[FFmpeg-devel,1/2] avcodec/lcldec: Optimize YUV422 case

Submitted by Michael Niedermayer on July 28, 2019, 11:56 a.m.

Details

Message ID 20190728115639.GQ3219@michaelspb
State New
Headers show

Commit Message

Michael Niedermayer July 28, 2019, 11:56 a.m.
On Sun, Jul 28, 2019 at 12:45:36AM +0200, Reimar Döffinger wrote:
> 
> 
> On 28.07.2019, at 00:31, Michael Niedermayer <michael@niedermayer.cc> wrote:
> 
> > This merges several byte operations and avoids some shifts inside the loop
> > 
> > Improves: Timeout (330sec -> 134sec)
> > Improves: 15599/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_MSZH_fuzzer-5658127116009472
> > 
> > Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
> > Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
> > ---
> > libavcodec/lcldec.c | 10 +++++-----
> > 1 file changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/libavcodec/lcldec.c b/libavcodec/lcldec.c
> > index 104defa5f5..c3787b3cbe 100644
> > --- a/libavcodec/lcldec.c
> > +++ b/libavcodec/lcldec.c
> > @@ -391,13 +391,13 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame, AVPac
> >         break;
> >     case IMGTYPE_YUV422:
> >         for (row = 0; row < height; row++) {
> > -            for (col = 0; col < width - 3; col += 4) {
> > +            for (col = 0; col < (width - 2)>>1; col += 2) {
> >                 memcpy(y_out + col, encoded, 4);
> >                 encoded += 4;
> > -                u_out[ col >> 1     ] = *encoded++ + 128;
> > -                u_out[(col >> 1) + 1] = *encoded++ + 128;
> > -                v_out[ col >> 1     ] = *encoded++ + 128;
> > -                v_out[(col >> 1) + 1] = *encoded++ + 128;
> > +                AV_WN16(u_out + col, AV_RN16(encoded) ^ 0x8080);
> > +                encoded += 2;
> > +                AV_WN16(v_out + col, AV_RN16(encoded) ^ 0x8080);
> > +                encoded += 2;
> 
> Huh? Surely the pixel stride used for y_out still needs to be double of the u/v one?

> I suspect doing only the AV_RN16/xor optimization might be best, the one shift saved seems not worth the risk/complexity...

if you want i can remove the shift change ?
with the fixed shift change its 155sec, if i remove the shift optimization its 170sec

patch for the 155 case below:

commit 56998b7d57a2cd0ed7f53981c50e76fd419cd86f (HEAD)
Author: Michael Niedermayer <michael@niedermayer.cc>
Date:   Sat Jul 27 22:46:34 2019 +0200

    avcodec/lcldec: Optimize YUV422 case
    
    This merges several byte operations and avoids some shifts inside the loop
    
    Improves: Timeout (330sec -> 155sec)
    Improves: 15599/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_MSZH_fuzzer-5658127116009472
    
    Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
    Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>

[...]

Comments

Reimar Döffinger July 28, 2019, 2 p.m.
On 28.07.2019, at 13:56, Michael Niedermayer <michael@niedermayer.cc> wrote:

> On Sun, Jul 28, 2019 at 12:45:36AM +0200, Reimar Döffinger wrote:
>> 
>> 
>> On 28.07.2019, at 00:31, Michael Niedermayer <michael@niedermayer.cc> wrote:
>> 
>>> This merges several byte operations and avoids some shifts inside the loop
>>> 
>>> Improves: Timeout (330sec -> 134sec)
>>> Improves: 15599/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_MSZH_fuzzer-5658127116009472
>>> 
>>> Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
>>> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
>>> ---
>>> libavcodec/lcldec.c | 10 +++++-----
>>> 1 file changed, 5 insertions(+), 5 deletions(-)
>>> 
>>> diff --git a/libavcodec/lcldec.c b/libavcodec/lcldec.c
>>> index 104defa5f5..c3787b3cbe 100644
>>> --- a/libavcodec/lcldec.c
>>> +++ b/libavcodec/lcldec.c
>>> @@ -391,13 +391,13 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame, AVPac
>>>        break;
>>>    case IMGTYPE_YUV422:
>>>        for (row = 0; row < height; row++) {
>>> -            for (col = 0; col < width - 3; col += 4) {
>>> +            for (col = 0; col < (width - 2)>>1; col += 2) {
>>>                memcpy(y_out + col, encoded, 4);
>>>                encoded += 4;
>>> -                u_out[ col >> 1     ] = *encoded++ + 128;
>>> -                u_out[(col >> 1) + 1] = *encoded++ + 128;
>>> -                v_out[ col >> 1     ] = *encoded++ + 128;
>>> -                v_out[(col >> 1) + 1] = *encoded++ + 128;
>>> +                AV_WN16(u_out + col, AV_RN16(encoded) ^ 0x8080);
>>> +                encoded += 2;
>>> +                AV_WN16(v_out + col, AV_RN16(encoded) ^ 0x8080);
>>> +                encoded += 2;
>> 
>> Huh? Surely the pixel stride used for y_out still needs to be double of the u/v one?
> 
>> I suspect doing only the AV_RN16/xor optimization might be best, the one shift saved seems not worth the risk/complexity...
> 
> if you want i can remove the shift change ?
> with the fixed shift change its 155sec, if i remove the shift optimization its 170sec
> 
> patch for the 155 case below:

I can't decide, it's a little uglier but a little faster...
Unless someone else has an opinion, go with whatever you prefer.
James Almer July 28, 2019, 2:06 p.m.
On 7/28/2019 8:56 AM, Michael Niedermayer wrote:
> On Sun, Jul 28, 2019 at 12:45:36AM +0200, Reimar Döffinger wrote:
>>
>>
>> On 28.07.2019, at 00:31, Michael Niedermayer <michael@niedermayer.cc> wrote:
>>
>>> This merges several byte operations and avoids some shifts inside the loop
>>>
>>> Improves: Timeout (330sec -> 134sec)
>>> Improves: 15599/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_MSZH_fuzzer-5658127116009472
>>>
>>> Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
>>> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
>>> ---
>>> libavcodec/lcldec.c | 10 +++++-----
>>> 1 file changed, 5 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/libavcodec/lcldec.c b/libavcodec/lcldec.c
>>> index 104defa5f5..c3787b3cbe 100644
>>> --- a/libavcodec/lcldec.c
>>> +++ b/libavcodec/lcldec.c
>>> @@ -391,13 +391,13 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame, AVPac
>>>         break;
>>>     case IMGTYPE_YUV422:
>>>         for (row = 0; row < height; row++) {
>>> -            for (col = 0; col < width - 3; col += 4) {
>>> +            for (col = 0; col < (width - 2)>>1; col += 2) {
>>>                 memcpy(y_out + col, encoded, 4);
>>>                 encoded += 4;
>>> -                u_out[ col >> 1     ] = *encoded++ + 128;
>>> -                u_out[(col >> 1) + 1] = *encoded++ + 128;
>>> -                v_out[ col >> 1     ] = *encoded++ + 128;
>>> -                v_out[(col >> 1) + 1] = *encoded++ + 128;
>>> +                AV_WN16(u_out + col, AV_RN16(encoded) ^ 0x8080);
>>> +                encoded += 2;
>>> +                AV_WN16(v_out + col, AV_RN16(encoded) ^ 0x8080);
>>> +                encoded += 2;
>>
>> Huh? Surely the pixel stride used for y_out still needs to be double of the u/v one?
> 
>> I suspect doing only the AV_RN16/xor optimization might be best, the one shift saved seems not worth the risk/complexity...
> 
> if you want i can remove the shift change ?
> with the fixed shift change its 155sec, if i remove the shift optimization its 170sec
> 
> patch for the 155 case below:
> 
> commit 56998b7d57a2cd0ed7f53981c50e76fd419cd86f (HEAD)
> Author: Michael Niedermayer <michael@niedermayer.cc>
> Date:   Sat Jul 27 22:46:34 2019 +0200
> 
>     avcodec/lcldec: Optimize YUV422 case
>     
>     This merges several byte operations and avoids some shifts inside the loop
>     
>     Improves: Timeout (330sec -> 155sec)
>     Improves: 15599/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_MSZH_fuzzer-5658127116009472
>     
>     Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
>     Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
> 
> diff --git a/libavcodec/lcldec.c b/libavcodec/lcldec.c
> index 104defa5f5..9e018ff5a9 100644
> --- a/libavcodec/lcldec.c
> +++ b/libavcodec/lcldec.c
> @@ -391,13 +391,13 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame, AVPac
>          break;
>      case IMGTYPE_YUV422:
>          for (row = 0; row < height; row++) {
> -            for (col = 0; col < width - 3; col += 4) {
> -                memcpy(y_out + col, encoded, 4);
> +            for (col = 0; col < (width - 2)>>1; col += 2) {
> +                memcpy(y_out + 2 * col, encoded, 4);
>                  encoded += 4;
> -                u_out[ col >> 1     ] = *encoded++ + 128;
> -                u_out[(col >> 1) + 1] = *encoded++ + 128;
> -                v_out[ col >> 1     ] = *encoded++ + 128;
> -                v_out[(col >> 1) + 1] = *encoded++ + 128;
> +                AV_WN16(u_out + col, AV_RN16(encoded) ^ 0x8080);
> +                encoded += 2;
> +                AV_WN16(v_out + col, AV_RN16(encoded) ^ 0x8080);
> +                encoded += 2;
>              }
>              y_out -= frame->linesize[0];
>              u_out -= frame->linesize[1];
> [...]

As others pointed before, this kind of optimization is usually meant for
the SIMD implementations and not the C boilerplate/reference. So
prioritize readability above speed if possible when choosing which
version to apply.
Michael Niedermayer July 28, 2019, 4:45 p.m.
On Sun, Jul 28, 2019 at 11:06:16AM -0300, James Almer wrote:
> On 7/28/2019 8:56 AM, Michael Niedermayer wrote:
> > On Sun, Jul 28, 2019 at 12:45:36AM +0200, Reimar Döffinger wrote:
> >>
> >>
> >> On 28.07.2019, at 00:31, Michael Niedermayer <michael@niedermayer.cc> wrote:
> >>
> >>> This merges several byte operations and avoids some shifts inside the loop
> >>>
> >>> Improves: Timeout (330sec -> 134sec)
> >>> Improves: 15599/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_MSZH_fuzzer-5658127116009472
> >>>
> >>> Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
> >>> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
> >>> ---
> >>> libavcodec/lcldec.c | 10 +++++-----
> >>> 1 file changed, 5 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/libavcodec/lcldec.c b/libavcodec/lcldec.c
> >>> index 104defa5f5..c3787b3cbe 100644
> >>> --- a/libavcodec/lcldec.c
> >>> +++ b/libavcodec/lcldec.c
> >>> @@ -391,13 +391,13 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame, AVPac
> >>>         break;
> >>>     case IMGTYPE_YUV422:
> >>>         for (row = 0; row < height; row++) {
> >>> -            for (col = 0; col < width - 3; col += 4) {
> >>> +            for (col = 0; col < (width - 2)>>1; col += 2) {
> >>>                 memcpy(y_out + col, encoded, 4);
> >>>                 encoded += 4;
> >>> -                u_out[ col >> 1     ] = *encoded++ + 128;
> >>> -                u_out[(col >> 1) + 1] = *encoded++ + 128;
> >>> -                v_out[ col >> 1     ] = *encoded++ + 128;
> >>> -                v_out[(col >> 1) + 1] = *encoded++ + 128;
> >>> +                AV_WN16(u_out + col, AV_RN16(encoded) ^ 0x8080);
> >>> +                encoded += 2;
> >>> +                AV_WN16(v_out + col, AV_RN16(encoded) ^ 0x8080);
> >>> +                encoded += 2;
> >>
> >> Huh? Surely the pixel stride used for y_out still needs to be double of the u/v one?
> > 
> >> I suspect doing only the AV_RN16/xor optimization might be best, the one shift saved seems not worth the risk/complexity...
> > 
> > if you want i can remove the shift change ?
> > with the fixed shift change its 155sec, if i remove the shift optimization its 170sec
> > 
> > patch for the 155 case below:
> > 
> > commit 56998b7d57a2cd0ed7f53981c50e76fd419cd86f (HEAD)
> > Author: Michael Niedermayer <michael@niedermayer.cc>
> > Date:   Sat Jul 27 22:46:34 2019 +0200
> > 
> >     avcodec/lcldec: Optimize YUV422 case
> >     
> >     This merges several byte operations and avoids some shifts inside the loop
> >     
> >     Improves: Timeout (330sec -> 155sec)
> >     Improves: 15599/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_MSZH_fuzzer-5658127116009472
> >     
> >     Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
> >     Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
> > 
> > diff --git a/libavcodec/lcldec.c b/libavcodec/lcldec.c
> > index 104defa5f5..9e018ff5a9 100644
> > --- a/libavcodec/lcldec.c
> > +++ b/libavcodec/lcldec.c
> > @@ -391,13 +391,13 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame, AVPac
> >          break;
> >      case IMGTYPE_YUV422:
> >          for (row = 0; row < height; row++) {
> > -            for (col = 0; col < width - 3; col += 4) {
> > -                memcpy(y_out + col, encoded, 4);
> > +            for (col = 0; col < (width - 2)>>1; col += 2) {
> > +                memcpy(y_out + 2 * col, encoded, 4);
> >                  encoded += 4;
> > -                u_out[ col >> 1     ] = *encoded++ + 128;
> > -                u_out[(col >> 1) + 1] = *encoded++ + 128;
> > -                v_out[ col >> 1     ] = *encoded++ + 128;
> > -                v_out[(col >> 1) + 1] = *encoded++ + 128;
> > +                AV_WN16(u_out + col, AV_RN16(encoded) ^ 0x8080);
> > +                encoded += 2;
> > +                AV_WN16(v_out + col, AV_RN16(encoded) ^ 0x8080);
> > +                encoded += 2;
> >              }
> >              y_out -= frame->linesize[0];
> >              u_out -= frame->linesize[1];
> > [...]
> 
> As others pointed before, this kind of optimization is usually meant for
> the SIMD implementations and not the C boilerplate/reference. So
> prioritize readability above speed if possible when choosing which
> version to apply.

I think its not a big difference, a shift of width vs. a shift of col
so ill go with what was faster in this testcase but iam happy to
do something else if people prefer

Thanks

[...]

Patch hide | download patch | download mbox

diff --git a/libavcodec/lcldec.c b/libavcodec/lcldec.c
index 104defa5f5..9e018ff5a9 100644
--- a/libavcodec/lcldec.c
+++ b/libavcodec/lcldec.c
@@ -391,13 +391,13 @@  static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame, AVPac
         break;
     case IMGTYPE_YUV422:
         for (row = 0; row < height; row++) {
-            for (col = 0; col < width - 3; col += 4) {
-                memcpy(y_out + col, encoded, 4);
+            for (col = 0; col < (width - 2)>>1; col += 2) {
+                memcpy(y_out + 2 * col, encoded, 4);
                 encoded += 4;
-                u_out[ col >> 1     ] = *encoded++ + 128;
-                u_out[(col >> 1) + 1] = *encoded++ + 128;
-                v_out[ col >> 1     ] = *encoded++ + 128;
-                v_out[(col >> 1) + 1] = *encoded++ + 128;
+                AV_WN16(u_out + col, AV_RN16(encoded) ^ 0x8080);
+                encoded += 2;
+                AV_WN16(v_out + col, AV_RN16(encoded) ^ 0x8080);
+                encoded += 2;
             }
             y_out -= frame->linesize[0];
             u_out -= frame->linesize[1];