Message ID | 20190728115639.GQ3219@michaelspb |
---|---|
State | New |
Headers | show |
On 28.07.2019, at 13:56, Michael Niedermayer <michael@niedermayer.cc> wrote: > On Sun, Jul 28, 2019 at 12:45:36AM +0200, Reimar Döffinger wrote: >> >> >> On 28.07.2019, at 00:31, Michael Niedermayer <michael@niedermayer.cc> wrote: >> >>> This merges several byte operations and avoids some shifts inside the loop >>> >>> Improves: Timeout (330sec -> 134sec) >>> Improves: 15599/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_MSZH_fuzzer-5658127116009472 >>> >>> Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg >>> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc> >>> --- >>> libavcodec/lcldec.c | 10 +++++----- >>> 1 file changed, 5 insertions(+), 5 deletions(-) >>> >>> diff --git a/libavcodec/lcldec.c b/libavcodec/lcldec.c >>> index 104defa5f5..c3787b3cbe 100644 >>> --- a/libavcodec/lcldec.c >>> +++ b/libavcodec/lcldec.c >>> @@ -391,13 +391,13 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame, AVPac >>> break; >>> case IMGTYPE_YUV422: >>> for (row = 0; row < height; row++) { >>> - for (col = 0; col < width - 3; col += 4) { >>> + for (col = 0; col < (width - 2)>>1; col += 2) { >>> memcpy(y_out + col, encoded, 4); >>> encoded += 4; >>> - u_out[ col >> 1 ] = *encoded++ + 128; >>> - u_out[(col >> 1) + 1] = *encoded++ + 128; >>> - v_out[ col >> 1 ] = *encoded++ + 128; >>> - v_out[(col >> 1) + 1] = *encoded++ + 128; >>> + AV_WN16(u_out + col, AV_RN16(encoded) ^ 0x8080); >>> + encoded += 2; >>> + AV_WN16(v_out + col, AV_RN16(encoded) ^ 0x8080); >>> + encoded += 2; >> >> Huh? Surely the pixel stride used for y_out still needs to be double of the u/v one? > >> I suspect doing only the AV_RN16/xor optimization might be best, the one shift saved seems not worth the risk/complexity... > > if you want i can remove the shift change ? > with the fixed shift change its 155sec, if i remove the shift optimization its 170sec > > patch for the 155 case below: I can't decide, it's a little uglier but a little faster... Unless someone else has an opinion, go with whatever you prefer.
On 7/28/2019 8:56 AM, Michael Niedermayer wrote: > On Sun, Jul 28, 2019 at 12:45:36AM +0200, Reimar Döffinger wrote: >> >> >> On 28.07.2019, at 00:31, Michael Niedermayer <michael@niedermayer.cc> wrote: >> >>> This merges several byte operations and avoids some shifts inside the loop >>> >>> Improves: Timeout (330sec -> 134sec) >>> Improves: 15599/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_MSZH_fuzzer-5658127116009472 >>> >>> Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg >>> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc> >>> --- >>> libavcodec/lcldec.c | 10 +++++----- >>> 1 file changed, 5 insertions(+), 5 deletions(-) >>> >>> diff --git a/libavcodec/lcldec.c b/libavcodec/lcldec.c >>> index 104defa5f5..c3787b3cbe 100644 >>> --- a/libavcodec/lcldec.c >>> +++ b/libavcodec/lcldec.c >>> @@ -391,13 +391,13 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame, AVPac >>> break; >>> case IMGTYPE_YUV422: >>> for (row = 0; row < height; row++) { >>> - for (col = 0; col < width - 3; col += 4) { >>> + for (col = 0; col < (width - 2)>>1; col += 2) { >>> memcpy(y_out + col, encoded, 4); >>> encoded += 4; >>> - u_out[ col >> 1 ] = *encoded++ + 128; >>> - u_out[(col >> 1) + 1] = *encoded++ + 128; >>> - v_out[ col >> 1 ] = *encoded++ + 128; >>> - v_out[(col >> 1) + 1] = *encoded++ + 128; >>> + AV_WN16(u_out + col, AV_RN16(encoded) ^ 0x8080); >>> + encoded += 2; >>> + AV_WN16(v_out + col, AV_RN16(encoded) ^ 0x8080); >>> + encoded += 2; >> >> Huh? Surely the pixel stride used for y_out still needs to be double of the u/v one? > >> I suspect doing only the AV_RN16/xor optimization might be best, the one shift saved seems not worth the risk/complexity... > > if you want i can remove the shift change ? > with the fixed shift change its 155sec, if i remove the shift optimization its 170sec > > patch for the 155 case below: > > commit 56998b7d57a2cd0ed7f53981c50e76fd419cd86f (HEAD) > Author: Michael Niedermayer <michael@niedermayer.cc> > Date: Sat Jul 27 22:46:34 2019 +0200 > > avcodec/lcldec: Optimize YUV422 case > > This merges several byte operations and avoids some shifts inside the loop > > Improves: Timeout (330sec -> 155sec) > Improves: 15599/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_MSZH_fuzzer-5658127116009472 > > Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg > Signed-off-by: Michael Niedermayer <michael@niedermayer.cc> > > diff --git a/libavcodec/lcldec.c b/libavcodec/lcldec.c > index 104defa5f5..9e018ff5a9 100644 > --- a/libavcodec/lcldec.c > +++ b/libavcodec/lcldec.c > @@ -391,13 +391,13 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame, AVPac > break; > case IMGTYPE_YUV422: > for (row = 0; row < height; row++) { > - for (col = 0; col < width - 3; col += 4) { > - memcpy(y_out + col, encoded, 4); > + for (col = 0; col < (width - 2)>>1; col += 2) { > + memcpy(y_out + 2 * col, encoded, 4); > encoded += 4; > - u_out[ col >> 1 ] = *encoded++ + 128; > - u_out[(col >> 1) + 1] = *encoded++ + 128; > - v_out[ col >> 1 ] = *encoded++ + 128; > - v_out[(col >> 1) + 1] = *encoded++ + 128; > + AV_WN16(u_out + col, AV_RN16(encoded) ^ 0x8080); > + encoded += 2; > + AV_WN16(v_out + col, AV_RN16(encoded) ^ 0x8080); > + encoded += 2; > } > y_out -= frame->linesize[0]; > u_out -= frame->linesize[1]; > [...] As others pointed before, this kind of optimization is usually meant for the SIMD implementations and not the C boilerplate/reference. So prioritize readability above speed if possible when choosing which version to apply.
On Sun, Jul 28, 2019 at 11:06:16AM -0300, James Almer wrote: > On 7/28/2019 8:56 AM, Michael Niedermayer wrote: > > On Sun, Jul 28, 2019 at 12:45:36AM +0200, Reimar Döffinger wrote: > >> > >> > >> On 28.07.2019, at 00:31, Michael Niedermayer <michael@niedermayer.cc> wrote: > >> > >>> This merges several byte operations and avoids some shifts inside the loop > >>> > >>> Improves: Timeout (330sec -> 134sec) > >>> Improves: 15599/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_MSZH_fuzzer-5658127116009472 > >>> > >>> Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg > >>> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc> > >>> --- > >>> libavcodec/lcldec.c | 10 +++++----- > >>> 1 file changed, 5 insertions(+), 5 deletions(-) > >>> > >>> diff --git a/libavcodec/lcldec.c b/libavcodec/lcldec.c > >>> index 104defa5f5..c3787b3cbe 100644 > >>> --- a/libavcodec/lcldec.c > >>> +++ b/libavcodec/lcldec.c > >>> @@ -391,13 +391,13 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame, AVPac > >>> break; > >>> case IMGTYPE_YUV422: > >>> for (row = 0; row < height; row++) { > >>> - for (col = 0; col < width - 3; col += 4) { > >>> + for (col = 0; col < (width - 2)>>1; col += 2) { > >>> memcpy(y_out + col, encoded, 4); > >>> encoded += 4; > >>> - u_out[ col >> 1 ] = *encoded++ + 128; > >>> - u_out[(col >> 1) + 1] = *encoded++ + 128; > >>> - v_out[ col >> 1 ] = *encoded++ + 128; > >>> - v_out[(col >> 1) + 1] = *encoded++ + 128; > >>> + AV_WN16(u_out + col, AV_RN16(encoded) ^ 0x8080); > >>> + encoded += 2; > >>> + AV_WN16(v_out + col, AV_RN16(encoded) ^ 0x8080); > >>> + encoded += 2; > >> > >> Huh? Surely the pixel stride used for y_out still needs to be double of the u/v one? > > > >> I suspect doing only the AV_RN16/xor optimization might be best, the one shift saved seems not worth the risk/complexity... > > > > if you want i can remove the shift change ? > > with the fixed shift change its 155sec, if i remove the shift optimization its 170sec > > > > patch for the 155 case below: > > > > commit 56998b7d57a2cd0ed7f53981c50e76fd419cd86f (HEAD) > > Author: Michael Niedermayer <michael@niedermayer.cc> > > Date: Sat Jul 27 22:46:34 2019 +0200 > > > > avcodec/lcldec: Optimize YUV422 case > > > > This merges several byte operations and avoids some shifts inside the loop > > > > Improves: Timeout (330sec -> 155sec) > > Improves: 15599/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_MSZH_fuzzer-5658127116009472 > > > > Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg > > Signed-off-by: Michael Niedermayer <michael@niedermayer.cc> > > > > diff --git a/libavcodec/lcldec.c b/libavcodec/lcldec.c > > index 104defa5f5..9e018ff5a9 100644 > > --- a/libavcodec/lcldec.c > > +++ b/libavcodec/lcldec.c > > @@ -391,13 +391,13 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame, AVPac > > break; > > case IMGTYPE_YUV422: > > for (row = 0; row < height; row++) { > > - for (col = 0; col < width - 3; col += 4) { > > - memcpy(y_out + col, encoded, 4); > > + for (col = 0; col < (width - 2)>>1; col += 2) { > > + memcpy(y_out + 2 * col, encoded, 4); > > encoded += 4; > > - u_out[ col >> 1 ] = *encoded++ + 128; > > - u_out[(col >> 1) + 1] = *encoded++ + 128; > > - v_out[ col >> 1 ] = *encoded++ + 128; > > - v_out[(col >> 1) + 1] = *encoded++ + 128; > > + AV_WN16(u_out + col, AV_RN16(encoded) ^ 0x8080); > > + encoded += 2; > > + AV_WN16(v_out + col, AV_RN16(encoded) ^ 0x8080); > > + encoded += 2; > > } > > y_out -= frame->linesize[0]; > > u_out -= frame->linesize[1]; > > [...] > > As others pointed before, this kind of optimization is usually meant for > the SIMD implementations and not the C boilerplate/reference. So > prioritize readability above speed if possible when choosing which > version to apply. I think its not a big difference, a shift of width vs. a shift of col so ill go with what was faster in this testcase but iam happy to do something else if people prefer Thanks [...]
diff --git a/libavcodec/lcldec.c b/libavcodec/lcldec.c index 104defa5f5..9e018ff5a9 100644 --- a/libavcodec/lcldec.c +++ b/libavcodec/lcldec.c @@ -391,13 +391,13 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame, AVPac break; case IMGTYPE_YUV422: for (row = 0; row < height; row++) { - for (col = 0; col < width - 3; col += 4) { - memcpy(y_out + col, encoded, 4); + for (col = 0; col < (width - 2)>>1; col += 2) { + memcpy(y_out + 2 * col, encoded, 4); encoded += 4; - u_out[ col >> 1 ] = *encoded++ + 128; - u_out[(col >> 1) + 1] = *encoded++ + 128; - v_out[ col >> 1 ] = *encoded++ + 128; - v_out[(col >> 1) + 1] = *encoded++ + 128; + AV_WN16(u_out + col, AV_RN16(encoded) ^ 0x8080); + encoded += 2; + AV_WN16(v_out + col, AV_RN16(encoded) ^ 0x8080); + encoded += 2; } y_out -= frame->linesize[0]; u_out -= frame->linesize[1];