[FFmpeg-devel,3/3] avcodec/scpr: optimize shift loop.

Message ID	20170908212913.20478-3-michael@niedermayer.cc
State	Superseded
Headers	show Delivered-To: ffmpegpatchwork@gmail.com Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; From: Michael Niedermayer <michael@niedermayer.cc> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> Date: Fri, 8 Sep 2017 23:29:13 +0200 Message-Id: <20170908212913.20478-3-michael@niedermayer.cc> In-Reply-To: <20170908212913.20478-1-michael@niedermayer.cc> References: <20170908212913.20478-1-michael@niedermayer.cc> Subject: [FFmpeg-devel] [PATCH 3/3] avcodec/scpr: optimize shift loop. Precedence: list Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

Michael Niedermayer Sept. 8, 2017, 9:29 p.m. UTC

Speeds code up from 50sec to 15sec

Fixes Timeout
Fixes: 3242/clusterfuzz-testcase-5811951672229888

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
---
 libavcodec/scpr.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

James Almer Sept. 8, 2017, 9:43 p.m. UTC | #1

On 9/8/2017 6:29 PM, Michael Niedermayer wrote:
> Speeds code up from 50sec to 15sec
> 
> Fixes Timeout
> Fixes: 3242/clusterfuzz-testcase-5811951672229888
> 
> Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
> ---
>  libavcodec/scpr.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/libavcodec/scpr.c b/libavcodec/scpr.c
> index 37fbe7a106..2ef63a7bf8 100644
> --- a/libavcodec/scpr.c
> +++ b/libavcodec/scpr.c
> @@ -827,7 +827,16 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame,
>              return ret;
>  
>          for (y = 0; y < avctx->height; y++) {
> -            for (x = 0; x < avctx->width * 4; x++) {
> +            if (!(((uintptr_t)dst) & 7)) {
> +                uint64_t *dst64 = (uint64_t *)dst;
> +                int w = avctx->width>>1;
> +                for (x = 0; x < w; x++) {
> +                    dst64[x] = (dst64[x] << 3) & 0xFCFCFCFCFCFCFCFCULL;

Shouldn't this be used only if HAVE_FAST_64BIT is true, and a version
shifting four bytes at a time used otherwise? That's how we do almost
everywhere else.

The chances for anyone bothering writing simd for this decoder are
almost none, so adding C optimized loops is ok in this case.

> +                }
> +                x *= 8;
> +            } else
> +                x = 0;

How does this fix the timeout if the new code is only run if the pointer
is eight byte aligned? (or four once you add that).

> +            for (; x < avctx->width * 4; x++) {
>                  dst[x] = dst[x] << 3;
>              }
>              dst += frame->linesize[0];
>

Kieran Kunhya Sept. 8, 2017, 9:47 p.m. UTC | #2

On Fri, 8 Sep 2017 at 22:29 Michael Niedermayer <michael@niedermayer.cc>
wrote:

> Speeds code up from 50sec to 15sec
>
> Fixes Timeout
> Fixes: 3242/clusterfuzz-testcase-5811951672229888
>
> Found-by: continuous fuzzing process
> https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
> Signed-off-by
> <https://github.com/google/oss-fuzz/tree/master/projects/ffmpegSigned-off-by>:
> Michael Niedermayer <michael@niedermayer.cc>
> ---
>  libavcodec/scpr.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/libavcodec/scpr.c b/libavcodec/scpr.c
> index 37fbe7a106..2ef63a7bf8 100644
> --- a/libavcodec/scpr.c
> +++ b/libavcodec/scpr.c
> @@ -827,7 +827,16 @@ static int decode_frame(AVCodecContext *avctx, void
> *data, int *got_frame,
>              return ret;
>
>          for (y = 0; y < avctx->height; y++) {
> -            for (x = 0; x < avctx->width * 4; x++) {
> +            if (!(((uintptr_t)dst) & 7)) {
> +                uint64_t *dst64 = (uint64_t *)dst;
> +                int w = avctx->width>>1;
> +                for (x = 0; x < w; x++) {
> +                    dst64[x] = (dst64[x] << 3) & 0xFCFCFCFCFCFCFCFCULL;
> +                }
> +                x *= 8;
> +            } else
> +                x = 0;
> +            for (; x < avctx->width * 4; x++) {
>                  dst[x] = dst[x] << 3;
>              }
>              dst += frame->linesize[0];
> --
> 2.14.1
>

This is as clear as mud.

James Almer Sept. 8, 2017, 10:15 p.m. UTC | #3

On 9/8/2017 6:47 PM, Kieran Kunhya wrote:
> On Fri, 8 Sep 2017 at 22:29 Michael Niedermayer <michael@niedermayer.cc>
> wrote:
> 
>> Speeds code up from 50sec to 15sec
>>
>> Fixes Timeout
>> Fixes: 3242/clusterfuzz-testcase-5811951672229888
>>
>> Found-by: continuous fuzzing process
>> https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
>> Signed-off-by
>> <https://github.com/google/oss-fuzz/tree/master/projects/ffmpegSigned-off-by>:
>> Michael Niedermayer <michael@niedermayer.cc>
>> ---
>>  libavcodec/scpr.c | 11 ++++++++++-
>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/libavcodec/scpr.c b/libavcodec/scpr.c
>> index 37fbe7a106..2ef63a7bf8 100644
>> --- a/libavcodec/scpr.c
>> +++ b/libavcodec/scpr.c
>> @@ -827,7 +827,16 @@ static int decode_frame(AVCodecContext *avctx, void
>> *data, int *got_frame,
>>              return ret;
>>
>>          for (y = 0; y < avctx->height; y++) {
>> -            for (x = 0; x < avctx->width * 4; x++) {
>> +            if (!(((uintptr_t)dst) & 7)) {
>> +                uint64_t *dst64 = (uint64_t *)dst;
>> +                int w = avctx->width>>1;
>> +                for (x = 0; x < w; x++) {
>> +                    dst64[x] = (dst64[x] << 3) & 0xFCFCFCFCFCFCFCFCULL;
>> +                }
>> +                x *= 8;
>> +            } else
>> +                x = 0;
>> +            for (; x < avctx->width * 4; x++) {
>>                  dst[x] = dst[x] << 3;
>>              }
>>              dst += frame->linesize[0];
>> --
>> 2.14.1
>>
> 
> This is as clear as mud.

It reads eight bytes at a time if the buffer is sufficiently aligned,
then finishes reading the remaining bytes one at a time.
If the buffer is unaligned, it reads everything one byte at a time like
it used to.

See ff_h2645_extract_rbsp() and add_bytes_c() for another example of
this optimization.

Michael Niedermayer Sept. 8, 2017, 11:49 p.m. UTC | #4

On Fri, Sep 08, 2017 at 06:43:06PM -0300, James Almer wrote:
> On 9/8/2017 6:29 PM, Michael Niedermayer wrote:
> > Speeds code up from 50sec to 15sec
> > 
> > Fixes Timeout
> > Fixes: 3242/clusterfuzz-testcase-5811951672229888
> > 
> > Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
> > Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
> > ---
> >  libavcodec/scpr.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/libavcodec/scpr.c b/libavcodec/scpr.c
> > index 37fbe7a106..2ef63a7bf8 100644
> > --- a/libavcodec/scpr.c
> > +++ b/libavcodec/scpr.c
> > @@ -827,7 +827,16 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame,
> >              return ret;
> >  
> >          for (y = 0; y < avctx->height; y++) {
> > -            for (x = 0; x < avctx->width * 4; x++) {
> > +            if (!(((uintptr_t)dst) & 7)) {
> > +                uint64_t *dst64 = (uint64_t *)dst;
> > +                int w = avctx->width>>1;
> > +                for (x = 0; x < w; x++) {
> > +                    dst64[x] = (dst64[x] << 3) & 0xFCFCFCFCFCFCFCFCULL;
> 
> Shouldn't this be used only if HAVE_FAST_64BIT is true, and a version
> shifting four bytes at a time used otherwise? That's how we do almost
> everywhere else.

The compiler would break the 64bit into two 32bit automatically.
I can write an explicit version if that is wanted ?
it seemed overkill for this here though


> 
> The chances for anyone bothering writing simd for this decoder are
> almost none, so adding C optimized loops is ok in this case.
> 
> > +                }
> > +                x *= 8;
> > +            } else
> > +                x = 0;
> 
> How does this fix the timeout if the new code is only run if the pointer
> is eight byte aligned? (or four once you add that).

The timeout is IIRC currently defined as 30sec or so on the platform
the fuzzer runs on, data is aligned in that case.

You could imagine a platform where data is not aligned, yes
on that patform the patch would not improve the time decoding takes.
Simiarly you could imagine a CPU that supports only 8bit operations,
or just a slower CPU.

This patch adds some optimizations that makes decoding faster which
gets us over the threshold for this particular sample and a normal
modern desktop platform.

Its quite possible another scpr file will still hit the threshold
and certainly another threshold or other hw could trigger it
still with this sample

I would very much prefer a more universal fix ...
But i didnt see one and making that loop 3 times as fast is great on
its own.

[...]

James Almer Sept. 9, 2017, 12:01 a.m. UTC | #5

On 9/8/2017 8:49 PM, Michael Niedermayer wrote:
> On Fri, Sep 08, 2017 at 06:43:06PM -0300, James Almer wrote:
>> On 9/8/2017 6:29 PM, Michael Niedermayer wrote:
>>> Speeds code up from 50sec to 15sec
>>>
>>> Fixes Timeout
>>> Fixes: 3242/clusterfuzz-testcase-5811951672229888
>>>
>>> Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
>>> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
>>> ---
>>>  libavcodec/scpr.c | 11 ++++++++++-
>>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/libavcodec/scpr.c b/libavcodec/scpr.c
>>> index 37fbe7a106..2ef63a7bf8 100644
>>> --- a/libavcodec/scpr.c
>>> +++ b/libavcodec/scpr.c
>>> @@ -827,7 +827,16 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame,
>>>              return ret;
>>>  
>>>          for (y = 0; y < avctx->height; y++) {
>>> -            for (x = 0; x < avctx->width * 4; x++) {
>>> +            if (!(((uintptr_t)dst) & 7)) {
>>> +                uint64_t *dst64 = (uint64_t *)dst;
>>> +                int w = avctx->width>>1;
>>> +                for (x = 0; x < w; x++) {
>>> +                    dst64[x] = (dst64[x] << 3) & 0xFCFCFCFCFCFCFCFCULL;
>>
>> Shouldn't this be used only if HAVE_FAST_64BIT is true, and a version
>> shifting four bytes at a time used otherwise? That's how we do almost
>> everywhere else.
> 
> The compiler would break the 64bit into two 32bit automatically.
> I can write an explicit version if that is wanted ?
> it seemed overkill for this here though

I guess it's simple enough that the compiler can figure that out, yeah.
Should be ok then.

> 
> 
>>
>> The chances for anyone bothering writing simd for this decoder are
>> almost none, so adding C optimized loops is ok in this case.
>>
>>> +                }
>>> +                x *= 8;
>>> +            } else
>>> +                x = 0;
>>
>> How does this fix the timeout if the new code is only run if the pointer
>> is eight byte aligned? (or four once you add that).
> 
> The timeout is IIRC currently defined as 30sec or so on the platform
> the fuzzer runs on, data is aligned in that case.
> 
> You could imagine a platform where data is not aligned, yes
> on that patform the patch would not improve the time decoding takes.
> Simiarly you could imagine a CPU that supports only 8bit operations,
> or just a slower CPU.
> 
> This patch adds some optimizations that makes decoding faster which
> gets us over the threshold for this particular sample and a normal
> modern desktop platform.
> 
> Its quite possible another scpr file will still hit the threshold
> and certainly another threshold or other hw could trigger it
> still with this sample
> 
> I would very much prefer a more universal fix ...
> But i didnt see one and making that loop 3 times as fast is great on
> its own.

Ok. As i said nobody is going to write simd for this, so adding C
optimized code is fine.

Derek Buitenhuis Sept. 9, 2017, 2:11 p.m. UTC | #6

On 9/8/2017 11:15 PM, James Almer wrote:
> It reads eight bytes at a time if the buffer is sufficiently aligned,
> then finishes reading the remaining bytes one at a time.
> If the buffer is unaligned, it reads everything one byte at a time like
> it used to.
> 
> See ff_h2645_extract_rbsp() and add_bytes_c() for another example of
> this optimization.

So put a comment, or at least put it in the commit message.

It isn't exactly straightforward; it's like reading Hex Rays output.

- Derek

Michael Niedermayer Sept. 9, 2017, 6:03 p.m. UTC | #7

On Sat, Sep 09, 2017 at 03:11:00PM +0100, Derek Buitenhuis wrote:
> On 9/8/2017 11:15 PM, James Almer wrote:
> > It reads eight bytes at a time if the buffer is sufficiently aligned,
> > then finishes reading the remaining bytes one at a time.
> > If the buffer is unaligned, it reads everything one byte at a time like
> > it used to.
> > 
> > See ff_h2645_extract_rbsp() and add_bytes_c() for another example of
> > this optimization.
> 
> So put a comment, or at least put it in the commit message.
> 
> It isn't exactly straightforward; it's like reading Hex Rays output.

ill repost the patch with comments

thx

[...]

[FFmpeg-devel,3/3] avcodec/scpr: optimize shift loop.

Commit Message

Comments

Patch