[FFmpeg-devel,1/3] avutil/imgutils: Optimize writing 4 bytes in memset_bytes()

Submitted by Michael Niedermayer on Jan. 17, 2019, 10:28 p.m.

Details

Message ID 20190117222802.GP3501@michaelspb
State New
Headers show

Commit Message

Michael Niedermayer Jan. 17, 2019, 10:28 p.m.
On Wed, Jan 16, 2019 at 08:00:22PM +0100, Marton Balint wrote:
> 
> 
> On Tue, 15 Jan 2019, Michael Niedermayer wrote:
> 
> >On Sun, Dec 30, 2018 at 07:15:49PM +0100, Marton Balint wrote:
> >>
> >>
> >>On Fri, 28 Dec 2018, Michael Niedermayer wrote:
> >>
> >>>On Wed, Dec 26, 2018 at 10:16:47PM +0100, Marton Balint wrote:
> >>>>
> >>>>
> >>>>On Wed, 26 Dec 2018, Paul B Mahol wrote:
> >>>>
> >>>>>On 12/26/18, Michael Niedermayer <michael@niedermayer.cc> wrote:
> >>>>>>On Wed, Dec 26, 2018 at 04:32:17PM +0100, Paul B Mahol wrote:
> >>>>>>>On 12/25/18, Michael Niedermayer <michael@niedermayer.cc> wrote:
> >>>>>>>>Fixes: Timeout
> >>>>>>>>Fixes:
> >>>>>>>>11502/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_WCMV_fuzzer-5664893810769920
> >>>>>>>>Before: Executed
> >>>>>>>>clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_WCMV_fuzzer-5664893810769920
> >>>>>>>>in 11294 ms
> >>>>>>>>After : Executed
> >>>>>>>>clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_WCMV_fuzzer-5664893810769920
> >>>>>>>>in 4249 ms
> >>>>>>>>
> >>>>>>>>Found-by: continuous fuzzing process
> >>>>>>>>https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
> >>>>>>>>Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
> >>>>>>>>---
> >>>>>>>>libavutil/imgutils.c | 6 ++++++
> >>>>>>>>1 file changed, 6 insertions(+)
> >>>>>>>>
> >>>>>>>>diff --git a/libavutil/imgutils.c b/libavutil/imgutils.c
> >>>>>>>>index 4938a7ef67..cc38f1e878 100644
> >>>>>>>>--- a/libavutil/imgutils.c
> >>>>>>>>+++ b/libavutil/imgutils.c
> >>>>>>>>@@ -529,6 +529,12 @@ static void memset_bytes(uint8_t *dst, size_t
> >>>>>>>>dst_size,
> >>>>>>>>uint8_t *clear,
> >>>>>>>>        }
> >>>>>>>>    } else if (clear_size == 4) {
> >>>>>>>>        uint32_t val = AV_RN32(clear);
> >>>>>>>>+        uint64_t val8 = val * 0x100000001ULL;
> >>>>>>>>+        for (; dst_size >= 32; dst_size -= 32) {
> >>>>>>>>+            AV_WN64(dst   , val8); AV_WN64(dst+ 8, val8);
> >>>>>>>>+            AV_WN64(dst+16, val8); AV_WN64(dst+24, val8);
> >>>>>>>>+            dst += 32;
> >>>>>>>>+        }
> >>>>>>>>        for (; dst_size >= 4; dst_size -= 4) {
> >>>>>>>>            AV_WN32(dst, val);
> >>>>>>>>            dst += 4;
> >>>>>>>>--
> >>>>>>>>2.20.1
> >>>>>>>>
> >>>>>>>
> >>>>>>>NAK, implement special memset function instead.
> >>>>>>
> >>>>>>I can move the added loop into a seperate function, if thats what you
> >>>>>>suggest ?
> >>>>>
> >>>>>No, don't do that.
> >>>>>
> >>>>>>All the code is already in a "special" memset though, this is
> >>>>>>memset_bytes()
> >>>>>>
> >>>>>
> >>>>>I guess function is less useful if its static. So any duplicate should
> >>>>>be avoided in codebase.
> >>>>
> >>>>Isn't av_memcpy_backptr does almost exactly what is needed here? That can
> >>>>also be optimized further if needed.
> >>>
> >>>av_memcpy_backptr() copies data with overlap, its more like a recursive
> >>>memmove().
> >>
> >>So? As far as I see the memset_bytes function in imgutils.c can be replaced
> >>with this:
> >>
> >>    if (clear_size > dst_size)
> >>        clear_size = dst_size;
> >>    memcpy(dst, clear, clear_size);
> >>    av_memcpy_backptr(dst + clear_size, clear_size, dst_size - clear_size);
> >>
> >>I am not against an av_memset_bytes API addition, but I believe it should
> >>share code with av_memcpy_backptr to avoid duplication.
> >
> >ive implemented this, it does not seem to be really faster in the testcase
> 
> I guess it is not faster because you have not applied your original
> optimalization to fill32 in libavutil/mem.c. Could you compare speed after
> optimizing that the same way your original patch did it with imgutils
> memset_bytes?

sure, that makes it faster:

From f5660e4025bb8161ebdb55cda03b656cbf685b1a Mon Sep 17 00:00:00 2001
From: Michael Niedermayer <michael@niedermayer.cc>
Date: Thu, 17 Jan 2019 22:35:10 +0100
Subject: [PATCH 1/2] avutil/mem: Optimize fill32() by unrolling and using
 64bit

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
---
 libavutil/mem.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Marton Balint Jan. 18, 2019, 11:28 p.m.
On Thu, 17 Jan 2019, Michael Niedermayer wrote:

> On Wed, Jan 16, 2019 at 08:00:22PM +0100, Marton Balint wrote:
>>
>>
>> On Tue, 15 Jan 2019, Michael Niedermayer wrote:
>>
>>> On Sun, Dec 30, 2018 at 07:15:49PM +0100, Marton Balint wrote:
>>>>
>>>>
>>>> On Fri, 28 Dec 2018, Michael Niedermayer wrote:
>>>>
>>>>> On Wed, Dec 26, 2018 at 10:16:47PM +0100, Marton Balint wrote:
>>>>>>
>>>>>>
>>>>>> On Wed, 26 Dec 2018, Paul B Mahol wrote:
>>>>>>
>>>>>>> On 12/26/18, Michael Niedermayer <michael@niedermayer.cc> wrote:
>>>>>>>> On Wed, Dec 26, 2018 at 04:32:17PM +0100, Paul B Mahol wrote:
>>>>>>>>> On 12/25/18, Michael Niedermayer <michael@niedermayer.cc> wrote:
>>>>>>>>>> Fixes: Timeout
>>>>>>>>>> Fixes:
>>>>>>>>>> 11502/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_WCMV_fuzzer-5664893810769920
>>>>>>>>>> Before: Executed
>>>>>>>>>> clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_WCMV_fuzzer-5664893810769920
>>>>>>>>>> in 11294 ms
>>>>>>>>>> After : Executed
>>>>>>>>>> clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_WCMV_fuzzer-5664893810769920
>>>>>>>>>> in 4249 ms
>>>>>>>>>>
>>>>>>>>>> Found-by: continuous fuzzing process
>>>>>>>>>> https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
>>>>>>>>>> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
>>>>>>>>>> ---
>>>>>>>>>> libavutil/imgutils.c | 6 ++++++
>>>>>>>>>> 1 file changed, 6 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/libavutil/imgutils.c b/libavutil/imgutils.c
>>>>>>>>>> index 4938a7ef67..cc38f1e878 100644
>>>>>>>>>> --- a/libavutil/imgutils.c
>>>>>>>>>> +++ b/libavutil/imgutils.c
>>>>>>>>>> @@ -529,6 +529,12 @@ static void memset_bytes(uint8_t *dst, size_t
>>>>>>>>>> dst_size,
>>>>>>>>>> uint8_t *clear,
>>>>>>>>>>        }
>>>>>>>>>>    } else if (clear_size == 4) {
>>>>>>>>>>        uint32_t val = AV_RN32(clear);
>>>>>>>>>> +        uint64_t val8 = val * 0x100000001ULL;
>>>>>>>>>> +        for (; dst_size >= 32; dst_size -= 32) {
>>>>>>>>>> +            AV_WN64(dst   , val8); AV_WN64(dst+ 8, val8);
>>>>>>>>>> +            AV_WN64(dst+16, val8); AV_WN64(dst+24, val8);
>>>>>>>>>> +            dst += 32;
>>>>>>>>>> +        }
>>>>>>>>>>        for (; dst_size >= 4; dst_size -= 4) {
>>>>>>>>>>            AV_WN32(dst, val);
>>>>>>>>>>            dst += 4;
>>>>>>>>>> --
>>>>>>>>>> 2.20.1
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> NAK, implement special memset function instead.
>>>>>>>>
>>>>>>>> I can move the added loop into a seperate function, if thats what you
>>>>>>>> suggest ?
>>>>>>>
>>>>>>> No, don't do that.
>>>>>>>
>>>>>>>> All the code is already in a "special" memset though, this is
>>>>>>>> memset_bytes()
>>>>>>>>
>>>>>>>
>>>>>>> I guess function is less useful if its static. So any duplicate should
>>>>>>> be avoided in codebase.
>>>>>>
>>>>>> Isn't av_memcpy_backptr does almost exactly what is needed here? That can
>>>>>> also be optimized further if needed.
>>>>>
>>>>> av_memcpy_backptr() copies data with overlap, its more like a recursive
>>>>> memmove().
>>>>
>>>> So? As far as I see the memset_bytes function in imgutils.c can be replaced
>>>> with this:
>>>>
>>>>    if (clear_size > dst_size)
>>>>        clear_size = dst_size;
>>>>    memcpy(dst, clear, clear_size);
>>>>    av_memcpy_backptr(dst + clear_size, clear_size, dst_size - clear_size);
>>>>
>>>> I am not against an av_memset_bytes API addition, but I believe it should
>>>> share code with av_memcpy_backptr to avoid duplication.
>>>
>>> ive implemented this, it does not seem to be really faster in the testcase
>>
>> I guess it is not faster because you have not applied your original
>> optimalization to fill32 in libavutil/mem.c. Could you compare speed after
>> optimizing that the same way your original patch did it with imgutils
>> memset_bytes?
>
> sure, that makes it faster:

Thanks, both patches LGTM.

Regards,
Marton
Michael Niedermayer Jan. 20, 2019, 8:14 p.m.
On Sat, Jan 19, 2019 at 12:28:25AM +0100, Marton Balint wrote:
> 
> 
> On Thu, 17 Jan 2019, Michael Niedermayer wrote:
> 
> >On Wed, Jan 16, 2019 at 08:00:22PM +0100, Marton Balint wrote:
> >>
> >>
> >>On Tue, 15 Jan 2019, Michael Niedermayer wrote:
> >>
> >>>On Sun, Dec 30, 2018 at 07:15:49PM +0100, Marton Balint wrote:
> >>>>
> >>>>
> >>>>On Fri, 28 Dec 2018, Michael Niedermayer wrote:
> >>>>
> >>>>>On Wed, Dec 26, 2018 at 10:16:47PM +0100, Marton Balint wrote:
> >>>>>>
> >>>>>>
> >>>>>>On Wed, 26 Dec 2018, Paul B Mahol wrote:
> >>>>>>
> >>>>>>>On 12/26/18, Michael Niedermayer <michael@niedermayer.cc> wrote:
> >>>>>>>>On Wed, Dec 26, 2018 at 04:32:17PM +0100, Paul B Mahol wrote:
> >>>>>>>>>On 12/25/18, Michael Niedermayer <michael@niedermayer.cc> wrote:
> >>>>>>>>>>Fixes: Timeout
> >>>>>>>>>>Fixes:
> >>>>>>>>>>11502/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_WCMV_fuzzer-5664893810769920
> >>>>>>>>>>Before: Executed
> >>>>>>>>>>clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_WCMV_fuzzer-5664893810769920
> >>>>>>>>>>in 11294 ms
> >>>>>>>>>>After : Executed
> >>>>>>>>>>clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_WCMV_fuzzer-5664893810769920
> >>>>>>>>>>in 4249 ms
> >>>>>>>>>>
> >>>>>>>>>>Found-by: continuous fuzzing process
> >>>>>>>>>>https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
> >>>>>>>>>>Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
> >>>>>>>>>>---
> >>>>>>>>>>libavutil/imgutils.c | 6 ++++++
> >>>>>>>>>>1 file changed, 6 insertions(+)
> >>>>>>>>>>
> >>>>>>>>>>diff --git a/libavutil/imgutils.c b/libavutil/imgutils.c
> >>>>>>>>>>index 4938a7ef67..cc38f1e878 100644
> >>>>>>>>>>--- a/libavutil/imgutils.c
> >>>>>>>>>>+++ b/libavutil/imgutils.c
> >>>>>>>>>>@@ -529,6 +529,12 @@ static void memset_bytes(uint8_t *dst, size_t
> >>>>>>>>>>dst_size,
> >>>>>>>>>>uint8_t *clear,
> >>>>>>>>>>       }
> >>>>>>>>>>   } else if (clear_size == 4) {
> >>>>>>>>>>       uint32_t val = AV_RN32(clear);
> >>>>>>>>>>+        uint64_t val8 = val * 0x100000001ULL;
> >>>>>>>>>>+        for (; dst_size >= 32; dst_size -= 32) {
> >>>>>>>>>>+            AV_WN64(dst   , val8); AV_WN64(dst+ 8, val8);
> >>>>>>>>>>+            AV_WN64(dst+16, val8); AV_WN64(dst+24, val8);
> >>>>>>>>>>+            dst += 32;
> >>>>>>>>>>+        }
> >>>>>>>>>>       for (; dst_size >= 4; dst_size -= 4) {
> >>>>>>>>>>           AV_WN32(dst, val);
> >>>>>>>>>>           dst += 4;
> >>>>>>>>>>--
> >>>>>>>>>>2.20.1
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>NAK, implement special memset function instead.
> >>>>>>>>
> >>>>>>>>I can move the added loop into a seperate function, if thats what you
> >>>>>>>>suggest ?
> >>>>>>>
> >>>>>>>No, don't do that.
> >>>>>>>
> >>>>>>>>All the code is already in a "special" memset though, this is
> >>>>>>>>memset_bytes()
> >>>>>>>>
> >>>>>>>
> >>>>>>>I guess function is less useful if its static. So any duplicate should
> >>>>>>>be avoided in codebase.
> >>>>>>
> >>>>>>Isn't av_memcpy_backptr does almost exactly what is needed here? That can
> >>>>>>also be optimized further if needed.
> >>>>>
> >>>>>av_memcpy_backptr() copies data with overlap, its more like a recursive
> >>>>>memmove().
> >>>>
> >>>>So? As far as I see the memset_bytes function in imgutils.c can be replaced
> >>>>with this:
> >>>>
> >>>>   if (clear_size > dst_size)
> >>>>       clear_size = dst_size;
> >>>>   memcpy(dst, clear, clear_size);
> >>>>   av_memcpy_backptr(dst + clear_size, clear_size, dst_size - clear_size);
> >>>>
> >>>>I am not against an av_memset_bytes API addition, but I believe it should
> >>>>share code with av_memcpy_backptr to avoid duplication.
> >>>
> >>>ive implemented this, it does not seem to be really faster in the testcase
> >>
> >>I guess it is not faster because you have not applied your original
> >>optimalization to fill32 in libavutil/mem.c. Could you compare speed after
> >>optimizing that the same way your original patch did it with imgutils
> >>memset_bytes?
> >
> >sure, that makes it faster:
> 
> Thanks, both patches LGTM.

will apply

thanks

[...]

Patch hide | download patch | download mbox

diff --git a/libavutil/mem.c b/libavutil/mem.c
index 6149755a6b..88fe09b179 100644
--- a/libavutil/mem.c
+++ b/libavutil/mem.c
@@ -399,6 +399,18 @@  static void fill32(uint8_t *dst, int len)
 {
     uint32_t v = AV_RN32(dst - 4);
 
+#if HAVE_FAST_64BIT
+    uint64_t v2= v + ((uint64_t)v<<32);
+    while (len >= 32) {
+        AV_WN64(dst   , v2);
+        AV_WN64(dst+ 8, v2);
+        AV_WN64(dst+16, v2);
+        AV_WN64(dst+24, v2);
+        dst += 32;
+        len -= 32;
+    }
+#endif
+
     while (len >= 4) {
         AV_WN32(dst, v);
         dst += 4;
-- 
2.20.1

From 9b5573f91a043a818fe1fd6b93d0d36c4830cd9c Mon Sep 17 00:00:00 2001
From: Michael Niedermayer <michael@niedermayer.cc>
Date: Tue, 25 Dec 2018 23:15:20 +0100
Subject: [PATCH 2/2] avutil/imgutils: Optimize memset_bytes() by using
 av_memcpy_backptr()

This is strongly based on code by Marton Balint

Fixes: Timeout
Fixes: 11502/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_WCMV_fuzzer-5664893810769920
Before: Executed clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_WCMV_fuzzer-5664893810769920 in 11209 ms
After:  Executed clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_WCMV_fuzzer-5664893810769920 in  4104 ms

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
---
 libavutil/imgutils.c | 26 +++++---------------------
 1 file changed, 5 insertions(+), 21 deletions(-)

diff --git a/libavutil/imgutils.c b/libavutil/imgutils.c
index 4938a7ef67..cf06afde3f 100644
--- a/libavutil/imgutils.c
+++ b/libavutil/imgutils.c
@@ -521,28 +521,12 @@  static void memset_bytes(uint8_t *dst, size_t dst_size, uint8_t *clear,
     if (clear_size == 1) {
         memset(dst, clear[0], dst_size);
         dst_size = 0;
-    } else if (clear_size == 2) {
-        uint16_t val = AV_RN16(clear);
-        for (; dst_size >= 2; dst_size -= 2) {
-            AV_WN16(dst, val);
-            dst += 2;
-        }
-    } else if (clear_size == 4) {
-        uint32_t val = AV_RN32(clear);
-        for (; dst_size >= 4; dst_size -= 4) {
-            AV_WN32(dst, val);
-            dst += 4;
-        }
-    } else if (clear_size == 8) {
-        uint32_t val = AV_RN64(clear);
-        for (; dst_size >= 8; dst_size -= 8) {
-            AV_WN64(dst, val);
-            dst += 8;
-        }
+    } else {
+        if (clear_size > dst_size)
+            clear_size = dst_size;
+        memcpy(dst, clear, clear_size);
+        av_memcpy_backptr(dst + clear_size, clear_size, dst_size - clear_size);
     }
-
-    for (; dst_size; dst_size--)
-        *dst++ = clear[pos++ % clear_size];
 }
 
 // Maximum size in bytes of a plane element (usually a pixel, or multiple pixels