diff mbox series

[FFmpeg-devel,1/2] libavutil/cpu: Adds fast gather detection.

Message ID 20210625075429.72269-1-alankelly@google.com
State Superseded
Headers show
Series [FFmpeg-devel,1/2] libavutil/cpu: Adds fast gather detection. | expand

Checks

Context Check Description
andriy/x86_make success Make finished
andriy/x86_make_fate success Make fate finished
andriy/PPC64_make success Make finished
andriy/PPC64_make_fate success Make fate finished

Commit Message

Alan Kelly June 25, 2021, 7:54 a.m. UTC
Broadwell and later and Zen3 and later have fast gather instructions.
---
 Gather requires between 9 and 12 cycles on Haswell, 5 to 7 on Broadwell,
 and 2 to 5 on Skylake and newer. It is also slow on AMD before Zen 3.
 libavutil/cpu.h     |  2 ++
 libavutil/x86/cpu.c | 18 ++++++++++++++++--
 libavutil/x86/cpu.h |  1 +
 3 files changed, 19 insertions(+), 2 deletions(-)

Comments

Lynne June 25, 2021, 8:39 a.m. UTC | #1
Jun 25, 2021, 09:54 by alankelly-at-google.com@ffmpeg.org:

> Broadwell and later and Zen3 and later have fast gather instructions.
> ---
>  Gather requires between 9 and 12 cycles on Haswell, 5 to 7 on Broadwell,
>  and 2 to 5 on Skylake and newer. It is also slow on AMD before Zen 3.
>  libavutil/cpu.h     |  2 ++
>  libavutil/x86/cpu.c | 18 ++++++++++++++++--
>  libavutil/x86/cpu.h |  1 +
>  3 files changed, 19 insertions(+), 2 deletions(-)
>

No, we really don't need more FAST/SLOW flags, especially for
something like this which is just fixable by _not_using_vgather_.
Take a look at libavutil/x86/tx_float.asm, we only use vgather
if it's guaranteed to either be faster for what we're gathering or
is just as fast "slow". If neither is true, we use manual lookups,
which is actually advantageous since for AVX2 we can interleave
the lookups that happen in each lane.

Even if we disregard this, I've extensively benchmarked vgather
on Zen 3, Zen 2, Cascade Lake and Skylake, and there's hardly
a great vgather improvement to be found in Zen 3 to justify
using a new CPU flag for this.
Alan Kelly June 25, 2021, 11:24 a.m. UTC | #2
On Fri, Jun 25, 2021 at 10:40 AM Lynne <dev@lynne.ee> wrote:

> Jun 25, 2021, 09:54 by alankelly-at-google.com@ffmpeg.org:
>
> > Broadwell and later and Zen3 and later have fast gather instructions.
> > ---
> >  Gather requires between 9 and 12 cycles on Haswell, 5 to 7 on Broadwell,
> >  and 2 to 5 on Skylake and newer. It is also slow on AMD before Zen 3.
> >  libavutil/cpu.h     |  2 ++
> >  libavutil/x86/cpu.c | 18 ++++++++++++++++--
> >  libavutil/x86/cpu.h |  1 +
> >  3 files changed, 19 insertions(+), 2 deletions(-)
> >
>
> No, we really don't need more FAST/SLOW flags, especially for
> something like this which is just fixable by _not_using_vgather_.
> Take a look at libavutil/x86/tx_float.asm, we only use vgather
> if it's guaranteed to either be faster for what we're gathering or
> is just as fast "slow". If neither is true, we use manual lookups,
> which is actually advantageous since for AVX2 we can interleave
> the lookups that happen in each lane.
>
> Even if we disregard this, I've extensively benchmarked vgather
> on Zen 3, Zen 2, Cascade Lake and Skylake, and there's hardly
> a great vgather improvement to be found in Zen 3 to justify
> using a new CPU flag for this.
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>

Thanks for your response. I'm not against finding a cleaner way of
enabling/disabling the code which will be protected by this flag. However,
the manual lookups solution proposed will not work in this case, the avx2
version of hscale will only be faster if fast gathers are available,
otherwise, the ssse3 version should be used.

I haven't got access to a Zen3 so I can't comment on the performance. I
have tested on a Zen 2 and it is slow. On Broadwell hscale avx2 is about
10% faster than the ssse3 version and on Skylake about 40% faster, Haswell
has similar performance to Zen2.

Is there a proxy which could be used for detecting Broadwell or Skylake and
later? AVX512 seems too strict as there are Skylake chips without AVX512.
Thanks
Alan Kelly July 12, 2021, 9:29 a.m. UTC | #3
On Fri, Jun 25, 2021 at 1:24 PM Alan Kelly <alankelly@google.com> wrote:

> On Fri, Jun 25, 2021 at 10:40 AM Lynne <dev@lynne.ee> wrote:
>
>> Jun 25, 2021, 09:54 by alankelly-at-google.com@ffmpeg.org:
>>
>> > Broadwell and later and Zen3 and later have fast gather instructions.
>> > ---
>> >  Gather requires between 9 and 12 cycles on Haswell, 5 to 7 on
>> Broadwell,
>> >  and 2 to 5 on Skylake and newer. It is also slow on AMD before Zen 3.
>> >  libavutil/cpu.h     |  2 ++
>> >  libavutil/x86/cpu.c | 18 ++++++++++++++++--
>> >  libavutil/x86/cpu.h |  1 +
>> >  3 files changed, 19 insertions(+), 2 deletions(-)
>> >
>>
>> No, we really don't need more FAST/SLOW flags, especially for
>> something like this which is just fixable by _not_using_vgather_.
>> Take a look at libavutil/x86/tx_float.asm, we only use vgather
>> if it's guaranteed to either be faster for what we're gathering or
>> is just as fast "slow". If neither is true, we use manual lookups,
>> which is actually advantageous since for AVX2 we can interleave
>> the lookups that happen in each lane.
>>
>> Even if we disregard this, I've extensively benchmarked vgather
>> on Zen 3, Zen 2, Cascade Lake and Skylake, and there's hardly
>> a great vgather improvement to be found in Zen 3 to justify
>> using a new CPU flag for this.
>> _______________________________________________
>> ffmpeg-devel mailing list
>> ffmpeg-devel@ffmpeg.org
>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>
>> To unsubscribe, visit link above, or email
>> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>>
>
> Thanks for your response. I'm not against finding a cleaner way of
> enabling/disabling the code which will be protected by this flag. However,
> the manual lookups solution proposed will not work in this case, the avx2
> version of hscale will only be faster if fast gathers are available,
> otherwise, the ssse3 version should be used.
>
> I haven't got access to a Zen3 so I can't comment on the performance. I
> have tested on a Zen 2 and it is slow. On Broadwell hscale avx2 is about
> 10% faster than the ssse3 version and on Skylake about 40% faster, Haswell
> has similar performance to Zen2.
>
> Is there a proxy which could be used for detecting Broadwell or Skylake
> and later? AVX512 seems too strict as there are Skylake chips without
> AVX512. Thanks
>

Hi,

I will paste the performance figures from the thread for the other part of
this patch here so that the justification for this flag is clearer:

Skylake Haswell
hscale_8_to_15_width4_ssse3 761.2 760
hscale_8_to_15_width4_avx2 468.7 957
hscale_8_to_15_width8_ssse3 1170.7 1032
hscale_8_to_15_width8_avx2 865.7 1979
hscale_8_to_15_width12_ssse3 2172.2 2472
hscale_8_to_15_width12_avx2 1245.7 2901
hscale_8_to_15_width16_ssse3 2244.2 2400
hscale_8_to_15_width16_avx2 1647.2 3681

As you can see, it is catastrophic on Haswell and older chips but the gains
on Skylake are impressive.
As I don't have performance figures for Zen 3, I can disable this feature
on all cpus apart from Broadwell and later as you say that there is no
worthwhile improvement on Zen3. Is this OK with you?

Thanks
Lynne July 12, 2021, 10:46 a.m. UTC | #4
12 Jul 2021, 11:29 by alankelly-at-google.com@ffmpeg.org:

> On Fri, Jun 25, 2021 at 1:24 PM Alan Kelly <alankelly@google.com> wrote:
>
>> On Fri, Jun 25, 2021 at 10:40 AM Lynne <dev@lynne.ee> wrote:
>>
>>> Jun 25, 2021, 09:54 by alankelly-at-google.com@ffmpeg.org:
>>>
>>> > Broadwell and later and Zen3 and later have fast gather instructions.
>>> > ---
>>> >  Gather requires between 9 and 12 cycles on Haswell, 5 to 7 on
>>> Broadwell,
>>> >  and 2 to 5 on Skylake and newer. It is also slow on AMD before Zen 3.
>>> >  libavutil/cpu.h     |  2 ++
>>> >  libavutil/x86/cpu.c | 18 ++++++++++++++++--
>>> >  libavutil/x86/cpu.h |  1 +
>>> >  3 files changed, 19 insertions(+), 2 deletions(-)
>>> >
>>>
>>> No, we really don't need more FAST/SLOW flags, especially for
>>> something like this which is just fixable by _not_using_vgather_.
>>> Take a look at libavutil/x86/tx_float.asm, we only use vgather
>>> if it's guaranteed to either be faster for what we're gathering or
>>> is just as fast "slow". If neither is true, we use manual lookups,
>>> which is actually advantageous since for AVX2 we can interleave
>>> the lookups that happen in each lane.
>>>
>>> Even if we disregard this, I've extensively benchmarked vgather
>>> on Zen 3, Zen 2, Cascade Lake and Skylake, and there's hardly
>>> a great vgather improvement to be found in Zen 3 to justify
>>> using a new CPU flag for this.
>>> _______________________________________________
>>> ffmpeg-devel mailing list
>>> ffmpeg-devel@ffmpeg.org
>>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>>
>>> To unsubscribe, visit link above, or email
>>> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>>>
>>
>> Thanks for your response. I'm not against finding a cleaner way of
>> enabling/disabling the code which will be protected by this flag. However,
>> the manual lookups solution proposed will not work in this case, the avx2
>> version of hscale will only be faster if fast gathers are available,
>> otherwise, the ssse3 version should be used.
>>
>> I haven't got access to a Zen3 so I can't comment on the performance. I
>> have tested on a Zen 2 and it is slow. On Broadwell hscale avx2 is about
>> 10% faster than the ssse3 version and on Skylake about 40% faster, Haswell
>> has similar performance to Zen2.
>>
>> Is there a proxy which could be used for detecting Broadwell or Skylake
>> and later? AVX512 seems too strict as there are Skylake chips without
>> AVX512. Thanks
>>
>
> Hi,
>
> I will paste the performance figures from the thread for the other part of
> this patch here so that the justification for this flag is clearer:
>
> Skylake Haswell
> hscale_8_to_15_width4_ssse3 761.2 760
> hscale_8_to_15_width4_avx2 468.7 957
> hscale_8_to_15_width8_ssse3 1170.7 1032
> hscale_8_to_15_width8_avx2 865.7 1979
> hscale_8_to_15_width12_ssse3 2172.2 2472
> hscale_8_to_15_width12_avx2 1245.7 2901
> hscale_8_to_15_width16_ssse3 2244.2 2400
> hscale_8_to_15_width16_avx2 1647.2 3681
>
> As you can see, it is catastrophic on Haswell and older chips but the gains
> on Skylake are impressive.
> As I don't have performance figures for Zen 3, I can disable this feature
> on all cpus apart from Broadwell and later as you say that there is no
> worthwhile improvement on Zen3. Is this OK with you?
>

It's not that catastrophic. Since Haswell CPUs generally don't have
large AVX2 gains, could you just exclude Haswell only from
EXTERNAL_AVX2_FAST, and require EXTERNAL_AVX2_FAST
to enable those functions?
Lynne July 12, 2021, 1:39 p.m. UTC | #5
12 Jul 2021, 13:53 by jamrial@gmail.com:

> On 7/12/2021 7:46 AM, Lynne wrote:
>
>> 12 Jul 2021, 11:29 by alankelly-at-google.com@ffmpeg.org:
>>
>>> On Fri, Jun 25, 2021 at 1:24 PM Alan Kelly <alankelly@google.com> wrote:
>>>
>>>> On Fri, Jun 25, 2021 at 10:40 AM Lynne <dev@lynne.ee> wrote:
>>>>
>>>>> Jun 25, 2021, 09:54 by alankelly-at-google.com@ffmpeg.org:
>>>>>
>>>>>> Broadwell and later and Zen3 and later have fast gather instructions.
>>>>>> ---
>>>>>>  Gather requires between 9 and 12 cycles on Haswell, 5 to 7 on
>>>>>>
>>>>> Broadwell,
>>>>>
>>>>>> and 2 to 5 on Skylake and newer. It is also slow on AMD before Zen 3.
>>>>>>  libavutil/cpu.h     |  2 ++
>>>>>>  libavutil/x86/cpu.c | 18 ++++++++++++++++--
>>>>>>  libavutil/x86/cpu.h |  1 +
>>>>>>  3 files changed, 19 insertions(+), 2 deletions(-)
>>>>>>
>>>>>
>>>>> No, we really don't need more FAST/SLOW flags, especially for
>>>>> something like this which is just fixable by _not_using_vgather_.
>>>>> Take a look at libavutil/x86/tx_float.asm, we only use vgather
>>>>> if it's guaranteed to either be faster for what we're gathering or
>>>>> is just as fast "slow". If neither is true, we use manual lookups,
>>>>> which is actually advantageous since for AVX2 we can interleave
>>>>> the lookups that happen in each lane.
>>>>>
>>>>> Even if we disregard this, I've extensively benchmarked vgather
>>>>> on Zen 3, Zen 2, Cascade Lake and Skylake, and there's hardly
>>>>> a great vgather improvement to be found in Zen 3 to justify
>>>>> using a new CPU flag for this.
>>>>> _______________________________________________
>>>>> ffmpeg-devel mailing list
>>>>> ffmpeg-devel@ffmpeg.org
>>>>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>>>>
>>>>> To unsubscribe, visit link above, or email
>>>>> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>>>>>
>>>>
>>>> Thanks for your response. I'm not against finding a cleaner way of
>>>> enabling/disabling the code which will be protected by this flag. However,
>>>> the manual lookups solution proposed will not work in this case, the avx2
>>>> version of hscale will only be faster if fast gathers are available,
>>>> otherwise, the ssse3 version should be used.
>>>>
>>>> I haven't got access to a Zen3 so I can't comment on the performance. I
>>>> have tested on a Zen 2 and it is slow. On Broadwell hscale avx2 is about
>>>> 10% faster than the ssse3 version and on Skylake about 40% faster, Haswell
>>>> has similar performance to Zen2.
>>>>
>>>> Is there a proxy which could be used for detecting Broadwell or Skylake
>>>> and later? AVX512 seems too strict as there are Skylake chips without
>>>> AVX512. Thanks
>>>>
>>>
>>> Hi,
>>>
>>> I will paste the performance figures from the thread for the other part of
>>> this patch here so that the justification for this flag is clearer:
>>>
>>> Skylake Haswell
>>> hscale_8_to_15_width4_ssse3 761.2 760
>>> hscale_8_to_15_width4_avx2 468.7 957
>>> hscale_8_to_15_width8_ssse3 1170.7 1032
>>> hscale_8_to_15_width8_avx2 865.7 1979
>>> hscale_8_to_15_width12_ssse3 2172.2 2472
>>> hscale_8_to_15_width12_avx2 1245.7 2901
>>> hscale_8_to_15_width16_ssse3 2244.2 2400
>>> hscale_8_to_15_width16_avx2 1647.2 3681
>>>
>>> As you can see, it is catastrophic on Haswell and older chips but the gains
>>> on Skylake are impressive.
>>> As I don't have performance figures for Zen 3, I can disable this feature
>>> on all cpus apart from Broadwell and later as you say that there is no
>>> worthwhile improvement on Zen3. Is this OK with you?
>>>
>>
>> It's not that catastrophic. Since Haswell CPUs generally don't have
>> large AVX2 gains, could you just exclude Haswell only from
>> EXTERNAL_AVX2_FAST, and require EXTERNAL_AVX2_FAST
>> to enable those functions?
>>
>
> And disable all non gather AVX2 asm functions on Haswell? No. And it's a lie that Haswell doesn't have large gains with AVX2.
>

It won't disable ALL of the AVX2, but it'll affect a few random components, the most
prominent of which is some (not all) hevc assembly.
But I think I'd rather just not do anything at all. Performance of vgather even on Haswell
is still above 2x the C version, and we barely have any vgathers in our code. And
Haswell use is in decline too.
diff mbox series

Patch

diff --git a/libavutil/cpu.h b/libavutil/cpu.h
index b555422dae..f94eb79af1 100644
--- a/libavutil/cpu.h
+++ b/libavutil/cpu.h
@@ -50,6 +50,7 @@ 
 #define AV_CPU_FLAG_FMA4         0x0800 ///< Bulldozer FMA4 functions
 #define AV_CPU_FLAG_CMOV         0x1000 ///< supports cmov instruction
 #define AV_CPU_FLAG_AVX2         0x8000 ///< AVX2 functions: requires OS support even if YMM registers aren't used
+#define AV_CPU_FLAG_AVX2SLOW  0x2000000 ///< AVX2 supported but gather is slower.
 #define AV_CPU_FLAG_FMA3        0x10000 ///< Haswell FMA3 functions
 #define AV_CPU_FLAG_BMI1        0x20000 ///< Bit Manipulation Instruction Set 1
 #define AV_CPU_FLAG_BMI2        0x40000 ///< Bit Manipulation Instruction Set 2
@@ -107,6 +108,7 @@  int av_cpu_count(void);
  *  av_set_cpu_flags_mask(), then this function will behave as if AVX is not
  *  present.
  */
+
 size_t av_cpu_max_align(void);
 
 #endif /* AVUTIL_CPU_H */
diff --git a/libavutil/x86/cpu.c b/libavutil/x86/cpu.c
index bcd41a50a2..56fcde594c 100644
--- a/libavutil/x86/cpu.c
+++ b/libavutil/x86/cpu.c
@@ -146,8 +146,20 @@  int ff_get_cpu_flags_x86(void)
     if (max_std_level >= 7) {
         cpuid(7, eax, ebx, ecx, edx);
 #if HAVE_AVX2
-        if ((rval & AV_CPU_FLAG_AVX) && (ebx & 0x00000020))
+        if ((rval & AV_CPU_FLAG_AVX) && (ebx & 0x00000020)){
             rval |= AV_CPU_FLAG_AVX2;
+
+            cpuid(1, eax, ebx, ecx, std_caps);
+            family = ((eax >> 8) & 0xf) + ((eax >> 20) & 0xff);
+            model  = ((eax >> 4) & 0xf) + ((eax >> 12) & 0xf0);
+            // Haswell and earlier has slow gather
+            if(family == 6 && model < 70)
+                rval |= AV_CPU_FLAG_AVX2SLOW;
+            // Zen 2 and earlier
+            if (!strncmp(vendor.c, "AuthenticAMD", 12) && family < 25)
+                    rval |= AV_CPU_FLAG_AVX2SLOW;
+        }
+
 #if HAVE_AVX512 /* F, CD, BW, DQ, VL */
         if ((xcr0_lo & 0xe0) == 0xe0) { /* OPMASK/ZMM state */
             if ((rval & AV_CPU_FLAG_AVX2) && (ebx & 0xd0030000) == 0xd0030000)
@@ -194,8 +206,10 @@  int ff_get_cpu_flags_x86(void)
            functions using XMM registers are always faster on them.
            AV_CPU_FLAG_AVX and AV_CPU_FLAG_AVXSLOW are both set so that AVX is
            used unless explicitly disabled by checking AV_CPU_FLAG_AVXSLOW. */
-            if ((family == 0x15 || family == 0x16) && (rval & AV_CPU_FLAG_AVX))
+            if ((family == 0x15 || family == 0x16) && (rval & AV_CPU_FLAG_AVX)){
                 rval |= AV_CPU_FLAG_AVXSLOW;
+                rval |= AV_CPU_FLAG_AVX2SLOW;
+            }
         }
 
         /* XOP and FMA4 use the AVX instruction coding scheme, so they can't be
diff --git a/libavutil/x86/cpu.h b/libavutil/x86/cpu.h
index 937c697fa0..a42a15a997 100644
--- a/libavutil/x86/cpu.h
+++ b/libavutil/x86/cpu.h
@@ -78,6 +78,7 @@ 
 #define EXTERNAL_AVX2(flags)        CPUEXT_SUFFIX(flags, _EXTERNAL, AVX2)
 #define EXTERNAL_AVX2_FAST(flags)   CPUEXT_SUFFIX_FAST2(flags, _EXTERNAL, AVX2, AVX)
 #define EXTERNAL_AVX2_SLOW(flags)   CPUEXT_SUFFIX_SLOW2(flags, _EXTERNAL, AVX2, AVX)
+#define EXTERNAL_AVX2_FAST_GATHER(flags)   CPUEXT_SUFFIX_FAST(flags, _EXTERNAL, AVX2)
 #define EXTERNAL_AESNI(flags)       CPUEXT_SUFFIX(flags, _EXTERNAL, AESNI)
 #define EXTERNAL_AVX512(flags)      CPUEXT_SUFFIX(flags, _EXTERNAL, AVX512)