Message ID | 20210625075429.72269-1-alankelly@google.com |
---|---|
State | Superseded |
Headers | show |
Series | [FFmpeg-devel,1/2] libavutil/cpu: Adds fast gather detection. | expand |
Context | Check | Description |
---|---|---|
andriy/x86_make | success | Make finished |
andriy/x86_make_fate | success | Make fate finished |
andriy/PPC64_make | success | Make finished |
andriy/PPC64_make_fate | success | Make fate finished |
Jun 25, 2021, 09:54 by alankelly-at-google.com@ffmpeg.org: > Broadwell and later and Zen3 and later have fast gather instructions. > --- > Gather requires between 9 and 12 cycles on Haswell, 5 to 7 on Broadwell, > and 2 to 5 on Skylake and newer. It is also slow on AMD before Zen 3. > libavutil/cpu.h | 2 ++ > libavutil/x86/cpu.c | 18 ++++++++++++++++-- > libavutil/x86/cpu.h | 1 + > 3 files changed, 19 insertions(+), 2 deletions(-) > No, we really don't need more FAST/SLOW flags, especially for something like this which is just fixable by _not_using_vgather_. Take a look at libavutil/x86/tx_float.asm, we only use vgather if it's guaranteed to either be faster for what we're gathering or is just as fast "slow". If neither is true, we use manual lookups, which is actually advantageous since for AVX2 we can interleave the lookups that happen in each lane. Even if we disregard this, I've extensively benchmarked vgather on Zen 3, Zen 2, Cascade Lake and Skylake, and there's hardly a great vgather improvement to be found in Zen 3 to justify using a new CPU flag for this.
On Fri, Jun 25, 2021 at 10:40 AM Lynne <dev@lynne.ee> wrote: > Jun 25, 2021, 09:54 by alankelly-at-google.com@ffmpeg.org: > > > Broadwell and later and Zen3 and later have fast gather instructions. > > --- > > Gather requires between 9 and 12 cycles on Haswell, 5 to 7 on Broadwell, > > and 2 to 5 on Skylake and newer. It is also slow on AMD before Zen 3. > > libavutil/cpu.h | 2 ++ > > libavutil/x86/cpu.c | 18 ++++++++++++++++-- > > libavutil/x86/cpu.h | 1 + > > 3 files changed, 19 insertions(+), 2 deletions(-) > > > > No, we really don't need more FAST/SLOW flags, especially for > something like this which is just fixable by _not_using_vgather_. > Take a look at libavutil/x86/tx_float.asm, we only use vgather > if it's guaranteed to either be faster for what we're gathering or > is just as fast "slow". If neither is true, we use manual lookups, > which is actually advantageous since for AVX2 we can interleave > the lookups that happen in each lane. > > Even if we disregard this, I've extensively benchmarked vgather > on Zen 3, Zen 2, Cascade Lake and Skylake, and there's hardly > a great vgather improvement to be found in Zen 3 to justify > using a new CPU flag for this. > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". > Thanks for your response. I'm not against finding a cleaner way of enabling/disabling the code which will be protected by this flag. However, the manual lookups solution proposed will not work in this case, the avx2 version of hscale will only be faster if fast gathers are available, otherwise, the ssse3 version should be used. I haven't got access to a Zen3 so I can't comment on the performance. I have tested on a Zen 2 and it is slow. On Broadwell hscale avx2 is about 10% faster than the ssse3 version and on Skylake about 40% faster, Haswell has similar performance to Zen2. Is there a proxy which could be used for detecting Broadwell or Skylake and later? AVX512 seems too strict as there are Skylake chips without AVX512. Thanks
On Fri, Jun 25, 2021 at 1:24 PM Alan Kelly <alankelly@google.com> wrote: > On Fri, Jun 25, 2021 at 10:40 AM Lynne <dev@lynne.ee> wrote: > >> Jun 25, 2021, 09:54 by alankelly-at-google.com@ffmpeg.org: >> >> > Broadwell and later and Zen3 and later have fast gather instructions. >> > --- >> > Gather requires between 9 and 12 cycles on Haswell, 5 to 7 on >> Broadwell, >> > and 2 to 5 on Skylake and newer. It is also slow on AMD before Zen 3. >> > libavutil/cpu.h | 2 ++ >> > libavutil/x86/cpu.c | 18 ++++++++++++++++-- >> > libavutil/x86/cpu.h | 1 + >> > 3 files changed, 19 insertions(+), 2 deletions(-) >> > >> >> No, we really don't need more FAST/SLOW flags, especially for >> something like this which is just fixable by _not_using_vgather_. >> Take a look at libavutil/x86/tx_float.asm, we only use vgather >> if it's guaranteed to either be faster for what we're gathering or >> is just as fast "slow". If neither is true, we use manual lookups, >> which is actually advantageous since for AVX2 we can interleave >> the lookups that happen in each lane. >> >> Even if we disregard this, I've extensively benchmarked vgather >> on Zen 3, Zen 2, Cascade Lake and Skylake, and there's hardly >> a great vgather improvement to be found in Zen 3 to justify >> using a new CPU flag for this. >> _______________________________________________ >> ffmpeg-devel mailing list >> ffmpeg-devel@ffmpeg.org >> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel >> >> To unsubscribe, visit link above, or email >> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". >> > > Thanks for your response. I'm not against finding a cleaner way of > enabling/disabling the code which will be protected by this flag. However, > the manual lookups solution proposed will not work in this case, the avx2 > version of hscale will only be faster if fast gathers are available, > otherwise, the ssse3 version should be used. > > I haven't got access to a Zen3 so I can't comment on the performance. I > have tested on a Zen 2 and it is slow. On Broadwell hscale avx2 is about > 10% faster than the ssse3 version and on Skylake about 40% faster, Haswell > has similar performance to Zen2. > > Is there a proxy which could be used for detecting Broadwell or Skylake > and later? AVX512 seems too strict as there are Skylake chips without > AVX512. Thanks > Hi, I will paste the performance figures from the thread for the other part of this patch here so that the justification for this flag is clearer: Skylake Haswell hscale_8_to_15_width4_ssse3 761.2 760 hscale_8_to_15_width4_avx2 468.7 957 hscale_8_to_15_width8_ssse3 1170.7 1032 hscale_8_to_15_width8_avx2 865.7 1979 hscale_8_to_15_width12_ssse3 2172.2 2472 hscale_8_to_15_width12_avx2 1245.7 2901 hscale_8_to_15_width16_ssse3 2244.2 2400 hscale_8_to_15_width16_avx2 1647.2 3681 As you can see, it is catastrophic on Haswell and older chips but the gains on Skylake are impressive. As I don't have performance figures for Zen 3, I can disable this feature on all cpus apart from Broadwell and later as you say that there is no worthwhile improvement on Zen3. Is this OK with you? Thanks
12 Jul 2021, 11:29 by alankelly-at-google.com@ffmpeg.org: > On Fri, Jun 25, 2021 at 1:24 PM Alan Kelly <alankelly@google.com> wrote: > >> On Fri, Jun 25, 2021 at 10:40 AM Lynne <dev@lynne.ee> wrote: >> >>> Jun 25, 2021, 09:54 by alankelly-at-google.com@ffmpeg.org: >>> >>> > Broadwell and later and Zen3 and later have fast gather instructions. >>> > --- >>> > Gather requires between 9 and 12 cycles on Haswell, 5 to 7 on >>> Broadwell, >>> > and 2 to 5 on Skylake and newer. It is also slow on AMD before Zen 3. >>> > libavutil/cpu.h | 2 ++ >>> > libavutil/x86/cpu.c | 18 ++++++++++++++++-- >>> > libavutil/x86/cpu.h | 1 + >>> > 3 files changed, 19 insertions(+), 2 deletions(-) >>> > >>> >>> No, we really don't need more FAST/SLOW flags, especially for >>> something like this which is just fixable by _not_using_vgather_. >>> Take a look at libavutil/x86/tx_float.asm, we only use vgather >>> if it's guaranteed to either be faster for what we're gathering or >>> is just as fast "slow". If neither is true, we use manual lookups, >>> which is actually advantageous since for AVX2 we can interleave >>> the lookups that happen in each lane. >>> >>> Even if we disregard this, I've extensively benchmarked vgather >>> on Zen 3, Zen 2, Cascade Lake and Skylake, and there's hardly >>> a great vgather improvement to be found in Zen 3 to justify >>> using a new CPU flag for this. >>> _______________________________________________ >>> ffmpeg-devel mailing list >>> ffmpeg-devel@ffmpeg.org >>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel >>> >>> To unsubscribe, visit link above, or email >>> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". >>> >> >> Thanks for your response. I'm not against finding a cleaner way of >> enabling/disabling the code which will be protected by this flag. However, >> the manual lookups solution proposed will not work in this case, the avx2 >> version of hscale will only be faster if fast gathers are available, >> otherwise, the ssse3 version should be used. >> >> I haven't got access to a Zen3 so I can't comment on the performance. I >> have tested on a Zen 2 and it is slow. On Broadwell hscale avx2 is about >> 10% faster than the ssse3 version and on Skylake about 40% faster, Haswell >> has similar performance to Zen2. >> >> Is there a proxy which could be used for detecting Broadwell or Skylake >> and later? AVX512 seems too strict as there are Skylake chips without >> AVX512. Thanks >> > > Hi, > > I will paste the performance figures from the thread for the other part of > this patch here so that the justification for this flag is clearer: > > Skylake Haswell > hscale_8_to_15_width4_ssse3 761.2 760 > hscale_8_to_15_width4_avx2 468.7 957 > hscale_8_to_15_width8_ssse3 1170.7 1032 > hscale_8_to_15_width8_avx2 865.7 1979 > hscale_8_to_15_width12_ssse3 2172.2 2472 > hscale_8_to_15_width12_avx2 1245.7 2901 > hscale_8_to_15_width16_ssse3 2244.2 2400 > hscale_8_to_15_width16_avx2 1647.2 3681 > > As you can see, it is catastrophic on Haswell and older chips but the gains > on Skylake are impressive. > As I don't have performance figures for Zen 3, I can disable this feature > on all cpus apart from Broadwell and later as you say that there is no > worthwhile improvement on Zen3. Is this OK with you? > It's not that catastrophic. Since Haswell CPUs generally don't have large AVX2 gains, could you just exclude Haswell only from EXTERNAL_AVX2_FAST, and require EXTERNAL_AVX2_FAST to enable those functions?
12 Jul 2021, 13:53 by jamrial@gmail.com: > On 7/12/2021 7:46 AM, Lynne wrote: > >> 12 Jul 2021, 11:29 by alankelly-at-google.com@ffmpeg.org: >> >>> On Fri, Jun 25, 2021 at 1:24 PM Alan Kelly <alankelly@google.com> wrote: >>> >>>> On Fri, Jun 25, 2021 at 10:40 AM Lynne <dev@lynne.ee> wrote: >>>> >>>>> Jun 25, 2021, 09:54 by alankelly-at-google.com@ffmpeg.org: >>>>> >>>>>> Broadwell and later and Zen3 and later have fast gather instructions. >>>>>> --- >>>>>> Gather requires between 9 and 12 cycles on Haswell, 5 to 7 on >>>>>> >>>>> Broadwell, >>>>> >>>>>> and 2 to 5 on Skylake and newer. It is also slow on AMD before Zen 3. >>>>>> libavutil/cpu.h | 2 ++ >>>>>> libavutil/x86/cpu.c | 18 ++++++++++++++++-- >>>>>> libavutil/x86/cpu.h | 1 + >>>>>> 3 files changed, 19 insertions(+), 2 deletions(-) >>>>>> >>>>> >>>>> No, we really don't need more FAST/SLOW flags, especially for >>>>> something like this which is just fixable by _not_using_vgather_. >>>>> Take a look at libavutil/x86/tx_float.asm, we only use vgather >>>>> if it's guaranteed to either be faster for what we're gathering or >>>>> is just as fast "slow". If neither is true, we use manual lookups, >>>>> which is actually advantageous since for AVX2 we can interleave >>>>> the lookups that happen in each lane. >>>>> >>>>> Even if we disregard this, I've extensively benchmarked vgather >>>>> on Zen 3, Zen 2, Cascade Lake and Skylake, and there's hardly >>>>> a great vgather improvement to be found in Zen 3 to justify >>>>> using a new CPU flag for this. >>>>> _______________________________________________ >>>>> ffmpeg-devel mailing list >>>>> ffmpeg-devel@ffmpeg.org >>>>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel >>>>> >>>>> To unsubscribe, visit link above, or email >>>>> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". >>>>> >>>> >>>> Thanks for your response. I'm not against finding a cleaner way of >>>> enabling/disabling the code which will be protected by this flag. However, >>>> the manual lookups solution proposed will not work in this case, the avx2 >>>> version of hscale will only be faster if fast gathers are available, >>>> otherwise, the ssse3 version should be used. >>>> >>>> I haven't got access to a Zen3 so I can't comment on the performance. I >>>> have tested on a Zen 2 and it is slow. On Broadwell hscale avx2 is about >>>> 10% faster than the ssse3 version and on Skylake about 40% faster, Haswell >>>> has similar performance to Zen2. >>>> >>>> Is there a proxy which could be used for detecting Broadwell or Skylake >>>> and later? AVX512 seems too strict as there are Skylake chips without >>>> AVX512. Thanks >>>> >>> >>> Hi, >>> >>> I will paste the performance figures from the thread for the other part of >>> this patch here so that the justification for this flag is clearer: >>> >>> Skylake Haswell >>> hscale_8_to_15_width4_ssse3 761.2 760 >>> hscale_8_to_15_width4_avx2 468.7 957 >>> hscale_8_to_15_width8_ssse3 1170.7 1032 >>> hscale_8_to_15_width8_avx2 865.7 1979 >>> hscale_8_to_15_width12_ssse3 2172.2 2472 >>> hscale_8_to_15_width12_avx2 1245.7 2901 >>> hscale_8_to_15_width16_ssse3 2244.2 2400 >>> hscale_8_to_15_width16_avx2 1647.2 3681 >>> >>> As you can see, it is catastrophic on Haswell and older chips but the gains >>> on Skylake are impressive. >>> As I don't have performance figures for Zen 3, I can disable this feature >>> on all cpus apart from Broadwell and later as you say that there is no >>> worthwhile improvement on Zen3. Is this OK with you? >>> >> >> It's not that catastrophic. Since Haswell CPUs generally don't have >> large AVX2 gains, could you just exclude Haswell only from >> EXTERNAL_AVX2_FAST, and require EXTERNAL_AVX2_FAST >> to enable those functions? >> > > And disable all non gather AVX2 asm functions on Haswell? No. And it's a lie that Haswell doesn't have large gains with AVX2. > It won't disable ALL of the AVX2, but it'll affect a few random components, the most prominent of which is some (not all) hevc assembly. But I think I'd rather just not do anything at all. Performance of vgather even on Haswell is still above 2x the C version, and we barely have any vgathers in our code. And Haswell use is in decline too.
diff --git a/libavutil/cpu.h b/libavutil/cpu.h index b555422dae..f94eb79af1 100644 --- a/libavutil/cpu.h +++ b/libavutil/cpu.h @@ -50,6 +50,7 @@ #define AV_CPU_FLAG_FMA4 0x0800 ///< Bulldozer FMA4 functions #define AV_CPU_FLAG_CMOV 0x1000 ///< supports cmov instruction #define AV_CPU_FLAG_AVX2 0x8000 ///< AVX2 functions: requires OS support even if YMM registers aren't used +#define AV_CPU_FLAG_AVX2SLOW 0x2000000 ///< AVX2 supported but gather is slower. #define AV_CPU_FLAG_FMA3 0x10000 ///< Haswell FMA3 functions #define AV_CPU_FLAG_BMI1 0x20000 ///< Bit Manipulation Instruction Set 1 #define AV_CPU_FLAG_BMI2 0x40000 ///< Bit Manipulation Instruction Set 2 @@ -107,6 +108,7 @@ int av_cpu_count(void); * av_set_cpu_flags_mask(), then this function will behave as if AVX is not * present. */ + size_t av_cpu_max_align(void); #endif /* AVUTIL_CPU_H */ diff --git a/libavutil/x86/cpu.c b/libavutil/x86/cpu.c index bcd41a50a2..56fcde594c 100644 --- a/libavutil/x86/cpu.c +++ b/libavutil/x86/cpu.c @@ -146,8 +146,20 @@ int ff_get_cpu_flags_x86(void) if (max_std_level >= 7) { cpuid(7, eax, ebx, ecx, edx); #if HAVE_AVX2 - if ((rval & AV_CPU_FLAG_AVX) && (ebx & 0x00000020)) + if ((rval & AV_CPU_FLAG_AVX) && (ebx & 0x00000020)){ rval |= AV_CPU_FLAG_AVX2; + + cpuid(1, eax, ebx, ecx, std_caps); + family = ((eax >> 8) & 0xf) + ((eax >> 20) & 0xff); + model = ((eax >> 4) & 0xf) + ((eax >> 12) & 0xf0); + // Haswell and earlier has slow gather + if(family == 6 && model < 70) + rval |= AV_CPU_FLAG_AVX2SLOW; + // Zen 2 and earlier + if (!strncmp(vendor.c, "AuthenticAMD", 12) && family < 25) + rval |= AV_CPU_FLAG_AVX2SLOW; + } + #if HAVE_AVX512 /* F, CD, BW, DQ, VL */ if ((xcr0_lo & 0xe0) == 0xe0) { /* OPMASK/ZMM state */ if ((rval & AV_CPU_FLAG_AVX2) && (ebx & 0xd0030000) == 0xd0030000) @@ -194,8 +206,10 @@ int ff_get_cpu_flags_x86(void) functions using XMM registers are always faster on them. AV_CPU_FLAG_AVX and AV_CPU_FLAG_AVXSLOW are both set so that AVX is used unless explicitly disabled by checking AV_CPU_FLAG_AVXSLOW. */ - if ((family == 0x15 || family == 0x16) && (rval & AV_CPU_FLAG_AVX)) + if ((family == 0x15 || family == 0x16) && (rval & AV_CPU_FLAG_AVX)){ rval |= AV_CPU_FLAG_AVXSLOW; + rval |= AV_CPU_FLAG_AVX2SLOW; + } } /* XOP and FMA4 use the AVX instruction coding scheme, so they can't be diff --git a/libavutil/x86/cpu.h b/libavutil/x86/cpu.h index 937c697fa0..a42a15a997 100644 --- a/libavutil/x86/cpu.h +++ b/libavutil/x86/cpu.h @@ -78,6 +78,7 @@ #define EXTERNAL_AVX2(flags) CPUEXT_SUFFIX(flags, _EXTERNAL, AVX2) #define EXTERNAL_AVX2_FAST(flags) CPUEXT_SUFFIX_FAST2(flags, _EXTERNAL, AVX2, AVX) #define EXTERNAL_AVX2_SLOW(flags) CPUEXT_SUFFIX_SLOW2(flags, _EXTERNAL, AVX2, AVX) +#define EXTERNAL_AVX2_FAST_GATHER(flags) CPUEXT_SUFFIX_FAST(flags, _EXTERNAL, AVX2) #define EXTERNAL_AESNI(flags) CPUEXT_SUFFIX(flags, _EXTERNAL, AESNI) #define EXTERNAL_AVX512(flags) CPUEXT_SUFFIX(flags, _EXTERNAL, AVX512)