[FFmpeg-devel,08/11] avcodec/v210enc: add AVX-512 10-bit line pack function

Submitted by James Darnley on Nov. 9, 2017, 11:58 a.m.

Details

Message ID 20171109115837.32618-9-jdarnley@obe.tv
State New
Headers show

Commit Message

James Darnley Nov. 9, 2017, 11:58 a.m.
---
 libavcodec/x86/v210enc.asm    | 5 +++++
 libavcodec/x86/v210enc_init.c | 7 +++++++
 2 files changed, 12 insertions(+)

Comments

Martin Vignali Nov. 13, 2017, 6:57 p.m.
2017-11-10 22:13 GMT+01:00 James Darnley <jdarnley@obe.tv>:

> On 2017-11-10 14:32, James Darnley wrote:
> > I mentioned previously that using ZMM registers will cause the CPU to
> > reduce its frequency.
> >
> > Gramner said on IRC that a user should spend 20-30% of time in
> > AVX-512/ZMM code for it to be a net gain in speed.
> > From ffmpeg-devel IRC on 2017-10-26
> >> https://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/
> 2017-October/004622.html
> >> [18:49:26 CEST] <Gramner> J_Darnley: be aware that using zmm registers
> induces significant frequency drops which reduces performance of everything
> else, so if you want to use 512-bit vectors you better go all in on it to
> make up for it. you probably want to spend at least 20-30% of overall
> runtime in avx-512 code
> >> [18:50:00 CEST] <Gramner> the alternative is to stay in 256-bit mode
> and just leverage new instructions and opmasks
> >
> > This means any cycles you might save by using longer registers, fewer
> > instructions, better instructions, whatever, will be lost because the
> > frequency drops meaning it takes longer to execute overall.
>
> Some details about this can be found in one of Intel's documents: IntelĀ®
> 64 and IA-32 Architectures Optimization Reference Manual
> Order Number: 248966-038
> October 2017
> > https://software.intel.com/sites/default/files/managed/
> 9e/bc/64-ia-32-architectures-optimization-manual.pdf
> Specifically section 15.26 "SKYLAKE SERVER POWER MANAGEMENT"
>
> Earlier on the ffmpeg-devel IRC channel I posted a link to Cloudflare's
> blog in which they discuss the effects of running just a few (my words)
> AVX-512/ZMM instructions.
> > https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/
>
> In the worst cases on some of the new processors the frequency drop can
> be 1GHz.  In Cloudflare's case just spending about 2.5% of time in a
> cryptography function using AVX-512 was causing a 10% drop in their
> overall performance (requests served per second).
>
> After seeing this and the discussion on IRC I won't commit any of the
> function patches.  The functions are not very impressive and are likely
> to make everything else slower.
>
> The IRC log should appear at the link below.
> > https://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/
> 2017-November/004651.html
>
>
> Thanks for the details explanations.

Martin
James Darnley Nov. 13, 2017, 8:08 p.m.
On 2017-11-10 22:13, James Darnley wrote:
> The IRC log should appear at the link below.
>> https://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/2017-November/004651.html

Of course when I try to predict what number an email will get based on
the past few it ends up being out of order.

The ffmpeg-devel log I was referring to is here:
> https://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/2017-November/004652.html

Patch hide | download patch | download mbox

diff --git a/libavcodec/x86/v210enc.asm b/libavcodec/x86/v210enc.asm
index 965f2bea3c..5068af27f8 100644
--- a/libavcodec/x86/v210enc.asm
+++ b/libavcodec/x86/v210enc.asm
@@ -103,6 +103,11 @@  INIT_YMM avx2
 v210_planar_pack_10
 %endif
 
+%if HAVE_AVX512_EXTERNAL
+INIT_YMM avx512
+v210_planar_pack_10
+%endif
+
 %macro v210_planar_pack_8 0
 
 ; v210_planar_pack_8(const uint8_t *y, const uint8_t *u, const uint8_t *v, uint8_t *dst, ptrdiff_t width)
diff --git a/libavcodec/x86/v210enc_init.c b/libavcodec/x86/v210enc_init.c
index e997b4b67a..e8aac373a0 100644
--- a/libavcodec/x86/v210enc_init.c
+++ b/libavcodec/x86/v210enc_init.c
@@ -32,6 +32,9 @@  void ff_v210_planar_pack_10_ssse3(const uint16_t *y, const uint16_t *u,
 void ff_v210_planar_pack_10_avx2(const uint16_t *y, const uint16_t *u,
                                  const uint16_t *v, uint8_t *dst,
                                  ptrdiff_t width);
+void ff_v210_planar_pack_10_avx512(const uint16_t *y, const uint16_t *u,
+                                   const uint16_t *v, uint8_t *dst,
+                                   ptrdiff_t width);
 
 av_cold void ff_v210enc_init_x86(V210EncContext *s)
 {
@@ -51,4 +54,8 @@  av_cold void ff_v210enc_init_x86(V210EncContext *s)
         s->sample_factor_10 = 2;
         s->pack_line_10     = ff_v210_planar_pack_10_avx2;
     }
+
+    if (EXTERNAL_AVX512(cpu_flags)) {
+        s->pack_line_10 = ff_v210_planar_pack_10_avx512;
+    }
 }