mbox series

[FFmpeg-devel,0/3] swscale: add AVX2 version of yuv2nv12cX

Message ID 1587697999-84025-1-git-send-email-negomez@linux.microsoft.com
Headers show
Series swscale: add AVX2 version of yuv2nv12cX | expand

Message

Nelson Gomez April 24, 2020, 3:13 a.m. UTC
This patchset aims to optimize yuv2nv12cX_c for Intel/AMD chips by adding an
AVX2 implementation of it. To support this change, the typedef declaration for
yuv2interleavedX_fn has been changed to pass two additional parameters for
chrDither8 and dstFormat rather than passing a pointer to the entire SwsContext.
Output is bit-identical to the software implementation.

Patchset validated on an Intel Xeon W-2133, Core i7-8650U, and an AMD Ryzen
1700. Passes fate tests; this patch is exercised by
fate-filter-pixdesc-nv{12,21,24,42}.

Benchmarks measured on the W-2133. Flags used are:

  -benchmark -i /dev/shm/benchmark.mp4 -pix_fmt nv42 -f null -

Benchmark material is a yuv420p file:

  http://linux.microsoft.com/~negomez/ffmpeg/yuv420p-benchmark.mp4

Results:

  * Single-threaded conversion: +95% fps

    -cpuflags -avx2 -threads 1:

    frame= 9959 fps=114 q=-0.0 Lsize=N/A time=00:05:32.29 bitrate=N/A speed=3.79x
    bench: utime=87.648s stime=0.060s rtime=87.709s
    bench: maxrss=35020kB

    -cpuflags all -threads 1:

    frame= 9959 fps=222 q=-0.0 Lsize=N/A time=00:05:32.29 bitrate=N/A speed=7.39x
    bench: utime=44.900s stime=0.040s rtime=44.941s
    bench: maxrss=33048kB


  * Multi-threaded conversion: +197% fps

    -cpuflags -avx2:

    frame= 9959 fps=159 q=-0.0 Lsize=N/A time=00:05:32.29 bitrate=N/A speed=5.3x
    bench: utime=90.381s stime=0.430s rtime=62.663s
    bench: maxrss=77420kB

    -cpuflags all:

    frame= 9959 fps=473 q=-0.0 Lsize=N/A time=00:05:32.29 bitrate=N/A speed=15.8x
    bench: utime=48.625s stime=0.459s rtime=21.058s
    bench: maxrss=78500kB

---

Nelson Gomez (3):
      swscale: make yuv2interleavedX more asm-friendly
      swscale/x86/output: add AVX2 version of yuv2nv12cX
      swscale: cosmetic fixes

 libswscale/output.c           |  19 ++++++++++---------
 libswscale/swscale_internal.h |   6 ++++--
 libswscale/vscale.c           |   2 +-
 libswscale/x86/output.asm     | 140 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 libswscale/x86/swscale.c      |  24 ++++++++++++++++++++++++
 5 files changed, 178 insertions(+), 13 deletions(-)