mbox series

[FFmpeg-devel,00/11] lavu/tx: FFT improvements, additions and assembly

Message ID MYfmSp7--3-2@lynne.ee
Headers show
Series lavu/tx: FFT improvements, additions and assembly
Related show

Message

Lynne April 19, 2021, 8:19 p.m. UTC
This patchset cleans up and improves the power-of-two C code, 
adds a 7-point and a 9-point FFT, and adds a power-of-two length
floating-point assembly.
Subject: [PATCH 00/11] lavu/tx: FFT improvements, additions and assembly
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This patchset cleans up and improves the power-of-two C code,
adds a 7-point and a 9-point FFT, and adds a power-of-two length
floating-point assembly.

Lynne (11):
  lavu/tx: minor code style improvements and additional comments
  lavu/tx: refactor power-of-two FFT
  lavu/tx: add a 7-point FFT and (i)MDCT
  lavu/tx: add a 9-point FFT and (i)MDCT
  lavu/tx: add unaligned flag to the API
  lavu/tx: add full-sized iMDCT transform flag
  lavu: bump minor and add APIchanges entry for the lavu/tx changes
  lavu/tx: add parity revtab generator version
  checkasm: add av_tx FFT SIMD testing code
  doc/transforms: add documentation for the FFT transforms
  lavu/x86: add FFT assembly

 doc/APIchanges                |    3 +
 doc/transforms.md             |  706 +++++++++++++++++++
 libavutil/tx.c                |   83 ++-
 libavutil/tx.h                |   21 +-
 libavutil/tx_priv.h           |  103 ++-
 libavutil/tx_template.c       |  481 ++++++++++---
 libavutil/version.h           |    2 +-
 libavutil/x86/Makefile        |    2 +
 libavutil/x86/tx_float.asm    | 1216 +++++++++++++++++++++++++++++++++
 libavutil/x86/tx_float_init.c |  101 +++
 tests/checkasm/Makefile       |    1 +
 tests/checkasm/av_tx.c        |  109 +++
 tests/checkasm/checkasm.c     |    1 +
 tests/checkasm/checkasm.h     |    1 +
 tests/fate/checkasm.mak       |    1 +
 15 files changed, 2684 insertions(+), 147 deletions(-)
 create mode 100644 doc/transforms.md
 create mode 100644 libavutil/x86/tx_float.asm
 create mode 100644 libavutil/x86/tx_float_init.c
 create mode 100644 tests/checkasm/av_tx.c

Comments

Lynne April 19, 2021, 10:57 p.m. UTC | #1
Apr 19, 2021, 22:27 by dev@lynne.ee:

> This commit adds a pure x86 assembly SIMD version of the FFT in libavutil/tx. 
> The design of this pure assembly FFT is pretty unconventional.
>

Oh, I forgot to mention _why_ on the majority of transforms it's slower than FFTW.
It's simple - we don't hardcode anything. In fact, the biggest bottleneck by far is
the lookup done during loading of input data. Easily more than 40% of the entire
time is spent just doing lookups.
FFTW hardcodes the addresses in the transforms themselves (they call them
codelets). And in the process duplicate herculean amounts of code, amplified by
their just-as-inefficient Split-Radix refactoring which needs 4 versions of each codelet.
The 5+MB stripped binary can't get fat by itself, after all.
Also FFTW only supports a single extension (e.g. fma, avx, avx2) at once during
compile time, so that size figure looks even worse.

For an in-place pre-permuted transform where all the lookups are gone and replaced
with simple loads, we're actually consistently faster than FFTW. Thus, once
non-power-of-two transforms are implemented (where this situation happens),
we'll be faster than FFTW by quite a margin. The old non-power-of-two FFT
code is faster than FFTW by a bit, and this code's C version is quite a bit faster
than that code's C version.
I'd have implemented this as well before I sent the patch, but there are only so many
hours in a day, and ways to ignore the outside world, and the 5.0 release demands
my urgent attention, so I'll write the non-power-of-two part up when I feel like it in
hopefully a week or two.

An easy way to gauge optimization is the perf graph:
 39.58%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.32pt
 21.17%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.64pt
   8.21%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.synth_deinterleave
   3.90%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.synth_32768
   3.15%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.synth_8192
   3.08%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.synth_16384
   3.00%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.synth_65536
   2.97%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.synth_512
   2.96%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.synth_4096
   2.96%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.synth_2048
   2.95%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.256pt
   2.91%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.128pt
   2.89%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.synth_1024
   0.04%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.512pt
   0.01%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.1024pt
   0.01%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.4096pt
   0.01%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.2048pt
   0.00%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.131072pt
   0.00%  a.out    a.out              [.] ff_split_radix_fft_float_fma3.8192pt

Only the 32-point and 64-point transforms read data from the input,
and that's where 60% of the overhead is.

A fast vgatherdpd will go a long way to speed this up, as well as
AVX512. This code should be quite a bit faster on Ice Lake and Tiger Lake
systems, where vgatherdpd was significantly improved over Skylake.
Maybe someone can run checkasm on one of those systems.
Lynne April 21, 2021, 2:42 a.m. UTC | #2
Apr 19, 2021, 22:19 by dev@lynne.ee:

> This patchset cleans up and improves the power-of-two C code, 
> adds a 7-point and a 9-point FFT, and adds a power-of-two length
> floating-point assembly.
>

Ping. There's no one signed up to review the assembly yet, but apart from me,
there's unfortunately pretty much no one else who does float SIMD these days.
Thought I did spend long enough on every part to make sure it's as close
to perfect as I could make it.
Paul B Mahol April 21, 2021, 5:36 p.m. UTC | #3
I will just test it and reply if everything is ok.

On Wed, Apr 21, 2021 at 4:43 AM Lynne <dev@lynne.ee> wrote:

> Apr 19, 2021, 22:19 by dev@lynne.ee:
>
> > This patchset cleans up and improves the power-of-two C code,
> > adds a 7-point and a 9-point FFT, and adds a power-of-two length
> > floating-point assembly.
> >
>
> Ping. There's no one signed up to review the assembly yet, but apart from
> me,
> there's unfortunately pretty much no one else who does float SIMD these
> days.
> Thought I did spend long enough on every part to make sure it's as close
> to perfect as I could make it.
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>
Lynne April 24, 2021, 3:22 p.m. UTC | #4
Apr 19, 2021, 22:19 by dev@lynne.ee:

> This patchset cleans up and improves the power-of-two C code, 
> adds a 7-point and a 9-point FFT, and adds a power-of-two length
> floating-point assembly.
>

Patchset pushed.
Hopefully FATE will pass everywhere, but there's enough lenience in
the check for any precision gained by FMA that it will.