Message ID | MYfmSp7--3-2@lynne.ee |
---|---|
Headers | show |
Series | lavu/tx: FFT improvements, additions and assembly | expand |
Apr 19, 2021, 22:27 by dev@lynne.ee: > This commit adds a pure x86 assembly SIMD version of the FFT in libavutil/tx. > The design of this pure assembly FFT is pretty unconventional. > Oh, I forgot to mention _why_ on the majority of transforms it's slower than FFTW. It's simple - we don't hardcode anything. In fact, the biggest bottleneck by far is the lookup done during loading of input data. Easily more than 40% of the entire time is spent just doing lookups. FFTW hardcodes the addresses in the transforms themselves (they call them codelets). And in the process duplicate herculean amounts of code, amplified by their just-as-inefficient Split-Radix refactoring which needs 4 versions of each codelet. The 5+MB stripped binary can't get fat by itself, after all. Also FFTW only supports a single extension (e.g. fma, avx, avx2) at once during compile time, so that size figure looks even worse. For an in-place pre-permuted transform where all the lookups are gone and replaced with simple loads, we're actually consistently faster than FFTW. Thus, once non-power-of-two transforms are implemented (where this situation happens), we'll be faster than FFTW by quite a margin. The old non-power-of-two FFT code is faster than FFTW by a bit, and this code's C version is quite a bit faster than that code's C version. I'd have implemented this as well before I sent the patch, but there are only so many hours in a day, and ways to ignore the outside world, and the 5.0 release demands my urgent attention, so I'll write the non-power-of-two part up when I feel like it in hopefully a week or two. An easy way to gauge optimization is the perf graph: 39.58% a.out a.out [.] ff_split_radix_fft_float_fma3.32pt 21.17% a.out a.out [.] ff_split_radix_fft_float_fma3.64pt 8.21% a.out a.out [.] ff_split_radix_fft_float_fma3.synth_deinterleave 3.90% a.out a.out [.] ff_split_radix_fft_float_fma3.synth_32768 3.15% a.out a.out [.] ff_split_radix_fft_float_fma3.synth_8192 3.08% a.out a.out [.] ff_split_radix_fft_float_fma3.synth_16384 3.00% a.out a.out [.] ff_split_radix_fft_float_fma3.synth_65536 2.97% a.out a.out [.] ff_split_radix_fft_float_fma3.synth_512 2.96% a.out a.out [.] ff_split_radix_fft_float_fma3.synth_4096 2.96% a.out a.out [.] ff_split_radix_fft_float_fma3.synth_2048 2.95% a.out a.out [.] ff_split_radix_fft_float_fma3.256pt 2.91% a.out a.out [.] ff_split_radix_fft_float_fma3.128pt 2.89% a.out a.out [.] ff_split_radix_fft_float_fma3.synth_1024 0.04% a.out a.out [.] ff_split_radix_fft_float_fma3.512pt 0.01% a.out a.out [.] ff_split_radix_fft_float_fma3.1024pt 0.01% a.out a.out [.] ff_split_radix_fft_float_fma3.4096pt 0.01% a.out a.out [.] ff_split_radix_fft_float_fma3.2048pt 0.00% a.out a.out [.] ff_split_radix_fft_float_fma3.131072pt 0.00% a.out a.out [.] ff_split_radix_fft_float_fma3.8192pt Only the 32-point and 64-point transforms read data from the input, and that's where 60% of the overhead is. A fast vgatherdpd will go a long way to speed this up, as well as AVX512. This code should be quite a bit faster on Ice Lake and Tiger Lake systems, where vgatherdpd was significantly improved over Skylake. Maybe someone can run checkasm on one of those systems.
Apr 19, 2021, 22:19 by dev@lynne.ee: > This patchset cleans up and improves the power-of-two C code, > adds a 7-point and a 9-point FFT, and adds a power-of-two length > floating-point assembly. > Ping. There's no one signed up to review the assembly yet, but apart from me, there's unfortunately pretty much no one else who does float SIMD these days. Thought I did spend long enough on every part to make sure it's as close to perfect as I could make it.
I will just test it and reply if everything is ok. On Wed, Apr 21, 2021 at 4:43 AM Lynne <dev@lynne.ee> wrote: > Apr 19, 2021, 22:19 by dev@lynne.ee: > > > This patchset cleans up and improves the power-of-two C code, > > adds a 7-point and a 9-point FFT, and adds a power-of-two length > > floating-point assembly. > > > > Ping. There's no one signed up to review the assembly yet, but apart from > me, > there's unfortunately pretty much no one else who does float SIMD these > days. > Thought I did spend long enough on every part to make sure it's as close > to perfect as I could make it. > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". >
Apr 19, 2021, 22:19 by dev@lynne.ee: > This patchset cleans up and improves the power-of-two C code, > adds a 7-point and a 9-point FFT, and adds a power-of-two length > floating-point assembly. > Patchset pushed. Hopefully FATE will pass everywhere, but there's enough lenience in the check for any precision gained by FMA that it will.