Message ID | NCgcUxK--3-2@lynne.ee |
---|---|
State | New |
Headers | show |
Series | [FFmpeg-devel,1/6] opus: convert encoder and decoder to lavu/tx | expand |
Context | Check | Description |
---|---|---|
andriy/make_x86 | success | Make finished |
andriy/make_fate_x86 | success | Make fate finished |
On Sat, 24 Sep 2022, Lynne wrote: > This commit changes both the encoder and decoder to use the new lavu/tx code, > which has faster C transforms and more assembly optimizations. What's the case of e.g. 32 bit arm - that does have a bunch of fft and mdct assembly, but is that something that ends up used by opus today, or does the mdct15 stuff use separate codepaths that aren't optimized there today yet? // Martin
On Sat, Sep 24, 2022 at 8:43 PM Martin Storsjö <martin@martin.st> wrote: > > On Sat, 24 Sep 2022, Lynne wrote: > > > This commit changes both the encoder and decoder to use the new lavu/tx code, > > which has faster C transforms and more assembly optimizations. > > What's the case of e.g. 32 bit arm - that does have a bunch of fft and > mdct assembly, but is that something that ends up used by opus today, or > does the mdct15 stuff use separate codepaths that aren't optimized there > today yet? > mdct15 only has some x86 assembly, nothing for ARM. Only the normal (power of 2) fft/mdct has some ARM 32-bit assembly. - Hendrik
On Sat, Sep 24, 2022 at 9:26 PM Hendrik Leppkes <h.leppkes@gmail.com> wrote: > > On Sat, Sep 24, 2022 at 8:43 PM Martin Storsjö <martin@martin.st> wrote: > > > > On Sat, 24 Sep 2022, Lynne wrote: > > > > > This commit changes both the encoder and decoder to use the new lavu/tx code, > > > which has faster C transforms and more assembly optimizations. > > > > What's the case of e.g. 32 bit arm - that does have a bunch of fft and > > mdct assembly, but is that something that ends up used by opus today, or > > does the mdct15 stuff use separate codepaths that aren't optimized there > > today yet? > > > > mdct15 only has some x86 assembly, nothing for ARM. > Only the normal (power of 2) fft/mdct has some ARM 32-bit assembly. > Actually, I missed that the mdct15 internally uses one of the normal fft functions for a part of the calculation, but how much impact that has on performance vs. the new code where the C alone is quite a bit faster would have to be confirmed by Lynne. - Hendrik
On Sat, 24 Sep 2022, Hendrik Leppkes wrote: > On Sat, Sep 24, 2022 at 9:26 PM Hendrik Leppkes <h.leppkes@gmail.com> wrote: >> >> On Sat, Sep 24, 2022 at 8:43 PM Martin Storsjö <martin@martin.st> wrote: >> > >> > On Sat, 24 Sep 2022, Lynne wrote: >> > >> > > This commit changes both the encoder and decoder to use the new lavu/tx code, >> > > which has faster C transforms and more assembly optimizations. >> > >> > What's the case of e.g. 32 bit arm - that does have a bunch of fft and >> > mdct assembly, but is that something that ends up used by opus today, or >> > does the mdct15 stuff use separate codepaths that aren't optimized there >> > today yet? >> > >> >> mdct15 only has some x86 assembly, nothing for ARM. >> Only the normal (power of 2) fft/mdct has some ARM 32-bit assembly. >> > > Actually, I missed that the mdct15 internally uses one of the normal > fft functions for a part of the calculation, but how much impact that > has on performance vs. the new code where the C alone is quite a bit > faster would have to be confirmed by Lynne. Ok, fair enough. What about ac3dsp then - that one seems like it's fairly optimized for arm? // Martin
Sep 24, 2022, 21:40 by martin@martin.st: > On Sat, 24 Sep 2022, Hendrik Leppkes wrote: > >> On Sat, Sep 24, 2022 at 9:26 PM Hendrik Leppkes <h.leppkes@gmail.com> wrote: >> >>> >>> On Sat, Sep 24, 2022 at 8:43 PM Martin Storsjö <martin@martin.st> wrote: >>> > >>> > On Sat, 24 Sep 2022, Lynne wrote: >>> > >>> > > This commit changes both the encoder and decoder to use the new lavu/tx code, >>> > > which has faster C transforms and more assembly optimizations. >>> > >>> > What's the case of e.g. 32 bit arm - that does have a bunch of fft and >>> > mdct assembly, but is that something that ends up used by opus today, or >>> > does the mdct15 stuff use separate codepaths that aren't optimized there >>> > today yet? >>> > >>> >>> mdct15 only has some x86 assembly, nothing for ARM. >>> Only the normal (power of 2) fft/mdct has some ARM 32-bit assembly. >>> >> >> Actually, I missed that the mdct15 internally uses one of the normal >> fft functions for a part of the calculation, but how much impact that >> has on performance vs. the new code where the C alone is quite a bit >> faster would have to be confirmed by Lynne. >> > > Ok, fair enough. > I did some benchmarking. Just lavc's C nptwo MDCT is 10% slower than lavu's C nptwo MDCT. I don't have 32bit ARM hardware to test on, but I do have an aarch64 A53 core. On it, the performance difference with all optimizations with this patch on or off was that the decoder became 15% faster. With lavu/tx's aarch64 assembly disabled to simulate arm32's situation, the decoder was still 10% faster overall. It's probably going to be similar on arm32. On x86, the performance difference between the decoder without this patch and the decoder with this patch but all lavu/tx asm disabled was only 10% slower. With assembly enabled and this patch, the decoder is 15% faster overall on an Alder Lake system. As for the overall decoding time consumption for Opus, the MDCT is very far behind the largest overhead - coefficient decoding (on x86 with optimizations, 50% of the time is spent there, whilst only 5% on the MDCT in total). It's a very optimized decoder. In general, for the transform alone, a C non-power-of-two lavu MDCT for the lengths used by Opus, the performance difference for using AVX vs C for the ptwo part is on the order of 20% slower transforms for 960pt, and SSE vs C for 240pt is also around 20%. Most of this is due to the function call overhead, (framesize/2)/ptwo = 120, 60, 30 and 15 calls to ptwo FFTs per transform. The assembly function largely eliminates this overhead by linking assembly functions together with a minimal 'ABI'. > What about ac3dsp then - that one seems like it's fairly optimized for arm? > Haven't touched them, they're still being used. Unfortunately, for AC3, the full MDCT optimizations in lavc do make a difference and the overall decoder becomes 15% slower with this patch on for aarch64 with lavu/tx's asm disabled and 7% slower with lavu/tx's asm enabled. I do plan to write an aarch64 MDCT NEON SIMD code in a month or so, unless someone is faster, which should make the decoder at least 10% faster with lavu/tx. For Opus, the used ptwo lengths are (framesize/2)/15 = 32, 16, 8 and 4pt FFTs. If you'd like to help out, I've documented the C factorizations used in docs/transforms.md. You could also try porting the existing assembly. It should be trivial if they don't use the upper half of the tables. lavc's and lavu's FFT tables differ by size - lavu's are half the size of lavc's tables, because lavc's tables contain the multiplication factors mirrored after the halfway point. That's used by the RDFT, and by the x86 assembly. It's not worth replicating this, the memory overhead is just too much, especially on bandwidth starved cores. If the arm32 assembly uses the upper part, it shouldn't be too hard to make it read from both the start and end point of the exptab array in the recombination function of ptwo transforms. The MDCT asm can be ported in a straightforward way and would improve both decoders significantly. If the ABI is simpler than x86's, you could even make the asm transform call into C functions, which would lessen the work. A lot of the MDCT overhead is in the gather and multiplication part, whilst the FFT is limited by mostly adds and memory bandwidth, so just with MDCT assembly the decoder would get a lot faster.
Sep 24, 2022, 23:57 by dev@lynne.ee: > Sep 24, 2022, 21:40 by martin@martin.st: > >> What about ac3dsp then - that one seems like it's fairly optimized for arm? >> > Haven't touched them, they're still being used. Unfortunately, for AC3, > the full MDCT optimizations in lavc do make a difference and the overall > decoder becomes 15% slower with this patch on for aarch64 with lavu/tx's > asm disabled and 7% slower with lavu/tx's asm enabled. I do plan to write > an aarch64 MDCT NEON SIMD code in a month or so, unless someone is faster, > which should make the decoder at least 10% faster with lavu/tx. > I'd just like to add this was for the float version of the ac3 decoder. The fixed-point version is a few percent faster with the patch on an A53, and quite a bit more accurate. The lavc fixed-point FFT code also has some weird large spikes in #cycles for some transform sizes, so the figure above is an average, but the dips went from 117x realtime to 78x realtime, which on a slower CPU may be the difference between stuttering and realtime playback. On this CPU, the fixed-point version is 23% slower than the float version, but on a CPU with slower float ops, it would make more sense to pick that decoder up than the float version. The 2 decoders produce nearly identical results, minus a few rounding errors, since AC3 is inherently a fixed-point codec. The only difference are the transforms themselves, and the extra ops needed to convert the 25bit ints to floats in the float decoder.
Lynne: > Sep 24, 2022, 23:57 by dev@lynne.ee: > >> Sep 24, 2022, 21:40 by martin@martin.st: >> >>> What about ac3dsp then - that one seems like it's fairly optimized for arm? >>> >> Haven't touched them, they're still being used. Unfortunately, for AC3, >> the full MDCT optimizations in lavc do make a difference and the overall >> decoder becomes 15% slower with this patch on for aarch64 with lavu/tx's >> asm disabled and 7% slower with lavu/tx's asm enabled. I do plan to write >> an aarch64 MDCT NEON SIMD code in a month or so, unless someone is faster, >> which should make the decoder at least 10% faster with lavu/tx. >> > > I'd just like to add this was for the float version of the ac3 decoder. The fixed-point > version is a few percent faster with the patch on an A53, and quite a bit > more accurate. > The lavc fixed-point FFT code also has some weird large spikes in #cycles > for some transform sizes, so the figure above is an average, but the dips > went from 117x realtime to 78x realtime, which on a slower CPU may > be the difference between stuttering and realtime playback. > On this CPU, the fixed-point version is 23% slower than the float version, > but on a CPU with slower float ops, it would make more sense to pick that > decoder up than the float version. > The 2 decoders produce nearly identical results, minus a few rounding > errors, since AC3 is inherently a fixed-point codec. The only difference > are the transforms themselves, and the extra ops needed to convert > the 25bit ints to floats in the float decoder. 1. You forgot to remove mdct15 requirements from configure in this whole patchset. 2. You forgot to update the FATE references for several tests; e.g. when only applying the ac3 patch, then I get this: TEST ac3-4.0 stddev: 7.60 PSNR: 78.71 MAXDIFF: 867 bytes: 761856/ 761856 MAXDIFF: |867 - 0| >= 1 Test ac3-4.0 failed. Look at tests/data/fate/ac3-4.0.err for details. make: *** [src/tests/Makefile:307: fate-ac3-4.0] Error 1 TEST ac3-2.0 stddev: 2.57 PSNR: 88.10 MAXDIFF: 414 bytes: 804864/ 804864 MAXDIFF: |414 - 0| >= 1 Test ac3-2.0 failed. Look at tests/data/fate/ac3-2.0.err for details. make: *** [src/tests/Makefile:307: fate-ac3-2.0] Error 1 TEST ac3-4.0-downmix-stereo stddev: 2.99 PSNR: 86.81 MAXDIFF: 198 bytes: 380928/ 380928 MAXDIFF: |198 - 0| >= 1 Test ac3-4.0-downmix-stereo failed. Look at tests/data/fate/ac3-4.0-downmix-stereo.err for details. make: *** [src/tests/Makefile:307: fate-ac3-4.0-downmix-stereo] Error 1 TEST ac3-4.0-downmix-mono stddev: 4.11 PSNR: 84.05 MAXDIFF: 281 bytes: 190464/ 190464 MAXDIFF: |281 - 0| >= 1 Test ac3-4.0-downmix-mono failed. Look at tests/data/fate/ac3-4.0-downmix-mono.err for details. make: *** [src/tests/Makefile:307: fate-ac3-4.0-downmix-mono] Error 1 TEST ac3-fixed-2.0 stddev: 382.35 PSNR: 44.68 MAXDIFF:32866 bytes: 804864/ 804864 MAXDIFF: |32866 - 0| >= 1 Test ac3-fixed-2.0 failed. Look at tests/data/fate/ac3-fixed-2.0.err for details. make: *** [src/tests/Makefile:307: fate-ac3-fixed-2.0] Error 1 TEST ac3-fixed-4.0-downmix-mono stddev: 1140.81 PSNR: 35.18 MAXDIFF:34416 bytes: 190464/ 190464 MAXDIFF: |34416 - 0| >= 1 Test ac3-fixed-4.0-downmix-mono failed. Look at tests/data/fate/ac3-fixed-4.0-downmix-mono.err for details. make: *** [src/tests/Makefile:307: fate-ac3-fixed-4.0-downmix-mono] Error 1 TEST ac3-fixed-encode --- - 2022-09-25 14:22:45.695390813 +0200 +++ tests/data/fate/ac3-fixed-encode 2022-09-25 14:22:45.687999547 +0200 @@ -1 +1 @@ -1f548175e11a95e62ce20e442fcc8d08 +e9d78bca187b4bbafc4512bcea8efd3e Test ac3-fixed-encode failed. Look at tests/data/fate/ac3-fixed-encode.err for details. make: *** [src/tests/Makefile:307: fate-ac3-fixed-encode] Error 1 (Additionally, checksums in unknown_layout-ac3, lavf-rm, shortest, copy-shortest1 and copy-shortest2 need to be updated.) As the above shows, the difference between the reference files and the decoded output becomes larger in several tests, i.e. the reference files won't be usable lateron. If the new float and fixed-point decoders produce indeed produce nearly identical output, then one could write tests that decode the same file with both the floating point and the fixed point decoder, check that both are nearly identical and print a checksum of the output of the fixed point decoder. Also note that there is currently no test that directly verifies your claims of greater accuracy. One could write such a test by encoding a file with ac3-fixed and decoding it again (with the fixed point decoder) and printing the psnr of input and output. No encoding tests does this at the moment. - Andreas
On Sat, 24 Sep 2022, Lynne wrote: >> What about ac3dsp then - that one seems like it's fairly optimized for arm? >> > > Haven't touched them, they're still being used. Unfortunately, for AC3, > the full MDCT optimizations in lavc do make a difference and the overall > decoder becomes 15% slower with this patch on for aarch64 with lavu/tx's > asm disabled and 7% slower with lavu/tx's asm enabled. Hmm, that's a shame... > I do plan to write an aarch64 MDCT NEON SIMD code in a month or so, > unless someone is faster, which should make the decoder at least 10% > faster with lavu/tx. Would you consider holding off of converting the ac3 decoder until this point, to avoid unnecessary temporary performance regressions at least for the architectures that are covered by the new lavu/tx framework? > If you'd like to help out, I've documented the C factorizations used in > docs/transforms.md. Sorry, I don't think I have time at the moment to take on writing new code from scratch for this... I could maybe consider porting the aarch64 assembly to arm32; if it's not register starved, it's usually quite straightforward to do such rewrites (there's either half the number of SIMD registers compared to aarch64, or the same number but half the length). The reason why I'm asking about arm32, is because ffmpeg has got a bunch of users who have spent a fair amount of effort on reaching specific performance levels for some codecs, both for raspberry pi 1 (which doesn't have neon but only vfp) and for the newer ones with neon. I don't remember exactly which codecs are relevant for these users - I doubt opus is, but ac3 and dca are, iirc. I'm CCing Ben Avison who has contributed a lot of optimizations in this area. // Martin
Sep 25, 2022, 21:55 by martin@martin.st: > On Sat, 24 Sep 2022, Lynne wrote: > >>> What about ac3dsp then - that one seems like it's fairly optimized for arm? >>> >> >> Haven't touched them, they're still being used. Unfortunately, for AC3, >> the full MDCT optimizations in lavc do make a difference and the overall >> decoder becomes 15% slower with this patch on for aarch64 with lavu/tx's >> asm disabled and 7% slower with lavu/tx's asm enabled. >> > > Hmm, that's a shame... > >> I do plan to write an aarch64 MDCT NEON SIMD code in a month or so, unless someone is faster, which should make the decoder at least 10% faster with lavu/tx. >> > > Would you consider holding off of converting the ac3 decoder until this point, to avoid unnecessary temporary performance regressions at least for the architectures that are covered by the new lavu/tx framework? > >> If you'd like to help out, I've documented the C factorizations used in >> docs/transforms.md. >> > > Sorry, I don't think I have time at the moment to take on writing new code from scratch for this... > > I could maybe consider porting the aarch64 assembly to arm32; if it's not register starved, it's usually quite straightforward to do such rewrites (there's either half the number of SIMD registers compared to aarch64, or the same number but half the length) > For the basis transforms (double 4, double 8 and 8, single 16), there's no starvation. For the 32pt transform, it's a bit starved, but nothing you couldn't work out. For the 64pt and up, absolutely all registers are used to the point of needing to stash vector regs across gprs. If all registers are written back to memory (no register sharing between transform sizes), it becomes as starved as the 32pt. It's obvious to see where the starvation happens (only 32pt -> 64pt) and how to fix it, but it's still work to convert code. Take a look at it and see if you can spot something that would make it difficult? > The reason why I'm asking about arm32, is because ffmpeg has got a bunch of users who have spent a fair amount of effort on reaching specific performance levels for some codecs, both for raspberry pi 1 (which doesn't have neon but only vfp) and for the newer ones with neon. I don't remember exactly which codecs are relevant for these users - I doubt opus is, but ac3 and dca are, iirc. > We do maintain old versions for years after a release. And we recently-ish had a major bump, and very recently 5.1. I think there's enough time to bring them back up and make them faster still before stuck users become quite outdated, what about you? Maybe someone who's interested could notice and help out? > I'm CCing Ben Avison who has contributed a lot of optimizations in this area. > Thanks.
Sep 25, 2022, 14:34 by andreas.rheinhardt@outlook.com: > Lynne: > >> Sep 24, 2022, 23:57 by dev@lynne.ee: >> >>> Sep 24, 2022, 21:40 by martin@martin.st: >>> >>>> What about ac3dsp then - that one seems like it's fairly optimized for arm? >>>> >>> Haven't touched them, they're still being used. Unfortunately, for AC3, >>> the full MDCT optimizations in lavc do make a difference and the overall >>> decoder becomes 15% slower with this patch on for aarch64 with lavu/tx's >>> asm disabled and 7% slower with lavu/tx's asm enabled. I do plan to write >>> an aarch64 MDCT NEON SIMD code in a month or so, unless someone is faster, >>> which should make the decoder at least 10% faster with lavu/tx. >>> >> >> I'd just like to add this was for the float version of the ac3 decoder. The fixed-point >> version is a few percent faster with the patch on an A53, and quite a bit >> more accurate. >> The lavc fixed-point FFT code also has some weird large spikes in #cycles >> for some transform sizes, so the figure above is an average, but the dips >> went from 117x realtime to 78x realtime, which on a slower CPU may >> be the difference between stuttering and realtime playback. >> On this CPU, the fixed-point version is 23% slower than the float version, >> but on a CPU with slower float ops, it would make more sense to pick that >> decoder up than the float version. >> The 2 decoders produce nearly identical results, minus a few rounding >> errors, since AC3 is inherently a fixed-point codec. The only difference >> are the transforms themselves, and the extra ops needed to convert >> the 25bit ints to floats in the float decoder. >> > > 1. You forgot to remove mdct15 requirements from configure in this whole > patchset. > 2. You forgot to update the FATE references for several tests; e.g. when > only applying the ac3 patch, then I get this: > I know. durandal pointed it out the day I sent them. I'll send them again later. I'm planning to just push the Opus patch in a day with the mdct15 line in configure gone. > As the above shows, the difference between the reference files and the > decoded output becomes larger in several tests, i.e. the reference files > won't be usable lateron. If the new float and fixed-point decoders > produce indeed produce nearly identical output, then one could write > tests that decode the same file with both the floating point and the > fixed point decoder, check that both are nearly identical and print a > checksum of the output of the fixed point decoder. > I have a standalone program I've hacked on as I need to for the fixed-point transforms: https://0x0.st/oWxO.c The square root of the squared rounding error across the entire range (1 to 21 bits) of transforms from 32pt to 1024pt is 6.855655 for lavu and 7.141428 for lavc, which is slightly worse. If you extend the range to 22bits, the 1024pt transform in lavc explodes, while lavu is still fine, thus showing a greater range. The rounding errors are a lesser problem than hitting the max range, because then you get huge spikes in the output. I can further reduce the error in lavu at the cost of speed, but I think this is sufficient. > Also note that there is currently no test that directly verifies your > claims of greater accuracy. One could write such a test by encoding a > file with ac3-fixed and decoding it again (with the fixed point decoder) > and printing the psnr of input and output. No encoding tests does this > at the moment. > I'm not writing that, but I like the idea, the point of fixed-point decoders isn't bitexactness, but speed on slow hardware, so we shouldn't be testing an MD5.
Lynne: > Sep 25, 2022, 14:34 by andreas.rheinhardt@outlook.com: > >> Lynne: >> >>> Sep 24, 2022, 23:57 by dev@lynne.ee: >>> >>>> Sep 24, 2022, 21:40 by martin@martin.st: >>>> >>>>> What about ac3dsp then - that one seems like it's fairly optimized for arm? >>>>> >>>> Haven't touched them, they're still being used. Unfortunately, for AC3, >>>> the full MDCT optimizations in lavc do make a difference and the overall >>>> decoder becomes 15% slower with this patch on for aarch64 with lavu/tx's >>>> asm disabled and 7% slower with lavu/tx's asm enabled. I do plan to write >>>> an aarch64 MDCT NEON SIMD code in a month or so, unless someone is faster, >>>> which should make the decoder at least 10% faster with lavu/tx. >>>> >>> >>> I'd just like to add this was for the float version of the ac3 decoder. The fixed-point >>> version is a few percent faster with the patch on an A53, and quite a bit >>> more accurate. >>> The lavc fixed-point FFT code also has some weird large spikes in #cycles >>> for some transform sizes, so the figure above is an average, but the dips >>> went from 117x realtime to 78x realtime, which on a slower CPU may >>> be the difference between stuttering and realtime playback. >>> On this CPU, the fixed-point version is 23% slower than the float version, >>> but on a CPU with slower float ops, it would make more sense to pick that >>> decoder up than the float version. >>> The 2 decoders produce nearly identical results, minus a few rounding >>> errors, since AC3 is inherently a fixed-point codec. The only difference >>> are the transforms themselves, and the extra ops needed to convert >>> the 25bit ints to floats in the float decoder. >>> >> >> 1. You forgot to remove mdct15 requirements from configure in this whole >> patchset. >> 2. You forgot to update the FATE references for several tests; e.g. when >> only applying the ac3 patch, then I get this: >> > > I know. durandal pointed it out the day I sent them. I'll send them again > later. > I'm planning to just push the Opus patch in a day with the mdct15 > line in configure gone. > > >> As the above shows, the difference between the reference files and the >> decoded output becomes larger in several tests, i.e. the reference files >> won't be usable lateron. If the new float and fixed-point decoders >> produce indeed produce nearly identical output, then one could write >> tests that decode the same file with both the floating point and the >> fixed point decoder, check that both are nearly identical and print a >> checksum of the output of the fixed point decoder. >> > > I have a standalone program I've hacked on as I need to for the fixed-point > transforms: https://0x0.st/oWxO.c > The square root of the squared rounding error across the entire range > (1 to 21 bits) of transforms from 32pt to 1024pt is 6.855655 for lavu and > 7.141428 for lavc, which is slightly worse. If you extend the range > to 22bits, the 1024pt transform in lavc explodes, while lavu is still fine, > thus showing a greater range. > The rounding errors are a lesser problem than hitting the max range, > because then you get huge spikes in the output. > I can further reduce the error in lavu at the cost of speed, but I think > this is sufficient. > > >> Also note that there is currently no test that directly verifies your >> claims of greater accuracy. One could write such a test by encoding a >> file with ac3-fixed and decoding it again (with the fixed point decoder) >> and printing the psnr of input and output. No encoding tests does this >> at the moment. >> > > I'm not writing that, but I like the idea, the point of fixed-point decoders > isn't bitexactness, but speed on slow hardware, so we shouldn't be testing > an MD5. Are your fixed-point transforms bitexact across all arches/cpuflags? - Andreas
Sep 25, 2022, 23:17 by andreas.rheinhardt@outlook.com: > Lynne: > >> Sep 25, 2022, 14:34 by andreas.rheinhardt@outlook.com: >> >>> Lynne: >>> >>>> Sep 24, 2022, 23:57 by dev@lynne.ee: >>>> >>>>> Sep 24, 2022, 21:40 by martin@martin.st: >>>>> >>>>>> What about ac3dsp then - that one seems like it's fairly optimized for arm? >>>>>> >>>>> Haven't touched them, they're still being used. Unfortunately, for AC3, >>>>> the full MDCT optimizations in lavc do make a difference and the overall >>>>> decoder becomes 15% slower with this patch on for aarch64 with lavu/tx's >>>>> asm disabled and 7% slower with lavu/tx's asm enabled. I do plan to write >>>>> an aarch64 MDCT NEON SIMD code in a month or so, unless someone is faster, >>>>> which should make the decoder at least 10% faster with lavu/tx. >>>>> >>>> >>>> I'd just like to add this was for the float version of the ac3 decoder. The fixed-point >>>> version is a few percent faster with the patch on an A53, and quite a bit >>>> more accurate. >>>> The lavc fixed-point FFT code also has some weird large spikes in #cycles >>>> for some transform sizes, so the figure above is an average, but the dips >>>> went from 117x realtime to 78x realtime, which on a slower CPU may >>>> be the difference between stuttering and realtime playback. >>>> On this CPU, the fixed-point version is 23% slower than the float version, >>>> but on a CPU with slower float ops, it would make more sense to pick that >>>> decoder up than the float version. >>>> The 2 decoders produce nearly identical results, minus a few rounding >>>> errors, since AC3 is inherently a fixed-point codec. The only difference >>>> are the transforms themselves, and the extra ops needed to convert >>>> the 25bit ints to floats in the float decoder. >>>> >>> >>> 1. You forgot to remove mdct15 requirements from configure in this whole >>> patchset. >>> 2. You forgot to update the FATE references for several tests; e.g. when >>> only applying the ac3 patch, then I get this: >>> >> >> I know. durandal pointed it out the day I sent them. I'll send them again >> later. >> I'm planning to just push the Opus patch in a day with the mdct15 >> line in configure gone. >> >> >>> As the above shows, the difference between the reference files and the >>> decoded output becomes larger in several tests, i.e. the reference files >>> won't be usable lateron. If the new float and fixed-point decoders >>> produce indeed produce nearly identical output, then one could write >>> tests that decode the same file with both the floating point and the >>> fixed point decoder, check that both are nearly identical and print a >>> checksum of the output of the fixed point decoder. >>> >> >> I have a standalone program I've hacked on as I need to for the fixed-point >> transforms: https://0x0.st/oWxO.c >> The square root of the squared rounding error across the entire range >> (1 to 21 bits) of transforms from 32pt to 1024pt is 6.855655 for lavu and >> 7.141428 for lavc, which is slightly worse. If you extend the range >> to 22bits, the 1024pt transform in lavc explodes, while lavu is still fine, >> thus showing a greater range. >> The rounding errors are a lesser problem than hitting the max range, >> because then you get huge spikes in the output. >> I can further reduce the error in lavu at the cost of speed, but I think >> this is sufficient. >> >> >>> Also note that there is currently no test that directly verifies your >>> claims of greater accuracy. One could write such a test by encoding a >>> file with ac3-fixed and decoding it again (with the fixed point decoder) >>> and printing the psnr of input and output. No encoding tests does this >>> at the moment. >>> >> >> I'm not writing that, but I like the idea, the point of fixed-point decoders >> isn't bitexactness, but speed on slow hardware, so we shouldn't be testing >> an MD5. >> > > Are your fixed-point transforms bitexact across all arches/cpuflags? > As much as libavcodec's. This is because we use a float value for the MDCT scale, and we calculate the exptabs and FFT tables with floats before converting them to ints during init. If issues arise, we could specialcase them, though as libavcodec's hasn't needed that, lavu doesn't need it either. Since the FFT tables are always constant, they would benefit from hardcoding, as it would take out any local machine precision out of the equation. The actual constants are quantized versions of the computed floats, which also has a fair leeway.
From d4fdda5b57ab1e0f08eb3d78dac6b003060dfd41 Mon Sep 17 00:00:00 2001 From: Lynne <dev@lynne.ee> Date: Sat, 24 Sep 2022 00:46:44 +0200 Subject: [PATCH 1/6] opus: convert encoder and decoder to lavu/tx This commit changes both the encoder and decoder to use the new lavu/tx code, which has faster C transforms and more assembly optimizations. --- libavcodec/opus_celt.c | 20 ++++++++++++-------- libavcodec/opus_celt.h | 5 +++-- libavcodec/opusenc.c | 15 +++++++++------ libavcodec/opusenc_psy.c | 13 ++++++++----- libavcodec/opusenc_psy.h | 4 +++- 5 files changed, 35 insertions(+), 22 deletions(-) diff --git a/libavcodec/opus_celt.c b/libavcodec/opus_celt.c index 9dbeff1927..f1fb88a56d 100644 --- a/libavcodec/opus_celt.c +++ b/libavcodec/opus_celt.c @@ -323,7 +323,8 @@ int ff_celt_decode_frame(CeltFrame *f, OpusRangeCoder *rc, { int i, j, downmix = 0; int consumed; // bits of entropy consumed thus far for this frame - MDCT15Context *imdct; + AVTXContext *imdct; + av_tx_fn imdct_fn; if (channels != 1 && channels != 2) { av_log(f->avctx, AV_LOG_ERROR, "Invalid number of coded channels: %d\n", @@ -385,7 +386,8 @@ int ff_celt_decode_frame(CeltFrame *f, OpusRangeCoder *rc, f->blocks = f->transient ? 1 << f->size : 1; f->blocksize = frame_size / f->blocks; - imdct = f->imdct[f->transient ? 0 : f->size]; + imdct = f->tx[f->transient ? 0 : f->size]; + imdct_fn = f->tx_fn[f->transient ? 0 : f->size]; if (channels == 1) { for (i = 0; i < CELT_MAX_BANDS; i++) @@ -440,8 +442,8 @@ int ff_celt_decode_frame(CeltFrame *f, OpusRangeCoder *rc, for (j = 0; j < f->blocks; j++) { float *dst = block->buf + 1024 + j * f->blocksize; - imdct->imdct_half(imdct, dst + CELT_OVERLAP / 2, f->block[i].coeffs + j, - f->blocks); + imdct_fn(imdct, dst + CELT_OVERLAP / 2, f->block[i].coeffs + j, + sizeof(float)*f->blocks); f->dsp->vector_fmul_window(dst, dst, dst + CELT_OVERLAP / 2, ff_celt_window, CELT_OVERLAP / 2); } @@ -526,8 +528,8 @@ void ff_celt_free(CeltFrame **f) if (!frm) return; - for (i = 0; i < FF_ARRAY_ELEMS(frm->imdct); i++) - ff_mdct15_uninit(&frm->imdct[i]); + for (i = 0; i < FF_ARRAY_ELEMS(frm->tx); i++) + av_tx_uninit(&frm->tx[i]); ff_celt_pvq_uninit(&frm->pvq); @@ -555,9 +557,11 @@ int ff_celt_init(AVCodecContext *avctx, CeltFrame **f, int output_channels, frm->output_channels = output_channels; frm->apply_phase_inv = apply_phase_inv; - for (i = 0; i < FF_ARRAY_ELEMS(frm->imdct); i++) - if ((ret = ff_mdct15_init(&frm->imdct[i], 1, i + 3, -1.0f/32768)) < 0) + for (i = 0; i < FF_ARRAY_ELEMS(frm->tx); i++) { + const float scale = -1.0f/32768; + if ((ret = av_tx_init(&frm->tx[i], &frm->tx_fn[i], AV_TX_FLOAT_MDCT, 1, 15 << (i + 3), &scale, 0)) < 0) goto fail; + } if ((ret = ff_celt_pvq_init(&frm->pvq, 0)) < 0) goto fail; diff --git a/libavcodec/opus_celt.h b/libavcodec/opus_celt.h index 661ca251de..291a544298 100644 --- a/libavcodec/opus_celt.h +++ b/libavcodec/opus_celt.h @@ -30,10 +30,10 @@ #include "opus_pvq.h" #include "opusdsp.h" -#include "mdct15.h" #include "libavutil/float_dsp.h" #include "libavutil/libm.h" #include "libavutil/mem_internal.h" +#include "libavutil/tx.h" #define CELT_VECTORS 11 #define CELT_ALLOC_STEPS 6 @@ -93,7 +93,8 @@ typedef struct CeltBlock { struct CeltFrame { // constant values that do not change during context lifetime AVCodecContext *avctx; - MDCT15Context *imdct[4]; + AVTXContext *tx[4]; + av_tx_fn tx_fn[4]; AVFloatDSPContext *dsp; CeltBlock block[2]; CeltPVQ *pvq; diff --git a/libavcodec/opusenc.c b/libavcodec/opusenc.c index a7a9d3a5f5..8cdd27d930 100644 --- a/libavcodec/opusenc.c +++ b/libavcodec/opusenc.c @@ -40,7 +40,8 @@ typedef struct OpusEncContext { AVCodecContext *avctx; AudioFrameQueue afq; AVFloatDSPContext *dsp; - MDCT15Context *mdct[CELT_BLOCK_NB]; + AVTXContext *tx[CELT_BLOCK_NB]; + av_tx_fn tx_fn[CELT_BLOCK_NB]; CeltPVQ *pvq; struct FFBufQueue bufqueue; @@ -204,7 +205,7 @@ static void celt_frame_mdct(OpusEncContext *s, CeltFrame *f) s->dsp->vector_fmul_reverse(&win[CELT_OVERLAP], src2, ff_celt_window - 8, 128); src1 = src2; - s->mdct[0]->mdct(s->mdct[0], b->coeffs + t, win, f->blocks); + s->tx_fn[0](s->tx[0], b->coeffs + t, win, sizeof(float)*f->blocks); } } } else { @@ -226,7 +227,7 @@ static void celt_frame_mdct(OpusEncContext *s, CeltFrame *f) ff_celt_window - 8, 128); memcpy(win + lap_dst + blk_len, temp, CELT_OVERLAP*sizeof(float)); - s->mdct[f->size]->mdct(s->mdct[f->size], b->coeffs, win, 1); + s->tx_fn[f->size](s->tx[f->size], b->coeffs, win, sizeof(float)); } } @@ -612,7 +613,7 @@ static av_cold int opus_encode_end(AVCodecContext *avctx) OpusEncContext *s = avctx->priv_data; for (int i = 0; i < CELT_BLOCK_NB; i++) - ff_mdct15_uninit(&s->mdct[i]); + av_tx_uninit(&s->tx[i]); ff_celt_pvq_uninit(&s->pvq); av_freep(&s->dsp); @@ -668,9 +669,11 @@ static av_cold int opus_encode_init(AVCodecContext *avctx) return AVERROR(ENOMEM); /* I have no idea why a base scaling factor of 68 works, could be the twiddles */ - for (int i = 0; i < CELT_BLOCK_NB; i++) - if ((ret = ff_mdct15_init(&s->mdct[i], 0, i + 3, 68 << (CELT_BLOCK_NB - 1 - i)))) + for (int i = 0; i < CELT_BLOCK_NB; i++) { + const float scale = 68 << (CELT_BLOCK_NB - 1 - i); + if ((ret = av_tx_init(&s->tx[i], &s->tx_fn[i], AV_TX_FLOAT_MDCT, 0, 15 << (i + 3), &scale, 0))) return AVERROR(ENOMEM); + } /* Zero out previous energy (matters for inter first frame) */ for (int ch = 0; ch < s->channels; ch++) diff --git a/libavcodec/opusenc_psy.c b/libavcodec/opusenc_psy.c index 1c8f69269c..3bff57d347 100644 --- a/libavcodec/opusenc_psy.c +++ b/libavcodec/opusenc_psy.c @@ -22,7 +22,6 @@ #include "opusenc_psy.h" #include "opus_pvq.h" #include "opustab.h" -#include "mdct15.h" #include "libavutil/qsort.h" static float pvq_band_cost(CeltPVQ *pvq, CeltFrame *f, OpusRangeCoder *rc, int band, @@ -99,7 +98,8 @@ static void step_collect_psy_metrics(OpusPsyContext *s, int index) s->dsp->vector_fmul(s->scratch, s->scratch, s->window[s->bsize_analysis], (OPUS_BLOCK_SIZE(s->bsize_analysis) << 1)); - s->mdct[s->bsize_analysis]->mdct(s->mdct[s->bsize_analysis], st->coeffs[ch], s->scratch, 1); + s->mdct_fn[s->bsize_analysis](s->mdct[s->bsize_analysis], st->coeffs[ch], + s->scratch, sizeof(float)); for (i = 0; i < CELT_MAX_BANDS; i++) st->bands[ch][i] = &st->coeffs[ch][ff_celt_freq_bands[i] << s->bsize_analysis]; @@ -558,13 +558,16 @@ av_cold int ff_opus_psy_init(OpusPsyContext *s, AVCodecContext *avctx, for (i = 0; i < CELT_BLOCK_NB; i++) { float tmp; const int len = OPUS_BLOCK_SIZE(i); + const float scale = 68 << (CELT_BLOCK_NB - 1 - i); s->window[i] = av_malloc(2*len*sizeof(float)); if (!s->window[i]) { ret = AVERROR(ENOMEM); goto fail; } generate_window_func(s->window[i], 2*len, WFUNC_SINE, &tmp); - if ((ret = ff_mdct15_init(&s->mdct[i], 0, i + 3, 68 << (CELT_BLOCK_NB - 1 - i)))) + ret = av_tx_init(&s->mdct[i], &s->mdct_fn[i], AV_TX_FLOAT_MDCT, + 0, 15 << (i + 3), &scale, 0); + if (ret < 0) goto fail; } @@ -575,7 +578,7 @@ fail: av_freep(&s->dsp); for (i = 0; i < CELT_BLOCK_NB; i++) { - ff_mdct15_uninit(&s->mdct[i]); + av_tx_uninit(&s->mdct[i]); av_freep(&s->window[i]); } @@ -598,7 +601,7 @@ av_cold int ff_opus_psy_end(OpusPsyContext *s) av_freep(&s->dsp); for (i = 0; i < CELT_BLOCK_NB; i++) { - ff_mdct15_uninit(&s->mdct[i]); + av_tx_uninit(&s->mdct[i]); av_freep(&s->window[i]); } diff --git a/libavcodec/opusenc_psy.h b/libavcodec/opusenc_psy.h index d4fb096a3d..0a7cdb6f2c 100644 --- a/libavcodec/opusenc_psy.h +++ b/libavcodec/opusenc_psy.h @@ -22,6 +22,7 @@ #ifndef AVCODEC_OPUSENC_PSY_H #define AVCODEC_OPUSENC_PSY_H +#include "libavutil/tx.h" #include "libavutil/mem_internal.h" #include "opusenc.h" @@ -70,7 +71,8 @@ typedef struct OpusPsyContext { int max_steps; float *window[CELT_BLOCK_NB]; - MDCT15Context *mdct[CELT_BLOCK_NB]; + AVTXContext *mdct[CELT_BLOCK_NB]; + av_tx_fn mdct_fn[CELT_BLOCK_NB]; int bsize_analysis; DECLARE_ALIGNED(32, float, scratch)[2048]; -- 2.37.2.609.g9ff673ca1a