diff mbox series

[FFmpeg-devel,1/6] opus: convert encoder and decoder to lavu/tx

Message ID NCgcUxK--3-2@lynne.ee
State New
Headers show
Series [FFmpeg-devel,1/6] opus: convert encoder and decoder to lavu/tx | expand

Checks

Context Check Description
andriy/make_x86 success Make finished
andriy/make_fate_x86 success Make fate finished

Commit Message

Lynne Sept. 23, 2022, 11:14 p.m. UTC
This commit changes both the encoder and decoder to use the new lavu/tx code,
which has faster C transforms and more assembly optimizations.

Patch attached.

Comments

Martin Storsjö Sept. 24, 2022, 6:42 p.m. UTC | #1
On Sat, 24 Sep 2022, Lynne wrote:

> This commit changes both the encoder and decoder to use the new lavu/tx code,
> which has faster C transforms and more assembly optimizations.

What's the case of e.g. 32 bit arm - that does have a bunch of fft and 
mdct assembly, but is that something that ends up used by opus today, or 
does the mdct15 stuff use separate codepaths that aren't optimized there 
today yet?

// Martin
Hendrik Leppkes Sept. 24, 2022, 7:26 p.m. UTC | #2
On Sat, Sep 24, 2022 at 8:43 PM Martin Storsjö <martin@martin.st> wrote:
>
> On Sat, 24 Sep 2022, Lynne wrote:
>
> > This commit changes both the encoder and decoder to use the new lavu/tx code,
> > which has faster C transforms and more assembly optimizations.
>
> What's the case of e.g. 32 bit arm - that does have a bunch of fft and
> mdct assembly, but is that something that ends up used by opus today, or
> does the mdct15 stuff use separate codepaths that aren't optimized there
> today yet?
>

mdct15 only has some x86 assembly, nothing for ARM.
Only the normal (power of 2) fft/mdct has some ARM 32-bit assembly.

- Hendrik
Hendrik Leppkes Sept. 24, 2022, 7:31 p.m. UTC | #3
On Sat, Sep 24, 2022 at 9:26 PM Hendrik Leppkes <h.leppkes@gmail.com> wrote:
>
> On Sat, Sep 24, 2022 at 8:43 PM Martin Storsjö <martin@martin.st> wrote:
> >
> > On Sat, 24 Sep 2022, Lynne wrote:
> >
> > > This commit changes both the encoder and decoder to use the new lavu/tx code,
> > > which has faster C transforms and more assembly optimizations.
> >
> > What's the case of e.g. 32 bit arm - that does have a bunch of fft and
> > mdct assembly, but is that something that ends up used by opus today, or
> > does the mdct15 stuff use separate codepaths that aren't optimized there
> > today yet?
> >
>
> mdct15 only has some x86 assembly, nothing for ARM.
> Only the normal (power of 2) fft/mdct has some ARM 32-bit assembly.
>

Actually, I missed that the mdct15 internally uses one of the normal
fft functions for a part of the calculation, but how much impact that
has on performance vs. the new code where the C alone is quite a bit
faster would have to be confirmed by Lynne.

- Hendrik
Martin Storsjö Sept. 24, 2022, 7:40 p.m. UTC | #4
On Sat, 24 Sep 2022, Hendrik Leppkes wrote:

> On Sat, Sep 24, 2022 at 9:26 PM Hendrik Leppkes <h.leppkes@gmail.com> wrote:
>>
>> On Sat, Sep 24, 2022 at 8:43 PM Martin Storsjö <martin@martin.st> wrote:
>> >
>> > On Sat, 24 Sep 2022, Lynne wrote:
>> >
>> > > This commit changes both the encoder and decoder to use the new lavu/tx code,
>> > > which has faster C transforms and more assembly optimizations.
>> >
>> > What's the case of e.g. 32 bit arm - that does have a bunch of fft and
>> > mdct assembly, but is that something that ends up used by opus today, or
>> > does the mdct15 stuff use separate codepaths that aren't optimized there
>> > today yet?
>> >
>>
>> mdct15 only has some x86 assembly, nothing for ARM.
>> Only the normal (power of 2) fft/mdct has some ARM 32-bit assembly.
>>
>
> Actually, I missed that the mdct15 internally uses one of the normal
> fft functions for a part of the calculation, but how much impact that
> has on performance vs. the new code where the C alone is quite a bit
> faster would have to be confirmed by Lynne.

Ok, fair enough.

What about ac3dsp then - that one seems like it's fairly optimized for 
arm?

// Martin
Lynne Sept. 24, 2022, 9:57 p.m. UTC | #5
Sep 24, 2022, 21:40 by martin@martin.st:

> On Sat, 24 Sep 2022, Hendrik Leppkes wrote:
>
>> On Sat, Sep 24, 2022 at 9:26 PM Hendrik Leppkes <h.leppkes@gmail.com> wrote:
>>
>>>
>>> On Sat, Sep 24, 2022 at 8:43 PM Martin Storsjö <martin@martin.st> wrote:
>>> >
>>> > On Sat, 24 Sep 2022, Lynne wrote:
>>> >
>>> > > This commit changes both the encoder and decoder to use the new lavu/tx code,
>>> > > which has faster C transforms and more assembly optimizations.
>>> >
>>> > What's the case of e.g. 32 bit arm - that does have a bunch of fft and
>>> > mdct assembly, but is that something that ends up used by opus today, or
>>> > does the mdct15 stuff use separate codepaths that aren't optimized there
>>> > today yet?
>>> >
>>>
>>> mdct15 only has some x86 assembly, nothing for ARM.
>>> Only the normal (power of 2) fft/mdct has some ARM 32-bit assembly.
>>>
>>
>> Actually, I missed that the mdct15 internally uses one of the normal
>> fft functions for a part of the calculation, but how much impact that
>> has on performance vs. the new code where the C alone is quite a bit
>> faster would have to be confirmed by Lynne.
>>
>
> Ok, fair enough.
>

I did some benchmarking. Just lavc's C nptwo MDCT is 10% slower than lavu's
C nptwo MDCT. I don't have 32bit ARM hardware to test on, but I do have an
aarch64 A53 core. On it, the performance difference with all optimizations with
this patch on or off was that the decoder became 15% faster. With lavu/tx's aarch64
assembly disabled to simulate arm32's situation, the decoder was still 10% faster
overall. It's probably going to be similar on arm32.

On x86, the performance difference between the decoder without this patch
and the decoder with this patch but all lavu/tx asm disabled was only 10% slower.
With assembly enabled and this patch, the decoder is 15% faster overall on an
Alder Lake system.

As for the overall decoding time consumption for Opus, the MDCT is very far behind
the largest overhead - coefficient decoding (on x86 with optimizations, 50% of the
time is spent there, whilst only 5% on the MDCT in total). It's a very optimized decoder.

In general, for the transform alone, a C non-power-of-two lavu MDCT for the lengths
used by Opus, the performance difference for using AVX vs C for the ptwo part is on
the order of 20% slower transforms for 960pt, and SSE vs C for 240pt is also around
20%. Most of this is due to the function call overhead, (framesize/2)/ptwo = 120,
60, 30 and 15 calls to ptwo FFTs per transform. The assembly function largely
eliminates this overhead by linking assembly functions together with a minimal
'ABI'.


> What about ac3dsp then - that one seems like it's fairly optimized for arm?
>

Haven't touched them, they're still being used. Unfortunately, for AC3,
the full MDCT optimizations in lavc do make a difference and the overall
decoder becomes 15% slower with this patch on for aarch64 with lavu/tx's
asm disabled and 7% slower with lavu/tx's asm enabled. I do plan to write
an aarch64 MDCT NEON SIMD code in a month or so, unless someone is faster,
which should make the decoder at least 10% faster with lavu/tx.

For Opus, the used ptwo lengths are (framesize/2)/15 = 32, 16, 8 and 4pt FFTs.
If you'd like to help out, I've documented the C factorizations used in
docs/transforms.md. You could also try porting the existing assembly. It should be
trivial if they don't use the upper half of the tables. lavc's and lavu's FFT tables
differ by size - lavu's are half the size of lavc's tables, because lavc's tables
contain the multiplication factors mirrored after the halfway point. That's used by
the RDFT, and by the x86 assembly. It's not worth replicating this, the
memory overhead is just too much, especially on bandwidth starved cores.
If the arm32 assembly uses the upper part, it shouldn't be too hard to
make it read from both the start and end point of the exptab array in the
recombination function of ptwo transforms.

The MDCT asm can be ported in a straightforward way and would improve
both decoders significantly. If the ABI is simpler than x86's, you could even
make the asm transform call into C functions, which would lessen the work.
A lot of the MDCT overhead is in the gather and multiplication part, whilst
the FFT is limited by mostly adds and memory bandwidth, so just with
MDCT assembly the decoder would get a lot faster.
Lynne Sept. 25, 2022, 7:54 a.m. UTC | #6
Sep 24, 2022, 23:57 by dev@lynne.ee:

> Sep 24, 2022, 21:40 by martin@martin.st:
>
>> What about ac3dsp then - that one seems like it's fairly optimized for arm?
>>
> Haven't touched them, they're still being used. Unfortunately, for AC3,
> the full MDCT optimizations in lavc do make a difference and the overall
> decoder becomes 15% slower with this patch on for aarch64 with lavu/tx's
> asm disabled and 7% slower with lavu/tx's asm enabled. I do plan to write
> an aarch64 MDCT NEON SIMD code in a month or so, unless someone is faster,
> which should make the decoder at least 10% faster with lavu/tx.
>

I'd just like to add this was for the float version of the ac3 decoder. The fixed-point
version is a few percent faster with the patch on an A53, and quite a bit
more accurate.
The lavc fixed-point FFT code also has some weird large spikes in #cycles
for some transform sizes, so the figure above is an average, but the dips
went from 117x realtime to 78x realtime, which on a slower CPU may
be the difference between stuttering and realtime playback.
On this CPU, the fixed-point version is 23% slower than the float version,
but on a CPU with slower float ops, it would make more sense to pick that
decoder up than the float version.
The 2 decoders produce nearly identical results, minus a few rounding
errors, since AC3 is inherently a fixed-point codec. The only difference
are the transforms themselves, and the extra ops needed to convert
the 25bit ints to floats in the float decoder.
Andreas Rheinhardt Sept. 25, 2022, 12:34 p.m. UTC | #7
Lynne:
> Sep 24, 2022, 23:57 by dev@lynne.ee:
> 
>> Sep 24, 2022, 21:40 by martin@martin.st:
>>
>>> What about ac3dsp then - that one seems like it's fairly optimized for arm?
>>>
>> Haven't touched them, they're still being used. Unfortunately, for AC3,
>> the full MDCT optimizations in lavc do make a difference and the overall
>> decoder becomes 15% slower with this patch on for aarch64 with lavu/tx's
>> asm disabled and 7% slower with lavu/tx's asm enabled. I do plan to write
>> an aarch64 MDCT NEON SIMD code in a month or so, unless someone is faster,
>> which should make the decoder at least 10% faster with lavu/tx.
>>
> 
> I'd just like to add this was for the float version of the ac3 decoder. The fixed-point
> version is a few percent faster with the patch on an A53, and quite a bit
> more accurate.
> The lavc fixed-point FFT code also has some weird large spikes in #cycles
> for some transform sizes, so the figure above is an average, but the dips
> went from 117x realtime to 78x realtime, which on a slower CPU may
> be the difference between stuttering and realtime playback.
> On this CPU, the fixed-point version is 23% slower than the float version,
> but on a CPU with slower float ops, it would make more sense to pick that
> decoder up than the float version.
> The 2 decoders produce nearly identical results, minus a few rounding
> errors, since AC3 is inherently a fixed-point codec. The only difference
> are the transforms themselves, and the extra ops needed to convert
> the 25bit ints to floats in the float decoder.

1. You forgot to remove mdct15 requirements from configure in this whole
patchset.
2. You forgot to update the FATE references for several tests; e.g. when
only applying the ac3 patch, then I get this:

TEST    ac3-4.0
stddev:    7.60 PSNR: 78.71 MAXDIFF:  867 bytes:   761856/   761856
MAXDIFF: |867 - 0| >= 1
Test ac3-4.0 failed. Look at tests/data/fate/ac3-4.0.err for details.
make: *** [src/tests/Makefile:307: fate-ac3-4.0] Error 1
TEST    ac3-2.0
stddev:    2.57 PSNR: 88.10 MAXDIFF:  414 bytes:   804864/   804864
MAXDIFF: |414 - 0| >= 1
Test ac3-2.0 failed. Look at tests/data/fate/ac3-2.0.err for details.
make: *** [src/tests/Makefile:307: fate-ac3-2.0] Error 1
TEST    ac3-4.0-downmix-stereo
stddev:    2.99 PSNR: 86.81 MAXDIFF:  198 bytes:   380928/   380928
MAXDIFF: |198 - 0| >= 1
Test ac3-4.0-downmix-stereo failed. Look at
tests/data/fate/ac3-4.0-downmix-stereo.err for details.
make: *** [src/tests/Makefile:307: fate-ac3-4.0-downmix-stereo] Error 1
TEST    ac3-4.0-downmix-mono
stddev:    4.11 PSNR: 84.05 MAXDIFF:  281 bytes:   190464/   190464
MAXDIFF: |281 - 0| >= 1
Test ac3-4.0-downmix-mono failed. Look at
tests/data/fate/ac3-4.0-downmix-mono.err for details.
make: *** [src/tests/Makefile:307: fate-ac3-4.0-downmix-mono] Error 1
TEST    ac3-fixed-2.0
stddev:  382.35 PSNR: 44.68 MAXDIFF:32866 bytes:   804864/   804864
MAXDIFF: |32866 - 0| >= 1
Test ac3-fixed-2.0 failed. Look at tests/data/fate/ac3-fixed-2.0.err for
details.
make: *** [src/tests/Makefile:307: fate-ac3-fixed-2.0] Error 1
TEST    ac3-fixed-4.0-downmix-mono
stddev: 1140.81 PSNR: 35.18 MAXDIFF:34416 bytes:   190464/   190464
MAXDIFF: |34416 - 0| >= 1
Test ac3-fixed-4.0-downmix-mono failed. Look at
tests/data/fate/ac3-fixed-4.0-downmix-mono.err for details.
make: *** [src/tests/Makefile:307: fate-ac3-fixed-4.0-downmix-mono] Error 1
TEST    ac3-fixed-encode
--- -	2022-09-25 14:22:45.695390813 +0200
+++ tests/data/fate/ac3-fixed-encode	2022-09-25 14:22:45.687999547 +0200
@@ -1 +1 @@
-1f548175e11a95e62ce20e442fcc8d08
+e9d78bca187b4bbafc4512bcea8efd3e
Test ac3-fixed-encode failed. Look at
tests/data/fate/ac3-fixed-encode.err for details.
make: *** [src/tests/Makefile:307: fate-ac3-fixed-encode] Error 1

(Additionally, checksums in unknown_layout-ac3, lavf-rm, shortest,
copy-shortest1 and copy-shortest2 need to be updated.)

As the above shows, the difference between the reference files and the
decoded output becomes larger in several tests, i.e. the reference files
won't be usable lateron. If the new float and fixed-point decoders
produce indeed produce nearly identical output, then one could write
tests that decode the same file with both the floating point and the
fixed point decoder, check that both are nearly identical and print a
checksum of the output of the fixed point decoder.

Also note that there is currently no test that directly verifies your
claims of greater accuracy. One could write such a test by encoding a
file with ac3-fixed and decoding it again (with the fixed point decoder)
and printing the psnr of input and output. No encoding tests does this
at the moment.

- Andreas
Martin Storsjö Sept. 25, 2022, 7:55 p.m. UTC | #8
On Sat, 24 Sep 2022, Lynne wrote:

>> What about ac3dsp then - that one seems like it's fairly optimized for arm?
>>
>
> Haven't touched them, they're still being used. Unfortunately, for AC3,
> the full MDCT optimizations in lavc do make a difference and the overall
> decoder becomes 15% slower with this patch on for aarch64 with lavu/tx's
> asm disabled and 7% slower with lavu/tx's asm enabled.

Hmm, that's a shame...

> I do plan to write an aarch64 MDCT NEON SIMD code in a month or so, 
> unless someone is faster, which should make the decoder at least 10% 
> faster with lavu/tx.

Would you consider holding off of converting the ac3 decoder until this 
point, to avoid unnecessary temporary performance regressions at least for 
the architectures that are covered by the new lavu/tx framework?

> If you'd like to help out, I've documented the C factorizations used in
> docs/transforms.md.

Sorry, I don't think I have time at the moment to take on writing new code 
from scratch for this...

I could maybe consider porting the aarch64 assembly to arm32; if it's not 
register starved, it's usually quite straightforward to do such rewrites 
(there's either half the number of SIMD registers compared to aarch64, or 
the same number but half the length).


The reason why I'm asking about arm32, is because ffmpeg has got a bunch 
of users who have spent a fair amount of effort on reaching specific 
performance levels for some codecs, both for raspberry pi 1 (which doesn't 
have neon but only vfp) and for the newer ones with neon. I don't remember 
exactly which codecs are relevant for these users - I doubt opus is, but 
ac3 and dca are, iirc.

I'm CCing Ben Avison who has contributed a lot of optimizations in this 
area.

// Martin
Lynne Sept. 25, 2022, 8:45 p.m. UTC | #9
Sep 25, 2022, 21:55 by martin@martin.st:

> On Sat, 24 Sep 2022, Lynne wrote:
>
>>> What about ac3dsp then - that one seems like it's fairly optimized for arm?
>>>
>>
>> Haven't touched them, they're still being used. Unfortunately, for AC3,
>> the full MDCT optimizations in lavc do make a difference and the overall
>> decoder becomes 15% slower with this patch on for aarch64 with lavu/tx's
>> asm disabled and 7% slower with lavu/tx's asm enabled.
>>
>
> Hmm, that's a shame...
>
>> I do plan to write an aarch64 MDCT NEON SIMD code in a month or so, unless someone is faster, which should make the decoder at least 10% faster with lavu/tx.
>>
>
> Would you consider holding off of converting the ac3 decoder until this point, to avoid unnecessary temporary performance regressions at least for the architectures that are covered by the new lavu/tx framework?
>
>> If you'd like to help out, I've documented the C factorizations used in
>> docs/transforms.md.
>>
>
> Sorry, I don't think I have time at the moment to take on writing new code from scratch for this...
>
> I could maybe consider porting the aarch64 assembly to arm32; if it's not register starved, it's usually quite straightforward to do such rewrites (there's either half the number of SIMD registers compared to aarch64, or the same number but half the length)
>

For the basis transforms (double 4, double 8 and 8, single 16), there's no starvation.
For the 32pt transform, it's a bit starved, but nothing you couldn't work out.
For the 64pt and up, absolutely all registers are used to the point of needing to
stash vector regs across gprs. If all registers are written back to memory (no register
sharing between transform sizes), it becomes as starved as the 32pt.
It's obvious to see where the starvation happens (only 32pt -> 64pt) and how to fix it,
but it's still work to convert code. Take a look at it and see if you can spot something
that would make it difficult?


> The reason why I'm asking about arm32, is because ffmpeg has got a bunch of users who have spent a fair amount of effort on reaching specific performance levels for some codecs, both for raspberry pi 1 (which doesn't have neon but only vfp) and for the newer ones with neon. I don't remember exactly which codecs are relevant for these users - I doubt opus is, but ac3 and dca are, iirc.
>

We do maintain old versions for years after a release. And we recently-ish
had a major bump, and very recently 5.1. I think there's enough time to
bring them back up and make them faster still before stuck users become
quite outdated, what about you? Maybe someone who's interested could
notice and help out?


> I'm CCing Ben Avison who has contributed a lot of optimizations in this area.
>

Thanks.
Lynne Sept. 25, 2022, 9:08 p.m. UTC | #10
Sep 25, 2022, 14:34 by andreas.rheinhardt@outlook.com:

> Lynne:
>
>> Sep 24, 2022, 23:57 by dev@lynne.ee:
>>
>>> Sep 24, 2022, 21:40 by martin@martin.st:
>>>
>>>> What about ac3dsp then - that one seems like it's fairly optimized for arm?
>>>>
>>> Haven't touched them, they're still being used. Unfortunately, for AC3,
>>> the full MDCT optimizations in lavc do make a difference and the overall
>>> decoder becomes 15% slower with this patch on for aarch64 with lavu/tx's
>>> asm disabled and 7% slower with lavu/tx's asm enabled. I do plan to write
>>> an aarch64 MDCT NEON SIMD code in a month or so, unless someone is faster,
>>> which should make the decoder at least 10% faster with lavu/tx.
>>>
>>
>> I'd just like to add this was for the float version of the ac3 decoder. The fixed-point
>> version is a few percent faster with the patch on an A53, and quite a bit
>> more accurate.
>> The lavc fixed-point FFT code also has some weird large spikes in #cycles
>> for some transform sizes, so the figure above is an average, but the dips
>> went from 117x realtime to 78x realtime, which on a slower CPU may
>> be the difference between stuttering and realtime playback.
>> On this CPU, the fixed-point version is 23% slower than the float version,
>> but on a CPU with slower float ops, it would make more sense to pick that
>> decoder up than the float version.
>> The 2 decoders produce nearly identical results, minus a few rounding
>> errors, since AC3 is inherently a fixed-point codec. The only difference
>> are the transforms themselves, and the extra ops needed to convert
>> the 25bit ints to floats in the float decoder.
>>
>
> 1. You forgot to remove mdct15 requirements from configure in this whole
> patchset.
> 2. You forgot to update the FATE references for several tests; e.g. when
> only applying the ac3 patch, then I get this:
>

I know. durandal pointed it out the day I sent them. I'll send them again
later.
I'm planning to just push the Opus patch in a day with the mdct15
line in configure gone.


> As the above shows, the difference between the reference files and the
> decoded output becomes larger in several tests, i.e. the reference files
> won't be usable lateron. If the new float and fixed-point decoders
> produce indeed produce nearly identical output, then one could write
> tests that decode the same file with both the floating point and the
> fixed point decoder, check that both are nearly identical and print a
> checksum of the output of the fixed point decoder.
>

I have a standalone program I've hacked on as I need to for the fixed-point
transforms: https://0x0.st/oWxO.c
The square root of the squared rounding error across the entire range
(1 to 21 bits) of transforms from 32pt to 1024pt is 6.855655 for lavu and
7.141428 for lavc, which is slightly worse. If you extend the range
to 22bits, the 1024pt transform in lavc explodes, while lavu is still fine,
thus showing a greater range.
The rounding errors are a lesser problem than hitting the max range,
because then you get huge spikes in the output.
I can further reduce the error in lavu at the cost of speed, but I think
this is sufficient.


> Also note that there is currently no test that directly verifies your
> claims of greater accuracy. One could write such a test by encoding a
> file with ac3-fixed and decoding it again (with the fixed point decoder)
> and printing the psnr of input and output. No encoding tests does this
> at the moment.
>

I'm not writing that, but I like the idea, the point of fixed-point decoders
isn't bitexactness, but speed on slow hardware, so we shouldn't be testing
an MD5.
Andreas Rheinhardt Sept. 25, 2022, 9:17 p.m. UTC | #11
Lynne:
> Sep 25, 2022, 14:34 by andreas.rheinhardt@outlook.com:
> 
>> Lynne:
>>
>>> Sep 24, 2022, 23:57 by dev@lynne.ee:
>>>
>>>> Sep 24, 2022, 21:40 by martin@martin.st:
>>>>
>>>>> What about ac3dsp then - that one seems like it's fairly optimized for arm?
>>>>>
>>>> Haven't touched them, they're still being used. Unfortunately, for AC3,
>>>> the full MDCT optimizations in lavc do make a difference and the overall
>>>> decoder becomes 15% slower with this patch on for aarch64 with lavu/tx's
>>>> asm disabled and 7% slower with lavu/tx's asm enabled. I do plan to write
>>>> an aarch64 MDCT NEON SIMD code in a month or so, unless someone is faster,
>>>> which should make the decoder at least 10% faster with lavu/tx.
>>>>
>>>
>>> I'd just like to add this was for the float version of the ac3 decoder. The fixed-point
>>> version is a few percent faster with the patch on an A53, and quite a bit
>>> more accurate.
>>> The lavc fixed-point FFT code also has some weird large spikes in #cycles
>>> for some transform sizes, so the figure above is an average, but the dips
>>> went from 117x realtime to 78x realtime, which on a slower CPU may
>>> be the difference between stuttering and realtime playback.
>>> On this CPU, the fixed-point version is 23% slower than the float version,
>>> but on a CPU with slower float ops, it would make more sense to pick that
>>> decoder up than the float version.
>>> The 2 decoders produce nearly identical results, minus a few rounding
>>> errors, since AC3 is inherently a fixed-point codec. The only difference
>>> are the transforms themselves, and the extra ops needed to convert
>>> the 25bit ints to floats in the float decoder.
>>>
>>
>> 1. You forgot to remove mdct15 requirements from configure in this whole
>> patchset.
>> 2. You forgot to update the FATE references for several tests; e.g. when
>> only applying the ac3 patch, then I get this:
>>
> 
> I know. durandal pointed it out the day I sent them. I'll send them again
> later.
> I'm planning to just push the Opus patch in a day with the mdct15
> line in configure gone.
> 
> 
>> As the above shows, the difference between the reference files and the
>> decoded output becomes larger in several tests, i.e. the reference files
>> won't be usable lateron. If the new float and fixed-point decoders
>> produce indeed produce nearly identical output, then one could write
>> tests that decode the same file with both the floating point and the
>> fixed point decoder, check that both are nearly identical and print a
>> checksum of the output of the fixed point decoder.
>>
> 
> I have a standalone program I've hacked on as I need to for the fixed-point
> transforms: https://0x0.st/oWxO.c
> The square root of the squared rounding error across the entire range
> (1 to 21 bits) of transforms from 32pt to 1024pt is 6.855655 for lavu and
> 7.141428 for lavc, which is slightly worse. If you extend the range
> to 22bits, the 1024pt transform in lavc explodes, while lavu is still fine,
> thus showing a greater range.
> The rounding errors are a lesser problem than hitting the max range,
> because then you get huge spikes in the output.
> I can further reduce the error in lavu at the cost of speed, but I think
> this is sufficient.
> 
> 
>> Also note that there is currently no test that directly verifies your
>> claims of greater accuracy. One could write such a test by encoding a
>> file with ac3-fixed and decoding it again (with the fixed point decoder)
>> and printing the psnr of input and output. No encoding tests does this
>> at the moment.
>>
> 
> I'm not writing that, but I like the idea, the point of fixed-point decoders
> isn't bitexactness, but speed on slow hardware, so we shouldn't be testing
> an MD5.

Are your fixed-point transforms bitexact across all arches/cpuflags?

- Andreas
Lynne Sept. 25, 2022, 9:46 p.m. UTC | #12
Sep 25, 2022, 23:17 by andreas.rheinhardt@outlook.com:

> Lynne:
>
>> Sep 25, 2022, 14:34 by andreas.rheinhardt@outlook.com:
>>
>>> Lynne:
>>>
>>>> Sep 24, 2022, 23:57 by dev@lynne.ee:
>>>>
>>>>> Sep 24, 2022, 21:40 by martin@martin.st:
>>>>>
>>>>>> What about ac3dsp then - that one seems like it's fairly optimized for arm?
>>>>>>
>>>>> Haven't touched them, they're still being used. Unfortunately, for AC3,
>>>>> the full MDCT optimizations in lavc do make a difference and the overall
>>>>> decoder becomes 15% slower with this patch on for aarch64 with lavu/tx's
>>>>> asm disabled and 7% slower with lavu/tx's asm enabled. I do plan to write
>>>>> an aarch64 MDCT NEON SIMD code in a month or so, unless someone is faster,
>>>>> which should make the decoder at least 10% faster with lavu/tx.
>>>>>
>>>>
>>>> I'd just like to add this was for the float version of the ac3 decoder. The fixed-point
>>>> version is a few percent faster with the patch on an A53, and quite a bit
>>>> more accurate.
>>>> The lavc fixed-point FFT code also has some weird large spikes in #cycles
>>>> for some transform sizes, so the figure above is an average, but the dips
>>>> went from 117x realtime to 78x realtime, which on a slower CPU may
>>>> be the difference between stuttering and realtime playback.
>>>> On this CPU, the fixed-point version is 23% slower than the float version,
>>>> but on a CPU with slower float ops, it would make more sense to pick that
>>>> decoder up than the float version.
>>>> The 2 decoders produce nearly identical results, minus a few rounding
>>>> errors, since AC3 is inherently a fixed-point codec. The only difference
>>>> are the transforms themselves, and the extra ops needed to convert
>>>> the 25bit ints to floats in the float decoder.
>>>>
>>>
>>> 1. You forgot to remove mdct15 requirements from configure in this whole
>>> patchset.
>>> 2. You forgot to update the FATE references for several tests; e.g. when
>>> only applying the ac3 patch, then I get this:
>>>
>>
>> I know. durandal pointed it out the day I sent them. I'll send them again
>> later.
>> I'm planning to just push the Opus patch in a day with the mdct15
>> line in configure gone.
>>
>>
>>> As the above shows, the difference between the reference files and the
>>> decoded output becomes larger in several tests, i.e. the reference files
>>> won't be usable lateron. If the new float and fixed-point decoders
>>> produce indeed produce nearly identical output, then one could write
>>> tests that decode the same file with both the floating point and the
>>> fixed point decoder, check that both are nearly identical and print a
>>> checksum of the output of the fixed point decoder.
>>>
>>
>> I have a standalone program I've hacked on as I need to for the fixed-point
>> transforms: https://0x0.st/oWxO.c
>> The square root of the squared rounding error across the entire range
>> (1 to 21 bits) of transforms from 32pt to 1024pt is 6.855655 for lavu and
>> 7.141428 for lavc, which is slightly worse. If you extend the range
>> to 22bits, the 1024pt transform in lavc explodes, while lavu is still fine,
>> thus showing a greater range.
>> The rounding errors are a lesser problem than hitting the max range,
>> because then you get huge spikes in the output.
>> I can further reduce the error in lavu at the cost of speed, but I think
>> this is sufficient.
>>
>>
>>> Also note that there is currently no test that directly verifies your
>>> claims of greater accuracy. One could write such a test by encoding a
>>> file with ac3-fixed and decoding it again (with the fixed point decoder)
>>> and printing the psnr of input and output. No encoding tests does this
>>> at the moment.
>>>
>>
>> I'm not writing that, but I like the idea, the point of fixed-point decoders
>> isn't bitexactness, but speed on slow hardware, so we shouldn't be testing
>> an MD5.
>>
>
> Are your fixed-point transforms bitexact across all arches/cpuflags?
>

As much as libavcodec's. This is because we use a float value for the MDCT scale,
and we calculate the exptabs and FFT tables with floats before converting
them to ints during init. If issues arise, we could specialcase them, though as
libavcodec's hasn't needed that, lavu doesn't need it either.
Since the FFT tables are always constant, they would benefit from hardcoding,
as it would take out any local machine precision out of the equation. The actual
constants are quantized versions of the computed floats, which also has a fair leeway.
diff mbox series

Patch

From d4fdda5b57ab1e0f08eb3d78dac6b003060dfd41 Mon Sep 17 00:00:00 2001
From: Lynne <dev@lynne.ee>
Date: Sat, 24 Sep 2022 00:46:44 +0200
Subject: [PATCH 1/6] opus: convert encoder and decoder to lavu/tx

This commit changes both the encoder and decoder to use the new lavu/tx code,
which has faster C transforms and more assembly optimizations.
---
 libavcodec/opus_celt.c   | 20 ++++++++++++--------
 libavcodec/opus_celt.h   |  5 +++--
 libavcodec/opusenc.c     | 15 +++++++++------
 libavcodec/opusenc_psy.c | 13 ++++++++-----
 libavcodec/opusenc_psy.h |  4 +++-
 5 files changed, 35 insertions(+), 22 deletions(-)

diff --git a/libavcodec/opus_celt.c b/libavcodec/opus_celt.c
index 9dbeff1927..f1fb88a56d 100644
--- a/libavcodec/opus_celt.c
+++ b/libavcodec/opus_celt.c
@@ -323,7 +323,8 @@  int ff_celt_decode_frame(CeltFrame *f, OpusRangeCoder *rc,
 {
     int i, j, downmix = 0;
     int consumed;           // bits of entropy consumed thus far for this frame
-    MDCT15Context *imdct;
+    AVTXContext *imdct;
+    av_tx_fn imdct_fn;
 
     if (channels != 1 && channels != 2) {
         av_log(f->avctx, AV_LOG_ERROR, "Invalid number of coded channels: %d\n",
@@ -385,7 +386,8 @@  int ff_celt_decode_frame(CeltFrame *f, OpusRangeCoder *rc,
     f->blocks    = f->transient ? 1 << f->size : 1;
     f->blocksize = frame_size / f->blocks;
 
-    imdct = f->imdct[f->transient ? 0 : f->size];
+    imdct = f->tx[f->transient ? 0 : f->size];
+    imdct_fn = f->tx_fn[f->transient ? 0 : f->size];
 
     if (channels == 1) {
         for (i = 0; i < CELT_MAX_BANDS; i++)
@@ -440,8 +442,8 @@  int ff_celt_decode_frame(CeltFrame *f, OpusRangeCoder *rc,
         for (j = 0; j < f->blocks; j++) {
             float *dst  = block->buf + 1024 + j * f->blocksize;
 
-            imdct->imdct_half(imdct, dst + CELT_OVERLAP / 2, f->block[i].coeffs + j,
-                              f->blocks);
+            imdct_fn(imdct, dst + CELT_OVERLAP / 2, f->block[i].coeffs + j,
+                     sizeof(float)*f->blocks);
             f->dsp->vector_fmul_window(dst, dst, dst + CELT_OVERLAP / 2,
                                        ff_celt_window, CELT_OVERLAP / 2);
         }
@@ -526,8 +528,8 @@  void ff_celt_free(CeltFrame **f)
     if (!frm)
         return;
 
-    for (i = 0; i < FF_ARRAY_ELEMS(frm->imdct); i++)
-        ff_mdct15_uninit(&frm->imdct[i]);
+    for (i = 0; i < FF_ARRAY_ELEMS(frm->tx); i++)
+        av_tx_uninit(&frm->tx[i]);
 
     ff_celt_pvq_uninit(&frm->pvq);
 
@@ -555,9 +557,11 @@  int ff_celt_init(AVCodecContext *avctx, CeltFrame **f, int output_channels,
     frm->output_channels = output_channels;
     frm->apply_phase_inv = apply_phase_inv;
 
-    for (i = 0; i < FF_ARRAY_ELEMS(frm->imdct); i++)
-        if ((ret = ff_mdct15_init(&frm->imdct[i], 1, i + 3, -1.0f/32768)) < 0)
+    for (i = 0; i < FF_ARRAY_ELEMS(frm->tx); i++) {
+        const float scale = -1.0f/32768;
+        if ((ret = av_tx_init(&frm->tx[i], &frm->tx_fn[i], AV_TX_FLOAT_MDCT, 1, 15 << (i + 3), &scale, 0)) < 0)
             goto fail;
+    }
 
     if ((ret = ff_celt_pvq_init(&frm->pvq, 0)) < 0)
         goto fail;
diff --git a/libavcodec/opus_celt.h b/libavcodec/opus_celt.h
index 661ca251de..291a544298 100644
--- a/libavcodec/opus_celt.h
+++ b/libavcodec/opus_celt.h
@@ -30,10 +30,10 @@ 
 #include "opus_pvq.h"
 #include "opusdsp.h"
 
-#include "mdct15.h"
 #include "libavutil/float_dsp.h"
 #include "libavutil/libm.h"
 #include "libavutil/mem_internal.h"
+#include "libavutil/tx.h"
 
 #define CELT_VECTORS                 11
 #define CELT_ALLOC_STEPS             6
@@ -93,7 +93,8 @@  typedef struct CeltBlock {
 struct CeltFrame {
     // constant values that do not change during context lifetime
     AVCodecContext      *avctx;
-    MDCT15Context       *imdct[4];
+    AVTXContext        *tx[4];
+    av_tx_fn            tx_fn[4];
     AVFloatDSPContext   *dsp;
     CeltBlock           block[2];
     CeltPVQ             *pvq;
diff --git a/libavcodec/opusenc.c b/libavcodec/opusenc.c
index a7a9d3a5f5..8cdd27d930 100644
--- a/libavcodec/opusenc.c
+++ b/libavcodec/opusenc.c
@@ -40,7 +40,8 @@  typedef struct OpusEncContext {
     AVCodecContext *avctx;
     AudioFrameQueue afq;
     AVFloatDSPContext *dsp;
-    MDCT15Context *mdct[CELT_BLOCK_NB];
+    AVTXContext *tx[CELT_BLOCK_NB];
+    av_tx_fn tx_fn[CELT_BLOCK_NB];
     CeltPVQ *pvq;
     struct FFBufQueue bufqueue;
 
@@ -204,7 +205,7 @@  static void celt_frame_mdct(OpusEncContext *s, CeltFrame *f)
                 s->dsp->vector_fmul_reverse(&win[CELT_OVERLAP], src2,
                                             ff_celt_window - 8, 128);
                 src1 = src2;
-                s->mdct[0]->mdct(s->mdct[0], b->coeffs + t, win, f->blocks);
+                s->tx_fn[0](s->tx[0], b->coeffs + t, win, sizeof(float)*f->blocks);
             }
         }
     } else {
@@ -226,7 +227,7 @@  static void celt_frame_mdct(OpusEncContext *s, CeltFrame *f)
                                         ff_celt_window - 8, 128);
             memcpy(win + lap_dst + blk_len, temp, CELT_OVERLAP*sizeof(float));
 
-            s->mdct[f->size]->mdct(s->mdct[f->size], b->coeffs, win, 1);
+            s->tx_fn[f->size](s->tx[f->size], b->coeffs, win, sizeof(float));
         }
     }
 
@@ -612,7 +613,7 @@  static av_cold int opus_encode_end(AVCodecContext *avctx)
     OpusEncContext *s = avctx->priv_data;
 
     for (int i = 0; i < CELT_BLOCK_NB; i++)
-        ff_mdct15_uninit(&s->mdct[i]);
+        av_tx_uninit(&s->tx[i]);
 
     ff_celt_pvq_uninit(&s->pvq);
     av_freep(&s->dsp);
@@ -668,9 +669,11 @@  static av_cold int opus_encode_init(AVCodecContext *avctx)
         return AVERROR(ENOMEM);
 
     /* I have no idea why a base scaling factor of 68 works, could be the twiddles */
-    for (int i = 0; i < CELT_BLOCK_NB; i++)
-        if ((ret = ff_mdct15_init(&s->mdct[i], 0, i + 3, 68 << (CELT_BLOCK_NB - 1 - i))))
+    for (int i = 0; i < CELT_BLOCK_NB; i++) {
+        const float scale = 68 << (CELT_BLOCK_NB - 1 - i);
+        if ((ret = av_tx_init(&s->tx[i], &s->tx_fn[i], AV_TX_FLOAT_MDCT, 0, 15 << (i + 3), &scale, 0)))
             return AVERROR(ENOMEM);
+    }
 
     /* Zero out previous energy (matters for inter first frame) */
     for (int ch = 0; ch < s->channels; ch++)
diff --git a/libavcodec/opusenc_psy.c b/libavcodec/opusenc_psy.c
index 1c8f69269c..3bff57d347 100644
--- a/libavcodec/opusenc_psy.c
+++ b/libavcodec/opusenc_psy.c
@@ -22,7 +22,6 @@ 
 #include "opusenc_psy.h"
 #include "opus_pvq.h"
 #include "opustab.h"
-#include "mdct15.h"
 #include "libavutil/qsort.h"
 
 static float pvq_band_cost(CeltPVQ *pvq, CeltFrame *f, OpusRangeCoder *rc, int band,
@@ -99,7 +98,8 @@  static void step_collect_psy_metrics(OpusPsyContext *s, int index)
         s->dsp->vector_fmul(s->scratch, s->scratch, s->window[s->bsize_analysis],
                             (OPUS_BLOCK_SIZE(s->bsize_analysis) << 1));
 
-        s->mdct[s->bsize_analysis]->mdct(s->mdct[s->bsize_analysis], st->coeffs[ch], s->scratch, 1);
+        s->mdct_fn[s->bsize_analysis](s->mdct[s->bsize_analysis], st->coeffs[ch],
+                                      s->scratch, sizeof(float));
 
         for (i = 0; i < CELT_MAX_BANDS; i++)
             st->bands[ch][i] = &st->coeffs[ch][ff_celt_freq_bands[i] << s->bsize_analysis];
@@ -558,13 +558,16 @@  av_cold int ff_opus_psy_init(OpusPsyContext *s, AVCodecContext *avctx,
     for (i = 0; i < CELT_BLOCK_NB; i++) {
         float tmp;
         const int len = OPUS_BLOCK_SIZE(i);
+        const float scale = 68 << (CELT_BLOCK_NB - 1 - i);
         s->window[i] = av_malloc(2*len*sizeof(float));
         if (!s->window[i]) {
             ret = AVERROR(ENOMEM);
             goto fail;
         }
         generate_window_func(s->window[i], 2*len, WFUNC_SINE, &tmp);
-        if ((ret = ff_mdct15_init(&s->mdct[i], 0, i + 3, 68 << (CELT_BLOCK_NB - 1 - i))))
+        ret = av_tx_init(&s->mdct[i], &s->mdct_fn[i], AV_TX_FLOAT_MDCT,
+                         0, 15 << (i + 3), &scale, 0);
+        if (ret < 0)
             goto fail;
     }
 
@@ -575,7 +578,7 @@  fail:
     av_freep(&s->dsp);
 
     for (i = 0; i < CELT_BLOCK_NB; i++) {
-        ff_mdct15_uninit(&s->mdct[i]);
+        av_tx_uninit(&s->mdct[i]);
         av_freep(&s->window[i]);
     }
 
@@ -598,7 +601,7 @@  av_cold int ff_opus_psy_end(OpusPsyContext *s)
     av_freep(&s->dsp);
 
     for (i = 0; i < CELT_BLOCK_NB; i++) {
-        ff_mdct15_uninit(&s->mdct[i]);
+        av_tx_uninit(&s->mdct[i]);
         av_freep(&s->window[i]);
     }
 
diff --git a/libavcodec/opusenc_psy.h b/libavcodec/opusenc_psy.h
index d4fb096a3d..0a7cdb6f2c 100644
--- a/libavcodec/opusenc_psy.h
+++ b/libavcodec/opusenc_psy.h
@@ -22,6 +22,7 @@ 
 #ifndef AVCODEC_OPUSENC_PSY_H
 #define AVCODEC_OPUSENC_PSY_H
 
+#include "libavutil/tx.h"
 #include "libavutil/mem_internal.h"
 
 #include "opusenc.h"
@@ -70,7 +71,8 @@  typedef struct OpusPsyContext {
     int max_steps;
 
     float *window[CELT_BLOCK_NB];
-    MDCT15Context *mdct[CELT_BLOCK_NB];
+    AVTXContext *mdct[CELT_BLOCK_NB];
+    av_tx_fn mdct_fn[CELT_BLOCK_NB];
     int bsize_analysis;
 
     DECLARE_ALIGNED(32, float, scratch)[2048];
-- 
2.37.2.609.g9ff673ca1a