mbox series

[FFmpeg-devel,0/6] avcodec/vc1: Arm optimisations

Message ID 20220317185819.466470-1-bavison@riscosopen.org
Headers show
Series avcodec/vc1: Arm optimisations | expand

Message

Ben Avison March 17, 2022, 6:58 p.m. UTC
The VC1 decoder was missing lots of important fast paths for Arm, especially
for 64-bit Arm. This submission fills in implementations for all functions
where a fast path already existed and the fallback C implementation was
taking 1% or more of the runtime, and adds a new fast path to permit
vc1_unescape_buffer() to be overridden.

I've measured the playback speed on a 1.5 GHz Cortex-A72 (Raspberry Pi 4)
using `ffmpeg -i <bitstream> -f null -` for a couple of example streams:

Architecture:  AArch32    AArch32    AArch64    AArch64
Stream:        1          2          1          2
Before speed:  1.22x      0.82x      1.00x      0.67x
After speed:   1.31x      0.98x      1.39x      1.06x
Improvement:   7.4%       20%        39%        58%

`make fate` passes on both AArch32 and AArch64.

Ben Avison (6):
  avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths
  avcodec/vc1: Arm 32-bit NEON deblocking filter fast paths
  avcodec/vc1: Arm 64-bit NEON inverse transform fast paths
  avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths
  avcodec/blockdsp: Arm 64-bit NEON block clear fast paths
  avcodec/vc1: Introduce fast path for unescaping bitstream buffer

 libavcodec/aarch64/Makefile                |    6 +-
 libavcodec/aarch64/blockdsp_init_aarch64.c |   42 +
 libavcodec/aarch64/blockdsp_neon.S         |   43 +
 libavcodec/aarch64/idctdsp_init_aarch64.c  |   26 +-
 libavcodec/aarch64/idctdsp_neon.S          |  130 ++
 libavcodec/aarch64/vc1dsp_init_aarch64.c   |   93 ++
 libavcodec/aarch64/vc1dsp_neon.S           | 1552 ++++++++++++++++++++
 libavcodec/arm/vc1dsp_init_neon.c          |   74 +
 libavcodec/arm/vc1dsp_neon.S               |  761 ++++++++++
 libavcodec/blockdsp.c                      |    2 +
 libavcodec/blockdsp.h                      |    1 +
 libavcodec/vc1dec.c                        |   20 +-
 libavcodec/vc1dsp.c                        |    2 +
 libavcodec/vc1dsp.h                        |    3 +
 14 files changed, 2736 insertions(+), 19 deletions(-)
 create mode 100644 libavcodec/aarch64/blockdsp_init_aarch64.c
 create mode 100644 libavcodec/aarch64/blockdsp_neon.S
 create mode 100644 libavcodec/aarch64/idctdsp_neon.S
 create mode 100644 libavcodec/aarch64/vc1dsp_neon.S

Comments

Martin Storsjö March 19, 2022, 11:06 p.m. UTC | #1
Hi Ben,

On Thu, 17 Mar 2022, Ben Avison wrote:

> The VC1 decoder was missing lots of important fast paths for Arm, especially
> for 64-bit Arm. This submission fills in implementations for all functions
> where a fast path already existed and the fallback C implementation was
> taking 1% or more of the runtime, and adds a new fast path to permit
> vc1_unescape_buffer() to be overridden.
>
> I've measured the playback speed on a 1.5 GHz Cortex-A72 (Raspberry Pi 4)
> using `ffmpeg -i <bitstream> -f null -` for a couple of example streams:
>
> Architecture:  AArch32    AArch32    AArch64    AArch64
> Stream:        1          2          1          2
> Before speed:  1.22x      0.82x      1.00x      0.67x
> After speed:   1.31x      0.98x      1.39x      1.06x
> Improvement:   7.4%       20%        39%        58%
>
> `make fate` passes on both AArch32 and AArch64.

Thanks for the patches! I have looked briefly at the patches (I haven't 
started reading the implementation in detail yet though).

As you are writing assembly for these functions, I would very much 
appreciate if you could add checkasm tests for all the functions you're 
implementing. I see that there exists a test for the blockdsp functions, 
but all the other ones are missing a test.

I try to request such tests for all new assembly. Such a test allows 
testing all interesting cornercases of the DSP functions with one concise 
test, instead of having to run the full fate testsuite. It also allows 
catching a number of other possible lingering issues, like using the full 
64 bit register when the argument only set 32 bits where the upper bits 
are undefined, or missing to restore callee saved registers, etc. It also 
allows for easy benchmarking of the functions on their own, which is very 
useful for tuning of the implementation. And it finally allows easily 
checking that the assembly works correctly when built with a different 
toolchain for a different platform - without needing to run the full 
decoding tests.

Especially as you've been implementing the functions, you're probably more 
familiar with the expectations and behaviours (and potential cornercases 
that are worth testing) of the functions than most other developers in the 
community at the moment, which is good for writing useful testcases.

There's plenty of existing examples of such tests - the h264dsp, vp8dsp 
and vp9dsp cases might be relevant.


The other main issue I'd like to request is to indent the assembly 
similarly to the rest of the existing assembly. For the 32 bit assembly, 
your patches do match the surrounding code, but for the 64 bit assembly, 
your patches align the operands column differently than the rest. (I think 
your code aligns the operands with 16 chars to the left of the operands, 
while our code aligns it with 24 chars to the left, both in 32 and 64 bit 
arm assembly.)


Finally, the 32 bit assembly fails to build for me both with (recent) 
clang and old binutils, with errors like these:

src/libavcodec/arm/vc1dsp_neon.S: Assembler messages:
src/libavcodec/arm/vc1dsp_neon.S:1579: Error: bad type for scalar -- `vmov r0,d4[1]'
src/libavcodec/arm/vc1dsp_neon.S:1582: Error: bad type for scalar -- `vmov r2,d5[1]'
src/libavcodec/arm/vc1dsp_neon.S:1592: Error: bad type for scalar -- `vmov r2,d8[1]'
src/libavcodec/arm/vc1dsp_neon.S:1595: Error: bad type for scalar -- `vmov r12,d9[1]'

Qualifying the "vmov" into "vmov.32" seems to fix it.

// Martin
Martin Storsjö March 19, 2022, 11:07 p.m. UTC | #2
On Sun, 20 Mar 2022, Martin Storsjö wrote:

> The other main issue I'd like to request is to indent the assembly similarly 
> to the rest of the existing assembly. For the 32 bit assembly, your patches 
> do match the surrounding code, but for the 64 bit assembly, your patches 
> align the operands column differently than the rest. (I think your code 
> aligns the operands with 16 chars to the left of the operands, while our code 
> aligns it with 24 chars to the left, both in 32 and 64 bit arm assembly.)

Oh, sidenote - I do see that the last patch in the set uses much more 
inconsistent indentation, with varying indentation between lines. Is this 
intentional to signify some structure in the code, or just accidental? I 
think it'd be preferrable to have it use normal straight indentation all 
the way throughout.

// Martin
Ben Avison March 21, 2022, 5:37 p.m. UTC | #3
Hi Martin,

Thanks very much for taking a look.

On 19/03/2022 23:06, Martin Storsjö wrote:
> As you are writing assembly for these functions, I would very much 
> appreciate if you could add checkasm tests for all the functions you're 
> implementing. I see that there exists a test for the blockdsp functions, 
> but all the other ones are missing a test.

I think I'd have a bit of a learning curve ahead of me there! I did 
write my own fuzz testers to check the validity of my assembly 
implementations, and I could share them (they'd probably need a bit of 
tidying up since I wasn't intending them for public consumption) but 
they were written in ignorance of the checkasm framework, so probably 
wouldn't slot in neatly.

Is there any writeup of checkasm anywhere, discussing how it's used, 
what sorts of things it tests, any speed/memory limits that tests should 
try to adhere to - that sort of thing?

> The other main issue I'd like to request is to indent the assembly 
> similarly to the rest of the existing assembly. For the 32 bit assembly, 
> your patches do match the surrounding code, but for the 64 bit assembly, 
> your patches align the operands column differently than the rest.

Since I was creating new source files for the 64-bit stuff, I assumed I 
had a bit of leeway in indentation style - but I can easily change it.

For what it's worth, the opcodes in AArch64 are significantly shorter 
than in AArch32, since the vector element size qualifiers go on the 
operands instead of the opcodes, so there's less need for extra indentation.

> Finally, the 32 bit assembly fails to build for me both with (recent) 
> clang and old binutils, with errors like these:
> 
> src/libavcodec/arm/vc1dsp_neon.S: Assembler messages:
> src/libavcodec/arm/vc1dsp_neon.S:1579: Error: bad type for scalar -- 
> `vmov r0,d4[1]'

Thanks - the Armv8-A ARM says (section F6.1.139) that the data type can 
be omitted here, and in that case it is equivalent to '32', so that's a 
bug in clang. But easy to work around.

> Oh, sidenote - I do see that the last patch in the set uses much more
> inconsistent indentation, with varying indentation between lines. Is
> this intentional to signify some structure in the code, or just
> accidental?

That was deliberate! The inner loop there is unrolled x2, and then 
adjacent iterations are overlapped 180 degrees out of phase. This is 
because each iteration starts off busy, with lots of instructions to 
execute, keeping pipelines full, and towards the end, it thins out, 
meaning we can benefit by using what would otherwise be stalls to 
speculatively start to process the next iteration before we've completed 
the current one.

Effectively, if you only read a series of instructions with matching 
indentation, you get one logical iteration of the loop - for example, in 
the AArch32 version, you can follow through the process from loading the 
source buffer into q10 (line 1849) until we store from it to the 
destination buffer, having determined that it doesn't contain the start 
of any escape sequences (line 1890).

It's a trick I've seen used a few times elsewhere, which is why I didn't 
bother explaining it in a comment. I could add one, or if you still 
don't like it once you've understood what it means, I'd be happy to take 
it out.

Ben
Martin Storsjö March 21, 2022, 10:29 p.m. UTC | #4
On Mon, 21 Mar 2022, Ben Avison wrote:

>
> On 19/03/2022 23:06, Martin Storsjö wrote:
>> As you are writing assembly for these functions, I would very much 
>> appreciate if you could add checkasm tests for all the functions you're 
>> implementing. I see that there exists a test for the blockdsp functions, 
>> but all the other ones are missing a test.
>
> I think I'd have a bit of a learning curve ahead of me there! I did write my 
> own fuzz testers to check the validity of my assembly implementations, and I 
> could share them (they'd probably need a bit of tidying up since I wasn't 
> intending them for public consumption) but they were written in ignorance of 
> the checkasm framework, so probably wouldn't slot in neatly.
>
> Is there any writeup of checkasm anywhere, discussing how it's used, what 
> sorts of things it tests, any speed/memory limits that tests should try to 
> adhere to - that sort of thing?

I'm not aware of any guide in itself, but I can try to do a short writeup 
here.

Checkasm is essentially a unit test framework for assembly functions. A 
test runs two versions of the same function, one reference (e.g. C 
implementation) and one test version (e.g. NEON implementation) against 
randomish input data, and compares the output to make sure that it matches 
(for the relevant part of the output buffer).

For each test, the main point lies in knowing what the expected/valid 
ranges of inputs are, so that you pick random inputs that are valid - and 
test specifically the e.g. buffer sizes that are used in practice by the 
decoder.

If a function usually has e.g. different codepaths internally, depending 
on some input values, you can make the test exercise all the different 
codepaths. Or if the function has potential overflows in some case, you 
can make part of the input buffer always contain the worst-case scenario, 
so that each run always tests for the overflows, even if the rest of the 
buffers are random data.

Another aspect of tests is what part of the output to check. E.g. 
functions in video codecs often write a rectangular block in a buffer. At 
the very least, a test can check that the contents within the expected 
region of the output buffer matches. But tests can also (optionally) take 
it one step further, and check that e.g. the function didn't accidentally 
write outside of the edges of the indended rectangle in the output buffer. 
Or in some cases it's expected that a function may overwrite e.g. up to 16 
bytes past the end of the payload in each row - then you intentionally 
wouldn't check that area.

Additionally, after comparing the tested version with the reference, 
checkasm can also optionally benchmark your functions. This runs the 
benchmark specifically of only the assembly function, nothing else, by 
running the function with the same input parameters e.g. 1000 times. It 
also measures the C version of the function, so that you can compare 
against that and see the speedup of your assembly work, in isolation.

(On linux on ARM, it by default uses the perf timers. If you have enabled 
user mode access to the cycle counter registers, which I highly recommend, 
and configure with --disable-linux-perf, you get much more precise timing 
- to the point that you can measure the impact of different instruction 
scheduling setups on in-order cores, like the Cortex A53.)

Additionally, for the testing (but not for benchmarking), the functions 
get wrapped with extra setup to try to find lingering nonfunctional issues 
that don't show up when you just run the code.

E.g. one common isssue is with functions that take a 32 bit value as 
argument. On a 64 bit architecture with arguments in registers, the upper 
32 bits of such a register are undefined - while in practice they're often 
zero. This can lead to issues later down the line, when e.g. a different 
or updated compiler suddenly happens to pass nonzero bits in the undefined 
part. The test wrapping tries to arrange so that these bits end up as 
nonzero, to catch such hidden bugs.

Additionally, it checks to make sure you've restored all callee saved 
registers. In most cases, if you happen to forget to restore e.g. a callee 
saved SIMD register, the effects of it normally don't show up soon (or 
at all), but may only show up much later depending on what the compiler 
did in a calling function. But all functions covered in checkasm get this 
checked for free.


After building checkasm, if you run e.g. ./tests/checkasm/checkasm, it 
runs all the tests for all functions, for all SIMD instruction sets 
available. (On ARM there's usually only NEON, but e.g. on X86 it first 
enables only MMX, tests all functions available there, then increasing 
levels with SSE2, SSSE3, etc, to test all potential implementations.) If 
you run e.g. checkasm --test=blockdsp or --test=h264dsp, it will only run 
the tests for that subsystem. If you further add --bench=h264_idct, it 
will benchmark all functions with a name starting with h264_idct.


One of the simplest tests to have a look at, to understand the structure, 
is tests/checkasm/blockdsp.c. First, the cpu feature mask is set so that 
av_get_cpu_flags() returns 0. Then the test main function, 
checkasm_check_blockdsp, is called, which initializes the DSP context 
(which then only gets assigned the reference C implementation of the 
function). In this case the test does nothing, as it compares the 
C implementation with itself. The next time around, av_get_cpu_flags 
returns NEON, and checkasm_check_blockdsp gets called again, where the DSP 
context now gets the NEON function assigned.

In blockdsp.c, the first call to check_func(h.func, 
"blockdsp.clear_block") stored a copy of the previous function pointer, 
the C reference function, in a map. On the second call to it, it digs up 
the previous version and keeps the new current version. These function 
pointers are used via the macros call_ref() and call_new(), with the same 
parameters as if you'd call the function directly. After running 
both, you inspect the output of them to see if they match, and if not 
you fail the test. Finally, the bench_new() macro checks if you've asked 
to benchmark this particular function. If this function is one of the 
functions to benchmark, it runs it N times with the provided parameters.

For your patches, the existing tests in h264dsp, vp8dsp, vp9dsp probably 
are good examples of such tests.

A small gotcha when/if you're adding a new checkasm test in a new file, 
under a new name. If you just run checkasm without parameters, it runs all 
the tests by default (as long as the test is hooked up in the main tests[] 
array in checkasm.c). But when running fate, it runs checkasm individually 
with one module at a time. So if adding a new test module in checkasm, be 
sure to add it to the test listing in tests/fate/checkasm.mak too.


The inverse transforms are tricky to test, because you probably can't feed 
them any random input data. The existing h264dsp, vp8dsp and vp9dsp 
inverse transform tests take random pixels and do a naive forward 
transform of them, so that you only get transform coefficients within the 
possible range.

For deblocking filters, those tests start with random-ish input data, but 
try to arrange coefficients in a way so that each block contains all 
possible combinations of data (above or below the threshold values). Or a 
test can run the functions multiple times, with input data arranged to 
trigger each special case.

For the unescape function, I'm not sure if we have any good examples of 
existing testcases that work on a similar function though. Try to come up 
with all cases of interesting input to the function (short buffers, long 
buffers, mod-4/non-mod-4 length, nothing to unescape, lots of things to 
unescape close together).


>> The other main issue I'd like to request is to indent the assembly 
>> similarly to the rest of the existing assembly. For the 32 bit assembly, 
>> your patches do match the surrounding code, but for the 64 bit assembly, 
>> your patches align the operands column differently than the rest.
>
> Since I was creating new source files for the 64-bit stuff, I assumed I had a 
> bit of leeway in indentation style - but I can easily change it.

Ok, thanks, that'd be appreciated. Yeah I try to maintain consistency 
across files here.

> For what it's worth, the opcodes in AArch64 are significantly shorter than in 
> AArch32, since the vector element size qualifiers go on the operands instead 
> of the opcodes, so there's less need for extra indentation.

Yup, that's true. But for functions where instruction-like macros are 
used, the macro names often are a bit longer than regular instructions, so 
there the extra space is appreciated. And consistency is still nice when 
both 32 and 64 bit arm use the same indentation style; in many cases, code 
is ported between the two by just copying and slightly adjusting/rewriting 
e.g. register names and tweaking instruction names.

>> Finally, the 32 bit assembly fails to build for me both with (recent) clang 
>> and old binutils, with errors like these:
>> 
>> src/libavcodec/arm/vc1dsp_neon.S: Assembler messages:
>> src/libavcodec/arm/vc1dsp_neon.S:1579: Error: bad type for scalar -- `vmov 
>> r0,d4[1]'
>
> Thanks - the Armv8-A ARM says (section F6.1.139) that the data type can be 
> omitted here, and in that case it is equivalent to '32', so that's a bug in 
> clang. But easy to work around.

Ok, good. Yeah, bugs or not, we try to stick with the subset of assembly 
that builds on all toolchains that regularly are used to building.

>> Oh, sidenote - I do see that the last patch in the set uses much more
>> inconsistent indentation, with varying indentation between lines. Is
>> this intentional to signify some structure in the code, or just
>> accidental?
>
> That was deliberate! The inner loop there is unrolled x2, and then adjacent 
> iterations are overlapped 180 degrees out of phase. This is because each 
> iteration starts off busy, with lots of instructions to execute, keeping 
> pipelines full, and towards the end, it thins out, meaning we can benefit by 
> using what would otherwise be stalls to speculatively start to process the 
> next iteration before we've completed the current one.
>
> Effectively, if you only read a series of instructions with matching 
> indentation, you get one logical iteration of the loop - for example, in the 
> AArch32 version, you can follow through the process from loading the source 
> buffer into q10 (line 1849) until we store from it to the destination buffer, 
> having determined that it doesn't contain the start of any escape sequences 
> (line 1890).
>
> It's a trick I've seen used a few times elsewhere, which is why I didn't 
> bother explaining it in a comment. I could add one, or if you still don't 
> like it once you've understood what it means, I'd be happy to take it out.

Right, I see. (I didn't try to read the code and follow it yet, I just 
browsed your patches and testbuilt them.) I think it can be valuable to 
keep this nonstandard indentation as a readability/maintainability aid 
then. (But do shift the operand column 8 chars to the right for the 64 bit 
version.)

// Martin