Message ID | 20220317185819.466470-1-bavison@riscosopen.org |
---|---|
Headers | show |
Series | avcodec/vc1: Arm optimisations | expand |
Hi Ben, On Thu, 17 Mar 2022, Ben Avison wrote: > The VC1 decoder was missing lots of important fast paths for Arm, especially > for 64-bit Arm. This submission fills in implementations for all functions > where a fast path already existed and the fallback C implementation was > taking 1% or more of the runtime, and adds a new fast path to permit > vc1_unescape_buffer() to be overridden. > > I've measured the playback speed on a 1.5 GHz Cortex-A72 (Raspberry Pi 4) > using `ffmpeg -i <bitstream> -f null -` for a couple of example streams: > > Architecture: AArch32 AArch32 AArch64 AArch64 > Stream: 1 2 1 2 > Before speed: 1.22x 0.82x 1.00x 0.67x > After speed: 1.31x 0.98x 1.39x 1.06x > Improvement: 7.4% 20% 39% 58% > > `make fate` passes on both AArch32 and AArch64. Thanks for the patches! I have looked briefly at the patches (I haven't started reading the implementation in detail yet though). As you are writing assembly for these functions, I would very much appreciate if you could add checkasm tests for all the functions you're implementing. I see that there exists a test for the blockdsp functions, but all the other ones are missing a test. I try to request such tests for all new assembly. Such a test allows testing all interesting cornercases of the DSP functions with one concise test, instead of having to run the full fate testsuite. It also allows catching a number of other possible lingering issues, like using the full 64 bit register when the argument only set 32 bits where the upper bits are undefined, or missing to restore callee saved registers, etc. It also allows for easy benchmarking of the functions on their own, which is very useful for tuning of the implementation. And it finally allows easily checking that the assembly works correctly when built with a different toolchain for a different platform - without needing to run the full decoding tests. Especially as you've been implementing the functions, you're probably more familiar with the expectations and behaviours (and potential cornercases that are worth testing) of the functions than most other developers in the community at the moment, which is good for writing useful testcases. There's plenty of existing examples of such tests - the h264dsp, vp8dsp and vp9dsp cases might be relevant. The other main issue I'd like to request is to indent the assembly similarly to the rest of the existing assembly. For the 32 bit assembly, your patches do match the surrounding code, but for the 64 bit assembly, your patches align the operands column differently than the rest. (I think your code aligns the operands with 16 chars to the left of the operands, while our code aligns it with 24 chars to the left, both in 32 and 64 bit arm assembly.) Finally, the 32 bit assembly fails to build for me both with (recent) clang and old binutils, with errors like these: src/libavcodec/arm/vc1dsp_neon.S: Assembler messages: src/libavcodec/arm/vc1dsp_neon.S:1579: Error: bad type for scalar -- `vmov r0,d4[1]' src/libavcodec/arm/vc1dsp_neon.S:1582: Error: bad type for scalar -- `vmov r2,d5[1]' src/libavcodec/arm/vc1dsp_neon.S:1592: Error: bad type for scalar -- `vmov r2,d8[1]' src/libavcodec/arm/vc1dsp_neon.S:1595: Error: bad type for scalar -- `vmov r12,d9[1]' Qualifying the "vmov" into "vmov.32" seems to fix it. // Martin
On Sun, 20 Mar 2022, Martin Storsjö wrote: > The other main issue I'd like to request is to indent the assembly similarly > to the rest of the existing assembly. For the 32 bit assembly, your patches > do match the surrounding code, but for the 64 bit assembly, your patches > align the operands column differently than the rest. (I think your code > aligns the operands with 16 chars to the left of the operands, while our code > aligns it with 24 chars to the left, both in 32 and 64 bit arm assembly.) Oh, sidenote - I do see that the last patch in the set uses much more inconsistent indentation, with varying indentation between lines. Is this intentional to signify some structure in the code, or just accidental? I think it'd be preferrable to have it use normal straight indentation all the way throughout. // Martin
Hi Martin, Thanks very much for taking a look. On 19/03/2022 23:06, Martin Storsjö wrote: > As you are writing assembly for these functions, I would very much > appreciate if you could add checkasm tests for all the functions you're > implementing. I see that there exists a test for the blockdsp functions, > but all the other ones are missing a test. I think I'd have a bit of a learning curve ahead of me there! I did write my own fuzz testers to check the validity of my assembly implementations, and I could share them (they'd probably need a bit of tidying up since I wasn't intending them for public consumption) but they were written in ignorance of the checkasm framework, so probably wouldn't slot in neatly. Is there any writeup of checkasm anywhere, discussing how it's used, what sorts of things it tests, any speed/memory limits that tests should try to adhere to - that sort of thing? > The other main issue I'd like to request is to indent the assembly > similarly to the rest of the existing assembly. For the 32 bit assembly, > your patches do match the surrounding code, but for the 64 bit assembly, > your patches align the operands column differently than the rest. Since I was creating new source files for the 64-bit stuff, I assumed I had a bit of leeway in indentation style - but I can easily change it. For what it's worth, the opcodes in AArch64 are significantly shorter than in AArch32, since the vector element size qualifiers go on the operands instead of the opcodes, so there's less need for extra indentation. > Finally, the 32 bit assembly fails to build for me both with (recent) > clang and old binutils, with errors like these: > > src/libavcodec/arm/vc1dsp_neon.S: Assembler messages: > src/libavcodec/arm/vc1dsp_neon.S:1579: Error: bad type for scalar -- > `vmov r0,d4[1]' Thanks - the Armv8-A ARM says (section F6.1.139) that the data type can be omitted here, and in that case it is equivalent to '32', so that's a bug in clang. But easy to work around. > Oh, sidenote - I do see that the last patch in the set uses much more > inconsistent indentation, with varying indentation between lines. Is > this intentional to signify some structure in the code, or just > accidental? That was deliberate! The inner loop there is unrolled x2, and then adjacent iterations are overlapped 180 degrees out of phase. This is because each iteration starts off busy, with lots of instructions to execute, keeping pipelines full, and towards the end, it thins out, meaning we can benefit by using what would otherwise be stalls to speculatively start to process the next iteration before we've completed the current one. Effectively, if you only read a series of instructions with matching indentation, you get one logical iteration of the loop - for example, in the AArch32 version, you can follow through the process from loading the source buffer into q10 (line 1849) until we store from it to the destination buffer, having determined that it doesn't contain the start of any escape sequences (line 1890). It's a trick I've seen used a few times elsewhere, which is why I didn't bother explaining it in a comment. I could add one, or if you still don't like it once you've understood what it means, I'd be happy to take it out. Ben
On Mon, 21 Mar 2022, Ben Avison wrote: > > On 19/03/2022 23:06, Martin Storsjö wrote: >> As you are writing assembly for these functions, I would very much >> appreciate if you could add checkasm tests for all the functions you're >> implementing. I see that there exists a test for the blockdsp functions, >> but all the other ones are missing a test. > > I think I'd have a bit of a learning curve ahead of me there! I did write my > own fuzz testers to check the validity of my assembly implementations, and I > could share them (they'd probably need a bit of tidying up since I wasn't > intending them for public consumption) but they were written in ignorance of > the checkasm framework, so probably wouldn't slot in neatly. > > Is there any writeup of checkasm anywhere, discussing how it's used, what > sorts of things it tests, any speed/memory limits that tests should try to > adhere to - that sort of thing? I'm not aware of any guide in itself, but I can try to do a short writeup here. Checkasm is essentially a unit test framework for assembly functions. A test runs two versions of the same function, one reference (e.g. C implementation) and one test version (e.g. NEON implementation) against randomish input data, and compares the output to make sure that it matches (for the relevant part of the output buffer). For each test, the main point lies in knowing what the expected/valid ranges of inputs are, so that you pick random inputs that are valid - and test specifically the e.g. buffer sizes that are used in practice by the decoder. If a function usually has e.g. different codepaths internally, depending on some input values, you can make the test exercise all the different codepaths. Or if the function has potential overflows in some case, you can make part of the input buffer always contain the worst-case scenario, so that each run always tests for the overflows, even if the rest of the buffers are random data. Another aspect of tests is what part of the output to check. E.g. functions in video codecs often write a rectangular block in a buffer. At the very least, a test can check that the contents within the expected region of the output buffer matches. But tests can also (optionally) take it one step further, and check that e.g. the function didn't accidentally write outside of the edges of the indended rectangle in the output buffer. Or in some cases it's expected that a function may overwrite e.g. up to 16 bytes past the end of the payload in each row - then you intentionally wouldn't check that area. Additionally, after comparing the tested version with the reference, checkasm can also optionally benchmark your functions. This runs the benchmark specifically of only the assembly function, nothing else, by running the function with the same input parameters e.g. 1000 times. It also measures the C version of the function, so that you can compare against that and see the speedup of your assembly work, in isolation. (On linux on ARM, it by default uses the perf timers. If you have enabled user mode access to the cycle counter registers, which I highly recommend, and configure with --disable-linux-perf, you get much more precise timing - to the point that you can measure the impact of different instruction scheduling setups on in-order cores, like the Cortex A53.) Additionally, for the testing (but not for benchmarking), the functions get wrapped with extra setup to try to find lingering nonfunctional issues that don't show up when you just run the code. E.g. one common isssue is with functions that take a 32 bit value as argument. On a 64 bit architecture with arguments in registers, the upper 32 bits of such a register are undefined - while in practice they're often zero. This can lead to issues later down the line, when e.g. a different or updated compiler suddenly happens to pass nonzero bits in the undefined part. The test wrapping tries to arrange so that these bits end up as nonzero, to catch such hidden bugs. Additionally, it checks to make sure you've restored all callee saved registers. In most cases, if you happen to forget to restore e.g. a callee saved SIMD register, the effects of it normally don't show up soon (or at all), but may only show up much later depending on what the compiler did in a calling function. But all functions covered in checkasm get this checked for free. After building checkasm, if you run e.g. ./tests/checkasm/checkasm, it runs all the tests for all functions, for all SIMD instruction sets available. (On ARM there's usually only NEON, but e.g. on X86 it first enables only MMX, tests all functions available there, then increasing levels with SSE2, SSSE3, etc, to test all potential implementations.) If you run e.g. checkasm --test=blockdsp or --test=h264dsp, it will only run the tests for that subsystem. If you further add --bench=h264_idct, it will benchmark all functions with a name starting with h264_idct. One of the simplest tests to have a look at, to understand the structure, is tests/checkasm/blockdsp.c. First, the cpu feature mask is set so that av_get_cpu_flags() returns 0. Then the test main function, checkasm_check_blockdsp, is called, which initializes the DSP context (which then only gets assigned the reference C implementation of the function). In this case the test does nothing, as it compares the C implementation with itself. The next time around, av_get_cpu_flags returns NEON, and checkasm_check_blockdsp gets called again, where the DSP context now gets the NEON function assigned. In blockdsp.c, the first call to check_func(h.func, "blockdsp.clear_block") stored a copy of the previous function pointer, the C reference function, in a map. On the second call to it, it digs up the previous version and keeps the new current version. These function pointers are used via the macros call_ref() and call_new(), with the same parameters as if you'd call the function directly. After running both, you inspect the output of them to see if they match, and if not you fail the test. Finally, the bench_new() macro checks if you've asked to benchmark this particular function. If this function is one of the functions to benchmark, it runs it N times with the provided parameters. For your patches, the existing tests in h264dsp, vp8dsp, vp9dsp probably are good examples of such tests. A small gotcha when/if you're adding a new checkasm test in a new file, under a new name. If you just run checkasm without parameters, it runs all the tests by default (as long as the test is hooked up in the main tests[] array in checkasm.c). But when running fate, it runs checkasm individually with one module at a time. So if adding a new test module in checkasm, be sure to add it to the test listing in tests/fate/checkasm.mak too. The inverse transforms are tricky to test, because you probably can't feed them any random input data. The existing h264dsp, vp8dsp and vp9dsp inverse transform tests take random pixels and do a naive forward transform of them, so that you only get transform coefficients within the possible range. For deblocking filters, those tests start with random-ish input data, but try to arrange coefficients in a way so that each block contains all possible combinations of data (above or below the threshold values). Or a test can run the functions multiple times, with input data arranged to trigger each special case. For the unescape function, I'm not sure if we have any good examples of existing testcases that work on a similar function though. Try to come up with all cases of interesting input to the function (short buffers, long buffers, mod-4/non-mod-4 length, nothing to unescape, lots of things to unescape close together). >> The other main issue I'd like to request is to indent the assembly >> similarly to the rest of the existing assembly. For the 32 bit assembly, >> your patches do match the surrounding code, but for the 64 bit assembly, >> your patches align the operands column differently than the rest. > > Since I was creating new source files for the 64-bit stuff, I assumed I had a > bit of leeway in indentation style - but I can easily change it. Ok, thanks, that'd be appreciated. Yeah I try to maintain consistency across files here. > For what it's worth, the opcodes in AArch64 are significantly shorter than in > AArch32, since the vector element size qualifiers go on the operands instead > of the opcodes, so there's less need for extra indentation. Yup, that's true. But for functions where instruction-like macros are used, the macro names often are a bit longer than regular instructions, so there the extra space is appreciated. And consistency is still nice when both 32 and 64 bit arm use the same indentation style; in many cases, code is ported between the two by just copying and slightly adjusting/rewriting e.g. register names and tweaking instruction names. >> Finally, the 32 bit assembly fails to build for me both with (recent) clang >> and old binutils, with errors like these: >> >> src/libavcodec/arm/vc1dsp_neon.S: Assembler messages: >> src/libavcodec/arm/vc1dsp_neon.S:1579: Error: bad type for scalar -- `vmov >> r0,d4[1]' > > Thanks - the Armv8-A ARM says (section F6.1.139) that the data type can be > omitted here, and in that case it is equivalent to '32', so that's a bug in > clang. But easy to work around. Ok, good. Yeah, bugs or not, we try to stick with the subset of assembly that builds on all toolchains that regularly are used to building. >> Oh, sidenote - I do see that the last patch in the set uses much more >> inconsistent indentation, with varying indentation between lines. Is >> this intentional to signify some structure in the code, or just >> accidental? > > That was deliberate! The inner loop there is unrolled x2, and then adjacent > iterations are overlapped 180 degrees out of phase. This is because each > iteration starts off busy, with lots of instructions to execute, keeping > pipelines full, and towards the end, it thins out, meaning we can benefit by > using what would otherwise be stalls to speculatively start to process the > next iteration before we've completed the current one. > > Effectively, if you only read a series of instructions with matching > indentation, you get one logical iteration of the loop - for example, in the > AArch32 version, you can follow through the process from loading the source > buffer into q10 (line 1849) until we store from it to the destination buffer, > having determined that it doesn't contain the start of any escape sequences > (line 1890). > > It's a trick I've seen used a few times elsewhere, which is why I didn't > bother explaining it in a comment. I could add one, or if you still don't > like it once you've understood what it means, I'd be happy to take it out. Right, I see. (I didn't try to read the code and follow it yet, I just browsed your patches and testbuilt them.) I think it can be valuable to keep this nonstandard indentation as a readability/maintainability aid then. (But do shift the operand column 8 chars to the right for the 64 bit version.) // Martin