Message ID | 680e2122-47b1-008e-6ae2-cab6e3043bd4@mail.de |
---|---|
State | New |
Headers | show |
Series | [FFmpeg-devel] lavc/alsdec: Add NEON optimizations | expand |
Context | Check | Description |
---|---|---|
andriy/x86_make | success | Make finished |
andriy/x86_make_fate | success | Make fate finished |
andriy/PPC64_make | success | Make finished |
andriy/PPC64_make_fate | success | Make fate finished |
Hi Thilo, On Sun, 28 Feb 2021, Thilo Borgmann wrote: > it's my first attempt to do some assembly, it might still includes some dont's of the asm world... > Tested with gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 > > Speed-wise, it sees a drop for small prediction orders until around 10 or 11. > Well, the maximum prediction order is 1023. > I therefore checked with the "real-world" samples from the fate-suite, which suggests low prediction orders are non-dominant: > > pred_order = {7..17}, gain: 23% > > als_reconstruct_all_c: 26645.2 > als_reconstruct_all_neon: 20635.2 This is the combination that the patch actually tests by default, if I read the code correctly - right? You didn't write what CPU you tested this on - do note that the actual peformance of the assembly is pretty heavily dependent on the CPU. I get roughly similar numbers if I build with GCC: Cortex A53 A72 A73 als_reconstruct_all_c: 107708.2 44044.5 57427.7 als_reconstruct_all_neon: 78895.7 38464.7 34065.5 However - if I build with Clang, where vectorization isn't disabled by configure, the C code beats the handwritten assembly: Cortex A53 als_reconstruct_all_c: 69145.7 als_reconstruct_all_neon: 78895.7 Even if I only test order 17, the C code still is faster. So clearly we can do better - if nothing else, we could copy the assembly code that Clang outputs :-) First a couple technical details about the patch... > new file mode 100644 > index 0000000000..130b1a615e > --- /dev/null > +++ b/libavcodec/aarch64/alsdsp_init_aarch64.c > @@ -0,0 +1,35 @@ > + > +#include "config.h" > + > +#include "libavutil/aarch64/cpu.h" > +#include "libavcodec/alsdsp.h" > + > +void ff_alsdsp_reconstruct_all_neon(int32_t *samples, int32_t *samples_end, int32_t *coeffs, unsigned int opt_order); Nit: Long line? > diff --git a/libavcodec/aarch64/alsdsp_neon.S b/libavcodec/aarch64/alsdsp_neon.S > new file mode 100644 > index 0000000000..fe95eaf843 > --- /dev/null > +++ b/libavcodec/aarch64/alsdsp_neon.S > @@ -0,0 +1,155 @@ > + > +#include "libavutil/aarch64/asm.S" > +#include "neon.S" > + > +//void ff_alsdsp_reconstruct_all_neon(int32_t *samples, int32_t *samples_end, int32_t *coeffs, unsigned int opt_order); > +// x0: int32_t *samples > +// x1: int32_t *samples_end > +// x2: int32_t *coeffs > +// w3: unsigned int opt_order > +function ff_alsdsp_reconstruct_all_neon, export = 1 Write the named macro argument without extra spaces, i.e. "export=1". Otherwise this breaks building with gas-preprocessor > + sub sp, sp, #128 Please align instructions and operands in the same way as in other sources. Also for the operands, I'd recommend aligning the columns according to the max width for each operand. E.g. like this: lsl x3, x3, #32 neg x3, x3, lsr #32 lsl x10, x3, #2 That way the columns line up nicely regardless of which registers are used. And for vector register operands, align them so that the max sized register (v16.16b) would fit. For things that deviate from the regular form (e.g. loads/stores etc) just align things so it looks pretty. > + st1 {v8.4s - v11.4s}, [sp], #64 > + st1 {v12.4s - v15.4s}, [sp], #64 You aren't using registers v16-v31 at all. You could use those and avoid using v8-v15, to avoid needing to back up and restore these registers. > +// avoid 32-bit clubber from register Nit: The common spelling is "clobber" > + lsl x3, x3, #32 > + neg x3, x3, lsr #32 There's normally no need to do such manual cleanup of the argument that might have junk in the upper half. If you really need to, you can fix it by doing "uxtl x3, w3" or "sxtl x3, w3", but in most cases you can avoid it by just making sure to refer to the register as w3 instead of x3 the first time you use it. If you do a write to a register in the form wN instead of xN, it will implicitly clear the upper half, so you could use the xN form after that. That doesn't work quite as easily here when you want to have it fully negated though. But e.g. with something like this, it works just fine: neg w3, w3 lsl w10, w3, #2 // Sign extension when used with a 64 bit register add x4, x0, w10, sxtw add x5, x2, w10, sxtw mov w6, w3 // All other uses use w6 instead of x6 etc. > +// x10 counts the bytes left to read, set to 4 * -opt_order > + lsl x10, x3, #2 > + > +// loop x0 .. x1 > +1: cmp x0, x1 > + b.eq 4f > + > +// samples - opt_order, coeffs - opt_order > + add x4, x0, x10 > + add x5, x2, x10 > +// reset local counter: count -opt_order .. 0 > + mov x6, x3 > + > +// reset local acc > + movi v8.2d, #0 > + movi v9.2d, #0 > + movi v10.2d, #0 > + movi v11.2d, #0 > + movi v12.2d, #0 > + movi v13.2d, #0 > + movi v14.2d, #0 > + movi v15.2d, #0 > + > +// loop over 16 samples while >= 16 more to read > + adds x6, x6, #16 > + b.gt 3f > + > +2: ld1 {v0.4s - v3.4s}, [x4], #64 > + ld1 {v4.4s - v7.4s}, [x5], #64 > + > + smlal v8.2d, v0.2s, v4.2s > + smlal2 v9.2d, v0.4s, v4.4s > + smlal v10.2d, v1.2s, v5.2s > + smlal2 v11.2d, v1.4s, v5.4s > + smlal v12.2d, v2.2s, v6.2s > + smlal2 v13.2d, v2.4s, v6.4s > + smlal v14.2d, v3.2s, v7.2s > + smlal2 v15.2d, v3.4s, v7.4s > + > + adds x6, x6, #16 > + b.le 2b > + > +// reduce to four NEON registers > +// acc values into register > +3: subs x6, x6, #16 > + > + add v4.2d, v8.2d, v9.2d > + add v5.2d, v10.2d, v11.2d > + add v6.2d, v12.2d, v13.2d > + add v7.2d, v14.2d, v15.2d > + > +// next 8 samples > + cmn x6, #8 > + b.gt 3f > + > + ld1 {v0.4s - v1.4s}, [x4], #32 > + ld1 {v2.4s - v3.4s}, [x5], #32 > + > + smlal v4.2d, v0.2s, v2.2s > + smlal2 v5.2d, v0.4s, v2.4s > + smlal v6.2d, v1.2s, v3.2s > + smlal2 v7.2d, v1.4s, v3.4s > + > + adds x6, x6, #8 > + > +// reduce to two NEON registers > +// acc values into register > +3: add v2.2d, v4.2d, v5.2d > + add v3.2d, v6.2d, v7.2d > + > +// next 4 samples > + cmn x6, #4 > + b.gt 3f > + > + ld1 {v0.4s}, [x4], #16 > + ld1 {v1.4s}, [x5], #16 > + > + smlal v2.2d, v0.2s, v1.2s > + smlal2 v3.2d, v0.4s, v1.4s > + > + adds x6, x6, #4 > + > +// reduce to A64 registers > +// acc values into register > +3: add v2.2d, v2.2d, v3.2d > + mov x7, v2.2d[0] > + mov x8, v2.2d[1] This breaks building both with Clang and with MS armasm64.exe (via gas-preprocessor); binutils accepts the syntax "v2.2d[0]" here but the correct form is "v2.d[0]" (as you're only accessing one lane at a time, it doesn't matter if you see the register as full or half). > + add x7, x7, x8 > + > + cmn x6, #0 > + b.eq 3f > + > +// loop over the remaining < 4 samples to read > +2: ldrsw x8, [x4], #4 > + ldrsw x9, [x5], #4 > + > + madd x7, x8, x9, x7 > + adds x6, x6, #1 > + b.lt 2b > + > +// add 1<<19 and store s-=X>>20 > +3: mov x9, #1 > + lsl x9, x9, #19 > + add x7, x7, x9 > + neg x7, x7, asr #20 > + > + ldrsw x9, [x4] > + add x9, x9, x7 > + str w9, [x4] > + > +// increment samples and loop > + add x0, x0, #4 > + b 1b > + > +4: sub sp, sp, #128 > + ld1 {v8.4s - v11.4s}, [sp], #64 > + ld1 {v12.4s - v15.4s}, [sp], #64 > + > + ret > +endfunc > diff --git a/libavcodec/alsdsp.c b/libavcodec/alsdsp.c > new file mode 100644 > index 0000000000..00270bb5e6 > --- /dev/null > +++ b/libavcodec/alsdsp.c > +#include "libavutil/attributes.h" > +#include "libavutil/samplefmt.h" > +#include "mathops.h" > +#include "alsdsp.h" > +#include "config.h" > + > +static void als_reconstruct_all_c(int32_t *raw_samples, int32_t *raw_samples_end, int32_t *lpc_cof, unsigned int opt_order) > +{ > + int64_t y; > + int sb; > + > + for (; raw_samples < raw_samples_end; raw_samples++) { > + y = 1 << 19; > + > + for (sb = -opt_order; sb < 0; sb++) > + y += (uint64_t)MUL64(lpc_cof[sb], raw_samples[sb]); > + > + *raw_samples -= y >> 20; > + } This new file uses incorrect indentation and even uses tabs. > + > + > +av_cold void ff_alsdsp_init(ALSDSPContext *ctx) > +{ > + ctx->reconstruct_all = als_reconstruct_all_c; > + > + if (ARCH_AARCH64) > + ff_alsdsp_init_neon(ctx); I think the norm here would be to have this function be called *_aarch64, as it's behind an ARCH_AARCH64 check. (In the future we could have other SIMD instruction sets on aarch64, that all would go through the same init function, just like on x86 where there's SSE* and AVX*.) > diff --git a/tests/checkasm/alsdsp.c b/tests/checkasm/alsdsp.c > new file mode 100644 > index 0000000000..f35c7d49be > --- /dev/null > +++ b/tests/checkasm/alsdsp.c > @@ -0,0 +1,81 @@ > + > +void checkasm_check_alsdsp(void) > +{ > + LOCAL_ALIGNED_16(uint32_t, ref_samples, [1024]); > + LOCAL_ALIGNED_16(uint32_t, ref_coeffs, [1024]); > + LOCAL_ALIGNED_16(uint32_t, new_samples, [1024]); > + LOCAL_ALIGNED_16(uint32_t, new_coeffs, [1024]); > + > + ALSDSPContext dsp; > + ff_alsdsp_init(&dsp); > + > + if (check_func(dsp.reconstruct_all, "als_reconstruct_all")) { > + declare_func(void, int32_t *samples, int32_t *samples_end, int32_t *coeffs, unsigned int opt_order); > + int32_t *s, *c, *e; > + unsigned int o; > + unsigned int O[] = {7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17}; > + for (int k = 0; k <11; k++) { Would it be good to use FF_ARRAY_ELEMS() here instead of hardcoding 11? That would simplify testing in various configurations. Alternative you could have "for (o = 7; o <= 17; o++)", but the array form is nice for testing specific values as long as it's not a very long range. > + o = O[k]; > + > + randomize_buffers(); > + > + s = (int32_t*)(ref_samples + o); Can't you just declare the arrays as int32_t to avoid these casts? Now for the actual algorithm - while you do use SIMD instructions for doing the multiplication (which is the biggest part when you have long filters), your algorithm is serial (you produce one single output sample) in the end, and for small filter sizes, this is a significant portion of the runtime. For an even more SIMDy algorithm you would filter and produce e.g. 4 output samples at a time. Is it ok to write up to 3 samples past samples_end? If not, can the calling code be rewritten that way, to avoid the need for extra edge conditions. If not, can the calling code be rearranged to allow that? Is lpc_coef[] before -opt_order undefined, or can it be arranged so that lpc_coef[] is padded with a couple zeros before the first coefficient we need to care about? That would allow us to ignore even more boundary conditions and e.g. round opt_order up to the nearest multiple of 4 which would simplify things even more. Anyway, the way of filtering multiple samples at a time, while avoiding any serial processing, looks like this: opt_order = FFALIGN(opt_order, 4); while (samples_left > 0) { // Make this >= 4 with serial processing at the end if we aren't // allowed to go past the end of the buffer ptr = raw_samples - 4*opt_order; lpc_coef -= 4*opt_order; ld1 {v1.4s}, [ptr], #16 // load 4 samples n = opt_order; mov v4.2d, #0 // accumulator mov v5.2d, #0 do { ld1 {v2.4s}, [ptr], #16 // load 4 more samples ld1 {v0.4s}, [lpc_coef], #16 // load 4 coefficients ext v16.16b, v1.16b, v2.16b, #4 // Samples v1-v2 offset by 1 ext v17.16b, v1.16b, v2.16b, #8 // offset by 2 ext v18.16b, v1.16b, v2.16b, #12 // offset by 3 smlal v4.2d, v1.2s, v0.s[0] smlal2 v5.2d, v1.2s, v0.s[0] smlal v4.2d, v16.2s, v0.s[1] smlal2 v5.2d, v16.2s, v0.s[1] smlal v4.2d, v17.2s, v0.s[2] smlal2 v5.2d, v17.2s, v0.s[2] smlal v4.2d, v18.2s, v0.s[3] smlal2 v5.2d, v18.2s, v0.s[3] // For in-order cores like A53 and A55, this can also be more // efficient if reordered to do all 4 smlal to v4.2d first, // followed by 4 smlal2 to v5.2d. mov v1, v2 // shift input samples n -= 4; } while (n > 0); // v4-v5 now contains the final sum for 4 output samples, done in parallel // Use the rounding shift function for doing the (sum + (1<<19)) >> 20 rshrn v4.2s, v4.2d, #20 rshrn2 v4.4s, v5.2d, #20 // v2 should be the corresponding output samples at this point // Rewind ptr to point at where v2 was loaded from. That way we don't // need a separate pointer register for this, we should be able to // do with just one register for all input/output to raw_samples. sub ptr, ptr, #16 sub v2.4s, v2.4s, v4.4s st1 {v2.4s}, [ptr], #16 samples_left -= 4; } // Martin
Hi Martin, >> it's my first attempt to do some assembly, it might still includes some dont's of the asm world... >> Tested with gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 >> >> Speed-wise, it sees a drop for small prediction orders until around 10 or 11. >> Well, the maximum prediction order is 1023. >> I therefore checked with the "real-world" samples from the fate-suite, which suggests low prediction orders are non-dominant: >> >> pred_order = {7..17}, gain: 23% >> >> als_reconstruct_all_c: 26645.2 >> als_reconstruct_all_neon: 20635.2 > > This is the combination that the patch actually tests by default, if I read the code correctly - right? exactly. > You didn't write what CPU you tested this on - do note that the actual peformance of the assembly is pretty heavily dependent on the CPU. > > I get roughly similar numbers if I build with GCC: > > Cortex A53 A72 A73 > als_reconstruct_all_c: 107708.2 44044.5 57427.7 > als_reconstruct_all_neon: 78895.7 38464.7 34065.5 Was a remote one, don't know exactly, yet. Will find out for v2. > However - if I build with Clang, where vectorization isn't disabled by configure, the C code beats the handwritten assembly: > > Cortex A53 > als_reconstruct_all_c: 69145.7 > als_reconstruct_all_neon: 78895.7 > > Even if I only test order 17, the C code still is faster. So clearly we can do better - if nothing else, we could copy the assembly code that Clang outputs :-) Narf. Well maybe thoughts about the code itself will get more speed manually... > First a couple technical details about the patch... > [...] I very much appreciate your excessive feedback, I will need quite some time to work through it! :) Thanks! -Thilo
diff --git a/configure b/configure index 900505756b..30875f87f2 100755 --- a/configure +++ b/configure @@ -2345,6 +2345,7 @@ CONFIG_EXTRA=" aandcttables ac3dsp adts_header + alsdsp atsc_a53 audio_frame_queue audiodsp @@ -2664,7 +2665,7 @@ adpcm_g722_decoder_select="g722dsp" adpcm_g722_encoder_select="g722dsp" aic_decoder_select="golomb idctdsp" alac_encoder_select="lpc" -als_decoder_select="bswapdsp" +als_decoder_select="bswapdsp alsdsp" amrnb_decoder_select="lsp" amrwb_decoder_select="lsp" amv_decoder_select="sp5x_decoder exif" diff --git a/libavcodec/Makefile b/libavcodec/Makefile index 35318f4f4d..8a23ab8ea0 100644 --- a/libavcodec/Makefile +++ b/libavcodec/Makefile @@ -62,6 +62,7 @@ OBJS = ac3_parser.o \ OBJS-$(CONFIG_AANDCTTABLES) += aandcttab.o OBJS-$(CONFIG_AC3DSP) += ac3dsp.o ac3.o ac3tab.o OBJS-$(CONFIG_ADTS_HEADER) += adts_header.o mpeg4audio.o +OBJS-$(CONFIG_ALSDSP) += alsdsp.o OBJS-$(CONFIG_AMF) += amfenc.o OBJS-$(CONFIG_AUDIO_FRAME_QUEUE) += audio_frame_queue.o OBJS-$(CONFIG_ATSC_A53) += atsc_a53.o diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile index f6434e40da..a7493c7c2b 100644 --- a/libavcodec/aarch64/Makefile +++ b/libavcodec/aarch64/Makefile @@ -1,4 +1,5 @@ # subsystems +OBJS-$(CONFIG_ALSDSP) += aarch64/alsdsp_init_aarch64.o OBJS-$(CONFIG_FFT) += aarch64/fft_init_aarch64.o OBJS-$(CONFIG_FMTCONVERT) += aarch64/fmtconvert_init.o OBJS-$(CONFIG_H264CHROMA) += aarch64/h264chroma_init_aarch64.o @@ -52,6 +53,7 @@ NEON-OBJS-$(CONFIG_VP8DSP) += aarch64/vp8dsp_neon.o # decoders/encoders NEON-OBJS-$(CONFIG_AAC_DECODER) += aarch64/aacpsdsp_neon.o +NEON-OBJS-$(CONFIG_ALS_DECODER) += aarch64/alsdsp_neon.o NEON-OBJS-$(CONFIG_DCA_DECODER) += aarch64/synth_filter_neon.o NEON-OBJS-$(CONFIG_OPUS_DECODER) += aarch64/opusdsp_neon.o NEON-OBJS-$(CONFIG_VORBIS_DECODER) += aarch64/vorbisdsp_neon.o diff --git a/libavcodec/aarch64/alsdsp_init_aarch64.c b/libavcodec/aarch64/alsdsp_init_aarch64.c new file mode 100644 index 0000000000..130b1a615e --- /dev/null +++ b/libavcodec/aarch64/alsdsp_init_aarch64.c @@ -0,0 +1,35 @@ +/* + * Copyright (c) 2021 Thilo Borgmann <thilo.borgmann _at_ mail.de> + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "config.h" + +#include "libavutil/aarch64/cpu.h" +#include "libavcodec/alsdsp.h" + +void ff_alsdsp_reconstruct_all_neon(int32_t *samples, int32_t *samples_end, int32_t *coeffs, unsigned int opt_order); + +av_cold void ff_alsdsp_init_neon(ALSDSPContext *s) +{ + int cpu_flags = av_get_cpu_flags(); + + if (have_neon(cpu_flags)) { + s->reconstruct_all = ff_alsdsp_reconstruct_all_neon; + } +} diff --git a/libavcodec/aarch64/alsdsp_neon.S b/libavcodec/aarch64/alsdsp_neon.S new file mode 100644 index 0000000000..fe95eaf843 --- /dev/null +++ b/libavcodec/aarch64/alsdsp_neon.S @@ -0,0 +1,155 @@ +/* + * Copyright (c) 2021 Thilo Borgmann <thilo.borgmann _at_ mail.de> + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/aarch64/asm.S" +#include "neon.S" + +//void ff_alsdsp_reconstruct_all_neon(int32_t *samples, int32_t *samples_end, int32_t *coeffs, unsigned int opt_order); +// x0: int32_t *samples +// x1: int32_t *samples_end +// x2: int32_t *coeffs +// w3: unsigned int opt_order +function ff_alsdsp_reconstruct_all_neon, export = 1 + sub sp, sp, #128 + st1 {v8.4s - v11.4s}, [sp], #64 + st1 {v12.4s - v15.4s}, [sp], #64 +// avoid 32-bit clubber from register + lsl x3, x3, #32 + neg x3, x3, lsr #32 +// x10 counts the bytes left to read, set to 4 * -opt_order + lsl x10, x3, #2 + +// loop x0 .. x1 +1: cmp x0, x1 + b.eq 4f + +// samples - opt_order, coeffs - opt_order + add x4, x0, x10 + add x5, x2, x10 +// reset local counter: count -opt_order .. 0 + mov x6, x3 + +// reset local acc + movi v8.2d, #0 + movi v9.2d, #0 + movi v10.2d, #0 + movi v11.2d, #0 + movi v12.2d, #0 + movi v13.2d, #0 + movi v14.2d, #0 + movi v15.2d, #0 + +// loop over 16 samples while >= 16 more to read + adds x6, x6, #16 + b.gt 3f + +2: ld1 {v0.4s - v3.4s}, [x4], #64 + ld1 {v4.4s - v7.4s}, [x5], #64 + + smlal v8.2d, v0.2s, v4.2s + smlal2 v9.2d, v0.4s, v4.4s + smlal v10.2d, v1.2s, v5.2s + smlal2 v11.2d, v1.4s, v5.4s + smlal v12.2d, v2.2s, v6.2s + smlal2 v13.2d, v2.4s, v6.4s + smlal v14.2d, v3.2s, v7.2s + smlal2 v15.2d, v3.4s, v7.4s + + adds x6, x6, #16 + b.le 2b + +// reduce to four NEON registers +// acc values into register +3: subs x6, x6, #16 + + add v4.2d, v8.2d, v9.2d + add v5.2d, v10.2d, v11.2d + add v6.2d, v12.2d, v13.2d + add v7.2d, v14.2d, v15.2d + +// next 8 samples + cmn x6, #8 + b.gt 3f + + ld1 {v0.4s - v1.4s}, [x4], #32 + ld1 {v2.4s - v3.4s}, [x5], #32 + + smlal v4.2d, v0.2s, v2.2s + smlal2 v5.2d, v0.4s, v2.4s + smlal v6.2d, v1.2s, v3.2s + smlal2 v7.2d, v1.4s, v3.4s + + adds x6, x6, #8 + +// reduce to two NEON registers +// acc values into register +3: add v2.2d, v4.2d, v5.2d + add v3.2d, v6.2d, v7.2d + +// next 4 samples + cmn x6, #4 + b.gt 3f + + ld1 {v0.4s}, [x4], #16 + ld1 {v1.4s}, [x5], #16 + + smlal v2.2d, v0.2s, v1.2s + smlal2 v3.2d, v0.4s, v1.4s + + adds x6, x6, #4 + +// reduce to A64 registers +// acc values into register +3: add v2.2d, v2.2d, v3.2d + mov x7, v2.2d[0] + mov x8, v2.2d[1] + add x7, x7, x8 + + cmn x6, #0 + b.eq 3f + +// loop over the remaining < 4 samples to read +2: ldrsw x8, [x4], #4 + ldrsw x9, [x5], #4 + + madd x7, x8, x9, x7 + adds x6, x6, #1 + b.lt 2b + +// add 1<<19 and store s-=X>>20 +3: mov x9, #1 + lsl x9, x9, #19 + add x7, x7, x9 + neg x7, x7, asr #20 + + ldrsw x9, [x4] + add x9, x9, x7 + str w9, [x4] + +// increment samples and loop + add x0, x0, #4 + b 1b + +4: sub sp, sp, #128 + ld1 {v8.4s - v11.4s}, [sp], #64 + ld1 {v12.4s - v15.4s}, [sp], #64 + + ret +endfunc diff --git a/libavcodec/alsdec.c b/libavcodec/alsdec.c index b3c444c54f..044e372b87 100644 --- a/libavcodec/alsdec.c +++ b/libavcodec/alsdec.c @@ -32,6 +32,7 @@ #include "unary.h" #include "mpeg4audio.h" #include "bgmc.h" +#include "alsdsp.h" #include "bswapdsp.h" #include "internal.h" #include "mlz.h" @@ -195,6 +196,7 @@ typedef struct ALSDecContext { AVCodecContext *avctx; ALSSpecificConfig sconf; GetBitContext gb; + ALSDSPContext dsp; BswapDSPContext bdsp; const AVCRC *crc_table; uint32_t crc_org; ///< CRC value of the original input data @@ -903,6 +905,7 @@ static int read_var_block_data(ALSDecContext *ctx, ALSBlockData *bd) static int decode_var_block_data(ALSDecContext *ctx, ALSBlockData *bd) { ALSSpecificConfig *sconf = &ctx->sconf; + ALSDSPContext *dsp = &ctx->dsp; unsigned int block_length = bd->block_length; unsigned int smp = 0; unsigned int k; @@ -987,14 +990,7 @@ static int decode_var_block_data(ALSDecContext *ctx, ALSBlockData *bd) raw_samples = bd->raw_samples + smp; lpc_cof = lpc_cof_reversed + opt_order; - for (; raw_samples < raw_samples_end; raw_samples++) { - y = 1 << 19; - - for (sb = -opt_order; sb < 0; sb++) - y += (uint64_t)MUL64(lpc_cof[sb], raw_samples[sb]); - - *raw_samples -= y >> 20; - } + dsp->reconstruct_all(raw_samples, raw_samples_end, lpc_cof, opt_order); raw_samples = bd->raw_samples; @@ -2150,6 +2146,7 @@ static av_cold int decode_init(AVCodecContext *avctx) } } + ff_alsdsp_init(&ctx->dsp); ff_bswapdsp_init(&ctx->bdsp); return 0; diff --git a/libavcodec/alsdsp.c b/libavcodec/alsdsp.c new file mode 100644 index 0000000000..00270bb5e6 --- /dev/null +++ b/libavcodec/alsdsp.c @@ -0,0 +1,49 @@ +/* + * Copyright (c) 2021 Thilo Borgmann <thilo.borgmann _at_ mail.de> + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/attributes.h" +#include "libavutil/samplefmt.h" +#include "mathops.h" +#include "alsdsp.h" +#include "config.h" + +static void als_reconstruct_all_c(int32_t *raw_samples, int32_t *raw_samples_end, int32_t *lpc_cof, unsigned int opt_order) +{ + int64_t y; + int sb; + + for (; raw_samples < raw_samples_end; raw_samples++) { + y = 1 << 19; + + for (sb = -opt_order; sb < 0; sb++) + y += (uint64_t)MUL64(lpc_cof[sb], raw_samples[sb]); + + *raw_samples -= y >> 20; + } +} + + +av_cold void ff_alsdsp_init(ALSDSPContext *ctx) +{ + ctx->reconstruct_all = als_reconstruct_all_c; + + if (ARCH_AARCH64) + ff_alsdsp_init_neon(ctx); +} diff --git a/libavcodec/alsdsp.h b/libavcodec/alsdsp.h new file mode 100644 index 0000000000..b285edbe6e --- /dev/null +++ b/libavcodec/alsdsp.h @@ -0,0 +1,35 @@ +/* + * Copyright (c) 2021 Thilo Borgmann <thilo.borgmann _at_ mail.de> + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef AVCODEC_ALSDSP_H +#define AVCODEC_ALSDSP_H + +#include <stdint.h> +#include "libavutil/internal.h" +#include "libavutil/samplefmt.h" + +typedef struct ALSDSPContext { + void (*reconstruct_all)(int32_t *raw_samples, int32_t *raw_samples_end, int32_t *lpc_cof, unsigned int opt_order); +} ALSDSPContext; + +void ff_alsdsp_init(ALSDSPContext *c); +void ff_alsdsp_init_neon(ALSDSPContext *c); + +#endif /* AVCODEC_ALSDSP_H */ diff --git a/tests/checkasm/Makefile b/tests/checkasm/Makefile index 9e9569777b..2f1c03d78c 100644 --- a/tests/checkasm/Makefile +++ b/tests/checkasm/Makefile @@ -1,6 +1,7 @@ # libavcodec tests # subsystems AVCODECOBJS-$(CONFIG_AUDIODSP) += audiodsp.o +AVCODECOBJS-$(CONFIG_ALSDSP) += alsdsp.o AVCODECOBJS-$(CONFIG_BLOCKDSP) += blockdsp.o AVCODECOBJS-$(CONFIG_BSWAPDSP) += bswapdsp.o AVCODECOBJS-$(CONFIG_FLACDSP) += flacdsp.o diff --git a/tests/checkasm/alsdsp.c b/tests/checkasm/alsdsp.c new file mode 100644 index 0000000000..f35c7d49be --- /dev/null +++ b/tests/checkasm/alsdsp.c @@ -0,0 +1,81 @@ +/* + * Copyright (c) 2021 Thilo Borgmann <thilo.borgmann _at_ mail.de> + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License along + * with FFmpeg; if not, write to the Free Software Foundation, Inc., + * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. + */ + +#include <string.h> +#include "checkasm.h" +#include "libavcodec/alsdsp.h" +#include "libavutil/common.h" +#include "libavutil/internal.h" +#include "libavutil/intreadwrite.h" +#include "libavutil/mem_internal.h" + +#define NUM 1024 + +#define randomize_buffers() \ + do { \ + int i; \ + for (i = 0; i < NUM; i++) { \ + uint32_t r = rnd(); \ + AV_WN32A(&ref_coeffs[i], r); \ + AV_WN32A(&new_coeffs[i], r); \ + r = rnd(); \ + AV_WN32A(&ref_samples[i], r); \ + AV_WN32A(&new_samples[i], r); \ + } \ + } while (0) + + +void checkasm_check_alsdsp(void) +{ + LOCAL_ALIGNED_16(uint32_t, ref_samples, [1024]); + LOCAL_ALIGNED_16(uint32_t, ref_coeffs, [1024]); + LOCAL_ALIGNED_16(uint32_t, new_samples, [1024]); + LOCAL_ALIGNED_16(uint32_t, new_coeffs, [1024]); + + ALSDSPContext dsp; + ff_alsdsp_init(&dsp); + + if (check_func(dsp.reconstruct_all, "als_reconstruct_all")) { + declare_func(void, int32_t *samples, int32_t *samples_end, int32_t *coeffs, unsigned int opt_order); + int32_t *s, *c, *e; + unsigned int o; + unsigned int O[] = {7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17}; + for (int k = 0; k <11; k++) { + o = O[k]; + + randomize_buffers(); + + s = (int32_t*)(ref_samples + o); + e = (int32_t*)(ref_samples + 1024); + c = (int32_t*)(ref_coeffs + o); + call_ref(s, e, c, o); + + s = (int32_t*)(new_samples + o); + e = (int32_t*)(new_samples + 1024); + c = (int32_t*)(new_coeffs + o); + call_new(s, e, c, o); + + if (memcmp(ref_samples, new_samples, o+1) || memcmp(ref_coeffs, new_coeffs, o+1)) + fail(); + bench_new(new_samples, e, new_coeffs, o); + } + } + report("reconstruct_all"); +} diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c index b3ac76c325..c847ae28f5 100644 --- a/tests/checkasm/checkasm.c +++ b/tests/checkasm/checkasm.c @@ -80,6 +80,9 @@ static const struct { #if CONFIG_ALAC_DECODER { "alacdsp", checkasm_check_alacdsp }, #endif + #if CONFIG_ALSDSP + { "alsdsp", checkasm_check_alsdsp }, + #endif #if CONFIG_AUDIODSP { "audiodsp", checkasm_check_audiodsp }, #endif diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h index 0190bc912c..da9f9c73fe 100644 --- a/tests/checkasm/checkasm.h +++ b/tests/checkasm/checkasm.h @@ -42,6 +42,7 @@ void checkasm_check_aacpsdsp(void); void checkasm_check_afir(void); void checkasm_check_alacdsp(void); +void checkasm_check_alsdsp(void); void checkasm_check_audiodsp(void); void checkasm_check_blend(void); void checkasm_check_blockdsp(void);
Hi, it's my first attempt to do some assembly, it might still includes some dont's of the asm world... Tested with gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Speed-wise, it sees a drop for small prediction orders until around 10 or 11. Well, the maximum prediction order is 1023. I therefore checked with the "real-world" samples from the fate-suite, which suggests low prediction orders are non-dominant: pred_order = 9, gain: -6% als_reconstruct_all_c: 15898.2 als_reconstruct_all_neon: 16460.0 pred_order = 15,gain: 35% als_reconstruct_all_c: 34843.7 als_reconstruct_all_neon: 22840.5 pred_order = {7..17}, gain: 23% als_reconstruct_all_c: 26645.2 als_reconstruct_all_neon: 20635.2 patched: TEST mpeg4-als-conformance-00 TEST mpeg4-als-conformance-01 TEST mpeg4-als-conformance-02 TEST mpeg4-als-conformance-03 TEST mpeg4-als-conformance-04 TEST mpeg4-als-conformance-05 TEST mpeg4-als-conformance-09 real 0m1.006s user 0m0.903s sys 0m0.112s real 0m1.007s user 0m0.889s sys 0m0.127s real 0m1.005s user 0m0.897s sys 0m0.117s unpatched: TEST mpeg4-als-conformance-00 TEST mpeg4-als-conformance-01 TEST mpeg4-als-conformance-02 TEST mpeg4-als-conformance-03 TEST mpeg4-als-conformance-04 TEST mpeg4-als-conformance-05 TEST mpeg4-als-conformance-09 real 0m1.204s user 0m1.122s sys 0m0.091s real 0m1.204s user 0m1.098s sys 0m0.115s real 0m1.205s user 0m1.077s sys 0m0.137s -Thilo From 42a4d5f581570b0d292b63bb193e3e8da9645fcd Mon Sep 17 00:00:00 2001 From: Thilo Borgmann <thilo.borgmann@mail.de> Date: Sun, 28 Feb 2021 14:13:32 +0000 Subject: [PATCH] lavc/alsdec: Add NEON optimizations --- configure | 3 +- libavcodec/Makefile | 1 + libavcodec/aarch64/Makefile | 2 + libavcodec/aarch64/alsdsp_init_aarch64.c | 35 +++++ libavcodec/aarch64/alsdsp_neon.S | 155 +++++++++++++++++++++++ libavcodec/alsdec.c | 13 +- libavcodec/alsdsp.c | 49 +++++++ libavcodec/alsdsp.h | 35 +++++ tests/checkasm/Makefile | 1 + tests/checkasm/alsdsp.c | 81 ++++++++++++ tests/checkasm/checkasm.c | 3 + tests/checkasm/checkasm.h | 1 + 12 files changed, 370 insertions(+), 9 deletions(-) create mode 100644 libavcodec/aarch64/alsdsp_init_aarch64.c create mode 100644 libavcodec/aarch64/alsdsp_neon.S create mode 100644 libavcodec/alsdsp.c create mode 100644 libavcodec/alsdsp.h create mode 100644 tests/checkasm/alsdsp.c