Message ID | 20210819213102.603690-1-mnitenko@gmail.com |
---|---|
State | New |
Headers | show |
Series | [FFmpeg-devel,1/2] lavc/aarch64: move transpose_4x4S and transpose_8x8S to neon.S | expand |
Context | Check | Description |
---|---|---|
andriy/x86_make | success | Make finished |
andriy/x86_make_fate | success | Make fate finished |
andriy/PPC64_make | success | Make finished |
andriy/PPC64_make_fate | success | Make fate finished |
On Fri, 20 Aug 2021, Mikhail Nitenko wrote: > > there is a function that is not covered by tests, but I tested it with > sample videos, not sure what to do with it If there are asm functions that lack checkasm tests, then please do add a test for it while working on that function. Asm functions that aren't tested by checkasm can have various hidden bugs, and not having a checkasm test makes the code much more unmaintainable. // Martin
On Fri, 20 Aug 2021, Mikhail Nitenko wrote: > transpose_4x4S and transpose_8x8S were declared in vp9itxfm_16bpp_neon, however these macros are > not unique to vp9 and could be used elsewhere. > > Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com> > --- > libavcodec/aarch64/neon.S | 49 ++++++++++++++++++++++++ > libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 49 ------------------------ > 2 files changed, 49 insertions(+), 49 deletions(-) > > diff --git a/libavcodec/aarch64/neon.S b/libavcodec/aarch64/neon.S > index 1ad32c359d..4186186185 100644 > --- a/libavcodec/aarch64/neon.S > +++ b/libavcodec/aarch64/neon.S > @@ -160,3 +160,52 @@ > trn2 \r7\().2D, \r9\().2D, \r7\().2D > > .endm > + > +.macro transpose_4x4S r0, r1, r2, r3, r4, r5, r6, r7 > -.macro transpose_4x4s r0, r1, r2, r3, r4, r5, r6, r7 Here, you are changing the case of the macro name. This breaks the existing users of the macro in vp9itxfm when built with clang or gas-preprocessor, which both treat macros case-sensitively. If you want to make the casing consistent with the rest of neon.S here, then please do mention it in the commit message, and update vp9itxfm correspondingly. // Martin
On Fri, 20 Aug 2021, Mikhail Nitenko wrote: > Benchmarks: A53 A72 > h264_idct4_add_10bpp_c: 187.7 115.2 > h264_idct4_add_10bpp_neon: 72.5 45.0 > h264_idct4_add_dc_10bpp_c: 96.0 61.2 > h264_idct4_add_dc_10bpp_neon: 36.0 19.5 > h264_idct8_add4_10bpp_c: 2115.5 1424.2 > h264_idct8_add4_10bpp_neon: 734.0 459.5 > h264_idct8_add_10bpp_c: 1017.5 709.0 > h264_idct8_add_10bpp_neon: 345.5 216.5 > h264_idct8_add_dc_10bpp_c: 316.0 235.5 > h264_idct8_add_dc_10bpp_neon: 69.7 44.0 > h264_idct_add16_10bpp_c: 2540.2 1498.5 > h264_idct_add16_10bpp_neon: 1080.5 616.0 > h264_idct_add16intra_10bpp_c: 784.7 439.5 > h264_idct_add16intra_10bpp_neon: 641.0 462.2 > > Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com> > --- > > there is a function that is not covered by tests, but I tested it with > sample videos, not sure what to do with it It would be really good to add a checkasm test for it, because assembly without checkasm tests can have lots of hidden bugs (although it seems fairly straightforward here) that only get uncovered by later compiler updates. Not saying that it is the case here, but without a checkasm test we don't know. Overall the patch seems fine, the code is fairly 1:1 copy of the existing but with wider SIMD elements, and I presume that the range of values don't allow keeping anything in more narrow form. > +function ff_h264_idct_add8_neon_10, export=1 // NO TESTS but test video > looks fine (did not look fine before the fixes so it is definitely > working somehow) I'm not quite sure what you mean here - did it look wrong before implementing this function - then we'd have a bug in the C code? Or did it look wrong with a broken version of this assembly function, and look right after getting it right? // Martin
diff --git a/libavcodec/aarch64/neon.S b/libavcodec/aarch64/neon.S index 1ad32c359d..4186186185 100644 --- a/libavcodec/aarch64/neon.S +++ b/libavcodec/aarch64/neon.S @@ -160,3 +160,52 @@ trn2 \r7\().2D, \r9\().2D, \r7\().2D .endm + +.macro transpose_4x4S r0, r1, r2, r3, r4, r5, r6, r7 + trn1 \r4\().4s, \r0\().4s, \r1\().4s + trn2 \r5\().4s, \r0\().4s, \r1\().4s + trn1 \r6\().4s, \r2\().4s, \r3\().4s + trn2 \r7\().4s, \r2\().4s, \r3\().4s + trn1 \r0\().2d, \r4\().2d, \r6\().2d + trn2 \r2\().2d, \r4\().2d, \r6\().2d + trn1 \r1\().2d, \r5\().2d, \r7\().2d + trn2 \r3\().2d, \r5\().2d, \r7\().2d +.endm + +// Transpose a 8x8 matrix of 32 bit elements, where each row is spread out +// over two registers. +.macro transpose_8x8S r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12, r13, r14, r15, t0, t1, t2, t3 + transpose_4x4S \r0, \r2, \r4, \r6, \t0, \t1, \t2, \t3 + transpose_4x4S \r9, \r11, \r13, \r15, \t0, \t1, \t2, \t3 + + // Do 4x4 transposes of r1,r3,r5,r7 and r8,r10,r12,r14 + // while swapping the two 4x4 matrices between each other + + // First step of the 4x4 transpose of r1-r7, into t0-t3 + trn1 \t0\().4s, \r1\().4s, \r3\().4s + trn2 \t1\().4s, \r1\().4s, \r3\().4s + trn1 \t2\().4s, \r5\().4s, \r7\().4s + trn2 \t3\().4s, \r5\().4s, \r7\().4s + + // First step of the 4x4 transpose of r8-r12, into r1-r7 + trn1 \r1\().4s, \r8\().4s, \r10\().4s + trn2 \r3\().4s, \r8\().4s, \r10\().4s + trn1 \r5\().4s, \r12\().4s, \r14\().4s + trn2 \r7\().4s, \r12\().4s, \r14\().4s + + // Second step of the 4x4 transpose of r1-r7 (now in t0-r3), into r8-r12 + trn1 \r8\().2d, \t0\().2d, \t2\().2d + trn2 \r12\().2d, \t0\().2d, \t2\().2d + trn1 \r10\().2d, \t1\().2d, \t3\().2d + trn2 \r14\().2d, \t1\().2d, \t3\().2d + + // Second step of the 4x4 transpose of r8-r12 (now in r1-r7), in place as far as possible + trn1 \t0\().2d, \r1\().2d, \r5\().2d + trn2 \r5\().2d, \r1\().2d, \r5\().2d + trn1 \t1\().2d, \r3\().2d, \r7\().2d + trn2 \r7\().2d, \r3\().2d, \r7\().2d + + // Move the outputs of trn1 back in place + mov \r1\().16b, \t0\().16b + mov \r3\().16b, \t1\().16b +.endm \ No newline at end of file diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S index 68296d9c40..a165ab3271 100644 --- a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S +++ b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S @@ -41,55 +41,6 @@ const iadst16_coeffs, align=4 .short 14811, 7005, 13160, 9760, 5520, 15426, 2404, 16207 endconst -.macro transpose_4x4s r0, r1, r2, r3, r4, r5, r6, r7 - trn1 \r4\().4s, \r0\().4s, \r1\().4s - trn2 \r5\().4s, \r0\().4s, \r1\().4s - trn1 \r6\().4s, \r2\().4s, \r3\().4s - trn2 \r7\().4s, \r2\().4s, \r3\().4s - trn1 \r0\().2d, \r4\().2d, \r6\().2d - trn2 \r2\().2d, \r4\().2d, \r6\().2d - trn1 \r1\().2d, \r5\().2d, \r7\().2d - trn2 \r3\().2d, \r5\().2d, \r7\().2d -.endm - -// Transpose a 8x8 matrix of 32 bit elements, where each row is spread out -// over two registers. -.macro transpose_8x8s r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12, r13, r14, r15, t0, t1, t2, t3 - transpose_4x4s \r0, \r2, \r4, \r6, \t0, \t1, \t2, \t3 - transpose_4x4s \r9, \r11, \r13, \r15, \t0, \t1, \t2, \t3 - - // Do 4x4 transposes of r1,r3,r5,r7 and r8,r10,r12,r14 - // while swapping the two 4x4 matrices between each other - - // First step of the 4x4 transpose of r1-r7, into t0-t3 - trn1 \t0\().4s, \r1\().4s, \r3\().4s - trn2 \t1\().4s, \r1\().4s, \r3\().4s - trn1 \t2\().4s, \r5\().4s, \r7\().4s - trn2 \t3\().4s, \r5\().4s, \r7\().4s - - // First step of the 4x4 transpose of r8-r12, into r1-r7 - trn1 \r1\().4s, \r8\().4s, \r10\().4s - trn2 \r3\().4s, \r8\().4s, \r10\().4s - trn1 \r5\().4s, \r12\().4s, \r14\().4s - trn2 \r7\().4s, \r12\().4s, \r14\().4s - - // Second step of the 4x4 transpose of r1-r7 (now in t0-r3), into r8-r12 - trn1 \r8\().2d, \t0\().2d, \t2\().2d - trn2 \r12\().2d, \t0\().2d, \t2\().2d - trn1 \r10\().2d, \t1\().2d, \t3\().2d - trn2 \r14\().2d, \t1\().2d, \t3\().2d - - // Second step of the 4x4 transpose of r8-r12 (now in r1-r7), in place as far as possible - trn1 \t0\().2d, \r1\().2d, \r5\().2d - trn2 \r5\().2d, \r1\().2d, \r5\().2d - trn1 \t1\().2d, \r3\().2d, \r7\().2d - trn2 \r7\().2d, \r3\().2d, \r7\().2d - - // Move the outputs of trn1 back in place - mov \r1\().16b, \t0\().16b - mov \r3\().16b, \t1\().16b -.endm - // out1 = ((in1 + in2) * d0[0] + (1 << 13)) >> 14 // out2 = ((in1 - in2) * d0[0] + (1 << 13)) >> 14 // in/out are .4s registers; this can do with 4 temp registers, but is
transpose_4x4S and transpose_8x8S were declared in vp9itxfm_16bpp_neon, however these macros are not unique to vp9 and could be used elsewhere. Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com> --- libavcodec/aarch64/neon.S | 49 ++++++++++++++++++++++++ libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 49 ------------------------ 2 files changed, 49 insertions(+), 49 deletions(-)