diff mbox series

[FFmpeg-devel,1/2] lavc/aarch64: move transpose_4x4S and transpose_8x8S to neon.S

Message ID 20210819213102.603690-1-mnitenko@gmail.com
State New
Headers show
Series [FFmpeg-devel,1/2] lavc/aarch64: move transpose_4x4S and transpose_8x8S to neon.S | expand

Checks

Context Check Description
andriy/x86_make success Make finished
andriy/x86_make_fate success Make fate finished
andriy/PPC64_make success Make finished
andriy/PPC64_make_fate success Make fate finished

Commit Message

Mikhail Nitenko Aug. 19, 2021, 9:31 p.m. UTC
transpose_4x4S and transpose_8x8S were declared in vp9itxfm_16bpp_neon, however these macros are
not unique to vp9 and could be used elsewhere.

Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com>
---
 libavcodec/aarch64/neon.S                | 49 ++++++++++++++++++++++++
 libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 49 ------------------------
 2 files changed, 49 insertions(+), 49 deletions(-)

Comments

Martin Storsjö Aug. 19, 2021, 9:37 p.m. UTC | #1
On Fri, 20 Aug 2021, Mikhail Nitenko wrote:

>
> there is a function that is not covered by tests, but I tested it with
> sample videos, not sure what to do with it

If there are asm functions that lack checkasm tests, then please do add a 
test for it while working on that function. Asm functions that aren't 
tested by checkasm can have various hidden bugs, and not having a checkasm 
test makes the code much more unmaintainable.

// Martin
Martin Storsjö Sept. 3, 2021, 9 a.m. UTC | #2
On Fri, 20 Aug 2021, Mikhail Nitenko wrote:

> transpose_4x4S and transpose_8x8S were declared in vp9itxfm_16bpp_neon, however these macros are
> not unique to vp9 and could be used elsewhere.
>
> Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com>
> ---
> libavcodec/aarch64/neon.S                | 49 ++++++++++++++++++++++++
> libavcodec/aarch64/vp9itxfm_16bpp_neon.S | 49 ------------------------
> 2 files changed, 49 insertions(+), 49 deletions(-)
>
> diff --git a/libavcodec/aarch64/neon.S b/libavcodec/aarch64/neon.S
> index 1ad32c359d..4186186185 100644
> --- a/libavcodec/aarch64/neon.S
> +++ b/libavcodec/aarch64/neon.S
> @@ -160,3 +160,52 @@
>         trn2            \r7\().2D,  \r9\().2D,  \r7\().2D
>
> .endm
> +
> +.macro transpose_4x4S r0, r1, r2, r3, r4, r5, r6, r7

> -.macro transpose_4x4s r0, r1, r2, r3, r4, r5, r6, r7

Here, you are changing the case of the macro name. This breaks the 
existing users of the macro in vp9itxfm when built with clang or 
gas-preprocessor, which both treat macros case-sensitively.

If you want to make the casing consistent with the rest of neon.S here, 
then please do mention it in the commit message, and update vp9itxfm 
correspondingly.

// Martin
Martin Storsjö Sept. 3, 2021, 11:35 a.m. UTC | #3
On Fri, 20 Aug 2021, Mikhail Nitenko wrote:

> Benchmarks:                                             A53     A72
> h264_idct4_add_10bpp_c:                                187.7   115.2
> h264_idct4_add_10bpp_neon:                              72.5    45.0
> h264_idct4_add_dc_10bpp_c:                              96.0    61.2
> h264_idct4_add_dc_10bpp_neon:                           36.0    19.5
> h264_idct8_add4_10bpp_c:                              2115.5  1424.2
> h264_idct8_add4_10bpp_neon:                            734.0   459.5
> h264_idct8_add_10bpp_c:                               1017.5   709.0
> h264_idct8_add_10bpp_neon:                             345.5   216.5
> h264_idct8_add_dc_10bpp_c:                             316.0   235.5
> h264_idct8_add_dc_10bpp_neon:                           69.7    44.0
> h264_idct_add16_10bpp_c:                              2540.2  1498.5
> h264_idct_add16_10bpp_neon:                           1080.5   616.0
> h264_idct_add16intra_10bpp_c:                          784.7   439.5
> h264_idct_add16intra_10bpp_neon:                       641.0   462.2
>
> Signed-off-by: Mikhail Nitenko <mnitenko@gmail.com>
> ---
>
> there is a function that is not covered by tests, but I tested it with
> sample videos, not sure what to do with it

It would be really good to add a checkasm test for it, because assembly 
without checkasm tests can have lots of hidden bugs (although it seems 
fairly straightforward here) that only get uncovered by later compiler 
updates. Not saying that it is the case here, but without a checkasm test 
we don't know.

Overall the patch seems fine, the code is fairly 1:1 copy of the existing 
but with wider SIMD elements, and I presume that the range of values don't 
allow keeping anything in more narrow form.

> +function ff_h264_idct_add8_neon_10, export=1 // NO TESTS but test video 
> looks fine (did not look fine before the fixes so it is definitely 
> working somehow)

I'm not quite sure what you mean here - did it look wrong before 
implementing this function - then we'd have a bug in the C code? Or did it 
look wrong with a broken version of this assembly function, and look right 
after getting it right?

// Martin
diff mbox series

Patch

diff --git a/libavcodec/aarch64/neon.S b/libavcodec/aarch64/neon.S
index 1ad32c359d..4186186185 100644
--- a/libavcodec/aarch64/neon.S
+++ b/libavcodec/aarch64/neon.S
@@ -160,3 +160,52 @@ 
         trn2            \r7\().2D,  \r9\().2D,  \r7\().2D
 
 .endm
+
+.macro transpose_4x4S r0, r1, r2, r3, r4, r5, r6, r7
+        trn1            \r4\().4s,  \r0\().4s,  \r1\().4s
+        trn2            \r5\().4s,  \r0\().4s,  \r1\().4s
+        trn1            \r6\().4s,  \r2\().4s,  \r3\().4s
+        trn2            \r7\().4s,  \r2\().4s,  \r3\().4s
+        trn1            \r0\().2d,  \r4\().2d,  \r6\().2d
+        trn2            \r2\().2d,  \r4\().2d,  \r6\().2d
+        trn1            \r1\().2d,  \r5\().2d,  \r7\().2d
+        trn2            \r3\().2d,  \r5\().2d,  \r7\().2d
+.endm
+
+// Transpose a 8x8 matrix of 32 bit elements, where each row is spread out
+// over two registers.
+.macro transpose_8x8S r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12, r13, r14, r15, t0, t1, t2, t3
+        transpose_4x4S  \r0,  \r2,  \r4,  \r6,  \t0, \t1, \t2, \t3
+        transpose_4x4S  \r9,  \r11, \r13, \r15, \t0, \t1, \t2, \t3
+
+        // Do 4x4 transposes of r1,r3,r5,r7 and r8,r10,r12,r14
+        // while swapping the two 4x4 matrices between each other
+
+        // First step of the 4x4 transpose of r1-r7, into t0-t3
+        trn1            \t0\().4s,  \r1\().4s,  \r3\().4s
+        trn2            \t1\().4s,  \r1\().4s,  \r3\().4s
+        trn1            \t2\().4s,  \r5\().4s,  \r7\().4s
+        trn2            \t3\().4s,  \r5\().4s,  \r7\().4s
+
+        // First step of the 4x4 transpose of r8-r12, into r1-r7
+        trn1            \r1\().4s,  \r8\().4s,  \r10\().4s
+        trn2            \r3\().4s,  \r8\().4s,  \r10\().4s
+        trn1            \r5\().4s,  \r12\().4s, \r14\().4s
+        trn2            \r7\().4s,  \r12\().4s, \r14\().4s
+
+        // Second step of the 4x4 transpose of r1-r7 (now in t0-r3), into r8-r12
+        trn1            \r8\().2d,  \t0\().2d,  \t2\().2d
+        trn2            \r12\().2d, \t0\().2d,  \t2\().2d
+        trn1            \r10\().2d, \t1\().2d,  \t3\().2d
+        trn2            \r14\().2d, \t1\().2d,  \t3\().2d
+
+        // Second step of the 4x4 transpose of r8-r12 (now in r1-r7), in place as far as possible
+        trn1            \t0\().2d,  \r1\().2d,  \r5\().2d
+        trn2            \r5\().2d,  \r1\().2d,  \r5\().2d
+        trn1            \t1\().2d,  \r3\().2d,  \r7\().2d
+        trn2            \r7\().2d,  \r3\().2d,  \r7\().2d
+
+        // Move the outputs of trn1 back in place
+        mov             \r1\().16b,  \t0\().16b
+        mov             \r3\().16b,  \t1\().16b
+.endm
\ No newline at end of file
diff --git a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
index 68296d9c40..a165ab3271 100644
--- a/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
+++ b/libavcodec/aarch64/vp9itxfm_16bpp_neon.S
@@ -41,55 +41,6 @@  const iadst16_coeffs, align=4
         .short  14811, 7005, 13160, 9760, 5520, 15426, 2404, 16207
 endconst
 
-.macro transpose_4x4s r0, r1, r2, r3, r4, r5, r6, r7
-        trn1            \r4\().4s,  \r0\().4s,  \r1\().4s
-        trn2            \r5\().4s,  \r0\().4s,  \r1\().4s
-        trn1            \r6\().4s,  \r2\().4s,  \r3\().4s
-        trn2            \r7\().4s,  \r2\().4s,  \r3\().4s
-        trn1            \r0\().2d,  \r4\().2d,  \r6\().2d
-        trn2            \r2\().2d,  \r4\().2d,  \r6\().2d
-        trn1            \r1\().2d,  \r5\().2d,  \r7\().2d
-        trn2            \r3\().2d,  \r5\().2d,  \r7\().2d
-.endm
-
-// Transpose a 8x8 matrix of 32 bit elements, where each row is spread out
-// over two registers.
-.macro transpose_8x8s r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12, r13, r14, r15, t0, t1, t2, t3
-        transpose_4x4s  \r0,  \r2,  \r4,  \r6,  \t0, \t1, \t2, \t3
-        transpose_4x4s  \r9,  \r11, \r13, \r15, \t0, \t1, \t2, \t3
-
-        // Do 4x4 transposes of r1,r3,r5,r7 and r8,r10,r12,r14
-        // while swapping the two 4x4 matrices between each other
-
-        // First step of the 4x4 transpose of r1-r7, into t0-t3
-        trn1            \t0\().4s,  \r1\().4s,  \r3\().4s
-        trn2            \t1\().4s,  \r1\().4s,  \r3\().4s
-        trn1            \t2\().4s,  \r5\().4s,  \r7\().4s
-        trn2            \t3\().4s,  \r5\().4s,  \r7\().4s
-
-        // First step of the 4x4 transpose of r8-r12, into r1-r7
-        trn1            \r1\().4s,  \r8\().4s,  \r10\().4s
-        trn2            \r3\().4s,  \r8\().4s,  \r10\().4s
-        trn1            \r5\().4s,  \r12\().4s, \r14\().4s
-        trn2            \r7\().4s,  \r12\().4s, \r14\().4s
-
-        // Second step of the 4x4 transpose of r1-r7 (now in t0-r3), into r8-r12
-        trn1            \r8\().2d,  \t0\().2d,  \t2\().2d
-        trn2            \r12\().2d, \t0\().2d,  \t2\().2d
-        trn1            \r10\().2d, \t1\().2d,  \t3\().2d
-        trn2            \r14\().2d, \t1\().2d,  \t3\().2d
-
-        // Second step of the 4x4 transpose of r8-r12 (now in r1-r7), in place as far as possible
-        trn1            \t0\().2d,  \r1\().2d,  \r5\().2d
-        trn2            \r5\().2d,  \r1\().2d,  \r5\().2d
-        trn1            \t1\().2d,  \r3\().2d,  \r7\().2d
-        trn2            \r7\().2d,  \r3\().2d,  \r7\().2d
-
-        // Move the outputs of trn1 back in place
-        mov             \r1\().16b,  \t0\().16b
-        mov             \r3\().16b,  \t1\().16b
-.endm
-
 // out1 = ((in1 + in2) * d0[0] + (1 << 13)) >> 14
 // out2 = ((in1 - in2) * d0[0] + (1 << 13)) >> 14
 // in/out are .4s registers; this can do with 4 temp registers, but is