diff mbox series

[FFmpeg-devel] lavc/aarch64: Add neon implementation for sse16

Message ID 20220725111208.43542-1-hum@semihalf.com
State New
Headers show
Series [FFmpeg-devel] lavc/aarch64: Add neon implementation for sse16 | expand

Checks

Context Check Description
yinshiyou/make_loongarch64 success Make finished
yinshiyou/make_fate_loongarch64 success Make fate finished
andriy/make_x86 success Make finished
andriy/make_fate_x86 success Make fate finished

Commit Message

Hubert Mazur July 25, 2022, 11:12 a.m. UTC
Provide neon implementation for sse16 function.

Performance comparison tests are shown below.
- sse_0_c: 273.0
- sse_0_neon: 48.2

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 libavcodec/aarch64/me_cmp_init_aarch64.c |  4 ++
 libavcodec/aarch64/me_cmp_neon.S         | 82 ++++++++++++++++++++++++
 2 files changed, 86 insertions(+)

Comments

Martin Storsjö Aug. 3, 2022, 1:22 p.m. UTC | #1
On Mon, 25 Jul 2022, Hubert Mazur wrote:

> Provide neon implementation for sse16 function.
>
> Performance comparison tests are shown below.
> - sse_0_c: 273.0
> - sse_0_neon: 48.2
>
> Benchmarks and tests run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c |  4 ++
> libavcodec/aarch64/me_cmp_neon.S         | 82 ++++++++++++++++++++++++
> 2 files changed, 86 insertions(+)
>
> diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
> index 136b008eb7..3ff5767bd0 100644
> --- a/libavcodec/aarch64/me_cmp_init_aarch64.c
> +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
> @@ -30,6 +30,9 @@ int ff_pix_abs16_xy2_neon(MpegEncContext *s, uint8_t *blk1, uint8_t *blk2,
> int ff_pix_abs16_x2_neon(MpegEncContext *v, uint8_t *pix1, uint8_t *pix2,
>                       ptrdiff_t stride, int h);
> 
> +int sse16_neon(MpegEncContext *v, uint8_t *pix1, uint8_t *pix2,
> +                  ptrdiff_t stride, int h);

The signature of these functions has been changed now (right after these 
patches were submitted); the pix1/pix2 parameters are now const.

Also, nitpick; please align the following line ("ptrdiff_t stride, ...") 
correctly with the parenthese on the line above.

> +
> av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
> {
>     int cpu_flags = av_get_cpu_flags();
> @@ -40,5 +43,6 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
>         c->pix_abs[0][3] = ff_pix_abs16_xy2_neon;
>
>         c->sad[0] = ff_pix_abs16_neon;
> +        c->sse[0] = sse16_neon;
>     }
> }
> diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
> index cda7ce0408..98c912b608 100644
> --- a/libavcodec/aarch64/me_cmp_neon.S
> +++ b/libavcodec/aarch64/me_cmp_neon.S
> @@ -270,3 +270,85 @@ function ff_pix_abs16_x2_neon, export=1
>
>         ret
> endfunc
> +
> +function sse16_neon, export=1
> +        // x0 - unused
> +        // x1 - pix1
> +        // x2 - pix2
> +        // x3 - stride
> +        // w4 - h
> +
> +        cmp             w4, #4
> +        movi            d18, #0
> +        b.lt            2f
> +
> +// Make 4 iterations at once
> +1:
> +
> +        // res = abs(pix1[0] - pix2[0])
> +        // res * res
> +
> +        ld1             {v0.16b}, [x1], x3              // Load pix1 vector for first iteration
> +        ld1             {v1.16b}, [x2], x3              // Load pix2 vector for first iteration
> +        uabd            v30.16b, v0.16b, v1.16b         // Absolute difference, first iteration

Try to improve the interleaving of this function; I did a quick test on 
Cortex A53, A72 and A73, and got these numbers:

Before:
sse_0_neon:  147.7   64.5   64.7
After:
sse_0_neon:  133.7   60.7   59.2

Overall, try to avoid having consecutive instructions operating on the 
same iteration (except for when doing the same operation on different 
halves of the same iteration), i.e. not "absolute difference third 
iteration; multiply lower half third iteration, multiply upper half third 
iteration, pairwise add third iteration", but bundle it up so you have 
e.g. "absolute difference third iteration; pairwise add first iteration; 
multiply {upper,lower} half third iteration; pairwise add second 
iteration; pairwise add third iteration", or something like that.

Then secondly, in general, don't serialize the summation down to a single 
element in each iteration! You can keep the accumulated sum as a vX.4s 
vector (or maybe even better, two .4s vectors!) throughout the whole 
algorithm, and then only add them up horizontally (with an uaddv) at the 
end.

For adding vectors, I would instinctively prefer doing "uaddl v0.4s, 
v2.4h, v3.4h; uaddl2 v1.4s, v2.8h, v3.8h" instead of "uaddlp v0.4s, 
v1.4h; uadalp v0.4s, v1.8h" etc.

I didn't try out this modification, but please do, I'm pretty sure it will 
be a fair bit faster, and if not, at least more idiomatic SIMD.

I didn't check the other patches yet, but if the other sse* functions are 
implemented similarly, I would expect the same feedback to apply to them 
too.

Let's iterate on the sse16 patch first now at least and get that one 
great, and then update sse4/sse8 similarly once we have that one settled.

I'll try to have a look at the other patches in the set later 
today/tomorrow.

// Martin
Martin Storsjö Aug. 4, 2022, 7:46 a.m. UTC | #2
On Mon, 25 Jul 2022, Hubert Mazur wrote:

> Provide neon implementation for sse16 function.
>
> Performance comparison tests are shown below.
> - sse_0_c: 273.0
> - sse_0_neon: 48.2
>
> Benchmarks and tests run with checkasm tool on AWS Graviton 3.
>
> Signed-off-by: Hubert Mazur <hum@semihalf.com>
> ---
> libavcodec/aarch64/me_cmp_init_aarch64.c |  4 ++
> libavcodec/aarch64/me_cmp_neon.S         | 82 ++++++++++++++++++++++++
> 2 files changed, 86 insertions(+)

> +// iterate by one
> +2:
> +
> +        ld1             {v0.16b}, [x1], x3              // Load pix1
> +        ld1             {v1.16b}, [x2], x3              // Load pix2
> +
> +        uabd            v30.16b, v0.16b, v1.16b
> +        umull           v29.8h, v0.8b, v1.8b
> +        umull2          v28.8h, v0.16b, v1.16b

This should probably be using v30 instead of v0/v1 in the umull here.

The whole codepath for non-modulo-4 heights is untested in practice. You 
can apply the patches from 
https://patchwork.ffmpeg.org/project/ffmpeg/list/?series=7028 to make 
checkasm test it, so please make sure that the uncommon codepaths in the 
patches do work too.

// Martin
diff mbox series

Patch

diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c
index 136b008eb7..3ff5767bd0 100644
--- a/libavcodec/aarch64/me_cmp_init_aarch64.c
+++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
@@ -30,6 +30,9 @@  int ff_pix_abs16_xy2_neon(MpegEncContext *s, uint8_t *blk1, uint8_t *blk2,
 int ff_pix_abs16_x2_neon(MpegEncContext *v, uint8_t *pix1, uint8_t *pix2,
                       ptrdiff_t stride, int h);
 
+int sse16_neon(MpegEncContext *v, uint8_t *pix1, uint8_t *pix2,
+                  ptrdiff_t stride, int h);
+
 av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
 {
     int cpu_flags = av_get_cpu_flags();
@@ -40,5 +43,6 @@  av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx)
         c->pix_abs[0][3] = ff_pix_abs16_xy2_neon;
 
         c->sad[0] = ff_pix_abs16_neon;
+        c->sse[0] = sse16_neon;
     }
 }
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index cda7ce0408..98c912b608 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -270,3 +270,85 @@  function ff_pix_abs16_x2_neon, export=1
 
         ret
 endfunc
+
+function sse16_neon, export=1
+        // x0 - unused
+        // x1 - pix1
+        // x2 - pix2
+        // x3 - stride
+        // w4 - h
+
+        cmp             w4, #4
+        movi            d18, #0
+        b.lt            2f
+
+// Make 4 iterations at once
+1:
+
+        // res = abs(pix1[0] - pix2[0])
+        // res * res
+
+        ld1             {v0.16b}, [x1], x3              // Load pix1 vector for first iteration
+        ld1             {v1.16b}, [x2], x3              // Load pix2 vector for first iteration
+        uabd            v30.16b, v0.16b, v1.16b         // Absolute difference, first iteration
+        ld1             {v2.16b}, [x1], x3              // Load pix1 vector for second iteration
+        umull           v29.8h, v30.8b, v30.8b          // Multiply lower half of vectors, first iteration
+        ld1             {v3.16b}, [x2], x3              // Load pix2 vector for second iteration
+        umull2          v28.8h, v30.16b, v30.16b        // Multiply upper half of vectors, first iteration
+        uabd            v27.16b, v2.16b, v3.16b         // Absolute difference, second iteration
+        uaddlp          v17.4s, v29.8h                  // Pairwise add, first iteration
+        umull           v26.8h, v27.8b, v27.8b          // Mulitply lower half, second iteration
+        umull2          v25.8h, v27.16b, v27.16b        // Multiply upper half, second iteration
+        ld1             {v4.16b}, [x1], x3              // Load pix1 for third iteration
+        uadalp          v17.4s, v26.8h                  // Pairwise add and accumulate, second iteration
+        ld1             {v5.16b}, [x2], x3              // Load pix2 for third iteration
+        uadalp          v17.4s, v25.8h                  // Pairwise add andd accumulate, second iteration
+        uabd            v24.16b, v4.16b, v5.16b         // Absolute difference, third iteration
+        ld1             {v6.16b}, [x1], x3              // Load pix1 for fourth iteration
+        umull           v23.8h, v24.8b, v24.8b          // Multiply lower half, third iteration
+        umull2          v22.8h, v24.16b, v24.16b        // Multiply upper half, third iteration
+        uadalp          v17.4s, v23.8h                  // Pairwise add and accumulate, third iteration
+        uadalp          v17.4s, v22.8h                  // Pairwise add and accumulate, third iteration
+        ld1             {v7.16b}, [x2], x3              // Load pix2 for fouth iteration
+        uadalp          v17.4s, v28.8h                  // Pairwise add and accumulate, first iteration
+        uabd            v21.16b, v6.16b, v7.16b         // Absolute difference, fourth iteration
+        umull           v20.8h, v21.8b, v21.8b          // Multiply lower half, fourth iteration
+        uadalp          v17.4s, v20.8h                  // Pairwise add and accumulate, fourth iteration
+        umull2          v19.8h, v21.16b, v21.16b        // Multiply upper half, fourth iteration
+        uadalp          v17.4s, v19.8h                  // Pairwise add and accumulate, fourth iteration
+
+        sub             w4, w4, #4                      // h -= 4
+        uaddlv          d16, v17.4s                     // add up accumulator vector
+        cmp             w4, #4
+        add             d18, d18, d16
+
+        b.ge            1b
+
+        cbnz            w4, 2f
+        fmov            w0, s18
+
+        ret
+
+// iterate by one
+2:
+
+        ld1             {v0.16b}, [x1], x3              // Load pix1
+        ld1             {v1.16b}, [x2], x3              // Load pix2
+
+        uabd            v30.16b, v0.16b, v1.16b
+        umull           v29.8h, v0.8b, v1.8b
+        umull2          v28.8h, v0.16b, v1.16b
+        uaddlp          v17.4s, v29.8h
+        uadalp          v17.4s, v28.8h
+
+
+        subs            w4, w4, #1
+        uaddlv          d16, v17.4s
+        add             d18, d18, d16
+
+        b.ne            2b
+        fmov            w0, s18
+
+        ret
+
+endfunc