diff mbox series

[FFmpeg-devel,1/4] aarch64: Fix ff_hevc_put_hevc_epel_h48_8_neon_i8mm

Message ID 20240312131229.1551-1-martin@martin.st
State Accepted
Commit 0c5da7be599d2f0d101705cfce27dcc965b2fc07
Headers show
Series [FFmpeg-devel,1/4] aarch64: Fix ff_hevc_put_hevc_epel_h48_8_neon_i8mm | expand

Checks

Context Check Description
yinshiyou/make_loongarch64 success Make finished
yinshiyou/make_fate_loongarch64 success Make fate finished
andriy/make_x86 success Make finished
andriy/make_fate_x86 success Make fate finished

Commit Message

Martin Storsjö March 12, 2024, 1:12 p.m. UTC
The first 32 elements of each row were correct, while the
last 16 were scrambled.

This hasn't been noticed, because the checkasm test erroneously
only checked half of the output (for 8 bit functions), and
apparently none of the samples as part of "fate-hevc" seem to
trigger this specific function.
---
 libavcodec/aarch64/hevcdsp_epel_neon.S | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

Comments

J. Dekker March 14, 2024, 12:47 p.m. UTC | #1
Martin Storsjö <martin@martin.st> writes:

> The first 32 elements of each row were correct, while the
> last 16 were scrambled.
>
> This hasn't been noticed, because the checkasm test erroneously
> only checked half of the output (for 8 bit functions), and
> apparently none of the samples as part of "fate-hevc" seem to
> trigger this specific function.
> ---
>  libavcodec/aarch64/hevcdsp_epel_neon.S | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)

Thanks for the fixes, wonder if we should use checkasm_check()
exclusively in checkasm rather than memcmp(), would probably be useful.

Pushed set
Martin Storsjö March 14, 2024, 1:16 p.m. UTC | #2
On Thu, 14 Mar 2024, J. Dekker wrote:

>
> Martin Storsjö <martin@martin.st> writes:
>
>> The first 32 elements of each row were correct, while the
>> last 16 were scrambled.
>>
>> This hasn't been noticed, because the checkasm test erroneously
>> only checked half of the output (for 8 bit functions), and
>> apparently none of the samples as part of "fate-hevc" seem to
>> trigger this specific function.
>> ---
>>  libavcodec/aarch64/hevcdsp_epel_neon.S | 14 +++++++++-----
>>  1 file changed, 9 insertions(+), 5 deletions(-)
>
> Thanks for the fixes, wonder if we should use checkasm_check()
> exclusively in checkasm rather than memcmp(), would probably be useful.

Wherever it makes sense and works, then yes, using checkasm_check() 
probably is useful. (Within dav1d, we use it in most tests except for a 
few.)

FWIW, many checkasm tests seem to have pretty naive setups, where e.g. all 
rows are tightly packed. If they'd use a bigger stride with more padding 
between rows, one can also detect some other cases of potential asm bugs.

> Pushed set

Thanks!

// Martin
diff mbox series

Patch

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 2dafa09337..d3f0a26f79 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -1572,6 +1572,7 @@  function ff_hevc_put_hevc_epel_h48_8_neon_i8mm, export=1
         xtn2            v22.8h, v26.4s
         xtn             v23.4h, v23.4s
         xtn2            v23.8h, v27.4s
+        add             x7, x0, #64
         st4             {v20.8h, v21.8h, v22.8h, v23.8h}, [x0], x10
         ext             v4.16b, v2.16b, v3.16b, #1
         ext             v5.16b, v2.16b, v3.16b, #2
@@ -1584,11 +1585,14 @@  function ff_hevc_put_hevc_epel_h48_8_neon_i8mm, export=1
         usdot           v21.4s, v4.16b, v30.16b
         usdot           v22.4s, v5.16b, v30.16b
         usdot           v23.4s, v6.16b, v30.16b
-        xtn             v20.4h, v20.4s
-        xtn2            v20.8h, v22.4s
-        xtn             v21.4h, v21.4s
-        xtn2            v21.8h, v23.4s
-        add             x7, x0, #64
+        zip1            v24.4s, v20.4s, v22.4s
+        zip2            v25.4s, v20.4s, v22.4s
+        zip1            v26.4s, v21.4s, v23.4s
+        zip2            v27.4s, v21.4s, v23.4s
+        xtn             v20.4h, v24.4s
+        xtn2            v20.8h, v25.4s
+        xtn             v21.4h, v26.4s
+        xtn2            v21.8h, v27.4s
         st2             {v20.8h, v21.8h}, [x7]
         b.ne            1b
         ret