Message ID | 20240312131229.1551-1-martin@martin.st |
---|---|
State | Accepted |
Commit | 0c5da7be599d2f0d101705cfce27dcc965b2fc07 |
Headers | show |
Series | [FFmpeg-devel,1/4] aarch64: Fix ff_hevc_put_hevc_epel_h48_8_neon_i8mm | expand |
Context | Check | Description |
---|---|---|
yinshiyou/make_loongarch64 | success | Make finished |
yinshiyou/make_fate_loongarch64 | success | Make fate finished |
andriy/make_x86 | success | Make finished |
andriy/make_fate_x86 | success | Make fate finished |
Martin Storsjö <martin@martin.st> writes: > The first 32 elements of each row were correct, while the > last 16 were scrambled. > > This hasn't been noticed, because the checkasm test erroneously > only checked half of the output (for 8 bit functions), and > apparently none of the samples as part of "fate-hevc" seem to > trigger this specific function. > --- > libavcodec/aarch64/hevcdsp_epel_neon.S | 14 +++++++++----- > 1 file changed, 9 insertions(+), 5 deletions(-) Thanks for the fixes, wonder if we should use checkasm_check() exclusively in checkasm rather than memcmp(), would probably be useful. Pushed set
On Thu, 14 Mar 2024, J. Dekker wrote: > > Martin Storsjö <martin@martin.st> writes: > >> The first 32 elements of each row were correct, while the >> last 16 were scrambled. >> >> This hasn't been noticed, because the checkasm test erroneously >> only checked half of the output (for 8 bit functions), and >> apparently none of the samples as part of "fate-hevc" seem to >> trigger this specific function. >> --- >> libavcodec/aarch64/hevcdsp_epel_neon.S | 14 +++++++++----- >> 1 file changed, 9 insertions(+), 5 deletions(-) > > Thanks for the fixes, wonder if we should use checkasm_check() > exclusively in checkasm rather than memcmp(), would probably be useful. Wherever it makes sense and works, then yes, using checkasm_check() probably is useful. (Within dav1d, we use it in most tests except for a few.) FWIW, many checkasm tests seem to have pretty naive setups, where e.g. all rows are tightly packed. If they'd use a bigger stride with more padding between rows, one can also detect some other cases of potential asm bugs. > Pushed set Thanks! // Martin
diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S index 2dafa09337..d3f0a26f79 100644 --- a/libavcodec/aarch64/hevcdsp_epel_neon.S +++ b/libavcodec/aarch64/hevcdsp_epel_neon.S @@ -1572,6 +1572,7 @@ function ff_hevc_put_hevc_epel_h48_8_neon_i8mm, export=1 xtn2 v22.8h, v26.4s xtn v23.4h, v23.4s xtn2 v23.8h, v27.4s + add x7, x0, #64 st4 {v20.8h, v21.8h, v22.8h, v23.8h}, [x0], x10 ext v4.16b, v2.16b, v3.16b, #1 ext v5.16b, v2.16b, v3.16b, #2 @@ -1584,11 +1585,14 @@ function ff_hevc_put_hevc_epel_h48_8_neon_i8mm, export=1 usdot v21.4s, v4.16b, v30.16b usdot v22.4s, v5.16b, v30.16b usdot v23.4s, v6.16b, v30.16b - xtn v20.4h, v20.4s - xtn2 v20.8h, v22.4s - xtn v21.4h, v21.4s - xtn2 v21.8h, v23.4s - add x7, x0, #64 + zip1 v24.4s, v20.4s, v22.4s + zip2 v25.4s, v20.4s, v22.4s + zip1 v26.4s, v21.4s, v23.4s + zip2 v27.4s, v21.4s, v23.4s + xtn v20.4h, v24.4s + xtn2 v20.8h, v25.4s + xtn v21.4h, v26.4s + xtn2 v21.8h, v27.4s st2 {v20.8h, v21.8h}, [x7] b.ne 1b ret