diff mbox series

[FFmpeg-devel] avcodec/x86/hevc: fix luma 12b overflow

Message ID 20240225082755.355295-1-jdek@itanimul.li
State New
Headers show
Series [FFmpeg-devel] avcodec/x86/hevc: fix luma 12b overflow | expand

Checks

Context Check Description
yinshiyou/make_loongarch64 success Make finished
yinshiyou/make_fate_loongarch64 success Make fate finished
andriy/make_x86 success Make finished
andriy/make_fate_x86 success Make fate finished

Commit Message

J. Dekker Feb. 25, 2024, 8:27 a.m. UTC
Weak filter can overflow in delta0 calculation before >> 4 in int16.

Signed-off-by: J. Dekker <jdek@itanimul.li>
---

 I do not know x86 simd at all, so this is just an attempt to fix
 the implementation rather than write extremely performant code.

 Suggestions welcome.

 libavcodec/x86/hevc_deblock.asm | 47 +++++++++++++++++++++++++++++++++
 libavcodec/x86/hevcdsp_init.c   |  8 ------
 2 files changed, 47 insertions(+), 8 deletions(-)

Comments

Ronald S. Bultje Feb. 25, 2024, 3:56 p.m. UTC | #1
Hi,

On Sun, Feb 25, 2024 at 3:28 AM J. Dekker <jdek@itanimul.li> wrote:

> Weak filter can overflow in delta0 calculation before >> 4 in int16.
>
> Signed-off-by: J. Dekker <jdek@itanimul.li>
> ---
>
>  I do not know x86 simd at all, so this is just an attempt to fix
>  the implementation rather than write extremely performant code.
>
>  Suggestions welcome.
>

https://pastebin.com/KvcbQ2nK

Most of this can remain in 16bits when doing pabsw before the add. The odd
thing is that if I break (intentionally) the sse2 version, the ssse3/avx
checkams tests start failing occasionally also. I'm not sure why that is
the case. Not sure how to measure perf impact, I don't think there is any
for the non-12bit case (same number of instructions) but I didn't test very
hard and the perf is not very stable...

Ronald
Ronald S. Bultje Feb. 25, 2024, 4:22 p.m. UTC | #2
On Sun, Feb 25, 2024 at 10:56 AM Ronald S. Bultje <rsbultje@gmail.com>
wrote:

> Hi,
>
> On Sun, Feb 25, 2024 at 3:28 AM J. Dekker <jdek@itanimul.li> wrote:
>
>> Weak filter can overflow in delta0 calculation before >> 4 in int16.
>>
>> Signed-off-by: J. Dekker <jdek@itanimul.li>
>> ---
>>
>>  I do not know x86 simd at all, so this is just an attempt to fix
>>  the implementation rather than write extremely performant code.
>>
>>  Suggestions welcome.
>>
>
> https://pastebin.com/KvcbQ2nK
>

Attached a slightly adjusted version which does sse2 in 16bit also.

Ronald
Ronald S. Bultje Feb. 25, 2024, 4:24 p.m. UTC | #3
Hi,

On Sun, Feb 25, 2024 at 11:22 AM Ronald S. Bultje <rsbultje@gmail.com>
wrote:

>
>
> On Sun, Feb 25, 2024 at 10:56 AM Ronald S. Bultje <rsbultje@gmail.com>
> wrote:
>
>> Hi,
>>
>> On Sun, Feb 25, 2024 at 3:28 AM J. Dekker <jdek@itanimul.li> wrote:
>>
>>> Weak filter can overflow in delta0 calculation before >> 4 in int16.
>>>
>>> Signed-off-by: J. Dekker <jdek@itanimul.li>
>>> ---
>>>
>>>  I do not know x86 simd at all, so this is just an attempt to fix
>>>  the implementation rather than write extremely performant code.
>>>
>>>  Suggestions welcome.
>>>
>>
>> https://pastebin.com/KvcbQ2nK
>>
>
> Attached a slightly adjusted version which does sse2 in 16bit also.
>

And now without typos and whitespace changes.

Ronald
James Almer Feb. 25, 2024, 4:28 p.m. UTC | #4
On 2/25/2024 1:22 PM, Ronald S. Bultje wrote:
> On Sun, Feb 25, 2024 at 10:56 AM Ronald S. Bultje <rsbultje@gmail.com>
> wrote:
> 
>> Hi,
>>
>> On Sun, Feb 25, 2024 at 3:28 AM J. Dekker <jdek@itanimul.li> wrote:
>>
>>> Weak filter can overflow in delta0 calculation before >> 4 in int16.
>>>
>>> Signed-off-by: J. Dekker <jdek@itanimul.li>
>>> ---
>>>
>>>   I do not know x86 simd at all, so this is just an attempt to fix
>>>   the implementation rather than write extremely performant code.
>>>
>>>   Suggestions welcome.
>>>
>>
>> https://pastebin.com/KvcbQ2nK
>>
> 
> Attached a slightly adjusted version which does sse2 in 16bit also.
> 
> Ronald

> diff --git a/libavcodec/x86/hevc_deblock.asm b/libavcodec/x86/hevc_deblock.asm
> index 85ee4800bb..869301caff 100644
> --- a/libavcodec/x86/hevc_deblock.asm
> +++ b/libavcodec/x86/hevc_deblock.asm
> @@ -31,6 +31,7 @@ cextern pw_1023
>  pw_pixel_max_12: times 8 dw ((1 << 12)-1)
>  pw_m2:           times 8 dw -2
>  pd_1 :           times 4 dd  1
> +pd_8 :           times 8 dd  8

This is unused.
Ronald S. Bultje Feb. 25, 2024, 4:41 p.m. UTC | #5
Hi,

On Sun, Feb 25, 2024 at 11:28 AM James Almer <jamrial@gmail.com> wrote:

> On 2/25/2024 1:22 PM, Ronald S. Bultje wrote:
> > On Sun, Feb 25, 2024 at 10:56 AM Ronald S. Bultje <rsbultje@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> On Sun, Feb 25, 2024 at 3:28 AM J. Dekker <jdek@itanimul.li> wrote:
> >>
> >>> Weak filter can overflow in delta0 calculation before >> 4 in int16.
> >>>
> >>> Signed-off-by: J. Dekker <jdek@itanimul.li>
> >>> ---
> >>>
> >>>   I do not know x86 simd at all, so this is just an attempt to fix
> >>>   the implementation rather than write extremely performant code.
> >>>
> >>>   Suggestions welcome.
> >>>
> >>
> >> https://pastebin.com/KvcbQ2nK
> >>
> >
> > Attached a slightly adjusted version which does sse2 in 16bit also.
> >
> > Ronald
>
> > diff --git a/libavcodec/x86/hevc_deblock.asm
> b/libavcodec/x86/hevc_deblock.asm
> > index 85ee4800bb..869301caff 100644
> > --- a/libavcodec/x86/hevc_deblock.asm
> > +++ b/libavcodec/x86/hevc_deblock.asm
> > @@ -31,6 +31,7 @@ cextern pw_1023
> >  pw_pixel_max_12: times 8 dw ((1 << 12)-1)
> >  pw_m2:           times 8 dw -2
> >  pd_1 :           times 4 dd  1
> > +pd_8 :           times 8 dd  8
>
> This is unused.
>

Fixed.

Ronald
Henrik Gramner Feb. 25, 2024, 10:30 p.m. UTC | #6
On Sun, Feb 25, 2024 at 5:42 PM Ronald S. Bultje <rsbultje@gmail.com> wrote:
> +    mova            m13, [pw_8]
> +    paddw           m10, m12, m12
> +    paddw           m12, m10 ; 9 * (q0 - p0) - 3 * ( q1 - p1 )
>      paddw           m12, m13; + 8

Memory operand

> +    paddw           m10, m13, m13
> +    paddw           m13, m10 ; abs(9 * (q0 - p0) - 3 * ( q1 - p1 ))
> +    paddw           m13, [pw_8]
[...]
> +    paddw           m13, m12, m12
> +    paddw           m13, m12 ; 3*abs(m12)
> +    paddw           m13, [pw_8]

Another minor improvement would be to reorder the adds like (x + x) +
(x + 8) instead of ((x + x) + x) + 8 to allow for more
instruction-level parallelism.
Ronald S. Bultje Feb. 25, 2024, 11 p.m. UTC | #7
Hi,

On Sun, Feb 25, 2024 at 5:30 PM Henrik Gramner via ffmpeg-devel <
ffmpeg-devel@ffmpeg.org> wrote:

> On Sun, Feb 25, 2024 at 5:42 PM Ronald S. Bultje <rsbultje@gmail.com>
> wrote:
> > +    mova            m13, [pw_8]
> > +    paddw           m10, m12, m12
> > +    paddw           m12, m10 ; 9 * (q0 - p0) - 3 * ( q1 - p1 )
> >      paddw           m12, m13; + 8
>
> Memory operand
>
> > +    paddw           m10, m13, m13
> > +    paddw           m13, m10 ; abs(9 * (q0 - p0) - 3 * ( q1 - p1 ))
> > +    paddw           m13, [pw_8]
> [...]
> > +    paddw           m13, m12, m12
> > +    paddw           m13, m12 ; 3*abs(m12)
> > +    paddw           m13, [pw_8]
>
> Another minor improvement would be to reorder the adds like (x + x) +
> (x + 8) instead of ((x + x) + x) + 8 to allow for more
> instruction-level parallelism.
>

New version attached.

Ronald
J. Dekker Feb. 26, 2024, 2:26 p.m. UTC | #8
"Ronald S. Bultje" <rsbultje@gmail.com> writes:
>
> New version attached.
>

Thanks for the help, pushed this along with the test.
diff mbox series

Patch

diff --git a/libavcodec/x86/hevc_deblock.asm b/libavcodec/x86/hevc_deblock.asm
index 85ee4800bb..ce9221ebc7 100644
--- a/libavcodec/x86/hevc_deblock.asm
+++ b/libavcodec/x86/hevc_deblock.asm
@@ -541,6 +541,7 @@  ALIGN 16
     add             betaq, r13
     shr             betaq, 3; ((beta + (beta >> 1)) >> 3))
 
+%if %1 < 12
     mova            m13, [pw_8]
     psubw           m12, m4, m3 ; q0 - p0
     psllw           m10, m12, 3; 8 * (q0 - p0)
@@ -553,7 +554,49 @@  ALIGN 16
     paddw           m12, m13; + 8
     psraw           m12, 4; >> 4 , delta0
     PABSW           m13, m12; abs(delta0)
+%else
+    psubw           m12, m4, m3 ; q0 - p0
+    pmovsxwd        m13, m12 ; m13 low
+    movhlps         m12, m12
+    pmovsxwd        m12, m12 ; m12 high
+
+    ; m8 low, m10 high
+    pslld            m8, m13, 3; 8 * (q0 - p0)
+    pslld           m10, m12, 3
+
+    paddd            m8, m13 ; 9 * (q0 - p0)
+    paddd           m10, m12
+
+    psubw           m12, m5, m2 ; q1 - p1
+    pmovsxwd        m13, m12 ; m13 low
+    movhlps         m12, m12
+    pmovsxwd        m12, m12 ; m12 high
 
+    psubd            m8, m13 ; 9 * (q0 - p0) - ( q1 - p1 )
+    psubd           m10, m12
+
+    pslld           m13, m13, 1; 2 * ( q1 - p1 )
+    pslld           m12, m12, 1
+
+    psubd            m8, m13; 9 * (q0 - p0) - 3 * ( q1 - p1 )
+    psubd           m10, m12
+
+    mova            m13, [pw_8]
+    pmovsxwd        m13, m13
+
+    paddd            m8, m13 ; + 8
+    paddd           m10, m13
+
+    psrad            m8, 4; >> 4 , delta0
+    psrad           m10, 4
+
+    packssdw        m12, m8
+    packssdw        m10, m10
+
+    psrldq          m12, 8
+    punpcklqdq      m12, m10
+    PABSW           m13, m12; abs(delta0)
+%endif
 
     psllw           m10, m9, 2; 8 * tc
     paddw           m10, m9; 10 * tc
@@ -746,6 +789,7 @@  cglobal hevc_v_loop_filter_luma_10, 4, 14, 16, pix, stride, beta, tc, pix0, src3
 .bypassluma:
     RET
 
+%if cpuflag(avx)
 cglobal hevc_v_loop_filter_luma_12, 4, 14, 16, pix, stride, beta, tc, pix0, src3stride
     sub            pixq, 8
     lea           pix0q, [3 * strideq]
@@ -757,6 +801,7 @@  cglobal hevc_v_loop_filter_luma_12, 4, 14, 16, pix, stride, beta, tc, pix0, src3
     TRANSPOSE8x8W_STORE PASS8ROWS(src3strideq, pixq, r1, pix0q), [pw_pixel_max_12]
 .bypassluma:
     RET
+%endif
 
 ;-----------------------------------------------------------------------------
 ; void ff_hevc_h_loop_filter_luma(uint8_t *_pix, ptrdiff_t _stride, int beta,
@@ -829,6 +874,7 @@  cglobal hevc_h_loop_filter_luma_10, 4, 14, 16, pix, stride, beta, tc, pix0, src3
 .bypassluma:
     RET
 
+%if cpuflag(avx)
 cglobal hevc_h_loop_filter_luma_12, 4, 14, 16, pix, stride, beta, tc, pix0, src3stride
     lea                  src3strideq, [3 * strideq]
     mov                        pix0q, pixq
@@ -859,6 +905,7 @@  cglobal hevc_h_loop_filter_luma_12, 4, 14, 16, pix, stride, beta, tc, pix0, src3
     movdqu     [pixq  + 2 * strideq], m6;  q2
 .bypassluma:
     RET
+%endif
 
 %endmacro
 
diff --git a/libavcodec/x86/hevcdsp_init.c b/libavcodec/x86/hevcdsp_init.c
index f5bc342cd5..e3fcb7b591 100644
--- a/libavcodec/x86/hevcdsp_init.c
+++ b/libavcodec/x86/hevcdsp_init.c
@@ -1205,10 +1205,6 @@  void ff_hevc_dsp_init_x86(HEVCDSPContext *c, const int bit_depth)
         if (EXTERNAL_SSE2(cpu_flags)) {
             c->hevc_v_loop_filter_chroma = ff_hevc_v_loop_filter_chroma_12_sse2;
             c->hevc_h_loop_filter_chroma = ff_hevc_h_loop_filter_chroma_12_sse2;
-            if (ARCH_X86_64) {
-                c->hevc_v_loop_filter_luma = ff_hevc_v_loop_filter_luma_12_sse2;
-                c->hevc_h_loop_filter_luma = ff_hevc_h_loop_filter_luma_12_sse2;
-            }
             SAO_BAND_INIT(12, sse2);
             SAO_EDGE_INIT(12, sse2);
 
@@ -1216,10 +1212,6 @@  void ff_hevc_dsp_init_x86(HEVCDSPContext *c, const int bit_depth)
             c->idct_dc[2] = ff_hevc_idct_16x16_dc_12_sse2;
             c->idct_dc[3] = ff_hevc_idct_32x32_dc_12_sse2;
         }
-        if (EXTERNAL_SSSE3(cpu_flags) && ARCH_X86_64) {
-            c->hevc_v_loop_filter_luma = ff_hevc_v_loop_filter_luma_12_ssse3;
-            c->hevc_h_loop_filter_luma = ff_hevc_h_loop_filter_luma_12_ssse3;
-        }
         if (EXTERNAL_SSE4(cpu_flags) && ARCH_X86_64) {
             EPEL_LINKS(c->put_hevc_epel, 0, 0, pel_pixels, 12, sse4);
             EPEL_LINKS(c->put_hevc_epel, 0, 1, epel_h,     12, sse4);