Message ID | OSZP286MB2173897FD4B90324B4B1AFAFCAF22@OSZP286MB2173.JPNP286.PROD.OUTLOOK.COM |
---|---|
State | New |
Headers | show |
Series | [FFmpeg-devel,1/3] avcodec/x86/vvc/vvc_alf: fix integer overflow | expand |
Context | Check | Description |
---|---|---|
yinshiyou/make_loongarch64 | success | Make finished |
yinshiyou/make_fate_loongarch64 | success | Make fate finished |
andriy/make_x86 | success | Make finished |
andriy/make_fate_x86 | success | Make fate finished |
toqsxw@outlook.com: > From: Wu Jianhua <toqsxw@outlook.com> > > Some tests fails with certain seeds > > tests/checkasm/checkasm 2325607578 --test=vvc_alf > checkasm: using random seed 2325607578 > AVX2: > vvc_alf_filter_luma_120x20_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x24_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x28_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x32_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x36_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x40_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x44_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x48_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x52_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x56_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x60_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x64_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x68_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x72_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x76_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x80_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x84_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x88_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x92_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x96_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x100_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x104_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x108_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x112_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x116_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x120_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x124_12_avx2 (vvc_alf.c:104) > vvc_alf_filter_luma_120x128_12_avx2 (vvc_alf.c:104) > - vvc_alf.alf_filter [FAILED] > - vvc_alf.alf_classify [OK] > checkasm: 28 of 9216 tests have failed > > Reported-by: James Almer <jamrial@gmail.com> > Signed-off-by: Wu Jianhua <toqsxw@outlook.com> > --- > libavcodec/x86/vvc/vvc_alf.asm | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/libavcodec/x86/vvc/vvc_alf.asm b/libavcodec/x86/vvc/vvc_alf.asm > index 71e821c27b..91f158bac9 100644 > --- a/libavcodec/x86/vvc/vvc_alf.asm > +++ b/libavcodec/x86/vvc/vvc_alf.asm > @@ -278,7 +278,9 @@ SECTION .text > psrad m0, SHIFT + 3 > psrad m1, SHIFT + 3 > %%shift_end: > +%if ps == 1 > packssdw m0, m0, m1 > +%endif > %endmacro > > ; FILTER_VB(line) > @@ -356,7 +358,18 @@ SECTION .text > > FILTER_VB xq > > + ; sum += curr > +%if ps == 1 > paddw m0, m2 > +%else > + vpunpcklqdq m11, m2, m2 > + vpunpckhqdq m12, m2, m2 > + vpunpcklwd m11, m11, m14 > + vpunpcklwd m12, m12, m14 > + paddd m0, m11 > + paddd m1, m12 > + packssdw m0, m0, m1 > +%endif > > ; clip to pixel > CLIPW m0, m14, m15 Can this happen with real inputs (like when called from the decoder)? If not, then the test needs to be made more realistic. Anyway, what is the performance impact of this? - Andreas
Hi, On Wed, May 29, 2024 at 11:38 AM <toqsxw@outlook.com> wrote: > +%else > + vpunpcklqdq m11, m2, m2 > + vpunpckhqdq m12, m2, m2 > + vpunpcklwd m11, m11, m14 > + vpunpcklwd m12, m12, m14 > + paddd m0, m11 > + paddd m1, m12 > + packssdw m0, m0, m1 > +%endif > punpcklqdq a, src, src punpckhqdq b, src, src punpcklwd a, a, zero punpcklwd b, b, zero is the same as punpcklwd a, src, zero punpckhwd b, src, zero Also, the whole thing just emulates a saturated add. Can't you use paddsw instead of paddw and be done with it? To add to Andreas' question: is saturating here normatively required? Ronald
Ronald S. Bultje: > 发件人: Ronald S. Bultje <rsbultje@gmail.com> > 发送时间: 2024年5月29日 10:51 > 收件人: FFmpeg development discussions and patches > 抄送: James Almer; Wu Jianhua > 主题: Re: [FFmpeg-devel] [PATCH 1/3] avcodec/x86/vvc/vvc_alf: fix integer overflow > > Hi, > > On Wed, May 29, 2024 at 11:38 AM <toqsxw@outlook.com<mailto:toqsxw@outlook.com>> wrote: > +%else > + vpunpcklqdq m11, m2, m2 > + vpunpckhqdq m12, m2, m2 > + vpunpcklwd m11, m11, m14 > + vpunpcklwd m12, m12, m14 > + paddd m0, m11 > + paddd m1, m12 > + packssdw m0, m0, m1 > +%endif > > punpcklqdq a, src, src > punpckhqdq b, src, src > punpcklwd a, a, zero > punpcklwd b, b, zero > > is the same as > > punpcklwd a, src, zero > punpckhwd b, src, zero Thank you for pointing out this. This modification is really helpful for my improvement! Andreas: >Can this happen with real inputs (like when called from the decoder)? If > not, then the test needs to be made more realistic. > Anyway, what is the performance impact of this? I didn't have a unit test, but the average FPS looks no change. Ronald: > Also, the whole thing just emulates a saturated add. Can't you use paddsw instead of paddw and be done with it? To add to Andreas' question: is saturating here normatively required? We didn't have any sample that failed for this issue except for the checksum with specific seeds. I think we can keep not changing it until a real sample has something wrong. @Nuomi to get more details.
Hi, On Wed, May 29, 2024 at 3:44 PM Wu Jianhua <toqsxw@outlook.com> wrote: > Ronald S. Bultje: > > On Wed, May 29, 2024 at 11:38 AM <toqsxw@outlook.com<mailto: > toqsxw@outlook.com>> wrote: > > +%else > > + vpunpcklqdq m11, m2, m2 > > + vpunpckhqdq m12, m2, m2 > > + vpunpcklwd m11, m11, m14 > > + vpunpcklwd m12, m12, m14 > > + paddd m0, m11 > > + paddd m1, m12 > > + packssdw m0, m0, m1 > > +%endif > > > [..] > > Also, the whole thing just emulates a saturated add. Can't you use paddsw > instead of paddw and be done with it? To add to Andreas' question: is > saturating here normatively required? > > We didn't have any sample that failed for this issue except for the > checksum with specific seeds. I think we can keep not changing it until a > real sample has something wrong. > > @Nuomi to get more details. > I think "just" replacing paddw with paddsw is correct, since the input pixels are 12bit (so they could be either unsigned or signed), the filtered output is the result of packssdw (so signed words), and the desired output is 12bit pixels anyway, anything greater than that is clipped to 12bit range. So to me, it seems paddsw is a cheaper way to accomplish the same thing. Ronald
Ronald S. Bultje: > 发件人: Ronald S. Bultje <rsbultje@gmail.com> > 发送时间: 2024年5月29日 13:56 > 收件人: Wu Jianhua > 抄送: FFmpeg development discussions and patches; Nuo Mi; James Almer > 主题: Re: [FFmpeg-devel] [PATCH 1/3] avcodec/x86/vvc/vvc_alf: fix integer overflow > > Hi, > > On Wed, May 29, 2024 at 3:44 PM Wu Jianhua <toqsxw@outlook.com<mailto:toqsxw@outlook.com>> wrote: > Ronald S. Bultje: >> On Wed, May 29, 2024 at 11:38 AM <toqsxw@outlook.com<mailto:toqsxw@outlook.com>> <mailto:toqsxw@outlook.com<mailto:toqsxw@outlook.com>>> wrote: >> +%else >> + vpunpcklqdq m11, m2, m2 >> + vpunpckhqdq m12, m2, m2 >> + vpunpcklwd m11, m11, m14 >> + vpunpcklwd m12, m12, m14 >> + paddd m0, m11 >> + paddd m1, m12 >> + packssdw m0, m0, m1 >> +%endif > > [..] > > Also, the whole thing just emulates a saturated add. Can't you use paddsw instead of paddw and be done with it? To add to Andreas' question: is >> saturating here normatively required? > > > We didn't have any sample that failed for this issue except for the checksum with specific seeds. I think we can keep not changing it until a real sample has something wrong. > > @Nuomi to get more details. > > I think "just" replacing paddw with paddsw is correct, since the input pixels are 12bit (so they could be either unsigned or signed), the filtered output > is the result of packssdw (so signed words), and the desired output is 12bit pixels anyway, anything greater than that is clipped to 12bit range. So to > me, it seems paddsw is a cheaper way to accomplish the same thing. > > Ronald Hi Ronald, Yes, it does. I've test paddsw and everything works well. It must be a cheaper way to get minimum performance loss. And v2 sent. Thanks for this. Jianhua
diff --git a/libavcodec/x86/vvc/vvc_alf.asm b/libavcodec/x86/vvc/vvc_alf.asm index 71e821c27b..91f158bac9 100644 --- a/libavcodec/x86/vvc/vvc_alf.asm +++ b/libavcodec/x86/vvc/vvc_alf.asm @@ -278,7 +278,9 @@ SECTION .text psrad m0, SHIFT + 3 psrad m1, SHIFT + 3 %%shift_end: +%if ps == 1 packssdw m0, m0, m1 +%endif %endmacro ; FILTER_VB(line) @@ -356,7 +358,18 @@ SECTION .text FILTER_VB xq + ; sum += curr +%if ps == 1 paddw m0, m2 +%else + vpunpcklqdq m11, m2, m2 + vpunpckhqdq m12, m2, m2 + vpunpcklwd m11, m11, m14 + vpunpcklwd m12, m12, m14 + paddd m0, m11 + paddd m1, m12 + packssdw m0, m0, m1 +%endif ; clip to pixel CLIPW m0, m14, m15