Message ID | 20220325185257.513933-4-bavison@riscosopen.org |
---|---|
State | New |
Headers | show |
Series | avcodec/vc1: Arm optimisations | expand |
Context | Check | Description |
---|---|---|
andriy/make_x86 | success | Make finished |
andriy/make_fate_x86 | success | Make fate finished |
On Fri, 25 Mar 2022, Ben Avison wrote: > Disable ff_add_pixels_clamped_arm, which was found to fail the test. As this > is normally only used for Arms prior to Armv6 (ARM11) it seems quite unlikely > that anyone is still using this, so I haven't put in the effort to debug it. I had a look at this function, and I see that the overflow checks are using tst r6, #0x100 to see whether the addition overflowed (either above or below). However, if block[] was e.g. 0x200, it's possible to overflow without setting this bit at all. If it would be the case that the valid range of block[] values would be e.g. [-255,255], then this kind of overflow checking would work though. (As there exists assembly for armv6, then this function probably hasn't been used much in modern times, so this doesn't say much about what values actually are used here.) Secondly, the clamping seems to be done with movne r6, r5, lsr #24 However that should use asr, not lsr, I think, to get proper clamping in both ends? Thirdly - the added test also occasionally fails for the other existing functions (armv6, neon) and the newly added aarch64 neon version. If you have e.g. src[] = 32767, dst[] = 255, then the widening 8->16 addition will overflow, as there's no operation that both widens and clamps at the same time. I think this is reason to limit the range of src[] at least somewhat in the test, since I don't think the full 16 bit signed range actually is relevant here. // Martin
On Tue, 29 Mar 2022, Martin Storsjö wrote: > On Fri, 25 Mar 2022, Ben Avison wrote: > >> Disable ff_add_pixels_clamped_arm, which was found to fail the test. As >> this >> is normally only used for Arms prior to Armv6 (ARM11) it seems quite >> unlikely >> that anyone is still using this, so I haven't put in the effort to debug >> it. > > I had a look at this function, and I see that the overflow checks are using > > tst r6, #0x100 > > to see whether the addition overflowed (either above or below). However, if > block[] was e.g. 0x200, it's possible to overflow without setting this bit at > all. > > If it would be the case that the valid range of block[] values would be e.g. > [-255,255], then this kind of overflow checking would work though. (As there > exists assembly for armv6, then this function probably hasn't been used much > in modern times, so this doesn't say much about what values actually are used > here.) > > Secondly, the clamping seems to be done with > > movne r6, r5, lsr #24 > > However that should use asr, not lsr, I think, to get proper clamping in both > ends? On second thought, no, lsr #24 should be correct here. But "tst r6, #0x100" probably is the main issue, given the range of input values set by the current test. No idea what the actual value range is, for the decoders that use this function though. // Martin
On 29/03/2022 14:13, Martin Storsjö wrote: > On Fri, 25 Mar 2022, Ben Avison wrote: > >> Disable ff_add_pixels_clamped_arm, which was found to fail the test. > > I had a look at this function, and I see that the overflow checks are using > > tst r6, #0x100 > > to see whether the addition overflowed (either above or below). However, > if block[] was e.g. 0x200, it's possible to overflow without setting > this bit at all. Yes, thinking about it, that test is only valid if the signed 16-bit value from block[] lies in the range -0x100..+0x100 inclusive, otherwise there exists at least one unsigned 8-bit value which should have clamped but won't. > Secondly, the clamping seems to be done with > > movne r6, r5, lsr #24 > > However that should use asr, not lsr, I think, to get proper clamping in > both ends? r5 is the NOTted version, so all that's doing is selecting 0x000000FF if there was positive overflow, and 0x00000000 if there was negative overflow. Given that bit 8 and above need to be zero to facilitate repacking the 8-bit samples, that's the right thing to do. > Thirdly - the added test also occasionally fails for the other existing > functions (armv6, neon) and the newly added aarch64 neon version. If you > have e.g. src[] = 32767, dst[] = 255, then the widening 8->16 addition > will overflow, as there's no operation that both widens and clamps at > the same time. So it does. I obviously just didn't hit those cases in my test runs! I can't easily test all codecs that use this function, but I just tried instrumenting the VC-1 case and it doesn't appear to actually use this particular function, so I'm none the wiser! Should I just limit the 16-bit values to +/-0x100 and re-enable the armv4 fast path then? Ben
On Tue, 29 Mar 2022, Ben Avison wrote: >> Thirdly - the added test also occasionally fails for the other existing >> functions (armv6, neon) and the newly added aarch64 neon version. If you >> have e.g. src[] = 32767, dst[] = 255, then the widening 8->16 addition >> will overflow, as there's no operation that both widens and clamps at >> the same time. > > So it does. I obviously just didn't hit those cases in my test runs! > > I can't easily test all codecs that use this function, but I just tried > instrumenting the VC-1 case and it doesn't appear to actually use this > particular function, so I'm none the wiser! > > Should I just limit the 16-bit values to +/-0x100 and re-enable the > armv4 fast path then? Yes, I think that'd be the safest path forward. Worst case, the test would be slightly too narrow and could miss some valid case - but that's at least better than having the test give false positives for perfectly correct assembly, that would work just fine for actual decoder use. // Martin
diff --git a/libavcodec/arm/idctdsp_init_arm.c b/libavcodec/arm/idctdsp_init_arm.c index ebc90e4b49..8c8f7daf06 100644 --- a/libavcodec/arm/idctdsp_init_arm.c +++ b/libavcodec/arm/idctdsp_init_arm.c @@ -83,7 +83,9 @@ av_cold void ff_idctdsp_init_arm(IDCTDSPContext *c, AVCodecContext *avctx, } } +#if 0 // FIXME: this implementation fails checkasm test c->add_pixels_clamped = ff_add_pixels_clamped_arm; +#endif if (have_armv5te(cpu_flags)) ff_idctdsp_init_armv5te(c, avctx, high_bit_depth); diff --git a/tests/checkasm/Makefile b/tests/checkasm/Makefile index 7133a6ee66..f6b1008855 100644 --- a/tests/checkasm/Makefile +++ b/tests/checkasm/Makefile @@ -9,6 +9,7 @@ AVCODECOBJS-$(CONFIG_G722DSP) += g722dsp.o AVCODECOBJS-$(CONFIG_H264DSP) += h264dsp.o AVCODECOBJS-$(CONFIG_H264PRED) += h264pred.o AVCODECOBJS-$(CONFIG_H264QPEL) += h264qpel.o +AVCODECOBJS-$(CONFIG_IDCTDSP) += idctdsp.o AVCODECOBJS-$(CONFIG_LLVIDDSP) += llviddsp.o AVCODECOBJS-$(CONFIG_LLVIDENCDSP) += llviddspenc.o AVCODECOBJS-$(CONFIG_VC1DSP) += vc1dsp.o diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c index c2efd81b6d..57134f96ea 100644 --- a/tests/checkasm/checkasm.c +++ b/tests/checkasm/checkasm.c @@ -123,6 +123,9 @@ static const struct { #if CONFIG_HUFFYUV_DECODER { "huffyuvdsp", checkasm_check_huffyuvdsp }, #endif + #if CONFIG_IDCTDSP + { "idctdsp", checkasm_check_idctdsp }, + #endif #if CONFIG_JPEG2000_DECODER { "jpeg2000dsp", checkasm_check_jpeg2000dsp }, #endif diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h index 52ab18a5b1..a86db140e3 100644 --- a/tests/checkasm/checkasm.h +++ b/tests/checkasm/checkasm.h @@ -64,6 +64,7 @@ void checkasm_check_hevc_idct(void); void checkasm_check_hevc_pel(void); void checkasm_check_hevc_sao(void); void checkasm_check_huffyuvdsp(void); +void checkasm_check_idctdsp(void); void checkasm_check_jpeg2000dsp(void); void checkasm_check_llviddsp(void); void checkasm_check_llviddspenc(void); diff --git a/tests/checkasm/idctdsp.c b/tests/checkasm/idctdsp.c new file mode 100644 index 0000000000..d94728b672 --- /dev/null +++ b/tests/checkasm/idctdsp.c @@ -0,0 +1,85 @@ +/* + * Copyright (c) 2022 Ben Avison + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License along + * with FFmpeg; if not, write to the Free Software Foundation, Inc., + * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. + */ + +#include <string.h> + +#include "checkasm.h" + +#include "libavcodec/idctdsp.h" + +#include "libavutil/common.h" +#include "libavutil/internal.h" +#include "libavutil/intreadwrite.h" +#include "libavutil/mem_internal.h" + +#define RANDOMIZE_BUFFER16(name, size) \ + do { \ + int i; \ + for (i = 0; i < size; ++i) { \ + uint16_t r = rnd(); \ + AV_WN16A(name##0 + i, r); \ + AV_WN16A(name##1 + i, r); \ + } \ + } while (0) + +#define RANDOMIZE_BUFFER8(name, size) \ + do { \ + int i; \ + for (i = 0; i < size; ++i) { \ + uint8_t r = rnd(); \ + name##0[i] = r; \ + name##1[i] = r; \ + } \ + } while (0) + +#define CHECK_ADD_PUT_CLAMPED(func) \ + do { \ + if (check_func(h.func, "idctdsp." #func)) { \ + declare_func_emms(AV_CPU_FLAG_MMX, void, const int16_t *, uint8_t *, ptrdiff_t); \ + RANDOMIZE_BUFFER16(src, 64); \ + RANDOMIZE_BUFFER8(dst, 10 * 24); \ + call_ref(src0, dst0 + 24 + 8, 24); \ + call_new(src1, dst1 + 24 + 8, 24); \ + if (memcmp(dst0, dst1, 10 * 24)) \ + fail(); \ + bench_new(src1, dst1 + 24 + 8, 24); \ + } \ + } while (0) + +void checkasm_check_idctdsp(void) +{ + /* Source buffers are only as big as needed, since any over-read won't affect results */ + LOCAL_ALIGNED_16(int16_t, src0, [64]); + LOCAL_ALIGNED_16(int16_t, src1, [64]); + /* Destination buffers have borders of one row above/below and 8 columns left/right to catch overflows */ + LOCAL_ALIGNED_8(uint8_t, dst0, [10 * 24]); + LOCAL_ALIGNED_8(uint8_t, dst1, [10 * 24]); + + AVCodecContext avctx = { 0 }; + IDCTDSPContext h; + + ff_idctdsp_init(&h, &avctx); + + CHECK_ADD_PUT_CLAMPED(add_pixels_clamped); + CHECK_ADD_PUT_CLAMPED(put_pixels_clamped); + CHECK_ADD_PUT_CLAMPED(put_signed_pixels_clamped); + + report("idctdsp"); +} diff --git a/tests/fate/checkasm.mak b/tests/fate/checkasm.mak index 99e6bb13c4..c6273db183 100644 --- a/tests/fate/checkasm.mak +++ b/tests/fate/checkasm.mak @@ -19,6 +19,7 @@ FATE_CHECKASM = fate-checkasm-aacpsdsp \ fate-checkasm-hevc_pel \ fate-checkasm-hevc_sao \ fate-checkasm-huffyuvdsp \ + fate-checkasm-idctdsp \ fate-checkasm-jpeg2000dsp \ fate-checkasm-llviddsp \ fate-checkasm-llviddspenc \
Disable ff_add_pixels_clamped_arm, which was found to fail the test. As this is normally only used for Arms prior to Armv6 (ARM11) it seems quite unlikely that anyone is still using this, so I haven't put in the effort to debug it. Signed-off-by: Ben Avison <bavison@riscosopen.org> --- libavcodec/arm/idctdsp_init_arm.c | 2 + tests/checkasm/Makefile | 1 + tests/checkasm/checkasm.c | 3 ++ tests/checkasm/checkasm.h | 1 + tests/checkasm/idctdsp.c | 85 +++++++++++++++++++++++++++++++ tests/fate/checkasm.mak | 1 + 6 files changed, 93 insertions(+) create mode 100644 tests/checkasm/idctdsp.c