Message ID | 20240828204303.5801-1-ramiro.polla@gmail.com |
---|---|
State | New |
Headers | show |
Series | [FFmpeg-devel,1/4] checkasm/sw_rgb: add rgb24toyv12 tests | expand |
Context | Check | Description |
---|---|---|
yinshiyou/make_loongarch64 | success | Make finished |
yinshiyou/make_fate_loongarch64 | success | Make fate finished |
andriy/make_x86 | success | Make finished |
andriy/make_fate_x86 | fail | Make fate failed |
On Wed, 28 Aug 2024, Ramiro Polla wrote: > NOTE: currently the tests for rgb24toyv12 fail for x86 since the c and > mmxext implementations differ (the mmxext version averages four > rgb pixels before performing the chroma calculations). > --- > tests/checkasm/sw_rgb.c | 89 +++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 89 insertions(+) Would it be better for bisectability, if you'd swap the order of patches 1 and 2? // Martin
On Wed, Aug 28, 2024 at 11:13 PM Martin Storsjö <martin@martin.st> wrote: > On Wed, 28 Aug 2024, Ramiro Polla wrote: > > > NOTE: currently the tests for rgb24toyv12 fail for x86 since the c and > > mmxext implementations differ (the mmxext version averages four > > rgb pixels before performing the chroma calculations). > > --- > > tests/checkasm/sw_rgb.c | 89 +++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 89 insertions(+) > > Would it be better for bisectability, if you'd swap the order of patches > 1 and 2? Indeed it would. I'll change it for the next iteration (or before pushing).
On Wed, Aug 28, 2024 at 11:23 PM Martin Storsjö <martin@martin.st> wrote: > On Wed, 28 Aug 2024, Ramiro Polla wrote: > > +2: > > + // load first line > > + ld3 {v16.8b, v17.8b, v18.8b}, [x0], #24 > > + ld3 {v19.8b, v20.8b, v21.8b}, [x0], #24 > > Hmm, can't we do just one single ld3 with .16b registers, instead of two > separate ones? > > If you want to keep the same register layout as now, load into v19-v21, > then do "uxtl v16.8h, v19.8b; uxtl2 v19.8h, v19.16b". Thanks, that made it faster. > > + uxtl v16.8h, v16.8b // v16 = B11 > > + uxtl v17.8h, v17.8b // v17 = G11 > > + uxtl v18.8h, v18.8b // v18 = R11 > > + uxtl v19.8h, v19.8b // v19 = B12 > > + uxtl v20.8h, v20.8b // v20 = G12 > > + uxtl v21.8h, v21.8b // v21 = R12 > > + > > + // calculate Y values for first line > > + rgbconv16 v24, v16, v17, v18, BY, GY, RY // v24 = Y11 > > + rgbconv16 v25, v19, v20, v21, BY, GY, RY // v25 = Y12 > > + > > + // pairwise add and save rgb values to calculate average > > + addp v5.8h, v16.8h, v19.8h > > + addp v6.8h, v17.8h, v20.8h > > + addp v7.8h, v18.8h, v21.8h > > + > > + // load second line > > + ld3 {v16.8b, v17.8b, v18.8b}, [x10], #24 > > + ld3 {v19.8b, v20.8b, v21.8b}, [x10], #24 > > It's a shame we can't start this load earlier. But as essentially > everything depends on the input as it is, in v16-v21, we'd pretty much > need to use different registers here in order to do that. > > If you wanted to, you could try loading earlier, into different registers > (I think v26-v31 are free at this point?), while then doing the uxtl into > the same registers as before, which shouldn't require any further changes. Thanks, that also led to a small improvement. New patch attached. Ramiro
On Thu, 29 Aug 2024, Ramiro Polla wrote: > On Wed, Aug 28, 2024 at 11:23 PM Martin Storsjö <martin@martin.st> wrote: >> On Wed, 28 Aug 2024, Ramiro Polla wrote: > >>> +2: >>> + // load first line >>> + ld3 {v16.8b, v17.8b, v18.8b}, [x0], #24 >>> + ld3 {v19.8b, v20.8b, v21.8b}, [x0], #24 >> >> Hmm, can't we do just one single ld3 with .16b registers, instead of two >> separate ones? >> >> If you want to keep the same register layout as now, load into v19-v21, >> then do "uxtl v16.8h, v19.8b; uxtl2 v19.8h, v19.16b". > > Thanks, that made it faster. > >>> + uxtl v16.8h, v16.8b // v16 = B11 >>> + uxtl v17.8h, v17.8b // v17 = G11 >>> + uxtl v18.8h, v18.8b // v18 = R11 >>> + uxtl v19.8h, v19.8b // v19 = B12 >>> + uxtl v20.8h, v20.8b // v20 = G12 >>> + uxtl v21.8h, v21.8b // v21 = R12 >>> + >>> + // calculate Y values for first line >>> + rgbconv16 v24, v16, v17, v18, BY, GY, RY // v24 = Y11 >>> + rgbconv16 v25, v19, v20, v21, BY, GY, RY // v25 = Y12 >>> + >>> + // pairwise add and save rgb values to calculate average >>> + addp v5.8h, v16.8h, v19.8h >>> + addp v6.8h, v17.8h, v20.8h >>> + addp v7.8h, v18.8h, v21.8h >>> + >>> + // load second line >>> + ld3 {v16.8b, v17.8b, v18.8b}, [x10], #24 >>> + ld3 {v19.8b, v20.8b, v21.8b}, [x10], #24 >> >> It's a shame we can't start this load earlier. But as essentially >> everything depends on the input as it is, in v16-v21, we'd pretty much >> need to use different registers here in order to do that. >> >> If you wanted to, you could try loading earlier, into different registers >> (I think v26-v31 are free at this point?), while then doing the uxtl into >> the same registers as before, which shouldn't require any further changes. > > Thanks, that also led to a small improvement. > > New patch attached. The new version LGTM, thanks! // Martin
On Wed, Aug 28, 2024 at 10:43 PM Ramiro Polla <ramiro.polla@gmail.com> wrote: > > The current code subsamples by dropping 3/4 pixels to calculate the > chroma components. This commit calculates the average of 4 rgb pixels > before calculating the chroma components, putting it in line with the > mmxext implementation. > --- > libswscale/rgb2rgb_template.c | 88 ++++++++++++++++------------------- > 1 file changed, 41 insertions(+), 47 deletions(-) New patch attached removes another occurrence of the comment about chrominance data only being taken from every second line. It also flips the patch order so this is the first one in the patchset (and therefore checkasm doesn't fail on x86). I'll push the patchset in a couple of days if there are no more comments.
diff --git a/tests/checkasm/sw_rgb.c b/tests/checkasm/sw_rgb.c index f278454d3d..ff84ddf5ef 100644 --- a/tests/checkasm/sw_rgb.c +++ b/tests/checkasm/sw_rgb.c @@ -114,6 +114,92 @@ static void check_uyvy_to_422p(void) } } +#define NUM_LINES 5 +#define MAX_LINE_SIZE 1920 +#define BUFSIZE (NUM_LINES * MAX_LINE_SIZE) + +static int cmp_off_by_n(const uint8_t *ref, const uint8_t *test, size_t n, int accuracy) +{ + for (size_t i = 0; i < n; i++) { + if (abs(ref[i] - test[i]) > accuracy) + return 1; + } + return 0; +} + +static void check_rgb24toyv12(struct SwsContext *ctx) +{ + static const int input_sizes[] = {16, 128, 512, MAX_LINE_SIZE, -MAX_LINE_SIZE}; + + LOCAL_ALIGNED_32(uint8_t, src, [BUFSIZE * 3]); + LOCAL_ALIGNED_32(uint8_t, buf_y_0, [BUFSIZE]); + LOCAL_ALIGNED_32(uint8_t, buf_y_1, [BUFSIZE]); + LOCAL_ALIGNED_32(uint8_t, buf_u_0, [BUFSIZE / 4]); + LOCAL_ALIGNED_32(uint8_t, buf_u_1, [BUFSIZE / 4]); + LOCAL_ALIGNED_32(uint8_t, buf_v_0, [BUFSIZE / 4]); + LOCAL_ALIGNED_32(uint8_t, buf_v_1, [BUFSIZE / 4]); + + declare_func(void, const uint8_t *src, uint8_t *ydst, uint8_t *udst, + uint8_t *vdst, int width, int height, int lumStride, + int chromStride, int srcStride, int32_t *rgb2yuv); + + randomize_buffers(src, BUFSIZE * 3); + + for (int isi = 0; isi < FF_ARRAY_ELEMS(input_sizes); isi++) { + int input_size = input_sizes[isi]; + int negstride = input_size < 0; + const char *negstride_str = negstride ? "_negstride" : ""; + int width = FFABS(input_size); + int linesize = width + 32; + /* calculate height based on specified width to use the entire buffer. */ + int height = (BUFSIZE / linesize) & ~1; + uint8_t *src0 = src; + uint8_t *src1 = src; + uint8_t *dst_y_0 = buf_y_0; + uint8_t *dst_y_1 = buf_y_1; + uint8_t *dst_u_0 = buf_u_0; + uint8_t *dst_u_1 = buf_u_1; + uint8_t *dst_v_0 = buf_v_0; + uint8_t *dst_v_1 = buf_v_1; + + if (negstride) { + src0 += (height - 1) * (linesize * 3); + src1 += (height - 1) * (linesize * 3); + dst_y_0 += (height - 1) * linesize; + dst_y_1 += (height - 1) * linesize; + dst_u_0 += ((height / 2) - 1) * (linesize / 2); + dst_u_1 += ((height / 2) - 1) * (linesize / 2); + dst_v_0 += ((height / 2) - 1) * (linesize / 2); + dst_v_1 += ((height / 2) - 1) * (linesize / 2); + linesize *= -1; + } + + if (check_func(ff_rgb24toyv12, "rgb24toyv12_%d_%d%s", width, height, negstride_str)) { + memset(buf_y_0, 0xFF, BUFSIZE); + memset(buf_y_1, 0xFF, BUFSIZE); + memset(buf_u_0, 0xFF, BUFSIZE / 4); + memset(buf_u_1, 0xFF, BUFSIZE / 4); + memset(buf_v_0, 0xFF, BUFSIZE / 4); + memset(buf_v_1, 0xFF, BUFSIZE / 4); + + call_ref(src0, dst_y_0, dst_u_0, dst_v_0, width, height, + linesize, linesize / 2, linesize * 3, ctx->input_rgb2yuv_table); + call_new(src1, dst_y_1, dst_u_1, dst_v_1, width, height, + linesize, linesize / 2, linesize * 3, ctx->input_rgb2yuv_table); + if (cmp_off_by_n(buf_y_0, buf_y_1, BUFSIZE, 1) || + cmp_off_by_n(buf_u_0, buf_u_1, BUFSIZE / 4, 1) || + cmp_off_by_n(buf_v_0, buf_v_1, BUFSIZE / 4, 1)) + fail(); + bench_new(src1, dst_y_1, dst_u_1, dst_v_1, width, height, + linesize, linesize / 2, linesize * 3, ctx->input_rgb2yuv_table); + } + } +} + +#undef NUM_LINES +#undef MAX_LINE_SIZE +#undef BUFSIZE + static void check_interleave_bytes(void) { LOCAL_ALIGNED_16(uint8_t, src0_buf, [MAX_STRIDE*MAX_HEIGHT+1]); @@ -327,5 +413,8 @@ void checkasm_check_sw_rgb(void) check_rgb_to_uv(ctx); report("rgb_to_uv"); + check_rgb24toyv12(ctx); + report("rgb24toyv12"); + sws_freeContext(ctx); }