diff mbox series

[FFmpeg-devel,1/4] checkasm/sw_rgb: add rgb24toyv12 tests

Message ID 20240828204303.5801-1-ramiro.polla@gmail.com
State New
Headers show
Series [FFmpeg-devel,1/4] checkasm/sw_rgb: add rgb24toyv12 tests | expand

Checks

Context Check Description
yinshiyou/make_loongarch64 success Make finished
yinshiyou/make_fate_loongarch64 success Make fate finished
andriy/make_x86 success Make finished
andriy/make_fate_x86 fail Make fate failed

Commit Message

Ramiro Polla Aug. 28, 2024, 8:43 p.m. UTC
NOTE: currently the tests for rgb24toyv12 fail for x86 since the c and
      mmxext implementations differ (the mmxext version averages four
      rgb pixels before performing the chroma calculations).
---
 tests/checkasm/sw_rgb.c | 89 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

Comments

Martin Storsjö Aug. 28, 2024, 9:13 p.m. UTC | #1
On Wed, 28 Aug 2024, Ramiro Polla wrote:

> NOTE: currently the tests for rgb24toyv12 fail for x86 since the c and
>      mmxext implementations differ (the mmxext version averages four
>      rgb pixels before performing the chroma calculations).
> ---
> tests/checkasm/sw_rgb.c | 89 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 89 insertions(+)

Would it be better for bisectability, if you'd swap the order of patches 
1 and 2?

// Martin
Ramiro Polla Aug. 28, 2024, 9:21 p.m. UTC | #2
On Wed, Aug 28, 2024 at 11:13 PM Martin Storsjö <martin@martin.st> wrote:
> On Wed, 28 Aug 2024, Ramiro Polla wrote:
>
> > NOTE: currently the tests for rgb24toyv12 fail for x86 since the c and
> >      mmxext implementations differ (the mmxext version averages four
> >      rgb pixels before performing the chroma calculations).
> > ---
> > tests/checkasm/sw_rgb.c | 89 +++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 89 insertions(+)
>
> Would it be better for bisectability, if you'd swap the order of patches
> 1 and 2?

Indeed it would. I'll change it for the next iteration (or before pushing).
Ramiro Polla Aug. 28, 2024, 10:20 p.m. UTC | #3
On Wed, Aug 28, 2024 at 11:23 PM Martin Storsjö <martin@martin.st> wrote:
> On Wed, 28 Aug 2024, Ramiro Polla wrote:

> > +2:
> > +        // load first line
> > +        ld3             {v16.8b, v17.8b, v18.8b}, [x0], #24
> > +        ld3             {v19.8b, v20.8b, v21.8b}, [x0], #24
>
> Hmm, can't we do just one single ld3 with .16b registers, instead of two
> separate ones?
>
> If you want to keep the same register layout as now, load into v19-v21,
> then do "uxtl v16.8h, v19.8b; uxtl2 v19.8h, v19.16b".

Thanks, that made it faster.

> > +        uxtl            v16.8h, v16.8b              // v16 = B11
> > +        uxtl            v17.8h, v17.8b              // v17 = G11
> > +        uxtl            v18.8h, v18.8b              // v18 = R11
> > +        uxtl            v19.8h, v19.8b              // v19 = B12
> > +        uxtl            v20.8h, v20.8b              // v20 = G12
> > +        uxtl            v21.8h, v21.8b              // v21 = R12
> > +
> > +        // calculate Y values for first line
> > +        rgbconv16       v24, v16, v17, v18, BY, GY, RY // v24 = Y11
> > +        rgbconv16       v25, v19, v20, v21, BY, GY, RY // v25 = Y12
> > +
> > +        // pairwise add and save rgb values to calculate average
> > +        addp            v5.8h, v16.8h, v19.8h
> > +        addp            v6.8h, v17.8h, v20.8h
> > +        addp            v7.8h, v18.8h, v21.8h
> > +
> > +        // load second line
> > +        ld3             {v16.8b, v17.8b, v18.8b}, [x10], #24
> > +        ld3             {v19.8b, v20.8b, v21.8b}, [x10], #24
>
> It's a shame we can't start this load earlier. But as essentially
> everything depends on the input as it is, in v16-v21, we'd pretty much
> need to use different registers here in order to do that.
>
> If you wanted to, you could try loading earlier, into different registers
> (I think v26-v31 are free at this point?), while then doing the uxtl into
> the same registers as before, which shouldn't require any further changes.

Thanks, that also led to a small improvement.

New patch attached.

Ramiro
Martin Storsjö Aug. 29, 2024, 11:22 a.m. UTC | #4
On Thu, 29 Aug 2024, Ramiro Polla wrote:

> On Wed, Aug 28, 2024 at 11:23 PM Martin Storsjö <martin@martin.st> wrote:
>> On Wed, 28 Aug 2024, Ramiro Polla wrote:
>
>>> +2:
>>> +        // load first line
>>> +        ld3             {v16.8b, v17.8b, v18.8b}, [x0], #24
>>> +        ld3             {v19.8b, v20.8b, v21.8b}, [x0], #24
>>
>> Hmm, can't we do just one single ld3 with .16b registers, instead of two
>> separate ones?
>>
>> If you want to keep the same register layout as now, load into v19-v21,
>> then do "uxtl v16.8h, v19.8b; uxtl2 v19.8h, v19.16b".
>
> Thanks, that made it faster.
>
>>> +        uxtl            v16.8h, v16.8b              // v16 = B11
>>> +        uxtl            v17.8h, v17.8b              // v17 = G11
>>> +        uxtl            v18.8h, v18.8b              // v18 = R11
>>> +        uxtl            v19.8h, v19.8b              // v19 = B12
>>> +        uxtl            v20.8h, v20.8b              // v20 = G12
>>> +        uxtl            v21.8h, v21.8b              // v21 = R12
>>> +
>>> +        // calculate Y values for first line
>>> +        rgbconv16       v24, v16, v17, v18, BY, GY, RY // v24 = Y11
>>> +        rgbconv16       v25, v19, v20, v21, BY, GY, RY // v25 = Y12
>>> +
>>> +        // pairwise add and save rgb values to calculate average
>>> +        addp            v5.8h, v16.8h, v19.8h
>>> +        addp            v6.8h, v17.8h, v20.8h
>>> +        addp            v7.8h, v18.8h, v21.8h
>>> +
>>> +        // load second line
>>> +        ld3             {v16.8b, v17.8b, v18.8b}, [x10], #24
>>> +        ld3             {v19.8b, v20.8b, v21.8b}, [x10], #24
>>
>> It's a shame we can't start this load earlier. But as essentially
>> everything depends on the input as it is, in v16-v21, we'd pretty much
>> need to use different registers here in order to do that.
>>
>> If you wanted to, you could try loading earlier, into different registers
>> (I think v26-v31 are free at this point?), while then doing the uxtl into
>> the same registers as before, which shouldn't require any further changes.
>
> Thanks, that also led to a small improvement.
>
> New patch attached.

The new version LGTM, thanks!

// Martin
Ramiro Polla Sept. 3, 2024, 3:40 p.m. UTC | #5
On Wed, Aug 28, 2024 at 10:43 PM Ramiro Polla <ramiro.polla@gmail.com> wrote:
>
> The current code subsamples by dropping 3/4 pixels to calculate the
> chroma components. This commit calculates the average of 4 rgb pixels
> before calculating the chroma components, putting it in line with the
> mmxext implementation.
> ---
>  libswscale/rgb2rgb_template.c | 88 ++++++++++++++++-------------------
>  1 file changed, 41 insertions(+), 47 deletions(-)

New patch attached removes another occurrence of the comment about
chrominance data only being taken from every second line.
It also flips the patch order so this is the first one in the patchset
(and therefore checkasm doesn't fail on x86).

I'll push the patchset in a couple of days if there are no more comments.
diff mbox series

Patch

diff --git a/tests/checkasm/sw_rgb.c b/tests/checkasm/sw_rgb.c
index f278454d3d..ff84ddf5ef 100644
--- a/tests/checkasm/sw_rgb.c
+++ b/tests/checkasm/sw_rgb.c
@@ -114,6 +114,92 @@  static void check_uyvy_to_422p(void)
     }
 }
 
+#define NUM_LINES 5
+#define MAX_LINE_SIZE 1920
+#define BUFSIZE (NUM_LINES * MAX_LINE_SIZE)
+
+static int cmp_off_by_n(const uint8_t *ref, const uint8_t *test, size_t n, int accuracy)
+{
+    for (size_t i = 0; i < n; i++) {
+        if (abs(ref[i] - test[i]) > accuracy)
+            return 1;
+    }
+    return 0;
+}
+
+static void check_rgb24toyv12(struct SwsContext *ctx)
+{
+    static const int input_sizes[] = {16, 128, 512, MAX_LINE_SIZE, -MAX_LINE_SIZE};
+
+    LOCAL_ALIGNED_32(uint8_t, src, [BUFSIZE * 3]);
+    LOCAL_ALIGNED_32(uint8_t, buf_y_0, [BUFSIZE]);
+    LOCAL_ALIGNED_32(uint8_t, buf_y_1, [BUFSIZE]);
+    LOCAL_ALIGNED_32(uint8_t, buf_u_0, [BUFSIZE / 4]);
+    LOCAL_ALIGNED_32(uint8_t, buf_u_1, [BUFSIZE / 4]);
+    LOCAL_ALIGNED_32(uint8_t, buf_v_0, [BUFSIZE / 4]);
+    LOCAL_ALIGNED_32(uint8_t, buf_v_1, [BUFSIZE / 4]);
+
+    declare_func(void, const uint8_t *src, uint8_t *ydst, uint8_t *udst,
+                       uint8_t *vdst, int width, int height, int lumStride,
+                       int chromStride, int srcStride, int32_t *rgb2yuv);
+
+    randomize_buffers(src, BUFSIZE * 3);
+
+    for (int isi = 0; isi < FF_ARRAY_ELEMS(input_sizes); isi++) {
+        int input_size = input_sizes[isi];
+        int negstride = input_size < 0;
+        const char *negstride_str = negstride ? "_negstride" : "";
+        int width = FFABS(input_size);
+        int linesize = width + 32;
+        /* calculate height based on specified width to use the entire buffer. */
+        int height = (BUFSIZE / linesize) & ~1;
+        uint8_t *src0 = src;
+        uint8_t *src1 = src;
+        uint8_t *dst_y_0 = buf_y_0;
+        uint8_t *dst_y_1 = buf_y_1;
+        uint8_t *dst_u_0 = buf_u_0;
+        uint8_t *dst_u_1 = buf_u_1;
+        uint8_t *dst_v_0 = buf_v_0;
+        uint8_t *dst_v_1 = buf_v_1;
+
+        if (negstride) {
+            src0    += (height - 1) * (linesize * 3);
+            src1    += (height - 1) * (linesize * 3);
+            dst_y_0 += (height - 1) * linesize;
+            dst_y_1 += (height - 1) * linesize;
+            dst_u_0 += ((height / 2) - 1) * (linesize / 2);
+            dst_u_1 += ((height / 2) - 1) * (linesize / 2);
+            dst_v_0 += ((height / 2) - 1) * (linesize / 2);
+            dst_v_1 += ((height / 2) - 1) * (linesize / 2);
+            linesize *= -1;
+        }
+
+        if (check_func(ff_rgb24toyv12, "rgb24toyv12_%d_%d%s", width, height, negstride_str)) {
+            memset(buf_y_0, 0xFF, BUFSIZE);
+            memset(buf_y_1, 0xFF, BUFSIZE);
+            memset(buf_u_0, 0xFF, BUFSIZE / 4);
+            memset(buf_u_1, 0xFF, BUFSIZE / 4);
+            memset(buf_v_0, 0xFF, BUFSIZE / 4);
+            memset(buf_v_1, 0xFF, BUFSIZE / 4);
+
+            call_ref(src0, dst_y_0, dst_u_0, dst_v_0, width, height,
+                     linesize, linesize / 2, linesize * 3, ctx->input_rgb2yuv_table);
+            call_new(src1, dst_y_1, dst_u_1, dst_v_1, width, height,
+                     linesize, linesize / 2, linesize * 3, ctx->input_rgb2yuv_table);
+            if (cmp_off_by_n(buf_y_0, buf_y_1, BUFSIZE, 1) ||
+                cmp_off_by_n(buf_u_0, buf_u_1, BUFSIZE / 4, 1) ||
+                cmp_off_by_n(buf_v_0, buf_v_1, BUFSIZE / 4, 1))
+                fail();
+            bench_new(src1, dst_y_1, dst_u_1, dst_v_1, width, height,
+                      linesize, linesize / 2, linesize * 3, ctx->input_rgb2yuv_table);
+        }
+    }
+}
+
+#undef NUM_LINES
+#undef MAX_LINE_SIZE
+#undef BUFSIZE
+
 static void check_interleave_bytes(void)
 {
     LOCAL_ALIGNED_16(uint8_t, src0_buf, [MAX_STRIDE*MAX_HEIGHT+1]);
@@ -327,5 +413,8 @@  void checkasm_check_sw_rgb(void)
     check_rgb_to_uv(ctx);
     report("rgb_to_uv");
 
+    check_rgb24toyv12(ctx);
+    report("rgb24toyv12");
+
     sws_freeContext(ctx);
 }