From patchwork Sat Aug 13 20:55:55 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Swinney, Jonathan" <jswinney@amazon.com>
X-Patchwork-Id: 37258
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:3d0d:b0:8d:a68e:8a0e with SMTP id y13csp513514pzi;
        Sat, 13 Aug 2022 13:56:12 -0700 (PDT)
X-Google-Smtp-Source: 
 AA6agR4bkMBhixNzzibwanj4HIvkG3rXEHFWkGf6nrcl5YjiyhIu7BuZ2mmZ6qC/6olXImdpWmDn
X-Received: by 2002:a17:907:8a0a:b0:730:a118:75de with SMTP id
 sc10-20020a1709078a0a00b00730a11875demr6504914ejc.189.1660424172138;
        Sat, 13 Aug 2022 13:56:12 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1660424172; cv=none;
        d=google.com; s=arc-20160816;
        b=JjoVc1FPna7O3gTvvNVH+nOqWyMJP2e7FUapQh3X4sVjSlEnlzUs5xfUyWg1tjHCaA
         BL7hk2xCpggAE5YKnod0jssSS0+rK9wI/sKBxCqqRI21G+T4zTOogLVqTcPII7+yWg1L
         esSH7uOqVsCdwCiFNbLlz6G8IrIR1EATGds4XWomZc85rSATAaw0NFIw0+XLC9EjjfNX
         MbSidicrWr2DaH5mKeLlnqXjwMGWWoQtffyOBWiKpJn9nNTuc6a8VmCuP4bc7sWAIrke
         TwnA1pa8s0sfpJZ9apEkDr/jbN+P9xA8IkP7jBVAp2PdxWrF834UWuL2AUPkUIe7/sFW
         gkxA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:content-language
         :accept-language:message-id:date:thread-index:thread-topic:to:from
         :dkim-signature:delivered-to;
        bh=v816ddX5bI2h2KFFkyNW1xOzFHn+V9Ap5xj3c1BvdzE=;
        b=0Gnn7zNdjOvMXWB7SqYZzMG/aj32WkmgtFgoxWMvqmNOe/JnNmw6f1HZ8XYNTpmXPl
         49PBCNTdt5ZNveh1j4UZPuHsQ85HzT+M5axUUr8DflfWHLSOsEQWvOnif3C0fpPKuNET
         ymx9MzgbvdP/y+hLoPCqzEwaobweS4JGRjpB9dIUmSCshOu/8PDE5PAuacZwypYTTApX
         mjU19VEPwYERjsIH266Z5v9ogPzXqZlVyQX0vkUewJ5+/ofL8k/Bk806BF/5IwceCLut
         9Cti8BOJcnVHiZxr8gkpuCZGZnE+6Vmr6JYXi9H8t9nHPI9c2MDaJNDVVfpwA0nMXLS4
         gGaA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@amazon.com
 header.s=amazon201209 header.b=A2R29emr;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 i20-20020a05640242d400b0043e8006a816si5639118edc.30.2022.08.13.13.56.11;
        Sat, 13 Aug 2022 13:56:12 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@amazon.com
 header.s=amazon201209 header.b=A2R29emr;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 1C38168B3B2;
	Sat, 13 Aug 2022 23:56:09 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from smtp-fw-80006.amazon.com (smtp-fw-80006.amazon.com
 [99.78.197.217])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 53C1E68B7DF
 for <ffmpeg-devel@ffmpeg.org>; Sat, 13 Aug 2022 23:56:01 +0300 (EEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209;
 t=1660424166; x=1691960166;
 h=from:to:cc:subject:date:message-id:
 content-transfer-encoding:mime-version;
 bh=3ev2ify0F3v8llOdECR+94S9KtZty7AYhuCvnUlLgSI=;
 b=A2R29emrblsW7HaS9IJb68AKz88lbx3pzpQTpP0g3J6T4OGrK+qjixyY
 3ijKv4eTlW3p5wi5NL8ganjnp2xEICgefvCFTLnlTh+pw7QutvEWKwKRB
 vp5nXVGZB2i5u8r318AScxdmY81oNS9I/30dy1nQvdvMn59maWX89EgcK Q=;
X-IronPort-AV: E=Sophos;i="5.93,236,1654560000"; d="scan'208";a="118894192"
Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO
 email-inbound-relay-iad-1e-fc41acad.us-east-1.amazon.com) ([10.25.36.210])
 by smtp-border-fw-80006.pdx80.corp.amazon.com with
 ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Aug 2022 20:55:58 +0000
Received: from EX13MTAUWB001.ant.amazon.com
 (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38])
 by email-inbound-relay-iad-1e-fc41acad.us-east-1.amazon.com (Postfix) with
 ESMTPS id F282AC0230; Sat, 13 Aug 2022 20:55:56 +0000 (UTC)
Received: from EX19D007UWB001.ant.amazon.com (10.13.138.75) by
 EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS)
 id 15.0.1497.38; Sat, 13 Aug 2022 20:55:55 +0000
Received: from EX19D007UWB001.ant.amazon.com (10.13.138.75) by
 EX19D007UWB001.ant.amazon.com (10.13.138.75) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.2.1118.12; Sat, 13 Aug 2022 20:55:55 +0000
Received: from EX19D007UWB001.ant.amazon.com ([fe80::bcaa:e18f:a569:3851]) by
 EX19D007UWB001.ant.amazon.com ([fe80::bcaa:e18f:a569:3851%6]) with
 mapi id 15.02.1118.012; Sat, 13 Aug 2022 20:55:55 +0000
From: "Swinney, Jonathan" <jswinney@amazon.com>
To: "ffmpeg-devel@ffmpeg.org" <ffmpeg-devel@ffmpeg.org>
Thread-Topic: [PATCH v3 1/3] checkasm: updated tests for sw_scale
Thread-Index: AdivVusAsr+hqURDQGaVD4fKOSqJyg==
Date: Sat, 13 Aug 2022 20:55:55 +0000
Message-ID: <859182400d774ed6a80829087b578b8e@amazon.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.43.162.134]
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH v3 1/3] checkasm: updated tests for sw_scale
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>,
 Hubert Mazur <hum@semihalf.com>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: WYvmlerZxkur

- added a test for yuv2plane1
- fixed test for yuv2planeX for aarch64 which was previously not working
  at all
- updated the test for yuv2planeX to check exact results or approximated
  results

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
---
 libswscale/x86/swscale.c  |   8 +-
 tests/checkasm/sw_scale.c | 188 ++++++++++++++++++++++++++++++--------
 2 files changed, 154 insertions(+), 42 deletions(-)

diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c
index 628f12137c..32d441245d 100644
--- a/libswscale/x86/swscale.c
+++ b/libswscale/x86/swscale.c
@@ -534,7 +534,8 @@ switch(c->dstBpc){ \
         ASSIGN_SSE_SCALE_FUNC(c->hcScale, c->hChrFilterSize, sse2, sse2);
         ASSIGN_VSCALEX_FUNC(c->yuv2planeX, sse2, ,
                             HAVE_ALIGNED_STACK || ARCH_X86_64);
-        ASSIGN_VSCALE_FUNC(c->yuv2plane1, sse2);
+        if (!(c->flags & SWS_ACCURATE_RND))
+            ASSIGN_VSCALE_FUNC(c->yuv2plane1, sse2);
 
         switch (c->srcFormat) {
         case AV_PIX_FMT_YA8:
@@ -583,14 +584,15 @@ switch(c->dstBpc){ \
         ASSIGN_VSCALEX_FUNC(c->yuv2planeX, sse4,
                             if (!isBE(c->dstFormat)) c->yuv2planeX = ff_yuv2planeX_16_sse4,
                             HAVE_ALIGNED_STACK || ARCH_X86_64);
-        if (c->dstBpc == 16 && !isBE(c->dstFormat))
+        if (c->dstBpc == 16 && !isBE(c->dstFormat) && !(c->flags & SWS_ACCURATE_RND))
             c->yuv2plane1 = ff_yuv2plane1_16_sse4;
     }
 
     if (EXTERNAL_AVX(cpu_flags)) {
         ASSIGN_VSCALEX_FUNC(c->yuv2planeX, avx, ,
                             HAVE_ALIGNED_STACK || ARCH_X86_64);
-        ASSIGN_VSCALE_FUNC(c->yuv2plane1, avx);
+        if (!(c->flags & SWS_ACCURATE_RND))
+            ASSIGN_VSCALE_FUNC(c->yuv2plane1, avx);
 
         switch (c->srcFormat) {
         case AV_PIX_FMT_YUYV422:
diff --git a/tests/checkasm/sw_scale.c b/tests/checkasm/sw_scale.c
index b643a47c30..859993db6f 100644
--- a/tests/checkasm/sw_scale.c
+++ b/tests/checkasm/sw_scale.c
@@ -35,40 +35,140 @@
             AV_WN32(buf + j, rnd());      \
     } while (0)
 
-// This reference function is the same approximate algorithm employed by the
-// SIMD functions
-static void ref_function(const int16_t *filter, int filterSize,
-                                                 const int16_t **src, uint8_t *dest, int dstW,
-                                                 const uint8_t *dither, int offset)
+static void yuv2planeX_8_ref(const int16_t *filter, int filterSize,
+                             const int16_t **src, uint8_t *dest, int dstW,
+                             const uint8_t *dither, int offset)
 {
-    int i, d;
-    d = ((filterSize - 1) * 8 + dither[0]) >> 4;
-    for ( i = 0; i < dstW; i++) {
-        int16_t val = d;
+    // This corresponds to the yuv2planeX_8_c function
+    int i;
+    for (i = 0; i < dstW; i++) {
+        int val = dither[(i + offset) & 7] << 12;
         int j;
-        union {
-            int val;
-            int16_t v[2];
-        } t;
-        for (j = 0; j < filterSize; j++){
-            t.val = (int)src[j][i + offset] * (int)filter[j];
-            val += t.v[1];
+        for (j = 0; j < filterSize; j++)
+            val += src[j][i] * filter[j];
+
+        dest[i]= av_clip_uint8(val >> 19);
+    }
+}
+
+static int cmp_off_by_n(const uint8_t *ref, const uint8_t *test, size_t n, int accuracy)
+{
+    for (size_t i = 0; i < n; i++) {
+        if (abs(ref[i] - test[i]) > accuracy)
+            return 1;
+    }
+    return 0;
+}
+
+static void print_data(uint8_t *p, size_t len, size_t offset)
+{
+    size_t i = 0;
+    for (; i < len; i++) {
+        if (i % 8 == 0) {
+            printf("0x%04zx: ", i+offset);
+        }
+        printf("0x%02x ", (uint32_t) p[i]);
+        if (i % 8 == 7) {
+            printf("\n");
         }
-        dest[i]= av_clip_uint8(val>>3);
     }
+    if (i % 8 != 0) {
+        printf("\n");
+    }
+}
+
+static size_t show_differences(uint8_t *a, uint8_t *b, size_t len)
+{
+    for (size_t i = 0; i < len; i++) {
+        if (a[i] != b[i]) {
+            size_t offset_of_mismatch = i;
+            size_t offset;
+            if (i >= 8) i-=8;
+            offset = i & (~7);
+            printf("test a:\n");
+            print_data(&a[offset], 32, offset);
+            printf("\ntest b:\n");
+            print_data(&b[offset], 32, offset);
+            printf("\n");
+            return offset_of_mismatch;
+        }
+    }
+    return len;
 }
 
-static void check_yuv2yuvX(void)
+static void check_yuv2yuv1(int accurate)
+{
+    struct SwsContext *ctx;
+    int osi, isi;
+    int dstW, offset;
+    size_t fail_offset;
+    const int input_sizes[] = {8, 24, 128, 144, 256, 512};
+    const int INPUT_SIZES = sizeof(input_sizes)/sizeof(input_sizes[0]);
+    #define LARGEST_INPUT_SIZE 512
+
+    const int offsets[] = {0, 3, 8, 11, 16, 19};
+    const int OFFSET_SIZES = sizeof(offsets)/sizeof(offsets[0]);
+    const char *accurate_str = (accurate) ? "accurate" : "approximate";
+
+    declare_func_emms(AV_CPU_FLAG_MMX, void,
+                      const int16_t *src, uint8_t *dest,
+                      int dstW, const uint8_t *dither, int offset);
+
+    LOCAL_ALIGNED_8(int16_t, src_pixels, [LARGEST_INPUT_SIZE]);
+    LOCAL_ALIGNED_8(uint8_t, dst0, [LARGEST_INPUT_SIZE]);
+    LOCAL_ALIGNED_8(uint8_t, dst1, [LARGEST_INPUT_SIZE]);
+    LOCAL_ALIGNED_8(uint8_t, dither, [8]);
+
+    randomize_buffers((uint8_t*)dither, 8);
+    randomize_buffers((uint8_t*)src_pixels, LARGEST_INPUT_SIZE * sizeof(int16_t));
+    ctx = sws_alloc_context();
+    if (accurate)
+        ctx->flags |= SWS_ACCURATE_RND;
+    if (sws_init_context(ctx, NULL, NULL) < 0)
+        fail();
+
+    ff_sws_init_scale(ctx);
+    for (isi = 0; isi < INPUT_SIZES; ++isi) {
+        dstW = input_sizes[isi];
+        for (osi = 0; osi < OFFSET_SIZES; osi++) {
+            offset = offsets[osi];
+            if (check_func(ctx->yuv2plane1, "yuv2yuv1_%d_%d_%s", offset, dstW, accurate_str)){
+                memset(dst0, 0, LARGEST_INPUT_SIZE * sizeof(dst0[0]));
+                memset(dst1, 0, LARGEST_INPUT_SIZE * sizeof(dst1[0]));
+
+                call_ref(src_pixels, dst0, dstW, dither, offset);
+                call_new(src_pixels, dst1, dstW, dither, offset);
+                if (cmp_off_by_n(dst0, dst1, dstW * sizeof(dst0[0]), accurate ? 0 : 2)) {
+                    fail();
+                    printf("failed: yuv2yuv1_%d_%di_%s\n", offset, dstW, accurate_str);
+                    fail_offset = show_differences(dst0, dst1, LARGEST_INPUT_SIZE * sizeof(dst0[0]));
+                    printf("failing values: src: 0x%04x dither: 0x%02x dst-c: %02x dst-asm: %02x\n",
+                            (int) src_pixels[fail_offset],
+                            (int) dither[(fail_offset + fail_offset) & 7],
+                            (int) dst0[fail_offset],
+                            (int) dst1[fail_offset]);
+                }
+                if(dstW == LARGEST_INPUT_SIZE)
+                    bench_new(src_pixels, dst1, dstW, dither, offset);
+            }
+        }
+    }
+    sws_freeContext(ctx);
+}
+
+static void check_yuv2yuvX(int accurate)
 {
     struct SwsContext *ctx;
     int fsi, osi, isi, i, j;
     int dstW;
 #define LARGEST_FILTER 16
-#define FILTER_SIZES 4
-    static const int filter_sizes[FILTER_SIZES] = {1, 4, 8, 16};
+    // ff_yuv2planeX_8_sse2 can't handle odd filter sizes
+    const int filter_sizes[] = {2, 4, 8, 16};
+    const int FILTER_SIZES = sizeof(filter_sizes)/sizeof(filter_sizes[0]);
 #define LARGEST_INPUT_SIZE 512
-#define INPUT_SIZES 6
-    static const int input_sizes[INPUT_SIZES] = {8, 24, 128, 144, 256, 512};
+    static const int input_sizes[] = {8, 24, 128, 144, 256, 512};
+    const int INPUT_SIZES = sizeof(input_sizes)/sizeof(input_sizes[0]);
+    const char *accurate_str = (accurate) ? "accurate" : "approximate";
 
     declare_func_emms(AV_CPU_FLAG_MMX, void, const int16_t *filter,
                       int filterSize, const int16_t **src, uint8_t *dest,
@@ -89,6 +189,8 @@ static void check_yuv2yuvX(void)
     randomize_buffers((uint8_t*)src_pixels, LARGEST_FILTER * LARGEST_INPUT_SIZE * sizeof(int16_t));
     randomize_buffers((uint8_t*)filter_coeff, LARGEST_FILTER * sizeof(int16_t));
     ctx = sws_alloc_context();
+    if (accurate)
+        ctx->flags |= SWS_ACCURATE_RND;
     if (sws_init_context(ctx, NULL, NULL) < 0)
         fail();
 
@@ -96,33 +198,37 @@ static void check_yuv2yuvX(void)
     for(isi = 0; isi < INPUT_SIZES; ++isi){
         dstW = input_sizes[isi];
         for(osi = 0; osi < 64; osi += 16){
-            for(fsi = 0; fsi < FILTER_SIZES; ++fsi){
+            if (dstW <= osi)
+                continue;
+            for (fsi = 0; fsi < FILTER_SIZES; ++fsi) {
                 src = av_malloc(sizeof(int16_t*) * filter_sizes[fsi]);
                 vFilterData = av_malloc((filter_sizes[fsi] + 2) * sizeof(union VFilterData));
                 memset(vFilterData, 0, (filter_sizes[fsi] + 2) * sizeof(union VFilterData));
-                for(i = 0; i < filter_sizes[fsi]; ++i){
+                for (i = 0; i < filter_sizes[fsi]; ++i) {
                     src[i] = &src_pixels[i * LARGEST_INPUT_SIZE];
-                    vFilterData[i].src = src[i];
+                    vFilterData[i].src = src[i] - osi;
                     for(j = 0; j < 4; ++j)
                         vFilterData[i].coeff[j + 4] = filter_coeff[i];
                 }
-                if (check_func(ctx->yuv2planeX, "yuv2yuvX_%d_%d_%d", filter_sizes[fsi], osi, dstW)){
+                if (check_func(ctx->yuv2planeX, "yuv2yuvX_%d_%d_%d_%s", filter_sizes[fsi], osi, dstW, accurate_str)){
+                    // use vFilterData for the mmx function
+                    const int16_t *filter = ctx->use_mmx_vfilter ? (const int16_t*)vFilterData : &filter_coeff[0];
                     memset(dst0, 0, LARGEST_INPUT_SIZE * sizeof(dst0[0]));
                     memset(dst1, 0, LARGEST_INPUT_SIZE * sizeof(dst1[0]));
 
-                    // The reference function is not the scalar function selected when mmx
-                    // is deactivated as the SIMD functions do not give the same result as
-                    // the scalar ones due to rounding. The SIMD functions are activated by
-                    // the flag SWS_ACCURATE_RND
-                    ref_function(&filter_coeff[0], filter_sizes[fsi], src, dst0, dstW - osi, dither, osi);
-                    // There's no point in calling new for the reference function
-                    if(ctx->use_mmx_vfilter){
-                        call_new((const int16_t*)vFilterData, filter_sizes[fsi], src, dst1, dstW - osi, dither, osi);
-                        if (memcmp(dst0, dst1, LARGEST_INPUT_SIZE * sizeof(dst0[0])))
-                            fail();
-                        if(dstW == LARGEST_INPUT_SIZE)
-                            bench_new((const int16_t*)vFilterData, filter_sizes[fsi], src, dst1, dstW - osi, dither, osi);
+                    // We can't use call_ref here, because we don't know if use_mmx_vfilter was set for that
+                    // function or not, so we can't pass it the parameters correctly.
+                    yuv2planeX_8_ref(&filter_coeff[0], filter_sizes[fsi], src, dst0, dstW - osi, dither, osi);
+
+                    call_new(filter, filter_sizes[fsi], src, dst1, dstW - osi, dither, osi);
+                    if (cmp_off_by_n(dst0, dst1, LARGEST_INPUT_SIZE * sizeof(dst0[0]), accurate ? 0 : 2)) {
+                        fail();
+                        printf("failed: yuv2yuvX_%d_%d_%d_%s\n", filter_sizes[fsi], osi, dstW, accurate_str);
+                        show_differences(dst0, dst1, LARGEST_INPUT_SIZE * sizeof(dst0[0]));
                     }
+                    if(dstW == LARGEST_INPUT_SIZE)
+                        bench_new((const int16_t*)vFilterData, filter_sizes[fsi], src, dst1, dstW - osi, dither, osi);
+
                 }
                 av_freep(&src);
                 av_freep(&vFilterData);
@@ -245,6 +351,10 @@ void checkasm_check_sw_scale(void)
 {
     check_hscale();
     report("hscale");
-    check_yuv2yuvX();
+    check_yuv2yuv1(0);
+    check_yuv2yuv1(1);
+    report("yuv2yuv1");
+    check_yuv2yuvX(0);
+    check_yuv2yuvX(1);
     report("yuv2yuvX");
 }

From patchwork Sat Aug 13 20:56:02 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Swinney, Jonathan" <jswinney@amazon.com>
X-Patchwork-Id: 37259
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:3d0d:b0:8d:a68e:8a0e with SMTP id y13csp513552pzi;
        Sat, 13 Aug 2022 13:56:20 -0700 (PDT)
X-Google-Smtp-Source: 
 AA6agR7LIC1+aXHwPCSG95UKP6DvvIPxm58NQUH+bYBFY2EXzNUn5iz8qLL9qIdVqmdSaS6KLt8j
X-Received: by 2002:a05:6402:287:b0:43c:c604:addb with SMTP id
 l7-20020a056402028700b0043cc604addbmr8584260edv.201.1660424180204;
        Sat, 13 Aug 2022 13:56:20 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1660424180; cv=none;
        d=google.com; s=arc-20160816;
        b=LrKTXvvTASkjyZfVaLrIhANlEK2MIqNkh2oSO9QPL0HYRjhsRzI2NgBClMzXxZGV4/
         SDInoT8CjnoWpBoolPlZsAV5vwNyOACyT9ZXiaZ6H9o4i9fEJLmkUe+XJEkqHZMc6hIa
         NOFiUF3IlieeAFEPC28x14llVkzaicYmSD501oOj0OWrUZn0zjP5gVbx9ccrD/+rBhBL
         fePdgfkIFNuq4dfDmLQYZBf0811C6j9RqPFH2XnYoHM7NeB41EVFLhMODvYH/WEQXOqK
         Nrnoilmro+CDLG7DFpxqdN9rB+xsd5hr+jrW/0u/KPdkk+VEWvOPxX5RWB3g2LomEIOB
         qc8w==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:content-language
         :accept-language:message-id:date:thread-index:thread-topic:to:from
         :dkim-signature:delivered-to;
        bh=XuNxLfetY0aghed8GPAE5ijOfWueNGZa+hcddXiDMUw=;
        b=n/mLzlCipY1/5VlURLmWcLfGXpkRnRY6qJV7DaxFAVTnYRRHBhKVmHeYBca5CTiLNV
         7ZeWQyeG5AWxuO/B2nKNGyGq38KdxnuXYgPIFqE4TEr8/8uGCNYV/DzpUAiEXi+kEBLx
         kB9wiutg9U3kHREsaFgSZNBIzNUTPMlqRX/BSB/1xiKh0OESFVRxOdH4NgqP47VHs6Ge
         OtOmNlonWejT/AslviUay2QYO8vMKmtDK1ggWt+CEKGS52Un4yZBBn7uIibhLxpbZ6jC
         yIyK+5c+r4RCMwUyJD4pinUEvUUZ1JI6hGh5RIsvAZsPr0ebQfGzAGGp8tbIuuzzjygq
         bD2g==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@amazon.com
 header.s=amazon201209 header.b=tW7dkFOh;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 q17-20020a50cc91000000b0043e438a48e1si4625315edi.178.2022.08.13.13.56.19;
        Sat, 13 Aug 2022 13:56:20 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@amazon.com
 header.s=amazon201209 header.b=tW7dkFOh;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 193D468B8CC;
	Sat, 13 Aug 2022 23:56:13 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from smtp-fw-6001.amazon.com (smtp-fw-6001.amazon.com
 [52.95.48.154])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 6F71C68B20B
 for <ffmpeg-devel@ffmpeg.org>; Sat, 13 Aug 2022 23:56:06 +0300 (EEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209;
 t=1660424172; x=1691960172;
 h=from:to:cc:subject:date:message-id:
 content-transfer-encoding:mime-version;
 bh=d2veYPVwfvfydEjiXV0RiCdjVX6kr4xg7SfSuR9ylAg=;
 b=tW7dkFOheLdM5OeVXTGvT6fbk2hMzy91Rl4Cv8O5ioqk2CaYalQEcicm
 vO2jawWe6faisVVwXv8+xlydBDHz4aCW0wRwPH8jdZbLqOpz9YsReP/Rq
 jk7Nle4pxWzyC5Vi97zrogiWK8RtFhuuPGOIW0SKgrJaIue3jGtK/xVxT s=;
Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO
 email-inbound-relay-iad-1a-b27d4a00.us-east-1.amazon.com) ([10.43.8.2])
 by smtp-border-fw-6001.iad6.amazon.com with
 ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Aug 2022 20:56:05 +0000
Received: from EX13MTAUWB001.ant.amazon.com
 (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38])
 by email-inbound-relay-iad-1a-b27d4a00.us-east-1.amazon.com (Postfix) with
 ESMTPS id 6C5F8803B7; Sat, 13 Aug 2022 20:56:04 +0000 (UTC)
Received: from EX19D007UWB001.ant.amazon.com (10.13.138.75) by
 EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS)
 id 15.0.1497.38; Sat, 13 Aug 2022 20:56:02 +0000
Received: from EX19D007UWB001.ant.amazon.com (10.13.138.75) by
 EX19D007UWB001.ant.amazon.com (10.13.138.75) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.2.1118.12; Sat, 13 Aug 2022 20:56:02 +0000
Received: from EX19D007UWB001.ant.amazon.com ([fe80::bcaa:e18f:a569:3851]) by
 EX19D007UWB001.ant.amazon.com ([fe80::bcaa:e18f:a569:3851%6]) with
 mapi id 15.02.1118.012; Sat, 13 Aug 2022 20:56:02 +0000
From: "Swinney, Jonathan" <jswinney@amazon.com>
To: "ffmpeg-devel@ffmpeg.org" <ffmpeg-devel@ffmpeg.org>
Thread-Topic: [PATCH v3 2/3] swscale/aarch64: vscale optimization
Thread-Index: AdivVwIoGM530APiQviyMJ/WNteqHQ==
Date: Sat, 13 Aug 2022 20:56:02 +0000
Message-ID: <17ac9282aef74f869ec9895f0d17f17e@amazon.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.43.162.134]
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH v3 2/3] swscale/aarch64: vscale optimization
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>,
 Hubert Mazur <hum@semihalf.com>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: U4pqH0biRAvE

Use scalar times vector multiply accumlate instructions instead of
vector times vector to remove the need for replicating load instructions
which are slightly slower.

On AWS c7g (Graviton 3, Neoverse V1) instances:
yuv2yuvX_8_0_512_accurate_neon:  1144.8  987.4
yuv2yuvX_16_0_512_accurate_neon: 2080.5 1869.4

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
---
 libswscale/aarch64/output.S | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S
index af71de6050..991750cf31 100644
--- a/libswscale/aarch64/output.S
+++ b/libswscale/aarch64/output.S
@@ -34,16 +34,15 @@ function ff_yuv2planeX_8_neon, export=1
         mov                 x9, x2                          // srcp    = src
         mov                 x10, x0                         // filterp = filter
 3:      ldp                 x11, x12, [x9], #16             // get 2 pointers: src[j] and src[j+1]
+        ldr                 s7, [x10], #4                   // read 2x16-bit coeff X and Y at filter[j] and filter[j+1]
         add                 x11, x11, x7, lsl #1            // &src[j  ][i]
         add                 x12, x12, x7, lsl #1            // &src[j+1][i]
         ld1                 {v5.8H}, [x11]                  // read 8x16-bit @ src[j  ][i + {0..7}]: A,B,C,D,E,F,G,H
         ld1                 {v6.8H}, [x12]                  // read 8x16-bit @ src[j+1][i + {0..7}]: I,J,K,L,M,N,O,P
-        ld1r                {v7.8H}, [x10], #2              // read 1x16-bit coeff X at filter[j  ] and duplicate across lanes
-        ld1r                {v16.8H}, [x10], #2             // read 1x16-bit coeff Y at filter[j+1] and duplicate across lanes
-        smlal               v3.4S, v5.4H, v7.4H             // val0 += {A,B,C,D} * X
-        smlal2              v4.4S, v5.8H, v7.8H             // val1 += {E,F,G,H} * X
-        smlal               v3.4S, v6.4H, v16.4H            // val0 += {I,J,K,L} * Y
-        smlal2              v4.4S, v6.8H, v16.8H            // val1 += {M,N,O,P} * Y
+        smlal               v3.4S, v5.4H, v7.H[0]           // val0 += {A,B,C,D} * X
+        smlal2              v4.4S, v5.8H, v7.H[0]           // val1 += {E,F,G,H} * X
+        smlal               v3.4S, v6.4H, v7.H[1]           // val0 += {I,J,K,L} * Y
+        smlal2              v4.4S, v6.8H, v7.H[1]           // val1 += {M,N,O,P} * Y
         subs                w8, w8, #2                      // tmpfilterSize -= 2
         b.gt                3b                              // loop until filterSize consumed
 

From patchwork Sat Aug 13 20:56:06 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Swinney, Jonathan" <jswinney@amazon.com>
X-Patchwork-Id: 37260
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:3d0d:b0:8d:a68e:8a0e with SMTP id y13csp513593pzi;
        Sat, 13 Aug 2022 13:56:29 -0700 (PDT)
X-Google-Smtp-Source: 
 AA6agR4TvqDKfT/GJf/bu/C5YLt82nMlZoQOUZpQOCGPkXn90Xyz9c8tciaCVBLf8kc73Id8UX2B
X-Received: by 2002:a05:6402:28cb:b0:43b:c6d7:ef92 with SMTP id
 ef11-20020a05640228cb00b0043bc6d7ef92mr8742524edb.333.1660424188990;
        Sat, 13 Aug 2022 13:56:28 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1660424188; cv=none;
        d=google.com; s=arc-20160816;
        b=kbn2r6WvfA+COQU92IhI1g3OuEDzeXFaf0x02NSaMlfD97gpRcoHv8Eo3p2uHfc1zc
         jvhZ717yRq5KV+8Ubc4dg/t2fC8MST5HQuxTmPi7tIGYKDhoxcxTcdd7Y7HdW+XI9D9j
         UwP8bEo82ItpXBQOld5V4J/Qoo5VbjLS2ct6Sx7wL+J+CWqpgVLx4KAdNVAkeLuGd2je
         OwLa21QAoJiN2gGHNQX/7eyw2DRcY5MeZjbR67/iwaZYD3PML7srNCwk0caeHowmzK6a
         IDwF2CN96yZPVq0G4mUR5wQsig5aMVyQUx/Jn0dbC3LJbD6rLozzZbeBJFRt7IspgRsG
         PmcQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:content-language
         :accept-language:message-id:date:thread-index:thread-topic:to:from
         :dkim-signature:delivered-to;
        bh=ScujgIZXTLJ/aw1y/QsnH6yBkAIObOadIbfviyhrtnM=;
        b=syhDi6/iEBbxdE/cf2nech4MwPZ7pjJWpdVj9oBCY7Q3HKVeIZsTTPYlt5htPPr5GN
         s2Ox0GO5XJ8NjD59HPx2Eap8FH+cbgw4p/fb4o6NSLaPMQLuoyIadZOGCRe95ddf1ZiU
         N79sBJt89AWjAYrr89FZd6P+epdYIQ5hNh+ePFJ+jgGb2g8wnFTI9oGBnekH0lSfy6H8
         UhEV6yKlEg6F1dRTkid45Yw/MqholOrQiCoHiSJKJwM/+m6DYyWcFcxFKZUIDRfyVjbR
         T6IBzdACHOJHdS09PZDqeJhpMiZX31fyFI2mWFD1jCvtQ3mfCkDsFeMZUompBEkQX8nr
         A7aw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@amazon.com
 header.s=amazon201209 header.b="fW/njcPN";
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 u1-20020a170906068100b0072eda634546si3666050ejb.560.2022.08.13.13.56.28;
        Sat, 13 Aug 2022 13:56:28 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@amazon.com
 header.s=amazon201209 header.b="fW/njcPN";
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id F3DBA68B8E5;
	Sat, 13 Aug 2022 23:56:20 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from smtp-fw-9103.amazon.com (smtp-fw-9103.amazon.com
 [207.171.188.200])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E1D6B68B419
 for <ffmpeg-devel@ffmpeg.org>; Sat, 13 Aug 2022 23:56:13 +0300 (EEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209;
 t=1660424179; x=1691960179;
 h=from:to:cc:subject:date:message-id:
 content-transfer-encoding:mime-version;
 bh=4CZV234rAWqFk7WOGRyqAOhHKdEpQfUX1WSNmNxYWKs=;
 b=fW/njcPNRoR8uS0hZMe972jRUXb3F4JWX2AAVwJttOR2oSFUfaH6Kdk1
 sSj/uZT9hkU9/gI7G9DXlErZSJ0K9W8Tb/EgZYWm39XwklRXjjaz2ddIG
 7ASW90E9sSJGKzvYY7fdCNaDdxDctekJlAbSUJysjKPnWWwO1ZkJ4D+yk g=;
X-IronPort-AV: E=Sophos;i="5.93,236,1654560000"; d="scan'208";a="1044016123"
Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO
 email-inbound-relay-iad-1e-b69ea591.us-east-1.amazon.com) ([10.25.36.214])
 by smtp-border-fw-9103.sea19.amazon.com with
 ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Aug 2022 20:56:11 +0000
Received: from EX13MTAUWB001.ant.amazon.com
 (iad12-ws-svc-p26-lb9-vlan2.iad.amazon.com [10.40.163.34])
 by email-inbound-relay-iad-1e-b69ea591.us-east-1.amazon.com (Postfix) with
 ESMTPS id 40DC0C0425; Sat, 13 Aug 2022 20:56:10 +0000 (UTC)
Received: from EX19D007UWB003.ant.amazon.com (10.13.138.28) by
 EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS)
 id 15.0.1497.38; Sat, 13 Aug 2022 20:56:06 +0000
Received: from EX19D007UWB001.ant.amazon.com (10.13.138.75) by
 EX19D007UWB003.ant.amazon.com (10.13.138.28) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.2.1118.12; Sat, 13 Aug 2022 20:56:06 +0000
Received: from EX19D007UWB001.ant.amazon.com ([fe80::bcaa:e18f:a569:3851]) by
 EX19D007UWB001.ant.amazon.com ([fe80::bcaa:e18f:a569:3851%6]) with
 mapi id 15.02.1118.012; Sat, 13 Aug 2022 20:56:06 +0000
From: "Swinney, Jonathan" <jswinney@amazon.com>
To: "ffmpeg-devel@ffmpeg.org" <ffmpeg-devel@ffmpeg.org>
Thread-Topic: [PATCH v3 3/3] swscale/aarch64: add vscale specializations
Thread-Index: AdivVxEmf6fLDA28TLO6ywGfvD7wdQ==
Date: Sat, 13 Aug 2022 20:56:06 +0000
Message-ID: <70629b7632564b30a44c71bf6a903b26@amazon.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.43.162.134]
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH v3 3/3] swscale/aarch64: add vscale
 specializations
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>,
 Hubert Mazur <hum@semihalf.com>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: 8NWbUFa+vR15

This commit adds new code paths for vscale when filterSize is 2, 4, or
8. By using specialized code with unrolling to match the filterSize we
can improve performance.

On AWS c7g (Graviton 3, Neoverse V1) instances:
                                 before   after
yuv2yuvX_2_0_512_accurate_neon:  558.8    268.9
yuv2yuvX_4_0_512_accurate_neon:  637.5    434.9
yuv2yuvX_8_0_512_accurate_neon:  1144.8   806.2
yuv2yuvX_16_0_512_accurate_neon: 2080.5   1853.7

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
---
 libswscale/aarch64/output.S  | 177 +++++++++++++++++++++++++++++++++++
 libswscale/aarch64/swscale.c |  12 +++
 2 files changed, 189 insertions(+)

diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S
index 991750cf31..b8a2818c9b 100644
--- a/libswscale/aarch64/output.S
+++ b/libswscale/aarch64/output.S
@@ -21,13 +21,33 @@
 #include "libavutil/aarch64/asm.S"
 
 function ff_yuv2planeX_8_neon, export=1
+// x0 - const int16_t *filter,
+// x1 - int filterSize,
+// x2 - const int16_t **src,
+// x3 - uint8_t *dest,
+// w4 - int dstW,
+// x5 - const uint8_t *dither,
+// w6 - int offset
+
         ld1                 {v0.8B}, [x5]                   // load 8x8-bit dither
+        and                 w6, w6, #7
         cbz                 w6, 1f                          // check if offsetting present
         ext                 v0.8B, v0.8B, v0.8B, #3         // honor offsetting which can be 0 or 3 only
 1:      uxtl                v0.8H, v0.8B                    // extend dither to 16-bit
         ushll               v1.4S, v0.4H, #12               // extend dither to 32-bit with left shift by 12 (part 1)
         ushll2              v2.4S, v0.8H, #12               // extend dither to 32-bit with left shift by 12 (part 2)
+        cmp                 w1, #8                          // if filterSize == 8, branch to specialized version
+        b.eq                6f
+        cmp                 w1, #4                          // if filterSize == 4, branch to specialized version
+        b.eq                8f
+        cmp                 w1, #2                          // if filterSize == 2, branch to specialized version
+        b.eq                10f
+
+// The filter size does not match of the of specialized implementations. It is either even or odd. If it is even
+// then use the first section below.
         mov                 x7, #0                          // i = 0
+        tbnz                w1, #0, 4f                      // if filterSize % 2 != 0 branch to specialized version
+// fs % 2 == 0
 2:      mov                 v3.16B, v1.16B                  // initialize accumulator part 1 with dithering value
         mov                 v4.16B, v2.16B                  // initialize accumulator part 2 with dithering value
         mov                 w8, w1                          // tmpfilterSize = filterSize
@@ -54,4 +74,161 @@ function ff_yuv2planeX_8_neon, export=1
         add                 x7, x7, #8                      // i += 8
         b.gt                2b                              // loop until width consumed
         ret
+
+// If filter size is odd (most likely == 1), then use this section.
+// fs % 2 != 0
+4:      mov                 v3.16B, v1.16B                  // initialize accumulator part 1 with dithering value
+        mov                 v4.16B, v2.16B                  // initialize accumulator part 2 with dithering value
+        mov                 w8, w1                          // tmpfilterSize = filterSize
+        mov                 x9, x2                          // srcp    = src
+        mov                 x10, x0                         // filterp = filter
+5:      ldr                 x11, [x9], #8                   // get 1 pointer: src[j]
+        ldr                 h6, [x10], #2                   // read 1 16 bit coeff X at filter[j]
+        add                 x11, x11, x7, lsl #1            // &src[j  ][i]
+        ld1                 {v5.8H}, [x11]                  // read 8x16-bit @ src[j  ][i + {0..7}]: A,B,C,D,E,F,G,H
+        smlal               v3.4S, v5.4H, v6.H[0]           // val0 += {A,B,C,D} * X
+        smlal2              v4.4S, v5.8H, v6.H[0]           // val1 += {E,F,G,H} * X
+        subs                w8, w8, #1                      // tmpfilterSize -= 2
+        b.gt                5b                              // loop until filterSize consumed
+
+        sqshrun             v3.4h, v3.4s, #16               // clip16(val0>>16)
+        sqshrun2            v3.8h, v4.4s, #16               // clip16(val1>>16)
+        uqshrn              v3.8b, v3.8h, #3                // clip8(val>>19)
+        st1                 {v3.8b}, [x3], #8               // write to destination
+        subs                w4, w4, #8                      // dstW -= 8
+        add                 x7, x7, #8                      // i += 8
+        b.gt                4b                              // loop until width consumed
+        ret
+
+6:      // fs=8
+        ldp                 x5, x6, [x2]                    // load 2 pointers: src[j  ] and src[j+1]
+        ldp                 x7, x9, [x2, #16]               // load 2 pointers: src[j+2] and src[j+3]
+        ldp                 x10, x11, [x2, #32]             // load 2 pointers: src[j+4] and src[j+5]
+        ldp                 x12, x13, [x2, #48]             // load 2 pointers: src[j+6] and src[j+7]
+
+        // load 8x16-bit values for filter[j], where j=0..7
+        ld1                 {v6.8H}, [x0]
+7:
+        mov                 v3.16B, v1.16B                  // initialize accumulator part 1 with dithering value
+        mov                 v4.16B, v2.16B                  // initialize accumulator part 2 with dithering value
+
+        ld1                 {v24.8H}, [x5], #16             // load 8x16-bit values for src[j + 0][i + {0..7}]
+        ld1                 {v25.8H}, [x6], #16             // load 8x16-bit values for src[j + 1][i + {0..7}]
+        ld1                 {v26.8H}, [x7], #16             // load 8x16-bit values for src[j + 2][i + {0..7}]
+        ld1                 {v27.8H}, [x9], #16             // load 8x16-bit values for src[j + 3][i + {0..7}]
+        ld1                 {v28.8H}, [x10], #16            // load 8x16-bit values for src[j + 4][i + {0..7}]
+        ld1                 {v29.8H}, [x11], #16            // load 8x16-bit values for src[j + 5][i + {0..7}]
+        ld1                 {v30.8H}, [x12], #16            // load 8x16-bit values for src[j + 6][i + {0..7}]
+        ld1                 {v31.8H}, [x13], #16            // load 8x16-bit values for src[j + 7][i + {0..7}]
+
+        smlal               v3.4S, v24.4H, v6.H[0]          // val0 += src[0][i + {0..3}] * filter[0]
+        smlal2              v4.4S, v24.8H, v6.H[0]          // val1 += src[0][i + {4..7}] * filter[0]
+        smlal               v3.4S, v25.4H, v6.H[1]          // val0 += src[1][i + {0..3}] * filter[1]
+        smlal2              v4.4S, v25.8H, v6.H[1]          // val1 += src[1][i + {4..7}] * filter[1]
+        smlal               v3.4S, v26.4H, v6.H[2]          // val0 += src[2][i + {0..3}] * filter[2]
+        smlal2              v4.4S, v26.8H, v6.H[2]          // val1 += src[2][i + {4..7}] * filter[2]
+        smlal               v3.4S, v27.4H, v6.H[3]          // val0 += src[3][i + {0..3}] * filter[3]
+        smlal2              v4.4S, v27.8H, v6.H[3]          // val1 += src[3][i + {4..7}] * filter[3]
+        smlal               v3.4S, v28.4H, v6.H[4]          // val0 += src[4][i + {0..3}] * filter[4]
+        smlal2              v4.4S, v28.8H, v6.H[4]          // val1 += src[4][i + {4..7}] * filter[4]
+        smlal               v3.4S, v29.4H, v6.H[5]          // val0 += src[5][i + {0..3}] * filter[5]
+        smlal2              v4.4S, v29.8H, v6.H[5]          // val1 += src[5][i + {4..7}] * filter[5]
+        smlal               v3.4S, v30.4H, v6.H[6]          // val0 += src[6][i + {0..3}] * filter[6]
+        smlal2              v4.4S, v30.8H, v6.H[6]          // val1 += src[6][i + {4..7}] * filter[6]
+        smlal               v3.4S, v31.4H, v6.H[7]          // val0 += src[7][i + {0..3}] * filter[7]
+        smlal2              v4.4S, v31.8H, v6.H[7]          // val1 += src[7][i + {4..7}] * filter[7]
+
+        sqshrun             v3.4h, v3.4s, #16               // clip16(val0>>16)
+        sqshrun2            v3.8h, v4.4s, #16               // clip16(val1>>16)
+        uqshrn              v3.8b, v3.8h, #3                // clip8(val>>19)
+        subs                w4, w4, #8                      // dstW -= 8
+        st1                 {v3.8b}, [x3], #8               // write to destination
+        b.gt                7b                              // loop until width consumed
+        ret
+
+8:      // fs=4
+        ldp                 x5, x6, [x2]                    // load 2 pointers: src[j  ] and src[j+1]
+        ldp                 x7, x9, [x2, #16]               // load 2 pointers: src[j+2] and src[j+3]
+
+        // load 4x16-bit values for filter[j], where j=0..3 and replicated across lanes
+        ld1                 {v6.4H}, [x0]
+9:
+        mov                 v3.16B, v1.16B                  // initialize accumulator part 1 with dithering value
+        mov                 v4.16B, v2.16B                  // initialize accumulator part 2 with dithering value
+
+        ld1                 {v24.8H}, [x5], #16             // load 8x16-bit values for src[j + 0][i + {0..7}]
+        ld1                 {v25.8H}, [x6], #16             // load 8x16-bit values for src[j + 1][i + {0..7}]
+        ld1                 {v26.8H}, [x7], #16             // load 8x16-bit values for src[j + 2][i + {0..7}]
+        ld1                 {v27.8H}, [x9], #16             // load 8x16-bit values for src[j + 3][i + {0..7}]
+
+        smlal               v3.4S, v24.4H, v6.H[0]          // val0 += src[0][i + {0..3}] * filter[0]
+        smlal2              v4.4S, v24.8H, v6.H[0]          // val1 += src[0][i + {4..7}] * filter[0]
+        smlal               v3.4S, v25.4H, v6.H[1]          // val0 += src[1][i + {0..3}] * filter[1]
+        smlal2              v4.4S, v25.8H, v6.H[1]          // val1 += src[1][i + {4..7}] * filter[1]
+        smlal               v3.4S, v26.4H, v6.H[2]          // val0 += src[2][i + {0..3}] * filter[2]
+        smlal2              v4.4S, v26.8H, v6.H[2]          // val1 += src[2][i + {4..7}] * filter[2]
+        smlal               v3.4S, v27.4H, v6.H[3]          // val0 += src[3][i + {0..3}] * filter[3]
+        smlal2              v4.4S, v27.8H, v6.H[3]          // val1 += src[3][i + {4..7}] * filter[3]
+
+        sqshrun             v3.4h, v3.4s, #16               // clip16(val0>>16)
+        sqshrun2            v3.8h, v4.4s, #16               // clip16(val1>>16)
+        uqshrn              v3.8b, v3.8h, #3                // clip8(val>>19)
+        st1                 {v3.8b}, [x3], #8               // write to destination
+        subs                w4, w4, #8                      // dstW -= 8
+        b.gt                9b                              // loop until width consumed
+        ret
+
+10:     // fs=2
+        ldp                 x5, x6, [x2]                    // load 2 pointers: src[j  ] and src[j+1]
+
+        // load 2x16-bit values for filter[j], where j=0..1 and replicated across lanes
+        ldr                 s6, [x0]
+11:
+        mov                 v3.16B, v1.16B                  // initialize accumulator part 1 with dithering value
+        mov                 v4.16B, v2.16B                  // initialize accumulator part 2 with dithering value
+
+        ld1                 {v24.8H}, [x5], #16             // load 8x16-bit values for src[j + 0][i + {0..7}]
+        ld1                 {v25.8H}, [x6], #16             // load 8x16-bit values for src[j + 1][i + {0..7}]
+
+        smlal               v3.4S, v24.4H, v6.H[0]          // val0 += src[0][i + {0..3}] * filter[0]
+        smlal2              v4.4S, v24.8H, v6.H[0]          // val1 += src[0][i + {4..7}] * filter[0]
+        smlal               v3.4S, v25.4H, v6.H[1]          // val0 += src[1][i + {0..3}] * filter[1]
+        smlal2              v4.4S, v25.8H, v6.H[1]          // val1 += src[1][i + {4..7}] * filter[1]
+
+        sqshrun             v3.4h, v3.4s, #16               // clip16(val0>>16)
+        sqshrun2            v3.8h, v4.4s, #16               // clip16(val1>>16)
+        uqshrn              v3.8b, v3.8h, #3                // clip8(val>>19)
+        st1                 {v3.8b}, [x3], #8               // write to destination
+        subs                w4, w4, #8                      // dstW -= 8
+        b.gt                11b                             // loop until width consumed
+        ret
+endfunc
+
+function ff_yuv2plane1_8_neon, export=1
+// x0 - const int16_t *src,
+// x1 - uint8_t *dest,
+// w2 - int dstW,
+// x3 - const uint8_t *dither,
+// w4 - int offset
+        ld1                 {v0.8B}, [x3]                   // load 8x8-bit dither
+        and                 w4, w4, #7
+        cbz                 w4, 1f                          // check if offsetting present
+        ext                 v0.8B, v0.8B, v0.8B, #3         // honor offsetting which can be 0 or 3 only
+1:      uxtl                v0.8H, v0.8B                    // extend dither to 32-bit
+        uxtl                v1.4s, v0.4h
+        uxtl2               v2.4s, v0.8h
+2:
+        ld1                 {v3.8h}, [x0], #16              // read 8x16-bit @ src[j  ][i + {0..7}]: A,B,C,D,E,F,G,H
+        sxtl                v4.4s, v3.4h
+        sxtl2               v5.4s, v3.8h
+        add                 v4.4s, v4.4s, v1.4s
+        add                 v5.4s, v5.4s, v2.4s
+        sqshrun             v4.4h, v4.4s, #6
+        sqshrun2            v4.8h, v5.4s, #6
+
+        uqshrn              v3.8b, v4.8h, #1                // clip8(val>>7)
+        subs                w2, w2, #8                      // dstW -= 8
+        st1                 {v3.8b}, [x1], #8               // write to destination
+        b.gt                2b                              // loop until width consumed
+        ret
 endfunc
diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c
index ab28be4da6..321d1f844e 100644
--- a/libswscale/aarch64/swscale.c
+++ b/libswscale/aarch64/swscale.c
@@ -39,6 +39,12 @@ ALL_SCALE_FUNCS(neon);
 void ff_yuv2planeX_8_neon(const int16_t *filter, int filterSize,
                           const int16_t **src, uint8_t *dest, int dstW,
                           const uint8_t *dither, int offset);
+void ff_yuv2plane1_8_neon(
+        const int16_t *src,
+        uint8_t *dest,
+        int dstW,
+        const uint8_t *dither,
+        int offset);
 
 #define ASSIGN_SCALE_FUNC2(hscalefn, filtersize, opt) do {              \
     if (c->srcBpc == 8 && c->dstBpc <= 14) {                            \
@@ -54,6 +60,11 @@ void ff_yuv2planeX_8_neon(const int16_t *filter, int filterSize,
                ASSIGN_SCALE_FUNC2(hscalefn, X8, opt);                   \
            break;                                                       \
   }
+#define ASSIGN_VSCALE_FUNC(vscalefn, opt)                               \
+    switch (c->dstBpc) {                                                \
+    case 8: vscalefn = ff_yuv2plane1_8_  ## opt;  break;                \
+    default: break;                                                     \
+    }
 
 av_cold void ff_sws_init_swscale_aarch64(SwsContext *c)
 {
@@ -62,6 +73,7 @@ av_cold void ff_sws_init_swscale_aarch64(SwsContext *c)
     if (have_neon(cpu_flags)) {
         ASSIGN_SCALE_FUNC(c->hyScale, c->hLumFilterSize, neon);
         ASSIGN_SCALE_FUNC(c->hcScale, c->hChrFilterSize, neon);
+        ASSIGN_VSCALE_FUNC(c->yuv2plane1, neon);
         if (c->dstBpc == 8) {
             c->yuv2planeX = ff_yuv2planeX_8_neon;
         }