From patchwork Fri Apr 15 21:36:56 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: "Swinney, Jonathan" <jswinney@amazon.com>
X-Patchwork-Id: 35331
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:671c:b0:7c:62c8:b2d1 with SMTP id q28csp468672pzh;
        Fri, 15 Apr 2022 14:37:13 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJzbR3apPrSL5DAIejVuMDxCC8SQTCTLWAXlakeL9YYVCHGs/aUt8E18zq/dB9eUuNiXtyuF
X-Received: by 2002:a17:907:6d8b:b0:6ec:b63:a108 with SMTP id
 sb11-20020a1709076d8b00b006ec0b63a108mr700622ejc.670.1650058633291;
        Fri, 15 Apr 2022 14:37:13 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1650058633; cv=none;
        d=google.com; s=arc-20160816;
        b=PU0zxvqmKRfJzxbJ3zcobGtvXuS9AoacM+k8KUzR8Qe5VN+1zGDwtTOIDopVBg4rpj
         as8yTVPxZiw6jnhCUTwtm9YV4VAovuHL0hbTUjbhUqL3kqx6X3lFVrR7zBlfX9Q/yMLy
         lJJnfq8SRK4MooX+8NspSxDqwi3kVo7kwr8ffK3w+M32lx3anEJNAF7DkSPynvZvzMfA
         uN18b+G9J5qUFrZP91/ZIDfuwxp+Q9TMRyzbUZjSQrxIzVHElMY3MRZOTb2+qeXrxkEl
         fbT/d8TB7dAMchkJ5pZB8zVdWVMHaf+oQTEl6VdtOtX6vX/s85UEBYeurCAxy04Zp5RY
         xLiQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:content-language
         :accept-language:message-id:date:thread-index:thread-topic:to:from
         :dkim-signature:delivered-to;
        bh=lKOCbre+1yWx4CXYOE1qJAeUP3RDFbzHl01yEJz1a18=;
        b=SAn8uzJaPuIVwCvI6zerngdvxCJjxLNdRuSHu2D6jRwGy5evvHFb07QjPW+RrcECJq
         vIOCJp5L93wHwEYE8+IL4r4SXdtMEKBYeCzTYd+SEakO66LMO3JbS13Jd+1Mk67taIxT
         G237LXn91zLlCj6M4hRtQa8eBOWgsG72+hn4G5Odtd5jckxjRXmTGfwKtVPbr9OhLKt6
         +3b8YeuvQzFrw7R4Htj1YNDpx28ME77T0QMBE3dvSjrEtjW/cg4znt7HUSCPemknfC4a
         e7qUFtNFWGR7e9KBDgIOlLFx6c5PtoixEs4DBto8dwJ/k2sS0F9TaeSHLIaL1BDeemeU
         XxaA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@amazon.com
 header.s=amazon201209 header.b=bL8UFV+j;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 iy21-20020a170907819500b006df76385c90si1537405ejc.304.2022.04.15.14.37.12;
        Fri, 15 Apr 2022 14:37:13 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@amazon.com
 header.s=amazon201209 header.b=bL8UFV+j;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id D7D6368B4EC;
	Sat, 16 Apr 2022 00:37:08 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from smtp-fw-9102.amazon.com (smtp-fw-9102.amazon.com
 [207.171.184.29])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id C0CF368B43B
 for <ffmpeg-devel@ffmpeg.org>; Sat, 16 Apr 2022 00:37:00 +0300 (EEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209;
 t=1650058626; x=1681594626;
 h=from:to:cc:subject:date:message-id:
 content-transfer-encoding:mime-version;
 bh=YxDetklbYCwhUPghHs4wdlO2nt9TCOAA6EYpOz68SSE=;
 b=bL8UFV+juy7+or2piBXap9eCwszZt0svRpiBWzGKv1VHhq4fE5Tl53Dh
 3SMdtNf8yLzPMA1ga9O5hww7zBQYP/PV2nKHBvBduGt1Zq/VmtknJQkkJ
 sShyE7wSiC3EJHCd8crcLKRaoUoWZhjePGEW+xJnzmBaOCP8eyz5vYtJ4 A=;
X-IronPort-AV: E=Sophos;i="5.90,263,1643673600"; d="scan'208";a="211331152"
Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO
 email-inbound-relay-pdx-1box-2b-eee1d651.us-west-2.amazon.com)
 ([10.25.36.214])
 by smtp-border-fw-9102.sea19.amazon.com with ESMTP;
 15 Apr 2022 21:36:58 +0000
Received: from EX13MTAUWB001.ant.amazon.com
 (pdx1-ws-svc-p6-lb9-vlan2.pdx.amazon.com [10.236.137.194])
 by email-inbound-relay-pdx-1box-2b-eee1d651.us-west-2.amazon.com (Postfix)
 with ESMTPS id 6031A9E77F; Fri, 15 Apr 2022 21:36:57 +0000 (UTC)
Received: from EX13D01UWB002.ant.amazon.com (10.43.161.136) by
 EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS)
 id 15.0.1497.32; Fri, 15 Apr 2022 21:36:57 +0000
Received: from EX13D07UWB004.ant.amazon.com (10.43.161.196) by
 EX13d01UWB002.ant.amazon.com (10.43.161.136) with Microsoft SMTP Server (TLS)
 id 15.0.1497.32; Fri, 15 Apr 2022 21:36:56 +0000
Received: from EX13D07UWB004.ant.amazon.com ([10.43.161.196]) by
 EX13D07UWB004.ant.amazon.com ([10.43.161.196]) with mapi id 15.00.1497.033;
 Fri, 15 Apr 2022 21:36:56 +0000
From: "Swinney, Jonathan" <jswinney@amazon.com>
To: "ffmpeg-devel@ffmpeg.org" <ffmpeg-devel@ffmpeg.org>
Thread-Topic: [PATCH 1/2] swscale/aarch64: add hscale specializations
Thread-Index: AdgqlYlVJdfeUegcRyq1eLesJnt5lw==
Date: Fri, 15 Apr 2022 21:36:56 +0000
Message-ID: <199f4223693645eda3fdd8257c2e6355@EX13D07UWB004.ant.amazon.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-ms-exchange-transport-fromentityheader: Hosted
x-originating-ip: [10.43.160.81]
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 1/2] swscale/aarch64: add hscale
 specializations
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>, "Pop,
 Sebastian" <spop@amazon.com>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: xtJRixwHb+wj

This patch adds specializations for hscale for filterSize == 4 and 8 and
converts the existing implementation for the X8 version. For the old code, now
used for the X8 version, it improves the efficiency of the final summations by
reducing 11 instructions to 7.

ff_hscale8to15_8_neon is mostly unchanged from the original except for a few
changes.
 - The loads for the filter data were consolidated into a single 64 byte ld1
   instruction.
 - The final summations were improved.
 - The inner loop on filterSize was completely removed

ff_hscale8to15_4_neon is a complete rewrite. Since the main bottleneck here is
loading the data from src, this data is loaded a whole block ahead and stored
back to the stack to be loaded again with ld4. This arranges the data for most
efficient use of the vector instructions and removes the need for completion
adds at the end. The number of iterations of the C per iteration of the assembly
is increased from 4 to 8, but because of the prefetching, it can only be used
when dstW is >= 16.

This improves speed by 26% on Graviton 2 (Neoverse N1)
ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
before: t:0.001796 avg:0.001839 max:0.002756 min:0.001733
after:  t:0.001690 avg:0.001352 max:0.002171 min:0.001292

In direct micro benchmarks I wrote the benefit is more dramatic when filterSize == 4.

| (seconds)   | c6g   |       |
| ----------- | ----- | ----- |
| filterSize  | 4     | 8     |
| original    | 7.554 | 7.621 |
| optimized   | 3.736 | 7.054 |
| improvement | 50.5% | 7.44% |

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
---
 libswscale/aarch64/hscale.S  | 263 +++++++++++++++++++++++++++++++++--
 libswscale/aarch64/swscale.c |  41 ++++--
 libswscale/utils.c           |   2 +-
 3 files changed, 284 insertions(+), 22 deletions(-)

-- 
2.32.0

diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S
index af55ffe2b7..a934653a46 100644
--- a/libswscale/aarch64/hscale.S
+++ b/libswscale/aarch64/hscale.S
@@ -1,5 +1,7 @@
 /*
  * Copyright (c) 2016 Clément Bœsch <clement stupeflix.com>
+ * Copyright (c) 2019-2021 Sebastian Pop <spop@amazon.com>
+ * Copyright (c) 2022 Jonathan Swinney <jswinney@amazon.com>
  *
  * This file is part of FFmpeg.
  *
@@ -20,7 +22,25 @@
 
 #include "libavutil/aarch64/asm.S"
 
-function ff_hscale_8_to_15_neon, export=1
+/*
+;-----------------------------------------------------------------------------
+; horizontal line scaling
+;
+; void hscale<source_width>to<intermediate_nbits>_<filterSize>_<opt>
+;                               (SwsContext *c, int{16,32}_t *dst,
+;                                int dstW, const uint{8,16}_t *src,
+;                                const int16_t *filter,
+;                                const int32_t *filterPos, int filterSize);
+;
+; Scale one horizontal line. Input is either 8-bit width or 16-bit width
+; ($source_width can be either 8, 9, 10 or 16, difference is whether we have to
+; downscale before multiplying). Filter is 14 bits. Output is either 15 bits
+; (in int16_t) or 19 bits (in int32_t), as given in $intermediate_nbits. Each
+; output pixel is generated from $filterSize input pixels, the position of
+; the first pixel is given in filterPos[nOutputPixel].
+;----------------------------------------------------------------------------- */
+
+function ff_hscale8to15_X8_neon, export=1
         sbfiz               x7, x6, #1, #32             // filterSize*2 (*2 because int16)
 1:      ldr                 w8, [x5], #4                // filterPos[idx]
         ldr                 w0, [x5], #4                // filterPos[idx + 1]
@@ -61,20 +81,239 @@ function ff_hscale_8_to_15_neon, export=1
         smlal               v3.4S, v18.4H, v19.4H       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
         smlal2              v3.4S, v18.8H, v19.8H       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
         b.gt                2b                          // inner loop if filterSize not consumed completely
-        addp                v0.4S, v0.4S, v0.4S         // part0 horizontal pair adding
-        addp                v1.4S, v1.4S, v1.4S         // part1 horizontal pair adding
-        addp                v2.4S, v2.4S, v2.4S         // part2 horizontal pair adding
-        addp                v3.4S, v3.4S, v3.4S         // part3 horizontal pair adding
-        addp                v0.4S, v0.4S, v0.4S         // part0 horizontal pair adding
-        addp                v1.4S, v1.4S, v1.4S         // part1 horizontal pair adding
-        addp                v2.4S, v2.4S, v2.4S         // part2 horizontal pair adding
-        addp                v3.4S, v3.4S, v3.4S         // part3 horizontal pair adding
-        zip1                v0.4S, v0.4S, v1.4S         // part01 = zip values from part0 and part1
-        zip1                v2.4S, v2.4S, v3.4S         // part23 = zip values from part2 and part3
-        mov                 v0.d[1], v2.d[0]            // part0123 = zip values from part01 and part23
+        uzp1                v4.4S, v0.4S, v1.4S         // unzip low parts 0 and 1
+        uzp2                v5.4S, v0.4S, v1.4S         // unzip high parts 0 and 1
+        uzp1                v6.4S, v2.4S, v3.4S         // unzip low parts 2 and 3
+        uzp2                v7.4S, v2.4S, v3.4S         // unzip high parts 2 and 3
+        add                 v16.4S, v4.4S, v5.4S        // add half of each of part 0 and 1
+        add                 v17.4S, v6.4S, v7.4S        // add half of each of part 2 and 3
+        addp                v0.4S, v16.4S, v17.4S       // pairwise add to complete half adds in earlier steps
         subs                w2, w2, #4                  // dstW -= 4
         sqshrn              v0.4H, v0.4S, #7            // shift and clip the 2x16-bit final values
         st1                 {v0.4H}, [x1], #8           // write to destination part0123
         b.gt                1b                          // loop until end of line
         ret
 endfunc
+
+
+function ff_hscale8to15_8_neon, export=1
+// x0      SwsContext *c (not used)
+// x1      int16_t *dst
+// x2      int dstW
+// x3      const uint8_t *src
+// x4      const int16_t *filter
+// x5      const int32_t *filterPos
+// x6      int filterSize
+// x8-x11  filterPos values
+
+// v0-v3   multiply add accumulators
+// v4-v7   filter data, temp for final horizontal sum
+// v16-v19 src data
+1:
+        ld1                 {v4.8H, v5.8H, v6.8H, v7.8H}, [x4], #64 // load filter[idx=0..3, j=0..7]
+        ldp                 w8, w9,  [x5]               // filterPos[idx + 0], [idx + 1]
+        ldp                 w10, w11, [x5, 8]           // filterPos[idx + 2], [idx + 3]
+        movi                v0.2D, #0                   // val sum part 1 (for dst[0])
+        movi                v1.2D, #0                   // val sum part 2 (for dst[1])
+        add                 x5, x5, #16                 // increment filterPos
+
+        add                 x8, x3, w8, UXTW            // srcp + filterPos[0]
+        add                 x9,  x3, w9, UXTW           // srcp + filterPos[1]
+        add                 x10, x3, w10, UXTW          // srcp + filterPos[2]
+        add                 x11, x3, w11, UXTW          // srcp + filterPos[3]
+
+        ld1                 {v16.8B}, [x8], #8          // srcp[filterPos[0] + {0..7}]
+        ld1                 {v17.8B}, [x9], #8          // srcp[filterPos[1] + {0..7}]
+
+        movi                v2.2D, #0                   // val sum part 3 (for dst[2])
+        movi                v3.2D, #0                   // val sum part 4 (for dst[3])
+
+        uxtl                v16.8H, v16.8B              // unpack part 1 to 16-bit
+        uxtl                v17.8H, v17.8B              // unpack part 2 to 16-bit
+
+        smlal               v0.4S, v16.4H, v4.4H        // v0 accumulates srcp[filterPos[0] + {0..3}] * filter[{0..3}]
+        smlal               v1.4S, v17.4H, v5.4H        // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+
+        ld1                 {v18.8B}, [x10], #8         // srcp[filterPos[2] + {0..7}]
+        ld1                 {v19.8B}, [x11], #8         // srcp[filterPos[3] + {0..7}]
+
+        smlal2              v0.4S, v16.8H, v4.8H        // v0 accumulates srcp[filterPos[0] + {4..7}] * filter[{4..7}]
+        smlal2              v1.4S, v17.8H, v5.8H        // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+
+        uxtl                v18.8H, v18.8B              // unpack part 3 to 16-bit
+        uxtl                v19.8H, v19.8B              // unpack part 4 to 16-bit
+
+        smlal               v2.4S, v18.4H, v6.4H        // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+        smlal               v3.4S, v19.4H, v7.4H        // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+
+        smlal2              v2.4S, v18.8H, v6.8H        // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+        smlal2              v3.4S, v19.8H, v7.8H        // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+
+        uzp1                v4.4S, v0.4S, v1.4S         // unzip low parts 0 and 1
+        uzp2                v5.4S, v0.4S, v1.4S         // unzip high parts 0 and 1
+        uzp1                v6.4S, v2.4S, v3.4S         // unzip low parts 2 and 3
+        uzp2                v7.4S, v2.4S, v3.4S         // unzip high parts 2 and 3
+
+        add                 v0.4S, v4.4S, v5.4S         // add half of each of part 0 and 1
+        add                 v1.4S, v6.4S, v7.4S         // add half of each of part 2 and 3
+
+        addp                v4.4S, v0.4S, v1.4S         // pairwise add to complete half adds in earlier steps
+
+        subs                w2, w2, #4                  // dstW -= 4
+        sqshrn              v0.4H, v4.4S, #7            // shift and clip the 2x16-bit final values
+        st1                 {v0.4H}, [x1], #8           // write to destination part0123
+        b.gt                1b                          // loop until end of line
+        ret
+endfunc
+
+function ff_hscale8to15_4_neon, export=1
+// x0  SwsContext *c (not used)
+// x1  int16_t *dst
+// x2  int dstW
+// x3  const uint8_t *src
+// x4  const int16_t *filter
+// x5  const int32_t *filterPos
+// x6  int filterSize
+// x8-x15 registers for gathering src data
+
+// v0      madd accumulator 4S
+// v1-v4   filter values (16 bit) 8H
+// v5      madd accumulator 4S
+// v16-v19 src values (8 bit) 8B
+
+// This implementation has 4 sections:
+//  1. Prefetch src data
+//  2. Interleaved prefetching src data and madd
+//  3. Complete madd
+//  4. Complete remaining iterations when dstW % 8 != 0
+
+        add                 sp, sp, #-32                // allocate 32 bytes on the stack
+        cmp                 w2, #16                     // if dstW <16, skip to the last block used for wrapping up
+        b.lt                2f
+
+        // load 8 values from filterPos to be used as offsets into src
+        ldp                 w8, w9,  [x5]               // filterPos[idx + 0], [idx + 1]
+        ldp                 w10, w11, [x5, 8]           // filterPos[idx + 2], [idx + 3]
+        ldp                 w12, w13, [x5, 16]          // filterPos[idx + 4], [idx + 5]
+        ldp                 w14, w15, [x5, 24]          // filterPos[idx + 6], [idx + 7]
+        add                 x5, x5, #32                 // advance filterPos
+
+        // gather random access data from src into contiguous memory
+        ldr                 w8, [x3, w8, UXTW]          // src[filterPos[idx + 0]][0..3]
+        ldr                 w9, [x3, w9, UXTW]          // src[filterPos[idx + 1]][0..3]
+        ldr                 w10, [x3, w10, UXTW]        // src[filterPos[idx + 2]][0..3]
+        ldr                 w11, [x3, w11, UXTW]        // src[filterPos[idx + 3]][0..3]
+        ldr                 w12, [x3, w12, UXTW]        // src[filterPos[idx + 4]][0..3]
+        ldr                 w13, [x3, w13, UXTW]        // src[filterPos[idx + 5]][0..3]
+        ldr                 w14, [x3, w14, UXTW]        // src[filterPos[idx + 6]][0..3]
+        ldr                 w15, [x3, w15, UXTW]        // src[filterPos[idx + 7]][0..3]
+        stp                 w8, w9, [sp]                // *scratch_mem = { src[filterPos[idx + 0]][0..3], src[filterPos[idx + 1]][0..3] }
+        stp                 w10, w11, [sp, 8]           // *scratch_mem = { src[filterPos[idx + 2]][0..3], src[filterPos[idx + 3]][0..3] }
+        stp                 w12, w13, [sp, 16]          // *scratch_mem = { src[filterPos[idx + 4]][0..3], src[filterPos[idx + 5]][0..3] }
+        stp                 w14, w15, [sp, 24]          // *scratch_mem = { src[filterPos[idx + 6]][0..3], src[filterPos[idx + 7]][0..3] }
+
+1:
+        ld4                 {v16.8B, v17.8B, v18.8B, v19.8B}, [sp] // transpose 8 bytes each from src into 4 registers
+
+        // load 8 values from filterPos to be used as offsets into src
+        ldp                 w8, w9,  [x5]               // filterPos[idx + 0][0..3], [idx + 1][0..3], next iteration
+        ldp                 w10, w11, [x5, 8]           // filterPos[idx + 2][0..3], [idx + 3][0..3], next iteration
+        ldp                 w12, w13, [x5, 16]          // filterPos[idx + 4][0..3], [idx + 5][0..3], next iteration
+        ldp                 w14, w15, [x5, 24]          // filterPos[idx + 6][0..3], [idx + 7][0..3], next iteration
+
+        movi                v0.2D, #0                   // Clear madd accumulator for idx 0..3
+        movi                v5.2D, #0                   // Clear madd accumulator for idx 4..7
+
+        ld4                 {v1.8H, v2.8H, v3.8H, v4.8H}, [x4], #64 // load filter idx + 0..7
+
+        add                 x5, x5, #32                 // advance filterPos
+
+        // interleaved SIMD and prefetching intended to keep ld/st and vector pipelines busy
+        uxtl                v16.8H, v16.8B              // unsigned extend long, covert src data to 16-bit
+        uxtl                v17.8H, v17.8B              // unsigned extend long, covert src data to 16-bit
+        ldr                 w8, [x3, w8, UXTW]          // src[filterPos[idx + 0]], next iteration
+        ldr                 w9, [x3, w9, UXTW]          // src[filterPos[idx + 1]], next iteration
+        uxtl                v18.8H, v18.8B              // unsigned extend long, covert src data to 16-bit
+        uxtl                v19.8H, v19.8B              // unsigned extend long, covert src data to 16-bit
+        ldr                 w10, [x3, w10, UXTW]        // src[filterPos[idx + 2]], next iteration
+        ldr                 w11, [x3, w11, UXTW]        // src[filterPos[idx + 3]], next iteration
+
+        smlal               v0.4S, v1.4H, v16.4H        // multiply accumulate inner loop j = 0, idx = 0..3
+        smlal               v0.4S, v2.4H, v17.4H        // multiply accumulate inner loop j = 1, idx = 0..3
+        ldr                 w12, [x3, w12, UXTW]        // src[filterPos[idx + 4]], next iteration
+        ldr                 w13, [x3, w13, UXTW]        // src[filterPos[idx + 5]], next iteration
+        smlal               v0.4S, v3.4H, v18.4H        // multiply accumulate inner loop j = 2, idx = 0..3
+        smlal               v0.4S, v4.4H, v19.4H        // multiply accumulate inner loop j = 3, idx = 0..3
+        ldr                 w14, [x3, w14, UXTW]        // src[filterPos[idx + 6]], next iteration
+        ldr                 w15, [x3, w15, UXTW]        // src[filterPos[idx + 7]], next iteration
+
+        smlal2              v5.4S, v1.8H, v16.8H        // multiply accumulate inner loop j = 0, idx = 4..7
+        smlal2              v5.4S, v2.8H, v17.8H        // multiply accumulate inner loop j = 1, idx = 4..7
+        stp                 w8, w9, [sp]                // *scratch_mem = { src[filterPos[idx + 0]][0..3], src[filterPos[idx + 1]][0..3] }
+        stp                 w10, w11, [sp, 8]           // *scratch_mem = { src[filterPos[idx + 2]][0..3], src[filterPos[idx + 3]][0..3] }
+        smlal2              v5.4S, v3.8H, v18.8H        // multiply accumulate inner loop j = 2, idx = 4..7
+        smlal2              v5.4S, v4.8H, v19.8H        // multiply accumulate inner loop j = 3, idx = 4..7
+        stp                 w12, w13, [sp, 16]          // *scratch_mem = { src[filterPos[idx + 4]][0..3], src[filterPos[idx + 5]][0..3] }
+        stp                 w14, w15, [sp, 24]          // *scratch_mem = { src[filterPos[idx + 6]][0..3], src[filterPos[idx + 7]][0..3] }
+
+        sub                 w2, w2, #8                  // dstW -= 8
+        sqshrn              v0.4H, v0.4S, #7            // shift and clip the 2x16-bit final values
+        sqshrn              v1.4H, v5.4S, #7            // shift and clip the 2x16-bit final values
+        st1                 {v0.4H, v1.4H}, [x1], #16   // write to dst[idx + 0..7]
+        cmp                 w2, #16                     // continue on main loop if there are at least 16 iterations left
+        b.ge                1b
+
+        // last full iteration
+        ld4                 {v16.8B, v17.8B, v18.8B, v19.8B}, [sp]
+        ld4                 {v1.8H, v2.8H, v3.8H, v4.8H}, [x4], #64 // load filter idx + 0..7
+
+        movi                v0.2D, #0                   // Clear madd accumulator for idx 0..3
+        movi                v5.2D, #0                   // Clear madd accumulator for idx 4..7
+
+        uxtl                v16.8H, v16.8B              // unsigned extend long, covert src data to 16-bit
+        uxtl                v17.8H, v17.8B              // unsigned extend long, covert src data to 16-bit
+        uxtl                v18.8H, v18.8B              // unsigned extend long, covert src data to 16-bit
+        uxtl                v19.8H, v19.8B              // unsigned extend long, covert src data to 16-bit
+
+        smlal               v0.4S, v1.4H, v16.4H        // multiply accumulate inner loop j = 0, idx = 0..3
+        smlal               v0.4S, v2.4H, v17.4H        // multiply accumulate inner loop j = 1, idx = 0..3
+        smlal               v0.4S, v3.4H, v18.4H        // multiply accumulate inner loop j = 2, idx = 0..3
+        smlal               v0.4S, v4.4H, v19.4H        // multiply accumulate inner loop j = 3, idx = 0..3
+
+        smlal2              v5.4S, v1.8H, v16.8H        // multiply accumulate inner loop j = 0, idx = 4..7
+        smlal2              v5.4S, v2.8H, v17.8H        // multiply accumulate inner loop j = 1, idx = 4..7
+        smlal2              v5.4S, v3.8H, v18.8H        // multiply accumulate inner loop j = 2, idx = 4..7
+        smlal2              v5.4S, v4.8H, v19.8H        // multiply accumulate inner loop j = 3, idx = 4..7
+
+        subs                w2, w2, #8                  // dstW -= 8
+        sqshrn              v0.4H, v0.4S, #7            // shift and clip the 2x16-bit final values
+        sqshrn              v1.4H, v5.4S, #7            // shift and clip the 2x16-bit final values
+        st1                 {v0.4H, v1.4H}, [x1], #16   // write to dst[idx + 0..7]
+
+        cbnz                w2, 2f                      // if >0 iterations remain, jump to the wrap up section
+
+        add                 sp, sp, #32                 // clean up stack
+        ret
+
+        // finish up when dstW % 8 != 0 or dstW < 16
+2:
+        // load src
+        ldr                 w8, [x5], #4                // filterPos[i]
+        ldr                 w9, [x3, w8, UXTW]          // src[filterPos[i] + 0..3]
+        ins                 v5.S[0], w9                 // move to simd register
+        // load filter
+        ld1                 {v6.4H}, [x4], #8           // filter[filterSize * i + 0..3]
+
+        uxtl                v5.8H, v5.8B                // unsigned exten long, convert src data to 16-bit
+        smull               v0.4S, v5.4H, v6.4H         // 4 iterations of src[...] * filter[...]
+        addp                v0.4S, v0.4S, v0.4S         // accumulate the smull results
+        addp                v0.4S, v0.4S, v0.4S         // accumulate the smull results
+        sqshrn              v0.4H, v0.4S, #7            // shift and clip the 2x16-bit final values
+        mov                 w10, v0.S[0]                // move back to general register (only one value from simd reg is used)
+        strh                w10, [x1], #2               // dst[i] = ...
+        sub                 w2, w2, #1                  // dstW--
+        cbnz                w2, 2b
+
+        add                 sp, sp, #32                 // clean up stack
+        ret
+endfunc
diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c
index 09d0a7130e..2ea4ccb3a6 100644
--- a/libswscale/aarch64/swscale.c
+++ b/libswscale/aarch64/swscale.c
@@ -22,25 +22,48 @@
 #include "libswscale/swscale_internal.h"
 #include "libavutil/aarch64/cpu.h"
 
-void ff_hscale_8_to_15_neon(SwsContext *c, int16_t *dst, int dstW,
-                            const uint8_t *src, const int16_t *filter,
-                            const int32_t *filterPos, int filterSize);
+#define SCALE_FUNC(filter_n, from_bpc, to_bpc, opt) \
+void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
+                                                SwsContext *c, int16_t *data, \
+                                                int dstW, const uint8_t *src, \
+                                                const int16_t *filter, \
+                                                const int32_t *filterPos, int filterSize)
+#define SCALE_FUNCS(filter_n, opt) \
+    SCALE_FUNC(filter_n,  8, 15, opt);
+#define ALL_SCALE_FUNCS(opt) \
+    SCALE_FUNCS(4, opt); \
+    SCALE_FUNCS(8, opt); \
+    SCALE_FUNCS(X8, opt)
+
+ALL_SCALE_FUNCS(neon);
 
 void ff_yuv2planeX_8_neon(const int16_t *filter, int filterSize,
                           const int16_t **src, uint8_t *dest, int dstW,
                           const uint8_t *dither, int offset);
 
+#define ASSIGN_SCALE_FUNC2(hscalefn, filtersize, opt) do {              \
+    if (c->srcBpc == 8 && c->dstBpc <= 14) {                            \
+      hscalefn =                                                        \
+        ff_hscale8to15_ ## filtersize ## _ ## opt;                      \
+    }                                                                   \
+} while (0)
+
+#define ASSIGN_SCALE_FUNC(hscalefn, filtersize, opt)                    \
+  switch (filtersize) {                                                 \
+  case 4:  ASSIGN_SCALE_FUNC2(hscalefn, 4, opt); break;                 \
+  case 8:  ASSIGN_SCALE_FUNC2(hscalefn, 8, opt); break;                 \
+  default: if (filtersize % 8 == 0)                                     \
+               ASSIGN_SCALE_FUNC2(hscalefn, X8, opt);                   \
+           break;                                                       \
+  }
+
 av_cold void ff_sws_init_swscale_aarch64(SwsContext *c)
 {
     int cpu_flags = av_get_cpu_flags();
 
     if (have_neon(cpu_flags)) {
-        if (c->srcBpc == 8 && c->dstBpc <= 14 &&
-            (c->hLumFilterSize % 8) == 0 &&
-            (c->hChrFilterSize % 8) == 0)
-        {
-            c->hyScale = c->hcScale = ff_hscale_8_to_15_neon;
-        }
+        ASSIGN_SCALE_FUNC(c->hyScale, c->hLumFilterSize, neon);
+        ASSIGN_SCALE_FUNC(c->hcScale, c->hChrFilterSize, neon);
         if (c->dstBpc == 8) {
             c->yuv2planeX = ff_yuv2planeX_8_neon;
         }
diff --git a/libswscale/utils.c b/libswscale/utils.c
index c5ea8853d5..2f2b8e73a9 100644
--- a/libswscale/utils.c
+++ b/libswscale/utils.c
@@ -1825,7 +1825,7 @@ av_cold int sws_init_context(SwsContext *c, SwsFilter *srcFilter,
         {
             const int filterAlign = X86_MMX(cpu_flags)     ? 4 :
                                     PPC_ALTIVEC(cpu_flags) ? 8 :
-                                    have_neon(cpu_flags)   ? 8 : 1;
+                                    have_neon(cpu_flags)   ? 4 : 1;
 
             if ((ret = initFilter(&c->hLumFilter, &c->hLumFilterPos,
                            &c->hLumFilterSize, c->lumXInc,

From patchwork Fri Apr 15 21:37:06 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Swinney, Jonathan" <jswinney@amazon.com>
X-Patchwork-Id: 35332
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:671c:b0:7c:62c8:b2d1 with SMTP id q28csp468725pzh;
        Fri, 15 Apr 2022 14:37:23 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJxx8embhJwJoRczhnhMUZi7mtup8hd1cR0AkixoqPA5kEuLWrJ/s+zmB3XwRnujk0P22kfo
X-Received: by 2002:a17:906:c14c:b0:6e8:6526:7647 with SMTP id
 dp12-20020a170906c14c00b006e865267647mr700212ejc.257.1650058643480;
        Fri, 15 Apr 2022 14:37:23 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1650058643; cv=none;
        d=google.com; s=arc-20160816;
        b=e1gxUYN6QA7PCVw7R7uw8WtEpSczuaL0PSnkmIxkArfse3VK98rPV6F9LNnKne8CeR
         iNz6tlQpKIPjSETlnz1d02MKZHfaU9wlAdkdTNrcHp0u5hsVmqtmaVHlVvZ/VkMd4YRM
         zv1joTXhcsKYoyp+F0Bs6Zw/BH8/F3z0XNpxBrmBt4yTB0yPFIFg70aOBVJ1Nhq1hY+b
         VHyspEJvohr6neqLBK/z8dkWr+TqQO/nk3Hpf1wlVjwpWU5NfH86GNlInoLwMtntlFTn
         ijmMW61ProbqxWYVzKd6cu0VwsqAQ7ihu2BxSJg2WeHqXY7yXEvTyjRSItaM1ymAIn0c
         xtXw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:content-language
         :accept-language:message-id:date:thread-index:thread-topic:to:from
         :dkim-signature:delivered-to;
        bh=pIrZhvQY+HwuAcYF2BgFqS1h0J2VT7N5aR24RMHoHyw=;
        b=UlwYCrb5pC/47J/tnrE5GroAtuuPqUeyN0q4puaO1DormCd6bmz+VpKJy6M9evrE10
         D/WEEKj+qUYI2gpCa/Fn0XTsK7J7o+8XkNAnT20mnkp4JT4NC4mLsyFL3w15gGePca1g
         gfWg1aeM5iRD2F3ICTvnRyVjBTwZhIaKamjGI+RBS829eYyR+KgBBQD704AWBeivtlSM
         ZlVISt5NuUlQyMSBaCJ6RMLAmmqQnsdapRgs0h91x6pu9P5rpBoYi7XXcSHFwlapuv/y
         W9qyGCXZgRMSqMerzT6y5qEOwz+yTW+BaVvgCMREac/4Ti+XiQatrBtVyjZMWvQb41u9
         lGVw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@amazon.com
 header.s=amazon201209 header.b=qpBnirEP;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 x8-20020a170906148800b006ec0acfe293si1668816ejc.881.2022.04.15.14.37.23;
        Fri, 15 Apr 2022 14:37:23 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@amazon.com
 header.s=amazon201209 header.b=qpBnirEP;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id E2D4368B4F8;
	Sat, 16 Apr 2022 00:37:18 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from smtp-fw-80006.amazon.com (smtp-fw-80006.amazon.com
 [99.78.197.217])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id D030568B4D4
 for <ffmpeg-devel@ffmpeg.org>; Sat, 16 Apr 2022 00:37:10 +0300 (EEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209;
 t=1650058636; x=1681594636;
 h=from:to:cc:subject:date:message-id:
 content-transfer-encoding:mime-version;
 bh=lEdWc2wr7nxWLCixhNPoPAIgny/cjO2FdjiQwlHHVco=;
 b=qpBnirEP8LNcvYJoHe6/JM1RPsd/USGtDG8ubQ8nYklkv1eC0LSjUhCI
 SQsf4f+ZpJ6xZ36svRY4mYDk2dS+Ke5I50p7QBIH4FKkACb7gfZop5iMj
 CUv/CBPsh2zi7ZHEbBtyszsGsxvsox7zYlzW2sjXzxLLiiWPScZ9WmabG Q=;
X-IronPort-AV: E=Sophos;i="5.90,263,1643673600"; d="scan'208";a="80297828"
Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO
 email-inbound-relay-pdx-2b-02ee77e7.us-west-2.amazon.com) ([10.25.36.214])
 by smtp-border-fw-80006.pdx80.corp.amazon.com with ESMTP;
 15 Apr 2022 21:37:08 +0000
Received: from EX13MTAUWB001.ant.amazon.com
 (pdx1-ws-svc-p6-lb9-vlan2.pdx.amazon.com [10.236.137.194])
 by email-inbound-relay-pdx-2b-02ee77e7.us-west-2.amazon.com (Postfix) with
 ESMTPS id 1457340DAF; Fri, 15 Apr 2022 21:37:08 +0000 (UTC)
Received: from EX13D01UWB004.ant.amazon.com (10.43.161.157) by
 EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS)
 id 15.0.1497.32; Fri, 15 Apr 2022 21:37:06 +0000
Received: from EX13D07UWB004.ant.amazon.com (10.43.161.196) by
 EX13d01UWB004.ant.amazon.com (10.43.161.157) with Microsoft SMTP Server (TLS)
 id 15.0.1497.32; Fri, 15 Apr 2022 21:37:06 +0000
Received: from EX13D07UWB004.ant.amazon.com ([10.43.161.196]) by
 EX13D07UWB004.ant.amazon.com ([10.43.161.196]) with mapi id 15.00.1497.033;
 Fri, 15 Apr 2022 21:37:06 +0000
From: "Swinney, Jonathan" <jswinney@amazon.com>
To: "ffmpeg-devel@ffmpeg.org" <ffmpeg-devel@ffmpeg.org>
Thread-Topic: [PATCH 2/2] swscale/aarch64: add vscale specializations
Thread-Index: AdhRDA+QV+DuTNdhRaigK3LBQ9rnSg==
Date: Fri, 15 Apr 2022 21:37:06 +0000
Message-ID: <fb9f751291d84d36bc66b0c65028e741@EX13D07UWB004.ant.amazon.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-ms-exchange-transport-fromentityheader: Hosted
x-originating-ip: [10.43.160.81]
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 2/2] swscale/aarch64: add vscale
 specializations
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>, "Pop,
 Sebastian" <spop@amazon.com>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: gzTnPKFKJst3

This commit adds new code paths for vscale when filterSize is 2, 4, or 8. By
using specialized code with unrolling to match the filterSize we can improve
performance.

| (seconds)   | c6g   |       |       |
| ------------| ----- | ----- | ----- |
| filterSize  | 2     | 4     | 8     |
| original    | 0.581 | 0.974 | 1.744 |
| optimized   | 0.399 | 0.569 | 1.052 |
| improvement | 31.1% | 41.6% | 39.7% |

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
---
 libswscale/aarch64/output.S  | 147 +++++++++++++++++++++++++++++++++--
 libswscale/aarch64/swscale.c |  12 +++
 2 files changed, 153 insertions(+), 6 deletions(-)

diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S
index af71de6050..9c99c3bea9 100644
--- a/libswscale/aarch64/output.S
+++ b/libswscale/aarch64/output.S
@@ -21,12 +21,27 @@
 #include "libavutil/aarch64/asm.S"
 
 function ff_yuv2planeX_8_neon, export=1
+// x0 - const int16_t *filter,
+// x1 - int filterSize,
+// x2 - const int16_t **src,
+// x3 - uint8_t *dest,
+// x4 - int dstW,
+// x5 - const uint8_t *dither,
+// x6 - int offset
+
         ld1                 {v0.8B}, [x5]                   // load 8x8-bit dither
         cbz                 w6, 1f                          // check if offsetting present
         ext                 v0.8B, v0.8B, v0.8B, #3         // honor offsetting which can be 0 or 3 only
 1:      uxtl                v0.8H, v0.8B                    // extend dither to 16-bit
         ushll               v1.4S, v0.4H, #12               // extend dither to 32-bit with left shift by 12 (part 1)
         ushll2              v2.4S, v0.8H, #12               // extend dither to 32-bit with left shift by 12 (part 2)
+        cmp                 w1, #8                          // if filterSize == 8, branch to specialized version
+        b.eq                5f
+        cmp                 w1, #4                          // if filterSize == 4, branch to specialized version
+        b.eq                7f
+        cmp                 w1, #2                          // if filterSize == 2, branch to specialized version
+        b.eq                9f
+
         mov                 x7, #0                          // i = 0
 2:      mov                 v3.16B, v1.16B                  // initialize accumulator part 1 with dithering value
         mov                 v4.16B, v2.16B                  // initialize accumulator part 2 with dithering value
@@ -34,16 +49,15 @@ function ff_yuv2planeX_8_neon, export=1
         mov                 x9, x2                          // srcp    = src
         mov                 x10, x0                         // filterp = filter
 3:      ldp                 x11, x12, [x9], #16             // get 2 pointers: src[j] and src[j+1]
+        ld2r                {v16.8H, v17.8H}, [x10], #4     // read 2x16-bit coeff X and Y at filter[j] and filter[j+1]
         add                 x11, x11, x7, lsl #1            // &src[j  ][i]
         add                 x12, x12, x7, lsl #1            // &src[j+1][i]
         ld1                 {v5.8H}, [x11]                  // read 8x16-bit @ src[j  ][i + {0..7}]: A,B,C,D,E,F,G,H
         ld1                 {v6.8H}, [x12]                  // read 8x16-bit @ src[j+1][i + {0..7}]: I,J,K,L,M,N,O,P
-        ld1r                {v7.8H}, [x10], #2              // read 1x16-bit coeff X at filter[j  ] and duplicate across lanes
-        ld1r                {v16.8H}, [x10], #2             // read 1x16-bit coeff Y at filter[j+1] and duplicate across lanes
-        smlal               v3.4S, v5.4H, v7.4H             // val0 += {A,B,C,D} * X
-        smlal2              v4.4S, v5.8H, v7.8H             // val1 += {E,F,G,H} * X
-        smlal               v3.4S, v6.4H, v16.4H            // val0 += {I,J,K,L} * Y
-        smlal2              v4.4S, v6.8H, v16.8H            // val1 += {M,N,O,P} * Y
+        smlal               v3.4S, v5.4H, v16.4H            // val0 += {A,B,C,D} * X
+        smlal2              v4.4S, v5.8H, v16.8H            // val1 += {E,F,G,H} * X
+        smlal               v3.4S, v6.4H, v17.4H            // val0 += {I,J,K,L} * Y
+        smlal2              v4.4S, v6.8H, v17.8H            // val1 += {M,N,O,P} * Y
         subs                w8, w8, #2                      // tmpfilterSize -= 2
         b.gt                3b                              // loop until filterSize consumed
 
@@ -55,4 +69,125 @@ function ff_yuv2planeX_8_neon, export=1
         add                 x7, x7, #8                      // i += 8
         b.gt                2b                              // loop until width consumed
         ret
+
+5:      // fs=8
+        ldp                 x5, x6, [x2]                    // load 2 pointers: src[j  ] and src[j+1]
+        ldp                 x7, x9, [x2, #16]               // load 2 pointers: src[j+2] and src[j+3]
+        ldp                 x10, x11, [x2, #32]             // load 2 pointers: src[j+4] and src[j+5]
+        ldp                 x12, x13, [x2, #48]             // load 2 pointers: src[j+6] and src[j+7]
+
+        // load 8x16-bit values for filter[j], where j=0..7 and replicated across lanes
+        ld4r                {v16.8H, v17.8H, v18.8H, v19.8H}, [x0], #8
+        ld4r                {v20.8H, v21.8H, v22.8H, v23.8H}, [x0]
+6:
+        mov                 v3.16B, v1.16B                  // initialize accumulator part 1 with dithering value
+        mov                 v4.16B, v2.16B                  // initialize accumulator part 2 with dithering value
+
+        ld1                 {v24.8H}, [x5], #16             // load 8x16-bit values for src[j + 0][i + {0..7}]
+        ld1                 {v25.8H}, [x6], #16             // load 8x16-bit values for src[j + 1][i + {0..7}]
+        ld1                 {v26.8H}, [x7], #16             // load 8x16-bit values for src[j + 2][i + {0..7}]
+        ld1                 {v27.8H}, [x9], #16             // load 8x16-bit values for src[j + 3][i + {0..7}]
+        ld1                 {v28.8H}, [x10], #16            // load 8x16-bit values for src[j + 4][i + {0..7}]
+        ld1                 {v29.8H}, [x11], #16            // load 8x16-bit values for src[j + 5][i + {0..7}]
+        ld1                 {v30.8H}, [x12], #16            // load 8x16-bit values for src[j + 6][i + {0..7}]
+        ld1                 {v31.8H}, [x13], #16            // load 8x16-bit values for src[j + 7][i + {0..7}]
+
+        smlal               v3.4S, v24.4H, v16.4H           // val0 += src[0][i + {0..3}] * filter[0]
+        smlal2              v4.4S, v24.8H, v16.8H           // val1 += src[0][i + {4..7}] * filter[0]
+        smlal               v3.4S, v25.4H, v17.4H           // val0 += src[1][i + {0..3}] * filter[1]
+        smlal2              v4.4S, v25.8H, v17.8H           // val1 += src[1][i + {4..7}] * filter[1]
+        smlal               v3.4S, v26.4H, v18.4H           // val0 += src[2][i + {0..3}] * filter[2]
+        smlal2              v4.4S, v26.8H, v18.8H           // val1 += src[2][i + {4..7}] * filter[2]
+        smlal               v3.4S, v27.4H, v19.4H           // val0 += src[3][i + {0..3}] * filter[3]
+        smlal2              v4.4S, v27.8H, v19.8H           // val1 += src[3][i + {4..7}] * filter[3]
+        smlal               v3.4S, v28.4H, v20.4H           // val0 += src[4][i + {0..3}] * filter[4]
+        smlal2              v4.4S, v28.8H, v20.8H           // val1 += src[4][i + {4..7}] * filter[4]
+        smlal               v3.4S, v29.4H, v21.4H           // val0 += src[5][i + {0..3}] * filter[5]
+        smlal2              v4.4S, v29.8H, v21.8H           // val1 += src[5][i + {4..7}] * filter[5]
+        smlal               v3.4S, v30.4H, v22.4H           // val0 += src[6][i + {0..3}] * filter[6]
+        smlal2              v4.4S, v30.8H, v22.8H           // val1 += src[6][i + {4..7}] * filter[6]
+        smlal               v3.4S, v31.4H, v23.4H           // val0 += src[7][i + {0..3}] * filter[7]
+        smlal2              v4.4S, v31.8H, v23.8H           // val1 += src[7][i + {4..7}] * filter[7]
+
+        sqshrun             v3.4h, v3.4s, #16               // clip16(val0>>16)
+        sqshrun2            v3.8h, v4.4s, #16               // clip16(val1>>16)
+        uqshrn              v3.8b, v3.8h, #3                // clip8(val>>19)
+        st1                 {v3.8b}, [x3], #8               // write to destination
+        subs                w4, w4, #8                      // dstW -= 8
+        b.gt                6b                              // loop until width consumed
+        ret
+
+7:      // fs=4
+        ldp                 x5, x6, [x2]                    // load 2 pointers: src[j  ] and src[j+1]
+        ldp                 x7, x9, [x2, #16]               // load 2 pointers: src[j+2] and src[j+3]
+
+        // load 4x16-bit values for filter[j], where j=0..3 and replicated across lanes
+        ld4r                {v16.8H, v17.8H, v18.8H, v19.8H}, [x0]
+8:
+        mov                 v3.16B, v1.16B                  // initialize accumulator part 1 with dithering value
+        mov                 v4.16B, v2.16B                  // initialize accumulator part 2 with dithering value
+
+        ld1                 {v24.8H}, [x5], #16             // load 8x16-bit values for src[j + 0][i + {0..7}]
+        ld1                 {v25.8H}, [x6], #16             // load 8x16-bit values for src[j + 1][i + {0..7}]
+        ld1                 {v26.8H}, [x7], #16             // load 8x16-bit values for src[j + 2][i + {0..7}]
+        ld1                 {v27.8H}, [x9], #16             // load 8x16-bit values for src[j + 3][i + {0..7}]
+
+        smlal               v3.4S, v24.4H, v16.4H           // val0 += src[0][i + {0..3}] * filter[0]
+        smlal2              v4.4S, v24.8H, v16.8H           // val1 += src[0][i + {4..7}] * filter[0]
+        smlal               v3.4S, v25.4H, v17.4H           // val0 += src[1][i + {0..3}] * filter[1]
+        smlal2              v4.4S, v25.8H, v17.8H           // val1 += src[1][i + {4..7}] * filter[1]
+        smlal               v3.4S, v26.4H, v18.4H           // val0 += src[2][i + {0..3}] * filter[2]
+        smlal2              v4.4S, v26.8H, v18.8H           // val1 += src[2][i + {4..7}] * filter[2]
+        smlal               v3.4S, v27.4H, v19.4H           // val0 += src[3][i + {0..3}] * filter[3]
+        smlal2              v4.4S, v27.8H, v19.8H           // val1 += src[3][i + {4..7}] * filter[3]
+
+        sqshrun             v3.4h, v3.4s, #16               // clip16(val0>>16)
+        sqshrun2            v3.8h, v4.4s, #16               // clip16(val1>>16)
+        uqshrn              v3.8b, v3.8h, #3                // clip8(val>>19)
+        st1                 {v3.8b}, [x3], #8               // write to destination
+        subs                w4, w4, #8                      // dstW -= 8
+        b.gt                8b                              // loop until width consumed
+        ret
+
+9:      // fs=2
+        ldp                 x5, x6, [x2]                    // load 2 pointers: src[j  ] and src[j+1]
+
+        // load 2x16-bit values for filter[j], where j=0..1 and replicated across lanes
+        ld2r                {v16.8H, v17.8H}, [x0]
+10:
+        mov                 v3.16B, v1.16B                  // initialize accumulator part 1 with dithering value
+        mov                 v4.16B, v2.16B                  // initialize accumulator part 2 with dithering value
+
+        ld1                 {v24.8H}, [x5], #16             // load 8x16-bit values for src[j + 0][i + {0..7}]
+        ld1                 {v25.8H}, [x6], #16             // load 8x16-bit values for src[j + 1][i + {0..7}]
+
+        smlal               v3.4S, v24.4H, v16.4H           // val0 += src[0][i + {0..3}] * filter[0]
+        smlal2              v4.4S, v24.8H, v16.8H           // val1 += src[0][i + {4..7}] * filter[0]
+        smlal               v3.4S, v25.4H, v17.4H           // val0 += src[1][i + {0..3}] * filter[1]
+        smlal2              v4.4S, v25.8H, v17.8H           // val1 += src[1][i + {4..7}] * filter[1]
+
+        sqshrun             v3.4h, v3.4s, #16               // clip16(val0>>16)
+        sqshrun2            v3.8h, v4.4s, #16               // clip16(val1>>16)
+        uqshrn              v3.8b, v3.8h, #3                // clip8(val>>19)
+        st1                 {v3.8b}, [x3], #8               // write to destination
+        subs                w4, w4, #8                      // dstW -= 8
+        b.gt                10b                             // loop until width consumed
+        ret
+endfunc
+function ff_yuv2plane1_8_neon, export=1
+        ld1                 {v0.8B}, [x5]                   // load 8x8-bit dither
+        cbz                 w6, 1f                          // check if offsetting present
+        ext                 v0.8B, v0.8B, v0.8B, #3         // honor offsetting which can be 0 or 3 only
+1:      uxtl                v0.8H, v0.8B                    // extend dither to 16-bit
+
+
+2:
+        ld1                 {v5.8H}, [x0], #16              // read 8x16-bit @ src[j  ][i + {0..7}]: A,B,C,D,E,F,G,H
+        add                 v3.8H, v0.8H, v5.8H
+
+        uqshrn              v3.8b, v3.8h, #7                // clip8(val>>7)
+        st1                 {v3.8b}, [x1], #8               // write to destination
+        subs                w2, w2, #8                      // dstW -= 8
+        b.gt                2b                              // loop until width consumed
+        ret
 endfunc
diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c
index 2ea4ccb3a6..0cc821bf11 100644
--- a/libswscale/aarch64/swscale.c
+++ b/libswscale/aarch64/swscale.c
@@ -40,6 +40,12 @@ ALL_SCALE_FUNCS(neon);
 void ff_yuv2planeX_8_neon(const int16_t *filter, int filterSize,
                           const int16_t **src, uint8_t *dest, int dstW,
                           const uint8_t *dither, int offset);
+void ff_yuv2plane1_8_neon(
+        const int16_t *src,
+        uint8_t *dest,
+        int dstW,
+        const uint8_t *dither,
+        int offset);
 
 #define ASSIGN_SCALE_FUNC2(hscalefn, filtersize, opt) do {              \
     if (c->srcBpc == 8 && c->dstBpc <= 14) {                            \
@@ -56,6 +62,11 @@ void ff_yuv2planeX_8_neon(const int16_t *filter, int filterSize,
                ASSIGN_SCALE_FUNC2(hscalefn, X8, opt);                   \
            break;                                                       \
   }
+#define ASSIGN_VSCALE_FUNC(vscalefn, opt1)                              \
+    switch(c->dstBpc){                                                  \
+    case 8: vscalefn = ff_yuv2plane1_8_  ## opt1;  break;               \
+    default: break;                                                     \
+    }
 
 av_cold void ff_sws_init_swscale_aarch64(SwsContext *c)
 {
@@ -64,6 +75,7 @@ av_cold void ff_sws_init_swscale_aarch64(SwsContext *c)
     if (have_neon(cpu_flags)) {
         ASSIGN_SCALE_FUNC(c->hyScale, c->hLumFilterSize, neon);
         ASSIGN_SCALE_FUNC(c->hcScale, c->hChrFilterSize, neon);
+        ASSIGN_VSCALE_FUNC(c->yuv2plane1, neon);
         if (c->dstBpc == 8) {
             c->yuv2planeX = ff_yuv2planeX_8_neon;
         }