From patchwork Mon Oct 17 13:07:12 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Hubert Mazur <hum@semihalf.com>
X-Patchwork-Id: 38762
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:4a86:b0:9d:28a3:170e with SMTP id fn6csp1584471pzb;
        Mon, 17 Oct 2022 06:08:37 -0700 (PDT)
X-Google-Smtp-Source: 
 AMsMyM593Pz70ifXyAddJvTjKFfa2AqIMW19MknQjEmsFBhvfzfH4TopcINjZcR/gaOTXDJcDA6Y
X-Received: by 2002:a17:906:db03:b0:741:337e:3600 with SMTP id
 xj3-20020a170906db0300b00741337e3600mr8667761ejb.343.1666012116755;
        Mon, 17 Oct 2022 06:08:36 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1666012116; cv=none;
        d=google.com; s=arc-20160816;
        b=VyqG8IZJNw1llGEadCC6D6gsAPLQU61AfiAe/Mg+eIXJbj+/PxsXvKE1Q6YJGnA0tF
         kvsAgR6gMC4tQTBPofrmLYZNJ6xiknCcfz9dYcygO4O26ip/AfcHGCk3XQ1RpUUalSAY
         cOf7UG45AEtDBkNDmW5J6yTaaf71Q2Mn26Nwjci34U0b/inkSMPuRMMWdV3bfXzcG8dZ
         SoMdTb+C/wgh0HSSXsIRuOpeYiks4NmR0it+IWlJY+gzHl6mG38o+WU5BxGDID5govZP
         3EhVtu25KiI30gWgR6VZQJcraI97uH47k3fkfn/jzLG68omEv2dRFzzHJoVUtKItoq8i
         9VvQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:references:in-reply-to
         :message-id:date:to:from:dkim-signature:delivered-to;
        bh=oTEVwTvglecRlF++1LEVuDX3IFpvV2+jfo2Al1WvDkI=;
        b=Jq9otxl+wYa5BAZR0j/wl5EjWt8kOeqP7L6Bq6dKt3BfXlPvoLmqIQWOAKomtfNY33
         F0TjtkPMUq5doaf2TVJ4wdkqAys0TKXKHlfDUAs2APCTWI38PORymCyOlBpuR/C44o62
         P8yELBC8djXSyNWMH1XLSMQRGJ+abXJ7s+jNJdXjO9HXtUtKHsiLmggdtLF8+2XD+rYY
         GQWZ6mxj72vkRwAVzWN+w8Xb8EGtPwrhJy44ybmou7iwZwCKXU899hjSd/nwCokS6Q1V
         mO6hYDmubLCQpjeRATi7U5ZjBPbXR0IDVfUQDzpn7/5GES3v4OkLYiNRawrg6atgzzKX
         UhMA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@semihalf.com
 header.s=google header.b=FEtxOEFi;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE)
 header.from=semihalf.com
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 ga13-20020a1709070c0d00b0078db1343eedsi9334166ejc.774.2022.10.17.06.08.36;
        Mon, 17 Oct 2022 06:08:36 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@semihalf.com
 header.s=google header.b=FEtxOEFi;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE)
 header.from=semihalf.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 8151868BD0D;
	Mon, 17 Oct 2022 16:08:25 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mail-wm1-f47.google.com (mail-wm1-f47.google.com
 [209.85.128.47])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E606A68BD00
 for <ffmpeg-devel@ffmpeg.org>; Mon, 17 Oct 2022 16:08:18 +0300 (EEST)
Received: by mail-wm1-f47.google.com with SMTP id
 v130-20020a1cac88000000b003bcde03bd44so12654974wme.5
 for <ffmpeg-devel@ffmpeg.org>; Mon, 17 Oct 2022 06:08:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=semihalf.com;
 s=google;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=0J628ykI6t9YgPtLIUmxg/iHJuXbUWR2dHMWGpd5GVM=;
 b=FEtxOEFiXEomSLGRHVclmD7CGzR3H8to2YBF8Ji7cnAWyggKLZZB37ZUuq0y5GKVKk
 UYqDkhvzTlHvmrhjGXfpBU7pQx1wdw1epcYPdAJmIhneFsZH2c0ij/HXXV6seV6Bi8ll
 1hI04DEABQ8C6hFNgwEbJzuq8GGow0kAWLrsVsuVn2GcCfR1Yi7pwVWL3DBo7OjsDOJY
 4JSjkvJaI00T5wuS8WI9NISIWQ1JeoBPKGUZSLxWr4kaDnhZ+OTR+jHsHMT4QawSZAdh
 KdT/FB8oIQrE7SGGz2ZRbZzSGOAtoO0pM1Q4QdwPUio68PRAFSB1yY74W1g+JrkinLx6
 IsJw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=0J628ykI6t9YgPtLIUmxg/iHJuXbUWR2dHMWGpd5GVM=;
 b=fttD4EqRbY+n43IOFK0CvOsUhhw/LDewCQ2RwyegLuLC+THBT3d3HgBKXTwMogyAZV
 5BvrfvDc+KFmyqWl1U64vipcP0f/f4fMnTKQsA2HG3Lv1z/VA+/MifNLKsZ4+RyvYT0K
 UTAIQOx+EbZc+nHuQO/30lyCofDY2bErvYcOFIDZhIuQjOvmqEs5327P6YfjhPJYSbDX
 w+E0TFkMwIRrigkmV3D1Yljwo2vXDyx7Odi1Poy+YMH/h8i1bQ2lm9JErAfnRNNogw5s
 DeWDX2rnL+0mkUBJbFR1s83dMf5BWszFyxk4o+3ZW4p9uEWkSLqiss+7nVM5fgsEglxZ
 u0uw==
X-Gm-Message-State: ACrzQf30TTnoV3LAuCHgsUxV+cHiOWAx52/ZgETWKVw+rCGUqvoZ4T4L
 rMyQfguOH0SAvAdPYDwM5LCddbY7yvdcOqGX
X-Received: by 2002:a05:600c:6028:b0:3c6:f0bb:316a with SMTP id
 az40-20020a05600c602800b003c6f0bb316amr7197132wmb.1.1666012097760;
 Mon, 17 Oct 2022 06:08:17 -0700 (PDT)
Received: from ip-172-31-3-164.eu-west-1.compute.internal
 (ec2-54-154-193-154.eu-west-1.compute.amazonaws.com. [54.154.193.154])
 by smtp.gmail.com with ESMTPSA id
 t18-20020a5d6a52000000b0022af865810esm8297237wrw.75.2022.10.17.06.08.16
 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 17 Oct 2022 06:08:17 -0700 (PDT)
From: Hubert Mazur <hum@semihalf.com>
To: ffmpeg-devel@ffmpeg.org
Date: Mon, 17 Oct 2022 13:07:12 +0000
Message-Id: <20221017130715.30896-2-hum@semihalf.com>
X-Mailer: git-send-email 2.37.1
In-Reply-To: <20221017130715.30896-1-hum@semihalf.com>
References: <20221017130715.30896-1-hum@semihalf.com>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 1/4] sw_scale: Add specializations for hscale
 8 to 19
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com,
 Hubert Mazur <hum@semihalf.com>, martin@martin.st, mw@semihalf.com,
 spop@amazon.com
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: 4jyaSZpV1UWS

Add arm64 neon implementations for hscale 8 to 19 with filter
sizes 4, 4X and 8. Both implementations are based on very similar ones
dedicated to hscale 8 to 15. The major changes refer to saving
the data - instead of writing the result as int16_t it is done
with int32_t.

These functions are heavily inspired on patches provided by J. Swinney
and M. Storsjö for hscale8to15 which were slightly adapted for
hscale8to19.

The tests and benchmarks run on AWS Graviton 2 instances. The results
from a checkasm tool shown below.

hscale_8_to_19__fs_4_dstW_512_c: 5663.2
hscale_8_to_19__fs_4_dstW_512_neon: 1259.7
hscale_8_to_19__fs_8_dstW_512_c: 9306.0
hscale_8_to_19__fs_8_dstW_512_neon: 2020.2
hscale_8_to_19__fs_12_dstW_512_c: 12932.7
hscale_8_to_19__fs_12_dstW_512_neon: 2462.5
hscale_8_to_19__fs_16_dstW_512_c: 16844.2
hscale_8_to_19__fs_16_dstW_512_neon: 4671.2
hscale_8_to_19__fs_32_dstW_512_c: 32803.7
hscale_8_to_19__fs_32_dstW_512_neon: 5474.2
hscale_8_to_19__fs_40_dstW_512_c: 40948.0
hscale_8_to_19__fs_40_dstW_512_neon: 6669.7

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 libswscale/aarch64/hscale.S  | 292 ++++++++++++++++++++++++++++++++++-
 libswscale/aarch64/swscale.c |  13 +-
 2 files changed, 300 insertions(+), 5 deletions(-)

diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S
index a16d3dca42..5e8cad9825 100644
--- a/libswscale/aarch64/hscale.S
+++ b/libswscale/aarch64/hscale.S
@@ -218,7 +218,6 @@ function ff_hscale8to15_4_neon, export=1
 //  2. Interleaved prefetching src data and madd
 //  3. Complete madd
 //  4. Complete remaining iterations when dstW % 8 != 0
-
         sub                 sp, sp, #32                 // allocate 32 bytes on the stack
         cmp                 w2, #16                     // if dstW <16, skip to the last block used for wrapping up
         b.lt                2f
@@ -347,3 +346,294 @@ function ff_hscale8to15_4_neon, export=1
         add                 sp, sp, #32                 // clean up stack
         ret
 endfunc
+
+function ff_hscale8to19_4_neon, export=1
+        // x0               SwsContext *c (unused)
+        // x1               int32_t *dst
+        // w2               int dstW
+        // x3               const uint8_t *src // treat it as uint16_t *src
+        // x4               const uint16_t *filter
+        // x5               const int32_t *filterPos
+        // w6               int filterSize
+
+        movi                v18.4s, #1
+        movi                v17.4s, #1
+        shl                 v18.4s, v18.4s, #19
+        sub                 v18.4s, v18.4s, v17.4s      // max allowed value
+
+        cmp                 w2, #16
+        b.lt                2f // move to last block
+
+        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
+        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
+        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
+        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
+        add                 x5, x5, #32
+
+        // load data from
+        ldr                 w8, [x3, w8, UXTW]
+        ldr                 w9, [x3, w9, UXTW]
+        ldr                 w10, [x3, w10, UXTW]
+        ldr                 w11, [x3, w11, UXTW]
+        ldr                 w12, [x3, w12, UXTW]
+        ldr                 w13, [x3, w13, UXTW]
+        ldr                 w14, [x3, w14, UXTW]
+        ldr                 w15, [x3, w15, UXTW]
+
+        sub                 sp, sp, #32
+
+        stp                 w8, w9, [sp]
+        stp                 w10, w11, [sp, #8]
+        stp                 w12, w13, [sp, #16]
+        stp                 w14, w15, [sp, #24]
+
+1:
+        ld4                 {v0.8b, v1.8b, v2.8b, v3.8b}, [sp]
+        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
+        // load filterPositions into registers for next iteration
+
+        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
+        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
+        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
+        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
+        add                 x5, x5, #32
+        uxtl                v0.8h, v0.8b
+        ldr                 w8, [x3, w8, UXTW]
+        smull               v5.4s, v0.4h, v28.4h        // multiply first column of src
+        ldr                 w9, [x3, w9, UXTW]
+        smull2              v6.4s, v0.8h, v28.8h
+        stp                 w8, w9, [sp]
+
+        uxtl                v1.8h, v1.8b
+        ldr                 w10, [x3, w10, UXTW]
+        smlal               v5.4s, v1.4h, v29.4h        // multiply second column of src
+        ldr                 w11, [x3, w11, UXTW]
+        smlal2              v6.4s, v1.8h, v29.8h
+        stp                 w10, w11, [sp, #8]
+
+        uxtl                v2.8h, v2.8b
+        ldr                 w12, [x3, w12, UXTW]
+        smlal               v5.4s, v2.4h, v30.4h        // multiply third column of src
+        ldr                 w13, [x3, w13, UXTW]
+        smlal2              v6.4s, v2.8h, v30.8h
+        stp                 w12, w13, [sp, #16]
+
+        uxtl                v3.8h, v3.8b
+        ldr                 w14, [x3, w14, UXTW]
+        smlal               v5.4s, v3.4h, v31.4h        // multiply fourth column of src
+        ldr                 w15, [x3, w15, UXTW]
+        smlal2              v6.4s, v3.8h, v31.8h
+        stp                 w14, w15, [sp, #24]
+
+        sub                 w2, w2, #8
+        sshr                v5.4s, v5.4s, #3
+        sshr                v6.4s, v6.4s, #3
+        smin                v5.4s, v5.4s, v18.4s
+        smin                v6.4s, v6.4s, v18.4s
+
+        st1                 {v5.4s, v6.4s}, [x1], #32
+        cmp                 w2, #16
+        b.ge                1b
+
+        // here we make last iteration, without updating the registers
+        ld4                 {v0.8b, v1.8b, v2.8b, v3.8b}, [sp]
+        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
+
+        uxtl                v0.8h, v0.8b
+        uxtl                v1.8h, v1.8b
+        smull               v5.4s, v0.4h, v28.4h
+        smull2              v6.4s, v0.8h, v28.8h
+        uxtl                v2.8h, v2.8b
+        smlal               v5.4s, v1.4h, v29.4H
+        smlal2              v6.4s, v1.8h, v29.8H
+        uxtl                v3.8h, v3.8b
+        smlal               v5.4s, v2.4h, v30.4H
+        smlal2              v6.4s, v2.8h, v30.8H
+        smlal               v5.4s, v3.4h, v31.4H
+        smlal2              v6.4s, v3.8h, v31.8h
+
+        sshr                v5.4s, v5.4s, #3
+        sshr                v6.4s, v6.4s, #3
+
+        smin                v5.4s, v5.4s, v18.4s
+        smin                v6.4s, v6.4s, v18.4s
+
+        sub                 w2, w2, #8
+        st1                 {v5.4s, v6.4s}, [x1], #32
+        add                 sp, sp, #32 // restore stack
+        cbnz                w2, 2f
+
+        ret
+
+2:
+        ldr                 w8, [x5], #4 // load filterPos
+        add                 x9, x3, w8, UXTW // src + filterPos
+        ld1                 {v0.s}[0], [x9] // load 4 * uint8_t* into one single
+        ld1                 {v31.4h}, [x4], #8
+        uxtl                v0.8h, v0.8b
+        smull               v5.4s, v0.4h, v31.4H
+        saddlv              d0, v5.4S
+        sqshrn              s0, d0, #3
+        smin                v0.4s, v0.4s, v18.4s
+        st1                 {v0.s}[0], [x1], #4
+        sub                 w2, w2, #1
+        cbnz                w2, 2b // if iterations remain jump to beginning
+
+        ret
+endfunc
+
+function ff_hscale8to19_X8_neon, export=1
+        movi                v20.4s, #1
+        movi                v17.4s, #1
+        shl                 v20.4s, v20.4s, #19
+        sub                 v20.4s, v20.4s, v17.4s
+
+        sbfiz               x7, x6, #1, #32             // filterSize*2 (*2 because int16)
+1:
+        mov                 x16, x4                     // filter0 = filter
+        ldr                 w8, [x5], #4                // filterPos[idx]
+        add                 x12, x16, x7                // filter1 = filter0 + filterSize*2
+        ldr                 w0, [x5], #4                // filterPos[idx + 1]
+        add                 x13, x12, x7                // filter2 = filter1 + filterSize*2
+        ldr                 w11, [x5], #4               // filterPos[idx + 2]
+        add                 x4, x13, x7                 // filter3 = filter2 + filterSize*2
+        ldr                 w9, [x5], #4                // filterPos[idx + 3]
+        movi                v0.2D, #0                   // val sum part 1 (for dst[0])
+        movi                v1.2D, #0                   // val sum part 2 (for dst[1])
+        movi                v2.2D, #0                   // val sum part 3 (for dst[2])
+        movi                v3.2D, #0                   // val sum part 4 (for dst[3])
+        add                 x17, x3, w8, UXTW           // srcp + filterPos[0]
+        add                 x8,  x3, w0, UXTW           // srcp + filterPos[1]
+        add                 x0, x3, w11, UXTW           // srcp + filterPos[2]
+        add                 x11, x3, w9, UXTW           // srcp + filterPos[3]
+        mov                 w15, w6                     // filterSize counter
+2:      ld1                 {v4.8B}, [x17], #8          // srcp[filterPos[0] + {0..7}]
+        ld1                 {v5.8H}, [x16], #16         // load 8x16-bit filter values, part 1
+        uxtl                v4.8H, v4.8B                // unpack part 1 to 16-bit
+        smlal               v0.4S, v4.4H, v5.4H         // v0 accumulates srcp[filterPos[0] + {0..3}] * filter[{0..3}]
+        ld1                 {v6.8B}, [x8], #8           // srcp[filterPos[1] + {0..7}]
+        smlal2              v0.4S, v4.8H, v5.8H         // v0 accumulates srcp[filterPos[0] + {4..7}] * filter[{4..7}]
+        ld1                 {v7.8H}, [x12], #16         // load 8x16-bit at filter+filterSize
+        ld1                 {v16.8B}, [x0], #8          // srcp[filterPos[2] + {0..7}]
+        uxtl                v6.8H, v6.8B                // unpack part 2 to 16-bit
+        ld1                 {v17.8H}, [x13], #16        // load 8x16-bit at filter+2*filterSize
+        uxtl                v16.8H, v16.8B              // unpack part 3 to 16-bit
+        smlal               v1.4S, v6.4H, v7.4H         // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+        ld1                 {v18.8B}, [x11], #8         // srcp[filterPos[3] + {0..7}]
+        smlal               v2.4S, v16.4H, v17.4H       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+        ld1                 {v19.8H}, [x4], #16         // load 8x16-bit at filter+3*filterSize
+        smlal2              v2.4S, v16.8H, v17.8H       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+        uxtl                v18.8H, v18.8B              // unpack part 4 to 16-bit
+        smlal2              v1.4S, v6.8H, v7.8H         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+        smlal               v3.4S, v18.4H, v19.4H       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+        subs                w15, w15, #8                // j -= 8: processed 8/filterSize
+        smlal2              v3.4S, v18.8H, v19.8H       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+        b.gt                2b                          // inner loop if filterSize not consumed completely
+        addp                v0.4S, v0.4S, v1.4S         // part01 horizontal pair adding
+        addp                v2.4S, v2.4S, v3.4S         // part23 horizontal pair adding
+        addp                v0.4S, v0.4S, v2.4S         // part0123 horizontal pair adding
+        subs                w2, w2, #4                  // dstW -= 4
+        sshr                v0.4s, v0.4S, #3            // shift and clip the 2x16-bit final values
+        smin                v0.4s, v0.4s, v20.4s
+        st1                 {v0.4s}, [x1], #16           // write to destination part0123
+        b.gt                1b                          // loop until end of line
+        ret
+endfunc
+
+function ff_hscale8to19_X4_neon, export=1
+        // x0  SwsContext *c (not used)
+        // x1  int16_t *dst
+        // w2  int dstW
+        // x3  const uint8_t *src
+        // x4  const int16_t *filter
+        // x5  const int32_t *filterPos
+        // w6  int filterSize
+
+        movi                v20.4s, #1
+        movi                v17.4s, #1
+        shl                 v20.4s, v20.4s, #19
+        sub                 v20.4s, v20.4s, v17.4s
+
+        lsl                 w7, w6, #1
+1:
+        ldp                 w8, w9, [x5]
+        ldp                 w10, w11, [x5, #8]
+
+        movi                v16.2d, #0                  // initialize accumulator for idx + 0
+        movi                v17.2d, #0                  // initialize accumulator for idx + 1
+        movi                v18.2d, #0                  // initialize accumulator for idx + 2
+        movi                v19.2d, #0                  // initialize accumulator for idx + 3
+
+        mov                 x12, x4                     // filter + 0
+        add                 x13, x4, x7                 // filter + 1
+        add                 x8, x3, w8, UXTW            // srcp + filterPos 0
+        add                 x14, x13, x7                // filter + 2
+        add                 x9, x3, w9, UXTW            // srcp + filterPos 1
+        add                 x15, x14, x7                // filter + 3
+        add                 x10, x3, w10, UXTW          // srcp + filterPos 2
+        mov                 w0, w6                      // save the filterSize to temporary variable
+        add                 x11, x3, w11, UXTW          // srcp + filterPos 3
+        add                 x5, x5, #16                 // advance filter position
+        mov                 x16, xzr                    // clear the register x16 used for offsetting the filter values
+
+2:
+        ldr                 d4, [x8], #8                // load src values for idx 0
+        ldr                 q31, [x12, x16]             // load filter values for idx 0
+        uxtl                v4.8h, v4.8b                // extend type to match the filter' size
+        ldr                 d5, [x9], #8                // load src values for idx 1
+        smlal               v16.4s, v4.4h, v31.4h       // multiplication of lower half for idx 0
+        uxtl                v5.8h, v5.8b                // extend type to match the filter' size
+        ldr                 q30, [x13, x16]             // load filter values for idx 1
+        smlal2              v16.4s, v4.8h, v31.8h       // multiplication of upper half for idx 0
+        ldr                 d6, [x10], #8               // load src values for idx 2
+        ldr                 q29, [x14, x16]             // load filter values for idx 2
+        smlal               v17.4s, v5.4h, v30.4H       // multiplication of lower half for idx 1
+        ldr                 d7, [x11], #8               // load src values for idx 3
+        smlal2              v17.4s, v5.8h, v30.8H       // multiplication of upper half for idx 1
+        uxtl                v6.8h, v6.8B                // extend tpye to matchi the filter's size
+        ldr                 q28, [x15, x16]             // load filter values for idx 3
+        smlal               v18.4s, v6.4h, v29.4h       // multiplication of lower half for idx 2
+        uxtl                v7.8h, v7.8B
+        smlal2              v18.4s, v6.8h, v29.8H       // multiplication of upper half for idx 2
+        sub                 w0, w0, #8
+        smlal               v19.4s, v7.4h, v28.4H       // multiplication of lower half for idx 3
+        cmp                 w0, #8
+        smlal2              v19.4s, v7.8h, v28.8h       // multiplication of upper half for idx 3
+        add                 x16, x16, #16                // advance filter values indexing
+
+        b.ge                2b
+
+
+        // 4 iterations left
+
+        sub                 x17, x7, #8                 // step back to wrap up the filter pos for last 4 elements
+
+        ldr                 s4, [x8]                    // load src values for idx 0
+        ldr                 d31, [x12, x17]             // load filter values for idx 0
+        uxtl                v4.8h, v4.8b                // extend type to match the filter' size
+        ldr                 s5, [x9]                    // load src values for idx 1
+        smlal               v16.4s, v4.4h, v31.4h
+        ldr                 d30, [x13, x17]             // load filter values for idx 1
+        uxtl                v5.8h, v5.8b                // extend type to match the filter' size
+        ldr                 s6, [x10]                   // load src values for idx 2
+        smlal               v17.4s, v5.4h, v30.4h
+        uxtl                v6.8h, v6.8B                // extend type to match the filter's size
+        ldr                 d29, [x14, x17]             // load filter values for idx 2
+        ldr                 s7, [x11]                   // load src values for idx 3
+        addp                v16.4s, v16.4s, v17.4s
+        uxtl                v7.8h, v7.8B
+        ldr                 d28, [x15, x17]             // load filter values for idx 3
+        smlal               v18.4s, v6.4h, v29.4h
+        smlal               v19.4s, v7.4h, v28.4h
+        subs                w2, w2, #4
+        addp                v18.4s, v18.4s, v19.4s
+        addp                v16.4s, v16.4s, v18.4s
+        sshr                v16.4s, v16.4s, #3
+        smin                v16.4s, v16.4s, v20.4s
+
+        st1                 {v16.4s}, [x1], #16
+        add                 x4, x4, x7, lsl #2
+        b.gt                1b
+        ret
+
+endfunc
\ No newline at end of file
diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c
index d1312c6658..479fe129d0 100644
--- a/libswscale/aarch64/swscale.c
+++ b/libswscale/aarch64/swscale.c
@@ -29,7 +29,8 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
                                                 const int16_t *filter, \
                                                 const int32_t *filterPos, int filterSize)
 #define SCALE_FUNCS(filter_n, opt) \
-    SCALE_FUNC(filter_n,  8, 15, opt);
+    SCALE_FUNC(filter_n,  8, 15, opt); \
+    SCALE_FUNC(filter_n, 8, 19, opt);
 #define ALL_SCALE_FUNCS(opt) \
     SCALE_FUNCS(4, opt); \
     SCALE_FUNCS(X8, opt); \
@@ -48,9 +49,13 @@ void ff_yuv2plane1_8_neon(
         int offset);
 
 #define ASSIGN_SCALE_FUNC2(hscalefn, filtersize, opt) do {              \
-    if (c->srcBpc == 8 && c->dstBpc <= 14) {                            \
-      hscalefn =                                                        \
-        ff_hscale8to15_ ## filtersize ## _ ## opt;                      \
+    if (c->srcBpc == 8) {                                               \
+        if(c->dstBpc <= 14) {                                           \
+            hscalefn =                                                  \
+                ff_hscale8to15_ ## filtersize ## _ ## opt;              \
+        } else                                                          \
+            hscalefn =                                                  \
+                ff_hscale8to19_ ## filtersize ## _ ## opt;              \
     }                                                                   \
 } while (0)
 

From patchwork Mon Oct 17 13:07:13 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hubert Mazur <hum@semihalf.com>
X-Patchwork-Id: 38763
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:4a86:b0:9d:28a3:170e with SMTP id fn6csp1584540pzb;
        Mon, 17 Oct 2022 06:08:46 -0700 (PDT)
X-Google-Smtp-Source: 
 AMsMyM7zAb1SH5dRxkzuQGCBFzkcrS5irhmFAASLQdCB6MvdTXHXBhyAI7KV2g1zlnf9fudgDhGR
X-Received: by 2002:a05:6402:2686:b0:45d:82c0:c2b6 with SMTP id
 w6-20020a056402268600b0045d82c0c2b6mr5789926edd.390.1666012126275;
        Mon, 17 Oct 2022 06:08:46 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1666012126; cv=none;
        d=google.com; s=arc-20160816;
        b=agdF58kip7MBBpozT6Fk390rq8iBcYVzCH9f5mujSRUtaZNuSO6BEn8wShvjsNfZ6w
         ig+ird3O87Pelj3JulI3xgQN/fBYkgAjpjvOdHFGS2BrTC5DX2vJ0M3kUe8fjjUYDH8D
         j5e+Rt0mGGHunWI692xl88Z2BTYf6DJM55NbRscwH80gOKoDSvWc8v9gzU7yrPf4wDsE
         ewd/uPwFU6SB3n48e96juaz/g5HJsMQR7DKX+DEguY9Xx+de/Rh5vGCL02+v+dcYGi75
         gVaOoLfbpZI//jWZll74H39hp08BC/pYENrDjEwGElCLas3SrSj7VddA0vEoKemEdoUM
         m1kQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:references:in-reply-to
         :message-id:date:to:from:dkim-signature:delivered-to;
        bh=v+KlX61JTRlrUe5gf7J3nwBnxosrb5Uzc2RQxOaFPfk=;
        b=gwgH5qyNi4qjwCdrwLWcYf07UPiY9NZfIHzSmCtdMFCvmc2dRtuQ4ezHmntb5mpRds
         TTaDf6xeoVNY4WMg+ZmRS54tDz0XNimRlJmPYZhz9mvoKA0mYOJ395LFbny6SwknwQyA
         FW46qnr/7awpqyw3JxNvLUiIqvhvZ7Tj/8o80o5Uw1zt8w6nemM8TmNx1Xru3xGZHczZ
         udD07qU9RN9mR6ynzJDP8ZnKUxgs9JQ77O7d2GXg3oWvTCBRMgNlDr2VwQmRs4z6j/Of
         uUD0a7dPvZNN1mfyKd2fF5u7rnEhRPi2axTPIowqmjxeTXkWnWpbBhhCAorCT18YvQvB
         hGlQ==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@semihalf.com
 header.s=google header.b=lsooud3t;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE)
 header.from=semihalf.com
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 sc12-20020a1709078a0c00b0078da9130dc8si9923491ejc.164.2022.10.17.06.08.45;
        Mon, 17 Oct 2022 06:08:46 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@semihalf.com
 header.s=google header.b=lsooud3t;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE)
 header.from=semihalf.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 94D9568BD10;
	Mon, 17 Oct 2022 16:08:28 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mail-wm1-f47.google.com (mail-wm1-f47.google.com
 [209.85.128.47])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 2BB3B68BCFE
 for <ffmpeg-devel@ffmpeg.org>; Mon, 17 Oct 2022 16:08:20 +0300 (EEST)
Received: by mail-wm1-f47.google.com with SMTP id n9so8604069wms.1
 for <ffmpeg-devel@ffmpeg.org>; Mon, 17 Oct 2022 06:08:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=semihalf.com;
 s=google;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=HMJ/dB3W+aJT6xEOVd3elsu9wDdmB00a430IhEwETvg=;
 b=lsooud3tcyunmt+kTgc4JYFoLO1MUe5MHer7jPdw8YtoV9XWfKwKAP3eoLyBS8KkwF
 JJThpIZNi6r2ejdHnENTKjl6I5bfUvKPsY4msi+qE8sksYKgN2+jbbrXKn4JdXFogHJ1
 TZBkRmEzdWxaEmhDnoGgvcT0B0ugCzER/U62Hsjd0n7nvptnGKPB0pwq21fYZZkHngpA
 DAsAzqurCPW0qYvxXmahPUUOARDxF7LxYwDvlX7qZNSt8aJg2iPYRKKud0ofUgh3jwpl
 mJmn/jk6oQpWUY2ZrVrz3KbQo7USAP/7nUhzxRPPrET/SxkPOMk0IoRMUwlU3SSDKmtG
 bjqg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=HMJ/dB3W+aJT6xEOVd3elsu9wDdmB00a430IhEwETvg=;
 b=WAKHALoHv3/G+KXMf2fke7LcXOJJzVx4HFNgiKJnyK8L0RUSz8WxLc+7IXbhCIL1rA
 Au44924emqlgTvR+FJLy3aCLRTWUZWXzUnGdTSxnTTUr3hQrenf6jhtqGavRP4DToanl
 ZUnYiBnxrjj+7wDrWb1vEEot76xFlZIb/3KXY29pj17pAV5qc7leoA9BJJYHH/8dsXJ6
 h26YCudSq/33zRVEVh3+ZBN3SxKuFwZ6BavXy83hMzHuSWADnzAm47cgqrtoHJayFP30
 hkkY1yrqdwgRfXFydGYsV62/RKGmVXW3c04r+52oQf5L40Cv1GFh4nBdlTMwdS1Sj9pa
 QZYg==
X-Gm-Message-State: ACrzQf3OB+uN9vuMxGHo1QivHKLHye/+HnzK0kQKbddYD06/LWdS/gVa
 DWBL+TErVj9GAPnAhghcUGvcDNKCvjUHnoDD
X-Received: by 2002:a05:600c:3213:b0:3c6:cab8:dac4 with SMTP id
 r19-20020a05600c321300b003c6cab8dac4mr19502211wmp.160.1666012099208;
 Mon, 17 Oct 2022 06:08:19 -0700 (PDT)
Received: from ip-172-31-3-164.eu-west-1.compute.internal
 (ec2-54-154-193-154.eu-west-1.compute.amazonaws.com. [54.154.193.154])
 by smtp.gmail.com with ESMTPSA id
 t18-20020a5d6a52000000b0022af865810esm8297237wrw.75.2022.10.17.06.08.18
 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 17 Oct 2022 06:08:18 -0700 (PDT)
From: Hubert Mazur <hum@semihalf.com>
To: ffmpeg-devel@ffmpeg.org
Date: Mon, 17 Oct 2022 13:07:13 +0000
Message-Id: <20221017130715.30896-3-hum@semihalf.com>
X-Mailer: git-send-email 2.37.1
In-Reply-To: <20221017130715.30896-1-hum@semihalf.com>
References: <20221017130715.30896-1-hum@semihalf.com>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 2/4] tests/sw_scale: Add test cases for input
 sizes 16
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com,
 Hubert Mazur <hum@semihalf.com>, martin@martin.st, mw@semihalf.com,
 spop@amazon.com
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: ZMDC7YwW9+0f

Previously test cases handled only input sizes equal to 8.
Add support for input size 16 which is used by scaling
routines hscale16To15 and hscale16To19. Pass SwsContext
pointer to each function as some of them make use of it.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 tests/checkasm/sw_scale.c | 35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/tests/checkasm/sw_scale.c b/tests/checkasm/sw_scale.c
index 3b8dd310ec..2e4b698f88 100644
--- a/tests/checkasm/sw_scale.c
+++ b/tests/checkasm/sw_scale.c
@@ -262,23 +262,31 @@ static void check_hscale(void)
 #define FILTER_SIZES 6
     static const int filter_sizes[FILTER_SIZES] = { 4, 8, 12, 16, 32, 40 };
 
-#define HSCALE_PAIRS 2
+#define HSCALE_PAIRS 4
     static const int hscale_pairs[HSCALE_PAIRS][2] = {
         { 8, 14 },
         { 8, 18 },
+        { 16, 14 },
+        { 16, 18 }
     };
 
+#define DST_WIDTH(x) ( (x) == (14) ? sizeof(int16_t) : sizeof(int32_t))
 #define LARGEST_INPUT_SIZE 512
 #define INPUT_SIZES 6
     static const int input_sizes[INPUT_SIZES] = {8, 24, 128, 144, 256, 512};
 
     int i, j, fsi, hpi, width, dstWi;
     struct SwsContext *ctx;
+    void *(*_dst)[2];
+    void *_src;
 
     // padded
     LOCAL_ALIGNED_32(uint8_t, src, [FFALIGN(SRC_PIXELS + MAX_FILTER_WIDTH - 1, 4)]);
-    LOCAL_ALIGNED_32(uint32_t, dst0, [SRC_PIXELS]);
-    LOCAL_ALIGNED_32(uint32_t, dst1, [SRC_PIXELS]);
+    LOCAL_ALIGNED_32(uint16_t, src1, [FFALIGN(SRC_PIXELS + MAX_FILTER_WIDTH - 1, 4)]);
+    LOCAL_ALIGNED_32(int16_t, dst_ref_16, [SRC_PIXELS]);
+    LOCAL_ALIGNED_32(int16_t, dst_new_16, [SRC_PIXELS]);
+    LOCAL_ALIGNED_32(int32_t, dst_ref_32, [SRC_PIXELS]);
+    LOCAL_ALIGNED_32(int32_t, dst_new_32, [SRC_PIXELS]);
 
     // padded
     LOCAL_ALIGNED_32(int16_t, filter, [SRC_PIXELS * MAX_FILTER_WIDTH + MAX_FILTER_WIDTH]);
@@ -286,6 +294,9 @@ static void check_hscale(void)
     LOCAL_ALIGNED_32(int16_t, filterAvx2, [SRC_PIXELS * MAX_FILTER_WIDTH + MAX_FILTER_WIDTH]);
     LOCAL_ALIGNED_32(int32_t, filterPosAvx, [SRC_PIXELS]);
 
+    void *_dst_16[2] = {dst_ref_16, dst_new_16};
+    void *_dst_32[2] = {dst_ref_32, dst_new_32};
+
     // The dst parameter here is either int16_t or int32_t but we use void* to
     // just cover both cases.
     declare_func_emms(AV_CPU_FLAG_MMX, void, void *c, void *dst, int dstW,
@@ -297,6 +308,7 @@ static void check_hscale(void)
         fail();
 
     randomize_buffers(src, SRC_PIXELS + MAX_FILTER_WIDTH - 1);
+    randomize_buffers(src1, SRC_PIXELS + MAX_FILTER_WIDTH - 1);
 
     for (hpi = 0; hpi < HSCALE_PAIRS; hpi++) {
         for (fsi = 0; fsi < FILTER_SIZES; fsi++) {
@@ -306,6 +318,8 @@ static void check_hscale(void)
                 ctx->srcBpc = hscale_pairs[hpi][0];
                 ctx->dstBpc = hscale_pairs[hpi][1];
                 ctx->hLumFilterSize = ctx->hChrFilterSize = width;
+                _src = ctx->srcBpc == 8 ? (void *)src : (void *)src1;
+                _dst = ctx->dstBpc == 14 ? (void*)_dst_16 : (void*)_dst_32;
 
                 for (i = 0; i < SRC_PIXELS; i++) {
                     filterPos[i] = i;
@@ -343,14 +357,15 @@ static void check_hscale(void)
                 ff_shuffle_filter_coefficients(ctx, filterPosAvx, width, filterAvx2, ctx->dstW);
 
                 if (check_func(ctx->hcScale, "hscale_%d_to_%d__fs_%d_dstW_%d", ctx->srcBpc, ctx->dstBpc + 1, width, ctx->dstW)) {
-                    memset(dst0, 0, SRC_PIXELS * sizeof(dst0[0]));
-                    memset(dst1, 0, SRC_PIXELS * sizeof(dst1[0]));
+                    memset((*_dst)[0], 0, SRC_PIXELS * DST_WIDTH(ctx->dstBpc));
+                    memset((*_dst)[1], 0, SRC_PIXELS * DST_WIDTH(ctx->dstBpc));
+
+                    call_ref(ctx, (*_dst)[0], ctx->dstW, src, filter, filterPos, width);
+                    call_new(ctx, (*_dst)[1], ctx->dstW, src, filterAvx2, filterPosAvx, width);
 
-                    call_ref(NULL, dst0, ctx->dstW, src, filter, filterPos, width);
-                    call_new(NULL, dst1, ctx->dstW, src, filterAvx2, filterPosAvx, width);
-                    if (memcmp(dst0, dst1, ctx->dstW * sizeof(dst0[0])))
+                    if (memcmp((*_dst)[0], (*_dst)[1], ctx->dstW * DST_WIDTH(ctx->dstBpc)))
                         fail();
-                    bench_new(NULL, dst0, ctx->dstW, src, filter, filterPosAvx, width);
+                    bench_new(ctx, (*_dst)[1], ctx->dstW, _src, filter, filterPosAvx, width);
                 }
             }
         }
@@ -358,6 +373,8 @@ static void check_hscale(void)
     sws_freeContext(ctx);
 }
 
+#undef DST_WIDTH
+
 void checkasm_check_sw_scale(void)
 {
     check_hscale();

From patchwork Mon Oct 17 13:07:14 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hubert Mazur <hum@semihalf.com>
X-Patchwork-Id: 38764
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:4a86:b0:9d:28a3:170e with SMTP id fn6csp1584647pzb;
        Mon, 17 Oct 2022 06:08:56 -0700 (PDT)
X-Google-Smtp-Source: 
 AMsMyM4BqJMDhUyHli9bqoYXN3hWBxLnjfaaHG3I53kacETMb0Yy27oJvIKUGaCj2aARsziatwLy
X-Received: by 2002:a05:6402:51d1:b0:45d:b498:169 with SMTP id
 r17-20020a05640251d100b0045db4980169mr1824091edd.119.1666012136177;
        Mon, 17 Oct 2022 06:08:56 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1666012136; cv=none;
        d=google.com; s=arc-20160816;
        b=rLvZCtGvaKoljshxyo6C3ifn/VxOtS+ZbwgQcg81UmqOf0wxemsOzQsDUJbYL+lpun
         smTUcmwlIJHYpNuuoWumWFC1DcYUFSRCA7V6HjLME9OktZamEzMS+0h+OTXUpQwQzTT3
         hWX2meGGoLB7YHxBW6bVYEykGkyTeiGxPN22BjCIcEMkUzTwcO3hqp8079Mi8MOQhK0O
         PlaWiGqG6LGuKMhQpnGnK1X73Fj0E26pkkDwlSMx1otl3h6ColCzwelVfsBKjtVULZt3
         EvCN6+gBizt2N5zhKhw/Qq7dBgnBG2NTLe3AVvInHhb5TY0Vsnp5M9WyYPyIDc5BJfNM
         441A==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:references:in-reply-to
         :message-id:date:to:from:dkim-signature:delivered-to;
        bh=dXXBinwJKAbjgEs/AOioVPS5ixn76dPODYQ4BkyLUQI=;
        b=Puw/5bV9Agp6eVIxM2V27UHrYsCSFd3nXdT5CLFMt14f03A3f4AR6k0RG+rGHbrJFO
         jqBlbRrYkuVo/RJZgoPqiBbW2Ql6g0N86/+ok9ArXvg80p+OT2CDmepz7bAOnN72vOif
         XKsS0/5iE16P/zMiMfI2J9+JSmcdI9ThlJAB3tMqlpR23dRIlMChLdODoSWKZVP6ho7b
         W+EtvbEPG65Upu+jDglxTC6AikgEAAOOSsAysRANQEnfVAmHfCs5xkSatujgV0mQu00r
         9gDsJG55S2DMMepm/+XUFFNfbCfgBF4dUOSQ5i9fKWSydkAhAsrwaX3RTisEckvIVUkb
         vJSA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@semihalf.com
 header.s=google header.b="O4BVXvn/";
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE)
 header.from=semihalf.com
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 b11-20020aa7cd0b000000b0045bd55b1240si7619071edw.313.2022.10.17.06.08.55;
        Mon, 17 Oct 2022 06:08:56 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@semihalf.com
 header.s=google header.b="O4BVXvn/";
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE)
 header.from=semihalf.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 9953A68BD15;
	Mon, 17 Oct 2022 16:08:29 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mail-wr1-f53.google.com (mail-wr1-f53.google.com
 [209.85.221.53])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E7E8368BD05
 for <ffmpeg-devel@ffmpeg.org>; Mon, 17 Oct 2022 16:08:21 +0300 (EEST)
Received: by mail-wr1-f53.google.com with SMTP id n12so18342377wrp.10
 for <ffmpeg-devel@ffmpeg.org>; Mon, 17 Oct 2022 06:08:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=semihalf.com;
 s=google;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=OIEfF3n82rk2G9hF2G+IaPWE/hUTe2lz6mjrXSgiLcg=;
 b=O4BVXvn/b05Rbs/UCG0DNgQlpTVk1Oc5m27EtOfhgMLw4U7S4O10Xv+zZ4pK2JuFxh
 bLia9r/mXZfeMtu5zOM1RkovOs0hjhNpTBadA269vUanfuVdI7HgfJpu6yc/uvDH4Eyo
 3ajNdqgq4OfacDyKpDtkwZIdB+12sUGqx7td9KZ+7StrIRafcwrD3nsgFxtOd33Pw6iO
 Nzpq5Yb9uQ+3OQL58qSfr0XendYO4HfIq+N66VE1VBTOk7aRmHtMww8EMPGRVrIqQdeV
 BmZFloQ+bs1U6dt+PfBVy+Xvv2w0+nueET6QsHTybJc81IVXin1l6plQnGfBt3k7Okod
 1+iw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=OIEfF3n82rk2G9hF2G+IaPWE/hUTe2lz6mjrXSgiLcg=;
 b=Nt2Ip5x2QVK47sczEzHucre63JUxagADzWdrZUGMyjMzS83fZZxXscXPKiaJDH3pL2
 ShlcTsa/fFb1XFLdkNGjouUEPlgdMf/ZUCDqMmzk0+3cJtEYqXjeQh3qG6BkMCVuey2+
 K5O6V4kEtEtHHm0gb2hiWz+f47IXlnRzD7na7nVlf8jn4aYjMbJGrDrHP1su6P3u4ISL
 aBI3wODuc/HUduB03RTXfHNPl/IgP78LrIFzQJhfsEowIZJbGXCLEHAdZzBpHxYywVYM
 3QH151UInrCt5lQAuoB8bzkamNW093ptuL2pQbPiMkQ+UbXNQfyNRVv6C+6VWaJTN1AQ
 JSew==
X-Gm-Message-State: ACrzQf2J9TPBJ3G2nkevzuhY9J2EGbJMISuiJDRZ/3YJBlflRrSdNgUI
 xCHAL35c1F2QIgVngCKfQ9FmDhbOJzpO91sr
X-Received: by 2002:adf:fd04:0:b0:22e:4bf6:4a08 with SMTP id
 e4-20020adffd04000000b0022e4bf64a08mr6618042wrr.619.1666012100551;
 Mon, 17 Oct 2022 06:08:20 -0700 (PDT)
Received: from ip-172-31-3-164.eu-west-1.compute.internal
 (ec2-54-154-193-154.eu-west-1.compute.amazonaws.com. [54.154.193.154])
 by smtp.gmail.com with ESMTPSA id
 t18-20020a5d6a52000000b0022af865810esm8297237wrw.75.2022.10.17.06.08.19
 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 17 Oct 2022 06:08:20 -0700 (PDT)
From: Hubert Mazur <hum@semihalf.com>
To: ffmpeg-devel@ffmpeg.org
Date: Mon, 17 Oct 2022 13:07:14 +0000
Message-Id: <20221017130715.30896-4-hum@semihalf.com>
X-Mailer: git-send-email 2.37.1
In-Reply-To: <20221017130715.30896-1-hum@semihalf.com>
References: <20221017130715.30896-1-hum@semihalf.com>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 3/4] sw_scale: Add specializations for hscale
 16 to 15
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com,
 Hubert Mazur <hum@semihalf.com>, martin@martin.st, mw@semihalf.com,
 spop@amazon.com
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: G7MFcgvSYmo7

Add arm64 neon implementations for hscale 16 to 15 with filter
sizes 4, 8 and X4.

The tests and benchmarks run on AWS Graviton 2 instances.
The results from a checkasm tool are shown below.

hscale_16_to_15__fs_4_dstW_512_c: 6703.5
hscale_16_to_15__fs_4_dstW_512_neon: 2298.0
hscale_16_to_15__fs_8_dstW_512_c: 10983.0
hscale_16_to_15__fs_8_dstW_512_neon: 3216.5
hscale_16_to_15__fs_12_dstW_512_c: 15526.0
hscale_16_to_15__fs_12_dstW_512_neon: 3993.0
hscale_16_to_15__fs_16_dstW_512_c: 20183.5
hscale_16_to_15__fs_16_dstW_512_neon: 5369.7
hscale_16_to_15__fs_32_dstW_512_c: 39315.2
hscale_16_to_15__fs_32_dstW_512_neon: 9511.2
hscale_16_to_15__fs_40_dstW_512_c: 48995.7
hscale_16_to_15__fs_40_dstW_512_neon: 11570.0

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 libswscale/aarch64/hscale.S  | 409 ++++++++++++++++++++++++++++++++++-
 libswscale/aarch64/swscale.c |  66 +++++-
 libswscale/swscale.c         |   3 +-
 3 files changed, 474 insertions(+), 4 deletions(-)

diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S
index 5e8cad9825..7d7e1c1f2e 100644
--- a/libswscale/aarch64/hscale.S
+++ b/libswscale/aarch64/hscale.S
@@ -635,5 +635,412 @@ function ff_hscale8to19_X4_neon, export=1
         add                 x4, x4, x7, lsl #2
         b.gt                1b
         ret
+endfunc
+
+function ff_hscale16to15_4_neon_asm, export=1
+        // w0               int shift
+        // x1               int32_t *dst
+        // w2               int dstW
+        // x3               const uint8_t *src // treat it as uint16_t *src
+        // x4               const uint16_t *filter
+        // x5               const int32_t *filterPos
+        // w6               int filterSize
+
+        movi                v18.4s, #1
+        movi                v17.4s, #1
+        shl                 v18.4s, v18.4s, #15
+        sub                 v18.4s, v18.4s, v17.4s      // max allowed value
+        dup                 v17.4s, w0                  // read shift
+        neg                 v17.4s, v17.4s              // negate it, so it can be used in sshl (effectively shift right)
+
+        cmp                 w2, #16
+        b.lt                2f // move to last block
+
+        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
+        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
+        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
+        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
+        add                 x5, x5, #32
+
+        // shift all filterPos left by one, as uint16_t will be read
+        lsl                 x8, x8, #1
+        lsl                 x9, x9, #1
+        lsl                 x10, x10, #1
+        lsl                 x11, x11, #1
+        lsl                 x12, x12, #1
+        lsl                 x13, x13, #1
+        lsl                 x14, x14, #1
+        lsl                 x15, x15, #1
+
+        // load src with given offset
+        ldr                 x8,  [x3, w8,  UXTW]
+        ldr                 x9,  [x3, w9,  UXTW]
+        ldr                 x10, [x3, w10, UXTW]
+        ldr                 x11, [x3, w11, UXTW]
+        ldr                 x12, [x3, w12, UXTW]
+        ldr                 x13, [x3, w13, UXTW]
+        ldr                 x14, [x3, w14, UXTW]
+        ldr                 x15, [x3, w15, UXTW]
+
+        sub                 sp, sp, #64
+        // push src on stack so it can be loaded into vectors later
+        stp                 x8, x9, [sp]
+        stp                 x10, x11, [sp, #16]
+        stp                 x12, x13, [sp, #32]
+        stp                 x14, x15, [sp, #48]
+
+1:
+        ld4                 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
+        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
+
+        // Each of blocks does the following:
+        // Extend src and filter to 32 bits with uxtl and sxtl
+        // multiply or multiply and accumulate results
+        // Extending to 32 bits is necessary, as unit16_t values can't
+        // be represented as int16_t without type promotion.
+        uxtl                v26.4s, v0.4h
+        sxtl                v27.4s, v28.4H
+        uxtl2               v0.4s, v0.8h
+        mul                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v28.8H
+        uxtl                v26.4s, v1.4h
+        mul                 v6.4s, v0.4s, v28.4s
+
+        sxtl                v27.4s, v29.4H
+        uxtl2               v0.4s, v1.8h
+        mla                 v5.4s, v27.4s, v26.4s
+        sxtl2               v28.4s, v29.8H
+        uxtl                v26.4s, v2.4h
+        mla                 v6.4s, v28.4s, v0.4s
+
+        sxtl                v27.4s, v30.4H
+        uxtl2               v0.4s, v2.8h
+        mla                 v5.4s, v27.4s, v26.4s
+        sxtl2               v28.4s, v30.8H
+        uxtl                v26.4s, v3.4h
+        mla                 v6.4s, v28.4s, v0.4s
+
+        sxtl                v27.4s, v31.4H
+        uxtl2               v0.4s, v3.8h
+        mla                 v5.4s, v27.4s, v26.4s
+        sxtl2               v28.4s, v31.8H
+        sub                 w2, w2, #8
+        mla                 v6.4s, v28.4s, v0.4s
+
+        sshl                v5.4s, v5.4s, v17.4s
+        sshl                v6.4s, v6.4s, v17.4s
+        smin                v5.4s, v5.4s, v18.4s
+        smin                v6.4s, v6.4s, v18.4s
+        xtn                 v5.4h, v5.4s
+        xtn2                v5.8h, v6.4s
+
+        st1                 {v5.8h}, [x1], #16
+        cmp                 w2, #16
+
+        // load filterPositions into registers for next iteration
+        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
+        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
+        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
+        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
+        add                 x5, x5, #32
+
+        lsl                 x8, x8, #1
+        lsl                 x9, x9, #1
+        lsl                 x10, x10, #1
+        lsl                 x11, x11, #1
+        lsl                 x12, x12, #1
+        lsl                 x13, x13, #1
+        lsl                 x14, x14, #1
+        lsl                 x15, x15, #1
+
+        ldr                 x8,  [x3, w8,  UXTW]
+        ldr                 x9,  [x3, w9,  UXTW]
+        ldr                 x10, [x3, w10, UXTW]
+        ldr                 x11, [x3, w11, UXTW]
+        ldr                 x12, [x3, w12, UXTW]
+        ldr                 x13, [x3, w13, UXTW]
+        ldr                 x14, [x3, w14, UXTW]
+        ldr                 x15, [x3, w15, UXTW]
+
+        stp                 x8, x9, [sp]
+        stp                 x10, x11, [sp, #16]
+        stp                 x12, x13, [sp, #32]
+        stp                 x14, x15, [sp, #48]
 
-endfunc
\ No newline at end of file
+        b.ge                1b
+
+        // here we make last iteration, without updating the registers
+        ld4                 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
+        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64
+
+        uxtl                v26.4s, v0.4h
+        sxtl                v27.4s, v28.4H
+        uxtl2               v0.4s, v0.8h
+        mul                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v28.8H
+        uxtl                v26.4s, v1.4h
+        mul                 v6.4s, v0.4s, v28.4s
+
+        sxtl                v27.4s, v29.4H
+        uxtl2               v0.4s, v1.8h
+        mla                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v29.8H
+        uxtl                v26.4s, v2.4h
+        mla                 v6.4s, v0.4s, v28.4s
+
+        sxtl                v27.4s, v30.4H
+        uxtl2               v0.4s, v2.8h
+        mla                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v30.8H
+        uxtl                v26.4s, v3.4h
+        mla                 v6.4s, v0.4s, v28.4s
+
+        sxtl                v27.4s, v31.4H
+        uxtl2               v0.4s, v3.8h
+        mla                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v31.8H
+        subs                w2, w2, #8
+        mla                 v6.4s, v0.4s, v28.4s
+
+        sshl                v5.4s, v5.4s, v17.4s
+        sshl                v6.4s, v6.4s, v17.4s
+        smin                v5.4s, v5.4s, v18.4s
+        smin                v6.4s, v6.4s, v18.4s
+        xtn                 v5.4h, v5.4S
+        xtn2                v5.8h, v6.4s
+
+        st1                 {v5.8h}, [x1], #16
+        add                 sp, sp, #64                 // restore stack
+        cbnz                w2, 2f
+
+        ret
+
+2:
+        ldr                 w8, [x5], #4                // load filterPos
+        lsl                 w8, w8, #1
+        add                 x9, x3, w8, UXTW            // src + filterPos
+        ld1                 {v0.4h}, [x9]               // load 4 * uint16_t
+        ld1                 {v31.4h}, [x4], #8
+
+        uxtl                v0.4s, v0.4h
+        sxtl                v31.4s, v31.4h
+        mul                 v5.4s, v0.4s, v31.4s
+        addv                s0, v5.4S
+        sshl                v0.4s, v0.4s, v17.4s
+        smin                v0.4s, v0.4s, v18.4s
+        st1                 {v0.h}[0], [x1], #2
+        sub                 w2, w2, #1
+        cbnz                w2, 2b                      // if iterations remain jump to beginning
+
+        ret
+endfunc
+
+function ff_hscale16to15_X8_neon_asm, export=1
+        // w0               int shift
+        // x1               int32_t *dst
+        // w2               int dstW
+        // x3               const uint8_t *src // treat it as uint16_t *src
+        // x4               const uint16_t *filter
+        // x5               const int32_t *filterPos
+        // w6               int filterSize
+
+        movi                v20.4s, #1
+        movi                v21.4s, #1
+        shl                 v20.4s, v20.4s, #15
+        sub                 v20.4s, v20.4s, v21.4s
+        dup                 v21.4s, w0
+        neg                 v21.4s, v21.4s
+
+        sbfiz               x7, x6, #1, #32             // filterSize*2 (*2 because int16)
+1:      ldr                 w8, [x5], #4                // filterPos[idx]
+        lsl                 w8, w8, #1
+        ldr                 w10, [x5], #4               // filterPos[idx + 1]
+        lsl                 w10, w10, #1
+        ldr                 w11, [x5], #4               // filterPos[idx + 2]
+        lsl                 w11, w11, #1
+        ldr                 w9, [x5], #4                // filterPos[idx + 3]
+        lsl                 w9, w9, #1
+        mov                 x16, x4                     // filter0 = filter
+        add                 x12, x16, x7                // filter1 = filter0 + filterSize*2
+        add                 x13, x12, x7                // filter2 = filter1 + filterSize*2
+        add                 x4, x13, x7                 // filter3 = filter2 + filterSize*2
+        movi                v0.2D, #0                   // val sum part 1 (for dst[0])
+        movi                v1.2D, #0                   // val sum part 2 (for dst[1])
+        movi                v2.2D, #0                   // val sum part 3 (for dst[2])
+        movi                v3.2D, #0                   // val sum part 4 (for dst[3])
+        add                 x17, x3, w8, UXTW           // srcp + filterPos[0]
+        add                 x8,  x3, w10, UXTW          // srcp + filterPos[1]
+        add                 x10, x3, w11, UXTW          // srcp + filterPos[2]
+        add                 x11, x3, w9, UXTW           // srcp + filterPos[3]
+        mov                 w15, w6                     // filterSize counter
+2:      ld1                 {v4.8H}, [x17], #16         // srcp[filterPos[0] + {0..7}]
+        ld1                 {v5.8H}, [x16], #16         // load 8x16-bit filter values, part 1
+        ld1                 {v6.8H}, [x8], #16          // srcp[filterPos[1] + {0..7}]
+        ld1                 {v7.8H}, [x12], #16         // load 8x16-bit at filter+filterSize
+        uxtl                v24.4s, v4.4H               // extend srcp lower half to 32 bits to preserve sign
+        sxtl                v25.4s, v5.4H               // extend filter lower half to 32 bits to match srcp size
+        uxtl2               v4.4s, v4.8h                // extend srcp upper half to 32 bits
+        mla                 v0.4s, v24.4s, v25.4s       // multiply accumulate lower half of v4 * v5
+        sxtl2               v5.4s, v5.8h                // extend filter upper half to 32 bits
+        uxtl                v26.4s, v6.4h               // extend srcp lower half to 32 bits
+        mla                 v0.4S, v4.4s, v5.4s         // multiply accumulate upper half of v4 * v5
+        sxtl                v27.4s, v7.4H               // exted filter lower half
+        uxtl2               v6.4s, v6.8H                // extend srcp upper half
+        sxtl2               v7.4s, v7.8h                // extend filter upper half
+        ld1                 {v16.8H}, [x10], #16        // srcp[filterPos[2] + {0..7}]
+        mla                 v1.4S, v26.4s, v27.4s       // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+        ld1                 {v17.8H}, [x13], #16        // load 8x16-bit at filter+2*filterSize
+        uxtl                v22.4s, v16.4H              // extend srcp lower half
+        sxtl                v23.4s, v17.4H              // extend filter lower half
+        uxtl2               v16.4s, v16.8H              // extend srcp upper half
+        sxtl2               v17.4s, v17.8h              // extend filter upper half
+        mla                 v2.4S, v22.4s, v23.4s       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+        mla                 v2.4S, v16.4s, v17.4s       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+        ld1                 {v18.8H}, [x11], #16        // srcp[filterPos[3] + {0..7}]
+        mla                 v1.4S, v6.4s, v7.4s         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+        ld1                 {v19.8H}, [x4], #16         // load 8x16-bit at filter+3*filterSize
+        subs                w15, w15, #8                // j -= 8: processed 8/filterSize
+        uxtl                v28.4s, v18.4H              // extend srcp lower half
+        sxtl                v29.4s, v19.4H              // extend filter lower half
+        uxtl2               v18.4s, v18.8H              // extend srcp upper half
+        sxtl2               v19.4s, v19.8h              // extend filter upper half
+        mla                 v3.4S, v28.4s, v29.4s       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+        mla                 v3.4S, v18.4s, v19.4s       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+        b.gt                2b                          // inner loop if filterSize not consumed completely
+        addp                v0.4S, v0.4S, v1.4S         // part01 horizontal pair adding
+        addp                v2.4S, v2.4S, v3.4S         // part23 horizontal pair adding
+        addp                v0.4S, v0.4S, v2.4S         // part0123 horizontal pair adding
+        subs                w2, w2, #4                  // dstW -= 4
+        sshl                v0.4s, v0.4s, v21.4s        // shift right (effectively rigth, as shift is negative); overflow expected
+        smin                v0.4s, v0.4s, v20.4s        // apply min (do not use sqshl)
+        xtn                 v0.4h, v0.4s                // narrow down to 16 bits
+
+        st1                 {v0.4H}, [x1], #8           // write to destination part0123
+        b.gt                1b                          // loop until end of line
+        ret
+endfunc
+
+function ff_hscale16to15_X4_neon_asm, export=1
+        // w0  int shift
+        // x1  int16_t *dst
+        // w2  int dstW
+        // x3  const uint8_t *src
+        // x4  const int16_t *filter
+        // x5  const int32_t *filterPos
+        // w6  int filterSize
+
+        stp                 d8, d9, [sp, #-0x20]!
+        stp                 d10, d11, [sp, #0x10]
+
+        movi                v18.4s, #1
+        movi                v17.4s, #1
+        shl                 v18.4s, v18.4s, #15
+        sub                 v21.4s, v18.4s, v17.4s      // max allowed value
+        dup                 v17.4s, w0                  // read shift
+        neg                 v20.4s, v17.4s              // negate it, so it can be used in sshl (effectively shift right)
+
+        lsl                 w7, w6, #1
+1:
+        ldp                 w8, w9, [x5]
+        ldp                 w10, w11, [x5, #8]
+
+        movi                v16.2d, #0                  // initialize accumulator for idx + 0
+        movi                v17.2d, #0                  // initialize accumulator for idx + 1
+        movi                v18.2d, #0                  // initialize accumulator for idx + 2
+        movi                v19.2d, #0                  // initialize accumulator for idx + 3
+
+        mov                 x12, x4                     // filter + 0
+        add                 x13, x4, x7                 // filter + 1
+        add                 x8, x3, x8, lsl #1          // srcp + filterPos 0
+        add                 x14, x13, x7                // filter + 2
+        add                 x9, x3, x9, lsl #1          // srcp + filterPos 1
+        add                 x15, x14, x7                // filter + 3
+        add                 x10, x3, x10, lsl #1        // srcp + filterPos 2
+        mov                 w0, w6                      // save the filterSize to temporary variable
+        add                 x11, x3, x11, lsl #1        // srcp + filterPos 3
+        add                 x5, x5, #16                 // advance filter position
+        mov                 x16, xzr                    // clear the register x16 used for offsetting the filter values
+
+2:
+        ldr                 q4, [x8], #16               // load src values for idx 0
+        ldr                 q5, [x9], #16               // load src values for idx 1
+        uxtl                v26.4s, v4.4h
+        uxtl2               v4.4s, v4.8h
+        ldr                 q31, [x12, x16]             // load filter values for idx 0
+        ldr                 q6, [x10], #16              // load src values for idx 2
+        sxtl                v22.4s, v31.4h
+        sxtl2               v31.4s, v31.8h
+        mla                 v16.4s, v26.4s, v22.4s      // multiplication of lower half for idx 0
+        uxtl                v25.4s, v5.4h
+        uxtl2               v5.4s, v5.8h
+        ldr                 q30, [x13, x16]             // load filter values for idx 1
+        ldr                 q7, [x11], #16              // load src values for idx 3
+        mla                 v16.4s, v4.4s, v31.4s       // multiplication of upper half for idx 0
+        uxtl                v24.4s, v6.4h
+        sxtl                v8.4s, v30.4h
+        sxtl2               v30.4s, v30.8h
+        mla                 v17.4s, v25.4s, v8.4s       // multiplication of lower half for idx 1
+        ldr                 q29, [x14, x16]             // load filter values for idx 2
+        uxtl2               v6.4s, v6.8h
+        sxtl                v9.4s, v29.4h
+        sxtl2               v29.4s, v29.8h
+        mla                 v17.4s, v5.4s, v30.4s       // multiplication of upper half for idx 1
+        mla                 v18.4s, v24.4s, v9.4s       // multiplication of lower half for idx 2
+        ldr                 q28, [x15, x16]             // load filter values for idx 3
+        uxtl                v23.4s, v7.4h
+        sxtl                v10.4s, v28.4h
+        mla                 v18.4s, v6.4s, v29.4s       // multiplication of upper half for idx 2
+        uxtl2               v7.4s, v7.8h
+        sxtl2               v28.4s, v28.8h
+        mla                 v19.4s, v23.4s, v10.4s      // multiplication of lower half for idx 3
+        sub                 w0, w0, #8
+        cmp                 w0, #8
+        mla                 v19.4s, v7.4s, v28.4s       // multiplication of upper half for idx 3
+
+        add                 x16, x16, #16               // advance filter values indexing
+
+        b.ge                2b
+
+        // 4 iterations left
+
+        sub                 x17, x7, #8                 // step back to wrap up the filter pos for last 4 elements
+
+        ldr                 d4, [x8]                    // load src values for idx 0
+        ldr                 d31, [x12, x17]             // load filter values for idx 0
+        uxtl                v4.4s, v4.4h
+        sxtl                v31.4s, v31.4h
+        ldr                 d5, [x9]                    // load src values for idx 1
+        mla                 v16.4s, v4.4s, v31.4s       // multiplication of upper half for idx 0
+        ldr                 d30, [x13, x17]             // load filter values for idx 1
+        uxtl                v5.4s, v5.4h
+        sxtl                v30.4s, v30.4h
+        ldr                 d6, [x10]                   // load src values for idx 2
+        mla                 v17.4s, v5.4s, v30.4s       // multiplication of upper half for idx 1
+        ldr                 d29, [x14, x17]             // load filter values for idx 2
+        uxtl                v6.4s, v6.4h
+        sxtl                v29.4s, v29.4h
+        ldr                 d7, [x11]                   // load src values for idx 3
+        ldr                 d28, [x15, x17]             // load filter values for idx 3
+        mla                 v18.4s, v6.4s, v29.4s       // multiplication of upper half for idx 2
+        uxtl                v7.4s, v7.4h
+        sxtl                v28.4s, v28.4h
+        addp                v16.4s, v16.4s, v17.4s
+        mla                 v19.4s, v7.4s, v28.4s       // multiplication of upper half for idx 3
+        subs                w2, w2, #4
+        addp                v18.4s, v18.4s, v19.4s
+        addp                v16.4s, v16.4s, v18.4s
+        sshl                v16.4s, v16.4s, v20.4s
+        smin                v16.4s, v16.4s, v21.4s
+        xtn                 v16.4h, v16.4s
+
+        st1                 {v16.4h}, [x1], #8
+        add                 x4, x4, x7, lsl #2
+        b.gt                1b
+
+        ldp                 d8, d9, [sp]
+        ldp                 d10, d11, [sp, #0x10]
+
+        add                 sp, sp, #0x20
+
+        ret
+endfunc
diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c
index 479fe129d0..993cdd67dd 100644
--- a/libswscale/aarch64/swscale.c
+++ b/libswscale/aarch64/swscale.c
@@ -22,6 +22,18 @@
 #include "libswscale/swscale_internal.h"
 #include "libavutil/aarch64/cpu.h"
 
+void ff_hscale16to15_4_neon_asm(int shift, int16_t *_dst, int dstW,
+                      const uint8_t *_src, const int16_t *filter,
+                      const int32_t *filterPos, int filterSize);
+
+void ff_hscale16to15_X8_neon_asm(int shift, int16_t *_dst, int dstW,
+                      const uint8_t *_src, const int16_t *filter,
+                      const int32_t *filterPos, int filterSize);
+
+void ff_hscale16to15_X4_neon_asm(int shift, int16_t *_dst, int dstW,
+                      const uint8_t *_src, const int16_t *filter,
+                      const int32_t *filterPos, int filterSize);
+
 #define SCALE_FUNC(filter_n, from_bpc, to_bpc, opt) \
 void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
                                                 SwsContext *c, int16_t *data, \
@@ -30,7 +42,8 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
                                                 const int32_t *filterPos, int filterSize)
 #define SCALE_FUNCS(filter_n, opt) \
     SCALE_FUNC(filter_n,  8, 15, opt); \
-    SCALE_FUNC(filter_n, 8, 19, opt);
+    SCALE_FUNC(filter_n, 8, 19, opt); \
+    SCALE_FUNC(filter_n, 16, 15, opt);
 #define ALL_SCALE_FUNCS(opt) \
     SCALE_FUNCS(4, opt); \
     SCALE_FUNCS(X8, opt); \
@@ -56,6 +69,10 @@ void ff_yuv2plane1_8_neon(
         } else                                                          \
             hscalefn =                                                  \
                 ff_hscale8to19_ ## filtersize ## _ ## opt;              \
+    } else {                                                            \
+        if (c->dstBpc <= 14)                                            \
+            hscalefn =                                                  \
+                ff_hscale16to15_ ## filtersize ## _ ## opt;             \
     }                                                                   \
 } while (0)
 
@@ -87,3 +104,50 @@ av_cold void ff_sws_init_swscale_aarch64(SwsContext *c)
         }
     }
 }
+
+void ff_hscale16to15_4_neon(SwsContext *c, int16_t *_dst, int dstW,
+                      const uint8_t *_src, const int16_t *filter,
+                      const int32_t *filterPos, int filterSize)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat);
+    int sh              = desc->comp[0].depth - 1;
+
+    if (sh<15) {
+        sh = isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8 ? 13 : (desc->comp[0].depth - 1);
+    } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */
+        sh = 16 - 1;
+    }
+    ff_hscale16to15_4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+
+}
+
+void ff_hscale16to15_X8_neon(SwsContext *c, int16_t *_dst, int dstW,
+                      const uint8_t *_src, const int16_t *filter,
+                      const int32_t *filterPos, int filterSize)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat);
+    int sh              = desc->comp[0].depth - 1;
+
+    if (sh<15) {
+        sh = isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8 ? 13 : (desc->comp[0].depth - 1);
+    } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */
+        sh = 16 - 1;
+    }
+    ff_hscale16to15_X8_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+
+}
+
+void ff_hscale16to15_X4_neon(SwsContext *c, int16_t *_dst, int dstW,
+                      const uint8_t *_src, const int16_t *filter,
+                      const int32_t *filterPos, int filterSize)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat);
+    int sh              = desc->comp[0].depth - 1;
+
+    if (sh<15) {
+        sh = isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8 ? 13 : (desc->comp[0].depth - 1);
+    } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */
+        sh = 16 - 1;
+    }
+    ff_hscale16to15_X4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+}
\ No newline at end of file
diff --git a/libswscale/swscale.c b/libswscale/swscale.c
index 367d045a02..5afd5eba83 100644
--- a/libswscale/swscale.c
+++ b/libswscale/swscale.c
@@ -109,11 +109,10 @@ static void hScale16To15_c(SwsContext *c, int16_t *dst, int dstW,
         int j;
         int srcPos = filterPos[i];
         int val    = 0;
-
         for (j = 0; j < filterSize; j++) {
             val += src[srcPos + j] * filter[filterSize * i + j];
         }
-        // filter=14 bit, input=16 bit, output=30 bit, >> 15 makes 15 bit
+        //filter=14 bit, input=16 bit, output=30 bit, >> 15 makes 15 bit
         dst[i] = FFMIN(val >> sh, (1 << 15) - 1);
     }
 }

From patchwork Mon Oct 17 13:07:15 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Hubert Mazur <hum@semihalf.com>
X-Patchwork-Id: 38765
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:4a86:b0:9d:28a3:170e with SMTP id fn6csp1584746pzb;
        Mon, 17 Oct 2022 06:09:06 -0700 (PDT)
X-Google-Smtp-Source: 
 AMsMyM7yG4lWCfJ3Nu8fL76qPcz2x6VqNRHTlPGO78uJ7ZDrQrK7ry+3gIA+w37D5kypEEzE0Mf2
X-Received: by 2002:a05:6402:1d86:b0:457:e84:f0e with SMTP id
 dk6-20020a0564021d8600b004570e840f0emr10028249edb.241.1666012146446;
        Mon, 17 Oct 2022 06:09:06 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1666012146; cv=none;
        d=google.com; s=arc-20160816;
        b=eesBZocH3HHHXMqel5b3NqBn1Q88J5HSNYUXebtwKKx8zU3vgvp67P7vzZIwlH39eB
         hvg8JtUuQyzcczGjvLCP6fK3JoYozGXeG7QmgUtkRpJ7o54FaHwijKkkBf3iTKYhqTou
         aPp2XUbnR7BvSqCrMlzVcR+PxeHBWMWqPbv3Et9Jw0HmNfElrkvrv1uNFp7NRi+/ELYD
         Hf3b/zcWdHQoSzp1eSfAu1Xd/gBOVQtTzRMEQXthSufR3BbWoydui0eVvb5VRzvOa4x8
         wwaXnSy77F0JkhfA5skEVCAQpL+bj74Q8WKCtYWEpmwTOBG7E1OyrjyL48bApHnIJgqS
         SLRA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:references:in-reply-to
         :message-id:date:to:from:dkim-signature:delivered-to;
        bh=U6QLTknRHgxqwDJkQvwHYqfQbHFIcw94qVRslbZHOh4=;
        b=Vb484p1Sz7sWpu6Nk8+b9UTgxFxFF2zWH6+b7Za36oa+Ug16uMciJ/wy8t0ixh23NC
         /v7+des9eD0IOYvFhQKK/gnKZvKmc2/pu9b0V0+e2Gn0RGxpByBuQKuUlpM8zkO1TbiG
         Lp0oJEomzgIwi3NK6jJgFMsH9eXpxz1ZYH4eaVt/0gtdgA/JUC0YCLbFote+Z+XPdJd9
         8/664UXiIL8qLDVmFyDuRbmlAy19Y5t6tCWakDFhM6sZ2b+gg4kKURfsBTlGebGS5BVM
         xBduvZkvBJegXoT7YXW4YS6C+klLfgv6/tR8ZH/2q2hJf2TNXMdvd+IEdBU4vfBScZ/R
         VB4Q==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@semihalf.com
 header.s=google header.b=O6YF29s8;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE)
 header.from=semihalf.com
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 g5-20020a056402090500b00457e6752422si8160465edz.189.2022.10.17.06.09.06;
        Mon, 17 Oct 2022 06:09:06 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@semihalf.com
 header.s=google header.b=O6YF29s8;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE)
 header.from=semihalf.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 8F7FE68BD06;
	Mon, 17 Oct 2022 16:08:30 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mail-wm1-f46.google.com (mail-wm1-f46.google.com
 [209.85.128.46])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 7EDFB68BD05
 for <ffmpeg-devel@ffmpeg.org>; Mon, 17 Oct 2022 16:08:23 +0300 (EEST)
Received: by mail-wm1-f46.google.com with SMTP id iv17so8602217wmb.4
 for <ffmpeg-devel@ffmpeg.org>; Mon, 17 Oct 2022 06:08:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=semihalf.com;
 s=google;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=rrd0IALhut9+kdhwBUnDoxxEX3cFpbw38Kyn0o3lmsg=;
 b=O6YF29s8nLwqoN3w5TWtL9s0Wqc2lHrF95wQUKlDXYAx5pJaFxZHpQb08duKsPliOz
 kK0UvYoN7NOdyndO3lumnPk0u1KHjEmdI0WRxkkGDVfFoGNn/f8sTUR0F1lODgh8JYdD
 8rId/2U7EKStZLDMmMEXcVMVDj5or8TOq+HggLWef3zHELIE86ON6jma7O2AyHhgfjNA
 wLIoZgMqCKfeEm5Y0H+J+eqngaFr+ZEVfFlnEPlvDqG1dvp3rqWhMK5+YHH2qeNOPPwK
 1Vnk9KYbdVuDX10nlA+qJ/WZutx+Psd8Wqnk+Pp97x54SxEokgohpwmDC5CCqZBfYqec
 4sWw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=rrd0IALhut9+kdhwBUnDoxxEX3cFpbw38Kyn0o3lmsg=;
 b=kyA3a6FobzRYnLTODXRaCJ9flG0GsHIAQGVHwPjoOGMuTjhaOJKzGCVgaE2CRHi2D3
 24B+B5bphcPChQXKNJZqDuLF7jw3sv4L3QeJr28L4zKGp5F/dKDvouby7MDY44e/E1TA
 DpGzCEGq5jAIbePlbST1mqOJD/bn9nUS6t6mqEOyuje5ovsiXlr5KkuCogFd5It/b0nI
 1CMgo2mO+R9iZqyFqoN4j3Q/MeiETIG2oOWp6h+k50pBSXQyXzUGFwAbVB8CCuccdrzp
 YkJh8w947wxa5VwrGXRxFrWg4k9ztL+fO5cHrWscTw3D5eBsWG1lLpzk+CEmFHK5EDy3
 Fw1Q==
X-Gm-Message-State: ACrzQf1Qj63OjNV0spzDI7YZZerRpyMdezRMMpdfr/LbQEukz9s8s8Hs
 +/rX+NokZ+TqhQNKSnUJxP9EHilNwDKpj+Zo
X-Received: by 2002:a05:600c:1906:b0:3c6:f83e:d15f with SMTP id
 j6-20020a05600c190600b003c6f83ed15fmr3023889wmq.205.1666012102242;
 Mon, 17 Oct 2022 06:08:22 -0700 (PDT)
Received: from ip-172-31-3-164.eu-west-1.compute.internal
 (ec2-54-154-193-154.eu-west-1.compute.amazonaws.com. [54.154.193.154])
 by smtp.gmail.com with ESMTPSA id
 t18-20020a5d6a52000000b0022af865810esm8297237wrw.75.2022.10.17.06.08.21
 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 17 Oct 2022 06:08:21 -0700 (PDT)
From: Hubert Mazur <hum@semihalf.com>
To: ffmpeg-devel@ffmpeg.org
Date: Mon, 17 Oct 2022 13:07:15 +0000
Message-Id: <20221017130715.30896-5-hum@semihalf.com>
X-Mailer: git-send-email 2.37.1
In-Reply-To: <20221017130715.30896-1-hum@semihalf.com>
References: <20221017130715.30896-1-hum@semihalf.com>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 4/4] sw_scale: Add specializations for hscale
 16 to 19
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com,
 Hubert Mazur <hum@semihalf.com>, martin@martin.st, mw@semihalf.com,
 spop@amazon.com
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: lnOLmJ1ZlKTM

Provide arm64 neon optimized implementations for hscale16To19 with
filter sizes 4, 8 and X4.

The tests and benchmarks run on AWS Graviton 2 instances.
The results from a checkasm tool are shown below.

hscale_16_to_19__fs_4_dstW_512_c: 6216.0
hscale_16_to_19__fs_4_dstW_512_neon: 2257.0
hscale_16_to_19__fs_8_dstW_512_c: 10417.7
hscale_16_to_19__fs_8_dstW_512_neon: 3112.5
hscale_16_to_19__fs_12_dstW_512_c: 14890.5
hscale_16_to_19__fs_12_dstW_512_neon: 3899.0
hscale_16_to_19__fs_16_dstW_512_c: 19006.5
hscale_16_to_19__fs_16_dstW_512_neon: 5341.2
hscale_16_to_19__fs_32_dstW_512_c: 36629.5
hscale_16_to_19__fs_32_dstW_512_neon: 9502.7
hscale_16_to_19__fs_40_dstW_512_c: 45477.5
hscale_16_to_19__fs_40_dstW_512_neon: 11552.0

Signed-off-by: Hubert Mazur <hum@semihalf.com>
---
 libswscale/aarch64/hscale.S  | 402 +++++++++++++++++++++++++++++++++++
 libswscale/aarch64/swscale.c |  70 +++++-
 2 files changed, 471 insertions(+), 1 deletion(-)

diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S
index 7d7e1c1f2e..dfc635d1b9 100644
--- a/libswscale/aarch64/hscale.S
+++ b/libswscale/aarch64/hscale.S
@@ -1044,3 +1044,405 @@ function ff_hscale16to15_X4_neon_asm, export=1
 
         ret
 endfunc
+
+function ff_hscale16to19_4_neon_asm, export=1
+        // w0               int shift
+        // x1               int32_t *dst
+        // w2               int dstW
+        // x3               const uint8_t *src // treat it as uint16_t *src
+        // x4               const uint16_t *filter
+        // x5               const int32_t *filterPos
+        // w6               int filterSize
+
+        movi                v18.4s, #1
+        movi                v17.4s, #1
+        shl                 v18.4s, v18.4s, #19
+        sub                 v18.4s, v18.4s, v17.4s      // max allowed value
+        dup                 v17.4s, w0                  // read shift
+        neg                 v17.4s, v17.4s              // negate it, so it can be used in sshl (effectively shift right)
+
+        cmp                 w2, #16
+        b.lt                2f // move to last block
+
+        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
+        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
+        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
+        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
+        add                 x5, x5, #32
+
+        // shift all filterPos left by one, as uint16_t will be read
+        lsl                 x8, x8, #1
+        lsl                 x9, x9, #1
+        lsl                 x10, x10, #1
+        lsl                 x11, x11, #1
+        lsl                 x12, x12, #1
+        lsl                 x13, x13, #1
+        lsl                 x14, x14, #1
+        lsl                 x15, x15, #1
+
+        // load src with given offset
+        ldr                 x8,  [x3, w8,  UXTW]
+        ldr                 x9,  [x3, w9,  UXTW]
+        ldr                 x10, [x3, w10, UXTW]
+        ldr                 x11, [x3, w11, UXTW]
+        ldr                 x12, [x3, w12, UXTW]
+        ldr                 x13, [x3, w13, UXTW]
+        ldr                 x14, [x3, w14, UXTW]
+        ldr                 x15, [x3, w15, UXTW]
+
+        sub                 sp, sp, #64
+        // push src on stack so it can be loaded into vectors later
+        stp                 x8, x9, [sp]
+        stp                 x10, x11, [sp, #16]
+        stp                 x12, x13, [sp, #32]
+        stp                 x14, x15, [sp, #48]
+
+1:
+        ld4                 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
+        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
+
+        // Each of blocks does the following:
+        // Extend src and filter to 32 bits with uxtl and sxtl
+        // multiply or multiply and accumulate results
+        // Extending to 32 bits is necessary, as unit16_t values can't
+        // be represented as int16_t without type promotion.
+        uxtl                v26.4s, v0.4h
+        sxtl                v27.4s, v28.4H
+        uxtl2               v0.4s, v0.8h
+        mul                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v28.8H
+        uxtl                v26.4s, v1.4h
+        mul                 v6.4s, v0.4s, v28.4s
+
+        sxtl                v27.4s, v29.4H
+        uxtl2               v0.4s, v1.8h
+        mla                 v5.4s, v27.4s, v26.4s
+        sxtl2               v28.4s, v29.8H
+        uxtl                v26.4s, v2.4h
+        mla                 v6.4s, v28.4s, v0.4s
+
+        sxtl                v27.4s, v30.4H
+        uxtl2               v0.4s, v2.8h
+        mla                 v5.4s, v27.4s, v26.4s
+        sxtl2               v28.4s, v30.8H
+        uxtl                v26.4s, v3.4h
+        mla                 v6.4s, v28.4s, v0.4s
+
+        sxtl                v27.4s, v31.4H
+        uxtl2               v0.4s, v3.8h
+        mla                 v5.4s, v27.4s, v26.4s
+        sxtl2               v28.4s, v31.8H
+        sub                 w2, w2, #8
+        mla                 v6.4s, v28.4s, v0.4s
+
+        sshl                v5.4s, v5.4s, v17.4s
+        sshl                v6.4s, v6.4s, v17.4s
+        smin                v5.4s, v5.4s, v18.4s
+        smin                v6.4s, v6.4s, v18.4s
+
+        st1                 {v5.4s, v6.4s}, [x1], #32
+        cmp                 w2, #16
+
+        // load filterPositions into registers for next iteration
+        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
+        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
+        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
+        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
+        add                 x5, x5, #32
+
+        lsl                 x8, x8, #1
+        lsl                 x9, x9, #1
+        lsl                 x10, x10, #1
+        lsl                 x11, x11, #1
+        lsl                 x12, x12, #1
+        lsl                 x13, x13, #1
+        lsl                 x14, x14, #1
+        lsl                 x15, x15, #1
+
+        ldr                 x8,  [x3, w8,  UXTW]
+        ldr                 x9,  [x3, w9,  UXTW]
+        ldr                 x10, [x3, w10, UXTW]
+        ldr                 x11, [x3, w11, UXTW]
+        ldr                 x12, [x3, w12, UXTW]
+        ldr                 x13, [x3, w13, UXTW]
+        ldr                 x14, [x3, w14, UXTW]
+        ldr                 x15, [x3, w15, UXTW]
+
+        stp                 x8, x9, [sp]
+        stp                 x10, x11, [sp, #16]
+        stp                 x12, x13, [sp, #32]
+        stp                 x14, x15, [sp, #48]
+
+        b.ge                1b
+
+        // here we make last iteration, without updating the registers
+        ld4                 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
+        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64
+
+        uxtl                v26.4s, v0.4h
+        sxtl                v27.4s, v28.4H
+        uxtl2               v0.4s, v0.8h
+        mul                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v28.8H
+        uxtl                v26.4s, v1.4h
+        mul                 v6.4s, v0.4s, v28.4s
+
+        sxtl                v27.4s, v29.4H
+        uxtl2               v0.4s, v1.8h
+        mla                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v29.8H
+        uxtl                v26.4s, v2.4h
+        mla                 v6.4s, v0.4s, v28.4s
+
+        sxtl                v27.4s, v30.4H
+        uxtl2               v0.4s, v2.8h
+        mla                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v30.8H
+        uxtl                v26.4s, v3.4h
+        mla                 v6.4s, v0.4s, v28.4s
+
+        sxtl                v27.4s, v31.4H
+        uxtl2               v0.4s, v3.8h
+        mla                 v5.4s, v26.4s, v27.4s
+        sxtl2               v28.4s, v31.8H
+        subs                w2, w2, #8
+        mla                 v6.4s, v0.4s, v28.4s
+
+        sshl                v5.4s, v5.4s, v17.4s
+        sshl                v6.4s, v6.4s, v17.4s
+
+        smin                v5.4s, v5.4s, v18.4s
+        smin                v6.4s, v6.4s, v18.4s
+
+        st1                 {v5.4s, v6.4s}, [x1], #32
+        add                 sp, sp, #64                 // restore stack
+        cbnz                w2, 2f
+
+        ret
+
+2:
+        ldr                 w8, [x5], #4                // load filterPos
+        lsl                 w8, w8, #1
+        add                 x9, x3, w8, UXTW            // src + filterPos
+        ld1                 {v0.4h}, [x9]               // load 4 * uint16_t
+        ld1                 {v31.4h}, [x4], #8
+
+        uxtl                v0.4s, v0.4h
+        sxtl                v31.4s, v31.4h
+        subs                w2, w2, #1
+        mul                 v5.4s, v0.4s, v31.4s
+        addv                s0, v5.4S
+        sshl                v0.4s, v0.4s, v17.4s
+        smin                v0.4s, v0.4s, v18.4s
+        st1                 {v0.s}[0], [x1], #4
+        cbnz                w2, 2b                      // if iterations remain jump to beginning
+
+        ret
+endfunc
+
+function ff_hscale16to19_X8_neon_asm, export=1
+        // w0               int shift
+        // x1               int32_t *dst
+        // w2               int dstW
+        // x3               const uint8_t *src // treat it as uint16_t *src
+        // x4               const uint16_t *filter
+        // x5               const int32_t *filterPos
+        // w6               int filterSize
+
+        movi                v20.4s, #1
+        movi                v21.4s, #1
+        shl                 v20.4s, v20.4s, #19
+        sub                 v20.4s, v20.4s, v21.4s
+        dup                 v21.4s, w0
+        neg                 v21.4s, v21.4s
+
+        sbfiz               x7, x6, #1, #32             // filterSize*2 (*2 because int16)
+1:      ldr                 w8, [x5], #4                // filterPos[idx]
+        ldr                 w10, [x5], #4               // filterPos[idx + 1]
+        lsl                 w8, w8, #1
+        ldr                 w11, [x5], #4               // filterPos[idx + 2]
+        ldr                 w9, [x5], #4                // filterPos[idx + 3]
+        mov                 x16, x4                     // filter0 = filter
+        lsl                 w11, w11, #1
+        add                 x12, x16, x7                // filter1 = filter0 + filterSize*2
+        lsl                 w9, w9, #1
+        add                 x13, x12, x7                // filter2 = filter1 + filterSize*2
+        lsl                 w10, w10, #1
+        add                 x4, x13, x7                 // filter3 = filter2 + filterSize*2
+        movi                v0.2D, #0                   // val sum part 1 (for dst[0])
+        movi                v1.2D, #0                   // val sum part 2 (for dst[1])
+        movi                v2.2D, #0                   // val sum part 3 (for dst[2])
+        movi                v3.2D, #0                   // val sum part 4 (for dst[3])
+        add                 x17, x3, w8, UXTW           // srcp + filterPos[0]
+        add                 x8,  x3, w10, UXTW          // srcp + filterPos[1]
+        add                 x10, x3, w11, UXTW          // srcp + filterPos[2]
+        add                 x11, x3, w9, UXTW           // srcp + filterPos[3]
+        mov                 w15, w6                     // filterSize counter
+2:      ld1                 {v4.8H}, [x17], #16         // srcp[filterPos[0] + {0..7}]
+        ld1                 {v5.8H}, [x16], #16         // load 8x16-bit filter values, part 1
+        ld1                 {v6.8H}, [x8], #16          // srcp[filterPos[1] + {0..7}]
+        ld1                 {v7.8H}, [x12], #16         // load 8x16-bit at filter+filterSize
+        uxtl                v24.4s, v4.4H               // extend srcp lower half to 32 bits to preserve sign
+        sxtl                v25.4s, v5.4H               // extend filter lower half to 32 bits to match srcp size
+        uxtl2               v4.4s, v4.8h                // extend srcp upper half to 32 bits
+        mla                 v0.4s, v24.4s, v25.4s       // multiply accumulate lower half of v4 * v5
+        sxtl2               v5.4s, v5.8h                // extend filter upper half to 32 bits
+        uxtl                v26.4s, v6.4h               // extend srcp lower half to 32 bits
+        mla                 v0.4S, v4.4s, v5.4s         // multiply accumulate upper half of v4 * v5
+        sxtl                v27.4s, v7.4H               // exted filter lower half
+        uxtl2               v6.4s, v6.8H                // extend srcp upper half
+        sxtl2               v7.4s, v7.8h                // extend filter upper half
+        ld1                 {v16.8H}, [x10], #16        // srcp[filterPos[2] + {0..7}]
+        mla                 v1.4S, v26.4s, v27.4s       // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+        ld1                 {v17.8H}, [x13], #16        // load 8x16-bit at filter+2*filterSize
+        uxtl                v22.4s, v16.4H              // extend srcp lower half
+        sxtl                v23.4s, v17.4H              // extend filter lower half
+        uxtl2               v16.4s, v16.8H              // extend srcp upper half
+        sxtl2               v17.4s, v17.8h              // extend filter upper half
+        mla                 v2.4S, v22.4s, v23.4s       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+        mla                 v2.4S, v16.4s, v17.4s       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+        ld1                 {v18.8H}, [x11], #16        // srcp[filterPos[3] + {0..7}]
+        mla                 v1.4S, v6.4s, v7.4s         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+        ld1                 {v19.8H}, [x4], #16         // load 8x16-bit at filter+3*filterSize
+        subs                w15, w15, #8                // j -= 8: processed 8/filterSize
+        uxtl                v28.4s, v18.4H              // extend srcp lower half
+        sxtl                v29.4s, v19.4H              // extend filter lower half
+        uxtl2               v18.4s, v18.8H              // extend srcp upper half
+        sxtl2               v19.4s, v19.8h              // extend filter upper half
+        mla                 v3.4S, v28.4s, v29.4s       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+        mla                 v3.4S, v18.4s, v19.4s       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+        b.gt                2b                          // inner loop if filterSize not consumed completely
+        addp                v0.4S, v0.4S, v1.4S         // part01 horizontal pair adding
+        addp                v2.4S, v2.4S, v3.4S         // part23 horizontal pair adding
+        addp                v0.4S, v0.4S, v2.4S         // part0123 horizontal pair adding
+        subs                w2, w2, #4                  // dstW -= 4
+        sshl                v0.4s, v0.4s, v21.4s        // shift right (effectively rigth, as shift is negative); overflow expected
+        smin                v0.4s, v0.4s, v20.4s        // apply min (do not use sqshl)
+        st1                 {v0.4s}, [x1], #16          // write to destination part0123
+        b.gt                1b                          // loop until end of line
+        ret
+endfunc
+
+function ff_hscale16to19_X4_neon_asm, export=1
+        // w0  int shift
+        // x1  int16_t *dst
+        // w2  int dstW
+        // x3  const uint8_t *src
+        // x4  const int16_t *filter
+        // x5  const int32_t *filterPos
+        // w6  int filterSize
+
+        stp                 d8, d9, [sp, #-0x20]!
+        stp                 d10, d11, [sp, #0x10]
+
+        movi                v18.4s, #1
+        movi                v17.4s, #1
+        shl                 v18.4s, v18.4s, #19
+        sub                 v21.4s, v18.4s, v17.4s      // max allowed value
+        dup                 v17.4s, w0                  // read shift
+        neg                 v20.4s, v17.4s              // negate it, so it can be used in sshl (effectively shift right)
+
+        lsl                 w7, w6, #1
+1:
+        ldp                 w8, w9, [x5]
+        ldp                 w10, w11, [x5, #8]
+
+        movi                v16.2d, #0                  // initialize accumulator for idx + 0
+        movi                v17.2d, #0                  // initialize accumulator for idx + 1
+        movi                v18.2d, #0                  // initialize accumulator for idx + 2
+        movi                v19.2d, #0                  // initialize accumulator for idx + 3
+
+        mov                 x12, x4                     // filter + 0
+        add                 x13, x4, x7                 // filter + 1
+        add                 x8, x3, x8, lsl #1          // srcp + filterPos 0
+        add                 x14, x13, x7                // filter + 2
+        add                 x9, x3, x9, lsl #1          // srcp + filterPos 1
+        add                 x15, x14, x7                // filter + 3
+        add                 x10, x3, x10, lsl #1        // srcp + filterPos 2
+        mov                 w0, w6                      // save the filterSize to temporary variable
+        add                 x11, x3, x11, lsl #1        // srcp + filterPos 3
+        add                 x5, x5, #16                 // advance filter position
+        mov                 x16, xzr                    // clear the register x16 used for offsetting the filter values
+
+2:
+        ldr                 q4, [x8], #16               // load src values for idx 0
+        ldr                 q5, [x9], #16               // load src values for idx 1
+        uxtl                v26.4s, v4.4h
+        uxtl2               v4.4s, v4.8h
+        ldr                 q31, [x12, x16]             // load filter values for idx 0
+        ldr                 q6, [x10], #16              // load src values for idx 2
+        sxtl                v22.4s, v31.4h
+        sxtl2               v31.4s, v31.8h
+        mla                 v16.4s, v26.4s, v22.4s      // multiplication of lower half for idx 0
+        uxtl                v25.4s, v5.4h
+        uxtl2               v5.4s, v5.8h
+        ldr                 q30, [x13, x16]             // load filter values for idx 1
+        ldr                 q7, [x11], #16              // load src values for idx 3
+        mla                 v16.4s, v4.4s, v31.4s       // multiplication of upper half for idx 0
+        uxtl                v24.4s, v6.4h
+        sxtl                v8.4s, v30.4h
+        sxtl2               v30.4s, v30.8h
+        mla                 v17.4s, v25.4s, v8.4s       // multiplication of lower half for idx 1
+        ldr                 q29, [x14, x16]             // load filter values for idx 2
+        uxtl2               v6.4s, v6.8h
+        sxtl                v9.4s, v29.4h
+        sxtl2               v29.4s, v29.8h
+        mla                 v17.4s, v5.4s, v30.4s       // multiplication of upper half for idx 1
+        ldr                 q28, [x15, x16]             // load filter values for idx 3
+        mla                 v18.4s, v24.4s, v9.4s       // multiplication of lower half for idx 2
+        uxtl                v23.4s, v7.4h
+        sxtl                v10.4s, v28.4h
+        mla                 v18.4s, v6.4s, v29.4s       // multiplication of upper half for idx 2
+        uxtl2               v7.4s, v7.8h
+        sxtl2               v28.4s, v28.8h
+        mla                 v19.4s, v23.4s, v10.4s      // multiplication of lower half for idx 3
+        sub                 w0, w0, #8
+        cmp                 w0, #8
+        mla                 v19.4s, v7.4s, v28.4s       // multiplication of upper half for idx 3
+
+        add                 x16, x16, #16               // advance filter values indexing
+
+        b.ge                2b
+
+        // 4 iterations left
+
+        sub                 x17, x7, #8                 // step back to wrap up the filter pos for last 4 elements
+
+        ldr                 d4, [x8]                    // load src values for idx 0
+        ldr                 d31, [x12, x17]             // load filter values for idx 0
+        uxtl                v4.4s, v4.4h
+        sxtl                v31.4s, v31.4h
+        ldr                 d5, [x9]                    // load src values for idx 1
+        mla                 v16.4s, v4.4s, v31.4s       // multiplication of upper half for idx 0
+        ldr                 d30, [x13, x17]             // load filter values for idx 1
+        uxtl                v5.4s, v5.4h
+        sxtl                v30.4s, v30.4h
+        ldr                 d6, [x10]                   // load src values for idx 2
+        mla                 v17.4s, v5.4s, v30.4s       // multiplication of upper half for idx 1
+        ldr                 d29, [x14, x17]             // load filter values for idx 2
+        uxtl                v6.4s, v6.4h
+        sxtl                v29.4s, v29.4h
+        ldr                 d7, [x11]                   // load src values for idx 3
+        ldr                 d28, [x15, x17]             // load filter values for idx 3
+        mla                 v18.4s, v6.4s, v29.4s       // multiplication of upper half for idx 2
+        uxtl                v7.4s, v7.4h
+        sxtl                v28.4s, v28.4h
+        addp                v16.4s, v16.4s, v17.4s
+        mla                 v19.4s, v7.4s, v28.4s       // multiplication of upper half for idx 3
+        subs                w2, w2, #4
+        addp                v18.4s, v18.4s, v19.4s
+        addp                v16.4s, v16.4s, v18.4s
+        sshl                v16.4s, v16.4s, v20.4s
+        smin                v16.4s, v16.4s, v21.4s
+
+        st1                 {v16.4s}, [x1], #16
+        add                 x4, x4, x7, lsl #2
+        b.gt                1b
+
+        ldp                 d8, d9, [sp]
+        ldp                 d10, d11, [sp, #0x10]
+
+        add                 sp, sp, #0x20
+
+        ret
+endfunc
\ No newline at end of file
diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c
index 993cdd67dd..ef6029e068 100644
--- a/libswscale/aarch64/swscale.c
+++ b/libswscale/aarch64/swscale.c
@@ -34,6 +34,16 @@ void ff_hscale16to15_X4_neon_asm(int shift, int16_t *_dst, int dstW,
                       const uint8_t *_src, const int16_t *filter,
                       const int32_t *filterPos, int filterSize);
 
+void ff_hscale16to19_4_neon_asm(int shift, int16_t *_dst, int dstW,
+                      const uint8_t *_src, const int16_t *filter,
+                      const int32_t *filterPos, int filterSize);
+void ff_hscale16to19_X8_neon_asm(int shift, int16_t *_dst, int dstW,
+                      const uint8_t *_src, const int16_t *filter,
+                      const int32_t *filterPos, int filterSize);
+void ff_hscale16to19_X4_neon_asm(int shift, int16_t *_dst, int dstW,
+                      const uint8_t *_src, const int16_t *filter,
+                      const int32_t *filterPos, int filterSize);
+
 #define SCALE_FUNC(filter_n, from_bpc, to_bpc, opt) \
 void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
                                                 SwsContext *c, int16_t *data, \
@@ -43,7 +53,8 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
 #define SCALE_FUNCS(filter_n, opt) \
     SCALE_FUNC(filter_n,  8, 15, opt); \
     SCALE_FUNC(filter_n, 8, 19, opt); \
-    SCALE_FUNC(filter_n, 16, 15, opt);
+    SCALE_FUNC(filter_n, 16, 15, opt); \
+    SCALE_FUNC(filter_n, 16, 19, opt);
 #define ALL_SCALE_FUNCS(opt) \
     SCALE_FUNCS(4, opt); \
     SCALE_FUNCS(X8, opt); \
@@ -73,6 +84,9 @@ void ff_yuv2plane1_8_neon(
         if (c->dstBpc <= 14)                                            \
             hscalefn =                                                  \
                 ff_hscale16to15_ ## filtersize ## _ ## opt;             \
+        else                                                            \
+            hscalefn =                                                  \
+                ff_hscale16to19_ ## filtersize ## _ ## opt;             \
     }                                                                   \
 } while (0)
 
@@ -150,4 +164,58 @@ void ff_hscale16to15_X4_neon(SwsContext *c, int16_t *_dst, int dstW,
         sh = 16 - 1;
     }
     ff_hscale16to15_X4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+}
+
+void ff_hscale16to19_4_neon(SwsContext *c, int16_t *_dst, int dstW,
+                           const uint8_t *_src, const int16_t *filter,
+                           const int32_t *filterPos, int filterSize)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat);
+    int bits            = desc->comp[0].depth - 1;
+    int sh              = bits - 4;
+
+    if ((isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8) && desc->comp[0].depth<16) {
+        sh = 9;
+    } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */
+        sh = 16 - 1 - 4;
+    }
+
+    ff_hscale16to19_4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+
+}
+
+void ff_hscale16to19_X8_neon(SwsContext *c, int16_t *_dst, int dstW,
+                           const uint8_t *_src, const int16_t *filter,
+                           const int32_t *filterPos, int filterSize)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat);
+    int bits            = desc->comp[0].depth - 1;
+    int sh              = bits - 4;
+
+    if ((isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8) && desc->comp[0].depth<16) {
+        sh = 9;
+    } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */
+        sh = 16 - 1 - 4;
+    }
+
+    ff_hscale16to19_X8_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+
+}
+
+void ff_hscale16to19_X4_neon(SwsContext *c, int16_t *_dst, int dstW,
+                           const uint8_t *_src, const int16_t *filter,
+                           const int32_t *filterPos, int filterSize)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat);
+    int bits            = desc->comp[0].depth - 1;
+    int sh              = bits - 4;
+
+    if ((isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8) && desc->comp[0].depth<16) {
+        sh = 9;
+    } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */
+        sh = 16 - 1 - 4;
+    }
+
+    ff_hscale16to19_X4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize);
+
 }
\ No newline at end of file