From patchwork Mon Oct 17 13:07:12 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Hubert Mazur X-Patchwork-Id: 38762 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:4a86:b0:9d:28a3:170e with SMTP id fn6csp1584471pzb; Mon, 17 Oct 2022 06:08:37 -0700 (PDT) X-Google-Smtp-Source: AMsMyM593Pz70ifXyAddJvTjKFfa2AqIMW19MknQjEmsFBhvfzfH4TopcINjZcR/gaOTXDJcDA6Y X-Received: by 2002:a17:906:db03:b0:741:337e:3600 with SMTP id xj3-20020a170906db0300b00741337e3600mr8667761ejb.343.1666012116755; Mon, 17 Oct 2022 06:08:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666012116; cv=none; d=google.com; s=arc-20160816; b=VyqG8IZJNw1llGEadCC6D6gsAPLQU61AfiAe/Mg+eIXJbj+/PxsXvKE1Q6YJGnA0tF kvsAgR6gMC4tQTBPofrmLYZNJ6xiknCcfz9dYcygO4O26ip/AfcHGCk3XQ1RpUUalSAY cOf7UG45AEtDBkNDmW5J6yTaaf71Q2Mn26Nwjci34U0b/inkSMPuRMMWdV3bfXzcG8dZ SoMdTb+C/wgh0HSSXsIRuOpeYiks4NmR0it+IWlJY+gzHl6mG38o+WU5BxGDID5govZP 3EhVtu25KiI30gWgR6VZQJcraI97uH47k3fkfn/jzLG68omEv2dRFzzHJoVUtKItoq8i 9VvQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:dkim-signature:delivered-to; bh=oTEVwTvglecRlF++1LEVuDX3IFpvV2+jfo2Al1WvDkI=; b=Jq9otxl+wYa5BAZR0j/wl5EjWt8kOeqP7L6Bq6dKt3BfXlPvoLmqIQWOAKomtfNY33 F0TjtkPMUq5doaf2TVJ4wdkqAys0TKXKHlfDUAs2APCTWI38PORymCyOlBpuR/C44o62 P8yELBC8djXSyNWMH1XLSMQRGJ+abXJ7s+jNJdXjO9HXtUtKHsiLmggdtLF8+2XD+rYY GQWZ6mxj72vkRwAVzWN+w8Xb8EGtPwrhJy44ybmou7iwZwCKXU899hjSd/nwCokS6Q1V mO6hYDmubLCQpjeRATi7U5ZjBPbXR0IDVfUQDzpn7/5GES3v4OkLYiNRawrg6atgzzKX UhMA== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=FEtxOEFi; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id ga13-20020a1709070c0d00b0078db1343eedsi9334166ejc.774.2022.10.17.06.08.36; Mon, 17 Oct 2022 06:08:36 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=FEtxOEFi; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 8151868BD0D; Mon, 17 Oct 2022 16:08:25 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-wm1-f47.google.com (mail-wm1-f47.google.com [209.85.128.47]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E606A68BD00 for ; Mon, 17 Oct 2022 16:08:18 +0300 (EEST) Received: by mail-wm1-f47.google.com with SMTP id v130-20020a1cac88000000b003bcde03bd44so12654974wme.5 for ; Mon, 17 Oct 2022 06:08:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=semihalf.com; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=0J628ykI6t9YgPtLIUmxg/iHJuXbUWR2dHMWGpd5GVM=; b=FEtxOEFiXEomSLGRHVclmD7CGzR3H8to2YBF8Ji7cnAWyggKLZZB37ZUuq0y5GKVKk UYqDkhvzTlHvmrhjGXfpBU7pQx1wdw1epcYPdAJmIhneFsZH2c0ij/HXXV6seV6Bi8ll 1hI04DEABQ8C6hFNgwEbJzuq8GGow0kAWLrsVsuVn2GcCfR1Yi7pwVWL3DBo7OjsDOJY 4JSjkvJaI00T5wuS8WI9NISIWQ1JeoBPKGUZSLxWr4kaDnhZ+OTR+jHsHMT4QawSZAdh KdT/FB8oIQrE7SGGz2ZRbZzSGOAtoO0pM1Q4QdwPUio68PRAFSB1yY74W1g+JrkinLx6 IsJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=0J628ykI6t9YgPtLIUmxg/iHJuXbUWR2dHMWGpd5GVM=; b=fttD4EqRbY+n43IOFK0CvOsUhhw/LDewCQ2RwyegLuLC+THBT3d3HgBKXTwMogyAZV 5BvrfvDc+KFmyqWl1U64vipcP0f/f4fMnTKQsA2HG3Lv1z/VA+/MifNLKsZ4+RyvYT0K UTAIQOx+EbZc+nHuQO/30lyCofDY2bErvYcOFIDZhIuQjOvmqEs5327P6YfjhPJYSbDX w+E0TFkMwIRrigkmV3D1Yljwo2vXDyx7Odi1Poy+YMH/h8i1bQ2lm9JErAfnRNNogw5s DeWDX2rnL+0mkUBJbFR1s83dMf5BWszFyxk4o+3ZW4p9uEWkSLqiss+7nVM5fgsEglxZ u0uw== X-Gm-Message-State: ACrzQf30TTnoV3LAuCHgsUxV+cHiOWAx52/ZgETWKVw+rCGUqvoZ4T4L rMyQfguOH0SAvAdPYDwM5LCddbY7yvdcOqGX X-Received: by 2002:a05:600c:6028:b0:3c6:f0bb:316a with SMTP id az40-20020a05600c602800b003c6f0bb316amr7197132wmb.1.1666012097760; Mon, 17 Oct 2022 06:08:17 -0700 (PDT) Received: from ip-172-31-3-164.eu-west-1.compute.internal (ec2-54-154-193-154.eu-west-1.compute.amazonaws.com. [54.154.193.154]) by smtp.gmail.com with ESMTPSA id t18-20020a5d6a52000000b0022af865810esm8297237wrw.75.2022.10.17.06.08.16 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 17 Oct 2022 06:08:17 -0700 (PDT) From: Hubert Mazur To: ffmpeg-devel@ffmpeg.org Date: Mon, 17 Oct 2022 13:07:12 +0000 Message-Id: <20221017130715.30896-2-hum@semihalf.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20221017130715.30896-1-hum@semihalf.com> References: <20221017130715.30896-1-hum@semihalf.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 1/4] sw_scale: Add specializations for hscale 8 to 19 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com, Hubert Mazur , martin@martin.st, mw@semihalf.com, spop@amazon.com Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: 4jyaSZpV1UWS Add arm64 neon implementations for hscale 8 to 19 with filter sizes 4, 4X and 8. Both implementations are based on very similar ones dedicated to hscale 8 to 15. The major changes refer to saving the data - instead of writing the result as int16_t it is done with int32_t. These functions are heavily inspired on patches provided by J. Swinney and M. Storsjö for hscale8to15 which were slightly adapted for hscale8to19. The tests and benchmarks run on AWS Graviton 2 instances. The results from a checkasm tool shown below. hscale_8_to_19__fs_4_dstW_512_c: 5663.2 hscale_8_to_19__fs_4_dstW_512_neon: 1259.7 hscale_8_to_19__fs_8_dstW_512_c: 9306.0 hscale_8_to_19__fs_8_dstW_512_neon: 2020.2 hscale_8_to_19__fs_12_dstW_512_c: 12932.7 hscale_8_to_19__fs_12_dstW_512_neon: 2462.5 hscale_8_to_19__fs_16_dstW_512_c: 16844.2 hscale_8_to_19__fs_16_dstW_512_neon: 4671.2 hscale_8_to_19__fs_32_dstW_512_c: 32803.7 hscale_8_to_19__fs_32_dstW_512_neon: 5474.2 hscale_8_to_19__fs_40_dstW_512_c: 40948.0 hscale_8_to_19__fs_40_dstW_512_neon: 6669.7 Signed-off-by: Hubert Mazur --- libswscale/aarch64/hscale.S | 292 ++++++++++++++++++++++++++++++++++- libswscale/aarch64/swscale.c | 13 +- 2 files changed, 300 insertions(+), 5 deletions(-) diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S index a16d3dca42..5e8cad9825 100644 --- a/libswscale/aarch64/hscale.S +++ b/libswscale/aarch64/hscale.S @@ -218,7 +218,6 @@ function ff_hscale8to15_4_neon, export=1 // 2. Interleaved prefetching src data and madd // 3. Complete madd // 4. Complete remaining iterations when dstW % 8 != 0 - sub sp, sp, #32 // allocate 32 bytes on the stack cmp w2, #16 // if dstW <16, skip to the last block used for wrapping up b.lt 2f @@ -347,3 +346,294 @@ function ff_hscale8to15_4_neon, export=1 add sp, sp, #32 // clean up stack ret endfunc + +function ff_hscale8to19_4_neon, export=1 + // x0 SwsContext *c (unused) + // x1 int32_t *dst + // w2 int dstW + // x3 const uint8_t *src // treat it as uint16_t *src + // x4 const uint16_t *filter + // x5 const int32_t *filterPos + // w6 int filterSize + + movi v18.4s, #1 + movi v17.4s, #1 + shl v18.4s, v18.4s, #19 + sub v18.4s, v18.4s, v17.4s // max allowed value + + cmp w2, #16 + b.lt 2f // move to last block + + ldp w8, w9, [x5] // filterPos[0], filterPos[1] + ldp w10, w11, [x5, #8] // filterPos[2], filterPos[3] + ldp w12, w13, [x5, #16] // filterPos[4], filterPos[5] + ldp w14, w15, [x5, #24] // filterPos[6], filterPos[7] + add x5, x5, #32 + + // load data from + ldr w8, [x3, w8, UXTW] + ldr w9, [x3, w9, UXTW] + ldr w10, [x3, w10, UXTW] + ldr w11, [x3, w11, UXTW] + ldr w12, [x3, w12, UXTW] + ldr w13, [x3, w13, UXTW] + ldr w14, [x3, w14, UXTW] + ldr w15, [x3, w15, UXTW] + + sub sp, sp, #32 + + stp w8, w9, [sp] + stp w10, w11, [sp, #8] + stp w12, w13, [sp, #16] + stp w14, w15, [sp, #24] + +1: + ld4 {v0.8b, v1.8b, v2.8b, v3.8b}, [sp] + ld4 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7] + // load filterPositions into registers for next iteration + + ldp w8, w9, [x5] // filterPos[0], filterPos[1] + ldp w10, w11, [x5, #8] // filterPos[2], filterPos[3] + ldp w12, w13, [x5, #16] // filterPos[4], filterPos[5] + ldp w14, w15, [x5, #24] // filterPos[6], filterPos[7] + add x5, x5, #32 + uxtl v0.8h, v0.8b + ldr w8, [x3, w8, UXTW] + smull v5.4s, v0.4h, v28.4h // multiply first column of src + ldr w9, [x3, w9, UXTW] + smull2 v6.4s, v0.8h, v28.8h + stp w8, w9, [sp] + + uxtl v1.8h, v1.8b + ldr w10, [x3, w10, UXTW] + smlal v5.4s, v1.4h, v29.4h // multiply second column of src + ldr w11, [x3, w11, UXTW] + smlal2 v6.4s, v1.8h, v29.8h + stp w10, w11, [sp, #8] + + uxtl v2.8h, v2.8b + ldr w12, [x3, w12, UXTW] + smlal v5.4s, v2.4h, v30.4h // multiply third column of src + ldr w13, [x3, w13, UXTW] + smlal2 v6.4s, v2.8h, v30.8h + stp w12, w13, [sp, #16] + + uxtl v3.8h, v3.8b + ldr w14, [x3, w14, UXTW] + smlal v5.4s, v3.4h, v31.4h // multiply fourth column of src + ldr w15, [x3, w15, UXTW] + smlal2 v6.4s, v3.8h, v31.8h + stp w14, w15, [sp, #24] + + sub w2, w2, #8 + sshr v5.4s, v5.4s, #3 + sshr v6.4s, v6.4s, #3 + smin v5.4s, v5.4s, v18.4s + smin v6.4s, v6.4s, v18.4s + + st1 {v5.4s, v6.4s}, [x1], #32 + cmp w2, #16 + b.ge 1b + + // here we make last iteration, without updating the registers + ld4 {v0.8b, v1.8b, v2.8b, v3.8b}, [sp] + ld4 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7] + + uxtl v0.8h, v0.8b + uxtl v1.8h, v1.8b + smull v5.4s, v0.4h, v28.4h + smull2 v6.4s, v0.8h, v28.8h + uxtl v2.8h, v2.8b + smlal v5.4s, v1.4h, v29.4H + smlal2 v6.4s, v1.8h, v29.8H + uxtl v3.8h, v3.8b + smlal v5.4s, v2.4h, v30.4H + smlal2 v6.4s, v2.8h, v30.8H + smlal v5.4s, v3.4h, v31.4H + smlal2 v6.4s, v3.8h, v31.8h + + sshr v5.4s, v5.4s, #3 + sshr v6.4s, v6.4s, #3 + + smin v5.4s, v5.4s, v18.4s + smin v6.4s, v6.4s, v18.4s + + sub w2, w2, #8 + st1 {v5.4s, v6.4s}, [x1], #32 + add sp, sp, #32 // restore stack + cbnz w2, 2f + + ret + +2: + ldr w8, [x5], #4 // load filterPos + add x9, x3, w8, UXTW // src + filterPos + ld1 {v0.s}[0], [x9] // load 4 * uint8_t* into one single + ld1 {v31.4h}, [x4], #8 + uxtl v0.8h, v0.8b + smull v5.4s, v0.4h, v31.4H + saddlv d0, v5.4S + sqshrn s0, d0, #3 + smin v0.4s, v0.4s, v18.4s + st1 {v0.s}[0], [x1], #4 + sub w2, w2, #1 + cbnz w2, 2b // if iterations remain jump to beginning + + ret +endfunc + +function ff_hscale8to19_X8_neon, export=1 + movi v20.4s, #1 + movi v17.4s, #1 + shl v20.4s, v20.4s, #19 + sub v20.4s, v20.4s, v17.4s + + sbfiz x7, x6, #1, #32 // filterSize*2 (*2 because int16) +1: + mov x16, x4 // filter0 = filter + ldr w8, [x5], #4 // filterPos[idx] + add x12, x16, x7 // filter1 = filter0 + filterSize*2 + ldr w0, [x5], #4 // filterPos[idx + 1] + add x13, x12, x7 // filter2 = filter1 + filterSize*2 + ldr w11, [x5], #4 // filterPos[idx + 2] + add x4, x13, x7 // filter3 = filter2 + filterSize*2 + ldr w9, [x5], #4 // filterPos[idx + 3] + movi v0.2D, #0 // val sum part 1 (for dst[0]) + movi v1.2D, #0 // val sum part 2 (for dst[1]) + movi v2.2D, #0 // val sum part 3 (for dst[2]) + movi v3.2D, #0 // val sum part 4 (for dst[3]) + add x17, x3, w8, UXTW // srcp + filterPos[0] + add x8, x3, w0, UXTW // srcp + filterPos[1] + add x0, x3, w11, UXTW // srcp + filterPos[2] + add x11, x3, w9, UXTW // srcp + filterPos[3] + mov w15, w6 // filterSize counter +2: ld1 {v4.8B}, [x17], #8 // srcp[filterPos[0] + {0..7}] + ld1 {v5.8H}, [x16], #16 // load 8x16-bit filter values, part 1 + uxtl v4.8H, v4.8B // unpack part 1 to 16-bit + smlal v0.4S, v4.4H, v5.4H // v0 accumulates srcp[filterPos[0] + {0..3}] * filter[{0..3}] + ld1 {v6.8B}, [x8], #8 // srcp[filterPos[1] + {0..7}] + smlal2 v0.4S, v4.8H, v5.8H // v0 accumulates srcp[filterPos[0] + {4..7}] * filter[{4..7}] + ld1 {v7.8H}, [x12], #16 // load 8x16-bit at filter+filterSize + ld1 {v16.8B}, [x0], #8 // srcp[filterPos[2] + {0..7}] + uxtl v6.8H, v6.8B // unpack part 2 to 16-bit + ld1 {v17.8H}, [x13], #16 // load 8x16-bit at filter+2*filterSize + uxtl v16.8H, v16.8B // unpack part 3 to 16-bit + smlal v1.4S, v6.4H, v7.4H // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}] + ld1 {v18.8B}, [x11], #8 // srcp[filterPos[3] + {0..7}] + smlal v2.4S, v16.4H, v17.4H // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}] + ld1 {v19.8H}, [x4], #16 // load 8x16-bit at filter+3*filterSize + smlal2 v2.4S, v16.8H, v17.8H // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}] + uxtl v18.8H, v18.8B // unpack part 4 to 16-bit + smlal2 v1.4S, v6.8H, v7.8H // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}] + smlal v3.4S, v18.4H, v19.4H // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}] + subs w15, w15, #8 // j -= 8: processed 8/filterSize + smlal2 v3.4S, v18.8H, v19.8H // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}] + b.gt 2b // inner loop if filterSize not consumed completely + addp v0.4S, v0.4S, v1.4S // part01 horizontal pair adding + addp v2.4S, v2.4S, v3.4S // part23 horizontal pair adding + addp v0.4S, v0.4S, v2.4S // part0123 horizontal pair adding + subs w2, w2, #4 // dstW -= 4 + sshr v0.4s, v0.4S, #3 // shift and clip the 2x16-bit final values + smin v0.4s, v0.4s, v20.4s + st1 {v0.4s}, [x1], #16 // write to destination part0123 + b.gt 1b // loop until end of line + ret +endfunc + +function ff_hscale8to19_X4_neon, export=1 + // x0 SwsContext *c (not used) + // x1 int16_t *dst + // w2 int dstW + // x3 const uint8_t *src + // x4 const int16_t *filter + // x5 const int32_t *filterPos + // w6 int filterSize + + movi v20.4s, #1 + movi v17.4s, #1 + shl v20.4s, v20.4s, #19 + sub v20.4s, v20.4s, v17.4s + + lsl w7, w6, #1 +1: + ldp w8, w9, [x5] + ldp w10, w11, [x5, #8] + + movi v16.2d, #0 // initialize accumulator for idx + 0 + movi v17.2d, #0 // initialize accumulator for idx + 1 + movi v18.2d, #0 // initialize accumulator for idx + 2 + movi v19.2d, #0 // initialize accumulator for idx + 3 + + mov x12, x4 // filter + 0 + add x13, x4, x7 // filter + 1 + add x8, x3, w8, UXTW // srcp + filterPos 0 + add x14, x13, x7 // filter + 2 + add x9, x3, w9, UXTW // srcp + filterPos 1 + add x15, x14, x7 // filter + 3 + add x10, x3, w10, UXTW // srcp + filterPos 2 + mov w0, w6 // save the filterSize to temporary variable + add x11, x3, w11, UXTW // srcp + filterPos 3 + add x5, x5, #16 // advance filter position + mov x16, xzr // clear the register x16 used for offsetting the filter values + +2: + ldr d4, [x8], #8 // load src values for idx 0 + ldr q31, [x12, x16] // load filter values for idx 0 + uxtl v4.8h, v4.8b // extend type to match the filter' size + ldr d5, [x9], #8 // load src values for idx 1 + smlal v16.4s, v4.4h, v31.4h // multiplication of lower half for idx 0 + uxtl v5.8h, v5.8b // extend type to match the filter' size + ldr q30, [x13, x16] // load filter values for idx 1 + smlal2 v16.4s, v4.8h, v31.8h // multiplication of upper half for idx 0 + ldr d6, [x10], #8 // load src values for idx 2 + ldr q29, [x14, x16] // load filter values for idx 2 + smlal v17.4s, v5.4h, v30.4H // multiplication of lower half for idx 1 + ldr d7, [x11], #8 // load src values for idx 3 + smlal2 v17.4s, v5.8h, v30.8H // multiplication of upper half for idx 1 + uxtl v6.8h, v6.8B // extend tpye to matchi the filter's size + ldr q28, [x15, x16] // load filter values for idx 3 + smlal v18.4s, v6.4h, v29.4h // multiplication of lower half for idx 2 + uxtl v7.8h, v7.8B + smlal2 v18.4s, v6.8h, v29.8H // multiplication of upper half for idx 2 + sub w0, w0, #8 + smlal v19.4s, v7.4h, v28.4H // multiplication of lower half for idx 3 + cmp w0, #8 + smlal2 v19.4s, v7.8h, v28.8h // multiplication of upper half for idx 3 + add x16, x16, #16 // advance filter values indexing + + b.ge 2b + + + // 4 iterations left + + sub x17, x7, #8 // step back to wrap up the filter pos for last 4 elements + + ldr s4, [x8] // load src values for idx 0 + ldr d31, [x12, x17] // load filter values for idx 0 + uxtl v4.8h, v4.8b // extend type to match the filter' size + ldr s5, [x9] // load src values for idx 1 + smlal v16.4s, v4.4h, v31.4h + ldr d30, [x13, x17] // load filter values for idx 1 + uxtl v5.8h, v5.8b // extend type to match the filter' size + ldr s6, [x10] // load src values for idx 2 + smlal v17.4s, v5.4h, v30.4h + uxtl v6.8h, v6.8B // extend type to match the filter's size + ldr d29, [x14, x17] // load filter values for idx 2 + ldr s7, [x11] // load src values for idx 3 + addp v16.4s, v16.4s, v17.4s + uxtl v7.8h, v7.8B + ldr d28, [x15, x17] // load filter values for idx 3 + smlal v18.4s, v6.4h, v29.4h + smlal v19.4s, v7.4h, v28.4h + subs w2, w2, #4 + addp v18.4s, v18.4s, v19.4s + addp v16.4s, v16.4s, v18.4s + sshr v16.4s, v16.4s, #3 + smin v16.4s, v16.4s, v20.4s + + st1 {v16.4s}, [x1], #16 + add x4, x4, x7, lsl #2 + b.gt 1b + ret + +endfunc \ No newline at end of file diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c index d1312c6658..479fe129d0 100644 --- a/libswscale/aarch64/swscale.c +++ b/libswscale/aarch64/swscale.c @@ -29,7 +29,8 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \ const int16_t *filter, \ const int32_t *filterPos, int filterSize) #define SCALE_FUNCS(filter_n, opt) \ - SCALE_FUNC(filter_n, 8, 15, opt); + SCALE_FUNC(filter_n, 8, 15, opt); \ + SCALE_FUNC(filter_n, 8, 19, opt); #define ALL_SCALE_FUNCS(opt) \ SCALE_FUNCS(4, opt); \ SCALE_FUNCS(X8, opt); \ @@ -48,9 +49,13 @@ void ff_yuv2plane1_8_neon( int offset); #define ASSIGN_SCALE_FUNC2(hscalefn, filtersize, opt) do { \ - if (c->srcBpc == 8 && c->dstBpc <= 14) { \ - hscalefn = \ - ff_hscale8to15_ ## filtersize ## _ ## opt; \ + if (c->srcBpc == 8) { \ + if(c->dstBpc <= 14) { \ + hscalefn = \ + ff_hscale8to15_ ## filtersize ## _ ## opt; \ + } else \ + hscalefn = \ + ff_hscale8to19_ ## filtersize ## _ ## opt; \ } \ } while (0) From patchwork Mon Oct 17 13:07:13 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hubert Mazur X-Patchwork-Id: 38763 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:4a86:b0:9d:28a3:170e with SMTP id fn6csp1584540pzb; Mon, 17 Oct 2022 06:08:46 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7zAb1SH5dRxkzuQGCBFzkcrS5irhmFAASLQdCB6MvdTXHXBhyAI7KV2g1zlnf9fudgDhGR X-Received: by 2002:a05:6402:2686:b0:45d:82c0:c2b6 with SMTP id w6-20020a056402268600b0045d82c0c2b6mr5789926edd.390.1666012126275; Mon, 17 Oct 2022 06:08:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666012126; cv=none; d=google.com; s=arc-20160816; b=agdF58kip7MBBpozT6Fk390rq8iBcYVzCH9f5mujSRUtaZNuSO6BEn8wShvjsNfZ6w ig+ird3O87Pelj3JulI3xgQN/fBYkgAjpjvOdHFGS2BrTC5DX2vJ0M3kUe8fjjUYDH8D j5e+Rt0mGGHunWI692xl88Z2BTYf6DJM55NbRscwH80gOKoDSvWc8v9gzU7yrPf4wDsE ewd/uPwFU6SB3n48e96juaz/g5HJsMQR7DKX+DEguY9Xx+de/Rh5vGCL02+v+dcYGi75 gVaOoLfbpZI//jWZll74H39hp08BC/pYENrDjEwGElCLas3SrSj7VddA0vEoKemEdoUM m1kQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:dkim-signature:delivered-to; bh=v+KlX61JTRlrUe5gf7J3nwBnxosrb5Uzc2RQxOaFPfk=; b=gwgH5qyNi4qjwCdrwLWcYf07UPiY9NZfIHzSmCtdMFCvmc2dRtuQ4ezHmntb5mpRds TTaDf6xeoVNY4WMg+ZmRS54tDz0XNimRlJmPYZhz9mvoKA0mYOJ395LFbny6SwknwQyA FW46qnr/7awpqyw3JxNvLUiIqvhvZ7Tj/8o80o5Uw1zt8w6nemM8TmNx1Xru3xGZHczZ udD07qU9RN9mR6ynzJDP8ZnKUxgs9JQ77O7d2GXg3oWvTCBRMgNlDr2VwQmRs4z6j/Of uUD0a7dPvZNN1mfyKd2fF5u7rnEhRPi2axTPIowqmjxeTXkWnWpbBhhCAorCT18YvQvB hGlQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=lsooud3t; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id sc12-20020a1709078a0c00b0078da9130dc8si9923491ejc.164.2022.10.17.06.08.45; Mon, 17 Oct 2022 06:08:46 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=lsooud3t; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 94D9568BD10; Mon, 17 Oct 2022 16:08:28 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-wm1-f47.google.com (mail-wm1-f47.google.com [209.85.128.47]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 2BB3B68BCFE for ; Mon, 17 Oct 2022 16:08:20 +0300 (EEST) Received: by mail-wm1-f47.google.com with SMTP id n9so8604069wms.1 for ; Mon, 17 Oct 2022 06:08:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=semihalf.com; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=HMJ/dB3W+aJT6xEOVd3elsu9wDdmB00a430IhEwETvg=; b=lsooud3tcyunmt+kTgc4JYFoLO1MUe5MHer7jPdw8YtoV9XWfKwKAP3eoLyBS8KkwF JJThpIZNi6r2ejdHnENTKjl6I5bfUvKPsY4msi+qE8sksYKgN2+jbbrXKn4JdXFogHJ1 TZBkRmEzdWxaEmhDnoGgvcT0B0ugCzER/U62Hsjd0n7nvptnGKPB0pwq21fYZZkHngpA DAsAzqurCPW0qYvxXmahPUUOARDxF7LxYwDvlX7qZNSt8aJg2iPYRKKud0ofUgh3jwpl mJmn/jk6oQpWUY2ZrVrz3KbQo7USAP/7nUhzxRPPrET/SxkPOMk0IoRMUwlU3SSDKmtG bjqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HMJ/dB3W+aJT6xEOVd3elsu9wDdmB00a430IhEwETvg=; b=WAKHALoHv3/G+KXMf2fke7LcXOJJzVx4HFNgiKJnyK8L0RUSz8WxLc+7IXbhCIL1rA Au44924emqlgTvR+FJLy3aCLRTWUZWXzUnGdTSxnTTUr3hQrenf6jhtqGavRP4DToanl ZUnYiBnxrjj+7wDrWb1vEEot76xFlZIb/3KXY29pj17pAV5qc7leoA9BJJYHH/8dsXJ6 h26YCudSq/33zRVEVh3+ZBN3SxKuFwZ6BavXy83hMzHuSWADnzAm47cgqrtoHJayFP30 hkkY1yrqdwgRfXFydGYsV62/RKGmVXW3c04r+52oQf5L40Cv1GFh4nBdlTMwdS1Sj9pa QZYg== X-Gm-Message-State: ACrzQf3OB+uN9vuMxGHo1QivHKLHye/+HnzK0kQKbddYD06/LWdS/gVa DWBL+TErVj9GAPnAhghcUGvcDNKCvjUHnoDD X-Received: by 2002:a05:600c:3213:b0:3c6:cab8:dac4 with SMTP id r19-20020a05600c321300b003c6cab8dac4mr19502211wmp.160.1666012099208; Mon, 17 Oct 2022 06:08:19 -0700 (PDT) Received: from ip-172-31-3-164.eu-west-1.compute.internal (ec2-54-154-193-154.eu-west-1.compute.amazonaws.com. [54.154.193.154]) by smtp.gmail.com with ESMTPSA id t18-20020a5d6a52000000b0022af865810esm8297237wrw.75.2022.10.17.06.08.18 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 17 Oct 2022 06:08:18 -0700 (PDT) From: Hubert Mazur To: ffmpeg-devel@ffmpeg.org Date: Mon, 17 Oct 2022 13:07:13 +0000 Message-Id: <20221017130715.30896-3-hum@semihalf.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20221017130715.30896-1-hum@semihalf.com> References: <20221017130715.30896-1-hum@semihalf.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 2/4] tests/sw_scale: Add test cases for input sizes 16 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com, Hubert Mazur , martin@martin.st, mw@semihalf.com, spop@amazon.com Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: ZMDC7YwW9+0f Previously test cases handled only input sizes equal to 8. Add support for input size 16 which is used by scaling routines hscale16To15 and hscale16To19. Pass SwsContext pointer to each function as some of them make use of it. Signed-off-by: Hubert Mazur --- tests/checkasm/sw_scale.c | 35 ++++++++++++++++++++++++++--------- 1 file changed, 26 insertions(+), 9 deletions(-) diff --git a/tests/checkasm/sw_scale.c b/tests/checkasm/sw_scale.c index 3b8dd310ec..2e4b698f88 100644 --- a/tests/checkasm/sw_scale.c +++ b/tests/checkasm/sw_scale.c @@ -262,23 +262,31 @@ static void check_hscale(void) #define FILTER_SIZES 6 static const int filter_sizes[FILTER_SIZES] = { 4, 8, 12, 16, 32, 40 }; -#define HSCALE_PAIRS 2 +#define HSCALE_PAIRS 4 static const int hscale_pairs[HSCALE_PAIRS][2] = { { 8, 14 }, { 8, 18 }, + { 16, 14 }, + { 16, 18 } }; +#define DST_WIDTH(x) ( (x) == (14) ? sizeof(int16_t) : sizeof(int32_t)) #define LARGEST_INPUT_SIZE 512 #define INPUT_SIZES 6 static const int input_sizes[INPUT_SIZES] = {8, 24, 128, 144, 256, 512}; int i, j, fsi, hpi, width, dstWi; struct SwsContext *ctx; + void *(*_dst)[2]; + void *_src; // padded LOCAL_ALIGNED_32(uint8_t, src, [FFALIGN(SRC_PIXELS + MAX_FILTER_WIDTH - 1, 4)]); - LOCAL_ALIGNED_32(uint32_t, dst0, [SRC_PIXELS]); - LOCAL_ALIGNED_32(uint32_t, dst1, [SRC_PIXELS]); + LOCAL_ALIGNED_32(uint16_t, src1, [FFALIGN(SRC_PIXELS + MAX_FILTER_WIDTH - 1, 4)]); + LOCAL_ALIGNED_32(int16_t, dst_ref_16, [SRC_PIXELS]); + LOCAL_ALIGNED_32(int16_t, dst_new_16, [SRC_PIXELS]); + LOCAL_ALIGNED_32(int32_t, dst_ref_32, [SRC_PIXELS]); + LOCAL_ALIGNED_32(int32_t, dst_new_32, [SRC_PIXELS]); // padded LOCAL_ALIGNED_32(int16_t, filter, [SRC_PIXELS * MAX_FILTER_WIDTH + MAX_FILTER_WIDTH]); @@ -286,6 +294,9 @@ static void check_hscale(void) LOCAL_ALIGNED_32(int16_t, filterAvx2, [SRC_PIXELS * MAX_FILTER_WIDTH + MAX_FILTER_WIDTH]); LOCAL_ALIGNED_32(int32_t, filterPosAvx, [SRC_PIXELS]); + void *_dst_16[2] = {dst_ref_16, dst_new_16}; + void *_dst_32[2] = {dst_ref_32, dst_new_32}; + // The dst parameter here is either int16_t or int32_t but we use void* to // just cover both cases. declare_func_emms(AV_CPU_FLAG_MMX, void, void *c, void *dst, int dstW, @@ -297,6 +308,7 @@ static void check_hscale(void) fail(); randomize_buffers(src, SRC_PIXELS + MAX_FILTER_WIDTH - 1); + randomize_buffers(src1, SRC_PIXELS + MAX_FILTER_WIDTH - 1); for (hpi = 0; hpi < HSCALE_PAIRS; hpi++) { for (fsi = 0; fsi < FILTER_SIZES; fsi++) { @@ -306,6 +318,8 @@ static void check_hscale(void) ctx->srcBpc = hscale_pairs[hpi][0]; ctx->dstBpc = hscale_pairs[hpi][1]; ctx->hLumFilterSize = ctx->hChrFilterSize = width; + _src = ctx->srcBpc == 8 ? (void *)src : (void *)src1; + _dst = ctx->dstBpc == 14 ? (void*)_dst_16 : (void*)_dst_32; for (i = 0; i < SRC_PIXELS; i++) { filterPos[i] = i; @@ -343,14 +357,15 @@ static void check_hscale(void) ff_shuffle_filter_coefficients(ctx, filterPosAvx, width, filterAvx2, ctx->dstW); if (check_func(ctx->hcScale, "hscale_%d_to_%d__fs_%d_dstW_%d", ctx->srcBpc, ctx->dstBpc + 1, width, ctx->dstW)) { - memset(dst0, 0, SRC_PIXELS * sizeof(dst0[0])); - memset(dst1, 0, SRC_PIXELS * sizeof(dst1[0])); + memset((*_dst)[0], 0, SRC_PIXELS * DST_WIDTH(ctx->dstBpc)); + memset((*_dst)[1], 0, SRC_PIXELS * DST_WIDTH(ctx->dstBpc)); + + call_ref(ctx, (*_dst)[0], ctx->dstW, src, filter, filterPos, width); + call_new(ctx, (*_dst)[1], ctx->dstW, src, filterAvx2, filterPosAvx, width); - call_ref(NULL, dst0, ctx->dstW, src, filter, filterPos, width); - call_new(NULL, dst1, ctx->dstW, src, filterAvx2, filterPosAvx, width); - if (memcmp(dst0, dst1, ctx->dstW * sizeof(dst0[0]))) + if (memcmp((*_dst)[0], (*_dst)[1], ctx->dstW * DST_WIDTH(ctx->dstBpc))) fail(); - bench_new(NULL, dst0, ctx->dstW, src, filter, filterPosAvx, width); + bench_new(ctx, (*_dst)[1], ctx->dstW, _src, filter, filterPosAvx, width); } } } @@ -358,6 +373,8 @@ static void check_hscale(void) sws_freeContext(ctx); } +#undef DST_WIDTH + void checkasm_check_sw_scale(void) { check_hscale(); From patchwork Mon Oct 17 13:07:14 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hubert Mazur X-Patchwork-Id: 38764 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:4a86:b0:9d:28a3:170e with SMTP id fn6csp1584647pzb; Mon, 17 Oct 2022 06:08:56 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4BqJMDhUyHli9bqoYXN3hWBxLnjfaaHG3I53kacETMb0Yy27oJvIKUGaCj2aARsziatwLy X-Received: by 2002:a05:6402:51d1:b0:45d:b498:169 with SMTP id r17-20020a05640251d100b0045db4980169mr1824091edd.119.1666012136177; Mon, 17 Oct 2022 06:08:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666012136; cv=none; d=google.com; s=arc-20160816; b=rLvZCtGvaKoljshxyo6C3ifn/VxOtS+ZbwgQcg81UmqOf0wxemsOzQsDUJbYL+lpun smTUcmwlIJHYpNuuoWumWFC1DcYUFSRCA7V6HjLME9OktZamEzMS+0h+OTXUpQwQzTT3 hWX2meGGoLB7YHxBW6bVYEykGkyTeiGxPN22BjCIcEMkUzTwcO3hqp8079Mi8MOQhK0O PlaWiGqG6LGuKMhQpnGnK1X73Fj0E26pkkDwlSMx1otl3h6ColCzwelVfsBKjtVULZt3 EvCN6+gBizt2N5zhKhw/Qq7dBgnBG2NTLe3AVvInHhb5TY0Vsnp5M9WyYPyIDc5BJfNM 441A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:dkim-signature:delivered-to; bh=dXXBinwJKAbjgEs/AOioVPS5ixn76dPODYQ4BkyLUQI=; b=Puw/5bV9Agp6eVIxM2V27UHrYsCSFd3nXdT5CLFMt14f03A3f4AR6k0RG+rGHbrJFO jqBlbRrYkuVo/RJZgoPqiBbW2Ql6g0N86/+ok9ArXvg80p+OT2CDmepz7bAOnN72vOif XKsS0/5iE16P/zMiMfI2J9+JSmcdI9ThlJAB3tMqlpR23dRIlMChLdODoSWKZVP6ho7b W+EtvbEPG65Upu+jDglxTC6AikgEAAOOSsAysRANQEnfVAmHfCs5xkSatujgV0mQu00r 9gDsJG55S2DMMepm/+XUFFNfbCfgBF4dUOSQ5i9fKWSydkAhAsrwaX3RTisEckvIVUkb vJSA== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b="O4BVXvn/"; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id b11-20020aa7cd0b000000b0045bd55b1240si7619071edw.313.2022.10.17.06.08.55; Mon, 17 Oct 2022 06:08:56 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b="O4BVXvn/"; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 9953A68BD15; Mon, 17 Oct 2022 16:08:29 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-wr1-f53.google.com (mail-wr1-f53.google.com [209.85.221.53]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E7E8368BD05 for ; Mon, 17 Oct 2022 16:08:21 +0300 (EEST) Received: by mail-wr1-f53.google.com with SMTP id n12so18342377wrp.10 for ; Mon, 17 Oct 2022 06:08:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=semihalf.com; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=OIEfF3n82rk2G9hF2G+IaPWE/hUTe2lz6mjrXSgiLcg=; b=O4BVXvn/b05Rbs/UCG0DNgQlpTVk1Oc5m27EtOfhgMLw4U7S4O10Xv+zZ4pK2JuFxh bLia9r/mXZfeMtu5zOM1RkovOs0hjhNpTBadA269vUanfuVdI7HgfJpu6yc/uvDH4Eyo 3ajNdqgq4OfacDyKpDtkwZIdB+12sUGqx7td9KZ+7StrIRafcwrD3nsgFxtOd33Pw6iO Nzpq5Yb9uQ+3OQL58qSfr0XendYO4HfIq+N66VE1VBTOk7aRmHtMww8EMPGRVrIqQdeV BmZFloQ+bs1U6dt+PfBVy+Xvv2w0+nueET6QsHTybJc81IVXin1l6plQnGfBt3k7Okod 1+iw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=OIEfF3n82rk2G9hF2G+IaPWE/hUTe2lz6mjrXSgiLcg=; b=Nt2Ip5x2QVK47sczEzHucre63JUxagADzWdrZUGMyjMzS83fZZxXscXPKiaJDH3pL2 ShlcTsa/fFb1XFLdkNGjouUEPlgdMf/ZUCDqMmzk0+3cJtEYqXjeQh3qG6BkMCVuey2+ K5O6V4kEtEtHHm0gb2hiWz+f47IXlnRzD7na7nVlf8jn4aYjMbJGrDrHP1su6P3u4ISL aBI3wODuc/HUduB03RTXfHNPl/IgP78LrIFzQJhfsEowIZJbGXCLEHAdZzBpHxYywVYM 3QH151UInrCt5lQAuoB8bzkamNW093ptuL2pQbPiMkQ+UbXNQfyNRVv6C+6VWaJTN1AQ JSew== X-Gm-Message-State: ACrzQf2J9TPBJ3G2nkevzuhY9J2EGbJMISuiJDRZ/3YJBlflRrSdNgUI xCHAL35c1F2QIgVngCKfQ9FmDhbOJzpO91sr X-Received: by 2002:adf:fd04:0:b0:22e:4bf6:4a08 with SMTP id e4-20020adffd04000000b0022e4bf64a08mr6618042wrr.619.1666012100551; Mon, 17 Oct 2022 06:08:20 -0700 (PDT) Received: from ip-172-31-3-164.eu-west-1.compute.internal (ec2-54-154-193-154.eu-west-1.compute.amazonaws.com. [54.154.193.154]) by smtp.gmail.com with ESMTPSA id t18-20020a5d6a52000000b0022af865810esm8297237wrw.75.2022.10.17.06.08.19 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 17 Oct 2022 06:08:20 -0700 (PDT) From: Hubert Mazur To: ffmpeg-devel@ffmpeg.org Date: Mon, 17 Oct 2022 13:07:14 +0000 Message-Id: <20221017130715.30896-4-hum@semihalf.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20221017130715.30896-1-hum@semihalf.com> References: <20221017130715.30896-1-hum@semihalf.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 3/4] sw_scale: Add specializations for hscale 16 to 15 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com, Hubert Mazur , martin@martin.st, mw@semihalf.com, spop@amazon.com Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: G7MFcgvSYmo7 Add arm64 neon implementations for hscale 16 to 15 with filter sizes 4, 8 and X4. The tests and benchmarks run on AWS Graviton 2 instances. The results from a checkasm tool are shown below. hscale_16_to_15__fs_4_dstW_512_c: 6703.5 hscale_16_to_15__fs_4_dstW_512_neon: 2298.0 hscale_16_to_15__fs_8_dstW_512_c: 10983.0 hscale_16_to_15__fs_8_dstW_512_neon: 3216.5 hscale_16_to_15__fs_12_dstW_512_c: 15526.0 hscale_16_to_15__fs_12_dstW_512_neon: 3993.0 hscale_16_to_15__fs_16_dstW_512_c: 20183.5 hscale_16_to_15__fs_16_dstW_512_neon: 5369.7 hscale_16_to_15__fs_32_dstW_512_c: 39315.2 hscale_16_to_15__fs_32_dstW_512_neon: 9511.2 hscale_16_to_15__fs_40_dstW_512_c: 48995.7 hscale_16_to_15__fs_40_dstW_512_neon: 11570.0 Signed-off-by: Hubert Mazur --- libswscale/aarch64/hscale.S | 409 ++++++++++++++++++++++++++++++++++- libswscale/aarch64/swscale.c | 66 +++++- libswscale/swscale.c | 3 +- 3 files changed, 474 insertions(+), 4 deletions(-) diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S index 5e8cad9825..7d7e1c1f2e 100644 --- a/libswscale/aarch64/hscale.S +++ b/libswscale/aarch64/hscale.S @@ -635,5 +635,412 @@ function ff_hscale8to19_X4_neon, export=1 add x4, x4, x7, lsl #2 b.gt 1b ret +endfunc + +function ff_hscale16to15_4_neon_asm, export=1 + // w0 int shift + // x1 int32_t *dst + // w2 int dstW + // x3 const uint8_t *src // treat it as uint16_t *src + // x4 const uint16_t *filter + // x5 const int32_t *filterPos + // w6 int filterSize + + movi v18.4s, #1 + movi v17.4s, #1 + shl v18.4s, v18.4s, #15 + sub v18.4s, v18.4s, v17.4s // max allowed value + dup v17.4s, w0 // read shift + neg v17.4s, v17.4s // negate it, so it can be used in sshl (effectively shift right) + + cmp w2, #16 + b.lt 2f // move to last block + + ldp w8, w9, [x5] // filterPos[0], filterPos[1] + ldp w10, w11, [x5, #8] // filterPos[2], filterPos[3] + ldp w12, w13, [x5, #16] // filterPos[4], filterPos[5] + ldp w14, w15, [x5, #24] // filterPos[6], filterPos[7] + add x5, x5, #32 + + // shift all filterPos left by one, as uint16_t will be read + lsl x8, x8, #1 + lsl x9, x9, #1 + lsl x10, x10, #1 + lsl x11, x11, #1 + lsl x12, x12, #1 + lsl x13, x13, #1 + lsl x14, x14, #1 + lsl x15, x15, #1 + + // load src with given offset + ldr x8, [x3, w8, UXTW] + ldr x9, [x3, w9, UXTW] + ldr x10, [x3, w10, UXTW] + ldr x11, [x3, w11, UXTW] + ldr x12, [x3, w12, UXTW] + ldr x13, [x3, w13, UXTW] + ldr x14, [x3, w14, UXTW] + ldr x15, [x3, w15, UXTW] + + sub sp, sp, #64 + // push src on stack so it can be loaded into vectors later + stp x8, x9, [sp] + stp x10, x11, [sp, #16] + stp x12, x13, [sp, #32] + stp x14, x15, [sp, #48] + +1: + ld4 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp] + ld4 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7] + + // Each of blocks does the following: + // Extend src and filter to 32 bits with uxtl and sxtl + // multiply or multiply and accumulate results + // Extending to 32 bits is necessary, as unit16_t values can't + // be represented as int16_t without type promotion. + uxtl v26.4s, v0.4h + sxtl v27.4s, v28.4H + uxtl2 v0.4s, v0.8h + mul v5.4s, v26.4s, v27.4s + sxtl2 v28.4s, v28.8H + uxtl v26.4s, v1.4h + mul v6.4s, v0.4s, v28.4s + + sxtl v27.4s, v29.4H + uxtl2 v0.4s, v1.8h + mla v5.4s, v27.4s, v26.4s + sxtl2 v28.4s, v29.8H + uxtl v26.4s, v2.4h + mla v6.4s, v28.4s, v0.4s + + sxtl v27.4s, v30.4H + uxtl2 v0.4s, v2.8h + mla v5.4s, v27.4s, v26.4s + sxtl2 v28.4s, v30.8H + uxtl v26.4s, v3.4h + mla v6.4s, v28.4s, v0.4s + + sxtl v27.4s, v31.4H + uxtl2 v0.4s, v3.8h + mla v5.4s, v27.4s, v26.4s + sxtl2 v28.4s, v31.8H + sub w2, w2, #8 + mla v6.4s, v28.4s, v0.4s + + sshl v5.4s, v5.4s, v17.4s + sshl v6.4s, v6.4s, v17.4s + smin v5.4s, v5.4s, v18.4s + smin v6.4s, v6.4s, v18.4s + xtn v5.4h, v5.4s + xtn2 v5.8h, v6.4s + + st1 {v5.8h}, [x1], #16 + cmp w2, #16 + + // load filterPositions into registers for next iteration + ldp w8, w9, [x5] // filterPos[0], filterPos[1] + ldp w10, w11, [x5, #8] // filterPos[2], filterPos[3] + ldp w12, w13, [x5, #16] // filterPos[4], filterPos[5] + ldp w14, w15, [x5, #24] // filterPos[6], filterPos[7] + add x5, x5, #32 + + lsl x8, x8, #1 + lsl x9, x9, #1 + lsl x10, x10, #1 + lsl x11, x11, #1 + lsl x12, x12, #1 + lsl x13, x13, #1 + lsl x14, x14, #1 + lsl x15, x15, #1 + + ldr x8, [x3, w8, UXTW] + ldr x9, [x3, w9, UXTW] + ldr x10, [x3, w10, UXTW] + ldr x11, [x3, w11, UXTW] + ldr x12, [x3, w12, UXTW] + ldr x13, [x3, w13, UXTW] + ldr x14, [x3, w14, UXTW] + ldr x15, [x3, w15, UXTW] + + stp x8, x9, [sp] + stp x10, x11, [sp, #16] + stp x12, x13, [sp, #32] + stp x14, x15, [sp, #48] -endfunc \ No newline at end of file + b.ge 1b + + // here we make last iteration, without updating the registers + ld4 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp] + ld4 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 + + uxtl v26.4s, v0.4h + sxtl v27.4s, v28.4H + uxtl2 v0.4s, v0.8h + mul v5.4s, v26.4s, v27.4s + sxtl2 v28.4s, v28.8H + uxtl v26.4s, v1.4h + mul v6.4s, v0.4s, v28.4s + + sxtl v27.4s, v29.4H + uxtl2 v0.4s, v1.8h + mla v5.4s, v26.4s, v27.4s + sxtl2 v28.4s, v29.8H + uxtl v26.4s, v2.4h + mla v6.4s, v0.4s, v28.4s + + sxtl v27.4s, v30.4H + uxtl2 v0.4s, v2.8h + mla v5.4s, v26.4s, v27.4s + sxtl2 v28.4s, v30.8H + uxtl v26.4s, v3.4h + mla v6.4s, v0.4s, v28.4s + + sxtl v27.4s, v31.4H + uxtl2 v0.4s, v3.8h + mla v5.4s, v26.4s, v27.4s + sxtl2 v28.4s, v31.8H + subs w2, w2, #8 + mla v6.4s, v0.4s, v28.4s + + sshl v5.4s, v5.4s, v17.4s + sshl v6.4s, v6.4s, v17.4s + smin v5.4s, v5.4s, v18.4s + smin v6.4s, v6.4s, v18.4s + xtn v5.4h, v5.4S + xtn2 v5.8h, v6.4s + + st1 {v5.8h}, [x1], #16 + add sp, sp, #64 // restore stack + cbnz w2, 2f + + ret + +2: + ldr w8, [x5], #4 // load filterPos + lsl w8, w8, #1 + add x9, x3, w8, UXTW // src + filterPos + ld1 {v0.4h}, [x9] // load 4 * uint16_t + ld1 {v31.4h}, [x4], #8 + + uxtl v0.4s, v0.4h + sxtl v31.4s, v31.4h + mul v5.4s, v0.4s, v31.4s + addv s0, v5.4S + sshl v0.4s, v0.4s, v17.4s + smin v0.4s, v0.4s, v18.4s + st1 {v0.h}[0], [x1], #2 + sub w2, w2, #1 + cbnz w2, 2b // if iterations remain jump to beginning + + ret +endfunc + +function ff_hscale16to15_X8_neon_asm, export=1 + // w0 int shift + // x1 int32_t *dst + // w2 int dstW + // x3 const uint8_t *src // treat it as uint16_t *src + // x4 const uint16_t *filter + // x5 const int32_t *filterPos + // w6 int filterSize + + movi v20.4s, #1 + movi v21.4s, #1 + shl v20.4s, v20.4s, #15 + sub v20.4s, v20.4s, v21.4s + dup v21.4s, w0 + neg v21.4s, v21.4s + + sbfiz x7, x6, #1, #32 // filterSize*2 (*2 because int16) +1: ldr w8, [x5], #4 // filterPos[idx] + lsl w8, w8, #1 + ldr w10, [x5], #4 // filterPos[idx + 1] + lsl w10, w10, #1 + ldr w11, [x5], #4 // filterPos[idx + 2] + lsl w11, w11, #1 + ldr w9, [x5], #4 // filterPos[idx + 3] + lsl w9, w9, #1 + mov x16, x4 // filter0 = filter + add x12, x16, x7 // filter1 = filter0 + filterSize*2 + add x13, x12, x7 // filter2 = filter1 + filterSize*2 + add x4, x13, x7 // filter3 = filter2 + filterSize*2 + movi v0.2D, #0 // val sum part 1 (for dst[0]) + movi v1.2D, #0 // val sum part 2 (for dst[1]) + movi v2.2D, #0 // val sum part 3 (for dst[2]) + movi v3.2D, #0 // val sum part 4 (for dst[3]) + add x17, x3, w8, UXTW // srcp + filterPos[0] + add x8, x3, w10, UXTW // srcp + filterPos[1] + add x10, x3, w11, UXTW // srcp + filterPos[2] + add x11, x3, w9, UXTW // srcp + filterPos[3] + mov w15, w6 // filterSize counter +2: ld1 {v4.8H}, [x17], #16 // srcp[filterPos[0] + {0..7}] + ld1 {v5.8H}, [x16], #16 // load 8x16-bit filter values, part 1 + ld1 {v6.8H}, [x8], #16 // srcp[filterPos[1] + {0..7}] + ld1 {v7.8H}, [x12], #16 // load 8x16-bit at filter+filterSize + uxtl v24.4s, v4.4H // extend srcp lower half to 32 bits to preserve sign + sxtl v25.4s, v5.4H // extend filter lower half to 32 bits to match srcp size + uxtl2 v4.4s, v4.8h // extend srcp upper half to 32 bits + mla v0.4s, v24.4s, v25.4s // multiply accumulate lower half of v4 * v5 + sxtl2 v5.4s, v5.8h // extend filter upper half to 32 bits + uxtl v26.4s, v6.4h // extend srcp lower half to 32 bits + mla v0.4S, v4.4s, v5.4s // multiply accumulate upper half of v4 * v5 + sxtl v27.4s, v7.4H // exted filter lower half + uxtl2 v6.4s, v6.8H // extend srcp upper half + sxtl2 v7.4s, v7.8h // extend filter upper half + ld1 {v16.8H}, [x10], #16 // srcp[filterPos[2] + {0..7}] + mla v1.4S, v26.4s, v27.4s // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}] + ld1 {v17.8H}, [x13], #16 // load 8x16-bit at filter+2*filterSize + uxtl v22.4s, v16.4H // extend srcp lower half + sxtl v23.4s, v17.4H // extend filter lower half + uxtl2 v16.4s, v16.8H // extend srcp upper half + sxtl2 v17.4s, v17.8h // extend filter upper half + mla v2.4S, v22.4s, v23.4s // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}] + mla v2.4S, v16.4s, v17.4s // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}] + ld1 {v18.8H}, [x11], #16 // srcp[filterPos[3] + {0..7}] + mla v1.4S, v6.4s, v7.4s // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}] + ld1 {v19.8H}, [x4], #16 // load 8x16-bit at filter+3*filterSize + subs w15, w15, #8 // j -= 8: processed 8/filterSize + uxtl v28.4s, v18.4H // extend srcp lower half + sxtl v29.4s, v19.4H // extend filter lower half + uxtl2 v18.4s, v18.8H // extend srcp upper half + sxtl2 v19.4s, v19.8h // extend filter upper half + mla v3.4S, v28.4s, v29.4s // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}] + mla v3.4S, v18.4s, v19.4s // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}] + b.gt 2b // inner loop if filterSize not consumed completely + addp v0.4S, v0.4S, v1.4S // part01 horizontal pair adding + addp v2.4S, v2.4S, v3.4S // part23 horizontal pair adding + addp v0.4S, v0.4S, v2.4S // part0123 horizontal pair adding + subs w2, w2, #4 // dstW -= 4 + sshl v0.4s, v0.4s, v21.4s // shift right (effectively rigth, as shift is negative); overflow expected + smin v0.4s, v0.4s, v20.4s // apply min (do not use sqshl) + xtn v0.4h, v0.4s // narrow down to 16 bits + + st1 {v0.4H}, [x1], #8 // write to destination part0123 + b.gt 1b // loop until end of line + ret +endfunc + +function ff_hscale16to15_X4_neon_asm, export=1 + // w0 int shift + // x1 int16_t *dst + // w2 int dstW + // x3 const uint8_t *src + // x4 const int16_t *filter + // x5 const int32_t *filterPos + // w6 int filterSize + + stp d8, d9, [sp, #-0x20]! + stp d10, d11, [sp, #0x10] + + movi v18.4s, #1 + movi v17.4s, #1 + shl v18.4s, v18.4s, #15 + sub v21.4s, v18.4s, v17.4s // max allowed value + dup v17.4s, w0 // read shift + neg v20.4s, v17.4s // negate it, so it can be used in sshl (effectively shift right) + + lsl w7, w6, #1 +1: + ldp w8, w9, [x5] + ldp w10, w11, [x5, #8] + + movi v16.2d, #0 // initialize accumulator for idx + 0 + movi v17.2d, #0 // initialize accumulator for idx + 1 + movi v18.2d, #0 // initialize accumulator for idx + 2 + movi v19.2d, #0 // initialize accumulator for idx + 3 + + mov x12, x4 // filter + 0 + add x13, x4, x7 // filter + 1 + add x8, x3, x8, lsl #1 // srcp + filterPos 0 + add x14, x13, x7 // filter + 2 + add x9, x3, x9, lsl #1 // srcp + filterPos 1 + add x15, x14, x7 // filter + 3 + add x10, x3, x10, lsl #1 // srcp + filterPos 2 + mov w0, w6 // save the filterSize to temporary variable + add x11, x3, x11, lsl #1 // srcp + filterPos 3 + add x5, x5, #16 // advance filter position + mov x16, xzr // clear the register x16 used for offsetting the filter values + +2: + ldr q4, [x8], #16 // load src values for idx 0 + ldr q5, [x9], #16 // load src values for idx 1 + uxtl v26.4s, v4.4h + uxtl2 v4.4s, v4.8h + ldr q31, [x12, x16] // load filter values for idx 0 + ldr q6, [x10], #16 // load src values for idx 2 + sxtl v22.4s, v31.4h + sxtl2 v31.4s, v31.8h + mla v16.4s, v26.4s, v22.4s // multiplication of lower half for idx 0 + uxtl v25.4s, v5.4h + uxtl2 v5.4s, v5.8h + ldr q30, [x13, x16] // load filter values for idx 1 + ldr q7, [x11], #16 // load src values for idx 3 + mla v16.4s, v4.4s, v31.4s // multiplication of upper half for idx 0 + uxtl v24.4s, v6.4h + sxtl v8.4s, v30.4h + sxtl2 v30.4s, v30.8h + mla v17.4s, v25.4s, v8.4s // multiplication of lower half for idx 1 + ldr q29, [x14, x16] // load filter values for idx 2 + uxtl2 v6.4s, v6.8h + sxtl v9.4s, v29.4h + sxtl2 v29.4s, v29.8h + mla v17.4s, v5.4s, v30.4s // multiplication of upper half for idx 1 + mla v18.4s, v24.4s, v9.4s // multiplication of lower half for idx 2 + ldr q28, [x15, x16] // load filter values for idx 3 + uxtl v23.4s, v7.4h + sxtl v10.4s, v28.4h + mla v18.4s, v6.4s, v29.4s // multiplication of upper half for idx 2 + uxtl2 v7.4s, v7.8h + sxtl2 v28.4s, v28.8h + mla v19.4s, v23.4s, v10.4s // multiplication of lower half for idx 3 + sub w0, w0, #8 + cmp w0, #8 + mla v19.4s, v7.4s, v28.4s // multiplication of upper half for idx 3 + + add x16, x16, #16 // advance filter values indexing + + b.ge 2b + + // 4 iterations left + + sub x17, x7, #8 // step back to wrap up the filter pos for last 4 elements + + ldr d4, [x8] // load src values for idx 0 + ldr d31, [x12, x17] // load filter values for idx 0 + uxtl v4.4s, v4.4h + sxtl v31.4s, v31.4h + ldr d5, [x9] // load src values for idx 1 + mla v16.4s, v4.4s, v31.4s // multiplication of upper half for idx 0 + ldr d30, [x13, x17] // load filter values for idx 1 + uxtl v5.4s, v5.4h + sxtl v30.4s, v30.4h + ldr d6, [x10] // load src values for idx 2 + mla v17.4s, v5.4s, v30.4s // multiplication of upper half for idx 1 + ldr d29, [x14, x17] // load filter values for idx 2 + uxtl v6.4s, v6.4h + sxtl v29.4s, v29.4h + ldr d7, [x11] // load src values for idx 3 + ldr d28, [x15, x17] // load filter values for idx 3 + mla v18.4s, v6.4s, v29.4s // multiplication of upper half for idx 2 + uxtl v7.4s, v7.4h + sxtl v28.4s, v28.4h + addp v16.4s, v16.4s, v17.4s + mla v19.4s, v7.4s, v28.4s // multiplication of upper half for idx 3 + subs w2, w2, #4 + addp v18.4s, v18.4s, v19.4s + addp v16.4s, v16.4s, v18.4s + sshl v16.4s, v16.4s, v20.4s + smin v16.4s, v16.4s, v21.4s + xtn v16.4h, v16.4s + + st1 {v16.4h}, [x1], #8 + add x4, x4, x7, lsl #2 + b.gt 1b + + ldp d8, d9, [sp] + ldp d10, d11, [sp, #0x10] + + add sp, sp, #0x20 + + ret +endfunc diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c index 479fe129d0..993cdd67dd 100644 --- a/libswscale/aarch64/swscale.c +++ b/libswscale/aarch64/swscale.c @@ -22,6 +22,18 @@ #include "libswscale/swscale_internal.h" #include "libavutil/aarch64/cpu.h" +void ff_hscale16to15_4_neon_asm(int shift, int16_t *_dst, int dstW, + const uint8_t *_src, const int16_t *filter, + const int32_t *filterPos, int filterSize); + +void ff_hscale16to15_X8_neon_asm(int shift, int16_t *_dst, int dstW, + const uint8_t *_src, const int16_t *filter, + const int32_t *filterPos, int filterSize); + +void ff_hscale16to15_X4_neon_asm(int shift, int16_t *_dst, int dstW, + const uint8_t *_src, const int16_t *filter, + const int32_t *filterPos, int filterSize); + #define SCALE_FUNC(filter_n, from_bpc, to_bpc, opt) \ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \ SwsContext *c, int16_t *data, \ @@ -30,7 +42,8 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \ const int32_t *filterPos, int filterSize) #define SCALE_FUNCS(filter_n, opt) \ SCALE_FUNC(filter_n, 8, 15, opt); \ - SCALE_FUNC(filter_n, 8, 19, opt); + SCALE_FUNC(filter_n, 8, 19, opt); \ + SCALE_FUNC(filter_n, 16, 15, opt); #define ALL_SCALE_FUNCS(opt) \ SCALE_FUNCS(4, opt); \ SCALE_FUNCS(X8, opt); \ @@ -56,6 +69,10 @@ void ff_yuv2plane1_8_neon( } else \ hscalefn = \ ff_hscale8to19_ ## filtersize ## _ ## opt; \ + } else { \ + if (c->dstBpc <= 14) \ + hscalefn = \ + ff_hscale16to15_ ## filtersize ## _ ## opt; \ } \ } while (0) @@ -87,3 +104,50 @@ av_cold void ff_sws_init_swscale_aarch64(SwsContext *c) } } } + +void ff_hscale16to15_4_neon(SwsContext *c, int16_t *_dst, int dstW, + const uint8_t *_src, const int16_t *filter, + const int32_t *filterPos, int filterSize) +{ + const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat); + int sh = desc->comp[0].depth - 1; + + if (sh<15) { + sh = isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8 ? 13 : (desc->comp[0].depth - 1); + } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */ + sh = 16 - 1; + } + ff_hscale16to15_4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize); + +} + +void ff_hscale16to15_X8_neon(SwsContext *c, int16_t *_dst, int dstW, + const uint8_t *_src, const int16_t *filter, + const int32_t *filterPos, int filterSize) +{ + const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat); + int sh = desc->comp[0].depth - 1; + + if (sh<15) { + sh = isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8 ? 13 : (desc->comp[0].depth - 1); + } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */ + sh = 16 - 1; + } + ff_hscale16to15_X8_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize); + +} + +void ff_hscale16to15_X4_neon(SwsContext *c, int16_t *_dst, int dstW, + const uint8_t *_src, const int16_t *filter, + const int32_t *filterPos, int filterSize) +{ + const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat); + int sh = desc->comp[0].depth - 1; + + if (sh<15) { + sh = isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8 ? 13 : (desc->comp[0].depth - 1); + } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */ + sh = 16 - 1; + } + ff_hscale16to15_X4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize); +} \ No newline at end of file diff --git a/libswscale/swscale.c b/libswscale/swscale.c index 367d045a02..5afd5eba83 100644 --- a/libswscale/swscale.c +++ b/libswscale/swscale.c @@ -109,11 +109,10 @@ static void hScale16To15_c(SwsContext *c, int16_t *dst, int dstW, int j; int srcPos = filterPos[i]; int val = 0; - for (j = 0; j < filterSize; j++) { val += src[srcPos + j] * filter[filterSize * i + j]; } - // filter=14 bit, input=16 bit, output=30 bit, >> 15 makes 15 bit + //filter=14 bit, input=16 bit, output=30 bit, >> 15 makes 15 bit dst[i] = FFMIN(val >> sh, (1 << 15) - 1); } } From patchwork Mon Oct 17 13:07:15 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hubert Mazur X-Patchwork-Id: 38765 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:4a86:b0:9d:28a3:170e with SMTP id fn6csp1584746pzb; Mon, 17 Oct 2022 06:09:06 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7yG4lWCfJ3Nu8fL76qPcz2x6VqNRHTlPGO78uJ7ZDrQrK7ry+3gIA+w37D5kypEEzE0Mf2 X-Received: by 2002:a05:6402:1d86:b0:457:e84:f0e with SMTP id dk6-20020a0564021d8600b004570e840f0emr10028249edb.241.1666012146446; Mon, 17 Oct 2022 06:09:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666012146; cv=none; d=google.com; s=arc-20160816; b=eesBZocH3HHHXMqel5b3NqBn1Q88J5HSNYUXebtwKKx8zU3vgvp67P7vzZIwlH39eB hvg8JtUuQyzcczGjvLCP6fK3JoYozGXeG7QmgUtkRpJ7o54FaHwijKkkBf3iTKYhqTou aPp2XUbnR7BvSqCrMlzVcR+PxeHBWMWqPbv3Et9Jw0HmNfElrkvrv1uNFp7NRi+/ELYD Hf3b/zcWdHQoSzp1eSfAu1Xd/gBOVQtTzRMEQXthSufR3BbWoydui0eVvb5VRzvOa4x8 wwaXnSy77F0JkhfA5skEVCAQpL+bj74Q8WKCtYWEpmwTOBG7E1OyrjyL48bApHnIJgqS SLRA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:dkim-signature:delivered-to; bh=U6QLTknRHgxqwDJkQvwHYqfQbHFIcw94qVRslbZHOh4=; b=Vb484p1Sz7sWpu6Nk8+b9UTgxFxFF2zWH6+b7Za36oa+Ug16uMciJ/wy8t0ixh23NC /v7+des9eD0IOYvFhQKK/gnKZvKmc2/pu9b0V0+e2Gn0RGxpByBuQKuUlpM8zkO1TbiG Lp0oJEomzgIwi3NK6jJgFMsH9eXpxz1ZYH4eaVt/0gtdgA/JUC0YCLbFote+Z+XPdJd9 8/664UXiIL8qLDVmFyDuRbmlAy19Y5t6tCWakDFhM6sZ2b+gg4kKURfsBTlGebGS5BVM xBduvZkvBJegXoT7YXW4YS6C+klLfgv6/tR8ZH/2q2hJf2TNXMdvd+IEdBU4vfBScZ/R VB4Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=O6YF29s8; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id g5-20020a056402090500b00457e6752422si8160465edz.189.2022.10.17.06.09.06; Mon, 17 Oct 2022 06:09:06 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=O6YF29s8; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 8F7FE68BD06; Mon, 17 Oct 2022 16:08:30 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-wm1-f46.google.com (mail-wm1-f46.google.com [209.85.128.46]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 7EDFB68BD05 for ; Mon, 17 Oct 2022 16:08:23 +0300 (EEST) Received: by mail-wm1-f46.google.com with SMTP id iv17so8602217wmb.4 for ; Mon, 17 Oct 2022 06:08:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=semihalf.com; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=rrd0IALhut9+kdhwBUnDoxxEX3cFpbw38Kyn0o3lmsg=; b=O6YF29s8nLwqoN3w5TWtL9s0Wqc2lHrF95wQUKlDXYAx5pJaFxZHpQb08duKsPliOz kK0UvYoN7NOdyndO3lumnPk0u1KHjEmdI0WRxkkGDVfFoGNn/f8sTUR0F1lODgh8JYdD 8rId/2U7EKStZLDMmMEXcVMVDj5or8TOq+HggLWef3zHELIE86ON6jma7O2AyHhgfjNA wLIoZgMqCKfeEm5Y0H+J+eqngaFr+ZEVfFlnEPlvDqG1dvp3rqWhMK5+YHH2qeNOPPwK 1Vnk9KYbdVuDX10nlA+qJ/WZutx+Psd8Wqnk+Pp97x54SxEokgohpwmDC5CCqZBfYqec 4sWw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=rrd0IALhut9+kdhwBUnDoxxEX3cFpbw38Kyn0o3lmsg=; b=kyA3a6FobzRYnLTODXRaCJ9flG0GsHIAQGVHwPjoOGMuTjhaOJKzGCVgaE2CRHi2D3 24B+B5bphcPChQXKNJZqDuLF7jw3sv4L3QeJr28L4zKGp5F/dKDvouby7MDY44e/E1TA DpGzCEGq5jAIbePlbST1mqOJD/bn9nUS6t6mqEOyuje5ovsiXlr5KkuCogFd5It/b0nI 1CMgo2mO+R9iZqyFqoN4j3Q/MeiETIG2oOWp6h+k50pBSXQyXzUGFwAbVB8CCuccdrzp YkJh8w947wxa5VwrGXRxFrWg4k9ztL+fO5cHrWscTw3D5eBsWG1lLpzk+CEmFHK5EDy3 Fw1Q== X-Gm-Message-State: ACrzQf1Qj63OjNV0spzDI7YZZerRpyMdezRMMpdfr/LbQEukz9s8s8Hs +/rX+NokZ+TqhQNKSnUJxP9EHilNwDKpj+Zo X-Received: by 2002:a05:600c:1906:b0:3c6:f83e:d15f with SMTP id j6-20020a05600c190600b003c6f83ed15fmr3023889wmq.205.1666012102242; Mon, 17 Oct 2022 06:08:22 -0700 (PDT) Received: from ip-172-31-3-164.eu-west-1.compute.internal (ec2-54-154-193-154.eu-west-1.compute.amazonaws.com. [54.154.193.154]) by smtp.gmail.com with ESMTPSA id t18-20020a5d6a52000000b0022af865810esm8297237wrw.75.2022.10.17.06.08.21 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 17 Oct 2022 06:08:21 -0700 (PDT) From: Hubert Mazur To: ffmpeg-devel@ffmpeg.org Date: Mon, 17 Oct 2022 13:07:15 +0000 Message-Id: <20221017130715.30896-5-hum@semihalf.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: <20221017130715.30896-1-hum@semihalf.com> References: <20221017130715.30896-1-hum@semihalf.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 4/4] sw_scale: Add specializations for hscale 16 to 19 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com, Hubert Mazur , martin@martin.st, mw@semihalf.com, spop@amazon.com Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: lnOLmJ1ZlKTM Provide arm64 neon optimized implementations for hscale16To19 with filter sizes 4, 8 and X4. The tests and benchmarks run on AWS Graviton 2 instances. The results from a checkasm tool are shown below. hscale_16_to_19__fs_4_dstW_512_c: 6216.0 hscale_16_to_19__fs_4_dstW_512_neon: 2257.0 hscale_16_to_19__fs_8_dstW_512_c: 10417.7 hscale_16_to_19__fs_8_dstW_512_neon: 3112.5 hscale_16_to_19__fs_12_dstW_512_c: 14890.5 hscale_16_to_19__fs_12_dstW_512_neon: 3899.0 hscale_16_to_19__fs_16_dstW_512_c: 19006.5 hscale_16_to_19__fs_16_dstW_512_neon: 5341.2 hscale_16_to_19__fs_32_dstW_512_c: 36629.5 hscale_16_to_19__fs_32_dstW_512_neon: 9502.7 hscale_16_to_19__fs_40_dstW_512_c: 45477.5 hscale_16_to_19__fs_40_dstW_512_neon: 11552.0 Signed-off-by: Hubert Mazur --- libswscale/aarch64/hscale.S | 402 +++++++++++++++++++++++++++++++++++ libswscale/aarch64/swscale.c | 70 +++++- 2 files changed, 471 insertions(+), 1 deletion(-) diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S index 7d7e1c1f2e..dfc635d1b9 100644 --- a/libswscale/aarch64/hscale.S +++ b/libswscale/aarch64/hscale.S @@ -1044,3 +1044,405 @@ function ff_hscale16to15_X4_neon_asm, export=1 ret endfunc + +function ff_hscale16to19_4_neon_asm, export=1 + // w0 int shift + // x1 int32_t *dst + // w2 int dstW + // x3 const uint8_t *src // treat it as uint16_t *src + // x4 const uint16_t *filter + // x5 const int32_t *filterPos + // w6 int filterSize + + movi v18.4s, #1 + movi v17.4s, #1 + shl v18.4s, v18.4s, #19 + sub v18.4s, v18.4s, v17.4s // max allowed value + dup v17.4s, w0 // read shift + neg v17.4s, v17.4s // negate it, so it can be used in sshl (effectively shift right) + + cmp w2, #16 + b.lt 2f // move to last block + + ldp w8, w9, [x5] // filterPos[0], filterPos[1] + ldp w10, w11, [x5, #8] // filterPos[2], filterPos[3] + ldp w12, w13, [x5, #16] // filterPos[4], filterPos[5] + ldp w14, w15, [x5, #24] // filterPos[6], filterPos[7] + add x5, x5, #32 + + // shift all filterPos left by one, as uint16_t will be read + lsl x8, x8, #1 + lsl x9, x9, #1 + lsl x10, x10, #1 + lsl x11, x11, #1 + lsl x12, x12, #1 + lsl x13, x13, #1 + lsl x14, x14, #1 + lsl x15, x15, #1 + + // load src with given offset + ldr x8, [x3, w8, UXTW] + ldr x9, [x3, w9, UXTW] + ldr x10, [x3, w10, UXTW] + ldr x11, [x3, w11, UXTW] + ldr x12, [x3, w12, UXTW] + ldr x13, [x3, w13, UXTW] + ldr x14, [x3, w14, UXTW] + ldr x15, [x3, w15, UXTW] + + sub sp, sp, #64 + // push src on stack so it can be loaded into vectors later + stp x8, x9, [sp] + stp x10, x11, [sp, #16] + stp x12, x13, [sp, #32] + stp x14, x15, [sp, #48] + +1: + ld4 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp] + ld4 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7] + + // Each of blocks does the following: + // Extend src and filter to 32 bits with uxtl and sxtl + // multiply or multiply and accumulate results + // Extending to 32 bits is necessary, as unit16_t values can't + // be represented as int16_t without type promotion. + uxtl v26.4s, v0.4h + sxtl v27.4s, v28.4H + uxtl2 v0.4s, v0.8h + mul v5.4s, v26.4s, v27.4s + sxtl2 v28.4s, v28.8H + uxtl v26.4s, v1.4h + mul v6.4s, v0.4s, v28.4s + + sxtl v27.4s, v29.4H + uxtl2 v0.4s, v1.8h + mla v5.4s, v27.4s, v26.4s + sxtl2 v28.4s, v29.8H + uxtl v26.4s, v2.4h + mla v6.4s, v28.4s, v0.4s + + sxtl v27.4s, v30.4H + uxtl2 v0.4s, v2.8h + mla v5.4s, v27.4s, v26.4s + sxtl2 v28.4s, v30.8H + uxtl v26.4s, v3.4h + mla v6.4s, v28.4s, v0.4s + + sxtl v27.4s, v31.4H + uxtl2 v0.4s, v3.8h + mla v5.4s, v27.4s, v26.4s + sxtl2 v28.4s, v31.8H + sub w2, w2, #8 + mla v6.4s, v28.4s, v0.4s + + sshl v5.4s, v5.4s, v17.4s + sshl v6.4s, v6.4s, v17.4s + smin v5.4s, v5.4s, v18.4s + smin v6.4s, v6.4s, v18.4s + + st1 {v5.4s, v6.4s}, [x1], #32 + cmp w2, #16 + + // load filterPositions into registers for next iteration + ldp w8, w9, [x5] // filterPos[0], filterPos[1] + ldp w10, w11, [x5, #8] // filterPos[2], filterPos[3] + ldp w12, w13, [x5, #16] // filterPos[4], filterPos[5] + ldp w14, w15, [x5, #24] // filterPos[6], filterPos[7] + add x5, x5, #32 + + lsl x8, x8, #1 + lsl x9, x9, #1 + lsl x10, x10, #1 + lsl x11, x11, #1 + lsl x12, x12, #1 + lsl x13, x13, #1 + lsl x14, x14, #1 + lsl x15, x15, #1 + + ldr x8, [x3, w8, UXTW] + ldr x9, [x3, w9, UXTW] + ldr x10, [x3, w10, UXTW] + ldr x11, [x3, w11, UXTW] + ldr x12, [x3, w12, UXTW] + ldr x13, [x3, w13, UXTW] + ldr x14, [x3, w14, UXTW] + ldr x15, [x3, w15, UXTW] + + stp x8, x9, [sp] + stp x10, x11, [sp, #16] + stp x12, x13, [sp, #32] + stp x14, x15, [sp, #48] + + b.ge 1b + + // here we make last iteration, without updating the registers + ld4 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp] + ld4 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 + + uxtl v26.4s, v0.4h + sxtl v27.4s, v28.4H + uxtl2 v0.4s, v0.8h + mul v5.4s, v26.4s, v27.4s + sxtl2 v28.4s, v28.8H + uxtl v26.4s, v1.4h + mul v6.4s, v0.4s, v28.4s + + sxtl v27.4s, v29.4H + uxtl2 v0.4s, v1.8h + mla v5.4s, v26.4s, v27.4s + sxtl2 v28.4s, v29.8H + uxtl v26.4s, v2.4h + mla v6.4s, v0.4s, v28.4s + + sxtl v27.4s, v30.4H + uxtl2 v0.4s, v2.8h + mla v5.4s, v26.4s, v27.4s + sxtl2 v28.4s, v30.8H + uxtl v26.4s, v3.4h + mla v6.4s, v0.4s, v28.4s + + sxtl v27.4s, v31.4H + uxtl2 v0.4s, v3.8h + mla v5.4s, v26.4s, v27.4s + sxtl2 v28.4s, v31.8H + subs w2, w2, #8 + mla v6.4s, v0.4s, v28.4s + + sshl v5.4s, v5.4s, v17.4s + sshl v6.4s, v6.4s, v17.4s + + smin v5.4s, v5.4s, v18.4s + smin v6.4s, v6.4s, v18.4s + + st1 {v5.4s, v6.4s}, [x1], #32 + add sp, sp, #64 // restore stack + cbnz w2, 2f + + ret + +2: + ldr w8, [x5], #4 // load filterPos + lsl w8, w8, #1 + add x9, x3, w8, UXTW // src + filterPos + ld1 {v0.4h}, [x9] // load 4 * uint16_t + ld1 {v31.4h}, [x4], #8 + + uxtl v0.4s, v0.4h + sxtl v31.4s, v31.4h + subs w2, w2, #1 + mul v5.4s, v0.4s, v31.4s + addv s0, v5.4S + sshl v0.4s, v0.4s, v17.4s + smin v0.4s, v0.4s, v18.4s + st1 {v0.s}[0], [x1], #4 + cbnz w2, 2b // if iterations remain jump to beginning + + ret +endfunc + +function ff_hscale16to19_X8_neon_asm, export=1 + // w0 int shift + // x1 int32_t *dst + // w2 int dstW + // x3 const uint8_t *src // treat it as uint16_t *src + // x4 const uint16_t *filter + // x5 const int32_t *filterPos + // w6 int filterSize + + movi v20.4s, #1 + movi v21.4s, #1 + shl v20.4s, v20.4s, #19 + sub v20.4s, v20.4s, v21.4s + dup v21.4s, w0 + neg v21.4s, v21.4s + + sbfiz x7, x6, #1, #32 // filterSize*2 (*2 because int16) +1: ldr w8, [x5], #4 // filterPos[idx] + ldr w10, [x5], #4 // filterPos[idx + 1] + lsl w8, w8, #1 + ldr w11, [x5], #4 // filterPos[idx + 2] + ldr w9, [x5], #4 // filterPos[idx + 3] + mov x16, x4 // filter0 = filter + lsl w11, w11, #1 + add x12, x16, x7 // filter1 = filter0 + filterSize*2 + lsl w9, w9, #1 + add x13, x12, x7 // filter2 = filter1 + filterSize*2 + lsl w10, w10, #1 + add x4, x13, x7 // filter3 = filter2 + filterSize*2 + movi v0.2D, #0 // val sum part 1 (for dst[0]) + movi v1.2D, #0 // val sum part 2 (for dst[1]) + movi v2.2D, #0 // val sum part 3 (for dst[2]) + movi v3.2D, #0 // val sum part 4 (for dst[3]) + add x17, x3, w8, UXTW // srcp + filterPos[0] + add x8, x3, w10, UXTW // srcp + filterPos[1] + add x10, x3, w11, UXTW // srcp + filterPos[2] + add x11, x3, w9, UXTW // srcp + filterPos[3] + mov w15, w6 // filterSize counter +2: ld1 {v4.8H}, [x17], #16 // srcp[filterPos[0] + {0..7}] + ld1 {v5.8H}, [x16], #16 // load 8x16-bit filter values, part 1 + ld1 {v6.8H}, [x8], #16 // srcp[filterPos[1] + {0..7}] + ld1 {v7.8H}, [x12], #16 // load 8x16-bit at filter+filterSize + uxtl v24.4s, v4.4H // extend srcp lower half to 32 bits to preserve sign + sxtl v25.4s, v5.4H // extend filter lower half to 32 bits to match srcp size + uxtl2 v4.4s, v4.8h // extend srcp upper half to 32 bits + mla v0.4s, v24.4s, v25.4s // multiply accumulate lower half of v4 * v5 + sxtl2 v5.4s, v5.8h // extend filter upper half to 32 bits + uxtl v26.4s, v6.4h // extend srcp lower half to 32 bits + mla v0.4S, v4.4s, v5.4s // multiply accumulate upper half of v4 * v5 + sxtl v27.4s, v7.4H // exted filter lower half + uxtl2 v6.4s, v6.8H // extend srcp upper half + sxtl2 v7.4s, v7.8h // extend filter upper half + ld1 {v16.8H}, [x10], #16 // srcp[filterPos[2] + {0..7}] + mla v1.4S, v26.4s, v27.4s // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}] + ld1 {v17.8H}, [x13], #16 // load 8x16-bit at filter+2*filterSize + uxtl v22.4s, v16.4H // extend srcp lower half + sxtl v23.4s, v17.4H // extend filter lower half + uxtl2 v16.4s, v16.8H // extend srcp upper half + sxtl2 v17.4s, v17.8h // extend filter upper half + mla v2.4S, v22.4s, v23.4s // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}] + mla v2.4S, v16.4s, v17.4s // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}] + ld1 {v18.8H}, [x11], #16 // srcp[filterPos[3] + {0..7}] + mla v1.4S, v6.4s, v7.4s // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}] + ld1 {v19.8H}, [x4], #16 // load 8x16-bit at filter+3*filterSize + subs w15, w15, #8 // j -= 8: processed 8/filterSize + uxtl v28.4s, v18.4H // extend srcp lower half + sxtl v29.4s, v19.4H // extend filter lower half + uxtl2 v18.4s, v18.8H // extend srcp upper half + sxtl2 v19.4s, v19.8h // extend filter upper half + mla v3.4S, v28.4s, v29.4s // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}] + mla v3.4S, v18.4s, v19.4s // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}] + b.gt 2b // inner loop if filterSize not consumed completely + addp v0.4S, v0.4S, v1.4S // part01 horizontal pair adding + addp v2.4S, v2.4S, v3.4S // part23 horizontal pair adding + addp v0.4S, v0.4S, v2.4S // part0123 horizontal pair adding + subs w2, w2, #4 // dstW -= 4 + sshl v0.4s, v0.4s, v21.4s // shift right (effectively rigth, as shift is negative); overflow expected + smin v0.4s, v0.4s, v20.4s // apply min (do not use sqshl) + st1 {v0.4s}, [x1], #16 // write to destination part0123 + b.gt 1b // loop until end of line + ret +endfunc + +function ff_hscale16to19_X4_neon_asm, export=1 + // w0 int shift + // x1 int16_t *dst + // w2 int dstW + // x3 const uint8_t *src + // x4 const int16_t *filter + // x5 const int32_t *filterPos + // w6 int filterSize + + stp d8, d9, [sp, #-0x20]! + stp d10, d11, [sp, #0x10] + + movi v18.4s, #1 + movi v17.4s, #1 + shl v18.4s, v18.4s, #19 + sub v21.4s, v18.4s, v17.4s // max allowed value + dup v17.4s, w0 // read shift + neg v20.4s, v17.4s // negate it, so it can be used in sshl (effectively shift right) + + lsl w7, w6, #1 +1: + ldp w8, w9, [x5] + ldp w10, w11, [x5, #8] + + movi v16.2d, #0 // initialize accumulator for idx + 0 + movi v17.2d, #0 // initialize accumulator for idx + 1 + movi v18.2d, #0 // initialize accumulator for idx + 2 + movi v19.2d, #0 // initialize accumulator for idx + 3 + + mov x12, x4 // filter + 0 + add x13, x4, x7 // filter + 1 + add x8, x3, x8, lsl #1 // srcp + filterPos 0 + add x14, x13, x7 // filter + 2 + add x9, x3, x9, lsl #1 // srcp + filterPos 1 + add x15, x14, x7 // filter + 3 + add x10, x3, x10, lsl #1 // srcp + filterPos 2 + mov w0, w6 // save the filterSize to temporary variable + add x11, x3, x11, lsl #1 // srcp + filterPos 3 + add x5, x5, #16 // advance filter position + mov x16, xzr // clear the register x16 used for offsetting the filter values + +2: + ldr q4, [x8], #16 // load src values for idx 0 + ldr q5, [x9], #16 // load src values for idx 1 + uxtl v26.4s, v4.4h + uxtl2 v4.4s, v4.8h + ldr q31, [x12, x16] // load filter values for idx 0 + ldr q6, [x10], #16 // load src values for idx 2 + sxtl v22.4s, v31.4h + sxtl2 v31.4s, v31.8h + mla v16.4s, v26.4s, v22.4s // multiplication of lower half for idx 0 + uxtl v25.4s, v5.4h + uxtl2 v5.4s, v5.8h + ldr q30, [x13, x16] // load filter values for idx 1 + ldr q7, [x11], #16 // load src values for idx 3 + mla v16.4s, v4.4s, v31.4s // multiplication of upper half for idx 0 + uxtl v24.4s, v6.4h + sxtl v8.4s, v30.4h + sxtl2 v30.4s, v30.8h + mla v17.4s, v25.4s, v8.4s // multiplication of lower half for idx 1 + ldr q29, [x14, x16] // load filter values for idx 2 + uxtl2 v6.4s, v6.8h + sxtl v9.4s, v29.4h + sxtl2 v29.4s, v29.8h + mla v17.4s, v5.4s, v30.4s // multiplication of upper half for idx 1 + ldr q28, [x15, x16] // load filter values for idx 3 + mla v18.4s, v24.4s, v9.4s // multiplication of lower half for idx 2 + uxtl v23.4s, v7.4h + sxtl v10.4s, v28.4h + mla v18.4s, v6.4s, v29.4s // multiplication of upper half for idx 2 + uxtl2 v7.4s, v7.8h + sxtl2 v28.4s, v28.8h + mla v19.4s, v23.4s, v10.4s // multiplication of lower half for idx 3 + sub w0, w0, #8 + cmp w0, #8 + mla v19.4s, v7.4s, v28.4s // multiplication of upper half for idx 3 + + add x16, x16, #16 // advance filter values indexing + + b.ge 2b + + // 4 iterations left + + sub x17, x7, #8 // step back to wrap up the filter pos for last 4 elements + + ldr d4, [x8] // load src values for idx 0 + ldr d31, [x12, x17] // load filter values for idx 0 + uxtl v4.4s, v4.4h + sxtl v31.4s, v31.4h + ldr d5, [x9] // load src values for idx 1 + mla v16.4s, v4.4s, v31.4s // multiplication of upper half for idx 0 + ldr d30, [x13, x17] // load filter values for idx 1 + uxtl v5.4s, v5.4h + sxtl v30.4s, v30.4h + ldr d6, [x10] // load src values for idx 2 + mla v17.4s, v5.4s, v30.4s // multiplication of upper half for idx 1 + ldr d29, [x14, x17] // load filter values for idx 2 + uxtl v6.4s, v6.4h + sxtl v29.4s, v29.4h + ldr d7, [x11] // load src values for idx 3 + ldr d28, [x15, x17] // load filter values for idx 3 + mla v18.4s, v6.4s, v29.4s // multiplication of upper half for idx 2 + uxtl v7.4s, v7.4h + sxtl v28.4s, v28.4h + addp v16.4s, v16.4s, v17.4s + mla v19.4s, v7.4s, v28.4s // multiplication of upper half for idx 3 + subs w2, w2, #4 + addp v18.4s, v18.4s, v19.4s + addp v16.4s, v16.4s, v18.4s + sshl v16.4s, v16.4s, v20.4s + smin v16.4s, v16.4s, v21.4s + + st1 {v16.4s}, [x1], #16 + add x4, x4, x7, lsl #2 + b.gt 1b + + ldp d8, d9, [sp] + ldp d10, d11, [sp, #0x10] + + add sp, sp, #0x20 + + ret +endfunc \ No newline at end of file diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c index 993cdd67dd..ef6029e068 100644 --- a/libswscale/aarch64/swscale.c +++ b/libswscale/aarch64/swscale.c @@ -34,6 +34,16 @@ void ff_hscale16to15_X4_neon_asm(int shift, int16_t *_dst, int dstW, const uint8_t *_src, const int16_t *filter, const int32_t *filterPos, int filterSize); +void ff_hscale16to19_4_neon_asm(int shift, int16_t *_dst, int dstW, + const uint8_t *_src, const int16_t *filter, + const int32_t *filterPos, int filterSize); +void ff_hscale16to19_X8_neon_asm(int shift, int16_t *_dst, int dstW, + const uint8_t *_src, const int16_t *filter, + const int32_t *filterPos, int filterSize); +void ff_hscale16to19_X4_neon_asm(int shift, int16_t *_dst, int dstW, + const uint8_t *_src, const int16_t *filter, + const int32_t *filterPos, int filterSize); + #define SCALE_FUNC(filter_n, from_bpc, to_bpc, opt) \ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \ SwsContext *c, int16_t *data, \ @@ -43,7 +53,8 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \ #define SCALE_FUNCS(filter_n, opt) \ SCALE_FUNC(filter_n, 8, 15, opt); \ SCALE_FUNC(filter_n, 8, 19, opt); \ - SCALE_FUNC(filter_n, 16, 15, opt); + SCALE_FUNC(filter_n, 16, 15, opt); \ + SCALE_FUNC(filter_n, 16, 19, opt); #define ALL_SCALE_FUNCS(opt) \ SCALE_FUNCS(4, opt); \ SCALE_FUNCS(X8, opt); \ @@ -73,6 +84,9 @@ void ff_yuv2plane1_8_neon( if (c->dstBpc <= 14) \ hscalefn = \ ff_hscale16to15_ ## filtersize ## _ ## opt; \ + else \ + hscalefn = \ + ff_hscale16to19_ ## filtersize ## _ ## opt; \ } \ } while (0) @@ -150,4 +164,58 @@ void ff_hscale16to15_X4_neon(SwsContext *c, int16_t *_dst, int dstW, sh = 16 - 1; } ff_hscale16to15_X4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize); +} + +void ff_hscale16to19_4_neon(SwsContext *c, int16_t *_dst, int dstW, + const uint8_t *_src, const int16_t *filter, + const int32_t *filterPos, int filterSize) +{ + const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat); + int bits = desc->comp[0].depth - 1; + int sh = bits - 4; + + if ((isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8) && desc->comp[0].depth<16) { + sh = 9; + } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */ + sh = 16 - 1 - 4; + } + + ff_hscale16to19_4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize); + +} + +void ff_hscale16to19_X8_neon(SwsContext *c, int16_t *_dst, int dstW, + const uint8_t *_src, const int16_t *filter, + const int32_t *filterPos, int filterSize) +{ + const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat); + int bits = desc->comp[0].depth - 1; + int sh = bits - 4; + + if ((isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8) && desc->comp[0].depth<16) { + sh = 9; + } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */ + sh = 16 - 1 - 4; + } + + ff_hscale16to19_X8_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize); + +} + +void ff_hscale16to19_X4_neon(SwsContext *c, int16_t *_dst, int dstW, + const uint8_t *_src, const int16_t *filter, + const int32_t *filterPos, int filterSize) +{ + const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(c->srcFormat); + int bits = desc->comp[0].depth - 1; + int sh = bits - 4; + + if ((isAnyRGB(c->srcFormat) || c->srcFormat==AV_PIX_FMT_PAL8) && desc->comp[0].depth<16) { + sh = 9; + } else if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) { /* float input are process like uint 16bpc */ + sh = 16 - 1 - 4; + } + + ff_hscale16to19_X4_neon_asm(sh, _dst, dstW, _src, filter, filterPos, filterSize); + } \ No newline at end of file