From patchwork Sat Aug 13 20:56:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Swinney, Jonathan" X-Patchwork-Id: 37260 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:3d0d:b0:8d:a68e:8a0e with SMTP id y13csp513593pzi; Sat, 13 Aug 2022 13:56:29 -0700 (PDT) X-Google-Smtp-Source: AA6agR4TvqDKfT/GJf/bu/C5YLt82nMlZoQOUZpQOCGPkXn90Xyz9c8tciaCVBLf8kc73Id8UX2B X-Received: by 2002:a05:6402:28cb:b0:43b:c6d7:ef92 with SMTP id ef11-20020a05640228cb00b0043bc6d7ef92mr8742524edb.333.1660424188990; Sat, 13 Aug 2022 13:56:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1660424188; cv=none; d=google.com; s=arc-20160816; b=kbn2r6WvfA+COQU92IhI1g3OuEDzeXFaf0x02NSaMlfD97gpRcoHv8Eo3p2uHfc1zc jvhZ717yRq5KV+8Ubc4dg/t2fC8MST5HQuxTmPi7tIGYKDhoxcxTcdd7Y7HdW+XI9D9j UwP8bEo82ItpXBQOld5V4J/Qoo5VbjLS2ct6Sx7wL+J+CWqpgVLx4KAdNVAkeLuGd2je OwLa21QAoJiN2gGHNQX/7eyw2DRcY5MeZjbR67/iwaZYD3PML7srNCwk0caeHowmzK6a IDwF2CN96yZPVq0G4mUR5wQsig5aMVyQUx/Jn0dbC3LJbD6rLozzZbeBJFRt7IspgRsG PmcQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:content-language :accept-language:message-id:date:thread-index:thread-topic:to:from :dkim-signature:delivered-to; bh=ScujgIZXTLJ/aw1y/QsnH6yBkAIObOadIbfviyhrtnM=; b=syhDi6/iEBbxdE/cf2nech4MwPZ7pjJWpdVj9oBCY7Q3HKVeIZsTTPYlt5htPPr5GN s2Ox0GO5XJ8NjD59HPx2Eap8FH+cbgw4p/fb4o6NSLaPMQLuoyIadZOGCRe95ddf1ZiU N79sBJt89AWjAYrr89FZd6P+epdYIQ5hNh+ePFJ+jgGb2g8wnFTI9oGBnekH0lSfy6H8 UhEV6yKlEg6F1dRTkid45Yw/MqholOrQiCoHiSJKJwM/+m6DYyWcFcxFKZUIDRfyVjbR T6IBzdACHOJHdS09PZDqeJhpMiZX31fyFI2mWFD1jCvtQ3mfCkDsFeMZUompBEkQX8nr A7aw== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@amazon.com header.s=amazon201209 header.b="fW/njcPN"; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id u1-20020a170906068100b0072eda634546si3666050ejb.560.2022.08.13.13.56.28; Sat, 13 Aug 2022 13:56:28 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@amazon.com header.s=amazon201209 header.b="fW/njcPN"; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id F3DBA68B8E5; Sat, 13 Aug 2022 23:56:20 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from smtp-fw-9103.amazon.com (smtp-fw-9103.amazon.com [207.171.188.200]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E1D6B68B419 for ; Sat, 13 Aug 2022 23:56:13 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1660424179; x=1691960179; h=from:to:cc:subject:date:message-id: content-transfer-encoding:mime-version; bh=4CZV234rAWqFk7WOGRyqAOhHKdEpQfUX1WSNmNxYWKs=; b=fW/njcPNRoR8uS0hZMe972jRUXb3F4JWX2AAVwJttOR2oSFUfaH6Kdk1 sSj/uZT9hkU9/gI7G9DXlErZSJ0K9W8Tb/EgZYWm39XwklRXjjaz2ddIG 7ASW90E9sSJGKzvYY7fdCNaDdxDctekJlAbSUJysjKPnWWwO1ZkJ4D+yk g=; X-IronPort-AV: E=Sophos;i="5.93,236,1654560000"; d="scan'208";a="1044016123" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO email-inbound-relay-iad-1e-b69ea591.us-east-1.amazon.com) ([10.25.36.214]) by smtp-border-fw-9103.sea19.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Aug 2022 20:56:11 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan2.iad.amazon.com [10.40.163.34]) by email-inbound-relay-iad-1e-b69ea591.us-east-1.amazon.com (Postfix) with ESMTPS id 40DC0C0425; Sat, 13 Aug 2022 20:56:10 +0000 (UTC) Received: from EX19D007UWB003.ant.amazon.com (10.13.138.28) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Sat, 13 Aug 2022 20:56:06 +0000 Received: from EX19D007UWB001.ant.amazon.com (10.13.138.75) by EX19D007UWB003.ant.amazon.com (10.13.138.28) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.12; Sat, 13 Aug 2022 20:56:06 +0000 Received: from EX19D007UWB001.ant.amazon.com ([fe80::bcaa:e18f:a569:3851]) by EX19D007UWB001.ant.amazon.com ([fe80::bcaa:e18f:a569:3851%6]) with mapi id 15.02.1118.012; Sat, 13 Aug 2022 20:56:06 +0000 From: "Swinney, Jonathan" To: "ffmpeg-devel@ffmpeg.org" Thread-Topic: [PATCH v3 3/3] swscale/aarch64: add vscale specializations Thread-Index: AdivVxEmf6fLDA28TLO6ywGfvD7wdQ== Date: Sat, 13 Aug 2022 20:56:06 +0000 Message-ID: <70629b7632564b30a44c71bf6a903b26@amazon.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.43.162.134] MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v3 3/3] swscale/aarch64: add vscale specializations X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: =?utf-8?q?Martin_Storsj=C3=B6?= , Hubert Mazur Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: 8NWbUFa+vR15 This commit adds new code paths for vscale when filterSize is 2, 4, or 8. By using specialized code with unrolling to match the filterSize we can improve performance. On AWS c7g (Graviton 3, Neoverse V1) instances: before after yuv2yuvX_2_0_512_accurate_neon: 558.8 268.9 yuv2yuvX_4_0_512_accurate_neon: 637.5 434.9 yuv2yuvX_8_0_512_accurate_neon: 1144.8 806.2 yuv2yuvX_16_0_512_accurate_neon: 2080.5 1853.7 Signed-off-by: Jonathan Swinney --- libswscale/aarch64/output.S | 177 +++++++++++++++++++++++++++++++++++ libswscale/aarch64/swscale.c | 12 +++ 2 files changed, 189 insertions(+) diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S index 991750cf31..b8a2818c9b 100644 --- a/libswscale/aarch64/output.S +++ b/libswscale/aarch64/output.S @@ -21,13 +21,33 @@ #include "libavutil/aarch64/asm.S" function ff_yuv2planeX_8_neon, export=1 +// x0 - const int16_t *filter, +// x1 - int filterSize, +// x2 - const int16_t **src, +// x3 - uint8_t *dest, +// w4 - int dstW, +// x5 - const uint8_t *dither, +// w6 - int offset + ld1 {v0.8B}, [x5] // load 8x8-bit dither + and w6, w6, #7 cbz w6, 1f // check if offsetting present ext v0.8B, v0.8B, v0.8B, #3 // honor offsetting which can be 0 or 3 only 1: uxtl v0.8H, v0.8B // extend dither to 16-bit ushll v1.4S, v0.4H, #12 // extend dither to 32-bit with left shift by 12 (part 1) ushll2 v2.4S, v0.8H, #12 // extend dither to 32-bit with left shift by 12 (part 2) + cmp w1, #8 // if filterSize == 8, branch to specialized version + b.eq 6f + cmp w1, #4 // if filterSize == 4, branch to specialized version + b.eq 8f + cmp w1, #2 // if filterSize == 2, branch to specialized version + b.eq 10f + +// The filter size does not match of the of specialized implementations. It is either even or odd. If it is even +// then use the first section below. mov x7, #0 // i = 0 + tbnz w1, #0, 4f // if filterSize % 2 != 0 branch to specialized version +// fs % 2 == 0 2: mov v3.16B, v1.16B // initialize accumulator part 1 with dithering value mov v4.16B, v2.16B // initialize accumulator part 2 with dithering value mov w8, w1 // tmpfilterSize = filterSize @@ -54,4 +74,161 @@ function ff_yuv2planeX_8_neon, export=1 add x7, x7, #8 // i += 8 b.gt 2b // loop until width consumed ret + +// If filter size is odd (most likely == 1), then use this section. +// fs % 2 != 0 +4: mov v3.16B, v1.16B // initialize accumulator part 1 with dithering value + mov v4.16B, v2.16B // initialize accumulator part 2 with dithering value + mov w8, w1 // tmpfilterSize = filterSize + mov x9, x2 // srcp = src + mov x10, x0 // filterp = filter +5: ldr x11, [x9], #8 // get 1 pointer: src[j] + ldr h6, [x10], #2 // read 1 16 bit coeff X at filter[j] + add x11, x11, x7, lsl #1 // &src[j ][i] + ld1 {v5.8H}, [x11] // read 8x16-bit @ src[j ][i + {0..7}]: A,B,C,D,E,F,G,H + smlal v3.4S, v5.4H, v6.H[0] // val0 += {A,B,C,D} * X + smlal2 v4.4S, v5.8H, v6.H[0] // val1 += {E,F,G,H} * X + subs w8, w8, #1 // tmpfilterSize -= 2 + b.gt 5b // loop until filterSize consumed + + sqshrun v3.4h, v3.4s, #16 // clip16(val0>>16) + sqshrun2 v3.8h, v4.4s, #16 // clip16(val1>>16) + uqshrn v3.8b, v3.8h, #3 // clip8(val>>19) + st1 {v3.8b}, [x3], #8 // write to destination + subs w4, w4, #8 // dstW -= 8 + add x7, x7, #8 // i += 8 + b.gt 4b // loop until width consumed + ret + +6: // fs=8 + ldp x5, x6, [x2] // load 2 pointers: src[j ] and src[j+1] + ldp x7, x9, [x2, #16] // load 2 pointers: src[j+2] and src[j+3] + ldp x10, x11, [x2, #32] // load 2 pointers: src[j+4] and src[j+5] + ldp x12, x13, [x2, #48] // load 2 pointers: src[j+6] and src[j+7] + + // load 8x16-bit values for filter[j], where j=0..7 + ld1 {v6.8H}, [x0] +7: + mov v3.16B, v1.16B // initialize accumulator part 1 with dithering value + mov v4.16B, v2.16B // initialize accumulator part 2 with dithering value + + ld1 {v24.8H}, [x5], #16 // load 8x16-bit values for src[j + 0][i + {0..7}] + ld1 {v25.8H}, [x6], #16 // load 8x16-bit values for src[j + 1][i + {0..7}] + ld1 {v26.8H}, [x7], #16 // load 8x16-bit values for src[j + 2][i + {0..7}] + ld1 {v27.8H}, [x9], #16 // load 8x16-bit values for src[j + 3][i + {0..7}] + ld1 {v28.8H}, [x10], #16 // load 8x16-bit values for src[j + 4][i + {0..7}] + ld1 {v29.8H}, [x11], #16 // load 8x16-bit values for src[j + 5][i + {0..7}] + ld1 {v30.8H}, [x12], #16 // load 8x16-bit values for src[j + 6][i + {0..7}] + ld1 {v31.8H}, [x13], #16 // load 8x16-bit values for src[j + 7][i + {0..7}] + + smlal v3.4S, v24.4H, v6.H[0] // val0 += src[0][i + {0..3}] * filter[0] + smlal2 v4.4S, v24.8H, v6.H[0] // val1 += src[0][i + {4..7}] * filter[0] + smlal v3.4S, v25.4H, v6.H[1] // val0 += src[1][i + {0..3}] * filter[1] + smlal2 v4.4S, v25.8H, v6.H[1] // val1 += src[1][i + {4..7}] * filter[1] + smlal v3.4S, v26.4H, v6.H[2] // val0 += src[2][i + {0..3}] * filter[2] + smlal2 v4.4S, v26.8H, v6.H[2] // val1 += src[2][i + {4..7}] * filter[2] + smlal v3.4S, v27.4H, v6.H[3] // val0 += src[3][i + {0..3}] * filter[3] + smlal2 v4.4S, v27.8H, v6.H[3] // val1 += src[3][i + {4..7}] * filter[3] + smlal v3.4S, v28.4H, v6.H[4] // val0 += src[4][i + {0..3}] * filter[4] + smlal2 v4.4S, v28.8H, v6.H[4] // val1 += src[4][i + {4..7}] * filter[4] + smlal v3.4S, v29.4H, v6.H[5] // val0 += src[5][i + {0..3}] * filter[5] + smlal2 v4.4S, v29.8H, v6.H[5] // val1 += src[5][i + {4..7}] * filter[5] + smlal v3.4S, v30.4H, v6.H[6] // val0 += src[6][i + {0..3}] * filter[6] + smlal2 v4.4S, v30.8H, v6.H[6] // val1 += src[6][i + {4..7}] * filter[6] + smlal v3.4S, v31.4H, v6.H[7] // val0 += src[7][i + {0..3}] * filter[7] + smlal2 v4.4S, v31.8H, v6.H[7] // val1 += src[7][i + {4..7}] * filter[7] + + sqshrun v3.4h, v3.4s, #16 // clip16(val0>>16) + sqshrun2 v3.8h, v4.4s, #16 // clip16(val1>>16) + uqshrn v3.8b, v3.8h, #3 // clip8(val>>19) + subs w4, w4, #8 // dstW -= 8 + st1 {v3.8b}, [x3], #8 // write to destination + b.gt 7b // loop until width consumed + ret + +8: // fs=4 + ldp x5, x6, [x2] // load 2 pointers: src[j ] and src[j+1] + ldp x7, x9, [x2, #16] // load 2 pointers: src[j+2] and src[j+3] + + // load 4x16-bit values for filter[j], where j=0..3 and replicated across lanes + ld1 {v6.4H}, [x0] +9: + mov v3.16B, v1.16B // initialize accumulator part 1 with dithering value + mov v4.16B, v2.16B // initialize accumulator part 2 with dithering value + + ld1 {v24.8H}, [x5], #16 // load 8x16-bit values for src[j + 0][i + {0..7}] + ld1 {v25.8H}, [x6], #16 // load 8x16-bit values for src[j + 1][i + {0..7}] + ld1 {v26.8H}, [x7], #16 // load 8x16-bit values for src[j + 2][i + {0..7}] + ld1 {v27.8H}, [x9], #16 // load 8x16-bit values for src[j + 3][i + {0..7}] + + smlal v3.4S, v24.4H, v6.H[0] // val0 += src[0][i + {0..3}] * filter[0] + smlal2 v4.4S, v24.8H, v6.H[0] // val1 += src[0][i + {4..7}] * filter[0] + smlal v3.4S, v25.4H, v6.H[1] // val0 += src[1][i + {0..3}] * filter[1] + smlal2 v4.4S, v25.8H, v6.H[1] // val1 += src[1][i + {4..7}] * filter[1] + smlal v3.4S, v26.4H, v6.H[2] // val0 += src[2][i + {0..3}] * filter[2] + smlal2 v4.4S, v26.8H, v6.H[2] // val1 += src[2][i + {4..7}] * filter[2] + smlal v3.4S, v27.4H, v6.H[3] // val0 += src[3][i + {0..3}] * filter[3] + smlal2 v4.4S, v27.8H, v6.H[3] // val1 += src[3][i + {4..7}] * filter[3] + + sqshrun v3.4h, v3.4s, #16 // clip16(val0>>16) + sqshrun2 v3.8h, v4.4s, #16 // clip16(val1>>16) + uqshrn v3.8b, v3.8h, #3 // clip8(val>>19) + st1 {v3.8b}, [x3], #8 // write to destination + subs w4, w4, #8 // dstW -= 8 + b.gt 9b // loop until width consumed + ret + +10: // fs=2 + ldp x5, x6, [x2] // load 2 pointers: src[j ] and src[j+1] + + // load 2x16-bit values for filter[j], where j=0..1 and replicated across lanes + ldr s6, [x0] +11: + mov v3.16B, v1.16B // initialize accumulator part 1 with dithering value + mov v4.16B, v2.16B // initialize accumulator part 2 with dithering value + + ld1 {v24.8H}, [x5], #16 // load 8x16-bit values for src[j + 0][i + {0..7}] + ld1 {v25.8H}, [x6], #16 // load 8x16-bit values for src[j + 1][i + {0..7}] + + smlal v3.4S, v24.4H, v6.H[0] // val0 += src[0][i + {0..3}] * filter[0] + smlal2 v4.4S, v24.8H, v6.H[0] // val1 += src[0][i + {4..7}] * filter[0] + smlal v3.4S, v25.4H, v6.H[1] // val0 += src[1][i + {0..3}] * filter[1] + smlal2 v4.4S, v25.8H, v6.H[1] // val1 += src[1][i + {4..7}] * filter[1] + + sqshrun v3.4h, v3.4s, #16 // clip16(val0>>16) + sqshrun2 v3.8h, v4.4s, #16 // clip16(val1>>16) + uqshrn v3.8b, v3.8h, #3 // clip8(val>>19) + st1 {v3.8b}, [x3], #8 // write to destination + subs w4, w4, #8 // dstW -= 8 + b.gt 11b // loop until width consumed + ret +endfunc + +function ff_yuv2plane1_8_neon, export=1 +// x0 - const int16_t *src, +// x1 - uint8_t *dest, +// w2 - int dstW, +// x3 - const uint8_t *dither, +// w4 - int offset + ld1 {v0.8B}, [x3] // load 8x8-bit dither + and w4, w4, #7 + cbz w4, 1f // check if offsetting present + ext v0.8B, v0.8B, v0.8B, #3 // honor offsetting which can be 0 or 3 only +1: uxtl v0.8H, v0.8B // extend dither to 32-bit + uxtl v1.4s, v0.4h + uxtl2 v2.4s, v0.8h +2: + ld1 {v3.8h}, [x0], #16 // read 8x16-bit @ src[j ][i + {0..7}]: A,B,C,D,E,F,G,H + sxtl v4.4s, v3.4h + sxtl2 v5.4s, v3.8h + add v4.4s, v4.4s, v1.4s + add v5.4s, v5.4s, v2.4s + sqshrun v4.4h, v4.4s, #6 + sqshrun2 v4.8h, v5.4s, #6 + + uqshrn v3.8b, v4.8h, #1 // clip8(val>>7) + subs w2, w2, #8 // dstW -= 8 + st1 {v3.8b}, [x1], #8 // write to destination + b.gt 2b // loop until width consumed + ret endfunc diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c index ab28be4da6..321d1f844e 100644 --- a/libswscale/aarch64/swscale.c +++ b/libswscale/aarch64/swscale.c @@ -39,6 +39,12 @@ ALL_SCALE_FUNCS(neon); void ff_yuv2planeX_8_neon(const int16_t *filter, int filterSize, const int16_t **src, uint8_t *dest, int dstW, const uint8_t *dither, int offset); +void ff_yuv2plane1_8_neon( + const int16_t *src, + uint8_t *dest, + int dstW, + const uint8_t *dither, + int offset); #define ASSIGN_SCALE_FUNC2(hscalefn, filtersize, opt) do { \ if (c->srcBpc == 8 && c->dstBpc <= 14) { \ @@ -54,6 +60,11 @@ void ff_yuv2planeX_8_neon(const int16_t *filter, int filterSize, ASSIGN_SCALE_FUNC2(hscalefn, X8, opt); \ break; \ } +#define ASSIGN_VSCALE_FUNC(vscalefn, opt) \ + switch (c->dstBpc) { \ + case 8: vscalefn = ff_yuv2plane1_8_ ## opt; break; \ + default: break; \ + } av_cold void ff_sws_init_swscale_aarch64(SwsContext *c) { @@ -62,6 +73,7 @@ av_cold void ff_sws_init_swscale_aarch64(SwsContext *c) if (have_neon(cpu_flags)) { ASSIGN_SCALE_FUNC(c->hyScale, c->hLumFilterSize, neon); ASSIGN_SCALE_FUNC(c->hcScale, c->hChrFilterSize, neon); + ASSIGN_VSCALE_FUNC(c->yuv2plane1, neon); if (c->dstBpc == 8) { c->yuv2planeX = ff_yuv2planeX_8_neon; }