From patchwork Sat Sep 21 17:41:43 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zhao Zhili X-Patchwork-Id: 51687 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a59:d154:0:b0:48e:c0f8:d0de with SMTP id bt20csp1651956vqb; Sat, 21 Sep 2024 10:42:47 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCWE7XZxwLkbIPdWyz8Xow+uWItLpZKrc9ZFo1ujCoVg3gWRUwQb4weOSrtNDzCshcMrtg/EclFCN/pnLHkL8EKO@gmail.com X-Google-Smtp-Source: AGHT+IGEABLSxM9eT7Q6IS98pxy303MCiW4rStJrJEGc6quDFAJ+xeMl84qShTV8bgJHGbDUWdFF X-Received: by 2002:a05:6402:2753:b0:5c4:2bb0:41c8 with SMTP id 4fb4d7f45d1cf-5c464a3eb12mr4839800a12.13.1726940566935; Sat, 21 Sep 2024 10:42:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1726940566; cv=none; d=google.com; s=arc-20240605; b=OVqk8qrGT2bx9JzQNz6N/vSI8sUdLp9SUXleNdwKkYZcny/Kuyx3/YdFUongDOx9gR 5LtLdVmLEsrxMpsaKRjw5BXwp2lmXn2S/6FW2Wz8fZgmThLxqfMUDDDFVxUOSH1Bif4N DRPInR+TzfwFdzwKno8MFmK20PkoxlRz6DRm7pUegxykFYU5WBJA92h1219jCGs6VcXi ZEp9xnCnknwwsrPcqx15kfyNGCss/JSjEZaEqc4VQF4zd/b3TACXGR1Z0W60EM9AbgpG l4sc3cWF8TaIvadVsRTvs75UhmQpFt/6iLatcKSl7HqyVXj8SLQpxzUpCJD1GhpNjsKR tfBQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:date:to:from:message-id :dkim-signature:delivered-to; bh=o5uAWJdNiGUHTtUw+DbAC/JShVyFlJmoDJ4+WUdbj40=; fh=HnHYuZ9XgUo86ZRXTLWWmQxhslYEI9B9taZ5X1DLFfc=; b=lIJ7bMKeB+ApngOaUfWwsMztH5/QABmy2/JFGAw9ikmsEa7fn7jtn1oodKkzNUBfLN ITwQTGSdzwIYIbX7IFLgNLqDxOO2qjtDF+LeEjkOf6++sE97q0tcJW4FAUM/2os4vm6c +5MjOAKUXS15H4cGwpc0nDrieMlkoGX1kUmRwm973axrn/s4WsGFez6aQS5K63LEgdme M2FCk6lGNZjTEz+Bf/4D4oOkzOqk/EonGFvXJusY+xMtFthdYIbMxUU8euOtKrWnfoyj pWNj1yWdmeryZaqWd9ReyY7RuWBPE9i2zByj40k8tcep5cxPg9A7JiyKp5LXDACbFnq8 D16g==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=geICHdKk; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id 4fb4d7f45d1cf-5c44e1e22f0si6681102a12.619.2024.09.21.10.42.45; Sat, 21 Sep 2024 10:42:46 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=geICHdKk; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id CC61C68DBE5; Sat, 21 Sep 2024 20:42:10 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from out162-62-57-49.mail.qq.com (out162-62-57-49.mail.qq.com [162.62.57.49]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id EC28468DB0F for ; Sat, 21 Sep 2024 20:41:59 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foxmail.com; s=s201512; t=1726940510; bh=7/KDZE5lDOmc2JHDY2eK3rVDOnxLQhL7d+ct4ls6qYU=; h=From:To:Cc:Subject:Date; b=geICHdKk3vZXw+QXE9kMecKZa6azx9NEoxqOm4gsx0GblM8ZsySQWrG1NCeYpK1K5 JV0/HTMrVWgMMyks/Tthp6ilCl3zu+pPcossj+7JxRw0C5TCg/fuuMSCQhGu3/tapX 9wSTBzam+Q8nHfN6EihW/pMaObbXl8Ixz1DfYSaQ= Received: from ZHILIZHAO-MB1.tencent.com ([113.118.103.137]) by newxmesmtplogicsvrsza29-0.qq.com (NewEsmtp) with SMTP id A7081A2C; Sun, 22 Sep 2024 01:41:48 +0800 X-QQ-mid: xmsmtpt1726940508tb6h4ms21 Message-ID: X-QQ-XMAILINFO: MB5+LsFw85No/DEKWadeqafTsRmw1QUmi5vflroqDs1hqFG1apJtHAD6ztcsHk DaIp3V08XcJ2+1RGMqK0MfwKpJOLq6j2XfLcK27EnIIAfRegZ0ocRHJYL9OxwCcRXLJsSvOvJOiX 8yNA5zhxXahWC3X2ogatCg6gUTADuMrCHNWHVw7aWKeBbLWJ1vpseu1rJ59fS1aebWD3Zi7ATi09 oQd5UlLojk/sIxum1vVQgHI2PQfpDJRQFNKEDBEMwlUJrG2v2W3zNkvIIOYtPvBmpmQamv6WGkL5 I4E75MOyEjpis9uDGcE86MYce9r3r7CbMiTRcsSnOyD20UomHjMSlC0UbWdqDthCAHFf8tihc1JR SsykOY4HNevlCv+vImyKZg22MJDAcuGmDITw4M3HGjyQ/w23rugolX13/K6o2+O+Mx5UH3CKBOHr dEtIum3NBpNbaC4v59g8FtUkWLbcWVfFNPhOMYW5/bsgE3Cr+qCZkw6po0XeKnIYSB4rmX3Z6pgN AqMVrvNoCx61oKTo86hD2Inl4fN29gHr/rP2zVjmLBJcW9D/IA1O4yXAKPdWcr1/dl/LYEkRY9Ta upMpU4q7YOkmM9NSvBhTU6eKsEmpnits7SpXeNszq2C0FYMTkUx/++vKRcDHmZNnPaHMOXLbIyMV PhfZGnpvMwvukN53EFgJsxB8NCLm6EOOexUvY3IWF2ktZdvDjzuNFWZc0DDCf3Pm4BTxuMAu0JQE 26pEv0qjFuBdP/lZXrOV5T6V1LjKT0K5Ptb93HK8Gz8VIf/vYU69t4aIy/72f+vHrX5iETQbVhFe nuc1tp+m0ua1dd+w5nsNLbjeEZhX7ECZorbhmZhSs65B9zuGIvvjtqbCfdSu6Nz53nfGQ4LJ5+pr 8IjlNT/rWD9vQJRqqdiTEqWcsq3uFcCl+WnM6fHn0H7zC5gitXJTgrnf6X6fzu+Z8x67aRlMZ0og hBkhE/JBYge7S3P7UU2GIAHC2t1XY4crDNx08u9GkQM4Iw7p0WNO5yVI1KB17E X-QQ-XMRINFO: MSVp+SPm3vtS1Vd6Y4Mggwc= From: Zhao Zhili To: ffmpeg-devel@ffmpeg.org Date: Sun, 22 Sep 2024 01:41:43 +0800 X-OQ-MSGID: <20240921174146.10928-1-quinkblack@foxmail.com> X-Mailer: git-send-email 2.42.0 MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 1/4] aarch64/vvc: Add w_avg X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Zhao Zhili Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: eNLtOe4fbfNI From: Zhao Zhili w_avg_8_2x2_c: 0.0 ( 0.00x) w_avg_8_2x2_neon: 0.0 ( 0.00x) w_avg_8_4x4_c: 0.2 ( 1.00x) w_avg_8_4x4_neon: 0.0 ( 0.00x) w_avg_8_8x8_c: 1.2 ( 1.00x) w_avg_8_8x8_neon: 0.2 ( 5.00x) w_avg_8_16x16_c: 4.2 ( 1.00x) w_avg_8_16x16_neon: 0.8 ( 5.67x) w_avg_8_32x32_c: 16.2 ( 1.00x) w_avg_8_32x32_neon: 2.5 ( 6.50x) w_avg_8_64x64_c: 64.5 ( 1.00x) w_avg_8_64x64_neon: 9.0 ( 7.17x) w_avg_8_128x128_c: 269.5 ( 1.00x) w_avg_8_128x128_neon: 35.5 ( 7.59x) w_avg_10_2x2_c: 0.2 ( 1.00x) w_avg_10_2x2_neon: 0.2 ( 1.00x) w_avg_10_4x4_c: 0.2 ( 1.00x) w_avg_10_4x4_neon: 0.2 ( 1.00x) w_avg_10_8x8_c: 1.0 ( 1.00x) w_avg_10_8x8_neon: 0.2 ( 4.00x) w_avg_10_16x16_c: 4.2 ( 1.00x) w_avg_10_16x16_neon: 0.8 ( 5.67x) w_avg_10_32x32_c: 16.2 ( 1.00x) w_avg_10_32x32_neon: 2.5 ( 6.50x) w_avg_10_64x64_c: 66.2 ( 1.00x) w_avg_10_64x64_neon: 10.0 ( 6.62x) w_avg_10_128x128_c: 277.8 ( 1.00x) w_avg_10_128x128_neon: 39.8 ( 6.99x) w_avg_12_2x2_c: 0.0 ( 0.00x) w_avg_12_2x2_neon: 0.2 ( 0.00x) w_avg_12_4x4_c: 0.2 ( 1.00x) w_avg_12_4x4_neon: 0.0 ( 0.00x) w_avg_12_8x8_c: 1.2 ( 1.00x) w_avg_12_8x8_neon: 0.5 ( 2.50x) w_avg_12_16x16_c: 4.8 ( 1.00x) w_avg_12_16x16_neon: 0.8 ( 6.33x) w_avg_12_32x32_c: 17.0 ( 1.00x) w_avg_12_32x32_neon: 2.8 ( 6.18x) w_avg_12_64x64_c: 64.0 ( 1.00x) w_avg_12_64x64_neon: 10.0 ( 6.40x) w_avg_12_128x128_c: 269.2 ( 1.00x) w_avg_12_128x128_neon: 42.0 ( 6.41x) --- libavcodec/aarch64/vvc/dsp_init.c | 34 ++++++++++++ libavcodec/aarch64/vvc/inter.S | 92 +++++++++++++++++++++++++------ 2 files changed, 109 insertions(+), 17 deletions(-) diff --git a/libavcodec/aarch64/vvc/dsp_init.c b/libavcodec/aarch64/vvc/dsp_init.c index ad767d17e2..b39ebb83fc 100644 --- a/libavcodec/aarch64/vvc/dsp_init.c +++ b/libavcodec/aarch64/vvc/dsp_init.c @@ -52,6 +52,37 @@ void ff_vvc_avg_12_neon(uint8_t *dst, ptrdiff_t dst_stride, const int16_t *src0, const int16_t *src1, int width, int height); +void ff_vvc_w_avg_8_neon(uint8_t *_dst, const ptrdiff_t _dst_stride, + const int16_t *src0, const int16_t *src1, + const int width, const int height, + uintptr_t w0_w1, uintptr_t offset_shift); +void ff_vvc_w_avg_10_neon(uint8_t *_dst, const ptrdiff_t _dst_stride, + const int16_t *src0, const int16_t *src1, + const int width, const int height, + uintptr_t w0_w1, uintptr_t offset_shift); +void ff_vvc_w_avg_12_neon(uint8_t *_dst, const ptrdiff_t _dst_stride, + const int16_t *src0, const int16_t *src1, + const int width, const int height, + uintptr_t w0_w1, uintptr_t offset_shift); +/* When passing arguments to functions, Apple platforms diverge from the ARM64 + * standard ABI, that we can't implement the function directly in asm. + */ +#define W_AVG_FUN(bit_depth) \ +static void vvc_w_avg_ ## bit_depth(uint8_t *dst, const ptrdiff_t dst_stride, \ + const int16_t *src0, const int16_t *src1, const int width, const int height, \ + const int denom, const int w0, const int w1, const int o0, const int o1) \ +{ \ + const int shift = denom + FFMAX(3, 15 - bit_depth); \ + const int offset = ((o0 + o1) * (1 << (bit_depth - 8)) + 1) * (1 << (shift - 1)); \ + uintptr_t w0_w1 = ((uintptr_t)w0 << 32) | (uint32_t)w1; \ + uintptr_t offset_shift = ((uintptr_t)offset << 32) | (uint32_t)shift; \ + ff_vvc_w_avg_ ## bit_depth ## _neon(dst, dst_stride, src0, src1, width, height, w0_w1, offset_shift); \ +} + +W_AVG_FUN(8) +W_AVG_FUN(10) +W_AVG_FUN(12) + void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) { int cpu_flags = av_get_cpu_flags(); @@ -123,6 +154,7 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) c->inter.put_uni_w[0][6][0][0] = ff_vvc_put_pel_uni_w_pixels128_8_neon; c->inter.avg = ff_vvc_avg_8_neon; + c->inter.w_avg = vvc_w_avg_8; for (int i = 0; i < FF_ARRAY_ELEMS(c->sao.band_filter); i++) c->sao.band_filter[i] = ff_h26x_sao_band_filter_8x8_8_neon; @@ -163,11 +195,13 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) } } else if (bd == 10) { c->inter.avg = ff_vvc_avg_10_neon; + c->inter.w_avg = vvc_w_avg_10; c->alf.filter[LUMA] = alf_filter_luma_10_neon; c->alf.filter[CHROMA] = alf_filter_chroma_10_neon; } else if (bd == 12) { c->inter.avg = ff_vvc_avg_12_neon; + c->inter.w_avg = vvc_w_avg_12; c->alf.filter[LUMA] = alf_filter_luma_12_neon; c->alf.filter[CHROMA] = alf_filter_chroma_12_neon; diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index 2f69274b86..49e1050aee 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -22,9 +22,9 @@ #define VVC_MAX_PB_SIZE 128 -.macro vvc_avg, bit_depth +.macro vvc_avg type, bit_depth -.macro vvc_avg_\bit_depth\()_2_4, tap +.macro vvc_\type\()_\bit_depth\()_2_4 tap .if \tap == 2 ldr s0, [src0] ldr s2, [src1] @@ -32,9 +32,19 @@ ldr d0, [src0] ldr d2, [src1] .endif + +.ifc \type, avg saddl v4.4s, v0.4h, v2.4h add v4.4s, v4.4s, v16.4s sqshrn v4.4h, v4.4s, #(15 - \bit_depth) +.else + mov v4.16b, v16.16b + smlal v4.4s, v0.4h, v19.4h + smlal v4.4s, v2.4h, v20.4h + sqshl v4.4s, v4.4s, v22.4s + sqxtn v4.4h, v4.4s +.endif + .if \bit_depth == 8 sqxtun v4.8b, v4.8h .if \tap == 2 @@ -57,7 +67,7 @@ add dst, dst, dst_stride .endm -function ff_vvc_avg_\bit_depth\()_neon, export=1 +function ff_vvc_\type\()_\bit_depth\()_neon, export=1 dst .req x0 dst_stride .req x1 src0 .req x2 @@ -67,42 +77,64 @@ function ff_vvc_avg_\bit_depth\()_neon, export=1 mov x10, #(VVC_MAX_PB_SIZE * 2) cmp width, #8 -.if \bit_depth == 8 - movi v16.4s, #64 -.else -.if \bit_depth == 10 - mov w6, #1023 - movi v16.4s, #16 +.ifc \type, avg + movi v16.4s, #(1 << (14 - \bit_depth)) .else - mov w6, #4095 - movi v16.4s, #4 -.endif + lsr x11, x6, #32 // weight0 + mov w12, w6 // weight1 + lsr x13, x7, #32 // offset + mov w14, w7 // shift + + dup v19.8h, w11 + neg w14, w14 // so we can use sqshl + dup v20.8h, w12 + dup v16.4s, w13 + dup v22.4s, w14 +.endif // avg + + .if \bit_depth >= 10 + // clip pixel + mov w6, #((1 << \bit_depth) - 1) movi v18.8h, #0 dup v17.8h, w6 .endif + b.eq 8f b.hi 16f cmp width, #4 b.eq 4f 2: // width == 2 subs height, height, #1 - vvc_avg_\bit_depth\()_2_4 2 + vvc_\type\()_\bit_depth\()_2_4 2 b.ne 2b b 32f 4: // width == 4 subs height, height, #1 - vvc_avg_\bit_depth\()_2_4 4 + vvc_\type\()_\bit_depth\()_2_4 4 b.ne 4b b 32f 8: // width == 8 ld1 {v0.8h}, [src0], x10 ld1 {v2.8h}, [src1], x10 +.ifc \type, avg saddl v4.4s, v0.4h, v2.4h saddl2 v5.4s, v0.8h, v2.8h add v4.4s, v4.4s, v16.4s add v5.4s, v5.4s, v16.4s sqshrn v4.4h, v4.4s, #(15 - \bit_depth) sqshrn2 v4.8h, v5.4s, #(15 - \bit_depth) +.else + mov v4.16b, v16.16b + mov v5.16b, v16.16b + smlal v4.4s, v0.4h, v19.4h + smlal v4.4s, v2.4h, v20.4h + smlal2 v5.4s, v0.8h, v19.8h + smlal2 v5.4s, v2.8h, v20.8h + sqshl v4.4s, v4.4s, v22.4s + sqshl v5.4s, v5.4s, v22.4s + sqxtn v4.4h, v4.4s + sqxtn2 v4.8h, v5.4s +.endif subs height, height, #1 .if \bit_depth == 8 sqxtun v4.8b, v4.8h @@ -122,6 +154,7 @@ function ff_vvc_avg_\bit_depth\()_neon, export=1 17: ldp q0, q1, [x7], #32 ldp q2, q3, [x8], #32 +.ifc \type, avg saddl v4.4s, v0.4h, v2.4h saddl2 v5.4s, v0.8h, v2.8h saddl v6.4s, v1.4h, v3.4h @@ -134,6 +167,28 @@ function ff_vvc_avg_\bit_depth\()_neon, export=1 sqshrn2 v4.8h, v5.4s, #(15 - \bit_depth) sqshrn v6.4h, v6.4s, #(15 - \bit_depth) sqshrn2 v6.8h, v7.4s, #(15 - \bit_depth) +.else // avg + mov v4.16b, v16.16b + mov v5.16b, v16.16b + mov v6.16b, v16.16b + mov v7.16b, v16.16b + smlal v4.4s, v0.4h, v19.4h + smlal v4.4s, v2.4h, v20.4h + smlal2 v5.4s, v0.8h, v19.8h + smlal2 v5.4s, v2.8h, v20.8h + smlal v6.4s, v1.4h, v19.4h + smlal v6.4s, v3.4h, v20.4h + smlal2 v7.4s, v1.8h, v19.8h + smlal2 v7.4s, v3.8h, v20.8h + sqshl v4.4s, v4.4s, v22.4s + sqshl v5.4s, v5.4s, v22.4s + sqshl v6.4s, v6.4s, v22.4s + sqshl v7.4s, v7.4s, v22.4s + sqxtn v4.4h, v4.4s + sqxtn v6.4h, v6.4s + sqxtn2 v4.8h, v5.4s + sqxtn2 v6.8h, v7.4s +.endif // w_avg subs w6, w6, #16 .if \bit_depth == 8 sqxtun v4.8b, v4.8h @@ -158,6 +213,9 @@ function ff_vvc_avg_\bit_depth\()_neon, export=1 endfunc .endm -vvc_avg 8 -vvc_avg 10 -vvc_avg 12 +vvc_avg avg, 8 +vvc_avg avg, 10 +vvc_avg avg, 12 +vvc_avg w_avg, 8 +vvc_avg w_avg, 10 +vvc_avg w_avg, 12