From patchwork Sat Sep 21 17:41:43 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zhao Zhili X-Patchwork-Id: 51687 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a59:d154:0:b0:48e:c0f8:d0de with SMTP id bt20csp1651956vqb; Sat, 21 Sep 2024 10:42:47 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCWE7XZxwLkbIPdWyz8Xow+uWItLpZKrc9ZFo1ujCoVg3gWRUwQb4weOSrtNDzCshcMrtg/EclFCN/pnLHkL8EKO@gmail.com X-Google-Smtp-Source: AGHT+IGEABLSxM9eT7Q6IS98pxy303MCiW4rStJrJEGc6quDFAJ+xeMl84qShTV8bgJHGbDUWdFF X-Received: by 2002:a05:6402:2753:b0:5c4:2bb0:41c8 with SMTP id 4fb4d7f45d1cf-5c464a3eb12mr4839800a12.13.1726940566935; Sat, 21 Sep 2024 10:42:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1726940566; cv=none; d=google.com; s=arc-20240605; b=OVqk8qrGT2bx9JzQNz6N/vSI8sUdLp9SUXleNdwKkYZcny/Kuyx3/YdFUongDOx9gR 5LtLdVmLEsrxMpsaKRjw5BXwp2lmXn2S/6FW2Wz8fZgmThLxqfMUDDDFVxUOSH1Bif4N DRPInR+TzfwFdzwKno8MFmK20PkoxlRz6DRm7pUegxykFYU5WBJA92h1219jCGs6VcXi ZEp9xnCnknwwsrPcqx15kfyNGCss/JSjEZaEqc4VQF4zd/b3TACXGR1Z0W60EM9AbgpG l4sc3cWF8TaIvadVsRTvs75UhmQpFt/6iLatcKSl7HqyVXj8SLQpxzUpCJD1GhpNjsKR tfBQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:date:to:from:message-id :dkim-signature:delivered-to; bh=o5uAWJdNiGUHTtUw+DbAC/JShVyFlJmoDJ4+WUdbj40=; fh=HnHYuZ9XgUo86ZRXTLWWmQxhslYEI9B9taZ5X1DLFfc=; b=lIJ7bMKeB+ApngOaUfWwsMztH5/QABmy2/JFGAw9ikmsEa7fn7jtn1oodKkzNUBfLN ITwQTGSdzwIYIbX7IFLgNLqDxOO2qjtDF+LeEjkOf6++sE97q0tcJW4FAUM/2os4vm6c +5MjOAKUXS15H4cGwpc0nDrieMlkoGX1kUmRwm973axrn/s4WsGFez6aQS5K63LEgdme M2FCk6lGNZjTEz+Bf/4D4oOkzOqk/EonGFvXJusY+xMtFthdYIbMxUU8euOtKrWnfoyj pWNj1yWdmeryZaqWd9ReyY7RuWBPE9i2zByj40k8tcep5cxPg9A7JiyKp5LXDACbFnq8 D16g==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=geICHdKk; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id 4fb4d7f45d1cf-5c44e1e22f0si6681102a12.619.2024.09.21.10.42.45; Sat, 21 Sep 2024 10:42:46 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=geICHdKk; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id CC61C68DBE5; Sat, 21 Sep 2024 20:42:10 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from out162-62-57-49.mail.qq.com (out162-62-57-49.mail.qq.com [162.62.57.49]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id EC28468DB0F for ; Sat, 21 Sep 2024 20:41:59 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foxmail.com; s=s201512; t=1726940510; bh=7/KDZE5lDOmc2JHDY2eK3rVDOnxLQhL7d+ct4ls6qYU=; h=From:To:Cc:Subject:Date; b=geICHdKk3vZXw+QXE9kMecKZa6azx9NEoxqOm4gsx0GblM8ZsySQWrG1NCeYpK1K5 JV0/HTMrVWgMMyks/Tthp6ilCl3zu+pPcossj+7JxRw0C5TCg/fuuMSCQhGu3/tapX 9wSTBzam+Q8nHfN6EihW/pMaObbXl8Ixz1DfYSaQ= Received: from ZHILIZHAO-MB1.tencent.com ([113.118.103.137]) by newxmesmtplogicsvrsza29-0.qq.com (NewEsmtp) with SMTP id A7081A2C; Sun, 22 Sep 2024 01:41:48 +0800 X-QQ-mid: xmsmtpt1726940508tb6h4ms21 Message-ID: X-QQ-XMAILINFO: MB5+LsFw85No/DEKWadeqafTsRmw1QUmi5vflroqDs1hqFG1apJtHAD6ztcsHk DaIp3V08XcJ2+1RGMqK0MfwKpJOLq6j2XfLcK27EnIIAfRegZ0ocRHJYL9OxwCcRXLJsSvOvJOiX 8yNA5zhxXahWC3X2ogatCg6gUTADuMrCHNWHVw7aWKeBbLWJ1vpseu1rJ59fS1aebWD3Zi7ATi09 oQd5UlLojk/sIxum1vVQgHI2PQfpDJRQFNKEDBEMwlUJrG2v2W3zNkvIIOYtPvBmpmQamv6WGkL5 I4E75MOyEjpis9uDGcE86MYce9r3r7CbMiTRcsSnOyD20UomHjMSlC0UbWdqDthCAHFf8tihc1JR SsykOY4HNevlCv+vImyKZg22MJDAcuGmDITw4M3HGjyQ/w23rugolX13/K6o2+O+Mx5UH3CKBOHr dEtIum3NBpNbaC4v59g8FtUkWLbcWVfFNPhOMYW5/bsgE3Cr+qCZkw6po0XeKnIYSB4rmX3Z6pgN AqMVrvNoCx61oKTo86hD2Inl4fN29gHr/rP2zVjmLBJcW9D/IA1O4yXAKPdWcr1/dl/LYEkRY9Ta upMpU4q7YOkmM9NSvBhTU6eKsEmpnits7SpXeNszq2C0FYMTkUx/++vKRcDHmZNnPaHMOXLbIyMV PhfZGnpvMwvukN53EFgJsxB8NCLm6EOOexUvY3IWF2ktZdvDjzuNFWZc0DDCf3Pm4BTxuMAu0JQE 26pEv0qjFuBdP/lZXrOV5T6V1LjKT0K5Ptb93HK8Gz8VIf/vYU69t4aIy/72f+vHrX5iETQbVhFe nuc1tp+m0ua1dd+w5nsNLbjeEZhX7ECZorbhmZhSs65B9zuGIvvjtqbCfdSu6Nz53nfGQ4LJ5+pr 8IjlNT/rWD9vQJRqqdiTEqWcsq3uFcCl+WnM6fHn0H7zC5gitXJTgrnf6X6fzu+Z8x67aRlMZ0og hBkhE/JBYge7S3P7UU2GIAHC2t1XY4crDNx08u9GkQM4Iw7p0WNO5yVI1KB17E X-QQ-XMRINFO: MSVp+SPm3vtS1Vd6Y4Mggwc= From: Zhao Zhili To: ffmpeg-devel@ffmpeg.org Date: Sun, 22 Sep 2024 01:41:43 +0800 X-OQ-MSGID: <20240921174146.10928-1-quinkblack@foxmail.com> X-Mailer: git-send-email 2.42.0 MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 1/4] aarch64/vvc: Add w_avg X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Zhao Zhili Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: eNLtOe4fbfNI From: Zhao Zhili w_avg_8_2x2_c: 0.0 ( 0.00x) w_avg_8_2x2_neon: 0.0 ( 0.00x) w_avg_8_4x4_c: 0.2 ( 1.00x) w_avg_8_4x4_neon: 0.0 ( 0.00x) w_avg_8_8x8_c: 1.2 ( 1.00x) w_avg_8_8x8_neon: 0.2 ( 5.00x) w_avg_8_16x16_c: 4.2 ( 1.00x) w_avg_8_16x16_neon: 0.8 ( 5.67x) w_avg_8_32x32_c: 16.2 ( 1.00x) w_avg_8_32x32_neon: 2.5 ( 6.50x) w_avg_8_64x64_c: 64.5 ( 1.00x) w_avg_8_64x64_neon: 9.0 ( 7.17x) w_avg_8_128x128_c: 269.5 ( 1.00x) w_avg_8_128x128_neon: 35.5 ( 7.59x) w_avg_10_2x2_c: 0.2 ( 1.00x) w_avg_10_2x2_neon: 0.2 ( 1.00x) w_avg_10_4x4_c: 0.2 ( 1.00x) w_avg_10_4x4_neon: 0.2 ( 1.00x) w_avg_10_8x8_c: 1.0 ( 1.00x) w_avg_10_8x8_neon: 0.2 ( 4.00x) w_avg_10_16x16_c: 4.2 ( 1.00x) w_avg_10_16x16_neon: 0.8 ( 5.67x) w_avg_10_32x32_c: 16.2 ( 1.00x) w_avg_10_32x32_neon: 2.5 ( 6.50x) w_avg_10_64x64_c: 66.2 ( 1.00x) w_avg_10_64x64_neon: 10.0 ( 6.62x) w_avg_10_128x128_c: 277.8 ( 1.00x) w_avg_10_128x128_neon: 39.8 ( 6.99x) w_avg_12_2x2_c: 0.0 ( 0.00x) w_avg_12_2x2_neon: 0.2 ( 0.00x) w_avg_12_4x4_c: 0.2 ( 1.00x) w_avg_12_4x4_neon: 0.0 ( 0.00x) w_avg_12_8x8_c: 1.2 ( 1.00x) w_avg_12_8x8_neon: 0.5 ( 2.50x) w_avg_12_16x16_c: 4.8 ( 1.00x) w_avg_12_16x16_neon: 0.8 ( 6.33x) w_avg_12_32x32_c: 17.0 ( 1.00x) w_avg_12_32x32_neon: 2.8 ( 6.18x) w_avg_12_64x64_c: 64.0 ( 1.00x) w_avg_12_64x64_neon: 10.0 ( 6.40x) w_avg_12_128x128_c: 269.2 ( 1.00x) w_avg_12_128x128_neon: 42.0 ( 6.41x) --- libavcodec/aarch64/vvc/dsp_init.c | 34 ++++++++++++ libavcodec/aarch64/vvc/inter.S | 92 +++++++++++++++++++++++++------ 2 files changed, 109 insertions(+), 17 deletions(-) diff --git a/libavcodec/aarch64/vvc/dsp_init.c b/libavcodec/aarch64/vvc/dsp_init.c index ad767d17e2..b39ebb83fc 100644 --- a/libavcodec/aarch64/vvc/dsp_init.c +++ b/libavcodec/aarch64/vvc/dsp_init.c @@ -52,6 +52,37 @@ void ff_vvc_avg_12_neon(uint8_t *dst, ptrdiff_t dst_stride, const int16_t *src0, const int16_t *src1, int width, int height); +void ff_vvc_w_avg_8_neon(uint8_t *_dst, const ptrdiff_t _dst_stride, + const int16_t *src0, const int16_t *src1, + const int width, const int height, + uintptr_t w0_w1, uintptr_t offset_shift); +void ff_vvc_w_avg_10_neon(uint8_t *_dst, const ptrdiff_t _dst_stride, + const int16_t *src0, const int16_t *src1, + const int width, const int height, + uintptr_t w0_w1, uintptr_t offset_shift); +void ff_vvc_w_avg_12_neon(uint8_t *_dst, const ptrdiff_t _dst_stride, + const int16_t *src0, const int16_t *src1, + const int width, const int height, + uintptr_t w0_w1, uintptr_t offset_shift); +/* When passing arguments to functions, Apple platforms diverge from the ARM64 + * standard ABI, that we can't implement the function directly in asm. + */ +#define W_AVG_FUN(bit_depth) \ +static void vvc_w_avg_ ## bit_depth(uint8_t *dst, const ptrdiff_t dst_stride, \ + const int16_t *src0, const int16_t *src1, const int width, const int height, \ + const int denom, const int w0, const int w1, const int o0, const int o1) \ +{ \ + const int shift = denom + FFMAX(3, 15 - bit_depth); \ + const int offset = ((o0 + o1) * (1 << (bit_depth - 8)) + 1) * (1 << (shift - 1)); \ + uintptr_t w0_w1 = ((uintptr_t)w0 << 32) | (uint32_t)w1; \ + uintptr_t offset_shift = ((uintptr_t)offset << 32) | (uint32_t)shift; \ + ff_vvc_w_avg_ ## bit_depth ## _neon(dst, dst_stride, src0, src1, width, height, w0_w1, offset_shift); \ +} + +W_AVG_FUN(8) +W_AVG_FUN(10) +W_AVG_FUN(12) + void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) { int cpu_flags = av_get_cpu_flags(); @@ -123,6 +154,7 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) c->inter.put_uni_w[0][6][0][0] = ff_vvc_put_pel_uni_w_pixels128_8_neon; c->inter.avg = ff_vvc_avg_8_neon; + c->inter.w_avg = vvc_w_avg_8; for (int i = 0; i < FF_ARRAY_ELEMS(c->sao.band_filter); i++) c->sao.band_filter[i] = ff_h26x_sao_band_filter_8x8_8_neon; @@ -163,11 +195,13 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) } } else if (bd == 10) { c->inter.avg = ff_vvc_avg_10_neon; + c->inter.w_avg = vvc_w_avg_10; c->alf.filter[LUMA] = alf_filter_luma_10_neon; c->alf.filter[CHROMA] = alf_filter_chroma_10_neon; } else if (bd == 12) { c->inter.avg = ff_vvc_avg_12_neon; + c->inter.w_avg = vvc_w_avg_12; c->alf.filter[LUMA] = alf_filter_luma_12_neon; c->alf.filter[CHROMA] = alf_filter_chroma_12_neon; diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index 2f69274b86..49e1050aee 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -22,9 +22,9 @@ #define VVC_MAX_PB_SIZE 128 -.macro vvc_avg, bit_depth +.macro vvc_avg type, bit_depth -.macro vvc_avg_\bit_depth\()_2_4, tap +.macro vvc_\type\()_\bit_depth\()_2_4 tap .if \tap == 2 ldr s0, [src0] ldr s2, [src1] @@ -32,9 +32,19 @@ ldr d0, [src0] ldr d2, [src1] .endif + +.ifc \type, avg saddl v4.4s, v0.4h, v2.4h add v4.4s, v4.4s, v16.4s sqshrn v4.4h, v4.4s, #(15 - \bit_depth) +.else + mov v4.16b, v16.16b + smlal v4.4s, v0.4h, v19.4h + smlal v4.4s, v2.4h, v20.4h + sqshl v4.4s, v4.4s, v22.4s + sqxtn v4.4h, v4.4s +.endif + .if \bit_depth == 8 sqxtun v4.8b, v4.8h .if \tap == 2 @@ -57,7 +67,7 @@ add dst, dst, dst_stride .endm -function ff_vvc_avg_\bit_depth\()_neon, export=1 +function ff_vvc_\type\()_\bit_depth\()_neon, export=1 dst .req x0 dst_stride .req x1 src0 .req x2 @@ -67,42 +77,64 @@ function ff_vvc_avg_\bit_depth\()_neon, export=1 mov x10, #(VVC_MAX_PB_SIZE * 2) cmp width, #8 -.if \bit_depth == 8 - movi v16.4s, #64 -.else -.if \bit_depth == 10 - mov w6, #1023 - movi v16.4s, #16 +.ifc \type, avg + movi v16.4s, #(1 << (14 - \bit_depth)) .else - mov w6, #4095 - movi v16.4s, #4 -.endif + lsr x11, x6, #32 // weight0 + mov w12, w6 // weight1 + lsr x13, x7, #32 // offset + mov w14, w7 // shift + + dup v19.8h, w11 + neg w14, w14 // so we can use sqshl + dup v20.8h, w12 + dup v16.4s, w13 + dup v22.4s, w14 +.endif // avg + + .if \bit_depth >= 10 + // clip pixel + mov w6, #((1 << \bit_depth) - 1) movi v18.8h, #0 dup v17.8h, w6 .endif + b.eq 8f b.hi 16f cmp width, #4 b.eq 4f 2: // width == 2 subs height, height, #1 - vvc_avg_\bit_depth\()_2_4 2 + vvc_\type\()_\bit_depth\()_2_4 2 b.ne 2b b 32f 4: // width == 4 subs height, height, #1 - vvc_avg_\bit_depth\()_2_4 4 + vvc_\type\()_\bit_depth\()_2_4 4 b.ne 4b b 32f 8: // width == 8 ld1 {v0.8h}, [src0], x10 ld1 {v2.8h}, [src1], x10 +.ifc \type, avg saddl v4.4s, v0.4h, v2.4h saddl2 v5.4s, v0.8h, v2.8h add v4.4s, v4.4s, v16.4s add v5.4s, v5.4s, v16.4s sqshrn v4.4h, v4.4s, #(15 - \bit_depth) sqshrn2 v4.8h, v5.4s, #(15 - \bit_depth) +.else + mov v4.16b, v16.16b + mov v5.16b, v16.16b + smlal v4.4s, v0.4h, v19.4h + smlal v4.4s, v2.4h, v20.4h + smlal2 v5.4s, v0.8h, v19.8h + smlal2 v5.4s, v2.8h, v20.8h + sqshl v4.4s, v4.4s, v22.4s + sqshl v5.4s, v5.4s, v22.4s + sqxtn v4.4h, v4.4s + sqxtn2 v4.8h, v5.4s +.endif subs height, height, #1 .if \bit_depth == 8 sqxtun v4.8b, v4.8h @@ -122,6 +154,7 @@ function ff_vvc_avg_\bit_depth\()_neon, export=1 17: ldp q0, q1, [x7], #32 ldp q2, q3, [x8], #32 +.ifc \type, avg saddl v4.4s, v0.4h, v2.4h saddl2 v5.4s, v0.8h, v2.8h saddl v6.4s, v1.4h, v3.4h @@ -134,6 +167,28 @@ function ff_vvc_avg_\bit_depth\()_neon, export=1 sqshrn2 v4.8h, v5.4s, #(15 - \bit_depth) sqshrn v6.4h, v6.4s, #(15 - \bit_depth) sqshrn2 v6.8h, v7.4s, #(15 - \bit_depth) +.else // avg + mov v4.16b, v16.16b + mov v5.16b, v16.16b + mov v6.16b, v16.16b + mov v7.16b, v16.16b + smlal v4.4s, v0.4h, v19.4h + smlal v4.4s, v2.4h, v20.4h + smlal2 v5.4s, v0.8h, v19.8h + smlal2 v5.4s, v2.8h, v20.8h + smlal v6.4s, v1.4h, v19.4h + smlal v6.4s, v3.4h, v20.4h + smlal2 v7.4s, v1.8h, v19.8h + smlal2 v7.4s, v3.8h, v20.8h + sqshl v4.4s, v4.4s, v22.4s + sqshl v5.4s, v5.4s, v22.4s + sqshl v6.4s, v6.4s, v22.4s + sqshl v7.4s, v7.4s, v22.4s + sqxtn v4.4h, v4.4s + sqxtn v6.4h, v6.4s + sqxtn2 v4.8h, v5.4s + sqxtn2 v6.8h, v7.4s +.endif // w_avg subs w6, w6, #16 .if \bit_depth == 8 sqxtun v4.8b, v4.8h @@ -158,6 +213,9 @@ function ff_vvc_avg_\bit_depth\()_neon, export=1 endfunc .endm -vvc_avg 8 -vvc_avg 10 -vvc_avg 12 +vvc_avg avg, 8 +vvc_avg avg, 10 +vvc_avg avg, 12 +vvc_avg w_avg, 8 +vvc_avg w_avg, 10 +vvc_avg w_avg, 12 From patchwork Sat Sep 21 17:41:44 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zhao Zhili X-Patchwork-Id: 51685 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a59:d154:0:b0:48e:c0f8:d0de with SMTP id bt20csp1651787vqb; Sat, 21 Sep 2024 10:42:14 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCX/tu0tJxbQkrHivy5JR3RCjyScrceoYhNw3jtQUD7ipZV3HwMWVA+SJndgxYcC+VHaaPrxz4kzs5MIAJfd9Y2v@gmail.com X-Google-Smtp-Source: AGHT+IGj6eNs61aZSgNfltsD37zYIqWy/nhvUvvIXqE9rLq4/rJH+9xhJuZQxo24Vuw0VzMpTUFr X-Received: by 2002:a05:6512:3d07:b0:52e:76d5:9504 with SMTP id 2adb3069b0e04-536acf6acffmr3430992e87.3.1726940534306; Sat, 21 Sep 2024 10:42:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1726940534; cv=none; d=google.com; s=arc-20240605; b=Ku/SRZG2EBbZIvbzQz7sardPuv1ptkW1iuiPtb1TTEGw+sj42X77Dp8SO8m4LZnWwR P3uxOP4fnXQSfVaIIDXNkn/nDwYdAjhFY0+Fg4fJA5sAykXcNLcQqtdiGDnKUw21C84q NlVOarNMAYq34Yux6CpmSlH/iuwaSw+84D1VqgXYqHByFX58HAwuQ1qGD+EpSWugrR5/ O/F2S+W9LWXVLQxkekqMtBq40Ry4tNGs1UeamJmdXVi8CIRTG86S9CnSFCXUsqEsScS2 rwlC7ksFE7zuJ0ymss5fy2uH4ndlMfyNlUeDOpNnRDBpcEwJNi6EU9XUf0fB8u7E/lDX FlvA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to:date :to:from:message-id:dkim-signature:delivered-to; bh=gQ9glFi2lSCGFwGV0841NA76j3Os3FSVG5U0fJg5/IA=; fh=HnHYuZ9XgUo86ZRXTLWWmQxhslYEI9B9taZ5X1DLFfc=; b=A8MDM+OQ1B9BEwevmpM8VNFtUysEClNLwDWhAfHwIrUrMG218R3oaTFBSq49uXpqwH Ju05oPOF3JCOe0t8pIaiN1qXjlSy1FnOVHqNn7Ad2+hC36kIibq3wuS4pKepVWCK+mhj /2vQsbaN3KKvE+IAk4Z8XYjniDXG+gQ5IoYnojUPQsNmPeCImrL0oLk8aIZ0E2llFJo5 Go0U36HnSoKg8cgEXH9UV+6YwJu3lq0zRvhwRQsWb/RlCjItpwBPgFMHcDXvMOsLW1cL IqsLrp8qvZFu8Kns6MTHbrGUV0tF2/CwU48KQeIC9PLfwRYGvHryueEIddFuu1LJ/0JO zGxw==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=u5XVEnYC; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id 2adb3069b0e04-536870a0b09si5984010e87.364.2024.09.21.10.42.13; Sat, 21 Sep 2024 10:42:14 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=u5XVEnYC; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A861E68DB0F; Sat, 21 Sep 2024 20:42:07 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from out162-62-58-216.mail.qq.com (out162-62-58-216.mail.qq.com [162.62.58.216]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E602668D3C6 for ; Sat, 21 Sep 2024 20:41:59 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foxmail.com; s=s201512; t=1726940510; bh=GEmrNlWpl7NB+sg5VF9MrAmTVrJe/QthlMGdECDz+U8=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=u5XVEnYCHaPnDquSrcelDzLICzn4h8HlkZ2SuB9wZR84TcEZAMTslHskVKKYT7Izt 4MJxPqy2HwwN3oxVfgJTbhd8BOeO7GmVlo9CKzR2mtogyMlfO2WFl98ekHEWiWxw0X DaHz+cGbLZqTvpXEYxiSH7JlJEND18sr6cG6ZbN4= Received: from ZHILIZHAO-MB1.tencent.com ([113.118.103.137]) by newxmesmtplogicsvrsza29-0.qq.com (NewEsmtp) with SMTP id A7081A2C; Sun, 22 Sep 2024 01:41:48 +0800 X-QQ-mid: xmsmtpt1726940509tz3ypmmw3 Message-ID: X-QQ-XMAILINFO: MIAHdi1iQo+z89WErO/skj9pzD4BiSYg5xqYzlR+3iHSHWaxG2rRFRw5Wpb7Fo SCgbzY3l845E7yHYZxV/7agt75hvi0j2mBNSM/WTseYR/b25CCVm6BLc2BkMhWUNHbow35lygnV3 kFoAbVdVuc4e92GTmxp4eJTvW9mN8oA1TPc1t1YCZsbyfHSijT9nkrnJxv5j7amB8+DNHa/Uw7VI flrnRYsxaGiNP8EAZ8anniw2bjKSE2SY86tOVQVzRwTL9Izbt1T4gzSpy5F7OKJQX2exS7suECi/ s1He0t2zGjT/jXZd8MVCNdUYRQZmlTVd/D5jDkXPaaEX3GqAMt+qPDAjpCFJwXahNP5W+dZ99oC8 yGjSA2xBuMmc65AzY52QurEjbsWXzVBCS87o5ganRMCG+vTD6eTrOKxDPKivKYb2jEh5YjrRqEVH +zN8aZ2z7J/mrHOspJgQ3OSfBJ6JMha29BG4ErF/vP2dLokDoHb0DQbYg5vM6Gk4UpPfWMOgVT4a Z9lld573SLQOfpDAuP419KyJYbjTngxtKeCH3/elXK8Db9xc8gXUPyo6nsUm0bmuAM18tSTBHfio c+yRxamU5FcfRPV2u6DyeUFfCcWnOKanJgD8Itou4vNwVWAViN9FrxtPw0RefBefxNb72nqfuLYY sjKAFXbwf1uFPotUAmg54UOJ9NC7irjzio6JbQ3mbYQoMhFWeXfpv0AjU95VXce9Ug657Qpcgkt7 +CSJ1ijEklTNAOfYXv02GQhDCEz/xCQuM0yDEiXlBVPb/RJW5+Bq/X33lsxge7JcuZGUqDwYIGZl jzHBm9OLcE7UV6pq7piYyyqb5PCgutewSl8HBGYwaxK6xNC0fWBRtLJYa4LArKKzdMhHZBcc2K61 M5SNImKgL7xjpdS28XrgSDN29VwgIQK+di6ciDgPn+ThQUoR3ed45olGUj6az7xcZdy9Mlcc7OyI /nLSRVxniClhuAcbfq98jZ3dLx1KfYnCVil9DW1si6vPDJrHNsLA== X-QQ-XMRINFO: Nq+8W0+stu50PRdwbJxPCL0= From: Zhao Zhili To: ffmpeg-devel@ffmpeg.org Date: Sun, 22 Sep 2024 01:41:44 +0800 X-OQ-MSGID: <20240921174146.10928-2-quinkblack@foxmail.com> X-Mailer: git-send-email 2.42.0 In-Reply-To: <20240921174146.10928-1-quinkblack@foxmail.com> References: <20240921174146.10928-1-quinkblack@foxmail.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 2/4] aarch64/vvc: Add apply_bdof X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Zhao Zhili Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: PhaPqK8/Tzcg From: Zhao Zhili apply_bdof_8_8x16_c: 18.7 ( 1.00x) apply_bdof_8_8x16_neon: 9.7 ( 1.93x) apply_bdof_8_16x8_c: 20.0 ( 1.00x) apply_bdof_8_16x8_neon: 9.5 ( 2.11x) apply_bdof_8_16x16_c: 36.7 ( 1.00x) apply_bdof_8_16x16_neon: 19.0 ( 1.94x) apply_bdof_10_8x16_c: 18.0 ( 1.00x) apply_bdof_10_8x16_neon: 10.0 ( 1.80x) apply_bdof_10_16x8_c: 18.0 ( 1.00x) apply_bdof_10_16x8_neon: 9.5 ( 1.90x) apply_bdof_10_16x16_c: 35.5 ( 1.00x) apply_bdof_10_16x16_neon: 19.0 ( 1.87x) apply_bdof_12_8x16_c: 17.5 ( 1.00x) apply_bdof_12_8x16_neon: 9.7 ( 1.80x) apply_bdof_12_16x8_c: 18.2 ( 1.00x) apply_bdof_12_16x8_neon: 9.5 ( 1.92x) apply_bdof_12_16x16_c: 34.5 ( 1.00x) apply_bdof_12_16x16_neon: 18.7 ( 1.84x) --- libavcodec/aarch64/vvc/dsp_init.c | 9 + libavcodec/aarch64/vvc/inter.S | 351 +++++++++++++++++++++++++++ libavcodec/aarch64/vvc/of_template.c | 70 ++++++ 3 files changed, 430 insertions(+) create mode 100644 libavcodec/aarch64/vvc/of_template.c diff --git a/libavcodec/aarch64/vvc/dsp_init.c b/libavcodec/aarch64/vvc/dsp_init.c index b39ebb83fc..03a4c62310 100644 --- a/libavcodec/aarch64/vvc/dsp_init.c +++ b/libavcodec/aarch64/vvc/dsp_init.c @@ -27,16 +27,22 @@ #include "libavcodec/vvc/dec.h" #include "libavcodec/vvc/ctu.h" +#define BDOF_BLOCK_SIZE 16 +#define BDOF_MIN_BLOCK_SIZE 4 + #define BIT_DEPTH 8 #include "alf_template.c" +#include "of_template.c" #undef BIT_DEPTH #define BIT_DEPTH 10 #include "alf_template.c" +#include "of_template.c" #undef BIT_DEPTH #define BIT_DEPTH 12 #include "alf_template.c" +#include "of_template.c" #undef BIT_DEPTH int ff_vvc_sad_neon(const int16_t *src0, const int16_t *src1, int dx, int dy, @@ -155,6 +161,7 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) c->inter.avg = ff_vvc_avg_8_neon; c->inter.w_avg = vvc_w_avg_8; + c->inter.apply_bdof = apply_bdof_8; for (int i = 0; i < FF_ARRAY_ELEMS(c->sao.band_filter); i++) c->sao.band_filter[i] = ff_h26x_sao_band_filter_8x8_8_neon; @@ -196,12 +203,14 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) } else if (bd == 10) { c->inter.avg = ff_vvc_avg_10_neon; c->inter.w_avg = vvc_w_avg_10; + c->inter.apply_bdof = apply_bdof_10; c->alf.filter[LUMA] = alf_filter_luma_10_neon; c->alf.filter[CHROMA] = alf_filter_chroma_10_neon; } else if (bd == 12) { c->inter.avg = ff_vvc_avg_12_neon; c->inter.w_avg = vvc_w_avg_12; + c->inter.apply_bdof = apply_bdof_12; c->alf.filter[LUMA] = alf_filter_luma_12_neon; c->alf.filter[CHROMA] = alf_filter_chroma_12_neon; diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index 49e1050aee..8cfacef44f 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -21,6 +21,8 @@ #include "libavutil/aarch64/asm.S" #define VVC_MAX_PB_SIZE 128 +#define BDOF_BLOCK_SIZE 16 +#define BDOF_MIN_BLOCK_SIZE 4 .macro vvc_avg type, bit_depth @@ -211,6 +213,13 @@ function ff_vvc_\type\()_\bit_depth\()_neon, export=1 32: ret endfunc + +.unreq dst +.unreq dst_stride +.unreq src0 +.unreq src1 +.unreq width +.unreq height .endm vvc_avg avg, 8 @@ -219,3 +228,345 @@ vvc_avg avg, 12 vvc_avg w_avg, 8 vvc_avg w_avg, 10 vvc_avg w_avg, 12 + +function ff_vvc_prof_grad_filter_8x_neon, export=1 + gh .req x0 + gv .req x1 + gstride .req x2 + src .req x3 + src_stride .req x4 + width .req w5 + height .req w6 + + lsl src_stride, src_stride, #1 + neg x7, src_stride +1: + mov x10, src + mov w11, width + mov x12, gh + mov x13, gv +2: + ldur q0, [x10, #2] + ldur q1, [x10, #-2] + subs w11, w11, #8 + ldr q2, [x10, src_stride] + ldr q3, [x10, x7] + sshr v0.8h, v0.8h, #6 + sshr v1.8h, v1.8h, #6 + sshr v2.8h, v2.8h, #6 + sshr v3.8h, v3.8h, #6 + sub v0.8h, v0.8h, v1.8h + sub v2.8h, v2.8h, v3.8h + st1 {v0.8h}, [x12], #16 + st1 {v2.8h}, [x13], #16 + add x10, x10, #16 + b.ne 2b + + subs height, height, #1 + add gh, gh, gstride, lsl #1 + add gv, gv, gstride, lsl #1 + add src, src, src_stride + b.ne 1b + ret + +.unreq gh +.unreq gv +.unreq gstride +.unreq src +.unreq src_stride +.unreq width +.unreq height + +endfunc + +.macro vvc_apply_bdof_min_block bit_depth + dst .req x0 + dst_stride .req x1 + src0 .req x2 + src1 .req x3 + gh .req x4 + gv .req x5 + vx .req w6 + vy .req w7 + + dup v0.4h, vx + dup v1.4h, vy + movi v7.4s, #(1 << (14 - \bit_depth)) + ldp x8, x9, [gh] + ldp x10, x11, [gv] + mov x12, #(BDOF_BLOCK_SIZE * 2) + mov w13, #(BDOF_MIN_BLOCK_SIZE) + mov x14, #(VVC_MAX_PB_SIZE * 2) +.if \bit_depth >= 10 + // clip pixel + mov w15, #((1 << \bit_depth) - 1) + movi v18.8h, #0 + lsl dst_stride, dst_stride, #1 + dup v17.8h, w15 +.endif +1: + ld1 {v2.4h}, [x8], x12 + ld1 {v3.4h}, [x9], x12 + ld1 {v4.4h}, [x10], x12 + ld1 {v5.4h}, [x11], x12 + sub v2.4h, v2.4h, v3.4h + sub v4.4h, v4.4h, v5.4h + smull v2.4s, v0.4h, v2.4h + smlal v2.4s, v1.4h, v4.4h + + ld1 {v5.4h}, [src0], x14 + ld1 {v6.4h}, [src1], x14 + saddl v5.4s, v5.4h, v6.4h + add v5.4s, v5.4s, v7.4s + add v5.4s, v5.4s, v2.4s + sqshrn v5.4h, v5.4s, #(15 - \bit_depth) + subs w13, w13, #1 +.if \bit_depth == 8 + sqxtun v5.8b, v5.8h + str s5, [dst] + add dst, dst, dst_stride +.else + smin v5.4h, v5.4h, v17.4h + smax v5.4h, v5.4h, v18.4h + st1 {v5.4h}, [dst], dst_stride +.endif + b.ne 1b + ret + +.unreq dst +.unreq dst_stride +.unreq src0 +.unreq src1 +.unreq gh +.unreq gv +.unreq vx +.unreq vy +.endm + +function ff_vvc_apply_bdof_min_block_8_neon, export=1 + vvc_apply_bdof_min_block 8 +endfunc + +function ff_vvc_apply_bdof_min_block_10_neon, export=1 + vvc_apply_bdof_min_block 10 +endfunc + +function ff_vvc_apply_bdof_min_block_12_neon, export=1 + vvc_apply_bdof_min_block 12 +endfunc + +.macro derive_bdof_vx_vy_x_begin_end + ldrh w19, [x14, x16, lsl #1] // load from src0 + ldrh w20, [x15, x16, lsl #1] // load from src1 + sxth w19, w19 + sxth w20, w20 + asr w19, w19, #4 + asr w20, w20, #4 + sub w19, w19, w20 // diff + add x17, x16, x13, lsl #4 // idx + ldrh w3, [gh0, x17, lsl #1] // load from gh0 + ldrh w4, [gh1, x17, lsl #1] // load from gh1 + sxth w3, w3 + sxth w4, w4 + ldrh w22, [gv0, x17, lsl #1] // load from gv0 + ldrh w23, [gv1, x17, lsl #1] // load from gv1 + add w3, w3, w4 + asr w21, w3, #1 // temph + sxth w3, w22 + sxth w4, w23 + add w3, w3, w4 + cmp w21, #0 + asr w22, w3, #1 // tempv + cneg w20, w21, mi + csetm w23, ne + csinc w23, w23, wzr, ge // -VVC_SIGN(temph) + cmp w22, #0 + add sgx2, sgx2, w20 + cneg w20, w22, mi + cset w24, ne + csinv w24, w24, wzr, ge // VVC_SIGN(tempv) + add sgy2, sgy2, w20 + madd sgxgy, w24, w21, sgxgy + madd sgxdi, w23, w19, sgxdi + csetm w24, ne + csinc w24, w24, wzr, ge // -VVC_SIGN(tempv) + madd sgydi, w24, w19, sgydi +.endm + +function ff_vvc_derive_bdof_vx_vy_neon, export=1 + src0 .req x0 + src1 .req x1 + pad_mask .req w2 + gh .req x3 + gv .req x4 + gh0 .req x27 + gh1 .req x28 + gv0 .req x25 + gv1 .req x26 + vx .req x5 + vy .req x6 + sgx2 .req w7 + sgy2 .req w8 + sgxgy .req w9 + sgxdi .req w10 + sgydi .req w11 + y .req x12 + + stp x27, x28, [sp, #-80]! + stp x25, x26, [sp, #16] + stp x23, x24, [sp, #32] + stp x21, x22, [sp, #48] + stp x19, x20, [sp, #64] + + ldp gh0, gh1, [gh] + mov sgx2, #0 + mov sgy2, #0 + mov sgxgy, #0 + mov sgxdi, #0 + mov sgydi, #0 + ldp gv0, gv1, [gv] + + mov y, #-1 + mov x13, #-1 // dy + tst pad_mask, #2 + b.eq 1f + mov x13, #0 // dy: pad top +1: + add x14, src0, x13, lsl #8 // local src0 + add x15, src1, x13, lsl #8 // local src1 + + // x = -1 + mov x16, #-1 // dx + tst pad_mask, #1 + b.eq 2f + mov x16, #0 +2: + derive_bdof_vx_vy_x_begin_end + + // x = 0 to BDOF_MIN_BLOCK_SIZE - 1 + ldr d0, [x14] + ldr d1, [x15] + lsl x19, x13, #5 + ldr d2, [gh0, x19] + ldr d3, [gh1, x19] + sshr v0.4h, v0.4h, #4 + sshr v1.4h, v1.4h, #4 + ssubl v0.4s, v0.4h, v1.4h // diff + ldr d4, [gv0, x19] + ldr d5, [gv1, x19] + saddl v2.4s, v2.4h, v3.4h + saddl v4.4s, v4.4h, v5.4h + sshr v2.4s, v2.4s, #1 // temph + sshr v4.4s, v4.4s, #1 // tempv + abs v3.4s, v2.4s + abs v5.4s, v4.4s + addv s3, v3.4s + addv s5, v5.4s + mov w19, v3.s[0] + mov w20, v5.s[0] + add sgx2, sgx2, w19 + add sgy2, sgy2, w20 + + movi v5.4s, #1 + cmgt v17.4s, v4.4s, #0 // mask > 0 + cmlt v18.4s, v4.4s, #0 // mask < 0 + and v17.16b, v17.16b, v5.16b + and v18.16b, v18.16b, v5.16b + neg v19.4s, v18.4s + add v20.4s, v17.4s, v19.4s // VVC_SIGN(tempv) + smull v21.2d, v20.2s, v2.2s + smlal2 v21.2d, v20.4s, v2.4s + addp d21, v21.2d + mov w19, v21.s[0] + add sgxgy, sgxgy, w19 + + smull v16.2d, v20.2s, v0.2s + smlal2 v16.2d, v20.4s, v0.4s + addp d16, v16.2d + mov w19, v16.s[0] + sub sgydi, sgydi, w19 + + cmgt v17.4s, v2.4s, #0 + cmlt v18.4s, v2.4s, #0 + and v17.16b, v17.16b, v5.16b + and v18.16b, v18.16b, v5.16b + neg v21.4s, v17.4s + add v16.4s, v21.4s, v18.4s // -VVC_SIGN(temph) + smull v20.2d, v16.2s, v0.2s + smlal2 v20.2d, v16.4s, v0.4s + addp d20, v20.2d + mov w19, v20.s[0] + add sgxdi, sgxdi, w19 + + // x = BDOF_MIN_BLOCK_SIZE + mov x16, #BDOF_MIN_BLOCK_SIZE // dx + tst pad_mask, #4 + b.eq 3f + mov x16, #(BDOF_MIN_BLOCK_SIZE - 1) +3: + derive_bdof_vx_vy_x_begin_end + + add y, y, #1 + cmp y, #(BDOF_MIN_BLOCK_SIZE) + mov x13, y + b.gt 4f + b.lt 1b + tst pad_mask, #8 + b.eq 1b + sub x13, x13, #1 // pad bottom + b 1b +4: + mov w3, #31 + mov w14, #0 + mov w16, #-15 + mov w17, #15 + cbz sgx2, 5f + clz w12, sgx2 + lsl sgxdi, sgxdi, #2 + sub w13, w3, w12 // log2(sgx2) + asr sgxdi, sgxdi, w13 + cmp sgxdi, w16 + csel w14, w16, sgxdi, lt // clip to -15 + b.le 5f + cmp sgxdi, w17 + csel w14, w17, sgxdi, gt // clip to 15 +5: + str w14, [vx] + + mov w15, #0 + cbz sgy2, 6f + lsl sgydi, sgydi, #2 + smull x14, w14, sgxgy + asr w14, w14, #1 + sub sgydi, sgydi, w14 + clz w12, sgy2 + sub w13, w3, w12 // log2(sgy2) + asr sgydi, sgydi, w13 + cmp sgydi, w16 + csel w15, w16, sgydi, lt // clip to -15 + b.le 6f + cmp sgydi, w17 + csel w15, w17, sgydi, gt // clip to 15 +6: + str w15, [vy] + ldp x25, x26, [sp, #16] + ldp x23, x24, [sp, #32] + ldp x21, x22, [sp, #48] + ldp x19, x20, [sp, #64] + ldp x27, x28, [sp], #80 + ret +.unreq src0 +.unreq src1 +.unreq pad_mask +.unreq gh +.unreq gv +.unreq vx +.unreq vy +.unreq sgx2 +.unreq sgy2 +.unreq sgxgy +.unreq sgxdi +.unreq sgydi +.unreq y +endfunc + diff --git a/libavcodec/aarch64/vvc/of_template.c b/libavcodec/aarch64/vvc/of_template.c new file mode 100644 index 0000000000..508ea6d99d --- /dev/null +++ b/libavcodec/aarch64/vvc/of_template.c @@ -0,0 +1,70 @@ +/* + * Copyright (c) 2024 Zhao Zhili + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavcodec/bit_depth_template.c" + +void ff_vvc_prof_grad_filter_8x_neon(int16_t *gradient_h, + int16_t *gradient_v, + const ptrdiff_t gradient_stride, + const int16_t *_src, + const ptrdiff_t src_stride, + const int width, const int height); + +void ff_vvc_derive_bdof_vx_vy_neon( + const int16_t *_src0, const int16_t *_src1, int pad_mask, + const int16_t **gradient_h, const int16_t **gradient_v, + int *vx, int *vy); + +void FUNC2(ff_vvc_apply_bdof_min_block, BIT_DEPTH, _neon)(pixel* dst, + const ptrdiff_t dst_stride, const int16_t *src0, const int16_t *src1, + const int16_t **gh, const int16_t **gv, const int vx, const int vy); + +static void FUNC(apply_bdof)(uint8_t *_dst, const ptrdiff_t _dst_stride, + const int16_t *_src0, const int16_t *_src1, + const int block_w, const int block_h) +{ + int16_t gradient_h[2][BDOF_BLOCK_SIZE * BDOF_BLOCK_SIZE]; + int16_t gradient_v[2][BDOF_BLOCK_SIZE * BDOF_BLOCK_SIZE]; + int vx, vy; + const ptrdiff_t dst_stride = _dst_stride / sizeof(pixel); + pixel* dst = (pixel*)_dst; + + ff_vvc_prof_grad_filter_8x_neon(gradient_h[0], gradient_v[0], BDOF_BLOCK_SIZE, + _src0, MAX_PB_SIZE, block_w, block_h); + ff_vvc_prof_grad_filter_8x_neon(gradient_h[1], gradient_v[1], BDOF_BLOCK_SIZE, + _src1, MAX_PB_SIZE, block_w, block_h); + + for (int y = 0; y < block_h; y += BDOF_MIN_BLOCK_SIZE) { + for (int x = 0; x < block_w; x += BDOF_MIN_BLOCK_SIZE) { + const int16_t* src0 = _src0 + y * MAX_PB_SIZE + x; + const int16_t* src1 = _src1 + y * MAX_PB_SIZE + x; + pixel *d = dst + x; + const int idx = BDOF_BLOCK_SIZE * y + x; + const int16_t* gh[] = { gradient_h[0] + idx, gradient_h[1] + idx }; + const int16_t* gv[] = { gradient_v[0] + idx, gradient_v[1] + idx }; + const int pad_mask = !x | ((!y) << 1) | + ((x + BDOF_MIN_BLOCK_SIZE == block_w) << 2) | + ((y + BDOF_MIN_BLOCK_SIZE == block_h) << 3); + ff_vvc_derive_bdof_vx_vy_neon(src0, src1, pad_mask, gh, gv, &vx, &vy); + FUNC2(ff_vvc_apply_bdof_min_block, BIT_DEPTH, _neon)(d, dst_stride, src0, src1, gh, gv, vx, vy); + } + dst += BDOF_MIN_BLOCK_SIZE * dst_stride; + } +} From patchwork Sat Sep 21 17:41:45 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zhao Zhili X-Patchwork-Id: 51686 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a59:d154:0:b0:48e:c0f8:d0de with SMTP id bt20csp1651842vqb; Sat, 21 Sep 2024 10:42:26 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCWOse/2kgw5ljI+/R2M9Z4NhQjlU8nHRqaQO7pa/LQRE2pC8V7AK0VJa6pJb06S7ZC2TxADCfuET0uaRt0ltfjh@gmail.com X-Google-Smtp-Source: AGHT+IErR2Z/z+B9PxygQ2194NWD7AbbqLDMXdRFOWEPgsZQubFh5/+veGGCu0iBXHm70y8bIQkK X-Received: by 2002:a05:6402:4313:b0:5c4:24a4:8848 with SMTP id 4fb4d7f45d1cf-5c464a5c24bmr1876328a12.4.1726940546028; Sat, 21 Sep 2024 10:42:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1726940546; cv=none; d=google.com; s=arc-20240605; b=acDnKyW/yYcil86MOscga9g8s37D3BaAKJkgn1R+Y9rNNevwSoi7EuGR/hXDOk937+ k8/77mDjqc2bXrjqURlm9faGnHqgjsfelQv24JsMGx5RzST1lkc+E2IODhtZmllMU7Al dj3yfKYIuasQw7Kl3+aikqju1+sgpg57jJeBLaQA46esxzKB5VM24NwDglJfc0L98Vaj 5POK+PF4Be09qRYJTUKH6ZxcqRLXpPxi/bRs5Bl6Dytqdy4wf0RWy++3iNHn1D61DLhf L8lgp5vyY2ad67TsToqUs4nWVH1e/CXCemVaZ5ANeKJK5dU+PQ0i2tVILZf3ulEMAr4d cTzw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to:date :to:from:message-id:dkim-signature:delivered-to; bh=SFs/dIhn0ASbZzGWGg9KM2KVIA3q2BuvgqyzDcX6Lds=; fh=HnHYuZ9XgUo86ZRXTLWWmQxhslYEI9B9taZ5X1DLFfc=; b=ckXs7kIBc7d+Cnj1HIYVje9NtqHRb4uJuNOBTh/EFV8WK9Fk2sCFZ37tmkAv4i4rzT zRM3pUsUAOTTeswJXgNA/5yfsipZ/5ipXCRDjBCXBzQWO06LNb23l7BTGNlXJQuGgAH/ 3cHDrEVOY2Qzz6iX3XcTlOgep+uqIcpq337GS9Wl5HDfi2LLLVkqpPpALGJJPloIcrZd 0Di0NczqN3wNbH86N7jE71tKKkYbM2VPXWoOxVMaJgHqGY7YoQbQwH0Un8LHgf+F6Zw+ pBo9Je7PZErHfkN+/+cTRBgIqu7i3emLIOLxS32QXbgjnEqYZNQwlBm8FAOZh2i/tM+4 RysA==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=Z+YMjdYg; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id 4fb4d7f45d1cf-5c42bb919e3si10958711a12.241.2024.09.21.10.42.25; Sat, 21 Sep 2024 10:42:26 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=Z+YMjdYg; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A846368D3C6; Sat, 21 Sep 2024 20:42:08 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from out162-62-58-216.mail.qq.com (out162-62-58-216.mail.qq.com [162.62.58.216]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id EE2DB68DB2B for ; Sat, 21 Sep 2024 20:41:59 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foxmail.com; s=s201512; t=1726940511; bh=UngxzGpcpS43fECyRsn/GjykKvT8B7tgdmnUZtfFceU=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=Z+YMjdYgCSQtR9Ga5kcYaqh05+u2CGKiCssRkTQLwcL7Fye+i+5wEuP/DkFD39jB2 R5AyId3bjZIhD/KvicOEs1QMbDplBrht0gAvlxfqlSTpYslCmb6ZSYkR+71Tka0VeB fwg63MZsam4K6g/BET0LIp9JF6Ceb0rnhTjbBWRc= Received: from ZHILIZHAO-MB1.tencent.com ([113.118.103.137]) by newxmesmtplogicsvrsza29-0.qq.com (NewEsmtp) with SMTP id A7081A2C; Sun, 22 Sep 2024 01:41:48 +0800 X-QQ-mid: xmsmtpt1726940510te1rk08u5 Message-ID: X-QQ-XMAILINFO: MmPNY57tR1XnArLiBTRsiv0Qar/mtDkL2giKO4HuD6qZZ2/fG3NOLtyWXrTD3m uCVGCV7YTqa1VdAyumbAZgKjkZZBRdnhyZhohH+KUwDLMaIlmJ8MQ2C/UC7qyXfB0svWurUd6DQH J7HDxstfoKPF4FqnraTeNhp9l7aKDsh31CsrMwGF7utSt2yPjKztsFQGH7cXJnAPOKzMyIr+s+um YWPIasXlxbF5ZjHqhOnJKTDcSZdBJMFUnbrHBTt3PLIG42WXudqCAa+NylJRlEvxGy4R7yTvAbvI 22koUqFGqUelYtXSLr4vXJ9h9mom2nRfRa+sIi3nrb19Wxts0fsReFU3rWeuwBvzKTPSoArepCRf RI+Gg0UiE+EVSiRb/olJ8zfJMdLJdykTGJqDXHBLBC7cOvy7b/otWiHweMxX4+GPwiQt0vhcz3M6 O0nl3oEbk7qEPoPM9P7uYTdJvm3bTH6QHhdM52csO6NF8LH8P79+1pGYHvXxUrOufvputMZazZR6 JPHtuVKu3QJDWZ8CYfYi/CUhoXDuGp18Wkah6YLjmDGs471UpbY1NffbRkI7JLoG0Q1Hb80Ra+JC CbQWlz9ShhTKiY1VnJAWAYambh3iBNpUCRMlHEGoaxdS4VRu/hOp3FCp/fuP6A8TV5VrVt2MOPcb ABAosb0zsRvPxjHJAQeVtYNCFNB/J9is/848pknuXQvKYgfTWGmoBk0k5cVHCfuwK+JcwRF+m1PH 0FAhYW/leXTCGM8iBGY5lU6g0ubCG3DgUM+SCOFDQbXOc8hBr2CWY1tz2iG77WmGdrpAFRXgTytC lYo5O/mE9mZhBovDPAluGhKq92NvR/aFbgBfCYg8LCjK+PaXutQ0DbuQ8hj3JOoegp9z0i0JYI/J wl31lyI1KW++HmTwcdqCqdxEglDZCVVzvxyAAK71cH3pIFxRWVjXTmxZrYSXR7Bu8M73CIgBiD+y CBkJoGA+Fq6Vf5YeY2crIlBcFi6WufFD2USezxdqNUeCa+RgDJEBYHJZSIsBRz X-QQ-XMRINFO: OD9hHCdaPRBwq3WW+NvGbIU= From: Zhao Zhili To: ffmpeg-devel@ffmpeg.org Date: Sun, 22 Sep 2024 01:41:45 +0800 X-OQ-MSGID: <20240921174146.10928-3-quinkblack@foxmail.com> X-Mailer: git-send-email 2.42.0 In-Reply-To: <20240921174146.10928-1-quinkblack@foxmail.com> References: <20240921174146.10928-1-quinkblack@foxmail.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 3/4] aarch64/vvc: Add dmvr_hv X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Zhao Zhili Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: PABTqOFDa5/L From: Zhao Zhili dmvr_hv_8_12x20_c: 8.0 ( 1.00x) dmvr_hv_8_12x20_neon: 1.2 ( 6.62x) dmvr_hv_8_20x12_c: 8.0 ( 1.00x) dmvr_hv_8_20x12_neon: 0.9 ( 8.37x) dmvr_hv_8_20x20_c: 12.9 ( 1.00x) dmvr_hv_8_20x20_neon: 1.7 ( 7.62x) dmvr_hv_10_12x20_c: 7.0 ( 1.00x) dmvr_hv_10_12x20_neon: 1.7 ( 4.09x) dmvr_hv_10_20x12_c: 7.0 ( 1.00x) dmvr_hv_10_20x12_neon: 1.7 ( 4.09x) dmvr_hv_10_20x20_c: 11.2 ( 1.00x) dmvr_hv_10_20x20_neon: 2.7 ( 4.15x) dmvr_hv_12_12x20_c: 6.5 ( 1.00x) dmvr_hv_12_12x20_neon: 1.7 ( 3.79x) dmvr_hv_12_20x12_c: 6.5 ( 1.00x) dmvr_hv_12_20x12_neon: 1.7 ( 3.79x) dmvr_hv_12_20x20_c: 10.2 ( 1.00x) dmvr_hv_12_20x20_neon: 2.2 ( 4.64x) --- libavcodec/aarch64/vvc/dsp_init.c | 12 ++ libavcodec/aarch64/vvc/inter.S | 307 ++++++++++++++++++++++++++++++ 2 files changed, 319 insertions(+) diff --git a/libavcodec/aarch64/vvc/dsp_init.c b/libavcodec/aarch64/vvc/dsp_init.c index 03a4c62310..48642e98e6 100644 --- a/libavcodec/aarch64/vvc/dsp_init.c +++ b/libavcodec/aarch64/vvc/dsp_init.c @@ -89,6 +89,15 @@ W_AVG_FUN(8) W_AVG_FUN(10) W_AVG_FUN(12) +#define DMVR_FUN(fn, bd) \ + void ff_vvc_dmvr_ ## fn ## bd ## _neon(int16_t *dst, \ + const uint8_t *_src, const ptrdiff_t _src_stride, const int height, \ + const intptr_t mx, const intptr_t my, const int width); + +DMVR_FUN(hv_, 8) +DMVR_FUN(hv_, 10) +DMVR_FUN(hv_, 12) + void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) { int cpu_flags = av_get_cpu_flags(); @@ -162,6 +171,7 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) c->inter.avg = ff_vvc_avg_8_neon; c->inter.w_avg = vvc_w_avg_8; c->inter.apply_bdof = apply_bdof_8; + c->inter.dmvr[1][1] = ff_vvc_dmvr_hv_8_neon; for (int i = 0; i < FF_ARRAY_ELEMS(c->sao.band_filter); i++) c->sao.band_filter[i] = ff_h26x_sao_band_filter_8x8_8_neon; @@ -204,6 +214,7 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) c->inter.avg = ff_vvc_avg_10_neon; c->inter.w_avg = vvc_w_avg_10; c->inter.apply_bdof = apply_bdof_10; + c->inter.dmvr[1][1] = ff_vvc_dmvr_hv_10_neon; c->alf.filter[LUMA] = alf_filter_luma_10_neon; c->alf.filter[CHROMA] = alf_filter_chroma_10_neon; @@ -211,6 +222,7 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) c->inter.avg = ff_vvc_avg_12_neon; c->inter.w_avg = vvc_w_avg_12; c->inter.apply_bdof = apply_bdof_12; + c->inter.dmvr[1][1] = ff_vvc_dmvr_hv_12_neon; c->alf.filter[LUMA] = alf_filter_luma_12_neon; c->alf.filter[CHROMA] = alf_filter_chroma_12_neon; diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index 8cfacef44f..b652e0d609 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -570,3 +570,310 @@ function ff_vvc_derive_bdof_vx_vy_neon, export=1 .unreq y endfunc +/* x0: int16_t *dst + * x1: const uint8_t *_src + * x2: const ptrdiff_t _src_stride + * w3: const int height + * x4: const intptr_t mx + * x5: const intptr_t my + * w6: const int width + */ +function ff_vvc_dmvr_hv_8_neon, export=1 + dst .req x0 + src .req x1 + src_stride .req x2 + height .req w3 + mx .req x4 + my .req x5 + width .req w6 + tmp0 .req x7 + tmp1 .req x8 + + sub sp, sp, #(VVC_MAX_PB_SIZE * 4) + + movrel x9, X(ff_vvc_inter_luma_dmvr_filters) + add x12, x9, mx, lsl #1 + ldrb w10, [x12] + ldrb w11, [x12, #1] + mov tmp0, sp + add tmp1, tmp0, #(VVC_MAX_PB_SIZE * 2) + // We know the value are positive + dup v0.8h, w10 // filter_x[0] + dup v1.8h, w11 // filter_x[1] + + add x12, x9, my, lsl #1 + ldrb w10, [x12] + ldrb w11, [x12, #1] + sxtw x6, w6 + movi v30.8h, #(1 << (8 - 7)) // offset1 + movi v31.8h, #8 // offset2 + dup v2.8h, w10 // filter_y[0] + dup v3.8h, w11 // filter_y[1] + + // Valid value for width can only be 8 + 4, 16 + 4 + cmp width, #16 + mov w10, #0 // start filter_y or not + add height, height, #1 + sub dst, dst, #(VVC_MAX_PB_SIZE * 2) + sub src_stride, src_stride, x6 + cset w15, gt // width > 16 +1: + mov x12, tmp0 + mov x13, tmp1 + mov x14, dst + cbz w15, 2f + + // width > 16 + ldur q5, [src, #1] + ldr q4, [src], #16 + uxtl v7.8h, v5.8b + uxtl2 v17.8h, v5.16b + uxtl v6.8h, v4.8b + uxtl2 v16.8h, v4.16b + mul v6.8h, v6.8h, v0.8h + mul v16.8h, v16.8h, v0.8h + mla v6.8h, v7.8h, v1.8h + mla v16.8h, v17.8h, v1.8h + add v6.8h, v6.8h, v30.8h + add v16.8h, v16.8h, v30.8h + ushr v6.8h, v6.8h, #(8 - 6) + ushr v7.8h, v16.8h, #(8 - 6) + stp q6, q7, [x13], #32 + + cbz w10, 3f + + ldp q16, q17, [x12], #32 + mul v16.8h, v16.8h, v2.8h + mul v17.8h, v17.8h, v2.8h + mla v16.8h, v6.8h, v3.8h + mla v17.8h, v7.8h, v3.8h + add v16.8h, v16.8h, v31.8h + add v17.8h, v17.8h, v31.8h + ushr v16.8h, v16.8h, #4 + ushr v17.8h, v17.8h, #4 + stp q16, q17, [x14], #32 + b 3f +2: + // width > 8 + ldur d5, [src, #1] + ldr d4, [src], #8 + uxtl v7.8h, v5.8b + uxtl v6.8h, v4.8b + mul v6.8h, v6.8h, v0.8h + mla v6.8h, v7.8h, v1.8h + add v6.8h, v6.8h, v30.8h + ushr v6.8h, v6.8h, #(8 - 6) + str q6, [x13], #16 + + cbz w10, 3f + + ldr q16, [x12], #16 + mul v16.8h, v16.8h, v2.8h + mla v16.8h, v6.8h, v3.8h + add v16.8h, v16.8h, v31.8h + ushr v16.8h, v16.8h, #4 + str q16, [x14], #16 +3: + ldr s5, [src, #1] + ldr s4, [src], #4 + uxtl v7.8h, v5.8b + uxtl v6.8h, v4.8b + mul v6.4h, v6.4h, v0.4h + mla v6.4h, v7.4h, v1.4h + add v6.4h, v6.4h, v30.4h + ushr v6.4h, v6.4h, #(8 - 6) + str d6, [x13], #8 + + cbz w10, 4f + + ldr d16, [x12], #8 + mul v16.4h, v16.4h, v2.4h + mla v16.4h, v6.4h, v3.4h + add v16.4h, v16.4h, v31.4h + ushr v16.4h, v16.4h, #4 + str d16, [x14], #8 +4: + subs height, height, #1 + mov w10, #1 + add src, src, src_stride + add dst, dst, #(VVC_MAX_PB_SIZE * 2) + eor tmp0, tmp0, tmp1 + eor tmp1, tmp0, tmp1 + eor tmp0, tmp0, tmp1 + b.ne 1b + + add sp, sp, #(VVC_MAX_PB_SIZE * 4) + ret +endfunc + +function ff_vvc_dmvr_hv_12_neon, export=1 + movi v29.4s, #(12 - 6) + movi v30.4s, #(1 << (12 - 7)) // offset1 + b 0f +endfunc + +function ff_vvc_dmvr_hv_10_neon, export=1 + movi v29.4s, #(10 - 6) + movi v30.4s, #(1 << (10 - 7)) // offset1 +0: + movi v31.4s, #8 // offset2 + neg v29.4s, v29.4s + + sub sp, sp, #(VVC_MAX_PB_SIZE * 4) + + movrel x9, X(ff_vvc_inter_luma_dmvr_filters) + add x12, x9, mx, lsl #1 + ldrb w10, [x12] + ldrb w11, [x12, #1] + mov tmp0, sp + add tmp1, tmp0, #(VVC_MAX_PB_SIZE * 2) + // We know the value are positive + dup v0.8h, w10 // filter_x[0] + dup v1.8h, w11 // filter_x[1] + + add x12, x9, my, lsl #1 + ldrb w10, [x12] + ldrb w11, [x12, #1] + sxtw x6, w6 + dup v2.8h, w10 // filter_y[0] + dup v3.8h, w11 // filter_y[1] + + // Valid value for width can only be 8 + 4, 16 + 4 + cmp width, #16 + mov w10, #0 // start filter_y or not + add height, height, #1 + sub dst, dst, #(VVC_MAX_PB_SIZE * 2) + sub src_stride, src_stride, x6, lsl #1 + cset w15, gt // width > 16 +1: + mov x12, tmp0 + mov x13, tmp1 + mov x14, dst + cbz w15, 2f + + // width > 16 + add x16, src, #2 + ldp q6, q16, [src], #32 + ldp q7, q17, [x16] + umull v4.4s, v6.4h, v0.4h + umull2 v5.4s, v6.8h, v0.8h + umull v18.4s, v16.4h, v0.4h + umull2 v19.4s, v16.8h, v0.8h + umlal v4.4s, v7.4h, v1.4h + umlal2 v5.4s, v7.8h, v1.8h + umlal v18.4s, v17.4h, v1.4h + umlal2 v19.4s, v17.8h, v1.8h + + add v4.4s, v4.4s, v30.4s + add v5.4s, v5.4s, v30.4s + add v18.4s, v18.4s, v30.4s + add v19.4s, v19.4s, v30.4s + ushl v4.4s, v4.4s, v29.4s + ushl v5.4s, v5.4s, v29.4s + ushl v18.4s, v18.4s, v29.4s + ushl v19.4s, v19.4s, v29.4s + uqxtn v6.4h, v4.4s + uqxtn2 v6.8h, v5.4s + uqxtn v7.4h, v18.4s + uqxtn2 v7.8h, v19.4s + stp q6, q7, [x13], #32 + + cbz w10, 3f + + ldp q4, q5, [x12], #32 + umull v17.4s, v4.4h, v2.4h + umull2 v18.4s, v4.8h, v2.8h + umull v19.4s, v5.4h, v2.4h + umull2 v20.4s, v5.8h, v2.8h + umlal v17.4s, v6.4h, v3.4h + umlal2 v18.4s, v6.8h, v3.8h + umlal v19.4s, v7.4h, v3.4h + umlal2 v20.4s, v7.8h, v3.8h + add v17.4s, v17.4s, v31.4s + add v18.4s, v18.4s, v31.4s + add v19.4s, v19.4s, v31.4s + add v20.4s, v20.4s, v31.4s + ushr v17.4s, v17.4s, #4 + ushr v18.4s, v18.4s, #4 + ushr v19.4s, v19.4s, #4 + ushr v20.4s, v20.4s, #4 + uqxtn v6.4h, v17.4s + uqxtn2 v6.8h, v18.4s + uqxtn v7.4h, v19.4s + uqxtn2 v7.8h, v20.4s + stp q6, q7, [x14], #32 + b 3f +2: + // width > 8 + ldur q7, [src, #2] + ldr q6, [src], #16 + umull v4.4s, v6.4h, v0.4h + umull2 v5.4s, v6.8h, v0.8h + umlal v4.4s, v7.4h, v1.4h + umlal2 v5.4s, v7.8h, v1.8h + + add v4.4s, v4.4s, v30.4s + add v5.4s, v5.4s, v30.4s + ushl v4.4s, v4.4s, v29.4s + ushl v5.4s, v5.4s, v29.4s + uqxtn v6.4h, v4.4s + uqxtn2 v6.8h, v5.4s + str q6, [x13], #16 + + cbz w10, 3f + + ldr q16, [x12], #16 + umull v17.4s, v16.4h, v2.4h + umull2 v18.4s, v16.8h, v2.8h + umlal v17.4s, v6.4h, v3.4h + umlal2 v18.4s, v6.8h, v3.8h + add v17.4s, v17.4s, v31.4s + add v18.4s, v18.4s, v31.4s + ushr v17.4s, v17.4s, #4 + ushr v18.4s, v18.4s, #4 + uqxtn v16.4h, v17.4s + uqxtn2 v16.8h, v18.4s + str q16, [x14], #16 +3: + ldr d7, [src, #2] + ldr d6, [src], #8 + umull v4.4s, v7.4h, v1.4h + umlal v4.4s, v6.4h, v0.4h + add v4.4s, v4.4s, v30.4s + ushl v4.4s, v4.4s, v29.4s + uqxtn v6.4h, v4.4s + str d6, [x13], #8 + + cbz w10, 4f + + ldr d16, [x12], #8 + umull v17.4s, v16.4h, v2.4h + umlal v17.4s, v6.4h, v3.4h + add v17.4s, v17.4s, v31.4s + ushr v17.4s, v17.4s, #4 + uqxtn v16.4h, v17.4s + str d16, [x14], #8 +4: + subs height, height, #1 + mov w10, #1 + add src, src, src_stride + add dst, dst, #(VVC_MAX_PB_SIZE * 2) + eor tmp0, tmp0, tmp1 + eor tmp1, tmp0, tmp1 + eor tmp0, tmp0, tmp1 + b.ne 1b + + add sp, sp, #(VVC_MAX_PB_SIZE * 4) + ret + +.unreq dst +.unreq src +.unreq src_stride +.unreq height +.unreq mx +.unreq my +.unreq width +.unreq tmp0 +.unreq tmp1 +endfunc + From patchwork Sat Sep 21 17:41:46 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zhao Zhili X-Patchwork-Id: 51688 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a59:d154:0:b0:48e:c0f8:d0de with SMTP id bt20csp1653731vqb; Sat, 21 Sep 2024 10:49:17 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCX1rCevGiSFI9NvjMdYkDEmM74yIhXd/lLWcoI7Z5EhDW0FIzZ4hQPxPV6Kc17EJHLBQO8znQwyVkUz+m4jlCSo@gmail.com X-Google-Smtp-Source: AGHT+IH7zVnRV0aW9c+Zp5tAIXWSIzDItwIvE1D/G6wcAIoKTNIVzDMjcrsxP4qcjQMjs6NE7Sxn X-Received: by 2002:a17:907:e60b:b0:a8a:8c92:1c9c with SMTP id a640c23a62f3a-a90d50045c9mr585687166b.29.1726940957088; Sat, 21 Sep 2024 10:49:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1726940957; cv=none; d=google.com; s=arc-20240605; b=Kh5BUmyB+FQeaGWtyJ9knLQU7i0okGFfU4O8aWUAn2XgIG7Xanm60r0s1XjmR6bTB9 X7oJXfs/asqPIAyOU+kZ0kIC6Vz0r1o6cmFUFSS87Wwpgh9pNRrZINeqRCrSUnOuQgi0 W/YkAzxWN3lueMgp2OsGgL9/FlYGHgfA4+avd4ThB2As1pq4zR+qwiVHio7YnRHl+QwK kuUfwa7b1yM/J7P7iRkrEbg+USX1ZOUiy2lTZoaD9RWYUp2jTlm4BGttaQ72gXrHFWCK wJS1Urj5l43YVnMgns9tNIfa4sAyqThD+knTOt7Wj5lhUagBHbvDS/1xrpOlvXOBbEX4 QGsQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to:date :to:from:message-id:dkim-signature:delivered-to; bh=yXfRTdwg7ECYokDv+sW7d2u9qREwA8Dc2bA6usG4Pqw=; fh=HnHYuZ9XgUo86ZRXTLWWmQxhslYEI9B9taZ5X1DLFfc=; b=V2O3cCYGpowvRGQJjJE9mDR2jy84BpTKFuBU4jFKKwa126xSdyYpij+XJ6IV4kTXa/ wSxV7Uzp4vnioCRdZdgPtDUGZ0bN3RzT6fFJLx1uUsaChNu9V8bKDl2CpjusVkksWs6g 4JXSBscTqfRtqrel14ywwZBdPA90uWEUgZDBLo/1eUEJYxFetbWxPOR82RGsZ6hxOM13 mX12B3M4DpuDHF6oBYdTtoskw1nEMO7oWSBAR5knVbXCsnY/kBhCmMUikvVjiZUJIenu CtnEnrCpZVS51J/1GvZKaKR3dER/pm7VSt0iwmWhTaGqFhj58iQuV0G7gk7qZetOoS6g UmIA==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=GvDyCntS; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id a640c23a62f3a-a9061336b78si1138390466b.800.2024.09.21.10.49.16; Sat, 21 Sep 2024 10:49:17 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=GvDyCntS; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A78D068DBD4; Sat, 21 Sep 2024 20:42:09 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from out162-62-58-211.mail.qq.com (out162-62-58-211.mail.qq.com [162.62.58.211]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E904C68DB0E for ; Sat, 21 Sep 2024 20:41:59 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foxmail.com; s=s201512; t=1726940511; bh=nvLU/myQbNP/7EfIYvqs8S+3KyMjXp03+WQGgsUAQAQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=GvDyCntSXx8CsOK4ZJVh5P7br/JEbHv7Q32QgFSdUMXy8zzE8OjoORLLRdh2z01gr z4feKujFOetavNj/s86Yg3cdsKlnWr9LHwSk03UTIkFzfryIlTvkhPBp3HsJs1pnen o0g+mygBgPu8j4s2AtEqbNmBNgVYEDSgcrLKgPQ4= Received: from ZHILIZHAO-MB1.tencent.com ([113.118.103.137]) by newxmesmtplogicsvrsza29-0.qq.com (NewEsmtp) with SMTP id A7081A2C; Sun, 22 Sep 2024 01:41:48 +0800 X-QQ-mid: xmsmtpt1726940511tmslafmvs Message-ID: X-QQ-XMAILINFO: M01R5JnlBdC2MBzVZ+E607mIJfolGjrV+a3YSTDFj41LszR8hrlM5kIBnFIh3h vsr/jtUpMDAOkfFjTySZYfIzWFYjAQXOj1ASvwXGMMRiZFZFsEYREf0LV3270u4ecpO6OQWBcWGK C2sqyCxEvy+bVwj0xXZu6wrhlKsG0iy2lxNgnMLwFt9/8MtblPiXrV/FCEKkajF7mtPMHsjo7H4t EzqLbOVIJbzSFd1Ti78bUSDJA2qOcXduJ4nx5HIzb/IMNp3g1T0RnnsXuMRNUpOHFQkbpvym1q0O kjlGUUqCwkkRwpyn5mFcHF6UJFmWsL5hxFlWHtdf6drC2REKd7D33fexwogp02aTvWbHdYYFb0Uz /1BRGrN1rpNcnxdWH8OO5ew2R9Ijc9SGcOYo4S1iKemJdWRdhvIvAdlB03I5XiMLCrx9jkzzSMmQ kG/eLC1FLBGWYNXzizvUGjeaEHMP7PfnLPQmgNpmRPZ8ynAy5o+/VQuLCbB/mBZjawAjFeCDQGLk TnGdcTcfnd+AeCXqrSSQQkGfXe2ohT4qTew/qG1U3DxtDspd2eT+7SZ6Y4+wjylBKtif5iS4RLWx wfk935nqj/k8ZukAFdZ7YA5MehFjLnv/FcgMjw4hl1aDlZ4qOcuFcBl4leox+jQj30yIRLhGWTEa 1EapPommyjN8v/TZ13ebWfiu7I4ZLvrK9Vx23vGT6bQ4nC6cA8d9YrNrtD734LkxiAjITglU2dv3 5ertdK0OOOaWqX+DAugUnuiheykTlsd/cTuCQTLygOFjyna8mn1HgjdYhaoos98GRjIZkHZnOywa n8UzZ60oZqMZIk3Nxat2wfozGifihceslwY7HTct/6WdxK6yPvcgVyzhPfPCtos1JdiZp0QEvRZg zRgFeLg7Zr9rQ3uEXulPXlGOhivRLkOShCRYLHMG/i2bL+104triwofkJpFhwCBgtcBiC8qcGCh2 0eBO+7zYvg/00fxZjhaQ9WvsHjEOUDpO7kBx/rjsk1EedmLduEIurzK8+f1f2ogA+qXKmJTutyxt QGoYBK0A== X-QQ-XMRINFO: OWPUhxQsoeAVDbp3OJHYyFg= From: Zhao Zhili To: ffmpeg-devel@ffmpeg.org Date: Sun, 22 Sep 2024 01:41:46 +0800 X-OQ-MSGID: <20240921174146.10928-4-quinkblack@foxmail.com> X-Mailer: git-send-email 2.42.0 In-Reply-To: <20240921174146.10928-1-quinkblack@foxmail.com> References: <20240921174146.10928-1-quinkblack@foxmail.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 4/4] aarch64/vvc: Add dmvr X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Zhao Zhili Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: UahU7m39flm9 From: Zhao Zhili dmvr_8_12x20_c: 2.2 ( 1.00x) dmvr_8_12x20_neon: 0.5 ( 4.50x) dmvr_8_20x12_c: 2.0 ( 1.00x) dmvr_8_20x12_neon: 0.2 ( 8.00x) dmvr_8_20x20_c: 3.2 ( 1.00x) dmvr_8_20x20_neon: 0.5 ( 6.50x) dmvr_12_12x20_c: 2.2 ( 1.00x) dmvr_12_12x20_neon: 0.5 ( 4.50x) dmvr_12_20x12_c: 2.2 ( 1.00x) dmvr_12_20x12_neon: 0.5 ( 4.50x) dmvr_12_20x20_c: 3.2 ( 1.00x) dmvr_12_20x20_neon: 0.8 ( 4.33x) --- libavcodec/aarch64/vvc/dsp_init.c | 4 ++ libavcodec/aarch64/vvc/inter.S | 94 ++++++++++++++++++++++++++++++- 2 files changed, 97 insertions(+), 1 deletion(-) diff --git a/libavcodec/aarch64/vvc/dsp_init.c b/libavcodec/aarch64/vvc/dsp_init.c index 48642e98e6..36611a6f5d 100644 --- a/libavcodec/aarch64/vvc/dsp_init.c +++ b/libavcodec/aarch64/vvc/dsp_init.c @@ -94,6 +94,8 @@ W_AVG_FUN(12) const uint8_t *_src, const ptrdiff_t _src_stride, const int height, \ const intptr_t mx, const intptr_t my, const int width); +DMVR_FUN(, 8) +DMVR_FUN(, 12) DMVR_FUN(hv_, 8) DMVR_FUN(hv_, 10) DMVR_FUN(hv_, 12) @@ -171,6 +173,7 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) c->inter.avg = ff_vvc_avg_8_neon; c->inter.w_avg = vvc_w_avg_8; c->inter.apply_bdof = apply_bdof_8; + c->inter.dmvr[0][0] = ff_vvc_dmvr_8_neon; c->inter.dmvr[1][1] = ff_vvc_dmvr_hv_8_neon; for (int i = 0; i < FF_ARRAY_ELEMS(c->sao.band_filter); i++) @@ -222,6 +225,7 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) c->inter.avg = ff_vvc_avg_12_neon; c->inter.w_avg = vvc_w_avg_12; c->inter.apply_bdof = apply_bdof_12; + c->inter.dmvr[0][0] = ff_vvc_dmvr_12_neon; c->inter.dmvr[1][1] = ff_vvc_dmvr_hv_12_neon; c->alf.filter[LUMA] = alf_filter_luma_12_neon; diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index b652e0d609..1f4706e2fa 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -578,7 +578,7 @@ endfunc * x5: const intptr_t my * w6: const int width */ -function ff_vvc_dmvr_hv_8_neon, export=1 +function ff_vvc_dmvr_8_neon, export=1 dst .req x0 src .req x1 src_stride .req x2 @@ -586,6 +586,98 @@ function ff_vvc_dmvr_hv_8_neon, export=1 mx .req x4 my .req x5 width .req w6 + + sxtw x6, w6 + mov x7, #(VVC_MAX_PB_SIZE * 2 + 8) + cmp width, #16 + sub src_stride, src_stride, x6 + cset w15, gt // width > 16 + movi v16.8h, #2 // DMVR_SHIFT + sub x7, x7, x6, lsl #1 +1: + cbz w15, 2f + ldr q0, [src], #16 + uxtl v1.8h, v0.8b + uxtl2 v2.8h, v0.16b + ushl v1.8h, v1.8h, v16.8h + ushl v2.8h, v2.8h, v16.8h + stp q1, q2, [dst], #32 + b 3f +2: + ldr d0, [src], #8 + uxtl v1.8h, v0.8b + ushl v1.8h, v1.8h, v16.8h + str q1, [dst], #16 +3: + subs height, height, #1 + ldr s3, [src], #4 + uxtl v4.8h, v3.8b + ushl v4.4h, v4.4h, v16.4h + st1 {v4.4h}, [dst], x7 + + add src, src, src_stride + b.ne 1b + + ret +endfunc + +function ff_vvc_dmvr_12_neon, export=1 + sxtw x6, w6 + mov x7, #(VVC_MAX_PB_SIZE * 2 + 8) + cmp width, #16 + sub src_stride, src_stride, x6, lsl #1 + cset w15, gt // width > 16 + movi v16.4s, #2 // offset4 + sub x7, x7, x6, lsl #1 +1: + cbz w15, 2f + ldp q0, q1, [src], #32 + uxtl v2.4s, v0.4h + uxtl2 v3.4s, v0.8h + uxtl v4.4s, v1.4h + uxtl2 v5.4s, v1.8h + add v2.4s, v2.4s, v16.4s + add v3.4s, v3.4s, v16.4s + add v4.4s, v4.4s, v16.4s + add v5.4s, v5.4s, v16.4s + ushr v2.4s, v2.4s, #2 + ushr v3.4s, v3.4s, #2 + ushr v4.4s, v4.4s, #2 + ushr v5.4s, v5.4s, #2 + uqxtn v2.4h, v2.4s + uqxtn2 v2.8h, v3.4s + uqxtn v4.4h, v4.4s + uqxtn2 v4.8h, v5.4s + + stp q2, q4, [dst], #32 + b 3f +2: + ldr q0, [src], #16 + uxtl v2.4s, v0.4h + uxtl2 v3.4s, v0.8h + add v2.4s, v2.4s, v16.4s + add v3.4s, v3.4s, v16.4s + ushr v2.4s, v2.4s, #2 + ushr v3.4s, v3.4s, #2 + uqxtn v2.4h, v2.4s + uqxtn2 v2.8h, v3.4s + str q2, [dst], #16 +3: + subs height, height, #1 + ldr d0, [src], #8 + uxtl v3.4s, v0.4h + add v3.4s, v3.4s, v16.4s + ushr v3.4s, v3.4s, #2 + uqxtn v3.4h, v3.4s + st1 {v3.4h}, [dst], x7 + + add src, src, src_stride + b.ne 1b + + ret +endfunc + +function ff_vvc_dmvr_hv_8_neon, export=1 tmp0 .req x7 tmp1 .req x8