From patchwork Mon Sep 23 09:05:38 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zhao Zhili X-Patchwork-Id: 51733 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a59:d154:0:b0:48e:c0f8:d0de with SMTP id bt20csp2319181vqb; Mon, 23 Sep 2024 02:06:06 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCWnZToKTupwAByCLRmaW8M6pQZR1TAEEJC5Sw8RCVFczXzHKxE/M2maopP9gaC42I2TjdMnZaGrKrSlYaXxAygB@gmail.com X-Google-Smtp-Source: AGHT+IFtRx2l06oN2LeL75dApTxO+L65+7R9fNCp2ObVqwDBIihFs5lgBING/eqxSsMcDKrBRXFO X-Received: by 2002:a17:907:7fa7:b0:a8d:4829:3dca with SMTP id a640c23a62f3a-a90d4ffec9dmr508057166b.8.1727082365975; Mon, 23 Sep 2024 02:06:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1727082365; cv=none; d=google.com; s=arc-20240605; b=B0oFi97sMv8uz4RdsNoTqsyKPCBPuMlCPNrIazjRzbL5dsH4dOtmXB5ViXbpwNUYLe xjJRY1xkgUmKN0LjA5Tl7qD4J7R5dWzfvSupDoAqWU5PHtoermjkDENTAPWlh5mcCrBu D1sAjKFMLhj5249RuilbmvuJa09iEAXVu/Hv7QOHkUkFiz/hRE3fBBxivVqkY4AfPtmq Gyrh7e8VrOXR7WWJhXzEXMOuPR/Om1JMD+7OGjgoa4FOxF0QP2HbRreKs5Muhq56HQeG T70s6FYy4LDysc1ww1kqMeZ17SWi7cQKPrdYWrHrurpvUansROQqIb022II9eS/u47hX roDQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:date:to:from:message-id :dkim-signature:delivered-to; bh=HwNc4OIg//+yTeQheRkkGnyhSLIGlhg/ZbYw6SG+Xak=; fh=HnHYuZ9XgUo86ZRXTLWWmQxhslYEI9B9taZ5X1DLFfc=; b=KkZfw6jxOBEZ5h8+b9yWu37MuF36jg+0FfmRBUn1109xsczH8BsFF2RqlF01T6EoBO /Qmha8dA2vreA8iOgr+qWz5hYszdSR+LviCKbvHg0X1MQy8Z5jxWPtq06WqjpTEbVRQA anQloEbb9hT+qx/PvdTtxDTGUhKccaDzLJhMIkQ4IOiLXIBbFu6j3Go45M/3j6MPZdI1 rt0mASfPxn+KUgudxVpXFSdMSOT2h5EPzvhhNlTjcQEkiUBZV8KQr9yDmQcIiOGQ1zzV wCha+0NwPuVkYM29C/9l1/MTRp4Hixu10uIxsRC24Zu50TYZfG6tJoCMKyUwJPis08GP bYRA==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=I7Vh5U04; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id a640c23a62f3a-a906133a8d5si1405268966b.893.2024.09.23.02.06.05; Mon, 23 Sep 2024 02:06:05 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=I7Vh5U04; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id CC98C68DB73; Mon, 23 Sep 2024 12:06:01 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from out203-205-221-155.mail.qq.com (out203-205-221-155.mail.qq.com [203.205.221.155]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 7FC8968DB20 for ; Mon, 23 Sep 2024 12:05:53 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foxmail.com; s=s201512; t=1727082343; bh=duq40PQkG1DQ6aRt64wLsLMtOrsKOkI4NLr75uBk8hk=; h=From:To:Cc:Subject:Date; b=I7Vh5U041DnYVtOdJ4X6ot2aysMZ6sWbdgUnj61gPuwJ3sCjPE5z9f6Hr9OpNSJsc Hlq3ZHXmOw02RWhqiHaZ4aa5QHqCWsss5WPGcFF3hsl83kuAKE7v2FKNaxHoRxbX7A nleSUcaNeUHdu78hCQSNcxZVyiaLdrp+YkiHZbPc= Received: from ZHILIZHAO-MB1.tencent.com ([119.147.10.207]) by newxmesmtplogicsvrszc19-0.qq.com (NewEsmtp) with SMTP id 16A88826; Mon, 23 Sep 2024 17:05:42 +0800 X-QQ-mid: xmsmtpt1727082342tbshbzkg9 Message-ID: X-QQ-XMAILINFO: MesT5uKpDagVDjIdEtcOMAX6HPcVAi+GpDwz2qHQ5JCtSDC67xJ5qgIoEPSrrp g7ihRHGRBSTUHbzs9HcRN3jy4lQKHIXN1K2mJWOuE6ixlZ5wK2zyvL2hCxtz85qHJXnZX/jABJil oyxLVbygtE98QFSWOWphn4hJZouminSyujTjRrSbF2y7EsnBimxxsy5yr+CPCxIeJ832j3xA0eYO Fmv1OerpzUET8EKM59V1zB3QDiHCHUSUGlzd+UJoTH+1t8xNNPsQwFCTBhCJhN4383xubBX5zTHu I+khk4X+ojEW5tAaosrHpD547pw66onqd6JGV7bVLlmHm4hzVRfSiV4MEkpVB/2eWrRkNfgdN/N4 FjMpwnlFU2RqHfmKYZ5IYiOm4BjEe6XVEx0z3ouL/LPQzWvOOk6PXiQBRN68scCokfC8JY0VD0Ey oK2V+7P+nQTLwOgVidYaPXj5rqeNYlpJLaStJ42PIXMYrcJf1xJbYyLXtr7jnx8yFsT+EyxNvGUd W/Sd5917YlZ5YvxQVEWAeXtFJ3QGbpnX1xXSlQDu8wc4J/WtmdFbn6AaloPt5f7r2aJPmLxOnkHm nvJ/DlQFo1G5M8Yc/FVGdVyZYHNdieBKw1B6Glr0KVmlpVwWJNkxT8PHk31g290jwFwKTLeoQRuC FzIF9EwKsqYraBrAG4vPw7nAt2p1hnmnv3KOvkWIsKKAq8AfFoVfDMOMHc9tNCVOBw5Qzt22xDBG vu+gNxp00D/uEzLTnBqznXR+yrJgbICc8xleeCfqz9U+F3HoA8dKW9Wr4u2pOquSXrwO7/1jGEFQ zRyhScEOoa0SEKw03eMAwwKPCMVGABF3fwWiZ1i0ErOFn8TXIRSkCmUxtgMiwvcb3ZVHRmfTesu/ 1a6hJ58sR9rxc9lGJjQvOkupguMqQXdBSqemtCcABfI4USBKlTDqCw3Jv/IfdBGehGv3Kj1ASsbK OgK+hV/zES5AxMqbogS1Qby+pvfkEhbwHFeueuVguDDOjTUbEVs/bMy4zerx8NWaZ+ES1byA4= X-QQ-XMRINFO: NyFYKkN4Ny6FSmKK/uo/jdU= From: Zhao Zhili To: ffmpeg-devel@ffmpeg.org Date: Mon, 23 Sep 2024 17:05:38 +0800 X-OQ-MSGID: <20240923090540.18807-1-quinkblack@foxmail.com> X-Mailer: git-send-email 2.42.0 MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v2 1/3] aarch64/vvc: Add w_avg X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Zhao Zhili Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: IKpPTFStGzm3 From: Zhao Zhili w_avg_8_2x2_c: 0.0 ( 0.00x) w_avg_8_2x2_neon: 0.0 ( 0.00x) w_avg_8_4x4_c: 0.2 ( 1.00x) w_avg_8_4x4_neon: 0.0 ( 0.00x) w_avg_8_8x8_c: 1.2 ( 1.00x) w_avg_8_8x8_neon: 0.2 ( 5.00x) w_avg_8_16x16_c: 4.2 ( 1.00x) w_avg_8_16x16_neon: 0.8 ( 5.67x) w_avg_8_32x32_c: 16.2 ( 1.00x) w_avg_8_32x32_neon: 2.5 ( 6.50x) w_avg_8_64x64_c: 64.5 ( 1.00x) w_avg_8_64x64_neon: 9.0 ( 7.17x) w_avg_8_128x128_c: 269.5 ( 1.00x) w_avg_8_128x128_neon: 35.5 ( 7.59x) w_avg_10_2x2_c: 0.2 ( 1.00x) w_avg_10_2x2_neon: 0.2 ( 1.00x) w_avg_10_4x4_c: 0.2 ( 1.00x) w_avg_10_4x4_neon: 0.2 ( 1.00x) w_avg_10_8x8_c: 1.0 ( 1.00x) w_avg_10_8x8_neon: 0.2 ( 4.00x) w_avg_10_16x16_c: 4.2 ( 1.00x) w_avg_10_16x16_neon: 0.8 ( 5.67x) w_avg_10_32x32_c: 16.2 ( 1.00x) w_avg_10_32x32_neon: 2.5 ( 6.50x) w_avg_10_64x64_c: 66.2 ( 1.00x) w_avg_10_64x64_neon: 10.0 ( 6.62x) w_avg_10_128x128_c: 277.8 ( 1.00x) w_avg_10_128x128_neon: 39.8 ( 6.99x) w_avg_12_2x2_c: 0.0 ( 0.00x) w_avg_12_2x2_neon: 0.2 ( 0.00x) w_avg_12_4x4_c: 0.2 ( 1.00x) w_avg_12_4x4_neon: 0.0 ( 0.00x) w_avg_12_8x8_c: 1.2 ( 1.00x) w_avg_12_8x8_neon: 0.5 ( 2.50x) w_avg_12_16x16_c: 4.8 ( 1.00x) w_avg_12_16x16_neon: 0.8 ( 6.33x) w_avg_12_32x32_c: 17.0 ( 1.00x) w_avg_12_32x32_neon: 2.8 ( 6.18x) w_avg_12_64x64_c: 64.0 ( 1.00x) w_avg_12_64x64_neon: 10.0 ( 6.40x) w_avg_12_128x128_c: 269.2 ( 1.00x) w_avg_12_128x128_neon: 42.0 ( 6.41x) --- libavcodec/aarch64/vvc/dsp_init.c | 34 +++++++++++ libavcodec/aarch64/vvc/inter.S | 99 +++++++++++++++++++++++++------ 2 files changed, 116 insertions(+), 17 deletions(-) diff --git a/libavcodec/aarch64/vvc/dsp_init.c b/libavcodec/aarch64/vvc/dsp_init.c index ad767d17e2..b39ebb83fc 100644 --- a/libavcodec/aarch64/vvc/dsp_init.c +++ b/libavcodec/aarch64/vvc/dsp_init.c @@ -52,6 +52,37 @@ void ff_vvc_avg_12_neon(uint8_t *dst, ptrdiff_t dst_stride, const int16_t *src0, const int16_t *src1, int width, int height); +void ff_vvc_w_avg_8_neon(uint8_t *_dst, const ptrdiff_t _dst_stride, + const int16_t *src0, const int16_t *src1, + const int width, const int height, + uintptr_t w0_w1, uintptr_t offset_shift); +void ff_vvc_w_avg_10_neon(uint8_t *_dst, const ptrdiff_t _dst_stride, + const int16_t *src0, const int16_t *src1, + const int width, const int height, + uintptr_t w0_w1, uintptr_t offset_shift); +void ff_vvc_w_avg_12_neon(uint8_t *_dst, const ptrdiff_t _dst_stride, + const int16_t *src0, const int16_t *src1, + const int width, const int height, + uintptr_t w0_w1, uintptr_t offset_shift); +/* When passing arguments to functions, Apple platforms diverge from the ARM64 + * standard ABI, that we can't implement the function directly in asm. + */ +#define W_AVG_FUN(bit_depth) \ +static void vvc_w_avg_ ## bit_depth(uint8_t *dst, const ptrdiff_t dst_stride, \ + const int16_t *src0, const int16_t *src1, const int width, const int height, \ + const int denom, const int w0, const int w1, const int o0, const int o1) \ +{ \ + const int shift = denom + FFMAX(3, 15 - bit_depth); \ + const int offset = ((o0 + o1) * (1 << (bit_depth - 8)) + 1) * (1 << (shift - 1)); \ + uintptr_t w0_w1 = ((uintptr_t)w0 << 32) | (uint32_t)w1; \ + uintptr_t offset_shift = ((uintptr_t)offset << 32) | (uint32_t)shift; \ + ff_vvc_w_avg_ ## bit_depth ## _neon(dst, dst_stride, src0, src1, width, height, w0_w1, offset_shift); \ +} + +W_AVG_FUN(8) +W_AVG_FUN(10) +W_AVG_FUN(12) + void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) { int cpu_flags = av_get_cpu_flags(); @@ -123,6 +154,7 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) c->inter.put_uni_w[0][6][0][0] = ff_vvc_put_pel_uni_w_pixels128_8_neon; c->inter.avg = ff_vvc_avg_8_neon; + c->inter.w_avg = vvc_w_avg_8; for (int i = 0; i < FF_ARRAY_ELEMS(c->sao.band_filter); i++) c->sao.band_filter[i] = ff_h26x_sao_band_filter_8x8_8_neon; @@ -163,11 +195,13 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) } } else if (bd == 10) { c->inter.avg = ff_vvc_avg_10_neon; + c->inter.w_avg = vvc_w_avg_10; c->alf.filter[LUMA] = alf_filter_luma_10_neon; c->alf.filter[CHROMA] = alf_filter_chroma_10_neon; } else if (bd == 12) { c->inter.avg = ff_vvc_avg_12_neon; + c->inter.w_avg = vvc_w_avg_12; c->alf.filter[LUMA] = alf_filter_luma_12_neon; c->alf.filter[CHROMA] = alf_filter_chroma_12_neon; diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index 2f69274b86..c4c6ab1a72 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -22,9 +22,9 @@ #define VVC_MAX_PB_SIZE 128 -.macro vvc_avg, bit_depth +.macro vvc_avg type, bit_depth -.macro vvc_avg_\bit_depth\()_2_4, tap +.macro vvc_\type\()_\bit_depth\()_2_4 tap .if \tap == 2 ldr s0, [src0] ldr s2, [src1] @@ -32,9 +32,19 @@ ldr d0, [src0] ldr d2, [src1] .endif + +.ifc \type, avg saddl v4.4s, v0.4h, v2.4h add v4.4s, v4.4s, v16.4s sqshrn v4.4h, v4.4s, #(15 - \bit_depth) +.else + mov v4.16b, v16.16b + smlal v4.4s, v0.4h, v19.4h + smlal v4.4s, v2.4h, v20.4h + sqshl v4.4s, v4.4s, v22.4s + sqxtn v4.4h, v4.4s +.endif + .if \bit_depth == 8 sqxtun v4.8b, v4.8h .if \tap == 2 @@ -57,7 +67,7 @@ add dst, dst, dst_stride .endm -function ff_vvc_avg_\bit_depth\()_neon, export=1 +function ff_vvc_\type\()_\bit_depth\()_neon, export=1 dst .req x0 dst_stride .req x1 src0 .req x2 @@ -67,42 +77,64 @@ function ff_vvc_avg_\bit_depth\()_neon, export=1 mov x10, #(VVC_MAX_PB_SIZE * 2) cmp width, #8 -.if \bit_depth == 8 - movi v16.4s, #64 -.else -.if \bit_depth == 10 - mov w6, #1023 - movi v16.4s, #16 +.ifc \type, avg + movi v16.4s, #(1 << (14 - \bit_depth)) .else - mov w6, #4095 - movi v16.4s, #4 -.endif + lsr x11, x6, #32 // weight0 + mov w12, w6 // weight1 + lsr x13, x7, #32 // offset + mov w14, w7 // shift + + dup v19.8h, w11 + neg w14, w14 // so we can use sqshl + dup v20.8h, w12 + dup v16.4s, w13 + dup v22.4s, w14 +.endif // avg + + .if \bit_depth >= 10 + // clip pixel + mov w6, #((1 << \bit_depth) - 1) movi v18.8h, #0 dup v17.8h, w6 .endif + b.eq 8f b.hi 16f cmp width, #4 b.eq 4f 2: // width == 2 subs height, height, #1 - vvc_avg_\bit_depth\()_2_4 2 + vvc_\type\()_\bit_depth\()_2_4 2 b.ne 2b b 32f 4: // width == 4 subs height, height, #1 - vvc_avg_\bit_depth\()_2_4 4 + vvc_\type\()_\bit_depth\()_2_4 4 b.ne 4b b 32f 8: // width == 8 ld1 {v0.8h}, [src0], x10 ld1 {v2.8h}, [src1], x10 +.ifc \type, avg saddl v4.4s, v0.4h, v2.4h saddl2 v5.4s, v0.8h, v2.8h add v4.4s, v4.4s, v16.4s add v5.4s, v5.4s, v16.4s sqshrn v4.4h, v4.4s, #(15 - \bit_depth) sqshrn2 v4.8h, v5.4s, #(15 - \bit_depth) +.else + mov v4.16b, v16.16b + mov v5.16b, v16.16b + smlal v4.4s, v0.4h, v19.4h + smlal v4.4s, v2.4h, v20.4h + smlal2 v5.4s, v0.8h, v19.8h + smlal2 v5.4s, v2.8h, v20.8h + sqshl v4.4s, v4.4s, v22.4s + sqshl v5.4s, v5.4s, v22.4s + sqxtn v4.4h, v4.4s + sqxtn2 v4.8h, v5.4s +.endif subs height, height, #1 .if \bit_depth == 8 sqxtun v4.8b, v4.8h @@ -122,6 +154,7 @@ function ff_vvc_avg_\bit_depth\()_neon, export=1 17: ldp q0, q1, [x7], #32 ldp q2, q3, [x8], #32 +.ifc \type, avg saddl v4.4s, v0.4h, v2.4h saddl2 v5.4s, v0.8h, v2.8h saddl v6.4s, v1.4h, v3.4h @@ -134,6 +167,28 @@ function ff_vvc_avg_\bit_depth\()_neon, export=1 sqshrn2 v4.8h, v5.4s, #(15 - \bit_depth) sqshrn v6.4h, v6.4s, #(15 - \bit_depth) sqshrn2 v6.8h, v7.4s, #(15 - \bit_depth) +.else // avg + mov v4.16b, v16.16b + mov v5.16b, v16.16b + mov v6.16b, v16.16b + mov v7.16b, v16.16b + smlal v4.4s, v0.4h, v19.4h + smlal v4.4s, v2.4h, v20.4h + smlal2 v5.4s, v0.8h, v19.8h + smlal2 v5.4s, v2.8h, v20.8h + smlal v6.4s, v1.4h, v19.4h + smlal v6.4s, v3.4h, v20.4h + smlal2 v7.4s, v1.8h, v19.8h + smlal2 v7.4s, v3.8h, v20.8h + sqshl v4.4s, v4.4s, v22.4s + sqshl v5.4s, v5.4s, v22.4s + sqshl v6.4s, v6.4s, v22.4s + sqshl v7.4s, v7.4s, v22.4s + sqxtn v4.4h, v4.4s + sqxtn v6.4h, v6.4s + sqxtn2 v4.8h, v5.4s + sqxtn2 v6.8h, v7.4s +.endif // w_avg subs w6, w6, #16 .if \bit_depth == 8 sqxtun v4.8b, v4.8h @@ -155,9 +210,19 @@ function ff_vvc_avg_\bit_depth\()_neon, export=1 b.ne 16b 32: ret + +.unreq dst +.unreq dst_stride +.unreq src0 +.unreq src1 +.unreq width +.unreq height endfunc .endm -vvc_avg 8 -vvc_avg 10 -vvc_avg 12 +vvc_avg avg, 8 +vvc_avg avg, 10 +vvc_avg avg, 12 +vvc_avg w_avg, 8 +vvc_avg w_avg, 10 +vvc_avg w_avg, 12 From patchwork Mon Sep 23 09:05:39 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zhao Zhili X-Patchwork-Id: 51735 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a59:d154:0:b0:48e:c0f8:d0de with SMTP id bt20csp2322572vqb; Mon, 23 Sep 2024 02:14:18 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCUm906wmuIVYIwemxQgSMVZSglJEn/33/TvgoKAiAQMTDbpG7q6YALk0H13jT5jCcXFbnB7I0hq1/U59WGzLBhU@gmail.com X-Google-Smtp-Source: AGHT+IHLECehUL2IZY+aznDv8S1TavlzdNDitCFM+w6eA+2iKhKFq3GLRHlrHsSFwNpiiF22OeVu X-Received: by 2002:a05:6512:12c2:b0:533:4785:82a7 with SMTP id 2adb3069b0e04-536ad17d59emr5101590e87.28.1727082857720; Mon, 23 Sep 2024 02:14:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1727082857; cv=none; d=google.com; s=arc-20240605; b=SrPU17MWgMw4j0GHrdGQb6unbINtap/XHjJFMplpQoOD9Ko/hgq2/k3z4/PE07LBbe 9+dkqsoWZqo5FO5Z+qNqqWCcKHUzq8YtgBUFpQVdWlzyRjZTrWjNn8hqWFV/RdcixOgz bDxyaYScbgIMHyZnPFqNqV1lg5BR6uqSgBgB31NEImeUnKxuvL/6ZjMT76MfD3ZFU/5N u0Bw6r592PKCKUC6Ofk/Qs94FXiXOhY5FLGnJztPXdKvj+gKWOvsexHDaW9tS9Ew5WPZ s1mwK19fR549h4x6RtjGFJFtDFQdG/Pahs2QrNpD+CgcAOKz5XixcOgrnuYO+eEPMVtI vNWw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to:date :to:from:message-id:dkim-signature:delivered-to; bh=crW6QVTQE4gZncFlBrdYw2l+ixHboQqWzcov9YvW6jU=; fh=HnHYuZ9XgUo86ZRXTLWWmQxhslYEI9B9taZ5X1DLFfc=; b=W1ZMiV13Ev0b63J6io39yW8T5123SZqSgK6nBXDF8WG10CeGJ8o7H86klModR9Ua7D KBkeD97s4NZF/pM/5HWOOwBG69ZEnHNlTT/xWJkHdPY0IohceJw1Iuit6VnyxUEzFrHK PJuNp4tWYb+H88jVdc9dIMNJbe4wZpNzBXPRJyR07SI3ZosDNCAJvVtRFbBqYeIq98Mt gweSVWlVvZKE9AtRqP1qaQCVug6NPSLPFyTIatkc1SNDQ+yYlX1LRXAIxeTgcZ5v0VVr ivNLS9twY0kPGjlnHZfY06ywuQ9KPxycySsHqx5cttIXfGJnL4SxBJwnZHvhbo4Hkwlz m2AA==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=oeptDnWd; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id 2adb3069b0e04-536870b1e58si6643940e87.580.2024.09.23.02.14.17; Mon, 23 Sep 2024 02:14:17 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=oeptDnWd; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 7E91568DB99; Mon, 23 Sep 2024 12:06:04 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from out203-205-221-209.mail.qq.com (out203-205-221-209.mail.qq.com [203.205.221.209]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 7C1D568DA3A for ; Mon, 23 Sep 2024 12:05:53 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foxmail.com; s=s201512; t=1727082344; bh=orWx9KQ6T8C+udC+HmS10yQLabgos7TUeVQPpaessFI=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=oeptDnWdBZnxA6z1qjaXZf2MDRcjOkqUg3tFgUvES6hrCveWT0M5ECmHTGxyciXeM n8D89smkSfMygRmjY/h5RdXuMJ96RgQiSYd30HYMY2kDGQp3M0mO+k24VJsu1BD/aI ciqINc5Kib6P83DWzERht76M5hDmrD9PSHpAxcss= Received: from ZHILIZHAO-MB1.tencent.com ([119.147.10.207]) by newxmesmtplogicsvrszc19-0.qq.com (NewEsmtp) with SMTP id 16A88826; Mon, 23 Sep 2024 17:05:42 +0800 X-QQ-mid: xmsmtpt1727082343tyrolramn Message-ID: X-QQ-XMAILINFO: NQR8mRxMnur91gIxrQVtefTtGKCT/XDYvWfYCRB0VNzwT7Rskatp02XwwBFDYw YfWnDqeHepsaa4Y3OBC3wvkZ36dIoLB7dcXpkrWQNtpq5adQ/+MTrUPXEaCA0mBo65Fs/8M/150v XbCzdmUYMyS7ZKugiI5OOQE8/VJYsMwK3GXxuOs4jiclsYsPU+lQ0fDPxBNSznwaIz85P9gSnNNn zWLFjRs7AR0ORP2rkHCPmtWkgNnDJEgt858QpwMvgOj1pIxP27J+MwXcfLR5nrX/RjNCyOmTGz9r UA/ABy1x1z1ww1kstlVD8kyxvY6DCRn9Y2d9QwsfbSwWzUPOnv+uTKvADo8Tfkvp+JYScdyEZ8HL ldARK9LNCi5rDUE6zHooWIwqdWh+A/r8B/iUBxP9tQUdQzgiqTLgDaz/oKE26PSHub6adIr1h2++ djlqOlN8We0gwoQlqs8wDzgqzyS260K6m+x/bamiiVk6TpD0i6SjTCZKg5xI6903P9W+q6IYH2Yf Tq22k+/foMJrs2BS83mp/1YSqflPun4eZo8w1t4Fx8RApUSMnzoJwDZnJVivkE4HzeObBm7EvMaQ QYthKGUIBhKhMgNn7Pc+E6rt0GsU8jmtxwR+KVdYxv4R3D0IjGg0QPwFtdKRgO2g9QYRLffN5AhS 3fGZDheXDN12xPwKSyhomOb0z6KCPUcmQDuFdVtRlCcB14q7VVU3lRqsyu0vEL3PJP8965dfejNZ tkHrkV7panwyTJVvKtKw0Td2ijhI4GrlZhLgdwhMaGwahEuxXlQvcqLGf6KFHmTQVFnzKhXGu72G QmhQOvgPIEUoG740Yf2BCP2f2SFILT2jIxtMEA7A84TyLZeOfjDsr9EmlpwXxpE3kbCzzQY8R9/B EiVS6WiR1wh7CzMOPW01b1PEi+8Tiz/SXfjOBn0MlpU2HitPCRdSGbbT3cwoBb5Ljub6h7D0QltQ awLSNxt/rPQh2WOPer8dzZU4lVegHC/dhn1hz9Szv1uiq+cVNhn7KDrYRDJuOQ X-QQ-XMRINFO: OWPUhxQsoeAVDbp3OJHYyFg= From: Zhao Zhili To: ffmpeg-devel@ffmpeg.org Date: Mon, 23 Sep 2024 17:05:39 +0800 X-OQ-MSGID: <20240923090540.18807-2-quinkblack@foxmail.com> X-Mailer: git-send-email 2.42.0 In-Reply-To: <20240923090540.18807-1-quinkblack@foxmail.com> References: <20240923090540.18807-1-quinkblack@foxmail.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v2 2/3] aarch64/vvc: Add dmvr_hv X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Zhao Zhili Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: 2djnQU3Jqsuk From: Zhao Zhili dmvr_hv_8_12x20_c: 8.0 ( 1.00x) dmvr_hv_8_12x20_neon: 1.2 ( 6.62x) dmvr_hv_8_20x12_c: 8.0 ( 1.00x) dmvr_hv_8_20x12_neon: 0.9 ( 8.37x) dmvr_hv_8_20x20_c: 12.9 ( 1.00x) dmvr_hv_8_20x20_neon: 1.7 ( 7.62x) dmvr_hv_10_12x20_c: 7.0 ( 1.00x) dmvr_hv_10_12x20_neon: 1.7 ( 4.09x) dmvr_hv_10_20x12_c: 7.0 ( 1.00x) dmvr_hv_10_20x12_neon: 1.7 ( 4.09x) dmvr_hv_10_20x20_c: 11.2 ( 1.00x) dmvr_hv_10_20x20_neon: 2.7 ( 4.15x) dmvr_hv_12_12x20_c: 6.5 ( 1.00x) dmvr_hv_12_12x20_neon: 1.7 ( 3.79x) dmvr_hv_12_20x12_c: 6.5 ( 1.00x) dmvr_hv_12_20x12_neon: 1.7 ( 3.79x) dmvr_hv_12_20x20_c: 10.2 ( 1.00x) dmvr_hv_12_20x20_neon: 2.2 ( 4.64x) --- libavcodec/aarch64/vvc/dsp_init.c | 12 ++ libavcodec/aarch64/vvc/inter.S | 307 ++++++++++++++++++++++++++++++ 2 files changed, 319 insertions(+) diff --git a/libavcodec/aarch64/vvc/dsp_init.c b/libavcodec/aarch64/vvc/dsp_init.c index b39ebb83fc..995e26d163 100644 --- a/libavcodec/aarch64/vvc/dsp_init.c +++ b/libavcodec/aarch64/vvc/dsp_init.c @@ -83,6 +83,15 @@ W_AVG_FUN(8) W_AVG_FUN(10) W_AVG_FUN(12) +#define DMVR_FUN(fn, bd) \ + void ff_vvc_dmvr_ ## fn ## bd ## _neon(int16_t *dst, \ + const uint8_t *_src, const ptrdiff_t _src_stride, const int height, \ + const intptr_t mx, const intptr_t my, const int width); + +DMVR_FUN(hv_, 8) +DMVR_FUN(hv_, 10) +DMVR_FUN(hv_, 12) + void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) { int cpu_flags = av_get_cpu_flags(); @@ -155,6 +164,7 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) c->inter.avg = ff_vvc_avg_8_neon; c->inter.w_avg = vvc_w_avg_8; + c->inter.dmvr[1][1] = ff_vvc_dmvr_hv_8_neon; for (int i = 0; i < FF_ARRAY_ELEMS(c->sao.band_filter); i++) c->sao.band_filter[i] = ff_h26x_sao_band_filter_8x8_8_neon; @@ -196,12 +206,14 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) } else if (bd == 10) { c->inter.avg = ff_vvc_avg_10_neon; c->inter.w_avg = vvc_w_avg_10; + c->inter.dmvr[1][1] = ff_vvc_dmvr_hv_10_neon; c->alf.filter[LUMA] = alf_filter_luma_10_neon; c->alf.filter[CHROMA] = alf_filter_chroma_10_neon; } else if (bd == 12) { c->inter.avg = ff_vvc_avg_12_neon; c->inter.w_avg = vvc_w_avg_12; + c->inter.dmvr[1][1] = ff_vvc_dmvr_hv_12_neon; c->alf.filter[LUMA] = alf_filter_luma_12_neon; c->alf.filter[CHROMA] = alf_filter_chroma_12_neon; diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index c4c6ab1a72..a0bb356f07 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -226,3 +226,310 @@ vvc_avg avg, 12 vvc_avg w_avg, 8 vvc_avg w_avg, 10 vvc_avg w_avg, 12 + +/* x0: int16_t *dst + * x1: const uint8_t *_src + * x2: const ptrdiff_t _src_stride + * w3: const int height + * x4: const intptr_t mx + * x5: const intptr_t my + * w6: const int width + */ +function ff_vvc_dmvr_hv_8_neon, export=1 + dst .req x0 + src .req x1 + src_stride .req x2 + height .req w3 + mx .req x4 + my .req x5 + width .req w6 + tmp0 .req x7 + tmp1 .req x8 + + sub sp, sp, #(VVC_MAX_PB_SIZE * 4) + + movrel x9, X(ff_vvc_inter_luma_dmvr_filters) + add x12, x9, mx, lsl #1 + ldrb w10, [x12] + ldrb w11, [x12, #1] + mov tmp0, sp + add tmp1, tmp0, #(VVC_MAX_PB_SIZE * 2) + // We know the value are positive + dup v0.8h, w10 // filter_x[0] + dup v1.8h, w11 // filter_x[1] + + add x12, x9, my, lsl #1 + ldrb w10, [x12] + ldrb w11, [x12, #1] + sxtw x6, w6 + movi v30.8h, #(1 << (8 - 7)) // offset1 + movi v31.8h, #8 // offset2 + dup v2.8h, w10 // filter_y[0] + dup v3.8h, w11 // filter_y[1] + + // Valid value for width can only be 8 + 4, 16 + 4 + cmp width, #16 + mov w10, #0 // start filter_y or not + add height, height, #1 + sub dst, dst, #(VVC_MAX_PB_SIZE * 2) + sub src_stride, src_stride, x6 + cset w15, gt // width > 16 +1: + mov x12, tmp0 + mov x13, tmp1 + mov x14, dst + cbz w15, 2f + + // width > 16 + ldur q5, [src, #1] + ldr q4, [src], #16 + uxtl v7.8h, v5.8b + uxtl2 v17.8h, v5.16b + uxtl v6.8h, v4.8b + uxtl2 v16.8h, v4.16b + mul v6.8h, v6.8h, v0.8h + mul v16.8h, v16.8h, v0.8h + mla v6.8h, v7.8h, v1.8h + mla v16.8h, v17.8h, v1.8h + add v6.8h, v6.8h, v30.8h + add v16.8h, v16.8h, v30.8h + ushr v6.8h, v6.8h, #(8 - 6) + ushr v7.8h, v16.8h, #(8 - 6) + stp q6, q7, [x13], #32 + + cbz w10, 3f + + ldp q16, q17, [x12], #32 + mul v16.8h, v16.8h, v2.8h + mul v17.8h, v17.8h, v2.8h + mla v16.8h, v6.8h, v3.8h + mla v17.8h, v7.8h, v3.8h + add v16.8h, v16.8h, v31.8h + add v17.8h, v17.8h, v31.8h + ushr v16.8h, v16.8h, #4 + ushr v17.8h, v17.8h, #4 + stp q16, q17, [x14], #32 + b 3f +2: + // width > 8 + ldur d5, [src, #1] + ldr d4, [src], #8 + uxtl v7.8h, v5.8b + uxtl v6.8h, v4.8b + mul v6.8h, v6.8h, v0.8h + mla v6.8h, v7.8h, v1.8h + add v6.8h, v6.8h, v30.8h + ushr v6.8h, v6.8h, #(8 - 6) + str q6, [x13], #16 + + cbz w10, 3f + + ldr q16, [x12], #16 + mul v16.8h, v16.8h, v2.8h + mla v16.8h, v6.8h, v3.8h + add v16.8h, v16.8h, v31.8h + ushr v16.8h, v16.8h, #4 + str q16, [x14], #16 +3: + ldr s5, [src, #1] + ldr s4, [src], #4 + uxtl v7.8h, v5.8b + uxtl v6.8h, v4.8b + mul v6.4h, v6.4h, v0.4h + mla v6.4h, v7.4h, v1.4h + add v6.4h, v6.4h, v30.4h + ushr v6.4h, v6.4h, #(8 - 6) + str d6, [x13], #8 + + cbz w10, 4f + + ldr d16, [x12], #8 + mul v16.4h, v16.4h, v2.4h + mla v16.4h, v6.4h, v3.4h + add v16.4h, v16.4h, v31.4h + ushr v16.4h, v16.4h, #4 + str d16, [x14], #8 +4: + subs height, height, #1 + mov w10, #1 + add src, src, src_stride + add dst, dst, #(VVC_MAX_PB_SIZE * 2) + eor tmp0, tmp0, tmp1 + eor tmp1, tmp0, tmp1 + eor tmp0, tmp0, tmp1 + b.ne 1b + + add sp, sp, #(VVC_MAX_PB_SIZE * 4) + ret +endfunc + +function ff_vvc_dmvr_hv_12_neon, export=1 + movi v29.4s, #(12 - 6) + movi v30.4s, #(1 << (12 - 7)) // offset1 + b 0f +endfunc + +function ff_vvc_dmvr_hv_10_neon, export=1 + movi v29.4s, #(10 - 6) + movi v30.4s, #(1 << (10 - 7)) // offset1 +0: + movi v31.4s, #8 // offset2 + neg v29.4s, v29.4s + + sub sp, sp, #(VVC_MAX_PB_SIZE * 4) + + movrel x9, X(ff_vvc_inter_luma_dmvr_filters) + add x12, x9, mx, lsl #1 + ldrb w10, [x12] + ldrb w11, [x12, #1] + mov tmp0, sp + add tmp1, tmp0, #(VVC_MAX_PB_SIZE * 2) + // We know the value are positive + dup v0.8h, w10 // filter_x[0] + dup v1.8h, w11 // filter_x[1] + + add x12, x9, my, lsl #1 + ldrb w10, [x12] + ldrb w11, [x12, #1] + sxtw x6, w6 + dup v2.8h, w10 // filter_y[0] + dup v3.8h, w11 // filter_y[1] + + // Valid value for width can only be 8 + 4, 16 + 4 + cmp width, #16 + mov w10, #0 // start filter_y or not + add height, height, #1 + sub dst, dst, #(VVC_MAX_PB_SIZE * 2) + sub src_stride, src_stride, x6, lsl #1 + cset w15, gt // width > 16 +1: + mov x12, tmp0 + mov x13, tmp1 + mov x14, dst + cbz w15, 2f + + // width > 16 + add x16, src, #2 + ldp q6, q16, [src], #32 + ldp q7, q17, [x16] + umull v4.4s, v6.4h, v0.4h + umull2 v5.4s, v6.8h, v0.8h + umull v18.4s, v16.4h, v0.4h + umull2 v19.4s, v16.8h, v0.8h + umlal v4.4s, v7.4h, v1.4h + umlal2 v5.4s, v7.8h, v1.8h + umlal v18.4s, v17.4h, v1.4h + umlal2 v19.4s, v17.8h, v1.8h + + add v4.4s, v4.4s, v30.4s + add v5.4s, v5.4s, v30.4s + add v18.4s, v18.4s, v30.4s + add v19.4s, v19.4s, v30.4s + ushl v4.4s, v4.4s, v29.4s + ushl v5.4s, v5.4s, v29.4s + ushl v18.4s, v18.4s, v29.4s + ushl v19.4s, v19.4s, v29.4s + uqxtn v6.4h, v4.4s + uqxtn2 v6.8h, v5.4s + uqxtn v7.4h, v18.4s + uqxtn2 v7.8h, v19.4s + stp q6, q7, [x13], #32 + + cbz w10, 3f + + ldp q4, q5, [x12], #32 + umull v17.4s, v4.4h, v2.4h + umull2 v18.4s, v4.8h, v2.8h + umull v19.4s, v5.4h, v2.4h + umull2 v20.4s, v5.8h, v2.8h + umlal v17.4s, v6.4h, v3.4h + umlal2 v18.4s, v6.8h, v3.8h + umlal v19.4s, v7.4h, v3.4h + umlal2 v20.4s, v7.8h, v3.8h + add v17.4s, v17.4s, v31.4s + add v18.4s, v18.4s, v31.4s + add v19.4s, v19.4s, v31.4s + add v20.4s, v20.4s, v31.4s + ushr v17.4s, v17.4s, #4 + ushr v18.4s, v18.4s, #4 + ushr v19.4s, v19.4s, #4 + ushr v20.4s, v20.4s, #4 + uqxtn v6.4h, v17.4s + uqxtn2 v6.8h, v18.4s + uqxtn v7.4h, v19.4s + uqxtn2 v7.8h, v20.4s + stp q6, q7, [x14], #32 + b 3f +2: + // width > 8 + ldur q7, [src, #2] + ldr q6, [src], #16 + umull v4.4s, v6.4h, v0.4h + umull2 v5.4s, v6.8h, v0.8h + umlal v4.4s, v7.4h, v1.4h + umlal2 v5.4s, v7.8h, v1.8h + + add v4.4s, v4.4s, v30.4s + add v5.4s, v5.4s, v30.4s + ushl v4.4s, v4.4s, v29.4s + ushl v5.4s, v5.4s, v29.4s + uqxtn v6.4h, v4.4s + uqxtn2 v6.8h, v5.4s + str q6, [x13], #16 + + cbz w10, 3f + + ldr q16, [x12], #16 + umull v17.4s, v16.4h, v2.4h + umull2 v18.4s, v16.8h, v2.8h + umlal v17.4s, v6.4h, v3.4h + umlal2 v18.4s, v6.8h, v3.8h + add v17.4s, v17.4s, v31.4s + add v18.4s, v18.4s, v31.4s + ushr v17.4s, v17.4s, #4 + ushr v18.4s, v18.4s, #4 + uqxtn v16.4h, v17.4s + uqxtn2 v16.8h, v18.4s + str q16, [x14], #16 +3: + ldr d7, [src, #2] + ldr d6, [src], #8 + umull v4.4s, v7.4h, v1.4h + umlal v4.4s, v6.4h, v0.4h + add v4.4s, v4.4s, v30.4s + ushl v4.4s, v4.4s, v29.4s + uqxtn v6.4h, v4.4s + str d6, [x13], #8 + + cbz w10, 4f + + ldr d16, [x12], #8 + umull v17.4s, v16.4h, v2.4h + umlal v17.4s, v6.4h, v3.4h + add v17.4s, v17.4s, v31.4s + ushr v17.4s, v17.4s, #4 + uqxtn v16.4h, v17.4s + str d16, [x14], #8 +4: + subs height, height, #1 + mov w10, #1 + add src, src, src_stride + add dst, dst, #(VVC_MAX_PB_SIZE * 2) + eor tmp0, tmp0, tmp1 + eor tmp1, tmp0, tmp1 + eor tmp0, tmp0, tmp1 + b.ne 1b + + add sp, sp, #(VVC_MAX_PB_SIZE * 4) + ret + +.unreq dst +.unreq src +.unreq src_stride +.unreq height +.unreq mx +.unreq my +.unreq width +.unreq tmp0 +.unreq tmp1 +endfunc From patchwork Mon Sep 23 09:05:40 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zhao Zhili X-Patchwork-Id: 51734 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a59:d154:0:b0:48e:c0f8:d0de with SMTP id bt20csp2319242vqb; Mon, 23 Sep 2024 02:06:17 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCWtUTb96we3Aj4U152EFsomxX/0yLkb+UT6xQ58i2GYAgMKWUHuiQbFxsSUKF43YkqJ63iHfay4eB6Mjno+Swi8@gmail.com X-Google-Smtp-Source: AGHT+IGVeiJjJOvffNCzkIMt3sHxCcCVCxQoR365pChecK3hIt6Bw4bEztEyMsCAxBC7YCfoQcEM X-Received: by 2002:a17:907:f747:b0:a7a:a0c2:8be9 with SMTP id a640c23a62f3a-a90d5611379mr1186635166b.18.1727082376933; Mon, 23 Sep 2024 02:06:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1727082376; cv=none; d=google.com; s=arc-20240605; b=AgR7VY1ZZGEnPRPS+hPSHDkA6CQeIlk1Lo4t44h6uM7ztQj0B2NPRSSWV09Lp++yUd Pbp1rUzKUhYBxynZxz9/0YavUef2Jr46sPVlWxxcvBvusOj8J9tdJMRD0FD/h2iknlHm lvdshiPN92kBeFSuBJm/1VR9Nn+SlEoxnH3ynht0bGUQ85IqjsZ30rVp8QRyH32G6GiK CQEo81DbDYRzpvJTLQBOnLqvlu1stArHGBxzSZpVBTQjMYraDGw1CyPTftPI9VjoZnsz NY+YskoaE4yVfTpAyV67rCibsWt+KD5YgcAqCgK+2jpaP9Clw8Sh5CTh0VNE5MFHWwcc nsjQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to:date :to:from:message-id:dkim-signature:delivered-to; bh=rTOMEUnr9VqbUjgaDoUP9CvuKEfMz7lyZRJCBlviH5U=; fh=HnHYuZ9XgUo86ZRXTLWWmQxhslYEI9B9taZ5X1DLFfc=; b=BvuxwfBr6KobyhfCZQmzuboIqnYLIlM/N+fc5QDs8BN4ByZ903KZ1hpgNfAo0Rl8wF 0AxPzqERrAcRPO3NGxq5TVdRx73+PZkoGloDcBA+eT76BH5DIu64KwY1sPc/HLqz+TX4 A15QCgkdVPTeDQTeC0YjpwWdzdeMq/rqPhDYIQeV0jL6K/UlZjQvkRjCNaeiBYIqFPcF gkHgj+7kvlrw1W8WHuE+2Nf5FDxebr/uiCTMC7PffsPWV1kTNm6Ne93jyRgv9ylnq+c3 JB+zb/8OoHzexvEmwXjr2M+4Z3O1hXM4VcavHpaQl3CQ9yBTS66PoK4UmcD+tjj73j+p wl7w==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=RDqjYrG8; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id a640c23a62f3a-a90613362f6si1324292266b.757.2024.09.23.02.06.16; Mon, 23 Sep 2024 02:06:16 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=RDqjYrG8; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id E760768DB88; Mon, 23 Sep 2024 12:06:02 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from out203-205-221-173.mail.qq.com (out203-205-221-173.mail.qq.com [203.205.221.173]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 8342B68DB29 for ; Mon, 23 Sep 2024 12:05:53 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foxmail.com; s=s201512; t=1727082344; bh=NHUk38IymlzuvIdyVqFoS7AIaZAUAaFMufnwvAE/7FU=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=RDqjYrG8RgKFjbGmjUezFpkYavMP1pTMc1o/6Bc1aULxYudCFvBresqXO7cMvlHqp GTpGSQE3A4KI+DUTut2GDYuifGNKrLwMoCev/VxxleiJAvp9r7BwdJmJ70ovYpxRd4 ZwVCatlzoqbYIF9nUW3Q2f324blpcUSVJDr8qqNY= Received: from ZHILIZHAO-MB1.tencent.com ([119.147.10.207]) by newxmesmtplogicsvrszc19-0.qq.com (NewEsmtp) with SMTP id 16A88826; Mon, 23 Sep 2024 17:05:42 +0800 X-QQ-mid: xmsmtpt1727082344tppjvljhi Message-ID: X-QQ-XMAILINFO: Ncpnai4dwVTHx9PP/LZUvaT0qL3jHQ5h93cimjPcTivPWFuOUF6Eer+zSxB6w/ ebPhXuGL/dxl124vk/2Ah+Ggch1ZnZY7bWnfLKFGzh0rMemzePsYEevQNzY88U8cdjYbtor3Sw43 /W/eo0zvR7R3zrnvP99qoTsXoN42YjXxXGxOqZjXNldykz10TcjLr5bB8mFjpEba1BTC/mCKrcSa GzYnKVA7howyNmvH0WeZwNpOJgf/x6RrQtJeCyuqZ8Ddki8Ah8A2r2o83wNqJeA5aY2CCQsgf7av 1lFRPpz2FGPVExxzuzGzOHZImUxGjkZprprJ/ch1zQP+zwW79Vxvr6XDTtfvuVjhhcK/J5OXeTwz Uc0vomiYbCjGB8etF6VsT9L3HVE79CltSUxXVsModwy9WyxcKXI/LAoiByVhMr0OXpZz6dpzBFIl YvxFeI9CtP/grTnMF2PjvtEFveX4UhzcXN+kFpOVfj9073E41ePqnH4qWf1NhGt0vZYvOIUnmapg UBmE2P+DNeak5tuNOTIce3axOAeV+XJhzL3AVKoRRIFnjpOGQqtQcNr/dHZ1hGqlN9PTdqIICM0S 5+iaC8UxL8iny+N3aPqatGB4RlC1eHRQkZTp80VS7hM9GppLBqhz6HeH2/29ZECiMYQSVF4MjYvk NtVcKJpKBPZgXlRNsLA1VtgXJgXmRTshp0NdFrdvI1jKWVihyboAbWS6tUjtkDY0jD2u05n/k4iX Oq3X/PeqL3tVpKArnLggBeA+AXd9QKASwpURt/URzdwOZ6uwFaZKNZMdFRAN2UGeuQSAKXLO7soV D64bqHwF6Bcdg+FuGq32KUdR6QDAGiocllr9Bu63gMKPVcGs4c7ct23/6g5VBMV2VK0IXYCvm6eW bMilpB8FyAir9rlVvM+uLI2kdrdAbyY0R1K9MiFwyp0KduLgT2WW7VELMGUZ0vJZHF/skAGpK5lq RWiPL7vQ81ZLZdGgeQKvTgGa/jNpYJwp2QGkRhMJYL3Kg30nO0d4PeEx+Z9jFmWbtcnnzQOtk= X-QQ-XMRINFO: OD9hHCdaPRBwq3WW+NvGbIU= From: Zhao Zhili To: ffmpeg-devel@ffmpeg.org Date: Mon, 23 Sep 2024 17:05:40 +0800 X-OQ-MSGID: <20240923090540.18807-3-quinkblack@foxmail.com> X-Mailer: git-send-email 2.42.0 In-Reply-To: <20240923090540.18807-1-quinkblack@foxmail.com> References: <20240923090540.18807-1-quinkblack@foxmail.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v2 3/3] aarch64/vvc: Add dmvr X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Zhao Zhili Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: uo/gteBf8cHa From: Zhao Zhili dmvr_8_12x20_c: 2.2 ( 1.00x) dmvr_8_12x20_neon: 0.5 ( 4.50x) dmvr_8_20x12_c: 2.0 ( 1.00x) dmvr_8_20x12_neon: 0.2 ( 8.00x) dmvr_8_20x20_c: 3.2 ( 1.00x) dmvr_8_20x20_neon: 0.5 ( 6.50x) dmvr_12_12x20_c: 2.2 ( 1.00x) dmvr_12_12x20_neon: 0.5 ( 4.50x) dmvr_12_20x12_c: 2.2 ( 1.00x) dmvr_12_20x12_neon: 0.5 ( 4.50x) dmvr_12_20x20_c: 3.2 ( 1.00x) dmvr_12_20x20_neon: 0.8 ( 4.33x) --- libavcodec/aarch64/vvc/dsp_init.c | 4 ++ libavcodec/aarch64/vvc/inter.S | 94 ++++++++++++++++++++++++++++++- 2 files changed, 97 insertions(+), 1 deletion(-) diff --git a/libavcodec/aarch64/vvc/dsp_init.c b/libavcodec/aarch64/vvc/dsp_init.c index 995e26d163..e9bb65bd0f 100644 --- a/libavcodec/aarch64/vvc/dsp_init.c +++ b/libavcodec/aarch64/vvc/dsp_init.c @@ -88,6 +88,8 @@ W_AVG_FUN(12) const uint8_t *_src, const ptrdiff_t _src_stride, const int height, \ const intptr_t mx, const intptr_t my, const int width); +DMVR_FUN(, 8) +DMVR_FUN(, 12) DMVR_FUN(hv_, 8) DMVR_FUN(hv_, 10) DMVR_FUN(hv_, 12) @@ -164,6 +166,7 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) c->inter.avg = ff_vvc_avg_8_neon; c->inter.w_avg = vvc_w_avg_8; + c->inter.dmvr[0][0] = ff_vvc_dmvr_8_neon; c->inter.dmvr[1][1] = ff_vvc_dmvr_hv_8_neon; for (int i = 0; i < FF_ARRAY_ELEMS(c->sao.band_filter); i++) @@ -213,6 +216,7 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) } else if (bd == 12) { c->inter.avg = ff_vvc_avg_12_neon; c->inter.w_avg = vvc_w_avg_12; + c->inter.dmvr[0][0] = ff_vvc_dmvr_12_neon; c->inter.dmvr[1][1] = ff_vvc_dmvr_hv_12_neon; c->alf.filter[LUMA] = alf_filter_luma_12_neon; diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index a0bb356f07..1b3fb5b468 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -235,7 +235,7 @@ vvc_avg w_avg, 12 * x5: const intptr_t my * w6: const int width */ -function ff_vvc_dmvr_hv_8_neon, export=1 +function ff_vvc_dmvr_8_neon, export=1 dst .req x0 src .req x1 src_stride .req x2 @@ -243,6 +243,98 @@ function ff_vvc_dmvr_hv_8_neon, export=1 mx .req x4 my .req x5 width .req w6 + + sxtw x6, w6 + mov x7, #(VVC_MAX_PB_SIZE * 2 + 8) + cmp width, #16 + sub src_stride, src_stride, x6 + cset w15, gt // width > 16 + movi v16.8h, #2 // DMVR_SHIFT + sub x7, x7, x6, lsl #1 +1: + cbz w15, 2f + ldr q0, [src], #16 + uxtl v1.8h, v0.8b + uxtl2 v2.8h, v0.16b + ushl v1.8h, v1.8h, v16.8h + ushl v2.8h, v2.8h, v16.8h + stp q1, q2, [dst], #32 + b 3f +2: + ldr d0, [src], #8 + uxtl v1.8h, v0.8b + ushl v1.8h, v1.8h, v16.8h + str q1, [dst], #16 +3: + subs height, height, #1 + ldr s3, [src], #4 + uxtl v4.8h, v3.8b + ushl v4.4h, v4.4h, v16.4h + st1 {v4.4h}, [dst], x7 + + add src, src, src_stride + b.ne 1b + + ret +endfunc + +function ff_vvc_dmvr_12_neon, export=1 + sxtw x6, w6 + mov x7, #(VVC_MAX_PB_SIZE * 2 + 8) + cmp width, #16 + sub src_stride, src_stride, x6, lsl #1 + cset w15, gt // width > 16 + movi v16.4s, #2 // offset4 + sub x7, x7, x6, lsl #1 +1: + cbz w15, 2f + ldp q0, q1, [src], #32 + uxtl v2.4s, v0.4h + uxtl2 v3.4s, v0.8h + uxtl v4.4s, v1.4h + uxtl2 v5.4s, v1.8h + add v2.4s, v2.4s, v16.4s + add v3.4s, v3.4s, v16.4s + add v4.4s, v4.4s, v16.4s + add v5.4s, v5.4s, v16.4s + ushr v2.4s, v2.4s, #2 + ushr v3.4s, v3.4s, #2 + ushr v4.4s, v4.4s, #2 + ushr v5.4s, v5.4s, #2 + uqxtn v2.4h, v2.4s + uqxtn2 v2.8h, v3.4s + uqxtn v4.4h, v4.4s + uqxtn2 v4.8h, v5.4s + + stp q2, q4, [dst], #32 + b 3f +2: + ldr q0, [src], #16 + uxtl v2.4s, v0.4h + uxtl2 v3.4s, v0.8h + add v2.4s, v2.4s, v16.4s + add v3.4s, v3.4s, v16.4s + ushr v2.4s, v2.4s, #2 + ushr v3.4s, v3.4s, #2 + uqxtn v2.4h, v2.4s + uqxtn2 v2.8h, v3.4s + str q2, [dst], #16 +3: + subs height, height, #1 + ldr d0, [src], #8 + uxtl v3.4s, v0.4h + add v3.4s, v3.4s, v16.4s + ushr v3.4s, v3.4s, #2 + uqxtn v3.4h, v3.4s + st1 {v3.4h}, [dst], x7 + + add src, src, src_stride + b.ne 1b + + ret +endfunc + +function ff_vvc_dmvr_hv_8_neon, export=1 tmp0 .req x7 tmp1 .req x8