From patchwork Sat Sep 21 17:41:45 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zhao Zhili X-Patchwork-Id: 51686 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a59:d154:0:b0:48e:c0f8:d0de with SMTP id bt20csp1651842vqb; Sat, 21 Sep 2024 10:42:26 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCWOse/2kgw5ljI+/R2M9Z4NhQjlU8nHRqaQO7pa/LQRE2pC8V7AK0VJa6pJb06S7ZC2TxADCfuET0uaRt0ltfjh@gmail.com X-Google-Smtp-Source: AGHT+IErR2Z/z+B9PxygQ2194NWD7AbbqLDMXdRFOWEPgsZQubFh5/+veGGCu0iBXHm70y8bIQkK X-Received: by 2002:a05:6402:4313:b0:5c4:24a4:8848 with SMTP id 4fb4d7f45d1cf-5c464a5c24bmr1876328a12.4.1726940546028; Sat, 21 Sep 2024 10:42:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1726940546; cv=none; d=google.com; s=arc-20240605; b=acDnKyW/yYcil86MOscga9g8s37D3BaAKJkgn1R+Y9rNNevwSoi7EuGR/hXDOk937+ k8/77mDjqc2bXrjqURlm9faGnHqgjsfelQv24JsMGx5RzST1lkc+E2IODhtZmllMU7Al dj3yfKYIuasQw7Kl3+aikqju1+sgpg57jJeBLaQA46esxzKB5VM24NwDglJfc0L98Vaj 5POK+PF4Be09qRYJTUKH6ZxcqRLXpPxi/bRs5Bl6Dytqdy4wf0RWy++3iNHn1D61DLhf L8lgp5vyY2ad67TsToqUs4nWVH1e/CXCemVaZ5ANeKJK5dU+PQ0i2tVILZf3ulEMAr4d cTzw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to:date :to:from:message-id:dkim-signature:delivered-to; bh=SFs/dIhn0ASbZzGWGg9KM2KVIA3q2BuvgqyzDcX6Lds=; fh=HnHYuZ9XgUo86ZRXTLWWmQxhslYEI9B9taZ5X1DLFfc=; b=ckXs7kIBc7d+Cnj1HIYVje9NtqHRb4uJuNOBTh/EFV8WK9Fk2sCFZ37tmkAv4i4rzT zRM3pUsUAOTTeswJXgNA/5yfsipZ/5ipXCRDjBCXBzQWO06LNb23l7BTGNlXJQuGgAH/ 3cHDrEVOY2Qzz6iX3XcTlOgep+uqIcpq337GS9Wl5HDfi2LLLVkqpPpALGJJPloIcrZd 0Di0NczqN3wNbH86N7jE71tKKkYbM2VPXWoOxVMaJgHqGY7YoQbQwH0Un8LHgf+F6Zw+ pBo9Je7PZErHfkN+/+cTRBgIqu7i3emLIOLxS32QXbgjnEqYZNQwlBm8FAOZh2i/tM+4 RysA==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=Z+YMjdYg; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id 4fb4d7f45d1cf-5c42bb919e3si10958711a12.241.2024.09.21.10.42.25; Sat, 21 Sep 2024 10:42:26 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=Z+YMjdYg; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A846368D3C6; Sat, 21 Sep 2024 20:42:08 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from out162-62-58-216.mail.qq.com (out162-62-58-216.mail.qq.com [162.62.58.216]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id EE2DB68DB2B for ; Sat, 21 Sep 2024 20:41:59 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foxmail.com; s=s201512; t=1726940511; bh=UngxzGpcpS43fECyRsn/GjykKvT8B7tgdmnUZtfFceU=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=Z+YMjdYgCSQtR9Ga5kcYaqh05+u2CGKiCssRkTQLwcL7Fye+i+5wEuP/DkFD39jB2 R5AyId3bjZIhD/KvicOEs1QMbDplBrht0gAvlxfqlSTpYslCmb6ZSYkR+71Tka0VeB fwg63MZsam4K6g/BET0LIp9JF6Ceb0rnhTjbBWRc= Received: from ZHILIZHAO-MB1.tencent.com ([113.118.103.137]) by newxmesmtplogicsvrsza29-0.qq.com (NewEsmtp) with SMTP id A7081A2C; Sun, 22 Sep 2024 01:41:48 +0800 X-QQ-mid: xmsmtpt1726940510te1rk08u5 Message-ID: X-QQ-XMAILINFO: MmPNY57tR1XnArLiBTRsiv0Qar/mtDkL2giKO4HuD6qZZ2/fG3NOLtyWXrTD3m uCVGCV7YTqa1VdAyumbAZgKjkZZBRdnhyZhohH+KUwDLMaIlmJ8MQ2C/UC7qyXfB0svWurUd6DQH J7HDxstfoKPF4FqnraTeNhp9l7aKDsh31CsrMwGF7utSt2yPjKztsFQGH7cXJnAPOKzMyIr+s+um YWPIasXlxbF5ZjHqhOnJKTDcSZdBJMFUnbrHBTt3PLIG42WXudqCAa+NylJRlEvxGy4R7yTvAbvI 22koUqFGqUelYtXSLr4vXJ9h9mom2nRfRa+sIi3nrb19Wxts0fsReFU3rWeuwBvzKTPSoArepCRf RI+Gg0UiE+EVSiRb/olJ8zfJMdLJdykTGJqDXHBLBC7cOvy7b/otWiHweMxX4+GPwiQt0vhcz3M6 O0nl3oEbk7qEPoPM9P7uYTdJvm3bTH6QHhdM52csO6NF8LH8P79+1pGYHvXxUrOufvputMZazZR6 JPHtuVKu3QJDWZ8CYfYi/CUhoXDuGp18Wkah6YLjmDGs471UpbY1NffbRkI7JLoG0Q1Hb80Ra+JC CbQWlz9ShhTKiY1VnJAWAYambh3iBNpUCRMlHEGoaxdS4VRu/hOp3FCp/fuP6A8TV5VrVt2MOPcb ABAosb0zsRvPxjHJAQeVtYNCFNB/J9is/848pknuXQvKYgfTWGmoBk0k5cVHCfuwK+JcwRF+m1PH 0FAhYW/leXTCGM8iBGY5lU6g0ubCG3DgUM+SCOFDQbXOc8hBr2CWY1tz2iG77WmGdrpAFRXgTytC lYo5O/mE9mZhBovDPAluGhKq92NvR/aFbgBfCYg8LCjK+PaXutQ0DbuQ8hj3JOoegp9z0i0JYI/J wl31lyI1KW++HmTwcdqCqdxEglDZCVVzvxyAAK71cH3pIFxRWVjXTmxZrYSXR7Bu8M73CIgBiD+y CBkJoGA+Fq6Vf5YeY2crIlBcFi6WufFD2USezxdqNUeCa+RgDJEBYHJZSIsBRz X-QQ-XMRINFO: OD9hHCdaPRBwq3WW+NvGbIU= From: Zhao Zhili To: ffmpeg-devel@ffmpeg.org Date: Sun, 22 Sep 2024 01:41:45 +0800 X-OQ-MSGID: <20240921174146.10928-3-quinkblack@foxmail.com> X-Mailer: git-send-email 2.42.0 In-Reply-To: <20240921174146.10928-1-quinkblack@foxmail.com> References: <20240921174146.10928-1-quinkblack@foxmail.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 3/4] aarch64/vvc: Add dmvr_hv X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Zhao Zhili Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: PABTqOFDa5/L From: Zhao Zhili dmvr_hv_8_12x20_c: 8.0 ( 1.00x) dmvr_hv_8_12x20_neon: 1.2 ( 6.62x) dmvr_hv_8_20x12_c: 8.0 ( 1.00x) dmvr_hv_8_20x12_neon: 0.9 ( 8.37x) dmvr_hv_8_20x20_c: 12.9 ( 1.00x) dmvr_hv_8_20x20_neon: 1.7 ( 7.62x) dmvr_hv_10_12x20_c: 7.0 ( 1.00x) dmvr_hv_10_12x20_neon: 1.7 ( 4.09x) dmvr_hv_10_20x12_c: 7.0 ( 1.00x) dmvr_hv_10_20x12_neon: 1.7 ( 4.09x) dmvr_hv_10_20x20_c: 11.2 ( 1.00x) dmvr_hv_10_20x20_neon: 2.7 ( 4.15x) dmvr_hv_12_12x20_c: 6.5 ( 1.00x) dmvr_hv_12_12x20_neon: 1.7 ( 3.79x) dmvr_hv_12_20x12_c: 6.5 ( 1.00x) dmvr_hv_12_20x12_neon: 1.7 ( 3.79x) dmvr_hv_12_20x20_c: 10.2 ( 1.00x) dmvr_hv_12_20x20_neon: 2.2 ( 4.64x) --- libavcodec/aarch64/vvc/dsp_init.c | 12 ++ libavcodec/aarch64/vvc/inter.S | 307 ++++++++++++++++++++++++++++++ 2 files changed, 319 insertions(+) diff --git a/libavcodec/aarch64/vvc/dsp_init.c b/libavcodec/aarch64/vvc/dsp_init.c index 03a4c62310..48642e98e6 100644 --- a/libavcodec/aarch64/vvc/dsp_init.c +++ b/libavcodec/aarch64/vvc/dsp_init.c @@ -89,6 +89,15 @@ W_AVG_FUN(8) W_AVG_FUN(10) W_AVG_FUN(12) +#define DMVR_FUN(fn, bd) \ + void ff_vvc_dmvr_ ## fn ## bd ## _neon(int16_t *dst, \ + const uint8_t *_src, const ptrdiff_t _src_stride, const int height, \ + const intptr_t mx, const intptr_t my, const int width); + +DMVR_FUN(hv_, 8) +DMVR_FUN(hv_, 10) +DMVR_FUN(hv_, 12) + void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) { int cpu_flags = av_get_cpu_flags(); @@ -162,6 +171,7 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) c->inter.avg = ff_vvc_avg_8_neon; c->inter.w_avg = vvc_w_avg_8; c->inter.apply_bdof = apply_bdof_8; + c->inter.dmvr[1][1] = ff_vvc_dmvr_hv_8_neon; for (int i = 0; i < FF_ARRAY_ELEMS(c->sao.band_filter); i++) c->sao.band_filter[i] = ff_h26x_sao_band_filter_8x8_8_neon; @@ -204,6 +214,7 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) c->inter.avg = ff_vvc_avg_10_neon; c->inter.w_avg = vvc_w_avg_10; c->inter.apply_bdof = apply_bdof_10; + c->inter.dmvr[1][1] = ff_vvc_dmvr_hv_10_neon; c->alf.filter[LUMA] = alf_filter_luma_10_neon; c->alf.filter[CHROMA] = alf_filter_chroma_10_neon; @@ -211,6 +222,7 @@ void ff_vvc_dsp_init_aarch64(VVCDSPContext *const c, const int bd) c->inter.avg = ff_vvc_avg_12_neon; c->inter.w_avg = vvc_w_avg_12; c->inter.apply_bdof = apply_bdof_12; + c->inter.dmvr[1][1] = ff_vvc_dmvr_hv_12_neon; c->alf.filter[LUMA] = alf_filter_luma_12_neon; c->alf.filter[CHROMA] = alf_filter_chroma_12_neon; diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index 8cfacef44f..b652e0d609 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -570,3 +570,310 @@ function ff_vvc_derive_bdof_vx_vy_neon, export=1 .unreq y endfunc +/* x0: int16_t *dst + * x1: const uint8_t *_src + * x2: const ptrdiff_t _src_stride + * w3: const int height + * x4: const intptr_t mx + * x5: const intptr_t my + * w6: const int width + */ +function ff_vvc_dmvr_hv_8_neon, export=1 + dst .req x0 + src .req x1 + src_stride .req x2 + height .req w3 + mx .req x4 + my .req x5 + width .req w6 + tmp0 .req x7 + tmp1 .req x8 + + sub sp, sp, #(VVC_MAX_PB_SIZE * 4) + + movrel x9, X(ff_vvc_inter_luma_dmvr_filters) + add x12, x9, mx, lsl #1 + ldrb w10, [x12] + ldrb w11, [x12, #1] + mov tmp0, sp + add tmp1, tmp0, #(VVC_MAX_PB_SIZE * 2) + // We know the value are positive + dup v0.8h, w10 // filter_x[0] + dup v1.8h, w11 // filter_x[1] + + add x12, x9, my, lsl #1 + ldrb w10, [x12] + ldrb w11, [x12, #1] + sxtw x6, w6 + movi v30.8h, #(1 << (8 - 7)) // offset1 + movi v31.8h, #8 // offset2 + dup v2.8h, w10 // filter_y[0] + dup v3.8h, w11 // filter_y[1] + + // Valid value for width can only be 8 + 4, 16 + 4 + cmp width, #16 + mov w10, #0 // start filter_y or not + add height, height, #1 + sub dst, dst, #(VVC_MAX_PB_SIZE * 2) + sub src_stride, src_stride, x6 + cset w15, gt // width > 16 +1: + mov x12, tmp0 + mov x13, tmp1 + mov x14, dst + cbz w15, 2f + + // width > 16 + ldur q5, [src, #1] + ldr q4, [src], #16 + uxtl v7.8h, v5.8b + uxtl2 v17.8h, v5.16b + uxtl v6.8h, v4.8b + uxtl2 v16.8h, v4.16b + mul v6.8h, v6.8h, v0.8h + mul v16.8h, v16.8h, v0.8h + mla v6.8h, v7.8h, v1.8h + mla v16.8h, v17.8h, v1.8h + add v6.8h, v6.8h, v30.8h + add v16.8h, v16.8h, v30.8h + ushr v6.8h, v6.8h, #(8 - 6) + ushr v7.8h, v16.8h, #(8 - 6) + stp q6, q7, [x13], #32 + + cbz w10, 3f + + ldp q16, q17, [x12], #32 + mul v16.8h, v16.8h, v2.8h + mul v17.8h, v17.8h, v2.8h + mla v16.8h, v6.8h, v3.8h + mla v17.8h, v7.8h, v3.8h + add v16.8h, v16.8h, v31.8h + add v17.8h, v17.8h, v31.8h + ushr v16.8h, v16.8h, #4 + ushr v17.8h, v17.8h, #4 + stp q16, q17, [x14], #32 + b 3f +2: + // width > 8 + ldur d5, [src, #1] + ldr d4, [src], #8 + uxtl v7.8h, v5.8b + uxtl v6.8h, v4.8b + mul v6.8h, v6.8h, v0.8h + mla v6.8h, v7.8h, v1.8h + add v6.8h, v6.8h, v30.8h + ushr v6.8h, v6.8h, #(8 - 6) + str q6, [x13], #16 + + cbz w10, 3f + + ldr q16, [x12], #16 + mul v16.8h, v16.8h, v2.8h + mla v16.8h, v6.8h, v3.8h + add v16.8h, v16.8h, v31.8h + ushr v16.8h, v16.8h, #4 + str q16, [x14], #16 +3: + ldr s5, [src, #1] + ldr s4, [src], #4 + uxtl v7.8h, v5.8b + uxtl v6.8h, v4.8b + mul v6.4h, v6.4h, v0.4h + mla v6.4h, v7.4h, v1.4h + add v6.4h, v6.4h, v30.4h + ushr v6.4h, v6.4h, #(8 - 6) + str d6, [x13], #8 + + cbz w10, 4f + + ldr d16, [x12], #8 + mul v16.4h, v16.4h, v2.4h + mla v16.4h, v6.4h, v3.4h + add v16.4h, v16.4h, v31.4h + ushr v16.4h, v16.4h, #4 + str d16, [x14], #8 +4: + subs height, height, #1 + mov w10, #1 + add src, src, src_stride + add dst, dst, #(VVC_MAX_PB_SIZE * 2) + eor tmp0, tmp0, tmp1 + eor tmp1, tmp0, tmp1 + eor tmp0, tmp0, tmp1 + b.ne 1b + + add sp, sp, #(VVC_MAX_PB_SIZE * 4) + ret +endfunc + +function ff_vvc_dmvr_hv_12_neon, export=1 + movi v29.4s, #(12 - 6) + movi v30.4s, #(1 << (12 - 7)) // offset1 + b 0f +endfunc + +function ff_vvc_dmvr_hv_10_neon, export=1 + movi v29.4s, #(10 - 6) + movi v30.4s, #(1 << (10 - 7)) // offset1 +0: + movi v31.4s, #8 // offset2 + neg v29.4s, v29.4s + + sub sp, sp, #(VVC_MAX_PB_SIZE * 4) + + movrel x9, X(ff_vvc_inter_luma_dmvr_filters) + add x12, x9, mx, lsl #1 + ldrb w10, [x12] + ldrb w11, [x12, #1] + mov tmp0, sp + add tmp1, tmp0, #(VVC_MAX_PB_SIZE * 2) + // We know the value are positive + dup v0.8h, w10 // filter_x[0] + dup v1.8h, w11 // filter_x[1] + + add x12, x9, my, lsl #1 + ldrb w10, [x12] + ldrb w11, [x12, #1] + sxtw x6, w6 + dup v2.8h, w10 // filter_y[0] + dup v3.8h, w11 // filter_y[1] + + // Valid value for width can only be 8 + 4, 16 + 4 + cmp width, #16 + mov w10, #0 // start filter_y or not + add height, height, #1 + sub dst, dst, #(VVC_MAX_PB_SIZE * 2) + sub src_stride, src_stride, x6, lsl #1 + cset w15, gt // width > 16 +1: + mov x12, tmp0 + mov x13, tmp1 + mov x14, dst + cbz w15, 2f + + // width > 16 + add x16, src, #2 + ldp q6, q16, [src], #32 + ldp q7, q17, [x16] + umull v4.4s, v6.4h, v0.4h + umull2 v5.4s, v6.8h, v0.8h + umull v18.4s, v16.4h, v0.4h + umull2 v19.4s, v16.8h, v0.8h + umlal v4.4s, v7.4h, v1.4h + umlal2 v5.4s, v7.8h, v1.8h + umlal v18.4s, v17.4h, v1.4h + umlal2 v19.4s, v17.8h, v1.8h + + add v4.4s, v4.4s, v30.4s + add v5.4s, v5.4s, v30.4s + add v18.4s, v18.4s, v30.4s + add v19.4s, v19.4s, v30.4s + ushl v4.4s, v4.4s, v29.4s + ushl v5.4s, v5.4s, v29.4s + ushl v18.4s, v18.4s, v29.4s + ushl v19.4s, v19.4s, v29.4s + uqxtn v6.4h, v4.4s + uqxtn2 v6.8h, v5.4s + uqxtn v7.4h, v18.4s + uqxtn2 v7.8h, v19.4s + stp q6, q7, [x13], #32 + + cbz w10, 3f + + ldp q4, q5, [x12], #32 + umull v17.4s, v4.4h, v2.4h + umull2 v18.4s, v4.8h, v2.8h + umull v19.4s, v5.4h, v2.4h + umull2 v20.4s, v5.8h, v2.8h + umlal v17.4s, v6.4h, v3.4h + umlal2 v18.4s, v6.8h, v3.8h + umlal v19.4s, v7.4h, v3.4h + umlal2 v20.4s, v7.8h, v3.8h + add v17.4s, v17.4s, v31.4s + add v18.4s, v18.4s, v31.4s + add v19.4s, v19.4s, v31.4s + add v20.4s, v20.4s, v31.4s + ushr v17.4s, v17.4s, #4 + ushr v18.4s, v18.4s, #4 + ushr v19.4s, v19.4s, #4 + ushr v20.4s, v20.4s, #4 + uqxtn v6.4h, v17.4s + uqxtn2 v6.8h, v18.4s + uqxtn v7.4h, v19.4s + uqxtn2 v7.8h, v20.4s + stp q6, q7, [x14], #32 + b 3f +2: + // width > 8 + ldur q7, [src, #2] + ldr q6, [src], #16 + umull v4.4s, v6.4h, v0.4h + umull2 v5.4s, v6.8h, v0.8h + umlal v4.4s, v7.4h, v1.4h + umlal2 v5.4s, v7.8h, v1.8h + + add v4.4s, v4.4s, v30.4s + add v5.4s, v5.4s, v30.4s + ushl v4.4s, v4.4s, v29.4s + ushl v5.4s, v5.4s, v29.4s + uqxtn v6.4h, v4.4s + uqxtn2 v6.8h, v5.4s + str q6, [x13], #16 + + cbz w10, 3f + + ldr q16, [x12], #16 + umull v17.4s, v16.4h, v2.4h + umull2 v18.4s, v16.8h, v2.8h + umlal v17.4s, v6.4h, v3.4h + umlal2 v18.4s, v6.8h, v3.8h + add v17.4s, v17.4s, v31.4s + add v18.4s, v18.4s, v31.4s + ushr v17.4s, v17.4s, #4 + ushr v18.4s, v18.4s, #4 + uqxtn v16.4h, v17.4s + uqxtn2 v16.8h, v18.4s + str q16, [x14], #16 +3: + ldr d7, [src, #2] + ldr d6, [src], #8 + umull v4.4s, v7.4h, v1.4h + umlal v4.4s, v6.4h, v0.4h + add v4.4s, v4.4s, v30.4s + ushl v4.4s, v4.4s, v29.4s + uqxtn v6.4h, v4.4s + str d6, [x13], #8 + + cbz w10, 4f + + ldr d16, [x12], #8 + umull v17.4s, v16.4h, v2.4h + umlal v17.4s, v6.4h, v3.4h + add v17.4s, v17.4s, v31.4s + ushr v17.4s, v17.4s, #4 + uqxtn v16.4h, v17.4s + str d16, [x14], #8 +4: + subs height, height, #1 + mov w10, #1 + add src, src, src_stride + add dst, dst, #(VVC_MAX_PB_SIZE * 2) + eor tmp0, tmp0, tmp1 + eor tmp1, tmp0, tmp1 + eor tmp0, tmp0, tmp1 + b.ne 1b + + add sp, sp, #(VVC_MAX_PB_SIZE * 4) + ret + +.unreq dst +.unreq src +.unreq src_stride +.unreq height +.unreq mx +.unreq my +.unreq width +.unreq tmp0 +.unreq tmp1 +endfunc +