From patchwork Mon Jun 19 13:06:09 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arnie Chang X-Patchwork-Id: 42224 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:be15:b0:121:b37c:e101 with SMTP id ge21csp678929pzb; Mon, 19 Jun 2023 06:06:27 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ75zDG+8YE0OixZLPh1bJsPwrJl87cdkiF7VPkQLwSI1xrxKA5Dp3quwI1ycd7ntBGstsW5 X-Received: by 2002:aa7:cb84:0:b0:518:7a8b:5d4 with SMTP id r4-20020aa7cb84000000b005187a8b05d4mr6388032edt.16.1687179987303; Mon, 19 Jun 2023 06:06:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1687179987; cv=none; d=google.com; s=arc-20160816; b=w/T8TFtN3cl7V8uuMvgM3ScFU7VeTzhuldgVcgj+uvoK4+U7uahZu0OLgxb/f1oplw GfDCH/tm9n0cYKFJU8t6iInm8bLojBFtQNhjAwRZ5PwnTCOOB4MUORoEpcJV4wCH27WE GvKb+VyjoiRu+oTVsocsrLLTocc3gbMD7Z6yxDhnabE6uEiPafnJ/GNFmB6z3EO4U+xA v6O7VEwPp8GTUbvX8//s4wiUKg0tzz7cD5Ioh26n7CFyWNCaapVZrywhKpFr5BPRE8ni aoZy47BhPgjzxtPgTBKT1HEpyOFHyytiPxucpjLSekKvzlHz3xjA465SePwt41YrQN9S OBxA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:message-id:date:to:from :dkim-signature:delivered-to; bh=iphGp3s4YZt2pQ0fEATRXZWSNqVIadVrYQLmcVx4PXg=; b=XrO8w24FuYezNE4IUDv4K1s3XEtBtikS7UBI5KIciHqevtl0FRO84MaM6/vbQwcMOB uIpwXxVpsLHfYEZaTT3b5LP0IUOejgtL8pb60tUkkjYxzvNYrKFuKKpSryo6HE0crz1W GGMAGkJiSLtj0v46d3a9uj01tFoiaZ86aEofrXYg0ICcAQO7x8WqTgpEIQCbjRaUamTx KjARj1kXhzY0FkHOdKO6vyHDQsB9I5DjXO7O3+b0R6N1Vc42ZaeHBXIBaH8S5LTr4mtG k9aNzC4caR9gCrORalC5KzrPoDFlivTjAMqDr1VWWTlaNu0t8f92/vSEI567E/ITo0Oo T12Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@sifive.com header.s=google header.b=BORZt14j; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id u10-20020a056402110a00b0051907ce8c44si5666059edv.584.2023.06.19.06.06.26; Mon, 19 Jun 2023 06:06:27 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@sifive.com header.s=google header.b=BORZt14j; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 13FAE68BF9E; Mon, 19 Jun 2023 16:06:23 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-pl1-f170.google.com (mail-pl1-f170.google.com [209.85.214.170]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id C649768BF4B for ; Mon, 19 Jun 2023 16:06:16 +0300 (EEST) Received: by mail-pl1-f170.google.com with SMTP id d9443c01a7336-1b50d7b4aaaso12748555ad.3 for ; Mon, 19 Jun 2023 06:06:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sifive.com; s=google; t=1687179974; x=1689771974; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=XP/JsgkG5aGn9/iw3a1+UmIX7D+XGsmLAboAMGK5QXc=; b=BORZt14jyB1QUANufphRWxG2UNOU4kRDEGG3MsOcLK7Y8UFPK02RusB69+0VivR7Dh DXGLaOO3hV8m91yIUC6Vy6V2BHCtpDNF0EifSSXo+ZCQOFiowoLd+nNJqDZCxPyUey1B k4iP2kEMnWGJ5XHvYl7HzdmDq9QJVbEeug+RpX1iIyvtodTjDiow5nmIR7Hl1mmIzrZ7 irbcXNekhogAbjV8nCVCA3dJrQC8sXvSrhRy+QR0BpOpQrZmQOe80XBz9iZjfr0CBJh4 lYIJ9H++xPoF6BW4/AkSjIDQ1V/49M1t0tc8i9pN0DIJwkVu632Og58yIkYD8IjSCaQ/ 8Ktg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687179974; x=1689771974; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=XP/JsgkG5aGn9/iw3a1+UmIX7D+XGsmLAboAMGK5QXc=; b=IyL0XgZY0PXC6ubGhJafLwyLFQg9DLDlUy8rUIO+yxY3Hjyuu90y1eW7+NYhxusl4O Jydyo3FBTHIOwJ7lXzysvkHKvo2pYnjmUo3dFC7PK3/CtTLzIFAfRTDQrV59SVk1CTve K0YJxDuCzTTP1w7obGSwsnx5Z2s78NvIxftXNTQxWOfIYYZ7Ilhqpn0Mwg623TwdoXwz lhMOABzxlJ0wsMpdqOfFbJSdw3SMZDXVLvW4y5YpqTfEvBhFZPndUuFd55fl4iLrZ6Ok LfvOgqehbWXQHDmTkIi4rXOaOoXmDb1BwlcQNC0MrMK0B2HDvdcnS+q+09waint3nKVt VXhQ== X-Gm-Message-State: AC+VfDwLHXcBH3mzV1G6ie4ioK0vocxBErVNw15U22EJ9atRPFaU4HHM wUO4zOrI2tFhPNKEpLxkWVvQZUqoX08dcopjNvR9CKolQ6bEqZo1zV7aMt43CvlWnjtlmus7cIJ OjG3Nrhk4WAeX0DfCUhFTK1K0ZagGPYHu2djweFX2apNyE4aCIare38HeGVDDY2iuGhkbtvfyzh PFNZoyHW0= X-Received: by 2002:a17:903:244e:b0:1b6:695f:1dbf with SMTP id l14-20020a170903244e00b001b6695f1dbfmr1141444pls.61.1687179973925; Mon, 19 Jun 2023 06:06:13 -0700 (PDT) Received: from arnie-ThinkPad-T480s.localdomain (61-230-61-145.dynamic-ip.hinet.net. [61.230.61.145]) by smtp.gmail.com with ESMTPSA id l13-20020a170903120d00b001b53953f306sm5238023plh.178.2023.06.19.06.06.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 19 Jun 2023 06:06:13 -0700 (PDT) From: Arnie Chang To: ffmpeg-devel@ffmpeg.org Date: Mon, 19 Jun 2023 21:06:09 +0800 Message-Id: <20230619130609.15547-1-arnie.chang@sifive.com> X-Mailer: git-send-email 2.17.1 MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v2] lavc/h264chroma: RISC-V V add motion compensation for 4xH and 2xH chroma blocks X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Arnie Chang Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: sjf1a+3DnZiO Optimize the put and avg filtering for 4xH and 2xH blocks Signed-off-by: Arnie Chang --- V2: 1. Change the \width to an run time argument 2. Call to an internal function instead of instantiating similar code three times RVVi32: - h264chroma.chroma_mc [OK] checkasm: all 6 tests passed avg_h264_chroma_mc1_8_c: 1821.5 avg_h264_chroma_mc1_8_rvv_i32: 466.5 avg_h264_chroma_mc2_8_c: 939.2 avg_h264_chroma_mc2_8_rvv_i32: 466.5 avg_h264_chroma_mc4_8_c: 502.2 avg_h264_chroma_mc4_8_rvv_i32: 466.5 put_h264_chroma_mc1_8_c: 1436.5 put_h264_chroma_mc1_8_rvv_i32: 382.5 put_h264_chroma_mc2_8_c: 824.2 put_h264_chroma_mc2_8_rvv_i32: 382.5 put_h264_chroma_mc4_8_c: 431.2 put_h264_chroma_mc4_8_rvv_i32: 382.5 libavcodec/riscv/h264_chroma_init_riscv.c | 8 + libavcodec/riscv/h264_mc_chroma.S | 237 ++++++++++++++-------- 2 files changed, 160 insertions(+), 85 deletions(-) diff --git a/libavcodec/riscv/h264_chroma_init_riscv.c b/libavcodec/riscv/h264_chroma_init_riscv.c index 7c905edfcd..9f95150ea3 100644 --- a/libavcodec/riscv/h264_chroma_init_riscv.c +++ b/libavcodec/riscv/h264_chroma_init_riscv.c @@ -27,6 +27,10 @@ void h264_put_chroma_mc8_rvv(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int h, int x, int y); void h264_avg_chroma_mc8_rvv(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int h, int x, int y); +void h264_put_chroma_mc4_rvv(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int h, int x, int y); +void h264_avg_chroma_mc4_rvv(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int h, int x, int y); +void h264_put_chroma_mc2_rvv(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int h, int x, int y); +void h264_avg_chroma_mc2_rvv(uint8_t *p_dst, const uint8_t *p_src, ptrdiff_t stride, int h, int x, int y); av_cold void ff_h264chroma_init_riscv(H264ChromaContext *c, int bit_depth) { @@ -36,6 +40,10 @@ av_cold void ff_h264chroma_init_riscv(H264ChromaContext *c, int bit_depth) if (bit_depth == 8 && (flags & AV_CPU_FLAG_RVV_I32) && ff_get_rv_vlenb() >= 16) { c->put_h264_chroma_pixels_tab[0] = h264_put_chroma_mc8_rvv; c->avg_h264_chroma_pixels_tab[0] = h264_avg_chroma_mc8_rvv; + c->put_h264_chroma_pixels_tab[1] = h264_put_chroma_mc4_rvv; + c->avg_h264_chroma_pixels_tab[1] = h264_avg_chroma_mc4_rvv; + c->put_h264_chroma_pixels_tab[2] = h264_put_chroma_mc2_rvv; + c->avg_h264_chroma_pixels_tab[2] = h264_avg_chroma_mc2_rvv; } #endif } diff --git a/libavcodec/riscv/h264_mc_chroma.S b/libavcodec/riscv/h264_mc_chroma.S index 364bc3156e..ce99bda44d 100644 --- a/libavcodec/riscv/h264_mc_chroma.S +++ b/libavcodec/riscv/h264_mc_chroma.S @@ -19,8 +19,7 @@ */ #include "libavutil/riscv/asm.S" -.macro h264_chroma_mc8 type -func h264_\type\()_chroma_mc8_rvv, zve32x +.macro do_chroma_mc type unroll csrw vxrm, zero slli t2, a5, 3 mul t1, a5, a4 @@ -30,94 +29,100 @@ func h264_\type\()_chroma_mc8_rvv, zve32x sub a7, a4, t1 addi a6, a5, 64 sub t0, t2, t1 - vsetivli t3, 8, e8, m1, ta, mu + vsetvli t3, t6, e8, m1, ta, mu beqz t1, 2f blez a3, 8f li t4, 0 li t2, 0 li t5, 1 addi a5, t3, 1 - slli t3, a2, 2 + slli t3, a2, (1 + \unroll) 1: # if (xy != 0) add a4, a1, t4 vsetvli zero, a5, e8, m1, ta, ma + .ifc \unroll,1 addi t2, t2, 4 + .else + addi t2, t2, 2 + .endif vle8.v v10, (a4) add a4, a4, a2 vslide1down.vx v11, v10, t5 - vsetivli zero, 8, e8, m1, ta, ma + vsetvli zero, t6, e8, m1, ta, ma vwmulu.vx v8, v10, a6 vwmaccu.vx v8, a7, v11 vsetvli zero, a5, e8, m1, ta, ma vle8.v v12, (a4) - vsetivli zero, 8, e8, m1, ta, ma + vsetvli zero, t6, e8, m1, ta, ma add a4, a4, a2 vwmaccu.vx v8, t0, v12 vsetvli zero, a5, e8, m1, ta, ma vslide1down.vx v13, v12, t5 - vsetivli zero, 8, e8, m1, ta, ma + vsetvli zero, t6, e8, m1, ta, ma vwmulu.vx v10, v12, a6 vwmaccu.vx v8, t1, v13 vwmaccu.vx v10, a7, v13 vsetvli zero, a5, e8, m1, ta, ma vle8.v v14, (a4) - vsetivli zero, 8, e8, m1, ta, ma + vsetvli zero, t6, e8, m1, ta, ma add a4, a4, a2 vwmaccu.vx v10, t0, v14 vsetvli zero, a5, e8, m1, ta, ma vslide1down.vx v15, v14, t5 - vsetivli zero, 8, e8, m1, ta, ma + vsetvli zero, t6, e8, m1, ta, ma vwmulu.vx v12, v14, a6 vwmaccu.vx v10, t1, v15 vwmaccu.vx v12, a7, v15 + vnclipu.wi v15, v8, 6 + .ifc \type,avg + vle8.v v9, (a0) + vaaddu.vv v15, v15, v9 + .endif + vse8.v v15, (a0) + add a0, a0, a2 + vnclipu.wi v8, v10, 6 + .ifc \type,avg + vle8.v v9, (a0) + vaaddu.vv v8, v8, v9 + .endif + add t4, t4, t3 + vse8.v v8, (a0) + add a0, a0, a2 + .ifc \unroll,1 vsetvli zero, a5, e8, m1, ta, ma vle8.v v14, (a4) - vsetivli zero, 8, e8, m1, ta, ma + vsetvli zero, t6, e8, m1, ta, ma add a4, a4, a2 vwmaccu.vx v12, t0, v14 vsetvli zero, a5, e8, m1, ta, ma vslide1down.vx v15, v14, t5 - vsetivli zero, 8, e8, m1, ta, ma + vsetvli zero, t6, e8, m1, ta, ma vwmulu.vx v16, v14, a6 vwmaccu.vx v12, t1, v15 vwmaccu.vx v16, a7, v15 vsetvli zero, a5, e8, m1, ta, ma vle8.v v14, (a4) - vsetivli zero, 8, e8, m1, ta, ma - add a4, a0, t4 - add t4, t4, t3 + vsetvli zero, t6, e8, m1, ta, ma vwmaccu.vx v16, t0, v14 vsetvli zero, a5, e8, m1, ta, ma vslide1down.vx v14, v14, t5 - vsetivli zero, 8, e8, m1, ta, ma - vnclipu.wi v15, v8, 6 + vsetvli zero, t6, e8, m1, ta, ma vwmaccu.vx v16, t1, v14 - .ifc \type,avg - vle8.v v9, (a4) - vaaddu.vv v15, v15, v9 - .endif - vse8.v v15, (a4) - add a4, a4, a2 - vnclipu.wi v8, v10, 6 - .ifc \type,avg - vle8.v v9, (a4) - vaaddu.vv v8, v8, v9 - .endif - vse8.v v8, (a4) - add a4, a4, a2 vnclipu.wi v8, v12, 6 .ifc \type,avg - vle8.v v9, (a4) + vle8.v v9, (a0) vaaddu.vv v8, v8, v9 .endif - vse8.v v8, (a4) - add a4, a4, a2 + vse8.v v8, (a0) + add a0, a0, a2 vnclipu.wi v8, v16, 6 .ifc \type,avg - vle8.v v9, (a4) + vle8.v v9, (a0) vaaddu.vv v8, v8, v9 .endif - vse8.v v8, (a4) + vse8.v v8, (a0) + add a0, a0, a2 + .endif blt t2, a3, 1b j 8f 2: @@ -126,11 +131,15 @@ func h264_\type\()_chroma_mc8_rvv, zve32x blez a3, 8f li a4, 0 li t1, 0 - slli a7, a2, 2 + slli a7, a2, (1 + \unroll) 3: # if ((x8 - xy) == 0 && (y8 -xy) != 0) add a5, a1, a4 vsetvli zero, zero, e8, m1, ta, ma + .ifc \unroll,1 addi t1, t1, 4 + .else + addi t1, t1, 2 + .endif vle8.v v8, (a5) add a5, a5, a2 add t2, a5, a2 @@ -141,42 +150,44 @@ func h264_\type\()_chroma_mc8_rvv, zve32x add t2, t2, a2 add a5, t2, a2 vwmaccu.vx v10, t0, v8 - vle8.v v8, (t2) - vle8.v v14, (a5) - add a5, a0, a4 add a4, a4, a7 vwmaccu.vx v12, t0, v9 vnclipu.wi v15, v10, 6 vwmulu.vx v10, v9, a6 + vnclipu.wi v9, v12, 6 .ifc \type,avg - vle8.v v16, (a5) + vle8.v v16, (a0) vaaddu.vv v15, v15, v16 .endif - vse8.v v15, (a5) - add a5, a5, a2 - vnclipu.wi v9, v12, 6 - vwmaccu.vx v10, t0, v8 - vwmulu.vx v12, v8, a6 + vse8.v v15, (a0) + add a0, a0, a2 .ifc \type,avg - vle8.v v16, (a5) + vle8.v v16, (a0) vaaddu.vv v9, v9, v16 .endif - vse8.v v9, (a5) - add a5, a5, a2 + vse8.v v9, (a0) + add a0, a0, a2 + .ifc \unroll,1 + vle8.v v8, (t2) + vle8.v v14, (a5) + vwmaccu.vx v10, t0, v8 + vwmulu.vx v12, v8, a6 vnclipu.wi v8, v10, 6 vwmaccu.vx v12, t0, v14 .ifc \type,avg - vle8.v v16, (a5) + vle8.v v16, (a0) vaaddu.vv v8, v8, v16 .endif - vse8.v v8, (a5) - add a5, a5, a2 + vse8.v v8, (a0) + add a0, a0, a2 vnclipu.wi v8, v12, 6 .ifc \type,avg - vle8.v v16, (a5) + vle8.v v16, (a0) vaaddu.vv v8, v8, v16 .endif - vse8.v v8, (a5) + vse8.v v8, (a0) + add a0, a0, a2 + .endif blt t1, a3, 3b j 8f 4: @@ -186,87 +197,95 @@ func h264_\type\()_chroma_mc8_rvv, zve32x li a4, 0 li t2, 0 addi t0, t3, 1 - slli t1, a2, 2 + slli t1, a2, (1 + \unroll) 5: # if ((x8 - xy) != 0 && (y8 -xy) == 0) add a5, a1, a4 vsetvli zero, t0, e8, m1, ta, ma + .ifc \unroll,1 addi t2, t2, 4 + .else + addi t2, t2, 2 + .endif vle8.v v8, (a5) add a5, a5, a2 vslide1down.vx v9, v8, t5 - vsetivli zero, 8, e8, m1, ta, ma + vsetvli zero, t6, e8, m1, ta, ma vwmulu.vx v10, v8, a6 vwmaccu.vx v10, a7, v9 vsetvli zero, t0, e8, m1, ta, ma vle8.v v8, (a5) add a5, a5, a2 vslide1down.vx v9, v8, t5 - vsetivli zero, 8, e8, m1, ta, ma + vsetvli zero, t6, e8, m1, ta, ma vwmulu.vx v12, v8, a6 vwmaccu.vx v12, a7, v9 + vnclipu.wi v16, v10, 6 + .ifc \type,avg + vle8.v v18, (a0) + vaaddu.vv v16, v16, v18 + .endif + vse8.v v16, (a0) + add a0, a0, a2 + vnclipu.wi v10, v12, 6 + .ifc \type,avg + vle8.v v18, (a0) + vaaddu.vv v10, v10, v18 + .endif + add a4, a4, t1 + vse8.v v10, (a0) + add a0, a0, a2 + .ifc \unroll,1 vsetvli zero, t0, e8, m1, ta, ma vle8.v v8, (a5) add a5, a5, a2 vslide1down.vx v9, v8, t5 - vsetivli zero, 8, e8, m1, ta, ma + vsetvli zero, t6, e8, m1, ta, ma vwmulu.vx v14, v8, a6 vwmaccu.vx v14, a7, v9 vsetvli zero, t0, e8, m1, ta, ma vle8.v v8, (a5) - add a5, a0, a4 - add a4, a4, t1 vslide1down.vx v9, v8, t5 - vsetivli zero, 8, e8, m1, ta, ma - vnclipu.wi v16, v10, 6 - .ifc \type,avg - vle8.v v18, (a5) - vaaddu.vv v16, v16, v18 - .endif - vse8.v v16, (a5) - add a5, a5, a2 - vnclipu.wi v10, v12, 6 + vsetvli zero, t6, e8, m1, ta, ma vwmulu.vx v12, v8, a6 - .ifc \type,avg - vle8.v v18, (a5) - vaaddu.vv v10, v10, v18 - .endif - vse8.v v10, (a5) - add a5, a5, a2 vnclipu.wi v8, v14, 6 vwmaccu.vx v12, a7, v9 .ifc \type,avg - vle8.v v18, (a5) + vle8.v v18, (a0) vaaddu.vv v8, v8, v18 .endif - vse8.v v8, (a5) - add a5, a5, a2 + vse8.v v8, (a0) + add a0, a0, a2 vnclipu.wi v8, v12, 6 .ifc \type,avg - vle8.v v18, (a5) + vle8.v v18, (a0) vaaddu.vv v8, v8, v18 .endif - vse8.v v8, (a5) + vse8.v v8, (a0) + add a0, a0, a2 + .endif blt t2, a3, 5b j 8f 6: blez a3, 8f li a4, 0 li t2, 0 - slli a7, a2, 2 + slli a7, a2, (1 + \unroll) 7: # the final else, none of the above conditions are met add t0, a1, a4 vsetvli zero, zero, e8, m1, ta, ma add a5, a0, a4 add a4, a4, a7 + .ifc \unroll,1 addi t2, t2, 4 + .else + addi t2, t2, 2 + .endif vle8.v v8, (t0) add t0, t0, a2 add t1, t0, a2 vwmulu.vx v10, v8, a6 vle8.v v8, (t0) add t0, t1, a2 - vle8.v v9, (t1) - vle8.v v12, (t0) vnclipu.wi v13, v10, 6 vwmulu.vx v10, v8, a6 .ifc \type,avg @@ -276,13 +295,16 @@ func h264_\type\()_chroma_mc8_rvv, zve32x vse8.v v13, (a5) add a5, a5, a2 vnclipu.wi v8, v10, 6 - vwmulu.vx v10, v9, a6 .ifc \type,avg vle8.v v18, (a5) vaaddu.vv v8, v8, v18 .endif vse8.v v8, (a5) add a5, a5, a2 + .ifc \unroll,1 + vle8.v v9, (t1) + vle8.v v12, (t0) + vwmulu.vx v10, v9, a6 vnclipu.wi v8, v10, 6 vwmulu.vx v10, v12, a6 .ifc \type,avg @@ -297,11 +319,56 @@ func h264_\type\()_chroma_mc8_rvv, zve32x vaaddu.vv v8, v8, v18 .endif vse8.v v8, (a5) + .endif blt t2, a3, 7b 8: ret -endfunc .endm -h264_chroma_mc8 put -h264_chroma_mc8 avg +func h264_put_chroma_mc_rvv, zve32x +11: + li a7, 3 + blt a3, a7, 12f + do_chroma_mc put 1 +12: + do_chroma_mc put 0 +endfunc + +func h264_avg_chroma_mc_rvv, zve32x +21: + li a7, 3 + blt a3, a7, 22f + do_chroma_mc avg 1 +22: + do_chroma_mc avg 0 +endfunc + +func h264_put_chroma_mc8_rvv, zve32x + li t6, 8 + j 11b +endfunc + +func h264_put_chroma_mc4_rvv, zve32x + li t6, 4 + j 11b +endfunc + +func h264_put_chroma_mc2_rvv, zve32x + li t6, 2 + j 11b +endfunc + +func h264_avg_chroma_mc8_rvv, zve32x + li t6, 8 + j 21b +endfunc + +func h264_avg_chroma_mc4_rvv, zve32x + li t6, 4 + j 21b +endfunc + +func h264_avg_chroma_mc2_rvv, zve32x + li t6, 2 + j 21b +endfunc