Message ID | tencent_1CC522736C587F61113C98AAEF167D43BC08@qq.com |
---|---|
State | New |
Headers | show |
Series | [FFmpeg-devel,v3] lavc/vvc_mc: R-V V avg w_avg | expand |
Context | Check | Description |
---|---|---|
yinshiyou/make_loongarch64 | success | Make finished |
yinshiyou/make_fate_loongarch64 | success | Make fate finished |
andriy/make_x86 | success | Make finished |
andriy/make_fate_x86 | success | Make fate finished |
> I think we can drop the 2x2 transforms. In all likelihood, scalar code will > end up faster than vector code on future hardware, especially out-of-order > pipelines. I want to drop 2x2, but since there's only one function to handle all situations instead of 7*7 functions.. > AFAIU, this will generate relocations. I wonder if the linker smart enough to > put that into .data.relro rather than whine that it can't live it in .rodata? > > In assembler, we can dodge the problem entirely by storing relative offsets > rather than addresses. You can also stick to 4- or even 2-byte values then. Okay, updated it in the reply > LLA is an alias for AUIPC; ADD. You can avoid that ADD by folding the low bits > into LD. See how ff_h263_loop_filter_strength is addressed in h263dsp_rvv.S. With the previous change to use relative offsets in the table, it seems that the full table start address needs to be stored in a register once, so it appears that this situation requires the use of lla. <uk7b@foxmail.com> 于2024年6月12日周三 00:38写道: > From: sunyuechi <sunyuechi@iscas.ac.cn> > > C908 X60 > avg_8_2x2_c : 1.2 1.0 > avg_8_2x2_rvv_i32 : 1.0 1.0 > avg_8_2x4_c : 2.0 2.0 > avg_8_2x4_rvv_i32 : 1.5 1.2 > avg_8_2x8_c : 3.7 4.0 > avg_8_2x8_rvv_i32 : 2.0 2.0 > avg_8_2x16_c : 7.2 7.7 > avg_8_2x16_rvv_i32 : 3.2 3.0 > avg_8_2x32_c : 14.5 15.2 > avg_8_2x32_rvv_i32 : 5.7 5.0 > avg_8_2x64_c : 50.0 45.2 > avg_8_2x64_rvv_i32 : 41.5 32.5 > avg_8_2x128_c : 101.5 84.2 > avg_8_2x128_rvv_i32 : 89.5 73.2 > avg_8_4x2_c : 2.0 2.0 > avg_8_4x2_rvv_i32 : 1.0 1.0 > avg_8_4x4_c : 3.5 3.5 > avg_8_4x4_rvv_i32 : 1.5 1.2 > avg_8_4x8_c : 6.7 7.0 > avg_8_4x8_rvv_i32 : 2.0 1.7 > avg_8_4x16_c : 13.2 14.0 > avg_8_4x16_rvv_i32 : 3.2 3.0 > avg_8_4x32_c : 26.2 27.7 > avg_8_4x32_rvv_i32 : 5.7 5.0 > avg_8_4x64_c : 75.0 66.0 > avg_8_4x64_rvv_i32 : 40.2 33.0 > avg_8_4x128_c : 144.5 128.0 > avg_8_4x128_rvv_i32 : 89.5 78.7 > avg_8_8x2_c : 3.2 3.5 > avg_8_8x2_rvv_i32 : 1.2 1.0 > avg_8_8x4_c : 6.5 6.7 > avg_8_8x4_rvv_i32 : 1.5 1.5 > avg_8_8x8_c : 12.7 13.2 > avg_8_8x8_rvv_i32 : 2.2 1.7 > avg_8_8x16_c : 25.2 26.5 > avg_8_8x16_rvv_i32 : 3.7 2.7 > avg_8_8x32_c : 50.2 52.7 > avg_8_8x32_rvv_i32 : 6.5 5.0 > avg_8_8x64_c : 120.2 117.7 > avg_8_8x64_rvv_i32 : 45.2 39.2 > avg_8_8x128_c : 223.0 233.5 > avg_8_8x128_rvv_i32 : 80.0 73.2 > avg_8_16x2_c : 6.2 6.5 > avg_8_16x2_rvv_i32 : 1.5 1.0 > avg_8_16x4_c : 12.5 12.7 > avg_8_16x4_rvv_i32 : 2.0 1.2 > avg_8_16x8_c : 24.7 26.0 > avg_8_16x8_rvv_i32 : 3.2 2.0 > avg_8_16x16_c : 49.0 51.2 > avg_8_16x16_rvv_i32 : 5.7 3.2 > avg_8_16x32_c : 97.7 102.5 > avg_8_16x32_rvv_i32 : 10.7 5.7 > avg_8_16x64_c : 220.5 214.2 > avg_8_16x64_rvv_i32 : 48.2 39.5 > avg_8_16x128_c : 436.2 428.0 > avg_8_16x128_rvv_i32 : 97.2 77.0 > avg_8_32x2_c : 12.2 12.7 > avg_8_32x2_rvv_i32 : 2.0 1.2 > avg_8_32x4_c : 24.5 25.5 > avg_8_32x4_rvv_i32 : 3.2 1.7 > avg_8_32x8_c : 48.5 50.7 > avg_8_32x8_rvv_i32 : 5.7 2.7 > avg_8_32x16_c : 96.5 101.2 > avg_8_32x16_rvv_i32 : 10.2 5.0 > avg_8_32x32_c : 192.5 202.2 > avg_8_32x32_rvv_i32 : 20.0 9.5 > avg_8_32x64_c : 405.7 404.5 > avg_8_32x64_rvv_i32 : 72.5 40.2 > avg_8_32x128_c : 821.0 832.2 > avg_8_32x128_rvv_i32 : 136.2 75.7 > avg_8_64x2_c : 24.0 25.2 > avg_8_64x2_rvv_i32 : 3.2 1.7 > avg_8_64x4_c : 48.5 51.0 > avg_8_64x4_rvv_i32 : 5.5 2.7 > avg_8_64x8_c : 97.0 101.5 > avg_8_64x8_rvv_i32 : 10.2 5.0 > avg_8_64x16_c : 193.5 202.7 > avg_8_64x16_rvv_i32 : 19.2 9.2 > avg_8_64x32_c : 404.2 405.7 > avg_8_64x32_rvv_i32 : 38.0 17.7 > avg_8_64x64_c : 834.0 840.7 > avg_8_64x64_rvv_i32 : 75.0 36.2 > avg_8_64x128_c : 1667.2 1685.7 > avg_8_64x128_rvv_i32 : 336.0 181.5 > avg_8_128x2_c : 49.0 50.7 > avg_8_128x2_rvv_i32 : 5.2 2.7 > avg_8_128x4_c : 96.7 101.0 > avg_8_128x4_rvv_i32 : 9.7 4.7 > avg_8_128x8_c : 193.0 201.7 > avg_8_128x8_rvv_i32 : 19.0 8.5 > avg_8_128x16_c : 386.2 402.7 > avg_8_128x16_rvv_i32 : 37.2 79.0 > avg_8_128x32_c : 789.2 805.0 > avg_8_128x32_rvv_i32 : 73.5 32.7 > avg_8_128x64_c : 1620.5 1651.2 > avg_8_128x64_rvv_i32 : 185.2 69.7 > avg_8_128x128_c : 3203.0 3236.7 > avg_8_128x128_rvv_i32 : 457.2 280.5 > w_avg_8_2x2_c : 1.5 1.5 > w_avg_8_2x2_rvv_i32 : 1.7 1.5 > w_avg_8_2x4_c : 2.7 2.7 > w_avg_8_2x4_rvv_i32 : 2.7 2.5 > w_avg_8_2x8_c : 5.0 4.7 > w_avg_8_2x8_rvv_i32 : 4.5 4.0 > w_avg_8_2x16_c : 9.7 9.5 > w_avg_8_2x16_rvv_i32 : 8.0 7.0 > w_avg_8_2x32_c : 18.7 18.5 > w_avg_8_2x32_rvv_i32 : 15.0 13.2 > w_avg_8_2x64_c : 57.7 49.0 > w_avg_8_2x64_rvv_i32 : 42.7 35.2 > w_avg_8_2x128_c : 127.5 94.5 > w_avg_8_2x128_rvv_i32 : 99.2 78.2 > w_avg_8_4x2_c : 2.5 2.5 > w_avg_8_4x2_rvv_i32 : 2.0 1.7 > w_avg_8_4x4_c : 4.7 4.5 > w_avg_8_4x4_rvv_i32 : 2.7 2.2 > w_avg_8_4x8_c : 9.0 9.0 > w_avg_8_4x8_rvv_i32 : 4.5 4.0 > w_avg_8_4x16_c : 17.7 17.7 > w_avg_8_4x16_rvv_i32 : 8.0 7.0 > w_avg_8_4x32_c : 35.0 35.0 > w_avg_8_4x32_rvv_i32 : 32.7 13.2 > w_avg_8_4x64_c : 117.5 79.5 > w_avg_8_4x64_rvv_i32 : 47.2 39.0 > w_avg_8_4x128_c : 235.7 159.0 > w_avg_8_4x128_rvv_i32 : 101.0 80.0 > w_avg_8_8x2_c : 4.5 4.5 > w_avg_8_8x2_rvv_i32 : 1.7 1.7 > w_avg_8_8x4_c : 8.7 8.7 > w_avg_8_8x4_rvv_i32 : 2.7 2.5 > w_avg_8_8x8_c : 17.2 17.0 > w_avg_8_8x8_rvv_i32 : 4.7 4.0 > w_avg_8_8x16_c : 34.0 34.2 > w_avg_8_8x16_rvv_i32 : 8.5 7.0 > w_avg_8_8x32_c : 67.5 67.7 > w_avg_8_8x32_rvv_i32 : 16.0 13.2 > w_avg_8_8x64_c : 184.0 147.7 > w_avg_8_8x64_rvv_i32 : 53.7 35.0 > w_avg_8_8x128_c : 350.0 320.5 > w_avg_8_8x128_rvv_i32 : 98.5 74.7 > w_avg_8_16x2_c : 8.7 8.5 > w_avg_8_16x2_rvv_i32 : 2.5 1.7 > w_avg_8_16x4_c : 17.0 17.0 > w_avg_8_16x4_rvv_i32 : 3.7 2.5 > w_avg_8_16x8_c : 49.7 33.5 > w_avg_8_16x8_rvv_i32 : 6.5 4.2 > w_avg_8_16x16_c : 66.5 66.5 > w_avg_8_16x16_rvv_i32 : 12.2 7.5 > w_avg_8_16x32_c : 132.2 134.0 > w_avg_8_16x32_rvv_i32 : 23.2 14.2 > w_avg_8_16x64_c : 298.2 283.5 > w_avg_8_16x64_rvv_i32 : 65.7 40.2 > w_avg_8_16x128_c : 755.5 593.2 > w_avg_8_16x128_rvv_i32 : 132.0 76.0 > w_avg_8_32x2_c : 16.7 16.7 > w_avg_8_32x2_rvv_i32 : 3.5 2.0 > w_avg_8_32x4_c : 33.2 33.2 > w_avg_8_32x4_rvv_i32 : 6.0 3.2 > w_avg_8_32x8_c : 65.7 66.0 > w_avg_8_32x8_rvv_i32 : 11.2 6.5 > w_avg_8_32x16_c : 148.2 132.0 > w_avg_8_32x16_rvv_i32 : 21.5 10.7 > w_avg_8_32x32_c : 266.2 267.0 > w_avg_8_32x32_rvv_i32 : 60.7 20.7 > w_avg_8_32x64_c : 683.5 559.7 > w_avg_8_32x64_rvv_i32 : 83.7 65.0 > w_avg_8_32x128_c : 1169.5 1140.2 > w_avg_8_32x128_rvv_i32 : 191.2 96.7 > w_avg_8_64x2_c : 33.0 33.2 > w_avg_8_64x2_rvv_i32 : 6.0 3.2 > w_avg_8_64x4_c : 65.5 65.7 > w_avg_8_64x4_rvv_i32 : 11.2 5.2 > w_avg_8_64x8_c : 149.7 132.0 > w_avg_8_64x8_rvv_i32 : 21.7 9.7 > w_avg_8_64x16_c : 279.2 262.7 > w_avg_8_64x16_rvv_i32 : 42.2 18.5 > w_avg_8_64x32_c : 538.7 542.2 > w_avg_8_64x32_rvv_i32 : 83.7 36.5 > w_avg_8_64x64_c : 1200.2 1074.2 > w_avg_8_64x64_rvv_i32 : 204.5 73.7 > w_avg_8_64x128_c : 2375.7 2482.0 > w_avg_8_64x128_rvv_i32 : 390.5 205.2 > w_avg_8_128x2_c : 66.2 66.5 > w_avg_8_128x2_rvv_i32 : 11.0 5.2 > w_avg_8_128x4_c : 133.2 133.2 > w_avg_8_128x4_rvv_i32 : 21.7 10.0 > w_avg_8_128x8_c : 303.5 268.5 > w_avg_8_128x8_rvv_i32 : 42.2 18.5 > w_avg_8_128x16_c : 544.0 545.5 > w_avg_8_128x16_rvv_i32 : 83.5 36.5 > w_avg_8_128x32_c : 1128.0 1090.7 > w_avg_8_128x32_rvv_i32 : 166.5 72.2 > w_avg_8_128x64_c : 2275.7 2167.5 > w_avg_8_128x64_rvv_i32 : 391.7 146.7 > w_avg_8_128x128_c : 4851.2 4310.5 > w_avg_8_128x128_rvv_i32 : 742.0 341.2 > --- > libavcodec/riscv/vvc/Makefile | 2 + > libavcodec/riscv/vvc/vvc_mc_rvv.S | 296 +++++++++++++++++++++++++++++ > libavcodec/riscv/vvc/vvcdsp_init.c | 71 +++++++ > libavcodec/vvc/dsp.c | 4 +- > libavcodec/vvc/dsp.h | 1 + > 5 files changed, 373 insertions(+), 1 deletion(-) > create mode 100644 libavcodec/riscv/vvc/Makefile > create mode 100644 libavcodec/riscv/vvc/vvc_mc_rvv.S > create mode 100644 libavcodec/riscv/vvc/vvcdsp_init.c > > diff --git a/libavcodec/riscv/vvc/Makefile b/libavcodec/riscv/vvc/Makefile > new file mode 100644 > index 0000000000..582b051579 > --- /dev/null > +++ b/libavcodec/riscv/vvc/Makefile > @@ -0,0 +1,2 @@ > +OBJS-$(CONFIG_VVC_DECODER) += riscv/vvc/vvcdsp_init.o > +RVV-OBJS-$(CONFIG_VVC_DECODER) += riscv/vvc/vvc_mc_rvv.o > diff --git a/libavcodec/riscv/vvc/vvc_mc_rvv.S > b/libavcodec/riscv/vvc/vvc_mc_rvv.S > new file mode 100644 > index 0000000000..e6e906f3b2 > --- /dev/null > +++ b/libavcodec/riscv/vvc/vvc_mc_rvv.S > @@ -0,0 +1,296 @@ > +/* > + * Copyright (c) 2024 Institue of Software Chinese Academy of Sciences > (ISCAS). > + * > + * This file is part of FFmpeg. > + * > + * FFmpeg is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public > + * License as published by the Free Software Foundation; either > + * version 2.1 of the License, or (at your option) any later version. > + * > + * FFmpeg is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with FFmpeg; if not, write to the Free Software > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA > 02110-1301 USA > + */ > + > +#include "libavutil/riscv/asm.S" > + > +.macro vsetvlstatic8 w, vlen, is_w > + .if \w <= 2 > + vsetivli zero, \w, e8, mf8, ta, ma > + .elseif \w <= 4 && \vlen == 128 > + vsetivli zero, \w, e8, mf4, ta, ma > + .elseif \w <= 4 && \vlen >= 256 > + vsetivli zero, \w, e8, mf8, ta, ma > + .elseif \w <= 8 && \vlen == 128 > + vsetivli zero, \w, e8, mf2, ta, ma > + .elseif \w <= 8 && \vlen >= 256 > + vsetivli zero, \w, e8, mf4, ta, ma > + .elseif \w <= 16 && \vlen == 128 > + vsetivli zero, \w, e8, m1, ta, ma > + .elseif \w <= 16 && \vlen >= 256 > + vsetivli zero, \w, e8, mf2, ta, ma > + .elseif \w <= 32 && \vlen >= 256 > + li t0, \w > + vsetvli zero, t0, e8, m1, ta, ma > + .elseif \w <= (\vlen / 4) || \is_w > + li t0, 64 > + vsetvli zero, t0, e8, m2, ta, ma > + .else > + li t0, \w > + vsetvli zero, t0, e8, m4, ta, ma > + .endif > +.endm > + > +.macro vsetvlstatic16 w, vlen, is_w > + .if \w <= 2 > + vsetivli zero, \w, e16, mf4, ta, ma > + .elseif \w <= 4 && \vlen == 128 > + vsetivli zero, \w, e16, mf2, ta, ma > + .elseif \w <= 4 && \vlen >= 256 > + vsetivli zero, \w, e16, mf4, ta, ma > + .elseif \w <= 8 && \vlen == 128 > + vsetivli zero, \w, e16, m1, ta, ma > + .elseif \w <= 8 && \vlen >= 256 > + vsetivli zero, \w, e16, mf2, ta, ma > + .elseif \w <= 16 && \vlen == 128 > + vsetivli zero, \w, e16, m2, ta, ma > + .elseif \w <= 16 && \vlen >= 256 > + vsetivli zero, \w, e16, m1, ta, ma > + .elseif \w <= 32 && \vlen >= 256 > + li t0, \w > + vsetvli zero, t0, e16, m2, ta, ma > + .elseif \w <= (\vlen / 4) || \is_w > + li t0, 64 > + vsetvli zero, t0, e16, m4, ta, ma > + .else > + li t0, \w > + vsetvli zero, t0, e16, m8, ta, ma > + .endif > +.endm > + > +.macro vsetvlstatic32 w, vlen > + .if \w <= 2 > + vsetivli zero, \w, e32, mf2, ta, ma > + .elseif \w <= 4 && \vlen == 128 > + vsetivli zero, \w, e32, m1, ta, ma > + .elseif \w <= 4 && \vlen >= 256 > + vsetivli zero, \w, e32, mf2, ta, ma > + .elseif \w <= 8 && \vlen == 128 > + vsetivli zero, \w, e32, m2, ta, ma > + .elseif \w <= 8 && \vlen >= 256 > + vsetivli zero, \w, e32, m1, ta, ma > + .elseif \w <= 16 && \vlen == 128 > + vsetivli zero, \w, e32, m4, ta, ma > + .elseif \w <= 16 && \vlen >= 256 > + vsetivli zero, \w, e32, m2, ta, ma > + .elseif \w <= 32 && \vlen >= 256 > + li t0, \w > + vsetvli zero, t0, e32, m4, ta, ma > + .else > + li t0, \w > + vsetvli zero, t0, e32, m8, ta, ma > + .endif > +.endm > + > +.macro avg_nx1 w, vlen > + vsetvlstatic16 \w, \vlen, 0 > + vle16.v v0, (a2) > + vle16.v v8, (a3) > + vadd.vv v8, v8, v0 > + vmax.vx v8, v8, zero > + vsetvlstatic8 \w, \vlen, 0 > + vnclipu.wi v8, v8, 7 > + vse8.v v8, (a0) > +.endm > + > +.macro avg w, vlen, id > +\id\w\vlen: > +.if \w < 128 > + vsetvlstatic16 \w, \vlen, 0 > + addi t0, a2, 128*2 > + addi t1, a3, 128*2 > + add t2, a0, a1 > + vle16.v v0, (a2) > + vle16.v v8, (a3) > + addi a5, a5, -2 > + vle16.v v16, (t0) > + vle16.v v24, (t1) > + vadd.vv v8, v8, v0 > + vadd.vv v24, v24, v16 > + vmax.vx v8, v8, zero > + vmax.vx v24, v24, zero > + vsetvlstatic8 \w, \vlen, 0 > + addi a2, a2, 128*4 > + vnclipu.wi v8, v8, 7 > + vnclipu.wi v24, v24, 7 > + addi a3, a3, 128*4 > + vse8.v v8, (a0) > + vse8.v v24, (t2) > + sh1add a0, a1, a0 > +.else > + avg_nx1 128, \vlen > + addi a5, a5, -1 > + .if \vlen == 128 > + addi a2, a2, 64*2 > + addi a3, a3, 64*2 > + addi a0, a0, 64 > + avg_nx1 128, \vlen > + addi a0, a0, -64 > + addi a2, a2, 128 > + addi a3, a3, 128 > + .else > + addi a2, a2, 128*2 > + addi a3, a3, 128*2 > + .endif > + add a0, a0, a1 > +.endif > + bnez a5, \id\w\vlen\()b > + ret > +.endm > + > + > +.macro AVG_JMP_TABLE id, vlen > +const jmp_table_\id\vlen > + .4byte \id\()2\vlen\()f - jmp_table_\id\vlen > + .4byte \id\()4\vlen\()f - jmp_table_\id\vlen > + .4byte \id\()8\vlen\()f - jmp_table_\id\vlen > + .4byte \id\()16\vlen\()f - jmp_table_\id\vlen > + .4byte \id\()32\vlen\()f - jmp_table_\id\vlen > + .4byte \id\()64\vlen\()f - jmp_table_\id\vlen > + .4byte \id\()128\vlen\()f - jmp_table_\id\vlen > +endconst > +.endm > + > +.macro AVG_J vlen, id > + clz a4, a4 > + li t0, __riscv_xlen-2 > + sub a4, t0, a4 > + lla t5, jmp_table_\id\vlen > + sh2add t0, a4, t5 > + lw t0, 0(t0) > + add t0, t0, t5 > + jr t0 > +.endm > + > +.macro func_avg vlen > +func ff_vvc_avg_8_rvv_\vlen\(), zve32x > + AVG_JMP_TABLE 1, \vlen > + csrw vxrm, zero > + AVG_J \vlen, 1 > + .irp w,2,4,8,16,32,64,128 > + avg \w, \vlen, 1 > + .endr > +endfunc > +.endm > + > +.macro w_avg_nx1 w, vlen > + vsetvlstatic16 \w, \vlen, 1 > + vle16.v v0, (a2) > + vle16.v v8, (a3) > + vwmul.vx v16, v0, a7 > + vwmacc.vx v16, t3, v8 > + vsetvlstatic32 \w, \vlen > + vadd.vx v16, v16, t4 > + vsetvlstatic16 \w, \vlen, 1 > + vnsrl.wx v16, v16, t6 > + vmax.vx v16, v16, zero > + vsetvlstatic8 \w, \vlen, 1 > + vnclipu.wi v16, v16, 0 > + vse8.v v16, (a0) > +.endm > + > +.macro w_avg w, vlen, id > +\id\w\vlen: > +.if \vlen <= 16 > + vsetvlstatic16 \w, \vlen, 1 > + addi t0, a2, 128*2 > + addi t1, a3, 128*2 > + vle16.v v0, (a2) > + vle16.v v8, (a3) > + addi a5, a5, -2 > + vle16.v v20, (t0) > + vle16.v v24, (t1) > + vwmul.vx v16, v0, a7 > + vwmul.vx v28, v20, a7 > + vwmacc.vx v16, t3, v8 > + vwmacc.vx v28, t3, v24 > + vsetvlstatic32 \w, \vlen > + add t2, a0, a1 > + vadd.vx v16, v16, t4 > + vadd.vx v28, v28, t4 > + vsetvlstatic16 \w, \vlen, 1 > + vnsrl.wx v16, v16, t6 > + vnsrl.wx v28, v28, t6 > + vmax.vx v16, v16, zero > + vmax.vx v28, v28, zero > + vsetvlstatic8 \w, \vlen, 1 > + addi a2, a2, 128*4 > + vnclipu.wi v16, v16, 0 > + vnclipu.wi v28, v28, 0 > + vse8.v v16, (a0) > + addi a3, a3, 128*4 > + vse8.v v28, (t2) > + sh1add a0, a1, a0 > +.else > + w_avg_nx1 \w, \vlen > + addi a5, a5, -1 > + .if \w == (\vlen / 2) > + addi a2, a2, (\vlen / 2) > + addi a3, a3, (\vlen / 2) > + addi a0, a0, (\vlen / 4) > + w_avg_nx1 \w, \vlen > + addi a2, a2, -(\vlen / 2) > + addi a3, a3, -(\vlen / 2) > + addi a0, a0, -(\vlen / 4) > + .elseif \w == 128 && \vlen == 128 > + .rept 3 > + addi a2, a2, (\vlen / 2) > + addi a3, a3, (\vlen / 2) > + addi a0, a0, (\vlen / 4) > + w_avg_nx1 \w, \vlen > + .endr > + addi a2, a2, -(\vlen / 2) * 3 > + addi a3, a3, -(\vlen / 2) * 3 > + addi a0, a0, -(\vlen / 4) * 3 > + .endif > + > + addi a2, a2, 128*2 > + addi a3, a3, 128*2 > + add a0, a0, a1 > +.endif > + bnez a5, \id\w\vlen\()b > + ret > +.endm > + > + > +.macro func_w_avg vlen > +func ff_vvc_w_avg_8_rvv_\vlen\(), zve32x > + AVG_JMP_TABLE 2, \vlen > + csrw vxrm, zero > + addi t6, a6, 7 > + ld t3, (sp) > + ld t4, 8(sp) > + ld t5, 16(sp) > + add t4, t4, t5 > + addi t4, t4, 1 // o0 + o1 + 1 > + addi t5, t6, -1 // shift - 1 > + sll t4, t4, t5 > + AVG_J \vlen, 2 > + .irp w,2,4,8,16,32,64,128 > + w_avg \w, \vlen, 2 > + .endr > +endfunc > +.endm > + > +func_avg 128 > +func_avg 256 > +#if (__riscv_xlen == 64) > +func_w_avg 128 > +func_w_avg 256 > +#endif > diff --git a/libavcodec/riscv/vvc/vvcdsp_init.c > b/libavcodec/riscv/vvc/vvcdsp_init.c > new file mode 100644 > index 0000000000..85b1ede061 > --- /dev/null > +++ b/libavcodec/riscv/vvc/vvcdsp_init.c > @@ -0,0 +1,71 @@ > +/* > + * Copyright (c) 2024 Institue of Software Chinese Academy of Sciences > (ISCAS). > + * > + * This file is part of FFmpeg. > + * > + * FFmpeg is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public > + * License as published by the Free Software Foundation; either > + * version 2.1 of the License, or (at your option) any later version. > + * > + * FFmpeg is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with FFmpeg; if not, write to the Free Software > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA > 02110-1301 USA > + */ > + > +#include "config.h" > + > +#include "libavutil/attributes.h" > +#include "libavutil/cpu.h" > +#include "libavutil/riscv/cpu.h" > +#include "libavcodec/vvc/dsp.h" > + > +#define bf(fn, bd, opt) fn##_##bd##_##opt > + > +#define AVG_PROTOTYPES(bd, opt) > \ > +void bf(ff_vvc_avg, bd, opt)(uint8_t *dst, ptrdiff_t dst_stride, > \ > + const int16_t *src0, const int16_t *src1, int width, int height); > \ > +void bf(ff_vvc_w_avg, bd, opt)(uint8_t *dst, ptrdiff_t dst_stride, > \ > + const int16_t *src0, const int16_t *src1, int width, int height, > \ > + int denom, int w0, int w1, int o0, int o1); > + > +AVG_PROTOTYPES(8, rvv_128) > +AVG_PROTOTYPES(8, rvv_256) > + > +void ff_vvc_dsp_init_riscv(VVCDSPContext *const c, const int bd) > +{ > +#if HAVE_RVV > + const int flags = av_get_cpu_flags(); > + > + if ((flags & AV_CPU_FLAG_RVV_I32) && (flags & AV_CPU_FLAG_RVB_ADDR) && > + ff_rv_vlen_least(256)) { > + switch (bd) { > + case 8: > + c->inter.avg = ff_vvc_avg_8_rvv_256; > +# if (__riscv_xlen == 64) > + c->inter.w_avg = ff_vvc_w_avg_8_rvv_256; > +# endif > + break; > + default: > + break; > + } > + } else if ((flags & AV_CPU_FLAG_RVV_I32) && (flags & > AV_CPU_FLAG_RVB_ADDR) && > + ff_rv_vlen_least(128)) { > + switch (bd) { > + case 8: > + c->inter.avg = ff_vvc_avg_8_rvv_128; > +# if (__riscv_xlen == 64) > + c->inter.w_avg = ff_vvc_w_avg_8_rvv_128; > +# endif > + break; > + default: > + break; > + } > + } > +#endif > +} > diff --git a/libavcodec/vvc/dsp.c b/libavcodec/vvc/dsp.c > index 41e830a98a..c55a37d255 100644 > --- a/libavcodec/vvc/dsp.c > +++ b/libavcodec/vvc/dsp.c > @@ -121,7 +121,9 @@ void ff_vvc_dsp_init(VVCDSPContext *vvcdsp, int > bit_depth) > break; > } > > -#if ARCH_X86 > +#if ARCH_RISCV > + ff_vvc_dsp_init_riscv(vvcdsp, bit_depth); > +#elif ARCH_X86 > ff_vvc_dsp_init_x86(vvcdsp, bit_depth); > #endif > } > diff --git a/libavcodec/vvc/dsp.h b/libavcodec/vvc/dsp.h > index 1f14096c41..e03236dd76 100644 > --- a/libavcodec/vvc/dsp.h > +++ b/libavcodec/vvc/dsp.h > @@ -180,6 +180,7 @@ typedef struct VVCDSPContext { > > void ff_vvc_dsp_init(VVCDSPContext *hpc, int bit_depth); > > +void ff_vvc_dsp_init_riscv(VVCDSPContext *hpc, const int bit_depth); > void ff_vvc_dsp_init_x86(VVCDSPContext *hpc, const int bit_depth); > > #endif /* AVCODEC_VVC_DSP_H */ > -- > 2.45.2 > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". >
Le tiistaina 11. kesäkuuta 2024, 19.38.15 EEST uk7b@foxmail.com a écrit : > From: sunyuechi <sunyuechi@iscas.ac.cn> > > C908 X60 > avg_8_2x2_c : 1.2 1.0 > avg_8_2x2_rvv_i32 : 1.0 1.0 > avg_8_2x4_c : 2.0 2.0 > avg_8_2x4_rvv_i32 : 1.5 1.2 > avg_8_2x8_c : 3.7 4.0 > avg_8_2x8_rvv_i32 : 2.0 2.0 > avg_8_2x16_c : 7.2 7.7 > avg_8_2x16_rvv_i32 : 3.2 3.0 > avg_8_2x32_c : 14.5 15.2 > avg_8_2x32_rvv_i32 : 5.7 5.0 > avg_8_2x64_c : 50.0 45.2 > avg_8_2x64_rvv_i32 : 41.5 32.5 > avg_8_2x128_c : 101.5 84.2 > avg_8_2x128_rvv_i32 : 89.5 73.2 > avg_8_4x2_c : 2.0 2.0 > avg_8_4x2_rvv_i32 : 1.0 1.0 > avg_8_4x4_c : 3.5 3.5 > avg_8_4x4_rvv_i32 : 1.5 1.2 > avg_8_4x8_c : 6.7 7.0 > avg_8_4x8_rvv_i32 : 2.0 1.7 > avg_8_4x16_c : 13.2 14.0 > avg_8_4x16_rvv_i32 : 3.2 3.0 > avg_8_4x32_c : 26.2 27.7 > avg_8_4x32_rvv_i32 : 5.7 5.0 > avg_8_4x64_c : 75.0 66.0 > avg_8_4x64_rvv_i32 : 40.2 33.0 > avg_8_4x128_c : 144.5 128.0 > avg_8_4x128_rvv_i32 : 89.5 78.7 > avg_8_8x2_c : 3.2 3.5 > avg_8_8x2_rvv_i32 : 1.2 1.0 > avg_8_8x4_c : 6.5 6.7 > avg_8_8x4_rvv_i32 : 1.5 1.5 > avg_8_8x8_c : 12.7 13.2 > avg_8_8x8_rvv_i32 : 2.2 1.7 > avg_8_8x16_c : 25.2 26.5 > avg_8_8x16_rvv_i32 : 3.7 2.7 > avg_8_8x32_c : 50.2 52.7 > avg_8_8x32_rvv_i32 : 6.5 5.0 > avg_8_8x64_c : 120.2 117.7 > avg_8_8x64_rvv_i32 : 45.2 39.2 > avg_8_8x128_c : 223.0 233.5 > avg_8_8x128_rvv_i32 : 80.0 73.2 > avg_8_16x2_c : 6.2 6.5 > avg_8_16x2_rvv_i32 : 1.5 1.0 > avg_8_16x4_c : 12.5 12.7 > avg_8_16x4_rvv_i32 : 2.0 1.2 > avg_8_16x8_c : 24.7 26.0 > avg_8_16x8_rvv_i32 : 3.2 2.0 > avg_8_16x16_c : 49.0 51.2 > avg_8_16x16_rvv_i32 : 5.7 3.2 > avg_8_16x32_c : 97.7 102.5 > avg_8_16x32_rvv_i32 : 10.7 5.7 > avg_8_16x64_c : 220.5 214.2 > avg_8_16x64_rvv_i32 : 48.2 39.5 > avg_8_16x128_c : 436.2 428.0 > avg_8_16x128_rvv_i32 : 97.2 77.0 > avg_8_32x2_c : 12.2 12.7 > avg_8_32x2_rvv_i32 : 2.0 1.2 > avg_8_32x4_c : 24.5 25.5 > avg_8_32x4_rvv_i32 : 3.2 1.7 > avg_8_32x8_c : 48.5 50.7 > avg_8_32x8_rvv_i32 : 5.7 2.7 > avg_8_32x16_c : 96.5 101.2 > avg_8_32x16_rvv_i32 : 10.2 5.0 > avg_8_32x32_c : 192.5 202.2 > avg_8_32x32_rvv_i32 : 20.0 9.5 > avg_8_32x64_c : 405.7 404.5 > avg_8_32x64_rvv_i32 : 72.5 40.2 > avg_8_32x128_c : 821.0 832.2 > avg_8_32x128_rvv_i32 : 136.2 75.7 > avg_8_64x2_c : 24.0 25.2 > avg_8_64x2_rvv_i32 : 3.2 1.7 > avg_8_64x4_c : 48.5 51.0 > avg_8_64x4_rvv_i32 : 5.5 2.7 > avg_8_64x8_c : 97.0 101.5 > avg_8_64x8_rvv_i32 : 10.2 5.0 > avg_8_64x16_c : 193.5 202.7 > avg_8_64x16_rvv_i32 : 19.2 9.2 > avg_8_64x32_c : 404.2 405.7 > avg_8_64x32_rvv_i32 : 38.0 17.7 > avg_8_64x64_c : 834.0 840.7 > avg_8_64x64_rvv_i32 : 75.0 36.2 > avg_8_64x128_c : 1667.2 1685.7 > avg_8_64x128_rvv_i32 : 336.0 181.5 > avg_8_128x2_c : 49.0 50.7 > avg_8_128x2_rvv_i32 : 5.2 2.7 > avg_8_128x4_c : 96.7 101.0 > avg_8_128x4_rvv_i32 : 9.7 4.7 > avg_8_128x8_c : 193.0 201.7 > avg_8_128x8_rvv_i32 : 19.0 8.5 > avg_8_128x16_c : 386.2 402.7 > avg_8_128x16_rvv_i32 : 37.2 79.0 > avg_8_128x32_c : 789.2 805.0 > avg_8_128x32_rvv_i32 : 73.5 32.7 > avg_8_128x64_c : 1620.5 1651.2 > avg_8_128x64_rvv_i32 : 185.2 69.7 > avg_8_128x128_c : 3203.0 3236.7 > avg_8_128x128_rvv_i32 : 457.2 280.5 > w_avg_8_2x2_c : 1.5 1.5 > w_avg_8_2x2_rvv_i32 : 1.7 1.5 > w_avg_8_2x4_c : 2.7 2.7 > w_avg_8_2x4_rvv_i32 : 2.7 2.5 > w_avg_8_2x8_c : 5.0 4.7 > w_avg_8_2x8_rvv_i32 : 4.5 4.0 > w_avg_8_2x16_c : 9.7 9.5 > w_avg_8_2x16_rvv_i32 : 8.0 7.0 > w_avg_8_2x32_c : 18.7 18.5 > w_avg_8_2x32_rvv_i32 : 15.0 13.2 > w_avg_8_2x64_c : 57.7 49.0 > w_avg_8_2x64_rvv_i32 : 42.7 35.2 > w_avg_8_2x128_c : 127.5 94.5 > w_avg_8_2x128_rvv_i32 : 99.2 78.2 > w_avg_8_4x2_c : 2.5 2.5 > w_avg_8_4x2_rvv_i32 : 2.0 1.7 > w_avg_8_4x4_c : 4.7 4.5 > w_avg_8_4x4_rvv_i32 : 2.7 2.2 > w_avg_8_4x8_c : 9.0 9.0 > w_avg_8_4x8_rvv_i32 : 4.5 4.0 > w_avg_8_4x16_c : 17.7 17.7 > w_avg_8_4x16_rvv_i32 : 8.0 7.0 > w_avg_8_4x32_c : 35.0 35.0 > w_avg_8_4x32_rvv_i32 : 32.7 13.2 > w_avg_8_4x64_c : 117.5 79.5 > w_avg_8_4x64_rvv_i32 : 47.2 39.0 > w_avg_8_4x128_c : 235.7 159.0 > w_avg_8_4x128_rvv_i32 : 101.0 80.0 > w_avg_8_8x2_c : 4.5 4.5 > w_avg_8_8x2_rvv_i32 : 1.7 1.7 > w_avg_8_8x4_c : 8.7 8.7 > w_avg_8_8x4_rvv_i32 : 2.7 2.5 > w_avg_8_8x8_c : 17.2 17.0 > w_avg_8_8x8_rvv_i32 : 4.7 4.0 > w_avg_8_8x16_c : 34.0 34.2 > w_avg_8_8x16_rvv_i32 : 8.5 7.0 > w_avg_8_8x32_c : 67.5 67.7 > w_avg_8_8x32_rvv_i32 : 16.0 13.2 > w_avg_8_8x64_c : 184.0 147.7 > w_avg_8_8x64_rvv_i32 : 53.7 35.0 > w_avg_8_8x128_c : 350.0 320.5 > w_avg_8_8x128_rvv_i32 : 98.5 74.7 > w_avg_8_16x2_c : 8.7 8.5 > w_avg_8_16x2_rvv_i32 : 2.5 1.7 > w_avg_8_16x4_c : 17.0 17.0 > w_avg_8_16x4_rvv_i32 : 3.7 2.5 > w_avg_8_16x8_c : 49.7 33.5 > w_avg_8_16x8_rvv_i32 : 6.5 4.2 > w_avg_8_16x16_c : 66.5 66.5 > w_avg_8_16x16_rvv_i32 : 12.2 7.5 > w_avg_8_16x32_c : 132.2 134.0 > w_avg_8_16x32_rvv_i32 : 23.2 14.2 > w_avg_8_16x64_c : 298.2 283.5 > w_avg_8_16x64_rvv_i32 : 65.7 40.2 > w_avg_8_16x128_c : 755.5 593.2 > w_avg_8_16x128_rvv_i32 : 132.0 76.0 > w_avg_8_32x2_c : 16.7 16.7 > w_avg_8_32x2_rvv_i32 : 3.5 2.0 > w_avg_8_32x4_c : 33.2 33.2 > w_avg_8_32x4_rvv_i32 : 6.0 3.2 > w_avg_8_32x8_c : 65.7 66.0 > w_avg_8_32x8_rvv_i32 : 11.2 6.5 > w_avg_8_32x16_c : 148.2 132.0 > w_avg_8_32x16_rvv_i32 : 21.5 10.7 > w_avg_8_32x32_c : 266.2 267.0 > w_avg_8_32x32_rvv_i32 : 60.7 20.7 > w_avg_8_32x64_c : 683.5 559.7 > w_avg_8_32x64_rvv_i32 : 83.7 65.0 > w_avg_8_32x128_c : 1169.5 1140.2 > w_avg_8_32x128_rvv_i32 : 191.2 96.7 > w_avg_8_64x2_c : 33.0 33.2 > w_avg_8_64x2_rvv_i32 : 6.0 3.2 > w_avg_8_64x4_c : 65.5 65.7 > w_avg_8_64x4_rvv_i32 : 11.2 5.2 > w_avg_8_64x8_c : 149.7 132.0 > w_avg_8_64x8_rvv_i32 : 21.7 9.7 > w_avg_8_64x16_c : 279.2 262.7 > w_avg_8_64x16_rvv_i32 : 42.2 18.5 > w_avg_8_64x32_c : 538.7 542.2 > w_avg_8_64x32_rvv_i32 : 83.7 36.5 > w_avg_8_64x64_c : 1200.2 1074.2 > w_avg_8_64x64_rvv_i32 : 204.5 73.7 > w_avg_8_64x128_c : 2375.7 2482.0 > w_avg_8_64x128_rvv_i32 : 390.5 205.2 > w_avg_8_128x2_c : 66.2 66.5 > w_avg_8_128x2_rvv_i32 : 11.0 5.2 > w_avg_8_128x4_c : 133.2 133.2 > w_avg_8_128x4_rvv_i32 : 21.7 10.0 > w_avg_8_128x8_c : 303.5 268.5 > w_avg_8_128x8_rvv_i32 : 42.2 18.5 > w_avg_8_128x16_c : 544.0 545.5 > w_avg_8_128x16_rvv_i32 : 83.5 36.5 > w_avg_8_128x32_c : 1128.0 1090.7 > w_avg_8_128x32_rvv_i32 : 166.5 72.2 > w_avg_8_128x64_c : 2275.7 2167.5 > w_avg_8_128x64_rvv_i32 : 391.7 146.7 > w_avg_8_128x128_c : 4851.2 4310.5 > w_avg_8_128x128_rvv_i32 : 742.0 341.2 > --- > libavcodec/riscv/vvc/Makefile | 2 + > libavcodec/riscv/vvc/vvc_mc_rvv.S | 296 +++++++++++++++++++++++++++++ > libavcodec/riscv/vvc/vvcdsp_init.c | 71 +++++++ > libavcodec/vvc/dsp.c | 4 +- > libavcodec/vvc/dsp.h | 1 + > 5 files changed, 373 insertions(+), 1 deletion(-) > create mode 100644 libavcodec/riscv/vvc/Makefile > create mode 100644 libavcodec/riscv/vvc/vvc_mc_rvv.S > create mode 100644 libavcodec/riscv/vvc/vvcdsp_init.c > > diff --git a/libavcodec/riscv/vvc/Makefile b/libavcodec/riscv/vvc/Makefile > new file mode 100644 > index 0000000000..582b051579 > --- /dev/null > +++ b/libavcodec/riscv/vvc/Makefile > @@ -0,0 +1,2 @@ > +OBJS-$(CONFIG_VVC_DECODER) += riscv/vvc/vvcdsp_init.o > +RVV-OBJS-$(CONFIG_VVC_DECODER) += riscv/vvc/vvc_mc_rvv.o > diff --git a/libavcodec/riscv/vvc/vvc_mc_rvv.S > b/libavcodec/riscv/vvc/vvc_mc_rvv.S new file mode 100644 > index 0000000000..e6e906f3b2 > --- /dev/null > +++ b/libavcodec/riscv/vvc/vvc_mc_rvv.S > @@ -0,0 +1,296 @@ > +/* > + * Copyright (c) 2024 Institue of Software Chinese Academy of Sciences > (ISCAS). + * > + * This file is part of FFmpeg. > + * > + * FFmpeg is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public > + * License as published by the Free Software Foundation; either > + * version 2.1 of the License, or (at your option) any later version. > + * > + * FFmpeg is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with FFmpeg; if not, write to the Free Software > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 > USA + */ > + > +#include "libavutil/riscv/asm.S" > + > +.macro vsetvlstatic8 w, vlen, is_w Could SEW be a parameter so that these three macros would be a little bit more factored? .ifc / .ifnc might help to match e8/e16/e32. > + .if \w <= 2 > + vsetivli zero, \w, e8, mf8, ta, ma > + .elseif \w <= 4 && \vlen == 128 > + vsetivli zero, \w, e8, mf4, ta, ma > + .elseif \w <= 4 && \vlen >= 256 > + vsetivli zero, \w, e8, mf8, ta, ma > + .elseif \w <= 8 && \vlen == 128 > + vsetivli zero, \w, e8, mf2, ta, ma > + .elseif \w <= 8 && \vlen >= 256 > + vsetivli zero, \w, e8, mf4, ta, ma > + .elseif \w <= 16 && \vlen == 128 > + vsetivli zero, \w, e8, m1, ta, ma > + .elseif \w <= 16 && \vlen >= 256 > + vsetivli zero, \w, e8, mf2, ta, ma > + .elseif \w <= 32 && \vlen >= 256 > + li t0, \w > + vsetvli zero, t0, e8, m1, ta, ma > + .elseif \w <= (\vlen / 4) || \is_w > + li t0, 64 > + vsetvli zero, t0, e8, m2, ta, ma > + .else > + li t0, \w > + vsetvli zero, t0, e8, m4, ta, ma > + .endif > +.endm > + > +.macro vsetvlstatic16 w, vlen, is_w > + .if \w <= 2 > + vsetivli zero, \w, e16, mf4, ta, ma > + .elseif \w <= 4 && \vlen == 128 > + vsetivli zero, \w, e16, mf2, ta, ma > + .elseif \w <= 4 && \vlen >= 256 > + vsetivli zero, \w, e16, mf4, ta, ma > + .elseif \w <= 8 && \vlen == 128 > + vsetivli zero, \w, e16, m1, ta, ma > + .elseif \w <= 8 && \vlen >= 256 > + vsetivli zero, \w, e16, mf2, ta, ma > + .elseif \w <= 16 && \vlen == 128 > + vsetivli zero, \w, e16, m2, ta, ma > + .elseif \w <= 16 && \vlen >= 256 > + vsetivli zero, \w, e16, m1, ta, ma > + .elseif \w <= 32 && \vlen >= 256 > + li t0, \w > + vsetvli zero, t0, e16, m2, ta, ma > + .elseif \w <= (\vlen / 4) || \is_w > + li t0, 64 > + vsetvli zero, t0, e16, m4, ta, ma > + .else > + li t0, \w > + vsetvli zero, t0, e16, m8, ta, ma > + .endif > +.endm > + > +.macro vsetvlstatic32 w, vlen > + .if \w <= 2 > + vsetivli zero, \w, e32, mf2, ta, ma > + .elseif \w <= 4 && \vlen == 128 > + vsetivli zero, \w, e32, m1, ta, ma > + .elseif \w <= 4 && \vlen >= 256 > + vsetivli zero, \w, e32, mf2, ta, ma > + .elseif \w <= 8 && \vlen == 128 > + vsetivli zero, \w, e32, m2, ta, ma > + .elseif \w <= 8 && \vlen >= 256 > + vsetivli zero, \w, e32, m1, ta, ma > + .elseif \w <= 16 && \vlen == 128 > + vsetivli zero, \w, e32, m4, ta, ma > + .elseif \w <= 16 && \vlen >= 256 > + vsetivli zero, \w, e32, m2, ta, ma > + .elseif \w <= 32 && \vlen >= 256 > + li t0, \w > + vsetvli zero, t0, e32, m4, ta, ma > + .else > + li t0, \w > + vsetvli zero, t0, e32, m8, ta, ma > + .endif > +.endm > + > +.macro avg_nx1 w, vlen > + vsetvlstatic16 \w, \vlen, 0 > + vle16.v v0, (a2) > + vle16.v v8, (a3) > + vadd.vv v8, v8, v0 > + vmax.vx v8, v8, zero > + vsetvlstatic8 \w, \vlen, 0 > + vnclipu.wi v8, v8, 7 > + vse8.v v8, (a0) > +.endm > + > +.macro avg w, vlen, id > +\id\w\vlen: > +.if \w < 128 > + vsetvlstatic16 \w, \vlen, 0 > + addi t0, a2, 128*2 > + addi t1, a3, 128*2 > + add t2, a0, a1 > + vle16.v v0, (a2) > + vle16.v v8, (a3) > + addi a5, a5, -2 > + vle16.v v16, (t0) > + vle16.v v24, (t1) > + vadd.vv v8, v8, v0 > + vadd.vv v24, v24, v16 > + vmax.vx v8, v8, zero > + vmax.vx v24, v24, zero > + vsetvlstatic8 \w, \vlen, 0 > + addi a2, a2, 128*4 > + vnclipu.wi v8, v8, 7 > + vnclipu.wi v24, v24, 7 > + addi a3, a3, 128*4 > + vse8.v v8, (a0) > + vse8.v v24, (t2) > + sh1add a0, a1, a0 > +.else > + avg_nx1 128, \vlen > + addi a5, a5, -1 > + .if \vlen == 128 > + addi a2, a2, 64*2 > + addi a3, a3, 64*2 > + addi a0, a0, 64 > + avg_nx1 128, \vlen > + addi a0, a0, -64 > + addi a2, a2, 128 > + addi a3, a3, 128 > + .else > + addi a2, a2, 128*2 > + addi a3, a3, 128*2 > + .endif > + add a0, a0, a1 > +.endif > + bnez a5, \id\w\vlen\()b > + ret > +.endm > + > + > +.macro AVG_JMP_TABLE id, vlen > +const jmp_table_\id\vlen > + .4byte \id\()2\vlen\()f - jmp_table_\id\vlen > + .4byte \id\()4\vlen\()f - jmp_table_\id\vlen > + .4byte \id\()8\vlen\()f - jmp_table_\id\vlen > + .4byte \id\()16\vlen\()f - jmp_table_\id\vlen > + .4byte \id\()32\vlen\()f - jmp_table_\id\vlen > + .4byte \id\()64\vlen\()f - jmp_table_\id\vlen > + .4byte \id\()128\vlen\()f - jmp_table_\id\vlen > +endconst > +.endm > + > +.macro AVG_J vlen, id > + clz a4, a4 > + li t0, __riscv_xlen-2 > + sub a4, t0, a4 > + lla t5, jmp_table_\id\vlen In C, it would be invalid pointer arithmetic, but in assembler, you can add whatever constant offset you want to this symbol, even if points outside the table. So you should be able to eliminate the LI above. It won't make much difference though. > + sh2add t0, a4, t5 > + lw t0, 0(t0) > + add t0, t0, t5 > + jr t0 > +.endm > + > +.macro func_avg vlen > +func ff_vvc_avg_8_rvv_\vlen\(), zve32x > + AVG_JMP_TABLE 1, \vlen > + csrw vxrm, zero Nit: for overall code base consistency, I'd use csrwi here. Reason being that for other rounding modes, csrwi is the better option. > + AVG_J \vlen, 1 > + .irp w,2,4,8,16,32,64,128 > + avg \w, \vlen, 1 > + .endr > +endfunc > +.endm > + > +.macro w_avg_nx1 w, vlen > + vsetvlstatic16 \w, \vlen, 1 > + vle16.v v0, (a2) > + vle16.v v8, (a3) > + vwmul.vx v16, v0, a7 > + vwmacc.vx v16, t3, v8 > + vsetvlstatic32 \w, \vlen > + vadd.vx v16, v16, t4 I guess t4 is 32-bit? Kinda sad to switch VTYPE just for this but if so, I don't have any better idea :( > + vsetvlstatic16 \w, \vlen, 1 > + vnsrl.wx v16, v16, t6 > + vmax.vx v16, v16, zero > + vsetvlstatic8 \w, \vlen, 1 > + vnclipu.wi v16, v16, 0 > + vse8.v v16, (a0) > +.endm > + > +.macro w_avg w, vlen, id > +\id\w\vlen: > +.if \vlen <= 16 > + vsetvlstatic16 \w, \vlen, 1 > + addi t0, a2, 128*2 > + addi t1, a3, 128*2 > + vle16.v v0, (a2) > + vle16.v v8, (a3) > + addi a5, a5, -2 > + vle16.v v20, (t0) > + vle16.v v24, (t1) > + vwmul.vx v16, v0, a7 > + vwmul.vx v28, v20, a7 > + vwmacc.vx v16, t3, v8 > + vwmacc.vx v28, t3, v24 > + vsetvlstatic32 \w, \vlen > + add t2, a0, a1 > + vadd.vx v16, v16, t4 > + vadd.vx v28, v28, t4 > + vsetvlstatic16 \w, \vlen, 1 > + vnsrl.wx v16, v16, t6 > + vnsrl.wx v28, v28, t6 > + vmax.vx v16, v16, zero > + vmax.vx v28, v28, zero > + vsetvlstatic8 \w, \vlen, 1 > + addi a2, a2, 128*4 > + vnclipu.wi v16, v16, 0 > + vnclipu.wi v28, v28, 0 > + vse8.v v16, (a0) > + addi a3, a3, 128*4 > + vse8.v v28, (t2) > + sh1add a0, a1, a0 > +.else > + w_avg_nx1 \w, \vlen > + addi a5, a5, -1 > + .if \w == (\vlen / 2) > + addi a2, a2, (\vlen / 2) > + addi a3, a3, (\vlen / 2) > + addi a0, a0, (\vlen / 4) > + w_avg_nx1 \w, \vlen > + addi a2, a2, -(\vlen / 2) > + addi a3, a3, -(\vlen / 2) > + addi a0, a0, -(\vlen / 4) > + .elseif \w == 128 && \vlen == 128 > + .rept 3 > + addi a2, a2, (\vlen / 2) > + addi a3, a3, (\vlen / 2) > + addi a0, a0, (\vlen / 4) > + w_avg_nx1 \w, \vlen Is that .rept meaningfully faster than a run-time loop? > + .endr > + addi a2, a2, -(\vlen / 2) * 3 > + addi a3, a3, -(\vlen / 2) * 3 > + addi a0, a0, -(\vlen / 4) * 3 > + .endif > + > + addi a2, a2, 128*2 > + addi a3, a3, 128*2 > + add a0, a0, a1 > +.endif > + bnez a5, \id\w\vlen\()b > + ret > +.endm > + > + > +.macro func_w_avg vlen > +func ff_vvc_w_avg_8_rvv_\vlen\(), zve32x > + AVG_JMP_TABLE 2, \vlen > + csrw vxrm, zero Nit again > + addi t6, a6, 7 > + ld t3, (sp) > + ld t4, 8(sp) > + ld t5, 16(sp) > + add t4, t4, t5 > + addi t4, t4, 1 // o0 + o1 + 1 Probably faster to swap the two above, to avoid stalling on LD. > + addi t5, t6, -1 // shift - 1 > + sll t4, t4, t5 > + AVG_J \vlen, 2 > + .irp w,2,4,8,16,32,64,128 > + w_avg \w, \vlen, 2 > + .endr > +endfunc > +.endm > + > +func_avg 128 > +func_avg 256 > +#if (__riscv_xlen == 64) > +func_w_avg 128 > +func_w_avg 256 > +#endif > diff --git a/libavcodec/riscv/vvc/vvcdsp_init.c > b/libavcodec/riscv/vvc/vvcdsp_init.c new file mode 100644 > index 0000000000..85b1ede061 > --- /dev/null > +++ b/libavcodec/riscv/vvc/vvcdsp_init.c > @@ -0,0 +1,71 @@ > +/* > + * Copyright (c) 2024 Institue of Software Chinese Academy of Sciences > (ISCAS). + * > + * This file is part of FFmpeg. > + * > + * FFmpeg is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public > + * License as published by the Free Software Foundation; either > + * version 2.1 of the License, or (at your option) any later version. > + * > + * FFmpeg is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with FFmpeg; if not, write to the Free Software > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 > USA + */ > + > +#include "config.h" > + > +#include "libavutil/attributes.h" > +#include "libavutil/cpu.h" > +#include "libavutil/riscv/cpu.h" > +#include "libavcodec/vvc/dsp.h" > + > +#define bf(fn, bd, opt) fn##_##bd##_##opt > + > +#define AVG_PROTOTYPES(bd, opt) > \ +void bf(ff_vvc_avg, bd, opt)(uint8_t *dst, > ptrdiff_t dst_stride, \ + const > int16_t *src0, const int16_t *src1, int width, int height); > \ +void bf(ff_vvc_w_avg, bd, opt)(uint8_t *dst, ptrdiff_t > dst_stride, \ + const int16_t *src0, > const int16_t *src1, int width, int height, > \ + int denom, int w0, int w1, int o0, int o1); > + > +AVG_PROTOTYPES(8, rvv_128) > +AVG_PROTOTYPES(8, rvv_256) > + > +void ff_vvc_dsp_init_riscv(VVCDSPContext *const c, const int bd) > +{ > +#if HAVE_RVV > + const int flags = av_get_cpu_flags(); > + > + if ((flags & AV_CPU_FLAG_RVV_I32) && (flags & AV_CPU_FLAG_RVB_ADDR) && > + ff_rv_vlen_least(256)) { If you check more than one length, better to get ff_get_rv_vlenb() into a local variable. > + switch (bd) { > + case 8: > + c->inter.avg = ff_vvc_avg_8_rvv_256; > +# if (__riscv_xlen == 64) > + c->inter.w_avg = ff_vvc_w_avg_8_rvv_256; > +# endif > + break; > + default: > + break; > + } > + } else if ((flags & AV_CPU_FLAG_RVV_I32) && (flags & > AV_CPU_FLAG_RVB_ADDR) && > + ff_rv_vlen_least(128)) { > + switch (bd) { > + case 8: > + c->inter.avg = ff_vvc_avg_8_rvv_128; > +# if (__riscv_xlen == 64) > + c->inter.w_avg = ff_vvc_w_avg_8_rvv_128; > +# endif > + break; > + default: > + break; > + } > + } > +#endif > +} > diff --git a/libavcodec/vvc/dsp.c b/libavcodec/vvc/dsp.c > index 41e830a98a..c55a37d255 100644 > --- a/libavcodec/vvc/dsp.c > +++ b/libavcodec/vvc/dsp.c > @@ -121,7 +121,9 @@ void ff_vvc_dsp_init(VVCDSPContext *vvcdsp, int > bit_depth) break; > } > > -#if ARCH_X86 > +#if ARCH_RISCV > + ff_vvc_dsp_init_riscv(vvcdsp, bit_depth); > +#elif ARCH_X86 > ff_vvc_dsp_init_x86(vvcdsp, bit_depth); > #endif > } > diff --git a/libavcodec/vvc/dsp.h b/libavcodec/vvc/dsp.h > index 1f14096c41..e03236dd76 100644 > --- a/libavcodec/vvc/dsp.h > +++ b/libavcodec/vvc/dsp.h > @@ -180,6 +180,7 @@ typedef struct VVCDSPContext { > > void ff_vvc_dsp_init(VVCDSPContext *hpc, int bit_depth); > > +void ff_vvc_dsp_init_riscv(VVCDSPContext *hpc, const int bit_depth); > void ff_vvc_dsp_init_x86(VVCDSPContext *hpc, const int bit_depth); > > #endif /* AVCODEC_VVC_DSP_H */
diff --git a/libavcodec/riscv/vvc/Makefile b/libavcodec/riscv/vvc/Makefile new file mode 100644 index 0000000000..582b051579 --- /dev/null +++ b/libavcodec/riscv/vvc/Makefile @@ -0,0 +1,2 @@ +OBJS-$(CONFIG_VVC_DECODER) += riscv/vvc/vvcdsp_init.o +RVV-OBJS-$(CONFIG_VVC_DECODER) += riscv/vvc/vvc_mc_rvv.o diff --git a/libavcodec/riscv/vvc/vvc_mc_rvv.S b/libavcodec/riscv/vvc/vvc_mc_rvv.S new file mode 100644 index 0000000000..e6e906f3b2 --- /dev/null +++ b/libavcodec/riscv/vvc/vvc_mc_rvv.S @@ -0,0 +1,296 @@ +/* + * Copyright (c) 2024 Institue of Software Chinese Academy of Sciences (ISCAS). + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/riscv/asm.S" + +.macro vsetvlstatic8 w, vlen, is_w + .if \w <= 2 + vsetivli zero, \w, e8, mf8, ta, ma + .elseif \w <= 4 && \vlen == 128 + vsetivli zero, \w, e8, mf4, ta, ma + .elseif \w <= 4 && \vlen >= 256 + vsetivli zero, \w, e8, mf8, ta, ma + .elseif \w <= 8 && \vlen == 128 + vsetivli zero, \w, e8, mf2, ta, ma + .elseif \w <= 8 && \vlen >= 256 + vsetivli zero, \w, e8, mf4, ta, ma + .elseif \w <= 16 && \vlen == 128 + vsetivli zero, \w, e8, m1, ta, ma + .elseif \w <= 16 && \vlen >= 256 + vsetivli zero, \w, e8, mf2, ta, ma + .elseif \w <= 32 && \vlen >= 256 + li t0, \w + vsetvli zero, t0, e8, m1, ta, ma + .elseif \w <= (\vlen / 4) || \is_w + li t0, 64 + vsetvli zero, t0, e8, m2, ta, ma + .else + li t0, \w + vsetvli zero, t0, e8, m4, ta, ma + .endif +.endm + +.macro vsetvlstatic16 w, vlen, is_w + .if \w <= 2 + vsetivli zero, \w, e16, mf4, ta, ma + .elseif \w <= 4 && \vlen == 128 + vsetivli zero, \w, e16, mf2, ta, ma + .elseif \w <= 4 && \vlen >= 256 + vsetivli zero, \w, e16, mf4, ta, ma + .elseif \w <= 8 && \vlen == 128 + vsetivli zero, \w, e16, m1, ta, ma + .elseif \w <= 8 && \vlen >= 256 + vsetivli zero, \w, e16, mf2, ta, ma + .elseif \w <= 16 && \vlen == 128 + vsetivli zero, \w, e16, m2, ta, ma + .elseif \w <= 16 && \vlen >= 256 + vsetivli zero, \w, e16, m1, ta, ma + .elseif \w <= 32 && \vlen >= 256 + li t0, \w + vsetvli zero, t0, e16, m2, ta, ma + .elseif \w <= (\vlen / 4) || \is_w + li t0, 64 + vsetvli zero, t0, e16, m4, ta, ma + .else + li t0, \w + vsetvli zero, t0, e16, m8, ta, ma + .endif +.endm + +.macro vsetvlstatic32 w, vlen + .if \w <= 2 + vsetivli zero, \w, e32, mf2, ta, ma + .elseif \w <= 4 && \vlen == 128 + vsetivli zero, \w, e32, m1, ta, ma + .elseif \w <= 4 && \vlen >= 256 + vsetivli zero, \w, e32, mf2, ta, ma + .elseif \w <= 8 && \vlen == 128 + vsetivli zero, \w, e32, m2, ta, ma + .elseif \w <= 8 && \vlen >= 256 + vsetivli zero, \w, e32, m1, ta, ma + .elseif \w <= 16 && \vlen == 128 + vsetivli zero, \w, e32, m4, ta, ma + .elseif \w <= 16 && \vlen >= 256 + vsetivli zero, \w, e32, m2, ta, ma + .elseif \w <= 32 && \vlen >= 256 + li t0, \w + vsetvli zero, t0, e32, m4, ta, ma + .else + li t0, \w + vsetvli zero, t0, e32, m8, ta, ma + .endif +.endm + +.macro avg_nx1 w, vlen + vsetvlstatic16 \w, \vlen, 0 + vle16.v v0, (a2) + vle16.v v8, (a3) + vadd.vv v8, v8, v0 + vmax.vx v8, v8, zero + vsetvlstatic8 \w, \vlen, 0 + vnclipu.wi v8, v8, 7 + vse8.v v8, (a0) +.endm + +.macro avg w, vlen, id +\id\w\vlen: +.if \w < 128 + vsetvlstatic16 \w, \vlen, 0 + addi t0, a2, 128*2 + addi t1, a3, 128*2 + add t2, a0, a1 + vle16.v v0, (a2) + vle16.v v8, (a3) + addi a5, a5, -2 + vle16.v v16, (t0) + vle16.v v24, (t1) + vadd.vv v8, v8, v0 + vadd.vv v24, v24, v16 + vmax.vx v8, v8, zero + vmax.vx v24, v24, zero + vsetvlstatic8 \w, \vlen, 0 + addi a2, a2, 128*4 + vnclipu.wi v8, v8, 7 + vnclipu.wi v24, v24, 7 + addi a3, a3, 128*4 + vse8.v v8, (a0) + vse8.v v24, (t2) + sh1add a0, a1, a0 +.else + avg_nx1 128, \vlen + addi a5, a5, -1 + .if \vlen == 128 + addi a2, a2, 64*2 + addi a3, a3, 64*2 + addi a0, a0, 64 + avg_nx1 128, \vlen + addi a0, a0, -64 + addi a2, a2, 128 + addi a3, a3, 128 + .else + addi a2, a2, 128*2 + addi a3, a3, 128*2 + .endif + add a0, a0, a1 +.endif + bnez a5, \id\w\vlen\()b + ret +.endm + + +.macro AVG_JMP_TABLE id, vlen +const jmp_table_\id\vlen + .4byte \id\()2\vlen\()f - jmp_table_\id\vlen + .4byte \id\()4\vlen\()f - jmp_table_\id\vlen + .4byte \id\()8\vlen\()f - jmp_table_\id\vlen + .4byte \id\()16\vlen\()f - jmp_table_\id\vlen + .4byte \id\()32\vlen\()f - jmp_table_\id\vlen + .4byte \id\()64\vlen\()f - jmp_table_\id\vlen + .4byte \id\()128\vlen\()f - jmp_table_\id\vlen +endconst +.endm + +.macro AVG_J vlen, id + clz a4, a4 + li t0, __riscv_xlen-2 + sub a4, t0, a4 + lla t5, jmp_table_\id\vlen + sh2add t0, a4, t5 + lw t0, 0(t0) + add t0, t0, t5 + jr t0 +.endm + +.macro func_avg vlen +func ff_vvc_avg_8_rvv_\vlen\(), zve32x + AVG_JMP_TABLE 1, \vlen + csrw vxrm, zero + AVG_J \vlen, 1 + .irp w,2,4,8,16,32,64,128 + avg \w, \vlen, 1 + .endr +endfunc +.endm + +.macro w_avg_nx1 w, vlen + vsetvlstatic16 \w, \vlen, 1 + vle16.v v0, (a2) + vle16.v v8, (a3) + vwmul.vx v16, v0, a7 + vwmacc.vx v16, t3, v8 + vsetvlstatic32 \w, \vlen + vadd.vx v16, v16, t4 + vsetvlstatic16 \w, \vlen, 1 + vnsrl.wx v16, v16, t6 + vmax.vx v16, v16, zero + vsetvlstatic8 \w, \vlen, 1 + vnclipu.wi v16, v16, 0 + vse8.v v16, (a0) +.endm + +.macro w_avg w, vlen, id +\id\w\vlen: +.if \vlen <= 16 + vsetvlstatic16 \w, \vlen, 1 + addi t0, a2, 128*2 + addi t1, a3, 128*2 + vle16.v v0, (a2) + vle16.v v8, (a3) + addi a5, a5, -2 + vle16.v v20, (t0) + vle16.v v24, (t1) + vwmul.vx v16, v0, a7 + vwmul.vx v28, v20, a7 + vwmacc.vx v16, t3, v8 + vwmacc.vx v28, t3, v24 + vsetvlstatic32 \w, \vlen + add t2, a0, a1 + vadd.vx v16, v16, t4 + vadd.vx v28, v28, t4 + vsetvlstatic16 \w, \vlen, 1 + vnsrl.wx v16, v16, t6 + vnsrl.wx v28, v28, t6 + vmax.vx v16, v16, zero + vmax.vx v28, v28, zero + vsetvlstatic8 \w, \vlen, 1 + addi a2, a2, 128*4 + vnclipu.wi v16, v16, 0 + vnclipu.wi v28, v28, 0 + vse8.v v16, (a0) + addi a3, a3, 128*4 + vse8.v v28, (t2) + sh1add a0, a1, a0 +.else + w_avg_nx1 \w, \vlen + addi a5, a5, -1 + .if \w == (\vlen / 2) + addi a2, a2, (\vlen / 2) + addi a3, a3, (\vlen / 2) + addi a0, a0, (\vlen / 4) + w_avg_nx1 \w, \vlen + addi a2, a2, -(\vlen / 2) + addi a3, a3, -(\vlen / 2) + addi a0, a0, -(\vlen / 4) + .elseif \w == 128 && \vlen == 128 + .rept 3 + addi a2, a2, (\vlen / 2) + addi a3, a3, (\vlen / 2) + addi a0, a0, (\vlen / 4) + w_avg_nx1 \w, \vlen + .endr + addi a2, a2, -(\vlen / 2) * 3 + addi a3, a3, -(\vlen / 2) * 3 + addi a0, a0, -(\vlen / 4) * 3 + .endif + + addi a2, a2, 128*2 + addi a3, a3, 128*2 + add a0, a0, a1 +.endif + bnez a5, \id\w\vlen\()b + ret +.endm + + +.macro func_w_avg vlen +func ff_vvc_w_avg_8_rvv_\vlen\(), zve32x + AVG_JMP_TABLE 2, \vlen + csrw vxrm, zero + addi t6, a6, 7 + ld t3, (sp) + ld t4, 8(sp) + ld t5, 16(sp) + add t4, t4, t5 + addi t4, t4, 1 // o0 + o1 + 1 + addi t5, t6, -1 // shift - 1 + sll t4, t4, t5 + AVG_J \vlen, 2 + .irp w,2,4,8,16,32,64,128 + w_avg \w, \vlen, 2 + .endr +endfunc +.endm + +func_avg 128 +func_avg 256 +#if (__riscv_xlen == 64) +func_w_avg 128 +func_w_avg 256 +#endif diff --git a/libavcodec/riscv/vvc/vvcdsp_init.c b/libavcodec/riscv/vvc/vvcdsp_init.c new file mode 100644 index 0000000000..85b1ede061 --- /dev/null +++ b/libavcodec/riscv/vvc/vvcdsp_init.c @@ -0,0 +1,71 @@ +/* + * Copyright (c) 2024 Institue of Software Chinese Academy of Sciences (ISCAS). + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "config.h" + +#include "libavutil/attributes.h" +#include "libavutil/cpu.h" +#include "libavutil/riscv/cpu.h" +#include "libavcodec/vvc/dsp.h" + +#define bf(fn, bd, opt) fn##_##bd##_##opt + +#define AVG_PROTOTYPES(bd, opt) \ +void bf(ff_vvc_avg, bd, opt)(uint8_t *dst, ptrdiff_t dst_stride, \ + const int16_t *src0, const int16_t *src1, int width, int height); \ +void bf(ff_vvc_w_avg, bd, opt)(uint8_t *dst, ptrdiff_t dst_stride, \ + const int16_t *src0, const int16_t *src1, int width, int height, \ + int denom, int w0, int w1, int o0, int o1); + +AVG_PROTOTYPES(8, rvv_128) +AVG_PROTOTYPES(8, rvv_256) + +void ff_vvc_dsp_init_riscv(VVCDSPContext *const c, const int bd) +{ +#if HAVE_RVV + const int flags = av_get_cpu_flags(); + + if ((flags & AV_CPU_FLAG_RVV_I32) && (flags & AV_CPU_FLAG_RVB_ADDR) && + ff_rv_vlen_least(256)) { + switch (bd) { + case 8: + c->inter.avg = ff_vvc_avg_8_rvv_256; +# if (__riscv_xlen == 64) + c->inter.w_avg = ff_vvc_w_avg_8_rvv_256; +# endif + break; + default: + break; + } + } else if ((flags & AV_CPU_FLAG_RVV_I32) && (flags & AV_CPU_FLAG_RVB_ADDR) && + ff_rv_vlen_least(128)) { + switch (bd) { + case 8: + c->inter.avg = ff_vvc_avg_8_rvv_128; +# if (__riscv_xlen == 64) + c->inter.w_avg = ff_vvc_w_avg_8_rvv_128; +# endif + break; + default: + break; + } + } +#endif +} diff --git a/libavcodec/vvc/dsp.c b/libavcodec/vvc/dsp.c index 41e830a98a..c55a37d255 100644 --- a/libavcodec/vvc/dsp.c +++ b/libavcodec/vvc/dsp.c @@ -121,7 +121,9 @@ void ff_vvc_dsp_init(VVCDSPContext *vvcdsp, int bit_depth) break; } -#if ARCH_X86 +#if ARCH_RISCV + ff_vvc_dsp_init_riscv(vvcdsp, bit_depth); +#elif ARCH_X86 ff_vvc_dsp_init_x86(vvcdsp, bit_depth); #endif } diff --git a/libavcodec/vvc/dsp.h b/libavcodec/vvc/dsp.h index 1f14096c41..e03236dd76 100644 --- a/libavcodec/vvc/dsp.h +++ b/libavcodec/vvc/dsp.h @@ -180,6 +180,7 @@ typedef struct VVCDSPContext { void ff_vvc_dsp_init(VVCDSPContext *hpc, int bit_depth); +void ff_vvc_dsp_init_riscv(VVCDSPContext *hpc, const int bit_depth); void ff_vvc_dsp_init_x86(VVCDSPContext *hpc, const int bit_depth); #endif /* AVCODEC_VVC_DSP_H */
From: sunyuechi <sunyuechi@iscas.ac.cn> C908 X60 avg_8_2x2_c : 1.2 1.0 avg_8_2x2_rvv_i32 : 1.0 1.0 avg_8_2x4_c : 2.0 2.0 avg_8_2x4_rvv_i32 : 1.5 1.2 avg_8_2x8_c : 3.7 4.0 avg_8_2x8_rvv_i32 : 2.0 2.0 avg_8_2x16_c : 7.2 7.7 avg_8_2x16_rvv_i32 : 3.2 3.0 avg_8_2x32_c : 14.5 15.2 avg_8_2x32_rvv_i32 : 5.7 5.0 avg_8_2x64_c : 50.0 45.2 avg_8_2x64_rvv_i32 : 41.5 32.5 avg_8_2x128_c : 101.5 84.2 avg_8_2x128_rvv_i32 : 89.5 73.2 avg_8_4x2_c : 2.0 2.0 avg_8_4x2_rvv_i32 : 1.0 1.0 avg_8_4x4_c : 3.5 3.5 avg_8_4x4_rvv_i32 : 1.5 1.2 avg_8_4x8_c : 6.7 7.0 avg_8_4x8_rvv_i32 : 2.0 1.7 avg_8_4x16_c : 13.2 14.0 avg_8_4x16_rvv_i32 : 3.2 3.0 avg_8_4x32_c : 26.2 27.7 avg_8_4x32_rvv_i32 : 5.7 5.0 avg_8_4x64_c : 75.0 66.0 avg_8_4x64_rvv_i32 : 40.2 33.0 avg_8_4x128_c : 144.5 128.0 avg_8_4x128_rvv_i32 : 89.5 78.7 avg_8_8x2_c : 3.2 3.5 avg_8_8x2_rvv_i32 : 1.2 1.0 avg_8_8x4_c : 6.5 6.7 avg_8_8x4_rvv_i32 : 1.5 1.5 avg_8_8x8_c : 12.7 13.2 avg_8_8x8_rvv_i32 : 2.2 1.7 avg_8_8x16_c : 25.2 26.5 avg_8_8x16_rvv_i32 : 3.7 2.7 avg_8_8x32_c : 50.2 52.7 avg_8_8x32_rvv_i32 : 6.5 5.0 avg_8_8x64_c : 120.2 117.7 avg_8_8x64_rvv_i32 : 45.2 39.2 avg_8_8x128_c : 223.0 233.5 avg_8_8x128_rvv_i32 : 80.0 73.2 avg_8_16x2_c : 6.2 6.5 avg_8_16x2_rvv_i32 : 1.5 1.0 avg_8_16x4_c : 12.5 12.7 avg_8_16x4_rvv_i32 : 2.0 1.2 avg_8_16x8_c : 24.7 26.0 avg_8_16x8_rvv_i32 : 3.2 2.0 avg_8_16x16_c : 49.0 51.2 avg_8_16x16_rvv_i32 : 5.7 3.2 avg_8_16x32_c : 97.7 102.5 avg_8_16x32_rvv_i32 : 10.7 5.7 avg_8_16x64_c : 220.5 214.2 avg_8_16x64_rvv_i32 : 48.2 39.5 avg_8_16x128_c : 436.2 428.0 avg_8_16x128_rvv_i32 : 97.2 77.0 avg_8_32x2_c : 12.2 12.7 avg_8_32x2_rvv_i32 : 2.0 1.2 avg_8_32x4_c : 24.5 25.5 avg_8_32x4_rvv_i32 : 3.2 1.7 avg_8_32x8_c : 48.5 50.7 avg_8_32x8_rvv_i32 : 5.7 2.7 avg_8_32x16_c : 96.5 101.2 avg_8_32x16_rvv_i32 : 10.2 5.0 avg_8_32x32_c : 192.5 202.2 avg_8_32x32_rvv_i32 : 20.0 9.5 avg_8_32x64_c : 405.7 404.5 avg_8_32x64_rvv_i32 : 72.5 40.2 avg_8_32x128_c : 821.0 832.2 avg_8_32x128_rvv_i32 : 136.2 75.7 avg_8_64x2_c : 24.0 25.2 avg_8_64x2_rvv_i32 : 3.2 1.7 avg_8_64x4_c : 48.5 51.0 avg_8_64x4_rvv_i32 : 5.5 2.7 avg_8_64x8_c : 97.0 101.5 avg_8_64x8_rvv_i32 : 10.2 5.0 avg_8_64x16_c : 193.5 202.7 avg_8_64x16_rvv_i32 : 19.2 9.2 avg_8_64x32_c : 404.2 405.7 avg_8_64x32_rvv_i32 : 38.0 17.7 avg_8_64x64_c : 834.0 840.7 avg_8_64x64_rvv_i32 : 75.0 36.2 avg_8_64x128_c : 1667.2 1685.7 avg_8_64x128_rvv_i32 : 336.0 181.5 avg_8_128x2_c : 49.0 50.7 avg_8_128x2_rvv_i32 : 5.2 2.7 avg_8_128x4_c : 96.7 101.0 avg_8_128x4_rvv_i32 : 9.7 4.7 avg_8_128x8_c : 193.0 201.7 avg_8_128x8_rvv_i32 : 19.0 8.5 avg_8_128x16_c : 386.2 402.7 avg_8_128x16_rvv_i32 : 37.2 79.0 avg_8_128x32_c : 789.2 805.0 avg_8_128x32_rvv_i32 : 73.5 32.7 avg_8_128x64_c : 1620.5 1651.2 avg_8_128x64_rvv_i32 : 185.2 69.7 avg_8_128x128_c : 3203.0 3236.7 avg_8_128x128_rvv_i32 : 457.2 280.5 w_avg_8_2x2_c : 1.5 1.5 w_avg_8_2x2_rvv_i32 : 1.7 1.5 w_avg_8_2x4_c : 2.7 2.7 w_avg_8_2x4_rvv_i32 : 2.7 2.5 w_avg_8_2x8_c : 5.0 4.7 w_avg_8_2x8_rvv_i32 : 4.5 4.0 w_avg_8_2x16_c : 9.7 9.5 w_avg_8_2x16_rvv_i32 : 8.0 7.0 w_avg_8_2x32_c : 18.7 18.5 w_avg_8_2x32_rvv_i32 : 15.0 13.2 w_avg_8_2x64_c : 57.7 49.0 w_avg_8_2x64_rvv_i32 : 42.7 35.2 w_avg_8_2x128_c : 127.5 94.5 w_avg_8_2x128_rvv_i32 : 99.2 78.2 w_avg_8_4x2_c : 2.5 2.5 w_avg_8_4x2_rvv_i32 : 2.0 1.7 w_avg_8_4x4_c : 4.7 4.5 w_avg_8_4x4_rvv_i32 : 2.7 2.2 w_avg_8_4x8_c : 9.0 9.0 w_avg_8_4x8_rvv_i32 : 4.5 4.0 w_avg_8_4x16_c : 17.7 17.7 w_avg_8_4x16_rvv_i32 : 8.0 7.0 w_avg_8_4x32_c : 35.0 35.0 w_avg_8_4x32_rvv_i32 : 32.7 13.2 w_avg_8_4x64_c : 117.5 79.5 w_avg_8_4x64_rvv_i32 : 47.2 39.0 w_avg_8_4x128_c : 235.7 159.0 w_avg_8_4x128_rvv_i32 : 101.0 80.0 w_avg_8_8x2_c : 4.5 4.5 w_avg_8_8x2_rvv_i32 : 1.7 1.7 w_avg_8_8x4_c : 8.7 8.7 w_avg_8_8x4_rvv_i32 : 2.7 2.5 w_avg_8_8x8_c : 17.2 17.0 w_avg_8_8x8_rvv_i32 : 4.7 4.0 w_avg_8_8x16_c : 34.0 34.2 w_avg_8_8x16_rvv_i32 : 8.5 7.0 w_avg_8_8x32_c : 67.5 67.7 w_avg_8_8x32_rvv_i32 : 16.0 13.2 w_avg_8_8x64_c : 184.0 147.7 w_avg_8_8x64_rvv_i32 : 53.7 35.0 w_avg_8_8x128_c : 350.0 320.5 w_avg_8_8x128_rvv_i32 : 98.5 74.7 w_avg_8_16x2_c : 8.7 8.5 w_avg_8_16x2_rvv_i32 : 2.5 1.7 w_avg_8_16x4_c : 17.0 17.0 w_avg_8_16x4_rvv_i32 : 3.7 2.5 w_avg_8_16x8_c : 49.7 33.5 w_avg_8_16x8_rvv_i32 : 6.5 4.2 w_avg_8_16x16_c : 66.5 66.5 w_avg_8_16x16_rvv_i32 : 12.2 7.5 w_avg_8_16x32_c : 132.2 134.0 w_avg_8_16x32_rvv_i32 : 23.2 14.2 w_avg_8_16x64_c : 298.2 283.5 w_avg_8_16x64_rvv_i32 : 65.7 40.2 w_avg_8_16x128_c : 755.5 593.2 w_avg_8_16x128_rvv_i32 : 132.0 76.0 w_avg_8_32x2_c : 16.7 16.7 w_avg_8_32x2_rvv_i32 : 3.5 2.0 w_avg_8_32x4_c : 33.2 33.2 w_avg_8_32x4_rvv_i32 : 6.0 3.2 w_avg_8_32x8_c : 65.7 66.0 w_avg_8_32x8_rvv_i32 : 11.2 6.5 w_avg_8_32x16_c : 148.2 132.0 w_avg_8_32x16_rvv_i32 : 21.5 10.7 w_avg_8_32x32_c : 266.2 267.0 w_avg_8_32x32_rvv_i32 : 60.7 20.7 w_avg_8_32x64_c : 683.5 559.7 w_avg_8_32x64_rvv_i32 : 83.7 65.0 w_avg_8_32x128_c : 1169.5 1140.2 w_avg_8_32x128_rvv_i32 : 191.2 96.7 w_avg_8_64x2_c : 33.0 33.2 w_avg_8_64x2_rvv_i32 : 6.0 3.2 w_avg_8_64x4_c : 65.5 65.7 w_avg_8_64x4_rvv_i32 : 11.2 5.2 w_avg_8_64x8_c : 149.7 132.0 w_avg_8_64x8_rvv_i32 : 21.7 9.7 w_avg_8_64x16_c : 279.2 262.7 w_avg_8_64x16_rvv_i32 : 42.2 18.5 w_avg_8_64x32_c : 538.7 542.2 w_avg_8_64x32_rvv_i32 : 83.7 36.5 w_avg_8_64x64_c : 1200.2 1074.2 w_avg_8_64x64_rvv_i32 : 204.5 73.7 w_avg_8_64x128_c : 2375.7 2482.0 w_avg_8_64x128_rvv_i32 : 390.5 205.2 w_avg_8_128x2_c : 66.2 66.5 w_avg_8_128x2_rvv_i32 : 11.0 5.2 w_avg_8_128x4_c : 133.2 133.2 w_avg_8_128x4_rvv_i32 : 21.7 10.0 w_avg_8_128x8_c : 303.5 268.5 w_avg_8_128x8_rvv_i32 : 42.2 18.5 w_avg_8_128x16_c : 544.0 545.5 w_avg_8_128x16_rvv_i32 : 83.5 36.5 w_avg_8_128x32_c : 1128.0 1090.7 w_avg_8_128x32_rvv_i32 : 166.5 72.2 w_avg_8_128x64_c : 2275.7 2167.5 w_avg_8_128x64_rvv_i32 : 391.7 146.7 w_avg_8_128x128_c : 4851.2 4310.5 w_avg_8_128x128_rvv_i32 : 742.0 341.2 --- libavcodec/riscv/vvc/Makefile | 2 + libavcodec/riscv/vvc/vvc_mc_rvv.S | 296 +++++++++++++++++++++++++++++ libavcodec/riscv/vvc/vvcdsp_init.c | 71 +++++++ libavcodec/vvc/dsp.c | 4 +- libavcodec/vvc/dsp.h | 1 + 5 files changed, 373 insertions(+), 1 deletion(-) create mode 100644 libavcodec/riscv/vvc/Makefile create mode 100644 libavcodec/riscv/vvc/vvc_mc_rvv.S create mode 100644 libavcodec/riscv/vvc/vvcdsp_init.c