[FFmpeg-devel] lavc/opusdsp: simplify R-V V postfilter

Message ID	20231216104453.13092-1-remi@remlab.net
State	Accepted
Commit	db32f75c635c5783b76e7c3fd8060548d0917180
Headers	show Delivered-To: ffmpegpatchwork2@gmail.com Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; From: =?utf-8?q?R=C3=A9mi_Denis-Courmont?= <remi@remlab.net> To: ffmpeg-devel@ffmpeg.org Date: Sat, 16 Dec 2023 12:44:53 +0200 Message-ID: <20231216104453.13092-1-remi@remlab.net> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH] lavc/opusdsp: simplify R-V V postfilter Precedence: list Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Series	[FFmpeg-devel] lavc/opusdsp: simplify R-V V postfilter \| expand [FFmpeg-devel] lavc/opusdsp: simplify R-V V postfilter

Message ID

20231216104453.13092-1-remi@remlab.net

State

Accepted

Commit

db32f75c635c5783b76e7c3fd8060548d0917180

Headers

Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
From: =?utf-8?q?R=C3=A9mi_Denis-Courmont?= <remi@remlab.net>
To: ffmpeg-devel@ffmpeg.org
Date: Sat, 16 Dec 2023 12:44:53 +0200
Message-ID: <20231216104453.13092-1-remi@remlab.net>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH] lavc/opusdsp: simplify R-V V postfilter
Precedence: list
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

Series

[FFmpeg-devel] lavc/opusdsp: simplify R-V V postfilter | expand

Checks

Context	Check	Description
yinshiyou/make_loongarch64	success	Make finished
yinshiyou/make_fate_loongarch64	success	Make fate finished
andriy/make_x86	success	Make finished
andriy/make_fate_x86	success	Make fate finished

Context

Check

Description

yinshiyou/make_loongarch64

success

Make finished

yinshiyou/make_fate_loongarch64

success

Make fate finished

andriy/make_x86

success

Make finished

andriy/make_fate_x86

success

Make fate finished

Commit Message

Rémi Denis-Courmont Dec. 16, 2023, 10:44 a.m. UTC

This skips the round-trip to scalar register for the sliding 'x'
coefficients, improving performance by about 5%. The trick here is that
the vector slide-up instruction preserves elements in destination vector
until the slide offset.

The switch from vfslide1up.vf to vslideup.vi also allows the elimination
of data dependencies on consecutive slides. Since the specifications
recommend sticking to power of two offsets, we could slide as follows:

        vslideup.vi v8, v0, 2
        vslideup.vi v4, v0, 1
        vslideup.vi v12, v8, 1
        vslideup.vi v16, v8, 2

However in the device under test, this seems to make performance slightly
worse, so this is left for (in)validation with future better hardware.
---
 libavcodec/riscv/opusdsp_rvv.S | 30 ++++++++++++------------------
 1 file changed, 12 insertions(+), 18 deletions(-)

diff --git a/libavcodec/riscv/opusdsp_rvv.S b/libavcodec/riscv/opusdsp_rvv.S
index 79ae86c30e..9a8914c78d 100644
--- a/libavcodec/riscv/opusdsp_rvv.S
+++ b/libavcodec/riscv/opusdsp_rvv.S
@@ -26,40 +26,34 @@  func ff_opus_postfilter_rvv, zve32f
         flw     fa1, 4(a2) // g1
         sub     t0, a0, t1
         flw     fa2, 8(a2) // g2
+        addi    t1, t0, -2 * 4 // data - (period + 2) = initial &x4
+        vsetivli zero, 4, e32, m4, ta, ma
         addi    t0, t0, 2 * 4 // data - (period - 2) = initial &x0
-
-        flw     ft4, -16(t0)
+        vle32.v v16, (t1)
         addi    t3, a1, -2 // maximum parallelism w/o stepping our tail
-        flw     ft3, -12(t0)
-        flw     ft2,  -8(t0)
-        flw     ft1,  -4(t0)
 1:
+        vslidedown.vi v8, v16, 2
         min     t1, a3, t3
+        vslide1down.vx v12, v16, zero
         vsetvli t1, t1, e32, m4, ta, ma
         vle32.v v0, (t0) // x0
         sub     a3, a3, t1
-        vle32.v v28, (a0)
+        vslide1down.vx v4, v8, zero
         sh2add  t0, t1, t0
-        vfslide1up.vf v4, v0, ft1
+        vle32.v v28, (a0)
         addi    t2, t1, -4
-        vfslide1up.vf v8, v4, ft2
-        vfslide1up.vf v12, v8, ft3
-        vfslide1up.vf v16, v12, ft4
+        vslideup.vi v4, v0, 1
+        vslideup.vi v8, v4, 1
+        vslideup.vi v12, v8, 1
+        vslideup.vi v16, v12, 1
         vfadd.vv v20, v4, v12
         vfadd.vv v24, v0, v16
-        vslidedown.vx v12, v0, t2
+        vslidedown.vx v16, v0, t2
         vfmacc.vf v28, fa0, v8
-        vslidedown.vi v4, v12, 2
         vfmacc.vf v28, fa1, v20
-        vslide1down.vx v8, v12, zero
         vfmacc.vf v28, fa2, v24
-        vslide1down.vx v0, v4, zero
         vse32.v v28, (a0)
-        vfmv.f.s ft4, v12
         sh2add  a0, t1, a0
-        vfmv.f.s ft2, v4
-        vfmv.f.s ft3, v8
-        vfmv.f.s ft1, v0
         bnez    a3, 1b
 
         ret

[FFmpeg-devel] lavc/opusdsp: simplify R-V V postfilter

Checks

Commit Message

Patch