From patchwork Sat Dec 16 10:44:53 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?q?R=C3=A9mi_Denis-Courmont?= X-Patchwork-Id: 45177 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:1225:b0:181:818d:5e7f with SMTP id v37csp6228896pzf; Sat, 16 Dec 2023 02:45:04 -0800 (PST) X-Google-Smtp-Source: AGHT+IGc1fgf//VgDGWBcvyZ67MYskfCrSz4XNK0+/8VWyUYzS5IuKvhfYRipLoUrY9l6t7Ct/P5 X-Received: by 2002:a50:d74f:0:b0:553:4cb:7cc4 with SMTP id i15-20020a50d74f000000b0055304cb7cc4mr403113edj.50.1702723504723; Sat, 16 Dec 2023 02:45:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702723504; cv=none; d=google.com; s=arc-20160816; b=q0uItHkkydAK8VLnbBzk4x4jOkNn4fat2inBtkWV/myE17BnM9T43b0WyNLJYqfp1J 4p6vR6lyqSb62LOoQPQd5ZIijSlg1amw5n8ocWgFAQyD59ExC8+ZHL7qoo7is5sfBOof WZaNijZw9ZmtpUwtYCaX/KVaZYep3IQ5aofaTl98o+exg9PLRAtQ8muyeQFpRk5eSYSd 10kr7bQaL8WCaaTsadmkrDWK21sabAQb1Gz6MNBJD6oaljTUDcjsJ34CJArb6sgX+zA3 tpfFeaig2QVJ+hlL1UpbDHqjKUCyRYIGVMlEzljQudXkjsscivmW4/X4KOdXvtpDddHC z12g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:reply-to:list-subscribe :list-help:list-post:list-archive:list-unsubscribe:list-id :precedence:subject:mime-version:message-id:date:to:from :delivered-to; bh=kJVpcqyzWAUNNwdvRU624UozrXl+Oapu7bvSqZNS1dM=; fh=YOA8vD9MJZuwZ71F/05pj6KdCjf6jQRmzLS+CATXUQk=; b=uHHXgU77kx+fsovZdqynZc+Nk//Bs6sm4KxSvmp6NJxWdkHPAyIorCZRpuPrVPMqc/ bmvp+dwJJEwUD0l5Z7WeIKJIeUufdQkVAabmSVGLf/Ojsaj2es9jnQ6y9Iqn4PQChB3I mb+FHs/AuL9kAgpSS6AYrT3vynxl+YeJONj6pb2563FB9+rG9PTsW0EAQ9GeQZOi6l69 F60wO1qnQ9e6R1CP2rDo1G2w8N88CWNJhl8zzm15oG9eTKPo6vUxoQtzXLBZfPKidFLe HD6+ZKRfp9oGssHZYeFpE5uj7MVw0woFDVakQGwzfXcB4tYacidp76a8KBlRjxbCEsEt KiHA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id c7-20020a05640227c700b00543670dada0si8464182ede.217.2023.12.16.02.45.03; Sat, 16 Dec 2023 02:45:04 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A99DB68D0B0; Sat, 16 Dec 2023 12:45:00 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from ursule.remlab.net (vps-a2bccee9.vps.ovh.net [51.75.19.47]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 520E768CF9D for ; Sat, 16 Dec 2023 12:44:54 +0200 (EET) Received: from basile.remlab.net (localhost [IPv6:::1]) by ursule.remlab.net (Postfix) with ESMTP id E2977C000E for ; Sat, 16 Dec 2023 12:44:53 +0200 (EET) From: =?utf-8?q?R=C3=A9mi_Denis-Courmont?= To: ffmpeg-devel@ffmpeg.org Date: Sat, 16 Dec 2023 12:44:53 +0200 Message-ID: <20231216104453.13092-1-remi@remlab.net> X-Mailer: git-send-email 2.43.0 MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH] lavc/opusdsp: simplify R-V V postfilter X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: GYlMrtoHunXJ This skips the round-trip to scalar register for the sliding 'x' coefficients, improving performance by about 5%. The trick here is that the vector slide-up instruction preserves elements in destination vector until the slide offset. The switch from vfslide1up.vf to vslideup.vi also allows the elimination of data dependencies on consecutive slides. Since the specifications recommend sticking to power of two offsets, we could slide as follows: vslideup.vi v8, v0, 2 vslideup.vi v4, v0, 1 vslideup.vi v12, v8, 1 vslideup.vi v16, v8, 2 However in the device under test, this seems to make performance slightly worse, so this is left for (in)validation with future better hardware. --- libavcodec/riscv/opusdsp_rvv.S | 30 ++++++++++++------------------ 1 file changed, 12 insertions(+), 18 deletions(-) diff --git a/libavcodec/riscv/opusdsp_rvv.S b/libavcodec/riscv/opusdsp_rvv.S index 79ae86c30e..9a8914c78d 100644 --- a/libavcodec/riscv/opusdsp_rvv.S +++ b/libavcodec/riscv/opusdsp_rvv.S @@ -26,40 +26,34 @@ func ff_opus_postfilter_rvv, zve32f flw fa1, 4(a2) // g1 sub t0, a0, t1 flw fa2, 8(a2) // g2 + addi t1, t0, -2 * 4 // data - (period + 2) = initial &x4 + vsetivli zero, 4, e32, m4, ta, ma addi t0, t0, 2 * 4 // data - (period - 2) = initial &x0 - - flw ft4, -16(t0) + vle32.v v16, (t1) addi t3, a1, -2 // maximum parallelism w/o stepping our tail - flw ft3, -12(t0) - flw ft2, -8(t0) - flw ft1, -4(t0) 1: + vslidedown.vi v8, v16, 2 min t1, a3, t3 + vslide1down.vx v12, v16, zero vsetvli t1, t1, e32, m4, ta, ma vle32.v v0, (t0) // x0 sub a3, a3, t1 - vle32.v v28, (a0) + vslide1down.vx v4, v8, zero sh2add t0, t1, t0 - vfslide1up.vf v4, v0, ft1 + vle32.v v28, (a0) addi t2, t1, -4 - vfslide1up.vf v8, v4, ft2 - vfslide1up.vf v12, v8, ft3 - vfslide1up.vf v16, v12, ft4 + vslideup.vi v4, v0, 1 + vslideup.vi v8, v4, 1 + vslideup.vi v12, v8, 1 + vslideup.vi v16, v12, 1 vfadd.vv v20, v4, v12 vfadd.vv v24, v0, v16 - vslidedown.vx v12, v0, t2 + vslidedown.vx v16, v0, t2 vfmacc.vf v28, fa0, v8 - vslidedown.vi v4, v12, 2 vfmacc.vf v28, fa1, v20 - vslide1down.vx v8, v12, zero vfmacc.vf v28, fa2, v24 - vslide1down.vx v0, v4, zero vse32.v v28, (a0) - vfmv.f.s ft4, v12 sh2add a0, t1, a0 - vfmv.f.s ft2, v4 - vfmv.f.s ft3, v8 - vfmv.f.s ft1, v0 bnez a3, 1b ret