From patchwork Mon Apr 29 19:21:44 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?q?R=C3=A9mi_Denis-Courmont?= X-Patchwork-Id: 48373 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a21:1509:b0:1a9:af23:56c1 with SMTP id nq9csp2200937pzb; Mon, 29 Apr 2024 12:22:05 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCU76ZNOIMp4gV4bg7j3VV82L284xZQxe85ohNfchw9mBY7Sy8HVJrYCwJ0f21k/3E+2Xvd34lghmtXdmA2DxlYo7GsO/qcm0nJleg== X-Google-Smtp-Source: AGHT+IG1ySNhPjmfC/WszKdfY5tinzBNu++JN9M+VFr8zOy6Gpu0SacbE4ek9KoYiC9XA1BelyJA X-Received: by 2002:a17:906:395b:b0:a55:b93e:90ef with SMTP id g27-20020a170906395b00b00a55b93e90efmr4621090eje.77.1714418525471; Mon, 29 Apr 2024 12:22:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1714418525; cv=none; d=google.com; s=arc-20160816; b=oBukt4A+/ZczsubjMSLhM17KznhsixLIAykeqHsNV4HQRNgDDwAwYmwUbse7sE4ahH bz0AZ3UAKPpdO6wB69vddC+f5BovKGOGrvTH7loOEcK7BGNwIIjMiWra7akkyllroM79 mS4JgmPuY77MGIfqDfsCtl4GLgdx7vhjVdzDVzAEZLEs4YHV1AFngP9psI4Jxtyh7KVN 0Yzch6WERO/wprL6zAJhdoVEaJTUw50Bsin7AzgiIoALcSiTJn4Q8HBYiAjMvgKIGTA7 700latg9PobY9xURzny+y/+L3Xse1UzoKCPGYTGFVOnoPr+3N1E5dhre1/XmQ277vZZI u8Bg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:reply-to:list-subscribe :list-help:list-post:list-archive:list-unsubscribe:list-id :precedence:subject:mime-version:references:in-reply-to:message-id :date:to:from:delivered-to; bh=iRLbC+09nJT25Q3JxdW4mEfMbvw46Q/XOedvd1jVhS4=; fh=YOA8vD9MJZuwZ71F/05pj6KdCjf6jQRmzLS+CATXUQk=; b=P4MLC8gOqt2pCq6BAdgbDygGBHiQJoyv2mycuUds7Jp2ZMC6MhXGYmhx+7cQtQIQjy +fO/mLdA1D43fm72Nm23eYaX5gtnwl970CM0pki1NkZAxk/KIpHTfKbnJ9HPjGWdl94W cA9sziEyiWlPWOv7qeJ3AjrtwDk3ncVAC3j3xOFwsMWX/9z8+YJhVQA182ETFdJTn0Ou rW93PA4veFk3Rhqsl3aJBWEpIWFbxKdLAcsoT0PV9H96wicQjpcHW3yUH8vFLcWW2q3K DGOJZTmchYXK1xRT2dMnlsPibMQ7LbCNpxo4Tkal1stFs4qj9kTeSUAX4o9xKuHn7/ZQ XvmA==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id l16-20020a1709062a9000b00a523d95aa45si14945016eje.298.2024.04.29.12.22.04; Mon, 29 Apr 2024 12:22:05 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id E58E768D518; Mon, 29 Apr 2024 22:21:52 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from ursule.remlab.net (vps-a2bccee9.vps.ovh.net [51.75.19.47]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 9588468D40C for ; Mon, 29 Apr 2024 22:21:45 +0300 (EEST) Received: from basile.remlab.net (localhost [IPv6:::1]) by ursule.remlab.net (Postfix) with ESMTP id D6EEAC0214 for ; Mon, 29 Apr 2024 22:21:44 +0300 (EEST) From: =?utf-8?q?R=C3=A9mi_Denis-Courmont?= To: ffmpeg-devel@ffmpeg.org Date: Mon, 29 Apr 2024 22:21:44 +0300 Message-ID: <20240429192144.84571-2-remi@remlab.net> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240429192144.84571-1-remi@remlab.net> References: <20240429192144.84571-1-remi@remlab.net> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 2/2] lavc/ac3dsp: R-V V sum_square_butterfly_float X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: 6fHvuaIWT74e As we do not need to widen accumulators to 64 bits, we effectively get double capacity for unrolling compared to the integer function. This explains the slightly better performance gains. ac3_sum_square_bufferfly_float_c: 65.2 ac3_sum_square_bufferfly_float_rvv_f32: 12.2 --- libavcodec/riscv/ac3dsp_init.c | 6 +++++- libavcodec/riscv/ac3dsp_rvv.S | 39 ++++++++++++++++++++++++++++++++++ 2 files changed, 44 insertions(+), 1 deletion(-) diff --git a/libavcodec/riscv/ac3dsp_init.c b/libavcodec/riscv/ac3dsp_init.c index be5e153fac..e120aa2dce 100644 --- a/libavcodec/riscv/ac3dsp_init.c +++ b/libavcodec/riscv/ac3dsp_init.c @@ -30,6 +30,8 @@ void ff_extract_exponents_rvb(uint8_t *exp, int32_t *coef, int nb_coefs); void ff_float_to_fixed24_rvv(int32_t *dst, const float *src, size_t len); void ff_sum_square_butterfly_int32_rvv(int64_t *, const int32_t *, const int32_t *, int); +void ff_sum_square_butterfly_float_rvv(float *, const float *, + const float *, int); av_cold void ff_ac3dsp_init_riscv(AC3DSPContext *c) { @@ -39,8 +41,10 @@ av_cold void ff_ac3dsp_init_riscv(AC3DSPContext *c) if (flags & AV_CPU_FLAG_RVB_ADDR) { if (flags & AV_CPU_FLAG_RVB_BASIC) c->extract_exponents = ff_extract_exponents_rvb; - if (flags & AV_CPU_FLAG_RVV_F32) + if (flags & AV_CPU_FLAG_RVV_F32) { c->float_to_fixed24 = ff_float_to_fixed24_rvv; + c->sum_square_butterfly_float = ff_sum_square_butterfly_float_rvv; + } # if __riscv_xlen >= 64 if (flags & AV_CPU_FLAG_RVV_I64) c->sum_square_butterfly_int32 = ff_sum_square_butterfly_int32_rvv; diff --git a/libavcodec/riscv/ac3dsp_rvv.S b/libavcodec/riscv/ac3dsp_rvv.S index dd0b4cd797..397e000ab0 100644 --- a/libavcodec/riscv/ac3dsp_rvv.S +++ b/libavcodec/riscv/ac3dsp_rvv.S @@ -78,3 +78,42 @@ func ff_sum_square_butterfly_int32_rvv, zve64x ret endfunc #endif + +func ff_sum_square_butterfly_float_rvv, zve32f + vsetvli t0, zero, e32, m8, ta, ma + vmv.v.x v0, zero + vmv.v.x v8, zero +1: + vsetvli t0, a3, e32, m4, tu, ma + vle32.v v16, (a1) + sub a3, a3, t0 + vle32.v v20, (a2) + sh2add a1, t0, a1 + vfadd.vv v24, v16, v20 + sh2add a2, t0, a2 + vfsub.vv v28, v16, v20 + vfmacc.vv v0, v16, v16 + vfmacc.vv v4, v20, v20 + vfmacc.vv v8, v24, v24 + vfmacc.vv v12, v28, v28 + bnez a3, 1b + + vsetvli t0, zero, e32, m4, ta, ma + vmv.s.x v16, zero + vmv.s.x v17, zero + vfredsum.vs v16, v0, v16 + vmv.s.x v18, zero + vfredsum.vs v17, v4, v17 + vmv.s.x v19, zero + vfredsum.vs v18, v8, v18 + vfmv.f.s ft0, v16 + vfredsum.vs v19, v12, v19 + vfmv.f.s ft1, v17 + fsw ft0, (a0) + vfmv.f.s ft2, v18 + fsw ft1, 4(a0) + vfmv.f.s ft3, v19 + fsw ft2, 8(a0) + fsw ft3, 12(a0) + ret +endfunc