From patchwork Sun Nov 19 18:53:40 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?q?R=C3=A9mi_Denis-Courmont?= X-Patchwork-Id: 44722 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:6a89:b0:181:818d:5e7f with SMTP id bi9csp1177842pzb; Sun, 19 Nov 2023 10:53:52 -0800 (PST) X-Google-Smtp-Source: AGHT+IFwKV849kqjBEsNrqV8xIMhmAlREh2vd5fr9ZvHoS3b60sv1BZuKiN9fHOxmmR8qewWndzZ X-Received: by 2002:a05:6402:1353:b0:548:5671:b6c5 with SMTP id y19-20020a056402135300b005485671b6c5mr4227886edw.0.1700420032598; Sun, 19 Nov 2023 10:53:52 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700420032; cv=none; d=google.com; s=arc-20160816; b=n0LX6MInaBUjVr7jgrmBAcpmAQDpKkQ6pYIr18L10ZsXmQNRlmJztBMF9SUT+iLry7 c7Ra4HpBaqxm9DoGa0YieW3d1IDrCaY+oI7mrCBajxl24uXvEHcw3IGyXo457g5dGDDi 0uc//HpNadVl7T6QGN2GfrNd6a22npmkUKpeaUcgsdYl9XpwC9ZcS5mYZeKN5LLHV1Tx +k+Z8mK+DAgENFtja6K9DJ++mpPwvRpYFvNZvUIt31fOAfUddJHVL+5I87Wazf3tAAtM WtYGoyXhmJR1Ea170SVVtf2jIaKxOc+Ss8NiBv3VT3CZl10r1qCmQiTYxyqQ/88oxBwx BImw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:reply-to:list-subscribe :list-help:list-post:list-archive:list-unsubscribe:list-id :precedence:subject:mime-version:message-id:date:to:from :delivered-to; bh=vY0HBaPW/5VE/vtLvi7V8aDRhZ68cAQjSMRNT5soiMU=; fh=YOA8vD9MJZuwZ71F/05pj6KdCjf6jQRmzLS+CATXUQk=; b=jrYCR/ogMInEZoqI0dnRBGXxiZHqVHXg2mHtorjEd9Z+yW9jR7C8C0rvlTkh8g7zxS 2lNpH75Qh/DJAQE9mu61b/59fN3Z3mqHqpEv2B98vPXLHPNSvPqfordSI850GodWGkHf YhxWVJGRV8QzcZi7+zOJKCPRXwuOjKy1WLDtWj3x5Z6FvbHodINNfWIaHfM1Hbd4j/Fw dG5n8RP9IVC5Ha6Gm3iZW6i25jkkKqsmOXRo+IdAE5Ya6HWOZMpGo0Ui8VG3pNYCH/na xta/Ig6nh5GiTBXq8ncVYh84Sh6OEHRbQoERiPcCAwHyHIae7cMdIZ2XztixNzP6YeYq 9rFg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id n2-20020a05640204c200b005454b3af4c9si3791695edw.27.2023.11.19.10.53.51; Sun, 19 Nov 2023 10:53:52 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 57A4968C911; Sun, 19 Nov 2023 20:53:48 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from ursule.remlab.net (vps-a2bccee9.vps.ovh.net [51.75.19.47]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 65C0C68AA33 for ; Sun, 19 Nov 2023 20:53:41 +0200 (EET) Received: from basile.remlab.net (localhost [IPv6:::1]) by ursule.remlab.net (Postfix) with ESMTP id F1FA2C018B for ; Sun, 19 Nov 2023 20:53:40 +0200 (EET) From: =?utf-8?q?R=C3=A9mi_Denis-Courmont?= To: ffmpeg-devel@ffmpeg.org Date: Sun, 19 Nov 2023 20:53:40 +0200 Message-ID: <20231119185340.29082-1-remi@remlab.net> X-Mailer: git-send-email 2.42.0 MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH] lavc/aacpsdsp: use LMUL=2 and amortise strides X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: 3wdcpD1elwhr The input is laid out in 16 segments, of which 13 actually need to be loaded. There are no really efficient ways to deal with this: 1) If we load 8 segments wit unit stride, then narrow to 16 segments with right shifts, we can only one half-size vector per segment, or just 2 elements per vector (EMUL=1/2). This ends up unsurprisingly about as fas as the C code. 2) The current approach is to load with strides. We keep that approach, but improve it using three 4-segmented loads instead of 12 single-segment loads. This divides the number of distinct loaded addresses by 4. 3) A potential third approach would be to avoid segmentation altogether and splat the scalar coefficient into vectors. Then we can use a unit-stride and maximum EMUL. But the downside then is that we have to multiply the 3 (of 16) unused segments with zero as part of the multiply-accumulate operations. In addition, we also reuse vectors mid-loop so as to increase the EMUL from 1 to 2, which also improves performance a little bit. Oeverall the gains are quite small with the device under test, as it does not deal with segmented loads very well. But at least the code is tidier, and should enjoy bigger speed-ups on better hardware implementation. Before: ps_hybrid_analysis_c: 1819.2 ps_hybrid_analysis_rvv_f32: 1037.0 (before) ps_hybrid_analysis_rvv_f32: 990.0 (after) --- libavcodec/riscv/aacpsdsp_rvv.S | 61 +++++++++++---------------------- 1 file changed, 20 insertions(+), 41 deletions(-) diff --git a/libavcodec/riscv/aacpsdsp_rvv.S b/libavcodec/riscv/aacpsdsp_rvv.S index 1dc426e01c..f46b35fe91 100644 --- a/libavcodec/riscv/aacpsdsp_rvv.S +++ b/libavcodec/riscv/aacpsdsp_rvv.S @@ -85,63 +85,42 @@ NOHWD fsw fs\n, (4 * \n)(sp) flw fs4, (4 * ((6 * 2) + 0))(a1) flw fs5, (4 * ((6 * 2) + 1))(a1) - add a2, a2, 6 * 2 * 4 // point to filter[i][6][0] + add t2, a2, 6 * 2 * 4 // point to filter[i][6][0] li t4, 8 * 2 * 4 // filter byte stride slli a3, a3, 3 // output byte stride 1: .macro filter, vs0, vs1, fo0, fo1, fo2, fo3 vfmacc.vf v8, \fo0, \vs0 - vfmacc.vf v9, \fo2, \vs0 + vfmacc.vf v10, \fo2, \vs0 vfnmsac.vf v8, \fo1, \vs1 - vfmacc.vf v9, \fo3, \vs1 + vfmacc.vf v10, \fo3, \vs1 .endm - vsetvli t0, a4, e32, m1, ta, ma + vsetvli t0, a4, e32, m2, ta, ma /* * The filter (a2) has 16 segments, of which 13 need to be extracted. * R-V V supports only up to 8 segments, so unrolling is unavoidable. */ - addi t1, a2, -48 - vlse32.v v22, (a2), t4 - addi t2, a2, -44 - vlse32.v v16, (t1), t4 - addi t1, a2, -40 - vfmul.vf v8, v22, fs4 - vlse32.v v24, (t2), t4 - addi t2, a2, -36 - vfmul.vf v9, v22, fs5 - vlse32.v v17, (t1), t4 - addi t1, a2, -32 - vlse32.v v25, (t2), t4 - addi t2, a2, -28 - filter v16, v24, ft0, ft1, ft2, ft3 - vlse32.v v18, (t1), t4 - addi t1, a2, -24 - vlse32.v v26, (t2), t4 - addi t2, a2, -20 - filter v17, v25, ft4, ft5, ft6, ft7 - vlse32.v v19, (t1), t4 - addi t1, a2, -16 - vlse32.v v27, (t2), t4 - addi t2, a2, -12 - filter v18, v26, ft8, ft9, ft10, ft11 - vlse32.v v20, (t1), t4 - addi t1, a2, -8 vlse32.v v28, (t2), t4 - addi t2, a2, -4 - filter v19, v27, fa0, fa1, fa2, fa3 - vlse32.v v21, (t1), t4 + addi t1, a2, 16 + vfmul.vf v8, v28, fs4 + vlsseg4e32.v v16, (a2), t4 + vfmul.vf v10, v28, fs5 + filter v16, v18, ft0, ft1, ft2, ft3 + vlsseg4e32.v v24, (t1), t4 + filter v20, v22, ft4, ft5, ft6, ft7 + addi t1, a2, 32 + filter v24, v26, ft8, ft9, ft10, ft11 + vlsseg4e32.v v16, (t1), t4 sub a4, a4, t0 - vlse32.v v29, (t2), t4 + filter v28, v30, fa0, fa1, fa2, fa3 slli t1, t0, 3 + 1 + 2 // ctz(8 * 2 * 4) - add a2, a2, t1 - filter v20, v28, fa4, fa5, fa6, fa7 - filter v21, v29, fs0, fs1, fs2, fs3 - - add t2, a0, 4 - vsse32.v v8, (a0), a3 + filter v16, v18, fa4, fa5, fa6, fa7 mul t0, t0, a3 - vsse32.v v9, (t2), a3 + filter v20, v22, fs0, fs1, fs2, fs3 + add a2, a2, t1 + add t2, t2, t1 + vssseg2e32.v v8, (a0), a3 add a0, a0, t0 bnez a4, 1b