From patchwork Wed Sep 28 09:13:33 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Martin_Storsj=C3=B6?= X-Patchwork-Id: 38397 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:3b1c:b0:96:9ee8:5cfd with SMTP id c28csp74016pzh; Wed, 28 Sep 2022 02:13:46 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6zuYiwIRFpQecctZM6yQlFEAT255IuS0q0gCwQ+glyIf+ztQa+FC9jYtQG1wjMFVhIVVwY X-Received: by 2002:a05:6402:5290:b0:453:5942:4ef8 with SMTP id en16-20020a056402529000b0045359424ef8mr32720769edb.180.1664356425808; Wed, 28 Sep 2022 02:13:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1664356425; cv=none; d=google.com; s=arc-20160816; b=d7QoKBHq8nHA+Zytw/epZaabB2DJEky4Z4xILqVDKWaF0ofyVz2WO7A9N6/hyfNGKd /7ShFzztXGQydGklsGoxwmRPnw5yXnVN1IQa5x1NJVh6kpsRe2aCqxbcFyqhsNpyPvYz bcxhbSFK8Wx84ob2RIbofoGWHEgxd1DLAapnXN2xKlxFe6n1BzhArSs/xcIR21JbYNGE LBi2MkEmkSF28p4Glsp3rvUN5z6lby5uDWeFjxw9qEEecQ6esZMcrl13Cne1WAu02Khf cOrj1SIJm8l85LZlRkIsm/jdp47anD5tlwJ0a9t0EjFD36yT+fv25Q1oe0sjsYa3ZFq1 6zAA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:message-id:date:to:from :dkim-signature:delivered-to; bh=DiaWAes1T20QWmzzt0ArSmsmJ4kx3ChcvcJTS8aejps=; b=qVQzKkm4ksYJoSmWfGwPf2l7UifggsSMktaVDR/WNi3NhGNdTFKKuHz5bku06puqD8 Wb42IpOut7dkKW4ee5oGLFuXrlN0wAswBBioMy0CUjehL85UF8qGXJ284ogFb5yKo50b 5FyH/Oq5JhfZdIFi9De2ObHQ0G3WKObTBP0WCPo/G5eb/veu67w1F2YXo16JBeUqm+rB cPVvdBT45n9cetLCVrDZv2poCcbX+iLSoeWbkPHRA+/CIjgqcTwWAvkRUYTx4hCWTqq1 K6XtH6NCSJZpcR8YPTlVyGo+Om/yglIUyXDMsYjN05tkYVMRuLwhKkGEfMqdtOL7ARyj WwpA== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@martin-st.20210112.gappssmtp.com header.s=20210112 header.b=V7lxDNqL; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id s20-20020aa7d794000000b00457fdee46e4si125220edq.257.2022.09.28.02.13.45; Wed, 28 Sep 2022 02:13:45 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@martin-st.20210112.gappssmtp.com header.s=20210112 header.b=V7lxDNqL; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A06D768BB87; Wed, 28 Sep 2022 12:13:42 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lf1-f49.google.com (mail-lf1-f49.google.com [209.85.167.49]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 0A05768BB10 for ; Wed, 28 Sep 2022 12:13:36 +0300 (EEST) Received: by mail-lf1-f49.google.com with SMTP id a8so19399237lff.13 for ; Wed, 28 Sep 2022 02:13:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=martin-st.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date; bh=g2WaV2ki52+Fh+BLI3RCz023ob5y/DindXdOTmAlTFo=; b=V7lxDNqLxm4UFEVtVhmfh3h+OngqcSZPzH8653TV7yYQa6DgKexbkTar+Dn04QweYk 6OnGZss78ZiCm8d2r5Y6hyGcBcVYh5g95Ge97MmDESaHacPokq40tN3gFIU3QKVdJUSn bFGTdJ5QLXGWY8YJO2s6aIqrvp3N+sk/Y4myneh+MNYpZkGR+EPqrRrBiNtxR74b2h9/ 1Xnc92cV3ufbJFLYjdrKbaRSFAFaJKHwffkcDxuAIJJjL7tmqwJX6r68m/8979JuPPcV ZRqrIJw1mqQMQmjqgx3GNwA5G7GvycEmaCenzb8NP/ADRcL8HFP06wJC5bi4Q+UvFfGQ TRHA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date; bh=g2WaV2ki52+Fh+BLI3RCz023ob5y/DindXdOTmAlTFo=; b=eVLo4JoWHoC4m3jQU/1BLyk76G7+OjtoyD4SOwOHlFLUuR8rmOz8rRfA8OEpH4nbvk fBFWjQGqkU/yhkVSd9B6UMHeDU9SEkaoWgvevByLpP+dj0P8Pjc91g0hLV11vdu4s98B 9W5vhR/KKA8b0eLuPRlW4NiL797yoToyA8UELt65nSZKEPi7sus7HjJ4Ub3G0ZhO32G3 19f1rcvImex43E5C9oZsrFkomb7U1acjE9Z+9m3OviEna46bG64Y+xIlfvm4Hmuadz9F oGoVUrpB3kM94H631iY5VVlJknS0GXPz8J3xU01WlNHaneONpMjibVi03LPXb+RoneZL dRdQ== X-Gm-Message-State: ACrzQf1OwzzcF4qruPl0kpRl4p/lYyjyiraE3hWwPygCD8vhfaeI1gLQ L9TrqQsdZA+cX4uZRccgHvbF+jVQE952+qrS X-Received: by 2002:a05:6512:281d:b0:4a1:f54c:f245 with SMTP id cf29-20020a056512281d00b004a1f54cf245mr1462067lfb.421.1664356415134; Wed, 28 Sep 2022 02:13:35 -0700 (PDT) Received: from localhost.localdomain (dsl-tkubng21-58c01c-243.dhcp.inet.fi. [88.192.28.243]) by smtp.gmail.com with ESMTPSA id y13-20020a19750d000000b00497a1f92a72sm418884lfe.221.2022.09.28.02.13.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 28 Sep 2022 02:13:34 -0700 (PDT) From: =?utf-8?q?Martin_Storsj=C3=B6?= To: ffmpeg-devel@ffmpeg.org Date: Wed, 28 Sep 2022 12:13:33 +0300 Message-Id: <20220928091334.7838-1-martin@martin.st> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 1/2] aarch64: me_cmp: Avoid redundant loads in ff_pix_abs16_y2_neon X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Grzegorz Bernacki , Jonathan Swinney , Hubert Mazur , =?utf-8?q?Martin_Storsj=C3=B6?= Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: H1avQB0EUGpz This avoids one redundant load per row; pix3 from the previous iteration can be used as pix2 in the next one. Before: Cortex A53 A72 A73 pix_abs_0_2_neon: 138.0 59.7 48.0 After: pix_abs_0_2_neon: 109.7 50.2 39.5 Signed-off-by: Martin Storsjö --- libavcodec/aarch64/me_cmp_neon.S | 24 ++++++++++-------------- 1 file changed, 10 insertions(+), 14 deletions(-) diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S index 11af4849f9..832a7cb22d 100644 --- a/libavcodec/aarch64/me_cmp_neon.S +++ b/libavcodec/aarch64/me_cmp_neon.S @@ -326,9 +326,9 @@ function ff_pix_abs16_y2_neon, export=1 // w4 int h // initialize buffers + ld1 {v1.16b}, [x2], x3 // Load pix2 movi v29.8h, #0 // clear the accumulator movi v28.8h, #0 // clear the accumulator - add x5, x2, x3 // pix2 + stride cmp w4, #4 b.lt 2f @@ -339,29 +339,25 @@ function ff_pix_abs16_y2_neon, export=1 // avg2(a, b) = (((a) + (b) + 1) >> 1) // abs(x) = (x < 0 ? (-x) : (x)) - ld1 {v1.16b}, [x2], x3 // Load pix2 for first iteration - ld1 {v2.16b}, [x5], x3 // Load pix3 for first iteration + ld1 {v2.16b}, [x2], x3 // Load pix3 for first iteration ld1 {v0.16b}, [x1], x3 // Load pix1 for first iteration urhadd v30.16b, v1.16b, v2.16b // Rounding halving add, first iteration - ld1 {v4.16b}, [x2], x3 // Load pix2 for second iteration - ld1 {v5.16b}, [x5], x3 // Load pix3 for second iteartion + ld1 {v5.16b}, [x2], x3 // Load pix3 for second iteartion uabal v29.8h, v0.8b, v30.8b // Absolute difference of lower half, first iteration uabal2 v28.8h, v0.16b, v30.16b // Absolute difference of upper half, first iteration ld1 {v3.16b}, [x1], x3 // Load pix1 for second iteration - urhadd v27.16b, v4.16b, v5.16b // Rounding halving add, second iteration - ld1 {v7.16b}, [x2], x3 // Load pix2 for third iteration - ld1 {v20.16b}, [x5], x3 // Load pix3 for third iteration + urhadd v27.16b, v2.16b, v5.16b // Rounding halving add, second iteration + ld1 {v20.16b}, [x2], x3 // Load pix3 for third iteration uabal v29.8h, v3.8b, v27.8b // Absolute difference of lower half for second iteration uabal2 v28.8h, v3.16b, v27.16b // Absolute difference of upper half for second iteration ld1 {v6.16b}, [x1], x3 // Load pix1 for third iteration - urhadd v26.16b, v7.16b, v20.16b // Rounding halving add, third iteration - ld1 {v22.16b}, [x2], x3 // Load pix2 for fourth iteration - ld1 {v23.16b}, [x5], x3 // Load pix3 for fourth iteration + urhadd v26.16b, v5.16b, v20.16b // Rounding halving add, third iteration + ld1 {v1.16b}, [x2], x3 // Load pix3 for fourth iteration uabal v29.8h, v6.8b, v26.8b // Absolute difference of lower half for third iteration uabal2 v28.8h, v6.16b, v26.16b // Absolute difference of upper half for third iteration ld1 {v21.16b}, [x1], x3 // Load pix1 for fourth iteration sub w4, w4, #4 // h-= 4 - urhadd v25.16b, v22.16b, v23.16b // Rounding halving add + urhadd v25.16b, v20.16b, v1.16b // Rounding halving add cmp w4, #4 uabal v29.8h, v21.8b, v25.8b // Absolute difference of lower half for fourth iteration uabal2 v28.8h, v21.16b, v25.16b // Absolute difference of upper half for fourth iteration @@ -372,11 +368,11 @@ function ff_pix_abs16_y2_neon, export=1 // iterate by one 2: - ld1 {v1.16b}, [x2], x3 // Load pix2 - ld1 {v2.16b}, [x5], x3 // Load pix3 + ld1 {v2.16b}, [x2], x3 // Load pix3 subs w4, w4, #1 ld1 {v0.16b}, [x1], x3 // Load pix1 urhadd v30.16b, v1.16b, v2.16b // Rounding halving add + mov v1.16b, v2.16b // Shift pix3->pix2 uabal v29.8h, v30.8b, v0.8b uabal2 v28.8h, v30.16b, v0.16b