From patchwork Wed Apr 20 08:57:53 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Martin_Storsj=C3=B6?= X-Patchwork-Id: 35354 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:3b9e:b0:7d:cfb5:dc7c with SMTP id b30csp762479pzh; Wed, 20 Apr 2022 01:58:06 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyEURqMMynoUca9CvX3vyGg8VMemWI4WocwtDi/jzgt1HNCE4hfmC+xKUnro+dGGKArluvg X-Received: by 2002:a05:6402:2945:b0:41d:aad:c824 with SMTP id ed5-20020a056402294500b0041d0aadc824mr22133381edb.364.1650445086621; Wed, 20 Apr 2022 01:58:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1650445086; cv=none; d=google.com; s=arc-20160816; b=CQykHM1VGAHz+qmvW51mctdbnG3S4oogpvmQtZg1DQMYy3mJSjxDUZaa53AQp7Xzxu QHJE2pOw2dWz4pJ9D7NDC4Brl3M7/V5VFL755RibGTA4QdT7XZjT6g4eOa74kaVX0ouG bjQ0HE5KDaefWeGiOeewQWZoK86XFbKKVhPJ4GJo64CkmF4wIftbCPutOTlf9N8TxauA 03ZYOCe0y+pF0MM9qimOo1XV6MP54b7u3HyXyrR7CIJq5jKzu64btQp9xylirM6Oiplt xHBx2D3oKxfCJu78oOs+4wiHJNH7k2N9CBtGsgh2cJHPOM3RhbuqAWIarklwhF/sOY16 AZXA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:message-id:date:to:from :dkim-signature:delivered-to; bh=+ZzR/KfFgQgRXvA2j+AIVrNVDS2aw5UyWWKnMddyb6E=; b=nTVgib6HgKJHtBXD02c8OxZu1fHJH4GdDgQUD9j2QXB02FCFIYhYfL+gnw2RAJM+Nm bKmmXYgL/79meekP/afT5H1+tWH5qYlsFUD95qpqzQje5nm46rH4IvvwlIUqO5gWpmQv rtTRIerJa41CeZY0b/WE2/oMshVInpA8lE/+05I83R/u1rspAFmpm6JHwywAZy3o3cBm NgPwU/xtVptzMcNE0KvsJ372d/hQddr0del5RwvjBrjaiA/9dL79pVlNvDHVf3Pn3zqs 5PIIhdUpsM5LndZznVUrDrqrJ4AjlhP8gb8e0sM3IiM7zB+W8Pg2n47HC6gXKp6QcI/v 9Hnw== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@martin-st.20210112.gappssmtp.com header.s=20210112 header.b=r+fKSWid; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id u9-20020a056402064900b00423f4a29608si928571edx.353.2022.04.20.01.58.05; Wed, 20 Apr 2022 01:58:06 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@martin-st.20210112.gappssmtp.com header.s=20210112 header.b=r+fKSWid; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id C454C68B3FC; Wed, 20 Apr 2022 11:58:02 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lf1-f53.google.com (mail-lf1-f53.google.com [209.85.167.53]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id ED74A68B38A for ; Wed, 20 Apr 2022 11:57:55 +0300 (EEST) Received: by mail-lf1-f53.google.com with SMTP id t25so1630047lfg.7 for ; Wed, 20 Apr 2022 01:57:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=martin-st.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=bVtCmExjyPLXxsqhozGT/kn000A7d3b0I7SHetQYyHY=; b=r+fKSWid2x0Zl1DGx1RIRy0mxXsEZYE+NGVI4GabbSRnmHxT5apDsLRsbV9qY/+U/f gsUqQpC99PlkQ2VsnkDxAgm33qrZJXfJfzZtCiLD3mru8KYDUcaRPqIDMacFQW9CcG3m WZ0XPgA2VUH5VdTVGhTWg6DDPg5z4vfslgTPFGcC38IwfMPdu5VCM+ZI91xxhTSMw53X E+mMHuct8u1Ekh6bgfQxwkPs6WZzVRyziDyFqbFeNchHIo0ilJXX6tYB9WMhv45mgwfs hr5ZYtpZ2llOx+KSlGaDSOg1D97aEuBqpKZIUQfYC6naTpu+o7H9XyZI4aIEGlFhsn2F vaFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=bVtCmExjyPLXxsqhozGT/kn000A7d3b0I7SHetQYyHY=; b=cV83zImEL2GYutDnW3W+sEiQaxmEBt00ZiCBg5HO5SM7B6JBS+3elptwx47cfbLMUX CRcpriS5V9iHsG9TQBa1DErmyHAEALwQVw0NEu5MWzzBzthT81S1oQeDULsGhJ1wS1aI sgJf+dhW9Ijp95bdDdRalo2RplPdap9rP4oWldQsi3trNMbnG0IP9tW4yn2REG91A8TA 7u6CkGlWQVpkXnUw/3999qynyY3X2xpYBTbxBcG2AnmR0E3ZOLlzqpXOw5JM/t/kme0l EzELJnV9nCazvb4ueBSHsgLdb5Jb/vC6v06zQzSYy8QOPO2ZBG5NHNl5c4Il4OnIdr9C RRDA== X-Gm-Message-State: AOAM530iowrIyzkvjUHoKceIMQ7bg3jmkcwgGn2zQU4wsXCcRmHEKz8p wFtdNYV9j6YVcqm4uv4tFHZVIwAuBUvBeYUtvyQ= X-Received: by 2002:a05:6512:3b10:b0:471:9430:b53a with SMTP id f16-20020a0565123b1000b004719430b53amr8965817lfv.505.1650445075143; Wed, 20 Apr 2022 01:57:55 -0700 (PDT) Received: from localhost.localdomain (dsl-tkubng21-58c01c-243.dhcp.inet.fi. [88.192.28.243]) by smtp.gmail.com with ESMTPSA id e9-20020a196909000000b00471924f1f01sm870928lfc.228.2022.04.20.01.57.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Apr 2022 01:57:54 -0700 (PDT) From: =?utf-8?q?Martin_Storsj=C3=B6?= To: ffmpeg-devel@ffmpeg.org Date: Wed, 20 Apr 2022 11:57:53 +0300 Message-Id: <20220420085753.2019156-1-martin@martin.st> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH] swscale: aarch64: Optimize the final summation in the hscale routine X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: =?utf-8?q?Martin_Storsj=C3=B6?= , Jonathan Swinney , =?utf-8?b?Q2zDqW1lbnQgQsWTc2No?= , Sebastian Pop Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: AoFRz7hT2nu9 Before: Cortex A53 A72 A73 hscale_8_to_15_width8_neon: 8273.0 4602.5 4289.5 hscale_8_to_15_width16_neon: 12405.7 6803.0 6359.0 hscale_8_to_15_width32_neon: 21258.7 11491.7 11469.2 hscale_8_to_15_width40_neon: 25652.0 14173.7 12488.2 After: hscale_8_to_15_width8_neon: 7633.0 3981.5 3350.2 hscale_8_to_15_width16_neon: 11666.7 5951.0 5512.0 hscale_8_to_15_width32_neon: 20900.7 10733.2 9481.7 hscale_8_to_15_width40_neon: 24826.0 13536.2 11502.0 Thus, this gives overall a 8-28% speedup for the smaller filter sizes, around 3-8% for the smaller filter sizes. Inspired by a patch by Jonathan Swinney . Signed-off-by: Martin Storsjö --- I'll go ahead and apply this patch within a few days if there's no opposition, as it should be a fairly uncontroversial change. --- libswscale/aarch64/hscale.S | 14 +++----------- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S index af55ffe2b7..da34f1cb8d 100644 --- a/libswscale/aarch64/hscale.S +++ b/libswscale/aarch64/hscale.S @@ -61,17 +61,9 @@ function ff_hscale_8_to_15_neon, export=1 smlal v3.4S, v18.4H, v19.4H // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}] smlal2 v3.4S, v18.8H, v19.8H // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}] b.gt 2b // inner loop if filterSize not consumed completely - addp v0.4S, v0.4S, v0.4S // part0 horizontal pair adding - addp v1.4S, v1.4S, v1.4S // part1 horizontal pair adding - addp v2.4S, v2.4S, v2.4S // part2 horizontal pair adding - addp v3.4S, v3.4S, v3.4S // part3 horizontal pair adding - addp v0.4S, v0.4S, v0.4S // part0 horizontal pair adding - addp v1.4S, v1.4S, v1.4S // part1 horizontal pair adding - addp v2.4S, v2.4S, v2.4S // part2 horizontal pair adding - addp v3.4S, v3.4S, v3.4S // part3 horizontal pair adding - zip1 v0.4S, v0.4S, v1.4S // part01 = zip values from part0 and part1 - zip1 v2.4S, v2.4S, v3.4S // part23 = zip values from part2 and part3 - mov v0.d[1], v2.d[0] // part0123 = zip values from part01 and part23 + addp v0.4S, v0.4S, v1.4S // part01 horizontal pair adding + addp v2.4S, v2.4S, v3.4S // part23 horizontal pair adding + addp v0.4S, v0.4S, v2.4S // part0123 horizontal pair adding subs w2, w2, #4 // dstW -= 4 sqshrn v0.4H, v0.4S, #7 // shift and clip the 2x16-bit final values st1 {v0.4H}, [x1], #8 // write to destination part0123