From patchwork Sat Aug 13 20:56:02 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Swinney, Jonathan" X-Patchwork-Id: 37259 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:3d0d:b0:8d:a68e:8a0e with SMTP id y13csp513552pzi; Sat, 13 Aug 2022 13:56:20 -0700 (PDT) X-Google-Smtp-Source: AA6agR7LIC1+aXHwPCSG95UKP6DvvIPxm58NQUH+bYBFY2EXzNUn5iz8qLL9qIdVqmdSaS6KLt8j X-Received: by 2002:a05:6402:287:b0:43c:c604:addb with SMTP id l7-20020a056402028700b0043cc604addbmr8584260edv.201.1660424180204; Sat, 13 Aug 2022 13:56:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1660424180; cv=none; d=google.com; s=arc-20160816; b=LrKTXvvTASkjyZfVaLrIhANlEK2MIqNkh2oSO9QPL0HYRjhsRzI2NgBClMzXxZGV4/ SDInoT8CjnoWpBoolPlZsAV5vwNyOACyT9ZXiaZ6H9o4i9fEJLmkUe+XJEkqHZMc6hIa NOFiUF3IlieeAFEPC28x14llVkzaicYmSD501oOj0OWrUZn0zjP5gVbx9ccrD/+rBhBL fePdgfkIFNuq4dfDmLQYZBf0811C6j9RqPFH2XnYoHM7NeB41EVFLhMODvYH/WEQXOqK Nrnoilmro+CDLG7DFpxqdN9rB+xsd5hr+jrW/0u/KPdkk+VEWvOPxX5RWB3g2LomEIOB qc8w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:content-language :accept-language:message-id:date:thread-index:thread-topic:to:from :dkim-signature:delivered-to; bh=XuNxLfetY0aghed8GPAE5ijOfWueNGZa+hcddXiDMUw=; b=n/mLzlCipY1/5VlURLmWcLfGXpkRnRY6qJV7DaxFAVTnYRRHBhKVmHeYBca5CTiLNV 7ZeWQyeG5AWxuO/B2nKNGyGq38KdxnuXYgPIFqE4TEr8/8uGCNYV/DzpUAiEXi+kEBLx kB9wiutg9U3kHREsaFgSZNBIzNUTPMlqRX/BSB/1xiKh0OESFVRxOdH4NgqP47VHs6Ge OtOmNlonWejT/AslviUay2QYO8vMKmtDK1ggWt+CEKGS52Un4yZBBn7uIibhLxpbZ6jC yIyK+5c+r4RCMwUyJD4pinUEvUUZ1JI6hGh5RIsvAZsPr0ebQfGzAGGp8tbIuuzzjygq bD2g== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@amazon.com header.s=amazon201209 header.b=tW7dkFOh; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id q17-20020a50cc91000000b0043e438a48e1si4625315edi.178.2022.08.13.13.56.19; Sat, 13 Aug 2022 13:56:20 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@amazon.com header.s=amazon201209 header.b=tW7dkFOh; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 193D468B8CC; Sat, 13 Aug 2022 23:56:13 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from smtp-fw-6001.amazon.com (smtp-fw-6001.amazon.com [52.95.48.154]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 6F71C68B20B for ; Sat, 13 Aug 2022 23:56:06 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1660424172; x=1691960172; h=from:to:cc:subject:date:message-id: content-transfer-encoding:mime-version; bh=d2veYPVwfvfydEjiXV0RiCdjVX6kr4xg7SfSuR9ylAg=; b=tW7dkFOheLdM5OeVXTGvT6fbk2hMzy91Rl4Cv8O5ioqk2CaYalQEcicm vO2jawWe6faisVVwXv8+xlydBDHz4aCW0wRwPH8jdZbLqOpz9YsReP/Rq jk7Nle4pxWzyC5Vi97zrogiWK8RtFhuuPGOIW0SKgrJaIue3jGtK/xVxT s=; Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-iad-1a-b27d4a00.us-east-1.amazon.com) ([10.43.8.2]) by smtp-border-fw-6001.iad6.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Aug 2022 20:56:05 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-iad-1a-b27d4a00.us-east-1.amazon.com (Postfix) with ESMTPS id 6C5F8803B7; Sat, 13 Aug 2022 20:56:04 +0000 (UTC) Received: from EX19D007UWB001.ant.amazon.com (10.13.138.75) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Sat, 13 Aug 2022 20:56:02 +0000 Received: from EX19D007UWB001.ant.amazon.com (10.13.138.75) by EX19D007UWB001.ant.amazon.com (10.13.138.75) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.12; Sat, 13 Aug 2022 20:56:02 +0000 Received: from EX19D007UWB001.ant.amazon.com ([fe80::bcaa:e18f:a569:3851]) by EX19D007UWB001.ant.amazon.com ([fe80::bcaa:e18f:a569:3851%6]) with mapi id 15.02.1118.012; Sat, 13 Aug 2022 20:56:02 +0000 From: "Swinney, Jonathan" To: "ffmpeg-devel@ffmpeg.org" Thread-Topic: [PATCH v3 2/3] swscale/aarch64: vscale optimization Thread-Index: AdivVwIoGM530APiQviyMJ/WNteqHQ== Date: Sat, 13 Aug 2022 20:56:02 +0000 Message-ID: <17ac9282aef74f869ec9895f0d17f17e@amazon.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.43.162.134] MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v3 2/3] swscale/aarch64: vscale optimization X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: =?utf-8?q?Martin_Storsj=C3=B6?= , Hubert Mazur Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: U4pqH0biRAvE Use scalar times vector multiply accumlate instructions instead of vector times vector to remove the need for replicating load instructions which are slightly slower. On AWS c7g (Graviton 3, Neoverse V1) instances: yuv2yuvX_8_0_512_accurate_neon: 1144.8 987.4 yuv2yuvX_16_0_512_accurate_neon: 2080.5 1869.4 Signed-off-by: Jonathan Swinney --- libswscale/aarch64/output.S | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S index af71de6050..991750cf31 100644 --- a/libswscale/aarch64/output.S +++ b/libswscale/aarch64/output.S @@ -34,16 +34,15 @@ function ff_yuv2planeX_8_neon, export=1 mov x9, x2 // srcp = src mov x10, x0 // filterp = filter 3: ldp x11, x12, [x9], #16 // get 2 pointers: src[j] and src[j+1] + ldr s7, [x10], #4 // read 2x16-bit coeff X and Y at filter[j] and filter[j+1] add x11, x11, x7, lsl #1 // &src[j ][i] add x12, x12, x7, lsl #1 // &src[j+1][i] ld1 {v5.8H}, [x11] // read 8x16-bit @ src[j ][i + {0..7}]: A,B,C,D,E,F,G,H ld1 {v6.8H}, [x12] // read 8x16-bit @ src[j+1][i + {0..7}]: I,J,K,L,M,N,O,P - ld1r {v7.8H}, [x10], #2 // read 1x16-bit coeff X at filter[j ] and duplicate across lanes - ld1r {v16.8H}, [x10], #2 // read 1x16-bit coeff Y at filter[j+1] and duplicate across lanes - smlal v3.4S, v5.4H, v7.4H // val0 += {A,B,C,D} * X - smlal2 v4.4S, v5.8H, v7.8H // val1 += {E,F,G,H} * X - smlal v3.4S, v6.4H, v16.4H // val0 += {I,J,K,L} * Y - smlal2 v4.4S, v6.8H, v16.8H // val1 += {M,N,O,P} * Y + smlal v3.4S, v5.4H, v7.H[0] // val0 += {A,B,C,D} * X + smlal2 v4.4S, v5.8H, v7.H[0] // val1 += {E,F,G,H} * X + smlal v3.4S, v6.4H, v7.H[1] // val0 += {I,J,K,L} * Y + smlal2 v4.4S, v6.8H, v7.H[1] // val1 += {M,N,O,P} * Y subs w8, w8, #2 // tmpfilterSize -= 2 b.gt 3b // loop until filterSize consumed