From patchwork Sun Apr 19 21:09:09 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?q?Martin_Storsj=C3=B6?= X-Patchwork-Id: 19094 Return-Path: X-Original-To: patchwork@ffaux-bg.ffmpeg.org Delivered-To: patchwork@ffaux-bg.ffmpeg.org Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by ffaux.localdomain (Postfix) with ESMTP id DD94644AF07 for ; Mon, 20 Apr 2020 00:09:17 +0300 (EEST) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id B9BF968B939; Mon, 20 Apr 2020 00:09:17 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lj1-f195.google.com (mail-lj1-f195.google.com [209.85.208.195]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 0719E68B7E8 for ; Mon, 20 Apr 2020 00:09:11 +0300 (EEST) Received: by mail-lj1-f195.google.com with SMTP id u6so7752585ljl.6 for ; Sun, 19 Apr 2020 14:09:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=martin-st.20150623.gappssmtp.com; s=20150623; h=from:to:subject:date:message-id:in-reply-to:references; bh=x+ADEoiwl6jS7Ox+B7IIKrdK/Rx6os6snu8+iM6aeas=; b=UKNfHHqGVJoI6Gaas6cupRRzffd+xF3S+rfIaT76RGgpntJ45O1OcGb4BJB84OVxVy Uq6y6Y4C/SHPPYqhWyDonQU9juXkWLFZYLp8t4q5kX/ajcWYi77rd63jTB1f4xfktjrj Q8ATdA6UGuhXyNr+kBwAoc6AbAT//Hc/H9Eo2lo/eaflNhZ6j9XK998o9D8DYL4rKpTE IDcmJvz5vVbmZAyee2lBDMkJNYksx1bvqQ8icmtdegEsHh9NkapqwUbjVe9JtC/QsMyv R1+4yl6YsMHplI6ij+bMGNEeUj/Uxu9SAgEFezhuM/tSBRNX0Q14rSUZ1HSiRGrMnByu 74aw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id:in-reply-to :references; bh=x+ADEoiwl6jS7Ox+B7IIKrdK/Rx6os6snu8+iM6aeas=; b=bdW5WCvaeZMymeCBwjMm1mvCP23NwK3/vOHexbHxaxayjDXzf7h9bU0dfU8hC+UufK pjjMIlcMV/fDlNNae0kYGj2pVAgvG5ZXLD6jQ6jYJ5JviX7Z+pG1MfzjjpJYs4E3Vg7i 2PxN73sfu5ldHEum/qIaX1X/jQOrG1WawJrZMZCOZBOmH/jLe7X+yTH4IXsarTIAjRzj /oDTlY5FGBWWX/WH+wN3Nc+bO+Q6HVu5DqtajrCnFNe3IO49Ae5yMq1mGB/7h4fMBbvR SRJPXvHzkBQZJy8q5mIC4NgS3eQdNS9Qj5Gnw+7BBPzGlWsynUhW4wKbteCKdWuj6TS1 5lVw== X-Gm-Message-State: AGi0PubeXwIO9bEFhnNenx7vBz7dt89WzDPUOdO1L14qr3mzHZfg0fTl MICQpjizVtODK+YtKpB8peCQSehVndY= X-Google-Smtp-Source: APiQypIPxc8rq9ptHi9pOAgwlyxmCfDZW5vKcBbkE34UpkszM1QDPWJs929QZ8jlN0wMk5Ar5LdFeA== X-Received: by 2002:a2e:9886:: with SMTP id b6mr8411160ljj.237.1587330550836; Sun, 19 Apr 2020 14:09:10 -0700 (PDT) Received: from localhost.localdomain (dsl-tkubng21-58c01c-243.dhcp.inet.fi. [88.192.28.243]) by smtp.gmail.com with ESMTPSA id c2sm10631723ljk.97.2020.04.19.14.09.10 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 19 Apr 2020 14:09:10 -0700 (PDT) From: =?utf-8?q?Martin_Storsj=C3=B6?= To: ffmpeg-devel@ffmpeg.org Date: Mon, 20 Apr 2020 00:09:09 +0300 Message-Id: <20200419210909.20584-1-martin@martin.st> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20200419174841.28749-1-martin@martin.st> References: <20200419174841.28749-1-martin@martin.st> Subject: [FFmpeg-devel] [PATCH v2] swscale: aarch64: Don't clobber callee-saved registers v8-v15 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" --- If there would have been checkasm tests for these functions, it would have been caught immediately. Fixed one missed case (because the code used mixed upper/lower case for register names, mixing v8 with V8, so I missed one with search/replace, and as there's no dedicated checkasm test, it can only be tested implicitly via other fate tests). --- libswscale/aarch64/hscale.S | 20 ++++++++++---------- libswscale/aarch64/output.S | 6 +++--- 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S index ae73014a25..af55ffe2b7 100644 --- a/libswscale/aarch64/hscale.S +++ b/libswscale/aarch64/hscale.S @@ -46,20 +46,20 @@ function ff_hscale_8_to_15_neon, export=1 uxtl v4.8H, v4.8B // unpack part 1 to 16-bit smlal v0.4S, v4.4H, v5.4H // v0 accumulates srcp[filterPos[0] + {0..3}] * filter[{0..3}] smlal2 v0.4S, v4.8H, v5.8H // v0 accumulates srcp[filterPos[0] + {4..7}] * filter[{4..7}] - ld1 {v8.8B}, [x0], #8 // srcp[filterPos[2] + {0..7}] - ld1 {v9.8H}, [x13], #16 // load 8x16-bit at filter+2*filterSize + ld1 {v16.8B}, [x0], #8 // srcp[filterPos[2] + {0..7}] + ld1 {v17.8H}, [x13], #16 // load 8x16-bit at filter+2*filterSize uxtl v6.8H, v6.8B // unpack part 2 to 16-bit smlal v1.4S, v6.4H, v7.4H // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}] - uxtl v8.8H, v8.8B // unpack part 3 to 16-bit - smlal v2.4S, v8.4H, v9.4H // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}] - smlal2 v2.4S, V8.8H, v9.8H // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}] - ld1 {v10.8B}, [x11], #8 // srcp[filterPos[3] + {0..7}] + uxtl v16.8H, v16.8B // unpack part 3 to 16-bit + smlal v2.4S, v16.4H, v17.4H // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}] + smlal2 v2.4S, v16.8H, v17.8H // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}] + ld1 {v18.8B}, [x11], #8 // srcp[filterPos[3] + {0..7}] smlal2 v1.4S, v6.8H, v7.8H // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}] - ld1 {v11.8H}, [x4], #16 // load 8x16-bit at filter+3*filterSize + ld1 {v19.8H}, [x4], #16 // load 8x16-bit at filter+3*filterSize subs w15, w15, #8 // j -= 8: processed 8/filterSize - uxtl v10.8H, v10.8B // unpack part 4 to 16-bit - smlal v3.4S, v10.4H, v11.4H // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}] - smlal2 v3.4S, v10.8H, v11.8H // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}] + uxtl v18.8H, v18.8B // unpack part 4 to 16-bit + smlal v3.4S, v18.4H, v19.4H // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}] + smlal2 v3.4S, v18.8H, v19.8H // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}] b.gt 2b // inner loop if filterSize not consumed completely addp v0.4S, v0.4S, v0.4S // part0 horizontal pair adding addp v1.4S, v1.4S, v1.4S // part1 horizontal pair adding diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S index 25bf28b6e4..af71de6050 100644 --- a/libswscale/aarch64/output.S +++ b/libswscale/aarch64/output.S @@ -39,11 +39,11 @@ function ff_yuv2planeX_8_neon, export=1 ld1 {v5.8H}, [x11] // read 8x16-bit @ src[j ][i + {0..7}]: A,B,C,D,E,F,G,H ld1 {v6.8H}, [x12] // read 8x16-bit @ src[j+1][i + {0..7}]: I,J,K,L,M,N,O,P ld1r {v7.8H}, [x10], #2 // read 1x16-bit coeff X at filter[j ] and duplicate across lanes - ld1r {v8.8H}, [x10], #2 // read 1x16-bit coeff Y at filter[j+1] and duplicate across lanes + ld1r {v16.8H}, [x10], #2 // read 1x16-bit coeff Y at filter[j+1] and duplicate across lanes smlal v3.4S, v5.4H, v7.4H // val0 += {A,B,C,D} * X smlal2 v4.4S, v5.8H, v7.8H // val1 += {E,F,G,H} * X - smlal v3.4S, v6.4H, v8.4H // val0 += {I,J,K,L} * Y - smlal2 v4.4S, v6.8H, v8.8H // val1 += {M,N,O,P} * Y + smlal v3.4S, v6.4H, v16.4H // val0 += {I,J,K,L} * Y + smlal2 v4.4S, v6.8H, v16.8H // val1 += {M,N,O,P} * Y subs w8, w8, #2 // tmpfilterSize -= 2 b.gt 3b // loop until filterSize consumed