From patchwork Mon Jan 9 22:15:07 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?q?Martin_Storsj=C3=B6?= X-Patchwork-Id: 2152 Delivered-To: ffmpegpatchwork@gmail.com Received: by 10.103.89.21 with SMTP id n21csp7257268vsb; Mon, 9 Jan 2017 14:20:54 -0800 (PST) X-Received: by 10.28.57.193 with SMTP id g184mr196638wma.122.1484000454864; Mon, 09 Jan 2017 14:20:54 -0800 (PST) Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id s9si9344838wra.110.2017.01.09.14.20.54; Mon, 09 Jan 2017 14:20:54 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@martin-st.20150623.gappssmtp.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 39CC268A234; Tue, 10 Jan 2017 00:20:42 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lf0-f65.google.com (mail-lf0-f65.google.com [209.85.215.65]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 8062F689B35 for ; Tue, 10 Jan 2017 00:20:35 +0200 (EET) Received: by mail-lf0-f65.google.com with SMTP id q89so6923683lfi.1 for ; Mon, 09 Jan 2017 14:20:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=martin-st.20150623.gappssmtp.com; s=20150623; h=from:to:subject:date:message-id; bh=T4bAUN9g0Rp5BtkBpi0rARFx1B4W+f6JnO3sbv4vFfE=; b=E+pnbOKVF73vbWLWBTwRtlKiRtxEZrr6pizwQ2qSQOs7qoZaTLEcqdV0kH/C1f3s8c HI44z+bJSOInQwjG9WIsFKKSqo6amCoTJ6NtS1RdeU+RVDEmkOILCXGFEkufeb4/LsOY jLEtH6mDxjr+Ho8vQJPq2ZrPYf1QEdMUp3ysU7uGpIr9gcg5f9iOq+rbPOnc1EuLK43F tJ0ejxX+g837FuWfkhpg7kjCJlt5Dmz+DjzJ8IUhQFLt64IZJ8sdb/bXtnMHbkoJvPnx AdNf0JD8poEhdwervaFERKL8rm6i1LjeiRqIA+fQKD3OjipAj/LEdincqRDwyd0wj1yE HR6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id; bh=T4bAUN9g0Rp5BtkBpi0rARFx1B4W+f6JnO3sbv4vFfE=; b=ZMfyunMWW972CrKcKy9/vsZlJvnKo0ZyeGOGLsykwpb0VIGnz/Z/0qHkIKLjl4QYx7 r09NXYK0/60HPoTb228KjwUMQ2TMwENkOAnR8OBA1JkOg7yXmuoMAdLACR6gI5gI2tCQ LX7VAdz8gTEWAoku7UV8MWTlsNlw8YsjKFXUddSvEvWPXQWIu0sAyrMLdX3i/5VUrMtQ DUYT5aKNVXcQgORsbOaekCcimpIOjK5FahcKQqtQyr6c7ZHjiZTvLFLohrfyUoSqT4f4 y1N4oRbX7gqFFEDwQuO6jV983Yy58/QfTjFwvmxugrZY9K+mmUePWuI63TAgCXE2U5UJ Bjnw== X-Gm-Message-State: AIkVDXJ7dxR97g4U0fHiHahJWrqpBfo6K/hWmDreOvrBl1vnlvQbgrvCWusDWx6KqmW2dA== X-Received: by 10.25.127.2 with SMTP id a2mr9440522lfd.68.1484000121894; Mon, 09 Jan 2017 14:15:21 -0800 (PST) Received: from localhost.localdomain ([2001:470:28:852:a9ed:5432:636c:1053]) by smtp.gmail.com with ESMTPSA id f25sm1358538lji.26.2017.01.09.14.15.21 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Mon, 09 Jan 2017 14:15:21 -0800 (PST) From: =?UTF-8?q?Martin=20Storsj=C3=B6?= To: ffmpeg-devel@ffmpeg.org Date: Tue, 10 Jan 2017 00:15:07 +0200 Message-Id: <1484000119-4959-1-git-send-email-martin@martin.st> X-Mailer: git-send-email 2.7.4 Subject: [FFmpeg-devel] [PATCH 01/13] aarch64: vp9: use alternative returns in the core loop filter function X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" From: Janne Grunau Since aarch64 has enough free general purpose registers use them to branch to the appropiate storage code. 1-2 cycles faster for the functions using loop_filter 8/16, ... on a cortex-a53. Mixed results (up to 2 cycles faster/slower) on a cortex-a57. This is cherrypicked from libav commit d7595de0b25e7064fd9e06dea5d0425536cef6dc. --- libavcodec/aarch64/vp9lpf_neon.S | 48 +++++++++++++++------------------------- 1 file changed, 18 insertions(+), 30 deletions(-) diff --git a/libavcodec/aarch64/vp9lpf_neon.S b/libavcodec/aarch64/vp9lpf_neon.S index e727a4d..78aae61 100644 --- a/libavcodec/aarch64/vp9lpf_neon.S +++ b/libavcodec/aarch64/vp9lpf_neon.S @@ -410,15 +410,19 @@ .endif // If no pixels needed flat8in nor flat8out, jump to a // writeout of the inner 4 pixels - cbz x5, 7f + cbnz x5, 1f + br x14 +1: mov x5, v7.d[0] .ifc \sz, .16b mov x6, v7.d[1] orr x5, x5, x6 .endif // If no pixels need flat8out, jump to a writeout of the inner 6 pixels - cbz x5, 8f + cbnz x5, 1f + br x15 +1: // flat8out // This writes all outputs into v2-v17 (skipping v6 and v16). // If this part is skipped, the output is read from v21-v26 (which is the input @@ -549,35 +553,24 @@ endfunc function vp9_loop_filter_8 loop_filter 8, .8b, 0, v16, v17, v18, v19, v28, v29, v30, v31 - mov x5, #0 ret 6: - mov x5, #6 - ret + br x13 9: br x10 endfunc function vp9_loop_filter_8_16b_mix loop_filter 8, .16b, 88, v16, v17, v18, v19, v28, v29, v30, v31 - mov x5, #0 ret 6: - mov x5, #6 - ret + br x13 9: br x10 endfunc function vp9_loop_filter_16 loop_filter 16, .8b, 0, v8, v9, v10, v11, v12, v13, v14, v15 - mov x5, #0 - ret -7: - mov x5, #7 - ret -8: - mov x5, #8 ret 9: ldp d8, d9, [sp], 0x10 @@ -589,13 +582,6 @@ endfunc function vp9_loop_filter_16_16b loop_filter 16, .16b, 0, v8, v9, v10, v11, v12, v13, v14, v15 - mov x5, #0 - ret -7: - mov x5, #7 - ret -8: - mov x5, #8 ret 9: ldp d8, d9, [sp], 0x10 @@ -614,11 +600,14 @@ endfunc .endm .macro loop_filter_8 + // calculate alternative 'return' targets + adr x13, 6f bl vp9_loop_filter_8 - cbnz x5, 6f .endm .macro loop_filter_8_16b_mix mix + // calculate alternative 'return' targets + adr x13, 6f .if \mix == 48 mov x11, #0xffffffff00000000 .elseif \mix == 84 @@ -627,21 +616,20 @@ endfunc mov x11, #0xffffffffffffffff .endif bl vp9_loop_filter_8_16b_mix - cbnz x5, 6f .endm .macro loop_filter_16 + // calculate alternative 'return' targets + adr x14, 7f + adr x15, 8f bl vp9_loop_filter_16 - cmp x5, 7 - b.gt 8f - b.eq 7f .endm .macro loop_filter_16_16b + // calculate alternative 'return' targets + adr x14, 7f + adr x15, 8f bl vp9_loop_filter_16_16b - cmp x5, 7 - b.gt 8f - b.eq 7f .endm