From patchwork Thu Mar 16 22:10:15 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>
X-Patchwork-Id: 2972
Delivered-To: ffmpegpatchwork@gmail.com
Received: by 10.103.50.79 with SMTP id y76csp3949vsy;
	Thu, 16 Mar 2017 15:18:33 -0700 (PDT)
X-Received: by 10.223.133.228 with SMTP id 33mr11273782wru.0.1489702713561;
	Thu, 16 Mar 2017 15:18:33 -0700 (PDT)
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
	by mx.google.com with ESMTP id k19si459512wmi.79.2017.03.16.15.18.33;
	Thu, 16 Mar 2017 15:18:33 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
	designates 79.124.17.100 as permitted sender)
	client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
	dkim=neutral (body hash did not verify)
	header.i=@martin-st.20150623.gappssmtp.com;
	spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
	designates 79.124.17.100 as permitted sender)
	smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id B75C9688350;
	Fri, 17 Mar 2017 00:18:13 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mail-lf0-f67.google.com (mail-lf0-f67.google.com
	[209.85.215.67])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 2449E688312
	for <ffmpeg-devel@ffmpeg.org>; Fri, 17 Mar 2017 00:18:13 +0200 (EET)
Received: by mail-lf0-f67.google.com with SMTP id r36so4324957lfi.0
	for <ffmpeg-devel@ffmpeg.org>; Thu, 16 Mar 2017 15:18:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=martin-st.20150623.gappssmtp.com; s=20150623;
	h=from:to:subject:date:message-id:in-reply-to:references;
	bh=7BF9U13gXnwFyU1ZzMJUxfZihVvxXa2+Mu40p6/zV0A=;
	b=KKMp7ISckqOnLJLwtr4q/dzho1vIzXIpnaik0xwpSBfDydrFBHcp311vqefBN1chK+
	7iAuKq5ztZ/wlw1ZI8hdpN+UPRLGKXDonBW0/sOLkahwrGl7/hDyC7NeiyTlGQ+l5E92
	0z++++6oeB/ToxgVTcQORRlHPQEcsXLSYZM7kqXnKetqifOh60rB5tUy0/3d/rAQcb9I
	BXW5TIgfi8puJa0j0wsYyHH19BUwXlJenzn2z8GWar2j4uRQE9rYADNtLsAm6WBVdaFi
	9coFnIAh1M7dEDT9rOHhZEfU4dsFvefQGqHU663pAXtBx6kfd0O1jEp6sabmHUaXqhOS
	ashQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:from:to:subject:date:message-id:in-reply-to
	:references;
	bh=7BF9U13gXnwFyU1ZzMJUxfZihVvxXa2+Mu40p6/zV0A=;
	b=qCOvAjVuaG3HIJ+xtO40ZpzZHYy6g+VQ4Vla/ua6in24MtvtTwC5dPwDV/FCWc2lle
	SfCks9MHPqAhJAiSqNJNKWXJkNe9xdKjaa7vkeBhy4m+LlKl6aFB1Xu0ve1fDBScHoxt
	1mO1VHJbRiv90kYdv5lyNnDi2MQy33rNpX4JjWqeuOiErhOaiI9OrXGGo70Fg89G73+S
	82SPG6Rc16wUp/2RTEFfF9MJS5I5ytYixxK+mOldIpfrUH071vbCTF9/BdqArVPJEw/G
	cYK9s5CbPybyXQN3bt01B2Advuc79ztJ1U39U3qIm7jKcKiejFFbXR0PnNQOWgRrm2PI
	Pq8A==
X-Gm-Message-State: 
 AFeK/H0UKOdf80Lgxh7/WpJ2g98Jph8n8mK2uKvv5FlV9dC+0t7Xv/TaE3V/AODgcmoCxQ==
X-Received: by 10.25.22.26 with SMTP id m26mr3418153lfi.137.1489702229289;
	Thu, 16 Mar 2017 15:10:29 -0700 (PDT)
Received: from localhost.localdomain ([2001:470:28:852:10ad:e858:1f3b:5c2c])
	by smtp.gmail.com with ESMTPSA id
	g3sm1124718lfe.34.2017.03.16.15.10.28 for <ffmpeg-devel@ffmpeg.org>
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
	Thu, 16 Mar 2017 15:10:28 -0700 (PDT)
From: =?UTF-8?q?Martin=20Storsj=C3=B6?= <martin@martin.st>
To: ffmpeg-devel@ffmpeg.org
Date: Fri, 17 Mar 2017 00:10:15 +0200
Message-Id: <1489702219-12643-10-git-send-email-martin@martin.st>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1489702219-12643-1-git-send-email-martin@martin.st>
References: <1489702219-12643-1-git-send-email-martin@martin.st>
Subject: [FFmpeg-devel] [PATCH 10/14] arm: vp9itxfm16: Make the larger core
	transforms standalone functions
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <http://ffmpeg.org/mailman/options/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <http://ffmpeg.org/pipermail/ffmpeg-devel/>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <http://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches
	<ffmpeg-devel@ffmpeg.org>
MIME-Version: 1.0
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

This work is sponsored by, and copyright, Google.

This reduces the code size of libavcodec/arm/vp9itxfm_16bpp_neon.o from
17500 to 14516 bytes.

This gives a small slowdown of a couple tens of cycles, up to around
150 cycles for the full case of the largest transform, but makes
it more feasible to add more optimized versions of these transforms.

Before:                                 Cortex A7       A8       A9      A53
vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4237.4   3561.5   3971.8   2525.3
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6371.9   5452.0   5779.3   3910.5
vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22068.8  17867.5  19555.2  13871.6
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37268.9  38684.2  32314.2  23969.0

After:
vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4375.1   3571.9   4283.8   2567.2
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6415.6   5578.9   5844.6   3948.3
vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22653.7  18079.7  19603.7  13905.3
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37593.2  38862.2  32235.8  24070.9
---
 libavcodec/arm/vp9itxfm_16bpp_neon.S | 43 ++++++++++++++++++++++--------------
 1 file changed, 27 insertions(+), 16 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_16bpp_neon.S b/libavcodec/arm/vp9itxfm_16bpp_neon.S
index 29d95ca..8350153 100644
--- a/libavcodec/arm/vp9itxfm_16bpp_neon.S
+++ b/libavcodec/arm/vp9itxfm_16bpp_neon.S
@@ -807,7 +807,7 @@ function idct16x16_dc_add_neon
 endfunc
 .ltorg
 
-.macro idct16
+function idct16
         mbutterfly0     d16, d24, d16, d24, d8, d10, q4,  q5 @ d16 = t0a,  d24 = t1a
         mbutterfly      d20, d28, d1[0], d1[1], q4,  q5  @ d20 = t2a,  d28 = t3a
         mbutterfly      d18, d30, d2[0], d2[1], q4,  q5  @ d18 = t4a,  d30 = t7a
@@ -853,9 +853,10 @@ endfunc
         vmov            d8,  d21                         @ d8  = t10a
         butterfly       d20, d27, d10, d27               @ d20 = out[4], d27 = out[11]
         butterfly       d21, d26, d26, d8                @ d21 = out[5], d26 = out[10]
-.endm
+        bx              lr
+endfunc
 
-.macro iadst16
+function iadst16
         movrel          r12, iadst16_coeffs
         vld1.16         {q0},  [r12,:128]!
         vmovl.s16       q1,  d1
@@ -933,7 +934,8 @@ endfunc
 
         vmov            d16, d2
         vmov            d30, d4
-.endm
+        bx              lr
+endfunc
 
 .macro itxfm16_1d_funcs txfm
 @ Read a vertical 2x16 slice out of a 16x16 matrix, do a transform on it,
@@ -941,6 +943,8 @@ endfunc
 @ r0 = dst (temp buffer)
 @ r2 = src
 function \txfm\()16_1d_2x16_pass1_neon
+        push            {lr}
+
         mov             r12, #64
         vmov.s32        q4,  #0
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
@@ -948,7 +952,7 @@ function \txfm\()16_1d_2x16_pass1_neon
         vst1.32         {d8},  [r2,:64], r12
 .endr
 
-        \txfm\()16
+        bl              \txfm\()16
 
         @ Do eight 2x2 transposes. Originally, d16-d31 contain the
         @ 16 rows. Afterwards, d16-d17, d18-d19 etc contain the eight
@@ -959,7 +963,7 @@ function \txfm\()16_1d_2x16_pass1_neon
 .irp i, 16, 18, 20, 22, 24, 26, 28, 30, 17, 19, 21, 23, 25, 27, 29, 31
         vst1.32         {d\i}, [r0,:64]!
 .endr
-        bx              lr
+        pop             {pc}
 endfunc
 
 @ Read a vertical 2x16 slice out of a 16x16 matrix, do a transform on it,
@@ -968,6 +972,8 @@ endfunc
 @ r1 = dst stride
 @ r2 = src (temp buffer)
 function \txfm\()16_1d_2x16_pass2_neon
+        push            {lr}
+
         mov             r12, #64
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
         vld1.16         {d\i}, [r2,:64], r12
@@ -975,7 +981,7 @@ function \txfm\()16_1d_2x16_pass2_neon
 
         add             r3,  r0,  r1
         lsl             r1,  r1,  #1
-        \txfm\()16
+        bl              \txfm\()16
 
 .macro load_add_store coef0, coef1, coef2, coef3
         vrshr.s32       \coef0, \coef0, #6
@@ -1019,7 +1025,7 @@ function \txfm\()16_1d_2x16_pass2_neon
         load_add_store  q12, q13, q14, q15
 .purgem load_add_store
 
-        bx              lr
+        pop             {pc}
 endfunc
 .endm
 
@@ -1193,7 +1199,7 @@ function idct32x32_dc_add_neon
         pop             {r4-r9,pc}
 endfunc
 
-.macro idct32_odd
+function idct32_odd
         movrel          r12, idct_coeffs
 
         @ Overwrite the idct16 coeffs with the stored ones for idct32
@@ -1262,7 +1268,8 @@ endfunc
         mbutterfly0     d26, d21, d26, d21, d8, d10, q4, q5 @ d26 = t26a, d21 = t21a
         mbutterfly0     d25, d22, d25, d22, d8, d10, q4, q5 @ d25 = t25,  d22 = t22
         mbutterfly0     d24, d23, d24, d23, d8, d10, q4, q5 @ d24 = t24a, d23 = t23a
-.endm
+        bx              lr
+endfunc
 
 @ Do an 32-point IDCT of a 2x32 slice out of a 32x32 matrix.
 @ We don't have register space to do a single pass IDCT of 2x32 though,
@@ -1274,6 +1281,8 @@ endfunc
 @ r1 = unused
 @ r2 = src
 function idct32_1d_2x32_pass1_neon
+        push            {lr}
+
         @ Double stride of the input, since we only read every other line
         mov             r12, #256
         vmov.s32        d8,  #0
@@ -1284,7 +1293,7 @@ function idct32_1d_2x32_pass1_neon
         vst1.32         {d8},  [r2,:64], r12
 .endr
 
-        idct16
+        bl              idct16
 
         @ Do eight 2x2 transposes. Originally, d16-d31 contain the
         @ 16 rows. Afterwards, d16-d17, d18-d19 etc contain the eight
@@ -1319,7 +1328,7 @@ function idct32_1d_2x32_pass1_neon
         vst1.16         {d8},  [r2,:64], r12
 .endr
 
-        idct32_odd
+        bl              idct32_odd
 
         transpose32_8x_2x2 d31, d30, d29, d28, d27, d26, d25, d24, d23, d22, d21, d20, d19, d18, d17, d16
 
@@ -1343,7 +1352,7 @@ function idct32_1d_2x32_pass1_neon
         store_rev       31, 29, 27, 25, 23, 21, 19, 17
         store_rev       30, 28, 26, 24, 22, 20, 18, 16
 .purgem store_rev
-        bx              lr
+        pop             {pc}
 endfunc
 .ltorg
 
@@ -1354,6 +1363,8 @@ endfunc
 @ r1 = dst stride
 @ r2 = src (temp buffer)
 function idct32_1d_2x32_pass2_neon
+        push            {lr}
+
         mov             r12, #256
         @ d16 = IN(0), d17 = IN(2) ... d31 = IN(30)
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
@@ -1361,7 +1372,7 @@ function idct32_1d_2x32_pass2_neon
 .endr
         sub             r2,  r2,  r12, lsl #4
 
-        idct16
+        bl              idct16
 
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
         vst1.32         {d\i}, [r2,:64], r12
@@ -1377,7 +1388,7 @@ function idct32_1d_2x32_pass2_neon
         sub             r2,  r2,  r12, lsl #4
         sub             r2,  r2,  #128
 
-        idct32_odd
+        bl              idct32_odd
 
         @ Narrow the ict16 coefficients in q0-q3 into q0-q1, to
         @ allow clobbering q2-q3 below.
@@ -1439,7 +1450,7 @@ function idct32_1d_2x32_pass2_neon
         vmovl.s16       q3,  d3
         vmovl.s16       q1,  d1
         vmovl.s16       q0,  d0
-        bx              lr
+        pop             {pc}
 endfunc
 
 const min_eob_idct_idct_32, align=4