From patchwork Mon Jan  9 22:15:15 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>
X-Patchwork-Id: 2158
Delivered-To: ffmpegpatchwork@gmail.com
Received: by 10.103.89.21 with SMTP id n21csp7257390vsb;
	Mon, 9 Jan 2017 14:21:16 -0800 (PST)
X-Received: by 10.28.135.3 with SMTP id j3mr201125wmd.96.1484000476241;
	Mon, 09 Jan 2017 14:21:16 -0800 (PST)
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
	by mx.google.com with ESMTP id k3si9359072wrk.55.2017.01.09.14.21.15;
	Mon, 09 Jan 2017 14:21:16 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
	designates 79.124.17.100 as permitted sender)
	client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
	dkim=neutral (body hash did not verify)
	header.i=@martin-st.20150623.gappssmtp.com;
	spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
	designates 79.124.17.100 as permitted sender)
	smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id AEEAB68A225;
	Tue, 10 Jan 2017 00:21:06 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mail-lf0-f66.google.com (mail-lf0-f66.google.com
	[209.85.215.66])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 3F1C968077C
	for <ffmpeg-devel@ffmpeg.org>; Tue, 10 Jan 2017 00:21:00 +0200 (EET)
Received: by mail-lf0-f66.google.com with SMTP id j75so13010044lfe.3
	for <ffmpeg-devel@ffmpeg.org>; Mon, 09 Jan 2017 14:21:07 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=martin-st.20150623.gappssmtp.com; s=20150623;
	h=from:to:subject:date:message-id:in-reply-to:references;
	bh=5IpgTnXfAUWEPtfizQ6PfvvXqgo9LCGmOkN6iESh3fU=;
	b=Q3mvhRza65cEzji9t6M5G9ycZArhrMDt33lSrlUSRgA8XTPL631YkmWJrU6QlbIKiB
	VZ1QcV3SKKe1xHQjQBX69Q9qQguno7ZMyaHsH3r/HSYwDjzso7cu8iUD/AZz+NsgYn8z
	EP6UH0LChN0O62xs7rN/uyeiv1HDkr+0Cq1YonLCuCd8gnXEoDqp09dVpXIb7DyYdrgO
	rlgn31rGTqFOcu51LOm9B8lyHL28SLZooziqFkW08PNxMnEgm3BehBYijWjROihzYXzn
	+DaVls0nDXn54BI7tY2wYn0ZUcOAlo4elGMUlkYfJIe8z9wRFqwq7zWJV5vjsLKPVn5q
	pRRw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:from:to:subject:date:message-id:in-reply-to
	:references;
	bh=5IpgTnXfAUWEPtfizQ6PfvvXqgo9LCGmOkN6iESh3fU=;
	b=LFg6kM5s9pnP/iLla+VHN2aMUBtgx4PVqJ9uFuhXuFlcF+aclqNgpcPuUQgfJXRUL9
	nPC3Lrhi3+t93X05H3OccEKKPdA4TQuKWcOwD+UBaVK4QmIZc3FrW7ESBR0T+MkmIjCT
	ZIGvjbi4bzSGPrGMakZIY8NZPmvY3PgRvuYku80TXmXUk3rQLj93qH7h1xIj/YvYJlUU
	GX+7twutYdpydUTndmie5LxIyEtV2FbpQy9v3XNJtW2LmhLU8MbESVaVV5jtFWx6qE3T
	g3PTfORKcTiw3ndi/CT540WLgb6Xf1sM440aKRG82ZCMaOywrR7u3OJO6WaBN7DlELlB
	3vQA==
X-Gm-Message-State: 
 AIkVDXKnZFzX6KJk5FTyWv+gpQDoAqojU9HXzYSLmveDgF5WtjBRO7dQAnmxEBVMaPFw7w==
X-Received: by 10.25.134.2 with SMTP id i2mr19948346lfd.79.1484000128301;
	Mon, 09 Jan 2017 14:15:28 -0800 (PST)
Received: from localhost.localdomain ([2001:470:28:852:a9ed:5432:636c:1053])
	by smtp.gmail.com with ESMTPSA id
	f25sm1358538lji.26.2017.01.09.14.15.27
	for <ffmpeg-devel@ffmpeg.org>
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
	Mon, 09 Jan 2017 14:15:27 -0800 (PST)
From: =?UTF-8?q?Martin=20Storsj=C3=B6?= <martin@martin.st>
To: ffmpeg-devel@ffmpeg.org
Date: Tue, 10 Jan 2017 00:15:15 +0200
Message-Id: <1484000119-4959-9-git-send-email-martin@martin.st>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1484000119-4959-1-git-send-email-martin@martin.st>
References: <1484000119-4959-1-git-send-email-martin@martin.st>
Subject: [FFmpeg-devel] [PATCH 09/13] arm: vp9itxfm: Skip empty slices in
	the first pass of idct_idct 16x16 and 32x32
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <http://ffmpeg.org/mailman/options/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <http://ffmpeg.org/pipermail/ffmpeg-devel/>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <http://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches
	<ffmpeg-devel@ffmpeg.org>
MIME-Version: 1.0
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

This work is sponsored by, and copyright, Google.

Previously all subpartitions except the eob=1 (DC) case ran with
the same runtime:

                                     Cortex A7       A8       A9      A53
vp9_inv_dct_dct_16x16_sub16_add_neon:   3188.1   2435.4   2499.0   1969.0
vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.7  16582.3  14207.6  12000.3

By skipping individual 4x16 or 4x32 pixel slices in the first pass,
we reduce the runtime of these functions like this:

vp9_inv_dct_dct_16x16_sub1_add_neon:     274.6    189.5    211.7    235.8
vp9_inv_dct_dct_16x16_sub2_add_neon:    2064.0   1534.8   1719.4   1248.7
vp9_inv_dct_dct_16x16_sub4_add_neon:    2135.0   1477.2   1736.3   1249.5
vp9_inv_dct_dct_16x16_sub8_add_neon:    2446.7   1828.7   1993.6   1494.7
vp9_inv_dct_dct_16x16_sub12_add_neon:   2832.4   2118.3   2266.5   1735.1
vp9_inv_dct_dct_16x16_sub16_add_neon:   3211.7   2475.3   2523.5   1983.1
vp9_inv_dct_dct_32x32_sub1_add_neon:     756.2    456.7    862.0    553.9
vp9_inv_dct_dct_32x32_sub2_add_neon:   10682.2   8190.4   8539.2   6762.5
vp9_inv_dct_dct_32x32_sub4_add_neon:   10813.5   8014.9   8518.3   6762.8
vp9_inv_dct_dct_32x32_sub8_add_neon:   11859.6   9313.0   9347.4   7514.5
vp9_inv_dct_dct_32x32_sub12_add_neon:  12946.6  10752.4  10192.2   8280.2
vp9_inv_dct_dct_32x32_sub16_add_neon:  14074.6  11946.5  11001.4   9008.6
vp9_inv_dct_dct_32x32_sub20_add_neon:  15269.9  13662.7  11816.1   9762.6
vp9_inv_dct_dct_32x32_sub24_add_neon:  16327.9  14940.1  12626.7  10516.0
vp9_inv_dct_dct_32x32_sub28_add_neon:  17462.7  15776.1  13446.2  11264.7
vp9_inv_dct_dct_32x32_sub32_add_neon:  18575.5  17157.0  14249.3  12015.1

I.e. in general a very minor overhead for the full subpartition case due
to the additional loads and cmps, but a significant speedup for the cases
when we only need to process a small part of the actual input data.

In common VP9 content in a few inspected clips, 70-90% of the non-dc-only
16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left
8x8 or 16x16 subpartitions respectively.

This is cherrypicked from libav commit
9c8bc74c2b40537b0997f646c87c008042d788c2.
---
 libavcodec/arm/vp9itxfm_neon.S | 75 +++++++++++++++++++++++++++++++++++++-----
 tests/checkasm/vp9dsp.c        |  6 ++--
 2 files changed, 70 insertions(+), 11 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index d5b8495..25f6dde 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -659,9 +659,8 @@ endfunc
 @ Read a vertical 4x16 slice out of a 16x16 matrix, do a transform on it,
 @ transpose into a horizontal 16x4 slice and store.
 @ r0 = dst (temp buffer)
-@ r1 = unused
+@ r1 = slice offset
 @ r2 = src
-@ r3 = slice offset
 function \txfm\()16_1d_4x16_pass1_neon
         mov             r12, #32
         vmov.s16        q2, #0
@@ -678,14 +677,14 @@ function \txfm\()16_1d_4x16_pass1_neon
         transpose16_q_4x_4x4 q8,  q9,  q10, q11, q12, q13, q14, q15, d16, d17, d18, d19, d20, d21, d22, d23, d24, d25, d26, d27, d28, d29, d30, d31
 
         @ Store the transposed 4x4 blocks horizontally.
-        cmp             r3,  #12
+        cmp             r1,  #12
         beq             1f
 .irp i, 16, 20, 24, 28, 17, 21, 25, 29, 18, 22, 26, 30, 19, 23, 27, 31
         vst1.16         {d\i}, [r0,:64]!
 .endr
         bx              lr
 1:
-        @ Special case: For the last input column (r3 == 12),
+        @ Special case: For the last input column (r1 == 12),
         @ which would be stored as the last row in the temp buffer,
         @ don't store the first 4x4 block, but keep it in registers
         @ for the first slice of the second pass (where it is the
@@ -781,15 +780,22 @@ endfunc
 itxfm16_1d_funcs idct
 itxfm16_1d_funcs iadst
 
+@ This is the minimum eob value for each subpartition, in increments of 4
+const min_eob_idct_idct_16, align=4
+        .short  0, 10, 38, 89
+endconst
+
 .macro itxfm_func16x16 txfm1, txfm2
 function ff_vp9_\txfm1\()_\txfm2\()_16x16_add_neon, export=1
 .ifc \txfm1\()_\txfm2,idct_idct
         cmp             r3,  #1
         beq             idct16x16_dc_add_neon
 .endif
-        push            {r4-r7,lr}
+        push            {r4-r8,lr}
 .ifnc \txfm1\()_\txfm2,idct_idct
         vpush           {q4-q7}
+.else
+        movrel          r8,  min_eob_idct_idct_16 + 2
 .endif
 
         @ Align the stack, allocate a temp buffer
@@ -810,10 +816,36 @@ A       and             r7,  sp,  #15
 
 .irp i, 0, 4, 8, 12
         add             r0,  sp,  #(\i*32)
+.ifc \txfm1\()_\txfm2,idct_idct
+.if \i > 0
+        ldrh_post       r1,  r8,  #2
+        cmp             r3,  r1
+        it              le
+        movle           r1,  #(16 - \i)/4
+        ble             1f
+.endif
+.endif
+        mov             r1,  #\i
         add             r2,  r6,  #(\i*2)
-        mov             r3,  #\i
         bl              \txfm1\()16_1d_4x16_pass1_neon
 .endr
+
+.ifc \txfm1\()_\txfm2,idct_idct
+        b               3f
+1:
+        @ For all-zero slices in pass 1, set d28-d31 to zero, for the in-register
+        @ passthrough of coefficients to pass 2 and clear the end of the temp buffer
+        vmov.i16        q14, #0
+        vmov.i16        q15, #0
+2:
+        subs            r1,  r1,  #1
+.rept 4
+        vst1.16         {q14-q15}, [r0,:128]!
+.endr
+        bne             2b
+3:
+.endif
+
 .ifc \txfm1\()_\txfm2,iadst_idct
         movrel          r12, idct_coeffs
         vld1.16         {q0-q1}, [r12,:128]
@@ -830,7 +862,7 @@ A       and             r7,  sp,  #15
 .ifnc \txfm1\()_\txfm2,idct_idct
         vpop            {q4-q7}
 .endif
-        pop             {r4-r7,pc}
+        pop             {r4-r8,pc}
 endfunc
 .endm
 
@@ -1110,11 +1142,16 @@ function idct32_1d_4x32_pass2_neon
         bx              lr
 endfunc
 
+const min_eob_idct_idct_32, align=4
+        .short  0, 9, 34, 70, 135, 240, 336, 448
+endconst
+
 function ff_vp9_idct_idct_32x32_add_neon, export=1
         cmp             r3,  #1
         beq             idct32x32_dc_add_neon
-        push            {r4-r7,lr}
+        push            {r4-r8,lr}
         vpush           {q4-q7}
+        movrel          r8,  min_eob_idct_idct_32 + 2
 
         @ Align the stack, allocate a temp buffer
 T       mov             r7,  sp
@@ -1129,9 +1166,29 @@ A       and             r7,  sp,  #15
 
 .irp i, 0, 4, 8, 12, 16, 20, 24, 28
         add             r0,  sp,  #(\i*64)
+.if \i > 0
+        ldrh_post       r1,  r8,  #2
+        cmp             r3,  r1
+        it              le
+        movle           r1,  #(32 - \i)/2
+        ble             1f
+.endif
         add             r2,  r6,  #(\i*2)
         bl              idct32_1d_4x32_pass1_neon
 .endr
+        b               3f
+
+1:
+        @ Write zeros to the temp buffer for pass 2
+        vmov.i16        q14, #0
+        vmov.i16        q15, #0
+2:
+        subs            r1,  r1,  #1
+.rept 4
+        vst1.16         {q14-q15}, [r0,:128]!
+.endr
+        bne             2b
+3:
 .irp i, 0, 4, 8, 12, 16, 20, 24, 28
         add             r0,  r4,  #(\i)
         mov             r1,  r5
@@ -1141,5 +1198,5 @@ A       and             r7,  sp,  #15
 
         add             sp,  sp,  r7
         vpop            {q4-q7}
-        pop             {r4-r7,pc}
+        pop             {r4-r8,pc}
 endfunc
diff --git a/tests/checkasm/vp9dsp.c b/tests/checkasm/vp9dsp.c
index f32b97c..4817309 100644
--- a/tests/checkasm/vp9dsp.c
+++ b/tests/checkasm/vp9dsp.c
@@ -334,8 +334,10 @@ static void check_itxfm(void)
                 // skip testing sub-IDCTs for WHT or ADST since they don't
                 // implement it in any of the SIMD functions. If they do,
                 // consider changing this to ensure we have complete test
-                // coverage
-                for (sub = (txtp == 0 && tx < 4) ? 1 : sz; sub <= sz; sub <<= 1) {
+                // coverage. Test sub=1 for dc-only, then 2, 4, 8, 12, etc,
+                // since the arm version can distinguish them at that level.
+                for (sub = (txtp == 0 && tx < 4) ? 1 : sz; sub <= sz;
+                     sub < 4 ? (sub <<= 1) : (sub += 4)) {
                     if (check_func(dsp.itxfm_add[tx][txtp],
                                    "vp9_inv_%s_%dx%d_sub%d_add_%d",
                                    tx == 4 ? "wht_wht" : txtp_types[txtp],