From patchwork Wed Mar  8 10:00:47 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>
X-Patchwork-Id: 2804
Delivered-To: ffmpegpatchwork@gmail.com
Received: by 10.103.50.79 with SMTP id y76csp952884vsy;
	Wed, 8 Mar 2017 02:02:13 -0800 (PST)
X-Received: by 10.28.74.194 with SMTP id n63mr17622710wmi.15.1488967333155;
	Wed, 08 Mar 2017 02:02:13 -0800 (PST)
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
	by mx.google.com with ESMTP id
	e26si3681026wra.113.2017.03.08.02.02.12;
	Wed, 08 Mar 2017 02:02:13 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
	designates 79.124.17.100 as permitted sender)
	client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
	dkim=neutral (body hash did not verify)
	header.i=@martin-st.20150623.gappssmtp.com;
	spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
	designates 79.124.17.100 as permitted sender)
	smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 6CC81688294;
	Wed,  8 Mar 2017 12:01:16 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mail-lf0-f48.google.com (mail-lf0-f48.google.com
	[209.85.215.48])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E01A0688293
	for <ffmpeg-devel@ffmpeg.org>; Wed,  8 Mar 2017 12:01:09 +0200 (EET)
Received: by mail-lf0-f48.google.com with SMTP id y193so12355761lfd.3
	for <ffmpeg-devel@ffmpeg.org>; Wed, 08 Mar 2017 02:01:23 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=martin-st.20150623.gappssmtp.com; s=20150623;
	h=from:to:subject:date:message-id:in-reply-to:references;
	bh=VBKg8uuu5z9qjrbSNF5vQkAlUk6gyXC5XQK6ycaxfBw=;
	b=h/U65rWfCSpTHg1+A7iG6aqK868u1V3+O7Ir/PU8Xyk/EsAbhjY+7fmUaqQCGV0fqh
	4HFIOc91CPvo94NbbL3DkWKockFf5YP9ycefatIYhcWiLTo1WWjhEE8dEHSm0y/IkhpE
	Ai4C6YzeJ5Vwebhipu/jFCzO8b4su1TSiyEU8cfs/yEZksAvMDJ+DGN+eYTallsFZqI5
	NtkFkVOsGbC+m1yEq/wzJzgfIKRG85eHjMEDWZK4AT5w7fsFPmA11BaYjgipvtCa8el2
	KmEZch67dSpYBzVg4Z44Vcuuov8fHQpByHo4sWMCJqr/QvfVK90+AZCODMa17NM7SMko
	kO+g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:from:to:subject:date:message-id:in-reply-to
	:references;
	bh=VBKg8uuu5z9qjrbSNF5vQkAlUk6gyXC5XQK6ycaxfBw=;
	b=YhkHFv1xjYlpJsdY7vxgpXCorF//LscbzL2lrxBqdJBv10VgQjBKsqqpzcOtABps7A
	f524RBF+Rwda5FGqyIiqm6HF4UV+gm+EQEHV2xoWGwnZM+aOW52iqRjGYsb3LkP1Kek/
	lQta5GwzUpbgWze4TXzg4Nk0T+EL+VdywjuZAZr8bjK+AW5til9vHuPfVrDve019OuTd
	wuo3HRHweJGDeA0zceMCnBs+LkfxQn9D2/cKyqRtoaQikEj0H0oZl8v7aRwPASZH9M8s
	IK919MWAv+G/omH5OVpyJUzcG4BRLX1Vk5mez/M+q6SpyRpBmRQ43r7WHApuP7kWTS6m
	LWDQ==
X-Gm-Message-State: 
 AMke39lW1DbLIAnh/rLTAEeiQ+0piMeRhvWQVAOmaJuAzLYIoQF7Z+S1AdbSRL58VCegbg==
X-Received: by 10.25.209.197 with SMTP id i188mr1513653lfg.168.1488967282159;
	Wed, 08 Mar 2017 02:01:22 -0800 (PST)
Received: from localhost.localdomain ([2001:470:28:852:7d47:68e:13e8:4933])
	by smtp.gmail.com with ESMTPSA id
	m127sm513064lfg.58.2017.03.08.02.01.21
	for <ffmpeg-devel@ffmpeg.org>
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
	Wed, 08 Mar 2017 02:01:21 -0800 (PST)
From: =?UTF-8?q?Martin=20Storsj=C3=B6?= <martin@martin.st>
To: ffmpeg-devel@ffmpeg.org
Date: Wed,  8 Mar 2017 12:00:47 +0200
Message-Id: <1488967274-8143-7-git-send-email-martin@martin.st>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1488967274-8143-1-git-send-email-martin@martin.st>
References: <1488967274-8143-1-git-send-email-martin@martin.st>
Subject: [FFmpeg-devel] [PATCH 07/34] arm: vp9itxfm: Do a simpler
	half/quarter idct16/idct32 when possible
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <http://ffmpeg.org/mailman/options/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <http://ffmpeg.org/pipermail/ffmpeg-devel/>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <http://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches
	<ffmpeg-devel@ffmpeg.org>
MIME-Version: 1.0
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 12388 bytes to 19784 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before:                              Cortex A7       A8       A9      A53
vp9_inv_dct_dct_16x16_sub1_add_neon:     273.0    189.5    212.0    235.8
vp9_inv_dct_dct_16x16_sub2_add_neon:    2102.1   1521.7   1736.2   1265.8
vp9_inv_dct_dct_16x16_sub4_add_neon:    2104.5   1533.0   1736.6   1265.5
vp9_inv_dct_dct_16x16_sub8_add_neon:    2484.8   1828.7   2014.4   1506.5
vp9_inv_dct_dct_16x16_sub12_add_neon:   2851.2   2117.8   2294.8   1753.2
vp9_inv_dct_dct_16x16_sub16_add_neon:   3239.4   2408.3   2543.5   1994.9
vp9_inv_dct_dct_32x32_sub1_add_neon:     758.3    456.7    864.5    553.9
vp9_inv_dct_dct_32x32_sub2_add_neon:   10776.7   7949.8   8567.7   6819.7
vp9_inv_dct_dct_32x32_sub4_add_neon:   10865.6   8131.5   8589.6   6816.3
vp9_inv_dct_dct_32x32_sub8_add_neon:   12053.9   9271.3   9387.7   7564.0
vp9_inv_dct_dct_32x32_sub12_add_neon:  13328.3  10463.2  10217.0   8321.3
vp9_inv_dct_dct_32x32_sub16_add_neon:  14176.4  11509.5  11018.7   9062.3
vp9_inv_dct_dct_32x32_sub20_add_neon:  15301.5  12999.9  11855.1   9828.2
vp9_inv_dct_dct_32x32_sub24_add_neon:  16482.7  14931.5  12650.1  10575.0
vp9_inv_dct_dct_32x32_sub28_add_neon:  17589.5  15811.9  13482.8  11333.4
vp9_inv_dct_dct_32x32_sub32_add_neon:  18696.2  17049.2  14355.6  12089.7

After:
vp9_inv_dct_dct_16x16_sub1_add_neon:     273.0    189.5    211.7    235.8
vp9_inv_dct_dct_16x16_sub2_add_neon:    1203.5    998.2   1035.3    763.0
vp9_inv_dct_dct_16x16_sub4_add_neon:    1203.5    998.1   1035.5    760.8
vp9_inv_dct_dct_16x16_sub8_add_neon:    1926.1   1610.6   1722.1   1271.7
vp9_inv_dct_dct_16x16_sub12_add_neon:   2873.2   2129.7   2285.1   1757.3
vp9_inv_dct_dct_16x16_sub16_add_neon:   3221.4   2520.3   2557.6   2002.1
vp9_inv_dct_dct_32x32_sub1_add_neon:     753.0    457.5    866.6    554.6
vp9_inv_dct_dct_32x32_sub2_add_neon:    7554.6   5652.4   6048.4   4920.2
vp9_inv_dct_dct_32x32_sub4_add_neon:    7549.9   5685.0   6046.9   4925.7
vp9_inv_dct_dct_32x32_sub8_add_neon:    8336.9   6704.5   6604.0   5478.0
vp9_inv_dct_dct_32x32_sub12_add_neon:  10914.0   9777.2   9240.4   7416.9
vp9_inv_dct_dct_32x32_sub16_add_neon:  11859.2  11223.3   9966.3   8095.1
vp9_inv_dct_dct_32x32_sub20_add_neon:  15237.1  13029.4  11838.3   9829.4
vp9_inv_dct_dct_32x32_sub24_add_neon:  16293.2  14379.8  12644.9  10572.0
vp9_inv_dct_dct_32x32_sub28_add_neon:  17424.3  15734.7  13473.0  11326.9
vp9_inv_dct_dct_32x32_sub32_add_neon:  18531.3  17457.0  14298.6  12080.0

This is cherrypicked from libav commit
5eb5aec475aabc884d083566f902876ecbc072cb.
---
 libavcodec/arm/vp9itxfm_neon.S | 591 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 537 insertions(+), 54 deletions(-)

diff --git a/libavcodec/arm/vp9itxfm_neon.S b/libavcodec/arm/vp9itxfm_neon.S
index 682a82e..33a7af1 100644
--- a/libavcodec/arm/vp9itxfm_neon.S
+++ b/libavcodec/arm/vp9itxfm_neon.S
@@ -74,6 +74,14 @@ endconst
         vrshrn.s32      \out2, \tmpq4, #14
 .endm
 
+@ Same as mbutterfly0 above, but treating the input in in2 as zero,
+@ writing the same output into both out1 and out2.
+.macro mbutterfly0_h out1, out2, in1, in2, tmpd1, tmpd2, tmpq3, tmpq4
+        vmull.s16       \tmpq3, \in1, d0[0]
+        vrshrn.s32      \out1,  \tmpq3, #14
+        vrshrn.s32      \out2,  \tmpq3, #14
+.endm
+
 @ out1,out2 = ((in1 + in2) * d0[0] + (1 << 13)) >> 14
 @ out3,out4 = ((in1 - in2) * d0[0] + (1 << 13)) >> 14
 @ Same as mbutterfly0, but with input being 2 q registers, output
@@ -137,6 +145,23 @@ endconst
         vrshrn.s32      \inout2, \tmp2,  #14
 .endm
 
+@ Same as mbutterfly above, but treating the input in inout2 as zero
+.macro mbutterfly_h1 inout1, inout2, coef1, coef2, tmp1, tmp2
+        vmull.s16       \tmp1,   \inout1, \coef1
+        vmull.s16       \tmp2,   \inout1, \coef2
+        vrshrn.s32      \inout1, \tmp1,   #14
+        vrshrn.s32      \inout2, \tmp2,   #14
+.endm
+
+@ Same as mbutterfly above, but treating the input in inout1 as zero
+.macro mbutterfly_h2 inout1, inout2, coef1, coef2, tmp1, tmp2
+        vmull.s16       \tmp1,   \inout2, \coef2
+        vmull.s16       \tmp2,   \inout2, \coef1
+        vneg.s32        \tmp1,   \tmp1
+        vrshrn.s32      \inout2, \tmp2,   #14
+        vrshrn.s32      \inout1, \tmp1,   #14
+.endm
+
 @ inout1,inout2 = (inout1,inout2 * coef1 - inout3,inout4 * coef2 + (1 << 13)) >> 14
 @ inout3,inout4 = (inout1,inout2 * coef2 + inout3,inout4 * coef1 + (1 << 13)) >> 14
 @ inout are 4 d registers, tmp are 4 q registers
@@ -534,6 +559,33 @@ function idct16x16_dc_add_neon
 endfunc
 .ltorg
 
+.macro idct16_end
+        butterfly       d18, d7,  d4,  d7                @ d18 = t0a,  d7  = t7a
+        butterfly       d19, d22, d5,  d22               @ d19 = t1a,  d22 = t6
+        butterfly       d4,  d26, d20, d26               @ d4  = t2a,  d26 = t5
+        butterfly       d5,  d6,  d28, d6                @ d5  = t3a,  d6  = t4
+        butterfly       d20, d28, d16, d24               @ d20 = t8a,  d28 = t11a
+        butterfly       d24, d21, d23, d21               @ d24 = t9,   d21 = t10
+        butterfly       d23, d27, d25, d27               @ d23 = t14,  d27 = t13
+        butterfly       d25, d29, d29, d17               @ d25 = t15a, d29 = t12a
+
+        mbutterfly0     d27, d21, d27, d21, d16, d30, q8, q15 @ d27 = t13a, d21 = t10a
+        mbutterfly0     d29, d28, d29, d28, d16, d30, q8, q15 @ d29 = t12,  d28 = t11
+
+        vswp            d27, d29                         @ d27 = t12, d29 = t13a
+        vswp            d28, d27                         @ d28 = t12, d27 = t11
+        butterfly       d16, d31, d18, d25               @ d16 = out[0], d31 = out[15]
+        butterfly       d17, d30, d19, d23               @ d17 = out[1], d30 = out[14]
+        butterfly_r     d25, d22, d22, d24               @ d25 = out[9], d22 = out[6]
+        butterfly       d23, d24, d7,  d20               @ d23 = out[7], d24 = out[8]
+        butterfly       d18, d29, d4,  d29               @ d18 = out[2], d29 = out[13]
+        butterfly       d19, d28, d5,  d28               @ d19 = out[3], d28 = out[12]
+        vmov            d4,  d21                         @ d4  = t10a
+        butterfly       d20, d27, d6,  d27               @ d20 = out[4], d27 = out[11]
+        butterfly       d21, d26, d26, d4                @ d21 = out[5], d26 = out[10]
+        bx              lr
+.endm
+
 function idct16
         mbutterfly0     d16, d24, d16, d24, d4, d6,  q2,  q3 @ d16 = t0a,  d24 = t1a
         mbutterfly      d20, d28, d0[1], d0[2], q2,  q3  @ d20 = t2a,  d28 = t3a
@@ -556,31 +608,63 @@ function idct16
         mbutterfly0     d22, d26, d22, d26, d18, d30, q9,  q15  @ d22 = t6a, d26 = t5a
         mbutterfly      d23, d25, d0[1], d0[2], q9,  q15        @ d23 = t9a,  d25 = t14a
         mbutterfly      d27, d21, d0[1], d0[2], q9,  q15, neg=1 @ d27 = t13a, d21 = t10a
+        idct16_end
+endfunc
 
-        butterfly       d18, d7,  d4,  d7                @ d18 = t0a,  d7  = t7a
-        butterfly       d19, d22, d5,  d22               @ d19 = t1a,  d22 = t6
-        butterfly       d4,  d26, d20, d26               @ d4  = t2a,  d26 = t5
-        butterfly       d5,  d6,  d28, d6                @ d5  = t3a,  d6  = t4
-        butterfly       d20, d28, d16, d24               @ d20 = t8a,  d28 = t11a
-        butterfly       d24, d21, d23, d21               @ d24 = t9,   d21 = t10
-        butterfly       d23, d27, d25, d27               @ d23 = t14,  d27 = t13
-        butterfly       d25, d29, d29, d17               @ d25 = t15a, d29 = t12a
+function idct16_half
+        mbutterfly0_h   d16, d24, d16, d24, d4, d6,  q2,  q3 @ d16 = t0a,  d24 = t1a
+        mbutterfly_h1   d20, d28, d0[1], d0[2], q2,  q3  @ d20 = t2a,  d28 = t3a
+        mbutterfly_h1   d18, d30, d0[3], d1[0], q2,  q3  @ d18 = t4a,  d30 = t7a
+        mbutterfly_h2   d26, d22, d1[1], d1[2], q2,  q3  @ d26 = t5a,  d22 = t6a
+        mbutterfly_h1   d17, d31, d1[3], d2[0], q2,  q3  @ d17 = t8a,  d31 = t15a
+        mbutterfly_h2   d25, d23, d2[1], d2[2], q2,  q3  @ d25 = t9a,  d23 = t14a
+        mbutterfly_h1   d21, d27, d2[3], d3[0], q2,  q3  @ d21 = t10a, d27 = t13a
+        mbutterfly_h2   d29, d19, d3[1], d3[2], q2,  q3  @ d29 = t11a, d19 = t12a
 
-        mbutterfly0     d27, d21, d27, d21, d16, d30, q8, q15 @ d27 = t13a, d21 = t10a
-        mbutterfly0     d29, d28, d29, d28, d16, d30, q8, q15 @ d29 = t12,  d28 = t11
+        butterfly       d4,  d28, d16, d28               @ d4  = t0,   d28 = t3
+        butterfly       d5,  d20, d24, d20               @ d5  = t1,   d20 = t2
+        butterfly       d6,  d26, d18, d26               @ d6  = t4,   d26 = t5
+        butterfly       d7,  d22, d30, d22               @ d7  = t7,   d22 = t6
+        butterfly       d16, d25, d17, d25               @ d16 = t8,   d25 = t9
+        butterfly       d24, d21, d29, d21               @ d24 = t11,  d21 = t10
+        butterfly       d17, d27, d19, d27               @ d17 = t12,  d27 = t13
+        butterfly       d29, d23, d31, d23               @ d29 = t15,  d23 = t14
 
-        vswp            d27, d29                         @ d27 = t12, d29 = t13a
-        vswp            d28, d27                         @ d28 = t12, d27 = t11
-        butterfly       d16, d31, d18, d25               @ d16 = out[0], d31 = out[15]
-        butterfly       d17, d30, d19, d23               @ d17 = out[1], d30 = out[14]
-        butterfly_r     d25, d22, d22, d24               @ d25 = out[9], d22 = out[6]
-        butterfly       d23, d24, d7,  d20               @ d23 = out[7], d24 = out[8]
-        butterfly       d18, d29, d4,  d29               @ d18 = out[2], d29 = out[13]
-        butterfly       d19, d28, d5,  d28               @ d19 = out[3], d28 = out[12]
-        vmov            d4,  d21                         @ d4  = t10a
-        butterfly       d20, d27, d6,  d27               @ d20 = out[4], d27 = out[11]
-        butterfly       d21, d26, d26, d4                @ d21 = out[5], d26 = out[10]
-        bx              lr
+        mbutterfly0     d22, d26, d22, d26, d18, d30, q9,  q15  @ d22 = t6a, d26 = t5a
+        mbutterfly      d23, d25, d0[1], d0[2], q9,  q15        @ d23 = t9a,  d25 = t14a
+        mbutterfly      d27, d21, d0[1], d0[2], q9,  q15, neg=1 @ d27 = t13a, d21 = t10a
+        idct16_end
+endfunc
+
+function idct16_quarter
+        vmull.s16       q12, d19, d3[2]
+        vmull.s16       q2,  d17, d1[3]
+        vmull.s16       q3,  d18, d1[0]
+        vmull.s16       q15, d18, d0[3]
+        vneg.s32        q12, q12
+        vmull.s16       q14, d17, d2[0]
+        vmull.s16       q13, d19, d3[1]
+        vmull.s16       q11, d16, d0[0]
+        vrshrn.s32      d24, q12, #14
+        vrshrn.s32      d16, q2,  #14
+        vrshrn.s32      d7,  q3,  #14
+        vrshrn.s32      d6,  q15, #14
+        vrshrn.s32      d29, q14, #14
+        vrshrn.s32      d17, q13, #14
+        vrshrn.s32      d28, q11, #14
+
+        mbutterfly_l    q10, q11, d17, d24, d0[1], d0[2]
+        mbutterfly_l    q9,  q15, d29, d16, d0[1], d0[2]
+        vneg.s32        q11, q11
+        vrshrn.s32      d27, q10, #14
+        vrshrn.s32      d21, q11, #14
+        vrshrn.s32      d23, q9,  #14
+        vrshrn.s32      d25, q15, #14
+        vmov            d4,  d28
+        vmov            d5,  d28
+        mbutterfly0     d22, d26, d7,  d6,  d18, d30, q9,  q15
+        vmov            d20, d28
+        idct16_end
 endfunc
 
 function iadst16
@@ -819,6 +903,13 @@ A       and             r7,  sp,  #15
         vld1.16         {q0-q1}, [r12,:128]
 .endif
 
+.ifc \txfm1\()_\txfm2,idct_idct
+        cmp             r3,  #10
+        ble             idct16x16_quarter_add_neon
+        cmp             r3,  #38
+        ble             idct16x16_half_add_neon
+.endif
+
 .irp i, 0, 4, 8, 12
         add             r0,  sp,  #(\i*32)
 .ifc \txfm1\()_\txfm2,idct_idct
@@ -877,6 +968,169 @@ itxfm_func16x16 idct,  iadst
 itxfm_func16x16 iadst, iadst
 .ltorg
 
+function idct16_1d_4x16_pass1_quarter_neon
+        push            {lr}
+        mov             r12, #32
+        vmov.s16        q2, #0
+.irp i, 16, 17, 18, 19
+        vld1.16         {d\i}, [r2,:64]
+        vst1.16         {d4},  [r2,:64], r12
+.endr
+
+        bl              idct16_quarter
+
+        @ Do four 4x4 transposes. Originally, d16-d31 contain the
+        @ 16 rows. Afterwards, d16-d19, d20-d23, d24-d27, d28-d31
+        @ contain the transposed 4x4 blocks.
+        transpose16_q_4x_4x4 q8,  q9,  q10, q11, q12, q13, q14, q15, d16, d17, d18, d19, d20, d21, d22, d23, d24, d25, d26, d27, d28, d29, d30, d31
+
+        @ Store the transposed 4x4 blocks horizontally.
+        @ The first 4x4 block is kept in registers for the second pass,
+        @ store the rest in the temp buffer.
+        add             r0,  r0,  #8
+        vst1.16         {d20}, [r0,:64]!
+        vst1.16         {d24}, [r0,:64]!
+        vst1.16         {d28}, [r0,:64]!
+        add             r0,  r0,  #8
+        vst1.16         {d21}, [r0,:64]!
+        vst1.16         {d25}, [r0,:64]!
+        vst1.16         {d29}, [r0,:64]!
+        add             r0,  r0,  #8
+        vst1.16         {d22}, [r0,:64]!
+        vst1.16         {d26}, [r0,:64]!
+        vst1.16         {d30}, [r0,:64]!
+        add             r0,  r0,  #8
+        vst1.16         {d23}, [r0,:64]!
+        vst1.16         {d27}, [r0,:64]!
+        vst1.16         {d31}, [r0,:64]!
+        pop             {pc}
+endfunc
+
+function idct16_1d_4x16_pass2_quarter_neon
+        push            {lr}
+        @ Only load the top 4 lines, and only do it for the later slices.
+        @ For the first slice, d16-d19 is kept in registers from the first pass.
+        cmp             r3,  #0
+        beq             1f
+        mov             r12, #32
+.irp i, 16, 17, 18, 19
+        vld1.16         {d\i}, [r2,:64], r12
+.endr
+1:
+
+        add             r3,  r0,  r1
+        lsl             r1,  r1,  #1
+        bl              idct16_quarter
+
+        load_add_store  q8,  q9,  q10, q11
+        load_add_store  q12, q13, q14, q15
+
+        pop             {pc}
+endfunc
+
+function idct16_1d_4x16_pass1_half_neon
+        push            {lr}
+        mov             r12, #32
+        vmov.s16        q2, #0
+.irp i, 16, 17, 18, 19, 20, 21, 22, 23
+        vld1.16         {d\i}, [r2,:64]
+        vst1.16         {d4},  [r2,:64], r12
+.endr
+
+        bl              idct16_half
+
+        @ Do four 4x4 transposes. Originally, d16-d31 contain the
+        @ 16 rows. Afterwards, d16-d19, d20-d23, d24-d27, d28-d31
+        @ contain the transposed 4x4 blocks.
+        transpose16_q_4x_4x4 q8,  q9,  q10, q11, q12, q13, q14, q15, d16, d17, d18, d19, d20, d21, d22, d23, d24, d25, d26, d27, d28, d29, d30, d31
+
+        @ Store the transposed 4x4 blocks horizontally.
+        cmp             r1,  #4
+        beq             1f
+.irp i, 16, 20, 24, 28, 17, 21, 25, 29, 18, 22, 26, 30, 19, 23, 27, 31
+        vst1.16         {d\i}, [r0,:64]!
+.endr
+        pop             {pc}
+1:
+        @ Special case: For the second input column (r1 == 4),
+        @ which would be stored as the second row in the temp buffer,
+        @ don't store the first 4x4 block, but keep it in registers
+        @ for the first slice of the second pass (where it is the
+        @ second 4x4 block).
+        add             r0,  r0,  #8
+        vst1.16         {d20}, [r0,:64]!
+        vst1.16         {d24}, [r0,:64]!
+        vst1.16         {d28}, [r0,:64]!
+        add             r0,  r0,  #8
+        vst1.16         {d21}, [r0,:64]!
+        vst1.16         {d25}, [r0,:64]!
+        vst1.16         {d29}, [r0,:64]!
+        add             r0,  r0,  #8
+        vst1.16         {d22}, [r0,:64]!
+        vst1.16         {d26}, [r0,:64]!
+        vst1.16         {d30}, [r0,:64]!
+        add             r0,  r0,  #8
+        vst1.16         {d23}, [r0,:64]!
+        vst1.16         {d27}, [r0,:64]!
+        vst1.16         {d31}, [r0,:64]!
+        vmov            d20, d16
+        vmov            d21, d17
+        vmov            d22, d18
+        vmov            d23, d19
+        pop             {pc}
+endfunc
+
+function idct16_1d_4x16_pass2_half_neon
+        push            {lr}
+        mov             r12, #32
+        cmp             r3,  #0
+.irp i, 16, 17, 18, 19
+        vld1.16         {d\i}, [r2,:64], r12
+.endr
+        beq             1f
+.irp i, 20, 21, 22, 23
+        vld1.16         {d\i}, [r2,:64], r12
+.endr
+1:
+
+        add             r3,  r0,  r1
+        lsl             r1,  r1,  #1
+        bl              idct16_half
+
+        load_add_store  q8,  q9,  q10, q11
+        load_add_store  q12, q13, q14, q15
+
+        pop             {pc}
+endfunc
+.purgem load_add_store
+
+.macro idct16_partial size
+function idct16x16_\size\()_add_neon
+        add             r0,  sp,  #(0*32)
+        mov             r1,  #0
+        add             r2,  r6,  #(0*2)
+        bl              idct16_1d_4x16_pass1_\size\()_neon
+.ifc \size,half
+        add             r0,  sp,  #(4*32)
+        mov             r1,  #4
+        add             r2,  r6,  #(4*2)
+        bl              idct16_1d_4x16_pass1_\size\()_neon
+.endif
+.irp i, 0, 4, 8, 12
+        add             r0,  r4,  #(\i)
+        mov             r1,  r5
+        add             r2,  sp,  #(\i*2)
+        mov             r3,  #\i
+        bl              idct16_1d_4x16_pass2_\size\()_neon
+.endr
+
+        add             sp,  sp,  r7
+        pop             {r4-r8,pc}
+endfunc
+.endm
+
+idct16_partial quarter
+idct16_partial half
 
 function idct32x32_dc_add_neon
         movrel          r12, idct_coeffs
@@ -913,6 +1167,38 @@ function idct32x32_dc_add_neon
         bx              lr
 endfunc
 
+.macro idct32_end
+        butterfly       d16, d5,  d4,  d5  @ d16 = t16a, d5  = t19a
+        butterfly       d17, d20, d23, d20 @ d17 = t17,  d20 = t18
+        butterfly       d18, d6,  d7,  d6  @ d18 = t23a, d6  = t20a
+        butterfly       d19, d21, d22, d21 @ d19 = t22,  d21 = t21
+        butterfly       d4,  d28, d28, d30 @ d4  = t24a, d28 = t27a
+        butterfly       d23, d26, d25, d26 @ d23 = t25,  d26 = t26
+        butterfly       d7,  d29, d29, d31 @ d7  = t31a, d29 = t28a
+        butterfly       d22, d27, d24, d27 @ d22 = t30,  d27 = t29
+
+        mbutterfly      d27, d20, d0[1], d0[2], q12, q15        @ d27 = t18a, d20 = t29a
+        mbutterfly      d29, d5,  d0[1], d0[2], q12, q15        @ d29 = t19,  d5  = t28
+        mbutterfly      d28, d6,  d0[1], d0[2], q12, q15, neg=1 @ d28 = t27,  d6  = t20
+        mbutterfly      d26, d21, d0[1], d0[2], q12, q15, neg=1 @ d26 = t26a, d21 = t21a
+
+        butterfly       d31, d24, d7,  d4  @ d31 = t31,  d24 = t24
+        butterfly       d30, d25, d22, d23 @ d30 = t30a, d25 = t25a
+        butterfly_r     d23, d16, d16, d18 @ d23 = t23,  d16 = t16
+        butterfly_r     d22, d17, d17, d19 @ d22 = t22a, d17 = t17a
+        butterfly       d18, d21, d27, d21 @ d18 = t18,  d21 = t21
+        butterfly_r     d27, d28, d5,  d28 @ d27 = t27a, d28 = t28a
+        butterfly       d4,  d26, d20, d26 @ d4  = t29,  d26 = t26
+        butterfly       d19, d20, d29, d6  @ d19 = t19a, d20 = t20
+        vmov            d29, d4            @ d29 = t29
+
+        mbutterfly0     d27, d20, d27, d20, d4, d6, q2, q3 @ d27 = t27,  d20 = t20
+        mbutterfly0     d26, d21, d26, d21, d4, d6, q2, q3 @ d26 = t26a, d21 = t21a
+        mbutterfly0     d25, d22, d25, d22, d4, d6, q2, q3 @ d25 = t25,  d22 = t22
+        mbutterfly0     d24, d23, d24, d23, d4, d6, q2, q3 @ d24 = t24a, d23 = t23a
+        bx              lr
+.endm
+
 function idct32_odd
         movrel          r12, idct_coeffs
         add             r12, r12, #32
@@ -943,38 +1229,91 @@ function idct32_odd
         mbutterfly      d27, d20, d0[3], d1[0], q8, q9, neg=1 @ d27 = t29a, d20 = t18a
         mbutterfly      d21, d26, d1[1], d1[2], q8, q9        @ d21 = t21a, d26 = t26a
         mbutterfly      d25, d22, d1[1], d1[2], q8, q9, neg=1 @ d25 = t25a, d22 = t22a
+        idct32_end
+endfunc
 
-        butterfly       d16, d5,  d4,  d5  @ d16 = t16a, d5  = t19a
-        butterfly       d17, d20, d23, d20 @ d17 = t17,  d20 = t18
-        butterfly       d18, d6,  d7,  d6  @ d18 = t23a, d6  = t20a
-        butterfly       d19, d21, d22, d21 @ d19 = t22,  d21 = t21
-        butterfly       d4,  d28, d28, d30 @ d4  = t24a, d28 = t27a
-        butterfly       d23, d26, d25, d26 @ d23 = t25,  d26 = t26
-        butterfly       d7,  d29, d29, d31 @ d7  = t31a, d29 = t28a
-        butterfly       d22, d27, d24, d27 @ d22 = t30,  d27 = t29
+function idct32_odd_half
+        movrel          r12, idct_coeffs
+        add             r12, r12, #32
+        vld1.16         {q0-q1}, [r12,:128]
 
-        mbutterfly      d27, d20, d0[1], d0[2], q12, q15        @ d27 = t18a, d20 = t29a
-        mbutterfly      d29, d5,  d0[1], d0[2], q12, q15        @ d29 = t19,  d5  = t28
-        mbutterfly      d28, d6,  d0[1], d0[2], q12, q15, neg=1 @ d28 = t27,  d6  = t20
-        mbutterfly      d26, d21, d0[1], d0[2], q12, q15, neg=1 @ d26 = t26a, d21 = t21a
+        mbutterfly_h1   d16, d31, d0[0], d0[1], q2, q3 @ d16 = t16a, d31 = t31a
+        mbutterfly_h2   d24, d23, d0[2], d0[3], q2, q3 @ d24 = t17a, d23 = t30a
+        mbutterfly_h1   d20, d27, d1[0], d1[1], q2, q3 @ d20 = t18a, d27 = t29a
+        mbutterfly_h2   d28, d19, d1[2], d1[3], q2, q3 @ d28 = t19a, d19 = t28a
+        mbutterfly_h1   d18, d29, d2[0], d2[1], q2, q3 @ d18 = t20a, d29 = t27a
+        mbutterfly_h2   d26, d21, d2[2], d2[3], q2, q3 @ d26 = t21a, d21 = t26a
+        mbutterfly_h1   d22, d25, d3[0], d3[1], q2, q3 @ d22 = t22a, d25 = t25a
+        mbutterfly_h2   d30, d17, d3[2], d3[3], q2, q3 @ d30 = t23a, d17 = t24a
 
-        butterfly       d31, d24, d7,  d4  @ d31 = t31,  d24 = t24
-        butterfly       d30, d25, d22, d23 @ d30 = t30a, d25 = t25a
-        butterfly_r     d23, d16, d16, d18 @ d23 = t23,  d16 = t16
-        butterfly_r     d22, d17, d17, d19 @ d22 = t22a, d17 = t17a
-        butterfly       d18, d21, d27, d21 @ d18 = t18,  d21 = t21
-        butterfly_r     d27, d28, d5,  d28 @ d27 = t27a, d28 = t28a
-        butterfly       d4,  d26, d20, d26 @ d4  = t29,  d26 = t26
-        butterfly       d19, d20, d29, d6  @ d19 = t19a, d20 = t20
-        vmov            d29, d4            @ d29 = t29
+        sub             r12, r12, #32
+        vld1.16         {q0}, [r12,:128]
 
-        mbutterfly0     d27, d20, d27, d20, d4, d6, q2, q3 @ d27 = t27,  d20 = t20
-        mbutterfly0     d26, d21, d26, d21, d4, d6, q2, q3 @ d26 = t26a, d21 = t21a
-        mbutterfly0     d25, d22, d25, d22, d4, d6, q2, q3 @ d25 = t25,  d22 = t22
-        mbutterfly0     d24, d23, d24, d23, d4, d6, q2, q3 @ d24 = t24a, d23 = t23a
-        bx              lr
+        butterfly       d4,  d24, d16, d24 @ d4  = t16, d24 = t17
+        butterfly       d5,  d20, d28, d20 @ d5  = t19, d20 = t18
+        butterfly       d6,  d26, d18, d26 @ d6  = t20, d26 = t21
+        butterfly       d7,  d22, d30, d22 @ d7  = t23, d22 = t22
+        butterfly       d28, d25, d17, d25 @ d28 = t24, d25 = t25
+        butterfly       d30, d21, d29, d21 @ d30 = t27, d21 = t26
+        butterfly       d29, d23, d31, d23 @ d29 = t31, d23 = t30
+        butterfly       d31, d27, d19, d27 @ d31 = t28, d27 = t29
+
+        mbutterfly      d23, d24, d0[3], d1[0], q8, q9        @ d23 = t17a, d24 = t30a
+        mbutterfly      d27, d20, d0[3], d1[0], q8, q9, neg=1 @ d27 = t29a, d20 = t18a
+        mbutterfly      d21, d26, d1[1], d1[2], q8, q9        @ d21 = t21a, d26 = t26a
+        mbutterfly      d25, d22, d1[1], d1[2], q8, q9, neg=1 @ d25 = t25a, d22 = t22a
+
+        idct32_end
 endfunc
 
+function idct32_odd_quarter
+        movrel          r12, idct_coeffs
+        add             r12, r12, #32
+        vld1.16         {q0-q1}, [r12,:128]
+
+        vmull.s16       q2,  d16, d0[0]
+        vmull.s16       q14, d19, d1[3]
+        vmull.s16       q15, d16, d0[1]
+        vmull.s16       q11, d17, d3[2]
+        vmull.s16       q3,  d17, d3[3]
+        vmull.s16       q13, d19, d1[2]
+        vmull.s16       q10, d18, d2[0]
+        vmull.s16       q12, d18, d2[1]
+
+        sub             r12, r12, #32
+        vld1.16         {q0}, [r12,:128]
+
+        vneg.s32        q14, q14
+        vneg.s32        q3,  q3
+
+        vrshrn.s32      d4,  q2,  #14
+        vrshrn.s32      d5,  q14, #14
+        vrshrn.s32      d29, q15, #14
+        vrshrn.s32      d28, q11, #14
+        vrshrn.s32      d7,  q3,  #14
+        vrshrn.s32      d31, q13, #14
+        vrshrn.s32      d6,  q10, #14
+        vrshrn.s32      d30, q12, #14
+
+        mbutterfly_l    q8,  q9,  d29, d4,  d0[3], d1[0]
+        mbutterfly_l    q13, q10, d31, d5,  d0[3], d1[0]
+        vrshrn.s32      d23, q8,  #14
+        vrshrn.s32      d24, q9,  #14
+        vneg.s32        q10, q10
+        vrshrn.s32      d27, q13, #14
+        vrshrn.s32      d20, q10, #14
+        mbutterfly_l    q8,  q9,  d30, d6,  d1[1], d1[2]
+        vrshrn.s32      d21, q8,  #14
+        vrshrn.s32      d26, q9,  #14
+        mbutterfly_l    q8,  q9,  d28, d7,  d1[1], d1[2]
+        vrshrn.s32      d25, q8,  #14
+        vneg.s32        q9,  q9
+        vrshrn.s32      d22, q9,  #14
+
+        idct32_end
+endfunc
+
+.macro idct32_funcs suffix
 @ Do an 32-point IDCT of a 4x32 slice out of a 32x32 matrix.
 @ We don't have register space to do a single pass IDCT of 4x32 though,
 @ but the 32-point IDCT can be decomposed into two 16-point IDCTs;
@@ -984,7 +1323,7 @@ endfunc
 @ r0 = dst (temp buffer)
 @ r1 = unused
 @ r2 = src
-function idct32_1d_4x32_pass1_neon
+function idct32_1d_4x32_pass1\suffix\()_neon
         push            {lr}
 
         movrel          r12, idct_coeffs
@@ -995,12 +1334,26 @@ function idct32_1d_4x32_pass1_neon
         vmov.s16        d4, #0
 
         @ d16 = IN(0), d17 = IN(2) ... d31 = IN(30)
+.ifb \suffix
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
         vld1.16         {d\i}, [r2,:64]
         vst1.16         {d4},  [r2,:64], r12
 .endr
+.endif
+.ifc \suffix,_quarter
+.irp i, 16, 17, 18, 19
+        vld1.16         {d\i}, [r2,:64]
+        vst1.16         {d4},  [r2,:64], r12
+.endr
+.endif
+.ifc \suffix,_half
+.irp i, 16, 17, 18, 19, 20, 21, 22, 23
+        vld1.16         {d\i}, [r2,:64]
+        vst1.16         {d4},  [r2,:64], r12
+.endr
+.endif
 
-        bl              idct16
+        bl              idct16\suffix
 
         @ Do four 4x4 transposes. Originally, d16-d31 contain the
         @ 16 rows. Afterwards, d16-d19, d20-d23, d24-d27, d28-d31
@@ -1026,17 +1379,39 @@ function idct32_1d_4x32_pass1_neon
 
         @ Move r2 back to the start of the input, and move
         @ to the first odd row
+.ifb \suffix
         sub             r2,  r2,  r12, lsl #4
+.endif
+.ifc \suffix,_quarter
+        sub             r2,  r2,  r12, lsl #2
+.endif
+.ifc \suffix,_half
+        sub             r2,  r2,  r12, lsl #3
+.endif
         add             r2,  r2,  #64
 
         vmov.s16        d4, #0
         @ d16 = IN(1), d17 = IN(3) ... d31 = IN(31)
+.ifb \suffix
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
         vld1.16         {d\i}, [r2,:64]
         vst1.16         {d4},  [r2,:64], r12
 .endr
+.endif
+.ifc \suffix,_quarter
+.irp i, 16, 17, 18, 19
+        vld1.16         {d\i}, [r2,:64]
+        vst1.16         {d4},  [r2,:64], r12
+.endr
+.endif
+.ifc \suffix,_half
+.irp i, 16, 17, 18, 19, 20, 21, 22, 23
+        vld1.16         {d\i}, [r2,:64]
+        vst1.16         {d4},  [r2,:64], r12
+.endr
+.endif
 
-        bl              idct32_odd
+        bl              idct32_odd\suffix
 
         transpose16_q_4x_4x4 q15, q14, q13, q12, q11, q10, q9,  q8,  d31, d30, d29, d28, d27, d26, d25, d24, d23, d22, d21, d20, d19, d18, d17, d16
 
@@ -1072,19 +1447,33 @@ endfunc
 @ r0 = dst
 @ r1 = dst stride
 @ r2 = src (temp buffer)
-function idct32_1d_4x32_pass2_neon
+function idct32_1d_4x32_pass2\suffix\()_neon
         push            {lr}
         movrel          r12, idct_coeffs
         vld1.16         {q0-q1}, [r12,:128]
 
         mov             r12, #128
         @ d16 = IN(0), d17 = IN(2) ... d31 = IN(30)
+.ifb \suffix
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
         vld1.16         {d\i}, [r2,:64], r12
 .endr
         sub             r2,  r2,  r12, lsl #4
+.endif
+.ifc \suffix,_quarter
+.irp i, 16, 17, 18, 19
+        vld1.16         {d\i}, [r2,:64], r12
+.endr
+        sub             r2,  r2,  r12, lsl #2
+.endif
+.ifc \suffix,_half
+.irp i, 16, 17, 18, 19, 20, 21, 22, 23
+        vld1.16         {d\i}, [r2,:64], r12
+.endr
+        sub             r2,  r2,  r12, lsl #3
+.endif
 
-        bl              idct16
+        bl              idct16\suffix
 
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
         vst1.16         {d\i}, [r2,:64], r12
@@ -1094,13 +1483,27 @@ function idct32_1d_4x32_pass2_neon
         add             r2,  r2,  #64
 
         @ d16 = IN(1), d17 = IN(3) ... d31 = IN(31)
+.ifb \suffix
 .irp i, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
         vld1.16         {d\i}, [r2,:64], r12
 .endr
         sub             r2,  r2,  r12, lsl #4
+.endif
+.ifc \suffix,_quarter
+.irp i, 16, 17, 18, 19
+        vld1.16         {d\i}, [r2,:64], r12
+.endr
+        sub             r2,  r2,  r12, lsl #2
+.endif
+.ifc \suffix,_half
+.irp i, 16, 17, 18, 19, 20, 21, 22, 23
+        vld1.16         {d\i}, [r2,:64], r12
+.endr
+        sub             r2,  r2,  r12, lsl #3
+.endif
         sub             r2,  r2,  #64
 
-        bl              idct32_odd
+        bl              idct32_odd\suffix
 
         mov             r12, #128
 .macro load_acc_store a, b, c, d, neg=0
@@ -1150,6 +1553,11 @@ function idct32_1d_4x32_pass2_neon
 .purgem load_acc_store
         pop             {pc}
 endfunc
+.endm
+
+idct32_funcs
+idct32_funcs _quarter
+idct32_funcs _half
 
 const min_eob_idct_idct_32, align=4
         .short  0, 9, 34, 70, 135, 240, 336, 448
@@ -1173,6 +1581,11 @@ A       and             r7,  sp,  #15
         mov             r5,  r1
         mov             r6,  r2
 
+        cmp             r3,  #34
+        ble             idct32x32_quarter_add_neon
+        cmp             r3,  #135
+        ble             idct32x32_half_add_neon
+
 .irp i, 0, 4, 8, 12, 16, 20, 24, 28
         add             r0,  sp,  #(\i*64)
 .if \i > 0
@@ -1209,3 +1622,73 @@ A       and             r7,  sp,  #15
         vpop            {q4-q7}
         pop             {r4-r8,pc}
 endfunc
+
+function idct32x32_quarter_add_neon
+.irp i, 0, 4
+        add             r0,  sp,  #(\i*64)
+.if \i == 4
+        cmp             r3,  #9
+        ble             1f
+.endif
+        add             r2,  r6,  #(\i*2)
+        bl              idct32_1d_4x32_pass1_quarter_neon
+.endr
+        b               3f
+
+1:
+        @ Write zeros to the temp buffer for pass 2
+        vmov.i16        q14, #0
+        vmov.i16        q15, #0
+.rept 8
+        vst1.16         {q14-q15}, [r0,:128]!
+.endr
+3:
+.irp i, 0, 4, 8, 12, 16, 20, 24, 28
+        add             r0,  r4,  #(\i)
+        mov             r1,  r5
+        add             r2,  sp,  #(\i*2)
+        bl              idct32_1d_4x32_pass2_quarter_neon
+.endr
+
+        add             sp,  sp,  r7
+        vpop            {q4-q7}
+        pop             {r4-r8,pc}
+endfunc
+
+function idct32x32_half_add_neon
+.irp i, 0, 4, 8, 12
+        add             r0,  sp,  #(\i*64)
+.if \i > 0
+        ldrh_post       r1,  r8,  #2
+        cmp             r3,  r1
+        it              le
+        movle           r1,  #(16 - \i)/2
+        ble             1f
+.endif
+        add             r2,  r6,  #(\i*2)
+        bl              idct32_1d_4x32_pass1_half_neon
+.endr
+        b               3f
+
+1:
+        @ Write zeros to the temp buffer for pass 2
+        vmov.i16        q14, #0
+        vmov.i16        q15, #0
+2:
+        subs            r1,  r1,  #1
+.rept 4
+        vst1.16         {q14-q15}, [r0,:128]!
+.endr
+        bne             2b
+3:
+.irp i, 0, 4, 8, 12, 16, 20, 24, 28
+        add             r0,  r4,  #(\i)
+        mov             r1,  r5
+        add             r2,  sp,  #(\i*2)
+        bl              idct32_1d_4x32_pass2_half_neon
+.endr
+
+        add             sp,  sp,  r7
+        vpop            {q4-q7}
+        pop             {r4-r8,pc}
+endfunc