From patchwork Fri Mar 25 18:52:53 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ben Avison <bavison@riscosopen.org>
X-Patchwork-Id: 34990
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:ab0:5fda:0:0:0:0:0 with SMTP id g26csp1432587uaj;
        Fri, 25 Mar 2022 11:55:00 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJwnlGcc5N9uMTA71F0idtgSTBhiG45ewDk400EbRLXUgY0bSHgKHOG6kqbS36ugiTi8Dqi0
X-Received: by 2002:a05:6402:254e:b0:418:d759:dc48 with SMTP id
 l14-20020a056402254e00b00418d759dc48mr132164edb.104.1648234500032;
        Fri, 25 Mar 2022 11:55:00 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1648234500; cv=none;
        d=google.com; s=arc-20160816;
        b=K6rloM2Ec/ygmWN/yyBqf4UZJ1m7/EH1mi2ocudS0VNo04d+plRvWrtZJtTcflY/10
         VJLwcxdWOm8uD4UYX7DTk+kHXW6U+js4kHpIQn3JNcBlokP3mHXSEKDMaGpY6ZJQeWZ3
         zzSxeeTCB/3jwHvljMWDSVONtv7B/jbL8XxFSxgHpBodYMi/vRMBECh4bgLBUCJDSk0D
         xAq4UyXTYaL9BILf2pawBamgSHlq3PxuGHm8JErW12BXoJ4qyQd4xKOhQ5IiMY85N+Fg
         xvbK5N0rYiiXdrL+xEmOLV3PRVzWzK1DjQfvz7kIWT234SofhfcZEydfBg2+DOrlLQTc
         tndw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:references:in-reply-to
         :message-id:date:to:from:delivered-to;
        bh=tfMFbt4D13eMAlPuLjYMuFEQlQOPxpavUZU+K6qN7d4=;
        b=sWxprjhVujzlGHqyueVmEx3Sgc5BPmwJ5pR2ZfygPfIeTdHn6dx8qFL6Icksg8knmp
         vefIPeGM7IkTau6BrFNoCcb38SCH09Of8opNH5D3SaF7dcOoQp6/vZ7tJ986ViJt2AXq
         kCJ9WGBXgUoUiCEQ4vvSSgDIPmkSolCgP+iu89eC2TY9NTyF4ciHRWRu9i7hQi6qh040
         FEMPeKz2ZoHg19PsTpJqe1WGG6Y99R2/CvBjc0m9hE6YW4mw1JdBpQGKbkVPZrqGrlXI
         57l3CmS9iignuvRbJFQDWvx1aEUExvVzYmEK/q1/92Yml+wjdkk344CvL0HqdVa5JWhL
         zZ1w==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 d11-20020a1709064c4b00b006df76385e2bsi3244663ejw.715.2022.03.25.11.54.59;
        Fri, 25 Mar 2022 11:55:00 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 3AFC568B263;
	Fri, 25 Mar 2022 20:53:57 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from outmail149075.authsmtp.net (outmail149075.authsmtp.net
 [62.13.149.75])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id A5C6168B24D
 for <ffmpeg-devel@ffmpeg.org>; Fri, 25 Mar 2022 20:53:48 +0200 (EET)
Received: from punt23.authsmtp.com (punt23.authsmtp.com [62.13.128.122])
 by punt15.authsmtp.com. (8.15.2/8.15.2) with ESMTP id 22PIrmxr014745
 for <ffmpeg-devel@ffmpeg.org>; Fri, 25 Mar 2022 18:53:48 GMT
 (envelope-from bavison@riscosopen.org)
Received: from mail-c233.authsmtp.com (mail-c233.authsmtp.com [62.13.128.233])
 by punt23.authsmtp.com. (8.15.2/8.15.2) with ESMTP id 22PIrmZt011370;
 Fri, 25 Mar 2022 18:53:48 GMT (envelope-from bavison@riscosopen.org)
Received: from rpi2021 (237.63.9.51.dyn.plus.net [51.9.63.237])
 (authenticated bits=0)
 by mail.authsmtp.com (8.15.2/8.15.2) with ESMTPSA id 22PIrjRo046593
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Fri, 25 Mar 2022 18:53:46 GMT (envelope-from bavison@riscosopen.org)
Received: by rpi2021 (sSMTP sendmail emulation);
 Fri, 25 Mar 2022 18:53:45 +0000
From: Ben Avison <bavison@riscosopen.org>
To: ffmpeg-devel@ffmpeg.org
Date: Fri, 25 Mar 2022 18:52:53 +0000
Message-Id: <20220325185257.513933-7-bavison@riscosopen.org>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20220325185257.513933-1-bavison@riscosopen.org>
References: <20220317185819.466470-1-bavison@riscosopen.org>
 <20220325185257.513933-1-bavison@riscosopen.org>
MIME-Version: 1.0
X-Server-Quench: e700522b-ac6c-11ec-a0f2-84349711df28
X-AuthReport-Spam: If SPAM / abuse - report it at:
 http://www.authsmtp.com/abuse
X-AuthRoute: OCd1YggXA1ZfRRob ESQCJDVBUg4iPRpU DBlFKhFVNl8UURhQ
 KkJXbgASJgdBAnRQ QXkJW1ZWQFx5U2Z2 YQ5WIwBcfENQWQZ0 UktOXVBXFgB3AFID
 BH5nEEMQKAVCfnt5 ZghrX3hbWgovckZ9 R0xWR3BXNmEzdWEe BRNFJgMCch5CehxB
 Y1d+VSdbY21JDRoR IyQTd2hvemwHHWxc XAoKIV8ZBlgAR2x0 bgoHVW51WEcEW20V
 AjsAYlIRFkIWOUYz N1RpV1McLxIIQgNX BF0FGCZFJ0Jp
X-Authentic-SMTP: 61633632303230.1021:7600
X-AuthFastPath: 0 (Was 255)
X-AuthVirus-Status: No virus detected - but ensure you scan with your own
 anti-virus system.
Subject: [FFmpeg-devel] [PATCH 06/10] avcodec/vc1: Arm 32-bit NEON
 deblocking filter fast paths
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Ben Avison <bavison@riscosopen.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: 1oWYBdKyYhCb

checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. Note that the C
version can still outperform the NEON version in specific cases. The balance
between different code paths is stream-dependent, but in practice the best
case happens about 5% of the time, the worst case happens about 40% of the
time, and the complexity of the remaining cases fall somewhere in between.
Therefore, taking the average of the best and worst case timings is
probably a conservative estimate of the degree by which the NEON code
improves performance.

vc1dsp.vc1_h_loop_filter4_bestcase_c: 19.0
vc1dsp.vc1_h_loop_filter4_bestcase_neon: 48.5
vc1dsp.vc1_h_loop_filter4_worstcase_c: 144.7
vc1dsp.vc1_h_loop_filter4_worstcase_neon: 76.2
vc1dsp.vc1_h_loop_filter8_bestcase_c: 41.0
vc1dsp.vc1_h_loop_filter8_bestcase_neon: 75.0
vc1dsp.vc1_h_loop_filter8_worstcase_c: 294.0
vc1dsp.vc1_h_loop_filter8_worstcase_neon: 102.7
vc1dsp.vc1_h_loop_filter16_bestcase_c: 54.7
vc1dsp.vc1_h_loop_filter16_bestcase_neon: 130.0
vc1dsp.vc1_h_loop_filter16_worstcase_c: 569.7
vc1dsp.vc1_h_loop_filter16_worstcase_neon: 186.7
vc1dsp.vc1_v_loop_filter4_bestcase_c: 20.2
vc1dsp.vc1_v_loop_filter4_bestcase_neon: 47.2
vc1dsp.vc1_v_loop_filter4_worstcase_c: 164.2
vc1dsp.vc1_v_loop_filter4_worstcase_neon: 68.5
vc1dsp.vc1_v_loop_filter8_bestcase_c: 43.5
vc1dsp.vc1_v_loop_filter8_bestcase_neon: 55.2
vc1dsp.vc1_v_loop_filter8_worstcase_c: 316.2
vc1dsp.vc1_v_loop_filter8_worstcase_neon: 72.7
vc1dsp.vc1_v_loop_filter16_bestcase_c: 62.2
vc1dsp.vc1_v_loop_filter16_bestcase_neon: 103.7
vc1dsp.vc1_v_loop_filter16_worstcase_c: 646.5
vc1dsp.vc1_v_loop_filter16_worstcase_neon: 110.7

Signed-off-by: Ben Avison <bavison@riscosopen.org>
---
 libavcodec/arm/vc1dsp_init_neon.c |  14 +
 libavcodec/arm/vc1dsp_neon.S      | 643 ++++++++++++++++++++++++++++++
 2 files changed, 657 insertions(+)

diff --git a/libavcodec/arm/vc1dsp_init_neon.c b/libavcodec/arm/vc1dsp_init_neon.c
index 2cca784f5a..f5f5c702d7 100644
--- a/libavcodec/arm/vc1dsp_init_neon.c
+++ b/libavcodec/arm/vc1dsp_init_neon.c
@@ -32,6 +32,13 @@ void ff_vc1_inv_trans_4x8_dc_neon(uint8_t *dest, ptrdiff_t stride, int16_t *bloc
 void ff_vc1_inv_trans_8x4_dc_neon(uint8_t *dest, ptrdiff_t stride, int16_t *block);
 void ff_vc1_inv_trans_4x4_dc_neon(uint8_t *dest, ptrdiff_t stride, int16_t *block);
 
+void ff_vc1_v_loop_filter4_neon(uint8_t *src, int stride, int pq);
+void ff_vc1_h_loop_filter4_neon(uint8_t *src, int stride, int pq);
+void ff_vc1_v_loop_filter8_neon(uint8_t *src, int stride, int pq);
+void ff_vc1_h_loop_filter8_neon(uint8_t *src, int stride, int pq);
+void ff_vc1_v_loop_filter16_neon(uint8_t *src, int stride, int pq);
+void ff_vc1_h_loop_filter16_neon(uint8_t *src, int stride, int pq);
+
 void ff_put_pixels8x8_neon(uint8_t *block, const uint8_t *pixels,
                            ptrdiff_t line_size, int rnd);
 
@@ -92,6 +99,13 @@ av_cold void ff_vc1dsp_init_neon(VC1DSPContext *dsp)
     dsp->vc1_inv_trans_8x4_dc = ff_vc1_inv_trans_8x4_dc_neon;
     dsp->vc1_inv_trans_4x4_dc = ff_vc1_inv_trans_4x4_dc_neon;
 
+    dsp->vc1_v_loop_filter4  = ff_vc1_v_loop_filter4_neon;
+    dsp->vc1_h_loop_filter4  = ff_vc1_h_loop_filter4_neon;
+    dsp->vc1_v_loop_filter8  = ff_vc1_v_loop_filter8_neon;
+    dsp->vc1_h_loop_filter8  = ff_vc1_h_loop_filter8_neon;
+    dsp->vc1_v_loop_filter16 = ff_vc1_v_loop_filter16_neon;
+    dsp->vc1_h_loop_filter16 = ff_vc1_h_loop_filter16_neon;
+
     dsp->put_vc1_mspel_pixels_tab[1][ 0] = ff_put_pixels8x8_neon;
     FN_ASSIGN(1, 0);
     FN_ASSIGN(2, 0);
diff --git a/libavcodec/arm/vc1dsp_neon.S b/libavcodec/arm/vc1dsp_neon.S
index 93f043bf08..a639e81171 100644
--- a/libavcodec/arm/vc1dsp_neon.S
+++ b/libavcodec/arm/vc1dsp_neon.S
@@ -1161,3 +1161,646 @@ function ff_vc1_inv_trans_4x4_dc_neon, export=1
         vst1.32         {d1[1]},  [r0,:32]
         bx              lr
 endfunc
+
+@ VC-1 in-loop deblocking filter for 4 pixel pairs at boundary of vertically-neighbouring blocks
+@ On entry:
+@   r0 -> top-left pel of lower block
+@   r1 = row stride, bytes
+@   r2 = PQUANT bitstream parameter
+function ff_vc1_v_loop_filter4_neon, export=1
+        sub             r3, r0, r1, lsl #2
+        vldr            d0, .Lcoeffs
+        vld1.32         {d1[0]}, [r0], r1       @ P5
+        vld1.32         {d2[0]}, [r3], r1       @ P1
+        vld1.32         {d3[0]}, [r3], r1       @ P2
+        vld1.32         {d4[0]}, [r0], r1       @ P6
+        vld1.32         {d5[0]}, [r3], r1       @ P3
+        vld1.32         {d6[0]}, [r0], r1       @ P7
+        vld1.32         {d7[0]}, [r3]           @ P4
+        vld1.32         {d16[0]}, [r0]          @ P8
+        vshll.u8        q9, d1, #1              @ 2*P5
+        vdup.16         d17, r2                 @ pq
+        vshll.u8        q10, d2, #1             @ 2*P1
+        vmovl.u8        q11, d3                 @ P2
+        vmovl.u8        q1, d4                  @ P6
+        vmovl.u8        q12, d5                 @ P3
+        vmls.i16        d20, d22, d0[1]         @ 2*P1-5*P2
+        vmovl.u8        q11, d6                 @ P7
+        vmls.i16        d18, d2, d0[1]          @ 2*P5-5*P6
+        vshll.u8        q2, d5, #1              @ 2*P3
+        vmovl.u8        q3, d7                  @ P4
+        vmla.i16        d18, d22, d0[1]         @ 2*P5-5*P6+5*P7
+        vmovl.u8        q11, d16                @ P8
+        vmla.u16        d20, d24, d0[1]         @ 2*P1-5*P2+5*P3
+        vmovl.u8        q12, d1                 @ P5
+        vmls.u16        d4, d6, d0[1]           @ 2*P3-5*P4
+        vmls.u16        d18, d22, d0[0]         @ 2*P5-5*P6+5*P7-2*P8
+        vsub.i16        d1, d6, d24             @ P4-P5
+        vmls.i16        d20, d6, d0[0]          @ 2*P1-5*P2+5*P3-2*P4
+        vmla.i16        d4, d24, d0[1]          @ 2*P3-5*P4+5*P5
+        vmls.i16        d4, d2, d0[0]           @ 2*P3-5*P4+5*P5-2*P6
+        vabs.s16        d2, d1
+        vrshr.s16       d3, d18, #3
+        vrshr.s16       d5, d20, #3
+        vshr.s16        d2, d2, #1              @ clip
+        vrshr.s16       d4, d4, #3
+        vabs.s16        d3, d3                  @ a2
+        vshr.s16        d1, d1, #8              @ clip_sign
+        vabs.s16        d5, d5                  @ a1
+        vceq.i16        d7, d2, #0              @ test clip == 0
+        vabs.s16        d16, d4                 @ a0
+        vshr.s16        d4, d4, #8              @ a0_sign
+        vcge.s16        d18, d5, d3             @ test a1 >= a2
+        vcge.s16        d17, d16, d17           @ test a0 >= pq
+        vbsl            d18, d3, d5             @ a3
+        vsub.i16        d1, d1, d4              @ clip_sign - a0_sign
+        vorr            d3, d7, d17             @ test clip == 0 || a0 >= pq
+        vqsub.u16       d4, d16, d18            @ a0 >= a3 ? a0-a3 : 0  (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs)
+        vcge.s16        d5, d18, d16            @ test a3 >= a0
+        vmul.i16        d0, d4, d0[1]           @ a0 >= a3 ? 5*(a0-a3) : 0
+        vorr            d4, d3, d5              @ test clip == 0 || a0 >= pq || a3 >= a0
+        vmov.32         r0, d4[1]               @ move to gp reg
+        vshr.u16        d0, d0, #3              @ a0 >= a3 ? (5*(a0-a3))>>3 : 0
+        vcge.s16        d4, d0, d2
+        tst             r0, #1
+        bne             1f                      @ none of the 4 pixel pairs should be updated if this one is not filtered
+        vbsl            d4, d2, d0              @ FFMIN(d, clip)
+        vbic            d0, d4, d3              @ set each d to zero if it should not be filtered because clip == 0 || a0 >= pq (a3 > a0 case already zeroed by saturating sub)
+        vmls.i16        d6, d0, d1              @ invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P4
+        vmla.i16        d24, d0, d1             @ invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P5
+        vqmovun.s16     d0, q3
+        vqmovun.s16     d1, q12
+        vst1.32         {d0[0]}, [r3], r1
+        vst1.32         {d1[0]}, [r3]
+1:      bx              lr
+endfunc
+
+@ VC-1 in-loop deblocking filter for 4 pixel pairs at boundary of horizontally-neighbouring blocks
+@ On entry:
+@   r0 -> top-left pel of right block
+@   r1 = row stride, bytes
+@   r2 = PQUANT bitstream parameter
+function ff_vc1_h_loop_filter4_neon, export=1
+        sub             r3, r0, #4              @ where to start reading
+        vldr            d0, .Lcoeffs
+        vld1.32         {d2}, [r3], r1
+        sub             r0, r0, #1              @ where to start writing
+        vld1.32         {d4}, [r3], r1
+        vld1.32         {d3}, [r3], r1
+        vld1.32         {d5}, [r3]
+        vdup.16         d1, r2                  @ pq
+        vtrn.8          q1, q2
+        vtrn.16         d2, d3                  @ P1, P5, P3, P7
+        vtrn.16         d4, d5                  @ P2, P6, P4, P8
+        vshll.u8        q3, d2, #1              @ 2*P1, 2*P5
+        vmovl.u8        q8, d4                  @ P2, P6
+        vmovl.u8        q9, d3                  @ P3, P7
+        vmovl.u8        q2, d5                  @ P4, P8
+        vmls.i16        q3, q8, d0[1]           @ 2*P1-5*P2, 2*P5-5*P6
+        vshll.u8        q10, d3, #1             @ 2*P3, 2*P7
+        vmovl.u8        q1, d2                  @ P1, P5
+        vmla.i16        q3, q9, d0[1]           @ 2*P1-5*P2+5*P3, 2*P5-5*P6+5*P7
+        vmls.i16        q3, q2, d0[0]           @ 2*P1-5*P2+5*P3-2*P4, 2*P5-5*P6+5*P7-2*P8
+        vmov            d2, d3                  @ needs to be in an even-numbered vector for when we come to narrow it later
+        vmls.i16        d20, d4, d0[1]          @ 2*P3-5*P4
+        vmla.i16        d20, d3, d0[1]          @ 2*P3-5*P4+5*P5
+        vsub.i16        d3, d4, d2              @ P4-P5
+        vmls.i16        d20, d17, d0[0]         @ 2*P3-5*P4+5*P5-2*P6
+        vrshr.s16       q3, q3, #3
+        vabs.s16        d5, d3
+        vshr.s16        d3, d3, #8              @ clip_sign
+        vrshr.s16       d16, d20, #3
+        vabs.s16        q3, q3                  @ a1, a2
+        vshr.s16        d5, d5, #1              @ clip
+        vabs.s16        d17, d16                @ a0
+        vceq.i16        d18, d5, #0             @ test clip == 0
+        vshr.s16        d16, d16, #8            @ a0_sign
+        vcge.s16        d19, d6, d7             @ test a1 >= a2
+        vcge.s16        d1, d17, d1             @ test a0 >= pq
+        vsub.i16        d16, d3, d16            @ clip_sign - a0_sign
+        vbsl            d19, d7, d6             @ a3
+        vorr            d1, d18, d1             @ test clip == 0 || a0 >= pq
+        vqsub.u16       d3, d17, d19            @ a0 >= a3 ? a0-a3 : 0  (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs)
+        vcge.s16        d6, d19, d17            @ test a3 >= a0    @
+        vmul.i16        d0, d3, d0[1]           @ a0 >= a3 ? 5*(a0-a3) : 0
+        vorr            d3, d1, d6              @ test clip == 0 || a0 >= pq || a3 >= a0
+        vmov.32         r2, d3[1]               @ move to gp reg
+        vshr.u16        d0, d0, #3              @ a0 >= a3 ? (5*(a0-a3))>>3 : 0
+        vcge.s16        d3, d0, d5
+        tst             r2, #1
+        bne             1f                      @ none of the 4 pixel pairs should be updated if this one is not filtered
+        vbsl            d3, d5, d0              @ FFMIN(d, clip)
+        vbic            d0, d3, d1              @ set each d to zero if it should not be filtered because clip == 0 || a0 >= pq (a3 > a0 case already zeroed by saturating sub)
+        vmla.i16        d2, d0, d16             @ invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P5
+        vmls.i16        d4, d0, d16             @ invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P4
+        vqmovun.s16     d1, q1
+        vqmovun.s16     d0, q2
+        vst2.8          {d0[0], d1[0]}, [r0], r1
+        vst2.8          {d0[1], d1[1]}, [r0], r1
+        vst2.8          {d0[2], d1[2]}, [r0], r1
+        vst2.8          {d0[3], d1[3]}, [r0]
+1:      bx              lr
+endfunc
+
+@ VC-1 in-loop deblocking filter for 8 pixel pairs at boundary of vertically-neighbouring blocks
+@ On entry:
+@   r0 -> top-left pel of lower block
+@   r1 = row stride, bytes
+@   r2 = PQUANT bitstream parameter
+function ff_vc1_v_loop_filter8_neon, export=1
+        sub             r3, r0, r1, lsl #2
+        vldr            d0, .Lcoeffs
+        vld1.32         {d1}, [r0], r1          @ P5
+        vld1.32         {d2}, [r3], r1          @ P1
+        vld1.32         {d3}, [r3], r1          @ P2
+        vld1.32         {d4}, [r0], r1          @ P6
+        vld1.32         {d5}, [r3], r1          @ P3
+        vld1.32         {d6}, [r0], r1          @ P7
+        vshll.u8        q8, d1, #1              @ 2*P5
+        vshll.u8        q9, d2, #1              @ 2*P1
+        vld1.32         {d7}, [r3]              @ P4
+        vmovl.u8        q1, d3                  @ P2
+        vld1.32         {d20}, [r0]             @ P8
+        vmovl.u8        q11, d4                 @ P6
+        vdup.16         q12, r2                 @ pq
+        vmovl.u8        q13, d5                 @ P3
+        vmls.i16        q9, q1, d0[1]           @ 2*P1-5*P2
+        vmovl.u8        q1, d6                  @ P7
+        vshll.u8        q2, d5, #1              @ 2*P3
+        vmls.i16        q8, q11, d0[1]          @ 2*P5-5*P6
+        vmovl.u8        q3, d7                  @ P4
+        vmovl.u8        q10, d20                @ P8
+        vmla.i16        q8, q1, d0[1]           @ 2*P5-5*P6+5*P7
+        vmovl.u8        q1, d1                  @ P5
+        vmla.i16        q9, q13, d0[1]          @ 2*P1-5*P2+5*P3
+        vsub.i16        q13, q3, q1             @ P4-P5
+        vmls.i16        q2, q3, d0[1]           @ 2*P3-5*P4
+        vmls.i16        q8, q10, d0[0]          @ 2*P5-5*P6+5*P7-2*P8
+        vabs.s16        q10, q13
+        vshr.s16        q13, q13, #8            @ clip_sign
+        vmls.i16        q9, q3, d0[0]           @ 2*P1-5*P2+5*P3-2*P4
+        vshr.s16        q10, q10, #1            @ clip
+        vmla.i16        q2, q1, d0[1]           @ 2*P3-5*P4+5*P5
+        vrshr.s16       q8, q8, #3
+        vmls.i16        q2, q11, d0[0]          @ 2*P3-5*P4+5*P5-2*P6
+        vceq.i16        q11, q10, #0            @ test clip == 0
+        vrshr.s16       q9, q9, #3
+        vabs.s16        q8, q8                  @ a2
+        vabs.s16        q9, q9                  @ a1
+        vrshr.s16       q2, q2, #3
+        vcge.s16        q14, q9, q8             @ test a1 >= a2
+        vabs.s16        q15, q2                 @ a0
+        vshr.s16        q2, q2, #8              @ a0_sign
+        vbsl            q14, q8, q9             @ a3
+        vcge.s16        q8, q15, q12            @ test a0 >= pq
+        vsub.i16        q2, q13, q2             @ clip_sign - a0_sign
+        vqsub.u16       q9, q15, q14            @ a0 >= a3 ? a0-a3 : 0  (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs)
+        vcge.s16        q12, q14, q15           @ test a3 >= a0
+        vorr            q8, q11, q8             @ test clip == 0 || a0 >= pq
+        vmul.i16        q0, q9, d0[1]           @ a0 >= a3 ? 5*(a0-a3) : 0
+        vorr            q9, q8, q12             @ test clip == 0 || a0 >= pq || a3 >= a0
+        vshl.i64        q11, q9, #16
+        vmov.32         r0, d18[1]              @ move to gp reg
+        vshr.u16        q0, q0, #3              @ a0 >= a3 ? (5*(a0-a3))>>3 : 0
+        vmov.32         r2, d19[1]
+        vshr.s64        q9, q11, #48
+        vcge.s16        q11, q0, q10
+        vorr            q8, q8, q9
+        and             r0, r0, r2
+        vbsl            q11, q10, q0            @ FFMIN(d, clip)
+        tst             r0, #1
+        bne             1f                      @ none of the 8 pixel pairs should be updated in this case
+        vbic            q0, q11, q8             @ set each d to zero if it should not be filtered
+        vmls.i16        q3, q0, q2              @ invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P4
+        vmla.i16        q1, q0, q2              @ invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P5
+        vqmovun.s16     d0, q3
+        vqmovun.s16     d1, q1
+        vst1.32         {d0}, [r3], r1
+        vst1.32         {d1}, [r3]
+1:      bx              lr
+endfunc
+
+.align  5
+.Lcoeffs:
+.quad   0x00050002
+
+@ VC-1 in-loop deblocking filter for 8 pixel pairs at boundary of horizontally-neighbouring blocks
+@ On entry:
+@   r0 -> top-left pel of right block
+@   r1 = row stride, bytes
+@   r2 = PQUANT bitstream parameter
+function ff_vc1_h_loop_filter8_neon, export=1
+        push            {lr}
+        sub             r3, r0, #4              @ where to start reading
+        vldr            d0, .Lcoeffs
+        vld1.32         {d2}, [r3], r1          @ P1[0], P2[0]...
+        sub             r0, r0, #1              @ where to start writing
+        vld1.32         {d4}, [r3], r1
+        add             r12, r0, r1, lsl #2
+        vld1.32         {d3}, [r3], r1
+        vld1.32         {d5}, [r3], r1
+        vld1.32         {d6}, [r3], r1
+        vld1.32         {d16}, [r3], r1
+        vld1.32         {d7}, [r3], r1
+        vld1.32         {d17}, [r3]
+        vtrn.8          q1, q2                  @ P1[0], P1[1], P3[0]... P1[2], P1[3], P3[2]... P2[0], P2[1], P4[0]... P2[2], P2[3], P4[2]...
+        vdup.16         q9, r2                  @ pq
+        vtrn.16         d2, d3                  @ P1[0], P1[1], P1[2], P1[3], P5[0]... P3[0], P3[1], P3[2], P3[3], P7[0]...
+        vtrn.16         d4, d5                  @ P2[0], P2[1], P2[2], P2[3], P6[0]... P4[0], P4[1], P4[2], P4[3], P8[0]...
+        vtrn.8          q3, q8                  @ P1[4], P1[5], P3[4]... P1[6], P1[7], P3[6]... P2[4], P2[5], P4[4]... P2[6], P2[7], P4[6]...
+        vtrn.16         d6, d7                  @ P1[4], P1[5], P1[6], P1[7], P5[4]... P3[4], P3[5], P3[5], P3[7], P7[4]...
+        vtrn.16         d16, d17                @ P2[4], P2[5], P2[6], P2[7], P6[4]... P4[4], P4[5], P4[6], P4[7], P8[4]...
+        vtrn.32         d2, d6                  @ P1, P5
+        vtrn.32         d4, d16                 @ P2, P6
+        vtrn.32         d3, d7                  @ P3, P7
+        vtrn.32         d5, d17                 @ P4, P8
+        vshll.u8        q10, d2, #1             @ 2*P1
+        vshll.u8        q11, d6, #1             @ 2*P5
+        vmovl.u8        q12, d4                 @ P2
+        vmovl.u8        q13, d16                @ P6
+        vmovl.u8        q14, d3                 @ P3
+        vmls.i16        q10, q12, d0[1]         @ 2*P1-5*P2
+        vmovl.u8        q12, d7                 @ P7
+        vshll.u8        q1, d3, #1              @ 2*P3
+        vmls.i16        q11, q13, d0[1]         @ 2*P5-5*P6
+        vmovl.u8        q2, d5                  @ P4
+        vmovl.u8        q8, d17                 @ P8
+        vmla.i16        q11, q12, d0[1]         @ 2*P5-5*P6+5*P7
+        vmovl.u8        q3, d6                  @ P5
+        vmla.i16        q10, q14, d0[1]         @ 2*P1-5*P2+5*P3
+        vsub.i16        q12, q2, q3             @ P4-P5
+        vmls.i16        q1, q2, d0[1]           @ 2*P3-5*P4
+        vmls.i16        q11, q8, d0[0]          @ 2*P5-5*P6+5*P7-2*P8
+        vabs.s16        q8, q12
+        vshr.s16        q12, q12, #8            @ clip_sign
+        vmls.i16        q10, q2, d0[0]          @ 2*P1-5*P2+5*P3-2*P4
+        vshr.s16        q8, q8, #1              @ clip
+        vmla.i16        q1, q3, d0[1]           @ 2*P3-5*P4+5*P5
+        vrshr.s16       q11, q11, #3
+        vmls.i16        q1, q13, d0[0]          @ 2*P3-5*P4+5*P5-2*P6
+        vceq.i16        q13, q8, #0             @ test clip == 0
+        vrshr.s16       q10, q10, #3
+        vabs.s16        q11, q11                @ a2
+        vabs.s16        q10, q10                @ a1
+        vrshr.s16       q1, q1, #3
+        vcge.s16        q14, q10, q11           @ test a1 >= a2
+        vabs.s16        q15, q1                 @ a0
+        vshr.s16        q1, q1, #8              @ a0_sign
+        vbsl            q14, q11, q10           @ a3
+        vcge.s16        q9, q15, q9             @ test a0 >= pq
+        vsub.i16        q1, q12, q1             @ clip_sign - a0_sign
+        vqsub.u16       q10, q15, q14           @ a0 >= a3 ? a0-a3 : 0  (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs)
+        vcge.s16        q11, q14, q15           @ test a3 >= a0
+        vorr            q9, q13, q9             @ test clip == 0 || a0 >= pq
+        vmul.i16        q0, q10, d0[1]          @ a0 >= a3 ? 5*(a0-a3) : 0
+        vorr            q10, q9, q11            @ test clip == 0 || a0 >= pq || a3 >= a0
+        vmov.32         r2, d20[1]              @ move to gp reg
+        vshr.u16        q0, q0, #3              @ a0 >= a3 ? (5*(a0-a3))>>3 : 0
+        vmov.32         r3, d21[1]
+        vcge.s16        q10, q0, q8
+        and             r14, r2, r3
+        vbsl            q10, q8, q0             @ FFMIN(d, clip)
+        tst             r14, #1
+        bne             2f                      @ none of the 8 pixel pairs should be updated in this case
+        vbic            q0, q10, q9             @ set each d to zero if it should not be filtered because clip == 0 || a0 >= pq (a3 > a0 case already zeroed by saturating sub)
+        vmla.i16        q3, q0, q1              @ invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P5
+        vmls.i16        q2, q0, q1              @ invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P4
+        vqmovun.s16     d1, q3
+        vqmovun.s16     d0, q2
+        tst             r2, #1
+        bne             1f                      @ none of the first 4 pixel pairs should be updated if so
+        vst2.8          {d0[0], d1[0]}, [r0], r1
+        vst2.8          {d0[1], d1[1]}, [r0], r1
+        vst2.8          {d0[2], d1[2]}, [r0], r1
+        vst2.8          {d0[3], d1[3]}, [r0]
+1:      tst             r3, #1
+        bne             2f                      @ none of the second 4 pixel pairs should be updated if so
+        vst2.8          {d0[4], d1[4]}, [r12], r1
+        vst2.8          {d0[5], d1[5]}, [r12], r1
+        vst2.8          {d0[6], d1[6]}, [r12], r1
+        vst2.8          {d0[7], d1[7]}, [r12]
+2:      pop             {pc}
+endfunc
+
+@ VC-1 in-loop deblocking filter for 16 pixel pairs at boundary of vertically-neighbouring blocks
+@ On entry:
+@   r0 -> top-left pel of lower block
+@   r1 = row stride, bytes
+@   r2 = PQUANT bitstream parameter
+function ff_vc1_v_loop_filter16_neon, export=1
+        vpush           {d8-d15}
+        sub             r3, r0, r1, lsl #2
+        vldr            d0, .Lcoeffs
+        vld1.64         {q1}, [r0], r1          @ P5
+        vld1.64         {q2}, [r3], r1          @ P1
+        vld1.64         {q3}, [r3], r1          @ P2
+        vld1.64         {q4}, [r0], r1          @ P6
+        vld1.64         {q5}, [r3], r1          @ P3
+        vld1.64         {q6}, [r0], r1          @ P7
+        vshll.u8        q7, d2, #1              @ 2*P5[0..7]
+        vshll.u8        q8, d4, #1              @ 2*P1[0..7]
+        vld1.64         {q9}, [r3]              @ P4
+        vmovl.u8        q10, d6                 @ P2[0..7]
+        vld1.64         {q11}, [r0]             @ P8
+        vmovl.u8        q12, d8                 @ P6[0..7]
+        vdup.16         q13, r2                 @ pq
+        vshll.u8        q2, d5, #1              @ 2*P1[8..15]
+        vmls.i16        q8, q10, d0[1]          @ 2*P1[0..7]-5*P2[0..7]
+        vshll.u8        q10, d3, #1             @ 2*P5[8..15]
+        vmovl.u8        q3, d7                  @ P2[8..15]
+        vmls.i16        q7, q12, d0[1]          @ 2*P5[0..7]-5*P6[0..7]
+        vmovl.u8        q4, d9                  @ P6[8..15]
+        vmovl.u8        q14, d10                @ P3[0..7]
+        vmovl.u8        q15, d12                @ P7[0..7]
+        vmls.i16        q2, q3, d0[1]           @ 2*P1[8..15]-5*P2[8..15]
+        vshll.u8        q3, d10, #1             @ 2*P3[0..7]
+        vmls.i16        q10, q4, d0[1]          @ 2*P5[8..15]-5*P6[8..15]
+        vmovl.u8        q6, d13                 @ P7[8..15]
+        vmla.i16        q8, q14, d0[1]          @ 2*P1[0..7]-5*P2[0..7]+5*P3[0..7]
+        vmovl.u8        q14, d18                @ P4[0..7]
+        vmovl.u8        q9, d19                 @ P4[8..15]
+        vmla.i16        q7, q15, d0[1]          @ 2*P5[0..7]-5*P6[0..7]+5*P7[0..7]
+        vmovl.u8        q15, d11                @ P3[8..15]
+        vshll.u8        q5, d11, #1             @ 2*P3[8..15]
+        vmls.i16        q3, q14, d0[1]          @ 2*P3[0..7]-5*P4[0..7]
+        vmla.i16        q2, q15, d0[1]          @ 2*P1[8..15]-5*P2[8..15]+5*P3[8..15]
+        vmovl.u8        q15, d22                @ P8[0..7]
+        vmovl.u8        q11, d23                @ P8[8..15]
+        vmla.i16        q10, q6, d0[1]          @ 2*P5[8..15]-5*P6[8..15]+5*P7[8..15]
+        vmovl.u8        q6, d2                  @ P5[0..7]
+        vmovl.u8        q1, d3                  @ P5[8..15]
+        vmls.i16        q5, q9, d0[1]           @ 2*P3[8..15]-5*P4[8..15]
+        vmls.i16        q8, q14, d0[0]          @ 2*P1[0..7]-5*P2[0..7]+5*P3[0..7]-2*P4[0..7]
+        vmls.i16        q7, q15, d0[0]          @ 2*P5[0..7]-5*P6[0..7]+5*P7[0..7]-2*P8[0..7]
+        vsub.i16        q15, q14, q6            @ P4[0..7]-P5[0..7]
+        vmla.i16        q3, q6, d0[1]           @ 2*P3[0..7]-5*P4[0..7]+5*P5[0..7]
+        vrshr.s16       q8, q8, #3
+        vmls.i16        q2, q9, d0[0]           @ 2*P1[8..15]-5*P2[8..15]+5*P3[8..15]-2*P4[8..15]
+        vrshr.s16       q7, q7, #3
+        vmls.i16        q10, q11, d0[0]         @ 2*P5[8..15]-5*P6[8..15]+5*P7[8..15]-2*P8[8..15]
+        vabs.s16        q11, q15
+        vabs.s16        q8, q8                  @ a1[0..7]
+        vmla.i16        q5, q1, d0[1]           @ 2*P3[8..15]-5*P4[8..15]+5*P5[8..15]
+        vshr.s16        q15, q15, #8            @ clip_sign[0..7]
+        vrshr.s16       q2, q2, #3
+        vmls.i16        q3, q12, d0[0]          @ 2*P3[0..7]-5*P4[0..7]+5*P5[0..7]-2*P6[0..7]
+        vabs.s16        q7, q7                  @ a2[0..7]
+        vrshr.s16       q10, q10, #3
+        vsub.i16        q12, q9, q1             @ P4[8..15]-P5[8..15]
+        vshr.s16        q11, q11, #1            @ clip[0..7]
+        vmls.i16        q5, q4, d0[0]           @ 2*P3[8..15]-5*P4[8..15]+5*P5[8..15]-2*P6[8..15]
+        vcge.s16        q4, q8, q7              @ test a1[0..7] >= a2[0..7]
+        vabs.s16        q2, q2                  @ a1[8..15]
+        vrshr.s16       q3, q3, #3
+        vabs.s16        q10, q10                @ a2[8..15]
+        vbsl            q4, q7, q8              @ a3[0..7]
+        vabs.s16        q7, q12
+        vshr.s16        q8, q12, #8             @ clip_sign[8..15]
+        vrshr.s16       q5, q5, #3
+        vcge.s16        q12, q2, q10            @ test a1[8..15] >= a2[8.15]
+        vshr.s16        q7, q7, #1              @ clip[8..15]
+        vbsl            q12, q10, q2            @ a3[8..15]
+        vabs.s16        q2, q3                  @ a0[0..7]
+        vceq.i16        q10, q11, #0            @ test clip[0..7] == 0
+        vshr.s16        q3, q3, #8              @ a0_sign[0..7]
+        vsub.i16        q3, q15, q3             @ clip_sign[0..7] - a0_sign[0..7]
+        vcge.s16        q15, q2, q13            @ test a0[0..7] >= pq
+        vorr            q10, q10, q15           @ test clip[0..7] == 0 || a0[0..7] >= pq
+        vqsub.u16       q15, q2, q4             @ a0[0..7] >= a3[0..7] ? a0[0..7]-a3[0..7] : 0  (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs)
+        vcge.s16        q2, q4, q2              @ test a3[0..7] >= a0[0..7]
+        vabs.s16        q4, q5                  @ a0[8..15]
+        vshr.s16        q5, q5, #8              @ a0_sign[8..15]
+        vmul.i16        q15, q15, d0[1]         @ a0[0..7] >= a3[0..7] ? 5*(a0[0..7]-a3[0..7]) : 0
+        vcge.s16        q13, q4, q13            @ test a0[8..15] >= pq
+        vorr            q2, q10, q2             @ test clip[0..7] == 0 || a0[0..7] >= pq || a3[0..7] >= a0[0..7]
+        vsub.i16        q5, q8, q5              @ clip_sign[8..15] - a0_sign[8..15]
+        vceq.i16        q8, q7, #0              @ test clip[8..15] == 0
+        vshr.u16        q15, q15, #3            @ a0[0..7] >= a3[0..7] ? (5*(a0[0..7]-a3[0..7]))>>3 : 0
+        vmov.32         r0, d4[1]               @ move to gp reg
+        vorr            q8, q8, q13             @ test clip[8..15] == 0 || a0[8..15] >= pq
+        vqsub.u16       q13, q4, q12            @ a0[8..15] >= a3[8..15] ? a0[8..15]-a3[8..15] : 0  (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs)
+        vmov.32         r2, d5[1]
+        vcge.s16        q4, q12, q4             @ test a3[8..15] >= a0[8..15]
+        vshl.i64        q2, q2, #16
+        vcge.s16        q12, q15, q11
+        vmul.i16        q0, q13, d0[1]          @ a0[8..15] >= a3[8..15] ? 5*(a0[8..15]-a3[8..15]) : 0
+        vorr            q4, q8, q4              @ test clip[8..15] == 0 || a0[8..15] >= pq || a3[8..15] >= a0[8..15]
+        vshr.s64        q2, q2, #48
+        and             r0, r0, r2
+        vbsl            q12, q11, q15           @ FFMIN(d[0..7], clip[0..7])
+        vshl.i64        q11, q4, #16
+        vmov.32         r2, d8[1]
+        vshr.u16        q0, q0, #3              @ a0[8..15] >= a3[8..15] ? (5*(a0[8..15]-a3[8..15]))>>3 : 0
+        vorr            q2, q10, q2
+        vmov.32         r12, d9[1]
+        vshr.s64        q4, q11, #48
+        vcge.s16        q10, q0, q7
+        vbic            q2, q12, q2             @ set each d[0..7] to zero if it should not be filtered because clip[0..7] == 0 || a0[0..7] >= pq (a3 > a0 case already zeroed by saturating sub)
+        vorr            q4, q8, q4
+        and             r2, r2, r12
+        vbsl            q10, q7, q0             @ FFMIN(d[8..15], clip[8..15])
+        vmls.i16        q14, q2, q3             @ invert d[0..7] depending on clip_sign[0..7] & a0_sign[0..7], or zero it if they match, and accumulate into P4[0..7]
+        and             r0, r0, r2
+        vbic            q0, q10, q4             @ set each d[8..15] to zero if it should not be filtered because clip[8..15] == 0 || a0[8..15] >= pq (a3 > a0 case already zeroed by saturating sub)
+        tst             r0, #1
+        bne             1f                      @ none of the 16 pixel pairs should be updated in this case
+        vmla.i16        q6, q2, q3              @ invert d[0..7] depending on clip_sign[0..7] & a0_sign[0..7], or zero it if they match, and accumulate into P5[0..7]
+        vmls.i16        q9, q0, q5              @ invert d[8..15] depending on clip_sign[8..15] & a0_sign[8..15], or zero it if they match, and accumulate into P4[8..15]
+        vqmovun.s16     d4, q14
+        vmla.i16        q1, q0, q5              @ invert d[8..15] depending on clip_sign[8..15] & a0_sign[8..15], or zero it if they match, and accumulate into P5[8..15]
+        vqmovun.s16     d0, q6
+        vqmovun.s16     d5, q9
+        vqmovun.s16     d1, q1
+        vst1.64         {q2}, [r3], r1
+        vst1.64         {q0}, [r3]
+1:      vpop            {d8-d15}
+        bx              lr
+endfunc
+
+@ VC-1 in-loop deblocking filter for 16 pixel pairs at boundary of horizontally-neighbouring blocks
+@ On entry:
+@   r0 -> top-left pel of right block
+@   r1 = row stride, bytes
+@   r2 = PQUANT bitstream parameter
+function ff_vc1_h_loop_filter16_neon, export=1
+        push            {r4-r6,lr}
+        vpush           {d8-d15}
+        sub             r3, r0, #4              @ where to start reading
+        vldr            d0, .Lcoeffs
+        vld1.32         {d2}, [r3], r1          @ P1[0], P2[0]...
+        sub             r0, r0, #1              @ where to start writing
+        vld1.32         {d3}, [r3], r1
+        add             r4, r0, r1, lsl #2
+        vld1.32         {d10}, [r3], r1
+        vld1.32         {d11}, [r3], r1
+        vld1.32         {d16}, [r3], r1
+        vld1.32         {d4}, [r3], r1
+        vld1.32         {d8}, [r3], r1
+        vtrn.8          d2, d3                  @ P1[0], P1[1], P3[0]... P2[0], P2[1], P4[0]...
+        vld1.32         {d14}, [r3], r1
+        vld1.32         {d5}, [r3], r1
+        vtrn.8          d10, d11                @ P1[2], P1[3], P3[2]... P2[2], P2[3], P4[2]...
+        vld1.32         {d6}, [r3], r1
+        vld1.32         {d12}, [r3], r1
+        vtrn.8          d16, d4                 @ P1[4], P1[5], P3[4]... P2[4], P2[5], P4[4]...
+        vld1.32         {d13}, [r3], r1
+        vtrn.16         d2, d10                 @ P1[0], P1[1], P1[2], P1[3], P5[0]... P3[0], P3[1], P3[2], P3[3], P7[0]...
+        vld1.32         {d1}, [r3], r1
+        vtrn.8          d8, d14                 @ P1[6], P1[7], P3[6]... P2[6], P2[7], P4[6]...
+        vld1.32         {d7}, [r3], r1
+        vtrn.16         d3, d11                 @ P2[0], P2[1], P2[2], P2[3], P6[0]... P4[0], P4[1], P4[2], P4[3], P8[0]...
+        vld1.32         {d9}, [r3], r1
+        vtrn.8          d5, d6                  @ P1[8], P1[9], P3[8]... P2[8], P2[9], P4[8]...
+        vld1.32         {d15}, [r3]
+        vtrn.16         d16, d8                 @ P1[4], P1[5], P1[6], P1[7], P5[4]... P3[4], P3[5], P3[6], P3[7], P7[4]...
+        vtrn.16         d4, d14                 @ P2[4], P2[5], P2[6], P2[7], P6[4]... P4[4], P4[5], P4[6], P4[7], P8[4]...
+        vtrn.8          d12, d13                @ P1[10], P1[11], P3[10]... P2[10], P2[11], P4[10]...
+        vdup.16         q9, r2                  @ pq
+        vtrn.8          d1, d7                  @ P1[12], P1[13], P3[12]... P2[12], P2[13], P4[12]...
+        vtrn.32         d2, d16                 @ P1[0..7], P5[0..7]
+        vtrn.16         d5, d12                 @ P1[8], P1[7], P1[10], P1[11], P5[8]... P3[8], P3[9], P3[10], P3[11], P7[8]...
+        vtrn.16         d6, d13                 @ P2[8], P2[7], P2[10], P2[11], P6[8]... P4[8], P4[9], P4[10], P4[11], P8[8]...
+        vtrn.8          d9, d15                 @ P1[14], P1[15], P3[14]... P2[14], P2[15], P4[14]...
+        vtrn.32         d3, d4                  @ P2[0..7], P6[0..7]
+        vshll.u8        q10, d2, #1             @ 2*P1[0..7]
+        vtrn.32         d10, d8                 @ P3[0..7], P7[0..7]
+        vshll.u8        q11, d16, #1            @ 2*P5[0..7]
+        vtrn.32         d11, d14                @ P4[0..7], P8[0..7]
+        vtrn.16         d1, d9                  @ P1[12], P1[13], P1[14], P1[15], P5[12]... P3[12], P3[13], P3[14], P3[15], P7[12]...
+        vtrn.16         d7, d15                 @ P2[12], P2[13], P2[14], P2[15], P6[12]... P4[12], P4[13], P4[14], P4[15], P8[12]...
+        vmovl.u8        q1, d3                  @ P2[0..7]
+        vmovl.u8        q12, d4                 @ P6[0..7]
+        vtrn.32         d5, d1                  @ P1[8..15], P5[8..15]
+        vtrn.32         d6, d7                  @ P2[8..15], P6[8..15]
+        vtrn.32         d12, d9                 @ P3[8..15], P7[8..15]
+        vtrn.32         d13, d15                @ P4[8..15], P8[8..15]
+        vmls.i16        q10, q1, d0[1]          @ 2*P1[0..7]-5*P2[0..7]
+        vmovl.u8        q1, d10                 @ P3[0..7]
+        vshll.u8        q2, d5, #1              @ 2*P1[8..15]
+        vshll.u8        q13, d1, #1             @ 2*P5[8..15]
+        vmls.i16        q11, q12, d0[1]         @ 2*P5[0..7]-5*P6[0..7]
+        vmovl.u8        q14, d6                 @ P2[8..15]
+        vmovl.u8        q3, d7                  @ P6[8..15]
+        vmovl.u8        q15, d8                 @ P7[0..7]
+        vmla.i16        q10, q1, d0[1]          @ 2*P1[0..7]-5*P2[0..7]+5*P3[0..7]
+        vmovl.u8        q1, d12                 @ P3[8..15]
+        vmls.i16        q2, q14, d0[1]          @ 2*P1[8..15]-5*P2[8..15]
+        vmovl.u8        q4, d9                  @ P7[8..15]
+        vshll.u8        q14, d10, #1            @ 2*P3[0..7]
+        vmls.i16        q13, q3, d0[1]          @ 2*P5[8..15]-5*P6[8..15]
+        vmovl.u8        q5, d11                 @ P4[0..7]
+        vmla.i16        q11, q15, d0[1]         @ 2*P5[0..7]-5*P6[0..7]+5*P7[0..7]
+        vshll.u8        q15, d12, #1            @ 2*P3[8..15]
+        vmovl.u8        q6, d13                 @ P4[8..15]
+        vmla.i16        q2, q1, d0[1]           @ 2*P1[8..15]-5*P2[8..15]+5*P3[8..15]
+        vmovl.u8        q1, d14                 @ P8[0..7]
+        vmovl.u8        q7, d15                 @ P8[8..15]
+        vmla.i16        q13, q4, d0[1]          @ 2*P5[8..15]-5*P6[8..15]+5*P7[8..15]
+        vmovl.u8        q4, d16                 @ P5[0..7]
+        vmovl.u8        q8, d1                  @ P5[8..15]
+        vmls.i16        q14, q5, d0[1]          @ 2*P3[0..7]-5*P4[0..7]
+        vmls.i16        q15, q6, d0[1]          @ 2*P3[8..15]-5*P4[8..15]
+        vmls.i16        q10, q5, d0[0]          @ 2*P1[0..7]-5*P2[0..7]+5*P3[0..7]-2*P4[0..7]
+        vmls.i16        q11, q1, d0[0]          @ 2*P5[0..7]-5*P6[0..7]+5*P7[0..7]-2*P8[0..7]
+        vsub.i16        q1, q5, q4              @ P4[0..7]-P5[0..7]
+        vmls.i16        q2, q6, d0[0]           @ 2*P1[8..15]-5*P2[8..15]+5*P3[8..15]-2*P4[8..15]
+        vrshr.s16       q10, q10, #3
+        vmls.i16        q13, q7, d0[0]          @ 2*P5[8..15]-5*P6[8..15]+5*P7[8..15]-2*P8[8..15]
+        vsub.i16        q7, q6, q8              @ P4[8..15]-P5[8..15]
+        vrshr.s16       q11, q11, #3
+        vmla.s16        q14, q4, d0[1]          @ 2*P3[0..7]-5*P4[0..7]+5*P5[0..7]
+        vrshr.s16       q2, q2, #3
+        vmla.i16        q15, q8, d0[1]          @ 2*P3[8..15]-5*P4[8..15]+5*P5[8..15]
+        vabs.s16        q10, q10                @ a1[0..7]
+        vrshr.s16       q13, q13, #3
+        vmls.i16        q15, q3, d0[0]          @ 2*P3[8..15]-5*P4[8..15]+5*P5[8..15]-2*P6[8..15]
+        vabs.s16        q3, q11                 @ a2[0..7]
+        vabs.s16        q2, q2                  @ a1[8..15]
+        vmls.i16        q14, q12, d0[0]         @ 2*P3[0..7]-5*P4[0..7]+5*P5[0..7]-2*P6[0..7]
+        vabs.s16        q11, q1
+        vabs.s16        q12, q13                @ a2[8..15]
+        vcge.s16        q13, q10, q3            @ test a1[0..7] >= a2[0..7]
+        vshr.s16        q1, q1, #8              @ clip_sign[0..7]
+        vrshr.s16       q15, q15, #3
+        vshr.s16        q11, q11, #1            @ clip[0..7]
+        vrshr.s16       q14, q14, #3
+        vbsl            q13, q3, q10            @ a3[0..7]
+        vcge.s16        q3, q2, q12             @ test a1[8..15] >= a2[8.15]
+        vabs.s16        q10, q15                @ a0[8..15]
+        vshr.s16        q15, q15, #8            @ a0_sign[8..15]
+        vbsl            q3, q12, q2             @ a3[8..15]
+        vabs.s16        q2, q14                 @ a0[0..7]
+        vabs.s16        q12, q7
+        vshr.s16        q7, q7, #8              @ clip_sign[8..15]
+        vshr.s16        q14, q14, #8            @ a0_sign[0..7]
+        vshr.s16        q12, q12, #1            @ clip[8..15]
+        vsub.i16        q7, q7, q15             @ clip_sign[8..15] - a0_sign[8..15]
+        vqsub.u16       q15, q10, q3            @ a0[8..15] >= a3[8..15] ? a0[8..15]-a3[8..15] : 0  (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs)
+        vcge.s16        q3, q3, q10             @ test a3[8..15] >= a0[8..15]
+        vcge.s16        q10, q10, q9            @ test a0[8..15] >= pq
+        vcge.s16        q9, q2, q9              @ test a0[0..7] >= pq
+        vsub.i16        q1, q1, q14             @ clip_sign[0..7] - a0_sign[0..7]
+        vqsub.u16       q14, q2, q13            @ a0[0..7] >= a3[0..7] ? a0[0..7]-a3[0..7] : 0  (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs)
+        vcge.s16        q2, q13, q2             @ test a3[0..7] >= a0[0..7]
+        vmul.i16        q13, q15, d0[1]         @ a0[8..15] >= a3[8..15] ? 5*(a0[8..15]-a3[8..15]) : 0
+        vceq.i16        q15, q11, #0            @ test clip[0..7] == 0
+        vmul.i16        q0, q14, d0[1]          @ a0[0..7] >= a3[0..7] ? 5*(a0[0..7]-a3[0..7]) : 0
+        vorr            q9, q15, q9             @ test clip[0..7] == 0 || a0[0..7] >= pq
+        vceq.i16        q14, q12, #0            @ test clip[8..15] == 0
+        vshr.u16        q13, q13, #3            @ a0[8..15] >= a3[8..15] ? (5*(a0[8..15]-a3[8..15]))>>3 : 0
+        vorr            q2, q9, q2              @ test clip[0..7] == 0 || a0[0..7] >= pq || a3[0..7] >= a0[0..7]
+        vshr.u16        q0, q0, #3              @ a0[0..7] >= a3[0..7] ? (5*(a0[0..7]-a3[0..7]))>>3 : 0
+        vorr            q10, q14, q10           @ test clip[8..15] == 0 || a0[8..15] >= pq
+        vcge.s16        q14, q13, q12
+        vmov.32         r2, d4[1]               @ move to gp reg
+        vorr            q3, q10, q3             @ test clip[8..15] == 0 || a0[8..15] >= pq || a3[8..15] >= a0[8..15]
+        vmov.32         r3, d5[1]
+        vcge.s16        q2, q0, q11
+        vbsl            q14, q12, q13           @ FFMIN(d[8..15], clip[8..15])
+        vbsl            q2, q11, q0             @ FFMIN(d[0..7], clip[0..7])
+        vmov.32         r5, d6[1]
+        vbic            q0, q14, q10            @ set each d[8..15] to zero if it should not be filtered because clip[8..15] == 0 || a0[8..15] >= pq (a3 > a0 case already zeroed by saturating sub)
+        vmov.32         r6, d7[1]
+        and             r12, r2, r3
+        vbic            q2, q2, q9              @ set each d[0..7] to zero if it should not be filtered because clip[0..7] == 0 || a0[0..7] >= pq (a3 > a0 case already zeroed by saturating sub)
+        vmls.i16        q6, q0, q7              @ invert d[8..15] depending on clip_sign[8..15] & a0_sign[8..15], or zero it if they match, and accumulate into P4
+        vmls.i16        q5, q2, q1              @ invert d[0..7] depending on clip_sign[0..7] & a0_sign[0..7], or zero it if they match, and accumulate into P4
+        and             r14, r5, r6
+        vmla.i16        q4, q2, q1              @ invert d[0..7] depending on clip_sign[0..7] & a0_sign[0..7], or zero it if they match, and accumulate into P5
+        and             r12, r12, r14
+        vqmovun.s16     d4, q6
+        vmla.i16        q8, q0, q7              @ invert d[8..15] depending on clip_sign[8..15] & a0_sign[8..15], or zero it if they match, and accumulate into P5
+        tst             r12, #1
+        bne             4f                      @ none of the 16 pixel pairs should be updated in this case
+        vqmovun.s16     d2, q5
+        vqmovun.s16     d3, q4
+        vqmovun.s16     d5, q8
+        tst             r2, #1
+        bne             1f
+        vst2.8          {d2[0], d3[0]}, [r0], r1
+        vst2.8          {d2[1], d3[1]}, [r0], r1
+        vst2.8          {d2[2], d3[2]}, [r0], r1
+        vst2.8          {d2[3], d3[3]}, [r0]
+1:      add             r0, r4, r1, lsl #2
+        tst             r3, #1
+        bne             2f
+        vst2.8          {d2[4], d3[4]}, [r4], r1
+        vst2.8          {d2[5], d3[5]}, [r4], r1
+        vst2.8          {d2[6], d3[6]}, [r4], r1
+        vst2.8          {d2[7], d3[7]}, [r4]
+2:      add             r4, r0, r1, lsl #2
+        tst             r5, #1
+        bne             3f
+        vst2.8          {d4[0], d5[0]}, [r0], r1
+        vst2.8          {d4[1], d5[1]}, [r0], r1
+        vst2.8          {d4[2], d5[2]}, [r0], r1
+        vst2.8          {d4[3], d5[3]}, [r0]
+3:      tst             r6, #1
+        bne             4f
+        vst2.8          {d4[4], d5[4]}, [r4], r1
+        vst2.8          {d4[5], d5[5]}, [r4], r1
+        vst2.8          {d4[6], d5[6]}, [r4], r1
+        vst2.8          {d4[7], d5[7]}, [r4]
+4:      vpop            {d8-d15}
+        pop             {r4-r6,pc}
+endfunc