From patchwork Thu Mar 31 17:23:50 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ben Avison <bavison@riscosopen.org>
X-Patchwork-Id: 35113
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:c05:b0:7a:e998:b410 with SMTP id bw5csp230414pzb;
        Thu, 31 Mar 2022 10:25:56 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJwYBQmO338ysucTpHDo4tpoEhFQr88fVSXLp+o6eDxoSGoif7et1dP3+dF/2JCa3cqx4GxP
X-Received: by 2002:a17:906:4fd2:b0:6e0:5ce7:d7c7 with SMTP id
 i18-20020a1709064fd200b006e05ce7d7c7mr5909608ejw.113.1648747556621;
        Thu, 31 Mar 2022 10:25:56 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1648747556; cv=none;
        d=google.com; s=arc-20160816;
        b=Acys+q41sXFckfiFs5oMJhgGOVcNZ8q3E8uHUSG3aDXJpk1QjUNTy4uG+WpFgNTktt
         zqqGFsEuO5d1gvNKRQrXnrF5PPX05ch0s2lHzaTyPO+lhYV4cGFHv6G3Ij2kSwtJ/V7n
         NqYXb5rs75AHW1ZtWhJKQNlienkIzoEsW1w7fSrYANH/KLlUnzCHfEbgjoapziVytP4y
         4XFVsJyMz8/Vbds+OOIsRXt5RYVxV/FMU4kmgLTb95S1DZ3zk0mJLDBh1pNIF7ktOJv8
         lbO95itTZozHxON2/f0cTBXYtNRY0aNpoMaeNWEnLXrSAAru/ip+PPzZfjFCqxSbCSya
         YkLw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:references:in-reply-to
         :message-id:date:to:from:delivered-to;
        bh=kD5jIeQ2dkTD6nWVGxIkFSHW8SCcc7f8CYfpdAql4Lc=;
        b=A8NZ67FC1vjrFDwBFMU2XACgfidOdxiR9AMf1U/ykvmK6EAlH5Gpz77dSEL83i8qZv
         PWN1P1qZU1dg5rAUsc5chssKItPWLGxok00EyVSiynd7ZCYB6hus8BotIrSYd3HClLrT
         uSvVjtues7usABnUgOu31RQUyPSioVtTmlCIoLy916IQFmSJnXv1dwpU9WPx3Ac9uBd+
         gllYHH5w1kCBJp24cyfywX0GW1eNXt4PvsoF/+h9hI9xhS/HIYFpF8WAgjalgD70tra7
         JUaWneI57NBTUQ7bElvoG8lvV0jQmfDioAQlt8yuzW31blxge7mXWj0BJderzVAp4Zv/
         YmCQ==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 s10-20020a170906284a00b006df76385c39si114799ejc.217.2022.03.31.10.25.55;
        Thu, 31 Mar 2022 10:25:56 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 4C5BC68B2C4;
	Thu, 31 Mar 2022 20:24:25 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from outmail149078.authsmtp.net (outmail149078.authsmtp.net
 [62.13.149.78])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 093D768B239
 for <ffmpeg-devel@ffmpeg.org>; Thu, 31 Mar 2022 20:24:20 +0300 (EEST)
Received: from mail-c233.authsmtp.com (mail-c233.authsmtp.com [62.13.128.233])
 by punt15.authsmtp.com. (8.15.2/8.15.2) with ESMTP id 22VHOKT2019939;
 Thu, 31 Mar 2022 18:24:20 +0100 (BST)
 (envelope-from bavison@riscosopen.org)
Received: from rpi2021 (237.63.9.51.dyn.plus.net [51.9.63.237])
 (authenticated bits=0)
 by mail.authsmtp.com (8.15.2/8.15.2) with ESMTPSA id 22VHOIZ0062625
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Thu, 31 Mar 2022 18:24:19 +0100 (BST)
 (envelope-from bavison@riscosopen.org)
Received: by rpi2021 (sSMTP sendmail emulation);
 Thu, 31 Mar 2022 18:24:18 +0100
From: Ben Avison <bavison@riscosopen.org>
To: ffmpeg-devel@ffmpeg.org
Date: Thu, 31 Mar 2022 18:23:50 +0100
Message-Id: <20220331172351.550818-10-bavison@riscosopen.org>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20220331172351.550818-1-bavison@riscosopen.org>
References: <20220331172351.550818-1-bavison@riscosopen.org>
MIME-Version: 1.0
X-Server-Quench: 666a4768-b117-11ec-a0f2-84349711df28
X-AuthReport-Spam: If SPAM / abuse - report it at:
 http://www.authsmtp.com/abuse
X-AuthRoute: OCd1YggXA1ZfRRob ESQCJDVBUg4iPRpU DBlFKhFVNl8UURhQ
 KkJXbgASJgZFAnRQ QXkJW1ZWQFx5U2Fx YQhRIwBcfENQWQZ0 UktOXVBXFgB3AFID
 BHhmLWAYdwVAenhy YAhgWnlcWAp8c0As RklSHXBUZGZndWEe BRNFJgMCch5CehxB
 Y1d+VSdbY21JDRoR IyQTdy5qdW0Ob30N d0kEM1kVTUsAWSA3 HkJKNC8qVRNZAi8y
 M1QAB3k6VFsXP145 OEMsEVwRKANaEgRC HykA
X-Authentic-SMTP: 61633632303230.1021:7600
X-AuthFastPath: 0 (Was 255)
X-AuthSMTP-Origin: 51.9.63.237/2525
X-AuthVirus-Status: No virus detected - but ensure you scan with your own
 anti-virus system.
Subject: [FFmpeg-devel] [PATCH v3 09/10] avcodec/vc1: Arm 64-bit NEON
 unescape fast path
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Ben Avison <bavison@riscosopen.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: ovcq/7C80sWi

checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.

vc1dsp.vc1_unescape_buffer_c: 655617.7
vc1dsp.vc1_unescape_buffer_neon: 118237.0

Signed-off-by: Ben Avison <bavison@riscosopen.org>
---
 libavcodec/aarch64/vc1dsp_init_aarch64.c |  61 ++++++++
 libavcodec/aarch64/vc1dsp_neon.S         | 176 +++++++++++++++++++++++
 2 files changed, 237 insertions(+)

diff --git a/libavcodec/aarch64/vc1dsp_init_aarch64.c b/libavcodec/aarch64/vc1dsp_init_aarch64.c
index e0eb52dd63..a7976fd596 100644
--- a/libavcodec/aarch64/vc1dsp_init_aarch64.c
+++ b/libavcodec/aarch64/vc1dsp_init_aarch64.c
@@ -21,6 +21,7 @@
 #include "libavutil/attributes.h"
 #include "libavutil/cpu.h"
 #include "libavutil/aarch64/cpu.h"
+#include "libavutil/intreadwrite.h"
 #include "libavcodec/vc1dsp.h"
 
 #include "config.h"
@@ -51,6 +52,64 @@ void ff_put_vc1_chroma_mc4_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride,
 void ff_avg_vc1_chroma_mc4_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride,
                                 int h, int x, int y);
 
+int ff_vc1_unescape_buffer_helper_neon(const uint8_t *src, int size, uint8_t *dst);
+
+static int vc1_unescape_buffer_neon(const uint8_t *src, int size, uint8_t *dst)
+{
+    /* Dealing with starting and stopping, and removing escape bytes, are
+     * comparatively less time-sensitive, so are more clearly expressed using
+     * a C wrapper around the assembly inner loop. Note that we assume a
+     * little-endian machine that supports unaligned loads. */
+    int dsize = 0;
+    while (size >= 4)
+    {
+        int found = 0;
+        while (!found && (((uintptr_t) dst) & 7) && size >= 4)
+        {
+            found = (AV_RL32(src) &~ 0x03000000) == 0x00030000;
+            if (!found)
+            {
+                *dst++ = *src++;
+                --size;
+                ++dsize;
+            }
+        }
+        if (!found)
+        {
+            int skip = size - ff_vc1_unescape_buffer_helper_neon(src, size, dst);
+            dst += skip;
+            src += skip;
+            size -= skip;
+            dsize += skip;
+            while (!found && size >= 4)
+            {
+                found = (AV_RL32(src) &~ 0x03000000) == 0x00030000;
+                if (!found)
+                {
+                    *dst++ = *src++;
+                    --size;
+                    ++dsize;
+                }
+            }
+        }
+        if (found)
+        {
+            *dst++ = *src++;
+            *dst++ = *src++;
+            ++src;
+            size -= 3;
+            dsize += 2;
+        }
+    }
+    while (size > 0)
+    {
+        *dst++ = *src++;
+        --size;
+        ++dsize;
+    }
+    return dsize;
+}
+
 av_cold void ff_vc1dsp_init_aarch64(VC1DSPContext *dsp)
 {
     int cpu_flags = av_get_cpu_flags();
@@ -76,5 +135,7 @@ av_cold void ff_vc1dsp_init_aarch64(VC1DSPContext *dsp)
         dsp->avg_no_rnd_vc1_chroma_pixels_tab[0] = ff_avg_vc1_chroma_mc8_neon;
         dsp->put_no_rnd_vc1_chroma_pixels_tab[1] = ff_put_vc1_chroma_mc4_neon;
         dsp->avg_no_rnd_vc1_chroma_pixels_tab[1] = ff_avg_vc1_chroma_mc4_neon;
+
+        dsp->vc1_unescape_buffer = vc1_unescape_buffer_neon;
     }
 }
diff --git a/libavcodec/aarch64/vc1dsp_neon.S b/libavcodec/aarch64/vc1dsp_neon.S
index 0201db4f78..9a96c2523c 100644
--- a/libavcodec/aarch64/vc1dsp_neon.S
+++ b/libavcodec/aarch64/vc1dsp_neon.S
@@ -1368,3 +1368,179 @@ function ff_vc1_h_loop_filter16_neon, export=1
         st2             {v2.b, v3.b}[7], [x6]
 4:      ret
 endfunc
+
+// Copy at most the specified number of bytes from source to destination buffer,
+// stopping at a multiple of 32 bytes, none of which are the start of an escape sequence
+// On entry:
+//   x0 -> source buffer
+//   w1 = max number of bytes to copy
+//   x2 -> destination buffer, optimally 8-byte aligned
+// On exit:
+//   w0 = number of bytes not copied
+function ff_vc1_unescape_buffer_helper_neon, export=1
+        // Offset by 80 to screen out cases that are too short for us to handle,
+        // and also make it easy to test for loop termination, or to determine
+        // whether we need an odd number of half-iterations of the loop.
+        subs            w1, w1, #80
+        b.mi            90f
+
+        // Set up useful constants
+        movi            v20.4s, #3, lsl #24
+        movi            v21.4s, #3, lsl #16
+
+        tst             w1, #32
+        b.ne            1f
+
+          ld1             {v0.16b, v1.16b, v2.16b}, [x0], #48
+          ext             v25.16b, v0.16b, v1.16b, #1
+          ext             v26.16b, v0.16b, v1.16b, #2
+          ext             v27.16b, v0.16b, v1.16b, #3
+          ext             v29.16b, v1.16b, v2.16b, #1
+          ext             v30.16b, v1.16b, v2.16b, #2
+          ext             v31.16b, v1.16b, v2.16b, #3
+          bic             v24.16b, v0.16b, v20.16b
+          bic             v25.16b, v25.16b, v20.16b
+          bic             v26.16b, v26.16b, v20.16b
+          bic             v27.16b, v27.16b, v20.16b
+          bic             v28.16b, v1.16b, v20.16b
+          bic             v29.16b, v29.16b, v20.16b
+          bic             v30.16b, v30.16b, v20.16b
+          bic             v31.16b, v31.16b, v20.16b
+          eor             v24.16b, v24.16b, v21.16b
+          eor             v25.16b, v25.16b, v21.16b
+          eor             v26.16b, v26.16b, v21.16b
+          eor             v27.16b, v27.16b, v21.16b
+          eor             v28.16b, v28.16b, v21.16b
+          eor             v29.16b, v29.16b, v21.16b
+          eor             v30.16b, v30.16b, v21.16b
+          eor             v31.16b, v31.16b, v21.16b
+          cmeq            v24.4s, v24.4s, #0
+          cmeq            v25.4s, v25.4s, #0
+          cmeq            v26.4s, v26.4s, #0
+          cmeq            v27.4s, v27.4s, #0
+          add             w1, w1, #32
+          b               3f
+
+1:      ld1             {v3.16b, v4.16b, v5.16b}, [x0], #48
+        ext             v25.16b, v3.16b, v4.16b, #1
+        ext             v26.16b, v3.16b, v4.16b, #2
+        ext             v27.16b, v3.16b, v4.16b, #3
+        ext             v29.16b, v4.16b, v5.16b, #1
+        ext             v30.16b, v4.16b, v5.16b, #2
+        ext             v31.16b, v4.16b, v5.16b, #3
+        bic             v24.16b, v3.16b, v20.16b
+        bic             v25.16b, v25.16b, v20.16b
+        bic             v26.16b, v26.16b, v20.16b
+        bic             v27.16b, v27.16b, v20.16b
+        bic             v28.16b, v4.16b, v20.16b
+        bic             v29.16b, v29.16b, v20.16b
+        bic             v30.16b, v30.16b, v20.16b
+        bic             v31.16b, v31.16b, v20.16b
+        eor             v24.16b, v24.16b, v21.16b
+        eor             v25.16b, v25.16b, v21.16b
+        eor             v26.16b, v26.16b, v21.16b
+        eor             v27.16b, v27.16b, v21.16b
+        eor             v28.16b, v28.16b, v21.16b
+        eor             v29.16b, v29.16b, v21.16b
+        eor             v30.16b, v30.16b, v21.16b
+        eor             v31.16b, v31.16b, v21.16b
+        cmeq            v24.4s, v24.4s, #0
+        cmeq            v25.4s, v25.4s, #0
+        cmeq            v26.4s, v26.4s, #0
+        cmeq            v27.4s, v27.4s, #0
+        // Drop through...
+2:        mov             v0.16b, v5.16b
+          ld1             {v1.16b, v2.16b}, [x0], #32
+        cmeq            v28.4s, v28.4s, #0
+        cmeq            v29.4s, v29.4s, #0
+        cmeq            v30.4s, v30.4s, #0
+        cmeq            v31.4s, v31.4s, #0
+        orr             v24.16b, v24.16b, v25.16b
+        orr             v26.16b, v26.16b, v27.16b
+        orr             v28.16b, v28.16b, v29.16b
+        orr             v30.16b, v30.16b, v31.16b
+          ext             v25.16b, v0.16b, v1.16b, #1
+        orr             v22.16b, v24.16b, v26.16b
+          ext             v26.16b, v0.16b, v1.16b, #2
+          ext             v27.16b, v0.16b, v1.16b, #3
+          ext             v29.16b, v1.16b, v2.16b, #1
+        orr             v23.16b, v28.16b, v30.16b
+          ext             v30.16b, v1.16b, v2.16b, #2
+          ext             v31.16b, v1.16b, v2.16b, #3
+          bic             v24.16b, v0.16b, v20.16b
+          bic             v25.16b, v25.16b, v20.16b
+          bic             v26.16b, v26.16b, v20.16b
+        orr             v22.16b, v22.16b, v23.16b
+          bic             v27.16b, v27.16b, v20.16b
+          bic             v28.16b, v1.16b, v20.16b
+          bic             v29.16b, v29.16b, v20.16b
+          bic             v30.16b, v30.16b, v20.16b
+          bic             v31.16b, v31.16b, v20.16b
+        addv            s22, v22.4s
+          eor             v24.16b, v24.16b, v21.16b
+          eor             v25.16b, v25.16b, v21.16b
+          eor             v26.16b, v26.16b, v21.16b
+          eor             v27.16b, v27.16b, v21.16b
+          eor             v28.16b, v28.16b, v21.16b
+        mov             w3, v22.s[0]
+          eor             v29.16b, v29.16b, v21.16b
+          eor             v30.16b, v30.16b, v21.16b
+          eor             v31.16b, v31.16b, v21.16b
+          cmeq            v24.4s, v24.4s, #0
+          cmeq            v25.4s, v25.4s, #0
+          cmeq            v26.4s, v26.4s, #0
+          cmeq            v27.4s, v27.4s, #0
+        cbnz            w3, 90f
+        st1             {v3.16b, v4.16b}, [x2], #32
+3:          mov             v3.16b, v2.16b
+            ld1             {v4.16b, v5.16b}, [x0], #32
+          cmeq            v28.4s, v28.4s, #0
+          cmeq            v29.4s, v29.4s, #0
+          cmeq            v30.4s, v30.4s, #0
+          cmeq            v31.4s, v31.4s, #0
+          orr             v24.16b, v24.16b, v25.16b
+          orr             v26.16b, v26.16b, v27.16b
+          orr             v28.16b, v28.16b, v29.16b
+          orr             v30.16b, v30.16b, v31.16b
+            ext             v25.16b, v3.16b, v4.16b, #1
+          orr             v22.16b, v24.16b, v26.16b
+            ext             v26.16b, v3.16b, v4.16b, #2
+            ext             v27.16b, v3.16b, v4.16b, #3
+            ext             v29.16b, v4.16b, v5.16b, #1
+          orr             v23.16b, v28.16b, v30.16b
+            ext             v30.16b, v4.16b, v5.16b, #2
+            ext             v31.16b, v4.16b, v5.16b, #3
+            bic             v24.16b, v3.16b, v20.16b
+            bic             v25.16b, v25.16b, v20.16b
+            bic             v26.16b, v26.16b, v20.16b
+          orr             v22.16b, v22.16b, v23.16b
+            bic             v27.16b, v27.16b, v20.16b
+            bic             v28.16b, v4.16b, v20.16b
+            bic             v29.16b, v29.16b, v20.16b
+            bic             v30.16b, v30.16b, v20.16b
+            bic             v31.16b, v31.16b, v20.16b
+          addv            s22, v22.4s
+            eor             v24.16b, v24.16b, v21.16b
+            eor             v25.16b, v25.16b, v21.16b
+            eor             v26.16b, v26.16b, v21.16b
+            eor             v27.16b, v27.16b, v21.16b
+            eor             v28.16b, v28.16b, v21.16b
+          mov             w3, v22.s[0]
+            eor             v29.16b, v29.16b, v21.16b
+            eor             v30.16b, v30.16b, v21.16b
+            eor             v31.16b, v31.16b, v21.16b
+            cmeq            v24.4s, v24.4s, #0
+            cmeq            v25.4s, v25.4s, #0
+            cmeq            v26.4s, v26.4s, #0
+            cmeq            v27.4s, v27.4s, #0
+          cbnz            w3, 91f
+          st1             {v0.16b, v1.16b}, [x2], #32
+        subs            w1, w1, #64
+        b.pl            2b
+
+90:     add             w0, w1, #80
+        ret
+
+91:     sub             w1, w1, #32
+        b               90b
+endfunc