From patchwork Fri Mar 25 18:52:56 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ben Avison <bavison@riscosopen.org>
X-Patchwork-Id: 34991
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:ab0:5fda:0:0:0:0:0 with SMTP id g26csp1432617uaj;
        Fri, 25 Mar 2022 11:55:10 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJz89sWPDuIwKR5gHCNseGUyKkljgTcY7myoQ4rWcPE/nnem1X8BUyPLAXBuTiGuOjo4OjX1
X-Received: by 2002:a17:906:99d1:b0:6df:f83e:da21 with SMTP id
 s17-20020a17090699d100b006dff83eda21mr13163546ejn.462.1648234510619;
        Fri, 25 Mar 2022 11:55:10 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1648234510; cv=none;
        d=google.com; s=arc-20160816;
        b=pZH5yskT45WrnxMxLz4dlTINeafujml2alFixUfKR1n1RW4kYfsUn6SNIHjVASyAu1
         Xc7lOvVUjFOa2FVjkcQ6gcaLkiYc84OSSy3SbweIlUGvxxwbbuYFSM1qfaUaD8xurUnE
         Jm0cbJKUxNKWFo2P/CpidWk7kB3DjMVFPnFcZnsEs5Jovlpy0z5sDh6porLDXr4uVGLZ
         SgI1nejpGjkbAZolZAC0ObD2Kw+OTLq8FHXrNkT9HKL+n4lITpuOBnd8T91T09h29CV7
         Ay+Cluo8Ou6nK7A+lQ7S5H8NNwCT9g+6SsE/TJ3RXuFn6CxdtUg8kjW+AX8izRFufqVJ
         PcgA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:references:in-reply-to
         :message-id:date:to:from:delivered-to;
        bh=DpPcRST4KND7SOUiN0pmRaV4p1m0FtH2qsZuuqrL+OA=;
        b=TqcSfiVW52is825Jfxu5PSve3u9mP/2YDTEKCWChV4+2JjlTF7N03u2Q08OGjgZ9J0
         er6spfYW16LHmcT2IJtAowEY7COEJT5ctSCkV2myY6qMYpdIQSEDVUAoHp8Z5Q6u37R9
         2C2ZSKLF6xvqlfm2LZ2FbjcF/FjqIEfczt1TrdpYQROEUq1Tu5UvRZcjxWbvtcPNdNyq
         ZFY/Bh2RRmGb4z+JoGcrcJTNa7rGe+YYkzytaBksJgZFitiXHV6MUiUTkMiBFKrVz7Xl
         Si7YojJ/oUBsC7aDraejxS0gWGkaxSCNU0uuHsy84K46HvZBGWhJU6+cL1LdBo3cWGRl
         6ljg==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 c12-20020aa7c98c000000b004195f430321si3697586edt.76.2022.03.25.11.55.10;
        Fri, 25 Mar 2022 11:55:10 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 3DE0668B253;
	Fri, 25 Mar 2022 20:53:59 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from outmail149113.authsmtp.com (outmail149113.authsmtp.com
 [62.13.149.113])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E149868B24B
 for <ffmpeg-devel@ffmpeg.org>; Fri, 25 Mar 2022 20:53:54 +0200 (EET)
Received: from mail-c233.authsmtp.com (mail-c233.authsmtp.com [62.13.128.233])
 by punt17.authsmtp.com. (8.15.2/8.15.2) with ESMTP id 22PIrscK052628;
 Fri, 25 Mar 2022 18:53:54 GMT (envelope-from bavison@riscosopen.org)
Received: from rpi2021 (237.63.9.51.dyn.plus.net [51.9.63.237])
 (authenticated bits=0)
 by mail.authsmtp.com (8.15.2/8.15.2) with ESMTPSA id 22PIrpNR046678
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Fri, 25 Mar 2022 18:53:51 GMT (envelope-from bavison@riscosopen.org)
Received: by rpi2021 (sSMTP sendmail emulation);
 Fri, 25 Mar 2022 18:53:51 +0000
From: Ben Avison <bavison@riscosopen.org>
To: ffmpeg-devel@ffmpeg.org
Date: Fri, 25 Mar 2022 18:52:56 +0000
Message-Id: <20220325185257.513933-10-bavison@riscosopen.org>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20220325185257.513933-1-bavison@riscosopen.org>
References: <20220317185819.466470-1-bavison@riscosopen.org>
 <20220325185257.513933-1-bavison@riscosopen.org>
MIME-Version: 1.0
X-Server-Quench: ea61b25e-ac6c-11ec-a0f2-84349711df28
X-AuthReport-Spam: If SPAM / abuse - report it at:
 http://www.authsmtp.com/abuse
X-AuthRoute: OCd1YggXA1ZfRRob ESQCJDVBUg4iPRpU DBlFKhFVNl8UURhQ
 KkJXbgASJgdBAnRQ QXkJW1ZWQFx5U2Z2 YQ9SIwBcfENQWQZ0 UktOXVBXFgB3AFID
 BH5nEFkMFQVCfnh3 bQhgW3BdVAovJEB8 EExRQHBXNmEzdWEe BRNFJgMCch5CehxB
 Y1d+VSdbY21JDRoR IyQTd2hgemwHHWxc XAoKIV8ZBlgAR2x0 bgoHVWtzWEcEW20V
 AjsAYkMaEV0aO10/ eVUoQk5QKxYOCmUA
X-Authentic-SMTP: 61633632303230.1021:7600
X-AuthFastPath: 0 (Was 255)
X-AuthSMTP-Origin: 51.9.63.237/2525
X-AuthVirus-Status: No virus detected - but ensure you scan with your own
 anti-virus system.
Subject: [FFmpeg-devel] [PATCH 09/10] avcodec/vc1: Arm 64-bit NEON unescape
 fast path
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Ben Avison <bavison@riscosopen.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: DsBTHczP9Rnf

checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.

vc1dsp.vc1_unescape_buffer_c: 655617.7
vc1dsp.vc1_unescape_buffer_neon: 118237.0

Signed-off-by: Ben Avison <bavison@riscosopen.org>
---
 libavcodec/aarch64/vc1dsp_init_aarch64.c |  61 ++++++++
 libavcodec/aarch64/vc1dsp_neon.S         | 176 +++++++++++++++++++++++
 2 files changed, 237 insertions(+)

diff --git a/libavcodec/aarch64/vc1dsp_init_aarch64.c b/libavcodec/aarch64/vc1dsp_init_aarch64.c
index b672b2aa99..161d5a972b 100644
--- a/libavcodec/aarch64/vc1dsp_init_aarch64.c
+++ b/libavcodec/aarch64/vc1dsp_init_aarch64.c
@@ -21,6 +21,7 @@
 #include "libavutil/attributes.h"
 #include "libavutil/cpu.h"
 #include "libavutil/aarch64/cpu.h"
+#include "libavutil/intreadwrite.h"
 #include "libavcodec/vc1dsp.h"
 
 #include "config.h"
@@ -51,6 +52,64 @@ void ff_put_vc1_chroma_mc4_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride,
 void ff_avg_vc1_chroma_mc4_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride,
                                 int h, int x, int y);
 
+int ff_vc1_unescape_buffer_helper_neon(const uint8_t *src, int size, uint8_t *dst);
+
+static int vc1_unescape_buffer_neon(const uint8_t *src, int size, uint8_t *dst)
+{
+    /* Dealing with starting and stopping, and removing escape bytes, are
+     * comparatively less time-sensitive, so are more clearly expressed using
+     * a C wrapper around the assembly inner loop. Note that we assume a
+     * little-endian machine that supports unaligned loads. */
+    int dsize = 0;
+    while (size >= 4)
+    {
+        int found = 0;
+        while (!found && (((uintptr_t) dst) & 7) && size >= 4)
+        {
+            found = (AV_RL32(src) &~ 0x03000000) == 0x00030000;
+            if (!found)
+            {
+                *dst++ = *src++;
+                --size;
+                ++dsize;
+            }
+        }
+        if (!found)
+        {
+            int skip = size - ff_vc1_unescape_buffer_helper_neon(src, size, dst);
+            dst += skip;
+            src += skip;
+            size -= skip;
+            dsize += skip;
+            while (!found && size >= 4)
+            {
+                found = (AV_RL32(src) &~ 0x03000000) == 0x00030000;
+                if (!found)
+                {
+                    *dst++ = *src++;
+                    --size;
+                    ++dsize;
+                }
+            }
+        }
+        if (found)
+        {
+            *dst++ = *src++;
+            *dst++ = *src++;
+            ++src;
+            size -= 3;
+            dsize += 2;
+        }
+    }
+    while (size > 0)
+    {
+        *dst++ = *src++;
+        --size;
+        ++dsize;
+    }
+    return dsize;
+}
+
 av_cold void ff_vc1dsp_init_aarch64(VC1DSPContext *dsp)
 {
     int cpu_flags = av_get_cpu_flags();
@@ -76,5 +135,7 @@ av_cold void ff_vc1dsp_init_aarch64(VC1DSPContext *dsp)
         dsp->avg_no_rnd_vc1_chroma_pixels_tab[0] = ff_avg_vc1_chroma_mc8_neon;
         dsp->put_no_rnd_vc1_chroma_pixels_tab[1] = ff_put_vc1_chroma_mc4_neon;
         dsp->avg_no_rnd_vc1_chroma_pixels_tab[1] = ff_avg_vc1_chroma_mc4_neon;
+
+        dsp->vc1_unescape_buffer = vc1_unescape_buffer_neon;
     }
 }
diff --git a/libavcodec/aarch64/vc1dsp_neon.S b/libavcodec/aarch64/vc1dsp_neon.S
index e68e0fce53..529c21d285 100644
--- a/libavcodec/aarch64/vc1dsp_neon.S
+++ b/libavcodec/aarch64/vc1dsp_neon.S
@@ -1374,3 +1374,179 @@ function ff_vc1_h_loop_filter16_neon, export=1
         st2             {v2.b, v3.b}[7], [x6]
 4:      ret
 endfunc
+
+// Copy at most the specified number of bytes from source to destination buffer,
+// stopping at a multiple of 32 bytes, none of which are the start of an escape sequence
+// On entry:
+//   x0 -> source buffer
+//   w1 = max number of bytes to copy
+//   x2 -> destination buffer, optimally 8-byte aligned
+// On exit:
+//   w0 = number of bytes not copied
+function ff_vc1_unescape_buffer_helper_neon, export=1
+        // Offset by 80 to screen out cases that are too short for us to handle,
+        // and also make it easy to test for loop termination, or to determine
+        // whether we need an odd number of half-iterations of the loop.
+        subs            w1, w1, #80
+        b.mi            90f
+
+        // Set up useful constants
+        movi            v20.4s, #3, lsl #24
+        movi            v21.4s, #3, lsl #16
+
+        tst             w1, #32
+        b.ne            1f
+
+          ld1             {v0.16b, v1.16b, v2.16b}, [x0], #48
+          ext             v25.16b, v0.16b, v1.16b, #1
+          ext             v26.16b, v0.16b, v1.16b, #2
+          ext             v27.16b, v0.16b, v1.16b, #3
+          ext             v29.16b, v1.16b, v2.16b, #1
+          ext             v30.16b, v1.16b, v2.16b, #2
+          ext             v31.16b, v1.16b, v2.16b, #3
+          bic             v24.16b, v0.16b, v20.16b
+          bic             v25.16b, v25.16b, v20.16b
+          bic             v26.16b, v26.16b, v20.16b
+          bic             v27.16b, v27.16b, v20.16b
+          bic             v28.16b, v1.16b, v20.16b
+          bic             v29.16b, v29.16b, v20.16b
+          bic             v30.16b, v30.16b, v20.16b
+          bic             v31.16b, v31.16b, v20.16b
+          eor             v24.16b, v24.16b, v21.16b
+          eor             v25.16b, v25.16b, v21.16b
+          eor             v26.16b, v26.16b, v21.16b
+          eor             v27.16b, v27.16b, v21.16b
+          eor             v28.16b, v28.16b, v21.16b
+          eor             v29.16b, v29.16b, v21.16b
+          eor             v30.16b, v30.16b, v21.16b
+          eor             v31.16b, v31.16b, v21.16b
+          cmeq            v24.4s, v24.4s, #0
+          cmeq            v25.4s, v25.4s, #0
+          cmeq            v26.4s, v26.4s, #0
+          cmeq            v27.4s, v27.4s, #0
+          add             w1, w1, #32
+          b               3f
+
+1:      ld1             {v3.16b, v4.16b, v5.16b}, [x0], #48
+        ext             v25.16b, v3.16b, v4.16b, #1
+        ext             v26.16b, v3.16b, v4.16b, #2
+        ext             v27.16b, v3.16b, v4.16b, #3
+        ext             v29.16b, v4.16b, v5.16b, #1
+        ext             v30.16b, v4.16b, v5.16b, #2
+        ext             v31.16b, v4.16b, v5.16b, #3
+        bic             v24.16b, v3.16b, v20.16b
+        bic             v25.16b, v25.16b, v20.16b
+        bic             v26.16b, v26.16b, v20.16b
+        bic             v27.16b, v27.16b, v20.16b
+        bic             v28.16b, v4.16b, v20.16b
+        bic             v29.16b, v29.16b, v20.16b
+        bic             v30.16b, v30.16b, v20.16b
+        bic             v31.16b, v31.16b, v20.16b
+        eor             v24.16b, v24.16b, v21.16b
+        eor             v25.16b, v25.16b, v21.16b
+        eor             v26.16b, v26.16b, v21.16b
+        eor             v27.16b, v27.16b, v21.16b
+        eor             v28.16b, v28.16b, v21.16b
+        eor             v29.16b, v29.16b, v21.16b
+        eor             v30.16b, v30.16b, v21.16b
+        eor             v31.16b, v31.16b, v21.16b
+        cmeq            v24.4s, v24.4s, #0
+        cmeq            v25.4s, v25.4s, #0
+        cmeq            v26.4s, v26.4s, #0
+        cmeq            v27.4s, v27.4s, #0
+        // Drop through...
+2:        mov             v0.16b, v5.16b
+          ld1             {v1.16b, v2.16b}, [x0], #32
+        cmeq            v28.4s, v28.4s, #0
+        cmeq            v29.4s, v29.4s, #0
+        cmeq            v30.4s, v30.4s, #0
+        cmeq            v31.4s, v31.4s, #0
+        orr             v24.16b, v24.16b, v25.16b
+        orr             v26.16b, v26.16b, v27.16b
+        orr             v28.16b, v28.16b, v29.16b
+        orr             v30.16b, v30.16b, v31.16b
+          ext             v25.16b, v0.16b, v1.16b, #1
+        orr             v22.16b, v24.16b, v26.16b
+          ext             v26.16b, v0.16b, v1.16b, #2
+          ext             v27.16b, v0.16b, v1.16b, #3
+          ext             v29.16b, v1.16b, v2.16b, #1
+        orr             v23.16b, v28.16b, v30.16b
+          ext             v30.16b, v1.16b, v2.16b, #2
+          ext             v31.16b, v1.16b, v2.16b, #3
+          bic             v24.16b, v0.16b, v20.16b
+          bic             v25.16b, v25.16b, v20.16b
+          bic             v26.16b, v26.16b, v20.16b
+        orr             v22.16b, v22.16b, v23.16b
+          bic             v27.16b, v27.16b, v20.16b
+          bic             v28.16b, v1.16b, v20.16b
+          bic             v29.16b, v29.16b, v20.16b
+          bic             v30.16b, v30.16b, v20.16b
+          bic             v31.16b, v31.16b, v20.16b
+        addv            s22, v22.4s
+          eor             v24.16b, v24.16b, v21.16b
+          eor             v25.16b, v25.16b, v21.16b
+          eor             v26.16b, v26.16b, v21.16b
+          eor             v27.16b, v27.16b, v21.16b
+          eor             v28.16b, v28.16b, v21.16b
+        mov             w3, v22.s[0]
+          eor             v29.16b, v29.16b, v21.16b
+          eor             v30.16b, v30.16b, v21.16b
+          eor             v31.16b, v31.16b, v21.16b
+          cmeq            v24.4s, v24.4s, #0
+          cmeq            v25.4s, v25.4s, #0
+          cmeq            v26.4s, v26.4s, #0
+          cmeq            v27.4s, v27.4s, #0
+        cbnz            w3, 90f
+        st1             {v3.16b, v4.16b}, [x2], #32
+3:          mov             v3.16b, v2.16b
+            ld1             {v4.16b, v5.16b}, [x0], #32
+          cmeq            v28.4s, v28.4s, #0
+          cmeq            v29.4s, v29.4s, #0
+          cmeq            v30.4s, v30.4s, #0
+          cmeq            v31.4s, v31.4s, #0
+          orr             v24.16b, v24.16b, v25.16b
+          orr             v26.16b, v26.16b, v27.16b
+          orr             v28.16b, v28.16b, v29.16b
+          orr             v30.16b, v30.16b, v31.16b
+            ext             v25.16b, v3.16b, v4.16b, #1
+          orr             v22.16b, v24.16b, v26.16b
+            ext             v26.16b, v3.16b, v4.16b, #2
+            ext             v27.16b, v3.16b, v4.16b, #3
+            ext             v29.16b, v4.16b, v5.16b, #1
+          orr             v23.16b, v28.16b, v30.16b
+            ext             v30.16b, v4.16b, v5.16b, #2
+            ext             v31.16b, v4.16b, v5.16b, #3
+            bic             v24.16b, v3.16b, v20.16b
+            bic             v25.16b, v25.16b, v20.16b
+            bic             v26.16b, v26.16b, v20.16b
+          orr             v22.16b, v22.16b, v23.16b
+            bic             v27.16b, v27.16b, v20.16b
+            bic             v28.16b, v4.16b, v20.16b
+            bic             v29.16b, v29.16b, v20.16b
+            bic             v30.16b, v30.16b, v20.16b
+            bic             v31.16b, v31.16b, v20.16b
+          addv            s22, v22.4s
+            eor             v24.16b, v24.16b, v21.16b
+            eor             v25.16b, v25.16b, v21.16b
+            eor             v26.16b, v26.16b, v21.16b
+            eor             v27.16b, v27.16b, v21.16b
+            eor             v28.16b, v28.16b, v21.16b
+          mov             w3, v22.s[0]
+            eor             v29.16b, v29.16b, v21.16b
+            eor             v30.16b, v30.16b, v21.16b
+            eor             v31.16b, v31.16b, v21.16b
+            cmeq            v24.4s, v24.4s, #0
+            cmeq            v25.4s, v25.4s, #0
+            cmeq            v26.4s, v26.4s, #0
+            cmeq            v27.4s, v27.4s, #0
+          cbnz            w3, 91f
+          st1             {v0.16b, v1.16b}, [x2], #32
+        subs            w1, w1, #64
+        b.pl            2b
+
+90:     add             w0, w1, #80
+        ret
+
+91:     sub             w1, w1, #32
+        b               90b
+endfunc