From patchwork Fri Mar 25 18:52:57 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Avison X-Patchwork-Id: 34993 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:ab0:5fda:0:0:0:0:0 with SMTP id g26csp1432681uaj; Fri, 25 Mar 2022 11:55:31 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy3yj7xwpFb4X7zgKcsyFa68FVLpxu0CTRz+r1B40Pxi2RGJoq4vrAmPfIUmi0l0tg3Sw6V X-Received: by 2002:aa7:db94:0:b0:410:f0e8:c39e with SMTP id u20-20020aa7db94000000b00410f0e8c39emr144900edt.14.1648234531363; Fri, 25 Mar 2022 11:55:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1648234531; cv=none; d=google.com; s=arc-20160816; b=EObf3KFHumYwdjJOy1CuOgwXl/sp6TIJFFtr5BKY6D+0UMLDfvFGL2uIoUsupx6RgA PA+nYo3RNW9yXCUki/JtjeAJqIg5sJfkDg+6AJe3SND4uUmzksUbFiWPFS/xE6wf3LBw bhPNF9tC++4dG+nCmRfdWyMIb8gP+k36ny7ndQyoWhsu25mGtpiK76753mzCyOkR033o 7fhr/xNDHgfnylE0Joi2lEMfdRIYorx06ONYE4A0r/RYhaQo1dl4uDcU5gITFK4O9ged HE2hA6w3EM+wqWUZGJM+w1usZ/101pp6oa7iB3QlBuFTW9QdWb6pYsOFGvf67ORd3VdV 9u/A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=pwzveZpmDqX0PhOz7LvRVD0Yzn+YflM+2LpMw7kIF7g=; b=BqpxL7NCo++d5KVPWD0kMCJIE7vmeYcFstG9ZgxEOsHne7CSGzhgnwgGQeEqEo2fm2 ULUIYb9FpUj8uYS26bi8WwHp/UW2rSYVYuo92TIWFxOQGN8QIEvewsXuXvM9znI8HFQs e/9yaC5csEa+ncNXtjvoGfl0v0Bd0EmegWzFkk9wwqk3l8l3xnpFBtxrMiBvLDiv7vY7 A6CTiE7wAXr23rY79Cl9vR1OcCf+TaNgu3zQGhGwuDJ3ZelUTiljkrw2mLWNu/iWI9Le cbNDo7yJNM0jOLkvL/pItUl5gF3f/hizPevh5Ml6lfncMQoREAlWeWwjMNFUHUpmfPL0 t7PQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id z17-20020a056402275100b00418ed1fc1easi4842157edd.428.2022.03.25.11.55.31; Fri, 25 Mar 2022 11:55:31 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 5EBB568B24B; Fri, 25 Mar 2022 20:54:01 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from outmail149113.authsmtp.com (outmail149113.authsmtp.com [62.13.149.113]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id C869A68B24F for ; Fri, 25 Mar 2022 20:53:55 +0200 (EET) Received: from punt22.authsmtp.com (punt22.authsmtp.com [62.13.128.207]) by punt17.authsmtp.com. (8.15.2/8.15.2) with ESMTP id 22PIrtKc052651 for ; Fri, 25 Mar 2022 18:53:55 GMT (envelope-from bavison@riscosopen.org) Received: from mail-c233.authsmtp.com (mail-c233.authsmtp.com [62.13.128.233]) by punt22.authsmtp.com. (8.15.2/8.15.2) with ESMTP id 22PIrtp9035513; Fri, 25 Mar 2022 18:53:55 GMT (envelope-from bavison@riscosopen.org) Received: from rpi2021 (237.63.9.51.dyn.plus.net [51.9.63.237]) (authenticated bits=0) by mail.authsmtp.com (8.15.2/8.15.2) with ESMTPSA id 22PIrreI046704 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 25 Mar 2022 18:53:53 GMT (envelope-from bavison@riscosopen.org) Received: by rpi2021 (sSMTP sendmail emulation); Fri, 25 Mar 2022 18:53:53 +0000 From: Ben Avison To: ffmpeg-devel@ffmpeg.org Date: Fri, 25 Mar 2022 18:52:57 +0000 Message-Id: <20220325185257.513933-11-bavison@riscosopen.org> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220325185257.513933-1-bavison@riscosopen.org> References: <20220317185819.466470-1-bavison@riscosopen.org> <20220325185257.513933-1-bavison@riscosopen.org> MIME-Version: 1.0 X-Server-Quench: eb6ffc95-ac6c-11ec-a0f2-84349711df28 X-AuthReport-Spam: If SPAM / abuse - report it at: http://www.authsmtp.com/abuse X-AuthRoute: OCd1YggXA1ZfRRob ESQCJDVBUg4iPRpU DBlFKhFVNl8UURhQ KkJXbgASJgdBAnRQ QXkJW1ZWQFx5U2Z2 YQ9VIwBcfENQWQZ0 UktOXVBXFgB3AFID BH5nEFsnDgVCfnlw YQhjWXJaXgovJ0Ar FB1dEHBXNmEzdWEe BRNFJgMCch5CehxB Y1d+VSdbY21JDRoR IyQTd2lpemwHHWxc XAoKIV8ZBlgAR2x0 bgoHVW51WEcEW20V AjsAYkMaEV0aO10/ eVUoQk5QKxYOCmUA X-Authentic-SMTP: 61633632303230.1021:7600 X-AuthFastPath: 0 (Was 255) X-AuthVirus-Status: No virus detected - but ensure you scan with your own anti-virus system. Subject: [FFmpeg-devel] [PATCH 10/10] avcodec/vc1: Arm 32-bit NEON unescape fast path X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Ben Avison Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: w4R2ypFrkHon checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. vc1dsp.vc1_unescape_buffer_c: 918624.7 vc1dsp.vc1_unescape_buffer_neon: 142958.0 Signed-off-by: Ben Avison --- libavcodec/arm/vc1dsp_init_neon.c | 61 +++++++++++++++ libavcodec/arm/vc1dsp_neon.S | 118 ++++++++++++++++++++++++++++++ 2 files changed, 179 insertions(+) diff --git a/libavcodec/arm/vc1dsp_init_neon.c b/libavcodec/arm/vc1dsp_init_neon.c index f5f5c702d7..48cb816b70 100644 --- a/libavcodec/arm/vc1dsp_init_neon.c +++ b/libavcodec/arm/vc1dsp_init_neon.c @@ -19,6 +19,7 @@ #include #include "libavutil/attributes.h" +#include "libavutil/intreadwrite.h" #include "libavcodec/vc1dsp.h" #include "vc1dsp.h" @@ -84,6 +85,64 @@ void ff_put_vc1_chroma_mc4_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride, void ff_avg_vc1_chroma_mc4_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int h, int x, int y); +int ff_vc1_unescape_buffer_helper_neon(const uint8_t *src, int size, uint8_t *dst); + +static int vc1_unescape_buffer_neon(const uint8_t *src, int size, uint8_t *dst) +{ + /* Dealing with starting and stopping, and removing escape bytes, are + * comparatively less time-sensitive, so are more clearly expressed using + * a C wrapper around the assembly inner loop. Note that we assume a + * little-endian machine that supports unaligned loads. */ + int dsize = 0; + while (size >= 4) + { + int found = 0; + while (!found && (((uintptr_t) dst) & 7) && size >= 4) + { + found = (AV_RL32(src) &~ 0x03000000) == 0x00030000; + if (!found) + { + *dst++ = *src++; + --size; + ++dsize; + } + } + if (!found) + { + int skip = size - ff_vc1_unescape_buffer_helper_neon(src, size, dst); + dst += skip; + src += skip; + size -= skip; + dsize += skip; + while (!found && size >= 4) + { + found = (AV_RL32(src) &~ 0x03000000) == 0x00030000; + if (!found) + { + *dst++ = *src++; + --size; + ++dsize; + } + } + } + if (found) + { + *dst++ = *src++; + *dst++ = *src++; + ++src; + size -= 3; + dsize += 2; + } + } + while (size > 0) + { + *dst++ = *src++; + --size; + ++dsize; + } + return dsize; +} + #define FN_ASSIGN(X, Y) \ dsp->put_vc1_mspel_pixels_tab[0][X+4*Y] = ff_put_vc1_mspel_mc##X##Y##_16_neon; \ dsp->put_vc1_mspel_pixels_tab[1][X+4*Y] = ff_put_vc1_mspel_mc##X##Y##_neon @@ -130,4 +189,6 @@ av_cold void ff_vc1dsp_init_neon(VC1DSPContext *dsp) dsp->avg_no_rnd_vc1_chroma_pixels_tab[0] = ff_avg_vc1_chroma_mc8_neon; dsp->put_no_rnd_vc1_chroma_pixels_tab[1] = ff_put_vc1_chroma_mc4_neon; dsp->avg_no_rnd_vc1_chroma_pixels_tab[1] = ff_avg_vc1_chroma_mc4_neon; + + dsp->vc1_unescape_buffer = vc1_unescape_buffer_neon; } diff --git a/libavcodec/arm/vc1dsp_neon.S b/libavcodec/arm/vc1dsp_neon.S index a639e81171..8e97bc5e58 100644 --- a/libavcodec/arm/vc1dsp_neon.S +++ b/libavcodec/arm/vc1dsp_neon.S @@ -1804,3 +1804,121 @@ function ff_vc1_h_loop_filter16_neon, export=1 4: vpop {d8-d15} pop {r4-r6,pc} endfunc + +@ Copy at most the specified number of bytes from source to destination buffer, +@ stopping at a multiple of 16 bytes, none of which are the start of an escape sequence +@ On entry: +@ r0 -> source buffer +@ r1 = max number of bytes to copy +@ r2 -> destination buffer, optimally 8-byte aligned +@ On exit: +@ r0 = number of bytes not copied +function ff_vc1_unescape_buffer_helper_neon, export=1 + @ Offset by 48 to screen out cases that are too short for us to handle, + @ and also make it easy to test for loop termination, or to determine + @ whether we need an odd number of half-iterations of the loop. + subs r1, r1, #48 + bmi 90f + + @ Set up useful constants + vmov.i32 q0, #0x3000000 + vmov.i32 q1, #0x30000 + + tst r1, #16 + bne 1f + + vld1.8 {q8, q9}, [r0]! + vbic q12, q8, q0 + vext.8 q13, q8, q9, #1 + vext.8 q14, q8, q9, #2 + vext.8 q15, q8, q9, #3 + veor q12, q12, q1 + vbic q13, q13, q0 + vbic q14, q14, q0 + vbic q15, q15, q0 + vceq.i32 q12, q12, #0 + veor q13, q13, q1 + veor q14, q14, q1 + veor q15, q15, q1 + vceq.i32 q13, q13, #0 + vceq.i32 q14, q14, #0 + vceq.i32 q15, q15, #0 + add r1, r1, #16 + b 3f + +1: vld1.8 {q10, q11}, [r0]! + vbic q12, q10, q0 + vext.8 q13, q10, q11, #1 + vext.8 q14, q10, q11, #2 + vext.8 q15, q10, q11, #3 + veor q12, q12, q1 + vbic q13, q13, q0 + vbic q14, q14, q0 + vbic q15, q15, q0 + vceq.i32 q12, q12, #0 + veor q13, q13, q1 + veor q14, q14, q1 + veor q15, q15, q1 + vceq.i32 q13, q13, #0 + vceq.i32 q14, q14, #0 + vceq.i32 q15, q15, #0 + @ Drop through... +2: vmov q8, q11 + vld1.8 {q9}, [r0]! + vorr q13, q12, q13 + vorr q15, q14, q15 + vbic q12, q8, q0 + vorr q3, q13, q15 + vext.8 q13, q8, q9, #1 + vext.8 q14, q8, q9, #2 + vext.8 q15, q8, q9, #3 + veor q12, q12, q1 + vorr d6, d6, d7 + vbic q13, q13, q0 + vbic q14, q14, q0 + vbic q15, q15, q0 + vceq.i32 q12, q12, #0 + vmov r3, r12, d6 + veor q13, q13, q1 + veor q14, q14, q1 + veor q15, q15, q1 + vceq.i32 q13, q13, #0 + vceq.i32 q14, q14, #0 + vceq.i32 q15, q15, #0 + orrs r3, r3, r12 + bne 90f + vst1.64 {q10}, [r2]! +3: vmov q10, q9 + vld1.8 {q11}, [r0]! + vorr q13, q12, q13 + vorr q15, q14, q15 + vbic q12, q10, q0 + vorr q3, q13, q15 + vext.8 q13, q10, q11, #1 + vext.8 q14, q10, q11, #2 + vext.8 q15, q10, q11, #3 + veor q12, q12, q1 + vorr d6, d6, d7 + vbic q13, q13, q0 + vbic q14, q14, q0 + vbic q15, q15, q0 + vceq.i32 q12, q12, #0 + vmov r3, r12, d6 + veor q13, q13, q1 + veor q14, q14, q1 + veor q15, q15, q1 + vceq.i32 q13, q13, #0 + vceq.i32 q14, q14, #0 + vceq.i32 q15, q15, #0 + orrs r3, r3, r12 + bne 91f + vst1.64 {q8}, [r2]! + subs r1, r1, #32 + bpl 2b + +90: add r0, r1, #48 + bx lr + +91: sub r1, r1, #16 + b 90b +endfunc