From patchwork Thu Mar 17 18:58:14 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Avison X-Patchwork-Id: 34815 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6838:3486:0:0:0:0 with SMTP id ek6csp1187930nkb; Thu, 17 Mar 2022 11:59:12 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyW/S1CsudCRKNX5oPZJF52k23KV1fPj35XSHreMh4BA6ybONgDrP+w7Ki921gM2cytB8vG X-Received: by 2002:a05:6402:5106:b0:418:337a:506d with SMTP id m6-20020a056402510600b00418337a506dmr6065607edd.376.1647543552559; Thu, 17 Mar 2022 11:59:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1647543552; cv=none; d=google.com; s=arc-20160816; b=lHc72dEmI3w8Jo/pRfJs3nxV198tbaLavSvfFKqo+IGnZmLWhaCrgMucqW8adY0lY2 EzhbADIt3jRWI1HctNb+Uha66Cd3c0+RCk6Si5xsxOYmbe/GX/KI6VrdJcwiNeHg0cvy ILqH2MupfKx53lPjnin4/a5OhxRFYxmhOEeRjP9qQTcdUIC8+uPS7u+lvnMcOaYkOivP djIpDDSRDYKcd7SxJYDbN5ectzqfd6KfLxh9vGf+agZhOiLAsOxRCo9+/4JMl4V2TMqz vqYnkjCWHENBMVq0EPhrKgKSM+Mgxs7P/MPzp1EyfxpPxnb7AoxAzXzDJ71YPcaqVxsy KBUQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=k5zYZck1eF/a8lr+XllHojMwY1X+4FNUoZI8RrT45k8=; b=wYXRcFCgW69SKKFuzzxZPvZGBV0Nog5FoFPuXR3NOlJz8ALhk3+kysQfFPcUsCyFtV mmK9Fmzjg2W9qTz20u0Y99eTt6bWgmPbI8tUEkshLHLsXSyLnfLRv9U6mDaU9Z2J6V6t DR6V3V5iPliCs/QYP8YtZxf/PYBr/ghrUMQfq35f8eYN9cyCxOxO8Gdk1rKUdc0np5/2 cU5Ikr/6Ixtz/MwCnAly9mvTKcoUvsDNlz2ixt7R+hw74XoXHBDA7Pql8OoyuA/PRo+J Z7FXFUunbzR6GpX6g8s8+D4ReQQbZEDFaD1qAomxggcvmES6zHXHJS0uJY/Jt/ROxKq+ KJWw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id v1-20020a509541000000b00418f9c77409si2007180eda.501.2022.03.17.11.59.12; Thu, 17 Mar 2022 11:59:12 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id B78D668AF45; Thu, 17 Mar 2022 20:59:06 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from outmail148101.authsmtp.com (outmail148101.authsmtp.com [62.13.148.101]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id D813768AE6B for ; Thu, 17 Mar 2022 20:59:00 +0200 (EET) Received: from punt18.authsmtp.com (punt18.authsmtp.com [62.13.128.225]) by punt16.authsmtp.com. (8.15.2/8.15.2) with ESMTP id 22HIx0Cl054427 for ; Thu, 17 Mar 2022 18:59:00 GMT (envelope-from bavison@riscosopen.org) Received: from mail-c233.authsmtp.com (mail-c233.authsmtp.com [62.13.128.233]) by punt18.authsmtp.com. (8.15.2/8.15.2) with ESMTP id 22HIx0bK018546; Thu, 17 Mar 2022 18:59:00 GMT (envelope-from bavison@riscosopen.org) Received: from rpi2021 (237.63.9.51.dyn.plus.net [51.9.63.237]) (authenticated bits=0) by mail.authsmtp.com (8.15.2/8.15.2) with ESMTPSA id 22HIwvNv054592 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 17 Mar 2022 18:58:58 GMT (envelope-from bavison@riscosopen.org) Received: by rpi2021 (sSMTP sendmail emulation); Thu, 17 Mar 2022 18:58:57 +0000 From: Ben Avison To: ffmpeg-devel@ffmpeg.org Date: Thu, 17 Mar 2022 18:58:14 +0000 Message-Id: <20220317185819.466470-2-bavison@riscosopen.org> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220317185819.466470-1-bavison@riscosopen.org> References: <20220317185819.466470-1-bavison@riscosopen.org> MIME-Version: 1.0 X-Server-Quench: 4d7f3e11-a624-11ec-a0f2-84349711df28 X-AuthReport-Spam: If SPAM / abuse - report it at: http://www.authsmtp.com/abuse X-AuthRoute: OCd1YggXA1ZfRRob ESQCJDVBUg4iPRpU DBlFKhFVNl8UURhQ KkJXbgASJgRDAnRQ QXkJW1ZWQFx5U2Z9 YQ9YIwBcfENQWQZ0 UktOXVBXFgB3AFID BGZnFV8MMQVDfHt5 ZwhqVnFdWgp+IUEr QRtVFHBXY2VkdWEe BRNFJgMCch5CehxB Y1d+VSdbY21JDRoR IyQTd2l2YwAXITpe RQ0AJhUMSh9ZVhcm QlhcQXAlHFFNYQgU CVQqJ1QYG00SM0M9 eVUgXU4VKVccAxZC V1lEHC9CTwAI X-Authentic-SMTP: 61633632303230.1021:7600 X-AuthFastPath: 0 (Was 255) X-AuthVirus-Status: No virus detected - but ensure you scan with your own anti-virus system. Subject: [FFmpeg-devel] [PATCH 1/6] avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Ben Avison Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: UiV5Bej+gT7j Signed-off-by: Ben Avison --- libavcodec/aarch64/Makefile | 1 + libavcodec/aarch64/vc1dsp_init_aarch64.c | 14 + libavcodec/aarch64/vc1dsp_neon.S | 698 +++++++++++++++++++++++ 3 files changed, 713 insertions(+) create mode 100644 libavcodec/aarch64/vc1dsp_neon.S diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile index 954461f81d..5b25e4dfb9 100644 --- a/libavcodec/aarch64/Makefile +++ b/libavcodec/aarch64/Makefile @@ -48,6 +48,7 @@ NEON-OBJS-$(CONFIG_IDCTDSP) += aarch64/simple_idct_neon.o NEON-OBJS-$(CONFIG_MDCT) += aarch64/mdct_neon.o NEON-OBJS-$(CONFIG_MPEGAUDIODSP) += aarch64/mpegaudiodsp_neon.o NEON-OBJS-$(CONFIG_PIXBLOCKDSP) += aarch64/pixblockdsp_neon.o +NEON-OBJS-$(CONFIG_VC1DSP) += aarch64/vc1dsp_neon.o NEON-OBJS-$(CONFIG_VP8DSP) += aarch64/vp8dsp_neon.o # decoders/encoders diff --git a/libavcodec/aarch64/vc1dsp_init_aarch64.c b/libavcodec/aarch64/vc1dsp_init_aarch64.c index 13dfd74940..edfb296b75 100644 --- a/libavcodec/aarch64/vc1dsp_init_aarch64.c +++ b/libavcodec/aarch64/vc1dsp_init_aarch64.c @@ -25,6 +25,13 @@ #include "config.h" +void ff_vc1_v_loop_filter4_neon(uint8_t *src, int stride, int pq); +void ff_vc1_h_loop_filter4_neon(uint8_t *src, int stride, int pq); +void ff_vc1_v_loop_filter8_neon(uint8_t *src, int stride, int pq); +void ff_vc1_h_loop_filter8_neon(uint8_t *src, int stride, int pq); +void ff_vc1_v_loop_filter16_neon(uint8_t *src, int stride, int pq); +void ff_vc1_h_loop_filter16_neon(uint8_t *src, int stride, int pq); + void ff_put_vc1_chroma_mc8_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int h, int x, int y); void ff_avg_vc1_chroma_mc8_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride, @@ -39,6 +46,13 @@ av_cold void ff_vc1dsp_init_aarch64(VC1DSPContext *dsp) int cpu_flags = av_get_cpu_flags(); if (have_neon(cpu_flags)) { + dsp->vc1_v_loop_filter4 = ff_vc1_v_loop_filter4_neon; + dsp->vc1_h_loop_filter4 = ff_vc1_h_loop_filter4_neon; + dsp->vc1_v_loop_filter8 = ff_vc1_v_loop_filter8_neon; + dsp->vc1_h_loop_filter8 = ff_vc1_h_loop_filter8_neon; + dsp->vc1_v_loop_filter16 = ff_vc1_v_loop_filter16_neon; + dsp->vc1_h_loop_filter16 = ff_vc1_h_loop_filter16_neon; + dsp->put_no_rnd_vc1_chroma_pixels_tab[0] = ff_put_vc1_chroma_mc8_neon; dsp->avg_no_rnd_vc1_chroma_pixels_tab[0] = ff_avg_vc1_chroma_mc8_neon; dsp->put_no_rnd_vc1_chroma_pixels_tab[1] = ff_put_vc1_chroma_mc4_neon; diff --git a/libavcodec/aarch64/vc1dsp_neon.S b/libavcodec/aarch64/vc1dsp_neon.S new file mode 100644 index 0000000000..fe8963545a --- /dev/null +++ b/libavcodec/aarch64/vc1dsp_neon.S @@ -0,0 +1,698 @@ +/* + * VC1 AArch64 NEON optimisations + * + * Copyright (c) 2022 Ben Avison + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/aarch64/asm.S" + +.align 5 +.Lcoeffs: +.quad 0x00050002 + +// VC-1 in-loop deblocking filter for 4 pixel pairs at boundary of vertically-neighbouring blocks +// On entry: +// x0 -> top-left pel of lower block +// w1 = row stride, bytes +// w2 = PQUANT bitstream parameter +function ff_vc1_v_loop_filter4_neon, export=1 + sub x3, x0, w1, sxtw #2 + sxtw x1, w1 // technically, stride is signed int + ldr d0, .Lcoeffs + ld1 {v1.s}[0], [x0], x1 // P5 + ld1 {v2.s}[0], [x3], x1 // P1 + ld1 {v3.s}[0], [x3], x1 // P2 + ld1 {v4.s}[0], [x0], x1 // P6 + ld1 {v5.s}[0], [x3], x1 // P3 + ld1 {v6.s}[0], [x0], x1 // P7 + ld1 {v7.s}[0], [x3] // P4 + ld1 {v16.s}[0], [x0] // P8 + ushll v17.8h, v1.8b, #1 // 2*P5 + dup v18.8h, w2 // pq + ushll v2.8h, v2.8b, #1 // 2*P1 + uxtl v3.8h, v3.8b // P2 + uxtl v4.8h, v4.8b // P6 + uxtl v19.8h, v5.8b // P3 + mls v2.4h, v3.4h, v0.h[1] // 2*P1-5*P2 + uxtl v3.8h, v6.8b // P7 + mls v17.4h, v4.4h, v0.h[1] // 2*P5-5*P6 + ushll v5.8h, v5.8b, #1 // 2*P3 + uxtl v6.8h, v7.8b // P4 + mla v17.4h, v3.4h, v0.h[1] // 2*P5-5*P6+5*P7 + uxtl v3.8h, v16.8b // P8 + mla v2.4h, v19.4h, v0.h[1] // 2*P1-5*P2+5*P3 + uxtl v1.8h, v1.8b // P5 + mls v5.4h, v6.4h, v0.h[1] // 2*P3-5*P4 + mls v17.4h, v3.4h, v0.h[0] // 2*P5-5*P6+5*P7-2*P8 + sub v3.4h, v6.4h, v1.4h // P4-P5 + mls v2.4h, v6.4h, v0.h[0] // 2*P1-5*P2+5*P3-2*P4 + mla v5.4h, v1.4h, v0.h[1] // 2*P3-5*P4+5*P5 + mls v5.4h, v4.4h, v0.h[0] // 2*P3-5*P4+5*P5-2*P6 + abs v4.4h, v3.4h + srshr v7.4h, v17.4h, #3 + srshr v2.4h, v2.4h, #3 + sshr v4.4h, v4.4h, #1 // clip + srshr v5.4h, v5.4h, #3 + abs v7.4h, v7.4h // a2 + sshr v3.4h, v3.4h, #8 // clip_sign + abs v2.4h, v2.4h // a1 + cmeq v16.4h, v4.4h, #0 // test clip == 0 + abs v17.4h, v5.4h // a0 + sshr v5.4h, v5.4h, #8 // a0_sign + cmhs v19.4h, v2.4h, v7.4h // test a1 >= a2 + cmhs v18.4h, v17.4h, v18.4h // test a0 >= pq + sub v3.4h, v3.4h, v5.4h // clip_sign - a0_sign + bsl v19.8b, v7.8b, v2.8b // a3 + orr v2.8b, v16.8b, v18.8b // test clip == 0 || a0 >= pq + uqsub v5.4h, v17.4h, v19.4h // a0 >= a3 ? a0-a3 : 0 (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs) + cmhs v7.4h, v19.4h, v17.4h // test a3 >= a0 + mul v0.4h, v5.4h, v0.h[1] // a0 >= a3 ? 5*(a0-a3) : 0 + orr v5.8b, v2.8b, v7.8b // test clip == 0 || a0 >= pq || a3 >= a0 + mov w0, v5.s[1] // move to gp reg + ushr v0.4h, v0.4h, #3 // a0 >= a3 ? (5*(a0-a3))>>3 : 0 + cmhs v5.4h, v0.4h, v4.4h + tbnz w0, #0, 1f // none of the 4 pixel pairs should be updated if this one is not filtered + bsl v5.8b, v4.8b, v0.8b // FFMIN(d, clip) + bic v0.8b, v5.8b, v2.8b // set each d to zero if it should not be filtered because clip == 0 || a0 >= pq (a3 > a0 case already zeroed by saturating sub) + mls v6.4h, v0.4h, v3.4h // invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P4 + mla v1.4h, v0.4h, v3.4h // invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P5 + sqxtun v0.8b, v6.8h + sqxtun v1.8b, v1.8h + st1 {v0.s}[0], [x3], x1 + st1 {v1.s}[0], [x3] +1: ret +endfunc + +// VC-1 in-loop deblocking filter for 4 pixel pairs at boundary of horizontally-neighbouring blocks +// On entry: +// x0 -> top-left pel of right block +// w1 = row stride, bytes +// w2 = PQUANT bitstream parameter +function ff_vc1_h_loop_filter4_neon, export=1 + sub x3, x0, #4 // where to start reading + sxtw x1, w1 // technically, stride is signed int + ldr d0, .Lcoeffs + ld1 {v1.8b}, [x3], x1 + sub x0, x0, #1 // where to start writing + ld1 {v2.8b}, [x3], x1 + ld1 {v3.8b}, [x3], x1 + ld1 {v4.8b}, [x3] + dup v5.8h, w2 // pq + trn1 v6.8b, v1.8b, v2.8b + trn2 v1.8b, v1.8b, v2.8b + trn1 v2.8b, v3.8b, v4.8b + trn2 v3.8b, v3.8b, v4.8b + trn1 v4.4h, v6.4h, v2.4h // P1, P5 + trn1 v7.4h, v1.4h, v3.4h // P2, P6 + trn2 v2.4h, v6.4h, v2.4h // P3, P7 + trn2 v1.4h, v1.4h, v3.4h // P4, P8 + ushll v3.8h, v4.8b, #1 // 2*P1, 2*P5 + uxtl v6.8h, v7.8b // P2, P6 + uxtl v7.8h, v2.8b // P3, P7 + uxtl v1.8h, v1.8b // P4, P8 + mls v3.8h, v6.8h, v0.h[1] // 2*P1-5*P2, 2*P5-5*P6 + ushll v2.8h, v2.8b, #1 // 2*P3, 2*P7 + uxtl v4.8h, v4.8b // P1, P5 + mla v3.8h, v7.8h, v0.h[1] // 2*P1-5*P2+5*P3, 2*P5-5*P6+5*P7 + mov d6, v6.d[1] // P6 + mls v3.8h, v1.8h, v0.h[0] // 2*P1-5*P2+5*P3-2*P4, 2*P5-5*P6+5*P7-2*P8 + mov d4, v4.d[1] // P5 + mls v2.4h, v1.4h, v0.h[1] // 2*P3-5*P4 + mla v2.4h, v4.4h, v0.h[1] // 2*P3-5*P4+5*P5 + sub v7.4h, v1.4h, v4.4h // P4-P5 + mls v2.4h, v6.4h, v0.h[0] // 2*P3-5*P4+5*P5-2*P6 + srshr v3.8h, v3.8h, #3 + abs v6.4h, v7.4h + sshr v7.4h, v7.4h, #8 // clip_sign + srshr v2.4h, v2.4h, #3 + abs v3.8h, v3.8h // a1, a2 + sshr v6.4h, v6.4h, #1 // clip + mov d16, v3.d[1] // a2 + abs v17.4h, v2.4h // a0 + cmeq v18.4h, v6.4h, #0 // test clip == 0 + sshr v2.4h, v2.4h, #8 // a0_sign + cmhs v19.4h, v3.4h, v16.4h // test a1 >= a2 + cmhs v5.4h, v17.4h, v5.4h // test a0 >= pq + sub v2.4h, v7.4h, v2.4h // clip_sign - a0_sign + bsl v19.8b, v16.8b, v3.8b // a3 + orr v3.8b, v18.8b, v5.8b // test clip == 0 || a0 >= pq + uqsub v5.4h, v17.4h, v19.4h // a0 >= a3 ? a0-a3 : 0 (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs) + cmhs v7.4h, v19.4h, v17.4h // test a3 >= a0 + mul v0.4h, v5.4h, v0.h[1] // a0 >= a3 ? 5*(a0-a3) : 0 + orr v5.8b, v3.8b, v7.8b // test clip == 0 || a0 >= pq || a3 >= a0 + mov w2, v5.s[1] // move to gp reg + ushr v0.4h, v0.4h, #3 // a0 >= a3 ? (5*(a0-a3))>>3 : 0 + cmhs v5.4h, v0.4h, v6.4h + tbnz w2, #0, 1f // none of the 4 pixel pairs should be updated if this one is not filtered + bsl v5.8b, v6.8b, v0.8b // FFMIN(d, clip) + bic v0.8b, v5.8b, v3.8b // set each d to zero if it should not be filtered because clip == 0 || a0 >= pq (a3 > a0 case already zeroed by saturating sub) + mla v4.4h, v0.4h, v2.4h // invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P5 + mls v1.4h, v0.4h, v2.4h // invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P4 + sqxtun v3.8b, v4.8h + sqxtun v2.8b, v1.8h + st2 {v2.b, v3.b}[0], [x0], x1 + st2 {v2.b, v3.b}[1], [x0], x1 + st2 {v2.b, v3.b}[2], [x0], x1 + st2 {v2.b, v3.b}[3], [x0] +1: ret +endfunc + +// VC-1 in-loop deblocking filter for 8 pixel pairs at boundary of vertically-neighbouring blocks +// On entry: +// x0 -> top-left pel of lower block +// w1 = row stride, bytes +// w2 = PQUANT bitstream parameter +function ff_vc1_v_loop_filter8_neon, export=1 + sub x3, x0, w1, sxtw #2 + sxtw x1, w1 // technically, stride is signed int + ldr d0, .Lcoeffs + ld1 {v1.8b}, [x0], x1 // P5 + movi v2.2d, #0x0000ffff00000000 + ld1 {v3.8b}, [x3], x1 // P1 + ld1 {v4.8b}, [x3], x1 // P2 + ld1 {v5.8b}, [x0], x1 // P6 + ld1 {v6.8b}, [x3], x1 // P3 + ld1 {v7.8b}, [x0], x1 // P7 + ushll v16.8h, v1.8b, #1 // 2*P5 + ushll v3.8h, v3.8b, #1 // 2*P1 + ld1 {v17.8b}, [x3] // P4 + uxtl v4.8h, v4.8b // P2 + ld1 {v18.8b}, [x0] // P8 + uxtl v5.8h, v5.8b // P6 + dup v19.8h, w2 // pq + uxtl v20.8h, v6.8b // P3 + mls v3.8h, v4.8h, v0.h[1] // 2*P1-5*P2 + uxtl v4.8h, v7.8b // P7 + ushll v6.8h, v6.8b, #1 // 2*P3 + mls v16.8h, v5.8h, v0.h[1] // 2*P5-5*P6 + uxtl v7.8h, v17.8b // P4 + uxtl v17.8h, v18.8b // P8 + mla v16.8h, v4.8h, v0.h[1] // 2*P5-5*P6+5*P7 + uxtl v1.8h, v1.8b // P5 + mla v3.8h, v20.8h, v0.h[1] // 2*P1-5*P2+5*P3 + sub v4.8h, v7.8h, v1.8h // P4-P5 + mls v6.8h, v7.8h, v0.h[1] // 2*P3-5*P4 + mls v16.8h, v17.8h, v0.h[0] // 2*P5-5*P6+5*P7-2*P8 + abs v17.8h, v4.8h + sshr v4.8h, v4.8h, #8 // clip_sign + mls v3.8h, v7.8h, v0.h[0] // 2*P1-5*P2+5*P3-2*P4 + sshr v17.8h, v17.8h, #1 // clip + mla v6.8h, v1.8h, v0.h[1] // 2*P3-5*P4+5*P5 + srshr v16.8h, v16.8h, #3 + mls v6.8h, v5.8h, v0.h[0] // 2*P3-5*P4+5*P5-2*P6 + cmeq v5.8h, v17.8h, #0 // test clip == 0 + srshr v3.8h, v3.8h, #3 + abs v16.8h, v16.8h // a2 + abs v3.8h, v3.8h // a1 + srshr v6.8h, v6.8h, #3 + cmhs v18.8h, v3.8h, v16.8h // test a1 >= a2 + abs v20.8h, v6.8h // a0 + sshr v6.8h, v6.8h, #8 // a0_sign + bsl v18.16b, v16.16b, v3.16b // a3 + cmhs v3.8h, v20.8h, v19.8h // test a0 >= pq + sub v4.8h, v4.8h, v6.8h // clip_sign - a0_sign + uqsub v6.8h, v20.8h, v18.8h // a0 >= a3 ? a0-a3 : 0 (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs) + cmhs v16.8h, v18.8h, v20.8h // test a3 >= a0 + orr v3.16b, v5.16b, v3.16b // test clip == 0 || a0 >= pq + mul v0.8h, v6.8h, v0.h[1] // a0 >= a3 ? 5*(a0-a3) : 0 + orr v5.16b, v3.16b, v16.16b // test clip == 0 || a0 >= pq || a3 >= a0 + cmtst v2.2d, v5.2d, v2.2d // if 2nd of each group of is not filtered, then none of the others in the group should be either + mov w0, v5.s[1] // move to gp reg + ushr v0.8h, v0.8h, #3 // a0 >= a3 ? (5*(a0-a3))>>3 : 0 + mov w2, v5.s[3] + orr v2.16b, v3.16b, v2.16b + cmhs v3.8h, v0.8h, v17.8h + and w0, w0, w2 + bsl v3.16b, v17.16b, v0.16b // FFMIN(d, clip) + tbnz w0, #0, 1f // none of the 8 pixel pairs should be updated in this case + bic v0.16b, v3.16b, v2.16b // set each d to zero if it should not be filtered + mls v7.8h, v0.8h, v4.8h // invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P4 + mla v1.8h, v0.8h, v4.8h // invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P5 + sqxtun v0.8b, v7.8h + sqxtun v1.8b, v1.8h + st1 {v0.8b}, [x3], x1 + st1 {v1.8b}, [x3] +1: ret +endfunc + +// VC-1 in-loop deblocking filter for 8 pixel pairs at boundary of horizontally-neighbouring blocks +// On entry: +// x0 -> top-left pel of right block +// w1 = row stride, bytes +// w2 = PQUANT bitstream parameter +function ff_vc1_h_loop_filter8_neon, export=1 + sub x3, x0, #4 // where to start reading + sxtw x1, w1 // technically, stride is signed int + ldr d0, .Lcoeffs + ld1 {v1.8b}, [x3], x1 // P1[0], P2[0]... + sub x0, x0, #1 // where to start writing + ld1 {v2.8b}, [x3], x1 + add x4, x0, x1, lsl #2 + ld1 {v3.8b}, [x3], x1 + ld1 {v4.8b}, [x3], x1 + ld1 {v5.8b}, [x3], x1 + ld1 {v6.8b}, [x3], x1 + ld1 {v7.8b}, [x3], x1 + trn1 v16.8b, v1.8b, v2.8b // P1[0], P1[1], P3[0]... + ld1 {v17.8b}, [x3] + trn2 v1.8b, v1.8b, v2.8b // P2[0], P2[1], P4[0]... + trn1 v2.8b, v3.8b, v4.8b // P1[2], P1[3], P3[2]... + trn2 v3.8b, v3.8b, v4.8b // P2[2], P2[3], P4[2]... + dup v4.8h, w2 // pq + trn1 v18.8b, v5.8b, v6.8b // P1[4], P1[5], P3[4]... + trn2 v5.8b, v5.8b, v6.8b // P2[4], P2[5], P4[4]... + trn1 v6.4h, v16.4h, v2.4h // P1[0], P1[1], P1[2], P1[3], P5[0]... + trn1 v19.4h, v1.4h, v3.4h // P2[0], P2[1], P2[2], P2[3], P6[0]... + trn1 v20.8b, v7.8b, v17.8b // P1[6], P1[7], P3[6]... + trn2 v7.8b, v7.8b, v17.8b // P2[6], P2[7], P4[6]... + trn2 v2.4h, v16.4h, v2.4h // P3[0], P3[1], P3[2], P3[3], P7[0]... + trn2 v1.4h, v1.4h, v3.4h // P4[0], P4[1], P4[2], P4[3], P8[0]... + trn1 v3.4h, v18.4h, v20.4h // P1[4], P1[5], P1[6], P1[7], P5[4]... + trn1 v16.4h, v5.4h, v7.4h // P2[4], P2[5], P2[6], P2[7], P6[4]... + trn2 v17.4h, v18.4h, v20.4h // P3[4], P3[5], P3[6], P3[7], P7[4]... + trn2 v5.4h, v5.4h, v7.4h // P4[4], P4[5], P4[6], P4[7], P8[4]... + trn1 v7.2s, v6.2s, v3.2s // P1 + trn1 v18.2s, v19.2s, v16.2s // P2 + trn2 v3.2s, v6.2s, v3.2s // P5 + trn2 v6.2s, v19.2s, v16.2s // P6 + trn1 v16.2s, v2.2s, v17.2s // P3 + trn2 v2.2s, v2.2s, v17.2s // P7 + ushll v7.8h, v7.8b, #1 // 2*P1 + trn1 v17.2s, v1.2s, v5.2s // P4 + ushll v19.8h, v3.8b, #1 // 2*P5 + trn2 v1.2s, v1.2s, v5.2s // P8 + uxtl v5.8h, v18.8b // P2 + uxtl v6.8h, v6.8b // P6 + uxtl v18.8h, v16.8b // P3 + mls v7.8h, v5.8h, v0.h[1] // 2*P1-5*P2 + uxtl v2.8h, v2.8b // P7 + ushll v5.8h, v16.8b, #1 // 2*P3 + mls v19.8h, v6.8h, v0.h[1] // 2*P5-5*P6 + uxtl v16.8h, v17.8b // P4 + uxtl v1.8h, v1.8b // P8 + mla v19.8h, v2.8h, v0.h[1] // 2*P5-5*P6+5*P7 + uxtl v2.8h, v3.8b // P5 + mla v7.8h, v18.8h, v0.h[1] // 2*P1-5*P2+5*P3 + sub v3.8h, v16.8h, v2.8h // P4-P5 + mls v5.8h, v16.8h, v0.h[1] // 2*P3-5*P4 + mls v19.8h, v1.8h, v0.h[0] // 2*P5-5*P6+5*P7-2*P8 + abs v1.8h, v3.8h + sshr v3.8h, v3.8h, #8 // clip_sign + mls v7.8h, v16.8h, v0.h[0] // 2*P1-5*P2+5*P3-2*P4 + sshr v1.8h, v1.8h, #1 // clip + mla v5.8h, v2.8h, v0.h[1] // 2*P3-5*P4+5*P5 + srshr v17.8h, v19.8h, #3 + mls v5.8h, v6.8h, v0.h[0] // 2*P3-5*P4+5*P5-2*P6 + cmeq v6.8h, v1.8h, #0 // test clip == 0 + srshr v7.8h, v7.8h, #3 + abs v17.8h, v17.8h // a2 + abs v7.8h, v7.8h // a1 + srshr v5.8h, v5.8h, #3 + cmhs v18.8h, v7.8h, v17.8h // test a1 >= a2 + abs v19.8h, v5.8h // a0 + sshr v5.8h, v5.8h, #8 // a0_sign + bsl v18.16b, v17.16b, v7.16b // a3 + cmhs v4.8h, v19.8h, v4.8h // test a0 >= pq + sub v3.8h, v3.8h, v5.8h // clip_sign - a0_sign + uqsub v5.8h, v19.8h, v18.8h // a0 >= a3 ? a0-a3 : 0 (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs) + cmhs v7.8h, v18.8h, v19.8h // test a3 >= a0 + orr v4.16b, v6.16b, v4.16b // test clip == 0 || a0 >= pq + mul v0.8h, v5.8h, v0.h[1] // a0 >= a3 ? 5*(a0-a3) : 0 + orr v5.16b, v4.16b, v7.16b // test clip == 0 || a0 >= pq || a3 >= a0 + mov w2, v5.s[1] // move to gp reg + ushr v0.8h, v0.8h, #3 // a0 >= a3 ? (5*(a0-a3))>>3 : 0 + mov w3, v5.s[3] + cmhs v5.8h, v0.8h, v1.8h + and w5, w2, w3 + bsl v5.16b, v1.16b, v0.16b // FFMIN(d, clip) + tbnz w5, #0, 2f // none of the 8 pixel pairs should be updated in this case + bic v0.16b, v5.16b, v4.16b // set each d to zero if it should not be filtered because clip == 0 || a0 >= pq (a3 > a0 case already zeroed by saturating sub) + mla v2.8h, v0.8h, v3.8h // invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P5 + mls v16.8h, v0.8h, v3.8h // invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P4 + sqxtun v1.8b, v2.8h + sqxtun v0.8b, v16.8h + tbnz w2, #0, 1f // none of the first 4 pixel pairs should be updated if so + st2 {v0.b, v1.b}[0], [x0], x1 + st2 {v0.b, v1.b}[1], [x0], x1 + st2 {v0.b, v1.b}[2], [x0], x1 + st2 {v0.b, v1.b}[3], [x0] +1: tbnz w3, #0, 2f // none of the second 4 pixel pairs should be updated if so + st2 {v0.b, v1.b}[4], [x4], x1 + st2 {v0.b, v1.b}[5], [x4], x1 + st2 {v0.b, v1.b}[6], [x4], x1 + st2 {v0.b, v1.b}[7], [x4] +2: ret +endfunc + +// VC-1 in-loop deblocking filter for 16 pixel pairs at boundary of vertically-neighbouring blocks +// On entry: +// x0 -> top-left pel of lower block +// w1 = row stride, bytes +// w2 = PQUANT bitstream parameter +function ff_vc1_v_loop_filter16_neon, export=1 + sub x3, x0, w1, sxtw #2 + sxtw x1, w1 // technically, stride is signed int + ldr d0, .Lcoeffs + ld1 {v1.16b}, [x0], x1 // P5 + movi v2.2d, #0x0000ffff00000000 + ld1 {v3.16b}, [x3], x1 // P1 + ld1 {v4.16b}, [x3], x1 // P2 + ld1 {v5.16b}, [x0], x1 // P6 + ld1 {v6.16b}, [x3], x1 // P3 + ld1 {v7.16b}, [x0], x1 // P7 + ushll v16.8h, v1.8b, #1 // 2*P5[0..7] + ushll v17.8h, v3.8b, #1 // 2*P1[0..7] + ld1 {v18.16b}, [x3] // P4 + uxtl v19.8h, v4.8b // P2[0..7] + ld1 {v20.16b}, [x0] // P8 + uxtl v21.8h, v5.8b // P6[0..7] + dup v22.8h, w2 // pq + ushll2 v3.8h, v3.16b, #1 // 2*P1[8..15] + mls v17.8h, v19.8h, v0.h[1] // 2*P1[0..7]-5*P2[0..7] + ushll2 v19.8h, v1.16b, #1 // 2*P5[8..15] + uxtl2 v4.8h, v4.16b // P2[8..15] + mls v16.8h, v21.8h, v0.h[1] // 2*P5[0..7]-5*P6[0..7] + uxtl2 v5.8h, v5.16b // P6[8..15] + uxtl v23.8h, v6.8b // P3[0..7] + uxtl v24.8h, v7.8b // P7[0..7] + mls v3.8h, v4.8h, v0.h[1] // 2*P1[8..15]-5*P2[8..15] + ushll v4.8h, v6.8b, #1 // 2*P3[0..7] + uxtl v25.8h, v18.8b // P4[0..7] + mls v19.8h, v5.8h, v0.h[1] // 2*P5[8..15]-5*P6[8..15] + uxtl2 v26.8h, v6.16b // P3[8..15] + mla v17.8h, v23.8h, v0.h[1] // 2*P1[0..7]-5*P2[0..7]+5*P3[0..7] + uxtl2 v7.8h, v7.16b // P7[8..15] + ushll2 v6.8h, v6.16b, #1 // 2*P3[8..15] + mla v16.8h, v24.8h, v0.h[1] // 2*P5[0..7]-5*P6[0..7]+5*P7[0..7] + uxtl2 v18.8h, v18.16b // P4[8..15] + uxtl v23.8h, v20.8b // P8[0..7] + mls v4.8h, v25.8h, v0.h[1] // 2*P3[0..7]-5*P4[0..7] + uxtl v24.8h, v1.8b // P5[0..7] + uxtl2 v20.8h, v20.16b // P8[8..15] + mla v3.8h, v26.8h, v0.h[1] // 2*P1[8..15]-5*P2[8..15]+5*P3[8..15] + uxtl2 v1.8h, v1.16b // P5[8..15] + sub v26.8h, v25.8h, v24.8h // P4[0..7]-P5[0..7] + mla v19.8h, v7.8h, v0.h[1] // 2*P5[8..15]-5*P6[8..15]+5*P7[8..15] + sub v7.8h, v18.8h, v1.8h // P4[8..15]-P5[8..15] + mls v6.8h, v18.8h, v0.h[1] // 2*P3[8..15]-5*P4[8..15] + abs v27.8h, v26.8h + sshr v26.8h, v26.8h, #8 // clip_sign[0..7] + mls v17.8h, v25.8h, v0.h[0] // 2*P1[0..7]-5*P2[0..7]+5*P3[0..7]-2*P4[0..7] + abs v28.8h, v7.8h + sshr v27.8h, v27.8h, #1 // clip[0..7] + mls v16.8h, v23.8h, v0.h[0] // 2*P5[0..7]-5*P6[0..7]+5*P7[0..7]-2*P8[0..7] + sshr v7.8h, v7.8h, #8 // clip_sign[8..15] + sshr v23.8h, v28.8h, #1 // clip[8..15] + mla v4.8h, v24.8h, v0.h[1] // 2*P3[0..7]-5*P4[0..7]+5*P5[0..7] + cmeq v28.8h, v27.8h, #0 // test clip[0..7] == 0 + srshr v17.8h, v17.8h, #3 + mls v3.8h, v18.8h, v0.h[0] // 2*P1[8..15]-5*P2[8..15]+5*P3[8..15]-2*P4[8..15] + cmeq v29.8h, v23.8h, #0 // test clip[8..15] == 0 + srshr v16.8h, v16.8h, #3 + mls v19.8h, v20.8h, v0.h[0] // 2*P5[8..15]-5*P6[8..15]+5*P7[8..15]-2*P8[8..15] + abs v17.8h, v17.8h // a1[0..7] + mla v6.8h, v1.8h, v0.h[1] // 2*P3[8..15]-5*P4[8..15]+5*P5[8..15] + srshr v3.8h, v3.8h, #3 + mls v4.8h, v21.8h, v0.h[0] // 2*P3[0..7]-5*P4[0..7]+5*P5[0..7]-2*P6[0..7] + abs v16.8h, v16.8h // a2[0..7] + srshr v19.8h, v19.8h, #3 + mls v6.8h, v5.8h, v0.h[0] // 2*P3[8..15]-5*P4[8..15]+5*P5[8..15]-2*P6[8..15] + cmhs v5.8h, v17.8h, v16.8h // test a1[0..7] >= a2[0..7] + abs v3.8h, v3.8h // a1[8..15] + srshr v4.8h, v4.8h, #3 + abs v19.8h, v19.8h // a2[8..15] + bsl v5.16b, v16.16b, v17.16b // a3[0..7] + srshr v6.8h, v6.8h, #3 + cmhs v16.8h, v3.8h, v19.8h // test a1[8..15] >= a2[8.15] + abs v17.8h, v4.8h // a0[0..7] + sshr v4.8h, v4.8h, #8 // a0_sign[0..7] + bsl v16.16b, v19.16b, v3.16b // a3[8..15] + uqsub v3.8h, v17.8h, v5.8h // a0[0..7] >= a3[0..7] ? a0[0..7]-a3[0..7] : 0 (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs) + abs v19.8h, v6.8h // a0[8..15] + cmhs v20.8h, v17.8h, v22.8h // test a0[0..7] >= pq + cmhs v5.8h, v5.8h, v17.8h // test a3[0..7] >= a0[0..7] + sub v4.8h, v26.8h, v4.8h // clip_sign[0..7] - a0_sign[0..7] + sshr v6.8h, v6.8h, #8 // a0_sign[8..15] + mul v3.8h, v3.8h, v0.h[1] // a0[0..7] >= a3[0..7] ? 5*(a0[0..7]-a3[0..7]) : 0 + uqsub v17.8h, v19.8h, v16.8h // a0[8..15] >= a3[8..15] ? a0[8..15]-a3[8..15] : 0 (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs) + orr v20.16b, v28.16b, v20.16b // test clip[0..7] == 0 || a0[0..7] >= pq + cmhs v21.8h, v19.8h, v22.8h // test a0[8..15] >= pq + cmhs v16.8h, v16.8h, v19.8h // test a3[8..15] >= a0[8..15] + mul v0.8h, v17.8h, v0.h[1] // a0[8..15] >= a3[8..15] ? 5*(a0[8..15]-a3[8..15]) : 0 + sub v6.8h, v7.8h, v6.8h // clip_sign[8..15] - a0_sign[8..15] + orr v5.16b, v20.16b, v5.16b // test clip[0..7] == 0 || a0[0..7] >= pq || a3[0..7] >= a0[0..7] + ushr v3.8h, v3.8h, #3 // a0[0..7] >= a3[0..7] ? (5*(a0[0..7]-a3[0..7]))>>3 : 0 + orr v7.16b, v29.16b, v21.16b // test clip[8..15] == 0 || a0[8..15] >= pq + cmtst v17.2d, v5.2d, v2.2d // if 2nd of each group of is not filtered, then none of the others in the group should be either + mov w0, v5.s[1] // move to gp reg + cmhs v19.8h, v3.8h, v27.8h + ushr v0.8h, v0.8h, #3 // a0[8..15] >= a3[8..15] ? (5*(a0[8..15]-a3[8..15]))>>3 : 0 + mov w2, v5.s[3] + orr v5.16b, v7.16b, v16.16b // test clip[8..15] == 0 || a0[8..15] >= pq || a3[8..15] >= a0[8..15] + orr v16.16b, v20.16b, v17.16b + bsl v19.16b, v27.16b, v3.16b // FFMIN(d[0..7], clip[0..7]) + cmtst v2.2d, v5.2d, v2.2d + cmhs v3.8h, v0.8h, v23.8h + mov w4, v5.s[1] + mov w5, v5.s[3] + and w0, w0, w2 + bic v5.16b, v19.16b, v16.16b // set each d[0..7] to zero if it should not be filtered because clip[0..7] == 0 || a0[0..7] >= pq (a3 > a0 case already zeroed by saturating sub) + orr v2.16b, v7.16b, v2.16b + bsl v3.16b, v23.16b, v0.16b // FFMIN(d[8..15], clip[8..15]) + mls v25.8h, v5.8h, v4.8h // invert d[0..7] depending on clip_sign[0..7] & a0_sign[0..7], or zero it if they match, and accumulate into P4[0..7] + and w2, w4, w5 + bic v0.16b, v3.16b, v2.16b // set each d[8..15] to zero if it should not be filtered because clip[8..15] == 0 || a0[8..15] >= pq (a3 > a0 case already zeroed by saturating sub) + mla v24.8h, v5.8h, v4.8h // invert d[0..7] depending on clip_sign[0..7] & a0_sign[0..7], or zero it if they match, and accumulate into P5[0..7] + and w0, w0, w2 + mls v18.8h, v0.8h, v6.8h // invert d[8..15] depending on clip_sign[8..15] & a0_sign[8..15], or zero it if they match, and accumulate into P4[8..15] + sqxtun v2.8b, v25.8h + tbnz w0, #0, 1f // none of the 16 pixel pairs should be updated in this case + mla v1.8h, v0.8h, v6.8h // invert d[8..15] depending on clip_sign[8..15] & a0_sign[8..15], or zero it if they match, and accumulate into P5[8..15] + sqxtun v0.8b, v24.8h + sqxtun2 v2.16b, v18.8h + sqxtun2 v0.16b, v1.8h + st1 {v2.16b}, [x3], x1 + st1 {v0.16b}, [x3] +1: ret +endfunc + +// VC-1 in-loop deblocking filter for 16 pixel pairs at boundary of horizontally-neighbouring blocks +// On entry: +// x0 -> top-left pel of right block +// w1 = row stride, bytes +// w2 = PQUANT bitstream parameter +function ff_vc1_h_loop_filter16_neon, export=1 + sub x3, x0, #4 // where to start reading + sxtw x1, w1 // technically, stride is signed int + ldr d0, .Lcoeffs + ld1 {v1.8b}, [x3], x1 // P1[0], P2[0]... + sub x0, x0, #1 // where to start writing + ld1 {v2.8b}, [x3], x1 + add x4, x0, x1, lsl #3 + ld1 {v3.8b}, [x3], x1 + add x5, x0, x1, lsl #2 + ld1 {v4.8b}, [x3], x1 + add x6, x4, x1, lsl #2 + ld1 {v5.8b}, [x3], x1 + ld1 {v6.8b}, [x3], x1 + ld1 {v7.8b}, [x3], x1 + trn1 v16.8b, v1.8b, v2.8b // P1[0], P1[1], P3[0]... + ld1 {v17.8b}, [x3], x1 + trn2 v1.8b, v1.8b, v2.8b // P2[0], P2[1], P4[0]... + ld1 {v2.8b}, [x3], x1 + trn1 v18.8b, v3.8b, v4.8b // P1[2], P1[3], P3[2]... + ld1 {v19.8b}, [x3], x1 + trn2 v3.8b, v3.8b, v4.8b // P2[2], P2[3], P4[2]... + ld1 {v4.8b}, [x3], x1 + trn1 v20.8b, v5.8b, v6.8b // P1[4], P1[5], P3[4]... + ld1 {v21.8b}, [x3], x1 + trn2 v5.8b, v5.8b, v6.8b // P2[4], P2[5], P4[4]... + ld1 {v6.8b}, [x3], x1 + trn1 v22.8b, v7.8b, v17.8b // P1[6], P1[7], P3[6]... + ld1 {v23.8b}, [x3], x1 + trn2 v7.8b, v7.8b, v17.8b // P2[6], P2[7], P4[6]... + ld1 {v17.8b}, [x3], x1 + trn1 v24.8b, v2.8b, v19.8b // P1[8], P1[9], P3[8]... + ld1 {v25.8b}, [x3] + trn2 v2.8b, v2.8b, v19.8b // P2[8], P2[9], P4[8]... + trn1 v19.4h, v16.4h, v18.4h // P1[0], P1[1], P1[2], P1[3], P5[0]... + trn1 v26.8b, v4.8b, v21.8b // P1[10], P1[11], P3[10]... + trn2 v4.8b, v4.8b, v21.8b // P2[10], P2[11], P4[10]... + trn1 v21.4h, v1.4h, v3.4h // P2[0], P2[1], P2[2], P2[3], P6[0]... + trn1 v27.4h, v20.4h, v22.4h // P1[4], P1[5], P1[6], P1[7], P5[4]... + trn1 v28.8b, v6.8b, v23.8b // P1[12], P1[13], P3[12]... + trn2 v6.8b, v6.8b, v23.8b // P2[12], P2[13], P4[12]... + trn1 v23.4h, v5.4h, v7.4h // P2[4], P2[5], P2[6], P2[7], P6[4]... + trn1 v29.4h, v24.4h, v26.4h // P1[8], P1[9], P1[10], P1[11], P5[8]... + trn1 v30.8b, v17.8b, v25.8b // P1[14], P1[15], P3[14]... + trn2 v17.8b, v17.8b, v25.8b // P2[14], P2[15], P4[14]... + trn1 v25.4h, v2.4h, v4.4h // P2[8], P2[9], P2[10], P2[11], P6[8]... + trn1 v31.2s, v19.2s, v27.2s // P1[0..7] + trn2 v19.2s, v19.2s, v27.2s // P5[0..7] + trn1 v27.2s, v21.2s, v23.2s // P2[0..7] + trn2 v21.2s, v21.2s, v23.2s // P6[0..7] + trn1 v23.4h, v28.4h, v30.4h // P1[12], P1[13], P1[14], P1[15], P5[12]... + trn2 v16.4h, v16.4h, v18.4h // P3[0], P3[1], P3[2], P3[3], P7[0]... + trn1 v18.4h, v6.4h, v17.4h // P2[12], P2[13], P2[14], P2[15], P6[12]... + trn2 v20.4h, v20.4h, v22.4h // P3[4], P3[5], P3[6], P3[7], P7[4]... + trn2 v22.4h, v24.4h, v26.4h // P3[8], P3[9], P3[10], P3[11], P7[8]... + trn1 v24.2s, v29.2s, v23.2s // P1[8..15] + trn2 v23.2s, v29.2s, v23.2s // P5[8..15] + trn1 v26.2s, v25.2s, v18.2s // P2[8..15] + trn2 v18.2s, v25.2s, v18.2s // P6[8..15] + trn2 v25.4h, v28.4h, v30.4h // P3[12], P3[13], P3[14], P3[15], P7[12]... + trn2 v1.4h, v1.4h, v3.4h // P4[0], P4[1], P4[2], P4[3], P8[0]... + trn2 v3.4h, v5.4h, v7.4h // P4[4], P4[5], P4[6], P4[7], P8[4]... + trn2 v2.4h, v2.4h, v4.4h // P4[8], P4[9], P4[10], P4[11], P8[8]... + trn2 v4.4h, v6.4h, v17.4h // P4[12], P4[13], P4[14], P4[15], P8[12]... + ushll v5.8h, v31.8b, #1 // 2*P1[0..7] + ushll v6.8h, v19.8b, #1 // 2*P5[0..7] + trn1 v7.2s, v16.2s, v20.2s // P3[0..7] + uxtl v17.8h, v27.8b // P2[0..7] + trn2 v16.2s, v16.2s, v20.2s // P7[0..7] + uxtl v20.8h, v21.8b // P6[0..7] + trn1 v21.2s, v22.2s, v25.2s // P3[8..15] + ushll v24.8h, v24.8b, #1 // 2*P1[8..15] + trn2 v22.2s, v22.2s, v25.2s // P7[8..15] + ushll v25.8h, v23.8b, #1 // 2*P5[8..15] + trn1 v27.2s, v1.2s, v3.2s // P4[0..7] + uxtl v26.8h, v26.8b // P2[8..15] + mls v5.8h, v17.8h, v0.h[1] // 2*P1[0..7]-5*P2[0..7] + uxtl v17.8h, v18.8b // P6[8..15] + mls v6.8h, v20.8h, v0.h[1] // 2*P5[0..7]-5*P6[0..7] + trn1 v18.2s, v2.2s, v4.2s // P4[8..15] + uxtl v28.8h, v7.8b // P3[0..7] + mls v24.8h, v26.8h, v0.h[1] // 2*P1[8..15]-5*P2[8..15] + uxtl v16.8h, v16.8b // P7[0..7] + uxtl v26.8h, v21.8b // P3[8..15] + mls v25.8h, v17.8h, v0.h[1] // 2*P5[8..15]-5*P6[8..15] + uxtl v22.8h, v22.8b // P7[8..15] + ushll v7.8h, v7.8b, #1 // 2*P3[0..7] + uxtl v27.8h, v27.8b // P4[0..7] + trn2 v1.2s, v1.2s, v3.2s // P8[0..7] + ushll v3.8h, v21.8b, #1 // 2*P3[8..15] + trn2 v2.2s, v2.2s, v4.2s // P8[8..15] + uxtl v4.8h, v18.8b // P4[8..15] + mla v5.8h, v28.8h, v0.h[1] // 2*P1[0..7]-5*P2[0..7]+5*P3[0..7] + uxtl v1.8h, v1.8b // P8[0..7] + mla v6.8h, v16.8h, v0.h[1] // 2*P5[0..7]-5*P6[0..7]+5*P7[0..7] + uxtl v2.8h, v2.8b // P8[8..15] + uxtl v16.8h, v19.8b // P5[0..7] + mla v24.8h, v26.8h, v0.h[1] // 2*P1[8..15]-5*P2[8..15]+5*P3[8..15] + uxtl v18.8h, v23.8b // P5[8..15] + dup v19.8h, w2 // pq + mla v25.8h, v22.8h, v0.h[1] // 2*P5[8..15]-5*P6[8..15]+5*P7[8..15] + sub v21.8h, v27.8h, v16.8h // P4[0..7]-P5[0..7] + sub v22.8h, v4.8h, v18.8h // P4[8..15]-P5[8..15] + mls v7.8h, v27.8h, v0.h[1] // 2*P3[0..7]-5*P4[0..7] + abs v23.8h, v21.8h + mls v3.8h, v4.8h, v0.h[1] // 2*P3[8..15]-5*P4[8..15] + abs v26.8h, v22.8h + sshr v21.8h, v21.8h, #8 // clip_sign[0..7] + mls v5.8h, v27.8h, v0.h[0] // 2*P1[0..7]-5*P2[0..7]+5*P3[0..7]-2*P4[0..7] + sshr v23.8h, v23.8h, #1 // clip[0..7] + sshr v26.8h, v26.8h, #1 // clip[8..15] + mls v6.8h, v1.8h, v0.h[0] // 2*P5[0..7]-5*P6[0..7]+5*P7[0..7]-2*P8[0..7] + sshr v1.8h, v22.8h, #8 // clip_sign[8..15] + cmeq v22.8h, v23.8h, #0 // test clip[0..7] == 0 + mls v24.8h, v4.8h, v0.h[0] // 2*P1[8..15]-5*P2[8..15]+5*P3[8..15]-2*P4[8..15] + cmeq v28.8h, v26.8h, #0 // test clip[8..15] == 0 + srshr v5.8h, v5.8h, #3 + mls v25.8h, v2.8h, v0.h[0] // 2*P5[8..15]-5*P6[8..15]+5*P7[8..15]-2*P8[8..15] + srshr v2.8h, v6.8h, #3 + mla v7.8h, v16.8h, v0.h[1] // 2*P3[0..7]-5*P4[0..7]+5*P5[0..7] + srshr v6.8h, v24.8h, #3 + mla v3.8h, v18.8h, v0.h[1] // 2*P3[8..15]-5*P4[8..15]+5*P5[8..15] + abs v5.8h, v5.8h // a1[0..7] + srshr v24.8h, v25.8h, #3 + mls v3.8h, v17.8h, v0.h[0] // 2*P3[8..15]-5*P4[8..15]+5*P5[8..15]-2*P6[8..15] + abs v2.8h, v2.8h // a2[0..7] + abs v6.8h, v6.8h // a1[8..15] + mls v7.8h, v20.8h, v0.h[0] // 2*P3[0..7]-5*P4[0..7]+5*P5[0..7]-2*P6[0..7] + abs v17.8h, v24.8h // a2[8..15] + cmhs v20.8h, v5.8h, v2.8h // test a1[0..7] >= a2[0..7] + srshr v3.8h, v3.8h, #3 + cmhs v24.8h, v6.8h, v17.8h // test a1[8..15] >= a2[8.15] + srshr v7.8h, v7.8h, #3 + bsl v20.16b, v2.16b, v5.16b // a3[0..7] + abs v2.8h, v3.8h // a0[8..15] + sshr v3.8h, v3.8h, #8 // a0_sign[8..15] + bsl v24.16b, v17.16b, v6.16b // a3[8..15] + abs v5.8h, v7.8h // a0[0..7] + sshr v6.8h, v7.8h, #8 // a0_sign[0..7] + cmhs v7.8h, v2.8h, v19.8h // test a0[8..15] >= pq + sub v1.8h, v1.8h, v3.8h // clip_sign[8..15] - a0_sign[8..15] + uqsub v3.8h, v2.8h, v24.8h // a0[8..15] >= a3[8..15] ? a0[8..15]-a3[8..15] : 0 (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs) + cmhs v2.8h, v24.8h, v2.8h // test a3[8..15] >= a0[8..15] + uqsub v17.8h, v5.8h, v20.8h // a0[0..7] >= a3[0..7] ? a0[0..7]-a3[0..7] : 0 (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs) + cmhs v19.8h, v5.8h, v19.8h // test a0[0..7] >= pq + orr v7.16b, v28.16b, v7.16b // test clip[8..15] == 0 || a0[8..15] >= pq + sub v6.8h, v21.8h, v6.8h // clip_sign[0..7] - a0_sign[0..7] + mul v3.8h, v3.8h, v0.h[1] // a0[8..15] >= a3[8..15] ? 5*(a0[8..15]-a3[8..15]) : 0 + cmhs v5.8h, v20.8h, v5.8h // test a3[0..7] >= a0[0..7] + orr v19.16b, v22.16b, v19.16b // test clip[0..7] == 0 || a0[0..7] >= pq + mul v0.8h, v17.8h, v0.h[1] // a0[0..7] >= a3[0..7] ? 5*(a0[0..7]-a3[0..7]) : 0 + orr v2.16b, v7.16b, v2.16b // test clip[8..15] == 0 || a0[8..15] >= pq || a3[8..15] >= a0[8..15] + orr v5.16b, v19.16b, v5.16b // test clip[0..7] == 0 || a0[0..7] >= pq || a3[0..7] >= a0[0..7] + ushr v3.8h, v3.8h, #3 // a0[8..15] >= a3[8..15] ? (5*(a0[8..15]-a3[8..15]))>>3 : 0 + mov w7, v2.s[1] + mov w8, v2.s[3] + ushr v0.8h, v0.8h, #3 // a0[0..7] >= a3[0..7] ? (5*(a0[0..7]-a3[0..7]))>>3 : 0 + mov w2, v5.s[1] // move to gp reg + cmhs v2.8h, v3.8h, v26.8h + mov w3, v5.s[3] + cmhs v5.8h, v0.8h, v23.8h + bsl v2.16b, v26.16b, v3.16b // FFMIN(d[8..15], clip[8..15]) + and w9, w7, w8 + bsl v5.16b, v23.16b, v0.16b // FFMIN(d[0..7], clip[0..7]) + and w10, w2, w3 + bic v0.16b, v2.16b, v7.16b // set each d[8..15] to zero if it should not be filtered because clip[8..15] == 0 || a0[8..15] >= pq (a3 > a0 case already zeroed by saturating sub) + and w9, w10, w9 + bic v2.16b, v5.16b, v19.16b // set each d[0..7] to zero if it should not be filtered because clip[0..7] == 0 || a0[0..7] >= pq (a3 > a0 case already zeroed by saturating sub) + mls v4.8h, v0.8h, v1.8h // invert d[8..15] depending on clip_sign[8..15] & a0_sign[8..15], or zero it if they match, and accumulate into P4 + tbnz w9, #0, 4f // none of the 16 pixel pairs should be updated in this case + mls v27.8h, v2.8h, v6.8h // invert d[0..7] depending on clip_sign[0..7] & a0_sign[0..7], or zero it if they match, and accumulate into P4 + mla v16.8h, v2.8h, v6.8h // invert d[0..7] depending on clip_sign[0..7] & a0_sign[0..7], or zero it if they match, and accumulate into P5 + sqxtun v2.8b, v4.8h + mla v18.8h, v0.8h, v1.8h // invert d[8..15] depending on clip_sign[8..15] & a0_sign[8..15], or zero it if they match, and accumulate into P5 + sqxtun v0.8b, v27.8h + sqxtun v1.8b, v16.8h + sqxtun v3.8b, v18.8h + tbnz w2, #0, 1f + st2 {v0.b, v1.b}[0], [x0], x1 + st2 {v0.b, v1.b}[1], [x0], x1 + st2 {v0.b, v1.b}[2], [x0], x1 + st2 {v0.b, v1.b}[3], [x0] +1: tbnz w3, #0, 2f + st2 {v0.b, v1.b}[4], [x5], x1 + st2 {v0.b, v1.b}[5], [x5], x1 + st2 {v0.b, v1.b}[6], [x5], x1 + st2 {v0.b, v1.b}[7], [x5] +2: tbnz w7, #0, 3f + st2 {v2.b, v3.b}[0], [x4], x1 + st2 {v2.b, v3.b}[1], [x4], x1 + st2 {v2.b, v3.b}[2], [x4], x1 + st2 {v2.b, v3.b}[3], [x4] +3: tbnz w8, #0, 4f + st2 {v2.b, v3.b}[4], [x6], x1 + st2 {v2.b, v3.b}[5], [x6], x1 + st2 {v2.b, v3.b}[6], [x6], x1 + st2 {v2.b, v3.b}[7], [x6] +4: ret +endfunc From patchwork Thu Mar 17 18:58:15 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Avison X-Patchwork-Id: 34816 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6838:3486:0:0:0:0 with SMTP id ek6csp1188073nkb; Thu, 17 Mar 2022 11:59:25 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxhf3Mv/8fWKeD9izGd9wG1PnA0fQhSoN5XxH9FfQ4QfVyXFhllG1kux9A11xh1A457wzcZ X-Received: by 2002:a17:907:160a:b0:6db:ec4f:4fc5 with SMTP id hb10-20020a170907160a00b006dbec4f4fc5mr5761241ejc.303.1647543565405; Thu, 17 Mar 2022 11:59:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1647543565; cv=none; d=google.com; s=arc-20160816; b=lOpayJ6QpFWJgL0Ad59iXvCjJxvktPfW61brKkPhj5nw5JgUaHRPJ2eEWyhg3Pbw/O ++ZvkbXQa3DW7JX+TOWBeoeOnVkcxn4wN+ZbV0KburEkh32Z8Zo+yRIOfaGB6OMSexaQ 96l9TfYn8MadP6OvWo6oGIR6WcZ6TE/fvHPWoBBtpAInYZt3sf03cUSQw32uxIMK+ySw Ts8uFIUn96zmDQLytWqXOpkBhnHrhUu3iijrZR/4fIEC2HYoKXkCKMfLu8tpOKVxrHmG IhA7JteAiZxNrmU7mX/q7eiU9EhJeROw/f7J7HaBsLJ26/jSY3MmDbHZKdPAdsl/KZbs fLbQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=6IQFYh9L0tRkEMC4BD9YUi5+E595Gkp/vpIgc98fhcA=; b=LxQueJtGmi3sRTJ4dSCDAJogGbZFn+gp7xVgr9C3gyuvo48inb5ElSGySQrAluU7zT 5dl1eXdwOAIAzH7FGDfmFS0SPYEslckf2HbiE6HMBdHEoU+3UEuZ+Q0WB0EzzF5V56CL iFYCuaQUZe4CiS4IPvRssnikPgdOe52qv+awT7jd6leM9sIQ4qLMtBydOLvTWCV7FHqy ItKxnzV1W5EiafIMGSnArYglAPzoH+SNedqgUIgIhtorXOgf49h1UKjZbwjitLrHic2r RCuNN1HSkisthP84oz8f0OGcWgoljdSqh62bomhaa6TmhT39j7lp+xBQXWZeUW+FQ2Mb 49OQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id m20-20020a509994000000b00418ed563a99si2002049edb.484.2022.03.17.11.59.25; Thu, 17 Mar 2022 11:59:25 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id CDEE268AF8F; Thu, 17 Mar 2022 20:59:11 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from outmail148109.authsmtp.co.uk (outmail148109.authsmtp.co.uk [62.13.148.109]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 1873268AC00 for ; Thu, 17 Mar 2022 20:59:06 +0200 (EET) Received: from punt21.authsmtp.com (punt21.authsmtp.com [62.13.128.151]) by punt16.authsmtp.com. (8.15.2/8.15.2) with ESMTP id 22HIx5eH055258 for ; Thu, 17 Mar 2022 18:59:05 GMT (envelope-from bavison@riscosopen.org) Received: from mail-c233.authsmtp.com (mail-c233.authsmtp.com [62.13.128.233]) by punt21.authsmtp.com. (8.15.2/8.15.2) with ESMTP id 22HIx5dc013398; Thu, 17 Mar 2022 18:59:05 GMT (envelope-from bavison@riscosopen.org) Received: from rpi2021 (237.63.9.51.dyn.plus.net [51.9.63.237]) (authenticated bits=0) by mail.authsmtp.com (8.15.2/8.15.2) with ESMTPSA id 22HIx28I054996 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 17 Mar 2022 18:59:02 GMT (envelope-from bavison@riscosopen.org) Received: by rpi2021 (sSMTP sendmail emulation); Thu, 17 Mar 2022 18:59:02 +0000 From: Ben Avison To: ffmpeg-devel@ffmpeg.org Date: Thu, 17 Mar 2022 18:58:15 +0000 Message-Id: <20220317185819.466470-3-bavison@riscosopen.org> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220317185819.466470-1-bavison@riscosopen.org> References: <20220317185819.466470-1-bavison@riscosopen.org> MIME-Version: 1.0 X-Server-Quench: 5070290e-a624-11ec-a0f2-84349711df28 X-AuthReport-Spam: If SPAM / abuse - report it at: http://www.authsmtp.com/abuse X-AuthRoute: OCd1YggXA1ZfRRob ESQCJDVBUg4iPRpU DBlFKhFVNl8UURhQ KkJXbgASJgRDAnRQ QXkJW1ZWQFx5U2Z8 YQpVIwBcfENQWQZ0 UktOXVBXFgB3AFID BGZnGht6DgVDfHd5 YwhqVnRZVAp/dUF9 QEdUQHBXY2VkdWEe BRNFJgMCch5CehxB Y1d+VSdbY21JDRoR IyQTd2p2YwAXITpe RQ0AJhUMSh9ZVhcm QlhZR3AlHFFNYQgU CVQqJ1QYG00SM0M9 eVUgXU4VKVccAxZC V1lEHC9CTwAA X-Authentic-SMTP: 61633632303230.1021:7600 X-AuthFastPath: 0 (Was 255) X-AuthVirus-Status: No virus detected - but ensure you scan with your own anti-virus system. Subject: [FFmpeg-devel] [PATCH 2/6] avcodec/vc1: Arm 32-bit NEON deblocking filter fast paths X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Ben Avison Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: F9FEf4uCXPbK Signed-off-by: Ben Avison --- libavcodec/arm/vc1dsp_init_neon.c | 14 + libavcodec/arm/vc1dsp_neon.S | 643 ++++++++++++++++++++++++++++++ 2 files changed, 657 insertions(+) diff --git a/libavcodec/arm/vc1dsp_init_neon.c b/libavcodec/arm/vc1dsp_init_neon.c index 2cca784f5a..f5f5c702d7 100644 --- a/libavcodec/arm/vc1dsp_init_neon.c +++ b/libavcodec/arm/vc1dsp_init_neon.c @@ -32,6 +32,13 @@ void ff_vc1_inv_trans_4x8_dc_neon(uint8_t *dest, ptrdiff_t stride, int16_t *bloc void ff_vc1_inv_trans_8x4_dc_neon(uint8_t *dest, ptrdiff_t stride, int16_t *block); void ff_vc1_inv_trans_4x4_dc_neon(uint8_t *dest, ptrdiff_t stride, int16_t *block); +void ff_vc1_v_loop_filter4_neon(uint8_t *src, int stride, int pq); +void ff_vc1_h_loop_filter4_neon(uint8_t *src, int stride, int pq); +void ff_vc1_v_loop_filter8_neon(uint8_t *src, int stride, int pq); +void ff_vc1_h_loop_filter8_neon(uint8_t *src, int stride, int pq); +void ff_vc1_v_loop_filter16_neon(uint8_t *src, int stride, int pq); +void ff_vc1_h_loop_filter16_neon(uint8_t *src, int stride, int pq); + void ff_put_pixels8x8_neon(uint8_t *block, const uint8_t *pixels, ptrdiff_t line_size, int rnd); @@ -92,6 +99,13 @@ av_cold void ff_vc1dsp_init_neon(VC1DSPContext *dsp) dsp->vc1_inv_trans_8x4_dc = ff_vc1_inv_trans_8x4_dc_neon; dsp->vc1_inv_trans_4x4_dc = ff_vc1_inv_trans_4x4_dc_neon; + dsp->vc1_v_loop_filter4 = ff_vc1_v_loop_filter4_neon; + dsp->vc1_h_loop_filter4 = ff_vc1_h_loop_filter4_neon; + dsp->vc1_v_loop_filter8 = ff_vc1_v_loop_filter8_neon; + dsp->vc1_h_loop_filter8 = ff_vc1_h_loop_filter8_neon; + dsp->vc1_v_loop_filter16 = ff_vc1_v_loop_filter16_neon; + dsp->vc1_h_loop_filter16 = ff_vc1_h_loop_filter16_neon; + dsp->put_vc1_mspel_pixels_tab[1][ 0] = ff_put_pixels8x8_neon; FN_ASSIGN(1, 0); FN_ASSIGN(2, 0); diff --git a/libavcodec/arm/vc1dsp_neon.S b/libavcodec/arm/vc1dsp_neon.S index 93f043bf08..4ef083102b 100644 --- a/libavcodec/arm/vc1dsp_neon.S +++ b/libavcodec/arm/vc1dsp_neon.S @@ -1161,3 +1161,646 @@ function ff_vc1_inv_trans_4x4_dc_neon, export=1 vst1.32 {d1[1]}, [r0,:32] bx lr endfunc + +@ VC-1 in-loop deblocking filter for 4 pixel pairs at boundary of vertically-neighbouring blocks +@ On entry: +@ r0 -> top-left pel of lower block +@ r1 = row stride, bytes +@ r2 = PQUANT bitstream parameter +function ff_vc1_v_loop_filter4_neon, export=1 + sub r3, r0, r1, lsl #2 + vldr d0, .Lcoeffs + vld1.32 {d1[0]}, [r0], r1 @ P5 + vld1.32 {d2[0]}, [r3], r1 @ P1 + vld1.32 {d3[0]}, [r3], r1 @ P2 + vld1.32 {d4[0]}, [r0], r1 @ P6 + vld1.32 {d5[0]}, [r3], r1 @ P3 + vld1.32 {d6[0]}, [r0], r1 @ P7 + vld1.32 {d7[0]}, [r3] @ P4 + vld1.32 {d16[0]}, [r0] @ P8 + vshll.u8 q9, d1, #1 @ 2*P5 + vdup.16 d17, r2 @ pq + vshll.u8 q10, d2, #1 @ 2*P1 + vmovl.u8 q11, d3 @ P2 + vmovl.u8 q1, d4 @ P6 + vmovl.u8 q12, d5 @ P3 + vmls.i16 d20, d22, d0[1] @ 2*P1-5*P2 + vmovl.u8 q11, d6 @ P7 + vmls.i16 d18, d2, d0[1] @ 2*P5-5*P6 + vshll.u8 q2, d5, #1 @ 2*P3 + vmovl.u8 q3, d7 @ P4 + vmla.i16 d18, d22, d0[1] @ 2*P5-5*P6+5*P7 + vmovl.u8 q11, d16 @ P8 + vmla.u16 d20, d24, d0[1] @ 2*P1-5*P2+5*P3 + vmovl.u8 q12, d1 @ P5 + vmls.u16 d4, d6, d0[1] @ 2*P3-5*P4 + vmls.u16 d18, d22, d0[0] @ 2*P5-5*P6+5*P7-2*P8 + vsub.i16 d1, d6, d24 @ P4-P5 + vmls.i16 d20, d6, d0[0] @ 2*P1-5*P2+5*P3-2*P4 + vmla.i16 d4, d24, d0[1] @ 2*P3-5*P4+5*P5 + vmls.i16 d4, d2, d0[0] @ 2*P3-5*P4+5*P5-2*P6 + vabs.s16 d2, d1 + vrshr.s16 d3, d18, #3 + vrshr.s16 d5, d20, #3 + vshr.s16 d2, d2, #1 @ clip + vrshr.s16 d4, d4, #3 + vabs.s16 d3, d3 @ a2 + vshr.s16 d1, d1, #8 @ clip_sign + vabs.s16 d5, d5 @ a1 + vceq.i16 d7, d2, #0 @ test clip == 0 + vabs.s16 d16, d4 @ a0 + vshr.s16 d4, d4, #8 @ a0_sign + vcge.s16 d18, d5, d3 @ test a1 >= a2 + vcge.s16 d17, d16, d17 @ test a0 >= pq + vbsl d18, d3, d5 @ a3 + vsub.i16 d1, d1, d4 @ clip_sign - a0_sign + vorr d3, d7, d17 @ test clip == 0 || a0 >= pq + vqsub.u16 d4, d16, d18 @ a0 >= a3 ? a0-a3 : 0 (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs) + vcge.s16 d5, d18, d16 @ test a3 >= a0 + vmul.i16 d0, d4, d0[1] @ a0 >= a3 ? 5*(a0-a3) : 0 + vorr d4, d3, d5 @ test clip == 0 || a0 >= pq || a3 >= a0 + vmov.32 r0, d4[1] @ move to gp reg + vshr.u16 d0, d0, #3 @ a0 >= a3 ? (5*(a0-a3))>>3 : 0 + vcge.s16 d4, d0, d2 + tst r0, #1 + bne 1f @ none of the 4 pixel pairs should be updated if this one is not filtered + vbsl d4, d2, d0 @ FFMIN(d, clip) + vbic d0, d4, d3 @ set each d to zero if it should not be filtered because clip == 0 || a0 >= pq (a3 > a0 case already zeroed by saturating sub) + vmls.i16 d6, d0, d1 @ invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P4 + vmla.i16 d24, d0, d1 @ invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P5 + vqmovun.s16 d0, q3 + vqmovun.s16 d1, q12 + vst1.32 {d0[0]}, [r3], r1 + vst1.32 {d1[0]}, [r3] +1: bx lr +endfunc + +@ VC-1 in-loop deblocking filter for 4 pixel pairs at boundary of horizontally-neighbouring blocks +@ On entry: +@ r0 -> top-left pel of right block +@ r1 = row stride, bytes +@ r2 = PQUANT bitstream parameter +function ff_vc1_h_loop_filter4_neon, export=1 + sub r3, r0, #4 @ where to start reading + vldr d0, .Lcoeffs + vld1.32 {d2}, [r3], r1 + sub r0, r0, #1 @ where to start writing + vld1.32 {d4}, [r3], r1 + vld1.32 {d3}, [r3], r1 + vld1.32 {d5}, [r3] + vdup.16 d1, r2 @ pq + vtrn.8 q1, q2 + vtrn.16 d2, d3 @ P1, P5, P3, P7 + vtrn.16 d4, d5 @ P2, P6, P4, P8 + vshll.u8 q3, d2, #1 @ 2*P1, 2*P5 + vmovl.u8 q8, d4 @ P2, P6 + vmovl.u8 q9, d3 @ P3, P7 + vmovl.u8 q2, d5 @ P4, P8 + vmls.i16 q3, q8, d0[1] @ 2*P1-5*P2, 2*P5-5*P6 + vshll.u8 q10, d3, #1 @ 2*P3, 2*P7 + vmovl.u8 q1, d2 @ P1, P5 + vmla.i16 q3, q9, d0[1] @ 2*P1-5*P2+5*P3, 2*P5-5*P6+5*P7 + vmls.i16 q3, q2, d0[0] @ 2*P1-5*P2+5*P3-2*P4, 2*P5-5*P6+5*P7-2*P8 + vmov d2, d3 @ needs to be in an even-numbered vector for when we come to narrow it later + vmls.i16 d20, d4, d0[1] @ 2*P3-5*P4 + vmla.i16 d20, d3, d0[1] @ 2*P3-5*P4+5*P5 + vsub.i16 d3, d4, d2 @ P4-P5 + vmls.i16 d20, d17, d0[0] @ 2*P3-5*P4+5*P5-2*P6 + vrshr.s16 q3, q3, #3 + vabs.s16 d5, d3 + vshr.s16 d3, d3, #8 @ clip_sign + vrshr.s16 d16, d20, #3 + vabs.s16 q3, q3 @ a1, a2 + vshr.s16 d5, d5, #1 @ clip + vabs.s16 d17, d16 @ a0 + vceq.i16 d18, d5, #0 @ test clip == 0 + vshr.s16 d16, d16, #8 @ a0_sign + vcge.s16 d19, d6, d7 @ test a1 >= a2 + vcge.s16 d1, d17, d1 @ test a0 >= pq + vsub.i16 d16, d3, d16 @ clip_sign - a0_sign + vbsl d19, d7, d6 @ a3 + vorr d1, d18, d1 @ test clip == 0 || a0 >= pq + vqsub.u16 d3, d17, d19 @ a0 >= a3 ? a0-a3 : 0 (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs) + vcge.s16 d6, d19, d17 @ test a3 >= a0 @ + vmul.i16 d0, d3, d0[1] @ a0 >= a3 ? 5*(a0-a3) : 0 + vorr d3, d1, d6 @ test clip == 0 || a0 >= pq || a3 >= a0 + vmov.32 r2, d3[1] @ move to gp reg + vshr.u16 d0, d0, #3 @ a0 >= a3 ? (5*(a0-a3))>>3 : 0 + vcge.s16 d3, d0, d5 + tst r2, #1 + bne 1f @ none of the 4 pixel pairs should be updated if this one is not filtered + vbsl d3, d5, d0 @ FFMIN(d, clip) + vbic d0, d3, d1 @ set each d to zero if it should not be filtered because clip == 0 || a0 >= pq (a3 > a0 case already zeroed by saturating sub) + vmla.i16 d2, d0, d16 @ invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P5 + vmls.i16 d4, d0, d16 @ invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P4 + vqmovun.s16 d1, q1 + vqmovun.s16 d0, q2 + vst2.8 {d0[0], d1[0]}, [r0], r1 + vst2.8 {d0[1], d1[1]}, [r0], r1 + vst2.8 {d0[2], d1[2]}, [r0], r1 + vst2.8 {d0[3], d1[3]}, [r0] +1: bx lr +endfunc + +@ VC-1 in-loop deblocking filter for 8 pixel pairs at boundary of vertically-neighbouring blocks +@ On entry: +@ r0 -> top-left pel of lower block +@ r1 = row stride, bytes +@ r2 = PQUANT bitstream parameter +function ff_vc1_v_loop_filter8_neon, export=1 + sub r3, r0, r1, lsl #2 + vldr d0, .Lcoeffs + vld1.32 {d1}, [r0], r1 @ P5 + vld1.32 {d2}, [r3], r1 @ P1 + vld1.32 {d3}, [r3], r1 @ P2 + vld1.32 {d4}, [r0], r1 @ P6 + vld1.32 {d5}, [r3], r1 @ P3 + vld1.32 {d6}, [r0], r1 @ P7 + vshll.u8 q8, d1, #1 @ 2*P5 + vshll.u8 q9, d2, #1 @ 2*P1 + vld1.32 {d7}, [r3] @ P4 + vmovl.u8 q1, d3 @ P2 + vld1.32 {d20}, [r0] @ P8 + vmovl.u8 q11, d4 @ P6 + vdup.16 q12, r2 @ pq + vmovl.u8 q13, d5 @ P3 + vmls.i16 q9, q1, d0[1] @ 2*P1-5*P2 + vmovl.u8 q1, d6 @ P7 + vshll.u8 q2, d5, #1 @ 2*P3 + vmls.i16 q8, q11, d0[1] @ 2*P5-5*P6 + vmovl.u8 q3, d7 @ P4 + vmovl.u8 q10, d20 @ P8 + vmla.i16 q8, q1, d0[1] @ 2*P5-5*P6+5*P7 + vmovl.u8 q1, d1 @ P5 + vmla.i16 q9, q13, d0[1] @ 2*P1-5*P2+5*P3 + vsub.i16 q13, q3, q1 @ P4-P5 + vmls.i16 q2, q3, d0[1] @ 2*P3-5*P4 + vmls.i16 q8, q10, d0[0] @ 2*P5-5*P6+5*P7-2*P8 + vabs.s16 q10, q13 + vshr.s16 q13, q13, #8 @ clip_sign + vmls.i16 q9, q3, d0[0] @ 2*P1-5*P2+5*P3-2*P4 + vshr.s16 q10, q10, #1 @ clip + vmla.i16 q2, q1, d0[1] @ 2*P3-5*P4+5*P5 + vrshr.s16 q8, q8, #3 + vmls.i16 q2, q11, d0[0] @ 2*P3-5*P4+5*P5-2*P6 + vceq.i16 q11, q10, #0 @ test clip == 0 + vrshr.s16 q9, q9, #3 + vabs.s16 q8, q8 @ a2 + vabs.s16 q9, q9 @ a1 + vrshr.s16 q2, q2, #3 + vcge.s16 q14, q9, q8 @ test a1 >= a2 + vabs.s16 q15, q2 @ a0 + vshr.s16 q2, q2, #8 @ a0_sign + vbsl q14, q8, q9 @ a3 + vcge.s16 q8, q15, q12 @ test a0 >= pq + vsub.i16 q2, q13, q2 @ clip_sign - a0_sign + vqsub.u16 q9, q15, q14 @ a0 >= a3 ? a0-a3 : 0 (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs) + vcge.s16 q12, q14, q15 @ test a3 >= a0 + vorr q8, q11, q8 @ test clip == 0 || a0 >= pq + vmul.i16 q0, q9, d0[1] @ a0 >= a3 ? 5*(a0-a3) : 0 + vorr q9, q8, q12 @ test clip == 0 || a0 >= pq || a3 >= a0 + vshl.i64 q11, q9, #16 + vmov.32 r0, d18[1] @ move to gp reg + vshr.u16 q0, q0, #3 @ a0 >= a3 ? (5*(a0-a3))>>3 : 0 + vmov.32 r2, d19[1] + vshr.s64 q9, q11, #48 + vcge.s16 q11, q0, q10 + vorr q8, q8, q9 + and r0, r0, r2 + vbsl q11, q10, q0 @ FFMIN(d, clip) + tst r0, #1 + bne 1f @ none of the 8 pixel pairs should be updated in this case + vbic q0, q11, q8 @ set each d to zero if it should not be filtered + vmls.i16 q3, q0, q2 @ invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P4 + vmla.i16 q1, q0, q2 @ invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P5 + vqmovun.s16 d0, q3 + vqmovun.s16 d1, q1 + vst1.32 {d0}, [r3], r1 + vst1.32 {d1}, [r3] +1: bx lr +endfunc + +.align 5 +.Lcoeffs: +.quad 0x00050002 + +@ VC-1 in-loop deblocking filter for 8 pixel pairs at boundary of horizontally-neighbouring blocks +@ On entry: +@ r0 -> top-left pel of right block +@ r1 = row stride, bytes +@ r2 = PQUANT bitstream parameter +function ff_vc1_h_loop_filter8_neon, export=1 + push {lr} + sub r3, r0, #4 @ where to start reading + vldr d0, .Lcoeffs + vld1.32 {d2}, [r3], r1 @ P1[0], P2[0]... + sub r0, r0, #1 @ where to start writing + vld1.32 {d4}, [r3], r1 + add r12, r0, r1, lsl #2 + vld1.32 {d3}, [r3], r1 + vld1.32 {d5}, [r3], r1 + vld1.32 {d6}, [r3], r1 + vld1.32 {d16}, [r3], r1 + vld1.32 {d7}, [r3], r1 + vld1.32 {d17}, [r3] + vtrn.8 q1, q2 @ P1[0], P1[1], P3[0]... P1[2], P1[3], P3[2]... P2[0], P2[1], P4[0]... P2[2], P2[3], P4[2]... + vdup.16 q9, r2 @ pq + vtrn.16 d2, d3 @ P1[0], P1[1], P1[2], P1[3], P5[0]... P3[0], P3[1], P3[2], P3[3], P7[0]... + vtrn.16 d4, d5 @ P2[0], P2[1], P2[2], P2[3], P6[0]... P4[0], P4[1], P4[2], P4[3], P8[0]... + vtrn.8 q3, q8 @ P1[4], P1[5], P3[4]... P1[6], P1[7], P3[6]... P2[4], P2[5], P4[4]... P2[6], P2[7], P4[6]... + vtrn.16 d6, d7 @ P1[4], P1[5], P1[6], P1[7], P5[4]... P3[4], P3[5], P3[5], P3[7], P7[4]... + vtrn.16 d16, d17 @ P2[4], P2[5], P2[6], P2[7], P6[4]... P4[4], P4[5], P4[6], P4[7], P8[4]... + vtrn.32 d2, d6 @ P1, P5 + vtrn.32 d4, d16 @ P2, P6 + vtrn.32 d3, d7 @ P3, P7 + vtrn.32 d5, d17 @ P4, P8 + vshll.u8 q10, d2, #1 @ 2*P1 + vshll.u8 q11, d6, #1 @ 2*P5 + vmovl.u8 q12, d4 @ P2 + vmovl.u8 q13, d16 @ P6 + vmovl.u8 q14, d3 @ P3 + vmls.i16 q10, q12, d0[1] @ 2*P1-5*P2 + vmovl.u8 q12, d7 @ P7 + vshll.u8 q1, d3, #1 @ 2*P3 + vmls.i16 q11, q13, d0[1] @ 2*P5-5*P6 + vmovl.u8 q2, d5 @ P4 + vmovl.u8 q8, d17 @ P8 + vmla.i16 q11, q12, d0[1] @ 2*P5-5*P6+5*P7 + vmovl.u8 q3, d6 @ P5 + vmla.i16 q10, q14, d0[1] @ 2*P1-5*P2+5*P3 + vsub.i16 q12, q2, q3 @ P4-P5 + vmls.i16 q1, q2, d0[1] @ 2*P3-5*P4 + vmls.i16 q11, q8, d0[0] @ 2*P5-5*P6+5*P7-2*P8 + vabs.s16 q8, q12 + vshr.s16 q12, q12, #8 @ clip_sign + vmls.i16 q10, q2, d0[0] @ 2*P1-5*P2+5*P3-2*P4 + vshr.s16 q8, q8, #1 @ clip + vmla.i16 q1, q3, d0[1] @ 2*P3-5*P4+5*P5 + vrshr.s16 q11, q11, #3 + vmls.i16 q1, q13, d0[0] @ 2*P3-5*P4+5*P5-2*P6 + vceq.i16 q13, q8, #0 @ test clip == 0 + vrshr.s16 q10, q10, #3 + vabs.s16 q11, q11 @ a2 + vabs.s16 q10, q10 @ a1 + vrshr.s16 q1, q1, #3 + vcge.s16 q14, q10, q11 @ test a1 >= a2 + vabs.s16 q15, q1 @ a0 + vshr.s16 q1, q1, #8 @ a0_sign + vbsl q14, q11, q10 @ a3 + vcge.s16 q9, q15, q9 @ test a0 >= pq + vsub.i16 q1, q12, q1 @ clip_sign - a0_sign + vqsub.u16 q10, q15, q14 @ a0 >= a3 ? a0-a3 : 0 (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs) + vcge.s16 q11, q14, q15 @ test a3 >= a0 + vorr q9, q13, q9 @ test clip == 0 || a0 >= pq + vmul.i16 q0, q10, d0[1] @ a0 >= a3 ? 5*(a0-a3) : 0 + vorr q10, q9, q11 @ test clip == 0 || a0 >= pq || a3 >= a0 + vmov.32 r2, d20[1] @ move to gp reg + vshr.u16 q0, q0, #3 @ a0 >= a3 ? (5*(a0-a3))>>3 : 0 + vmov.32 r3, d21[1] + vcge.s16 q10, q0, q8 + and r14, r2, r3 + vbsl q10, q8, q0 @ FFMIN(d, clip) + tst r14, #1 + bne 2f @ none of the 8 pixel pairs should be updated in this case + vbic q0, q10, q9 @ set each d to zero if it should not be filtered because clip == 0 || a0 >= pq (a3 > a0 case already zeroed by saturating sub) + vmla.i16 q3, q0, q1 @ invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P5 + vmls.i16 q2, q0, q1 @ invert d depending on clip_sign & a0_sign, or zero it if they match, and accumulate into P4 + vqmovun.s16 d1, q3 + vqmovun.s16 d0, q2 + tst r2, #1 + bne 1f @ none of the first 4 pixel pairs should be updated if so + vst2.8 {d0[0], d1[0]}, [r0], r1 + vst2.8 {d0[1], d1[1]}, [r0], r1 + vst2.8 {d0[2], d1[2]}, [r0], r1 + vst2.8 {d0[3], d1[3]}, [r0] +1: tst r3, #1 + bne 2f @ none of the second 4 pixel pairs should be updated if so + vst2.8 {d0[4], d1[4]}, [r12], r1 + vst2.8 {d0[5], d1[5]}, [r12], r1 + vst2.8 {d0[6], d1[6]}, [r12], r1 + vst2.8 {d0[7], d1[7]}, [r12] +2: pop {pc} +endfunc + +@ VC-1 in-loop deblocking filter for 16 pixel pairs at boundary of vertically-neighbouring blocks +@ On entry: +@ r0 -> top-left pel of lower block +@ r1 = row stride, bytes +@ r2 = PQUANT bitstream parameter +function ff_vc1_v_loop_filter16_neon, export=1 + vpush {d8-d15} + sub r3, r0, r1, lsl #2 + vldr d0, .Lcoeffs + vld1.64 {q1}, [r0], r1 @ P5 + vld1.64 {q2}, [r3], r1 @ P1 + vld1.64 {q3}, [r3], r1 @ P2 + vld1.64 {q4}, [r0], r1 @ P6 + vld1.64 {q5}, [r3], r1 @ P3 + vld1.64 {q6}, [r0], r1 @ P7 + vshll.u8 q7, d2, #1 @ 2*P5[0..7] + vshll.u8 q8, d4, #1 @ 2*P1[0..7] + vld1.64 {q9}, [r3] @ P4 + vmovl.u8 q10, d6 @ P2[0..7] + vld1.64 {q11}, [r0] @ P8 + vmovl.u8 q12, d8 @ P6[0..7] + vdup.16 q13, r2 @ pq + vshll.u8 q2, d5, #1 @ 2*P1[8..15] + vmls.i16 q8, q10, d0[1] @ 2*P1[0..7]-5*P2[0..7] + vshll.u8 q10, d3, #1 @ 2*P5[8..15] + vmovl.u8 q3, d7 @ P2[8..15] + vmls.i16 q7, q12, d0[1] @ 2*P5[0..7]-5*P6[0..7] + vmovl.u8 q4, d9 @ P6[8..15] + vmovl.u8 q14, d10 @ P3[0..7] + vmovl.u8 q15, d12 @ P7[0..7] + vmls.i16 q2, q3, d0[1] @ 2*P1[8..15]-5*P2[8..15] + vshll.u8 q3, d10, #1 @ 2*P3[0..7] + vmls.i16 q10, q4, d0[1] @ 2*P5[8..15]-5*P6[8..15] + vmovl.u8 q6, d13 @ P7[8..15] + vmla.i16 q8, q14, d0[1] @ 2*P1[0..7]-5*P2[0..7]+5*P3[0..7] + vmovl.u8 q14, d18 @ P4[0..7] + vmovl.u8 q9, d19 @ P4[8..15] + vmla.i16 q7, q15, d0[1] @ 2*P5[0..7]-5*P6[0..7]+5*P7[0..7] + vmovl.u8 q15, d11 @ P3[8..15] + vshll.u8 q5, d11, #1 @ 2*P3[8..15] + vmls.i16 q3, q14, d0[1] @ 2*P3[0..7]-5*P4[0..7] + vmla.i16 q2, q15, d0[1] @ 2*P1[8..15]-5*P2[8..15]+5*P3[8..15] + vmovl.u8 q15, d22 @ P8[0..7] + vmovl.u8 q11, d23 @ P8[8..15] + vmla.i16 q10, q6, d0[1] @ 2*P5[8..15]-5*P6[8..15]+5*P7[8..15] + vmovl.u8 q6, d2 @ P5[0..7] + vmovl.u8 q1, d3 @ P5[8..15] + vmls.i16 q5, q9, d0[1] @ 2*P3[8..15]-5*P4[8..15] + vmls.i16 q8, q14, d0[0] @ 2*P1[0..7]-5*P2[0..7]+5*P3[0..7]-2*P4[0..7] + vmls.i16 q7, q15, d0[0] @ 2*P5[0..7]-5*P6[0..7]+5*P7[0..7]-2*P8[0..7] + vsub.i16 q15, q14, q6 @ P4[0..7]-P5[0..7] + vmla.i16 q3, q6, d0[1] @ 2*P3[0..7]-5*P4[0..7]+5*P5[0..7] + vrshr.s16 q8, q8, #3 + vmls.i16 q2, q9, d0[0] @ 2*P1[8..15]-5*P2[8..15]+5*P3[8..15]-2*P4[8..15] + vrshr.s16 q7, q7, #3 + vmls.i16 q10, q11, d0[0] @ 2*P5[8..15]-5*P6[8..15]+5*P7[8..15]-2*P8[8..15] + vabs.s16 q11, q15 + vabs.s16 q8, q8 @ a1[0..7] + vmla.i16 q5, q1, d0[1] @ 2*P3[8..15]-5*P4[8..15]+5*P5[8..15] + vshr.s16 q15, q15, #8 @ clip_sign[0..7] + vrshr.s16 q2, q2, #3 + vmls.i16 q3, q12, d0[0] @ 2*P3[0..7]-5*P4[0..7]+5*P5[0..7]-2*P6[0..7] + vabs.s16 q7, q7 @ a2[0..7] + vrshr.s16 q10, q10, #3 + vsub.i16 q12, q9, q1 @ P4[8..15]-P5[8..15] + vshr.s16 q11, q11, #1 @ clip[0..7] + vmls.i16 q5, q4, d0[0] @ 2*P3[8..15]-5*P4[8..15]+5*P5[8..15]-2*P6[8..15] + vcge.s16 q4, q8, q7 @ test a1[0..7] >= a2[0..7] + vabs.s16 q2, q2 @ a1[8..15] + vrshr.s16 q3, q3, #3 + vabs.s16 q10, q10 @ a2[8..15] + vbsl q4, q7, q8 @ a3[0..7] + vabs.s16 q7, q12 + vshr.s16 q8, q12, #8 @ clip_sign[8..15] + vrshr.s16 q5, q5, #3 + vcge.s16 q12, q2, q10 @ test a1[8..15] >= a2[8.15] + vshr.s16 q7, q7, #1 @ clip[8..15] + vbsl q12, q10, q2 @ a3[8..15] + vabs.s16 q2, q3 @ a0[0..7] + vceq.i16 q10, q11, #0 @ test clip[0..7] == 0 + vshr.s16 q3, q3, #8 @ a0_sign[0..7] + vsub.i16 q3, q15, q3 @ clip_sign[0..7] - a0_sign[0..7] + vcge.s16 q15, q2, q13 @ test a0[0..7] >= pq + vorr q10, q10, q15 @ test clip[0..7] == 0 || a0[0..7] >= pq + vqsub.u16 q15, q2, q4 @ a0[0..7] >= a3[0..7] ? a0[0..7]-a3[0..7] : 0 (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs) + vcge.s16 q2, q4, q2 @ test a3[0..7] >= a0[0..7] + vabs.s16 q4, q5 @ a0[8..15] + vshr.s16 q5, q5, #8 @ a0_sign[8..15] + vmul.i16 q15, q15, d0[1] @ a0[0..7] >= a3[0..7] ? 5*(a0[0..7]-a3[0..7]) : 0 + vcge.s16 q13, q4, q13 @ test a0[8..15] >= pq + vorr q2, q10, q2 @ test clip[0..7] == 0 || a0[0..7] >= pq || a3[0..7] >= a0[0..7] + vsub.i16 q5, q8, q5 @ clip_sign[8..15] - a0_sign[8..15] + vceq.i16 q8, q7, #0 @ test clip[8..15] == 0 + vshr.u16 q15, q15, #3 @ a0[0..7] >= a3[0..7] ? (5*(a0[0..7]-a3[0..7]))>>3 : 0 + vmov r0, d4[1] @ move to gp reg + vorr q8, q8, q13 @ test clip[8..15] == 0 || a0[8..15] >= pq + vqsub.u16 q13, q4, q12 @ a0[8..15] >= a3[8..15] ? a0[8..15]-a3[8..15] : 0 (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs) + vmov r2, d5[1] + vcge.s16 q4, q12, q4 @ test a3[8..15] >= a0[8..15] + vshl.i64 q2, q2, #16 + vcge.s16 q12, q15, q11 + vmul.i16 q0, q13, d0[1] @ a0[8..15] >= a3[8..15] ? 5*(a0[8..15]-a3[8..15]) : 0 + vorr q4, q8, q4 @ test clip[8..15] == 0 || a0[8..15] >= pq || a3[8..15] >= a0[8..15] + vshr.s64 q2, q2, #48 + and r0, r0, r2 + vbsl q12, q11, q15 @ FFMIN(d[0..7], clip[0..7]) + vshl.i64 q11, q4, #16 + vmov r2, d8[1] + vshr.u16 q0, q0, #3 @ a0[8..15] >= a3[8..15] ? (5*(a0[8..15]-a3[8..15]))>>3 : 0 + vorr q2, q10, q2 + vmov r12, d9[1] + vshr.s64 q4, q11, #48 + vcge.s16 q10, q0, q7 + vbic q2, q12, q2 @ set each d[0..7] to zero if it should not be filtered because clip[0..7] == 0 || a0[0..7] >= pq (a3 > a0 case already zeroed by saturating sub) + vorr q4, q8, q4 + and r2, r2, r12 + vbsl q10, q7, q0 @ FFMIN(d[8..15], clip[8..15]) + vmls.i16 q14, q2, q3 @ invert d[0..7] depending on clip_sign[0..7] & a0_sign[0..7], or zero it if they match, and accumulate into P4[0..7] + and r0, r0, r2 + vbic q0, q10, q4 @ set each d[8..15] to zero if it should not be filtered because clip[8..15] == 0 || a0[8..15] >= pq (a3 > a0 case already zeroed by saturating sub) + tst r0, #1 + bne 1f @ none of the 16 pixel pairs should be updated in this case + vmla.i16 q6, q2, q3 @ invert d[0..7] depending on clip_sign[0..7] & a0_sign[0..7], or zero it if they match, and accumulate into P5[0..7] + vmls.i16 q9, q0, q5 @ invert d[8..15] depending on clip_sign[8..15] & a0_sign[8..15], or zero it if they match, and accumulate into P4[8..15] + vqmovun.s16 d4, q14 + vmla.i16 q1, q0, q5 @ invert d[8..15] depending on clip_sign[8..15] & a0_sign[8..15], or zero it if they match, and accumulate into P5[8..15] + vqmovun.s16 d0, q6 + vqmovun.s16 d5, q9 + vqmovun.s16 d1, q1 + vst1.64 {q2}, [r3], r1 + vst1.64 {q0}, [r3] +1: vpop {d8-d15} + bx lr +endfunc + +@ VC-1 in-loop deblocking filter for 16 pixel pairs at boundary of horizontally-neighbouring blocks +@ On entry: +@ r0 -> top-left pel of right block +@ r1 = row stride, bytes +@ r2 = PQUANT bitstream parameter +function ff_vc1_h_loop_filter16_neon, export=1 + push {r4-r6,lr} + vpush {d8-d15} + sub r3, r0, #4 @ where to start reading + vldr d0, .Lcoeffs + vld1.32 {d2}, [r3], r1 @ P1[0], P2[0]... + sub r0, r0, #1 @ where to start writing + vld1.32 {d3}, [r3], r1 + add r4, r0, r1, lsl #2 + vld1.32 {d10}, [r3], r1 + vld1.32 {d11}, [r3], r1 + vld1.32 {d16}, [r3], r1 + vld1.32 {d4}, [r3], r1 + vld1.32 {d8}, [r3], r1 + vtrn.8 d2, d3 @ P1[0], P1[1], P3[0]... P2[0], P2[1], P4[0]... + vld1.32 {d14}, [r3], r1 + vld1.32 {d5}, [r3], r1 + vtrn.8 d10, d11 @ P1[2], P1[3], P3[2]... P2[2], P2[3], P4[2]... + vld1.32 {d6}, [r3], r1 + vld1.32 {d12}, [r3], r1 + vtrn.8 d16, d4 @ P1[4], P1[5], P3[4]... P2[4], P2[5], P4[4]... + vld1.32 {d13}, [r3], r1 + vtrn.16 d2, d10 @ P1[0], P1[1], P1[2], P1[3], P5[0]... P3[0], P3[1], P3[2], P3[3], P7[0]... + vld1.32 {d1}, [r3], r1 + vtrn.8 d8, d14 @ P1[6], P1[7], P3[6]... P2[6], P2[7], P4[6]... + vld1.32 {d7}, [r3], r1 + vtrn.16 d3, d11 @ P2[0], P2[1], P2[2], P2[3], P6[0]... P4[0], P4[1], P4[2], P4[3], P8[0]... + vld1.32 {d9}, [r3], r1 + vtrn.8 d5, d6 @ P1[8], P1[9], P3[8]... P2[8], P2[9], P4[8]... + vld1.32 {d15}, [r3] + vtrn.16 d16, d8 @ P1[4], P1[5], P1[6], P1[7], P5[4]... P3[4], P3[5], P3[6], P3[7], P7[4]... + vtrn.16 d4, d14 @ P2[4], P2[5], P2[6], P2[7], P6[4]... P4[4], P4[5], P4[6], P4[7], P8[4]... + vtrn.8 d12, d13 @ P1[10], P1[11], P3[10]... P2[10], P2[11], P4[10]... + vdup.16 q9, r2 @ pq + vtrn.8 d1, d7 @ P1[12], P1[13], P3[12]... P2[12], P2[13], P4[12]... + vtrn.32 d2, d16 @ P1[0..7], P5[0..7] + vtrn.16 d5, d12 @ P1[8], P1[7], P1[10], P1[11], P5[8]... P3[8], P3[9], P3[10], P3[11], P7[8]... + vtrn.16 d6, d13 @ P2[8], P2[7], P2[10], P2[11], P6[8]... P4[8], P4[9], P4[10], P4[11], P8[8]... + vtrn.8 d9, d15 @ P1[14], P1[15], P3[14]... P2[14], P2[15], P4[14]... + vtrn.32 d3, d4 @ P2[0..7], P6[0..7] + vshll.u8 q10, d2, #1 @ 2*P1[0..7] + vtrn.32 d10, d8 @ P3[0..7], P7[0..7] + vshll.u8 q11, d16, #1 @ 2*P5[0..7] + vtrn.32 d11, d14 @ P4[0..7], P8[0..7] + vtrn.16 d1, d9 @ P1[12], P1[13], P1[14], P1[15], P5[12]... P3[12], P3[13], P3[14], P3[15], P7[12]... + vtrn.16 d7, d15 @ P2[12], P2[13], P2[14], P2[15], P6[12]... P4[12], P4[13], P4[14], P4[15], P8[12]... + vmovl.u8 q1, d3 @ P2[0..7] + vmovl.u8 q12, d4 @ P6[0..7] + vtrn.32 d5, d1 @ P1[8..15], P5[8..15] + vtrn.32 d6, d7 @ P2[8..15], P6[8..15] + vtrn.32 d12, d9 @ P3[8..15], P7[8..15] + vtrn.32 d13, d15 @ P4[8..15], P8[8..15] + vmls.i16 q10, q1, d0[1] @ 2*P1[0..7]-5*P2[0..7] + vmovl.u8 q1, d10 @ P3[0..7] + vshll.u8 q2, d5, #1 @ 2*P1[8..15] + vshll.u8 q13, d1, #1 @ 2*P5[8..15] + vmls.i16 q11, q12, d0[1] @ 2*P5[0..7]-5*P6[0..7] + vmovl.u8 q14, d6 @ P2[8..15] + vmovl.u8 q3, d7 @ P6[8..15] + vmovl.u8 q15, d8 @ P7[0..7] + vmla.i16 q10, q1, d0[1] @ 2*P1[0..7]-5*P2[0..7]+5*P3[0..7] + vmovl.u8 q1, d12 @ P3[8..15] + vmls.i16 q2, q14, d0[1] @ 2*P1[8..15]-5*P2[8..15] + vmovl.u8 q4, d9 @ P7[8..15] + vshll.u8 q14, d10, #1 @ 2*P3[0..7] + vmls.i16 q13, q3, d0[1] @ 2*P5[8..15]-5*P6[8..15] + vmovl.u8 q5, d11 @ P4[0..7] + vmla.i16 q11, q15, d0[1] @ 2*P5[0..7]-5*P6[0..7]+5*P7[0..7] + vshll.u8 q15, d12, #1 @ 2*P3[8..15] + vmovl.u8 q6, d13 @ P4[8..15] + vmla.i16 q2, q1, d0[1] @ 2*P1[8..15]-5*P2[8..15]+5*P3[8..15] + vmovl.u8 q1, d14 @ P8[0..7] + vmovl.u8 q7, d15 @ P8[8..15] + vmla.i16 q13, q4, d0[1] @ 2*P5[8..15]-5*P6[8..15]+5*P7[8..15] + vmovl.u8 q4, d16 @ P5[0..7] + vmovl.u8 q8, d1 @ P5[8..15] + vmls.i16 q14, q5, d0[1] @ 2*P3[0..7]-5*P4[0..7] + vmls.i16 q15, q6, d0[1] @ 2*P3[8..15]-5*P4[8..15] + vmls.i16 q10, q5, d0[0] @ 2*P1[0..7]-5*P2[0..7]+5*P3[0..7]-2*P4[0..7] + vmls.i16 q11, q1, d0[0] @ 2*P5[0..7]-5*P6[0..7]+5*P7[0..7]-2*P8[0..7] + vsub.i16 q1, q5, q4 @ P4[0..7]-P5[0..7] + vmls.i16 q2, q6, d0[0] @ 2*P1[8..15]-5*P2[8..15]+5*P3[8..15]-2*P4[8..15] + vrshr.s16 q10, q10, #3 + vmls.i16 q13, q7, d0[0] @ 2*P5[8..15]-5*P6[8..15]+5*P7[8..15]-2*P8[8..15] + vsub.i16 q7, q6, q8 @ P4[8..15]-P5[8..15] + vrshr.s16 q11, q11, #3 + vmla.s16 q14, q4, d0[1] @ 2*P3[0..7]-5*P4[0..7]+5*P5[0..7] + vrshr.s16 q2, q2, #3 + vmla.i16 q15, q8, d0[1] @ 2*P3[8..15]-5*P4[8..15]+5*P5[8..15] + vabs.s16 q10, q10 @ a1[0..7] + vrshr.s16 q13, q13, #3 + vmls.i16 q15, q3, d0[0] @ 2*P3[8..15]-5*P4[8..15]+5*P5[8..15]-2*P6[8..15] + vabs.s16 q3, q11 @ a2[0..7] + vabs.s16 q2, q2 @ a1[8..15] + vmls.i16 q14, q12, d0[0] @ 2*P3[0..7]-5*P4[0..7]+5*P5[0..7]-2*P6[0..7] + vabs.s16 q11, q1 + vabs.s16 q12, q13 @ a2[8..15] + vcge.s16 q13, q10, q3 @ test a1[0..7] >= a2[0..7] + vshr.s16 q1, q1, #8 @ clip_sign[0..7] + vrshr.s16 q15, q15, #3 + vshr.s16 q11, q11, #1 @ clip[0..7] + vrshr.s16 q14, q14, #3 + vbsl q13, q3, q10 @ a3[0..7] + vcge.s16 q3, q2, q12 @ test a1[8..15] >= a2[8.15] + vabs.s16 q10, q15 @ a0[8..15] + vshr.s16 q15, q15, #8 @ a0_sign[8..15] + vbsl q3, q12, q2 @ a3[8..15] + vabs.s16 q2, q14 @ a0[0..7] + vabs.s16 q12, q7 + vshr.s16 q7, q7, #8 @ clip_sign[8..15] + vshr.s16 q14, q14, #8 @ a0_sign[0..7] + vshr.s16 q12, q12, #1 @ clip[8..15] + vsub.i16 q7, q7, q15 @ clip_sign[8..15] - a0_sign[8..15] + vqsub.u16 q15, q10, q3 @ a0[8..15] >= a3[8..15] ? a0[8..15]-a3[8..15] : 0 (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs) + vcge.s16 q3, q3, q10 @ test a3[8..15] >= a0[8..15] + vcge.s16 q10, q10, q9 @ test a0[8..15] >= pq + vcge.s16 q9, q2, q9 @ test a0[0..7] >= pq + vsub.i16 q1, q1, q14 @ clip_sign[0..7] - a0_sign[0..7] + vqsub.u16 q14, q2, q13 @ a0[0..7] >= a3[0..7] ? a0[0..7]-a3[0..7] : 0 (a0 > a3 in all cases where filtering is enabled, so makes more sense to subtract this way round than the opposite and then taking the abs) + vcge.s16 q2, q13, q2 @ test a3[0..7] >= a0[0..7] + vmul.i16 q13, q15, d0[1] @ a0[8..15] >= a3[8..15] ? 5*(a0[8..15]-a3[8..15]) : 0 + vceq.i16 q15, q11, #0 @ test clip[0..7] == 0 + vmul.i16 q0, q14, d0[1] @ a0[0..7] >= a3[0..7] ? 5*(a0[0..7]-a3[0..7]) : 0 + vorr q9, q15, q9 @ test clip[0..7] == 0 || a0[0..7] >= pq + vceq.i16 q14, q12, #0 @ test clip[8..15] == 0 + vshr.u16 q13, q13, #3 @ a0[8..15] >= a3[8..15] ? (5*(a0[8..15]-a3[8..15]))>>3 : 0 + vorr q2, q9, q2 @ test clip[0..7] == 0 || a0[0..7] >= pq || a3[0..7] >= a0[0..7] + vshr.u16 q0, q0, #3 @ a0[0..7] >= a3[0..7] ? (5*(a0[0..7]-a3[0..7]))>>3 : 0 + vorr q10, q14, q10 @ test clip[8..15] == 0 || a0[8..15] >= pq + vcge.s16 q14, q13, q12 + vmov.32 r2, d4[1] @ move to gp reg + vorr q3, q10, q3 @ test clip[8..15] == 0 || a0[8..15] >= pq || a3[8..15] >= a0[8..15] + vmov.32 r3, d5[1] + vcge.s16 q2, q0, q11 + vbsl q14, q12, q13 @ FFMIN(d[8..15], clip[8..15]) + vbsl q2, q11, q0 @ FFMIN(d[0..7], clip[0..7]) + vmov.32 r5, d6[1] + vbic q0, q14, q10 @ set each d[8..15] to zero if it should not be filtered because clip[8..15] == 0 || a0[8..15] >= pq (a3 > a0 case already zeroed by saturating sub) + vmov.32 r6, d7[1] + and r12, r2, r3 + vbic q2, q2, q9 @ set each d[0..7] to zero if it should not be filtered because clip[0..7] == 0 || a0[0..7] >= pq (a3 > a0 case already zeroed by saturating sub) + vmls.i16 q6, q0, q7 @ invert d[8..15] depending on clip_sign[8..15] & a0_sign[8..15], or zero it if they match, and accumulate into P4 + vmls.i16 q5, q2, q1 @ invert d[0..7] depending on clip_sign[0..7] & a0_sign[0..7], or zero it if they match, and accumulate into P4 + and r14, r5, r6 + vmla.i16 q4, q2, q1 @ invert d[0..7] depending on clip_sign[0..7] & a0_sign[0..7], or zero it if they match, and accumulate into P5 + and r12, r12, r14 + vqmovun.s16 d4, q6 + vmla.i16 q8, q0, q7 @ invert d[8..15] depending on clip_sign[8..15] & a0_sign[8..15], or zero it if they match, and accumulate into P5 + tst r12, #1 + bne 4f @ none of the 16 pixel pairs should be updated in this case + vqmovun.s16 d2, q5 + vqmovun.s16 d3, q4 + vqmovun.s16 d5, q8 + tst r2, #1 + bne 1f + vst2.8 {d2[0], d3[0]}, [r0], r1 + vst2.8 {d2[1], d3[1]}, [r0], r1 + vst2.8 {d2[2], d3[2]}, [r0], r1 + vst2.8 {d2[3], d3[3]}, [r0] +1: add r0, r4, r1, lsl #2 + tst r3, #1 + bne 2f + vst2.8 {d2[4], d3[4]}, [r4], r1 + vst2.8 {d2[5], d3[5]}, [r4], r1 + vst2.8 {d2[6], d3[6]}, [r4], r1 + vst2.8 {d2[7], d3[7]}, [r4] +2: add r4, r0, r1, lsl #2 + tst r5, #1 + bne 3f + vst2.8 {d4[0], d5[0]}, [r0], r1 + vst2.8 {d4[1], d5[1]}, [r0], r1 + vst2.8 {d4[2], d5[2]}, [r0], r1 + vst2.8 {d4[3], d5[3]}, [r0] +3: tst r6, #1 + bne 4f + vst2.8 {d4[4], d5[4]}, [r4], r1 + vst2.8 {d4[5], d5[5]}, [r4], r1 + vst2.8 {d4[6], d5[6]}, [r4], r1 + vst2.8 {d4[7], d5[7]}, [r4] +4: vpop {d8-d15} + pop {r4-r6,pc} +endfunc From patchwork Thu Mar 17 18:58:16 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Avison X-Patchwork-Id: 34817 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6838:3486:0:0:0:0 with SMTP id ek6csp1188216nkb; Thu, 17 Mar 2022 11:59:37 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyGpFMzLqRou9B0NM5d5nF8fLFJAHlvBUmBpd7/8DOgj8Ykxmor7g1EpVmJyABcbBgH+1xX X-Received: by 2002:a17:907:7254:b0:6db:ad8f:27b4 with SMTP id ds20-20020a170907725400b006dbad8f27b4mr5952411ejc.599.1647543576992; Thu, 17 Mar 2022 11:59:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1647543576; cv=none; d=google.com; s=arc-20160816; b=y7QzCKZSwIoyiUi+4bN/PXfTRIi1jzZ0ecInh0FveuvIQ50CRX6f56dyy6e4owb/NR xrogM7fWWgR4ympDBYuBerNaTkkx2YctBySKqYWGidZGhiuDnY2StcLoadR5Cj4VrUiA 8bity7jW8A2XxeM/gMFM49L0jspaHMLSdbge0no/BqG4x28Jh+kQf+Fm75K1gV0Bb9xq kSvKhUB+PpwIp72NQyjyGqgDiOfn01SoFXlIelF0fp8w9zfPg1JFHD3GzNIh0vu90Jae B9WVzwyXJSHvRiPwzNnph6h+/jsr0uIbnEvkiYDDC2UWE2vGIHCZyMdveIHzTk8vAizi dsFA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=UwqNZWkXyJOJEK67ne+PHpbBTrndiCSHA+xUrDD//GU=; b=WirqQAIvhrzttg3ad+imOAohbg7CNwVVueHSvjTnH1XyoPzfjwS+dLpJVBVWA52yfO PrrFO2tjHSwS84ztM7kj64gBFthrfkmABFs1CEnEzBkitZ/1fwJOEd/+SotxVodwzJ2O mpyyab4NmVcJAPxIWqh4aITa8GB6IlSxm87t6GaKi0m0HhBeKcDXMoVNU/eEleRsMioR 0ZruZntUEDyGiUpsudcMPoRNPz8FDQ/0Hmo4AMdqHL3DZTZa2pJ+mw7pL72cnP1VFZjo htS4qlDp44gDM2G5dWvU33i4riIRfbrimOSrYFlgvf8BE3VlWKrAzakrbd/Sbf5JOUf6 o6xA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id ne7-20020a1709077b8700b006db34ec52f6si4163225ejc.228.2022.03.17.11.59.36; Thu, 17 Mar 2022 11:59:36 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id D3C8068AFD2; Thu, 17 Mar 2022 20:59:14 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from outmail149095.authsmtp.com (outmail149095.authsmtp.com [62.13.149.95]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 1B8AE68AC00 for ; Thu, 17 Mar 2022 20:59:09 +0200 (EET) Received: from mail-c233.authsmtp.com (mail-c233.authsmtp.com [62.13.128.233]) by punt17.authsmtp.com. (8.15.2/8.15.2) with ESMTP id 22HIx8gc032795; Thu, 17 Mar 2022 18:59:08 GMT (envelope-from bavison@riscosopen.org) Received: from rpi2021 (237.63.9.51.dyn.plus.net [51.9.63.237]) (authenticated bits=0) by mail.authsmtp.com (8.15.2/8.15.2) with ESMTPSA id 22HIx5Dp055411 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 17 Mar 2022 18:59:05 GMT (envelope-from bavison@riscosopen.org) Received: by rpi2021 (sSMTP sendmail emulation); Thu, 17 Mar 2022 18:59:05 +0000 From: Ben Avison To: ffmpeg-devel@ffmpeg.org Date: Thu, 17 Mar 2022 18:58:16 +0000 Message-Id: <20220317185819.466470-4-bavison@riscosopen.org> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220317185819.466470-1-bavison@riscosopen.org> References: <20220317185819.466470-1-bavison@riscosopen.org> MIME-Version: 1.0 X-Server-Quench: 5273dead-a624-11ec-a0f2-84349711df28 X-AuthReport-Spam: If SPAM / abuse - report it at: http://www.authsmtp.com/abuse X-AuthRoute: OCd1YggXA1ZfRRob ESQCJDVBUg4iPRpU DBlFKhFVNl8UURhQ KkJXbgASJgRDAnRQ QXkJW1ZWQFx5U2Z8 YQpWIwBcfENQWQZ0 UktOXVBXFgB3AFID BGZnGhwGNwVDfXpx ZAhqX3hfXAp/d0F+ FhsFQXBXY2VkdWEe BRNFJgMCch5CehxB Y1d+VSdbY21JDRoR IyQTd2t2YwAXITpe RQ0AJhUMSh9ZVhcm QlhcQXAlHFFNYQgU CVQnLEARBl0celko OF06V1UCNlccAxZC V1lEHC9CTwAA X-Authentic-SMTP: 61633632303230.1021:7600 X-AuthFastPath: 0 (Was 255) X-AuthSMTP-Origin: 51.9.63.237/2525 X-AuthVirus-Status: No virus detected - but ensure you scan with your own anti-virus system. Subject: [FFmpeg-devel] [PATCH 3/6] avcodec/vc1: Arm 64-bit NEON inverse transform fast paths X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Ben Avison Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: OEWB7V3CgekL Signed-off-by: Ben Avison --- libavcodec/aarch64/vc1dsp_init_aarch64.c | 19 + libavcodec/aarch64/vc1dsp_neon.S | 678 +++++++++++++++++++++++ 2 files changed, 697 insertions(+) diff --git a/libavcodec/aarch64/vc1dsp_init_aarch64.c b/libavcodec/aarch64/vc1dsp_init_aarch64.c index edfb296b75..b672b2aa99 100644 --- a/libavcodec/aarch64/vc1dsp_init_aarch64.c +++ b/libavcodec/aarch64/vc1dsp_init_aarch64.c @@ -25,6 +25,16 @@ #include "config.h" +void ff_vc1_inv_trans_8x8_neon(int16_t *block); +void ff_vc1_inv_trans_8x4_neon(uint8_t *dest, ptrdiff_t stride, int16_t *block); +void ff_vc1_inv_trans_4x8_neon(uint8_t *dest, ptrdiff_t stride, int16_t *block); +void ff_vc1_inv_trans_4x4_neon(uint8_t *dest, ptrdiff_t stride, int16_t *block); + +void ff_vc1_inv_trans_8x8_dc_neon(uint8_t *dest, ptrdiff_t stride, int16_t *block); +void ff_vc1_inv_trans_8x4_dc_neon(uint8_t *dest, ptrdiff_t stride, int16_t *block); +void ff_vc1_inv_trans_4x8_dc_neon(uint8_t *dest, ptrdiff_t stride, int16_t *block); +void ff_vc1_inv_trans_4x4_dc_neon(uint8_t *dest, ptrdiff_t stride, int16_t *block); + void ff_vc1_v_loop_filter4_neon(uint8_t *src, int stride, int pq); void ff_vc1_h_loop_filter4_neon(uint8_t *src, int stride, int pq); void ff_vc1_v_loop_filter8_neon(uint8_t *src, int stride, int pq); @@ -46,6 +56,15 @@ av_cold void ff_vc1dsp_init_aarch64(VC1DSPContext *dsp) int cpu_flags = av_get_cpu_flags(); if (have_neon(cpu_flags)) { + dsp->vc1_inv_trans_8x8 = ff_vc1_inv_trans_8x8_neon; + dsp->vc1_inv_trans_8x4 = ff_vc1_inv_trans_8x4_neon; + dsp->vc1_inv_trans_4x8 = ff_vc1_inv_trans_4x8_neon; + dsp->vc1_inv_trans_4x4 = ff_vc1_inv_trans_4x4_neon; + dsp->vc1_inv_trans_8x8_dc = ff_vc1_inv_trans_8x8_dc_neon; + dsp->vc1_inv_trans_8x4_dc = ff_vc1_inv_trans_8x4_dc_neon; + dsp->vc1_inv_trans_4x8_dc = ff_vc1_inv_trans_4x8_dc_neon; + dsp->vc1_inv_trans_4x4_dc = ff_vc1_inv_trans_4x4_dc_neon; + dsp->vc1_v_loop_filter4 = ff_vc1_v_loop_filter4_neon; dsp->vc1_h_loop_filter4 = ff_vc1_h_loop_filter4_neon; dsp->vc1_v_loop_filter8 = ff_vc1_v_loop_filter8_neon; diff --git a/libavcodec/aarch64/vc1dsp_neon.S b/libavcodec/aarch64/vc1dsp_neon.S index fe8963545a..c3ca3eae1e 100644 --- a/libavcodec/aarch64/vc1dsp_neon.S +++ b/libavcodec/aarch64/vc1dsp_neon.S @@ -22,7 +22,685 @@ #include "libavutil/aarch64/asm.S" +// VC-1 8x8 inverse transform +// On entry: +// x0 -> array of 16-bit inverse transform coefficients, in column-major order +// On exit: +// array at x0 updated to hold transformed block; also now held in row-major order +function ff_vc1_inv_trans_8x8_neon, export=1 + ld1 {v1.16b, v2.16b}, [x0], #32 + ld1 {v3.16b, v4.16b}, [x0], #32 + ld1 {v5.16b, v6.16b}, [x0], #32 + shl v1.8h, v1.8h, #2 // 8/2 * src[0] + sub x1, x0, #3*32 + ld1 {v16.16b, v17.16b}, [x0] + shl v7.8h, v2.8h, #4 // 16 * src[8] + shl v18.8h, v2.8h, #2 // 4 * src[8] + shl v19.8h, v4.8h, #4 // 16 * src[24] + ldr d0, .Lcoeffs_it8 + shl v5.8h, v5.8h, #2 // 8/2 * src[32] + shl v20.8h, v6.8h, #4 // 16 * src[40] + shl v21.8h, v6.8h, #2 // 4 * src[40] + shl v22.8h, v17.8h, #4 // 16 * src[56] + ssra v20.8h, v19.8h, #2 // 4 * src[24] + 16 * src[40] + mul v23.8h, v3.8h, v0.h[0] // 6/2 * src[16] + sub v19.8h, v19.8h, v21.8h // 16 * src[24] - 4 * src[40] + ssra v7.8h, v22.8h, #2 // 16 * src[8] + 4 * src[56] + sub v18.8h, v22.8h, v18.8h // - 4 * src[8] + 16 * src[56] + shl v3.8h, v3.8h, #3 // 16/2 * src[16] + mls v20.8h, v2.8h, v0.h[2] // - 15 * src[8] + 4 * src[24] + 16 * src[40] + ssra v1.8h, v1.8h, #1 // 12/2 * src[0] + ssra v5.8h, v5.8h, #1 // 12/2 * src[32] + mla v7.8h, v4.8h, v0.h[2] // 16 * src[8] + 15 * src[24] + 4 * src[56] + shl v21.8h, v16.8h, #3 // 16/2 * src[48] + mls v19.8h, v2.8h, v0.h[1] // - 9 * src[8] + 16 * src[24] - 4 * src[40] + sub v2.8h, v23.8h, v21.8h // t4/2 = 6/2 * src[16] - 16/2 * src[48] + mla v18.8h, v4.8h, v0.h[1] // - 4 * src[8] + 9 * src[24] + 16 * src[56] + add v4.8h, v1.8h, v5.8h // t1/2 = 12/2 * src[0] + 12/2 * src[32] + sub v1.8h, v1.8h, v5.8h // t2/2 = 12/2 * src[0] - 12/2 * src[32] + mla v3.8h, v16.8h, v0.h[0] // t3/2 = 16/2 * src[16] + 6/2 * src[48] + mla v7.8h, v6.8h, v0.h[1] // t1 = 16 * src[8] + 15 * src[24] + 9 * src[40] + 4 * src[56] + add v5.8h, v1.8h, v2.8h // t6/2 = t2/2 + t4/2 + sub v16.8h, v1.8h, v2.8h // t7/2 = t2/2 - t4/2 + mla v20.8h, v17.8h, v0.h[1] // -t2 = - 15 * src[8] + 4 * src[24] + 16 * src[40] + 9 * src[56] + add v21.8h, v1.8h, v2.8h // t6/2 = t2/2 + t4/2 + add v22.8h, v4.8h, v3.8h // t5/2 = t1/2 + t3/2 + mls v19.8h, v17.8h, v0.h[2] // -t3 = - 9 * src[8] + 16 * src[24] - 4 * src[40] - 15 * src[56] + sub v17.8h, v4.8h, v3.8h // t8/2 = t1/2 - t3/2 + add v23.8h, v4.8h, v3.8h // t5/2 = t1/2 + t3/2 + mls v18.8h, v6.8h, v0.h[2] // -t4 = - 4 * src[8] + 9 * src[24] - 15 * src[40] + 16 * src[56] + sub v1.8h, v1.8h, v2.8h // t7/2 = t2/2 - t4/2 + sub v2.8h, v4.8h, v3.8h // t8/2 = t1/2 - t3/2 + neg v3.8h, v7.8h // -t1 + neg v4.8h, v20.8h // +t2 + neg v6.8h, v19.8h // +t3 + ssra v22.8h, v7.8h, #1 // (t5 + t1) >> 1 + ssra v1.8h, v19.8h, #1 // (t7 - t3) >> 1 + neg v7.8h, v18.8h // +t4 + ssra v5.8h, v4.8h, #1 // (t6 + t2) >> 1 + ssra v16.8h, v6.8h, #1 // (t7 + t3) >> 1 + ssra v2.8h, v18.8h, #1 // (t8 - t4) >> 1 + ssra v17.8h, v7.8h, #1 // (t8 + t4) >> 1 + ssra v21.8h, v20.8h, #1 // (t6 - t2) >> 1 + ssra v23.8h, v3.8h, #1 // (t5 - t1) >> 1 + srshr v3.8h, v22.8h, #2 // (t5 + t1 + 4) >> 3 + srshr v4.8h, v5.8h, #2 // (t6 + t2 + 4) >> 3 + srshr v5.8h, v16.8h, #2 // (t7 + t3 + 4) >> 3 + srshr v6.8h, v17.8h, #2 // (t8 + t4 + 4) >> 3 + srshr v2.8h, v2.8h, #2 // (t8 - t4 + 4) >> 3 + srshr v1.8h, v1.8h, #2 // (t7 - t3 + 4) >> 3 + srshr v7.8h, v21.8h, #2 // (t6 - t2 + 4) >> 3 + srshr v16.8h, v23.8h, #2 // (t5 - t1 + 4) >> 3 + trn2 v17.8h, v3.8h, v4.8h + trn2 v18.8h, v5.8h, v6.8h + trn2 v19.8h, v2.8h, v1.8h + trn2 v20.8h, v7.8h, v16.8h + trn1 v21.4s, v17.4s, v18.4s + trn2 v17.4s, v17.4s, v18.4s + trn1 v18.4s, v19.4s, v20.4s + trn2 v19.4s, v19.4s, v20.4s + trn1 v3.8h, v3.8h, v4.8h + trn2 v4.2d, v21.2d, v18.2d + trn1 v20.2d, v17.2d, v19.2d + trn1 v5.8h, v5.8h, v6.8h + trn1 v1.8h, v2.8h, v1.8h + trn1 v2.8h, v7.8h, v16.8h + trn1 v6.2d, v21.2d, v18.2d + trn2 v7.2d, v17.2d, v19.2d + shl v16.8h, v20.8h, #4 // 16 * src[24] + shl v17.8h, v4.8h, #4 // 16 * src[40] + trn1 v18.4s, v3.4s, v5.4s + trn1 v19.4s, v1.4s, v2.4s + shl v21.8h, v7.8h, #4 // 16 * src[56] + shl v22.8h, v6.8h, #2 // 4 * src[8] + shl v23.8h, v4.8h, #2 // 4 * src[40] + trn2 v3.4s, v3.4s, v5.4s + trn2 v1.4s, v1.4s, v2.4s + shl v2.8h, v6.8h, #4 // 16 * src[8] + sub v5.8h, v16.8h, v23.8h // 16 * src[24] - 4 * src[40] + ssra v17.8h, v16.8h, #2 // 4 * src[24] + 16 * src[40] + sub v16.8h, v21.8h, v22.8h // - 4 * src[8] + 16 * src[56] + trn1 v22.2d, v18.2d, v19.2d + trn2 v18.2d, v18.2d, v19.2d + trn1 v19.2d, v3.2d, v1.2d + ssra v2.8h, v21.8h, #2 // 16 * src[8] + 4 * src[56] + mls v17.8h, v6.8h, v0.h[2] // - 15 * src[8] + 4 * src[24] + 16 * src[40] + shl v21.8h, v22.8h, #2 // 8/2 * src[0] + shl v18.8h, v18.8h, #2 // 8/2 * src[32] + mls v5.8h, v6.8h, v0.h[1] // - 9 * src[8] + 16 * src[24] - 4 * src[40] + shl v6.8h, v19.8h, #3 // 16/2 * src[16] + trn2 v1.2d, v3.2d, v1.2d + mla v16.8h, v20.8h, v0.h[1] // - 4 * src[8] + 9 * src[24] + 16 * src[56] + ssra v21.8h, v21.8h, #1 // 12/2 * src[0] + ssra v18.8h, v18.8h, #1 // 12/2 * src[32] + mul v3.8h, v19.8h, v0.h[0] // 6/2 * src[16] + shl v19.8h, v1.8h, #3 // 16/2 * src[48] + mla v2.8h, v20.8h, v0.h[2] // 16 * src[8] + 15 * src[24] + 4 * src[56] + add v20.8h, v21.8h, v18.8h // t1/2 = 12/2 * src[0] + 12/2 * src[32] + mla v6.8h, v1.8h, v0.h[0] // t3/2 = 16/2 * src[16] + 6/2 * src[48] + sub v1.8h, v21.8h, v18.8h // t2/2 = 12/2 * src[0] - 12/2 * src[32] + sub v3.8h, v3.8h, v19.8h // t4/2 = 6/2 * src[16] - 16/2 * src[48] + mla v17.8h, v7.8h, v0.h[1] // -t2 = - 15 * src[8] + 4 * src[24] + 16 * src[40] + 9 * src[56] + mls v5.8h, v7.8h, v0.h[2] // -t3 = - 9 * src[8] + 16 * src[24] - 4 * src[40] - 15 * src[56] + add v7.8h, v1.8h, v3.8h // t6/2 = t2/2 + t4/2 + add v18.8h, v20.8h, v6.8h // t5/2 = t1/2 + t3/2 + mls v16.8h, v4.8h, v0.h[2] // -t4 = - 4 * src[8] + 9 * src[24] - 15 * src[40] + 16 * src[56] + sub v19.8h, v1.8h, v3.8h // t7/2 = t2/2 - t4/2 + neg v21.8h, v17.8h // +t2 + mla v2.8h, v4.8h, v0.h[1] // t1 = 16 * src[8] + 15 * src[24] + 9 * src[40] + 4 * src[56] + sub v0.8h, v20.8h, v6.8h // t8/2 = t1/2 - t3/2 + neg v4.8h, v5.8h // +t3 + sub v22.8h, v1.8h, v3.8h // t7/2 = t2/2 - t4/2 + sub v23.8h, v20.8h, v6.8h // t8/2 = t1/2 - t3/2 + neg v24.8h, v16.8h // +t4 + add v6.8h, v20.8h, v6.8h // t5/2 = t1/2 + t3/2 + add v1.8h, v1.8h, v3.8h // t6/2 = t2/2 + t4/2 + ssra v7.8h, v21.8h, #1 // (t6 + t2) >> 1 + neg v3.8h, v2.8h // -t1 + ssra v18.8h, v2.8h, #1 // (t5 + t1) >> 1 + ssra v19.8h, v4.8h, #1 // (t7 + t3) >> 1 + ssra v0.8h, v24.8h, #1 // (t8 + t4) >> 1 + srsra v23.8h, v16.8h, #1 // (t8 - t4 + 1) >> 1 + srsra v22.8h, v5.8h, #1 // (t7 - t3 + 1) >> 1 + srsra v1.8h, v17.8h, #1 // (t6 - t2 + 1) >> 1 + srsra v6.8h, v3.8h, #1 // (t5 - t1 + 1) >> 1 + srshr v2.8h, v18.8h, #6 // (t5 + t1 + 64) >> 7 + srshr v3.8h, v7.8h, #6 // (t6 + t2 + 64) >> 7 + srshr v4.8h, v19.8h, #6 // (t7 + t3 + 64) >> 7 + srshr v5.8h, v0.8h, #6 // (t8 + t4 + 64) >> 7 + srshr v16.8h, v23.8h, #6 // (t8 - t4 + 65) >> 7 + srshr v17.8h, v22.8h, #6 // (t7 - t3 + 65) >> 7 + st1 {v2.16b, v3.16b}, [x1], #32 + srshr v0.8h, v1.8h, #6 // (t6 - t2 + 65) >> 7 + srshr v1.8h, v6.8h, #6 // (t5 - t1 + 65) >> 7 + st1 {v4.16b, v5.16b}, [x1], #32 + st1 {v16.16b, v17.16b}, [x1], #32 + st1 {v0.16b, v1.16b}, [x1] + ret +endfunc + +// VC-1 8x4 inverse transform +// On entry: +// x0 -> array of 8-bit samples, in row-major order +// x1 = row stride for 8-bit sample array +// x2 -> array of 16-bit inverse transform coefficients, in row-major order +// On exit: +// array at x0 updated by saturated addition of (narrowed) transformed block +function ff_vc1_inv_trans_8x4_neon, export=1 + ld1 {v1.8b, v2.8b, v3.8b, v4.8b}, [x2], #32 + mov x3, x0 + ld1 {v16.8b, v17.8b, v18.8b, v19.8b}, [x2] + ldr q0, .Lcoeffs_it8 // includes 4-point coefficients in upper half of vector + ld1 {v5.8b}, [x0], x1 + trn2 v6.4h, v1.4h, v3.4h + trn2 v7.4h, v2.4h, v4.4h + trn1 v1.4h, v1.4h, v3.4h + trn1 v2.4h, v2.4h, v4.4h + trn2 v3.4h, v16.4h, v18.4h + trn2 v4.4h, v17.4h, v19.4h + trn1 v16.4h, v16.4h, v18.4h + trn1 v17.4h, v17.4h, v19.4h + ld1 {v18.8b}, [x0], x1 + trn1 v19.2s, v6.2s, v3.2s + trn2 v3.2s, v6.2s, v3.2s + trn1 v6.2s, v7.2s, v4.2s + trn2 v4.2s, v7.2s, v4.2s + trn1 v7.2s, v1.2s, v16.2s + trn1 v20.2s, v2.2s, v17.2s + shl v21.4h, v19.4h, #4 // 16 * src[1] + trn2 v1.2s, v1.2s, v16.2s + shl v16.4h, v3.4h, #4 // 16 * src[3] + trn2 v2.2s, v2.2s, v17.2s + shl v17.4h, v6.4h, #4 // 16 * src[5] + ld1 {v22.8b}, [x0], x1 + shl v23.4h, v4.4h, #4 // 16 * src[7] + mul v24.4h, v1.4h, v0.h[0] // 6/2 * src[2] + ld1 {v25.8b}, [x0] + shl v26.4h, v19.4h, #2 // 4 * src[1] + shl v27.4h, v6.4h, #2 // 4 * src[5] + ssra v21.4h, v23.4h, #2 // 16 * src[1] + 4 * src[7] + ssra v17.4h, v16.4h, #2 // 4 * src[3] + 16 * src[5] + sub v23.4h, v23.4h, v26.4h // - 4 * src[1] + 16 * src[7] + sub v16.4h, v16.4h, v27.4h // 16 * src[3] - 4 * src[5] + shl v7.4h, v7.4h, #2 // 8/2 * src[0] + shl v20.4h, v20.4h, #2 // 8/2 * src[4] + mla v21.4h, v3.4h, v0.h[2] // 16 * src[1] + 15 * src[3] + 4 * src[7] + shl v1.4h, v1.4h, #3 // 16/2 * src[2] + mls v17.4h, v19.4h, v0.h[2] // - 15 * src[1] + 4 * src[3] + 16 * src[5] + ssra v7.4h, v7.4h, #1 // 12/2 * src[0] + mls v16.4h, v19.4h, v0.h[1] // - 9 * src[1] + 16 * src[3] - 4 * src[5] + ssra v20.4h, v20.4h, #1 // 12/2 * src[4] + mla v23.4h, v3.4h, v0.h[1] // - 4 * src[1] + 9 * src[3] + 16 * src[7] + shl v3.4h, v2.4h, #3 // 16/2 * src[6] + mla v1.4h, v2.4h, v0.h[0] // t3/2 = 16/2 * src[2] + 6/2 * src[6] + mla v21.4h, v6.4h, v0.h[1] // t1 = 16 * src[1] + 15 * src[3] + 9 * src[5] + 4 * src[7] + mla v17.4h, v4.4h, v0.h[1] // -t2 = - 15 * src[1] + 4 * src[3] + 16 * src[5] + 9 * src[7] + sub v2.4h, v24.4h, v3.4h // t4/2 = 6/2 * src[2] - 16/2 * src[6] + mls v16.4h, v4.4h, v0.h[2] // -t3 = - 9 * src[1] + 16 * src[3] - 4 * src[5] - 15 * src[7] + add v3.4h, v7.4h, v20.4h // t1/2 = 12/2 * src[0] + 12/2 * src[4] + mls v23.4h, v6.4h, v0.h[2] // -t4 = - 4 * src[1] + 9 * src[3] - 15 * src[5] + 16 * src[7] + sub v4.4h, v7.4h, v20.4h // t2/2 = 12/2 * src[0] - 12/2 * src[4] + neg v6.4h, v21.4h // -t1 + add v7.4h, v3.4h, v1.4h // t5/2 = t1/2 + t3/2 + sub v19.4h, v3.4h, v1.4h // t8/2 = t1/2 - t3/2 + add v20.4h, v4.4h, v2.4h // t6/2 = t2/2 + t4/2 + sub v24.4h, v4.4h, v2.4h // t7/2 = t2/2 - t4/2 + add v26.4h, v3.4h, v1.4h // t5/2 = t1/2 + t3/2 + add v27.4h, v4.4h, v2.4h // t6/2 = t2/2 + t4/2 + sub v2.4h, v4.4h, v2.4h // t7/2 = t2/2 - t4/2 + sub v1.4h, v3.4h, v1.4h // t8/2 = t1/2 - t3/2 + neg v3.4h, v17.4h // +t2 + neg v4.4h, v16.4h // +t3 + neg v28.4h, v23.4h // +t4 + ssra v7.4h, v21.4h, #1 // (t5 + t1) >> 1 + ssra v1.4h, v23.4h, #1 // (t8 - t4) >> 1 + ssra v20.4h, v3.4h, #1 // (t6 + t2) >> 1 + ssra v24.4h, v4.4h, #1 // (t7 + t3) >> 1 + ssra v19.4h, v28.4h, #1 // (t8 + t4) >> 1 + ssra v2.4h, v16.4h, #1 // (t7 - t3) >> 1 + ssra v27.4h, v17.4h, #1 // (t6 - t2) >> 1 + ssra v26.4h, v6.4h, #1 // (t5 - t1) >> 1 + trn1 v1.2d, v7.2d, v1.2d + trn1 v2.2d, v20.2d, v2.2d + trn1 v3.2d, v24.2d, v27.2d + trn1 v4.2d, v19.2d, v26.2d + srshr v1.8h, v1.8h, #2 // (t5 + t1 + 4) >> 3, (t8 - t4 + 4) >> 3 + srshr v2.8h, v2.8h, #2 // (t6 + t2 + 4) >> 3, (t7 - t3 + 4) >> 3 + srshr v3.8h, v3.8h, #2 // (t7 + t3 + 4) >> 3, (t6 - t2 + 4) >> 3 + srshr v4.8h, v4.8h, #2 // (t8 + t4 + 4) >> 3, (t5 - t1 + 4) >> 3 + trn2 v6.8h, v1.8h, v2.8h + trn1 v1.8h, v1.8h, v2.8h + trn2 v2.8h, v3.8h, v4.8h + trn1 v3.8h, v3.8h, v4.8h + trn2 v4.4s, v6.4s, v2.4s + trn1 v7.4s, v1.4s, v3.4s + trn2 v1.4s, v1.4s, v3.4s + mul v3.8h, v4.8h, v0.h[5] // 22/2 * src[24] + trn1 v2.4s, v6.4s, v2.4s + mul v4.8h, v4.8h, v0.h[4] // 10/2 * src[24] + mul v6.8h, v7.8h, v0.h[6] // 17 * src[0] + mul v1.8h, v1.8h, v0.h[6] // 17 * src[16] + mls v3.8h, v2.8h, v0.h[4] // t4/2 = - 10/2 * src[8] + 22/2 * src[24] + mla v4.8h, v2.8h, v0.h[5] // t3/2 = 22/2 * src[8] + 10/2 * src[24] + add v0.8h, v6.8h, v1.8h // t1 = 17 * src[0] + 17 * src[16] + sub v1.8h, v6.8h, v1.8h // t2 = 17 * src[0] - 17 * src[16] + neg v2.8h, v3.8h // -t4/2 + neg v6.8h, v4.8h // -t3/2 + ssra v4.8h, v0.8h, #1 // (t1 + t3) >> 1 + ssra v2.8h, v1.8h, #1 // (t2 - t4) >> 1 + ssra v3.8h, v1.8h, #1 // (t2 + t4) >> 1 + ssra v6.8h, v0.8h, #1 // (t1 - t3) >> 1 + srshr v0.8h, v4.8h, #6 // (t1 + t3 + 64) >> 7 + srshr v1.8h, v2.8h, #6 // (t2 - t4 + 64) >> 7 + srshr v2.8h, v3.8h, #6 // (t2 + t4 + 64) >> 7 + srshr v3.8h, v6.8h, #6 // (t1 - t3 + 64) >> 7 + uaddw v0.8h, v0.8h, v5.8b + uaddw v1.8h, v1.8h, v18.8b + uaddw v2.8h, v2.8h, v22.8b + uaddw v3.8h, v3.8h, v25.8b + sqxtun v0.8b, v0.8h + sqxtun v1.8b, v1.8h + sqxtun v2.8b, v2.8h + sqxtun v3.8b, v3.8h + st1 {v0.8b}, [x3], x1 + st1 {v1.8b}, [x3], x1 + st1 {v2.8b}, [x3], x1 + st1 {v3.8b}, [x3] + ret +endfunc + +// VC-1 4x8 inverse transform +// On entry: +// x0 -> array of 8-bit samples, in row-major order +// x1 = row stride for 8-bit sample array +// x2 -> array of 16-bit inverse transform coefficients, in row-major order (row stride is 8 coefficients) +// On exit: +// array at x0 updated by saturated addition of (narrowed) transformed block +function ff_vc1_inv_trans_4x8_neon, export=1 + mov x3, #16 + ldr q0, .Lcoeffs_it8 // includes 4-point coefficients in upper half of vector + mov x4, x0 + ld1 {v1.d}[0], [x2], x3 // 00 01 02 03 + ld1 {v2.d}[0], [x2], x3 // 10 11 12 13 + ld1 {v3.d}[0], [x2], x3 // 20 21 22 23 + ld1 {v4.d}[0], [x2], x3 // 30 31 32 33 + ld1 {v1.d}[1], [x2], x3 // 40 41 42 43 + ld1 {v2.d}[1], [x2], x3 // 50 51 52 53 + ld1 {v3.d}[1], [x2], x3 // 60 61 62 63 + ld1 {v4.d}[1], [x2] // 70 71 72 73 + ld1 {v5.s}[0], [x0], x1 + ld1 {v6.s}[0], [x0], x1 + ld1 {v7.s}[0], [x0], x1 + trn2 v16.8h, v1.8h, v2.8h // 01 11 03 13 41 51 43 53 + trn1 v1.8h, v1.8h, v2.8h // 00 10 02 12 40 50 42 52 + trn2 v2.8h, v3.8h, v4.8h // 21 31 23 33 61 71 63 73 + trn1 v3.8h, v3.8h, v4.8h // 20 30 22 32 60 70 62 72 + ld1 {v4.s}[0], [x0], x1 + trn2 v17.4s, v16.4s, v2.4s // 03 13 23 33 43 53 63 73 + trn1 v18.4s, v1.4s, v3.4s // 00 10 20 30 40 50 60 70 + trn1 v2.4s, v16.4s, v2.4s // 01 11 21 31 41 51 61 71 + mul v16.8h, v17.8h, v0.h[4] // 10/2 * src[3] + ld1 {v5.s}[1], [x0], x1 + mul v17.8h, v17.8h, v0.h[5] // 22/2 * src[3] + ld1 {v6.s}[1], [x0], x1 + trn2 v1.4s, v1.4s, v3.4s // 02 12 22 32 42 52 62 72 + mul v3.8h, v18.8h, v0.h[6] // 17 * src[0] + ld1 {v7.s}[1], [x0], x1 + mul v1.8h, v1.8h, v0.h[6] // 17 * src[2] + ld1 {v4.s}[1], [x0] + mla v16.8h, v2.8h, v0.h[5] // t3/2 = 22/2 * src[1] + 10/2 * src[3] + mls v17.8h, v2.8h, v0.h[4] // t4/2 = - 10/2 * src[1] + 22/2 * src[3] + add v2.8h, v3.8h, v1.8h // t1 = 17 * src[0] + 17 * src[2] + sub v1.8h, v3.8h, v1.8h // t2 = 17 * src[0] - 17 * src[2] + neg v3.8h, v16.8h // -t3/2 + ssra v16.8h, v2.8h, #1 // (t1 + t3) >> 1 + neg v18.8h, v17.8h // -t4/2 + ssra v17.8h, v1.8h, #1 // (t2 + t4) >> 1 + ssra v3.8h, v2.8h, #1 // (t1 - t3) >> 1 + ssra v18.8h, v1.8h, #1 // (t2 - t4) >> 1 + srshr v1.8h, v16.8h, #2 // (t1 + t3 + 64) >> 3 + srshr v2.8h, v17.8h, #2 // (t2 + t4 + 64) >> 3 + srshr v3.8h, v3.8h, #2 // (t1 - t3 + 64) >> 3 + srshr v16.8h, v18.8h, #2 // (t2 - t4 + 64) >> 3 + trn2 v17.8h, v2.8h, v3.8h // 12 13 32 33 52 53 72 73 + trn2 v18.8h, v1.8h, v16.8h // 10 11 30 31 50 51 70 71 + trn1 v1.8h, v1.8h, v16.8h // 00 01 20 21 40 41 60 61 + trn1 v2.8h, v2.8h, v3.8h // 02 03 22 23 42 43 62 63 + trn1 v3.4s, v18.4s, v17.4s // 10 11 12 13 50 51 52 53 + trn2 v16.4s, v18.4s, v17.4s // 30 31 32 33 70 71 72 73 + trn1 v17.4s, v1.4s, v2.4s // 00 01 02 03 40 41 42 43 + mov d18, v3.d[1] // 50 51 52 53 + shl v19.4h, v3.4h, #4 // 16 * src[8] + mov d20, v16.d[1] // 70 71 72 73 + shl v21.4h, v16.4h, #4 // 16 * src[24] + mov d22, v17.d[1] // 40 41 42 43 + shl v23.4h, v3.4h, #2 // 4 * src[8] + shl v24.4h, v18.4h, #4 // 16 * src[40] + shl v25.4h, v20.4h, #4 // 16 * src[56] + shl v26.4h, v18.4h, #2 // 4 * src[40] + trn2 v1.4s, v1.4s, v2.4s // 20 21 22 23 60 61 62 63 + ssra v24.4h, v21.4h, #2 // 4 * src[24] + 16 * src[40] + sub v2.4h, v25.4h, v23.4h // - 4 * src[8] + 16 * src[56] + shl v17.4h, v17.4h, #2 // 8/2 * src[0] + sub v21.4h, v21.4h, v26.4h // 16 * src[24] - 4 * src[40] + shl v22.4h, v22.4h, #2 // 8/2 * src[32] + mov d23, v1.d[1] // 60 61 62 63 + ssra v19.4h, v25.4h, #2 // 16 * src[8] + 4 * src[56] + mul v25.4h, v1.4h, v0.h[0] // 6/2 * src[16] + shl v1.4h, v1.4h, #3 // 16/2 * src[16] + mls v24.4h, v3.4h, v0.h[2] // - 15 * src[8] + 4 * src[24] + 16 * src[40] + ssra v17.4h, v17.4h, #1 // 12/2 * src[0] + mls v21.4h, v3.4h, v0.h[1] // - 9 * src[8] + 16 * src[24] - 4 * src[40] + ssra v22.4h, v22.4h, #1 // 12/2 * src[32] + mla v2.4h, v16.4h, v0.h[1] // - 4 * src[8] + 9 * src[24] + 16 * src[56] + shl v3.4h, v23.4h, #3 // 16/2 * src[48] + mla v19.4h, v16.4h, v0.h[2] // 16 * src[8] + 15 * src[24] + 4 * src[56] + mla v1.4h, v23.4h, v0.h[0] // t3/2 = 16/2 * src[16] + 6/2 * src[48] + mla v24.4h, v20.4h, v0.h[1] // -t2 = - 15 * src[8] + 4 * src[24] + 16 * src[40] + 9 * src[56] + add v16.4h, v17.4h, v22.4h // t1/2 = 12/2 * src[0] + 12/2 * src[32] + sub v3.4h, v25.4h, v3.4h // t4/2 = 6/2 * src[16] - 16/2 * src[48] + sub v17.4h, v17.4h, v22.4h // t2/2 = 12/2 * src[0] - 12/2 * src[32] + mls v21.4h, v20.4h, v0.h[2] // -t3 = - 9 * src[8] + 16 * src[24] - 4 * src[40] - 15 * src[56] + mla v19.4h, v18.4h, v0.h[1] // t1 = 16 * src[8] + 15 * src[24] + 9 * src[40] + 4 * src[56] + add v20.4h, v16.4h, v1.4h // t5/2 = t1/2 + t3/2 + mls v2.4h, v18.4h, v0.h[2] // -t4 = - 4 * src[8] + 9 * src[24] - 15 * src[40] + 16 * src[56] + sub v0.4h, v16.4h, v1.4h // t8/2 = t1/2 - t3/2 + add v18.4h, v17.4h, v3.4h // t6/2 = t2/2 + t4/2 + sub v22.4h, v17.4h, v3.4h // t7/2 = t2/2 - t4/2 + neg v23.4h, v24.4h // +t2 + sub v25.4h, v17.4h, v3.4h // t7/2 = t2/2 - t4/2 + add v3.4h, v17.4h, v3.4h // t6/2 = t2/2 + t4/2 + neg v17.4h, v21.4h // +t3 + sub v26.4h, v16.4h, v1.4h // t8/2 = t1/2 - t3/2 + add v1.4h, v16.4h, v1.4h // t5/2 = t1/2 + t3/2 + neg v16.4h, v19.4h // -t1 + neg v27.4h, v2.4h // +t4 + ssra v20.4h, v19.4h, #1 // (t5 + t1) >> 1 + srsra v0.4h, v2.4h, #1 // (t8 - t4 + 1) >> 1 + ssra v18.4h, v23.4h, #1 // (t6 + t2) >> 1 + srsra v22.4h, v21.4h, #1 // (t7 - t3 + 1) >> 1 + ssra v25.4h, v17.4h, #1 // (t7 + t3) >> 1 + srsra v3.4h, v24.4h, #1 // (t6 - t2 + 1) >> 1 + ssra v26.4h, v27.4h, #1 // (t8 + t4) >> 1 + srsra v1.4h, v16.4h, #1 // (t5 - t1 + 1) >> 1 + trn1 v0.2d, v20.2d, v0.2d + trn1 v2.2d, v18.2d, v22.2d + trn1 v3.2d, v25.2d, v3.2d + trn1 v1.2d, v26.2d, v1.2d + srshr v0.8h, v0.8h, #6 // (t5 + t1 + 64) >> 7, (t8 - t4 + 65) >> 7 + srshr v2.8h, v2.8h, #6 // (t6 + t2 + 64) >> 7, (t7 - t3 + 65) >> 7 + srshr v3.8h, v3.8h, #6 // (t7 + t3 + 64) >> 7, (t6 - t2 + 65) >> 7 + srshr v1.8h, v1.8h, #6 // (t8 + t4 + 64) >> 7, (t5 - t1 + 65) >> 7 + uaddw v0.8h, v0.8h, v5.8b + uaddw v2.8h, v2.8h, v6.8b + uaddw v3.8h, v3.8h, v7.8b + uaddw v1.8h, v1.8h, v4.8b + sqxtun v0.8b, v0.8h + sqxtun v2.8b, v2.8h + sqxtun v3.8b, v3.8h + sqxtun v1.8b, v1.8h + st1 {v0.s}[0], [x4], x1 + st1 {v2.s}[0], [x4], x1 + st1 {v3.s}[0], [x4], x1 + st1 {v1.s}[0], [x4], x1 + st1 {v0.s}[1], [x4], x1 + st1 {v2.s}[1], [x4], x1 + st1 {v3.s}[1], [x4], x1 + st1 {v1.s}[1], [x4] + ret +endfunc + +// VC-1 4x4 inverse transform +// On entry: +// x0 -> array of 8-bit samples, in row-major order +// x1 = row stride for 8-bit sample array +// x2 -> array of 16-bit inverse transform coefficients, in row-major order (row stride is 8 coefficients) +// On exit: +// array at x0 updated by saturated addition of (narrowed) transformed block +function ff_vc1_inv_trans_4x4_neon, export=1 + mov x3, #16 + ldr d0, .Lcoeffs_it4 + mov x4, x0 + ld1 {v1.d}[0], [x2], x3 // 00 01 02 03 + ld1 {v2.d}[0], [x2], x3 // 10 11 12 13 + ld1 {v3.d}[0], [x2], x3 // 20 21 22 23 + ld1 {v4.d}[0], [x2] // 30 31 32 33 + ld1 {v5.s}[0], [x0], x1 + ld1 {v5.s}[1], [x0], x1 + ld1 {v6.s}[0], [x0], x1 + trn2 v7.4h, v1.4h, v2.4h // 01 11 03 13 + trn1 v1.4h, v1.4h, v2.4h // 00 10 02 12 + ld1 {v6.s}[1], [x0] + trn2 v2.4h, v3.4h, v4.4h // 21 31 23 33 + trn1 v3.4h, v3.4h, v4.4h // 20 30 22 32 + trn2 v4.2s, v7.2s, v2.2s // 03 13 23 33 + trn1 v16.2s, v1.2s, v3.2s // 00 10 20 30 + trn1 v2.2s, v7.2s, v2.2s // 01 11 21 31 + trn2 v1.2s, v1.2s, v3.2s // 02 12 22 32 + mul v3.4h, v4.4h, v0.h[0] // 10/2 * src[3] + mul v4.4h, v4.4h, v0.h[1] // 22/2 * src[3] + mul v7.4h, v16.4h, v0.h[2] // 17 * src[0] + mul v1.4h, v1.4h, v0.h[2] // 17 * src[2] + mla v3.4h, v2.4h, v0.h[1] // t3/2 = 22/2 * src[1] + 10/2 * src[3] + mls v4.4h, v2.4h, v0.h[0] // t4/2 = - 10/2 * src[1] + 22/2 * src[3] + add v2.4h, v7.4h, v1.4h // t1 = 17 * src[0] + 17 * src[2] + sub v1.4h, v7.4h, v1.4h // t2 = 17 * src[0] - 17 * src[2] + neg v7.4h, v3.4h // -t3/2 + neg v16.4h, v4.4h // -t4/2 + ssra v3.4h, v2.4h, #1 // (t1 + t3) >> 1 + ssra v4.4h, v1.4h, #1 // (t2 + t4) >> 1 + ssra v16.4h, v1.4h, #1 // (t2 - t4) >> 1 + ssra v7.4h, v2.4h, #1 // (t1 - t3) >> 1 + srshr v1.4h, v3.4h, #2 // (t1 + t3 + 64) >> 3 + srshr v2.4h, v4.4h, #2 // (t2 + t4 + 64) >> 3 + srshr v3.4h, v16.4h, #2 // (t2 - t4 + 64) >> 3 + srshr v4.4h, v7.4h, #2 // (t1 - t3 + 64) >> 3 + trn2 v7.4h, v1.4h, v3.4h // 10 11 30 31 + trn1 v1.4h, v1.4h, v3.4h // 00 01 20 21 + trn2 v3.4h, v2.4h, v4.4h // 12 13 32 33 + trn1 v2.4h, v2.4h, v4.4h // 02 03 22 23 + trn2 v4.2s, v7.2s, v3.2s // 30 31 32 33 + trn1 v16.2s, v1.2s, v2.2s // 00 01 02 03 + trn1 v3.2s, v7.2s, v3.2s // 10 11 12 13 + trn2 v1.2s, v1.2s, v2.2s // 20 21 22 23 + mul v2.4h, v4.4h, v0.h[1] // 22/2 * src[24] + mul v4.4h, v4.4h, v0.h[0] // 10/2 * src[24] + mul v7.4h, v16.4h, v0.h[2] // 17 * src[0] + mul v1.4h, v1.4h, v0.h[2] // 17 * src[16] + mls v2.4h, v3.4h, v0.h[0] // t4/2 = - 10/2 * src[8] + 22/2 * src[24] + mla v4.4h, v3.4h, v0.h[1] // t3/2 = 22/2 * src[8] + 10/2 * src[24] + add v0.4h, v7.4h, v1.4h // t1 = 17 * src[0] + 17 * src[16] + sub v1.4h, v7.4h, v1.4h // t2 = 17 * src[0] - 17 * src[16] + neg v3.4h, v2.4h // -t4/2 + neg v7.4h, v4.4h // -t3/2 + ssra v4.4h, v0.4h, #1 // (t1 + t3) >> 1 + ssra v3.4h, v1.4h, #1 // (t2 - t4) >> 1 + ssra v2.4h, v1.4h, #1 // (t2 + t4) >> 1 + ssra v7.4h, v0.4h, #1 // (t1 - t3) >> 1 + trn1 v0.2d, v4.2d, v3.2d + trn1 v1.2d, v2.2d, v7.2d + srshr v0.8h, v0.8h, #6 // (t1 + t3 + 64) >> 7, (t2 - t4 + 64) >> 7 + srshr v1.8h, v1.8h, #6 // (t2 + t4 + 64) >> 7, (t1 - t3 + 64) >> 7 + uaddw v0.8h, v0.8h, v5.8b + uaddw v1.8h, v1.8h, v6.8b + sqxtun v0.8b, v0.8h + sqxtun v1.8b, v1.8h + st1 {v0.s}[0], [x4], x1 + st1 {v0.s}[1], [x4], x1 + st1 {v1.s}[0], [x4], x1 + st1 {v1.s}[1], [x4] + ret +endfunc + +// VC-1 8x8 inverse transform, DC case +// On entry: +// x0 -> array of 8-bit samples, in row-major order +// x1 = row stride for 8-bit sample array +// x2 -> 16-bit inverse transform DC coefficient +// On exit: +// array at x0 updated by saturated addition of (narrowed) transformed block +function ff_vc1_inv_trans_8x8_dc_neon, export=1 + ldrsh w2, [x2] + mov x3, x0 + ld1 {v0.8b}, [x0], x1 + ld1 {v1.8b}, [x0], x1 + ld1 {v2.8b}, [x0], x1 + add w2, w2, w2, lsl #1 + ld1 {v3.8b}, [x0], x1 + ld1 {v4.8b}, [x0], x1 + add w2, w2, #1 + ld1 {v5.8b}, [x0], x1 + asr w2, w2, #1 + ld1 {v6.8b}, [x0], x1 + add w2, w2, w2, lsl #1 + ld1 {v7.8b}, [x0] + add w0, w2, #16 + asr w0, w0, #5 + dup v16.8h, w0 + uaddw v0.8h, v16.8h, v0.8b + uaddw v1.8h, v16.8h, v1.8b + uaddw v2.8h, v16.8h, v2.8b + uaddw v3.8h, v16.8h, v3.8b + uaddw v4.8h, v16.8h, v4.8b + uaddw v5.8h, v16.8h, v5.8b + sqxtun v0.8b, v0.8h + uaddw v6.8h, v16.8h, v6.8b + sqxtun v1.8b, v1.8h + uaddw v7.8h, v16.8h, v7.8b + sqxtun v2.8b, v2.8h + sqxtun v3.8b, v3.8h + sqxtun v4.8b, v4.8h + st1 {v0.8b}, [x3], x1 + sqxtun v0.8b, v5.8h + st1 {v1.8b}, [x3], x1 + sqxtun v1.8b, v6.8h + st1 {v2.8b}, [x3], x1 + sqxtun v2.8b, v7.8h + st1 {v3.8b}, [x3], x1 + st1 {v4.8b}, [x3], x1 + st1 {v0.8b}, [x3], x1 + st1 {v1.8b}, [x3], x1 + st1 {v2.8b}, [x3] + ret +endfunc + +// VC-1 8x4 inverse transform, DC case +// On entry: +// x0 -> array of 8-bit samples, in row-major order +// x1 = row stride for 8-bit sample array +// x2 -> 16-bit inverse transform DC coefficient +// On exit: +// array at x0 updated by saturated addition of (narrowed) transformed block +function ff_vc1_inv_trans_8x4_dc_neon, export=1 + ldrsh w2, [x2] + mov x3, x0 + ld1 {v0.8b}, [x0], x1 + ld1 {v1.8b}, [x0], x1 + ld1 {v2.8b}, [x0], x1 + add w2, w2, w2, lsl #1 + ld1 {v3.8b}, [x0] + add w0, w2, #1 + asr w0, w0, #1 + add w0, w0, w0, lsl #4 + add w0, w0, #64 + asr w0, w0, #7 + dup v4.8h, w0 + uaddw v0.8h, v4.8h, v0.8b + uaddw v1.8h, v4.8h, v1.8b + uaddw v2.8h, v4.8h, v2.8b + uaddw v3.8h, v4.8h, v3.8b + sqxtun v0.8b, v0.8h + sqxtun v1.8b, v1.8h + sqxtun v2.8b, v2.8h + sqxtun v3.8b, v3.8h + st1 {v0.8b}, [x3], x1 + st1 {v1.8b}, [x3], x1 + st1 {v2.8b}, [x3], x1 + st1 {v3.8b}, [x3] + ret +endfunc + +// VC-1 4x8 inverse transform, DC case +// On entry: +// x0 -> array of 8-bit samples, in row-major order +// x1 = row stride for 8-bit sample array +// x2 -> 16-bit inverse transform DC coefficient +// On exit: +// array at x0 updated by saturated addition of (narrowed) transformed block +function ff_vc1_inv_trans_4x8_dc_neon, export=1 + ldrsh w2, [x2] + mov x3, x0 + ld1 {v0.s}[0], [x0], x1 + ld1 {v1.s}[0], [x0], x1 + ld1 {v2.s}[0], [x0], x1 + add w2, w2, w2, lsl #4 + ld1 {v3.s}[0], [x0], x1 + add w2, w2, #4 + asr w2, w2, #3 + add w2, w2, w2, lsl #1 + ld1 {v0.s}[1], [x0], x1 + add w2, w2, #16 + asr w2, w2, #5 + dup v4.8h, w2 + ld1 {v1.s}[1], [x0], x1 + ld1 {v2.s}[1], [x0], x1 + ld1 {v3.s}[1], [x0] + uaddw v0.8h, v4.8h, v0.8b + uaddw v1.8h, v4.8h, v1.8b + uaddw v2.8h, v4.8h, v2.8b + uaddw v3.8h, v4.8h, v3.8b + sqxtun v0.8b, v0.8h + sqxtun v1.8b, v1.8h + sqxtun v2.8b, v2.8h + sqxtun v3.8b, v3.8h + st1 {v0.s}[0], [x3], x1 + st1 {v1.s}[0], [x3], x1 + st1 {v2.s}[0], [x3], x1 + st1 {v3.s}[0], [x3], x1 + st1 {v0.s}[1], [x3], x1 + st1 {v1.s}[1], [x3], x1 + st1 {v2.s}[1], [x3], x1 + st1 {v3.s}[1], [x3] + ret +endfunc + +// VC-1 4x4 inverse transform, DC case +// On entry: +// x0 -> array of 8-bit samples, in row-major order +// x1 = row stride for 8-bit sample array +// x2 -> 16-bit inverse transform DC coefficient +// On exit: +// array at x0 updated by saturated addition of (narrowed) transformed block +function ff_vc1_inv_trans_4x4_dc_neon, export=1 + ldrsh w2, [x2] + mov x3, x0 + ld1 {v0.s}[0], [x0], x1 + ld1 {v1.s}[0], [x0], x1 + ld1 {v0.s}[1], [x0], x1 + add w2, w2, w2, lsl #4 + ld1 {v1.s}[1], [x0] + add w0, w2, #4 + asr w0, w0, #3 + add w0, w0, w0, lsl #4 + add w0, w0, #64 + asr w0, w0, #7 + dup v2.8h, w0 + uaddw v0.8h, v2.8h, v0.8b + uaddw v1.8h, v2.8h, v1.8b + sqxtun v0.8b, v0.8h + sqxtun v1.8b, v1.8h + st1 {v0.s}[0], [x3], x1 + st1 {v1.s}[0], [x3], x1 + st1 {v0.s}[1], [x3], x1 + st1 {v1.s}[1], [x3] + ret +endfunc + .align 5 +.Lcoeffs_it8: +.quad 0x000F00090003 +.Lcoeffs_it4: +.quad 0x0011000B0005 .Lcoeffs: .quad 0x00050002 From patchwork Thu Mar 17 18:58:17 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Avison X-Patchwork-Id: 34818 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6838:3486:0:0:0:0 with SMTP id ek6csp1188354nkb; Thu, 17 Mar 2022 11:59:49 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy4aM0VZz6tHWDFsHbcvAfEzMCq6N+LXkpGcIGKbqlxM4NHst0c74iUs0Rw7rBedgzo2N1a X-Received: by 2002:a17:907:86a3:b0:6da:870c:af44 with SMTP id qa35-20020a17090786a300b006da870caf44mr5773081ejc.445.1647543589094; Thu, 17 Mar 2022 11:59:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1647543589; cv=none; d=google.com; s=arc-20160816; b=T7MYS18yMI4nj1076pPXVTB3RDoshkHlwjVTpvVxrWOpaUSfLzdIN7N+pBR4iCoQb7 5vBDUdmWEPTirNLJGDgJ6IQ7yr0lCK5s7T4bBcE9bHnphkFhrB2N+xmT2xd3cLWQmeg0 8bmLvhDbxJapJ1Zmz4gqtq+ebyoWYZH9tJ8WPsr4shakOiAgpVZUptTXYc8li4Zosrc2 FzkxLUeMyHF3FfRIV9R0AEeT1I2NLrI6oB5dU8hALPUaVpt5lnDsfDghGipx9bBE1fjq yPa2e7ovD9nWXaG0c3GzJh3GBeFq2Zelr7OYH0tUwVZZLhG69hHHs7xQF2wwZimM9/JT WNoA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=iaEVzo1wFqZPeOO/4Axl7YpKZXr8/oBIfSFtduVCTck=; b=wlo6cMAA5CdqWh0DDpLykhX1dHUPG0Eqsd6l0UIFtT/prYv3ZdeIatJKUL8peVW2ck /kHfZC1dDgNp/eIci4qLCZgzB9hIbdjMtsaqjzkv2J054rKERNkvvoad2vH4W6B+6CIO yIGdntkh0ZKYWupZsLrM7qVWV284rwq/rSWZyrrggaZ2G700oH0OHBS730DA1NC8JnHa m0Nz0wkw6VvzVkI8yymmaPMz3AtdbdSAtvRxwmnm/Dj9y6M9dCAfyKvsuwGwBcu7Owmq EoyuwvJA8XcRetB+fth2aLIff9zKRhpO90xhnSQ2G8RlDt4LLIzbg98p9yr6rxPfq3C/ UYuw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id w8-20020a170906b18800b006dbf2566d68si3777590ejy.182.2022.03.17.11.59.48; Thu, 17 Mar 2022 11:59:49 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id DF41668B056; Thu, 17 Mar 2022 20:59:17 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from outmail148101.authsmtp.com (outmail148101.authsmtp.com [62.13.148.101]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id B37E368AF5D for ; Thu, 17 Mar 2022 20:59:11 +0200 (EET) Received: from punt22.authsmtp.com (punt22.authsmtp.com [62.13.128.207]) by punt16.authsmtp.com. (8.15.2/8.15.2) with ESMTP id 22HIxBN4055384 for ; Thu, 17 Mar 2022 18:59:11 GMT (envelope-from bavison@riscosopen.org) Received: from mail-c233.authsmtp.com (mail-c233.authsmtp.com [62.13.128.233]) by punt22.authsmtp.com. (8.15.2/8.15.2) with ESMTP id 22HIxB6d043224; Thu, 17 Mar 2022 18:59:11 GMT (envelope-from bavison@riscosopen.org) Received: from rpi2021 (237.63.9.51.dyn.plus.net [51.9.63.237]) (authenticated bits=0) by mail.authsmtp.com (8.15.2/8.15.2) with ESMTPSA id 22HIx84j055488 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 17 Mar 2022 18:59:09 GMT (envelope-from bavison@riscosopen.org) Received: by rpi2021 (sSMTP sendmail emulation); Thu, 17 Mar 2022 18:59:08 +0000 From: Ben Avison To: ffmpeg-devel@ffmpeg.org Date: Thu, 17 Mar 2022 18:58:17 +0000 Message-Id: <20220317185819.466470-5-bavison@riscosopen.org> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220317185819.466470-1-bavison@riscosopen.org> References: <20220317185819.466470-1-bavison@riscosopen.org> MIME-Version: 1.0 X-Server-Quench: 5430661c-a624-11ec-a0f2-84349711df28 X-AuthReport-Spam: If SPAM / abuse - report it at: http://www.authsmtp.com/abuse X-AuthRoute: OCd1YggXA1ZfRRob ESQCJDVBUg4iPRpU DBlFKhFVNl8UURhQ KkJXbgASJgRDAnRQ QXkJW1ZWQFx5U2Z8 YQtRIwBcfENQWQZ0 UktOXVBXFgB3AFID BGZnGhF2LQVDfXp4 bQhjWXBaWgp/cUV9 REhVRnBXY2VkdWEe BRNFJgMCch5CehxB Y1d+VSdbY21JDRoR IyQTd2x2YwAXITpe RQ0AJhUTTU0XEiUk FVgrBzBnQxFATSQv ZzoLDXhUFkIWOUZ6 OFctEVseP1cZDgRb BwlDCTRFb0EIWyow ZQAA X-Authentic-SMTP: 61633632303230.1021:7600 X-AuthFastPath: 0 (Was 255) X-AuthVirus-Status: No virus detected - but ensure you scan with your own anti-virus system. Subject: [FFmpeg-devel] [PATCH 4/6] avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Ben Avison Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: XQg0EcIodAc2 Signed-off-by: Ben Avison --- libavcodec/aarch64/Makefile | 3 +- libavcodec/aarch64/idctdsp_init_aarch64.c | 26 +++-- libavcodec/aarch64/idctdsp_neon.S | 130 ++++++++++++++++++++++ 3 files changed, 150 insertions(+), 9 deletions(-) create mode 100644 libavcodec/aarch64/idctdsp_neon.S diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile index 5b25e4dfb9..c8935f205e 100644 --- a/libavcodec/aarch64/Makefile +++ b/libavcodec/aarch64/Makefile @@ -44,7 +44,8 @@ NEON-OBJS-$(CONFIG_H264PRED) += aarch64/h264pred_neon.o NEON-OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_neon.o \ aarch64/hpeldsp_neon.o NEON-OBJS-$(CONFIG_HPELDSP) += aarch64/hpeldsp_neon.o -NEON-OBJS-$(CONFIG_IDCTDSP) += aarch64/simple_idct_neon.o +NEON-OBJS-$(CONFIG_IDCTDSP) += aarch64/idctdsp_neon.o \ + aarch64/simple_idct_neon.o NEON-OBJS-$(CONFIG_MDCT) += aarch64/mdct_neon.o NEON-OBJS-$(CONFIG_MPEGAUDIODSP) += aarch64/mpegaudiodsp_neon.o NEON-OBJS-$(CONFIG_PIXBLOCKDSP) += aarch64/pixblockdsp_neon.o diff --git a/libavcodec/aarch64/idctdsp_init_aarch64.c b/libavcodec/aarch64/idctdsp_init_aarch64.c index 742a3372e3..eec21aa5a2 100644 --- a/libavcodec/aarch64/idctdsp_init_aarch64.c +++ b/libavcodec/aarch64/idctdsp_init_aarch64.c @@ -27,19 +27,29 @@ #include "libavcodec/idctdsp.h" #include "idct.h" +void ff_put_pixels_clamped_neon(const int16_t *, uint8_t *, ptrdiff_t); +void ff_put_signed_pixels_clamped_neon(const int16_t *, uint8_t *, ptrdiff_t); +void ff_add_pixels_clamped_neon(const int16_t *, uint8_t *, ptrdiff_t); + av_cold void ff_idctdsp_init_aarch64(IDCTDSPContext *c, AVCodecContext *avctx, unsigned high_bit_depth) { int cpu_flags = av_get_cpu_flags(); - if (have_neon(cpu_flags) && !avctx->lowres && !high_bit_depth) { - if (avctx->idct_algo == FF_IDCT_AUTO || - avctx->idct_algo == FF_IDCT_SIMPLEAUTO || - avctx->idct_algo == FF_IDCT_SIMPLENEON) { - c->idct_put = ff_simple_idct_put_neon; - c->idct_add = ff_simple_idct_add_neon; - c->idct = ff_simple_idct_neon; - c->perm_type = FF_IDCT_PERM_PARTTRANS; + if (have_neon(cpu_flags)) { + if (!avctx->lowres && !high_bit_depth) { + if (avctx->idct_algo == FF_IDCT_AUTO || + avctx->idct_algo == FF_IDCT_SIMPLEAUTO || + avctx->idct_algo == FF_IDCT_SIMPLENEON) { + c->idct_put = ff_simple_idct_put_neon; + c->idct_add = ff_simple_idct_add_neon; + c->idct = ff_simple_idct_neon; + c->perm_type = FF_IDCT_PERM_PARTTRANS; + } } + + c->add_pixels_clamped = ff_add_pixels_clamped_neon; + c->put_pixels_clamped = ff_put_pixels_clamped_neon; + c->put_signed_pixels_clamped = ff_put_signed_pixels_clamped_neon; } } diff --git a/libavcodec/aarch64/idctdsp_neon.S b/libavcodec/aarch64/idctdsp_neon.S new file mode 100644 index 0000000000..bbc9dc3f84 --- /dev/null +++ b/libavcodec/aarch64/idctdsp_neon.S @@ -0,0 +1,130 @@ +/* + * IDCT AArch64 NEON optimisations + * + * Copyright (c) 2022 Ben Avison + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/aarch64/asm.S" + +// Clamp 16-bit signed block coefficients to unsigned 8-bit +// On entry: +// x0 -> array of 64x 16-bit coefficients +// x1 -> 8-bit results +// x2 = row stride for results, bytes +function ff_put_pixels_clamped_neon, export=1 + ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x0], #64 + ld1 {v4.16b, v5.16b, v6.16b, v7.16b}, [x0] + sqxtun v0.8b, v0.8h + sqxtun v1.8b, v1.8h + sqxtun v2.8b, v2.8h + sqxtun v3.8b, v3.8h + sqxtun v4.8b, v4.8h + st1 {v0.8b}, [x1], x2 + sqxtun v0.8b, v5.8h + st1 {v1.8b}, [x1], x2 + sqxtun v1.8b, v6.8h + st1 {v2.8b}, [x1], x2 + sqxtun v2.8b, v7.8h + st1 {v3.8b}, [x1], x2 + st1 {v4.8b}, [x1], x2 + st1 {v0.8b}, [x1], x2 + st1 {v1.8b}, [x1], x2 + st1 {v2.8b}, [x1] + ret +endfunc + +// Clamp 16-bit signed block coefficients to signed 8-bit (biased by 128) +// On entry: +// x0 -> array of 64x 16-bit coefficients +// x1 -> 8-bit results +// x2 = row stride for results, bytes +function ff_put_signed_pixels_clamped_neon, export=1 + ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x0], #64 + movi v4.8b, #128 + ld1 {v16.16b, v17.16b, v18.16b, v19.16b}, [x0] + sqxtn v0.8b, v0.8h + sqxtn v1.8b, v1.8h + sqxtn v2.8b, v2.8h + sqxtn v3.8b, v3.8h + sqxtn v5.8b, v16.8h + add v0.8b, v0.8b, v4.8b + sqxtn v6.8b, v17.8h + add v1.8b, v1.8b, v4.8b + sqxtn v7.8b, v18.8h + add v2.8b, v2.8b, v4.8b + sqxtn v16.8b, v19.8h + add v3.8b, v3.8b, v4.8b + st1 {v0.8b}, [x1], x2 + add v0.8b, v5.8b, v4.8b + st1 {v1.8b}, [x1], x2 + add v1.8b, v6.8b, v4.8b + st1 {v2.8b}, [x1], x2 + add v2.8b, v7.8b, v4.8b + st1 {v3.8b}, [x1], x2 + add v3.8b, v16.8b, v4.8b + st1 {v0.8b}, [x1], x2 + st1 {v1.8b}, [x1], x2 + st1 {v2.8b}, [x1], x2 + st1 {v3.8b}, [x1] + ret +endfunc + +// Add 16-bit signed block coefficients to unsigned 8-bit +// On entry: +// x0 -> array of 64x 16-bit coefficients +// x1 -> 8-bit input and results +// x2 = row stride for 8-bit input and results, bytes +function ff_add_pixels_clamped_neon, export=1 + ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x0], #64 + mov x3, x1 + ld1 {v4.8b}, [x1], x2 + ld1 {v5.8b}, [x1], x2 + ld1 {v6.8b}, [x1], x2 + ld1 {v7.8b}, [x1], x2 + ld1 {v16.16b, v17.16b, v18.16b, v19.16b}, [x0] + uaddw v0.8h, v0.8h, v4.8b + uaddw v1.8h, v1.8h, v5.8b + uaddw v2.8h, v2.8h, v6.8b + ld1 {v4.8b}, [x1], x2 + uaddw v3.8h, v3.8h, v7.8b + ld1 {v5.8b}, [x1], x2 + sqxtun v0.8b, v0.8h + ld1 {v6.8b}, [x1], x2 + sqxtun v1.8b, v1.8h + ld1 {v7.8b}, [x1] + sqxtun v2.8b, v2.8h + sqxtun v3.8b, v3.8h + uaddw v4.8h, v16.8h, v4.8b + st1 {v0.8b}, [x3], x2 + uaddw v0.8h, v17.8h, v5.8b + st1 {v1.8b}, [x3], x2 + uaddw v1.8h, v18.8h, v6.8b + st1 {v2.8b}, [x3], x2 + uaddw v2.8h, v19.8h, v7.8b + sqxtun v4.8b, v4.8h + sqxtun v0.8b, v0.8h + st1 {v3.8b}, [x3], x2 + sqxtun v1.8b, v1.8h + sqxtun v2.8b, v2.8h + st1 {v4.8b}, [x3], x2 + st1 {v0.8b}, [x3], x2 + st1 {v1.8b}, [x3], x2 + st1 {v2.8b}, [x3] + ret +endfunc From patchwork Thu Mar 17 18:58:18 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Avison X-Patchwork-Id: 34820 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6838:3486:0:0:0:0 with SMTP id ek6csp1188593nkb; Thu, 17 Mar 2022 12:00:11 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxi8fyVtD0UNrNeET+h9w62/lUqoTkkSWLXwMjlg7mbD1rulrQGXubwgx8ZK5Bk5XfzsPPA X-Received: by 2002:a17:906:d54e:b0:6db:b241:c0e2 with SMTP id cr14-20020a170906d54e00b006dbb241c0e2mr5818354ejc.724.1647543610893; Thu, 17 Mar 2022 12:00:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1647543610; cv=none; d=google.com; s=arc-20160816; b=yviRRRa1zyGcsnKtI5WFdhJ2TvLUFUL7y+q7QWnjIcePIQUtvJOCj70Qd7sSkVoXqN Er++PS/eEWvR1aSfc/8YvYNcF1iCXSxSgaHwrHEceklAYHuK9Cuuq58HDhvH9TdzVTMQ g3k+BEulZB1PKNEpX3xxgdz9lWkvcIdzMhB9VplI8MO8w8aNOwZ3Fmy7Ogs2YA+aeiZf lsihUUu8Daap8X0EBMJDFt8MG92Id11tAEeP48Zzj2WgfznFUIDV60hOZYc5Pmnj91ZO Olgwhdy1wdj58FPjlFrBPRScboYryMaYFIIxtyQDWFfw7BFImubkHEB2rghYOodhHXlZ QotQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=0fRFj/ZPvWbxFvHu8QzXLs7WAhxg5LiWj8w6dABpMxg=; b=A2LLCDbBIgdnGVumCclAX+kkv3G+ARwWj1wo7/VBgYFcrzwDPnJ24fMXJetMvpXWOP yAHB46MBt2TCBDp7V+GyV5pE0SAPvqj8YLKzatTEhaiIEU7a+ZTeV6+4r+FscK5hbGBe QJtzslapfHZKn/TIySvno0H/MwMay+n22pSLu38KOp/LdDOBt3wQPmK3jfypmBfUkwp3 VbmZ+kFoHfeI/W1qGClR5/40KnLTS8LjDNfndBKE6WjPDA+8vaAzLS1SHJBgjOkaVuJP 6SDkyygHx6lHehbwyYy3fCBbHj/Iqpbz2PhxW8FoT1BRl8Y01lbK6l0lcZj6WpfTw6mv Bg2Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id ho32-20020a1709070ea000b006db15fcf5bfsi4331307ejc.623.2022.03.17.12.00.10; Thu, 17 Mar 2022 12:00:10 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 83FE768B0C3; Thu, 17 Mar 2022 20:59:23 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from outmail149080.authsmtp.com (outmail149080.authsmtp.com [62.13.149.80]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id F1EE468B08F for ; Thu, 17 Mar 2022 20:59:14 +0200 (EET) Received: from mail-c233.authsmtp.com (mail-c233.authsmtp.com [62.13.128.233]) by punt16.authsmtp.com. (8.15.2/8.15.2) with ESMTP id 22HIxEbo055449; Thu, 17 Mar 2022 18:59:14 GMT (envelope-from bavison@riscosopen.org) Received: from rpi2021 (237.63.9.51.dyn.plus.net [51.9.63.237]) (authenticated bits=0) by mail.authsmtp.com (8.15.2/8.15.2) with ESMTPSA id 22HIxB1U055532 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 17 Mar 2022 18:59:11 GMT (envelope-from bavison@riscosopen.org) Received: by rpi2021 (sSMTP sendmail emulation); Thu, 17 Mar 2022 18:59:11 +0000 From: Ben Avison To: ffmpeg-devel@ffmpeg.org Date: Thu, 17 Mar 2022 18:58:18 +0000 Message-Id: <20220317185819.466470-6-bavison@riscosopen.org> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220317185819.466470-1-bavison@riscosopen.org> References: <20220317185819.466470-1-bavison@riscosopen.org> MIME-Version: 1.0 X-Server-Quench: 55d54c4b-a624-11ec-a0f2-84349711df28 X-AuthReport-Spam: If SPAM / abuse - report it at: http://www.authsmtp.com/abuse X-AuthRoute: OCd1YggXA1ZfRRob ESQCJDVBUg4iPRpU DBlFKhFVNl8UURhQ KkJXbgASJgRDAnRQ QXkJW1ZWQFx5U2Z8 YQtSIwBcfENQWQZ0 UktOXVBXFgB3AFID BGZnGmtzEgVDfXtz ZwhjXHBSXgp/cBJ4 Rh1QR3BXY2VkdWEe BRNFJgMCch5CehxB Y1d+VSdbY21JDRoR IyQTd212YwAXITpe RQ0AJhUYRUEAHTIn X0JKNC8qVRNZAi8y M1QAB3k6VEwVNU4x eVAlVFsCexEbEREW B0hRADQx X-Authentic-SMTP: 61633632303230.1021:7600 X-AuthFastPath: 0 (Was 255) X-AuthSMTP-Origin: 51.9.63.237/2525 X-AuthVirus-Status: No virus detected - but ensure you scan with your own anti-virus system. Subject: [FFmpeg-devel] [PATCH 5/6] avcodec/blockdsp: Arm 64-bit NEON block clear fast paths X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Ben Avison Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: 6vVauCUMvrQ4 Signed-off-by: Ben Avison --- libavcodec/aarch64/Makefile | 2 + libavcodec/aarch64/blockdsp_init_aarch64.c | 42 +++++++++++++++++++++ libavcodec/aarch64/blockdsp_neon.S | 43 ++++++++++++++++++++++ libavcodec/blockdsp.c | 2 + libavcodec/blockdsp.h | 1 + 5 files changed, 90 insertions(+) create mode 100644 libavcodec/aarch64/blockdsp_init_aarch64.c create mode 100644 libavcodec/aarch64/blockdsp_neon.S diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile index c8935f205e..7078dc6089 100644 --- a/libavcodec/aarch64/Makefile +++ b/libavcodec/aarch64/Makefile @@ -35,6 +35,8 @@ ARMV8-OBJS-$(CONFIG_VIDEODSP) += aarch64/videodsp.o # subsystems NEON-OBJS-$(CONFIG_AAC_DECODER) += aarch64/sbrdsp_neon.o +NEON-OBJS-$(CONFIG_BLOCKDSP) += aarch64/blockdsp_init_aarch64.o \ + aarch64/blockdsp_neon.o NEON-OBJS-$(CONFIG_FFT) += aarch64/fft_neon.o NEON-OBJS-$(CONFIG_FMTCONVERT) += aarch64/fmtconvert_neon.o NEON-OBJS-$(CONFIG_H264CHROMA) += aarch64/h264cmc_neon.o diff --git a/libavcodec/aarch64/blockdsp_init_aarch64.c b/libavcodec/aarch64/blockdsp_init_aarch64.c new file mode 100644 index 0000000000..9f3280f007 --- /dev/null +++ b/libavcodec/aarch64/blockdsp_init_aarch64.c @@ -0,0 +1,42 @@ +/* + * AArch64 NEON optimised block operations + * + * Copyright (c) 2022 Ben Avison + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include + +#include "libavutil/attributes.h" +#include "libavutil/cpu.h" +#include "libavutil/arm/cpu.h" +#include "libavcodec/avcodec.h" +#include "libavcodec/blockdsp.h" + +void ff_clear_block_neon(int16_t *block); +void ff_clear_blocks_neon(int16_t *blocks); + +av_cold void ff_blockdsp_init_aarch64(BlockDSPContext *c) +{ + int cpu_flags = av_get_cpu_flags(); + + if (have_neon(cpu_flags)) { + c->clear_block = ff_clear_block_neon; + c->clear_blocks = ff_clear_blocks_neon; + } +} diff --git a/libavcodec/aarch64/blockdsp_neon.S b/libavcodec/aarch64/blockdsp_neon.S new file mode 100644 index 0000000000..a310647a5d --- /dev/null +++ b/libavcodec/aarch64/blockdsp_neon.S @@ -0,0 +1,43 @@ +/* + * AArch64 NEON optimised block operations + * + * Copyright (c) 2022 Ben Avison + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/aarch64/asm.S" + +function ff_clear_block_neon, export=1 + movi v0.16b, #0 + movi v1.16b, #0 + st1 {v0.16b, v1.16b}, [x0], #32 + st1 {v0.16b, v1.16b}, [x0], #32 + st1 {v0.16b, v1.16b}, [x0], #32 + st1 {v0.16b, v1.16b}, [x0] + ret +endfunc + +function ff_clear_blocks_neon, export=1 + movi v0.16b, #0 + movi v1.16b, #0 + .rept 23 + st1 {v0.16b, v1.16b}, [x0], #32 + .endr + st1 {v0.16b, v1.16b}, [x0] + ret +endfunc diff --git a/libavcodec/blockdsp.c b/libavcodec/blockdsp.c index 5fb242ea65..97cc074765 100644 --- a/libavcodec/blockdsp.c +++ b/libavcodec/blockdsp.c @@ -64,6 +64,8 @@ av_cold void ff_blockdsp_init(BlockDSPContext *c, AVCodecContext *avctx) c->fill_block_tab[0] = fill_block16_c; c->fill_block_tab[1] = fill_block8_c; + if (ARCH_AARCH64) + ff_blockdsp_init_aarch64(c); if (ARCH_ALPHA) ff_blockdsp_init_alpha(c); if (ARCH_ARM) diff --git a/libavcodec/blockdsp.h b/libavcodec/blockdsp.h index 58eecffb07..9838922d22 100644 --- a/libavcodec/blockdsp.h +++ b/libavcodec/blockdsp.h @@ -40,6 +40,7 @@ typedef struct BlockDSPContext { void ff_blockdsp_init(BlockDSPContext *c, AVCodecContext *avctx); +void ff_blockdsp_init_aarch64(BlockDSPContext *c); void ff_blockdsp_init_alpha(BlockDSPContext *c); void ff_blockdsp_init_arm(BlockDSPContext *c); void ff_blockdsp_init_ppc(BlockDSPContext *c); From patchwork Thu Mar 17 18:58:19 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Avison X-Patchwork-Id: 34819 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6838:3486:0:0:0:0 with SMTP id ek6csp1188458nkb; Thu, 17 Mar 2022 11:59:59 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxmSMSv0gcEqfZG4VSu8RWtBUrLUEKRMtg38NlfGeFGSYn3bM2TYJQ9ZkmHzsxjdTavhe38 X-Received: by 2002:a17:906:1244:b0:6cf:118c:932f with SMTP id u4-20020a170906124400b006cf118c932fmr5751570eja.563.1647543599424; Thu, 17 Mar 2022 11:59:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1647543599; cv=none; d=google.com; s=arc-20160816; b=YvHRS2x4gmYdwA2kIV+hOQZ+ARi/Lx/tGpqguouoKRAVF4agUZ6NycmO7EcUzyLw5X 4Hx+Vbom2zu9m+splpetzbVh//Ft2QMY87+y+brKP4RHbqLAkPwptNKnicJSdXlO50D5 XbieAUgG08n635fuoUVqbqvSFk327SGXOQPYZCYHJ0QyJmHXhFFxRXgTmMhHkOdPhdrI EmqU7Fbpc8XwE13qzJBMrgUA3c/qJuP63e+nqpgL0kC1esnEAJmcmpg1bB7gp+dZd9g8 dl8uTzmMPWT2sykeeZNb0H9tdzThOzcQQxsO1Cjd5LjyXCt16seUNSJs4reS26Ohi++Z 4OYw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=PX4vshiY5WrizEzspLw0Mo7H5rVs9KrXocRRjD/SeU0=; b=rVgX+5w8jds9z5Y5W9HbS5i9sGA2QpSvAbIAIGUyI2N5eTG3wy/IpaQKYULJEDAw33 SdNOOzUNfCt+5R9+tFase2p1fXPxLsBx57MdE1n10VSaXijYq4djcb7zOgmgmmnHbz7Q Dezjv9j4ePb0YxzJ37iZEsaQC2mKVl0kc6cevZ7ZY4erHJYWTw+QEHkyupYMtaLgsk2f 5sY5OcTXD/YNJERKHAjYCxzOWpDs0bDaRaPImAyF5ysXZroh0h6hX0wvD+5W3PmceYY9 K2epHgNtDUeXxiraqZUZUTRcT3zBka6SfLMioFp2NceAOcrZpZy98s+lt/lB/d7S+vkH KhFQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id g9-20020a17090604c900b006d0ab20b6basi3848478eja.444.2022.03.17.11.59.59; Thu, 17 Mar 2022 11:59:59 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 168E068B0A2; Thu, 17 Mar 2022 20:59:21 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from outmail149080.authsmtp.com (outmail149080.authsmtp.com [62.13.149.80]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id A30E868B056 for ; Thu, 17 Mar 2022 20:59:17 +0200 (EET) Received: from mail-c233.authsmtp.com (mail-c233.authsmtp.com [62.13.128.233]) by punt16.authsmtp.com. (8.15.2/8.15.2) with ESMTP id 22HIxHZS055503; Thu, 17 Mar 2022 18:59:17 GMT (envelope-from bavison@riscosopen.org) Received: from rpi2021 (237.63.9.51.dyn.plus.net [51.9.63.237]) (authenticated bits=0) by mail.authsmtp.com (8.15.2/8.15.2) with ESMTPSA id 22HIxFOo055596 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 17 Mar 2022 18:59:15 GMT (envelope-from bavison@riscosopen.org) Received: by rpi2021 (sSMTP sendmail emulation); Thu, 17 Mar 2022 18:59:14 +0000 From: Ben Avison To: ffmpeg-devel@ffmpeg.org Date: Thu, 17 Mar 2022 18:58:19 +0000 Message-Id: <20220317185819.466470-7-bavison@riscosopen.org> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220317185819.466470-1-bavison@riscosopen.org> References: <20220317185819.466470-1-bavison@riscosopen.org> MIME-Version: 1.0 X-Server-Quench: 57da24be-a624-11ec-a0f2-84349711df28 X-AuthReport-Spam: If SPAM / abuse - report it at: http://www.authsmtp.com/abuse X-AuthRoute: OCd1YggXA1ZfRRob ESQCJDVBUg4iPRpU DBlFKhFVNl8UURhQ KkJXbgASJgRDAnRQ QXkJW1ZWQFx5U2Z8 YQtXIwBcfENQWQZ0 UktOXVBXFgB3AFID BGZnGm8NKAVDfXt5 YwhmVnNaVAp/chIs QEoGQHBXY2VkdWEe BRNFJgMCch5CehxB Y1d+VSdbY21JDRoR IyQTd252YwAXITpe RQ0AJhUMSh9ZVh86 WwoFESgkEAULTj4v ZwQvNl5UEkELelg0 PEAqUEoZNRBaAAxC BF1XDSZcb1McSSQm F2sA X-Authentic-SMTP: 61633632303230.1021:7600 X-AuthFastPath: 0 (Was 255) X-AuthSMTP-Origin: 51.9.63.237/2525 X-AuthVirus-Status: No virus detected - but ensure you scan with your own anti-virus system. Subject: [FFmpeg-devel] [PATCH 6/6] avcodec/vc1: Introduce fast path for unescaping bitstream buffer X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Ben Avison Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: zWs6cd9gRp87 Populate with implementations suitable for 32-bit and 64-bit Arm. Signed-off-by: Ben Avison --- libavcodec/aarch64/vc1dsp_init_aarch64.c | 60 ++++++++ libavcodec/aarch64/vc1dsp_neon.S | 176 +++++++++++++++++++++++ libavcodec/arm/vc1dsp_init_neon.c | 60 ++++++++ libavcodec/arm/vc1dsp_neon.S | 118 +++++++++++++++ libavcodec/vc1dec.c | 20 +-- libavcodec/vc1dsp.c | 2 + libavcodec/vc1dsp.h | 3 + 7 files changed, 429 insertions(+), 10 deletions(-) diff --git a/libavcodec/aarch64/vc1dsp_init_aarch64.c b/libavcodec/aarch64/vc1dsp_init_aarch64.c index b672b2aa99..2fc2d5d1d3 100644 --- a/libavcodec/aarch64/vc1dsp_init_aarch64.c +++ b/libavcodec/aarch64/vc1dsp_init_aarch64.c @@ -51,6 +51,64 @@ void ff_put_vc1_chroma_mc4_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride, void ff_avg_vc1_chroma_mc4_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int h, int x, int y); +int ff_vc1_unescape_buffer_helper_neon(const uint8_t *src, int size, uint8_t *dst); + +static int vc1_unescape_buffer_neon(const uint8_t *src, int size, uint8_t *dst) +{ + /* Dealing with starting and stopping, and removing escape bytes, are + * comparatively less time-sensitive, so are more clearly expressed using + * a C wrapper around the assembly inner loop. Note that we assume a + * little-endian machine that supports unaligned loads. */ + int dsize = 0; + while (size >= 4) + { + int found = 0; + while (!found && (((uintptr_t) dst) & 7) && size >= 4) + { + found = (*(uint32_t *)src &~ 0x03000000) == 0x00030000; + if (!found) + { + *dst++ = *src++; + --size; + ++dsize; + } + } + if (!found) + { + int skip = size - ff_vc1_unescape_buffer_helper_neon(src, size, dst); + dst += skip; + src += skip; + size -= skip; + dsize += skip; + while (!found && size >= 4) + { + found = (*(uint32_t *)src &~ 0x03000000) == 0x00030000; + if (!found) + { + *dst++ = *src++; + --size; + ++dsize; + } + } + } + if (found) + { + *dst++ = *src++; + *dst++ = *src++; + ++src; + size -= 3; + dsize += 2; + } + } + while (size > 0) + { + *dst++ = *src++; + --size; + ++dsize; + } + return dsize; +} + av_cold void ff_vc1dsp_init_aarch64(VC1DSPContext *dsp) { int cpu_flags = av_get_cpu_flags(); @@ -76,5 +134,7 @@ av_cold void ff_vc1dsp_init_aarch64(VC1DSPContext *dsp) dsp->avg_no_rnd_vc1_chroma_pixels_tab[0] = ff_avg_vc1_chroma_mc8_neon; dsp->put_no_rnd_vc1_chroma_pixels_tab[1] = ff_put_vc1_chroma_mc4_neon; dsp->avg_no_rnd_vc1_chroma_pixels_tab[1] = ff_avg_vc1_chroma_mc4_neon; + + dsp->vc1_unescape_buffer = vc1_unescape_buffer_neon; } } diff --git a/libavcodec/aarch64/vc1dsp_neon.S b/libavcodec/aarch64/vc1dsp_neon.S index c3ca3eae1e..8bdeffab44 100644 --- a/libavcodec/aarch64/vc1dsp_neon.S +++ b/libavcodec/aarch64/vc1dsp_neon.S @@ -1374,3 +1374,179 @@ function ff_vc1_h_loop_filter16_neon, export=1 st2 {v2.b, v3.b}[7], [x6] 4: ret endfunc + +// Copy at most the specified number of bytes from source to destination buffer, +// stopping at a multiple of 32 bytes, none of which are the start of an escape sequence +// On entry: +// x0 -> source buffer +// w1 = max number of bytes to copy +// x2 -> destination buffer, optimally 8-byte aligned +// On exit: +// w0 = number of bytes not copied +function ff_vc1_unescape_buffer_helper_neon, export=1 + // Offset by 80 to screen out cases that are too short for us to handle, + // and also make it easy to test for loop termination, or to determine + // whether we need an odd number of half-iterations of the loop. + subs w1, w1, #80 + b.mi 90f + + // Set up useful constants + movi v20.4s, #3, lsl #24 + movi v21.4s, #3, lsl #16 + + tst w1, #32 + b.ne 1f + + ld1 {v0.16b, v1.16b, v2.16b}, [x0], #48 + ext v25.16b, v0.16b, v1.16b, #1 + ext v26.16b, v0.16b, v1.16b, #2 + ext v27.16b, v0.16b, v1.16b, #3 + ext v29.16b, v1.16b, v2.16b, #1 + ext v30.16b, v1.16b, v2.16b, #2 + ext v31.16b, v1.16b, v2.16b, #3 + bic v24.16b, v0.16b, v20.16b + bic v25.16b, v25.16b, v20.16b + bic v26.16b, v26.16b, v20.16b + bic v27.16b, v27.16b, v20.16b + bic v28.16b, v1.16b, v20.16b + bic v29.16b, v29.16b, v20.16b + bic v30.16b, v30.16b, v20.16b + bic v31.16b, v31.16b, v20.16b + eor v24.16b, v24.16b, v21.16b + eor v25.16b, v25.16b, v21.16b + eor v26.16b, v26.16b, v21.16b + eor v27.16b, v27.16b, v21.16b + eor v28.16b, v28.16b, v21.16b + eor v29.16b, v29.16b, v21.16b + eor v30.16b, v30.16b, v21.16b + eor v31.16b, v31.16b, v21.16b + cmeq v24.4s, v24.4s, #0 + cmeq v25.4s, v25.4s, #0 + cmeq v26.4s, v26.4s, #0 + cmeq v27.4s, v27.4s, #0 + add w1, w1, #32 + b 3f + +1: ld1 {v3.16b, v4.16b, v5.16b}, [x0], #48 + ext v25.16b, v3.16b, v4.16b, #1 + ext v26.16b, v3.16b, v4.16b, #2 + ext v27.16b, v3.16b, v4.16b, #3 + ext v29.16b, v4.16b, v5.16b, #1 + ext v30.16b, v4.16b, v5.16b, #2 + ext v31.16b, v4.16b, v5.16b, #3 + bic v24.16b, v3.16b, v20.16b + bic v25.16b, v25.16b, v20.16b + bic v26.16b, v26.16b, v20.16b + bic v27.16b, v27.16b, v20.16b + bic v28.16b, v4.16b, v20.16b + bic v29.16b, v29.16b, v20.16b + bic v30.16b, v30.16b, v20.16b + bic v31.16b, v31.16b, v20.16b + eor v24.16b, v24.16b, v21.16b + eor v25.16b, v25.16b, v21.16b + eor v26.16b, v26.16b, v21.16b + eor v27.16b, v27.16b, v21.16b + eor v28.16b, v28.16b, v21.16b + eor v29.16b, v29.16b, v21.16b + eor v30.16b, v30.16b, v21.16b + eor v31.16b, v31.16b, v21.16b + cmeq v24.4s, v24.4s, #0 + cmeq v25.4s, v25.4s, #0 + cmeq v26.4s, v26.4s, #0 + cmeq v27.4s, v27.4s, #0 + // Drop through... +2: mov v0.16b, v5.16b + ld1 {v1.16b, v2.16b}, [x0], #32 + cmeq v28.4s, v28.4s, #0 + cmeq v29.4s, v29.4s, #0 + cmeq v30.4s, v30.4s, #0 + cmeq v31.4s, v31.4s, #0 + orr v24.16b, v24.16b, v25.16b + orr v26.16b, v26.16b, v27.16b + orr v28.16b, v28.16b, v29.16b + orr v30.16b, v30.16b, v31.16b + ext v25.16b, v0.16b, v1.16b, #1 + orr v22.16b, v24.16b, v26.16b + ext v26.16b, v0.16b, v1.16b, #2 + ext v27.16b, v0.16b, v1.16b, #3 + ext v29.16b, v1.16b, v2.16b, #1 + orr v23.16b, v28.16b, v30.16b + ext v30.16b, v1.16b, v2.16b, #2 + ext v31.16b, v1.16b, v2.16b, #3 + bic v24.16b, v0.16b, v20.16b + bic v25.16b, v25.16b, v20.16b + bic v26.16b, v26.16b, v20.16b + orr v22.16b, v22.16b, v23.16b + bic v27.16b, v27.16b, v20.16b + bic v28.16b, v1.16b, v20.16b + bic v29.16b, v29.16b, v20.16b + bic v30.16b, v30.16b, v20.16b + bic v31.16b, v31.16b, v20.16b + addv s22, v22.4s + eor v24.16b, v24.16b, v21.16b + eor v25.16b, v25.16b, v21.16b + eor v26.16b, v26.16b, v21.16b + eor v27.16b, v27.16b, v21.16b + eor v28.16b, v28.16b, v21.16b + mov w3, v22.s[0] + eor v29.16b, v29.16b, v21.16b + eor v30.16b, v30.16b, v21.16b + eor v31.16b, v31.16b, v21.16b + cmeq v24.4s, v24.4s, #0 + cmeq v25.4s, v25.4s, #0 + cmeq v26.4s, v26.4s, #0 + cmeq v27.4s, v27.4s, #0 + cbnz w3, 90f + st1 {v3.16b, v4.16b}, [x2], #32 +3: mov v3.16b, v2.16b + ld1 {v4.16b, v5.16b}, [x0], #32 + cmeq v28.4s, v28.4s, #0 + cmeq v29.4s, v29.4s, #0 + cmeq v30.4s, v30.4s, #0 + cmeq v31.4s, v31.4s, #0 + orr v24.16b, v24.16b, v25.16b + orr v26.16b, v26.16b, v27.16b + orr v28.16b, v28.16b, v29.16b + orr v30.16b, v30.16b, v31.16b + ext v25.16b, v3.16b, v4.16b, #1 + orr v22.16b, v24.16b, v26.16b + ext v26.16b, v3.16b, v4.16b, #2 + ext v27.16b, v3.16b, v4.16b, #3 + ext v29.16b, v4.16b, v5.16b, #1 + orr v23.16b, v28.16b, v30.16b + ext v30.16b, v4.16b, v5.16b, #2 + ext v31.16b, v4.16b, v5.16b, #3 + bic v24.16b, v3.16b, v20.16b + bic v25.16b, v25.16b, v20.16b + bic v26.16b, v26.16b, v20.16b + orr v22.16b, v22.16b, v23.16b + bic v27.16b, v27.16b, v20.16b + bic v28.16b, v4.16b, v20.16b + bic v29.16b, v29.16b, v20.16b + bic v30.16b, v30.16b, v20.16b + bic v31.16b, v31.16b, v20.16b + addv s22, v22.4s + eor v24.16b, v24.16b, v21.16b + eor v25.16b, v25.16b, v21.16b + eor v26.16b, v26.16b, v21.16b + eor v27.16b, v27.16b, v21.16b + eor v28.16b, v28.16b, v21.16b + mov w3, v22.s[0] + eor v29.16b, v29.16b, v21.16b + eor v30.16b, v30.16b, v21.16b + eor v31.16b, v31.16b, v21.16b + cmeq v24.4s, v24.4s, #0 + cmeq v25.4s, v25.4s, #0 + cmeq v26.4s, v26.4s, #0 + cmeq v27.4s, v27.4s, #0 + cbnz w3, 91f + st1 {v0.16b, v1.16b}, [x2], #32 + subs w1, w1, #64 + b.pl 2b + +90: add w0, w1, #80 + ret + +91: sub w1, w1, #32 + b 90b +endfunc diff --git a/libavcodec/arm/vc1dsp_init_neon.c b/libavcodec/arm/vc1dsp_init_neon.c index f5f5c702d7..3aefbcaf6d 100644 --- a/libavcodec/arm/vc1dsp_init_neon.c +++ b/libavcodec/arm/vc1dsp_init_neon.c @@ -84,6 +84,64 @@ void ff_put_vc1_chroma_mc4_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride, void ff_avg_vc1_chroma_mc4_neon(uint8_t *dst, uint8_t *src, ptrdiff_t stride, int h, int x, int y); +int ff_vc1_unescape_buffer_helper_neon(const uint8_t *src, int size, uint8_t *dst); + +static int vc1_unescape_buffer_neon(const uint8_t *src, int size, uint8_t *dst) +{ + /* Dealing with starting and stopping, and removing escape bytes, are + * comparatively less time-sensitive, so are more clearly expressed using + * a C wrapper around the assembly inner loop. Note that we assume a + * little-endian machine that supports unaligned loads. */ + int dsize = 0; + while (size >= 4) + { + int found = 0; + while (!found && (((uintptr_t) dst) & 7) && size >= 4) + { + found = (*(uint32_t *)src &~ 0x03000000) == 0x00030000; + if (!found) + { + *dst++ = *src++; + --size; + ++dsize; + } + } + if (!found) + { + int skip = size - ff_vc1_unescape_buffer_helper_neon(src, size, dst); + dst += skip; + src += skip; + size -= skip; + dsize += skip; + while (!found && size >= 4) + { + found = (*(uint32_t *)src &~ 0x03000000) == 0x00030000; + if (!found) + { + *dst++ = *src++; + --size; + ++dsize; + } + } + } + if (found) + { + *dst++ = *src++; + *dst++ = *src++; + ++src; + size -= 3; + dsize += 2; + } + } + while (size > 0) + { + *dst++ = *src++; + --size; + ++dsize; + } + return dsize; +} + #define FN_ASSIGN(X, Y) \ dsp->put_vc1_mspel_pixels_tab[0][X+4*Y] = ff_put_vc1_mspel_mc##X##Y##_16_neon; \ dsp->put_vc1_mspel_pixels_tab[1][X+4*Y] = ff_put_vc1_mspel_mc##X##Y##_neon @@ -130,4 +188,6 @@ av_cold void ff_vc1dsp_init_neon(VC1DSPContext *dsp) dsp->avg_no_rnd_vc1_chroma_pixels_tab[0] = ff_avg_vc1_chroma_mc8_neon; dsp->put_no_rnd_vc1_chroma_pixels_tab[1] = ff_put_vc1_chroma_mc4_neon; dsp->avg_no_rnd_vc1_chroma_pixels_tab[1] = ff_avg_vc1_chroma_mc4_neon; + + dsp->vc1_unescape_buffer = vc1_unescape_buffer_neon; } diff --git a/libavcodec/arm/vc1dsp_neon.S b/libavcodec/arm/vc1dsp_neon.S index 4ef083102b..9d7333cf12 100644 --- a/libavcodec/arm/vc1dsp_neon.S +++ b/libavcodec/arm/vc1dsp_neon.S @@ -1804,3 +1804,121 @@ function ff_vc1_h_loop_filter16_neon, export=1 4: vpop {d8-d15} pop {r4-r6,pc} endfunc + +@ Copy at most the specified number of bytes from source to destination buffer, +@ stopping at a multiple of 16 bytes, none of which are the start of an escape sequence +@ On entry: +@ r0 -> source buffer +@ r1 = max number of bytes to copy +@ r2 -> destination buffer, optimally 8-byte aligned +@ On exit: +@ r0 = number of bytes not copied +function ff_vc1_unescape_buffer_helper_neon, export=1 + @ Offset by 48 to screen out cases that are too short for us to handle, + @ and also make it easy to test for loop termination, or to determine + @ whether we need an odd number of half-iterations of the loop. + subs r1, r1, #48 + bmi 90f + + @ Set up useful constants + vmov.i32 q0, #0x3000000 + vmov.i32 q1, #0x30000 + + tst r1, #16 + bne 1f + + vld1.8 {q8, q9}, [r0]! + vbic q12, q8, q0 + vext.8 q13, q8, q9, #1 + vext.8 q14, q8, q9, #2 + vext.8 q15, q8, q9, #3 + veor q12, q12, q1 + vbic q13, q13, q0 + vbic q14, q14, q0 + vbic q15, q15, q0 + vceq.i32 q12, q12, #0 + veor q13, q13, q1 + veor q14, q14, q1 + veor q15, q15, q1 + vceq.i32 q13, q13, #0 + vceq.i32 q14, q14, #0 + vceq.i32 q15, q15, #0 + add r1, r1, #16 + b 3f + +1: vld1.8 {q10, q11}, [r0]! + vbic q12, q10, q0 + vext.8 q13, q10, q11, #1 + vext.8 q14, q10, q11, #2 + vext.8 q15, q10, q11, #3 + veor q12, q12, q1 + vbic q13, q13, q0 + vbic q14, q14, q0 + vbic q15, q15, q0 + vceq.i32 q12, q12, #0 + veor q13, q13, q1 + veor q14, q14, q1 + veor q15, q15, q1 + vceq.i32 q13, q13, #0 + vceq.i32 q14, q14, #0 + vceq.i32 q15, q15, #0 + @ Drop through... +2: vmov q8, q11 + vld1.8 {q9}, [r0]! + vorr q13, q12, q13 + vorr q15, q14, q15 + vbic q12, q8, q0 + vorr q3, q13, q15 + vext.8 q13, q8, q9, #1 + vext.8 q14, q8, q9, #2 + vext.8 q15, q8, q9, #3 + veor q12, q12, q1 + vorr d6, d6, d7 + vbic q13, q13, q0 + vbic q14, q14, q0 + vbic q15, q15, q0 + vceq.i32 q12, q12, #0 + vmov r3, r12, d6 + veor q13, q13, q1 + veor q14, q14, q1 + veor q15, q15, q1 + vceq.i32 q13, q13, #0 + vceq.i32 q14, q14, #0 + vceq.i32 q15, q15, #0 + orrs r3, r3, r12 + bne 90f + vst1.64 {q10}, [r2]! +3: vmov q10, q9 + vld1.8 {q11}, [r0]! + vorr q13, q12, q13 + vorr q15, q14, q15 + vbic q12, q10, q0 + vorr q3, q13, q15 + vext.8 q13, q10, q11, #1 + vext.8 q14, q10, q11, #2 + vext.8 q15, q10, q11, #3 + veor q12, q12, q1 + vorr d6, d6, d7 + vbic q13, q13, q0 + vbic q14, q14, q0 + vbic q15, q15, q0 + vceq.i32 q12, q12, #0 + vmov r3, r12, d6 + veor q13, q13, q1 + veor q14, q14, q1 + veor q15, q15, q1 + vceq.i32 q13, q13, #0 + vceq.i32 q14, q14, #0 + vceq.i32 q15, q15, #0 + orrs r3, r3, r12 + bne 91f + vst1.64 {q8}, [r2]! + subs r1, r1, #32 + bpl 2b + +90: add r0, r1, #48 + bx lr + +91: sub r1, r1, #16 + b 90b +endfunc diff --git a/libavcodec/vc1dec.c b/libavcodec/vc1dec.c index 1c92b9d401..6a30b5b664 100644 --- a/libavcodec/vc1dec.c +++ b/libavcodec/vc1dec.c @@ -490,7 +490,7 @@ static av_cold int vc1_decode_init(AVCodecContext *avctx) size = next - start - 4; if (size <= 0) continue; - buf2_size = vc1_unescape_buffer(start + 4, size, buf2); + buf2_size = v->vc1dsp.vc1_unescape_buffer(start + 4, size, buf2); init_get_bits(&gb, buf2, buf2_size * 8); switch (AV_RB32(start)) { case VC1_CODE_SEQHDR: @@ -680,7 +680,7 @@ static int vc1_decode_frame(AVCodecContext *avctx, void *data, case VC1_CODE_FRAME: if (avctx->hwaccel) buf_start = start; - buf_size2 = vc1_unescape_buffer(start + 4, size, buf2); + buf_size2 = v->vc1dsp.vc1_unescape_buffer(start + 4, size, buf2); break; case VC1_CODE_FIELD: { int buf_size3; @@ -697,8 +697,8 @@ static int vc1_decode_frame(AVCodecContext *avctx, void *data, ret = AVERROR(ENOMEM); goto err; } - buf_size3 = vc1_unescape_buffer(start + 4, size, - slices[n_slices].buf); + buf_size3 = v->vc1dsp.vc1_unescape_buffer(start + 4, size, + slices[n_slices].buf); init_get_bits(&slices[n_slices].gb, slices[n_slices].buf, buf_size3 << 3); slices[n_slices].mby_start = avctx->coded_height + 31 >> 5; @@ -709,7 +709,7 @@ static int vc1_decode_frame(AVCodecContext *avctx, void *data, break; } case VC1_CODE_ENTRYPOINT: /* it should be before frame data */ - buf_size2 = vc1_unescape_buffer(start + 4, size, buf2); + buf_size2 = v->vc1dsp.vc1_unescape_buffer(start + 4, size, buf2); init_get_bits(&s->gb, buf2, buf_size2 * 8); ff_vc1_decode_entry_point(avctx, v, &s->gb); break; @@ -726,8 +726,8 @@ static int vc1_decode_frame(AVCodecContext *avctx, void *data, ret = AVERROR(ENOMEM); goto err; } - buf_size3 = vc1_unescape_buffer(start + 4, size, - slices[n_slices].buf); + buf_size3 = v->vc1dsp.vc1_unescape_buffer(start + 4, size, + slices[n_slices].buf); init_get_bits(&slices[n_slices].gb, slices[n_slices].buf, buf_size3 << 3); slices[n_slices].mby_start = get_bits(&slices[n_slices].gb, 9); @@ -761,7 +761,7 @@ static int vc1_decode_frame(AVCodecContext *avctx, void *data, ret = AVERROR(ENOMEM); goto err; } - buf_size3 = vc1_unescape_buffer(divider + 4, buf + buf_size - divider - 4, slices[n_slices].buf); + buf_size3 = v->vc1dsp.vc1_unescape_buffer(divider + 4, buf + buf_size - divider - 4, slices[n_slices].buf); init_get_bits(&slices[n_slices].gb, slices[n_slices].buf, buf_size3 << 3); slices[n_slices].mby_start = s->mb_height + 1 >> 1; @@ -770,9 +770,9 @@ static int vc1_decode_frame(AVCodecContext *avctx, void *data, n_slices1 = n_slices - 1; n_slices++; } - buf_size2 = vc1_unescape_buffer(buf, divider - buf, buf2); + buf_size2 = v->vc1dsp.vc1_unescape_buffer(buf, divider - buf, buf2); } else { - buf_size2 = vc1_unescape_buffer(buf, buf_size, buf2); + buf_size2 = v->vc1dsp.vc1_unescape_buffer(buf, buf_size, buf2); } init_get_bits(&s->gb, buf2, buf_size2*8); } else{ diff --git a/libavcodec/vc1dsp.c b/libavcodec/vc1dsp.c index a29b91bf3d..11d493f002 100644 --- a/libavcodec/vc1dsp.c +++ b/libavcodec/vc1dsp.c @@ -34,6 +34,7 @@ #include "rnd_avg.h" #include "vc1dsp.h" #include "startcode.h" +#include "vc1_common.h" /* Apply overlap transform to horizontal edge */ static void vc1_v_overlap_c(uint8_t *src, int stride) @@ -1030,6 +1031,7 @@ av_cold void ff_vc1dsp_init(VC1DSPContext *dsp) #endif /* CONFIG_WMV3IMAGE_DECODER || CONFIG_VC1IMAGE_DECODER */ dsp->startcode_find_candidate = ff_startcode_find_candidate_c; + dsp->vc1_unescape_buffer = vc1_unescape_buffer; if (ARCH_AARCH64) ff_vc1dsp_init_aarch64(dsp); diff --git a/libavcodec/vc1dsp.h b/libavcodec/vc1dsp.h index c6443acb20..8be1198071 100644 --- a/libavcodec/vc1dsp.h +++ b/libavcodec/vc1dsp.h @@ -80,6 +80,9 @@ typedef struct VC1DSPContext { * one or more further zero bytes and a one byte. */ int (*startcode_find_candidate)(const uint8_t *buf, int size); + + /* Copy a buffer, removing startcode emulation escape bytes as we go */ + int (*vc1_unescape_buffer)(const uint8_t *src, int size, uint8_t *dst); } VC1DSPContext; void ff_vc1dsp_init(VC1DSPContext* c);