From patchwork Tue Dec 14 13:33:10 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= X-Patchwork-Id: 32484 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp6964854iog; Tue, 14 Dec 2021 05:33:59 -0800 (PST) X-Google-Smtp-Source: ABdhPJx0aIR0m3ojs08dLu2rEXJ3jUZzLs7HZSmFHH+USbIUGbGohCQuK6ZtYeoCgT0c3goJn0MO X-Received: by 2002:a17:906:2ac4:: with SMTP id m4mr5680794eje.734.1639488838725; Tue, 14 Dec 2021 05:33:58 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1639488838; cv=none; d=google.com; s=arc-20160816; b=kOXjP3wD/qT5XD6EiaG5NhzK4xL3VlB5jPPDEKWGW5woiQnfkYfOUkjBOAxMyyCUFm Rj8wqXnygyGnHZLbh8ZM5coI5a4Wl7e9902zlv52xl+MAJkDi3gl2T+I2uWiIQ2Wp5CM MXSIoYH/GhxoB+IpN9qcV965VRoNHPoWqLheuxEIFCuMhSFomp+amy/mfcDho0RCImyK 8RqP3DqL5Ulh2uL5JusIaBy/dg3UsyIB3PvCuuc2fd8htqvMouS0Ibalvsbxfnc62I0t xAhb4+M3HsWR3hJHYgKkppS2cA3wVfEQxGhkMvU39qMePyoiNHJSKZktUGsxUlEgxd7m sTsA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=UkDzGZrFbMm6Rc4ew+1bsQiIzOYhxy/IUyzQtecdy3w=; b=G6mMyekJotnW3TibjdoneQxztkScuyQqbKfrJJYTVXpiQi+iBvvUn0AsJV5UkCHvyZ oxBlquEGOJfdGYa0DxDM/tlEPUnmllM+vysi7oRn7YvuSKC+IRu6Rk2+/DdNetaTDyHb zAHg+WN0DeUmE3PXZBtBL8UMzhttDoESaqxW6EWfMVIQ1g3W9KIIcOxi8nIZEG4PQ6dU oG42FUZ1eaWPdbisA0VsULHzwGzAE0gD1+/QPVNkcquRtVY8kT+XDCErIYVgScPlzfVB 7SrQUo6rUJcCwOOg6ODMzMphN6EfE3z0FPoZ232SR6r1MXjinQb9kEsm7/8FtGuVjp1k sWjg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id js19si26362680ejc.573.2021.12.14.05.33.58; Tue, 14 Dec 2021 05:33:58 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 75A5468ADF7; Tue, 14 Dec 2021 15:33:52 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from loongson.cn (mail.loongson.cn [114.242.206.163]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 0C1C968AAD7 for ; Tue, 14 Dec 2021 15:33:43 +0200 (EET) Received: from localhost (unknown [36.33.26.144]) by mail.loongson.cn (Coremail) with SMTP id AQAAf9Dx7Nw0nbhhkKcAAA--.3410S3; Tue, 14 Dec 2021 21:33:41 +0800 (CST) From: Hao Chen To: ffmpeg-devel@ffmpeg.org Date: Tue, 14 Dec 2021 21:33:10 +0800 Message-Id: <20211214133316.8978-2-chenhao@loongson.cn> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20211214133316.8978-1-chenhao@loongson.cn> References: <20211214133316.8978-1-chenhao@loongson.cn> MIME-Version: 1.0 X-CM-TRANSID: AQAAf9Dx7Nw0nbhhkKcAAA--.3410S3 X-Coremail-Antispam: 1UD129KBjvJXoW3uw4rWr15Kw1rXFW5Kw4Uurg_yoWkZFW7pr Z7Cr4rKF18XFWIkF92qr98Jr1rWws3WF429FW3uw1jyrs8JF98Jrn2yF9xuFyxW34ru34x u3WkWFy3KFy7G3DanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUkv14x267AKxVWUJVW8JwAFc2x0x2IEx4CE42xK8VAvwI8IcIk0 rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj41l84x0c7CEw4AK67xGY2AK02 1l84ACjcxK6xIIjxv20xvE14v26ryj6F1UM28EF7xvwVC0I7IYx2IY6xkF7I0E14v26F4j 6r4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x0267AKxVW0oV Cq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG6I80ewAv7VC0 I7IYx2IY67AKxVWUtVWrXwAv7VC2z280aVAFwI0_Gr1j6F4UJwAm72CE4IkC6x0Yz7v_Jr 0_Gr1lF7xvr2IYc2Ij64vIr41lF7I21c0EjII2zVCS5cI20VAGYxC7MxkIecxEwVAFwVW5 GwCF04k20xvY0x0EwIxGrwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s026c02F40E14v26r 1j6r18MI8I3I0E7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_Jrv_JF1lIxkGc2Ij 64vIr41lIxAIcVC0I7IYx2IY67AKxVWUCVW8JwCI42IY6xIIjxv20xvEc7CjxVAFwI0_Gr 0_Cr1lIxAIcVCF04k26cxKx2IYs7xG6r1j6r1xMIIF0xvEx4A2jsIE14v26r4j6F4UMIIF 0xvEx4A2jsIEc7CjxVAFwI0_Gr0_Gr1UYxBIdaVFxhVjvjDU0xZFpf9x0JU47KxUUUUU= X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/ Subject: [FFmpeg-devel] [PATCH v2 1/7] avutil: [loongarch] Add support for loongarch SIMD. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Shiyou Yin Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: MnaQHeGgv9vs From: Shiyou Yin LSX and LASX is loongarch SIMD extention. They are enabled by default if compiler support it, and can be disabled with '--disable-lsx' '--disable-lasx'. Change-Id: Ie2608ea61dbd9b7fffadbf0ec2348bad6c124476 --- Makefile | 2 +- configure | 20 +++++++++-- ffbuild/arch.mak | 4 ++- ffbuild/common.mak | 8 +++++ libavutil/cpu.c | 7 ++++ libavutil/cpu.h | 4 +++ libavutil/cpu_internal.h | 2 ++ libavutil/loongarch/Makefile | 1 + libavutil/loongarch/cpu.c | 69 ++++++++++++++++++++++++++++++++++++ libavutil/loongarch/cpu.h | 31 ++++++++++++++++ libavutil/tests/cpu.c | 3 ++ tests/checkasm/checkasm.c | 3 ++ 12 files changed, 150 insertions(+), 4 deletions(-) create mode 100644 libavutil/loongarch/Makefile create mode 100644 libavutil/loongarch/cpu.c create mode 100644 libavutil/loongarch/cpu.h diff --git a/Makefile b/Makefile index 26c9107237..5b20658b52 100644 --- a/Makefile +++ b/Makefile @@ -89,7 +89,7 @@ SUBDIR_VARS := CLEANFILES FFLIBS HOSTPROGS TESTPROGS TOOLS \ ARMV5TE-OBJS ARMV6-OBJS ARMV8-OBJS VFP-OBJS NEON-OBJS \ ALTIVEC-OBJS VSX-OBJS MMX-OBJS X86ASM-OBJS \ MIPSFPU-OBJS MIPSDSPR2-OBJS MIPSDSP-OBJS MSA-OBJS \ - MMI-OBJS OBJS SLIBOBJS HOSTOBJS TESTOBJS + MMI-OBJS LSX-OBJS LASX-OBJS OBJS SLIBOBJS HOSTOBJS TESTOBJS define RESET $(1) := diff --git a/configure b/configure index a7593ec2db..c4afde4c5c 100755 --- a/configure +++ b/configure @@ -452,7 +452,9 @@ Optimization options (experts only): --disable-mipsdspr2 disable MIPS DSP ASE R2 optimizations --disable-msa disable MSA optimizations --disable-mipsfpu disable floating point MIPS optimizations - --disable-mmi disable Loongson SIMD optimizations + --disable-mmi disable Loongson MMI optimizations + --disable-lsx disable Loongson LSX optimizations + --disable-lasx disable Loongson LASX optimizations --disable-fast-unaligned consider unaligned accesses slow Developer options (useful when working on FFmpeg itself): @@ -2081,6 +2083,8 @@ ARCH_EXT_LIST_LOONGSON=" loongson2 loongson3 mmi + lsx + lasx " ARCH_EXT_LIST_X86_SIMD=" @@ -2617,6 +2621,10 @@ power8_deps="vsx" loongson2_deps="mips" loongson3_deps="mips" +mmi_deps_any="loongson2 loongson3" +lsx_deps="loongarch" +lasx_deps="lsx" + mips32r2_deps="mips" mips32r5_deps="mips" mips32r6_deps="mips" @@ -2625,7 +2633,6 @@ mips64r6_deps="mips" mipsfpu_deps="mips" mipsdsp_deps="mips" mipsdspr2_deps="mips" -mmi_deps_any="loongson2 loongson3" msa_deps="mipsfpu" cpunop_deps="i686" @@ -6134,6 +6141,9 @@ EOF ;; esac +elif enabled loongarch; then + enabled lsx && check_inline_asm lsx '"vadd.b $vr0, $vr1, $vr2"' '-mlsx' && append LSXFLAGS '-mlsx' + enabled lasx && check_inline_asm lasx '"xvadd.b $xr0, $xr1, $xr2"' '-mlasx' && append LASXFLAGS '-mlasx' fi check_cc intrinsics_neon arm_neon.h "int16x8_t test = vdupq_n_s16(0)" @@ -7484,6 +7494,10 @@ if enabled ppc; then echo "PPC 4xx optimizations ${ppc4xx-no}" echo "dcbzl available ${dcbzl-no}" fi +if enabled loongarch; then + echo "LSX enabled ${lsx-no}" + echo "LASX enabled ${lasx-no}" +fi echo "debug symbols ${debug-no}" echo "strip symbols ${stripping-no}" echo "optimize for size ${small-no}" @@ -7645,6 +7659,8 @@ ASMSTRIPFLAGS=$ASMSTRIPFLAGS X86ASMFLAGS=$X86ASMFLAGS MSAFLAGS=$MSAFLAGS MMIFLAGS=$MMIFLAGS +LSXFLAGS=$LSXFLAGS +LASXFLAGS=$LASXFLAGS BUILDSUF=$build_suffix PROGSSUF=$progs_suffix FULLNAME=$FULLNAME diff --git a/ffbuild/arch.mak b/ffbuild/arch.mak index e09006efca..997e31e85e 100644 --- a/ffbuild/arch.mak +++ b/ffbuild/arch.mak @@ -8,7 +8,9 @@ OBJS-$(HAVE_MIPSFPU) += $(MIPSFPU-OBJS) $(MIPSFPU-OBJS-yes) OBJS-$(HAVE_MIPSDSP) += $(MIPSDSP-OBJS) $(MIPSDSP-OBJS-yes) OBJS-$(HAVE_MIPSDSPR2) += $(MIPSDSPR2-OBJS) $(MIPSDSPR2-OBJS-yes) OBJS-$(HAVE_MSA) += $(MSA-OBJS) $(MSA-OBJS-yes) -OBJS-$(HAVE_MMI) += $(MMI-OBJS) $(MMI-OBJS-yes) +OBJS-$(HAVE_MMI) += $(MMI-OBJS) $(MMI-OBJS-yes) +OBJS-$(HAVE_LSX) += $(LSX-OBJS) $(LSX-OBJS-yes) +OBJS-$(HAVE_LASX) += $(LASX-OBJS) $(LASX-OBJS-yes) OBJS-$(HAVE_ALTIVEC) += $(ALTIVEC-OBJS) $(ALTIVEC-OBJS-yes) OBJS-$(HAVE_VSX) += $(VSX-OBJS) $(VSX-OBJS-yes) diff --git a/ffbuild/common.mak b/ffbuild/common.mak index 268ae61154..0eb831d434 100644 --- a/ffbuild/common.mak +++ b/ffbuild/common.mak @@ -59,6 +59,8 @@ COMPILE_HOSTC = $(call COMPILE,HOSTCC) COMPILE_NVCC = $(call COMPILE,NVCC) COMPILE_MMI = $(call COMPILE,CC,MMIFLAGS) COMPILE_MSA = $(call COMPILE,CC,MSAFLAGS) +COMPILE_LSX = $(call COMPILE,CC,LSXFLAGS) +COMPILE_LASX = $(call COMPILE,CC,LASXFLAGS) %_mmi.o: %_mmi.c $(COMPILE_MMI) @@ -66,6 +68,12 @@ COMPILE_MSA = $(call COMPILE,CC,MSAFLAGS) %_msa.o: %_msa.c $(COMPILE_MSA) +%_lsx.o: %_lsx.c + $(COMPILE_LSX) + +%_lasx.o: %_lasx.c + $(COMPILE_LASX) + %.o: %.c $(COMPILE_C) diff --git a/libavutil/cpu.c b/libavutil/cpu.c index 4627af4f23..63efb97ffd 100644 --- a/libavutil/cpu.c +++ b/libavutil/cpu.c @@ -62,6 +62,8 @@ static int get_cpu_flags(void) return ff_get_cpu_flags_ppc(); if (ARCH_X86) return ff_get_cpu_flags_x86(); + if (ARCH_LOONGARCH) + return ff_get_cpu_flags_loongarch(); return 0; } @@ -168,6 +170,9 @@ int av_parse_cpu_caps(unsigned *flags, const char *s) #elif ARCH_MIPS { "mmi", NULL, 0, AV_OPT_TYPE_CONST, { .i64 = AV_CPU_FLAG_MMI }, .unit = "flags" }, { "msa", NULL, 0, AV_OPT_TYPE_CONST, { .i64 = AV_CPU_FLAG_MSA }, .unit = "flags" }, +#elif ARCH_LOONGARCH + { "lsx", NULL, 0, AV_OPT_TYPE_CONST, { .i64 = AV_CPU_FLAG_LSX }, .unit = "flags" }, + { "lasx", NULL, 0, AV_OPT_TYPE_CONST, { .i64 = AV_CPU_FLAG_LASX }, .unit = "flags" }, #endif { NULL }, }; @@ -253,6 +258,8 @@ size_t av_cpu_max_align(void) return ff_get_cpu_max_align_ppc(); if (ARCH_X86) return ff_get_cpu_max_align_x86(); + if (ARCH_LOONGARCH) + return ff_get_cpu_max_align_loongarch(); return 8; } diff --git a/libavutil/cpu.h b/libavutil/cpu.h index afea0640b4..ae443eccad 100644 --- a/libavutil/cpu.h +++ b/libavutil/cpu.h @@ -72,6 +72,10 @@ #define AV_CPU_FLAG_MMI (1 << 0) #define AV_CPU_FLAG_MSA (1 << 1) +//Loongarch SIMD extension. +#define AV_CPU_FLAG_LSX (1 << 0) +#define AV_CPU_FLAG_LASX (1 << 1) + /** * Return the flags which specify extensions supported by the CPU. * The returned value is affected by av_force_cpu_flags() if that was used diff --git a/libavutil/cpu_internal.h b/libavutil/cpu_internal.h index 889764320b..e207b2d480 100644 --- a/libavutil/cpu_internal.h +++ b/libavutil/cpu_internal.h @@ -46,11 +46,13 @@ int ff_get_cpu_flags_aarch64(void); int ff_get_cpu_flags_arm(void); int ff_get_cpu_flags_ppc(void); int ff_get_cpu_flags_x86(void); +int ff_get_cpu_flags_loongarch(void); size_t ff_get_cpu_max_align_mips(void); size_t ff_get_cpu_max_align_aarch64(void); size_t ff_get_cpu_max_align_arm(void); size_t ff_get_cpu_max_align_ppc(void); size_t ff_get_cpu_max_align_x86(void); +size_t ff_get_cpu_max_align_loongarch(void); #endif /* AVUTIL_CPU_INTERNAL_H */ diff --git a/libavutil/loongarch/Makefile b/libavutil/loongarch/Makefile new file mode 100644 index 0000000000..2addd9351c --- /dev/null +++ b/libavutil/loongarch/Makefile @@ -0,0 +1 @@ +OBJS += loongarch/cpu.o diff --git a/libavutil/loongarch/cpu.c b/libavutil/loongarch/cpu.c new file mode 100644 index 0000000000..e4b240bc44 --- /dev/null +++ b/libavutil/loongarch/cpu.c @@ -0,0 +1,69 @@ +/* + * Copyright (c) 2020 Loongson Technology Corporation Limited + * Contributed by Shiyou Yin + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include +#include "cpu.h" + +#define LOONGARCH_CFG2 0x2 +#define LOONGARCH_CFG2_LSX (1 << 6) +#define LOONGARCH_CFG2_LASX (1 << 7) + +static int cpu_flags_cpucfg(void) +{ + int flags = 0; + uint32_t cfg2 = 0; + + __asm__ volatile( + "cpucfg %0, %1 \n\t" + : "+&r"(cfg2) + : "r"(LOONGARCH_CFG2) + ); + + if (cfg2 & LOONGARCH_CFG2_LSX) + flags |= AV_CPU_FLAG_LSX; + + if (cfg2 & LOONGARCH_CFG2_LASX) + flags |= AV_CPU_FLAG_LASX; + + return flags; +} + +int ff_get_cpu_flags_loongarch(void) +{ +#if defined __linux__ + return cpu_flags_cpucfg(); +#else + /* Assume no SIMD ASE supported */ + return 0; +#endif +} + +size_t ff_get_cpu_max_align_loongarch(void) +{ + int flags = av_get_cpu_flags(); + + if (flags & AV_CPU_FLAG_LASX) + return 32; + if (flags & AV_CPU_FLAG_LSX) + return 16; + + return 8; +} diff --git a/libavutil/loongarch/cpu.h b/libavutil/loongarch/cpu.h new file mode 100644 index 0000000000..1a445c69bc --- /dev/null +++ b/libavutil/loongarch/cpu.h @@ -0,0 +1,31 @@ +/* + * Copyright (c) 2020 Loongson Technology Corporation Limited + * Contributed by Shiyou Yin + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef AVUTIL_LOONGARCH_CPU_H +#define AVUTIL_LOONGARCH_CPU_H + +#include "libavutil/cpu.h" +#include "libavutil/cpu_internal.h" + +#define have_lsx(flags) CPUEXT(flags, LSX) +#define have_lasx(flags) CPUEXT(flags, LASX) + +#endif /* AVUTIL_LOONGARCH_CPU_H */ diff --git a/libavutil/tests/cpu.c b/libavutil/tests/cpu.c index c853371fb3..0a6c0cd32e 100644 --- a/libavutil/tests/cpu.c +++ b/libavutil/tests/cpu.c @@ -77,6 +77,9 @@ static const struct { { AV_CPU_FLAG_BMI2, "bmi2" }, { AV_CPU_FLAG_AESNI, "aesni" }, { AV_CPU_FLAG_AVX512, "avx512" }, +#elif ARCH_LOONGARCH + { AV_CPU_FLAG_LSX, "lsx" }, + { AV_CPU_FLAG_LASX, "lasx" }, #endif { 0 } }; diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c index b1353f7cbe..90d080de02 100644 --- a/tests/checkasm/checkasm.c +++ b/tests/checkasm/checkasm.c @@ -236,6 +236,9 @@ static const struct { { "FMA4", "fma4", AV_CPU_FLAG_FMA4 }, { "AVX2", "avx2", AV_CPU_FLAG_AVX2 }, { "AVX-512", "avx512", AV_CPU_FLAG_AVX512 }, +#elif ARCH_LOONGARCH + { "LSX", "lsx", AV_CPU_FLAG_LSX }, + { "LASX", "lasx", AV_CPU_FLAG_LASX }, #endif { NULL } }; From patchwork Tue Dec 14 13:33:11 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= X-Patchwork-Id: 32485 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp6965359iog; Tue, 14 Dec 2021 05:34:24 -0800 (PST) X-Google-Smtp-Source: ABdhPJx5wYGaczGzNVz0W9q3bPsGRPLRebv8AvQUNbxCbuJBoHdNg1G3kDNpxyHzy9TQtJp7rdvS X-Received: by 2002:a05:6402:270a:: with SMTP id y10mr8071437edd.108.1639488863853; Tue, 14 Dec 2021 05:34:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1639488863; cv=none; d=google.com; s=arc-20160816; b=GMI7y5J6DEvZFwexrMCB0LD/9RLVHnPsE274qRd46zN82QSHv4vxQ/+/1ghltSPvAX 2CAJYNCB0/Wjc5/0zy776ffzfTXPAjpd8mk0raqcp7CDnSvS8rb7r4arcv/FpqXtNrd6 q6lNrAmL3z5eclTJrp/60UZjAV7chJhluFhymT72U7UiMXaHHXo3G2Mv4JColv60ClV+ KUcoFYxV0nLjA+n6auW/vp1sQxjcRmjiV9/rKv9Z+2/oH1srNUF7RrYGrGde9ePiojwx o3+5dUiOccMNnI0tvF+PrSnhidXeOJtzFYEeEru0NnAMtKZhgXtBfKah9tjwAgI1MlDT foyQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:cc:reply-to:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence:subject :mime-version:references:in-reply-to:message-id:date:to:from :delivered-to; bh=LSTOUzelZUcdO9i3OXoIqgjd05yZ8izxJUqO3Z5MgxE=; b=P6pHmPtQC2VDhnHcaCUhVX4fph/rUFn18lkeSa4MTyNEHOHUtC0NOQuiQpuY/ychqf 3hjX7pSPdb0mKWRImVl8IR7HDmccrk7RixhstNGkYLxbdPVdpLJAJun+z1mZMqcVp5kD hpqveJkJl0LxlwB+RWi8K65f2QtaIh0HDzn6N3ARENUA6Dq2RrMgkESbgsv7AerrEVjQ KgtSUhc0qTK9NqlVMYDV2TyEqnBq3qnqvKTtDx67dhDOidZvIaHFmp4PuRoUNHFsBaDY JAT+C+fh6W0s+fplWYBXDaQjGuWdV6TKpL+cnaVeRHIzfwgDp/DIji80fAf7TEIJpJ4t 3sEg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id rh16si18239123ejb.761.2021.12.14.05.34.23; Tue, 14 Dec 2021 05:34:23 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id AD36968AF08; Tue, 14 Dec 2021 15:33:56 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from loongson.cn (mail.loongson.cn [114.242.206.163]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 3BB6B68ADF7 for ; Tue, 14 Dec 2021 15:33:44 +0200 (EET) Received: from localhost (unknown [36.33.26.144]) by mail.loongson.cn (Coremail) with SMTP id AQAAf9Dx_Nw2nbhhkacAAA--.3468S3; Tue, 14 Dec 2021 21:33:42 +0800 (CST) From: Hao Chen To: ffmpeg-devel@ffmpeg.org Date: Tue, 14 Dec 2021 21:33:11 +0800 Message-Id: <20211214133316.8978-3-chenhao@loongson.cn> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20211214133316.8978-1-chenhao@loongson.cn> References: <20211214133316.8978-1-chenhao@loongson.cn> MIME-Version: 1.0 X-CM-TRANSID: AQAAf9Dx_Nw2nbhhkacAAA--.3468S3 X-Coremail-Antispam: 1UD129KBjDUn29KB7ZKAUJUUUUU529EdanIXcx71UUUUU7v73 VFW2AGmfu7bjvjm3AaLaJ3UjIYCTnIWjp_UUUYn7AC8VAFwI0_Jr0_Gr1l1xkIjI8I6I8E 6xAIw20EY4v20xvaj40_Wr0E3s1l1IIY67AEw4v_Jr0_Jr4l8cAvFVAK0II2c7xJM28Cjx kF64kEwVA0rcxSw2x7M28EF7xvwVC0I7IYx2IY67AKxVW5JVW7JwA2z4x0Y4vE2Ix0cI8I cVCY1x0267AKxVWxJVW8Jr1l84ACjcxK6I8E87Iv67AKxVW0oVCq3wA2z4x0Y4vEx4A2js IEc7CjxVAFwI0_GcCE3s1le2I262IYc4CY6c8Ij28IcVAaY2xG8wASzI0E04IjxsIE14AK x2xKxwAqx4xG64xvF2IEw4CE5I8CrVC2j2WlYx0E2Ix0cI8IcVAFwI0_Jw0_WrylYx0Ex4 A2jsIE14v26r4UJVWxJr1lOx8S6xCaFVCjc4AY6r1j6r4UM4x0x7Aq67IIx4CEVc8vx2IE rcIFxwAKzVC20s0267AEwI8IwI0ExsIj0wCY02Avz4vE14v_Xr4l4I8I3I0E4IkC6x0Yz7 v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGwC20s026x8GjcxK67AKxVWUJVWUGwC2zVAF 1VAY17CE14v26r1Y6r17MIIF0xvE2Ix0cI8IcVAFwI0_Gr0_Xr1lIxAIcVC0I7IYx2IY6x kF7I0E14v26r4j6F4UMIIF0xvE42xK8VAvwI8IcIk0rVWUJVWUCwCI42IY6I8E87Iv67AK xVW8JVWxJwCI42IY6I8E87Iv6xkF7I0E14v26r4j6r4UJbIYCTnIWIevJa73UjIFyTuYvj fU8AwIUUUUU X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/ Subject: [FFmpeg-devel] [PATCH v2 2/7] avcodec: [loongarch] Optimize h264_chroma_mc with LASX. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Shiyou Yin Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: xnNh6AOhJXEr From: Shiyou Yin ./ffmpeg -i ../1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -y /dev/null -an before:170 after :183 Change-Id: I42ff23cc2dc7c32bd1b7e4274da9d9ec87065f20 --- libavcodec/h264chroma.c | 2 + libavcodec/h264chroma.h | 1 + libavcodec/loongarch/Makefile | 2 + .../loongarch/h264chroma_init_loongarch.c | 37 + libavcodec/loongarch/h264chroma_lasx.c | 1280 +++++++++++ libavcodec/loongarch/h264chroma_lasx.h | 36 + libavutil/loongarch/loongson_intrinsics.h | 1877 +++++++++++++++++ 7 files changed, 3235 insertions(+) create mode 100644 libavcodec/loongarch/Makefile create mode 100644 libavcodec/loongarch/h264chroma_init_loongarch.c create mode 100644 libavcodec/loongarch/h264chroma_lasx.c create mode 100644 libavcodec/loongarch/h264chroma_lasx.h create mode 100644 libavutil/loongarch/loongson_intrinsics.h diff --git a/libavcodec/h264chroma.c b/libavcodec/h264chroma.c index c2f1f30f5a..0ae6c793e1 100644 --- a/libavcodec/h264chroma.c +++ b/libavcodec/h264chroma.c @@ -56,4 +56,6 @@ av_cold void ff_h264chroma_init(H264ChromaContext *c, int bit_depth) ff_h264chroma_init_x86(c, bit_depth); if (ARCH_MIPS) ff_h264chroma_init_mips(c, bit_depth); + if (ARCH_LOONGARCH64) + ff_h264chroma_init_loongarch(c, bit_depth); } diff --git a/libavcodec/h264chroma.h b/libavcodec/h264chroma.h index 5c89fd12df..3259b4935f 100644 --- a/libavcodec/h264chroma.h +++ b/libavcodec/h264chroma.h @@ -36,5 +36,6 @@ void ff_h264chroma_init_arm(H264ChromaContext *c, int bit_depth); void ff_h264chroma_init_ppc(H264ChromaContext *c, int bit_depth); void ff_h264chroma_init_x86(H264ChromaContext *c, int bit_depth); void ff_h264chroma_init_mips(H264ChromaContext *c, int bit_depth); +void ff_h264chroma_init_loongarch(H264ChromaContext *c, int bit_depth); #endif /* AVCODEC_H264CHROMA_H */ diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile new file mode 100644 index 0000000000..f8fb54c925 --- /dev/null +++ b/libavcodec/loongarch/Makefile @@ -0,0 +1,2 @@ +OBJS-$(CONFIG_H264CHROMA) += loongarch/h264chroma_init_loongarch.o +LASX-OBJS-$(CONFIG_H264CHROMA) += loongarch/h264chroma_lasx.o diff --git a/libavcodec/loongarch/h264chroma_init_loongarch.c b/libavcodec/loongarch/h264chroma_init_loongarch.c new file mode 100644 index 0000000000..0ca24ecc47 --- /dev/null +++ b/libavcodec/loongarch/h264chroma_init_loongarch.c @@ -0,0 +1,37 @@ +/* + * Copyright (c) 2020 Loongson Technology Corporation Limited + * Contributed by Shiyou Yin + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "h264chroma_lasx.h" +#include "libavutil/attributes.h" +#include "libavutil/loongarch/cpu.h" +#include "libavcodec/h264chroma.h" + +av_cold void ff_h264chroma_init_loongarch(H264ChromaContext *c, int bit_depth) +{ + int cpu_flags = av_get_cpu_flags(); + if (have_lasx(cpu_flags)) { + if (bit_depth <= 8) { + c->put_h264_chroma_pixels_tab[0] = ff_put_h264_chroma_mc8_lasx; + c->avg_h264_chroma_pixels_tab[0] = ff_avg_h264_chroma_mc8_lasx; + c->put_h264_chroma_pixels_tab[1] = ff_put_h264_chroma_mc4_lasx; + } + } +} diff --git a/libavcodec/loongarch/h264chroma_lasx.c b/libavcodec/loongarch/h264chroma_lasx.c new file mode 100644 index 0000000000..824a78dfc8 --- /dev/null +++ b/libavcodec/loongarch/h264chroma_lasx.c @@ -0,0 +1,1280 @@ +/* + * Loongson LASX optimized h264chroma + * + * Copyright (c) 2020 Loongson Technology Corporation Limited + * Contributed by Shiyou Yin + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "h264chroma_lasx.h" +#include "libavutil/attributes.h" +#include "libavutil/avassert.h" +#include "libavutil/loongarch/loongson_intrinsics.h" + +static const uint8_t chroma_mask_arr[64] = { + 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, + 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, + 0, 1, 1, 2, 2, 3, 3, 4, 16, 17, 17, 18, 18, 19, 19, 20, + 0, 1, 1, 2, 2, 3, 3, 4, 16, 17, 17, 18, 18, 19, 19, 20 +}; + +static av_always_inline void avc_chroma_hv_8x4_lasx(uint8_t *src, uint8_t *dst, + ptrdiff_t stride, uint32_t coef_hor0, + uint32_t coef_hor1, uint32_t coef_ver0, + uint32_t coef_ver1) +{ + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + ptrdiff_t stride_4x = stride_2x << 1; + __m256i src0, src1, src2, src3, src4, out; + __m256i res_hz0, res_hz1, res_hz2, res_vt0, res_vt1; + __m256i mask; + __m256i coeff_hz_vec0 = __lasx_xvreplgr2vr_b(coef_hor0); + __m256i coeff_hz_vec1 = __lasx_xvreplgr2vr_b(coef_hor1); + __m256i coeff_hz_vec = __lasx_xvilvl_b(coeff_hz_vec0, coeff_hz_vec1); + __m256i coeff_vt_vec0 = __lasx_xvreplgr2vr_h(coef_ver0); + __m256i coeff_vt_vec1 = __lasx_xvreplgr2vr_h(coef_ver1); + + DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 0, src, 0, mask, src0); + DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2x, src, stride_3x, src, stride_4x, + src1, src2, src3, src4); + DUP2_ARG3(__lasx_xvpermi_q, src2, src1, 0x20, src4, src3, 0x20, src1, src3); + src0 = __lasx_xvshuf_b(src0, src0, mask); + DUP2_ARG3(__lasx_xvshuf_b, src1, src1, mask, src3, src3, mask, src1, src3); + DUP2_ARG2(__lasx_xvdp2_h_bu, src0, coeff_hz_vec, src1, coeff_hz_vec, res_hz0, res_hz1); + res_hz2 = __lasx_xvdp2_h_bu(src3, coeff_hz_vec); + res_vt0 = __lasx_xvmul_h(res_hz1, coeff_vt_vec0); + res_vt1 = __lasx_xvmul_h(res_hz2, coeff_vt_vec0); + res_hz0 = __lasx_xvpermi_q(res_hz1, res_hz0, 0x20); + res_hz1 = __lasx_xvpermi_q(res_hz1, res_hz2, 0x3); + res_vt0 = __lasx_xvmadd_h(res_vt0, res_hz0, coeff_vt_vec1); + res_vt1 = __lasx_xvmadd_h(res_vt1, res_hz1, coeff_vt_vec1); + out = __lasx_xvssrarni_bu_h(res_vt1, res_vt0, 6); + __lasx_xvstelm_d(out, dst, 0, 0); + __lasx_xvstelm_d(out, dst + stride, 0, 2); + __lasx_xvstelm_d(out, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out, dst + stride_3x, 0, 3); +} + +static av_always_inline void avc_chroma_hv_8x8_lasx(uint8_t *src, uint8_t *dst, + ptrdiff_t stride, uint32_t coef_hor0, + uint32_t coef_hor1, uint32_t coef_ver0, + uint32_t coef_ver1) +{ + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + ptrdiff_t stride_4x = stride << 2; + __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8; + __m256i out0, out1; + __m256i res_hz0, res_hz1, res_hz2, res_hz3, res_hz4; + __m256i res_vt0, res_vt1, res_vt2, res_vt3; + __m256i mask; + __m256i coeff_hz_vec0 = __lasx_xvreplgr2vr_b(coef_hor0); + __m256i coeff_hz_vec1 = __lasx_xvreplgr2vr_b(coef_hor1); + __m256i coeff_hz_vec = __lasx_xvilvl_b(coeff_hz_vec0, coeff_hz_vec1); + __m256i coeff_vt_vec0 = __lasx_xvreplgr2vr_h(coef_ver0); + __m256i coeff_vt_vec1 = __lasx_xvreplgr2vr_h(coef_ver1); + + DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 0, src, 0, mask, src0); + DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2x, src, stride_3x, src, stride_4x, + src1, src2, src3, src4); + src += stride_4x; + DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2x, src, stride_3x, src, stride_4x, + src5, src6, src7, src8); + DUP4_ARG3(__lasx_xvpermi_q, src2, src1, 0x20, src4, src3, 0x20, src6, src5, 0x20, + src8, src7, 0x20, src1, src3, src5, src7); + src0 = __lasx_xvshuf_b(src0, src0, mask); + DUP4_ARG3(__lasx_xvshuf_b, src1, src1, mask, src3, src3, mask, src5, src5, mask, src7, + src7, mask, src1, src3, src5, src7); + DUP4_ARG2(__lasx_xvdp2_h_bu, src0, coeff_hz_vec, src1, coeff_hz_vec, src3, + coeff_hz_vec, src5, coeff_hz_vec, res_hz0, res_hz1, res_hz2, res_hz3); + res_hz4 = __lasx_xvdp2_h_bu(src7, coeff_hz_vec); + res_vt0 = __lasx_xvmul_h(res_hz1, coeff_vt_vec0); + res_vt1 = __lasx_xvmul_h(res_hz2, coeff_vt_vec0); + res_vt2 = __lasx_xvmul_h(res_hz3, coeff_vt_vec0); + res_vt3 = __lasx_xvmul_h(res_hz4, coeff_vt_vec0); + res_hz0 = __lasx_xvpermi_q(res_hz1, res_hz0, 0x20); + res_hz1 = __lasx_xvpermi_q(res_hz1, res_hz2, 0x3); + res_hz2 = __lasx_xvpermi_q(res_hz2, res_hz3, 0x3); + res_hz3 = __lasx_xvpermi_q(res_hz3, res_hz4, 0x3); + DUP4_ARG3(__lasx_xvmadd_h, res_vt0, res_hz0, coeff_vt_vec1, res_vt1, res_hz1, coeff_vt_vec1, + res_vt2, res_hz2, coeff_vt_vec1, res_vt3, res_hz3, coeff_vt_vec1, + res_vt0, res_vt1, res_vt2, res_vt3); + DUP2_ARG3(__lasx_xvssrarni_bu_h, res_vt1, res_vt0, 6, res_vt3, res_vt2, 6, out0, out1); + __lasx_xvstelm_d(out0, dst, 0, 0); + __lasx_xvstelm_d(out0, dst + stride, 0, 2); + __lasx_xvstelm_d(out0, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out0, dst + stride_3x, 0, 3); + dst += stride_4x; + __lasx_xvstelm_d(out1, dst, 0, 0); + __lasx_xvstelm_d(out1, dst + stride, 0, 2); + __lasx_xvstelm_d(out1, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out1, dst + stride_3x, 0, 3); +} + +static av_always_inline void avc_chroma_hz_8x4_lasx(uint8_t *src, uint8_t *dst, + ptrdiff_t stride, uint32_t coeff0, uint32_t coeff1) +{ + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + __m256i src0, src1, src2, src3, out; + __m256i res0, res1; + __m256i mask; + __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0); + __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1); + __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1); + + coeff_vec = __lasx_xvslli_b(coeff_vec, 3); + DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 0, src, 0, mask, src0); + DUP2_ARG2(__lasx_xvldx, src, stride, src, stride_2x, src1, src2); + src3 = __lasx_xvldx(src, stride_3x); + DUP2_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src0, src2); + DUP2_ARG3(__lasx_xvshuf_b, src0, src0, mask, src2, src2, mask, src0, src2); + DUP2_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src2, coeff_vec, res0, res1); + out = __lasx_xvssrarni_bu_h(res1, res0, 6); + __lasx_xvstelm_d(out, dst, 0, 0); + __lasx_xvstelm_d(out, dst + stride, 0, 2); + __lasx_xvstelm_d(out, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out, dst + stride_3x, 0, 3); + +} + +static av_always_inline void avc_chroma_hz_8x8_lasx(uint8_t *src, uint8_t *dst, + ptrdiff_t stride, uint32_t coeff0, uint32_t coeff1) +{ + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + ptrdiff_t stride_4x = stride << 2; + __m256i src0, src1, src2, src3, src4, src5, src6, src7; + __m256i out0, out1; + __m256i res0, res1, res2, res3; + __m256i mask; + __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0); + __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1); + __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1); + + coeff_vec = __lasx_xvslli_b(coeff_vec, 3); + DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 0, src, 0, mask, src0); + DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2x, src, stride_3x, src, stride_4x, + src1, src2, src3, src4); + src += stride_4x; + DUP2_ARG2(__lasx_xvldx, src, stride, src, stride_2x, src5, src6); + src7 = __lasx_xvldx(src, stride_3x); + DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src5, src4, 0x20, + src7, src6, 0x20, src0, src2, src4, src6); + DUP4_ARG3(__lasx_xvshuf_b, src0, src0, mask, src2, src2, mask, src4, src4, mask, + src6, src6, mask, src0, src2, src4, src6); + DUP4_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src2, coeff_vec, src4, coeff_vec, src6, + coeff_vec, res0, res1, res2, res3); + DUP2_ARG3(__lasx_xvssrarni_bu_h, res1, res0, 6, res3, res2, 6, out0, out1); + __lasx_xvstelm_d(out0, dst, 0, 0); + __lasx_xvstelm_d(out0, dst + stride, 0, 2); + __lasx_xvstelm_d(out0, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out0, dst + stride_3x, 0, 3); + dst += stride_4x; + __lasx_xvstelm_d(out1, dst, 0, 0); + __lasx_xvstelm_d(out1, dst + stride, 0, 2); + __lasx_xvstelm_d(out1, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out1, dst + stride_3x, 0, 3); +} + +static av_always_inline void avc_chroma_hz_nonmult_lasx(uint8_t *src, + uint8_t *dst, ptrdiff_t stride, uint32_t coeff0, + uint32_t coeff1, int32_t height) +{ + uint32_t row; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + ptrdiff_t stride_4x = stride << 2; + __m256i src0, src1, src2, src3, out; + __m256i res0, res1; + __m256i mask; + __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0); + __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1); + __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1); + + mask = __lasx_xvld(chroma_mask_arr, 0); + coeff_vec = __lasx_xvslli_b(coeff_vec, 3); + + for (row = height >> 2; row--;) { + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x, + src0, src1, src2, src3); + src += stride_4x; + DUP2_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src0, src2); + DUP2_ARG3(__lasx_xvshuf_b, src0, src0, mask, src2, src2, mask, src0, src2); + DUP2_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src2, coeff_vec, res0, res1); + out = __lasx_xvssrarni_bu_h(res1, res0, 6); + __lasx_xvstelm_d(out, dst, 0, 0); + __lasx_xvstelm_d(out, dst + stride, 0, 2); + __lasx_xvstelm_d(out, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out, dst + stride_3x, 0, 3); + dst += stride_4x; + } + + if ((height & 3)) { + src0 = __lasx_xvld(src, 0); + src1 = __lasx_xvldx(src, stride); + src1 = __lasx_xvpermi_q(src1, src0, 0x20); + src0 = __lasx_xvshuf_b(src1, src1, mask); + res0 = __lasx_xvdp2_h_bu(src0, coeff_vec); + out = __lasx_xvssrarni_bu_h(res0, res0, 6); + __lasx_xvstelm_d(out, dst, 0, 0); + dst += stride; + __lasx_xvstelm_d(out, dst, 0, 2); + } +} + +static av_always_inline void avc_chroma_vt_8x4_lasx(uint8_t *src, uint8_t *dst, + ptrdiff_t stride, uint32_t coeff0, uint32_t coeff1) +{ + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + __m256i src0, src1, src2, src3, src4, out; + __m256i res0, res1; + __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0); + __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1); + __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1); + + coeff_vec = __lasx_xvslli_b(coeff_vec, 3); + src0 = __lasx_xvld(src, 0); + src += stride; + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x, + src1, src2, src3, src4); + DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src2, src1, 0x20, src3, src2, 0x20, + src4, src3, 0x20, src0, src1, src2, src3); + DUP2_ARG2(__lasx_xvilvl_b, src1, src0, src3, src2, src0, src2); + DUP2_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src2, coeff_vec, res0, res1); + out = __lasx_xvssrarni_bu_h(res1, res0, 6); + __lasx_xvstelm_d(out, dst, 0, 0); + __lasx_xvstelm_d(out, dst + stride, 0, 2); + __lasx_xvstelm_d(out, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out, dst + stride_3x, 0, 3); +} + +static av_always_inline void avc_chroma_vt_8x8_lasx(uint8_t *src, uint8_t *dst, + ptrdiff_t stride, uint32_t coeff0, uint32_t coeff1) +{ + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + ptrdiff_t stride_4x = stride << 2; + __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8; + __m256i out0, out1; + __m256i res0, res1, res2, res3; + __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0); + __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1); + __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1); + + coeff_vec = __lasx_xvslli_b(coeff_vec, 3); + src0 = __lasx_xvld(src, 0); + src += stride; + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x, + src1, src2, src3, src4); + src += stride_4x; + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x, + src5, src6, src7, src8); + DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src2, src1, 0x20, src3, src2, 0x20, + src4, src3, 0x20, src0, src1, src2, src3); + DUP4_ARG3(__lasx_xvpermi_q, src5, src4, 0x20, src6, src5, 0x20, src7, src6, 0x20, + src8, src7, 0x20, src4, src5, src6, src7); + DUP4_ARG2(__lasx_xvilvl_b, src1, src0, src3, src2, src5, src4, src7, src6, + src0, src2, src4, src6); + DUP4_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src2, coeff_vec, src4, coeff_vec, + src6, coeff_vec, res0, res1, res2, res3); + DUP2_ARG3(__lasx_xvssrarni_bu_h, res1, res0, 6, res3, res2, 6, out0, out1); + __lasx_xvstelm_d(out0, dst, 0, 0); + __lasx_xvstelm_d(out0, dst + stride, 0, 2); + __lasx_xvstelm_d(out0, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out0, dst + stride_3x, 0, 3); + dst += stride_4x; + __lasx_xvstelm_d(out1, dst, 0, 0); + __lasx_xvstelm_d(out1, dst + stride, 0, 2); + __lasx_xvstelm_d(out1, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out1, dst + stride_3x, 0, 3); +} + +static av_always_inline void copy_width8x8_lasx(uint8_t *src, uint8_t *dst, + ptrdiff_t stride) +{ + uint64_t tmp[8]; + ptrdiff_t stride_2, stride_3, stride_4; + __asm__ volatile ( + "slli.d %[stride_2], %[stride], 1 \n\t" + "add.d %[stride_3], %[stride_2], %[stride] \n\t" + "slli.d %[stride_4], %[stride_2], 1 \n\t" + "ld.d %[tmp0], %[src], 0x0 \n\t" + "ldx.d %[tmp1], %[src], %[stride] \n\t" + "ldx.d %[tmp2], %[src], %[stride_2] \n\t" + "ldx.d %[tmp3], %[src], %[stride_3] \n\t" + "add.d %[src], %[src], %[stride_4] \n\t" + "ld.d %[tmp4], %[src], 0x0 \n\t" + "ldx.d %[tmp5], %[src], %[stride] \n\t" + "ldx.d %[tmp6], %[src], %[stride_2] \n\t" + "ldx.d %[tmp7], %[src], %[stride_3] \n\t" + + "st.d %[tmp0], %[dst], 0x0 \n\t" + "stx.d %[tmp1], %[dst], %[stride] \n\t" + "stx.d %[tmp2], %[dst], %[stride_2] \n\t" + "stx.d %[tmp3], %[dst], %[stride_3] \n\t" + "add.d %[dst], %[dst], %[stride_4] \n\t" + "st.d %[tmp4], %[dst], 0x0 \n\t" + "stx.d %[tmp5], %[dst], %[stride] \n\t" + "stx.d %[tmp6], %[dst], %[stride_2] \n\t" + "stx.d %[tmp7], %[dst], %[stride_3] \n\t" + : [tmp0]"=&r"(tmp[0]), [tmp1]"=&r"(tmp[1]), + [tmp2]"=&r"(tmp[2]), [tmp3]"=&r"(tmp[3]), + [tmp4]"=&r"(tmp[4]), [tmp5]"=&r"(tmp[5]), + [tmp6]"=&r"(tmp[6]), [tmp7]"=&r"(tmp[7]), + [dst]"+&r"(dst), [src]"+&r"(src), + [stride_2]"=&r"(stride_2), [stride_3]"=&r"(stride_3), + [stride_4]"=&r"(stride_4) + : [stride]"r"(stride) + : "memory" + ); +} + +static av_always_inline void copy_width8x4_lasx(uint8_t *src, uint8_t *dst, + ptrdiff_t stride) +{ + uint64_t tmp[4]; + ptrdiff_t stride_2, stride_3; + __asm__ volatile ( + "slli.d %[stride_2], %[stride], 1 \n\t" + "add.d %[stride_3], %[stride_2], %[stride] \n\t" + "ld.d %[tmp0], %[src], 0x0 \n\t" + "ldx.d %[tmp1], %[src], %[stride] \n\t" + "ldx.d %[tmp2], %[src], %[stride_2] \n\t" + "ldx.d %[tmp3], %[src], %[stride_3] \n\t" + + "st.d %[tmp0], %[dst], 0x0 \n\t" + "stx.d %[tmp1], %[dst], %[stride] \n\t" + "stx.d %[tmp2], %[dst], %[stride_2] \n\t" + "stx.d %[tmp3], %[dst], %[stride_3] \n\t" + : [tmp0]"=&r"(tmp[0]), [tmp1]"=&r"(tmp[1]), + [tmp2]"=&r"(tmp[2]), [tmp3]"=&r"(tmp[3]), + [stride_2]"=&r"(stride_2), [stride_3]"=&r"(stride_3) + : [stride]"r"(stride), [dst]"r"(dst), [src]"r"(src) + : "memory" + ); +} + +static void avc_chroma_hv_8w_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + uint32_t coef_hor0, uint32_t coef_hor1, + uint32_t coef_ver0, uint32_t coef_ver1, + int32_t height) +{ + if (4 == height) { + avc_chroma_hv_8x4_lasx(src, dst, stride, coef_hor0, coef_hor1, coef_ver0, + coef_ver1); + } else if (8 == height) { + avc_chroma_hv_8x8_lasx(src, dst, stride, coef_hor0, coef_hor1, coef_ver0, + coef_ver1); + } +} + +static void avc_chroma_hv_4x2_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + uint32_t coef_hor0, uint32_t coef_hor1, + uint32_t coef_ver0, uint32_t coef_ver1) +{ + ptrdiff_t stride_2 = stride << 1; + __m256i src0, src1, src2; + __m256i res_hz, res_vt; + __m256i mask; + __m256i coeff_hz_vec0 = __lasx_xvreplgr2vr_b(coef_hor0); + __m256i coeff_hz_vec1 = __lasx_xvreplgr2vr_b(coef_hor1); + __m256i coeff_hz_vec = __lasx_xvilvl_b(coeff_hz_vec0, coeff_hz_vec1); + __m256i coeff_vt_vec0 = __lasx_xvreplgr2vr_h(coef_ver0); + __m256i coeff_vt_vec1 = __lasx_xvreplgr2vr_h(coef_ver1); + __m256i coeff_vt_vec = __lasx_xvpermi_q(coeff_vt_vec1, coeff_vt_vec0, 0x02); + + DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 32, src, 0, mask, src0); + DUP2_ARG2(__lasx_xvldx, src, stride, src, stride_2, src1, src2); + DUP2_ARG3(__lasx_xvshuf_b, src1, src0, mask, src2, src1, mask, src0, src1); + src0 = __lasx_xvpermi_q(src0, src1, 0x02); + res_hz = __lasx_xvdp2_h_bu(src0, coeff_hz_vec); + res_vt = __lasx_xvmul_h(res_hz, coeff_vt_vec); + res_hz = __lasx_xvpermi_q(res_hz, res_vt, 0x01); + res_vt = __lasx_xvadd_h(res_hz, res_vt); + res_vt = __lasx_xvssrarni_bu_h(res_vt, res_vt, 6); + __lasx_xvstelm_w(res_vt, dst, 0, 0); + __lasx_xvstelm_w(res_vt, dst + stride, 0, 1); +} + +static void avc_chroma_hv_4x4_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + uint32_t coef_hor0, uint32_t coef_hor1, + uint32_t coef_ver0, uint32_t coef_ver1) +{ + ptrdiff_t stride_2 = stride << 1; + ptrdiff_t stride_3 = stride_2 + stride; + ptrdiff_t stride_4 = stride_2 << 1; + __m256i src0, src1, src2, src3, src4; + __m256i res_hz0, res_hz1, res_vt0, res_vt1; + __m256i mask; + __m256i coeff_hz_vec0 = __lasx_xvreplgr2vr_b(coef_hor0); + __m256i coeff_hz_vec1 = __lasx_xvreplgr2vr_b(coef_hor1); + __m256i coeff_hz_vec = __lasx_xvilvl_b(coeff_hz_vec0, coeff_hz_vec1); + __m256i coeff_vt_vec0 = __lasx_xvreplgr2vr_h(coef_ver0); + __m256i coeff_vt_vec1 = __lasx_xvreplgr2vr_h(coef_ver1); + + DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 32, src, 0, mask, src0); + DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2, src, stride_3, + src, stride_4, src1, src2, src3, src4); + DUP4_ARG3(__lasx_xvshuf_b, src1, src0, mask, src2, src1, mask, src3, src2, mask, + src4, src3, mask, src0, src1, src2, src3); + DUP2_ARG3(__lasx_xvpermi_q, src0, src2, 0x02, src1, src3, 0x02, src0, src1); + DUP2_ARG2(__lasx_xvdp2_h_bu, src0, coeff_hz_vec, src1, coeff_hz_vec, res_hz0, res_hz1); + DUP2_ARG2(__lasx_xvmul_h, res_hz0, coeff_vt_vec1, res_hz1, coeff_vt_vec0, res_vt0, res_vt1); + res_hz0 = __lasx_xvadd_h(res_vt0, res_vt1); + res_hz0 = __lasx_xvssrarni_bu_h(res_hz0, res_hz0, 6); + __lasx_xvstelm_w(res_hz0, dst, 0, 0); + __lasx_xvstelm_w(res_hz0, dst + stride, 0, 1); + __lasx_xvstelm_w(res_hz0, dst + stride_2, 0, 4); + __lasx_xvstelm_w(res_hz0, dst + stride_3, 0, 5); +} + +static void avc_chroma_hv_4x8_lasx(uint8_t *src, uint8_t * dst, ptrdiff_t stride, + uint32_t coef_hor0, uint32_t coef_hor1, + uint32_t coef_ver0, uint32_t coef_ver1) +{ + ptrdiff_t stride_2 = stride << 1; + ptrdiff_t stride_3 = stride_2 + stride; + ptrdiff_t stride_4 = stride_2 << 1; + __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8; + __m256i res_hz0, res_hz1, res_hz2, res_hz3; + __m256i res_vt0, res_vt1, res_vt2, res_vt3; + __m256i mask; + __m256i coeff_hz_vec0 = __lasx_xvreplgr2vr_b(coef_hor0); + __m256i coeff_hz_vec1 = __lasx_xvreplgr2vr_b(coef_hor1); + __m256i coeff_hz_vec = __lasx_xvilvl_b(coeff_hz_vec0, coeff_hz_vec1); + __m256i coeff_vt_vec0 = __lasx_xvreplgr2vr_h(coef_ver0); + __m256i coeff_vt_vec1 = __lasx_xvreplgr2vr_h(coef_ver1); + + DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 32, src, 0, mask, src0); + DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2, src, stride_3, + src, stride_4, src1, src2, src3, src4); + src += stride_4; + DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2, src, stride_3, + src, stride_4, src5, src6, src7, src8); + DUP4_ARG3(__lasx_xvshuf_b, src1, src0, mask, src2, src1, mask, src3, src2, mask, + src4, src3, mask, src0, src1, src2, src3); + DUP4_ARG3(__lasx_xvshuf_b, src5, src4, mask, src6, src5, mask, src7, src6, mask, + src8, src7, mask, src4, src5, src6, src7); + DUP4_ARG3(__lasx_xvpermi_q, src0, src2, 0x02, src1, src3, 0x02, src4, src6, 0x02, + src5, src7, 0x02, src0, src1, src4, src5); + DUP4_ARG2(__lasx_xvdp2_h_bu, src0, coeff_hz_vec, src1, coeff_hz_vec, src4, coeff_hz_vec, + src5, coeff_hz_vec, res_hz0, res_hz1, res_hz2, res_hz3); + DUP4_ARG2(__lasx_xvmul_h, res_hz0, coeff_vt_vec1, res_hz1, coeff_vt_vec0, res_hz2, + coeff_vt_vec1, res_hz3, coeff_vt_vec0, res_vt0, res_vt1, res_vt2, res_vt3); + DUP2_ARG2(__lasx_xvadd_h, res_vt0, res_vt1, res_vt2, res_vt3, res_vt0, res_vt2); + res_hz0 = __lasx_xvssrarni_bu_h(res_vt2, res_vt0, 6); + __lasx_xvstelm_w(res_hz0, dst, 0, 0); + __lasx_xvstelm_w(res_hz0, dst + stride, 0, 1); + __lasx_xvstelm_w(res_hz0, dst + stride_2, 0, 4); + __lasx_xvstelm_w(res_hz0, dst + stride_3, 0, 5); + dst += stride_4; + __lasx_xvstelm_w(res_hz0, dst, 0, 2); + __lasx_xvstelm_w(res_hz0, dst + stride, 0, 3); + __lasx_xvstelm_w(res_hz0, dst + stride_2, 0, 6); + __lasx_xvstelm_w(res_hz0, dst + stride_3, 0, 7); +} + +static void avc_chroma_hv_4w_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + uint32_t coef_hor0, uint32_t coef_hor1, + uint32_t coef_ver0, uint32_t coef_ver1, + int32_t height) +{ + if (8 == height) { + avc_chroma_hv_4x8_lasx(src, dst, stride, coef_hor0, coef_hor1, coef_ver0, + coef_ver1); + } else if (4 == height) { + avc_chroma_hv_4x4_lasx(src, dst, stride, coef_hor0, coef_hor1, coef_ver0, + coef_ver1); + } else if (2 == height) { + avc_chroma_hv_4x2_lasx(src, dst, stride, coef_hor0, coef_hor1, coef_ver0, + coef_ver1); + } +} + +static void avc_chroma_hz_4x2_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + uint32_t coeff0, uint32_t coeff1) +{ + __m256i src0, src1; + __m256i res, mask; + __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0); + __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1); + __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1); + + DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 32, src, 0, mask, src0); + src1 = __lasx_xvldx(src, stride); + src0 = __lasx_xvshuf_b(src1, src0, mask); + res = __lasx_xvdp2_h_bu(src0, coeff_vec); + res = __lasx_xvslli_h(res, 3); + res = __lasx_xvssrarni_bu_h(res, res, 6); + __lasx_xvstelm_w(res, dst, 0, 0); + __lasx_xvstelm_w(res, dst + stride, 0, 1); +} + +static void avc_chroma_hz_4x4_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + uint32_t coeff0, uint32_t coeff1) +{ + ptrdiff_t stride_2 = stride << 1; + ptrdiff_t stride_3 = stride_2 + stride; + __m256i src0, src1, src2, src3; + __m256i res, mask; + __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0); + __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1); + __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1); + + DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 32, src, 0, mask, src0); + DUP2_ARG2(__lasx_xvldx, src, stride, src, stride_2, src1, src2); + src3 = __lasx_xvldx(src, stride_3); + DUP2_ARG3(__lasx_xvshuf_b, src1, src0, mask, src3, src2, mask, src0, src2); + src0 = __lasx_xvpermi_q(src0, src2, 0x02); + res = __lasx_xvdp2_h_bu(src0, coeff_vec); + res = __lasx_xvslli_h(res, 3); + res = __lasx_xvssrarni_bu_h(res, res, 6); + __lasx_xvstelm_w(res, dst, 0, 0); + __lasx_xvstelm_w(res, dst + stride, 0, 1); + __lasx_xvstelm_w(res, dst + stride_2, 0, 4); + __lasx_xvstelm_w(res, dst + stride_3, 0, 5); +} + +static void avc_chroma_hz_4x8_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + uint32_t coeff0, uint32_t coeff1) +{ + ptrdiff_t stride_2 = stride << 1; + ptrdiff_t stride_3 = stride_2 + stride; + ptrdiff_t stride_4 = stride_2 << 1; + __m256i src0, src1, src2, src3, src4, src5, src6, src7; + __m256i res0, res1, mask; + __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0); + __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1); + __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1); + + coeff_vec = __lasx_xvslli_b(coeff_vec, 3); + DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 32, src, 0, mask, src0); + DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2, src, stride_3, + src, stride_4, src1, src2, src3, src4); + src += stride_4; + DUP2_ARG2(__lasx_xvldx, src, stride, src, stride_2, src5, src6); + src7 = __lasx_xvldx(src, stride_3); + DUP4_ARG3(__lasx_xvshuf_b, src1, src0, mask, src3, src2, mask, src5, src4, mask, + src7, src6, mask, src0, src2, src4, src6); + DUP2_ARG3(__lasx_xvpermi_q, src0, src2, 0x02, src4, src6, 0x02, src0, src4); + DUP2_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src4, coeff_vec, res0, res1); + res0 = __lasx_xvssrarni_bu_h(res1, res0, 6); + __lasx_xvstelm_w(res0, dst, 0, 0); + __lasx_xvstelm_w(res0, dst + stride, 0, 1); + __lasx_xvstelm_w(res0, dst + stride_2, 0, 4); + __lasx_xvstelm_w(res0, dst + stride_3, 0, 5); + dst += stride_4; + __lasx_xvstelm_w(res0, dst, 0, 2); + __lasx_xvstelm_w(res0, dst + stride, 0, 3); + __lasx_xvstelm_w(res0, dst + stride_2, 0, 6); + __lasx_xvstelm_w(res0, dst + stride_3, 0, 7); +} + +static void avc_chroma_hz_4w_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + uint32_t coeff0, uint32_t coeff1, + int32_t height) +{ + if (8 == height) { + avc_chroma_hz_4x8_lasx(src, dst, stride, coeff0, coeff1); + } else if (4 == height) { + avc_chroma_hz_4x4_lasx(src, dst, stride, coeff0, coeff1); + } else if (2 == height) { + avc_chroma_hz_4x2_lasx(src, dst, stride, coeff0, coeff1); + } +} + +static void avc_chroma_hz_8w_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + uint32_t coeff0, uint32_t coeff1, + int32_t height) +{ + if (4 == height) { + avc_chroma_hz_8x4_lasx(src, dst, stride, coeff0, coeff1); + } else if (8 == height) { + avc_chroma_hz_8x8_lasx(src, dst, stride, coeff0, coeff1); + } else { + avc_chroma_hz_nonmult_lasx(src, dst, stride, coeff0, coeff1, height); + } +} + +static void avc_chroma_vt_4x2_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + uint32_t coeff0, uint32_t coeff1) +{ + __m256i src0, src1, src2; + __m256i tmp0, tmp1; + __m256i res; + __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0); + __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1); + __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1); + + src0 = __lasx_xvld(src, 0); + DUP2_ARG2(__lasx_xvldx, src, stride, src, stride << 1, src1, src2); + DUP2_ARG2(__lasx_xvilvl_b, src1, src0, src2, src1, tmp0, tmp1); + tmp0 = __lasx_xvilvl_d(tmp1, tmp0); + res = __lasx_xvdp2_h_bu(tmp0, coeff_vec); + res = __lasx_xvslli_h(res, 3); + res = __lasx_xvssrarni_bu_h(res, res, 6); + __lasx_xvstelm_w(res, dst, 0, 0); + __lasx_xvstelm_w(res, dst + stride, 0, 1); +} + +static void avc_chroma_vt_4x4_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + uint32_t coeff0, uint32_t coeff1) +{ + ptrdiff_t stride_2 = stride << 1; + ptrdiff_t stride_3 = stride_2 + stride; + ptrdiff_t stride_4 = stride_2 << 1; + __m256i src0, src1, src2, src3, src4; + __m256i tmp0, tmp1, tmp2, tmp3; + __m256i res; + __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0); + __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1); + __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1); + + src0 = __lasx_xvld(src, 0); + DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2, src, stride_3, + src, stride_4, src1, src2, src3, src4); + DUP4_ARG2(__lasx_xvilvl_b, src1, src0, src2, src1, src3, src2, src4, src3, + tmp0, tmp1, tmp2, tmp3); + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp2); + tmp0 = __lasx_xvpermi_q(tmp0, tmp2, 0x02); + res = __lasx_xvdp2_h_bu(tmp0, coeff_vec); + res = __lasx_xvslli_h(res, 3); + res = __lasx_xvssrarni_bu_h(res, res, 6); + __lasx_xvstelm_w(res, dst, 0, 0); + __lasx_xvstelm_w(res, dst + stride, 0, 1); + __lasx_xvstelm_w(res, dst + stride_2, 0, 4); + __lasx_xvstelm_w(res, dst + stride_3, 0, 5); +} + +static void avc_chroma_vt_4x8_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + uint32_t coeff0, uint32_t coeff1) +{ + ptrdiff_t stride_2 = stride << 1; + ptrdiff_t stride_3 = stride_2 + stride; + ptrdiff_t stride_4 = stride_2 << 1; + __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8; + __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7; + __m256i res0, res1; + __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0); + __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1); + __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1); + + coeff_vec = __lasx_xvslli_b(coeff_vec, 3); + src0 = __lasx_xvld(src, 0); + DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2, src, stride_3, + src, stride_4, src1, src2, src3, src4); + src += stride_4; + DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2, src, stride_3, + src, stride_4, src5, src6, src7, src8); + DUP4_ARG2(__lasx_xvilvl_b, src1, src0, src2, src1, src3, src2, src4, src3, + tmp0, tmp1, tmp2, tmp3); + DUP4_ARG2(__lasx_xvilvl_b, src5, src4, src6, src5, src7, src6, src8, src7, + tmp4, tmp5, tmp6, tmp7); + DUP4_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp5, tmp4, tmp7, tmp6, + tmp0, tmp2, tmp4, tmp6); + tmp0 = __lasx_xvpermi_q(tmp0, tmp2, 0x02); + tmp4 = __lasx_xvpermi_q(tmp4, tmp6, 0x02); + DUP2_ARG2(__lasx_xvdp2_h_bu, tmp0, coeff_vec, tmp4, coeff_vec, res0, res1); + res0 = __lasx_xvssrarni_bu_h(res1, res0, 6); + __lasx_xvstelm_w(res0, dst, 0, 0); + __lasx_xvstelm_w(res0, dst + stride, 0, 1); + __lasx_xvstelm_w(res0, dst + stride_2, 0, 4); + __lasx_xvstelm_w(res0, dst + stride_3, 0, 5); + dst += stride_4; + __lasx_xvstelm_w(res0, dst, 0, 2); + __lasx_xvstelm_w(res0, dst + stride, 0, 3); + __lasx_xvstelm_w(res0, dst + stride_2, 0, 6); + __lasx_xvstelm_w(res0, dst + stride_3, 0, 7); +} + +static void avc_chroma_vt_4w_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + uint32_t coeff0, uint32_t coeff1, + int32_t height) +{ + if (8 == height) { + avc_chroma_vt_4x8_lasx(src, dst, stride, coeff0, coeff1); + } else if (4 == height) { + avc_chroma_vt_4x4_lasx(src, dst, stride, coeff0, coeff1); + } else if (2 == height) { + avc_chroma_vt_4x2_lasx(src, dst, stride, coeff0, coeff1); + } +} + +static void avc_chroma_vt_8w_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + uint32_t coeff0, uint32_t coeff1, + int32_t height) +{ + if (4 == height) { + avc_chroma_vt_8x4_lasx(src, dst, stride, coeff0, coeff1); + } else if (8 == height) { + avc_chroma_vt_8x8_lasx(src, dst, stride, coeff0, coeff1); + } +} + +static void copy_width4_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + int32_t height) +{ + uint32_t tp0, tp1, tp2, tp3, tp4, tp5, tp6, tp7; + + if (8 == height) { + ptrdiff_t stride_2, stride_3, stride_4; + + __asm__ volatile ( + "slli.d %[stride_2], %[stride], 1 \n\t" + "add.d %[stride_3], %[stride_2], %[stride] \n\t" + "slli.d %[stride_4], %[stride_2], 1 \n\t" + "ld.wu %[tp0], %[src], 0 \n\t" + "ldx.wu %[tp1], %[src], %[stride] \n\t" + "ldx.wu %[tp2], %[src], %[stride_2] \n\t" + "ldx.wu %[tp3], %[src], %[stride_3] \n\t" + "add.d %[src], %[src], %[stride_4] \n\t" + "ld.wu %[tp4], %[src], 0 \n\t" + "ldx.wu %[tp5], %[src], %[stride] \n\t" + "ldx.wu %[tp6], %[src], %[stride_2] \n\t" + "ldx.wu %[tp7], %[src], %[stride_3] \n\t" + "st.w %[tp0], %[dst], 0 \n\t" + "stx.w %[tp1], %[dst], %[stride] \n\t" + "stx.w %[tp2], %[dst], %[stride_2] \n\t" + "stx.w %[tp3], %[dst], %[stride_3] \n\t" + "add.d %[dst], %[dst], %[stride_4] \n\t" + "st.w %[tp4], %[dst], 0 \n\t" + "stx.w %[tp5], %[dst], %[stride] \n\t" + "stx.w %[tp6], %[dst], %[stride_2] \n\t" + "stx.w %[tp7], %[dst], %[stride_3] \n\t" + : [stride_2]"+&r"(stride_2), [stride_3]"+&r"(stride_3), [stride_4]"+&r"(stride_4), + [src]"+&r"(src), [dst]"+&r"(dst), [tp0]"+&r"(tp0), [tp1]"+&r"(tp1), + [tp2]"+&r"(tp2), [tp3]"+&r"(tp3), [tp4]"+&r"(tp4), [tp5]"+&r"(tp5), + [tp6]"+&r"(tp6), [tp7]"+&r"(tp7) + : [stride]"r"(stride) + : "memory" + ); + } else if (4 == height) { + ptrdiff_t stride_2, stride_3; + + __asm__ volatile ( + "slli.d %[stride_2], %[stride], 1 \n\t" + "add.d %[stride_3], %[stride_2], %[stride] \n\t" + "ld.wu %[tp0], %[src], 0 \n\t" + "ldx.wu %[tp1], %[src], %[stride] \n\t" + "ldx.wu %[tp2], %[src], %[stride_2] \n\t" + "ldx.wu %[tp3], %[src], %[stride_3] \n\t" + "st.w %[tp0], %[dst], 0 \n\t" + "stx.w %[tp1], %[dst], %[stride] \n\t" + "stx.w %[tp2], %[dst], %[stride_2] \n\t" + "stx.w %[tp3], %[dst], %[stride_3] \n\t" + : [stride_2]"+&r"(stride_2), [stride_3]"+&r"(stride_3), + [src]"+&r"(src), [dst]"+&r"(dst), [tp0]"+&r"(tp0), [tp1]"+&r"(tp1), + [tp2]"+&r"(tp2), [tp3]"+&r"(tp3) + : [stride]"r"(stride) + : "memory" + ); + } else if (2 == height) { + __asm__ volatile ( + "ld.wu %[tp0], %[src], 0 \n\t" + "ldx.wu %[tp1], %[src], %[stride] \n\t" + "st.w %[tp0], %[dst], 0 \n\t" + "stx.w %[tp1], %[dst], %[stride] \n\t" + : [tp0]"+&r"(tp0), [tp1]"+&r"(tp1) + : [src]"r"(src), [dst]"r"(dst), [stride]"r"(stride) + : "memory" + ); + } +} + +static void copy_width8_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + int32_t height) +{ + if (8 == height) { + copy_width8x8_lasx(src, dst, stride); + } else if (4 == height) { + copy_width8x4_lasx(src, dst, stride); + } +} + +void ff_put_h264_chroma_mc4_lasx(uint8_t *dst, uint8_t *src, ptrdiff_t stride, + int height, int x, int y) +{ + av_assert2(x < 8 && y < 8 && x >= 0 && y >= 0); + + if(x && y) { + avc_chroma_hv_4w_lasx(src, dst, stride, x, (8 - x), y, (8 - y), height); + } else if (x) { + avc_chroma_hz_4w_lasx(src, dst, stride, x, (8 - x), height); + } else if (y) { + avc_chroma_vt_4w_lasx(src, dst, stride, y, (8 - y), height); + } else { + copy_width4_lasx(src, dst, stride, height); + } +} + +void ff_put_h264_chroma_mc8_lasx(uint8_t *dst, uint8_t *src, ptrdiff_t stride, + int height, int x, int y) +{ + av_assert2(x < 8 && y < 8 && x >= 0 && y >= 0); + + if (!(x || y)) { + copy_width8_lasx(src, dst, stride, height); + } else if (x && y) { + avc_chroma_hv_8w_lasx(src, dst, stride, x, (8 - x), y, (8 - y), height); + } else if (x) { + avc_chroma_hz_8w_lasx(src, dst, stride, x, (8 - x), height); + } else { + avc_chroma_vt_8w_lasx(src, dst, stride, y, (8 - y), height); + } +} + +static av_always_inline void avc_chroma_hv_and_aver_dst_8x4_lasx(uint8_t *src, + uint8_t *dst, ptrdiff_t stride, uint32_t coef_hor0, + uint32_t coef_hor1, uint32_t coef_ver0, + uint32_t coef_ver1) +{ + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + ptrdiff_t stride_4x = stride << 2; + __m256i tp0, tp1, tp2, tp3; + __m256i src0, src1, src2, src3, src4, out; + __m256i res_hz0, res_hz1, res_hz2, res_vt0, res_vt1; + __m256i mask; + __m256i coeff_hz_vec0 = __lasx_xvreplgr2vr_b(coef_hor0); + __m256i coeff_hz_vec1 = __lasx_xvreplgr2vr_b(coef_hor1); + __m256i coeff_hz_vec = __lasx_xvilvl_b(coeff_hz_vec0, coeff_hz_vec1); + __m256i coeff_vt_vec0 = __lasx_xvreplgr2vr_h(coef_ver0); + __m256i coeff_vt_vec1 = __lasx_xvreplgr2vr_h(coef_ver1); + + DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 0, src, 0, mask, src0); + DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2x, src, stride_3x, src, stride_4x, + src1, src2, src3, src4); + DUP2_ARG3(__lasx_xvpermi_q, src2, src1, 0x20, src4, src3, 0x20, src1, src3); + src0 = __lasx_xvshuf_b(src0, src0, mask); + DUP2_ARG3(__lasx_xvshuf_b, src1, src1, mask, src3, src3, mask, src1, src3); + DUP2_ARG2(__lasx_xvdp2_h_bu, src0, coeff_hz_vec, src1, coeff_hz_vec, res_hz0, res_hz1); + res_hz2 = __lasx_xvdp2_h_bu(src3, coeff_hz_vec); + res_vt0 = __lasx_xvmul_h(res_hz1, coeff_vt_vec0); + res_vt1 = __lasx_xvmul_h(res_hz2, coeff_vt_vec0); + res_hz0 = __lasx_xvpermi_q(res_hz1, res_hz0, 0x20); + res_hz1 = __lasx_xvpermi_q(res_hz1, res_hz2, 0x3); + res_vt0 = __lasx_xvmadd_h(res_vt0, res_hz0, coeff_vt_vec1); + res_vt1 = __lasx_xvmadd_h(res_vt1, res_hz1, coeff_vt_vec1); + out = __lasx_xvssrarni_bu_h(res_vt1, res_vt0, 6); + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, dst, stride_3x, + tp0, tp1, tp2, tp3); + DUP2_ARG2(__lasx_xvilvl_d, tp2, tp0, tp3, tp1, tp0, tp2); + tp0 = __lasx_xvpermi_q(tp2, tp0, 0x20); + out = __lasx_xvavgr_bu(out, tp0); + __lasx_xvstelm_d(out, dst, 0, 0); + __lasx_xvstelm_d(out, dst + stride, 0, 2); + __lasx_xvstelm_d(out, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out, dst + stride_3x, 0, 3); +} + +static av_always_inline void avc_chroma_hv_and_aver_dst_8x8_lasx(uint8_t *src, + uint8_t *dst, ptrdiff_t stride, uint32_t coef_hor0, + uint32_t coef_hor1, uint32_t coef_ver0, + uint32_t coef_ver1) +{ + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + ptrdiff_t stride_4x = stride << 2; + __m256i tp0, tp1, tp2, tp3, dst0, dst1; + __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8; + __m256i out0, out1; + __m256i res_hz0, res_hz1, res_hz2, res_hz3, res_hz4; + __m256i res_vt0, res_vt1, res_vt2, res_vt3; + __m256i mask; + __m256i coeff_hz_vec0 = __lasx_xvreplgr2vr_b(coef_hor0); + __m256i coeff_hz_vec1 = __lasx_xvreplgr2vr_b(coef_hor1); + __m256i coeff_vt_vec0 = __lasx_xvreplgr2vr_h(coef_ver0); + __m256i coeff_vt_vec1 = __lasx_xvreplgr2vr_h(coef_ver1); + __m256i coeff_hz_vec = __lasx_xvilvl_b(coeff_hz_vec0, coeff_hz_vec1); + + DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 0, src, 0, mask, src0); + src += stride; + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x, + src1, src2, src3, src4); + src += stride_4x; + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x, + src5, src6, src7, src8); + DUP4_ARG3(__lasx_xvpermi_q, src2, src1, 0x20, src4, src3, 0x20, src6, src5, 0x20, + src8, src7, 0x20, src1, src3, src5, src7); + src0 = __lasx_xvshuf_b(src0, src0, mask); + DUP4_ARG3(__lasx_xvshuf_b, src1, src1, mask, src3, src3, mask, src5, src5, mask, src7, + src7, mask, src1, src3, src5, src7); + DUP4_ARG2(__lasx_xvdp2_h_bu, src0, coeff_hz_vec, src1, coeff_hz_vec, src3, + coeff_hz_vec, src5, coeff_hz_vec, res_hz0, res_hz1, res_hz2, res_hz3); + res_hz4 = __lasx_xvdp2_h_bu(src7, coeff_hz_vec); + res_vt0 = __lasx_xvmul_h(res_hz1, coeff_vt_vec0); + res_vt1 = __lasx_xvmul_h(res_hz2, coeff_vt_vec0); + res_vt2 = __lasx_xvmul_h(res_hz3, coeff_vt_vec0); + res_vt3 = __lasx_xvmul_h(res_hz4, coeff_vt_vec0); + res_hz0 = __lasx_xvpermi_q(res_hz1, res_hz0, 0x20); + res_hz1 = __lasx_xvpermi_q(res_hz1, res_hz2, 0x3); + res_hz2 = __lasx_xvpermi_q(res_hz2, res_hz3, 0x3); + res_hz3 = __lasx_xvpermi_q(res_hz3, res_hz4, 0x3); + res_vt0 = __lasx_xvmadd_h(res_vt0, res_hz0, coeff_vt_vec1); + res_vt1 = __lasx_xvmadd_h(res_vt1, res_hz1, coeff_vt_vec1); + res_vt2 = __lasx_xvmadd_h(res_vt2, res_hz2, coeff_vt_vec1); + res_vt3 = __lasx_xvmadd_h(res_vt3, res_hz3, coeff_vt_vec1); + DUP2_ARG3(__lasx_xvssrarni_bu_h, res_vt1, res_vt0, 6, res_vt3, res_vt2, 6, + out0, out1); + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, dst, stride_3x, + tp0, tp1, tp2, tp3); + DUP2_ARG2(__lasx_xvilvl_d, tp2, tp0, tp3, tp1, tp0, tp2); + dst0 = __lasx_xvpermi_q(tp2, tp0, 0x20); + dst += stride_4x; + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, dst, stride_3x, + tp0, tp1, tp2, tp3); + dst -= stride_4x; + DUP2_ARG2(__lasx_xvilvl_d, tp2, tp0, tp3, tp1, tp0, tp2); + dst1 = __lasx_xvpermi_q(tp2, tp0, 0x20); + out0 = __lasx_xvavgr_bu(out0, dst0); + out1 = __lasx_xvavgr_bu(out1, dst1); + __lasx_xvstelm_d(out0, dst, 0, 0); + __lasx_xvstelm_d(out0, dst + stride, 0, 2); + __lasx_xvstelm_d(out0, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out0, dst + stride_3x, 0, 3); + dst += stride_4x; + __lasx_xvstelm_d(out1, dst, 0, 0); + __lasx_xvstelm_d(out1, dst + stride, 0, 2); + __lasx_xvstelm_d(out1, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out1, dst + stride_3x, 0, 3); +} + +static av_always_inline void avc_chroma_hz_and_aver_dst_8x4_lasx(uint8_t *src, + uint8_t *dst, ptrdiff_t stride, uint32_t coeff0, + uint32_t coeff1) +{ + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + __m256i tp0, tp1, tp2, tp3; + __m256i src0, src1, src2, src3, out; + __m256i res0, res1; + __m256i mask; + __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0); + __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1); + __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1); + + coeff_vec = __lasx_xvslli_b(coeff_vec, 3); + mask = __lasx_xvld(chroma_mask_arr, 0); + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x, + src0, src1, src2, src3); + DUP2_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src0, src2); + DUP2_ARG3(__lasx_xvshuf_b, src0, src0, mask, src2, src2, mask, src0, src2); + DUP2_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src2, coeff_vec, res0, res1); + out = __lasx_xvssrarni_bu_h(res1, res0, 6); + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, dst, stride_3x, + tp0, tp1, tp2, tp3); + DUP2_ARG2(__lasx_xvilvl_d, tp2, tp0, tp3, tp1, tp0, tp2); + tp0 = __lasx_xvpermi_q(tp2, tp0, 0x20); + out = __lasx_xvavgr_bu(out, tp0); + __lasx_xvstelm_d(out, dst, 0, 0); + __lasx_xvstelm_d(out, dst + stride, 0, 2); + __lasx_xvstelm_d(out, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out, dst + stride_3x, 0, 3); +} + +static av_always_inline void avc_chroma_hz_and_aver_dst_8x8_lasx(uint8_t *src, + uint8_t *dst, ptrdiff_t stride, uint32_t coeff0, + uint32_t coeff1) +{ + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + ptrdiff_t stride_4x = stride << 2; + __m256i tp0, tp1, tp2, tp3, dst0, dst1; + __m256i src0, src1, src2, src3, src4, src5, src6, src7; + __m256i out0, out1; + __m256i res0, res1, res2, res3; + __m256i mask; + __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0); + __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1); + __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1); + + coeff_vec = __lasx_xvslli_b(coeff_vec, 3); + mask = __lasx_xvld(chroma_mask_arr, 0); + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x, + src0, src1, src2, src3); + src += stride_4x; + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x, + src4, src5, src6, src7); + DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src5, src4, 0x20, + src7, src6, 0x20, src0, src2, src4, src6); + DUP4_ARG3(__lasx_xvshuf_b, src0, src0, mask, src2, src2, mask, src4, src4, + mask, src6, src6, mask, src0, src2, src4, src6); + DUP4_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src2, coeff_vec, src4, coeff_vec, src6, + coeff_vec, res0, res1, res2, res3); + DUP2_ARG3(__lasx_xvssrarni_bu_h, res1, res0, 6, res3, res2, 6, out0, out1); + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, dst, stride_3x, + tp0, tp1, tp2, tp3); + DUP2_ARG2(__lasx_xvilvl_d, tp2, tp0, tp3, tp1, tp0, tp2); + dst0 = __lasx_xvpermi_q(tp2, tp0, 0x20); + dst += stride_4x; + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, dst, stride_3x, + tp0, tp1, tp2, tp3); + dst -= stride_4x; + DUP2_ARG2(__lasx_xvilvl_d, tp2, tp0, tp3, tp1, tp0, tp2); + dst1 = __lasx_xvpermi_q(tp2, tp0, 0x20); + out0 = __lasx_xvavgr_bu(out0, dst0); + out1 = __lasx_xvavgr_bu(out1, dst1); + __lasx_xvstelm_d(out0, dst, 0, 0); + __lasx_xvstelm_d(out0, dst + stride, 0, 2); + __lasx_xvstelm_d(out0, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out0, dst + stride_3x, 0, 3); + dst += stride_4x; + __lasx_xvstelm_d(out1, dst, 0, 0); + __lasx_xvstelm_d(out1, dst + stride, 0, 2); + __lasx_xvstelm_d(out1, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out1, dst + stride_3x, 0, 3); +} + +static av_always_inline void avc_chroma_vt_and_aver_dst_8x4_lasx(uint8_t *src, + uint8_t *dst, ptrdiff_t stride, uint32_t coeff0, + uint32_t coeff1) +{ + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + ptrdiff_t stride_4x = stride << 2; + __m256i tp0, tp1, tp2, tp3; + __m256i src0, src1, src2, src3, src4, out; + __m256i res0, res1; + __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0); + __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1); + __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1); + + coeff_vec = __lasx_xvslli_b(coeff_vec, 3); + src0 = __lasx_xvld(src, 0); + DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2x, src, stride_3x, src, stride_4x, + src1, src2, src3, src4); + DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src2, src1, 0x20, src3, src2, 0x20, + src4, src3, 0x20, src0, src1, src2, src3); + DUP2_ARG2(__lasx_xvilvl_b, src1, src0, src3, src2, src0, src2); + DUP2_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src2, coeff_vec, res0, res1); + out = __lasx_xvssrarni_bu_h(res1, res0, 6); + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, dst, stride_3x, + tp0, tp1, tp2, tp3); + DUP2_ARG2(__lasx_xvilvl_d, tp2, tp0, tp3, tp1, tp0, tp2); + tp0 = __lasx_xvpermi_q(tp2, tp0, 0x20); + out = __lasx_xvavgr_bu(out, tp0); + __lasx_xvstelm_d(out, dst, 0, 0); + __lasx_xvstelm_d(out, dst + stride, 0, 2); + __lasx_xvstelm_d(out, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out, dst + stride_3x, 0, 3); +} + +static av_always_inline void avc_chroma_vt_and_aver_dst_8x8_lasx(uint8_t *src, + uint8_t *dst, ptrdiff_t stride, uint32_t coeff0, + uint32_t coeff1) +{ + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + ptrdiff_t stride_4x = stride << 2; + __m256i tp0, tp1, tp2, tp3, dst0, dst1; + __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8; + __m256i out0, out1; + __m256i res0, res1, res2, res3; + __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0); + __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1); + __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1); + + coeff_vec = __lasx_xvslli_b(coeff_vec, 3); + src0 = __lasx_xvld(src, 0); + src += stride; + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x, + src1, src2, src3, src4); + src += stride_4x; + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x, + src5, src6, src7, src8); + DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src2, src1, 0x20, src3, src2, 0x20, + src4, src3, 0x20, src0, src1, src2, src3); + DUP4_ARG3(__lasx_xvpermi_q, src5, src4, 0x20, src6, src5, 0x20, src7, src6, 0x20, + src8, src7, 0x20, src4, src5, src6, src7); + DUP4_ARG2(__lasx_xvilvl_b, src1, src0, src3, src2, src5, src4, src7, src6, + src0, src2, src4, src6); + DUP4_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src2, coeff_vec, src4, coeff_vec, src6, + coeff_vec, res0, res1, res2, res3); + DUP2_ARG3(__lasx_xvssrarni_bu_h, res1, res0, 6, res3, res2, 6, out0, out1); + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, dst, stride_3x, + tp0, tp1, tp2, tp3); + DUP2_ARG2(__lasx_xvilvl_d, tp2, tp0, tp3, tp1, tp0, tp2); + dst0 = __lasx_xvpermi_q(tp2, tp0, 0x20); + dst += stride_4x; + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, dst, stride_3x, + tp0, tp1, tp2, tp3); + dst -= stride_4x; + DUP2_ARG2(__lasx_xvilvl_d, tp2, tp0, tp3, tp1, tp0, tp2); + dst1 = __lasx_xvpermi_q(tp2, tp0, 0x20); + out0 = __lasx_xvavgr_bu(out0, dst0); + out1 = __lasx_xvavgr_bu(out1, dst1); + __lasx_xvstelm_d(out0, dst, 0, 0); + __lasx_xvstelm_d(out0, dst + stride, 0, 2); + __lasx_xvstelm_d(out0, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out0, dst + stride_3x, 0, 3); + dst += stride_4x; + __lasx_xvstelm_d(out1, dst, 0, 0); + __lasx_xvstelm_d(out1, dst + stride, 0, 2); + __lasx_xvstelm_d(out1, dst + stride_2x, 0, 1); + __lasx_xvstelm_d(out1, dst + stride_3x, 0, 3); +} + +static av_always_inline void avg_width8x8_lasx(uint8_t *src, uint8_t *dst, + ptrdiff_t stride) +{ + __m256i src0, src1, src2, src3; + __m256i dst0, dst1, dst2, dst3; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + ptrdiff_t stride_4x = stride << 2; + + src0 = __lasx_xvldrepl_d(src, 0); + src1 = __lasx_xvldrepl_d(src + stride, 0); + src2 = __lasx_xvldrepl_d(src + stride_2x, 0); + src3 = __lasx_xvldrepl_d(src + stride_3x, 0); + dst0 = __lasx_xvldrepl_d(dst, 0); + dst1 = __lasx_xvldrepl_d(dst + stride, 0); + dst2 = __lasx_xvldrepl_d(dst + stride_2x, 0); + dst3 = __lasx_xvldrepl_d(dst + stride_3x, 0); + src0 = __lasx_xvpackev_d(src1,src0); + src2 = __lasx_xvpackev_d(src3,src2); + src0 = __lasx_xvpermi_q(src0, src2, 0x02); + dst0 = __lasx_xvpackev_d(dst1,dst0); + dst2 = __lasx_xvpackev_d(dst3,dst2); + dst0 = __lasx_xvpermi_q(dst0, dst2, 0x02); + dst0 = __lasx_xvavgr_bu(src0, dst0); + __lasx_xvstelm_d(dst0, dst, 0, 0); + __lasx_xvstelm_d(dst0, dst + stride, 0, 1); + __lasx_xvstelm_d(dst0, dst + stride_2x, 0, 2); + __lasx_xvstelm_d(dst0, dst + stride_3x, 0, 3); + + src += stride_4x; + dst += stride_4x; + src0 = __lasx_xvldrepl_d(src, 0); + src1 = __lasx_xvldrepl_d(src + stride, 0); + src2 = __lasx_xvldrepl_d(src + stride_2x, 0); + src3 = __lasx_xvldrepl_d(src + stride_3x, 0); + dst0 = __lasx_xvldrepl_d(dst, 0); + dst1 = __lasx_xvldrepl_d(dst + stride, 0); + dst2 = __lasx_xvldrepl_d(dst + stride_2x, 0); + dst3 = __lasx_xvldrepl_d(dst + stride_3x, 0); + src0 = __lasx_xvpackev_d(src1,src0); + src2 = __lasx_xvpackev_d(src3,src2); + src0 = __lasx_xvpermi_q(src0, src2, 0x02); + dst0 = __lasx_xvpackev_d(dst1,dst0); + dst2 = __lasx_xvpackev_d(dst3,dst2); + dst0 = __lasx_xvpermi_q(dst0, dst2, 0x02); + dst0 = __lasx_xvavgr_bu(src0, dst0); + __lasx_xvstelm_d(dst0, dst, 0, 0); + __lasx_xvstelm_d(dst0, dst + stride, 0, 1); + __lasx_xvstelm_d(dst0, dst + stride_2x, 0, 2); + __lasx_xvstelm_d(dst0, dst + stride_3x, 0, 3); +} + +static av_always_inline void avg_width8x4_lasx(uint8_t *src, uint8_t *dst, + ptrdiff_t stride) +{ + __m256i src0, src1, src2, src3; + __m256i dst0, dst1, dst2, dst3; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + + src0 = __lasx_xvldrepl_d(src, 0); + src1 = __lasx_xvldrepl_d(src + stride, 0); + src2 = __lasx_xvldrepl_d(src + stride_2x, 0); + src3 = __lasx_xvldrepl_d(src + stride_3x, 0); + dst0 = __lasx_xvldrepl_d(dst, 0); + dst1 = __lasx_xvldrepl_d(dst + stride, 0); + dst2 = __lasx_xvldrepl_d(dst + stride_2x, 0); + dst3 = __lasx_xvldrepl_d(dst + stride_3x, 0); + src0 = __lasx_xvpackev_d(src1,src0); + src2 = __lasx_xvpackev_d(src3,src2); + src0 = __lasx_xvpermi_q(src0, src2, 0x02); + dst0 = __lasx_xvpackev_d(dst1,dst0); + dst2 = __lasx_xvpackev_d(dst3,dst2); + dst0 = __lasx_xvpermi_q(dst0, dst2, 0x02); + dst0 = __lasx_xvavgr_bu(src0, dst0); + __lasx_xvstelm_d(dst0, dst, 0, 0); + __lasx_xvstelm_d(dst0, dst + stride, 0, 1); + __lasx_xvstelm_d(dst0, dst + stride_2x, 0, 2); + __lasx_xvstelm_d(dst0, dst + stride_3x, 0, 3); +} + +static void avc_chroma_hv_and_aver_dst_8w_lasx(uint8_t *src, uint8_t *dst, + ptrdiff_t stride, + uint32_t coef_hor0, + uint32_t coef_hor1, + uint32_t coef_ver0, + uint32_t coef_ver1, + int32_t height) +{ + if (4 == height) { + avc_chroma_hv_and_aver_dst_8x4_lasx(src, dst, stride, coef_hor0, + coef_hor1, coef_ver0, coef_ver1); + } else if (8 == height) { + avc_chroma_hv_and_aver_dst_8x8_lasx(src, dst, stride, coef_hor0, + coef_hor1, coef_ver0, coef_ver1); + } +} + +static void avc_chroma_hz_and_aver_dst_8w_lasx(uint8_t *src, uint8_t *dst, + ptrdiff_t stride, uint32_t coeff0, + uint32_t coeff1, int32_t height) +{ + if (4 == height) { + avc_chroma_hz_and_aver_dst_8x4_lasx(src, dst, stride, coeff0, coeff1); + } else if (8 == height) { + avc_chroma_hz_and_aver_dst_8x8_lasx(src, dst, stride, coeff0, coeff1); + } +} + +static void avc_chroma_vt_and_aver_dst_8w_lasx(uint8_t *src, uint8_t *dst, + ptrdiff_t stride, uint32_t coeff0, + uint32_t coeff1, int32_t height) +{ + if (4 == height) { + avc_chroma_vt_and_aver_dst_8x4_lasx(src, dst, stride, coeff0, coeff1); + } else if (8 == height) { + avc_chroma_vt_and_aver_dst_8x8_lasx(src, dst, stride, coeff0, coeff1); + } +} + +static void avg_width8_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + int32_t height) +{ + if (8 == height) { + avg_width8x8_lasx(src, dst, stride); + } else if (4 == height) { + avg_width8x4_lasx(src, dst, stride); + } +} + +void ff_avg_h264_chroma_mc8_lasx(uint8_t *dst, uint8_t *src, ptrdiff_t stride, + int height, int x, int y) +{ + av_assert2(x < 8 && y < 8 && x >= 0 && y >= 0); + + if (!(x || y)) { + avg_width8_lasx(src, dst, stride, height); + } else if (x && y) { + avc_chroma_hv_and_aver_dst_8w_lasx(src, dst, stride, x, (8 - x), y, + (8 - y), height); + } else if (x) { + avc_chroma_hz_and_aver_dst_8w_lasx(src, dst, stride, x, (8 - x), height); + } else { + avc_chroma_vt_and_aver_dst_8w_lasx(src, dst, stride, y, (8 - y), height); + } +} diff --git a/libavcodec/loongarch/h264chroma_lasx.h b/libavcodec/loongarch/h264chroma_lasx.h new file mode 100644 index 0000000000..4aac8db8cb --- /dev/null +++ b/libavcodec/loongarch/h264chroma_lasx.h @@ -0,0 +1,36 @@ +/* + * Copyright (c) 2020 Loongson Technology Corporation Limited + * Contributed by Shiyou Yin + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef AVCODEC_LOONGARCH_H264CHROMA_LASX_H +#define AVCODEC_LOONGARCH_H264CHROMA_LASX_H + +#include +#include +#include "libavcodec/h264.h" + +void ff_put_h264_chroma_mc4_lasx(uint8_t *dst, uint8_t *src, ptrdiff_t stride, + int h, int x, int y); +void ff_put_h264_chroma_mc8_lasx(uint8_t *dst, uint8_t *src, ptrdiff_t stride, + int h, int x, int y); +void ff_avg_h264_chroma_mc8_lasx(uint8_t *dst, uint8_t *src, ptrdiff_t stride, + int h, int x, int y); + +#endif /* AVCODEC_LOONGARCH_H264CHROMA_LASX_H */ diff --git a/libavutil/loongarch/loongson_intrinsics.h b/libavutil/loongarch/loongson_intrinsics.h new file mode 100644 index 0000000000..6e0439f829 --- /dev/null +++ b/libavutil/loongarch/loongson_intrinsics.h @@ -0,0 +1,1877 @@ +/* + * Copyright (c) 2021 Loongson Technology Corporation Limited + * All rights reserved. + * Contributed by Shiyou Yin + * Xiwei Gu + * Lu Wang + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + * + */ + +#ifndef AVUTIL_LOONGARCH_LOONGSON_INTRINSICS_H +#define AVUTIL_LOONGARCH_LOONGSON_INTRINSICS_H + +/* + * Copyright (c) 2021 Loongson Technology Corporation Limited + * All rights reserved. + * Contributed by Shiyou Yin + * Xiwei Gu + * Lu Wang + * + * This file is a header file for loongarch builtin extention. + * + */ + +#ifndef LOONGSON_INTRINSICS_H +#define LOONGSON_INTRINSICS_H + +/** + * MAJOR version: Macro usage changes. + * MINOR version: Add new functions, or bug fix. + * MICRO version: Comment changes or implementation changes. + */ +#define LSOM_VERSION_MAJOR 1 +#define LSOM_VERSION_MINOR 0 +#define LSOM_VERSION_MICRO 3 + +#define DUP2_ARG1(_INS, _IN0, _IN1, _OUT0, _OUT1) \ +{ \ + _OUT0 = _INS(_IN0); \ + _OUT1 = _INS(_IN1); \ +} + +#define DUP2_ARG2(_INS, _IN0, _IN1, _IN2, _IN3, _OUT0, _OUT1) \ +{ \ + _OUT0 = _INS(_IN0, _IN1); \ + _OUT1 = _INS(_IN2, _IN3); \ +} + +#define DUP2_ARG3(_INS, _IN0, _IN1, _IN2, _IN3, _IN4, _IN5, _OUT0, _OUT1) \ +{ \ + _OUT0 = _INS(_IN0, _IN1, _IN2); \ + _OUT1 = _INS(_IN3, _IN4, _IN5); \ +} + +#define DUP4_ARG1(_INS, _IN0, _IN1, _IN2, _IN3, _OUT0, _OUT1, _OUT2, _OUT3) \ +{ \ + DUP2_ARG1(_INS, _IN0, _IN1, _OUT0, _OUT1); \ + DUP2_ARG1(_INS, _IN2, _IN3, _OUT2, _OUT3); \ +} + +#define DUP4_ARG2(_INS, _IN0, _IN1, _IN2, _IN3, _IN4, _IN5, _IN6, _IN7, \ + _OUT0, _OUT1, _OUT2, _OUT3) \ +{ \ + DUP2_ARG2(_INS, _IN0, _IN1, _IN2, _IN3, _OUT0, _OUT1); \ + DUP2_ARG2(_INS, _IN4, _IN5, _IN6, _IN7, _OUT2, _OUT3); \ +} + +#define DUP4_ARG3(_INS, _IN0, _IN1, _IN2, _IN3, _IN4, _IN5, _IN6, _IN7, \ + _IN8, _IN9, _IN10, _IN11, _OUT0, _OUT1, _OUT2, _OUT3) \ +{ \ + DUP2_ARG3(_INS, _IN0, _IN1, _IN2, _IN3, _IN4, _IN5, _OUT0, _OUT1); \ + DUP2_ARG3(_INS, _IN6, _IN7, _IN8, _IN9, _IN10, _IN11, _OUT2, _OUT3); \ +} + +#ifdef __loongarch_sx +#include +/* + * ============================================================================= + * Description : Dot product & addition of byte vector elements + * Arguments : Inputs - in_c, in_h, in_l + * Outputs - out + * Retrun Type - halfword + * Details : Signed byte elements from in_h are multiplied by + * signed byte elements from in_l, and then added adjacent to + * each other to get results with the twice size of input. + * Then the results plus to signed half word elements from in_c. + * Example : out = __lsx_vdp2add_h_b(in_c, in_h, in_l) + * in_c : 1,2,3,4, 1,2,3,4 + * in_h : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8 + * in_l : 8,7,6,5, 4,3,2,1, 8,7,6,5, 4,3,2,1 + * out : 23,40,41,26, 23,40,41,26 + * ============================================================================= + */ +static inline __m128i __lsx_vdp2add_h_b(__m128i in_c, __m128i in_h, __m128i in_l) +{ + __m128i out; + + out = __lsx_vmaddwev_h_b(in_c, in_h, in_l); + out = __lsx_vmaddwod_h_b(out, in_h, in_l); + return out; +} + +/* + * ============================================================================= + * Description : Dot product & addition of byte vector elements + * Arguments : Inputs - in_c, in_h, in_l + * Outputs - out + * Retrun Type - halfword + * Details : Unsigned byte elements from in_h are multiplied by + * unsigned byte elements from in_l, and then added adjacent to + * each other to get results with the twice size of input. + * The results plus to signed half word elements from in_c. + * Example : out = __lsx_vdp2add_h_b(in_c, in_h, in_l) + * in_c : 1,2,3,4, 1,2,3,4 + * in_h : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8 + * in_l : 8,7,6,5, 4,3,2,1, 8,7,6,5, 4,3,2,1 + * out : 23,40,41,26, 23,40,41,26 + * ============================================================================= + */ +static inline __m128i __lsx_vdp2add_h_bu(__m128i in_c, __m128i in_h, __m128i in_l) +{ + __m128i out; + + out = __lsx_vmaddwev_h_bu(in_c, in_h, in_l); + out = __lsx_vmaddwod_h_bu(out, in_h, in_l); + return out; +} + +/* + * ============================================================================= + * Description : Dot product & addition of half word vector elements + * Arguments : Inputs - in_c, in_h, in_l + * Outputs - out + * Retrun Type - __m128i + * Details : Signed half word elements from in_h are multiplied by + * signed half word elements from in_l, and then added adjacent to + * each other to get results with the twice size of input. + * Then the results plus to signed word elements from in_c. + * Example : out = __lsx_vdp2add_h_b(in_c, in_h, in_l) + * in_c : 1,2,3,4 + * in_h : 1,2,3,4, 5,6,7,8 + * in_l : 8,7,6,5, 4,3,2,1 + * out : 23,40,41,26 + * ============================================================================= + */ +static inline __m128i __lsx_vdp2add_w_h(__m128i in_c, __m128i in_h, __m128i in_l) +{ + __m128i out; + + out = __lsx_vmaddwev_w_h(in_c, in_h, in_l); + out = __lsx_vmaddwod_w_h(out, in_h, in_l); + return out; +} + +/* + * ============================================================================= + * Description : Dot product of byte vector elements + * Arguments : Inputs - in_h, in_l + * Outputs - out + * Retrun Type - halfword + * Details : Signed byte elements from in_h are multiplied by + * signed byte elements from in_l, and then added adjacent to + * each other to get results with the twice size of input. + * Example : out = __lsx_vdp2_h_b(in_h, in_l) + * in_h : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8 + * in_l : 8,7,6,5, 4,3,2,1, 8,7,6,5, 4,3,2,1 + * out : 22,38,38,22, 22,38,38,22 + * ============================================================================= + */ +static inline __m128i __lsx_vdp2_h_b(__m128i in_h, __m128i in_l) +{ + __m128i out; + + out = __lsx_vmulwev_h_b(in_h, in_l); + out = __lsx_vmaddwod_h_b(out, in_h, in_l); + return out; +} + +/* + * ============================================================================= + * Description : Dot product of byte vector elements + * Arguments : Inputs - in_h, in_l + * Outputs - out + * Retrun Type - halfword + * Details : Unsigned byte elements from in_h are multiplied by + * unsigned byte elements from in_l, and then added adjacent to + * each other to get results with the twice size of input. + * Example : out = __lsx_vdp2_h_bu(in_h, in_l) + * in_h : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8 + * in_l : 8,7,6,5, 4,3,2,1, 8,7,6,5, 4,3,2,1 + * out : 22,38,38,22, 22,38,38,22 + * ============================================================================= + */ +static inline __m128i __lsx_vdp2_h_bu(__m128i in_h, __m128i in_l) +{ + __m128i out; + + out = __lsx_vmulwev_h_bu(in_h, in_l); + out = __lsx_vmaddwod_h_bu(out, in_h, in_l); + return out; +} + +/* + * ============================================================================= + * Description : Dot product of byte vector elements + * Arguments : Inputs - in_h, in_l + * Outputs - out + * Retrun Type - halfword + * Details : Unsigned byte elements from in_h are multiplied by + * signed byte elements from in_l, and then added adjacent to + * each other to get results with the twice size of input. + * Example : out = __lsx_vdp2_h_bu_b(in_h, in_l) + * in_h : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8 + * in_l : 8,7,6,5, 4,3,2,1, 8,7,6,5, 4,3,2,-1 + * out : 22,38,38,22, 22,38,38,6 + * ============================================================================= + */ +static inline __m128i __lsx_vdp2_h_bu_b(__m128i in_h, __m128i in_l) +{ + __m128i out; + + out = __lsx_vmulwev_h_bu_b(in_h, in_l); + out = __lsx_vmaddwod_h_bu_b(out, in_h, in_l); + return out; +} + +/* + * ============================================================================= + * Description : Dot product of byte vector elements + * Arguments : Inputs - in_h, in_l + * Outputs - out + * Retrun Type - halfword + * Details : Signed byte elements from in_h are multiplied by + * signed byte elements from in_l, and then added adjacent to + * each other to get results with the twice size of input. + * Example : out = __lsx_vdp2_w_h(in_h, in_l) + * in_h : 1,2,3,4, 5,6,7,8 + * in_l : 8,7,6,5, 4,3,2,1 + * out : 22,38,38,22 + * ============================================================================= + */ +static inline __m128i __lsx_vdp2_w_h(__m128i in_h, __m128i in_l) +{ + __m128i out; + + out = __lsx_vmulwev_w_h(in_h, in_l); + out = __lsx_vmaddwod_w_h(out, in_h, in_l); + return out; +} + +/* + * ============================================================================= + * Description : Clip all halfword elements of input vector between min & max + * out = ((_in) < (min)) ? (min) : (((_in) > (max)) ? (max) : (_in)) + * Arguments : Inputs - _in (input vector) + * - min (min threshold) + * - max (max threshold) + * Outputs - out (output vector with clipped elements) + * Return Type - signed halfword + * Example : out = __lsx_vclip_h(_in) + * _in : -8,2,280,249, -8,255,280,249 + * min : 1,1,1,1, 1,1,1,1 + * max : 9,9,9,9, 9,9,9,9 + * out : 1,2,9,9, 1,9,9,9 + * ============================================================================= + */ +static inline __m128i __lsx_vclip_h(__m128i _in, __m128i min, __m128i max) +{ + __m128i out; + + out = __lsx_vmax_h(min, _in); + out = __lsx_vmin_h(max, out); + return out; +} + +/* + * ============================================================================= + * Description : Set each element of vector between 0 and 255 + * Arguments : Inputs - _in + * Outputs - out + * Retrun Type - halfword + * Details : Signed byte elements from _in are clamped between 0 and 255. + * Example : out = __lsx_vclip255_h(_in) + * _in : -8,255,280,249, -8,255,280,249 + * out : 0,255,255,249, 0,255,255,249 + * ============================================================================= + */ +static inline __m128i __lsx_vclip255_h(__m128i _in) +{ + __m128i out; + + out = __lsx_vmaxi_h(_in, 0); + out = __lsx_vsat_hu(out, 7); + return out; +} + +/* + * ============================================================================= + * Description : Set each element of vector between 0 and 255 + * Arguments : Inputs - _in + * Outputs - out + * Retrun Type - word + * Details : Signed byte elements from _in are clamped between 0 and 255. + * Example : out = __lsx_vclip255_w(_in) + * _in : -8,255,280,249 + * out : 0,255,255,249 + * ============================================================================= + */ +static inline __m128i __lsx_vclip255_w(__m128i _in) +{ + __m128i out; + + out = __lsx_vmaxi_w(_in, 0); + out = __lsx_vsat_wu(out, 7); + return out; +} + +/* + * ============================================================================= + * Description : Swap two variables + * Arguments : Inputs - _in0, _in1 + * Outputs - _in0, _in1 (in-place) + * Details : Swapping of two input variables using xor + * Example : LSX_SWAP(_in0, _in1) + * _in0 : 1,2,3,4 + * _in1 : 5,6,7,8 + * _in0(out) : 5,6,7,8 + * _in1(out) : 1,2,3,4 + * ============================================================================= + */ +#define LSX_SWAP(_in0, _in1) \ +{ \ + _in0 = __lsx_vxor_v(_in0, _in1); \ + _in1 = __lsx_vxor_v(_in0, _in1); \ + _in0 = __lsx_vxor_v(_in0, _in1); \ +} \ + +/* + * ============================================================================= + * Description : Transpose 4x4 block with word elements in vectors + * Arguments : Inputs - in0, in1, in2, in3 + * Outputs - out0, out1, out2, out3 + * Details : + * Example : + * 1, 2, 3, 4 1, 5, 9,13 + * 5, 6, 7, 8 to 2, 6,10,14 + * 9,10,11,12 =====> 3, 7,11,15 + * 13,14,15,16 4, 8,12,16 + * ============================================================================= + */ +#define LSX_TRANSPOSE4x4_W(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3) \ +{ \ + __m128i _t0, _t1, _t2, _t3; \ + \ + _t0 = __lsx_vilvl_w(_in1, _in0); \ + _t1 = __lsx_vilvh_w(_in1, _in0); \ + _t2 = __lsx_vilvl_w(_in3, _in2); \ + _t3 = __lsx_vilvh_w(_in3, _in2); \ + _out0 = __lsx_vilvl_d(_t2, _t0); \ + _out1 = __lsx_vilvh_d(_t2, _t0); \ + _out2 = __lsx_vilvl_d(_t3, _t1); \ + _out3 = __lsx_vilvh_d(_t3, _t1); \ +} + +/* + * ============================================================================= + * Description : Transpose 8x8 block with byte elements in vectors + * Arguments : Inputs - _in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7 + * Outputs - _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7 + * Details : The rows of the matrix become columns, and the columns become rows. + * Example : LSX_TRANSPOSE8x8_B + * _in0 : 00,01,02,03,04,05,06,07, 00,00,00,00,00,00,00,00 + * _in1 : 10,11,12,13,14,15,16,17, 00,00,00,00,00,00,00,00 + * _in2 : 20,21,22,23,24,25,26,27, 00,00,00,00,00,00,00,00 + * _in3 : 30,31,32,33,34,35,36,37, 00,00,00,00,00,00,00,00 + * _in4 : 40,41,42,43,44,45,46,47, 00,00,00,00,00,00,00,00 + * _in5 : 50,51,52,53,54,55,56,57, 00,00,00,00,00,00,00,00 + * _in6 : 60,61,62,63,64,65,66,67, 00,00,00,00,00,00,00,00 + * _in7 : 70,71,72,73,74,75,76,77, 00,00,00,00,00,00,00,00 + * + * _ out0 : 00,10,20,30,40,50,60,70, 00,00,00,00,00,00,00,00 + * _ out1 : 01,11,21,31,41,51,61,71, 00,00,00,00,00,00,00,00 + * _ out2 : 02,12,22,32,42,52,62,72, 00,00,00,00,00,00,00,00 + * _ out3 : 03,13,23,33,43,53,63,73, 00,00,00,00,00,00,00,00 + * _ out4 : 04,14,24,34,44,54,64,74, 00,00,00,00,00,00,00,00 + * _ out5 : 05,15,25,35,45,55,65,75, 00,00,00,00,00,00,00,00 + * _ out6 : 06,16,26,36,46,56,66,76, 00,00,00,00,00,00,00,00 + * _ out7 : 07,17,27,37,47,57,67,77, 00,00,00,00,00,00,00,00 + * ============================================================================= + */ +#define LSX_TRANSPOSE8x8_B(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, \ + _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\ +{ \ + __m128i zero = {0}; \ + __m128i shuf8 = {0x0F0E0D0C0B0A0908, 0x1716151413121110}; \ + __m128i _t0, _t1, _t2, _t3, _t4, _t5, _t6, _t7; \ + \ + _t0 = __lsx_vilvl_b(_in2, _in0); \ + _t1 = __lsx_vilvl_b(_in3, _in1); \ + _t2 = __lsx_vilvl_b(_in6, _in4); \ + _t3 = __lsx_vilvl_b(_in7, _in5); \ + _t4 = __lsx_vilvl_b(_t1, _t0); \ + _t5 = __lsx_vilvh_b(_t1, _t0); \ + _t6 = __lsx_vilvl_b(_t3, _t2); \ + _t7 = __lsx_vilvh_b(_t3, _t2); \ + _out0 = __lsx_vilvl_w(_t6, _t4); \ + _out2 = __lsx_vilvh_w(_t6, _t4); \ + _out4 = __lsx_vilvl_w(_t7, _t5); \ + _out6 = __lsx_vilvh_w(_t7, _t5); \ + _out1 = __lsx_vshuf_b(zero, _out0, shuf8); \ + _out3 = __lsx_vshuf_b(zero, _out2, shuf8); \ + _out5 = __lsx_vshuf_b(zero, _out4, shuf8); \ + _out7 = __lsx_vshuf_b(zero, _out6, shuf8); \ +} + +/* + * ============================================================================= + * Description : Transpose 8x8 block with half word elements in vectors + * Arguments : Inputs - in0, in1, in2, in3, in4, in5, in6, in7 + * Outputs - out0, out1, out2, out3, out4, out5, out6, out7 + * Details : + * Example : + * 00,01,02,03,04,05,06,07 00,10,20,30,40,50,60,70 + * 10,11,12,13,14,15,16,17 01,11,21,31,41,51,61,71 + * 20,21,22,23,24,25,26,27 02,12,22,32,42,52,62,72 + * 30,31,32,33,34,35,36,37 to 03,13,23,33,43,53,63,73 + * 40,41,42,43,44,45,46,47 ======> 04,14,24,34,44,54,64,74 + * 50,51,52,53,54,55,56,57 05,15,25,35,45,55,65,75 + * 60,61,62,63,64,65,66,67 06,16,26,36,46,56,66,76 + * 70,71,72,73,74,75,76,77 07,17,27,37,47,57,67,77 + * ============================================================================= + */ +#define LSX_TRANSPOSE8x8_H(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, \ + _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\ +{ \ + __m128i _s0, _s1, _t0, _t1, _t2, _t3, _t4, _t5, _t6, _t7; \ + \ + _s0 = __lsx_vilvl_h(_in6, _in4); \ + _s1 = __lsx_vilvl_h(_in7, _in5); \ + _t0 = __lsx_vilvl_h(_s1, _s0); \ + _t1 = __lsx_vilvh_h(_s1, _s0); \ + _s0 = __lsx_vilvh_h(_in6, _in4); \ + _s1 = __lsx_vilvh_h(_in7, _in5); \ + _t2 = __lsx_vilvl_h(_s1, _s0); \ + _t3 = __lsx_vilvh_h(_s1, _s0); \ + _s0 = __lsx_vilvl_h(_in2, _in0); \ + _s1 = __lsx_vilvl_h(_in3, _in1); \ + _t4 = __lsx_vilvl_h(_s1, _s0); \ + _t5 = __lsx_vilvh_h(_s1, _s0); \ + _s0 = __lsx_vilvh_h(_in2, _in0); \ + _s1 = __lsx_vilvh_h(_in3, _in1); \ + _t6 = __lsx_vilvl_h(_s1, _s0); \ + _t7 = __lsx_vilvh_h(_s1, _s0); \ + \ + _out0 = __lsx_vpickev_d(_t0, _t4); \ + _out2 = __lsx_vpickev_d(_t1, _t5); \ + _out4 = __lsx_vpickev_d(_t2, _t6); \ + _out6 = __lsx_vpickev_d(_t3, _t7); \ + _out1 = __lsx_vpickod_d(_t0, _t4); \ + _out3 = __lsx_vpickod_d(_t1, _t5); \ + _out5 = __lsx_vpickod_d(_t2, _t6); \ + _out7 = __lsx_vpickod_d(_t3, _t7); \ +} + +/* + * ============================================================================= + * Description : Transpose input 8x4 byte block into 4x8 + * Arguments : Inputs - _in0, _in1, _in2, _in3 (input 8x4 byte block) + * Outputs - _out0, _out1, _out2, _out3 (output 4x8 byte block) + * Return Type - as per RTYPE + * Details : The rows of the matrix become columns, and the columns become rows. + * Example : LSX_TRANSPOSE8x4_B + * _in0 : 00,01,02,03,00,00,00,00, 00,00,00,00,00,00,00,00 + * _in1 : 10,11,12,13,00,00,00,00, 00,00,00,00,00,00,00,00 + * _in2 : 20,21,22,23,00,00,00,00, 00,00,00,00,00,00,00,00 + * _in3 : 30,31,32,33,00,00,00,00, 00,00,00,00,00,00,00,00 + * _in4 : 40,41,42,43,00,00,00,00, 00,00,00,00,00,00,00,00 + * _in5 : 50,51,52,53,00,00,00,00, 00,00,00,00,00,00,00,00 + * _in6 : 60,61,62,63,00,00,00,00, 00,00,00,00,00,00,00,00 + * _in7 : 70,71,72,73,00,00,00,00, 00,00,00,00,00,00,00,00 + * + * _out0 : 00,10,20,30,40,50,60,70, 00,00,00,00,00,00,00,00 + * _out1 : 01,11,21,31,41,51,61,71, 00,00,00,00,00,00,00,00 + * _out2 : 02,12,22,32,42,52,62,72, 00,00,00,00,00,00,00,00 + * _out3 : 03,13,23,33,43,53,63,73, 00,00,00,00,00,00,00,00 + * ============================================================================= + */ +#define LSX_TRANSPOSE8x4_B(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, \ + _out0, _out1, _out2, _out3) \ +{ \ + __m128i _tmp0_m, _tmp1_m, _tmp2_m, _tmp3_m; \ + \ + _tmp0_m = __lsx_vpackev_w(_in4, _in0); \ + _tmp1_m = __lsx_vpackev_w(_in5, _in1); \ + _tmp2_m = __lsx_vilvl_b(_tmp1_m, _tmp0_m); \ + _tmp0_m = __lsx_vpackev_w(_in6, _in2); \ + _tmp1_m = __lsx_vpackev_w(_in7, _in3); \ + \ + _tmp3_m = __lsx_vilvl_b(_tmp1_m, _tmp0_m); \ + _tmp0_m = __lsx_vilvl_h(_tmp3_m, _tmp2_m); \ + _tmp1_m = __lsx_vilvh_h(_tmp3_m, _tmp2_m); \ + \ + _out0 = __lsx_vilvl_w(_tmp1_m, _tmp0_m); \ + _out2 = __lsx_vilvh_w(_tmp1_m, _tmp0_m); \ + _out1 = __lsx_vilvh_d(_out2, _out0); \ + _out3 = __lsx_vilvh_d(_out0, _out2); \ +} + +/* + * ============================================================================= + * Description : Transpose 16x8 block with byte elements in vectors + * Arguments : Inputs - in0, in1, in2, in3, in4, in5, in6, in7, in8 + * in9, in10, in11, in12, in13, in14, in15 + * Outputs - out0, out1, out2, out3, out4, out5, out6, out7 + * Details : + * Example : + * 000,001,002,003,004,005,006,007 + * 008,009,010,011,012,013,014,015 + * 016,017,018,019,020,021,022,023 + * 024,025,026,027,028,029,030,031 + * 032,033,034,035,036,037,038,039 + * 040,041,042,043,044,045,046,047 000,008,...,112,120 + * 048,049,050,051,052,053,054,055 001,009,...,113,121 + * 056,057,058,059,060,061,062,063 to 002,010,...,114,122 + * 064,068,066,067,068,069,070,071 =====> 003,011,...,115,123 + * 072,073,074,075,076,077,078,079 004,012,...,116,124 + * 080,081,082,083,084,085,086,087 005,013,...,117,125 + * 088,089,090,091,092,093,094,095 006,014,...,118,126 + * 096,097,098,099,100,101,102,103 007,015,...,119,127 + * 104,105,106,107,108,109,110,111 + * 112,113,114,115,116,117,118,119 + * 120,121,122,123,124,125,126,127 + * ============================================================================= + */ +#define LSX_TRANSPOSE16x8_B(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, _in8, \ + _in9, _in10, _in11, _in12, _in13, _in14, _in15, _out0, \ + _out1, _out2, _out3, _out4, _out5, _out6, _out7) \ +{ \ + __m128i _tmp0, _tmp1, _tmp2, _tmp3, _tmp4, _tmp5, _tmp6, _tmp7; \ + __m128i _t0, _t1, _t2, _t3, _t4, _t5, _t6, _t7; \ + DUP4_ARG2(__lsx_vilvl_b, _in2, _in0, _in3, _in1, _in6, _in4, _in7, _in5, \ + _tmp0, _tmp1, _tmp2, _tmp3); \ + DUP4_ARG2(__lsx_vilvl_b, _in10, _in8, _in11, _in9, _in14, _in12, _in15, \ + _in13, _tmp4, _tmp5, _tmp6, _tmp7); \ + DUP2_ARG2(__lsx_vilvl_b, _tmp1, _tmp0, _tmp3, _tmp2, _t0, _t2); \ + DUP2_ARG2(__lsx_vilvh_b, _tmp1, _tmp0, _tmp3, _tmp2, _t1, _t3); \ + DUP2_ARG2(__lsx_vilvl_b, _tmp5, _tmp4, _tmp7, _tmp6, _t4, _t6); \ + DUP2_ARG2(__lsx_vilvh_b, _tmp5, _tmp4, _tmp7, _tmp6, _t5, _t7); \ + DUP2_ARG2(__lsx_vilvl_w, _t2, _t0, _t3, _t1, _tmp0, _tmp4); \ + DUP2_ARG2(__lsx_vilvh_w, _t2, _t0, _t3, _t1, _tmp2, _tmp6); \ + DUP2_ARG2(__lsx_vilvl_w, _t6, _t4, _t7, _t5, _tmp1, _tmp5); \ + DUP2_ARG2(__lsx_vilvh_w, _t6, _t4, _t7, _t5, _tmp3, _tmp7); \ + DUP2_ARG2(__lsx_vilvl_d, _tmp1, _tmp0, _tmp3, _tmp2, _out0, _out2); \ + DUP2_ARG2(__lsx_vilvh_d, _tmp1, _tmp0, _tmp3, _tmp2, _out1, _out3); \ + DUP2_ARG2(__lsx_vilvl_d, _tmp5, _tmp4, _tmp7, _tmp6, _out4, _out6); \ + DUP2_ARG2(__lsx_vilvh_d, _tmp5, _tmp4, _tmp7, _tmp6, _out5, _out7); \ +} + +/* + * ============================================================================= + * Description : Butterfly of 4 input vectors + * Arguments : Inputs - in0, in1, in2, in3 + * Outputs - out0, out1, out2, out3 + * Details : Butterfly operation + * Example : + * out0 = in0 + in3; + * out1 = in1 + in2; + * out2 = in1 - in2; + * out3 = in0 - in3; + * ============================================================================= + */ +#define LSX_BUTTERFLY_4_B(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3) \ +{ \ + _out0 = __lsx_vadd_b(_in0, _in3); \ + _out1 = __lsx_vadd_b(_in1, _in2); \ + _out2 = __lsx_vsub_b(_in1, _in2); \ + _out3 = __lsx_vsub_b(_in0, _in3); \ +} +#define LSX_BUTTERFLY_4_H(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3) \ +{ \ + _out0 = __lsx_vadd_h(_in0, _in3); \ + _out1 = __lsx_vadd_h(_in1, _in2); \ + _out2 = __lsx_vsub_h(_in1, _in2); \ + _out3 = __lsx_vsub_h(_in0, _in3); \ +} +#define LSX_BUTTERFLY_4_W(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3) \ +{ \ + _out0 = __lsx_vadd_w(_in0, _in3); \ + _out1 = __lsx_vadd_w(_in1, _in2); \ + _out2 = __lsx_vsub_w(_in1, _in2); \ + _out3 = __lsx_vsub_w(_in0, _in3); \ +} +#define LSX_BUTTERFLY_4_D(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3) \ +{ \ + _out0 = __lsx_vadd_d(_in0, _in3); \ + _out1 = __lsx_vadd_d(_in1, _in2); \ + _out2 = __lsx_vsub_d(_in1, _in2); \ + _out3 = __lsx_vsub_d(_in0, _in3); \ +} + +/* + * ============================================================================= + * Description : Butterfly of 8 input vectors + * Arguments : Inputs - _in0, _in1, _in2, _in3, ~ + * Outputs - _out0, _out1, _out2, _out3, ~ + * Details : Butterfly operation + * Example : + * _out0 = _in0 + _in7; + * _out1 = _in1 + _in6; + * _out2 = _in2 + _in5; + * _out3 = _in3 + _in4; + * _out4 = _in3 - _in4; + * _out5 = _in2 - _in5; + * _out6 = _in1 - _in6; + * _out7 = _in0 - _in7; + * ============================================================================= + */ +#define LSX_BUTTERFLY_8_B(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, \ + _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\ +{ \ + _out0 = __lsx_vadd_b(_in0, _in7); \ + _out1 = __lsx_vadd_b(_in1, _in6); \ + _out2 = __lsx_vadd_b(_in2, _in5); \ + _out3 = __lsx_vadd_b(_in3, _in4); \ + _out4 = __lsx_vsub_b(_in3, _in4); \ + _out5 = __lsx_vsub_b(_in2, _in5); \ + _out6 = __lsx_vsub_b(_in1, _in6); \ + _out7 = __lsx_vsub_b(_in0, _in7); \ +} + +#define LSX_BUTTERFLY_8_H(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, \ + _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\ +{ \ + _out0 = __lsx_vadd_h(_in0, _in7); \ + _out1 = __lsx_vadd_h(_in1, _in6); \ + _out2 = __lsx_vadd_h(_in2, _in5); \ + _out3 = __lsx_vadd_h(_in3, _in4); \ + _out4 = __lsx_vsub_h(_in3, _in4); \ + _out5 = __lsx_vsub_h(_in2, _in5); \ + _out6 = __lsx_vsub_h(_in1, _in6); \ + _out7 = __lsx_vsub_h(_in0, _in7); \ +} + +#define LSX_BUTTERFLY_8_W(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, \ + _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\ +{ \ + _out0 = __lsx_vadd_w(_in0, _in7); \ + _out1 = __lsx_vadd_w(_in1, _in6); \ + _out2 = __lsx_vadd_w(_in2, _in5); \ + _out3 = __lsx_vadd_w(_in3, _in4); \ + _out4 = __lsx_vsub_w(_in3, _in4); \ + _out5 = __lsx_vsub_w(_in2, _in5); \ + _out6 = __lsx_vsub_w(_in1, _in6); \ + _out7 = __lsx_vsub_w(_in0, _in7); \ +} + +#define LSX_BUTTERFLY_8_D(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, \ + _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\ +{ \ + _out0 = __lsx_vadd_d(_in0, _in7); \ + _out1 = __lsx_vadd_d(_in1, _in6); \ + _out2 = __lsx_vadd_d(_in2, _in5); \ + _out3 = __lsx_vadd_d(_in3, _in4); \ + _out4 = __lsx_vsub_d(_in3, _in4); \ + _out5 = __lsx_vsub_d(_in2, _in5); \ + _out6 = __lsx_vsub_d(_in1, _in6); \ + _out7 = __lsx_vsub_d(_in0, _in7); \ +} + +#endif //LSX + +#ifdef __loongarch_asx +#include +/* + * ============================================================================= + * Description : Dot product of byte vector elements + * Arguments : Inputs - in_h, in_l + * Output - out + * Return Type - signed halfword + * Details : Unsigned byte elements from in_h are multiplied with + * unsigned byte elements from in_l producing a result + * twice the size of input i.e. signed halfword. + * Then this multiplied results of adjacent odd-even elements + * are added to the out vector + * Example : See out = __lasx_xvdp2_w_h(in_h, in_l) + * ============================================================================= + */ +static inline __m256i __lasx_xvdp2_h_bu(__m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvmulwev_h_bu(in_h, in_l); + out = __lasx_xvmaddwod_h_bu(out, in_h, in_l); + return out; +} + +/* + * ============================================================================= + * Description : Dot product of byte vector elements + * Arguments : Inputs - in_h, in_l + * Output - out + * Return Type - signed halfword + * Details : Signed byte elements from in_h are multiplied with + * signed byte elements from in_l producing a result + * twice the size of input i.e. signed halfword. + * Then this iniplication results of adjacent odd-even elements + * are added to the out vector + * Example : See out = __lasx_xvdp2_w_h(in_h, in_l) + * ============================================================================= + */ +static inline __m256i __lasx_xvdp2_h_b(__m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvmulwev_h_b(in_h, in_l); + out = __lasx_xvmaddwod_h_b(out, in_h, in_l); + return out; +} + +/* + * ============================================================================= + * Description : Dot product of halfword vector elements + * Arguments : Inputs - in_h, in_l + * Output - out + * Return Type - signed word + * Details : Signed halfword elements from in_h are multiplied with + * signed halfword elements from in_l producing a result + * twice the size of input i.e. signed word. + * Then this multiplied results of adjacent odd-even elements + * are added to the out vector. + * Example : out = __lasx_xvdp2_w_h(in_h, in_l) + * in_h : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8 + * in_l : 8,7,6,5, 4,3,2,1, 8,7,6,5, 4,3,2,1 + * out : 22,38,38,22, 22,38,38,22 + * ============================================================================= + */ +static inline __m256i __lasx_xvdp2_w_h(__m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvmulwev_w_h(in_h, in_l); + out = __lasx_xvmaddwod_w_h(out, in_h, in_l); + return out; +} + +/* + * ============================================================================= + * Description : Dot product of word vector elements + * Arguments : Inputs - in_h, in_l + * Output - out + * Retrun Type - signed double + * Details : Signed word elements from in_h are multiplied with + * signed word elements from in_l producing a result + * twice the size of input i.e. signed double word. + * Then this multiplied results of adjacent odd-even elements + * are added to the out vector. + * Example : See out = __lasx_xvdp2_w_h(in_h, in_l) + * ============================================================================= + */ +static inline __m256i __lasx_xvdp2_d_w(__m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvmulwev_d_w(in_h, in_l); + out = __lasx_xvmaddwod_d_w(out, in_h, in_l); + return out; +} + +/* + * ============================================================================= + * Description : Dot product of halfword vector elements + * Arguments : Inputs - in_h, in_l + * Output - out + * Return Type - signed word + * Details : Unsigned halfword elements from in_h are multiplied with + * signed halfword elements from in_l producing a result + * twice the size of input i.e. unsigned word. + * Multiplication result of adjacent odd-even elements + * are added to the out vector + * Example : See out = __lasx_xvdp2_w_h(in_h, in_l) + * ============================================================================= + */ +static inline __m256i __lasx_xvdp2_w_hu_h(__m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvmulwev_w_hu_h(in_h, in_l); + out = __lasx_xvmaddwod_w_hu_h(out, in_h, in_l); + return out; +} + +/* + * ============================================================================= + * Description : Dot product & addition of byte vector elements + * Arguments : Inputs - in_h, in_l + * Output - out + * Retrun Type - halfword + * Details : Signed byte elements from in_h are multiplied with + * signed byte elements from in_l producing a result + * twice the size of input i.e. signed halfword. + * Then this multiplied results of adjacent odd-even elements + * are added to the in_c vector. + * Example : See out = __lasx_xvdp2add_w_h(in_c, in_h, in_l) + * ============================================================================= + */ +static inline __m256i __lasx_xvdp2add_h_b(__m256i in_c,__m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvmaddwev_h_b(in_c, in_h, in_l); + out = __lasx_xvmaddwod_h_b(out, in_h, in_l); + return out; +} + +/* + * ============================================================================= + * Description : Dot product of halfword vector elements + * Arguments : Inputs - in_c, in_h, in_l + * Output - out + * Return Type - per RTYPE + * Details : Signed halfword elements from in_h are multiplied with + * signed halfword elements from in_l producing a result + * twice the size of input i.e. signed word. + * Multiplication result of adjacent odd-even elements + * are added to the in_c vector. + * Example : out = __lasx_xvdp2add_w_h(in_c, in_h, in_l) + * in_c : 1,2,3,4, 1,2,3,4 + * in_h : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8, + * in_l : 8,7,6,5, 4,3,2,1, 8,7,6,5, 4,3,2,1, + * out : 23,40,41,26, 23,40,41,26 + * ============================================================================= + */ +static inline __m256i __lasx_xvdp2add_w_h(__m256i in_c, __m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvmaddwev_w_h(in_c, in_h, in_l); + out = __lasx_xvmaddwod_w_h(out, in_h, in_l); + return out; +} + +/* + * ============================================================================= + * Description : Dot product of halfword vector elements + * Arguments : Inputs - in_c, in_h, in_l + * Output - out + * Return Type - signed word + * Details : Unsigned halfword elements from in_h are multiplied with + * unsigned halfword elements from in_l producing a result + * twice the size of input i.e. signed word. + * Multiplication result of adjacent odd-even elements + * are added to the in_c vector. + * Example : See out = __lasx_xvdp2add_w_h(in_c, in_h, in_l) + * ============================================================================= + */ +static inline __m256i __lasx_xvdp2add_w_hu(__m256i in_c, __m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvmaddwev_w_hu(in_c, in_h, in_l); + out = __lasx_xvmaddwod_w_hu(out, in_h, in_l); + return out; +} + +/* + * ============================================================================= + * Description : Dot product of halfword vector elements + * Arguments : Inputs - in_c, in_h, in_l + * Output - out + * Return Type - signed word + * Details : Unsigned halfword elements from in_h are multiplied with + * signed halfword elements from in_l producing a result + * twice the size of input i.e. signed word. + * Multiplication result of adjacent odd-even elements + * are added to the in_c vector + * Example : See out = __lasx_xvdp2add_w_h(in_c, in_h, in_l) + * ============================================================================= + */ +static inline __m256i __lasx_xvdp2add_w_hu_h(__m256i in_c, __m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvmaddwev_w_hu_h(in_c, in_h, in_l); + out = __lasx_xvmaddwod_w_hu_h(out, in_h, in_l); + return out; +} + +/* + * ============================================================================= + * Description : Vector Unsigned Dot Product and Subtract + * Arguments : Inputs - in_c, in_h, in_l + * Output - out + * Return Type - signed halfword + * Details : Unsigned byte elements from in_h are multiplied with + * unsigned byte elements from in_l producing a result + * twice the size of input i.e. signed halfword. + * Multiplication result of adjacent odd-even elements + * are added together and subtracted from double width elements + * in_c vector. + * Example : See out = __lasx_xvdp2sub_w_h(in_c, in_h, in_l) + * ============================================================================= + */ +static inline __m256i __lasx_xvdp2sub_h_bu(__m256i in_c, __m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvmulwev_h_bu(in_h, in_l); + out = __lasx_xvmaddwod_h_bu(out, in_h, in_l); + out = __lasx_xvsub_h(in_c, out); + return out; +} + +/* + * ============================================================================= + * Description : Vector Signed Dot Product and Subtract + * Arguments : Inputs - in_c, in_h, in_l + * Output - out + * Return Type - signed word + * Details : Signed halfword elements from in_h are multiplied with + * Signed halfword elements from in_l producing a result + * twice the size of input i.e. signed word. + * Multiplication result of adjacent odd-even elements + * are added together and subtracted from double width elements + * in_c vector. + * Example : out = __lasx_xvdp2sub_w_h(in_c, in_h, in_l) + * in_c : 0,0,0,0, 0,0,0,0 + * in_h : 3,1,3,0, 0,0,0,1, 0,0,1,1, 0,0,0,1 + * in_l : 2,1,1,0, 1,0,0,0, 0,0,1,0, 1,0,0,1 + * out : -7,-3,0,0, 0,-1,0,-1 + * ============================================================================= + */ +static inline __m256i __lasx_xvdp2sub_w_h(__m256i in_c, __m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvmulwev_w_h(in_h, in_l); + out = __lasx_xvmaddwod_w_h(out, in_h, in_l); + out = __lasx_xvsub_w(in_c, out); + return out; +} + +/* + * ============================================================================= + * Description : Dot product of halfword vector elements + * Arguments : Inputs - in_h, in_l + * Output - out + * Return Type - signed word + * Details : Signed halfword elements from in_h are iniplied with + * signed halfword elements from in_l producing a result + * four times the size of input i.e. signed doubleword. + * Then this iniplication results of four adjacent elements + * are added together and stored to the out vector. + * Example : out = __lasx_xvdp4_d_h(in_h, in_l) + * in_h : 3,1,3,0, 0,0,0,1, 0,0,1,-1, 0,0,0,1 + * in_l : -2,1,1,0, 1,0,0,0, 0,0,1, 0, 1,0,0,1 + * out : -2,0,1,1 + * ============================================================================= + */ +static inline __m256i __lasx_xvdp4_d_h(__m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvmulwev_w_h(in_h, in_l); + out = __lasx_xvmaddwod_w_h(out, in_h, in_l); + out = __lasx_xvhaddw_d_w(out, out); + return out; +} + +/* + * ============================================================================= + * Description : The high half of the vector elements are expanded and + * added after being doubled. + * Arguments : Inputs - in_h, in_l + * Output - out + * Details : The in_h vector and the in_l vector are added after the + * higher half of the two-fold sign extension (signed byte + * to signed halfword) and stored to the out vector. + * Example : See out = __lasx_xvaddwh_w_h(in_h, in_l) + * ============================================================================= + */ +static inline __m256i __lasx_xvaddwh_h_b(__m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvilvh_b(in_h, in_l); + out = __lasx_xvhaddw_h_b(out, out); + return out; +} + +/* + * ============================================================================= + * Description : The high half of the vector elements are expanded and + * added after being doubled. + * Arguments : Inputs - in_h, in_l + * Output - out + * Details : The in_h vector and the in_l vector are added after the + * higher half of the two-fold sign extension (signed halfword + * to signed word) and stored to the out vector. + * Example : out = __lasx_xvaddwh_w_h(in_h, in_l) + * in_h : 3, 0,3,0, 0,0,0,-1, 0,0,1,-1, 0,0,0,1 + * in_l : 2,-1,1,2, 1,0,0, 0, 1,0,1, 0, 1,0,0,1 + * out : 1,0,0,-1, 1,0,0, 2 + * ============================================================================= + */ + static inline __m256i __lasx_xvaddwh_w_h(__m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvilvh_h(in_h, in_l); + out = __lasx_xvhaddw_w_h(out, out); + return out; +} + +/* + * ============================================================================= + * Description : The low half of the vector elements are expanded and + * added after being doubled. + * Arguments : Inputs - in_h, in_l + * Output - out + * Details : The in_h vector and the in_l vector are added after the + * lower half of the two-fold sign extension (signed byte + * to signed halfword) and stored to the out vector. + * Example : See out = __lasx_xvaddwl_w_h(in_h, in_l) + * ============================================================================= + */ +static inline __m256i __lasx_xvaddwl_h_b(__m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvilvl_b(in_h, in_l); + out = __lasx_xvhaddw_h_b(out, out); + return out; +} + +/* + * ============================================================================= + * Description : The low half of the vector elements are expanded and + * added after being doubled. + * Arguments : Inputs - in_h, in_l + * Output - out + * Details : The in_h vector and the in_l vector are added after the + * lower half of the two-fold sign extension (signed halfword + * to signed word) and stored to the out vector. + * Example : out = __lasx_xvaddwl_w_h(in_h, in_l) + * in_h : 3, 0,3,0, 0,0,0,-1, 0,0,1,-1, 0,0,0,1 + * in_l : 2,-1,1,2, 1,0,0, 0, 1,0,1, 0, 1,0,0,1 + * out : 5,-1,4,2, 1,0,2,-1 + * ============================================================================= + */ +static inline __m256i __lasx_xvaddwl_w_h(__m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvilvl_h(in_h, in_l); + out = __lasx_xvhaddw_w_h(out, out); + return out; +} + +/* + * ============================================================================= + * Description : The low half of the vector elements are expanded and + * added after being doubled. + * Arguments : Inputs - in_h, in_l + * Output - out + * Details : The out vector and the out vector are added after the + * lower half of the two-fold zero extension (unsigned byte + * to unsigned halfword) and stored to the out vector. + * Example : See out = __lasx_xvaddwl_w_h(in_h, in_l) + * ============================================================================= + */ +static inline __m256i __lasx_xvaddwl_h_bu(__m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvilvl_b(in_h, in_l); + out = __lasx_xvhaddw_hu_bu(out, out); + return out; +} + +/* + * ============================================================================= + * Description : The low half of the vector elements are expanded and + * added after being doubled. + * Arguments : Inputs - in_h, in_l + * Output - out + * Details : The in_l vector after double zero extension (unsigned byte to + * signed halfword),added to the in_h vector. + * Example : See out = __lasx_xvaddw_w_w_h(in_h, in_l) + * ============================================================================= + */ +static inline __m256i __lasx_xvaddw_h_h_bu(__m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvsllwil_hu_bu(in_l, 0); + out = __lasx_xvadd_h(in_h, out); + return out; +} + +/* + * ============================================================================= + * Description : The low half of the vector elements are expanded and + * added after being doubled. + * Arguments : Inputs - in_h, in_l + * Output - out + * Details : The in_l vector after double sign extension (signed halfword to + * signed word), added to the in_h vector. + * Example : out = __lasx_xvaddw_w_w_h(in_h, in_l) + * in_h : 0, 1,0,0, -1,0,0,1, + * in_l : 2,-1,1,2, 1,0,0,0, 0,0,1,0, 1,0,0,1, + * out : 2, 0,1,2, -1,0,1,1, + * ============================================================================= + */ +static inline __m256i __lasx_xvaddw_w_w_h(__m256i in_h, __m256i in_l) +{ + __m256i out; + + out = __lasx_xvsllwil_w_h(in_l, 0); + out = __lasx_xvadd_w(in_h, out); + return out; +} + +/* + * ============================================================================= + * Description : Multiplication and addition calculation after expansion + * of the lower half of the vector. + * Arguments : Inputs - in_c, in_h, in_l + * Output - out + * Details : The in_h vector and the in_l vector are multiplied after + * the lower half of the two-fold sign extension (signed halfword + * to signed word), and the result is added to the vector in_c, + * then stored to the out vector. + * Example : out = __lasx_xvmaddwl_w_h(in_c, in_h, in_l) + * in_c : 1,2,3,4, 5,6,7,8 + * in_h : 1,2,3,4, 1,2,3,4, 5,6,7,8, 5,6,7,8 + * in_l : 200, 300, 400, 500, 2000, 3000, 4000, 5000, + * -200,-300,-400,-500, -2000,-3000,-4000,-5000 + * out : 201, 602,1203,2004, -995, -1794,-2793,-3992 + * ============================================================================= + */ +static inline __m256i __lasx_xvmaddwl_w_h(__m256i in_c, __m256i in_h, __m256i in_l) +{ + __m256i tmp0, tmp1, out; + + tmp0 = __lasx_xvsllwil_w_h(in_h, 0); + tmp1 = __lasx_xvsllwil_w_h(in_l, 0); + tmp0 = __lasx_xvmul_w(tmp0, tmp1); + out = __lasx_xvadd_w(tmp0, in_c); + return out; +} + +/* + * ============================================================================= + * Description : Multiplication and addition calculation after expansion + * of the higher half of the vector. + * Arguments : Inputs - in_c, in_h, in_l + * Output - out + * Details : The in_h vector and the in_l vector are multiplied after + * the higher half of the two-fold sign extension (signed + * halfword to signed word), and the result is added to + * the vector in_c, then stored to the out vector. + * Example : See out = __lasx_xvmaddwl_w_h(in_c, in_h, in_l) + * ============================================================================= + */ +static inline __m256i __lasx_xvmaddwh_w_h(__m256i in_c, __m256i in_h, __m256i in_l) +{ + __m256i tmp0, tmp1, out; + + tmp0 = __lasx_xvilvh_h(in_h, in_h); + tmp1 = __lasx_xvilvh_h(in_l, in_l); + tmp0 = __lasx_xvmulwev_w_h(tmp0, tmp1); + out = __lasx_xvadd_w(tmp0, in_c); + return out; +} + +/* + * ============================================================================= + * Description : Multiplication calculation after expansion of the lower + * half of the vector. + * Arguments : Inputs - in_h, in_l + * Output - out + * Details : The in_h vector and the in_l vector are multiplied after + * the lower half of the two-fold sign extension (signed + * halfword to signed word), then stored to the out vector. + * Example : out = __lasx_xvmulwl_w_h(in_h, in_l) + * in_h : 3,-1,3,0, 0,0,0,-1, 0,0,1,-1, 0,0,0,1 + * in_l : 2,-1,1,2, 1,0,0, 0, 0,0,1, 0, 1,0,0,1 + * out : 6,1,3,0, 0,0,1,0 + * ============================================================================= + */ +static inline __m256i __lasx_xvmulwl_w_h(__m256i in_h, __m256i in_l) +{ + __m256i tmp0, tmp1, out; + + tmp0 = __lasx_xvsllwil_w_h(in_h, 0); + tmp1 = __lasx_xvsllwil_w_h(in_l, 0); + out = __lasx_xvmul_w(tmp0, tmp1); + return out; +} + +/* + * ============================================================================= + * Description : Multiplication calculation after expansion of the lower + * half of the vector. + * Arguments : Inputs - in_h, in_l + * Output - out + * Details : The in_h vector and the in_l vector are multiplied after + * the lower half of the two-fold sign extension (signed + * halfword to signed word), then stored to the out vector. + * Example : out = __lasx_xvmulwh_w_h(in_h, in_l) + * in_h : 3,-1,3,0, 0,0,0,-1, 0,0,1,-1, 0,0,0,1 + * in_l : 2,-1,1,2, 1,0,0, 0, 0,0,1, 0, 1,0,0,1 + * out : 0,0,0,0, 0,0,0,1 + * ============================================================================= + */ +static inline __m256i __lasx_xvmulwh_w_h(__m256i in_h, __m256i in_l) +{ + __m256i tmp0, tmp1, out; + + tmp0 = __lasx_xvilvh_h(in_h, in_h); + tmp1 = __lasx_xvilvh_h(in_l, in_l); + out = __lasx_xvmulwev_w_h(tmp0, tmp1); + return out; +} + +/* + * ============================================================================= + * Description : The low half of the vector elements are expanded and + * added saturately after being doubled. + * Arguments : Inputs - in_h, in_l + * Output - out + * Details : The in_h vector adds the in_l vector saturately after the lower + * half of the two-fold zero extension (unsigned byte to unsigned + * halfword) and the results are stored to the out vector. + * Example : out = __lasx_xvsaddw_hu_hu_bu(in_h, in_l) + * in_h : 2,65532,1,2, 1,0,0,0, 0,0,1,0, 1,0,0,1 + * in_l : 3,6,3,0, 0,0,0,1, 0,0,1,1, 0,0,0,1, 3,18,3,0, 0,0,0,1, 0,0,1,1, 0,0,0,1 + * out : 5,65535,4,2, 1,0,0,1, 3,18,4,0, 1,0,0,2, + * ============================================================================= + */ +static inline __m256i __lasx_xvsaddw_hu_hu_bu(__m256i in_h, __m256i in_l) +{ + __m256i tmp1, out; + __m256i zero = {0}; + + tmp1 = __lasx_xvilvl_b(zero, in_l); + out = __lasx_xvsadd_hu(in_h, tmp1); + return out; +} + +/* + * ============================================================================= + * Description : Clip all halfword elements of input vector between min & max + * out = ((in) < (min)) ? (min) : (((in) > (max)) ? (max) : (in)) + * Arguments : Inputs - in (input vector) + * - min (min threshold) + * - max (max threshold) + * Outputs - in (output vector with clipped elements) + * Return Type - signed halfword + * Example : out = __lasx_xvclip_h(in, min, max) + * in : -8,2,280,249, -8,255,280,249, 4,4,4,4, 5,5,5,5 + * min : 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1 + * max : 9,9,9,9, 9,9,9,9, 9,9,9,9, 9,9,9,9 + * out : 1,2,9,9, 1,9,9,9, 4,4,4,4, 5,5,5,5 + * ============================================================================= + */ +static inline __m256i __lasx_xvclip_h(__m256i in, __m256i min, __m256i max) +{ + __m256i out; + + out = __lasx_xvmax_h(min, in); + out = __lasx_xvmin_h(max, out); + return out; +} + +/* + * ============================================================================= + * Description : Clip all signed halfword elements of input vector + * between 0 & 255 + * Arguments : Inputs - in (input vector) + * Outputs - out (output vector with clipped elements) + * Return Type - signed halfword + * Example : See out = __lasx_xvclip255_w(in) + * ============================================================================= + */ +static inline __m256i __lasx_xvclip255_h(__m256i in) +{ + __m256i out; + + out = __lasx_xvmaxi_h(in, 0); + out = __lasx_xvsat_hu(out, 7); + return out; +} + +/* + * ============================================================================= + * Description : Clip all signed word elements of input vector + * between 0 & 255 + * Arguments : Inputs - in (input vector) + * Output - out (output vector with clipped elements) + * Return Type - signed word + * Example : out = __lasx_xvclip255_w(in) + * in : -8,255,280,249, -8,255,280,249 + * out : 0,255,255,249, 0,255,255,249 + * ============================================================================= + */ +static inline __m256i __lasx_xvclip255_w(__m256i in) +{ + __m256i out; + + out = __lasx_xvmaxi_w(in, 0); + out = __lasx_xvsat_wu(out, 7); + return out; +} + +/* + * ============================================================================= + * Description : Indexed halfword element values are replicated to all + * elements in output vector. If 'indx < 8' use xvsplati_l_*, + * if 'indx >= 8' use xvsplati_h_*. + * Arguments : Inputs - in, idx + * Output - out + * Details : Idx element value from in vector is replicated to all + * elements in out vector. + * Valid index range for halfword operation is 0-7 + * Example : out = __lasx_xvsplati_l_h(in, idx) + * in : 20,10,11,12, 13,14,15,16, 0,0,2,0, 0,0,0,0 + * idx : 0x02 + * out : 11,11,11,11, 11,11,11,11, 11,11,11,11, 11,11,11,11 + * ============================================================================= + */ +static inline __m256i __lasx_xvsplati_l_h(__m256i in, int idx) +{ + __m256i out; + + out = __lasx_xvpermi_q(in, in, 0x02); + out = __lasx_xvreplve_h(out, idx); + return out; +} + +/* + * ============================================================================= + * Description : Indexed halfword element values are replicated to all + * elements in output vector. If 'indx < 8' use xvsplati_l_*, + * if 'indx >= 8' use xvsplati_h_*. + * Arguments : Inputs - in, idx + * Output - out + * Details : Idx element value from in vector is replicated to all + * elements in out vector. + * Valid index range for halfword operation is 0-7 + * Example : out = __lasx_xvsplati_h_h(in, idx) + * in : 20,10,11,12, 13,14,15,16, 0,2,0,0, 0,0,0,0 + * idx : 0x09 + * out : 2,2,2,2, 2,2,2,2, 2,2,2,2, 2,2,2,2 + * ============================================================================= + */ +static inline __m256i __lasx_xvsplati_h_h(__m256i in, int idx) +{ + __m256i out; + + out = __lasx_xvpermi_q(in, in, 0x13); + out = __lasx_xvreplve_h(out, idx); + return out; +} + +/* + * ============================================================================= + * Description : Transpose 4x4 block with double word elements in vectors + * Arguments : Inputs - _in0, _in1, _in2, _in3 + * Outputs - _out0, _out1, _out2, _out3 + * Example : LASX_TRANSPOSE4x4_D + * _in0 : 1,2,3,4 + * _in1 : 1,2,3,4 + * _in2 : 1,2,3,4 + * _in3 : 1,2,3,4 + * + * _out0 : 1,1,1,1 + * _out1 : 2,2,2,2 + * _out2 : 3,3,3,3 + * _out3 : 4,4,4,4 + * ============================================================================= + */ +#define LASX_TRANSPOSE4x4_D(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3) \ +{ \ + __m256i _tmp0, _tmp1, _tmp2, _tmp3; \ + _tmp0 = __lasx_xvilvl_d(_in1, _in0); \ + _tmp1 = __lasx_xvilvh_d(_in1, _in0); \ + _tmp2 = __lasx_xvilvl_d(_in3, _in2); \ + _tmp3 = __lasx_xvilvh_d(_in3, _in2); \ + _out0 = __lasx_xvpermi_q(_tmp2, _tmp0, 0x20); \ + _out2 = __lasx_xvpermi_q(_tmp2, _tmp0, 0x31); \ + _out1 = __lasx_xvpermi_q(_tmp3, _tmp1, 0x20); \ + _out3 = __lasx_xvpermi_q(_tmp3, _tmp1, 0x31); \ +} + +/* + * ============================================================================= + * Description : Transpose 8x8 block with word elements in vectors + * Arguments : Inputs - _in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7 + * Outputs - _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7 + * Example : LASX_TRANSPOSE8x8_W + * _in0 : 1,2,3,4,5,6,7,8 + * _in1 : 2,2,3,4,5,6,7,8 + * _in2 : 3,2,3,4,5,6,7,8 + * _in3 : 4,2,3,4,5,6,7,8 + * _in4 : 5,2,3,4,5,6,7,8 + * _in5 : 6,2,3,4,5,6,7,8 + * _in6 : 7,2,3,4,5,6,7,8 + * _in7 : 8,2,3,4,5,6,7,8 + * + * _out0 : 1,2,3,4,5,6,7,8 + * _out1 : 2,2,2,2,2,2,2,2 + * _out2 : 3,3,3,3,3,3,3,3 + * _out3 : 4,4,4,4,4,4,4,4 + * _out4 : 5,5,5,5,5,5,5,5 + * _out5 : 6,6,6,6,6,6,6,6 + * _out6 : 7,7,7,7,7,7,7,7 + * _out7 : 8,8,8,8,8,8,8,8 + * ============================================================================= + */ +#define LASX_TRANSPOSE8x8_W(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, \ + _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7) \ +{ \ + __m256i _s0_m, _s1_m; \ + __m256i _tmp0_m, _tmp1_m, _tmp2_m, _tmp3_m; \ + __m256i _tmp4_m, _tmp5_m, _tmp6_m, _tmp7_m; \ + \ + _s0_m = __lasx_xvilvl_w(_in2, _in0); \ + _s1_m = __lasx_xvilvl_w(_in3, _in1); \ + _tmp0_m = __lasx_xvilvl_w(_s1_m, _s0_m); \ + _tmp1_m = __lasx_xvilvh_w(_s1_m, _s0_m); \ + _s0_m = __lasx_xvilvh_w(_in2, _in0); \ + _s1_m = __lasx_xvilvh_w(_in3, _in1); \ + _tmp2_m = __lasx_xvilvl_w(_s1_m, _s0_m); \ + _tmp3_m = __lasx_xvilvh_w(_s1_m, _s0_m); \ + _s0_m = __lasx_xvilvl_w(_in6, _in4); \ + _s1_m = __lasx_xvilvl_w(_in7, _in5); \ + _tmp4_m = __lasx_xvilvl_w(_s1_m, _s0_m); \ + _tmp5_m = __lasx_xvilvh_w(_s1_m, _s0_m); \ + _s0_m = __lasx_xvilvh_w(_in6, _in4); \ + _s1_m = __lasx_xvilvh_w(_in7, _in5); \ + _tmp6_m = __lasx_xvilvl_w(_s1_m, _s0_m); \ + _tmp7_m = __lasx_xvilvh_w(_s1_m, _s0_m); \ + _out0 = __lasx_xvpermi_q(_tmp4_m, _tmp0_m, 0x20); \ + _out1 = __lasx_xvpermi_q(_tmp5_m, _tmp1_m, 0x20); \ + _out2 = __lasx_xvpermi_q(_tmp6_m, _tmp2_m, 0x20); \ + _out3 = __lasx_xvpermi_q(_tmp7_m, _tmp3_m, 0x20); \ + _out4 = __lasx_xvpermi_q(_tmp4_m, _tmp0_m, 0x31); \ + _out5 = __lasx_xvpermi_q(_tmp5_m, _tmp1_m, 0x31); \ + _out6 = __lasx_xvpermi_q(_tmp6_m, _tmp2_m, 0x31); \ + _out7 = __lasx_xvpermi_q(_tmp7_m, _tmp3_m, 0x31); \ +} + +/* + * ============================================================================= + * Description : Transpose input 16x8 byte block + * Arguments : Inputs - _in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, + * _in8, _in9, _in10, _in11, _in12, _in13, _in14, _in15 + * (input 16x8 byte block) + * Outputs - _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7 + * (output 8x16 byte block) + * Details : The rows of the matrix become columns, and the columns become rows. + * Example : See LASX_TRANSPOSE16x8_H + * ============================================================================= + */ +#define LASX_TRANSPOSE16x8_B(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, \ + _in8, _in9, _in10, _in11, _in12, _in13, _in14, _in15, \ + _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7) \ +{ \ + __m256i _tmp0_m, _tmp1_m, _tmp2_m, _tmp3_m; \ + __m256i _tmp4_m, _tmp5_m, _tmp6_m, _tmp7_m; \ + \ + _tmp0_m = __lasx_xvilvl_b(_in2, _in0); \ + _tmp1_m = __lasx_xvilvl_b(_in3, _in1); \ + _tmp2_m = __lasx_xvilvl_b(_in6, _in4); \ + _tmp3_m = __lasx_xvilvl_b(_in7, _in5); \ + _tmp4_m = __lasx_xvilvl_b(_in10, _in8); \ + _tmp5_m = __lasx_xvilvl_b(_in11, _in9); \ + _tmp6_m = __lasx_xvilvl_b(_in14, _in12); \ + _tmp7_m = __lasx_xvilvl_b(_in15, _in13); \ + _out0 = __lasx_xvilvl_b(_tmp1_m, _tmp0_m); \ + _out1 = __lasx_xvilvh_b(_tmp1_m, _tmp0_m); \ + _out2 = __lasx_xvilvl_b(_tmp3_m, _tmp2_m); \ + _out3 = __lasx_xvilvh_b(_tmp3_m, _tmp2_m); \ + _out4 = __lasx_xvilvl_b(_tmp5_m, _tmp4_m); \ + _out5 = __lasx_xvilvh_b(_tmp5_m, _tmp4_m); \ + _out6 = __lasx_xvilvl_b(_tmp7_m, _tmp6_m); \ + _out7 = __lasx_xvilvh_b(_tmp7_m, _tmp6_m); \ + _tmp0_m = __lasx_xvilvl_w(_out2, _out0); \ + _tmp2_m = __lasx_xvilvh_w(_out2, _out0); \ + _tmp4_m = __lasx_xvilvl_w(_out3, _out1); \ + _tmp6_m = __lasx_xvilvh_w(_out3, _out1); \ + _tmp1_m = __lasx_xvilvl_w(_out6, _out4); \ + _tmp3_m = __lasx_xvilvh_w(_out6, _out4); \ + _tmp5_m = __lasx_xvilvl_w(_out7, _out5); \ + _tmp7_m = __lasx_xvilvh_w(_out7, _out5); \ + _out0 = __lasx_xvilvl_d(_tmp1_m, _tmp0_m); \ + _out1 = __lasx_xvilvh_d(_tmp1_m, _tmp0_m); \ + _out2 = __lasx_xvilvl_d(_tmp3_m, _tmp2_m); \ + _out3 = __lasx_xvilvh_d(_tmp3_m, _tmp2_m); \ + _out4 = __lasx_xvilvl_d(_tmp5_m, _tmp4_m); \ + _out5 = __lasx_xvilvh_d(_tmp5_m, _tmp4_m); \ + _out6 = __lasx_xvilvl_d(_tmp7_m, _tmp6_m); \ + _out7 = __lasx_xvilvh_d(_tmp7_m, _tmp6_m); \ +} + +/* + * ============================================================================= + * Description : Transpose input 16x8 byte block + * Arguments : Inputs - _in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, + * _in8, _in9, _in10, _in11, _in12, _in13, _in14, _in15 + * (input 16x8 byte block) + * Outputs - _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7 + * (output 8x16 byte block) + * Details : The rows of the matrix become columns, and the columns become rows. + * Example : LASX_TRANSPOSE16x8_H + * _in0 : 1,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0 + * _in1 : 2,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0 + * _in2 : 3,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0 + * _in3 : 4,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0 + * _in4 : 5,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0 + * _in5 : 6,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0 + * _in6 : 7,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0 + * _in7 : 8,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0 + * _in8 : 9,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0 + * _in9 : 1,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0 + * _in10 : 0,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0 + * _in11 : 2,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0 + * _in12 : 3,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0 + * _in13 : 7,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0 + * _in14 : 5,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0 + * _in15 : 6,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0 + * + * _out0 : 1,2,3,4,5,6,7,8,9,1,0,2,3,7,5,6 + * _out1 : 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2 + * _out2 : 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3 + * _out3 : 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4 + * _out4 : 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5 + * _out5 : 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6 + * _out6 : 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7 + * _out7 : 8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8 + * ============================================================================= + */ +#define LASX_TRANSPOSE16x8_H(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, \ + _in8, _in9, _in10, _in11, _in12, _in13, _in14, _in15, \ + _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7) \ + { \ + __m256i _tmp0_m, _tmp1_m, _tmp2_m, _tmp3_m; \ + __m256i _tmp4_m, _tmp5_m, _tmp6_m, _tmp7_m; \ + __m256i _t0, _t1, _t2, _t3, _t4, _t5, _t6, _t7; \ + \ + _tmp0_m = __lasx_xvilvl_h(_in2, _in0); \ + _tmp1_m = __lasx_xvilvl_h(_in3, _in1); \ + _tmp2_m = __lasx_xvilvl_h(_in6, _in4); \ + _tmp3_m = __lasx_xvilvl_h(_in7, _in5); \ + _tmp4_m = __lasx_xvilvl_h(_in10, _in8); \ + _tmp5_m = __lasx_xvilvl_h(_in11, _in9); \ + _tmp6_m = __lasx_xvilvl_h(_in14, _in12); \ + _tmp7_m = __lasx_xvilvl_h(_in15, _in13); \ + _t0 = __lasx_xvilvl_h(_tmp1_m, _tmp0_m); \ + _t1 = __lasx_xvilvh_h(_tmp1_m, _tmp0_m); \ + _t2 = __lasx_xvilvl_h(_tmp3_m, _tmp2_m); \ + _t3 = __lasx_xvilvh_h(_tmp3_m, _tmp2_m); \ + _t4 = __lasx_xvilvl_h(_tmp5_m, _tmp4_m); \ + _t5 = __lasx_xvilvh_h(_tmp5_m, _tmp4_m); \ + _t6 = __lasx_xvilvl_h(_tmp7_m, _tmp6_m); \ + _t7 = __lasx_xvilvh_h(_tmp7_m, _tmp6_m); \ + _tmp0_m = __lasx_xvilvl_d(_t2, _t0); \ + _tmp2_m = __lasx_xvilvh_d(_t2, _t0); \ + _tmp4_m = __lasx_xvilvl_d(_t3, _t1); \ + _tmp6_m = __lasx_xvilvh_d(_t3, _t1); \ + _tmp1_m = __lasx_xvilvl_d(_t6, _t4); \ + _tmp3_m = __lasx_xvilvh_d(_t6, _t4); \ + _tmp5_m = __lasx_xvilvl_d(_t7, _t5); \ + _tmp7_m = __lasx_xvilvh_d(_t7, _t5); \ + _out0 = __lasx_xvpermi_q(_tmp1_m, _tmp0_m, 0x20); \ + _out1 = __lasx_xvpermi_q(_tmp3_m, _tmp2_m, 0x20); \ + _out2 = __lasx_xvpermi_q(_tmp5_m, _tmp4_m, 0x20); \ + _out3 = __lasx_xvpermi_q(_tmp7_m, _tmp6_m, 0x20); \ + \ + _tmp0_m = __lasx_xvilvh_h(_in2, _in0); \ + _tmp1_m = __lasx_xvilvh_h(_in3, _in1); \ + _tmp2_m = __lasx_xvilvh_h(_in6, _in4); \ + _tmp3_m = __lasx_xvilvh_h(_in7, _in5); \ + _tmp4_m = __lasx_xvilvh_h(_in10, _in8); \ + _tmp5_m = __lasx_xvilvh_h(_in11, _in9); \ + _tmp6_m = __lasx_xvilvh_h(_in14, _in12); \ + _tmp7_m = __lasx_xvilvh_h(_in15, _in13); \ + _t0 = __lasx_xvilvl_h(_tmp1_m, _tmp0_m); \ + _t1 = __lasx_xvilvh_h(_tmp1_m, _tmp0_m); \ + _t2 = __lasx_xvilvl_h(_tmp3_m, _tmp2_m); \ + _t3 = __lasx_xvilvh_h(_tmp3_m, _tmp2_m); \ + _t4 = __lasx_xvilvl_h(_tmp5_m, _tmp4_m); \ + _t5 = __lasx_xvilvh_h(_tmp5_m, _tmp4_m); \ + _t6 = __lasx_xvilvl_h(_tmp7_m, _tmp6_m); \ + _t7 = __lasx_xvilvh_h(_tmp7_m, _tmp6_m); \ + _tmp0_m = __lasx_xvilvl_d(_t2, _t0); \ + _tmp2_m = __lasx_xvilvh_d(_t2, _t0); \ + _tmp4_m = __lasx_xvilvl_d(_t3, _t1); \ + _tmp6_m = __lasx_xvilvh_d(_t3, _t1); \ + _tmp1_m = __lasx_xvilvl_d(_t6, _t4); \ + _tmp3_m = __lasx_xvilvh_d(_t6, _t4); \ + _tmp5_m = __lasx_xvilvl_d(_t7, _t5); \ + _tmp7_m = __lasx_xvilvh_d(_t7, _t5); \ + _out4 = __lasx_xvpermi_q(_tmp1_m, _tmp0_m, 0x20); \ + _out5 = __lasx_xvpermi_q(_tmp3_m, _tmp2_m, 0x20); \ + _out6 = __lasx_xvpermi_q(_tmp5_m, _tmp4_m, 0x20); \ + _out7 = __lasx_xvpermi_q(_tmp7_m, _tmp6_m, 0x20); \ +} + +/* + * ============================================================================= + * Description : Transpose 4x4 block with halfword elements in vectors + * Arguments : Inputs - _in0, _in1, _in2, _in3 + * Outputs - _out0, _out1, _out2, _out3 + * Return Type - signed halfword + * Details : The rows of the matrix become columns, and the columns become rows. + * Example : See LASX_TRANSPOSE8x8_H + * ============================================================================= + */ +#define LASX_TRANSPOSE4x4_H(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3) \ +{ \ + __m256i _s0_m, _s1_m; \ + \ + _s0_m = __lasx_xvilvl_h(_in1, _in0); \ + _s1_m = __lasx_xvilvl_h(_in3, _in2); \ + _out0 = __lasx_xvilvl_w(_s1_m, _s0_m); \ + _out2 = __lasx_xvilvh_w(_s1_m, _s0_m); \ + _out1 = __lasx_xvilvh_d(_out0, _out0); \ + _out3 = __lasx_xvilvh_d(_out2, _out2); \ +} + +/* + * ============================================================================= + * Description : Transpose input 8x8 byte block + * Arguments : Inputs - _in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7 + * (input 8x8 byte block) + * Outputs - _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7 + * (output 8x8 byte block) + * Example : See LASX_TRANSPOSE8x8_H + * ============================================================================= + */ +#define LASX_TRANSPOSE8x8_B(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, _out0, \ + _out1, _out2, _out3, _out4, _out5, _out6, _out7) \ +{ \ + __m256i _tmp0_m, _tmp1_m, _tmp2_m, _tmp3_m; \ + __m256i _tmp4_m, _tmp5_m, _tmp6_m, _tmp7_m; \ + _tmp0_m = __lasx_xvilvl_b(_in2, _in0); \ + _tmp1_m = __lasx_xvilvl_b(_in3, _in1); \ + _tmp2_m = __lasx_xvilvl_b(_in6, _in4); \ + _tmp3_m = __lasx_xvilvl_b(_in7, _in5); \ + _tmp4_m = __lasx_xvilvl_b(_tmp1_m, _tmp0_m); \ + _tmp5_m = __lasx_xvilvh_b(_tmp1_m, _tmp0_m); \ + _tmp6_m = __lasx_xvilvl_b(_tmp3_m, _tmp2_m); \ + _tmp7_m = __lasx_xvilvh_b(_tmp3_m, _tmp2_m); \ + _out0 = __lasx_xvilvl_w(_tmp6_m, _tmp4_m); \ + _out2 = __lasx_xvilvh_w(_tmp6_m, _tmp4_m); \ + _out4 = __lasx_xvilvl_w(_tmp7_m, _tmp5_m); \ + _out6 = __lasx_xvilvh_w(_tmp7_m, _tmp5_m); \ + _out1 = __lasx_xvbsrl_v(_out0, 8); \ + _out3 = __lasx_xvbsrl_v(_out2, 8); \ + _out5 = __lasx_xvbsrl_v(_out4, 8); \ + _out7 = __lasx_xvbsrl_v(_out6, 8); \ +} + +/* + * ============================================================================= + * Description : Transpose 8x8 block with halfword elements in vectors. + * Arguments : Inputs - _in0, _in1, ~ + * Outputs - _out0, _out1, ~ + * Details : The rows of the matrix become columns, and the columns become rows. + * Example : LASX_TRANSPOSE8x8_H + * _in0 : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8 + * _in1 : 8,2,3,4, 5,6,7,8, 8,2,3,4, 5,6,7,8 + * _in2 : 8,2,3,4, 5,6,7,8, 8,2,3,4, 5,6,7,8 + * _in3 : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8 + * _in4 : 9,2,3,4, 5,6,7,8, 9,2,3,4, 5,6,7,8 + * _in5 : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8 + * _in6 : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8 + * _in7 : 9,2,3,4, 5,6,7,8, 9,2,3,4, 5,6,7,8 + * + * _out0 : 1,8,8,1, 9,1,1,9, 1,8,8,1, 9,1,1,9 + * _out1 : 2,2,2,2, 2,2,2,2, 2,2,2,2, 2,2,2,2 + * _out2 : 3,3,3,3, 3,3,3,3, 3,3,3,3, 3,3,3,3 + * _out3 : 4,4,4,4, 4,4,4,4, 4,4,4,4, 4,4,4,4 + * _out4 : 5,5,5,5, 5,5,5,5, 5,5,5,5, 5,5,5,5 + * _out5 : 6,6,6,6, 6,6,6,6, 6,6,6,6, 6,6,6,6 + * _out6 : 7,7,7,7, 7,7,7,7, 7,7,7,7, 7,7,7,7 + * _out7 : 8,8,8,8, 8,8,8,8, 8,8,8,8, 8,8,8,8 + * ============================================================================= + */ +#define LASX_TRANSPOSE8x8_H(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, _out0, \ + _out1, _out2, _out3, _out4, _out5, _out6, _out7) \ +{ \ + __m256i _s0_m, _s1_m; \ + __m256i _tmp0_m, _tmp1_m, _tmp2_m, _tmp3_m; \ + __m256i _tmp4_m, _tmp5_m, _tmp6_m, _tmp7_m; \ + \ + _s0_m = __lasx_xvilvl_h(_in6, _in4); \ + _s1_m = __lasx_xvilvl_h(_in7, _in5); \ + _tmp0_m = __lasx_xvilvl_h(_s1_m, _s0_m); \ + _tmp1_m = __lasx_xvilvh_h(_s1_m, _s0_m); \ + _s0_m = __lasx_xvilvh_h(_in6, _in4); \ + _s1_m = __lasx_xvilvh_h(_in7, _in5); \ + _tmp2_m = __lasx_xvilvl_h(_s1_m, _s0_m); \ + _tmp3_m = __lasx_xvilvh_h(_s1_m, _s0_m); \ + \ + _s0_m = __lasx_xvilvl_h(_in2, _in0); \ + _s1_m = __lasx_xvilvl_h(_in3, _in1); \ + _tmp4_m = __lasx_xvilvl_h(_s1_m, _s0_m); \ + _tmp5_m = __lasx_xvilvh_h(_s1_m, _s0_m); \ + _s0_m = __lasx_xvilvh_h(_in2, _in0); \ + _s1_m = __lasx_xvilvh_h(_in3, _in1); \ + _tmp6_m = __lasx_xvilvl_h(_s1_m, _s0_m); \ + _tmp7_m = __lasx_xvilvh_h(_s1_m, _s0_m); \ + \ + _out0 = __lasx_xvpickev_d(_tmp0_m, _tmp4_m); \ + _out2 = __lasx_xvpickev_d(_tmp1_m, _tmp5_m); \ + _out4 = __lasx_xvpickev_d(_tmp2_m, _tmp6_m); \ + _out6 = __lasx_xvpickev_d(_tmp3_m, _tmp7_m); \ + _out1 = __lasx_xvpickod_d(_tmp0_m, _tmp4_m); \ + _out3 = __lasx_xvpickod_d(_tmp1_m, _tmp5_m); \ + _out5 = __lasx_xvpickod_d(_tmp2_m, _tmp6_m); \ + _out7 = __lasx_xvpickod_d(_tmp3_m, _tmp7_m); \ +} + +/* + * ============================================================================= + * Description : Butterfly of 4 input vectors + * Arguments : Inputs - _in0, _in1, _in2, _in3 + * Outputs - _out0, _out1, _out2, _out3 + * Details : Butterfly operation + * Example : LASX_BUTTERFLY_4 + * _out0 = _in0 + _in3; + * _out1 = _in1 + _in2; + * _out2 = _in1 - _in2; + * _out3 = _in0 - _in3; + * ============================================================================= + */ +#define LASX_BUTTERFLY_4_B(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3) \ +{ \ + _out0 = __lasx_xvadd_b(_in0, _in3); \ + _out1 = __lasx_xvadd_b(_in1, _in2); \ + _out2 = __lasx_xvsub_b(_in1, _in2); \ + _out3 = __lasx_xvsub_b(_in0, _in3); \ +} +#define LASX_BUTTERFLY_4_H(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3) \ +{ \ + _out0 = __lasx_xvadd_h(_in0, _in3); \ + _out1 = __lasx_xvadd_h(_in1, _in2); \ + _out2 = __lasx_xvsub_h(_in1, _in2); \ + _out3 = __lasx_xvsub_h(_in0, _in3); \ +} +#define LASX_BUTTERFLY_4_W(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3) \ +{ \ + _out0 = __lasx_xvadd_w(_in0, _in3); \ + _out1 = __lasx_xvadd_w(_in1, _in2); \ + _out2 = __lasx_xvsub_w(_in1, _in2); \ + _out3 = __lasx_xvsub_w(_in0, _in3); \ +} +#define LASX_BUTTERFLY_4_D(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3) \ +{ \ + _out0 = __lasx_xvadd_d(_in0, _in3); \ + _out1 = __lasx_xvadd_d(_in1, _in2); \ + _out2 = __lasx_xvsub_d(_in1, _in2); \ + _out3 = __lasx_xvsub_d(_in0, _in3); \ +} + +/* + * ============================================================================= + * Description : Butterfly of 8 input vectors + * Arguments : Inputs - _in0, _in1, _in2, _in3, ~ + * Outputs - _out0, _out1, _out2, _out3, ~ + * Details : Butterfly operation + * Example : LASX_BUTTERFLY_8 + * _out0 = _in0 + _in7; + * _out1 = _in1 + _in6; + * _out2 = _in2 + _in5; + * _out3 = _in3 + _in4; + * _out4 = _in3 - _in4; + * _out5 = _in2 - _in5; + * _out6 = _in1 - _in6; + * _out7 = _in0 - _in7; + * ============================================================================= + */ +#define LASX_BUTTERFLY_8_B(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, \ + _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\ +{ \ + _out0 = __lasx_xvadd_b(_in0, _in7); \ + _out1 = __lasx_xvadd_b(_in1, _in6); \ + _out2 = __lasx_xvadd_b(_in2, _in5); \ + _out3 = __lasx_xvadd_b(_in3, _in4); \ + _out4 = __lasx_xvsub_b(_in3, _in4); \ + _out5 = __lasx_xvsub_b(_in2, _in5); \ + _out6 = __lasx_xvsub_b(_in1, _in6); \ + _out7 = __lasx_xvsub_b(_in0, _in7); \ +} + +#define LASX_BUTTERFLY_8_H(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, \ + _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\ +{ \ + _out0 = __lasx_xvadd_h(_in0, _in7); \ + _out1 = __lasx_xvadd_h(_in1, _in6); \ + _out2 = __lasx_xvadd_h(_in2, _in5); \ + _out3 = __lasx_xvadd_h(_in3, _in4); \ + _out4 = __lasx_xvsub_h(_in3, _in4); \ + _out5 = __lasx_xvsub_h(_in2, _in5); \ + _out6 = __lasx_xvsub_h(_in1, _in6); \ + _out7 = __lasx_xvsub_h(_in0, _in7); \ +} + +#define LASX_BUTTERFLY_8_W(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, \ + _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\ +{ \ + _out0 = __lasx_xvadd_w(_in0, _in7); \ + _out1 = __lasx_xvadd_w(_in1, _in6); \ + _out2 = __lasx_xvadd_w(_in2, _in5); \ + _out3 = __lasx_xvadd_w(_in3, _in4); \ + _out4 = __lasx_xvsub_w(_in3, _in4); \ + _out5 = __lasx_xvsub_w(_in2, _in5); \ + _out6 = __lasx_xvsub_w(_in1, _in6); \ + _out7 = __lasx_xvsub_w(_in0, _in7); \ +} + +#define LASX_BUTTERFLY_8_D(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, \ + _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\ +{ \ + _out0 = __lasx_xvadd_d(_in0, _in7); \ + _out1 = __lasx_xvadd_d(_in1, _in6); \ + _out2 = __lasx_xvadd_d(_in2, _in5); \ + _out3 = __lasx_xvadd_d(_in3, _in4); \ + _out4 = __lasx_xvsub_d(_in3, _in4); \ + _out5 = __lasx_xvsub_d(_in2, _in5); \ + _out6 = __lasx_xvsub_d(_in1, _in6); \ + _out7 = __lasx_xvsub_d(_in0, _in7); \ +} + +#endif //LASX + +/* + * ============================================================================= + * Description : Print out elements in vector. + * Arguments : Inputs - RTYPE, _element_num, _in0, _enter + * Outputs - + * Details : Print out '_element_num' elements in 'RTYPE' vector '_in0', if + * '_enter' is TRUE, prefix "\nVP:" will be added first. + * Example : VECT_PRINT(v4i32,4,in0,1); // in0: 1,2,3,4 + * VP:1,2,3,4, + * ============================================================================= + */ +#define VECT_PRINT(RTYPE, element_num, in0, enter) \ +{ \ + RTYPE _tmp0 = (RTYPE)in0; \ + int _i = 0; \ + if (enter) \ + printf("\nVP:"); \ + for(_i = 0; _i < element_num; _i++) \ + printf("%d,",_tmp0[_i]); \ +} + +#endif /* LOONGSON_INTRINSICS_H */ +#endif /* AVUTIL_LOONGARCH_LOONGSON_INTRINSICS_H */ From patchwork Tue Dec 14 13:33:12 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= X-Patchwork-Id: 32487 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp6965948iog; Tue, 14 Dec 2021 05:34:52 -0800 (PST) X-Google-Smtp-Source: ABdhPJynGkATOQ791Rxlj/po67e2VprZTPmI0v8vbq2PoCqYmwk+TY/VyhvgZ/IWSzj7ivAH7j7z X-Received: by 2002:a05:6402:1d50:: with SMTP id dz16mr7659627edb.385.1639488892013; Tue, 14 Dec 2021 05:34:52 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1639488892; cv=none; d=google.com; s=arc-20160816; b=gXALpSV5W/eT2F1EDC8ht3mVP5YRuXki9IBJWO7tsMVRPa793nlbTg3Q8iRDK4DNy2 wUgQTYN+88O69PiCamnv5tC74jKVMThc8LBO1MAacE2aA9Ec/5pdj4TdlCV1cuHwKNUP K+XVWd90JAfe8aqQ/WmHWlCep2yRBv7hVlULuKW9bhDDhU1LwB2Sr5E2hcwMpI75BA9S 8Qd6YqtpeJT1bvc2Vtqt77DP1OjxYOYmBN/UEwxFrvmAxNqt5oJR0qSqnIfV0YUUyuv0 G4H2eXj25qnWq/oYbNt0gYqDhI7e4/KIAI803VO+m0gle/GSSI4wNxnbKHGnenc6Htbg EUfw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=p1LPgfQkU+9AR238I+X7HqYbDlNxvxVURZg6Ha3aGlI=; b=KP46gt4ID/4kQ1zG970HyTtd+ihIgK4PSJsMnl1RGHt8ZG5nccO7/gNtwekeeg/+1b IEe6wkmMfOxkwtA3hmi2oTMh3QssUGfaMLDQniElO6vFU5V05m0K7TICew6rUYbWbjWq KnM/2wiLv10UMAXDi2hRwFO1ZS+dnrSXcZj58cPyMJdguAGy1AbyjferEUBhzRoD1w1P +djTFRCg/N8qTEdgypuwPCHS56htn/NnjvIIFEHEKGFh4biFYuafgnOeD4fXsj7UtR6r XSJWIfV94kRHe1IisX88lVZW5XCOtBBPyuEkkQy0/4J8+xJmtQpJsdXLwT2XLeiR/CSI QBFQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id x25si21461336edq.109.2021.12.14.05.34.51; Tue, 14 Dec 2021 05:34:52 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 346D568AF18; Tue, 14 Dec 2021 15:33:59 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from loongson.cn (mail.loongson.cn [114.242.206.163]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 51B0F68A8DB for ; Tue, 14 Dec 2021 15:33:45 +0200 (EET) Received: from localhost (unknown [36.33.26.144]) by mail.loongson.cn (Coremail) with SMTP id AQAAf9DxLNw2nbhhk6cAAA--.674S3; Tue, 14 Dec 2021 21:33:42 +0800 (CST) From: Hao Chen To: ffmpeg-devel@ffmpeg.org Date: Tue, 14 Dec 2021 21:33:12 +0800 Message-Id: <20211214133316.8978-4-chenhao@loongson.cn> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20211214133316.8978-1-chenhao@loongson.cn> References: <20211214133316.8978-1-chenhao@loongson.cn> MIME-Version: 1.0 X-CM-TRANSID: AQAAf9DxLNw2nbhhk6cAAA--.674S3 X-Coremail-Antispam: 1UD129KBjvAXoWDWFyDKF17Wr45tF15Xr48tFb_yoW3CF1fGo Z3J3yvqws2ya4xt3W5Jr1kKayxZw4fXFn5Zw4jqwn3A34SqF98JFs0yw48ZF4rJr4fXwn8 Z3WUJFy7ZFs8Aas5n29KB7ZKAUJUUUU8529EdanIXcx71UUUUU7v73VFW2AGmfu7bjvjm3 AaLaJ3UjIYCTnIWjp_UUU567AC8VAFwI0_Jr0_Gr1l1xkIjI8I6I8E6xAIw20EY4v20xva j40_Wr0E3s1l1IIY67AEw4v_Jr0_Jr4l8cAvFVAK0II2c7xJM28CjxkF64kEwVA0rcxSw2 x7M28EF7xvwVC0I7IYx2IY67AKxVW5JVW7JwA2z4x0Y4vE2Ix0cI8IcVCY1x0267AKxVWx JVW8Jr1l84ACjcxK6I8E87Iv67AKxVW0oVCq3wA2z4x0Y4vEx4A2jsIEc7CjxVAFwI0_Gc CE3s1le2I262IYc4CY6c8Ij28IcVAaY2xG8wAqx4xG64xvF2IEw4CE5I8CrVC2j2WlYx0E 2Ix0cI8IcVAFwI0_Jw0_WrylYx0Ex4A2jsIE14v26r4UJVWxJr1lOx8S6xCaFVCjc4AY6r 1j6r4UM4x0x7Aq67IIx4CEVc8vx2IErcIFxwCY02Avz4vE14v_Xr4l4I8I3I0E4IkC6x0Y z7v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGwC20s026x8GjcxK67AKxVWUGVWUWwC2zV AF1VAY17CE14v26r1Y6r17MIIF0xvE2Ix0cI8IcVAFwI0_Gr0_Xr1lIxAIcVC0I7IYx2IY 6xkF7I0E14v26r4j6F4UMIIF0xvE42xK8VAvwI8IcIk0rVWUJVWUCwCI42IY6I8E87Iv67 AKxVW8JVWxJwCI42IY6I8E87Iv6xkF7I0E14v26r4j6r4UJbIYCTnIWIevJa73UjIFyTuY vjfU8AwIUUUUU X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/ Subject: [FFmpeg-devel] [PATCH v2 3/7] avcodec: [loongarch] Optimize h264qpel with LASX. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Shiyou Yin Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: Wt0VbSORvHD9 From: Shiyou Yin ./ffmpeg -i ../1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -y /dev/null -an before:183 after :225 Change-Id: I7c7d2f34cd82ef728aab5ce8f6bfb46dd81f0da4 --- libavcodec/h264qpel.c | 2 + libavcodec/h264qpel.h | 1 + libavcodec/loongarch/Makefile | 2 + .../loongarch/h264qpel_init_loongarch.c | 98 + libavcodec/loongarch/h264qpel_lasx.c | 2038 +++++++++++++++++ libavcodec/loongarch/h264qpel_lasx.h | 158 ++ 6 files changed, 2299 insertions(+) create mode 100644 libavcodec/loongarch/h264qpel_init_loongarch.c create mode 100644 libavcodec/loongarch/h264qpel_lasx.c create mode 100644 libavcodec/loongarch/h264qpel_lasx.h diff --git a/libavcodec/h264qpel.c b/libavcodec/h264qpel.c index 50e82e23b0..535ebd25b4 100644 --- a/libavcodec/h264qpel.c +++ b/libavcodec/h264qpel.c @@ -106,4 +106,6 @@ av_cold void ff_h264qpel_init(H264QpelContext *c, int bit_depth) ff_h264qpel_init_x86(c, bit_depth); if (ARCH_MIPS) ff_h264qpel_init_mips(c, bit_depth); + if (ARCH_LOONGARCH64) + ff_h264qpel_init_loongarch(c, bit_depth); } diff --git a/libavcodec/h264qpel.h b/libavcodec/h264qpel.h index 7c57ad001c..0259e8de23 100644 --- a/libavcodec/h264qpel.h +++ b/libavcodec/h264qpel.h @@ -36,5 +36,6 @@ void ff_h264qpel_init_arm(H264QpelContext *c, int bit_depth); void ff_h264qpel_init_ppc(H264QpelContext *c, int bit_depth); void ff_h264qpel_init_x86(H264QpelContext *c, int bit_depth); void ff_h264qpel_init_mips(H264QpelContext *c, int bit_depth); +void ff_h264qpel_init_loongarch(H264QpelContext *c, int bit_depth); #endif /* AVCODEC_H264QPEL_H */ diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile index f8fb54c925..4e2ce8487f 100644 --- a/libavcodec/loongarch/Makefile +++ b/libavcodec/loongarch/Makefile @@ -1,2 +1,4 @@ OBJS-$(CONFIG_H264CHROMA) += loongarch/h264chroma_init_loongarch.o +OBJS-$(CONFIG_H264QPEL) += loongarch/h264qpel_init_loongarch.o LASX-OBJS-$(CONFIG_H264CHROMA) += loongarch/h264chroma_lasx.o +LASX-OBJS-$(CONFIG_H264QPEL) += loongarch/h264qpel_lasx.o diff --git a/libavcodec/loongarch/h264qpel_init_loongarch.c b/libavcodec/loongarch/h264qpel_init_loongarch.c new file mode 100644 index 0000000000..969c9c376c --- /dev/null +++ b/libavcodec/loongarch/h264qpel_init_loongarch.c @@ -0,0 +1,98 @@ +/* + * Copyright (c) 2020 Loongson Technology Corporation Limited + * Contributed by Shiyou Yin + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "h264qpel_lasx.h" +#include "libavutil/attributes.h" +#include "libavutil/loongarch/cpu.h" +#include "libavcodec/h264qpel.h" + +av_cold void ff_h264qpel_init_loongarch(H264QpelContext *c, int bit_depth) +{ + int cpu_flags = av_get_cpu_flags(); + if (have_lasx(cpu_flags)) { + if (8 == bit_depth) { + c->put_h264_qpel_pixels_tab[0][0] = ff_put_h264_qpel16_mc00_lasx; + c->put_h264_qpel_pixels_tab[0][1] = ff_put_h264_qpel16_mc10_lasx; + c->put_h264_qpel_pixels_tab[0][2] = ff_put_h264_qpel16_mc20_lasx; + c->put_h264_qpel_pixels_tab[0][3] = ff_put_h264_qpel16_mc30_lasx; + c->put_h264_qpel_pixels_tab[0][4] = ff_put_h264_qpel16_mc01_lasx; + c->put_h264_qpel_pixels_tab[0][5] = ff_put_h264_qpel16_mc11_lasx; + + c->put_h264_qpel_pixels_tab[0][6] = ff_put_h264_qpel16_mc21_lasx; + c->put_h264_qpel_pixels_tab[0][7] = ff_put_h264_qpel16_mc31_lasx; + c->put_h264_qpel_pixels_tab[0][8] = ff_put_h264_qpel16_mc02_lasx; + c->put_h264_qpel_pixels_tab[0][9] = ff_put_h264_qpel16_mc12_lasx; + c->put_h264_qpel_pixels_tab[0][10] = ff_put_h264_qpel16_mc22_lasx; + c->put_h264_qpel_pixels_tab[0][11] = ff_put_h264_qpel16_mc32_lasx; + c->put_h264_qpel_pixels_tab[0][12] = ff_put_h264_qpel16_mc03_lasx; + c->put_h264_qpel_pixels_tab[0][13] = ff_put_h264_qpel16_mc13_lasx; + c->put_h264_qpel_pixels_tab[0][14] = ff_put_h264_qpel16_mc23_lasx; + c->put_h264_qpel_pixels_tab[0][15] = ff_put_h264_qpel16_mc33_lasx; + c->avg_h264_qpel_pixels_tab[0][0] = ff_avg_h264_qpel16_mc00_lasx; + c->avg_h264_qpel_pixels_tab[0][1] = ff_avg_h264_qpel16_mc10_lasx; + c->avg_h264_qpel_pixels_tab[0][2] = ff_avg_h264_qpel16_mc20_lasx; + c->avg_h264_qpel_pixels_tab[0][3] = ff_avg_h264_qpel16_mc30_lasx; + c->avg_h264_qpel_pixels_tab[0][4] = ff_avg_h264_qpel16_mc01_lasx; + c->avg_h264_qpel_pixels_tab[0][5] = ff_avg_h264_qpel16_mc11_lasx; + c->avg_h264_qpel_pixels_tab[0][6] = ff_avg_h264_qpel16_mc21_lasx; + c->avg_h264_qpel_pixels_tab[0][7] = ff_avg_h264_qpel16_mc31_lasx; + c->avg_h264_qpel_pixels_tab[0][8] = ff_avg_h264_qpel16_mc02_lasx; + c->avg_h264_qpel_pixels_tab[0][9] = ff_avg_h264_qpel16_mc12_lasx; + c->avg_h264_qpel_pixels_tab[0][10] = ff_avg_h264_qpel16_mc22_lasx; + c->avg_h264_qpel_pixels_tab[0][11] = ff_avg_h264_qpel16_mc32_lasx; + c->avg_h264_qpel_pixels_tab[0][12] = ff_avg_h264_qpel16_mc03_lasx; + c->avg_h264_qpel_pixels_tab[0][13] = ff_avg_h264_qpel16_mc13_lasx; + c->avg_h264_qpel_pixels_tab[0][14] = ff_avg_h264_qpel16_mc23_lasx; + c->avg_h264_qpel_pixels_tab[0][15] = ff_avg_h264_qpel16_mc33_lasx; + + c->put_h264_qpel_pixels_tab[1][0] = ff_put_h264_qpel8_mc00_lasx; + c->put_h264_qpel_pixels_tab[1][1] = ff_put_h264_qpel8_mc10_lasx; + c->put_h264_qpel_pixels_tab[1][2] = ff_put_h264_qpel8_mc20_lasx; + c->put_h264_qpel_pixels_tab[1][3] = ff_put_h264_qpel8_mc30_lasx; + c->put_h264_qpel_pixels_tab[1][4] = ff_put_h264_qpel8_mc01_lasx; + c->put_h264_qpel_pixels_tab[1][5] = ff_put_h264_qpel8_mc11_lasx; + c->put_h264_qpel_pixels_tab[1][6] = ff_put_h264_qpel8_mc21_lasx; + c->put_h264_qpel_pixels_tab[1][7] = ff_put_h264_qpel8_mc31_lasx; + c->put_h264_qpel_pixels_tab[1][8] = ff_put_h264_qpel8_mc02_lasx; + c->put_h264_qpel_pixels_tab[1][9] = ff_put_h264_qpel8_mc12_lasx; + c->put_h264_qpel_pixels_tab[1][10] = ff_put_h264_qpel8_mc22_lasx; + c->put_h264_qpel_pixels_tab[1][11] = ff_put_h264_qpel8_mc32_lasx; + c->put_h264_qpel_pixels_tab[1][12] = ff_put_h264_qpel8_mc03_lasx; + c->put_h264_qpel_pixels_tab[1][13] = ff_put_h264_qpel8_mc13_lasx; + c->put_h264_qpel_pixels_tab[1][14] = ff_put_h264_qpel8_mc23_lasx; + c->put_h264_qpel_pixels_tab[1][15] = ff_put_h264_qpel8_mc33_lasx; + c->avg_h264_qpel_pixels_tab[1][0] = ff_avg_h264_qpel8_mc00_lasx; + c->avg_h264_qpel_pixels_tab[1][1] = ff_avg_h264_qpel8_mc10_lasx; + c->avg_h264_qpel_pixels_tab[1][2] = ff_avg_h264_qpel8_mc20_lasx; + c->avg_h264_qpel_pixels_tab[1][3] = ff_avg_h264_qpel8_mc30_lasx; + c->avg_h264_qpel_pixels_tab[1][5] = ff_avg_h264_qpel8_mc11_lasx; + c->avg_h264_qpel_pixels_tab[1][6] = ff_avg_h264_qpel8_mc21_lasx; + c->avg_h264_qpel_pixels_tab[1][7] = ff_avg_h264_qpel8_mc31_lasx; + c->avg_h264_qpel_pixels_tab[1][8] = ff_avg_h264_qpel8_mc02_lasx; + c->avg_h264_qpel_pixels_tab[1][9] = ff_avg_h264_qpel8_mc12_lasx; + c->avg_h264_qpel_pixels_tab[1][10] = ff_avg_h264_qpel8_mc22_lasx; + c->avg_h264_qpel_pixels_tab[1][11] = ff_avg_h264_qpel8_mc32_lasx; + c->avg_h264_qpel_pixels_tab[1][13] = ff_avg_h264_qpel8_mc13_lasx; + c->avg_h264_qpel_pixels_tab[1][14] = ff_avg_h264_qpel8_mc23_lasx; + c->avg_h264_qpel_pixels_tab[1][15] = ff_avg_h264_qpel8_mc33_lasx; + } + } +} diff --git a/libavcodec/loongarch/h264qpel_lasx.c b/libavcodec/loongarch/h264qpel_lasx.c new file mode 100644 index 0000000000..1c142e510e --- /dev/null +++ b/libavcodec/loongarch/h264qpel_lasx.c @@ -0,0 +1,2038 @@ +/* + * Loongson LASX optimized h264qpel + * + * Copyright (c) 2020 Loongson Technology Corporation Limited + * Contributed by Shiyou Yin + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "h264qpel_lasx.h" +#include "libavutil/loongarch/loongson_intrinsics.h" +#include "libavutil/attributes.h" + +static const uint8_t luma_mask_arr[16 * 6] __attribute__((aligned(0x40))) = { + /* 8 width cases */ + 0, 5, 1, 6, 2, 7, 3, 8, 4, 9, 5, 10, 6, 11, 7, 12, + 0, 5, 1, 6, 2, 7, 3, 8, 4, 9, 5, 10, 6, 11, 7, 12, + 1, 4, 2, 5, 3, 6, 4, 7, 5, 8, 6, 9, 7, 10, 8, 11, + 1, 4, 2, 5, 3, 6, 4, 7, 5, 8, 6, 9, 7, 10, 8, 11, + 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, + 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10 +}; + +#define AVC_HORZ_FILTER_SH(in0, in1, mask0, mask1, mask2) \ +( { \ + __m256i out0_m; \ + __m256i tmp0_m; \ + \ + tmp0_m = __lasx_xvshuf_b(in1, in0, mask0); \ + out0_m = __lasx_xvhaddw_h_b(tmp0_m, tmp0_m); \ + tmp0_m = __lasx_xvshuf_b(in1, in0, mask1); \ + out0_m = __lasx_xvdp2add_h_b(out0_m, minus5b, tmp0_m); \ + tmp0_m = __lasx_xvshuf_b(in1, in0, mask2); \ + out0_m = __lasx_xvdp2add_h_b(out0_m, plus20b, tmp0_m); \ + \ + out0_m; \ +} ) + +#define AVC_DOT_SH3_SH(in0, in1, in2, coeff0, coeff1, coeff2) \ +( { \ + __m256i out0_m; \ + \ + out0_m = __lasx_xvdp2_h_b(in0, coeff0); \ + DUP2_ARG3(__lasx_xvdp2add_h_b, out0_m, in1, coeff1, out0_m,\ + in2, coeff2, out0_m, out0_m); \ + \ + out0_m; \ +} ) + +static av_always_inline +void avc_luma_hv_qrt_and_aver_dst_16x16_lasx(uint8_t *src_x, + uint8_t *src_y, + uint8_t *dst, ptrdiff_t stride) +{ + const int16_t filt_const0 = 0xfb01; + const int16_t filt_const1 = 0x1414; + const int16_t filt_const2 = 0x1fb; + uint32_t loop_cnt; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + ptrdiff_t stride_4x = stride << 2; + __m256i tmp0, tmp1; + __m256i src_hz0, src_hz1, src_hz2, src_hz3, mask0, mask1, mask2; + __m256i src_vt0, src_vt1, src_vt2, src_vt3, src_vt4, src_vt5, src_vt6; + __m256i src_vt7, src_vt8; + __m256i src_vt10_h, src_vt21_h, src_vt32_h, src_vt43_h, src_vt54_h; + __m256i src_vt65_h, src_vt76_h, src_vt87_h, filt0, filt1, filt2; + __m256i hz_out0, hz_out1, hz_out2, hz_out3, vt_out0, vt_out1, vt_out2; + __m256i vt_out3, out0, out1, out2, out3; + __m256i minus5b = __lasx_xvldi(0xFB); + __m256i plus20b = __lasx_xvldi(20); + + filt0 = __lasx_xvreplgr2vr_h(filt_const0); + filt1 = __lasx_xvreplgr2vr_h(filt_const1); + filt2 = __lasx_xvreplgr2vr_h(filt_const2); + + mask0 = __lasx_xvld(luma_mask_arr, 0); + DUP2_ARG2(__lasx_xvld, luma_mask_arr, 32, luma_mask_arr, 64, mask1, mask2); + src_vt0 = __lasx_xvld(src_y, 0); + DUP4_ARG2(__lasx_xvldx, src_y, stride, src_y, stride_2x, src_y, stride_3x, + src_y, stride_4x, src_vt1, src_vt2, src_vt3, src_vt4); + src_y += stride_4x; + + src_vt0 = __lasx_xvxori_b(src_vt0, 128); + DUP4_ARG2(__lasx_xvxori_b, src_vt1, 128, src_vt2, 128, src_vt3, 128, + src_vt4, 128, src_vt1, src_vt2, src_vt3, src_vt4); + + for (loop_cnt = 4; loop_cnt--;) { + src_hz0 = __lasx_xvld(src_x, 0); + DUP2_ARG2(__lasx_xvldx, src_x, stride, src_x, stride_2x, + src_hz1, src_hz2); + src_hz3 = __lasx_xvldx(src_x, stride_3x); + src_x += stride_4x; + src_hz0 = __lasx_xvpermi_d(src_hz0, 0x94); + src_hz1 = __lasx_xvpermi_d(src_hz1, 0x94); + src_hz2 = __lasx_xvpermi_d(src_hz2, 0x94); + src_hz3 = __lasx_xvpermi_d(src_hz3, 0x94); + DUP4_ARG2(__lasx_xvxori_b, src_hz0, 128, src_hz1, 128, src_hz2, 128, + src_hz3, 128, src_hz0, src_hz1, src_hz2, src_hz3); + + hz_out0 = AVC_HORZ_FILTER_SH(src_hz0, src_hz0, mask0, mask1, mask2); + hz_out1 = AVC_HORZ_FILTER_SH(src_hz1, src_hz1, mask0, mask1, mask2); + hz_out2 = AVC_HORZ_FILTER_SH(src_hz2, src_hz2, mask0, mask1, mask2); + hz_out3 = AVC_HORZ_FILTER_SH(src_hz3, src_hz3, mask0, mask1, mask2); + hz_out0 = __lasx_xvssrarni_b_h(hz_out1, hz_out0, 5); + hz_out2 = __lasx_xvssrarni_b_h(hz_out3, hz_out2, 5); + + DUP4_ARG2(__lasx_xvldx, src_y, stride, src_y, stride_2x, + src_y, stride_3x, src_y, stride_4x, + src_vt5, src_vt6, src_vt7, src_vt8); + src_y += stride_4x; + + DUP4_ARG2(__lasx_xvxori_b, src_vt5, 128, src_vt6, 128, src_vt7, 128, + src_vt8, 128, src_vt5, src_vt6, src_vt7, src_vt8); + + DUP4_ARG3(__lasx_xvpermi_q, src_vt0, src_vt4, 0x02, src_vt1, src_vt5, + 0x02, src_vt2, src_vt6, 0x02, src_vt3, src_vt7, 0x02, + src_vt0, src_vt1, src_vt2, src_vt3); + src_vt87_h = __lasx_xvpermi_q(src_vt4, src_vt8, 0x02); + DUP4_ARG2(__lasx_xvilvh_b, src_vt1, src_vt0, src_vt2, src_vt1, + src_vt3, src_vt2, src_vt87_h, src_vt3, + src_hz0, src_hz1, src_hz2, src_hz3); + DUP4_ARG2(__lasx_xvilvl_b, src_vt1, src_vt0, src_vt2, src_vt1, + src_vt3, src_vt2, src_vt87_h, src_vt3, + src_vt0, src_vt1, src_vt2, src_vt3); + DUP4_ARG3(__lasx_xvpermi_q, src_vt0, src_hz0, 0x02, src_vt1, src_hz1, + 0x02, src_vt2, src_hz2, 0x02, src_vt3, src_hz3, 0x02, + src_vt10_h, src_vt21_h, src_vt32_h, src_vt43_h); + DUP4_ARG3(__lasx_xvpermi_q, src_vt0, src_hz0, 0x13, src_vt1, src_hz1, + 0x13, src_vt2, src_hz2, 0x13, src_vt3, src_hz3, 0x13, + src_vt54_h, src_vt65_h, src_vt76_h, src_vt87_h); + vt_out0 = AVC_DOT_SH3_SH(src_vt10_h, src_vt32_h, src_vt54_h, filt0, + filt1, filt2); + vt_out1 = AVC_DOT_SH3_SH(src_vt21_h, src_vt43_h, src_vt65_h, filt0, + filt1, filt2); + vt_out2 = AVC_DOT_SH3_SH(src_vt32_h, src_vt54_h, src_vt76_h, filt0, + filt1, filt2); + vt_out3 = AVC_DOT_SH3_SH(src_vt43_h, src_vt65_h, src_vt87_h, filt0, + filt1, filt2); + vt_out0 = __lasx_xvssrarni_b_h(vt_out1, vt_out0, 5); + vt_out2 = __lasx_xvssrarni_b_h(vt_out3, vt_out2, 5); + + DUP2_ARG2(__lasx_xvaddwl_h_b, hz_out0, vt_out0, hz_out2, vt_out2, + out0, out2); + DUP2_ARG2(__lasx_xvaddwh_h_b, hz_out0, vt_out0, hz_out2, vt_out2, + out1, out3); + tmp0 = __lasx_xvssrarni_b_h(out1, out0, 1); + tmp1 = __lasx_xvssrarni_b_h(out3, out2, 1); + + DUP2_ARG2(__lasx_xvxori_b, tmp0, 128, tmp1, 128, tmp0, tmp1); + out0 = __lasx_xvld(dst, 0); + DUP2_ARG2(__lasx_xvldx, dst, stride, dst, stride_2x, out1, out2); + out3 = __lasx_xvldx(dst, stride_3x); + out0 = __lasx_xvpermi_q(out0, out2, 0x02); + out1 = __lasx_xvpermi_q(out1, out3, 0x02); + out2 = __lasx_xvilvl_d(out1, out0); + out3 = __lasx_xvilvh_d(out1, out0); + out0 = __lasx_xvpermi_q(out2, out3, 0x02); + out1 = __lasx_xvpermi_q(out2, out3, 0x13); + tmp0 = __lasx_xvavgr_bu(out0, tmp0); + tmp1 = __lasx_xvavgr_bu(out1, tmp1); + + __lasx_xvstelm_d(tmp0, dst, 0, 0); + __lasx_xvstelm_d(tmp0, dst + stride, 0, 1); + __lasx_xvstelm_d(tmp1, dst + stride_2x, 0, 0); + __lasx_xvstelm_d(tmp1, dst + stride_3x, 0, 1); + + __lasx_xvstelm_d(tmp0, dst, 8, 2); + __lasx_xvstelm_d(tmp0, dst + stride, 8, 3); + __lasx_xvstelm_d(tmp1, dst + stride_2x, 8, 2); + __lasx_xvstelm_d(tmp1, dst + stride_3x, 8, 3); + + dst += stride_4x; + src_vt0 = src_vt4; + src_vt1 = src_vt5; + src_vt2 = src_vt6; + src_vt3 = src_vt7; + src_vt4 = src_vt8; + } +} + +static av_always_inline void +avc_luma_hv_qrt_16x16_lasx(uint8_t *src_x, uint8_t *src_y, + uint8_t *dst, ptrdiff_t stride) +{ + const int16_t filt_const0 = 0xfb01; + const int16_t filt_const1 = 0x1414; + const int16_t filt_const2 = 0x1fb; + uint32_t loop_cnt; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + ptrdiff_t stride_4x = stride << 2; + __m256i tmp0, tmp1; + __m256i src_hz0, src_hz1, src_hz2, src_hz3, mask0, mask1, mask2; + __m256i src_vt0, src_vt1, src_vt2, src_vt3, src_vt4, src_vt5, src_vt6; + __m256i src_vt7, src_vt8; + __m256i src_vt10_h, src_vt21_h, src_vt32_h, src_vt43_h, src_vt54_h; + __m256i src_vt65_h, src_vt76_h, src_vt87_h, filt0, filt1, filt2; + __m256i hz_out0, hz_out1, hz_out2, hz_out3, vt_out0, vt_out1, vt_out2; + __m256i vt_out3, out0, out1, out2, out3; + __m256i minus5b = __lasx_xvldi(0xFB); + __m256i plus20b = __lasx_xvldi(20); + + filt0 = __lasx_xvreplgr2vr_h(filt_const0); + filt1 = __lasx_xvreplgr2vr_h(filt_const1); + filt2 = __lasx_xvreplgr2vr_h(filt_const2); + + mask0 = __lasx_xvld(luma_mask_arr, 0); + DUP2_ARG2(__lasx_xvld, luma_mask_arr, 32, luma_mask_arr, 64, mask1, mask2); + src_vt0 = __lasx_xvld(src_y, 0); + DUP4_ARG2(__lasx_xvldx, src_y, stride, src_y, stride_2x, src_y, stride_3x, + src_y, stride_4x, src_vt1, src_vt2, src_vt3, src_vt4); + src_y += stride_4x; + + src_vt0 = __lasx_xvxori_b(src_vt0, 128); + DUP4_ARG2(__lasx_xvxori_b, src_vt1, 128, src_vt2, 128, src_vt3, 128, + src_vt4, 128, src_vt1, src_vt2, src_vt3, src_vt4); + + for (loop_cnt = 4; loop_cnt--;) { + src_hz0 = __lasx_xvld(src_x, 0); + DUP2_ARG2(__lasx_xvldx, src_x, stride, src_x, stride_2x, + src_hz1, src_hz2); + src_hz3 = __lasx_xvldx(src_x, stride_3x); + src_x += stride_4x; + src_hz0 = __lasx_xvpermi_d(src_hz0, 0x94); + src_hz1 = __lasx_xvpermi_d(src_hz1, 0x94); + src_hz2 = __lasx_xvpermi_d(src_hz2, 0x94); + src_hz3 = __lasx_xvpermi_d(src_hz3, 0x94); + DUP4_ARG2(__lasx_xvxori_b, src_hz0, 128, src_hz1, 128, src_hz2, 128, + src_hz3, 128, src_hz0, src_hz1, src_hz2, src_hz3); + + hz_out0 = AVC_HORZ_FILTER_SH(src_hz0, src_hz0, mask0, mask1, mask2); + hz_out1 = AVC_HORZ_FILTER_SH(src_hz1, src_hz1, mask0, mask1, mask2); + hz_out2 = AVC_HORZ_FILTER_SH(src_hz2, src_hz2, mask0, mask1, mask2); + hz_out3 = AVC_HORZ_FILTER_SH(src_hz3, src_hz3, mask0, mask1, mask2); + hz_out0 = __lasx_xvssrarni_b_h(hz_out1, hz_out0, 5); + hz_out2 = __lasx_xvssrarni_b_h(hz_out3, hz_out2, 5); + + DUP4_ARG2(__lasx_xvldx, src_y, stride, src_y, stride_2x, + src_y, stride_3x, src_y, stride_4x, + src_vt5, src_vt6, src_vt7, src_vt8); + src_y += stride_4x; + + DUP4_ARG2(__lasx_xvxori_b, src_vt5, 128, src_vt6, 128, src_vt7, 128, + src_vt8, 128, src_vt5, src_vt6, src_vt7, src_vt8); + DUP4_ARG3(__lasx_xvpermi_q, src_vt0, src_vt4, 0x02, src_vt1, src_vt5, + 0x02, src_vt2, src_vt6, 0x02, src_vt3, src_vt7, 0x02, + src_vt0, src_vt1, src_vt2, src_vt3); + src_vt87_h = __lasx_xvpermi_q(src_vt4, src_vt8, 0x02); + DUP4_ARG2(__lasx_xvilvh_b, src_vt1, src_vt0, src_vt2, src_vt1, + src_vt3, src_vt2, src_vt87_h, src_vt3, + src_hz0, src_hz1, src_hz2, src_hz3); + DUP4_ARG2(__lasx_xvilvl_b, src_vt1, src_vt0, src_vt2, src_vt1, + src_vt3, src_vt2, src_vt87_h, src_vt3, + src_vt0, src_vt1, src_vt2, src_vt3); + DUP4_ARG3(__lasx_xvpermi_q, src_vt0, src_hz0, 0x02, src_vt1, + src_hz1, 0x02, src_vt2, src_hz2, 0x02, src_vt3, src_hz3, + 0x02, src_vt10_h, src_vt21_h, src_vt32_h, src_vt43_h); + DUP4_ARG3(__lasx_xvpermi_q, src_vt0, src_hz0, 0x13, src_vt1, + src_hz1, 0x13, src_vt2, src_hz2, 0x13, src_vt3, src_hz3, + 0x13, src_vt54_h, src_vt65_h, src_vt76_h, src_vt87_h); + + vt_out0 = AVC_DOT_SH3_SH(src_vt10_h, src_vt32_h, src_vt54_h, + filt0, filt1, filt2); + vt_out1 = AVC_DOT_SH3_SH(src_vt21_h, src_vt43_h, src_vt65_h, + filt0, filt1, filt2); + vt_out2 = AVC_DOT_SH3_SH(src_vt32_h, src_vt54_h, src_vt76_h, + filt0, filt1, filt2); + vt_out3 = AVC_DOT_SH3_SH(src_vt43_h, src_vt65_h, src_vt87_h, + filt0, filt1, filt2); + vt_out0 = __lasx_xvssrarni_b_h(vt_out1, vt_out0, 5); + vt_out2 = __lasx_xvssrarni_b_h(vt_out3, vt_out2, 5); + + DUP2_ARG2(__lasx_xvaddwl_h_b, hz_out0, vt_out0, hz_out2, vt_out2, + out0, out2); + DUP2_ARG2(__lasx_xvaddwh_h_b, hz_out0, vt_out0, hz_out2, vt_out2, + out1, out3); + tmp0 = __lasx_xvssrarni_b_h(out1, out0, 1); + tmp1 = __lasx_xvssrarni_b_h(out3, out2, 1); + + DUP2_ARG2(__lasx_xvxori_b, tmp0, 128, tmp1, 128, tmp0, tmp1); + __lasx_xvstelm_d(tmp0, dst, 0, 0); + __lasx_xvstelm_d(tmp0, dst + stride, 0, 1); + __lasx_xvstelm_d(tmp1, dst + stride_2x, 0, 0); + __lasx_xvstelm_d(tmp1, dst + stride_3x, 0, 1); + + __lasx_xvstelm_d(tmp0, dst, 8, 2); + __lasx_xvstelm_d(tmp0, dst + stride, 8, 3); + __lasx_xvstelm_d(tmp1, dst + stride_2x, 8, 2); + __lasx_xvstelm_d(tmp1, dst + stride_3x, 8, 3); + + dst += stride_4x; + src_vt0 = src_vt4; + src_vt1 = src_vt5; + src_vt2 = src_vt6; + src_vt3 = src_vt7; + src_vt4 = src_vt8; + } +} + +/* put_pixels8_8_inline_asm: dst = src */ +static av_always_inline void +put_pixels8_8_inline_asm(uint8_t *dst, const uint8_t *src, ptrdiff_t stride) +{ + uint64_t tmp[8]; + ptrdiff_t stride_2, stride_3, stride_4; + __asm__ volatile ( + "slli.d %[stride_2], %[stride], 1 \n\t" + "add.d %[stride_3], %[stride_2], %[stride] \n\t" + "slli.d %[stride_4], %[stride_2], 1 \n\t" + "ld.d %[tmp0], %[src], 0x0 \n\t" + "ldx.d %[tmp1], %[src], %[stride] \n\t" + "ldx.d %[tmp2], %[src], %[stride_2] \n\t" + "ldx.d %[tmp3], %[src], %[stride_3] \n\t" + "add.d %[src], %[src], %[stride_4] \n\t" + "ld.d %[tmp4], %[src], 0x0 \n\t" + "ldx.d %[tmp5], %[src], %[stride] \n\t" + "ldx.d %[tmp6], %[src], %[stride_2] \n\t" + "ldx.d %[tmp7], %[src], %[stride_3] \n\t" + + "st.d %[tmp0], %[dst], 0x0 \n\t" + "stx.d %[tmp1], %[dst], %[stride] \n\t" + "stx.d %[tmp2], %[dst], %[stride_2] \n\t" + "stx.d %[tmp3], %[dst], %[stride_3] \n\t" + "add.d %[dst], %[dst], %[stride_4] \n\t" + "st.d %[tmp4], %[dst], 0x0 \n\t" + "stx.d %[tmp5], %[dst], %[stride] \n\t" + "stx.d %[tmp6], %[dst], %[stride_2] \n\t" + "stx.d %[tmp7], %[dst], %[stride_3] \n\t" + : [tmp0]"=&r"(tmp[0]), [tmp1]"=&r"(tmp[1]), + [tmp2]"=&r"(tmp[2]), [tmp3]"=&r"(tmp[3]), + [tmp4]"=&r"(tmp[4]), [tmp5]"=&r"(tmp[5]), + [tmp6]"=&r"(tmp[6]), [tmp7]"=&r"(tmp[7]), + [stride_2]"=&r"(stride_2), [stride_3]"=&r"(stride_3), + [stride_4]"=&r"(stride_4), + [dst]"+&r"(dst), [src]"+&r"(src) + : [stride]"r"(stride) + : "memory" + ); +} + +/* avg_pixels8_8_lsx : dst = avg(src, dst) + * put_pixels8_l2_8_lsx: dst = avg(src, half) , half stride is 8. + * avg_pixels8_l2_8_lsx: dst = avg(avg(src, half), dst) , half stride is 8.*/ +static av_always_inline void +avg_pixels8_8_lsx(uint8_t *dst, const uint8_t *src, ptrdiff_t stride) +{ + uint8_t *tmp = dst; + ptrdiff_t stride_2, stride_3, stride_4; + __asm__ volatile ( + /* h0~h7 */ + "slli.d %[stride_2], %[stride], 1 \n\t" + "add.d %[stride_3], %[stride_2], %[stride] \n\t" + "slli.d %[stride_4], %[stride_2], 1 \n\t" + "vld $vr0, %[src], 0 \n\t" + "vldx $vr1, %[src], %[stride] \n\t" + "vldx $vr2, %[src], %[stride_2] \n\t" + "vldx $vr3, %[src], %[stride_3] \n\t" + "add.d %[src], %[src], %[stride_4] \n\t" + "vld $vr4, %[src], 0 \n\t" + "vldx $vr5, %[src], %[stride] \n\t" + "vldx $vr6, %[src], %[stride_2] \n\t" + "vldx $vr7, %[src], %[stride_3] \n\t" + + "vld $vr8, %[tmp], 0 \n\t" + "vldx $vr9, %[tmp], %[stride] \n\t" + "vldx $vr10, %[tmp], %[stride_2] \n\t" + "vldx $vr11, %[tmp], %[stride_3] \n\t" + "add.d %[tmp], %[tmp], %[stride_4] \n\t" + "vld $vr12, %[tmp], 0 \n\t" + "vldx $vr13, %[tmp], %[stride] \n\t" + "vldx $vr14, %[tmp], %[stride_2] \n\t" + "vldx $vr15, %[tmp], %[stride_3] \n\t" + + "vavgr.bu $vr0, $vr8, $vr0 \n\t" + "vavgr.bu $vr1, $vr9, $vr1 \n\t" + "vavgr.bu $vr2, $vr10, $vr2 \n\t" + "vavgr.bu $vr3, $vr11, $vr3 \n\t" + "vavgr.bu $vr4, $vr12, $vr4 \n\t" + "vavgr.bu $vr5, $vr13, $vr5 \n\t" + "vavgr.bu $vr6, $vr14, $vr6 \n\t" + "vavgr.bu $vr7, $vr15, $vr7 \n\t" + + "vstelm.d $vr0, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[stride] \n\t" + "vstelm.d $vr1, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[stride] \n\t" + "vstelm.d $vr2, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[stride] \n\t" + "vstelm.d $vr3, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[stride] \n\t" + "vstelm.d $vr4, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[stride] \n\t" + "vstelm.d $vr5, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[stride] \n\t" + "vstelm.d $vr6, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[stride] \n\t" + "vstelm.d $vr7, %[dst], 0, 0 \n\t" + : [dst]"+&r"(dst), [tmp]"+&r"(tmp), [src]"+&r"(src), + [stride_2]"=&r"(stride_2), [stride_3]"=&r"(stride_3), + [stride_4]"=&r"(stride_4) + : [stride]"r"(stride) + : "memory" + ); +} + +/* avg_pixels8_8_lsx : dst = avg(src, dst) + * put_pixels8_l2_8_lsx: dst = avg(src, half) , half stride is 8. + * avg_pixels8_l2_8_lsx: dst = avg(avg(src, half), dst) , half stride is 8.*/ +static av_always_inline void +put_pixels8_l2_8_lsx(uint8_t *dst, const uint8_t *src, const uint8_t *half, + ptrdiff_t dstStride, ptrdiff_t srcStride) +{ + ptrdiff_t stride_2, stride_3, stride_4; + __asm__ volatile ( + /* h0~h7 */ + "slli.d %[stride_2], %[srcStride], 1 \n\t" + "add.d %[stride_3], %[stride_2], %[srcStride] \n\t" + "slli.d %[stride_4], %[stride_2], 1 \n\t" + "vld $vr0, %[src], 0 \n\t" + "vldx $vr1, %[src], %[srcStride] \n\t" + "vldx $vr2, %[src], %[stride_2] \n\t" + "vldx $vr3, %[src], %[stride_3] \n\t" + "add.d %[src], %[src], %[stride_4] \n\t" + "vld $vr4, %[src], 0 \n\t" + "vldx $vr5, %[src], %[srcStride] \n\t" + "vldx $vr6, %[src], %[stride_2] \n\t" + "vldx $vr7, %[src], %[stride_3] \n\t" + + "vld $vr8, %[half], 0x00 \n\t" + "vld $vr9, %[half], 0x08 \n\t" + "vld $vr10, %[half], 0x10 \n\t" + "vld $vr11, %[half], 0x18 \n\t" + "vld $vr12, %[half], 0x20 \n\t" + "vld $vr13, %[half], 0x28 \n\t" + "vld $vr14, %[half], 0x30 \n\t" + "vld $vr15, %[half], 0x38 \n\t" + + "vavgr.bu $vr0, $vr8, $vr0 \n\t" + "vavgr.bu $vr1, $vr9, $vr1 \n\t" + "vavgr.bu $vr2, $vr10, $vr2 \n\t" + "vavgr.bu $vr3, $vr11, $vr3 \n\t" + "vavgr.bu $vr4, $vr12, $vr4 \n\t" + "vavgr.bu $vr5, $vr13, $vr5 \n\t" + "vavgr.bu $vr6, $vr14, $vr6 \n\t" + "vavgr.bu $vr7, $vr15, $vr7 \n\t" + + "vstelm.d $vr0, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[dstStride] \n\t" + "vstelm.d $vr1, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[dstStride] \n\t" + "vstelm.d $vr2, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[dstStride] \n\t" + "vstelm.d $vr3, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[dstStride] \n\t" + "vstelm.d $vr4, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[dstStride] \n\t" + "vstelm.d $vr5, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[dstStride] \n\t" + "vstelm.d $vr6, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[dstStride] \n\t" + "vstelm.d $vr7, %[dst], 0, 0 \n\t" + : [dst]"+&r"(dst), [half]"+&r"(half), [src]"+&r"(src), + [stride_2]"=&r"(stride_2), [stride_3]"=&r"(stride_3), + [stride_4]"=&r"(stride_4) + : [srcStride]"r"(srcStride), [dstStride]"r"(dstStride) + : "memory" + ); +} + +/* avg_pixels8_8_lsx : dst = avg(src, dst) + * put_pixels8_l2_8_lsx: dst = avg(src, half) , half stride is 8. + * avg_pixels8_l2_8_lsx: dst = avg(avg(src, half), dst) , half stride is 8.*/ +static av_always_inline void +avg_pixels8_l2_8_lsx(uint8_t *dst, const uint8_t *src, const uint8_t *half, + ptrdiff_t dstStride, ptrdiff_t srcStride) +{ + uint8_t *tmp = dst; + ptrdiff_t stride_2, stride_3, stride_4; + __asm__ volatile ( + /* h0~h7 */ + "slli.d %[stride_2], %[srcStride], 1 \n\t" + "add.d %[stride_3], %[stride_2], %[srcStride] \n\t" + "slli.d %[stride_4], %[stride_2], 1 \n\t" + "vld $vr0, %[src], 0 \n\t" + "vldx $vr1, %[src], %[srcStride] \n\t" + "vldx $vr2, %[src], %[stride_2] \n\t" + "vldx $vr3, %[src], %[stride_3] \n\t" + "add.d %[src], %[src], %[stride_4] \n\t" + "vld $vr4, %[src], 0 \n\t" + "vldx $vr5, %[src], %[srcStride] \n\t" + "vldx $vr6, %[src], %[stride_2] \n\t" + "vldx $vr7, %[src], %[stride_3] \n\t" + + "vld $vr8, %[half], 0x00 \n\t" + "vld $vr9, %[half], 0x08 \n\t" + "vld $vr10, %[half], 0x10 \n\t" + "vld $vr11, %[half], 0x18 \n\t" + "vld $vr12, %[half], 0x20 \n\t" + "vld $vr13, %[half], 0x28 \n\t" + "vld $vr14, %[half], 0x30 \n\t" + "vld $vr15, %[half], 0x38 \n\t" + + "vavgr.bu $vr0, $vr8, $vr0 \n\t" + "vavgr.bu $vr1, $vr9, $vr1 \n\t" + "vavgr.bu $vr2, $vr10, $vr2 \n\t" + "vavgr.bu $vr3, $vr11, $vr3 \n\t" + "vavgr.bu $vr4, $vr12, $vr4 \n\t" + "vavgr.bu $vr5, $vr13, $vr5 \n\t" + "vavgr.bu $vr6, $vr14, $vr6 \n\t" + "vavgr.bu $vr7, $vr15, $vr7 \n\t" + + "slli.d %[stride_2], %[dstStride], 1 \n\t" + "add.d %[stride_3], %[stride_2], %[dstStride] \n\t" + "slli.d %[stride_4], %[stride_2], 1 \n\t" + "vld $vr8, %[tmp], 0 \n\t" + "vldx $vr9, %[tmp], %[dstStride] \n\t" + "vldx $vr10, %[tmp], %[stride_2] \n\t" + "vldx $vr11, %[tmp], %[stride_3] \n\t" + "add.d %[tmp], %[tmp], %[stride_4] \n\t" + "vld $vr12, %[tmp], 0 \n\t" + "vldx $vr13, %[tmp], %[dstStride] \n\t" + "vldx $vr14, %[tmp], %[stride_2] \n\t" + "vldx $vr15, %[tmp], %[stride_3] \n\t" + + "vavgr.bu $vr0, $vr8, $vr0 \n\t" + "vavgr.bu $vr1, $vr9, $vr1 \n\t" + "vavgr.bu $vr2, $vr10, $vr2 \n\t" + "vavgr.bu $vr3, $vr11, $vr3 \n\t" + "vavgr.bu $vr4, $vr12, $vr4 \n\t" + "vavgr.bu $vr5, $vr13, $vr5 \n\t" + "vavgr.bu $vr6, $vr14, $vr6 \n\t" + "vavgr.bu $vr7, $vr15, $vr7 \n\t" + + "vstelm.d $vr0, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[dstStride] \n\t" + "vstelm.d $vr1, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[dstStride] \n\t" + "vstelm.d $vr2, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[dstStride] \n\t" + "vstelm.d $vr3, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[dstStride] \n\t" + "vstelm.d $vr4, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[dstStride] \n\t" + "vstelm.d $vr5, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[dstStride] \n\t" + "vstelm.d $vr6, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[dstStride] \n\t" + "vstelm.d $vr7, %[dst], 0, 0 \n\t" + : [dst]"+&r"(dst), [tmp]"+&r"(tmp), [half]"+&r"(half), + [src]"+&r"(src), [stride_2]"=&r"(stride_2), + [stride_3]"=&r"(stride_3), [stride_4]"=&r"(stride_4) + : [dstStride]"r"(dstStride), [srcStride]"r"(srcStride) + : "memory" + ); +} + +/* put_pixels16_8_lsx: dst = src */ +static av_always_inline void +put_pixels16_8_lsx(uint8_t *dst, const uint8_t *src, ptrdiff_t stride) +{ + ptrdiff_t stride_2, stride_3, stride_4; + __asm__ volatile ( + "slli.d %[stride_2], %[stride], 1 \n\t" + "add.d %[stride_3], %[stride_2], %[stride] \n\t" + "slli.d %[stride_4], %[stride_2], 1 \n\t" + "vld $vr0, %[src], 0 \n\t" + "vldx $vr1, %[src], %[stride] \n\t" + "vldx $vr2, %[src], %[stride_2] \n\t" + "vldx $vr3, %[src], %[stride_3] \n\t" + "add.d %[src], %[src], %[stride_4] \n\t" + "vld $vr4, %[src], 0 \n\t" + "vldx $vr5, %[src], %[stride] \n\t" + "vldx $vr6, %[src], %[stride_2] \n\t" + "vldx $vr7, %[src], %[stride_3] \n\t" + "add.d %[src], %[src], %[stride_4] \n\t" + + "vst $vr0, %[dst], 0 \n\t" + "vstx $vr1, %[dst], %[stride] \n\t" + "vstx $vr2, %[dst], %[stride_2] \n\t" + "vstx $vr3, %[dst], %[stride_3] \n\t" + "add.d %[dst], %[dst], %[stride_4] \n\t" + "vst $vr4, %[dst], 0 \n\t" + "vstx $vr5, %[dst], %[stride] \n\t" + "vstx $vr6, %[dst], %[stride_2] \n\t" + "vstx $vr7, %[dst], %[stride_3] \n\t" + "add.d %[dst], %[dst], %[stride_4] \n\t" + + "vld $vr0, %[src], 0 \n\t" + "vldx $vr1, %[src], %[stride] \n\t" + "vldx $vr2, %[src], %[stride_2] \n\t" + "vldx $vr3, %[src], %[stride_3] \n\t" + "add.d %[src], %[src], %[stride_4] \n\t" + "vld $vr4, %[src], 0 \n\t" + "vldx $vr5, %[src], %[stride] \n\t" + "vldx $vr6, %[src], %[stride_2] \n\t" + "vldx $vr7, %[src], %[stride_3] \n\t" + + "vst $vr0, %[dst], 0 \n\t" + "vstx $vr1, %[dst], %[stride] \n\t" + "vstx $vr2, %[dst], %[stride_2] \n\t" + "vstx $vr3, %[dst], %[stride_3] \n\t" + "add.d %[dst], %[dst], %[stride_4] \n\t" + "vst $vr4, %[dst], 0 \n\t" + "vstx $vr5, %[dst], %[stride] \n\t" + "vstx $vr6, %[dst], %[stride_2] \n\t" + "vstx $vr7, %[dst], %[stride_3] \n\t" + : [dst]"+&r"(dst), [src]"+&r"(src), + [stride_2]"=&r"(stride_2), [stride_3]"=&r"(stride_3), + [stride_4]"=&r"(stride_4) + : [stride]"r"(stride) + : "memory" + ); +} + +/* avg_pixels16_8_lsx : dst = avg(src, dst) + * put_pixels16_l2_8_lsx: dst = avg(src, half) , half stride is 8. + * avg_pixels16_l2_8_lsx: dst = avg(avg(src, half), dst) , half stride is 8.*/ +static av_always_inline void +avg_pixels16_8_lsx(uint8_t *dst, const uint8_t *src, ptrdiff_t stride) +{ + uint8_t *tmp = dst; + ptrdiff_t stride_2, stride_3, stride_4; + __asm__ volatile ( + /* h0~h7 */ + "slli.d %[stride_2], %[stride], 1 \n\t" + "add.d %[stride_3], %[stride_2], %[stride] \n\t" + "slli.d %[stride_4], %[stride_2], 1 \n\t" + "vld $vr0, %[src], 0 \n\t" + "vldx $vr1, %[src], %[stride] \n\t" + "vldx $vr2, %[src], %[stride_2] \n\t" + "vldx $vr3, %[src], %[stride_3] \n\t" + "add.d %[src], %[src], %[stride_4] \n\t" + "vld $vr4, %[src], 0 \n\t" + "vldx $vr5, %[src], %[stride] \n\t" + "vldx $vr6, %[src], %[stride_2] \n\t" + "vldx $vr7, %[src], %[stride_3] \n\t" + "add.d %[src], %[src], %[stride_4] \n\t" + + "vld $vr8, %[tmp], 0 \n\t" + "vldx $vr9, %[tmp], %[stride] \n\t" + "vldx $vr10, %[tmp], %[stride_2] \n\t" + "vldx $vr11, %[tmp], %[stride_3] \n\t" + "add.d %[tmp], %[tmp], %[stride_4] \n\t" + "vld $vr12, %[tmp], 0 \n\t" + "vldx $vr13, %[tmp], %[stride] \n\t" + "vldx $vr14, %[tmp], %[stride_2] \n\t" + "vldx $vr15, %[tmp], %[stride_3] \n\t" + "add.d %[tmp], %[tmp], %[stride_4] \n\t" + + "vavgr.bu $vr0, $vr8, $vr0 \n\t" + "vavgr.bu $vr1, $vr9, $vr1 \n\t" + "vavgr.bu $vr2, $vr10, $vr2 \n\t" + "vavgr.bu $vr3, $vr11, $vr3 \n\t" + "vavgr.bu $vr4, $vr12, $vr4 \n\t" + "vavgr.bu $vr5, $vr13, $vr5 \n\t" + "vavgr.bu $vr6, $vr14, $vr6 \n\t" + "vavgr.bu $vr7, $vr15, $vr7 \n\t" + + "vst $vr0, %[dst], 0 \n\t" + "vstx $vr1, %[dst], %[stride] \n\t" + "vstx $vr2, %[dst], %[stride_2] \n\t" + "vstx $vr3, %[dst], %[stride_3] \n\t" + "add.d %[dst], %[dst], %[stride_4] \n\t" + "vst $vr4, %[dst], 0 \n\t" + "vstx $vr5, %[dst], %[stride] \n\t" + "vstx $vr6, %[dst], %[stride_2] \n\t" + "vstx $vr7, %[dst], %[stride_3] \n\t" + "add.d %[dst], %[dst], %[stride_4] \n\t" + + /* h8~h15 */ + "vld $vr0, %[src], 0 \n\t" + "vldx $vr1, %[src], %[stride] \n\t" + "vldx $vr2, %[src], %[stride_2] \n\t" + "vldx $vr3, %[src], %[stride_3] \n\t" + "add.d %[src], %[src], %[stride_4] \n\t" + "vld $vr4, %[src], 0 \n\t" + "vldx $vr5, %[src], %[stride] \n\t" + "vldx $vr6, %[src], %[stride_2] \n\t" + "vldx $vr7, %[src], %[stride_3] \n\t" + + "vld $vr8, %[tmp], 0 \n\t" + "vldx $vr9, %[tmp], %[stride] \n\t" + "vldx $vr10, %[tmp], %[stride_2] \n\t" + "vldx $vr11, %[tmp], %[stride_3] \n\t" + "add.d %[tmp], %[tmp], %[stride_4] \n\t" + "vld $vr12, %[tmp], 0 \n\t" + "vldx $vr13, %[tmp], %[stride] \n\t" + "vldx $vr14, %[tmp], %[stride_2] \n\t" + "vldx $vr15, %[tmp], %[stride_3] \n\t" + + "vavgr.bu $vr0, $vr8, $vr0 \n\t" + "vavgr.bu $vr1, $vr9, $vr1 \n\t" + "vavgr.bu $vr2, $vr10, $vr2 \n\t" + "vavgr.bu $vr3, $vr11, $vr3 \n\t" + "vavgr.bu $vr4, $vr12, $vr4 \n\t" + "vavgr.bu $vr5, $vr13, $vr5 \n\t" + "vavgr.bu $vr6, $vr14, $vr6 \n\t" + "vavgr.bu $vr7, $vr15, $vr7 \n\t" + + "vst $vr0, %[dst], 0 \n\t" + "vstx $vr1, %[dst], %[stride] \n\t" + "vstx $vr2, %[dst], %[stride_2] \n\t" + "vstx $vr3, %[dst], %[stride_3] \n\t" + "add.d %[dst], %[dst], %[stride_4] \n\t" + "vst $vr4, %[dst], 0 \n\t" + "vstx $vr5, %[dst], %[stride] \n\t" + "vstx $vr6, %[dst], %[stride_2] \n\t" + "vstx $vr7, %[dst], %[stride_3] \n\t" + : [dst]"+&r"(dst), [tmp]"+&r"(tmp), [src]"+&r"(src), + [stride_2]"=&r"(stride_2), [stride_3]"=&r"(stride_3), + [stride_4]"=&r"(stride_4) + : [stride]"r"(stride) + : "memory" + ); +} + +/* avg_pixels16_8_lsx : dst = avg(src, dst) + * put_pixels16_l2_8_lsx: dst = avg(src, half) , half stride is 8. + * avg_pixels16_l2_8_lsx: dst = avg(avg(src, half), dst) , half stride is 8.*/ +static av_always_inline void +put_pixels16_l2_8_lsx(uint8_t *dst, const uint8_t *src, uint8_t *half, + ptrdiff_t dstStride, ptrdiff_t srcStride) +{ + ptrdiff_t stride_2, stride_3, stride_4; + ptrdiff_t dstride_2, dstride_3, dstride_4; + __asm__ volatile ( + "slli.d %[stride_2], %[srcStride], 1 \n\t" + "add.d %[stride_3], %[stride_2], %[srcStride] \n\t" + "slli.d %[stride_4], %[stride_2], 1 \n\t" + "slli.d %[dstride_2], %[dstStride], 1 \n\t" + "add.d %[dstride_3], %[dstride_2], %[dstStride] \n\t" + "slli.d %[dstride_4], %[dstride_2], 1 \n\t" + /* h0~h7 */ + "vld $vr0, %[src], 0 \n\t" + "vldx $vr1, %[src], %[srcStride] \n\t" + "vldx $vr2, %[src], %[stride_2] \n\t" + "vldx $vr3, %[src], %[stride_3] \n\t" + "add.d %[src], %[src], %[stride_4] \n\t" + "vld $vr4, %[src], 0 \n\t" + "vldx $vr5, %[src], %[srcStride] \n\t" + "vldx $vr6, %[src], %[stride_2] \n\t" + "vldx $vr7, %[src], %[stride_3] \n\t" + "add.d %[src], %[src], %[stride_4] \n\t" + + "vld $vr8, %[half], 0x00 \n\t" + "vld $vr9, %[half], 0x10 \n\t" + "vld $vr10, %[half], 0x20 \n\t" + "vld $vr11, %[half], 0x30 \n\t" + "vld $vr12, %[half], 0x40 \n\t" + "vld $vr13, %[half], 0x50 \n\t" + "vld $vr14, %[half], 0x60 \n\t" + "vld $vr15, %[half], 0x70 \n\t" + + "vavgr.bu $vr0, $vr8, $vr0 \n\t" + "vavgr.bu $vr1, $vr9, $vr1 \n\t" + "vavgr.bu $vr2, $vr10, $vr2 \n\t" + "vavgr.bu $vr3, $vr11, $vr3 \n\t" + "vavgr.bu $vr4, $vr12, $vr4 \n\t" + "vavgr.bu $vr5, $vr13, $vr5 \n\t" + "vavgr.bu $vr6, $vr14, $vr6 \n\t" + "vavgr.bu $vr7, $vr15, $vr7 \n\t" + + "vst $vr0, %[dst], 0 \n\t" + "vstx $vr1, %[dst], %[dstStride] \n\t" + "vstx $vr2, %[dst], %[dstride_2] \n\t" + "vstx $vr3, %[dst], %[dstride_3] \n\t" + "add.d %[dst], %[dst], %[dstride_4] \n\t" + "vst $vr4, %[dst], 0 \n\t" + "vstx $vr5, %[dst], %[dstStride] \n\t" + "vstx $vr6, %[dst], %[dstride_2] \n\t" + "vstx $vr7, %[dst], %[dstride_3] \n\t" + "add.d %[dst], %[dst], %[dstride_4] \n\t" + + /* h8~h15 */ + "vld $vr0, %[src], 0 \n\t" + "vldx $vr1, %[src], %[srcStride] \n\t" + "vldx $vr2, %[src], %[stride_2] \n\t" + "vldx $vr3, %[src], %[stride_3] \n\t" + "add.d %[src], %[src], %[stride_4] \n\t" + "vld $vr4, %[src], 0 \n\t" + "vldx $vr5, %[src], %[srcStride] \n\t" + "vldx $vr6, %[src], %[stride_2] \n\t" + "vldx $vr7, %[src], %[stride_3] \n\t" + + "vld $vr8, %[half], 0x80 \n\t" + "vld $vr9, %[half], 0x90 \n\t" + "vld $vr10, %[half], 0xa0 \n\t" + "vld $vr11, %[half], 0xb0 \n\t" + "vld $vr12, %[half], 0xc0 \n\t" + "vld $vr13, %[half], 0xd0 \n\t" + "vld $vr14, %[half], 0xe0 \n\t" + "vld $vr15, %[half], 0xf0 \n\t" + + "vavgr.bu $vr0, $vr8, $vr0 \n\t" + "vavgr.bu $vr1, $vr9, $vr1 \n\t" + "vavgr.bu $vr2, $vr10, $vr2 \n\t" + "vavgr.bu $vr3, $vr11, $vr3 \n\t" + "vavgr.bu $vr4, $vr12, $vr4 \n\t" + "vavgr.bu $vr5, $vr13, $vr5 \n\t" + "vavgr.bu $vr6, $vr14, $vr6 \n\t" + "vavgr.bu $vr7, $vr15, $vr7 \n\t" + + "vst $vr0, %[dst], 0 \n\t" + "vstx $vr1, %[dst], %[dstStride] \n\t" + "vstx $vr2, %[dst], %[dstride_2] \n\t" + "vstx $vr3, %[dst], %[dstride_3] \n\t" + "add.d %[dst], %[dst], %[dstride_4] \n\t" + "vst $vr4, %[dst], 0 \n\t" + "vstx $vr5, %[dst], %[dstStride] \n\t" + "vstx $vr6, %[dst], %[dstride_2] \n\t" + "vstx $vr7, %[dst], %[dstride_3] \n\t" + : [dst]"+&r"(dst), [half]"+&r"(half), [src]"+&r"(src), + [stride_2]"=&r"(stride_2), [stride_3]"=&r"(stride_3), + [stride_4]"=&r"(stride_4), [dstride_2]"=&r"(dstride_2), + [dstride_3]"=&r"(dstride_3), [dstride_4]"=&r"(dstride_4) + : [dstStride]"r"(dstStride), [srcStride]"r"(srcStride) + : "memory" + ); +} + +/* avg_pixels16_8_lsx : dst = avg(src, dst) + * put_pixels16_l2_8_lsx: dst = avg(src, half) , half stride is 8. + * avg_pixels16_l2_8_lsx: dst = avg(avg(src, half), dst) , half stride is 8.*/ +static av_always_inline void +avg_pixels16_l2_8_lsx(uint8_t *dst, const uint8_t *src, uint8_t *half, + ptrdiff_t dstStride, ptrdiff_t srcStride) +{ + uint8_t *tmp = dst; + ptrdiff_t stride_2, stride_3, stride_4; + ptrdiff_t dstride_2, dstride_3, dstride_4; + __asm__ volatile ( + "slli.d %[stride_2], %[srcStride], 1 \n\t" + "add.d %[stride_3], %[stride_2], %[srcStride] \n\t" + "slli.d %[stride_4], %[stride_2], 1 \n\t" + "slli.d %[dstride_2], %[dstStride], 1 \n\t" + "add.d %[dstride_3], %[dstride_2], %[dstStride] \n\t" + "slli.d %[dstride_4], %[dstride_2], 1 \n\t" + /* h0~h7 */ + "vld $vr0, %[src], 0 \n\t" + "vldx $vr1, %[src], %[srcStride] \n\t" + "vldx $vr2, %[src], %[stride_2] \n\t" + "vldx $vr3, %[src], %[stride_3] \n\t" + "add.d %[src], %[src], %[stride_4] \n\t" + "vld $vr4, %[src], 0 \n\t" + "vldx $vr5, %[src], %[srcStride] \n\t" + "vldx $vr6, %[src], %[stride_2] \n\t" + "vldx $vr7, %[src], %[stride_3] \n\t" + "add.d %[src], %[src], %[stride_4] \n\t" + + "vld $vr8, %[half], 0x00 \n\t" + "vld $vr9, %[half], 0x10 \n\t" + "vld $vr10, %[half], 0x20 \n\t" + "vld $vr11, %[half], 0x30 \n\t" + "vld $vr12, %[half], 0x40 \n\t" + "vld $vr13, %[half], 0x50 \n\t" + "vld $vr14, %[half], 0x60 \n\t" + "vld $vr15, %[half], 0x70 \n\t" + + "vavgr.bu $vr0, $vr8, $vr0 \n\t" + "vavgr.bu $vr1, $vr9, $vr1 \n\t" + "vavgr.bu $vr2, $vr10, $vr2 \n\t" + "vavgr.bu $vr3, $vr11, $vr3 \n\t" + "vavgr.bu $vr4, $vr12, $vr4 \n\t" + "vavgr.bu $vr5, $vr13, $vr5 \n\t" + "vavgr.bu $vr6, $vr14, $vr6 \n\t" + "vavgr.bu $vr7, $vr15, $vr7 \n\t" + + "vld $vr8, %[tmp], 0 \n\t" + "vldx $vr9, %[tmp], %[dstStride] \n\t" + "vldx $vr10, %[tmp], %[dstride_2] \n\t" + "vldx $vr11, %[tmp], %[dstride_3] \n\t" + "add.d %[tmp], %[tmp], %[dstride_4] \n\t" + "vld $vr12, %[tmp], 0 \n\t" + "vldx $vr13, %[tmp], %[dstStride] \n\t" + "vldx $vr14, %[tmp], %[dstride_2] \n\t" + "vldx $vr15, %[tmp], %[dstride_3] \n\t" + "add.d %[tmp], %[tmp], %[dstride_4] \n\t" + + "vavgr.bu $vr0, $vr8, $vr0 \n\t" + "vavgr.bu $vr1, $vr9, $vr1 \n\t" + "vavgr.bu $vr2, $vr10, $vr2 \n\t" + "vavgr.bu $vr3, $vr11, $vr3 \n\t" + "vavgr.bu $vr4, $vr12, $vr4 \n\t" + "vavgr.bu $vr5, $vr13, $vr5 \n\t" + "vavgr.bu $vr6, $vr14, $vr6 \n\t" + "vavgr.bu $vr7, $vr15, $vr7 \n\t" + + "vst $vr0, %[dst], 0 \n\t" + "vstx $vr1, %[dst], %[dstStride] \n\t" + "vstx $vr2, %[dst], %[dstride_2] \n\t" + "vstx $vr3, %[dst], %[dstride_3] \n\t" + "add.d %[dst], %[dst], %[dstride_4] \n\t" + "vst $vr4, %[dst], 0 \n\t" + "vstx $vr5, %[dst], %[dstStride] \n\t" + "vstx $vr6, %[dst], %[dstride_2] \n\t" + "vstx $vr7, %[dst], %[dstride_3] \n\t" + "add.d %[dst], %[dst], %[dstride_4] \n\t" + + /* h8~h15 */ + "vld $vr0, %[src], 0 \n\t" + "vldx $vr1, %[src], %[srcStride] \n\t" + "vldx $vr2, %[src], %[stride_2] \n\t" + "vldx $vr3, %[src], %[stride_3] \n\t" + "add.d %[src], %[src], %[stride_4] \n\t" + "vld $vr4, %[src], 0 \n\t" + "vldx $vr5, %[src], %[srcStride] \n\t" + "vldx $vr6, %[src], %[stride_2] \n\t" + "vldx $vr7, %[src], %[stride_3] \n\t" + + "vld $vr8, %[half], 0x80 \n\t" + "vld $vr9, %[half], 0x90 \n\t" + "vld $vr10, %[half], 0xa0 \n\t" + "vld $vr11, %[half], 0xb0 \n\t" + "vld $vr12, %[half], 0xc0 \n\t" + "vld $vr13, %[half], 0xd0 \n\t" + "vld $vr14, %[half], 0xe0 \n\t" + "vld $vr15, %[half], 0xf0 \n\t" + + "vavgr.bu $vr0, $vr8, $vr0 \n\t" + "vavgr.bu $vr1, $vr9, $vr1 \n\t" + "vavgr.bu $vr2, $vr10, $vr2 \n\t" + "vavgr.bu $vr3, $vr11, $vr3 \n\t" + "vavgr.bu $vr4, $vr12, $vr4 \n\t" + "vavgr.bu $vr5, $vr13, $vr5 \n\t" + "vavgr.bu $vr6, $vr14, $vr6 \n\t" + "vavgr.bu $vr7, $vr15, $vr7 \n\t" + + "vld $vr8, %[tmp], 0 \n\t" + "vldx $vr9, %[tmp], %[dstStride] \n\t" + "vldx $vr10, %[tmp], %[dstride_2] \n\t" + "vldx $vr11, %[tmp], %[dstride_3] \n\t" + "add.d %[tmp], %[tmp], %[dstride_4] \n\t" + "vld $vr12, %[tmp], 0 \n\t" + "vldx $vr13, %[tmp], %[dstStride] \n\t" + "vldx $vr14, %[tmp], %[dstride_2] \n\t" + "vldx $vr15, %[tmp], %[dstride_3] \n\t" + + "vavgr.bu $vr0, $vr8, $vr0 \n\t" + "vavgr.bu $vr1, $vr9, $vr1 \n\t" + "vavgr.bu $vr2, $vr10, $vr2 \n\t" + "vavgr.bu $vr3, $vr11, $vr3 \n\t" + "vavgr.bu $vr4, $vr12, $vr4 \n\t" + "vavgr.bu $vr5, $vr13, $vr5 \n\t" + "vavgr.bu $vr6, $vr14, $vr6 \n\t" + "vavgr.bu $vr7, $vr15, $vr7 \n\t" + + "vst $vr0, %[dst], 0 \n\t" + "vstx $vr1, %[dst], %[dstStride] \n\t" + "vstx $vr2, %[dst], %[dstride_2] \n\t" + "vstx $vr3, %[dst], %[dstride_3] \n\t" + "add.d %[dst], %[dst], %[dstride_4] \n\t" + "vst $vr4, %[dst], 0 \n\t" + "vstx $vr5, %[dst], %[dstStride] \n\t" + "vstx $vr6, %[dst], %[dstride_2] \n\t" + "vstx $vr7, %[dst], %[dstride_3] \n\t" + : [dst]"+&r"(dst), [tmp]"+&r"(tmp), [half]"+&r"(half), [src]"+&r"(src), + [stride_2]"=&r"(stride_2), [stride_3]"=&r"(stride_3), + [stride_4]"=&r"(stride_4), [dstride_2]"=&r"(dstride_2), + [dstride_3]"=&r"(dstride_3), [dstride_4]"=&r"(dstride_4) + : [dstStride]"r"(dstStride), [srcStride]"r"(srcStride) + : "memory" + ); +} + +#define QPEL8_H_LOWPASS(out_v) \ + src00 = __lasx_xvld(src, - 2); \ + src += srcStride; \ + src10 = __lasx_xvld(src, - 2); \ + src += srcStride; \ + src00 = __lasx_xvpermi_q(src00, src10, 0x02); \ + src01 = __lasx_xvshuf_b(src00, src00, (__m256i)mask1); \ + src02 = __lasx_xvshuf_b(src00, src00, (__m256i)mask2); \ + src03 = __lasx_xvshuf_b(src00, src00, (__m256i)mask3); \ + src04 = __lasx_xvshuf_b(src00, src00, (__m256i)mask4); \ + src05 = __lasx_xvshuf_b(src00, src00, (__m256i)mask5); \ + DUP2_ARG2(__lasx_xvaddwl_h_bu, src02, src03, src01, src04, src02, src01);\ + src00 = __lasx_xvaddwl_h_bu(src00, src05); \ + src02 = __lasx_xvmul_h(src02, h_20); \ + src01 = __lasx_xvmul_h(src01, h_5); \ + src02 = __lasx_xvssub_h(src02, src01); \ + src02 = __lasx_xvsadd_h(src02, src00); \ + src02 = __lasx_xvsadd_h(src02, h_16); \ + out_v = __lasx_xvssrani_bu_h(src02, src02, 5); \ + +static av_always_inline void +put_h264_qpel8_h_lowpass_lasx(uint8_t *dst, const uint8_t *src, int dstStride, + int srcStride) +{ + int dstStride_2x = dstStride << 1; + __m256i src00, src01, src02, src03, src04, src05, src10; + __m256i out0, out1, out2, out3; + __m256i h_20 = __lasx_xvldi(0x414); + __m256i h_5 = __lasx_xvldi(0x405); + __m256i h_16 = __lasx_xvldi(0x410); + __m256i mask1 = {0x0807060504030201, 0x0, 0x0807060504030201, 0x0}; + __m256i mask2 = {0x0908070605040302, 0x0, 0x0908070605040302, 0x0}; + __m256i mask3 = {0x0a09080706050403, 0x0, 0x0a09080706050403, 0x0}; + __m256i mask4 = {0x0b0a090807060504, 0x0, 0x0b0a090807060504, 0x0}; + __m256i mask5 = {0x0c0b0a0908070605, 0x0, 0x0c0b0a0908070605, 0x0}; + + QPEL8_H_LOWPASS(out0) + QPEL8_H_LOWPASS(out1) + QPEL8_H_LOWPASS(out2) + QPEL8_H_LOWPASS(out3) + __lasx_xvstelm_d(out0, dst, 0, 0); + __lasx_xvstelm_d(out0, dst + dstStride, 0, 2); + dst += dstStride_2x; + __lasx_xvstelm_d(out1, dst, 0, 0); + __lasx_xvstelm_d(out1, dst + dstStride, 0, 2); + dst += dstStride_2x; + __lasx_xvstelm_d(out2, dst, 0, 0); + __lasx_xvstelm_d(out2, dst + dstStride, 0, 2); + dst += dstStride_2x; + __lasx_xvstelm_d(out3, dst, 0, 0); + __lasx_xvstelm_d(out3, dst + dstStride, 0, 2); +} + +#define QPEL8_V_LOWPASS(src0, src1, src2, src3, src4, src5, src6, \ + tmp0, tmp1, tmp2, tmp3, tmp4, tmp5) \ +{ \ + tmp0 = __lasx_xvpermi_q(src0, src1, 0x02); \ + tmp1 = __lasx_xvpermi_q(src1, src2, 0x02); \ + tmp2 = __lasx_xvpermi_q(src2, src3, 0x02); \ + tmp3 = __lasx_xvpermi_q(src3, src4, 0x02); \ + tmp4 = __lasx_xvpermi_q(src4, src5, 0x02); \ + tmp5 = __lasx_xvpermi_q(src5, src6, 0x02); \ + DUP2_ARG2(__lasx_xvaddwl_h_bu, tmp2, tmp3, tmp1, tmp4, tmp2, tmp1); \ + tmp0 = __lasx_xvaddwl_h_bu(tmp0, tmp5); \ + tmp2 = __lasx_xvmul_h(tmp2, h_20); \ + tmp1 = __lasx_xvmul_h(tmp1, h_5); \ + tmp2 = __lasx_xvssub_h(tmp2, tmp1); \ + tmp2 = __lasx_xvsadd_h(tmp2, tmp0); \ + tmp2 = __lasx_xvsadd_h(tmp2, h_16); \ + tmp2 = __lasx_xvssrani_bu_h(tmp2, tmp2, 5); \ +} + +static av_always_inline void +put_h264_qpel8_v_lowpass_lasx(uint8_t *dst, uint8_t *src, int dstStride, + int srcStride) +{ + int srcStride_2x = srcStride << 1; + int dstStride_2x = dstStride << 1; + int srcStride_4x = srcStride << 2; + int srcStride_3x = srcStride_2x + srcStride; + __m256i src00, src01, src02, src03, src04, src05, src06; + __m256i src07, src08, src09, src10, src11, src12; + __m256i tmp00, tmp01, tmp02, tmp03, tmp04, tmp05; + __m256i h_20 = __lasx_xvldi(0x414); + __m256i h_5 = __lasx_xvldi(0x405); + __m256i h_16 = __lasx_xvldi(0x410); + + DUP2_ARG2(__lasx_xvld, src - srcStride_2x, 0, src - srcStride, 0, + src00, src01); + src02 = __lasx_xvld(src, 0); + DUP4_ARG2(__lasx_xvldx, src, srcStride, src, srcStride_2x, src, + srcStride_3x, src, srcStride_4x, src03, src04, src05, src06); + src += srcStride_4x; + DUP4_ARG2(__lasx_xvldx, src, srcStride, src, srcStride_2x, src, + srcStride_3x, src, srcStride_4x, src07, src08, src09, src10); + src += srcStride_4x; + DUP2_ARG2(__lasx_xvldx, src, srcStride, src, srcStride_2x, src11, src12); + + QPEL8_V_LOWPASS(src00, src01, src02, src03, src04, src05, src06, + tmp00, tmp01, tmp02, tmp03, tmp04, tmp05); + __lasx_xvstelm_d(tmp02, dst, 0, 0); + __lasx_xvstelm_d(tmp02, dst + dstStride, 0, 2); + dst += dstStride_2x; + QPEL8_V_LOWPASS(src02, src03, src04, src05, src06, src07, src08, + tmp00, tmp01, tmp02, tmp03, tmp04, tmp05); + __lasx_xvstelm_d(tmp02, dst, 0, 0); + __lasx_xvstelm_d(tmp02, dst + dstStride, 0, 2); + dst += dstStride_2x; + QPEL8_V_LOWPASS(src04, src05, src06, src07, src08, src09, src10, + tmp00, tmp01, tmp02, tmp03, tmp04, tmp05); + __lasx_xvstelm_d(tmp02, dst, 0, 0); + __lasx_xvstelm_d(tmp02, dst + dstStride, 0, 2); + dst += dstStride_2x; + QPEL8_V_LOWPASS(src06, src07, src08, src09, src10, src11, src12, + tmp00, tmp01, tmp02, tmp03, tmp04, tmp05); + __lasx_xvstelm_d(tmp02, dst, 0, 0); + __lasx_xvstelm_d(tmp02, dst + dstStride, 0, 2); +} + +static av_always_inline void +avg_h264_qpel8_v_lowpass_lasx(uint8_t *dst, uint8_t *src, int dstStride, + int srcStride) +{ + int srcStride_2x = srcStride << 1; + int srcStride_4x = srcStride << 2; + int dstStride_2x = dstStride << 1; + int dstStride_4x = dstStride << 2; + int srcStride_3x = srcStride_2x + srcStride; + int dstStride_3x = dstStride_2x + dstStride; + __m256i src00, src01, src02, src03, src04, src05, src06; + __m256i src07, src08, src09, src10, src11, src12, tmp00; + __m256i tmp01, tmp02, tmp03, tmp04, tmp05, tmp06, tmp07, tmp08, tmp09; + __m256i h_20 = __lasx_xvldi(0x414); + __m256i h_5 = __lasx_xvldi(0x405); + __m256i h_16 = __lasx_xvldi(0x410); + + + DUP2_ARG2(__lasx_xvld, src - srcStride_2x, 0, src - srcStride, 0, + src00, src01); + src02 = __lasx_xvld(src, 0); + DUP4_ARG2(__lasx_xvldx, src, srcStride, src, srcStride_2x, src, + srcStride_3x, src, srcStride_4x, src03, src04, src05, src06); + src += srcStride_4x; + DUP4_ARG2(__lasx_xvldx, src, srcStride, src, srcStride_2x, src, + srcStride_3x, src, srcStride_4x, src07, src08, src09, src10); + src += srcStride_4x; + DUP2_ARG2(__lasx_xvldx, src, srcStride, src, srcStride_2x, src11, src12); + + tmp06 = __lasx_xvld(dst, 0); + DUP4_ARG2(__lasx_xvldx, dst, dstStride, dst, dstStride_2x, + dst, dstStride_3x, dst, dstStride_4x, + tmp07, tmp02, tmp03, tmp04); + dst += dstStride_4x; + DUP2_ARG2(__lasx_xvldx, dst, dstStride, dst, dstStride_2x, + tmp05, tmp00); + tmp01 = __lasx_xvldx(dst, dstStride_3x); + dst -= dstStride_4x; + + tmp06 = __lasx_xvpermi_q(tmp06, tmp07, 0x02); + tmp07 = __lasx_xvpermi_q(tmp02, tmp03, 0x02); + tmp08 = __lasx_xvpermi_q(tmp04, tmp05, 0x02); + tmp09 = __lasx_xvpermi_q(tmp00, tmp01, 0x02); + + QPEL8_V_LOWPASS(src00, src01, src02, src03, src04, src05, src06, + tmp00, tmp01, tmp02, tmp03, tmp04, tmp05); + tmp06 = __lasx_xvavgr_bu(tmp06, tmp02); + __lasx_xvstelm_d(tmp06, dst, 0, 0); + __lasx_xvstelm_d(tmp06, dst + dstStride, 0, 2); + dst += dstStride_2x; + QPEL8_V_LOWPASS(src02, src03, src04, src05, src06, src07, src08, + tmp00, tmp01, tmp02, tmp03, tmp04, tmp05); + tmp07 = __lasx_xvavgr_bu(tmp07, tmp02); + __lasx_xvstelm_d(tmp07, dst, 0, 0); + __lasx_xvstelm_d(tmp07, dst + dstStride, 0, 2); + dst += dstStride_2x; + QPEL8_V_LOWPASS(src04, src05, src06, src07, src08, src09, src10, + tmp00, tmp01, tmp02, tmp03, tmp04, tmp05); + tmp08 = __lasx_xvavgr_bu(tmp08, tmp02); + __lasx_xvstelm_d(tmp08, dst, 0, 0); + __lasx_xvstelm_d(tmp08, dst + dstStride, 0, 2); + dst += dstStride_2x; + QPEL8_V_LOWPASS(src06, src07, src08, src09, src10, src11, src12, + tmp00, tmp01, tmp02, tmp03, tmp04, tmp05); + tmp09 = __lasx_xvavgr_bu(tmp09, tmp02); + __lasx_xvstelm_d(tmp09, dst, 0, 0); + __lasx_xvstelm_d(tmp09, dst + dstStride, 0, 2); +} + +#define QPEL8_HV_LOWPASS_H(tmp) \ +{ \ + src00 = __lasx_xvld(src, -2); \ + src += srcStride; \ + src10 = __lasx_xvld(src, -2); \ + src += srcStride; \ + src00 = __lasx_xvpermi_q(src00, src10, 0x02); \ + src01 = __lasx_xvshuf_b(src00, src00, (__m256i)mask1); \ + src02 = __lasx_xvshuf_b(src00, src00, (__m256i)mask2); \ + src03 = __lasx_xvshuf_b(src00, src00, (__m256i)mask3); \ + src04 = __lasx_xvshuf_b(src00, src00, (__m256i)mask4); \ + src05 = __lasx_xvshuf_b(src00, src00, (__m256i)mask5); \ + DUP2_ARG2(__lasx_xvaddwl_h_bu, src02, src03, src01, src04, src02, src01);\ + src00 = __lasx_xvaddwl_h_bu(src00, src05); \ + src02 = __lasx_xvmul_h(src02, h_20); \ + src01 = __lasx_xvmul_h(src01, h_5); \ + src02 = __lasx_xvssub_h(src02, src01); \ + tmp = __lasx_xvsadd_h(src02, src00); \ +} + +#define QPEL8_HV_LOWPASS_V(src0, src1, src2, src3, \ + src4, src5, temp0, temp1, \ + temp2, temp3, temp4, temp5, \ + out) \ +{ \ + DUP2_ARG2(__lasx_xvaddwl_w_h, src2, src3, src1, src4, temp0, temp2); \ + DUP2_ARG2(__lasx_xvaddwh_w_h, src2, src3, src1, src4, temp1, temp3); \ + temp4 = __lasx_xvaddwl_w_h(src0, src5); \ + temp5 = __lasx_xvaddwh_w_h(src0, src5); \ + temp0 = __lasx_xvmul_w(temp0, w_20); \ + temp1 = __lasx_xvmul_w(temp1, w_20); \ + temp2 = __lasx_xvmul_w(temp2, w_5); \ + temp3 = __lasx_xvmul_w(temp3, w_5); \ + temp0 = __lasx_xvssub_w(temp0, temp2); \ + temp1 = __lasx_xvssub_w(temp1, temp3); \ + temp0 = __lasx_xvsadd_w(temp0, temp4); \ + temp1 = __lasx_xvsadd_w(temp1, temp5); \ + temp0 = __lasx_xvsadd_w(temp0, w_512); \ + temp1 = __lasx_xvsadd_w(temp1, w_512); \ + temp0 = __lasx_xvssrani_hu_w(temp0, temp0, 10); \ + temp1 = __lasx_xvssrani_hu_w(temp1, temp1, 10); \ + temp0 = __lasx_xvpackev_d(temp1, temp0); \ + out = __lasx_xvssrani_bu_h(temp0, temp0, 0); \ +} + +static av_always_inline void +put_h264_qpel8_hv_lowpass_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dstStride, ptrdiff_t srcStride) +{ + __m256i src00, src01, src02, src03, src04, src05, src10; + __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6; + __m256i tmp7, tmp8, tmp9, tmp10, tmp11, tmp12; + __m256i h_20 = __lasx_xvldi(0x414); + __m256i h_5 = __lasx_xvldi(0x405); + __m256i w_20 = __lasx_xvldi(0x814); + __m256i w_5 = __lasx_xvldi(0x805); + __m256i w_512 = {512}; + __m256i mask1 = {0x0807060504030201, 0x0, 0x0807060504030201, 0x0}; + __m256i mask2 = {0x0908070605040302, 0x0, 0x0908070605040302, 0x0}; + __m256i mask3 = {0x0a09080706050403, 0x0, 0x0a09080706050403, 0x0}; + __m256i mask4 = {0x0b0a090807060504, 0x0, 0x0b0a090807060504, 0x0}; + __m256i mask5 = {0x0c0b0a0908070605, 0x0, 0x0c0b0a0908070605, 0x0}; + + w_512 = __lasx_xvreplve0_w(w_512); + + src -= srcStride << 1; + QPEL8_HV_LOWPASS_H(tmp0) + QPEL8_HV_LOWPASS_H(tmp2) + QPEL8_HV_LOWPASS_H(tmp4) + QPEL8_HV_LOWPASS_H(tmp6) + QPEL8_HV_LOWPASS_H(tmp8) + QPEL8_HV_LOWPASS_H(tmp10) + QPEL8_HV_LOWPASS_H(tmp12) + tmp11 = __lasx_xvpermi_q(tmp12, tmp10, 0x21); + tmp9 = __lasx_xvpermi_q(tmp10, tmp8, 0x21); + tmp7 = __lasx_xvpermi_q(tmp8, tmp6, 0x21); + tmp5 = __lasx_xvpermi_q(tmp6, tmp4, 0x21); + tmp3 = __lasx_xvpermi_q(tmp4, tmp2, 0x21); + tmp1 = __lasx_xvpermi_q(tmp2, tmp0, 0x21); + + QPEL8_HV_LOWPASS_V(tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, src00, src01, + src02, src03, src04, src05, tmp0) + QPEL8_HV_LOWPASS_V(tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, src00, src01, + src02, src03, src04, src05, tmp2) + QPEL8_HV_LOWPASS_V(tmp4, tmp5, tmp6, tmp7, tmp8, tmp9, src00, src01, + src02, src03, src04, src05, tmp4) + QPEL8_HV_LOWPASS_V(tmp6, tmp7, tmp8, tmp9, tmp10, tmp11, src00, src01, + src02, src03, src04, src05, tmp6) + __lasx_xvstelm_d(tmp0, dst, 0, 0); + dst += dstStride; + __lasx_xvstelm_d(tmp0, dst, 0, 2); + dst += dstStride; + __lasx_xvstelm_d(tmp2, dst, 0, 0); + dst += dstStride; + __lasx_xvstelm_d(tmp2, dst, 0, 2); + dst += dstStride; + __lasx_xvstelm_d(tmp4, dst, 0, 0); + dst += dstStride; + __lasx_xvstelm_d(tmp4, dst, 0, 2); + dst += dstStride; + __lasx_xvstelm_d(tmp6, dst, 0, 0); + dst += dstStride; + __lasx_xvstelm_d(tmp6, dst, 0, 2); +} + +static av_always_inline void +avg_h264_qpel8_h_lowpass_lasx(uint8_t *dst, const uint8_t *src, int dstStride, + int srcStride) +{ + int dstStride_2x = dstStride << 1; + int dstStride_4x = dstStride << 2; + int dstStride_3x = dstStride_2x + dstStride; + __m256i src00, src01, src02, src03, src04, src05, src10; + __m256i dst00, dst01, dst0, dst1, dst2, dst3; + __m256i out0, out1, out2, out3; + __m256i h_20 = __lasx_xvldi(0x414); + __m256i h_5 = __lasx_xvldi(0x405); + __m256i h_16 = __lasx_xvldi(0x410); + __m256i mask1 = {0x0807060504030201, 0x0, 0x0807060504030201, 0x0}; + __m256i mask2 = {0x0908070605040302, 0x0, 0x0908070605040302, 0x0}; + __m256i mask3 = {0x0a09080706050403, 0x0, 0x0a09080706050403, 0x0}; + __m256i mask4 = {0x0b0a090807060504, 0x0, 0x0b0a090807060504, 0x0}; + __m256i mask5 = {0x0c0b0a0908070605, 0x0, 0x0c0b0a0908070605, 0x0}; + + QPEL8_H_LOWPASS(out0) + QPEL8_H_LOWPASS(out1) + QPEL8_H_LOWPASS(out2) + QPEL8_H_LOWPASS(out3) + src00 = __lasx_xvld(dst, 0); + DUP4_ARG2(__lasx_xvldx, dst, dstStride, dst, dstStride_2x, dst, + dstStride_3x, dst, dstStride_4x, src01, src02, src03, src04); + dst += dstStride_4x; + DUP2_ARG2(__lasx_xvldx, dst, dstStride, dst, dstStride_2x, src05, dst00); + dst01 = __lasx_xvldx(dst, dstStride_3x); + dst -= dstStride_4x; + dst0 = __lasx_xvpermi_q(src00, src01, 0x02); + dst1 = __lasx_xvpermi_q(src02, src03, 0x02); + dst2 = __lasx_xvpermi_q(src04, src05, 0x02); + dst3 = __lasx_xvpermi_q(dst00, dst01, 0x02); + dst0 = __lasx_xvavgr_bu(dst0, out0); + dst1 = __lasx_xvavgr_bu(dst1, out1); + dst2 = __lasx_xvavgr_bu(dst2, out2); + dst3 = __lasx_xvavgr_bu(dst3, out3); + __lasx_xvstelm_d(dst0, dst, 0, 0); + __lasx_xvstelm_d(dst0, dst + dstStride, 0, 2); + __lasx_xvstelm_d(dst1, dst + dstStride_2x, 0, 0); + __lasx_xvstelm_d(dst1, dst + dstStride_3x, 0, 2); + dst += dstStride_4x; + __lasx_xvstelm_d(dst2, dst, 0, 0); + __lasx_xvstelm_d(dst2, dst + dstStride, 0, 2); + __lasx_xvstelm_d(dst3, dst + dstStride_2x, 0, 0); + __lasx_xvstelm_d(dst3, dst + dstStride_3x, 0, 2); +} + +static av_always_inline void +avg_h264_qpel8_hv_lowpass_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dstStride, ptrdiff_t srcStride) +{ + __m256i src00, src01, src02, src03, src04, src05, src10; + __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6; + __m256i tmp7, tmp8, tmp9, tmp10, tmp11, tmp12; + __m256i h_20 = __lasx_xvldi(0x414); + __m256i h_5 = __lasx_xvldi(0x405); + __m256i w_20 = __lasx_xvldi(0x814); + __m256i w_5 = __lasx_xvldi(0x805); + __m256i w_512 = {512}; + __m256i mask1 = {0x0807060504030201, 0x0, 0x0807060504030201, 0x0}; + __m256i mask2 = {0x0908070605040302, 0x0, 0x0908070605040302, 0x0}; + __m256i mask3 = {0x0a09080706050403, 0x0, 0x0a09080706050403, 0x0}; + __m256i mask4 = {0x0b0a090807060504, 0x0, 0x0b0a090807060504, 0x0}; + __m256i mask5 = {0x0c0b0a0908070605, 0x0, 0x0c0b0a0908070605, 0x0}; + ptrdiff_t dstStride_2x = dstStride << 1; + ptrdiff_t dstStride_4x = dstStride << 2; + ptrdiff_t dstStride_3x = dstStride_2x + dstStride; + + w_512 = __lasx_xvreplve0_w(w_512); + + src -= srcStride << 1; + QPEL8_HV_LOWPASS_H(tmp0) + QPEL8_HV_LOWPASS_H(tmp2) + QPEL8_HV_LOWPASS_H(tmp4) + QPEL8_HV_LOWPASS_H(tmp6) + QPEL8_HV_LOWPASS_H(tmp8) + QPEL8_HV_LOWPASS_H(tmp10) + QPEL8_HV_LOWPASS_H(tmp12) + tmp11 = __lasx_xvpermi_q(tmp12, tmp10, 0x21); + tmp9 = __lasx_xvpermi_q(tmp10, tmp8, 0x21); + tmp7 = __lasx_xvpermi_q(tmp8, tmp6, 0x21); + tmp5 = __lasx_xvpermi_q(tmp6, tmp4, 0x21); + tmp3 = __lasx_xvpermi_q(tmp4, tmp2, 0x21); + tmp1 = __lasx_xvpermi_q(tmp2, tmp0, 0x21); + + QPEL8_HV_LOWPASS_V(tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, src00, src01, + src02, src03, src04, src05, tmp0) + QPEL8_HV_LOWPASS_V(tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, src00, src01, + src02, src03, src04, src05, tmp2) + QPEL8_HV_LOWPASS_V(tmp4, tmp5, tmp6, tmp7, tmp8, tmp9, src00, src01, + src02, src03, src04, src05, tmp4) + QPEL8_HV_LOWPASS_V(tmp6, tmp7, tmp8, tmp9, tmp10, tmp11, src00, src01, + src02, src03, src04, src05, tmp6) + + src00 = __lasx_xvld(dst, 0); + DUP4_ARG2(__lasx_xvldx, dst, dstStride, dst, dstStride_2x, dst, + dstStride_3x, dst, dstStride_4x, src01, src02, src03, src04); + dst += dstStride_4x; + DUP2_ARG2(__lasx_xvldx, dst, dstStride, dst, dstStride_2x, src05, tmp8); + tmp9 = __lasx_xvldx(dst, dstStride_3x); + dst -= dstStride_4x; + tmp1 = __lasx_xvpermi_q(src00, src01, 0x02); + tmp3 = __lasx_xvpermi_q(src02, src03, 0x02); + tmp5 = __lasx_xvpermi_q(src04, src05, 0x02); + tmp7 = __lasx_xvpermi_q(tmp8, tmp9, 0x02); + tmp0 = __lasx_xvavgr_bu(tmp0, tmp1); + tmp2 = __lasx_xvavgr_bu(tmp2, tmp3); + tmp4 = __lasx_xvavgr_bu(tmp4, tmp5); + tmp6 = __lasx_xvavgr_bu(tmp6, tmp7); + __lasx_xvstelm_d(tmp0, dst, 0, 0); + dst += dstStride; + __lasx_xvstelm_d(tmp0, dst, 0, 2); + dst += dstStride; + __lasx_xvstelm_d(tmp2, dst, 0, 0); + dst += dstStride; + __lasx_xvstelm_d(tmp2, dst, 0, 2); + dst += dstStride; + __lasx_xvstelm_d(tmp4, dst, 0, 0); + dst += dstStride; + __lasx_xvstelm_d(tmp4, dst, 0, 2); + dst += dstStride; + __lasx_xvstelm_d(tmp6, dst, 0, 0); + dst += dstStride; + __lasx_xvstelm_d(tmp6, dst, 0, 2); +} + +static av_always_inline void +put_h264_qpel16_h_lowpass_lasx(uint8_t *dst, const uint8_t *src, + int dstStride, int srcStride) +{ + put_h264_qpel8_h_lowpass_lasx(dst, src, dstStride, srcStride); + put_h264_qpel8_h_lowpass_lasx(dst+8, src+8, dstStride, srcStride); + src += srcStride << 3; + dst += dstStride << 3; + put_h264_qpel8_h_lowpass_lasx(dst, src, dstStride, srcStride); + put_h264_qpel8_h_lowpass_lasx(dst+8, src+8, dstStride, srcStride); +} + +static av_always_inline void +avg_h264_qpel16_h_lowpass_lasx(uint8_t *dst, const uint8_t *src, + int dstStride, int srcStride) +{ + avg_h264_qpel8_h_lowpass_lasx(dst, src, dstStride, srcStride); + avg_h264_qpel8_h_lowpass_lasx(dst+8, src+8, dstStride, srcStride); + src += srcStride << 3; + dst += dstStride << 3; + avg_h264_qpel8_h_lowpass_lasx(dst, src, dstStride, srcStride); + avg_h264_qpel8_h_lowpass_lasx(dst+8, src+8, dstStride, srcStride); +} + +static void put_h264_qpel16_v_lowpass_lasx(uint8_t *dst, const uint8_t *src, + int dstStride, int srcStride) +{ + put_h264_qpel8_v_lowpass_lasx(dst, (uint8_t*)src, dstStride, srcStride); + put_h264_qpel8_v_lowpass_lasx(dst+8, (uint8_t*)src+8, dstStride, srcStride); + src += 8*srcStride; + dst += 8*dstStride; + put_h264_qpel8_v_lowpass_lasx(dst, (uint8_t*)src, dstStride, srcStride); + put_h264_qpel8_v_lowpass_lasx(dst+8, (uint8_t*)src+8, dstStride, srcStride); +} + +static void avg_h264_qpel16_v_lowpass_lasx(uint8_t *dst, const uint8_t *src, + int dstStride, int srcStride) +{ + avg_h264_qpel8_v_lowpass_lasx(dst, (uint8_t*)src, dstStride, srcStride); + avg_h264_qpel8_v_lowpass_lasx(dst+8, (uint8_t*)src+8, dstStride, srcStride); + src += 8*srcStride; + dst += 8*dstStride; + avg_h264_qpel8_v_lowpass_lasx(dst, (uint8_t*)src, dstStride, srcStride); + avg_h264_qpel8_v_lowpass_lasx(dst+8, (uint8_t*)src+8, dstStride, srcStride); +} + +static void put_h264_qpel16_hv_lowpass_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dstStride, ptrdiff_t srcStride) +{ + put_h264_qpel8_hv_lowpass_lasx(dst, src, dstStride, srcStride); + put_h264_qpel8_hv_lowpass_lasx(dst + 8, src + 8, dstStride, srcStride); + src += srcStride << 3; + dst += dstStride << 3; + put_h264_qpel8_hv_lowpass_lasx(dst, src, dstStride, srcStride); + put_h264_qpel8_hv_lowpass_lasx(dst + 8, src + 8, dstStride, srcStride); +} + +static void avg_h264_qpel16_hv_lowpass_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dstStride, ptrdiff_t srcStride) +{ + avg_h264_qpel8_hv_lowpass_lasx(dst, src, dstStride, srcStride); + avg_h264_qpel8_hv_lowpass_lasx(dst + 8, src + 8, dstStride, srcStride); + src += srcStride << 3; + dst += dstStride << 3; + avg_h264_qpel8_hv_lowpass_lasx(dst, src, dstStride, srcStride); + avg_h264_qpel8_hv_lowpass_lasx(dst + 8, src + 8, dstStride, srcStride); +} + +void ff_put_h264_qpel8_mc00_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + /* In mmi optimization, it used function ff_put_pixels8_8_mmi + * which implemented in hpeldsp_mmi.c */ + put_pixels8_8_inline_asm(dst, src, stride); +} + +void ff_put_h264_qpel8_mc10_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t half[64]; + + put_h264_qpel8_h_lowpass_lasx(half, src, 8, stride); + /* in qpel8, the stride of half and height of block is 8 */ + put_pixels8_l2_8_lsx(dst, src, half, stride, stride); +} + +void ff_put_h264_qpel8_mc20_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + put_h264_qpel8_h_lowpass_lasx(dst, src, stride, stride); +} + +void ff_put_h264_qpel8_mc30_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t half[64]; + + put_h264_qpel8_h_lowpass_lasx(half, src, 8, stride); + put_pixels8_l2_8_lsx(dst, src+1, half, stride, stride); +} + +void ff_put_h264_qpel8_mc01_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t half[64]; + + put_h264_qpel8_v_lowpass_lasx(half, (uint8_t*)src, 8, stride); + put_pixels8_l2_8_lsx(dst, src, half, stride, stride); +} + +void ff_put_h264_qpel8_mc11_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t halfH[64]; + uint8_t halfV[64]; + + put_h264_qpel8_h_lowpass_lasx(halfH, src, 8, stride); + put_h264_qpel8_v_lowpass_lasx(halfV, (uint8_t*)src, 8, stride); + put_pixels8_l2_8_lsx(dst, halfH, halfV, stride, 8); +} + +void ff_put_h264_qpel8_mc21_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t temp[128]; + uint8_t *const halfH = temp; + uint8_t *const halfHV = temp + 64; + + put_h264_qpel8_h_lowpass_lasx(halfH, src, 8, stride); + put_h264_qpel8_hv_lowpass_lasx(halfHV, src, 8, stride); + put_pixels8_l2_8_lsx(dst, halfH, halfHV, stride, 8); +} + +void ff_put_h264_qpel8_mc31_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t halfH[64]; + uint8_t halfV[64]; + + put_h264_qpel8_h_lowpass_lasx(halfH, src, 8, stride); + put_h264_qpel8_v_lowpass_lasx(halfV, (uint8_t*)src + 1, 8, stride); + put_pixels8_l2_8_lsx(dst, halfH, halfV, stride, 8); +} + +void ff_put_h264_qpel8_mc02_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + put_h264_qpel8_v_lowpass_lasx(dst, (uint8_t*)src, stride, stride); +} + +void ff_put_h264_qpel8_mc12_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t temp[128]; + uint8_t *const halfHV = temp; + uint8_t *const halfH = temp + 64; + + put_h264_qpel8_hv_lowpass_lasx(halfHV, src, 8, stride); + put_h264_qpel8_v_lowpass_lasx(halfH, (uint8_t*)src, 8, stride); + put_pixels8_l2_8_lsx(dst, halfH, halfHV, stride, 8); +} + +void ff_put_h264_qpel8_mc22_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + put_h264_qpel8_hv_lowpass_lasx(dst, src, stride, stride); +} + +void ff_put_h264_qpel8_mc32_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t temp[128]; + uint8_t *const halfHV = temp; + uint8_t *const halfH = temp + 64; + + put_h264_qpel8_hv_lowpass_lasx(halfHV, src, 8, stride); + put_h264_qpel8_v_lowpass_lasx(halfH, (uint8_t*)src + 1, 8, stride); + put_pixels8_l2_8_lsx(dst, halfH, halfHV, stride, 8); +} + +void ff_put_h264_qpel8_mc03_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t half[64]; + + put_h264_qpel8_v_lowpass_lasx(half, (uint8_t*)src, 8, stride); + put_pixels8_l2_8_lsx(dst, src + stride, half, stride, stride); +} + +void ff_put_h264_qpel8_mc13_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t halfH[64]; + uint8_t halfV[64]; + + put_h264_qpel8_h_lowpass_lasx(halfH, src + stride, 8, stride); + put_h264_qpel8_v_lowpass_lasx(halfV, (uint8_t*)src, 8, stride); + put_pixels8_l2_8_lsx(dst, halfH, halfV, stride, 8); +} + +void ff_put_h264_qpel8_mc23_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t temp[128]; + uint8_t *const halfH = temp; + uint8_t *const halfHV = temp + 64; + + put_h264_qpel8_h_lowpass_lasx(halfH, src + stride, 8, stride); + put_h264_qpel8_hv_lowpass_lasx(halfHV, src, 8, stride); + put_pixels8_l2_8_lsx(dst, halfH, halfHV, stride, 8); +} + +void ff_put_h264_qpel8_mc33_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t halfH[64]; + uint8_t halfV[64]; + + put_h264_qpel8_h_lowpass_lasx(halfH, src + stride, 8, stride); + put_h264_qpel8_v_lowpass_lasx(halfV, (uint8_t*)src + 1, 8, stride); + put_pixels8_l2_8_lsx(dst, halfH, halfV, stride, 8); +} + +void ff_avg_h264_qpel8_mc00_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + /* In mmi optimization, it used function ff_avg_pixels8_8_mmi + * which implemented in hpeldsp_mmi.c */ + avg_pixels8_8_lsx(dst, src, stride); +} + +void ff_avg_h264_qpel8_mc10_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t half[64]; + + put_h264_qpel8_h_lowpass_lasx(half, src, 8, stride); + avg_pixels8_l2_8_lsx(dst, src, half, stride, stride); +} + +void ff_avg_h264_qpel8_mc20_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + avg_h264_qpel8_h_lowpass_lasx(dst, src, stride, stride); +} + +void ff_avg_h264_qpel8_mc30_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t half[64]; + + put_h264_qpel8_h_lowpass_lasx(half, src, 8, stride); + avg_pixels8_l2_8_lsx(dst, src+1, half, stride, stride); +} + +void ff_avg_h264_qpel8_mc11_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t halfH[64]; + uint8_t halfV[64]; + + put_h264_qpel8_h_lowpass_lasx(halfH, src, 8, stride); + put_h264_qpel8_v_lowpass_lasx(halfV, (uint8_t*)src, 8, stride); + avg_pixels8_l2_8_lsx(dst, halfH, halfV, stride, 8); +} + +void ff_avg_h264_qpel8_mc21_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t temp[128]; + uint8_t *const halfH = temp; + uint8_t *const halfHV = temp + 64; + + put_h264_qpel8_h_lowpass_lasx(halfH, src, 8, stride); + put_h264_qpel8_hv_lowpass_lasx(halfHV, src, 8, stride); + avg_pixels8_l2_8_lsx(dst, halfH, halfHV, stride, 8); +} + +void ff_avg_h264_qpel8_mc31_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t halfH[64]; + uint8_t halfV[64]; + + put_h264_qpel8_h_lowpass_lasx(halfH, src, 8, stride); + put_h264_qpel8_v_lowpass_lasx(halfV, (uint8_t*)src + 1, 8, stride); + avg_pixels8_l2_8_lsx(dst, halfH, halfV, stride, 8); +} + +void ff_avg_h264_qpel8_mc02_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + avg_h264_qpel8_v_lowpass_lasx(dst, (uint8_t*)src, stride, stride); +} + +void ff_avg_h264_qpel8_mc12_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t temp[128]; + uint8_t *const halfHV = temp; + uint8_t *const halfH = temp + 64; + + put_h264_qpel8_hv_lowpass_lasx(halfHV, src, 8, stride); + put_h264_qpel8_v_lowpass_lasx(halfH, (uint8_t*)src, 8, stride); + avg_pixels8_l2_8_lsx(dst, halfH, halfHV, stride, 8); +} + +void ff_avg_h264_qpel8_mc22_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + avg_h264_qpel8_hv_lowpass_lasx(dst, src, stride, stride); +} + +void ff_avg_h264_qpel8_mc32_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t temp[128]; + uint8_t *const halfHV = temp; + uint8_t *const halfH = temp + 64; + + put_h264_qpel8_hv_lowpass_lasx(halfHV, src, 8, stride); + put_h264_qpel8_v_lowpass_lasx(halfH, (uint8_t*)src + 1, 8, stride); + avg_pixels8_l2_8_lsx(dst, halfH, halfHV, stride, 8); +} + +void ff_avg_h264_qpel8_mc13_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t halfH[64]; + uint8_t halfV[64]; + + put_h264_qpel8_h_lowpass_lasx(halfH, src + stride, 8, stride); + put_h264_qpel8_v_lowpass_lasx(halfV, (uint8_t*)src, 8, stride); + avg_pixels8_l2_8_lsx(dst, halfH, halfV, stride, 8); +} + +void ff_avg_h264_qpel8_mc23_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t temp[128]; + uint8_t *const halfH = temp; + uint8_t *const halfHV = temp + 64; + + put_h264_qpel8_h_lowpass_lasx(halfH, src + stride, 8, stride); + put_h264_qpel8_hv_lowpass_lasx(halfHV, src, 8, stride); + avg_pixels8_l2_8_lsx(dst, halfH, halfHV, stride, 8); +} + +void ff_avg_h264_qpel8_mc33_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t halfH[64]; + uint8_t halfV[64]; + + put_h264_qpel8_h_lowpass_lasx(halfH, src + stride, 8, stride); + put_h264_qpel8_v_lowpass_lasx(halfV, (uint8_t*)src + 1, 8, stride); + avg_pixels8_l2_8_lsx(dst, halfH, halfV, stride, 8); +} + +void ff_put_h264_qpel16_mc00_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + /* In mmi optimization, it used function ff_put_pixels16_8_mmi + * which implemented in hpeldsp_mmi.c */ + put_pixels16_8_lsx(dst, src, stride); +} + +void ff_put_h264_qpel16_mc10_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t half[256]; + + put_h264_qpel16_h_lowpass_lasx(half, src, 16, stride); + put_pixels16_l2_8_lsx(dst, src, half, stride, stride); +} + +void ff_put_h264_qpel16_mc20_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + put_h264_qpel16_h_lowpass_lasx(dst, src, stride, stride); +} + +void ff_put_h264_qpel16_mc30_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t half[256]; + + put_h264_qpel16_h_lowpass_lasx(half, src, 16, stride); + put_pixels16_l2_8_lsx(dst, src+1, half, stride, stride); +} + +void ff_put_h264_qpel16_mc01_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t half[256]; + + put_h264_qpel16_v_lowpass_lasx(half, src, 16, stride); + put_pixels16_l2_8_lsx(dst, src, half, stride, stride); +} + +void ff_put_h264_qpel16_mc11_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + avc_luma_hv_qrt_16x16_lasx((uint8_t*)src - 2, (uint8_t*)src - (stride * 2), + dst, stride); +} + +void ff_put_h264_qpel16_mc21_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t temp[512]; + uint8_t *const halfH = temp; + uint8_t *const halfHV = temp + 256; + + put_h264_qpel16_h_lowpass_lasx(halfH, src, 16, stride); + put_h264_qpel16_hv_lowpass_lasx(halfHV, src, 16, stride); + put_pixels16_l2_8_lsx(dst, halfH, halfHV, stride, 16); +} + +void ff_put_h264_qpel16_mc31_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + avc_luma_hv_qrt_16x16_lasx((uint8_t*)src - 2, (uint8_t*)src - (stride * 2) + 1, + dst, stride); +} + +void ff_put_h264_qpel16_mc02_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + put_h264_qpel16_v_lowpass_lasx(dst, src, stride, stride); +} + +void ff_put_h264_qpel16_mc12_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t temp[512]; + uint8_t *const halfHV = temp; + uint8_t *const halfH = temp + 256; + + put_h264_qpel16_hv_lowpass_lasx(halfHV, src, 16, stride); + put_h264_qpel16_v_lowpass_lasx(halfH, src, 16, stride); + put_pixels16_l2_8_lsx(dst, halfH, halfHV, stride, 16); +} + +void ff_put_h264_qpel16_mc22_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + put_h264_qpel16_hv_lowpass_lasx(dst, src, stride, stride); +} + +void ff_put_h264_qpel16_mc32_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t temp[512]; + uint8_t *const halfHV = temp; + uint8_t *const halfH = temp + 256; + + put_h264_qpel16_hv_lowpass_lasx(halfHV, src, 16, stride); + put_h264_qpel16_v_lowpass_lasx(halfH, src + 1, 16, stride); + put_pixels16_l2_8_lsx(dst, halfH, halfHV, stride, 16); +} + +void ff_put_h264_qpel16_mc03_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t half[256]; + + put_h264_qpel16_v_lowpass_lasx(half, src, 16, stride); + put_pixels16_l2_8_lsx(dst, src+stride, half, stride, stride); +} + +void ff_put_h264_qpel16_mc13_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + avc_luma_hv_qrt_16x16_lasx((uint8_t*)src + stride - 2, (uint8_t*)src - (stride * 2), + dst, stride); +} + +void ff_put_h264_qpel16_mc23_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t temp[512]; + uint8_t *const halfH = temp; + uint8_t *const halfHV = temp + 256; + + put_h264_qpel16_h_lowpass_lasx(halfH, src + stride, 16, stride); + put_h264_qpel16_hv_lowpass_lasx(halfHV, src, 16, stride); + put_pixels16_l2_8_lsx(dst, halfH, halfHV, stride, 16); +} + +void ff_put_h264_qpel16_mc33_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + avc_luma_hv_qrt_16x16_lasx((uint8_t*)src + stride - 2, + (uint8_t*)src - (stride * 2) + 1, dst, stride); +} + +void ff_avg_h264_qpel16_mc00_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + /* In mmi optimization, it used function ff_avg_pixels16_8_mmi + * which implemented in hpeldsp_mmi.c */ + avg_pixels16_8_lsx(dst, src, stride); +} + +void ff_avg_h264_qpel16_mc10_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t half[256]; + + put_h264_qpel16_h_lowpass_lasx(half, src, 16, stride); + avg_pixels16_l2_8_lsx(dst, src, half, stride, stride); +} + +void ff_avg_h264_qpel16_mc20_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + avg_h264_qpel16_h_lowpass_lasx(dst, src, stride, stride); +} + +void ff_avg_h264_qpel16_mc30_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t half[256]; + + put_h264_qpel16_h_lowpass_lasx(half, src, 16, stride); + avg_pixels16_l2_8_lsx(dst, src+1, half, stride, stride); +} + +void ff_avg_h264_qpel16_mc01_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t half[256]; + + put_h264_qpel16_v_lowpass_lasx(half, src, 16, stride); + avg_pixels16_l2_8_lsx(dst, src, half, stride, stride); +} + +void ff_avg_h264_qpel16_mc11_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + avc_luma_hv_qrt_and_aver_dst_16x16_lasx((uint8_t*)src - 2, + (uint8_t*)src - (stride * 2), + dst, stride); +} + +void ff_avg_h264_qpel16_mc21_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t temp[512]; + uint8_t *const halfH = temp; + uint8_t *const halfHV = temp + 256; + + put_h264_qpel16_h_lowpass_lasx(halfH, src, 16, stride); + put_h264_qpel16_hv_lowpass_lasx(halfHV, src, 16, stride); + avg_pixels16_l2_8_lsx(dst, halfH, halfHV, stride, 16); +} + +void ff_avg_h264_qpel16_mc31_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + avc_luma_hv_qrt_and_aver_dst_16x16_lasx((uint8_t*)src - 2, + (uint8_t*)src - (stride * 2) + 1, + dst, stride); +} + +void ff_avg_h264_qpel16_mc02_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + avg_h264_qpel16_v_lowpass_lasx(dst, src, stride, stride); +} + +void ff_avg_h264_qpel16_mc12_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t temp[512]; + uint8_t *const halfHV = temp; + uint8_t *const halfH = temp + 256; + + put_h264_qpel16_hv_lowpass_lasx(halfHV, src, 16, stride); + put_h264_qpel16_v_lowpass_lasx(halfH, src, 16, stride); + avg_pixels16_l2_8_lsx(dst, halfH, halfHV, stride, 16); +} + +void ff_avg_h264_qpel16_mc22_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + avg_h264_qpel16_hv_lowpass_lasx(dst, src, stride, stride); +} + +void ff_avg_h264_qpel16_mc32_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t temp[512]; + uint8_t *const halfHV = temp; + uint8_t *const halfH = temp + 256; + + put_h264_qpel16_hv_lowpass_lasx(halfHV, src, 16, stride); + put_h264_qpel16_v_lowpass_lasx(halfH, src + 1, 16, stride); + avg_pixels16_l2_8_lsx(dst, halfH, halfHV, stride, 16); +} + +void ff_avg_h264_qpel16_mc03_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t half[256]; + + put_h264_qpel16_v_lowpass_lasx(half, src, 16, stride); + avg_pixels16_l2_8_lsx(dst, src + stride, half, stride, stride); +} + +void ff_avg_h264_qpel16_mc13_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + avc_luma_hv_qrt_and_aver_dst_16x16_lasx((uint8_t*)src + stride - 2, + (uint8_t*)src - (stride * 2), + dst, stride); +} + +void ff_avg_h264_qpel16_mc23_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + uint8_t temp[512]; + uint8_t *const halfH = temp; + uint8_t *const halfHV = temp + 256; + + put_h264_qpel16_h_lowpass_lasx(halfH, src + stride, 16, stride); + put_h264_qpel16_hv_lowpass_lasx(halfHV, src, 16, stride); + avg_pixels16_l2_8_lsx(dst, halfH, halfHV, stride, 16); +} + +void ff_avg_h264_qpel16_mc33_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride) +{ + avc_luma_hv_qrt_and_aver_dst_16x16_lasx((uint8_t*)src + stride - 2, + (uint8_t*)src - (stride * 2) + 1, + dst, stride); +} diff --git a/libavcodec/loongarch/h264qpel_lasx.h b/libavcodec/loongarch/h264qpel_lasx.h new file mode 100644 index 0000000000..32b6b50917 --- /dev/null +++ b/libavcodec/loongarch/h264qpel_lasx.h @@ -0,0 +1,158 @@ +/* + * Copyright (c) 2020 Loongson Technology Corporation Limited + * Contributed by Shiyou Yin + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef AVCODEC_LOONGARCH_H264QPEL_LASX_H +#define AVCODEC_LOONGARCH_H264QPEL_LASX_H + +#include +#include +#include "libavcodec/h264.h" + +void ff_h264_h_lpf_luma_inter_lasx(uint8_t *src, int stride, + int alpha, int beta, int8_t *tc0); +void ff_h264_v_lpf_luma_inter_lasx(uint8_t *src, int stride, + int alpha, int beta, int8_t *tc0); +void ff_put_h264_qpel16_mc00_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_put_h264_qpel16_mc10_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_put_h264_qpel16_mc20_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_put_h264_qpel16_mc30_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_put_h264_qpel16_mc01_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_put_h264_qpel16_mc11_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_put_h264_qpel16_mc21_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_put_h264_qpel16_mc31_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_put_h264_qpel16_mc02_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_put_h264_qpel16_mc12_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_put_h264_qpel16_mc32_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_put_h264_qpel16_mc22_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_put_h264_qpel16_mc03_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_put_h264_qpel16_mc13_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_put_h264_qpel16_mc23_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_put_h264_qpel16_mc33_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel16_mc00_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel16_mc10_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel16_mc20_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel16_mc30_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel16_mc01_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel16_mc11_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel16_mc21_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel16_mc31_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel16_mc02_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel16_mc12_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel16_mc22_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel16_mc32_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel16_mc03_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel16_mc13_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel16_mc23_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel16_mc33_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); + +void ff_put_h264_qpel8_mc00_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride); +void ff_put_h264_qpel8_mc10_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride); +void ff_put_h264_qpel8_mc20_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride); +void ff_put_h264_qpel8_mc30_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride); +void ff_put_h264_qpel8_mc01_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride); +void ff_put_h264_qpel8_mc11_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride); +void ff_put_h264_qpel8_mc21_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride); +void ff_put_h264_qpel8_mc31_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride); +void ff_put_h264_qpel8_mc02_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride); +void ff_put_h264_qpel8_mc12_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride); +void ff_put_h264_qpel8_mc22_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride); +void ff_put_h264_qpel8_mc32_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride); +void ff_put_h264_qpel8_mc03_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride); +void ff_put_h264_qpel8_mc13_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride); +void ff_put_h264_qpel8_mc23_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride); +void ff_put_h264_qpel8_mc33_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t stride); +void ff_avg_h264_qpel8_mc00_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel8_mc10_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel8_mc20_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel8_mc30_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel8_mc11_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel8_mc21_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel8_mc31_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel8_mc02_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel8_mc12_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel8_mc22_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel8_mc32_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel8_mc13_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel8_mc23_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +void ff_avg_h264_qpel8_mc33_lasx(uint8_t *dst, const uint8_t *src, + ptrdiff_t dst_stride); +#endif // #ifndef AVCODEC_LOONGARCH_H264QPEL_LASX_H From patchwork Tue Dec 14 13:33:13 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= X-Patchwork-Id: 32489 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp6966507iog; Tue, 14 Dec 2021 05:35:17 -0800 (PST) X-Google-Smtp-Source: ABdhPJxADkLYi4YY4X0VhfcziZgVXnnaG0BfCo5W8/gCxwtvu2gIqO6ch2BC4JHfP+Mnm7hc6qm8 X-Received: by 2002:a17:906:58d0:: with SMTP id e16mr5566030ejs.605.1639488917368; Tue, 14 Dec 2021 05:35:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1639488917; cv=none; d=google.com; s=arc-20160816; b=02IUHCiLw7c8pMruRDYgkWiuO5vRvY8hj8u3Sv1uSRuHgg4rFCi87lzUs6I+VJbOgh 3ysTwpIkwIvSBqxyteFawgyX7sC2VKSYY3VhyduZbiZFo2h9LRnHu96yytV0m+mSwC61 CASR0whyo4OBL7f8mz5gFQB4s02JqYILToJ3yumml6eAsAJS6pSTYFnVWP7jDNt6Wx1B G/x7xO4vicReo7ZH2fWaxOYhpKfVPgh1/pfCb+LGAXmlaEiD09oYI20W5H8/a6jMdJ/7 RATJdH3bIXlXQmZC9R60X2m2zKNJIJCEbP/fdvNMADpIfzZ4ammwEnHu+jhEqiL1PQl4 7WsA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=wvSU8oMAiSQSf3pbNfStLe1mT7pNYyMBS0VjUzspZgo=; b=K9S891MyhZOTOqAikPU21FJJFNx75PurXvxKPunfEzy0UwUWXk1CBeOkL3klUG2RBd PSSQ8jlKHexbnnqzLWfvW8KgN8sv9V1KIVRorHkofpQ+gBpI77xhYx5OJWKufD2Ieu6k ueefH6eIWVL1REzW1dpNHhU6ZMnlDXT6uZqqfFP41z6gzQETbS6W6sZlvbR47Atf5ojX R1nB5g7f8v3vV3g/9mkHr3waWAsnxp/WefPKKfvj8Wn9WzBqKWX0smVcfN3CnG5DH31v k5hrDkBTEY8XKm7De3aTI/9FviYtnp0Ubgp/EzX+sjD4UjR+Nvpf68sDQswy2q6j29md WgbA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id qb11si5098691ejc.903.2021.12.14.05.35.16; Tue, 14 Dec 2021 05:35:17 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 8387268AEDD; Tue, 14 Dec 2021 15:34:01 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from loongson.cn (mail.loongson.cn [114.242.206.163]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id AA97868AE51 for ; Tue, 14 Dec 2021 15:33:46 +0200 (EET) Received: from localhost (unknown [36.33.26.144]) by mail.loongson.cn (Coremail) with SMTP id AQAAf9BxnN03nbhhlKcAAA--.3535S3; Tue, 14 Dec 2021 21:33:43 +0800 (CST) From: Hao Chen To: ffmpeg-devel@ffmpeg.org Date: Tue, 14 Dec 2021 21:33:13 +0800 Message-Id: <20211214133316.8978-5-chenhao@loongson.cn> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20211214133316.8978-1-chenhao@loongson.cn> References: <20211214133316.8978-1-chenhao@loongson.cn> MIME-Version: 1.0 X-CM-TRANSID: AQAAf9BxnN03nbhhlKcAAA--.3535S3 X-Coremail-Antispam: 1UD129KBjvAXoWDtryftF17WF45Kw1rtFWfGrg_yoWfWr1kKo WUKw4Ivrn2gF1Iy345JrnayFyUua4xCryDXw4jqws2ka45XF90yrWYk3y5Xry5tr4kX34D A3yUXa47Zw1Yqwn8n29KB7ZKAUJUUUU8529EdanIXcx71UUUUU7v73VFW2AGmfu7bjvjm3 AaLaJ3UjIYCTnIWjp_UUU5-7k0a2IF6w4kM7kC6x804xWl14x267AKxVWUJVW8JwAFc2x0 x2IEx4CE42xK8VAvwI8IcIk0rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj4 1l84x0c7CEw4AK67xGY2AK021l84ACjcxK6xIIjxv20xvE14v26ryj6F1UM28EF7xvwVC0 I7IYx2IY6xkF7I0E14v26F4j6r4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwV C2z280aVCY1x0267AKxVW0oVCq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC 0VAKzVAqx4xG6I80ewAv7VC0I7IYx2IY67AKxVWUtVWrXwAv7VC2z280aVAFwI0_Gr1j6F 4UJwAm72CE4IkC6x0Yz7v_Jr0_Gr1lc2xSY4AK67AK6ry8MxC20s026xCaFVCjc4AY6r1j 6r4UMI8I3I0E5I8CrVAFwI0_Jr0_Jr4lx2IqxVCjr7xvwVAFwI0_JrI_JrWlx4CE17CEb7 AF67AKxVWUXVWUAwCI42IY6xIIjxv20xvE14v26r4j6ryUMIIF0xvE2Ix0cI8IcVCY1x02 67AKxVW8JVWxJwCI42IY6xAIw20EY4v20xvaj40_Jr0_JF4lIxAIcVC2z280aVAFwI0_Gr 0_Cr1lIxAIcVC2z280aVCY1x0267AKxVW8JVW8JrUvcSsGvfC2KfnxnUUI43ZEXa7IU5tl 1PUUUUU== X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/ Subject: [FFmpeg-devel] [PATCH v2 4/7] avcodec: [loongarch] Optimize h264dsp with LASX. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: gxw Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: 5s75qHj4t0Pn From: gxw ./ffmpeg -i ../1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -y /dev/null -an before:225 after :282 Change-Id: Ibe245827dcdfe8fc1541c6b172483151bfa9e642 --- libavcodec/h264dsp.c | 1 + libavcodec/h264dsp.h | 2 + libavcodec/loongarch/Makefile | 2 + libavcodec/loongarch/h264dsp_init_loongarch.c | 58 + libavcodec/loongarch/h264dsp_lasx.c | 2114 +++++++++++++++++ libavcodec/loongarch/h264dsp_lasx.h | 68 + 6 files changed, 2245 insertions(+) create mode 100644 libavcodec/loongarch/h264dsp_init_loongarch.c create mode 100644 libavcodec/loongarch/h264dsp_lasx.c create mode 100644 libavcodec/loongarch/h264dsp_lasx.h diff --git a/libavcodec/h264dsp.c b/libavcodec/h264dsp.c index e76932b565..f97ac2823c 100644 --- a/libavcodec/h264dsp.c +++ b/libavcodec/h264dsp.c @@ -157,4 +157,5 @@ av_cold void ff_h264dsp_init(H264DSPContext *c, const int bit_depth, if (ARCH_PPC) ff_h264dsp_init_ppc(c, bit_depth, chroma_format_idc); if (ARCH_X86) ff_h264dsp_init_x86(c, bit_depth, chroma_format_idc); if (ARCH_MIPS) ff_h264dsp_init_mips(c, bit_depth, chroma_format_idc); + if (ARCH_LOONGARCH) ff_h264dsp_init_loongarch(c, bit_depth, chroma_format_idc); } diff --git a/libavcodec/h264dsp.h b/libavcodec/h264dsp.h index 850d4471fd..e0880c4d88 100644 --- a/libavcodec/h264dsp.h +++ b/libavcodec/h264dsp.h @@ -129,5 +129,7 @@ void ff_h264dsp_init_x86(H264DSPContext *c, const int bit_depth, const int chroma_format_idc); void ff_h264dsp_init_mips(H264DSPContext *c, const int bit_depth, const int chroma_format_idc); +void ff_h264dsp_init_loongarch(H264DSPContext *c, const int bit_depth, + const int chroma_format_idc); #endif /* AVCODEC_H264DSP_H */ diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile index 4e2ce8487f..df43151dbd 100644 --- a/libavcodec/loongarch/Makefile +++ b/libavcodec/loongarch/Makefile @@ -1,4 +1,6 @@ OBJS-$(CONFIG_H264CHROMA) += loongarch/h264chroma_init_loongarch.o OBJS-$(CONFIG_H264QPEL) += loongarch/h264qpel_init_loongarch.o +OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_init_loongarch.o LASX-OBJS-$(CONFIG_H264CHROMA) += loongarch/h264chroma_lasx.o LASX-OBJS-$(CONFIG_H264QPEL) += loongarch/h264qpel_lasx.o +LASX-OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_lasx.o diff --git a/libavcodec/loongarch/h264dsp_init_loongarch.c b/libavcodec/loongarch/h264dsp_init_loongarch.c new file mode 100644 index 0000000000..ddc0877a74 --- /dev/null +++ b/libavcodec/loongarch/h264dsp_init_loongarch.c @@ -0,0 +1,58 @@ +/* + * Copyright (c) 2021 Loongson Technology Corporation Limited + * Contributed by Shiyou Yin + * Xiwei Gu + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/loongarch/cpu.h" +#include "h264dsp_lasx.h" + +av_cold void ff_h264dsp_init_loongarch(H264DSPContext *c, const int bit_depth, + const int chroma_format_idc) +{ + int cpu_flags = av_get_cpu_flags(); + + if (have_lasx(cpu_flags)) { + if (bit_depth == 8) { + c->h264_add_pixels4_clear = ff_h264_add_pixels4_8_lasx; + c->h264_add_pixels8_clear = ff_h264_add_pixels8_8_lasx; + c->h264_v_loop_filter_luma = ff_h264_v_lpf_luma_8_lasx; + c->h264_h_loop_filter_luma = ff_h264_h_lpf_luma_8_lasx; + c->h264_v_loop_filter_luma_intra = ff_h264_v_lpf_luma_intra_8_lasx; + c->h264_h_loop_filter_luma_intra = ff_h264_h_lpf_luma_intra_8_lasx; + c->h264_v_loop_filter_chroma = ff_h264_v_lpf_chroma_8_lasx; + + if (chroma_format_idc <= 1) + c->h264_h_loop_filter_chroma = ff_h264_h_lpf_chroma_8_lasx; + c->h264_v_loop_filter_chroma_intra = ff_h264_v_lpf_chroma_intra_8_lasx; + + if (chroma_format_idc <= 1) + c->h264_h_loop_filter_chroma_intra = ff_h264_h_lpf_chroma_intra_8_lasx; + + /* Weighted MC */ + c->weight_h264_pixels_tab[0] = ff_weight_h264_pixels16_8_lasx; + c->weight_h264_pixels_tab[1] = ff_weight_h264_pixels8_8_lasx; + c->weight_h264_pixels_tab[2] = ff_weight_h264_pixels4_8_lasx; + + c->biweight_h264_pixels_tab[0] = ff_biweight_h264_pixels16_8_lasx; + c->biweight_h264_pixels_tab[1] = ff_biweight_h264_pixels8_8_lasx; + c->biweight_h264_pixels_tab[2] = ff_biweight_h264_pixels4_8_lasx; + } + } +} diff --git a/libavcodec/loongarch/h264dsp_lasx.c b/libavcodec/loongarch/h264dsp_lasx.c new file mode 100644 index 0000000000..7fd4cedf7e --- /dev/null +++ b/libavcodec/loongarch/h264dsp_lasx.c @@ -0,0 +1,2114 @@ +/* + * Loongson LASX optimized h264dsp + * + * Copyright (c) 2021 Loongson Technology Corporation Limited + * Contributed by Shiyou Yin + * Xiwei Gu + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/loongarch/loongson_intrinsics.h" +#include "h264dsp_lasx.h" + +#define AVC_LPF_P1_OR_Q1(p0_or_q0_org_in, q0_or_p0_org_in, \ + p1_or_q1_org_in, p2_or_q2_org_in, \ + neg_tc_in, tc_in, p1_or_q1_out) \ +{ \ + __m256i clip3, temp; \ + \ + clip3 = __lasx_xvavgr_hu(p0_or_q0_org_in, \ + q0_or_p0_org_in); \ + temp = __lasx_xvslli_h(p1_or_q1_org_in, 1); \ + clip3 = __lasx_xvsub_h(clip3, temp); \ + clip3 = __lasx_xvavg_h(p2_or_q2_org_in, clip3); \ + clip3 = __lasx_xvclip_h(clip3, neg_tc_in, tc_in); \ + p1_or_q1_out = __lasx_xvadd_h(p1_or_q1_org_in, clip3); \ +} + +#define AVC_LPF_P0Q0(q0_or_p0_org_in, p0_or_q0_org_in, \ + p1_or_q1_org_in, q1_or_p1_org_in, \ + neg_threshold_in, threshold_in, \ + p0_or_q0_out, q0_or_p0_out) \ +{ \ + __m256i q0_sub_p0, p1_sub_q1, delta; \ + \ + q0_sub_p0 = __lasx_xvsub_h(q0_or_p0_org_in, \ + p0_or_q0_org_in); \ + p1_sub_q1 = __lasx_xvsub_h(p1_or_q1_org_in, \ + q1_or_p1_org_in); \ + q0_sub_p0 = __lasx_xvslli_h(q0_sub_p0, 2); \ + p1_sub_q1 = __lasx_xvaddi_hu(p1_sub_q1, 4); \ + delta = __lasx_xvadd_h(q0_sub_p0, p1_sub_q1); \ + delta = __lasx_xvsrai_h(delta, 3); \ + delta = __lasx_xvclip_h(delta, neg_threshold_in, \ + threshold_in); \ + p0_or_q0_out = __lasx_xvadd_h(p0_or_q0_org_in, delta); \ + q0_or_p0_out = __lasx_xvsub_h(q0_or_p0_org_in, delta); \ + \ + p0_or_q0_out = __lasx_xvclip255_h(p0_or_q0_out); \ + q0_or_p0_out = __lasx_xvclip255_h(q0_or_p0_out); \ +} + +void ff_h264_h_lpf_luma_8_lasx(uint8_t *data, ptrdiff_t img_width, + int alpha_in, int beta_in, int8_t *tc) +{ + ptrdiff_t img_width_2x = img_width << 1; + ptrdiff_t img_width_4x = img_width << 2; + ptrdiff_t img_width_8x = img_width << 3; + ptrdiff_t img_width_3x = img_width_2x + img_width; + __m256i tmp_vec0, bs_vec; + __m256i tc_vec = {0x0101010100000000, 0x0303030302020202, + 0x0101010100000000, 0x0303030302020202}; + + tmp_vec0 = __lasx_xvldrepl_w((uint32_t*)tc, 0); + tc_vec = __lasx_xvshuf_b(tmp_vec0, tmp_vec0, tc_vec); + bs_vec = __lasx_xvslti_b(tc_vec, 0); + bs_vec = __lasx_xvxori_b(bs_vec, 255); + bs_vec = __lasx_xvandi_b(bs_vec, 1); + + if (__lasx_xbnz_v(bs_vec)) { + uint8_t *src = data - 4; + __m256i p3_org, p2_org, p1_org, p0_org, q0_org, q1_org, q2_org, q3_org; + __m256i p0_asub_q0, p1_asub_p0, q1_asub_q0, alpha, beta; + __m256i is_less_than, is_less_than_beta, is_less_than_alpha; + __m256i is_bs_greater_than0; + __m256i zero = __lasx_xvldi(0); + + is_bs_greater_than0 = __lasx_xvslt_bu(zero, bs_vec); + + { + uint8_t *src_tmp = src + img_width_8x; + __m256i row0, row1, row2, row3, row4, row5, row6, row7; + __m256i row8, row9, row10, row11, row12, row13, row14, row15; + + DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x, + src, img_width_3x, row0, row1, row2, row3); + src += img_width_4x; + DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x, + src, img_width_3x, row4, row5, row6, row7); + src -= img_width_4x; + DUP4_ARG2(__lasx_xvldx, src_tmp, 0, src_tmp, img_width, src_tmp, + img_width_2x, src_tmp, img_width_3x, + row8, row9, row10, row11); + src_tmp += img_width_4x; + DUP4_ARG2(__lasx_xvldx, src_tmp, 0, src_tmp, img_width, src_tmp, + img_width_2x, src_tmp, img_width_3x, + row12, row13, row14, row15); + src_tmp -= img_width_4x; + + LASX_TRANSPOSE16x8_B(row0, row1, row2, row3, row4, row5, row6, + row7, row8, row9, row10, row11, + row12, row13, row14, row15, + p3_org, p2_org, p1_org, p0_org, + q0_org, q1_org, q2_org, q3_org); + } + + p0_asub_q0 = __lasx_xvabsd_bu(p0_org, q0_org); + p1_asub_p0 = __lasx_xvabsd_bu(p1_org, p0_org); + q1_asub_q0 = __lasx_xvabsd_bu(q1_org, q0_org); + + alpha = __lasx_xvreplgr2vr_b(alpha_in); + beta = __lasx_xvreplgr2vr_b(beta_in); + + is_less_than_alpha = __lasx_xvslt_bu(p0_asub_q0, alpha); + is_less_than_beta = __lasx_xvslt_bu(p1_asub_p0, beta); + is_less_than = is_less_than_alpha & is_less_than_beta; + is_less_than_beta = __lasx_xvslt_bu(q1_asub_q0, beta); + is_less_than = is_less_than_beta & is_less_than; + is_less_than = is_less_than & is_bs_greater_than0; + + if (__lasx_xbnz_v(is_less_than)) { + __m256i neg_tc_h, tc_h, p1_org_h, p0_org_h, q0_org_h, q1_org_h; + __m256i p2_asub_p0, q2_asub_q0; + + neg_tc_h = __lasx_xvneg_b(tc_vec); + neg_tc_h = __lasx_vext2xv_h_b(neg_tc_h); + tc_h = __lasx_vext2xv_hu_bu(tc_vec); + p1_org_h = __lasx_vext2xv_hu_bu(p1_org); + p0_org_h = __lasx_vext2xv_hu_bu(p0_org); + q0_org_h = __lasx_vext2xv_hu_bu(q0_org); + + p2_asub_p0 = __lasx_xvabsd_bu(p2_org, p0_org); + is_less_than_beta = __lasx_xvslt_bu(p2_asub_p0, beta); + is_less_than_beta = is_less_than_beta & is_less_than; + + if (__lasx_xbnz_v(is_less_than_beta)) { + __m256i p2_org_h, p1_h; + + p2_org_h = __lasx_vext2xv_hu_bu(p2_org); + AVC_LPF_P1_OR_Q1(p0_org_h, q0_org_h, p1_org_h, p2_org_h, + neg_tc_h, tc_h, p1_h); + p1_h = __lasx_xvpickev_b(p1_h, p1_h); + p1_h = __lasx_xvpermi_d(p1_h, 0xd8); + p1_org = __lasx_xvbitsel_v(p1_org, p1_h, is_less_than_beta); + is_less_than_beta = __lasx_xvandi_b(is_less_than_beta, 1); + tc_vec = __lasx_xvadd_b(tc_vec, is_less_than_beta); + } + + q2_asub_q0 = __lasx_xvabsd_bu(q2_org, q0_org); + is_less_than_beta = __lasx_xvslt_bu(q2_asub_q0, beta); + is_less_than_beta = is_less_than_beta & is_less_than; + + q1_org_h = __lasx_vext2xv_hu_bu(q1_org); + + if (__lasx_xbnz_v(is_less_than_beta)) { + __m256i q2_org_h, q1_h; + + q2_org_h = __lasx_vext2xv_hu_bu(q2_org); + AVC_LPF_P1_OR_Q1(p0_org_h, q0_org_h, q1_org_h, q2_org_h, + neg_tc_h, tc_h, q1_h); + q1_h = __lasx_xvpickev_b(q1_h, q1_h); + q1_h = __lasx_xvpermi_d(q1_h, 0xd8); + q1_org = __lasx_xvbitsel_v(q1_org, q1_h, is_less_than_beta); + + is_less_than_beta = __lasx_xvandi_b(is_less_than_beta, 1); + tc_vec = __lasx_xvadd_b(tc_vec, is_less_than_beta); + } + + { + __m256i neg_thresh_h, p0_h, q0_h; + + neg_thresh_h = __lasx_xvneg_b(tc_vec); + neg_thresh_h = __lasx_vext2xv_h_b(neg_thresh_h); + tc_h = __lasx_vext2xv_hu_bu(tc_vec); + + AVC_LPF_P0Q0(q0_org_h, p0_org_h, p1_org_h, q1_org_h, + neg_thresh_h, tc_h, p0_h, q0_h); + DUP2_ARG2(__lasx_xvpickev_b, p0_h, p0_h, q0_h, q0_h, + p0_h, q0_h); + DUP2_ARG2(__lasx_xvpermi_d, p0_h, 0xd8, q0_h, 0xd8, + p0_h, q0_h); + p0_org = __lasx_xvbitsel_v(p0_org, p0_h, is_less_than); + q0_org = __lasx_xvbitsel_v(q0_org, q0_h, is_less_than); + } + + { + __m256i row0, row1, row2, row3, row4, row5, row6, row7; + __m256i control = {0x0000000400000000, 0x0000000500000001, + 0x0000000600000002, 0x0000000700000003}; + + DUP4_ARG3(__lasx_xvpermi_q, p0_org, q3_org, 0x02, p1_org, + q2_org, 0x02, p2_org, q1_org, 0x02, p3_org, + q0_org, 0x02, p0_org, p1_org, p2_org, p3_org); + DUP2_ARG2(__lasx_xvilvl_b, p1_org, p3_org, p0_org, p2_org, + row0, row2); + DUP2_ARG2(__lasx_xvilvh_b, p1_org, p3_org, p0_org, p2_org, + row1, row3); + DUP2_ARG2(__lasx_xvilvl_b, row2, row0, row3, row1, row4, row6); + DUP2_ARG2(__lasx_xvilvh_b, row2, row0, row3, row1, row5, row7); + DUP4_ARG2(__lasx_xvperm_w, row4, control, row5, control, row6, + control, row7, control, row4, row5, row6, row7); + __lasx_xvstelm_d(row4, src, 0, 0); + __lasx_xvstelm_d(row4, src + img_width, 0, 1); + src += img_width_2x; + __lasx_xvstelm_d(row4, src, 0, 2); + __lasx_xvstelm_d(row4, src + img_width, 0, 3); + src += img_width_2x; + __lasx_xvstelm_d(row5, src, 0, 0); + __lasx_xvstelm_d(row5, src + img_width, 0, 1); + src += img_width_2x; + __lasx_xvstelm_d(row5, src, 0, 2); + __lasx_xvstelm_d(row5, src + img_width, 0, 3); + src += img_width_2x; + __lasx_xvstelm_d(row6, src, 0, 0); + __lasx_xvstelm_d(row6, src + img_width, 0, 1); + src += img_width_2x; + __lasx_xvstelm_d(row6, src, 0, 2); + __lasx_xvstelm_d(row6, src + img_width, 0, 3); + src += img_width_2x; + __lasx_xvstelm_d(row7, src, 0, 0); + __lasx_xvstelm_d(row7, src + img_width, 0, 1); + src += img_width_2x; + __lasx_xvstelm_d(row7, src, 0, 2); + __lasx_xvstelm_d(row7, src + img_width, 0, 3); + } + } + } +} + +void ff_h264_v_lpf_luma_8_lasx(uint8_t *data, ptrdiff_t img_width, + int alpha_in, int beta_in, int8_t *tc) +{ + ptrdiff_t img_width_2x = img_width << 1; + ptrdiff_t img_width_3x = img_width + img_width_2x; + __m256i tmp_vec0, bs_vec; + __m256i tc_vec = {0x0101010100000000, 0x0303030302020202, + 0x0101010100000000, 0x0303030302020202}; + + tmp_vec0 = __lasx_xvldrepl_w((uint32_t*)tc, 0); + tc_vec = __lasx_xvshuf_b(tmp_vec0, tmp_vec0, tc_vec); + bs_vec = __lasx_xvslti_b(tc_vec, 0); + bs_vec = __lasx_xvxori_b(bs_vec, 255); + bs_vec = __lasx_xvandi_b(bs_vec, 1); + + if (__lasx_xbnz_v(bs_vec)) { + __m256i p2_org, p1_org, p0_org, q0_org, q1_org, q2_org; + __m256i p0_asub_q0, p1_asub_p0, q1_asub_q0, alpha, beta; + __m256i is_less_than, is_less_than_beta, is_less_than_alpha; + __m256i p1_org_h, p0_org_h, q0_org_h, q1_org_h; + __m256i is_bs_greater_than0; + __m256i zero = __lasx_xvldi(0); + + alpha = __lasx_xvreplgr2vr_b(alpha_in); + beta = __lasx_xvreplgr2vr_b(beta_in); + + DUP2_ARG2(__lasx_xvldx, data, -img_width_3x, data, -img_width_2x, + p2_org, p1_org); + p0_org = __lasx_xvldx(data, -img_width); + DUP2_ARG2(__lasx_xvldx, data, 0, data, img_width, q0_org, q1_org); + + is_bs_greater_than0 = __lasx_xvslt_bu(zero, bs_vec); + p0_asub_q0 = __lasx_xvabsd_bu(p0_org, q0_org); + p1_asub_p0 = __lasx_xvabsd_bu(p1_org, p0_org); + q1_asub_q0 = __lasx_xvabsd_bu(q1_org, q0_org); + + is_less_than_alpha = __lasx_xvslt_bu(p0_asub_q0, alpha); + is_less_than_beta = __lasx_xvslt_bu(p1_asub_p0, beta); + is_less_than = is_less_than_alpha & is_less_than_beta; + is_less_than_beta = __lasx_xvslt_bu(q1_asub_q0, beta); + is_less_than = is_less_than_beta & is_less_than; + is_less_than = is_less_than & is_bs_greater_than0; + + if (__lasx_xbnz_v(is_less_than)) { + __m256i neg_tc_h, tc_h, p2_asub_p0, q2_asub_q0; + + q2_org = __lasx_xvldx(data, img_width_2x); + + neg_tc_h = __lasx_xvneg_b(tc_vec); + neg_tc_h = __lasx_vext2xv_h_b(neg_tc_h); + tc_h = __lasx_vext2xv_hu_bu(tc_vec); + p1_org_h = __lasx_vext2xv_hu_bu(p1_org); + p0_org_h = __lasx_vext2xv_hu_bu(p0_org); + q0_org_h = __lasx_vext2xv_hu_bu(q0_org); + + p2_asub_p0 = __lasx_xvabsd_bu(p2_org, p0_org); + is_less_than_beta = __lasx_xvslt_bu(p2_asub_p0, beta); + is_less_than_beta = is_less_than_beta & is_less_than; + + if (__lasx_xbnz_v(is_less_than_beta)) { + __m256i p1_h, p2_org_h; + + p2_org_h = __lasx_vext2xv_hu_bu(p2_org); + AVC_LPF_P1_OR_Q1(p0_org_h, q0_org_h, p1_org_h, p2_org_h, + neg_tc_h, tc_h, p1_h); + p1_h = __lasx_xvpickev_b(p1_h, p1_h); + p1_h = __lasx_xvpermi_d(p1_h, 0xd8); + p1_h = __lasx_xvbitsel_v(p1_org, p1_h, is_less_than_beta); + p1_org = __lasx_xvpermi_q(p1_org, p1_h, 0x30); + __lasx_xvst(p1_org, data - img_width_2x, 0); + + is_less_than_beta = __lasx_xvandi_b(is_less_than_beta, 1); + tc_vec = __lasx_xvadd_b(tc_vec, is_less_than_beta); + } + + q2_asub_q0 = __lasx_xvabsd_bu(q2_org, q0_org); + is_less_than_beta = __lasx_xvslt_bu(q2_asub_q0, beta); + is_less_than_beta = is_less_than_beta & is_less_than; + + q1_org_h = __lasx_vext2xv_hu_bu(q1_org); + + if (__lasx_xbnz_v(is_less_than_beta)) { + __m256i q1_h, q2_org_h; + + q2_org_h = __lasx_vext2xv_hu_bu(q2_org); + AVC_LPF_P1_OR_Q1(p0_org_h, q0_org_h, q1_org_h, q2_org_h, + neg_tc_h, tc_h, q1_h); + q1_h = __lasx_xvpickev_b(q1_h, q1_h); + q1_h = __lasx_xvpermi_d(q1_h, 0xd8); + q1_h = __lasx_xvbitsel_v(q1_org, q1_h, is_less_than_beta); + q1_org = __lasx_xvpermi_q(q1_org, q1_h, 0x30); + __lasx_xvst(q1_org, data + img_width, 0); + + is_less_than_beta = __lasx_xvandi_b(is_less_than_beta, 1); + tc_vec = __lasx_xvadd_b(tc_vec, is_less_than_beta); + + } + + { + __m256i neg_thresh_h, p0_h, q0_h; + + neg_thresh_h = __lasx_xvneg_b(tc_vec); + neg_thresh_h = __lasx_vext2xv_h_b(neg_thresh_h); + tc_h = __lasx_vext2xv_hu_bu(tc_vec); + + AVC_LPF_P0Q0(q0_org_h, p0_org_h, p1_org_h, q1_org_h, + neg_thresh_h, tc_h, p0_h, q0_h); + DUP2_ARG2(__lasx_xvpickev_b, p0_h, p0_h, q0_h, q0_h, + p0_h, q0_h); + DUP2_ARG2(__lasx_xvpermi_d, p0_h, 0Xd8, q0_h, 0xd8, + p0_h, q0_h); + p0_h = __lasx_xvbitsel_v(p0_org, p0_h, is_less_than); + q0_h = __lasx_xvbitsel_v(q0_org, q0_h, is_less_than); + p0_org = __lasx_xvpermi_q(p0_org, p0_h, 0x30); + q0_org = __lasx_xvpermi_q(q0_org, q0_h, 0x30); + __lasx_xvst(p0_org, data - img_width, 0); + __lasx_xvst(q0_org, data, 0); + } + } + } +} + +void ff_h264_h_lpf_chroma_8_lasx(uint8_t *data, ptrdiff_t img_width, + int alpha_in, int beta_in, int8_t *tc) +{ + __m256i tmp_vec0, bs_vec; + __m256i tc_vec = {0x0303020201010000, 0x0303020201010000, 0x0, 0x0}; + __m256i zero = __lasx_xvldi(0); + ptrdiff_t img_width_2x = img_width << 1; + ptrdiff_t img_width_4x = img_width << 2; + ptrdiff_t img_width_3x = img_width_2x + img_width; + + tmp_vec0 = __lasx_xvldrepl_w((uint32_t*)tc, 0); + tc_vec = __lasx_xvshuf_b(tmp_vec0, tmp_vec0, tc_vec); + bs_vec = __lasx_xvslti_b(tc_vec, 0); + bs_vec = __lasx_xvxori_b(bs_vec, 255); + bs_vec = __lasx_xvandi_b(bs_vec, 1); + bs_vec = __lasx_xvpermi_q(zero, bs_vec, 0x30); + + if (__lasx_xbnz_v(bs_vec)) { + uint8_t *src = data - 2; + __m256i p1_org, p0_org, q0_org, q1_org; + __m256i p0_asub_q0, p1_asub_p0, q1_asub_q0, alpha, beta; + __m256i is_less_than, is_less_than_beta, is_less_than_alpha; + __m256i is_bs_greater_than0; + + is_bs_greater_than0 = __lasx_xvslt_bu(zero, bs_vec); + + { + __m256i row0, row1, row2, row3, row4, row5, row6, row7; + + DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x, + src, img_width_3x, row0, row1, row2, row3); + src += img_width_4x; + DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x, + src, img_width_3x, row4, row5, row6, row7); + src -= img_width_4x; + /* LASX_TRANSPOSE8x4_B */ + DUP4_ARG2(__lasx_xvilvl_b, row2, row0, row3, row1, row6, row4, + row7, row5, p1_org, p0_org, q0_org, q1_org); + row0 = __lasx_xvilvl_b(p0_org, p1_org); + row1 = __lasx_xvilvl_b(q1_org, q0_org); + row3 = __lasx_xvilvh_w(row1, row0); + row2 = __lasx_xvilvl_w(row1, row0); + p1_org = __lasx_xvpermi_d(row2, 0x00); + p0_org = __lasx_xvpermi_d(row2, 0x55); + q0_org = __lasx_xvpermi_d(row3, 0x00); + q1_org = __lasx_xvpermi_d(row3, 0x55); + } + + p0_asub_q0 = __lasx_xvabsd_bu(p0_org, q0_org); + p1_asub_p0 = __lasx_xvabsd_bu(p1_org, p0_org); + q1_asub_q0 = __lasx_xvabsd_bu(q1_org, q0_org); + + alpha = __lasx_xvreplgr2vr_b(alpha_in); + beta = __lasx_xvreplgr2vr_b(beta_in); + + is_less_than_alpha = __lasx_xvslt_bu(p0_asub_q0, alpha); + is_less_than_beta = __lasx_xvslt_bu(p1_asub_p0, beta); + is_less_than = is_less_than_alpha & is_less_than_beta; + is_less_than_beta = __lasx_xvslt_bu(q1_asub_q0, beta); + is_less_than = is_less_than_beta & is_less_than; + is_less_than = is_less_than & is_bs_greater_than0; + + if (__lasx_xbnz_v(is_less_than)) { + __m256i p1_org_h, p0_org_h, q0_org_h, q1_org_h; + + p1_org_h = __lasx_vext2xv_hu_bu(p1_org); + p0_org_h = __lasx_vext2xv_hu_bu(p0_org); + q0_org_h = __lasx_vext2xv_hu_bu(q0_org); + q1_org_h = __lasx_vext2xv_hu_bu(q1_org); + + { + __m256i tc_h, neg_thresh_h, p0_h, q0_h; + + neg_thresh_h = __lasx_xvneg_b(tc_vec); + neg_thresh_h = __lasx_vext2xv_h_b(neg_thresh_h); + tc_h = __lasx_vext2xv_hu_bu(tc_vec); + + AVC_LPF_P0Q0(q0_org_h, p0_org_h, p1_org_h, q1_org_h, + neg_thresh_h, tc_h, p0_h, q0_h); + DUP2_ARG2(__lasx_xvpickev_b, p0_h, p0_h, q0_h, q0_h, + p0_h, q0_h); + DUP2_ARG2(__lasx_xvpermi_d, p0_h, 0xd8, q0_h, 0xd8, + p0_h, q0_h); + p0_org = __lasx_xvbitsel_v(p0_org, p0_h, is_less_than); + q0_org = __lasx_xvbitsel_v(q0_org, q0_h, is_less_than); + } + + p0_org = __lasx_xvilvl_b(q0_org, p0_org); + src = data - 1; + __lasx_xvstelm_h(p0_org, src, 0, 0); + src += img_width; + __lasx_xvstelm_h(p0_org, src, 0, 1); + src += img_width; + __lasx_xvstelm_h(p0_org, src, 0, 2); + src += img_width; + __lasx_xvstelm_h(p0_org, src, 0, 3); + src += img_width; + __lasx_xvstelm_h(p0_org, src, 0, 4); + src += img_width; + __lasx_xvstelm_h(p0_org, src, 0, 5); + src += img_width; + __lasx_xvstelm_h(p0_org, src, 0, 6); + src += img_width; + __lasx_xvstelm_h(p0_org, src, 0, 7); + } + } +} + +void ff_h264_v_lpf_chroma_8_lasx(uint8_t *data, ptrdiff_t img_width, + int alpha_in, int beta_in, int8_t *tc) +{ + int img_width_2x = img_width << 1; + __m256i tmp_vec0, bs_vec; + __m256i tc_vec = {0x0303020201010000, 0x0303020201010000, 0x0, 0x0}; + __m256i zero = __lasx_xvldi(0); + + tmp_vec0 = __lasx_xvldrepl_w((uint32_t*)tc, 0); + tc_vec = __lasx_xvshuf_b(tmp_vec0, tmp_vec0, tc_vec); + bs_vec = __lasx_xvslti_b(tc_vec, 0); + bs_vec = __lasx_xvxori_b(bs_vec, 255); + bs_vec = __lasx_xvandi_b(bs_vec, 1); + bs_vec = __lasx_xvpermi_q(zero, bs_vec, 0x30); + + if (__lasx_xbnz_v(bs_vec)) { + __m256i p1_org, p0_org, q0_org, q1_org; + __m256i p0_asub_q0, p1_asub_p0, q1_asub_q0, alpha, beta; + __m256i is_less_than, is_less_than_beta, is_less_than_alpha; + __m256i is_bs_greater_than0; + + alpha = __lasx_xvreplgr2vr_b(alpha_in); + beta = __lasx_xvreplgr2vr_b(beta_in); + + DUP2_ARG2(__lasx_xvldx, data, -img_width_2x, data, -img_width, + p1_org, p0_org); + DUP2_ARG2(__lasx_xvldx, data, 0, data, img_width, q0_org, q1_org); + + is_bs_greater_than0 = __lasx_xvslt_bu(zero, bs_vec); + p0_asub_q0 = __lasx_xvabsd_bu(p0_org, q0_org); + p1_asub_p0 = __lasx_xvabsd_bu(p1_org, p0_org); + q1_asub_q0 = __lasx_xvabsd_bu(q1_org, q0_org); + + is_less_than_alpha = __lasx_xvslt_bu(p0_asub_q0, alpha); + is_less_than_beta = __lasx_xvslt_bu(p1_asub_p0, beta); + is_less_than = is_less_than_alpha & is_less_than_beta; + is_less_than_beta = __lasx_xvslt_bu(q1_asub_q0, beta); + is_less_than = is_less_than_beta & is_less_than; + is_less_than = is_less_than & is_bs_greater_than0; + + if (__lasx_xbnz_v(is_less_than)) { + __m256i p1_org_h, p0_org_h, q0_org_h, q1_org_h; + + p1_org_h = __lasx_vext2xv_hu_bu(p1_org); + p0_org_h = __lasx_vext2xv_hu_bu(p0_org); + q0_org_h = __lasx_vext2xv_hu_bu(q0_org); + q1_org_h = __lasx_vext2xv_hu_bu(q1_org); + + { + __m256i neg_thresh_h, tc_h, p0_h, q0_h; + + neg_thresh_h = __lasx_xvneg_b(tc_vec); + neg_thresh_h = __lasx_vext2xv_h_b(neg_thresh_h); + tc_h = __lasx_vext2xv_hu_bu(tc_vec); + + AVC_LPF_P0Q0(q0_org_h, p0_org_h, p1_org_h, q1_org_h, + neg_thresh_h, tc_h, p0_h, q0_h); + DUP2_ARG2(__lasx_xvpickev_b, p0_h, p0_h, q0_h, q0_h, + p0_h, q0_h); + DUP2_ARG2(__lasx_xvpermi_d, p0_h, 0xd8, q0_h, 0xd8, + p0_h, q0_h); + p0_h = __lasx_xvbitsel_v(p0_org, p0_h, is_less_than); + q0_h = __lasx_xvbitsel_v(q0_org, q0_h, is_less_than); + __lasx_xvstelm_d(p0_h, data - img_width, 0, 0); + __lasx_xvstelm_d(q0_h, data, 0, 0); + } + } + } +} + +#define AVC_LPF_P0P1P2_OR_Q0Q1Q2(p3_or_q3_org_in, p0_or_q0_org_in, \ + q3_or_p3_org_in, p1_or_q1_org_in, \ + p2_or_q2_org_in, q1_or_p1_org_in, \ + p0_or_q0_out, p1_or_q1_out, p2_or_q2_out) \ +{ \ + __m256i threshold; \ + __m256i const2, const3 = __lasx_xvldi(0); \ + \ + const2 = __lasx_xvaddi_hu(const3, 2); \ + const3 = __lasx_xvaddi_hu(const3, 3); \ + threshold = __lasx_xvadd_h(p0_or_q0_org_in, q3_or_p3_org_in); \ + threshold = __lasx_xvadd_h(p1_or_q1_org_in, threshold); \ + \ + p0_or_q0_out = __lasx_xvslli_h(threshold, 1); \ + p0_or_q0_out = __lasx_xvadd_h(p0_or_q0_out, p2_or_q2_org_in); \ + p0_or_q0_out = __lasx_xvadd_h(p0_or_q0_out, q1_or_p1_org_in); \ + p0_or_q0_out = __lasx_xvsrar_h(p0_or_q0_out, const3); \ + \ + p1_or_q1_out = __lasx_xvadd_h(p2_or_q2_org_in, threshold); \ + p1_or_q1_out = __lasx_xvsrar_h(p1_or_q1_out, const2); \ + \ + p2_or_q2_out = __lasx_xvmul_h(p2_or_q2_org_in, const3); \ + p2_or_q2_out = __lasx_xvadd_h(p2_or_q2_out, p3_or_q3_org_in); \ + p2_or_q2_out = __lasx_xvadd_h(p2_or_q2_out, p3_or_q3_org_in); \ + p2_or_q2_out = __lasx_xvadd_h(p2_or_q2_out, threshold); \ + p2_or_q2_out = __lasx_xvsrar_h(p2_or_q2_out, const3); \ +} + +/* data[-u32_img_width] = (uint8_t)((2 * p1 + p0 + q1 + 2) >> 2); */ +#define AVC_LPF_P0_OR_Q0(p0_or_q0_org_in, q1_or_p1_org_in, \ + p1_or_q1_org_in, p0_or_q0_out) \ +{ \ + __m256i const2 = __lasx_xvldi(0); \ + const2 = __lasx_xvaddi_hu(const2, 2); \ + p0_or_q0_out = __lasx_xvadd_h(p0_or_q0_org_in, q1_or_p1_org_in); \ + p0_or_q0_out = __lasx_xvadd_h(p0_or_q0_out, p1_or_q1_org_in); \ + p0_or_q0_out = __lasx_xvadd_h(p0_or_q0_out, p1_or_q1_org_in); \ + p0_or_q0_out = __lasx_xvsrar_h(p0_or_q0_out, const2); \ +} + +void ff_h264_h_lpf_luma_intra_8_lasx(uint8_t *data, ptrdiff_t img_width, + int alpha_in, int beta_in) +{ + ptrdiff_t img_width_2x = img_width << 1; + ptrdiff_t img_width_4x = img_width << 2; + ptrdiff_t img_width_3x = img_width_2x + img_width; + uint8_t *src = data - 4; + __m256i p0_asub_q0, p1_asub_p0, q1_asub_q0, alpha, beta; + __m256i is_less_than, is_less_than_beta, is_less_than_alpha; + __m256i p3_org, p2_org, p1_org, p0_org, q0_org, q1_org, q2_org, q3_org; + __m256i zero = __lasx_xvldi(0); + + { + __m256i row0, row1, row2, row3, row4, row5, row6, row7; + __m256i row8, row9, row10, row11, row12, row13, row14, row15; + + DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x, + src, img_width_3x, row0, row1, row2, row3); + src += img_width_4x; + DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x, + src, img_width_3x, row4, row5, row6, row7); + src += img_width_4x; + DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x, + src, img_width_3x, row8, row9, row10, row11); + src += img_width_4x; + DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x, + src, img_width_3x, row12, row13, row14, row15); + src += img_width_4x; + + LASX_TRANSPOSE16x8_B(row0, row1, row2, row3, + row4, row5, row6, row7, + row8, row9, row10, row11, + row12, row13, row14, row15, + p3_org, p2_org, p1_org, p0_org, + q0_org, q1_org, q2_org, q3_org); + } + + alpha = __lasx_xvreplgr2vr_b(alpha_in); + beta = __lasx_xvreplgr2vr_b(beta_in); + p0_asub_q0 = __lasx_xvabsd_bu(p0_org, q0_org); + p1_asub_p0 = __lasx_xvabsd_bu(p1_org, p0_org); + q1_asub_q0 = __lasx_xvabsd_bu(q1_org, q0_org); + + is_less_than_alpha = __lasx_xvslt_bu(p0_asub_q0, alpha); + is_less_than_beta = __lasx_xvslt_bu(p1_asub_p0, beta); + is_less_than = is_less_than_beta & is_less_than_alpha; + is_less_than_beta = __lasx_xvslt_bu(q1_asub_q0, beta); + is_less_than = is_less_than_beta & is_less_than; + is_less_than = __lasx_xvpermi_q(zero, is_less_than, 0x30); + + if (__lasx_xbnz_v(is_less_than)) { + __m256i p2_asub_p0, q2_asub_q0, p0_h, q0_h, negate_is_less_than_beta; + __m256i p1_org_h, p0_org_h, q0_org_h, q1_org_h; + __m256i less_alpha_shift2_add2 = __lasx_xvsrli_b(alpha, 2); + + less_alpha_shift2_add2 = __lasx_xvaddi_bu(less_alpha_shift2_add2, 2); + less_alpha_shift2_add2 = __lasx_xvslt_bu(p0_asub_q0, + less_alpha_shift2_add2); + + p1_org_h = __lasx_vext2xv_hu_bu(p1_org); + p0_org_h = __lasx_vext2xv_hu_bu(p0_org); + q0_org_h = __lasx_vext2xv_hu_bu(q0_org); + q1_org_h = __lasx_vext2xv_hu_bu(q1_org); + + p2_asub_p0 = __lasx_xvabsd_bu(p2_org, p0_org); + is_less_than_beta = __lasx_xvslt_bu(p2_asub_p0, beta); + is_less_than_beta = is_less_than_beta & less_alpha_shift2_add2; + negate_is_less_than_beta = __lasx_xvxori_b(is_less_than_beta, 0xff); + is_less_than_beta = is_less_than_beta & is_less_than; + negate_is_less_than_beta = negate_is_less_than_beta & is_less_than; + + /* combine and store */ + if (__lasx_xbnz_v(is_less_than_beta)) { + __m256i p2_org_h, p3_org_h, p1_h, p2_h; + + p2_org_h = __lasx_vext2xv_hu_bu(p2_org); + p3_org_h = __lasx_vext2xv_hu_bu(p3_org); + + AVC_LPF_P0P1P2_OR_Q0Q1Q2(p3_org_h, p0_org_h, q0_org_h, p1_org_h, + p2_org_h, q1_org_h, p0_h, p1_h, p2_h); + + p0_h = __lasx_xvpickev_b(p0_h, p0_h); + p0_h = __lasx_xvpermi_d(p0_h, 0xd8); + DUP2_ARG2(__lasx_xvpickev_b, p1_h, p1_h, p2_h, p2_h, p1_h, p2_h); + DUP2_ARG2(__lasx_xvpermi_d, p1_h, 0xd8, p2_h, 0xd8, p1_h, p2_h); + p0_org = __lasx_xvbitsel_v(p0_org, p0_h, is_less_than_beta); + p1_org = __lasx_xvbitsel_v(p1_org, p1_h, is_less_than_beta); + p2_org = __lasx_xvbitsel_v(p2_org, p2_h, is_less_than_beta); + } + + AVC_LPF_P0_OR_Q0(p0_org_h, q1_org_h, p1_org_h, p0_h); + /* combine */ + p0_h = __lasx_xvpickev_b(p0_h, p0_h); + p0_h = __lasx_xvpermi_d(p0_h, 0xd8); + p0_org = __lasx_xvbitsel_v(p0_org, p0_h, negate_is_less_than_beta); + + /* if (tmpFlag && (unsigned)ABS(q2-q0) < thresholds->beta_in) */ + q2_asub_q0 = __lasx_xvabsd_bu(q2_org, q0_org); + is_less_than_beta = __lasx_xvslt_bu(q2_asub_q0, beta); + is_less_than_beta = is_less_than_beta & less_alpha_shift2_add2; + negate_is_less_than_beta = __lasx_xvxori_b(is_less_than_beta, 0xff); + is_less_than_beta = is_less_than_beta & is_less_than; + negate_is_less_than_beta = negate_is_less_than_beta & is_less_than; + + /* combine and store */ + if (__lasx_xbnz_v(is_less_than_beta)) { + __m256i q2_org_h, q3_org_h, q1_h, q2_h; + + q2_org_h = __lasx_vext2xv_hu_bu(q2_org); + q3_org_h = __lasx_vext2xv_hu_bu(q3_org); + + AVC_LPF_P0P1P2_OR_Q0Q1Q2(q3_org_h, q0_org_h, p0_org_h, q1_org_h, + q2_org_h, p1_org_h, q0_h, q1_h, q2_h); + + q0_h = __lasx_xvpickev_b(q0_h, q0_h); + q0_h = __lasx_xvpermi_d(q0_h, 0xd8); + DUP2_ARG2(__lasx_xvpickev_b, q1_h, q1_h, q2_h, q2_h, q1_h, q2_h); + DUP2_ARG2(__lasx_xvpermi_d, q1_h, 0xd8, q2_h, 0xd8, q1_h, q2_h); + q0_org = __lasx_xvbitsel_v(q0_org, q0_h, is_less_than_beta); + q1_org = __lasx_xvbitsel_v(q1_org, q1_h, is_less_than_beta); + q2_org = __lasx_xvbitsel_v(q2_org, q2_h, is_less_than_beta); + + } + + AVC_LPF_P0_OR_Q0(q0_org_h, p1_org_h, q1_org_h, q0_h); + + /* combine */ + q0_h = __lasx_xvpickev_b(q0_h, q0_h); + q0_h = __lasx_xvpermi_d(q0_h, 0xd8); + q0_org = __lasx_xvbitsel_v(q0_org, q0_h, negate_is_less_than_beta); + + /* transpose and store */ + { + __m256i row0, row1, row2, row3, row4, row5, row6, row7; + __m256i control = {0x0000000400000000, 0x0000000500000001, + 0x0000000600000002, 0x0000000700000003}; + + DUP4_ARG3(__lasx_xvpermi_q, p0_org, q3_org, 0x02, p1_org, q2_org, + 0x02, p2_org, q1_org, 0x02, p3_org, q0_org, 0x02, + p0_org, p1_org, p2_org, p3_org); + DUP2_ARG2(__lasx_xvilvl_b, p1_org, p3_org, p0_org, p2_org, + row0, row2); + DUP2_ARG2(__lasx_xvilvh_b, p1_org, p3_org, p0_org, p2_org, + row1, row3); + DUP2_ARG2(__lasx_xvilvl_b, row2, row0, row3, row1, row4, row6); + DUP2_ARG2(__lasx_xvilvh_b, row2, row0, row3, row1, row5, row7); + DUP4_ARG2(__lasx_xvperm_w, row4, control, row5, control, row6, + control, row7, control, row4, row5, row6, row7); + src = data - 4; + __lasx_xvstelm_d(row4, src, 0, 0); + __lasx_xvstelm_d(row4, src + img_width, 0, 1); + src += img_width_2x; + __lasx_xvstelm_d(row4, src, 0, 2); + __lasx_xvstelm_d(row4, src + img_width, 0, 3); + src += img_width_2x; + __lasx_xvstelm_d(row5, src, 0, 0); + __lasx_xvstelm_d(row5, src + img_width, 0, 1); + src += img_width_2x; + __lasx_xvstelm_d(row5, src, 0, 2); + __lasx_xvstelm_d(row5, src + img_width, 0, 3); + src += img_width_2x; + __lasx_xvstelm_d(row6, src, 0, 0); + __lasx_xvstelm_d(row6, src + img_width, 0, 1); + src += img_width_2x; + __lasx_xvstelm_d(row6, src, 0, 2); + __lasx_xvstelm_d(row6, src + img_width, 0, 3); + src += img_width_2x; + __lasx_xvstelm_d(row7, src, 0, 0); + __lasx_xvstelm_d(row7, src + img_width, 0, 1); + src += img_width_2x; + __lasx_xvstelm_d(row7, src, 0, 2); + __lasx_xvstelm_d(row7, src + img_width, 0, 3); + } + } +} + +void ff_h264_v_lpf_luma_intra_8_lasx(uint8_t *data, ptrdiff_t img_width, + int alpha_in, int beta_in) +{ + ptrdiff_t img_width_2x = img_width << 1; + ptrdiff_t img_width_3x = img_width_2x + img_width; + uint8_t *src = data - img_width_2x; + __m256i p0_asub_q0, p1_asub_p0, q1_asub_q0, alpha, beta; + __m256i is_less_than, is_less_than_beta, is_less_than_alpha; + __m256i p1_org, p0_org, q0_org, q1_org; + __m256i zero = __lasx_xvldi(0); + + DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x, + src, img_width_3x, p1_org, p0_org, q0_org, q1_org); + alpha = __lasx_xvreplgr2vr_b(alpha_in); + beta = __lasx_xvreplgr2vr_b(beta_in); + p0_asub_q0 = __lasx_xvabsd_bu(p0_org, q0_org); + p1_asub_p0 = __lasx_xvabsd_bu(p1_org, p0_org); + q1_asub_q0 = __lasx_xvabsd_bu(q1_org, q0_org); + + is_less_than_alpha = __lasx_xvslt_bu(p0_asub_q0, alpha); + is_less_than_beta = __lasx_xvslt_bu(p1_asub_p0, beta); + is_less_than = is_less_than_beta & is_less_than_alpha; + is_less_than_beta = __lasx_xvslt_bu(q1_asub_q0, beta); + is_less_than = is_less_than_beta & is_less_than; + is_less_than = __lasx_xvpermi_q(zero, is_less_than, 0x30); + + if (__lasx_xbnz_v(is_less_than)) { + __m256i p2_asub_p0, q2_asub_q0, p0_h, q0_h, negate_is_less_than_beta; + __m256i p1_org_h, p0_org_h, q0_org_h, q1_org_h; + __m256i p2_org = __lasx_xvldx(src, -img_width); + __m256i q2_org = __lasx_xvldx(data, img_width_2x); + __m256i less_alpha_shift2_add2 = __lasx_xvsrli_b(alpha, 2); + less_alpha_shift2_add2 = __lasx_xvaddi_bu(less_alpha_shift2_add2, 2); + less_alpha_shift2_add2 = __lasx_xvslt_bu(p0_asub_q0, + less_alpha_shift2_add2); + + p1_org_h = __lasx_vext2xv_hu_bu(p1_org); + p0_org_h = __lasx_vext2xv_hu_bu(p0_org); + q0_org_h = __lasx_vext2xv_hu_bu(q0_org); + q1_org_h = __lasx_vext2xv_hu_bu(q1_org); + + p2_asub_p0 = __lasx_xvabsd_bu(p2_org, p0_org); + is_less_than_beta = __lasx_xvslt_bu(p2_asub_p0, beta); + is_less_than_beta = is_less_than_beta & less_alpha_shift2_add2; + negate_is_less_than_beta = __lasx_xvxori_b(is_less_than_beta, 0xff); + is_less_than_beta = is_less_than_beta & is_less_than; + negate_is_less_than_beta = negate_is_less_than_beta & is_less_than; + + /* combine and store */ + if (__lasx_xbnz_v(is_less_than_beta)) { + __m256i p2_org_h, p3_org_h, p1_h, p2_h; + __m256i p3_org = __lasx_xvldx(src, -img_width_2x); + + p2_org_h = __lasx_vext2xv_hu_bu(p2_org); + p3_org_h = __lasx_vext2xv_hu_bu(p3_org); + + AVC_LPF_P0P1P2_OR_Q0Q1Q2(p3_org_h, p0_org_h, q0_org_h, p1_org_h, + p2_org_h, q1_org_h, p0_h, p1_h, p2_h); + + p0_h = __lasx_xvpickev_b(p0_h, p0_h); + p0_h = __lasx_xvpermi_d(p0_h, 0xd8); + DUP2_ARG2(__lasx_xvpickev_b, p1_h, p1_h, p2_h, p2_h, p1_h, p2_h); + DUP2_ARG2(__lasx_xvpermi_d, p1_h, 0xd8, p2_h, 0xd8, p1_h, p2_h); + p0_org = __lasx_xvbitsel_v(p0_org, p0_h, is_less_than_beta); + p1_org = __lasx_xvbitsel_v(p1_org, p1_h, is_less_than_beta); + p2_org = __lasx_xvbitsel_v(p2_org, p2_h, is_less_than_beta); + + __lasx_xvst(p1_org, src, 0); + __lasx_xvst(p2_org, src - img_width, 0); + } + + AVC_LPF_P0_OR_Q0(p0_org_h, q1_org_h, p1_org_h, p0_h); + /* combine */ + p0_h = __lasx_xvpickev_b(p0_h, p0_h); + p0_h = __lasx_xvpermi_d(p0_h, 0xd8); + p0_org = __lasx_xvbitsel_v(p0_org, p0_h, negate_is_less_than_beta); + __lasx_xvst(p0_org, data - img_width, 0); + + /* if (tmpFlag && (unsigned)ABS(q2-q0) < thresholds->beta_in) */ + q2_asub_q0 = __lasx_xvabsd_bu(q2_org, q0_org); + is_less_than_beta = __lasx_xvslt_bu(q2_asub_q0, beta); + is_less_than_beta = is_less_than_beta & less_alpha_shift2_add2; + negate_is_less_than_beta = __lasx_xvxori_b(is_less_than_beta, 0xff); + is_less_than_beta = is_less_than_beta & is_less_than; + negate_is_less_than_beta = negate_is_less_than_beta & is_less_than; + + /* combine and store */ + if (__lasx_xbnz_v(is_less_than_beta)) { + __m256i q2_org_h, q3_org_h, q1_h, q2_h; + __m256i q3_org = __lasx_xvldx(data, img_width_2x + img_width); + + q2_org_h = __lasx_vext2xv_hu_bu(q2_org); + q3_org_h = __lasx_vext2xv_hu_bu(q3_org); + + AVC_LPF_P0P1P2_OR_Q0Q1Q2(q3_org_h, q0_org_h, p0_org_h, q1_org_h, + q2_org_h, p1_org_h, q0_h, q1_h, q2_h); + + q0_h = __lasx_xvpickev_b(q0_h, q0_h); + q0_h = __lasx_xvpermi_d(q0_h, 0xd8); + DUP2_ARG2(__lasx_xvpickev_b, q1_h, q1_h, q2_h, q2_h, q1_h, q2_h); + DUP2_ARG2(__lasx_xvpermi_d, q1_h, 0xd8, q2_h, 0xd8, q1_h, q2_h); + q0_org = __lasx_xvbitsel_v(q0_org, q0_h, is_less_than_beta); + q1_org = __lasx_xvbitsel_v(q1_org, q1_h, is_less_than_beta); + q2_org = __lasx_xvbitsel_v(q2_org, q2_h, is_less_than_beta); + + __lasx_xvst(q1_org, data + img_width, 0); + __lasx_xvst(q2_org, data + img_width_2x, 0); + } + + AVC_LPF_P0_OR_Q0(q0_org_h, p1_org_h, q1_org_h, q0_h); + + /* combine */ + q0_h = __lasx_xvpickev_b(q0_h, q0_h); + q0_h = __lasx_xvpermi_d(q0_h, 0xd8); + q0_org = __lasx_xvbitsel_v(q0_org, q0_h, negate_is_less_than_beta); + + __lasx_xvst(q0_org, data, 0); + } +} + +void ff_h264_h_lpf_chroma_intra_8_lasx(uint8_t *data, ptrdiff_t img_width, + int alpha_in, int beta_in) +{ + uint8_t *src = data - 2; + ptrdiff_t img_width_2x = img_width << 1; + ptrdiff_t img_width_4x = img_width << 2; + ptrdiff_t img_width_3x = img_width_2x + img_width; + __m256i p1_org, p0_org, q0_org, q1_org; + __m256i p0_asub_q0, p1_asub_p0, q1_asub_q0, alpha, beta; + __m256i is_less_than, is_less_than_beta, is_less_than_alpha; + + { + __m256i row0, row1, row2, row3, row4, row5, row6, row7; + + DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x, src, + img_width_3x, row0, row1, row2, row3); + src += img_width_4x; + DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x, src, + img_width_3x, row4, row5, row6, row7); + + /* LASX_TRANSPOSE8x4_B */ + DUP4_ARG2(__lasx_xvilvl_b, row2, row0, row3, row1, row6, row4, row7, row5, + p1_org, p0_org, q0_org, q1_org); + row0 = __lasx_xvilvl_b(p0_org, p1_org); + row1 = __lasx_xvilvl_b(q1_org, q0_org); + row3 = __lasx_xvilvh_w(row1, row0); + row2 = __lasx_xvilvl_w(row1, row0); + p1_org = __lasx_xvpermi_d(row2, 0x00); + p0_org = __lasx_xvpermi_d(row2, 0x55); + q0_org = __lasx_xvpermi_d(row3, 0x00); + q1_org = __lasx_xvpermi_d(row3, 0x55); + } + + alpha = __lasx_xvreplgr2vr_b(alpha_in); + beta = __lasx_xvreplgr2vr_b(beta_in); + + p0_asub_q0 = __lasx_xvabsd_bu(p0_org, q0_org); + p1_asub_p0 = __lasx_xvabsd_bu(p1_org, p0_org); + q1_asub_q0 = __lasx_xvabsd_bu(q1_org, q0_org); + + is_less_than_alpha = __lasx_xvslt_bu(p0_asub_q0, alpha); + is_less_than_beta = __lasx_xvslt_bu(p1_asub_p0, beta); + is_less_than = is_less_than_alpha & is_less_than_beta; + is_less_than_beta = __lasx_xvslt_bu(q1_asub_q0, beta); + is_less_than = is_less_than_beta & is_less_than; + + if (__lasx_xbnz_v(is_less_than)) { + __m256i p0_h, q0_h, p1_org_h, p0_org_h, q0_org_h, q1_org_h; + + p1_org_h = __lasx_vext2xv_hu_bu(p1_org); + p0_org_h = __lasx_vext2xv_hu_bu(p0_org); + q0_org_h = __lasx_vext2xv_hu_bu(q0_org); + q1_org_h = __lasx_vext2xv_hu_bu(q1_org); + + AVC_LPF_P0_OR_Q0(p0_org_h, q1_org_h, p1_org_h, p0_h); + AVC_LPF_P0_OR_Q0(q0_org_h, p1_org_h, q1_org_h, q0_h); + DUP2_ARG2(__lasx_xvpickev_b, p0_h, p0_h, q0_h, q0_h, p0_h, q0_h); + DUP2_ARG2(__lasx_xvpermi_d, p0_h, 0xd8, q0_h, 0xd8, p0_h, q0_h); + p0_org = __lasx_xvbitsel_v(p0_org, p0_h, is_less_than); + q0_org = __lasx_xvbitsel_v(q0_org, q0_h, is_less_than); + } + p0_org = __lasx_xvilvl_b(q0_org, p0_org); + src = data - 1; + __lasx_xvstelm_h(p0_org, src, 0, 0); + src += img_width; + __lasx_xvstelm_h(p0_org, src, 0, 1); + src += img_width; + __lasx_xvstelm_h(p0_org, src, 0, 2); + src += img_width; + __lasx_xvstelm_h(p0_org, src, 0, 3); + src += img_width; + __lasx_xvstelm_h(p0_org, src, 0, 4); + src += img_width; + __lasx_xvstelm_h(p0_org, src, 0, 5); + src += img_width; + __lasx_xvstelm_h(p0_org, src, 0, 6); + src += img_width; + __lasx_xvstelm_h(p0_org, src, 0, 7); +} + +void ff_h264_v_lpf_chroma_intra_8_lasx(uint8_t *data, ptrdiff_t img_width, + int alpha_in, int beta_in) +{ + ptrdiff_t img_width_2x = img_width << 1; + __m256i p1_org, p0_org, q0_org, q1_org; + __m256i p0_asub_q0, p1_asub_p0, q1_asub_q0, alpha, beta; + __m256i is_less_than, is_less_than_beta, is_less_than_alpha; + + alpha = __lasx_xvreplgr2vr_b(alpha_in); + beta = __lasx_xvreplgr2vr_b(beta_in); + + p1_org = __lasx_xvldx(data, -img_width_2x); + p0_org = __lasx_xvldx(data, -img_width); + DUP2_ARG2(__lasx_xvldx, data, 0, data, img_width, q0_org, q1_org); + + p0_asub_q0 = __lasx_xvabsd_bu(p0_org, q0_org); + p1_asub_p0 = __lasx_xvabsd_bu(p1_org, p0_org); + q1_asub_q0 = __lasx_xvabsd_bu(q1_org, q0_org); + + is_less_than_alpha = __lasx_xvslt_bu(p0_asub_q0, alpha); + is_less_than_beta = __lasx_xvslt_bu(p1_asub_p0, beta); + is_less_than = is_less_than_alpha & is_less_than_beta; + is_less_than_beta = __lasx_xvslt_bu(q1_asub_q0, beta); + is_less_than = is_less_than_beta & is_less_than; + + if (__lasx_xbnz_v(is_less_than)) { + __m256i p0_h, q0_h, p1_org_h, p0_org_h, q0_org_h, q1_org_h; + + p1_org_h = __lasx_vext2xv_hu_bu(p1_org); + p0_org_h = __lasx_vext2xv_hu_bu(p0_org); + q0_org_h = __lasx_vext2xv_hu_bu(q0_org); + q1_org_h = __lasx_vext2xv_hu_bu(q1_org); + + AVC_LPF_P0_OR_Q0(p0_org_h, q1_org_h, p1_org_h, p0_h); + AVC_LPF_P0_OR_Q0(q0_org_h, p1_org_h, q1_org_h, q0_h); + DUP2_ARG2(__lasx_xvpickev_b, p0_h, p0_h, q0_h, q0_h, p0_h, q0_h); + DUP2_ARG2(__lasx_xvpermi_d, p0_h, 0xd8, q0_h, 0xd8, p0_h, q0_h); + p0_h = __lasx_xvbitsel_v(p0_org, p0_h, is_less_than); + q0_h = __lasx_xvbitsel_v(q0_org, q0_h, is_less_than); + __lasx_xvstelm_d(p0_h, data - img_width, 0, 0); + __lasx_xvstelm_d(q0_h, data, 0, 0); + } +} + +void ff_biweight_h264_pixels16_8_lasx(uint8_t *dst, uint8_t *src, + ptrdiff_t stride, int height, + int log2_denom, int weight_dst, + int weight_src, int offset_in) +{ + __m256i wgt; + __m256i src0, src1, src2, src3; + __m256i dst0, dst1, dst2, dst3; + __m256i vec0, vec1, vec2, vec3, vec4, vec5, vec6, vec7; + __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7; + __m256i denom, offset; + int stride_2x = stride << 1; + int stride_4x = stride << 2; + int stride_3x = stride_2x + stride; + + offset_in = (unsigned) ((offset_in + 1) | 1) << log2_denom; + offset_in += ((weight_src + weight_dst) << 7); + log2_denom += 1; + + tmp0 = __lasx_xvreplgr2vr_b(weight_src); + tmp1 = __lasx_xvreplgr2vr_b(weight_dst); + wgt = __lasx_xvilvh_b(tmp1, tmp0); + offset = __lasx_xvreplgr2vr_h(offset_in); + denom = __lasx_xvreplgr2vr_h(log2_denom); + + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp0, tmp1, tmp2, tmp3); + src += stride_4x; + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp4, tmp5, tmp6, tmp7); + src += stride_4x; + DUP4_ARG3(__lasx_xvpermi_q, tmp1, tmp0, 0x20, tmp3, tmp2, 0x20, tmp5, tmp4, + 0x20, tmp7, tmp6, 0x20, src0, src1, src2, src3); + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, + dst, stride_3x, tmp0, tmp1, tmp2, tmp3); + dst += stride_4x; + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, + dst, stride_3x, tmp4, tmp5, tmp6, tmp7); + dst -= stride_4x; + DUP4_ARG3(__lasx_xvpermi_q, tmp1, tmp0, 0x20, tmp3, tmp2, 0x20, tmp5, tmp4, + 0x20, tmp7, tmp6, 0x20, dst0, dst1, dst2, dst3); + + DUP4_ARG2(__lasx_xvxori_b, src0, 128, src1, 128, src2, 128, src3, 128, + src0, src1, src2, src3); + DUP4_ARG2(__lasx_xvxori_b, dst0, 128, dst1, 128, dst2, 128, dst3, 128, + dst0, dst1, dst2, dst3); + DUP4_ARG2(__lasx_xvilvl_b, dst0, src0, dst1, src1, dst2, src2, + dst3, src3, vec0, vec2, vec4, vec6); + DUP4_ARG2(__lasx_xvilvh_b, dst0, src0, dst1, src1, dst2, src2, + dst3, src3, vec1, vec3, vec5, vec7); + + DUP4_ARG3(__lasx_xvdp2add_h_b, offset, wgt, vec0, offset, wgt, vec1, + offset, wgt, vec2, offset, wgt, vec3, tmp0, tmp1, tmp2, tmp3); + DUP4_ARG3(__lasx_xvdp2add_h_b, offset, wgt, vec4, offset, wgt, vec5, + offset, wgt, vec6, offset, wgt, vec7, tmp4, tmp5, tmp6, tmp7); + + tmp0 = __lasx_xvsra_h(tmp0, denom); + tmp1 = __lasx_xvsra_h(tmp1, denom); + tmp2 = __lasx_xvsra_h(tmp2, denom); + tmp3 = __lasx_xvsra_h(tmp3, denom); + tmp4 = __lasx_xvsra_h(tmp4, denom); + tmp5 = __lasx_xvsra_h(tmp5, denom); + tmp6 = __lasx_xvsra_h(tmp6, denom); + tmp7 = __lasx_xvsra_h(tmp7, denom); + + DUP4_ARG1(__lasx_xvclip255_h, tmp0, tmp1, tmp2, tmp3, + tmp0, tmp1, tmp2, tmp3); + DUP4_ARG1(__lasx_xvclip255_h, tmp4, tmp5, tmp6, tmp7, + tmp4, tmp5, tmp6, tmp7); + DUP4_ARG2(__lasx_xvpickev_b, tmp1, tmp0, tmp3, tmp2, tmp5, tmp4, tmp7, tmp6, + dst0, dst1, dst2, dst3); + __lasx_xvstelm_d(dst0, dst, 0, 0); + __lasx_xvstelm_d(dst0, dst, 8, 1); + dst += stride; + __lasx_xvstelm_d(dst0, dst, 0, 2); + __lasx_xvstelm_d(dst0, dst, 8, 3); + dst += stride; + __lasx_xvstelm_d(dst1, dst, 0, 0); + __lasx_xvstelm_d(dst1, dst, 8, 1); + dst += stride; + __lasx_xvstelm_d(dst1, dst, 0, 2); + __lasx_xvstelm_d(dst1, dst, 8, 3); + dst += stride; + __lasx_xvstelm_d(dst2, dst, 0, 0); + __lasx_xvstelm_d(dst2, dst, 8, 1); + dst += stride; + __lasx_xvstelm_d(dst2, dst, 0, 2); + __lasx_xvstelm_d(dst2, dst, 8, 3); + dst += stride; + __lasx_xvstelm_d(dst3, dst, 0, 0); + __lasx_xvstelm_d(dst3, dst, 8, 1); + dst += stride; + __lasx_xvstelm_d(dst3, dst, 0, 2); + __lasx_xvstelm_d(dst3, dst, 8, 3); + dst += stride; + + if (16 == height) { + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp0, tmp1, tmp2, tmp3); + src += stride_4x; + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp4, tmp5, tmp6, tmp7); + src += stride_4x; + DUP4_ARG3(__lasx_xvpermi_q, tmp1, tmp0, 0x20, tmp3, tmp2, 0x20, tmp5, + tmp4, 0x20, tmp7, tmp6, 0x20, src0, src1, src2, src3); + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, + dst, stride_3x, tmp0, tmp1, tmp2, tmp3); + dst += stride_4x; + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, + dst, stride_3x, tmp4, tmp5, tmp6, tmp7); + dst -= stride_4x; + DUP4_ARG3(__lasx_xvpermi_q, tmp1, tmp0, 0x20, tmp3, tmp2, 0x20, tmp5, + tmp4, 0x20, tmp7, tmp6, 0x20, dst0, dst1, dst2, dst3); + + DUP4_ARG2(__lasx_xvxori_b, src0, 128, src1, 128, src2, 128, src3, 128, + src0, src1, src2, src3); + DUP4_ARG2(__lasx_xvxori_b, dst0, 128, dst1, 128, dst2, 128, dst3, 128, + dst0, dst1, dst2, dst3); + DUP4_ARG2(__lasx_xvilvl_b, dst0, src0, dst1, src1, dst2, src2, + dst3, src3, vec0, vec2, vec4, vec6); + DUP4_ARG2(__lasx_xvilvh_b, dst0, src0, dst1, src1, dst2, src2, + dst3, src3, vec1, vec3, vec5, vec7); + + DUP4_ARG3(__lasx_xvdp2add_h_b, offset, wgt, vec0, offset, wgt, vec1, + offset, wgt, vec2, offset, wgt, vec3, tmp0, tmp1, tmp2, tmp3); + DUP4_ARG3(__lasx_xvdp2add_h_b, offset, wgt, vec4, offset, wgt, vec5, + offset, wgt, vec6, offset, wgt, vec7, tmp4, tmp5, tmp6, tmp7); + + tmp0 = __lasx_xvsra_h(tmp0, denom); + tmp1 = __lasx_xvsra_h(tmp1, denom); + tmp2 = __lasx_xvsra_h(tmp2, denom); + tmp3 = __lasx_xvsra_h(tmp3, denom); + tmp4 = __lasx_xvsra_h(tmp4, denom); + tmp5 = __lasx_xvsra_h(tmp5, denom); + tmp6 = __lasx_xvsra_h(tmp6, denom); + tmp7 = __lasx_xvsra_h(tmp7, denom); + + DUP4_ARG1(__lasx_xvclip255_h, tmp0, tmp1, tmp2, tmp3, + tmp0, tmp1, tmp2, tmp3); + DUP4_ARG1(__lasx_xvclip255_h, tmp4, tmp5, tmp6, tmp7, + tmp4, tmp5, tmp6, tmp7); + DUP4_ARG2(__lasx_xvpickev_b, tmp1, tmp0, tmp3, tmp2, tmp5, tmp4, tmp7, + tmp6, dst0, dst1, dst2, dst3); + __lasx_xvstelm_d(dst0, dst, 0, 0); + __lasx_xvstelm_d(dst0, dst, 8, 1); + dst += stride; + __lasx_xvstelm_d(dst0, dst, 0, 2); + __lasx_xvstelm_d(dst0, dst, 8, 3); + dst += stride; + __lasx_xvstelm_d(dst1, dst, 0, 0); + __lasx_xvstelm_d(dst1, dst, 8, 1); + dst += stride; + __lasx_xvstelm_d(dst1, dst, 0, 2); + __lasx_xvstelm_d(dst1, dst, 8, 3); + dst += stride; + __lasx_xvstelm_d(dst2, dst, 0, 0); + __lasx_xvstelm_d(dst2, dst, 8, 1); + dst += stride; + __lasx_xvstelm_d(dst2, dst, 0, 2); + __lasx_xvstelm_d(dst2, dst, 8, 3); + dst += stride; + __lasx_xvstelm_d(dst3, dst, 0, 0); + __lasx_xvstelm_d(dst3, dst, 8, 1); + dst += stride; + __lasx_xvstelm_d(dst3, dst, 0, 2); + __lasx_xvstelm_d(dst3, dst, 8, 3); + } +} + +static void avc_biwgt_8x4_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + int32_t log2_denom, int32_t weight_src, + int32_t weight_dst, int32_t offset_in) +{ + __m256i wgt, vec0, vec1; + __m256i src0, dst0; + __m256i tmp0, tmp1, tmp2, tmp3, denom, offset; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + + offset_in = (unsigned) ((offset_in + 1) | 1) << log2_denom; + offset_in += ((weight_src + weight_dst) << 7); + log2_denom += 1; + + tmp0 = __lasx_xvreplgr2vr_b(weight_src); + tmp1 = __lasx_xvreplgr2vr_b(weight_dst); + wgt = __lasx_xvilvh_b(tmp1, tmp0); + offset = __lasx_xvreplgr2vr_h(offset_in); + denom = __lasx_xvreplgr2vr_h(log2_denom); + + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp0, tmp1, tmp2, tmp3); + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + src0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, + dst, stride_3x, tmp0, tmp1, tmp2, tmp3); + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + dst0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + DUP2_ARG2(__lasx_xvxori_b, src0, 128, dst0, 128, src0, dst0); + vec0 = __lasx_xvilvl_b(dst0, src0); + vec1 = __lasx_xvilvh_b(dst0, src0); + DUP2_ARG3(__lasx_xvdp2add_h_b, offset, wgt, vec0, offset, wgt, vec1, + tmp0, tmp1); + tmp0 = __lasx_xvsra_h(tmp0, denom); + tmp1 = __lasx_xvsra_h(tmp1, denom); + DUP2_ARG1(__lasx_xvclip255_h, tmp0, tmp1, tmp0, tmp1); + dst0 = __lasx_xvpickev_b(tmp1, tmp0); + __lasx_xvstelm_d(dst0, dst, 0, 0); + __lasx_xvstelm_d(dst0, dst + stride, 0, 1); + __lasx_xvstelm_d(dst0, dst + stride_2x, 0, 2); + __lasx_xvstelm_d(dst0, dst + stride_3x, 0, 3); +} + +static void avc_biwgt_8x8_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + int32_t log2_denom, int32_t weight_src, + int32_t weight_dst, int32_t offset_in) +{ + __m256i wgt, vec0, vec1, vec2, vec3; + __m256i src0, src1, dst0, dst1; + __m256i tmp0, tmp1, tmp2, tmp3, denom, offset; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_4x = stride << 2; + ptrdiff_t stride_3x = stride_2x + stride; + uint8_t* dst_tmp = dst; + + offset_in = (unsigned) ((offset_in + 1) | 1) << log2_denom; + offset_in += ((weight_src + weight_dst) << 7); + log2_denom += 1; + + tmp0 = __lasx_xvreplgr2vr_b(weight_src); + tmp1 = __lasx_xvreplgr2vr_b(weight_dst); + wgt = __lasx_xvilvh_b(tmp1, tmp0); + offset = __lasx_xvreplgr2vr_h(offset_in); + denom = __lasx_xvreplgr2vr_h(log2_denom); + + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp0, tmp1, tmp2, tmp3); + src += stride_4x; + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + src0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp0, tmp1, tmp2, tmp3); + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + src1 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + tmp0 = __lasx_xvld(dst_tmp, 0); + DUP2_ARG2(__lasx_xvldx, dst_tmp, stride, dst_tmp, stride_2x, tmp1, tmp2); + tmp3 = __lasx_xvldx(dst_tmp, stride_3x); + dst_tmp += stride_4x; + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + dst0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + DUP4_ARG2(__lasx_xvldx, dst_tmp, 0, dst_tmp, stride, dst_tmp, stride_2x, + dst_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3); + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + dst1 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + + DUP4_ARG2(__lasx_xvxori_b, src0, 128, src1, 128, dst0, 128, dst1, 128, + src0, src1, dst0, dst1); + DUP2_ARG2(__lasx_xvilvl_b, dst0, src0, dst1, src1, vec0, vec2); + DUP2_ARG2(__lasx_xvilvh_b, dst0, src0, dst1, src1, vec1, vec3); + DUP4_ARG3(__lasx_xvdp2add_h_b, offset, wgt, vec0, offset, wgt, vec1, + offset, wgt, vec2, offset, wgt, vec3, tmp0, tmp1, tmp2, tmp3); + tmp0 = __lasx_xvsra_h(tmp0, denom); + tmp1 = __lasx_xvsra_h(tmp1, denom); + tmp2 = __lasx_xvsra_h(tmp2, denom); + tmp3 = __lasx_xvsra_h(tmp3, denom); + DUP4_ARG1(__lasx_xvclip255_h, tmp0, tmp1, tmp2, tmp3, + tmp0, tmp1, tmp2, tmp3); + DUP2_ARG2(__lasx_xvpickev_b, tmp1, tmp0, tmp3, tmp2, dst0, dst1); + __lasx_xvstelm_d(dst0, dst, 0, 0); + __lasx_xvstelm_d(dst0, dst + stride, 0, 1); + __lasx_xvstelm_d(dst0, dst + stride_2x, 0, 2); + __lasx_xvstelm_d(dst0, dst + stride_3x, 0, 3); + dst += stride_4x; + __lasx_xvstelm_d(dst1, dst, 0, 0); + __lasx_xvstelm_d(dst1, dst + stride, 0, 1); + __lasx_xvstelm_d(dst1, dst + stride_2x, 0, 2); + __lasx_xvstelm_d(dst1, dst + stride_3x, 0, 3); +} + +static void avc_biwgt_8x16_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + int32_t log2_denom, int32_t weight_src, + int32_t weight_dst, int32_t offset_in) +{ + __m256i wgt, vec0, vec1, vec2, vec3, vec4, vec5, vec6, vec7; + __m256i src0, src1, src2, src3, dst0, dst1, dst2, dst3; + __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, denom, offset; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_4x = stride << 2; + ptrdiff_t stride_3x = stride_2x + stride; + uint8_t* dst_tmp = dst; + + offset_in = (unsigned) ((offset_in + 1) | 1) << log2_denom; + offset_in += ((weight_src + weight_dst) << 7); + log2_denom += 1; + + tmp0 = __lasx_xvreplgr2vr_b(weight_src); + tmp1 = __lasx_xvreplgr2vr_b(weight_dst); + wgt = __lasx_xvilvh_b(tmp1, tmp0); + offset = __lasx_xvreplgr2vr_h(offset_in); + denom = __lasx_xvreplgr2vr_h(log2_denom); + + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp0, tmp1, tmp2, tmp3); + src += stride_4x; + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + src0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp0, tmp1, tmp2, tmp3); + src += stride_4x; + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + src1 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp0, tmp1, tmp2, tmp3); + src += stride_4x; + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + src2 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp0, tmp1, tmp2, tmp3); + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + src3 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + + DUP4_ARG2(__lasx_xvldx, dst_tmp, 0, dst_tmp, stride, dst_tmp, stride_2x, + dst_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3); + dst_tmp += stride_4x; + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + dst0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + DUP4_ARG2(__lasx_xvldx, dst_tmp, 0, dst_tmp, stride, dst_tmp, stride_2x, + dst_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3); + dst_tmp += stride_4x; + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + dst1 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + DUP4_ARG2(__lasx_xvldx, dst_tmp, 0, dst_tmp, stride, dst_tmp, stride_2x, + dst_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3); + dst_tmp += stride_4x; + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + dst2 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + DUP4_ARG2(__lasx_xvldx, dst_tmp, 0, dst_tmp, stride, dst_tmp, stride_2x, + dst_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3); + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + dst3 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + + DUP4_ARG2(__lasx_xvxori_b, src0, 128, src1, 128, src2, 128, src3, 128, + src0, src1, src2, src3); + DUP4_ARG2(__lasx_xvxori_b, dst0, 128, dst1, 128, dst2, 128, dst3, 128, + dst0, dst1, dst2, dst3); + DUP4_ARG2(__lasx_xvilvl_b, dst0, src0, dst1, src1, dst2, src2, + dst3, src3, vec0, vec2, vec4, vec6); + DUP4_ARG2(__lasx_xvilvh_b, dst0, src0, dst1, src1, dst2, src2, + dst3, src3, vec1, vec3, vec5, vec7); + DUP4_ARG3(__lasx_xvdp2add_h_b, offset, wgt, vec0, offset, wgt, vec1, + offset, wgt, vec2, offset, wgt, vec3, tmp0, tmp1, tmp2, tmp3); + DUP4_ARG3(__lasx_xvdp2add_h_b,offset, wgt, vec4, offset, wgt, vec5, + offset, wgt, vec6, offset, wgt, vec7, tmp4, tmp5, tmp6, tmp7); + tmp0 = __lasx_xvsra_h(tmp0, denom); + tmp1 = __lasx_xvsra_h(tmp1, denom); + tmp2 = __lasx_xvsra_h(tmp2, denom); + tmp3 = __lasx_xvsra_h(tmp3, denom); + tmp4 = __lasx_xvsra_h(tmp4, denom); + tmp5 = __lasx_xvsra_h(tmp5, denom); + tmp6 = __lasx_xvsra_h(tmp6, denom); + tmp7 = __lasx_xvsra_h(tmp7, denom); + DUP4_ARG1(__lasx_xvclip255_h, tmp0, tmp1, tmp2, tmp3, + tmp0, tmp1, tmp2, tmp3); + DUP4_ARG1(__lasx_xvclip255_h, tmp4, tmp5, tmp6, tmp7, + tmp4, tmp5, tmp6, tmp7); + DUP4_ARG2(__lasx_xvpickev_b, tmp1, tmp0, tmp3, tmp2, tmp5, tmp4, tmp7, tmp6, + dst0, dst1, dst2, dst3) + __lasx_xvstelm_d(dst0, dst, 0, 0); + __lasx_xvstelm_d(dst0, dst + stride, 0, 1); + __lasx_xvstelm_d(dst0, dst + stride_2x, 0, 2); + __lasx_xvstelm_d(dst0, dst + stride_3x, 0, 3); + dst += stride_4x; + __lasx_xvstelm_d(dst1, dst, 0, 0); + __lasx_xvstelm_d(dst1, dst + stride, 0, 1); + __lasx_xvstelm_d(dst1, dst + stride_2x, 0, 2); + __lasx_xvstelm_d(dst1, dst + stride_3x, 0, 3); + dst += stride_4x; + __lasx_xvstelm_d(dst2, dst, 0, 0); + __lasx_xvstelm_d(dst2, dst + stride, 0, 1); + __lasx_xvstelm_d(dst2, dst + stride_2x, 0, 2); + __lasx_xvstelm_d(dst2, dst + stride_3x, 0, 3); + dst += stride_4x; + __lasx_xvstelm_d(dst3, dst, 0, 0); + __lasx_xvstelm_d(dst3, dst + stride, 0, 1); + __lasx_xvstelm_d(dst3, dst + stride_2x, 0, 2); + __lasx_xvstelm_d(dst3, dst + stride_3x, 0, 3); +} + +void ff_biweight_h264_pixels8_8_lasx(uint8_t *dst, uint8_t *src, + ptrdiff_t stride, int height, + int log2_denom, int weight_dst, + int weight_src, int offset) +{ + if (4 == height) { + avc_biwgt_8x4_lasx(src, dst, stride, log2_denom, weight_src, weight_dst, + offset); + } else if (8 == height) { + avc_biwgt_8x8_lasx(src, dst, stride, log2_denom, weight_src, weight_dst, + offset); + } else { + avc_biwgt_8x16_lasx(src, dst, stride, log2_denom, weight_src, weight_dst, + offset); + } +} + +static void avc_biwgt_4x2_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + int32_t log2_denom, int32_t weight_src, + int32_t weight_dst, int32_t offset_in) +{ + __m256i wgt, vec0; + __m256i src0, dst0; + __m256i tmp0, tmp1, denom, offset; + + offset_in = (unsigned) ((offset_in + 1) | 1) << log2_denom; + offset_in += ((weight_src + weight_dst) << 7); + log2_denom += 1; + + tmp0 = __lasx_xvreplgr2vr_b(weight_src); + tmp1 = __lasx_xvreplgr2vr_b(weight_dst); + wgt = __lasx_xvilvh_b(tmp1, tmp0); + offset = __lasx_xvreplgr2vr_h(offset_in); + denom = __lasx_xvreplgr2vr_h(log2_denom); + + DUP2_ARG2(__lasx_xvldx, src, 0, src, stride, tmp0, tmp1); + src0 = __lasx_xvilvl_w(tmp1, tmp0); + DUP2_ARG2(__lasx_xvldx, dst, 0, dst, stride, tmp0, tmp1); + dst0 = __lasx_xvilvl_w(tmp1, tmp0); + DUP2_ARG2(__lasx_xvxori_b, src0, 128, dst0, 128, src0, dst0); + vec0 = __lasx_xvilvl_b(dst0, src0); + tmp0 = __lasx_xvdp2add_h_b(offset, wgt, vec0); + tmp0 = __lasx_xvsra_h(tmp0, denom); + tmp0 = __lasx_xvclip255_h(tmp0); + tmp0 = __lasx_xvpickev_b(tmp0, tmp0); + __lasx_xvstelm_w(tmp0, dst, 0, 0); + __lasx_xvstelm_w(tmp0, dst + stride, 0, 1); +} + +static void avc_biwgt_4x4_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + int32_t log2_denom, int32_t weight_src, + int32_t weight_dst, int32_t offset_in) +{ + __m256i wgt, vec0; + __m256i src0, dst0; + __m256i tmp0, tmp1, tmp2, tmp3, denom, offset; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + + offset_in = (unsigned) ((offset_in + 1) | 1) << log2_denom; + offset_in += ((weight_src + weight_dst) << 7); + log2_denom += 1; + + tmp0 = __lasx_xvreplgr2vr_b(weight_src); + tmp1 = __lasx_xvreplgr2vr_b(weight_dst); + wgt = __lasx_xvilvh_b(tmp1, tmp0); + offset = __lasx_xvreplgr2vr_h(offset_in); + denom = __lasx_xvreplgr2vr_h(log2_denom); + + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp0, tmp1, tmp2, tmp3); + DUP2_ARG2(__lasx_xvilvl_w, tmp2, tmp0, tmp3, tmp1, tmp0, tmp1); + src0 = __lasx_xvilvl_w(tmp1, tmp0); + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, + dst, stride_3x, tmp0, tmp1, tmp2, tmp3); + DUP2_ARG2(__lasx_xvilvl_w, tmp2, tmp0, tmp3, tmp1, tmp0, tmp1); + dst0 = __lasx_xvilvl_w(tmp1, tmp0); + DUP2_ARG2(__lasx_xvxori_b, src0, 128, dst0, 128, src0, dst0); + vec0 = __lasx_xvilvl_b(dst0, src0); + dst0 = __lasx_xvilvh_b(dst0, src0); + vec0 = __lasx_xvpermi_q(vec0, dst0, 0x02); + tmp0 = __lasx_xvdp2add_h_b(offset, wgt, vec0); + tmp0 = __lasx_xvsra_h(tmp0, denom); + tmp0 = __lasx_xvclip255_h(tmp0); + tmp0 = __lasx_xvpickev_b(tmp0, tmp0); + __lasx_xvstelm_w(tmp0, dst, 0, 0); + __lasx_xvstelm_w(tmp0, dst + stride, 0, 1); + __lasx_xvstelm_w(tmp0, dst + stride_2x, 0, 4); + __lasx_xvstelm_w(tmp0, dst + stride_3x, 0, 5); +} + +static void avc_biwgt_4x8_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride, + int32_t log2_denom, int32_t weight_src, + int32_t weight_dst, int32_t offset_in) +{ + __m256i wgt, vec0, vec1; + __m256i src0, dst0; + __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, denom, offset; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_4x = stride << 2; + ptrdiff_t stride_3x = stride_2x + stride; + + offset_in = (unsigned) ((offset_in + 1) | 1) << log2_denom; + offset_in += ((weight_src + weight_dst) << 7); + log2_denom += 1; + + tmp0 = __lasx_xvreplgr2vr_b(weight_src); + tmp1 = __lasx_xvreplgr2vr_b(weight_dst); + wgt = __lasx_xvilvh_b(tmp1, tmp0); + offset = __lasx_xvreplgr2vr_h(offset_in); + denom = __lasx_xvreplgr2vr_h(log2_denom); + + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp0, tmp1, tmp2, tmp3); + src += stride_4x; + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp4, tmp5, tmp6, tmp7); + DUP4_ARG2(__lasx_xvilvl_w, tmp2, tmp0, tmp3, tmp1, tmp6, tmp4, tmp7, tmp5, + tmp0, tmp1, tmp2, tmp3); + DUP2_ARG2(__lasx_xvilvl_w, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + src0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, + dst, stride_3x, tmp0, tmp1, tmp2, tmp3); + dst += stride_4x; + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, + dst, stride_3x, tmp4, tmp5, tmp6, tmp7); + dst -= stride_4x; + DUP4_ARG2(__lasx_xvilvl_w, tmp2, tmp0, tmp3, tmp1, tmp6, tmp4, tmp7, tmp5, + tmp0, tmp1, tmp2, tmp3); + DUP2_ARG2(__lasx_xvilvl_w, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + dst0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + DUP2_ARG2(__lasx_xvxori_b, src0, 128, dst0, 128, src0, dst0); + vec0 = __lasx_xvilvl_b(dst0, src0); + vec1 = __lasx_xvilvh_b(dst0, src0); + DUP2_ARG3(__lasx_xvdp2add_h_b, offset, wgt, vec0, offset, wgt, vec1, + tmp0, tmp1); + tmp0 = __lasx_xvsra_h(tmp0, denom); + tmp1 = __lasx_xvsra_h(tmp1, denom); + DUP2_ARG1(__lasx_xvclip255_h, tmp0, tmp1, tmp0, tmp1); + tmp0 = __lasx_xvpickev_b(tmp1, tmp0); + __lasx_xvstelm_w(tmp0, dst, 0, 0); + __lasx_xvstelm_w(tmp0, dst + stride, 0, 1); + __lasx_xvstelm_w(tmp0, dst + stride_2x, 0, 2); + __lasx_xvstelm_w(tmp0, dst + stride_3x, 0, 3); + dst += stride_4x; + __lasx_xvstelm_w(tmp0, dst, 0, 4); + __lasx_xvstelm_w(tmp0, dst + stride, 0, 5); + __lasx_xvstelm_w(tmp0, dst + stride_2x, 0, 6); + __lasx_xvstelm_w(tmp0, dst + stride_3x, 0, 7); +} + +void ff_biweight_h264_pixels4_8_lasx(uint8_t *dst, uint8_t *src, + ptrdiff_t stride, int height, + int log2_denom, int weight_dst, + int weight_src, int offset) +{ + if (2 == height) { + avc_biwgt_4x2_lasx(src, dst, stride, log2_denom, weight_src, + weight_dst, offset); + } else if (4 == height) { + avc_biwgt_4x4_lasx(src, dst, stride, log2_denom, weight_src, + weight_dst, offset); + } else { + avc_biwgt_4x8_lasx(src, dst, stride, log2_denom, weight_src, + weight_dst, offset); + } +} + +void ff_weight_h264_pixels16_8_lasx(uint8_t *src, ptrdiff_t stride, + int height, int log2_denom, + int weight_src, int offset_in) +{ + uint32_t offset_val; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_4x = stride << 2; + ptrdiff_t stride_3x = stride_2x + stride; + __m256i zero = __lasx_xvldi(0); + __m256i src0, src1, src2, src3; + __m256i src0_l, src1_l, src2_l, src3_l, src0_h, src1_h, src2_h, src3_h; + __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7; + __m256i wgt, denom, offset; + + offset_val = (unsigned) offset_in << log2_denom; + + wgt = __lasx_xvreplgr2vr_h(weight_src); + offset = __lasx_xvreplgr2vr_h(offset_val); + denom = __lasx_xvreplgr2vr_h(log2_denom); + + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp0, tmp1, tmp2, tmp3); + src += stride_4x; + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp4, tmp5, tmp6, tmp7); + src -= stride_4x; + DUP4_ARG3(__lasx_xvpermi_q, tmp1, tmp0, 0x20, tmp3, tmp2, 0x20, tmp5, tmp4, + 0x20, tmp7, tmp6, 0x20, src0, src1, src2, src3); + DUP4_ARG2(__lasx_xvilvl_b, zero, src0, zero, src1, zero, src2, + zero, src3, src0_l, src1_l, src2_l, src3_l); + DUP4_ARG2(__lasx_xvilvh_b, zero, src0, zero, src1, zero, src2, + zero, src3, src0_h, src1_h, src2_h, src3_h); + src0_l = __lasx_xvmul_h(wgt, src0_l); + src0_h = __lasx_xvmul_h(wgt, src0_h); + src1_l = __lasx_xvmul_h(wgt, src1_l); + src1_h = __lasx_xvmul_h(wgt, src1_h); + src2_l = __lasx_xvmul_h(wgt, src2_l); + src2_h = __lasx_xvmul_h(wgt, src2_h); + src3_l = __lasx_xvmul_h(wgt, src3_l); + src3_h = __lasx_xvmul_h(wgt, src3_h); + DUP4_ARG2(__lasx_xvsadd_h, src0_l, offset, src0_h, offset, src1_l, offset, + src1_h, offset, src0_l, src0_h, src1_l, src1_h); + DUP4_ARG2(__lasx_xvsadd_h, src2_l, offset, src2_h, offset, src3_l, offset, + src3_h, offset, src2_l, src2_h, src3_l, src3_h); + src0_l = __lasx_xvmaxi_h(src0_l, 0); + src0_h = __lasx_xvmaxi_h(src0_h, 0); + src1_l = __lasx_xvmaxi_h(src1_l, 0); + src1_h = __lasx_xvmaxi_h(src1_h, 0); + src2_l = __lasx_xvmaxi_h(src2_l, 0); + src2_h = __lasx_xvmaxi_h(src2_h, 0); + src3_l = __lasx_xvmaxi_h(src3_l, 0); + src3_h = __lasx_xvmaxi_h(src3_h, 0); + src0_l = __lasx_xvssrlrn_bu_h(src0_l, denom); + src0_h = __lasx_xvssrlrn_bu_h(src0_h, denom); + src1_l = __lasx_xvssrlrn_bu_h(src1_l, denom); + src1_h = __lasx_xvssrlrn_bu_h(src1_h, denom); + src2_l = __lasx_xvssrlrn_bu_h(src2_l, denom); + src2_h = __lasx_xvssrlrn_bu_h(src2_h, denom); + src3_l = __lasx_xvssrlrn_bu_h(src3_l, denom); + src3_h = __lasx_xvssrlrn_bu_h(src3_h, denom); + __lasx_xvstelm_d(src0_l, src, 0, 0); + __lasx_xvstelm_d(src0_h, src, 8, 0); + src += stride; + __lasx_xvstelm_d(src0_l, src, 0, 2); + __lasx_xvstelm_d(src0_h, src, 8, 2); + src += stride; + __lasx_xvstelm_d(src1_l, src, 0, 0); + __lasx_xvstelm_d(src1_h, src, 8, 0); + src += stride; + __lasx_xvstelm_d(src1_l, src, 0, 2); + __lasx_xvstelm_d(src1_h, src, 8, 2); + src += stride; + __lasx_xvstelm_d(src2_l, src, 0, 0); + __lasx_xvstelm_d(src2_h, src, 8, 0); + src += stride; + __lasx_xvstelm_d(src2_l, src, 0, 2); + __lasx_xvstelm_d(src2_h, src, 8, 2); + src += stride; + __lasx_xvstelm_d(src3_l, src, 0, 0); + __lasx_xvstelm_d(src3_h, src, 8, 0); + src += stride; + __lasx_xvstelm_d(src3_l, src, 0, 2); + __lasx_xvstelm_d(src3_h, src, 8, 2); + src += stride; + + if (16 == height) { + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp0, tmp1, tmp2, tmp3); + src += stride_4x; + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp4, tmp5, tmp6, tmp7); + src -= stride_4x; + DUP4_ARG3(__lasx_xvpermi_q, tmp1, tmp0, 0x20, tmp3, tmp2, 0x20, tmp5, + tmp4, 0x20, tmp7, tmp6, 0x20, src0, src1, src2, src3); + DUP4_ARG2(__lasx_xvilvl_b, zero, src0, zero, src1, zero, src2, + zero, src3, src0_l, src1_l, src2_l, src3_l); + DUP4_ARG2(__lasx_xvilvh_b, zero, src0, zero, src1, zero, src2, + zero, src3, src0_h, src1_h, src2_h, src3_h); + src0_l = __lasx_xvmul_h(wgt, src0_l); + src0_h = __lasx_xvmul_h(wgt, src0_h); + src1_l = __lasx_xvmul_h(wgt, src1_l); + src1_h = __lasx_xvmul_h(wgt, src1_h); + src2_l = __lasx_xvmul_h(wgt, src2_l); + src2_h = __lasx_xvmul_h(wgt, src2_h); + src3_l = __lasx_xvmul_h(wgt, src3_l); + src3_h = __lasx_xvmul_h(wgt, src3_h); + DUP4_ARG2(__lasx_xvsadd_h, src0_l, offset, src0_h, offset, src1_l, + offset, src1_h, offset, src0_l, src0_h, src1_l, src1_h); + DUP4_ARG2(__lasx_xvsadd_h, src2_l, offset, src2_h, offset, src3_l, + offset, src3_h, offset, src2_l, src2_h, src3_l, src3_h); + src0_l = __lasx_xvmaxi_h(src0_l, 0); + src0_h = __lasx_xvmaxi_h(src0_h, 0); + src1_l = __lasx_xvmaxi_h(src1_l, 0); + src1_h = __lasx_xvmaxi_h(src1_h, 0); + src2_l = __lasx_xvmaxi_h(src2_l, 0); + src2_h = __lasx_xvmaxi_h(src2_h, 0); + src3_l = __lasx_xvmaxi_h(src3_l, 0); + src3_h = __lasx_xvmaxi_h(src3_h, 0); + src0_l = __lasx_xvssrlrn_bu_h(src0_l, denom); + src0_h = __lasx_xvssrlrn_bu_h(src0_h, denom); + src1_l = __lasx_xvssrlrn_bu_h(src1_l, denom); + src1_h = __lasx_xvssrlrn_bu_h(src1_h, denom); + src2_l = __lasx_xvssrlrn_bu_h(src2_l, denom); + src2_h = __lasx_xvssrlrn_bu_h(src2_h, denom); + src3_l = __lasx_xvssrlrn_bu_h(src3_l, denom); + src3_h = __lasx_xvssrlrn_bu_h(src3_h, denom); + __lasx_xvstelm_d(src0_l, src, 0, 0); + __lasx_xvstelm_d(src0_h, src, 8, 0); + src += stride; + __lasx_xvstelm_d(src0_l, src, 0, 2); + __lasx_xvstelm_d(src0_h, src, 8, 2); + src += stride; + __lasx_xvstelm_d(src1_l, src, 0, 0); + __lasx_xvstelm_d(src1_h, src, 8, 0); + src += stride; + __lasx_xvstelm_d(src1_l, src, 0, 2); + __lasx_xvstelm_d(src1_h, src, 8, 2); + src += stride; + __lasx_xvstelm_d(src2_l, src, 0, 0); + __lasx_xvstelm_d(src2_h, src, 8, 0); + src += stride; + __lasx_xvstelm_d(src2_l, src, 0, 2); + __lasx_xvstelm_d(src2_h, src, 8, 2); + src += stride; + __lasx_xvstelm_d(src3_l, src, 0, 0); + __lasx_xvstelm_d(src3_h, src, 8, 0); + src += stride; + __lasx_xvstelm_d(src3_l, src, 0, 2); + __lasx_xvstelm_d(src3_h, src, 8, 2); + } +} + +static void avc_wgt_8x4_lasx(uint8_t *src, ptrdiff_t stride, + int32_t log2_denom, int32_t weight_src, + int32_t offset_in) +{ + uint32_t offset_val; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + __m256i wgt, zero = __lasx_xvldi(0); + __m256i src0, src0_h, src0_l; + __m256i tmp0, tmp1, tmp2, tmp3, denom, offset; + + offset_val = (unsigned) offset_in << log2_denom; + + wgt = __lasx_xvreplgr2vr_h(weight_src); + offset = __lasx_xvreplgr2vr_h(offset_val); + denom = __lasx_xvreplgr2vr_h(log2_denom); + + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp0, tmp1, tmp2, tmp3); + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + src0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + src0_l = __lasx_xvilvl_b(zero, src0); + src0_h = __lasx_xvilvh_b(zero, src0); + src0_l = __lasx_xvmul_h(wgt, src0_l); + src0_h = __lasx_xvmul_h(wgt, src0_h); + src0_l = __lasx_xvsadd_h(src0_l, offset); + src0_h = __lasx_xvsadd_h(src0_h, offset); + src0_l = __lasx_xvmaxi_h(src0_l, 0); + src0_h = __lasx_xvmaxi_h(src0_h, 0); + src0_l = __lasx_xvssrlrn_bu_h(src0_l, denom); + src0_h = __lasx_xvssrlrn_bu_h(src0_h, denom); + + src0 = __lasx_xvpickev_d(src0_h, src0_l); + __lasx_xvstelm_d(src0, src, 0, 0); + __lasx_xvstelm_d(src0, src + stride, 0, 1); + __lasx_xvstelm_d(src0, src + stride_2x, 0, 2); + __lasx_xvstelm_d(src0, src + stride_3x, 0, 3); +} + +static void avc_wgt_8x8_lasx(uint8_t *src, ptrdiff_t stride, int32_t log2_denom, + int32_t src_weight, int32_t offset_in) +{ + __m256i src0, src1, src0_h, src0_l, src1_h, src1_l, zero = __lasx_xvldi(0); + __m256i tmp0, tmp1, tmp2, tmp3, denom, offset, wgt; + uint32_t offset_val; + uint8_t* src_tmp = src; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_4x = stride << 2; + ptrdiff_t stride_3x = stride_2x + stride; + + offset_val = (unsigned) offset_in << log2_denom; + + wgt = __lasx_xvreplgr2vr_h(src_weight); + offset = __lasx_xvreplgr2vr_h(offset_val); + denom = __lasx_xvreplgr2vr_h(log2_denom); + + DUP4_ARG2(__lasx_xvldx, src_tmp, 0, src_tmp, stride, src_tmp, stride_2x, + src_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3); + src_tmp += stride_4x; + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + src0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + DUP4_ARG2(__lasx_xvldx, src_tmp, 0, src_tmp, stride, src_tmp, stride_2x, + src_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3); + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + src1 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + DUP2_ARG2(__lasx_xvilvl_b, zero, src0, zero, src1, src0_l, src1_l); + DUP2_ARG2(__lasx_xvilvh_b, zero, src0, zero, src1, src0_h, src1_h); + src0_l = __lasx_xvmul_h(wgt, src0_l); + src0_h = __lasx_xvmul_h(wgt, src0_h); + src1_l = __lasx_xvmul_h(wgt, src1_l); + src1_h = __lasx_xvmul_h(wgt, src1_h); + DUP4_ARG2(__lasx_xvsadd_h, src0_l, offset, src0_h, offset, src1_l, offset, + src1_h, offset, src0_l, src0_h, src1_l, src1_h); + src0_l = __lasx_xvmaxi_h(src0_l, 0); + src0_h = __lasx_xvmaxi_h(src0_h, 0); + src1_l = __lasx_xvmaxi_h(src1_l, 0); + src1_h = __lasx_xvmaxi_h(src1_h, 0); + src0_l = __lasx_xvssrlrn_bu_h(src0_l, denom); + src0_h = __lasx_xvssrlrn_bu_h(src0_h, denom); + src1_l = __lasx_xvssrlrn_bu_h(src1_l, denom); + src1_h = __lasx_xvssrlrn_bu_h(src1_h, denom); + + DUP2_ARG2(__lasx_xvpickev_d, src0_h, src0_l, src1_h, src1_l, src0, src1); + __lasx_xvstelm_d(src0, src, 0, 0); + __lasx_xvstelm_d(src0, src + stride, 0, 1); + __lasx_xvstelm_d(src0, src + stride_2x, 0, 2); + __lasx_xvstelm_d(src0, src + stride_3x, 0, 3); + src += stride_4x; + __lasx_xvstelm_d(src1, src, 0, 0); + __lasx_xvstelm_d(src1, src + stride, 0, 1); + __lasx_xvstelm_d(src1, src + stride_2x, 0, 2); + __lasx_xvstelm_d(src1, src + stride_3x, 0, 3); +} + +static void avc_wgt_8x16_lasx(uint8_t *src, ptrdiff_t stride, + int32_t log2_denom, int32_t src_weight, + int32_t offset_in) +{ + __m256i src0, src1, src2, src3; + __m256i src0_h, src0_l, src1_h, src1_l, src2_h, src2_l, src3_h, src3_l; + __m256i tmp0, tmp1, tmp2, tmp3, denom, offset, wgt; + __m256i zero = __lasx_xvldi(0); + uint32_t offset_val; + uint8_t* src_tmp = src; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_4x = stride << 2; + ptrdiff_t stride_3x = stride_2x + stride; + + offset_val = (unsigned) offset_in << log2_denom; + + wgt = __lasx_xvreplgr2vr_h(src_weight); + offset = __lasx_xvreplgr2vr_h(offset_val); + denom = __lasx_xvreplgr2vr_h(log2_denom); + + DUP4_ARG2(__lasx_xvldx, src_tmp, 0, src_tmp, stride, src_tmp, stride_2x, + src_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3); + src_tmp += stride_4x; + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + src0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + DUP4_ARG2(__lasx_xvldx, src_tmp, 0, src_tmp, stride, src_tmp, stride_2x, + src_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3); + src_tmp += stride_4x; + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + src1 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + DUP4_ARG2(__lasx_xvldx, src_tmp, 0, src_tmp, stride, src_tmp, stride_2x, + src_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3); + src_tmp += stride_4x; + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + src2 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + DUP4_ARG2(__lasx_xvldx, src_tmp, 0, src_tmp, stride, src_tmp, stride_2x, + src_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3); + DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + src3 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + + DUP4_ARG2(__lasx_xvilvl_b, zero, src0, zero, src1, zero, src2, zero, src3, + src0_l, src1_l, src2_l, src3_l); + DUP4_ARG2(__lasx_xvilvh_b, zero, src0, zero, src1, zero, src2, zero, src3, + src0_h, src1_h, src2_h, src3_h); + src0_l = __lasx_xvmul_h(wgt, src0_l); + src0_h = __lasx_xvmul_h(wgt, src0_h); + src1_l = __lasx_xvmul_h(wgt, src1_l); + src1_h = __lasx_xvmul_h(wgt, src1_h); + src2_l = __lasx_xvmul_h(wgt, src2_l); + src2_h = __lasx_xvmul_h(wgt, src2_h); + src3_l = __lasx_xvmul_h(wgt, src3_l); + src3_h = __lasx_xvmul_h(wgt, src3_h); + + DUP4_ARG2(__lasx_xvsadd_h, src0_l, offset, src0_h, offset, src1_l, offset, + src1_h, offset, src0_l, src0_h, src1_l, src1_h); + DUP4_ARG2(__lasx_xvsadd_h, src2_l, offset, src2_h, offset, src3_l, offset, + src3_h, offset, src2_l, src2_h, src3_l, src3_h); + + src0_l = __lasx_xvmaxi_h(src0_l, 0); + src0_h = __lasx_xvmaxi_h(src0_h, 0); + src1_l = __lasx_xvmaxi_h(src1_l, 0); + src1_h = __lasx_xvmaxi_h(src1_h, 0); + src2_l = __lasx_xvmaxi_h(src2_l, 0); + src2_h = __lasx_xvmaxi_h(src2_h, 0); + src3_l = __lasx_xvmaxi_h(src3_l, 0); + src3_h = __lasx_xvmaxi_h(src3_h, 0); + src0_l = __lasx_xvssrlrn_bu_h(src0_l, denom); + src0_h = __lasx_xvssrlrn_bu_h(src0_h, denom); + src1_l = __lasx_xvssrlrn_bu_h(src1_l, denom); + src1_h = __lasx_xvssrlrn_bu_h(src1_h, denom); + src2_l = __lasx_xvssrlrn_bu_h(src2_l, denom); + src2_h = __lasx_xvssrlrn_bu_h(src2_h, denom); + src3_l = __lasx_xvssrlrn_bu_h(src3_l, denom); + src3_h = __lasx_xvssrlrn_bu_h(src3_h, denom); + DUP4_ARG2(__lasx_xvpickev_d, src0_h, src0_l, src1_h, src1_l, src2_h, src2_l, + src3_h, src3_l, src0, src1, src2, src3); + + __lasx_xvstelm_d(src0, src, 0, 0); + __lasx_xvstelm_d(src0, src + stride, 0, 1); + __lasx_xvstelm_d(src0, src + stride_2x, 0, 2); + __lasx_xvstelm_d(src0, src + stride_3x, 0, 3); + src += stride_4x; + __lasx_xvstelm_d(src1, src, 0, 0); + __lasx_xvstelm_d(src1, src + stride, 0, 1); + __lasx_xvstelm_d(src1, src + stride_2x, 0, 2); + __lasx_xvstelm_d(src1, src + stride_3x, 0, 3); + src += stride_4x; + __lasx_xvstelm_d(src2, src, 0, 0); + __lasx_xvstelm_d(src2, src + stride, 0, 1); + __lasx_xvstelm_d(src2, src + stride_2x, 0, 2); + __lasx_xvstelm_d(src2, src + stride_3x, 0, 3); + src += stride_4x; + __lasx_xvstelm_d(src3, src, 0, 0); + __lasx_xvstelm_d(src3, src + stride, 0, 1); + __lasx_xvstelm_d(src3, src + stride_2x, 0, 2); + __lasx_xvstelm_d(src3, src + stride_3x, 0, 3); +} + +void ff_weight_h264_pixels8_8_lasx(uint8_t *src, ptrdiff_t stride, + int height, int log2_denom, + int weight_src, int offset) +{ + if (4 == height) { + avc_wgt_8x4_lasx(src, stride, log2_denom, weight_src, offset); + } else if (8 == height) { + avc_wgt_8x8_lasx(src, stride, log2_denom, weight_src, offset); + } else { + avc_wgt_8x16_lasx(src, stride, log2_denom, weight_src, offset); + } +} + +static void avc_wgt_4x2_lasx(uint8_t *src, ptrdiff_t stride, + int32_t log2_denom, int32_t weight_src, + int32_t offset_in) +{ + uint32_t offset_val; + __m256i wgt, zero = __lasx_xvldi(0); + __m256i src0, tmp0, tmp1, denom, offset; + + offset_val = (unsigned) offset_in << log2_denom; + + wgt = __lasx_xvreplgr2vr_h(weight_src); + offset = __lasx_xvreplgr2vr_h(offset_val); + denom = __lasx_xvreplgr2vr_h(log2_denom); + + DUP2_ARG2(__lasx_xvldx, src, 0, src, stride, tmp0, tmp1); + src0 = __lasx_xvilvl_w(tmp1, tmp0); + src0 = __lasx_xvilvl_b(zero, src0); + src0 = __lasx_xvmul_h(wgt, src0); + src0 = __lasx_xvsadd_h(src0, offset); + src0 = __lasx_xvmaxi_h(src0, 0); + src0 = __lasx_xvssrlrn_bu_h(src0, denom); + __lasx_xvstelm_w(src0, src, 0, 0); + __lasx_xvstelm_w(src0, src + stride, 0, 1); +} + +static void avc_wgt_4x4_lasx(uint8_t *src, ptrdiff_t stride, + int32_t log2_denom, int32_t weight_src, + int32_t offset_in) +{ + __m256i wgt; + __m256i src0, tmp0, tmp1, tmp2, tmp3, denom, offset; + uint32_t offset_val; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_3x = stride_2x + stride; + + offset_val = (unsigned) offset_in << log2_denom; + + wgt = __lasx_xvreplgr2vr_h(weight_src); + offset = __lasx_xvreplgr2vr_h(offset_val); + denom = __lasx_xvreplgr2vr_h(log2_denom); + + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp0, tmp1, tmp2, tmp3); + DUP2_ARG2(__lasx_xvilvl_w, tmp2, tmp0, tmp3, tmp1, tmp0, tmp1); + src0 = __lasx_xvilvl_w(tmp1, tmp0); + src0 = __lasx_vext2xv_hu_bu(src0); + src0 = __lasx_xvmul_h(wgt, src0); + src0 = __lasx_xvsadd_h(src0, offset); + src0 = __lasx_xvmaxi_h(src0, 0); + src0 = __lasx_xvssrlrn_bu_h(src0, denom); + __lasx_xvstelm_w(src0, src, 0, 0); + __lasx_xvstelm_w(src0, src + stride, 0, 1); + __lasx_xvstelm_w(src0, src + stride_2x, 0, 4); + __lasx_xvstelm_w(src0, src + stride_3x, 0, 5); +} + +static void avc_wgt_4x8_lasx(uint8_t *src, ptrdiff_t stride, + int32_t log2_denom, int32_t weight_src, + int32_t offset_in) +{ + __m256i src0, src0_h, src0_l; + __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, denom, offset; + __m256i wgt, zero = __lasx_xvldi(0); + uint32_t offset_val; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_4x = stride << 2; + ptrdiff_t stride_3x = stride_2x + stride; + + offset_val = (unsigned) offset_in << log2_denom; + + wgt = __lasx_xvreplgr2vr_h(weight_src); + offset = __lasx_xvreplgr2vr_h(offset_val); + denom = __lasx_xvreplgr2vr_h(log2_denom); + + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp0, tmp1, tmp2, tmp3); + src += stride_4x; + DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, + src, stride_3x, tmp4, tmp5, tmp6, tmp7); + src -= stride_4x; + DUP4_ARG2(__lasx_xvilvl_w, tmp2, tmp0, tmp3, tmp1, tmp6, tmp4, tmp7, + tmp5, tmp0, tmp1, tmp2, tmp3); + DUP2_ARG2(__lasx_xvilvl_w, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1); + src0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20); + src0_l = __lasx_xvilvl_b(zero, src0); + src0_h = __lasx_xvilvh_b(zero, src0); + src0_l = __lasx_xvmul_h(wgt, src0_l); + src0_h = __lasx_xvmul_h(wgt, src0_h); + src0_l = __lasx_xvsadd_h(src0_l, offset); + src0_h = __lasx_xvsadd_h(src0_h, offset); + src0_l = __lasx_xvmaxi_h(src0_l, 0); + src0_h = __lasx_xvmaxi_h(src0_h, 0); + src0_l = __lasx_xvssrlrn_bu_h(src0_l, denom); + src0_h = __lasx_xvssrlrn_bu_h(src0_h, denom); + __lasx_xvstelm_w(src0_l, src, 0, 0); + __lasx_xvstelm_w(src0_l, src + stride, 0, 1); + __lasx_xvstelm_w(src0_h, src + stride_2x, 0, 0); + __lasx_xvstelm_w(src0_h, src + stride_3x, 0, 1); + src += stride_4x; + __lasx_xvstelm_w(src0_l, src, 0, 4); + __lasx_xvstelm_w(src0_l, src + stride, 0, 5); + __lasx_xvstelm_w(src0_h, src + stride_2x, 0, 4); + __lasx_xvstelm_w(src0_h, src + stride_3x, 0, 5); +} + +void ff_weight_h264_pixels4_8_lasx(uint8_t *src, ptrdiff_t stride, + int height, int log2_denom, + int weight_src, int offset) +{ + if (2 == height) { + avc_wgt_4x2_lasx(src, stride, log2_denom, weight_src, offset); + } else if (4 == height) { + avc_wgt_4x4_lasx(src, stride, log2_denom, weight_src, offset); + } else { + avc_wgt_4x8_lasx(src, stride, log2_denom, weight_src, offset); + } +} + +void ff_h264_add_pixels4_8_lasx(uint8_t *_dst, int16_t *_src, int stride) +{ + __m256i src0, dst0, dst1, dst2, dst3, zero; + __m256i tmp0, tmp1; + uint8_t* _dst1 = _dst + stride; + uint8_t* _dst2 = _dst1 + stride; + uint8_t* _dst3 = _dst2 + stride; + + src0 = __lasx_xvld(_src, 0); + dst0 = __lasx_xvldrepl_w(_dst, 0); + dst1 = __lasx_xvldrepl_w(_dst1, 0); + dst2 = __lasx_xvldrepl_w(_dst2, 0); + dst3 = __lasx_xvldrepl_w(_dst3, 0); + tmp0 = __lasx_xvilvl_w(dst1, dst0); + tmp1 = __lasx_xvilvl_w(dst3, dst2); + dst0 = __lasx_xvilvl_d(tmp1, tmp0); + tmp0 = __lasx_vext2xv_hu_bu(dst0); + zero = __lasx_xvldi(0); + tmp1 = __lasx_xvadd_h(src0, tmp0); + dst0 = __lasx_xvpickev_b(tmp1, tmp1); + __lasx_xvstelm_w(dst0, _dst, 0, 0); + __lasx_xvstelm_w(dst0, _dst1, 0, 1); + __lasx_xvstelm_w(dst0, _dst2, 0, 4); + __lasx_xvstelm_w(dst0, _dst3, 0, 5); + __lasx_xvst(zero, _src, 0); +} + +void ff_h264_add_pixels8_8_lasx(uint8_t *_dst, int16_t *_src, int stride) +{ + __m256i src0, src1, src2, src3; + __m256i dst0, dst1, dst2, dst3, dst4, dst5, dst6, dst7; + __m256i tmp0, tmp1, tmp2, tmp3; + __m256i zero = __lasx_xvldi(0); + uint8_t *_dst1 = _dst + stride; + uint8_t *_dst2 = _dst1 + stride; + uint8_t *_dst3 = _dst2 + stride; + uint8_t *_dst4 = _dst3 + stride; + uint8_t *_dst5 = _dst4 + stride; + uint8_t *_dst6 = _dst5 + stride; + uint8_t *_dst7 = _dst6 + stride; + + src0 = __lasx_xvld(_src, 0); + src1 = __lasx_xvld(_src, 32); + src2 = __lasx_xvld(_src, 64); + src3 = __lasx_xvld(_src, 96); + dst0 = __lasx_xvldrepl_d(_dst, 0); + dst1 = __lasx_xvldrepl_d(_dst1, 0); + dst2 = __lasx_xvldrepl_d(_dst2, 0); + dst3 = __lasx_xvldrepl_d(_dst3, 0); + dst4 = __lasx_xvldrepl_d(_dst4, 0); + dst5 = __lasx_xvldrepl_d(_dst5, 0); + dst6 = __lasx_xvldrepl_d(_dst6, 0); + dst7 = __lasx_xvldrepl_d(_dst7, 0); + tmp0 = __lasx_xvilvl_d(dst1, dst0); + tmp1 = __lasx_xvilvl_d(dst3, dst2); + tmp2 = __lasx_xvilvl_d(dst5, dst4); + tmp3 = __lasx_xvilvl_d(dst7, dst6); + dst0 = __lasx_vext2xv_hu_bu(tmp0); + dst1 = __lasx_vext2xv_hu_bu(tmp1); + dst1 = __lasx_vext2xv_hu_bu(tmp1); + dst2 = __lasx_vext2xv_hu_bu(tmp2); + dst3 = __lasx_vext2xv_hu_bu(tmp3); + tmp0 = __lasx_xvadd_h(src0, dst0); + tmp1 = __lasx_xvadd_h(src1, dst1); + tmp2 = __lasx_xvadd_h(src2, dst2); + tmp3 = __lasx_xvadd_h(src3, dst3); + dst1 = __lasx_xvpickev_b(tmp1, tmp0); + dst2 = __lasx_xvpickev_b(tmp3, tmp2); + __lasx_xvst(zero, _src, 0); + __lasx_xvst(zero, _src, 32); + __lasx_xvst(zero, _src, 64); + __lasx_xvst(zero, _src, 96); + __lasx_xvstelm_d(dst1, _dst, 0, 0); + __lasx_xvstelm_d(dst1, _dst1, 0, 2); + __lasx_xvstelm_d(dst1, _dst2, 0, 1); + __lasx_xvstelm_d(dst1, _dst3, 0, 3); + __lasx_xvstelm_d(dst2, _dst4, 0, 0); + __lasx_xvstelm_d(dst2, _dst5, 0, 2); + __lasx_xvstelm_d(dst2, _dst6, 0, 1); + __lasx_xvstelm_d(dst2, _dst7, 0, 3); +} diff --git a/libavcodec/loongarch/h264dsp_lasx.h b/libavcodec/loongarch/h264dsp_lasx.h new file mode 100644 index 0000000000..538c14c936 --- /dev/null +++ b/libavcodec/loongarch/h264dsp_lasx.h @@ -0,0 +1,68 @@ +/* + * Copyright (c) 2021 Loongson Technology Corporation Limited + * Contributed by Shiyou Yin + * Xiwei Gu + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef AVCODEC_LOONGARCH_H264DSP_LASX_H +#define AVCODEC_LOONGARCH_H264DSP_LASX_H + +#include "libavcodec/h264dec.h" + +void ff_h264_h_lpf_luma_8_lasx(uint8_t *src, ptrdiff_t stride, + int alpha, int beta, int8_t *tc0); +void ff_h264_v_lpf_luma_8_lasx(uint8_t *src, ptrdiff_t stride, + int alpha, int beta, int8_t *tc0); +void ff_h264_h_lpf_luma_intra_8_lasx(uint8_t *src, ptrdiff_t stride, + int alpha, int beta); +void ff_h264_v_lpf_luma_intra_8_lasx(uint8_t *src, ptrdiff_t stride, + int alpha, int beta); +void ff_h264_h_lpf_chroma_8_lasx(uint8_t *src, ptrdiff_t stride, + int alpha, int beta, int8_t *tc0); +void ff_h264_v_lpf_chroma_8_lasx(uint8_t *src, ptrdiff_t stride, + int alpha, int beta, int8_t *tc0); +void ff_h264_h_lpf_chroma_intra_8_lasx(uint8_t *src, ptrdiff_t stride, + int alpha, int beta); +void ff_h264_v_lpf_chroma_intra_8_lasx(uint8_t *src, ptrdiff_t stride, + int alpha, int beta); +void ff_biweight_h264_pixels16_8_lasx(uint8_t *dst, uint8_t *src, + ptrdiff_t stride, int height, + int log2_denom, int weight_dst, + int weight_src, int offset_in); +void ff_biweight_h264_pixels8_8_lasx(uint8_t *dst, uint8_t *src, + ptrdiff_t stride, int height, + int log2_denom, int weight_dst, + int weight_src, int offset); +void ff_biweight_h264_pixels4_8_lasx(uint8_t *dst, uint8_t *src, + ptrdiff_t stride, int height, + int log2_denom, int weight_dst, + int weight_src, int offset); +void ff_weight_h264_pixels16_8_lasx(uint8_t *src, ptrdiff_t stride, + int height, int log2_denom, + int weight_src, int offset_in); +void ff_weight_h264_pixels8_8_lasx(uint8_t *src, ptrdiff_t stride, + int height, int log2_denom, + int weight_src, int offset); +void ff_weight_h264_pixels4_8_lasx(uint8_t *src, ptrdiff_t stride, + int height, int log2_denom, + int weight_src, int offset); +void ff_h264_add_pixels4_8_lasx(uint8_t *_dst, int16_t *_src, int stride); + +void ff_h264_add_pixels8_8_lasx(uint8_t *_dst, int16_t *_src, int stride); +#endif // #ifndef AVCODEC_LOONGARCH_H264DSP_LASX_H From patchwork Tue Dec 14 13:33:14 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= X-Patchwork-Id: 32486 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp6965632iog; Tue, 14 Dec 2021 05:34:39 -0800 (PST) X-Google-Smtp-Source: ABdhPJx/7OmDfoefTWdQDVf4JnFfIMkGT6yvlyoNH3CVxRORdF2o2FH3cl9Epk9Wjbz8i6DMldvZ X-Received: by 2002:a05:6402:d73:: with SMTP id ec51mr7795815edb.175.1639488878923; Tue, 14 Dec 2021 05:34:38 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1639488878; cv=none; d=google.com; s=arc-20160816; b=yOs8SmAqcCZaR6x8tyW9K3eT/YgclI8Ss25zGb+AJsYe6EhPyz1kRvsRJ5UW/Y2QO6 m/MaZAIrx3u1vn/RsCAoKP+eZ0SQ8W1Svqm9bcQvZl4PpvVG1KyrMGj/Live/Vx3RWMZ OuUuJzhhNsNKNguf8wt0dGFLu+BvjBUgbOT0P19f6xIml3RMp4JvovvVgc1bmzuo5t6L y38F6ipd3qgJdHleAau0qNDkT4AwzbC/Gn1sYY5ProzHkSvk6hQ6yeMtdbfHqv9wkXa7 jat1fqqBaDjmda9fTbWB8tf/4RPz89iBh2rRe4ZCjKZ+H8rcy+aShXBkqbFPkuUlLTN/ RxRw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=zm7BotcLoypr49oyv5CPH03K6va+Zj4BpRO9c8QxMjw=; b=RVIOaDKVJcU44g9T8+WWJu+j7EQkQgWdTMI5n80A/2YCDnitbxczJE6o6yVXLy6Ew8 sZU59+9ZUmB00id1MkXC1sUGLL336Nmq6y0TCzsJ5X/9Sm9zj+5WS771Nwu1URt33jGb emx2eioIIszKLDRKIemF+/JO/LBRRBzaJfJ0dJ8CBZ53OoS+4CkPE6g8p8stV+F34Fyl CojQ94TkbH2mbyV2I7VW0SU+bH80/w5zALLmFFoGCYZoMczpZmqGpR6TexuVjBP5a34H OcDH8Csctww3g2ANZ6xUFZ7r96cWtlSjHL8eQuohGexMOCEPGjqhS/j4ZgQ+LZa6H4gh FtfQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id hd13si23457710ejc.148.2021.12.14.05.34.38; Tue, 14 Dec 2021 05:34:38 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 0E4C068AEEA; Tue, 14 Dec 2021 15:33:58 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from loongson.cn (mail.loongson.cn [114.242.206.163]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 24B8968AAD7 for ; Tue, 14 Dec 2021 15:33:46 +0200 (EET) Received: from localhost (unknown [36.33.26.144]) by mail.loongson.cn (Coremail) with SMTP id AQAAf9Dx7Nw4nbhhlqcAAA--.3411S3; Tue, 14 Dec 2021 21:33:44 +0800 (CST) From: Hao Chen To: ffmpeg-devel@ffmpeg.org Date: Tue, 14 Dec 2021 21:33:14 +0800 Message-Id: <20211214133316.8978-6-chenhao@loongson.cn> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20211214133316.8978-1-chenhao@loongson.cn> References: <20211214133316.8978-1-chenhao@loongson.cn> MIME-Version: 1.0 X-CM-TRANSID: AQAAf9Dx7Nw4nbhhlqcAAA--.3411S3 X-Coremail-Antispam: 1UD129KBjvAXoWfWF13GFyxur4ktF1ktrWxCrg_yoW8trWDXo WUt392vr97Gw1Ivr95Ar9Yy3W8Cw43ur4UAw42qwsFya45Xa4qyrZ0kw4fJr17Krs7Wa43 Cry5XFy3ZrWFqr1Dn29KB7ZKAUJUUUU8529EdanIXcx71UUUUU7v73VFW2AGmfu7bjvjm3 AaLaJ3UjIYCTnIWjp_UUUY87k0a2IF6w4kM7kC6x804xWl14x267AKxVWUJVW8JwAFc2x0 x2IEx4CE42xK8VAvwI8IcIk0rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj4 1l84x0c7CEw4AK67xGY2AK021l84ACjcxK6xIIjxv20xvE14v26ryj6F1UM28EF7xvwVC0 I7IYx2IY6xkF7I0E14v26F4j6r4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwV C2z280aVCY1x0267AKxVW0oVCq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC 0VAKzVAqx4xG6I80ewAv7VC0I7IYx2IY67AKxVWUtVWrXwAv7VC2z280aVAFwI0_Gr1j6F 4UJwAm72CE4IkC6x0Yz7v_Jr0_Gr1lF7xvr2IYc2Ij64vIr41lc2xSY4AK67AK6ry8MxAI w28IcxkI7VAKI48JMxC20s026xCaFVCjc4AY6r1j6r4UMI8I3I0E5I8CrVAFwI0_Jr0_Jr 4lx2IqxVCjr7xvwVAFwI0_JrI_JrWlx4CE17CEb7AF67AKxVWUXVWUAwCIc40Y0x0EwIxG rwCI42IY6xIIjxv20xvE14v26r4j6ryUMIIF0xvE2Ix0cI8IcVCY1x0267AKxVW8JVWxJw CI42IY6xAIw20EY4v20xvaj40_Jr0_JF4lIxAIcVC2z280aVAFwI0_Gr0_Cr1lIxAIcVC2 z280aVCY1x0267AKxVW8JVW8JrUvcSsGvfC2KfnxnUUI43ZEXa7IU5tl1PUUUUU== X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/ Subject: [FFmpeg-devel] [PATCH v2 5/7] avcodec: [loongarch] Optimize h264idct with LASX. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Lu Wang Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: qB8rmJEZtOYZ From: Lu Wang ./ffmpeg -i ../1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -y /dev/null -an before:282 after :293 Change-Id: Ia8889935a6359630dd5dbb61263287f1cb24a0a4 --- libavcodec/loongarch/Makefile | 3 +- libavcodec/loongarch/h264dsp_init_loongarch.c | 15 + libavcodec/loongarch/h264dsp_lasx.h | 23 + libavcodec/loongarch/h264idct_lasx.c | 498 ++++++++++++++++++ 4 files changed, 538 insertions(+), 1 deletion(-) create mode 100644 libavcodec/loongarch/h264idct_lasx.c diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile index df43151dbd..242a2be290 100644 --- a/libavcodec/loongarch/Makefile +++ b/libavcodec/loongarch/Makefile @@ -3,4 +3,5 @@ OBJS-$(CONFIG_H264QPEL) += loongarch/h264qpel_init_loongarch.o OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_init_loongarch.o LASX-OBJS-$(CONFIG_H264CHROMA) += loongarch/h264chroma_lasx.o LASX-OBJS-$(CONFIG_H264QPEL) += loongarch/h264qpel_lasx.o -LASX-OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_lasx.o +LASX-OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_lasx.o \ + loongarch/h264idct_lasx.o diff --git a/libavcodec/loongarch/h264dsp_init_loongarch.c b/libavcodec/loongarch/h264dsp_init_loongarch.c index ddc0877a74..0985c2fe8a 100644 --- a/libavcodec/loongarch/h264dsp_init_loongarch.c +++ b/libavcodec/loongarch/h264dsp_init_loongarch.c @@ -53,6 +53,21 @@ av_cold void ff_h264dsp_init_loongarch(H264DSPContext *c, const int bit_depth, c->biweight_h264_pixels_tab[0] = ff_biweight_h264_pixels16_8_lasx; c->biweight_h264_pixels_tab[1] = ff_biweight_h264_pixels8_8_lasx; c->biweight_h264_pixels_tab[2] = ff_biweight_h264_pixels4_8_lasx; + + c->h264_idct_add = ff_h264_idct_add_lasx; + c->h264_idct8_add = ff_h264_idct8_addblk_lasx; + c->h264_idct_dc_add = ff_h264_idct4x4_addblk_dc_lasx; + c->h264_idct8_dc_add = ff_h264_idct8_dc_addblk_lasx; + c->h264_idct_add16 = ff_h264_idct_add16_lasx; + c->h264_idct8_add4 = ff_h264_idct8_add4_lasx; + + if (chroma_format_idc <= 1) + c->h264_idct_add8 = ff_h264_idct_add8_lasx; + else + c->h264_idct_add8 = ff_h264_idct_add8_422_lasx; + + c->h264_idct_add16intra = ff_h264_idct_add16_intra_lasx; + c->h264_luma_dc_dequant_idct = ff_h264_deq_idct_luma_dc_lasx; } } } diff --git a/libavcodec/loongarch/h264dsp_lasx.h b/libavcodec/loongarch/h264dsp_lasx.h index 538c14c936..bfd567fffa 100644 --- a/libavcodec/loongarch/h264dsp_lasx.h +++ b/libavcodec/loongarch/h264dsp_lasx.h @@ -65,4 +65,27 @@ void ff_weight_h264_pixels4_8_lasx(uint8_t *src, ptrdiff_t stride, void ff_h264_add_pixels4_8_lasx(uint8_t *_dst, int16_t *_src, int stride); void ff_h264_add_pixels8_8_lasx(uint8_t *_dst, int16_t *_src, int stride); +void ff_h264_idct_add_lasx(uint8_t *dst, int16_t *src, int32_t dst_stride); +void ff_h264_idct8_addblk_lasx(uint8_t *dst, int16_t *src, int32_t dst_stride); +void ff_h264_idct4x4_addblk_dc_lasx(uint8_t *dst, int16_t *src, + int32_t dst_stride); +void ff_h264_idct8_dc_addblk_lasx(uint8_t *dst, int16_t *src, + int32_t dst_stride); +void ff_h264_idct_add16_lasx(uint8_t *dst, const int32_t *blk_offset, + int16_t *block, int32_t dst_stride, + const uint8_t nzc[15 * 8]); +void ff_h264_idct8_add4_lasx(uint8_t *dst, const int32_t *blk_offset, + int16_t *block, int32_t dst_stride, + const uint8_t nzc[15 * 8]); +void ff_h264_idct_add8_lasx(uint8_t **dst, const int32_t *blk_offset, + int16_t *block, int32_t dst_stride, + const uint8_t nzc[15 * 8]); +void ff_h264_idct_add8_422_lasx(uint8_t **dst, const int32_t *blk_offset, + int16_t *block, int32_t dst_stride, + const uint8_t nzc[15 * 8]); +void ff_h264_idct_add16_intra_lasx(uint8_t *dst, const int32_t *blk_offset, + int16_t *block, int32_t dst_stride, + const uint8_t nzc[15 * 8]); +void ff_h264_deq_idct_luma_dc_lasx(int16_t *dst, int16_t *src, + int32_t de_qval); #endif // #ifndef AVCODEC_LOONGARCH_H264DSP_LASX_H diff --git a/libavcodec/loongarch/h264idct_lasx.c b/libavcodec/loongarch/h264idct_lasx.c new file mode 100644 index 0000000000..46bd3b74d5 --- /dev/null +++ b/libavcodec/loongarch/h264idct_lasx.c @@ -0,0 +1,498 @@ +/* + * Loongson LASX optimized h264dsp + * + * Copyright (c) 2021 Loongson Technology Corporation Limited + * Contributed by Shiyou Yin + * Xiwei Gu + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/loongarch/loongson_intrinsics.h" +#include "h264dsp_lasx.h" +#include "libavcodec/bit_depth_template.c" + +#define AVC_ITRANS_H(in0, in1, in2, in3, out0, out1, out2, out3) \ +{ \ + __m256i tmp0_m, tmp1_m, tmp2_m, tmp3_m; \ + \ + tmp0_m = __lasx_xvadd_h(in0, in2); \ + tmp1_m = __lasx_xvsub_h(in0, in2); \ + tmp2_m = __lasx_xvsrai_h(in1, 1); \ + tmp2_m = __lasx_xvsub_h(tmp2_m, in3); \ + tmp3_m = __lasx_xvsrai_h(in3, 1); \ + tmp3_m = __lasx_xvadd_h(in1, tmp3_m); \ + \ + LASX_BUTTERFLY_4_H(tmp0_m, tmp1_m, tmp2_m, tmp3_m, \ + out0, out1, out2, out3); \ +} + +void ff_h264_idct_add_lasx(uint8_t *dst, int16_t *src, int32_t dst_stride) +{ + __m256i src0_m, src1_m, src2_m, src3_m; + __m256i dst0_m, dst1_m; + __m256i hres0, hres1, hres2, hres3, vres0, vres1, vres2, vres3; + __m256i inp0_m, inp1_m, res0_m, src1, src3; + __m256i src0 = __lasx_xvld(src, 0); + __m256i src2 = __lasx_xvld(src, 16); + __m256i zero = __lasx_xvldi(0); + int32_t dst_stride_2x = dst_stride << 1; + int32_t dst_stride_3x = dst_stride_2x + dst_stride; + + __lasx_xvst(zero, src, 0); + DUP2_ARG2(__lasx_xvilvh_d, src0, src0, src2, src2, src1, src3); + AVC_ITRANS_H(src0, src1, src2, src3, hres0, hres1, hres2, hres3); + LASX_TRANSPOSE4x4_H(hres0, hres1, hres2, hres3, hres0, hres1, hres2, hres3); + AVC_ITRANS_H(hres0, hres1, hres2, hres3, vres0, vres1, vres2, vres3); + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, dst_stride, dst, dst_stride_2x, + dst, dst_stride_3x, src0_m, src1_m, src2_m, src3_m); + DUP4_ARG2(__lasx_xvld, dst, 0, dst + dst_stride, 0, dst + dst_stride_2x, + 0, dst + dst_stride_3x, 0, src0_m, src1_m, src2_m, src3_m); + DUP2_ARG2(__lasx_xvilvl_d, vres1, vres0, vres3, vres2, inp0_m, inp1_m); + inp0_m = __lasx_xvpermi_q(inp1_m, inp0_m, 0x20); + inp0_m = __lasx_xvsrari_h(inp0_m, 6); + DUP2_ARG2(__lasx_xvilvl_w, src1_m, src0_m, src3_m, src2_m, dst0_m, dst1_m); + dst0_m = __lasx_xvilvl_d(dst1_m, dst0_m); + res0_m = __lasx_vext2xv_hu_bu(dst0_m); + res0_m = __lasx_xvadd_h(res0_m, inp0_m); + res0_m = __lasx_xvclip255_h(res0_m); + dst0_m = __lasx_xvpickev_b(res0_m, res0_m); + __lasx_xvstelm_w(dst0_m, dst, 0, 0); + __lasx_xvstelm_w(dst0_m, dst + dst_stride, 0, 1); + __lasx_xvstelm_w(dst0_m, dst + dst_stride_2x, 0, 4); + __lasx_xvstelm_w(dst0_m, dst + dst_stride_3x, 0, 5); +} + +void ff_h264_idct8_addblk_lasx(uint8_t *dst, int16_t *src, + int32_t dst_stride) +{ + __m256i src0, src1, src2, src3, src4, src5, src6, src7; + __m256i vec0, vec1, vec2, vec3; + __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7; + __m256i res0, res1, res2, res3, res4, res5, res6, res7; + __m256i dst0, dst1, dst2, dst3, dst4, dst5, dst6, dst7; + __m256i zero = __lasx_xvldi(0); + int32_t dst_stride_2x = dst_stride << 1; + int32_t dst_stride_4x = dst_stride << 2; + int32_t dst_stride_3x = dst_stride_2x + dst_stride; + + src[0] += 32; + DUP4_ARG2(__lasx_xvld, src, 0, src, 16, src, 32, src, 48, + src0, src1, src2, src3); + DUP4_ARG2(__lasx_xvld, src, 64, src, 80, src, 96, src, 112, + src4, src5, src6, src7); + __lasx_xvst(zero, src, 0); + __lasx_xvst(zero, src, 32); + __lasx_xvst(zero, src, 64); + __lasx_xvst(zero, src, 96); + + vec0 = __lasx_xvadd_h(src0, src4); + vec1 = __lasx_xvsub_h(src0, src4); + vec2 = __lasx_xvsrai_h(src2, 1); + vec2 = __lasx_xvsub_h(vec2, src6); + vec3 = __lasx_xvsrai_h(src6, 1); + vec3 = __lasx_xvadd_h(src2, vec3); + + LASX_BUTTERFLY_4_H(vec0, vec1, vec2, vec3, tmp0, tmp1, tmp2, tmp3); + + vec0 = __lasx_xvsrai_h(src7, 1); + vec0 = __lasx_xvsub_h(src5, vec0); + vec0 = __lasx_xvsub_h(vec0, src3); + vec0 = __lasx_xvsub_h(vec0, src7); + + vec1 = __lasx_xvsrai_h(src3, 1); + vec1 = __lasx_xvsub_h(src1, vec1); + vec1 = __lasx_xvadd_h(vec1, src7); + vec1 = __lasx_xvsub_h(vec1, src3); + + vec2 = __lasx_xvsrai_h(src5, 1); + vec2 = __lasx_xvsub_h(vec2, src1); + vec2 = __lasx_xvadd_h(vec2, src7); + vec2 = __lasx_xvadd_h(vec2, src5); + + vec3 = __lasx_xvsrai_h(src1, 1); + vec3 = __lasx_xvadd_h(src3, vec3); + vec3 = __lasx_xvadd_h(vec3, src5); + vec3 = __lasx_xvadd_h(vec3, src1); + + tmp4 = __lasx_xvsrai_h(vec3, 2); + tmp4 = __lasx_xvadd_h(tmp4, vec0); + tmp5 = __lasx_xvsrai_h(vec2, 2); + tmp5 = __lasx_xvadd_h(tmp5, vec1); + tmp6 = __lasx_xvsrai_h(vec1, 2); + tmp6 = __lasx_xvsub_h(tmp6, vec2); + tmp7 = __lasx_xvsrai_h(vec0, 2); + tmp7 = __lasx_xvsub_h(vec3, tmp7); + + LASX_BUTTERFLY_8_H(tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, + res0, res1, res2, res3, res4, res5, res6, res7); + LASX_TRANSPOSE8x8_H(res0, res1, res2, res3, res4, res5, res6, res7, + res0, res1, res2, res3, res4, res5, res6, res7); + + DUP4_ARG1(__lasx_vext2xv_w_h, res0, res1, res2, res3, + tmp0, tmp1, tmp2, tmp3); + DUP4_ARG1(__lasx_vext2xv_w_h, res4, res5, res6, res7, + tmp4, tmp5, tmp6, tmp7); + vec0 = __lasx_xvadd_w(tmp0, tmp4); + vec1 = __lasx_xvsub_w(tmp0, tmp4); + + vec2 = __lasx_xvsrai_w(tmp2, 1); + vec2 = __lasx_xvsub_w(vec2, tmp6); + vec3 = __lasx_xvsrai_w(tmp6, 1); + vec3 = __lasx_xvadd_w(vec3, tmp2); + + tmp0 = __lasx_xvadd_w(vec0, vec3); + tmp2 = __lasx_xvadd_w(vec1, vec2); + tmp4 = __lasx_xvsub_w(vec1, vec2); + tmp6 = __lasx_xvsub_w(vec0, vec3); + + vec0 = __lasx_xvsrai_w(tmp7, 1); + vec0 = __lasx_xvsub_w(tmp5, vec0); + vec0 = __lasx_xvsub_w(vec0, tmp3); + vec0 = __lasx_xvsub_w(vec0, tmp7); + + vec1 = __lasx_xvsrai_w(tmp3, 1); + vec1 = __lasx_xvsub_w(tmp1, vec1); + vec1 = __lasx_xvadd_w(vec1, tmp7); + vec1 = __lasx_xvsub_w(vec1, tmp3); + + vec2 = __lasx_xvsrai_w(tmp5, 1); + vec2 = __lasx_xvsub_w(vec2, tmp1); + vec2 = __lasx_xvadd_w(vec2, tmp7); + vec2 = __lasx_xvadd_w(vec2, tmp5); + + vec3 = __lasx_xvsrai_w(tmp1, 1); + vec3 = __lasx_xvadd_w(tmp3, vec3); + vec3 = __lasx_xvadd_w(vec3, tmp5); + vec3 = __lasx_xvadd_w(vec3, tmp1); + + tmp1 = __lasx_xvsrai_w(vec3, 2); + tmp1 = __lasx_xvadd_w(tmp1, vec0); + tmp3 = __lasx_xvsrai_w(vec2, 2); + tmp3 = __lasx_xvadd_w(tmp3, vec1); + tmp5 = __lasx_xvsrai_w(vec1, 2); + tmp5 = __lasx_xvsub_w(tmp5, vec2); + tmp7 = __lasx_xvsrai_w(vec0, 2); + tmp7 = __lasx_xvsub_w(vec3, tmp7); + + LASX_BUTTERFLY_4_W(tmp0, tmp2, tmp5, tmp7, res0, res1, res6, res7); + LASX_BUTTERFLY_4_W(tmp4, tmp6, tmp1, tmp3, res2, res3, res4, res5); + + DUP4_ARG2(__lasx_xvsrai_w, res0, 6, res1, 6, res2, 6, res3, 6, + res0, res1, res2, res3); + DUP4_ARG2(__lasx_xvsrai_w, res4, 6, res5, 6, res6, 6, res7, 6, + res4, res5, res6, res7); + DUP4_ARG2(__lasx_xvpickev_h, res1, res0, res3, res2, res5, res4, res7, + res6, res0, res1, res2, res3); + DUP4_ARG2(__lasx_xvpermi_d, res0, 0xd8, res1, 0xd8, res2, 0xd8, res3, 0xd8, + res0, res1, res2, res3); + + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, dst_stride, dst, dst_stride_2x, + dst, dst_stride_3x, dst0, dst1, dst2, dst3); + dst += dst_stride_4x; + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, dst_stride, dst, dst_stride_2x, + dst, dst_stride_3x, dst4, dst5, dst6, dst7); + dst -= dst_stride_4x; + DUP4_ARG2(__lasx_xvilvl_b, zero, dst0, zero, dst1, zero, dst2, zero, dst3, + dst0, dst1, dst2, dst3); + DUP4_ARG2(__lasx_xvilvl_b, zero, dst4, zero, dst5, zero, dst6, zero, dst7, + dst4, dst5, dst6, dst7); + DUP4_ARG3(__lasx_xvpermi_q, dst1, dst0, 0x20, dst3, dst2, 0x20, dst5, + dst4, 0x20, dst7, dst6, 0x20, dst0, dst1, dst2, dst3); + res0 = __lasx_xvadd_h(res0, dst0); + res1 = __lasx_xvadd_h(res1, dst1); + res2 = __lasx_xvadd_h(res2, dst2); + res3 = __lasx_xvadd_h(res3, dst3); + DUP4_ARG1(__lasx_xvclip255_h, res0, res1, res2, res3, res0, res1, + res2, res3); + DUP2_ARG2(__lasx_xvpickev_b, res1, res0, res3, res2, res0, res1); + __lasx_xvstelm_d(res0, dst, 0, 0); + __lasx_xvstelm_d(res0, dst + dst_stride, 0, 2); + __lasx_xvstelm_d(res0, dst + dst_stride_2x, 0, 1); + __lasx_xvstelm_d(res0, dst + dst_stride_3x, 0, 3); + dst += dst_stride_4x; + __lasx_xvstelm_d(res1, dst, 0, 0); + __lasx_xvstelm_d(res1, dst + dst_stride, 0, 2); + __lasx_xvstelm_d(res1, dst + dst_stride_2x, 0, 1); + __lasx_xvstelm_d(res1, dst + dst_stride_3x, 0, 3); +} + +void ff_h264_idct4x4_addblk_dc_lasx(uint8_t *dst, int16_t *src, + int32_t dst_stride) +{ + const int16_t dc = (src[0] + 32) >> 6; + int32_t dst_stride_2x = dst_stride << 1; + int32_t dst_stride_3x = dst_stride_2x + dst_stride; + __m256i pred, out; + __m256i src0, src1, src2, src3; + __m256i input_dc = __lasx_xvreplgr2vr_h(dc); + + src[0] = 0; + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, dst_stride, dst, dst_stride_2x, + dst, dst_stride_3x, src0, src1, src2, src3); + DUP2_ARG2(__lasx_xvilvl_w, src1, src0, src3, src2, src0, src1); + + pred = __lasx_xvpermi_q(src0, src1, 0x02); + pred = __lasx_xvaddw_h_h_bu(input_dc, pred); + pred = __lasx_xvclip255_h(pred); + out = __lasx_xvpickev_b(pred, pred); + __lasx_xvstelm_w(out, dst, 0, 0); + __lasx_xvstelm_w(out, dst + dst_stride, 0, 1); + __lasx_xvstelm_w(out, dst + dst_stride_2x, 0, 4); + __lasx_xvstelm_w(out, dst + dst_stride_3x, 0, 5); +} + +void ff_h264_idct8_dc_addblk_lasx(uint8_t *dst, int16_t *src, + int32_t dst_stride) +{ + int32_t dc_val; + int32_t dst_stride_2x = dst_stride << 1; + int32_t dst_stride_4x = dst_stride << 2; + int32_t dst_stride_3x = dst_stride_2x + dst_stride; + __m256i dst0, dst1, dst2, dst3, dst4, dst5, dst6, dst7; + __m256i dc; + + dc_val = (src[0] + 32) >> 6; + dc = __lasx_xvreplgr2vr_h(dc_val); + + src[0] = 0; + + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, dst_stride, dst, dst_stride_2x, + dst, dst_stride_3x, dst0, dst1, dst2, dst3); + dst += dst_stride_4x; + DUP4_ARG2(__lasx_xvldx, dst, 0, dst, dst_stride, dst, dst_stride_2x, + dst, dst_stride_3x, dst4, dst5, dst6, dst7); + dst -= dst_stride_4x; + DUP4_ARG1(__lasx_vext2xv_hu_bu, dst0, dst1, dst2, dst3, + dst0, dst1, dst2, dst3); + DUP4_ARG1(__lasx_vext2xv_hu_bu, dst4, dst5, dst6, dst7, + dst4, dst5, dst6, dst7); + DUP4_ARG3(__lasx_xvpermi_q, dst1, dst0, 0x20, dst3, dst2, 0x20, dst5, + dst4, 0x20, dst7, dst6, 0x20, dst0, dst1, dst2, dst3); + dst0 = __lasx_xvadd_h(dst0, dc); + dst1 = __lasx_xvadd_h(dst1, dc); + dst2 = __lasx_xvadd_h(dst2, dc); + dst3 = __lasx_xvadd_h(dst3, dc); + DUP4_ARG1(__lasx_xvclip255_h, dst0, dst1, dst2, dst3, + dst0, dst1, dst2, dst3); + DUP2_ARG2(__lasx_xvpickev_b, dst1, dst0, dst3, dst2, dst0, dst1); + __lasx_xvstelm_d(dst0, dst, 0, 0); + __lasx_xvstelm_d(dst0, dst + dst_stride, 0, 2); + __lasx_xvstelm_d(dst0, dst + dst_stride_2x, 0, 1); + __lasx_xvstelm_d(dst0, dst + dst_stride_3x, 0, 3); + dst += dst_stride_4x; + __lasx_xvstelm_d(dst1, dst, 0, 0); + __lasx_xvstelm_d(dst1, dst + dst_stride, 0, 2); + __lasx_xvstelm_d(dst1, dst + dst_stride_2x, 0, 1); + __lasx_xvstelm_d(dst1, dst + dst_stride_3x, 0, 3); +} + +void ff_h264_idct_add16_lasx(uint8_t *dst, + const int32_t *blk_offset, + int16_t *block, int32_t dst_stride, + const uint8_t nzc[15 * 8]) +{ + int32_t i; + + for (i = 0; i < 16; i++) { + int32_t nnz = nzc[scan8[i]]; + + if (nnz) { + if (nnz == 1 && ((dctcoef *) block)[i * 16]) + ff_h264_idct4x4_addblk_dc_lasx(dst + blk_offset[i], + block + i * 16 * sizeof(pixel), + dst_stride); + else + ff_h264_idct_add_lasx(dst + blk_offset[i], + block + i * 16 * sizeof(pixel), + dst_stride); + } + } +} + +void ff_h264_idct8_add4_lasx(uint8_t *dst, const int32_t *blk_offset, + int16_t *block, int32_t dst_stride, + const uint8_t nzc[15 * 8]) +{ + int32_t cnt; + + for (cnt = 0; cnt < 16; cnt += 4) { + int32_t nnz = nzc[scan8[cnt]]; + + if (nnz) { + if (nnz == 1 && ((dctcoef *) block)[cnt * 16]) + ff_h264_idct8_dc_addblk_lasx(dst + blk_offset[cnt], + block + cnt * 16 * sizeof(pixel), + dst_stride); + else + ff_h264_idct8_addblk_lasx(dst + blk_offset[cnt], + block + cnt * 16 * sizeof(pixel), + dst_stride); + } + } +} + + +void ff_h264_idct_add8_lasx(uint8_t **dst, + const int32_t *blk_offset, + int16_t *block, int32_t dst_stride, + const uint8_t nzc[15 * 8]) +{ + int32_t i; + + for (i = 16; i < 20; i++) { + if (nzc[scan8[i]]) + ff_h264_idct_add_lasx(dst[0] + blk_offset[i], + block + i * 16 * sizeof(pixel), + dst_stride); + else if (((dctcoef *) block)[i * 16]) + ff_h264_idct4x4_addblk_dc_lasx(dst[0] + blk_offset[i], + block + i * 16 * sizeof(pixel), + dst_stride); + } + for (i = 32; i < 36; i++) { + if (nzc[scan8[i]]) + ff_h264_idct_add_lasx(dst[1] + blk_offset[i], + block + i * 16 * sizeof(pixel), + dst_stride); + else if (((dctcoef *) block)[i * 16]) + ff_h264_idct4x4_addblk_dc_lasx(dst[1] + blk_offset[i], + block + i * 16 * sizeof(pixel), + dst_stride); + } +} + +void ff_h264_idct_add8_422_lasx(uint8_t **dst, + const int32_t *blk_offset, + int16_t *block, int32_t dst_stride, + const uint8_t nzc[15 * 8]) +{ + int32_t i; + + for (i = 16; i < 20; i++) { + if (nzc[scan8[i]]) + ff_h264_idct_add_lasx(dst[0] + blk_offset[i], + block + i * 16 * sizeof(pixel), + dst_stride); + else if (((dctcoef *) block)[i * 16]) + ff_h264_idct4x4_addblk_dc_lasx(dst[0] + blk_offset[i], + block + i * 16 * sizeof(pixel), + dst_stride); + } + for (i = 32; i < 36; i++) { + if (nzc[scan8[i]]) + ff_h264_idct_add_lasx(dst[1] + blk_offset[i], + block + i * 16 * sizeof(pixel), + dst_stride); + else if (((dctcoef *) block)[i * 16]) + ff_h264_idct4x4_addblk_dc_lasx(dst[1] + blk_offset[i], + block + i * 16 * sizeof(pixel), + dst_stride); + } + for (i = 20; i < 24; i++) { + if (nzc[scan8[i + 4]]) + ff_h264_idct_add_lasx(dst[0] + blk_offset[i + 4], + block + i * 16 * sizeof(pixel), + dst_stride); + else if (((dctcoef *) block)[i * 16]) + ff_h264_idct4x4_addblk_dc_lasx(dst[0] + blk_offset[i + 4], + block + i * 16 * sizeof(pixel), + dst_stride); + } + for (i = 36; i < 40; i++) { + if (nzc[scan8[i + 4]]) + ff_h264_idct_add_lasx(dst[1] + blk_offset[i + 4], + block + i * 16 * sizeof(pixel), + dst_stride); + else if (((dctcoef *) block)[i * 16]) + ff_h264_idct4x4_addblk_dc_lasx(dst[1] + blk_offset[i + 4], + block + i * 16 * sizeof(pixel), + dst_stride); + } +} + +void ff_h264_idct_add16_intra_lasx(uint8_t *dst, + const int32_t *blk_offset, + int16_t *block, + int32_t dst_stride, + const uint8_t nzc[15 * 8]) +{ + int32_t i; + + for (i = 0; i < 16; i++) { + if (nzc[scan8[i]]) + ff_h264_idct_add_lasx(dst + blk_offset[i], + block + i * 16 * sizeof(pixel), dst_stride); + else if (((dctcoef *) block)[i * 16]) + ff_h264_idct4x4_addblk_dc_lasx(dst + blk_offset[i], + block + i * 16 * sizeof(pixel), + dst_stride); + } +} + +void ff_h264_deq_idct_luma_dc_lasx(int16_t *dst, int16_t *src, + int32_t de_qval) +{ +#define DC_DEST_STRIDE 16 + + __m256i src0, src1, src2, src3; + __m256i vec0, vec1, vec2, vec3; + __m256i tmp0, tmp1, tmp2, tmp3; + __m256i hres0, hres1, hres2, hres3; + __m256i vres0, vres1, vres2, vres3; + __m256i de_q_vec = __lasx_xvreplgr2vr_w(de_qval); + + DUP4_ARG2(__lasx_xvld, src, 0, src, 8, src, 16, src, 24, + src0, src1, src2, src3); + LASX_TRANSPOSE4x4_H(src0, src1, src2, src3, tmp0, tmp1, tmp2, tmp3); + LASX_BUTTERFLY_4_H(tmp0, tmp2, tmp3, tmp1, vec0, vec3, vec2, vec1); + LASX_BUTTERFLY_4_H(vec0, vec1, vec2, vec3, hres0, hres3, hres2, hres1); + LASX_TRANSPOSE4x4_H(hres0, hres1, hres2, hres3, + hres0, hres1, hres2, hres3); + LASX_BUTTERFLY_4_H(hres0, hres1, hres3, hres2, vec0, vec3, vec2, vec1); + LASX_BUTTERFLY_4_H(vec0, vec1, vec2, vec3, vres0, vres1, vres2, vres3); + DUP4_ARG1(__lasx_vext2xv_w_h, vres0, vres1, vres2, vres3, + vres0, vres1, vres2, vres3); + DUP2_ARG3(__lasx_xvpermi_q, vres1, vres0, 0x20, vres3, vres2, 0x20, + vres0, vres1); + + vres0 = __lasx_xvmul_w(vres0, de_q_vec); + vres1 = __lasx_xvmul_w(vres1, de_q_vec); + + vres0 = __lasx_xvsrari_w(vres0, 8); + vres1 = __lasx_xvsrari_w(vres1, 8); + vec0 = __lasx_xvpickev_h(vres1, vres0); + vec0 = __lasx_xvpermi_d(vec0, 0xd8); + __lasx_xvstelm_h(vec0, dst + 0 * DC_DEST_STRIDE, 0, 0); + __lasx_xvstelm_h(vec0, dst + 2 * DC_DEST_STRIDE, 0, 1); + __lasx_xvstelm_h(vec0, dst + 8 * DC_DEST_STRIDE, 0, 2); + __lasx_xvstelm_h(vec0, dst + 10 * DC_DEST_STRIDE, 0, 3); + __lasx_xvstelm_h(vec0, dst + 1 * DC_DEST_STRIDE, 0, 4); + __lasx_xvstelm_h(vec0, dst + 3 * DC_DEST_STRIDE, 0, 5); + __lasx_xvstelm_h(vec0, dst + 9 * DC_DEST_STRIDE, 0, 6); + __lasx_xvstelm_h(vec0, dst + 11 * DC_DEST_STRIDE, 0, 7); + __lasx_xvstelm_h(vec0, dst + 4 * DC_DEST_STRIDE, 0, 8); + __lasx_xvstelm_h(vec0, dst + 6 * DC_DEST_STRIDE, 0, 9); + __lasx_xvstelm_h(vec0, dst + 12 * DC_DEST_STRIDE, 0, 10); + __lasx_xvstelm_h(vec0, dst + 14 * DC_DEST_STRIDE, 0, 11); + __lasx_xvstelm_h(vec0, dst + 5 * DC_DEST_STRIDE, 0, 12); + __lasx_xvstelm_h(vec0, dst + 7 * DC_DEST_STRIDE, 0, 13); + __lasx_xvstelm_h(vec0, dst + 13 * DC_DEST_STRIDE, 0, 14); + __lasx_xvstelm_h(vec0, dst + 15 * DC_DEST_STRIDE, 0, 15); + +#undef DC_DEST_STRIDE +} From patchwork Tue Dec 14 13:33:15 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= X-Patchwork-Id: 32488 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp6966225iog; Tue, 14 Dec 2021 05:35:04 -0800 (PST) X-Google-Smtp-Source: ABdhPJyHfO+YrJ0EMJilwaUoONdzVoJcKqn0kmt8TR/+EeD85uFiDE0f1GB9uQK8ZMbemSXcPsn5 X-Received: by 2002:a17:906:bccc:: with SMTP id lw12mr5924481ejb.128.1639488904542; Tue, 14 Dec 2021 05:35:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1639488904; cv=none; d=google.com; s=arc-20160816; b=Wg0hgXzRo3XIKcb5EZzKoHmFkzpI9I+umqOE+FZjei+XKdQxcXEfRAqabTLYwJXdHa OGB0th9AFxk/TqPqdoEoK+BqZRedV4Qzwb+yt+9ShPZHV0j12+GcF1Ca/v3ljEH76KQO 08YN5Sw2e8++31zaqkLq3gg0j8hFhawFuiavk4630AhDhYxGcYslabkYiGDdTACtZyif owtRePgp3FyBZFjYBkr5U0DCY2TTrFFtEKz4q02GAMzC+qyXk9i5NSlmuknQbtbGTGGF 4o2s6n3OGOYdrBfMnfmIssW0Uq+J5AZOfmQZ2xu0AibwXPTYtBC/ylNgX9ML8qEQiT8F 5FGA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=Xs7u0VFS2ZhoKL+5k4A2UiAZ2kGJwqfObSD596oY1bs=; b=VEnv6F9eWwmTQxtC+4pmsGOWy5PXdu2chC8Y5JMTp/BWr+uWuxwpa+EJ/fnD3MMhe7 fXLbrinUcAsOq9AqXAw7U42tNws7EkNTJYvvWxmlcupdrl7baXzJ8WgSyEhpGz/z54sI 84hFQEQqTAJBx/mx51w0y1U7KOF6NGxBB8ZLlntqRV412VmeDSgBEzOn0MyBaz1/Ioy2 SyivNrF3NHxG8S/b4VhTCZvqAlui/WrmyaoUn+OXyJqg4G6x1ha7AoUAlPrF/D2x+Erq ulO/SfmnRAt8MDHnH7TOGRn705GfNmjeoYeXm9rbyeyydk837N5XkiJNyiBNR1jngbwZ b0Sg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id l24si20222416edr.155.2021.12.14.05.35.03; Tue, 14 Dec 2021 05:35:04 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 62C4F68AE8C; Tue, 14 Dec 2021 15:34:00 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from loongson.cn (mail.loongson.cn [114.242.206.163]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id E160468AEAA for ; Tue, 14 Dec 2021 15:33:47 +0200 (EET) Received: from localhost (unknown [36.33.26.144]) by mail.loongson.cn (Coremail) with SMTP id AQAAf9DxpN45nbhhl6cAAA--.3442S3; Tue, 14 Dec 2021 21:33:45 +0800 (CST) From: Hao Chen To: ffmpeg-devel@ffmpeg.org Date: Tue, 14 Dec 2021 21:33:15 +0800 Message-Id: <20211214133316.8978-7-chenhao@loongson.cn> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20211214133316.8978-1-chenhao@loongson.cn> References: <20211214133316.8978-1-chenhao@loongson.cn> MIME-Version: 1.0 X-CM-TRANSID: AQAAf9DxpN45nbhhl6cAAA--.3442S3 X-Coremail-Antispam: 1UD129KBjvJXoWxKw4Uuw48Kw13Wr4xtr4kCrg_yoW3tr43pa 4j9FsrJa18JFsrZr9rXw4kAr1SyFZ7Gr17tF15K3W7urWavryxWrZ2kFWqq3WDJw4UGF15 XF1fua4ava43Jw7anT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUk2b7Iv0xC_KF4lb4IE77IF4wAFF20E14v26r1j6r4UM7CY07I2 0VC2zVCF04k26cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rw A2F7IY1VAKz4vEj48ve4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_Xr0_Ar1l84ACjcxK6xII jxv20xvEc7CjxVAFwI0_Cr0_Gr1UM28EF7xvwVC2z280aVAFwI0_GcCE3s1l84ACjcxK6I 8E87Iv6xkF7I0E14v26rxl6s0DM2AIxVAIcxkEcVAq07x20xvEncxIr21l5I8CrVACY4xI 64kE6c02F40Ex7xfMcIj6xIIjxv20xvE14v26r1q6rW5McIj6I8E87Iv67AKxVW8Jr0_Cr 1UMcvjeVCFs4IE7xkEbVWUJVW8JwACjcxG0xvY0x0EwIxGrwCY02Avz4vE14v_Xr4l42xK 82IYc2Ij64vIr41l4I8I3I0E4IkC6x0Yz7v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGw C20s026x8GjcxK67AKxVWUGVWUWwC2zVAF1VAY17CE14v26r1Y6r17MIIYrxkI7VAKI48J MIIF0xvE2Ix0cI8IcVAFwI0_Gr0_Xr1lIxAIcVC0I7IYx2IY6xkF7I0E14v26r4j6F4UMI IF0xvE42xK8VAvwI8IcIk0rVWUJVWUCwCI42IY6I8E87Iv67AKxVW8JVWxJwCI42IY6I8E 87Iv6xkF7I0E14v26r4j6r4UJbIYCTnIWIevJa73UjIFyTuYvjxUxD73DUUUU X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/ Subject: [FFmpeg-devel] [PATCH v2 6/7] avcodec: [loongarch] Optimize h264_deblock with LASX. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Jin Bo Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: gXuDGOM+ayW7 From: Jin Bo ./ffmpeg -i ../1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -y /dev/null -an before:293 after :295 Change-Id: I5ff6cba4eaca0c4218c0c97b880ca500e35f9c87 Signed-off-by: Hao Chen --- libavcodec/loongarch/Makefile | 3 +- libavcodec/loongarch/h264_deblock_lasx.c | 147 ++++++++++++++++++ libavcodec/loongarch/h264dsp_init_loongarch.c | 2 + libavcodec/loongarch/h264dsp_lasx.h | 6 + 4 files changed, 157 insertions(+), 1 deletion(-) create mode 100644 libavcodec/loongarch/h264_deblock_lasx.c diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile index 242a2be290..1e1fe3fd48 100644 --- a/libavcodec/loongarch/Makefile +++ b/libavcodec/loongarch/Makefile @@ -4,4 +4,5 @@ OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_init_loongarch.o LASX-OBJS-$(CONFIG_H264CHROMA) += loongarch/h264chroma_lasx.o LASX-OBJS-$(CONFIG_H264QPEL) += loongarch/h264qpel_lasx.o LASX-OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_lasx.o \ - loongarch/h264idct_lasx.o + loongarch/h264idct_lasx.o \ + loongarch/h264_deblock_lasx.o diff --git a/libavcodec/loongarch/h264_deblock_lasx.c b/libavcodec/loongarch/h264_deblock_lasx.c new file mode 100644 index 0000000000..c89bea9a84 --- /dev/null +++ b/libavcodec/loongarch/h264_deblock_lasx.c @@ -0,0 +1,147 @@ +/* + * Copyright (c) 2021 Loongson Technology Corporation Limited + * Contributed by Xiwei Gu + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavcodec/bit_depth_template.c" +#include "h264dsp_lasx.h" +#include "libavutil/loongarch/loongson_intrinsics.h" + +#define H264_LOOP_FILTER_STRENGTH_ITERATION_LASX(edges, step, mask_mv, dir, \ + d_idx, mask_dir) \ +do { \ + int b_idx = 0; \ + int step_x4 = step << 2; \ + int d_idx_12 = d_idx + 12; \ + int d_idx_52 = d_idx + 52; \ + int d_idx_x4 = d_idx << 2; \ + int d_idx_x4_48 = d_idx_x4 + 48; \ + int dir_x32 = dir * 32; \ + uint8_t *ref_t = (uint8_t*)ref; \ + uint8_t *mv_t = (uint8_t*)mv; \ + uint8_t *nnz_t = (uint8_t*)nnz; \ + uint8_t *bS_t = (uint8_t*)bS; \ + mask_mv <<= 3; \ + for (; b_idx < edges; b_idx += step) { \ + out &= mask_dir; \ + if (!(mask_mv & b_idx)) { \ + if (bidir) { \ + ref2 = __lasx_xvldx(ref_t, d_idx_12); \ + ref3 = __lasx_xvldx(ref_t, d_idx_52); \ + ref0 = __lasx_xvld(ref_t, 12); \ + ref1 = __lasx_xvld(ref_t, 52); \ + ref2 = __lasx_xvilvl_w(ref3, ref2); \ + ref0 = __lasx_xvilvl_w(ref0, ref0); \ + ref1 = __lasx_xvilvl_w(ref1, ref1); \ + ref3 = __lasx_xvshuf4i_w(ref2, 0xB1); \ + ref0 = __lasx_xvsub_b(ref0, ref2); \ + ref1 = __lasx_xvsub_b(ref1, ref3); \ + ref0 = __lasx_xvor_v(ref0, ref1); \ +\ + tmp2 = __lasx_xvldx(mv_t, d_idx_x4_48); \ + tmp3 = __lasx_xvld(mv_t, 48); \ + tmp4 = __lasx_xvld(mv_t, 208); \ + tmp5 = __lasx_xvld(mv_t + d_idx_x4, 208); \ + DUP2_ARG3(__lasx_xvpermi_q, tmp2, tmp2, 0x20, tmp5, tmp5, \ + 0x20, tmp2, tmp5); \ + tmp3 = __lasx_xvpermi_q(tmp4, tmp3, 0x20); \ + tmp2 = __lasx_xvsub_h(tmp2, tmp3); \ + tmp5 = __lasx_xvsub_h(tmp5, tmp3); \ + DUP2_ARG2(__lasx_xvsat_h, tmp2, 7, tmp5, 7, tmp2, tmp5); \ + tmp0 = __lasx_xvpickev_b(tmp5, tmp2); \ + tmp0 = __lasx_xvpermi_d(tmp0, 0xd8); \ + tmp0 = __lasx_xvadd_b(tmp0, cnst_1); \ + tmp0 = __lasx_xvssub_bu(tmp0, cnst_0); \ + tmp0 = __lasx_xvsat_h(tmp0, 7); \ + tmp0 = __lasx_xvpickev_b(tmp0, tmp0); \ + tmp0 = __lasx_xvpermi_d(tmp0, 0xd8); \ + tmp1 = __lasx_xvpickod_d(tmp0, tmp0); \ + out = __lasx_xvor_v(ref0, tmp0); \ + tmp1 = __lasx_xvshuf4i_w(tmp1, 0xB1); \ + out = __lasx_xvor_v(out, tmp1); \ + tmp0 = __lasx_xvshuf4i_w(out, 0xB1); \ + out = __lasx_xvmin_bu(out, tmp0); \ + } else { \ + ref0 = __lasx_xvldx(ref_t, d_idx_12); \ + ref3 = __lasx_xvld(ref_t, 12); \ + tmp2 = __lasx_xvldx(mv_t, d_idx_x4_48); \ + tmp3 = __lasx_xvld(mv_t, 48); \ + tmp4 = __lasx_xvsub_h(tmp3, tmp2); \ + tmp1 = __lasx_xvsat_h(tmp4, 7); \ + tmp1 = __lasx_xvpickev_b(tmp1, tmp1); \ + tmp1 = __lasx_xvadd_b(tmp1, cnst_1); \ + out = __lasx_xvssub_bu(tmp1, cnst_0); \ + out = __lasx_xvsat_h(out, 7); \ + out = __lasx_xvpickev_b(out, out); \ + ref0 = __lasx_xvsub_b(ref3, ref0); \ + out = __lasx_xvor_v(out, ref0); \ + } \ + } \ + tmp0 = __lasx_xvld(nnz_t, 12); \ + tmp1 = __lasx_xvldx(nnz_t, d_idx_12); \ + tmp0 = __lasx_xvor_v(tmp0, tmp1); \ + tmp0 = __lasx_xvmin_bu(tmp0, cnst_2); \ + out = __lasx_xvmin_bu(out, cnst_2); \ + tmp0 = __lasx_xvslli_h(tmp0, 1); \ + tmp0 = __lasx_xvmax_bu(out, tmp0); \ + tmp0 = __lasx_vext2xv_hu_bu(tmp0); \ + __lasx_xvstelm_d(tmp0, bS_t + dir_x32, 0, 0); \ + ref_t += step; \ + mv_t += step_x4; \ + nnz_t += step; \ + bS_t += step; \ + } \ +} while(0) + +void ff_h264_loop_filter_strength_lasx(int16_t bS[2][4][4], uint8_t nnz[40], + int8_t ref[2][40], int16_t mv[2][40][2], + int bidir, int edges, int step, + int mask_mv0, int mask_mv1, int field) +{ + __m256i out; + __m256i ref0, ref1, ref2, ref3; + __m256i tmp0, tmp1; + __m256i tmp2, tmp3, tmp4, tmp5; + __m256i cnst_0, cnst_1, cnst_2; + __m256i zero = __lasx_xvldi(0); + __m256i one = __lasx_xvnor_v(zero, zero); + int64_t cnst3 = 0x0206020602060206, cnst4 = 0x0103010301030103; + if (field) { + cnst_0 = __lasx_xvreplgr2vr_d(cnst3); + cnst_1 = __lasx_xvreplgr2vr_d(cnst4); + cnst_2 = __lasx_xvldi(0x01); + } else { + DUP2_ARG1(__lasx_xvldi, 0x06, 0x03, cnst_0, cnst_1); + cnst_2 = __lasx_xvldi(0x01); + } + step <<= 3; + edges <<= 3; + + H264_LOOP_FILTER_STRENGTH_ITERATION_LASX(edges, step, mask_mv1, + 1, -8, zero); + H264_LOOP_FILTER_STRENGTH_ITERATION_LASX(32, 8, mask_mv0, 0, -1, one); + + DUP2_ARG2(__lasx_xvld, (int8_t*)bS, 0, (int8_t*)bS, 16, tmp0, tmp1); + DUP2_ARG2(__lasx_xvilvh_d, tmp0, tmp0, tmp1, tmp1, tmp2, tmp3); + LASX_TRANSPOSE4x4_H(tmp0, tmp2, tmp1, tmp3, tmp2, tmp3, tmp4, tmp5); + __lasx_xvstelm_d(tmp2, (int8_t*)bS, 0, 0); + __lasx_xvstelm_d(tmp3, (int8_t*)bS + 8, 0, 0); + __lasx_xvstelm_d(tmp4, (int8_t*)bS + 16, 0, 0); + __lasx_xvstelm_d(tmp5, (int8_t*)bS + 24, 0, 0); +} diff --git a/libavcodec/loongarch/h264dsp_init_loongarch.c b/libavcodec/loongarch/h264dsp_init_loongarch.c index 0985c2fe8a..37633c3e51 100644 --- a/libavcodec/loongarch/h264dsp_init_loongarch.c +++ b/libavcodec/loongarch/h264dsp_init_loongarch.c @@ -29,6 +29,8 @@ av_cold void ff_h264dsp_init_loongarch(H264DSPContext *c, const int bit_depth, int cpu_flags = av_get_cpu_flags(); if (have_lasx(cpu_flags)) { + if (chroma_format_idc <= 1) + c->h264_loop_filter_strength = ff_h264_loop_filter_strength_lasx; if (bit_depth == 8) { c->h264_add_pixels4_clear = ff_h264_add_pixels4_8_lasx; c->h264_add_pixels8_clear = ff_h264_add_pixels8_8_lasx; diff --git a/libavcodec/loongarch/h264dsp_lasx.h b/libavcodec/loongarch/h264dsp_lasx.h index bfd567fffa..4cf813750b 100644 --- a/libavcodec/loongarch/h264dsp_lasx.h +++ b/libavcodec/loongarch/h264dsp_lasx.h @@ -88,4 +88,10 @@ void ff_h264_idct_add16_intra_lasx(uint8_t *dst, const int32_t *blk_offset, const uint8_t nzc[15 * 8]); void ff_h264_deq_idct_luma_dc_lasx(int16_t *dst, int16_t *src, int32_t de_qval); + +void ff_h264_loop_filter_strength_lasx(int16_t bS[2][4][4], uint8_t nnz[40], + int8_t ref[2][40], int16_t mv[2][40][2], + int bidir, int edges, int step, + int mask_mv0, int mask_mv1, int field); + #endif // #ifndef AVCODEC_LOONGARCH_H264DSP_LASX_H From patchwork Tue Dec 14 13:33:16 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= X-Patchwork-Id: 32490 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp6966815iog; Tue, 14 Dec 2021 05:35:32 -0800 (PST) X-Google-Smtp-Source: ABdhPJzroN9XYpSjLwV4BCaCBJXJFKL6Sl/UJuK5JwXOn0KjjqtE2a89JAGUoLcF2eQWAIk0wDn3 X-Received: by 2002:aa7:dd47:: with SMTP id o7mr7727984edw.34.1639488932666; Tue, 14 Dec 2021 05:35:32 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1639488932; cv=none; d=google.com; s=arc-20160816; b=lxRK69PwW4K29CNP3Lb4fOOdS7SESno5KkgITxdIuFPRM0sRn4wjHot8TOiTh5T3sO /iygKaNBSWGOqLoP6nyq39ZA6dhhiODM3k+C+xq3Tx6JyNuYc7hDTni98YKtfgM5KP0f Yc4UXqkstOR5rQ4JzS3OCA4oIMr3oaGwpaWHn4oArgP2AvirVP2oSXwssXD2lcoblRiR 6RAz2yCSu/QSx5OwmWl2dWjj4LLK22SV7L+wAMmwy1WYbkDqiuBjYf4O654sXl+GLo6y HrvfVSwJ0ongmTeEY9OwrzTAQbUyblS5ASEoLBYzqtWmBAJBAfs4x2ZjBLgAdFi+CyEV hPbg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:reply-to:list-subscribe :list-help:list-post:list-archive:list-unsubscribe:list-id :precedence:subject:mime-version:references:in-reply-to:message-id :date:to:from:delivered-to; bh=nc1X72D8rL0VpWd768OEIjzMmRQ4BlTQguWk2JRbPyA=; b=uOdG6b92R3xjDsdw0ltcpkLDcFPZBdKtufBm6X9MLFUsyInijsKOBQEgm6MgwKRuHo YmGLVzNPA5nzf1+bYol4PhJ1/raR5W+7lRsS91jB4VudUwTyQvqQjPKep1yM7GJ3uIUo E6nj6uS+losMndDooTg8rrLNrlEvfSJNpMqslv9qZy1RJfdYT06TYv0ZXeMunMx5M351 O4KPbCl5WUOE1Ddnah2jqKi/wUa+J95joYLFokK6qh0ZwLft47Pwqzy9foTeuCwHtuqN 8LkDEls2wp7ZtaJlxpHBFfuqfzT6f9WSpAsljgvylxjKToAYhZccKQP1Ad9Gb7ZpZJw6 7x+g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id dy14si21707139edb.594.2021.12.14.05.35.31; Tue, 14 Dec 2021 05:35:32 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 93F4D68AF41; Tue, 14 Dec 2021 15:34:02 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from loongson.cn (mail.loongson.cn [114.242.206.163]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id B689E68AECF for ; Tue, 14 Dec 2021 15:33:48 +0200 (EET) Received: from localhost (unknown [36.33.26.144]) by mail.loongson.cn (Coremail) with SMTP id AQAAf9Dx2ZY6nbhhmKcAAA--.480S3; Tue, 14 Dec 2021 21:33:46 +0800 (CST) From: Hao Chen To: ffmpeg-devel@ffmpeg.org Date: Tue, 14 Dec 2021 21:33:16 +0800 Message-Id: <20211214133316.8978-8-chenhao@loongson.cn> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20211214133316.8978-1-chenhao@loongson.cn> References: <20211214133316.8978-1-chenhao@loongson.cn> MIME-Version: 1.0 X-CM-TRANSID: AQAAf9Dx2ZY6nbhhmKcAAA--.480S3 X-Coremail-Antispam: 1UD129KBjvDXoW8JrWkuw13ZryDtw1kZoXrpw1ktrc_Gw1SkF 18Cr4rCas2ga1jgw13Cr98ZrW8AFnxAryvyFnaqa45XFyrXa1kX3Wjvw1UKr97ZFy5J343 t3Z7Aw1UKjkaLaAFLSUrUUUUjb8apTn2vfkv8UJUUUU8Yxn0WfASr-VFAUDa7-sFnT9fnU UIcSsGvfJTRUUUb28YjsxI4VWkKwAYFVCjjxCrM7AC8VAFwI0_Jr0_Gr1l1xkIjI8I6I8E 6xAIw20EY4v20xvaj40_Wr0E3s1l1IIY67AEw4v_Jr0_Jr4l8cAvFVAK0II2c7xJM28Cjx kF64kEwVA0rcxSw2x7M28EF7xvwVC0I7IYx2IY67AKxVW5JVW7JwA2z4x0Y4vE2Ix0cI8I cVCY1x0267AKxVWxJVW8Jr1l84ACjcxK6I8E87Iv67AKxVW0oVCq3wA2z4x0Y4vEx4A2js IEc7CjxVAFwI0_GcCE3s1le2I262IYc4CY6c8Ij28IcVAaY2xG8wAqx4xG64xvF2IEw4CE 5I8CrVC2j2WlYx0E2Ix0cI8IcVAFwI0_Jw0_WrylYx0Ex4A2jsIE14v26r4UJVWxJr1lOx 8S6xCaFVCjc4AY6r1j6r4UM4x0Y48IcxkI7VAKI48JMxkIecxEwVAFwVW5GwCF04k20xvY 0x0EwIxGrwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s026c02F40E14v26r1j6r18MI8I3I 0E7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_Jr0_JrylIxkGc2Ij64vIr41lIxAI cVC0I7IYx2IY67AKxVW8JVW5JwCI42IY6xIIjxv20xvEc7CjxVAFwI0_Gr0_Cr1lIxAIcV CF04k26cxKx2IYs7xG6r1j6r1xMIIF0xvEx4A2jsIE14v26r4j6F4UMIIF0xvEx4A2jsIE c7CjxVAFwI0_Gr0_Gr1UYxBIdaVFxhVjvjDU0xZFpf9x07bwo7NUUUUU= X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/ Subject: [FFmpeg-devel] [PATCH v2 7/7] avcodec: [loongarch] Optimize pred16x16_plane with LASX. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: 479ClSNE1gCA ./ffmpeg -i ../1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -y /dev/null -an before:295 after :296 Change-Id: I281bc739f708d45f91fc3860150944c0b8a6a5ba --- libavcodec/h264pred.c | 2 + libavcodec/h264pred.h | 2 + libavcodec/loongarch/Makefile | 2 + .../loongarch/h264_intrapred_init_loongarch.c | 50 ++++++++ libavcodec/loongarch/h264_intrapred_lasx.c | 121 ++++++++++++++++++ libavcodec/loongarch/h264_intrapred_lasx.h | 31 +++++ 6 files changed, 208 insertions(+) create mode 100644 libavcodec/loongarch/h264_intrapred_init_loongarch.c create mode 100644 libavcodec/loongarch/h264_intrapred_lasx.c create mode 100644 libavcodec/loongarch/h264_intrapred_lasx.h diff --git a/libavcodec/h264pred.c b/libavcodec/h264pred.c index b0fec71f25..bd0d4a3d06 100644 --- a/libavcodec/h264pred.c +++ b/libavcodec/h264pred.c @@ -602,4 +602,6 @@ av_cold void ff_h264_pred_init(H264PredContext *h, int codec_id, ff_h264_pred_init_x86(h, codec_id, bit_depth, chroma_format_idc); if (ARCH_MIPS) ff_h264_pred_init_mips(h, codec_id, bit_depth, chroma_format_idc); + if (ARCH_LOONGARCH) + ff_h264_pred_init_loongarch(h, codec_id, bit_depth, chroma_format_idc); } diff --git a/libavcodec/h264pred.h b/libavcodec/h264pred.h index 2863dc9bd1..4583052dfe 100644 --- a/libavcodec/h264pred.h +++ b/libavcodec/h264pred.h @@ -122,5 +122,7 @@ void ff_h264_pred_init_x86(H264PredContext *h, int codec_id, const int bit_depth, const int chroma_format_idc); void ff_h264_pred_init_mips(H264PredContext *h, int codec_id, const int bit_depth, const int chroma_format_idc); +void ff_h264_pred_init_loongarch(H264PredContext *h, int codec_id, + const int bit_depth, const int chroma_format_idc); #endif /* AVCODEC_H264PRED_H */ diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile index 1e1fe3fd48..30799e4e48 100644 --- a/libavcodec/loongarch/Makefile +++ b/libavcodec/loongarch/Makefile @@ -1,8 +1,10 @@ OBJS-$(CONFIG_H264CHROMA) += loongarch/h264chroma_init_loongarch.o OBJS-$(CONFIG_H264QPEL) += loongarch/h264qpel_init_loongarch.o OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_init_loongarch.o +OBJS-$(CONFIG_H264PRED) += loongarch/h264_intrapred_init_loongarch.o LASX-OBJS-$(CONFIG_H264CHROMA) += loongarch/h264chroma_lasx.o LASX-OBJS-$(CONFIG_H264QPEL) += loongarch/h264qpel_lasx.o LASX-OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_lasx.o \ loongarch/h264idct_lasx.o \ loongarch/h264_deblock_lasx.o +LASX-OBJS-$(CONFIG_H264PRED) += loongarch/h264_intrapred_lasx.o diff --git a/libavcodec/loongarch/h264_intrapred_init_loongarch.c b/libavcodec/loongarch/h264_intrapred_init_loongarch.c new file mode 100644 index 0000000000..12620bd842 --- /dev/null +++ b/libavcodec/loongarch/h264_intrapred_init_loongarch.c @@ -0,0 +1,50 @@ +/* + * Copyright (c) 2021 Loongson Technology Corporation Limited + * Contributed by Hao Chen + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/loongarch/cpu.h" +#include "libavcodec/h264pred.h" +#include "h264_intrapred_lasx.h" + +av_cold void ff_h264_pred_init_loongarch(H264PredContext *h, int codec_id, + const int bit_depth, + const int chroma_format_idc) +{ + int cpu_flags = av_get_cpu_flags(); + + if (bit_depth == 8) { + if (have_lasx(cpu_flags)) { + if (chroma_format_idc <= 1) { + } + if (codec_id == AV_CODEC_ID_VP7 || codec_id == AV_CODEC_ID_VP8) { + } else { + if (chroma_format_idc <= 1) { + } + if (codec_id == AV_CODEC_ID_SVQ3) { + h->pred16x16[PLANE_PRED8x8] = ff_h264_pred16x16_plane_svq3_8_lasx; + } else if (codec_id == AV_CODEC_ID_RV40) { + h->pred16x16[PLANE_PRED8x8] = ff_h264_pred16x16_plane_rv40_8_lasx; + } else { + h->pred16x16[PLANE_PRED8x8] = ff_h264_pred16x16_plane_h264_8_lasx; + } + } + } + } +} diff --git a/libavcodec/loongarch/h264_intrapred_lasx.c b/libavcodec/loongarch/h264_intrapred_lasx.c new file mode 100644 index 0000000000..c38cd611b8 --- /dev/null +++ b/libavcodec/loongarch/h264_intrapred_lasx.c @@ -0,0 +1,121 @@ +/* + * Copyright (c) 2021 Loongson Technology Corporation Limited + * Contributed by Hao Chen + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/loongarch/loongson_intrinsics.h" +#include "h264_intrapred_lasx.h" + +#define PRED16X16_PLANE \ + ptrdiff_t stride_1, stride_2, stride_3, stride_4, stride_5, stride_6; \ + ptrdiff_t stride_8, stride_15; \ + int32_t res0, res1, res2, res3, cnt; \ + uint8_t *src0, *src1; \ + __m256i reg0, reg1, reg2, reg3, reg4; \ + __m256i tmp0, tmp1, tmp2, tmp3; \ + __m256i shuff = {0x0B040A0509060807, 0x0F000E010D020C03, 0, 0}; \ + __m256i mult = {0x0004000300020001, 0x0008000700060005, 0, 0}; \ + __m256i int_mult1 = {0x0000000100000000, 0x0000000300000002, \ + 0x0000000500000004, 0x0000000700000006}; \ + \ + stride_1 = -stride; \ + stride_2 = stride << 1; \ + stride_3 = stride_2 + stride; \ + stride_4 = stride_2 << 1; \ + stride_5 = stride_4 + stride; \ + stride_6 = stride_3 << 1; \ + stride_8 = stride_4 << 1; \ + stride_15 = (stride_8 << 1) - stride; \ + src0 = src - 1; \ + src1 = src0 + stride_8; \ + \ + reg0 = __lasx_xvldx(src0, -stride); \ + reg1 = __lasx_xvldx(src, (8 - stride)); \ + reg0 = __lasx_xvilvl_d(reg1, reg0); \ + reg0 = __lasx_xvshuf_b(reg0, reg0, shuff); \ + reg0 = __lasx_xvhsubw_hu_bu(reg0, reg0); \ + reg0 = __lasx_xvmul_h(reg0, mult); \ + res1 = (src1[0] - src0[stride_6]) + \ + 2 * (src1[stride] - src0[stride_5]) + \ + 3 * (src1[stride_2] - src0[stride_4]) + \ + 4 * (src1[stride_3] - src0[stride_3]) + \ + 5 * (src1[stride_4] - src0[stride_2]) + \ + 6 * (src1[stride_5] - src0[stride]) + \ + 7 * (src1[stride_6] - src0[0]) + \ + 8 * (src0[stride_15] - src0[stride_1]); \ + reg0 = __lasx_xvhaddw_w_h(reg0, reg0); \ + reg0 = __lasx_xvhaddw_d_w(reg0, reg0); \ + reg0 = __lasx_xvhaddw_q_d(reg0, reg0); \ + res0 = __lasx_xvpickve2gr_w(reg0, 0); \ + +#define PRED16X16_PLANE_END \ + res2 = (src0[stride_15] + src[15 - stride] + 1) << 4; \ + res3 = 7 * (res0 + res1); \ + res2 -= res3; \ + reg0 = __lasx_xvreplgr2vr_w(res0); \ + reg1 = __lasx_xvreplgr2vr_w(res1); \ + reg2 = __lasx_xvreplgr2vr_w(res2); \ + reg3 = __lasx_xvmul_w(reg0, int_mult1); \ + reg4 = __lasx_xvslli_w(reg0, 3); \ + reg4 = __lasx_xvadd_w(reg4, reg3); \ + for (cnt = 8; cnt--;) { \ + tmp0 = __lasx_xvadd_w(reg2, reg3); \ + tmp1 = __lasx_xvadd_w(reg2, reg4); \ + tmp0 = __lasx_xvssrani_hu_w(tmp1, tmp0, 5); \ + tmp0 = __lasx_xvpermi_d(tmp0, 0xD8); \ + reg2 = __lasx_xvadd_w(reg2, reg1); \ + tmp2 = __lasx_xvadd_w(reg2, reg3); \ + tmp3 = __lasx_xvadd_w(reg2, reg4); \ + tmp1 = __lasx_xvssrani_hu_w(tmp3, tmp2, 5); \ + tmp1 = __lasx_xvpermi_d(tmp1, 0xD8); \ + tmp0 = __lasx_xvssrani_bu_h(tmp1, tmp0, 0); \ + reg2 = __lasx_xvadd_w(reg2, reg1); \ + __lasx_xvstelm_d(tmp0, src, 0, 0); \ + __lasx_xvstelm_d(tmp0, src, 8, 2); \ + src += stride; \ + __lasx_xvstelm_d(tmp0, src, 0, 1); \ + __lasx_xvstelm_d(tmp0, src, 8, 3); \ + src += stride; \ + } + + +void ff_h264_pred16x16_plane_h264_8_lasx(uint8_t *src, ptrdiff_t stride) +{ + PRED16X16_PLANE + res0 = (5 * res0 + 32) >> 6; + res1 = (5 * res1 + 32) >> 6; + PRED16X16_PLANE_END +} + +void ff_h264_pred16x16_plane_rv40_8_lasx(uint8_t *src, ptrdiff_t stride) +{ + PRED16X16_PLANE + res0 = (res0 + (res0 >> 2)) >> 4; + res1 = (res1 + (res1 >> 2)) >> 4; + PRED16X16_PLANE_END +} + +void ff_h264_pred16x16_plane_svq3_8_lasx(uint8_t *src, ptrdiff_t stride) +{ + PRED16X16_PLANE + cnt = (5 * (res0/4)) / 16; + res0 = (5 * (res1/4)) / 16; + res1 = cnt; + PRED16X16_PLANE_END +} diff --git a/libavcodec/loongarch/h264_intrapred_lasx.h b/libavcodec/loongarch/h264_intrapred_lasx.h new file mode 100644 index 0000000000..0c2653300c --- /dev/null +++ b/libavcodec/loongarch/h264_intrapred_lasx.h @@ -0,0 +1,31 @@ +/* + * Copyright (c) 2021 Loongson Technology Corporation Limited + * Contributed by Hao Chen + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef AVCODEC_LOONGARCH_H264_INTRAPRED_LASX_H +#define AVCODEC_LOONGARCH_H264_INTRAPRED_LASX_H + +#include "libavcodec/avcodec.h" + +void ff_h264_pred16x16_plane_h264_8_lasx(uint8_t *src, ptrdiff_t stride); +void ff_h264_pred16x16_plane_rv40_8_lasx(uint8_t *src, ptrdiff_t stride); +void ff_h264_pred16x16_plane_svq3_8_lasx(uint8_t *src, ptrdiff_t stride); + +#endif // #ifndef AVCODEC_LOONGARCH_H264_INTRAPRED_LASX_H