From patchwork Tue Dec 14 13:33:10 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= <chenhao@loongson.cn>
X-Patchwork-Id: 32484
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp6964854iog;
        Tue, 14 Dec 2021 05:33:59 -0800 (PST)
X-Google-Smtp-Source: 
 ABdhPJx0aIR0m3ojs08dLu2rEXJ3jUZzLs7HZSmFHH+USbIUGbGohCQuK6ZtYeoCgT0c3goJn0MO
X-Received: by 2002:a17:906:2ac4:: with SMTP id
 m4mr5680794eje.734.1639488838725;
        Tue, 14 Dec 2021 05:33:58 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1639488838; cv=none;
        d=google.com; s=arc-20160816;
        b=kOXjP3wD/qT5XD6EiaG5NhzK4xL3VlB5jPPDEKWGW5woiQnfkYfOUkjBOAxMyyCUFm
         Rj8wqXnygyGnHZLbh8ZM5coI5a4Wl7e9902zlv52xl+MAJkDi3gl2T+I2uWiIQ2Wp5CM
         MXSIoYH/GhxoB+IpN9qcV965VRoNHPoWqLheuxEIFCuMhSFomp+amy/mfcDho0RCImyK
         8RqP3DqL5Ulh2uL5JusIaBy/dg3UsyIB3PvCuuc2fd8htqvMouS0Ibalvsbxfnc62I0t
         xAhb4+M3HsWR3hJHYgKkppS2cA3wVfEQxGhkMvU39qMePyoiNHJSKZktUGsxUlEgxd7m
         sTsA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:references:in-reply-to
         :message-id:date:to:from:delivered-to;
        bh=UkDzGZrFbMm6Rc4ew+1bsQiIzOYhxy/IUyzQtecdy3w=;
        b=G6mMyekJotnW3TibjdoneQxztkScuyQqbKfrJJYTVXpiQi+iBvvUn0AsJV5UkCHvyZ
         oxBlquEGOJfdGYa0DxDM/tlEPUnmllM+vysi7oRn7YvuSKC+IRu6Rk2+/DdNetaTDyHb
         zAHg+WN0DeUmE3PXZBtBL8UMzhttDoESaqxW6EWfMVIQ1g3W9KIIcOxi8nIZEG4PQ6dU
         oG42FUZ1eaWPdbisA0VsULHzwGzAE0gD1+/QPVNkcquRtVY8kT+XDCErIYVgScPlzfVB
         7SrQUo6rUJcCwOOg6ODMzMphN6EfE3z0FPoZ232SR6r1MXjinQb9kEsm7/8FtGuVjp1k
         sWjg==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 js19si26362680ejc.573.2021.12.14.05.33.58;
        Tue, 14 Dec 2021 05:33:58 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 75A5468ADF7;
	Tue, 14 Dec 2021 15:33:52 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from loongson.cn (mail.loongson.cn [114.242.206.163])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 0C1C968AAD7
 for <ffmpeg-devel@ffmpeg.org>; Tue, 14 Dec 2021 15:33:43 +0200 (EET)
Received: from localhost (unknown [36.33.26.144])
 by mail.loongson.cn (Coremail) with SMTP id AQAAf9Dx7Nw0nbhhkKcAAA--.3410S3;
 Tue, 14 Dec 2021 21:33:41 +0800 (CST)
From: Hao Chen <chenhao@loongson.cn>
To: ffmpeg-devel@ffmpeg.org
Date: Tue, 14 Dec 2021 21:33:10 +0800
Message-Id: <20211214133316.8978-2-chenhao@loongson.cn>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20211214133316.8978-1-chenhao@loongson.cn>
References: <20211214133316.8978-1-chenhao@loongson.cn>
MIME-Version: 1.0
X-CM-TRANSID: AQAAf9Dx7Nw0nbhhkKcAAA--.3410S3
X-Coremail-Antispam: 1UD129KBjvJXoW3uw4rWr15Kw1rXFW5Kw4Uurg_yoWkZFW7pr
 Z7Cr4rKF18XFWIkF92qr98Jr1rWws3WF429FW3uw1jyrs8JF98Jrn2yF9xuFyxW34ru34x
 u3WkWFy3KFy7G3DanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2
 9KBjDU0xBIdaVrnRJUUUkv14x267AKxVWUJVW8JwAFc2x0x2IEx4CE42xK8VAvwI8IcIk0
 rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj41l84x0c7CEw4AK67xGY2AK02
 1l84ACjcxK6xIIjxv20xvE14v26ryj6F1UM28EF7xvwVC0I7IYx2IY6xkF7I0E14v26F4j
 6r4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x0267AKxVW0oV
 Cq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG6I80ewAv7VC0
 I7IYx2IY67AKxVWUtVWrXwAv7VC2z280aVAFwI0_Gr1j6F4UJwAm72CE4IkC6x0Yz7v_Jr
 0_Gr1lF7xvr2IYc2Ij64vIr41lF7I21c0EjII2zVCS5cI20VAGYxC7MxkIecxEwVAFwVW5
 GwCF04k20xvY0x0EwIxGrwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s026c02F40E14v26r
 1j6r18MI8I3I0E7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_Jrv_JF1lIxkGc2Ij
 64vIr41lIxAIcVC0I7IYx2IY67AKxVWUCVW8JwCI42IY6xIIjxv20xvEc7CjxVAFwI0_Gr
 0_Cr1lIxAIcVCF04k26cxKx2IYs7xG6r1j6r1xMIIF0xvEx4A2jsIE14v26r4j6F4UMIIF
 0xvEx4A2jsIEc7CjxVAFwI0_Gr0_Gr1UYxBIdaVFxhVjvjDU0xZFpf9x0JU47KxUUUUU=
X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/
Subject: [FFmpeg-devel] [PATCH v2 1/7] avutil: [loongarch] Add support for
 loongarch SIMD.
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Shiyou Yin <yinshiyou-hf@loongson.cn>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: MnaQHeGgv9vs

From: Shiyou Yin <yinshiyou-hf@loongson.cn>

LSX and LASX is loongarch SIMD extention.
They are enabled by default if compiler support it, and can be disabled
with '--disable-lsx' '--disable-lasx'.

Change-Id: Ie2608ea61dbd9b7fffadbf0ec2348bad6c124476
---
 Makefile                     |  2 +-
 configure                    | 20 +++++++++--
 ffbuild/arch.mak             |  4 ++-
 ffbuild/common.mak           |  8 +++++
 libavutil/cpu.c              |  7 ++++
 libavutil/cpu.h              |  4 +++
 libavutil/cpu_internal.h     |  2 ++
 libavutil/loongarch/Makefile |  1 +
 libavutil/loongarch/cpu.c    | 69 ++++++++++++++++++++++++++++++++++++
 libavutil/loongarch/cpu.h    | 31 ++++++++++++++++
 libavutil/tests/cpu.c        |  3 ++
 tests/checkasm/checkasm.c    |  3 ++
 12 files changed, 150 insertions(+), 4 deletions(-)
 create mode 100644 libavutil/loongarch/Makefile
 create mode 100644 libavutil/loongarch/cpu.c
 create mode 100644 libavutil/loongarch/cpu.h

diff --git a/Makefile b/Makefile
index 26c9107237..5b20658b52 100644
--- a/Makefile
+++ b/Makefile
@@ -89,7 +89,7 @@ SUBDIR_VARS := CLEANFILES FFLIBS HOSTPROGS TESTPROGS TOOLS               \
                ARMV5TE-OBJS ARMV6-OBJS ARMV8-OBJS VFP-OBJS NEON-OBJS     \
                ALTIVEC-OBJS VSX-OBJS MMX-OBJS X86ASM-OBJS                \
                MIPSFPU-OBJS MIPSDSPR2-OBJS MIPSDSP-OBJS MSA-OBJS         \
-               MMI-OBJS OBJS SLIBOBJS HOSTOBJS TESTOBJS
+               MMI-OBJS LSX-OBJS LASX-OBJS OBJS SLIBOBJS HOSTOBJS TESTOBJS
 
 define RESET
 $(1) :=
diff --git a/configure b/configure
index a7593ec2db..c4afde4c5c 100755
--- a/configure
+++ b/configure
@@ -452,7 +452,9 @@ Optimization options (experts only):
   --disable-mipsdspr2      disable MIPS DSP ASE R2 optimizations
   --disable-msa            disable MSA optimizations
   --disable-mipsfpu        disable floating point MIPS optimizations
-  --disable-mmi            disable Loongson SIMD optimizations
+  --disable-mmi            disable Loongson MMI optimizations
+  --disable-lsx            disable Loongson LSX optimizations
+  --disable-lasx           disable Loongson LASX optimizations
   --disable-fast-unaligned consider unaligned accesses slow
 
 Developer options (useful when working on FFmpeg itself):
@@ -2081,6 +2083,8 @@ ARCH_EXT_LIST_LOONGSON="
     loongson2
     loongson3
     mmi
+    lsx
+    lasx
 "
 
 ARCH_EXT_LIST_X86_SIMD="
@@ -2617,6 +2621,10 @@ power8_deps="vsx"
 
 loongson2_deps="mips"
 loongson3_deps="mips"
+mmi_deps_any="loongson2 loongson3"
+lsx_deps="loongarch"
+lasx_deps="lsx"
+
 mips32r2_deps="mips"
 mips32r5_deps="mips"
 mips32r6_deps="mips"
@@ -2625,7 +2633,6 @@ mips64r6_deps="mips"
 mipsfpu_deps="mips"
 mipsdsp_deps="mips"
 mipsdspr2_deps="mips"
-mmi_deps_any="loongson2 loongson3"
 msa_deps="mipsfpu"
 
 cpunop_deps="i686"
@@ -6134,6 +6141,9 @@ EOF
         ;;
     esac
 
+elif enabled loongarch; then
+    enabled lsx && check_inline_asm lsx '"vadd.b $vr0, $vr1, $vr2"' '-mlsx' && append LSXFLAGS '-mlsx'
+    enabled lasx && check_inline_asm lasx '"xvadd.b $xr0, $xr1, $xr2"' '-mlasx' && append LASXFLAGS '-mlasx'
 fi
 
 check_cc intrinsics_neon arm_neon.h "int16x8_t test = vdupq_n_s16(0)"
@@ -7484,6 +7494,10 @@ if enabled ppc; then
     echo "PPC 4xx optimizations     ${ppc4xx-no}"
     echo "dcbzl available           ${dcbzl-no}"
 fi
+if enabled loongarch; then
+    echo "LSX enabled               ${lsx-no}"
+    echo "LASX enabled              ${lasx-no}"
+fi
 echo "debug symbols             ${debug-no}"
 echo "strip symbols             ${stripping-no}"
 echo "optimize for size         ${small-no}"
@@ -7645,6 +7659,8 @@ ASMSTRIPFLAGS=$ASMSTRIPFLAGS
 X86ASMFLAGS=$X86ASMFLAGS
 MSAFLAGS=$MSAFLAGS
 MMIFLAGS=$MMIFLAGS
+LSXFLAGS=$LSXFLAGS
+LASXFLAGS=$LASXFLAGS
 BUILDSUF=$build_suffix
 PROGSSUF=$progs_suffix
 FULLNAME=$FULLNAME
diff --git a/ffbuild/arch.mak b/ffbuild/arch.mak
index e09006efca..997e31e85e 100644
--- a/ffbuild/arch.mak
+++ b/ffbuild/arch.mak
@@ -8,7 +8,9 @@ OBJS-$(HAVE_MIPSFPU)   += $(MIPSFPU-OBJS)    $(MIPSFPU-OBJS-yes)
 OBJS-$(HAVE_MIPSDSP)   += $(MIPSDSP-OBJS)    $(MIPSDSP-OBJS-yes)
 OBJS-$(HAVE_MIPSDSPR2) += $(MIPSDSPR2-OBJS)  $(MIPSDSPR2-OBJS-yes)
 OBJS-$(HAVE_MSA)       += $(MSA-OBJS)        $(MSA-OBJS-yes)
-OBJS-$(HAVE_MMI)   += $(MMI-OBJS)   $(MMI-OBJS-yes)
+OBJS-$(HAVE_MMI)       += $(MMI-OBJS)        $(MMI-OBJS-yes)
+OBJS-$(HAVE_LSX)       += $(LSX-OBJS)        $(LSX-OBJS-yes)
+OBJS-$(HAVE_LASX)      += $(LASX-OBJS)       $(LASX-OBJS-yes)
 
 OBJS-$(HAVE_ALTIVEC) += $(ALTIVEC-OBJS) $(ALTIVEC-OBJS-yes)
 OBJS-$(HAVE_VSX)     += $(VSX-OBJS) $(VSX-OBJS-yes)
diff --git a/ffbuild/common.mak b/ffbuild/common.mak
index 268ae61154..0eb831d434 100644
--- a/ffbuild/common.mak
+++ b/ffbuild/common.mak
@@ -59,6 +59,8 @@ COMPILE_HOSTC = $(call COMPILE,HOSTCC)
 COMPILE_NVCC = $(call COMPILE,NVCC)
 COMPILE_MMI = $(call COMPILE,CC,MMIFLAGS)
 COMPILE_MSA = $(call COMPILE,CC,MSAFLAGS)
+COMPILE_LSX = $(call COMPILE,CC,LSXFLAGS)
+COMPILE_LASX = $(call COMPILE,CC,LASXFLAGS)
 
 %_mmi.o: %_mmi.c
 	$(COMPILE_MMI)
@@ -66,6 +68,12 @@ COMPILE_MSA = $(call COMPILE,CC,MSAFLAGS)
 %_msa.o: %_msa.c
 	$(COMPILE_MSA)
 
+%_lsx.o: %_lsx.c
+	$(COMPILE_LSX)
+
+%_lasx.o: %_lasx.c
+	$(COMPILE_LASX)
+
 %.o: %.c
 	$(COMPILE_C)
 
diff --git a/libavutil/cpu.c b/libavutil/cpu.c
index 4627af4f23..63efb97ffd 100644
--- a/libavutil/cpu.c
+++ b/libavutil/cpu.c
@@ -62,6 +62,8 @@ static int get_cpu_flags(void)
         return ff_get_cpu_flags_ppc();
     if (ARCH_X86)
         return ff_get_cpu_flags_x86();
+    if (ARCH_LOONGARCH)
+        return ff_get_cpu_flags_loongarch();
     return 0;
 }
 
@@ -168,6 +170,9 @@ int av_parse_cpu_caps(unsigned *flags, const char *s)
 #elif ARCH_MIPS
         { "mmi",      NULL, 0, AV_OPT_TYPE_CONST, { .i64 = AV_CPU_FLAG_MMI      },    .unit = "flags" },
         { "msa",      NULL, 0, AV_OPT_TYPE_CONST, { .i64 = AV_CPU_FLAG_MSA      },    .unit = "flags" },
+#elif ARCH_LOONGARCH
+        { "lsx",      NULL, 0, AV_OPT_TYPE_CONST, { .i64 = AV_CPU_FLAG_LSX      },    .unit = "flags" },
+        { "lasx",     NULL, 0, AV_OPT_TYPE_CONST, { .i64 = AV_CPU_FLAG_LASX     },    .unit = "flags" },
 #endif
         { NULL },
     };
@@ -253,6 +258,8 @@ size_t av_cpu_max_align(void)
         return ff_get_cpu_max_align_ppc();
     if (ARCH_X86)
         return ff_get_cpu_max_align_x86();
+    if (ARCH_LOONGARCH)
+        return ff_get_cpu_max_align_loongarch();
 
     return 8;
 }
diff --git a/libavutil/cpu.h b/libavutil/cpu.h
index afea0640b4..ae443eccad 100644
--- a/libavutil/cpu.h
+++ b/libavutil/cpu.h
@@ -72,6 +72,10 @@
 #define AV_CPU_FLAG_MMI          (1 << 0)
 #define AV_CPU_FLAG_MSA          (1 << 1)
 
+//Loongarch SIMD extension.
+#define AV_CPU_FLAG_LSX          (1 << 0)
+#define AV_CPU_FLAG_LASX         (1 << 1)
+
 /**
  * Return the flags which specify extensions supported by the CPU.
  * The returned value is affected by av_force_cpu_flags() if that was used
diff --git a/libavutil/cpu_internal.h b/libavutil/cpu_internal.h
index 889764320b..e207b2d480 100644
--- a/libavutil/cpu_internal.h
+++ b/libavutil/cpu_internal.h
@@ -46,11 +46,13 @@ int ff_get_cpu_flags_aarch64(void);
 int ff_get_cpu_flags_arm(void);
 int ff_get_cpu_flags_ppc(void);
 int ff_get_cpu_flags_x86(void);
+int ff_get_cpu_flags_loongarch(void);
 
 size_t ff_get_cpu_max_align_mips(void);
 size_t ff_get_cpu_max_align_aarch64(void);
 size_t ff_get_cpu_max_align_arm(void);
 size_t ff_get_cpu_max_align_ppc(void);
 size_t ff_get_cpu_max_align_x86(void);
+size_t ff_get_cpu_max_align_loongarch(void);
 
 #endif /* AVUTIL_CPU_INTERNAL_H */
diff --git a/libavutil/loongarch/Makefile b/libavutil/loongarch/Makefile
new file mode 100644
index 0000000000..2addd9351c
--- /dev/null
+++ b/libavutil/loongarch/Makefile
@@ -0,0 +1 @@
+OBJS += loongarch/cpu.o
diff --git a/libavutil/loongarch/cpu.c b/libavutil/loongarch/cpu.c
new file mode 100644
index 0000000000..e4b240bc44
--- /dev/null
+++ b/libavutil/loongarch/cpu.c
@@ -0,0 +1,69 @@
+/*
+ * Copyright (c) 2020 Loongson Technology Corporation Limited
+ * Contributed by Shiyou Yin <yinshiyou-hf@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <stdint.h>
+#include "cpu.h"
+
+#define LOONGARCH_CFG2 0x2
+#define LOONGARCH_CFG2_LSX    (1 << 6)
+#define LOONGARCH_CFG2_LASX   (1 << 7)
+
+static int cpu_flags_cpucfg(void)
+{
+    int flags = 0;
+    uint32_t cfg2 = 0;
+
+    __asm__ volatile(
+        "cpucfg %0, %1 \n\t"
+        : "+&r"(cfg2)
+        : "r"(LOONGARCH_CFG2)
+    );
+
+    if (cfg2 & LOONGARCH_CFG2_LSX)
+        flags |= AV_CPU_FLAG_LSX;
+
+    if (cfg2 & LOONGARCH_CFG2_LASX)
+        flags |= AV_CPU_FLAG_LASX;
+
+    return flags;
+}
+
+int ff_get_cpu_flags_loongarch(void)
+{
+#if defined __linux__
+    return cpu_flags_cpucfg();
+#else
+    /* Assume no SIMD ASE supported */
+    return 0;
+#endif
+}
+
+size_t ff_get_cpu_max_align_loongarch(void)
+{
+    int flags = av_get_cpu_flags();
+
+    if (flags & AV_CPU_FLAG_LASX)
+        return 32;
+    if (flags & AV_CPU_FLAG_LSX)
+        return 16;
+
+    return 8;
+}
diff --git a/libavutil/loongarch/cpu.h b/libavutil/loongarch/cpu.h
new file mode 100644
index 0000000000..1a445c69bc
--- /dev/null
+++ b/libavutil/loongarch/cpu.h
@@ -0,0 +1,31 @@
+/*
+ * Copyright (c) 2020 Loongson Technology Corporation Limited
+ * Contributed by Shiyou Yin <yinshiyou-hf@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVUTIL_LOONGARCH_CPU_H
+#define AVUTIL_LOONGARCH_CPU_H
+
+#include "libavutil/cpu.h"
+#include "libavutil/cpu_internal.h"
+
+#define have_lsx(flags) CPUEXT(flags, LSX)
+#define have_lasx(flags) CPUEXT(flags, LASX)
+
+#endif /* AVUTIL_LOONGARCH_CPU_H */
diff --git a/libavutil/tests/cpu.c b/libavutil/tests/cpu.c
index c853371fb3..0a6c0cd32e 100644
--- a/libavutil/tests/cpu.c
+++ b/libavutil/tests/cpu.c
@@ -77,6 +77,9 @@ static const struct {
     { AV_CPU_FLAG_BMI2,      "bmi2"       },
     { AV_CPU_FLAG_AESNI,     "aesni"      },
     { AV_CPU_FLAG_AVX512,    "avx512"     },
+#elif ARCH_LOONGARCH
+    { AV_CPU_FLAG_LSX,       "lsx"        },
+    { AV_CPU_FLAG_LASX,      "lasx"       },
 #endif
     { 0 }
 };
diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index b1353f7cbe..90d080de02 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -236,6 +236,9 @@ static const struct {
     { "FMA4",     "fma4",     AV_CPU_FLAG_FMA4 },
     { "AVX2",     "avx2",     AV_CPU_FLAG_AVX2 },
     { "AVX-512",  "avx512",   AV_CPU_FLAG_AVX512 },
+#elif ARCH_LOONGARCH
+    { "LSX",      "lsx",      AV_CPU_FLAG_LSX },
+    { "LASX",     "lasx",     AV_CPU_FLAG_LASX },
 #endif
     { NULL }
 };

From patchwork Tue Dec 14 13:33:11 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= <chenhao@loongson.cn>
X-Patchwork-Id: 32485
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp6965359iog;
        Tue, 14 Dec 2021 05:34:24 -0800 (PST)
X-Google-Smtp-Source: 
 ABdhPJx5wYGaczGzNVz0W9q3bPsGRPLRebv8AvQUNbxCbuJBoHdNg1G3kDNpxyHzy9TQtJp7rdvS
X-Received: by 2002:a05:6402:270a:: with SMTP id
 y10mr8071437edd.108.1639488863853;
        Tue, 14 Dec 2021 05:34:23 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1639488863; cv=none;
        d=google.com; s=arc-20160816;
        b=GMI7y5J6DEvZFwexrMCB0LD/9RLVHnPsE274qRd46zN82QSHv4vxQ/+/1ghltSPvAX
         2CAJYNCB0/Wjc5/0zy776ffzfTXPAjpd8mk0raqcp7CDnSvS8rb7r4arcv/FpqXtNrd6
         q6lNrAmL3z5eclTJrp/60UZjAV7chJhluFhymT72U7UiMXaHHXo3G2Mv4JColv60ClV+
         KUcoFYxV0nLjA+n6auW/vp1sQxjcRmjiV9/rKv9Z+2/oH1srNUF7RrYGrGde9ePiojwx
         o3+5dUiOccMNnI0tvF+PrSnhidXeOJtzFYEeEru0NnAMtKZhgXtBfKah9tjwAgI1MlDT
         foyQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:cc:reply-to:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence:subject
         :mime-version:references:in-reply-to:message-id:date:to:from
         :delivered-to;
        bh=LSTOUzelZUcdO9i3OXoIqgjd05yZ8izxJUqO3Z5MgxE=;
        b=P6pHmPtQC2VDhnHcaCUhVX4fph/rUFn18lkeSa4MTyNEHOHUtC0NOQuiQpuY/ychqf
         3hjX7pSPdb0mKWRImVl8IR7HDmccrk7RixhstNGkYLxbdPVdpLJAJun+z1mZMqcVp5kD
         hpqveJkJl0LxlwB+RWi8K65f2QtaIh0HDzn6N3ARENUA6Dq2RrMgkESbgsv7AerrEVjQ
         KgtSUhc0qTK9NqlVMYDV2TyEqnBq3qnqvKTtDx67dhDOidZvIaHFmp4PuRoUNHFsBaDY
         JAT+C+fh6W0s+fplWYBXDaQjGuWdV6TKpL+cnaVeRHIzfwgDp/DIji80fAf7TEIJpJ4t
         3sEg==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 rh16si18239123ejb.761.2021.12.14.05.34.23;
        Tue, 14 Dec 2021 05:34:23 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id AD36968AF08;
	Tue, 14 Dec 2021 15:33:56 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from loongson.cn (mail.loongson.cn [114.242.206.163])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 3BB6B68ADF7
 for <ffmpeg-devel@ffmpeg.org>; Tue, 14 Dec 2021 15:33:44 +0200 (EET)
Received: from localhost (unknown [36.33.26.144])
 by mail.loongson.cn (Coremail) with SMTP id AQAAf9Dx_Nw2nbhhkacAAA--.3468S3;
 Tue, 14 Dec 2021 21:33:42 +0800 (CST)
From: Hao Chen <chenhao@loongson.cn>
To: ffmpeg-devel@ffmpeg.org
Date: Tue, 14 Dec 2021 21:33:11 +0800
Message-Id: <20211214133316.8978-3-chenhao@loongson.cn>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20211214133316.8978-1-chenhao@loongson.cn>
References: <20211214133316.8978-1-chenhao@loongson.cn>
MIME-Version: 1.0
X-CM-TRANSID: AQAAf9Dx_Nw2nbhhkacAAA--.3468S3
X-Coremail-Antispam: 1UD129KBjDUn29KB7ZKAUJUUUUU529EdanIXcx71UUUUU7v73
 VFW2AGmfu7bjvjm3AaLaJ3UjIYCTnIWjp_UUUYn7AC8VAFwI0_Jr0_Gr1l1xkIjI8I6I8E
 6xAIw20EY4v20xvaj40_Wr0E3s1l1IIY67AEw4v_Jr0_Jr4l8cAvFVAK0II2c7xJM28Cjx
 kF64kEwVA0rcxSw2x7M28EF7xvwVC0I7IYx2IY67AKxVW5JVW7JwA2z4x0Y4vE2Ix0cI8I
 cVCY1x0267AKxVWxJVW8Jr1l84ACjcxK6I8E87Iv67AKxVW0oVCq3wA2z4x0Y4vEx4A2js
 IEc7CjxVAFwI0_GcCE3s1le2I262IYc4CY6c8Ij28IcVAaY2xG8wASzI0E04IjxsIE14AK
 x2xKxwAqx4xG64xvF2IEw4CE5I8CrVC2j2WlYx0E2Ix0cI8IcVAFwI0_Jw0_WrylYx0Ex4
 A2jsIE14v26r4UJVWxJr1lOx8S6xCaFVCjc4AY6r1j6r4UM4x0x7Aq67IIx4CEVc8vx2IE
 rcIFxwAKzVC20s0267AEwI8IwI0ExsIj0wCY02Avz4vE14v_Xr4l4I8I3I0E4IkC6x0Yz7
 v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGwC20s026x8GjcxK67AKxVWUJVWUGwC2zVAF
 1VAY17CE14v26r1Y6r17MIIF0xvE2Ix0cI8IcVAFwI0_Gr0_Xr1lIxAIcVC0I7IYx2IY6x
 kF7I0E14v26r4j6F4UMIIF0xvE42xK8VAvwI8IcIk0rVWUJVWUCwCI42IY6I8E87Iv67AK
 xVW8JVWxJwCI42IY6I8E87Iv6xkF7I0E14v26r4j6r4UJbIYCTnIWIevJa73UjIFyTuYvj
 fU8AwIUUUUU
X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/
Subject: [FFmpeg-devel] [PATCH v2 2/7] avcodec: [loongarch] Optimize
 h264_chroma_mc with LASX.
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Shiyou Yin <yinshiyou-hf@loongson.cn>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: xnNh6AOhJXEr

From: Shiyou Yin <yinshiyou-hf@loongson.cn>

./ffmpeg -i ../1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -y /dev/null -an
before:170
after :183

Change-Id: I42ff23cc2dc7c32bd1b7e4274da9d9ec87065f20
---
 libavcodec/h264chroma.c                       |    2 +
 libavcodec/h264chroma.h                       |    1 +
 libavcodec/loongarch/Makefile                 |    2 +
 .../loongarch/h264chroma_init_loongarch.c     |   37 +
 libavcodec/loongarch/h264chroma_lasx.c        | 1280 +++++++++++
 libavcodec/loongarch/h264chroma_lasx.h        |   36 +
 libavutil/loongarch/loongson_intrinsics.h     | 1877 +++++++++++++++++
 7 files changed, 3235 insertions(+)
 create mode 100644 libavcodec/loongarch/Makefile
 create mode 100644 libavcodec/loongarch/h264chroma_init_loongarch.c
 create mode 100644 libavcodec/loongarch/h264chroma_lasx.c
 create mode 100644 libavcodec/loongarch/h264chroma_lasx.h
 create mode 100644 libavutil/loongarch/loongson_intrinsics.h

diff --git a/libavcodec/h264chroma.c b/libavcodec/h264chroma.c
index c2f1f30f5a..0ae6c793e1 100644
--- a/libavcodec/h264chroma.c
+++ b/libavcodec/h264chroma.c
@@ -56,4 +56,6 @@ av_cold void ff_h264chroma_init(H264ChromaContext *c, int bit_depth)
         ff_h264chroma_init_x86(c, bit_depth);
     if (ARCH_MIPS)
         ff_h264chroma_init_mips(c, bit_depth);
+    if (ARCH_LOONGARCH64)
+        ff_h264chroma_init_loongarch(c, bit_depth);
 }
diff --git a/libavcodec/h264chroma.h b/libavcodec/h264chroma.h
index 5c89fd12df..3259b4935f 100644
--- a/libavcodec/h264chroma.h
+++ b/libavcodec/h264chroma.h
@@ -36,5 +36,6 @@ void ff_h264chroma_init_arm(H264ChromaContext *c, int bit_depth);
 void ff_h264chroma_init_ppc(H264ChromaContext *c, int bit_depth);
 void ff_h264chroma_init_x86(H264ChromaContext *c, int bit_depth);
 void ff_h264chroma_init_mips(H264ChromaContext *c, int bit_depth);
+void ff_h264chroma_init_loongarch(H264ChromaContext *c, int bit_depth);
 
 #endif /* AVCODEC_H264CHROMA_H */
diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile
new file mode 100644
index 0000000000..f8fb54c925
--- /dev/null
+++ b/libavcodec/loongarch/Makefile
@@ -0,0 +1,2 @@
+OBJS-$(CONFIG_H264CHROMA)             += loongarch/h264chroma_init_loongarch.o
+LASX-OBJS-$(CONFIG_H264CHROMA)        += loongarch/h264chroma_lasx.o
diff --git a/libavcodec/loongarch/h264chroma_init_loongarch.c b/libavcodec/loongarch/h264chroma_init_loongarch.c
new file mode 100644
index 0000000000..0ca24ecc47
--- /dev/null
+++ b/libavcodec/loongarch/h264chroma_init_loongarch.c
@@ -0,0 +1,37 @@
+/*
+ * Copyright (c) 2020 Loongson Technology Corporation Limited
+ * Contributed by Shiyou Yin <yinshiyou-hf@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "h264chroma_lasx.h"
+#include "libavutil/attributes.h"
+#include "libavutil/loongarch/cpu.h"
+#include "libavcodec/h264chroma.h"
+
+av_cold void ff_h264chroma_init_loongarch(H264ChromaContext *c, int bit_depth)
+{
+    int cpu_flags = av_get_cpu_flags();
+    if (have_lasx(cpu_flags)) {
+        if (bit_depth <= 8) {
+            c->put_h264_chroma_pixels_tab[0] = ff_put_h264_chroma_mc8_lasx;
+            c->avg_h264_chroma_pixels_tab[0] = ff_avg_h264_chroma_mc8_lasx;
+            c->put_h264_chroma_pixels_tab[1] = ff_put_h264_chroma_mc4_lasx;
+        }
+    }
+}
diff --git a/libavcodec/loongarch/h264chroma_lasx.c b/libavcodec/loongarch/h264chroma_lasx.c
new file mode 100644
index 0000000000..824a78dfc8
--- /dev/null
+++ b/libavcodec/loongarch/h264chroma_lasx.c
@@ -0,0 +1,1280 @@
+/*
+ * Loongson LASX optimized h264chroma
+ *
+ * Copyright (c) 2020 Loongson Technology Corporation Limited
+ * Contributed by Shiyou Yin <yinshiyou-hf@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "h264chroma_lasx.h"
+#include "libavutil/attributes.h"
+#include "libavutil/avassert.h"
+#include "libavutil/loongarch/loongson_intrinsics.h"
+
+static const uint8_t chroma_mask_arr[64] = {
+    0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8,
+    0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8,
+    0, 1, 1, 2, 2, 3, 3, 4, 16, 17, 17, 18, 18, 19, 19, 20,
+    0, 1, 1, 2, 2, 3, 3, 4, 16, 17, 17, 18, 18, 19, 19, 20
+};
+
+static av_always_inline void avc_chroma_hv_8x4_lasx(uint8_t *src, uint8_t *dst,
+                             ptrdiff_t stride, uint32_t coef_hor0,
+                             uint32_t coef_hor1, uint32_t coef_ver0,
+                             uint32_t coef_ver1)
+{
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    ptrdiff_t stride_4x = stride_2x << 1;
+    __m256i src0, src1, src2, src3, src4, out;
+    __m256i res_hz0, res_hz1, res_hz2, res_vt0, res_vt1;
+    __m256i mask;
+    __m256i coeff_hz_vec0 = __lasx_xvreplgr2vr_b(coef_hor0);
+    __m256i coeff_hz_vec1 = __lasx_xvreplgr2vr_b(coef_hor1);
+    __m256i coeff_hz_vec = __lasx_xvilvl_b(coeff_hz_vec0, coeff_hz_vec1);
+    __m256i coeff_vt_vec0 = __lasx_xvreplgr2vr_h(coef_ver0);
+    __m256i coeff_vt_vec1 = __lasx_xvreplgr2vr_h(coef_ver1);
+
+    DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 0, src, 0, mask, src0);
+    DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2x, src, stride_3x, src, stride_4x,
+              src1, src2, src3, src4);
+    DUP2_ARG3(__lasx_xvpermi_q, src2, src1, 0x20, src4, src3, 0x20, src1, src3);
+    src0 = __lasx_xvshuf_b(src0, src0, mask);
+    DUP2_ARG3(__lasx_xvshuf_b, src1, src1, mask, src3, src3, mask, src1, src3);
+    DUP2_ARG2(__lasx_xvdp2_h_bu, src0, coeff_hz_vec, src1, coeff_hz_vec, res_hz0, res_hz1);
+    res_hz2 = __lasx_xvdp2_h_bu(src3, coeff_hz_vec);
+    res_vt0 = __lasx_xvmul_h(res_hz1, coeff_vt_vec0);
+    res_vt1 = __lasx_xvmul_h(res_hz2, coeff_vt_vec0);
+    res_hz0 = __lasx_xvpermi_q(res_hz1, res_hz0, 0x20);
+    res_hz1 = __lasx_xvpermi_q(res_hz1, res_hz2, 0x3);
+    res_vt0 = __lasx_xvmadd_h(res_vt0, res_hz0, coeff_vt_vec1);
+    res_vt1 = __lasx_xvmadd_h(res_vt1, res_hz1, coeff_vt_vec1);
+    out = __lasx_xvssrarni_bu_h(res_vt1, res_vt0, 6);
+    __lasx_xvstelm_d(out, dst, 0, 0);
+    __lasx_xvstelm_d(out, dst + stride, 0, 2);
+    __lasx_xvstelm_d(out, dst + stride_2x, 0, 1);
+    __lasx_xvstelm_d(out, dst + stride_3x, 0, 3);
+}
+
+static av_always_inline void avc_chroma_hv_8x8_lasx(uint8_t *src, uint8_t *dst,
+                             ptrdiff_t stride, uint32_t coef_hor0,
+                             uint32_t coef_hor1, uint32_t coef_ver0,
+                             uint32_t coef_ver1)
+{
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    ptrdiff_t stride_4x = stride << 2;
+    __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8;
+    __m256i out0, out1;
+    __m256i res_hz0, res_hz1, res_hz2, res_hz3, res_hz4;
+    __m256i res_vt0, res_vt1, res_vt2, res_vt3;
+    __m256i mask;
+    __m256i coeff_hz_vec0 = __lasx_xvreplgr2vr_b(coef_hor0);
+    __m256i coeff_hz_vec1 = __lasx_xvreplgr2vr_b(coef_hor1);
+    __m256i coeff_hz_vec = __lasx_xvilvl_b(coeff_hz_vec0, coeff_hz_vec1);
+    __m256i coeff_vt_vec0 = __lasx_xvreplgr2vr_h(coef_ver0);
+    __m256i coeff_vt_vec1 = __lasx_xvreplgr2vr_h(coef_ver1);
+
+    DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 0, src, 0, mask, src0);
+    DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2x, src, stride_3x, src, stride_4x,
+              src1, src2, src3, src4);
+    src += stride_4x;
+    DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2x, src, stride_3x, src, stride_4x,
+              src5, src6, src7, src8);
+    DUP4_ARG3(__lasx_xvpermi_q, src2, src1, 0x20, src4, src3, 0x20, src6, src5, 0x20,
+              src8, src7, 0x20, src1, src3, src5, src7);
+    src0 = __lasx_xvshuf_b(src0, src0, mask);
+    DUP4_ARG3(__lasx_xvshuf_b, src1, src1, mask, src3, src3, mask, src5, src5, mask, src7,
+              src7, mask, src1, src3, src5, src7);
+    DUP4_ARG2(__lasx_xvdp2_h_bu, src0, coeff_hz_vec, src1, coeff_hz_vec, src3,
+              coeff_hz_vec, src5, coeff_hz_vec, res_hz0, res_hz1, res_hz2, res_hz3);
+    res_hz4 = __lasx_xvdp2_h_bu(src7, coeff_hz_vec);
+    res_vt0 = __lasx_xvmul_h(res_hz1, coeff_vt_vec0);
+    res_vt1 = __lasx_xvmul_h(res_hz2, coeff_vt_vec0);
+    res_vt2 = __lasx_xvmul_h(res_hz3, coeff_vt_vec0);
+    res_vt3 = __lasx_xvmul_h(res_hz4, coeff_vt_vec0);
+    res_hz0 = __lasx_xvpermi_q(res_hz1, res_hz0, 0x20);
+    res_hz1 = __lasx_xvpermi_q(res_hz1, res_hz2, 0x3);
+    res_hz2 = __lasx_xvpermi_q(res_hz2, res_hz3, 0x3);
+    res_hz3 = __lasx_xvpermi_q(res_hz3, res_hz4, 0x3);
+    DUP4_ARG3(__lasx_xvmadd_h, res_vt0, res_hz0, coeff_vt_vec1, res_vt1, res_hz1, coeff_vt_vec1,
+              res_vt2, res_hz2, coeff_vt_vec1, res_vt3, res_hz3, coeff_vt_vec1,
+              res_vt0, res_vt1, res_vt2, res_vt3);
+    DUP2_ARG3(__lasx_xvssrarni_bu_h, res_vt1, res_vt0, 6, res_vt3, res_vt2, 6, out0, out1);
+    __lasx_xvstelm_d(out0, dst, 0, 0);
+    __lasx_xvstelm_d(out0, dst + stride, 0, 2);
+    __lasx_xvstelm_d(out0, dst + stride_2x, 0, 1);
+    __lasx_xvstelm_d(out0, dst + stride_3x, 0, 3);
+    dst += stride_4x;
+    __lasx_xvstelm_d(out1, dst, 0, 0);
+    __lasx_xvstelm_d(out1, dst + stride, 0, 2);
+    __lasx_xvstelm_d(out1, dst + stride_2x, 0, 1);
+    __lasx_xvstelm_d(out1, dst + stride_3x, 0, 3);
+}
+
+static av_always_inline void avc_chroma_hz_8x4_lasx(uint8_t *src, uint8_t *dst,
+                             ptrdiff_t stride, uint32_t coeff0, uint32_t coeff1)
+{
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    __m256i src0, src1, src2, src3, out;
+    __m256i res0, res1;
+    __m256i mask;
+    __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0);
+    __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1);
+    __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1);
+
+    coeff_vec = __lasx_xvslli_b(coeff_vec, 3);
+    DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 0, src, 0, mask, src0);
+    DUP2_ARG2(__lasx_xvldx, src, stride, src, stride_2x, src1, src2);
+    src3 = __lasx_xvldx(src, stride_3x);
+    DUP2_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src0, src2);
+    DUP2_ARG3(__lasx_xvshuf_b, src0, src0, mask, src2, src2, mask, src0, src2);
+    DUP2_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src2, coeff_vec, res0, res1);
+    out = __lasx_xvssrarni_bu_h(res1, res0, 6);
+    __lasx_xvstelm_d(out, dst, 0, 0);
+    __lasx_xvstelm_d(out, dst + stride, 0, 2);
+    __lasx_xvstelm_d(out, dst + stride_2x, 0, 1);
+    __lasx_xvstelm_d(out, dst + stride_3x, 0, 3);
+
+}
+
+static av_always_inline void avc_chroma_hz_8x8_lasx(uint8_t *src, uint8_t *dst,
+                             ptrdiff_t stride, uint32_t coeff0, uint32_t coeff1)
+{
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    ptrdiff_t stride_4x = stride << 2;
+    __m256i src0, src1, src2, src3, src4, src5, src6, src7;
+    __m256i out0, out1;
+    __m256i res0, res1, res2, res3;
+    __m256i mask;
+    __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0);
+    __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1);
+    __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1);
+
+    coeff_vec = __lasx_xvslli_b(coeff_vec, 3);
+    DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 0, src, 0, mask, src0);
+    DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2x, src, stride_3x, src, stride_4x,
+              src1, src2, src3, src4);
+    src += stride_4x;
+    DUP2_ARG2(__lasx_xvldx, src, stride, src, stride_2x, src5, src6);
+    src7 = __lasx_xvldx(src, stride_3x);
+    DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src5, src4, 0x20,
+              src7, src6, 0x20, src0, src2, src4, src6);
+    DUP4_ARG3(__lasx_xvshuf_b, src0, src0, mask, src2, src2, mask, src4, src4, mask,
+              src6, src6, mask, src0, src2, src4, src6);
+    DUP4_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src2, coeff_vec, src4, coeff_vec, src6,
+              coeff_vec, res0, res1, res2, res3);
+    DUP2_ARG3(__lasx_xvssrarni_bu_h, res1, res0, 6, res3, res2, 6, out0, out1);
+    __lasx_xvstelm_d(out0, dst, 0, 0);
+    __lasx_xvstelm_d(out0, dst + stride, 0, 2);
+    __lasx_xvstelm_d(out0, dst + stride_2x, 0, 1);
+    __lasx_xvstelm_d(out0, dst + stride_3x, 0, 3);
+    dst += stride_4x;
+    __lasx_xvstelm_d(out1, dst, 0, 0);
+    __lasx_xvstelm_d(out1, dst + stride, 0, 2);
+    __lasx_xvstelm_d(out1, dst + stride_2x, 0, 1);
+    __lasx_xvstelm_d(out1, dst + stride_3x, 0, 3);
+}
+
+static av_always_inline void avc_chroma_hz_nonmult_lasx(uint8_t *src,
+                             uint8_t *dst, ptrdiff_t stride, uint32_t coeff0,
+                             uint32_t coeff1, int32_t height)
+{
+    uint32_t row;
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    ptrdiff_t stride_4x = stride << 2;
+    __m256i src0, src1, src2, src3, out;
+    __m256i res0, res1;
+    __m256i mask;
+    __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0);
+    __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1);
+    __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1);
+
+    mask = __lasx_xvld(chroma_mask_arr, 0);
+    coeff_vec = __lasx_xvslli_b(coeff_vec, 3);
+
+    for (row = height >> 2; row--;) {
+        DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x,
+                  src0, src1, src2, src3);
+        src += stride_4x;
+        DUP2_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src0, src2);
+        DUP2_ARG3(__lasx_xvshuf_b, src0, src0, mask, src2, src2, mask, src0, src2);
+        DUP2_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src2, coeff_vec, res0, res1);
+        out = __lasx_xvssrarni_bu_h(res1, res0, 6);
+        __lasx_xvstelm_d(out, dst, 0, 0);
+        __lasx_xvstelm_d(out, dst + stride, 0, 2);
+        __lasx_xvstelm_d(out, dst + stride_2x, 0, 1);
+        __lasx_xvstelm_d(out, dst + stride_3x, 0, 3);
+        dst += stride_4x;
+    }
+
+    if ((height & 3)) {
+        src0 = __lasx_xvld(src, 0);
+        src1 = __lasx_xvldx(src, stride);
+        src1 = __lasx_xvpermi_q(src1, src0, 0x20);
+        src0 = __lasx_xvshuf_b(src1, src1, mask);
+        res0 = __lasx_xvdp2_h_bu(src0, coeff_vec);
+        out  = __lasx_xvssrarni_bu_h(res0, res0, 6);
+        __lasx_xvstelm_d(out, dst, 0, 0);
+        dst += stride;
+        __lasx_xvstelm_d(out, dst, 0, 2);
+    }
+}
+
+static av_always_inline void avc_chroma_vt_8x4_lasx(uint8_t *src, uint8_t *dst,
+                             ptrdiff_t stride, uint32_t coeff0, uint32_t coeff1)
+{
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    __m256i src0, src1, src2, src3, src4, out;
+    __m256i res0, res1;
+    __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0);
+    __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1);
+    __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1);
+
+    coeff_vec = __lasx_xvslli_b(coeff_vec, 3);
+    src0 = __lasx_xvld(src, 0);
+    src += stride;
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x,
+              src1, src2, src3, src4);
+    DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src2, src1, 0x20, src3, src2, 0x20,
+              src4, src3, 0x20, src0, src1, src2, src3);
+    DUP2_ARG2(__lasx_xvilvl_b, src1, src0, src3, src2, src0, src2);
+    DUP2_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src2, coeff_vec, res0, res1);
+    out  = __lasx_xvssrarni_bu_h(res1, res0, 6);
+    __lasx_xvstelm_d(out, dst, 0, 0);
+    __lasx_xvstelm_d(out, dst + stride, 0, 2);
+    __lasx_xvstelm_d(out, dst + stride_2x, 0, 1);
+    __lasx_xvstelm_d(out, dst + stride_3x, 0, 3);
+}
+
+static av_always_inline void avc_chroma_vt_8x8_lasx(uint8_t *src, uint8_t *dst,
+                             ptrdiff_t stride, uint32_t coeff0, uint32_t coeff1)
+{
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    ptrdiff_t stride_4x = stride << 2;
+    __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8;
+    __m256i out0, out1;
+    __m256i res0, res1, res2, res3;
+    __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0);
+    __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1);
+    __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1);
+
+    coeff_vec = __lasx_xvslli_b(coeff_vec, 3);
+    src0 = __lasx_xvld(src, 0);
+    src += stride;
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x,
+              src1, src2, src3, src4);
+    src += stride_4x;
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x,
+              src5, src6, src7, src8);
+    DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src2, src1, 0x20, src3, src2, 0x20,
+              src4, src3, 0x20, src0, src1, src2, src3);
+    DUP4_ARG3(__lasx_xvpermi_q, src5, src4, 0x20, src6, src5, 0x20, src7, src6, 0x20,
+              src8, src7, 0x20, src4, src5, src6, src7);
+    DUP4_ARG2(__lasx_xvilvl_b, src1, src0, src3, src2, src5, src4, src7, src6,
+              src0, src2, src4, src6);
+    DUP4_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src2, coeff_vec, src4, coeff_vec,
+              src6, coeff_vec, res0, res1, res2, res3);
+    DUP2_ARG3(__lasx_xvssrarni_bu_h, res1, res0, 6, res3, res2, 6, out0, out1);
+    __lasx_xvstelm_d(out0, dst, 0, 0);
+    __lasx_xvstelm_d(out0, dst + stride, 0, 2);
+    __lasx_xvstelm_d(out0, dst + stride_2x, 0, 1);
+    __lasx_xvstelm_d(out0, dst + stride_3x, 0, 3);
+    dst += stride_4x;
+    __lasx_xvstelm_d(out1, dst, 0, 0);
+    __lasx_xvstelm_d(out1, dst + stride, 0, 2);
+    __lasx_xvstelm_d(out1, dst + stride_2x, 0, 1);
+    __lasx_xvstelm_d(out1, dst + stride_3x, 0, 3);
+}
+
+static av_always_inline void copy_width8x8_lasx(uint8_t *src, uint8_t *dst,
+                             ptrdiff_t stride)
+{
+    uint64_t tmp[8];
+    ptrdiff_t stride_2, stride_3, stride_4;
+    __asm__ volatile (
+        "slli.d   %[stride_2],     %[stride],     1             \n\t"
+        "add.d    %[stride_3],     %[stride_2],   %[stride]     \n\t"
+        "slli.d   %[stride_4],     %[stride_2],   1             \n\t"
+        "ld.d     %[tmp0],         %[src],        0x0           \n\t"
+        "ldx.d    %[tmp1],         %[src],        %[stride]     \n\t"
+        "ldx.d    %[tmp2],         %[src],        %[stride_2]   \n\t"
+        "ldx.d    %[tmp3],         %[src],        %[stride_3]   \n\t"
+        "add.d    %[src],          %[src],        %[stride_4]   \n\t"
+        "ld.d     %[tmp4],         %[src],        0x0           \n\t"
+        "ldx.d    %[tmp5],         %[src],        %[stride]     \n\t"
+        "ldx.d    %[tmp6],         %[src],        %[stride_2]   \n\t"
+        "ldx.d    %[tmp7],         %[src],        %[stride_3]   \n\t"
+
+        "st.d     %[tmp0],         %[dst],        0x0           \n\t"
+        "stx.d    %[tmp1],         %[dst],        %[stride]     \n\t"
+        "stx.d    %[tmp2],         %[dst],        %[stride_2]   \n\t"
+        "stx.d    %[tmp3],         %[dst],        %[stride_3]   \n\t"
+        "add.d    %[dst],          %[dst],        %[stride_4]   \n\t"
+        "st.d     %[tmp4],         %[dst],        0x0           \n\t"
+        "stx.d    %[tmp5],         %[dst],        %[stride]     \n\t"
+        "stx.d    %[tmp6],         %[dst],        %[stride_2]   \n\t"
+        "stx.d    %[tmp7],         %[dst],        %[stride_3]   \n\t"
+        : [tmp0]"=&r"(tmp[0]),        [tmp1]"=&r"(tmp[1]),
+          [tmp2]"=&r"(tmp[2]),        [tmp3]"=&r"(tmp[3]),
+          [tmp4]"=&r"(tmp[4]),        [tmp5]"=&r"(tmp[5]),
+          [tmp6]"=&r"(tmp[6]),        [tmp7]"=&r"(tmp[7]),
+          [dst]"+&r"(dst),            [src]"+&r"(src),
+          [stride_2]"=&r"(stride_2),  [stride_3]"=&r"(stride_3),
+          [stride_4]"=&r"(stride_4)
+        : [stride]"r"(stride)
+        : "memory"
+    );
+}
+
+static av_always_inline void copy_width8x4_lasx(uint8_t *src, uint8_t *dst,
+                             ptrdiff_t stride)
+{
+    uint64_t tmp[4];
+    ptrdiff_t stride_2, stride_3;
+    __asm__ volatile (
+        "slli.d   %[stride_2],     %[stride],     1             \n\t"
+        "add.d    %[stride_3],     %[stride_2],   %[stride]     \n\t"
+        "ld.d     %[tmp0],         %[src],        0x0           \n\t"
+        "ldx.d    %[tmp1],         %[src],        %[stride]     \n\t"
+        "ldx.d    %[tmp2],         %[src],        %[stride_2]   \n\t"
+        "ldx.d    %[tmp3],         %[src],        %[stride_3]   \n\t"
+
+        "st.d     %[tmp0],         %[dst],        0x0           \n\t"
+        "stx.d    %[tmp1],         %[dst],        %[stride]     \n\t"
+        "stx.d    %[tmp2],         %[dst],        %[stride_2]   \n\t"
+        "stx.d    %[tmp3],         %[dst],        %[stride_3]   \n\t"
+        : [tmp0]"=&r"(tmp[0]),        [tmp1]"=&r"(tmp[1]),
+          [tmp2]"=&r"(tmp[2]),        [tmp3]"=&r"(tmp[3]),
+          [stride_2]"=&r"(stride_2),  [stride_3]"=&r"(stride_3)
+        : [stride]"r"(stride), [dst]"r"(dst), [src]"r"(src)
+        : "memory"
+    );
+}
+
+static void avc_chroma_hv_8w_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                                  uint32_t coef_hor0, uint32_t coef_hor1,
+                                  uint32_t coef_ver0, uint32_t coef_ver1,
+                                  int32_t height)
+{
+    if (4 == height) {
+        avc_chroma_hv_8x4_lasx(src, dst, stride, coef_hor0, coef_hor1, coef_ver0,
+                               coef_ver1);
+    } else if (8 == height) {
+        avc_chroma_hv_8x8_lasx(src, dst, stride, coef_hor0, coef_hor1, coef_ver0,
+                               coef_ver1);
+    }
+}
+
+static void avc_chroma_hv_4x2_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                                   uint32_t coef_hor0, uint32_t coef_hor1,
+                                   uint32_t coef_ver0, uint32_t coef_ver1)
+{
+    ptrdiff_t stride_2 = stride << 1;
+    __m256i src0, src1, src2;
+    __m256i res_hz, res_vt;
+    __m256i mask;
+    __m256i coeff_hz_vec0 = __lasx_xvreplgr2vr_b(coef_hor0);
+    __m256i coeff_hz_vec1 = __lasx_xvreplgr2vr_b(coef_hor1);
+    __m256i coeff_hz_vec  = __lasx_xvilvl_b(coeff_hz_vec0, coeff_hz_vec1);
+    __m256i coeff_vt_vec0 = __lasx_xvreplgr2vr_h(coef_ver0);
+    __m256i coeff_vt_vec1 = __lasx_xvreplgr2vr_h(coef_ver1);
+    __m256i coeff_vt_vec  = __lasx_xvpermi_q(coeff_vt_vec1, coeff_vt_vec0, 0x02);
+
+    DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 32, src, 0, mask, src0);
+    DUP2_ARG2(__lasx_xvldx, src, stride, src, stride_2, src1, src2);
+    DUP2_ARG3(__lasx_xvshuf_b, src1, src0, mask, src2, src1, mask, src0, src1);
+    src0 = __lasx_xvpermi_q(src0, src1, 0x02);
+    res_hz = __lasx_xvdp2_h_bu(src0, coeff_hz_vec);
+    res_vt = __lasx_xvmul_h(res_hz, coeff_vt_vec);
+    res_hz = __lasx_xvpermi_q(res_hz, res_vt, 0x01);
+    res_vt = __lasx_xvadd_h(res_hz, res_vt);
+    res_vt = __lasx_xvssrarni_bu_h(res_vt, res_vt, 6);
+    __lasx_xvstelm_w(res_vt, dst, 0, 0);
+    __lasx_xvstelm_w(res_vt, dst + stride, 0, 1);
+}
+
+static void avc_chroma_hv_4x4_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                                   uint32_t coef_hor0, uint32_t coef_hor1,
+                                   uint32_t coef_ver0, uint32_t coef_ver1)
+{
+    ptrdiff_t stride_2 = stride << 1;
+    ptrdiff_t stride_3 = stride_2 + stride;
+    ptrdiff_t stride_4 = stride_2 << 1;
+    __m256i src0, src1, src2, src3, src4;
+    __m256i res_hz0, res_hz1, res_vt0, res_vt1;
+    __m256i mask;
+    __m256i coeff_hz_vec0 = __lasx_xvreplgr2vr_b(coef_hor0);
+    __m256i coeff_hz_vec1 = __lasx_xvreplgr2vr_b(coef_hor1);
+    __m256i coeff_hz_vec  = __lasx_xvilvl_b(coeff_hz_vec0, coeff_hz_vec1);
+    __m256i coeff_vt_vec0 = __lasx_xvreplgr2vr_h(coef_ver0);
+    __m256i coeff_vt_vec1 = __lasx_xvreplgr2vr_h(coef_ver1);
+
+    DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 32, src, 0, mask, src0);
+    DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2, src, stride_3,
+              src, stride_4, src1, src2, src3, src4);
+    DUP4_ARG3(__lasx_xvshuf_b, src1, src0, mask, src2, src1, mask, src3, src2, mask,
+              src4, src3, mask, src0, src1, src2, src3);
+    DUP2_ARG3(__lasx_xvpermi_q, src0, src2, 0x02, src1, src3, 0x02, src0, src1);
+    DUP2_ARG2(__lasx_xvdp2_h_bu, src0, coeff_hz_vec, src1, coeff_hz_vec, res_hz0, res_hz1);
+    DUP2_ARG2(__lasx_xvmul_h, res_hz0, coeff_vt_vec1, res_hz1, coeff_vt_vec0, res_vt0, res_vt1);
+    res_hz0 = __lasx_xvadd_h(res_vt0, res_vt1);
+    res_hz0 = __lasx_xvssrarni_bu_h(res_hz0, res_hz0, 6);
+    __lasx_xvstelm_w(res_hz0, dst, 0, 0);
+    __lasx_xvstelm_w(res_hz0, dst + stride, 0, 1);
+    __lasx_xvstelm_w(res_hz0, dst + stride_2, 0, 4);
+    __lasx_xvstelm_w(res_hz0, dst + stride_3, 0, 5);
+}
+
+static void avc_chroma_hv_4x8_lasx(uint8_t *src, uint8_t * dst, ptrdiff_t stride,
+                                   uint32_t coef_hor0, uint32_t coef_hor1,
+                                   uint32_t coef_ver0, uint32_t coef_ver1)
+{
+    ptrdiff_t stride_2 = stride << 1;
+    ptrdiff_t stride_3 = stride_2 + stride;
+    ptrdiff_t stride_4 = stride_2 << 1;
+    __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8;
+    __m256i res_hz0, res_hz1, res_hz2, res_hz3;
+    __m256i res_vt0, res_vt1, res_vt2, res_vt3;
+    __m256i mask;
+    __m256i coeff_hz_vec0 = __lasx_xvreplgr2vr_b(coef_hor0);
+    __m256i coeff_hz_vec1 = __lasx_xvreplgr2vr_b(coef_hor1);
+    __m256i coeff_hz_vec  = __lasx_xvilvl_b(coeff_hz_vec0, coeff_hz_vec1);
+    __m256i coeff_vt_vec0 = __lasx_xvreplgr2vr_h(coef_ver0);
+    __m256i coeff_vt_vec1 = __lasx_xvreplgr2vr_h(coef_ver1);
+
+    DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 32, src, 0, mask, src0);
+    DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2, src, stride_3,
+              src, stride_4, src1, src2, src3, src4);
+    src += stride_4;
+    DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2, src, stride_3,
+              src, stride_4, src5, src6, src7, src8);
+    DUP4_ARG3(__lasx_xvshuf_b, src1, src0, mask, src2, src1, mask, src3, src2, mask,
+              src4, src3, mask, src0, src1, src2, src3);
+    DUP4_ARG3(__lasx_xvshuf_b, src5, src4, mask, src6, src5, mask, src7, src6, mask,
+              src8, src7, mask, src4, src5, src6, src7);
+    DUP4_ARG3(__lasx_xvpermi_q, src0, src2, 0x02, src1, src3, 0x02, src4, src6, 0x02,
+              src5, src7, 0x02, src0, src1, src4, src5);
+    DUP4_ARG2(__lasx_xvdp2_h_bu, src0, coeff_hz_vec, src1, coeff_hz_vec, src4, coeff_hz_vec,
+              src5, coeff_hz_vec, res_hz0, res_hz1, res_hz2, res_hz3);
+    DUP4_ARG2(__lasx_xvmul_h, res_hz0, coeff_vt_vec1, res_hz1, coeff_vt_vec0, res_hz2,
+              coeff_vt_vec1, res_hz3, coeff_vt_vec0, res_vt0, res_vt1, res_vt2, res_vt3);
+    DUP2_ARG2(__lasx_xvadd_h, res_vt0, res_vt1, res_vt2, res_vt3, res_vt0, res_vt2);
+    res_hz0 = __lasx_xvssrarni_bu_h(res_vt2, res_vt0, 6);
+    __lasx_xvstelm_w(res_hz0, dst, 0, 0);
+    __lasx_xvstelm_w(res_hz0, dst + stride, 0, 1);
+    __lasx_xvstelm_w(res_hz0, dst + stride_2, 0, 4);
+    __lasx_xvstelm_w(res_hz0, dst + stride_3, 0, 5);
+    dst += stride_4;
+    __lasx_xvstelm_w(res_hz0, dst, 0, 2);
+    __lasx_xvstelm_w(res_hz0, dst + stride, 0, 3);
+    __lasx_xvstelm_w(res_hz0, dst + stride_2, 0, 6);
+    __lasx_xvstelm_w(res_hz0, dst + stride_3, 0, 7);
+}
+
+static void avc_chroma_hv_4w_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                                  uint32_t coef_hor0, uint32_t coef_hor1,
+                                  uint32_t coef_ver0, uint32_t coef_ver1,
+                                  int32_t height)
+{
+    if (8 == height) {
+        avc_chroma_hv_4x8_lasx(src, dst, stride, coef_hor0, coef_hor1, coef_ver0,
+                               coef_ver1);
+    } else if (4 == height) {
+        avc_chroma_hv_4x4_lasx(src, dst, stride, coef_hor0, coef_hor1, coef_ver0,
+                               coef_ver1);
+    } else if (2 == height) {
+        avc_chroma_hv_4x2_lasx(src, dst, stride, coef_hor0, coef_hor1, coef_ver0,
+                               coef_ver1);
+    }
+}
+
+static void avc_chroma_hz_4x2_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                                   uint32_t coeff0, uint32_t coeff1)
+{
+    __m256i src0, src1;
+    __m256i res, mask;
+    __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0);
+    __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1);
+    __m256i coeff_vec  = __lasx_xvilvl_b(coeff_vec0, coeff_vec1);
+
+    DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 32, src, 0, mask, src0);
+    src1 = __lasx_xvldx(src, stride);
+    src0 = __lasx_xvshuf_b(src1, src0, mask);
+    res = __lasx_xvdp2_h_bu(src0, coeff_vec);
+    res = __lasx_xvslli_h(res, 3);
+    res = __lasx_xvssrarni_bu_h(res, res, 6);
+    __lasx_xvstelm_w(res, dst, 0, 0);
+    __lasx_xvstelm_w(res, dst + stride, 0, 1);
+}
+
+static void avc_chroma_hz_4x4_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                                   uint32_t coeff0, uint32_t coeff1)
+{
+    ptrdiff_t stride_2 = stride << 1;
+    ptrdiff_t stride_3 = stride_2 + stride;
+    __m256i src0, src1, src2, src3;
+    __m256i res, mask;
+    __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0);
+    __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1);
+    __m256i coeff_vec  = __lasx_xvilvl_b(coeff_vec0, coeff_vec1);
+
+    DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 32, src, 0, mask, src0);
+    DUP2_ARG2(__lasx_xvldx, src, stride, src, stride_2, src1, src2);
+    src3 = __lasx_xvldx(src, stride_3);
+    DUP2_ARG3(__lasx_xvshuf_b, src1, src0, mask, src3, src2, mask, src0, src2);
+    src0 = __lasx_xvpermi_q(src0, src2, 0x02);
+    res = __lasx_xvdp2_h_bu(src0, coeff_vec);
+    res = __lasx_xvslli_h(res, 3);
+    res = __lasx_xvssrarni_bu_h(res, res, 6);
+    __lasx_xvstelm_w(res, dst, 0, 0);
+    __lasx_xvstelm_w(res, dst + stride, 0, 1);
+    __lasx_xvstelm_w(res, dst + stride_2, 0, 4);
+    __lasx_xvstelm_w(res, dst + stride_3, 0, 5);
+}
+
+static void avc_chroma_hz_4x8_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                                   uint32_t coeff0, uint32_t coeff1)
+{
+    ptrdiff_t stride_2 = stride << 1;
+    ptrdiff_t stride_3 = stride_2 + stride;
+    ptrdiff_t stride_4 = stride_2 << 1;
+    __m256i src0, src1, src2, src3, src4, src5, src6, src7;
+    __m256i res0, res1, mask;
+    __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0);
+    __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1);
+    __m256i coeff_vec  = __lasx_xvilvl_b(coeff_vec0, coeff_vec1);
+
+    coeff_vec = __lasx_xvslli_b(coeff_vec, 3);
+    DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 32, src, 0, mask, src0);
+    DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2, src, stride_3,
+              src, stride_4, src1, src2, src3, src4);
+    src += stride_4;
+    DUP2_ARG2(__lasx_xvldx, src, stride, src, stride_2, src5, src6);
+    src7 = __lasx_xvldx(src, stride_3);
+    DUP4_ARG3(__lasx_xvshuf_b, src1, src0, mask, src3, src2, mask, src5, src4, mask,
+              src7, src6, mask, src0, src2, src4, src6);
+    DUP2_ARG3(__lasx_xvpermi_q, src0, src2, 0x02, src4, src6, 0x02, src0, src4);
+    DUP2_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src4, coeff_vec, res0, res1);
+    res0 = __lasx_xvssrarni_bu_h(res1, res0, 6);
+    __lasx_xvstelm_w(res0, dst, 0, 0);
+    __lasx_xvstelm_w(res0, dst + stride, 0, 1);
+    __lasx_xvstelm_w(res0, dst + stride_2, 0, 4);
+    __lasx_xvstelm_w(res0, dst + stride_3, 0, 5);
+    dst += stride_4;
+    __lasx_xvstelm_w(res0, dst, 0, 2);
+    __lasx_xvstelm_w(res0, dst + stride, 0, 3);
+    __lasx_xvstelm_w(res0, dst + stride_2, 0, 6);
+    __lasx_xvstelm_w(res0, dst + stride_3, 0, 7);
+}
+
+static void avc_chroma_hz_4w_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                                  uint32_t coeff0, uint32_t coeff1,
+                                  int32_t height)
+{
+    if (8 == height) {
+        avc_chroma_hz_4x8_lasx(src, dst, stride, coeff0, coeff1);
+    } else if (4 == height) {
+        avc_chroma_hz_4x4_lasx(src, dst, stride, coeff0, coeff1);
+    } else if (2 == height) {
+        avc_chroma_hz_4x2_lasx(src, dst, stride, coeff0, coeff1);
+    }
+}
+
+static void avc_chroma_hz_8w_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                                  uint32_t coeff0, uint32_t coeff1,
+                                  int32_t height)
+{
+    if (4 == height) {
+        avc_chroma_hz_8x4_lasx(src, dst, stride, coeff0, coeff1);
+    } else if (8 == height) {
+        avc_chroma_hz_8x8_lasx(src, dst, stride, coeff0, coeff1);
+    } else {
+        avc_chroma_hz_nonmult_lasx(src, dst, stride, coeff0, coeff1, height);
+    }
+}
+
+static void avc_chroma_vt_4x2_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                                   uint32_t coeff0, uint32_t coeff1)
+{
+    __m256i src0, src1, src2;
+    __m256i tmp0, tmp1;
+    __m256i res;
+    __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0);
+    __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1);
+    __m256i coeff_vec  = __lasx_xvilvl_b(coeff_vec0, coeff_vec1);
+
+    src0 = __lasx_xvld(src, 0);
+    DUP2_ARG2(__lasx_xvldx, src, stride, src, stride << 1, src1, src2);
+    DUP2_ARG2(__lasx_xvilvl_b, src1, src0, src2, src1, tmp0, tmp1);
+    tmp0 = __lasx_xvilvl_d(tmp1, tmp0);
+    res  = __lasx_xvdp2_h_bu(tmp0, coeff_vec);
+    res  = __lasx_xvslli_h(res, 3);
+    res  = __lasx_xvssrarni_bu_h(res, res, 6);
+    __lasx_xvstelm_w(res, dst, 0, 0);
+    __lasx_xvstelm_w(res, dst + stride, 0, 1);
+}
+
+static void avc_chroma_vt_4x4_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                                   uint32_t coeff0, uint32_t coeff1)
+{
+    ptrdiff_t stride_2 = stride << 1;
+    ptrdiff_t stride_3 = stride_2 + stride;
+    ptrdiff_t stride_4 = stride_2 << 1;
+    __m256i src0, src1, src2, src3, src4;
+    __m256i tmp0, tmp1, tmp2, tmp3;
+    __m256i res;
+    __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0);
+    __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1);
+    __m256i coeff_vec  = __lasx_xvilvl_b(coeff_vec0, coeff_vec1);
+
+    src0 = __lasx_xvld(src, 0);
+    DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2, src, stride_3,
+              src, stride_4, src1, src2, src3, src4);
+    DUP4_ARG2(__lasx_xvilvl_b, src1, src0, src2, src1, src3, src2, src4, src3,
+              tmp0, tmp1, tmp2, tmp3);
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp2);
+    tmp0 = __lasx_xvpermi_q(tmp0, tmp2, 0x02);
+    res = __lasx_xvdp2_h_bu(tmp0, coeff_vec);
+    res = __lasx_xvslli_h(res, 3);
+    res = __lasx_xvssrarni_bu_h(res, res, 6);
+    __lasx_xvstelm_w(res, dst, 0, 0);
+    __lasx_xvstelm_w(res, dst + stride, 0, 1);
+    __lasx_xvstelm_w(res, dst + stride_2, 0, 4);
+    __lasx_xvstelm_w(res, dst + stride_3, 0, 5);
+}
+
+static void avc_chroma_vt_4x8_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                                   uint32_t coeff0, uint32_t coeff1)
+{
+    ptrdiff_t stride_2 = stride << 1;
+    ptrdiff_t stride_3 = stride_2 + stride;
+    ptrdiff_t stride_4 = stride_2 << 1;
+    __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8;
+    __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
+    __m256i res0, res1;
+    __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0);
+    __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1);
+    __m256i coeff_vec  = __lasx_xvilvl_b(coeff_vec0, coeff_vec1);
+
+    coeff_vec = __lasx_xvslli_b(coeff_vec, 3);
+    src0 = __lasx_xvld(src, 0);
+    DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2, src, stride_3,
+              src, stride_4, src1, src2, src3, src4);
+    src += stride_4;
+    DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2, src, stride_3,
+              src, stride_4, src5, src6, src7, src8);
+    DUP4_ARG2(__lasx_xvilvl_b, src1, src0, src2, src1, src3, src2, src4, src3,
+              tmp0, tmp1, tmp2, tmp3);
+    DUP4_ARG2(__lasx_xvilvl_b, src5, src4, src6, src5, src7, src6, src8, src7,
+              tmp4, tmp5, tmp6, tmp7);
+    DUP4_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp5, tmp4, tmp7, tmp6,
+              tmp0, tmp2, tmp4, tmp6);
+    tmp0 = __lasx_xvpermi_q(tmp0, tmp2, 0x02);
+    tmp4 = __lasx_xvpermi_q(tmp4, tmp6, 0x02);
+    DUP2_ARG2(__lasx_xvdp2_h_bu, tmp0, coeff_vec, tmp4, coeff_vec, res0, res1);
+    res0 = __lasx_xvssrarni_bu_h(res1, res0, 6);
+    __lasx_xvstelm_w(res0, dst, 0, 0);
+    __lasx_xvstelm_w(res0, dst + stride, 0, 1);
+    __lasx_xvstelm_w(res0, dst + stride_2, 0, 4);
+    __lasx_xvstelm_w(res0, dst + stride_3, 0, 5);
+    dst += stride_4;
+    __lasx_xvstelm_w(res0, dst, 0, 2);
+    __lasx_xvstelm_w(res0, dst + stride, 0, 3);
+    __lasx_xvstelm_w(res0, dst + stride_2, 0, 6);
+    __lasx_xvstelm_w(res0, dst + stride_3, 0, 7);
+}
+
+static void avc_chroma_vt_4w_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                                  uint32_t coeff0, uint32_t coeff1,
+                                  int32_t height)
+{
+    if (8 == height) {
+        avc_chroma_vt_4x8_lasx(src, dst, stride, coeff0, coeff1);
+    } else if (4 == height) {
+        avc_chroma_vt_4x4_lasx(src, dst, stride, coeff0, coeff1);
+    } else if (2 == height) {
+        avc_chroma_vt_4x2_lasx(src, dst, stride, coeff0, coeff1);
+    }
+}
+
+static void avc_chroma_vt_8w_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                                  uint32_t coeff0, uint32_t coeff1,
+                                  int32_t height)
+{
+    if (4 == height) {
+        avc_chroma_vt_8x4_lasx(src, dst, stride, coeff0, coeff1);
+    } else if (8 == height) {
+        avc_chroma_vt_8x8_lasx(src, dst, stride, coeff0, coeff1);
+    }
+}
+
+static void copy_width4_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                             int32_t height)
+{
+    uint32_t tp0, tp1, tp2, tp3, tp4, tp5, tp6, tp7;
+
+    if (8 == height) {
+        ptrdiff_t stride_2, stride_3, stride_4;
+
+        __asm__ volatile (
+        "slli.d   %[stride_2],     %[stride],     1             \n\t"
+        "add.d    %[stride_3],     %[stride_2],   %[stride]     \n\t"
+        "slli.d   %[stride_4],     %[stride_2],   1             \n\t"
+        "ld.wu    %[tp0],          %[src],        0             \n\t"
+        "ldx.wu   %[tp1],          %[src],        %[stride]     \n\t"
+        "ldx.wu   %[tp2],          %[src],        %[stride_2]   \n\t"
+        "ldx.wu   %[tp3],          %[src],        %[stride_3]   \n\t"
+        "add.d    %[src],          %[src],        %[stride_4]   \n\t"
+        "ld.wu    %[tp4],          %[src],        0             \n\t"
+        "ldx.wu   %[tp5],          %[src],        %[stride]     \n\t"
+        "ldx.wu   %[tp6],          %[src],        %[stride_2]   \n\t"
+        "ldx.wu   %[tp7],          %[src],        %[stride_3]   \n\t"
+        "st.w     %[tp0],          %[dst],        0             \n\t"
+        "stx.w    %[tp1],          %[dst],        %[stride]     \n\t"
+        "stx.w    %[tp2],          %[dst],        %[stride_2]   \n\t"
+        "stx.w    %[tp3],          %[dst],        %[stride_3]   \n\t"
+        "add.d    %[dst],          %[dst],        %[stride_4]   \n\t"
+        "st.w     %[tp4],          %[dst],        0             \n\t"
+        "stx.w    %[tp5],          %[dst],        %[stride]     \n\t"
+        "stx.w    %[tp6],          %[dst],        %[stride_2]   \n\t"
+        "stx.w    %[tp7],          %[dst],        %[stride_3]   \n\t"
+        : [stride_2]"+&r"(stride_2), [stride_3]"+&r"(stride_3), [stride_4]"+&r"(stride_4),
+          [src]"+&r"(src), [dst]"+&r"(dst), [tp0]"+&r"(tp0), [tp1]"+&r"(tp1),
+          [tp2]"+&r"(tp2), [tp3]"+&r"(tp3), [tp4]"+&r"(tp4), [tp5]"+&r"(tp5),
+          [tp6]"+&r"(tp6), [tp7]"+&r"(tp7)
+        : [stride]"r"(stride)
+        : "memory"
+        );
+    } else if (4 == height) {
+        ptrdiff_t stride_2, stride_3;
+
+        __asm__ volatile (
+        "slli.d   %[stride_2],     %[stride],     1             \n\t"
+        "add.d    %[stride_3],     %[stride_2],   %[stride]     \n\t"
+        "ld.wu    %[tp0],          %[src],        0             \n\t"
+        "ldx.wu   %[tp1],          %[src],        %[stride]     \n\t"
+        "ldx.wu   %[tp2],          %[src],        %[stride_2]   \n\t"
+        "ldx.wu   %[tp3],          %[src],        %[stride_3]   \n\t"
+        "st.w     %[tp0],          %[dst],        0             \n\t"
+        "stx.w    %[tp1],          %[dst],        %[stride]     \n\t"
+        "stx.w    %[tp2],          %[dst],        %[stride_2]   \n\t"
+        "stx.w    %[tp3],          %[dst],        %[stride_3]   \n\t"
+        : [stride_2]"+&r"(stride_2), [stride_3]"+&r"(stride_3),
+          [src]"+&r"(src), [dst]"+&r"(dst), [tp0]"+&r"(tp0), [tp1]"+&r"(tp1),
+          [tp2]"+&r"(tp2), [tp3]"+&r"(tp3)
+        : [stride]"r"(stride)
+        : "memory"
+        );
+    } else if (2 == height) {
+        __asm__ volatile (
+        "ld.wu    %[tp0],          %[src],        0             \n\t"
+        "ldx.wu   %[tp1],          %[src],        %[stride]     \n\t"
+        "st.w     %[tp0],          %[dst],        0             \n\t"
+        "stx.w    %[tp1],          %[dst],        %[stride]     \n\t"
+        : [tp0]"+&r"(tp0), [tp1]"+&r"(tp1)
+        : [src]"r"(src), [dst]"r"(dst), [stride]"r"(stride)
+        : "memory"
+        );
+    }
+}
+
+static void copy_width8_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                             int32_t height)
+{
+    if (8 == height) {
+        copy_width8x8_lasx(src, dst, stride);
+    } else if (4 == height) {
+        copy_width8x4_lasx(src, dst, stride);
+    }
+}
+
+void ff_put_h264_chroma_mc4_lasx(uint8_t *dst, uint8_t *src, ptrdiff_t stride,
+                                 int height, int x, int y)
+{
+    av_assert2(x < 8 && y < 8 && x >= 0 && y >= 0);
+
+    if(x && y) {
+        avc_chroma_hv_4w_lasx(src, dst, stride, x, (8 - x), y, (8 - y), height);
+    } else if (x) {
+        avc_chroma_hz_4w_lasx(src, dst, stride, x, (8 - x), height);
+    } else if (y) {
+        avc_chroma_vt_4w_lasx(src, dst, stride, y, (8 - y), height);
+    } else {
+        copy_width4_lasx(src, dst, stride, height);
+    }
+}
+
+void ff_put_h264_chroma_mc8_lasx(uint8_t *dst, uint8_t *src, ptrdiff_t stride,
+                                 int height, int x, int y)
+{
+    av_assert2(x < 8 && y < 8 && x >= 0 && y >= 0);
+
+    if (!(x || y)) {
+        copy_width8_lasx(src, dst, stride, height);
+    } else if (x && y) {
+        avc_chroma_hv_8w_lasx(src, dst, stride, x, (8 - x), y, (8 - y), height);
+    } else if (x) {
+        avc_chroma_hz_8w_lasx(src, dst, stride, x, (8 - x), height);
+    } else {
+        avc_chroma_vt_8w_lasx(src, dst, stride, y, (8 - y), height);
+    }
+}
+
+static av_always_inline void avc_chroma_hv_and_aver_dst_8x4_lasx(uint8_t *src,
+                             uint8_t *dst, ptrdiff_t stride, uint32_t coef_hor0,
+                             uint32_t coef_hor1, uint32_t coef_ver0,
+                             uint32_t coef_ver1)
+{
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    ptrdiff_t stride_4x = stride << 2;
+    __m256i tp0, tp1, tp2, tp3;
+    __m256i src0, src1, src2, src3, src4, out;
+    __m256i res_hz0, res_hz1, res_hz2, res_vt0, res_vt1;
+    __m256i mask;
+    __m256i coeff_hz_vec0 = __lasx_xvreplgr2vr_b(coef_hor0);
+    __m256i coeff_hz_vec1 = __lasx_xvreplgr2vr_b(coef_hor1);
+    __m256i coeff_hz_vec = __lasx_xvilvl_b(coeff_hz_vec0, coeff_hz_vec1);
+    __m256i coeff_vt_vec0 = __lasx_xvreplgr2vr_h(coef_ver0);
+    __m256i coeff_vt_vec1 = __lasx_xvreplgr2vr_h(coef_ver1);
+
+    DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 0, src, 0, mask, src0);
+    DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2x, src, stride_3x, src, stride_4x,
+              src1, src2, src3, src4);
+    DUP2_ARG3(__lasx_xvpermi_q, src2, src1, 0x20, src4, src3, 0x20, src1, src3);
+    src0 = __lasx_xvshuf_b(src0, src0, mask);
+    DUP2_ARG3(__lasx_xvshuf_b, src1, src1, mask, src3, src3, mask, src1, src3);
+    DUP2_ARG2(__lasx_xvdp2_h_bu, src0, coeff_hz_vec, src1, coeff_hz_vec, res_hz0, res_hz1);
+    res_hz2 = __lasx_xvdp2_h_bu(src3, coeff_hz_vec);
+    res_vt0 = __lasx_xvmul_h(res_hz1, coeff_vt_vec0);
+    res_vt1 = __lasx_xvmul_h(res_hz2, coeff_vt_vec0);
+    res_hz0 = __lasx_xvpermi_q(res_hz1, res_hz0, 0x20);
+    res_hz1 = __lasx_xvpermi_q(res_hz1, res_hz2, 0x3);
+    res_vt0 = __lasx_xvmadd_h(res_vt0, res_hz0, coeff_vt_vec1);
+    res_vt1 = __lasx_xvmadd_h(res_vt1, res_hz1, coeff_vt_vec1);
+    out = __lasx_xvssrarni_bu_h(res_vt1, res_vt0, 6);
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, dst, stride_3x,
+              tp0, tp1, tp2, tp3);
+    DUP2_ARG2(__lasx_xvilvl_d, tp2, tp0, tp3, tp1, tp0, tp2);
+    tp0 = __lasx_xvpermi_q(tp2, tp0, 0x20);
+    out = __lasx_xvavgr_bu(out, tp0);
+    __lasx_xvstelm_d(out, dst, 0, 0);
+    __lasx_xvstelm_d(out, dst + stride, 0, 2);
+    __lasx_xvstelm_d(out, dst + stride_2x, 0, 1);
+    __lasx_xvstelm_d(out, dst + stride_3x, 0, 3);
+}
+
+static av_always_inline void avc_chroma_hv_and_aver_dst_8x8_lasx(uint8_t *src,
+                             uint8_t *dst, ptrdiff_t stride, uint32_t coef_hor0,
+                             uint32_t coef_hor1, uint32_t coef_ver0,
+                             uint32_t coef_ver1)
+{
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    ptrdiff_t stride_4x = stride << 2;
+    __m256i tp0, tp1, tp2, tp3, dst0, dst1;
+    __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8;
+    __m256i out0, out1;
+    __m256i res_hz0, res_hz1, res_hz2, res_hz3, res_hz4;
+    __m256i res_vt0, res_vt1, res_vt2, res_vt3;
+    __m256i mask;
+    __m256i coeff_hz_vec0 = __lasx_xvreplgr2vr_b(coef_hor0);
+    __m256i coeff_hz_vec1 = __lasx_xvreplgr2vr_b(coef_hor1);
+    __m256i coeff_vt_vec0 = __lasx_xvreplgr2vr_h(coef_ver0);
+    __m256i coeff_vt_vec1 = __lasx_xvreplgr2vr_h(coef_ver1);
+    __m256i coeff_hz_vec = __lasx_xvilvl_b(coeff_hz_vec0, coeff_hz_vec1);
+
+    DUP2_ARG2(__lasx_xvld, chroma_mask_arr, 0, src, 0, mask, src0);
+    src += stride;
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x,
+              src1, src2, src3, src4);
+    src += stride_4x;
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x,
+              src5, src6, src7, src8);
+    DUP4_ARG3(__lasx_xvpermi_q, src2, src1, 0x20, src4, src3, 0x20, src6, src5, 0x20,
+              src8, src7, 0x20, src1, src3, src5, src7);
+    src0 = __lasx_xvshuf_b(src0, src0, mask);
+    DUP4_ARG3(__lasx_xvshuf_b, src1, src1, mask, src3, src3, mask, src5, src5, mask, src7,
+              src7, mask, src1, src3, src5, src7);
+    DUP4_ARG2(__lasx_xvdp2_h_bu, src0, coeff_hz_vec, src1, coeff_hz_vec, src3,
+              coeff_hz_vec, src5, coeff_hz_vec, res_hz0, res_hz1, res_hz2, res_hz3);
+    res_hz4 = __lasx_xvdp2_h_bu(src7, coeff_hz_vec);
+    res_vt0 = __lasx_xvmul_h(res_hz1, coeff_vt_vec0);
+    res_vt1 = __lasx_xvmul_h(res_hz2, coeff_vt_vec0);
+    res_vt2 = __lasx_xvmul_h(res_hz3, coeff_vt_vec0);
+    res_vt3 = __lasx_xvmul_h(res_hz4, coeff_vt_vec0);
+    res_hz0 = __lasx_xvpermi_q(res_hz1, res_hz0, 0x20);
+    res_hz1 = __lasx_xvpermi_q(res_hz1, res_hz2, 0x3);
+    res_hz2 = __lasx_xvpermi_q(res_hz2, res_hz3, 0x3);
+    res_hz3 = __lasx_xvpermi_q(res_hz3, res_hz4, 0x3);
+    res_vt0 = __lasx_xvmadd_h(res_vt0, res_hz0, coeff_vt_vec1);
+    res_vt1 = __lasx_xvmadd_h(res_vt1, res_hz1, coeff_vt_vec1);
+    res_vt2 = __lasx_xvmadd_h(res_vt2, res_hz2, coeff_vt_vec1);
+    res_vt3 = __lasx_xvmadd_h(res_vt3, res_hz3, coeff_vt_vec1);
+    DUP2_ARG3(__lasx_xvssrarni_bu_h, res_vt1, res_vt0, 6, res_vt3, res_vt2, 6,
+              out0, out1);
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, dst, stride_3x,
+              tp0, tp1, tp2, tp3);
+    DUP2_ARG2(__lasx_xvilvl_d, tp2, tp0, tp3, tp1, tp0, tp2);
+    dst0 = __lasx_xvpermi_q(tp2, tp0, 0x20);
+    dst += stride_4x;
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, dst, stride_3x,
+              tp0, tp1, tp2, tp3);
+    dst -= stride_4x;
+    DUP2_ARG2(__lasx_xvilvl_d, tp2, tp0, tp3, tp1, tp0, tp2);
+    dst1 = __lasx_xvpermi_q(tp2, tp0, 0x20);
+    out0 = __lasx_xvavgr_bu(out0, dst0);
+    out1 = __lasx_xvavgr_bu(out1, dst1);
+    __lasx_xvstelm_d(out0, dst, 0, 0);
+    __lasx_xvstelm_d(out0, dst + stride, 0, 2);
+    __lasx_xvstelm_d(out0, dst + stride_2x, 0, 1);
+    __lasx_xvstelm_d(out0, dst + stride_3x, 0, 3);
+    dst += stride_4x;
+    __lasx_xvstelm_d(out1, dst, 0, 0);
+    __lasx_xvstelm_d(out1, dst + stride, 0, 2);
+    __lasx_xvstelm_d(out1, dst + stride_2x, 0, 1);
+    __lasx_xvstelm_d(out1, dst + stride_3x, 0, 3);
+}
+
+static av_always_inline void avc_chroma_hz_and_aver_dst_8x4_lasx(uint8_t *src,
+                             uint8_t *dst, ptrdiff_t stride, uint32_t coeff0,
+                             uint32_t coeff1)
+{
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    __m256i tp0, tp1, tp2, tp3;
+    __m256i src0, src1, src2, src3, out;
+    __m256i res0, res1;
+    __m256i mask;
+    __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0);
+    __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1);
+    __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1);
+
+    coeff_vec = __lasx_xvslli_b(coeff_vec, 3);
+    mask = __lasx_xvld(chroma_mask_arr, 0);
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x,
+              src0, src1, src2, src3);
+    DUP2_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src0, src2);
+    DUP2_ARG3(__lasx_xvshuf_b, src0, src0, mask, src2, src2, mask, src0, src2);
+    DUP2_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src2, coeff_vec, res0, res1);
+    out = __lasx_xvssrarni_bu_h(res1, res0, 6);
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, dst, stride_3x,
+              tp0, tp1, tp2, tp3);
+    DUP2_ARG2(__lasx_xvilvl_d, tp2, tp0, tp3, tp1, tp0, tp2);
+    tp0 = __lasx_xvpermi_q(tp2, tp0, 0x20);
+    out = __lasx_xvavgr_bu(out, tp0);
+    __lasx_xvstelm_d(out, dst, 0, 0);
+    __lasx_xvstelm_d(out, dst + stride, 0, 2);
+    __lasx_xvstelm_d(out, dst + stride_2x, 0, 1);
+    __lasx_xvstelm_d(out, dst + stride_3x, 0, 3);
+}
+
+static av_always_inline void avc_chroma_hz_and_aver_dst_8x8_lasx(uint8_t *src,
+                             uint8_t *dst, ptrdiff_t stride, uint32_t coeff0,
+                             uint32_t coeff1)
+{
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    ptrdiff_t stride_4x = stride << 2;
+    __m256i tp0, tp1, tp2, tp3, dst0, dst1;
+    __m256i src0, src1, src2, src3, src4, src5, src6, src7;
+    __m256i out0, out1;
+    __m256i res0, res1, res2, res3;
+    __m256i mask;
+    __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0);
+    __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1);
+    __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1);
+
+    coeff_vec = __lasx_xvslli_b(coeff_vec, 3);
+    mask = __lasx_xvld(chroma_mask_arr, 0);
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x,
+              src0, src1, src2, src3);
+    src += stride_4x;
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x,
+              src4, src5, src6, src7);
+    DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src5, src4, 0x20,
+              src7, src6, 0x20, src0, src2, src4, src6);
+    DUP4_ARG3(__lasx_xvshuf_b, src0, src0, mask, src2, src2, mask, src4, src4,
+              mask, src6, src6, mask, src0, src2, src4, src6);
+    DUP4_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src2, coeff_vec, src4, coeff_vec, src6,
+              coeff_vec, res0, res1, res2, res3);
+    DUP2_ARG3(__lasx_xvssrarni_bu_h, res1, res0, 6, res3, res2, 6, out0, out1);
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, dst, stride_3x,
+              tp0, tp1, tp2, tp3);
+    DUP2_ARG2(__lasx_xvilvl_d, tp2, tp0, tp3, tp1, tp0, tp2);
+    dst0 = __lasx_xvpermi_q(tp2, tp0, 0x20);
+    dst += stride_4x;
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, dst, stride_3x,
+              tp0, tp1, tp2, tp3);
+    dst -= stride_4x;
+    DUP2_ARG2(__lasx_xvilvl_d, tp2, tp0, tp3, tp1, tp0, tp2);
+    dst1 = __lasx_xvpermi_q(tp2, tp0, 0x20);
+    out0 = __lasx_xvavgr_bu(out0, dst0);
+    out1 = __lasx_xvavgr_bu(out1, dst1);
+    __lasx_xvstelm_d(out0, dst, 0, 0);
+    __lasx_xvstelm_d(out0, dst + stride, 0, 2);
+    __lasx_xvstelm_d(out0, dst + stride_2x, 0, 1);
+    __lasx_xvstelm_d(out0, dst + stride_3x, 0, 3);
+    dst += stride_4x;
+    __lasx_xvstelm_d(out1, dst, 0, 0);
+    __lasx_xvstelm_d(out1, dst + stride, 0, 2);
+    __lasx_xvstelm_d(out1, dst + stride_2x, 0, 1);
+    __lasx_xvstelm_d(out1, dst + stride_3x, 0, 3);
+}
+
+static av_always_inline void avc_chroma_vt_and_aver_dst_8x4_lasx(uint8_t *src,
+                             uint8_t *dst, ptrdiff_t stride, uint32_t coeff0,
+                             uint32_t coeff1)
+{
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    ptrdiff_t stride_4x = stride << 2;
+    __m256i tp0, tp1, tp2, tp3;
+    __m256i src0, src1, src2, src3, src4, out;
+    __m256i res0, res1;
+    __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0);
+    __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1);
+    __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1);
+
+    coeff_vec = __lasx_xvslli_b(coeff_vec, 3);
+    src0 = __lasx_xvld(src, 0);
+    DUP4_ARG2(__lasx_xvldx, src, stride, src, stride_2x, src, stride_3x, src, stride_4x,
+              src1, src2, src3, src4);
+    DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src2, src1, 0x20, src3, src2, 0x20,
+              src4, src3, 0x20, src0, src1, src2, src3);
+    DUP2_ARG2(__lasx_xvilvl_b, src1, src0, src3, src2, src0, src2);
+    DUP2_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src2, coeff_vec, res0, res1);
+    out = __lasx_xvssrarni_bu_h(res1, res0, 6);
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, dst, stride_3x,
+              tp0, tp1, tp2, tp3);
+    DUP2_ARG2(__lasx_xvilvl_d, tp2, tp0, tp3, tp1, tp0, tp2);
+    tp0 = __lasx_xvpermi_q(tp2, tp0, 0x20);
+    out = __lasx_xvavgr_bu(out, tp0);
+    __lasx_xvstelm_d(out, dst, 0, 0);
+    __lasx_xvstelm_d(out, dst + stride, 0, 2);
+    __lasx_xvstelm_d(out, dst + stride_2x, 0, 1);
+    __lasx_xvstelm_d(out, dst + stride_3x, 0, 3);
+}
+
+static av_always_inline void avc_chroma_vt_and_aver_dst_8x8_lasx(uint8_t *src,
+                             uint8_t *dst, ptrdiff_t stride, uint32_t coeff0,
+                             uint32_t coeff1)
+{
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    ptrdiff_t stride_4x = stride << 2;
+    __m256i tp0, tp1, tp2, tp3, dst0, dst1;
+    __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8;
+    __m256i out0, out1;
+    __m256i res0, res1, res2, res3;
+    __m256i coeff_vec0 = __lasx_xvreplgr2vr_b(coeff0);
+    __m256i coeff_vec1 = __lasx_xvreplgr2vr_b(coeff1);
+    __m256i coeff_vec = __lasx_xvilvl_b(coeff_vec0, coeff_vec1);
+
+    coeff_vec = __lasx_xvslli_b(coeff_vec, 3);
+    src0 = __lasx_xvld(src, 0);
+    src += stride;
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x,
+              src1, src2, src3, src4);
+    src += stride_4x;
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x, src, stride_3x,
+              src5, src6, src7, src8);
+    DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src2, src1, 0x20, src3, src2, 0x20,
+              src4, src3, 0x20, src0, src1, src2, src3);
+    DUP4_ARG3(__lasx_xvpermi_q, src5, src4, 0x20, src6, src5, 0x20, src7, src6, 0x20,
+              src8, src7, 0x20, src4, src5, src6, src7);
+    DUP4_ARG2(__lasx_xvilvl_b, src1, src0, src3, src2, src5, src4, src7, src6,
+              src0, src2, src4, src6);
+    DUP4_ARG2(__lasx_xvdp2_h_bu, src0, coeff_vec, src2, coeff_vec, src4, coeff_vec, src6,
+              coeff_vec, res0, res1, res2, res3);
+    DUP2_ARG3(__lasx_xvssrarni_bu_h, res1, res0, 6, res3, res2, 6, out0, out1);
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, dst, stride_3x,
+              tp0, tp1, tp2, tp3);
+    DUP2_ARG2(__lasx_xvilvl_d, tp2, tp0, tp3, tp1, tp0, tp2);
+    dst0 = __lasx_xvpermi_q(tp2, tp0, 0x20);
+    dst += stride_4x;
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x, dst, stride_3x,
+              tp0, tp1, tp2, tp3);
+    dst -= stride_4x;
+    DUP2_ARG2(__lasx_xvilvl_d, tp2, tp0, tp3, tp1, tp0, tp2);
+    dst1 = __lasx_xvpermi_q(tp2, tp0, 0x20);
+    out0 = __lasx_xvavgr_bu(out0, dst0);
+    out1 = __lasx_xvavgr_bu(out1, dst1);
+    __lasx_xvstelm_d(out0, dst, 0, 0);
+    __lasx_xvstelm_d(out0, dst + stride, 0, 2);
+    __lasx_xvstelm_d(out0, dst + stride_2x, 0, 1);
+    __lasx_xvstelm_d(out0, dst + stride_3x, 0, 3);
+    dst += stride_4x;
+    __lasx_xvstelm_d(out1, dst, 0, 0);
+    __lasx_xvstelm_d(out1, dst + stride, 0, 2);
+    __lasx_xvstelm_d(out1, dst + stride_2x, 0, 1);
+    __lasx_xvstelm_d(out1, dst + stride_3x, 0, 3);
+}
+
+static av_always_inline void avg_width8x8_lasx(uint8_t *src, uint8_t *dst,
+                                               ptrdiff_t stride)
+{
+    __m256i src0, src1, src2, src3;
+    __m256i dst0, dst1, dst2, dst3;
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    ptrdiff_t stride_4x = stride << 2;
+
+    src0 = __lasx_xvldrepl_d(src, 0);
+    src1 = __lasx_xvldrepl_d(src + stride, 0);
+    src2 = __lasx_xvldrepl_d(src + stride_2x, 0);
+    src3 = __lasx_xvldrepl_d(src + stride_3x, 0);
+    dst0 = __lasx_xvldrepl_d(dst, 0);
+    dst1 = __lasx_xvldrepl_d(dst + stride, 0);
+    dst2 = __lasx_xvldrepl_d(dst + stride_2x, 0);
+    dst3 = __lasx_xvldrepl_d(dst + stride_3x, 0);
+    src0 = __lasx_xvpackev_d(src1,src0);
+    src2 = __lasx_xvpackev_d(src3,src2);
+    src0 = __lasx_xvpermi_q(src0, src2, 0x02);
+    dst0 = __lasx_xvpackev_d(dst1,dst0);
+    dst2 = __lasx_xvpackev_d(dst3,dst2);
+    dst0 = __lasx_xvpermi_q(dst0, dst2, 0x02);
+    dst0 = __lasx_xvavgr_bu(src0, dst0);
+    __lasx_xvstelm_d(dst0, dst, 0, 0);
+    __lasx_xvstelm_d(dst0, dst + stride, 0, 1);
+    __lasx_xvstelm_d(dst0, dst + stride_2x, 0, 2);
+    __lasx_xvstelm_d(dst0, dst + stride_3x, 0, 3);
+
+    src += stride_4x;
+    dst += stride_4x;
+    src0 = __lasx_xvldrepl_d(src, 0);
+    src1 = __lasx_xvldrepl_d(src + stride, 0);
+    src2 = __lasx_xvldrepl_d(src + stride_2x, 0);
+    src3 = __lasx_xvldrepl_d(src + stride_3x, 0);
+    dst0 = __lasx_xvldrepl_d(dst, 0);
+    dst1 = __lasx_xvldrepl_d(dst + stride, 0);
+    dst2 = __lasx_xvldrepl_d(dst + stride_2x, 0);
+    dst3 = __lasx_xvldrepl_d(dst + stride_3x, 0);
+    src0 = __lasx_xvpackev_d(src1,src0);
+    src2 = __lasx_xvpackev_d(src3,src2);
+    src0 = __lasx_xvpermi_q(src0, src2, 0x02);
+    dst0 = __lasx_xvpackev_d(dst1,dst0);
+    dst2 = __lasx_xvpackev_d(dst3,dst2);
+    dst0 = __lasx_xvpermi_q(dst0, dst2, 0x02);
+    dst0 = __lasx_xvavgr_bu(src0, dst0);
+    __lasx_xvstelm_d(dst0, dst, 0, 0);
+    __lasx_xvstelm_d(dst0, dst + stride, 0, 1);
+    __lasx_xvstelm_d(dst0, dst + stride_2x, 0, 2);
+    __lasx_xvstelm_d(dst0, dst + stride_3x, 0, 3);
+}
+
+static av_always_inline void avg_width8x4_lasx(uint8_t *src, uint8_t *dst,
+                                               ptrdiff_t stride)
+{
+    __m256i src0, src1, src2, src3;
+    __m256i dst0, dst1, dst2, dst3;
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+
+    src0 = __lasx_xvldrepl_d(src, 0);
+    src1 = __lasx_xvldrepl_d(src + stride, 0);
+    src2 = __lasx_xvldrepl_d(src + stride_2x, 0);
+    src3 = __lasx_xvldrepl_d(src + stride_3x, 0);
+    dst0 = __lasx_xvldrepl_d(dst, 0);
+    dst1 = __lasx_xvldrepl_d(dst + stride, 0);
+    dst2 = __lasx_xvldrepl_d(dst + stride_2x, 0);
+    dst3 = __lasx_xvldrepl_d(dst + stride_3x, 0);
+    src0 = __lasx_xvpackev_d(src1,src0);
+    src2 = __lasx_xvpackev_d(src3,src2);
+    src0 = __lasx_xvpermi_q(src0, src2, 0x02);
+    dst0 = __lasx_xvpackev_d(dst1,dst0);
+    dst2 = __lasx_xvpackev_d(dst3,dst2);
+    dst0 = __lasx_xvpermi_q(dst0, dst2, 0x02);
+    dst0 = __lasx_xvavgr_bu(src0, dst0);
+    __lasx_xvstelm_d(dst0, dst, 0, 0);
+    __lasx_xvstelm_d(dst0, dst + stride, 0, 1);
+    __lasx_xvstelm_d(dst0, dst + stride_2x, 0, 2);
+    __lasx_xvstelm_d(dst0, dst + stride_3x, 0, 3);
+}
+
+static void avc_chroma_hv_and_aver_dst_8w_lasx(uint8_t *src, uint8_t *dst,
+                                               ptrdiff_t stride,
+                                               uint32_t coef_hor0,
+                                               uint32_t coef_hor1,
+                                               uint32_t coef_ver0,
+                                               uint32_t coef_ver1,
+                                               int32_t height)
+{
+    if (4 == height) {
+        avc_chroma_hv_and_aver_dst_8x4_lasx(src, dst, stride, coef_hor0,
+                                            coef_hor1, coef_ver0, coef_ver1);
+    } else if (8 == height) {
+        avc_chroma_hv_and_aver_dst_8x8_lasx(src, dst, stride, coef_hor0,
+                                            coef_hor1, coef_ver0, coef_ver1);
+    }
+}
+
+static void avc_chroma_hz_and_aver_dst_8w_lasx(uint8_t *src, uint8_t *dst,
+                                               ptrdiff_t stride, uint32_t coeff0,
+                                               uint32_t coeff1, int32_t height)
+{
+    if (4 == height) {
+        avc_chroma_hz_and_aver_dst_8x4_lasx(src, dst, stride, coeff0, coeff1);
+    } else if (8 == height) {
+        avc_chroma_hz_and_aver_dst_8x8_lasx(src, dst, stride, coeff0, coeff1);
+    }
+}
+
+static void avc_chroma_vt_and_aver_dst_8w_lasx(uint8_t *src, uint8_t *dst,
+                                               ptrdiff_t stride, uint32_t coeff0,
+                                               uint32_t coeff1, int32_t height)
+{
+    if (4 == height) {
+        avc_chroma_vt_and_aver_dst_8x4_lasx(src, dst, stride, coeff0, coeff1);
+    } else if (8 == height) {
+        avc_chroma_vt_and_aver_dst_8x8_lasx(src, dst, stride, coeff0, coeff1);
+    }
+}
+
+static void avg_width8_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                            int32_t height)
+{
+    if (8 == height) {
+        avg_width8x8_lasx(src, dst, stride);
+    } else if (4 == height) {
+        avg_width8x4_lasx(src, dst, stride);
+    }
+}
+
+void ff_avg_h264_chroma_mc8_lasx(uint8_t *dst, uint8_t *src, ptrdiff_t stride,
+                                 int height, int x, int y)
+{
+    av_assert2(x < 8 && y < 8 && x >= 0 && y >= 0);
+
+    if (!(x || y)) {
+        avg_width8_lasx(src, dst, stride, height);
+    } else if (x && y) {
+        avc_chroma_hv_and_aver_dst_8w_lasx(src, dst, stride, x, (8 - x), y,
+                                           (8 - y), height);
+    } else if (x) {
+        avc_chroma_hz_and_aver_dst_8w_lasx(src, dst, stride, x, (8 - x), height);
+    } else {
+        avc_chroma_vt_and_aver_dst_8w_lasx(src, dst, stride, y, (8 - y), height);
+    }
+}
diff --git a/libavcodec/loongarch/h264chroma_lasx.h b/libavcodec/loongarch/h264chroma_lasx.h
new file mode 100644
index 0000000000..4aac8db8cb
--- /dev/null
+++ b/libavcodec/loongarch/h264chroma_lasx.h
@@ -0,0 +1,36 @@
+/*
+ * Copyright (c) 2020 Loongson Technology Corporation Limited
+ * Contributed by Shiyou Yin <yinshiyou-hf@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVCODEC_LOONGARCH_H264CHROMA_LASX_H
+#define AVCODEC_LOONGARCH_H264CHROMA_LASX_H
+
+#include <stdint.h>
+#include <stddef.h>
+#include "libavcodec/h264.h"
+
+void ff_put_h264_chroma_mc4_lasx(uint8_t *dst, uint8_t *src, ptrdiff_t stride,
+        int h, int x, int y);
+void ff_put_h264_chroma_mc8_lasx(uint8_t *dst, uint8_t *src, ptrdiff_t stride,
+        int h, int x, int y);
+void ff_avg_h264_chroma_mc8_lasx(uint8_t *dst, uint8_t *src, ptrdiff_t stride,
+        int h, int x, int y);
+
+#endif /* AVCODEC_LOONGARCH_H264CHROMA_LASX_H */
diff --git a/libavutil/loongarch/loongson_intrinsics.h b/libavutil/loongarch/loongson_intrinsics.h
new file mode 100644
index 0000000000..6e0439f829
--- /dev/null
+++ b/libavutil/loongarch/loongson_intrinsics.h
@@ -0,0 +1,1877 @@
+/*
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * All rights reserved.
+ * Contributed by Shiyou Yin <yinshiyou-hf@loongson.cn>
+ *                Xiwei Gu   <guxiwei-hf@loongson.cn>
+ *                Lu Wang    <wanglu@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ *
+ */
+
+#ifndef AVUTIL_LOONGARCH_LOONGSON_INTRINSICS_H
+#define AVUTIL_LOONGARCH_LOONGSON_INTRINSICS_H
+
+/*
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * All rights reserved.
+ * Contributed by Shiyou Yin <yinshiyou-hf@loongson.cn>
+ *                Xiwei Gu   <guxiwei-hf@loongson.cn>
+ *                Lu Wang    <wanglu@loongson.cn>
+ *
+ * This file is a header file for loongarch builtin extention.
+ *
+ */
+
+#ifndef LOONGSON_INTRINSICS_H
+#define LOONGSON_INTRINSICS_H
+
+/**
+ * MAJOR version: Macro usage changes.
+ * MINOR version: Add new functions, or bug fix.
+ * MICRO version: Comment changes or implementation changes.
+ */
+#define LSOM_VERSION_MAJOR 1
+#define LSOM_VERSION_MINOR 0
+#define LSOM_VERSION_MICRO 3
+
+#define DUP2_ARG1(_INS, _IN0, _IN1, _OUT0, _OUT1) \
+{ \
+    _OUT0 = _INS(_IN0); \
+    _OUT1 = _INS(_IN1); \
+}
+
+#define DUP2_ARG2(_INS, _IN0, _IN1, _IN2, _IN3, _OUT0, _OUT1) \
+{ \
+    _OUT0 = _INS(_IN0, _IN1); \
+    _OUT1 = _INS(_IN2, _IN3); \
+}
+
+#define DUP2_ARG3(_INS, _IN0, _IN1, _IN2, _IN3, _IN4, _IN5, _OUT0, _OUT1) \
+{ \
+    _OUT0 = _INS(_IN0, _IN1, _IN2); \
+    _OUT1 = _INS(_IN3, _IN4, _IN5); \
+}
+
+#define DUP4_ARG1(_INS, _IN0, _IN1, _IN2, _IN3, _OUT0, _OUT1, _OUT2, _OUT3) \
+{ \
+    DUP2_ARG1(_INS, _IN0, _IN1, _OUT0, _OUT1); \
+    DUP2_ARG1(_INS, _IN2, _IN3, _OUT2, _OUT3); \
+}
+
+#define DUP4_ARG2(_INS, _IN0, _IN1, _IN2, _IN3, _IN4, _IN5, _IN6, _IN7, \
+                  _OUT0, _OUT1, _OUT2, _OUT3) \
+{ \
+    DUP2_ARG2(_INS, _IN0, _IN1, _IN2, _IN3, _OUT0, _OUT1); \
+    DUP2_ARG2(_INS, _IN4, _IN5, _IN6, _IN7, _OUT2, _OUT3); \
+}
+
+#define DUP4_ARG3(_INS, _IN0, _IN1, _IN2, _IN3, _IN4, _IN5, _IN6, _IN7, \
+                  _IN8, _IN9, _IN10, _IN11, _OUT0, _OUT1, _OUT2, _OUT3) \
+{ \
+    DUP2_ARG3(_INS, _IN0, _IN1, _IN2, _IN3, _IN4,  _IN5,  _OUT0, _OUT1); \
+    DUP2_ARG3(_INS, _IN6, _IN7, _IN8, _IN9, _IN10, _IN11, _OUT2, _OUT3); \
+}
+
+#ifdef __loongarch_sx
+#include <lsxintrin.h>
+/*
+ * =============================================================================
+ * Description : Dot product & addition of byte vector elements
+ * Arguments   : Inputs  - in_c, in_h, in_l
+ *               Outputs - out
+ *               Retrun Type - halfword
+ * Details     : Signed byte elements from in_h are multiplied by
+ *               signed byte elements from in_l, and then added adjacent to
+ *               each other to get results with the twice size of input.
+ *               Then the results plus to signed half word elements from in_c.
+ * Example     : out = __lsx_vdp2add_h_b(in_c, in_h, in_l)
+ *        in_c : 1,2,3,4, 1,2,3,4
+ *        in_h : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8
+ *        in_l : 8,7,6,5, 4,3,2,1, 8,7,6,5, 4,3,2,1
+ *         out : 23,40,41,26, 23,40,41,26
+ * =============================================================================
+ */
+static inline __m128i __lsx_vdp2add_h_b(__m128i in_c, __m128i in_h, __m128i in_l)
+{
+    __m128i out;
+
+    out = __lsx_vmaddwev_h_b(in_c, in_h, in_l);
+    out = __lsx_vmaddwod_h_b(out, in_h, in_l);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Dot product & addition of byte vector elements
+ * Arguments   : Inputs  - in_c, in_h, in_l
+ *               Outputs - out
+ *               Retrun Type - halfword
+ * Details     : Unsigned byte elements from in_h are multiplied by
+ *               unsigned byte elements from in_l, and then added adjacent to
+ *               each other to get results with the twice size of input.
+ *               The results plus to signed half word elements from in_c.
+ * Example     : out = __lsx_vdp2add_h_b(in_c, in_h, in_l)
+ *        in_c : 1,2,3,4, 1,2,3,4
+ *        in_h : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8
+ *        in_l : 8,7,6,5, 4,3,2,1, 8,7,6,5, 4,3,2,1
+ *         out : 23,40,41,26, 23,40,41,26
+ * =============================================================================
+ */
+static inline __m128i __lsx_vdp2add_h_bu(__m128i in_c, __m128i in_h, __m128i in_l)
+{
+    __m128i out;
+
+    out = __lsx_vmaddwev_h_bu(in_c, in_h, in_l);
+    out = __lsx_vmaddwod_h_bu(out, in_h, in_l);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Dot product & addition of half word vector elements
+ * Arguments   : Inputs  - in_c, in_h, in_l
+ *               Outputs - out
+ *               Retrun Type - __m128i
+ * Details     : Signed half word elements from in_h are multiplied by
+ *               signed half word elements from in_l, and then added adjacent to
+ *               each other to get results with the twice size of input.
+ *               Then the results plus to signed word elements from in_c.
+ * Example     : out = __lsx_vdp2add_h_b(in_c, in_h, in_l)
+ *        in_c : 1,2,3,4
+ *        in_h : 1,2,3,4, 5,6,7,8
+ *        in_l : 8,7,6,5, 4,3,2,1
+ *         out : 23,40,41,26
+ * =============================================================================
+ */
+static inline __m128i __lsx_vdp2add_w_h(__m128i in_c, __m128i in_h, __m128i in_l)
+{
+    __m128i out;
+
+    out = __lsx_vmaddwev_w_h(in_c, in_h, in_l);
+    out = __lsx_vmaddwod_w_h(out, in_h, in_l);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Dot product of byte vector elements
+ * Arguments   : Inputs  - in_h, in_l
+ *               Outputs - out
+ *               Retrun Type - halfword
+ * Details     : Signed byte elements from in_h are multiplied by
+ *               signed byte elements from in_l, and then added adjacent to
+ *               each other to get results with the twice size of input.
+ * Example     : out = __lsx_vdp2_h_b(in_h, in_l)
+ *        in_h : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8
+ *        in_l : 8,7,6,5, 4,3,2,1, 8,7,6,5, 4,3,2,1
+ *         out : 22,38,38,22, 22,38,38,22
+ * =============================================================================
+ */
+static inline __m128i __lsx_vdp2_h_b(__m128i in_h, __m128i in_l)
+{
+    __m128i out;
+
+    out = __lsx_vmulwev_h_b(in_h, in_l);
+    out = __lsx_vmaddwod_h_b(out, in_h, in_l);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Dot product of byte vector elements
+ * Arguments   : Inputs  - in_h, in_l
+ *               Outputs - out
+ *               Retrun Type - halfword
+ * Details     : Unsigned byte elements from in_h are multiplied by
+ *               unsigned byte elements from in_l, and then added adjacent to
+ *               each other to get results with the twice size of input.
+ * Example     : out = __lsx_vdp2_h_bu(in_h, in_l)
+ *        in_h : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8
+ *        in_l : 8,7,6,5, 4,3,2,1, 8,7,6,5, 4,3,2,1
+ *         out : 22,38,38,22, 22,38,38,22
+ * =============================================================================
+ */
+static inline __m128i __lsx_vdp2_h_bu(__m128i in_h, __m128i in_l)
+{
+    __m128i out;
+
+    out = __lsx_vmulwev_h_bu(in_h, in_l);
+    out = __lsx_vmaddwod_h_bu(out, in_h, in_l);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Dot product of byte vector elements
+ * Arguments   : Inputs  - in_h, in_l
+ *               Outputs - out
+ *               Retrun Type - halfword
+ * Details     : Unsigned byte elements from in_h are multiplied by
+ *               signed byte elements from in_l, and then added adjacent to
+ *               each other to get results with the twice size of input.
+ * Example     : out = __lsx_vdp2_h_bu_b(in_h, in_l)
+ *        in_h : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8
+ *        in_l : 8,7,6,5, 4,3,2,1, 8,7,6,5, 4,3,2,-1
+ *         out : 22,38,38,22, 22,38,38,6
+ * =============================================================================
+ */
+static inline __m128i __lsx_vdp2_h_bu_b(__m128i in_h, __m128i in_l)
+{
+    __m128i out;
+
+    out = __lsx_vmulwev_h_bu_b(in_h, in_l);
+    out = __lsx_vmaddwod_h_bu_b(out, in_h, in_l);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Dot product of byte vector elements
+ * Arguments   : Inputs  - in_h, in_l
+ *               Outputs - out
+ *               Retrun Type - halfword
+ * Details     : Signed byte elements from in_h are multiplied by
+ *               signed byte elements from in_l, and then added adjacent to
+ *               each other to get results with the twice size of input.
+ * Example     : out = __lsx_vdp2_w_h(in_h, in_l)
+ *        in_h : 1,2,3,4, 5,6,7,8
+ *        in_l : 8,7,6,5, 4,3,2,1
+ *         out : 22,38,38,22
+ * =============================================================================
+ */
+static inline __m128i __lsx_vdp2_w_h(__m128i in_h, __m128i in_l)
+{
+    __m128i out;
+
+    out = __lsx_vmulwev_w_h(in_h, in_l);
+    out = __lsx_vmaddwod_w_h(out, in_h, in_l);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Clip all halfword elements of input vector between min & max
+ *               out = ((_in) < (min)) ? (min) : (((_in) > (max)) ? (max) : (_in))
+ * Arguments   : Inputs  - _in  (input vector)
+ *                       - min  (min threshold)
+ *                       - max  (max threshold)
+ *               Outputs - out  (output vector with clipped elements)
+ *               Return Type - signed halfword
+ * Example     : out = __lsx_vclip_h(_in)
+ *         _in : -8,2,280,249, -8,255,280,249
+ *         min : 1,1,1,1, 1,1,1,1
+ *         max : 9,9,9,9, 9,9,9,9
+ *         out : 1,2,9,9, 1,9,9,9
+ * =============================================================================
+ */
+static inline __m128i __lsx_vclip_h(__m128i _in, __m128i min, __m128i max)
+{
+    __m128i out;
+
+    out = __lsx_vmax_h(min, _in);
+    out = __lsx_vmin_h(max, out);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Set each element of vector between 0 and 255
+ * Arguments   : Inputs  - _in
+ *               Outputs - out
+ *               Retrun Type - halfword
+ * Details     : Signed byte elements from _in are clamped between 0 and 255.
+ * Example     : out = __lsx_vclip255_h(_in)
+ *         _in : -8,255,280,249, -8,255,280,249
+ *         out : 0,255,255,249, 0,255,255,249
+ * =============================================================================
+ */
+static inline __m128i __lsx_vclip255_h(__m128i _in)
+{
+    __m128i out;
+
+    out = __lsx_vmaxi_h(_in, 0);
+    out = __lsx_vsat_hu(out, 7);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Set each element of vector between 0 and 255
+ * Arguments   : Inputs  - _in
+ *               Outputs - out
+ *               Retrun Type - word
+ * Details     : Signed byte elements from _in are clamped between 0 and 255.
+ * Example     : out = __lsx_vclip255_w(_in)
+ *         _in : -8,255,280,249
+ *         out : 0,255,255,249
+ * =============================================================================
+ */
+static inline __m128i __lsx_vclip255_w(__m128i _in)
+{
+    __m128i out;
+
+    out = __lsx_vmaxi_w(_in, 0);
+    out = __lsx_vsat_wu(out, 7);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Swap two variables
+ * Arguments   : Inputs  - _in0, _in1
+ *               Outputs - _in0, _in1 (in-place)
+ * Details     : Swapping of two input variables using xor
+ * Example     : LSX_SWAP(_in0, _in1)
+ *        _in0 : 1,2,3,4
+ *        _in1 : 5,6,7,8
+ *   _in0(out) : 5,6,7,8
+ *   _in1(out) : 1,2,3,4
+ * =============================================================================
+ */
+#define LSX_SWAP(_in0, _in1)                                            \
+{                                                                       \
+    _in0 = __lsx_vxor_v(_in0, _in1);                                    \
+    _in1 = __lsx_vxor_v(_in0, _in1);                                    \
+    _in0 = __lsx_vxor_v(_in0, _in1);                                    \
+}                                                                       \
+
+/*
+ * =============================================================================
+ * Description : Transpose 4x4 block with word elements in vectors
+ * Arguments   : Inputs  - in0, in1, in2, in3
+ *               Outputs - out0, out1, out2, out3
+ * Details     :
+ * Example     :
+ *               1, 2, 3, 4            1, 5, 9,13
+ *               5, 6, 7, 8    to      2, 6,10,14
+ *               9,10,11,12  =====>    3, 7,11,15
+ *              13,14,15,16            4, 8,12,16
+ * =============================================================================
+ */
+#define LSX_TRANSPOSE4x4_W(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3) \
+{                                                                              \
+    __m128i _t0, _t1, _t2, _t3;                                                \
+                                                                               \
+    _t0   = __lsx_vilvl_w(_in1, _in0);                                         \
+    _t1   = __lsx_vilvh_w(_in1, _in0);                                         \
+    _t2   = __lsx_vilvl_w(_in3, _in2);                                         \
+    _t3   = __lsx_vilvh_w(_in3, _in2);                                         \
+    _out0 = __lsx_vilvl_d(_t2, _t0);                                           \
+    _out1 = __lsx_vilvh_d(_t2, _t0);                                           \
+    _out2 = __lsx_vilvl_d(_t3, _t1);                                           \
+    _out3 = __lsx_vilvh_d(_t3, _t1);                                           \
+}
+
+/*
+ * =============================================================================
+ * Description : Transpose 8x8 block with byte elements in vectors
+ * Arguments   : Inputs  - _in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7
+ *               Outputs - _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7
+ * Details     : The rows of the matrix become columns, and the columns become rows.
+ * Example     : LSX_TRANSPOSE8x8_B
+ *        _in0 : 00,01,02,03,04,05,06,07, 00,00,00,00,00,00,00,00
+ *        _in1 : 10,11,12,13,14,15,16,17, 00,00,00,00,00,00,00,00
+ *        _in2 : 20,21,22,23,24,25,26,27, 00,00,00,00,00,00,00,00
+ *        _in3 : 30,31,32,33,34,35,36,37, 00,00,00,00,00,00,00,00
+ *        _in4 : 40,41,42,43,44,45,46,47, 00,00,00,00,00,00,00,00
+ *        _in5 : 50,51,52,53,54,55,56,57, 00,00,00,00,00,00,00,00
+ *        _in6 : 60,61,62,63,64,65,66,67, 00,00,00,00,00,00,00,00
+ *        _in7 : 70,71,72,73,74,75,76,77, 00,00,00,00,00,00,00,00
+ *
+ *      _ out0 : 00,10,20,30,40,50,60,70, 00,00,00,00,00,00,00,00
+ *      _ out1 : 01,11,21,31,41,51,61,71, 00,00,00,00,00,00,00,00
+ *      _ out2 : 02,12,22,32,42,52,62,72, 00,00,00,00,00,00,00,00
+ *      _ out3 : 03,13,23,33,43,53,63,73, 00,00,00,00,00,00,00,00
+ *      _ out4 : 04,14,24,34,44,54,64,74, 00,00,00,00,00,00,00,00
+ *      _ out5 : 05,15,25,35,45,55,65,75, 00,00,00,00,00,00,00,00
+ *      _ out6 : 06,16,26,36,46,56,66,76, 00,00,00,00,00,00,00,00
+ *      _ out7 : 07,17,27,37,47,57,67,77, 00,00,00,00,00,00,00,00
+ * =============================================================================
+ */
+#define LSX_TRANSPOSE8x8_B(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7,        \
+                           _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\
+{                                                                                 \
+   __m128i zero = {0};                                                            \
+   __m128i shuf8 = {0x0F0E0D0C0B0A0908, 0x1716151413121110};                      \
+   __m128i _t0, _t1, _t2, _t3, _t4, _t5, _t6, _t7;                                \
+                                                                                  \
+   _t0 = __lsx_vilvl_b(_in2, _in0);                                               \
+   _t1 = __lsx_vilvl_b(_in3, _in1);                                               \
+   _t2 = __lsx_vilvl_b(_in6, _in4);                                               \
+   _t3 = __lsx_vilvl_b(_in7, _in5);                                               \
+   _t4 = __lsx_vilvl_b(_t1, _t0);                                                 \
+   _t5 = __lsx_vilvh_b(_t1, _t0);                                                 \
+   _t6 = __lsx_vilvl_b(_t3, _t2);                                                 \
+   _t7 = __lsx_vilvh_b(_t3, _t2);                                                 \
+   _out0 = __lsx_vilvl_w(_t6, _t4);                                               \
+   _out2 = __lsx_vilvh_w(_t6, _t4);                                               \
+   _out4 = __lsx_vilvl_w(_t7, _t5);                                               \
+   _out6 = __lsx_vilvh_w(_t7, _t5);                                               \
+   _out1 = __lsx_vshuf_b(zero, _out0, shuf8);                                     \
+   _out3 = __lsx_vshuf_b(zero, _out2, shuf8);                                     \
+   _out5 = __lsx_vshuf_b(zero, _out4, shuf8);                                     \
+   _out7 = __lsx_vshuf_b(zero, _out6, shuf8);                                     \
+}
+
+/*
+ * =============================================================================
+ * Description : Transpose 8x8 block with half word elements in vectors
+ * Arguments   : Inputs  - in0, in1, in2, in3, in4, in5, in6, in7
+ *               Outputs - out0, out1, out2, out3, out4, out5, out6, out7
+ * Details     :
+ * Example     :
+ *              00,01,02,03,04,05,06,07           00,10,20,30,40,50,60,70
+ *              10,11,12,13,14,15,16,17           01,11,21,31,41,51,61,71
+ *              20,21,22,23,24,25,26,27           02,12,22,32,42,52,62,72
+ *              30,31,32,33,34,35,36,37    to     03,13,23,33,43,53,63,73
+ *              40,41,42,43,44,45,46,47  ======>  04,14,24,34,44,54,64,74
+ *              50,51,52,53,54,55,56,57           05,15,25,35,45,55,65,75
+ *              60,61,62,63,64,65,66,67           06,16,26,36,46,56,66,76
+ *              70,71,72,73,74,75,76,77           07,17,27,37,47,57,67,77
+ * =============================================================================
+ */
+#define LSX_TRANSPOSE8x8_H(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7,        \
+                           _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\
+{                                                                                 \
+    __m128i _s0, _s1, _t0, _t1, _t2, _t3, _t4, _t5, _t6, _t7;                     \
+                                                                                  \
+    _s0 = __lsx_vilvl_h(_in6, _in4);                                              \
+    _s1 = __lsx_vilvl_h(_in7, _in5);                                              \
+    _t0 = __lsx_vilvl_h(_s1, _s0);                                                \
+    _t1 = __lsx_vilvh_h(_s1, _s0);                                                \
+    _s0 = __lsx_vilvh_h(_in6, _in4);                                              \
+    _s1 = __lsx_vilvh_h(_in7, _in5);                                              \
+    _t2 = __lsx_vilvl_h(_s1, _s0);                                                \
+    _t3 = __lsx_vilvh_h(_s1, _s0);                                                \
+    _s0 = __lsx_vilvl_h(_in2, _in0);                                              \
+    _s1 = __lsx_vilvl_h(_in3, _in1);                                              \
+    _t4 = __lsx_vilvl_h(_s1, _s0);                                                \
+    _t5 = __lsx_vilvh_h(_s1, _s0);                                                \
+    _s0 = __lsx_vilvh_h(_in2, _in0);                                              \
+    _s1 = __lsx_vilvh_h(_in3, _in1);                                              \
+    _t6 = __lsx_vilvl_h(_s1, _s0);                                                \
+    _t7 = __lsx_vilvh_h(_s1, _s0);                                                \
+                                                                                  \
+    _out0 = __lsx_vpickev_d(_t0, _t4);                                            \
+    _out2 = __lsx_vpickev_d(_t1, _t5);                                            \
+    _out4 = __lsx_vpickev_d(_t2, _t6);                                            \
+    _out6 = __lsx_vpickev_d(_t3, _t7);                                            \
+    _out1 = __lsx_vpickod_d(_t0, _t4);                                            \
+    _out3 = __lsx_vpickod_d(_t1, _t5);                                            \
+    _out5 = __lsx_vpickod_d(_t2, _t6);                                            \
+    _out7 = __lsx_vpickod_d(_t3, _t7);                                            \
+}
+
+/*
+ * =============================================================================
+ * Description : Transpose input 8x4 byte block into 4x8
+ * Arguments   : Inputs  - _in0, _in1, _in2, _in3      (input 8x4 byte block)
+ *               Outputs - _out0, _out1, _out2, _out3  (output 4x8 byte block)
+ *               Return Type - as per RTYPE
+ * Details     : The rows of the matrix become columns, and the columns become rows.
+ * Example     : LSX_TRANSPOSE8x4_B
+ *        _in0 : 00,01,02,03,00,00,00,00, 00,00,00,00,00,00,00,00
+ *        _in1 : 10,11,12,13,00,00,00,00, 00,00,00,00,00,00,00,00
+ *        _in2 : 20,21,22,23,00,00,00,00, 00,00,00,00,00,00,00,00
+ *        _in3 : 30,31,32,33,00,00,00,00, 00,00,00,00,00,00,00,00
+ *        _in4 : 40,41,42,43,00,00,00,00, 00,00,00,00,00,00,00,00
+ *        _in5 : 50,51,52,53,00,00,00,00, 00,00,00,00,00,00,00,00
+ *        _in6 : 60,61,62,63,00,00,00,00, 00,00,00,00,00,00,00,00
+ *        _in7 : 70,71,72,73,00,00,00,00, 00,00,00,00,00,00,00,00
+ *
+ *       _out0 : 00,10,20,30,40,50,60,70, 00,00,00,00,00,00,00,00
+ *       _out1 : 01,11,21,31,41,51,61,71, 00,00,00,00,00,00,00,00
+ *       _out2 : 02,12,22,32,42,52,62,72, 00,00,00,00,00,00,00,00
+ *       _out3 : 03,13,23,33,43,53,63,73, 00,00,00,00,00,00,00,00
+ * =============================================================================
+ */
+#define LSX_TRANSPOSE8x4_B(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7,       \
+                           _out0, _out1, _out2, _out3)                           \
+{                                                                                \
+    __m128i _tmp0_m, _tmp1_m, _tmp2_m, _tmp3_m;                                  \
+                                                                                 \
+    _tmp0_m = __lsx_vpackev_w(_in4, _in0);                                       \
+    _tmp1_m = __lsx_vpackev_w(_in5, _in1);                                       \
+    _tmp2_m = __lsx_vilvl_b(_tmp1_m, _tmp0_m);                                   \
+    _tmp0_m = __lsx_vpackev_w(_in6, _in2);                                       \
+    _tmp1_m = __lsx_vpackev_w(_in7, _in3);                                       \
+                                                                                 \
+    _tmp3_m = __lsx_vilvl_b(_tmp1_m, _tmp0_m);                                   \
+    _tmp0_m = __lsx_vilvl_h(_tmp3_m, _tmp2_m);                                   \
+    _tmp1_m = __lsx_vilvh_h(_tmp3_m, _tmp2_m);                                   \
+                                                                                 \
+    _out0 = __lsx_vilvl_w(_tmp1_m, _tmp0_m);                                     \
+    _out2 = __lsx_vilvh_w(_tmp1_m, _tmp0_m);                                     \
+    _out1 = __lsx_vilvh_d(_out2, _out0);                                         \
+    _out3 = __lsx_vilvh_d(_out0, _out2);                                         \
+}
+
+/*
+ * =============================================================================
+ * Description : Transpose 16x8 block with byte elements in vectors
+ * Arguments   : Inputs  - in0, in1, in2, in3, in4, in5, in6, in7, in8
+ *                         in9, in10, in11, in12, in13, in14, in15
+ *               Outputs - out0, out1, out2, out3, out4, out5, out6, out7
+ * Details     :
+ * Example     :
+ *              000,001,002,003,004,005,006,007
+ *              008,009,010,011,012,013,014,015
+ *              016,017,018,019,020,021,022,023
+ *              024,025,026,027,028,029,030,031
+ *              032,033,034,035,036,037,038,039
+ *              040,041,042,043,044,045,046,047        000,008,...,112,120
+ *              048,049,050,051,052,053,054,055        001,009,...,113,121
+ *              056,057,058,059,060,061,062,063   to   002,010,...,114,122
+ *              064,068,066,067,068,069,070,071 =====> 003,011,...,115,123
+ *              072,073,074,075,076,077,078,079        004,012,...,116,124
+ *              080,081,082,083,084,085,086,087        005,013,...,117,125
+ *              088,089,090,091,092,093,094,095        006,014,...,118,126
+ *              096,097,098,099,100,101,102,103        007,015,...,119,127
+ *              104,105,106,107,108,109,110,111
+ *              112,113,114,115,116,117,118,119
+ *              120,121,122,123,124,125,126,127
+ * =============================================================================
+ */
+#define LSX_TRANSPOSE16x8_B(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, _in8,  \
+                            _in9, _in10, _in11, _in12, _in13, _in14, _in15, _out0, \
+                            _out1, _out2, _out3, _out4, _out5, _out6, _out7)       \
+{                                                                                  \
+    __m128i _tmp0, _tmp1, _tmp2, _tmp3, _tmp4, _tmp5, _tmp6, _tmp7;                \
+    __m128i _t0, _t1, _t2, _t3, _t4, _t5, _t6, _t7;                                \
+    DUP4_ARG2(__lsx_vilvl_b, _in2, _in0, _in3, _in1, _in6, _in4, _in7, _in5,       \
+              _tmp0, _tmp1, _tmp2, _tmp3);                                         \
+    DUP4_ARG2(__lsx_vilvl_b, _in10, _in8, _in11, _in9, _in14, _in12, _in15,        \
+              _in13, _tmp4, _tmp5, _tmp6, _tmp7);                                  \
+    DUP2_ARG2(__lsx_vilvl_b, _tmp1, _tmp0, _tmp3, _tmp2, _t0, _t2);                \
+    DUP2_ARG2(__lsx_vilvh_b, _tmp1, _tmp0, _tmp3, _tmp2, _t1, _t3);                \
+    DUP2_ARG2(__lsx_vilvl_b, _tmp5, _tmp4, _tmp7, _tmp6, _t4, _t6);                \
+    DUP2_ARG2(__lsx_vilvh_b, _tmp5, _tmp4, _tmp7, _tmp6, _t5, _t7);                \
+    DUP2_ARG2(__lsx_vilvl_w, _t2, _t0, _t3, _t1, _tmp0, _tmp4);                    \
+    DUP2_ARG2(__lsx_vilvh_w, _t2, _t0, _t3, _t1, _tmp2, _tmp6);                    \
+    DUP2_ARG2(__lsx_vilvl_w, _t6, _t4, _t7, _t5, _tmp1, _tmp5);                    \
+    DUP2_ARG2(__lsx_vilvh_w, _t6, _t4, _t7, _t5, _tmp3, _tmp7);                    \
+    DUP2_ARG2(__lsx_vilvl_d, _tmp1, _tmp0, _tmp3, _tmp2, _out0, _out2);            \
+    DUP2_ARG2(__lsx_vilvh_d, _tmp1, _tmp0, _tmp3, _tmp2, _out1, _out3);            \
+    DUP2_ARG2(__lsx_vilvl_d, _tmp5, _tmp4, _tmp7, _tmp6, _out4, _out6);            \
+    DUP2_ARG2(__lsx_vilvh_d, _tmp5, _tmp4, _tmp7, _tmp6, _out5, _out7);            \
+}
+
+/*
+ * =============================================================================
+ * Description : Butterfly of 4 input vectors
+ * Arguments   : Inputs  - in0, in1, in2, in3
+ *               Outputs - out0, out1, out2, out3
+ * Details     : Butterfly operation
+ * Example     :
+ *               out0 = in0 + in3;
+ *               out1 = in1 + in2;
+ *               out2 = in1 - in2;
+ *               out3 = in0 - in3;
+ * =============================================================================
+ */
+#define LSX_BUTTERFLY_4_B(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3) \
+{                                                                             \
+    _out0 = __lsx_vadd_b(_in0, _in3);                                         \
+    _out1 = __lsx_vadd_b(_in1, _in2);                                         \
+    _out2 = __lsx_vsub_b(_in1, _in2);                                         \
+    _out3 = __lsx_vsub_b(_in0, _in3);                                         \
+}
+#define LSX_BUTTERFLY_4_H(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3) \
+{                                                                             \
+    _out0 = __lsx_vadd_h(_in0, _in3);                                         \
+    _out1 = __lsx_vadd_h(_in1, _in2);                                         \
+    _out2 = __lsx_vsub_h(_in1, _in2);                                         \
+    _out3 = __lsx_vsub_h(_in0, _in3);                                         \
+}
+#define LSX_BUTTERFLY_4_W(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3) \
+{                                                                             \
+    _out0 = __lsx_vadd_w(_in0, _in3);                                         \
+    _out1 = __lsx_vadd_w(_in1, _in2);                                         \
+    _out2 = __lsx_vsub_w(_in1, _in2);                                         \
+    _out3 = __lsx_vsub_w(_in0, _in3);                                         \
+}
+#define LSX_BUTTERFLY_4_D(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3) \
+{                                                                             \
+    _out0 = __lsx_vadd_d(_in0, _in3);                                         \
+    _out1 = __lsx_vadd_d(_in1, _in2);                                         \
+    _out2 = __lsx_vsub_d(_in1, _in2);                                         \
+    _out3 = __lsx_vsub_d(_in0, _in3);                                         \
+}
+
+/*
+ * =============================================================================
+ * Description : Butterfly of 8 input vectors
+ * Arguments   : Inputs  - _in0, _in1, _in2, _in3, ~
+ *               Outputs - _out0, _out1, _out2, _out3, ~
+ * Details     : Butterfly operation
+ * Example     :
+ *              _out0 = _in0 + _in7;
+ *              _out1 = _in1 + _in6;
+ *              _out2 = _in2 + _in5;
+ *              _out3 = _in3 + _in4;
+ *              _out4 = _in3 - _in4;
+ *              _out5 = _in2 - _in5;
+ *              _out6 = _in1 - _in6;
+ *              _out7 = _in0 - _in7;
+ * =============================================================================
+ */
+#define LSX_BUTTERFLY_8_B(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7,        \
+                          _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\
+{                                                                                \
+    _out0 = __lsx_vadd_b(_in0, _in7);                                            \
+    _out1 = __lsx_vadd_b(_in1, _in6);                                            \
+    _out2 = __lsx_vadd_b(_in2, _in5);                                            \
+    _out3 = __lsx_vadd_b(_in3, _in4);                                            \
+    _out4 = __lsx_vsub_b(_in3, _in4);                                            \
+    _out5 = __lsx_vsub_b(_in2, _in5);                                            \
+    _out6 = __lsx_vsub_b(_in1, _in6);                                            \
+    _out7 = __lsx_vsub_b(_in0, _in7);                                            \
+}
+
+#define LSX_BUTTERFLY_8_H(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7,        \
+                          _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\
+{                                                                                \
+    _out0 = __lsx_vadd_h(_in0, _in7);                                            \
+    _out1 = __lsx_vadd_h(_in1, _in6);                                            \
+    _out2 = __lsx_vadd_h(_in2, _in5);                                            \
+    _out3 = __lsx_vadd_h(_in3, _in4);                                            \
+    _out4 = __lsx_vsub_h(_in3, _in4);                                            \
+    _out5 = __lsx_vsub_h(_in2, _in5);                                            \
+    _out6 = __lsx_vsub_h(_in1, _in6);                                            \
+    _out7 = __lsx_vsub_h(_in0, _in7);                                            \
+}
+
+#define LSX_BUTTERFLY_8_W(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7,        \
+                          _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\
+{                                                                                \
+    _out0 = __lsx_vadd_w(_in0, _in7);                                            \
+    _out1 = __lsx_vadd_w(_in1, _in6);                                            \
+    _out2 = __lsx_vadd_w(_in2, _in5);                                            \
+    _out3 = __lsx_vadd_w(_in3, _in4);                                            \
+    _out4 = __lsx_vsub_w(_in3, _in4);                                            \
+    _out5 = __lsx_vsub_w(_in2, _in5);                                            \
+    _out6 = __lsx_vsub_w(_in1, _in6);                                            \
+    _out7 = __lsx_vsub_w(_in0, _in7);                                            \
+}
+
+#define LSX_BUTTERFLY_8_D(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7,        \
+                          _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\
+{                                                                                \
+    _out0 = __lsx_vadd_d(_in0, _in7);                                            \
+    _out1 = __lsx_vadd_d(_in1, _in6);                                            \
+    _out2 = __lsx_vadd_d(_in2, _in5);                                            \
+    _out3 = __lsx_vadd_d(_in3, _in4);                                            \
+    _out4 = __lsx_vsub_d(_in3, _in4);                                            \
+    _out5 = __lsx_vsub_d(_in2, _in5);                                            \
+    _out6 = __lsx_vsub_d(_in1, _in6);                                            \
+    _out7 = __lsx_vsub_d(_in0, _in7);                                            \
+}
+
+#endif //LSX
+
+#ifdef __loongarch_asx
+#include <lasxintrin.h>
+/*
+ * =============================================================================
+ * Description : Dot product of byte vector elements
+ * Arguments   : Inputs - in_h, in_l
+ *               Output - out
+ *               Return Type - signed halfword
+ * Details     : Unsigned byte elements from in_h are multiplied with
+ *               unsigned byte elements from in_l producing a result
+ *               twice the size of input i.e. signed halfword.
+ *               Then this multiplied results of adjacent odd-even elements
+ *               are added to the out vector
+ * Example     : See out = __lasx_xvdp2_w_h(in_h, in_l)
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvdp2_h_bu(__m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvmulwev_h_bu(in_h, in_l);
+    out = __lasx_xvmaddwod_h_bu(out, in_h, in_l);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Dot product of byte vector elements
+ * Arguments   : Inputs - in_h, in_l
+ *               Output - out
+ *               Return Type - signed halfword
+ * Details     : Signed byte elements from in_h are multiplied with
+ *               signed byte elements from in_l producing a result
+ *               twice the size of input i.e. signed halfword.
+ *               Then this iniplication results of adjacent odd-even elements
+ *               are added to the out vector
+ * Example     : See out = __lasx_xvdp2_w_h(in_h, in_l)
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvdp2_h_b(__m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvmulwev_h_b(in_h, in_l);
+    out = __lasx_xvmaddwod_h_b(out, in_h, in_l);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Dot product of halfword vector elements
+ * Arguments   : Inputs - in_h, in_l
+ *               Output - out
+ *               Return Type - signed word
+ * Details     : Signed halfword elements from in_h are multiplied with
+ *               signed halfword elements from in_l producing a result
+ *               twice the size of input i.e. signed word.
+ *               Then this multiplied results of adjacent odd-even elements
+ *               are added to the out vector.
+ * Example     : out = __lasx_xvdp2_w_h(in_h, in_l)
+ *        in_h : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8
+ *        in_l : 8,7,6,5, 4,3,2,1, 8,7,6,5, 4,3,2,1
+ *         out : 22,38,38,22, 22,38,38,22
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvdp2_w_h(__m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvmulwev_w_h(in_h, in_l);
+    out = __lasx_xvmaddwod_w_h(out, in_h, in_l);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Dot product of word vector elements
+ * Arguments   : Inputs - in_h, in_l
+ *               Output - out
+ *               Retrun Type - signed double
+ * Details     : Signed word elements from in_h are multiplied with
+ *               signed word elements from in_l producing a result
+ *               twice the size of input i.e. signed double word.
+ *               Then this multiplied results of adjacent odd-even elements
+ *               are added to the out vector.
+ * Example     : See out = __lasx_xvdp2_w_h(in_h, in_l)
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvdp2_d_w(__m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvmulwev_d_w(in_h, in_l);
+    out = __lasx_xvmaddwod_d_w(out, in_h, in_l);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Dot product of halfword vector elements
+ * Arguments   : Inputs - in_h, in_l
+ *               Output - out
+ *               Return Type - signed word
+ * Details     : Unsigned halfword elements from in_h are multiplied with
+ *               signed halfword elements from in_l producing a result
+ *               twice the size of input i.e. unsigned word.
+ *               Multiplication result of adjacent odd-even elements
+ *               are added to the out vector
+ * Example     : See out = __lasx_xvdp2_w_h(in_h, in_l)
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvdp2_w_hu_h(__m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvmulwev_w_hu_h(in_h, in_l);
+    out = __lasx_xvmaddwod_w_hu_h(out, in_h, in_l);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Dot product & addition of byte vector elements
+ * Arguments   : Inputs - in_h, in_l
+ *               Output - out
+ *               Retrun Type - halfword
+ * Details     : Signed byte elements from in_h are multiplied with
+ *               signed byte elements from in_l producing a result
+ *               twice the size of input i.e. signed halfword.
+ *               Then this multiplied results of adjacent odd-even elements
+ *               are added to the in_c vector.
+ * Example     : See out = __lasx_xvdp2add_w_h(in_c, in_h, in_l)
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvdp2add_h_b(__m256i in_c,__m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvmaddwev_h_b(in_c, in_h, in_l);
+    out = __lasx_xvmaddwod_h_b(out, in_h, in_l);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Dot product of halfword vector elements
+ * Arguments   : Inputs - in_c, in_h, in_l
+ *               Output - out
+ *               Return Type - per RTYPE
+ * Details     : Signed halfword elements from in_h are multiplied with
+ *               signed halfword elements from in_l producing a result
+ *               twice the size of input i.e. signed word.
+ *               Multiplication result of adjacent odd-even elements
+ *               are added to the in_c vector.
+ * Example     : out = __lasx_xvdp2add_w_h(in_c, in_h, in_l)
+ *        in_c : 1,2,3,4, 1,2,3,4
+ *        in_h : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8,
+ *        in_l : 8,7,6,5, 4,3,2,1, 8,7,6,5, 4,3,2,1,
+ *         out : 23,40,41,26, 23,40,41,26
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvdp2add_w_h(__m256i in_c, __m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvmaddwev_w_h(in_c, in_h, in_l);
+    out = __lasx_xvmaddwod_w_h(out, in_h, in_l);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Dot product of halfword vector elements
+ * Arguments   : Inputs - in_c, in_h, in_l
+ *               Output - out
+ *               Return Type - signed word
+ * Details     : Unsigned halfword elements from in_h are multiplied with
+ *               unsigned halfword elements from in_l producing a result
+ *               twice the size of input i.e. signed word.
+ *               Multiplication result of adjacent odd-even elements
+ *               are added to the in_c vector.
+ * Example     : See out = __lasx_xvdp2add_w_h(in_c, in_h, in_l)
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvdp2add_w_hu(__m256i in_c, __m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvmaddwev_w_hu(in_c, in_h, in_l);
+    out = __lasx_xvmaddwod_w_hu(out, in_h, in_l);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Dot product of halfword vector elements
+ * Arguments   : Inputs - in_c, in_h, in_l
+ *               Output - out
+ *               Return Type - signed word
+ * Details     : Unsigned halfword elements from in_h are multiplied with
+ *               signed halfword elements from in_l producing a result
+ *               twice the size of input i.e. signed word.
+ *               Multiplication result of adjacent odd-even elements
+ *               are added to the in_c vector
+ * Example     : See out = __lasx_xvdp2add_w_h(in_c, in_h, in_l)
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvdp2add_w_hu_h(__m256i in_c, __m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvmaddwev_w_hu_h(in_c, in_h, in_l);
+    out = __lasx_xvmaddwod_w_hu_h(out, in_h, in_l);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Vector Unsigned Dot Product and Subtract
+ * Arguments   : Inputs - in_c, in_h, in_l
+ *               Output - out
+ *               Return Type - signed halfword
+ * Details     : Unsigned byte elements from in_h are multiplied with
+ *               unsigned byte elements from in_l producing a result
+ *               twice the size of input i.e. signed halfword.
+ *               Multiplication result of adjacent odd-even elements
+ *               are added together and subtracted from double width elements
+ *               in_c vector.
+ * Example     : See out = __lasx_xvdp2sub_w_h(in_c, in_h, in_l)
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvdp2sub_h_bu(__m256i in_c, __m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvmulwev_h_bu(in_h, in_l);
+    out = __lasx_xvmaddwod_h_bu(out, in_h, in_l);
+    out = __lasx_xvsub_h(in_c, out);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Vector Signed Dot Product and Subtract
+ * Arguments   : Inputs - in_c, in_h, in_l
+ *               Output - out
+ *               Return Type - signed word
+ * Details     : Signed halfword elements from in_h are multiplied with
+ *               Signed halfword elements from in_l producing a result
+ *               twice the size of input i.e. signed word.
+ *               Multiplication result of adjacent odd-even elements
+ *               are added together and subtracted from double width elements
+ *               in_c vector.
+ * Example     : out = __lasx_xvdp2sub_w_h(in_c, in_h, in_l)
+ *        in_c : 0,0,0,0, 0,0,0,0
+ *        in_h : 3,1,3,0, 0,0,0,1, 0,0,1,1, 0,0,0,1
+ *        in_l : 2,1,1,0, 1,0,0,0, 0,0,1,0, 1,0,0,1
+ *         out : -7,-3,0,0, 0,-1,0,-1
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvdp2sub_w_h(__m256i in_c, __m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvmulwev_w_h(in_h, in_l);
+    out = __lasx_xvmaddwod_w_h(out, in_h, in_l);
+    out = __lasx_xvsub_w(in_c, out);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Dot product of halfword vector elements
+ * Arguments   : Inputs - in_h, in_l
+ *               Output - out
+ *               Return Type - signed word
+ * Details     : Signed halfword elements from in_h are iniplied with
+ *               signed halfword elements from in_l producing a result
+ *               four times the size of input i.e. signed doubleword.
+ *               Then this iniplication results of four adjacent elements
+ *               are added together and stored to the out vector.
+ * Example     : out = __lasx_xvdp4_d_h(in_h, in_l)
+ *        in_h :  3,1,3,0, 0,0,0,1, 0,0,1,-1, 0,0,0,1
+ *        in_l : -2,1,1,0, 1,0,0,0, 0,0,1, 0, 1,0,0,1
+ *         out : -2,0,1,1
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvdp4_d_h(__m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvmulwev_w_h(in_h, in_l);
+    out = __lasx_xvmaddwod_w_h(out, in_h, in_l);
+    out = __lasx_xvhaddw_d_w(out, out);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : The high half of the vector elements are expanded and
+ *               added after being doubled.
+ * Arguments   : Inputs - in_h, in_l
+ *               Output - out
+ * Details     : The in_h vector and the in_l vector are added after the
+ *               higher half of the two-fold sign extension (signed byte
+ *               to signed halfword) and stored to the out vector.
+ * Example     : See out = __lasx_xvaddwh_w_h(in_h, in_l)
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvaddwh_h_b(__m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvilvh_b(in_h, in_l);
+    out = __lasx_xvhaddw_h_b(out, out);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : The high half of the vector elements are expanded and
+ *               added after being doubled.
+ * Arguments   : Inputs - in_h, in_l
+ *               Output - out
+ * Details     : The in_h vector and the in_l vector are added after the
+ *               higher half of the two-fold sign extension (signed halfword
+ *               to signed word) and stored to the out vector.
+ * Example     : out = __lasx_xvaddwh_w_h(in_h, in_l)
+ *        in_h : 3, 0,3,0, 0,0,0,-1, 0,0,1,-1, 0,0,0,1
+ *        in_l : 2,-1,1,2, 1,0,0, 0, 1,0,1, 0, 1,0,0,1
+ *         out : 1,0,0,-1, 1,0,0, 2
+ * =============================================================================
+ */
+ static inline __m256i __lasx_xvaddwh_w_h(__m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvilvh_h(in_h, in_l);
+    out = __lasx_xvhaddw_w_h(out, out);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : The low half of the vector elements are expanded and
+ *               added after being doubled.
+ * Arguments   : Inputs - in_h, in_l
+ *               Output - out
+ * Details     : The in_h vector and the in_l vector are added after the
+ *               lower half of the two-fold sign extension (signed byte
+ *               to signed halfword) and stored to the out vector.
+ * Example     : See out = __lasx_xvaddwl_w_h(in_h, in_l)
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvaddwl_h_b(__m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvilvl_b(in_h, in_l);
+    out = __lasx_xvhaddw_h_b(out, out);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : The low half of the vector elements are expanded and
+ *               added after being doubled.
+ * Arguments   : Inputs - in_h, in_l
+ *               Output - out
+ * Details     : The in_h vector and the in_l vector are added after the
+ *               lower half of the two-fold sign extension (signed halfword
+ *               to signed word) and stored to the out vector.
+ * Example     : out = __lasx_xvaddwl_w_h(in_h, in_l)
+ *        in_h : 3, 0,3,0, 0,0,0,-1, 0,0,1,-1, 0,0,0,1
+ *        in_l : 2,-1,1,2, 1,0,0, 0, 1,0,1, 0, 1,0,0,1
+ *         out : 5,-1,4,2, 1,0,2,-1
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvaddwl_w_h(__m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvilvl_h(in_h, in_l);
+    out = __lasx_xvhaddw_w_h(out, out);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : The low half of the vector elements are expanded and
+ *               added after being doubled.
+ * Arguments   : Inputs - in_h, in_l
+ *               Output - out
+ * Details     : The out vector and the out vector are added after the
+ *               lower half of the two-fold zero extension (unsigned byte
+ *               to unsigned halfword) and stored to the out vector.
+ * Example     : See out = __lasx_xvaddwl_w_h(in_h, in_l)
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvaddwl_h_bu(__m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvilvl_b(in_h, in_l);
+    out = __lasx_xvhaddw_hu_bu(out, out);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : The low half of the vector elements are expanded and
+ *               added after being doubled.
+ * Arguments   : Inputs - in_h, in_l
+ *               Output - out
+ * Details     : The in_l vector after double zero extension (unsigned byte to
+ *               signed halfword)，added to the in_h vector.
+ * Example     : See out = __lasx_xvaddw_w_w_h(in_h, in_l)
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvaddw_h_h_bu(__m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvsllwil_hu_bu(in_l, 0);
+    out = __lasx_xvadd_h(in_h, out);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : The low half of the vector elements are expanded and
+ *               added after being doubled.
+ * Arguments   : Inputs - in_h, in_l
+ *               Output - out
+ * Details     : The in_l vector after double sign extension (signed halfword to
+ *               signed word), added to the in_h vector.
+ * Example     : out = __lasx_xvaddw_w_w_h(in_h, in_l)
+ *        in_h : 0, 1,0,0, -1,0,0,1,
+ *        in_l : 2,-1,1,2,  1,0,0,0, 0,0,1,0, 1,0,0,1,
+ *         out : 2, 0,1,2, -1,0,1,1,
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvaddw_w_w_h(__m256i in_h, __m256i in_l)
+{
+    __m256i out;
+
+    out = __lasx_xvsllwil_w_h(in_l, 0);
+    out = __lasx_xvadd_w(in_h, out);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Multiplication and addition calculation after expansion
+ *               of the lower half of the vector.
+ * Arguments   : Inputs - in_c, in_h, in_l
+ *               Output - out
+ * Details     : The in_h vector and the in_l vector are multiplied after
+ *               the lower half of the two-fold sign extension (signed halfword
+ *               to signed word), and the result is added to the vector in_c,
+ *               then stored to the out vector.
+ * Example     : out = __lasx_xvmaddwl_w_h(in_c, in_h, in_l)
+ *        in_c : 1,2,3,4, 5,6,7,8
+ *        in_h : 1,2,3,4, 1,2,3,4, 5,6,7,8, 5,6,7,8
+ *        in_l : 200, 300, 400, 500,  2000, 3000, 4000, 5000,
+ *              -200,-300,-400,-500, -2000,-3000,-4000,-5000
+ *         out : 201, 602,1203,2004, -995, -1794,-2793,-3992
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvmaddwl_w_h(__m256i in_c, __m256i in_h, __m256i in_l)
+{
+    __m256i tmp0, tmp1, out;
+
+    tmp0 = __lasx_xvsllwil_w_h(in_h, 0);
+    tmp1 = __lasx_xvsllwil_w_h(in_l, 0);
+    tmp0 = __lasx_xvmul_w(tmp0, tmp1);
+    out  = __lasx_xvadd_w(tmp0, in_c);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Multiplication and addition calculation after expansion
+ *               of the higher half of the vector.
+ * Arguments   : Inputs - in_c, in_h, in_l
+ *               Output - out
+ * Details     : The in_h vector and the in_l vector are multiplied after
+ *               the higher half of the two-fold sign extension (signed
+ *               halfword to signed word), and the result is added to
+ *               the vector in_c, then stored to the out vector.
+ * Example     : See out = __lasx_xvmaddwl_w_h(in_c, in_h, in_l)
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvmaddwh_w_h(__m256i in_c, __m256i in_h, __m256i in_l)
+{
+    __m256i tmp0, tmp1, out;
+
+    tmp0 = __lasx_xvilvh_h(in_h, in_h);
+    tmp1 = __lasx_xvilvh_h(in_l, in_l);
+    tmp0 = __lasx_xvmulwev_w_h(tmp0, tmp1);
+    out  = __lasx_xvadd_w(tmp0, in_c);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Multiplication calculation after expansion of the lower
+ *               half of the vector.
+ * Arguments   : Inputs - in_h, in_l
+ *               Output - out
+ * Details     : The in_h vector and the in_l vector are multiplied after
+ *               the lower half of the two-fold sign extension (signed
+ *               halfword to signed word), then stored to the out vector.
+ * Example     : out = __lasx_xvmulwl_w_h(in_h, in_l)
+ *        in_h : 3,-1,3,0, 0,0,0,-1, 0,0,1,-1, 0,0,0,1
+ *        in_l : 2,-1,1,2, 1,0,0, 0, 0,0,1, 0, 1,0,0,1
+ *         out : 6,1,3,0, 0,0,1,0
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvmulwl_w_h(__m256i in_h, __m256i in_l)
+{
+    __m256i tmp0, tmp1, out;
+
+    tmp0 = __lasx_xvsllwil_w_h(in_h, 0);
+    tmp1 = __lasx_xvsllwil_w_h(in_l, 0);
+    out  = __lasx_xvmul_w(tmp0, tmp1);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Multiplication calculation after expansion of the lower
+ *               half of the vector.
+ * Arguments   : Inputs - in_h, in_l
+ *               Output - out
+ * Details     : The in_h vector and the in_l vector are multiplied after
+ *               the lower half of the two-fold sign extension (signed
+ *               halfword to signed word), then stored to the out vector.
+ * Example     : out = __lasx_xvmulwh_w_h(in_h, in_l)
+ *        in_h : 3,-1,3,0, 0,0,0,-1, 0,0,1,-1, 0,0,0,1
+ *        in_l : 2,-1,1,2, 1,0,0, 0, 0,0,1, 0, 1,0,0,1
+ *         out : 0,0,0,0, 0,0,0,1
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvmulwh_w_h(__m256i in_h, __m256i in_l)
+{
+    __m256i tmp0, tmp1, out;
+
+    tmp0 = __lasx_xvilvh_h(in_h, in_h);
+    tmp1 = __lasx_xvilvh_h(in_l, in_l);
+    out  = __lasx_xvmulwev_w_h(tmp0, tmp1);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : The low half of the vector elements are expanded and
+ *               added saturately after being doubled.
+ * Arguments   : Inputs - in_h, in_l
+ *               Output - out
+ * Details     : The in_h vector adds the in_l vector saturately after the lower
+ *               half of the two-fold zero extension (unsigned byte to unsigned
+ *               halfword) and the results are stored to the out vector.
+ * Example     : out = __lasx_xvsaddw_hu_hu_bu(in_h, in_l)
+ *        in_h : 2,65532,1,2, 1,0,0,0, 0,0,1,0, 1,0,0,1
+ *        in_l : 3,6,3,0, 0,0,0,1, 0,0,1,1, 0,0,0,1, 3,18,3,0, 0,0,0,1, 0,0,1,1, 0,0,0,1
+ *         out : 5,65535,4,2, 1,0,0,1, 3,18,4,0, 1,0,0,2,
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvsaddw_hu_hu_bu(__m256i in_h, __m256i in_l)
+{
+    __m256i tmp1, out;
+    __m256i zero = {0};
+
+    tmp1 = __lasx_xvilvl_b(zero, in_l);
+    out  = __lasx_xvsadd_hu(in_h, tmp1);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Clip all halfword elements of input vector between min & max
+ *               out = ((in) < (min)) ? (min) : (((in) > (max)) ? (max) : (in))
+ * Arguments   : Inputs  - in    (input vector)
+ *                       - min   (min threshold)
+ *                       - max   (max threshold)
+ *               Outputs - in    (output vector with clipped elements)
+ *               Return Type - signed halfword
+ * Example     : out = __lasx_xvclip_h(in, min, max)
+ *          in : -8,2,280,249, -8,255,280,249, 4,4,4,4, 5,5,5,5
+ *         min : 1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1
+ *         max : 9,9,9,9, 9,9,9,9, 9,9,9,9, 9,9,9,9
+ *         out : 1,2,9,9, 1,9,9,9, 4,4,4,4, 5,5,5,5
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvclip_h(__m256i in, __m256i min, __m256i max)
+{
+    __m256i out;
+
+    out = __lasx_xvmax_h(min, in);
+    out = __lasx_xvmin_h(max, out);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Clip all signed halfword elements of input vector
+ *               between 0 & 255
+ * Arguments   : Inputs  - in   (input vector)
+ *               Outputs - out  (output vector with clipped elements)
+ *               Return Type - signed halfword
+ * Example     : See out = __lasx_xvclip255_w(in)
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvclip255_h(__m256i in)
+{
+    __m256i out;
+
+    out = __lasx_xvmaxi_h(in, 0);
+    out = __lasx_xvsat_hu(out, 7);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Clip all signed word elements of input vector
+ *               between 0 & 255
+ * Arguments   : Inputs - in   (input vector)
+ *               Output - out  (output vector with clipped elements)
+ *               Return Type - signed word
+ * Example     : out = __lasx_xvclip255_w(in)
+ *          in : -8,255,280,249, -8,255,280,249
+ *         out :  0,255,255,249,  0,255,255,249
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvclip255_w(__m256i in)
+{
+    __m256i out;
+
+    out = __lasx_xvmaxi_w(in, 0);
+    out = __lasx_xvsat_wu(out, 7);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Indexed halfword element values are replicated to all
+ *               elements in output vector. If 'indx < 8' use xvsplati_l_*,
+ *               if 'indx >= 8' use xvsplati_h_*.
+ * Arguments   : Inputs - in, idx
+ *               Output - out
+ * Details     : Idx element value from in vector is replicated to all
+ *               elements in out vector.
+ *               Valid index range for halfword operation is 0-7
+ * Example     : out = __lasx_xvsplati_l_h(in, idx)
+ *          in : 20,10,11,12, 13,14,15,16, 0,0,2,0, 0,0,0,0
+ *         idx : 0x02
+ *         out : 11,11,11,11, 11,11,11,11, 11,11,11,11, 11,11,11,11
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvsplati_l_h(__m256i in, int idx)
+{
+    __m256i out;
+
+    out = __lasx_xvpermi_q(in, in, 0x02);
+    out = __lasx_xvreplve_h(out, idx);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Indexed halfword element values are replicated to all
+ *               elements in output vector. If 'indx < 8' use xvsplati_l_*,
+ *               if 'indx >= 8' use xvsplati_h_*.
+ * Arguments   : Inputs - in, idx
+ *               Output - out
+ * Details     : Idx element value from in vector is replicated to all
+ *               elements in out vector.
+ *               Valid index range for halfword operation is 0-7
+ * Example     : out = __lasx_xvsplati_h_h(in, idx)
+ *          in : 20,10,11,12, 13,14,15,16, 0,2,0,0, 0,0,0,0
+ *         idx : 0x09
+ *         out : 2,2,2,2, 2,2,2,2, 2,2,2,2, 2,2,2,2
+ * =============================================================================
+ */
+static inline __m256i __lasx_xvsplati_h_h(__m256i in, int idx)
+{
+    __m256i out;
+
+    out = __lasx_xvpermi_q(in, in, 0x13);
+    out = __lasx_xvreplve_h(out, idx);
+    return out;
+}
+
+/*
+ * =============================================================================
+ * Description : Transpose 4x4 block with double word elements in vectors
+ * Arguments   : Inputs  - _in0, _in1, _in2, _in3
+ *               Outputs - _out0, _out1, _out2, _out3
+ * Example     : LASX_TRANSPOSE4x4_D
+ *        _in0 : 1,2,3,4
+ *        _in1 : 1,2,3,4
+ *        _in2 : 1,2,3,4
+ *        _in3 : 1,2,3,4
+ *
+ *       _out0 : 1,1,1,1
+ *       _out1 : 2,2,2,2
+ *       _out2 : 3,3,3,3
+ *       _out3 : 4,4,4,4
+ * =============================================================================
+ */
+#define LASX_TRANSPOSE4x4_D(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3) \
+{                                                                               \
+    __m256i _tmp0, _tmp1, _tmp2, _tmp3;                                         \
+    _tmp0 = __lasx_xvilvl_d(_in1, _in0);                                        \
+    _tmp1 = __lasx_xvilvh_d(_in1, _in0);                                        \
+    _tmp2 = __lasx_xvilvl_d(_in3, _in2);                                        \
+    _tmp3 = __lasx_xvilvh_d(_in3, _in2);                                        \
+    _out0 = __lasx_xvpermi_q(_tmp2, _tmp0, 0x20);                               \
+    _out2 = __lasx_xvpermi_q(_tmp2, _tmp0, 0x31);                               \
+    _out1 = __lasx_xvpermi_q(_tmp3, _tmp1, 0x20);                               \
+    _out3 = __lasx_xvpermi_q(_tmp3, _tmp1, 0x31);                               \
+}
+
+/*
+ * =============================================================================
+ * Description : Transpose 8x8 block with word elements in vectors
+ * Arguments   : Inputs  - _in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7
+ *               Outputs - _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7
+ * Example     : LASX_TRANSPOSE8x8_W
+ *        _in0 : 1,2,3,4,5,6,7,8
+ *        _in1 : 2,2,3,4,5,6,7,8
+ *        _in2 : 3,2,3,4,5,6,7,8
+ *        _in3 : 4,2,3,4,5,6,7,8
+ *        _in4 : 5,2,3,4,5,6,7,8
+ *        _in5 : 6,2,3,4,5,6,7,8
+ *        _in6 : 7,2,3,4,5,6,7,8
+ *        _in7 : 8,2,3,4,5,6,7,8
+ *
+ *       _out0 : 1,2,3,4,5,6,7,8
+ *       _out1 : 2,2,2,2,2,2,2,2
+ *       _out2 : 3,3,3,3,3,3,3,3
+ *       _out3 : 4,4,4,4,4,4,4,4
+ *       _out4 : 5,5,5,5,5,5,5,5
+ *       _out5 : 6,6,6,6,6,6,6,6
+ *       _out6 : 7,7,7,7,7,7,7,7
+ *       _out7 : 8,8,8,8,8,8,8,8
+ * =============================================================================
+ */
+#define LASX_TRANSPOSE8x8_W(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7,         \
+                            _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7) \
+{                                                                                   \
+    __m256i _s0_m, _s1_m;                                                           \
+    __m256i _tmp0_m, _tmp1_m, _tmp2_m, _tmp3_m;                                     \
+    __m256i _tmp4_m, _tmp5_m, _tmp6_m, _tmp7_m;                                     \
+                                                                                    \
+    _s0_m   = __lasx_xvilvl_w(_in2, _in0);                                          \
+    _s1_m   = __lasx_xvilvl_w(_in3, _in1);                                          \
+    _tmp0_m = __lasx_xvilvl_w(_s1_m, _s0_m);                                        \
+    _tmp1_m = __lasx_xvilvh_w(_s1_m, _s0_m);                                        \
+    _s0_m   = __lasx_xvilvh_w(_in2, _in0);                                          \
+    _s1_m   = __lasx_xvilvh_w(_in3, _in1);                                          \
+    _tmp2_m = __lasx_xvilvl_w(_s1_m, _s0_m);                                        \
+    _tmp3_m = __lasx_xvilvh_w(_s1_m, _s0_m);                                        \
+    _s0_m   = __lasx_xvilvl_w(_in6, _in4);                                          \
+    _s1_m   = __lasx_xvilvl_w(_in7, _in5);                                          \
+    _tmp4_m = __lasx_xvilvl_w(_s1_m, _s0_m);                                        \
+    _tmp5_m = __lasx_xvilvh_w(_s1_m, _s0_m);                                        \
+    _s0_m   = __lasx_xvilvh_w(_in6, _in4);                                          \
+    _s1_m   = __lasx_xvilvh_w(_in7, _in5);                                          \
+    _tmp6_m = __lasx_xvilvl_w(_s1_m, _s0_m);                                        \
+    _tmp7_m = __lasx_xvilvh_w(_s1_m, _s0_m);                                        \
+    _out0 = __lasx_xvpermi_q(_tmp4_m, _tmp0_m, 0x20);                               \
+    _out1 = __lasx_xvpermi_q(_tmp5_m, _tmp1_m, 0x20);                               \
+    _out2 = __lasx_xvpermi_q(_tmp6_m, _tmp2_m, 0x20);                               \
+    _out3 = __lasx_xvpermi_q(_tmp7_m, _tmp3_m, 0x20);                               \
+    _out4 = __lasx_xvpermi_q(_tmp4_m, _tmp0_m, 0x31);                               \
+    _out5 = __lasx_xvpermi_q(_tmp5_m, _tmp1_m, 0x31);                               \
+    _out6 = __lasx_xvpermi_q(_tmp6_m, _tmp2_m, 0x31);                               \
+    _out7 = __lasx_xvpermi_q(_tmp7_m, _tmp3_m, 0x31);                               \
+}
+
+/*
+ * =============================================================================
+ * Description : Transpose input 16x8 byte block
+ * Arguments   : Inputs  - _in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7,
+ *                         _in8, _in9, _in10, _in11, _in12, _in13, _in14, _in15
+ *                         (input 16x8 byte block)
+ *               Outputs - _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7
+ *                         (output 8x16 byte block)
+ * Details     : The rows of the matrix become columns, and the columns become rows.
+ * Example     : See LASX_TRANSPOSE16x8_H
+ * =============================================================================
+ */
+#define LASX_TRANSPOSE16x8_B(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7,         \
+                             _in8, _in9, _in10, _in11, _in12, _in13, _in14, _in15,   \
+                             _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7) \
+{                                                                                    \
+    __m256i _tmp0_m, _tmp1_m, _tmp2_m, _tmp3_m;                                      \
+    __m256i _tmp4_m, _tmp5_m, _tmp6_m, _tmp7_m;                                      \
+                                                                                     \
+    _tmp0_m = __lasx_xvilvl_b(_in2, _in0);                                           \
+    _tmp1_m = __lasx_xvilvl_b(_in3, _in1);                                           \
+    _tmp2_m = __lasx_xvilvl_b(_in6, _in4);                                           \
+    _tmp3_m = __lasx_xvilvl_b(_in7, _in5);                                           \
+    _tmp4_m = __lasx_xvilvl_b(_in10, _in8);                                          \
+    _tmp5_m = __lasx_xvilvl_b(_in11, _in9);                                          \
+    _tmp6_m = __lasx_xvilvl_b(_in14, _in12);                                         \
+    _tmp7_m = __lasx_xvilvl_b(_in15, _in13);                                         \
+    _out0 = __lasx_xvilvl_b(_tmp1_m, _tmp0_m);                                       \
+    _out1 = __lasx_xvilvh_b(_tmp1_m, _tmp0_m);                                       \
+    _out2 = __lasx_xvilvl_b(_tmp3_m, _tmp2_m);                                       \
+    _out3 = __lasx_xvilvh_b(_tmp3_m, _tmp2_m);                                       \
+    _out4 = __lasx_xvilvl_b(_tmp5_m, _tmp4_m);                                       \
+    _out5 = __lasx_xvilvh_b(_tmp5_m, _tmp4_m);                                       \
+    _out6 = __lasx_xvilvl_b(_tmp7_m, _tmp6_m);                                       \
+    _out7 = __lasx_xvilvh_b(_tmp7_m, _tmp6_m);                                       \
+    _tmp0_m = __lasx_xvilvl_w(_out2, _out0);                                         \
+    _tmp2_m = __lasx_xvilvh_w(_out2, _out0);                                         \
+    _tmp4_m = __lasx_xvilvl_w(_out3, _out1);                                         \
+    _tmp6_m = __lasx_xvilvh_w(_out3, _out1);                                         \
+    _tmp1_m = __lasx_xvilvl_w(_out6, _out4);                                         \
+    _tmp3_m = __lasx_xvilvh_w(_out6, _out4);                                         \
+    _tmp5_m = __lasx_xvilvl_w(_out7, _out5);                                         \
+    _tmp7_m = __lasx_xvilvh_w(_out7, _out5);                                         \
+    _out0 = __lasx_xvilvl_d(_tmp1_m, _tmp0_m);                                       \
+    _out1 = __lasx_xvilvh_d(_tmp1_m, _tmp0_m);                                       \
+    _out2 = __lasx_xvilvl_d(_tmp3_m, _tmp2_m);                                       \
+    _out3 = __lasx_xvilvh_d(_tmp3_m, _tmp2_m);                                       \
+    _out4 = __lasx_xvilvl_d(_tmp5_m, _tmp4_m);                                       \
+    _out5 = __lasx_xvilvh_d(_tmp5_m, _tmp4_m);                                       \
+    _out6 = __lasx_xvilvl_d(_tmp7_m, _tmp6_m);                                       \
+    _out7 = __lasx_xvilvh_d(_tmp7_m, _tmp6_m);                                       \
+}
+
+/*
+ * =============================================================================
+ * Description : Transpose input 16x8 byte block
+ * Arguments   : Inputs  - _in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7,
+ *                         _in8, _in9, _in10, _in11, _in12, _in13, _in14, _in15
+ *                         (input 16x8 byte block)
+ *               Outputs - _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7
+ *                         (output 8x16 byte block)
+ * Details     : The rows of the matrix become columns, and the columns become rows.
+ * Example     : LASX_TRANSPOSE16x8_H
+ *        _in0 : 1,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0
+ *        _in1 : 2,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0
+ *        _in2 : 3,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0
+ *        _in3 : 4,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0
+ *        _in4 : 5,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0
+ *        _in5 : 6,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0
+ *        _in6 : 7,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0
+ *        _in7 : 8,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0
+ *        _in8 : 9,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0
+ *        _in9 : 1,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0
+ *       _in10 : 0,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0
+ *       _in11 : 2,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0
+ *       _in12 : 3,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0
+ *       _in13 : 7,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0
+ *       _in14 : 5,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0
+ *       _in15 : 6,2,3,4,5,6,7,8,0,0,0,0,0,0,0,0
+ *
+ *       _out0 : 1,2,3,4,5,6,7,8,9,1,0,2,3,7,5,6
+ *       _out1 : 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
+ *       _out2 : 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
+ *       _out3 : 4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4
+ *       _out4 : 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5
+ *       _out5 : 6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6
+ *       _out6 : 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7
+ *       _out7 : 8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8
+ * =============================================================================
+ */
+#define LASX_TRANSPOSE16x8_H(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7,         \
+                             _in8, _in9, _in10, _in11, _in12, _in13, _in14, _in15,   \
+                             _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7) \
+   {                                                                                 \
+    __m256i _tmp0_m, _tmp1_m, _tmp2_m, _tmp3_m;                                      \
+    __m256i _tmp4_m, _tmp5_m, _tmp6_m, _tmp7_m;                                      \
+    __m256i _t0, _t1, _t2, _t3, _t4, _t5, _t6, _t7;                                  \
+                                                                                     \
+    _tmp0_m = __lasx_xvilvl_h(_in2, _in0);                                           \
+    _tmp1_m = __lasx_xvilvl_h(_in3, _in1);                                           \
+    _tmp2_m = __lasx_xvilvl_h(_in6, _in4);                                           \
+    _tmp3_m = __lasx_xvilvl_h(_in7, _in5);                                           \
+    _tmp4_m = __lasx_xvilvl_h(_in10, _in8);                                          \
+    _tmp5_m = __lasx_xvilvl_h(_in11, _in9);                                          \
+    _tmp6_m = __lasx_xvilvl_h(_in14, _in12);                                         \
+    _tmp7_m = __lasx_xvilvl_h(_in15, _in13);                                         \
+    _t0 = __lasx_xvilvl_h(_tmp1_m, _tmp0_m);                                         \
+    _t1 = __lasx_xvilvh_h(_tmp1_m, _tmp0_m);                                         \
+    _t2 = __lasx_xvilvl_h(_tmp3_m, _tmp2_m);                                         \
+    _t3 = __lasx_xvilvh_h(_tmp3_m, _tmp2_m);                                         \
+    _t4 = __lasx_xvilvl_h(_tmp5_m, _tmp4_m);                                         \
+    _t5 = __lasx_xvilvh_h(_tmp5_m, _tmp4_m);                                         \
+    _t6 = __lasx_xvilvl_h(_tmp7_m, _tmp6_m);                                         \
+    _t7 = __lasx_xvilvh_h(_tmp7_m, _tmp6_m);                                         \
+    _tmp0_m = __lasx_xvilvl_d(_t2, _t0);                                             \
+    _tmp2_m = __lasx_xvilvh_d(_t2, _t0);                                             \
+    _tmp4_m = __lasx_xvilvl_d(_t3, _t1);                                             \
+    _tmp6_m = __lasx_xvilvh_d(_t3, _t1);                                             \
+    _tmp1_m = __lasx_xvilvl_d(_t6, _t4);                                             \
+    _tmp3_m = __lasx_xvilvh_d(_t6, _t4);                                             \
+    _tmp5_m = __lasx_xvilvl_d(_t7, _t5);                                             \
+    _tmp7_m = __lasx_xvilvh_d(_t7, _t5);                                             \
+    _out0 = __lasx_xvpermi_q(_tmp1_m, _tmp0_m, 0x20);                                \
+    _out1 = __lasx_xvpermi_q(_tmp3_m, _tmp2_m, 0x20);                                \
+    _out2 = __lasx_xvpermi_q(_tmp5_m, _tmp4_m, 0x20);                                \
+    _out3 = __lasx_xvpermi_q(_tmp7_m, _tmp6_m, 0x20);                                \
+                                                                                     \
+    _tmp0_m = __lasx_xvilvh_h(_in2, _in0);                                           \
+    _tmp1_m = __lasx_xvilvh_h(_in3, _in1);                                           \
+    _tmp2_m = __lasx_xvilvh_h(_in6, _in4);                                           \
+    _tmp3_m = __lasx_xvilvh_h(_in7, _in5);                                           \
+    _tmp4_m = __lasx_xvilvh_h(_in10, _in8);                                          \
+    _tmp5_m = __lasx_xvilvh_h(_in11, _in9);                                          \
+    _tmp6_m = __lasx_xvilvh_h(_in14, _in12);                                         \
+    _tmp7_m = __lasx_xvilvh_h(_in15, _in13);                                         \
+    _t0 = __lasx_xvilvl_h(_tmp1_m, _tmp0_m);                                         \
+    _t1 = __lasx_xvilvh_h(_tmp1_m, _tmp0_m);                                         \
+    _t2 = __lasx_xvilvl_h(_tmp3_m, _tmp2_m);                                         \
+    _t3 = __lasx_xvilvh_h(_tmp3_m, _tmp2_m);                                         \
+    _t4 = __lasx_xvilvl_h(_tmp5_m, _tmp4_m);                                         \
+    _t5 = __lasx_xvilvh_h(_tmp5_m, _tmp4_m);                                         \
+    _t6 = __lasx_xvilvl_h(_tmp7_m, _tmp6_m);                                         \
+    _t7 = __lasx_xvilvh_h(_tmp7_m, _tmp6_m);                                         \
+    _tmp0_m = __lasx_xvilvl_d(_t2, _t0);                                             \
+    _tmp2_m = __lasx_xvilvh_d(_t2, _t0);                                             \
+    _tmp4_m = __lasx_xvilvl_d(_t3, _t1);                                             \
+    _tmp6_m = __lasx_xvilvh_d(_t3, _t1);                                             \
+    _tmp1_m = __lasx_xvilvl_d(_t6, _t4);                                             \
+    _tmp3_m = __lasx_xvilvh_d(_t6, _t4);                                             \
+    _tmp5_m = __lasx_xvilvl_d(_t7, _t5);                                             \
+    _tmp7_m = __lasx_xvilvh_d(_t7, _t5);                                             \
+    _out4 = __lasx_xvpermi_q(_tmp1_m, _tmp0_m, 0x20);                                \
+    _out5 = __lasx_xvpermi_q(_tmp3_m, _tmp2_m, 0x20);                                \
+    _out6 = __lasx_xvpermi_q(_tmp5_m, _tmp4_m, 0x20);                                \
+    _out7 = __lasx_xvpermi_q(_tmp7_m, _tmp6_m, 0x20);                                \
+}
+
+/*
+ * =============================================================================
+ * Description : Transpose 4x4 block with halfword elements in vectors
+ * Arguments   : Inputs  - _in0, _in1, _in2, _in3
+ *               Outputs - _out0, _out1, _out2, _out3
+ *               Return Type - signed halfword
+ * Details     : The rows of the matrix become columns, and the columns become rows.
+ * Example     : See LASX_TRANSPOSE8x8_H
+ * =============================================================================
+ */
+#define LASX_TRANSPOSE4x4_H(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3)     \
+{                                                                                   \
+    __m256i _s0_m, _s1_m;                                                           \
+                                                                                    \
+    _s0_m = __lasx_xvilvl_h(_in1, _in0);                                            \
+    _s1_m = __lasx_xvilvl_h(_in3, _in2);                                            \
+    _out0 = __lasx_xvilvl_w(_s1_m, _s0_m);                                          \
+    _out2 = __lasx_xvilvh_w(_s1_m, _s0_m);                                          \
+    _out1 = __lasx_xvilvh_d(_out0, _out0);                                          \
+    _out3 = __lasx_xvilvh_d(_out2, _out2);                                          \
+}
+
+/*
+ * =============================================================================
+ * Description : Transpose input 8x8 byte block
+ * Arguments   : Inputs  - _in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7
+ *                         (input 8x8 byte block)
+ *               Outputs - _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7
+ *                         (output 8x8 byte block)
+ * Example     : See LASX_TRANSPOSE8x8_H
+ * =============================================================================
+ */
+#define LASX_TRANSPOSE8x8_B(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, _out0,  \
+                            _out1, _out2, _out3, _out4, _out5, _out6, _out7)        \
+{                                                                                   \
+    __m256i _tmp0_m, _tmp1_m, _tmp2_m, _tmp3_m;                                     \
+    __m256i _tmp4_m, _tmp5_m, _tmp6_m, _tmp7_m;                                     \
+    _tmp0_m = __lasx_xvilvl_b(_in2, _in0);                                          \
+    _tmp1_m = __lasx_xvilvl_b(_in3, _in1);                                          \
+    _tmp2_m = __lasx_xvilvl_b(_in6, _in4);                                          \
+    _tmp3_m = __lasx_xvilvl_b(_in7, _in5);                                          \
+    _tmp4_m = __lasx_xvilvl_b(_tmp1_m, _tmp0_m);                                    \
+    _tmp5_m = __lasx_xvilvh_b(_tmp1_m, _tmp0_m);                                    \
+    _tmp6_m = __lasx_xvilvl_b(_tmp3_m, _tmp2_m);                                    \
+    _tmp7_m = __lasx_xvilvh_b(_tmp3_m, _tmp2_m);                                    \
+    _out0 = __lasx_xvilvl_w(_tmp6_m, _tmp4_m);                                      \
+    _out2 = __lasx_xvilvh_w(_tmp6_m, _tmp4_m);                                      \
+    _out4 = __lasx_xvilvl_w(_tmp7_m, _tmp5_m);                                      \
+    _out6 = __lasx_xvilvh_w(_tmp7_m, _tmp5_m);                                      \
+    _out1 = __lasx_xvbsrl_v(_out0, 8);                                              \
+    _out3 = __lasx_xvbsrl_v(_out2, 8);                                              \
+    _out5 = __lasx_xvbsrl_v(_out4, 8);                                              \
+    _out7 = __lasx_xvbsrl_v(_out6, 8);                                              \
+}
+
+/*
+ * =============================================================================
+ * Description : Transpose 8x8 block with halfword elements in vectors.
+ * Arguments   : Inputs  - _in0, _in1, ~
+ *               Outputs - _out0, _out1, ~
+ * Details     : The rows of the matrix become columns, and the columns become rows.
+ * Example     : LASX_TRANSPOSE8x8_H
+ *        _in0 : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8
+ *        _in1 : 8,2,3,4, 5,6,7,8, 8,2,3,4, 5,6,7,8
+ *        _in2 : 8,2,3,4, 5,6,7,8, 8,2,3,4, 5,6,7,8
+ *        _in3 : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8
+ *        _in4 : 9,2,3,4, 5,6,7,8, 9,2,3,4, 5,6,7,8
+ *        _in5 : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8
+ *        _in6 : 1,2,3,4, 5,6,7,8, 1,2,3,4, 5,6,7,8
+ *        _in7 : 9,2,3,4, 5,6,7,8, 9,2,3,4, 5,6,7,8
+ *
+ *       _out0 : 1,8,8,1, 9,1,1,9, 1,8,8,1, 9,1,1,9
+ *       _out1 : 2,2,2,2, 2,2,2,2, 2,2,2,2, 2,2,2,2
+ *       _out2 : 3,3,3,3, 3,3,3,3, 3,3,3,3, 3,3,3,3
+ *       _out3 : 4,4,4,4, 4,4,4,4, 4,4,4,4, 4,4,4,4
+ *       _out4 : 5,5,5,5, 5,5,5,5, 5,5,5,5, 5,5,5,5
+ *       _out5 : 6,6,6,6, 6,6,6,6, 6,6,6,6, 6,6,6,6
+ *       _out6 : 7,7,7,7, 7,7,7,7, 7,7,7,7, 7,7,7,7
+ *       _out7 : 8,8,8,8, 8,8,8,8, 8,8,8,8, 8,8,8,8
+ * =============================================================================
+ */
+#define LASX_TRANSPOSE8x8_H(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7, _out0,  \
+                            _out1, _out2, _out3, _out4, _out5, _out6, _out7)        \
+{                                                                                   \
+    __m256i _s0_m, _s1_m;                                                           \
+    __m256i _tmp0_m, _tmp1_m, _tmp2_m, _tmp3_m;                                     \
+    __m256i _tmp4_m, _tmp5_m, _tmp6_m, _tmp7_m;                                     \
+                                                                                    \
+    _s0_m   = __lasx_xvilvl_h(_in6, _in4);                                          \
+    _s1_m   = __lasx_xvilvl_h(_in7, _in5);                                          \
+    _tmp0_m = __lasx_xvilvl_h(_s1_m, _s0_m);                                        \
+    _tmp1_m = __lasx_xvilvh_h(_s1_m, _s0_m);                                        \
+    _s0_m   = __lasx_xvilvh_h(_in6, _in4);                                          \
+    _s1_m   = __lasx_xvilvh_h(_in7, _in5);                                          \
+    _tmp2_m = __lasx_xvilvl_h(_s1_m, _s0_m);                                        \
+    _tmp3_m = __lasx_xvilvh_h(_s1_m, _s0_m);                                        \
+                                                                                    \
+    _s0_m   = __lasx_xvilvl_h(_in2, _in0);                                          \
+    _s1_m   = __lasx_xvilvl_h(_in3, _in1);                                          \
+    _tmp4_m = __lasx_xvilvl_h(_s1_m, _s0_m);                                        \
+    _tmp5_m = __lasx_xvilvh_h(_s1_m, _s0_m);                                        \
+    _s0_m   = __lasx_xvilvh_h(_in2, _in0);                                          \
+    _s1_m   = __lasx_xvilvh_h(_in3, _in1);                                          \
+    _tmp6_m = __lasx_xvilvl_h(_s1_m, _s0_m);                                        \
+    _tmp7_m = __lasx_xvilvh_h(_s1_m, _s0_m);                                        \
+                                                                                    \
+    _out0 = __lasx_xvpickev_d(_tmp0_m, _tmp4_m);                                    \
+    _out2 = __lasx_xvpickev_d(_tmp1_m, _tmp5_m);                                    \
+    _out4 = __lasx_xvpickev_d(_tmp2_m, _tmp6_m);                                    \
+    _out6 = __lasx_xvpickev_d(_tmp3_m, _tmp7_m);                                    \
+    _out1 = __lasx_xvpickod_d(_tmp0_m, _tmp4_m);                                    \
+    _out3 = __lasx_xvpickod_d(_tmp1_m, _tmp5_m);                                    \
+    _out5 = __lasx_xvpickod_d(_tmp2_m, _tmp6_m);                                    \
+    _out7 = __lasx_xvpickod_d(_tmp3_m, _tmp7_m);                                    \
+}
+
+/*
+ * =============================================================================
+ * Description : Butterfly of 4 input vectors
+ * Arguments   : Inputs  - _in0, _in1, _in2, _in3
+ *               Outputs - _out0, _out1, _out2, _out3
+ * Details     : Butterfly operation
+ * Example     : LASX_BUTTERFLY_4
+ *               _out0 = _in0 + _in3;
+ *               _out1 = _in1 + _in2;
+ *               _out2 = _in1 - _in2;
+ *               _out3 = _in0 - _in3;
+ * =============================================================================
+ */
+#define LASX_BUTTERFLY_4_B(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3)  \
+{                                                                               \
+    _out0 = __lasx_xvadd_b(_in0, _in3);                                         \
+    _out1 = __lasx_xvadd_b(_in1, _in2);                                         \
+    _out2 = __lasx_xvsub_b(_in1, _in2);                                         \
+    _out3 = __lasx_xvsub_b(_in0, _in3);                                         \
+}
+#define LASX_BUTTERFLY_4_H(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3)  \
+{                                                                               \
+    _out0 = __lasx_xvadd_h(_in0, _in3);                                         \
+    _out1 = __lasx_xvadd_h(_in1, _in2);                                         \
+    _out2 = __lasx_xvsub_h(_in1, _in2);                                         \
+    _out3 = __lasx_xvsub_h(_in0, _in3);                                         \
+}
+#define LASX_BUTTERFLY_4_W(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3)  \
+{                                                                               \
+    _out0 = __lasx_xvadd_w(_in0, _in3);                                         \
+    _out1 = __lasx_xvadd_w(_in1, _in2);                                         \
+    _out2 = __lasx_xvsub_w(_in1, _in2);                                         \
+    _out3 = __lasx_xvsub_w(_in0, _in3);                                         \
+}
+#define LASX_BUTTERFLY_4_D(_in0, _in1, _in2, _in3, _out0, _out1, _out2, _out3)  \
+{                                                                               \
+    _out0 = __lasx_xvadd_d(_in0, _in3);                                         \
+    _out1 = __lasx_xvadd_d(_in1, _in2);                                         \
+    _out2 = __lasx_xvsub_d(_in1, _in2);                                         \
+    _out3 = __lasx_xvsub_d(_in0, _in3);                                         \
+}
+
+/*
+ * =============================================================================
+ * Description : Butterfly of 8 input vectors
+ * Arguments   : Inputs  - _in0, _in1, _in2, _in3, ~
+ *               Outputs - _out0, _out1, _out2, _out3, ~
+ * Details     : Butterfly operation
+ * Example     : LASX_BUTTERFLY_8
+ *               _out0 = _in0 + _in7;
+ *               _out1 = _in1 + _in6;
+ *               _out2 = _in2 + _in5;
+ *               _out3 = _in3 + _in4;
+ *               _out4 = _in3 - _in4;
+ *               _out5 = _in2 - _in5;
+ *               _out6 = _in1 - _in6;
+ *               _out7 = _in0 - _in7;
+ * =============================================================================
+ */
+#define LASX_BUTTERFLY_8_B(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7,        \
+                           _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\
+{                                                                                 \
+    _out0 = __lasx_xvadd_b(_in0, _in7);                                           \
+    _out1 = __lasx_xvadd_b(_in1, _in6);                                           \
+    _out2 = __lasx_xvadd_b(_in2, _in5);                                           \
+    _out3 = __lasx_xvadd_b(_in3, _in4);                                           \
+    _out4 = __lasx_xvsub_b(_in3, _in4);                                           \
+    _out5 = __lasx_xvsub_b(_in2, _in5);                                           \
+    _out6 = __lasx_xvsub_b(_in1, _in6);                                           \
+    _out7 = __lasx_xvsub_b(_in0, _in7);                                           \
+}
+
+#define LASX_BUTTERFLY_8_H(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7,        \
+                           _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\
+{                                                                                 \
+    _out0 = __lasx_xvadd_h(_in0, _in7);                                           \
+    _out1 = __lasx_xvadd_h(_in1, _in6);                                           \
+    _out2 = __lasx_xvadd_h(_in2, _in5);                                           \
+    _out3 = __lasx_xvadd_h(_in3, _in4);                                           \
+    _out4 = __lasx_xvsub_h(_in3, _in4);                                           \
+    _out5 = __lasx_xvsub_h(_in2, _in5);                                           \
+    _out6 = __lasx_xvsub_h(_in1, _in6);                                           \
+    _out7 = __lasx_xvsub_h(_in0, _in7);                                           \
+}
+
+#define LASX_BUTTERFLY_8_W(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7,        \
+                           _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\
+{                                                                                 \
+    _out0 = __lasx_xvadd_w(_in0, _in7);                                           \
+    _out1 = __lasx_xvadd_w(_in1, _in6);                                           \
+    _out2 = __lasx_xvadd_w(_in2, _in5);                                           \
+    _out3 = __lasx_xvadd_w(_in3, _in4);                                           \
+    _out4 = __lasx_xvsub_w(_in3, _in4);                                           \
+    _out5 = __lasx_xvsub_w(_in2, _in5);                                           \
+    _out6 = __lasx_xvsub_w(_in1, _in6);                                           \
+    _out7 = __lasx_xvsub_w(_in0, _in7);                                           \
+}
+
+#define LASX_BUTTERFLY_8_D(_in0, _in1, _in2, _in3, _in4, _in5, _in6, _in7,        \
+                           _out0, _out1, _out2, _out3, _out4, _out5, _out6, _out7)\
+{                                                                                 \
+    _out0 = __lasx_xvadd_d(_in0, _in7);                                           \
+    _out1 = __lasx_xvadd_d(_in1, _in6);                                           \
+    _out2 = __lasx_xvadd_d(_in2, _in5);                                           \
+    _out3 = __lasx_xvadd_d(_in3, _in4);                                           \
+    _out4 = __lasx_xvsub_d(_in3, _in4);                                           \
+    _out5 = __lasx_xvsub_d(_in2, _in5);                                           \
+    _out6 = __lasx_xvsub_d(_in1, _in6);                                           \
+    _out7 = __lasx_xvsub_d(_in0, _in7);                                           \
+}
+
+#endif //LASX
+
+/*
+ * =============================================================================
+ * Description : Print out elements in vector.
+ * Arguments   : Inputs  - RTYPE, _element_num, _in0, _enter
+ *               Outputs -
+ * Details     : Print out '_element_num' elements in 'RTYPE' vector '_in0', if
+ *               '_enter' is TRUE, prefix "\nVP:" will be added first.
+ * Example     : VECT_PRINT(v4i32,4,in0,1); // in0: 1,2,3,4
+ *               VP:1,2,3,4,
+ * =============================================================================
+ */
+#define VECT_PRINT(RTYPE, element_num, in0, enter)    \
+{                                                     \
+    RTYPE _tmp0 = (RTYPE)in0;                         \
+    int _i = 0;                                       \
+    if (enter)                                        \
+        printf("\nVP:");                              \
+    for(_i = 0; _i < element_num; _i++)               \
+        printf("%d,",_tmp0[_i]);                      \
+}
+
+#endif /* LOONGSON_INTRINSICS_H */
+#endif /* AVUTIL_LOONGARCH_LOONGSON_INTRINSICS_H */

From patchwork Tue Dec 14 13:33:12 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= <chenhao@loongson.cn>
X-Patchwork-Id: 32487
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp6965948iog;
        Tue, 14 Dec 2021 05:34:52 -0800 (PST)
X-Google-Smtp-Source: 
 ABdhPJynGkATOQ791Rxlj/po67e2VprZTPmI0v8vbq2PoCqYmwk+TY/VyhvgZ/IWSzj7ivAH7j7z
X-Received: by 2002:a05:6402:1d50:: with SMTP id
 dz16mr7659627edb.385.1639488892013;
        Tue, 14 Dec 2021 05:34:52 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1639488892; cv=none;
        d=google.com; s=arc-20160816;
        b=gXALpSV5W/eT2F1EDC8ht3mVP5YRuXki9IBJWO7tsMVRPa793nlbTg3Q8iRDK4DNy2
         wUgQTYN+88O69PiCamnv5tC74jKVMThc8LBO1MAacE2aA9Ec/5pdj4TdlCV1cuHwKNUP
         K+XVWd90JAfe8aqQ/WmHWlCep2yRBv7hVlULuKW9bhDDhU1LwB2Sr5E2hcwMpI75BA9S
         8Qd6YqtpeJT1bvc2Vtqt77DP1OjxYOYmBN/UEwxFrvmAxNqt5oJR0qSqnIfV0YUUyuv0
         G4H2eXj25qnWq/oYbNt0gYqDhI7e4/KIAI803VO+m0gle/GSSI4wNxnbKHGnenc6Htbg
         EUfw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:references:in-reply-to
         :message-id:date:to:from:delivered-to;
        bh=p1LPgfQkU+9AR238I+X7HqYbDlNxvxVURZg6Ha3aGlI=;
        b=KP46gt4ID/4kQ1zG970HyTtd+ihIgK4PSJsMnl1RGHt8ZG5nccO7/gNtwekeeg/+1b
         IEe6wkmMfOxkwtA3hmi2oTMh3QssUGfaMLDQniElO6vFU5V05m0K7TICew6rUYbWbjWq
         KnM/2wiLv10UMAXDi2hRwFO1ZS+dnrSXcZj58cPyMJdguAGy1AbyjferEUBhzRoD1w1P
         +djTFRCg/N8qTEdgypuwPCHS56htn/NnjvIIFEHEKGFh4biFYuafgnOeD4fXsj7UtR6r
         XSJWIfV94kRHe1IisX88lVZW5XCOtBBPyuEkkQy0/4J8+xJmtQpJsdXLwT2XLeiR/CSI
         QBFQ==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 x25si21461336edq.109.2021.12.14.05.34.51;
        Tue, 14 Dec 2021 05:34:52 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 346D568AF18;
	Tue, 14 Dec 2021 15:33:59 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from loongson.cn (mail.loongson.cn [114.242.206.163])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 51B0F68A8DB
 for <ffmpeg-devel@ffmpeg.org>; Tue, 14 Dec 2021 15:33:45 +0200 (EET)
Received: from localhost (unknown [36.33.26.144])
 by mail.loongson.cn (Coremail) with SMTP id AQAAf9DxLNw2nbhhk6cAAA--.674S3;
 Tue, 14 Dec 2021 21:33:42 +0800 (CST)
From: Hao Chen <chenhao@loongson.cn>
To: ffmpeg-devel@ffmpeg.org
Date: Tue, 14 Dec 2021 21:33:12 +0800
Message-Id: <20211214133316.8978-4-chenhao@loongson.cn>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20211214133316.8978-1-chenhao@loongson.cn>
References: <20211214133316.8978-1-chenhao@loongson.cn>
MIME-Version: 1.0
X-CM-TRANSID: AQAAf9DxLNw2nbhhk6cAAA--.674S3
X-Coremail-Antispam: 1UD129KBjvAXoWDWFyDKF17Wr45tF15Xr48tFb_yoW3CF1fGo
 Z3J3yvqws2ya4xt3W5Jr1kKayxZw4fXFn5Zw4jqwn3A34SqF98JFs0yw48ZF4rJr4fXwn8
 Z3WUJFy7ZFs8Aas5n29KB7ZKAUJUUUU8529EdanIXcx71UUUUU7v73VFW2AGmfu7bjvjm3
 AaLaJ3UjIYCTnIWjp_UUU567AC8VAFwI0_Jr0_Gr1l1xkIjI8I6I8E6xAIw20EY4v20xva
 j40_Wr0E3s1l1IIY67AEw4v_Jr0_Jr4l8cAvFVAK0II2c7xJM28CjxkF64kEwVA0rcxSw2
 x7M28EF7xvwVC0I7IYx2IY67AKxVW5JVW7JwA2z4x0Y4vE2Ix0cI8IcVCY1x0267AKxVWx
 JVW8Jr1l84ACjcxK6I8E87Iv67AKxVW0oVCq3wA2z4x0Y4vEx4A2jsIEc7CjxVAFwI0_Gc
 CE3s1le2I262IYc4CY6c8Ij28IcVAaY2xG8wAqx4xG64xvF2IEw4CE5I8CrVC2j2WlYx0E
 2Ix0cI8IcVAFwI0_Jw0_WrylYx0Ex4A2jsIE14v26r4UJVWxJr1lOx8S6xCaFVCjc4AY6r
 1j6r4UM4x0x7Aq67IIx4CEVc8vx2IErcIFxwCY02Avz4vE14v_Xr4l4I8I3I0E4IkC6x0Y
 z7v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGwC20s026x8GjcxK67AKxVWUGVWUWwC2zV
 AF1VAY17CE14v26r1Y6r17MIIF0xvE2Ix0cI8IcVAFwI0_Gr0_Xr1lIxAIcVC0I7IYx2IY
 6xkF7I0E14v26r4j6F4UMIIF0xvE42xK8VAvwI8IcIk0rVWUJVWUCwCI42IY6I8E87Iv67
 AKxVW8JVWxJwCI42IY6I8E87Iv6xkF7I0E14v26r4j6r4UJbIYCTnIWIevJa73UjIFyTuY
 vjfU8AwIUUUUU
X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/
Subject: [FFmpeg-devel] [PATCH v2 3/7] avcodec: [loongarch] Optimize
 h264qpel with LASX.
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Shiyou Yin <yinshiyou-hf@loongson.cn>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: Wt0VbSORvHD9

From: Shiyou Yin <yinshiyou-hf@loongson.cn>

./ffmpeg -i ../1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -y /dev/null -an
before:183
after :225

Change-Id: I7c7d2f34cd82ef728aab5ce8f6bfb46dd81f0da4
---
 libavcodec/h264qpel.c                         |    2 +
 libavcodec/h264qpel.h                         |    1 +
 libavcodec/loongarch/Makefile                 |    2 +
 .../loongarch/h264qpel_init_loongarch.c       |   98 +
 libavcodec/loongarch/h264qpel_lasx.c          | 2038 +++++++++++++++++
 libavcodec/loongarch/h264qpel_lasx.h          |  158 ++
 6 files changed, 2299 insertions(+)
 create mode 100644 libavcodec/loongarch/h264qpel_init_loongarch.c
 create mode 100644 libavcodec/loongarch/h264qpel_lasx.c
 create mode 100644 libavcodec/loongarch/h264qpel_lasx.h

diff --git a/libavcodec/h264qpel.c b/libavcodec/h264qpel.c
index 50e82e23b0..535ebd25b4 100644
--- a/libavcodec/h264qpel.c
+++ b/libavcodec/h264qpel.c
@@ -106,4 +106,6 @@ av_cold void ff_h264qpel_init(H264QpelContext *c, int bit_depth)
         ff_h264qpel_init_x86(c, bit_depth);
     if (ARCH_MIPS)
         ff_h264qpel_init_mips(c, bit_depth);
+    if (ARCH_LOONGARCH64)
+        ff_h264qpel_init_loongarch(c, bit_depth);
 }
diff --git a/libavcodec/h264qpel.h b/libavcodec/h264qpel.h
index 7c57ad001c..0259e8de23 100644
--- a/libavcodec/h264qpel.h
+++ b/libavcodec/h264qpel.h
@@ -36,5 +36,6 @@ void ff_h264qpel_init_arm(H264QpelContext *c, int bit_depth);
 void ff_h264qpel_init_ppc(H264QpelContext *c, int bit_depth);
 void ff_h264qpel_init_x86(H264QpelContext *c, int bit_depth);
 void ff_h264qpel_init_mips(H264QpelContext *c, int bit_depth);
+void ff_h264qpel_init_loongarch(H264QpelContext *c, int bit_depth);
 
 #endif /* AVCODEC_H264QPEL_H */
diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile
index f8fb54c925..4e2ce8487f 100644
--- a/libavcodec/loongarch/Makefile
+++ b/libavcodec/loongarch/Makefile
@@ -1,2 +1,4 @@
 OBJS-$(CONFIG_H264CHROMA)             += loongarch/h264chroma_init_loongarch.o
+OBJS-$(CONFIG_H264QPEL)               += loongarch/h264qpel_init_loongarch.o
 LASX-OBJS-$(CONFIG_H264CHROMA)        += loongarch/h264chroma_lasx.o
+LASX-OBJS-$(CONFIG_H264QPEL)          += loongarch/h264qpel_lasx.o
diff --git a/libavcodec/loongarch/h264qpel_init_loongarch.c b/libavcodec/loongarch/h264qpel_init_loongarch.c
new file mode 100644
index 0000000000..969c9c376c
--- /dev/null
+++ b/libavcodec/loongarch/h264qpel_init_loongarch.c
@@ -0,0 +1,98 @@
+/*
+ * Copyright (c) 2020 Loongson Technology Corporation Limited
+ * Contributed by Shiyou Yin <yinshiyou-hf@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "h264qpel_lasx.h"
+#include "libavutil/attributes.h"
+#include "libavutil/loongarch/cpu.h"
+#include "libavcodec/h264qpel.h"
+
+av_cold void ff_h264qpel_init_loongarch(H264QpelContext *c, int bit_depth)
+{
+    int cpu_flags = av_get_cpu_flags();
+    if (have_lasx(cpu_flags)) {
+        if (8 == bit_depth) {
+            c->put_h264_qpel_pixels_tab[0][0]  = ff_put_h264_qpel16_mc00_lasx;
+            c->put_h264_qpel_pixels_tab[0][1]  = ff_put_h264_qpel16_mc10_lasx;
+            c->put_h264_qpel_pixels_tab[0][2]  = ff_put_h264_qpel16_mc20_lasx;
+            c->put_h264_qpel_pixels_tab[0][3]  = ff_put_h264_qpel16_mc30_lasx;
+            c->put_h264_qpel_pixels_tab[0][4]  = ff_put_h264_qpel16_mc01_lasx;
+            c->put_h264_qpel_pixels_tab[0][5]  = ff_put_h264_qpel16_mc11_lasx;
+
+            c->put_h264_qpel_pixels_tab[0][6]  = ff_put_h264_qpel16_mc21_lasx;
+            c->put_h264_qpel_pixels_tab[0][7]  = ff_put_h264_qpel16_mc31_lasx;
+            c->put_h264_qpel_pixels_tab[0][8]  = ff_put_h264_qpel16_mc02_lasx;
+            c->put_h264_qpel_pixels_tab[0][9]  = ff_put_h264_qpel16_mc12_lasx;
+            c->put_h264_qpel_pixels_tab[0][10] = ff_put_h264_qpel16_mc22_lasx;
+            c->put_h264_qpel_pixels_tab[0][11] = ff_put_h264_qpel16_mc32_lasx;
+            c->put_h264_qpel_pixels_tab[0][12] = ff_put_h264_qpel16_mc03_lasx;
+            c->put_h264_qpel_pixels_tab[0][13] = ff_put_h264_qpel16_mc13_lasx;
+            c->put_h264_qpel_pixels_tab[0][14] = ff_put_h264_qpel16_mc23_lasx;
+            c->put_h264_qpel_pixels_tab[0][15] = ff_put_h264_qpel16_mc33_lasx;
+            c->avg_h264_qpel_pixels_tab[0][0]  = ff_avg_h264_qpel16_mc00_lasx;
+            c->avg_h264_qpel_pixels_tab[0][1]  = ff_avg_h264_qpel16_mc10_lasx;
+            c->avg_h264_qpel_pixels_tab[0][2]  = ff_avg_h264_qpel16_mc20_lasx;
+            c->avg_h264_qpel_pixels_tab[0][3]  = ff_avg_h264_qpel16_mc30_lasx;
+            c->avg_h264_qpel_pixels_tab[0][4]  = ff_avg_h264_qpel16_mc01_lasx;
+            c->avg_h264_qpel_pixels_tab[0][5]  = ff_avg_h264_qpel16_mc11_lasx;
+            c->avg_h264_qpel_pixels_tab[0][6]  = ff_avg_h264_qpel16_mc21_lasx;
+            c->avg_h264_qpel_pixels_tab[0][7]  = ff_avg_h264_qpel16_mc31_lasx;
+            c->avg_h264_qpel_pixels_tab[0][8]  = ff_avg_h264_qpel16_mc02_lasx;
+            c->avg_h264_qpel_pixels_tab[0][9]  = ff_avg_h264_qpel16_mc12_lasx;
+            c->avg_h264_qpel_pixels_tab[0][10] = ff_avg_h264_qpel16_mc22_lasx;
+            c->avg_h264_qpel_pixels_tab[0][11] = ff_avg_h264_qpel16_mc32_lasx;
+            c->avg_h264_qpel_pixels_tab[0][12] = ff_avg_h264_qpel16_mc03_lasx;
+            c->avg_h264_qpel_pixels_tab[0][13] = ff_avg_h264_qpel16_mc13_lasx;
+            c->avg_h264_qpel_pixels_tab[0][14] = ff_avg_h264_qpel16_mc23_lasx;
+            c->avg_h264_qpel_pixels_tab[0][15] = ff_avg_h264_qpel16_mc33_lasx;
+
+            c->put_h264_qpel_pixels_tab[1][0]  = ff_put_h264_qpel8_mc00_lasx;
+            c->put_h264_qpel_pixels_tab[1][1]  = ff_put_h264_qpel8_mc10_lasx;
+            c->put_h264_qpel_pixels_tab[1][2]  = ff_put_h264_qpel8_mc20_lasx;
+            c->put_h264_qpel_pixels_tab[1][3]  = ff_put_h264_qpel8_mc30_lasx;
+            c->put_h264_qpel_pixels_tab[1][4]  = ff_put_h264_qpel8_mc01_lasx;
+            c->put_h264_qpel_pixels_tab[1][5]  = ff_put_h264_qpel8_mc11_lasx;
+            c->put_h264_qpel_pixels_tab[1][6]  = ff_put_h264_qpel8_mc21_lasx;
+            c->put_h264_qpel_pixels_tab[1][7]  = ff_put_h264_qpel8_mc31_lasx;
+            c->put_h264_qpel_pixels_tab[1][8]  = ff_put_h264_qpel8_mc02_lasx;
+            c->put_h264_qpel_pixels_tab[1][9]  = ff_put_h264_qpel8_mc12_lasx;
+            c->put_h264_qpel_pixels_tab[1][10] = ff_put_h264_qpel8_mc22_lasx;
+            c->put_h264_qpel_pixels_tab[1][11] = ff_put_h264_qpel8_mc32_lasx;
+            c->put_h264_qpel_pixels_tab[1][12] = ff_put_h264_qpel8_mc03_lasx;
+            c->put_h264_qpel_pixels_tab[1][13] = ff_put_h264_qpel8_mc13_lasx;
+            c->put_h264_qpel_pixels_tab[1][14] = ff_put_h264_qpel8_mc23_lasx;
+            c->put_h264_qpel_pixels_tab[1][15] = ff_put_h264_qpel8_mc33_lasx;
+            c->avg_h264_qpel_pixels_tab[1][0]  = ff_avg_h264_qpel8_mc00_lasx;
+            c->avg_h264_qpel_pixels_tab[1][1]  = ff_avg_h264_qpel8_mc10_lasx;
+            c->avg_h264_qpel_pixels_tab[1][2]  = ff_avg_h264_qpel8_mc20_lasx;
+            c->avg_h264_qpel_pixels_tab[1][3]  = ff_avg_h264_qpel8_mc30_lasx;
+            c->avg_h264_qpel_pixels_tab[1][5]  = ff_avg_h264_qpel8_mc11_lasx;
+            c->avg_h264_qpel_pixels_tab[1][6]  = ff_avg_h264_qpel8_mc21_lasx;
+            c->avg_h264_qpel_pixels_tab[1][7]  = ff_avg_h264_qpel8_mc31_lasx;
+            c->avg_h264_qpel_pixels_tab[1][8]  = ff_avg_h264_qpel8_mc02_lasx;
+            c->avg_h264_qpel_pixels_tab[1][9]  = ff_avg_h264_qpel8_mc12_lasx;
+            c->avg_h264_qpel_pixels_tab[1][10] = ff_avg_h264_qpel8_mc22_lasx;
+            c->avg_h264_qpel_pixels_tab[1][11] = ff_avg_h264_qpel8_mc32_lasx;
+            c->avg_h264_qpel_pixels_tab[1][13] = ff_avg_h264_qpel8_mc13_lasx;
+            c->avg_h264_qpel_pixels_tab[1][14] = ff_avg_h264_qpel8_mc23_lasx;
+            c->avg_h264_qpel_pixels_tab[1][15] = ff_avg_h264_qpel8_mc33_lasx;
+        }
+    }
+}
diff --git a/libavcodec/loongarch/h264qpel_lasx.c b/libavcodec/loongarch/h264qpel_lasx.c
new file mode 100644
index 0000000000..1c142e510e
--- /dev/null
+++ b/libavcodec/loongarch/h264qpel_lasx.c
@@ -0,0 +1,2038 @@
+/*
+ * Loongson LASX optimized h264qpel
+ *
+ * Copyright (c) 2020 Loongson Technology Corporation Limited
+ * Contributed by Shiyou Yin <yinshiyou-hf@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "h264qpel_lasx.h"
+#include "libavutil/loongarch/loongson_intrinsics.h"
+#include "libavutil/attributes.h"
+
+static const uint8_t luma_mask_arr[16 * 6] __attribute__((aligned(0x40))) = {
+    /* 8 width cases */
+    0, 5, 1, 6, 2, 7, 3, 8, 4, 9, 5, 10, 6, 11, 7, 12,
+    0, 5, 1, 6, 2, 7, 3, 8, 4, 9, 5, 10, 6, 11, 7, 12,
+    1, 4, 2, 5, 3, 6, 4, 7, 5, 8, 6, 9, 7, 10, 8, 11,
+    1, 4, 2, 5, 3, 6, 4, 7, 5, 8, 6, 9, 7, 10, 8, 11,
+    2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10,
+    2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10
+};
+
+#define AVC_HORZ_FILTER_SH(in0, in1, mask0, mask1, mask2)  \
+( {                                                        \
+    __m256i out0_m;                                        \
+    __m256i tmp0_m;                                        \
+                                                           \
+    tmp0_m = __lasx_xvshuf_b(in1, in0, mask0);             \
+    out0_m = __lasx_xvhaddw_h_b(tmp0_m, tmp0_m);           \
+    tmp0_m = __lasx_xvshuf_b(in1, in0, mask1);             \
+    out0_m = __lasx_xvdp2add_h_b(out0_m, minus5b, tmp0_m); \
+    tmp0_m = __lasx_xvshuf_b(in1, in0, mask2);             \
+    out0_m = __lasx_xvdp2add_h_b(out0_m, plus20b, tmp0_m); \
+                                                           \
+    out0_m;                                                \
+} )
+
+#define AVC_DOT_SH3_SH(in0, in1, in2, coeff0, coeff1, coeff2)  \
+( {                                                            \
+    __m256i out0_m;                                            \
+                                                               \
+    out0_m = __lasx_xvdp2_h_b(in0, coeff0);                    \
+    DUP2_ARG3(__lasx_xvdp2add_h_b, out0_m, in1, coeff1, out0_m,\
+              in2, coeff2, out0_m, out0_m);                    \
+                                                               \
+    out0_m;                                                    \
+} )
+
+static av_always_inline
+void avc_luma_hv_qrt_and_aver_dst_16x16_lasx(uint8_t *src_x,
+                                             uint8_t *src_y,
+                                             uint8_t *dst, ptrdiff_t stride)
+{
+    const int16_t filt_const0 = 0xfb01;
+    const int16_t filt_const1 = 0x1414;
+    const int16_t filt_const2 = 0x1fb;
+    uint32_t loop_cnt;
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    ptrdiff_t stride_4x = stride << 2;
+    __m256i tmp0, tmp1;
+    __m256i src_hz0, src_hz1, src_hz2, src_hz3, mask0, mask1, mask2;
+    __m256i src_vt0, src_vt1, src_vt2, src_vt3, src_vt4, src_vt5, src_vt6;
+    __m256i src_vt7, src_vt8;
+    __m256i src_vt10_h, src_vt21_h, src_vt32_h, src_vt43_h, src_vt54_h;
+    __m256i src_vt65_h, src_vt76_h, src_vt87_h, filt0, filt1, filt2;
+    __m256i hz_out0, hz_out1, hz_out2, hz_out3, vt_out0, vt_out1, vt_out2;
+    __m256i vt_out3, out0, out1, out2, out3;
+    __m256i minus5b = __lasx_xvldi(0xFB);
+    __m256i plus20b = __lasx_xvldi(20);
+
+    filt0 = __lasx_xvreplgr2vr_h(filt_const0);
+    filt1 = __lasx_xvreplgr2vr_h(filt_const1);
+    filt2 = __lasx_xvreplgr2vr_h(filt_const2);
+
+    mask0 = __lasx_xvld(luma_mask_arr, 0);
+    DUP2_ARG2(__lasx_xvld, luma_mask_arr, 32, luma_mask_arr, 64, mask1, mask2);
+    src_vt0 = __lasx_xvld(src_y, 0);
+    DUP4_ARG2(__lasx_xvldx, src_y, stride, src_y, stride_2x, src_y, stride_3x,
+              src_y, stride_4x, src_vt1, src_vt2, src_vt3, src_vt4);
+    src_y += stride_4x;
+
+    src_vt0 = __lasx_xvxori_b(src_vt0, 128);
+    DUP4_ARG2(__lasx_xvxori_b, src_vt1, 128, src_vt2, 128, src_vt3, 128,
+              src_vt4, 128, src_vt1, src_vt2, src_vt3, src_vt4);
+
+    for (loop_cnt = 4; loop_cnt--;) {
+        src_hz0 = __lasx_xvld(src_x, 0);
+        DUP2_ARG2(__lasx_xvldx, src_x, stride, src_x, stride_2x,
+                  src_hz1, src_hz2);
+        src_hz3 = __lasx_xvldx(src_x, stride_3x);
+        src_x  += stride_4x;
+        src_hz0 = __lasx_xvpermi_d(src_hz0, 0x94);
+        src_hz1 = __lasx_xvpermi_d(src_hz1, 0x94);
+        src_hz2 = __lasx_xvpermi_d(src_hz2, 0x94);
+        src_hz3 = __lasx_xvpermi_d(src_hz3, 0x94);
+        DUP4_ARG2(__lasx_xvxori_b, src_hz0, 128, src_hz1, 128, src_hz2, 128,
+                  src_hz3, 128, src_hz0, src_hz1, src_hz2, src_hz3);
+
+        hz_out0 = AVC_HORZ_FILTER_SH(src_hz0, src_hz0, mask0, mask1, mask2);
+        hz_out1 = AVC_HORZ_FILTER_SH(src_hz1, src_hz1, mask0, mask1, mask2);
+        hz_out2 = AVC_HORZ_FILTER_SH(src_hz2, src_hz2, mask0, mask1, mask2);
+        hz_out3 = AVC_HORZ_FILTER_SH(src_hz3, src_hz3, mask0, mask1, mask2);
+        hz_out0 = __lasx_xvssrarni_b_h(hz_out1, hz_out0, 5);
+        hz_out2 = __lasx_xvssrarni_b_h(hz_out3, hz_out2, 5);
+
+        DUP4_ARG2(__lasx_xvldx, src_y, stride, src_y, stride_2x,
+                  src_y, stride_3x, src_y, stride_4x,
+                  src_vt5, src_vt6, src_vt7, src_vt8);
+        src_y += stride_4x;
+
+        DUP4_ARG2(__lasx_xvxori_b, src_vt5, 128, src_vt6, 128, src_vt7, 128,
+                  src_vt8, 128, src_vt5, src_vt6, src_vt7, src_vt8);
+
+        DUP4_ARG3(__lasx_xvpermi_q, src_vt0, src_vt4, 0x02, src_vt1, src_vt5,
+                  0x02, src_vt2, src_vt6, 0x02, src_vt3, src_vt7, 0x02,
+                  src_vt0, src_vt1, src_vt2, src_vt3);
+        src_vt87_h = __lasx_xvpermi_q(src_vt4, src_vt8, 0x02);
+        DUP4_ARG2(__lasx_xvilvh_b, src_vt1, src_vt0, src_vt2, src_vt1,
+                  src_vt3, src_vt2, src_vt87_h, src_vt3,
+                  src_hz0, src_hz1, src_hz2, src_hz3);
+        DUP4_ARG2(__lasx_xvilvl_b, src_vt1, src_vt0, src_vt2, src_vt1,
+                  src_vt3, src_vt2, src_vt87_h, src_vt3,
+                  src_vt0, src_vt1, src_vt2, src_vt3);
+        DUP4_ARG3(__lasx_xvpermi_q, src_vt0, src_hz0, 0x02, src_vt1, src_hz1,
+                  0x02, src_vt2, src_hz2, 0x02, src_vt3, src_hz3, 0x02,
+                  src_vt10_h, src_vt21_h, src_vt32_h, src_vt43_h);
+        DUP4_ARG3(__lasx_xvpermi_q, src_vt0, src_hz0, 0x13, src_vt1, src_hz1,
+                  0x13, src_vt2, src_hz2, 0x13, src_vt3, src_hz3, 0x13,
+                  src_vt54_h, src_vt65_h, src_vt76_h, src_vt87_h);
+        vt_out0 = AVC_DOT_SH3_SH(src_vt10_h, src_vt32_h, src_vt54_h, filt0,
+                                 filt1, filt2);
+        vt_out1 = AVC_DOT_SH3_SH(src_vt21_h, src_vt43_h, src_vt65_h, filt0,
+                                 filt1, filt2);
+        vt_out2 = AVC_DOT_SH3_SH(src_vt32_h, src_vt54_h, src_vt76_h, filt0,
+                                 filt1, filt2);
+        vt_out3 = AVC_DOT_SH3_SH(src_vt43_h, src_vt65_h, src_vt87_h, filt0,
+                                 filt1, filt2);
+        vt_out0 = __lasx_xvssrarni_b_h(vt_out1, vt_out0, 5);
+        vt_out2 = __lasx_xvssrarni_b_h(vt_out3, vt_out2, 5);
+
+        DUP2_ARG2(__lasx_xvaddwl_h_b, hz_out0, vt_out0, hz_out2, vt_out2,
+                  out0, out2);
+        DUP2_ARG2(__lasx_xvaddwh_h_b, hz_out0, vt_out0, hz_out2, vt_out2,
+                  out1, out3);
+        tmp0 = __lasx_xvssrarni_b_h(out1, out0, 1);
+        tmp1 = __lasx_xvssrarni_b_h(out3, out2, 1);
+
+        DUP2_ARG2(__lasx_xvxori_b, tmp0, 128, tmp1, 128, tmp0, tmp1);
+        out0 = __lasx_xvld(dst, 0);
+        DUP2_ARG2(__lasx_xvldx, dst, stride, dst, stride_2x, out1, out2);
+        out3 = __lasx_xvldx(dst, stride_3x);
+        out0 = __lasx_xvpermi_q(out0, out2, 0x02);
+        out1 = __lasx_xvpermi_q(out1, out3, 0x02);
+        out2 = __lasx_xvilvl_d(out1, out0);
+        out3 = __lasx_xvilvh_d(out1, out0);
+        out0 = __lasx_xvpermi_q(out2, out3, 0x02);
+        out1 = __lasx_xvpermi_q(out2, out3, 0x13);
+        tmp0 = __lasx_xvavgr_bu(out0, tmp0);
+        tmp1 = __lasx_xvavgr_bu(out1, tmp1);
+
+        __lasx_xvstelm_d(tmp0, dst, 0, 0);
+        __lasx_xvstelm_d(tmp0, dst + stride, 0, 1);
+        __lasx_xvstelm_d(tmp1, dst + stride_2x, 0, 0);
+        __lasx_xvstelm_d(tmp1, dst + stride_3x, 0, 1);
+
+        __lasx_xvstelm_d(tmp0, dst, 8, 2);
+        __lasx_xvstelm_d(tmp0, dst + stride, 8, 3);
+        __lasx_xvstelm_d(tmp1, dst + stride_2x, 8, 2);
+        __lasx_xvstelm_d(tmp1, dst + stride_3x, 8, 3);
+
+        dst    += stride_4x;
+        src_vt0 = src_vt4;
+        src_vt1 = src_vt5;
+        src_vt2 = src_vt6;
+        src_vt3 = src_vt7;
+        src_vt4 = src_vt8;
+    }
+}
+
+static av_always_inline void
+avc_luma_hv_qrt_16x16_lasx(uint8_t *src_x, uint8_t *src_y,
+                           uint8_t *dst, ptrdiff_t stride)
+{
+    const int16_t filt_const0 = 0xfb01;
+    const int16_t filt_const1 = 0x1414;
+    const int16_t filt_const2 = 0x1fb;
+    uint32_t loop_cnt;
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    ptrdiff_t stride_4x = stride << 2;
+    __m256i tmp0, tmp1;
+    __m256i src_hz0, src_hz1, src_hz2, src_hz3, mask0, mask1, mask2;
+    __m256i src_vt0, src_vt1, src_vt2, src_vt3, src_vt4, src_vt5, src_vt6;
+    __m256i src_vt7, src_vt8;
+    __m256i src_vt10_h, src_vt21_h, src_vt32_h, src_vt43_h, src_vt54_h;
+    __m256i src_vt65_h, src_vt76_h, src_vt87_h, filt0, filt1, filt2;
+    __m256i hz_out0, hz_out1, hz_out2, hz_out3, vt_out0, vt_out1, vt_out2;
+    __m256i vt_out3, out0, out1, out2, out3;
+    __m256i minus5b = __lasx_xvldi(0xFB);
+    __m256i plus20b = __lasx_xvldi(20);
+
+    filt0 = __lasx_xvreplgr2vr_h(filt_const0);
+    filt1 = __lasx_xvreplgr2vr_h(filt_const1);
+    filt2 = __lasx_xvreplgr2vr_h(filt_const2);
+
+    mask0 = __lasx_xvld(luma_mask_arr, 0);
+    DUP2_ARG2(__lasx_xvld, luma_mask_arr, 32, luma_mask_arr, 64, mask1, mask2);
+    src_vt0 = __lasx_xvld(src_y, 0);
+    DUP4_ARG2(__lasx_xvldx, src_y, stride, src_y, stride_2x, src_y, stride_3x,
+              src_y, stride_4x, src_vt1, src_vt2, src_vt3, src_vt4);
+    src_y += stride_4x;
+
+    src_vt0 = __lasx_xvxori_b(src_vt0, 128);
+    DUP4_ARG2(__lasx_xvxori_b, src_vt1, 128, src_vt2, 128, src_vt3, 128,
+              src_vt4, 128, src_vt1, src_vt2, src_vt3, src_vt4);
+
+    for (loop_cnt = 4; loop_cnt--;) {
+        src_hz0 = __lasx_xvld(src_x, 0);
+        DUP2_ARG2(__lasx_xvldx, src_x, stride, src_x, stride_2x,
+                  src_hz1, src_hz2);
+        src_hz3 = __lasx_xvldx(src_x, stride_3x);
+        src_x  += stride_4x;
+        src_hz0 = __lasx_xvpermi_d(src_hz0, 0x94);
+        src_hz1 = __lasx_xvpermi_d(src_hz1, 0x94);
+        src_hz2 = __lasx_xvpermi_d(src_hz2, 0x94);
+        src_hz3 = __lasx_xvpermi_d(src_hz3, 0x94);
+        DUP4_ARG2(__lasx_xvxori_b, src_hz0, 128, src_hz1, 128, src_hz2, 128,
+                  src_hz3, 128, src_hz0, src_hz1, src_hz2, src_hz3);
+
+        hz_out0 = AVC_HORZ_FILTER_SH(src_hz0, src_hz0, mask0, mask1, mask2);
+        hz_out1 = AVC_HORZ_FILTER_SH(src_hz1, src_hz1, mask0, mask1, mask2);
+        hz_out2 = AVC_HORZ_FILTER_SH(src_hz2, src_hz2, mask0, mask1, mask2);
+        hz_out3 = AVC_HORZ_FILTER_SH(src_hz3, src_hz3, mask0, mask1, mask2);
+        hz_out0 = __lasx_xvssrarni_b_h(hz_out1, hz_out0, 5);
+        hz_out2 = __lasx_xvssrarni_b_h(hz_out3, hz_out2, 5);
+
+        DUP4_ARG2(__lasx_xvldx, src_y, stride, src_y, stride_2x,
+                  src_y, stride_3x, src_y, stride_4x,
+                  src_vt5, src_vt6, src_vt7, src_vt8);
+        src_y += stride_4x;
+
+        DUP4_ARG2(__lasx_xvxori_b, src_vt5, 128, src_vt6, 128, src_vt7, 128,
+                  src_vt8, 128, src_vt5, src_vt6, src_vt7, src_vt8);
+        DUP4_ARG3(__lasx_xvpermi_q, src_vt0, src_vt4, 0x02, src_vt1, src_vt5,
+                  0x02, src_vt2, src_vt6, 0x02, src_vt3, src_vt7, 0x02,
+                  src_vt0, src_vt1, src_vt2, src_vt3);
+        src_vt87_h = __lasx_xvpermi_q(src_vt4, src_vt8, 0x02);
+        DUP4_ARG2(__lasx_xvilvh_b, src_vt1, src_vt0, src_vt2, src_vt1,
+                  src_vt3, src_vt2, src_vt87_h, src_vt3,
+                  src_hz0, src_hz1, src_hz2, src_hz3);
+        DUP4_ARG2(__lasx_xvilvl_b, src_vt1, src_vt0, src_vt2, src_vt1,
+                  src_vt3, src_vt2, src_vt87_h, src_vt3,
+                  src_vt0, src_vt1, src_vt2, src_vt3);
+        DUP4_ARG3(__lasx_xvpermi_q, src_vt0, src_hz0, 0x02, src_vt1,
+                  src_hz1, 0x02, src_vt2, src_hz2, 0x02, src_vt3, src_hz3,
+                  0x02, src_vt10_h, src_vt21_h, src_vt32_h, src_vt43_h);
+        DUP4_ARG3(__lasx_xvpermi_q, src_vt0, src_hz0, 0x13, src_vt1,
+                  src_hz1, 0x13, src_vt2, src_hz2, 0x13, src_vt3, src_hz3,
+                  0x13, src_vt54_h, src_vt65_h, src_vt76_h, src_vt87_h);
+
+        vt_out0 = AVC_DOT_SH3_SH(src_vt10_h, src_vt32_h, src_vt54_h,
+                                 filt0, filt1, filt2);
+        vt_out1 = AVC_DOT_SH3_SH(src_vt21_h, src_vt43_h, src_vt65_h,
+                                 filt0, filt1, filt2);
+        vt_out2 = AVC_DOT_SH3_SH(src_vt32_h, src_vt54_h, src_vt76_h,
+                                 filt0, filt1, filt2);
+        vt_out3 = AVC_DOT_SH3_SH(src_vt43_h, src_vt65_h, src_vt87_h,
+                                 filt0, filt1, filt2);
+        vt_out0 = __lasx_xvssrarni_b_h(vt_out1, vt_out0, 5);
+        vt_out2 = __lasx_xvssrarni_b_h(vt_out3, vt_out2, 5);
+
+        DUP2_ARG2(__lasx_xvaddwl_h_b, hz_out0, vt_out0, hz_out2, vt_out2,
+                  out0, out2);
+        DUP2_ARG2(__lasx_xvaddwh_h_b, hz_out0, vt_out0, hz_out2, vt_out2,
+                  out1, out3);
+        tmp0 = __lasx_xvssrarni_b_h(out1, out0, 1);
+        tmp1 = __lasx_xvssrarni_b_h(out3, out2, 1);
+
+        DUP2_ARG2(__lasx_xvxori_b, tmp0, 128, tmp1, 128, tmp0, tmp1);
+        __lasx_xvstelm_d(tmp0, dst, 0, 0);
+        __lasx_xvstelm_d(tmp0, dst + stride, 0, 1);
+        __lasx_xvstelm_d(tmp1, dst + stride_2x, 0, 0);
+        __lasx_xvstelm_d(tmp1, dst + stride_3x, 0, 1);
+
+        __lasx_xvstelm_d(tmp0, dst, 8, 2);
+        __lasx_xvstelm_d(tmp0, dst + stride, 8, 3);
+        __lasx_xvstelm_d(tmp1, dst + stride_2x, 8, 2);
+        __lasx_xvstelm_d(tmp1, dst + stride_3x, 8, 3);
+
+        dst    += stride_4x;
+        src_vt0 = src_vt4;
+        src_vt1 = src_vt5;
+        src_vt2 = src_vt6;
+        src_vt3 = src_vt7;
+        src_vt4 = src_vt8;
+    }
+}
+
+/* put_pixels8_8_inline_asm: dst = src */
+static av_always_inline void
+put_pixels8_8_inline_asm(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
+{
+    uint64_t tmp[8];
+    ptrdiff_t stride_2, stride_3, stride_4;
+    __asm__ volatile (
+    "slli.d     %[stride_2],     %[stride],   1           \n\t"
+    "add.d      %[stride_3],     %[stride_2], %[stride]   \n\t"
+    "slli.d     %[stride_4],     %[stride_2], 1           \n\t"
+    "ld.d       %[tmp0],         %[src],      0x0         \n\t"
+    "ldx.d      %[tmp1],         %[src],      %[stride]   \n\t"
+    "ldx.d      %[tmp2],         %[src],      %[stride_2] \n\t"
+    "ldx.d      %[tmp3],         %[src],      %[stride_3] \n\t"
+    "add.d      %[src],          %[src],      %[stride_4] \n\t"
+    "ld.d       %[tmp4],         %[src],      0x0         \n\t"
+    "ldx.d      %[tmp5],         %[src],      %[stride]   \n\t"
+    "ldx.d      %[tmp6],         %[src],      %[stride_2] \n\t"
+    "ldx.d      %[tmp7],         %[src],      %[stride_3] \n\t"
+
+    "st.d       %[tmp0],         %[dst],      0x0         \n\t"
+    "stx.d      %[tmp1],         %[dst],      %[stride]   \n\t"
+    "stx.d      %[tmp2],         %[dst],      %[stride_2] \n\t"
+    "stx.d      %[tmp3],         %[dst],      %[stride_3] \n\t"
+    "add.d      %[dst],          %[dst],      %[stride_4] \n\t"
+    "st.d       %[tmp4],         %[dst],      0x0         \n\t"
+    "stx.d      %[tmp5],         %[dst],      %[stride]   \n\t"
+    "stx.d      %[tmp6],         %[dst],      %[stride_2] \n\t"
+    "stx.d      %[tmp7],         %[dst],      %[stride_3] \n\t"
+    : [tmp0]"=&r"(tmp[0]),        [tmp1]"=&r"(tmp[1]),
+      [tmp2]"=&r"(tmp[2]),        [tmp3]"=&r"(tmp[3]),
+      [tmp4]"=&r"(tmp[4]),        [tmp5]"=&r"(tmp[5]),
+      [tmp6]"=&r"(tmp[6]),        [tmp7]"=&r"(tmp[7]),
+      [stride_2]"=&r"(stride_2),  [stride_3]"=&r"(stride_3),
+      [stride_4]"=&r"(stride_4),
+      [dst]"+&r"(dst),            [src]"+&r"(src)
+    : [stride]"r"(stride)
+    : "memory"
+    );
+}
+
+/* avg_pixels8_8_lsx   : dst = avg(src, dst)
+ * put_pixels8_l2_8_lsx: dst = avg(src, half) , half stride is 8.
+ * avg_pixels8_l2_8_lsx: dst = avg(avg(src, half), dst) , half stride is 8.*/
+static av_always_inline void
+avg_pixels8_8_lsx(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
+{
+    uint8_t *tmp = dst;
+    ptrdiff_t stride_2, stride_3, stride_4;
+    __asm__ volatile (
+    /* h0~h7 */
+    "slli.d     %[stride_2],     %[stride],   1           \n\t"
+    "add.d      %[stride_3],     %[stride_2], %[stride]   \n\t"
+    "slli.d     %[stride_4],     %[stride_2], 1           \n\t"
+    "vld        $vr0,            %[src],      0           \n\t"
+    "vldx       $vr1,            %[src],      %[stride]   \n\t"
+    "vldx       $vr2,            %[src],      %[stride_2] \n\t"
+    "vldx       $vr3,            %[src],      %[stride_3] \n\t"
+    "add.d      %[src],          %[src],      %[stride_4] \n\t"
+    "vld        $vr4,            %[src],      0           \n\t"
+    "vldx       $vr5,            %[src],      %[stride]   \n\t"
+    "vldx       $vr6,            %[src],      %[stride_2] \n\t"
+    "vldx       $vr7,            %[src],      %[stride_3] \n\t"
+
+    "vld        $vr8,            %[tmp],      0           \n\t"
+    "vldx       $vr9,            %[tmp],      %[stride]   \n\t"
+    "vldx       $vr10,           %[tmp],      %[stride_2] \n\t"
+    "vldx       $vr11,           %[tmp],      %[stride_3] \n\t"
+    "add.d      %[tmp],          %[tmp],      %[stride_4] \n\t"
+    "vld        $vr12,           %[tmp],      0           \n\t"
+    "vldx       $vr13,           %[tmp],      %[stride]   \n\t"
+    "vldx       $vr14,           %[tmp],      %[stride_2] \n\t"
+    "vldx       $vr15,           %[tmp],      %[stride_3] \n\t"
+
+    "vavgr.bu    $vr0,           $vr8,        $vr0        \n\t"
+    "vavgr.bu    $vr1,           $vr9,        $vr1        \n\t"
+    "vavgr.bu    $vr2,           $vr10,       $vr2        \n\t"
+    "vavgr.bu    $vr3,           $vr11,       $vr3        \n\t"
+    "vavgr.bu    $vr4,           $vr12,       $vr4        \n\t"
+    "vavgr.bu    $vr5,           $vr13,       $vr5        \n\t"
+    "vavgr.bu    $vr6,           $vr14,       $vr6        \n\t"
+    "vavgr.bu    $vr7,           $vr15,       $vr7        \n\t"
+
+    "vstelm.d    $vr0,           %[dst],      0,  0       \n\t"
+    "add.d       %[dst],         %[dst],      %[stride]   \n\t"
+    "vstelm.d    $vr1,           %[dst],      0,  0       \n\t"
+    "add.d       %[dst],         %[dst],      %[stride]   \n\t"
+    "vstelm.d    $vr2,           %[dst],      0,  0       \n\t"
+    "add.d       %[dst],         %[dst],      %[stride]   \n\t"
+    "vstelm.d    $vr3,           %[dst],      0,  0       \n\t"
+    "add.d       %[dst],         %[dst],      %[stride]   \n\t"
+    "vstelm.d    $vr4,           %[dst],      0,  0       \n\t"
+    "add.d       %[dst],         %[dst],      %[stride]   \n\t"
+    "vstelm.d    $vr5,           %[dst],      0,  0       \n\t"
+    "add.d       %[dst],         %[dst],      %[stride]   \n\t"
+    "vstelm.d    $vr6,           %[dst],      0,  0       \n\t"
+    "add.d       %[dst],         %[dst],      %[stride]   \n\t"
+    "vstelm.d    $vr7,           %[dst],      0,  0       \n\t"
+    : [dst]"+&r"(dst), [tmp]"+&r"(tmp), [src]"+&r"(src),
+      [stride_2]"=&r"(stride_2),  [stride_3]"=&r"(stride_3),
+      [stride_4]"=&r"(stride_4)
+    : [stride]"r"(stride)
+    : "memory"
+    );
+}
+
+/* avg_pixels8_8_lsx   : dst = avg(src, dst)
+ * put_pixels8_l2_8_lsx: dst = avg(src, half) , half stride is 8.
+ * avg_pixels8_l2_8_lsx: dst = avg(avg(src, half), dst) , half stride is 8.*/
+static av_always_inline void
+put_pixels8_l2_8_lsx(uint8_t *dst, const uint8_t *src, const uint8_t *half,
+                     ptrdiff_t dstStride, ptrdiff_t srcStride)
+{
+    ptrdiff_t stride_2, stride_3, stride_4;
+    __asm__ volatile (
+    /* h0~h7 */
+    "slli.d     %[stride_2],     %[srcStride],   1            \n\t"
+    "add.d      %[stride_3],     %[stride_2],    %[srcStride] \n\t"
+    "slli.d     %[stride_4],     %[stride_2],    1            \n\t"
+    "vld        $vr0,            %[src],         0            \n\t"
+    "vldx       $vr1,            %[src],         %[srcStride] \n\t"
+    "vldx       $vr2,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr3,            %[src],         %[stride_3]  \n\t"
+    "add.d      %[src],          %[src],         %[stride_4]  \n\t"
+    "vld        $vr4,            %[src],         0            \n\t"
+    "vldx       $vr5,            %[src],         %[srcStride] \n\t"
+    "vldx       $vr6,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr7,            %[src],         %[stride_3]  \n\t"
+
+    "vld        $vr8,            %[half],        0x00         \n\t"
+    "vld        $vr9,            %[half],        0x08         \n\t"
+    "vld        $vr10,           %[half],        0x10         \n\t"
+    "vld        $vr11,           %[half],        0x18         \n\t"
+    "vld        $vr12,           %[half],        0x20         \n\t"
+    "vld        $vr13,           %[half],        0x28         \n\t"
+    "vld        $vr14,           %[half],        0x30         \n\t"
+    "vld        $vr15,           %[half],        0x38         \n\t"
+
+    "vavgr.bu   $vr0,            $vr8,           $vr0         \n\t"
+    "vavgr.bu   $vr1,            $vr9,           $vr1         \n\t"
+    "vavgr.bu   $vr2,            $vr10,          $vr2         \n\t"
+    "vavgr.bu   $vr3,            $vr11,          $vr3         \n\t"
+    "vavgr.bu   $vr4,            $vr12,          $vr4         \n\t"
+    "vavgr.bu   $vr5,            $vr13,          $vr5         \n\t"
+    "vavgr.bu   $vr6,            $vr14,          $vr6         \n\t"
+    "vavgr.bu   $vr7,            $vr15,          $vr7         \n\t"
+
+    "vstelm.d   $vr0,            %[dst],         0,  0        \n\t"
+    "add.d      %[dst],          %[dst],         %[dstStride] \n\t"
+    "vstelm.d   $vr1,            %[dst],         0,  0        \n\t"
+    "add.d      %[dst],          %[dst],         %[dstStride] \n\t"
+    "vstelm.d   $vr2,            %[dst],         0,  0        \n\t"
+    "add.d      %[dst],          %[dst],         %[dstStride] \n\t"
+    "vstelm.d   $vr3,            %[dst],         0,  0        \n\t"
+    "add.d      %[dst],          %[dst],         %[dstStride] \n\t"
+    "vstelm.d   $vr4,            %[dst],         0,  0        \n\t"
+    "add.d      %[dst],          %[dst],         %[dstStride] \n\t"
+    "vstelm.d   $vr5,            %[dst],         0,  0        \n\t"
+    "add.d      %[dst],          %[dst],         %[dstStride] \n\t"
+    "vstelm.d   $vr6,            %[dst],         0,  0        \n\t"
+    "add.d      %[dst],          %[dst],         %[dstStride] \n\t"
+    "vstelm.d   $vr7,            %[dst],         0,  0        \n\t"
+    : [dst]"+&r"(dst), [half]"+&r"(half), [src]"+&r"(src),
+      [stride_2]"=&r"(stride_2),  [stride_3]"=&r"(stride_3),
+      [stride_4]"=&r"(stride_4)
+    : [srcStride]"r"(srcStride), [dstStride]"r"(dstStride)
+    : "memory"
+    );
+}
+
+/* avg_pixels8_8_lsx   : dst = avg(src, dst)
+ * put_pixels8_l2_8_lsx: dst = avg(src, half) , half stride is 8.
+ * avg_pixels8_l2_8_lsx: dst = avg(avg(src, half), dst) , half stride is 8.*/
+static av_always_inline void
+avg_pixels8_l2_8_lsx(uint8_t *dst, const uint8_t *src, const uint8_t *half,
+                     ptrdiff_t dstStride, ptrdiff_t srcStride)
+{
+    uint8_t *tmp = dst;
+    ptrdiff_t stride_2, stride_3, stride_4;
+    __asm__ volatile (
+    /* h0~h7 */
+    "slli.d     %[stride_2],     %[srcStride],   1            \n\t"
+    "add.d      %[stride_3],     %[stride_2],    %[srcStride] \n\t"
+    "slli.d     %[stride_4],     %[stride_2],    1            \n\t"
+    "vld        $vr0,            %[src],         0            \n\t"
+    "vldx       $vr1,            %[src],         %[srcStride] \n\t"
+    "vldx       $vr2,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr3,            %[src],         %[stride_3]  \n\t"
+    "add.d      %[src],          %[src],         %[stride_4]  \n\t"
+    "vld        $vr4,            %[src],         0            \n\t"
+    "vldx       $vr5,            %[src],         %[srcStride] \n\t"
+    "vldx       $vr6,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr7,            %[src],         %[stride_3]  \n\t"
+
+    "vld        $vr8,            %[half],        0x00         \n\t"
+    "vld        $vr9,            %[half],        0x08         \n\t"
+    "vld        $vr10,           %[half],        0x10         \n\t"
+    "vld        $vr11,           %[half],        0x18         \n\t"
+    "vld        $vr12,           %[half],        0x20         \n\t"
+    "vld        $vr13,           %[half],        0x28         \n\t"
+    "vld        $vr14,           %[half],        0x30         \n\t"
+    "vld        $vr15,           %[half],        0x38         \n\t"
+
+    "vavgr.bu    $vr0,           $vr8,           $vr0         \n\t"
+    "vavgr.bu    $vr1,           $vr9,           $vr1         \n\t"
+    "vavgr.bu    $vr2,           $vr10,          $vr2         \n\t"
+    "vavgr.bu    $vr3,           $vr11,          $vr3         \n\t"
+    "vavgr.bu    $vr4,           $vr12,          $vr4         \n\t"
+    "vavgr.bu    $vr5,           $vr13,          $vr5         \n\t"
+    "vavgr.bu    $vr6,           $vr14,          $vr6         \n\t"
+    "vavgr.bu    $vr7,           $vr15,          $vr7         \n\t"
+
+    "slli.d     %[stride_2],     %[dstStride],   1            \n\t"
+    "add.d      %[stride_3],     %[stride_2],    %[dstStride] \n\t"
+    "slli.d     %[stride_4],     %[stride_2],    1            \n\t"
+    "vld        $vr8,            %[tmp],         0            \n\t"
+    "vldx       $vr9,            %[tmp],         %[dstStride] \n\t"
+    "vldx       $vr10,           %[tmp],         %[stride_2]  \n\t"
+    "vldx       $vr11,           %[tmp],         %[stride_3]  \n\t"
+    "add.d      %[tmp],          %[tmp],         %[stride_4]  \n\t"
+    "vld        $vr12,           %[tmp],         0            \n\t"
+    "vldx       $vr13,           %[tmp],         %[dstStride] \n\t"
+    "vldx       $vr14,           %[tmp],         %[stride_2]  \n\t"
+    "vldx       $vr15,           %[tmp],         %[stride_3]  \n\t"
+
+    "vavgr.bu    $vr0,           $vr8,           $vr0         \n\t"
+    "vavgr.bu    $vr1,           $vr9,           $vr1         \n\t"
+    "vavgr.bu    $vr2,           $vr10,          $vr2         \n\t"
+    "vavgr.bu    $vr3,           $vr11,          $vr3         \n\t"
+    "vavgr.bu    $vr4,           $vr12,          $vr4         \n\t"
+    "vavgr.bu    $vr5,           $vr13,          $vr5         \n\t"
+    "vavgr.bu    $vr6,           $vr14,          $vr6         \n\t"
+    "vavgr.bu    $vr7,           $vr15,          $vr7         \n\t"
+
+    "vstelm.d    $vr0,           %[dst],         0,  0        \n\t"
+    "add.d       %[dst],         %[dst],         %[dstStride] \n\t"
+    "vstelm.d    $vr1,           %[dst],         0,  0        \n\t"
+    "add.d       %[dst],         %[dst],         %[dstStride] \n\t"
+    "vstelm.d    $vr2,           %[dst],         0,  0        \n\t"
+    "add.d       %[dst],         %[dst],         %[dstStride] \n\t"
+    "vstelm.d    $vr3,           %[dst],         0,  0        \n\t"
+    "add.d       %[dst],         %[dst],         %[dstStride] \n\t"
+    "vstelm.d    $vr4,           %[dst],         0,  0        \n\t"
+    "add.d       %[dst],         %[dst],         %[dstStride] \n\t"
+    "vstelm.d    $vr5,           %[dst],         0,  0        \n\t"
+    "add.d       %[dst],         %[dst],         %[dstStride] \n\t"
+    "vstelm.d    $vr6,           %[dst],         0,  0        \n\t"
+    "add.d       %[dst],         %[dst],         %[dstStride] \n\t"
+    "vstelm.d    $vr7,           %[dst],         0,  0        \n\t"
+    : [dst]"+&r"(dst), [tmp]"+&r"(tmp), [half]"+&r"(half),
+      [src]"+&r"(src), [stride_2]"=&r"(stride_2),
+      [stride_3]"=&r"(stride_3), [stride_4]"=&r"(stride_4)
+    : [dstStride]"r"(dstStride), [srcStride]"r"(srcStride)
+    : "memory"
+    );
+}
+
+/* put_pixels16_8_lsx: dst = src */
+static av_always_inline void
+put_pixels16_8_lsx(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
+{
+    ptrdiff_t stride_2, stride_3, stride_4;
+    __asm__ volatile (
+    "slli.d     %[stride_2],     %[stride],      1            \n\t"
+    "add.d      %[stride_3],     %[stride_2],    %[stride]    \n\t"
+    "slli.d     %[stride_4],     %[stride_2],    1            \n\t"
+    "vld        $vr0,            %[src],         0            \n\t"
+    "vldx       $vr1,            %[src],         %[stride]    \n\t"
+    "vldx       $vr2,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr3,            %[src],         %[stride_3]  \n\t"
+    "add.d      %[src],          %[src],         %[stride_4]  \n\t"
+    "vld        $vr4,            %[src],         0            \n\t"
+    "vldx       $vr5,            %[src],         %[stride]    \n\t"
+    "vldx       $vr6,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr7,            %[src],         %[stride_3]  \n\t"
+    "add.d      %[src],          %[src],         %[stride_4]  \n\t"
+
+    "vst        $vr0,            %[dst],         0            \n\t"
+    "vstx       $vr1,            %[dst],         %[stride]    \n\t"
+    "vstx       $vr2,            %[dst],         %[stride_2]  \n\t"
+    "vstx       $vr3,            %[dst],         %[stride_3]  \n\t"
+    "add.d      %[dst],          %[dst],         %[stride_4]  \n\t"
+    "vst        $vr4,            %[dst],         0            \n\t"
+    "vstx       $vr5,            %[dst],         %[stride]    \n\t"
+    "vstx       $vr6,            %[dst],         %[stride_2]  \n\t"
+    "vstx       $vr7,            %[dst],         %[stride_3]  \n\t"
+    "add.d      %[dst],          %[dst],         %[stride_4]  \n\t"
+
+    "vld        $vr0,            %[src],         0            \n\t"
+    "vldx       $vr1,            %[src],         %[stride]    \n\t"
+    "vldx       $vr2,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr3,            %[src],         %[stride_3]  \n\t"
+    "add.d      %[src],          %[src],         %[stride_4]  \n\t"
+    "vld        $vr4,            %[src],         0            \n\t"
+    "vldx       $vr5,            %[src],         %[stride]    \n\t"
+    "vldx       $vr6,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr7,            %[src],         %[stride_3]  \n\t"
+
+    "vst        $vr0,            %[dst],         0            \n\t"
+    "vstx       $vr1,            %[dst],         %[stride]    \n\t"
+    "vstx       $vr2,            %[dst],         %[stride_2]  \n\t"
+    "vstx       $vr3,            %[dst],         %[stride_3]  \n\t"
+    "add.d      %[dst],          %[dst],         %[stride_4]  \n\t"
+    "vst        $vr4,            %[dst],         0            \n\t"
+    "vstx       $vr5,            %[dst],         %[stride]    \n\t"
+    "vstx       $vr6,            %[dst],         %[stride_2]  \n\t"
+    "vstx       $vr7,            %[dst],         %[stride_3]  \n\t"
+    : [dst]"+&r"(dst),            [src]"+&r"(src),
+      [stride_2]"=&r"(stride_2),  [stride_3]"=&r"(stride_3),
+      [stride_4]"=&r"(stride_4)
+    : [stride]"r"(stride)
+    : "memory"
+    );
+}
+
+/* avg_pixels16_8_lsx    : dst = avg(src, dst)
+ * put_pixels16_l2_8_lsx: dst = avg(src, half) , half stride is 8.
+ * avg_pixels16_l2_8_lsx: dst = avg(avg(src, half), dst) , half stride is 8.*/
+static av_always_inline void
+avg_pixels16_8_lsx(uint8_t *dst, const uint8_t *src, ptrdiff_t stride)
+{
+    uint8_t *tmp = dst;
+    ptrdiff_t stride_2, stride_3, stride_4;
+    __asm__ volatile (
+    /* h0~h7 */
+    "slli.d     %[stride_2],     %[stride],      1            \n\t"
+    "add.d      %[stride_3],     %[stride_2],    %[stride]    \n\t"
+    "slli.d     %[stride_4],     %[stride_2],    1            \n\t"
+    "vld        $vr0,            %[src],         0            \n\t"
+    "vldx       $vr1,            %[src],         %[stride]    \n\t"
+    "vldx       $vr2,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr3,            %[src],         %[stride_3]  \n\t"
+    "add.d      %[src],          %[src],         %[stride_4]  \n\t"
+    "vld        $vr4,            %[src],         0            \n\t"
+    "vldx       $vr5,            %[src],         %[stride]    \n\t"
+    "vldx       $vr6,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr7,            %[src],         %[stride_3]  \n\t"
+    "add.d      %[src],          %[src],         %[stride_4]  \n\t"
+
+    "vld        $vr8,            %[tmp],         0            \n\t"
+    "vldx       $vr9,            %[tmp],         %[stride]    \n\t"
+    "vldx       $vr10,           %[tmp],         %[stride_2]  \n\t"
+    "vldx       $vr11,           %[tmp],         %[stride_3]  \n\t"
+    "add.d      %[tmp],          %[tmp],         %[stride_4]  \n\t"
+    "vld        $vr12,           %[tmp],         0            \n\t"
+    "vldx       $vr13,           %[tmp],         %[stride]    \n\t"
+    "vldx       $vr14,           %[tmp],         %[stride_2]  \n\t"
+    "vldx       $vr15,           %[tmp],         %[stride_3]  \n\t"
+    "add.d      %[tmp],          %[tmp],         %[stride_4]  \n\t"
+
+    "vavgr.bu   $vr0,            $vr8,           $vr0         \n\t"
+    "vavgr.bu   $vr1,            $vr9,           $vr1         \n\t"
+    "vavgr.bu   $vr2,            $vr10,          $vr2         \n\t"
+    "vavgr.bu   $vr3,            $vr11,          $vr3         \n\t"
+    "vavgr.bu   $vr4,            $vr12,          $vr4         \n\t"
+    "vavgr.bu   $vr5,            $vr13,          $vr5         \n\t"
+    "vavgr.bu   $vr6,            $vr14,          $vr6         \n\t"
+    "vavgr.bu   $vr7,            $vr15,          $vr7         \n\t"
+
+    "vst        $vr0,            %[dst],         0            \n\t"
+    "vstx       $vr1,            %[dst],         %[stride]    \n\t"
+    "vstx       $vr2,            %[dst],         %[stride_2]  \n\t"
+    "vstx       $vr3,            %[dst],         %[stride_3]  \n\t"
+    "add.d      %[dst],          %[dst],         %[stride_4]  \n\t"
+    "vst        $vr4,            %[dst],         0            \n\t"
+    "vstx       $vr5,            %[dst],         %[stride]    \n\t"
+    "vstx       $vr6,            %[dst],         %[stride_2]  \n\t"
+    "vstx       $vr7,            %[dst],         %[stride_3]  \n\t"
+    "add.d      %[dst],          %[dst],         %[stride_4]  \n\t"
+
+    /* h8~h15 */
+    "vld        $vr0,            %[src],         0            \n\t"
+    "vldx       $vr1,            %[src],         %[stride]    \n\t"
+    "vldx       $vr2,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr3,            %[src],         %[stride_3]  \n\t"
+    "add.d      %[src],          %[src],         %[stride_4]  \n\t"
+    "vld        $vr4,            %[src],         0            \n\t"
+    "vldx       $vr5,            %[src],         %[stride]    \n\t"
+    "vldx       $vr6,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr7,            %[src],         %[stride_3]  \n\t"
+
+    "vld        $vr8,            %[tmp],         0            \n\t"
+    "vldx       $vr9,            %[tmp],         %[stride]    \n\t"
+    "vldx       $vr10,           %[tmp],         %[stride_2]  \n\t"
+    "vldx       $vr11,           %[tmp],         %[stride_3]  \n\t"
+    "add.d      %[tmp],          %[tmp],         %[stride_4]  \n\t"
+    "vld        $vr12,           %[tmp],         0            \n\t"
+    "vldx       $vr13,           %[tmp],         %[stride]    \n\t"
+    "vldx       $vr14,           %[tmp],         %[stride_2]  \n\t"
+    "vldx       $vr15,           %[tmp],         %[stride_3]  \n\t"
+
+    "vavgr.bu    $vr0,           $vr8,           $vr0         \n\t"
+    "vavgr.bu    $vr1,           $vr9,           $vr1         \n\t"
+    "vavgr.bu    $vr2,           $vr10,          $vr2         \n\t"
+    "vavgr.bu    $vr3,           $vr11,          $vr3         \n\t"
+    "vavgr.bu    $vr4,           $vr12,          $vr4         \n\t"
+    "vavgr.bu    $vr5,           $vr13,          $vr5         \n\t"
+    "vavgr.bu    $vr6,           $vr14,          $vr6         \n\t"
+    "vavgr.bu    $vr7,           $vr15,          $vr7         \n\t"
+
+    "vst        $vr0,            %[dst],         0            \n\t"
+    "vstx       $vr1,            %[dst],         %[stride]    \n\t"
+    "vstx       $vr2,            %[dst],         %[stride_2]  \n\t"
+    "vstx       $vr3,            %[dst],         %[stride_3]  \n\t"
+    "add.d      %[dst],          %[dst],         %[stride_4]  \n\t"
+    "vst        $vr4,            %[dst],         0            \n\t"
+    "vstx       $vr5,            %[dst],         %[stride]    \n\t"
+    "vstx       $vr6,            %[dst],         %[stride_2]  \n\t"
+    "vstx       $vr7,            %[dst],         %[stride_3]  \n\t"
+    : [dst]"+&r"(dst), [tmp]"+&r"(tmp), [src]"+&r"(src),
+      [stride_2]"=&r"(stride_2),  [stride_3]"=&r"(stride_3),
+      [stride_4]"=&r"(stride_4)
+    : [stride]"r"(stride)
+    : "memory"
+    );
+}
+
+/* avg_pixels16_8_lsx   : dst = avg(src, dst)
+ * put_pixels16_l2_8_lsx: dst = avg(src, half) , half stride is 8.
+ * avg_pixels16_l2_8_lsx: dst = avg(avg(src, half), dst) , half stride is 8.*/
+static av_always_inline void
+put_pixels16_l2_8_lsx(uint8_t *dst, const uint8_t *src, uint8_t *half,
+                      ptrdiff_t dstStride, ptrdiff_t srcStride)
+{
+    ptrdiff_t stride_2, stride_3, stride_4;
+    ptrdiff_t dstride_2, dstride_3, dstride_4;
+    __asm__ volatile (
+    "slli.d     %[stride_2],     %[srcStride],   1            \n\t"
+    "add.d      %[stride_3],     %[stride_2],    %[srcStride] \n\t"
+    "slli.d     %[stride_4],     %[stride_2],    1            \n\t"
+    "slli.d     %[dstride_2],    %[dstStride],   1            \n\t"
+    "add.d      %[dstride_3],    %[dstride_2],   %[dstStride] \n\t"
+    "slli.d     %[dstride_4],    %[dstride_2],   1            \n\t"
+    /* h0~h7 */
+    "vld        $vr0,            %[src],         0            \n\t"
+    "vldx       $vr1,            %[src],         %[srcStride] \n\t"
+    "vldx       $vr2,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr3,            %[src],         %[stride_3]  \n\t"
+    "add.d      %[src],          %[src],         %[stride_4]  \n\t"
+    "vld        $vr4,            %[src],         0            \n\t"
+    "vldx       $vr5,            %[src],         %[srcStride] \n\t"
+    "vldx       $vr6,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr7,            %[src],         %[stride_3]  \n\t"
+    "add.d      %[src],          %[src],         %[stride_4]  \n\t"
+
+    "vld        $vr8,            %[half],        0x00         \n\t"
+    "vld        $vr9,            %[half],        0x10         \n\t"
+    "vld        $vr10,           %[half],        0x20         \n\t"
+    "vld        $vr11,           %[half],        0x30         \n\t"
+    "vld        $vr12,           %[half],        0x40         \n\t"
+    "vld        $vr13,           %[half],        0x50         \n\t"
+    "vld        $vr14,           %[half],        0x60         \n\t"
+    "vld        $vr15,           %[half],        0x70         \n\t"
+
+    "vavgr.bu   $vr0,            $vr8,           $vr0         \n\t"
+    "vavgr.bu   $vr1,            $vr9,           $vr1         \n\t"
+    "vavgr.bu   $vr2,            $vr10,          $vr2         \n\t"
+    "vavgr.bu   $vr3,            $vr11,          $vr3         \n\t"
+    "vavgr.bu   $vr4,            $vr12,          $vr4         \n\t"
+    "vavgr.bu   $vr5,            $vr13,          $vr5         \n\t"
+    "vavgr.bu   $vr6,            $vr14,          $vr6         \n\t"
+    "vavgr.bu   $vr7,            $vr15,          $vr7         \n\t"
+
+    "vst        $vr0,            %[dst],         0            \n\t"
+    "vstx       $vr1,            %[dst],         %[dstStride] \n\t"
+    "vstx       $vr2,            %[dst],         %[dstride_2] \n\t"
+    "vstx       $vr3,            %[dst],         %[dstride_3] \n\t"
+    "add.d      %[dst],          %[dst],         %[dstride_4] \n\t"
+    "vst        $vr4,            %[dst],         0            \n\t"
+    "vstx       $vr5,            %[dst],         %[dstStride] \n\t"
+    "vstx       $vr6,            %[dst],         %[dstride_2] \n\t"
+    "vstx       $vr7,            %[dst],         %[dstride_3] \n\t"
+    "add.d      %[dst],          %[dst],         %[dstride_4] \n\t"
+
+    /* h8~h15 */
+    "vld        $vr0,            %[src],         0            \n\t"
+    "vldx       $vr1,            %[src],         %[srcStride] \n\t"
+    "vldx       $vr2,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr3,            %[src],         %[stride_3]  \n\t"
+    "add.d      %[src],          %[src],         %[stride_4]  \n\t"
+    "vld        $vr4,            %[src],         0            \n\t"
+    "vldx       $vr5,            %[src],         %[srcStride] \n\t"
+    "vldx       $vr6,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr7,            %[src],         %[stride_3]  \n\t"
+
+    "vld        $vr8,            %[half],        0x80         \n\t"
+    "vld        $vr9,            %[half],        0x90         \n\t"
+    "vld        $vr10,           %[half],        0xa0         \n\t"
+    "vld        $vr11,           %[half],        0xb0         \n\t"
+    "vld        $vr12,           %[half],        0xc0         \n\t"
+    "vld        $vr13,           %[half],        0xd0         \n\t"
+    "vld        $vr14,           %[half],        0xe0         \n\t"
+    "vld        $vr15,           %[half],        0xf0         \n\t"
+
+    "vavgr.bu   $vr0,            $vr8,           $vr0         \n\t"
+    "vavgr.bu   $vr1,            $vr9,           $vr1         \n\t"
+    "vavgr.bu   $vr2,            $vr10,          $vr2         \n\t"
+    "vavgr.bu   $vr3,            $vr11,          $vr3         \n\t"
+    "vavgr.bu   $vr4,            $vr12,          $vr4         \n\t"
+    "vavgr.bu   $vr5,            $vr13,          $vr5         \n\t"
+    "vavgr.bu   $vr6,            $vr14,          $vr6         \n\t"
+    "vavgr.bu   $vr7,            $vr15,          $vr7         \n\t"
+
+    "vst        $vr0,            %[dst],         0            \n\t"
+    "vstx       $vr1,            %[dst],         %[dstStride] \n\t"
+    "vstx       $vr2,            %[dst],         %[dstride_2] \n\t"
+    "vstx       $vr3,            %[dst],         %[dstride_3] \n\t"
+    "add.d      %[dst],          %[dst],         %[dstride_4] \n\t"
+    "vst        $vr4,            %[dst],         0            \n\t"
+    "vstx       $vr5,            %[dst],         %[dstStride] \n\t"
+    "vstx       $vr6,            %[dst],         %[dstride_2] \n\t"
+    "vstx       $vr7,            %[dst],         %[dstride_3] \n\t"
+    : [dst]"+&r"(dst), [half]"+&r"(half), [src]"+&r"(src),
+      [stride_2]"=&r"(stride_2),  [stride_3]"=&r"(stride_3),
+      [stride_4]"=&r"(stride_4),  [dstride_2]"=&r"(dstride_2),
+      [dstride_3]"=&r"(dstride_3), [dstride_4]"=&r"(dstride_4)
+    : [dstStride]"r"(dstStride), [srcStride]"r"(srcStride)
+    : "memory"
+    );
+}
+
+/* avg_pixels16_8_lsx    : dst = avg(src, dst)
+ * put_pixels16_l2_8_lsx: dst = avg(src, half) , half stride is 8.
+ * avg_pixels16_l2_8_lsx: dst = avg(avg(src, half), dst) , half stride is 8.*/
+static av_always_inline void
+avg_pixels16_l2_8_lsx(uint8_t *dst, const uint8_t *src, uint8_t *half,
+                      ptrdiff_t dstStride, ptrdiff_t srcStride)
+{
+    uint8_t *tmp = dst;
+    ptrdiff_t stride_2, stride_3, stride_4;
+    ptrdiff_t dstride_2, dstride_3, dstride_4;
+    __asm__ volatile (
+    "slli.d     %[stride_2],     %[srcStride],   1            \n\t"
+    "add.d      %[stride_3],     %[stride_2],    %[srcStride] \n\t"
+    "slli.d     %[stride_4],     %[stride_2],    1            \n\t"
+    "slli.d     %[dstride_2],    %[dstStride],   1            \n\t"
+    "add.d      %[dstride_3],    %[dstride_2],   %[dstStride] \n\t"
+    "slli.d     %[dstride_4],    %[dstride_2],   1            \n\t"
+    /* h0~h7 */
+    "vld        $vr0,            %[src],         0            \n\t"
+    "vldx       $vr1,            %[src],         %[srcStride] \n\t"
+    "vldx       $vr2,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr3,            %[src],         %[stride_3]  \n\t"
+    "add.d      %[src],          %[src],         %[stride_4]  \n\t"
+    "vld        $vr4,            %[src],         0            \n\t"
+    "vldx       $vr5,            %[src],         %[srcStride] \n\t"
+    "vldx       $vr6,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr7,            %[src],         %[stride_3]  \n\t"
+    "add.d      %[src],          %[src],         %[stride_4]  \n\t"
+
+    "vld        $vr8,            %[half],        0x00         \n\t"
+    "vld        $vr9,            %[half],        0x10         \n\t"
+    "vld        $vr10,           %[half],        0x20         \n\t"
+    "vld        $vr11,           %[half],        0x30         \n\t"
+    "vld        $vr12,           %[half],        0x40         \n\t"
+    "vld        $vr13,           %[half],        0x50         \n\t"
+    "vld        $vr14,           %[half],        0x60         \n\t"
+    "vld        $vr15,           %[half],        0x70         \n\t"
+
+    "vavgr.bu   $vr0,            $vr8,           $vr0         \n\t"
+    "vavgr.bu   $vr1,            $vr9,           $vr1         \n\t"
+    "vavgr.bu   $vr2,            $vr10,          $vr2         \n\t"
+    "vavgr.bu   $vr3,            $vr11,          $vr3         \n\t"
+    "vavgr.bu   $vr4,            $vr12,          $vr4         \n\t"
+    "vavgr.bu   $vr5,            $vr13,          $vr5         \n\t"
+    "vavgr.bu   $vr6,            $vr14,          $vr6         \n\t"
+    "vavgr.bu   $vr7,            $vr15,          $vr7         \n\t"
+
+    "vld        $vr8,            %[tmp],         0            \n\t"
+    "vldx       $vr9,            %[tmp],         %[dstStride] \n\t"
+    "vldx       $vr10,           %[tmp],         %[dstride_2] \n\t"
+    "vldx       $vr11,           %[tmp],         %[dstride_3] \n\t"
+    "add.d      %[tmp],          %[tmp],         %[dstride_4] \n\t"
+    "vld        $vr12,           %[tmp],         0            \n\t"
+    "vldx       $vr13,           %[tmp],         %[dstStride] \n\t"
+    "vldx       $vr14,           %[tmp],         %[dstride_2] \n\t"
+    "vldx       $vr15,           %[tmp],         %[dstride_3] \n\t"
+    "add.d      %[tmp],          %[tmp],         %[dstride_4] \n\t"
+
+    "vavgr.bu    $vr0,           $vr8,           $vr0         \n\t"
+    "vavgr.bu    $vr1,           $vr9,           $vr1         \n\t"
+    "vavgr.bu    $vr2,           $vr10,          $vr2         \n\t"
+    "vavgr.bu    $vr3,           $vr11,          $vr3         \n\t"
+    "vavgr.bu    $vr4,           $vr12,          $vr4         \n\t"
+    "vavgr.bu    $vr5,           $vr13,          $vr5         \n\t"
+    "vavgr.bu    $vr6,           $vr14,          $vr6         \n\t"
+    "vavgr.bu    $vr7,           $vr15,          $vr7         \n\t"
+
+    "vst        $vr0,            %[dst],         0            \n\t"
+    "vstx       $vr1,            %[dst],         %[dstStride] \n\t"
+    "vstx       $vr2,            %[dst],         %[dstride_2] \n\t"
+    "vstx       $vr3,            %[dst],         %[dstride_3] \n\t"
+    "add.d      %[dst],          %[dst],         %[dstride_4] \n\t"
+    "vst        $vr4,            %[dst],         0            \n\t"
+    "vstx       $vr5,            %[dst],         %[dstStride] \n\t"
+    "vstx       $vr6,            %[dst],         %[dstride_2] \n\t"
+    "vstx       $vr7,            %[dst],         %[dstride_3] \n\t"
+    "add.d      %[dst],          %[dst],         %[dstride_4] \n\t"
+
+    /* h8~h15    */
+    "vld        $vr0,            %[src],         0            \n\t"
+    "vldx       $vr1,            %[src],         %[srcStride] \n\t"
+    "vldx       $vr2,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr3,            %[src],         %[stride_3]  \n\t"
+    "add.d      %[src],          %[src],         %[stride_4]  \n\t"
+    "vld        $vr4,            %[src],         0            \n\t"
+    "vldx       $vr5,            %[src],         %[srcStride] \n\t"
+    "vldx       $vr6,            %[src],         %[stride_2]  \n\t"
+    "vldx       $vr7,            %[src],         %[stride_3]  \n\t"
+
+    "vld        $vr8,            %[half],        0x80         \n\t"
+    "vld        $vr9,            %[half],        0x90         \n\t"
+    "vld        $vr10,           %[half],        0xa0         \n\t"
+    "vld        $vr11,           %[half],        0xb0         \n\t"
+    "vld        $vr12,           %[half],        0xc0         \n\t"
+    "vld        $vr13,           %[half],        0xd0         \n\t"
+    "vld        $vr14,           %[half],        0xe0         \n\t"
+    "vld        $vr15,           %[half],        0xf0         \n\t"
+
+    "vavgr.bu    $vr0,           $vr8,           $vr0         \n\t"
+    "vavgr.bu    $vr1,           $vr9,           $vr1         \n\t"
+    "vavgr.bu    $vr2,           $vr10,          $vr2         \n\t"
+    "vavgr.bu    $vr3,           $vr11,          $vr3         \n\t"
+    "vavgr.bu    $vr4,           $vr12,          $vr4         \n\t"
+    "vavgr.bu    $vr5,           $vr13,          $vr5         \n\t"
+    "vavgr.bu    $vr6,           $vr14,          $vr6         \n\t"
+    "vavgr.bu    $vr7,           $vr15,          $vr7         \n\t"
+
+    "vld        $vr8,            %[tmp],         0            \n\t"
+    "vldx       $vr9,            %[tmp],         %[dstStride] \n\t"
+    "vldx       $vr10,           %[tmp],         %[dstride_2] \n\t"
+    "vldx       $vr11,           %[tmp],         %[dstride_3] \n\t"
+    "add.d      %[tmp],          %[tmp],         %[dstride_4] \n\t"
+    "vld        $vr12,           %[tmp],         0            \n\t"
+    "vldx       $vr13,           %[tmp],         %[dstStride] \n\t"
+    "vldx       $vr14,           %[tmp],         %[dstride_2] \n\t"
+    "vldx       $vr15,           %[tmp],         %[dstride_3] \n\t"
+
+    "vavgr.bu    $vr0,           $vr8,           $vr0         \n\t"
+    "vavgr.bu    $vr1,           $vr9,           $vr1         \n\t"
+    "vavgr.bu    $vr2,           $vr10,          $vr2         \n\t"
+    "vavgr.bu    $vr3,           $vr11,          $vr3         \n\t"
+    "vavgr.bu    $vr4,           $vr12,          $vr4         \n\t"
+    "vavgr.bu    $vr5,           $vr13,          $vr5         \n\t"
+    "vavgr.bu    $vr6,           $vr14,          $vr6         \n\t"
+    "vavgr.bu    $vr7,           $vr15,          $vr7         \n\t"
+
+    "vst        $vr0,            %[dst],         0            \n\t"
+    "vstx       $vr1,            %[dst],         %[dstStride] \n\t"
+    "vstx       $vr2,            %[dst],         %[dstride_2] \n\t"
+    "vstx       $vr3,            %[dst],         %[dstride_3] \n\t"
+    "add.d      %[dst],          %[dst],         %[dstride_4] \n\t"
+    "vst        $vr4,            %[dst],         0            \n\t"
+    "vstx       $vr5,            %[dst],         %[dstStride] \n\t"
+    "vstx       $vr6,            %[dst],         %[dstride_2] \n\t"
+    "vstx       $vr7,            %[dst],         %[dstride_3] \n\t"
+    : [dst]"+&r"(dst), [tmp]"+&r"(tmp), [half]"+&r"(half), [src]"+&r"(src),
+      [stride_2]"=&r"(stride_2),  [stride_3]"=&r"(stride_3),
+      [stride_4]"=&r"(stride_4),  [dstride_2]"=&r"(dstride_2),
+      [dstride_3]"=&r"(dstride_3), [dstride_4]"=&r"(dstride_4)
+    : [dstStride]"r"(dstStride), [srcStride]"r"(srcStride)
+    : "memory"
+    );
+}
+
+#define QPEL8_H_LOWPASS(out_v)                                               \
+    src00 = __lasx_xvld(src, - 2);                                           \
+    src += srcStride;                                                        \
+    src10 = __lasx_xvld(src, - 2);                                           \
+    src += srcStride;                                                        \
+    src00 = __lasx_xvpermi_q(src00, src10, 0x02);                            \
+    src01 = __lasx_xvshuf_b(src00, src00, (__m256i)mask1);                   \
+    src02 = __lasx_xvshuf_b(src00, src00, (__m256i)mask2);                   \
+    src03 = __lasx_xvshuf_b(src00, src00, (__m256i)mask3);                   \
+    src04 = __lasx_xvshuf_b(src00, src00, (__m256i)mask4);                   \
+    src05 = __lasx_xvshuf_b(src00, src00, (__m256i)mask5);                   \
+    DUP2_ARG2(__lasx_xvaddwl_h_bu, src02, src03, src01, src04, src02, src01);\
+    src00 = __lasx_xvaddwl_h_bu(src00, src05);                               \
+    src02 = __lasx_xvmul_h(src02, h_20);                                     \
+    src01 = __lasx_xvmul_h(src01, h_5);                                      \
+    src02 = __lasx_xvssub_h(src02, src01);                                   \
+    src02 = __lasx_xvsadd_h(src02, src00);                                   \
+    src02 = __lasx_xvsadd_h(src02, h_16);                                    \
+    out_v = __lasx_xvssrani_bu_h(src02, src02, 5);                           \
+
+static av_always_inline void
+put_h264_qpel8_h_lowpass_lasx(uint8_t *dst, const uint8_t *src, int dstStride,
+                              int srcStride)
+{
+    int dstStride_2x = dstStride << 1;
+    __m256i src00, src01, src02, src03, src04, src05, src10;
+    __m256i out0, out1, out2, out3;
+    __m256i h_20 = __lasx_xvldi(0x414);
+    __m256i h_5  = __lasx_xvldi(0x405);
+    __m256i h_16 = __lasx_xvldi(0x410);
+    __m256i mask1 = {0x0807060504030201, 0x0, 0x0807060504030201, 0x0};
+    __m256i mask2 = {0x0908070605040302, 0x0, 0x0908070605040302, 0x0};
+    __m256i mask3 = {0x0a09080706050403, 0x0, 0x0a09080706050403, 0x0};
+    __m256i mask4 = {0x0b0a090807060504, 0x0, 0x0b0a090807060504, 0x0};
+    __m256i mask5 = {0x0c0b0a0908070605, 0x0, 0x0c0b0a0908070605, 0x0};
+
+    QPEL8_H_LOWPASS(out0)
+    QPEL8_H_LOWPASS(out1)
+    QPEL8_H_LOWPASS(out2)
+    QPEL8_H_LOWPASS(out3)
+    __lasx_xvstelm_d(out0, dst, 0, 0);
+    __lasx_xvstelm_d(out0, dst + dstStride, 0, 2);
+    dst += dstStride_2x;
+    __lasx_xvstelm_d(out1, dst, 0, 0);
+    __lasx_xvstelm_d(out1, dst + dstStride, 0, 2);
+    dst += dstStride_2x;
+    __lasx_xvstelm_d(out2, dst, 0, 0);
+    __lasx_xvstelm_d(out2, dst + dstStride, 0, 2);
+    dst += dstStride_2x;
+    __lasx_xvstelm_d(out3, dst, 0, 0);
+    __lasx_xvstelm_d(out3, dst + dstStride, 0, 2);
+}
+
+#define QPEL8_V_LOWPASS(src0, src1, src2, src3, src4, src5, src6,       \
+                        tmp0, tmp1, tmp2, tmp3, tmp4, tmp5)             \
+{                                                                       \
+    tmp0 = __lasx_xvpermi_q(src0, src1, 0x02);                          \
+    tmp1 = __lasx_xvpermi_q(src1, src2, 0x02);                          \
+    tmp2 = __lasx_xvpermi_q(src2, src3, 0x02);                          \
+    tmp3 = __lasx_xvpermi_q(src3, src4, 0x02);                          \
+    tmp4 = __lasx_xvpermi_q(src4, src5, 0x02);                          \
+    tmp5 = __lasx_xvpermi_q(src5, src6, 0x02);                          \
+    DUP2_ARG2(__lasx_xvaddwl_h_bu, tmp2, tmp3, tmp1, tmp4, tmp2, tmp1); \
+    tmp0 = __lasx_xvaddwl_h_bu(tmp0, tmp5);                             \
+    tmp2 = __lasx_xvmul_h(tmp2, h_20);                                  \
+    tmp1 = __lasx_xvmul_h(tmp1, h_5);                                   \
+    tmp2 = __lasx_xvssub_h(tmp2, tmp1);                                 \
+    tmp2 = __lasx_xvsadd_h(tmp2, tmp0);                                 \
+    tmp2 = __lasx_xvsadd_h(tmp2, h_16);                                 \
+    tmp2 = __lasx_xvssrani_bu_h(tmp2, tmp2, 5);                         \
+}
+
+static av_always_inline void
+put_h264_qpel8_v_lowpass_lasx(uint8_t *dst, uint8_t *src, int dstStride,
+                              int srcStride)
+{
+    int srcStride_2x = srcStride << 1;
+    int dstStride_2x = dstStride << 1;
+    int srcStride_4x = srcStride << 2;
+    int srcStride_3x = srcStride_2x + srcStride;
+    __m256i src00, src01, src02, src03, src04, src05, src06;
+    __m256i src07, src08, src09, src10, src11, src12;
+    __m256i tmp00, tmp01, tmp02, tmp03, tmp04, tmp05;
+    __m256i h_20 = __lasx_xvldi(0x414);
+    __m256i h_5  = __lasx_xvldi(0x405);
+    __m256i h_16 = __lasx_xvldi(0x410);
+
+    DUP2_ARG2(__lasx_xvld, src - srcStride_2x, 0, src - srcStride, 0,
+              src00, src01);
+    src02 = __lasx_xvld(src, 0);
+    DUP4_ARG2(__lasx_xvldx, src, srcStride, src, srcStride_2x, src,
+              srcStride_3x, src, srcStride_4x, src03, src04, src05, src06);
+    src += srcStride_4x;
+    DUP4_ARG2(__lasx_xvldx, src, srcStride, src, srcStride_2x, src,
+              srcStride_3x, src, srcStride_4x, src07, src08, src09, src10);
+    src += srcStride_4x;
+    DUP2_ARG2(__lasx_xvldx, src, srcStride, src, srcStride_2x, src11, src12);
+
+    QPEL8_V_LOWPASS(src00, src01, src02, src03, src04, src05, src06,
+                    tmp00, tmp01, tmp02, tmp03, tmp04, tmp05);
+    __lasx_xvstelm_d(tmp02, dst, 0, 0);
+    __lasx_xvstelm_d(tmp02, dst + dstStride, 0, 2);
+    dst += dstStride_2x;
+    QPEL8_V_LOWPASS(src02, src03, src04, src05, src06, src07, src08,
+                    tmp00, tmp01, tmp02, tmp03, tmp04, tmp05);
+    __lasx_xvstelm_d(tmp02, dst, 0, 0);
+    __lasx_xvstelm_d(tmp02, dst + dstStride, 0, 2);
+    dst += dstStride_2x;
+    QPEL8_V_LOWPASS(src04, src05, src06, src07, src08, src09, src10,
+                    tmp00, tmp01, tmp02, tmp03, tmp04, tmp05);
+    __lasx_xvstelm_d(tmp02, dst, 0, 0);
+    __lasx_xvstelm_d(tmp02, dst + dstStride, 0, 2);
+    dst += dstStride_2x;
+    QPEL8_V_LOWPASS(src06, src07, src08, src09, src10, src11, src12,
+                    tmp00, tmp01, tmp02, tmp03, tmp04, tmp05);
+    __lasx_xvstelm_d(tmp02, dst, 0, 0);
+    __lasx_xvstelm_d(tmp02, dst + dstStride, 0, 2);
+}
+
+static av_always_inline void
+avg_h264_qpel8_v_lowpass_lasx(uint8_t *dst, uint8_t *src, int dstStride,
+                              int srcStride)
+{
+    int srcStride_2x = srcStride << 1;
+    int srcStride_4x = srcStride << 2;
+    int dstStride_2x = dstStride << 1;
+    int dstStride_4x = dstStride << 2;
+    int srcStride_3x = srcStride_2x + srcStride;
+    int dstStride_3x = dstStride_2x + dstStride;
+    __m256i src00, src01, src02, src03, src04, src05, src06;
+    __m256i src07, src08, src09, src10, src11, src12, tmp00;
+    __m256i tmp01, tmp02, tmp03, tmp04, tmp05, tmp06, tmp07, tmp08, tmp09;
+    __m256i h_20 = __lasx_xvldi(0x414);
+    __m256i h_5  = __lasx_xvldi(0x405);
+    __m256i h_16 = __lasx_xvldi(0x410);
+
+
+    DUP2_ARG2(__lasx_xvld, src - srcStride_2x, 0, src - srcStride, 0,
+              src00, src01);
+    src02 = __lasx_xvld(src, 0);
+    DUP4_ARG2(__lasx_xvldx, src, srcStride, src, srcStride_2x, src,
+              srcStride_3x, src, srcStride_4x, src03, src04, src05, src06);
+    src += srcStride_4x;
+    DUP4_ARG2(__lasx_xvldx, src, srcStride, src, srcStride_2x, src,
+              srcStride_3x, src, srcStride_4x, src07, src08, src09, src10);
+    src += srcStride_4x;
+    DUP2_ARG2(__lasx_xvldx, src, srcStride, src, srcStride_2x, src11, src12);
+
+    tmp06 = __lasx_xvld(dst, 0);
+    DUP4_ARG2(__lasx_xvldx, dst, dstStride, dst, dstStride_2x,
+              dst, dstStride_3x, dst, dstStride_4x,
+              tmp07, tmp02, tmp03, tmp04);
+    dst += dstStride_4x;
+    DUP2_ARG2(__lasx_xvldx, dst, dstStride, dst, dstStride_2x,
+              tmp05, tmp00);
+    tmp01 = __lasx_xvldx(dst, dstStride_3x);
+    dst -= dstStride_4x;
+
+    tmp06 = __lasx_xvpermi_q(tmp06, tmp07, 0x02);
+    tmp07 = __lasx_xvpermi_q(tmp02, tmp03, 0x02);
+    tmp08 = __lasx_xvpermi_q(tmp04, tmp05, 0x02);
+    tmp09 = __lasx_xvpermi_q(tmp00, tmp01, 0x02);
+
+    QPEL8_V_LOWPASS(src00, src01, src02, src03, src04, src05, src06,
+                    tmp00, tmp01, tmp02, tmp03, tmp04, tmp05);
+    tmp06 = __lasx_xvavgr_bu(tmp06, tmp02);
+    __lasx_xvstelm_d(tmp06, dst, 0, 0);
+    __lasx_xvstelm_d(tmp06, dst + dstStride, 0, 2);
+    dst += dstStride_2x;
+    QPEL8_V_LOWPASS(src02, src03, src04, src05, src06, src07, src08,
+                    tmp00, tmp01, tmp02, tmp03, tmp04, tmp05);
+    tmp07 = __lasx_xvavgr_bu(tmp07, tmp02);
+    __lasx_xvstelm_d(tmp07, dst, 0, 0);
+    __lasx_xvstelm_d(tmp07, dst + dstStride, 0, 2);
+    dst += dstStride_2x;
+    QPEL8_V_LOWPASS(src04, src05, src06, src07, src08, src09, src10,
+                    tmp00, tmp01, tmp02, tmp03, tmp04, tmp05);
+    tmp08 = __lasx_xvavgr_bu(tmp08, tmp02);
+    __lasx_xvstelm_d(tmp08, dst, 0, 0);
+    __lasx_xvstelm_d(tmp08, dst + dstStride, 0, 2);
+    dst += dstStride_2x;
+    QPEL8_V_LOWPASS(src06, src07, src08, src09, src10, src11, src12,
+                    tmp00, tmp01, tmp02, tmp03, tmp04, tmp05);
+    tmp09 = __lasx_xvavgr_bu(tmp09, tmp02);
+    __lasx_xvstelm_d(tmp09, dst, 0, 0);
+    __lasx_xvstelm_d(tmp09, dst + dstStride, 0, 2);
+}
+
+#define QPEL8_HV_LOWPASS_H(tmp)                                              \
+{                                                                            \
+    src00 = __lasx_xvld(src, -2);                                            \
+    src += srcStride;                                                        \
+    src10 = __lasx_xvld(src, -2);                                            \
+    src += srcStride;                                                        \
+    src00 = __lasx_xvpermi_q(src00, src10, 0x02);                            \
+    src01 = __lasx_xvshuf_b(src00, src00, (__m256i)mask1);                   \
+    src02 = __lasx_xvshuf_b(src00, src00, (__m256i)mask2);                   \
+    src03 = __lasx_xvshuf_b(src00, src00, (__m256i)mask3);                   \
+    src04 = __lasx_xvshuf_b(src00, src00, (__m256i)mask4);                   \
+    src05 = __lasx_xvshuf_b(src00, src00, (__m256i)mask5);                   \
+    DUP2_ARG2(__lasx_xvaddwl_h_bu, src02, src03, src01, src04, src02, src01);\
+    src00 = __lasx_xvaddwl_h_bu(src00, src05);                               \
+    src02 = __lasx_xvmul_h(src02, h_20);                                     \
+    src01 = __lasx_xvmul_h(src01, h_5);                                      \
+    src02 = __lasx_xvssub_h(src02, src01);                                   \
+    tmp  = __lasx_xvsadd_h(src02, src00);                                    \
+}
+
+#define QPEL8_HV_LOWPASS_V(src0, src1, src2, src3,                       \
+                           src4, src5, temp0, temp1,                     \
+                           temp2, temp3, temp4, temp5,                   \
+                           out)                                          \
+{                                                                        \
+    DUP2_ARG2(__lasx_xvaddwl_w_h, src2, src3, src1, src4, temp0, temp2); \
+    DUP2_ARG2(__lasx_xvaddwh_w_h, src2, src3, src1, src4, temp1, temp3); \
+    temp4 = __lasx_xvaddwl_w_h(src0, src5);                              \
+    temp5 = __lasx_xvaddwh_w_h(src0, src5);                              \
+    temp0 = __lasx_xvmul_w(temp0, w_20);                                 \
+    temp1 = __lasx_xvmul_w(temp1, w_20);                                 \
+    temp2 = __lasx_xvmul_w(temp2, w_5);                                  \
+    temp3 = __lasx_xvmul_w(temp3, w_5);                                  \
+    temp0 = __lasx_xvssub_w(temp0, temp2);                               \
+    temp1 = __lasx_xvssub_w(temp1, temp3);                               \
+    temp0 = __lasx_xvsadd_w(temp0, temp4);                               \
+    temp1 = __lasx_xvsadd_w(temp1, temp5);                               \
+    temp0 = __lasx_xvsadd_w(temp0, w_512);                               \
+    temp1 = __lasx_xvsadd_w(temp1, w_512);                               \
+    temp0 = __lasx_xvssrani_hu_w(temp0, temp0, 10);                      \
+    temp1 = __lasx_xvssrani_hu_w(temp1, temp1, 10);                      \
+    temp0 = __lasx_xvpackev_d(temp1, temp0);                             \
+    out   = __lasx_xvssrani_bu_h(temp0, temp0, 0);                       \
+}
+
+static av_always_inline void
+put_h264_qpel8_hv_lowpass_lasx(uint8_t *dst, const uint8_t *src,
+                               ptrdiff_t dstStride, ptrdiff_t srcStride)
+{
+    __m256i src00, src01, src02, src03, src04, src05, src10;
+    __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6;
+    __m256i tmp7, tmp8, tmp9, tmp10, tmp11, tmp12;
+    __m256i h_20 = __lasx_xvldi(0x414);
+    __m256i h_5  = __lasx_xvldi(0x405);
+    __m256i w_20 = __lasx_xvldi(0x814);
+    __m256i w_5  = __lasx_xvldi(0x805);
+    __m256i w_512 = {512};
+    __m256i mask1 = {0x0807060504030201, 0x0, 0x0807060504030201, 0x0};
+    __m256i mask2 = {0x0908070605040302, 0x0, 0x0908070605040302, 0x0};
+    __m256i mask3 = {0x0a09080706050403, 0x0, 0x0a09080706050403, 0x0};
+    __m256i mask4 = {0x0b0a090807060504, 0x0, 0x0b0a090807060504, 0x0};
+    __m256i mask5 = {0x0c0b0a0908070605, 0x0, 0x0c0b0a0908070605, 0x0};
+
+    w_512 = __lasx_xvreplve0_w(w_512);
+
+    src -= srcStride << 1;
+    QPEL8_HV_LOWPASS_H(tmp0)
+    QPEL8_HV_LOWPASS_H(tmp2)
+    QPEL8_HV_LOWPASS_H(tmp4)
+    QPEL8_HV_LOWPASS_H(tmp6)
+    QPEL8_HV_LOWPASS_H(tmp8)
+    QPEL8_HV_LOWPASS_H(tmp10)
+    QPEL8_HV_LOWPASS_H(tmp12)
+    tmp11 = __lasx_xvpermi_q(tmp12, tmp10, 0x21);
+    tmp9  = __lasx_xvpermi_q(tmp10, tmp8,  0x21);
+    tmp7  = __lasx_xvpermi_q(tmp8,  tmp6,  0x21);
+    tmp5  = __lasx_xvpermi_q(tmp6,  tmp4,  0x21);
+    tmp3  = __lasx_xvpermi_q(tmp4,  tmp2,  0x21);
+    tmp1  = __lasx_xvpermi_q(tmp2,  tmp0,  0x21);
+
+    QPEL8_HV_LOWPASS_V(tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, src00, src01,
+                       src02, src03, src04, src05, tmp0)
+    QPEL8_HV_LOWPASS_V(tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, src00, src01,
+                       src02, src03, src04, src05, tmp2)
+    QPEL8_HV_LOWPASS_V(tmp4, tmp5, tmp6, tmp7, tmp8, tmp9, src00, src01,
+                       src02, src03, src04, src05, tmp4)
+    QPEL8_HV_LOWPASS_V(tmp6, tmp7, tmp8, tmp9, tmp10, tmp11, src00, src01,
+                       src02, src03, src04, src05, tmp6)
+    __lasx_xvstelm_d(tmp0, dst, 0, 0);
+    dst += dstStride;
+    __lasx_xvstelm_d(tmp0, dst, 0, 2);
+    dst += dstStride;
+    __lasx_xvstelm_d(tmp2, dst, 0, 0);
+    dst += dstStride;
+    __lasx_xvstelm_d(tmp2, dst, 0, 2);
+    dst += dstStride;
+    __lasx_xvstelm_d(tmp4, dst, 0, 0);
+    dst += dstStride;
+    __lasx_xvstelm_d(tmp4, dst, 0, 2);
+    dst += dstStride;
+    __lasx_xvstelm_d(tmp6, dst, 0, 0);
+    dst += dstStride;
+    __lasx_xvstelm_d(tmp6, dst, 0, 2);
+}
+
+static av_always_inline void
+avg_h264_qpel8_h_lowpass_lasx(uint8_t *dst, const uint8_t *src, int dstStride,
+                              int srcStride)
+{
+    int dstStride_2x = dstStride << 1;
+    int dstStride_4x = dstStride << 2;
+    int dstStride_3x = dstStride_2x + dstStride;
+    __m256i src00, src01, src02, src03, src04, src05, src10;
+    __m256i dst00, dst01, dst0, dst1, dst2, dst3;
+    __m256i out0, out1, out2, out3;
+    __m256i h_20 = __lasx_xvldi(0x414);
+    __m256i h_5  = __lasx_xvldi(0x405);
+    __m256i h_16 = __lasx_xvldi(0x410);
+    __m256i mask1 = {0x0807060504030201, 0x0, 0x0807060504030201, 0x0};
+    __m256i mask2 = {0x0908070605040302, 0x0, 0x0908070605040302, 0x0};
+    __m256i mask3 = {0x0a09080706050403, 0x0, 0x0a09080706050403, 0x0};
+    __m256i mask4 = {0x0b0a090807060504, 0x0, 0x0b0a090807060504, 0x0};
+    __m256i mask5 = {0x0c0b0a0908070605, 0x0, 0x0c0b0a0908070605, 0x0};
+
+    QPEL8_H_LOWPASS(out0)
+    QPEL8_H_LOWPASS(out1)
+    QPEL8_H_LOWPASS(out2)
+    QPEL8_H_LOWPASS(out3)
+    src00 = __lasx_xvld(dst, 0);
+    DUP4_ARG2(__lasx_xvldx, dst, dstStride, dst, dstStride_2x, dst,
+              dstStride_3x, dst, dstStride_4x, src01, src02, src03, src04);
+    dst += dstStride_4x;
+    DUP2_ARG2(__lasx_xvldx, dst, dstStride, dst, dstStride_2x, src05, dst00);
+    dst01 = __lasx_xvldx(dst, dstStride_3x);
+    dst -= dstStride_4x;
+    dst0 = __lasx_xvpermi_q(src00, src01, 0x02);
+    dst1 = __lasx_xvpermi_q(src02, src03, 0x02);
+    dst2 = __lasx_xvpermi_q(src04, src05, 0x02);
+    dst3 = __lasx_xvpermi_q(dst00, dst01, 0x02);
+    dst0 = __lasx_xvavgr_bu(dst0, out0);
+    dst1 = __lasx_xvavgr_bu(dst1, out1);
+    dst2 = __lasx_xvavgr_bu(dst2, out2);
+    dst3 = __lasx_xvavgr_bu(dst3, out3);
+    __lasx_xvstelm_d(dst0, dst, 0, 0);
+    __lasx_xvstelm_d(dst0, dst + dstStride, 0, 2);
+    __lasx_xvstelm_d(dst1, dst + dstStride_2x, 0, 0);
+    __lasx_xvstelm_d(dst1, dst + dstStride_3x, 0, 2);
+    dst += dstStride_4x;
+    __lasx_xvstelm_d(dst2, dst, 0, 0);
+    __lasx_xvstelm_d(dst2, dst + dstStride, 0, 2);
+    __lasx_xvstelm_d(dst3, dst + dstStride_2x, 0, 0);
+    __lasx_xvstelm_d(dst3, dst + dstStride_3x, 0, 2);
+}
+
+static av_always_inline void
+avg_h264_qpel8_hv_lowpass_lasx(uint8_t *dst, const uint8_t *src,
+                               ptrdiff_t dstStride, ptrdiff_t srcStride)
+{
+    __m256i src00, src01, src02, src03, src04, src05, src10;
+    __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6;
+    __m256i tmp7, tmp8, tmp9, tmp10, tmp11, tmp12;
+    __m256i h_20 = __lasx_xvldi(0x414);
+    __m256i h_5  = __lasx_xvldi(0x405);
+    __m256i w_20 = __lasx_xvldi(0x814);
+    __m256i w_5  = __lasx_xvldi(0x805);
+    __m256i w_512 = {512};
+    __m256i mask1 = {0x0807060504030201, 0x0, 0x0807060504030201, 0x0};
+    __m256i mask2 = {0x0908070605040302, 0x0, 0x0908070605040302, 0x0};
+    __m256i mask3 = {0x0a09080706050403, 0x0, 0x0a09080706050403, 0x0};
+    __m256i mask4 = {0x0b0a090807060504, 0x0, 0x0b0a090807060504, 0x0};
+    __m256i mask5 = {0x0c0b0a0908070605, 0x0, 0x0c0b0a0908070605, 0x0};
+    ptrdiff_t dstStride_2x = dstStride << 1;
+    ptrdiff_t dstStride_4x = dstStride << 2;
+    ptrdiff_t dstStride_3x = dstStride_2x + dstStride;
+
+    w_512 = __lasx_xvreplve0_w(w_512);
+
+    src -= srcStride << 1;
+    QPEL8_HV_LOWPASS_H(tmp0)
+    QPEL8_HV_LOWPASS_H(tmp2)
+    QPEL8_HV_LOWPASS_H(tmp4)
+    QPEL8_HV_LOWPASS_H(tmp6)
+    QPEL8_HV_LOWPASS_H(tmp8)
+    QPEL8_HV_LOWPASS_H(tmp10)
+    QPEL8_HV_LOWPASS_H(tmp12)
+    tmp11 = __lasx_xvpermi_q(tmp12, tmp10, 0x21);
+    tmp9  = __lasx_xvpermi_q(tmp10, tmp8,  0x21);
+    tmp7  = __lasx_xvpermi_q(tmp8,  tmp6,  0x21);
+    tmp5  = __lasx_xvpermi_q(tmp6,  tmp4,  0x21);
+    tmp3  = __lasx_xvpermi_q(tmp4,  tmp2,  0x21);
+    tmp1  = __lasx_xvpermi_q(tmp2,  tmp0,  0x21);
+
+    QPEL8_HV_LOWPASS_V(tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, src00, src01,
+                       src02, src03, src04, src05, tmp0)
+    QPEL8_HV_LOWPASS_V(tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, src00, src01,
+                       src02, src03, src04, src05, tmp2)
+    QPEL8_HV_LOWPASS_V(tmp4, tmp5, tmp6, tmp7, tmp8, tmp9, src00, src01,
+                       src02, src03, src04, src05, tmp4)
+    QPEL8_HV_LOWPASS_V(tmp6, tmp7, tmp8, tmp9, tmp10, tmp11, src00, src01,
+                       src02, src03, src04, src05, tmp6)
+
+    src00 = __lasx_xvld(dst, 0);
+    DUP4_ARG2(__lasx_xvldx, dst, dstStride, dst, dstStride_2x, dst,
+              dstStride_3x, dst, dstStride_4x, src01, src02, src03, src04);
+    dst += dstStride_4x;
+    DUP2_ARG2(__lasx_xvldx, dst, dstStride, dst, dstStride_2x, src05, tmp8);
+    tmp9 = __lasx_xvldx(dst, dstStride_3x);
+    dst -= dstStride_4x;
+    tmp1 = __lasx_xvpermi_q(src00, src01, 0x02);
+    tmp3 = __lasx_xvpermi_q(src02, src03, 0x02);
+    tmp5 = __lasx_xvpermi_q(src04, src05, 0x02);
+    tmp7 = __lasx_xvpermi_q(tmp8,  tmp9,  0x02);
+    tmp0 = __lasx_xvavgr_bu(tmp0, tmp1);
+    tmp2 = __lasx_xvavgr_bu(tmp2, tmp3);
+    tmp4 = __lasx_xvavgr_bu(tmp4, tmp5);
+    tmp6 = __lasx_xvavgr_bu(tmp6, tmp7);
+    __lasx_xvstelm_d(tmp0, dst, 0, 0);
+    dst += dstStride;
+    __lasx_xvstelm_d(tmp0, dst, 0, 2);
+    dst += dstStride;
+    __lasx_xvstelm_d(tmp2, dst, 0, 0);
+    dst += dstStride;
+    __lasx_xvstelm_d(tmp2, dst, 0, 2);
+    dst += dstStride;
+    __lasx_xvstelm_d(tmp4, dst, 0, 0);
+    dst += dstStride;
+    __lasx_xvstelm_d(tmp4, dst, 0, 2);
+    dst += dstStride;
+    __lasx_xvstelm_d(tmp6, dst, 0, 0);
+    dst += dstStride;
+    __lasx_xvstelm_d(tmp6, dst, 0, 2);
+}
+
+static av_always_inline void
+put_h264_qpel16_h_lowpass_lasx(uint8_t *dst, const uint8_t *src,
+                               int dstStride, int srcStride)
+{
+    put_h264_qpel8_h_lowpass_lasx(dst, src, dstStride, srcStride);
+    put_h264_qpel8_h_lowpass_lasx(dst+8, src+8, dstStride, srcStride);
+    src += srcStride << 3;
+    dst += dstStride << 3;
+    put_h264_qpel8_h_lowpass_lasx(dst, src, dstStride, srcStride);
+    put_h264_qpel8_h_lowpass_lasx(dst+8, src+8, dstStride, srcStride);
+}
+
+static av_always_inline void
+avg_h264_qpel16_h_lowpass_lasx(uint8_t *dst, const uint8_t *src,
+                               int dstStride, int srcStride)
+{
+    avg_h264_qpel8_h_lowpass_lasx(dst, src, dstStride, srcStride);
+    avg_h264_qpel8_h_lowpass_lasx(dst+8, src+8, dstStride, srcStride);
+    src += srcStride << 3;
+    dst += dstStride << 3;
+    avg_h264_qpel8_h_lowpass_lasx(dst, src, dstStride, srcStride);
+    avg_h264_qpel8_h_lowpass_lasx(dst+8, src+8, dstStride, srcStride);
+}
+
+static void put_h264_qpel16_v_lowpass_lasx(uint8_t *dst, const uint8_t *src,
+                                           int dstStride, int srcStride)
+{
+    put_h264_qpel8_v_lowpass_lasx(dst, (uint8_t*)src, dstStride, srcStride);
+    put_h264_qpel8_v_lowpass_lasx(dst+8, (uint8_t*)src+8, dstStride, srcStride);
+    src += 8*srcStride;
+    dst += 8*dstStride;
+    put_h264_qpel8_v_lowpass_lasx(dst, (uint8_t*)src, dstStride, srcStride);
+    put_h264_qpel8_v_lowpass_lasx(dst+8, (uint8_t*)src+8, dstStride, srcStride);
+}
+
+static void avg_h264_qpel16_v_lowpass_lasx(uint8_t *dst, const uint8_t *src,
+                                           int dstStride, int srcStride)
+{
+    avg_h264_qpel8_v_lowpass_lasx(dst, (uint8_t*)src, dstStride, srcStride);
+    avg_h264_qpel8_v_lowpass_lasx(dst+8, (uint8_t*)src+8, dstStride, srcStride);
+    src += 8*srcStride;
+    dst += 8*dstStride;
+    avg_h264_qpel8_v_lowpass_lasx(dst, (uint8_t*)src, dstStride, srcStride);
+    avg_h264_qpel8_v_lowpass_lasx(dst+8, (uint8_t*)src+8, dstStride, srcStride);
+}
+
+static void put_h264_qpel16_hv_lowpass_lasx(uint8_t *dst, const uint8_t *src,
+                                     ptrdiff_t dstStride, ptrdiff_t srcStride)
+{
+    put_h264_qpel8_hv_lowpass_lasx(dst, src, dstStride, srcStride);
+    put_h264_qpel8_hv_lowpass_lasx(dst + 8, src + 8, dstStride, srcStride);
+    src += srcStride << 3;
+    dst += dstStride << 3;
+    put_h264_qpel8_hv_lowpass_lasx(dst, src, dstStride, srcStride);
+    put_h264_qpel8_hv_lowpass_lasx(dst + 8, src + 8, dstStride, srcStride);
+}
+
+static void avg_h264_qpel16_hv_lowpass_lasx(uint8_t *dst, const uint8_t *src,
+                                     ptrdiff_t dstStride, ptrdiff_t srcStride)
+{
+    avg_h264_qpel8_hv_lowpass_lasx(dst, src, dstStride, srcStride);
+    avg_h264_qpel8_hv_lowpass_lasx(dst + 8, src + 8, dstStride, srcStride);
+    src += srcStride << 3;
+    dst += dstStride << 3;
+    avg_h264_qpel8_hv_lowpass_lasx(dst, src, dstStride, srcStride);
+    avg_h264_qpel8_hv_lowpass_lasx(dst + 8, src + 8, dstStride, srcStride);
+}
+
+void ff_put_h264_qpel8_mc00_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    /* In mmi optimization, it used function ff_put_pixels8_8_mmi
+     * which implemented in hpeldsp_mmi.c */
+    put_pixels8_8_inline_asm(dst, src, stride);
+}
+
+void ff_put_h264_qpel8_mc10_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t half[64];
+
+    put_h264_qpel8_h_lowpass_lasx(half, src, 8, stride);
+    /* in qpel8, the stride of half and height of block is 8 */
+    put_pixels8_l2_8_lsx(dst, src, half, stride, stride);
+}
+
+void ff_put_h264_qpel8_mc20_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    put_h264_qpel8_h_lowpass_lasx(dst, src, stride, stride);
+}
+
+void ff_put_h264_qpel8_mc30_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t half[64];
+
+    put_h264_qpel8_h_lowpass_lasx(half, src, 8, stride);
+    put_pixels8_l2_8_lsx(dst, src+1, half, stride, stride);
+}
+
+void ff_put_h264_qpel8_mc01_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t half[64];
+
+    put_h264_qpel8_v_lowpass_lasx(half, (uint8_t*)src, 8, stride);
+    put_pixels8_l2_8_lsx(dst, src, half, stride, stride);
+}
+
+void ff_put_h264_qpel8_mc11_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t halfH[64];
+    uint8_t halfV[64];
+
+    put_h264_qpel8_h_lowpass_lasx(halfH, src, 8, stride);
+    put_h264_qpel8_v_lowpass_lasx(halfV, (uint8_t*)src, 8, stride);
+    put_pixels8_l2_8_lsx(dst, halfH, halfV, stride, 8);
+}
+
+void ff_put_h264_qpel8_mc21_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t temp[128];
+    uint8_t *const halfH  = temp;
+    uint8_t *const halfHV = temp + 64;
+
+    put_h264_qpel8_h_lowpass_lasx(halfH, src, 8, stride);
+    put_h264_qpel8_hv_lowpass_lasx(halfHV, src, 8, stride);
+    put_pixels8_l2_8_lsx(dst, halfH, halfHV, stride, 8);
+}
+
+void ff_put_h264_qpel8_mc31_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t halfH[64];
+    uint8_t halfV[64];
+
+    put_h264_qpel8_h_lowpass_lasx(halfH, src, 8, stride);
+    put_h264_qpel8_v_lowpass_lasx(halfV, (uint8_t*)src + 1, 8, stride);
+    put_pixels8_l2_8_lsx(dst, halfH, halfV, stride, 8);
+}
+
+void ff_put_h264_qpel8_mc02_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    put_h264_qpel8_v_lowpass_lasx(dst, (uint8_t*)src, stride, stride);
+}
+
+void ff_put_h264_qpel8_mc12_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t temp[128];
+    uint8_t *const halfHV = temp;
+    uint8_t *const halfH  = temp + 64;
+
+    put_h264_qpel8_hv_lowpass_lasx(halfHV, src, 8, stride);
+    put_h264_qpel8_v_lowpass_lasx(halfH, (uint8_t*)src, 8, stride);
+    put_pixels8_l2_8_lsx(dst, halfH, halfHV, stride, 8);
+}
+
+void ff_put_h264_qpel8_mc22_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    put_h264_qpel8_hv_lowpass_lasx(dst, src, stride, stride);
+}
+
+void ff_put_h264_qpel8_mc32_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t temp[128];
+    uint8_t *const halfHV = temp;
+    uint8_t *const halfH  = temp + 64;
+
+    put_h264_qpel8_hv_lowpass_lasx(halfHV, src, 8, stride);
+    put_h264_qpel8_v_lowpass_lasx(halfH, (uint8_t*)src + 1, 8, stride);
+    put_pixels8_l2_8_lsx(dst, halfH, halfHV, stride, 8);
+}
+
+void ff_put_h264_qpel8_mc03_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t half[64];
+
+    put_h264_qpel8_v_lowpass_lasx(half, (uint8_t*)src, 8, stride);
+    put_pixels8_l2_8_lsx(dst, src + stride, half, stride, stride);
+}
+
+void ff_put_h264_qpel8_mc13_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t halfH[64];
+    uint8_t halfV[64];
+
+    put_h264_qpel8_h_lowpass_lasx(halfH, src + stride, 8, stride);
+    put_h264_qpel8_v_lowpass_lasx(halfV, (uint8_t*)src, 8, stride);
+    put_pixels8_l2_8_lsx(dst, halfH, halfV, stride, 8);
+}
+
+void ff_put_h264_qpel8_mc23_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t temp[128];
+    uint8_t *const halfH  = temp;
+    uint8_t *const halfHV = temp + 64;
+
+    put_h264_qpel8_h_lowpass_lasx(halfH, src + stride, 8, stride);
+    put_h264_qpel8_hv_lowpass_lasx(halfHV, src, 8, stride);
+    put_pixels8_l2_8_lsx(dst, halfH, halfHV, stride, 8);
+}
+
+void ff_put_h264_qpel8_mc33_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t halfH[64];
+    uint8_t halfV[64];
+
+    put_h264_qpel8_h_lowpass_lasx(halfH, src + stride, 8, stride);
+    put_h264_qpel8_v_lowpass_lasx(halfV, (uint8_t*)src + 1, 8, stride);
+    put_pixels8_l2_8_lsx(dst, halfH, halfV, stride, 8);
+}
+
+void ff_avg_h264_qpel8_mc00_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    /* In mmi optimization, it used function ff_avg_pixels8_8_mmi
+     * which implemented in hpeldsp_mmi.c */
+    avg_pixels8_8_lsx(dst, src, stride);
+}
+
+void ff_avg_h264_qpel8_mc10_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t half[64];
+
+    put_h264_qpel8_h_lowpass_lasx(half, src, 8, stride);
+    avg_pixels8_l2_8_lsx(dst, src, half, stride, stride);
+}
+
+void ff_avg_h264_qpel8_mc20_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    avg_h264_qpel8_h_lowpass_lasx(dst, src, stride, stride);
+}
+
+void ff_avg_h264_qpel8_mc30_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t half[64];
+
+    put_h264_qpel8_h_lowpass_lasx(half, src, 8, stride);
+    avg_pixels8_l2_8_lsx(dst, src+1, half, stride, stride);
+}
+
+void ff_avg_h264_qpel8_mc11_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t halfH[64];
+    uint8_t halfV[64];
+
+    put_h264_qpel8_h_lowpass_lasx(halfH, src, 8, stride);
+    put_h264_qpel8_v_lowpass_lasx(halfV, (uint8_t*)src, 8, stride);
+    avg_pixels8_l2_8_lsx(dst, halfH, halfV, stride, 8);
+}
+
+void ff_avg_h264_qpel8_mc21_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t temp[128];
+    uint8_t *const halfH  = temp;
+    uint8_t *const halfHV = temp + 64;
+
+    put_h264_qpel8_h_lowpass_lasx(halfH, src, 8, stride);
+    put_h264_qpel8_hv_lowpass_lasx(halfHV, src, 8, stride);
+    avg_pixels8_l2_8_lsx(dst, halfH, halfHV, stride, 8);
+}
+
+void ff_avg_h264_qpel8_mc31_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t halfH[64];
+    uint8_t halfV[64];
+
+    put_h264_qpel8_h_lowpass_lasx(halfH, src, 8, stride);
+    put_h264_qpel8_v_lowpass_lasx(halfV, (uint8_t*)src + 1, 8, stride);
+    avg_pixels8_l2_8_lsx(dst, halfH, halfV, stride, 8);
+}
+
+void ff_avg_h264_qpel8_mc02_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    avg_h264_qpel8_v_lowpass_lasx(dst, (uint8_t*)src, stride, stride);
+}
+
+void ff_avg_h264_qpel8_mc12_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t temp[128];
+    uint8_t *const halfHV = temp;
+    uint8_t *const halfH  = temp + 64;
+
+    put_h264_qpel8_hv_lowpass_lasx(halfHV, src, 8, stride);
+    put_h264_qpel8_v_lowpass_lasx(halfH, (uint8_t*)src, 8, stride);
+    avg_pixels8_l2_8_lsx(dst, halfH, halfHV, stride, 8);
+}
+
+void ff_avg_h264_qpel8_mc22_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    avg_h264_qpel8_hv_lowpass_lasx(dst, src, stride, stride);
+}
+
+void ff_avg_h264_qpel8_mc32_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t temp[128];
+    uint8_t *const halfHV = temp;
+    uint8_t *const halfH  = temp + 64;
+
+    put_h264_qpel8_hv_lowpass_lasx(halfHV, src, 8, stride);
+    put_h264_qpel8_v_lowpass_lasx(halfH, (uint8_t*)src + 1, 8, stride);
+    avg_pixels8_l2_8_lsx(dst, halfH, halfHV, stride, 8);
+}
+
+void ff_avg_h264_qpel8_mc13_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t halfH[64];
+    uint8_t halfV[64];
+
+    put_h264_qpel8_h_lowpass_lasx(halfH, src + stride, 8, stride);
+    put_h264_qpel8_v_lowpass_lasx(halfV, (uint8_t*)src, 8, stride);
+    avg_pixels8_l2_8_lsx(dst, halfH, halfV, stride, 8);
+}
+
+void ff_avg_h264_qpel8_mc23_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t temp[128];
+    uint8_t *const halfH  = temp;
+    uint8_t *const halfHV = temp + 64;
+
+    put_h264_qpel8_h_lowpass_lasx(halfH, src + stride, 8, stride);
+    put_h264_qpel8_hv_lowpass_lasx(halfHV, src, 8, stride);
+    avg_pixels8_l2_8_lsx(dst, halfH, halfHV, stride, 8);
+}
+
+void ff_avg_h264_qpel8_mc33_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride)
+{
+    uint8_t halfH[64];
+    uint8_t halfV[64];
+
+    put_h264_qpel8_h_lowpass_lasx(halfH, src + stride, 8, stride);
+    put_h264_qpel8_v_lowpass_lasx(halfV, (uint8_t*)src + 1, 8, stride);
+    avg_pixels8_l2_8_lsx(dst, halfH, halfV, stride, 8);
+}
+
+void ff_put_h264_qpel16_mc00_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    /* In mmi optimization, it used function ff_put_pixels16_8_mmi
+     * which implemented in hpeldsp_mmi.c */
+    put_pixels16_8_lsx(dst, src, stride);
+}
+
+void ff_put_h264_qpel16_mc10_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    uint8_t half[256];
+
+    put_h264_qpel16_h_lowpass_lasx(half, src, 16, stride);
+    put_pixels16_l2_8_lsx(dst, src, half, stride, stride);
+}
+
+void ff_put_h264_qpel16_mc20_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    put_h264_qpel16_h_lowpass_lasx(dst, src, stride, stride);
+}
+
+void ff_put_h264_qpel16_mc30_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    uint8_t half[256];
+
+    put_h264_qpel16_h_lowpass_lasx(half, src, 16, stride);
+    put_pixels16_l2_8_lsx(dst, src+1, half, stride, stride);
+}
+
+void ff_put_h264_qpel16_mc01_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    uint8_t half[256];
+
+    put_h264_qpel16_v_lowpass_lasx(half, src, 16, stride);
+    put_pixels16_l2_8_lsx(dst, src, half, stride, stride);
+}
+
+void ff_put_h264_qpel16_mc11_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    avc_luma_hv_qrt_16x16_lasx((uint8_t*)src - 2, (uint8_t*)src - (stride * 2),
+                               dst, stride);
+}
+
+void ff_put_h264_qpel16_mc21_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    uint8_t temp[512];
+    uint8_t *const halfH  = temp;
+    uint8_t *const halfHV = temp + 256;
+
+    put_h264_qpel16_h_lowpass_lasx(halfH, src, 16, stride);
+    put_h264_qpel16_hv_lowpass_lasx(halfHV, src, 16, stride);
+    put_pixels16_l2_8_lsx(dst, halfH, halfHV, stride, 16);
+}
+
+void ff_put_h264_qpel16_mc31_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    avc_luma_hv_qrt_16x16_lasx((uint8_t*)src - 2, (uint8_t*)src - (stride * 2) + 1,
+                               dst, stride);
+}
+
+void ff_put_h264_qpel16_mc02_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    put_h264_qpel16_v_lowpass_lasx(dst, src, stride, stride);
+}
+
+void ff_put_h264_qpel16_mc12_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    uint8_t temp[512];
+    uint8_t *const halfHV = temp;
+    uint8_t *const halfH  = temp + 256;
+
+    put_h264_qpel16_hv_lowpass_lasx(halfHV, src, 16, stride);
+    put_h264_qpel16_v_lowpass_lasx(halfH, src, 16, stride);
+    put_pixels16_l2_8_lsx(dst, halfH, halfHV, stride, 16);
+}
+
+void ff_put_h264_qpel16_mc22_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    put_h264_qpel16_hv_lowpass_lasx(dst, src, stride, stride);
+}
+
+void ff_put_h264_qpel16_mc32_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    uint8_t temp[512];
+    uint8_t *const halfHV = temp;
+    uint8_t *const halfH  = temp + 256;
+
+    put_h264_qpel16_hv_lowpass_lasx(halfHV, src, 16, stride);
+    put_h264_qpel16_v_lowpass_lasx(halfH, src + 1, 16, stride);
+    put_pixels16_l2_8_lsx(dst, halfH, halfHV, stride, 16);
+}
+
+void ff_put_h264_qpel16_mc03_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    uint8_t half[256];
+
+    put_h264_qpel16_v_lowpass_lasx(half, src, 16, stride);
+    put_pixels16_l2_8_lsx(dst, src+stride, half, stride, stride);
+}
+
+void ff_put_h264_qpel16_mc13_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    avc_luma_hv_qrt_16x16_lasx((uint8_t*)src + stride - 2, (uint8_t*)src - (stride * 2),
+                               dst, stride);
+}
+
+void ff_put_h264_qpel16_mc23_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    uint8_t temp[512];
+    uint8_t *const halfH  = temp;
+    uint8_t *const halfHV = temp + 256;
+
+    put_h264_qpel16_h_lowpass_lasx(halfH, src + stride, 16, stride);
+    put_h264_qpel16_hv_lowpass_lasx(halfHV, src, 16, stride);
+    put_pixels16_l2_8_lsx(dst, halfH, halfHV, stride, 16);
+}
+
+void ff_put_h264_qpel16_mc33_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    avc_luma_hv_qrt_16x16_lasx((uint8_t*)src + stride - 2,
+                               (uint8_t*)src - (stride * 2) + 1, dst, stride);
+}
+
+void ff_avg_h264_qpel16_mc00_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    /* In mmi optimization, it used function ff_avg_pixels16_8_mmi
+     * which implemented in hpeldsp_mmi.c */
+    avg_pixels16_8_lsx(dst, src, stride);
+}
+
+void ff_avg_h264_qpel16_mc10_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    uint8_t half[256];
+
+    put_h264_qpel16_h_lowpass_lasx(half, src, 16, stride);
+    avg_pixels16_l2_8_lsx(dst, src, half, stride, stride);
+}
+
+void ff_avg_h264_qpel16_mc20_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    avg_h264_qpel16_h_lowpass_lasx(dst, src, stride, stride);
+}
+
+void ff_avg_h264_qpel16_mc30_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    uint8_t half[256];
+
+    put_h264_qpel16_h_lowpass_lasx(half, src, 16, stride);
+    avg_pixels16_l2_8_lsx(dst, src+1, half, stride, stride);
+}
+
+void ff_avg_h264_qpel16_mc01_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    uint8_t half[256];
+
+    put_h264_qpel16_v_lowpass_lasx(half, src, 16, stride);
+    avg_pixels16_l2_8_lsx(dst, src, half, stride, stride);
+}
+
+void ff_avg_h264_qpel16_mc11_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    avc_luma_hv_qrt_and_aver_dst_16x16_lasx((uint8_t*)src - 2,
+                                           (uint8_t*)src - (stride * 2),
+                                           dst, stride);
+}
+
+void ff_avg_h264_qpel16_mc21_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    uint8_t temp[512];
+    uint8_t *const halfH  = temp;
+    uint8_t *const halfHV = temp + 256;
+
+    put_h264_qpel16_h_lowpass_lasx(halfH, src, 16, stride);
+    put_h264_qpel16_hv_lowpass_lasx(halfHV, src, 16, stride);
+    avg_pixels16_l2_8_lsx(dst, halfH, halfHV, stride, 16);
+}
+
+void ff_avg_h264_qpel16_mc31_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    avc_luma_hv_qrt_and_aver_dst_16x16_lasx((uint8_t*)src - 2,
+                                            (uint8_t*)src - (stride * 2) + 1,
+                                            dst, stride);
+}
+
+void ff_avg_h264_qpel16_mc02_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    avg_h264_qpel16_v_lowpass_lasx(dst, src, stride, stride);
+}
+
+void ff_avg_h264_qpel16_mc12_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    uint8_t temp[512];
+    uint8_t *const halfHV = temp;
+    uint8_t *const halfH  = temp + 256;
+
+    put_h264_qpel16_hv_lowpass_lasx(halfHV, src, 16, stride);
+    put_h264_qpel16_v_lowpass_lasx(halfH, src, 16, stride);
+    avg_pixels16_l2_8_lsx(dst, halfH, halfHV, stride, 16);
+}
+
+void ff_avg_h264_qpel16_mc22_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    avg_h264_qpel16_hv_lowpass_lasx(dst, src, stride, stride);
+}
+
+void ff_avg_h264_qpel16_mc32_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    uint8_t temp[512];
+    uint8_t *const halfHV = temp;
+    uint8_t *const halfH  = temp + 256;
+
+    put_h264_qpel16_hv_lowpass_lasx(halfHV, src, 16, stride);
+    put_h264_qpel16_v_lowpass_lasx(halfH, src + 1, 16, stride);
+    avg_pixels16_l2_8_lsx(dst, halfH, halfHV, stride, 16);
+}
+
+void ff_avg_h264_qpel16_mc03_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    uint8_t half[256];
+
+    put_h264_qpel16_v_lowpass_lasx(half, src, 16, stride);
+    avg_pixels16_l2_8_lsx(dst, src + stride, half, stride, stride);
+}
+
+void ff_avg_h264_qpel16_mc13_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    avc_luma_hv_qrt_and_aver_dst_16x16_lasx((uint8_t*)src + stride - 2,
+                                            (uint8_t*)src - (stride * 2),
+                                            dst, stride);
+}
+
+void ff_avg_h264_qpel16_mc23_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    uint8_t temp[512];
+    uint8_t *const halfH  = temp;
+    uint8_t *const halfHV = temp + 256;
+
+    put_h264_qpel16_h_lowpass_lasx(halfH, src + stride, 16, stride);
+    put_h264_qpel16_hv_lowpass_lasx(halfHV, src, 16, stride);
+    avg_pixels16_l2_8_lsx(dst, halfH, halfHV, stride, 16);
+}
+
+void ff_avg_h264_qpel16_mc33_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t stride)
+{
+    avc_luma_hv_qrt_and_aver_dst_16x16_lasx((uint8_t*)src + stride - 2,
+                                            (uint8_t*)src - (stride * 2) + 1,
+                                            dst, stride);
+}
diff --git a/libavcodec/loongarch/h264qpel_lasx.h b/libavcodec/loongarch/h264qpel_lasx.h
new file mode 100644
index 0000000000..32b6b50917
--- /dev/null
+++ b/libavcodec/loongarch/h264qpel_lasx.h
@@ -0,0 +1,158 @@
+/*
+ * Copyright (c) 2020 Loongson Technology Corporation Limited
+ * Contributed by Shiyou Yin <yinshiyou-hf@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVCODEC_LOONGARCH_H264QPEL_LASX_H
+#define AVCODEC_LOONGARCH_H264QPEL_LASX_H
+
+#include <stdint.h>
+#include <stddef.h>
+#include "libavcodec/h264.h"
+
+void ff_h264_h_lpf_luma_inter_lasx(uint8_t *src, int stride,
+                                   int alpha, int beta, int8_t *tc0);
+void ff_h264_v_lpf_luma_inter_lasx(uint8_t *src, int stride,
+                                   int alpha, int beta, int8_t *tc0);
+void ff_put_h264_qpel16_mc00_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_put_h264_qpel16_mc10_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_put_h264_qpel16_mc20_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_put_h264_qpel16_mc30_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_put_h264_qpel16_mc01_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_put_h264_qpel16_mc11_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_put_h264_qpel16_mc21_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_put_h264_qpel16_mc31_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_put_h264_qpel16_mc02_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_put_h264_qpel16_mc12_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_put_h264_qpel16_mc32_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_put_h264_qpel16_mc22_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_put_h264_qpel16_mc03_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_put_h264_qpel16_mc13_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_put_h264_qpel16_mc23_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_put_h264_qpel16_mc33_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_avg_h264_qpel16_mc00_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_avg_h264_qpel16_mc10_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_avg_h264_qpel16_mc20_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_avg_h264_qpel16_mc30_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_avg_h264_qpel16_mc01_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_avg_h264_qpel16_mc11_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_avg_h264_qpel16_mc21_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_avg_h264_qpel16_mc31_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_avg_h264_qpel16_mc02_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_avg_h264_qpel16_mc12_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_avg_h264_qpel16_mc22_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_avg_h264_qpel16_mc32_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_avg_h264_qpel16_mc03_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_avg_h264_qpel16_mc13_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_avg_h264_qpel16_mc23_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+void ff_avg_h264_qpel16_mc33_lasx(uint8_t *dst, const uint8_t *src,
+                                  ptrdiff_t dst_stride);
+
+void ff_put_h264_qpel8_mc00_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride);
+void ff_put_h264_qpel8_mc10_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride);
+void ff_put_h264_qpel8_mc20_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride);
+void ff_put_h264_qpel8_mc30_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride);
+void ff_put_h264_qpel8_mc01_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride);
+void ff_put_h264_qpel8_mc11_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride);
+void ff_put_h264_qpel8_mc21_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride);
+void ff_put_h264_qpel8_mc31_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride);
+void ff_put_h264_qpel8_mc02_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride);
+void ff_put_h264_qpel8_mc12_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride);
+void ff_put_h264_qpel8_mc22_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride);
+void ff_put_h264_qpel8_mc32_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride);
+void ff_put_h264_qpel8_mc03_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride);
+void ff_put_h264_qpel8_mc13_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride);
+void ff_put_h264_qpel8_mc23_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride);
+void ff_put_h264_qpel8_mc33_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t stride);
+void ff_avg_h264_qpel8_mc00_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t dst_stride);
+void ff_avg_h264_qpel8_mc10_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t dst_stride);
+void ff_avg_h264_qpel8_mc20_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t dst_stride);
+void ff_avg_h264_qpel8_mc30_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t dst_stride);
+void ff_avg_h264_qpel8_mc11_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t dst_stride);
+void ff_avg_h264_qpel8_mc21_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t dst_stride);
+void ff_avg_h264_qpel8_mc31_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t dst_stride);
+void ff_avg_h264_qpel8_mc02_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t dst_stride);
+void ff_avg_h264_qpel8_mc12_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t dst_stride);
+void ff_avg_h264_qpel8_mc22_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t dst_stride);
+void ff_avg_h264_qpel8_mc32_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t dst_stride);
+void ff_avg_h264_qpel8_mc13_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t dst_stride);
+void ff_avg_h264_qpel8_mc23_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t dst_stride);
+void ff_avg_h264_qpel8_mc33_lasx(uint8_t *dst, const uint8_t *src,
+                                 ptrdiff_t dst_stride);
+#endif  // #ifndef AVCODEC_LOONGARCH_H264QPEL_LASX_H

From patchwork Tue Dec 14 13:33:13 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= <chenhao@loongson.cn>
X-Patchwork-Id: 32489
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp6966507iog;
        Tue, 14 Dec 2021 05:35:17 -0800 (PST)
X-Google-Smtp-Source: 
 ABdhPJxADkLYi4YY4X0VhfcziZgVXnnaG0BfCo5W8/gCxwtvu2gIqO6ch2BC4JHfP+Mnm7hc6qm8
X-Received: by 2002:a17:906:58d0:: with SMTP id
 e16mr5566030ejs.605.1639488917368;
        Tue, 14 Dec 2021 05:35:17 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1639488917; cv=none;
        d=google.com; s=arc-20160816;
        b=02IUHCiLw7c8pMruRDYgkWiuO5vRvY8hj8u3Sv1uSRuHgg4rFCi87lzUs6I+VJbOgh
         3ysTwpIkwIvSBqxyteFawgyX7sC2VKSYY3VhyduZbiZFo2h9LRnHu96yytV0m+mSwC61
         CASR0whyo4OBL7f8mz5gFQB4s02JqYILToJ3yumml6eAsAJS6pSTYFnVWP7jDNt6Wx1B
         G/x7xO4vicReo7ZH2fWaxOYhpKfVPgh1/pfCb+LGAXmlaEiD09oYI20W5H8/a6jMdJ/7
         RATJdH3bIXlXQmZC9R60X2m2zKNJIJCEbP/fdvNMADpIfzZ4ammwEnHu+jhEqiL1PQl4
         7WsA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:references:in-reply-to
         :message-id:date:to:from:delivered-to;
        bh=wvSU8oMAiSQSf3pbNfStLe1mT7pNYyMBS0VjUzspZgo=;
        b=K9S891MyhZOTOqAikPU21FJJFNx75PurXvxKPunfEzy0UwUWXk1CBeOkL3klUG2RBd
         PSSQ8jlKHexbnnqzLWfvW8KgN8sv9V1KIVRorHkofpQ+gBpI77xhYx5OJWKufD2Ieu6k
         ueefH6eIWVL1REzW1dpNHhU6ZMnlDXT6uZqqfFP41z6gzQETbS6W6sZlvbR47Atf5ojX
         R1nB5g7f8v3vV3g/9mkHr3waWAsnxp/WefPKKfvj8Wn9WzBqKWX0smVcfN3CnG5DH31v
         k5hrDkBTEY8XKm7De3aTI/9FviYtnp0Ubgp/EzX+sjD4UjR+Nvpf68sDQswy2q6j29md
         WgbA==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 qb11si5098691ejc.903.2021.12.14.05.35.16;
        Tue, 14 Dec 2021 05:35:17 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 8387268AEDD;
	Tue, 14 Dec 2021 15:34:01 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from loongson.cn (mail.loongson.cn [114.242.206.163])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id AA97868AE51
 for <ffmpeg-devel@ffmpeg.org>; Tue, 14 Dec 2021 15:33:46 +0200 (EET)
Received: from localhost (unknown [36.33.26.144])
 by mail.loongson.cn (Coremail) with SMTP id AQAAf9BxnN03nbhhlKcAAA--.3535S3;
 Tue, 14 Dec 2021 21:33:43 +0800 (CST)
From: Hao Chen <chenhao@loongson.cn>
To: ffmpeg-devel@ffmpeg.org
Date: Tue, 14 Dec 2021 21:33:13 +0800
Message-Id: <20211214133316.8978-5-chenhao@loongson.cn>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20211214133316.8978-1-chenhao@loongson.cn>
References: <20211214133316.8978-1-chenhao@loongson.cn>
MIME-Version: 1.0
X-CM-TRANSID: AQAAf9BxnN03nbhhlKcAAA--.3535S3
X-Coremail-Antispam: 1UD129KBjvAXoWDtryftF17WF45Kw1rtFWfGrg_yoWfWr1kKo
 WUKw4Ivrn2gF1Iy345JrnayFyUua4xCryDXw4jqws2ka45XF90yrWYk3y5Xry5tr4kX34D
 A3yUXa47Zw1Yqwn8n29KB7ZKAUJUUUU8529EdanIXcx71UUUUU7v73VFW2AGmfu7bjvjm3
 AaLaJ3UjIYCTnIWjp_UUU5-7k0a2IF6w4kM7kC6x804xWl14x267AKxVWUJVW8JwAFc2x0
 x2IEx4CE42xK8VAvwI8IcIk0rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj4
 1l84x0c7CEw4AK67xGY2AK021l84ACjcxK6xIIjxv20xvE14v26ryj6F1UM28EF7xvwVC0
 I7IYx2IY6xkF7I0E14v26F4j6r4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwV
 C2z280aVCY1x0267AKxVW0oVCq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC
 0VAKzVAqx4xG6I80ewAv7VC0I7IYx2IY67AKxVWUtVWrXwAv7VC2z280aVAFwI0_Gr1j6F
 4UJwAm72CE4IkC6x0Yz7v_Jr0_Gr1lc2xSY4AK67AK6ry8MxC20s026xCaFVCjc4AY6r1j
 6r4UMI8I3I0E5I8CrVAFwI0_Jr0_Jr4lx2IqxVCjr7xvwVAFwI0_JrI_JrWlx4CE17CEb7
 AF67AKxVWUXVWUAwCI42IY6xIIjxv20xvE14v26r4j6ryUMIIF0xvE2Ix0cI8IcVCY1x02
 67AKxVW8JVWxJwCI42IY6xAIw20EY4v20xvaj40_Jr0_JF4lIxAIcVC2z280aVAFwI0_Gr
 0_Cr1lIxAIcVC2z280aVCY1x0267AKxVW8JVW8JrUvcSsGvfC2KfnxnUUI43ZEXa7IU5tl
 1PUUUUU==
X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/
Subject: [FFmpeg-devel] [PATCH v2 4/7] avcodec: [loongarch] Optimize h264dsp
 with LASX.
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: gxw <guxiwei-hf@loongson.cn>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: 5s75qHj4t0Pn

From: gxw <guxiwei-hf@loongson.cn>

./ffmpeg -i ../1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -y /dev/null -an
before:225
after :282

Change-Id: Ibe245827dcdfe8fc1541c6b172483151bfa9e642
---
 libavcodec/h264dsp.c                          |    1 +
 libavcodec/h264dsp.h                          |    2 +
 libavcodec/loongarch/Makefile                 |    2 +
 libavcodec/loongarch/h264dsp_init_loongarch.c |   58 +
 libavcodec/loongarch/h264dsp_lasx.c           | 2114 +++++++++++++++++
 libavcodec/loongarch/h264dsp_lasx.h           |   68 +
 6 files changed, 2245 insertions(+)
 create mode 100644 libavcodec/loongarch/h264dsp_init_loongarch.c
 create mode 100644 libavcodec/loongarch/h264dsp_lasx.c
 create mode 100644 libavcodec/loongarch/h264dsp_lasx.h

diff --git a/libavcodec/h264dsp.c b/libavcodec/h264dsp.c
index e76932b565..f97ac2823c 100644
--- a/libavcodec/h264dsp.c
+++ b/libavcodec/h264dsp.c
@@ -157,4 +157,5 @@ av_cold void ff_h264dsp_init(H264DSPContext *c, const int bit_depth,
     if (ARCH_PPC) ff_h264dsp_init_ppc(c, bit_depth, chroma_format_idc);
     if (ARCH_X86) ff_h264dsp_init_x86(c, bit_depth, chroma_format_idc);
     if (ARCH_MIPS) ff_h264dsp_init_mips(c, bit_depth, chroma_format_idc);
+    if (ARCH_LOONGARCH) ff_h264dsp_init_loongarch(c, bit_depth, chroma_format_idc);
 }
diff --git a/libavcodec/h264dsp.h b/libavcodec/h264dsp.h
index 850d4471fd..e0880c4d88 100644
--- a/libavcodec/h264dsp.h
+++ b/libavcodec/h264dsp.h
@@ -129,5 +129,7 @@ void ff_h264dsp_init_x86(H264DSPContext *c, const int bit_depth,
                          const int chroma_format_idc);
 void ff_h264dsp_init_mips(H264DSPContext *c, const int bit_depth,
                           const int chroma_format_idc);
+void ff_h264dsp_init_loongarch(H264DSPContext *c, const int bit_depth,
+                               const int chroma_format_idc);
 
 #endif /* AVCODEC_H264DSP_H */
diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile
index 4e2ce8487f..df43151dbd 100644
--- a/libavcodec/loongarch/Makefile
+++ b/libavcodec/loongarch/Makefile
@@ -1,4 +1,6 @@
 OBJS-$(CONFIG_H264CHROMA)             += loongarch/h264chroma_init_loongarch.o
 OBJS-$(CONFIG_H264QPEL)               += loongarch/h264qpel_init_loongarch.o
+OBJS-$(CONFIG_H264DSP)                += loongarch/h264dsp_init_loongarch.o
 LASX-OBJS-$(CONFIG_H264CHROMA)        += loongarch/h264chroma_lasx.o
 LASX-OBJS-$(CONFIG_H264QPEL)          += loongarch/h264qpel_lasx.o
+LASX-OBJS-$(CONFIG_H264DSP)           += loongarch/h264dsp_lasx.o
diff --git a/libavcodec/loongarch/h264dsp_init_loongarch.c b/libavcodec/loongarch/h264dsp_init_loongarch.c
new file mode 100644
index 0000000000..ddc0877a74
--- /dev/null
+++ b/libavcodec/loongarch/h264dsp_init_loongarch.c
@@ -0,0 +1,58 @@
+/*
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * Contributed by Shiyou Yin <yinshiyou-hf@loongson.cn>
+ *                Xiwei  Gu  <guxiwei-hf@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/loongarch/cpu.h"
+#include "h264dsp_lasx.h"
+
+av_cold void ff_h264dsp_init_loongarch(H264DSPContext *c, const int bit_depth,
+                                       const int chroma_format_idc)
+{
+    int cpu_flags = av_get_cpu_flags();
+
+    if (have_lasx(cpu_flags)) {
+        if (bit_depth == 8) {
+            c->h264_add_pixels4_clear = ff_h264_add_pixels4_8_lasx;
+            c->h264_add_pixels8_clear = ff_h264_add_pixels8_8_lasx;
+            c->h264_v_loop_filter_luma = ff_h264_v_lpf_luma_8_lasx;
+            c->h264_h_loop_filter_luma = ff_h264_h_lpf_luma_8_lasx;
+            c->h264_v_loop_filter_luma_intra = ff_h264_v_lpf_luma_intra_8_lasx;
+            c->h264_h_loop_filter_luma_intra = ff_h264_h_lpf_luma_intra_8_lasx;
+            c->h264_v_loop_filter_chroma = ff_h264_v_lpf_chroma_8_lasx;
+
+            if (chroma_format_idc <= 1)
+                c->h264_h_loop_filter_chroma = ff_h264_h_lpf_chroma_8_lasx;
+            c->h264_v_loop_filter_chroma_intra = ff_h264_v_lpf_chroma_intra_8_lasx;
+
+            if (chroma_format_idc <= 1)
+                c->h264_h_loop_filter_chroma_intra = ff_h264_h_lpf_chroma_intra_8_lasx;
+
+            /* Weighted MC */
+            c->weight_h264_pixels_tab[0] = ff_weight_h264_pixels16_8_lasx;
+            c->weight_h264_pixels_tab[1] = ff_weight_h264_pixels8_8_lasx;
+            c->weight_h264_pixels_tab[2] = ff_weight_h264_pixels4_8_lasx;
+
+            c->biweight_h264_pixels_tab[0] = ff_biweight_h264_pixels16_8_lasx;
+            c->biweight_h264_pixels_tab[1] = ff_biweight_h264_pixels8_8_lasx;
+            c->biweight_h264_pixels_tab[2] = ff_biweight_h264_pixels4_8_lasx;
+        }
+    }
+}
diff --git a/libavcodec/loongarch/h264dsp_lasx.c b/libavcodec/loongarch/h264dsp_lasx.c
new file mode 100644
index 0000000000..7fd4cedf7e
--- /dev/null
+++ b/libavcodec/loongarch/h264dsp_lasx.c
@@ -0,0 +1,2114 @@
+/*
+ * Loongson LASX optimized h264dsp
+ *
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * Contributed by Shiyou Yin <yinshiyou-hf@loongson.cn>
+ *                Xiwei  Gu  <guxiwei-hf@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/loongarch/loongson_intrinsics.h"
+#include "h264dsp_lasx.h"
+
+#define AVC_LPF_P1_OR_Q1(p0_or_q0_org_in, q0_or_p0_org_in,   \
+                         p1_or_q1_org_in, p2_or_q2_org_in,   \
+                         neg_tc_in, tc_in, p1_or_q1_out)     \
+{                                                            \
+    __m256i clip3, temp;                                     \
+                                                             \
+    clip3 = __lasx_xvavgr_hu(p0_or_q0_org_in,                \
+                             q0_or_p0_org_in);               \
+    temp = __lasx_xvslli_h(p1_or_q1_org_in, 1);              \
+    clip3 = __lasx_xvsub_h(clip3, temp);                     \
+    clip3 = __lasx_xvavg_h(p2_or_q2_org_in, clip3);          \
+    clip3 = __lasx_xvclip_h(clip3, neg_tc_in, tc_in);        \
+    p1_or_q1_out = __lasx_xvadd_h(p1_or_q1_org_in, clip3);   \
+}
+
+#define AVC_LPF_P0Q0(q0_or_p0_org_in, p0_or_q0_org_in,       \
+                     p1_or_q1_org_in, q1_or_p1_org_in,       \
+                     neg_threshold_in, threshold_in,         \
+                     p0_or_q0_out, q0_or_p0_out)             \
+{                                                            \
+    __m256i q0_sub_p0, p1_sub_q1, delta;                     \
+                                                             \
+    q0_sub_p0 = __lasx_xvsub_h(q0_or_p0_org_in,              \
+                               p0_or_q0_org_in);             \
+    p1_sub_q1 = __lasx_xvsub_h(p1_or_q1_org_in,              \
+                               q1_or_p1_org_in);             \
+    q0_sub_p0 = __lasx_xvslli_h(q0_sub_p0, 2);               \
+    p1_sub_q1 = __lasx_xvaddi_hu(p1_sub_q1, 4);              \
+    delta = __lasx_xvadd_h(q0_sub_p0, p1_sub_q1);            \
+    delta = __lasx_xvsrai_h(delta, 3);                       \
+    delta = __lasx_xvclip_h(delta, neg_threshold_in,         \
+           threshold_in);                                    \
+    p0_or_q0_out = __lasx_xvadd_h(p0_or_q0_org_in, delta);   \
+    q0_or_p0_out = __lasx_xvsub_h(q0_or_p0_org_in, delta);   \
+                                                             \
+    p0_or_q0_out = __lasx_xvclip255_h(p0_or_q0_out);         \
+    q0_or_p0_out = __lasx_xvclip255_h(q0_or_p0_out);         \
+}
+
+void ff_h264_h_lpf_luma_8_lasx(uint8_t *data, ptrdiff_t img_width,
+                               int alpha_in, int beta_in, int8_t *tc)
+{
+    ptrdiff_t img_width_2x = img_width << 1;
+    ptrdiff_t img_width_4x = img_width << 2;
+    ptrdiff_t img_width_8x = img_width << 3;
+    ptrdiff_t img_width_3x = img_width_2x + img_width;
+    __m256i tmp_vec0, bs_vec;
+    __m256i tc_vec = {0x0101010100000000, 0x0303030302020202,
+                      0x0101010100000000, 0x0303030302020202};
+
+    tmp_vec0 = __lasx_xvldrepl_w((uint32_t*)tc, 0);
+    tc_vec   = __lasx_xvshuf_b(tmp_vec0, tmp_vec0, tc_vec);
+    bs_vec   = __lasx_xvslti_b(tc_vec, 0);
+    bs_vec   = __lasx_xvxori_b(bs_vec, 255);
+    bs_vec   = __lasx_xvandi_b(bs_vec, 1);
+
+    if (__lasx_xbnz_v(bs_vec)) {
+        uint8_t *src = data - 4;
+        __m256i p3_org, p2_org, p1_org, p0_org, q0_org, q1_org, q2_org, q3_org;
+        __m256i p0_asub_q0, p1_asub_p0, q1_asub_q0, alpha, beta;
+        __m256i is_less_than, is_less_than_beta, is_less_than_alpha;
+        __m256i is_bs_greater_than0;
+        __m256i zero = __lasx_xvldi(0);
+
+        is_bs_greater_than0 = __lasx_xvslt_bu(zero, bs_vec);
+
+        {
+            uint8_t *src_tmp = src + img_width_8x;
+            __m256i row0, row1, row2, row3, row4, row5, row6, row7;
+            __m256i row8, row9, row10, row11, row12, row13, row14, row15;
+
+            DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x,
+                      src, img_width_3x, row0, row1, row2, row3);
+            src += img_width_4x;
+            DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x,
+                      src, img_width_3x, row4, row5, row6, row7);
+            src -= img_width_4x;
+            DUP4_ARG2(__lasx_xvldx, src_tmp, 0, src_tmp, img_width, src_tmp,
+                      img_width_2x, src_tmp, img_width_3x,
+                      row8, row9, row10, row11);
+            src_tmp += img_width_4x;
+            DUP4_ARG2(__lasx_xvldx, src_tmp, 0, src_tmp, img_width, src_tmp,
+                      img_width_2x, src_tmp, img_width_3x,
+                      row12, row13, row14, row15);
+            src_tmp -= img_width_4x;
+
+            LASX_TRANSPOSE16x8_B(row0, row1, row2, row3, row4, row5, row6,
+                                 row7, row8, row9, row10, row11,
+                                 row12, row13, row14, row15,
+                                 p3_org, p2_org, p1_org, p0_org,
+                                 q0_org, q1_org, q2_org, q3_org);
+        }
+
+        p0_asub_q0 = __lasx_xvabsd_bu(p0_org, q0_org);
+        p1_asub_p0 = __lasx_xvabsd_bu(p1_org, p0_org);
+        q1_asub_q0 = __lasx_xvabsd_bu(q1_org, q0_org);
+
+        alpha = __lasx_xvreplgr2vr_b(alpha_in);
+        beta  = __lasx_xvreplgr2vr_b(beta_in);
+
+        is_less_than_alpha = __lasx_xvslt_bu(p0_asub_q0, alpha);
+        is_less_than_beta  = __lasx_xvslt_bu(p1_asub_p0, beta);
+        is_less_than       = is_less_than_alpha & is_less_than_beta;
+        is_less_than_beta  = __lasx_xvslt_bu(q1_asub_q0, beta);
+        is_less_than       = is_less_than_beta & is_less_than;
+        is_less_than       = is_less_than & is_bs_greater_than0;
+
+        if (__lasx_xbnz_v(is_less_than)) {
+            __m256i neg_tc_h, tc_h, p1_org_h, p0_org_h, q0_org_h, q1_org_h;
+            __m256i p2_asub_p0, q2_asub_q0;
+
+            neg_tc_h = __lasx_xvneg_b(tc_vec);
+            neg_tc_h = __lasx_vext2xv_h_b(neg_tc_h);
+            tc_h     = __lasx_vext2xv_hu_bu(tc_vec);
+            p1_org_h = __lasx_vext2xv_hu_bu(p1_org);
+            p0_org_h = __lasx_vext2xv_hu_bu(p0_org);
+            q0_org_h = __lasx_vext2xv_hu_bu(q0_org);
+
+            p2_asub_p0 = __lasx_xvabsd_bu(p2_org, p0_org);
+            is_less_than_beta = __lasx_xvslt_bu(p2_asub_p0, beta);
+            is_less_than_beta = is_less_than_beta & is_less_than;
+
+            if (__lasx_xbnz_v(is_less_than_beta)) {
+                __m256i p2_org_h, p1_h;
+
+                p2_org_h = __lasx_vext2xv_hu_bu(p2_org);
+                AVC_LPF_P1_OR_Q1(p0_org_h, q0_org_h, p1_org_h, p2_org_h,
+                                 neg_tc_h, tc_h, p1_h);
+                p1_h = __lasx_xvpickev_b(p1_h, p1_h);
+                p1_h = __lasx_xvpermi_d(p1_h, 0xd8);
+                p1_org = __lasx_xvbitsel_v(p1_org, p1_h, is_less_than_beta);
+                is_less_than_beta = __lasx_xvandi_b(is_less_than_beta, 1);
+                tc_vec = __lasx_xvadd_b(tc_vec, is_less_than_beta);
+            }
+
+            q2_asub_q0 = __lasx_xvabsd_bu(q2_org, q0_org);
+            is_less_than_beta = __lasx_xvslt_bu(q2_asub_q0, beta);
+            is_less_than_beta = is_less_than_beta & is_less_than;
+
+            q1_org_h = __lasx_vext2xv_hu_bu(q1_org);
+
+            if (__lasx_xbnz_v(is_less_than_beta)) {
+                __m256i q2_org_h, q1_h;
+
+                q2_org_h = __lasx_vext2xv_hu_bu(q2_org);
+                AVC_LPF_P1_OR_Q1(p0_org_h, q0_org_h, q1_org_h, q2_org_h,
+                                 neg_tc_h, tc_h, q1_h);
+                q1_h = __lasx_xvpickev_b(q1_h, q1_h);
+                q1_h = __lasx_xvpermi_d(q1_h, 0xd8);
+                q1_org = __lasx_xvbitsel_v(q1_org, q1_h, is_less_than_beta);
+
+                is_less_than_beta = __lasx_xvandi_b(is_less_than_beta, 1);
+                tc_vec = __lasx_xvadd_b(tc_vec, is_less_than_beta);
+            }
+
+            {
+                __m256i neg_thresh_h, p0_h, q0_h;
+
+                neg_thresh_h = __lasx_xvneg_b(tc_vec);
+                neg_thresh_h = __lasx_vext2xv_h_b(neg_thresh_h);
+                tc_h         = __lasx_vext2xv_hu_bu(tc_vec);
+
+                AVC_LPF_P0Q0(q0_org_h, p0_org_h, p1_org_h, q1_org_h,
+                             neg_thresh_h, tc_h, p0_h, q0_h);
+                DUP2_ARG2(__lasx_xvpickev_b, p0_h, p0_h, q0_h, q0_h,
+                          p0_h, q0_h);
+                DUP2_ARG2(__lasx_xvpermi_d, p0_h, 0xd8, q0_h, 0xd8,
+                          p0_h, q0_h);
+                p0_org = __lasx_xvbitsel_v(p0_org, p0_h, is_less_than);
+                q0_org = __lasx_xvbitsel_v(q0_org, q0_h, is_less_than);
+            }
+
+            {
+                __m256i row0, row1, row2, row3, row4, row5, row6, row7;
+                __m256i control = {0x0000000400000000, 0x0000000500000001,
+                                   0x0000000600000002, 0x0000000700000003};
+
+                DUP4_ARG3(__lasx_xvpermi_q, p0_org, q3_org, 0x02, p1_org,
+                          q2_org, 0x02, p2_org, q1_org, 0x02, p3_org,
+                          q0_org, 0x02, p0_org, p1_org, p2_org, p3_org);
+                DUP2_ARG2(__lasx_xvilvl_b, p1_org, p3_org, p0_org, p2_org,
+                          row0, row2);
+                DUP2_ARG2(__lasx_xvilvh_b, p1_org, p3_org, p0_org, p2_org,
+                          row1, row3);
+                DUP2_ARG2(__lasx_xvilvl_b, row2, row0, row3, row1, row4, row6);
+                DUP2_ARG2(__lasx_xvilvh_b, row2, row0, row3, row1, row5, row7);
+                DUP4_ARG2(__lasx_xvperm_w, row4, control, row5, control, row6,
+                          control, row7, control, row4, row5, row6, row7);
+                __lasx_xvstelm_d(row4, src, 0, 0);
+                __lasx_xvstelm_d(row4, src + img_width, 0, 1);
+                src += img_width_2x;
+                __lasx_xvstelm_d(row4, src, 0, 2);
+                __lasx_xvstelm_d(row4, src + img_width, 0, 3);
+                src += img_width_2x;
+                __lasx_xvstelm_d(row5, src, 0, 0);
+                __lasx_xvstelm_d(row5, src + img_width, 0, 1);
+                src += img_width_2x;
+                __lasx_xvstelm_d(row5, src, 0, 2);
+                __lasx_xvstelm_d(row5, src + img_width, 0, 3);
+                src += img_width_2x;
+                __lasx_xvstelm_d(row6, src, 0, 0);
+                __lasx_xvstelm_d(row6, src + img_width, 0, 1);
+                src += img_width_2x;
+                __lasx_xvstelm_d(row6, src, 0, 2);
+                __lasx_xvstelm_d(row6, src + img_width, 0, 3);
+                src += img_width_2x;
+                __lasx_xvstelm_d(row7, src, 0, 0);
+                __lasx_xvstelm_d(row7, src + img_width, 0, 1);
+                src += img_width_2x;
+                __lasx_xvstelm_d(row7, src, 0, 2);
+                __lasx_xvstelm_d(row7, src + img_width, 0, 3);
+            }
+        }
+    }
+}
+
+void ff_h264_v_lpf_luma_8_lasx(uint8_t *data, ptrdiff_t img_width,
+                                   int alpha_in, int beta_in, int8_t *tc)
+{
+    ptrdiff_t img_width_2x = img_width << 1;
+    ptrdiff_t img_width_3x = img_width + img_width_2x;
+    __m256i tmp_vec0, bs_vec;
+    __m256i tc_vec = {0x0101010100000000, 0x0303030302020202,
+                      0x0101010100000000, 0x0303030302020202};
+
+    tmp_vec0 = __lasx_xvldrepl_w((uint32_t*)tc, 0);
+    tc_vec   = __lasx_xvshuf_b(tmp_vec0, tmp_vec0, tc_vec);
+    bs_vec   = __lasx_xvslti_b(tc_vec, 0);
+    bs_vec   = __lasx_xvxori_b(bs_vec, 255);
+    bs_vec   = __lasx_xvandi_b(bs_vec, 1);
+
+    if (__lasx_xbnz_v(bs_vec)) {
+        __m256i p2_org, p1_org, p0_org, q0_org, q1_org, q2_org;
+        __m256i p0_asub_q0, p1_asub_p0, q1_asub_q0, alpha, beta;
+        __m256i is_less_than, is_less_than_beta, is_less_than_alpha;
+        __m256i p1_org_h, p0_org_h, q0_org_h, q1_org_h;
+        __m256i is_bs_greater_than0;
+        __m256i zero = __lasx_xvldi(0);
+
+        alpha = __lasx_xvreplgr2vr_b(alpha_in);
+        beta  = __lasx_xvreplgr2vr_b(beta_in);
+
+        DUP2_ARG2(__lasx_xvldx, data, -img_width_3x, data, -img_width_2x,
+                  p2_org, p1_org);
+        p0_org = __lasx_xvldx(data, -img_width);
+        DUP2_ARG2(__lasx_xvldx, data, 0, data, img_width, q0_org, q1_org);
+
+        is_bs_greater_than0 = __lasx_xvslt_bu(zero, bs_vec);
+        p0_asub_q0 = __lasx_xvabsd_bu(p0_org, q0_org);
+        p1_asub_p0 = __lasx_xvabsd_bu(p1_org, p0_org);
+        q1_asub_q0 = __lasx_xvabsd_bu(q1_org, q0_org);
+
+        is_less_than_alpha = __lasx_xvslt_bu(p0_asub_q0, alpha);
+        is_less_than_beta  = __lasx_xvslt_bu(p1_asub_p0, beta);
+        is_less_than       = is_less_than_alpha & is_less_than_beta;
+        is_less_than_beta  = __lasx_xvslt_bu(q1_asub_q0, beta);
+        is_less_than       = is_less_than_beta & is_less_than;
+        is_less_than       = is_less_than & is_bs_greater_than0;
+
+        if (__lasx_xbnz_v(is_less_than)) {
+            __m256i neg_tc_h, tc_h, p2_asub_p0, q2_asub_q0;
+
+            q2_org = __lasx_xvldx(data, img_width_2x);
+
+            neg_tc_h = __lasx_xvneg_b(tc_vec);
+            neg_tc_h = __lasx_vext2xv_h_b(neg_tc_h);
+            tc_h     = __lasx_vext2xv_hu_bu(tc_vec);
+            p1_org_h = __lasx_vext2xv_hu_bu(p1_org);
+            p0_org_h = __lasx_vext2xv_hu_bu(p0_org);
+            q0_org_h = __lasx_vext2xv_hu_bu(q0_org);
+
+            p2_asub_p0        = __lasx_xvabsd_bu(p2_org, p0_org);
+            is_less_than_beta = __lasx_xvslt_bu(p2_asub_p0, beta);
+            is_less_than_beta = is_less_than_beta & is_less_than;
+
+            if (__lasx_xbnz_v(is_less_than_beta)) {
+                __m256i p1_h, p2_org_h;
+
+                p2_org_h = __lasx_vext2xv_hu_bu(p2_org);
+                AVC_LPF_P1_OR_Q1(p0_org_h, q0_org_h, p1_org_h, p2_org_h,
+                                 neg_tc_h, tc_h, p1_h);
+                p1_h = __lasx_xvpickev_b(p1_h, p1_h);
+                p1_h = __lasx_xvpermi_d(p1_h, 0xd8);
+                p1_h   = __lasx_xvbitsel_v(p1_org, p1_h, is_less_than_beta);
+                p1_org = __lasx_xvpermi_q(p1_org, p1_h, 0x30);
+                __lasx_xvst(p1_org, data - img_width_2x, 0);
+
+                is_less_than_beta = __lasx_xvandi_b(is_less_than_beta, 1);
+                tc_vec = __lasx_xvadd_b(tc_vec, is_less_than_beta);
+            }
+
+            q2_asub_q0 = __lasx_xvabsd_bu(q2_org, q0_org);
+            is_less_than_beta = __lasx_xvslt_bu(q2_asub_q0, beta);
+            is_less_than_beta = is_less_than_beta & is_less_than;
+
+            q1_org_h = __lasx_vext2xv_hu_bu(q1_org);
+
+            if (__lasx_xbnz_v(is_less_than_beta)) {
+                __m256i q1_h, q2_org_h;
+
+                q2_org_h = __lasx_vext2xv_hu_bu(q2_org);
+                AVC_LPF_P1_OR_Q1(p0_org_h, q0_org_h, q1_org_h, q2_org_h,
+                                 neg_tc_h, tc_h, q1_h);
+                q1_h = __lasx_xvpickev_b(q1_h, q1_h);
+                q1_h = __lasx_xvpermi_d(q1_h, 0xd8);
+                q1_h = __lasx_xvbitsel_v(q1_org, q1_h, is_less_than_beta);
+                q1_org = __lasx_xvpermi_q(q1_org, q1_h, 0x30);
+                __lasx_xvst(q1_org, data + img_width, 0);
+
+                is_less_than_beta = __lasx_xvandi_b(is_less_than_beta, 1);
+                tc_vec = __lasx_xvadd_b(tc_vec, is_less_than_beta);
+
+            }
+
+            {
+                __m256i neg_thresh_h, p0_h, q0_h;
+
+                neg_thresh_h = __lasx_xvneg_b(tc_vec);
+                neg_thresh_h = __lasx_vext2xv_h_b(neg_thresh_h);
+                tc_h         = __lasx_vext2xv_hu_bu(tc_vec);
+
+                AVC_LPF_P0Q0(q0_org_h, p0_org_h, p1_org_h, q1_org_h,
+                             neg_thresh_h, tc_h, p0_h, q0_h);
+                DUP2_ARG2(__lasx_xvpickev_b, p0_h, p0_h, q0_h, q0_h,
+                          p0_h, q0_h);
+                DUP2_ARG2(__lasx_xvpermi_d, p0_h, 0Xd8, q0_h, 0xd8,
+                          p0_h, q0_h);
+                p0_h = __lasx_xvbitsel_v(p0_org, p0_h, is_less_than);
+                q0_h = __lasx_xvbitsel_v(q0_org, q0_h, is_less_than);
+                p0_org = __lasx_xvpermi_q(p0_org, p0_h, 0x30);
+                q0_org = __lasx_xvpermi_q(q0_org, q0_h, 0x30);
+                __lasx_xvst(p0_org, data - img_width, 0);
+                __lasx_xvst(q0_org, data, 0);
+            }
+        }
+    }
+}
+
+void ff_h264_h_lpf_chroma_8_lasx(uint8_t *data, ptrdiff_t img_width,
+                                 int alpha_in, int beta_in, int8_t *tc)
+{
+    __m256i tmp_vec0, bs_vec;
+    __m256i tc_vec = {0x0303020201010000, 0x0303020201010000, 0x0, 0x0};
+    __m256i zero = __lasx_xvldi(0);
+    ptrdiff_t img_width_2x = img_width << 1;
+    ptrdiff_t img_width_4x = img_width << 2;
+    ptrdiff_t img_width_3x = img_width_2x + img_width;
+
+    tmp_vec0 = __lasx_xvldrepl_w((uint32_t*)tc, 0);
+    tc_vec   = __lasx_xvshuf_b(tmp_vec0, tmp_vec0, tc_vec);
+    bs_vec   = __lasx_xvslti_b(tc_vec, 0);
+    bs_vec   = __lasx_xvxori_b(bs_vec, 255);
+    bs_vec   = __lasx_xvandi_b(bs_vec, 1);
+    bs_vec   = __lasx_xvpermi_q(zero, bs_vec, 0x30);
+
+    if (__lasx_xbnz_v(bs_vec)) {
+        uint8_t *src = data - 2;
+        __m256i p1_org, p0_org, q0_org, q1_org;
+        __m256i p0_asub_q0, p1_asub_p0, q1_asub_q0, alpha, beta;
+        __m256i is_less_than, is_less_than_beta, is_less_than_alpha;
+        __m256i is_bs_greater_than0;
+
+        is_bs_greater_than0 = __lasx_xvslt_bu(zero, bs_vec);
+
+        {
+            __m256i row0, row1, row2, row3, row4, row5, row6, row7;
+
+            DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x,
+                      src, img_width_3x, row0, row1, row2, row3);
+            src += img_width_4x;
+            DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x,
+                      src, img_width_3x, row4, row5, row6, row7);
+            src -= img_width_4x;
+            /* LASX_TRANSPOSE8x4_B */
+            DUP4_ARG2(__lasx_xvilvl_b, row2, row0, row3, row1, row6, row4,
+                      row7, row5, p1_org, p0_org, q0_org, q1_org);
+            row0 = __lasx_xvilvl_b(p0_org, p1_org);
+            row1 = __lasx_xvilvl_b(q1_org, q0_org);
+            row3 = __lasx_xvilvh_w(row1, row0);
+            row2 = __lasx_xvilvl_w(row1, row0);
+            p1_org = __lasx_xvpermi_d(row2, 0x00);
+            p0_org = __lasx_xvpermi_d(row2, 0x55);
+            q0_org = __lasx_xvpermi_d(row3, 0x00);
+            q1_org = __lasx_xvpermi_d(row3, 0x55);
+        }
+
+        p0_asub_q0 = __lasx_xvabsd_bu(p0_org, q0_org);
+        p1_asub_p0 = __lasx_xvabsd_bu(p1_org, p0_org);
+        q1_asub_q0 = __lasx_xvabsd_bu(q1_org, q0_org);
+
+        alpha = __lasx_xvreplgr2vr_b(alpha_in);
+        beta  = __lasx_xvreplgr2vr_b(beta_in);
+
+        is_less_than_alpha = __lasx_xvslt_bu(p0_asub_q0, alpha);
+        is_less_than_beta  = __lasx_xvslt_bu(p1_asub_p0, beta);
+        is_less_than       = is_less_than_alpha & is_less_than_beta;
+        is_less_than_beta  = __lasx_xvslt_bu(q1_asub_q0, beta);
+        is_less_than       = is_less_than_beta & is_less_than;
+        is_less_than       = is_less_than & is_bs_greater_than0;
+
+        if (__lasx_xbnz_v(is_less_than)) {
+            __m256i p1_org_h, p0_org_h, q0_org_h, q1_org_h;
+
+            p1_org_h = __lasx_vext2xv_hu_bu(p1_org);
+            p0_org_h = __lasx_vext2xv_hu_bu(p0_org);
+            q0_org_h = __lasx_vext2xv_hu_bu(q0_org);
+            q1_org_h = __lasx_vext2xv_hu_bu(q1_org);
+
+            {
+                __m256i tc_h, neg_thresh_h, p0_h, q0_h;
+
+                neg_thresh_h = __lasx_xvneg_b(tc_vec);
+                neg_thresh_h = __lasx_vext2xv_h_b(neg_thresh_h);
+                tc_h         = __lasx_vext2xv_hu_bu(tc_vec);
+
+                AVC_LPF_P0Q0(q0_org_h, p0_org_h, p1_org_h, q1_org_h,
+                             neg_thresh_h, tc_h, p0_h, q0_h);
+                DUP2_ARG2(__lasx_xvpickev_b, p0_h, p0_h, q0_h, q0_h,
+                          p0_h, q0_h);
+                DUP2_ARG2(__lasx_xvpermi_d, p0_h, 0xd8, q0_h, 0xd8,
+                          p0_h, q0_h);
+                p0_org = __lasx_xvbitsel_v(p0_org, p0_h, is_less_than);
+                q0_org = __lasx_xvbitsel_v(q0_org, q0_h, is_less_than);
+            }
+
+            p0_org = __lasx_xvilvl_b(q0_org, p0_org);
+            src = data - 1;
+            __lasx_xvstelm_h(p0_org, src, 0, 0);
+            src += img_width;
+            __lasx_xvstelm_h(p0_org, src, 0, 1);
+            src += img_width;
+            __lasx_xvstelm_h(p0_org, src, 0, 2);
+            src += img_width;
+            __lasx_xvstelm_h(p0_org, src, 0, 3);
+            src += img_width;
+            __lasx_xvstelm_h(p0_org, src, 0, 4);
+            src += img_width;
+            __lasx_xvstelm_h(p0_org, src, 0, 5);
+            src += img_width;
+            __lasx_xvstelm_h(p0_org, src, 0, 6);
+            src += img_width;
+            __lasx_xvstelm_h(p0_org, src, 0, 7);
+        }
+    }
+}
+
+void ff_h264_v_lpf_chroma_8_lasx(uint8_t *data, ptrdiff_t img_width,
+                                 int alpha_in, int beta_in, int8_t *tc)
+{
+    int img_width_2x = img_width << 1;
+    __m256i tmp_vec0, bs_vec;
+    __m256i tc_vec = {0x0303020201010000, 0x0303020201010000, 0x0, 0x0};
+    __m256i zero = __lasx_xvldi(0);
+
+    tmp_vec0 = __lasx_xvldrepl_w((uint32_t*)tc, 0);
+    tc_vec   = __lasx_xvshuf_b(tmp_vec0, tmp_vec0, tc_vec);
+    bs_vec   = __lasx_xvslti_b(tc_vec, 0);
+    bs_vec   = __lasx_xvxori_b(bs_vec, 255);
+    bs_vec   = __lasx_xvandi_b(bs_vec, 1);
+    bs_vec   = __lasx_xvpermi_q(zero, bs_vec, 0x30);
+
+    if (__lasx_xbnz_v(bs_vec)) {
+        __m256i p1_org, p0_org, q0_org, q1_org;
+        __m256i p0_asub_q0, p1_asub_p0, q1_asub_q0, alpha, beta;
+        __m256i is_less_than, is_less_than_beta, is_less_than_alpha;
+        __m256i is_bs_greater_than0;
+
+        alpha = __lasx_xvreplgr2vr_b(alpha_in);
+        beta  = __lasx_xvreplgr2vr_b(beta_in);
+
+        DUP2_ARG2(__lasx_xvldx, data, -img_width_2x, data, -img_width,
+                  p1_org, p0_org);
+        DUP2_ARG2(__lasx_xvldx, data, 0, data, img_width, q0_org, q1_org);
+
+        is_bs_greater_than0 = __lasx_xvslt_bu(zero, bs_vec);
+        p0_asub_q0 = __lasx_xvabsd_bu(p0_org, q0_org);
+        p1_asub_p0 = __lasx_xvabsd_bu(p1_org, p0_org);
+        q1_asub_q0 = __lasx_xvabsd_bu(q1_org, q0_org);
+
+        is_less_than_alpha = __lasx_xvslt_bu(p0_asub_q0, alpha);
+        is_less_than_beta  = __lasx_xvslt_bu(p1_asub_p0, beta);
+        is_less_than       = is_less_than_alpha & is_less_than_beta;
+        is_less_than_beta  = __lasx_xvslt_bu(q1_asub_q0, beta);
+        is_less_than       = is_less_than_beta & is_less_than;
+        is_less_than       = is_less_than & is_bs_greater_than0;
+
+        if (__lasx_xbnz_v(is_less_than)) {
+            __m256i p1_org_h, p0_org_h, q0_org_h, q1_org_h;
+
+            p1_org_h = __lasx_vext2xv_hu_bu(p1_org);
+            p0_org_h = __lasx_vext2xv_hu_bu(p0_org);
+            q0_org_h = __lasx_vext2xv_hu_bu(q0_org);
+            q1_org_h = __lasx_vext2xv_hu_bu(q1_org);
+
+            {
+                __m256i neg_thresh_h, tc_h, p0_h, q0_h;
+
+                neg_thresh_h = __lasx_xvneg_b(tc_vec);
+                neg_thresh_h = __lasx_vext2xv_h_b(neg_thresh_h);
+                tc_h         = __lasx_vext2xv_hu_bu(tc_vec);
+
+                AVC_LPF_P0Q0(q0_org_h, p0_org_h, p1_org_h, q1_org_h,
+                             neg_thresh_h, tc_h, p0_h, q0_h);
+                DUP2_ARG2(__lasx_xvpickev_b, p0_h, p0_h, q0_h, q0_h,
+                          p0_h, q0_h);
+                DUP2_ARG2(__lasx_xvpermi_d, p0_h, 0xd8, q0_h, 0xd8,
+                          p0_h, q0_h);
+                p0_h = __lasx_xvbitsel_v(p0_org, p0_h, is_less_than);
+                q0_h = __lasx_xvbitsel_v(q0_org, q0_h, is_less_than);
+                __lasx_xvstelm_d(p0_h, data - img_width, 0, 0);
+                __lasx_xvstelm_d(q0_h, data, 0, 0);
+            }
+        }
+    }
+}
+
+#define AVC_LPF_P0P1P2_OR_Q0Q1Q2(p3_or_q3_org_in, p0_or_q0_org_in,          \
+                                 q3_or_p3_org_in, p1_or_q1_org_in,          \
+                                 p2_or_q2_org_in, q1_or_p1_org_in,          \
+                                 p0_or_q0_out, p1_or_q1_out, p2_or_q2_out)  \
+{                                                                           \
+    __m256i threshold;                                                      \
+    __m256i const2, const3 = __lasx_xvldi(0);                               \
+                                                                            \
+    const2 = __lasx_xvaddi_hu(const3, 2);                                   \
+    const3 = __lasx_xvaddi_hu(const3, 3);                                   \
+    threshold = __lasx_xvadd_h(p0_or_q0_org_in, q3_or_p3_org_in);           \
+    threshold = __lasx_xvadd_h(p1_or_q1_org_in, threshold);                 \
+                                                                            \
+    p0_or_q0_out = __lasx_xvslli_h(threshold, 1);                           \
+    p0_or_q0_out = __lasx_xvadd_h(p0_or_q0_out, p2_or_q2_org_in);           \
+    p0_or_q0_out = __lasx_xvadd_h(p0_or_q0_out, q1_or_p1_org_in);           \
+    p0_or_q0_out = __lasx_xvsrar_h(p0_or_q0_out, const3);                   \
+                                                                            \
+    p1_or_q1_out = __lasx_xvadd_h(p2_or_q2_org_in, threshold);              \
+    p1_or_q1_out = __lasx_xvsrar_h(p1_or_q1_out, const2);                   \
+                                                                            \
+    p2_or_q2_out = __lasx_xvmul_h(p2_or_q2_org_in, const3);                 \
+    p2_or_q2_out = __lasx_xvadd_h(p2_or_q2_out, p3_or_q3_org_in);           \
+    p2_or_q2_out = __lasx_xvadd_h(p2_or_q2_out, p3_or_q3_org_in);           \
+    p2_or_q2_out = __lasx_xvadd_h(p2_or_q2_out, threshold);                 \
+    p2_or_q2_out = __lasx_xvsrar_h(p2_or_q2_out, const3);                   \
+}
+
+/* data[-u32_img_width] = (uint8_t)((2 * p1 + p0 + q1 + 2) >> 2); */
+#define AVC_LPF_P0_OR_Q0(p0_or_q0_org_in, q1_or_p1_org_in,             \
+                         p1_or_q1_org_in, p0_or_q0_out)                \
+{                                                                      \
+    __m256i const2 = __lasx_xvldi(0);                                  \
+    const2 = __lasx_xvaddi_hu(const2, 2);                              \
+    p0_or_q0_out = __lasx_xvadd_h(p0_or_q0_org_in, q1_or_p1_org_in);   \
+    p0_or_q0_out = __lasx_xvadd_h(p0_or_q0_out, p1_or_q1_org_in);      \
+    p0_or_q0_out = __lasx_xvadd_h(p0_or_q0_out, p1_or_q1_org_in);      \
+    p0_or_q0_out = __lasx_xvsrar_h(p0_or_q0_out, const2);              \
+}
+
+void ff_h264_h_lpf_luma_intra_8_lasx(uint8_t *data, ptrdiff_t img_width,
+                                     int alpha_in, int beta_in)
+{
+    ptrdiff_t img_width_2x = img_width << 1;
+    ptrdiff_t img_width_4x = img_width << 2;
+    ptrdiff_t img_width_3x = img_width_2x + img_width;
+    uint8_t *src = data - 4;
+    __m256i p0_asub_q0, p1_asub_p0, q1_asub_q0, alpha, beta;
+    __m256i is_less_than, is_less_than_beta, is_less_than_alpha;
+    __m256i p3_org, p2_org, p1_org, p0_org, q0_org, q1_org, q2_org, q3_org;
+    __m256i zero = __lasx_xvldi(0);
+
+    {
+        __m256i row0, row1, row2, row3, row4, row5, row6, row7;
+        __m256i row8, row9, row10, row11, row12, row13, row14, row15;
+
+        DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x,
+                  src, img_width_3x, row0, row1, row2, row3);
+        src += img_width_4x;
+        DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x,
+                  src, img_width_3x, row4, row5, row6, row7);
+        src += img_width_4x;
+        DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x,
+                  src, img_width_3x, row8, row9, row10, row11);
+        src += img_width_4x;
+        DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x,
+                  src, img_width_3x, row12, row13, row14, row15);
+        src += img_width_4x;
+
+        LASX_TRANSPOSE16x8_B(row0, row1, row2, row3,
+                             row4, row5, row6, row7,
+                             row8, row9, row10, row11,
+                             row12, row13, row14, row15,
+                             p3_org, p2_org, p1_org, p0_org,
+                             q0_org, q1_org, q2_org, q3_org);
+    }
+
+    alpha = __lasx_xvreplgr2vr_b(alpha_in);
+    beta  = __lasx_xvreplgr2vr_b(beta_in);
+    p0_asub_q0 = __lasx_xvabsd_bu(p0_org, q0_org);
+    p1_asub_p0 = __lasx_xvabsd_bu(p1_org, p0_org);
+    q1_asub_q0 = __lasx_xvabsd_bu(q1_org, q0_org);
+
+    is_less_than_alpha = __lasx_xvslt_bu(p0_asub_q0, alpha);
+    is_less_than_beta  = __lasx_xvslt_bu(p1_asub_p0, beta);
+    is_less_than       = is_less_than_beta & is_less_than_alpha;
+    is_less_than_beta  = __lasx_xvslt_bu(q1_asub_q0, beta);
+    is_less_than       = is_less_than_beta & is_less_than;
+    is_less_than       = __lasx_xvpermi_q(zero, is_less_than, 0x30);
+
+    if (__lasx_xbnz_v(is_less_than)) {
+        __m256i p2_asub_p0, q2_asub_q0, p0_h, q0_h, negate_is_less_than_beta;
+        __m256i p1_org_h, p0_org_h, q0_org_h, q1_org_h;
+        __m256i less_alpha_shift2_add2 = __lasx_xvsrli_b(alpha, 2);
+
+        less_alpha_shift2_add2 = __lasx_xvaddi_bu(less_alpha_shift2_add2, 2);
+        less_alpha_shift2_add2 = __lasx_xvslt_bu(p0_asub_q0,
+                                                 less_alpha_shift2_add2);
+
+        p1_org_h = __lasx_vext2xv_hu_bu(p1_org);
+        p0_org_h = __lasx_vext2xv_hu_bu(p0_org);
+        q0_org_h = __lasx_vext2xv_hu_bu(q0_org);
+        q1_org_h = __lasx_vext2xv_hu_bu(q1_org);
+
+        p2_asub_p0               = __lasx_xvabsd_bu(p2_org, p0_org);
+        is_less_than_beta        = __lasx_xvslt_bu(p2_asub_p0, beta);
+        is_less_than_beta        = is_less_than_beta & less_alpha_shift2_add2;
+        negate_is_less_than_beta = __lasx_xvxori_b(is_less_than_beta, 0xff);
+        is_less_than_beta        = is_less_than_beta & is_less_than;
+        negate_is_less_than_beta = negate_is_less_than_beta & is_less_than;
+
+        /* combine and store */
+        if (__lasx_xbnz_v(is_less_than_beta)) {
+            __m256i p2_org_h, p3_org_h, p1_h, p2_h;
+
+            p2_org_h   = __lasx_vext2xv_hu_bu(p2_org);
+            p3_org_h   = __lasx_vext2xv_hu_bu(p3_org);
+
+            AVC_LPF_P0P1P2_OR_Q0Q1Q2(p3_org_h, p0_org_h, q0_org_h, p1_org_h,
+                                     p2_org_h, q1_org_h, p0_h, p1_h, p2_h);
+
+            p0_h = __lasx_xvpickev_b(p0_h, p0_h);
+            p0_h = __lasx_xvpermi_d(p0_h, 0xd8);
+            DUP2_ARG2(__lasx_xvpickev_b, p1_h, p1_h, p2_h, p2_h, p1_h, p2_h);
+            DUP2_ARG2(__lasx_xvpermi_d, p1_h, 0xd8, p2_h, 0xd8, p1_h, p2_h);
+            p0_org = __lasx_xvbitsel_v(p0_org, p0_h, is_less_than_beta);
+            p1_org = __lasx_xvbitsel_v(p1_org, p1_h, is_less_than_beta);
+            p2_org = __lasx_xvbitsel_v(p2_org, p2_h, is_less_than_beta);
+        }
+
+        AVC_LPF_P0_OR_Q0(p0_org_h, q1_org_h, p1_org_h, p0_h);
+        /* combine */
+        p0_h = __lasx_xvpickev_b(p0_h, p0_h);
+        p0_h = __lasx_xvpermi_d(p0_h, 0xd8);
+        p0_org = __lasx_xvbitsel_v(p0_org, p0_h, negate_is_less_than_beta);
+
+        /* if (tmpFlag && (unsigned)ABS(q2-q0) < thresholds->beta_in) */
+        q2_asub_q0 = __lasx_xvabsd_bu(q2_org, q0_org);
+        is_less_than_beta = __lasx_xvslt_bu(q2_asub_q0, beta);
+        is_less_than_beta = is_less_than_beta & less_alpha_shift2_add2;
+        negate_is_less_than_beta = __lasx_xvxori_b(is_less_than_beta, 0xff);
+        is_less_than_beta = is_less_than_beta & is_less_than;
+        negate_is_less_than_beta = negate_is_less_than_beta & is_less_than;
+
+        /* combine and store */
+        if (__lasx_xbnz_v(is_less_than_beta)) {
+            __m256i q2_org_h, q3_org_h, q1_h, q2_h;
+
+            q2_org_h   = __lasx_vext2xv_hu_bu(q2_org);
+            q3_org_h   = __lasx_vext2xv_hu_bu(q3_org);
+
+            AVC_LPF_P0P1P2_OR_Q0Q1Q2(q3_org_h, q0_org_h, p0_org_h, q1_org_h,
+                                     q2_org_h, p1_org_h, q0_h, q1_h, q2_h);
+
+            q0_h = __lasx_xvpickev_b(q0_h, q0_h);
+            q0_h = __lasx_xvpermi_d(q0_h, 0xd8);
+            DUP2_ARG2(__lasx_xvpickev_b, q1_h, q1_h, q2_h, q2_h, q1_h, q2_h);
+            DUP2_ARG2(__lasx_xvpermi_d, q1_h, 0xd8, q2_h, 0xd8, q1_h, q2_h);
+            q0_org = __lasx_xvbitsel_v(q0_org, q0_h, is_less_than_beta);
+            q1_org = __lasx_xvbitsel_v(q1_org, q1_h, is_less_than_beta);
+            q2_org = __lasx_xvbitsel_v(q2_org, q2_h, is_less_than_beta);
+
+        }
+
+        AVC_LPF_P0_OR_Q0(q0_org_h, p1_org_h, q1_org_h, q0_h);
+
+        /* combine */
+        q0_h = __lasx_xvpickev_b(q0_h, q0_h);
+        q0_h = __lasx_xvpermi_d(q0_h, 0xd8);
+        q0_org = __lasx_xvbitsel_v(q0_org, q0_h, negate_is_less_than_beta);
+
+        /* transpose and store */
+        {
+            __m256i row0, row1, row2, row3, row4, row5, row6, row7;
+            __m256i control = {0x0000000400000000, 0x0000000500000001,
+                               0x0000000600000002, 0x0000000700000003};
+
+            DUP4_ARG3(__lasx_xvpermi_q, p0_org, q3_org, 0x02, p1_org, q2_org,
+                      0x02, p2_org, q1_org, 0x02, p3_org, q0_org, 0x02,
+                      p0_org, p1_org, p2_org, p3_org);
+            DUP2_ARG2(__lasx_xvilvl_b, p1_org, p3_org, p0_org, p2_org,
+                      row0, row2);
+            DUP2_ARG2(__lasx_xvilvh_b, p1_org, p3_org, p0_org, p2_org,
+                      row1, row3);
+            DUP2_ARG2(__lasx_xvilvl_b, row2, row0, row3, row1, row4, row6);
+            DUP2_ARG2(__lasx_xvilvh_b, row2, row0, row3, row1, row5, row7);
+            DUP4_ARG2(__lasx_xvperm_w, row4, control, row5, control, row6,
+                      control, row7, control, row4, row5, row6, row7);
+            src = data - 4;
+            __lasx_xvstelm_d(row4, src, 0, 0);
+            __lasx_xvstelm_d(row4, src + img_width, 0, 1);
+            src += img_width_2x;
+            __lasx_xvstelm_d(row4, src, 0, 2);
+            __lasx_xvstelm_d(row4, src + img_width, 0, 3);
+            src += img_width_2x;
+            __lasx_xvstelm_d(row5, src, 0, 0);
+            __lasx_xvstelm_d(row5, src + img_width, 0, 1);
+            src += img_width_2x;
+            __lasx_xvstelm_d(row5, src, 0, 2);
+            __lasx_xvstelm_d(row5, src + img_width, 0, 3);
+            src += img_width_2x;
+            __lasx_xvstelm_d(row6, src, 0, 0);
+            __lasx_xvstelm_d(row6, src + img_width, 0, 1);
+            src += img_width_2x;
+            __lasx_xvstelm_d(row6, src, 0, 2);
+            __lasx_xvstelm_d(row6, src + img_width, 0, 3);
+            src += img_width_2x;
+            __lasx_xvstelm_d(row7, src, 0, 0);
+            __lasx_xvstelm_d(row7, src + img_width, 0, 1);
+            src += img_width_2x;
+            __lasx_xvstelm_d(row7, src, 0, 2);
+            __lasx_xvstelm_d(row7, src + img_width, 0, 3);
+        }
+    }
+}
+
+void ff_h264_v_lpf_luma_intra_8_lasx(uint8_t *data, ptrdiff_t img_width,
+                                     int alpha_in, int beta_in)
+{
+    ptrdiff_t img_width_2x = img_width << 1;
+    ptrdiff_t img_width_3x = img_width_2x + img_width;
+    uint8_t *src = data - img_width_2x;
+    __m256i p0_asub_q0, p1_asub_p0, q1_asub_q0, alpha, beta;
+    __m256i is_less_than, is_less_than_beta, is_less_than_alpha;
+    __m256i p1_org, p0_org, q0_org, q1_org;
+    __m256i zero = __lasx_xvldi(0);
+
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x,
+              src, img_width_3x, p1_org, p0_org, q0_org, q1_org);
+    alpha = __lasx_xvreplgr2vr_b(alpha_in);
+    beta  = __lasx_xvreplgr2vr_b(beta_in);
+    p0_asub_q0 = __lasx_xvabsd_bu(p0_org, q0_org);
+    p1_asub_p0 = __lasx_xvabsd_bu(p1_org, p0_org);
+    q1_asub_q0 = __lasx_xvabsd_bu(q1_org, q0_org);
+
+    is_less_than_alpha = __lasx_xvslt_bu(p0_asub_q0, alpha);
+    is_less_than_beta  = __lasx_xvslt_bu(p1_asub_p0, beta);
+    is_less_than       = is_less_than_beta & is_less_than_alpha;
+    is_less_than_beta  = __lasx_xvslt_bu(q1_asub_q0, beta);
+    is_less_than       = is_less_than_beta & is_less_than;
+    is_less_than       = __lasx_xvpermi_q(zero, is_less_than, 0x30);
+
+    if (__lasx_xbnz_v(is_less_than)) {
+        __m256i p2_asub_p0, q2_asub_q0, p0_h, q0_h, negate_is_less_than_beta;
+        __m256i p1_org_h, p0_org_h, q0_org_h, q1_org_h;
+        __m256i p2_org = __lasx_xvldx(src, -img_width);
+        __m256i q2_org = __lasx_xvldx(data, img_width_2x);
+        __m256i less_alpha_shift2_add2 = __lasx_xvsrli_b(alpha, 2);
+        less_alpha_shift2_add2 = __lasx_xvaddi_bu(less_alpha_shift2_add2, 2);
+        less_alpha_shift2_add2 = __lasx_xvslt_bu(p0_asub_q0,
+                                                 less_alpha_shift2_add2);
+
+        p1_org_h = __lasx_vext2xv_hu_bu(p1_org);
+        p0_org_h = __lasx_vext2xv_hu_bu(p0_org);
+        q0_org_h = __lasx_vext2xv_hu_bu(q0_org);
+        q1_org_h = __lasx_vext2xv_hu_bu(q1_org);
+
+        p2_asub_p0               = __lasx_xvabsd_bu(p2_org, p0_org);
+        is_less_than_beta        = __lasx_xvslt_bu(p2_asub_p0, beta);
+        is_less_than_beta        = is_less_than_beta & less_alpha_shift2_add2;
+        negate_is_less_than_beta = __lasx_xvxori_b(is_less_than_beta, 0xff);
+        is_less_than_beta        = is_less_than_beta & is_less_than;
+        negate_is_less_than_beta = negate_is_less_than_beta & is_less_than;
+
+        /* combine and store */
+        if (__lasx_xbnz_v(is_less_than_beta)) {
+            __m256i p2_org_h, p3_org_h, p1_h, p2_h;
+            __m256i p3_org = __lasx_xvldx(src, -img_width_2x);
+
+            p2_org_h   = __lasx_vext2xv_hu_bu(p2_org);
+            p3_org_h   = __lasx_vext2xv_hu_bu(p3_org);
+
+            AVC_LPF_P0P1P2_OR_Q0Q1Q2(p3_org_h, p0_org_h, q0_org_h, p1_org_h,
+                                     p2_org_h, q1_org_h, p0_h, p1_h, p2_h);
+
+            p0_h = __lasx_xvpickev_b(p0_h, p0_h);
+            p0_h =  __lasx_xvpermi_d(p0_h, 0xd8);
+            DUP2_ARG2(__lasx_xvpickev_b, p1_h, p1_h, p2_h, p2_h, p1_h, p2_h);
+            DUP2_ARG2(__lasx_xvpermi_d, p1_h, 0xd8, p2_h, 0xd8, p1_h, p2_h);
+            p0_org = __lasx_xvbitsel_v(p0_org, p0_h, is_less_than_beta);
+            p1_org = __lasx_xvbitsel_v(p1_org, p1_h, is_less_than_beta);
+            p2_org = __lasx_xvbitsel_v(p2_org, p2_h, is_less_than_beta);
+
+            __lasx_xvst(p1_org, src, 0);
+            __lasx_xvst(p2_org, src - img_width, 0);
+        }
+
+        AVC_LPF_P0_OR_Q0(p0_org_h, q1_org_h, p1_org_h, p0_h);
+        /* combine */
+        p0_h = __lasx_xvpickev_b(p0_h, p0_h);
+        p0_h = __lasx_xvpermi_d(p0_h, 0xd8);
+        p0_org = __lasx_xvbitsel_v(p0_org, p0_h, negate_is_less_than_beta);
+        __lasx_xvst(p0_org, data - img_width, 0);
+
+        /* if (tmpFlag && (unsigned)ABS(q2-q0) < thresholds->beta_in) */
+        q2_asub_q0 = __lasx_xvabsd_bu(q2_org, q0_org);
+        is_less_than_beta = __lasx_xvslt_bu(q2_asub_q0, beta);
+        is_less_than_beta = is_less_than_beta & less_alpha_shift2_add2;
+        negate_is_less_than_beta = __lasx_xvxori_b(is_less_than_beta, 0xff);
+        is_less_than_beta = is_less_than_beta & is_less_than;
+        negate_is_less_than_beta = negate_is_less_than_beta & is_less_than;
+
+        /* combine and store */
+        if (__lasx_xbnz_v(is_less_than_beta)) {
+            __m256i q2_org_h, q3_org_h, q1_h, q2_h;
+            __m256i q3_org = __lasx_xvldx(data, img_width_2x + img_width);
+
+            q2_org_h   = __lasx_vext2xv_hu_bu(q2_org);
+            q3_org_h   = __lasx_vext2xv_hu_bu(q3_org);
+
+            AVC_LPF_P0P1P2_OR_Q0Q1Q2(q3_org_h, q0_org_h, p0_org_h, q1_org_h,
+                                     q2_org_h, p1_org_h, q0_h, q1_h, q2_h);
+
+            q0_h = __lasx_xvpickev_b(q0_h, q0_h);
+            q0_h = __lasx_xvpermi_d(q0_h, 0xd8);
+            DUP2_ARG2(__lasx_xvpickev_b, q1_h, q1_h, q2_h, q2_h, q1_h, q2_h);
+            DUP2_ARG2(__lasx_xvpermi_d, q1_h, 0xd8, q2_h, 0xd8, q1_h, q2_h);
+            q0_org = __lasx_xvbitsel_v(q0_org, q0_h, is_less_than_beta);
+            q1_org = __lasx_xvbitsel_v(q1_org, q1_h, is_less_than_beta);
+            q2_org = __lasx_xvbitsel_v(q2_org, q2_h, is_less_than_beta);
+
+            __lasx_xvst(q1_org, data + img_width, 0);
+            __lasx_xvst(q2_org, data + img_width_2x, 0);
+        }
+
+        AVC_LPF_P0_OR_Q0(q0_org_h, p1_org_h, q1_org_h, q0_h);
+
+        /* combine */
+        q0_h = __lasx_xvpickev_b(q0_h, q0_h);
+        q0_h = __lasx_xvpermi_d(q0_h, 0xd8);
+        q0_org = __lasx_xvbitsel_v(q0_org, q0_h, negate_is_less_than_beta);
+
+        __lasx_xvst(q0_org, data, 0);
+    }
+}
+
+void ff_h264_h_lpf_chroma_intra_8_lasx(uint8_t *data, ptrdiff_t img_width,
+                                       int alpha_in, int beta_in)
+{
+    uint8_t *src = data - 2;
+    ptrdiff_t img_width_2x = img_width << 1;
+    ptrdiff_t img_width_4x = img_width << 2;
+    ptrdiff_t img_width_3x = img_width_2x + img_width;
+    __m256i p1_org, p0_org, q0_org, q1_org;
+    __m256i p0_asub_q0, p1_asub_p0, q1_asub_q0, alpha, beta;
+    __m256i is_less_than, is_less_than_beta, is_less_than_alpha;
+
+    {
+        __m256i row0, row1, row2, row3, row4, row5, row6, row7;
+
+        DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x, src,
+                  img_width_3x, row0, row1, row2, row3);
+        src += img_width_4x;
+        DUP4_ARG2(__lasx_xvldx, src, 0, src, img_width, src, img_width_2x, src,
+                  img_width_3x, row4, row5, row6, row7);
+
+        /* LASX_TRANSPOSE8x4_B */
+        DUP4_ARG2(__lasx_xvilvl_b, row2, row0, row3, row1, row6, row4, row7, row5,
+                  p1_org, p0_org, q0_org, q1_org);
+        row0 = __lasx_xvilvl_b(p0_org, p1_org);
+        row1 = __lasx_xvilvl_b(q1_org, q0_org);
+        row3 = __lasx_xvilvh_w(row1, row0);
+        row2 = __lasx_xvilvl_w(row1, row0);
+        p1_org = __lasx_xvpermi_d(row2, 0x00);
+        p0_org = __lasx_xvpermi_d(row2, 0x55);
+        q0_org = __lasx_xvpermi_d(row3, 0x00);
+        q1_org = __lasx_xvpermi_d(row3, 0x55);
+    }
+
+    alpha = __lasx_xvreplgr2vr_b(alpha_in);
+    beta  = __lasx_xvreplgr2vr_b(beta_in);
+
+    p0_asub_q0 = __lasx_xvabsd_bu(p0_org, q0_org);
+    p1_asub_p0 = __lasx_xvabsd_bu(p1_org, p0_org);
+    q1_asub_q0 = __lasx_xvabsd_bu(q1_org, q0_org);
+
+    is_less_than_alpha = __lasx_xvslt_bu(p0_asub_q0, alpha);
+    is_less_than_beta  = __lasx_xvslt_bu(p1_asub_p0, beta);
+    is_less_than       = is_less_than_alpha & is_less_than_beta;
+    is_less_than_beta  = __lasx_xvslt_bu(q1_asub_q0, beta);
+    is_less_than       = is_less_than_beta & is_less_than;
+
+    if (__lasx_xbnz_v(is_less_than)) {
+        __m256i p0_h, q0_h, p1_org_h, p0_org_h, q0_org_h, q1_org_h;
+
+        p1_org_h = __lasx_vext2xv_hu_bu(p1_org);
+        p0_org_h = __lasx_vext2xv_hu_bu(p0_org);
+        q0_org_h = __lasx_vext2xv_hu_bu(q0_org);
+        q1_org_h = __lasx_vext2xv_hu_bu(q1_org);
+
+        AVC_LPF_P0_OR_Q0(p0_org_h, q1_org_h, p1_org_h, p0_h);
+        AVC_LPF_P0_OR_Q0(q0_org_h, p1_org_h, q1_org_h, q0_h);
+        DUP2_ARG2(__lasx_xvpickev_b, p0_h, p0_h, q0_h, q0_h, p0_h, q0_h);
+        DUP2_ARG2(__lasx_xvpermi_d, p0_h, 0xd8, q0_h, 0xd8, p0_h, q0_h);
+        p0_org = __lasx_xvbitsel_v(p0_org, p0_h, is_less_than);
+        q0_org = __lasx_xvbitsel_v(q0_org, q0_h, is_less_than);
+    }
+    p0_org = __lasx_xvilvl_b(q0_org, p0_org);
+    src = data - 1;
+    __lasx_xvstelm_h(p0_org, src, 0, 0);
+    src += img_width;
+    __lasx_xvstelm_h(p0_org, src, 0, 1);
+    src += img_width;
+    __lasx_xvstelm_h(p0_org, src, 0, 2);
+    src += img_width;
+    __lasx_xvstelm_h(p0_org, src, 0, 3);
+    src += img_width;
+    __lasx_xvstelm_h(p0_org, src, 0, 4);
+    src += img_width;
+    __lasx_xvstelm_h(p0_org, src, 0, 5);
+    src += img_width;
+    __lasx_xvstelm_h(p0_org, src, 0, 6);
+    src += img_width;
+    __lasx_xvstelm_h(p0_org, src, 0, 7);
+}
+
+void ff_h264_v_lpf_chroma_intra_8_lasx(uint8_t *data, ptrdiff_t img_width,
+                                       int alpha_in, int beta_in)
+{
+    ptrdiff_t img_width_2x = img_width << 1;
+    __m256i p1_org, p0_org, q0_org, q1_org;
+    __m256i p0_asub_q0, p1_asub_p0, q1_asub_q0, alpha, beta;
+    __m256i is_less_than, is_less_than_beta, is_less_than_alpha;
+
+    alpha = __lasx_xvreplgr2vr_b(alpha_in);
+    beta  = __lasx_xvreplgr2vr_b(beta_in);
+
+    p1_org = __lasx_xvldx(data, -img_width_2x);
+    p0_org = __lasx_xvldx(data, -img_width);
+    DUP2_ARG2(__lasx_xvldx, data, 0, data, img_width, q0_org, q1_org);
+
+    p0_asub_q0 = __lasx_xvabsd_bu(p0_org, q0_org);
+    p1_asub_p0 = __lasx_xvabsd_bu(p1_org, p0_org);
+    q1_asub_q0 = __lasx_xvabsd_bu(q1_org, q0_org);
+
+    is_less_than_alpha = __lasx_xvslt_bu(p0_asub_q0, alpha);
+    is_less_than_beta  = __lasx_xvslt_bu(p1_asub_p0, beta);
+    is_less_than       = is_less_than_alpha & is_less_than_beta;
+    is_less_than_beta  = __lasx_xvslt_bu(q1_asub_q0, beta);
+    is_less_than       = is_less_than_beta & is_less_than;
+
+    if (__lasx_xbnz_v(is_less_than)) {
+        __m256i p0_h, q0_h, p1_org_h, p0_org_h, q0_org_h, q1_org_h;
+
+        p1_org_h = __lasx_vext2xv_hu_bu(p1_org);
+        p0_org_h = __lasx_vext2xv_hu_bu(p0_org);
+        q0_org_h = __lasx_vext2xv_hu_bu(q0_org);
+        q1_org_h = __lasx_vext2xv_hu_bu(q1_org);
+
+        AVC_LPF_P0_OR_Q0(p0_org_h, q1_org_h, p1_org_h, p0_h);
+        AVC_LPF_P0_OR_Q0(q0_org_h, p1_org_h, q1_org_h, q0_h);
+        DUP2_ARG2(__lasx_xvpickev_b, p0_h, p0_h, q0_h, q0_h, p0_h, q0_h);
+        DUP2_ARG2(__lasx_xvpermi_d, p0_h, 0xd8, q0_h, 0xd8, p0_h, q0_h);
+        p0_h = __lasx_xvbitsel_v(p0_org, p0_h, is_less_than);
+        q0_h = __lasx_xvbitsel_v(q0_org, q0_h, is_less_than);
+        __lasx_xvstelm_d(p0_h, data - img_width, 0, 0);
+        __lasx_xvstelm_d(q0_h, data, 0, 0);
+    }
+}
+
+void ff_biweight_h264_pixels16_8_lasx(uint8_t *dst, uint8_t *src,
+                                      ptrdiff_t stride, int height,
+                                      int log2_denom, int weight_dst,
+                                      int weight_src, int offset_in)
+{
+    __m256i wgt;
+    __m256i src0, src1, src2, src3;
+    __m256i dst0, dst1, dst2, dst3;
+    __m256i vec0, vec1, vec2, vec3, vec4, vec5, vec6, vec7;
+    __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
+    __m256i denom, offset;
+    int stride_2x = stride << 1;
+    int stride_4x = stride << 2;
+    int stride_3x = stride_2x + stride;
+
+    offset_in   = (unsigned) ((offset_in + 1) | 1) << log2_denom;
+    offset_in  += ((weight_src + weight_dst) << 7);
+    log2_denom += 1;
+
+    tmp0   = __lasx_xvreplgr2vr_b(weight_src);
+    tmp1   = __lasx_xvreplgr2vr_b(weight_dst);
+    wgt    = __lasx_xvilvh_b(tmp1, tmp0);
+    offset = __lasx_xvreplgr2vr_h(offset_in);
+    denom  = __lasx_xvreplgr2vr_h(log2_denom);
+
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+              src, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    src += stride_4x;
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+              src, stride_3x, tmp4, tmp5, tmp6, tmp7);
+    src += stride_4x;
+    DUP4_ARG3(__lasx_xvpermi_q, tmp1, tmp0, 0x20, tmp3, tmp2, 0x20, tmp5, tmp4,
+              0x20, tmp7, tmp6, 0x20, src0, src1, src2, src3);
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x,
+              dst, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    dst += stride_4x;
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x,
+              dst, stride_3x, tmp4, tmp5, tmp6, tmp7);
+    dst -= stride_4x;
+    DUP4_ARG3(__lasx_xvpermi_q, tmp1, tmp0, 0x20, tmp3, tmp2, 0x20, tmp5, tmp4,
+              0x20, tmp7, tmp6, 0x20, dst0, dst1, dst2, dst3);
+
+    DUP4_ARG2(__lasx_xvxori_b, src0, 128, src1, 128, src2, 128, src3, 128,
+              src0, src1, src2, src3);
+    DUP4_ARG2(__lasx_xvxori_b, dst0, 128, dst1, 128, dst2, 128, dst3, 128,
+              dst0, dst1, dst2, dst3);
+    DUP4_ARG2(__lasx_xvilvl_b, dst0, src0, dst1, src1, dst2, src2,
+              dst3, src3, vec0, vec2, vec4, vec6);
+    DUP4_ARG2(__lasx_xvilvh_b, dst0, src0, dst1, src1, dst2, src2,
+              dst3, src3, vec1, vec3, vec5, vec7);
+
+    DUP4_ARG3(__lasx_xvdp2add_h_b, offset, wgt, vec0, offset, wgt, vec1,
+              offset, wgt, vec2, offset, wgt, vec3, tmp0, tmp1, tmp2, tmp3);
+    DUP4_ARG3(__lasx_xvdp2add_h_b, offset, wgt, vec4, offset, wgt, vec5,
+              offset, wgt, vec6, offset, wgt, vec7, tmp4, tmp5, tmp6, tmp7);
+
+    tmp0 = __lasx_xvsra_h(tmp0, denom);
+    tmp1 = __lasx_xvsra_h(tmp1, denom);
+    tmp2 = __lasx_xvsra_h(tmp2, denom);
+    tmp3 = __lasx_xvsra_h(tmp3, denom);
+    tmp4 = __lasx_xvsra_h(tmp4, denom);
+    tmp5 = __lasx_xvsra_h(tmp5, denom);
+    tmp6 = __lasx_xvsra_h(tmp6, denom);
+    tmp7 = __lasx_xvsra_h(tmp7, denom);
+
+    DUP4_ARG1(__lasx_xvclip255_h, tmp0, tmp1, tmp2, tmp3,
+                                  tmp0, tmp1, tmp2, tmp3);
+    DUP4_ARG1(__lasx_xvclip255_h, tmp4, tmp5, tmp6, tmp7,
+                                  tmp4, tmp5, tmp6, tmp7);
+    DUP4_ARG2(__lasx_xvpickev_b, tmp1, tmp0, tmp3, tmp2, tmp5, tmp4, tmp7, tmp6,
+              dst0, dst1, dst2, dst3);
+    __lasx_xvstelm_d(dst0, dst, 0, 0);
+    __lasx_xvstelm_d(dst0, dst, 8, 1);
+    dst += stride;
+    __lasx_xvstelm_d(dst0, dst, 0, 2);
+    __lasx_xvstelm_d(dst0, dst, 8, 3);
+    dst += stride;
+    __lasx_xvstelm_d(dst1, dst, 0, 0);
+    __lasx_xvstelm_d(dst1, dst, 8, 1);
+    dst += stride;
+    __lasx_xvstelm_d(dst1, dst, 0, 2);
+    __lasx_xvstelm_d(dst1, dst, 8, 3);
+    dst += stride;
+    __lasx_xvstelm_d(dst2, dst, 0, 0);
+    __lasx_xvstelm_d(dst2, dst, 8, 1);
+    dst += stride;
+    __lasx_xvstelm_d(dst2, dst, 0, 2);
+    __lasx_xvstelm_d(dst2, dst, 8, 3);
+    dst += stride;
+    __lasx_xvstelm_d(dst3, dst, 0, 0);
+    __lasx_xvstelm_d(dst3, dst, 8, 1);
+    dst += stride;
+    __lasx_xvstelm_d(dst3, dst, 0, 2);
+    __lasx_xvstelm_d(dst3, dst, 8, 3);
+    dst += stride;
+
+    if (16 == height) {
+        DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+                  src, stride_3x, tmp0, tmp1, tmp2, tmp3);
+        src += stride_4x;
+        DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+                  src, stride_3x, tmp4, tmp5, tmp6, tmp7);
+        src += stride_4x;
+        DUP4_ARG3(__lasx_xvpermi_q, tmp1, tmp0, 0x20, tmp3, tmp2, 0x20, tmp5,
+                  tmp4, 0x20, tmp7, tmp6, 0x20, src0, src1, src2, src3);
+        DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x,
+                  dst, stride_3x, tmp0, tmp1, tmp2, tmp3);
+        dst += stride_4x;
+        DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x,
+                  dst, stride_3x, tmp4, tmp5, tmp6, tmp7);
+        dst -= stride_4x;
+        DUP4_ARG3(__lasx_xvpermi_q, tmp1, tmp0, 0x20, tmp3, tmp2, 0x20, tmp5,
+                  tmp4, 0x20, tmp7, tmp6, 0x20, dst0, dst1, dst2, dst3);
+
+        DUP4_ARG2(__lasx_xvxori_b, src0, 128, src1, 128, src2, 128, src3, 128,
+                  src0, src1, src2, src3);
+        DUP4_ARG2(__lasx_xvxori_b, dst0, 128, dst1, 128, dst2, 128, dst3, 128,
+                  dst0, dst1, dst2, dst3);
+        DUP4_ARG2(__lasx_xvilvl_b, dst0, src0, dst1, src1, dst2, src2,
+                  dst3, src3, vec0, vec2, vec4, vec6);
+        DUP4_ARG2(__lasx_xvilvh_b, dst0, src0, dst1, src1, dst2, src2,
+                  dst3, src3, vec1, vec3, vec5, vec7);
+
+        DUP4_ARG3(__lasx_xvdp2add_h_b, offset, wgt, vec0, offset, wgt, vec1,
+                  offset, wgt, vec2, offset, wgt, vec3, tmp0, tmp1, tmp2, tmp3);
+        DUP4_ARG3(__lasx_xvdp2add_h_b, offset, wgt, vec4, offset, wgt, vec5,
+                  offset, wgt, vec6, offset, wgt, vec7, tmp4, tmp5, tmp6, tmp7);
+
+        tmp0 = __lasx_xvsra_h(tmp0, denom);
+        tmp1 = __lasx_xvsra_h(tmp1, denom);
+        tmp2 = __lasx_xvsra_h(tmp2, denom);
+        tmp3 = __lasx_xvsra_h(tmp3, denom);
+        tmp4 = __lasx_xvsra_h(tmp4, denom);
+        tmp5 = __lasx_xvsra_h(tmp5, denom);
+        tmp6 = __lasx_xvsra_h(tmp6, denom);
+        tmp7 = __lasx_xvsra_h(tmp7, denom);
+
+        DUP4_ARG1(__lasx_xvclip255_h, tmp0, tmp1, tmp2, tmp3,
+                                      tmp0, tmp1, tmp2, tmp3);
+        DUP4_ARG1(__lasx_xvclip255_h, tmp4, tmp5, tmp6, tmp7,
+                                      tmp4, tmp5, tmp6, tmp7);
+        DUP4_ARG2(__lasx_xvpickev_b, tmp1, tmp0, tmp3, tmp2, tmp5, tmp4, tmp7,
+                  tmp6, dst0, dst1, dst2, dst3);
+        __lasx_xvstelm_d(dst0, dst, 0, 0);
+        __lasx_xvstelm_d(dst0, dst, 8, 1);
+        dst += stride;
+        __lasx_xvstelm_d(dst0, dst, 0, 2);
+        __lasx_xvstelm_d(dst0, dst, 8, 3);
+        dst += stride;
+        __lasx_xvstelm_d(dst1, dst, 0, 0);
+        __lasx_xvstelm_d(dst1, dst, 8, 1);
+        dst += stride;
+        __lasx_xvstelm_d(dst1, dst, 0, 2);
+        __lasx_xvstelm_d(dst1, dst, 8, 3);
+        dst += stride;
+        __lasx_xvstelm_d(dst2, dst, 0, 0);
+        __lasx_xvstelm_d(dst2, dst, 8, 1);
+        dst += stride;
+        __lasx_xvstelm_d(dst2, dst, 0, 2);
+        __lasx_xvstelm_d(dst2, dst, 8, 3);
+        dst += stride;
+        __lasx_xvstelm_d(dst3, dst, 0, 0);
+        __lasx_xvstelm_d(dst3, dst, 8, 1);
+        dst += stride;
+        __lasx_xvstelm_d(dst3, dst, 0, 2);
+        __lasx_xvstelm_d(dst3, dst, 8, 3);
+    }
+}
+
+static void avc_biwgt_8x4_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                               int32_t log2_denom, int32_t weight_src,
+                               int32_t weight_dst, int32_t offset_in)
+{
+    __m256i wgt, vec0, vec1;
+    __m256i src0, dst0;
+    __m256i tmp0, tmp1, tmp2, tmp3, denom, offset;
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+
+    offset_in   = (unsigned) ((offset_in + 1) | 1) << log2_denom;
+    offset_in  += ((weight_src + weight_dst) << 7);
+    log2_denom += 1;
+
+    tmp0   = __lasx_xvreplgr2vr_b(weight_src);
+    tmp1   = __lasx_xvreplgr2vr_b(weight_dst);
+    wgt    = __lasx_xvilvh_b(tmp1, tmp0);
+    offset = __lasx_xvreplgr2vr_h(offset_in);
+    denom  = __lasx_xvreplgr2vr_h(log2_denom);
+
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+              src, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    src0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x,
+              dst, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    dst0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    DUP2_ARG2(__lasx_xvxori_b, src0, 128, dst0, 128, src0, dst0);
+    vec0 = __lasx_xvilvl_b(dst0, src0);
+    vec1 = __lasx_xvilvh_b(dst0, src0);
+    DUP2_ARG3(__lasx_xvdp2add_h_b, offset, wgt, vec0, offset, wgt, vec1,
+              tmp0, tmp1);
+    tmp0 = __lasx_xvsra_h(tmp0, denom);
+    tmp1 = __lasx_xvsra_h(tmp1, denom);
+    DUP2_ARG1(__lasx_xvclip255_h, tmp0, tmp1, tmp0, tmp1);
+    dst0 = __lasx_xvpickev_b(tmp1, tmp0);
+    __lasx_xvstelm_d(dst0, dst, 0, 0);
+    __lasx_xvstelm_d(dst0, dst + stride, 0, 1);
+    __lasx_xvstelm_d(dst0, dst + stride_2x, 0, 2);
+    __lasx_xvstelm_d(dst0, dst + stride_3x, 0, 3);
+}
+
+static void avc_biwgt_8x8_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                               int32_t log2_denom, int32_t weight_src,
+                               int32_t weight_dst, int32_t offset_in)
+{
+    __m256i wgt, vec0, vec1, vec2, vec3;
+    __m256i src0, src1, dst0, dst1;
+    __m256i tmp0, tmp1, tmp2, tmp3, denom, offset;
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_4x = stride << 2;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    uint8_t* dst_tmp = dst;
+
+    offset_in   = (unsigned) ((offset_in + 1) | 1) << log2_denom;
+    offset_in  += ((weight_src + weight_dst) << 7);
+    log2_denom += 1;
+
+    tmp0   = __lasx_xvreplgr2vr_b(weight_src);
+    tmp1   = __lasx_xvreplgr2vr_b(weight_dst);
+    wgt    = __lasx_xvilvh_b(tmp1, tmp0);
+    offset = __lasx_xvreplgr2vr_h(offset_in);
+    denom  = __lasx_xvreplgr2vr_h(log2_denom);
+
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+              src, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    src += stride_4x;
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    src0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+              src, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    src1 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    tmp0 = __lasx_xvld(dst_tmp, 0);
+    DUP2_ARG2(__lasx_xvldx, dst_tmp, stride, dst_tmp, stride_2x, tmp1, tmp2);
+    tmp3 = __lasx_xvldx(dst_tmp, stride_3x);
+    dst_tmp += stride_4x;
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    dst0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    DUP4_ARG2(__lasx_xvldx, dst_tmp, 0, dst_tmp, stride, dst_tmp, stride_2x,
+              dst_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    dst1 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+
+    DUP4_ARG2(__lasx_xvxori_b, src0, 128, src1, 128, dst0, 128, dst1, 128,
+              src0, src1, dst0, dst1);
+    DUP2_ARG2(__lasx_xvilvl_b, dst0, src0, dst1, src1, vec0, vec2);
+    DUP2_ARG2(__lasx_xvilvh_b, dst0, src0, dst1, src1, vec1, vec3);
+    DUP4_ARG3(__lasx_xvdp2add_h_b, offset, wgt, vec0, offset, wgt, vec1,
+              offset, wgt, vec2, offset, wgt, vec3, tmp0, tmp1, tmp2, tmp3);
+    tmp0 = __lasx_xvsra_h(tmp0, denom);
+    tmp1 = __lasx_xvsra_h(tmp1, denom);
+    tmp2 = __lasx_xvsra_h(tmp2, denom);
+    tmp3 = __lasx_xvsra_h(tmp3, denom);
+    DUP4_ARG1(__lasx_xvclip255_h, tmp0, tmp1, tmp2, tmp3,
+                                  tmp0, tmp1, tmp2, tmp3);
+    DUP2_ARG2(__lasx_xvpickev_b, tmp1, tmp0, tmp3, tmp2, dst0, dst1);
+    __lasx_xvstelm_d(dst0, dst, 0, 0);
+    __lasx_xvstelm_d(dst0, dst + stride, 0, 1);
+    __lasx_xvstelm_d(dst0, dst + stride_2x, 0, 2);
+    __lasx_xvstelm_d(dst0, dst + stride_3x, 0, 3);
+    dst += stride_4x;
+    __lasx_xvstelm_d(dst1, dst, 0, 0);
+    __lasx_xvstelm_d(dst1, dst + stride, 0, 1);
+    __lasx_xvstelm_d(dst1, dst + stride_2x, 0, 2);
+    __lasx_xvstelm_d(dst1, dst + stride_3x, 0, 3);
+}
+
+static void avc_biwgt_8x16_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                                int32_t log2_denom, int32_t weight_src,
+                                int32_t weight_dst, int32_t offset_in)
+{
+    __m256i wgt, vec0, vec1, vec2, vec3, vec4, vec5, vec6, vec7;
+    __m256i src0, src1, src2, src3, dst0, dst1, dst2, dst3;
+    __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, denom, offset;
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_4x = stride << 2;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    uint8_t* dst_tmp = dst;
+
+    offset_in   = (unsigned) ((offset_in + 1) | 1) << log2_denom;
+    offset_in  += ((weight_src + weight_dst) << 7);
+    log2_denom += 1;
+
+    tmp0   = __lasx_xvreplgr2vr_b(weight_src);
+    tmp1   = __lasx_xvreplgr2vr_b(weight_dst);
+    wgt    = __lasx_xvilvh_b(tmp1, tmp0);
+    offset = __lasx_xvreplgr2vr_h(offset_in);
+    denom  = __lasx_xvreplgr2vr_h(log2_denom);
+
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+              src, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    src += stride_4x;
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    src0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+              src, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    src += stride_4x;
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    src1 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+              src, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    src += stride_4x;
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    src2 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+              src, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    src3 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+
+    DUP4_ARG2(__lasx_xvldx, dst_tmp, 0, dst_tmp, stride, dst_tmp, stride_2x,
+              dst_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    dst_tmp += stride_4x;
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    dst0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    DUP4_ARG2(__lasx_xvldx, dst_tmp, 0, dst_tmp, stride, dst_tmp, stride_2x,
+              dst_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    dst_tmp += stride_4x;
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    dst1 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    DUP4_ARG2(__lasx_xvldx, dst_tmp, 0, dst_tmp, stride, dst_tmp, stride_2x,
+              dst_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    dst_tmp += stride_4x;
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    dst2 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    DUP4_ARG2(__lasx_xvldx, dst_tmp, 0, dst_tmp, stride, dst_tmp, stride_2x,
+              dst_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    dst3 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+
+    DUP4_ARG2(__lasx_xvxori_b, src0, 128, src1, 128, src2, 128, src3, 128,
+              src0, src1, src2, src3);
+    DUP4_ARG2(__lasx_xvxori_b, dst0, 128, dst1, 128, dst2, 128, dst3, 128,
+              dst0, dst1, dst2, dst3);
+    DUP4_ARG2(__lasx_xvilvl_b, dst0, src0, dst1, src1, dst2, src2,
+              dst3, src3, vec0, vec2, vec4, vec6);
+    DUP4_ARG2(__lasx_xvilvh_b, dst0, src0, dst1, src1, dst2, src2,
+              dst3, src3, vec1, vec3, vec5, vec7);
+    DUP4_ARG3(__lasx_xvdp2add_h_b, offset, wgt, vec0, offset, wgt, vec1,
+              offset, wgt, vec2, offset, wgt, vec3, tmp0, tmp1, tmp2, tmp3);
+    DUP4_ARG3(__lasx_xvdp2add_h_b,offset, wgt, vec4, offset, wgt, vec5,
+              offset, wgt, vec6, offset, wgt, vec7, tmp4, tmp5, tmp6, tmp7);
+    tmp0 = __lasx_xvsra_h(tmp0, denom);
+    tmp1 = __lasx_xvsra_h(tmp1, denom);
+    tmp2 = __lasx_xvsra_h(tmp2, denom);
+    tmp3 = __lasx_xvsra_h(tmp3, denom);
+    tmp4 = __lasx_xvsra_h(tmp4, denom);
+    tmp5 = __lasx_xvsra_h(tmp5, denom);
+    tmp6 = __lasx_xvsra_h(tmp6, denom);
+    tmp7 = __lasx_xvsra_h(tmp7, denom);
+    DUP4_ARG1(__lasx_xvclip255_h, tmp0, tmp1, tmp2, tmp3,
+                                  tmp0, tmp1, tmp2, tmp3);
+    DUP4_ARG1(__lasx_xvclip255_h, tmp4, tmp5, tmp6, tmp7,
+                                  tmp4, tmp5, tmp6, tmp7);
+    DUP4_ARG2(__lasx_xvpickev_b, tmp1, tmp0, tmp3, tmp2, tmp5, tmp4, tmp7, tmp6,
+                   dst0, dst1, dst2, dst3)
+    __lasx_xvstelm_d(dst0, dst, 0, 0);
+    __lasx_xvstelm_d(dst0, dst + stride, 0, 1);
+    __lasx_xvstelm_d(dst0, dst + stride_2x, 0, 2);
+    __lasx_xvstelm_d(dst0, dst + stride_3x, 0, 3);
+    dst += stride_4x;
+    __lasx_xvstelm_d(dst1, dst, 0, 0);
+    __lasx_xvstelm_d(dst1, dst + stride, 0, 1);
+    __lasx_xvstelm_d(dst1, dst + stride_2x, 0, 2);
+    __lasx_xvstelm_d(dst1, dst + stride_3x, 0, 3);
+    dst += stride_4x;
+    __lasx_xvstelm_d(dst2, dst, 0, 0);
+    __lasx_xvstelm_d(dst2, dst + stride, 0, 1);
+    __lasx_xvstelm_d(dst2, dst + stride_2x, 0, 2);
+    __lasx_xvstelm_d(dst2, dst + stride_3x, 0, 3);
+    dst += stride_4x;
+    __lasx_xvstelm_d(dst3, dst, 0, 0);
+    __lasx_xvstelm_d(dst3, dst + stride, 0, 1);
+    __lasx_xvstelm_d(dst3, dst + stride_2x, 0, 2);
+    __lasx_xvstelm_d(dst3, dst + stride_3x, 0, 3);
+}
+
+void ff_biweight_h264_pixels8_8_lasx(uint8_t *dst, uint8_t *src,
+                                     ptrdiff_t stride, int height,
+                                     int log2_denom, int weight_dst,
+                                     int weight_src, int offset)
+{
+    if (4 == height) {
+        avc_biwgt_8x4_lasx(src, dst, stride, log2_denom, weight_src, weight_dst,
+                           offset);
+    } else if (8 == height) {
+        avc_biwgt_8x8_lasx(src, dst, stride, log2_denom, weight_src, weight_dst,
+                           offset);
+    } else {
+        avc_biwgt_8x16_lasx(src, dst, stride, log2_denom, weight_src, weight_dst,
+                            offset);
+    }
+}
+
+static void avc_biwgt_4x2_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                               int32_t log2_denom, int32_t weight_src,
+                               int32_t weight_dst, int32_t offset_in)
+{
+    __m256i wgt, vec0;
+    __m256i src0, dst0;
+    __m256i tmp0, tmp1, denom, offset;
+
+    offset_in   = (unsigned) ((offset_in + 1) | 1) << log2_denom;
+    offset_in  += ((weight_src + weight_dst) << 7);
+    log2_denom += 1;
+
+    tmp0   = __lasx_xvreplgr2vr_b(weight_src);
+    tmp1   = __lasx_xvreplgr2vr_b(weight_dst);
+    wgt    = __lasx_xvilvh_b(tmp1, tmp0);
+    offset = __lasx_xvreplgr2vr_h(offset_in);
+    denom  = __lasx_xvreplgr2vr_h(log2_denom);
+
+    DUP2_ARG2(__lasx_xvldx, src, 0, src, stride, tmp0, tmp1);
+    src0 = __lasx_xvilvl_w(tmp1, tmp0);
+    DUP2_ARG2(__lasx_xvldx, dst, 0, dst, stride, tmp0, tmp1);
+    dst0 = __lasx_xvilvl_w(tmp1, tmp0);
+    DUP2_ARG2(__lasx_xvxori_b, src0, 128, dst0, 128, src0, dst0);
+    vec0 = __lasx_xvilvl_b(dst0, src0);
+    tmp0 = __lasx_xvdp2add_h_b(offset, wgt, vec0);
+    tmp0 = __lasx_xvsra_h(tmp0, denom);
+    tmp0 = __lasx_xvclip255_h(tmp0);
+    tmp0 = __lasx_xvpickev_b(tmp0, tmp0);
+    __lasx_xvstelm_w(tmp0, dst, 0, 0);
+    __lasx_xvstelm_w(tmp0, dst + stride, 0, 1);
+}
+
+static void avc_biwgt_4x4_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                               int32_t log2_denom, int32_t weight_src,
+                               int32_t weight_dst, int32_t offset_in)
+{
+    __m256i wgt, vec0;
+    __m256i src0, dst0;
+    __m256i tmp0, tmp1, tmp2, tmp3, denom, offset;
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+
+    offset_in   = (unsigned) ((offset_in + 1) | 1) << log2_denom;
+    offset_in  += ((weight_src + weight_dst) << 7);
+    log2_denom += 1;
+
+    tmp0   = __lasx_xvreplgr2vr_b(weight_src);
+    tmp1   = __lasx_xvreplgr2vr_b(weight_dst);
+    wgt    = __lasx_xvilvh_b(tmp1, tmp0);
+    offset = __lasx_xvreplgr2vr_h(offset_in);
+    denom  = __lasx_xvreplgr2vr_h(log2_denom);
+
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+              src, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    DUP2_ARG2(__lasx_xvilvl_w, tmp2, tmp0, tmp3, tmp1, tmp0, tmp1);
+    src0 = __lasx_xvilvl_w(tmp1, tmp0);
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x,
+              dst, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    DUP2_ARG2(__lasx_xvilvl_w, tmp2, tmp0, tmp3, tmp1, tmp0, tmp1);
+    dst0 = __lasx_xvilvl_w(tmp1, tmp0);
+    DUP2_ARG2(__lasx_xvxori_b, src0, 128, dst0, 128, src0, dst0);
+    vec0 = __lasx_xvilvl_b(dst0, src0);
+    dst0 = __lasx_xvilvh_b(dst0, src0);
+    vec0 = __lasx_xvpermi_q(vec0, dst0, 0x02);
+    tmp0 = __lasx_xvdp2add_h_b(offset, wgt, vec0);
+    tmp0 = __lasx_xvsra_h(tmp0, denom);
+    tmp0 = __lasx_xvclip255_h(tmp0);
+    tmp0 = __lasx_xvpickev_b(tmp0, tmp0);
+    __lasx_xvstelm_w(tmp0, dst, 0, 0);
+    __lasx_xvstelm_w(tmp0, dst + stride, 0, 1);
+    __lasx_xvstelm_w(tmp0, dst + stride_2x, 0, 4);
+    __lasx_xvstelm_w(tmp0, dst + stride_3x, 0, 5);
+}
+
+static void avc_biwgt_4x8_lasx(uint8_t *src, uint8_t *dst, ptrdiff_t stride,
+                               int32_t log2_denom, int32_t weight_src,
+                               int32_t weight_dst, int32_t offset_in)
+{
+    __m256i wgt, vec0, vec1;
+    __m256i src0, dst0;
+    __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, denom, offset;
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_4x = stride << 2;
+    ptrdiff_t stride_3x = stride_2x + stride;
+
+    offset_in   = (unsigned) ((offset_in + 1) | 1) << log2_denom;
+    offset_in  += ((weight_src + weight_dst) << 7);
+    log2_denom += 1;
+
+    tmp0   = __lasx_xvreplgr2vr_b(weight_src);
+    tmp1   = __lasx_xvreplgr2vr_b(weight_dst);
+    wgt    = __lasx_xvilvh_b(tmp1, tmp0);
+    offset = __lasx_xvreplgr2vr_h(offset_in);
+    denom  = __lasx_xvreplgr2vr_h(log2_denom);
+
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+              src, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    src += stride_4x;
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+              src, stride_3x, tmp4, tmp5, tmp6, tmp7);
+    DUP4_ARG2(__lasx_xvilvl_w, tmp2, tmp0, tmp3, tmp1, tmp6, tmp4, tmp7, tmp5,
+              tmp0, tmp1, tmp2, tmp3);
+    DUP2_ARG2(__lasx_xvilvl_w, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    src0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x,
+              dst, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    dst += stride_4x;
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, stride, dst, stride_2x,
+              dst, stride_3x, tmp4, tmp5, tmp6, tmp7);
+    dst -= stride_4x;
+    DUP4_ARG2(__lasx_xvilvl_w, tmp2, tmp0, tmp3, tmp1, tmp6, tmp4, tmp7, tmp5,
+              tmp0, tmp1, tmp2, tmp3);
+    DUP2_ARG2(__lasx_xvilvl_w, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    dst0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    DUP2_ARG2(__lasx_xvxori_b, src0, 128, dst0, 128, src0, dst0);
+    vec0 = __lasx_xvilvl_b(dst0, src0);
+    vec1 = __lasx_xvilvh_b(dst0, src0);
+    DUP2_ARG3(__lasx_xvdp2add_h_b, offset, wgt, vec0, offset, wgt, vec1,
+              tmp0, tmp1);
+    tmp0 = __lasx_xvsra_h(tmp0, denom);
+    tmp1 = __lasx_xvsra_h(tmp1, denom);
+    DUP2_ARG1(__lasx_xvclip255_h, tmp0, tmp1, tmp0, tmp1);
+    tmp0 = __lasx_xvpickev_b(tmp1, tmp0);
+    __lasx_xvstelm_w(tmp0, dst, 0, 0);
+    __lasx_xvstelm_w(tmp0, dst + stride, 0, 1);
+    __lasx_xvstelm_w(tmp0, dst + stride_2x, 0, 2);
+    __lasx_xvstelm_w(tmp0, dst + stride_3x, 0, 3);
+    dst += stride_4x;
+    __lasx_xvstelm_w(tmp0, dst, 0, 4);
+    __lasx_xvstelm_w(tmp0, dst + stride, 0, 5);
+    __lasx_xvstelm_w(tmp0, dst + stride_2x, 0, 6);
+    __lasx_xvstelm_w(tmp0, dst + stride_3x, 0, 7);
+}
+
+void ff_biweight_h264_pixels4_8_lasx(uint8_t *dst, uint8_t *src,
+                                     ptrdiff_t stride, int height,
+                                     int log2_denom, int weight_dst,
+                                     int weight_src, int offset)
+{
+    if (2 == height) {
+        avc_biwgt_4x2_lasx(src, dst, stride, log2_denom, weight_src,
+                           weight_dst, offset);
+    } else if (4 == height) {
+        avc_biwgt_4x4_lasx(src, dst, stride, log2_denom, weight_src,
+                           weight_dst, offset);
+    } else {
+        avc_biwgt_4x8_lasx(src, dst, stride, log2_denom, weight_src,
+                           weight_dst, offset);
+    }
+}
+
+void ff_weight_h264_pixels16_8_lasx(uint8_t *src, ptrdiff_t stride,
+                                    int height, int log2_denom,
+                                    int weight_src, int offset_in)
+{
+    uint32_t offset_val;
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_4x = stride << 2;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    __m256i zero = __lasx_xvldi(0);
+    __m256i src0, src1, src2, src3;
+    __m256i src0_l, src1_l, src2_l, src3_l, src0_h, src1_h, src2_h, src3_h;
+    __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
+    __m256i wgt, denom, offset;
+
+    offset_val = (unsigned) offset_in << log2_denom;
+
+    wgt    = __lasx_xvreplgr2vr_h(weight_src);
+    offset = __lasx_xvreplgr2vr_h(offset_val);
+    denom  = __lasx_xvreplgr2vr_h(log2_denom);
+
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+              src, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    src += stride_4x;
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+              src, stride_3x, tmp4, tmp5, tmp6, tmp7);
+    src -= stride_4x;
+    DUP4_ARG3(__lasx_xvpermi_q, tmp1, tmp0, 0x20, tmp3, tmp2, 0x20, tmp5, tmp4,
+              0x20, tmp7, tmp6, 0x20, src0, src1, src2, src3);
+    DUP4_ARG2(__lasx_xvilvl_b, zero, src0, zero, src1, zero, src2,
+              zero, src3, src0_l, src1_l, src2_l, src3_l);
+    DUP4_ARG2(__lasx_xvilvh_b, zero, src0, zero, src1, zero, src2,
+              zero, src3, src0_h, src1_h, src2_h, src3_h);
+    src0_l = __lasx_xvmul_h(wgt, src0_l);
+    src0_h = __lasx_xvmul_h(wgt, src0_h);
+    src1_l = __lasx_xvmul_h(wgt, src1_l);
+    src1_h = __lasx_xvmul_h(wgt, src1_h);
+    src2_l = __lasx_xvmul_h(wgt, src2_l);
+    src2_h = __lasx_xvmul_h(wgt, src2_h);
+    src3_l = __lasx_xvmul_h(wgt, src3_l);
+    src3_h = __lasx_xvmul_h(wgt, src3_h);
+    DUP4_ARG2(__lasx_xvsadd_h, src0_l, offset, src0_h, offset, src1_l, offset,
+              src1_h, offset, src0_l, src0_h, src1_l, src1_h);
+    DUP4_ARG2(__lasx_xvsadd_h, src2_l, offset, src2_h, offset, src3_l, offset,
+              src3_h, offset, src2_l, src2_h, src3_l, src3_h);
+    src0_l = __lasx_xvmaxi_h(src0_l, 0);
+    src0_h = __lasx_xvmaxi_h(src0_h, 0);
+    src1_l = __lasx_xvmaxi_h(src1_l, 0);
+    src1_h = __lasx_xvmaxi_h(src1_h, 0);
+    src2_l = __lasx_xvmaxi_h(src2_l, 0);
+    src2_h = __lasx_xvmaxi_h(src2_h, 0);
+    src3_l = __lasx_xvmaxi_h(src3_l, 0);
+    src3_h = __lasx_xvmaxi_h(src3_h, 0);
+    src0_l = __lasx_xvssrlrn_bu_h(src0_l, denom);
+    src0_h = __lasx_xvssrlrn_bu_h(src0_h, denom);
+    src1_l = __lasx_xvssrlrn_bu_h(src1_l, denom);
+    src1_h = __lasx_xvssrlrn_bu_h(src1_h, denom);
+    src2_l = __lasx_xvssrlrn_bu_h(src2_l, denom);
+    src2_h = __lasx_xvssrlrn_bu_h(src2_h, denom);
+    src3_l = __lasx_xvssrlrn_bu_h(src3_l, denom);
+    src3_h = __lasx_xvssrlrn_bu_h(src3_h, denom);
+    __lasx_xvstelm_d(src0_l, src, 0, 0);
+    __lasx_xvstelm_d(src0_h, src, 8, 0);
+    src += stride;
+    __lasx_xvstelm_d(src0_l, src, 0, 2);
+    __lasx_xvstelm_d(src0_h, src, 8, 2);
+    src += stride;
+    __lasx_xvstelm_d(src1_l, src, 0, 0);
+    __lasx_xvstelm_d(src1_h, src, 8, 0);
+    src += stride;
+    __lasx_xvstelm_d(src1_l, src, 0, 2);
+    __lasx_xvstelm_d(src1_h, src, 8, 2);
+    src += stride;
+    __lasx_xvstelm_d(src2_l, src, 0, 0);
+    __lasx_xvstelm_d(src2_h, src, 8, 0);
+    src += stride;
+    __lasx_xvstelm_d(src2_l, src, 0, 2);
+    __lasx_xvstelm_d(src2_h, src, 8, 2);
+    src += stride;
+    __lasx_xvstelm_d(src3_l, src, 0, 0);
+    __lasx_xvstelm_d(src3_h, src, 8, 0);
+    src += stride;
+    __lasx_xvstelm_d(src3_l, src, 0, 2);
+    __lasx_xvstelm_d(src3_h, src, 8, 2);
+    src += stride;
+
+    if (16 == height) {
+        DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+                  src, stride_3x, tmp0, tmp1, tmp2, tmp3);
+        src += stride_4x;
+        DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+                  src, stride_3x, tmp4, tmp5, tmp6, tmp7);
+        src -= stride_4x;
+        DUP4_ARG3(__lasx_xvpermi_q, tmp1, tmp0, 0x20, tmp3, tmp2, 0x20, tmp5,
+                  tmp4, 0x20, tmp7, tmp6, 0x20, src0, src1, src2, src3);
+        DUP4_ARG2(__lasx_xvilvl_b, zero, src0, zero, src1, zero, src2,
+                  zero, src3, src0_l, src1_l, src2_l, src3_l);
+        DUP4_ARG2(__lasx_xvilvh_b, zero, src0, zero, src1, zero, src2,
+                  zero, src3, src0_h, src1_h, src2_h, src3_h);
+        src0_l = __lasx_xvmul_h(wgt, src0_l);
+        src0_h = __lasx_xvmul_h(wgt, src0_h);
+        src1_l = __lasx_xvmul_h(wgt, src1_l);
+        src1_h = __lasx_xvmul_h(wgt, src1_h);
+        src2_l = __lasx_xvmul_h(wgt, src2_l);
+        src2_h = __lasx_xvmul_h(wgt, src2_h);
+        src3_l = __lasx_xvmul_h(wgt, src3_l);
+        src3_h = __lasx_xvmul_h(wgt, src3_h);
+        DUP4_ARG2(__lasx_xvsadd_h, src0_l, offset, src0_h, offset, src1_l,
+                  offset, src1_h, offset, src0_l, src0_h, src1_l, src1_h);
+        DUP4_ARG2(__lasx_xvsadd_h, src2_l, offset, src2_h, offset, src3_l,
+                  offset, src3_h, offset, src2_l, src2_h, src3_l, src3_h);
+        src0_l = __lasx_xvmaxi_h(src0_l, 0);
+        src0_h = __lasx_xvmaxi_h(src0_h, 0);
+        src1_l = __lasx_xvmaxi_h(src1_l, 0);
+        src1_h = __lasx_xvmaxi_h(src1_h, 0);
+        src2_l = __lasx_xvmaxi_h(src2_l, 0);
+        src2_h = __lasx_xvmaxi_h(src2_h, 0);
+        src3_l = __lasx_xvmaxi_h(src3_l, 0);
+        src3_h = __lasx_xvmaxi_h(src3_h, 0);
+        src0_l = __lasx_xvssrlrn_bu_h(src0_l, denom);
+        src0_h = __lasx_xvssrlrn_bu_h(src0_h, denom);
+        src1_l = __lasx_xvssrlrn_bu_h(src1_l, denom);
+        src1_h = __lasx_xvssrlrn_bu_h(src1_h, denom);
+        src2_l = __lasx_xvssrlrn_bu_h(src2_l, denom);
+        src2_h = __lasx_xvssrlrn_bu_h(src2_h, denom);
+        src3_l = __lasx_xvssrlrn_bu_h(src3_l, denom);
+        src3_h = __lasx_xvssrlrn_bu_h(src3_h, denom);
+        __lasx_xvstelm_d(src0_l, src, 0, 0);
+        __lasx_xvstelm_d(src0_h, src, 8, 0);
+        src += stride;
+        __lasx_xvstelm_d(src0_l, src, 0, 2);
+        __lasx_xvstelm_d(src0_h, src, 8, 2);
+        src += stride;
+        __lasx_xvstelm_d(src1_l, src, 0, 0);
+        __lasx_xvstelm_d(src1_h, src, 8, 0);
+        src += stride;
+        __lasx_xvstelm_d(src1_l, src, 0, 2);
+        __lasx_xvstelm_d(src1_h, src, 8, 2);
+        src += stride;
+        __lasx_xvstelm_d(src2_l, src, 0, 0);
+        __lasx_xvstelm_d(src2_h, src, 8, 0);
+        src += stride;
+        __lasx_xvstelm_d(src2_l, src, 0, 2);
+        __lasx_xvstelm_d(src2_h, src, 8, 2);
+        src += stride;
+        __lasx_xvstelm_d(src3_l, src, 0, 0);
+        __lasx_xvstelm_d(src3_h, src, 8, 0);
+        src += stride;
+        __lasx_xvstelm_d(src3_l, src, 0, 2);
+        __lasx_xvstelm_d(src3_h, src, 8, 2);
+    }
+}
+
+static void avc_wgt_8x4_lasx(uint8_t *src, ptrdiff_t stride,
+                             int32_t log2_denom, int32_t weight_src,
+                             int32_t offset_in)
+{
+    uint32_t offset_val;
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+    __m256i wgt, zero = __lasx_xvldi(0);
+    __m256i src0, src0_h, src0_l;
+    __m256i tmp0, tmp1, tmp2, tmp3, denom, offset;
+
+    offset_val = (unsigned) offset_in << log2_denom;
+
+    wgt    = __lasx_xvreplgr2vr_h(weight_src);
+    offset = __lasx_xvreplgr2vr_h(offset_val);
+    denom  = __lasx_xvreplgr2vr_h(log2_denom);
+
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+              src, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    src0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    src0_l = __lasx_xvilvl_b(zero, src0);
+    src0_h = __lasx_xvilvh_b(zero, src0);
+    src0_l = __lasx_xvmul_h(wgt, src0_l);
+    src0_h = __lasx_xvmul_h(wgt, src0_h);
+    src0_l = __lasx_xvsadd_h(src0_l, offset);
+    src0_h = __lasx_xvsadd_h(src0_h, offset);
+    src0_l = __lasx_xvmaxi_h(src0_l, 0);
+    src0_h = __lasx_xvmaxi_h(src0_h, 0);
+    src0_l = __lasx_xvssrlrn_bu_h(src0_l, denom);
+    src0_h = __lasx_xvssrlrn_bu_h(src0_h, denom);
+
+    src0 = __lasx_xvpickev_d(src0_h, src0_l);
+    __lasx_xvstelm_d(src0, src, 0, 0);
+    __lasx_xvstelm_d(src0, src + stride, 0, 1);
+    __lasx_xvstelm_d(src0, src + stride_2x, 0, 2);
+    __lasx_xvstelm_d(src0, src + stride_3x, 0, 3);
+}
+
+static void avc_wgt_8x8_lasx(uint8_t *src, ptrdiff_t stride, int32_t log2_denom,
+                             int32_t src_weight, int32_t offset_in)
+{
+    __m256i src0, src1, src0_h, src0_l, src1_h, src1_l, zero = __lasx_xvldi(0);
+    __m256i tmp0, tmp1, tmp2, tmp3, denom, offset, wgt;
+    uint32_t offset_val;
+    uint8_t* src_tmp = src;
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_4x = stride << 2;
+    ptrdiff_t stride_3x = stride_2x + stride;
+
+    offset_val = (unsigned) offset_in << log2_denom;
+
+    wgt    = __lasx_xvreplgr2vr_h(src_weight);
+    offset = __lasx_xvreplgr2vr_h(offset_val);
+    denom  = __lasx_xvreplgr2vr_h(log2_denom);
+
+    DUP4_ARG2(__lasx_xvldx, src_tmp, 0, src_tmp, stride, src_tmp, stride_2x,
+              src_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    src_tmp += stride_4x;
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    src0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    DUP4_ARG2(__lasx_xvldx, src_tmp, 0, src_tmp, stride, src_tmp, stride_2x,
+              src_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    src1 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    DUP2_ARG2(__lasx_xvilvl_b, zero, src0, zero, src1, src0_l, src1_l);
+    DUP2_ARG2(__lasx_xvilvh_b, zero, src0, zero, src1, src0_h, src1_h);
+    src0_l = __lasx_xvmul_h(wgt, src0_l);
+    src0_h = __lasx_xvmul_h(wgt, src0_h);
+    src1_l = __lasx_xvmul_h(wgt, src1_l);
+    src1_h = __lasx_xvmul_h(wgt, src1_h);
+    DUP4_ARG2(__lasx_xvsadd_h, src0_l, offset, src0_h, offset, src1_l, offset,
+              src1_h, offset, src0_l, src0_h, src1_l, src1_h);
+    src0_l = __lasx_xvmaxi_h(src0_l, 0);
+    src0_h = __lasx_xvmaxi_h(src0_h, 0);
+    src1_l = __lasx_xvmaxi_h(src1_l, 0);
+    src1_h = __lasx_xvmaxi_h(src1_h, 0);
+    src0_l = __lasx_xvssrlrn_bu_h(src0_l, denom);
+    src0_h = __lasx_xvssrlrn_bu_h(src0_h, denom);
+    src1_l = __lasx_xvssrlrn_bu_h(src1_l, denom);
+    src1_h = __lasx_xvssrlrn_bu_h(src1_h, denom);
+
+    DUP2_ARG2(__lasx_xvpickev_d, src0_h, src0_l, src1_h, src1_l, src0, src1);
+    __lasx_xvstelm_d(src0, src, 0, 0);
+    __lasx_xvstelm_d(src0, src + stride, 0, 1);
+    __lasx_xvstelm_d(src0, src + stride_2x, 0, 2);
+    __lasx_xvstelm_d(src0, src + stride_3x, 0, 3);
+    src += stride_4x;
+    __lasx_xvstelm_d(src1, src, 0, 0);
+    __lasx_xvstelm_d(src1, src + stride, 0, 1);
+    __lasx_xvstelm_d(src1, src + stride_2x, 0, 2);
+    __lasx_xvstelm_d(src1, src + stride_3x, 0, 3);
+}
+
+static void avc_wgt_8x16_lasx(uint8_t *src, ptrdiff_t stride,
+                              int32_t log2_denom, int32_t src_weight,
+                              int32_t offset_in)
+{
+    __m256i src0, src1, src2, src3;
+    __m256i src0_h, src0_l, src1_h, src1_l, src2_h, src2_l, src3_h, src3_l;
+    __m256i tmp0, tmp1, tmp2, tmp3, denom, offset, wgt;
+    __m256i zero = __lasx_xvldi(0);
+    uint32_t offset_val;
+    uint8_t* src_tmp = src;
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_4x = stride << 2;
+    ptrdiff_t stride_3x = stride_2x + stride;
+
+    offset_val = (unsigned) offset_in << log2_denom;
+
+    wgt    = __lasx_xvreplgr2vr_h(src_weight);
+    offset = __lasx_xvreplgr2vr_h(offset_val);
+    denom  = __lasx_xvreplgr2vr_h(log2_denom);
+
+    DUP4_ARG2(__lasx_xvldx, src_tmp, 0, src_tmp, stride, src_tmp, stride_2x,
+              src_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    src_tmp += stride_4x;
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    src0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    DUP4_ARG2(__lasx_xvldx, src_tmp, 0, src_tmp, stride, src_tmp, stride_2x,
+              src_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    src_tmp += stride_4x;
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    src1 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    DUP4_ARG2(__lasx_xvldx, src_tmp, 0, src_tmp, stride, src_tmp, stride_2x,
+              src_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    src_tmp += stride_4x;
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    src2 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    DUP4_ARG2(__lasx_xvldx, src_tmp, 0, src_tmp, stride, src_tmp, stride_2x,
+              src_tmp, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    DUP2_ARG2(__lasx_xvilvl_d, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    src3 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+
+    DUP4_ARG2(__lasx_xvilvl_b, zero, src0, zero, src1, zero, src2, zero, src3,
+              src0_l, src1_l, src2_l, src3_l);
+    DUP4_ARG2(__lasx_xvilvh_b, zero, src0, zero, src1, zero, src2, zero, src3,
+              src0_h, src1_h, src2_h, src3_h);
+    src0_l = __lasx_xvmul_h(wgt, src0_l);
+    src0_h = __lasx_xvmul_h(wgt, src0_h);
+    src1_l = __lasx_xvmul_h(wgt, src1_l);
+    src1_h = __lasx_xvmul_h(wgt, src1_h);
+    src2_l = __lasx_xvmul_h(wgt, src2_l);
+    src2_h = __lasx_xvmul_h(wgt, src2_h);
+    src3_l = __lasx_xvmul_h(wgt, src3_l);
+    src3_h = __lasx_xvmul_h(wgt, src3_h);
+
+    DUP4_ARG2(__lasx_xvsadd_h, src0_l, offset, src0_h, offset, src1_l, offset,
+              src1_h, offset, src0_l, src0_h, src1_l, src1_h);
+    DUP4_ARG2(__lasx_xvsadd_h, src2_l, offset, src2_h, offset, src3_l, offset,
+              src3_h, offset, src2_l, src2_h, src3_l, src3_h);
+
+    src0_l = __lasx_xvmaxi_h(src0_l, 0);
+    src0_h = __lasx_xvmaxi_h(src0_h, 0);
+    src1_l = __lasx_xvmaxi_h(src1_l, 0);
+    src1_h = __lasx_xvmaxi_h(src1_h, 0);
+    src2_l = __lasx_xvmaxi_h(src2_l, 0);
+    src2_h = __lasx_xvmaxi_h(src2_h, 0);
+    src3_l = __lasx_xvmaxi_h(src3_l, 0);
+    src3_h = __lasx_xvmaxi_h(src3_h, 0);
+    src0_l = __lasx_xvssrlrn_bu_h(src0_l, denom);
+    src0_h = __lasx_xvssrlrn_bu_h(src0_h, denom);
+    src1_l = __lasx_xvssrlrn_bu_h(src1_l, denom);
+    src1_h = __lasx_xvssrlrn_bu_h(src1_h, denom);
+    src2_l = __lasx_xvssrlrn_bu_h(src2_l, denom);
+    src2_h = __lasx_xvssrlrn_bu_h(src2_h, denom);
+    src3_l = __lasx_xvssrlrn_bu_h(src3_l, denom);
+    src3_h = __lasx_xvssrlrn_bu_h(src3_h, denom);
+    DUP4_ARG2(__lasx_xvpickev_d, src0_h, src0_l, src1_h, src1_l, src2_h, src2_l,
+              src3_h, src3_l, src0, src1, src2, src3);
+
+    __lasx_xvstelm_d(src0, src, 0, 0);
+    __lasx_xvstelm_d(src0, src + stride, 0, 1);
+    __lasx_xvstelm_d(src0, src + stride_2x, 0, 2);
+    __lasx_xvstelm_d(src0, src + stride_3x, 0, 3);
+    src += stride_4x;
+    __lasx_xvstelm_d(src1, src, 0, 0);
+    __lasx_xvstelm_d(src1, src + stride, 0, 1);
+    __lasx_xvstelm_d(src1, src + stride_2x, 0, 2);
+    __lasx_xvstelm_d(src1, src + stride_3x, 0, 3);
+    src += stride_4x;
+    __lasx_xvstelm_d(src2, src, 0, 0);
+    __lasx_xvstelm_d(src2, src + stride, 0, 1);
+    __lasx_xvstelm_d(src2, src + stride_2x, 0, 2);
+    __lasx_xvstelm_d(src2, src + stride_3x, 0, 3);
+    src += stride_4x;
+    __lasx_xvstelm_d(src3, src, 0, 0);
+    __lasx_xvstelm_d(src3, src + stride, 0, 1);
+    __lasx_xvstelm_d(src3, src + stride_2x, 0, 2);
+    __lasx_xvstelm_d(src3, src + stride_3x, 0, 3);
+}
+
+void ff_weight_h264_pixels8_8_lasx(uint8_t *src, ptrdiff_t stride,
+                                   int height, int log2_denom,
+                                   int weight_src, int offset)
+{
+    if (4 == height) {
+        avc_wgt_8x4_lasx(src, stride, log2_denom, weight_src, offset);
+    } else if (8 == height) {
+        avc_wgt_8x8_lasx(src, stride, log2_denom, weight_src, offset);
+    } else {
+        avc_wgt_8x16_lasx(src, stride, log2_denom, weight_src, offset);
+    }
+}
+
+static void avc_wgt_4x2_lasx(uint8_t *src, ptrdiff_t stride,
+                             int32_t log2_denom, int32_t weight_src,
+                             int32_t offset_in)
+{
+    uint32_t offset_val;
+    __m256i wgt, zero = __lasx_xvldi(0);
+    __m256i src0, tmp0, tmp1, denom, offset;
+
+    offset_val = (unsigned) offset_in << log2_denom;
+
+    wgt    = __lasx_xvreplgr2vr_h(weight_src);
+    offset = __lasx_xvreplgr2vr_h(offset_val);
+    denom  = __lasx_xvreplgr2vr_h(log2_denom);
+
+    DUP2_ARG2(__lasx_xvldx, src, 0, src, stride, tmp0, tmp1);
+    src0 = __lasx_xvilvl_w(tmp1, tmp0);
+    src0 = __lasx_xvilvl_b(zero, src0);
+    src0 = __lasx_xvmul_h(wgt, src0);
+    src0 = __lasx_xvsadd_h(src0, offset);
+    src0 = __lasx_xvmaxi_h(src0, 0);
+    src0 = __lasx_xvssrlrn_bu_h(src0, denom);
+    __lasx_xvstelm_w(src0, src, 0, 0);
+    __lasx_xvstelm_w(src0, src + stride, 0, 1);
+}
+
+static void avc_wgt_4x4_lasx(uint8_t *src, ptrdiff_t stride,
+                             int32_t log2_denom, int32_t weight_src,
+                             int32_t offset_in)
+{
+    __m256i wgt;
+    __m256i src0, tmp0, tmp1, tmp2, tmp3, denom, offset;
+    uint32_t offset_val;
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_3x = stride_2x + stride;
+
+    offset_val = (unsigned) offset_in << log2_denom;
+
+    wgt    = __lasx_xvreplgr2vr_h(weight_src);
+    offset = __lasx_xvreplgr2vr_h(offset_val);
+    denom  = __lasx_xvreplgr2vr_h(log2_denom);
+
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+              src, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    DUP2_ARG2(__lasx_xvilvl_w, tmp2, tmp0, tmp3, tmp1, tmp0, tmp1);
+    src0 = __lasx_xvilvl_w(tmp1, tmp0);
+    src0 = __lasx_vext2xv_hu_bu(src0);
+    src0 = __lasx_xvmul_h(wgt, src0);
+    src0 = __lasx_xvsadd_h(src0, offset);
+    src0 = __lasx_xvmaxi_h(src0, 0);
+    src0 = __lasx_xvssrlrn_bu_h(src0, denom);
+    __lasx_xvstelm_w(src0, src, 0, 0);
+    __lasx_xvstelm_w(src0, src + stride, 0, 1);
+    __lasx_xvstelm_w(src0, src + stride_2x, 0, 4);
+    __lasx_xvstelm_w(src0, src + stride_3x, 0, 5);
+}
+
+static void avc_wgt_4x8_lasx(uint8_t *src, ptrdiff_t stride,
+                             int32_t log2_denom, int32_t weight_src,
+                             int32_t offset_in)
+{
+    __m256i src0, src0_h, src0_l;
+    __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7, denom, offset;
+    __m256i wgt, zero = __lasx_xvldi(0);
+    uint32_t offset_val;
+    ptrdiff_t stride_2x = stride << 1;
+    ptrdiff_t stride_4x = stride << 2;
+    ptrdiff_t stride_3x = stride_2x + stride;
+
+    offset_val = (unsigned) offset_in << log2_denom;
+
+    wgt    = __lasx_xvreplgr2vr_h(weight_src);
+    offset = __lasx_xvreplgr2vr_h(offset_val);
+    denom  = __lasx_xvreplgr2vr_h(log2_denom);
+
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+              src, stride_3x, tmp0, tmp1, tmp2, tmp3);
+    src += stride_4x;
+    DUP4_ARG2(__lasx_xvldx, src, 0, src, stride, src, stride_2x,
+              src, stride_3x, tmp4, tmp5, tmp6, tmp7);
+    src -= stride_4x;
+    DUP4_ARG2(__lasx_xvilvl_w, tmp2, tmp0, tmp3, tmp1, tmp6, tmp4, tmp7,
+              tmp5, tmp0, tmp1, tmp2, tmp3);
+    DUP2_ARG2(__lasx_xvilvl_w, tmp1, tmp0, tmp3, tmp2, tmp0, tmp1);
+    src0 = __lasx_xvpermi_q(tmp1, tmp0, 0x20);
+    src0_l = __lasx_xvilvl_b(zero, src0);
+    src0_h = __lasx_xvilvh_b(zero, src0);
+    src0_l = __lasx_xvmul_h(wgt, src0_l);
+    src0_h = __lasx_xvmul_h(wgt, src0_h);
+    src0_l = __lasx_xvsadd_h(src0_l, offset);
+    src0_h = __lasx_xvsadd_h(src0_h, offset);
+    src0_l = __lasx_xvmaxi_h(src0_l, 0);
+    src0_h = __lasx_xvmaxi_h(src0_h, 0);
+    src0_l = __lasx_xvssrlrn_bu_h(src0_l, denom);
+    src0_h = __lasx_xvssrlrn_bu_h(src0_h, denom);
+    __lasx_xvstelm_w(src0_l, src, 0, 0);
+    __lasx_xvstelm_w(src0_l, src + stride, 0, 1);
+    __lasx_xvstelm_w(src0_h, src + stride_2x, 0, 0);
+    __lasx_xvstelm_w(src0_h, src + stride_3x, 0, 1);
+    src += stride_4x;
+    __lasx_xvstelm_w(src0_l, src, 0, 4);
+    __lasx_xvstelm_w(src0_l, src + stride, 0, 5);
+    __lasx_xvstelm_w(src0_h, src + stride_2x, 0, 4);
+    __lasx_xvstelm_w(src0_h, src + stride_3x, 0, 5);
+}
+
+void ff_weight_h264_pixels4_8_lasx(uint8_t *src, ptrdiff_t stride,
+                                   int height, int log2_denom,
+                                   int weight_src, int offset)
+{
+    if (2 == height) {
+        avc_wgt_4x2_lasx(src, stride, log2_denom, weight_src, offset);
+    } else if (4 == height) {
+        avc_wgt_4x4_lasx(src, stride, log2_denom, weight_src, offset);
+    } else {
+        avc_wgt_4x8_lasx(src, stride, log2_denom, weight_src, offset);
+    }
+}
+
+void ff_h264_add_pixels4_8_lasx(uint8_t *_dst, int16_t *_src, int stride)
+{
+    __m256i src0, dst0, dst1, dst2, dst3, zero;
+    __m256i tmp0, tmp1;
+    uint8_t* _dst1 = _dst + stride;
+    uint8_t* _dst2 = _dst1 + stride;
+    uint8_t* _dst3 = _dst2 + stride;
+
+    src0 = __lasx_xvld(_src, 0);
+    dst0 = __lasx_xvldrepl_w(_dst, 0);
+    dst1 = __lasx_xvldrepl_w(_dst1, 0);
+    dst2 = __lasx_xvldrepl_w(_dst2, 0);
+    dst3 = __lasx_xvldrepl_w(_dst3, 0);
+    tmp0 = __lasx_xvilvl_w(dst1, dst0);
+    tmp1 = __lasx_xvilvl_w(dst3, dst2);
+    dst0 = __lasx_xvilvl_d(tmp1, tmp0);
+    tmp0 = __lasx_vext2xv_hu_bu(dst0);
+    zero = __lasx_xvldi(0);
+    tmp1 = __lasx_xvadd_h(src0, tmp0);
+    dst0 = __lasx_xvpickev_b(tmp1, tmp1);
+    __lasx_xvstelm_w(dst0, _dst, 0, 0);
+    __lasx_xvstelm_w(dst0, _dst1, 0, 1);
+    __lasx_xvstelm_w(dst0, _dst2, 0, 4);
+    __lasx_xvstelm_w(dst0, _dst3, 0, 5);
+    __lasx_xvst(zero, _src, 0);
+}
+
+void ff_h264_add_pixels8_8_lasx(uint8_t *_dst, int16_t *_src, int stride)
+{
+    __m256i src0, src1, src2, src3;
+    __m256i dst0, dst1, dst2, dst3, dst4, dst5, dst6, dst7;
+    __m256i tmp0, tmp1, tmp2, tmp3;
+    __m256i zero = __lasx_xvldi(0);
+    uint8_t *_dst1 = _dst + stride;
+    uint8_t *_dst2 = _dst1 + stride;
+    uint8_t *_dst3 = _dst2 + stride;
+    uint8_t *_dst4 = _dst3 + stride;
+    uint8_t *_dst5 = _dst4 + stride;
+    uint8_t *_dst6 = _dst5 + stride;
+    uint8_t *_dst7 = _dst6 + stride;
+
+    src0 = __lasx_xvld(_src, 0);
+    src1 = __lasx_xvld(_src, 32);
+    src2 = __lasx_xvld(_src, 64);
+    src3 = __lasx_xvld(_src, 96);
+    dst0 = __lasx_xvldrepl_d(_dst, 0);
+    dst1 = __lasx_xvldrepl_d(_dst1, 0);
+    dst2 = __lasx_xvldrepl_d(_dst2, 0);
+    dst3 = __lasx_xvldrepl_d(_dst3, 0);
+    dst4 = __lasx_xvldrepl_d(_dst4, 0);
+    dst5 = __lasx_xvldrepl_d(_dst5, 0);
+    dst6 = __lasx_xvldrepl_d(_dst6, 0);
+    dst7 = __lasx_xvldrepl_d(_dst7, 0);
+    tmp0 = __lasx_xvilvl_d(dst1, dst0);
+    tmp1 = __lasx_xvilvl_d(dst3, dst2);
+    tmp2 = __lasx_xvilvl_d(dst5, dst4);
+    tmp3 = __lasx_xvilvl_d(dst7, dst6);
+    dst0 = __lasx_vext2xv_hu_bu(tmp0);
+    dst1 = __lasx_vext2xv_hu_bu(tmp1);
+    dst1 = __lasx_vext2xv_hu_bu(tmp1);
+    dst2 = __lasx_vext2xv_hu_bu(tmp2);
+    dst3 = __lasx_vext2xv_hu_bu(tmp3);
+    tmp0 = __lasx_xvadd_h(src0, dst0);
+    tmp1 = __lasx_xvadd_h(src1, dst1);
+    tmp2 = __lasx_xvadd_h(src2, dst2);
+    tmp3 = __lasx_xvadd_h(src3, dst3);
+    dst1 = __lasx_xvpickev_b(tmp1, tmp0);
+    dst2 = __lasx_xvpickev_b(tmp3, tmp2);
+    __lasx_xvst(zero, _src, 0);
+    __lasx_xvst(zero, _src, 32);
+    __lasx_xvst(zero, _src, 64);
+    __lasx_xvst(zero, _src, 96);
+    __lasx_xvstelm_d(dst1, _dst, 0, 0);
+    __lasx_xvstelm_d(dst1, _dst1, 0, 2);
+    __lasx_xvstelm_d(dst1, _dst2, 0, 1);
+    __lasx_xvstelm_d(dst1, _dst3, 0, 3);
+    __lasx_xvstelm_d(dst2, _dst4, 0, 0);
+    __lasx_xvstelm_d(dst2, _dst5, 0, 2);
+    __lasx_xvstelm_d(dst2, _dst6, 0, 1);
+    __lasx_xvstelm_d(dst2, _dst7, 0, 3);
+}
diff --git a/libavcodec/loongarch/h264dsp_lasx.h b/libavcodec/loongarch/h264dsp_lasx.h
new file mode 100644
index 0000000000..538c14c936
--- /dev/null
+++ b/libavcodec/loongarch/h264dsp_lasx.h
@@ -0,0 +1,68 @@
+/*
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * Contributed by Shiyou Yin <yinshiyou-hf@loongson.cn>
+ *                Xiwei  Gu  <guxiwei-hf@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVCODEC_LOONGARCH_H264DSP_LASX_H
+#define AVCODEC_LOONGARCH_H264DSP_LASX_H
+
+#include "libavcodec/h264dec.h"
+
+void ff_h264_h_lpf_luma_8_lasx(uint8_t *src, ptrdiff_t stride,
+                               int alpha, int beta, int8_t *tc0);
+void ff_h264_v_lpf_luma_8_lasx(uint8_t *src, ptrdiff_t stride,
+                               int alpha, int beta, int8_t *tc0);
+void ff_h264_h_lpf_luma_intra_8_lasx(uint8_t *src, ptrdiff_t stride,
+                                     int alpha, int beta);
+void ff_h264_v_lpf_luma_intra_8_lasx(uint8_t *src, ptrdiff_t stride,
+                                     int alpha, int beta);
+void ff_h264_h_lpf_chroma_8_lasx(uint8_t *src, ptrdiff_t stride,
+                                 int alpha, int beta, int8_t *tc0);
+void ff_h264_v_lpf_chroma_8_lasx(uint8_t *src, ptrdiff_t stride,
+                                 int alpha, int beta, int8_t *tc0);
+void ff_h264_h_lpf_chroma_intra_8_lasx(uint8_t *src, ptrdiff_t stride,
+                                       int alpha, int beta);
+void ff_h264_v_lpf_chroma_intra_8_lasx(uint8_t *src, ptrdiff_t stride,
+                                       int alpha, int beta);
+void ff_biweight_h264_pixels16_8_lasx(uint8_t *dst, uint8_t *src,
+                                      ptrdiff_t stride, int height,
+                                      int log2_denom, int weight_dst,
+                                      int weight_src, int offset_in);
+void ff_biweight_h264_pixels8_8_lasx(uint8_t *dst, uint8_t *src,
+                                     ptrdiff_t stride, int height,
+                                     int log2_denom, int weight_dst,
+                                     int weight_src, int offset);
+void ff_biweight_h264_pixels4_8_lasx(uint8_t *dst, uint8_t *src,
+                                     ptrdiff_t stride, int height,
+                                     int log2_denom, int weight_dst,
+                                     int weight_src, int offset);
+void ff_weight_h264_pixels16_8_lasx(uint8_t *src, ptrdiff_t stride,
+                                    int height, int log2_denom,
+                                    int weight_src, int offset_in);
+void ff_weight_h264_pixels8_8_lasx(uint8_t *src, ptrdiff_t stride,
+                                   int height, int log2_denom,
+                                   int weight_src, int offset);
+void ff_weight_h264_pixels4_8_lasx(uint8_t *src, ptrdiff_t stride,
+                                   int height, int log2_denom,
+                                   int weight_src, int offset);
+void ff_h264_add_pixels4_8_lasx(uint8_t *_dst, int16_t *_src, int stride);
+
+void ff_h264_add_pixels8_8_lasx(uint8_t *_dst, int16_t *_src, int stride);
+#endif  // #ifndef AVCODEC_LOONGARCH_H264DSP_LASX_H

From patchwork Tue Dec 14 13:33:14 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= <chenhao@loongson.cn>
X-Patchwork-Id: 32486
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp6965632iog;
        Tue, 14 Dec 2021 05:34:39 -0800 (PST)
X-Google-Smtp-Source: 
 ABdhPJx/7OmDfoefTWdQDVf4JnFfIMkGT6yvlyoNH3CVxRORdF2o2FH3cl9Epk9Wjbz8i6DMldvZ
X-Received: by 2002:a05:6402:d73:: with SMTP id
 ec51mr7795815edb.175.1639488878923;
        Tue, 14 Dec 2021 05:34:38 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1639488878; cv=none;
        d=google.com; s=arc-20160816;
        b=yOs8SmAqcCZaR6x8tyW9K3eT/YgclI8Ss25zGb+AJsYe6EhPyz1kRvsRJ5UW/Y2QO6
         m/MaZAIrx3u1vn/RsCAoKP+eZ0SQ8W1Svqm9bcQvZl4PpvVG1KyrMGj/Live/Vx3RWMZ
         OuUuJzhhNsNKNguf8wt0dGFLu+BvjBUgbOT0P19f6xIml3RMp4JvovvVgc1bmzuo5t6L
         y38F6ipd3qgJdHleAau0qNDkT4AwzbC/Gn1sYY5ProzHkSvk6hQ6yeMtdbfHqv9wkXa7
         jat1fqqBaDjmda9fTbWB8tf/4RPz89iBh2rRe4ZCjKZ+H8rcy+aShXBkqbFPkuUlLTN/
         RxRw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:references:in-reply-to
         :message-id:date:to:from:delivered-to;
        bh=zm7BotcLoypr49oyv5CPH03K6va+Zj4BpRO9c8QxMjw=;
        b=RVIOaDKVJcU44g9T8+WWJu+j7EQkQgWdTMI5n80A/2YCDnitbxczJE6o6yVXLy6Ew8
         sZU59+9ZUmB00id1MkXC1sUGLL336Nmq6y0TCzsJ5X/9Sm9zj+5WS771Nwu1URt33jGb
         emx2eioIIszKLDRKIemF+/JO/LBRRBzaJfJ0dJ8CBZ53OoS+4CkPE6g8p8stV+F34Fyl
         CojQ94TkbH2mbyV2I7VW0SU+bH80/w5zALLmFFoGCYZoMczpZmqGpR6TexuVjBP5a34H
         OcDH8Csctww3g2ANZ6xUFZ7r96cWtlSjHL8eQuohGexMOCEPGjqhS/j4ZgQ+LZa6H4gh
         FtfQ==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 hd13si23457710ejc.148.2021.12.14.05.34.38;
        Tue, 14 Dec 2021 05:34:38 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 0E4C068AEEA;
	Tue, 14 Dec 2021 15:33:58 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from loongson.cn (mail.loongson.cn [114.242.206.163])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 24B8968AAD7
 for <ffmpeg-devel@ffmpeg.org>; Tue, 14 Dec 2021 15:33:46 +0200 (EET)
Received: from localhost (unknown [36.33.26.144])
 by mail.loongson.cn (Coremail) with SMTP id AQAAf9Dx7Nw4nbhhlqcAAA--.3411S3;
 Tue, 14 Dec 2021 21:33:44 +0800 (CST)
From: Hao Chen <chenhao@loongson.cn>
To: ffmpeg-devel@ffmpeg.org
Date: Tue, 14 Dec 2021 21:33:14 +0800
Message-Id: <20211214133316.8978-6-chenhao@loongson.cn>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20211214133316.8978-1-chenhao@loongson.cn>
References: <20211214133316.8978-1-chenhao@loongson.cn>
MIME-Version: 1.0
X-CM-TRANSID: AQAAf9Dx7Nw4nbhhlqcAAA--.3411S3
X-Coremail-Antispam: 1UD129KBjvAXoWfWF13GFyxur4ktF1ktrWxCrg_yoW8trWDXo
 WUt392vr97Gw1Ivr95Ar9Yy3W8Cw43ur4UAw42qwsFya45Xa4qyrZ0kw4fJr17Krs7Wa43
 Cry5XFy3ZrWFqr1Dn29KB7ZKAUJUUUU8529EdanIXcx71UUUUU7v73VFW2AGmfu7bjvjm3
 AaLaJ3UjIYCTnIWjp_UUUY87k0a2IF6w4kM7kC6x804xWl14x267AKxVWUJVW8JwAFc2x0
 x2IEx4CE42xK8VAvwI8IcIk0rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj4
 1l84x0c7CEw4AK67xGY2AK021l84ACjcxK6xIIjxv20xvE14v26ryj6F1UM28EF7xvwVC0
 I7IYx2IY6xkF7I0E14v26F4j6r4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwV
 C2z280aVCY1x0267AKxVW0oVCq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC
 0VAKzVAqx4xG6I80ewAv7VC0I7IYx2IY67AKxVWUtVWrXwAv7VC2z280aVAFwI0_Gr1j6F
 4UJwAm72CE4IkC6x0Yz7v_Jr0_Gr1lF7xvr2IYc2Ij64vIr41lc2xSY4AK67AK6ry8MxAI
 w28IcxkI7VAKI48JMxC20s026xCaFVCjc4AY6r1j6r4UMI8I3I0E5I8CrVAFwI0_Jr0_Jr
 4lx2IqxVCjr7xvwVAFwI0_JrI_JrWlx4CE17CEb7AF67AKxVWUXVWUAwCIc40Y0x0EwIxG
 rwCI42IY6xIIjxv20xvE14v26r4j6ryUMIIF0xvE2Ix0cI8IcVCY1x0267AKxVW8JVWxJw
 CI42IY6xAIw20EY4v20xvaj40_Jr0_JF4lIxAIcVC2z280aVAFwI0_Gr0_Cr1lIxAIcVC2
 z280aVCY1x0267AKxVW8JVW8JrUvcSsGvfC2KfnxnUUI43ZEXa7IU5tl1PUUUUU==
X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/
Subject: [FFmpeg-devel] [PATCH v2 5/7] avcodec: [loongarch] Optimize
 h264idct with LASX.
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Lu Wang <wanglu@loongson.cn>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: qB8rmJEZtOYZ

From: Lu Wang <wanglu@loongson.cn>

./ffmpeg -i ../1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -y /dev/null -an
before:282
after :293

Change-Id: Ia8889935a6359630dd5dbb61263287f1cb24a0a4
---
 libavcodec/loongarch/Makefile                 |   3 +-
 libavcodec/loongarch/h264dsp_init_loongarch.c |  15 +
 libavcodec/loongarch/h264dsp_lasx.h           |  23 +
 libavcodec/loongarch/h264idct_lasx.c          | 498 ++++++++++++++++++
 4 files changed, 538 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/loongarch/h264idct_lasx.c

diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile
index df43151dbd..242a2be290 100644
--- a/libavcodec/loongarch/Makefile
+++ b/libavcodec/loongarch/Makefile
@@ -3,4 +3,5 @@ OBJS-$(CONFIG_H264QPEL)               += loongarch/h264qpel_init_loongarch.o
 OBJS-$(CONFIG_H264DSP)                += loongarch/h264dsp_init_loongarch.o
 LASX-OBJS-$(CONFIG_H264CHROMA)        += loongarch/h264chroma_lasx.o
 LASX-OBJS-$(CONFIG_H264QPEL)          += loongarch/h264qpel_lasx.o
-LASX-OBJS-$(CONFIG_H264DSP)           += loongarch/h264dsp_lasx.o
+LASX-OBJS-$(CONFIG_H264DSP)           += loongarch/h264dsp_lasx.o \
+                                         loongarch/h264idct_lasx.o
diff --git a/libavcodec/loongarch/h264dsp_init_loongarch.c b/libavcodec/loongarch/h264dsp_init_loongarch.c
index ddc0877a74..0985c2fe8a 100644
--- a/libavcodec/loongarch/h264dsp_init_loongarch.c
+++ b/libavcodec/loongarch/h264dsp_init_loongarch.c
@@ -53,6 +53,21 @@ av_cold void ff_h264dsp_init_loongarch(H264DSPContext *c, const int bit_depth,
             c->biweight_h264_pixels_tab[0] = ff_biweight_h264_pixels16_8_lasx;
             c->biweight_h264_pixels_tab[1] = ff_biweight_h264_pixels8_8_lasx;
             c->biweight_h264_pixels_tab[2] = ff_biweight_h264_pixels4_8_lasx;
+
+            c->h264_idct_add = ff_h264_idct_add_lasx;
+            c->h264_idct8_add = ff_h264_idct8_addblk_lasx;
+            c->h264_idct_dc_add = ff_h264_idct4x4_addblk_dc_lasx;
+            c->h264_idct8_dc_add = ff_h264_idct8_dc_addblk_lasx;
+            c->h264_idct_add16 = ff_h264_idct_add16_lasx;
+            c->h264_idct8_add4 = ff_h264_idct8_add4_lasx;
+
+            if (chroma_format_idc <= 1)
+                c->h264_idct_add8 = ff_h264_idct_add8_lasx;
+            else
+                c->h264_idct_add8 = ff_h264_idct_add8_422_lasx;
+
+            c->h264_idct_add16intra = ff_h264_idct_add16_intra_lasx;
+            c->h264_luma_dc_dequant_idct = ff_h264_deq_idct_luma_dc_lasx;
         }
     }
 }
diff --git a/libavcodec/loongarch/h264dsp_lasx.h b/libavcodec/loongarch/h264dsp_lasx.h
index 538c14c936..bfd567fffa 100644
--- a/libavcodec/loongarch/h264dsp_lasx.h
+++ b/libavcodec/loongarch/h264dsp_lasx.h
@@ -65,4 +65,27 @@ void ff_weight_h264_pixels4_8_lasx(uint8_t *src, ptrdiff_t stride,
 void ff_h264_add_pixels4_8_lasx(uint8_t *_dst, int16_t *_src, int stride);
 
 void ff_h264_add_pixels8_8_lasx(uint8_t *_dst, int16_t *_src, int stride);
+void ff_h264_idct_add_lasx(uint8_t *dst, int16_t *src, int32_t dst_stride);
+void ff_h264_idct8_addblk_lasx(uint8_t *dst, int16_t *src, int32_t dst_stride);
+void ff_h264_idct4x4_addblk_dc_lasx(uint8_t *dst, int16_t *src,
+                                    int32_t dst_stride);
+void ff_h264_idct8_dc_addblk_lasx(uint8_t *dst, int16_t *src,
+                                  int32_t dst_stride);
+void ff_h264_idct_add16_lasx(uint8_t *dst, const int32_t *blk_offset,
+                             int16_t *block, int32_t dst_stride,
+                             const uint8_t nzc[15 * 8]);
+void ff_h264_idct8_add4_lasx(uint8_t *dst, const int32_t *blk_offset,
+                             int16_t *block, int32_t dst_stride,
+                             const uint8_t nzc[15 * 8]);
+void ff_h264_idct_add8_lasx(uint8_t **dst, const int32_t *blk_offset,
+                            int16_t *block, int32_t dst_stride,
+                            const uint8_t nzc[15 * 8]);
+void ff_h264_idct_add8_422_lasx(uint8_t **dst, const int32_t *blk_offset,
+                                int16_t *block, int32_t dst_stride,
+                                const uint8_t nzc[15 * 8]);
+void ff_h264_idct_add16_intra_lasx(uint8_t *dst, const int32_t *blk_offset,
+                                   int16_t *block, int32_t dst_stride,
+                                   const uint8_t nzc[15 * 8]);
+void ff_h264_deq_idct_luma_dc_lasx(int16_t *dst, int16_t *src,
+                                   int32_t de_qval);
 #endif  // #ifndef AVCODEC_LOONGARCH_H264DSP_LASX_H
diff --git a/libavcodec/loongarch/h264idct_lasx.c b/libavcodec/loongarch/h264idct_lasx.c
new file mode 100644
index 0000000000..46bd3b74d5
--- /dev/null
+++ b/libavcodec/loongarch/h264idct_lasx.c
@@ -0,0 +1,498 @@
+/*
+ * Loongson LASX optimized h264dsp
+ *
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * Contributed by Shiyou Yin <yinshiyou-hf@loongson.cn>
+ *                Xiwei  Gu  <guxiwei-hf@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/loongarch/loongson_intrinsics.h"
+#include "h264dsp_lasx.h"
+#include "libavcodec/bit_depth_template.c"
+
+#define AVC_ITRANS_H(in0, in1, in2, in3, out0, out1, out2, out3)     \
+{                                                                    \
+   __m256i tmp0_m, tmp1_m, tmp2_m, tmp3_m;                           \
+                                                                     \
+    tmp0_m = __lasx_xvadd_h(in0, in2);                               \
+    tmp1_m = __lasx_xvsub_h(in0, in2);                               \
+    tmp2_m = __lasx_xvsrai_h(in1, 1);                                \
+    tmp2_m = __lasx_xvsub_h(tmp2_m, in3);                            \
+    tmp3_m = __lasx_xvsrai_h(in3, 1);                                \
+    tmp3_m = __lasx_xvadd_h(in1, tmp3_m);                            \
+                                                                     \
+    LASX_BUTTERFLY_4_H(tmp0_m, tmp1_m, tmp2_m, tmp3_m,               \
+                       out0, out1, out2, out3);                      \
+}
+
+void ff_h264_idct_add_lasx(uint8_t *dst, int16_t *src, int32_t dst_stride)
+{
+    __m256i src0_m, src1_m, src2_m, src3_m;
+    __m256i dst0_m, dst1_m;
+    __m256i hres0, hres1, hres2, hres3, vres0, vres1, vres2, vres3;
+    __m256i inp0_m, inp1_m, res0_m, src1, src3;
+    __m256i src0 = __lasx_xvld(src, 0);
+    __m256i src2 = __lasx_xvld(src, 16);
+    __m256i zero = __lasx_xvldi(0);
+    int32_t dst_stride_2x = dst_stride << 1;
+    int32_t dst_stride_3x = dst_stride_2x + dst_stride;
+
+    __lasx_xvst(zero, src, 0);
+    DUP2_ARG2(__lasx_xvilvh_d, src0, src0, src2, src2, src1, src3);
+    AVC_ITRANS_H(src0, src1, src2, src3, hres0, hres1, hres2, hres3);
+    LASX_TRANSPOSE4x4_H(hres0, hres1, hres2, hres3, hres0, hres1, hres2, hres3);
+    AVC_ITRANS_H(hres0, hres1, hres2, hres3, vres0, vres1, vres2, vres3);
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, dst_stride, dst, dst_stride_2x,
+              dst, dst_stride_3x, src0_m, src1_m, src2_m, src3_m);
+    DUP4_ARG2(__lasx_xvld, dst, 0, dst + dst_stride, 0, dst + dst_stride_2x,
+              0, dst + dst_stride_3x, 0, src0_m, src1_m, src2_m, src3_m);
+    DUP2_ARG2(__lasx_xvilvl_d, vres1, vres0, vres3, vres2, inp0_m, inp1_m);
+    inp0_m = __lasx_xvpermi_q(inp1_m, inp0_m, 0x20);
+    inp0_m = __lasx_xvsrari_h(inp0_m, 6);
+    DUP2_ARG2(__lasx_xvilvl_w, src1_m, src0_m, src3_m, src2_m, dst0_m, dst1_m);
+    dst0_m = __lasx_xvilvl_d(dst1_m, dst0_m);
+    res0_m = __lasx_vext2xv_hu_bu(dst0_m);
+    res0_m = __lasx_xvadd_h(res0_m, inp0_m);
+    res0_m = __lasx_xvclip255_h(res0_m);
+    dst0_m = __lasx_xvpickev_b(res0_m, res0_m);
+    __lasx_xvstelm_w(dst0_m, dst, 0, 0);
+    __lasx_xvstelm_w(dst0_m, dst + dst_stride, 0, 1);
+    __lasx_xvstelm_w(dst0_m, dst + dst_stride_2x, 0, 4);
+    __lasx_xvstelm_w(dst0_m, dst + dst_stride_3x, 0, 5);
+}
+
+void ff_h264_idct8_addblk_lasx(uint8_t *dst, int16_t *src,
+                               int32_t dst_stride)
+{
+    __m256i src0, src1, src2, src3, src4, src5, src6, src7;
+    __m256i vec0, vec1, vec2, vec3;
+    __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
+    __m256i res0, res1, res2, res3, res4, res5, res6, res7;
+    __m256i dst0, dst1, dst2, dst3, dst4, dst5, dst6, dst7;
+    __m256i zero = __lasx_xvldi(0);
+    int32_t dst_stride_2x = dst_stride << 1;
+    int32_t dst_stride_4x = dst_stride << 2;
+    int32_t dst_stride_3x = dst_stride_2x + dst_stride;
+
+    src[0] += 32;
+    DUP4_ARG2(__lasx_xvld, src, 0, src, 16, src, 32, src, 48,
+              src0, src1, src2, src3);
+    DUP4_ARG2(__lasx_xvld, src, 64, src, 80, src, 96, src, 112,
+              src4, src5, src6, src7);
+    __lasx_xvst(zero, src, 0);
+    __lasx_xvst(zero, src, 32);
+    __lasx_xvst(zero, src, 64);
+    __lasx_xvst(zero, src, 96);
+
+    vec0 = __lasx_xvadd_h(src0, src4);
+    vec1 = __lasx_xvsub_h(src0, src4);
+    vec2 = __lasx_xvsrai_h(src2, 1);
+    vec2 = __lasx_xvsub_h(vec2, src6);
+    vec3 = __lasx_xvsrai_h(src6, 1);
+    vec3 = __lasx_xvadd_h(src2, vec3);
+
+    LASX_BUTTERFLY_4_H(vec0, vec1, vec2, vec3, tmp0, tmp1, tmp2, tmp3);
+
+    vec0 = __lasx_xvsrai_h(src7, 1);
+    vec0 = __lasx_xvsub_h(src5, vec0);
+    vec0 = __lasx_xvsub_h(vec0, src3);
+    vec0 = __lasx_xvsub_h(vec0, src7);
+
+    vec1 = __lasx_xvsrai_h(src3, 1);
+    vec1 = __lasx_xvsub_h(src1, vec1);
+    vec1 = __lasx_xvadd_h(vec1, src7);
+    vec1 = __lasx_xvsub_h(vec1, src3);
+
+    vec2 = __lasx_xvsrai_h(src5, 1);
+    vec2 = __lasx_xvsub_h(vec2, src1);
+    vec2 = __lasx_xvadd_h(vec2, src7);
+    vec2 = __lasx_xvadd_h(vec2, src5);
+
+    vec3 = __lasx_xvsrai_h(src1, 1);
+    vec3 = __lasx_xvadd_h(src3, vec3);
+    vec3 = __lasx_xvadd_h(vec3, src5);
+    vec3 = __lasx_xvadd_h(vec3, src1);
+
+    tmp4 = __lasx_xvsrai_h(vec3, 2);
+    tmp4 = __lasx_xvadd_h(tmp4, vec0);
+    tmp5 = __lasx_xvsrai_h(vec2, 2);
+    tmp5 = __lasx_xvadd_h(tmp5, vec1);
+    tmp6 = __lasx_xvsrai_h(vec1, 2);
+    tmp6 = __lasx_xvsub_h(tmp6, vec2);
+    tmp7 = __lasx_xvsrai_h(vec0, 2);
+    tmp7 = __lasx_xvsub_h(vec3, tmp7);
+
+    LASX_BUTTERFLY_8_H(tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7,
+                       res0, res1, res2, res3, res4, res5, res6, res7);
+    LASX_TRANSPOSE8x8_H(res0, res1, res2, res3, res4, res5, res6, res7,
+                        res0, res1, res2, res3, res4, res5, res6, res7);
+
+    DUP4_ARG1(__lasx_vext2xv_w_h, res0, res1, res2, res3,
+              tmp0, tmp1, tmp2, tmp3);
+    DUP4_ARG1(__lasx_vext2xv_w_h, res4, res5, res6, res7,
+              tmp4, tmp5, tmp6, tmp7);
+    vec0 = __lasx_xvadd_w(tmp0, tmp4);
+    vec1 = __lasx_xvsub_w(tmp0, tmp4);
+
+    vec2 = __lasx_xvsrai_w(tmp2, 1);
+    vec2 = __lasx_xvsub_w(vec2, tmp6);
+    vec3 = __lasx_xvsrai_w(tmp6, 1);
+    vec3 = __lasx_xvadd_w(vec3, tmp2);
+
+    tmp0 = __lasx_xvadd_w(vec0, vec3);
+    tmp2 = __lasx_xvadd_w(vec1, vec2);
+    tmp4 = __lasx_xvsub_w(vec1, vec2);
+    tmp6 = __lasx_xvsub_w(vec0, vec3);
+
+    vec0 = __lasx_xvsrai_w(tmp7, 1);
+    vec0 = __lasx_xvsub_w(tmp5, vec0);
+    vec0 = __lasx_xvsub_w(vec0, tmp3);
+    vec0 = __lasx_xvsub_w(vec0, tmp7);
+
+    vec1 = __lasx_xvsrai_w(tmp3, 1);
+    vec1 = __lasx_xvsub_w(tmp1, vec1);
+    vec1 = __lasx_xvadd_w(vec1, tmp7);
+    vec1 = __lasx_xvsub_w(vec1, tmp3);
+
+    vec2 = __lasx_xvsrai_w(tmp5, 1);
+    vec2 = __lasx_xvsub_w(vec2, tmp1);
+    vec2 = __lasx_xvadd_w(vec2, tmp7);
+    vec2 = __lasx_xvadd_w(vec2, tmp5);
+
+    vec3 = __lasx_xvsrai_w(tmp1, 1);
+    vec3 = __lasx_xvadd_w(tmp3, vec3);
+    vec3 = __lasx_xvadd_w(vec3, tmp5);
+    vec3 = __lasx_xvadd_w(vec3, tmp1);
+
+    tmp1 = __lasx_xvsrai_w(vec3, 2);
+    tmp1 = __lasx_xvadd_w(tmp1, vec0);
+    tmp3 = __lasx_xvsrai_w(vec2, 2);
+    tmp3 = __lasx_xvadd_w(tmp3, vec1);
+    tmp5 = __lasx_xvsrai_w(vec1, 2);
+    tmp5 = __lasx_xvsub_w(tmp5, vec2);
+    tmp7 = __lasx_xvsrai_w(vec0, 2);
+    tmp7 = __lasx_xvsub_w(vec3, tmp7);
+
+    LASX_BUTTERFLY_4_W(tmp0, tmp2, tmp5, tmp7, res0, res1, res6, res7);
+    LASX_BUTTERFLY_4_W(tmp4, tmp6, tmp1, tmp3, res2, res3, res4, res5);
+
+    DUP4_ARG2(__lasx_xvsrai_w, res0, 6, res1, 6, res2, 6, res3, 6,
+              res0, res1, res2, res3);
+    DUP4_ARG2(__lasx_xvsrai_w, res4, 6, res5, 6, res6, 6, res7, 6,
+              res4, res5, res6, res7);
+    DUP4_ARG2(__lasx_xvpickev_h, res1, res0, res3, res2, res5, res4, res7,
+              res6, res0, res1, res2, res3);
+    DUP4_ARG2(__lasx_xvpermi_d, res0, 0xd8, res1, 0xd8, res2, 0xd8, res3, 0xd8,
+              res0, res1, res2, res3);
+
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, dst_stride, dst, dst_stride_2x,
+              dst, dst_stride_3x, dst0, dst1, dst2, dst3);
+    dst += dst_stride_4x;
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, dst_stride, dst, dst_stride_2x,
+              dst, dst_stride_3x, dst4, dst5, dst6, dst7);
+    dst -= dst_stride_4x;
+    DUP4_ARG2(__lasx_xvilvl_b, zero, dst0, zero, dst1, zero, dst2, zero, dst3,
+              dst0, dst1, dst2, dst3);
+    DUP4_ARG2(__lasx_xvilvl_b, zero, dst4, zero, dst5, zero, dst6, zero, dst7,
+              dst4, dst5, dst6, dst7);
+    DUP4_ARG3(__lasx_xvpermi_q, dst1, dst0, 0x20, dst3, dst2, 0x20, dst5,
+              dst4, 0x20, dst7, dst6, 0x20, dst0, dst1, dst2, dst3);
+    res0 = __lasx_xvadd_h(res0, dst0);
+    res1 = __lasx_xvadd_h(res1, dst1);
+    res2 = __lasx_xvadd_h(res2, dst2);
+    res3 = __lasx_xvadd_h(res3, dst3);
+    DUP4_ARG1(__lasx_xvclip255_h, res0, res1, res2, res3, res0, res1,
+              res2, res3);
+    DUP2_ARG2(__lasx_xvpickev_b, res1, res0, res3, res2, res0, res1);
+    __lasx_xvstelm_d(res0, dst, 0, 0);
+    __lasx_xvstelm_d(res0, dst + dst_stride, 0, 2);
+    __lasx_xvstelm_d(res0, dst + dst_stride_2x, 0, 1);
+    __lasx_xvstelm_d(res0, dst + dst_stride_3x, 0, 3);
+    dst += dst_stride_4x;
+    __lasx_xvstelm_d(res1, dst, 0, 0);
+    __lasx_xvstelm_d(res1, dst + dst_stride, 0, 2);
+    __lasx_xvstelm_d(res1, dst + dst_stride_2x, 0, 1);
+    __lasx_xvstelm_d(res1, dst + dst_stride_3x, 0, 3);
+}
+
+void ff_h264_idct4x4_addblk_dc_lasx(uint8_t *dst, int16_t *src,
+                                    int32_t dst_stride)
+{
+    const int16_t dc = (src[0] + 32) >> 6;
+    int32_t dst_stride_2x = dst_stride << 1;
+    int32_t dst_stride_3x = dst_stride_2x + dst_stride;
+    __m256i pred, out;
+    __m256i src0, src1, src2, src3;
+    __m256i input_dc = __lasx_xvreplgr2vr_h(dc);
+
+    src[0] = 0;
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, dst_stride, dst, dst_stride_2x,
+              dst, dst_stride_3x, src0, src1, src2, src3);
+    DUP2_ARG2(__lasx_xvilvl_w, src1, src0, src3, src2, src0, src1);
+
+    pred = __lasx_xvpermi_q(src0, src1, 0x02);
+    pred = __lasx_xvaddw_h_h_bu(input_dc, pred);
+    pred = __lasx_xvclip255_h(pred);
+    out = __lasx_xvpickev_b(pred, pred);
+    __lasx_xvstelm_w(out, dst, 0, 0);
+    __lasx_xvstelm_w(out, dst + dst_stride, 0, 1);
+    __lasx_xvstelm_w(out, dst + dst_stride_2x, 0, 4);
+    __lasx_xvstelm_w(out, dst + dst_stride_3x, 0, 5);
+}
+
+void ff_h264_idct8_dc_addblk_lasx(uint8_t *dst, int16_t *src,
+                                  int32_t dst_stride)
+{
+    int32_t dc_val;
+    int32_t dst_stride_2x = dst_stride << 1;
+    int32_t dst_stride_4x = dst_stride << 2;
+    int32_t dst_stride_3x = dst_stride_2x + dst_stride;
+    __m256i dst0, dst1, dst2, dst3, dst4, dst5, dst6, dst7;
+    __m256i dc;
+
+    dc_val = (src[0] + 32) >> 6;
+    dc = __lasx_xvreplgr2vr_h(dc_val);
+
+    src[0] = 0;
+
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, dst_stride, dst, dst_stride_2x,
+              dst, dst_stride_3x, dst0, dst1, dst2, dst3);
+    dst += dst_stride_4x;
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, dst_stride, dst, dst_stride_2x,
+              dst, dst_stride_3x, dst4, dst5, dst6, dst7);
+    dst -= dst_stride_4x;
+    DUP4_ARG1(__lasx_vext2xv_hu_bu, dst0, dst1, dst2, dst3,
+              dst0, dst1, dst2, dst3);
+    DUP4_ARG1(__lasx_vext2xv_hu_bu, dst4, dst5, dst6, dst7,
+              dst4, dst5, dst6, dst7);
+    DUP4_ARG3(__lasx_xvpermi_q, dst1, dst0, 0x20, dst3, dst2, 0x20, dst5,
+              dst4, 0x20, dst7, dst6, 0x20, dst0, dst1, dst2, dst3);
+    dst0 = __lasx_xvadd_h(dst0, dc);
+    dst1 = __lasx_xvadd_h(dst1, dc);
+    dst2 = __lasx_xvadd_h(dst2, dc);
+    dst3 = __lasx_xvadd_h(dst3, dc);
+    DUP4_ARG1(__lasx_xvclip255_h, dst0, dst1, dst2, dst3,
+              dst0, dst1, dst2, dst3);
+    DUP2_ARG2(__lasx_xvpickev_b, dst1, dst0, dst3, dst2, dst0, dst1);
+    __lasx_xvstelm_d(dst0, dst, 0, 0);
+    __lasx_xvstelm_d(dst0, dst + dst_stride, 0, 2);
+    __lasx_xvstelm_d(dst0, dst + dst_stride_2x, 0, 1);
+    __lasx_xvstelm_d(dst0, dst + dst_stride_3x, 0, 3);
+    dst += dst_stride_4x;
+    __lasx_xvstelm_d(dst1, dst, 0, 0);
+    __lasx_xvstelm_d(dst1, dst + dst_stride, 0, 2);
+    __lasx_xvstelm_d(dst1, dst + dst_stride_2x, 0, 1);
+    __lasx_xvstelm_d(dst1, dst + dst_stride_3x, 0, 3);
+}
+
+void ff_h264_idct_add16_lasx(uint8_t *dst,
+                             const int32_t *blk_offset,
+                             int16_t *block, int32_t dst_stride,
+                             const uint8_t nzc[15 * 8])
+{
+    int32_t i;
+
+    for (i = 0; i < 16; i++) {
+        int32_t nnz = nzc[scan8[i]];
+
+        if (nnz) {
+            if (nnz == 1 && ((dctcoef *) block)[i * 16])
+                ff_h264_idct4x4_addblk_dc_lasx(dst + blk_offset[i],
+                                               block + i * 16 * sizeof(pixel),
+                                               dst_stride);
+            else
+                ff_h264_idct_add_lasx(dst + blk_offset[i],
+                                      block + i * 16 * sizeof(pixel),
+                                      dst_stride);
+        }
+    }
+}
+
+void ff_h264_idct8_add4_lasx(uint8_t *dst, const int32_t *blk_offset,
+                             int16_t *block, int32_t dst_stride,
+                             const uint8_t nzc[15 * 8])
+{
+    int32_t cnt;
+
+    for (cnt = 0; cnt < 16; cnt += 4) {
+        int32_t nnz = nzc[scan8[cnt]];
+
+        if (nnz) {
+            if (nnz == 1 && ((dctcoef *) block)[cnt * 16])
+                ff_h264_idct8_dc_addblk_lasx(dst + blk_offset[cnt],
+                                             block + cnt * 16 * sizeof(pixel),
+                                             dst_stride);
+            else
+                ff_h264_idct8_addblk_lasx(dst + blk_offset[cnt],
+                                          block + cnt * 16 * sizeof(pixel),
+                                          dst_stride);
+        }
+    }
+}
+
+
+void ff_h264_idct_add8_lasx(uint8_t **dst,
+                            const int32_t *blk_offset,
+                            int16_t *block, int32_t dst_stride,
+                            const uint8_t nzc[15 * 8])
+{
+    int32_t i;
+
+    for (i = 16; i < 20; i++) {
+        if (nzc[scan8[i]])
+            ff_h264_idct_add_lasx(dst[0] + blk_offset[i],
+                                  block + i * 16 * sizeof(pixel),
+                                  dst_stride);
+        else if (((dctcoef *) block)[i * 16])
+            ff_h264_idct4x4_addblk_dc_lasx(dst[0] + blk_offset[i],
+                                           block + i * 16 * sizeof(pixel),
+                                           dst_stride);
+    }
+    for (i = 32; i < 36; i++) {
+        if (nzc[scan8[i]])
+            ff_h264_idct_add_lasx(dst[1] + blk_offset[i],
+                                  block + i * 16 * sizeof(pixel),
+                                  dst_stride);
+        else if (((dctcoef *) block)[i * 16])
+            ff_h264_idct4x4_addblk_dc_lasx(dst[1] + blk_offset[i],
+                                           block + i * 16 * sizeof(pixel),
+                                           dst_stride);
+    }
+}
+
+void ff_h264_idct_add8_422_lasx(uint8_t **dst,
+                                const int32_t *blk_offset,
+                                int16_t *block, int32_t dst_stride,
+                                const uint8_t nzc[15 * 8])
+{
+    int32_t i;
+
+    for (i = 16; i < 20; i++) {
+        if (nzc[scan8[i]])
+            ff_h264_idct_add_lasx(dst[0] + blk_offset[i],
+                                  block + i * 16 * sizeof(pixel),
+                                  dst_stride);
+        else if (((dctcoef *) block)[i * 16])
+            ff_h264_idct4x4_addblk_dc_lasx(dst[0] + blk_offset[i],
+                                           block + i * 16 * sizeof(pixel),
+                                           dst_stride);
+    }
+    for (i = 32; i < 36; i++) {
+        if (nzc[scan8[i]])
+            ff_h264_idct_add_lasx(dst[1] + blk_offset[i],
+                                  block + i * 16 * sizeof(pixel),
+                                  dst_stride);
+        else if (((dctcoef *) block)[i * 16])
+            ff_h264_idct4x4_addblk_dc_lasx(dst[1] + blk_offset[i],
+                                           block + i * 16 * sizeof(pixel),
+                                           dst_stride);
+    }
+    for (i = 20; i < 24; i++) {
+        if (nzc[scan8[i + 4]])
+            ff_h264_idct_add_lasx(dst[0] + blk_offset[i + 4],
+                                  block + i * 16 * sizeof(pixel),
+                                  dst_stride);
+        else if (((dctcoef *) block)[i * 16])
+            ff_h264_idct4x4_addblk_dc_lasx(dst[0] + blk_offset[i + 4],
+                                           block + i * 16 * sizeof(pixel),
+                                           dst_stride);
+    }
+    for (i = 36; i < 40; i++) {
+        if (nzc[scan8[i + 4]])
+            ff_h264_idct_add_lasx(dst[1] + blk_offset[i + 4],
+                                  block + i * 16 * sizeof(pixel),
+                                  dst_stride);
+        else if (((dctcoef *) block)[i * 16])
+            ff_h264_idct4x4_addblk_dc_lasx(dst[1] + blk_offset[i + 4],
+                                           block + i * 16 * sizeof(pixel),
+                                           dst_stride);
+    }
+}
+
+void ff_h264_idct_add16_intra_lasx(uint8_t *dst,
+                                   const int32_t *blk_offset,
+                                   int16_t *block,
+                                   int32_t dst_stride,
+                                   const uint8_t nzc[15 * 8])
+{
+    int32_t i;
+
+    for (i = 0; i < 16; i++) {
+        if (nzc[scan8[i]])
+            ff_h264_idct_add_lasx(dst + blk_offset[i],
+                                  block + i * 16 * sizeof(pixel), dst_stride);
+        else if (((dctcoef *) block)[i * 16])
+            ff_h264_idct4x4_addblk_dc_lasx(dst + blk_offset[i],
+                                           block + i * 16 * sizeof(pixel),
+                                           dst_stride);
+    }
+}
+
+void ff_h264_deq_idct_luma_dc_lasx(int16_t *dst, int16_t *src,
+                                   int32_t de_qval)
+{
+#define DC_DEST_STRIDE 16
+
+    __m256i src0, src1, src2, src3;
+    __m256i vec0, vec1, vec2, vec3;
+    __m256i tmp0, tmp1, tmp2, tmp3;
+    __m256i hres0, hres1, hres2, hres3;
+    __m256i vres0, vres1, vres2, vres3;
+    __m256i de_q_vec = __lasx_xvreplgr2vr_w(de_qval);
+
+    DUP4_ARG2(__lasx_xvld, src, 0, src, 8, src, 16, src, 24,
+              src0, src1, src2, src3);
+    LASX_TRANSPOSE4x4_H(src0, src1, src2, src3, tmp0, tmp1, tmp2, tmp3);
+    LASX_BUTTERFLY_4_H(tmp0, tmp2, tmp3, tmp1, vec0, vec3, vec2, vec1);
+    LASX_BUTTERFLY_4_H(vec0, vec1, vec2, vec3, hres0, hres3, hres2, hres1);
+    LASX_TRANSPOSE4x4_H(hres0, hres1, hres2, hres3,
+                        hres0, hres1, hres2, hres3);
+    LASX_BUTTERFLY_4_H(hres0, hres1, hres3, hres2, vec0, vec3, vec2, vec1);
+    LASX_BUTTERFLY_4_H(vec0, vec1, vec2, vec3, vres0, vres1, vres2, vres3);
+    DUP4_ARG1(__lasx_vext2xv_w_h, vres0, vres1, vres2, vres3,
+              vres0, vres1, vres2, vres3);
+    DUP2_ARG3(__lasx_xvpermi_q, vres1, vres0, 0x20, vres3, vres2, 0x20,
+              vres0, vres1);
+
+    vres0 = __lasx_xvmul_w(vres0, de_q_vec);
+    vres1 = __lasx_xvmul_w(vres1, de_q_vec);
+
+    vres0 = __lasx_xvsrari_w(vres0, 8);
+    vres1 = __lasx_xvsrari_w(vres1, 8);
+    vec0 = __lasx_xvpickev_h(vres1, vres0);
+    vec0 = __lasx_xvpermi_d(vec0, 0xd8);
+    __lasx_xvstelm_h(vec0, dst + 0  * DC_DEST_STRIDE, 0, 0);
+    __lasx_xvstelm_h(vec0, dst + 2  * DC_DEST_STRIDE, 0, 1);
+    __lasx_xvstelm_h(vec0, dst + 8  * DC_DEST_STRIDE, 0, 2);
+    __lasx_xvstelm_h(vec0, dst + 10 * DC_DEST_STRIDE, 0, 3);
+    __lasx_xvstelm_h(vec0, dst + 1  * DC_DEST_STRIDE, 0, 4);
+    __lasx_xvstelm_h(vec0, dst + 3  * DC_DEST_STRIDE, 0, 5);
+    __lasx_xvstelm_h(vec0, dst + 9  * DC_DEST_STRIDE, 0, 6);
+    __lasx_xvstelm_h(vec0, dst + 11 * DC_DEST_STRIDE, 0, 7);
+    __lasx_xvstelm_h(vec0, dst + 4  * DC_DEST_STRIDE, 0, 8);
+    __lasx_xvstelm_h(vec0, dst + 6  * DC_DEST_STRIDE, 0, 9);
+    __lasx_xvstelm_h(vec0, dst + 12 * DC_DEST_STRIDE, 0, 10);
+    __lasx_xvstelm_h(vec0, dst + 14 * DC_DEST_STRIDE, 0, 11);
+    __lasx_xvstelm_h(vec0, dst + 5  * DC_DEST_STRIDE, 0, 12);
+    __lasx_xvstelm_h(vec0, dst + 7  * DC_DEST_STRIDE, 0, 13);
+    __lasx_xvstelm_h(vec0, dst + 13 * DC_DEST_STRIDE, 0, 14);
+    __lasx_xvstelm_h(vec0, dst + 15 * DC_DEST_STRIDE, 0, 15);
+
+#undef DC_DEST_STRIDE
+}

From patchwork Tue Dec 14 13:33:15 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= <chenhao@loongson.cn>
X-Patchwork-Id: 32488
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp6966225iog;
        Tue, 14 Dec 2021 05:35:04 -0800 (PST)
X-Google-Smtp-Source: 
 ABdhPJyHfO+YrJ0EMJilwaUoONdzVoJcKqn0kmt8TR/+EeD85uFiDE0f1GB9uQK8ZMbemSXcPsn5
X-Received: by 2002:a17:906:bccc:: with SMTP id
 lw12mr5924481ejb.128.1639488904542;
        Tue, 14 Dec 2021 05:35:04 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1639488904; cv=none;
        d=google.com; s=arc-20160816;
        b=Wg0hgXzRo3XIKcb5EZzKoHmFkzpI9I+umqOE+FZjei+XKdQxcXEfRAqabTLYwJXdHa
         OGB0th9AFxk/TqPqdoEoK+BqZRedV4Qzwb+yt+9ShPZHV0j12+GcF1Ca/v3ljEH76KQO
         08YN5Sw2e8++31zaqkLq3gg0j8hFhawFuiavk4630AhDhYxGcYslabkYiGDdTACtZyif
         owtRePgp3FyBZFjYBkr5U0DCY2TTrFFtEKz4q02GAMzC+qyXk9i5NSlmuknQbtbGTGGF
         4o2s6n3OGOYdrBfMnfmIssW0Uq+J5AZOfmQZ2xu0AibwXPTYtBC/ylNgX9ML8qEQiT8F
         5FGA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:references:in-reply-to
         :message-id:date:to:from:delivered-to;
        bh=Xs7u0VFS2ZhoKL+5k4A2UiAZ2kGJwqfObSD596oY1bs=;
        b=VEnv6F9eWwmTQxtC+4pmsGOWy5PXdu2chC8Y5JMTp/BWr+uWuxwpa+EJ/fnD3MMhe7
         fXLbrinUcAsOq9AqXAw7U42tNws7EkNTJYvvWxmlcupdrl7baXzJ8WgSyEhpGz/z54sI
         84hFQEQqTAJBx/mx51w0y1U7KOF6NGxBB8ZLlntqRV412VmeDSgBEzOn0MyBaz1/Ioy2
         SyivNrF3NHxG8S/b4VhTCZvqAlui/WrmyaoUn+OXyJqg4G6x1ha7AoUAlPrF/D2x+Erq
         ulO/SfmnRAt8MDHnH7TOGRn705GfNmjeoYeXm9rbyeyydk837N5XkiJNyiBNR1jngbwZ
         b0Sg==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 l24si20222416edr.155.2021.12.14.05.35.03;
        Tue, 14 Dec 2021 05:35:04 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 62C4F68AE8C;
	Tue, 14 Dec 2021 15:34:00 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from loongson.cn (mail.loongson.cn [114.242.206.163])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id E160468AEAA
 for <ffmpeg-devel@ffmpeg.org>; Tue, 14 Dec 2021 15:33:47 +0200 (EET)
Received: from localhost (unknown [36.33.26.144])
 by mail.loongson.cn (Coremail) with SMTP id AQAAf9DxpN45nbhhl6cAAA--.3442S3;
 Tue, 14 Dec 2021 21:33:45 +0800 (CST)
From: Hao Chen <chenhao@loongson.cn>
To: ffmpeg-devel@ffmpeg.org
Date: Tue, 14 Dec 2021 21:33:15 +0800
Message-Id: <20211214133316.8978-7-chenhao@loongson.cn>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20211214133316.8978-1-chenhao@loongson.cn>
References: <20211214133316.8978-1-chenhao@loongson.cn>
MIME-Version: 1.0
X-CM-TRANSID: AQAAf9DxpN45nbhhl6cAAA--.3442S3
X-Coremail-Antispam: 1UD129KBjvJXoWxKw4Uuw48Kw13Wr4xtr4kCrg_yoW3tr43pa
 4j9FsrJa18JFsrZr9rXw4kAr1SyFZ7Gr17tF15K3W7urWavryxWrZ2kFWqq3WDJw4UGF15
 XF1fua4ava43Jw7anT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2
 9KBjDU0xBIdaVrnRJUUUk2b7Iv0xC_KF4lb4IE77IF4wAFF20E14v26r1j6r4UM7CY07I2
 0VC2zVCF04k26cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rw
 A2F7IY1VAKz4vEj48ve4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_Xr0_Ar1l84ACjcxK6xII
 jxv20xvEc7CjxVAFwI0_Cr0_Gr1UM28EF7xvwVC2z280aVAFwI0_GcCE3s1l84ACjcxK6I
 8E87Iv6xkF7I0E14v26rxl6s0DM2AIxVAIcxkEcVAq07x20xvEncxIr21l5I8CrVACY4xI
 64kE6c02F40Ex7xfMcIj6xIIjxv20xvE14v26r1q6rW5McIj6I8E87Iv67AKxVW8Jr0_Cr
 1UMcvjeVCFs4IE7xkEbVWUJVW8JwACjcxG0xvY0x0EwIxGrwCY02Avz4vE14v_Xr4l42xK
 82IYc2Ij64vIr41l4I8I3I0E4IkC6x0Yz7v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGw
 C20s026x8GjcxK67AKxVWUGVWUWwC2zVAF1VAY17CE14v26r1Y6r17MIIYrxkI7VAKI48J
 MIIF0xvE2Ix0cI8IcVAFwI0_Gr0_Xr1lIxAIcVC0I7IYx2IY6xkF7I0E14v26r4j6F4UMI
 IF0xvE42xK8VAvwI8IcIk0rVWUJVWUCwCI42IY6I8E87Iv67AKxVW8JVWxJwCI42IY6I8E
 87Iv6xkF7I0E14v26r4j6r4UJbIYCTnIWIevJa73UjIFyTuYvjxUxD73DUUUU
X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/
Subject: [FFmpeg-devel] [PATCH v2 6/7] avcodec: [loongarch] Optimize
 h264_deblock with LASX.
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Jin Bo <jinbo@loongson.cn>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: gXuDGOM+ayW7

From: Jin Bo <jinbo@loongson.cn>

./ffmpeg -i ../1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -y /dev/null -an
before:293
after :295

Change-Id: I5ff6cba4eaca0c4218c0c97b880ca500e35f9c87
Signed-off-by: Hao Chen <chenhao@loongson.cn>
---
 libavcodec/loongarch/Makefile                 |   3 +-
 libavcodec/loongarch/h264_deblock_lasx.c      | 147 ++++++++++++++++++
 libavcodec/loongarch/h264dsp_init_loongarch.c |   2 +
 libavcodec/loongarch/h264dsp_lasx.h           |   6 +
 4 files changed, 157 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/loongarch/h264_deblock_lasx.c

diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile
index 242a2be290..1e1fe3fd48 100644
--- a/libavcodec/loongarch/Makefile
+++ b/libavcodec/loongarch/Makefile
@@ -4,4 +4,5 @@ OBJS-$(CONFIG_H264DSP)                += loongarch/h264dsp_init_loongarch.o
 LASX-OBJS-$(CONFIG_H264CHROMA)        += loongarch/h264chroma_lasx.o
 LASX-OBJS-$(CONFIG_H264QPEL)          += loongarch/h264qpel_lasx.o
 LASX-OBJS-$(CONFIG_H264DSP)           += loongarch/h264dsp_lasx.o \
-                                         loongarch/h264idct_lasx.o
+                                         loongarch/h264idct_lasx.o \
+                                         loongarch/h264_deblock_lasx.o
diff --git a/libavcodec/loongarch/h264_deblock_lasx.c b/libavcodec/loongarch/h264_deblock_lasx.c
new file mode 100644
index 0000000000..c89bea9a84
--- /dev/null
+++ b/libavcodec/loongarch/h264_deblock_lasx.c
@@ -0,0 +1,147 @@
+/*
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * Contributed by Xiwei Gu <guxiwei-hf@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavcodec/bit_depth_template.c"
+#include "h264dsp_lasx.h"
+#include "libavutil/loongarch/loongson_intrinsics.h"
+
+#define H264_LOOP_FILTER_STRENGTH_ITERATION_LASX(edges, step, mask_mv, dir, \
+                                                 d_idx, mask_dir)           \
+do {                                                                        \
+    int b_idx = 0; \
+    int step_x4 = step << 2; \
+    int d_idx_12 = d_idx + 12; \
+    int d_idx_52 = d_idx + 52; \
+    int d_idx_x4 = d_idx << 2; \
+    int d_idx_x4_48 = d_idx_x4 + 48; \
+    int dir_x32  = dir * 32; \
+    uint8_t *ref_t = (uint8_t*)ref; \
+    uint8_t *mv_t  = (uint8_t*)mv; \
+    uint8_t *nnz_t = (uint8_t*)nnz; \
+    uint8_t *bS_t  = (uint8_t*)bS; \
+    mask_mv <<= 3; \
+    for (; b_idx < edges; b_idx += step) { \
+        out &= mask_dir; \
+        if (!(mask_mv & b_idx)) { \
+            if (bidir) { \
+                ref2 = __lasx_xvldx(ref_t, d_idx_12); \
+                ref3 = __lasx_xvldx(ref_t, d_idx_52); \
+                ref0 = __lasx_xvld(ref_t, 12); \
+                ref1 = __lasx_xvld(ref_t, 52); \
+                ref2 = __lasx_xvilvl_w(ref3, ref2); \
+                ref0 = __lasx_xvilvl_w(ref0, ref0); \
+                ref1 = __lasx_xvilvl_w(ref1, ref1); \
+                ref3 = __lasx_xvshuf4i_w(ref2, 0xB1); \
+                ref0 = __lasx_xvsub_b(ref0, ref2); \
+                ref1 = __lasx_xvsub_b(ref1, ref3); \
+                ref0 = __lasx_xvor_v(ref0, ref1); \
+\
+                tmp2 = __lasx_xvldx(mv_t, d_idx_x4_48);   \
+                tmp3 = __lasx_xvld(mv_t, 48); \
+                tmp4 = __lasx_xvld(mv_t, 208); \
+                tmp5 = __lasx_xvld(mv_t + d_idx_x4, 208); \
+                DUP2_ARG3(__lasx_xvpermi_q, tmp2, tmp2, 0x20, tmp5, tmp5, \
+                          0x20, tmp2, tmp5); \
+                tmp3 =  __lasx_xvpermi_q(tmp4, tmp3, 0x20); \
+                tmp2 = __lasx_xvsub_h(tmp2, tmp3); \
+                tmp5 = __lasx_xvsub_h(tmp5, tmp3); \
+                DUP2_ARG2(__lasx_xvsat_h, tmp2, 7, tmp5, 7, tmp2, tmp5); \
+                tmp0 = __lasx_xvpickev_b(tmp5, tmp2); \
+                tmp0 = __lasx_xvpermi_d(tmp0, 0xd8); \
+                tmp0 = __lasx_xvadd_b(tmp0, cnst_1); \
+                tmp0 = __lasx_xvssub_bu(tmp0, cnst_0); \
+                tmp0 = __lasx_xvsat_h(tmp0, 7); \
+                tmp0 = __lasx_xvpickev_b(tmp0, tmp0); \
+                tmp0 = __lasx_xvpermi_d(tmp0, 0xd8); \
+                tmp1 = __lasx_xvpickod_d(tmp0, tmp0); \
+                out = __lasx_xvor_v(ref0, tmp0); \
+                tmp1 = __lasx_xvshuf4i_w(tmp1, 0xB1); \
+                out = __lasx_xvor_v(out, tmp1); \
+                tmp0 = __lasx_xvshuf4i_w(out, 0xB1); \
+                out = __lasx_xvmin_bu(out, tmp0); \
+            } else { \
+                ref0 = __lasx_xvldx(ref_t, d_idx_12); \
+                ref3 = __lasx_xvld(ref_t, 12); \
+                tmp2 = __lasx_xvldx(mv_t, d_idx_x4_48); \
+                tmp3 = __lasx_xvld(mv_t, 48); \
+                tmp4 = __lasx_xvsub_h(tmp3, tmp2); \
+                tmp1 = __lasx_xvsat_h(tmp4, 7); \
+                tmp1 = __lasx_xvpickev_b(tmp1, tmp1); \
+                tmp1 = __lasx_xvadd_b(tmp1, cnst_1); \
+                out = __lasx_xvssub_bu(tmp1, cnst_0); \
+                out = __lasx_xvsat_h(out, 7); \
+                out = __lasx_xvpickev_b(out, out); \
+                ref0 = __lasx_xvsub_b(ref3, ref0); \
+                out = __lasx_xvor_v(out, ref0); \
+            } \
+        } \
+        tmp0 = __lasx_xvld(nnz_t, 12); \
+        tmp1 = __lasx_xvldx(nnz_t, d_idx_12); \
+        tmp0 = __lasx_xvor_v(tmp0, tmp1); \
+        tmp0 = __lasx_xvmin_bu(tmp0, cnst_2); \
+        out  = __lasx_xvmin_bu(out, cnst_2); \
+        tmp0 = __lasx_xvslli_h(tmp0, 1); \
+        tmp0 = __lasx_xvmax_bu(out, tmp0); \
+        tmp0 = __lasx_vext2xv_hu_bu(tmp0); \
+        __lasx_xvstelm_d(tmp0, bS_t + dir_x32, 0, 0); \
+        ref_t += step; \
+        mv_t  += step_x4; \
+        nnz_t += step; \
+        bS_t  += step; \
+    } \
+} while(0)
+
+void ff_h264_loop_filter_strength_lasx(int16_t bS[2][4][4], uint8_t nnz[40],
+                                       int8_t ref[2][40], int16_t mv[2][40][2],
+                                       int bidir, int edges, int step,
+                                       int mask_mv0, int mask_mv1, int field)
+{
+    __m256i out;
+    __m256i ref0, ref1, ref2, ref3;
+    __m256i tmp0, tmp1;
+    __m256i tmp2, tmp3, tmp4, tmp5;
+    __m256i cnst_0, cnst_1, cnst_2;
+    __m256i zero = __lasx_xvldi(0);
+    __m256i one  = __lasx_xvnor_v(zero, zero);
+    int64_t cnst3 = 0x0206020602060206, cnst4 = 0x0103010301030103;
+    if (field) {
+        cnst_0 = __lasx_xvreplgr2vr_d(cnst3);
+        cnst_1 = __lasx_xvreplgr2vr_d(cnst4);
+        cnst_2 = __lasx_xvldi(0x01);
+    } else {
+        DUP2_ARG1(__lasx_xvldi, 0x06, 0x03, cnst_0, cnst_1);
+        cnst_2 = __lasx_xvldi(0x01);
+    }
+    step  <<= 3;
+    edges <<= 3;
+
+    H264_LOOP_FILTER_STRENGTH_ITERATION_LASX(edges, step, mask_mv1,
+                                             1, -8, zero);
+    H264_LOOP_FILTER_STRENGTH_ITERATION_LASX(32, 8, mask_mv0, 0, -1, one);
+
+    DUP2_ARG2(__lasx_xvld, (int8_t*)bS, 0, (int8_t*)bS, 16, tmp0, tmp1);
+    DUP2_ARG2(__lasx_xvilvh_d, tmp0, tmp0, tmp1, tmp1, tmp2, tmp3);
+    LASX_TRANSPOSE4x4_H(tmp0, tmp2, tmp1, tmp3, tmp2, tmp3, tmp4, tmp5);
+    __lasx_xvstelm_d(tmp2, (int8_t*)bS, 0, 0);
+    __lasx_xvstelm_d(tmp3, (int8_t*)bS + 8, 0, 0);
+    __lasx_xvstelm_d(tmp4, (int8_t*)bS + 16, 0, 0);
+    __lasx_xvstelm_d(tmp5, (int8_t*)bS + 24, 0, 0);
+}
diff --git a/libavcodec/loongarch/h264dsp_init_loongarch.c b/libavcodec/loongarch/h264dsp_init_loongarch.c
index 0985c2fe8a..37633c3e51 100644
--- a/libavcodec/loongarch/h264dsp_init_loongarch.c
+++ b/libavcodec/loongarch/h264dsp_init_loongarch.c
@@ -29,6 +29,8 @@ av_cold void ff_h264dsp_init_loongarch(H264DSPContext *c, const int bit_depth,
     int cpu_flags = av_get_cpu_flags();
 
     if (have_lasx(cpu_flags)) {
+        if (chroma_format_idc <= 1)
+            c->h264_loop_filter_strength = ff_h264_loop_filter_strength_lasx;
         if (bit_depth == 8) {
             c->h264_add_pixels4_clear = ff_h264_add_pixels4_8_lasx;
             c->h264_add_pixels8_clear = ff_h264_add_pixels8_8_lasx;
diff --git a/libavcodec/loongarch/h264dsp_lasx.h b/libavcodec/loongarch/h264dsp_lasx.h
index bfd567fffa..4cf813750b 100644
--- a/libavcodec/loongarch/h264dsp_lasx.h
+++ b/libavcodec/loongarch/h264dsp_lasx.h
@@ -88,4 +88,10 @@ void ff_h264_idct_add16_intra_lasx(uint8_t *dst, const int32_t *blk_offset,
                                    const uint8_t nzc[15 * 8]);
 void ff_h264_deq_idct_luma_dc_lasx(int16_t *dst, int16_t *src,
                                    int32_t de_qval);
+
+void ff_h264_loop_filter_strength_lasx(int16_t bS[2][4][4], uint8_t nnz[40],
+                                       int8_t ref[2][40], int16_t mv[2][40][2],
+                                       int bidir, int edges, int step,
+                                       int mask_mv0, int mask_mv1, int field);
+
 #endif  // #ifndef AVCODEC_LOONGARCH_H264DSP_LASX_H

From patchwork Tue Dec 14 13:33:16 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= <chenhao@loongson.cn>
X-Patchwork-Id: 32490
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp6966815iog;
        Tue, 14 Dec 2021 05:35:32 -0800 (PST)
X-Google-Smtp-Source: 
 ABdhPJzroN9XYpSjLwV4BCaCBJXJFKL6Sl/UJuK5JwXOn0KjjqtE2a89JAGUoLcF2eQWAIk0wDn3
X-Received: by 2002:aa7:dd47:: with SMTP id o7mr7727984edw.34.1639488932666;
        Tue, 14 Dec 2021 05:35:32 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1639488932; cv=none;
        d=google.com; s=arc-20160816;
        b=lxRK69PwW4K29CNP3Lb4fOOdS7SESno5KkgITxdIuFPRM0sRn4wjHot8TOiTh5T3sO
         /iygKaNBSWGOqLoP6nyq39ZA6dhhiODM3k+C+xq3Tx6JyNuYc7hDTni98YKtfgM5KP0f
         Yc4UXqkstOR5rQ4JzS3OCA4oIMr3oaGwpaWHn4oArgP2AvirVP2oSXwssXD2lcoblRiR
         6RAz2yCSu/QSx5OwmWl2dWjj4LLK22SV7L+wAMmwy1WYbkDqiuBjYf4O654sXl+GLo6y
         HrvfVSwJ0ongmTeEY9OwrzTAQbUyblS5ASEoLBYzqtWmBAJBAfs4x2ZjBLgAdFi+CyEV
         hPbg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:reply-to:list-subscribe
         :list-help:list-post:list-archive:list-unsubscribe:list-id
         :precedence:subject:mime-version:references:in-reply-to:message-id
         :date:to:from:delivered-to;
        bh=nc1X72D8rL0VpWd768OEIjzMmRQ4BlTQguWk2JRbPyA=;
        b=uOdG6b92R3xjDsdw0ltcpkLDcFPZBdKtufBm6X9MLFUsyInijsKOBQEgm6MgwKRuHo
         YmGLVzNPA5nzf1+bYol4PhJ1/raR5W+7lRsS91jB4VudUwTyQvqQjPKep1yM7GJ3uIUo
         E6nj6uS+losMndDooTg8rrLNrlEvfSJNpMqslv9qZy1RJfdYT06TYv0ZXeMunMx5M351
         O4KPbCl5WUOE1Ddnah2jqKi/wUa+J95joYLFokK6qh0ZwLft47Pwqzy9foTeuCwHtuqN
         8LkDEls2wp7ZtaJlxpHBFfuqfzT6f9WSpAsljgvylxjKToAYhZccKQP1Ad9Gb7ZpZJw6
         7x+g==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 dy14si21707139edb.594.2021.12.14.05.35.31;
        Tue, 14 Dec 2021 05:35:32 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 93F4D68AF41;
	Tue, 14 Dec 2021 15:34:02 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from loongson.cn (mail.loongson.cn [114.242.206.163])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id B689E68AECF
 for <ffmpeg-devel@ffmpeg.org>; Tue, 14 Dec 2021 15:33:48 +0200 (EET)
Received: from localhost (unknown [36.33.26.144])
 by mail.loongson.cn (Coremail) with SMTP id AQAAf9Dx2ZY6nbhhmKcAAA--.480S3;
 Tue, 14 Dec 2021 21:33:46 +0800 (CST)
From: Hao Chen <chenhao@loongson.cn>
To: ffmpeg-devel@ffmpeg.org
Date: Tue, 14 Dec 2021 21:33:16 +0800
Message-Id: <20211214133316.8978-8-chenhao@loongson.cn>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20211214133316.8978-1-chenhao@loongson.cn>
References: <20211214133316.8978-1-chenhao@loongson.cn>
MIME-Version: 1.0
X-CM-TRANSID: AQAAf9Dx2ZY6nbhhmKcAAA--.480S3
X-Coremail-Antispam: 1UD129KBjvDXoW8JrWkuw13ZryDtw1kZoXrpw1ktrc_Gw1SkF
 18Cr4rCas2ga1jgw13Cr98ZrW8AFnxAryvyFnaqa45XFyrXa1kX3Wjvw1UKr97ZFy5J343
 t3Z7Aw1UKjkaLaAFLSUrUUUUjb8apTn2vfkv8UJUUUU8Yxn0WfASr-VFAUDa7-sFnT9fnU
 UIcSsGvfJTRUUUb28YjsxI4VWkKwAYFVCjjxCrM7AC8VAFwI0_Jr0_Gr1l1xkIjI8I6I8E
 6xAIw20EY4v20xvaj40_Wr0E3s1l1IIY67AEw4v_Jr0_Jr4l8cAvFVAK0II2c7xJM28Cjx
 kF64kEwVA0rcxSw2x7M28EF7xvwVC0I7IYx2IY67AKxVW5JVW7JwA2z4x0Y4vE2Ix0cI8I
 cVCY1x0267AKxVWxJVW8Jr1l84ACjcxK6I8E87Iv67AKxVW0oVCq3wA2z4x0Y4vEx4A2js
 IEc7CjxVAFwI0_GcCE3s1le2I262IYc4CY6c8Ij28IcVAaY2xG8wAqx4xG64xvF2IEw4CE
 5I8CrVC2j2WlYx0E2Ix0cI8IcVAFwI0_Jw0_WrylYx0Ex4A2jsIE14v26r4UJVWxJr1lOx
 8S6xCaFVCjc4AY6r1j6r4UM4x0Y48IcxkI7VAKI48JMxkIecxEwVAFwVW5GwCF04k20xvY
 0x0EwIxGrwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s026c02F40E14v26r1j6r18MI8I3I
 0E7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_Jr0_JrylIxkGc2Ij64vIr41lIxAI
 cVC0I7IYx2IY67AKxVW8JVW5JwCI42IY6xIIjxv20xvEc7CjxVAFwI0_Gr0_Cr1lIxAIcV
 CF04k26cxKx2IYs7xG6r1j6r1xMIIF0xvEx4A2jsIE14v26r4j6F4UMIIF0xvEx4A2jsIE
 c7CjxVAFwI0_Gr0_Gr1UYxBIdaVFxhVjvjDU0xZFpf9x07bwo7NUUUUU=
X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/
Subject: [FFmpeg-devel] [PATCH v2 7/7] avcodec: [loongarch] Optimize
 pred16x16_plane with LASX.
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: 479ClSNE1gCA

./ffmpeg -i ../1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -y /dev/null -an
before:295
after :296

Change-Id: I281bc739f708d45f91fc3860150944c0b8a6a5ba
---
 libavcodec/h264pred.c                         |   2 +
 libavcodec/h264pred.h                         |   2 +
 libavcodec/loongarch/Makefile                 |   2 +
 .../loongarch/h264_intrapred_init_loongarch.c |  50 ++++++++
 libavcodec/loongarch/h264_intrapred_lasx.c    | 121 ++++++++++++++++++
 libavcodec/loongarch/h264_intrapred_lasx.h    |  31 +++++
 6 files changed, 208 insertions(+)
 create mode 100644 libavcodec/loongarch/h264_intrapred_init_loongarch.c
 create mode 100644 libavcodec/loongarch/h264_intrapred_lasx.c
 create mode 100644 libavcodec/loongarch/h264_intrapred_lasx.h

diff --git a/libavcodec/h264pred.c b/libavcodec/h264pred.c
index b0fec71f25..bd0d4a3d06 100644
--- a/libavcodec/h264pred.c
+++ b/libavcodec/h264pred.c
@@ -602,4 +602,6 @@ av_cold void ff_h264_pred_init(H264PredContext *h, int codec_id,
         ff_h264_pred_init_x86(h, codec_id, bit_depth, chroma_format_idc);
     if (ARCH_MIPS)
         ff_h264_pred_init_mips(h, codec_id, bit_depth, chroma_format_idc);
+    if (ARCH_LOONGARCH)
+        ff_h264_pred_init_loongarch(h, codec_id, bit_depth, chroma_format_idc);
 }
diff --git a/libavcodec/h264pred.h b/libavcodec/h264pred.h
index 2863dc9bd1..4583052dfe 100644
--- a/libavcodec/h264pred.h
+++ b/libavcodec/h264pred.h
@@ -122,5 +122,7 @@ void ff_h264_pred_init_x86(H264PredContext *h, int codec_id,
                            const int bit_depth, const int chroma_format_idc);
 void ff_h264_pred_init_mips(H264PredContext *h, int codec_id,
                             const int bit_depth, const int chroma_format_idc);
+void ff_h264_pred_init_loongarch(H264PredContext *h, int codec_id,
+                                 const int bit_depth, const int chroma_format_idc);
 
 #endif /* AVCODEC_H264PRED_H */
diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile
index 1e1fe3fd48..30799e4e48 100644
--- a/libavcodec/loongarch/Makefile
+++ b/libavcodec/loongarch/Makefile
@@ -1,8 +1,10 @@
 OBJS-$(CONFIG_H264CHROMA)             += loongarch/h264chroma_init_loongarch.o
 OBJS-$(CONFIG_H264QPEL)               += loongarch/h264qpel_init_loongarch.o
 OBJS-$(CONFIG_H264DSP)                += loongarch/h264dsp_init_loongarch.o
+OBJS-$(CONFIG_H264PRED)               += loongarch/h264_intrapred_init_loongarch.o
 LASX-OBJS-$(CONFIG_H264CHROMA)        += loongarch/h264chroma_lasx.o
 LASX-OBJS-$(CONFIG_H264QPEL)          += loongarch/h264qpel_lasx.o
 LASX-OBJS-$(CONFIG_H264DSP)           += loongarch/h264dsp_lasx.o \
                                          loongarch/h264idct_lasx.o \
                                          loongarch/h264_deblock_lasx.o
+LASX-OBJS-$(CONFIG_H264PRED)          += loongarch/h264_intrapred_lasx.o
diff --git a/libavcodec/loongarch/h264_intrapred_init_loongarch.c b/libavcodec/loongarch/h264_intrapred_init_loongarch.c
new file mode 100644
index 0000000000..12620bd842
--- /dev/null
+++ b/libavcodec/loongarch/h264_intrapred_init_loongarch.c
@@ -0,0 +1,50 @@
+/*
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * Contributed by Hao Chen <chenhao@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/loongarch/cpu.h"
+#include "libavcodec/h264pred.h"
+#include "h264_intrapred_lasx.h"
+
+av_cold void ff_h264_pred_init_loongarch(H264PredContext *h, int codec_id,
+                                         const int bit_depth,
+                                         const int chroma_format_idc)
+{
+    int cpu_flags = av_get_cpu_flags();
+
+    if (bit_depth == 8) {
+        if (have_lasx(cpu_flags)) {
+            if (chroma_format_idc <= 1) {
+            }
+            if (codec_id == AV_CODEC_ID_VP7 || codec_id == AV_CODEC_ID_VP8) {
+            } else {
+                if (chroma_format_idc <= 1) {
+                }
+                if (codec_id == AV_CODEC_ID_SVQ3) {
+                    h->pred16x16[PLANE_PRED8x8] = ff_h264_pred16x16_plane_svq3_8_lasx;
+                } else if (codec_id == AV_CODEC_ID_RV40) {
+                    h->pred16x16[PLANE_PRED8x8] = ff_h264_pred16x16_plane_rv40_8_lasx;
+                } else {
+                    h->pred16x16[PLANE_PRED8x8] = ff_h264_pred16x16_plane_h264_8_lasx;
+                }
+            }
+        }
+    }
+}
diff --git a/libavcodec/loongarch/h264_intrapred_lasx.c b/libavcodec/loongarch/h264_intrapred_lasx.c
new file mode 100644
index 0000000000..c38cd611b8
--- /dev/null
+++ b/libavcodec/loongarch/h264_intrapred_lasx.c
@@ -0,0 +1,121 @@
+/*
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * Contributed by Hao Chen <chenhao@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/loongarch/loongson_intrinsics.h"
+#include "h264_intrapred_lasx.h"
+
+#define PRED16X16_PLANE                                                        \
+    ptrdiff_t stride_1, stride_2, stride_3, stride_4, stride_5, stride_6;      \
+    ptrdiff_t stride_8, stride_15;                                             \
+    int32_t res0, res1, res2, res3, cnt;                                       \
+    uint8_t *src0, *src1;                                                      \
+    __m256i reg0, reg1, reg2, reg3, reg4;                                      \
+    __m256i tmp0, tmp1, tmp2, tmp3;                                            \
+    __m256i shuff = {0x0B040A0509060807, 0x0F000E010D020C03, 0, 0};            \
+    __m256i mult = {0x0004000300020001, 0x0008000700060005, 0, 0};             \
+    __m256i int_mult1 = {0x0000000100000000, 0x0000000300000002,               \
+                         0x0000000500000004, 0x0000000700000006};              \
+                                                                               \
+    stride_1 = -stride;                                                        \
+    stride_2 = stride << 1;                                                    \
+    stride_3 = stride_2 + stride;                                              \
+    stride_4 = stride_2 << 1;                                                  \
+    stride_5 = stride_4 + stride;                                              \
+    stride_6 = stride_3 << 1;                                                  \
+    stride_8 = stride_4 << 1;                                                  \
+    stride_15 = (stride_8 << 1) - stride;                                      \
+    src0 = src - 1;                                                            \
+    src1 = src0 + stride_8;                                                    \
+                                                                               \
+    reg0 = __lasx_xvldx(src0, -stride);                                        \
+    reg1 = __lasx_xvldx(src, (8 - stride));                                    \
+    reg0 = __lasx_xvilvl_d(reg1, reg0);                                        \
+    reg0 = __lasx_xvshuf_b(reg0, reg0, shuff);                                 \
+    reg0 = __lasx_xvhsubw_hu_bu(reg0, reg0);                                   \
+    reg0 = __lasx_xvmul_h(reg0, mult);                                         \
+    res1 = (src1[0] - src0[stride_6]) +                                        \
+        2 * (src1[stride] - src0[stride_5]) +                                  \
+        3 * (src1[stride_2] - src0[stride_4]) +                                \
+        4 * (src1[stride_3] - src0[stride_3]) +                                \
+        5 * (src1[stride_4] - src0[stride_2]) +                                \
+        6 * (src1[stride_5] - src0[stride]) +                                  \
+        7 * (src1[stride_6] - src0[0]) +                                       \
+        8 * (src0[stride_15] - src0[stride_1]);                                \
+    reg0 = __lasx_xvhaddw_w_h(reg0, reg0);                                     \
+    reg0 = __lasx_xvhaddw_d_w(reg0, reg0);                                     \
+    reg0 = __lasx_xvhaddw_q_d(reg0, reg0);                                     \
+    res0 = __lasx_xvpickve2gr_w(reg0, 0);                                      \
+
+#define PRED16X16_PLANE_END                                                    \
+    res2 = (src0[stride_15] + src[15 - stride] + 1) << 4;                      \
+    res3 = 7 * (res0 + res1);                                                  \
+    res2 -= res3;                                                              \
+    reg0 = __lasx_xvreplgr2vr_w(res0);                                         \
+    reg1 = __lasx_xvreplgr2vr_w(res1);                                         \
+    reg2 = __lasx_xvreplgr2vr_w(res2);                                         \
+    reg3 = __lasx_xvmul_w(reg0, int_mult1);                                    \
+    reg4 = __lasx_xvslli_w(reg0, 3);                                           \
+    reg4 = __lasx_xvadd_w(reg4, reg3);                                         \
+    for (cnt = 8; cnt--;) {                                                    \
+        tmp0 = __lasx_xvadd_w(reg2, reg3);                                     \
+        tmp1 = __lasx_xvadd_w(reg2, reg4);                                     \
+        tmp0 = __lasx_xvssrani_hu_w(tmp1, tmp0, 5);                            \
+        tmp0 = __lasx_xvpermi_d(tmp0, 0xD8);                                   \
+        reg2 = __lasx_xvadd_w(reg2, reg1);                                     \
+        tmp2 = __lasx_xvadd_w(reg2, reg3);                                     \
+        tmp3 = __lasx_xvadd_w(reg2, reg4);                                     \
+        tmp1 = __lasx_xvssrani_hu_w(tmp3, tmp2, 5);                            \
+        tmp1 = __lasx_xvpermi_d(tmp1, 0xD8);                                   \
+        tmp0 = __lasx_xvssrani_bu_h(tmp1, tmp0, 0);                            \
+        reg2 = __lasx_xvadd_w(reg2, reg1);                                     \
+        __lasx_xvstelm_d(tmp0, src, 0, 0);                                     \
+        __lasx_xvstelm_d(tmp0, src, 8, 2);                                     \
+        src += stride;                                                         \
+        __lasx_xvstelm_d(tmp0, src, 0, 1);                                     \
+        __lasx_xvstelm_d(tmp0, src, 8, 3);                                     \
+        src += stride;                                                         \
+    }
+
+
+void ff_h264_pred16x16_plane_h264_8_lasx(uint8_t *src, ptrdiff_t stride)
+{
+    PRED16X16_PLANE
+    res0 = (5 * res0 + 32) >> 6;
+    res1 = (5 * res1 + 32) >> 6;
+    PRED16X16_PLANE_END
+}
+
+void ff_h264_pred16x16_plane_rv40_8_lasx(uint8_t *src, ptrdiff_t stride)
+{
+    PRED16X16_PLANE
+    res0 = (res0 + (res0 >> 2)) >> 4;
+    res1 = (res1 + (res1 >> 2)) >> 4;
+    PRED16X16_PLANE_END
+}
+
+void ff_h264_pred16x16_plane_svq3_8_lasx(uint8_t *src, ptrdiff_t stride)
+{
+    PRED16X16_PLANE
+    cnt  = (5 * (res0/4)) / 16;
+    res0 = (5 * (res1/4)) / 16;
+    res1 = cnt;
+    PRED16X16_PLANE_END
+}
diff --git a/libavcodec/loongarch/h264_intrapred_lasx.h b/libavcodec/loongarch/h264_intrapred_lasx.h
new file mode 100644
index 0000000000..0c2653300c
--- /dev/null
+++ b/libavcodec/loongarch/h264_intrapred_lasx.h
@@ -0,0 +1,31 @@
+/*
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * Contributed by Hao Chen <chenhao@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVCODEC_LOONGARCH_H264_INTRAPRED_LASX_H
+#define AVCODEC_LOONGARCH_H264_INTRAPRED_LASX_H
+
+#include "libavcodec/avcodec.h"
+
+void ff_h264_pred16x16_plane_h264_8_lasx(uint8_t *src, ptrdiff_t stride);
+void ff_h264_pred16x16_plane_rv40_8_lasx(uint8_t *src, ptrdiff_t stride);
+void ff_h264_pred16x16_plane_svq3_8_lasx(uint8_t *src, ptrdiff_t stride);
+
+#endif  // #ifndef AVCODEC_LOONGARCH_H264_INTRAPRED_LASX_H