From patchwork Tue Dec 14 07:15:43 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= <chenhao@loongson.cn>
X-Patchwork-Id: 32482
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp6564822iog;
        Mon, 13 Dec 2021 23:17:20 -0800 (PST)
X-Google-Smtp-Source: 
 ABdhPJzf6pfe+2TGY8rhbNd6qvJGpsOBGUGRF46CeT+jaQ+VRU0xxXTsd+BGelex6S7U34xot6ZI
X-Received: by 2002:a17:906:309b:: with SMTP id
 27mr4056831ejv.129.1639466239958;
        Mon, 13 Dec 2021 23:17:19 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1639466239; cv=none;
        d=google.com; s=arc-20160816;
        b=X/dSlr6QPTEfadzU7S7FRakM7PXFi29YdpxOdXGmpiWpyRXGK9pWJzBHab3ithgYRT
         ez7WdzqQBExnURmARduVIQvRtA2WdAWW+c81TP/FGTJNZk5T6Cj9CGjTqqGWM5tzSO74
         PFP9LzpaqOQacak+zK6ewgjjONr7Ny6kNxjwIDPioMOhLUDfX8gynFBfeXjm7BEJJltS
         RCZczcjFl+FhnNbA85eBZCJZDMBcPRwN5BGoM4WCtns/m6KHs0kGOpK+NJPEpTk+KeUe
         8Lmf2yaO/bwJthiaebtmmQkfgB2VSSNStlGNBoUJ1Y1eKRn7AzjbYCIKyQwb6DclLmpK
         gr/A==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:references:in-reply-to
         :message-id:date:to:from:delivered-to;
        bh=zm7BotcLoypr49oyv5CPH03K6va+Zj4BpRO9c8QxMjw=;
        b=SLoLBnQm6dYnYLGoolPwzH1BoChKycOCDzwuFY2qX+S0pNkLd6OwQIwyRB7ncPWGJX
         XNTkv6hRdYml4vojPaH+GYzUvFoY+K27R7iShqeir0oqTRscfzMXP7NZo2zur8znGPi6
         JWAaxDtEUAO98vhf/cod22YJlrmtOTPib7A+VTizdp0wMBNpZ/H0ipiKxYRJ+yTxQLaB
         pDaB3NApbWTZQUOkfzjJXd5sdlX83EZANKQEiRMCT1aDOR8F0l26tfRI4NJSpbbYRLaE
         JYVV/qX/qQN5JjL4kwV1kqJt7ga2CcDX7gmbrid0y2+r+D78EF3X0DfXxjKiybkxUOUF
         ilYQ==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 bo5si19620526edb.177.2021.12.13.23.17.19;
        Mon, 13 Dec 2021 23:17:19 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 38F5768AECA;
	Tue, 14 Dec 2021 09:16:06 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from loongson.cn (mail.loongson.cn [114.242.206.163])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id D6E7768AE13
 for <ffmpeg-devel@ffmpeg.org>; Tue, 14 Dec 2021 09:15:54 +0200 (EET)
Received: from localhost (unknown [36.33.26.144])
 by mail.loongson.cn (Coremail) with SMTP id AQAAf9BxXN2nRLhhB4sAAA--.3102S3;
 Tue, 14 Dec 2021 15:15:51 +0800 (CST)
From: Hao Chen <chenhao@loongson.cn>
To: ffmpeg-devel@ffmpeg.org
Date: Tue, 14 Dec 2021 15:15:43 +0800
Message-Id: <20211214071545.26283-6-chenhao@loongson.cn>
X-Mailer: git-send-email 2.20.1
In-Reply-To: <20211214071545.26283-1-chenhao@loongson.cn>
References: <20211214071545.26283-1-chenhao@loongson.cn>
MIME-Version: 1.0
X-CM-TRANSID: AQAAf9BxXN2nRLhhB4sAAA--.3102S3
X-Coremail-Antispam: 1UD129KBjvAXoWfWF13GFyxur4ktF1ktrWxCrg_yoW8trWDXo
 WUt392vr97Gw1Ivr95Ar9Yy3W8Cw43ur4UAw42qwsFya45Xa4qyrZ0kw4fJr17Krs7Wa43
 Cry5XFy3ZrWFqr1Dn29KB7ZKAUJUUUU5529EdanIXcx71UUUUU7v73VFW2AGmfu7bjvjm3
 AaLaJ3UjIYCTnIWjp_UUUYI7k0a2IF6w4kM7kC6x804xWl14x267AKxVWUJVW8JwAFc2x0
 x2IEx4CE42xK8VAvwI8IcIk0rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj4
 1l84x0c7CEw4AK67xGY2AK021l84ACjcxK6xIIjxv20xvE14v26F1j6w1UM28EF7xvwVC0
 I7IYx2IY6xkF7I0E14v26F4j6r4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwV
 C2z280aVCY1x0267AKxVW0oVCq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC
 0VAKzVAqx4xG6I80ewAv7VC0I7IYx2IY67AKxVWUtVWrXwAv7VC2z280aVAFwI0_Cr0_Gr
 1UMcvjeVCFs4IE7xkEbVWUJVW8JwACjcxG0xvY0x0EwIxGrwCY02Avz4vE14v_GF4l42xK
 82IYc2Ij64vIr41l4I8I3I0E4IkC6x0Yz7v_Jr0_Gr1l4IxYO2xFxVAFwI0_Jw0_GFylx2
 IqxVAqx4xG67AKxVWUJVWUGwC20s026x8GjcxK67AKxVWUGVWUWwC2zVAF1VAY17CE14v2
 6r1Y6r17MIIYrxkI7VAKI48JMIIF0xvE2Ix0cI8IcVAFwI0_Xr0_Ar1lIxAIcVC0I7IYx2
 IY6xkF7I0E14v26r4j6F4UMIIF0xvE42xK8VAvwI8IcIk0rVWUJVWUCwCI42IY6I8E87Iv
 67AKxVW8JVWxJwCI42IY6I8E87Iv6xkF7I0E14v26r4j6r4UJbIYCTnIWIevJa73UjIFyT
 uYvjxUg7PiUUUUU
X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/
Subject: [FFmpeg-devel] [PATCH 5/7] avcodec: [loongarch] Optimize h264idct
 with LASX.
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Lu Wang <wanglu@loongson.cn>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: Q3w3Np77b6aj

From: Lu Wang <wanglu@loongson.cn>

./ffmpeg -i ../1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -y /dev/null -an
before:282
after :293

Change-Id: Ia8889935a6359630dd5dbb61263287f1cb24a0a4
---
 libavcodec/loongarch/Makefile                 |   3 +-
 libavcodec/loongarch/h264dsp_init_loongarch.c |  15 +
 libavcodec/loongarch/h264dsp_lasx.h           |  23 +
 libavcodec/loongarch/h264idct_lasx.c          | 498 ++++++++++++++++++
 4 files changed, 538 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/loongarch/h264idct_lasx.c

diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile
index df43151dbd..242a2be290 100644
--- a/libavcodec/loongarch/Makefile
+++ b/libavcodec/loongarch/Makefile
@@ -3,4 +3,5 @@ OBJS-$(CONFIG_H264QPEL)               += loongarch/h264qpel_init_loongarch.o
 OBJS-$(CONFIG_H264DSP)                += loongarch/h264dsp_init_loongarch.o
 LASX-OBJS-$(CONFIG_H264CHROMA)        += loongarch/h264chroma_lasx.o
 LASX-OBJS-$(CONFIG_H264QPEL)          += loongarch/h264qpel_lasx.o
-LASX-OBJS-$(CONFIG_H264DSP)           += loongarch/h264dsp_lasx.o
+LASX-OBJS-$(CONFIG_H264DSP)           += loongarch/h264dsp_lasx.o \
+                                         loongarch/h264idct_lasx.o
diff --git a/libavcodec/loongarch/h264dsp_init_loongarch.c b/libavcodec/loongarch/h264dsp_init_loongarch.c
index ddc0877a74..0985c2fe8a 100644
--- a/libavcodec/loongarch/h264dsp_init_loongarch.c
+++ b/libavcodec/loongarch/h264dsp_init_loongarch.c
@@ -53,6 +53,21 @@ av_cold void ff_h264dsp_init_loongarch(H264DSPContext *c, const int bit_depth,
             c->biweight_h264_pixels_tab[0] = ff_biweight_h264_pixels16_8_lasx;
             c->biweight_h264_pixels_tab[1] = ff_biweight_h264_pixels8_8_lasx;
             c->biweight_h264_pixels_tab[2] = ff_biweight_h264_pixels4_8_lasx;
+
+            c->h264_idct_add = ff_h264_idct_add_lasx;
+            c->h264_idct8_add = ff_h264_idct8_addblk_lasx;
+            c->h264_idct_dc_add = ff_h264_idct4x4_addblk_dc_lasx;
+            c->h264_idct8_dc_add = ff_h264_idct8_dc_addblk_lasx;
+            c->h264_idct_add16 = ff_h264_idct_add16_lasx;
+            c->h264_idct8_add4 = ff_h264_idct8_add4_lasx;
+
+            if (chroma_format_idc <= 1)
+                c->h264_idct_add8 = ff_h264_idct_add8_lasx;
+            else
+                c->h264_idct_add8 = ff_h264_idct_add8_422_lasx;
+
+            c->h264_idct_add16intra = ff_h264_idct_add16_intra_lasx;
+            c->h264_luma_dc_dequant_idct = ff_h264_deq_idct_luma_dc_lasx;
         }
     }
 }
diff --git a/libavcodec/loongarch/h264dsp_lasx.h b/libavcodec/loongarch/h264dsp_lasx.h
index 538c14c936..bfd567fffa 100644
--- a/libavcodec/loongarch/h264dsp_lasx.h
+++ b/libavcodec/loongarch/h264dsp_lasx.h
@@ -65,4 +65,27 @@ void ff_weight_h264_pixels4_8_lasx(uint8_t *src, ptrdiff_t stride,
 void ff_h264_add_pixels4_8_lasx(uint8_t *_dst, int16_t *_src, int stride);
 
 void ff_h264_add_pixels8_8_lasx(uint8_t *_dst, int16_t *_src, int stride);
+void ff_h264_idct_add_lasx(uint8_t *dst, int16_t *src, int32_t dst_stride);
+void ff_h264_idct8_addblk_lasx(uint8_t *dst, int16_t *src, int32_t dst_stride);
+void ff_h264_idct4x4_addblk_dc_lasx(uint8_t *dst, int16_t *src,
+                                    int32_t dst_stride);
+void ff_h264_idct8_dc_addblk_lasx(uint8_t *dst, int16_t *src,
+                                  int32_t dst_stride);
+void ff_h264_idct_add16_lasx(uint8_t *dst, const int32_t *blk_offset,
+                             int16_t *block, int32_t dst_stride,
+                             const uint8_t nzc[15 * 8]);
+void ff_h264_idct8_add4_lasx(uint8_t *dst, const int32_t *blk_offset,
+                             int16_t *block, int32_t dst_stride,
+                             const uint8_t nzc[15 * 8]);
+void ff_h264_idct_add8_lasx(uint8_t **dst, const int32_t *blk_offset,
+                            int16_t *block, int32_t dst_stride,
+                            const uint8_t nzc[15 * 8]);
+void ff_h264_idct_add8_422_lasx(uint8_t **dst, const int32_t *blk_offset,
+                                int16_t *block, int32_t dst_stride,
+                                const uint8_t nzc[15 * 8]);
+void ff_h264_idct_add16_intra_lasx(uint8_t *dst, const int32_t *blk_offset,
+                                   int16_t *block, int32_t dst_stride,
+                                   const uint8_t nzc[15 * 8]);
+void ff_h264_deq_idct_luma_dc_lasx(int16_t *dst, int16_t *src,
+                                   int32_t de_qval);
 #endif  // #ifndef AVCODEC_LOONGARCH_H264DSP_LASX_H
diff --git a/libavcodec/loongarch/h264idct_lasx.c b/libavcodec/loongarch/h264idct_lasx.c
new file mode 100644
index 0000000000..46bd3b74d5
--- /dev/null
+++ b/libavcodec/loongarch/h264idct_lasx.c
@@ -0,0 +1,498 @@
+/*
+ * Loongson LASX optimized h264dsp
+ *
+ * Copyright (c) 2021 Loongson Technology Corporation Limited
+ * Contributed by Shiyou Yin <yinshiyou-hf@loongson.cn>
+ *                Xiwei  Gu  <guxiwei-hf@loongson.cn>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/loongarch/loongson_intrinsics.h"
+#include "h264dsp_lasx.h"
+#include "libavcodec/bit_depth_template.c"
+
+#define AVC_ITRANS_H(in0, in1, in2, in3, out0, out1, out2, out3)     \
+{                                                                    \
+   __m256i tmp0_m, tmp1_m, tmp2_m, tmp3_m;                           \
+                                                                     \
+    tmp0_m = __lasx_xvadd_h(in0, in2);                               \
+    tmp1_m = __lasx_xvsub_h(in0, in2);                               \
+    tmp2_m = __lasx_xvsrai_h(in1, 1);                                \
+    tmp2_m = __lasx_xvsub_h(tmp2_m, in3);                            \
+    tmp3_m = __lasx_xvsrai_h(in3, 1);                                \
+    tmp3_m = __lasx_xvadd_h(in1, tmp3_m);                            \
+                                                                     \
+    LASX_BUTTERFLY_4_H(tmp0_m, tmp1_m, tmp2_m, tmp3_m,               \
+                       out0, out1, out2, out3);                      \
+}
+
+void ff_h264_idct_add_lasx(uint8_t *dst, int16_t *src, int32_t dst_stride)
+{
+    __m256i src0_m, src1_m, src2_m, src3_m;
+    __m256i dst0_m, dst1_m;
+    __m256i hres0, hres1, hres2, hres3, vres0, vres1, vres2, vres3;
+    __m256i inp0_m, inp1_m, res0_m, src1, src3;
+    __m256i src0 = __lasx_xvld(src, 0);
+    __m256i src2 = __lasx_xvld(src, 16);
+    __m256i zero = __lasx_xvldi(0);
+    int32_t dst_stride_2x = dst_stride << 1;
+    int32_t dst_stride_3x = dst_stride_2x + dst_stride;
+
+    __lasx_xvst(zero, src, 0);
+    DUP2_ARG2(__lasx_xvilvh_d, src0, src0, src2, src2, src1, src3);
+    AVC_ITRANS_H(src0, src1, src2, src3, hres0, hres1, hres2, hres3);
+    LASX_TRANSPOSE4x4_H(hres0, hres1, hres2, hres3, hres0, hres1, hres2, hres3);
+    AVC_ITRANS_H(hres0, hres1, hres2, hres3, vres0, vres1, vres2, vres3);
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, dst_stride, dst, dst_stride_2x,
+              dst, dst_stride_3x, src0_m, src1_m, src2_m, src3_m);
+    DUP4_ARG2(__lasx_xvld, dst, 0, dst + dst_stride, 0, dst + dst_stride_2x,
+              0, dst + dst_stride_3x, 0, src0_m, src1_m, src2_m, src3_m);
+    DUP2_ARG2(__lasx_xvilvl_d, vres1, vres0, vres3, vres2, inp0_m, inp1_m);
+    inp0_m = __lasx_xvpermi_q(inp1_m, inp0_m, 0x20);
+    inp0_m = __lasx_xvsrari_h(inp0_m, 6);
+    DUP2_ARG2(__lasx_xvilvl_w, src1_m, src0_m, src3_m, src2_m, dst0_m, dst1_m);
+    dst0_m = __lasx_xvilvl_d(dst1_m, dst0_m);
+    res0_m = __lasx_vext2xv_hu_bu(dst0_m);
+    res0_m = __lasx_xvadd_h(res0_m, inp0_m);
+    res0_m = __lasx_xvclip255_h(res0_m);
+    dst0_m = __lasx_xvpickev_b(res0_m, res0_m);
+    __lasx_xvstelm_w(dst0_m, dst, 0, 0);
+    __lasx_xvstelm_w(dst0_m, dst + dst_stride, 0, 1);
+    __lasx_xvstelm_w(dst0_m, dst + dst_stride_2x, 0, 4);
+    __lasx_xvstelm_w(dst0_m, dst + dst_stride_3x, 0, 5);
+}
+
+void ff_h264_idct8_addblk_lasx(uint8_t *dst, int16_t *src,
+                               int32_t dst_stride)
+{
+    __m256i src0, src1, src2, src3, src4, src5, src6, src7;
+    __m256i vec0, vec1, vec2, vec3;
+    __m256i tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
+    __m256i res0, res1, res2, res3, res4, res5, res6, res7;
+    __m256i dst0, dst1, dst2, dst3, dst4, dst5, dst6, dst7;
+    __m256i zero = __lasx_xvldi(0);
+    int32_t dst_stride_2x = dst_stride << 1;
+    int32_t dst_stride_4x = dst_stride << 2;
+    int32_t dst_stride_3x = dst_stride_2x + dst_stride;
+
+    src[0] += 32;
+    DUP4_ARG2(__lasx_xvld, src, 0, src, 16, src, 32, src, 48,
+              src0, src1, src2, src3);
+    DUP4_ARG2(__lasx_xvld, src, 64, src, 80, src, 96, src, 112,
+              src4, src5, src6, src7);
+    __lasx_xvst(zero, src, 0);
+    __lasx_xvst(zero, src, 32);
+    __lasx_xvst(zero, src, 64);
+    __lasx_xvst(zero, src, 96);
+
+    vec0 = __lasx_xvadd_h(src0, src4);
+    vec1 = __lasx_xvsub_h(src0, src4);
+    vec2 = __lasx_xvsrai_h(src2, 1);
+    vec2 = __lasx_xvsub_h(vec2, src6);
+    vec3 = __lasx_xvsrai_h(src6, 1);
+    vec3 = __lasx_xvadd_h(src2, vec3);
+
+    LASX_BUTTERFLY_4_H(vec0, vec1, vec2, vec3, tmp0, tmp1, tmp2, tmp3);
+
+    vec0 = __lasx_xvsrai_h(src7, 1);
+    vec0 = __lasx_xvsub_h(src5, vec0);
+    vec0 = __lasx_xvsub_h(vec0, src3);
+    vec0 = __lasx_xvsub_h(vec0, src7);
+
+    vec1 = __lasx_xvsrai_h(src3, 1);
+    vec1 = __lasx_xvsub_h(src1, vec1);
+    vec1 = __lasx_xvadd_h(vec1, src7);
+    vec1 = __lasx_xvsub_h(vec1, src3);
+
+    vec2 = __lasx_xvsrai_h(src5, 1);
+    vec2 = __lasx_xvsub_h(vec2, src1);
+    vec2 = __lasx_xvadd_h(vec2, src7);
+    vec2 = __lasx_xvadd_h(vec2, src5);
+
+    vec3 = __lasx_xvsrai_h(src1, 1);
+    vec3 = __lasx_xvadd_h(src3, vec3);
+    vec3 = __lasx_xvadd_h(vec3, src5);
+    vec3 = __lasx_xvadd_h(vec3, src1);
+
+    tmp4 = __lasx_xvsrai_h(vec3, 2);
+    tmp4 = __lasx_xvadd_h(tmp4, vec0);
+    tmp5 = __lasx_xvsrai_h(vec2, 2);
+    tmp5 = __lasx_xvadd_h(tmp5, vec1);
+    tmp6 = __lasx_xvsrai_h(vec1, 2);
+    tmp6 = __lasx_xvsub_h(tmp6, vec2);
+    tmp7 = __lasx_xvsrai_h(vec0, 2);
+    tmp7 = __lasx_xvsub_h(vec3, tmp7);
+
+    LASX_BUTTERFLY_8_H(tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7,
+                       res0, res1, res2, res3, res4, res5, res6, res7);
+    LASX_TRANSPOSE8x8_H(res0, res1, res2, res3, res4, res5, res6, res7,
+                        res0, res1, res2, res3, res4, res5, res6, res7);
+
+    DUP4_ARG1(__lasx_vext2xv_w_h, res0, res1, res2, res3,
+              tmp0, tmp1, tmp2, tmp3);
+    DUP4_ARG1(__lasx_vext2xv_w_h, res4, res5, res6, res7,
+              tmp4, tmp5, tmp6, tmp7);
+    vec0 = __lasx_xvadd_w(tmp0, tmp4);
+    vec1 = __lasx_xvsub_w(tmp0, tmp4);
+
+    vec2 = __lasx_xvsrai_w(tmp2, 1);
+    vec2 = __lasx_xvsub_w(vec2, tmp6);
+    vec3 = __lasx_xvsrai_w(tmp6, 1);
+    vec3 = __lasx_xvadd_w(vec3, tmp2);
+
+    tmp0 = __lasx_xvadd_w(vec0, vec3);
+    tmp2 = __lasx_xvadd_w(vec1, vec2);
+    tmp4 = __lasx_xvsub_w(vec1, vec2);
+    tmp6 = __lasx_xvsub_w(vec0, vec3);
+
+    vec0 = __lasx_xvsrai_w(tmp7, 1);
+    vec0 = __lasx_xvsub_w(tmp5, vec0);
+    vec0 = __lasx_xvsub_w(vec0, tmp3);
+    vec0 = __lasx_xvsub_w(vec0, tmp7);
+
+    vec1 = __lasx_xvsrai_w(tmp3, 1);
+    vec1 = __lasx_xvsub_w(tmp1, vec1);
+    vec1 = __lasx_xvadd_w(vec1, tmp7);
+    vec1 = __lasx_xvsub_w(vec1, tmp3);
+
+    vec2 = __lasx_xvsrai_w(tmp5, 1);
+    vec2 = __lasx_xvsub_w(vec2, tmp1);
+    vec2 = __lasx_xvadd_w(vec2, tmp7);
+    vec2 = __lasx_xvadd_w(vec2, tmp5);
+
+    vec3 = __lasx_xvsrai_w(tmp1, 1);
+    vec3 = __lasx_xvadd_w(tmp3, vec3);
+    vec3 = __lasx_xvadd_w(vec3, tmp5);
+    vec3 = __lasx_xvadd_w(vec3, tmp1);
+
+    tmp1 = __lasx_xvsrai_w(vec3, 2);
+    tmp1 = __lasx_xvadd_w(tmp1, vec0);
+    tmp3 = __lasx_xvsrai_w(vec2, 2);
+    tmp3 = __lasx_xvadd_w(tmp3, vec1);
+    tmp5 = __lasx_xvsrai_w(vec1, 2);
+    tmp5 = __lasx_xvsub_w(tmp5, vec2);
+    tmp7 = __lasx_xvsrai_w(vec0, 2);
+    tmp7 = __lasx_xvsub_w(vec3, tmp7);
+
+    LASX_BUTTERFLY_4_W(tmp0, tmp2, tmp5, tmp7, res0, res1, res6, res7);
+    LASX_BUTTERFLY_4_W(tmp4, tmp6, tmp1, tmp3, res2, res3, res4, res5);
+
+    DUP4_ARG2(__lasx_xvsrai_w, res0, 6, res1, 6, res2, 6, res3, 6,
+              res0, res1, res2, res3);
+    DUP4_ARG2(__lasx_xvsrai_w, res4, 6, res5, 6, res6, 6, res7, 6,
+              res4, res5, res6, res7);
+    DUP4_ARG2(__lasx_xvpickev_h, res1, res0, res3, res2, res5, res4, res7,
+              res6, res0, res1, res2, res3);
+    DUP4_ARG2(__lasx_xvpermi_d, res0, 0xd8, res1, 0xd8, res2, 0xd8, res3, 0xd8,
+              res0, res1, res2, res3);
+
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, dst_stride, dst, dst_stride_2x,
+              dst, dst_stride_3x, dst0, dst1, dst2, dst3);
+    dst += dst_stride_4x;
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, dst_stride, dst, dst_stride_2x,
+              dst, dst_stride_3x, dst4, dst5, dst6, dst7);
+    dst -= dst_stride_4x;
+    DUP4_ARG2(__lasx_xvilvl_b, zero, dst0, zero, dst1, zero, dst2, zero, dst3,
+              dst0, dst1, dst2, dst3);
+    DUP4_ARG2(__lasx_xvilvl_b, zero, dst4, zero, dst5, zero, dst6, zero, dst7,
+              dst4, dst5, dst6, dst7);
+    DUP4_ARG3(__lasx_xvpermi_q, dst1, dst0, 0x20, dst3, dst2, 0x20, dst5,
+              dst4, 0x20, dst7, dst6, 0x20, dst0, dst1, dst2, dst3);
+    res0 = __lasx_xvadd_h(res0, dst0);
+    res1 = __lasx_xvadd_h(res1, dst1);
+    res2 = __lasx_xvadd_h(res2, dst2);
+    res3 = __lasx_xvadd_h(res3, dst3);
+    DUP4_ARG1(__lasx_xvclip255_h, res0, res1, res2, res3, res0, res1,
+              res2, res3);
+    DUP2_ARG2(__lasx_xvpickev_b, res1, res0, res3, res2, res0, res1);
+    __lasx_xvstelm_d(res0, dst, 0, 0);
+    __lasx_xvstelm_d(res0, dst + dst_stride, 0, 2);
+    __lasx_xvstelm_d(res0, dst + dst_stride_2x, 0, 1);
+    __lasx_xvstelm_d(res0, dst + dst_stride_3x, 0, 3);
+    dst += dst_stride_4x;
+    __lasx_xvstelm_d(res1, dst, 0, 0);
+    __lasx_xvstelm_d(res1, dst + dst_stride, 0, 2);
+    __lasx_xvstelm_d(res1, dst + dst_stride_2x, 0, 1);
+    __lasx_xvstelm_d(res1, dst + dst_stride_3x, 0, 3);
+}
+
+void ff_h264_idct4x4_addblk_dc_lasx(uint8_t *dst, int16_t *src,
+                                    int32_t dst_stride)
+{
+    const int16_t dc = (src[0] + 32) >> 6;
+    int32_t dst_stride_2x = dst_stride << 1;
+    int32_t dst_stride_3x = dst_stride_2x + dst_stride;
+    __m256i pred, out;
+    __m256i src0, src1, src2, src3;
+    __m256i input_dc = __lasx_xvreplgr2vr_h(dc);
+
+    src[0] = 0;
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, dst_stride, dst, dst_stride_2x,
+              dst, dst_stride_3x, src0, src1, src2, src3);
+    DUP2_ARG2(__lasx_xvilvl_w, src1, src0, src3, src2, src0, src1);
+
+    pred = __lasx_xvpermi_q(src0, src1, 0x02);
+    pred = __lasx_xvaddw_h_h_bu(input_dc, pred);
+    pred = __lasx_xvclip255_h(pred);
+    out = __lasx_xvpickev_b(pred, pred);
+    __lasx_xvstelm_w(out, dst, 0, 0);
+    __lasx_xvstelm_w(out, dst + dst_stride, 0, 1);
+    __lasx_xvstelm_w(out, dst + dst_stride_2x, 0, 4);
+    __lasx_xvstelm_w(out, dst + dst_stride_3x, 0, 5);
+}
+
+void ff_h264_idct8_dc_addblk_lasx(uint8_t *dst, int16_t *src,
+                                  int32_t dst_stride)
+{
+    int32_t dc_val;
+    int32_t dst_stride_2x = dst_stride << 1;
+    int32_t dst_stride_4x = dst_stride << 2;
+    int32_t dst_stride_3x = dst_stride_2x + dst_stride;
+    __m256i dst0, dst1, dst2, dst3, dst4, dst5, dst6, dst7;
+    __m256i dc;
+
+    dc_val = (src[0] + 32) >> 6;
+    dc = __lasx_xvreplgr2vr_h(dc_val);
+
+    src[0] = 0;
+
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, dst_stride, dst, dst_stride_2x,
+              dst, dst_stride_3x, dst0, dst1, dst2, dst3);
+    dst += dst_stride_4x;
+    DUP4_ARG2(__lasx_xvldx, dst, 0, dst, dst_stride, dst, dst_stride_2x,
+              dst, dst_stride_3x, dst4, dst5, dst6, dst7);
+    dst -= dst_stride_4x;
+    DUP4_ARG1(__lasx_vext2xv_hu_bu, dst0, dst1, dst2, dst3,
+              dst0, dst1, dst2, dst3);
+    DUP4_ARG1(__lasx_vext2xv_hu_bu, dst4, dst5, dst6, dst7,
+              dst4, dst5, dst6, dst7);
+    DUP4_ARG3(__lasx_xvpermi_q, dst1, dst0, 0x20, dst3, dst2, 0x20, dst5,
+              dst4, 0x20, dst7, dst6, 0x20, dst0, dst1, dst2, dst3);
+    dst0 = __lasx_xvadd_h(dst0, dc);
+    dst1 = __lasx_xvadd_h(dst1, dc);
+    dst2 = __lasx_xvadd_h(dst2, dc);
+    dst3 = __lasx_xvadd_h(dst3, dc);
+    DUP4_ARG1(__lasx_xvclip255_h, dst0, dst1, dst2, dst3,
+              dst0, dst1, dst2, dst3);
+    DUP2_ARG2(__lasx_xvpickev_b, dst1, dst0, dst3, dst2, dst0, dst1);
+    __lasx_xvstelm_d(dst0, dst, 0, 0);
+    __lasx_xvstelm_d(dst0, dst + dst_stride, 0, 2);
+    __lasx_xvstelm_d(dst0, dst + dst_stride_2x, 0, 1);
+    __lasx_xvstelm_d(dst0, dst + dst_stride_3x, 0, 3);
+    dst += dst_stride_4x;
+    __lasx_xvstelm_d(dst1, dst, 0, 0);
+    __lasx_xvstelm_d(dst1, dst + dst_stride, 0, 2);
+    __lasx_xvstelm_d(dst1, dst + dst_stride_2x, 0, 1);
+    __lasx_xvstelm_d(dst1, dst + dst_stride_3x, 0, 3);
+}
+
+void ff_h264_idct_add16_lasx(uint8_t *dst,
+                             const int32_t *blk_offset,
+                             int16_t *block, int32_t dst_stride,
+                             const uint8_t nzc[15 * 8])
+{
+    int32_t i;
+
+    for (i = 0; i < 16; i++) {
+        int32_t nnz = nzc[scan8[i]];
+
+        if (nnz) {
+            if (nnz == 1 && ((dctcoef *) block)[i * 16])
+                ff_h264_idct4x4_addblk_dc_lasx(dst + blk_offset[i],
+                                               block + i * 16 * sizeof(pixel),
+                                               dst_stride);
+            else
+                ff_h264_idct_add_lasx(dst + blk_offset[i],
+                                      block + i * 16 * sizeof(pixel),
+                                      dst_stride);
+        }
+    }
+}
+
+void ff_h264_idct8_add4_lasx(uint8_t *dst, const int32_t *blk_offset,
+                             int16_t *block, int32_t dst_stride,
+                             const uint8_t nzc[15 * 8])
+{
+    int32_t cnt;
+
+    for (cnt = 0; cnt < 16; cnt += 4) {
+        int32_t nnz = nzc[scan8[cnt]];
+
+        if (nnz) {
+            if (nnz == 1 && ((dctcoef *) block)[cnt * 16])
+                ff_h264_idct8_dc_addblk_lasx(dst + blk_offset[cnt],
+                                             block + cnt * 16 * sizeof(pixel),
+                                             dst_stride);
+            else
+                ff_h264_idct8_addblk_lasx(dst + blk_offset[cnt],
+                                          block + cnt * 16 * sizeof(pixel),
+                                          dst_stride);
+        }
+    }
+}
+
+
+void ff_h264_idct_add8_lasx(uint8_t **dst,
+                            const int32_t *blk_offset,
+                            int16_t *block, int32_t dst_stride,
+                            const uint8_t nzc[15 * 8])
+{
+    int32_t i;
+
+    for (i = 16; i < 20; i++) {
+        if (nzc[scan8[i]])
+            ff_h264_idct_add_lasx(dst[0] + blk_offset[i],
+                                  block + i * 16 * sizeof(pixel),
+                                  dst_stride);
+        else if (((dctcoef *) block)[i * 16])
+            ff_h264_idct4x4_addblk_dc_lasx(dst[0] + blk_offset[i],
+                                           block + i * 16 * sizeof(pixel),
+                                           dst_stride);
+    }
+    for (i = 32; i < 36; i++) {
+        if (nzc[scan8[i]])
+            ff_h264_idct_add_lasx(dst[1] + blk_offset[i],
+                                  block + i * 16 * sizeof(pixel),
+                                  dst_stride);
+        else if (((dctcoef *) block)[i * 16])
+            ff_h264_idct4x4_addblk_dc_lasx(dst[1] + blk_offset[i],
+                                           block + i * 16 * sizeof(pixel),
+                                           dst_stride);
+    }
+}
+
+void ff_h264_idct_add8_422_lasx(uint8_t **dst,
+                                const int32_t *blk_offset,
+                                int16_t *block, int32_t dst_stride,
+                                const uint8_t nzc[15 * 8])
+{
+    int32_t i;
+
+    for (i = 16; i < 20; i++) {
+        if (nzc[scan8[i]])
+            ff_h264_idct_add_lasx(dst[0] + blk_offset[i],
+                                  block + i * 16 * sizeof(pixel),
+                                  dst_stride);
+        else if (((dctcoef *) block)[i * 16])
+            ff_h264_idct4x4_addblk_dc_lasx(dst[0] + blk_offset[i],
+                                           block + i * 16 * sizeof(pixel),
+                                           dst_stride);
+    }
+    for (i = 32; i < 36; i++) {
+        if (nzc[scan8[i]])
+            ff_h264_idct_add_lasx(dst[1] + blk_offset[i],
+                                  block + i * 16 * sizeof(pixel),
+                                  dst_stride);
+        else if (((dctcoef *) block)[i * 16])
+            ff_h264_idct4x4_addblk_dc_lasx(dst[1] + blk_offset[i],
+                                           block + i * 16 * sizeof(pixel),
+                                           dst_stride);
+    }
+    for (i = 20; i < 24; i++) {
+        if (nzc[scan8[i + 4]])
+            ff_h264_idct_add_lasx(dst[0] + blk_offset[i + 4],
+                                  block + i * 16 * sizeof(pixel),
+                                  dst_stride);
+        else if (((dctcoef *) block)[i * 16])
+            ff_h264_idct4x4_addblk_dc_lasx(dst[0] + blk_offset[i + 4],
+                                           block + i * 16 * sizeof(pixel),
+                                           dst_stride);
+    }
+    for (i = 36; i < 40; i++) {
+        if (nzc[scan8[i + 4]])
+            ff_h264_idct_add_lasx(dst[1] + blk_offset[i + 4],
+                                  block + i * 16 * sizeof(pixel),
+                                  dst_stride);
+        else if (((dctcoef *) block)[i * 16])
+            ff_h264_idct4x4_addblk_dc_lasx(dst[1] + blk_offset[i + 4],
+                                           block + i * 16 * sizeof(pixel),
+                                           dst_stride);
+    }
+}
+
+void ff_h264_idct_add16_intra_lasx(uint8_t *dst,
+                                   const int32_t *blk_offset,
+                                   int16_t *block,
+                                   int32_t dst_stride,
+                                   const uint8_t nzc[15 * 8])
+{
+    int32_t i;
+
+    for (i = 0; i < 16; i++) {
+        if (nzc[scan8[i]])
+            ff_h264_idct_add_lasx(dst + blk_offset[i],
+                                  block + i * 16 * sizeof(pixel), dst_stride);
+        else if (((dctcoef *) block)[i * 16])
+            ff_h264_idct4x4_addblk_dc_lasx(dst + blk_offset[i],
+                                           block + i * 16 * sizeof(pixel),
+                                           dst_stride);
+    }
+}
+
+void ff_h264_deq_idct_luma_dc_lasx(int16_t *dst, int16_t *src,
+                                   int32_t de_qval)
+{
+#define DC_DEST_STRIDE 16
+
+    __m256i src0, src1, src2, src3;
+    __m256i vec0, vec1, vec2, vec3;
+    __m256i tmp0, tmp1, tmp2, tmp3;
+    __m256i hres0, hres1, hres2, hres3;
+    __m256i vres0, vres1, vres2, vres3;
+    __m256i de_q_vec = __lasx_xvreplgr2vr_w(de_qval);
+
+    DUP4_ARG2(__lasx_xvld, src, 0, src, 8, src, 16, src, 24,
+              src0, src1, src2, src3);
+    LASX_TRANSPOSE4x4_H(src0, src1, src2, src3, tmp0, tmp1, tmp2, tmp3);
+    LASX_BUTTERFLY_4_H(tmp0, tmp2, tmp3, tmp1, vec0, vec3, vec2, vec1);
+    LASX_BUTTERFLY_4_H(vec0, vec1, vec2, vec3, hres0, hres3, hres2, hres1);
+    LASX_TRANSPOSE4x4_H(hres0, hres1, hres2, hres3,
+                        hres0, hres1, hres2, hres3);
+    LASX_BUTTERFLY_4_H(hres0, hres1, hres3, hres2, vec0, vec3, vec2, vec1);
+    LASX_BUTTERFLY_4_H(vec0, vec1, vec2, vec3, vres0, vres1, vres2, vres3);
+    DUP4_ARG1(__lasx_vext2xv_w_h, vres0, vres1, vres2, vres3,
+              vres0, vres1, vres2, vres3);
+    DUP2_ARG3(__lasx_xvpermi_q, vres1, vres0, 0x20, vres3, vres2, 0x20,
+              vres0, vres1);
+
+    vres0 = __lasx_xvmul_w(vres0, de_q_vec);
+    vres1 = __lasx_xvmul_w(vres1, de_q_vec);
+
+    vres0 = __lasx_xvsrari_w(vres0, 8);
+    vres1 = __lasx_xvsrari_w(vres1, 8);
+    vec0 = __lasx_xvpickev_h(vres1, vres0);
+    vec0 = __lasx_xvpermi_d(vec0, 0xd8);
+    __lasx_xvstelm_h(vec0, dst + 0  * DC_DEST_STRIDE, 0, 0);
+    __lasx_xvstelm_h(vec0, dst + 2  * DC_DEST_STRIDE, 0, 1);
+    __lasx_xvstelm_h(vec0, dst + 8  * DC_DEST_STRIDE, 0, 2);
+    __lasx_xvstelm_h(vec0, dst + 10 * DC_DEST_STRIDE, 0, 3);
+    __lasx_xvstelm_h(vec0, dst + 1  * DC_DEST_STRIDE, 0, 4);
+    __lasx_xvstelm_h(vec0, dst + 3  * DC_DEST_STRIDE, 0, 5);
+    __lasx_xvstelm_h(vec0, dst + 9  * DC_DEST_STRIDE, 0, 6);
+    __lasx_xvstelm_h(vec0, dst + 11 * DC_DEST_STRIDE, 0, 7);
+    __lasx_xvstelm_h(vec0, dst + 4  * DC_DEST_STRIDE, 0, 8);
+    __lasx_xvstelm_h(vec0, dst + 6  * DC_DEST_STRIDE, 0, 9);
+    __lasx_xvstelm_h(vec0, dst + 12 * DC_DEST_STRIDE, 0, 10);
+    __lasx_xvstelm_h(vec0, dst + 14 * DC_DEST_STRIDE, 0, 11);
+    __lasx_xvstelm_h(vec0, dst + 5  * DC_DEST_STRIDE, 0, 12);
+    __lasx_xvstelm_h(vec0, dst + 7  * DC_DEST_STRIDE, 0, 13);
+    __lasx_xvstelm_h(vec0, dst + 13 * DC_DEST_STRIDE, 0, 14);
+    __lasx_xvstelm_h(vec0, dst + 15 * DC_DEST_STRIDE, 0, 15);
+
+#undef DC_DEST_STRIDE
+}