From patchwork Wed Dec 29 10:18:20 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= X-Patchwork-Id: 32943 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp11995410iog; Wed, 29 Dec 2021 02:18:49 -0800 (PST) X-Google-Smtp-Source: ABdhPJzETL/HlJfVzMwHGsDlYPAYXk4njLxgu8OXbXs0kjsyY/kkLRmE7DKGRlUtXg8iKh7bbXeI X-Received: by 2002:a05:6402:438b:: with SMTP id o11mr24349406edc.143.1640773129560; Wed, 29 Dec 2021 02:18:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1640773129; cv=none; d=google.com; s=arc-20160816; b=mmAu+lWkeEgpuHM8K8uQRs92a1XusdsdlACF/dGKgab19XY7vQsnzcH+pw7o741G8e 2avfNuK7dwnlivBuy+WKzVrB9wEOlKSIp8Rk4c+a6v1oUVbgqthzYIhFF3Bk10IHRaiU ltkjZcdY85gfFKh8MBQqJxcan5bWQxnw5BB1x5pxec6gbP1/jMdpbnM/39lnzg9mkmh0 gJdoU+iCmLQlJoaT5igIQRykVSH/GbgVaaq5314+7MGDgJkHo10drMyNZltKMjlpjaVs 4O9VxcDmn+2hgVTCrVZ5olIoVCgnpD52pbs4M1QkrhuVgISI50bVDNBG3uykQ/AGPMUp sgjw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=onvTa0tHMysTlo+I/geE0d2AMb2bR8qWPmrxyAzoNYI=; b=kRTGr5u6IGLzTNJIgwL84T3A8cdZfwYJy+q/ITC9+q3YsVppFT5CGL2XXNPFJAwnBG 9amVCslbwp2j7kflj8qLz+gWCFJHfeEYjedzsWIXMJMOYyUcRlzYaTJYGrj/8QjjG8Yn 15e3x+7fQSgADOZzpW17ExSGgaetSfl23dCrbuaoJxsBEgxMQtqfEj29cnRwG0mQpzk0 tnsYMLHMr1VExj1di0f19UX+/d258hlpHETDXBOJALWi8t/yE/Vk7Wt+jt6JL+/fcsRP irOpGEQEuV65lXGBqrjAPxlV9o3OHJb1b8Z38+24H3ncjWq8Dnvq1TBSSr6BPrnlyaj1 wb7w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id x20si11203469edd.196.2021.12.29.02.18.49; Wed, 29 Dec 2021 02:18:49 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id E777768AF42; Wed, 29 Dec 2021 12:18:37 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from loongson.cn (mail.loongson.cn [114.242.206.163]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id C4EF968AACC for ; Wed, 29 Dec 2021 12:18:28 +0200 (EET) Received: from localhost (unknown [36.33.26.144]) by mail.loongson.cn (Coremail) with SMTP id AQAAf9DxeZbxNcxh3igFAA--.4671S3; Wed, 29 Dec 2021 18:18:25 +0800 (CST) From: Hao Chen To: ffmpeg-devel@ffmpeg.org Date: Wed, 29 Dec 2021 18:18:20 +0800 Message-Id: <20211229101822.31956-2-chenhao@loongson.cn> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20211229101822.31956-1-chenhao@loongson.cn> References: <20211229101822.31956-1-chenhao@loongson.cn> MIME-Version: 1.0 X-CM-TRANSID: AQAAf9DxeZbxNcxh3igFAA--.4671S3 X-Coremail-Antispam: 1UD129KBjvAXoWDWFyktw17ur43Xw43GFy8Xwb_yoW7tw15to WUK397Kwn7KFySkrn8JrykKa47uFW5Ar1UZr47tw4vy345uryYkrWavw48Jw4Fgr4Fgw15 CF17try7Zw43Crn8n29KB7ZKAUJUUUU8529EdanIXcx71UUUUU7v73VFW2AGmfu7bjvjm3 AaLaJ3UjIYCTnIWjp_UUUYO7AC8VAFwI0_Jr0_Gr1l1xkIjI8I6I8E6xAIw20EY4v20xva j40_Wr0E3s1l1IIY67AEw4v_Jr0_Jr4l8cAvFVAK0II2c7xJM28CjxkF64kEwVA0rcxSw2 x7M28EF7xvwVC0I7IYx2IY67AKxVW5JVW7JwA2z4x0Y4vE2Ix0cI8IcVCY1x0267AKxVW8 Jr0_Cr1UM28EF7xvwVC2z280aVAFwI0_Cr1j6rxdM28EF7xvwVC2z280aVCY1x0267AKxV W0oVCq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG6I80ewAv 7VC0I7IYx2IY67AKxVWUAVWUtwAv7VC2z280aVAFwI0_Cr0_Gr1UMcvjeVCFs4IE7xkEbV WUJVW8JwACjcxG0xvY0x0EwIxGrwACjI8F5VA0II8E6IAqYI8I648v4I1lc2xSY4AK67AK 6r4DMxAIw28IcxkI7VAKI48JMxC20s026xCaFVCjc4AY6r1j6r4UMI8I3I0E5I8CrVAFwI 0_Jr0_Jr4lx2IqxVCjr7xvwVAFwI0_JrI_JrWlx4CE17CEb7AF67AKxVWUXVWUAwCIc40Y 0x0EwIxGrwCI42IY6xIIjxv20xvE14v26r1I6r4UMIIF0xvE2Ix0cI8IcVCY1x0267AKxV WUJVW8JwCI42IY6xAIw20EY4v20xvaj40_Jr0_JF4lIxAIcVC2z280aVAFwI0_Jr0_Gr1l IxAIcVC2z280aVCY1x0267AKxVW8JVW8JrUvcSsGvfC2KfnxnUUI43ZEXa7VUb_-PUUUUU U== X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/ Subject: [FFmpeg-devel] [PATCH v3 1/3] avcodec: [loongarch] Optimize hpeldsp with LASX. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Shiyou Yin Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: jLDwO5UBLz2q From: Shiyou Yin ./ffmpeg -i 8_mpeg4_1080p_24fps_12Mbps.avi -f rawvideo -y /dev/null -an before:376fps after :433fps --- libavcodec/hpeldsp.c | 2 + libavcodec/hpeldsp.h | 1 + libavcodec/loongarch/Makefile | 2 + libavcodec/loongarch/hpeldsp_init_loongarch.c | 50 + libavcodec/loongarch/hpeldsp_lasx.c | 1287 +++++++++++++++++ libavcodec/loongarch/hpeldsp_lasx.h | 58 + 6 files changed, 1400 insertions(+) create mode 100644 libavcodec/loongarch/hpeldsp_init_loongarch.c create mode 100644 libavcodec/loongarch/hpeldsp_lasx.c create mode 100644 libavcodec/loongarch/hpeldsp_lasx.h diff --git a/libavcodec/hpeldsp.c b/libavcodec/hpeldsp.c index 8e2fd8fcf5..843ba399c5 100644 --- a/libavcodec/hpeldsp.c +++ b/libavcodec/hpeldsp.c @@ -367,4 +367,6 @@ av_cold void ff_hpeldsp_init(HpelDSPContext *c, int flags) ff_hpeldsp_init_x86(c, flags); if (ARCH_MIPS) ff_hpeldsp_init_mips(c, flags); + if (ARCH_LOONGARCH64) + ff_hpeldsp_init_loongarch(c, flags); } diff --git a/libavcodec/hpeldsp.h b/libavcodec/hpeldsp.h index 768139bfc9..45e81b10a5 100644 --- a/libavcodec/hpeldsp.h +++ b/libavcodec/hpeldsp.h @@ -102,5 +102,6 @@ void ff_hpeldsp_init_arm(HpelDSPContext *c, int flags); void ff_hpeldsp_init_ppc(HpelDSPContext *c, int flags); void ff_hpeldsp_init_x86(HpelDSPContext *c, int flags); void ff_hpeldsp_init_mips(HpelDSPContext *c, int flags); +void ff_hpeldsp_init_loongarch(HpelDSPContext *c, int flags); #endif /* AVCODEC_HPELDSP_H */ diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile index baf5f92e84..07a401d883 100644 --- a/libavcodec/loongarch/Makefile +++ b/libavcodec/loongarch/Makefile @@ -5,6 +5,7 @@ OBJS-$(CONFIG_H264PRED) += loongarch/h264_intrapred_init_loongarch OBJS-$(CONFIG_VP8_DECODER) += loongarch/vp8dsp_init_loongarch.o OBJS-$(CONFIG_VP9_DECODER) += loongarch/vp9dsp_init_loongarch.o OBJS-$(CONFIG_VC1DSP) += loongarch/vc1dsp_init_loongarch.o +OBJS-$(CONFIG_HPELDSP) += loongarch/hpeldsp_init_loongarch.o LASX-OBJS-$(CONFIG_H264CHROMA) += loongarch/h264chroma_lasx.o LASX-OBJS-$(CONFIG_H264QPEL) += loongarch/h264qpel_lasx.o LASX-OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_lasx.o \ @@ -12,6 +13,7 @@ LASX-OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_lasx.o \ loongarch/h264_deblock_lasx.o LASX-OBJS-$(CONFIG_H264PRED) += loongarch/h264_intrapred_lasx.o LASX-OBJS-$(CONFIG_VC1_DECODER) += loongarch/vc1dsp_lasx.o +LASX-OBJS-$(CONFIG_HPELDSP) += loongarch/hpeldsp_lasx.o LSX-OBJS-$(CONFIG_VP8_DECODER) += loongarch/vp8_mc_lsx.o \ loongarch/vp8_lpf_lsx.o LSX-OBJS-$(CONFIG_VP9_DECODER) += loongarch/vp9_mc_lsx.o \ diff --git a/libavcodec/loongarch/hpeldsp_init_loongarch.c b/libavcodec/loongarch/hpeldsp_init_loongarch.c new file mode 100644 index 0000000000..1690be5438 --- /dev/null +++ b/libavcodec/loongarch/hpeldsp_init_loongarch.c @@ -0,0 +1,50 @@ +/* + * Copyright (c) 2021 Loongson Technology Corporation Limited + * Contributed by Shiyou Yin + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/loongarch/cpu.h" +#include "libavcodec/hpeldsp.h" +#include "libavcodec/loongarch/hpeldsp_lasx.h" + +void ff_hpeldsp_init_loongarch(HpelDSPContext *c, int flags) +{ + int cpu_flags = av_get_cpu_flags(); + + if (have_lasx(cpu_flags)) { + c->put_pixels_tab[0][0] = ff_put_pixels16_8_lsx; + c->put_pixels_tab[0][1] = ff_put_pixels16_x2_8_lasx; + c->put_pixels_tab[0][2] = ff_put_pixels16_y2_8_lasx; + c->put_pixels_tab[0][3] = ff_put_pixels16_xy2_8_lasx; + + c->put_pixels_tab[1][0] = ff_put_pixels8_8_lasx; + c->put_pixels_tab[1][1] = ff_put_pixels8_x2_8_lasx; + c->put_pixels_tab[1][2] = ff_put_pixels8_y2_8_lasx; + c->put_pixels_tab[1][3] = ff_put_pixels8_xy2_8_lasx; + c->put_no_rnd_pixels_tab[0][0] = ff_put_pixels16_8_lsx; + c->put_no_rnd_pixels_tab[0][1] = ff_put_no_rnd_pixels16_x2_8_lasx; + c->put_no_rnd_pixels_tab[0][2] = ff_put_no_rnd_pixels16_y2_8_lasx; + c->put_no_rnd_pixels_tab[0][3] = ff_put_no_rnd_pixels16_xy2_8_lasx; + + c->put_no_rnd_pixels_tab[1][0] = ff_put_pixels8_8_lasx; + c->put_no_rnd_pixels_tab[1][1] = ff_put_no_rnd_pixels8_x2_8_lasx; + c->put_no_rnd_pixels_tab[1][2] = ff_put_no_rnd_pixels8_y2_8_lasx; + c->put_no_rnd_pixels_tab[1][3] = ff_put_no_rnd_pixels8_xy2_8_lasx; + } +} diff --git a/libavcodec/loongarch/hpeldsp_lasx.c b/libavcodec/loongarch/hpeldsp_lasx.c new file mode 100644 index 0000000000..dd2ae173da --- /dev/null +++ b/libavcodec/loongarch/hpeldsp_lasx.c @@ -0,0 +1,1287 @@ +/* + * Copyright (c) 2021 Loongson Technology Corporation Limited + * Contributed by Shiyou Yin + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/loongarch/loongson_intrinsics.h" +#include "hpeldsp_lasx.h" + +static av_always_inline void +put_pixels8_l2_8_lsx(uint8_t *dst, const uint8_t *src1, const uint8_t *src2, + int dst_stride, int src_stride1, int src_stride2, int h) +{ + int stride1_2, stride1_3, stride1_4; + int stride2_2, stride2_3, stride2_4; + __asm__ volatile ( + "slli.d %[stride1_2], %[srcStride1], 1 \n\t" + "slli.d %[stride2_2], %[srcStride2], 1 \n\t" + "add.d %[stride1_3], %[stride1_2], %[srcStride1] \n\t" + "add.d %[stride2_3], %[stride2_2], %[srcStride2] \n\t" + "slli.d %[stride1_4], %[stride1_2], 1 \n\t" + "slli.d %[stride2_4], %[stride2_2], 1 \n\t" + "1: \n\t" + "vld $vr0, %[src1], 0 \n\t" + "vldx $vr1, %[src1], %[srcStride1] \n\t" + "vldx $vr2, %[src1], %[stride1_2] \n\t" + "vldx $vr3, %[src1], %[stride1_3] \n\t" + "add.d %[src1], %[src1], %[stride1_4] \n\t" + + "vld $vr4, %[src2], 0 \n\t" + "vldx $vr5, %[src2], %[srcStride2] \n\t" + "vldx $vr6, %[src2], %[stride2_2] \n\t" + "vldx $vr7, %[src2], %[stride2_3] \n\t" + "add.d %[src2], %[src2], %[stride2_4] \n\t" + + "addi.d %[h], %[h], -4 \n\t" + + "vavgr.bu $vr0, $vr4, $vr0 \n\t" + "vavgr.bu $vr1, $vr5, $vr1 \n\t" + "vavgr.bu $vr2, $vr6, $vr2 \n\t" + "vavgr.bu $vr3, $vr7, $vr3 \n\t" + "vstelm.d $vr0, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[dstStride] \n\t" + "vstelm.d $vr1, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[dstStride] \n\t" + "vstelm.d $vr2, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[dstStride] \n\t" + "vstelm.d $vr3, %[dst], 0, 0 \n\t" + "add.d %[dst], %[dst], %[dstStride] \n\t" + "bnez %[h], 1b \n\t" + + : [dst]"+&r"(dst), [src2]"+&r"(src2), [src1]"+&r"(src1), + [h]"+&r"(h), [stride1_2]"=&r"(stride1_2), + [stride1_3]"=&r"(stride1_3), [stride1_4]"=&r"(stride1_4), + [stride2_2]"=&r"(stride2_2), [stride2_3]"=&r"(stride2_3), + [stride2_4]"=&r"(stride2_4) + : [dstStride]"r"(dst_stride), [srcStride1]"r"(src_stride1), + [srcStride2]"r"(src_stride2) + : "memory" + ); +} + +static av_always_inline void +put_pixels16_l2_8_lsx(uint8_t *dst, const uint8_t *src1, const uint8_t *src2, + int dst_stride, int src_stride1, int src_stride2, int h) +{ + int stride1_2, stride1_3, stride1_4; + int stride2_2, stride2_3, stride2_4; + int dststride2, dststride3, dststride4; + __asm__ volatile ( + "slli.d %[stride1_2], %[srcStride1], 1 \n\t" + "slli.d %[stride2_2], %[srcStride2], 1 \n\t" + "slli.d %[dststride2], %[dstStride], 1 \n\t" + "add.d %[stride1_3], %[stride1_2], %[srcStride1] \n\t" + "add.d %[stride2_3], %[stride2_2], %[srcStride2] \n\t" + "add.d %[dststride3], %[dststride2], %[dstStride] \n\t" + "slli.d %[stride1_4], %[stride1_2], 1 \n\t" + "slli.d %[stride2_4], %[stride2_2], 1 \n\t" + "slli.d %[dststride4], %[dststride2], 1 \n\t" + "1: \n\t" + "vld $vr0, %[src1], 0 \n\t" + "vldx $vr1, %[src1], %[srcStride1] \n\t" + "vldx $vr2, %[src1], %[stride1_2] \n\t" + "vldx $vr3, %[src1], %[stride1_3] \n\t" + "add.d %[src1], %[src1], %[stride1_4] \n\t" + + "vld $vr4, %[src2], 0 \n\t" + "vldx $vr5, %[src2], %[srcStride2] \n\t" + "vldx $vr6, %[src2], %[stride2_2] \n\t" + "vldx $vr7, %[src2], %[stride2_3] \n\t" + "add.d %[src2], %[src2], %[stride2_4] \n\t" + + "addi.d %[h], %[h], -4 \n\t" + + "vavgr.bu $vr0, $vr4, $vr0 \n\t" + "vavgr.bu $vr1, $vr5, $vr1 \n\t" + "vavgr.bu $vr2, $vr6, $vr2 \n\t" + "vavgr.bu $vr3, $vr7, $vr3 \n\t" + "vst $vr0, %[dst], 0 \n\t" + "vstx $vr1, %[dst], %[dstStride] \n\t" + "vstx $vr2, %[dst], %[dststride2] \n\t" + "vstx $vr3, %[dst], %[dststride3] \n\t" + "add.d %[dst], %[dst], %[dststride4] \n\t" + "bnez %[h], 1b \n\t" + + : [dst]"+&r"(dst), [src2]"+&r"(src2), [src1]"+&r"(src1), + [h]"+&r"(h), [stride1_2]"=&r"(stride1_2), + [stride1_3]"=&r"(stride1_3), [stride1_4]"=&r"(stride1_4), + [stride2_2]"=&r"(stride2_2), [stride2_3]"=&r"(stride2_3), + [stride2_4]"=&r"(stride2_4), [dststride2]"=&r"(dststride2), + [dststride3]"=&r"(dststride3), [dststride4]"=&r"(dststride4) + : [dstStride]"r"(dst_stride), [srcStride1]"r"(src_stride1), + [srcStride2]"r"(src_stride2) + : "memory" + ); +} + +void ff_put_pixels8_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h) +{ + uint64_t tmp[8]; + int h_8 = h >> 3; + int res = h & 7; + ptrdiff_t stride2, stride3, stride4; + + __asm__ volatile ( + "beqz %[h_8], 2f \n\t" + "slli.d %[stride2], %[stride], 1 \n\t" + "add.d %[stride3], %[stride2], %[stride] \n\t" + "slli.d %[stride4], %[stride2], 1 \n\t" + "1: \n\t" + "ld.d %[tmp0], %[src], 0x0 \n\t" + "ldx.d %[tmp1], %[src], %[stride] \n\t" + "ldx.d %[tmp2], %[src], %[stride2] \n\t" + "ldx.d %[tmp3], %[src], %[stride3] \n\t" + "add.d %[src], %[src], %[stride4] \n\t" + "ld.d %[tmp4], %[src], 0x0 \n\t" + "ldx.d %[tmp5], %[src], %[stride] \n\t" + "ldx.d %[tmp6], %[src], %[stride2] \n\t" + "ldx.d %[tmp7], %[src], %[stride3] \n\t" + "add.d %[src], %[src], %[stride4] \n\t" + + "addi.d %[h_8], %[h_8], -1 \n\t" + + "st.d %[tmp0], %[dst], 0x0 \n\t" + "stx.d %[tmp1], %[dst], %[stride] \n\t" + "stx.d %[tmp2], %[dst], %[stride2] \n\t" + "stx.d %[tmp3], %[dst], %[stride3] \n\t" + "add.d %[dst], %[dst], %[stride4] \n\t" + "st.d %[tmp4], %[dst], 0x0 \n\t" + "stx.d %[tmp5], %[dst], %[stride] \n\t" + "stx.d %[tmp6], %[dst], %[stride2] \n\t" + "stx.d %[tmp7], %[dst], %[stride3] \n\t" + "add.d %[dst], %[dst], %[stride4] \n\t" + "bnez %[h_8], 1b \n\t" + + "2: \n\t" + "beqz %[res], 4f \n\t" + "3: \n\t" + "ld.d %[tmp0], %[src], 0x0 \n\t" + "add.d %[src], %[src], %[stride] \n\t" + "addi.d %[res], %[res], -1 \n\t" + "st.d %[tmp0], %[dst], 0x0 \n\t" + "add.d %[dst], %[dst], %[stride] \n\t" + "bnez %[res], 3b \n\t" + "4: \n\t" + : [tmp0]"=&r"(tmp[0]), [tmp1]"=&r"(tmp[1]), + [tmp2]"=&r"(tmp[2]), [tmp3]"=&r"(tmp[3]), + [tmp4]"=&r"(tmp[4]), [tmp5]"=&r"(tmp[5]), + [tmp6]"=&r"(tmp[6]), [tmp7]"=&r"(tmp[7]), + [dst]"+&r"(block), [src]"+&r"(pixels), + [h_8]"+&r"(h_8), [res]"+&r"(res), + [stride2]"=&r"(stride2), [stride3]"=&r"(stride3), + [stride4]"=&r"(stride4) + : [stride]"r"(line_size) + : "memory" + ); +} + +void ff_put_pixels16_8_lsx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h) +{ + int h_8 = h >> 3; + int res = h & 7; + ptrdiff_t stride2, stride3, stride4; + + __asm__ volatile ( + "beqz %[h_8], 2f \n\t" + "slli.d %[stride2], %[stride], 1 \n\t" + "add.d %[stride3], %[stride2], %[stride] \n\t" + "slli.d %[stride4], %[stride2], 1 \n\t" + "1: \n\t" + "vld $vr0, %[src], 0x0 \n\t" + "vldx $vr1, %[src], %[stride] \n\t" + "vldx $vr2, %[src], %[stride2] \n\t" + "vldx $vr3, %[src], %[stride3] \n\t" + "add.d %[src], %[src], %[stride4] \n\t" + "vld $vr4, %[src], 0x0 \n\t" + "vldx $vr5, %[src], %[stride] \n\t" + "vldx $vr6, %[src], %[stride2] \n\t" + "vldx $vr7, %[src], %[stride3] \n\t" + "add.d %[src], %[src], %[stride4] \n\t" + + "addi.d %[h_8], %[h_8], -1 \n\t" + + "vst $vr0, %[dst], 0x0 \n\t" + "vstx $vr1, %[dst], %[stride] \n\t" + "vstx $vr2, %[dst], %[stride2] \n\t" + "vstx $vr3, %[dst], %[stride3] \n\t" + "add.d %[dst], %[dst], %[stride4] \n\t" + "vst $vr4, %[dst], 0x0 \n\t" + "vstx $vr5, %[dst], %[stride] \n\t" + "vstx $vr6, %[dst], %[stride2] \n\t" + "vstx $vr7, %[dst], %[stride3] \n\t" + "add.d %[dst], %[dst], %[stride4] \n\t" + "bnez %[h_8], 1b \n\t" + + "2: \n\t" + "beqz %[res], 4f \n\t" + "3: \n\t" + "vld $vr0, %[src], 0x0 \n\t" + "add.d %[src], %[src], %[stride] \n\t" + "addi.d %[res], %[res], -1 \n\t" + "vst $vr0, %[dst], 0x0 \n\t" + "add.d %[dst], %[dst], %[stride] \n\t" + "bnez %[res], 3b \n\t" + "4: \n\t" + : [dst]"+&r"(block), [src]"+&r"(pixels), + [h_8]"+&r"(h_8), [res]"+&r"(res), + [stride2]"=&r"(stride2), [stride3]"=&r"(stride3), + [stride4]"=&r"(stride4) + : [stride]"r"(line_size) + : "memory" + ); +} + +void ff_put_pixels8_x2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h) +{ + put_pixels8_l2_8_lsx(block, pixels, pixels + 1, line_size, line_size, + line_size, h); +} + +void ff_put_pixels8_y2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h) +{ + put_pixels8_l2_8_lsx(block, pixels, pixels + line_size, line_size, + line_size, line_size, h); +} + +void ff_put_pixels16_x2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h) +{ + put_pixels16_l2_8_lsx(block, pixels, pixels + 1, line_size, line_size, + line_size, h); +} + +void ff_put_pixels16_y2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h) +{ + put_pixels16_l2_8_lsx(block, pixels, pixels + line_size, line_size, + line_size, line_size, h); +} + +static void common_hz_bil_no_rnd_16x16_lasx(const uint8_t *src, + int32_t src_stride, + uint8_t *dst, int32_t dst_stride) +{ + __m256i src0, src1, src2, src3, src4, src5, src6, src7; + int32_t src_stride_2x = src_stride << 1; + int32_t src_stride_4x = src_stride << 2; + int32_t src_stride_3x = src_stride_2x + src_stride; + uint8_t *_src = (uint8_t*)src; + + src0 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2); + src3 = __lasx_xvldx(_src, src_stride_3x); + _src += 1; + src4 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6); + src7 = __lasx_xvldx(_src, src_stride_3x); + _src += (src_stride_4x -1); + DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src5, + src4, 0x20, src7, src6, 0x20, src0, src1, src2, src3); + src0 = __lasx_xvavg_bu(src0, src2); + src1 = __lasx_xvavg_bu(src1, src3); + __lasx_xvstelm_d(src0, dst, 0, 0); + __lasx_xvstelm_d(src0, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src0, dst, 0, 2); + __lasx_xvstelm_d(src0, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(src1, dst, 0, 0); + __lasx_xvstelm_d(src1, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src1, dst, 0, 2); + __lasx_xvstelm_d(src1, dst, 8, 3); + dst += dst_stride; + + src0 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2); + src3 = __lasx_xvldx(_src, src_stride_3x); + _src += 1; + src4 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6); + src7 = __lasx_xvldx(_src, src_stride_3x); + _src += (src_stride_4x - 1); + DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src5, src4, + 0x20, src7, src6, 0x20, src0, src1, src2, src3); + src0 = __lasx_xvavg_bu(src0, src2); + src1 = __lasx_xvavg_bu(src1, src3); + __lasx_xvstelm_d(src0, dst, 0, 0); + __lasx_xvstelm_d(src0, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src0, dst, 0, 2); + __lasx_xvstelm_d(src0, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(src1, dst, 0, 0); + __lasx_xvstelm_d(src1, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src1, dst, 0, 2); + __lasx_xvstelm_d(src1, dst, 8, 3); + dst += dst_stride; + + src0 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2); + src3 = __lasx_xvldx(_src, src_stride_3x); + _src += 1; + src4 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6); + src7 = __lasx_xvldx(_src, src_stride_3x); + _src += (src_stride_4x - 1); + DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src5, src4, + 0x20, src7, src6, 0x20, src0, src1, src2, src3); + src0 = __lasx_xvavg_bu(src0, src2); + src1 = __lasx_xvavg_bu(src1, src3); + __lasx_xvstelm_d(src0, dst, 0, 0); + __lasx_xvstelm_d(src0, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src0, dst, 0, 2); + __lasx_xvstelm_d(src0, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(src1, dst, 0, 0); + __lasx_xvstelm_d(src1, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src1, dst, 0, 2); + __lasx_xvstelm_d(src1, dst, 8, 3); + dst += dst_stride; + + src0 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2); + src3 = __lasx_xvldx(_src, src_stride_3x); + _src += 1; + src4 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6); + src7 = __lasx_xvldx(_src, src_stride_3x); + DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src5, src4, + 0x20, src7, src6, 0x20, src0, src1, src2, src3); + src0 = __lasx_xvavg_bu(src0, src2); + src1 = __lasx_xvavg_bu(src1, src3); + __lasx_xvstelm_d(src0, dst, 0, 0); + __lasx_xvstelm_d(src0, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src0, dst, 0, 2); + __lasx_xvstelm_d(src0, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(src1, dst, 0, 0); + __lasx_xvstelm_d(src1, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src1, dst, 0, 2); + __lasx_xvstelm_d(src1, dst, 8, 3); +} + +static void common_hz_bil_no_rnd_8x16_lasx(const uint8_t *src, + int32_t src_stride, + uint8_t *dst, int32_t dst_stride) +{ + __m256i src0, src1, src2, src3, src4, src5, src6, src7; + int32_t src_stride_2x = src_stride << 1; + int32_t src_stride_4x = src_stride << 2; + int32_t src_stride_3x = src_stride_2x + src_stride; + uint8_t* _src = (uint8_t*)src; + + src0 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2); + src3 = __lasx_xvldx(_src, src_stride_3x); + _src += 1; + src4 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6); + src7 = __lasx_xvldx(_src, src_stride_3x); + _src += (src_stride_4x - 1); + DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src5, src4, + 0x20, src7, src6, 0x20, src0, src1, src2, src3); + src0 = __lasx_xvavg_bu(src0, src2); + src1 = __lasx_xvavg_bu(src1, src3); + __lasx_xvstelm_d(src0, dst, 0, 0); + __lasx_xvstelm_d(src0, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src0, dst, 0, 2); + __lasx_xvstelm_d(src0, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(src1, dst, 0, 0); + __lasx_xvstelm_d(src1, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src1, dst, 0, 2); + __lasx_xvstelm_d(src1, dst, 8, 3); + dst += dst_stride; + + src0 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2); + src3 = __lasx_xvldx(_src, src_stride_3x); + _src += 1; + src4 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6); + src7 = __lasx_xvldx(_src, src_stride_3x); + DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src5, src4, + 0x20, src7, src6, 0x20, src0, src1, src2, src3); + src0 = __lasx_xvavg_bu(src0, src2); + src1 = __lasx_xvavg_bu(src1, src3); + __lasx_xvstelm_d(src0, dst, 0, 0); + __lasx_xvstelm_d(src0, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src0, dst, 0, 2); + __lasx_xvstelm_d(src0, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(src1, dst, 0, 0); + __lasx_xvstelm_d(src1, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src1, dst, 0, 2); + __lasx_xvstelm_d(src1, dst, 8, 3); +} + +void ff_put_no_rnd_pixels16_x2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h) +{ + if (h == 16) { + common_hz_bil_no_rnd_16x16_lasx(pixels, line_size, block, line_size); + } else if (h == 8) { + common_hz_bil_no_rnd_8x16_lasx(pixels, line_size, block, line_size); + } +} + +static void common_vt_bil_no_rnd_16x16_lasx(const uint8_t *src, + int32_t src_stride, + uint8_t *dst, int32_t dst_stride) +{ + __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8; + __m256i src9, src10, src11, src12, src13, src14, src15, src16; + int32_t src_stride_2x = src_stride << 1; + int32_t src_stride_4x = src_stride << 2; + int32_t src_stride_3x = src_stride_2x + src_stride; + uint8_t* _src = (uint8_t*)src; + + src0 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2); + src3 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src4 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6); + src7 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src8 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src9, src10); + src11 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src12 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, + src13, src14); + src15 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src16 = __lasx_xvld(_src, 0); + + DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src2, src1, 0x20, src3, src2, + 0x20, src4, src3, 0x20, src0, src1, src2, src3); + DUP4_ARG3(__lasx_xvpermi_q, src5, src4, 0x20, src6, src5, 0x20, src7, src6, + 0x20, src8, src7, 0x20, src4, src5, src6, src7); + DUP4_ARG3(__lasx_xvpermi_q, src9, src8, 0x20, src10, src9, 0x20, src11, + src10, 0x20, src12, src11, 0x20, src8, src9, src10, src11); + DUP4_ARG3(__lasx_xvpermi_q, src13, src12, 0x20, src14, src13, 0x20, src15, + src14, 0x20, src16, src15, 0x20, src12, src13, src14, src15); + DUP4_ARG2(__lasx_xvavg_bu, src0, src1, src2, src3, src4, src5, src6, src7, + src0, src2, src4, src6); + DUP4_ARG2(__lasx_xvavg_bu, src8, src9, src10, src11, src12, src13, src14, + src15, src8, src10, src12, src14); + + __lasx_xvstelm_d(src0, dst, 0, 0); + __lasx_xvstelm_d(src0, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src0, dst, 0, 2); + __lasx_xvstelm_d(src0, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(src2, dst, 0, 0); + __lasx_xvstelm_d(src2, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src2, dst, 0, 2); + __lasx_xvstelm_d(src2, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(src4, dst, 0, 0); + __lasx_xvstelm_d(src4, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src4, dst, 0, 2); + __lasx_xvstelm_d(src4, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(src6, dst, 0, 0); + __lasx_xvstelm_d(src6, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src6, dst, 0, 2); + __lasx_xvstelm_d(src6, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(src8, dst, 0, 0); + __lasx_xvstelm_d(src8, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src8, dst, 0, 2); + __lasx_xvstelm_d(src8, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(src10, dst, 0, 0); + __lasx_xvstelm_d(src10, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src10, dst, 0, 2); + __lasx_xvstelm_d(src10, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(src12, dst, 0, 0); + __lasx_xvstelm_d(src12, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src12, dst, 0, 2); + __lasx_xvstelm_d(src12, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(src14, dst, 0, 0); + __lasx_xvstelm_d(src14, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src14, dst, 0, 2); + __lasx_xvstelm_d(src14, dst, 8, 3); +} + +static void common_vt_bil_no_rnd_8x16_lasx(const uint8_t *src, + int32_t src_stride, + uint8_t *dst, int32_t dst_stride) +{ + __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8; + int32_t src_stride_2x = src_stride << 1; + int32_t src_stride_4x = src_stride << 2; + int32_t src_stride_3x = src_stride_2x + src_stride; + uint8_t* _src = (uint8_t*)src; + + src0 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2); + src3 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src4 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6); + src7 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src8 = __lasx_xvld(_src, 0); + + DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src2, src1, 0x20, src3, src2, + 0x20, src4, src3, 0x20, src0, src1, src2, src3); + DUP4_ARG3(__lasx_xvpermi_q, src5, src4, 0x20, src6, src5, 0x20, src7, src6, + 0x20, src8, src7, 0x20, src4, src5, src6, src7); + DUP4_ARG2(__lasx_xvavg_bu, src0, src1, src2, src3, src4, src5, src6, src7, + src0, src2, src4, src6); + + __lasx_xvstelm_d(src0, dst, 0, 0); + __lasx_xvstelm_d(src0, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src0, dst, 0, 2); + __lasx_xvstelm_d(src0, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(src2, dst, 0, 0); + __lasx_xvstelm_d(src2, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src2, dst, 0, 2); + __lasx_xvstelm_d(src2, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(src4, dst, 0, 0); + __lasx_xvstelm_d(src4, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src4, dst, 0, 2); + __lasx_xvstelm_d(src4, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(src6, dst, 0, 0); + __lasx_xvstelm_d(src6, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(src6, dst, 0, 2); + __lasx_xvstelm_d(src6, dst, 8, 3); +} + +void ff_put_no_rnd_pixels16_y2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h) +{ + if (h == 16) { + common_vt_bil_no_rnd_16x16_lasx(pixels, line_size, block, line_size); + } else if (h == 8) { + common_vt_bil_no_rnd_8x16_lasx(pixels, line_size, block, line_size); + } +} + +static void common_hv_bil_no_rnd_16x16_lasx(const uint8_t *src, + int32_t src_stride, + uint8_t *dst, int32_t dst_stride) +{ + __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8, src9; + __m256i src10, src11, src12, src13, src14, src15, src16, src17; + __m256i sum0, sum1, sum2, sum3, sum4, sum5, sum6, sum7; + int32_t src_stride_2x = src_stride << 1; + int32_t src_stride_4x = src_stride << 2; + int32_t src_stride_3x = src_stride_2x + src_stride; + uint8_t* _src = (uint8_t*)src; + + src0 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2); + src3 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src4 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6); + src7 = __lasx_xvldx(_src, src_stride_3x); + _src += (1 - src_stride_4x); + src9 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, + src10, src11); + src12 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src13 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, + src14, src15); + src16 = __lasx_xvldx(_src, src_stride_3x); + _src += (src_stride_4x - 1); + DUP2_ARG2(__lasx_xvld, _src, 0, _src, 1, src8, src17); + + DUP4_ARG3(__lasx_xvpermi_q, src0, src4, 0x02, src1, src5, 0x02, src2, + src6, 0x02, src3, src7, 0x02, src0, src1, src2, src3); + DUP4_ARG3(__lasx_xvpermi_q, src4, src8, 0x02, src9, src13, 0x02, src10, + src14, 0x02, src11, src15, 0x02, src4, src5, src6, src7); + DUP2_ARG3(__lasx_xvpermi_q, src12, src16, 0x02, src13, src17, 0x02, + src8, src9); + DUP4_ARG2(__lasx_xvilvl_h, src5, src0, src6, src1, src7, src2, src8, src3, + sum0, sum2, sum4, sum6); + DUP4_ARG2(__lasx_xvilvh_h, src5, src0, src6, src1, src7, src2, src8, src3, + sum1, sum3, sum5, sum7); + src8 = __lasx_xvilvl_h(src9, src4); + src9 = __lasx_xvilvh_h(src9, src4); + + DUP4_ARG2(__lasx_xvhaddw_hu_bu, sum0, sum0, sum1, sum1, sum2, sum2, + sum3, sum3, src0, src1, src2, src3); + DUP4_ARG2(__lasx_xvhaddw_hu_bu, sum4, sum4, sum5, sum5, sum6, sum6, + sum7, sum7, src4, src5, src6, src7); + DUP2_ARG2(__lasx_xvhaddw_hu_bu, src8, src8, src9, src9, src8, src9); + + DUP4_ARG2(__lasx_xvadd_h, src0, src2, src1, src3, src2, src4, src3, src5, + sum0, sum1, sum2, sum3); + DUP4_ARG2(__lasx_xvadd_h, src4, src6, src5, src7, src6, src8, src7, src9, + sum4, sum5, sum6, sum7); + DUP4_ARG2(__lasx_xvaddi_hu, sum0, 1, sum1, 1, sum2, 1, sum3, 1, + sum0, sum1, sum2, sum3); + DUP4_ARG2(__lasx_xvaddi_hu, sum4, 1, sum5, 1, sum6, 1, sum7, 1, + sum4, sum5, sum6, sum7); + DUP4_ARG3(__lasx_xvsrani_b_h, sum1, sum0, 2, sum3, sum2, 2, sum5, sum4, 2, + sum7, sum6, 2, sum0, sum1, sum2, sum3); + __lasx_xvstelm_d(sum0, dst, 0, 0); + __lasx_xvstelm_d(sum0, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(sum1, dst, 0, 0); + __lasx_xvstelm_d(sum1, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(sum2, dst, 0, 0); + __lasx_xvstelm_d(sum2, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(sum3, dst, 0, 0); + __lasx_xvstelm_d(sum3, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(sum0, dst, 0, 2); + __lasx_xvstelm_d(sum0, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(sum1, dst, 0, 2); + __lasx_xvstelm_d(sum1, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(sum2, dst, 0, 2); + __lasx_xvstelm_d(sum2, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(sum3, dst, 0, 2); + __lasx_xvstelm_d(sum3, dst, 8, 3); + dst += dst_stride; + + src0 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2); + src3 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src4 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6); + src7 = __lasx_xvldx(_src, src_stride_3x); + _src += (1 - src_stride_4x); + src9 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, + src10, src11); + src12 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src13 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, + src14, src15); + src16 = __lasx_xvldx(_src, src_stride_3x); + _src += (src_stride_4x - 1); + DUP2_ARG2(__lasx_xvld, _src, 0, _src, 1, src8, src17); + + DUP4_ARG3(__lasx_xvpermi_q, src0, src4, 0x02, src1, src5, 0x02, src2, src6, 0x02, + src3, src7, 0x02, src0, src1, src2, src3); + DUP4_ARG3(__lasx_xvpermi_q, src4, src8, 0x02, src9, src13, 0x02, src10, src14, 0x02, + src11, src15, 0x02, src4, src5, src6, src7); + DUP2_ARG3(__lasx_xvpermi_q, src12, src16, 0x02, src13, src17, 0x02, src8, src9); + + DUP4_ARG2(__lasx_xvilvl_h, src5, src0, src6, src1, src7, src2, src8, src3, + sum0, sum2, sum4, sum6); + DUP4_ARG2(__lasx_xvilvh_h, src5, src0, src6, src1, src7, src2, src8, src3, + sum1, sum3, sum5, sum7); + src8 = __lasx_xvilvl_h(src9, src4); + src9 = __lasx_xvilvh_h(src9, src4); + + DUP4_ARG2(__lasx_xvhaddw_hu_bu, sum0, sum0, sum1, sum1, sum2, sum2, + sum3, sum3, src0, src1, src2, src3); + DUP4_ARG2(__lasx_xvhaddw_hu_bu, sum4, sum4, sum5, sum5, sum6, sum6, + sum7, sum7, src4, src5, src6, src7); + DUP2_ARG2(__lasx_xvhaddw_hu_bu, src8, src8, src9, src9, src8, src9); + + DUP4_ARG2(__lasx_xvadd_h, src0, src2, src1, src3, src2, src4, src3, src5, + sum0, sum1, sum2, sum3); + DUP4_ARG2(__lasx_xvadd_h, src4, src6, src5, src7, src6, src8, src7, src9, + sum4, sum5, sum6, sum7); + DUP4_ARG2(__lasx_xvaddi_hu, sum0, 1, sum1, 1, sum2, 1, sum3, 1, + sum0, sum1, sum2, sum3); + DUP4_ARG2(__lasx_xvaddi_hu, sum4, 1, sum5, 1, sum6, 1, sum7, 1, + sum4, sum5, sum6, sum7); + DUP4_ARG3(__lasx_xvsrani_b_h, sum1, sum0, 2, sum3, sum2, 2, sum5, sum4, 2, + sum7, sum6, 2, sum0, sum1, sum2, sum3); + __lasx_xvstelm_d(sum0, dst, 0, 0); + __lasx_xvstelm_d(sum0, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(sum1, dst, 0, 0); + __lasx_xvstelm_d(sum1, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(sum2, dst, 0, 0); + __lasx_xvstelm_d(sum2, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(sum3, dst, 0, 0); + __lasx_xvstelm_d(sum3, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(sum0, dst, 0, 2); + __lasx_xvstelm_d(sum0, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(sum1, dst, 0, 2); + __lasx_xvstelm_d(sum1, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(sum2, dst, 0, 2); + __lasx_xvstelm_d(sum2, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(sum3, dst, 0, 2); + __lasx_xvstelm_d(sum3, dst, 8, 3); +} + +static void common_hv_bil_no_rnd_8x16_lasx(const uint8_t *src, + int32_t src_stride, + uint8_t *dst, int32_t dst_stride) +{ + __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8, src9; + __m256i src10, src11, src12, src13, src14, src15, src16, src17; + __m256i sum0, sum1, sum2, sum3, sum4, sum5, sum6, sum7; + int32_t src_stride_2x = src_stride << 1; + int32_t src_stride_4x = src_stride << 2; + int32_t src_stride_3x = src_stride_2x + src_stride; + uint8_t* _src = (uint8_t*)src; + + src0 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2); + src3 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src4 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6); + src7 = __lasx_xvldx(_src, src_stride_3x); + _src += (1 - src_stride_4x); + src9 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, + src10, src11); + src12 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src13 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, + src14, src15); + src16 = __lasx_xvldx(_src, src_stride_3x); + _src += (src_stride_4x - 1); + DUP2_ARG2(__lasx_xvld, _src, 0, _src, 1, src8, src17); + + DUP4_ARG3(__lasx_xvpermi_q, src0, src4, 0x02, src1, src5, 0x02, src2, + src6, 0x02, src3, src7, 0x02, src0, src1, src2, src3); + DUP4_ARG3(__lasx_xvpermi_q, src4, src8, 0x02, src9, src13, 0x02, src10, + src14, 0x02, src11, src15, 0x02, src4, src5, src6, src7); + DUP2_ARG3(__lasx_xvpermi_q, src12, src16, 0x02, src13, src17, 0x02, src8, src9); + + DUP4_ARG2(__lasx_xvilvl_h, src5, src0, src6, src1, src7, src2, src8, src3, + sum0, sum2, sum4, sum6); + DUP4_ARG2(__lasx_xvilvh_h, src5, src0, src6, src1, src7, src2, src8, src3, + sum1, sum3, sum5, sum7); + src8 = __lasx_xvilvl_h(src9, src4); + src9 = __lasx_xvilvh_h(src9, src4); + + DUP4_ARG2(__lasx_xvhaddw_hu_bu, sum0, sum0, sum1, sum1, sum2, sum2, + sum3, sum3, src0, src1, src2, src3); + DUP4_ARG2(__lasx_xvhaddw_hu_bu, sum4, sum4, sum5, sum5, sum6, sum6, + sum7, sum7, src4, src5, src6, src7); + DUP2_ARG2(__lasx_xvhaddw_hu_bu, src8, src8, src9, src9, src8, src9); + + DUP4_ARG2(__lasx_xvadd_h, src0, src2, src1, src3, src2, src4, src3, src5, + sum0, sum1, sum2, sum3); + DUP4_ARG2(__lasx_xvadd_h, src4, src6, src5, src7, src6, src8, src7, src9, + sum4, sum5, sum6, sum7); + DUP4_ARG2(__lasx_xvaddi_hu, sum0, 1, sum1, 1, sum2, 1, sum3, 1, + sum0, sum1, sum2, sum3); + DUP4_ARG2(__lasx_xvaddi_hu, sum4, 1, sum5, 1, sum6, 1, sum7, 1, + sum4, sum5, sum6, sum7); + DUP4_ARG3(__lasx_xvsrani_b_h, sum1, sum0, 2, sum3, sum2, 2, sum5, sum4, 2, + sum7, sum6, 2, sum0, sum1, sum2, sum3); + __lasx_xvstelm_d(sum0, dst, 0, 0); + __lasx_xvstelm_d(sum0, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(sum1, dst, 0, 0); + __lasx_xvstelm_d(sum1, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(sum2, dst, 0, 0); + __lasx_xvstelm_d(sum2, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(sum3, dst, 0, 0); + __lasx_xvstelm_d(sum3, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(sum0, dst, 0, 2); + __lasx_xvstelm_d(sum0, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(sum1, dst, 0, 2); + __lasx_xvstelm_d(sum1, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(sum2, dst, 0, 2); + __lasx_xvstelm_d(sum2, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(sum3, dst, 0, 2); + __lasx_xvstelm_d(sum3, dst, 8, 3); +} + +void ff_put_no_rnd_pixels16_xy2_8_lasx(uint8_t *block, + const uint8_t *pixels, + ptrdiff_t line_size, int h) +{ + if (h == 16) { + common_hv_bil_no_rnd_16x16_lasx(pixels, line_size, block, line_size); + } else if (h == 8) { + common_hv_bil_no_rnd_8x16_lasx(pixels, line_size, block, line_size); + } +} + +static void common_hz_bil_no_rnd_8x8_lasx(const uint8_t *src, + int32_t src_stride, + uint8_t *dst, int32_t dst_stride) +{ + __m256i src0, src1, src2, src3, src4, src5, src6, src7; + __m256i src8, src9, src10, src11, src12, src13, src14, src15; + int32_t src_stride_2x = src_stride << 1; + int32_t src_stride_4x = src_stride << 2; + int32_t dst_stride_2x = dst_stride << 1; + int32_t dst_stride_4x = dst_stride << 2; + int32_t dst_stride_3x = dst_stride_2x + dst_stride; + int32_t src_stride_3x = src_stride_2x + src_stride; + uint8_t* _src = (uint8_t*)src; + + src0 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2); + src3 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src4 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6); + src7 = __lasx_xvldx(_src, src_stride_3x); + _src += (1 - src_stride_4x); + src8 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src9, src10); + src11 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src12 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, + src13, src14); + src15 = __lasx_xvldx(_src, src_stride_3x); + + DUP4_ARG2(__lasx_xvpickev_d, src1, src0, src3, src2, src5, src4, src7, + src6, src0, src1, src2, src3); + DUP4_ARG2(__lasx_xvpickev_d, src9, src8, src11, src10, src13, src12, src15, + src14, src4, src5, src6, src7); + DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src5, src4, + 0x20, src7, src6, 0x20, src0, src1, src2, src3); + src0 = __lasx_xvavg_bu(src0, src2); + src1 = __lasx_xvavg_bu(src1, src3); + __lasx_xvstelm_d(src0, dst, 0, 0); + __lasx_xvstelm_d(src0, dst + dst_stride, 0, 1); + __lasx_xvstelm_d(src0, dst + dst_stride_2x, 0, 2); + __lasx_xvstelm_d(src0, dst + dst_stride_3x, 0, 3); + dst += dst_stride_4x; + __lasx_xvstelm_d(src1, dst, 0, 0); + __lasx_xvstelm_d(src1, dst + dst_stride, 0, 1); + __lasx_xvstelm_d(src1, dst + dst_stride_2x, 0, 2); + __lasx_xvstelm_d(src1, dst + dst_stride_3x, 0, 3); +} + +static void common_hz_bil_no_rnd_4x8_lasx(const uint8_t *src, + int32_t src_stride, + uint8_t *dst, int32_t dst_stride) +{ + __m256i src0, src1, src2, src3, src4, src5, src6, src7; + int32_t src_stride_2x = src_stride << 1; + int32_t src_stride_3x = src_stride_2x + src_stride; + int32_t dst_stride_2x = dst_stride << 1; + int32_t dst_stride_3x = dst_stride_2x + dst_stride; + uint8_t *_src = (uint8_t*)src; + + src0 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2); + src3 = __lasx_xvldx(_src, src_stride_3x); + _src += 1; + src4 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6); + src7 = __lasx_xvldx(_src, src_stride_3x); + DUP4_ARG2(__lasx_xvpickev_d, src1, src0, src3, src2, src5, src4, src7, src6, + src0, src1, src2, src3); + DUP2_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src3, src2, 0x20, src0, src1); + src0 = __lasx_xvavg_bu(src0, src1); + __lasx_xvstelm_d(src0, dst, 0, 0); + __lasx_xvstelm_d(src0, dst + dst_stride, 0, 1); + __lasx_xvstelm_d(src0, dst + dst_stride_2x, 0, 2); + __lasx_xvstelm_d(src0, dst + dst_stride_3x, 0, 3); +} + +void ff_put_no_rnd_pixels8_x2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h) +{ + if (h == 8) { + common_hz_bil_no_rnd_8x8_lasx(pixels, line_size, block, line_size); + } else if (h == 4) { + common_hz_bil_no_rnd_4x8_lasx(pixels, line_size, block, line_size); + } +} + +static void common_vt_bil_no_rnd_8x8_lasx(const uint8_t *src, int32_t src_stride, + uint8_t *dst, int32_t dst_stride) +{ + __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8; + int32_t src_stride_2x = src_stride << 1; + int32_t src_stride_4x = src_stride << 2; + int32_t dst_stride_2x = dst_stride << 1; + int32_t dst_stride_4x = dst_stride << 2; + int32_t dst_stride_3x = dst_stride_2x + dst_stride; + int32_t src_stride_3x = src_stride_2x + src_stride; + uint8_t* _src = (uint8_t*)src; + + src0 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2); + src3 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src4 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6); + src7 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src8 = __lasx_xvld(_src, 0); + + DUP4_ARG2(__lasx_xvpickev_d, src1, src0, src2, src1, src3, src2, src4, src3, + src0, src1, src2, src3); + DUP4_ARG2(__lasx_xvpickev_d, src5, src4, src6, src5, src7, src6, src8, src7, + src4, src5, src6, src7); + DUP4_ARG3(__lasx_xvpermi_q, src2, src0, 0x20, src3, src1, 0x20, src6, src4, + 0x20, src7, src5, 0x20, src0, src1, src2, src3); + src0 = __lasx_xvavg_bu(src0, src1); + src1 = __lasx_xvavg_bu(src2, src3); + __lasx_xvstelm_d(src0, dst, 0, 0); + __lasx_xvstelm_d(src0, dst + dst_stride, 0, 1); + __lasx_xvstelm_d(src0, dst + dst_stride_2x, 0, 2); + __lasx_xvstelm_d(src0, dst + dst_stride_3x, 0, 3); + dst += dst_stride_4x; + __lasx_xvstelm_d(src1, dst, 0, 0); + __lasx_xvstelm_d(src1, dst + dst_stride, 0, 1); + __lasx_xvstelm_d(src1, dst + dst_stride_2x, 0, 2); + __lasx_xvstelm_d(src1, dst + dst_stride_3x, 0, 3); +} + +static void common_vt_bil_no_rnd_4x8_lasx(const uint8_t *src, int32_t src_stride, + uint8_t *dst, int32_t dst_stride) +{ + __m256i src0, src1, src2, src3, src4; + int32_t src_stride_2x = src_stride << 1; + int32_t src_stride_4x = src_stride << 2; + int32_t dst_stride_2x = dst_stride << 1; + int32_t dst_stride_3x = dst_stride_2x + dst_stride; + int32_t src_stride_3x = src_stride_2x + src_stride; + uint8_t* _src = (uint8_t*)src; + + src0 = __lasx_xvld(_src, 0); + DUP4_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, _src, + src_stride_3x, _src, src_stride_4x, src1, src2, src3, src4); + DUP4_ARG2(__lasx_xvpickev_d, src1, src0, src2, src1, src3, src2, src4, src3, + src0, src1, src2, src3); + DUP2_ARG3(__lasx_xvpermi_q, src2, src0, 0x20, src3, src1, 0x20, src0, src1); + src0 = __lasx_xvavg_bu(src0, src1); + __lasx_xvstelm_d(src0, dst, 0, 0); + __lasx_xvstelm_d(src0, dst + dst_stride, 0, 1); + __lasx_xvstelm_d(src0, dst + dst_stride_2x, 0, 2); + __lasx_xvstelm_d(src0, dst + dst_stride_3x, 0, 3); +} + +void ff_put_no_rnd_pixels8_y2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h) +{ + if (h == 8) { + common_vt_bil_no_rnd_8x8_lasx(pixels, line_size, block, line_size); + } else if (h == 4) { + common_vt_bil_no_rnd_4x8_lasx(pixels, line_size, block, line_size); + } +} + +static void common_hv_bil_no_rnd_8x8_lasx(const uint8_t *src, int32_t src_stride, + uint8_t *dst, int32_t dst_stride) +{ + __m256i src0, src1, src2, src3, src4, src5, src6, src7; + __m256i src8, src9, src10, src11, src12, src13, src14, src15, src16, src17; + __m256i sum0, sum1, sum2, sum3; + int32_t src_stride_2x = src_stride << 1; + int32_t src_stride_4x = src_stride << 2; + int32_t dst_stride_2x = dst_stride << 1; + int32_t dst_stride_4x = dst_stride << 2; + int32_t dst_stride_3x = dst_stride_2x + dst_stride; + int32_t src_stride_3x = src_stride_2x + src_stride; + uint8_t* _src = (uint8_t*)src; + + src0 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2); + src3 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src4 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6); + src7 = __lasx_xvldx(_src, src_stride_3x); + _src += (1 - src_stride_4x); + src9 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, + src10, src11); + src12 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src13 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, + src14, src15); + src16 = __lasx_xvldx(_src, src_stride_3x); + _src += (src_stride_4x - 1); + DUP2_ARG2(__lasx_xvld, _src, 0, _src, 1, src8, src17); + + DUP4_ARG2(__lasx_xvilvl_b, src9, src0, src10, src1, src11, src2, src12, src3, + src0, src1, src2, src3); + DUP4_ARG2(__lasx_xvilvl_b, src13, src4, src14, src5, src15, src6, src16, src7, + src4, src5, src6, src7); + src8 = __lasx_xvilvl_b(src17, src8); + DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src2, src1, 0x20, src3, src2, + 0x20, src4, src3, 0x20, src0, src1, src2, src3); + DUP4_ARG3(__lasx_xvpermi_q, src5, src4, 0x20, src6, src5, 0x20, src7, src6, + 0x20, src8, src7, 0x20, src4, src5, src6, src7); + DUP4_ARG2(__lasx_xvhaddw_hu_bu, src0, src0, src1, src1, src2, src2, + src3, src3, src0, src1, src2, src3); + DUP4_ARG2(__lasx_xvhaddw_hu_bu, src4, src4, src5, src5, src6, src6, + src7, src7, src4, src5, src6, src7); + DUP4_ARG2(__lasx_xvadd_h, src0, src1, src2, src3, src4, src5, src6, src7, + sum0, sum1, sum2, sum3); + DUP4_ARG2(__lasx_xvaddi_hu, sum0, 1, sum1, 1, sum2, 1, sum3, 1, + sum0, sum1, sum2, sum3); + DUP2_ARG3(__lasx_xvsrani_b_h, sum1, sum0, 2, sum3, sum2, 2, sum0, sum1); + __lasx_xvstelm_d(sum0, dst, 0, 0); + __lasx_xvstelm_d(sum0, dst + dst_stride, 0, 2); + __lasx_xvstelm_d(sum0, dst + dst_stride_2x, 0, 1); + __lasx_xvstelm_d(sum0, dst + dst_stride_3x, 0, 3); + dst += dst_stride_4x; + __lasx_xvstelm_d(sum1, dst, 0, 0); + __lasx_xvstelm_d(sum1, dst + dst_stride, 0, 2); + __lasx_xvstelm_d(sum1, dst + dst_stride_2x, 0, 1); + __lasx_xvstelm_d(sum1, dst + dst_stride_3x, 0, 3); +} + +static void common_hv_bil_no_rnd_4x8_lasx(const uint8_t *src, int32_t src_stride, + uint8_t *dst, int32_t dst_stride) +{ + __m256i src0, src1, src2, src3, src4, src5, src6, src7; + __m256i src8, src9, sum0, sum1; + int32_t src_stride_2x = src_stride << 1; + int32_t src_stride_4x = src_stride << 2; + int32_t dst_stride_2x = dst_stride << 1; + int32_t dst_stride_3x = dst_stride_2x + dst_stride; + int32_t src_stride_3x = src_stride_2x + src_stride; + uint8_t *_src = (uint8_t*)src; + + src0 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2); + src3 = __lasx_xvldx(_src, src_stride_3x); + _src += 1; + src5 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src6, src7); + src8 = __lasx_xvldx(_src, src_stride_3x); + _src += (src_stride_4x - 1); + DUP2_ARG2(__lasx_xvld, _src, 0, _src, 1, src4, src9); + + DUP4_ARG2(__lasx_xvilvl_b, src5, src0, src6, src1, src7, src2, src8, src3, + src0, src1, src2, src3); + src4 = __lasx_xvilvl_b(src9, src4); + DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src2, src1, 0x20, src3, src2, + 0x20, src4, src3, 0x20, src0, src1, src2, src3); + DUP4_ARG2(__lasx_xvhaddw_hu_bu, src0, src0, src1, src1, src2, src2, + src3, src3, src0, src1, src2, src3); + DUP2_ARG2(__lasx_xvadd_h, src0, src1, src2, src3, sum0, sum1); + sum0 = __lasx_xvaddi_hu(sum0, 1); + sum1 = __lasx_xvaddi_hu(sum1, 1); + sum0 = __lasx_xvsrani_b_h(sum1, sum0, 2); + __lasx_xvstelm_d(sum0, dst, 0, 0); + __lasx_xvstelm_d(sum0, dst + dst_stride, 0, 2); + __lasx_xvstelm_d(sum0, dst + dst_stride_2x, 0, 1); + __lasx_xvstelm_d(sum0, dst + dst_stride_3x, 0, 3); +} + +void ff_put_no_rnd_pixels8_xy2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h) +{ + if (h == 8) { + common_hv_bil_no_rnd_8x8_lasx(pixels, line_size, block, line_size); + } else if (h == 4) { + common_hv_bil_no_rnd_4x8_lasx(pixels, line_size, block, line_size); + } +} + +static void common_hv_bil_16w_lasx(const uint8_t *src, int32_t src_stride, + uint8_t *dst, int32_t dst_stride, + uint8_t height) +{ + __m256i src0, src1, src2, src3, src4, src5, src6, src7, src8, src9; + __m256i src10, src11, src12, src13, src14, src15, src16, src17; + __m256i sum0, sum1, sum2, sum3, sum4, sum5, sum6, sum7; + uint8_t loop_cnt; + int32_t src_stride_2x = src_stride << 1; + int32_t src_stride_4x = src_stride << 2; + int32_t src_stride_3x = src_stride_2x + src_stride; + uint8_t* _src = (uint8_t*)src; + + for (loop_cnt = (height >> 3); loop_cnt--;) { + src0 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src1, src2); + src3 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src4 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src5, src6); + src7 = __lasx_xvldx(_src, src_stride_3x); + _src += (1 - src_stride_4x); + src9 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, + src10, src11); + src12 = __lasx_xvldx(_src, src_stride_3x); + _src += src_stride_4x; + src13 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, + src14, src15); + src16 = __lasx_xvldx(_src, src_stride_3x); + _src += (src_stride_4x - 1); + DUP2_ARG2(__lasx_xvld, _src, 0, _src, 1, src8, src17); + + DUP4_ARG3(__lasx_xvpermi_q, src0, src4, 0x02, src1, src5, 0x02, src2, + src6, 0x02, src3, src7, 0x02, src0, src1, src2, src3); + DUP4_ARG3(__lasx_xvpermi_q, src4, src8, 0x02, src9, src13, 0x02, src10, + src14, 0x02, src11, src15, 0x02, src4, src5, src6, src7); + DUP2_ARG3(__lasx_xvpermi_q, src12, src16, 0x02, src13, src17, 0x02, + src8, src9); + + DUP4_ARG2(__lasx_xvilvl_h, src5, src0, src6, src1, src7, src2, src8, + src3, sum0, sum2, sum4, sum6); + DUP4_ARG2(__lasx_xvilvh_h, src5, src0, src6, src1, src7, src2, src8, + src3, sum1, sum3, sum5, sum7); + src8 = __lasx_xvilvl_h(src9, src4); + src9 = __lasx_xvilvh_h(src9, src4); + + DUP4_ARG2(__lasx_xvhaddw_hu_bu, sum0, sum0, sum1, sum1, sum2, sum2, + sum3, sum3, src0, src1, src2, src3); + DUP4_ARG2(__lasx_xvhaddw_hu_bu, sum4, sum4, sum5, sum5, sum6, sum6, + sum7, sum7, src4, src5, src6, src7); + DUP2_ARG2(__lasx_xvhaddw_hu_bu, src8, src8, src9, src9, src8, src9); + + DUP4_ARG2(__lasx_xvadd_h, src0, src2, src1, src3, src2, src4, src3, + src5, sum0, sum1, sum2, sum3); + DUP4_ARG2(__lasx_xvadd_h, src4, src6, src5, src7, src6, src8, src7, + src9, sum4, sum5, sum6, sum7); + DUP4_ARG3(__lasx_xvsrarni_b_h, sum1, sum0, 2, sum3, sum2, 2, sum5, + sum4, 2, sum7, sum6, 2, sum0, sum1, sum2, sum3); + __lasx_xvstelm_d(sum0, dst, 0, 0); + __lasx_xvstelm_d(sum0, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(sum1, dst, 0, 0); + __lasx_xvstelm_d(sum1, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(sum2, dst, 0, 0); + __lasx_xvstelm_d(sum2, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(sum3, dst, 0, 0); + __lasx_xvstelm_d(sum3, dst, 8, 1); + dst += dst_stride; + __lasx_xvstelm_d(sum0, dst, 0, 2); + __lasx_xvstelm_d(sum0, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(sum1, dst, 0, 2); + __lasx_xvstelm_d(sum1, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(sum2, dst, 0, 2); + __lasx_xvstelm_d(sum2, dst, 8, 3); + dst += dst_stride; + __lasx_xvstelm_d(sum3, dst, 0, 2); + __lasx_xvstelm_d(sum3, dst, 8, 3); + dst += dst_stride; + } +} + +void ff_put_pixels16_xy2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h) +{ + common_hv_bil_16w_lasx(pixels, line_size, block, line_size, h); +} + +static void common_hv_bil_8w_lasx(const uint8_t *src, int32_t src_stride, + uint8_t *dst, int32_t dst_stride, + uint8_t height) +{ + __m256i src0, src1, src2, src3, src4, src5, src6, src7; + __m256i src8, src9, sum0, sum1; + uint8_t loop_cnt; + int32_t src_stride_2x = src_stride << 1; + int32_t src_stride_4x = src_stride << 2; + int32_t dst_stride_2x = dst_stride << 1; + int32_t dst_stride_4x = dst_stride << 2; + int32_t dst_stride_3x = dst_stride_2x + dst_stride; + int32_t src_stride_3x = src_stride_2x + src_stride; + uint8_t* _src = (uint8_t*)src; + + DUP2_ARG2(__lasx_xvld, _src, 0, _src, 1, src0, src5); + _src += src_stride; + + for (loop_cnt = (height >> 2); loop_cnt--;) { + src1 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src2, src3); + src4 = __lasx_xvldx(_src, src_stride_3x); + _src += 1; + src6 = __lasx_xvld(_src, 0); + DUP2_ARG2(__lasx_xvldx, _src, src_stride, _src, src_stride_2x, src7, src8); + src9 = __lasx_xvldx(_src, src_stride_3x); + _src += (src_stride_4x - 1); + DUP4_ARG2(__lasx_xvilvl_b, src5, src0, src6, src1, src7, src2, src8, src3, + src0, src1, src2, src3); + src5 = __lasx_xvilvl_b(src9, src4); + DUP4_ARG3(__lasx_xvpermi_q, src1, src0, 0x20, src2, src1, 0x20, src3, src2, + 0x20, src5, src3, 0x20, src0, src1, src2, src3); + DUP4_ARG2(__lasx_xvhaddw_hu_bu, src0, src0, src1, src1, src2, src2, + src3, src3, src0, src1, src2, src3); + DUP2_ARG2(__lasx_xvadd_h, src0, src1, src2, src3, sum0, sum1); + sum0 = __lasx_xvsrarni_b_h(sum1, sum0, 2); + __lasx_xvstelm_d(sum0, dst, 0, 0); + __lasx_xvstelm_d(sum0, dst + dst_stride, 0, 2); + __lasx_xvstelm_d(sum0, dst + dst_stride_2x, 0, 1); + __lasx_xvstelm_d(sum0, dst + dst_stride_3x, 0, 3); + dst += dst_stride_4x; + src0 = src4; + src5 = src9; + } +} + +void ff_put_pixels8_xy2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h) +{ + common_hv_bil_8w_lasx(pixels, line_size, block, line_size, h); +} diff --git a/libavcodec/loongarch/hpeldsp_lasx.h b/libavcodec/loongarch/hpeldsp_lasx.h new file mode 100644 index 0000000000..2e035eade8 --- /dev/null +++ b/libavcodec/loongarch/hpeldsp_lasx.h @@ -0,0 +1,58 @@ +/* + * Copyright (c) 2021 Loongson Technology Corporation Limited + * Contributed by Shiyou Yin + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef AVCODEC_LOONGARCH_HPELDSP_LASX_H +#define AVCODEC_LOONGARCH_HPELDSP_LASX_H + +#include +#include +#include "libavutil/attributes.h" + +void ff_put_pixels8_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h); +void ff_put_pixels8_x2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int32_t h); +void ff_put_pixels8_y2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int32_t h); +void ff_put_pixels16_8_lsx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h); +void ff_put_pixels16_x2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int32_t h); +void ff_put_pixels16_y2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int32_t h); +void ff_put_no_rnd_pixels16_x2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h); +void ff_put_no_rnd_pixels16_y2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h); +void ff_put_no_rnd_pixels16_xy2_8_lasx(uint8_t *block, + const uint8_t *pixels, + ptrdiff_t line_size, int h); +void ff_put_no_rnd_pixels8_x2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h); +void ff_put_no_rnd_pixels8_y2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h); +void ff_put_no_rnd_pixels8_xy2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h); +void ff_put_pixels8_xy2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h); +void ff_put_pixels16_xy2_8_lasx(uint8_t *block, const uint8_t *pixels, + ptrdiff_t line_size, int h); +#endif From patchwork Wed Dec 29 10:18:21 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= X-Patchwork-Id: 32944 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp11995502iog; Wed, 29 Dec 2021 02:18:59 -0800 (PST) X-Google-Smtp-Source: ABdhPJysqqlyDP/benIyFAdXcaOAy8I7u69vIshWMFlmA0C+3r6koiAnOz1jyBn8qahYrJVZVKrZ X-Received: by 2002:a05:6402:438e:: with SMTP id o14mr6494859edc.121.1640773139758; Wed, 29 Dec 2021 02:18:59 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1640773139; cv=none; d=google.com; s=arc-20160816; b=XETET/JNGpZSJ1ZxaDgbYubkHg8HKh7n2b/gTd43UqjUISF+DcRpKUoqxnGtKE1lAF 0WYk3ZLhdP11kW7LoLQbdIQx5ew7WHHwYbbSwAq1E75vzJVkF9cGSH9KlgL3ZZpQNJwo 2y1Byk9HEAGD7AdrBFeKah5qHy5OCZlWdXvpiNfPJ/a+9zUrZ8DvWQFW5ae8zPzrEnzl VauHZlgeUdR1JCBUXEG5o/fHRbIZGfHEeMLhoQ9OBi+r9OT+4lAtxuWNPlbhTECU89To NCPIDLrTp+AYEYLHnPYnchb88UZUO4+LAZWtsFoe4VdWNXm2IA7Jt6x3rPPCYaWRHfqT 9wkA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:reply-to:list-subscribe :list-help:list-post:list-archive:list-unsubscribe:list-id :precedence:subject:mime-version:references:in-reply-to:message-id :date:to:from:delivered-to; bh=TaFdOjUmxuhOK3sf35KMxBjqGRcJK61lOi6tUzpxxF8=; b=fL8SLaZYMnpGvmY6Ow0WP4KuxODs8jj2YKKsDF4uD3QrlslvfMSQXDhVvJZVahRZT+ 1Vf1+wRns299LJNu3weYQYexV/9kpsh9Bq3kvmJZsD4/+XOp/SWoGrMiv/G6ORaVzkty VSguZY/HbGmylrdcCnedTgXimCw+IkwNFy8yygRYW9vOfXRY/IJvnJHWJIpDBDDDEuWd ARPaifsg1Obw7B+oJ2AVqVk5c5WwJGKzQX97lhdDHwlwKw9W5P9otzLDwamu2RLbxBb2 95NDT36hIzxcgzeCXBA8AybYWwqVoJfqM1Qu9OHTsOaCD1TbhWso7O6BtY/wMpaX1aNk Fu5A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id g24si10287114edu.555.2021.12.29.02.18.59; Wed, 29 Dec 2021 02:18:59 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id C60A268008C; Wed, 29 Dec 2021 12:18:38 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from loongson.cn (mail.loongson.cn [114.242.206.163]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 8D02268AE4E for ; Wed, 29 Dec 2021 12:18:29 +0200 (EET) Received: from localhost (unknown [36.33.26.144]) by mail.loongson.cn (Coremail) with SMTP id AQAAf9DxOZbyNcxh3ygFAA--.4628S3; Wed, 29 Dec 2021 18:18:26 +0800 (CST) From: Hao Chen To: ffmpeg-devel@ffmpeg.org Date: Wed, 29 Dec 2021 18:18:21 +0800 Message-Id: <20211229101822.31956-3-chenhao@loongson.cn> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20211229101822.31956-1-chenhao@loongson.cn> References: <20211229101822.31956-1-chenhao@loongson.cn> MIME-Version: 1.0 X-CM-TRANSID: AQAAf9DxOZbyNcxh3ygFAA--.4628S3 X-Coremail-Antispam: 1UD129KBjvAXoWfuFyUKr13Xr1ftF15uw4UArb_yoW5Jr4xJo WUK397tws7KryIyr98JrnYyayUGa4fCF15Aw17Xws2ya4rXFy5ArW29w15ZF17Krn5Wa4x Jry2qFy2v3W3Jr9rn29KB7ZKAUJUUUU5529EdanIXcx71UUUUU7v73VFW2AGmfu7bjvjm3 AaLaJ3UjIYCTnIWjp_UUUY07k0a2IF6w4xM7kC6x804xWl14x267AKxVWUJVW8JwAFc2x0 x2IEx4CE42xK8VAvwI8IcIk0rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj4 1l84x0c7CEw4AK67xGY2AK021l84ACjcxK6xIIjxv20xvE14v26ryj6F1UM28EF7xvwVC0 I7IYx2IY6xkF7I0E14v26r4UJVWxJr1l84ACjcxK6I8E87Iv67AKxVWxJr0_GcWl84ACjc xK6I8E87Iv6xkF7I0E14v26rxl6s0DM2AIxVAIcxkEcVAq07x20xvEncxIr21l5I8CrVAC Y4xI64kE6c02F40Ex7xfMcIj6xIIjxv20xvE14v26r126r1DMcIj6I8E87Iv67AKxVWxJV W8Jr1lOx8S6xCaFVCjc4AY6r1j6r4UM4x0Y48IcxkI7VAKI48JMxkIecxEwVAFwVW8twCF 04k20xvY0x0EwIxGrwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s026c02F40E14v26r1j6r 18MI8I3I0E7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_Jr0_JrylIxkGc2Ij64vI r41lIxAIcVC0I7IYx2IY67AKxVWUCVW8JwCI42IY6xIIjxv20xvEc7CjxVAFwI0_Jr0_Gr 1lIxAIcVCF04k26cxKx2IYs7xG6r1j6r1xMIIF0xvEx4A2jsIE14v26r1j6r4UMIIF0xvE x4A2jsIEc7CjxVAFwI0_Gr0_Gr1UYxBIdaVFxhVjvjDU0xZFpf9x07j8nYwUUUUU= X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/ Subject: [FFmpeg-devel] [PATCH v3 2/3] avcodec: [loongarch] Optimize idctdstp with LASX. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: neinNsop9XBZ ./ffmpeg -i 8_mpeg4_1080p_24fps_12Mbps.avi -f rawvideo -y /dev/null -an before:433fps after :552fps --- libavcodec/idctdsp.c | 2 + libavcodec/idctdsp.h | 2 + libavcodec/loongarch/Makefile | 3 + libavcodec/loongarch/idctdsp_init_loongarch.c | 45 +++ libavcodec/loongarch/idctdsp_lasx.c | 124 ++++++++ libavcodec/loongarch/idctdsp_loongarch.h | 41 +++ libavcodec/loongarch/simple_idct_lasx.c | 297 ++++++++++++++++++ 7 files changed, 514 insertions(+) create mode 100644 libavcodec/loongarch/idctdsp_init_loongarch.c create mode 100644 libavcodec/loongarch/idctdsp_lasx.c create mode 100644 libavcodec/loongarch/idctdsp_loongarch.h create mode 100644 libavcodec/loongarch/simple_idct_lasx.c diff --git a/libavcodec/idctdsp.c b/libavcodec/idctdsp.c index 846ed0b0f8..71bd03c606 100644 --- a/libavcodec/idctdsp.c +++ b/libavcodec/idctdsp.c @@ -315,6 +315,8 @@ av_cold void ff_idctdsp_init(IDCTDSPContext *c, AVCodecContext *avctx) ff_idctdsp_init_x86(c, avctx, high_bit_depth); if (ARCH_MIPS) ff_idctdsp_init_mips(c, avctx, high_bit_depth); + if (ARCH_LOONGARCH) + ff_idctdsp_init_loongarch(c, avctx, high_bit_depth); ff_init_scantable_permutation(c->idct_permutation, c->perm_type); diff --git a/libavcodec/idctdsp.h b/libavcodec/idctdsp.h index ca21a31a02..014488aec3 100644 --- a/libavcodec/idctdsp.h +++ b/libavcodec/idctdsp.h @@ -118,5 +118,7 @@ void ff_idctdsp_init_x86(IDCTDSPContext *c, AVCodecContext *avctx, unsigned high_bit_depth); void ff_idctdsp_init_mips(IDCTDSPContext *c, AVCodecContext *avctx, unsigned high_bit_depth); +void ff_idctdsp_init_loongarch(IDCTDSPContext *c, AVCodecContext *avctx, + unsigned high_bit_depth); #endif /* AVCODEC_IDCTDSP_H */ diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile index 07a401d883..c4d71e801b 100644 --- a/libavcodec/loongarch/Makefile +++ b/libavcodec/loongarch/Makefile @@ -6,6 +6,7 @@ OBJS-$(CONFIG_VP8_DECODER) += loongarch/vp8dsp_init_loongarch.o OBJS-$(CONFIG_VP9_DECODER) += loongarch/vp9dsp_init_loongarch.o OBJS-$(CONFIG_VC1DSP) += loongarch/vc1dsp_init_loongarch.o OBJS-$(CONFIG_HPELDSP) += loongarch/hpeldsp_init_loongarch.o +OBJS-$(CONFIG_IDCTDSP) += loongarch/idctdsp_init_loongarch.o LASX-OBJS-$(CONFIG_H264CHROMA) += loongarch/h264chroma_lasx.o LASX-OBJS-$(CONFIG_H264QPEL) += loongarch/h264qpel_lasx.o LASX-OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_lasx.o \ @@ -14,6 +15,8 @@ LASX-OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_lasx.o \ LASX-OBJS-$(CONFIG_H264PRED) += loongarch/h264_intrapred_lasx.o LASX-OBJS-$(CONFIG_VC1_DECODER) += loongarch/vc1dsp_lasx.o LASX-OBJS-$(CONFIG_HPELDSP) += loongarch/hpeldsp_lasx.o +LASX-OBJS-$(CONFIG_IDCTDSP) += loongarch/simple_idct_lasx.o \ + loongarch/idctdsp_lasx.o LSX-OBJS-$(CONFIG_VP8_DECODER) += loongarch/vp8_mc_lsx.o \ loongarch/vp8_lpf_lsx.o LSX-OBJS-$(CONFIG_VP9_DECODER) += loongarch/vp9_mc_lsx.o \ diff --git a/libavcodec/loongarch/idctdsp_init_loongarch.c b/libavcodec/loongarch/idctdsp_init_loongarch.c new file mode 100644 index 0000000000..9d1d21cc18 --- /dev/null +++ b/libavcodec/loongarch/idctdsp_init_loongarch.c @@ -0,0 +1,45 @@ +/* + * Copyright (c) 2021 Loongson Technology Corporation Limited + * Contributed by Hao Chen + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/loongarch/cpu.h" +#include "idctdsp_loongarch.h" +#include "libavcodec/xvididct.h" + +av_cold void ff_idctdsp_init_loongarch(IDCTDSPContext *c, AVCodecContext *avctx, + unsigned high_bit_depth) +{ + int cpu_flags = av_get_cpu_flags(); + + if (have_lasx(cpu_flags)) { + if ((avctx->lowres != 1) && (avctx->lowres != 2) && (avctx->lowres != 3) && + (avctx->bits_per_raw_sample != 10) && + (avctx->bits_per_raw_sample != 12) && + (avctx->idct_algo == FF_IDCT_AUTO)) { + c->idct_put = ff_simple_idct_put_lasx; + c->idct_add = ff_simple_idct_add_lasx; + c->idct = ff_simple_idct_lasx; + c->perm_type = FF_IDCT_PERM_NONE; + } + c->put_pixels_clamped = ff_put_pixels_clamped_lasx; + c->put_signed_pixels_clamped = ff_put_signed_pixels_clamped_lasx; + c->add_pixels_clamped = ff_add_pixels_clamped_lasx; + } +} diff --git a/libavcodec/loongarch/idctdsp_lasx.c b/libavcodec/loongarch/idctdsp_lasx.c new file mode 100644 index 0000000000..1cfab0e028 --- /dev/null +++ b/libavcodec/loongarch/idctdsp_lasx.c @@ -0,0 +1,124 @@ +/* + * Copyright (c) 2021 Loongson Technology Corporation Limited + * Contributed by Hao Chen + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "idctdsp_loongarch.h" +#include "libavutil/loongarch/loongson_intrinsics.h" + +void ff_put_pixels_clamped_lasx(const int16_t *block, + uint8_t *av_restrict pixels, + ptrdiff_t stride) +{ + __m256i b0, b1, b2, b3; + __m256i temp0, temp1; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_4x = stride << 2; + ptrdiff_t stride_3x = stride_2x + stride; + + DUP4_ARG2(__lasx_xvld, block, 0, block, 32, block, 64, block, 96, + b0, b1, b2, b3); + DUP4_ARG1(__lasx_xvclip255_h, b0, b1, b2, b3, b0, b1, b2, b3); + DUP2_ARG2(__lasx_xvpickev_b, b1, b0, b3, b2, temp0, temp1); + __lasx_xvstelm_d(temp0, pixels, 0, 0); + __lasx_xvstelm_d(temp0, pixels + stride, 0, 2); + __lasx_xvstelm_d(temp0, pixels + stride_2x, 0, 1); + __lasx_xvstelm_d(temp0, pixels + stride_3x, 0, 3); + pixels += stride_4x; + __lasx_xvstelm_d(temp1, pixels, 0, 0); + __lasx_xvstelm_d(temp1, pixels + stride, 0, 2); + __lasx_xvstelm_d(temp1, pixels + stride_2x, 0, 1); + __lasx_xvstelm_d(temp1, pixels + stride_3x, 0, 3); +} + +void ff_put_signed_pixels_clamped_lasx(const int16_t *block, + uint8_t *av_restrict pixels, + ptrdiff_t stride) +{ + __m256i b0, b1, b2, b3; + __m256i temp0, temp1; + __m256i const_128 = {0x0080008000800080, 0x0080008000800080, + 0x0080008000800080, 0x0080008000800080}; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_4x = stride << 2; + ptrdiff_t stride_3x = stride_2x + stride; + + DUP4_ARG2(__lasx_xvld, block, 0, block, 32, block, 64, block, 96, + b0, b1, b2, b3); + DUP4_ARG2(__lasx_xvadd_h, b0, const_128, b1, const_128, b2, const_128, + b3, const_128, b0, b1, b2, b3); + DUP4_ARG1(__lasx_xvclip255_h, b0, b1, b2, b3, b0, b1, b2, b3); + DUP2_ARG2(__lasx_xvpickev_b, b1, b0, b3, b2, temp0, temp1); + __lasx_xvstelm_d(temp0, pixels, 0, 0); + __lasx_xvstelm_d(temp0, pixels + stride, 0, 2); + __lasx_xvstelm_d(temp0, pixels + stride_2x, 0, 1); + __lasx_xvstelm_d(temp0, pixels + stride_3x, 0, 3); + pixels += stride_4x; + __lasx_xvstelm_d(temp1, pixels, 0, 0); + __lasx_xvstelm_d(temp1, pixels + stride, 0, 2); + __lasx_xvstelm_d(temp1, pixels + stride_2x, 0, 1); + __lasx_xvstelm_d(temp1, pixels + stride_3x, 0, 3); +} + +void ff_add_pixels_clamped_lasx(const int16_t *block, + uint8_t *av_restrict pixels, + ptrdiff_t stride) +{ + __m256i b0, b1, b2, b3; + __m256i p0, p1, p2, p3, p4, p5, p6, p7; + __m256i temp0, temp1, temp2, temp3; + uint8_t *pix = pixels; + ptrdiff_t stride_2x = stride << 1; + ptrdiff_t stride_4x = stride << 2; + ptrdiff_t stride_3x = stride_2x + stride; + + DUP4_ARG2(__lasx_xvld, block, 0, block, 32, block, 64, block, 96, + b0, b1, b2, b3); + p0 = __lasx_xvldrepl_d(pix, 0); + pix += stride; + p1 = __lasx_xvldrepl_d(pix, 0); + pix += stride; + p2 = __lasx_xvldrepl_d(pix, 0); + pix += stride; + p3 = __lasx_xvldrepl_d(pix, 0); + pix += stride; + p4 = __lasx_xvldrepl_d(pix, 0); + pix += stride; + p5 = __lasx_xvldrepl_d(pix, 0); + pix += stride; + p6 = __lasx_xvldrepl_d(pix, 0); + pix += stride; + p7 = __lasx_xvldrepl_d(pix, 0); + DUP4_ARG3(__lasx_xvpermi_q, p1, p0, 0x20, p3, p2, 0x20, p5, p4, 0x20, + p7, p6, 0x20, temp0, temp1, temp2, temp3); + DUP4_ARG2(__lasx_xvaddw_h_h_bu, b0, temp0, b1, temp1, b2, temp2, b3, temp3, + temp0, temp1, temp2, temp3); + DUP4_ARG1(__lasx_xvclip255_h, temp0, temp1, temp2, temp3, + temp0, temp1, temp2, temp3); + DUP2_ARG2(__lasx_xvpickev_b, temp1, temp0, temp3, temp2, temp0, temp1); + __lasx_xvstelm_d(temp0, pixels, 0, 0); + __lasx_xvstelm_d(temp0, pixels + stride, 0, 2); + __lasx_xvstelm_d(temp0, pixels + stride_2x, 0, 1); + __lasx_xvstelm_d(temp0, pixels + stride_3x, 0, 3); + pixels += stride_4x; + __lasx_xvstelm_d(temp1, pixels, 0, 0); + __lasx_xvstelm_d(temp1, pixels + stride, 0, 2); + __lasx_xvstelm_d(temp1, pixels + stride_2x, 0, 1); + __lasx_xvstelm_d(temp1, pixels + stride_3x, 0, 3); +} diff --git a/libavcodec/loongarch/idctdsp_loongarch.h b/libavcodec/loongarch/idctdsp_loongarch.h new file mode 100644 index 0000000000..cae8e7af58 --- /dev/null +++ b/libavcodec/loongarch/idctdsp_loongarch.h @@ -0,0 +1,41 @@ +/* + * Copyright (c) 2021 Loongson Technology Corporation Limited + * Contributed by Hao Chen + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef AVCODEC_LOONGARCH_IDCTDSP_LOONGARCH_H +#define AVCODEC_LOONGARCH_IDCTDSP_LOONGARCH_H + +#include +#include "libavcodec/mpegvideo.h" + +void ff_simple_idct_lasx(int16_t *block); +void ff_simple_idct_put_lasx(uint8_t *dest, ptrdiff_t stride_dst, int16_t *block); +void ff_simple_idct_add_lasx(uint8_t *dest, ptrdiff_t stride_dst, int16_t *block); +void ff_put_pixels_clamped_lasx(const int16_t *block, + uint8_t *av_restrict pixels, + ptrdiff_t line_size); +void ff_put_signed_pixels_clamped_lasx(const int16_t *block, + uint8_t *av_restrict pixels, + ptrdiff_t line_size); +void ff_add_pixels_clamped_lasx(const int16_t *block, + uint8_t *av_restrict pixels, + ptrdiff_t line_size); + +#endif /* AVCODEC_LOONGARCH_IDCTDSP_LOONGARCH_H */ diff --git a/libavcodec/loongarch/simple_idct_lasx.c b/libavcodec/loongarch/simple_idct_lasx.c new file mode 100644 index 0000000000..a0d936b666 --- /dev/null +++ b/libavcodec/loongarch/simple_idct_lasx.c @@ -0,0 +1,297 @@ +/* + * Copyright (c) 2021 Loongson Technology Corporation Limited + * Contributed by Hao Chen + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/loongarch/loongson_intrinsics.h" +#include "idctdsp_loongarch.h" + +#define LASX_TRANSPOSE4x16(in_0, in_1, in_2, in_3, out_0, out_1, out_2, out_3) \ +{ \ + __m256i temp_0, temp_1, temp_2, temp_3; \ + __m256i temp_4, temp_5, temp_6, temp_7; \ + DUP4_ARG3(__lasx_xvpermi_q, in_2, in_0, 0x20, in_2, in_0, 0x31, in_3, in_1,\ + 0x20, in_3, in_1, 0x31, temp_0, temp_1, temp_2, temp_3); \ + DUP2_ARG2(__lasx_xvilvl_h, temp_1, temp_0, temp_3, temp_2, temp_4, temp_6);\ + DUP2_ARG2(__lasx_xvilvh_h, temp_1, temp_0, temp_3, temp_2, temp_5, temp_7);\ + DUP2_ARG2(__lasx_xvilvl_w, temp_6, temp_4, temp_7, temp_5, out_0, out_2); \ + DUP2_ARG2(__lasx_xvilvh_w, temp_6, temp_4, temp_7, temp_5, out_1, out_3); \ +} + +#define LASX_IDCTROWCONDDC \ + const_val = 16383 * ((1 << 19) / 16383); \ + const_val1 = __lasx_xvreplgr2vr_w(const_val); \ + DUP4_ARG2(__lasx_xvld, block, 0, block, 32, block, 64, block, 96, \ + in0, in1, in2, in3); \ + LASX_TRANSPOSE4x16(in0, in1, in2, in3, in0, in1, in2, in3); \ + a0 = __lasx_xvpermi_d(in0, 0xD8); \ + a0 = __lasx_vext2xv_w_h(a0); \ + temp = __lasx_xvslli_w(a0, 3); \ + a1 = __lasx_xvpermi_d(in0, 0x8D); \ + a1 = __lasx_vext2xv_w_h(a1); \ + a2 = __lasx_xvpermi_d(in1, 0xD8); \ + a2 = __lasx_vext2xv_w_h(a2); \ + a3 = __lasx_xvpermi_d(in1, 0x8D); \ + a3 = __lasx_vext2xv_w_h(a3); \ + b0 = __lasx_xvpermi_d(in2, 0xD8); \ + b0 = __lasx_vext2xv_w_h(b0); \ + b1 = __lasx_xvpermi_d(in2, 0x8D); \ + b1 = __lasx_vext2xv_w_h(b1); \ + b2 = __lasx_xvpermi_d(in3, 0xD8); \ + b2 = __lasx_vext2xv_w_h(b2); \ + b3 = __lasx_xvpermi_d(in3, 0x8D); \ + b3 = __lasx_vext2xv_w_h(b3); \ + select_vec = a0 | a1 | a2 | a3 | b0 | b1 | b2 | b3; \ + select_vec = __lasx_xvslti_wu(select_vec, 1); \ + \ + DUP4_ARG2(__lasx_xvrepl128vei_h, w1, 2, w1, 3, w1, 4, w1, 5, \ + w2, w3, w4, w5); \ + DUP2_ARG2(__lasx_xvrepl128vei_h, w1, 6, w1, 7, w6, w7); \ + w1 = __lasx_xvrepl128vei_h(w1, 1); \ + \ + /* part of FUNC6(idctRowCondDC) */ \ + temp0 = __lasx_xvmaddwl_w_h(const_val0, in0, w4); \ + DUP2_ARG2(__lasx_xvmulwl_w_h, in1, w2, in1, w6, temp1, temp2); \ + a0 = __lasx_xvadd_w(temp0, temp1); \ + a1 = __lasx_xvadd_w(temp0, temp2); \ + a2 = __lasx_xvsub_w(temp0, temp2); \ + a3 = __lasx_xvsub_w(temp0, temp1); \ + \ + DUP2_ARG2(__lasx_xvilvh_h, in1, in0, w3, w1, temp0, temp1); \ + b0 = __lasx_xvdp2_w_h(temp0, temp1); \ + temp1 = __lasx_xvneg_h(w7); \ + temp2 = __lasx_xvilvl_h(temp1, w3); \ + b1 = __lasx_xvdp2_w_h(temp0, temp2); \ + temp1 = __lasx_xvneg_h(w1); \ + temp2 = __lasx_xvilvl_h(temp1, w5); \ + b2 = __lasx_xvdp2_w_h(temp0, temp2); \ + temp1 = __lasx_xvneg_h(w5); \ + temp2 = __lasx_xvilvl_h(temp1, w7); \ + b3 = __lasx_xvdp2_w_h(temp0, temp2); \ + \ + /* if (AV_RAN64A(row + 4)) */ \ + DUP2_ARG2(__lasx_xvilvl_h, in3, in2, w6, w4, temp0, temp1); \ + a0 = __lasx_xvdp2add_w_h(a0, temp0, temp1); \ + temp1 = __lasx_xvilvl_h(w2, w4); \ + a1 = __lasx_xvdp2sub_w_h(a1, temp0, temp1); \ + temp1 = __lasx_xvneg_h(w4); \ + temp2 = __lasx_xvilvl_h(w2, temp1); \ + a2 = __lasx_xvdp2add_w_h(a2, temp0, temp2); \ + temp1 = __lasx_xvneg_h(w6); \ + temp2 = __lasx_xvilvl_h(temp1, w4); \ + a3 = __lasx_xvdp2add_w_h(a3, temp0, temp2); \ + \ + DUP2_ARG2(__lasx_xvilvh_h, in3, in2, w7, w5, temp0, temp1); \ + b0 = __lasx_xvdp2add_w_h(b0, temp0, temp1); \ + DUP2_ARG2(__lasx_xvilvl_h, w5, w1, w3, w7, temp1, temp2); \ + b1 = __lasx_xvdp2sub_w_h(b1, temp0, temp1); \ + b2 = __lasx_xvdp2add_w_h(b2, temp0, temp2); \ + temp1 = __lasx_xvneg_h(w1); \ + temp2 = __lasx_xvilvl_h(temp1, w3); \ + b3 = __lasx_xvdp2add_w_h(b3, temp0, temp2); \ + \ + DUP4_ARG2(__lasx_xvadd_w, a0, b0, a1, b1, a2, b2, a3, b3, \ + temp0, temp1, temp2, temp3); \ + DUP4_ARG2(__lasx_xvsub_w, a0, b0, a1, b1, a2, b2, a3, b3, \ + a0, a1, a2, a3); \ + DUP4_ARG2(__lasx_xvsrai_w, temp0, 11, temp1, 11, temp2, 11, temp3, 11, \ + temp0, temp1, temp2, temp3); \ + DUP4_ARG2(__lasx_xvsrai_w, a0, 11, a1, 11, a2, 11, a3, 11, a0, a1, a2, a3);\ + DUP4_ARG3(__lasx_xvbitsel_v, temp0, temp, select_vec, temp1, temp, \ + select_vec, temp2, temp, select_vec, temp3, temp, select_vec, \ + in0, in1, in2, in3); \ + DUP4_ARG3(__lasx_xvbitsel_v, a0, temp, select_vec, a1, temp, \ + select_vec, a2, temp, select_vec, a3, temp, select_vec, \ + a0, a1, a2, a3); \ + DUP4_ARG2(__lasx_xvpickev_h, in1, in0, in3, in2, a2, a3, a0, a1, \ + in0, in1, in2, in3); \ + DUP4_ARG2(__lasx_xvpermi_d, in0, 0xD8, in1, 0xD8, in2, 0xD8, in3, 0xD8, \ + in0, in1, in2, in3); \ + +#define LASX_IDCTCOLS \ + /* part of FUNC6(idctSparaseCol) */ \ + LASX_TRANSPOSE4x16(in0, in1, in2, in3, in0, in1, in2, in3); \ + temp0 = __lasx_xvmaddwl_w_h(const_val1, in0, w4); \ + DUP2_ARG2(__lasx_xvmulwl_w_h, in1, w2, in1, w6, temp1, temp2); \ + a0 = __lasx_xvadd_w(temp0, temp1); \ + a1 = __lasx_xvadd_w(temp0, temp2); \ + a2 = __lasx_xvsub_w(temp0, temp2); \ + a3 = __lasx_xvsub_w(temp0, temp1); \ + \ + DUP2_ARG2(__lasx_xvilvh_h, in1, in0, w3, w1, temp0, temp1); \ + b0 = __lasx_xvdp2_w_h(temp0, temp1); \ + temp1 = __lasx_xvneg_h(w7); \ + temp2 = __lasx_xvilvl_h(temp1, w3); \ + b1 = __lasx_xvdp2_w_h(temp0, temp2); \ + temp1 = __lasx_xvneg_h(w1); \ + temp2 = __lasx_xvilvl_h(temp1, w5); \ + b2 = __lasx_xvdp2_w_h(temp0, temp2); \ + temp1 = __lasx_xvneg_h(w5); \ + temp2 = __lasx_xvilvl_h(temp1, w7); \ + b3 = __lasx_xvdp2_w_h(temp0, temp2); \ + \ + /* if (AV_RAN64A(row + 4)) */ \ + DUP2_ARG2(__lasx_xvilvl_h, in3, in2, w6, w4, temp0, temp1); \ + a0 = __lasx_xvdp2add_w_h(a0, temp0, temp1); \ + temp1 = __lasx_xvilvl_h(w2, w4); \ + a1 = __lasx_xvdp2sub_w_h(a1, temp0, temp1); \ + temp1 = __lasx_xvneg_h(w4); \ + temp2 = __lasx_xvilvl_h(w2, temp1); \ + a2 = __lasx_xvdp2add_w_h(a2, temp0, temp2); \ + temp1 = __lasx_xvneg_h(w6); \ + temp2 = __lasx_xvilvl_h(temp1, w4); \ + a3 = __lasx_xvdp2add_w_h(a3, temp0, temp2); \ + \ + DUP2_ARG2(__lasx_xvilvh_h, in3, in2, w7, w5, temp0, temp1); \ + b0 = __lasx_xvdp2add_w_h(b0, temp0, temp1); \ + DUP2_ARG2(__lasx_xvilvl_h, w5, w1, w3, w7, temp1, temp2); \ + b1 = __lasx_xvdp2sub_w_h(b1, temp0, temp1); \ + b2 = __lasx_xvdp2add_w_h(b2, temp0, temp2); \ + temp1 = __lasx_xvneg_h(w1); \ + temp2 = __lasx_xvilvl_h(temp1, w3); \ + b3 = __lasx_xvdp2add_w_h(b3, temp0, temp2); \ + \ + DUP4_ARG2(__lasx_xvadd_w, a0, b0, a1, b1, a2, b2, a3, b3, \ + temp0, temp1, temp2, temp3); \ + DUP4_ARG2(__lasx_xvsub_w, a3, b3, a2, b2, a1, b1, a0, b0, \ + a3, a2, a1, a0); \ + DUP4_ARG3(__lasx_xvsrani_h_w, temp1, temp0, 20, temp3, temp2, 20, a2, a3, \ + 20, a0, a1, 20, in0, in1, in2, in3); \ + +void ff_simple_idct_lasx(int16_t *block) +{ + int32_t const_val = 1 << 10; + __m256i w1 = {0x4B42539F58C50000, 0x11A822A332493FFF, + 0x4B42539F58C50000, 0x11A822A332493FFF}; + __m256i in0, in1, in2, in3; + __m256i w2, w3, w4, w5, w6, w7; + __m256i a0, a1, a2, a3; + __m256i b0, b1, b2, b3; + __m256i temp0, temp1, temp2, temp3; + __m256i const_val0 = __lasx_xvreplgr2vr_w(const_val); + __m256i const_val1, select_vec, temp; + + LASX_IDCTROWCONDDC + LASX_IDCTCOLS + DUP4_ARG2(__lasx_xvpermi_d, in0, 0xD8, in1, 0xD8, in2, 0xD8, in3, 0xD8, + in0, in1, in2, in3); + __lasx_xvst(in0, block, 0); + __lasx_xvst(in1, block, 32); + __lasx_xvst(in2, block, 64); + __lasx_xvst(in3, block, 96); +} + +void ff_simple_idct_put_lasx(uint8_t *dst, ptrdiff_t dst_stride, + int16_t *block) +{ + int32_t const_val = 1 << 10; + ptrdiff_t dst_stride_2x = dst_stride << 1; + ptrdiff_t dst_stride_4x = dst_stride << 2; + ptrdiff_t dst_stride_3x = dst_stride_2x + dst_stride; + __m256i w1 = {0x4B42539F58C50000, 0x11A822A332493FFF, + 0x4B42539F58C50000, 0x11A822A332493FFF}; + __m256i in0, in1, in2, in3; + __m256i w2, w3, w4, w5, w6, w7; + __m256i a0, a1, a2, a3; + __m256i b0, b1, b2, b3; + __m256i temp0, temp1, temp2, temp3; + __m256i const_val0 = __lasx_xvreplgr2vr_w(const_val); + __m256i const_val1, select_vec, temp; + + LASX_IDCTROWCONDDC + LASX_IDCTCOLS + DUP4_ARG2(__lasx_xvpermi_d, in0, 0xD8, in1, 0xD8, in2, 0xD8, in3, 0xD8, + in0, in1, in2, in3); + DUP4_ARG1(__lasx_xvclip255_h, in0, in1, in2, in3, in0, in1, in2, in3); + DUP2_ARG2(__lasx_xvpickev_b, in1, in0, in3, in2, in0, in1); + __lasx_xvstelm_d(in0, dst, 0, 0); + __lasx_xvstelm_d(in0, dst + dst_stride, 0, 2); + __lasx_xvstelm_d(in0, dst + dst_stride_2x, 0, 1); + __lasx_xvstelm_d(in0, dst + dst_stride_3x, 0, 3); + dst += dst_stride_4x; + __lasx_xvstelm_d(in1, dst, 0, 0); + __lasx_xvstelm_d(in1, dst + dst_stride, 0, 2); + __lasx_xvstelm_d(in1, dst + dst_stride_2x, 0, 1); + __lasx_xvstelm_d(in1, dst + dst_stride_3x, 0, 3); +} + +void ff_simple_idct_add_lasx(uint8_t *dst, ptrdiff_t dst_stride, + int16_t *block) +{ + int32_t const_val = 1 << 10; + uint8_t *dst1 = dst; + ptrdiff_t dst_stride_2x = dst_stride << 1; + ptrdiff_t dst_stride_4x = dst_stride << 2; + ptrdiff_t dst_stride_3x = dst_stride_2x + dst_stride; + + __m256i w1 = {0x4B42539F58C50000, 0x11A822A332493FFF, + 0x4B42539F58C50000, 0x11A822A332493FFF}; + __m256i sh = {0x0003000200010000, 0x000B000A00090008, + 0x0007000600050004, 0x000F000E000D000C}; + __m256i in0, in1, in2, in3; + __m256i w2, w3, w4, w5, w6, w7; + __m256i a0, a1, a2, a3; + __m256i b0, b1, b2, b3; + __m256i temp0, temp1, temp2, temp3; + __m256i const_val0 = __lasx_xvreplgr2vr_w(const_val); + __m256i const_val1, select_vec, temp; + + LASX_IDCTROWCONDDC + LASX_IDCTCOLS + a0 = __lasx_xvldrepl_d(dst1, 0); + a0 = __lasx_vext2xv_hu_bu(a0); + dst1 += dst_stride; + a1 = __lasx_xvldrepl_d(dst1, 0); + a1 = __lasx_vext2xv_hu_bu(a1); + dst1 += dst_stride; + a2 = __lasx_xvldrepl_d(dst1, 0); + a2 = __lasx_vext2xv_hu_bu(a2); + dst1 += dst_stride; + a3 = __lasx_xvldrepl_d(dst1, 0); + a3 = __lasx_vext2xv_hu_bu(a3); + dst1 += dst_stride; + b0 = __lasx_xvldrepl_d(dst1, 0); + b0 = __lasx_vext2xv_hu_bu(b0); + dst1 += dst_stride; + b1 = __lasx_xvldrepl_d(dst1, 0); + b1 = __lasx_vext2xv_hu_bu(b1); + dst1 += dst_stride; + b2 = __lasx_xvldrepl_d(dst1, 0); + b2 = __lasx_vext2xv_hu_bu(b2); + dst1 += dst_stride; + b3 = __lasx_xvldrepl_d(dst1, 0); + b3 = __lasx_vext2xv_hu_bu(b3); + DUP4_ARG3(__lasx_xvshuf_h, sh, a1, a0, sh, a3, a2, sh, b1, b0, sh, b3, b2, + temp0, temp1, temp2, temp3); + DUP4_ARG2(__lasx_xvadd_h, temp0, in0, temp1, in1, temp2, in2, temp3, in3, + in0, in1, in2, in3); + DUP4_ARG2(__lasx_xvpermi_d, in0, 0xD8, in1, 0xD8, in2, 0xD8, in3, 0xD8, + in0, in1, in2, in3); + DUP4_ARG1(__lasx_xvclip255_h, in0, in1, in2, in3, in0, in1, in2, in3); + DUP2_ARG2(__lasx_xvpickev_b, in1, in0, in3, in2, in0, in1); + __lasx_xvstelm_d(in0, dst, 0, 0); + __lasx_xvstelm_d(in0, dst + dst_stride, 0, 2); + __lasx_xvstelm_d(in0, dst + dst_stride_2x, 0, 1); + __lasx_xvstelm_d(in0, dst + dst_stride_3x, 0, 3); + dst += dst_stride_4x; + __lasx_xvstelm_d(in1, dst, 0, 0); + __lasx_xvstelm_d(in1, dst + dst_stride, 0, 2); + __lasx_xvstelm_d(in1, dst + dst_stride_2x, 0, 1); + __lasx_xvstelm_d(in1, dst + dst_stride_3x, 0, 3); +} From patchwork Wed Dec 29 10:18:22 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= X-Patchwork-Id: 32945 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp11995678iog; Wed, 29 Dec 2021 02:19:19 -0800 (PST) X-Google-Smtp-Source: ABdhPJyYsi0QFGKQiedxoUvhIRjVSOB6B3e5jq4g72mKiMLNLF81lOlxKwRQnL5t1ZSAEgJ6R3dZ X-Received: by 2002:a17:906:4a02:: with SMTP id w2mr20703451eju.398.1640773149203; Wed, 29 Dec 2021 02:19:09 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1640773149; cv=none; d=google.com; s=arc-20160816; b=FHE6Vh7D2K/3DblWfRLQ2+/KYGVetdWPN+zJlhkaFWJyZwh3cYn/78oVHJ1Z2N+UQR 8ba3aAdRXreXq3z/L+ummtKo4fys6C6NlgYMkHToDmd6B4yZSglBMw4sl0F0+2PXKOIH JBWLrAWLi5WXDdnAZLq5RZPwGMuybGXQdiCa40HPAI7yNZJY+WRPHMXg4K/r305lqGz+ w+avQWkTHYwF0eGpzqk6ZtIzub7umOTo7JlkbBxFI87+kiubbY7zyqVvrA3DjfMr+5T7 PSK51gSKyeVBZQVHyko3+11QhdmfmuytUudZk+qHBESGKuPpPc7S1Ei4zlDaviBlTI7g Qc7A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=HdVpPGloGWMZppak92vpaLkuE8w/oGqmJrMQDEEz7G0=; b=TA1OGSW85IZtscMiZzR6QDxZgNcS0wUsnuPcKb45ta8da3ZeCko39jUirUwrERkOAo 4Z6lmHy7YjyFBimLpMJR7XikVPn09cxYwieX9j2NpihIW28SYHIzR8ylpVEsb+lFh07R B3kyB4N1y6EGIVOK4f9+hgEwAux0Rc4HnIBC83I0tZREQl4LaZA2F/9/3DxJLWp7rjwZ vui4yhLmnSPz3kmqDNBpYrYpsFokwU9J+s/hHMQtE65UOrajvO0Dld8XCOe1FhswsnAp Ahk5jDj6+W57avri+7GetFVFYJJp8CawyN8fEwrysIEFD038Unlm/iFk2IavGdigHiEe GTiA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id v3si1911858ejf.202.2021.12.29.02.19.08; Wed, 29 Dec 2021 02:19:09 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id C4A8C68AEDE; Wed, 29 Dec 2021 12:18:40 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from loongson.cn (mail.loongson.cn [114.242.206.163]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 6083C68AEE7 for ; Wed, 29 Dec 2021 12:18:29 +0200 (EET) Received: from localhost (unknown [36.33.26.144]) by mail.loongson.cn (Coremail) with SMTP id AQAAf9DxyZbzNcxh4CgFAA--.10890S3; Wed, 29 Dec 2021 18:18:27 +0800 (CST) From: Hao Chen To: ffmpeg-devel@ffmpeg.org Date: Wed, 29 Dec 2021 18:18:22 +0800 Message-Id: <20211229101822.31956-4-chenhao@loongson.cn> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20211229101822.31956-1-chenhao@loongson.cn> References: <20211229101822.31956-1-chenhao@loongson.cn> MIME-Version: 1.0 X-CM-TRANSID: AQAAf9DxyZbzNcxh4CgFAA--.10890S3 X-Coremail-Antispam: 1UD129KBjvJXoWxXF43Wr4UAF1kur1DXFWxXrb_yoWrGFWrpa y7ur17Jw4kWrZFk397J3s8XF45tF93ury2qF13tw18CrWFvw1fXr92yr9rua4DXa1DAF1S qws3C3W7JF1rXw7anT9S1TB71UUUUUDqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUkIb7Iv0xC_Kw4lb4IE77IF4wAFF20E14v26r1j6r4UM7CY07I2 0VC2zVCF04k26cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rw A2F7IY1VAKz4vEj48ve4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_Xr0_Ar1l84ACjcxK6xII jxv20xvEc7CjxVAFwI0_Gr1j6F4UJwA2z4x0Y4vEx4A2jsIE14v26F4UJVW0owA2z4x0Y4 vEx4A2jsIEc7CjxVAFwI0_GcCE3s1le2I262IYc4CY6c8Ij28IcVAaY2xG8wAqx4xG64xv F2IEw4CE5I8CrVC2j2WlYx0E2Ix0cI8IcVAFwI0_Jw0_WrylYx0Ex4A2jsIE14v26F4j6r 4UJwAm72CE4IkC6x0Yz7v_Jr0_Gr1lF7xvr2IYc2Ij64vIr41lc2xSY4AK67AK6r4DMxAI w28IcxkI7VAKI48JMxC20s026xCaFVCjc4AY6r1j6r4UMI8I3I0E5I8CrVAFwI0_Jr0_Jr 4lx2IqxVCjr7xvwVAFwI0_JrI_JrWlx4CE17CEb7AF67AKxVWUXVWUAwCIc40Y0x0EwIxG rwCI42IY6xIIjxv20xvE14v26r1I6r4UMIIF0xvE2Ix0cI8IcVCY1x0267AKxVW8JVWxJw CI42IY6xAIw20EY4v20xvaj40_Jr0_JF4lIxAIcVC2z280aVAFwI0_Jr0_Gr1lIxAIcVC2 z280aVCY1x0267AKxVW8JVW8JrUvcSsGvfC2KfnxnUUI43ZEXa7IU566zUUUUUU== X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/ Subject: [FFmpeg-devel] [PATCH v3 3/3] avcodec: [loongarch] Optimize prefetch with loongarch. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: gxw Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: DwTF8DE0BJIc From: gxw ./ffmpeg -i ../1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -y /dev/null -an before:296 after :308 --- libavcodec/loongarch/Makefile | 1 + libavcodec/loongarch/videodsp_init.c | 45 ++++++++++++++++++++++++++++ libavcodec/videodsp.c | 2 ++ libavcodec/videodsp.h | 1 + 4 files changed, 49 insertions(+) create mode 100644 libavcodec/loongarch/videodsp_init.c diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile index c4d71e801b..3c15c2edeb 100644 --- a/libavcodec/loongarch/Makefile +++ b/libavcodec/loongarch/Makefile @@ -7,6 +7,7 @@ OBJS-$(CONFIG_VP9_DECODER) += loongarch/vp9dsp_init_loongarch.o OBJS-$(CONFIG_VC1DSP) += loongarch/vc1dsp_init_loongarch.o OBJS-$(CONFIG_HPELDSP) += loongarch/hpeldsp_init_loongarch.o OBJS-$(CONFIG_IDCTDSP) += loongarch/idctdsp_init_loongarch.o +OBJS-$(CONFIG_VIDEODSP) += loongarch/videodsp_init.o LASX-OBJS-$(CONFIG_H264CHROMA) += loongarch/h264chroma_lasx.o LASX-OBJS-$(CONFIG_H264QPEL) += loongarch/h264qpel_lasx.o LASX-OBJS-$(CONFIG_H264DSP) += loongarch/h264dsp_lasx.o \ diff --git a/libavcodec/loongarch/videodsp_init.c b/libavcodec/loongarch/videodsp_init.c new file mode 100644 index 0000000000..6cbb7763ff --- /dev/null +++ b/libavcodec/loongarch/videodsp_init.c @@ -0,0 +1,45 @@ +/* + * Copyright (c) 2021 Loongson Technology Corporation Limited + * Contributed by Xiwei Gu + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavcodec/videodsp.h" +#include "libavutil/attributes.h" + +static void prefetch_loongarch(uint8_t *mem, ptrdiff_t stride, int h) +{ + register const uint8_t *p = mem; + + __asm__ volatile ( + "1: \n\t" + "preld 0, %[p], 0 \n\t" + "preld 0, %[p], 32 \n\t" + "addi.d %[h], %[h], -1 \n\t" + "add.d %[p], %[p], %[stride] \n\t" + + "blt $r0, %[h], 1b \n\t" + : [p] "+r" (p), [h] "+r" (h) + : [stride] "r" (stride) + ); +} + +av_cold void ff_videodsp_init_loongarch(VideoDSPContext *ctx, int bpc) +{ + ctx->prefetch = prefetch_loongarch; +} diff --git a/libavcodec/videodsp.c b/libavcodec/videodsp.c index ce9e9eb143..212147984f 100644 --- a/libavcodec/videodsp.c +++ b/libavcodec/videodsp.c @@ -54,4 +54,6 @@ av_cold void ff_videodsp_init(VideoDSPContext *ctx, int bpc) ff_videodsp_init_x86(ctx, bpc); if (ARCH_MIPS) ff_videodsp_init_mips(ctx, bpc); + if (ARCH_LOONGARCH64) + ff_videodsp_init_loongarch(ctx, bpc); } diff --git a/libavcodec/videodsp.h b/libavcodec/videodsp.h index c0545f22b0..ac971dc57f 100644 --- a/libavcodec/videodsp.h +++ b/libavcodec/videodsp.h @@ -84,5 +84,6 @@ void ff_videodsp_init_arm(VideoDSPContext *ctx, int bpc); void ff_videodsp_init_ppc(VideoDSPContext *ctx, int bpc); void ff_videodsp_init_x86(VideoDSPContext *ctx, int bpc); void ff_videodsp_init_mips(VideoDSPContext *ctx, int bpc); +void ff_videodsp_init_loongarch(VideoDSPContext *ctx, int bpc); #endif /* AVCODEC_VIDEODSP_H */