From patchwork Wed Dec 27 04:50:13 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?6YeR5rOi?= X-Patchwork-Id: 45340 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:6623:b0:194:e134:edd4 with SMTP id n35csp3415345pzh; Tue, 26 Dec 2023 20:50:48 -0800 (PST) X-Google-Smtp-Source: AGHT+IE1moFDMN4wdCOtrACSFcmmh9KNOHK7AGez8YqRV4cSTnhvVr+MIyg2DwG2/uk3nqAPqM0Q X-Received: by 2002:a19:670b:0:b0:50e:4a61:c369 with SMTP id b11-20020a19670b000000b0050e4a61c369mr1771934lfc.46.1703652647968; Tue, 26 Dec 2023 20:50:47 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1703652647; cv=none; d=google.com; s=arc-20160816; b=QehKKy18pGLiiUvfAaxZPj+vxXXFqUVKyHBMmODqeNN8X3iYlbHdBar54Ir2O8pfm6 s3nu2zbzGnPI0+Wbs99/3GRqPm8hDaPSp92NgS8eTLG5UPdLlTIuQ7wwpIYBsFyexgy5 m21g7tqYxrGdMTQ0u1A5IsJAZE0mqOvOTcammnjGnr2ZbNF/popGm8KOauMz/5xonRnE QkIDBo2UNYxxPs+r/HlSMlB169gVyrWMZ144lVYJXzzqJB5R4SKhXWDyHTHhuz11onQG gT1l1EQPcz/P9DthMpgzOf2QDddGo/W6ItJ5/htkLSmZTE04WzcfiJOv7JuKXugnDlBj Yf4g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=+UpTUCReXfl5+imM1w7p3lGDzrWaBf1rtysfFU4/hZ0=; fh=jVI190atZy/rQkFyTEx2ZbUJwkNs7ZTytKjYCllSOv4=; b=IIoMBdap6wKEhVyjZNvUTZDphgCe5PAOYngJibIDNlZZyn4kZF8iuvB/0/IKmVz4/m SliEcHbzowH2bbu/myMI4uP6GUr6ruBzaor+17M6uzwUStqGoK7T1YUDef/Kj2OqefeV u5AbsTu3I/xjLljg8/E/kU8kIKajspIQmQvU8loxrjqwqM6baW0iT5Bj7Eyb/cxPfDCT O5vF3hE78w9t3ju0rQBQFK04ngnRA7+PMhh69jrdZ3nVxN+ctAlS6rEUTpMUTJ5ODNao nYtrbHo2yHWbwM/mDQRkeNQrS+T5dRvQhr2cJl2QWd9/0dgf9kBGRnKJY+EHgvZ/HLYm s0aQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id x28-20020a50d61c000000b00552792d1252si6006409edi.202.2023.12.26.20.50.47; Tue, 26 Dec 2023 20:50:47 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 3EC2B68CC5D; Wed, 27 Dec 2023 06:50:45 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail.loongson.cn (mail.loongson.cn [114.242.206.163]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 6E81168CBD9 for ; Wed, 27 Dec 2023 06:50:37 +0200 (EET) Received: from loongson.cn (unknown [36.33.26.33]) by gateway (Coremail) with SMTP id _____8Axf+karYtlQe4EAA--.890S3; Wed, 27 Dec 2023 12:50:35 +0800 (CST) Received: from localhost (unknown [36.33.26.33]) by localhost.localdomain (Coremail) with SMTP id AQAAf8DxK74ZrYtlgVsMAA--.16059S3; Wed, 27 Dec 2023 12:50:34 +0800 (CST) From: jinbo To: ffmpeg-devel@ffmpeg.org Date: Wed, 27 Dec 2023 12:50:13 +0800 Message-Id: <20231227045019.25078-2-jinbo@loongson.cn> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20231227045019.25078-1-jinbo@loongson.cn> References: <20231227045019.25078-1-jinbo@loongson.cn> MIME-Version: 1.0 X-CM-TRANSID: AQAAf8DxK74ZrYtlgVsMAA--.16059S3 X-CM-SenderInfo: xmlqu0o6or00hjvr0hdfq/1tbiAQASEmWLia8B1wADsc X-Coremail-Antispam: 1Uk129KBj93XoW7Zr4UJr17Jr13Jr43tw47Awc_yoW8JFWUpF 9xuwnFqF18XFsFvF4DJw1UWa45ur97Wa43uF17Jry8AFWxXw129wn8Jr9rZFWqqw47AFna qw1xKw15AF1DAwcCm3ZEXasCq-sJn29KB7ZKAUJUUUUU529EdanIXcx71UUUUU7KY7ZEXa sCq-sGcSsGvfJ3Ic02F40EFcxC0VAKzVAqx4xG6I80ebIjqfuFe4nvWSU5nxnvy29KBjDU 0xBIdaVrnRJUUUkYb4IE77IF4wAFF20E14v26r1j6r4UM7CY07I20VC2zVCF04k26cxKx2 IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rwA2F7IY1VAKz4vEj48v e4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_JFI_Gr1l84ACjcxK6xIIjxv20xvEc7CjxVAFwI 0_Jr0_Gr1l84ACjcxK6I8E87Iv67AKxVWxJVW8Jr1l84ACjcxK6I8E87Iv6xkF7I0E14v2 6r4UJVWxJr1le2I262IYc4CY6c8Ij28IcVAaY2xG8wAqjxCEc2xF0cIa020Ex4CE44I27w Aqx4xG64xvF2IEw4CE5I8CrVC2j2WlYx0E2Ix0cI8IcVAFwI0_Jrv_JF1lYx0Ex4A2jsIE 14v26r4j6F4UMcvjeVCFs4IE7xkEbVWUJVW8JwACjcxG0xvY0x0EwIxGrwCF04k20xvY0x 0EwIxGrwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s026c02F40E14v26r1j6r18MI8I3I0E 7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_Jrv_JF1lIxkGc2Ij64vIr41lIxAIcV C0I7IYx2IY67AKxVWUJVWUCwCI42IY6xIIjxv20xvEc7CjxVAFwI0_Jr0_Gr1lIxAIcVCF 04k26cxKx2IYs7xG6r1j6r1xMIIF0xvEx4A2jsIE14v26r1j6r4UMIIF0xvEx4A2jsIEc7 CjxVAFwI0_Jr0_GrUvcSsGvfC2KfnxnUUI43ZEXa7IU8oGQDUUUUU== Subject: [FFmpeg-devel] [PATCH v2 1/7] avcodec/hevc: Add init for sao_edge_filter X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: jinbo Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: dJRgkX23Nd9r Forgot to init c->sao_edge_filter[idx] when idx=0/1/2/3. After this patch, the speedup of decoding H265 4K 30FPS 30Mbps on 3A6000 is about 7% (42fps==>45fps). Change-Id: I521999b397fa72b931a23c165cf45f276440cdfb --- libavcodec/loongarch/hevcdsp_init_loongarch.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c index 22739c6f5b..5a96f3a4c9 100644 --- a/libavcodec/loongarch/hevcdsp_init_loongarch.c +++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c @@ -167,6 +167,10 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth) c->put_hevc_qpel_uni_w[8][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv48_8_lsx; c->put_hevc_qpel_uni_w[9][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv64_8_lsx; + c->sao_edge_filter[0] = ff_hevc_sao_edge_filter_8_lsx; + c->sao_edge_filter[1] = ff_hevc_sao_edge_filter_8_lsx; + c->sao_edge_filter[2] = ff_hevc_sao_edge_filter_8_lsx; + c->sao_edge_filter[3] = ff_hevc_sao_edge_filter_8_lsx; c->sao_edge_filter[4] = ff_hevc_sao_edge_filter_8_lsx; c->hevc_h_loop_filter_luma = ff_hevc_loop_filter_luma_h_8_lsx; From patchwork Wed Dec 27 04:50:14 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?6YeR5rOi?= X-Patchwork-Id: 45342 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:6623:b0:194:e134:edd4 with SMTP id n35csp3415454pzh; Tue, 26 Dec 2023 20:51:14 -0800 (PST) X-Google-Smtp-Source: AGHT+IGYzTLrBTIsKlL9X5RnvOyF+grYA4975YX+Za8EdQfQVFQd2N4ns4Enqy2BlFnc69x1vHVG X-Received: by 2002:a17:907:9009:b0:a23:510c:7f07 with SMTP id ay9-20020a170907900900b00a23510c7f07mr2878936ejc.125.1703652673922; Tue, 26 Dec 2023 20:51:13 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1703652673; cv=none; d=google.com; s=arc-20160816; b=rP7EeIaqNHKUP93YL22TbGubRZhjqQBFX6iENG6VwRlRwsAJokalTA0OrKOCQJ63n0 dG0dy4EAMFJjrE6QfTgSlhHbQc19c2c05RSI5KPz8FvnAirzTVv/zsTAVXyM02A9bfvK k1yyrd/6QOZnRJtXIwKD1W+ScovvCYaHxPyQRMPAdDOsod4G/0bjXZdIL3jH1KMyXcMv RnTRA5xF1CyMo42ljbS8nDwZn3pgFJds6gpJKj2Im9wp1y6YC7mjwfFDHWdUBmVKS9W5 sBEYUb54NBG8dvd1BnN6aW2vXGp3jmx4owfDiJqLrPyzbgHpiNwhNsIklFv3gpfsk7nw rA3A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=FaKyLjWcEe7CZILMq8K4eN0pAHxHJ4GtV8E877kjomk=; fh=jVI190atZy/rQkFyTEx2ZbUJwkNs7ZTytKjYCllSOv4=; b=D3tjQ+FCuqA2q1UBW+UY2QXZOVcyLCmnIMcZW1BAi2QExNCdq8K6Rydrj7xA6p9jTa 5FzCX0FOjF0yENnVUPzdERCtbD1P4mmDRap6obBQQQe0vm179TF+dMU5MJLBetMjmH4x gRTFvW4J48ubqiTUGy2S6uHi9qg1vZA0hF75V5X6vc8JnTx8MCdb0Gp7jqfoM22DmSsT qjX/rIZ1BUeeB4BY9nLI4CWq0S5QLwKJ/pyKvc7jiPAtDpOC9HIR0zf9SPFdMwD+LSuN GX9fGGvMdUTUBaQcKmfQxRLuaWtX42EX3Mq7jfG3vVz1nC62anN7ntgHIwhzsaanshl6 h0Og== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id kb2-20020a170907924200b00a272ec66d58si725766ejb.111.2023.12.26.20.51.13; Tue, 26 Dec 2023 20:51:13 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id AB00368CDE3; Wed, 27 Dec 2023 06:50:49 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail.loongson.cn (mail.loongson.cn [114.242.206.163]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 037E968CD45 for ; Wed, 27 Dec 2023 06:50:41 +0200 (EET) Received: from loongson.cn (unknown [36.33.26.33]) by gateway (Coremail) with SMTP id _____8Bx6ugbrYtlQu4EAA--.23267S3; Wed, 27 Dec 2023 12:50:35 +0800 (CST) Received: from localhost (unknown [36.33.26.33]) by localhost.localdomain (Coremail) with SMTP id AQAAf8Bxyr0arYtlglsMAA--.15908S3; Wed, 27 Dec 2023 12:50:34 +0800 (CST) From: jinbo To: ffmpeg-devel@ffmpeg.org Date: Wed, 27 Dec 2023 12:50:14 +0800 Message-Id: <20231227045019.25078-3-jinbo@loongson.cn> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20231227045019.25078-1-jinbo@loongson.cn> References: <20231227045019.25078-1-jinbo@loongson.cn> MIME-Version: 1.0 X-CM-TRANSID: AQAAf8Bxyr0arYtlglsMAA--.15908S3 X-CM-SenderInfo: xmlqu0o6or00hjvr0hdfq/1tbiAQASEmWLia8B1wAFsa X-Coremail-Antispam: 1Uk129KBj93XoW3AFyDtr4rtrWkCr17Aw15WrX_yoWxAF4DpF 9FvwnxGw1kWr9I9wnrKry5XF1j9rZaga4agFW3try29rWUXryjvw1DJF97XFyDXwn5ArWr X3Zaq343C3W7K3gCm3ZEXasCq-sJn29KB7ZKAUJUUUUU529EdanIXcx71UUUUU7KY7ZEXa sCq-sGcSsGvfJ3Ic02F40EFcxC0VAKzVAqx4xG6I80ebIjqfuFe4nvWSU5nxnvy29KBjDU 0xBIdaVrnRJUUUkYb4IE77IF4wAFF20E14v26r1j6r4UM7CY07I20VC2zVCF04k26cxKx2 IYs7xG6rWj6s0DM7CIcVAFz4kK6r106r15M28lY4IEw2IIxxk0rwA2F7IY1VAKz4vEj48v e4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_Jr0_JF4l84ACjcxK6xIIjxv20xvEc7CjxVAFwI 0_Jr0_Gr1l84ACjcxK6I8E87Iv67AKxVWxJVW8Jr1l84ACjcxK6I8E87Iv6xkF7I0E14v2 6r4UJVWxJr1le2I262IYc4CY6c8Ij28IcVAaY2xG8wAqjxCEc2xF0cIa020Ex4CE44I27w Aqx4xG64xvF2IEw4CE5I8CrVC2j2WlYx0E2Ix0cI8IcVAFwI0_JrI_JrylYx0Ex4A2jsIE 14v26r4j6F4UMcvjeVCFs4IE7xkEbVWUJVW8JwACjcxG0xvY0x0EwIxGrwCF04k20xvY0x 0EwIxGrwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s026c02F40E14v26r1j6r18MI8I3I0E 7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_Jrv_JF1lIxkGc2Ij64vIr41lIxAIcV C0I7IYx2IY67AKxVWUJVWUCwCI42IY6xIIjxv20xvEc7CjxVAFwI0_Jr0_Gr1lIxAIcVCF 04k26cxKx2IYs7xG6r1j6r1xMIIF0xvEx4A2jsIE14v26r1j6r4UMIIF0xvEx4A2jsIEc7 CjxVAFwI0_Jr0_GrUvcSsGvfC2KfnxnUUI43ZEXa7IU8I38UUUUUU== Subject: [FFmpeg-devel] [PATCH v2 2/7] avcodec/hevc: Add add_residual_4/8/16/32 asm opt X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: jinbo Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: lLwY2ns6UEWq After this patch, the peformance of decoding H265 4K 30FPS 30Mbps on 3A6000 with 8 threads improves 2fps (45fps-->47fsp). --- libavcodec/loongarch/Makefile | 3 +- libavcodec/loongarch/hevc_add_res.S | 162 ++++++++++++++++++ libavcodec/loongarch/hevcdsp_init_loongarch.c | 5 + libavcodec/loongarch/hevcdsp_lsx.h | 5 + 4 files changed, 174 insertions(+), 1 deletion(-) create mode 100644 libavcodec/loongarch/hevc_add_res.S diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile index 06cfab5c20..07ea97f803 100644 --- a/libavcodec/loongarch/Makefile +++ b/libavcodec/loongarch/Makefile @@ -27,7 +27,8 @@ LSX-OBJS-$(CONFIG_HEVC_DECODER) += loongarch/hevcdsp_lsx.o \ loongarch/hevc_lpf_sao_lsx.o \ loongarch/hevc_mc_bi_lsx.o \ loongarch/hevc_mc_uni_lsx.o \ - loongarch/hevc_mc_uniw_lsx.o + loongarch/hevc_mc_uniw_lsx.o \ + loongarch/hevc_add_res.o LSX-OBJS-$(CONFIG_H264DSP) += loongarch/h264idct.o \ loongarch/h264idct_loongarch.o \ loongarch/h264dsp.o diff --git a/libavcodec/loongarch/hevc_add_res.S b/libavcodec/loongarch/hevc_add_res.S new file mode 100644 index 0000000000..dd2d820af8 --- /dev/null +++ b/libavcodec/loongarch/hevc_add_res.S @@ -0,0 +1,162 @@ +/* + * Loongson LSX optimized add_residual functions for HEVC decoding + * + * Copyright (c) 2023 Loongson Technology Corporation Limited + * Contributed by jinbo + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "loongson_asm.S" + +/* + * void ff_hevc_add_residual4x4_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride) + */ +.macro ADD_RES_LSX_4x4_8 + vldrepl.w vr0, a0, 0 + add.d t0, a0, a2 + vldrepl.w vr1, t0, 0 + vld vr2, a1, 0 + + vilvl.w vr1, vr1, vr0 + vsllwil.hu.bu vr1, vr1, 0 + vadd.h vr1, vr1, vr2 + vssrani.bu.h vr1, vr1, 0 + + vstelm.w vr1, a0, 0, 0 + vstelm.w vr1, t0, 0, 1 +.endm + +function ff_hevc_add_residual4x4_8_lsx + ADD_RES_LSX_4x4_8 + alsl.d a0, a2, a0, 1 + addi.d a1, a1, 16 + ADD_RES_LSX_4x4_8 +endfunc + +/* + * void ff_hevc_add_residual8x8_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride) + */ +.macro ADD_RES_LSX_8x8_8 + vldrepl.d vr0, a0, 0 + add.d t0, a0, a2 + vldrepl.d vr1, t0, 0 + add.d t1, t0, a2 + vldrepl.d vr2, t1, 0 + add.d t2, t1, a2 + vldrepl.d vr3, t2, 0 + + vld vr4, a1, 0 + addi.d t3, zero, 16 + vldx vr5, a1, t3 + addi.d t4, a1, 32 + vld vr6, t4, 0 + vldx vr7, t4, t3 + + vsllwil.hu.bu vr0, vr0, 0 + vsllwil.hu.bu vr1, vr1, 0 + vsllwil.hu.bu vr2, vr2, 0 + vsllwil.hu.bu vr3, vr3, 0 + vadd.h vr0, vr0, vr4 + vadd.h vr1, vr1, vr5 + vadd.h vr2, vr2, vr6 + vadd.h vr3, vr3, vr7 + vssrani.bu.h vr1, vr0, 0 + vssrani.bu.h vr3, vr2, 0 + + vstelm.d vr1, a0, 0, 0 + vstelm.d vr1, t0, 0, 1 + vstelm.d vr3, t1, 0, 0 + vstelm.d vr3, t2, 0, 1 +.endm + +function ff_hevc_add_residual8x8_8_lsx + ADD_RES_LSX_8x8_8 + alsl.d a0, a2, a0, 2 + addi.d a1, a1, 64 + ADD_RES_LSX_8x8_8 +endfunc + +/* + * void ff_hevc_add_residual16x16_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride) + */ +function ff_hevc_add_residual16x16_8_lsx +.rept 8 + vld vr0, a0, 0 + vldx vr2, a0, a2 + + vld vr4, a1, 0 + addi.d t0, zero, 16 + vldx vr5, a1, t0 + addi.d t1, a1, 32 + vld vr6, t1, 0 + vldx vr7, t1, t0 + + vexth.hu.bu vr1, vr0 + vsllwil.hu.bu vr0, vr0, 0 + vexth.hu.bu vr3, vr2 + vsllwil.hu.bu vr2, vr2, 0 + vadd.h vr0, vr0, vr4 + vadd.h vr1, vr1, vr5 + vadd.h vr2, vr2, vr6 + vadd.h vr3, vr3, vr7 + + vssrani.bu.h vr1, vr0, 0 + vssrani.bu.h vr3, vr2, 0 + + vst vr1, a0, 0 + vstx vr3, a0, a2 + + alsl.d a0, a2, a0, 1 + addi.d a1, a1, 64 +.endr +endfunc + +/* + * void ff_hevc_add_residual32x32_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride) + */ +function ff_hevc_add_residual32x32_8_lsx +.rept 32 + vld vr0, a0, 0 + addi.w t0, zero, 16 + vldx vr2, a0, t0 + + vld vr4, a1, 0 + vldx vr5, a1, t0 + addi.d t1, a1, 32 + vld vr6, t1, 0 + vldx vr7, t1, t0 + + vexth.hu.bu vr1, vr0 + vsllwil.hu.bu vr0, vr0, 0 + vexth.hu.bu vr3, vr2 + vsllwil.hu.bu vr2, vr2, 0 + vadd.h vr0, vr0, vr4 + vadd.h vr1, vr1, vr5 + vadd.h vr2, vr2, vr6 + vadd.h vr3, vr3, vr7 + + vssrani.bu.h vr1, vr0, 0 + vssrani.bu.h vr3, vr2, 0 + + vst vr1, a0, 0 + vstx vr3, a0, t0 + + add.d a0, a0, a2 + addi.d a1, a1, 64 +.endr +endfunc diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c index 5a96f3a4c9..a8f753dc86 100644 --- a/libavcodec/loongarch/hevcdsp_init_loongarch.c +++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c @@ -189,6 +189,11 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth) c->idct[1] = ff_hevc_idct_8x8_lsx; c->idct[2] = ff_hevc_idct_16x16_lsx; c->idct[3] = ff_hevc_idct_32x32_lsx; + + c->add_residual[0] = ff_hevc_add_residual4x4_8_lsx; + c->add_residual[1] = ff_hevc_add_residual8x8_8_lsx; + c->add_residual[2] = ff_hevc_add_residual16x16_8_lsx; + c->add_residual[3] = ff_hevc_add_residual32x32_8_lsx; } } } diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h index 0d54196caf..ac509984fd 100644 --- a/libavcodec/loongarch/hevcdsp_lsx.h +++ b/libavcodec/loongarch/hevcdsp_lsx.h @@ -227,4 +227,9 @@ void ff_hevc_idct_8x8_lsx(int16_t *coeffs, int col_limit); void ff_hevc_idct_16x16_lsx(int16_t *coeffs, int col_limit); void ff_hevc_idct_32x32_lsx(int16_t *coeffs, int col_limit); +void ff_hevc_add_residual4x4_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride); +void ff_hevc_add_residual8x8_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride); +void ff_hevc_add_residual16x16_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride); +void ff_hevc_add_residual32x32_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride); + #endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H From patchwork Wed Dec 27 04:50:15 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?6YeR5rOi?= X-Patchwork-Id: 45341 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:6623:b0:194:e134:edd4 with SMTP id n35csp3415411pzh; Tue, 26 Dec 2023 20:51:05 -0800 (PST) X-Google-Smtp-Source: AGHT+IGHLrOShw59oVsf2C9mBc3quQskyo4r7znKDNaF15YXxCI6Ko2yzQCe7L5cMENauyW01au/ X-Received: by 2002:a17:906:2dd3:b0:a26:fb98:18ba with SMTP id h19-20020a1709062dd300b00a26fb9818bamr1428359eji.31.1703652665531; Tue, 26 Dec 2023 20:51:05 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1703652665; cv=none; d=google.com; s=arc-20160816; b=zSO59yXFuXXPghWKzw6NbDSA8Y2msMLQcSUuvPyw7syKENwW+5sglSRSdZxlBetMEM Z1cnCRvALJfkiCYgi0At8G2oi/rXik0RauZmjZpy77+C0w/jvawNctyvToZYE0mymPPB P/kaIZAmZGFDCdO7Lc5tcm3qOC1h4y+FPLcevNvKbfF+wfmfHIN6jkoa4+GDVuaDTpcG tNZbSw3Y/UTqXf+mgUyiGigR/Qrw54P5AN4TOYOynMKr4VTJ6ffKZtIQIhdNvfBzaOP4 yATcbZYIifZMf3FKk9qeeYt9WZMd1DzpjXIAtVOE40Dn056K+vZDUftlDG3m+wNoHoNW jE6g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=zhs53OqYfwuyBWQdHF+dcZOIP5uElSCA0xtluTjvuvU=; fh=jVI190atZy/rQkFyTEx2ZbUJwkNs7ZTytKjYCllSOv4=; b=B6JnCHndoB7wfCr1n8Olqy5Wq7nIqWleHVnE28e8Ud07TDwaJdQeZ2KITeNZ3Pc1Vs DlkE2QfDV2QEFocW3risxGNl72YbpUHiTBvyhlgaUxXlgOLo5ImbVP8Klr0URlT483cy qXeLH6o7dhqsreCiNzu338x9ixIw5W4S3NOU7ic1/SdiVhrsmAkGQx+jnMRjpipzOjsh HoaVxVb2R1mWVUsHiyiY+xv7BiunFS3gMlMfoQaVfEfZN8oSKbqfXhAfV0ba0J3MOSpN 2PGT++bp+WHNT47xE21q0vbcZOaS39av9fT7T+EkMHGsCISWQKBPy3iCV/7evv7P3Pjx zf8g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id lk19-20020a170906cb1300b00a26f040cb51si2265010ejb.790.2023.12.26.20.51.05; Tue, 26 Dec 2023 20:51:05 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 870FB68CD88; Wed, 27 Dec 2023 06:50:48 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail.loongson.cn (mail.loongson.cn [114.242.206.163]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A78BF68CC35 for ; Wed, 27 Dec 2023 06:50:41 +0200 (EET) Received: from loongson.cn (unknown [36.33.26.33]) by gateway (Coremail) with SMTP id _____8DxS+kerYtlRO4EAA--.23801S3; Wed, 27 Dec 2023 12:50:38 +0800 (CST) Received: from localhost (unknown [36.33.26.33]) by localhost.localdomain (Coremail) with SMTP id AQAAf8Cx2r0brYtlg1sMAA--.15869S3; Wed, 27 Dec 2023 12:50:35 +0800 (CST) From: jinbo To: ffmpeg-devel@ffmpeg.org Date: Wed, 27 Dec 2023 12:50:15 +0800 Message-Id: <20231227045019.25078-4-jinbo@loongson.cn> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20231227045019.25078-1-jinbo@loongson.cn> References: <20231227045019.25078-1-jinbo@loongson.cn> MIME-Version: 1.0 X-CM-TRANSID: AQAAf8Cx2r0brYtlg1sMAA--.15869S3 X-CM-SenderInfo: xmlqu0o6or00hjvr0hdfq/1tbiAQASEmWLia8B1wAHsY X-Coremail-Antispam: 1Uk129KBj9fXoWfZrW5WF4xKF43Aw17Gr1fXwc_yoW8ZFWDuo Zayr9Yva48G34aqFsxZrn8Jw45GFWjkw1avanFyr1UKa4fXr9Fk3s0qry2kry7K34kX3WU uF9rK345A3Z7Jw4Ul-sFpf9Il3svdjkaLaAFLSUrUUUUUb8apTn2vfkv8UJUUUU8wcxFpf 9Il3svdxBIdaVrn0xqx4xG64xvF2IEw4CE5I8CrVC2j2Jv73VFW2AGmfu7bjvjm3AaLaJ3 UjIYCTnIWjp_UUUYb7kC6x804xWl14x267AKxVWUJVW8JwAFc2x0x2IEx4CE42xK8VAvwI 8IcIk0rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj41l84x0c7CEw4AK67xG Y2AK021l84ACjcxK6xIIjxv20xvE14v26r1I6r4UM28EF7xvwVC0I7IYx2IY6xkF7I0E14 v26r1j6r4UM28EF7xvwVC2z280aVAFwI0_Cr0_Gr1UM28EF7xvwVC2z280aVCY1x0267AK xVW8Jr0_Cr1UM2AIxVAIcxkEcVAq07x20xvEncxIr21l57IF6xkI12xvs2x26I8E6xACxx 1l5I8CrVACY4xI64kE6c02F40Ex7xfMcIj6xIIjxv20xvE14v26r1Y6r17McIj6I8E87Iv 67AKxVW8JVWxJwAm72CE4IkC6x0Yz7v_Jr0_Gr1lF7xvr2IYc2Ij64vIr41l42xK82IYc2 Ij64vIr41l4I8I3I0E4IkC6x0Yz7v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGwC20s02 6x8GjcxK67AKxVWUGVWUWwC2zVAF1VAY17CE14v26r1Y6r17MIIYrxkI7VAKI48JMIIF0x vE2Ix0cI8IcVAFwI0_Jr0_JF4lIxAIcVC0I7IYx2IY6xkF7I0E14v26r1j6r4UMIIF0xvE 42xK8VAvwI8IcIk0rVWUJVWUCwCI42IY6I8E87Iv67AKxVWUJVW8JwCI42IY6I8E87Iv6x kF7I0E14v26r1j6r4UYxBIdaVFxhVjvjDU0xZFpf9x07jb_-PUUUUU= Subject: [FFmpeg-devel] [PATCH v2 3/7] avcodec/hevc: Add pel_uni_w_pixels4/6/8/12/16/24/32/48/64 asm opt X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: jinbo Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: MgA9Jm3M6a0Z tests/checkasm/checkasm: C LSX LASX put_hevc_pel_uni_w_pixels4_8_c: 2.7 1.0 put_hevc_pel_uni_w_pixels6_8_c: 6.2 2.0 1.5 put_hevc_pel_uni_w_pixels8_8_c: 10.7 2.5 1.7 put_hevc_pel_uni_w_pixels12_8_c: 23.0 5.5 5.0 put_hevc_pel_uni_w_pixels16_8_c: 41.0 8.2 5.0 put_hevc_pel_uni_w_pixels24_8_c: 91.0 19.7 13.2 put_hevc_pel_uni_w_pixels32_8_c: 161.7 32.5 16.2 put_hevc_pel_uni_w_pixels48_8_c: 354.5 73.7 43.0 put_hevc_pel_uni_w_pixels64_8_c: 641.5 130.0 64.2 Speedup of decoding H265 4K 30FPS 30Mbps on 3A6000 with 8 threads is 1fps(47fps-->48fps). --- libavcodec/loongarch/Makefile | 3 +- libavcodec/loongarch/hevc_mc.S | 471 ++++++++++++++++++ libavcodec/loongarch/hevcdsp_init_loongarch.c | 43 ++ libavcodec/loongarch/hevcdsp_lasx.h | 53 ++ libavcodec/loongarch/hevcdsp_lsx.h | 27 + 5 files changed, 596 insertions(+), 1 deletion(-) create mode 100644 libavcodec/loongarch/hevc_mc.S create mode 100644 libavcodec/loongarch/hevcdsp_lasx.h diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile index 07ea97f803..ad98cd4054 100644 --- a/libavcodec/loongarch/Makefile +++ b/libavcodec/loongarch/Makefile @@ -28,7 +28,8 @@ LSX-OBJS-$(CONFIG_HEVC_DECODER) += loongarch/hevcdsp_lsx.o \ loongarch/hevc_mc_bi_lsx.o \ loongarch/hevc_mc_uni_lsx.o \ loongarch/hevc_mc_uniw_lsx.o \ - loongarch/hevc_add_res.o + loongarch/hevc_add_res.o \ + loongarch/hevc_mc.o LSX-OBJS-$(CONFIG_H264DSP) += loongarch/h264idct.o \ loongarch/h264idct_loongarch.o \ loongarch/h264dsp.o diff --git a/libavcodec/loongarch/hevc_mc.S b/libavcodec/loongarch/hevc_mc.S new file mode 100644 index 0000000000..c5d553effe --- /dev/null +++ b/libavcodec/loongarch/hevc_mc.S @@ -0,0 +1,471 @@ +/* + * Copyright (c) 2023 Loongson Technology Corporation Limited + * Contributed by jinbo + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "loongson_asm.S" + +.macro LOAD_VAR bit + addi.w t1, a5, 6 //shift + addi.w t3, zero, 1 //one + sub.w t4, t1, t3 + sll.w t3, t3, t4 //offset +.if \bit == 128 + vreplgr2vr.w vr1, a6 //wx + vreplgr2vr.w vr2, t3 //offset + vreplgr2vr.w vr3, t1 //shift + vreplgr2vr.w vr4, a7 //ox +.else + xvreplgr2vr.w xr1, a6 + xvreplgr2vr.w xr2, t3 + xvreplgr2vr.w xr3, t1 + xvreplgr2vr.w xr4, a7 +.endif +.endm + +.macro HEVC_PEL_UNI_W_PIXELS8_LSX src0, dst0, w + vldrepl.d vr0, \src0, 0 + vsllwil.hu.bu vr0, vr0, 0 + vexth.wu.hu vr5, vr0 + vsllwil.wu.hu vr0, vr0, 0 + vslli.w vr0, vr0, 6 + vslli.w vr5, vr5, 6 + vmul.w vr0, vr0, vr1 + vmul.w vr5, vr5, vr1 + vadd.w vr0, vr0, vr2 + vadd.w vr5, vr5, vr2 + vsra.w vr0, vr0, vr3 + vsra.w vr5, vr5, vr3 + vadd.w vr0, vr0, vr4 + vadd.w vr5, vr5, vr4 + vssrani.h.w vr5, vr0, 0 + vssrani.bu.h vr5, vr5, 0 +.if \w == 6 + fst.s f5, \dst0, 0 + vstelm.h vr5, \dst0, 4, 2 +.else + fst.d f5, \dst0, 0 +.endif +.endm + +.macro HEVC_PEL_UNI_W_PIXELS8x2_LASX src0, dst0, w + vldrepl.d vr0, \src0, 0 + add.d t2, \src0, a3 + vldrepl.d vr5, t2, 0 + xvpermi.q xr0, xr5, 0x02 + xvsllwil.hu.bu xr0, xr0, 0 + xvexth.wu.hu xr5, xr0 + xvsllwil.wu.hu xr0, xr0, 0 + xvslli.w xr0, xr0, 6 + xvslli.w xr5, xr5, 6 + xvmul.w xr0, xr0, xr1 + xvmul.w xr5, xr5, xr1 + xvadd.w xr0, xr0, xr2 + xvadd.w xr5, xr5, xr2 + xvsra.w xr0, xr0, xr3 + xvsra.w xr5, xr5, xr3 + xvadd.w xr0, xr0, xr4 + xvadd.w xr5, xr5, xr4 + xvssrani.h.w xr5, xr0, 0 + xvpermi.q xr0, xr5, 0x01 + xvssrani.bu.h xr0, xr5, 0 + add.d t3, \dst0, a1 +.if \w == 6 + vstelm.w vr0, \dst0, 0, 0 + vstelm.h vr0, \dst0, 4, 2 + vstelm.w vr0, t3, 0, 2 + vstelm.h vr0, t3, 4, 6 +.else + vstelm.d vr0, \dst0, 0, 0 + vstelm.d vr0, t3, 0, 1 +.endif +.endm + +.macro HEVC_PEL_UNI_W_PIXELS16_LSX src0, dst0 + vld vr0, \src0, 0 + vexth.hu.bu vr7, vr0 + vexth.wu.hu vr8, vr7 + vsllwil.wu.hu vr7, vr7, 0 + vsllwil.hu.bu vr5, vr0, 0 + vexth.wu.hu vr6, vr5 + vsllwil.wu.hu vr5, vr5, 0 + vslli.w vr5, vr5, 6 + vslli.w vr6, vr6, 6 + vslli.w vr7, vr7, 6 + vslli.w vr8, vr8, 6 + vmul.w vr5, vr5, vr1 + vmul.w vr6, vr6, vr1 + vmul.w vr7, vr7, vr1 + vmul.w vr8, vr8, vr1 + vadd.w vr5, vr5, vr2 + vadd.w vr6, vr6, vr2 + vadd.w vr7, vr7, vr2 + vadd.w vr8, vr8, vr2 + vsra.w vr5, vr5, vr3 + vsra.w vr6, vr6, vr3 + vsra.w vr7, vr7, vr3 + vsra.w vr8, vr8, vr3 + vadd.w vr5, vr5, vr4 + vadd.w vr6, vr6, vr4 + vadd.w vr7, vr7, vr4 + vadd.w vr8, vr8, vr4 + vssrani.h.w vr6, vr5, 0 + vssrani.h.w vr8, vr7, 0 + vssrani.bu.h vr8, vr6, 0 + vst vr8, \dst0, 0 +.endm + +.macro HEVC_PEL_UNI_W_PIXELS16_LASX src0, dst0 + vld vr0, \src0, 0 + xvpermi.d xr0, xr0, 0xd8 + xvsllwil.hu.bu xr0, xr0, 0 + xvexth.wu.hu xr6, xr0 + xvsllwil.wu.hu xr5, xr0, 0 + xvslli.w xr5, xr5, 6 + xvslli.w xr6, xr6, 6 + xvmul.w xr5, xr5, xr1 + xvmul.w xr6, xr6, xr1 + xvadd.w xr5, xr5, xr2 + xvadd.w xr6, xr6, xr2 + xvsra.w xr5, xr5, xr3 + xvsra.w xr6, xr6, xr3 + xvadd.w xr5, xr5, xr4 + xvadd.w xr6, xr6, xr4 + xvssrani.h.w xr6, xr5, 0 + xvpermi.q xr7, xr6, 0x01 + xvssrani.bu.h xr7, xr6, 0 + vst vr7, \dst0, 0 +.endm + +.macro HEVC_PEL_UNI_W_PIXELS32_LASX src0, dst0, w +.if \w == 16 + vld vr0, \src0, 0 + add.d t2, \src0, a3 + vld vr5, t2, 0 + xvpermi.q xr0, xr5, 0x02 +.else //w=24/32 + xvld xr0, \src0, 0 +.endif + xvexth.hu.bu xr7, xr0 + xvexth.wu.hu xr8, xr7 + xvsllwil.wu.hu xr7, xr7, 0 + xvsllwil.hu.bu xr5, xr0, 0 + xvexth.wu.hu xr6, xr5 + xvsllwil.wu.hu xr5, xr5, 0 + xvslli.w xr5, xr5, 6 + xvslli.w xr6, xr6, 6 + xvslli.w xr7, xr7, 6 + xvslli.w xr8, xr8, 6 + xvmul.w xr5, xr5, xr1 + xvmul.w xr6, xr6, xr1 + xvmul.w xr7, xr7, xr1 + xvmul.w xr8, xr8, xr1 + xvadd.w xr5, xr5, xr2 + xvadd.w xr6, xr6, xr2 + xvadd.w xr7, xr7, xr2 + xvadd.w xr8, xr8, xr2 + xvsra.w xr5, xr5, xr3 + xvsra.w xr6, xr6, xr3 + xvsra.w xr7, xr7, xr3 + xvsra.w xr8, xr8, xr3 + xvadd.w xr5, xr5, xr4 + xvadd.w xr6, xr6, xr4 + xvadd.w xr7, xr7, xr4 + xvadd.w xr8, xr8, xr4 + xvssrani.h.w xr6, xr5, 0 + xvssrani.h.w xr8, xr7, 0 + xvssrani.bu.h xr8, xr6, 0 +.if \w == 16 + vst vr8, \dst0, 0 + add.d t2, \dst0, a1 + xvpermi.q xr8, xr8, 0x01 + vst vr8, t2, 0 +.elseif \w == 24 + vst vr8, \dst0, 0 + xvstelm.d xr8, \dst0, 16, 2 +.else + xvst xr8, \dst0, 0 +.endif +.endm + +function ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx + LOAD_VAR 128 + srli.w t0, a4, 1 +.LOOP_PIXELS4: + vldrepl.w vr0, a2, 0 + add.d t1, a2, a3 + vldrepl.w vr5, t1, 0 + vsllwil.hu.bu vr0, vr0, 0 + vsllwil.wu.hu vr0, vr0, 0 + vsllwil.hu.bu vr5, vr5, 0 + vsllwil.wu.hu vr5, vr5, 0 + vslli.w vr0, vr0, 6 + vslli.w vr5, vr5, 6 + vmul.w vr0, vr0, vr1 + vmul.w vr5, vr5, vr1 + vadd.w vr0, vr0, vr2 + vadd.w vr5, vr5, vr2 + vsra.w vr0, vr0, vr3 + vsra.w vr5, vr5, vr3 + vadd.w vr0, vr0, vr4 + vadd.w vr5, vr5, vr4 + vssrani.h.w vr5, vr0, 0 + vssrani.bu.h vr5, vr5, 0 + fst.s f5, a0, 0 + add.d t2, a0, a1 + vstelm.w vr5, t2, 0, 1 + alsl.d a2, a3, a2, 1 + alsl.d a0, a1, a0, 1 + addi.w t0, t0, -1 + bnez t0, .LOOP_PIXELS4 +endfunc + +function ff_hevc_put_hevc_pel_uni_w_pixels6_8_lsx + LOAD_VAR 128 +.LOOP_PIXELS6: + HEVC_PEL_UNI_W_PIXELS8_LSX a2, a0, 6 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.w a4, a4, -1 + bnez a4, .LOOP_PIXELS6 +endfunc + +function ff_hevc_put_hevc_pel_uni_w_pixels6_8_lasx + LOAD_VAR 256 + srli.w t0, a4, 1 +.LOOP_PIXELS6_LASX: + HEVC_PEL_UNI_W_PIXELS8x2_LASX a2, a0, 6 + alsl.d a2, a3, a2, 1 + alsl.d a0, a1, a0, 1 + addi.w t0, t0, -1 + bnez t0, .LOOP_PIXELS6_LASX +endfunc + +function ff_hevc_put_hevc_pel_uni_w_pixels8_8_lsx + LOAD_VAR 128 +.LOOP_PIXELS8: + HEVC_PEL_UNI_W_PIXELS8_LSX a2, a0, 8 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.w a4, a4, -1 + bnez a4, .LOOP_PIXELS8 +endfunc + +function ff_hevc_put_hevc_pel_uni_w_pixels8_8_lasx + LOAD_VAR 256 + srli.w t0, a4, 1 +.LOOP_PIXELS8_LASX: + HEVC_PEL_UNI_W_PIXELS8x2_LASX a2, a0, 8 + alsl.d a2, a3, a2, 1 + alsl.d a0, a1, a0, 1 + addi.w t0, t0, -1 + bnez t0, .LOOP_PIXELS8_LASX +endfunc + +function ff_hevc_put_hevc_pel_uni_w_pixels12_8_lsx + LOAD_VAR 128 +.LOOP_PIXELS12: + vld vr0, a2, 0 + vexth.hu.bu vr7, vr0 + vsllwil.wu.hu vr7, vr7, 0 + vsllwil.hu.bu vr5, vr0, 0 + vexth.wu.hu vr6, vr5 + vsllwil.wu.hu vr5, vr5, 0 + vslli.w vr5, vr5, 6 + vslli.w vr6, vr6, 6 + vslli.w vr7, vr7, 6 + vmul.w vr5, vr5, vr1 + vmul.w vr6, vr6, vr1 + vmul.w vr7, vr7, vr1 + vadd.w vr5, vr5, vr2 + vadd.w vr6, vr6, vr2 + vadd.w vr7, vr7, vr2 + vsra.w vr5, vr5, vr3 + vsra.w vr6, vr6, vr3 + vsra.w vr7, vr7, vr3 + vadd.w vr5, vr5, vr4 + vadd.w vr6, vr6, vr4 + vadd.w vr7, vr7, vr4 + vssrani.h.w vr6, vr5, 0 + vssrani.h.w vr7, vr7, 0 + vssrani.bu.h vr7, vr6, 0 + fst.d f7, a0, 0 + vstelm.w vr7, a0, 8, 2 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.w a4, a4, -1 + bnez a4, .LOOP_PIXELS12 +endfunc + +function ff_hevc_put_hevc_pel_uni_w_pixels12_8_lasx + LOAD_VAR 256 +.LOOP_PIXELS12_LASX: + vld vr0, a2, 0 + xvpermi.d xr0, xr0, 0xd8 + xvsllwil.hu.bu xr0, xr0, 0 + xvexth.wu.hu xr6, xr0 + xvsllwil.wu.hu xr5, xr0, 0 + xvslli.w xr5, xr5, 6 + xvslli.w xr6, xr6, 6 + xvmul.w xr5, xr5, xr1 + xvmul.w xr6, xr6, xr1 + xvadd.w xr5, xr5, xr2 + xvadd.w xr6, xr6, xr2 + xvsra.w xr5, xr5, xr3 + xvsra.w xr6, xr6, xr3 + xvadd.w xr5, xr5, xr4 + xvadd.w xr6, xr6, xr4 + xvssrani.h.w xr6, xr5, 0 + xvpermi.q xr7, xr6, 0x01 + xvssrani.bu.h xr7, xr6, 0 + fst.d f7, a0, 0 + vstelm.w vr7, a0, 8, 2 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.w a4, a4, -1 + bnez a4, .LOOP_PIXELS12_LASX +endfunc + +function ff_hevc_put_hevc_pel_uni_w_pixels16_8_lsx + LOAD_VAR 128 +.LOOP_PIXELS16: + HEVC_PEL_UNI_W_PIXELS16_LSX a2, a0 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.w a4, a4, -1 + bnez a4, .LOOP_PIXELS16 +endfunc + +function ff_hevc_put_hevc_pel_uni_w_pixels16_8_lasx + LOAD_VAR 256 + srli.w t0, a4, 1 +.LOOP_PIXELS16_LASX: + HEVC_PEL_UNI_W_PIXELS32_LASX a2, a0, 16 + alsl.d a2, a3, a2, 1 + alsl.d a0, a1, a0, 1 + addi.w t0, t0, -1 + bnez t0, .LOOP_PIXELS16_LASX +endfunc + +function ff_hevc_put_hevc_pel_uni_w_pixels24_8_lsx + LOAD_VAR 128 +.LOOP_PIXELS24: + HEVC_PEL_UNI_W_PIXELS16_LSX a2, a0 + addi.d t0, a2, 16 + addi.d t1, a0, 16 + HEVC_PEL_UNI_W_PIXELS8_LSX t0, t1, 8 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.w a4, a4, -1 + bnez a4, .LOOP_PIXELS24 +endfunc + +function ff_hevc_put_hevc_pel_uni_w_pixels24_8_lasx + LOAD_VAR 256 +.LOOP_PIXELS24_LASX: + HEVC_PEL_UNI_W_PIXELS32_LASX a2, a0, 24 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.w a4, a4, -1 + bnez a4, .LOOP_PIXELS24_LASX +endfunc + +function ff_hevc_put_hevc_pel_uni_w_pixels32_8_lsx + LOAD_VAR 128 +.LOOP_PIXELS32: + HEVC_PEL_UNI_W_PIXELS16_LSX a2, a0 + addi.d t0, a2, 16 + addi.d t1, a0, 16 + HEVC_PEL_UNI_W_PIXELS16_LSX t0, t1 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.w a4, a4, -1 + bnez a4, .LOOP_PIXELS32 +endfunc + +function ff_hevc_put_hevc_pel_uni_w_pixels32_8_lasx + LOAD_VAR 256 +.LOOP_PIXELS32_LASX: + HEVC_PEL_UNI_W_PIXELS32_LASX a2, a0, 32 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.w a4, a4, -1 + bnez a4, .LOOP_PIXELS32_LASX +endfunc + +function ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx + LOAD_VAR 128 +.LOOP_PIXELS48: + HEVC_PEL_UNI_W_PIXELS16_LSX a2, a0 + addi.d t0, a2, 16 + addi.d t1, a0, 16 + HEVC_PEL_UNI_W_PIXELS16_LSX t0, t1 + addi.d t0, a2, 32 + addi.d t1, a0, 32 + HEVC_PEL_UNI_W_PIXELS16_LSX t0, t1 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.w a4, a4, -1 + bnez a4, .LOOP_PIXELS48 +endfunc + +function ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx + LOAD_VAR 256 +.LOOP_PIXELS48_LASX: + HEVC_PEL_UNI_W_PIXELS32_LASX a2, a0, 32 + addi.d t0, a2, 32 + addi.d t1, a0, 32 + HEVC_PEL_UNI_W_PIXELS16_LASX t0, t1 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.w a4, a4, -1 + bnez a4, .LOOP_PIXELS48_LASX +endfunc + +function ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx + LOAD_VAR 128 +.LOOP_PIXELS64: + HEVC_PEL_UNI_W_PIXELS16_LSX a2, a0 + addi.d t0, a2, 16 + addi.d t1, a0, 16 + HEVC_PEL_UNI_W_PIXELS16_LSX t0, t1 + addi.d t0, a2, 32 + addi.d t1, a0, 32 + HEVC_PEL_UNI_W_PIXELS16_LSX t0, t1 + addi.d t0, a2, 48 + addi.d t1, a0, 48 + HEVC_PEL_UNI_W_PIXELS16_LSX t0, t1 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.w a4, a4, -1 + bnez a4, .LOOP_PIXELS64 +endfunc + +function ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx + LOAD_VAR 256 +.LOOP_PIXELS64_LASX: + HEVC_PEL_UNI_W_PIXELS32_LASX a2, a0, 32 + addi.d t0, a2, 32 + addi.d t1, a0, 32 + HEVC_PEL_UNI_W_PIXELS32_LASX t0, t1, 32 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.w a4, a4, -1 + bnez a4, .LOOP_PIXELS64_LASX +endfunc diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c index a8f753dc86..d0ee99d6b5 100644 --- a/libavcodec/loongarch/hevcdsp_init_loongarch.c +++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c @@ -22,6 +22,7 @@ #include "libavutil/loongarch/cpu.h" #include "hevcdsp_lsx.h" +#include "hevcdsp_lasx.h" void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth) { @@ -160,6 +161,26 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth) c->put_hevc_epel_uni[6][1][1] = ff_hevc_put_hevc_uni_epel_hv24_8_lsx; c->put_hevc_epel_uni[7][1][1] = ff_hevc_put_hevc_uni_epel_hv32_8_lsx; + c->put_hevc_qpel_uni_w[1][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx; + c->put_hevc_qpel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lsx; + c->put_hevc_qpel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lsx; + c->put_hevc_qpel_uni_w[4][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels12_8_lsx; + c->put_hevc_qpel_uni_w[5][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels16_8_lsx; + c->put_hevc_qpel_uni_w[6][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels24_8_lsx; + c->put_hevc_qpel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lsx; + c->put_hevc_qpel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx; + c->put_hevc_qpel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx; + + c->put_hevc_epel_uni_w[1][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx; + c->put_hevc_epel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lsx; + c->put_hevc_epel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lsx; + c->put_hevc_epel_uni_w[4][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels12_8_lsx; + c->put_hevc_epel_uni_w[5][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels16_8_lsx; + c->put_hevc_epel_uni_w[6][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels24_8_lsx; + c->put_hevc_epel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lsx; + c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx; + c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx; + c->put_hevc_qpel_uni_w[3][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv8_8_lsx; c->put_hevc_qpel_uni_w[5][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv16_8_lsx; c->put_hevc_qpel_uni_w[6][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv24_8_lsx; @@ -196,4 +217,26 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth) c->add_residual[3] = ff_hevc_add_residual32x32_8_lsx; } } + + if (have_lasx(cpu_flags)) { + if (bit_depth == 8) { + c->put_hevc_qpel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lasx; + c->put_hevc_qpel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lasx; + c->put_hevc_qpel_uni_w[4][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels12_8_lasx; + c->put_hevc_qpel_uni_w[5][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels16_8_lasx; + c->put_hevc_qpel_uni_w[6][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels24_8_lasx; + c->put_hevc_qpel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lasx; + c->put_hevc_qpel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx; + c->put_hevc_qpel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx; + + c->put_hevc_epel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lasx; + c->put_hevc_epel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lasx; + c->put_hevc_epel_uni_w[4][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels12_8_lasx; + c->put_hevc_epel_uni_w[5][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels16_8_lasx; + c->put_hevc_epel_uni_w[6][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels24_8_lasx; + c->put_hevc_epel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lasx; + c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx; + c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx; + } + } } diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h new file mode 100644 index 0000000000..819c3c3ecf --- /dev/null +++ b/libavcodec/loongarch/hevcdsp_lasx.h @@ -0,0 +1,53 @@ +/* + * Copyright (c) 2023 Loongson Technology Corporation Limited + * Contributed by jinbo + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H +#define AVCODEC_LOONGARCH_HEVCDSP_LASX_H + +#include "libavcodec/hevcdsp.h" + +#define PEL_UNI_W(PEL, DIR, WIDTH) \ +void ff_hevc_put_hevc_##PEL##_uni_w_##DIR##WIDTH##_8_lasx(uint8_t *dst, \ + ptrdiff_t \ + dst_stride, \ + const uint8_t *src, \ + ptrdiff_t \ + src_stride, \ + int height, \ + int denom, \ + int wx, \ + int ox, \ + intptr_t mx, \ + intptr_t my, \ + int width) + +PEL_UNI_W(pel, pixels, 6); +PEL_UNI_W(pel, pixels, 8); +PEL_UNI_W(pel, pixels, 12); +PEL_UNI_W(pel, pixels, 16); +PEL_UNI_W(pel, pixels, 24); +PEL_UNI_W(pel, pixels, 32); +PEL_UNI_W(pel, pixels, 48); +PEL_UNI_W(pel, pixels, 64); + +#undef PEL_UNI_W + +#endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h index ac509984fd..0d724a90ef 100644 --- a/libavcodec/loongarch/hevcdsp_lsx.h +++ b/libavcodec/loongarch/hevcdsp_lsx.h @@ -232,4 +232,31 @@ void ff_hevc_add_residual8x8_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t s void ff_hevc_add_residual16x16_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride); void ff_hevc_add_residual32x32_8_lsx(uint8_t *dst, const int16_t *res, ptrdiff_t stride); +#define PEL_UNI_W(PEL, DIR, WIDTH) \ +void ff_hevc_put_hevc_##PEL##_uni_w_##DIR##WIDTH##_8_lsx(uint8_t *dst, \ + ptrdiff_t \ + dst_stride, \ + const uint8_t *src, \ + ptrdiff_t \ + src_stride, \ + int height, \ + int denom, \ + int wx, \ + int ox, \ + intptr_t mx, \ + intptr_t my, \ + int width) + +PEL_UNI_W(pel, pixels, 4); +PEL_UNI_W(pel, pixels, 6); +PEL_UNI_W(pel, pixels, 8); +PEL_UNI_W(pel, pixels, 12); +PEL_UNI_W(pel, pixels, 16); +PEL_UNI_W(pel, pixels, 24); +PEL_UNI_W(pel, pixels, 32); +PEL_UNI_W(pel, pixels, 48); +PEL_UNI_W(pel, pixels, 64); + +#undef PEL_UNI_W + #endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H From patchwork Wed Dec 27 04:50:16 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?6YeR5rOi?= X-Patchwork-Id: 45343 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:6623:b0:194:e134:edd4 with SMTP id n35csp3415491pzh; Tue, 26 Dec 2023 20:51:23 -0800 (PST) X-Google-Smtp-Source: AGHT+IE3MiTZTHmuhwQB4IsRmSQWS9TkEmE4MduW4vgokTDqi1MktXwt9Sqr55aPRqNwuEV2tK77 X-Received: by 2002:ac2:5b5a:0:b0:50e:2b8e:2ca0 with SMTP id i26-20020ac25b5a000000b0050e2b8e2ca0mr1346877lfp.243.1703652683147; Tue, 26 Dec 2023 20:51:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1703652683; cv=none; d=google.com; s=arc-20160816; b=yzksnxy/c5O5M+Pj4FK1Sb1Iu5wDkvb5YS6NuBURi2cMrNgAW0fuIeqqom72TQ3HBM v4UxDWrb3WsFq45o5iAzJBABXBdawECv4R8Vm0McecVxnhHSvzCTm5xYm0G/zZPZTxJv aCk5gw7NVXdQYb9BD6aK27fPRFyxz2l6REFDxjDLq6vggQjbwB0H+3k/4Hb0Oebk8Dwj CcleiFJ2TBfe63Em7SYW76EaDFHYKgZMQUlseoZOrmplQj9w8G0kgYRsyJEdnVbWxIQM TRodvceLqUHo0wGw0WzEHqOxNzkdR3KEoY7xiGc8BlyfKrvsJyrSeeIoKMxw0jfNjca2 whXA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=lyhleTCVebW9DGJgC7BnBp8NE60jLQf5HP6kYubsQI0=; fh=jVI190atZy/rQkFyTEx2ZbUJwkNs7ZTytKjYCllSOv4=; b=TulTpSs1KwyThiOC26nFAxV7AKqNNHDN0ekkSa/2ng8aOFYhByJwgRZKQhzyEF7Xlm +omBFYlO/RY8reTpiQ9iCB6ylKgHohJcHsmvfazY53Si1fyduHKNwOualFevjuhcYGZB /0cvPGp4MdFEqxhrbhoqLwzJhLyZv65KDywHUNxaB1s/IGJwHW1qjgGQj+gXnm8sGFS4 VscsWQbYt7FaLBIWX88D1mfBp9ny/UgSeKaLqfWV/85B7QUCSQOfDKj60lD+vCO6ynnZ Dc9emlOKbEW9X6Z0KYwiUR5aRGJovYJsv2dya8FfoVoIbC2Nwu5UbeOjP8KFgOCAr8cR Iotw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id ca26-20020a170906a3da00b00a236c09acb8si6378859ejb.369.2023.12.26.20.51.22; Tue, 26 Dec 2023 20:51:23 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id AF64268CD59; Wed, 27 Dec 2023 06:50:50 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail.loongson.cn (mail.loongson.cn [114.242.206.163]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id DB2BC68CC6C for ; Wed, 27 Dec 2023 06:50:41 +0200 (EET) Received: from loongson.cn (unknown [36.33.26.33]) by gateway (Coremail) with SMTP id _____8DxS+kfrYtlRe4EAA--.23802S3; Wed, 27 Dec 2023 12:50:39 +0800 (CST) Received: from localhost (unknown [36.33.26.33]) by localhost.localdomain (Coremail) with SMTP id AQAAf8Cxvr4erYtlhFsMAA--.36996S3; Wed, 27 Dec 2023 12:50:38 +0800 (CST) From: jinbo To: ffmpeg-devel@ffmpeg.org Date: Wed, 27 Dec 2023 12:50:16 +0800 Message-Id: <20231227045019.25078-5-jinbo@loongson.cn> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20231227045019.25078-1-jinbo@loongson.cn> References: <20231227045019.25078-1-jinbo@loongson.cn> MIME-Version: 1.0 X-CM-TRANSID: AQAAf8Cxvr4erYtlhFsMAA--.36996S3 X-CM-SenderInfo: xmlqu0o6or00hjvr0hdfq/1tbiAQASEmWLia8B1wAJsW X-Coremail-Antispam: 1Uk129KBj9fXoWfWF1UCryrGw4rtFWrurW3urX_yoWrXFWfCo WYqrs0y3sxta4aqFZxZr1UW3yrCFW2kr1jvry2k3WxJa4Fgr92y3s0ywnFkasrKrn5X3Wq 93srtws5Aa4rZw1rl-sFpf9Il3svdjkaLaAFLSUrUUUUUb8apTn2vfkv8UJUUUU8wcxFpf 9Il3svdxBIdaVrn0xqx4xG64xvF2IEw4CE5I8CrVC2j2Jv73VFW2AGmfu7bjvjm3AaLaJ3 UjIYCTnIWjp_UUUYb7kC6x804xWl14x267AKxVWUJVW8JwAFc2x0x2IEx4CE42xK8VAvwI 8IcIk0rVWrJVCq3wAFIxvE14AKwVWUGVWUXwA2ocxC64kIII0Yj41l84x0c7CEw4AK67xG Y2AK021l84ACjcxK6xIIjxv20xvE14v26r1I6r4UM28EF7xvwVC0I7IYx2IY6xkF7I0E14 v26r1j6r4UM28EF7xvwVC2z280aVAFwI0_Cr0_Gr1UM28EF7xvwVC2z280aVCY1x0267AK xVW8Jr0_Cr1UM2AIxVAIcxkEcVAq07x20xvEncxIr21l57IF6xkI12xvs2x26I8E6xACxx 1l5I8CrVACY4xI64kE6c02F40Ex7xfMcIj6xIIjxv20xvE14v26r126r1DMcIj6I8E87Iv 67AKxVW8JVWxJwAm72CE4IkC6x0Yz7v_Jr0_Gr1lF7xvr2IYc2Ij64vIr41l42xK82IYc2 Ij64vIr41l4I8I3I0E4IkC6x0Yz7v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGwC20s02 6x8GjcxK67AKxVWUGVWUWwC2zVAF1VAY17CE14v26r1Y6r17MIIYrxkI7VAKI48JMIIF0x vE2Ix0cI8IcVAFwI0_Jr0_JF4lIxAIcVC0I7IYx2IY6xkF7I0E14v26r1j6r4UMIIF0xvE 42xK8VAvwI8IcIk0rVWUJVWUCwCI42IY6I8E87Iv67AKxVWUJVW8JwCI42IY6I8E87Iv6x kF7I0E14v26r1j6r4UYxBIdaVFxhVjvjDU0xZFpf9x07jY38nUUUUU= Subject: [FFmpeg-devel] [PATCH v2 4/7] avcodec/hevc: Add qpel_uni_w_v|h4/6/8/12/16/24/32/48/64 asm opt X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: jinbo Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: x5Y+rKvRNe6y tests/checkasm/checkasm: C LSX LASX put_hevc_qpel_uni_w_h4_8_c: 6.5 1.7 1.2 put_hevc_qpel_uni_w_h6_8_c: 14.5 4.5 3.7 put_hevc_qpel_uni_w_h8_8_c: 24.5 5.7 4.5 put_hevc_qpel_uni_w_h12_8_c: 54.7 17.5 12.0 put_hevc_qpel_uni_w_h16_8_c: 96.5 22.7 13.2 put_hevc_qpel_uni_w_h24_8_c: 216.0 51.2 33.2 put_hevc_qpel_uni_w_h32_8_c: 385.7 87.0 53.2 put_hevc_qpel_uni_w_h48_8_c: 860.5 192.0 113.2 put_hevc_qpel_uni_w_h64_8_c: 1531.0 334.2 200.0 put_hevc_qpel_uni_w_v4_8_c: 8.0 1.7 put_hevc_qpel_uni_w_v6_8_c: 17.2 4.5 put_hevc_qpel_uni_w_v8_8_c: 29.5 6.0 5.2 put_hevc_qpel_uni_w_v12_8_c: 65.2 16.0 11.7 put_hevc_qpel_uni_w_v16_8_c: 116.5 20.5 14.0 put_hevc_qpel_uni_w_v24_8_c: 259.2 48.5 37.2 put_hevc_qpel_uni_w_v32_8_c: 459.5 80.5 56.0 put_hevc_qpel_uni_w_v48_8_c: 1028.5 180.2 126.5 put_hevc_qpel_uni_w_v64_8_c: 1831.2 319.2 224.2 Speedup of decoding H265 4K 30FPS 30Mbps on 3A6000 with 8 threads is 4fps(48fps-->52fps). Change-Id: I1178848541d90083869225ba98a02e6aa8bb8c5a --- libavcodec/loongarch/hevc_mc.S | 1294 +++++++++++++++++ libavcodec/loongarch/hevcdsp_init_loongarch.c | 38 + libavcodec/loongarch/hevcdsp_lasx.h | 18 + libavcodec/loongarch/hevcdsp_lsx.h | 20 + 4 files changed, 1370 insertions(+) diff --git a/libavcodec/loongarch/hevc_mc.S b/libavcodec/loongarch/hevc_mc.S index c5d553effe..2ee338fb8e 100644 --- a/libavcodec/loongarch/hevc_mc.S +++ b/libavcodec/loongarch/hevc_mc.S @@ -21,6 +21,8 @@ #include "loongson_asm.S" +.extern ff_hevc_qpel_filters + .macro LOAD_VAR bit addi.w t1, a5, 6 //shift addi.w t3, zero, 1 //one @@ -469,3 +471,1295 @@ function ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx addi.w a4, a4, -1 bnez a4, .LOOP_PIXELS64_LASX endfunc + +.macro vhaddw.d.h in0 + vhaddw.w.h \in0, \in0, \in0 + vhaddw.d.w \in0, \in0, \in0 +.endm + +.macro xvhaddw.d.h in0 + xvhaddw.w.h \in0, \in0, \in0 + xvhaddw.d.w \in0, \in0, \in0 +.endm + +function ff_hevc_put_hevc_qpel_uni_w_v4_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + add.d t2, t1, a3 //stride * 4 + sub.d a2, a2, t1 //src -= stride*3 + fld.s f6, a2, 0 //0 + fldx.s f7, a2, a3 //1 + fldx.s f8, a2, t0 //2 + add.d a2, a2, t1 + fld.s f9, a2, 0 //3 + fldx.s f10, a2, a3 //4 + fldx.s f11, a2, t0 //5 + fldx.s f12, a2, t1 //6 + add.d a2, a2, t2 + vilvl.b vr6, vr7, vr6 + vilvl.b vr7, vr9, vr8 + vilvl.b vr8, vr11, vr10 + vilvl.b vr9, vr13, vr12 + vilvl.h vr6, vr7, vr6 + vilvl.h vr7, vr9, vr8 + vilvl.w vr8, vr7, vr6 + vilvh.w vr9, vr7, vr6 +.LOOP_V4: + fld.s f13, a2, 0 //7 + fldx.s f14, a2, a3 //8 next loop + add.d a2, a2, t0 + vextrins.b vr8, vr13, 0x70 + vextrins.b vr8, vr13, 0xf1 + vextrins.b vr9, vr13, 0x72 + vextrins.b vr9, vr13, 0xf3 + vbsrl.v vr10, vr8, 1 + vbsrl.v vr11, vr9, 1 + vextrins.b vr10, vr14, 0x70 + vextrins.b vr10, vr14, 0xf1 + vextrins.b vr11, vr14, 0x72 + vextrins.b vr11, vr14, 0xf3 + vdp2.h.bu.b vr6, vr8, vr5 //QPEL_FILTER(src, stride) + vdp2.h.bu.b vr7, vr9, vr5 + vdp2.h.bu.b vr12, vr10, vr5 + vdp2.h.bu.b vr13, vr11, vr5 + vbsrl.v vr8, vr10, 1 + vbsrl.v vr9, vr11, 1 + vhaddw.d.h vr6 + vhaddw.d.h vr7 + vhaddw.d.h vr12 + vhaddw.d.h vr13 + vpickev.w vr6, vr7, vr6 + vpickev.w vr12, vr13, vr12 + vmulwev.w.h vr6, vr6, vr1 //QPEL_FILTER(src, stride) * wx + vmulwev.w.h vr12, vr12, vr1 + vadd.w vr6, vr6, vr2 + vsra.w vr6, vr6, vr3 + vadd.w vr6, vr6, vr4 + vadd.w vr12, vr12, vr2 + vsra.w vr12, vr12, vr3 + vadd.w vr12, vr12, vr4 + vssrani.h.w vr12, vr6, 0 + vssrani.bu.h vr12, vr12, 0 + fst.s f12, a0, 0 + add.d a0, a0, a1 + vstelm.w vr12, a0, 0, 1 + add.d a0, a0, a1 + addi.d a4, a4, -2 + bnez a4, .LOOP_V4 +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_v6_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + add.d t2, t1, a3 //stride * 4 + sub.d a2, a2, t1 //src -= stride*3 + fld.d f6, a2, 0 + fldx.d f7, a2, a3 + fldx.d f8, a2, t0 + add.d a2, a2, t1 + fld.d f9, a2, 0 + fldx.d f10, a2, a3 + fldx.d f11, a2, t0 + fldx.d f12, a2, t1 + add.d a2, a2, t2 + vilvl.b vr6, vr7, vr6 //transpose 8x6 to 3x16 + vilvl.b vr7, vr9, vr8 + vilvl.b vr8, vr11, vr10 + vilvl.b vr9, vr13, vr12 + vilvl.h vr10, vr7, vr6 + vilvh.h vr11, vr7, vr6 + vilvl.h vr12, vr9, vr8 + vilvh.h vr13, vr9, vr8 + vilvl.w vr6, vr12, vr10 + vilvh.w vr7, vr12, vr10 + vilvl.w vr8, vr13, vr11 +.LOOP_V6: + fld.d f13, a2, 0 + add.d a2, a2, a3 + vextrins.b vr6, vr13, 0x70 + vextrins.b vr6, vr13, 0xf1 + vextrins.b vr7, vr13, 0x72 + vextrins.b vr7, vr13, 0xf3 + vextrins.b vr8, vr13, 0x74 + vextrins.b vr8, vr13, 0xf5 + vdp2.h.bu.b vr10, vr6, vr5 //QPEL_FILTER(src, stride) + vdp2.h.bu.b vr11, vr7, vr5 + vdp2.h.bu.b vr12, vr8, vr5 + vbsrl.v vr6, vr6, 1 + vbsrl.v vr7, vr7, 1 + vbsrl.v vr8, vr8, 1 + vhaddw.d.h vr10 + vhaddw.d.h vr11 + vhaddw.d.h vr12 + vpickev.w vr10, vr11, vr10 + vpickev.w vr11, vr13, vr12 + vmulwev.w.h vr10, vr10, vr1 //QPEL_FILTER(src, stride) * wx + vmulwev.w.h vr11, vr11, vr1 + vadd.w vr10, vr10, vr2 + vadd.w vr11, vr11, vr2 + vsra.w vr10, vr10, vr3 + vsra.w vr11, vr11, vr3 + vadd.w vr10, vr10, vr4 + vadd.w vr11, vr11, vr4 + vssrani.h.w vr11, vr10, 0 + vssrani.bu.h vr11, vr11, 0 + fst.s f11, a0, 0 + vstelm.h vr11, a0, 4, 2 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_V6 +endfunc + +// transpose 8x8b to 4x16b +.macro TRANSPOSE8X8B_LSX in0, in1, in2, in3, in4, in5, in6, in7, \ + out0, out1, out2, out3 + vilvl.b \in0, \in1, \in0 + vilvl.b \in1, \in3, \in2 + vilvl.b \in2, \in5, \in4 + vilvl.b \in3, \in7, \in6 + vilvl.h \in4, \in1, \in0 + vilvh.h \in5, \in1, \in0 + vilvl.h \in6, \in3, \in2 + vilvh.h \in7, \in3, \in2 + vilvl.w \out0, \in6, \in4 + vilvh.w \out1, \in6, \in4 + vilvl.w \out2, \in7, \in5 + vilvh.w \out3, \in7, \in5 +.endm + +.macro PUT_HEVC_QPEL_UNI_W_V8_LSX in0, in1, in2, in3, out0, out1, pos +.if \pos == 0 + vextrins.b \in0, vr13, 0x70 //insert the 8th load + vextrins.b \in0, vr13, 0xf1 + vextrins.b \in1, vr13, 0x72 + vextrins.b \in1, vr13, 0xf3 + vextrins.b \in2, vr13, 0x74 + vextrins.b \in2, vr13, 0xf5 + vextrins.b \in3, vr13, 0x76 + vextrins.b \in3, vr13, 0xf7 +.else// \pos == 8 + vextrins.b \in0, vr13, 0x78 + vextrins.b \in0, vr13, 0xf9 + vextrins.b \in1, vr13, 0x7a + vextrins.b \in1, vr13, 0xfb + vextrins.b \in2, vr13, 0x7c + vextrins.b \in2, vr13, 0xfd + vextrins.b \in3, vr13, 0x7e + vextrins.b \in3, vr13, 0xff +.endif + vdp2.h.bu.b \out0, \in0, vr5 //QPEL_FILTER(src, stride) + vdp2.h.bu.b \out1, \in1, vr5 + vdp2.h.bu.b vr12, \in2, vr5 + vdp2.h.bu.b vr20, \in3, vr5 + vbsrl.v \in0, \in0, 1 //Back up previous 7 loaded datas, + vbsrl.v \in1, \in1, 1 //so just need to insert the 8th + vbsrl.v \in2, \in2, 1 //load in the next loop. + vbsrl.v \in3, \in3, 1 + vhaddw.d.h \out0 + vhaddw.d.h \out1 + vhaddw.d.h vr12 + vhaddw.d.h vr20 + vpickev.w \out0, \out1, \out0 + vpickev.w \out1, vr20, vr12 + vmulwev.w.h \out0, \out0, vr1 //QPEL_FILTER(src, stride) * wx + vmulwev.w.h \out1, \out1, vr1 + vadd.w \out0, \out0, vr2 + vadd.w \out1, \out1, vr2 + vsra.w \out0, \out0, vr3 + vsra.w \out1, \out1, vr3 + vadd.w \out0, \out0, vr4 + vadd.w \out1, \out1, vr4 +.endm + +function ff_hevc_put_hevc_qpel_uni_w_v8_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + add.d t2, t1, a3 //stride * 4 + sub.d a2, a2, t1 //src -= stride*3 + fld.d f6, a2, 0 + fldx.d f7, a2, a3 + fldx.d f8, a2, t0 + add.d a2, a2, t1 + fld.d f9, a2, 0 + fldx.d f10, a2, a3 + fldx.d f11, a2, t0 + fldx.d f12, a2, t1 + add.d a2, a2, t2 + TRANSPOSE8X8B_LSX vr6, vr7, vr8, vr9, vr10, vr11, vr12, vr13, \ + vr6, vr7, vr8, vr9 +.LOOP_V8: + fld.d f13, a2, 0 //the 8th load + add.d a2, a2, a3 + PUT_HEVC_QPEL_UNI_W_V8_LSX vr6, vr7, vr8, vr9, vr10, vr11, 0 + vssrani.h.w vr11, vr10, 0 + vssrani.bu.h vr11, vr11, 0 + fst.d f11, a0, 0 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_V8 +endfunc + +.macro PUT_HEVC_UNI_W_V8_LASX w + fld.d f6, a2, 0 + fldx.d f7, a2, a3 + fldx.d f8, a2, t0 + add.d a2, a2, t1 + fld.d f9, a2, 0 + fldx.d f10, a2, a3 + fldx.d f11, a2, t0 + fldx.d f12, a2, t1 + add.d a2, a2, t2 + TRANSPOSE8X8B_LSX vr6, vr7, vr8, vr9, vr10, vr11, vr12, vr13, \ + vr6, vr7, vr8, vr9 + xvpermi.q xr6, xr7, 0x02 + xvpermi.q xr8, xr9, 0x02 +.LOOP_V8_LASX_\w: + fld.d f13, a2, 0 // 0 1 2 3 4 5 6 7 the 8th load + add.d a2, a2, a3 + vshuf4i.h vr13, vr13, 0xd8 + vbsrl.v vr14, vr13, 4 + xvpermi.q xr13, xr14, 0x02 //0 1 4 5 * * * * 2 3 6 7 * * * * + xvextrins.b xr6, xr13, 0x70 //begin to insert the 8th load + xvextrins.b xr6, xr13, 0xf1 + xvextrins.b xr8, xr13, 0x72 + xvextrins.b xr8, xr13, 0xf3 + xvdp2.h.bu.b xr20, xr6, xr5 //QPEL_FILTER(src, stride) + xvdp2.h.bu.b xr21, xr8, xr5 + xvbsrl.v xr6, xr6, 1 + xvbsrl.v xr8, xr8, 1 + xvhaddw.d.h xr20 + xvhaddw.d.h xr21 + xvpickev.w xr20, xr21, xr20 + xvpermi.d xr20, xr20, 0xd8 + xvmulwev.w.h xr20, xr20, xr1 //QPEL_FILTER(src, stride) * wx + xvadd.w xr20, xr20, xr2 + xvsra.w xr20, xr20, xr3 + xvadd.w xr10, xr20, xr4 + xvpermi.q xr11, xr10, 0x01 + vssrani.h.w vr11, vr10, 0 + vssrani.bu.h vr11, vr11, 0 + fst.d f11, a0, 0 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_V8_LASX_\w +.endm + +function ff_hevc_put_hevc_qpel_uni_w_v8_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + xvreplve0.q xr5, xr5 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + add.d t2, t1, a3 //stride * 4 + sub.d a2, a2, t1 //src -= stride*3 + PUT_HEVC_UNI_W_V8_LASX 8 +endfunc + +.macro PUT_HEVC_QPEL_UNI_W_V16_LSX w + vld vr6, a2, 0 + vldx vr7, a2, a3 + vldx vr8, a2, t0 + add.d a2, a2, t1 + vld vr9, a2, 0 + vldx vr10, a2, a3 + vldx vr11, a2, t0 + vldx vr12, a2, t1 + add.d a2, a2, t2 +.if \w > 8 + vilvh.d vr14, vr14, vr6 + vilvh.d vr15, vr15, vr7 + vilvh.d vr16, vr16, vr8 + vilvh.d vr17, vr17, vr9 + vilvh.d vr18, vr18, vr10 + vilvh.d vr19, vr19, vr11 + vilvh.d vr20, vr20, vr12 +.endif + TRANSPOSE8X8B_LSX vr6, vr7, vr8, vr9, vr10, vr11, vr12, vr13, \ + vr6, vr7, vr8, vr9 +.if \w > 8 + TRANSPOSE8X8B_LSX vr14, vr15, vr16, vr17, vr18, vr19, vr20, vr21, \ + vr14, vr15, vr16, vr17 +.endif +.LOOP_HORI_16_\w: + vld vr13, a2, 0 + add.d a2, a2, a3 + PUT_HEVC_QPEL_UNI_W_V8_LSX vr6, vr7, vr8, vr9, vr10, vr11, 0 +.if \w > 8 + PUT_HEVC_QPEL_UNI_W_V8_LSX vr14, vr15, vr16, vr17, vr18, vr19, 8 +.endif + vssrani.h.w vr11, vr10, 0 +.if \w > 8 + vssrani.h.w vr19, vr18, 0 + vssrani.bu.h vr19, vr11, 0 +.else + vssrani.bu.h vr11, vr11, 0 +.endif +.if \w == 8 + fst.d f11, a0, 0 +.elseif \w == 12 + fst.d f19, a0, 0 + vstelm.w vr19, a0, 8, 2 +.else + vst vr19, a0, 0 +.endif + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_HORI_16_\w +.endm + +function ff_hevc_put_hevc_qpel_uni_w_v16_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + add.d t2, t1, a3 //stride * 4 + sub.d a2, a2, t1 //src -= stride*3 + PUT_HEVC_QPEL_UNI_W_V16_LSX 16 +endfunc + +.macro PUT_HEVC_QPEL_UNI_W_V16_LASX w + vld vr6, a2, 0 + vldx vr7, a2, a3 + vldx vr8, a2, t0 + add.d a2, a2, t1 + vld vr9, a2, 0 + vldx vr10, a2, a3 + vldx vr11, a2, t0 + vldx vr12, a2, t1 + add.d a2, a2, t2 + xvpermi.q xr6, xr10, 0x02 //pack and transpose the 8x16 to 4x32 begin + xvpermi.q xr7, xr11, 0x02 + xvpermi.q xr8, xr12, 0x02 + xvpermi.q xr9, xr13, 0x02 + xvilvl.b xr14, xr7, xr6 //0 2 + xvilvh.b xr15, xr7, xr6 //1 3 + xvilvl.b xr16, xr9, xr8 //0 2 + xvilvh.b xr17, xr9, xr8 //1 3 + xvpermi.d xr14, xr14, 0xd8 + xvpermi.d xr15, xr15, 0xd8 + xvpermi.d xr16, xr16, 0xd8 + xvpermi.d xr17, xr17, 0xd8 + xvilvl.h xr6, xr16, xr14 + xvilvh.h xr7, xr16, xr14 + xvilvl.h xr8, xr17, xr15 + xvilvh.h xr9, xr17, xr15 + xvilvl.w xr14, xr7, xr6 //0 1 4 5 + xvilvh.w xr15, xr7, xr6 //2 3 6 7 + xvilvl.w xr16, xr9, xr8 //8 9 12 13 + xvilvh.w xr17, xr9, xr8 //10 11 14 15 end +.LOOP_HORI_16_LASX_\w: + vld vr13, a2, 0 //the 8th load + add.d a2, a2, a3 + vshuf4i.w vr13, vr13, 0xd8 + vbsrl.v vr12, vr13, 8 + xvpermi.q xr13, xr12, 0x02 + xvextrins.b xr14, xr13, 0x70 //inset the 8th load + xvextrins.b xr14, xr13, 0xf1 + xvextrins.b xr15, xr13, 0x72 + xvextrins.b xr15, xr13, 0xf3 + xvextrins.b xr16, xr13, 0x74 + xvextrins.b xr16, xr13, 0xf5 + xvextrins.b xr17, xr13, 0x76 + xvextrins.b xr17, xr13, 0xf7 + xvdp2.h.bu.b xr6, xr14, xr5 //QPEL_FILTER(src, stride) + xvdp2.h.bu.b xr7, xr15, xr5 + xvdp2.h.bu.b xr8, xr16, xr5 + xvdp2.h.bu.b xr9, xr17, xr5 + xvhaddw.d.h xr6 + xvhaddw.d.h xr7 + xvhaddw.d.h xr8 + xvhaddw.d.h xr9 + xvbsrl.v xr14, xr14, 1 //Back up previous 7 loaded datas, + xvbsrl.v xr15, xr15, 1 //so just need to insert the 8th + xvbsrl.v xr16, xr16, 1 //load in next loop. + xvbsrl.v xr17, xr17, 1 + xvpickev.w xr6, xr7, xr6 //0 1 2 3 4 5 6 7 + xvpickev.w xr7, xr9, xr8 //8 9 10 11 12 13 14 15 + xvmulwev.w.h xr6, xr6, xr1 //QPEL_FILTER(src, stride) * wx + xvmulwev.w.h xr7, xr7, xr1 + xvadd.w xr6, xr6, xr2 + xvadd.w xr7, xr7, xr2 + xvsra.w xr6, xr6, xr3 + xvsra.w xr7, xr7, xr3 + xvadd.w xr6, xr6, xr4 + xvadd.w xr7, xr7, xr4 + xvssrani.h.w xr7, xr6, 0 //0 1 2 3 8 9 10 11 4 5 6 7 12 13 14 15 + xvpermi.q xr6, xr7, 0x01 + vssrani.bu.h vr6, vr7, 0 + vshuf4i.w vr6, vr6, 0xd8 +.if \w == 12 + fst.d f6, a0, 0 + vstelm.w vr6, a0, 8, 2 +.else + vst vr6, a0, 0 +.endif + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_HORI_16_LASX_\w +.endm + +function ff_hevc_put_hevc_qpel_uni_w_v16_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + xvreplve0.q xr5, xr5 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + add.d t2, t1, a3 //stride * 4 + sub.d a2, a2, t1 //src -= stride*3 + PUT_HEVC_QPEL_UNI_W_V16_LASX 16 +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_v12_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + add.d t2, t1, a3 //stride * 4 + sub.d a2, a2, t1 //src -= stride*3 + PUT_HEVC_QPEL_UNI_W_V16_LSX 12 +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_v12_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + xvreplve0.q xr5, xr5 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + add.d t2, t1, a3 //stride * 4 + sub.d a2, a2, t1 //src -= stride*3 + PUT_HEVC_QPEL_UNI_W_V16_LASX 12 +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_v24_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + add.d t2, t1, a3 //stride * 4 + sub.d a2, a2, t1 //src -= stride*3 + addi.d t4, a0, 0 //save dst + addi.d t5, a2, 0 //save src + addi.d t6, a4, 0 + PUT_HEVC_QPEL_UNI_W_V16_LSX 24 + addi.d a0, t4, 16 + addi.d a2, t5, 16 + addi.d a4, t6, 0 + PUT_HEVC_QPEL_UNI_W_V16_LSX 8 +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_v24_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + xvreplve0.q xr5, xr5 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + add.d t2, t1, a3 //stride * 4 + sub.d a2, a2, t1 //src -= stride*3 + addi.d t4, a0, 0 //save dst + addi.d t5, a2, 0 //save src + addi.d t6, a4, 0 + PUT_HEVC_QPEL_UNI_W_V16_LASX 24 + addi.d a0, t4, 16 + addi.d a2, t5, 16 + addi.d a4, t6, 0 + PUT_HEVC_UNI_W_V8_LASX 24 +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_v32_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + add.d t2, t1, a3 //stride * 4 + sub.d a2, a2, t1 //src -= stride*3 + addi.d t3, zero, 2 + addi.d t4, a0, 0 //save dst + addi.d t5, a2, 0 //save src + addi.d t6, a4, 0 +.LOOP_V32: + PUT_HEVC_QPEL_UNI_W_V16_LSX 32 + addi.d t3, t3, -1 + addi.d a0, t4, 16 + addi.d a2, t5, 16 + addi.d a4, t6, 0 + bnez t3, .LOOP_V32 +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_v32_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + xvreplve0.q xr5, xr5 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + add.d t2, t1, a3 //stride * 4 + sub.d a2, a2, t1 //src -= stride*3 + addi.d t3, zero, 2 + addi.d t4, a0, 0 //save dst + addi.d t5, a2, 0 //save src + addi.d t6, a4, 0 +.LOOP_V32_LASX: + PUT_HEVC_QPEL_UNI_W_V16_LASX 32 + addi.d t3, t3, -1 + addi.d a0, t4, 16 + addi.d a2, t5, 16 + addi.d a4, t6, 0 + bnez t3, .LOOP_V32_LASX +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_v48_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + add.d t2, t1, a3 //stride * 4 + sub.d a2, a2, t1 //src -= stride*3 + addi.d t3, zero, 3 + addi.d t4, a0, 0 //save dst + addi.d t5, a2, 0 //save src + addi.d t6, a4, 0 +.LOOP_V48: + PUT_HEVC_QPEL_UNI_W_V16_LSX 48 + addi.d t3, t3, -1 + addi.d a0, t4, 16 + addi.d t4, t4, 16 + addi.d a2, t5, 16 + addi.d t5, t5, 16 + addi.d a4, t6, 0 + bnez t3, .LOOP_V48 +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_v48_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + xvreplve0.q xr5, xr5 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + add.d t2, t1, a3 //stride * 4 + sub.d a2, a2, t1 //src -= stride*3 + addi.d t3, zero, 3 + addi.d t4, a0, 0 //save dst + addi.d t5, a2, 0 //save src + addi.d t6, a4, 0 +.LOOP_V48_LASX: + PUT_HEVC_QPEL_UNI_W_V16_LASX 48 + addi.d t3, t3, -1 + addi.d a0, t4, 16 + addi.d t4, t4, 16 + addi.d a2, t5, 16 + addi.d t5, t5, 16 + addi.d a4, t6, 0 + bnez t3, .LOOP_V48_LASX +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_v64_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + add.d t2, t1, a3 //stride * 4 + sub.d a2, a2, t1 //src -= stride*3 + addi.d t3, zero, 4 + addi.d t4, a0, 0 //save dst + addi.d t5, a2, 0 //save src + addi.d t6, a4, 0 +.LOOP_V64: + PUT_HEVC_QPEL_UNI_W_V16_LSX 64 + addi.d t3, t3, -1 + addi.d a0, t4, 16 + addi.d t4, t4, 16 + addi.d a2, t5, 16 + addi.d t5, t5, 16 + addi.d a4, t6, 0 + bnez t3, .LOOP_V64 +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_v64_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + xvreplve0.q xr5, xr5 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + add.d t2, t1, a3 //stride * 4 + sub.d a2, a2, t1 //src -= stride*3 + addi.d t3, zero, 4 + addi.d t4, a0, 0 //save dst + addi.d t5, a2, 0 //save src + addi.d t6, a4, 0 +.LOOP_V64_LASX: + PUT_HEVC_QPEL_UNI_W_V16_LASX 64 + addi.d t3, t3, -1 + addi.d a0, t4, 16 + addi.d t4, t4, 16 + addi.d a2, t5, 16 + addi.d t5, t5, 16 + addi.d a4, t6, 0 + bnez t3, .LOOP_V64_LASX +endfunc + +.macro PUT_HEVC_QPEL_UNI_W_H8_LSX in0, out0, out1 + vbsrl.v vr7, \in0, 1 + vbsrl.v vr8, \in0, 2 + vbsrl.v vr9, \in0, 3 + vbsrl.v vr10, \in0, 4 + vbsrl.v vr11, \in0, 5 + vbsrl.v vr12, \in0, 6 + vbsrl.v vr13, \in0, 7 + vilvl.d vr6, vr7, \in0 + vilvl.d vr7, vr9, vr8 + vilvl.d vr8, vr11, vr10 + vilvl.d vr9, vr13, vr12 + vdp2.h.bu.b vr10, vr6, vr5 + vdp2.h.bu.b vr11, vr7, vr5 + vdp2.h.bu.b vr12, vr8, vr5 + vdp2.h.bu.b vr13, vr9, vr5 + vhaddw.d.h vr10 + vhaddw.d.h vr11 + vhaddw.d.h vr12 + vhaddw.d.h vr13 + vpickev.w vr10, vr11, vr10 + vpickev.w vr11, vr13, vr12 + vmulwev.w.h vr10, vr10, vr1 + vmulwev.w.h vr11, vr11, vr1 + vadd.w vr10, vr10, vr2 + vadd.w vr11, vr11, vr2 + vsra.w vr10, vr10, vr3 + vsra.w vr11, vr11, vr3 + vadd.w \out0, vr10, vr4 + vadd.w \out1, vr11, vr4 +.endm + +.macro PUT_HEVC_QPEL_UNI_W_H8_LASX in0, out0 + xvbsrl.v xr7, \in0, 4 + xvpermi.q xr7, \in0, 0x20 + xvbsrl.v xr8, xr7, 1 + xvbsrl.v xr9, xr7, 2 + xvbsrl.v xr10, xr7, 3 + xvpackev.d xr7, xr8, xr7 + xvpackev.d xr8, xr10, xr9 + xvdp2.h.bu.b xr10, xr7, xr5 + xvdp2.h.bu.b xr11, xr8, xr5 + xvhaddw.d.h xr10 + xvhaddw.d.h xr11 + xvpickev.w xr10, xr11, xr10 + xvmulwev.w.h xr10, xr10, xr1 + xvadd.w xr10, xr10, xr2 + xvsra.w xr10, xr10, xr3 + xvadd.w \out0, xr10, xr4 +.endm + +.macro PUT_HEVC_QPEL_UNI_W_H16_LASX in0, out0 + xvpermi.d xr6, \in0, 0x94 + xvbsrl.v xr7, xr6, 1 + xvbsrl.v xr8, xr6, 2 + xvbsrl.v xr9, xr6, 3 + xvbsrl.v xr10, xr6, 4 + xvbsrl.v xr11, xr6, 5 + xvbsrl.v xr12, xr6, 6 + xvbsrl.v xr13, xr6, 7 + xvpackev.d xr6, xr7, xr6 + xvpackev.d xr7, xr9, xr8 + xvpackev.d xr8, xr11, xr10 + xvpackev.d xr9, xr13, xr12 + xvdp2.h.bu.b xr10, xr6, xr5 + xvdp2.h.bu.b xr11, xr7, xr5 + xvdp2.h.bu.b xr12, xr8, xr5 + xvdp2.h.bu.b xr13, xr9, xr5 + xvhaddw.d.h xr10 + xvhaddw.d.h xr11 + xvhaddw.d.h xr12 + xvhaddw.d.h xr13 + xvpickev.w xr10, xr11, xr10 + xvpickev.w xr11, xr13, xr12 + xvmulwev.w.h xr10, xr10, xr1 + xvmulwev.w.h xr11, xr11, xr1 + xvadd.w xr10, xr10, xr2 + xvadd.w xr11, xr11, xr2 + xvsra.w xr10, xr10, xr3 + xvsra.w xr11, xr11, xr3 + xvadd.w xr10, xr10, xr4 + xvadd.w xr11, xr11, xr4 + xvssrani.h.w xr11, xr10, 0 + xvpermi.q \out0, xr11, 0x01 + xvssrani.bu.h \out0, xr11, 0 +.endm + +function ff_hevc_put_hevc_qpel_uni_w_h4_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + addi.d a2, a2, -3 //src -= 3 +.LOOP_H4: + vld vr18, a2, 0 + vldx vr19, a2, a3 + alsl.d a2, a3, a2, 1 + vbsrl.v vr6, vr18, 1 + vbsrl.v vr7, vr18, 2 + vbsrl.v vr8, vr18, 3 + vbsrl.v vr9, vr19, 1 + vbsrl.v vr10, vr19, 2 + vbsrl.v vr11, vr19, 3 + vilvl.d vr6, vr6, vr18 + vilvl.d vr7, vr8, vr7 + vilvl.d vr8, vr9, vr19 + vilvl.d vr9, vr11, vr10 + vdp2.h.bu.b vr10, vr6, vr5 + vdp2.h.bu.b vr11, vr7, vr5 + vdp2.h.bu.b vr12, vr8, vr5 + vdp2.h.bu.b vr13, vr9, vr5 + vhaddw.d.h vr10 + vhaddw.d.h vr11 + vhaddw.d.h vr12 + vhaddw.d.h vr13 + vpickev.w vr10, vr11, vr10 + vpickev.w vr11, vr13, vr12 + vmulwev.w.h vr10, vr10, vr1 + vmulwev.w.h vr11, vr11, vr1 + vadd.w vr10, vr10, vr2 + vadd.w vr11, vr11, vr2 + vsra.w vr10, vr10, vr3 + vsra.w vr11, vr11, vr3 + vadd.w vr10, vr10, vr4 + vadd.w vr11, vr11, vr4 + vssrani.h.w vr11, vr10, 0 + vssrani.bu.h vr11, vr11, 0 + fst.s f11, a0, 0 + vbsrl.v vr11, vr11, 4 + fstx.s f11, a0, a1 + alsl.d a0, a1, a0, 1 + addi.d a4, a4, -2 + bnez a4, .LOOP_H4 +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_h4_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + xvreplve0.q xr5, xr5 + addi.d a2, a2, -3 //src -= 3 +.LOOP_H4_LASX: + vld vr18, a2, 0 + vldx vr19, a2, a3 + alsl.d a2, a3, a2, 1 + xvpermi.q xr18, xr19, 0x02 + xvbsrl.v xr6, xr18, 1 + xvbsrl.v xr7, xr18, 2 + xvbsrl.v xr8, xr18, 3 + xvpackev.d xr6, xr6, xr18 + xvpackev.d xr7, xr8, xr7 + xvdp2.h.bu.b xr10, xr6, xr5 + xvdp2.h.bu.b xr11, xr7, xr5 + xvhaddw.d.h xr10 + xvhaddw.d.h xr11 + xvpickev.w xr10, xr11, xr10 + xvmulwev.w.h xr10, xr10, xr1 + xvadd.w xr10, xr10, xr2 + xvsra.w xr10, xr10, xr3 + xvadd.w xr10, xr10, xr4 + xvpermi.q xr11, xr10, 0x01 + vssrani.h.w vr11, vr10, 0 + vssrani.bu.h vr11, vr11, 0 + fst.s f11, a0, 0 + vbsrl.v vr11, vr11, 4 + fstx.s f11, a0, a1 + alsl.d a0, a1, a0, 1 + addi.d a4, a4, -2 + bnez a4, .LOOP_H4_LASX +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_h6_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + addi.d a2, a2, -3 //src -= 3 +.LOOP_H6: + vld vr6, a2, 0 + add.d a2, a2, a3 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr10, vr11 + vssrani.h.w vr11, vr10, 0 + vssrani.bu.h vr11, vr11, 0 + fst.s f11, a0, 0 + vstelm.h vr11, a0, 4, 2 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_H6 +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_h6_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + xvreplve0.q xr5, xr5 + addi.d a2, a2, -3 //src -= 3 +.LOOP_H6_LASX: + vld vr6, a2, 0 + add.d a2, a2, a3 + PUT_HEVC_QPEL_UNI_W_H8_LASX xr6, xr10 + xvpermi.q xr11, xr10, 0x01 + vssrani.h.w vr11, vr10, 0 + vssrani.bu.h vr11, vr11, 0 + fst.s f11, a0, 0 + vstelm.h vr11, a0, 4, 2 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_H6_LASX +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_h8_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + addi.d a2, a2, -3 //src -= 3 +.LOOP_H8: + vld vr6, a2, 0 + add.d a2, a2, a3 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr10, vr11 + vssrani.h.w vr11, vr10, 0 + vssrani.bu.h vr11, vr11, 0 + fst.d f11, a0, 0 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_H8 +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_h8_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + xvreplve0.q xr5, xr5 + addi.d a2, a2, -3 //src -= 3 +.LOOP_H8_LASX: + vld vr6, a2, 0 + add.d a2, a2, a3 + PUT_HEVC_QPEL_UNI_W_H8_LASX xr6, xr10 + xvpermi.q xr11, xr10, 0x01 + vssrani.h.w vr11, vr10, 0 + vssrani.bu.h vr11, vr11, 0 + fst.d f11, a0, 0 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_H8_LASX +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_h12_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + addi.d a2, a2, -3 //src -= 3 +.LOOP_H12: + vld vr6, a2, 0 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr14, vr15 + vld vr6, a2, 8 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr16, vr17 + add.d a2, a2, a3 + vssrani.h.w vr15, vr14, 0 + vssrani.h.w vr17, vr16, 0 + vssrani.bu.h vr17, vr15, 0 + fst.d f17, a0, 0 + vbsrl.v vr17, vr17, 8 + fst.s f17, a0, 8 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_H12 +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_h12_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + xvreplve0.q xr5, xr5 + addi.d a2, a2, -3 //src -= 3 +.LOOP_H12_LASX: + xvld xr6, a2, 0 + add.d a2, a2, a3 + PUT_HEVC_QPEL_UNI_W_H16_LASX xr6, xr14 + fst.d f14, a0, 0 + vstelm.w vr14, a0, 8, 2 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_H12_LASX +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_h16_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + addi.d a2, a2, -3 //src -= 3 +.LOOP_H16: + vld vr6, a2, 0 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr14, vr15 + vld vr6, a2, 8 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr6, vr16, vr17 + add.d a2, a2, a3 + vssrani.h.w vr15, vr14, 0 + vssrani.h.w vr17, vr16, 0 + vssrani.bu.h vr17, vr15, 0 + vst vr17, a0, 0 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_H16 +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_h16_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + xvreplve0.q xr5, xr5 + addi.d a2, a2, -3 //src -= 3 +.LOOP_H16_LASX: + xvld xr6, a2, 0 + add.d a2, a2, a3 + PUT_HEVC_QPEL_UNI_W_H16_LASX xr6, xr10 + vst vr10, a0, 0 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_H16_LASX +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_h24_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + addi.d a2, a2, -3 //src -= 3 +.LOOP_H24: + vld vr18, a2, 0 + vld vr19, a2, 16 + add.d a2, a2, a3 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr14, vr15 + vshuf4i.d vr18, vr19, 0x09 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr16, vr17 + vssrani.h.w vr15, vr14, 0 + vssrani.h.w vr17, vr16, 0 + vssrani.bu.h vr17, vr15, 0 + vst vr17, a0, 0 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr14, vr15 + vssrani.h.w vr15, vr14, 0 + vssrani.bu.h vr15, vr15, 0 + fst.d f15, a0, 16 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_H24 +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_h24_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + xvreplve0.q xr5, xr5 + addi.d a2, a2, -3 //src -= 3 +.LOOP_H24_LASX: + xvld xr18, a2, 0 + add.d a2, a2, a3 + PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr20 + xvpermi.q xr19, xr18, 0x01 + vst vr20, a0, 0 + PUT_HEVC_QPEL_UNI_W_H8_LASX xr19, xr20 + xvpermi.q xr21, xr20, 0x01 + vssrani.h.w vr21, vr20, 0 + vssrani.bu.h vr21, vr21, 0 + fst.d f21, a0, 16 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_H24_LASX +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_h32_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + addi.d a2, a2, -3 //src -= 3 +.LOOP_H32: + vld vr18, a2, 0 + vld vr19, a2, 16 + vld vr20, a2, 32 + add.d a2, a2, a3 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr14, vr15 + vshuf4i.d vr18, vr19, 0x09 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr16, vr17 + vssrani.h.w vr15, vr14, 0 + vssrani.h.w vr17, vr16, 0 + vssrani.bu.h vr17, vr15, 0 + vst vr17, a0, 0 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr14, vr15 + vshuf4i.d vr19, vr20, 0x09 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr16, vr17 + vssrani.h.w vr15, vr14, 0 + vssrani.h.w vr17, vr16, 0 + vssrani.bu.h vr17, vr15, 0 + vst vr17, a0, 16 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_H32 +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_h32_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + xvreplve0.q xr5, xr5 + addi.d a2, a2, -3 //src -= 3 +.LOOP_H32_LASX: + xvld xr18, a2, 0 + xvld xr19, a2, 16 + add.d a2, a2, a3 + PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr20 + PUT_HEVC_QPEL_UNI_W_H16_LASX xr19, xr21 + xvpermi.q xr20, xr21, 0x02 + xvst xr20, a0, 0 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_H32_LASX +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_h48_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + addi.d a2, a2, -3 //src -= 3 +.LOOP_H48: + vld vr18, a2, 0 + vld vr19, a2, 16 + vld vr20, a2, 32 + vld vr21, a2, 48 + add.d a2, a2, a3 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr14, vr15 + vshuf4i.d vr18, vr19, 0x09 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr16, vr17 + vssrani.h.w vr15, vr14, 0 + vssrani.h.w vr17, vr16, 0 + vssrani.bu.h vr17, vr15, 0 + vst vr17, a0, 0 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr14, vr15 + vshuf4i.d vr19, vr20, 0x09 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr16, vr17 + vssrani.h.w vr15, vr14, 0 + vssrani.h.w vr17, vr16, 0 + vssrani.bu.h vr17, vr15, 0 + vst vr17, a0, 16 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr20, vr14, vr15 + vshuf4i.d vr20, vr21, 0x09 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr20, vr16, vr17 + vssrani.h.w vr15, vr14, 0 + vssrani.h.w vr17, vr16, 0 + vssrani.bu.h vr17, vr15, 0 + vst vr17, a0, 32 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_H48 +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_h48_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + xvreplve0.q xr5, xr5 + addi.d a2, a2, -3 //src -= 3 +.LOOP_H48_LASX: + xvld xr18, a2, 0 + xvld xr19, a2, 32 + add.d a2, a2, a3 + PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr20 + xvpermi.q xr18, xr19, 0x03 + PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr21 + xvpermi.q xr20, xr21, 0x02 + xvst xr20, a0, 0 + PUT_HEVC_QPEL_UNI_W_H16_LASX xr19, xr20 + vst vr20, a0, 32 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_H48_LASX +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_h64_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + addi.d a2, a2, -3 //src -= 3 +.LOOP_H64: + vld vr18, a2, 0 + vld vr19, a2, 16 + vld vr20, a2, 32 + vld vr21, a2, 48 + vld vr22, a2, 64 + add.d a2, a2, a3 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr14, vr15 + vshuf4i.d vr18, vr19, 0x09 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr18, vr16, vr17 + vssrani.h.w vr15, vr14, 0 + vssrani.h.w vr17, vr16, 0 + vssrani.bu.h vr17, vr15, 0 + vst vr17, a0, 0 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr14, vr15 + vshuf4i.d vr19, vr20, 0x09 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr19, vr16, vr17 + vssrani.h.w vr15, vr14, 0 + vssrani.h.w vr17, vr16, 0 + vssrani.bu.h vr17, vr15, 0 + vst vr17, a0, 16 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr20, vr14, vr15 + vshuf4i.d vr20, vr21, 0x09 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr20, vr16, vr17 + vssrani.h.w vr15, vr14, 0 + vssrani.h.w vr17, vr16, 0 + vssrani.bu.h vr17, vr15, 0 + vst vr17, a0, 32 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr21, vr14, vr15 + vshuf4i.d vr21, vr22, 0x09 + PUT_HEVC_QPEL_UNI_W_H8_LSX vr21, vr16, vr17 + vssrani.h.w vr15, vr14, 0 + vssrani.h.w vr17, vr16, 0 + vssrani.bu.h vr17, vr15, 0 + vst vr17, a0, 48 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_H64 +endfunc + +function ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + xvreplve0.q xr5, xr5 + addi.d a2, a2, -3 //src -= 3 +.LOOP_H64_LASX: + xvld xr18, a2, 0 + xvld xr19, a2, 32 + xvld xr20, a2, 64 + add.d a2, a2, a3 + PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr21 + xvpermi.q xr18, xr19, 0x03 + PUT_HEVC_QPEL_UNI_W_H16_LASX xr18, xr22 + xvpermi.q xr21, xr22, 0x02 + xvst xr21, a0, 0 + PUT_HEVC_QPEL_UNI_W_H16_LASX xr19, xr21 + xvpermi.q xr19, xr20, 0x03 + PUT_HEVC_QPEL_UNI_W_H16_LASX xr19, xr22 + xvpermi.q xr21, xr22, 0x02 + xvst xr21, a0, 32 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_H64_LASX +endfunc diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c index d0ee99d6b5..3cdb3fb2d7 100644 --- a/libavcodec/loongarch/hevcdsp_init_loongarch.c +++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c @@ -188,6 +188,26 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth) c->put_hevc_qpel_uni_w[8][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv48_8_lsx; c->put_hevc_qpel_uni_w[9][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv64_8_lsx; + c->put_hevc_qpel_uni_w[1][1][0] = ff_hevc_put_hevc_qpel_uni_w_v4_8_lsx; + c->put_hevc_qpel_uni_w[2][1][0] = ff_hevc_put_hevc_qpel_uni_w_v6_8_lsx; + c->put_hevc_qpel_uni_w[3][1][0] = ff_hevc_put_hevc_qpel_uni_w_v8_8_lsx; + c->put_hevc_qpel_uni_w[4][1][0] = ff_hevc_put_hevc_qpel_uni_w_v12_8_lsx; + c->put_hevc_qpel_uni_w[5][1][0] = ff_hevc_put_hevc_qpel_uni_w_v16_8_lsx; + c->put_hevc_qpel_uni_w[6][1][0] = ff_hevc_put_hevc_qpel_uni_w_v24_8_lsx; + c->put_hevc_qpel_uni_w[7][1][0] = ff_hevc_put_hevc_qpel_uni_w_v32_8_lsx; + c->put_hevc_qpel_uni_w[8][1][0] = ff_hevc_put_hevc_qpel_uni_w_v48_8_lsx; + c->put_hevc_qpel_uni_w[9][1][0] = ff_hevc_put_hevc_qpel_uni_w_v64_8_lsx; + + c->put_hevc_qpel_uni_w[1][0][1] = ff_hevc_put_hevc_qpel_uni_w_h4_8_lsx; + c->put_hevc_qpel_uni_w[2][0][1] = ff_hevc_put_hevc_qpel_uni_w_h6_8_lsx; + c->put_hevc_qpel_uni_w[3][0][1] = ff_hevc_put_hevc_qpel_uni_w_h8_8_lsx; + c->put_hevc_qpel_uni_w[4][0][1] = ff_hevc_put_hevc_qpel_uni_w_h12_8_lsx; + c->put_hevc_qpel_uni_w[5][0][1] = ff_hevc_put_hevc_qpel_uni_w_h16_8_lsx; + c->put_hevc_qpel_uni_w[6][0][1] = ff_hevc_put_hevc_qpel_uni_w_h24_8_lsx; + c->put_hevc_qpel_uni_w[7][0][1] = ff_hevc_put_hevc_qpel_uni_w_h32_8_lsx; + c->put_hevc_qpel_uni_w[8][0][1] = ff_hevc_put_hevc_qpel_uni_w_h48_8_lsx; + c->put_hevc_qpel_uni_w[9][0][1] = ff_hevc_put_hevc_qpel_uni_w_h64_8_lsx; + c->sao_edge_filter[0] = ff_hevc_sao_edge_filter_8_lsx; c->sao_edge_filter[1] = ff_hevc_sao_edge_filter_8_lsx; c->sao_edge_filter[2] = ff_hevc_sao_edge_filter_8_lsx; @@ -237,6 +257,24 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth) c->put_hevc_epel_uni_w[7][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels32_8_lasx; c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx; c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx; + + c->put_hevc_qpel_uni_w[3][1][0] = ff_hevc_put_hevc_qpel_uni_w_v8_8_lasx; + c->put_hevc_qpel_uni_w[4][1][0] = ff_hevc_put_hevc_qpel_uni_w_v12_8_lasx; + c->put_hevc_qpel_uni_w[5][1][0] = ff_hevc_put_hevc_qpel_uni_w_v16_8_lasx; + c->put_hevc_qpel_uni_w[6][1][0] = ff_hevc_put_hevc_qpel_uni_w_v24_8_lasx; + c->put_hevc_qpel_uni_w[7][1][0] = ff_hevc_put_hevc_qpel_uni_w_v32_8_lasx; + c->put_hevc_qpel_uni_w[8][1][0] = ff_hevc_put_hevc_qpel_uni_w_v48_8_lasx; + c->put_hevc_qpel_uni_w[9][1][0] = ff_hevc_put_hevc_qpel_uni_w_v64_8_lasx; + + c->put_hevc_qpel_uni_w[1][0][1] = ff_hevc_put_hevc_qpel_uni_w_h4_8_lasx; + c->put_hevc_qpel_uni_w[2][0][1] = ff_hevc_put_hevc_qpel_uni_w_h6_8_lasx; + c->put_hevc_qpel_uni_w[3][0][1] = ff_hevc_put_hevc_qpel_uni_w_h8_8_lasx; + c->put_hevc_qpel_uni_w[4][0][1] = ff_hevc_put_hevc_qpel_uni_w_h12_8_lasx; + c->put_hevc_qpel_uni_w[5][0][1] = ff_hevc_put_hevc_qpel_uni_w_h16_8_lasx; + c->put_hevc_qpel_uni_w[6][0][1] = ff_hevc_put_hevc_qpel_uni_w_h24_8_lasx; + c->put_hevc_qpel_uni_w[7][0][1] = ff_hevc_put_hevc_qpel_uni_w_h32_8_lasx; + c->put_hevc_qpel_uni_w[8][0][1] = ff_hevc_put_hevc_qpel_uni_w_h48_8_lasx; + c->put_hevc_qpel_uni_w[9][0][1] = ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx; } } } diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h index 819c3c3ecf..8a9266d375 100644 --- a/libavcodec/loongarch/hevcdsp_lasx.h +++ b/libavcodec/loongarch/hevcdsp_lasx.h @@ -48,6 +48,24 @@ PEL_UNI_W(pel, pixels, 32); PEL_UNI_W(pel, pixels, 48); PEL_UNI_W(pel, pixels, 64); +PEL_UNI_W(qpel, v, 8); +PEL_UNI_W(qpel, v, 12); +PEL_UNI_W(qpel, v, 16); +PEL_UNI_W(qpel, v, 24); +PEL_UNI_W(qpel, v, 32); +PEL_UNI_W(qpel, v, 48); +PEL_UNI_W(qpel, v, 64); + +PEL_UNI_W(qpel, h, 4); +PEL_UNI_W(qpel, h, 6); +PEL_UNI_W(qpel, h, 8); +PEL_UNI_W(qpel, h, 12); +PEL_UNI_W(qpel, h, 16); +PEL_UNI_W(qpel, h, 24); +PEL_UNI_W(qpel, h, 32); +PEL_UNI_W(qpel, h, 48); +PEL_UNI_W(qpel, h, 64); + #undef PEL_UNI_W #endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h index 0d724a90ef..3291294ed9 100644 --- a/libavcodec/loongarch/hevcdsp_lsx.h +++ b/libavcodec/loongarch/hevcdsp_lsx.h @@ -257,6 +257,26 @@ PEL_UNI_W(pel, pixels, 32); PEL_UNI_W(pel, pixels, 48); PEL_UNI_W(pel, pixels, 64); +PEL_UNI_W(qpel, v, 4); +PEL_UNI_W(qpel, v, 6); +PEL_UNI_W(qpel, v, 8); +PEL_UNI_W(qpel, v, 12); +PEL_UNI_W(qpel, v, 16); +PEL_UNI_W(qpel, v, 24); +PEL_UNI_W(qpel, v, 32); +PEL_UNI_W(qpel, v, 48); +PEL_UNI_W(qpel, v, 64); + +PEL_UNI_W(qpel, h, 4); +PEL_UNI_W(qpel, h, 6); +PEL_UNI_W(qpel, h, 8); +PEL_UNI_W(qpel, h, 12); +PEL_UNI_W(qpel, h, 16); +PEL_UNI_W(qpel, h, 24); +PEL_UNI_W(qpel, h, 32); +PEL_UNI_W(qpel, h, 48); +PEL_UNI_W(qpel, h, 64); + #undef PEL_UNI_W #endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H From patchwork Wed Dec 27 04:50:17 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?6YeR5rOi?= X-Patchwork-Id: 45344 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:6623:b0:194:e134:edd4 with SMTP id n35csp3415543pzh; Tue, 26 Dec 2023 20:51:33 -0800 (PST) X-Google-Smtp-Source: AGHT+IGP3oM7elw4ijMIQwo5hY21SDYlYHYOZiZe1pGEObrGlDvAcQ9TT0FDgSC+/ZC0zGJITkND X-Received: by 2002:a17:907:3e9f:b0:a23:36f7:4918 with SMTP id hs31-20020a1709073e9f00b00a2336f74918mr4861968ejc.72.1703652693454; Tue, 26 Dec 2023 20:51:33 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1703652693; cv=none; d=google.com; s=arc-20160816; b=d7v2gsTYUeCngr9XfEg42hoqbM8OP9EO0Uin2BrDuC/lGzX6tSDMU3Db0DWxE12wLT Cfev6XdZTKd3VtC5xFLO/3j6UGSPKeZaR3avZg/RcTGBw2WyQDk+yq37HqJsCGE7c4eu YxPcFgKWhmYnqsuGuymm4ulEdAZHXWF0/qbz3X90WRc1GQ9/tdHgPvazrnpFgMGD6Zi2 D+P+FJHJp9J4JdMVJ9RP5qkSbyS9gjmJzXdqycoXacQzcYCPvhZeHg3X4uoKse0VSWco RgzSy0h0cnbC2vES0CqW1HG8AeWko3PTedE6K29EFU3bUNz3dvkWYDQ6NUBSWTlVU0kB Qi8Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=VDdu+E+nlxFQn9AMnrbraxWQMuy8FZG/ZtFgcjHiw0Y=; fh=jVI190atZy/rQkFyTEx2ZbUJwkNs7ZTytKjYCllSOv4=; b=IRc0SuOO97YyhtN0KeN/M3AZfO1heJDlI0Fw2v856OkXxSCe+GgOWRqdXPdICfUl90 wygSLK6gsUewektTHvcVUPbX4fA+tL3kzbezT+zOziudMLxbYxdBv8K7x1CE2E5X9/SD 7mnIVX1k3S6f+P8zdgu5RrKWHfu/QRFBIq8ksYaNdGr7ECPwqUv2vbcBLdbW+CDTMF95 +/K9lwyfggO2IKXGeEjDKSjMmmzCyQWe+pjxmnWKI/3yaxgGv5eLFXWi0CQAHnbusByr QSMbgdQXpGBEWXkTKCu7oY1g/PFfM9j4RpijZFQyS3rK+Fu3kZCvM7ocZwngGxpzJ0iL mZvw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id gg22-20020a170906e29600b00a23499a71a4si6074404ejb.525.2023.12.26.20.51.32; Tue, 26 Dec 2023 20:51:33 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id CDBD668CE49; Wed, 27 Dec 2023 06:50:52 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail.loongson.cn (mail.loongson.cn [114.242.206.163]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 1775B68CD40 for ; Wed, 27 Dec 2023 06:50:42 +0200 (EET) Received: from loongson.cn (unknown [36.33.26.33]) by gateway (Coremail) with SMTP id _____8AxKeghrYtlRu4EAA--.886S3; Wed, 27 Dec 2023 12:50:41 +0800 (CST) Received: from localhost (unknown [36.33.26.33]) by localhost.localdomain (Coremail) with SMTP id AQAAf8Cxvr4grYtlhlsMAA--.36997S3; Wed, 27 Dec 2023 12:50:40 +0800 (CST) From: jinbo To: ffmpeg-devel@ffmpeg.org Date: Wed, 27 Dec 2023 12:50:17 +0800 Message-Id: <20231227045019.25078-6-jinbo@loongson.cn> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20231227045019.25078-1-jinbo@loongson.cn> References: <20231227045019.25078-1-jinbo@loongson.cn> MIME-Version: 1.0 X-CM-TRANSID: AQAAf8Cxvr4grYtlhlsMAA--.36997S3 X-CM-SenderInfo: xmlqu0o6or00hjvr0hdfq/1tbiAQASEmWLia8B1wALsU X-Coremail-Antispam: 1Uk129KBj9fXoWfurykury3Cr4xKr48Xw4kZrc_yoW5XFy7Ao ZIqrn0vwsrW390qFW3Xwn8tw1fGFW7Arn0vFy2yw4xA3W8K343Aas0vwnFyryjk3yDX3ZI qasrKrn5Aay8Zw4rl-sFpf9Il3svdjkaLaAFLSUrUUUUUb8apTn2vfkv8UJUUUU8wcxFpf 9Il3svdxBIdaVrn0xqx4xG64xvF2IEw4CE5I8CrVC2j2Jv73VFW2AGmfu7bjvjm3AaLaJ3 UjIYCTnIWjp_UUUYb7kC6x804xWl14x267AKxVWUJVW8JwAFc2x0x2IEx4CE42xK8VAvwI 8IcIk0rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj41l84x0c7CEw4AK67xG Y2AK021l84ACjcxK6xIIjxv20xvE14v26r1I6r4UM28EF7xvwVC0I7IYx2IY6xkF7I0E14 v26r1j6r4UM28EF7xvwVC2z280aVAFwI0_Cr0_Gr1UM28EF7xvwVC2z280aVCY1x0267AK xVW8Jr0_Cr1UM2AIxVAIcxkEcVAq07x20xvEncxIr21l57IF6xkI12xvs2x26I8E6xACxx 1l5I8CrVACY4xI64kE6c02F40Ex7xfMcIj6xIIjxv20xvE14v26r126r1DMcIj6I8E87Iv 67AKxVW8JVWxJwAm72CE4IkC6x0Yz7v_Jr0_Gr1lF7xvr2IYc2Ij64vIr41l42xK82IYc2 Ij64vIr41l4I8I3I0E4IkC6x0Yz7v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGwC20s02 6x8GjcxK67AKxVWUGVWUWwC2zVAF1VAY17CE14v26r1Y6r17MIIYrxkI7VAKI48JMIIF0x vE2Ix0cI8IcVAFwI0_JFI_Gr1lIxAIcVC0I7IYx2IY6xkF7I0E14v26r1j6r4UMIIF0xvE 42xK8VAvwI8IcIk0rVWUJVWUCwCI42IY6I8E87Iv67AKxVWUJVW8JwCI42IY6I8E87Iv6x kF7I0E14v26r1j6r4UYxBIdaVFxhVjvjDU0xZFpf9x07j1q2_UUUUU= Subject: [FFmpeg-devel] [PATCH v2 5/7] avcodec/hevc: Add epel_uni_w_hv4/6/8/12/16/24/32/48/64 asm opt X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: jinbo Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: X+xw+UXxpVmC tests/checkasm/checkasm: C LSX LASX put_hevc_epel_uni_w_hv4_8_c: 9.5 2.2 put_hevc_epel_uni_w_hv6_8_c: 18.5 5.0 3.7 put_hevc_epel_uni_w_hv8_8_c: 30.7 6.0 4.5 put_hevc_epel_uni_w_hv12_8_c: 63.7 14.0 10.7 put_hevc_epel_uni_w_hv16_8_c: 107.5 22.7 17.0 put_hevc_epel_uni_w_hv24_8_c: 236.7 50.2 31.7 put_hevc_epel_uni_w_hv32_8_c: 414.5 88.0 53.0 put_hevc_epel_uni_w_hv48_8_c: 917.5 197.7 118.5 put_hevc_epel_uni_w_hv64_8_c: 1617.0 349.5 203.0 After this patch, the peformance of decoding H265 4K 30FPS 30Mbps on 3A6000 with 8 threads improves 3fps (52fps-->55fsp). Change-Id: If067e394cec4685c62193e7adb829ac93ba4804d --- libavcodec/loongarch/hevc_mc.S | 821 ++++++++++++++++++ libavcodec/loongarch/hevcdsp_init_loongarch.c | 19 + libavcodec/loongarch/hevcdsp_lasx.h | 9 + libavcodec/loongarch/hevcdsp_lsx.h | 10 + 4 files changed, 859 insertions(+) diff --git a/libavcodec/loongarch/hevc_mc.S b/libavcodec/loongarch/hevc_mc.S index 2ee338fb8e..0b0647546b 100644 --- a/libavcodec/loongarch/hevc_mc.S +++ b/libavcodec/loongarch/hevc_mc.S @@ -22,6 +22,7 @@ #include "loongson_asm.S" .extern ff_hevc_qpel_filters +.extern ff_hevc_epel_filters .macro LOAD_VAR bit addi.w t1, a5, 6 //shift @@ -206,6 +207,12 @@ .endif .endm +/* + * void FUNC(put_hevc_pel_uni_w_pixels)(uint8_t *_dst, ptrdiff_t _dststride, + * const uint8_t *_src, ptrdiff_t _srcstride, + * int height, int denom, int wx, int ox, + * intptr_t mx, intptr_t my, int width) + */ function ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx LOAD_VAR 128 srli.w t0, a4, 1 @@ -482,6 +489,12 @@ endfunc xvhaddw.d.w \in0, \in0, \in0 .endm +/* + * void FUNC(put_hevc_qpel_uni_w_v)(uint8_t *_dst, ptrdiff_t _dststride, + * const uint8_t *_src, ptrdiff_t _srcstride, + * int height, int denom, int wx, int ox, + * intptr_t mx, intptr_t my, int width) + */ function ff_hevc_put_hevc_qpel_uni_w_v4_8_lsx LOAD_VAR 128 ld.d t0, sp, 8 //my @@ -1253,6 +1266,12 @@ endfunc xvssrani.bu.h \out0, xr11, 0 .endm +/* + * void FUNC(put_hevc_qpel_uni_w_h)(uint8_t *_dst, ptrdiff_t _dststride, + * const uint8_t *_src, ptrdiff_t _srcstride, + * int height, int denom, int wx, int ox, + * intptr_t mx, intptr_t my, int width) + */ function ff_hevc_put_hevc_qpel_uni_w_h4_8_lsx LOAD_VAR 128 ld.d t0, sp, 0 //mx @@ -1763,3 +1782,805 @@ function ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx addi.d a4, a4, -1 bnez a4, .LOOP_H64_LASX endfunc + +const shufb + .byte 0,1,2,3, 1,2,3,4 ,2,3,4,5, 3,4,5,6 + .byte 4,5,6,7, 5,6,7,8 ,6,7,8,9, 7,8,9,10 +endconst + +.macro PUT_HEVC_EPEL_UNI_W_HV4_LSX w + fld.d f7, a2, 0 // start to load src + fldx.d f8, a2, a3 + alsl.d a2, a3, a2, 1 + fld.d f9, a2, 0 + vshuf.b vr7, vr7, vr7, vr0 // 0123 1234 2345 3456 + vshuf.b vr8, vr8, vr8, vr0 + vshuf.b vr9, vr9, vr9, vr0 + vdp2.h.bu.b vr10, vr7, vr5 // EPEL_FILTER(src, 1) + vdp2.h.bu.b vr11, vr8, vr5 + vdp2.h.bu.b vr12, vr9, vr5 + vhaddw.w.h vr10, vr10, vr10 // tmp[0/1/2/3] + vhaddw.w.h vr11, vr11, vr11 // vr10,vr11,vr12 corresponding to EPEL_EXTRA + vhaddw.w.h vr12, vr12, vr12 +.LOOP_HV4_\w: + add.d a2, a2, a3 + fld.d f14, a2, 0 // height loop begin + vshuf.b vr14, vr14, vr14, vr0 + vdp2.h.bu.b vr13, vr14, vr5 + vhaddw.w.h vr13, vr13, vr13 + vmul.w vr14, vr10, vr16 // EPEL_FILTER(tmp, MAX_PB_SIZE) + vmadd.w vr14, vr11, vr17 + vmadd.w vr14, vr12, vr18 + vmadd.w vr14, vr13, vr19 + vaddi.wu vr10, vr11, 0 //back up previous value + vaddi.wu vr11, vr12, 0 + vaddi.wu vr12, vr13, 0 + vsrai.w vr14, vr14, 6 // >> 6 + vmul.w vr14, vr14, vr1 // * wx + vadd.w vr14, vr14, vr2 // + offset + vsra.w vr14, vr14, vr3 // >> shift + vadd.w vr14, vr14, vr4 // + ox + vssrani.h.w vr14, vr14, 0 + vssrani.bu.h vr14, vr14, 0 // clip + fst.s f14, a0, 0 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_HV4_\w +.endm + +/* + * void FUNC(put_hevc_epel_uni_w_hv)(uint8_t *_dst, ptrdiff_t _dststride, + * const uint8_t *_src, ptrdiff_t _srcstride, + * int height, int denom, int wx, int ox, + * intptr_t mx, intptr_t my, int width) + */ +function ff_hevc_put_hevc_epel_uni_w_hv4_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 // mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1]; + vreplvei.w vr5, vr5, 0 + ld.d t0, sp, 8 // my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1]; + vsllwil.h.b vr6, vr6, 0 + vsllwil.w.h vr6, vr6, 0 + vreplvei.w vr16, vr6, 0 + vreplvei.w vr17, vr6, 1 + vreplvei.w vr18, vr6, 2 + vreplvei.w vr19, vr6, 3 + la.local t1, shufb + vld vr0, t1, 0 + sub.d a2, a2, a3 // src -= srcstride + addi.d a2, a2, -1 + PUT_HEVC_EPEL_UNI_W_HV4_LSX 4 +endfunc + +.macro PUT_HEVC_EPEL_UNI_W_HV8_LSX w + vld vr7, a2, 0 // start to load src + vldx vr8, a2, a3 + alsl.d a2, a3, a2, 1 + vld vr9, a2, 0 + vshuf.b vr10, vr7, vr7, vr0 // 0123 1234 2345 3456 + vshuf.b vr11, vr8, vr8, vr0 + vshuf.b vr12, vr9, vr9, vr0 + vshuf.b vr7, vr7, vr7, vr22// 4567 5678 6789 78910 + vshuf.b vr8, vr8, vr8, vr22 + vshuf.b vr9, vr9, vr9, vr22 + vdp2.h.bu.b vr13, vr10, vr5 // EPEL_FILTER(src, 1) + vdp2.h.bu.b vr14, vr11, vr5 + vdp2.h.bu.b vr15, vr12, vr5 + vdp2.h.bu.b vr23, vr7, vr5 + vdp2.h.bu.b vr20, vr8, vr5 + vdp2.h.bu.b vr21, vr9, vr5 + vhaddw.w.h vr7, vr13, vr13 + vhaddw.w.h vr8, vr14, vr14 + vhaddw.w.h vr9, vr15, vr15 + vhaddw.w.h vr10, vr23, vr23 + vhaddw.w.h vr11, vr20, vr20 + vhaddw.w.h vr12, vr21, vr21 +.LOOP_HV8_HORI_\w: + add.d a2, a2, a3 + vld vr15, a2, 0 + vshuf.b vr23, vr15, vr15, vr0 + vshuf.b vr15, vr15, vr15, vr22 + vdp2.h.bu.b vr13, vr23, vr5 + vdp2.h.bu.b vr14, vr15, vr5 + vhaddw.w.h vr13, vr13, vr13 //789--13 + vhaddw.w.h vr14, vr14, vr14 //101112--14 + vmul.w vr15, vr7, vr16 //EPEL_FILTER(tmp, MAX_PB_SIZE) + vmadd.w vr15, vr8, vr17 + vmadd.w vr15, vr9, vr18 + vmadd.w vr15, vr13, vr19 + vmul.w vr20, vr10, vr16 + vmadd.w vr20, vr11, vr17 + vmadd.w vr20, vr12, vr18 + vmadd.w vr20, vr14, vr19 + vaddi.wu vr7, vr8, 0 //back up previous value + vaddi.wu vr8, vr9, 0 + vaddi.wu vr9, vr13, 0 + vaddi.wu vr10, vr11, 0 + vaddi.wu vr11, vr12, 0 + vaddi.wu vr12, vr14, 0 + vsrai.w vr15, vr15, 6 // >> 6 + vsrai.w vr20, vr20, 6 + vmul.w vr15, vr15, vr1 // * wx + vmul.w vr20, vr20, vr1 + vadd.w vr15, vr15, vr2 // + offset + vadd.w vr20, vr20, vr2 + vsra.w vr15, vr15, vr3 // >> shift + vsra.w vr20, vr20, vr3 + vadd.w vr15, vr15, vr4 // + ox + vadd.w vr20, vr20, vr4 + vssrani.h.w vr20, vr15, 0 + vssrani.bu.h vr20, vr20, 0 +.if \w > 6 + fst.d f20, a0, 0 +.else + fst.s f20, a0, 0 + vstelm.h vr20, a0, 4, 2 +.endif + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_HV8_HORI_\w +.endm + +.macro PUT_HEVC_EPEL_UNI_W_HV8_LASX w + vld vr7, a2, 0 // start to load src + vldx vr8, a2, a3 + alsl.d a2, a3, a2, 1 + vld vr9, a2, 0 + xvreplve0.q xr7, xr7 + xvreplve0.q xr8, xr8 + xvreplve0.q xr9, xr9 + xvshuf.b xr10, xr7, xr7, xr0 // 0123 1234 2345 3456 + xvshuf.b xr11, xr8, xr8, xr0 + xvshuf.b xr12, xr9, xr9, xr0 + xvdp2.h.bu.b xr13, xr10, xr5 // EPEL_FILTER(src, 1) + xvdp2.h.bu.b xr14, xr11, xr5 + xvdp2.h.bu.b xr15, xr12, xr5 + xvhaddw.w.h xr7, xr13, xr13 + xvhaddw.w.h xr8, xr14, xr14 + xvhaddw.w.h xr9, xr15, xr15 +.LOOP_HV8_HORI_LASX_\w: + add.d a2, a2, a3 + vld vr15, a2, 0 + xvreplve0.q xr15, xr15 + xvshuf.b xr23, xr15, xr15, xr0 + xvdp2.h.bu.b xr10, xr23, xr5 + xvhaddw.w.h xr10, xr10, xr10 + xvmul.w xr15, xr7, xr16 //EPEL_FILTER(tmp, MAX_PB_SIZE) + xvmadd.w xr15, xr8, xr17 + xvmadd.w xr15, xr9, xr18 + xvmadd.w xr15, xr10, xr19 + xvaddi.wu xr7, xr8, 0 //back up previous value + xvaddi.wu xr8, xr9, 0 + xvaddi.wu xr9, xr10, 0 + xvsrai.w xr15, xr15, 6 // >> 6 + xvmul.w xr15, xr15, xr1 // * wx + xvadd.w xr15, xr15, xr2 // + offset + xvsra.w xr15, xr15, xr3 // >> shift + xvadd.w xr15, xr15, xr4 // + ox + xvpermi.q xr20, xr15, 0x01 + vssrani.h.w vr20, vr15, 0 + vssrani.bu.h vr20, vr20, 0 +.if \w > 6 + fst.d f20, a0, 0 +.else + fst.s f20, a0, 0 + vstelm.h vr20, a0, 4, 2 +.endif + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_HV8_HORI_LASX_\w +.endm + +.macro PUT_HEVC_EPEL_UNI_W_HV16_LASX w + xvld xr7, a2, 0 // start to load src + xvldx xr8, a2, a3 + alsl.d a2, a3, a2, 1 + xvld xr9, a2, 0 + xvpermi.d xr10, xr7, 0x09 //8..18 + xvpermi.d xr11, xr8, 0x09 + xvpermi.d xr12, xr9, 0x09 + xvreplve0.q xr7, xr7 + xvreplve0.q xr8, xr8 + xvreplve0.q xr9, xr9 + xvshuf.b xr13, xr7, xr7, xr0 // 0123 1234 2345 3456 + xvshuf.b xr14, xr8, xr8, xr0 + xvshuf.b xr15, xr9, xr9, xr0 + xvdp2.h.bu.b xr20, xr13, xr5 // EPEL_FILTER(src, 1) + xvdp2.h.bu.b xr21, xr14, xr5 + xvdp2.h.bu.b xr22, xr15, xr5 + xvhaddw.w.h xr7, xr20, xr20 + xvhaddw.w.h xr8, xr21, xr21 + xvhaddw.w.h xr9, xr22, xr22 + xvreplve0.q xr10, xr10 + xvreplve0.q xr11, xr11 + xvreplve0.q xr12, xr12 + xvshuf.b xr13, xr10, xr10, xr0 + xvshuf.b xr14, xr11, xr11, xr0 + xvshuf.b xr15, xr12, xr12, xr0 + xvdp2.h.bu.b xr20, xr13, xr5 + xvdp2.h.bu.b xr21, xr14, xr5 + xvdp2.h.bu.b xr22, xr15, xr5 + xvhaddw.w.h xr10, xr20, xr20 + xvhaddw.w.h xr11, xr21, xr21 + xvhaddw.w.h xr12, xr22, xr22 +.LOOP_HV16_HORI_LASX_\w: + add.d a2, a2, a3 + xvld xr15, a2, 0 + xvpermi.d xr20, xr15, 0x09 //8...18 + xvreplve0.q xr15, xr15 + xvreplve0.q xr20, xr20 + xvshuf.b xr21, xr15, xr15, xr0 + xvshuf.b xr22, xr20, xr20, xr0 + xvdp2.h.bu.b xr13, xr21, xr5 + xvdp2.h.bu.b xr14, xr22, xr5 + xvhaddw.w.h xr13, xr13, xr13 + xvhaddw.w.h xr14, xr14, xr14 + xvmul.w xr15, xr7, xr16 //EPEL_FILTER(tmp, MAX_PB_SIZE) + xvmadd.w xr15, xr8, xr17 + xvmadd.w xr15, xr9, xr18 + xvmadd.w xr15, xr13, xr19 + xvmul.w xr20, xr10, xr16 + xvmadd.w xr20, xr11, xr17 + xvmadd.w xr20, xr12, xr18 + xvmadd.w xr20, xr14, xr19 + xvaddi.wu xr7, xr8, 0 //back up previous value + xvaddi.wu xr8, xr9, 0 + xvaddi.wu xr9, xr13, 0 + xvaddi.wu xr10, xr11, 0 + xvaddi.wu xr11, xr12, 0 + xvaddi.wu xr12, xr14, 0 + xvsrai.w xr15, xr15, 6 // >> 6 + xvsrai.w xr20, xr20, 6 // >> 6 + xvmul.w xr15, xr15, xr1 // * wx + xvmul.w xr20, xr20, xr1 // * wx + xvadd.w xr15, xr15, xr2 // + offset + xvadd.w xr20, xr20, xr2 // + offset + xvsra.w xr15, xr15, xr3 // >> shift + xvsra.w xr20, xr20, xr3 // >> shift + xvadd.w xr15, xr15, xr4 // + ox + xvadd.w xr20, xr20, xr4 // + ox + xvssrani.h.w xr20, xr15, 0 + xvpermi.q xr21, xr20, 0x01 + vssrani.bu.h vr21, vr20, 0 + vpermi.w vr21, vr21, 0xd8 +.if \w < 16 + fst.d f21, a0, 0 + vstelm.w vr21, a0, 8, 2 +.else + vst vr21, a0, 0 +.endif + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_HV16_HORI_LASX_\w +.endm + +function ff_hevc_put_hevc_epel_uni_w_hv6_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 // mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1]; + vreplvei.w vr5, vr5, 0 + ld.d t0, sp, 8 // my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1]; + vsllwil.h.b vr6, vr6, 0 + vsllwil.w.h vr6, vr6, 0 + vreplvei.w vr16, vr6, 0 + vreplvei.w vr17, vr6, 1 + vreplvei.w vr18, vr6, 2 + vreplvei.w vr19, vr6, 3 + la.local t1, shufb + vld vr0, t1, 0 + vaddi.bu vr22, vr0, 4 // update shufb to get high part + sub.d a2, a2, a3 // src -= srcstride + addi.d a2, a2, -1 + PUT_HEVC_EPEL_UNI_W_HV8_LSX 6 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_hv6_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 // mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1]; + xvreplve0.w xr5, xr5 + ld.d t0, sp, 8 // my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1]; + vsllwil.h.b vr6, vr6, 0 + vsllwil.w.h vr6, vr6, 0 + xvreplve0.q xr6, xr6 + xvrepl128vei.w xr16, xr6, 0 + xvrepl128vei.w xr17, xr6, 1 + xvrepl128vei.w xr18, xr6, 2 + xvrepl128vei.w xr19, xr6, 3 + la.local t1, shufb + xvld xr0, t1, 0 + sub.d a2, a2, a3 // src -= srcstride + addi.d a2, a2, -1 + PUT_HEVC_EPEL_UNI_W_HV8_LASX 6 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_hv8_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 // mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1]; + vreplvei.w vr5, vr5, 0 + ld.d t0, sp, 8 // my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1]; + vsllwil.h.b vr6, vr6, 0 + vsllwil.w.h vr6, vr6, 0 + vreplvei.w vr16, vr6, 0 + vreplvei.w vr17, vr6, 1 + vreplvei.w vr18, vr6, 2 + vreplvei.w vr19, vr6, 3 + la.local t1, shufb + vld vr0, t1, 0 + vaddi.bu vr22, vr0, 4 // update shufb to get high part + sub.d a2, a2, a3 // src -= srcstride + addi.d a2, a2, -1 + PUT_HEVC_EPEL_UNI_W_HV8_LSX 8 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_hv8_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 // mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1]; + xvreplve0.w xr5, xr5 + ld.d t0, sp, 8 // my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1]; + vsllwil.h.b vr6, vr6, 0 + vsllwil.w.h vr6, vr6, 0 + xvreplve0.q xr6, xr6 + xvrepl128vei.w xr16, xr6, 0 + xvrepl128vei.w xr17, xr6, 1 + xvrepl128vei.w xr18, xr6, 2 + xvrepl128vei.w xr19, xr6, 3 + la.local t1, shufb + xvld xr0, t1, 0 + sub.d a2, a2, a3 // src -= srcstride + addi.d a2, a2, -1 + PUT_HEVC_EPEL_UNI_W_HV8_LASX 8 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_hv12_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 // mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1]; + vreplvei.w vr5, vr5, 0 + ld.d t0, sp, 8 // my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1]; + vsllwil.h.b vr6, vr6, 0 + vsllwil.w.h vr6, vr6, 0 + vreplvei.w vr16, vr6, 0 + vreplvei.w vr17, vr6, 1 + vreplvei.w vr18, vr6, 2 + vreplvei.w vr19, vr6, 3 + la.local t1, shufb + vld vr0, t1, 0 + vaddi.bu vr22, vr0, 4 // update shufb to get high part + sub.d a2, a2, a3 // src -= srcstride + addi.d a2, a2, -1 + addi.d t2, a0, 0 + addi.d t3, a2, 0 + addi.d t4, a4, 0 + PUT_HEVC_EPEL_UNI_W_HV8_LSX 12 + addi.d a0, t2, 8 + addi.d a2, t3, 8 + addi.d a4, t4, 0 + PUT_HEVC_EPEL_UNI_W_HV4_LSX 12 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_hv12_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 // mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1]; + xvreplve0.w xr5, xr5 + ld.d t0, sp, 8 // my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1]; + vsllwil.h.b vr6, vr6, 0 + vsllwil.w.h vr6, vr6, 0 + xvreplve0.q xr6, xr6 + xvrepl128vei.w xr16, xr6, 0 + xvrepl128vei.w xr17, xr6, 1 + xvrepl128vei.w xr18, xr6, 2 + xvrepl128vei.w xr19, xr6, 3 + la.local t1, shufb + xvld xr0, t1, 0 + sub.d a2, a2, a3 // src -= srcstride + addi.d a2, a2, -1 + PUT_HEVC_EPEL_UNI_W_HV16_LASX 12 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_hv16_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 // mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1]; + vreplvei.w vr5, vr5, 0 + ld.d t0, sp, 8 // my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1]; + vsllwil.h.b vr6, vr6, 0 + vsllwil.w.h vr6, vr6, 0 + vreplvei.w vr16, vr6, 0 + vreplvei.w vr17, vr6, 1 + vreplvei.w vr18, vr6, 2 + vreplvei.w vr19, vr6, 3 + la.local t1, shufb + vld vr0, t1, 0 + vaddi.bu vr22, vr0, 4 // update shufb to get high part + sub.d a2, a2, a3 // src -= srcstride + addi.d a2, a2, -1 + addi.d t2, a0, 0 + addi.d t3, a2, 0 + addi.d t4, a4, 0 + addi.d t5, zero, 2 +.LOOP_HV16: + PUT_HEVC_EPEL_UNI_W_HV8_LSX 16 + addi.d a0, t2, 8 + addi.d a2, t3, 8 + addi.d a4, t4, 0 + addi.d t5, t5, -1 + bnez t5, .LOOP_HV16 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_hv16_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 // mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1]; + xvreplve0.w xr5, xr5 + ld.d t0, sp, 8 // my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1]; + vsllwil.h.b vr6, vr6, 0 + vsllwil.w.h vr6, vr6, 0 + xvreplve0.q xr6, xr6 + xvrepl128vei.w xr16, xr6, 0 + xvrepl128vei.w xr17, xr6, 1 + xvrepl128vei.w xr18, xr6, 2 + xvrepl128vei.w xr19, xr6, 3 + la.local t1, shufb + xvld xr0, t1, 0 + sub.d a2, a2, a3 // src -= srcstride + addi.d a2, a2, -1 + PUT_HEVC_EPEL_UNI_W_HV16_LASX 16 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_hv24_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 // mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1]; + vreplvei.w vr5, vr5, 0 + ld.d t0, sp, 8 // my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1]; + vsllwil.h.b vr6, vr6, 0 + vsllwil.w.h vr6, vr6, 0 + vreplvei.w vr16, vr6, 0 + vreplvei.w vr17, vr6, 1 + vreplvei.w vr18, vr6, 2 + vreplvei.w vr19, vr6, 3 + la.local t1, shufb + vld vr0, t1, 0 + vaddi.bu vr22, vr0, 4 // update shufb to get high part + sub.d a2, a2, a3 // src -= srcstride + addi.d a2, a2, -1 + addi.d t2, a0, 0 + addi.d t3, a2, 0 + addi.d t4, a4, 0 + addi.d t5, zero, 3 +.LOOP_HV24: + PUT_HEVC_EPEL_UNI_W_HV8_LSX 24 + addi.d a0, t2, 8 + addi.d t2, t2, 8 + addi.d a2, t3, 8 + addi.d t3, t3, 8 + addi.d a4, t4, 0 + addi.d t5, t5, -1 + bnez t5, .LOOP_HV24 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_hv24_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 // mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1]; + xvreplve0.w xr5, xr5 + ld.d t0, sp, 8 // my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1]; + vsllwil.h.b vr6, vr6, 0 + vsllwil.w.h vr6, vr6, 0 + xvreplve0.q xr6, xr6 + xvrepl128vei.w xr16, xr6, 0 + xvrepl128vei.w xr17, xr6, 1 + xvrepl128vei.w xr18, xr6, 2 + xvrepl128vei.w xr19, xr6, 3 + la.local t1, shufb + xvld xr0, t1, 0 + sub.d a2, a2, a3 // src -= srcstride + addi.d a2, a2, -1 + addi.d t2, a0, 0 + addi.d t3, a2, 0 + addi.d t4, a4, 0 + PUT_HEVC_EPEL_UNI_W_HV16_LASX 24 + addi.d a0, t2, 16 + addi.d a2, t3, 16 + addi.d a4, t4, 0 + PUT_HEVC_EPEL_UNI_W_HV8_LASX 24 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_hv32_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 // mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1]; + vreplvei.w vr5, vr5, 0 + ld.d t0, sp, 8 // my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1]; + vsllwil.h.b vr6, vr6, 0 + vsllwil.w.h vr6, vr6, 0 + vreplvei.w vr16, vr6, 0 + vreplvei.w vr17, vr6, 1 + vreplvei.w vr18, vr6, 2 + vreplvei.w vr19, vr6, 3 + la.local t1, shufb + vld vr0, t1, 0 + vaddi.bu vr22, vr0, 4 // update shufb to get high part + sub.d a2, a2, a3 // src -= srcstride + addi.d a2, a2, -1 + addi.d t2, a0, 0 + addi.d t3, a2, 0 + addi.d t4, a4, 0 + addi.d t5, zero, 4 +.LOOP_HV32: + PUT_HEVC_EPEL_UNI_W_HV8_LSX 32 + addi.d a0, t2, 8 + addi.d t2, t2, 8 + addi.d a2, t3, 8 + addi.d t3, t3, 8 + addi.d a4, t4, 0 + addi.d t5, t5, -1 + bnez t5, .LOOP_HV32 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_hv32_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 // mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1]; + xvreplve0.w xr5, xr5 + ld.d t0, sp, 8 // my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1]; + vsllwil.h.b vr6, vr6, 0 + vsllwil.w.h vr6, vr6, 0 + xvreplve0.q xr6, xr6 + xvrepl128vei.w xr16, xr6, 0 + xvrepl128vei.w xr17, xr6, 1 + xvrepl128vei.w xr18, xr6, 2 + xvrepl128vei.w xr19, xr6, 3 + la.local t1, shufb + xvld xr0, t1, 0 + sub.d a2, a2, a3 // src -= srcstride + addi.d a2, a2, -1 + addi.d t2, a0, 0 + addi.d t3, a2, 0 + addi.d t4, a4, 0 + addi.d t5, zero, 2 +.LOOP_HV32_LASX: + PUT_HEVC_EPEL_UNI_W_HV16_LASX 32 + addi.d a0, t2, 16 + addi.d t2, t2, 16 + addi.d a2, t3, 16 + addi.d t3, t3, 16 + addi.d a4, t4, 0 + addi.d t5, t5, -1 + bnez t5, .LOOP_HV32_LASX +endfunc + +function ff_hevc_put_hevc_epel_uni_w_hv48_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 // mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1]; + vreplvei.w vr5, vr5, 0 + ld.d t0, sp, 8 // my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1]; + vsllwil.h.b vr6, vr6, 0 + vsllwil.w.h vr6, vr6, 0 + vreplvei.w vr16, vr6, 0 + vreplvei.w vr17, vr6, 1 + vreplvei.w vr18, vr6, 2 + vreplvei.w vr19, vr6, 3 + la.local t1, shufb + vld vr0, t1, 0 + vaddi.bu vr22, vr0, 4 // update shufb to get high part + sub.d a2, a2, a3 // src -= srcstride + addi.d a2, a2, -1 + addi.d t2, a0, 0 + addi.d t3, a2, 0 + addi.d t4, a4, 0 + addi.d t5, zero, 6 +.LOOP_HV48: + PUT_HEVC_EPEL_UNI_W_HV8_LSX 48 + addi.d a0, t2, 8 + addi.d t2, t2, 8 + addi.d a2, t3, 8 + addi.d t3, t3, 8 + addi.d a4, t4, 0 + addi.d t5, t5, -1 + bnez t5, .LOOP_HV48 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_hv48_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 // mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1]; + xvreplve0.w xr5, xr5 + ld.d t0, sp, 8 // my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1]; + vsllwil.h.b vr6, vr6, 0 + vsllwil.w.h vr6, vr6, 0 + xvreplve0.q xr6, xr6 + xvrepl128vei.w xr16, xr6, 0 + xvrepl128vei.w xr17, xr6, 1 + xvrepl128vei.w xr18, xr6, 2 + xvrepl128vei.w xr19, xr6, 3 + la.local t1, shufb + xvld xr0, t1, 0 + sub.d a2, a2, a3 // src -= srcstride + addi.d a2, a2, -1 + addi.d t2, a0, 0 + addi.d t3, a2, 0 + addi.d t4, a4, 0 + addi.d t5, zero, 3 +.LOOP_HV48_LASX: + PUT_HEVC_EPEL_UNI_W_HV16_LASX 48 + addi.d a0, t2, 16 + addi.d t2, t2, 16 + addi.d a2, t3, 16 + addi.d t3, t3, 16 + addi.d a4, t4, 0 + addi.d t5, t5, -1 + bnez t5, .LOOP_HV48_LASX +endfunc + +function ff_hevc_put_hevc_epel_uni_w_hv64_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 // mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1]; + vreplvei.w vr5, vr5, 0 + ld.d t0, sp, 8 // my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1]; + vsllwil.h.b vr6, vr6, 0 + vsllwil.w.h vr6, vr6, 0 + vreplvei.w vr16, vr6, 0 + vreplvei.w vr17, vr6, 1 + vreplvei.w vr18, vr6, 2 + vreplvei.w vr19, vr6, 3 + la.local t1, shufb + vld vr0, t1, 0 + vaddi.bu vr22, vr0, 4 // update shufb to get high part + sub.d a2, a2, a3 // src -= srcstride + addi.d a2, a2, -1 + addi.d t2, a0, 0 + addi.d t3, a2, 0 + addi.d t4, a4, 0 + addi.d t5, zero, 8 +.LOOP_HV64: + PUT_HEVC_EPEL_UNI_W_HV8_LSX 64 + addi.d a0, t2, 8 + addi.d t2, t2, 8 + addi.d a2, t3, 8 + addi.d t3, t3, 8 + addi.d a4, t4, 0 + addi.d t5, t5, -1 + bnez t5, .LOOP_HV64 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_hv64_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 // mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr5, t1, t0 // ff_hevc_epel_filters[mx - 1]; + xvreplve0.w xr5, xr5 + ld.d t0, sp, 8 // my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + vldx vr6, t1, t0 // ff_hevc_epel_filters[my - 1]; + vsllwil.h.b vr6, vr6, 0 + vsllwil.w.h vr6, vr6, 0 + xvreplve0.q xr6, xr6 + xvrepl128vei.w xr16, xr6, 0 + xvrepl128vei.w xr17, xr6, 1 + xvrepl128vei.w xr18, xr6, 2 + xvrepl128vei.w xr19, xr6, 3 + la.local t1, shufb + xvld xr0, t1, 0 + sub.d a2, a2, a3 // src -= srcstride + addi.d a2, a2, -1 + addi.d t2, a0, 0 + addi.d t3, a2, 0 + addi.d t4, a4, 0 + addi.d t5, zero, 4 +.LOOP_HV64_LASX: + PUT_HEVC_EPEL_UNI_W_HV16_LASX 64 + addi.d a0, t2, 16 + addi.d t2, t2, 16 + addi.d a2, t3, 16 + addi.d t3, t3, 16 + addi.d a4, t4, 0 + addi.d t5, t5, -1 + bnez t5, .LOOP_HV64_LASX +endfunc diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c index 3cdb3fb2d7..245a833947 100644 --- a/libavcodec/loongarch/hevcdsp_init_loongarch.c +++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c @@ -171,6 +171,16 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth) c->put_hevc_qpel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx; c->put_hevc_qpel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx; + c->put_hevc_epel_uni_w[1][1][1] = ff_hevc_put_hevc_epel_uni_w_hv4_8_lsx; + c->put_hevc_epel_uni_w[2][1][1] = ff_hevc_put_hevc_epel_uni_w_hv6_8_lsx; + c->put_hevc_epel_uni_w[3][1][1] = ff_hevc_put_hevc_epel_uni_w_hv8_8_lsx; + c->put_hevc_epel_uni_w[4][1][1] = ff_hevc_put_hevc_epel_uni_w_hv12_8_lsx; + c->put_hevc_epel_uni_w[5][1][1] = ff_hevc_put_hevc_epel_uni_w_hv16_8_lsx; + c->put_hevc_epel_uni_w[6][1][1] = ff_hevc_put_hevc_epel_uni_w_hv24_8_lsx; + c->put_hevc_epel_uni_w[7][1][1] = ff_hevc_put_hevc_epel_uni_w_hv32_8_lsx; + c->put_hevc_epel_uni_w[8][1][1] = ff_hevc_put_hevc_epel_uni_w_hv48_8_lsx; + c->put_hevc_epel_uni_w[9][1][1] = ff_hevc_put_hevc_epel_uni_w_hv64_8_lsx; + c->put_hevc_epel_uni_w[1][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels4_8_lsx; c->put_hevc_epel_uni_w[2][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels6_8_lsx; c->put_hevc_epel_uni_w[3][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels8_8_lsx; @@ -258,6 +268,15 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth) c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lasx; c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lasx; + c->put_hevc_epel_uni_w[2][1][1] = ff_hevc_put_hevc_epel_uni_w_hv6_8_lasx; + c->put_hevc_epel_uni_w[3][1][1] = ff_hevc_put_hevc_epel_uni_w_hv8_8_lasx; + c->put_hevc_epel_uni_w[4][1][1] = ff_hevc_put_hevc_epel_uni_w_hv12_8_lasx; + c->put_hevc_epel_uni_w[5][1][1] = ff_hevc_put_hevc_epel_uni_w_hv16_8_lasx; + c->put_hevc_epel_uni_w[6][1][1] = ff_hevc_put_hevc_epel_uni_w_hv24_8_lasx; + c->put_hevc_epel_uni_w[7][1][1] = ff_hevc_put_hevc_epel_uni_w_hv32_8_lasx; + c->put_hevc_epel_uni_w[8][1][1] = ff_hevc_put_hevc_epel_uni_w_hv48_8_lasx; + c->put_hevc_epel_uni_w[9][1][1] = ff_hevc_put_hevc_epel_uni_w_hv64_8_lasx; + c->put_hevc_qpel_uni_w[3][1][0] = ff_hevc_put_hevc_qpel_uni_w_v8_8_lasx; c->put_hevc_qpel_uni_w[4][1][0] = ff_hevc_put_hevc_qpel_uni_w_v12_8_lasx; c->put_hevc_qpel_uni_w[5][1][0] = ff_hevc_put_hevc_qpel_uni_w_v16_8_lasx; diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h index 8a9266d375..7f09d0943a 100644 --- a/libavcodec/loongarch/hevcdsp_lasx.h +++ b/libavcodec/loongarch/hevcdsp_lasx.h @@ -66,6 +66,15 @@ PEL_UNI_W(qpel, h, 32); PEL_UNI_W(qpel, h, 48); PEL_UNI_W(qpel, h, 64); +PEL_UNI_W(epel, hv, 6); +PEL_UNI_W(epel, hv, 8); +PEL_UNI_W(epel, hv, 12); +PEL_UNI_W(epel, hv, 16); +PEL_UNI_W(epel, hv, 24); +PEL_UNI_W(epel, hv, 32); +PEL_UNI_W(epel, hv, 48); +PEL_UNI_W(epel, hv, 64); + #undef PEL_UNI_W #endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h index 3291294ed9..7769cf25ae 100644 --- a/libavcodec/loongarch/hevcdsp_lsx.h +++ b/libavcodec/loongarch/hevcdsp_lsx.h @@ -277,6 +277,16 @@ PEL_UNI_W(qpel, h, 32); PEL_UNI_W(qpel, h, 48); PEL_UNI_W(qpel, h, 64); +PEL_UNI_W(epel, hv, 4); +PEL_UNI_W(epel, hv, 6); +PEL_UNI_W(epel, hv, 8); +PEL_UNI_W(epel, hv, 12); +PEL_UNI_W(epel, hv, 16); +PEL_UNI_W(epel, hv, 24); +PEL_UNI_W(epel, hv, 32); +PEL_UNI_W(epel, hv, 48); +PEL_UNI_W(epel, hv, 64); + #undef PEL_UNI_W #endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H From patchwork Wed Dec 27 04:50:18 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?6YeR5rOi?= X-Patchwork-Id: 45346 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:6623:b0:194:e134:edd4 with SMTP id n35csp3415610pzh; Tue, 26 Dec 2023 20:51:52 -0800 (PST) X-Google-Smtp-Source: AGHT+IGRlI0wbE65fBLsUn6E2hre5RmFbF2sgk2/pKxYmGum0m4opIiM3nqriDf+eUcVlSwkOEvJ X-Received: by 2002:a17:906:2744:b0:a19:1bba:d4c with SMTP id a4-20020a170906274400b00a191bba0d4cmr3971583ejd.16.1703652712308; Tue, 26 Dec 2023 20:51:52 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1703652712; cv=none; d=google.com; s=arc-20160816; b=AJYGQ/6sVEmBXLuR/+f01GenUlkLjkX3zdEt6woGxqUBL5A2mhuB2om54YOvHo74H4 77m7h6S58jGAU5WsKdhGU952cWX9+q7FftdRIG9nEO25xFr+MOZqM8i8VT343w1DnPuJ kYcXwTYrcX2cLxZXHXnTuX9w1s9Cv3BtESqVaS0O/d2ixyhPvypHbKL58NUifHAGxynA 2XE1H84kZ62z8BF7CTMGsi0yGFPCbw2pbF6fJq070Bm+uTwfpkl3aV4uyfM4zXEGRvtA mB9VzVBvcHNtNkjhVJ23wnt1hTy+Ufj66OcOtJj2MiFYyiea4bMjrwa9LdlRvXdJ4erq RYjA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=wZ/56LdmIBhjAZmmnNk48tF8E92+OkFseu10m655jsw=; fh=jVI190atZy/rQkFyTEx2ZbUJwkNs7ZTytKjYCllSOv4=; b=N5G6UqFtsEslrexy0f4jvo8045SviSt48PD0pqGe+yTV0ZtDFIJVggNAO7wAxIjf4h zGor2us3yqNg64o9WV9V2+EM5ttrJIRiv4EB1/M/bBxTpY9nDaNKbEj/MkFxCXWi5h+T 70BLoOnCIlpDazTzM29QpHMd9yrcMmx0pQyt2D5tjJSP6rSzpohuIRBisPF7mOwAJPJ+ ISntJL6NN6aJhvtVBzg3RDyh7U7RhBeZ+k66CFRp2OcxYWdg58faSnhObwNA0xJwRbCi QnUbfRNf9oE+OGxx3DJUbdags3BG+VpXam9Bop71orPlArHxB+XBSRm24OVBGdD5MGS8 JG3g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id fc5-20020a1709073a4500b00a26aaa325d7si5081390ejc.102.2023.12.26.20.51.51; Tue, 26 Dec 2023 20:51:52 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 135E968CED0; Wed, 27 Dec 2023 06:50:55 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail.loongson.cn (mail.loongson.cn [114.242.206.163]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 4D0FF68CD68 for ; Wed, 27 Dec 2023 06:50:44 +0200 (EET) Received: from loongson.cn (unknown [36.33.26.33]) by gateway (Coremail) with SMTP id _____8Axz4cirYtlR+4EAA--.922S3; Wed, 27 Dec 2023 12:50:42 +0800 (CST) Received: from localhost (unknown [36.33.26.33]) by localhost.localdomain (Coremail) with SMTP id AQAAf8Axur0hrYtliFsMAA--.15771S3; Wed, 27 Dec 2023 12:50:41 +0800 (CST) From: jinbo To: ffmpeg-devel@ffmpeg.org Date: Wed, 27 Dec 2023 12:50:18 +0800 Message-Id: <20231227045019.25078-7-jinbo@loongson.cn> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20231227045019.25078-1-jinbo@loongson.cn> References: <20231227045019.25078-1-jinbo@loongson.cn> MIME-Version: 1.0 X-CM-TRANSID: AQAAf8Axur0hrYtliFsMAA--.15771S3 X-CM-SenderInfo: xmlqu0o6or00hjvr0hdfq/1tbiAQASEmWLia8B1wANsS X-Coremail-Antispam: 1Uk129KBj9fXoWDJw4xJF1kuFW3uw45Zr45twc_yoWxWr4xto Waqr90v34DW3yaqFZxXwn8J3yrGFWFkr1YqFnFva13Ja40g342y390vwnF9347K39xXan0 vrn7Gwn5Aay8Aw4rl-sFpf9Il3svdjkaLaAFLSUrUUUUUb8apTn2vfkv8UJUUUU8wcxFpf 9Il3svdxBIdaVrn0xqx4xG64xvF2IEw4CE5I8CrVC2j2Jv73VFW2AGmfu7bjvjm3AaLaJ3 UjIYCTnIWjp_UUUYb7kC6x804xWl14x267AKxVWUJVW8JwAFc2x0x2IEx4CE42xK8VAvwI 8IcIk0rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj41l84x0c7CEw4AK67xG Y2AK021l84ACjcxK6xIIjxv20xvE14v26r1I6r4UM28EF7xvwVC0I7IYx2IY6xkF7I0E14 v26r1j6r4UM28EF7xvwVC2z280aVAFwI0_Cr0_Gr1UM28EF7xvwVC2z280aVCY1x0267AK xVW8Jr0_Cr1UM2AIxVAIcxkEcVAq07x20xvEncxIr21l57IF6xkI12xvs2x26I8E6xACxx 1l5I8CrVACY4xI64kE6c02F40Ex7xfMcIj6xIIjxv20xvE14v26r126r1DMcIj6I8E87Iv 67AKxVW8JVWxJwAm72CE4IkC6x0Yz7v_Jr0_Gr1lF7xvr2IYc2Ij64vIr41l42xK82IYc2 Ij64vIr41l4I8I3I0E4IkC6x0Yz7v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGwC20s02 6x8GjcxK67AKxVWUGVWUWwC2zVAF1VAY17CE14v26r1Y6r17MIIYrxkI7VAKI48JMIIF0x vE2Ix0cI8IcVAFwI0_JFI_Gr1lIxAIcVC0I7IYx2IY6xkF7I0E14v26r1j6r4UMIIF0xvE 42xK8VAvwI8IcIk0rVWUJVWUCwCI42IY6I8E87Iv67AKxVWUJVW8JwCI42IY6I8E87Iv6x kF7I0E14v26r1j6r4UYxBIdaVFxhVjvjDU0xZFpf9x07j1q2_UUUUU= Subject: [FFmpeg-devel] [PATCH v2 6/7] avcodec/hevc: Add asm opt for the following functions X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: jinbo Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: 1J4nOBZmw6rh tests/checkasm/checkasm: C LSX LASX put_hevc_qpel_uni_h4_8_c: 5.7 1.2 put_hevc_qpel_uni_h6_8_c: 12.2 2.7 put_hevc_qpel_uni_h8_8_c: 21.5 3.2 put_hevc_qpel_uni_h12_8_c: 47.2 9.2 7.2 put_hevc_qpel_uni_h16_8_c: 87.0 11.7 9.0 put_hevc_qpel_uni_h24_8_c: 188.2 27.5 21.0 put_hevc_qpel_uni_h32_8_c: 335.2 46.7 28.5 put_hevc_qpel_uni_h48_8_c: 772.5 104.5 65.2 put_hevc_qpel_uni_h64_8_c: 1383.2 142.2 109.0 put_hevc_epel_uni_w_v4_8_c: 5.0 1.5 put_hevc_epel_uni_w_v6_8_c: 10.7 3.5 2.5 put_hevc_epel_uni_w_v8_8_c: 18.2 3.7 3.0 put_hevc_epel_uni_w_v12_8_c: 40.2 10.7 7.5 put_hevc_epel_uni_w_v16_8_c: 70.2 13.0 9.2 put_hevc_epel_uni_w_v24_8_c: 158.2 30.2 22.5 put_hevc_epel_uni_w_v32_8_c: 281.0 52.0 36.5 put_hevc_epel_uni_w_v48_8_c: 631.7 116.7 82.7 put_hevc_epel_uni_w_v64_8_c: 1108.2 207.5 142.2 put_hevc_epel_uni_w_h4_8_c: 4.7 1.2 put_hevc_epel_uni_w_h6_8_c: 9.7 3.5 2.7 put_hevc_epel_uni_w_h8_8_c: 17.2 4.2 3.5 put_hevc_epel_uni_w_h12_8_c: 38.0 11.5 7.2 put_hevc_epel_uni_w_h16_8_c: 69.2 14.5 9.2 put_hevc_epel_uni_w_h24_8_c: 152.0 34.7 22.5 put_hevc_epel_uni_w_h32_8_c: 271.0 58.0 40.0 put_hevc_epel_uni_w_h48_8_c: 597.5 136.7 95.0 put_hevc_epel_uni_w_h64_8_c: 1074.0 252.2 168.0 put_hevc_epel_bi_h4_8_c: 4.5 0.7 put_hevc_epel_bi_h6_8_c: 9.0 1.5 put_hevc_epel_bi_h8_8_c: 15.2 1.7 put_hevc_epel_bi_h12_8_c: 33.5 4.2 3.7 put_hevc_epel_bi_h16_8_c: 59.7 5.2 4.7 put_hevc_epel_bi_h24_8_c: 132.2 11.0 put_hevc_epel_bi_h32_8_c: 232.7 20.2 13.2 put_hevc_epel_bi_h48_8_c: 521.7 45.2 31.2 put_hevc_epel_bi_h64_8_c: 949.0 71.5 51.0 After this patch, the peformance of decoding H265 4K 30FPS 30Mbps on 3A6000 with 8 threads improves 1fps(55fps-->56fsp). Change-Id: I8cc1e41daa63ca478039bc55d1ee8934a7423f51 --- libavcodec/loongarch/hevc_mc.S | 1991 ++++++++++++++++- libavcodec/loongarch/hevcdsp_init_loongarch.c | 66 + libavcodec/loongarch/hevcdsp_lasx.h | 54 + libavcodec/loongarch/hevcdsp_lsx.h | 36 +- 4 files changed, 2144 insertions(+), 3 deletions(-) diff --git a/libavcodec/loongarch/hevc_mc.S b/libavcodec/loongarch/hevc_mc.S index 0b0647546b..a0e5938fbd 100644 --- a/libavcodec/loongarch/hevc_mc.S +++ b/libavcodec/loongarch/hevc_mc.S @@ -1784,8 +1784,12 @@ function ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx endfunc const shufb - .byte 0,1,2,3, 1,2,3,4 ,2,3,4,5, 3,4,5,6 - .byte 4,5,6,7, 5,6,7,8 ,6,7,8,9, 7,8,9,10 + .byte 0,1,2,3, 1,2,3,4 ,2,3,4,5, 3,4,5,6 //mask for epel_uni_w(128-bit) + .byte 4,5,6,7, 5,6,7,8 ,6,7,8,9, 7,8,9,10 //mask for epel_uni_w(256-bit) + .byte 0,1,2,3, 4,5,6,7 ,1,2,3,4, 5,6,7,8 //mask for qpel_uni_h4 + .byte 0,1,1,2, 2,3,3,4 ,4,5,5,6, 6,7,7,8 //mask for qpel_uni_h/v6/8... + .byte 0,1,2,3, 1,2,3,4 ,2,3,4,5, 3,4,5,6, 4,5,6,7, 5,6,7,8, 6,7,8,9, 7,8,9,10 //epel_uni_w_h16/24/32/48/64 + .byte 0,1,1,2, 2,3,3,4 ,4,5,5,6, 6,7,7,8, 0,1,1,2, 2,3,3,4 ,4,5,5,6, 6,7,7,8 //mask for bi_epel_h16/24/32/48/64 endconst .macro PUT_HEVC_EPEL_UNI_W_HV4_LSX w @@ -2584,3 +2588,1986 @@ function ff_hevc_put_hevc_epel_uni_w_hv64_8_lasx addi.d t5, t5, -1 bnez t5, .LOOP_HV64_LASX endfunc + +/* + * void FUNC(put_hevc_qpel_uni_h)(uint8_t *_dst, ptrdiff_t _dststride, + * const uint8_t *_src, ptrdiff_t _srcstride, + * int height, intptr_t mx, intptr_t my, + * int width) + */ +function ff_hevc_put_hevc_uni_qpel_h4_8_lsx + addi.d t0, a5, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr5, t1, t0 //filter + addi.d a2, a2, -3 //src -= 3 + addi.w t1, zero, 32 + vreplgr2vr.h vr1, t1 + la.local t1, shufb + vld vr2, t1, 32 //mask0 0 1 + vaddi.bu vr3, vr2, 2 //mask1 2 3 +.LOOP_UNI_H4: + vld vr18, a2, 0 + vldx vr19, a2, a3 + alsl.d a2, a3, a2, 1 + vshuf.b vr6, vr18, vr18, vr2 + vshuf.b vr7, vr18, vr18, vr3 + vshuf.b vr8, vr19, vr19, vr2 + vshuf.b vr9, vr19, vr19, vr3 + vdp2.h.bu.b vr10, vr6, vr5 + vdp2.h.bu.b vr11, vr7, vr5 + vdp2.h.bu.b vr12, vr8, vr5 + vdp2.h.bu.b vr13, vr9, vr5 + vhaddw.d.h vr10 + vhaddw.d.h vr11 + vhaddw.d.h vr12 + vhaddw.d.h vr13 + vpickev.w vr10, vr11, vr10 + vpickev.w vr11, vr13, vr12 + vpickev.h vr10, vr11, vr10 + vadd.h vr10, vr10, vr1 + vsrai.h vr10, vr10, 6 + vssrani.bu.h vr10, vr10, 0 + fst.s f10, a0, 0 + vbsrl.v vr10, vr10, 4 + fstx.s f10, a0, a1 + alsl.d a0, a1, a0, 1 + addi.d a4, a4, -2 + bnez a4, .LOOP_UNI_H4 +endfunc + +.macro HEVC_UNI_QPEL_H8_LSX in0, out0 + vshuf.b vr10, \in0, \in0, vr5 + vshuf.b vr11, \in0, \in0, vr6 + vshuf.b vr12, \in0, \in0, vr7 + vshuf.b vr13, \in0, \in0, vr8 + vdp2.h.bu.b \out0, vr10, vr0 //(QPEL_FILTER(src, 1) + vdp2add.h.bu.b \out0, vr11, vr1 + vdp2add.h.bu.b \out0, vr12, vr2 + vdp2add.h.bu.b \out0, vr13, vr3 + vadd.h \out0, \out0, vr4 + vsrai.h \out0, \out0, 6 +.endm + +.macro HEVC_UNI_QPEL_H16_LASX in0, out0 + xvshuf.b xr10, \in0, \in0, xr5 + xvshuf.b xr11, \in0, \in0, xr6 + xvshuf.b xr12, \in0, \in0, xr7 + xvshuf.b xr13, \in0, \in0, xr8 + xvdp2.h.bu.b \out0, xr10, xr0 //(QPEL_FILTER(src, 1) + xvdp2add.h.bu.b \out0, xr11, xr1 + xvdp2add.h.bu.b \out0, xr12, xr2 + xvdp2add.h.bu.b \out0, xr13, xr3 + xvadd.h \out0, \out0, xr4 + xvsrai.h \out0, \out0, 6 +.endm + +function ff_hevc_put_hevc_uni_qpel_h6_8_lsx + addi.d t0, a5, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr0, t1, t0 //filter abcdefgh + vreplvei.h vr1, vr0, 1 //cd... + vreplvei.h vr2, vr0, 2 //ef... + vreplvei.h vr3, vr0, 3 //gh... + vreplvei.h vr0, vr0, 0 //ab... + addi.d a2, a2, -3 //src -= 3 + addi.w t1, zero, 32 + vreplgr2vr.h vr4, t1 + la.local t1, shufb + vld vr5, t1, 48 + vaddi.bu vr6, vr5, 2 + vaddi.bu vr7, vr5, 4 + vaddi.bu vr8, vr5, 6 +.LOOP_UNI_H6: + vld vr9, a2, 0 + add.d a2, a2, a3 + HEVC_UNI_QPEL_H8_LSX vr9, vr14 + vssrani.bu.h vr14, vr14, 0 + fst.s f14, a0, 0 + vstelm.h vr14, a0, 4, 2 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_H6 +endfunc + +function ff_hevc_put_hevc_uni_qpel_h8_8_lsx + addi.d t0, a5, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr0, t1, t0 //filter abcdefgh + vreplvei.h vr1, vr0, 1 //cd... + vreplvei.h vr2, vr0, 2 //ef... + vreplvei.h vr3, vr0, 3 //gh... + vreplvei.h vr0, vr0, 0 //ab... + addi.d a2, a2, -3 //src -= 3 + addi.w t1, zero, 32 + vreplgr2vr.h vr4, t1 + la.local t1, shufb + vld vr5, t1, 48 + vaddi.bu vr6, vr5, 2 + vaddi.bu vr7, vr5, 4 + vaddi.bu vr8, vr5, 6 +.LOOP_UNI_H8: + vld vr9, a2, 0 + add.d a2, a2, a3 + HEVC_UNI_QPEL_H8_LSX vr9, vr14 + vssrani.bu.h vr14, vr14, 0 + fst.d f14, a0, 0 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_H8 +endfunc + +function ff_hevc_put_hevc_uni_qpel_h12_8_lsx + addi.d t0, a5, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr0, t1, t0 //filter abcdefgh + vreplvei.h vr1, vr0, 1 //cd... + vreplvei.h vr2, vr0, 2 //ef... + vreplvei.h vr3, vr0, 3 //gh... + vreplvei.h vr0, vr0, 0 //ab... + addi.d a2, a2, -3 //src -= 3 + addi.w t1, zero, 32 + vreplgr2vr.h vr4, t1 + la.local t1, shufb + vld vr5, t1, 48 + vaddi.bu vr6, vr5, 2 + vaddi.bu vr7, vr5, 4 + vaddi.bu vr8, vr5, 6 +.LOOP_UNI_H12: + vld vr9, a2, 0 + HEVC_UNI_QPEL_H8_LSX vr9, vr14 + vld vr9, a2, 8 + add.d a2, a2, a3 + HEVC_UNI_QPEL_H8_LSX vr9, vr15 + vssrani.bu.h vr15, vr14, 0 + fst.d f15, a0, 0 + vstelm.w vr15, a0, 8, 2 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_H12 +endfunc + +function ff_hevc_put_hevc_uni_qpel_h12_8_lasx + addi.d t0, a5, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr0, t1, t0 //filter abcdefgh + xvreplve0.q xr0, xr0 + xvrepl128vei.h xr1, xr0, 1 //cd... + xvrepl128vei.h xr2, xr0, 2 //ef... + xvrepl128vei.h xr3, xr0, 3 //gh... + xvrepl128vei.h xr0, xr0, 0 //ab... + addi.d a2, a2, -3 //src -= 3 + addi.w t1, zero, 32 + xvreplgr2vr.h xr4, t1 + la.local t1, shufb + vld vr5, t1, 48 + xvreplve0.q xr5, xr5 + xvaddi.bu xr6, xr5, 2 + xvaddi.bu xr7, xr5, 4 + xvaddi.bu xr8, xr5, 6 +.LOOP_UNI_H12_LASX: + xvld xr9, a2, 0 + add.d a2, a2, a3 + xvpermi.d xr9, xr9, 0x94 //rearrange data + HEVC_UNI_QPEL_H16_LASX xr9, xr14 + xvpermi.q xr15, xr14, 0x01 + vssrani.bu.h vr15, vr14, 0 + fst.d f15, a0, 0 + vstelm.w vr15, a0, 8, 2 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_H12_LASX +endfunc + +function ff_hevc_put_hevc_uni_qpel_h16_8_lsx + addi.d t0, a5, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr0, t1, t0 //filter abcdefgh + vreplvei.h vr1, vr0, 1 //cd... + vreplvei.h vr2, vr0, 2 //ef... + vreplvei.h vr3, vr0, 3 //gh... + vreplvei.h vr0, vr0, 0 //ab... + addi.d a2, a2, -3 //src -= 3 + addi.w t1, zero, 32 + vreplgr2vr.h vr4, t1 + la.local t1, shufb + vld vr5, t1, 48 + vaddi.bu vr6, vr5, 2 + vaddi.bu vr7, vr5, 4 + vaddi.bu vr8, vr5, 6 +.LOOP_UNI_H16: + vld vr9, a2, 0 + HEVC_UNI_QPEL_H8_LSX vr9, vr14 + vld vr9, a2, 8 + add.d a2, a2, a3 + HEVC_UNI_QPEL_H8_LSX vr9, vr15 + vssrani.bu.h vr15, vr14, 0 + vst vr15, a0, 0 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_H16 +endfunc + +function ff_hevc_put_hevc_uni_qpel_h16_8_lasx + addi.d t0, a5, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr0, t1, t0 //filter abcdefgh + xvreplve0.q xr0, xr0 + xvrepl128vei.h xr1, xr0, 1 //cd... + xvrepl128vei.h xr2, xr0, 2 //ef... + xvrepl128vei.h xr3, xr0, 3 //gh... + xvrepl128vei.h xr0, xr0, 0 //ab... + addi.d a2, a2, -3 //src -= 3 + addi.w t1, zero, 32 + xvreplgr2vr.h xr4, t1 + la.local t1, shufb + vld vr5, t1, 48 + xvreplve0.q xr5, xr5 + xvaddi.bu xr6, xr5, 2 + xvaddi.bu xr7, xr5, 4 + xvaddi.bu xr8, xr5, 6 +.LOOP_UNI_H16_LASX: + xvld xr9, a2, 0 + add.d a2, a2, a3 + xvpermi.d xr9, xr9, 0x94 //rearrange data + HEVC_UNI_QPEL_H16_LASX xr9, xr14 + xvpermi.q xr15, xr14, 0x01 + vssrani.bu.h vr15, vr14, 0 + vst vr15, a0, 0 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_H16_LASX +endfunc + +function ff_hevc_put_hevc_uni_qpel_h24_8_lsx + addi.d t0, a5, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr0, t1, t0 //filter abcdefgh + vreplvei.h vr1, vr0, 1 //cd... + vreplvei.h vr2, vr0, 2 //ef... + vreplvei.h vr3, vr0, 3 //gh... + vreplvei.h vr0, vr0, 0 //ab... + addi.d a2, a2, -3 //src -= 3 + addi.w t1, zero, 32 + vreplgr2vr.h vr4, t1 + la.local t1, shufb + vld vr5, t1, 48 + vaddi.bu vr6, vr5, 2 + vaddi.bu vr7, vr5, 4 + vaddi.bu vr8, vr5, 6 +.LOOP_UNI_H24: + vld vr9, a2, 0 + HEVC_UNI_QPEL_H8_LSX vr9, vr14 + vld vr9, a2, 8 + HEVC_UNI_QPEL_H8_LSX vr9, vr15 + vld vr9, a2, 16 + add.d a2, a2, a3 + HEVC_UNI_QPEL_H8_LSX vr9, vr16 + vssrani.bu.h vr15, vr14, 0 + vssrani.bu.h vr16, vr16, 0 + vst vr15, a0, 0 + fst.d f16, a0, 16 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_H24 +endfunc + +function ff_hevc_put_hevc_uni_qpel_h24_8_lasx + addi.d t0, a5, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr0, t1, t0 //filter abcdefgh + xvreplve0.q xr0, xr0 + xvrepl128vei.h xr1, xr0, 1 //cd... + xvrepl128vei.h xr2, xr0, 2 //ef... + xvrepl128vei.h xr3, xr0, 3 //gh... + xvrepl128vei.h xr0, xr0, 0 //ab... + addi.d a2, a2, -3 //src -= 3 + addi.w t1, zero, 32 + xvreplgr2vr.h xr4, t1 + la.local t1, shufb + vld vr5, t1, 48 + xvreplve0.q xr5, xr5 + xvaddi.bu xr6, xr5, 2 + xvaddi.bu xr7, xr5, 4 + xvaddi.bu xr8, xr5, 6 +.LOOP_UNI_H24_LASX: + xvld xr9, a2, 0 + xvpermi.q xr19, xr9, 0x01 //16...23 + add.d a2, a2, a3 + xvpermi.d xr9, xr9, 0x94 //rearrange data + HEVC_UNI_QPEL_H16_LASX xr9, xr14 + xvpermi.q xr15, xr14, 0x01 + vssrani.bu.h vr15, vr14, 0 + vst vr15, a0, 0 + HEVC_UNI_QPEL_H8_LSX vr19, vr16 + vssrani.bu.h vr16, vr16, 0 + fst.d f16, a0, 16 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_H24_LASX +endfunc + +function ff_hevc_put_hevc_uni_qpel_h32_8_lsx + addi.d t0, a5, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr0, t1, t0 //filter abcdefgh + vreplvei.h vr1, vr0, 1 //cd... + vreplvei.h vr2, vr0, 2 //ef... + vreplvei.h vr3, vr0, 3 //gh... + vreplvei.h vr0, vr0, 0 //ab... + addi.d a2, a2, -3 //src -= 3 + addi.w t1, zero, 32 + vreplgr2vr.h vr4, t1 + la.local t1, shufb + vld vr5, t1, 48 + vaddi.bu vr6, vr5, 2 + vaddi.bu vr7, vr5, 4 + vaddi.bu vr8, vr5, 6 +.LOOP_UNI_H32: + vld vr9, a2, 0 + HEVC_UNI_QPEL_H8_LSX vr9, vr14 + vld vr9, a2, 8 + HEVC_UNI_QPEL_H8_LSX vr9, vr15 + vld vr9, a2, 16 + HEVC_UNI_QPEL_H8_LSX vr9, vr16 + vld vr9, a2, 24 + add.d a2, a2, a3 + HEVC_UNI_QPEL_H8_LSX vr9, vr17 + vssrani.bu.h vr15, vr14, 0 + vssrani.bu.h vr17, vr16, 0 + vst vr15, a0, 0 + vst vr17, a0, 16 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_H32 +endfunc + +function ff_hevc_put_hevc_uni_qpel_h32_8_lasx + addi.d t0, a5, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr0, t1, t0 //filter abcdefgh + xvreplve0.q xr0, xr0 + xvrepl128vei.h xr1, xr0, 1 //cd... + xvrepl128vei.h xr2, xr0, 2 //ef... + xvrepl128vei.h xr3, xr0, 3 //gh... + xvrepl128vei.h xr0, xr0, 0 //ab... + addi.d a2, a2, -3 //src -= 3 + addi.w t1, zero, 32 + xvreplgr2vr.h xr4, t1 + la.local t1, shufb + vld vr5, t1, 48 + xvreplve0.q xr5, xr5 + xvaddi.bu xr6, xr5, 2 + xvaddi.bu xr7, xr5, 4 + xvaddi.bu xr8, xr5, 6 +.LOOP_UNI_H32_LASX: + xvld xr9, a2, 0 + xvpermi.d xr9, xr9, 0x94 + HEVC_UNI_QPEL_H16_LASX xr9, xr14 + xvld xr9, a2, 16 + xvpermi.d xr9, xr9, 0x94 + HEVC_UNI_QPEL_H16_LASX xr9, xr15 + add.d a2, a2, a3 + xvssrani.bu.h xr15, xr14, 0 + xvpermi.d xr15, xr15, 0xd8 + xvst xr15, a0, 0 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_H32_LASX +endfunc + +function ff_hevc_put_hevc_uni_qpel_h48_8_lsx + addi.d t0, a5, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr0, t1, t0 //filter abcdefgh + vreplvei.h vr1, vr0, 1 //cd... + vreplvei.h vr2, vr0, 2 //ef... + vreplvei.h vr3, vr0, 3 //gh... + vreplvei.h vr0, vr0, 0 //ab... + addi.d a2, a2, -3 //src -= 3 + addi.w t1, zero, 32 + vreplgr2vr.h vr4, t1 + la.local t1, shufb + vld vr5, t1, 48 + vaddi.bu vr6, vr5, 2 + vaddi.bu vr7, vr5, 4 + vaddi.bu vr8, vr5, 6 +.LOOP_UNI_H48: + vld vr9, a2, 0 + HEVC_UNI_QPEL_H8_LSX vr9, vr14 + vld vr9, a2, 8 + HEVC_UNI_QPEL_H8_LSX vr9, vr15 + vld vr9, a2, 16 + HEVC_UNI_QPEL_H8_LSX vr9, vr16 + vld vr9, a2, 24 + HEVC_UNI_QPEL_H8_LSX vr9, vr17 + vld vr9, a2, 32 + HEVC_UNI_QPEL_H8_LSX vr9, vr18 + vld vr9, a2, 40 + add.d a2, a2, a3 + HEVC_UNI_QPEL_H8_LSX vr9, vr19 + vssrani.bu.h vr15, vr14, 0 + vssrani.bu.h vr17, vr16, 0 + vssrani.bu.h vr19, vr18, 0 + vst vr15, a0, 0 + vst vr17, a0, 16 + vst vr19, a0, 32 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_H48 +endfunc + +function ff_hevc_put_hevc_uni_qpel_h48_8_lasx + addi.d t0, a5, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr0, t1, t0 //filter abcdefgh + xvreplve0.q xr0, xr0 + xvrepl128vei.h xr1, xr0, 1 //cd... + xvrepl128vei.h xr2, xr0, 2 //ef... + xvrepl128vei.h xr3, xr0, 3 //gh... + xvrepl128vei.h xr0, xr0, 0 //ab... + addi.d a2, a2, -3 //src -= 3 + addi.w t1, zero, 32 + xvreplgr2vr.h xr4, t1 + la.local t1, shufb + vld vr5, t1, 48 + xvreplve0.q xr5, xr5 + xvaddi.bu xr6, xr5, 2 + xvaddi.bu xr7, xr5, 4 + xvaddi.bu xr8, xr5, 6 +.LOOP_UNI_H48_LASX: + xvld xr9, a2, 0 + xvpermi.d xr9, xr9, 0x94 + HEVC_UNI_QPEL_H16_LASX xr9, xr14 + xvld xr9, a2, 16 + xvpermi.d xr9, xr9, 0x94 + HEVC_UNI_QPEL_H16_LASX xr9, xr15 + xvld xr9, a2, 32 + xvpermi.d xr9, xr9, 0x94 + HEVC_UNI_QPEL_H16_LASX xr9, xr16 + add.d a2, a2, a3 + xvssrani.bu.h xr15, xr14, 0 + xvpermi.d xr15, xr15, 0xd8 + xvst xr15, a0, 0 + xvpermi.q xr17, xr16, 0x01 + vssrani.bu.h vr17, vr16, 0 + vst vr17, a0, 32 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_H48_LASX +endfunc + +function ff_hevc_put_hevc_uni_qpel_h64_8_lasx + addi.d t0, a5, -1 + slli.w t0, t0, 4 + la.local t1, ff_hevc_qpel_filters + vldx vr0, t1, t0 //filter abcdefgh + xvreplve0.q xr0, xr0 + xvrepl128vei.h xr1, xr0, 1 //cd... + xvrepl128vei.h xr2, xr0, 2 //ef... + xvrepl128vei.h xr3, xr0, 3 //gh... + xvrepl128vei.h xr0, xr0, 0 //ab... + addi.d a2, a2, -3 //src -= 3 + addi.w t1, zero, 32 + xvreplgr2vr.h xr4, t1 + la.local t1, shufb + vld vr5, t1, 48 + xvreplve0.q xr5, xr5 + xvaddi.bu xr6, xr5, 2 + xvaddi.bu xr7, xr5, 4 + xvaddi.bu xr8, xr5, 6 +.LOOP_UNI_H64_LASX: + xvld xr9, a2, 0 + xvpermi.d xr9, xr9, 0x94 + HEVC_UNI_QPEL_H16_LASX xr9, xr14 + xvld xr9, a2, 16 + xvpermi.d xr9, xr9, 0x94 + HEVC_UNI_QPEL_H16_LASX xr9, xr15 + xvld xr9, a2, 32 + xvpermi.d xr9, xr9, 0x94 + HEVC_UNI_QPEL_H16_LASX xr9, xr16 + xvld xr9, a2, 48 + xvpermi.d xr9, xr9, 0x94 + HEVC_UNI_QPEL_H16_LASX xr9, xr17 + add.d a2, a2, a3 + xvssrani.bu.h xr15, xr14, 0 + xvpermi.d xr15, xr15, 0xd8 + xvst xr15, a0, 0 + xvssrani.bu.h xr17, xr16, 0 + xvpermi.d xr17, xr17, 0xd8 + xvst xr17, a0, 32 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_H64_LASX +endfunc + +/* + * void FUNC(put_hevc_epel_uni_w_v)(uint8_t *_dst, ptrdiff_t _dststride, + * const uint8_t *_src, ptrdiff_t _srcstride, + * int height, int denom, int wx, int ox, + * intptr_t mx, intptr_t my, int width) + */ +function ff_hevc_put_hevc_epel_uni_w_v4_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + sub.d a2, a2, a3 //src -= stride + fld.s f6, a2, 0 //0 + fldx.s f7, a2, a3 //1 + fldx.s f8, a2, t0 //2 + add.d a2, a2, t1 + vilvl.b vr6, vr7, vr6 + vilvl.b vr7, vr8, vr8 + vilvl.h vr6, vr7, vr6 + vreplvei.w vr0, vr0, 0 +.LOOP_UNI_V4: + fld.s f9, a2, 0 //3 + fldx.s f10, a2, a3 //4 + add.d a2, a2, t0 + vextrins.b vr6, vr9, 0x30 //insert the 3th load + vextrins.b vr6, vr9, 0x71 + vextrins.b vr6, vr9, 0xb2 + vextrins.b vr6, vr9, 0xf3 + vbsrl.v vr7, vr6, 1 + vextrins.b vr7, vr10, 0x30 //insert the 4th load + vextrins.b vr7, vr10, 0x71 + vextrins.b vr7, vr10, 0xb2 + vextrins.b vr7, vr10, 0xf3 + vdp2.h.bu.b vr8, vr6, vr0 //EPEL_FILTER(src, stride) + vdp2.h.bu.b vr9, vr7, vr0 + vhaddw.w.h vr10, vr8, vr8 + vhaddw.w.h vr11, vr9, vr9 + vmulwev.w.h vr10, vr10, vr1 //EPEL_FILTER(src, stride) * wx + vmulwev.w.h vr11, vr11, vr1 + vadd.w vr10, vr10, vr2 // + offset + vadd.w vr11, vr11, vr2 + vsra.w vr10, vr10, vr3 // >> shift + vsra.w vr11, vr11, vr3 + vadd.w vr10, vr10, vr4 // + ox + vadd.w vr11, vr11, vr4 + vssrani.h.w vr11, vr10, 0 + vssrani.bu.h vr10, vr11, 0 + vbsrl.v vr6, vr7, 1 + fst.s f10, a0, 0 + vbsrl.v vr10, vr10, 4 + fstx.s f10, a0, a1 + alsl.d a0, a1, a0, 1 + addi.d a4, a4, -2 + bnez a4, .LOOP_UNI_V4 +endfunc + +.macro CALC_EPEL_FILTER_LSX out0, out1 + vdp2.h.bu.b vr12, vr10, vr0 //EPEL_FILTER(src, stride) + vdp2add.h.bu.b vr12, vr11, vr5 + vexth.w.h vr13, vr12 + vsllwil.w.h vr12, vr12, 0 + vmulwev.w.h vr12, vr12, vr1 //EPEL_FILTER(src, stride) * wx + vmulwev.w.h vr13, vr13, vr1 //EPEL_FILTER(src, stride) * wx + vadd.w vr12, vr12, vr2 // + offset + vadd.w vr13, vr13, vr2 + vsra.w vr12, vr12, vr3 // >> shift + vsra.w vr13, vr13, vr3 + vadd.w \out0, vr12, vr4 // + ox + vadd.w \out1, vr13, vr4 +.endm + +.macro CALC_EPEL_FILTER_LASX out0 + xvdp2.h.bu.b xr11, xr12, xr0 //EPEL_FILTER(src, stride) + xvhaddw.w.h xr12, xr11, xr11 + xvmulwev.w.h xr12, xr12, xr1 //EPEL_FILTER(src, stride) * wx + xvadd.w xr12, xr12, xr2 // + offset + xvsra.w xr12, xr12, xr3 // >> shift + xvadd.w \out0, xr12, xr4 // + ox +.endm + +//w is a label, also can be used as a condition for ".if" statement. +.macro PUT_HEVC_EPEL_UNI_W_V8_LSX w + fld.d f6, a2, 0 //0 + fldx.d f7, a2, a3 //1 + fldx.d f8, a2, t0 //2 + add.d a2, a2, t1 +.LOOP_UNI_V8_\w: + fld.d f9, a2, 0 // 3 + add.d a2, a2, a3 + vilvl.b vr10, vr7, vr6 + vilvl.b vr11, vr9, vr8 + vaddi.bu vr6, vr7, 0 //back up previous value + vaddi.bu vr7, vr8, 0 + vaddi.bu vr8, vr9, 0 + CALC_EPEL_FILTER_LSX vr12, vr13 + vssrani.h.w vr13, vr12, 0 + vssrani.bu.h vr13, vr13, 0 +.if \w < 8 + fst.s f13, a0, 0 + vstelm.h vr13, a0, 4, 2 +.else + fst.d f13, a0, 0 +.endif + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_V8_\w +.endm + +//w is a label, also can be used as a condition for ".if" statement. +.macro PUT_HEVC_EPEL_UNI_W_V8_LASX w + fld.d f6, a2, 0 //0 + fldx.d f7, a2, a3 //1 + fldx.d f8, a2, t0 //2 + add.d a2, a2, t1 +.LOOP_UNI_V8_LASX_\w: + fld.d f9, a2, 0 // 3 + add.d a2, a2, a3 + vilvl.b vr10, vr7, vr6 + vilvl.b vr11, vr9, vr8 + xvilvl.h xr12, xr11, xr10 + xvilvh.h xr13, xr11, xr10 + xvpermi.q xr12, xr13, 0x02 + vaddi.bu vr6, vr7, 0 //back up previous value + vaddi.bu vr7, vr8, 0 + vaddi.bu vr8, vr9, 0 + CALC_EPEL_FILTER_LASX xr12 + xvpermi.q xr13, xr12, 0x01 + vssrani.h.w vr13, vr12, 0 + vssrani.bu.h vr13, vr13, 0 +.if \w < 8 + fst.s f13, a0, 0 + vstelm.h vr13, a0, 4, 2 +.else + fst.d f13, a0, 0 +.endif + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_V8_LASX_\w +.endm + +function ff_hevc_put_hevc_epel_uni_w_v6_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + sub.d a2, a2, a3 //src -= stride + vreplvei.h vr5, vr0, 1 + vreplvei.h vr0, vr0, 0 + PUT_HEVC_EPEL_UNI_W_V8_LSX 6 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_v6_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + xvreplve0.w xr0, xr0 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + sub.d a2, a2, a3 //src -= stride + PUT_HEVC_EPEL_UNI_W_V8_LASX 6 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_v8_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + sub.d a2, a2, a3 //src -= stride + vreplvei.h vr5, vr0, 1 + vreplvei.h vr0, vr0, 0 + PUT_HEVC_EPEL_UNI_W_V8_LSX 8 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_v8_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + xvreplve0.w xr0, xr0 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + sub.d a2, a2, a3 //src -= stride + PUT_HEVC_EPEL_UNI_W_V8_LASX 8 +endfunc + +//w is a label, also can be used as a condition for ".if" statement. +.macro PUT_HEVC_EPEL_UNI_W_V16_LSX w + vld vr6, a2, 0 //0 + vldx vr7, a2, a3 //1 + vldx vr8, a2, t0 //2 + add.d a2, a2, t1 +.LOOP_UNI_V16_\w: + vld vr9, a2, 0 //3 + add.d a2, a2, a3 + vilvl.b vr10, vr7, vr6 + vilvl.b vr11, vr9, vr8 + CALC_EPEL_FILTER_LSX vr14, vr15 + vilvh.b vr10, vr7, vr6 + vilvh.b vr11, vr9, vr8 + CALC_EPEL_FILTER_LSX vr16, vr17 + vssrani.h.w vr15, vr14, 0 + vssrani.h.w vr17, vr16, 0 + vssrani.bu.h vr17, vr15, 0 + vaddi.bu vr6, vr7, 0 //back up previous value + vaddi.bu vr7, vr8, 0 + vaddi.bu vr8, vr9, 0 +.if \w < 16 + fst.d f17, a0, 0 + vstelm.w vr17, a0, 8, 2 +.else + vst vr17, a0, 0 +.endif + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_V16_\w +.endm + +//w is a label, also can be used as a condition for ".if" statement. +.macro PUT_HEVC_EPEL_UNI_W_V16_LASX w + vld vr6, a2, 0 //0 + vldx vr7, a2, a3 //1 + vldx vr8, a2, t0 //2 + add.d a2, a2, t1 +.LOOP_UNI_V16_LASX_\w: + vld vr9, a2, 0 //3 + add.d a2, a2, a3 + xvilvl.b xr10, xr7, xr6 + xvilvh.b xr11, xr7, xr6 + xvpermi.q xr11, xr10, 0x20 + xvilvl.b xr12, xr9, xr8 + xvilvh.b xr13, xr9, xr8 + xvpermi.q xr13, xr12, 0x20 + xvdp2.h.bu.b xr10, xr11, xr0 //EPEL_FILTER(src, stride) + xvdp2add.h.bu.b xr10, xr13, xr5 + xvexth.w.h xr11, xr10 + xvsllwil.w.h xr10, xr10, 0 + xvmulwev.w.h xr10, xr10, xr1 //EPEL_FILTER(src, stride) * wx + xvmulwev.w.h xr11, xr11, xr1 + xvadd.w xr10, xr10, xr2 // + offset + xvadd.w xr11, xr11, xr2 + xvsra.w xr10, xr10, xr3 // >> shift + xvsra.w xr11, xr11, xr3 + xvadd.w xr10, xr10, xr4 // + wx + xvadd.w xr11, xr11, xr4 + xvssrani.h.w xr11, xr10, 0 + xvpermi.q xr10, xr11, 0x01 + vssrani.bu.h vr10, vr11, 0 + vaddi.bu vr6, vr7, 0 //back up previous value + vaddi.bu vr7, vr8, 0 + vaddi.bu vr8, vr9, 0 +.if \w < 16 + fst.d f10, a0, 0 + vstelm.w vr10, a0, 8, 2 +.else + vst vr10, a0, 0 +.endif + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_V16_LASX_\w +.endm + +function ff_hevc_put_hevc_epel_uni_w_v12_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + sub.d a2, a2, a3 //src -= stride + vreplvei.h vr5, vr0, 1 + vreplvei.h vr0, vr0, 0 + PUT_HEVC_EPEL_UNI_W_V16_LSX 12 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_v12_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + xvreplve0.q xr0, xr0 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + sub.d a2, a2, a3 //src -= stride + xvrepl128vei.h xr5, xr0, 1 + xvrepl128vei.h xr0, xr0, 0 + PUT_HEVC_EPEL_UNI_W_V16_LASX 12 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_v16_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + sub.d a2, a2, a3 //src -= stride + vreplvei.h vr5, vr0, 1 + vreplvei.h vr0, vr0, 0 + PUT_HEVC_EPEL_UNI_W_V16_LSX 16 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_v16_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + xvreplve0.q xr0, xr0 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + sub.d a2, a2, a3 //src -= stride + xvrepl128vei.h xr5, xr0, 1 + xvrepl128vei.h xr0, xr0, 0 + PUT_HEVC_EPEL_UNI_W_V16_LASX 16 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_v24_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + sub.d a2, a2, a3 //src -= stride + vreplvei.h vr5, vr0, 1 + vreplvei.h vr0, vr0, 0 + addi.d t2, a0, 0 //save init + addi.d t3, a2, 0 + addi.d t4, a4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LSX 24 + addi.d a0, t2, 16 //increase step + addi.d a2, t3, 16 + addi.d a4, t4, 0 + PUT_HEVC_EPEL_UNI_W_V8_LSX 24 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_v24_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + xvreplve0.w xr20, xr0 //save xr0 + xvreplve0.q xr0, xr0 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + sub.d a2, a2, a3 //src -= stride + xvrepl128vei.h xr5, xr0, 1 + xvrepl128vei.h xr0, xr0, 0 + addi.d t2, a0, 0 //save init + addi.d t3, a2, 0 + addi.d t4, a4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LASX 24 + addi.d a0, t2, 16 //increase step + addi.d a2, t3, 16 + addi.d a4, t4, 0 + xvaddi.bu xr0, xr20, 0 + PUT_HEVC_EPEL_UNI_W_V8_LASX 24 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_v32_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + sub.d a2, a2, a3 //src -= stride + vreplvei.h vr5, vr0, 1 + vreplvei.h vr0, vr0, 0 + addi.d t2, a0, 0 + addi.d t3, a2, 0 + addi.d t4, a4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LSX 32 + addi.d a0, t2, 16 + addi.d a2, t3, 16 + addi.d a4, t4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LSX 33 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_v32_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + xvreplve0.q xr0, xr0 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + sub.d a2, a2, a3 //src -= stride + xvrepl128vei.h xr5, xr0, 1 + xvrepl128vei.h xr0, xr0, 0 + addi.d t2, a0, 0 + addi.d t3, a2, 0 + addi.d t4, a4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LASX 32 + addi.d a0, t2, 16 + addi.d a2, t3, 16 + addi.d a4, t4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LASX 33 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_v48_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + sub.d a2, a2, a3 //src -= stride + vreplvei.h vr5, vr0, 1 + vreplvei.h vr0, vr0, 0 + addi.d t2, a0, 0 + addi.d t3, a2, 0 + addi.d t4, a4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LSX 48 + addi.d a0, t2, 16 + addi.d a2, t3, 16 + addi.d a4, t4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LSX 49 + addi.d a0, t2, 32 + addi.d a2, t3, 32 + addi.d a4, t4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LSX 50 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_v48_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + xvreplve0.q xr0, xr0 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + sub.d a2, a2, a3 //src -= stride + xvrepl128vei.h xr5, xr0, 1 + xvrepl128vei.h xr0, xr0, 0 + addi.d t2, a0, 0 + addi.d t3, a2, 0 + addi.d t4, a4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LASX 48 + addi.d a0, t2, 16 + addi.d a2, t3, 16 + addi.d a4, t4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LASX 49 + addi.d a0, t2, 32 + addi.d a2, t3, 32 + addi.d a4, t4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LASX 50 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_v64_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + sub.d a2, a2, a3 //src -= stride + vreplvei.h vr5, vr0, 1 + vreplvei.h vr0, vr0, 0 + addi.d t2, a0, 0 + addi.d t3, a2, 0 + addi.d t4, a4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LSX 64 + addi.d a0, t2, 16 + addi.d a2, t3, 16 + addi.d a4, t4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LSX 65 + addi.d a0, t2, 32 + addi.d a2, t3, 32 + addi.d a4, t4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LSX 66 + addi.d a0, t2, 48 + addi.d a2, t3, 48 + addi.d a4, t4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LSX 67 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_v64_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 8 //my + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + xvreplve0.q xr0, xr0 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + sub.d a2, a2, a3 //src -= stride + xvrepl128vei.h xr5, xr0, 1 + xvrepl128vei.h xr0, xr0, 0 + addi.d t2, a0, 0 + addi.d t3, a2, 0 + addi.d t4, a4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LASX 64 + addi.d a0, t2, 16 + addi.d a2, t3, 16 + addi.d a4, t4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LASX 65 + addi.d a0, t2, 32 + addi.d a2, t3, 32 + addi.d a4, t4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LASX 66 + addi.d a0, t2, 48 + addi.d a2, t3, 48 + addi.d a4, t4, 0 + PUT_HEVC_EPEL_UNI_W_V16_LASX 67 +endfunc + +/* + * void FUNC(put_hevc_epel_uni_w_h)(uint8_t *_dst, ptrdiff_t _dststride, + * const uint8_t *_src, ptrdiff_t _srcstride, + * int height, int denom, int wx, int ox, + * intptr_t mx, intptr_t my, int width) + */ +function ff_hevc_put_hevc_epel_uni_w_h4_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + vreplvei.w vr0, vr0, 0 + la.local t1, shufb + vld vr5, t1, 0 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + addi.d a2, a2, -1 //src -= 1 +.LOOP_UNI_W_H4: + fld.d f6, a2, 0 + add.d a2, a2, a3 + vshuf.b vr6, vr6, vr6, vr5 + vdp2.h.bu.b vr7, vr6, vr0 + vhaddw.w.h vr7, vr7, vr7 + vmulwev.w.h vr7, vr7, vr1 + vadd.w vr7, vr7, vr2 + vsra.w vr7, vr7, vr3 + vadd.w vr7, vr7, vr4 + vssrani.h.w vr7, vr7, 0 + vssrani.bu.h vr7, vr7, 0 + fst.s f7, a0, 0 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_W_H4 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_h6_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + vreplvei.w vr0, vr0, 0 + la.local t1, shufb + vld vr6, t1, 48 + vaddi.bu vr7, vr6, 2 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + addi.d a2, a2, -1 //src -= 1 + vreplvei.h vr5, vr0, 1 + vreplvei.h vr0, vr0, 0 +.LOOP_UNI_W_H6: + vld vr8, a2, 0 + add.d a2, a2, a3 + vshuf.b vr10, vr8, vr8, vr6 + vshuf.b vr11, vr8, vr8, vr7 + CALC_EPEL_FILTER_LSX vr14, vr15 + vssrani.h.w vr15, vr14, 0 + vssrani.bu.h vr15, vr15, 0 + fst.s f15, a0, 0 + vstelm.h vr15, a0, 4, 2 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_W_H6 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_h6_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + xvreplve0.w xr0, xr0 + la.local t1, shufb + xvld xr6, t1, 64 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + addi.d a2, a2, -1 //src -= 1 +.LOOP_UNI_W_H6_LASX: + vld vr8, a2, 0 + xvreplve0.q xr8, xr8 + add.d a2, a2, a3 + xvshuf.b xr12, xr8, xr8, xr6 + CALC_EPEL_FILTER_LASX xr14 + xvpermi.q xr15, xr14, 0x01 + vssrani.h.w vr15, vr14, 0 + vssrani.bu.h vr15, vr15, 0 + fst.s f15, a0, 0 + vstelm.h vr15, a0, 4, 2 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_W_H6_LASX +endfunc + +function ff_hevc_put_hevc_epel_uni_w_h8_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + vreplvei.w vr0, vr0, 0 + la.local t1, shufb + vld vr6, t1, 48 + vaddi.bu vr7, vr6, 2 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + addi.d a2, a2, -1 //src -= 1 + vreplvei.h vr5, vr0, 1 + vreplvei.h vr0, vr0, 0 +.LOOP_UNI_W_H8: + vld vr8, a2, 0 + add.d a2, a2, a3 + vshuf.b vr10, vr8, vr8, vr6 + vshuf.b vr11, vr8, vr8, vr7 + CALC_EPEL_FILTER_LSX vr14, vr15 + vssrani.h.w vr15, vr14, 0 + vssrani.bu.h vr15, vr15, 0 + fst.d f15, a0, 0 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_W_H8 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_h8_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + xvreplve0.w xr0, xr0 + la.local t1, shufb + xvld xr6, t1, 64 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + addi.d a2, a2, -1 //src -= 1 +.LOOP_UNI_W_H8_LASX: + vld vr8, a2, 0 + xvreplve0.q xr8, xr8 + add.d a2, a2, a3 + xvshuf.b xr12, xr8, xr8, xr6 + CALC_EPEL_FILTER_LASX xr14 + xvpermi.q xr15, xr14, 0x01 + vssrani.h.w vr15, vr14, 0 + vssrani.bu.h vr15, vr15, 0 + fst.d f15, a0, 0 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_W_H8_LASX +endfunc + +.macro EPEL_UNI_W_H16_LOOP_LSX idx0, idx1, idx2 + vld vr8, a2, \idx0 + vshuf.b vr10, vr8, vr8, vr6 + vshuf.b vr11, vr8, vr8, vr7 + CALC_EPEL_FILTER_LSX vr14, vr15 + vld vr8, a2, \idx1 + vshuf.b vr10, vr8, vr8, vr6 + vshuf.b vr11, vr8, vr8, vr7 + CALC_EPEL_FILTER_LSX vr16, vr17 + vssrani.h.w vr15, vr14, 0 + vssrani.h.w vr17, vr16, 0 + vssrani.bu.h vr17, vr15, 0 + vst vr17, a0, \idx2 +.endm + +.macro EPEL_UNI_W_H16_LOOP_LASX idx0, idx2, w + xvld xr8, a2, \idx0 + xvpermi.d xr9, xr8, 0x09 + xvreplve0.q xr8, xr8 + xvshuf.b xr12, xr8, xr8, xr6 + CALC_EPEL_FILTER_LASX xr14 + xvreplve0.q xr8, xr9 + xvshuf.b xr12, xr8, xr8, xr6 + CALC_EPEL_FILTER_LASX xr16 + xvssrani.h.w xr16, xr14, 0 + xvpermi.q xr17, xr16, 0x01 + vssrani.bu.h vr17, vr16, 0 + vpermi.w vr17, vr17, 0xd8 +.if \w == 12 + fst.d f17, a0, 0 + vstelm.w vr17, a0, 8, 2 +.else + vst vr17, a0, \idx2 +.endif +.endm + +function ff_hevc_put_hevc_epel_uni_w_h12_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + vreplvei.w vr0, vr0, 0 + la.local t1, shufb + vld vr6, t1, 48 + vaddi.bu vr7, vr6, 2 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + addi.d a2, a2, -1 //src -= 1 + vreplvei.h vr5, vr0, 1 + vreplvei.h vr0, vr0, 0 +.LOOP_UNI_W_H12: + vld vr8, a2, 0 + vshuf.b vr10, vr8, vr8, vr6 + vshuf.b vr11, vr8, vr8, vr7 + CALC_EPEL_FILTER_LSX vr14, vr15 + vld vr8, a2, 8 + vshuf.b vr10, vr8, vr8, vr6 + vshuf.b vr11, vr8, vr8, vr7 + CALC_EPEL_FILTER_LSX vr16, vr17 + vssrani.h.w vr15, vr14, 0 + vssrani.h.w vr17, vr16, 0 + vssrani.bu.h vr17, vr15, 0 + fst.d f17, a0, 0 + vstelm.w vr17, a0, 8, 2 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_W_H12 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_h12_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + xvreplve0.w xr0, xr0 + la.local t1, shufb + xvld xr6, t1, 64 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + addi.d a2, a2, -1 //src -= 1 +.LOOP_UNI_W_H12_LASX: + EPEL_UNI_W_H16_LOOP_LASX 0, 0, 12 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_W_H12_LASX +endfunc + +function ff_hevc_put_hevc_epel_uni_w_h16_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + vreplvei.w vr0, vr0, 0 + la.local t1, shufb + vld vr6, t1, 48 + vaddi.bu vr7, vr6, 2 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + addi.d a2, a2, -1 //src -= 1 + vreplvei.h vr5, vr0, 1 + vreplvei.h vr0, vr0, 0 +.LOOP_UNI_W_H16: + EPEL_UNI_W_H16_LOOP_LSX 0, 8, 0 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_W_H16 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_h16_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + xvreplve0.w xr0, xr0 + la.local t1, shufb + xvld xr6, t1, 64 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + addi.d a2, a2, -1 //src -= 1 +.LOOP_UNI_W_H16_LASX: + EPEL_UNI_W_H16_LOOP_LASX 0, 0, 16 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_W_H16_LASX +endfunc + +function ff_hevc_put_hevc_epel_uni_w_h24_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + vreplvei.w vr0, vr0, 0 + la.local t1, shufb + vld vr6, t1, 48 + vaddi.bu vr7, vr6, 2 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + addi.d a2, a2, -1 //src -= 1 + vreplvei.h vr5, vr0, 1 + vreplvei.h vr0, vr0, 0 +.LOOP_UNI_W_H24: + EPEL_UNI_W_H16_LOOP_LSX 0, 8, 0 + vld vr8, a2, 16 + add.d a2, a2, a3 + vshuf.b vr10, vr8, vr8, vr6 + vshuf.b vr11, vr8, vr8, vr7 + CALC_EPEL_FILTER_LSX vr18, vr19 + vssrani.h.w vr19, vr18, 0 + vssrani.bu.h vr19, vr19, 0 + fst.d f19, a0, 16 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_W_H24 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_h24_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + xvreplve0.w xr0, xr0 + la.local t1, shufb + xvld xr6, t1, 64 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + addi.d a2, a2, -1 //src -= 1 +.LOOP_UNI_W_H24_LASX: + EPEL_UNI_W_H16_LOOP_LASX 0, 0, 24 + vld vr8, a2, 16 + add.d a2, a2, a3 + xvreplve0.q xr8, xr8 + xvshuf.b xr12, xr8, xr8, xr6 + CALC_EPEL_FILTER_LASX xr14 + xvpermi.q xr15, xr14, 0x01 + vssrani.h.w vr15, vr14, 0 + vssrani.bu.h vr15, vr15, 0 + fst.d f15, a0, 16 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_W_H24_LASX +endfunc + +function ff_hevc_put_hevc_epel_uni_w_h32_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + vreplvei.w vr0, vr0, 0 + la.local t1, shufb + vld vr6, t1, 48 + vaddi.bu vr7, vr6, 2 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + addi.d a2, a2, -1 //src -= 1 + vreplvei.h vr5, vr0, 1 + vreplvei.h vr0, vr0, 0 +.LOOP_UNI_W_H32: + EPEL_UNI_W_H16_LOOP_LSX 0, 8, 0 + EPEL_UNI_W_H16_LOOP_LSX 16, 24, 16 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_W_H32 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_h32_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + xvreplve0.w xr0, xr0 + la.local t1, shufb + xvld xr6, t1, 64 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + addi.d a2, a2, -1 //src -= 1 +.LOOP_UNI_W_H32_LASX: + EPEL_UNI_W_H16_LOOP_LASX 0, 0, 32 + EPEL_UNI_W_H16_LOOP_LASX 16, 16, 32 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_W_H32_LASX +endfunc + +function ff_hevc_put_hevc_epel_uni_w_h48_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + vreplvei.w vr0, vr0, 0 + la.local t1, shufb + vld vr6, t1, 48 + vaddi.bu vr7, vr6, 2 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + addi.d a2, a2, -1 //src -= 1 + vreplvei.h vr5, vr0, 1 + vreplvei.h vr0, vr0, 0 +.LOOP_UNI_W_H48: + EPEL_UNI_W_H16_LOOP_LSX 0, 8, 0 + EPEL_UNI_W_H16_LOOP_LSX 16, 24, 16 + EPEL_UNI_W_H16_LOOP_LSX 32, 40, 32 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_W_H48 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_h48_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + xvreplve0.w xr0, xr0 + la.local t1, shufb + xvld xr6, t1, 64 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + addi.d a2, a2, -1 //src -= 1 +.LOOP_UNI_W_H48_LASX: + EPEL_UNI_W_H16_LOOP_LASX 0, 0, 48 + EPEL_UNI_W_H16_LOOP_LASX 16, 16, 48 + EPEL_UNI_W_H16_LOOP_LASX 32, 32, 48 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_W_H48_LASX +endfunc + +function ff_hevc_put_hevc_epel_uni_w_h64_8_lsx + LOAD_VAR 128 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + vreplvei.w vr0, vr0, 0 + la.local t1, shufb + vld vr6, t1, 48 + vaddi.bu vr7, vr6, 2 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + addi.d a2, a2, -1 //src -= 1 + vreplvei.h vr5, vr0, 1 + vreplvei.h vr0, vr0, 0 +.LOOP_UNI_W_H64: + EPEL_UNI_W_H16_LOOP_LSX 0, 8, 0 + EPEL_UNI_W_H16_LOOP_LSX 16, 24, 16 + EPEL_UNI_W_H16_LOOP_LSX 32, 40, 32 + EPEL_UNI_W_H16_LOOP_LSX 48, 56, 48 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_W_H64 +endfunc + +function ff_hevc_put_hevc_epel_uni_w_h64_8_lasx + LOAD_VAR 256 + ld.d t0, sp, 0 //mx + addi.d t0, t0, -1 + slli.w t0, t0, 2 + la.local t1, ff_hevc_epel_filters + vldx vr0, t1, t0 //filter + xvreplve0.w xr0, xr0 + la.local t1, shufb + xvld xr6, t1, 64 + slli.d t0, a3, 1 //stride * 2 + add.d t1, t0, a3 //stride * 3 + addi.d a2, a2, -1 //src -= 1 +.LOOP_UNI_W_H64_LASX: + EPEL_UNI_W_H16_LOOP_LASX 0, 0, 64 + EPEL_UNI_W_H16_LOOP_LASX 16, 16, 64 + EPEL_UNI_W_H16_LOOP_LASX 32, 32, 64 + EPEL_UNI_W_H16_LOOP_LASX 48, 48, 64 + add.d a2, a2, a3 + add.d a0, a0, a1 + addi.d a4, a4, -1 + bnez a4, .LOOP_UNI_W_H64_LASX +endfunc + +/* + * void FUNC(put_hevc_epel_bi_h)(uint8_t *_dst, ptrdiff_t _dststride, + * const uint8_t *_src, ptrdiff_t _srcstride, + * const int16_t *src2, int height, intptr_t mx, + * intptr_t my, int width) + */ +function ff_hevc_put_hevc_bi_epel_h4_8_lsx + addi.d a6, a6, -1 + slli.w a6, a6, 2 + la.local t0, ff_hevc_epel_filters + vldx vr0, t0, a6 // filter + vreplvei.w vr0, vr0, 0 + la.local t0, shufb + vld vr1, t0, 0 // mask + addi.d a2, a2, -1 // src -= 1 +.LOOP_BI_EPEL_H4: + vld vr4, a4, 0 // src2 + vld vr5, a2, 0 + add.d a2, a2, a3 + addi.d a4, a4, 128 + vshuf.b vr5, vr5, vr5, vr1 + vdp2.h.bu.b vr6, vr5, vr0 // EPEL_FILTER(src, 1) + vsllwil.w.h vr4, vr4, 0 + vhaddw.w.h vr6, vr6, vr6 + vadd.w vr6, vr6, vr4 // src2[x] + vssrani.h.w vr6, vr6, 0 + vssrarni.bu.h vr6, vr6, 7 + fst.s f6, a0, 0 + add.d a0, a0, a1 + addi.d a5, a5, -1 + bnez a5, .LOOP_BI_EPEL_H4 +endfunc + +.macro PUT_HEVC_BI_EPEL_H8_LSX in0, in1, in2, in3, out0 + vshuf.b vr6, \in1, \in0, \in2 + vshuf.b vr7, \in1, \in0, \in3 + vdp2.h.bu.b vr8, vr6, vr0 // EPEL_FILTER(src, 1) + vdp2add.h.bu.b vr8, vr7, vr1 // EPEL_FILTER(src, 1) + vsadd.h \out0, vr8, vr4 // src2[x] +.endm + +.macro PUT_HEVC_BI_EPEL_H16_LASX in0, in1, in2, in3, out0 + xvshuf.b xr6, \in1, \in0, \in2 + xvshuf.b xr7, \in1, \in0, \in3 + xvdp2.h.bu.b xr8, xr6, xr0 // EPEL_FILTER(src, 1) + xvdp2add.h.bu.b xr8, xr7, xr1 // EPEL_FILTER(src, 1) + xvsadd.h \out0, xr8, xr4 // src2[x] +.endm + +function ff_hevc_put_hevc_bi_epel_h6_8_lsx + addi.d a6, a6, -1 + slli.w a6, a6, 2 + la.local t0, ff_hevc_epel_filters + vldx vr0, t0, a6 // filter + vreplvei.h vr1, vr0, 1 + vreplvei.h vr0, vr0, 0 + la.local t0, shufb + vld vr2, t0, 48// mask + vaddi.bu vr3, vr2, 2 + addi.d a2, a2, -1 // src -= 1 +.LOOP_BI_EPEL_H6: + vld vr4, a4, 0 // src2 + vld vr5, a2, 0 + add.d a2, a2, a3 + addi.d a4, a4, 128 + PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr7 + vssrarni.bu.h vr7, vr7, 7 + fst.s f7, a0, 0 + vstelm.h vr7, a0, 4, 2 + add.d a0, a0, a1 + addi.d a5, a5, -1 + bnez a5, .LOOP_BI_EPEL_H6 +endfunc + +function ff_hevc_put_hevc_bi_epel_h8_8_lsx + addi.d a6, a6, -1 + slli.w a6, a6, 2 + la.local t0, ff_hevc_epel_filters + vldx vr0, t0, a6 // filter + vreplvei.h vr1, vr0, 1 + vreplvei.h vr0, vr0, 0 + la.local t0, shufb + vld vr2, t0, 48// mask + vaddi.bu vr3, vr2, 2 + addi.d a2, a2, -1 // src -= 1 +.LOOP_BI_EPEL_H8: + vld vr4, a4, 0 // src2 + vld vr5, a2, 0 + add.d a2, a2, a3 + addi.d a4, a4, 128 + PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr7 + vssrarni.bu.h vr7, vr7, 7 + fst.d f7, a0, 0 + add.d a0, a0, a1 + addi.d a5, a5, -1 + bnez a5, .LOOP_BI_EPEL_H8 +endfunc + +function ff_hevc_put_hevc_bi_epel_h12_8_lsx + addi.d a6, a6, -1 + slli.w a6, a6, 2 + la.local t0, ff_hevc_epel_filters + vldx vr0, t0, a6 // filter + vreplvei.h vr1, vr0, 1 + vreplvei.h vr0, vr0, 0 + la.local t0, shufb + vld vr2, t0, 48// mask + vaddi.bu vr3, vr2, 2 + addi.d a2, a2, -1 // src -= 1 +.LOOP_BI_EPEL_H12: + vld vr4, a4, 0 // src2 + vld vr5, a2, 0 + PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr11 + vld vr5, a2, 8 + vld vr4, a4, 16 + PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr12 + vssrarni.bu.h vr12, vr11, 7 + fst.d f12, a0, 0 + vstelm.w vr12, a0, 8, 2 + add.d a2, a2, a3 + addi.d a4, a4, 128 + add.d a0, a0, a1 + addi.d a5, a5, -1 + bnez a5, .LOOP_BI_EPEL_H12 +endfunc + +function ff_hevc_put_hevc_bi_epel_h12_8_lasx + addi.d a6, a6, -1 + slli.w a6, a6, 2 + la.local t0, ff_hevc_epel_filters + vldx vr0, t0, a6 // filter + xvreplve0.q xr0, xr0 + xvrepl128vei.h xr1, xr0, 1 + xvrepl128vei.h xr0, xr0, 0 + la.local t0, shufb + xvld xr2, t0, 96// mask + xvaddi.bu xr3, xr2, 2 + addi.d a2, a2, -1 // src -= 1 +.LOOP_BI_EPEL_H12_LASX: + xvld xr4, a4, 0 // src2 + xvld xr5, a2, 0 + xvpermi.d xr5, xr5, 0x94 + PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr9 + xvpermi.q xr10, xr9, 0x01 + vssrarni.bu.h vr10, vr9, 7 + fst.d f10, a0, 0 + vstelm.w vr10, a0, 8, 2 + add.d a2, a2, a3 + addi.d a4, a4, 128 + add.d a0, a0, a1 + addi.d a5, a5, -1 + bnez a5, .LOOP_BI_EPEL_H12_LASX +endfunc + +function ff_hevc_put_hevc_bi_epel_h16_8_lsx + addi.d a6, a6, -1 + slli.w a6, a6, 2 + la.local t0, ff_hevc_epel_filters + vldx vr0, t0, a6 // filter + vreplvei.h vr1, vr0, 1 + vreplvei.h vr0, vr0, 0 + la.local t0, shufb + vld vr2, t0, 48// mask + vaddi.bu vr3, vr2, 2 + addi.d a2, a2, -1 // src -= 1 +.LOOP_BI_EPEL_H16: + vld vr4, a4, 0 // src2 + vld vr5, a2, 0 + PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr11 + vld vr5, a2, 8 + vld vr4, a4, 16 + PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr12 + vssrarni.bu.h vr12, vr11, 7 + vst vr12, a0, 0 + add.d a2, a2, a3 + addi.d a4, a4, 128 + add.d a0, a0, a1 + addi.d a5, a5, -1 + bnez a5, .LOOP_BI_EPEL_H16 +endfunc + +function ff_hevc_put_hevc_bi_epel_h16_8_lasx + addi.d a6, a6, -1 + slli.w a6, a6, 2 + la.local t0, ff_hevc_epel_filters + vldx vr0, t0, a6 // filter + xvreplve0.q xr0, xr0 + xvrepl128vei.h xr1, xr0, 1 + xvrepl128vei.h xr0, xr0, 0 + la.local t0, shufb + xvld xr2, t0, 96// mask + xvaddi.bu xr3, xr2, 2 + addi.d a2, a2, -1 // src -= 1 +.LOOP_BI_EPEL_H16_LASX: + xvld xr4, a4, 0 // src2 + xvld xr5, a2, 0 + xvpermi.d xr5, xr5, 0x94 + PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr9 + xvpermi.q xr10, xr9, 0x01 + vssrarni.bu.h vr10, vr9, 7 + vst vr10, a0, 0 + add.d a2, a2, a3 + addi.d a4, a4, 128 + add.d a0, a0, a1 + addi.d a5, a5, -1 + bnez a5, .LOOP_BI_EPEL_H16_LASX +endfunc + +function ff_hevc_put_hevc_bi_epel_h32_8_lasx + addi.d a6, a6, -1 + slli.w a6, a6, 2 + la.local t0, ff_hevc_epel_filters + vldx vr0, t0, a6 // filter + xvreplve0.q xr0, xr0 + xvrepl128vei.h xr1, xr0, 1 + xvrepl128vei.h xr0, xr0, 0 + la.local t0, shufb + xvld xr2, t0, 96// mask + xvaddi.bu xr3, xr2, 2 + addi.d a2, a2, -1 // src -= 1 +.LOOP_BI_EPEL_H32_LASX: + xvld xr4, a4, 0 // src2 + xvld xr5, a2, 0 + xvpermi.q xr15, xr5, 0x01 + xvpermi.d xr5, xr5, 0x94 + PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr9 + xvld xr4, a4, 32 + xvld xr15, a2, 16 + xvpermi.d xr15, xr15, 0x94 + PUT_HEVC_BI_EPEL_H16_LASX xr15, xr15, xr2, xr3, xr11 + xvssrarni.bu.h xr11, xr9, 7 + xvpermi.d xr11, xr11, 0xd8 + xvst xr11, a0, 0 + add.d a2, a2, a3 + addi.d a4, a4, 128 + add.d a0, a0, a1 + addi.d a5, a5, -1 + bnez a5, .LOOP_BI_EPEL_H32_LASX +endfunc + +function ff_hevc_put_hevc_bi_epel_h48_8_lsx + addi.d a6, a6, -1 + slli.w a6, a6, 2 + la.local t0, ff_hevc_epel_filters + vldx vr0, t0, a6// filter + vreplvei.h vr1, vr0, 1 + vreplvei.h vr0, vr0, 0 + la.local t0, shufb + vld vr2, t0, 48// mask + vaddi.bu vr3, vr2, 2 + vaddi.bu vr21, vr2, 8 + vaddi.bu vr22, vr2, 10 + addi.d a2, a2, -1 // src -= 1 +.LOOP_BI_EPEL_H48: + vld vr4, a4, 0 // src2 + vld vr5, a2, 0 + vld vr9, a2, 16 + vld vr10, a2, 32 + vld vr11, a2, 48 + PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr12 + vld vr4, a4, 16 + PUT_HEVC_BI_EPEL_H8_LSX vr5, vr9, vr21, vr22, vr13 + vld vr4, a4, 32 + PUT_HEVC_BI_EPEL_H8_LSX vr9, vr9, vr2, vr3, vr14 + vld vr4, a4, 48 + PUT_HEVC_BI_EPEL_H8_LSX vr9, vr10, vr21, vr22, vr15 + vld vr4, a4, 64 + PUT_HEVC_BI_EPEL_H8_LSX vr10, vr10, vr2, vr3, vr16 + vld vr4, a4, 80 + PUT_HEVC_BI_EPEL_H8_LSX vr10, vr11, vr21, vr22, vr17 + vssrarni.bu.h vr13, vr12, 7 + vssrarni.bu.h vr15, vr14, 7 + vssrarni.bu.h vr17, vr16, 7 + vst vr13, a0, 0 + vst vr15, a0, 16 + vst vr17, a0, 32 + add.d a2, a2, a3 + addi.d a4, a4, 128 + add.d a0, a0, a1 + addi.d a5, a5, -1 + bnez a5, .LOOP_BI_EPEL_H48 +endfunc + +function ff_hevc_put_hevc_bi_epel_h48_8_lasx + addi.d a6, a6, -1 + slli.w a6, a6, 2 + la.local t0, ff_hevc_epel_filters + vldx vr0, t0, a6 // filter + xvreplve0.q xr0, xr0 + xvrepl128vei.h xr1, xr0, 1 + xvrepl128vei.h xr0, xr0, 0 + la.local t0, shufb + xvld xr2, t0, 96// mask + xvaddi.bu xr3, xr2, 2 + addi.d a2, a2, -1 // src -= 1 +.LOOP_BI_EPEL_H48_LASX: + xvld xr4, a4, 0 // src2 + xvld xr5, a2, 0 + xvld xr9, a2, 32 + xvpermi.d xr10, xr9, 0x94 + xvpermi.q xr9, xr5, 0x21 + xvpermi.d xr9, xr9, 0x94 + xvpermi.d xr5, xr5, 0x94 + PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr11 + xvld xr4, a4, 32 + PUT_HEVC_BI_EPEL_H16_LASX xr9, xr9, xr2, xr3, xr12 + xvld xr4, a4, 64 + PUT_HEVC_BI_EPEL_H16_LASX xr10, xr10, xr2, xr3, xr13 + xvssrarni.bu.h xr12, xr11, 7 + xvpermi.d xr12, xr12, 0xd8 + xvpermi.q xr14, xr13, 0x01 + vssrarni.bu.h vr14, vr13, 7 + xvst xr12, a0, 0 + vst vr14, a0, 32 + add.d a2, a2, a3 + addi.d a4, a4, 128 + add.d a0, a0, a1 + addi.d a5, a5, -1 + bnez a5, .LOOP_BI_EPEL_H48_LASX +endfunc + +function ff_hevc_put_hevc_bi_epel_h64_8_lsx + addi.d a6, a6, -1 + slli.w a6, a6, 2 + la.local t0, ff_hevc_epel_filters + vldx vr0, t0, a6// filter + vreplvei.h vr1, vr0, 1 + vreplvei.h vr0, vr0, 0 + la.local t0, shufb + vld vr2, t0, 48// mask + vaddi.bu vr3, vr2, 2 + vaddi.bu vr21, vr2, 8 + vaddi.bu vr22, vr2, 10 + addi.d a2, a2, -1 // src -= 1 +.LOOP_BI_EPEL_H64: + vld vr4, a4, 0 // src2 + vld vr5, a2, 0 + vld vr9, a2, 16 + vld vr10, a2, 32 + vld vr11, a2, 48 + vld vr12, a2, 64 + PUT_HEVC_BI_EPEL_H8_LSX vr5, vr5, vr2, vr3, vr13 + vld vr4, a4, 16 + PUT_HEVC_BI_EPEL_H8_LSX vr5, vr9, vr21, vr22, vr14 + vld vr4, a4, 32 + PUT_HEVC_BI_EPEL_H8_LSX vr9, vr9, vr2, vr3, vr15 + vld vr4, a4, 48 + PUT_HEVC_BI_EPEL_H8_LSX vr9, vr10, vr21, vr22, vr16 + vld vr4, a4, 64 + PUT_HEVC_BI_EPEL_H8_LSX vr10, vr10, vr2, vr3, vr17 + vld vr4, a4, 80 + PUT_HEVC_BI_EPEL_H8_LSX vr10, vr11, vr21, vr22, vr18 + vld vr4, a4, 96 + PUT_HEVC_BI_EPEL_H8_LSX vr11, vr11, vr2, vr3, vr19 + vld vr4, a4, 112 + PUT_HEVC_BI_EPEL_H8_LSX vr11, vr12, vr21, vr22, vr20 + vssrarni.bu.h vr14, vr13, 7 + vssrarni.bu.h vr16, vr15, 7 + vssrarni.bu.h vr18, vr17, 7 + vssrarni.bu.h vr20, vr19, 7 + vst vr14, a0, 0 + vst vr16, a0, 16 + vst vr18, a0, 32 + vst vr20, a0, 48 + add.d a2, a2, a3 + addi.d a4, a4, 128 + add.d a0, a0, a1 + addi.d a5, a5, -1 + bnez a5, .LOOP_BI_EPEL_H64 +endfunc + +function ff_hevc_put_hevc_bi_epel_h64_8_lasx + addi.d a6, a6, -1 + slli.w a6, a6, 2 + la.local t0, ff_hevc_epel_filters + vldx vr0, t0, a6 // filter + xvreplve0.q xr0, xr0 + xvrepl128vei.h xr1, xr0, 1 + xvrepl128vei.h xr0, xr0, 0 + la.local t0, shufb + xvld xr2, t0, 96// mask + xvaddi.bu xr3, xr2, 2 + addi.d a2, a2, -1 // src -= 1 +.LOOP_BI_EPEL_H64_LASX: + xvld xr4, a4, 0 // src2 + xvld xr5, a2, 0 + xvld xr9, a2, 32 + xvld xr11, a2, 48 + xvpermi.d xr11, xr11, 0x94 + xvpermi.d xr10, xr9, 0x94 + xvpermi.q xr9, xr5, 0x21 + xvpermi.d xr9, xr9, 0x94 + xvpermi.d xr5, xr5, 0x94 + PUT_HEVC_BI_EPEL_H16_LASX xr5, xr5, xr2, xr3, xr12 + xvld xr4, a4, 32 + PUT_HEVC_BI_EPEL_H16_LASX xr9, xr9, xr2, xr3, xr13 + xvld xr4, a4, 64 + PUT_HEVC_BI_EPEL_H16_LASX xr10, xr10, xr2, xr3, xr14 + xvld xr4, a4, 96 + PUT_HEVC_BI_EPEL_H16_LASX xr11, xr11, xr2, xr3, xr15 + xvssrarni.bu.h xr13, xr12, 7 + xvssrarni.bu.h xr15, xr14, 7 + xvpermi.d xr13, xr13, 0xd8 + xvpermi.d xr15, xr15, 0xd8 + xvst xr13, a0, 0 + xvst xr15, a0, 32 + add.d a2, a2, a3 + addi.d a4, a4, 128 + add.d a0, a0, a1 + addi.d a5, a5, -1 + bnez a5, .LOOP_BI_EPEL_H64_LASX +endfunc diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c index 245a833947..2756755733 100644 --- a/libavcodec/loongarch/hevcdsp_init_loongarch.c +++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c @@ -124,8 +124,15 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth) c->put_hevc_qpel_bi[8][0][1] = ff_hevc_put_hevc_bi_qpel_h48_8_lsx; c->put_hevc_qpel_bi[9][0][1] = ff_hevc_put_hevc_bi_qpel_h64_8_lsx; + c->put_hevc_epel_bi[1][0][1] = ff_hevc_put_hevc_bi_epel_h4_8_lsx; + c->put_hevc_epel_bi[2][0][1] = ff_hevc_put_hevc_bi_epel_h6_8_lsx; + c->put_hevc_epel_bi[3][0][1] = ff_hevc_put_hevc_bi_epel_h8_8_lsx; + c->put_hevc_epel_bi[4][0][1] = ff_hevc_put_hevc_bi_epel_h12_8_lsx; + c->put_hevc_epel_bi[5][0][1] = ff_hevc_put_hevc_bi_epel_h16_8_lsx; c->put_hevc_epel_bi[6][0][1] = ff_hevc_put_hevc_bi_epel_h24_8_lsx; c->put_hevc_epel_bi[7][0][1] = ff_hevc_put_hevc_bi_epel_h32_8_lsx; + c->put_hevc_epel_bi[8][0][1] = ff_hevc_put_hevc_bi_epel_h48_8_lsx; + c->put_hevc_epel_bi[9][0][1] = ff_hevc_put_hevc_bi_epel_h64_8_lsx; c->put_hevc_epel_bi[4][1][0] = ff_hevc_put_hevc_bi_epel_v12_8_lsx; c->put_hevc_epel_bi[5][1][0] = ff_hevc_put_hevc_bi_epel_v16_8_lsx; @@ -138,6 +145,14 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth) c->put_hevc_epel_bi[6][1][1] = ff_hevc_put_hevc_bi_epel_hv24_8_lsx; c->put_hevc_epel_bi[7][1][1] = ff_hevc_put_hevc_bi_epel_hv32_8_lsx; + c->put_hevc_qpel_uni[1][0][1] = ff_hevc_put_hevc_uni_qpel_h4_8_lsx; + c->put_hevc_qpel_uni[2][0][1] = ff_hevc_put_hevc_uni_qpel_h6_8_lsx; + c->put_hevc_qpel_uni[3][0][1] = ff_hevc_put_hevc_uni_qpel_h8_8_lsx; + c->put_hevc_qpel_uni[4][0][1] = ff_hevc_put_hevc_uni_qpel_h12_8_lsx; + c->put_hevc_qpel_uni[5][0][1] = ff_hevc_put_hevc_uni_qpel_h16_8_lsx; + c->put_hevc_qpel_uni[6][0][1] = ff_hevc_put_hevc_uni_qpel_h24_8_lsx; + c->put_hevc_qpel_uni[7][0][1] = ff_hevc_put_hevc_uni_qpel_h32_8_lsx; + c->put_hevc_qpel_uni[8][0][1] = ff_hevc_put_hevc_uni_qpel_h48_8_lsx; c->put_hevc_qpel_uni[9][0][1] = ff_hevc_put_hevc_uni_qpel_h64_8_lsx; c->put_hevc_qpel_uni[6][1][0] = ff_hevc_put_hevc_uni_qpel_v24_8_lsx; @@ -191,6 +206,26 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth) c->put_hevc_epel_uni_w[8][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels48_8_lsx; c->put_hevc_epel_uni_w[9][0][0] = ff_hevc_put_hevc_pel_uni_w_pixels64_8_lsx; + c->put_hevc_epel_uni_w[1][0][1] = ff_hevc_put_hevc_epel_uni_w_h4_8_lsx; + c->put_hevc_epel_uni_w[2][0][1] = ff_hevc_put_hevc_epel_uni_w_h6_8_lsx; + c->put_hevc_epel_uni_w[3][0][1] = ff_hevc_put_hevc_epel_uni_w_h8_8_lsx; + c->put_hevc_epel_uni_w[4][0][1] = ff_hevc_put_hevc_epel_uni_w_h12_8_lsx; + c->put_hevc_epel_uni_w[5][0][1] = ff_hevc_put_hevc_epel_uni_w_h16_8_lsx; + c->put_hevc_epel_uni_w[6][0][1] = ff_hevc_put_hevc_epel_uni_w_h24_8_lsx; + c->put_hevc_epel_uni_w[7][0][1] = ff_hevc_put_hevc_epel_uni_w_h32_8_lsx; + c->put_hevc_epel_uni_w[8][0][1] = ff_hevc_put_hevc_epel_uni_w_h48_8_lsx; + c->put_hevc_epel_uni_w[9][0][1] = ff_hevc_put_hevc_epel_uni_w_h64_8_lsx; + + c->put_hevc_epel_uni_w[1][1][0] = ff_hevc_put_hevc_epel_uni_w_v4_8_lsx; + c->put_hevc_epel_uni_w[2][1][0] = ff_hevc_put_hevc_epel_uni_w_v6_8_lsx; + c->put_hevc_epel_uni_w[3][1][0] = ff_hevc_put_hevc_epel_uni_w_v8_8_lsx; + c->put_hevc_epel_uni_w[4][1][0] = ff_hevc_put_hevc_epel_uni_w_v12_8_lsx; + c->put_hevc_epel_uni_w[5][1][0] = ff_hevc_put_hevc_epel_uni_w_v16_8_lsx; + c->put_hevc_epel_uni_w[6][1][0] = ff_hevc_put_hevc_epel_uni_w_v24_8_lsx; + c->put_hevc_epel_uni_w[7][1][0] = ff_hevc_put_hevc_epel_uni_w_v32_8_lsx; + c->put_hevc_epel_uni_w[8][1][0] = ff_hevc_put_hevc_epel_uni_w_v48_8_lsx; + c->put_hevc_epel_uni_w[9][1][0] = ff_hevc_put_hevc_epel_uni_w_v64_8_lsx; + c->put_hevc_qpel_uni_w[3][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv8_8_lsx; c->put_hevc_qpel_uni_w[5][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv16_8_lsx; c->put_hevc_qpel_uni_w[6][1][1] = ff_hevc_put_hevc_uni_w_qpel_hv24_8_lsx; @@ -277,6 +312,15 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth) c->put_hevc_epel_uni_w[8][1][1] = ff_hevc_put_hevc_epel_uni_w_hv48_8_lasx; c->put_hevc_epel_uni_w[9][1][1] = ff_hevc_put_hevc_epel_uni_w_hv64_8_lasx; + c->put_hevc_epel_uni_w[2][0][1] = ff_hevc_put_hevc_epel_uni_w_h6_8_lasx; + c->put_hevc_epel_uni_w[3][0][1] = ff_hevc_put_hevc_epel_uni_w_h8_8_lasx; + c->put_hevc_epel_uni_w[4][0][1] = ff_hevc_put_hevc_epel_uni_w_h12_8_lasx; + c->put_hevc_epel_uni_w[5][0][1] = ff_hevc_put_hevc_epel_uni_w_h16_8_lasx; + c->put_hevc_epel_uni_w[6][0][1] = ff_hevc_put_hevc_epel_uni_w_h24_8_lasx; + c->put_hevc_epel_uni_w[7][0][1] = ff_hevc_put_hevc_epel_uni_w_h32_8_lasx; + c->put_hevc_epel_uni_w[8][0][1] = ff_hevc_put_hevc_epel_uni_w_h48_8_lasx; + c->put_hevc_epel_uni_w[9][0][1] = ff_hevc_put_hevc_epel_uni_w_h64_8_lasx; + c->put_hevc_qpel_uni_w[3][1][0] = ff_hevc_put_hevc_qpel_uni_w_v8_8_lasx; c->put_hevc_qpel_uni_w[4][1][0] = ff_hevc_put_hevc_qpel_uni_w_v12_8_lasx; c->put_hevc_qpel_uni_w[5][1][0] = ff_hevc_put_hevc_qpel_uni_w_v16_8_lasx; @@ -285,6 +329,15 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth) c->put_hevc_qpel_uni_w[8][1][0] = ff_hevc_put_hevc_qpel_uni_w_v48_8_lasx; c->put_hevc_qpel_uni_w[9][1][0] = ff_hevc_put_hevc_qpel_uni_w_v64_8_lasx; + c->put_hevc_epel_uni_w[2][1][0] = ff_hevc_put_hevc_epel_uni_w_v6_8_lasx; + c->put_hevc_epel_uni_w[3][1][0] = ff_hevc_put_hevc_epel_uni_w_v8_8_lasx; + c->put_hevc_epel_uni_w[4][1][0] = ff_hevc_put_hevc_epel_uni_w_v12_8_lasx; + c->put_hevc_epel_uni_w[5][1][0] = ff_hevc_put_hevc_epel_uni_w_v16_8_lasx; + c->put_hevc_epel_uni_w[6][1][0] = ff_hevc_put_hevc_epel_uni_w_v24_8_lasx; + c->put_hevc_epel_uni_w[7][1][0] = ff_hevc_put_hevc_epel_uni_w_v32_8_lasx; + c->put_hevc_epel_uni_w[8][1][0] = ff_hevc_put_hevc_epel_uni_w_v48_8_lasx; + c->put_hevc_epel_uni_w[9][1][0] = ff_hevc_put_hevc_epel_uni_w_v64_8_lasx; + c->put_hevc_qpel_uni_w[1][0][1] = ff_hevc_put_hevc_qpel_uni_w_h4_8_lasx; c->put_hevc_qpel_uni_w[2][0][1] = ff_hevc_put_hevc_qpel_uni_w_h6_8_lasx; c->put_hevc_qpel_uni_w[3][0][1] = ff_hevc_put_hevc_qpel_uni_w_h8_8_lasx; @@ -294,6 +347,19 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth) c->put_hevc_qpel_uni_w[7][0][1] = ff_hevc_put_hevc_qpel_uni_w_h32_8_lasx; c->put_hevc_qpel_uni_w[8][0][1] = ff_hevc_put_hevc_qpel_uni_w_h48_8_lasx; c->put_hevc_qpel_uni_w[9][0][1] = ff_hevc_put_hevc_qpel_uni_w_h64_8_lasx; + + c->put_hevc_qpel_uni[4][0][1] = ff_hevc_put_hevc_uni_qpel_h12_8_lasx; + c->put_hevc_qpel_uni[5][0][1] = ff_hevc_put_hevc_uni_qpel_h16_8_lasx; + c->put_hevc_qpel_uni[6][0][1] = ff_hevc_put_hevc_uni_qpel_h24_8_lasx; + c->put_hevc_qpel_uni[7][0][1] = ff_hevc_put_hevc_uni_qpel_h32_8_lasx; + c->put_hevc_qpel_uni[8][0][1] = ff_hevc_put_hevc_uni_qpel_h48_8_lasx; + c->put_hevc_qpel_uni[9][0][1] = ff_hevc_put_hevc_uni_qpel_h64_8_lasx; + + c->put_hevc_epel_bi[4][0][1] = ff_hevc_put_hevc_bi_epel_h12_8_lasx; + c->put_hevc_epel_bi[5][0][1] = ff_hevc_put_hevc_bi_epel_h16_8_lasx; + c->put_hevc_epel_bi[7][0][1] = ff_hevc_put_hevc_bi_epel_h32_8_lasx; + c->put_hevc_epel_bi[8][0][1] = ff_hevc_put_hevc_bi_epel_h48_8_lasx; + c->put_hevc_epel_bi[9][0][1] = ff_hevc_put_hevc_bi_epel_h64_8_lasx; } } } diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h index 7f09d0943a..5db35eed47 100644 --- a/libavcodec/loongarch/hevcdsp_lasx.h +++ b/libavcodec/loongarch/hevcdsp_lasx.h @@ -75,6 +75,60 @@ PEL_UNI_W(epel, hv, 32); PEL_UNI_W(epel, hv, 48); PEL_UNI_W(epel, hv, 64); +PEL_UNI_W(epel, v, 6); +PEL_UNI_W(epel, v, 8); +PEL_UNI_W(epel, v, 12); +PEL_UNI_W(epel, v, 16); +PEL_UNI_W(epel, v, 24); +PEL_UNI_W(epel, v, 32); +PEL_UNI_W(epel, v, 48); +PEL_UNI_W(epel, v, 64); + +PEL_UNI_W(epel, h, 6); +PEL_UNI_W(epel, h, 8); +PEL_UNI_W(epel, h, 12); +PEL_UNI_W(epel, h, 16); +PEL_UNI_W(epel, h, 24); +PEL_UNI_W(epel, h, 32); +PEL_UNI_W(epel, h, 48); +PEL_UNI_W(epel, h, 64); + #undef PEL_UNI_W +#define UNI_MC(PEL, DIR, WIDTH) \ +void ff_hevc_put_hevc_uni_##PEL##_##DIR##WIDTH##_8_lasx(uint8_t *dst, \ + ptrdiff_t dst_stride, \ + const uint8_t *src, \ + ptrdiff_t src_stride, \ + int height, \ + intptr_t mx, \ + intptr_t my, \ + int width) +UNI_MC(qpel, h, 12); +UNI_MC(qpel, h, 16); +UNI_MC(qpel, h, 24); +UNI_MC(qpel, h, 32); +UNI_MC(qpel, h, 48); +UNI_MC(qpel, h, 64); + +#undef UNI_MC + +#define BI_MC(PEL, DIR, WIDTH) \ +void ff_hevc_put_hevc_bi_##PEL##_##DIR##WIDTH##_8_lasx(uint8_t *dst, \ + ptrdiff_t dst_stride, \ + const uint8_t *src, \ + ptrdiff_t src_stride, \ + const int16_t *src_16bit, \ + int height, \ + intptr_t mx, \ + intptr_t my, \ + int width) +BI_MC(epel, h, 12); +BI_MC(epel, h, 16); +BI_MC(epel, h, 32); +BI_MC(epel, h, 48); +BI_MC(epel, h, 64); + +#undef BI_MC + #endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H diff --git a/libavcodec/loongarch/hevcdsp_lsx.h b/libavcodec/loongarch/hevcdsp_lsx.h index 7769cf25ae..a5ef237b5d 100644 --- a/libavcodec/loongarch/hevcdsp_lsx.h +++ b/libavcodec/loongarch/hevcdsp_lsx.h @@ -126,8 +126,15 @@ BI_MC(qpel, hv, 32); BI_MC(qpel, hv, 48); BI_MC(qpel, hv, 64); +BI_MC(epel, h, 4); +BI_MC(epel, h, 6); +BI_MC(epel, h, 8); +BI_MC(epel, h, 12); +BI_MC(epel, h, 16); BI_MC(epel, h, 24); BI_MC(epel, h, 32); +BI_MC(epel, h, 48); +BI_MC(epel, h, 64); BI_MC(epel, v, 12); BI_MC(epel, v, 16); @@ -151,7 +158,14 @@ void ff_hevc_put_hevc_uni_##PEL##_##DIR##WIDTH##_8_lsx(uint8_t *dst, \ intptr_t mx, \ intptr_t my, \ int width) - +UNI_MC(qpel, h, 4); +UNI_MC(qpel, h, 6); +UNI_MC(qpel, h, 8); +UNI_MC(qpel, h, 12); +UNI_MC(qpel, h, 16); +UNI_MC(qpel, h, 24); +UNI_MC(qpel, h, 32); +UNI_MC(qpel, h, 48); UNI_MC(qpel, h, 64); UNI_MC(qpel, v, 24); @@ -287,6 +301,26 @@ PEL_UNI_W(epel, hv, 32); PEL_UNI_W(epel, hv, 48); PEL_UNI_W(epel, hv, 64); +PEL_UNI_W(epel, h, 4); +PEL_UNI_W(epel, h, 6); +PEL_UNI_W(epel, h, 8); +PEL_UNI_W(epel, h, 12); +PEL_UNI_W(epel, h, 16); +PEL_UNI_W(epel, h, 24); +PEL_UNI_W(epel, h, 32); +PEL_UNI_W(epel, h, 48); +PEL_UNI_W(epel, h, 64); + +PEL_UNI_W(epel, v, 4); +PEL_UNI_W(epel, v, 6); +PEL_UNI_W(epel, v, 8); +PEL_UNI_W(epel, v, 12); +PEL_UNI_W(epel, v, 16); +PEL_UNI_W(epel, v, 24); +PEL_UNI_W(epel, v, 32); +PEL_UNI_W(epel, v, 48); +PEL_UNI_W(epel, v, 64); + #undef PEL_UNI_W #endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LSX_H From patchwork Wed Dec 27 04:50:19 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?6YeR5rOi?= X-Patchwork-Id: 45345 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:6623:b0:194:e134:edd4 with SMTP id n35csp3415570pzh; Tue, 26 Dec 2023 20:51:42 -0800 (PST) X-Google-Smtp-Source: AGHT+IELLwOd3Rpva/xS8IPpjbeBnKOO49CXi/K8a0oKAMqDH8iYk+9lp/pY67VYu8/KyLOilNaV X-Received: by 2002:a05:600c:8506:b0:40d:3756:97e2 with SMTP id gw6-20020a05600c850600b0040d375697e2mr4588083wmb.166.1703652702443; Tue, 26 Dec 2023 20:51:42 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1703652702; cv=none; d=google.com; s=arc-20160816; b=B2Jv/9G4T+8i3bE8F9/FYh8vchmVsDIqaZFk+jm6gyUhI9kaDIia1RbdgjIhl7P+oT 7UpI+a651AePd1ZuF/E+uaD3J8jLIs8tOp1EtsQ2uvTk05UGyLp9SAgNR44Fpbzerqmq xwtgZvbx9X3VS3OZ39KGA8WPQFuBebn+rYhxeiRwKsEYH2co8FgO34pNdJTgSzQDX5gg LxsYeBxSbSBoAtyXg9Oo0sDLRgEUTz7qKrOWncaa4UULkMRN5IqB6An46V9mrYkULr1g xwcM7vZfo/uUtJIqD/TB0PV9rL7vt8Fu0R5NQWP4iiUOuKxAO7vzqs03g9uUg88/L6Gw 58Qw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=WZtfXzXhc8nR9DGLaHbSBaj+fU9OXwdzRJTPI3hWtgU=; fh=+hmN1wfJl4NVHn3pT0EEC6f/MOK/4gczvnjUgmcCg70=; b=cNYFIugBPeUIBDn9Y+nbL/bPm6vUmclqrLzPlnA+FzcXylNXu5FwI/T0oRQ3KTKjML 661vEba2vvVWCxQbSV5cIlS1oBlYCjEDIK2rbmlj1+N/W4cjRainUl5hhQrKSWxdyuSr /s46AszfjDqLqU5wuMjQAl9D1T9RQnvirJaiYiy922dtUs8ymf2YDBST34kTXSoyhvdY SL3eXKPYXlMY1GokWU2gro+pfAqYAifuhJ5GaTqyv1pGRLLGfTwVcQBtO35yIcDiTlVj 3HY2Z6DfF03paC8bKAWBNMs1/M4p3avr4imsxO8KpgBld9WKwE44CSYePay/SS/Pmxay CAJg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id fc4-20020a1709073a4400b00a26ae76c104si4685684ejc.173.2023.12.26.20.51.42; Tue, 26 Dec 2023 20:51:42 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id E045168CEBC; Wed, 27 Dec 2023 06:50:53 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail.loongson.cn (mail.loongson.cn [114.242.206.163]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 6CE2B68CE20 for ; Wed, 27 Dec 2023 06:50:45 +0200 (EET) Received: from loongson.cn (unknown [36.33.26.33]) by gateway (Coremail) with SMTP id _____8DxCukjrYtlSO4EAA--.23335S3; Wed, 27 Dec 2023 12:50:43 +0800 (CST) Received: from localhost (unknown [36.33.26.33]) by localhost.localdomain (Coremail) with SMTP id AQAAf8BxH+UirYtliVsMAA--.44265S3; Wed, 27 Dec 2023 12:50:42 +0800 (CST) From: jinbo To: ffmpeg-devel@ffmpeg.org Date: Wed, 27 Dec 2023 12:50:19 +0800 Message-Id: <20231227045019.25078-8-jinbo@loongson.cn> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20231227045019.25078-1-jinbo@loongson.cn> References: <20231227045019.25078-1-jinbo@loongson.cn> MIME-Version: 1.0 X-CM-TRANSID: AQAAf8BxH+UirYtliVsMAA--.44265S3 X-CM-SenderInfo: xmlqu0o6or00hjvr0hdfq/1tbiAQASEmWLia8B1wAPsQ X-Coremail-Antispam: 1Uk129KBj9fXoWfZw4UCF4fZr43Zr48Zr4xuFX_yoW5Jr43Co WYvws0qwn8J3y5tFsxJw1UXrn8tFWa9ry3ta9Fka47CayrK347t3sFvr1DXayDt3ZrX3ZF 9r97twn5AF43Zw1rl-sFpf9Il3svdjkaLaAFLSUrUUUUUb8apTn2vfkv8UJUUUU8wcxFpf 9Il3svdxBIdaVrn0xqx4xG64xvF2IEw4CE5I8CrVC2j2Jv73VFW2AGmfu7bjvjm3AaLaJ3 UjIYCTnIWjp_UUUYb7kC6x804xWl14x267AKxVWUJVW8JwAFc2x0x2IEx4CE42xK8VAvwI 8IcIk0rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj41l84x0c7CEw4AK67xG Y2AK021l84ACjcxK6xIIjxv20xvE14v26r4j6ryUM28EF7xvwVC0I7IYx2IY6xkF7I0E14 v26r4j6F4UM28EF7xvwVC2z280aVAFwI0_Cr0_Gr1UM28EF7xvwVC2z280aVCY1x0267AK xVW8Jr0_Cr1UM2AIxVAIcxkEcVAq07x20xvEncxIr21l57IF6xkI12xvs2x26I8E6xACxx 1l5I8CrVACY4xI64kE6c02F40Ex7xfMcIj6xIIjxv20xvE14v26r126r1DMcIj6I8E87Iv 67AKxVW8JVWxJwAm72CE4IkC6x0Yz7v_Jr0_Gr1lF7xvr2IYc2Ij64vIr41l42xK82IYc2 Ij64vIr41l4I8I3I0E4IkC6x0Yz7v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGwC20s02 6x8GjcxK67AKxVWUGVWUWwC2zVAF1VAY17CE14v26r1Y6r17MIIYrxkI7VAKI48JMIIF0x vE2Ix0cI8IcVAFwI0_JFI_Gr1lIxAIcVC0I7IYx2IY6xkF7I0E14v26r1j6r4UMIIF0xvE 42xK8VAvwI8IcIk0rVWUJVWUCwCI42IY6I8E87Iv67AKxVWUJVW8JwCI42IY6I8E87Iv6x kF7I0E14v26r1j6r4UYxBIdaVFxhVjvjDU0xZFpf9x07j1q2_UUUUU= Subject: [FFmpeg-devel] [PATCH v2 7/7] avcodec/hevc: Add ff_hevc_idct_32x32_lasx asm opt X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: yuanhecai Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: hC8aZYzhrQzI From: yuanhecai tests/checkasm/checkasm: C LSX LASX hevc_idct_32x32_8_c: 1243.0 211.7 101.7 Speedup of decoding H265 4K 30FPS 30Mbps on 3A6000 with 8 threads is 1fps(56fps-->57fps). --- libavcodec/loongarch/Makefile | 3 +- libavcodec/loongarch/hevc_idct.S | 863 ++++++++++++++++++ libavcodec/loongarch/hevc_idct_lsx.c | 10 +- libavcodec/loongarch/hevcdsp_init_loongarch.c | 2 + libavcodec/loongarch/hevcdsp_lasx.h | 2 + 5 files changed, 874 insertions(+), 6 deletions(-) create mode 100644 libavcodec/loongarch/hevc_idct.S diff --git a/libavcodec/loongarch/Makefile b/libavcodec/loongarch/Makefile index ad98cd4054..07da2964e4 100644 --- a/libavcodec/loongarch/Makefile +++ b/libavcodec/loongarch/Makefile @@ -29,7 +29,8 @@ LSX-OBJS-$(CONFIG_HEVC_DECODER) += loongarch/hevcdsp_lsx.o \ loongarch/hevc_mc_uni_lsx.o \ loongarch/hevc_mc_uniw_lsx.o \ loongarch/hevc_add_res.o \ - loongarch/hevc_mc.o + loongarch/hevc_mc.o \ + loongarch/hevc_idct.o LSX-OBJS-$(CONFIG_H264DSP) += loongarch/h264idct.o \ loongarch/h264idct_loongarch.o \ loongarch/h264dsp.o diff --git a/libavcodec/loongarch/hevc_idct.S b/libavcodec/loongarch/hevc_idct.S new file mode 100644 index 0000000000..5593e5fd73 --- /dev/null +++ b/libavcodec/loongarch/hevc_idct.S @@ -0,0 +1,863 @@ +/* + * Copyright (c) 2023 Loongson Technology Corporation Limited + * Contributed by Hecai Yuan + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "loongson_asm.S" + +.macro fr_store + addi.d sp, sp, -64 + fst.d f24, sp, 0 + fst.d f25, sp, 8 + fst.d f26, sp, 16 + fst.d f27, sp, 24 + fst.d f28, sp, 32 + fst.d f29, sp, 40 + fst.d f30, sp, 48 + fst.d f31, sp, 56 +.endm + +.macro fr_recover + fld.d f24, sp, 0 + fld.d f25, sp, 8 + fld.d f26, sp, 16 + fld.d f27, sp, 24 + fld.d f28, sp, 32 + fld.d f29, sp, 40 + fld.d f30, sp, 48 + fld.d f31, sp, 56 + addi.d sp, sp, 64 +.endm + +.macro malloc_space number + li.w t0, \number + sub.d sp, sp, t0 + fr_store +.endm + +.macro free_space number + fr_recover + li.w t0, \number + add.d sp, sp, t0 +.endm + +.extern gt32x32_cnst1 + +.extern gt32x32_cnst2 + +.extern gt8x8_cnst + +.extern gt32x32_cnst0 + +.macro idct_16x32_step1_lasx + xvldrepl.w xr20, t1, 0 + xvldrepl.w xr21, t1, 4 + xvldrepl.w xr22, t1, 8 + xvldrepl.w xr23, t1, 12 + + xvmulwev.w.h xr16, xr8, xr20 + xvmaddwod.w.h xr16, xr8, xr20 + xvmulwev.w.h xr17, xr9, xr20 + xvmaddwod.w.h xr17, xr9, xr20 + + xvmaddwev.w.h xr16, xr10, xr21 + xvmaddwod.w.h xr16, xr10, xr21 + xvmaddwev.w.h xr17, xr11, xr21 + xvmaddwod.w.h xr17, xr11, xr21 + + xvmaddwev.w.h xr16, xr12, xr22 + xvmaddwod.w.h xr16, xr12, xr22 + xvmaddwev.w.h xr17, xr13, xr22 + xvmaddwod.w.h xr17, xr13, xr22 + + xvmaddwev.w.h xr16, xr14, xr23 + xvmaddwod.w.h xr16, xr14, xr23 + xvmaddwev.w.h xr17, xr15, xr23 + xvmaddwod.w.h xr17, xr15, xr23 + + xvld xr0, t2, 0 + xvld xr1, t2, 32 + + xvadd.w xr18, xr0, xr16 + xvadd.w xr19, xr1, xr17 + xvsub.w xr0, xr0, xr16 + xvsub.w xr1, xr1, xr17 + + xvst xr18, t2, 0 + xvst xr19, t2, 32 + xvst xr0, t3, 0 + xvst xr1, t3, 32 +.endm + +.macro idct_16x32_step2_lasx in0, in1, in2, in3, in4, in5, in6, in7, out0, out1 + + xvldrepl.w xr20, t1, 0 + xvldrepl.w xr21, t1, 4 + xvldrepl.w xr22, t1, 8 + xvldrepl.w xr23, t1, 12 + + xvmulwev.w.h \out0, \in0, xr20 + xvmaddwod.w.h \out0, \in0, xr20 + xvmulwev.w.h \out1, \in1, xr20 + xvmaddwod.w.h \out1, \in1, xr20 + xvmaddwev.w.h \out0, \in2, xr21 + xvmaddwod.w.h \out0, \in2, xr21 + xvmaddwev.w.h \out1, \in3, xr21 + xvmaddwod.w.h \out1, \in3, xr21 + xvmaddwev.w.h \out0, \in4, xr22 + xvmaddwod.w.h \out0, \in4, xr22 + xvmaddwev.w.h \out1, \in5, xr22 + xvmaddwod.w.h \out1, \in5, xr22 + xvmaddwev.w.h \out0, \in6, xr23 + xvmaddwod.w.h \out0, \in6, xr23 + xvmaddwev.w.h \out1, \in7, xr23 // sum0_r + xvmaddwod.w.h \out1, \in7, xr23 // sum0_l +.endm + + /* loop for all columns of filter constants */ +.macro idct_16x32_step3_lasx round + xvadd.w xr16, xr16, xr30 + xvadd.w xr17, xr17, xr31 + + xvld xr0, t2, 0 + xvld xr1, t2, 32 + + xvadd.w xr30, xr0, xr16 + xvadd.w xr31, xr1, xr17 + xvsub.w xr16, xr0, xr16 + xvsub.w xr17, xr1, xr17 + xvssrarni.h.w xr31, xr30, \round + xvssrarni.h.w xr17, xr16, \round + xvst xr31, t4, 0 + xvst xr17, t5, 0 +.endm + +.macro idct_16x32_lasx buf_pitch, round + addi.d t2, sp, 64 + + addi.d t0, a0, \buf_pitch*4*2 + + // 4 12 20 28 + xvld xr0, t0, 0 + xvld xr1, t0, \buf_pitch*8*2 + xvld xr2, t0, \buf_pitch*16*2 + xvld xr3, t0, \buf_pitch*24*2 + + xvilvl.h xr10, xr1, xr0 + xvilvh.h xr11, xr1, xr0 + xvilvl.h xr12, xr3, xr2 + xvilvh.h xr13, xr3, xr2 + + la.local t1, gt32x32_cnst2 + + xvldrepl.w xr20, t1, 0 + xvldrepl.w xr21, t1, 4 + xvmulwev.w.h xr14, xr10, xr20 + xvmaddwod.w.h xr14, xr10, xr20 + xvmulwev.w.h xr15, xr11, xr20 + xvmaddwod.w.h xr15, xr11, xr20 + xvmaddwev.w.h xr14, xr12, xr21 + xvmaddwod.w.h xr14, xr12, xr21 + xvmaddwev.w.h xr15, xr13, xr21 + xvmaddwod.w.h xr15, xr13, xr21 + + xvldrepl.w xr20, t1, 8 + xvldrepl.w xr21, t1, 12 + xvmulwev.w.h xr16, xr10, xr20 + xvmaddwod.w.h xr16, xr10, xr20 + xvmulwev.w.h xr17, xr11, xr20 + xvmaddwod.w.h xr17, xr11, xr20 + xvmaddwev.w.h xr16, xr12, xr21 + xvmaddwod.w.h xr16, xr12, xr21 + xvmaddwev.w.h xr17, xr13, xr21 + xvmaddwod.w.h xr17, xr13, xr21 + + xvldrepl.w xr20, t1, 16 + xvldrepl.w xr21, t1, 20 + xvmulwev.w.h xr18, xr10, xr20 + xvmaddwod.w.h xr18, xr10, xr20 + xvmulwev.w.h xr19, xr11, xr20 + xvmaddwod.w.h xr19, xr11, xr20 + xvmaddwev.w.h xr18, xr12, xr21 + xvmaddwod.w.h xr18, xr12, xr21 + xvmaddwev.w.h xr19, xr13, xr21 + xvmaddwod.w.h xr19, xr13, xr21 + + xvldrepl.w xr20, t1, 24 + xvldrepl.w xr21, t1, 28 + xvmulwev.w.h xr22, xr10, xr20 + xvmaddwod.w.h xr22, xr10, xr20 + xvmulwev.w.h xr23, xr11, xr20 + xvmaddwod.w.h xr23, xr11, xr20 + xvmaddwev.w.h xr22, xr12, xr21 + xvmaddwod.w.h xr22, xr12, xr21 + xvmaddwev.w.h xr23, xr13, xr21 + xvmaddwod.w.h xr23, xr13, xr21 + + /* process coeff 0, 8, 16, 24 */ + la.local t1, gt8x8_cnst + + xvld xr0, a0, 0 + xvld xr1, a0, \buf_pitch*8*2 + xvld xr2, a0, \buf_pitch*16*2 + xvld xr3, a0, \buf_pitch*24*2 + + xvldrepl.w xr20, t1, 0 + xvldrepl.w xr21, t1, 4 + + xvilvl.h xr10, xr2, xr0 + xvilvh.h xr11, xr2, xr0 + xvilvl.h xr12, xr3, xr1 + xvilvh.h xr13, xr3, xr1 + + xvmulwev.w.h xr4, xr10, xr20 + xvmaddwod.w.h xr4, xr10, xr20 // sum0_r + xvmulwev.w.h xr5, xr11, xr20 + xvmaddwod.w.h xr5, xr11, xr20 // sum0_l + xvmulwev.w.h xr6, xr12, xr21 + xvmaddwod.w.h xr6, xr12, xr21 // tmp1_r + xvmulwev.w.h xr7, xr13, xr21 + xvmaddwod.w.h xr7, xr13, xr21 // tmp1_l + + xvsub.w xr0, xr4, xr6 // sum1_r + xvadd.w xr1, xr4, xr6 // sum0_r + xvsub.w xr2, xr5, xr7 // sum1_l + xvadd.w xr3, xr5, xr7 // sum0_l + + // HEVC_EVEN16_CALC + xvsub.w xr24, xr1, xr14 // 7 + xvsub.w xr25, xr3, xr15 + xvadd.w xr14, xr1, xr14 // 0 + xvadd.w xr15, xr3, xr15 + xvst xr24, t2, 7*16*4 // 448=16*28=7*16*4 + xvst xr25, t2, 7*16*4+32 // 480 + xvst xr14, t2, 0 + xvst xr15, t2, 32 + + xvsub.w xr26, xr0, xr22 // 4 + xvsub.w xr27, xr2, xr23 + xvadd.w xr22, xr0, xr22 // 3 + xvadd.w xr23, xr2, xr23 + xvst xr26, t2, 4*16*4 // 256=4*16*4 + xvst xr27, t2, 4*16*4+32 // 288 + xvst xr22, t2, 3*16*4 // 192=3*16*4 + xvst xr23, t2, 3*16*4+32 // 224 + + xvldrepl.w xr20, t1, 16 + xvldrepl.w xr21, t1, 20 + + xvmulwev.w.h xr4, xr10, xr20 + xvmaddwod.w.h xr4, xr10, xr20 + xvmulwev.w.h xr5, xr11, xr20 + xvmaddwod.w.h xr5, xr11, xr20 + xvmulwev.w.h xr6, xr12, xr21 + xvmaddwod.w.h xr6, xr12, xr21 + xvmulwev.w.h xr7, xr13, xr21 + xvmaddwod.w.h xr7, xr13, xr21 + + xvsub.w xr0, xr4, xr6 // sum1_r + xvadd.w xr1, xr4, xr6 // sum0_r + xvsub.w xr2, xr5, xr7 // sum1_l + xvadd.w xr3, xr5, xr7 // sum0_l + + // HEVC_EVEN16_CALC + xvsub.w xr24, xr1, xr16 // 6 + xvsub.w xr25, xr3, xr17 + xvadd.w xr16, xr1, xr16 // 1 + xvadd.w xr17, xr3, xr17 + xvst xr24, t2, 6*16*4 // 384=6*16*4 + xvst xr25, t2, 6*16*4+32 // 416 + xvst xr16, t2, 1*16*4 // 64=1*16*4 + xvst xr17, t2, 1*16*4+32 // 96 + + xvsub.w xr26, xr0, xr18 // 5 + xvsub.w xr27, xr2, xr19 + xvadd.w xr18, xr0, xr18 // 2 + xvadd.w xr19, xr2, xr19 + xvst xr26, t2, 5*16*4 // 320=5*16*4 + xvst xr27, t2, 5*16*4+32 // 352 + xvst xr18, t2, 2*16*4 // 128=2*16*4 + xvst xr19, t2, 2*16*4+32 // 160 + + /* process coeff 2 6 10 14 18 22 26 30 */ + addi.d t0, a0, \buf_pitch*2*2 + + xvld xr0, t0, 0 + xvld xr1, t0, \buf_pitch*4*2 + xvld xr2, t0, \buf_pitch*8*2 + xvld xr3, t0, \buf_pitch*12*2 + + xvld xr4, t0, \buf_pitch*16*2 + xvld xr5, t0, \buf_pitch*20*2 + xvld xr6, t0, \buf_pitch*24*2 + xvld xr7, t0, \buf_pitch*28*2 + + xvilvl.h xr8, xr1, xr0 + xvilvh.h xr9, xr1, xr0 + xvilvl.h xr10, xr3, xr2 + xvilvh.h xr11, xr3, xr2 + xvilvl.h xr12, xr5, xr4 + xvilvh.h xr13, xr5, xr4 + xvilvl.h xr14, xr7, xr6 + xvilvh.h xr15, xr7, xr6 + + la.local t1, gt32x32_cnst1 + + addi.d t2, sp, 64 + addi.d t3, sp, 64+960 // 30*32 + + idct_16x32_step1_lasx + +.rept 7 + addi.d t1, t1, 16 + addi.d t2, t2, 64 + addi.d t3, t3, -64 + idct_16x32_step1_lasx +.endr + + addi.d t0, a0, \buf_pitch*2 + + xvld xr0, t0, 0 + xvld xr1, t0, \buf_pitch*2*2 + xvld xr2, t0, \buf_pitch*4*2 + xvld xr3, t0, \buf_pitch*6*2 + xvld xr4, t0, \buf_pitch*8*2 + xvld xr5, t0, \buf_pitch*10*2 + xvld xr6, t0, \buf_pitch*12*2 + xvld xr7, t0, \buf_pitch*14*2 + + xvilvl.h xr8, xr1, xr0 + xvilvh.h xr9, xr1, xr0 + xvilvl.h xr10, xr3, xr2 + xvilvh.h xr11, xr3, xr2 + xvilvl.h xr12, xr5, xr4 + xvilvh.h xr13, xr5, xr4 + xvilvl.h xr14, xr7, xr6 + xvilvh.h xr15, xr7, xr6 + + la.local t1, gt32x32_cnst0 + + idct_16x32_step2_lasx xr8, xr9, xr10, xr11, xr12, xr13, \ + xr14, xr15, xr16, xr17 + + addi.d t0, a0, \buf_pitch*16*2+\buf_pitch*2 + + xvld xr0, t0, 0 + xvld xr1, t0, \buf_pitch*2*2 + xvld xr2, t0, \buf_pitch*4*2 + xvld xr3, t0, \buf_pitch*6*2 + xvld xr4, t0, \buf_pitch*8*2 + xvld xr5, t0, \buf_pitch*10*2 + xvld xr6, t0, \buf_pitch*12*2 + xvld xr7, t0, \buf_pitch*14*2 + + xvilvl.h xr18, xr1, xr0 + xvilvh.h xr19, xr1, xr0 + xvilvl.h xr24, xr3, xr2 + xvilvh.h xr25, xr3, xr2 + xvilvl.h xr26, xr5, xr4 + xvilvh.h xr27, xr5, xr4 + xvilvl.h xr28, xr7, xr6 + xvilvh.h xr29, xr7, xr6 + + addi.d t1, t1, 16 + idct_16x32_step2_lasx xr18, xr19, xr24, xr25, xr26, xr27, \ + xr28, xr29, xr30, xr31 + + addi.d t4, a0, 0 + addi.d t5, a0, \buf_pitch*31*2 + addi.d t2, sp, 64 + + idct_16x32_step3_lasx \round + +.rept 15 + + addi.d t1, t1, 16 + idct_16x32_step2_lasx xr8, xr9, xr10, xr11, xr12, xr13, \ + xr14, xr15, xr16, xr17 + + addi.d t1, t1, 16 + idct_16x32_step2_lasx xr18, xr19, xr24, xr25, xr26, xr27, \ + xr28, xr29, xr30, xr31 + + addi.d t2, t2, 64 + addi.d t4, t4, \buf_pitch*2 + addi.d t5, t5, -\buf_pitch*2 + + idct_16x32_step3_lasx \round +.endr + +.endm + +function hevc_idct_16x32_column_step1_lasx + malloc_space 512+512+512 + + idct_16x32_lasx 32, 7 + + free_space 512+512+512 +endfunc + +function hevc_idct_16x32_column_step2_lasx + malloc_space 512+512+512 + + idct_16x32_lasx 16, 12 + + free_space 512+512+512 +endfunc + +function hevc_idct_transpose_32x16_to_16x32_lasx + fr_store + + xvld xr0, a0, 0 + xvld xr1, a0, 64 + xvld xr2, a0, 128 + xvld xr3, a0, 192 + xvld xr4, a0, 256 + xvld xr5, a0, 320 + xvld xr6, a0, 384 + xvld xr7, a0, 448 + + xvpermi.q xr8, xr0, 0x01 + xvpermi.q xr9, xr1, 0x01 + xvpermi.q xr10, xr2, 0x01 + xvpermi.q xr11, xr3, 0x01 + xvpermi.q xr12, xr4, 0x01 + xvpermi.q xr13, xr5, 0x01 + xvpermi.q xr14, xr6, 0x01 + xvpermi.q xr15, xr7, 0x01 + + LSX_TRANSPOSE8x8_H vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \ + vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \ + vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23 + + LSX_TRANSPOSE8x8_H vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \ + vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \ + vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23 + + addi.d a0, a0, 512 + + vld vr24, a0, 0 + vld vr25, a0, 64 + vld vr26, a0, 128 + vld vr27, a0, 192 + vld vr28, a0, 256 + vld vr29, a0, 320 + vld vr30, a0, 384 + vld vr31, a0, 448 + + LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \ + vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \ + vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23 + + xvpermi.q xr0, xr24, 0x02 + xvpermi.q xr1, xr25, 0x02 + xvpermi.q xr2, xr26, 0x02 + xvpermi.q xr3, xr27, 0x02 + xvpermi.q xr4, xr28, 0x02 + xvpermi.q xr5, xr29, 0x02 + xvpermi.q xr6, xr30, 0x02 + xvpermi.q xr7, xr31, 0x02 + + xvst xr0, a1, 0 + xvst xr1, a1, 32 + xvst xr2, a1, 64 + xvst xr3, a1, 96 + xvst xr4, a1, 128 + xvst xr5, a1, 160 + xvst xr6, a1, 192 + xvst xr7, a1, 224 + + addi.d a1, a1, 256 + addi.d a0, a0, 16 + + vld vr24, a0, 0 + vld vr25, a0, 64 + vld vr26, a0, 128 + vld vr27, a0, 192 + vld vr28, a0, 256 + vld vr29, a0, 320 + vld vr30, a0, 384 + vld vr31, a0, 448 + + LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \ + vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \ + vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23 + + xvpermi.q xr8, xr24, 0x02 + xvpermi.q xr9, xr25, 0x02 + xvpermi.q xr10, xr26, 0x02 + xvpermi.q xr11, xr27, 0x02 + xvpermi.q xr12, xr28, 0x02 + xvpermi.q xr13, xr29, 0x02 + xvpermi.q xr14, xr30, 0x02 + xvpermi.q xr15, xr31, 0x02 + + xvst xr8, a1, 0 + xvst xr9, a1, 32 + xvst xr10, a1, 64 + xvst xr11, a1, 96 + xvst xr12, a1, 128 + xvst xr13, a1, 160 + xvst xr14, a1, 192 + xvst xr15, a1, 224 + + // second + addi.d a0, a0, 32-512-16 + + xvld xr0, a0, 0 + xvld xr1, a0, 64 + xvld xr2, a0, 128 + xvld xr3, a0, 192 + xvld xr4, a0, 256 + xvld xr5, a0, 320 + xvld xr6, a0, 384 + xvld xr7, a0, 448 + + xvpermi.q xr8, xr0, 0x01 + xvpermi.q xr9, xr1, 0x01 + xvpermi.q xr10, xr2, 0x01 + xvpermi.q xr11, xr3, 0x01 + xvpermi.q xr12, xr4, 0x01 + xvpermi.q xr13, xr5, 0x01 + xvpermi.q xr14, xr6, 0x01 + xvpermi.q xr15, xr7, 0x01 + + LSX_TRANSPOSE8x8_H vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \ + vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \ + vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23 + + LSX_TRANSPOSE8x8_H vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \ + vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \ + vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23 + + addi.d a0, a0, 512 + + vld vr24, a0, 0 + vld vr25, a0, 64 + vld vr26, a0, 128 + vld vr27, a0, 192 + vld vr28, a0, 256 + vld vr29, a0, 320 + vld vr30, a0, 384 + vld vr31, a0, 448 + + LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \ + vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \ + vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23 + + xvpermi.q xr0, xr24, 0x02 + xvpermi.q xr1, xr25, 0x02 + xvpermi.q xr2, xr26, 0x02 + xvpermi.q xr3, xr27, 0x02 + xvpermi.q xr4, xr28, 0x02 + xvpermi.q xr5, xr29, 0x02 + xvpermi.q xr6, xr30, 0x02 + xvpermi.q xr7, xr31, 0x02 + + addi.d a1, a1, 256 + xvst xr0, a1, 0 + xvst xr1, a1, 32 + xvst xr2, a1, 64 + xvst xr3, a1, 96 + xvst xr4, a1, 128 + xvst xr5, a1, 160 + xvst xr6, a1, 192 + xvst xr7, a1, 224 + + addi.d a1, a1, 256 + addi.d a0, a0, 16 + + vld vr24, a0, 0 + vld vr25, a0, 64 + vld vr26, a0, 128 + vld vr27, a0, 192 + vld vr28, a0, 256 + vld vr29, a0, 320 + vld vr30, a0, 384 + vld vr31, a0, 448 + + LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \ + vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \ + vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23 + + xvpermi.q xr8, xr24, 0x02 + xvpermi.q xr9, xr25, 0x02 + xvpermi.q xr10, xr26, 0x02 + xvpermi.q xr11, xr27, 0x02 + xvpermi.q xr12, xr28, 0x02 + xvpermi.q xr13, xr29, 0x02 + xvpermi.q xr14, xr30, 0x02 + xvpermi.q xr15, xr31, 0x02 + + xvst xr8, a1, 0 + xvst xr9, a1, 32 + xvst xr10, a1, 64 + xvst xr11, a1, 96 + xvst xr12, a1, 128 + xvst xr13, a1, 160 + xvst xr14, a1, 192 + xvst xr15, a1, 224 + + fr_recover +endfunc + +function hevc_idct_transpose_16x32_to_32x16_lasx + fr_store + + xvld xr0, a0, 0 + xvld xr1, a0, 32 + xvld xr2, a0, 64 + xvld xr3, a0, 96 + xvld xr4, a0, 128 + xvld xr5, a0, 160 + xvld xr6, a0, 192 + xvld xr7, a0, 224 + + xvpermi.q xr8, xr0, 0x01 + xvpermi.q xr9, xr1, 0x01 + xvpermi.q xr10, xr2, 0x01 + xvpermi.q xr11, xr3, 0x01 + xvpermi.q xr12, xr4, 0x01 + xvpermi.q xr13, xr5, 0x01 + xvpermi.q xr14, xr6, 0x01 + xvpermi.q xr15, xr7, 0x01 + + LSX_TRANSPOSE8x8_H vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \ + vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \ + vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23 + + LSX_TRANSPOSE8x8_H vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \ + vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \ + vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23 + + addi.d a0, a0, 256 + + vld vr24, a0, 0 + vld vr25, a0, 32 + vld vr26, a0, 64 + vld vr27, a0, 96 + vld vr28, a0, 128 + vld vr29, a0, 160 + vld vr30, a0, 192 + vld vr31, a0, 224 + + LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \ + vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \ + vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23 + + xvpermi.q xr0, xr24, 0x02 + xvpermi.q xr1, xr25, 0x02 + xvpermi.q xr2, xr26, 0x02 + xvpermi.q xr3, xr27, 0x02 + xvpermi.q xr4, xr28, 0x02 + xvpermi.q xr5, xr29, 0x02 + xvpermi.q xr6, xr30, 0x02 + xvpermi.q xr7, xr31, 0x02 + + xvst xr0, a1, 0 + xvst xr1, a1, 64 + xvst xr2, a1, 128 + xvst xr3, a1, 192 + xvst xr4, a1, 256 + xvst xr5, a1, 320 + xvst xr6, a1, 384 + xvst xr7, a1, 448 + + addi.d a1, a1, 512 + addi.d a0, a0, 16 + + vld vr24, a0, 0 + vld vr25, a0, 32 + vld vr26, a0, 64 + vld vr27, a0, 96 + vld vr28, a0, 128 + vld vr29, a0, 160 + vld vr30, a0, 192 + vld vr31, a0, 224 + + LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \ + vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \ + vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23 + + xvpermi.q xr8, xr24, 0x02 + xvpermi.q xr9, xr25, 0x02 + xvpermi.q xr10, xr26, 0x02 + xvpermi.q xr11, xr27, 0x02 + xvpermi.q xr12, xr28, 0x02 + xvpermi.q xr13, xr29, 0x02 + xvpermi.q xr14, xr30, 0x02 + xvpermi.q xr15, xr31, 0x02 + + xvst xr8, a1, 0 + xvst xr9, a1, 64 + xvst xr10, a1, 128 + xvst xr11, a1, 192 + xvst xr12, a1, 256 + xvst xr13, a1, 320 + xvst xr14, a1, 384 + xvst xr15, a1, 448 + + // second + addi.d a0, a0, 256-16 + + xvld xr0, a0, 0 + xvld xr1, a0, 32 + xvld xr2, a0, 64 + xvld xr3, a0, 96 + xvld xr4, a0, 128 + xvld xr5, a0, 160 + xvld xr6, a0, 192 + xvld xr7, a0, 224 + + xvpermi.q xr8, xr0, 0x01 + xvpermi.q xr9, xr1, 0x01 + xvpermi.q xr10, xr2, 0x01 + xvpermi.q xr11, xr3, 0x01 + xvpermi.q xr12, xr4, 0x01 + xvpermi.q xr13, xr5, 0x01 + xvpermi.q xr14, xr6, 0x01 + xvpermi.q xr15, xr7, 0x01 + + LSX_TRANSPOSE8x8_H vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \ + vr0, vr1, vr2, vr3, vr4, vr5, vr6, vr7, \ + vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23 + + LSX_TRANSPOSE8x8_H vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \ + vr8, vr9, vr10, vr11, vr12, vr13, vr14, vr15, \ + vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23 + + addi.d a0, a0, 256 + + vld vr24, a0, 0 + vld vr25, a0, 32 + vld vr26, a0, 64 + vld vr27, a0, 96 + vld vr28, a0, 128 + vld vr29, a0, 160 + vld vr30, a0, 192 + vld vr31, a0, 224 + + LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \ + vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \ + vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23 + + xvpermi.q xr0, xr24, 0x02 + xvpermi.q xr1, xr25, 0x02 + xvpermi.q xr2, xr26, 0x02 + xvpermi.q xr3, xr27, 0x02 + xvpermi.q xr4, xr28, 0x02 + xvpermi.q xr5, xr29, 0x02 + xvpermi.q xr6, xr30, 0x02 + xvpermi.q xr7, xr31, 0x02 + + addi.d a1, a1, -512+32 + + xvst xr0, a1, 0 + xvst xr1, a1, 64 + xvst xr2, a1, 128 + xvst xr3, a1, 192 + xvst xr4, a1, 256 + xvst xr5, a1, 320 + xvst xr6, a1, 384 + xvst xr7, a1, 448 + + addi.d a1, a1, 512 + addi.d a0, a0, 16 + + vld vr24, a0, 0 + vld vr25, a0, 32 + vld vr26, a0, 64 + vld vr27, a0, 96 + vld vr28, a0, 128 + vld vr29, a0, 160 + vld vr30, a0, 192 + vld vr31, a0, 224 + + LSX_TRANSPOSE8x8_H vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \ + vr24, vr25, vr26, vr27, vr28, vr29, vr30, vr31, \ + vr16, vr17, vr18, vr19, vr20, vr21, vr22, vr23 + + xvpermi.q xr8, xr24, 0x02 + xvpermi.q xr9, xr25, 0x02 + xvpermi.q xr10, xr26, 0x02 + xvpermi.q xr11, xr27, 0x02 + xvpermi.q xr12, xr28, 0x02 + xvpermi.q xr13, xr29, 0x02 + xvpermi.q xr14, xr30, 0x02 + xvpermi.q xr15, xr31, 0x02 + + xvst xr8, a1, 0 + xvst xr9, a1, 64 + xvst xr10, a1, 128 + xvst xr11, a1, 192 + xvst xr12, a1, 256 + xvst xr13, a1, 320 + xvst xr14, a1, 384 + xvst xr15, a1, 448 + + fr_recover +endfunc + +function ff_hevc_idct_32x32_lasx + + addi.d t7, a0, 0 + addi.d t6, a1, 0 + + addi.d sp, sp, -8 + st.d ra, sp, 0 + + bl hevc_idct_16x32_column_step1_lasx + + addi.d a0, a0, 32 + + bl hevc_idct_16x32_column_step1_lasx + + malloc_space (16*32+31)*2 + + addi.d t8, sp, 64+31*2 // tmp_buf_ptr + + addi.d a0, t7, 0 + addi.d a1, t8, 0 + bl hevc_idct_transpose_32x16_to_16x32_lasx + + addi.d a0, t8, 0 + bl hevc_idct_16x32_column_step2_lasx + + addi.d a0, t8, 0 + addi.d a1, t7, 0 + bl hevc_idct_transpose_16x32_to_32x16_lasx + + // second + addi.d a0, t7, 32*8*2*2 + addi.d a1, t8, 0 + bl hevc_idct_transpose_32x16_to_16x32_lasx + + addi.d a0, t8, 0 + bl hevc_idct_16x32_column_step2_lasx + + addi.d a0, t8, 0 + addi.d a1, t7, 32*8*2*2 + bl hevc_idct_transpose_16x32_to_32x16_lasx + + free_space (16*32+31)*2 + + ld.d ra, sp, 0 + addi.d sp, sp, 8 + +endfunc diff --git a/libavcodec/loongarch/hevc_idct_lsx.c b/libavcodec/loongarch/hevc_idct_lsx.c index 2193b27546..527279d85d 100644 --- a/libavcodec/loongarch/hevc_idct_lsx.c +++ b/libavcodec/loongarch/hevc_idct_lsx.c @@ -23,18 +23,18 @@ #include "libavutil/loongarch/loongson_intrinsics.h" #include "hevcdsp_lsx.h" -static const int16_t gt8x8_cnst[16] __attribute__ ((aligned (64))) = { +const int16_t gt8x8_cnst[16] __attribute__ ((aligned (64))) = { 64, 64, 83, 36, 89, 50, 18, 75, 64, -64, 36, -83, 75, -89, -50, -18 }; -static const int16_t gt16x16_cnst[64] __attribute__ ((aligned (64))) = { +const int16_t gt16x16_cnst[64] __attribute__ ((aligned (64))) = { 64, 83, 64, 36, 89, 75, 50, 18, 90, 80, 57, 25, 70, 87, 9, 43, 64, 36, -64, -83, 75, -18, -89, -50, 87, 9, -80, -70, -43, 57, -25, -90, 64, -36, -64, 83, 50, -89, 18, 75, 80, -70, -25, 90, -87, 9, 43, 57, 64, -83, 64, -36, 18, -50, 75, -89, 70, -87, 90, -80, 9, -43, -57, 25 }; -static const int16_t gt32x32_cnst0[256] __attribute__ ((aligned (64))) = { +const int16_t gt32x32_cnst0[256] __attribute__ ((aligned (64))) = { 90, 90, 88, 85, 82, 78, 73, 67, 61, 54, 46, 38, 31, 22, 13, 4, 90, 82, 67, 46, 22, -4, -31, -54, -73, -85, -90, -88, -78, -61, -38, -13, 88, 67, 31, -13, -54, -82, -90, -78, -46, -4, 38, 73, 90, 85, 61, 22, @@ -53,14 +53,14 @@ static const int16_t gt32x32_cnst0[256] __attribute__ ((aligned (64))) = { 4, -13, 22, -31, 38, -46, 54, -61, 67, -73, 78, -82, 85, -88, 90, -90 }; -static const int16_t gt32x32_cnst1[64] __attribute__ ((aligned (64))) = { +const int16_t gt32x32_cnst1[64] __attribute__ ((aligned (64))) = { 90, 87, 80, 70, 57, 43, 25, 9, 87, 57, 9, -43, -80, -90, -70, -25, 80, 9, -70, -87, -25, 57, 90, 43, 70, -43, -87, 9, 90, 25, -80, -57, 57, -80, -25, 90, -9, -87, 43, 70, 43, -90, 57, 25, -87, 70, 9, -80, 25, -70, 90, -80, 43, 9, -57, 87, 9, -25, 43, -57, 70, -80, 87, -90 }; -static const int16_t gt32x32_cnst2[16] __attribute__ ((aligned (64))) = { +const int16_t gt32x32_cnst2[16] __attribute__ ((aligned (64))) = { 89, 75, 50, 18, 75, -18, -89, -50, 50, -89, 18, 75, 18, -50, 75, -89 }; diff --git a/libavcodec/loongarch/hevcdsp_init_loongarch.c b/libavcodec/loongarch/hevcdsp_init_loongarch.c index 2756755733..1585bda276 100644 --- a/libavcodec/loongarch/hevcdsp_init_loongarch.c +++ b/libavcodec/loongarch/hevcdsp_init_loongarch.c @@ -360,6 +360,8 @@ void ff_hevc_dsp_init_loongarch(HEVCDSPContext *c, const int bit_depth) c->put_hevc_epel_bi[7][0][1] = ff_hevc_put_hevc_bi_epel_h32_8_lasx; c->put_hevc_epel_bi[8][0][1] = ff_hevc_put_hevc_bi_epel_h48_8_lasx; c->put_hevc_epel_bi[9][0][1] = ff_hevc_put_hevc_bi_epel_h64_8_lasx; + + c->idct[3] = ff_hevc_idct_32x32_lasx; } } } diff --git a/libavcodec/loongarch/hevcdsp_lasx.h b/libavcodec/loongarch/hevcdsp_lasx.h index 5db35eed47..714cbf5880 100644 --- a/libavcodec/loongarch/hevcdsp_lasx.h +++ b/libavcodec/loongarch/hevcdsp_lasx.h @@ -131,4 +131,6 @@ BI_MC(epel, h, 64); #undef BI_MC +void ff_hevc_idct_32x32_lasx(int16_t *coeffs, int col_limit); + #endif // #ifndef AVCODEC_LOONGARCH_HEVCDSP_LASX_H