From patchwork Fri Sep 9 09:41:46 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?6ZmI5piK?= X-Patchwork-Id: 37784 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:139a:b0:8f:1db5:eae2 with SMTP id w26csp799485pzh; Fri, 9 Sep 2022 02:42:12 -0700 (PDT) X-Google-Smtp-Source: AA6agR7X0oyvMPYbDr5JvKALqEtHWMteJqTDT3EFsCMsrzk9ND/3VkCpO74thDjCeSU/2BgFKPcv X-Received: by 2002:a05:6402:d06:b0:440:3e9d:77d with SMTP id eb6-20020a0564020d0600b004403e9d077dmr10302177edb.286.1662716531870; Fri, 09 Sep 2022 02:42:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1662716531; cv=none; d=google.com; s=arc-20160816; b=uEKa51/jSYkp0BmnKFe1UZYaYPrOsE0GcTLhE+EbnSricEcFNiSCUipx19fUkB0ftS 1Anle1ViCWC2iXSg1s/3dGPRn9EoOGaLSv0anFuVlbGvqnwiP7MP48tPdFQ5Io7/VcQX rC5pJvGz7UFMcU1dkGvA3D1XTbgSaCDMLoNR7pNwDq8CPiWzFuOMGJvWzcBL35vamc8G wVWWG/Odc/DR0edjW1opOt/wvLf2L+PQUuP1LvZk8UEVTQZZefRMQHt8itiZmQr3Xyjx RzkpkiLAjKqDXmfVoJ+6cp9BfHMAoOTGkKY7/7+BXJj7aGX/D9GGvUQg6U2KkZrGhyPX TFBA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=8ejX8uurDmMNAnC+8mdc2wQwGiMZS/sgITbZtHO9Wi8=; b=1EUaUpFmP3PK//1AEE7VNcvVAP+1ZRJEcdhMHyhy4UfoIU/6x52Q9tmeex88hihhbi h1SrmFA6v5LacJI4GJc/+1CJBkFodKGlmD/ExLHLwcI0b5q5T+Irj1DMDMLa3VbVf11w uMpAnEXO90sxMgpLyhKaPJrRFldRc5lRkEABSbJ01kmAv3wZsLZ1JYZ4Xt8YtjUjoB87 yn2uzVoB1Dczcg+0rypJG939r3JCRUPvQLCPe1Uf3jEQkxuMGXzqvwQvQwbZ8O0QT5St v1Sg88hVjz2DFe0RROrSlgy2aOjje+Nd2KjnW7iBAJE3tAfgxi1Z0ZBYyk9uXV3tHoYL 0A7A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id b20-20020a056402351400b00451141e209fsi99530edd.59.2022.09.09.02.42.11; Fri, 09 Sep 2022 02:42:11 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 6817C68BAAA; Fri, 9 Sep 2022 12:41:59 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from loongson.cn (mail.loongson.cn [114.242.206.163]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id C6ECE68BAA4 for ; Fri, 9 Sep 2022 12:41:51 +0300 (EEST) Received: from localhost (unknown [36.33.26.144]) by localhost.localdomain (Coremail) with SMTP id AQAAf8AxTWtdChtjLVEVAA--.24624S3; Fri, 09 Sep 2022 17:41:49 +0800 (CST) From: Hao Chen To: ffmpeg-devel@ffmpeg.org Date: Fri, 9 Sep 2022 17:41:46 +0800 Message-Id: <20220909094147.23928-2-chenhao@loongson.cn> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20220909094147.23928-1-chenhao@loongson.cn> References: <20220909094147.23928-1-chenhao@loongson.cn> MIME-Version: 1.0 X-CM-TRANSID: AQAAf8AxTWtdChtjLVEVAA--.24624S3 X-Coremail-Antispam: 1UD129KBjvAXoW3KrW7ZrWrurW7Ar1rKF45ZFb_yoW8Gr1rto WjqFWkA34kGrZ7C3y3Ar18GasrXF47Wr1kZa1rtr1UXa4F9345A343Zw4SqayDKr4Fg345 Gr93GrWxJFsxAr98n29KB7ZKAUJUUUUU529EdanIXcx71UUUUU7v73VFW2AGmfu7bjvjm3 AaLaJ3UjIYCTnIWjp_UUUY87k0a2IF6w4kM7kC6x804xWl14x267AKxVWUJVW8JwAFc2x0 x2IEx4CE42xK8VAvwI8IcIk0rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj4 1l84x0c7CEw4AK67xGY2AK021l84ACjcxK6xIIjxv20xvE14v26F1j6w1UM28EF7xvwVC0 I7IYx2IY6xkF7I0E14v26r4UJVWxJr1l84ACjcxK6I8E87Iv67AKxVW0oVCq3wA2z4x0Y4 vEx4A2jsIEc7CjxVAFwI0_GcCE3s1le2I262IYc4CY6c8Ij28IcVAaY2xG8wAqx4xG64xv F2IEw4CE5I8CrVC2j2WlYx0E2Ix0cI8IcVAFwI0_JrI_JrylYx0Ex4A2jsIE14v26F4j6r 4UJwAm72CE4IkC6x0Yz7v_Jr0_Gr1lF7xvr2IYc2Ij64vIr41lc2xSY4AK67AK6r4fMxAI w28IcxkI7VAKI48JMxC20s026xCaFVCjc4AY6r1j6r4UMI8I3I0E5I8CrVAFwI0_Jr0_Jr 4lx2IqxVCjr7xvwVAFwI0_JrI_JrWlx4CE17CEb7AF67AKxVWUXVWUAwCIc40Y0x0EwIxG rwCI42IY6xIIjxv20xvE14v26r1j6r1xMIIF0xvE2Ix0cI8IcVCY1x0267AKxVWUJVW8Jw CI42IY6xAIw20EY4v20xvaj40_Jr0_JF4lIxAIcVC2z280aVAFwI0_Gr0_Cr1lIxAIcVC2 z280aVCY1x0267AKxVW8JVW8JrUvcSsGvfC2KfnxnUUI43ZEXa7IU5g4S3UUUUU== X-CM-SenderInfo: hfkh0xtdr6z05rqj20fqof0/ Subject: [FFmpeg-devel] [PATCH v2 1/2] lavc/mips: Fix bugs in me_cmp_msa.c file. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Lu Wang Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: xJHq2UXHIvaL From: Lu Wang This patch fixes a bug where the fate-checkasm-motion fails when h is not a multiple of 8. --- libavcodec/mips/me_cmp_msa.c | 201 ++++++++++++++++++++++++++++++----- 1 file changed, 173 insertions(+), 28 deletions(-) diff --git a/libavcodec/mips/me_cmp_msa.c b/libavcodec/mips/me_cmp_msa.c index 00a3cfd53f..351494161f 100644 --- a/libavcodec/mips/me_cmp_msa.c +++ b/libavcodec/mips/me_cmp_msa.c @@ -25,11 +25,13 @@ static uint32_t sad_8width_msa(const uint8_t *src, int32_t src_stride, const uint8_t *ref, int32_t ref_stride, int32_t height) { - int32_t ht_cnt; + int32_t ht_cnt = height >> 2; + int res = (height & 0x03); v16u8 src0, src1, src2, src3, ref0, ref1, ref2, ref3; + v8u16 zero = { 0 }; v8u16 sad = { 0 }; - for (ht_cnt = (height >> 2); ht_cnt--;) { + for (; ht_cnt--; ) { LD_UB4(src, src_stride, src0, src1, src2, src3); src += (4 * src_stride); LD_UB4(ref, ref_stride, ref0, ref1, ref2, ref3); @@ -39,6 +41,16 @@ static uint32_t sad_8width_msa(const uint8_t *src, int32_t src_stride, src0, src1, ref0, ref1); sad += SAD_UB2_UH(src0, src1, ref0, ref1); } + for (; res--; ) { + v16u8 diff; + src0 = LD_UB(src); + ref0 = LD_UB(ref); + src += src_stride; + ref += ref_stride; + diff = __msa_asub_u_b((v16u8) src0, (v16u8) ref0); + diff = (v16u8)__msa_ilvr_d((v2i64)zero, (v2i64)diff); + sad += __msa_hadd_u_h((v16u8) diff, (v16u8) diff); + } return (HADD_UH_U32(sad)); } @@ -47,11 +59,12 @@ static uint32_t sad_16width_msa(const uint8_t *src, int32_t src_stride, const uint8_t *ref, int32_t ref_stride, int32_t height) { - int32_t ht_cnt; + int32_t ht_cnt = height >> 2; + int res = (height & 0x03); v16u8 src0, src1, ref0, ref1; v8u16 sad = { 0 }; - for (ht_cnt = (height >> 2); ht_cnt--;) { + for (; ht_cnt--; ) { LD_UB2(src, src_stride, src0, src1); src += (2 * src_stride); LD_UB2(ref, ref_stride, ref0, ref1); @@ -64,7 +77,15 @@ static uint32_t sad_16width_msa(const uint8_t *src, int32_t src_stride, ref += (2 * ref_stride); sad += SAD_UB2_UH(src0, src1, ref0, ref1); } - + for (; res > 0; res--) { + v16u8 diff; + src0 = LD_UB(src); + ref0 = LD_UB(ref); + src += src_stride; + ref += ref_stride; + diff = __msa_asub_u_b((v16u8) src0, (v16u8) ref0); + sad += __msa_hadd_u_h((v16u8) diff, (v16u8) diff); + } return (HADD_UH_U32(sad)); } @@ -74,12 +95,14 @@ static uint32_t sad_horiz_bilinear_filter_8width_msa(const uint8_t *src, int32_t ref_stride, int32_t height) { - int32_t ht_cnt; + int32_t ht_cnt = height >> 3; + int32_t res = height & 0x07; v16u8 src0, src1, src2, src3, comp0, comp1; v16u8 ref0, ref1, ref2, ref3, ref4, ref5; + v8u16 zero = { 0 }; v8u16 sad = { 0 }; - for (ht_cnt = (height >> 3); ht_cnt--;) { + for (; ht_cnt--; ) { LD_UB4(src, src_stride, src0, src1, src2, src3); src += (4 * src_stride); LD_UB4(ref, ref_stride, ref0, ref1, ref2, ref3); @@ -107,6 +130,18 @@ static uint32_t sad_horiz_bilinear_filter_8width_msa(const uint8_t *src, sad += SAD_UB2_UH(src0, src1, comp0, comp1); } + for (; res--; ) { + v16u8 diff; + src0 = LD_UB(src); + ref0 = LD_UB(ref); + ref1 = LD_UB(ref + 1); + src += src_stride; + ref += ref_stride; + comp0 = (v16u8)__msa_aver_u_b((v16u8) ref0, (v16u8) ref1); + diff = __msa_asub_u_b((v16u8) src0, (v16u8) comp0); + diff = (v16u8)__msa_ilvr_d((v2i64) zero, (v2i64) diff); + sad += __msa_hadd_u_h((v16u8) diff, (v16u8) diff); + } return (HADD_UH_U32(sad)); } @@ -116,12 +151,13 @@ static uint32_t sad_horiz_bilinear_filter_16width_msa(const uint8_t *src, int32_t ref_stride, int32_t height) { - int32_t ht_cnt; + int32_t ht_cnt = height >> 3; + int32_t res = height & 0x07; v16u8 src0, src1, src2, src3, comp0, comp1; v16u8 ref00, ref10, ref20, ref30, ref01, ref11, ref21, ref31; v8u16 sad = { 0 }; - for (ht_cnt = (height >> 3); ht_cnt--;) { + for (; ht_cnt--; ) { LD_UB4(src, src_stride, src0, src1, src2, src3); src += (4 * src_stride); LD_UB4(ref, ref_stride, ref00, ref10, ref20, ref30); @@ -145,6 +181,17 @@ static uint32_t sad_horiz_bilinear_filter_16width_msa(const uint8_t *src, sad += SAD_UB2_UH(src2, src3, comp0, comp1); } + for (; res--; ) { + v16u8 diff; + src0 = LD_UB(src); + ref00 = LD_UB(ref); + ref01 = LD_UB(ref + 1); + src += src_stride; + ref += ref_stride; + comp0 = (v16u8)__msa_aver_u_b((v16u8) ref00, (v16u8) ref01); + diff = __msa_asub_u_b((v16u8) src0, (v16u8) comp0); + sad += __msa_hadd_u_h((v16u8) diff, (v16u8) diff); + } return (HADD_UH_U32(sad)); } @@ -154,12 +201,14 @@ static uint32_t sad_vert_bilinear_filter_8width_msa(const uint8_t *src, int32_t ref_stride, int32_t height) { - int32_t ht_cnt; + int32_t ht_cnt = height >> 3; + int32_t res = height & 0x07; v16u8 src0, src1, src2, src3, comp0, comp1; v16u8 ref0, ref1, ref2, ref3, ref4; + v8u16 zero = { 0 }; v8u16 sad = { 0 }; - for (ht_cnt = (height >> 3); ht_cnt--;) { + for (; ht_cnt--; ) { LD_UB4(src, src_stride, src0, src1, src2, src3); src += (4 * src_stride); LD_UB5(ref, ref_stride, ref0, ref1, ref2, ref3, ref4); @@ -183,6 +232,17 @@ static uint32_t sad_vert_bilinear_filter_8width_msa(const uint8_t *src, sad += SAD_UB2_UH(src0, src1, comp0, comp1); } + for (; res--; ) { + v16u8 diff; + src0 = LD_UB(src); + LD_UB2(ref, ref_stride, ref0, ref1); + src += src_stride; + ref += ref_stride; + comp0 = (v16u8)__msa_aver_u_b((v16u8) ref0, (v16u8) ref1); + diff = __msa_asub_u_b((v16u8) src0, (v16u8) comp0); + diff = (v16u8)__msa_ilvr_d((v2i64) zero, (v2i64) diff); + sad += __msa_hadd_u_h((v16u8) diff, (v16u8) diff); + } return (HADD_UH_U32(sad)); } @@ -192,12 +252,13 @@ static uint32_t sad_vert_bilinear_filter_16width_msa(const uint8_t *src, int32_t ref_stride, int32_t height) { - int32_t ht_cnt; + int32_t ht_cnt = height >> 3; + int32_t res = height & 0x07; v16u8 src0, src1, src2, src3, comp0, comp1; v16u8 ref0, ref1, ref2, ref3, ref4; v8u16 sad = { 0 }; - for (ht_cnt = (height >> 3); ht_cnt--;) { + for (; ht_cnt--; ) { LD_UB5(ref, ref_stride, ref4, ref0, ref1, ref2, ref3); ref += (5 * ref_stride); LD_UB4(src, src_stride, src0, src1, src2, src3); @@ -221,6 +282,16 @@ static uint32_t sad_vert_bilinear_filter_16width_msa(const uint8_t *src, sad += SAD_UB2_UH(src2, src3, comp0, comp1); } + for (; res--; ) { + v16u8 diff; + src0 = LD_UB(src); + LD_UB2(ref, ref_stride, ref0, ref1); + src += src_stride; + ref += ref_stride; + comp0 = (v16u8)__msa_aver_u_b((v16u8) ref0, (v16u8) ref1); + diff = __msa_asub_u_b((v16u8) src0, (v16u8) comp0); + sad += __msa_hadd_u_h((v16u8) diff, (v16u8) diff); + } return (HADD_UH_U32(sad)); } @@ -230,11 +301,13 @@ static uint32_t sad_hv_bilinear_filter_8width_msa(const uint8_t *src, int32_t ref_stride, int32_t height) { - int32_t ht_cnt; + int32_t ht_cnt = height >> 2; + int32_t res = height & 0x03; v16u8 src0, src1, src2, src3, temp0, temp1, diff; v16u8 ref0, ref1, ref2, ref3, ref4; v16i8 mask = { 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8 }; v8u16 comp0, comp1, comp2, comp3; + v8u16 zero = { 0 }; v8u16 sad = { 0 }; for (ht_cnt = (height >> 2); ht_cnt--;) { @@ -277,6 +350,22 @@ static uint32_t sad_hv_bilinear_filter_8width_msa(const uint8_t *src, sad += __msa_hadd_u_h(diff, diff); } + for (; res--; ) { + src0 = LD_UB(src); + LD_UB2(ref, ref_stride, ref0, ref1); + temp0 = (v16u8) __msa_vshf_b(mask, (v16i8) ref0, (v16i8) ref0); + temp1 = (v16u8) __msa_vshf_b(mask, (v16i8) ref1, (v16i8) ref1); + src += src_stride; + ref += ref_stride; + comp0 = __msa_hadd_u_h(temp0, temp0); + comp2 = __msa_hadd_u_h(temp1, temp1); + comp2 += comp0; + comp2 = (v8u16)__msa_srari_h((v8i16) comp2, 2); + comp0 = (v16u8) __msa_pckev_b((v16i8) zero, (v16i8) comp2); + diff = __msa_asub_u_b(src0, comp0); + diff = (v16u8)__msa_ilvr_d((v2i64) zero, (v2i64) diff); + sad += __msa_hadd_u_h(diff, diff); + } return (HADD_UH_U32(sad)); } @@ -286,14 +375,15 @@ static uint32_t sad_hv_bilinear_filter_16width_msa(const uint8_t *src, int32_t ref_stride, int32_t height) { - int32_t ht_cnt; + int32_t ht_cnt = height >> 3; + int32_t res = height & 0x07; v16u8 src0, src1, src2, src3, comp, diff; v16u8 temp0, temp1, temp2, temp3; v16u8 ref00, ref01, ref02, ref03, ref04, ref10, ref11, ref12, ref13, ref14; v8u16 comp0, comp1, comp2, comp3; v8u16 sad = { 0 }; - for (ht_cnt = (height >> 3); ht_cnt--;) { + for (; ht_cnt--; ) { LD_UB4(src, src_stride, src0, src1, src2, src3); src += (4 * src_stride); LD_UB5(ref, ref_stride, ref04, ref00, ref01, ref02, ref03); @@ -389,6 +479,25 @@ static uint32_t sad_hv_bilinear_filter_16width_msa(const uint8_t *src, diff = __msa_asub_u_b(src3, comp); sad += __msa_hadd_u_h(diff, diff); } + for (; res--; ) { + src0 = LD_UB(src); + LD_UB2(ref, ref_stride, ref00, ref10); + LD_UB2(ref + 1, ref_stride, ref01, ref11); + src += src_stride; + ref += ref_stride; + ILVRL_B2_UB(ref10, ref00, temp0, temp1); + ILVRL_B2_UB(ref11, ref01, temp2, temp3); + comp0 = __msa_hadd_u_h(temp0, temp0); + comp1 = __msa_hadd_u_h(temp1, temp1); + comp2 = __msa_hadd_u_h(temp2, temp2); + comp3 = __msa_hadd_u_h(temp3, temp3); + comp2 += comp0; + comp3 += comp1; + SRARI_H2_UH(comp2, comp3, 2); + comp = (v16u8) __msa_pckev_b((v16i8) comp3, (v16i8) comp2); + diff = __msa_asub_u_b(src0, comp); + sad += __msa_hadd_u_h(diff, diff); + } return (HADD_UH_U32(sad)); } @@ -407,15 +516,17 @@ static uint32_t sse_4width_msa(const uint8_t *src_ptr, int32_t src_stride, const uint8_t *ref_ptr, int32_t ref_stride, int32_t height) { - int32_t ht_cnt; + int32_t ht_cnt = height >> 2; + int32_t res = height & 0x03; uint32_t sse; uint32_t src0, src1, src2, src3; uint32_t ref0, ref1, ref2, ref3; - v16u8 src = { 0 }; - v16u8 ref = { 0 }; - v4i32 var = { 0 }; + v16u8 src = { 0 }; + v16u8 ref = { 0 }; + v16u8 zero = { 0 }; + v4i32 var = { 0 }; - for (ht_cnt = (height >> 2); ht_cnt--;) { + for (; ht_cnt--; ) { LW4(src_ptr, src_stride, src0, src1, src2, src3); src_ptr += (4 * src_stride); LW4(ref_ptr, ref_stride, ref0, ref1, ref2, ref3); @@ -426,6 +537,20 @@ static uint32_t sse_4width_msa(const uint8_t *src_ptr, int32_t src_stride, CALC_MSE_B(src, ref, var); } + for (; res--; ) { + v16u8 reg0; + v8i16 tmp0; + src0 = LW(src_ptr); + ref0 = LW(ref_ptr); + src_ptr += src_stride; + ref_ptr += ref_stride; + src = (v16u8)__msa_insert_w((v4i32) src, 0, src0); + ref = (v16u8)__msa_insert_w((v4i32) ref, 0, ref0); + reg0 = (v16u8)__msa_ilvr_b(src, ref); + reg0 = (v16u8)__msa_ilvr_d((v2i64) zero, (v2i64) reg0); + tmp0 = (v8i16)__msa_hsub_u_h((v16u8) reg0, (v16u8) reg0); + var = (v4i32)__msa_dpadd_s_w((v4i32) var, (v8i16) tmp0, (v8i16) tmp0); + } sse = HADD_SW_S32(var); return sse; @@ -435,13 +560,14 @@ static uint32_t sse_8width_msa(const uint8_t *src_ptr, int32_t src_stride, const uint8_t *ref_ptr, int32_t ref_stride, int32_t height) { - int32_t ht_cnt; + int32_t ht_cnt = height >> 2; + int32_t res = height & 0x03; uint32_t sse; v16u8 src0, src1, src2, src3; v16u8 ref0, ref1, ref2, ref3; v4i32 var = { 0 }; - for (ht_cnt = (height >> 2); ht_cnt--;) { + for (; ht_cnt--; ) { LD_UB4(src_ptr, src_stride, src0, src1, src2, src3); src_ptr += (4 * src_stride); LD_UB4(ref_ptr, ref_stride, ref0, ref1, ref2, ref3); @@ -453,6 +579,16 @@ static uint32_t sse_8width_msa(const uint8_t *src_ptr, int32_t src_stride, CALC_MSE_B(src1, ref1, var); } + for (; res--; ) { + v8i16 tmp0; + src0 = LD_UB(src_ptr); + ref0 = LD_UB(ref_ptr); + src_ptr += src_stride; + ref_ptr += ref_stride; + ref1 = (v16u8)__msa_ilvr_b(src0, ref0); + tmp0 = (v8i16)__msa_hsub_u_h((v16u8) ref1, (v16u8) ref1); + var = (v4i32)__msa_dpadd_s_w((v4i32) var, (v8i16) tmp0, (v8i16) tmp0); + } sse = HADD_SW_S32(var); return sse; @@ -462,12 +598,13 @@ static uint32_t sse_16width_msa(const uint8_t *src_ptr, int32_t src_stride, const uint8_t *ref_ptr, int32_t ref_stride, int32_t height) { - int32_t ht_cnt; + int32_t ht_cnt = height >> 2; + int32_t res = height & 0x03; uint32_t sse; v16u8 src, ref; v4i32 var = { 0 }; - for (ht_cnt = (height >> 2); ht_cnt--;) { + for (; ht_cnt--; ) { src = LD_UB(src_ptr); src_ptr += src_stride; ref = LD_UB(ref_ptr); @@ -493,6 +630,14 @@ static uint32_t sse_16width_msa(const uint8_t *src_ptr, int32_t src_stride, CALC_MSE_B(src, ref, var); } + for (; res--; ) { + src = LD_UB(src_ptr); + src_ptr += src_stride; + ref = LD_UB(ref_ptr); + ref_ptr += ref_stride; + CALC_MSE_B(src, ref, var); + } + sse = HADD_SW_S32(var); return sse; @@ -544,7 +689,7 @@ static int32_t hadamard_diff_8x8_msa(const uint8_t *src, int32_t src_stride, } static int32_t hadamard_intra_8x8_msa(const uint8_t *src, int32_t src_stride, - const uint8_t *ref, int32_t ref_stride) + const uint8_t *dumy, int32_t ref_stride) { int32_t sum_res = 0; v16u8 src0, src1, src2, src3, src4, src5, src6, src7; @@ -659,10 +804,10 @@ int ff_hadamard8_diff8x8_msa(MpegEncContext *s, const uint8_t *dst, const uint8_ return hadamard_diff_8x8_msa(src, stride, dst, stride); } -int ff_hadamard8_intra8x8_msa(MpegEncContext *s, const uint8_t *dst, const uint8_t *src, +int ff_hadamard8_intra8x8_msa(MpegEncContext *s, const uint8_t *src, const uint8_t *dummy, ptrdiff_t stride, int h) { - return hadamard_intra_8x8_msa(src, stride, dst, stride); + return hadamard_intra_8x8_msa(src, stride, dummy, stride); } /* Hadamard Transform functions */