From patchwork Fri Sep 22 11:37:27 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Manojkumar Bhosale X-Patchwork-Id: 5235 Delivered-To: ffmpegpatchwork@gmail.com Received: by 10.2.36.26 with SMTP id f26csp3081830jaa; Fri, 22 Sep 2017 04:37:42 -0700 (PDT) X-Google-Smtp-Source: AOwi7QA9OKQl//TJ2GenS9xbQauq5DJ4ctMuQ7v6L7AWiT2gFlVJi8W+YBh8ohthuCug1d7mrc/x X-Received: by 10.28.95.9 with SMTP id t9mr3310575wmb.109.1506080262180; Fri, 22 Sep 2017 04:37:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1506080262; cv=none; d=google.com; s=arc-20160816; b=JcrWYPZY9fTmhW9fAHa/tKfRuhYLXa9fqf9lSVx3ZOEpKk5ac0duR1HHMqNihFwTh2 APU+nsMWIW1j+/RkfyvGZCJacinQ2wRoWHdlVO1rOocUPZULJ3RuVXA+RqGfQUDrN3m0 lwgwFb5Eq1amVwoKety86pilEbBJybdRhaxgZBoBQ99JmLbRS11bVKYYJDZH91vsiC3G mD4hMCw4VHpnx4r/2dSGlZwPcJLm06NmUnIxdkvH25lq6YdCxbsmVBHwd9CTP+x/Dwep zTZ9WG0KftGJeO7Vrg5ZVNGtM+fT+fb2ouk6eZgvmOAH/GBO3tT+toINyjv5/VGsDk/u 8kcA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:content-language :accept-language:in-reply-to:references:message-id:date:thread-index :thread-topic:to:from:delivered-to:arc-authentication-results; bh=C7KA/x+A/SjKEJOUzA7gHh5P2s1wYhicOkM33DkwanI=; b=roH33Djsqx/S/XnNz2UlvRWQnx1VTq3+NCrGnutO2pyY0rPWPsk24cfAreHBsYJs5n lAvjBhaMjssenMGMQN1jh3oTGfVQV1FqGBBSLxFAEUmWt1WxxCfKOBbkwKP30DGiyq9X wbbXquwwPMImPhBNJ/aQ1hjZyRWs4VpaDzFY8XDLgezPDe24jkjoRje73dUw/wJjUy8K 8a7kaiecfcmEW7DTR6onq1Dg+1DVcOtLEUkaZTT5SLMESe36/F8nEW8NaXGp69v0L2BT fBE5RPhAWj1Pk2GYaA9J+uv4mdT8NGMhr18c2N5PUmizDGUFl6fHjz3PX3tIamQLuMVa Jh+Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id a9si3034115wrd.489.2017.09.22.04.37.41; Fri, 22 Sep 2017 04:37:42 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id CE73F680D6D; Fri, 22 Sep 2017 14:37:29 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mailapp01.imgtec.com (mailapp01.imgtec.com [195.59.15.196]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id B7F966802CF for ; Fri, 22 Sep 2017 14:37:23 +0300 (EEST) Received: from HHMAIL01.hh.imgtec.org (unknown [10.100.10.19]) by Forcepoint Email with ESMTPS id 85B61394E80B0 for ; Fri, 22 Sep 2017 12:37:28 +0100 (IST) Received: from HHMAIL-X.hh.imgtec.org (10.100.10.113) by HHMAIL01.hh.imgtec.org (10.100.10.19) with Microsoft SMTP Server (TLS) id 14.3.361.1; Fri, 22 Sep 2017 12:37:31 +0100 Received: from PUMAIL01.pu.imgtec.org (192.168.91.250) by HHMAIL-X.hh.imgtec.org (10.100.10.113) with Microsoft SMTP Server (TLS) id 14.3.361.1; Fri, 22 Sep 2017 12:37:30 +0100 Received: from PUMAIL01.pu.imgtec.org ([::1]) by PUMAIL01.pu.imgtec.org ([::1]) with mapi id 14.03.0266.001; Fri, 22 Sep 2017 17:07:28 +0530 From: Manojkumar Bhosale To: FFmpeg development discussions and patches Thread-Topic: [FFmpeg-devel] [PATCH] avcodec/mips: Improve hevc uni-w copy mc msa functions Thread-Index: AQHTMrGvWHMI6cC+YE+9In/KPLFIn6LAyNaw Date: Fri, 22 Sep 2017 11:37:27 +0000 Message-ID: <70293ACCC3BA6A4E81FFCA024C7A86E1E0592461@PUMAIL01.pu.imgtec.org> References: <1505981734-20317-1-git-send-email-kaustubh.raste@imgtec.com> In-Reply-To: <1505981734-20317-1-git-send-email-kaustubh.raste@imgtec.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [192.168.91.86] MIME-Version: 1.0 Subject: Re: [FFmpeg-devel] [PATCH] avcodec/mips: Improve hevc uni-w copy mc msa functions X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Kaustubh Raste Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" LGTM -----Original Message----- From: ffmpeg-devel [mailto:ffmpeg-devel-bounces@ffmpeg.org] On Behalf Of kaustubh.raste@imgtec.com Sent: Thursday, September 21, 2017 1:46 PM To: ffmpeg-devel@ffmpeg.org Cc: Kaustubh Raste Subject: [FFmpeg-devel] [PATCH] avcodec/mips: Improve hevc uni-w copy mc msa functions From: Kaustubh Raste Load the specific destination bytes instead of MSA load and pack. Pack the data to half word before clipping. Use immediate unsigned saturation for clip to max saving one vector register. Signed-off-by: Kaustubh Raste --- libavcodec/mips/hevc_mc_uniw_msa.c | 559 ++++++++++++++++++++++++----------- libavutil/mips/generic_macros_msa.h | 30 ++ 2 files changed, 415 insertions(+), 174 deletions(-) _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel diff --git a/libavcodec/mips/hevc_mc_uniw_msa.c b/libavcodec/mips/hevc_mc_uniw_msa.c index ce10f41..d184419 100644 --- a/libavcodec/mips/hevc_mc_uniw_msa.c +++ b/libavcodec/mips/hevc_mc_uniw_msa.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2015 Manojkumar Bhosale (Manojkumar.Bhosale@imgtec.com) + * Copyright (c) 2015 - 2017 Manojkumar Bhosale + (Manojkumar.Bhosale@imgtec.com) * * This file is part of FFmpeg. * @@ -62,6 +62,31 @@ out2_r, out3_r, out2_l, out3_l); \ } +#define HEVC_UNIW_RND_CLIP2_MAX_SATU_H(in0_h, in1_h, wgt_w, offset_h, rnd_w, \ + out0_h, out1_h) \ +{ \ + v4i32 in0_r_m, in0_l_m, in1_r_m, in1_l_m; \ + \ + ILVRL_H2_SW(in0_h, in0_h, in0_r_m, in0_l_m); \ + ILVRL_H2_SW(in1_h, in1_h, in1_r_m, in1_l_m); \ + DOTP_SH4_SW(in0_r_m, in1_r_m, in0_l_m, in1_l_m, wgt_w, wgt_w, wgt_w, \ + wgt_w, in0_r_m, in1_r_m, in0_l_m, in1_l_m); \ + SRAR_W4_SW(in0_r_m, in1_r_m, in0_l_m, in1_l_m, rnd_w); \ + PCKEV_H2_SH(in0_l_m, in0_r_m, in1_l_m, in1_r_m, out0_h, out1_h); \ + ADDS_SH2_SH(out0_h, offset_h, out1_h, offset_h, out0_h, out1_h); \ + CLIP_SH2_0_255_MAX_SATU(out0_h, out1_h); \ +} + +#define HEVC_UNIW_RND_CLIP4_MAX_SATU_H(in0_h, in1_h, in2_h, in3_h, wgt_w, \ + offset_h, rnd_w, out0_h, out1_h, \ + out2_h, out3_h) \ +{ \ + HEVC_UNIW_RND_CLIP2_MAX_SATU_H(in0_h, in1_h, wgt_w, offset_h, rnd_w, \ + out0_h, out1_h); \ + HEVC_UNIW_RND_CLIP2_MAX_SATU_H(in2_h, in3_h, wgt_w, offset_h, rnd_w, \ + out2_h, out3_h); \ +} + static void hevc_uniwgt_copy_4w_msa(uint8_t *src, int32_t src_stride, uint8_t *dst, @@ -71,76 +96,60 @@ static void hevc_uniwgt_copy_4w_msa(uint8_t *src, int32_t offset, int32_t rnd_val) { + uint32_t loop_cnt, tp0, tp1, tp2, tp3; v16i8 zero = { 0 }; - v4i32 weight_vec, offset_vec, rnd_vec; + v16u8 out0, out1; + v16i8 src0 = { 0 }, src1 = { 0 }; + v8i16 dst0, dst1, dst2, dst3, offset_vec; + v4i32 weight_vec, rnd_vec; weight = weight & 0x0000FFFF; weight_vec = __msa_fill_w(weight); - offset_vec = __msa_fill_w(offset); + offset_vec = __msa_fill_h(offset); rnd_vec = __msa_fill_w(rnd_val); if (2 == height) { - v16i8 src0, src1; - v8i16 dst0; v4i32 dst0_r, dst0_l; - LD_SB2(src, src_stride, src0, src1); - src0 = (v16i8) __msa_ilvr_w((v4i32) src1, (v4i32) src0); + LW2(src, src_stride, tp0, tp1); + INSERT_W2_SB(tp0, tp1, src0); dst0 = (v8i16) __msa_ilvr_b(zero, src0); dst0 <<= 6; ILVRL_H2_SW(dst0, dst0, dst0_r, dst0_l); DOTP_SH2_SW(dst0_r, dst0_l, weight_vec, weight_vec, dst0_r, dst0_l); SRAR_W2_SW(dst0_r, dst0_l, rnd_vec); - ADD2(dst0_r, offset_vec, dst0_l, offset_vec, dst0_r, dst0_l); - dst0_r = CLIP_SW_0_255(dst0_r); - dst0_l = CLIP_SW_0_255(dst0_l); - - HEVC_PCK_SW_SB2(dst0_l, dst0_r, dst0_r); - ST4x2_UB(dst0_r, dst, dst_stride); + dst0 = __msa_pckev_h((v8i16) dst0_l, (v8i16) dst0_r); + dst0 += offset_vec; + dst0 = CLIP_SH_0_255_MAX_SATU(dst0); + out0 = (v16u8) __msa_pckev_b((v16i8) dst0, (v16i8) dst0); + ST4x2_UB(out0, dst, dst_stride); } else if (4 == height) { - v16i8 src0, src1, src2, src3; - v8i16 dst0, dst1; - v4i32 dst0_r, dst1_r; - v4i32 dst0_l, dst1_l; - - LD_SB4(src, src_stride, src0, src1, src2, src3); - ILVR_W2_SB(src1, src0, src3, src2, src0, src1); - ILVR_B2_SH(zero, src0, zero, src1, dst0, dst1); - dst0 <<= 6; - dst1 <<= 6; - - HEVC_UNIW_RND_CLIP2(dst0, dst1, weight_vec, offset_vec, rnd_vec, - dst0_r, dst1_r, dst0_l, dst1_l); - - HEVC_PCK_SW_SB4(dst0_l, dst0_r, dst1_l, dst1_r, dst0_r); - ST4x4_UB(dst0_r, dst0_r, 0, 1, 2, 3, dst, dst_stride); - } else if (0 == height % 8) { - uint32_t loop_cnt; - v16i8 src0, src1, src2, src3, src4, src5, src6, src7; - v8i16 dst0, dst1, dst2, dst3; - v4i32 dst0_r, dst1_r, dst2_r, dst3_r; - v4i32 dst0_l, dst1_l, dst2_l, dst3_l; - + LW4(src, src_stride, tp0, tp1, tp2, tp3); + INSERT_W4_SB(tp0, tp1, tp2, tp3, src0); + ILVRL_B2_SH(zero, src0, dst0, dst1); + SLLI_2V(dst0, dst1, 6); + HEVC_UNIW_RND_CLIP2_MAX_SATU_H(dst0, dst1, weight_vec, offset_vec, + rnd_vec, dst0, dst1); + out0 = (v16u8) __msa_pckev_b((v16i8) dst1, (v16i8) dst0); + ST4x4_UB(out0, out0, 0, 1, 2, 3, dst, dst_stride); + } else if (0 == (height % 8)) { for (loop_cnt = (height >> 3); loop_cnt--;) { - LD_SB8(src, src_stride, - src0, src1, src2, src3, src4, src5, src6, src7); - src += (8 * src_stride); - ILVR_W4_SB(src1, src0, src3, src2, src5, src4, src7, src6, - src0, src1, src2, src3); - ILVR_B4_SH(zero, src0, zero, src1, zero, src2, zero, src3, - dst0, dst1, dst2, dst3); - + LW4(src, src_stride, tp0, tp1, tp2, tp3); + src += 4 * src_stride; + INSERT_W4_SB(tp0, tp1, tp2, tp3, src0); + LW4(src, src_stride, tp0, tp1, tp2, tp3); + src += 4 * src_stride; + INSERT_W4_SB(tp0, tp1, tp2, tp3, src1); + ILVRL_B2_SH(zero, src0, dst0, dst1); + ILVRL_B2_SH(zero, src1, dst2, dst3); SLLI_4V(dst0, dst1, dst2, dst3, 6); - HEVC_UNIW_RND_CLIP4(dst0, dst1, dst2, dst3, - weight_vec, offset_vec, rnd_vec, - dst0_r, dst1_r, dst2_r, dst3_r, - dst0_l, dst1_l, dst2_l, dst3_l); - - HEVC_PCK_SW_SB8(dst0_l, dst0_r, dst1_l, dst1_r, - dst2_l, dst2_r, dst3_l, dst3_r, dst0_r, dst1_r); - ST4x8_UB(dst0_r, dst1_r, dst, dst_stride); - dst += (8 * dst_stride); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst0, dst1, dst2, dst3, weight_vec, + offset_vec, rnd_vec, dst0, dst1, + dst2, dst3); + PCKEV_B2_UB(dst1, dst0, dst3, dst2, out0, out1); + ST4x8_UB(out0, out1, dst, dst_stride); + dst += 8 * dst_stride; } } } @@ -155,46 +164,48 @@ static void hevc_uniwgt_copy_6w_msa(uint8_t *src, int32_t rnd_val) { uint32_t loop_cnt; + uint64_t tp0, tp1, tp2, tp3; v16i8 zero = { 0 }; - v16i8 src0, src1, src2, src3, src4, src5, src6, src7; - v8i16 dst0, dst1, dst2, dst3, dst4, dst5, dst6, dst7; - v4i32 dst0_r, dst1_r, dst2_r, dst3_r; - v4i32 dst0_l, dst1_l, dst2_l, dst3_l; - v4i32 weight_vec, offset_vec, rnd_vec; + v16u8 out0, out1, out2, out3; + v16i8 src0, src1, src2, src3; + v8i16 dst0, dst1, dst2, dst3, dst4, dst5, dst6, dst7, offset_vec; + v4i32 weight_vec, rnd_vec; weight = weight & 0x0000FFFF; weight_vec = __msa_fill_w(weight); - offset_vec = __msa_fill_w(offset); + offset_vec = __msa_fill_h(offset); rnd_vec = __msa_fill_w(rnd_val); for (loop_cnt = (height >> 3); loop_cnt--;) { - LD_SB8(src, src_stride, src0, src1, src2, src3, src4, src5, src6, src7); - src += (8 * src_stride); - ILVR_B4_SH(zero, src0, zero, src1, zero, src2, zero, src3, - dst0, dst1, dst2, dst3); - ILVR_B4_SH(zero, src4, zero, src5, zero, src6, zero, src7, - dst4, dst5, dst6, dst7); + LD4(src, src_stride, tp0, tp1, tp2, tp3); + src += (4 * src_stride); + INSERT_D2_SB(tp0, tp1, src0); + INSERT_D2_SB(tp2, tp3, src1); + LD4(src, src_stride, tp0, tp1, tp2, tp3); + src += (4 * src_stride); + INSERT_D2_SB(tp0, tp1, src2); + INSERT_D2_SB(tp2, tp3, src3); + + ILVRL_B2_SH(zero, src0, dst0, dst1); + ILVRL_B2_SH(zero, src1, dst2, dst3); + ILVRL_B2_SH(zero, src2, dst4, dst5); + ILVRL_B2_SH(zero, src3, dst6, dst7); SLLI_4V(dst0, dst1, dst2, dst3, 6); SLLI_4V(dst4, dst5, dst6, dst7, 6); - HEVC_UNIW_RND_CLIP4(dst0, dst1, dst2, dst3, - weight_vec, offset_vec, rnd_vec, - dst0_r, dst1_r, dst2_r, dst3_r, - dst0_l, dst1_l, dst2_l, dst3_l); - HEVC_PCK_SW_SB8(dst0_l, dst0_r, dst1_l, dst1_r, - dst2_l, dst2_r, dst3_l, dst3_r, dst0_r, dst1_r); - ST6x4_UB(dst0_r, dst1_r, dst, dst_stride); - dst += (4 * dst_stride); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst0, dst1, dst2, dst3, weight_vec, + offset_vec, rnd_vec, dst0, dst1, dst2, + dst3); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst4, dst5, dst6, dst7, weight_vec, + offset_vec, rnd_vec, dst4, dst5, dst6, + dst7); + PCKEV_B2_UB(dst1, dst0, dst3, dst2, out0, out1); + PCKEV_B2_UB(dst5, dst4, dst7, dst6, out2, out3); - HEVC_UNIW_RND_CLIP4(dst4, dst5, dst6, dst7, - weight_vec, offset_vec, rnd_vec, - dst0_r, dst1_r, dst2_r, dst3_r, - dst0_l, dst1_l, dst2_l, dst3_l); - - HEVC_PCK_SW_SB8(dst0_l, dst0_r, dst1_l, dst1_r, - dst2_l, dst2_r, dst3_l, dst3_r, dst0_r, dst1_r); - ST6x4_UB(dst0_r, dst1_r, dst, dst_stride); + ST6x4_UB(out0, out1, dst, dst_stride); + dst += (4 * dst_stride); + ST6x4_UB(out2, out3, dst, dst_stride); dst += (4 * dst_stride); } } @@ -208,78 +219,89 @@ static void hevc_uniwgt_copy_8w_msa(uint8_t *src, int32_t offset, int32_t rnd_val) { + uint32_t loop_cnt; + uint64_t tp0, tp1, tp2, tp3; + v16i8 src0 = { 0 }, src1 = { 0 }, src2 = { 0 }, src3 = { 0 }; v16i8 zero = { 0 }; - v4i32 weight_vec, offset_vec, rnd_vec; + v16u8 out0, out1, out2, out3; + v8i16 dst0, dst1, dst2, dst3, dst4, dst5, dst6, dst7, offset_vec; + v4i32 weight_vec, rnd_vec; weight = weight & 0x0000FFFF; weight_vec = __msa_fill_w(weight); - offset_vec = __msa_fill_w(offset); + offset_vec = __msa_fill_h(offset); rnd_vec = __msa_fill_w(rnd_val); if (2 == height) { - v16i8 src0, src1; - v8i16 dst0, dst1; - v4i32 dst0_r, dst1_r, dst0_l, dst1_l; - - LD_SB2(src, src_stride, src0, src1); - ILVR_B2_SH(zero, src0, zero, src1, dst0, dst1); - - dst0 <<= 6; - dst1 <<= 6; - HEVC_UNIW_RND_CLIP2(dst0, dst1, weight_vec, offset_vec, rnd_vec, - dst0_r, dst1_r, dst0_l, dst1_l); - - HEVC_PCK_SW_SB4(dst0_l, dst0_r, dst1_l, dst1_r, dst0_r); - ST8x2_UB(dst0_r, dst, dst_stride); + LD2(src, src_stride, tp0, tp1); + INSERT_D2_SB(tp0, tp1, src0); + ILVRL_B2_SH(zero, src0, dst0, dst1); + SLLI_2V(dst0, dst1, 6); + HEVC_UNIW_RND_CLIP2_MAX_SATU_H(dst0, dst1, weight_vec, offset_vec, + rnd_vec, dst0, dst1); + out0 = (v16u8) __msa_pckev_b((v16i8) dst1, (v16i8) dst0); + ST8x2_UB(out0, dst, dst_stride); + } else if (4 == height) { + LD4(src, src_stride, tp0, tp1, tp2, tp3); + INSERT_D2_SB(tp0, tp1, src0); + INSERT_D2_SB(tp2, tp3, src1); + ILVRL_B2_SH(zero, src0, dst0, dst1); + ILVRL_B2_SH(zero, src1, dst2, dst3); + SLLI_4V(dst0, dst1, dst2, dst3, 6); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst0, dst1, dst2, dst3, weight_vec, + offset_vec, rnd_vec, dst0, dst1, dst2, + dst3); + PCKEV_B2_UB(dst1, dst0, dst3, dst2, out0, out1); + ST8x4_UB(out0, out1, dst, dst_stride); } else if (6 == height) { - v16i8 src0, src1, src2, src3, src4, src5; - v8i16 dst0, dst1, dst2, dst3, dst4, dst5; - v4i32 dst0_r, dst1_r, dst2_r, dst3_r, dst4_r, dst5_r; - v4i32 dst0_l, dst1_l, dst2_l, dst3_l, dst4_l, dst5_l; - - LD_SB6(src, src_stride, src0, src1, src2, src3, src4, src5); - ILVR_B4_SH(zero, src0, zero, src1, zero, src2, zero, src3, - dst0, dst1, dst2, dst3); - ILVR_B2_SH(zero, src4, zero, src5, dst4, dst5); - + LD4(src, src_stride, tp0, tp1, tp2, tp3); + src += 4 * src_stride; + INSERT_D2_SB(tp0, tp1, src0); + INSERT_D2_SB(tp2, tp3, src1); + LD2(src, src_stride, tp0, tp1); + INSERT_D2_SB(tp0, tp1, src2); + ILVRL_B2_SH(zero, src0, dst0, dst1); + ILVRL_B2_SH(zero, src1, dst2, dst3); + ILVRL_B2_SH(zero, src2, dst4, dst5); SLLI_4V(dst0, dst1, dst2, dst3, 6); - dst4 <<= 6; - dst5 <<= 6; - HEVC_UNIW_RND_CLIP4(dst0, dst1, dst2, dst3, - weight_vec, offset_vec, rnd_vec, - dst0_r, dst1_r, dst2_r, dst3_r, - dst0_l, dst1_l, dst2_l, dst3_l); - HEVC_UNIW_RND_CLIP2(dst4, dst5, weight_vec, offset_vec, rnd_vec, - dst4_r, dst5_r, dst4_l, dst5_l); - - HEVC_PCK_SW_SB12(dst0_l, dst0_r, dst1_l, dst1_r, - dst2_l, dst2_r, dst3_l, dst3_r, - dst4_l, dst4_r, dst5_l, dst5_r, - dst0_r, dst1_r, dst2_r); - ST8x4_UB(dst0_r, dst1_r, dst, dst_stride); + SLLI_2V(dst4, dst5, 6); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst0, dst1, dst2, dst3, weight_vec, + offset_vec, rnd_vec, dst0, dst1, dst2, + dst3); + HEVC_UNIW_RND_CLIP2_MAX_SATU_H(dst4, dst5, weight_vec, offset_vec, + rnd_vec, dst4, dst5); + PCKEV_B3_UB(dst1, dst0, dst3, dst2, dst5, dst4, out0, out1, out2); + ST8x4_UB(out0, out1, dst, dst_stride); dst += (4 * dst_stride); - ST8x2_UB(dst2_r, dst, dst_stride); - } else if (0 == height % 4) { - uint32_t loop_cnt; - v16i8 src0, src1, src2, src3; - v8i16 dst0, dst1, dst2, dst3; - v4i32 dst0_r, dst1_r, dst2_r, dst3_r, dst0_l, dst1_l, dst2_l, dst3_l; - - for (loop_cnt = (height >> 2); loop_cnt--;) { - LD_SB4(src, src_stride, src0, src1, src2, src3); - src += (4 * src_stride); - ILVR_B4_SH(zero, src0, zero, src1, zero, src2, zero, src3, - dst0, dst1, dst2, dst3); - + ST8x2_UB(out2, dst, dst_stride); + } else if (0 == height % 8) { + for (loop_cnt = (height >> 3); loop_cnt--;) { + LD4(src, src_stride, tp0, tp1, tp2, tp3); + src += 4 * src_stride; + INSERT_D2_SB(tp0, tp1, src0); + INSERT_D2_SB(tp2, tp3, src1); + LD4(src, src_stride, tp0, tp1, tp2, tp3); + src += 4 * src_stride; + INSERT_D2_SB(tp0, tp1, src2); + INSERT_D2_SB(tp2, tp3, src3); + + ILVRL_B2_SH(zero, src0, dst0, dst1); + ILVRL_B2_SH(zero, src1, dst2, dst3); + ILVRL_B2_SH(zero, src2, dst4, dst5); + ILVRL_B2_SH(zero, src3, dst6, dst7); SLLI_4V(dst0, dst1, dst2, dst3, 6); - HEVC_UNIW_RND_CLIP4(dst0, dst1, dst2, dst3, - weight_vec, offset_vec, rnd_vec, - dst0_r, dst1_r, dst2_r, dst3_r, - dst0_l, dst1_l, dst2_l, dst3_l); - - HEVC_PCK_SW_SB8(dst0_l, dst0_r, dst1_l, dst1_r, - dst2_l, dst2_r, dst3_l, dst3_r, dst0_r, dst1_r); - ST8x4_UB(dst0_r, dst1_r, dst, dst_stride); + SLLI_4V(dst4, dst5, dst6, dst7, 6); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst0, dst1, dst2, dst3, weight_vec, + offset_vec, rnd_vec, dst0, dst1, + dst2, dst3); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst4, dst5, dst6, dst7, weight_vec, + offset_vec, rnd_vec, dst4, dst5, + dst6, dst7); + PCKEV_B2_UB(dst1, dst0, dst3, dst2, out0, out1); + PCKEV_B2_UB(dst5, dst4, dst7, dst6, out2, out3); + ST8x4_UB(out0, out1, dst, dst_stride); + dst += (4 * dst_stride); + ST8x4_UB(out2, out3, dst, dst_stride); dst += (4 * dst_stride); } } @@ -295,41 +317,36 @@ static void hevc_uniwgt_copy_12w_msa(uint8_t *src, int32_t rnd_val) { uint32_t loop_cnt; + v16u8 out0, out1, out2; v16i8 src0, src1, src2, src3; v8i16 dst0, dst1, dst2, dst3, dst4, dst5; - v4i32 dst0_r, dst1_r, dst2_r, dst3_r, dst4_r, dst5_r; - v4i32 dst0_l, dst1_l, dst2_l, dst3_l, dst4_l, dst5_l; + v8i16 offset_vec; v16i8 zero = { 0 }; - v4i32 weight_vec, offset_vec, rnd_vec; + v4i32 weight_vec, rnd_vec; weight = weight & 0x0000FFFF; weight_vec = __msa_fill_w(weight); - offset_vec = __msa_fill_w(offset); + offset_vec = __msa_fill_h(offset); rnd_vec = __msa_fill_w(rnd_val); - for (loop_cnt = (height >> 2); loop_cnt--;) { + for (loop_cnt = 4; loop_cnt--;) { LD_SB4(src, src_stride, src0, src1, src2, src3); src += (4 * src_stride); ILVR_B4_SH(zero, src0, zero, src1, zero, src2, zero, src3, dst0, dst1, dst2, dst3); - SLLI_4V(dst0, dst1, dst2, dst3, 6); ILVL_W2_SB(src1, src0, src3, src2, src0, src1); ILVR_B2_SH(zero, src0, zero, src1, dst4, dst5); - dst4 <<= 6; - dst5 <<= 6; - HEVC_UNIW_RND_CLIP4(dst0, dst1, dst2, dst3, - weight_vec, offset_vec, rnd_vec, - dst0_r, dst1_r, dst2_r, dst3_r, - dst0_l, dst1_l, dst2_l, dst3_l); - HEVC_UNIW_RND_CLIP2(dst4, dst5, weight_vec, offset_vec, rnd_vec, - dst4_r, dst5_r, dst4_l, dst5_l); - - HEVC_PCK_SW_SB12(dst0_l, dst0_r, dst1_l, dst1_r, - dst2_l, dst2_r, dst3_l, dst3_r, - dst4_l, dst4_r, dst5_l, dst5_r, - dst0_r, dst1_r, dst2_r); - ST12x4_UB(dst0_r, dst1_r, dst2_r, dst, dst_stride); + SLLI_4V(dst0, dst1, dst2, dst3, 6); + SLLI_2V(dst4, dst5, 6); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst0, dst1, dst2, dst3, weight_vec, + offset_vec, rnd_vec, dst0, dst1, dst2, + dst3); + HEVC_UNIW_RND_CLIP2_MAX_SATU_H(dst4, dst5, weight_vec, offset_vec, + rnd_vec, dst4, dst5); + + PCKEV_B3_UB(dst1, dst0, dst3, dst2, dst5, dst4, out0, out1, out2); + ST12x4_UB(out0, out1, out2, dst, dst_stride); dst += (4 * dst_stride); } } @@ -410,8 +427,38 @@ static void hevc_uniwgt_copy_16w_msa(uint8_t *src, int32_t offset, int32_t rnd_val) { - hevc_uniwgt_copy_16multx4mult_msa(src, src_stride, dst, dst_stride, - height, weight, offset, rnd_val, 16); + uint32_t loop_cnt; + v16u8 out0, out1, out2, out3; + v16i8 src0, src1, src2, src3; + v16i8 zero = { 0 }; + v8i16 dst0, dst1, dst2, dst3, dst4, dst5, dst6, dst7, offset_vec; + v4i32 weight_vec, rnd_vec; + + weight = weight & 0x0000FFFF; + weight_vec = __msa_fill_w(weight); + offset_vec = __msa_fill_h(offset); + rnd_vec = __msa_fill_w(rnd_val); + + for (loop_cnt = height >> 2; loop_cnt--;) { + LD_SB4(src, src_stride, src0, src1, src2, src3); + src += (4 * src_stride); + ILVRL_B2_SH(zero, src0, dst0, dst1); + ILVRL_B2_SH(zero, src1, dst2, dst3); + ILVRL_B2_SH(zero, src2, dst4, dst5); + ILVRL_B2_SH(zero, src3, dst6, dst7); + SLLI_4V(dst0, dst1, dst2, dst3, 6); + SLLI_4V(dst4, dst5, dst6, dst7, 6); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst0, dst1, dst2, dst3, weight_vec, + offset_vec, rnd_vec, dst0, dst1, dst2, + dst3); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst4, dst5, dst6, dst7, weight_vec, + offset_vec, rnd_vec, dst4, dst5, dst6, + dst7); + PCKEV_B2_UB(dst1, dst0, dst3, dst2, out0, out1); + PCKEV_B2_UB(dst5, dst4, dst7, dst6, out2, out3); + ST_UB4(out0, out1, out2, out3, dst, dst_stride); + dst += (4 * dst_stride); + } } static void hevc_uniwgt_copy_24w_msa(uint8_t *src, @@ -423,11 +470,48 @@ static void hevc_uniwgt_copy_24w_msa(uint8_t *src, int32_t offset, int32_t rnd_val) { - hevc_uniwgt_copy_16multx4mult_msa(src, src_stride, dst, dst_stride, - height, weight, offset, rnd_val, 16); + uint32_t loop_cnt; + v16u8 out0, out1, out2, out3, out4, out5; + v16i8 src0, src1, src2, src3, src4, src5, src6, src7; + v16i8 zero = { 0 }; + v8i16 dst0, dst1, dst2, dst3, dst4, dst5, dst6, dst7, offset_vec; + v8i16 dst8, dst9, dst10, dst11; + v4i32 weight_vec, rnd_vec; - hevc_uniwgt_copy_8w_msa(src + 16, src_stride, dst + 16, dst_stride, - height, weight, offset, rnd_val); + weight = weight & 0x0000FFFF; + weight_vec = __msa_fill_w(weight); + offset_vec = __msa_fill_h(offset); + rnd_vec = __msa_fill_w(rnd_val); + + for (loop_cnt = (height >> 2); loop_cnt--;) { + LD_SB4(src, src_stride, src0, src1, src4, src5); + LD_SB4(src + 16, src_stride, src2, src3, src6, src7); + src += (4 * src_stride); + + ILVRL_B2_SH(zero, src0, dst0, dst1); + ILVRL_B2_SH(zero, src1, dst2, dst3); + ILVR_B2_SH(zero, src2, zero, src3, dst4, dst5); + ILVRL_B2_SH(zero, src4, dst6, dst7); + ILVRL_B2_SH(zero, src5, dst8, dst9); + ILVR_B2_SH(zero, src6, zero, src7, dst10, dst11); + SLLI_4V(dst0, dst1, dst2, dst3, 6); + SLLI_4V(dst4, dst5, dst6, dst7, 6); + SLLI_4V(dst8, dst9, dst10, dst11, 6); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst0, dst1, dst2, dst3, weight_vec, + offset_vec, rnd_vec, dst0, dst1, dst2, + dst3); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst4, dst5, dst6, dst7, weight_vec, + offset_vec, rnd_vec, dst4, dst5, dst6, + dst7); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst8, dst9, dst10, dst11, weight_vec, + offset_vec, rnd_vec, dst8, dst9, dst10, + dst11); + PCKEV_B3_UB(dst1, dst0, dst3, dst2, dst5, dst4, out0, out1, out2); + PCKEV_B3_UB(dst7, dst6, dst9, dst8, dst11, dst10, out3, out4, out5); + ST_UB4(out0, out1, out3, out4, dst, dst_stride); + ST8x4_UB(out2, out5, dst + 16, dst_stride); + dst += (4 * dst_stride); + } } static void hevc_uniwgt_copy_32w_msa(uint8_t *src, @@ -439,8 +523,41 @@ static void hevc_uniwgt_copy_32w_msa(uint8_t *src, int32_t offset, int32_t rnd_val) { - hevc_uniwgt_copy_16multx4mult_msa(src, src_stride, dst, dst_stride, - height, weight, offset, rnd_val, 32); + uint32_t loop_cnt; + v16u8 out0, out1, out2, out3; + v16i8 src0, src1, src2, src3; + v16i8 zero = { 0 }; + v8i16 dst0, dst1, dst2, dst3, dst4, dst5, dst6, dst7, offset_vec; + v4i32 weight_vec, rnd_vec; + + weight = weight & 0x0000FFFF; + weight_vec = __msa_fill_w(weight); + offset_vec = __msa_fill_h(offset); + rnd_vec = __msa_fill_w(rnd_val); + + for (loop_cnt = (height >> 1); loop_cnt--;) { + LD_SB2(src, src_stride, src0, src1); + LD_SB2(src + 16, src_stride, src2, src3); + src += (2 * src_stride); + + ILVRL_B2_SH(zero, src0, dst0, dst1); + ILVRL_B2_SH(zero, src1, dst2, dst3); + ILVRL_B2_SH(zero, src2, dst4, dst5); + ILVRL_B2_SH(zero, src3, dst6, dst7); + SLLI_4V(dst0, dst1, dst2, dst3, 6); + SLLI_4V(dst4, dst5, dst6, dst7, 6); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst0, dst1, dst2, dst3, weight_vec, + offset_vec, rnd_vec, dst0, dst1, dst2, + dst3); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst4, dst5, dst6, dst7, weight_vec, + offset_vec, rnd_vec, dst4, dst5, dst6, + dst7); + PCKEV_B2_UB(dst1, dst0, dst3, dst2, out0, out1); + PCKEV_B2_UB(dst5, dst4, dst7, dst6, out2, out3); + ST_UB2(out0, out1, dst, dst_stride); + ST_UB2(out2, out3, dst + 16, dst_stride); + dst += (2 * dst_stride); + } } static void hevc_uniwgt_copy_48w_msa(uint8_t *src, @@ -452,8 +569,52 @@ static void hevc_uniwgt_copy_48w_msa(uint8_t *src, int32_t offset, int32_t rnd_val) { - hevc_uniwgt_copy_16multx4mult_msa(src, src_stride, dst, dst_stride, - height, weight, offset, rnd_val, 48); + uint32_t loop_cnt; + v16u8 out0, out1, out2, out3, out4, out5; + v16i8 src0, src1, src2, src3, src4, src5; + v16i8 zero = { 0 }; + v8i16 dst0, dst1, dst2, dst3, dst4, dst5, offset_vec; + v8i16 dst6, dst7, dst8, dst9, dst10, dst11; + v4i32 weight_vec, rnd_vec; + + weight = weight & 0x0000FFFF; + weight_vec = __msa_fill_w(weight); + offset_vec = __msa_fill_h(offset); + rnd_vec = __msa_fill_w(rnd_val); + + for (loop_cnt = (height >> 1); loop_cnt--;) { + LD_SB3(src, 16, src0, src1, src2); + src += src_stride; + LD_SB3(src, 16, src3, src4, src5); + src += src_stride; + + ILVRL_B2_SH(zero, src0, dst0, dst1); + ILVRL_B2_SH(zero, src1, dst2, dst3); + ILVRL_B2_SH(zero, src2, dst4, dst5); + ILVRL_B2_SH(zero, src3, dst6, dst7); + ILVRL_B2_SH(zero, src4, dst8, dst9); + ILVRL_B2_SH(zero, src5, dst10, dst11); + SLLI_4V(dst0, dst1, dst2, dst3, 6); + SLLI_4V(dst4, dst5, dst6, dst7, 6); + SLLI_4V(dst8, dst9, dst10, dst11, 6); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst0, dst1, dst2, dst3, weight_vec, + offset_vec, rnd_vec, dst0, dst1, dst2, + dst3); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst4, dst5, dst6, dst7, weight_vec, + offset_vec, rnd_vec, dst4, dst5, dst6, + dst7); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst8, dst9, dst10, dst11, weight_vec, + offset_vec, rnd_vec, dst8, dst9, dst10, + dst11); + PCKEV_B3_UB(dst1, dst0, dst3, dst2, dst5, dst4, out0, out1, out2); + PCKEV_B3_UB(dst7, dst6, dst9, dst8, dst11, dst10, out3, out4, out5); + ST_UB2(out0, out1, dst, 16); + ST_UB(out2, dst + 32); + dst += dst_stride; + ST_UB2(out3, out4, dst, 16); + ST_UB(out5, dst + 32); + dst += dst_stride; + } } static void hevc_uniwgt_copy_64w_msa(uint8_t *src, @@ -465,8 +626,58 @@ static void hevc_uniwgt_copy_64w_msa(uint8_t *src, int32_t offset, int32_t rnd_val) { - hevc_uniwgt_copy_16multx4mult_msa(src, src_stride, dst, dst_stride, - height, weight, offset, rnd_val, 64); + uint32_t loop_cnt; + v16u8 out0, out1, out2, out3, out4, out5, out6, out7; + v16i8 src0, src1, src2, src3, src4, src5, src6, src7; + v16i8 zero = { 0 }; + v8i16 dst0, dst1, dst2, dst3, dst4, dst5, dst6, dst7, offset_vec; + v8i16 dst8, dst9, dst10, dst11, dst12, dst13, dst14, dst15; + v4i32 weight_vec, rnd_vec; + + weight = weight & 0x0000FFFF; + weight_vec = __msa_fill_w(weight); + offset_vec = __msa_fill_h(offset); + rnd_vec = __msa_fill_w(rnd_val); + + for (loop_cnt = (height >> 1); loop_cnt--;) { + LD_SB4(src, 16, src0, src1, src2, src3); + src += src_stride; + LD_SB4(src, 16, src4, src5, src6, src7); + src += src_stride; + + ILVRL_B2_SH(zero, src0, dst0, dst1); + ILVRL_B2_SH(zero, src1, dst2, dst3); + ILVRL_B2_SH(zero, src2, dst4, dst5); + ILVRL_B2_SH(zero, src3, dst6, dst7); + ILVRL_B2_SH(zero, src4, dst8, dst9); + ILVRL_B2_SH(zero, src5, dst10, dst11); + ILVRL_B2_SH(zero, src6, dst12, dst13); + ILVRL_B2_SH(zero, src7, dst14, dst15); + SLLI_4V(dst0, dst1, dst2, dst3, 6); + SLLI_4V(dst4, dst5, dst6, dst7, 6); + SLLI_4V(dst8, dst9, dst10, dst11, 6); + SLLI_4V(dst12, dst13, dst14, dst15, 6); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst0, dst1, dst2, dst3, weight_vec, + offset_vec, rnd_vec, dst0, dst1, dst2, + dst3); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst4, dst5, dst6, dst7, weight_vec, + offset_vec, rnd_vec, dst4, dst5, dst6, + dst7); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst8, dst9, dst10, dst11, weight_vec, + offset_vec, rnd_vec, dst8, dst9, dst10, + dst11); + HEVC_UNIW_RND_CLIP4_MAX_SATU_H(dst12, dst13, dst14, dst15, weight_vec, + offset_vec, rnd_vec, dst12, dst13, dst14, + dst15); + PCKEV_B2_UB(dst1, dst0, dst3, dst2, out0, out1); + PCKEV_B2_UB(dst5, dst4, dst7, dst6, out2, out3); + PCKEV_B2_UB(dst9, dst8, dst11, dst10, out4, out5); + PCKEV_B2_UB(dst13, dst12, dst15, dst14, out6, out7); + ST_UB4(out0, out1, out2, out3, dst, 16); + dst += dst_stride; + ST_UB4(out4, out5, out6, out7, dst, 16); + dst += dst_stride; + } } static void hevc_hz_uniwgt_8t_4w_msa(uint8_t *src, diff --git a/libavutil/mips/generic_macros_msa.h b/libavutil/mips/generic_macros_msa.h index 3ff94fd..bda3ed2 100644 --- a/libavutil/mips/generic_macros_msa.h +++ b/libavutil/mips/generic_macros_msa.h @@ -204,6 +204,12 @@ out3 = LW((psrc) + 3 * stride); \ } +#define LW2(psrc, stride, out0, out1) \ +{ \ + out0 = LW((psrc)); \ + out1 = LW((psrc) + stride); \ +} + /* Description : Load double words with stride Arguments : Inputs - psrc (source pointer to load from) - stride @@ -1047,6 +1053,25 @@ CLIP_SH2_0_255(in2, in3); \ } +#define CLIP_SH_0_255_MAX_SATU(in) \ +( { \ + v8i16 out_m; \ + \ + out_m = __msa_maxi_s_h((v8i16) in, 0); \ + out_m = (v8i16) __msa_sat_u_h((v8u16) out_m, 7); \ + out_m; \ +} ) +#define CLIP_SH2_0_255_MAX_SATU(in0, in1) \ +{ \ + in0 = CLIP_SH_0_255_MAX_SATU(in0); \ + in1 = CLIP_SH_0_255_MAX_SATU(in1); \ +} +#define CLIP_SH4_0_255_MAX_SATU(in0, in1, in2, in3) \ +{ \ + CLIP_SH2_0_255_MAX_SATU(in0, in1); \ + CLIP_SH2_0_255_MAX_SATU(in2, in3); \ +} + /* Description : Clips all signed word elements of input vector between 0 & 255 Arguments : Inputs - in (input vector) @@ -1965,6 +1990,11 @@ result is in place written to 'in0' Similar for other pairs */ +#define SLLI_2V(in0, in1, shift) \ +{ \ + in0 = in0 << shift; \ + in1 = in1 << shift; \ +} #define SLLI_4V(in0, in1, in2, in3, shift) \ { \ in0 = in0 << shift; \ -- 1.7.9.5