From patchwork Tue Oct 10 10:37:27 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Manojkumar Bhosale X-Patchwork-Id: 5513 Delivered-To: ffmpegpatchwork@gmail.com Received: by 10.2.161.90 with SMTP id m26csp3676681jah; Tue, 10 Oct 2017 03:37:47 -0700 (PDT) X-Google-Smtp-Source: AOwi7QD/8FFyRozk8tOMjzhWGHUWh0WVVXjsQigJNrJFFXJH9Av3m4mbBqQE5OxhfcH4tLaNXfqZ X-Received: by 10.28.20.137 with SMTP id 131mr9644499wmu.40.1507631867059; Tue, 10 Oct 2017 03:37:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1507631867; cv=none; d=google.com; s=arc-20160816; b=Rwk8R8xKsfnd6ysQ+0PnNPOsCQGn5kBPWiOt6UZUm99VPYs6Edp6fYGPoc00jFnZXi VzArPxoWr8bYEJvHTX4boJVSJHU+yeATFDsO6coLlv6jtx27TlGpFgCiFRd1dDXfLHgh 6oM5JZPHEHrXPmhUSvXwwcsLBaEIL2QGVRfpgxuxd3DkpgGG7wC0gzQgTrHRJnJJorAw zDPqqXwZ1d3NNL8XrI3gtZGwqd+XIKzHQRDJ9TpyaqaERDwH6XEOWtmf1VpCfdDwhIKO MVC88lm5fgRMe0Ic4l9V8KRy+FUR55rMxoDZkzbpuhBfF2MbZHdHGbzgIFn55ldkLXNx NDOg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:content-language :accept-language:in-reply-to:references:message-id:date:thread-index :thread-topic:to:from:delivered-to:arc-authentication-results; bh=eGLIhhhy16Pu9jOoTuJYD+rpfGpe8x4hDznqIMiafZY=; b=nTvd/8S0JmcibV3cHCynQ2wggoWmgX87z4dBk1u9Xn+9YUuPoXnT8FBdejzmK1q5RP C9XDCHG9OhqsQs6wLmloPXgtbE0CfC1piwMoyfFJwFHNFwVn+HOUNLI464c75OqanGyP JXL091/GJ1wjtaK6fZFXYVKSXpe6LE6KVtgWLg/3j/HLvj+OQfdmlwqb6RW2eMquCRxX +mgir8s2eLj9MJeDLeSoP/S5x8yZUGj2NThJR/GA80NbVa0BnkGVYdMts2fWjrF6mLjK WNjQ/kK7afHdtfGKgPPQFgIX65HfeDsykwRI7+wpJwHWU1HGW1SIOL+DZZcYpyaffE// FW2Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id d23si8464190wmh.66.2017.10.10.03.37.46; Tue, 10 Oct 2017 03:37:47 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id C84806891DD; Tue, 10 Oct 2017 13:37:37 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mailapp01.imgtec.com (mailapp01.imgtec.com [195.59.15.196]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id DE88B6891D5 for ; Tue, 10 Oct 2017 13:37:31 +0300 (EEST) Received: from hhmail02.hh.imgtec.org (unknown [10.100.10.20]) by Forcepoint Email with ESMTPS id 2744B589F15A8 for ; Tue, 10 Oct 2017 11:37:29 +0100 (IST) Received: from HHMAIL-X.hh.imgtec.org (10.100.10.113) by hhmail02.hh.imgtec.org (10.100.10.20) with Microsoft SMTP Server (TLS) id 14.3.361.1; Tue, 10 Oct 2017 11:37:31 +0100 Received: from PUMAIL01.pu.imgtec.org (192.168.91.250) by HHMAIL-X.hh.imgtec.org (10.100.10.113) with Microsoft SMTP Server (TLS) id 14.3.361.1; Tue, 10 Oct 2017 11:37:31 +0100 Received: from PUMAIL01.pu.imgtec.org ([::1]) by PUMAIL01.pu.imgtec.org ([::1]) with mapi id 14.03.0266.001; Tue, 10 Oct 2017 16:07:29 +0530 From: Manojkumar Bhosale To: FFmpeg development discussions and patches Thread-Topic: [FFmpeg-devel] [PATCH] avcodec/mips: Improve avc uni copy mc msa functions Thread-Index: AQHTQPihZtnsrfo2Xk+APx1Wk3BtJqLc5ZAA Date: Tue, 10 Oct 2017 10:37:27 +0000 Message-ID: <70293ACCC3BA6A4E81FFCA024C7A86E1E05944EF@PUMAIL01.pu.imgtec.org> References: <1507551525-9406-1-git-send-email-kaustubh.raste@imgtec.com> In-Reply-To: <1507551525-9406-1-git-send-email-kaustubh.raste@imgtec.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [192.168.91.86] MIME-Version: 1.0 Subject: Re: [FFmpeg-devel] [PATCH] avcodec/mips: Improve avc uni copy mc msa functions X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Kaustubh Raste Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" LGTM -----Original Message----- From: ffmpeg-devel [mailto:ffmpeg-devel-bounces@ffmpeg.org] On Behalf Of kaustubh.raste@imgtec.com Sent: Monday, October 9, 2017 5:49 PM To: ffmpeg-devel@ffmpeg.org Cc: Kaustubh Raste Subject: [FFmpeg-devel] [PATCH] avcodec/mips: Improve avc uni copy mc msa functions From: Kaustubh Raste Load the specific bytes instead of MSA load. Signed-off-by: Kaustubh Raste --- libavcodec/mips/hevc_mc_uni_msa.c | 245 +++++++++++++++---------------------- 1 file changed, 100 insertions(+), 145 deletions(-) ST12x8_UB(src0, src1, src2, src3, src4, src5, src6, src7, dst, dst_stride); } -static void copy_16multx8mult_msa(uint8_t *src, int32_t src_stride, - uint8_t *dst, int32_t dst_stride, - int32_t height, int32_t width) -{ - int32_t cnt, loop_cnt; - uint8_t *src_tmp, *dst_tmp; - v16u8 src0, src1, src2, src3, src4, src5, src6, src7; - - for (cnt = (width >> 4); cnt--;) { - src_tmp = src; - dst_tmp = dst; - - for (loop_cnt = (height >> 3); loop_cnt--;) { - LD_UB8(src_tmp, src_stride, - src0, src1, src2, src3, src4, src5, src6, src7); - src_tmp += (8 * src_stride); - - ST_UB8(src0, src1, src2, src3, src4, src5, src6, src7, - dst_tmp, dst_stride); - dst_tmp += (8 * dst_stride); - } - - src += 16; - dst += 16; - } -} - static void copy_width16_msa(uint8_t *src, int32_t src_stride, uint8_t *dst, int32_t dst_stride, int32_t height) @@ -156,23 +85,25 @@ static void copy_width16_msa(uint8_t *src, int32_t src_stride, int32_t cnt; v16u8 src0, src1, src2, src3, src4, src5, src6, src7; - if (0 == height % 12) { - for (cnt = (height / 12); cnt--;) { - LD_UB8(src, src_stride, - src0, src1, src2, src3, src4, src5, src6, src7); + if (12 == height) { + LD_UB8(src, src_stride, src0, src1, src2, src3, src4, src5, src6, src7); + src += (8 * src_stride); + ST_UB8(src0, src1, src2, src3, src4, src5, src6, src7, dst, dst_stride); + dst += (8 * dst_stride); + LD_UB4(src, src_stride, src0, src1, src2, src3); + src += (4 * src_stride); + ST_UB4(src0, src1, src2, src3, dst, dst_stride); + dst += (4 * dst_stride); + } else if (0 == (height % 8)) { + for (cnt = (height >> 3); cnt--;) { + LD_UB8(src, src_stride, src0, src1, src2, src3, src4, src5, src6, + src7); src += (8 * src_stride); - ST_UB8(src0, src1, src2, src3, src4, src5, src6, src7, - dst, dst_stride); + ST_UB8(src0, src1, src2, src3, src4, src5, src6, src7, dst, + dst_stride); dst += (8 * dst_stride); - - LD_UB4(src, src_stride, src0, src1, src2, src3); - src += (4 * src_stride); - ST_UB4(src0, src1, src2, src3, dst, dst_stride); - dst += (4 * dst_stride); } - } else if (0 == height % 8) { - copy_16multx8mult_msa(src, src_stride, dst, dst_stride, height, 16); - } else if (0 == height % 4) { + } else if (0 == (height % 4)) { for (cnt = (height >> 2); cnt--;) { LD_UB4(src, src_stride, src0, src1, src2, src3); src += (4 * src_stride); @@ -187,8 +118,23 @@ static void copy_width24_msa(uint8_t *src, int32_t src_stride, uint8_t *dst, int32_t dst_stride, int32_t height) { - copy_16multx8mult_msa(src, src_stride, dst, dst_stride, height, 16); - copy_width8_msa(src + 16, src_stride, dst + 16, dst_stride, height); + int32_t cnt; + v16u8 src0, src1, src2, src3, src4, src5, src6, src7; + uint64_t out0, out1, out2, out3, out4, out5, out6, out7; + + for (cnt = 4; cnt--;) { + LD_UB8(src, src_stride, src0, src1, src2, src3, src4, src5, src6, src7); + LD4(src + 16, src_stride, out0, out1, out2, out3); + src += (4 * src_stride); + LD4(src + 16, src_stride, out4, out5, out6, out7); + src += (4 * src_stride); + + ST_UB8(src0, src1, src2, src3, src4, src5, src6, src7, dst, dst_stride); + SD4(out0, out1, out2, out3, dst + 16, dst_stride); + dst += (4 * dst_stride); + SD4(out4, out5, out6, out7, dst + 16, dst_stride); + dst += (4 * dst_stride); + } } static void copy_width32_msa(uint8_t *src, int32_t src_stride, @@ -198,40 +144,13 @@ static void copy_width32_msa(uint8_t *src, int32_t src_stride, int32_t cnt; v16u8 src0, src1, src2, src3, src4, src5, src6, src7; - if (0 == height % 12) { - for (cnt = (height / 12); cnt--;) { - LD_UB4(src, src_stride, src0, src1, src2, src3); - LD_UB4(src + 16, src_stride, src4, src5, src6, src7); - src += (4 * src_stride); - ST_UB4(src0, src1, src2, src3, dst, dst_stride); - ST_UB4(src4, src5, src6, src7, dst + 16, dst_stride); - dst += (4 * dst_stride); - - LD_UB4(src, src_stride, src0, src1, src2, src3); - LD_UB4(src + 16, src_stride, src4, src5, src6, src7); - src += (4 * src_stride); - ST_UB4(src0, src1, src2, src3, dst, dst_stride); - ST_UB4(src4, src5, src6, src7, dst + 16, dst_stride); - dst += (4 * dst_stride); - - LD_UB4(src, src_stride, src0, src1, src2, src3); - LD_UB4(src + 16, src_stride, src4, src5, src6, src7); - src += (4 * src_stride); - ST_UB4(src0, src1, src2, src3, dst, dst_stride); - ST_UB4(src4, src5, src6, src7, dst + 16, dst_stride); - dst += (4 * dst_stride); - } - } else if (0 == height % 8) { - copy_16multx8mult_msa(src, src_stride, dst, dst_stride, height, 32); - } else if (0 == height % 4) { - for (cnt = (height >> 2); cnt--;) { - LD_UB4(src, src_stride, src0, src1, src2, src3); - LD_UB4(src + 16, src_stride, src4, src5, src6, src7); - src += (4 * src_stride); - ST_UB4(src0, src1, src2, src3, dst, dst_stride); - ST_UB4(src4, src5, src6, src7, dst + 16, dst_stride); - dst += (4 * dst_stride); - } + for (cnt = (height >> 2); cnt--;) { + LD_UB4(src, src_stride, src0, src1, src2, src3); + LD_UB4(src + 16, src_stride, src4, src5, src6, src7); + src += (4 * src_stride); + ST_UB4(src0, src1, src2, src3, dst, dst_stride); + ST_UB4(src4, src5, src6, src7, dst + 16, dst_stride); + dst += (4 * dst_stride); } } @@ -239,14 +158,50 @@ static void copy_width48_msa(uint8_t *src, int32_t src_stride, uint8_t *dst, int32_t dst_stride, int32_t height) { - copy_16multx8mult_msa(src, src_stride, dst, dst_stride, height, 48); + int32_t cnt; + v16u8 src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10; + v16u8 src11; + + for (cnt = (height >> 2); cnt--;) { + LD_UB4(src, src_stride, src0, src1, src2, src3); + LD_UB4(src + 16, src_stride, src4, src5, src6, src7); + LD_UB4(src + 32, src_stride, src8, src9, src10, src11); + src += (4 * src_stride); + + ST_UB4(src0, src1, src2, src3, dst, dst_stride); + ST_UB4(src4, src5, src6, src7, dst + 16, dst_stride); + ST_UB4(src8, src9, src10, src11, dst + 32, dst_stride); + dst += (4 * dst_stride); + } } static void copy_width64_msa(uint8_t *src, int32_t src_stride, uint8_t *dst, int32_t dst_stride, int32_t height) { - copy_16multx8mult_msa(src, src_stride, dst, dst_stride, height, 64); + int32_t cnt; + v16u8 src0, src1, src2, src3, src4, src5, src6, src7; + v16u8 src8, src9, src10, src11, src12, src13, src14, src15; + + for (cnt = (height >> 2); cnt--;) { + LD_UB4(src, 16, src0, src1, src2, src3); + src += src_stride; + LD_UB4(src, 16, src4, src5, src6, src7); + src += src_stride; + LD_UB4(src, 16, src8, src9, src10, src11); + src += src_stride; + LD_UB4(src, 16, src12, src13, src14, src15); + src += src_stride; + + ST_UB4(src0, src1, src2, src3, dst, 16); + dst += dst_stride; + ST_UB4(src4, src5, src6, src7, dst, 16); + dst += dst_stride; + ST_UB4(src8, src9, src10, src11, dst, 16); + dst += dst_stride; + ST_UB4(src12, src13, src14, src15, dst, 16); + dst += dst_stride; + } } static const uint8_t mc_filt_mask_arr[16 * 3] = { -- 1.7.9.5 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel diff --git a/libavcodec/mips/hevc_mc_uni_msa.c b/libavcodec/mips/hevc_mc_uni_msa.c index cf22e7f..eead591 100644 --- a/libavcodec/mips/hevc_mc_uni_msa.c +++ b/libavcodec/mips/hevc_mc_uni_msa.c @@ -28,83 +28,39 @@ static void copy_width8_msa(uint8_t *src, int32_t src_stride, { int32_t cnt; uint64_t out0, out1, out2, out3, out4, out5, out6, out7; - v16u8 src0, src1, src2, src3, src4, src5, src6, src7; - - if (0 == height % 12) { - for (cnt = (height / 12); cnt--;) { - LD_UB8(src, src_stride, - src0, src1, src2, src3, src4, src5, src6, src7); - src += (8 * src_stride); - - out0 = __msa_copy_u_d((v2i64) src0, 0); - out1 = __msa_copy_u_d((v2i64) src1, 0); - out2 = __msa_copy_u_d((v2i64) src2, 0); - out3 = __msa_copy_u_d((v2i64) src3, 0); - out4 = __msa_copy_u_d((v2i64) src4, 0); - out5 = __msa_copy_u_d((v2i64) src5, 0); - out6 = __msa_copy_u_d((v2i64) src6, 0); - out7 = __msa_copy_u_d((v2i64) src7, 0); - SD4(out0, out1, out2, out3, dst, dst_stride); - dst += (4 * dst_stride); - SD4(out4, out5, out6, out7, dst, dst_stride); - dst += (4 * dst_stride); - - LD_UB4(src, src_stride, src0, src1, src2, src3); + if (2 == height) { + LD2(src, src_stride, out0, out1); + SD(out0, dst); + dst += dst_stride; + SD(out1, dst); + } else if (6 == height) { + LD4(src, src_stride, out0, out1, out2, out3); + src += (4 * src_stride); + SD4(out0, out1, out2, out3, dst, dst_stride); + dst += (4 * dst_stride); + LD2(src, src_stride, out0, out1); + SD(out0, dst); + dst += dst_stride; + SD(out1, dst); + } else if (0 == (height % 8)) { + for (cnt = (height >> 3); cnt--;) { + LD4(src, src_stride, out0, out1, out2, out3); + src += (4 * src_stride); + LD4(src, src_stride, out4, out5, out6, out7); src += (4 * src_stride); - - out0 = __msa_copy_u_d((v2i64) src0, 0); - out1 = __msa_copy_u_d((v2i64) src1, 0); - out2 = __msa_copy_u_d((v2i64) src2, 0); - out3 = __msa_copy_u_d((v2i64) src3, 0); - - SD4(out0, out1, out2, out3, dst, dst_stride); - dst += (4 * dst_stride); - } - } else if (0 == height % 8) { - for (cnt = height >> 3; cnt--;) { - LD_UB8(src, src_stride, - src0, src1, src2, src3, src4, src5, src6, src7); - src += (8 * src_stride); - - out0 = __msa_copy_u_d((v2i64) src0, 0); - out1 = __msa_copy_u_d((v2i64) src1, 0); - out2 = __msa_copy_u_d((v2i64) src2, 0); - out3 = __msa_copy_u_d((v2i64) src3, 0); - out4 = __msa_copy_u_d((v2i64) src4, 0); - out5 = __msa_copy_u_d((v2i64) src5, 0); - out6 = __msa_copy_u_d((v2i64) src6, 0); - out7 = __msa_copy_u_d((v2i64) src7, 0); - SD4(out0, out1, out2, out3, dst, dst_stride); dst += (4 * dst_stride); SD4(out4, out5, out6, out7, dst, dst_stride); dst += (4 * dst_stride); } - } else if (0 == height % 4) { - for (cnt = (height / 4); cnt--;) { - LD_UB4(src, src_stride, src0, src1, src2, src3); + } else if (0 == (height % 4)) { + for (cnt = (height >> 2); cnt--;) { + LD4(src, src_stride, out0, out1, out2, out3); src += (4 * src_stride); - out0 = __msa_copy_u_d((v2i64) src0, 0); - out1 = __msa_copy_u_d((v2i64) src1, 0); - out2 = __msa_copy_u_d((v2i64) src2, 0); - out3 = __msa_copy_u_d((v2i64) src3, 0); - SD4(out0, out1, out2, out3, dst, dst_stride); dst += (4 * dst_stride); } - } else if (0 == height % 2) { - for (cnt = (height / 2); cnt--;) { - LD_UB2(src, src_stride, src0, src1); - src += (2 * src_stride); - out0 = __msa_copy_u_d((v2i64) src0, 0); - out1 = __msa_copy_u_d((v2i64) src1, 0); - - SD(out0, dst); - dst += dst_stride; - SD(out1, dst); - dst += dst_stride; - } } } @@ -122,33 +78,6 @@ static void copy_width12_msa(uint8_t *src, int32_t src_stride,