From patchwork Mon Jul 24 17:20:26 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Manojkumar Bhosale X-Patchwork-Id: 4443 Delivered-To: ffmpegpatchwork@gmail.com Received: by 10.103.1.76 with SMTP id 73csp4486791vsb; Mon, 24 Jul 2017 10:20:42 -0700 (PDT) X-Received: by 10.28.97.134 with SMTP id v128mr1646464wmb.83.1500916842685; Mon, 24 Jul 2017 10:20:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1500916842; cv=none; d=google.com; s=arc-20160816; b=WeaCINg2HydCUKwJ3HZ3dY479nMCEeI9/6HcG8PJD7Tq1TaowEm7595p5W6gwfdnfs inaNb4ikrR+rogix3pSOiZIwRKwgcHFnz+v0SzSeRDQ0FGlU2M7xxoXbxP/BtRVo2ask O2triU3bTyj0cDFEPbnAHlWDbmp/wivcLrsv+wx0AM0JU2yPFiZW/I0h7AO1cA1eIFDb zCMh2jlvVND1g2I+3egl/c3GTxD+QLiwX7T+Wdv/I5UYN50Rh5K0PipJtwe/C8V7I69x NmN3dQcQrg2ZJz9DIV4c1stC3TvOOGLCPSW1ZhSM7KoFY39OgQPsN7MqHApQ+P7otBuJ gHGg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:content-language :accept-language:in-reply-to:references:message-id:date:thread-index :thread-topic:to:from:delivered-to:arc-authentication-results; bh=OPFR1Zsx8/bFT2/4F0RJ1Vay6FjjCfPEgfTMMkpiHN4=; b=oofztY1fFQ3A55Bzw9d8ua47qr8ZAh4XqBhSM3DguKWvSb+cX+P2yP/yh4EVGU0xxA IJhCmQfwGR1u87z/1k70MIuuCoYgS57boWNOaEAmDiLS0A6aiw2Wa6gYPYeDTiB7TWmv ZCd/V1Mgfr8IUPRcPy9KxJ4/s5nfdyu/sQr1mGEoehb/av+w14G9mrHvAxt2PYuRQztg kpmmd4bmdFBIfrLanM+KwEoKjTHCBpku4G/61SE1JpVvgYx2A8y0W1xRe0iv2pAYCQZ9 JQPFqWWTrt3a1msblCqIuIjYwtvz1yoF0qoPamTAg5jU1JGz50Jy0RKPWgUNVUNIcryf snEA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id 142si5877144wmh.223.2017.07.24.10.20.42; Mon, 24 Jul 2017 10:20:42 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 25DF9689B3C; Mon, 24 Jul 2017 20:20:30 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mailapp01.imgtec.com (mailapp01.imgtec.com [195.59.15.196]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 09812689A77 for ; Mon, 24 Jul 2017 20:20:24 +0300 (EEST) Received: from hhmail02.hh.imgtec.org (unknown [10.100.10.20]) by Forcepoint Email with ESMTPS id CD13FA8DBD4E6 for ; Mon, 24 Jul 2017 18:20:28 +0100 (IST) Received: from PUMAIL01.pu.imgtec.org (192.168.91.250) by hhmail02.hh.imgtec.org (10.100.10.20) with Microsoft SMTP Server (TLS) id 14.3.294.0; Mon, 24 Jul 2017 18:20:31 +0100 Received: from PUMAIL01.pu.imgtec.org ([::1]) by PUMAIL01.pu.imgtec.org ([::1]) with mapi id 14.03.0266.001; Mon, 24 Jul 2017 22:50:28 +0530 From: Manojkumar Bhosale To: FFmpeg development discussions and patches Thread-Topic: [FFmpeg-devel] [PATCH] libavcodec/mips: Optimize avc idct 4x4 for msa Thread-Index: AQHTBHov/fnbQLf4HUi1aMndkhKd9KJjOSK+ Date: Mon, 24 Jul 2017 17:20:26 +0000 Message-ID: <70293ACCC3BA6A4E81FFCA024C7A86E1E058B928@PUMAIL01.pu.imgtec.org> References: <1500900113-17355-1-git-send-email-kaustubh.raste@imgtec.com> In-Reply-To: <1500900113-17355-1-git-send-email-kaustubh.raste@imgtec.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.100.10.19] MIME-Version: 1.0 Subject: Re: [FFmpeg-devel] [PATCH] libavcodec/mips: Optimize avc idct 4x4 for msa X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Kaustubh Raste Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" LGTM diff --git a/libavcodec/mips/h264idct_msa.c b/libavcodec/mips/h264idct_msa.c index fac1e7a..81e09e9 100644 --- a/libavcodec/mips/h264idct_msa.c +++ b/libavcodec/mips/h264idct_msa.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2015 Manojkumar Bhosale (Manojkumar.Bhosale@imgtec.com) + * Copyright (c) 2015 - 2017 Manojkumar Bhosale (Manojkumar.Bhosale@imgtec.com) * * This file is part of FFmpeg. * @@ -36,48 +36,6 @@ BUTTERFLY_4(tmp0_m, tmp1_m, tmp2_m, tmp3_m, out0, out1, out2, out3); \ } -static void avc_idct4x4_addblk_msa(uint8_t *dst, int16_t *src, - int32_t dst_stride) -{ - v8i16 src0, src1, src2, src3; - v8i16 hres0, hres1, hres2, hres3; - v8i16 vres0, vres1, vres2, vres3; - v8i16 zeros = { 0 }; - - LD4x4_SH(src, src0, src1, src2, src3); - AVC_ITRANS_H(src0, src1, src2, src3, hres0, hres1, hres2, hres3); - TRANSPOSE4x4_SH_SH(hres0, hres1, hres2, hres3, hres0, hres1, hres2, hres3); - AVC_ITRANS_H(hres0, hres1, hres2, hres3, vres0, vres1, vres2, vres3); - SRARI_H4_SH(vres0, vres1, vres2, vres3, 6); - ADDBLK_ST4x4_UB(vres0, vres1, vres2, vres3, dst, dst_stride); - ST_SH2(zeros, zeros, src, 8); -} - -static void avc_idct4x4_addblk_dc_msa(uint8_t *dst, int16_t *src, - int32_t dst_stride) -{ - int16_t dc; - uint32_t src0, src1, src2, src3; - v16u8 pred = { 0 }; - v16i8 out; - v8i16 input_dc, pred_r, pred_l; - - dc = (src[0] + 32) >> 6; - input_dc = __msa_fill_h(dc); - src[0] = 0; - - LW4(dst, dst_stride, src0, src1, src2, src3); - INSERT_W4_UB(src0, src1, src2, src3, pred); - UNPCK_UB_SH(pred, pred_r, pred_l); - - pred_r += input_dc; - pred_l += input_dc; - - CLIP_SH2_0_255(pred_r, pred_l); - out = __msa_pckev_b((v16i8) pred_l, (v16i8) pred_r); - ST4x4_UB(out, out, 0, 1, 2, 3, dst, dst_stride); -} - static void avc_deq_idct_luma_dc_msa(int16_t *dst, int16_t *src, int32_t de_q_val) { @@ -317,11 +275,45 @@ static void avc_idct8_dc_addblk_msa(uint8_t *dst, int16_t *src, ST8x4_UB(dst2, dst3, dst, dst_stride); } -void ff_h264_idct_add_msa(uint8_t *dst, int16_t *src, - int32_t dst_stride) +void ff_h264_idct_add_msa(uint8_t *dst, int16_t *src, int32_t dst_stride) { - avc_idct4x4_addblk_msa(dst, src, dst_stride); - memset(src, 0, 16 * sizeof(dctcoef)); + uint32_t src0_m, src1_m, src2_m, src3_m, out0_m, out1_m, out2_m, out3_m; + v16i8 dst0_m = { 0 }; + v16i8 dst1_m = { 0 }; + v8i16 hres0, hres1, hres2, hres3, vres0, vres1, vres2, vres3; + v8i16 inp0_m, inp1_m, res0_m, res1_m, src1, src3; + const v8i16 src0 = LD_SH(src); + const v8i16 src2 = LD_SH(src + 8); + const v8i16 zero = { 0 }; + const uint8_t *dst1 = dst + dst_stride; + const uint8_t *dst2 = dst + 2 * dst_stride; + const uint8_t *dst3 = dst + 3 * dst_stride; + + ILVL_D2_SH(src0, src0, src2, src2, src1, src3); + ST_SH2(zero, zero, src, 8); + AVC_ITRANS_H(src0, src1, src2, src3, hres0, hres1, hres2, hres3); + TRANSPOSE4x4_SH_SH(hres0, hres1, hres2, hres3, hres0, hres1, hres2, hres3); + AVC_ITRANS_H(hres0, hres1, hres2, hres3, vres0, vres1, vres2, vres3); + src0_m = LW(dst); + src1_m = LW(dst1); + SRARI_H4_SH(vres0, vres1, vres2, vres3, 6); + src2_m = LW(dst2); + src3_m = LW(dst3); + ILVR_D2_SH(vres1, vres0, vres3, vres2, inp0_m, inp1_m); + INSERT_W2_SB(src0_m, src1_m, dst0_m); + INSERT_W2_SB(src2_m, src3_m, dst1_m); + ILVR_B2_SH(zero, dst0_m, zero, dst1_m, res0_m, res1_m); + ADD2(res0_m, inp0_m, res1_m, inp1_m, res0_m, res1_m); + CLIP_SH2_0_255(res0_m, res1_m); + PCKEV_B2_SB(res0_m, res0_m, res1_m, res1_m, dst0_m, dst1_m); + out0_m = __msa_copy_u_w((v4i32) dst0_m, 0); + out1_m = __msa_copy_u_w((v4i32) dst0_m, 1); + out2_m = __msa_copy_u_w((v4i32) dst1_m, 0); + out3_m = __msa_copy_u_w((v4i32) dst1_m, 1); + SW(out0_m, dst); + SW(out1_m, dst1); + SW(out2_m, dst2); + SW(out3_m, dst3); } void ff_h264_idct8_addblk_msa(uint8_t *dst, int16_t *src, @@ -334,7 +326,23 @@ void ff_h264_idct8_addblk_msa(uint8_t *dst, int16_t *src, void ff_h264_idct4x4_addblk_dc_msa(uint8_t *dst, int16_t *src, int32_t dst_stride) { - avc_idct4x4_addblk_dc_msa(dst, src, dst_stride); + v16u8 pred = { 0 }; + v16i8 out; + v8i16 pred_r, pred_l; + const uint32_t src0 = LW(dst); + const uint32_t src1 = LW(dst + dst_stride); + const uint32_t src2 = LW(dst + 2 * dst_stride); + const uint32_t src3 = LW(dst + 3 * dst_stride); + const int16_t dc = (src[0] + 32) >> 6; + const v8i16 input_dc = __msa_fill_h(dc); + + src[0] = 0; + INSERT_W4_UB(src0, src1, src2, src3, pred); + UNPCK_UB_SH(pred, pred_r, pred_l); + ADD2(pred_r, input_dc, pred_l, input_dc, pred_r, pred_l); + CLIP_SH2_0_255(pred_r, pred_l); + out = __msa_pckev_b((v16i8) pred_l, (v16i8) pred_r); + ST4x4_UB(out, out, 0, 1, 2, 3, dst, dst_stride); } void ff_h264_idct8_dc_addblk_msa(uint8_t *dst, int16_t *src, diff --git a/libavutil/mips/generic_macros_msa.h b/libavutil/mips/generic_macros_msa.h index 61a8ee0..407d46e 100644 --- a/libavutil/mips/generic_macros_msa.h +++ b/libavutil/mips/generic_macros_msa.h @@ -1531,6 +1531,24 @@ #define ILVR_D4_SB(...) ILVR_D4(v16i8, __VA_ARGS__) #define ILVR_D4_UB(...) ILVR_D4(v16u8, __VA_ARGS__) +/* Description : Interleave left half of double word elements from vectors + Arguments : Inputs - in0, in1, in2, in3 + Outputs - out0, out1 + Return Type - as per RTYPE + Details : Left half of double word elements of in0 and left half of + double word elements of in1 are interleaved and copied to out0. + Left half of double word elements of in2 and left half of + double word elements of in3 are interleaved and copied to out1. +*/ +#define ILVL_D2(RTYPE, in0, in1, in2, in3, out0, out1) \ +{ \ + out0 = (RTYPE) __msa_ilvl_d((v2i64) in0, (v2i64) in1); \ + out1 = (RTYPE) __msa_ilvl_d((v2i64) in2, (v2i64) in3); \ +} +#define ILVL_D2_UB(...) ILVL_D2(v16u8, __VA_ARGS__) +#define ILVL_D2_SB(...) ILVL_D2(v16i8, __VA_ARGS__) +#define ILVL_D2_SH(...) ILVL_D2(v8i16, __VA_ARGS__) + /* Description : Interleave both left and right half of input vectors Arguments : Inputs - in0, in1 Outputs - out0, out1