From patchwork Mon Jul 24 12:41:53 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: kaustubh.raste@imgtec.com X-Patchwork-Id: 4437 Delivered-To: ffmpegpatchwork@gmail.com Received: by 10.103.1.76 with SMTP id 73csp4209684vsb; Mon, 24 Jul 2017 05:41:30 -0700 (PDT) X-Received: by 10.223.168.110 with SMTP id l101mr11531071wrc.251.1500900090517; Mon, 24 Jul 2017 05:41:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1500900090; cv=none; d=google.com; s=arc-20160816; b=BFOT0MSTjrLzXwZGEXRK1N6cfdTjPVPvkI0IjuajeorusaPAwTg/60js2p9k7ediLq HHrmAaBQjDgn7MtDseZFcPOj1d9mgdfbRb81tMYxrqFdnpNV+jzTK97XHvgCSUhbDmqe tpgHzR7sz5fzVD63cCjNNVgVEYc/pEUnkKgTNSGwGL6qcXYOa8vDfSt75fEGSust01Te lC8o53j9SevtK3d70F9OhC39+a0H+eflTzwBv5o7CeMsBvrJcTqXAenW5hONds7m3g4W kO9r0iAX82KYmYwtdlFGX3SzZd8ssWghQau3XUvr4jmeaZvvJgSAnKENDctczPhzshuh sQvw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:message-id:date:to:from :delivered-to:arc-authentication-results; bh=GXNMk0Vs1Q1GPR7yC1M2LIzdhEkMSR6rEX/Yz8eGHpg=; b=DeJRVTRDy+Wl3mCfT97FBcI0rWhzXe+nNEnZkFWP7yJAg9YmdkaRPMOfPxDh6HTO7B IaruLQM+BcaPhxoFCGHNMofuF1Gv04h4TRi+hEieV31T09uVtVWlaqv37/IgDo7cN6cV DJhdC1GSABCZVA7i89/lfbLzm2dDOiEwHgaaO99BzXfS6WJ2eqGQahnh1/sErPIrq1ci XVVTnS+cKxzcpp0XXk78ixwg1SW0ZIhQA/FgAf7EYPIuKzMihifscxerJT/pF3oOHENf qCOaGrE/AkWPVMGdkn+/U0pSNty9aO4NCYyhKLflgZG8AhajSz0oOau4VtefywLsphm3 SsRw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id m184si5619408wmm.190.2017.07.24.05.41.29; Mon, 24 Jul 2017 05:41:30 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A7E26688395; Mon, 24 Jul 2017 15:41:17 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mailapp01.imgtec.com (mailapp01.imgtec.com [195.59.15.196]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A948568065E for ; Mon, 24 Jul 2017 15:41:11 +0300 (EEST) Received: from hhmail02.hh.imgtec.org (unknown [10.100.10.20]) by Forcepoint Email with ESMTPS id 74FBE5EF97AB5 for ; Mon, 24 Jul 2017 13:41:17 +0100 (IST) Received: from pudesk204.pu.imgtec.org (192.168.91.13) by hhmail02.hh.imgtec.org (10.100.10.20) with Microsoft SMTP Server (TLS) id 14.3.294.0; Mon, 24 Jul 2017 13:41:19 +0100 From: To: Date: Mon, 24 Jul 2017 18:11:53 +0530 Message-ID: <1500900113-17355-1-git-send-email-kaustubh.raste@imgtec.com> X-Mailer: git-send-email 1.7.9.5 MIME-Version: 1.0 X-Originating-IP: [192.168.91.13] Subject: [FFmpeg-devel] [PATCH] libavcodec/mips: Optimize avc idct 4x4 for msa X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Kaustubh Raste Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" From: Kaustubh Raste Removed memset call and improved performance. Signed-off-by: Kaustubh Raste --- libavcodec/mips/h264idct_msa.c | 104 +++++++++++++++++++---------------- libavutil/mips/generic_macros_msa.h | 18 ++++++ 2 files changed, 74 insertions(+), 48 deletions(-) diff --git a/libavcodec/mips/h264idct_msa.c b/libavcodec/mips/h264idct_msa.c index fac1e7a..81e09e9 100644 --- a/libavcodec/mips/h264idct_msa.c +++ b/libavcodec/mips/h264idct_msa.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2015 Manojkumar Bhosale (Manojkumar.Bhosale@imgtec.com) + * Copyright (c) 2015 - 2017 Manojkumar Bhosale (Manojkumar.Bhosale@imgtec.com) * * This file is part of FFmpeg. * @@ -36,48 +36,6 @@ BUTTERFLY_4(tmp0_m, tmp1_m, tmp2_m, tmp3_m, out0, out1, out2, out3); \ } -static void avc_idct4x4_addblk_msa(uint8_t *dst, int16_t *src, - int32_t dst_stride) -{ - v8i16 src0, src1, src2, src3; - v8i16 hres0, hres1, hres2, hres3; - v8i16 vres0, vres1, vres2, vres3; - v8i16 zeros = { 0 }; - - LD4x4_SH(src, src0, src1, src2, src3); - AVC_ITRANS_H(src0, src1, src2, src3, hres0, hres1, hres2, hres3); - TRANSPOSE4x4_SH_SH(hres0, hres1, hres2, hres3, hres0, hres1, hres2, hres3); - AVC_ITRANS_H(hres0, hres1, hres2, hres3, vres0, vres1, vres2, vres3); - SRARI_H4_SH(vres0, vres1, vres2, vres3, 6); - ADDBLK_ST4x4_UB(vres0, vres1, vres2, vres3, dst, dst_stride); - ST_SH2(zeros, zeros, src, 8); -} - -static void avc_idct4x4_addblk_dc_msa(uint8_t *dst, int16_t *src, - int32_t dst_stride) -{ - int16_t dc; - uint32_t src0, src1, src2, src3; - v16u8 pred = { 0 }; - v16i8 out; - v8i16 input_dc, pred_r, pred_l; - - dc = (src[0] + 32) >> 6; - input_dc = __msa_fill_h(dc); - src[0] = 0; - - LW4(dst, dst_stride, src0, src1, src2, src3); - INSERT_W4_UB(src0, src1, src2, src3, pred); - UNPCK_UB_SH(pred, pred_r, pred_l); - - pred_r += input_dc; - pred_l += input_dc; - - CLIP_SH2_0_255(pred_r, pred_l); - out = __msa_pckev_b((v16i8) pred_l, (v16i8) pred_r); - ST4x4_UB(out, out, 0, 1, 2, 3, dst, dst_stride); -} - static void avc_deq_idct_luma_dc_msa(int16_t *dst, int16_t *src, int32_t de_q_val) { @@ -317,11 +275,45 @@ static void avc_idct8_dc_addblk_msa(uint8_t *dst, int16_t *src, ST8x4_UB(dst2, dst3, dst, dst_stride); } -void ff_h264_idct_add_msa(uint8_t *dst, int16_t *src, - int32_t dst_stride) +void ff_h264_idct_add_msa(uint8_t *dst, int16_t *src, int32_t dst_stride) { - avc_idct4x4_addblk_msa(dst, src, dst_stride); - memset(src, 0, 16 * sizeof(dctcoef)); + uint32_t src0_m, src1_m, src2_m, src3_m, out0_m, out1_m, out2_m, out3_m; + v16i8 dst0_m = { 0 }; + v16i8 dst1_m = { 0 }; + v8i16 hres0, hres1, hres2, hres3, vres0, vres1, vres2, vres3; + v8i16 inp0_m, inp1_m, res0_m, res1_m, src1, src3; + const v8i16 src0 = LD_SH(src); + const v8i16 src2 = LD_SH(src + 8); + const v8i16 zero = { 0 }; + const uint8_t *dst1 = dst + dst_stride; + const uint8_t *dst2 = dst + 2 * dst_stride; + const uint8_t *dst3 = dst + 3 * dst_stride; + + ILVL_D2_SH(src0, src0, src2, src2, src1, src3); + ST_SH2(zero, zero, src, 8); + AVC_ITRANS_H(src0, src1, src2, src3, hres0, hres1, hres2, hres3); + TRANSPOSE4x4_SH_SH(hres0, hres1, hres2, hres3, hres0, hres1, hres2, hres3); + AVC_ITRANS_H(hres0, hres1, hres2, hres3, vres0, vres1, vres2, vres3); + src0_m = LW(dst); + src1_m = LW(dst1); + SRARI_H4_SH(vres0, vres1, vres2, vres3, 6); + src2_m = LW(dst2); + src3_m = LW(dst3); + ILVR_D2_SH(vres1, vres0, vres3, vres2, inp0_m, inp1_m); + INSERT_W2_SB(src0_m, src1_m, dst0_m); + INSERT_W2_SB(src2_m, src3_m, dst1_m); + ILVR_B2_SH(zero, dst0_m, zero, dst1_m, res0_m, res1_m); + ADD2(res0_m, inp0_m, res1_m, inp1_m, res0_m, res1_m); + CLIP_SH2_0_255(res0_m, res1_m); + PCKEV_B2_SB(res0_m, res0_m, res1_m, res1_m, dst0_m, dst1_m); + out0_m = __msa_copy_u_w((v4i32) dst0_m, 0); + out1_m = __msa_copy_u_w((v4i32) dst0_m, 1); + out2_m = __msa_copy_u_w((v4i32) dst1_m, 0); + out3_m = __msa_copy_u_w((v4i32) dst1_m, 1); + SW(out0_m, dst); + SW(out1_m, dst1); + SW(out2_m, dst2); + SW(out3_m, dst3); } void ff_h264_idct8_addblk_msa(uint8_t *dst, int16_t *src, @@ -334,7 +326,23 @@ void ff_h264_idct8_addblk_msa(uint8_t *dst, int16_t *src, void ff_h264_idct4x4_addblk_dc_msa(uint8_t *dst, int16_t *src, int32_t dst_stride) { - avc_idct4x4_addblk_dc_msa(dst, src, dst_stride); + v16u8 pred = { 0 }; + v16i8 out; + v8i16 pred_r, pred_l; + const uint32_t src0 = LW(dst); + const uint32_t src1 = LW(dst + dst_stride); + const uint32_t src2 = LW(dst + 2 * dst_stride); + const uint32_t src3 = LW(dst + 3 * dst_stride); + const int16_t dc = (src[0] + 32) >> 6; + const v8i16 input_dc = __msa_fill_h(dc); + + src[0] = 0; + INSERT_W4_UB(src0, src1, src2, src3, pred); + UNPCK_UB_SH(pred, pred_r, pred_l); + ADD2(pred_r, input_dc, pred_l, input_dc, pred_r, pred_l); + CLIP_SH2_0_255(pred_r, pred_l); + out = __msa_pckev_b((v16i8) pred_l, (v16i8) pred_r); + ST4x4_UB(out, out, 0, 1, 2, 3, dst, dst_stride); } void ff_h264_idct8_dc_addblk_msa(uint8_t *dst, int16_t *src, diff --git a/libavutil/mips/generic_macros_msa.h b/libavutil/mips/generic_macros_msa.h index 61a8ee0..407d46e 100644 --- a/libavutil/mips/generic_macros_msa.h +++ b/libavutil/mips/generic_macros_msa.h @@ -1531,6 +1531,24 @@ #define ILVR_D4_SB(...) ILVR_D4(v16i8, __VA_ARGS__) #define ILVR_D4_UB(...) ILVR_D4(v16u8, __VA_ARGS__) +/* Description : Interleave left half of double word elements from vectors + Arguments : Inputs - in0, in1, in2, in3 + Outputs - out0, out1 + Return Type - as per RTYPE + Details : Left half of double word elements of in0 and left half of + double word elements of in1 are interleaved and copied to out0. + Left half of double word elements of in2 and left half of + double word elements of in3 are interleaved and copied to out1. +*/ +#define ILVL_D2(RTYPE, in0, in1, in2, in3, out0, out1) \ +{ \ + out0 = (RTYPE) __msa_ilvl_d((v2i64) in0, (v2i64) in1); \ + out1 = (RTYPE) __msa_ilvl_d((v2i64) in2, (v2i64) in3); \ +} +#define ILVL_D2_UB(...) ILVL_D2(v16u8, __VA_ARGS__) +#define ILVL_D2_SB(...) ILVL_D2(v16i8, __VA_ARGS__) +#define ILVL_D2_SH(...) ILVL_D2(v8i16, __VA_ARGS__) + /* Description : Interleave both left and right half of input vectors Arguments : Inputs - in0, in1 Outputs - out0, out1