From patchwork Wed Sep 13 09:31:00 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Manojkumar Bhosale X-Patchwork-Id: 5131 Delivered-To: ffmpegpatchwork@gmail.com Received: by 10.2.36.26 with SMTP id f26csp711583jaa; Wed, 13 Sep 2017 02:31:15 -0700 (PDT) X-Google-Smtp-Source: AOwi7QCeyyqfsG+ahqKx3iln4IzlDyrMpv2z+si9lhDoAGnjgehkL6D2Othtwt/HDROndVVp5O+M X-Received: by 10.28.102.213 with SMTP id a204mr1702971wmc.151.1505295075535; Wed, 13 Sep 2017 02:31:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1505295075; cv=none; d=google.com; s=arc-20160816; b=DhyJnK7tdxkGjCZrN56EUJsvl/beBOtGw6tGGwvWpwKZpaZRyHP1dD6gvEmokl1WKu XjCgjSe8t+i++IKgTUh5uHw/lK4i6JMGfaYCMQVHb8RnfZew17vxC5Sw/CnpxPPK5cV4 g8Xxewad/1q+YNEdxUrW2eh3DbaCdgPWEdsidJFyO82CObA2/ZgYf4EhJTuQNGxB9n4O uOQhX9MYDikKWA4Om/2fr62bwK0s5wZTkD5pkWVngTc7j5OKYZcv/zCNe7CAVaKWrnko 72hmRPLBXBbWFgRh41l9K6KgMbecn9cytKotCecH8guWp44BzSce+AUt3wgw3Ayidrgz LLSA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:content-language :accept-language:in-reply-to:references:message-id:date:thread-index :thread-topic:to:from:delivered-to:arc-authentication-results; bh=ytqaWOEUSX2eMjpb3c2xDK/uEkr/uW69qyT3cKDzU0g=; b=gKusPDXAgDlCN0QhZXaKz2nL9VFqMOfBgJUrmABoH3T8usQWHUAma+haET8b2KwKDT qFdN+Siaa2TMcUg+ibLKKCfP0QXrev0yhM5kK7CiKSthfu7LKJcZgUtOtIsq4XSipg7+ l6/dcrLfcmPL7iWQaOikxjLKjn011+0ZDb22oMB2WkoA6QIiReXtdqH6IeNqRZoEjcpT 5SMDyT0xNgs1yBPSpcfpn7SNSdTZm+Zr/5VONny1MfMxUL3gSY3PraLPG1KmlmFPb5gv ejGcvCe9jqbajDnADhzjSY9b8u7SpC+mmorQamhSIiVTgTPWbNedMVJOBL34w8XbjQ10 ohdw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id 94si10736023wri.287.2017.09.13.02.31.14; Wed, 13 Sep 2017 02:31:15 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 8F6D2689B49; Wed, 13 Sep 2017 12:31:07 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mailapp01.imgtec.com (mailapp01.imgtec.com [195.59.15.196]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 6B1AC689AD0 for ; Wed, 13 Sep 2017 12:31:01 +0300 (EEST) Received: from HHMAIL01.hh.imgtec.org (unknown [10.100.10.19]) by Forcepoint Email with ESMTPS id B4D5CE62F871 for ; Wed, 13 Sep 2017 10:31:02 +0100 (IST) Received: from HHMAIL-X.hh.imgtec.org (10.100.10.113) by HHMAIL01.hh.imgtec.org (10.100.10.19) with Microsoft SMTP Server (TLS) id 14.3.294.0; Wed, 13 Sep 2017 10:31:05 +0100 Received: from PUMAIL01.pu.imgtec.org (192.168.91.250) by HHMAIL-X.hh.imgtec.org (10.100.10.113) with Microsoft SMTP Server (TLS) id 14.3.294.0; Wed, 13 Sep 2017 10:31:05 +0100 Received: from PUMAIL01.pu.imgtec.org ([::1]) by PUMAIL01.pu.imgtec.org ([::1]) with mapi id 14.03.0266.001; Wed, 13 Sep 2017 15:01:02 +0530 From: Manojkumar Bhosale To: FFmpeg development discussions and patches Thread-Topic: [FFmpeg-devel] [PATCH] avcodec/mips: Improve hevc idct msa functions Thread-Index: AQHTK7qplZcRx9dRWUiij85EqmSqqKKyjosQ Date: Wed, 13 Sep 2017 09:31:00 +0000 Message-ID: <70293ACCC3BA6A4E81FFCA024C7A86E1E05917D4@PUMAIL01.pu.imgtec.org> References: <1505215918-16498-1-git-send-email-kaustubh.raste@imgtec.com> In-Reply-To: <1505215918-16498-1-git-send-email-kaustubh.raste@imgtec.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [192.168.91.86] MIME-Version: 1.0 Subject: Re: [FFmpeg-devel] [PATCH] avcodec/mips: Improve hevc idct msa functions X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Kaustubh Raste Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" LGTM -----Original Message----- From: ffmpeg-devel [mailto:ffmpeg-devel-bounces@ffmpeg.org] On Behalf Of kaustubh.raste@imgtec.com Sent: Tuesday, September 12, 2017 5:02 PM To: ffmpeg-devel@ffmpeg.org Cc: Kaustubh Raste Subject: [FFmpeg-devel] [PATCH] avcodec/mips: Improve hevc idct msa functions From: Kaustubh Raste Align the buffers. Remove reduandant constant array. Signed-off-by: Kaustubh Raste --- libavcodec/mips/hevc_idct_msa.c | 255 ++++++++++++++++++++++++++------------- 1 file changed, 171 insertions(+), 84 deletions(-) -static const int16_t gt32x32_cnst1[64] = { +static const int16_t gt32x32_cnst1[64] __attribute__ ((aligned (64))) = +{ 90, 87, 80, 70, 57, 43, 25, 9, 87, 57, 9, -43, -80, -90, -70, -25, 80, 9, -70, -87, -25, 57, 90, 43, 70, -43, -87, 9, 90, 25, -80, -57, 57, -80, -25, 90, -9, -87, 43, 70, 43, -90, 57, 25, -87, 70, 9, -80, 25, -70, 90, -80, 43, 9, -57, 87, 9, -25, 43, -57, 70, -80, 87, -90 }; -static const int16_t gt32x32_cnst2[16] = { +static const int16_t gt32x32_cnst2[16] __attribute__ ((aligned (64))) = +{ 89, 75, 50, 18, 75, -18, -89, -50, 50, -89, 18, 75, 18, -50, 75, -89 }; -static const int16_t gt32x32_cnst3[16] = { - 64, 64, 64, 64, 83, 36, -36, -83, 64, -64, -64, 64, 36, -83, 83, -36 -}; - #define HEVC_IDCT4x4_COL(in_r0, in_l0, in_r1, in_l1, \ sum0, sum1, sum2, sum3, shift) \ { \ @@ -323,8 +319,12 @@ static void hevc_idct_4x4_msa(int16_t *coeffs) HEVC_IDCT4x4_COL(in_r0, in_l0, in_r1, in_l1, sum0, sum1, sum2, sum3, 7); TRANSPOSE4x4_SW_SW(sum0, sum1, sum2, sum3, in_r0, in_l0, in_r1, in_l1); HEVC_IDCT4x4_COL(in_r0, in_l0, in_r1, in_l1, sum0, sum1, sum2, sum3, 12); - TRANSPOSE4x4_SW_SW(sum0, sum1, sum2, sum3, sum0, sum1, sum2, sum3); - PCKEV_H2_SH(sum1, sum0, sum3, sum2, in0, in1); + + /* Pack and transpose */ + PCKEV_H2_SH(sum2, sum0, sum3, sum1, in0, in1); + ILVRL_H2_SW(in1, in0, sum0, sum1); + ILVRL_W2_SH(sum1, sum0, in0, in1); + ST_SH2(in0, in1, coeffs, 8); } @@ -432,27 +432,35 @@ static void hevc_idct_8x32_column_msa(int16_t *coeffs, uint8_t buf_pitch, const int16_t *filter_ptr0 = >32x32_cnst0[0]; const int16_t *filter_ptr1 = >32x32_cnst1[0]; const int16_t *filter_ptr2 = >32x32_cnst2[0]; - const int16_t *filter_ptr3 = >32x32_cnst3[0]; + const int16_t *filter_ptr3 = >8x8_cnst[0]; int16_t *src0 = (coeffs + buf_pitch); int16_t *src1 = (coeffs + 2 * buf_pitch); int16_t *src2 = (coeffs + 4 * buf_pitch); int16_t *src3 = (coeffs); int32_t cnst0, cnst1; - int32_t tmp_buf[8 * 32]; - int32_t *tmp_buf_ptr = &tmp_buf[0]; + int32_t tmp_buf[8 * 32 + 15]; + int32_t *tmp_buf_ptr = tmp_buf + 15; v8i16 in0, in1, in2, in3, in4, in5, in6, in7; v8i16 src0_r, src1_r, src2_r, src3_r, src4_r, src5_r, src6_r, src7_r; v8i16 src0_l, src1_l, src2_l, src3_l, src4_l, src5_l, src6_l, src7_l; v8i16 filt0, filter0, filter1, filter2, filter3; v4i32 sum0_r, sum0_l, sum1_r, sum1_l, tmp0_r, tmp0_l, tmp1_r, tmp1_l; + /* Align pointer to 64 byte boundary */ + tmp_buf_ptr = (int32_t *)(((uintptr_t) tmp_buf_ptr) & ~(uintptr_t) + 63); + /* process coeff 4, 12, 20, 28 */ LD_SH4(src2, 8 * buf_pitch, in0, in1, in2, in3); ILVR_H2_SH(in1, in0, in3, in2, src0_r, src1_r); ILVL_H2_SH(in1, in0, in3, in2, src0_l, src1_l); + LD_SH2(src3, 16 * buf_pitch, in4, in6); + LD_SH2((src3 + 8 * buf_pitch), 16 * buf_pitch, in5, in7); + ILVR_H2_SH(in6, in4, in7, in5, src2_r, src3_r); + ILVL_H2_SH(in6, in4, in7, in5, src2_l, src3_l); + /* loop for all columns of constants */ - for (i = 0; i < 4; i++) { + for (i = 0; i < 2; i++) { /* processing single column of constants */ cnst0 = LW(filter_ptr2); cnst1 = LW(filter_ptr2 + 2); @@ -462,37 +470,39 @@ static void hevc_idct_8x32_column_msa(int16_t *coeffs, uint8_t buf_pitch, DOTP_SH2_SW(src0_r, src0_l, filter0, filter0, sum0_r, sum0_l); DPADD_SH2_SW(src1_r, src1_l, filter1, filter1, sum0_r, sum0_l); - ST_SW2(sum0_r, sum0_l, (tmp_buf_ptr + i * 8), 4); + ST_SW2(sum0_r, sum0_l, (tmp_buf_ptr + 2 * i * 8), 4); - filter_ptr2 += 4; - } + /* processing single column of constants */ + cnst0 = LW(filter_ptr2 + 4); + cnst1 = LW(filter_ptr2 + 6); - /* process coeff 0, 8, 16, 24 */ - LD_SH2(src3, 16 * buf_pitch, in0, in2); - LD_SH2((src3 + 8 * buf_pitch), 16 * buf_pitch, in1, in3); + filter0 = (v8i16) __msa_fill_w(cnst0); + filter1 = (v8i16) __msa_fill_w(cnst1); + + DOTP_SH2_SW(src0_r, src0_l, filter0, filter0, sum0_r, sum0_l); + DPADD_SH2_SW(src1_r, src1_l, filter1, filter1, sum0_r, sum0_l); + ST_SW2(sum0_r, sum0_l, (tmp_buf_ptr + (2 * i + 1) * 8), 4); - ILVR_H2_SH(in2, in0, in3, in1, src0_r, src1_r); - ILVL_H2_SH(in2, in0, in3, in1, src0_l, src1_l); + filter_ptr2 += 8; + } + /* process coeff 0, 8, 16, 24 */ /* loop for all columns of constants */ for (i = 0; i < 2; i++) { /* processing first column of filter constants */ cnst0 = LW(filter_ptr3); - cnst1 = LW(filter_ptr3 + 4); + cnst1 = LW(filter_ptr3 + 2); filter0 = (v8i16) __msa_fill_w(cnst0); filter1 = (v8i16) __msa_fill_w(cnst1); - DOTP_SH4_SW(src0_r, src0_l, src1_r, src1_l, filter0, filter0, filter1, + DOTP_SH4_SW(src2_r, src2_l, src3_r, src3_l, filter0, filter0, + filter1, filter1, sum0_r, sum0_l, tmp1_r, tmp1_l); - sum1_r = sum0_r; - sum1_l = sum0_l; - sum0_r += tmp1_r; - sum0_l += tmp1_l; - - sum1_r -= tmp1_r; - sum1_l -= tmp1_l; + sum1_r = sum0_r - tmp1_r; + sum1_l = sum0_l - tmp1_l; + sum0_r = sum0_r + tmp1_r; + sum0_l = sum0_l + tmp1_l; HEVC_EVEN16_CALC(tmp_buf_ptr, sum0_r, sum0_l, i, (7 - i)); HEVC_EVEN16_CALC(tmp_buf_ptr, sum1_r, sum1_l, (3 - i), (4 + i)); @@ -618,11 +628,14 @@ static void hevc_idct_32x32_msa(int16_t *coeffs) { uint8_t row_cnt, col_cnt; int16_t *src = coeffs; - int16_t tmp_buf[8 * 32]; - int16_t *tmp_buf_ptr = &tmp_buf[0]; + int16_t tmp_buf[8 * 32 + 31]; + int16_t *tmp_buf_ptr = tmp_buf + 31; uint8_t round; uint8_t buf_pitch; + /* Align pointer to 64 byte boundary */ + tmp_buf_ptr = (int16_t *)(((uintptr_t) tmp_buf_ptr) & ~(uintptr_t) + 63); + /* column transform */ round = 7; buf_pitch = 32; @@ -758,21 +771,22 @@ static void hevc_addblk_16x16_msa(int16_t *coeffs, uint8_t *dst, int32_t stride) { uint8_t loop_cnt; uint8_t *temp_dst = dst; - v16u8 dst0, dst1, dst2, dst3; + v16u8 dst0, dst1, dst2, dst3, dst4, dst5, dst6, dst7; v8i16 dst_r0, dst_l0, dst_r1, dst_l1, dst_r2, dst_l2, dst_r3, dst_l3; v8i16 in0, in1, in2, in3, in4, in5, in6, in7; - for (loop_cnt = 4; loop_cnt--;) { - LD_SH4(coeffs, 16, in0, in2, in4, in6); - LD_SH4((coeffs + 8), 16, in1, in3, in5, in7); - coeffs += 64; - LD_UB4(temp_dst, stride, dst0, dst1, dst2, dst3); - temp_dst += (4 * stride); + /* Pre-load for next iteration */ + LD_UB4(temp_dst, stride, dst4, dst5, dst6, dst7); + temp_dst += (4 * stride); + LD_SH4(coeffs, 16, in0, in2, in4, in6); + LD_SH4((coeffs + 8), 16, in1, in3, in5, in7); + coeffs += 64; - UNPCK_UB_SH(dst0, dst_r0, dst_l0); - UNPCK_UB_SH(dst1, dst_r1, dst_l1); - UNPCK_UB_SH(dst2, dst_r2, dst_l2); - UNPCK_UB_SH(dst3, dst_r3, dst_l3); + for (loop_cnt = 3; loop_cnt--;) { + UNPCK_UB_SH(dst4, dst_r0, dst_l0); + UNPCK_UB_SH(dst5, dst_r1, dst_l1); + UNPCK_UB_SH(dst6, dst_r2, dst_l2); + UNPCK_UB_SH(dst7, dst_r3, dst_l3); dst_r0 += in0; dst_l0 += in1; @@ -783,6 +797,13 @@ static void hevc_addblk_16x16_msa(int16_t *coeffs, uint8_t *dst, int32_t stride) dst_r3 += in6; dst_l3 += in7; + /* Pre-load for next iteration */ + LD_UB4(temp_dst, stride, dst4, dst5, dst6, dst7); + temp_dst += (4 * stride); + LD_SH4(coeffs, 16, in0, in2, in4, in6); + LD_SH4((coeffs + 8), 16, in1, in3, in5, in7); + coeffs += 64; + CLIP_SH4_0_255(dst_r0, dst_l0, dst_r1, dst_l1); CLIP_SH4_0_255(dst_r2, dst_l2, dst_r3, dst_l3); PCKEV_B4_UB(dst_l0, dst_r0, dst_l1, dst_r1, dst_l2, dst_r2, dst_l3, @@ -790,25 +811,50 @@ static void hevc_addblk_16x16_msa(int16_t *coeffs, uint8_t *dst, int32_t stride) ST_UB4(dst0, dst1, dst2, dst3, dst, stride); dst += (4 * stride); } + + UNPCK_UB_SH(dst4, dst_r0, dst_l0); + UNPCK_UB_SH(dst5, dst_r1, dst_l1); + UNPCK_UB_SH(dst6, dst_r2, dst_l2); + UNPCK_UB_SH(dst7, dst_r3, dst_l3); + + dst_r0 += in0; + dst_l0 += in1; + dst_r1 += in2; + dst_l1 += in3; + dst_r2 += in4; + dst_l2 += in5; + dst_r3 += in6; + dst_l3 += in7; + + CLIP_SH4_0_255(dst_r0, dst_l0, dst_r1, dst_l1); + CLIP_SH4_0_255(dst_r2, dst_l2, dst_r3, dst_l3); + PCKEV_B4_UB(dst_l0, dst_r0, dst_l1, dst_r1, dst_l2, dst_r2, dst_l3, + dst_r3, dst0, dst1, dst2, dst3); + ST_UB4(dst0, dst1, dst2, dst3, dst, stride); } static void hevc_addblk_32x32_msa(int16_t *coeffs, uint8_t *dst, int32_t stride) { uint8_t loop_cnt; uint8_t *temp_dst = dst; - v16u8 dst0, dst1, dst2, dst3; + v16u8 dst0, dst1, dst2, dst3, dst4, dst5, dst6, dst7; v8i16 dst_r0, dst_l0, dst_r1, dst_l1, dst_r2, dst_l2, dst_r3, dst_l3; v8i16 in0, in1, in2, in3, in4, in5, in6, in7; - for (loop_cnt = 8; loop_cnt--;) { - LD_SH4(coeffs, 32, in0, in2, in4, in6); - LD_SH4((coeffs + 8), 32, in1, in3, in5, in7); - LD_UB4(temp_dst, stride, dst0, dst1, dst2, dst3); - - UNPCK_UB_SH(dst0, dst_r0, dst_l0); - UNPCK_UB_SH(dst1, dst_r1, dst_l1); - UNPCK_UB_SH(dst2, dst_r2, dst_l2); - UNPCK_UB_SH(dst3, dst_r3, dst_l3); + /* Pre-load for next iteration */ + LD_UB2(temp_dst, 16, dst4, dst5); + temp_dst += stride; + LD_UB2(temp_dst, 16, dst6, dst7); + temp_dst += stride; + LD_SH4(coeffs, 16, in0, in2, in4, in6); + LD_SH4((coeffs + 8), 16, in1, in3, in5, in7); + coeffs += 64; + + for (loop_cnt = 14; loop_cnt--;) { + UNPCK_UB_SH(dst4, dst_r0, dst_l0); + UNPCK_UB_SH(dst5, dst_r1, dst_l1); + UNPCK_UB_SH(dst6, dst_r2, dst_l2); + UNPCK_UB_SH(dst7, dst_r3, dst_l3); dst_r0 += in0; dst_l0 += in1; @@ -819,40 +865,77 @@ static void hevc_addblk_32x32_msa(int16_t *coeffs, uint8_t *dst, int32_t stride) dst_r3 += in6; dst_l3 += in7; + /* Pre-load for next iteration */ + LD_UB2(temp_dst, 16, dst4, dst5); + temp_dst += stride; + LD_UB2(temp_dst, 16, dst6, dst7); + temp_dst += stride; + LD_SH4(coeffs, 16, in0, in2, in4, in6); + LD_SH4((coeffs + 8), 16, in1, in3, in5, in7); + coeffs += 64; + CLIP_SH4_0_255(dst_r0, dst_l0, dst_r1, dst_l1); CLIP_SH4_0_255(dst_r2, dst_l2, dst_r3, dst_l3); PCKEV_B4_UB(dst_l0, dst_r0, dst_l1, dst_r1, dst_l2, dst_r2, dst_l3, dst_r3, dst0, dst1, dst2, dst3); - ST_UB4(dst0, dst1, dst2, dst3, dst, stride); - - LD_SH4((coeffs + 16), 32, in0, in2, in4, in6); - LD_SH4((coeffs + 24), 32, in1, in3, in5, in7); - coeffs += 128; - LD_UB4((temp_dst + 16), stride, dst0, dst1, dst2, dst3); - temp_dst += (4 * stride); - - UNPCK_UB_SH(dst0, dst_r0, dst_l0); - UNPCK_UB_SH(dst1, dst_r1, dst_l1); - UNPCK_UB_SH(dst2, dst_r2, dst_l2); - UNPCK_UB_SH(dst3, dst_r3, dst_l3); + ST_UB2(dst0, dst1, dst, 16); + dst += stride; + ST_UB2(dst2, dst3, dst, 16); + dst += stride; + } - dst_r0 += in0; - dst_l0 += in1; - dst_r1 += in2; - dst_l1 += in3; - dst_r2 += in4; - dst_l2 += in5; - dst_r3 += in6; - dst_l3 += in7; + UNPCK_UB_SH(dst4, dst_r0, dst_l0); + UNPCK_UB_SH(dst5, dst_r1, dst_l1); + UNPCK_UB_SH(dst6, dst_r2, dst_l2); + UNPCK_UB_SH(dst7, dst_r3, dst_l3); + + dst_r0 += in0; + dst_l0 += in1; + dst_r1 += in2; + dst_l1 += in3; + dst_r2 += in4; + dst_l2 += in5; + dst_r3 += in6; + dst_l3 += in7; + + /* Pre-load for next iteration */ + LD_UB2(temp_dst, 16, dst4, dst5); + temp_dst += stride; + LD_UB2(temp_dst, 16, dst6, dst7); + temp_dst += stride; + LD_SH4(coeffs, 16, in0, in2, in4, in6); + LD_SH4((coeffs + 8), 16, in1, in3, in5, in7); - CLIP_SH4_0_255(dst_r0, dst_l0, dst_r1, dst_l1); - CLIP_SH4_0_255(dst_r2, dst_l2, dst_r3, dst_l3); - PCKEV_B4_UB(dst_l0, dst_r0, dst_l1, dst_r1, dst_l2, dst_r2, dst_l3, - dst_r3, dst0, dst1, dst2, dst3); + CLIP_SH4_0_255(dst_r0, dst_l0, dst_r1, dst_l1); + CLIP_SH4_0_255(dst_r2, dst_l2, dst_r3, dst_l3); + PCKEV_B4_UB(dst_l0, dst_r0, dst_l1, dst_r1, dst_l2, dst_r2, dst_l3, + dst_r3, dst0, dst1, dst2, dst3); + ST_UB2(dst0, dst1, dst, 16); + dst += stride; + ST_UB2(dst2, dst3, dst, 16); + dst += stride; + + UNPCK_UB_SH(dst4, dst_r0, dst_l0); + UNPCK_UB_SH(dst5, dst_r1, dst_l1); + UNPCK_UB_SH(dst6, dst_r2, dst_l2); + UNPCK_UB_SH(dst7, dst_r3, dst_l3); + + dst_r0 += in0; + dst_l0 += in1; + dst_r1 += in2; + dst_l1 += in3; + dst_r2 += in4; + dst_l2 += in5; + dst_r3 += in6; + dst_l3 += in7; - ST_UB4(dst0, dst1, dst2, dst3, (dst + 16), stride); - dst += (4 * stride); - } + CLIP_SH4_0_255(dst_r0, dst_l0, dst_r1, dst_l1); + CLIP_SH4_0_255(dst_r2, dst_l2, dst_r3, dst_l3); + PCKEV_B4_UB(dst_l0, dst_r0, dst_l1, dst_r1, dst_l2, dst_r2, dst_l3, + dst_r3, dst0, dst1, dst2, dst3); + ST_UB2(dst0, dst1, dst, 16); + dst += stride; + ST_UB2(dst2, dst3, dst, 16); } static void hevc_idct_luma_4x4_msa(int16_t *coeffs) @@ -868,8 +951,12 @@ static void hevc_idct_luma_4x4_msa(int16_t *coeffs) TRANSPOSE4x4_SW_SW(res0, res1, res2, res3, in_r0, in_l0, in_r1, in_l1); HEVC_IDCT_LUMA4x4_COL(in_r0, in_l0, in_r1, in_l1, res0, res1, res2, res3, 12); - TRANSPOSE4x4_SW_SW(res0, res1, res2, res3, res0, res1, res2, res3); - PCKEV_H2_SH(res1, res0, res3, res2, dst0, dst1); + + /* Pack and transpose */ + PCKEV_H2_SH(res2, res0, res3, res1, dst0, dst1); + ILVRL_H2_SW(dst1, dst0, res0, res1); + ILVRL_W2_SH(res1, res0, dst0, dst1); + ST_SH2(dst0, dst1, coeffs, 8); } -- 1.7.9.5 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel diff --git a/libavcodec/mips/hevc_idct_msa.c b/libavcodec/mips/hevc_idct_msa.c index d483707..0943119 100644 --- a/libavcodec/mips/hevc_idct_msa.c +++ b/libavcodec/mips/hevc_idct_msa.c @@ -21,18 +21,18 @@ #include "libavutil/mips/generic_macros_msa.h" #include "libavcodec/mips/hevcdsp_mips.h" -static const int16_t gt8x8_cnst[16] = { +static const int16_t gt8x8_cnst[16] __attribute__ ((aligned (64))) = { 64, 64, 83, 36, 89, 50, 18, 75, 64, -64, 36, -83, 75, -89, -50, -18 }; -static const int16_t gt16x16_cnst[64] = { +static const int16_t gt16x16_cnst[64] __attribute__ ((aligned (64))) = +{ 64, 83, 64, 36, 89, 75, 50, 18, 90, 80, 57, 25, 70, 87, 9, 43, 64, 36, -64, -83, 75, -18, -89, -50, 87, 9, -80, -70, -43, 57, -25, -90, 64, -36, -64, 83, 50, -89, 18, 75, 80, -70, -25, 90, -87, 9, 43, 57, 64, -83, 64, -36, 18, -50, 75, -89, 70, -87, 90, -80, 9, -43, -57, 25 }; -static const int16_t gt32x32_cnst0[256] = { +static const int16_t gt32x32_cnst0[256] __attribute__ ((aligned (64))) += { 90, 90, 88, 85, 82, 78, 73, 67, 61, 54, 46, 38, 31, 22, 13, 4, 90, 82, 67, 46, 22, -4, -31, -54, -73, -85, -90, -88, -78, -61, -38, -13, 88, 67, 31, -13, -54, -82, -90, -78, -46, -4, 38, 73, 90, 85, 61, 22, @@ -51,21 +51,17 @@ static const int16_t gt32x32_cnst0[256] = { 4, -13, 22, -31, 38, -46, 54, -61, 67, -73, 78, -82, 85, -88, 90, -90 };