From patchwork Fri Sep 15 12:44:03 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Manojkumar Bhosale X-Patchwork-Id: 5153 Delivered-To: ffmpegpatchwork@gmail.com Received: by 10.2.36.26 with SMTP id f26csp596221jaa; Fri, 15 Sep 2017 05:44:18 -0700 (PDT) X-Google-Smtp-Source: AOwi7QAOYoiCsh2nKLZUJFasMnfb+t6l0miEFDfr4SUj7LztClJMlBB0zxTqH5Bssl3Ir7n/juuK X-Received: by 10.28.125.205 with SMTP id y196mr2899911wmc.128.1505479458261; Fri, 15 Sep 2017 05:44:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1505479458; cv=none; d=google.com; s=arc-20160816; b=exE3V3JqGCPQE4qgD9jfqB58GztRTbXlGqu3eSa98pYAl8jU4jZq7RulBHeHQVT6W4 Vm0F9m7Hsn6cNrvbmHlNYvuUYcHA7Jzzh0k3QSOGTCdu3YMYqVsO5NYyXeeQuzcTmINr nerodKd1lXh8lvpsHolzV2oCtVzngSfEv3JA6Xwqj53cLvSSyCxSqEq+hKQl7VLMo8jn xoqNaGoGks4RJfdAXMgLa4hrVTf8706hN9HUw7Yh10+o0LWrVqYCeEFsiAPGU4FtkOuM cT+g/GdbAjH7uW0v7Cy2Ak2jB07USEC2aiHAClAvw5+wBoznT3+4pYM0nAqJzd0knP4r FQlg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:content-language :accept-language:in-reply-to:references:message-id:date:thread-index :thread-topic:to:from:delivered-to:arc-authentication-results; bh=ugU79Pb4DQMP4VL/fCYCphe1arq+hkbHijP7EpZfhvE=; b=Tf8unHZS+8S79e7mryRwM9Na995nO1O8eOuKVOIQIhy9lq3GE4FrkRep8LPh8RQBcP o2opOFnqsSv9+ZHRbQNxQvK8YfwUonLGEL3grkPOSyl7c2ARFz/GmwgIBW2vzO71dfk5 AETiy/lNw1bZQtxVyrUW1gj41/RdAtHdBdCo79XIVEOHB92act54bv2F7zqb9AsHBy8O tstnFk95TsOes/90cMz2aLqmL+YNzBUkGSvaA7TAUEdOTMYeIBynDUnAK8bZbnjRBuEb E/5car808E4mBEzw6+XosPIaA3ECUGht5jAuBhKKI/0udrsRtVdSfff0KcECR+uWxlAq 7+uA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id 82si918220wmv.216.2017.09.15.05.44.15; Fri, 15 Sep 2017 05:44:18 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id BDFEA689F5C; Fri, 15 Sep 2017 15:44:07 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mailapp01.imgtec.com (mailapp01.imgtec.com [195.59.15.196]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 9D791689888 for ; Fri, 15 Sep 2017 15:44:01 +0300 (EEST) Received: from hhmail02.hh.imgtec.org (unknown [10.100.10.20]) by Forcepoint Email with ESMTPS id 03B08366ADC2A for ; Fri, 15 Sep 2017 13:44:03 +0100 (IST) Received: from PUMAIL01.pu.imgtec.org (192.168.91.250) by hhmail02.hh.imgtec.org (10.100.10.20) with Microsoft SMTP Server (TLS) id 14.3.294.0; Fri, 15 Sep 2017 13:44:06 +0100 Received: from PUMAIL01.pu.imgtec.org ([::1]) by PUMAIL01.pu.imgtec.org ([::1]) with mapi id 14.03.0266.001; Fri, 15 Sep 2017 18:14:03 +0530 From: Manojkumar Bhosale To: FFmpeg development discussions and patches Thread-Topic: [FFmpeg-devel] [PATCH] avcodec/mips: Improve hevc sao band filter msa functions Thread-Index: AQHTLhbPsM3visjLjUiCZAX616a7a6K12c4Q Date: Fri, 15 Sep 2017 12:44:03 +0000 Message-ID: <70293ACCC3BA6A4E81FFCA024C7A86E1E0591B51@PUMAIL01.pu.imgtec.org> References: <1505475394-29139-1-git-send-email-kaustubh.raste@imgtec.com> In-Reply-To: <1505475394-29139-1-git-send-email-kaustubh.raste@imgtec.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [192.168.91.86] MIME-Version: 1.0 Subject: Re: [FFmpeg-devel] [PATCH] avcodec/mips: Improve hevc sao band filter msa functions X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Kaustubh Raste Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" LGTM -----Original Message----- From: ffmpeg-devel [mailto:ffmpeg-devel-bounces@ffmpeg.org] On Behalf Of kaustubh.raste@imgtec.com Sent: Friday, September 15, 2017 5:07 PM To: ffmpeg-devel@ffmpeg.org Cc: Kaustubh Raste Subject: [FFmpeg-devel] [PATCH] avcodec/mips: Improve hevc sao band filter msa functions From: Kaustubh Raste Preload data in band filter 0-8 for better pipeline parallelization. Signed-off-by: Kaustubh Raste --- libavcodec/mips/hevc_lpf_sao_msa.c | 174 ++++++++++++++++++++++------------- libavutil/mips/generic_macros_msa.h | 1 + 2 files changed, 112 insertions(+), 63 deletions(-) ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel diff --git a/libavcodec/mips/hevc_lpf_sao_msa.c b/libavcodec/mips/hevc_lpf_sao_msa.c index 79b156f..1d77432 100644 --- a/libavcodec/mips/hevc_lpf_sao_msa.c +++ b/libavcodec/mips/hevc_lpf_sao_msa.c @@ -1049,29 +1049,28 @@ static void hevc_sao_band_filter_4width_msa(uint8_t *dst, int32_t dst_stride, int16_t *sao_offset_val, int32_t height) { - int32_t h_cnt; v16u8 src0, src1, src2, src3; v16i8 src0_r, src1_r; v16i8 offset, offset_val, mask; - v16i8 offset0 = { 0 }; - v16i8 offset1 = { 0 }; + v16i8 dst0, offset0, offset1; v16i8 zero = { 0 }; - v8i16 temp0, temp1, dst0, dst1; offset_val = LD_SB(sao_offset_val + 1); offset_val = (v16i8) __msa_pckev_d((v2i64) offset_val, (v2i64) offset_val); offset_val = __msa_pckev_b(offset_val, offset_val); - offset1 = (v16i8) __msa_insve_w((v4i32) offset1, 3, (v4i32) offset_val); - offset0 = __msa_sld_b(offset1, offset0, 28 - ((sao_left_class) & 31)); + offset1 = (v16i8) __msa_insve_w((v4i32) zero, 3, (v4i32) offset_val); + offset0 = __msa_sld_b(offset1, zero, 28 - ((sao_left_class) & 31)); offset1 = __msa_sld_b(zero, offset1, 28 - ((sao_left_class) & 31)); + /* load in advance. */ + LD_UB4(src, src_stride, src0, src1, src2, src3); + if (!((sao_left_class > 12) & (sao_left_class < 29))) { SWAP(offset0, offset1); } - for (h_cnt = height >> 2; h_cnt--;) { - LD_UB4(src, src_stride, src0, src1, src2, src3); + for (height -= 4; height; height -= 4) { src += (4 * src_stride); ILVEV_D2_SB(src0, src1, src2, src3, src0_r, src1_r); @@ -1080,14 +1079,30 @@ static void hevc_sao_band_filter_4width_msa(uint8_t *dst, int32_t dst_stride, mask = __msa_srli_b(src0_r, 3); offset = __msa_vshf_b(mask, offset1, offset0); - UNPCK_SB_SH(offset, temp0, temp1); - ILVRL_B2_SH(zero, src0_r, dst0, dst1); - ADD2(dst0, temp0, dst1, temp1, dst0, dst1); - CLIP_SH2_0_255(dst0, dst1); - dst0 = (v8i16) __msa_pckev_b((v16i8) dst1, (v16i8) dst0); + src0_r = (v16i8) __msa_xori_b((v16u8) src0_r, 128); + dst0 = __msa_adds_s_b(src0_r, offset); + dst0 = (v16i8) __msa_xori_b((v16u8) dst0, 128); + + /* load in advance. */ + LD_UB4(src, src_stride, src0, src1, src2, src3); + + /* store results */ ST4x4_UB(dst0, dst0, 0, 1, 2, 3, dst, dst_stride); dst += (4 * dst_stride); } + + ILVEV_D2_SB(src0, src1, src2, src3, src0_r, src1_r); + + src0_r = (v16i8) __msa_pckev_w((v4i32) src1_r, (v4i32) src0_r); + mask = __msa_srli_b(src0_r, 3); + offset = __msa_vshf_b(mask, offset1, offset0); + + src0_r = (v16i8) __msa_xori_b((v16u8) src0_r, 128); + dst0 = __msa_adds_s_b(src0_r, offset); + dst0 = (v16i8) __msa_xori_b((v16u8) dst0, 128); + + /* store results */ + ST4x4_UB(dst0, dst0, 0, 1, 2, 3, dst, dst_stride); } static void hevc_sao_band_filter_8width_msa(uint8_t *dst, int32_t dst_stride, @@ -1096,51 +1111,69 @@ static void hevc_sao_band_filter_8width_msa(uint8_t *dst, int32_t dst_stride, int16_t *sao_offset_val, int32_t height) { - int32_t h_cnt; v16u8 src0, src1, src2, src3; v16i8 src0_r, src1_r, mask0, mask1; - v16i8 offset, offset_val; - v16i8 offset0 = { 0 }; - v16i8 offset1 = { 0 }; + v16i8 offset_mask0, offset_mask1, offset_val; + v16i8 offset0, offset1, dst0, dst1; v16i8 zero = { 0 }; - v8i16 dst0, dst1, dst2, dst3; - v8i16 temp0, temp1, temp2, temp3; offset_val = LD_SB(sao_offset_val + 1); offset_val = (v16i8) __msa_pckev_d((v2i64) offset_val, (v2i64) offset_val); offset_val = __msa_pckev_b(offset_val, offset_val); - offset1 = (v16i8) __msa_insve_w((v4i32) offset1, 3, (v4i32) offset_val); - offset0 = __msa_sld_b(offset1, offset0, 28 - ((sao_left_class) & 31)); + offset1 = (v16i8) __msa_insve_w((v4i32) zero, 3, (v4i32) offset_val); + offset0 = __msa_sld_b(offset1, zero, 28 - ((sao_left_class) & 31)); offset1 = __msa_sld_b(zero, offset1, 28 - ((sao_left_class) & 31)); + /* load in advance. */ + LD_UB4(src, src_stride, src0, src1, src2, src3); + if (!((sao_left_class > 12) & (sao_left_class < 29))) { SWAP(offset0, offset1); } - for (h_cnt = height >> 2; h_cnt--;) { - LD_UB4(src, src_stride, src0, src1, src2, src3); - src += (4 * src_stride); + for (height -= 4; height; height -= 4) { + src += src_stride << 2; ILVR_D2_SB(src1, src0, src3, src2, src0_r, src1_r); mask0 = __msa_srli_b(src0_r, 3); mask1 = __msa_srli_b(src1_r, 3); - offset = __msa_vshf_b(mask0, offset1, offset0); - UNPCK_SB_SH(offset, temp0, temp1); + offset_mask0 = __msa_vshf_b(mask0, offset1, offset0); + offset_mask1 = __msa_vshf_b(mask1, offset1, offset0); - offset = __msa_vshf_b(mask1, offset1, offset0); - UNPCK_SB_SH(offset, temp2, temp3); + /* load in advance. */ + LD_UB4(src, src_stride, src0, src1, src2, src3); - UNPCK_UB_SH(src0_r, dst0, dst1); - UNPCK_UB_SH(src1_r, dst2, dst3); - ADD4(dst0, temp0, dst1, temp1, dst2, temp2, dst3, temp3, - dst0, dst1, dst2, dst3); - CLIP_SH4_0_255(dst0, dst1, dst2, dst3); - PCKEV_B2_SH(dst1, dst0, dst3, dst2, dst0, dst2); - ST8x4_UB(dst0, dst2, dst, dst_stride); - dst += (4 * dst_stride); + XORI_B2_128_SB(src0_r, src1_r); + + dst0 = __msa_adds_s_b(src0_r, offset_mask0); + dst1 = __msa_adds_s_b(src1_r, offset_mask1); + + XORI_B2_128_SB(dst0, dst1); + + /* store results */ + ST8x4_UB(dst0, dst1, dst, dst_stride); + dst += dst_stride << 2; } + + ILVR_D2_SB(src1, src0, src3, src2, src0_r, src1_r); + + mask0 = __msa_srli_b(src0_r, 3); + mask1 = __msa_srli_b(src1_r, 3); + + offset_mask0 = __msa_vshf_b(mask0, offset1, offset0); + offset_mask1 = __msa_vshf_b(mask1, offset1, offset0); + + XORI_B2_128_SB(src0_r, src1_r); + + dst0 = __msa_adds_s_b(src0_r, offset_mask0); + dst1 = __msa_adds_s_b(src1_r, offset_mask1); + + XORI_B2_128_SB(dst0, dst1); + + /* store results */ + ST8x4_UB(dst0, dst1, dst, dst_stride); } static void hevc_sao_band_filter_16multiple_msa(uint8_t *dst, @@ -1151,32 +1184,30 @@ static void hevc_sao_band_filter_16multiple_msa(uint8_t *dst, int16_t *sao_offset_val, int32_t width, int32_t height) { - int32_t h_cnt, w_cnt; + int32_t w_cnt; v16u8 src0, src1, src2, src3; - v8i16 dst0, dst1, dst2, dst3, dst4, dst5, dst6, dst7; v16i8 out0, out1, out2, out3; v16i8 mask0, mask1, mask2, mask3; v16i8 tmp0, tmp1, tmp2, tmp3, offset_val; - v16i8 offset0 = { 0 }; - v16i8 offset1 = { 0 }; + v16i8 offset0, offset1; v16i8 zero = { 0 }; - v8i16 temp0, temp1, temp2, temp3, temp4, temp5, temp6, temp7; offset_val = LD_SB(sao_offset_val + 1); offset_val = (v16i8) __msa_pckev_d((v2i64) offset_val, (v2i64) offset_val); offset_val = __msa_pckev_b(offset_val, offset_val); - offset1 = (v16i8) __msa_insve_w((v4i32) offset1, 3, (v4i32) offset_val); - offset0 = __msa_sld_b(offset1, offset0, 28 - ((sao_left_class) & 31)); + offset1 = (v16i8) __msa_insve_w((v4i32) zero, 3, (v4i32) offset_val); + offset0 = __msa_sld_b(offset1, zero, 28 - ((sao_left_class) & 31)); offset1 = __msa_sld_b(zero, offset1, 28 - ((sao_left_class) & 31)); if (!((sao_left_class > 12) & (sao_left_class < 29))) { SWAP(offset0, offset1); } - for (h_cnt = height >> 2; h_cnt--;) { - for (w_cnt = 0; w_cnt < (width >> 4); w_cnt++) { - LD_UB4(src + w_cnt * 16, src_stride, src0, src1, src2, src3); + while (height > 0) { + /* load in advance */ + LD_UB4(src, src_stride, src0, src1, src2, src3); + for (w_cnt = 16; w_cnt < width; w_cnt += 16) { mask0 = __msa_srli_b((v16i8) src0, 3); mask1 = __msa_srli_b((v16i8) src1, 3); mask2 = __msa_srli_b((v16i8) src2, 3); @@ -1186,27 +1217,44 @@ static void hevc_sao_band_filter_16multiple_msa(uint8_t *dst, tmp0, tmp1); VSHF_B2_SB(offset0, offset1, offset0, offset1, mask2, mask3, tmp2, tmp3); - UNPCK_SB_SH(tmp0, temp0, temp1); - UNPCK_SB_SH(tmp1, temp2, temp3); - UNPCK_SB_SH(tmp2, temp4, temp5); - UNPCK_SB_SH(tmp3, temp6, temp7); - ILVRL_B2_SH(zero, src0, dst0, dst1); - ILVRL_B2_SH(zero, src1, dst2, dst3); - ILVRL_B2_SH(zero, src2, dst4, dst5); - ILVRL_B2_SH(zero, src3, dst6, dst7); - ADD4(dst0, temp0, dst1, temp1, dst2, temp2, dst3, temp3, - dst0, dst1, dst2, dst3); - ADD4(dst4, temp4, dst5, temp5, dst6, temp6, dst7, temp7, - dst4, dst5, dst6, dst7); - CLIP_SH4_0_255(dst0, dst1, dst2, dst3); - CLIP_SH4_0_255(dst4, dst5, dst6, dst7); - PCKEV_B4_SB(dst1, dst0, dst3, dst2, dst5, dst4, dst7, dst6, - out0, out1, out2, out3); - ST_SB4(out0, out1, out2, out3, dst + w_cnt * 16, dst_stride); + XORI_B4_128_UB(src0, src1, src2, src3); + + out0 = __msa_adds_s_b((v16i8) src0, tmp0); + out1 = __msa_adds_s_b((v16i8) src1, tmp1); + out2 = __msa_adds_s_b((v16i8) src2, tmp2); + out3 = __msa_adds_s_b((v16i8) src3, tmp3); + + /* load for next iteration */ + LD_UB4(src + w_cnt, src_stride, src0, src1, src2, src3); + + XORI_B4_128_SB(out0, out1, out2, out3); + + ST_SB4(out0, out1, out2, out3, dst + w_cnt - 16, + dst_stride); } + mask0 = __msa_srli_b((v16i8) src0, 3); + mask1 = __msa_srli_b((v16i8) src1, 3); + mask2 = __msa_srli_b((v16i8) src2, 3); + mask3 = __msa_srli_b((v16i8) src3, 3); + + VSHF_B2_SB(offset0, offset1, offset0, offset1, mask0, mask1, tmp0, + tmp1); + VSHF_B2_SB(offset0, offset1, offset0, offset1, mask2, mask3, tmp2, + tmp3); + XORI_B4_128_UB(src0, src1, src2, src3); + + out0 = __msa_adds_s_b((v16i8) src0, tmp0); + out1 = __msa_adds_s_b((v16i8) src1, tmp1); + out2 = __msa_adds_s_b((v16i8) src2, tmp2); + out3 = __msa_adds_s_b((v16i8) src3, tmp3); + + XORI_B4_128_SB(out0, out1, out2, out3); + + ST_SB4(out0, out1, out2, out3, dst + w_cnt - 16, dst_stride); + src += src_stride << 2; dst += dst_stride << 2; + height -= 4; } } diff --git a/libavutil/mips/generic_macros_msa.h b/libavutil/mips/generic_macros_msa.h index ee7d663..3ff94fd 100644 --- a/libavutil/mips/generic_macros_msa.h +++ b/libavutil/mips/generic_macros_msa.h @@ -1574,6 +1574,7 @@ out0 = (RTYPE) __msa_ilvr_h((v8i16) in0, (v8i16) in1); \ out1 = (RTYPE) __msa_ilvl_h((v8i16) in0, (v8i16) in1); \ } +#define ILVRL_H2_UB(...) ILVRL_H2(v16u8, __VA_ARGS__) #define ILVRL_H2_SB(...) ILVRL_H2(v16i8, __VA_ARGS__) #define ILVRL_H2_SH(...) ILVRL_H2(v8i16, __VA_ARGS__) #define ILVRL_H2_SW(...) ILVRL_H2(v4i32, __VA_ARGS__) -- 1.7.9.5 _______________________________________________