From patchwork Wed Nov 22 13:59:06 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shengbin Meng X-Patchwork-Id: 6272 Delivered-To: ffmpegpatchwork@gmail.com Received: by 10.2.161.94 with SMTP id m30csp201518jah; Wed, 22 Nov 2017 05:59:28 -0800 (PST) X-Google-Smtp-Source: AGs4zMY8Vh/P5x12ChsWD9O5CvLe08DgKwcdFxKsirSp6el3FUqIMcER54FyN0f8CDu+wvdJMnQ7 X-Received: by 10.28.45.214 with SMTP id t205mr4349320wmt.94.1511359168785; Wed, 22 Nov 2017 05:59:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1511359168; cv=none; d=google.com; s=arc-20160816; b=qb8cLet/UPWcBvBg5wm7sKfFKd2wSKyyphuePWV5OM3fHTRxQq8ifGz9ZppGaVpu8f 4eXlP1h7S+waxSpTUz0a7viS7pLIFFWAemgJTD/Vks/IksZMVvzjc/KG6u8iIOK3Q52j uM/6HA3DG2wi8lyzVwo+q5u93PHJ7qD/Xwer8LjrhpktWumWPmCNnW0T1U73XsZOWclg 7BolyPD7K334O7Ej1PGZDCUaI8Hv027nLf/Imf6maBQYQTj3MwHAaK1+2duy31Xy5Qop SDopuXjAY/qQXxhzLVbb5VC+lh2oSh2y1uvhrFWEFsz0/B0sNiVjNzwdWQ72/OZpL9O+ r4hw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:mime-version:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:message-id:date:to:from:dkim-signature :delivered-to:arc-authentication-results; bh=CY5n3v/7syUgku8cngpNk8X4Lqkm0YxsGyzqIwtsOkA=; b=rV4QJOeqx0+/BENMwQNoIjI9TaukElnMhAt7gCAtP6gwWyVD/Od39yhcmu3YL6W5c2 lZHUhLBpt5Uo+UXG/SKdgcDNpZyp0lKle5Tjz3CS1O2uNljwmt7N4Z3Q01KL52tv4rME TYN6DLUaReOb8wiEykIkrb6dE/1VvkurhhEKaDgwnUVrfA8ujtdVWjiG8hrqT5HQUK+F kqmCrNpmTCDT7bnIoipqEGiKqGEsUtPTQ7jpKYxju8f8KiHO2o/EjVb8GFYtOyp/TrcX Y2u3jTbuUv+hDeeulApJJqR6JeWaiweElIG/+WsnSlu/5Rle6J1wJPOo4LdBFXnitC5X AVcw== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.s=20161025 header.b=U/hIHvdh; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=gmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id 140si3352986wmp.193.2017.11.22.05.59.27; Wed, 22 Nov 2017 05:59:28 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.s=20161025 header.b=U/hIHvdh; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=gmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id C28CC68A153; Wed, 22 Nov 2017 15:59:26 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-pl0-f66.google.com (mail-pl0-f66.google.com [209.85.160.66]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 0084568A0FD for ; Wed, 22 Nov 2017 15:59:18 +0200 (EET) Received: by mail-pl0-f66.google.com with SMTP id z3so785494plh.9 for ; Wed, 22 Nov 2017 05:59:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=k0uJEoDlflENXSs8qFvfY+AfakSPu85FrOGQaQKvGs4=; b=U/hIHvdhdWQIBurn9xpPEQm7DMdz6l5hBRyEbCi3obLjRTxYrQgbNWfJnmVgMN/eDB UafwsmY2rDfouKnMdrsQbqqiG0LOJuo4S/i8B9dbmWGGVRrQ/s7HBruwD8r7rrfrBlQ/ 0Q96ZZd3db/GTzWnx/gc1mYtGF8zyKnrXblHQ6AUR1yIHAQGPERSG1H/0gwwIjnklqC3 LzHQDT3O2YNixBSkJ7gCI45o+w9/zi5QDyarA34I8D8IlvvOHgYJJMJ8cTEH//p2UZNp 8jScyF2nSI26F/OP3EsMeEwBXtsjHm4A9RNhzyV1/9u2m3Ma0FLYoLaQmwBa2832eLiu 3zdA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=k0uJEoDlflENXSs8qFvfY+AfakSPu85FrOGQaQKvGs4=; b=NPZ3BfoMSauBDFNYeX/TDNh7S+6yvZ2vB6bgXu6Omma3x7vWVMUWkBq8aqm1upiNl2 2mAQ0SaRSWklCYsROwfEbQ3B/ddyCXm+jSDlKBzAMAbX/1zCXOzMl0VSd0ZTbNrXdipI 9ObKaQ2DtlPCs/ws7p8O1JW/UI9WnvYykJqgwagxRZbzcZoP0HWifA1rVsbqzDnsDa3e 3iyZKTe9bocOaQ4und9A04Jo07+tC7uTnCxA1cvBB8al+7DOYEHG0epARn5vumJKUeEX 4K72SLurrudjNAPhuE4xoFajl2pYNi17laktPu9zLsVNrAmbQY4G1Xr9FN0ll/Sa85A7 r07Q== X-Gm-Message-State: AJaThX4Sxvqx8ezSSHisY6vJx/FSsrsB5afxwrN5BCjqOJaCmUNlIPKE WOIe6e2yBi+eTcjRHwZ2SH7KAQ== X-Received: by 10.159.194.1 with SMTP id x1mr21791657pln.48.1511359157320; Wed, 22 Nov 2017 05:59:17 -0800 (PST) Received: from Shengbins-Mac-mini.corp.bytedance.com ([103.48.142.82]) by smtp.googlemail.com with ESMTPSA id 13sm13111692pfs.112.2017.11.22.05.59.15 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 22 Nov 2017 05:59:16 -0800 (PST) From: Shengbin Meng To: ffmpeg-devel@ffmpeg.org Date: Wed, 22 Nov 2017 21:59:06 +0800 Message-Id: <20171122135910.20719-1-shengbinmeng@gmail.com> X-Mailer: git-send-email 2.13.6 (Apple Git-96) Subject: [FFmpeg-devel] [PATCH v2 1/5] avcodec/hevcdsp: Add NEON optimization for qpel weighted mode X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Meng Wang MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" From: Meng Wang Signed-off-by: Meng Wang --- libavcodec/arm/hevcdsp_init_neon.c | 67 +++++ libavcodec/arm/hevcdsp_qpel_neon.S | 509 +++++++++++++++++++++++++++++++++++++ 2 files changed, 576 insertions(+) diff --git a/libavcodec/arm/hevcdsp_init_neon.c b/libavcodec/arm/hevcdsp_init_neon.c index a4628d2a93..183162803e 100644 --- a/libavcodec/arm/hevcdsp_init_neon.c +++ b/libavcodec/arm/hevcdsp_init_neon.c @@ -81,6 +81,8 @@ static void (*put_hevc_qpel_neon[4][4])(int16_t *dst, ptrdiff_t dststride, uint8 int height, int width); static void (*put_hevc_qpel_uw_neon[4][4])(uint8_t *dst, ptrdiff_t dststride, uint8_t *_src, ptrdiff_t _srcstride, int width, int height, int16_t* src2, ptrdiff_t src2stride); +static void (*put_hevc_qpel_wt_neon[4][4])(uint8_t *_dst, ptrdiff_t _dststride, uint8_t *_src, ptrdiff_t _srcstride, + int width, int height, int denom, int wx1, int ox1, int wx0, int ox0, int16_t* src2, ptrdiff_t src2stride); void ff_hevc_put_qpel_neon_wrapper(int16_t *dst, uint8_t *src, ptrdiff_t srcstride, int height, intptr_t mx, intptr_t my, int width); void ff_hevc_put_qpel_uni_neon_wrapper(uint8_t *dst, ptrdiff_t dststride, uint8_t *src, ptrdiff_t srcstride, @@ -88,6 +90,15 @@ void ff_hevc_put_qpel_uni_neon_wrapper(uint8_t *dst, ptrdiff_t dststride, uint8_ void ff_hevc_put_qpel_bi_neon_wrapper(uint8_t *dst, ptrdiff_t dststride, uint8_t *src, ptrdiff_t srcstride, int16_t *src2, int height, intptr_t mx, intptr_t my, int width); +void ff_hevc_put_qpel_uni_w_neon_wrapper(uint8_t *dst, ptrdiff_t dststride, + uint8_t *src, ptrdiff_t srcstride, + int height, int denom, int wx, int ox, + intptr_t mx, intptr_t my, int width); +void ff_hevc_put_qpel_bi_w_neon_wrapper(uint8_t *dst, ptrdiff_t dststride, uint8_t *src, ptrdiff_t srcstride, + int16_t *src2, + int height, int denom, int wx0, int wx1, + int ox0, int ox1, intptr_t mx, intptr_t my, int width); + #define QPEL_FUNC(name) \ void name(int16_t *dst, ptrdiff_t dststride, uint8_t *src, ptrdiff_t srcstride, \ int height, int width) @@ -142,6 +153,26 @@ QPEL_FUNC_UW(ff_hevc_put_qpel_uw_h3v2_neon_8); QPEL_FUNC_UW(ff_hevc_put_qpel_uw_h3v3_neon_8); #undef QPEL_FUNC_UW +#define QPEL_FUNC_WT(name) \ +void name(uint8_t *_dst, ptrdiff_t _dststride, uint8_t *_src, ptrdiff_t _srcstride, \ + int width, int height, int denom, int wx1, int ox1, int wx0, int ox0, int16_t* src2, ptrdiff_t src2stride); +QPEL_FUNC_WT(ff_hevc_put_qpel_wt_v1_neon_8); +QPEL_FUNC_WT(ff_hevc_put_qpel_wt_v2_neon_8); +QPEL_FUNC_WT(ff_hevc_put_qpel_wt_v3_neon_8); +QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h1_neon_8); +QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h2_neon_8); +QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h3_neon_8); +QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h1v1_neon_8); +QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h1v2_neon_8); +QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h1v3_neon_8); +QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h2v1_neon_8); +QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h2v2_neon_8); +QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h2v3_neon_8); +QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h3v1_neon_8); +QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h3v2_neon_8); +QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h3v3_neon_8); +#undef QPEL_FUNC_WT + void ff_hevc_put_qpel_neon_wrapper(int16_t *dst, uint8_t *src, ptrdiff_t srcstride, int height, intptr_t mx, intptr_t my, int width) { @@ -160,6 +191,21 @@ void ff_hevc_put_qpel_bi_neon_wrapper(uint8_t *dst, ptrdiff_t dststride, uint8_t put_hevc_qpel_uw_neon[my][mx](dst, dststride, src, srcstride, width, height, src2, MAX_PB_SIZE); } +void ff_hevc_put_qpel_uni_w_neon_wrapper(uint8_t *dst, ptrdiff_t dststride, + uint8_t *src, ptrdiff_t srcstride, + int height, int denom, int wx, int ox, + intptr_t mx, intptr_t my, int width) { + put_hevc_qpel_wt_neon[my][mx](dst, dststride, src, srcstride, width, height, denom, wx, ox, 0, 0, NULL, 0); +} + +void ff_hevc_put_qpel_bi_w_neon_wrapper(uint8_t *dst, ptrdiff_t dststride, uint8_t *src, ptrdiff_t srcstride, + int16_t *src2, + int height, int denom, int wx0, int wx1, + int ox0, int ox1, intptr_t mx, intptr_t my, int width) { + put_hevc_qpel_wt_neon[my][mx](dst, dststride, src, srcstride, width, height, denom, wx1, ox1, wx0, ox0, src2, MAX_PB_SIZE); +} + + av_cold void ff_hevc_dsp_init_neon(HEVCDSPContext *c, const int bit_depth) { if (bit_depth == 8) { @@ -211,6 +257,21 @@ av_cold void ff_hevc_dsp_init_neon(HEVCDSPContext *c, const int bit_depth) put_hevc_qpel_uw_neon[3][1] = ff_hevc_put_qpel_uw_h1v3_neon_8; put_hevc_qpel_uw_neon[3][2] = ff_hevc_put_qpel_uw_h2v3_neon_8; put_hevc_qpel_uw_neon[3][3] = ff_hevc_put_qpel_uw_h3v3_neon_8; + put_hevc_qpel_wt_neon[1][0] = ff_hevc_put_qpel_wt_v1_neon_8; + put_hevc_qpel_wt_neon[2][0] = ff_hevc_put_qpel_wt_v2_neon_8; + put_hevc_qpel_wt_neon[3][0] = ff_hevc_put_qpel_wt_v3_neon_8; + put_hevc_qpel_wt_neon[0][1] = ff_hevc_put_qpel_wt_h1_neon_8; + put_hevc_qpel_wt_neon[0][2] = ff_hevc_put_qpel_wt_h2_neon_8; + put_hevc_qpel_wt_neon[0][3] = ff_hevc_put_qpel_wt_h3_neon_8; + put_hevc_qpel_wt_neon[1][1] = ff_hevc_put_qpel_wt_h1v1_neon_8; + put_hevc_qpel_wt_neon[1][2] = ff_hevc_put_qpel_wt_h2v1_neon_8; + put_hevc_qpel_wt_neon[1][3] = ff_hevc_put_qpel_wt_h3v1_neon_8; + put_hevc_qpel_wt_neon[2][1] = ff_hevc_put_qpel_wt_h1v2_neon_8; + put_hevc_qpel_wt_neon[2][2] = ff_hevc_put_qpel_wt_h2v2_neon_8; + put_hevc_qpel_wt_neon[2][3] = ff_hevc_put_qpel_wt_h3v2_neon_8; + put_hevc_qpel_wt_neon[3][1] = ff_hevc_put_qpel_wt_h1v3_neon_8; + put_hevc_qpel_wt_neon[3][2] = ff_hevc_put_qpel_wt_h2v3_neon_8; + put_hevc_qpel_wt_neon[3][3] = ff_hevc_put_qpel_wt_h3v3_neon_8; for (x = 0; x < 10; x++) { c->put_hevc_qpel[x][1][0] = ff_hevc_put_qpel_neon_wrapper; c->put_hevc_qpel[x][0][1] = ff_hevc_put_qpel_neon_wrapper; @@ -221,6 +282,12 @@ av_cold void ff_hevc_dsp_init_neon(HEVCDSPContext *c, const int bit_depth) c->put_hevc_qpel_bi[x][1][0] = ff_hevc_put_qpel_bi_neon_wrapper; c->put_hevc_qpel_bi[x][0][1] = ff_hevc_put_qpel_bi_neon_wrapper; c->put_hevc_qpel_bi[x][1][1] = ff_hevc_put_qpel_bi_neon_wrapper; + c->put_hevc_qpel_uni_w[x][1][0] = ff_hevc_put_qpel_uni_w_neon_wrapper; + c->put_hevc_qpel_uni_w[x][0][1] = ff_hevc_put_qpel_uni_w_neon_wrapper; + c->put_hevc_qpel_uni_w[x][1][1] = ff_hevc_put_qpel_uni_w_neon_wrapper; + c->put_hevc_qpel_bi_w[x][1][0] = ff_hevc_put_qpel_bi_w_neon_wrapper; + c->put_hevc_qpel_bi_w[x][0][1] = ff_hevc_put_qpel_bi_w_neon_wrapper; + c->put_hevc_qpel_bi_w[x][1][1] = ff_hevc_put_qpel_bi_w_neon_wrapper; } c->put_hevc_qpel[0][0][0] = ff_hevc_put_pixels_w2_neon_8; c->put_hevc_qpel[1][0][0] = ff_hevc_put_pixels_w4_neon_8; diff --git a/libavcodec/arm/hevcdsp_qpel_neon.S b/libavcodec/arm/hevcdsp_qpel_neon.S index 86f92cf75a..e188b215ba 100644 --- a/libavcodec/arm/hevcdsp_qpel_neon.S +++ b/libavcodec/arm/hevcdsp_qpel_neon.S @@ -333,6 +333,139 @@ bx lr .endm +.macro hevc_put_qpel_wt_vX_neon_8 filter + push {r4-r12} + ldr r5, [sp, #36] // width + ldr r4, [sp, #40] // height + ldr r8, [sp, #44] // denom + ldr r9, [sp, #48] // wx1 + ldr r10,[sp, #52] // ox1 + ldr r11,[sp, #64] // src2 + vpush {d8-d15} + sub r2, r2, r3, lsl #1 // r2 - 3*stride + sub r2, r3 + mov r12, r4 + mov r6, r0 + mov r7, r2 + add r8, #6 // weight shift = denom + 6 + vdup.32 q5, r8 // shift is a 32 bit action + vneg.s32 q4, q5 // q4 = -q5 + vdup.32 q6, r9 // q6 wx + vdup.32 q5, r10 // q5 ox + cmp r11, #0 // if src2 != 0 goto bi mode + bne .Lbi\@ +0: loadin8 + cmp r5, #4 + beq 4f +8: subs r4, #1 + \filter + vmovl.s16 q12, d14 // extending signed 4x16bit data to 4x32 bit + vmovl.s16 q13, d15 + vmul.s32 q14, q12, q6 // src * wx + vmul.s32 q15, q13, q6 // src * wx + vqrshl.s32 q12, q14, q4 // src * wx >> shift + vqrshl.s32 q13, q15, q4 // src * wx >> shift + vadd.s32 q14, q12, q5 // src * wx >> shift + ox + vadd.s32 q15, q13, q5 // src * wx >> shift + ox + vqmovun.s32 d2, q14 // narrow signed 4x32bit to unsigned 4x16bit + vqmovun.s32 d3, q15 // narrow signed 4x32bit to unsigned 4x16bit + vqmovn.u16 d0, q1 // narrow unsigned 8x16bit to unsigned 8x8bit + vst1.8 d0, [r0], r1 + regshuffle_d8 + vld1.8 {d23}, [r2], r3 + bne 8b + subs r5, #8 + beq 99f + mov r4, r12 + add r6, #8 + mov r0, r6 + add r7, #8 + mov r2, r7 + b 0b +4: subs r4, #1 + \filter + vmovl.s16 q12, d14 // extending signed 4x16bit data to 4x32 bit + vmul.s32 q14, q12, q6 + vqrshl.s32 q12, q14, q4 + vadd.s32 q14, q12, q5 + vqmovun.s32 d14, q14 + vqmovn.u16 d0, q7 + vst1.32 d0[0], [r0], r1 + regshuffle_d8 + vld1.32 {d23[0]}, [r2], r3 + bne 4b + b 99f +.Lbi\@: ldr r8, [sp, #120] // w0 + vdup.32 q1, r8 // q1 wx0 + ldr r8, [sp, #124] // ox0 + vdup.32 q2, r8 // q2 ox0 + vadd.s32 q2, q5 // q2 = ox0 +ox1 + vmov.s32 q10, #1 + vadd.s32 q2, q10 // q2 = ox0 +ox1 + 1 + vneg.s32 q15, q4 // q15 = - q4, preperation for left shift + vqrshl.s32 q3, q2, q15 // q3 = (ox0 + ox1 + 1)<