From patchwork Wed Nov 22 11:12:06 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shengbin Meng X-Patchwork-Id: 6265 Delivered-To: ffmpegpatchwork@gmail.com Received: by 10.2.161.94 with SMTP id m30csp33288jah; Wed, 22 Nov 2017 03:20:03 -0800 (PST) X-Google-Smtp-Source: AGs4zMZLbbKGZ+zSUapGY/xdXxZdaffIBIPAIc+q6z8Sa9GiFsvtq/Ojco6P6Iuq0wlB5DjFRofA X-Received: by 10.28.236.79 with SMTP id k76mr3923284wmh.95.1511349603321; Wed, 22 Nov 2017 03:20:03 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1511349603; cv=none; d=google.com; s=arc-20160816; b=kHU+5ptA8QCEauBjWr9NwJfV02BDvrXV/XgKawh+WiPWERoXbzAoWcjcK7YclQYdZY lkoiR8uQWPCELZyqatiwqTc+V0v96te+E3PG7rcMzI/CpuMeg+0rGLilRWdws8gqukmy 8TW+7UIg4UiNRctBOHwZbxU8oI5tJw5JQUNmAtaBXkVOmSr7qgf8DbU2185re3lAREsp CEzzBRgMlJDokHUtjUU4Hm6W9FPXm2lf0fD8YJK+ZCM2/xx78oHJnVQY9eqnhin8dCu7 E0v7NmPLAj8I6FwOzDrxki1QfsmDSTNN3XLx/pP9l8fNvH1lTOWseCcGRM0qO3hlQ+SZ t1HA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:mime-version:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:references:in-reply-to:message-id:date :to:from:dkim-signature:delivered-to:arc-authentication-results; bh=Uzl/+oqvpz2u8YSXc6GsOFg0CTzvTzLnc5Kva3nnoiM=; b=Umpm1v3aaGnQfEHO+c1JcLiCpLiAgsnjxYSPf+txj3dBIy01h67ap7mGm8CDFZXSEU f6SWNe6GSyDNrtXbBYdRoCqLY/QaWccCAcf0GquBgHTXAzKPjuzDBHG9sC2CFXpQZEUy JYy39tehuQs7HiqNkimpfHXq1k/uuU1aVwalEmJx92yzbfsB/5m0thDiIiTZNocHIwXp x8Y/JeJ6n7kZoYKI9+9+64ItpBM4MwJXRulI6WFEbSKuEBpg1ITSzeguQEu7X6vA4HCd Vr8zQTdulshETUWOxCQq1sBb2MiGUrKFaU4NbucGutff+DO4RwcT7EzABWJBYkl7AgOY vQqQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.s=20161025 header.b=Gm54iGk4; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=gmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id g13si12638570wrh.413.2017.11.22.03.20.02; Wed, 22 Nov 2017 03:20:03 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.s=20161025 header.b=Gm54iGk4; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=gmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 8753F68A141; Wed, 22 Nov 2017 13:20:01 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-pf0-f196.google.com (mail-pf0-f196.google.com [209.85.192.196]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 1392268A134 for ; Wed, 22 Nov 2017 13:19:55 +0200 (EET) Received: by mail-pf0-f196.google.com with SMTP id a84so11944804pfl.0 for ; Wed, 22 Nov 2017 03:19:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=fdtmxnoFy6E/pR58bsyHTZGjxRroBFAT1nmt/joGRCw=; b=Gm54iGk4KCzBgH8xOPMSkcX60glnYqoxHP4DSTQcD0zHNKmbNsbr632Oa6ESHQTbfq yIrhw7RlC7Pq9LyMtdGrTmrxpbpgi7YbC6WtfLG9cDpzmyxwNp8IVPl3A3E1i+382ypm IMtZj1qjUuTOBhW66mH/qWaoswMNaoplSWh3v02AS0qqPDDDga5Z05p7zB1Ri+6S2JrA Sp90Jgvdmt73uqnJmhdwedI7HIhRUEBsgBcYaLB67hbpFgKKqJKaronDn5oPjRivimy6 cwMTTRuQ9IJY2bAToh3X9y44kCtUrge2/6/qVqRcjZc54OIp3B94Wr9pv//YhfJx8AiL 8oiw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=fdtmxnoFy6E/pR58bsyHTZGjxRroBFAT1nmt/joGRCw=; b=lDqNZXICAXKA5TszxLvSnivpRQOS2d2DpljuiAOXFAeDFpQ4oDopW7m4MIcBv6FmnW Ew/xXte5O2t8F94D9yBTDSHjrxZpiwFk3EuIoroZtcxfpe42cdlljxxzVZxOpRCZoHkA Ro8qpFrNwaGO21D6VCTLtH+PU5gZq0xb36Aj/UoeGuU+SqywULkLSNkwpXycLppSBX5t OOznw7xHBu1iWdz0od/+OaREgRq3mGjcq2IJ/PDoKx/dDezluFuOA0FChCQs9oUgFXxb 3+75NwOqikaIWBzWxC6D3tyK8qcHZdenWXMIPZyYZaCCqMMtfPQtsjqLzqGdtS+ZKroZ kBAw== X-Gm-Message-State: AJaThX52A1DYhlsw2oaQ+SSpKI8N46N6LBLHQHzljjXO/4rs51cbUGUe QLZB7KvXCLSkysFdGydM8iaBeg== X-Received: by 10.98.72.130 with SMTP id q2mr18374330pfi.99.1511349143152; Wed, 22 Nov 2017 03:12:23 -0800 (PST) Received: from Shengbins-Mac-mini.corp.bytedance.com ([103.48.142.82]) by smtp.googlemail.com with ESMTPSA id 69sm29516227pft.11.2017.11.22.03.12.21 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 22 Nov 2017 03:12:22 -0800 (PST) From: Shengbin Meng To: ffmpeg-devel@ffmpeg.org Date: Wed, 22 Nov 2017 19:12:06 +0800 Message-Id: <20171122111206.17214-7-shengbinmeng@gmail.com> X-Mailer: git-send-email 2.13.6 (Apple Git-96) In-Reply-To: <20171122111206.17214-1-shengbinmeng@gmail.com> References: <20171122111206.17214-1-shengbinmeng@gmail.com> Subject: [FFmpeg-devel] [PATCH 6/6] avcodec/hevcdsp: Add NEON optimization for idct16x16 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Meng Wang MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" From: Meng Wang Signed-off-by: Meng Wang --- libavcodec/arm/hevcdsp_idct_neon.S | 241 +++++++++++++++++++++++++++++++++++++ libavcodec/arm/hevcdsp_init_neon.c | 2 + 2 files changed, 243 insertions(+) diff --git a/libavcodec/arm/hevcdsp_idct_neon.S b/libavcodec/arm/hevcdsp_idct_neon.S index e39d00634b..272abf279c 100644 --- a/libavcodec/arm/hevcdsp_idct_neon.S +++ b/libavcodec/arm/hevcdsp_idct_neon.S @@ -451,6 +451,247 @@ function ff_hevc_transform_8x8_neon_8, export=1 bx lr endfunc +/* 16x16 even line combine, input: q3-q10 output: q8-q15 */ +.macro tr8_combine + vsub.s32 q12, q3, q10 // e_8[3] - o_8[3], dst[4] + vadd.s32 q11, q3, q10 // e_8[3] + o_8[3], dst[3] + + vsub.s32 q13, q6, q9 // e_8[2] - o_8[2], dst[5] + vadd.s32 q10, q6, q9 // e_8[2] + o_8[2], dst[2] + + vsub.s32 q14, q5, q8 // e_8[1] - o_8[1], dst[6] + vadd.s32 q9, q5, q8 // e_8[1] + o_8[1], dst[1] + + vsub.s32 q15, q4, q7 // e_8[0] - o_8[0], dst[7] + vadd.s32 q8, q4, q7 // e_8[0] + o_8[0], dst[0] +.endm + +.macro tr16_begin in0, in1, in2, in3, in4, in5, in6, in7 + vmull.s16 q2, \in0, d2[1] // 90 * src1 + vmull.s16 q3, \in0, d2[0] // 87 * src1 + vmull.s16 q4, \in0, d2[3] // 80 * src1 + vmull.s16 q5, \in0, d2[2] // 70 * src1 + vmull.s16 q6, \in0, d3[1] // 57 * src1 + vmull.s16 q7, \in0, d3[0] // 43 * src1 + vmull.s16 q8, \in0, d3[3] // 25 * src1 + vmull.s16 q9, \in0, d3[2] // 9 * src1 + + vmlal.s16 q2, \in1, d2[0] // 87 * src3 + vmlal.s16 q3, \in1, d3[1] // 57 * src3 + vmlal.s16 q4, \in1, d3[2] // 9 * src3 + vmlsl.s16 q5, \in1, d3[0] //-43 * src3 + vmlsl.s16 q6, \in1, d2[3] //-80 * src3 + vmlsl.s16 q7, \in1, d2[1] //-90 * src3 + vmlsl.s16 q8, \in1, d2[2] //-70 * src3 + vmlsl.s16 q9, \in1, d3[3] //-25 * src3 + + vmlal.s16 q2, \in2, d2[3] // 80 * src5 + vmlal.s16 q3, \in2, d3[2] // 9 * src5 + vmlsl.s16 q4, \in2, d2[2] //-70 * src5 + vmlsl.s16 q5, \in2, d2[0] //-87 * src5 + vmlsl.s16 q6, \in2, d3[3] //-25 * src5 + vmlal.s16 q7, \in2, d3[1] // 57 * src5 + vmlal.s16 q8, \in2, d2[1] // 90 * src5 + vmlal.s16 q9, \in2, d3[0] // 43 * src5 + + vmlal.s16 q2, \in3, d2[2] // 70 * src7 + vmlsl.s16 q3, \in3, d3[0] //-43 * src7 + vmlsl.s16 q4, \in3, d2[0] //-87 * src7 + vmlal.s16 q5, \in3, d3[2] // 9 * src7 + vmlal.s16 q6, \in3, d2[1] // 90 * src7 + vmlal.s16 q7, \in3, d3[3] // 25 * src7 + vmlsl.s16 q8, \in3, d2[3] //-80 * src7 + vmlsl.s16 q9, \in3, d3[1] //-57 * src7 + + vmlal.s16 q2, \in4, d3[1] // 57 * src9 + vmlsl.s16 q3, \in4, d2[3] //-80 * src9 + vmlsl.s16 q4, \in4, d3[3] //-25 * src9 + vmlal.s16 q5, \in4, d2[1] // 90 * src9 + vmlsl.s16 q6, \in4, d3[2] // -9 * src9 + vmlsl.s16 q7, \in4, d2[0] //-87 * src9 + vmlal.s16 q8, \in4, d3[0] // 43 * src9 + vmlal.s16 q9, \in4, d2[2] // 70 * src9 + + vmlal.s16 q2, \in5, d3[0] // 43 * src11 + vmlsl.s16 q3, \in5, d2[1] //-90 * src11 + vmlal.s16 q4, \in5, d3[1] // 57 * src11 + vmlal.s16 q5, \in5, d3[3] // 25 * src11 + vmlsl.s16 q6, \in5, d2[0] //-87 * src11 + vmlal.s16 q7, \in5, d2[2] // 70 * src11 + vmlal.s16 q8, \in5, d3[2] // 9 * src11 + vmlsl.s16 q9, \in5, d2[3] //-80 * src11 + + vmlal.s16 q2, \in6, d3[3] // 25 * src13 + vmlsl.s16 q3, \in6, d2[2] //-70 * src13 + vmlal.s16 q4, \in6, d2[1] // 90 * src13 + vmlsl.s16 q5, \in6, d2[3] //-80 * src13 + vmlal.s16 q6, \in6, d3[0] // 43 * src13 + vmlal.s16 q7, \in6, d3[2] // 9 * src13 + vmlsl.s16 q8, \in6, d3[1] //-57 * src13 + vmlal.s16 q9, \in6, d2[0] // 87 * src13 + + + vmlal.s16 q2, \in7, d3[2] // 9 * src15 + vmlsl.s16 q3, \in7, d3[3] //-25 * src15 + vmlal.s16 q4, \in7, d3[0] // 43 * src15 + vmlsl.s16 q5, \in7, d3[1] //-57 * src15 + vmlal.s16 q6, \in7, d2[2] // 70 * src15 + vmlsl.s16 q7, \in7, d2[3] //-80 * src15 + vmlal.s16 q8, \in7, d2[0] // 87 * src15 + vmlsl.s16 q9, \in7, d2[1] //-90 * src15 +.endm + +.macro tr16_end shift + vpop {q2-q3} + vadd.s32 q4, q8, q2 + vsub.s32 q5, q8, q2 + vqrshrn.s32 d12, q4, \shift + vqrshrn.s32 d15, q5, \shift + + vadd.s32 q4, q9, q3 + vsub.s32 q5, q9, q3 + vqrshrn.s32 d13, q4, \shift + vqrshrn.s32 d14, q5, \shift + + vpop {q2-q3} + vadd.s32 q4, q10, q2 + vsub.s32 q5, q10, q2 + vqrshrn.s32 d16, q4, \shift + vqrshrn.s32 d19, q5, \shift + + vadd.s32 q4, q11, q3 + vsub.s32 q5, q11, q3 + vqrshrn.s32 d17, q4, \shift + vqrshrn.s32 d18, q5, \shift + + vpop {q2-q3} + vadd.s32 q4, q12, q2 + vsub.s32 q5, q12, q2 + vqrshrn.s32 d20, q4, \shift + vqrshrn.s32 d23, q5, \shift + + vadd.s32 q4, q13, q3 + vsub.s32 q5, q13, q3 + vqrshrn.s32 d21, q4, \shift + vqrshrn.s32 d22, q5, \shift + + vpop {q2-q3} + vadd.s32 q4, q14, q2 + vsub.s32 q5, q14, q2 + vqrshrn.s32 d24, q4, \shift + vqrshrn.s32 d27, q5, \shift + + vadd.s32 q4, q15, q3 + vsub.s32 q5, q15, q3 + vqrshrn.s32 d25, q4, \shift + vqrshrn.s32 d26, q5, \shift +.endm + +function ff_hevc_transform_16x16_neon_8, export=1 + push {r4-r8} + vpush {d8-d15} + mov r5, #64 + mov r6, #32 + mov r7, #0 + adr r3, tr4f + vld1.16 {d0, d1, d2, d3}, [r3] + mov r8, r0 +0: + add r7, #4 + add r0, #32 + // odd line + vld1.16 {d24}, [r0], r5 + vld1.16 {d25}, [r0], r5 + vld1.16 {d26}, [r0], r5 + vld1.16 {d27}, [r0], r5 + vld1.16 {d28}, [r0], r5 + vld1.16 {d29}, [r0], r5 + vld1.16 {d30}, [r0], r5 + vld1.16 {d31}, [r0], r5 + sub r0, #544 + + tr16_begin d24, d25, d26, d27, d28, d29, d30, d31 + vpush {q2-q9} + + // even line + vld1.16 {d24}, [r0], r5 + vld1.16 {d25}, [r0], r5 + vld1.16 {d26}, [r0], r5 + vld1.16 {d27}, [r0], r5 + vld1.16 {d28}, [r0], r5 + vld1.16 {d29}, [r0], r5 + vld1.16 {d30}, [r0], r5 + vld1.16 {d31}, [r0], r5 + sub r0, #512 + + tr8_begin d25, d27, d29, d31 + tr4 d24, d26, d28, d30 + tr8_combine + + // combine + tr16_end #7 + + // store + vst1.16 {d12}, [r0], r6 + vst1.16 {d13}, [r0], r6 + vst1.16 {d16}, [r0], r6 + vst1.16 {d17}, [r0], r6 + vst1.16 {d20}, [r0], r6 + vst1.16 {d21}, [r0], r6 + vst1.16 {d24}, [r0], r6 + vst1.16 {d25}, [r0], r6 + vst1.16 {d26}, [r0], r6 + vst1.16 {d27}, [r0], r6 + vst1.16 {d22}, [r0], r6 + vst1.16 {d23}, [r0], r6 + vst1.16 {d18}, [r0], r6 + vst1.16 {d19}, [r0], r6 + vst1.16 {d14}, [r0], r6 + vst1.16 {d15}, [r0], r6 + sub r0, #504 // 512 - 8 + + cmp r1, r7 + blt 1f + + cmp r7, #16 + blt 0b + +1: mov r0, r8 + mov r7, #4 +2: subs r7, #1 + // 1st 4 line + vldm r0, {q8-q15} // coeffs + transpose_16b_4x4 d16, d20, d24, d28 + transpose_16b_4x4 d17, d21, d25, d29 + transpose_16b_4x4 d18, d22, d26, d30 + transpose_16b_4x4 d19, d23, d27, d31 + vpush {q12-q13} // 16x16 even line (8x8 odd line) + vpush {q8-q9} // 16x16 even line (8x8 even line) + tr16_begin d20, d28, d21, d29, d22, d30, d23, d31 // odd line transform 2n+1 + vpop {q12-q15} // pop even line + vpush {q2-q9} // push results of 16x16 odd line + tr8_begin d28, d29, d30, d31 // even line transform 2n + tr4 d24, d25, d26, d27 + tr8_combine + tr16_end #12 + transpose_16b_4x4 d12, d13, d16, d17 + transpose_16b_4x4 d20, d21, d24, d25 + transpose_16b_4x4 d26, d27, d22, d23 + transpose_16b_4x4 d18, d19, d14, d15 + vswp d13, d20 + vswp d14, d23 + vswp d17, d24 + vswp d18, d27 + vswp q8, q10 + vswp q7, q13 + vstm r0!, {q6-q13} + bne 2b + + vpop {d8-d15} + pop {r4-r8} + bx lr +endfunc + .align 4 tr4f: .word 0x00240053 // 36 and d1[0] = 83 diff --git a/libavcodec/arm/hevcdsp_init_neon.c b/libavcodec/arm/hevcdsp_init_neon.c index 33cc44ef40..d846d01081 100644 --- a/libavcodec/arm/hevcdsp_init_neon.c +++ b/libavcodec/arm/hevcdsp_init_neon.c @@ -36,6 +36,7 @@ void ff_hevc_v_loop_filter_chroma_neon(uint8_t *_pix, ptrdiff_t _stride, int *_t void ff_hevc_h_loop_filter_chroma_neon(uint8_t *_pix, ptrdiff_t _stride, int *_tc, uint8_t *_no_p, uint8_t *_no_q); void ff_hevc_transform_4x4_neon_8(int16_t *coeffs, int col_limit); void ff_hevc_transform_8x8_neon_8(int16_t *coeffs, int col_limit); +void ff_hevc_transform_16x16_neon_8(int16_t *coeffs, int col_limit); void ff_hevc_idct_4x4_dc_neon_8(int16_t *coeffs); void ff_hevc_idct_8x8_dc_neon_8(int16_t *coeffs); void ff_hevc_idct_16x16_dc_neon_8(int16_t *coeffs); @@ -550,6 +551,7 @@ av_cold void ff_hevcdsp_init_neon(HEVCDSPContext *c, const int bit_depth) c->sao_edge_filter[4] = ff_hevc_sao_edge_filter_neon_wrapper; c->idct[0] = ff_hevc_transform_4x4_neon_8; c->idct[1] = ff_hevc_transform_8x8_neon_8; + c->idct[2] = ff_hevc_transform_16x16_neon_8; c->idct_dc[0] = ff_hevc_idct_4x4_dc_neon_8; c->idct_dc[1] = ff_hevc_idct_8x8_dc_neon_8; c->idct_dc[2] = ff_hevc_idct_16x16_dc_neon_8;