From patchwork Thu Feb 4 11:32:56 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Josh Dekker X-Patchwork-Id: 25385 Return-Path: X-Original-To: patchwork@ffaux-bg.ffmpeg.org Delivered-To: patchwork@ffaux-bg.ffmpeg.org Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by ffaux.localdomain (Postfix) with ESMTP id 3CCE944A6C2 for ; Thu, 4 Feb 2021 13:33:19 +0200 (EET) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 2C6C768A03E; Thu, 4 Feb 2021 13:33:19 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from out3-smtp.messagingengine.com (out3-smtp.messagingengine.com [66.111.4.27]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id C0029689C64 for ; Thu, 4 Feb 2021 13:33:10 +0200 (EET) Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailout.nyi.internal (Postfix) with ESMTP id E79675C0167; Thu, 4 Feb 2021 06:33:08 -0500 (EST) Received: from mailfrontend2 ([10.202.2.163]) by compute3.internal (MEProxy); Thu, 04 Feb 2021 06:33:08 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=itanimul.li; h= from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-type:content-transfer-encoding; s=fm1; bh= DGP2Ivjb05f88xQLob/tn7uw1Mcm7/owBC5aIUB+MHY=; b=akgV5TocMSutec1S Fy3gqoRww+EPD8LZ3zlWvciKn2KddkZZNH6+vytqdtvlI/l/YgUmq0Y6Pe70huIM pQAWwpxc0LDf/uux/7rK7rNcbUe3G7f028zyNhJlKBEcC4gZyRfmbhkkvzzOB7kc xwO8bwLe0nn7Mj/xnTrQTOQyiVIflE+1LF9Tee3ewB8qfa+Mk3O2Zqhz3TnTn+a9 x71mJAtQORga/m70U5LBFjPpOb/wvweVe7V1zDX8Y/h6Ru+R0midPxSwXPR6kedo LLGvbhJ1l5eQsNwyb6K0H3Yo+CaOMpWMU4eLjcatFI4ROrqvSF4mOi3MML9zPver j6shtQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm2; bh=DGP2Ivjb05f88xQLob/tn7uw1Mcm7/owBC5aIUB+M HY=; b=iH1ShehaJD8AcexsH09yrJjWJcP5i8kn3wMVH4e1rrqOFF6S2DfvJI3wh BXtMRMAicKW6/c05DOhfCg3BWd9EuTvDNLr5LdlImRwOKREB4BmzR/8veCLp6k3x tf/zLe8zRRe/N44Ag5J47qs1eLNeG4jJqsSAPypmSp0+C7CbU+ObylXCgzIIUKDd PkxLwGGZaWKc61bsk7UuXAjnf9RNih0AHLWDTalkFSGvX5B0AUIXvFg33mH86pmI bt5I4nGr8zHuf5wv+ZxGjDO/0MB97XwtPeUWGMfQ05MQY80iQPU/96noO84FeOrJ gs+VXeGZ2lbGe8VNm+YLjpoGU0IvA== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeduledrgeeggddvtdcutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkffojghfgggtgfesthekre dtredtjeenucfhrhhomheplfhoshhhucffvghkkhgvrhcuoehjohhshhesihhtrghnihhm uhhlrdhliheqnecuggftrfgrthhtvghrnhepjeeikeduvdevleegleejtdetledvtdduje fgteegkeduleeludekudefveehffeinecuffhomhgrihhnpehnvghonhdrshgsnecukfhp peekjedruddvfedrudelvddrudehheenucevlhhushhtvghrufhiiigvpedtnecurfgrrh grmhepmhgrihhlfhhrohhmpehjohhshhesihhtrghnihhmuhhlrdhlih X-ME-Proxy: Received: from localhost.localdomain (i577bc09b.versanet.de [87.123.192.155]) by mail.messagingengine.com (Postfix) with ESMTPA id 476C61080068; Thu, 4 Feb 2021 06:33:08 -0500 (EST) From: Josh Dekker To: ffmpeg-devel@ffmpeg.org Date: Thu, 4 Feb 2021 12:32:56 +0100 Message-Id: <20210204113259.20112-2-josh@itanimul.li> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20210204113259.20112-1-josh@itanimul.li> References: <20210204113259.20112-1-josh@itanimul.li> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v2 1/4] avcodec/aarch64/hevcdsp: port SIMD idct functions X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: =?utf-8?q?Reimar_D=C3=B6ffinger?= Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" From: Reimar Döffinger Makes SIMD-optimized 8x8 and 16x16 idcts for 8 and 10 bit depth available on aarch64. For a UHD HDR (10 bit) sample video these were consuming the most time and this optimization reduced overall decode time from 19.4s to 16.4s, approximately 15% speedup. Test sample was the first 300 frames of "LG 4K HDR Demo - New York.ts", running on Apple M1. Signed-off-by: Josh Dekker --- libavcodec/aarch64/Makefile | 2 + libavcodec/aarch64/hevcdsp_idct_neon.S | 380 ++++++++++++++++++++++ libavcodec/aarch64/hevcdsp_init_aarch64.c | 45 +++ libavcodec/hevcdsp.c | 2 + libavcodec/hevcdsp.h | 1 + 5 files changed, 430 insertions(+) create mode 100644 libavcodec/aarch64/hevcdsp_idct_neon.S create mode 100644 libavcodec/aarch64/hevcdsp_init_aarch64.c diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile index f6434e40da..2ea1d74a38 100644 --- a/libavcodec/aarch64/Makefile +++ b/libavcodec/aarch64/Makefile @@ -61,3 +61,5 @@ NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_16bpp_neon.o \ aarch64/vp9lpf_neon.o \ aarch64/vp9mc_16bpp_neon.o \ aarch64/vp9mc_neon.o +NEON-OBJS-$(CONFIG_HEVC_DECODER) += aarch64/hevcdsp_idct_neon.o \ + aarch64/hevcdsp_init_aarch64.o diff --git a/libavcodec/aarch64/hevcdsp_idct_neon.S b/libavcodec/aarch64/hevcdsp_idct_neon.S new file mode 100644 index 0000000000..c70d6a906d --- /dev/null +++ b/libavcodec/aarch64/hevcdsp_idct_neon.S @@ -0,0 +1,380 @@ +/* + * ARM NEON optimised IDCT functions for HEVC decoding + * Copyright (c) 2014 Seppo Tomperi + * Copyright (c) 2017 Alexandra Hájková + * + * Ported from arm/hevcdsp_idct_neon.S by + * Copyright (c) 2020 Reimar Döffinger + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/aarch64/asm.S" + +const trans, align=4 + .short 64, 83, 64, 36 + .short 89, 75, 50, 18 + .short 90, 87, 80, 70 + .short 57, 43, 25, 9 + .short 90, 90, 88, 85 + .short 82, 78, 73, 67 + .short 61, 54, 46, 38 + .short 31, 22, 13, 4 +endconst + +.macro sum_sub out, in, c, op, p + .ifc \op, + + smlal\p \out, \in, \c + .else + smlsl\p \out, \in, \c + .endif +.endm + +.macro fixsqrshrn d, dt, n, m + .ifc \dt, .8h + sqrshrn2 \d\dt, \n\().4s, \m + .else + sqrshrn \n\().4h, \n\().4s, \m + mov \d\().d[0], \n\().d[0] + .endif +.endm + +// uses and clobbers v28-v31 as temp registers +.macro tr_4x4_8 in0, in1, in2, in3, out0, out1, out2, out3, p1, p2 + sshll\p1 v28.4s, \in0, #6 + mov v29.16b, v28.16b + smull\p1 v30.4s, \in1, v0.h[1] + smull\p1 v31.4s, \in1, v0.h[3] + smlal\p2 v28.4s, \in2, v0.h[0] //e0 + smlsl\p2 v29.4s, \in2, v0.h[0] //e1 + smlal\p2 v30.4s, \in3, v0.h[3] //o0 + smlsl\p2 v31.4s, \in3, v0.h[1] //o1 + + add \out0, v28.4s, v30.4s + add \out1, v29.4s, v31.4s + sub \out2, v29.4s, v31.4s + sub \out3, v28.4s, v30.4s +.endm + +.macro transpose8_4x4 r0, r1, r2, r3 + trn1 v2.8h, \r0\().8h, \r1\().8h + trn2 v3.8h, \r0\().8h, \r1\().8h + trn1 v4.8h, \r2\().8h, \r3\().8h + trn2 v5.8h, \r2\().8h, \r3\().8h + trn1 \r0\().4s, v2.4s, v4.4s + trn2 \r2\().4s, v2.4s, v4.4s + trn1 \r1\().4s, v3.4s, v5.4s + trn2 \r3\().4s, v3.4s, v5.4s +.endm + +.macro transpose_8x8 r0, r1, r2, r3, r4, r5, r6, r7 + transpose8_4x4 \r0, \r1, \r2, \r3 + transpose8_4x4 \r4, \r5, \r6, \r7 +.endm + +.macro tr_8x4 shift, in0,in0t, in1,in1t, in2,in2t, in3,in3t, in4,in4t, in5,in5t, in6,in6t, in7,in7t, p1, p2 + tr_4x4_8 \in0\in0t, \in2\in2t, \in4\in4t, \in6\in6t, v24.4s, v25.4s, v26.4s, v27.4s, \p1, \p2 + + smull\p1 v30.4s, \in1\in1t, v0.h[6] + smull\p1 v28.4s, \in1\in1t, v0.h[4] + smull\p1 v29.4s, \in1\in1t, v0.h[5] + sum_sub v30.4s, \in3\in3t, v0.h[4], -, \p1 + sum_sub v28.4s, \in3\in3t, v0.h[5], +, \p1 + sum_sub v29.4s, \in3\in3t, v0.h[7], -, \p1 + + sum_sub v30.4s, \in5\in5t, v0.h[7], +, \p2 + sum_sub v28.4s, \in5\in5t, v0.h[6], +, \p2 + sum_sub v29.4s, \in5\in5t, v0.h[4], -, \p2 + + sum_sub v30.4s, \in7\in7t, v0.h[5], +, \p2 + sum_sub v28.4s, \in7\in7t, v0.h[7], +, \p2 + sum_sub v29.4s, \in7\in7t, v0.h[6], -, \p2 + + add v31.4s, v26.4s, v30.4s + sub v26.4s, v26.4s, v30.4s + fixsqrshrn \in2,\in2t, v31, \shift + + + smull\p1 v31.4s, \in1\in1t, v0.h[7] + sum_sub v31.4s, \in3\in3t, v0.h[6], -, \p1 + sum_sub v31.4s, \in5\in5t, v0.h[5], +, \p2 + sum_sub v31.4s, \in7\in7t, v0.h[4], -, \p2 + fixsqrshrn \in5,\in5t, v26, \shift + + + add v26.4s, v24.4s, v28.4s + sub v24.4s, v24.4s, v28.4s + add v28.4s, v25.4s, v29.4s + sub v25.4s, v25.4s, v29.4s + add v30.4s, v27.4s, v31.4s + sub v27.4s, v27.4s, v31.4s + + fixsqrshrn \in0,\in0t, v26, \shift + fixsqrshrn \in7,\in7t, v24, \shift + fixsqrshrn \in1,\in1t, v28, \shift + fixsqrshrn \in6,\in6t, v25, \shift + fixsqrshrn \in3,\in3t, v30, \shift + fixsqrshrn \in4,\in4t, v27, \shift +.endm + +.macro idct_8x8 bitdepth +function ff_hevc_idct_8x8_\bitdepth\()_neon, export=1 +//x0 - coeffs + mov x1, x0 + ld1 {v16.8h-v19.8h}, [x1], #64 + ld1 {v20.8h-v23.8h}, [x1] + + movrel x1, trans + ld1 {v0.8h}, [x1] + + tr_8x4 7, v16,.4h, v17,.4h, v18,.4h, v19,.4h, v20,.4h, v21,.4h, v22,.4h, v23,.4h + tr_8x4 7, v16,.8h, v17,.8h, v18,.8h, v19,.8h, v20,.8h, v21,.8h, v22,.8h, v23,.8h, 2, 2 + + transpose_8x8 v16, v17, v18, v19, v20, v21, v22, v23 + + tr_8x4 20 - \bitdepth, v16,.4h, v17,.4h, v18,.4h, v19,.4h, v16,.8h, v17,.8h, v18,.8h, v19,.8h, , 2 + tr_8x4 20 - \bitdepth, v20,.4h, v21,.4h, v22,.4h, v23,.4h, v20,.8h, v21,.8h, v22,.8h, v23,.8h, , 2 + + transpose_8x8 v16, v17, v18, v19, v20, v21, v22, v23 + + mov x1, x0 + st1 {v16.8h-v19.8h}, [x1], #64 + st1 {v20.8h-v23.8h}, [x1] + + ret +endfunc +.endm + +.macro butterfly e, o, tmp_p, tmp_m + add \tmp_p, \e, \o + sub \tmp_m, \e, \o +.endm + +.macro tr16_8x4 in0, in1, in2, in3, offset + tr_4x4_8 \in0\().4h, \in1\().4h, \in2\().4h, \in3\().4h, v24.4s, v25.4s, v26.4s, v27.4s + + smull2 v28.4s, \in0\().8h, v0.h[4] + smull2 v29.4s, \in0\().8h, v0.h[5] + smull2 v30.4s, \in0\().8h, v0.h[6] + smull2 v31.4s, \in0\().8h, v0.h[7] + sum_sub v28.4s, \in1\().8h, v0.h[5], +, 2 + sum_sub v29.4s, \in1\().8h, v0.h[7], -, 2 + sum_sub v30.4s, \in1\().8h, v0.h[4], -, 2 + sum_sub v31.4s, \in1\().8h, v0.h[6], -, 2 + + sum_sub v28.4s, \in2\().8h, v0.h[6], +, 2 + sum_sub v29.4s, \in2\().8h, v0.h[4], -, 2 + sum_sub v30.4s, \in2\().8h, v0.h[7], +, 2 + sum_sub v31.4s, \in2\().8h, v0.h[5], +, 2 + + sum_sub v28.4s, \in3\().8h, v0.h[7], +, 2 + sum_sub v29.4s, \in3\().8h, v0.h[6], -, 2 + sum_sub v30.4s, \in3\().8h, v0.h[5], +, 2 + sum_sub v31.4s, \in3\().8h, v0.h[4], -, 2 + + butterfly v24.4s, v28.4s, v16.4s, v23.4s + butterfly v25.4s, v29.4s, v17.4s, v22.4s + butterfly v26.4s, v30.4s, v18.4s, v21.4s + butterfly v27.4s, v31.4s, v19.4s, v20.4s + add x4, sp, #\offset + st1 {v16.4s-v19.4s}, [x4], #64 + st1 {v20.4s-v23.4s}, [x4] +.endm + +.macro load16 in0, in1, in2, in3 + ld1 {\in0}[0], [x1], x2 + ld1 {\in0}[1], [x3], x2 + ld1 {\in1}[0], [x1], x2 + ld1 {\in1}[1], [x3], x2 + ld1 {\in2}[0], [x1], x2 + ld1 {\in2}[1], [x3], x2 + ld1 {\in3}[0], [x1], x2 + ld1 {\in3}[1], [x3], x2 +.endm + +.macro add_member in, t0, t1, t2, t3, t4, t5, t6, t7, op0, op1, op2, op3, op4, op5, op6, op7, p + sum_sub v21.4s, \in, \t0, \op0, \p + sum_sub v22.4s, \in, \t1, \op1, \p + sum_sub v23.4s, \in, \t2, \op2, \p + sum_sub v24.4s, \in, \t3, \op3, \p + sum_sub v25.4s, \in, \t4, \op4, \p + sum_sub v26.4s, \in, \t5, \op5, \p + sum_sub v27.4s, \in, \t6, \op6, \p + sum_sub v28.4s, \in, \t7, \op7, \p +.endm + +.macro butterfly16 in0, in1, in2, in3, in4, in5, in6, in7 + add v20.4s, \in0, \in1 + sub \in0, \in0, \in1 + add \in1, \in2, \in3 + sub \in2, \in2, \in3 + add \in3, \in4, \in5 + sub \in4, \in4, \in5 + add \in5, \in6, \in7 + sub \in6, \in6, \in7 +.endm + +.macro store16 in0, in1, in2, in3, rx + st1 {\in0}[0], [x1], x2 + st1 {\in0}[1], [x3], \rx + st1 {\in1}[0], [x1], x2 + st1 {\in1}[1], [x3], \rx + st1 {\in2}[0], [x1], x2 + st1 {\in2}[1], [x3], \rx + st1 {\in3}[0], [x1], x2 + st1 {\in3}[1], [x3], \rx +.endm + +.macro scale out0, out1, out2, out3, in0, in1, in2, in3, in4, in5, in6, in7, shift + sqrshrn \out0\().4h, \in0, \shift + sqrshrn2 \out0\().8h, \in1, \shift + sqrshrn \out1\().4h, \in2, \shift + sqrshrn2 \out1\().8h, \in3, \shift + sqrshrn \out2\().4h, \in4, \shift + sqrshrn2 \out2\().8h, \in5, \shift + sqrshrn \out3\().4h, \in6, \shift + sqrshrn2 \out3\().8h, \in7, \shift +.endm + +.macro transpose16_4x4_2 r0, r1, r2, r3 + // lower halves + trn1 v2.4h, \r0\().4h, \r1\().4h + trn2 v3.4h, \r0\().4h, \r1\().4h + trn1 v4.4h, \r2\().4h, \r3\().4h + trn2 v5.4h, \r2\().4h, \r3\().4h + trn1 v6.2s, v2.2s, v4.2s + trn2 v7.2s, v2.2s, v4.2s + trn1 v2.2s, v3.2s, v5.2s + trn2 v4.2s, v3.2s, v5.2s + mov \r0\().d[0], v6.d[0] + mov \r2\().d[0], v7.d[0] + mov \r1\().d[0], v2.d[0] + mov \r3\().d[0], v4.d[0] + + // upper halves in reverse order + trn1 v2.8h, \r3\().8h, \r2\().8h + trn2 v3.8h, \r3\().8h, \r2\().8h + trn1 v4.8h, \r1\().8h, \r0\().8h + trn2 v5.8h, \r1\().8h, \r0\().8h + trn1 v6.4s, v2.4s, v4.4s + trn2 v7.4s, v2.4s, v4.4s + trn1 v2.4s, v3.4s, v5.4s + trn2 v4.4s, v3.4s, v5.4s + mov \r3\().d[1], v6.d[1] + mov \r1\().d[1], v7.d[1] + mov \r2\().d[1], v2.d[1] + mov \r0\().d[1], v4.d[1] +.endm + +.macro tr_16x4 name, shift, offset, step +function func_tr_16x4_\name + mov x1, x5 + add x3, x5, #(\step * 64) + mov x2, #(\step * 128) + load16 v16.d, v17.d, v18.d, v19.d + movrel x1, trans + ld1 {v0.8h}, [x1] + + tr16_8x4 v16, v17, v18, v19, \offset + + add x1, x5, #(\step * 32) + add x3, x5, #(\step * 3 *32) + mov x2, #(\step * 128) + load16 v20.d, v17.d, v18.d, v19.d + movrel x1, trans, 16 + ld1 {v1.8h}, [x1] + smull v21.4s, v20.4h, v1.h[0] + smull v22.4s, v20.4h, v1.h[1] + smull v23.4s, v20.4h, v1.h[2] + smull v24.4s, v20.4h, v1.h[3] + smull v25.4s, v20.4h, v1.h[4] + smull v26.4s, v20.4h, v1.h[5] + smull v27.4s, v20.4h, v1.h[6] + smull v28.4s, v20.4h, v1.h[7] + + add_member v20.8h, v1.h[1], v1.h[4], v1.h[7], v1.h[5], v1.h[2], v1.h[0], v1.h[3], v1.h[6], +, +, +, -, -, -, -, -, 2 + add_member v17.4h, v1.h[2], v1.h[7], v1.h[3], v1.h[1], v1.h[6], v1.h[4], v1.h[0], v1.h[5], +, +, -, -, -, +, +, + + add_member v17.8h, v1.h[3], v1.h[5], v1.h[1], v1.h[7], v1.h[0], v1.h[6], v1.h[2], v1.h[4], +, -, -, +, +, +, -, -, 2 + add_member v18.4h, v1.h[4], v1.h[2], v1.h[6], v1.h[0], v1.h[7], v1.h[1], v1.h[5], v1.h[3], +, -, -, +, -, -, +, + + add_member v18.8h, v1.h[5], v1.h[0], v1.h[4], v1.h[6], v1.h[1], v1.h[3], v1.h[7], v1.h[2], +, -, +, +, -, +, +, -, 2 + add_member v19.4h, v1.h[6], v1.h[3], v1.h[0], v1.h[2], v1.h[5], v1.h[7], v1.h[4], v1.h[1], +, -, +, -, +, +, -, + + add_member v19.8h, v1.h[7], v1.h[6], v1.h[5], v1.h[4], v1.h[3], v1.h[2], v1.h[1], v1.h[0], +, -, +, -, +, -, +, -, 2 + + add x4, sp, #\offset + ld1 {v16.4s-v19.4s}, [x4], #64 + + butterfly16 v16.4s, v21.4s, v17.4s, v22.4s, v18.4s, v23.4s, v19.4s, v24.4s + scale v29, v30, v31, v24, v20.4s, v16.4s, v21.4s, v17.4s, v22.4s, v18.4s, v23.4s, v19.4s, \shift + transpose16_4x4_2 v29, v30, v31, v24 + mov x1, x6 + add x3, x6, #(24 +3*32) + mov x2, #32 + mov x4, #-32 + store16 v29.d, v30.d, v31.d, v24.d, x4 + + add x4, sp, #(\offset + 64) + ld1 {v16.4s-v19.4s}, [x4] + butterfly16 v16.4s, v25.4s, v17.4s, v26.4s, v18.4s, v27.4s, v19.4s, v28.4s + scale v29, v30, v31, v20, v20.4s, v16.4s, v25.4s, v17.4s, v26.4s, v18.4s, v27.4s, v19.4s, \shift + transpose16_4x4_2 v29, v30, v31, v20 + + add x1, x6, #8 + add x3, x6, #(16 + 3 * 32) + mov x2, #32 + mov x4, #-32 + store16 v29.d, v30.d, v31.d, v20.d, x4 + + ret +endfunc +.endm + +.macro idct_16x16 bitdepth +function ff_hevc_idct_16x16_\bitdepth\()_neon, export=1 +//r0 - coeffs + mov x15, lr + + // allocate a temp buffer + sub sp, sp, #640 + +.irp i, 0, 1, 2, 3 + add x5, x0, #(8 * \i) + add x6, sp, #(8 * \i * 16) + bl func_tr_16x4_firstpass +.endr + +.irp i, 0, 1, 2, 3 + add x5, sp, #(8 * \i) + add x6, x0, #(8 * \i * 16) + bl func_tr_16x4_secondpass_\bitdepth +.endr + + add sp, sp, #640 + + mov lr, x15 + ret +endfunc +.endm + +idct_8x8 8 +idct_8x8 10 + +tr_16x4 firstpass, 7, 512, 1 +tr_16x4 secondpass_8, 20 - 8, 512, 1 +tr_16x4 secondpass_10, 20 - 10, 512, 1 + +idct_16x16 8 +idct_16x16 10 diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c new file mode 100644 index 0000000000..19d9a7f9ed --- /dev/null +++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c @@ -0,0 +1,45 @@ +/* + * Copyright (c) 2020 Reimar Döffinger + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include + +#include "libavutil/attributes.h" +#include "libavutil/cpu.h" +#include "libavutil/aarch64/cpu.h" +#include "libavcodec/hevcdsp.h" + +void ff_hevc_idct_8x8_8_neon(int16_t *coeffs, int col_limit); +void ff_hevc_idct_8x8_10_neon(int16_t *coeffs, int col_limit); +void ff_hevc_idct_16x16_8_neon(int16_t *coeffs, int col_limit); +void ff_hevc_idct_16x16_10_neon(int16_t *coeffs, int col_limit); + +av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth) +{ + if (!have_neon(av_get_cpu_flags())) return; + + if (bit_depth == 8) { + c->idct[1] = ff_hevc_idct_8x8_8_neon; + c->idct[2] = ff_hevc_idct_16x16_8_neon; + } + if (bit_depth == 10) { + c->idct[1] = ff_hevc_idct_8x8_10_neon; + c->idct[2] = ff_hevc_idct_16x16_10_neon; + } +} diff --git a/libavcodec/hevcdsp.c b/libavcodec/hevcdsp.c index 957e40d5ff..fe272ac1ce 100644 --- a/libavcodec/hevcdsp.c +++ b/libavcodec/hevcdsp.c @@ -257,6 +257,8 @@ int i = 0; break; } + if (ARCH_AARCH64) + ff_hevc_dsp_init_aarch64(hevcdsp, bit_depth); if (ARCH_ARM) ff_hevc_dsp_init_arm(hevcdsp, bit_depth); if (ARCH_PPC) diff --git a/libavcodec/hevcdsp.h b/libavcodec/hevcdsp.h index c605a343d6..0e013a8328 100644 --- a/libavcodec/hevcdsp.h +++ b/libavcodec/hevcdsp.h @@ -129,6 +129,7 @@ void ff_hevc_dsp_init(HEVCDSPContext *hpc, int bit_depth); extern const int8_t ff_hevc_epel_filters[7][4]; extern const int8_t ff_hevc_qpel_filters[3][16]; +void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth); void ff_hevc_dsp_init_arm(HEVCDSPContext *c, const int bit_depth); void ff_hevc_dsp_init_ppc(HEVCDSPContext *c, const int bit_depth); void ff_hevc_dsp_init_x86(HEVCDSPContext *c, const int bit_depth); From patchwork Thu Feb 4 11:32:57 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Josh Dekker X-Patchwork-Id: 25384 Return-Path: X-Original-To: patchwork@ffaux-bg.ffmpeg.org Delivered-To: patchwork@ffaux-bg.ffmpeg.org Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by ffaux.localdomain (Postfix) with ESMTP id 7223F44A6C2 for ; Thu, 4 Feb 2021 13:33:18 +0200 (EET) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 5C7B468A2A9; Thu, 4 Feb 2021 13:33:18 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from out3-smtp.messagingengine.com (out3-smtp.messagingengine.com [66.111.4.27]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id BDD3A689ADA for ; Thu, 4 Feb 2021 13:33:10 +0200 (EET) Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailout.nyi.internal (Postfix) with ESMTP id 7865E5C0182; Thu, 4 Feb 2021 06:33:09 -0500 (EST) Received: from mailfrontend2 ([10.202.2.163]) by compute3.internal (MEProxy); Thu, 04 Feb 2021 06:33:09 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=itanimul.li; h= from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-type:content-transfer-encoding; s=fm1; bh= a1v719MbPHUFSidCBIY69p9dmOj/1Y1GO79rU2TO/DI=; b=i4TQZpxZ778FE6Rl nW526CuwpHiQph65IHxgkQ/QOAs5TVN0eKcx0P6FZZQ68uIn7M4I46LXD5mVOBZq ANyG2ZKrdZHwCCZIZW8mPYoqmf3NFjTFF5HbepHqs5verI5ewBfXx7MoVBi6XKKI uugW8p0sibFrSc1x8l2pHd0N/jm3Ngyyo2OF/Dqw0H25pwlyV6pBoHTDqwxZxsxB LCxpOzrEeFdAfAcBsDOOI15Eu0hm+EJk8ICvYRhQe1rNsASJC3FNua80w0nCa4pE YatxeCBMa1jcCM+AOsETn0B0SdZug+Fz5C8sjnxCR4TgqHehIi1S7fKKjuWhoEb5 z8BQCQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm2; bh=a1v719MbPHUFSidCBIY69p9dmOj/1Y1GO79rU2TO/ DI=; b=UECTXJZ3u9LvAufU5EpFG2eO1qce1o74iriPHnhLJ/vZAJI+m3KZYWbJS tkdxg3Ru0yXmW6I8LtsZqkO9RdGU7B7mxxxozFPfT+b6cGd2A7wR9PWInYmeDsYZ KRZTdxGwlwY6d1FvDtyV3xLwy3/vGChyh7wxv0YgwZWpEdQaVj2+Bs6OgEUAFmbW miayHxldee+VJs+DEGjPohwUhtqXQLxmS4S6jz/ipUHub69itoiJtf6DKzVSfrhW jFoNdYxeHys3QGlAufJGXfJZN1v5m0/w7Txz93bvPWwrJ2CAyPnsIR0ZipETcsjn qKmt3PnPqIESMNHq/55mAiarcr0Ew== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeduledrgeeggddvtdcutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkffojghfgggtgfesthekre dtredtjeenucfhrhhomheplfhoshhhucffvghkkhgvrhcuoehjohhshhesihhtrghnihhm uhhlrdhliheqnecuggftrfgrthhtvghrnhepjeeikeduvdevleegleejtdetledvtdduje fgteegkeduleeludekudefveehffeinecuffhomhgrihhnpehnvghonhdrshgsnecukfhp peekjedruddvfedrudelvddrudehheenucevlhhushhtvghrufhiiigvpedtnecurfgrrh grmhepmhgrihhlfhhrohhmpehjohhshhesihhtrghnihhmuhhlrdhlih X-ME-Proxy: Received: from localhost.localdomain (i577bc09b.versanet.de [87.123.192.155]) by mail.messagingengine.com (Postfix) with ESMTPA id DECEF1080059; Thu, 4 Feb 2021 06:33:08 -0500 (EST) From: Josh Dekker To: ffmpeg-devel@ffmpeg.org Date: Thu, 4 Feb 2021 12:32:57 +0100 Message-Id: <20210204113259.20112-3-josh@itanimul.li> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20210204113259.20112-1-josh@itanimul.li> References: <20210204113259.20112-1-josh@itanimul.li> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v2 2/4] avcodec/aarch64/hevcdsp: port add_residual functions X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: =?utf-8?q?Reimar_D=C3=B6ffinger?= Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" From: Reimar Döffinger Speedup is fairly small, around 1.5%, but these are fairly simple. Signed-off-by: Josh Dekker --- libavcodec/aarch64/hevcdsp_idct_neon.S | 190 ++++++++++++++++++++++ libavcodec/aarch64/hevcdsp_init_aarch64.c | 24 +++ 2 files changed, 214 insertions(+) diff --git a/libavcodec/aarch64/hevcdsp_idct_neon.S b/libavcodec/aarch64/hevcdsp_idct_neon.S index c70d6a906d..329038a958 100644 --- a/libavcodec/aarch64/hevcdsp_idct_neon.S +++ b/libavcodec/aarch64/hevcdsp_idct_neon.S @@ -36,6 +36,196 @@ const trans, align=4 .short 31, 22, 13, 4 endconst +.macro clip10 in1, in2, c1, c2 + smax \in1, \in1, \c1 + smax \in2, \in2, \c1 + smin \in1, \in1, \c2 + smin \in2, \in2, \c2 +.endm + +function ff_hevc_add_residual_4x4_8_neon, export=1 + ld1 {v0.8h-v1.8h}, [x1] + ld1 {v2.s}[0], [x0], x2 + ld1 {v2.s}[1], [x0], x2 + ld1 {v2.s}[2], [x0], x2 + ld1 {v2.s}[3], [x0], x2 + sub x0, x0, x2, lsl #2 + uxtl v6.8h, v2.8B + uxtl2 v7.8h, v2.16B + sqadd v0.8h, v0.8h, v6.8h + sqadd v1.8h, v1.8h, v7.8h + sqxtun v0.8B, v0.8h + sqxtun2 v0.16B, v1.8h + st1 {v0.s}[0], [x0], x2 + st1 {v0.s}[1], [x0], x2 + st1 {v0.s}[2], [x0], x2 + st1 {v0.s}[3], [x0], x2 + ret +endfunc + +function ff_hevc_add_residual_4x4_10_neon, export=1 + mov x12, x0 + ld1 {v0.8h-v1.8h}, [x1] + ld1 {v2.d}[0], [x12], x2 + ld1 {v2.d}[1], [x12], x2 + ld1 {v3.d}[0], [x12], x2 + sqadd v0.8h, v0.8h, v2.8h + ld1 {V3.d}[1], [x12], x2 + movi v4.8h, #0 + sqadd v1.8h, v1.8h, v3.8h + mvni v5.8h, #0xFC, LSL #8 // movi #0x3FF + clip10 v0.8h, v1.8h, v4.8h, v5.8h + st1 {v0.d}[0], [x0], x2 + st1 {v0.d}[1], [x0], x2 + st1 {v1.d}[0], [x0], x2 + st1 {v1.d}[1], [x0], x2 + ret +endfunc + +function ff_hevc_add_residual_8x8_8_neon, export=1 + add x12, x0, x2 + add x2, x2, x2 + mov x3, #8 +1: subs x3, x3, #2 + ld1 {v2.d}[0], [x0] + ld1 {v2.d}[1], [x12] + uxtl v3.8h, v2.8B + ld1 {v0.8h-v1.8h}, [x1], #32 + uxtl2 v2.8h, v2.16B + sqadd v0.8h, v0.8h, v3.8h + sqadd v1.8h, v1.8h, v2.8h + sqxtun v0.8B, v0.8h + sqxtun2 v0.16B, v1.8h + st1 {v0.d}[0], [x0], x2 + st1 {v0.d}[1], [x12], x2 + bne 1b + ret +endfunc + +function ff_hevc_add_residual_8x8_10_neon, export=1 + add x12, x0, x2 + add x2, x2, x2 + mov x3, #8 + movi v4.8h, #0 + mvni v5.8h, #0xFC, LSL #8 // movi #0x3FF +1: subs x3, x3, #2 + ld1 {v0.8h-v1.8h}, [x1], #32 + ld1 {v2.8h}, [x0] + sqadd v0.8h, v0.8h, v2.8h + ld1 {v3.8h}, [x12] + sqadd v1.8h, v1.8h, v3.8h + clip10 v0.8h, v1.8h, v4.8h, v5.8h + st1 {v0.8h}, [x0], x2 + st1 {v1.8h}, [x12], x2 + bne 1b + ret +endfunc + +function ff_hevc_add_residual_16x16_8_neon, export=1 + mov x3, #16 + add x12, x0, x2 + add x2, x2, x2 +1: subs x3, x3, #2 + ld1 {v16.16B}, [x0] + ld1 {v0.8h-v3.8h}, [x1], #64 + ld1 {v19.16B}, [x12] + uxtl v17.8h, v16.8B + uxtl2 v18.8h, v16.16B + uxtl v20.8h, v19.8B + uxtl2 v21.8h, v19.16B + sqadd v0.8h, v0.8h, v17.8h + sqadd v1.8h, v1.8h, v18.8h + sqadd v2.8h, v2.8h, v20.8h + sqadd v3.8h, v3.8h, v21.8h + sqxtun v0.8B, v0.8h + sqxtun2 v0.16B, v1.8h + sqxtun v1.8B, v2.8h + sqxtun2 v1.16B, v3.8h + st1 {v0.16B}, [x0], x2 + st1 {v1.16B}, [x12], x2 + bne 1b + ret +endfunc + +function ff_hevc_add_residual_16x16_10_neon, export=1 + mov x3, #16 + movi v20.8h, #0 + mvni v21.8h, #0xFC, LSL #8 // movi #0x3FF + add x12, x0, x2 + add x2, x2, x2 +1: subs x3, x3, #2 + ld1 {v16.8h-v17.8h}, [x0] + ld1 {v0.8h-v3.8h}, [x1], #64 + sqadd v0.8h, v0.8h, v16.8h + ld1 {v18.8h-v19.8h}, [x12] + sqadd v1.8h, v1.8h, v17.8h + sqadd v2.8h, v2.8h, v18.8h + sqadd v3.8h, v3.8h, v19.8h + clip10 v0.8h, v1.8h, v20.8h, v21.8h + clip10 v2.8h, v3.8h, v20.8h, v21.8h + st1 {v0.8h-v1.8h}, [x0], x2 + st1 {v2.8h-v3.8h}, [x12], x2 + bne 1b + ret +endfunc + +function ff_hevc_add_residual_32x32_8_neon, export=1 + add x12, x0, x2 + add x2, x2, x2 + mov x3, #32 +1: subs x3, x3, #2 + ld1 {v20.16B, v21.16B}, [x0] + uxtl v16.8h, v20.8B + uxtl2 v17.8h, v20.16B + ld1 {v22.16B, v23.16B}, [x12] + uxtl v18.8h, v21.8B + uxtl2 v19.8h, v21.16B + uxtl v20.8h, v22.8B + ld1 {v0.8h-v3.8h}, [x1], #64 + ld1 {v4.8h-v7.8h}, [x1], #64 + uxtl2 v21.8h, v22.16B + uxtl v22.8h, v23.8B + uxtl2 v23.8h, v23.16B + sqadd v0.8h, v0.8h, v16.8h + sqadd v1.8h, v1.8h, v17.8h + sqadd v2.8h, v2.8h, v18.8h + sqadd v3.8h, v3.8h, v19.8h + sqadd v4.8h, v4.8h, v20.8h + sqadd v5.8h, v5.8h, v21.8h + sqadd v6.8h, v6.8h, v22.8h + sqadd v7.8h, v7.8h, v23.8h + sqxtun v0.8B, v0.8h + sqxtun2 v0.16B, v1.8h + sqxtun v1.8B, v2.8h + sqxtun2 v1.16B, v3.8h + sqxtun v2.8B, v4.8h + sqxtun2 v2.16B, v5.8h + st1 {v0.16B, v1.16B}, [x0], x2 + sqxtun v3.8B, v6.8h + sqxtun2 v3.16B, v7.8h + st1 {v2.16B, v3.16B}, [x12], x2 + bne 1b + ret +endfunc + +function ff_hevc_add_residual_32x32_10_neon, export=1 + mov x3, #32 + movi v20.8h, #0 + mvni v21.8h, #0xFC, LSL #8 // movi #0x3FF +1: subs x3, x3, #1 + ld1 {v0.8h-v3.8h}, [x1], #64 + ld1 {v16.8h-v19.8h}, [x0] + sqadd v0.8h, v0.8h, v16.8h + sqadd v1.8h, v1.8h, v17.8h + sqadd v2.8h, v2.8h, v18.8h + sqadd v3.8h, v3.8h, v19.8h + clip10 v0.8h, v1.8h, v20.8h, v21.8h + clip10 v2.8h, v3.8h, v20.8h, v21.8h + st1 {v0.8h-v3.8h}, [x0], x2 + bne 1b + ret +endfunc + .macro sum_sub out, in, c, op, p .ifc \op, + smlal\p \out, \in, \c diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c index 19d9a7f9ed..4c29daa6d5 100644 --- a/libavcodec/aarch64/hevcdsp_init_aarch64.c +++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c @@ -25,6 +25,22 @@ #include "libavutil/aarch64/cpu.h" #include "libavcodec/hevcdsp.h" +void ff_hevc_add_residual_4x4_8_neon(uint8_t *_dst, int16_t *coeffs, + ptrdiff_t stride); +void ff_hevc_add_residual_4x4_10_neon(uint8_t *_dst, int16_t *coeffs, + ptrdiff_t stride); +void ff_hevc_add_residual_8x8_8_neon(uint8_t *_dst, int16_t *coeffs, + ptrdiff_t stride); +void ff_hevc_add_residual_8x8_10_neon(uint8_t *_dst, int16_t *coeffs, + ptrdiff_t stride); +void ff_hevc_add_residual_16x16_8_neon(uint8_t *_dst, int16_t *coeffs, + ptrdiff_t stride); +void ff_hevc_add_residual_16x16_10_neon(uint8_t *_dst, int16_t *coeffs, + ptrdiff_t stride); +void ff_hevc_add_residual_32x32_8_neon(uint8_t *_dst, int16_t *coeffs, + ptrdiff_t stride); +void ff_hevc_add_residual_32x32_10_neon(uint8_t *_dst, int16_t *coeffs, + ptrdiff_t stride); void ff_hevc_idct_8x8_8_neon(int16_t *coeffs, int col_limit); void ff_hevc_idct_8x8_10_neon(int16_t *coeffs, int col_limit); void ff_hevc_idct_16x16_8_neon(int16_t *coeffs, int col_limit); @@ -35,10 +51,18 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth) if (!have_neon(av_get_cpu_flags())) return; if (bit_depth == 8) { + c->add_residual[0] = ff_hevc_add_residual_4x4_8_neon; + c->add_residual[1] = ff_hevc_add_residual_8x8_8_neon; + c->add_residual[2] = ff_hevc_add_residual_16x16_8_neon; + c->add_residual[3] = ff_hevc_add_residual_32x32_8_neon; c->idct[1] = ff_hevc_idct_8x8_8_neon; c->idct[2] = ff_hevc_idct_16x16_8_neon; } if (bit_depth == 10) { + c->add_residual[0] = ff_hevc_add_residual_4x4_10_neon; + c->add_residual[1] = ff_hevc_add_residual_8x8_10_neon; + c->add_residual[2] = ff_hevc_add_residual_16x16_10_neon; + c->add_residual[3] = ff_hevc_add_residual_32x32_10_neon; c->idct[1] = ff_hevc_idct_8x8_10_neon; c->idct[2] = ff_hevc_idct_16x16_10_neon; } From patchwork Thu Feb 4 11:32:58 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Josh Dekker X-Patchwork-Id: 25383 Return-Path: X-Original-To: patchwork@ffaux-bg.ffmpeg.org Delivered-To: patchwork@ffaux-bg.ffmpeg.org Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by ffaux.localdomain (Postfix) with ESMTP id A27F844A6C2 for ; Thu, 4 Feb 2021 13:33:17 +0200 (EET) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 8E28168A062; Thu, 4 Feb 2021 13:33:17 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from out3-smtp.messagingengine.com (out3-smtp.messagingengine.com [66.111.4.27]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id D2BFD689CC9 for ; Thu, 4 Feb 2021 13:33:10 +0200 (EET) Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailout.nyi.internal (Postfix) with ESMTP id E20CA5C0193 for ; Thu, 4 Feb 2021 06:33:09 -0500 (EST) Received: from mailfrontend2 ([10.202.2.163]) by compute3.internal (MEProxy); Thu, 04 Feb 2021 06:33:09 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=itanimul.li; h= from:to:subject:date:message-id:in-reply-to:references :mime-version:content-type:content-transfer-encoding; s=fm1; bh= 42xOEXOcaaSbs9uW/pXBas2KRsZFUpa0qS1dlAp9O8U=; b=OAKfksUXSVWqX4lV hmZU1PSFyWg0Ee6mQvjhHBOcGHmxKqxaRe0pum/H0V+WnKhD+Y9AuWdlTkOuAk86 Qqpb88WjL5dNAQqLjtg3UlNZdpOzCpZXlWlhdnfULssaaK0mGhV8Pv9aYtPX8+ig 6LGWTbb2jXEthDvKHH2XdyaOhYf2gIiQpouovuevegHlJJbjT5v6B4wIJgeyVW1s 4jP52XWxPQBjgsXPrw15zrJOq3V+NTc27fz90w+0ZuCedakrz+xiL437NjYJGDi/ wj8R78q/U479sLTNaIFwOhDA/TMOGNcN74RpV255v2VPwPyq0k25QWOA1YIr40VL zCaBKg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm2; bh=42xOEXOcaaSbs9uW/pXBas2KRsZFUpa0qS1dlAp9O 8U=; b=sEnYcnUFX4uCeCtzXmK5SsZgc76n4cpgpiGuYMoBzaueA/eAQO2YEst9v WVwO2Q5Bb6bmNla+khtUJiJ//8RuP/CXcpZTwZABj/d/Ewabw4KkwIIiOfRy+s15 PZcTVMr+OjZQBgaEq36vF7ilyeV565UgXUpL10b9ueIfw4Gq/q7EH0QiIPoUt3Tf /aHiu7eN7Bkc2Vfht0QyoD5apd4VPa93vOprKs715yoJ/WNzlLzNh5oas/1isuJk YedftGHJXDrZWhCt6z0RmvTgssoTrnc1/iu2wvxT7L13WtjS9nEMZQPKdWoTCyXe SAg1B+nmlto680eVSUFgED5eA43GA== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeduledrgeeggddvtdcutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkffojghfgggtgfesthekre dtredtjeenucfhrhhomheplfhoshhhucffvghkkhgvrhcuoehjohhshhesihhtrghnihhm uhhlrdhliheqnecuggftrfgrthhtvghrnhepjeeikeduvdevleegleejtdetledvtdduje fgteegkeduleeludekudefveehffeinecuffhomhgrihhnpehnvghonhdrshgsnecukfhp peekjedruddvfedrudelvddrudehheenucevlhhushhtvghrufhiiigvpedtnecurfgrrh grmhepmhgrihhlfhhrohhmpehjohhshhesihhtrghnihhmuhhlrdhlih X-ME-Proxy: Received: from localhost.localdomain (i577bc09b.versanet.de [87.123.192.155]) by mail.messagingengine.com (Postfix) with ESMTPA id 87CEC1080059 for ; Thu, 4 Feb 2021 06:33:09 -0500 (EST) From: Josh Dekker To: ffmpeg-devel@ffmpeg.org Date: Thu, 4 Feb 2021 12:32:58 +0100 Message-Id: <20210204113259.20112-4-josh@itanimul.li> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20210204113259.20112-1-josh@itanimul.li> References: <20210204113259.20112-1-josh@itanimul.li> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v2 3/4] avcodec/aarch64/hevcdsp: add idct_dc NEON X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Signed-off-by: Josh Dekker --- libavcodec/aarch64/hevcdsp_idct_neon.S | 54 +++++++++++++++++++++++ libavcodec/aarch64/hevcdsp_init_aarch64.c | 16 +++++++ 2 files changed, 70 insertions(+) diff --git a/libavcodec/aarch64/hevcdsp_idct_neon.S b/libavcodec/aarch64/hevcdsp_idct_neon.S index 329038a958..d3902a9e0f 100644 --- a/libavcodec/aarch64/hevcdsp_idct_neon.S +++ b/libavcodec/aarch64/hevcdsp_idct_neon.S @@ -5,6 +5,7 @@ * * Ported from arm/hevcdsp_idct_neon.S by * Copyright (c) 2020 Reimar Döffinger + * Copyright (c) 2020 Josh Dekker * * This file is part of FFmpeg. * @@ -568,3 +569,56 @@ tr_16x4 secondpass_10, 20 - 10, 512, 1 idct_16x16 8 idct_16x16 10 + +// void ff_hevc_idct_NxN_dc_DEPTH_neon(int16_t *coeffs) +.macro idct_dc size bitdepth +function ff_hevc_idct_\size\()x\size\()_dc_\bitdepth\()_neon, export=1 + ldrsh w1, [x0] + mov w2, #(1 << (13 - \bitdepth)) + add w1, w1, #1 + asr w1, w1, #1 + add w1, w1, w2 + asr w1, w1, #(14 - \bitdepth) + dup v0.8h, w1 + dup v1.8h, w1 +.if \size > 4 + dup v2.8h, w1 + dup v3.8h, w1 +.if \size > 16 /* dc 32x32 */ + mov x2, #4 +1: + subs x2, x2, #1 +.endif + add x12, x0, #64 + mov x13, #128 +.if \size > 8 /* dc 16x16 */ + st1 {v0.8h-v3.8h}, [ x0], x13 + st1 {v0.8h-v3.8h}, [x12], x13 + st1 {v0.8h-v3.8h}, [ x0], x13 + st1 {v0.8h-v3.8h}, [x12], x13 + st1 {v0.8h-v3.8h}, [ x0], x13 + st1 {v0.8h-v3.8h}, [x12], x13 +.endif /* dc 8x8 */ + st1 {v0.8h-v3.8h}, [ x0], x13 + st1 {v0.8h-v3.8h}, [x12], x13 +.if \size > 16 /* dc 32x32 */ + bne 1b +.endif +.else /* dc 4x4 */ + st1 {v0.8h-v1.8h}, [x0] +.endif + ret +endfunc +.endm + +idct_dc 4 8 +idct_dc 4 10 + +idct_dc 8 8 +idct_dc 8 10 + +idct_dc 16 8 +idct_dc 16 10 + +idct_dc 32 8 +idct_dc 32 10 diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c index 4c29daa6d5..fe111bd1ac 100644 --- a/libavcodec/aarch64/hevcdsp_init_aarch64.c +++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c @@ -45,6 +45,14 @@ void ff_hevc_idct_8x8_8_neon(int16_t *coeffs, int col_limit); void ff_hevc_idct_8x8_10_neon(int16_t *coeffs, int col_limit); void ff_hevc_idct_16x16_8_neon(int16_t *coeffs, int col_limit); void ff_hevc_idct_16x16_10_neon(int16_t *coeffs, int col_limit); +void ff_hevc_idct_4x4_dc_8_neon(int16_t *coeffs); +void ff_hevc_idct_8x8_dc_8_neon(int16_t *coeffs); +void ff_hevc_idct_16x16_dc_8_neon(int16_t *coeffs); +void ff_hevc_idct_32x32_dc_8_neon(int16_t *coeffs); +void ff_hevc_idct_4x4_dc_10_neon(int16_t *coeffs); +void ff_hevc_idct_8x8_dc_10_neon(int16_t *coeffs); +void ff_hevc_idct_16x16_dc_10_neon(int16_t *coeffs); +void ff_hevc_idct_32x32_dc_10_neon(int16_t *coeffs); av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth) { @@ -57,6 +65,10 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth) c->add_residual[3] = ff_hevc_add_residual_32x32_8_neon; c->idct[1] = ff_hevc_idct_8x8_8_neon; c->idct[2] = ff_hevc_idct_16x16_8_neon; + c->idct_dc[0] = ff_hevc_idct_4x4_dc_8_neon; + c->idct_dc[1] = ff_hevc_idct_8x8_dc_8_neon; + c->idct_dc[2] = ff_hevc_idct_16x16_dc_8_neon; + c->idct_dc[3] = ff_hevc_idct_32x32_dc_8_neon; } if (bit_depth == 10) { c->add_residual[0] = ff_hevc_add_residual_4x4_10_neon; @@ -65,5 +77,9 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth) c->add_residual[3] = ff_hevc_add_residual_32x32_10_neon; c->idct[1] = ff_hevc_idct_8x8_10_neon; c->idct[2] = ff_hevc_idct_16x16_10_neon; + c->idct_dc[0] = ff_hevc_idct_4x4_dc_10_neon; + c->idct_dc[1] = ff_hevc_idct_8x8_dc_10_neon; + c->idct_dc[2] = ff_hevc_idct_16x16_dc_10_neon; + c->idct_dc[3] = ff_hevc_idct_32x32_dc_10_neon; } } From patchwork Thu Feb 4 11:32:59 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Josh Dekker X-Patchwork-Id: 25386 Return-Path: X-Original-To: patchwork@ffaux-bg.ffmpeg.org Delivered-To: patchwork@ffaux-bg.ffmpeg.org Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by ffaux.localdomain (Postfix) with ESMTP id 1A26944A6C2 for ; Thu, 4 Feb 2021 13:33:20 +0200 (EET) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id F014868A435; Thu, 4 Feb 2021 13:33:19 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from out3-smtp.messagingengine.com (out3-smtp.messagingengine.com [66.111.4.27]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 9A847689ADA for ; Thu, 4 Feb 2021 13:33:11 +0200 (EET) Received: from compute6.internal (compute6.nyi.internal [10.202.2.46]) by mailout.nyi.internal (Postfix) with ESMTP id A08DE5C0075 for ; Thu, 4 Feb 2021 06:33:10 -0500 (EST) Received: from mailfrontend2 ([10.202.2.163]) by compute6.internal (MEProxy); Thu, 04 Feb 2021 06:33:10 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=itanimul.li; h= from:to:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; s=fm1; bh=hGP1REkNaAZBX ltKiziZvqluynnCKzh53GdVoL4+1vw=; b=Ep64iQO4CWw3a/6cuhm8lPs+Ftqtj T/ycDpWWN9uijkmvAgtOXStA3bRv46YNJtaH7iM/KkeQFhYzRUdveEO+sMnlvDuK xta8kBE7CTCKyHUg+A2feyRslUS7YR6Fr2fhqRGCS5uHqCbS3beRTrVFtKl2eTY8 eoYQOrT2lhwXyk/OwNSLxhZ7hLmEgZfyxORKjvCcqQfE+O3BtdbupTy3XUYb19uR YsbK2K6ic4ta2MtCgeZco+S6x7+LdWtU7TwIdbymddZ3AqR6/VuEFOEBwbSKgh9c bAaNpd7RXeEh2rcYgc1IipS8V1j3svnoLNDKKQHQQTL/tqy7liKmjLF+Q== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=content-transfer-encoding:date:from :in-reply-to:message-id:mime-version:references:subject:to :x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm2; bh=hGP1REkNaAZBXltKiziZvqluynnCKzh53GdVoL4+1vw=; b=qnAdPGUT 8IM6ls0poNCBPnyXGm18veNU0/YbdS/bYcK8AhvyYieErtUGryhPrFj0c/IdcAcP GMC0C50hT3Iy5rhw/xU2TkZUor5uq3IQLxKy6vLbS2arrHc6+h2mobJL0C45SgO0 EItDqIZRLVBUyE2GpDoYocrSCIqxSfAjDh+dnPSR3zf+yOBGYv/DR+J8SN4v2eM9 CCRXS4qCryi8D/hZCCCzWWCNEXtUT2d+TKMMC4vBO8KfvILWP1bZHDdQrMd5ke+A CiblOrS/PAcPa11Ia2PhbWc5njATt7LU7LjW6YEOBdNefF/iMlC2vtBWKuwVXqF/ Xm0740VuSYkYBQ== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeduledrgeeggddvtdcutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkffojghfggfgsedtkeertd ertddtnecuhfhrohhmpeflohhshhcuffgvkhhkvghruceojhhoshhhsehithgrnhhimhhu lhdrlhhiqeenucggtffrrghtthgvrhhnpeegudeiledtleetudfggeffheefgfduudehhe fgjeegueeiudekvddtjeeigfehueenucffohhmrghinhepnhgvohhnrdhssgenucfkphep keejrdduvdefrdduledvrdduheehnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrg hmpehmrghilhhfrhhomhepjhhoshhhsehithgrnhhimhhulhdrlhhi X-ME-Proxy: Received: from localhost.localdomain (i577bc09b.versanet.de [87.123.192.155]) by mail.messagingengine.com (Postfix) with ESMTPA id 151151080059 for ; Thu, 4 Feb 2021 06:33:09 -0500 (EST) From: Josh Dekker To: ffmpeg-devel@ffmpeg.org Date: Thu, 4 Feb 2021 12:32:59 +0100 Message-Id: <20210204113259.20112-5-josh@itanimul.li> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: <20210204113259.20112-1-josh@itanimul.li> References: <20210204113259.20112-1-josh@itanimul.li> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v2 4/4] avcodec/aarch64/hevcdsp: add sao_band NEON X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Only works for 8x8. Signed-off-by: Josh Dekker --- libavcodec/aarch64/Makefile | 3 +- libavcodec/aarch64/hevcdsp_init_aarch64.c | 7 ++ libavcodec/aarch64/hevcdsp_sao_neon.S | 87 +++++++++++++++++++++++ 3 files changed, 96 insertions(+), 1 deletion(-) create mode 100644 libavcodec/aarch64/hevcdsp_sao_neon.S diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile index 2ea1d74a38..954461f81d 100644 --- a/libavcodec/aarch64/Makefile +++ b/libavcodec/aarch64/Makefile @@ -62,4 +62,5 @@ NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_16bpp_neon.o \ aarch64/vp9mc_16bpp_neon.o \ aarch64/vp9mc_neon.o NEON-OBJS-$(CONFIG_HEVC_DECODER) += aarch64/hevcdsp_idct_neon.o \ - aarch64/hevcdsp_init_aarch64.o + aarch64/hevcdsp_init_aarch64.o \ + aarch64/hevcdsp_sao_neon.o diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c index fe111bd1ac..c785e46f79 100644 --- a/libavcodec/aarch64/hevcdsp_init_aarch64.c +++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c @@ -53,6 +53,12 @@ void ff_hevc_idct_4x4_dc_10_neon(int16_t *coeffs); void ff_hevc_idct_8x8_dc_10_neon(int16_t *coeffs); void ff_hevc_idct_16x16_dc_10_neon(int16_t *coeffs); void ff_hevc_idct_32x32_dc_10_neon(int16_t *coeffs); +void ff_hevc_sao_band_filter_8x8_8_neon(uint8_t *_dst, uint8_t *_src, + ptrdiff_t stride_dst, ptrdiff_t stride_src, + int16_t *sao_offset_val, int sao_left_class, + int width, int height); + + av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth) { @@ -69,6 +75,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth) c->idct_dc[1] = ff_hevc_idct_8x8_dc_8_neon; c->idct_dc[2] = ff_hevc_idct_16x16_dc_8_neon; c->idct_dc[3] = ff_hevc_idct_32x32_dc_8_neon; + c->sao_band_filter[0] = ff_hevc_sao_band_filter_8x8_8_neon; } if (bit_depth == 10) { c->add_residual[0] = ff_hevc_add_residual_4x4_10_neon; diff --git a/libavcodec/aarch64/hevcdsp_sao_neon.S b/libavcodec/aarch64/hevcdsp_sao_neon.S new file mode 100644 index 0000000000..f142c1e8c2 --- /dev/null +++ b/libavcodec/aarch64/hevcdsp_sao_neon.S @@ -0,0 +1,87 @@ +/* -*-arm64-*- + * vim: syntax=arm64asm + * + * AArch64 NEON optimised SAO functions for HEVC decoding + * + * Copyright (c) 2020 Josh Dekker + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/aarch64/asm.S" + +// void sao_band_filter(uint8_t *_dst, uint8_t *_src, +// ptrdiff_t stride_dst, ptrdiff_t stride_src, +// int16_t *sao_offset_val, int sao_left_class, +// int width, int height) +function ff_hevc_sao_band_filter_8x8_8_neon, export=1 + sub sp, sp, #64 + stp xzr, xzr, [sp] + stp xzr, xzr, [sp, #16] + stp xzr, xzr, [sp, #32] + stp xzr, xzr, [sp, #48] + mov w8, #4 +0: + ldrsh x9, [x4, x8, lsl #1] // x9 = sao_offset_val[k+1] + subs w8, w8, #1 + add w10, w8, w5 // x10 = k + sao_left_class + and w10, w10, #0x1F + strh w9, [sp, x10, lsl #1] + bne 0b + ld1 {v16.16b-v19.16b}, [sp], #64 + movi v20.8h, #1 +1: // beginning of line + mov w8, w6 +2: + // Simple layout for accessing 16bit values + // with 8bit LUT. + // + // 00 01 02 03 04 05 06 07 + // +-----------------------------------> + // |xDE#xAD|xCA#xFE|xBE#xEF|xFE#xED|.... + // +-----------------------------------> + // i-0 i-1 i-2 i-3 + // dst[x] = av_clip_pixel(src[x] + offset_table[src[x] >> shift]); + ld1 {v2.8b}, [x1] + // load src[x] + uxtl v0.8h, v2.8b + // >> shift + ushr v2.8h, v0.8h, #3 // BIT_DEPTH - 3 + // x2 (access lower short) + shl v1.8h, v2.8h, #1 // low (x2, accessing short) + // +1 access upper short + add v3.8h, v1.8h, v20.8h + // shift insert index to upper byte + sli v1.8h, v3.8h, #8 + // table + tbx v2.16b, {v16.16b-v19.16b}, v1.16b + // src[x] + table + add v1.8h, v0.8h, v2.8h + // clip + narrow + sqxtun v4.8b, v1.8h + // store + st1 {v4.8b}, [x0] + // done 8 pixels + subs w8, w8, #8 + bne 2b + // finished line + subs w7, w7, #1 + add x0, x0, x2 // dst += stride_dst + add x1, x1, x3 // src += stride_src + bne 1b + ret +endfunc