From patchwork Tue May 24 11:38:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "J. Dekker" X-Patchwork-Id: 35906 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:6914:b0:82:6b11:2509 with SMTP id q20csp491850pzj; Tue, 24 May 2022 04:38:21 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyHXP5+pL4kuXAK57p7+gZN0OE+xsSxtVckZVN+x4M6+uO4JF18MamLG9NqQPTLIwLl+3v2 X-Received: by 2002:a05:6402:4408:b0:42b:61c0:14b4 with SMTP id y8-20020a056402440800b0042b61c014b4mr10974905eda.17.1653392301206; Tue, 24 May 2022 04:38:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1653392301; cv=none; d=google.com; s=arc-20160816; b=0b8Mg1lJWokXtq29yOtdgK8Hpf+k9Fc5s5M9D22nzW6dKv1tKwo9uflR60r6jwKEoA XPynk1DgRt5b5sYrDTPI+QG2k1tL6l2XI841Zauiba35KXGAgsBtGmWkImBKrAH2qZZk cWpSt80RzLd+Any0Af1GvIA44/FOk8kG+ulWM1wfB4s2af+Si+jNf89OsjLuT31dqpSP PxWU0c8djxrFWkBr5wzhYQZnQ2ynqR6dpWZ2VU4X2brWYoBaysoseM9y7G+DD0LPSknP ADH2i78aLQi9mvhBCSYkFQF9mzHUQhAAtDYx+wG6CbpUORuvD9gwc4dghT7mUT6msl4Z 7Dfw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:reply-to:list-subscribe :list-help:list-post:list-archive:list-unsubscribe:list-id :precedence:subject:mime-version:message-id:date:to:from:feedback-id :dkim-signature:dkim-signature:delivered-to; bh=sAbdodsyvMTVZ9Anp2WRsoScXly8dlr3CkHZFibpWe8=; b=Z8roRYXT8wL0ObSTkMVrx5eQXPEMhZ+tZju9KELoXI7cK1NDmB3saK5YmpwlxgpzKA +1366Xu7rSis9XmcgVPQCO8nNtbEjY7BOTGyO4Y1GS85jTNwYNwNgiYCBfyC7otoueJV 94gaU9XRYq7tgnpOPCWt/tqjN8c9BOvWnciDywt3tmkdbC3x2JsQbOAgNMgfbLeRjkGh /UBWGmoQh60zPIIwaVrr5mw+cX70Q6VXuKCw1dZZr87nskDTnVUDyCw7r/Ixvi2/42FW QQgNyFb77e0RsV4JiuEryBZ7yUCmQyLlxekL/M5NwYdlFP+xzsX0+AjTnPk66VPRp1aD YfIg== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@itanimul.li header.s=fm2 header.b=A4871YPV; dkim=neutral (body hash did not verify) header.i=@messagingengine.com header.s=fm1 header.b=REB6IRT4; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id gn36-20020a1709070d2400b006f3976ed175si17956969ejc.938.2022.05.24.04.38.20; Tue, 24 May 2022 04:38:21 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@itanimul.li header.s=fm2 header.b=A4871YPV; dkim=neutral (body hash did not verify) header.i=@messagingengine.com header.s=fm1 header.b=REB6IRT4; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id E694B68B1FE; Tue, 24 May 2022 14:38:16 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from wout1-smtp.messagingengine.com (wout1-smtp.messagingengine.com [64.147.123.24]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 2CB3E68B1FE for ; Tue, 24 May 2022 14:38:10 +0300 (EEST) Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.west.internal (Postfix) with ESMTP id 8FBF7320096C for ; Tue, 24 May 2022 07:38:07 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute5.internal (MEProxy); Tue, 24 May 2022 07:38:07 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=itanimul.li; h= cc:content-transfer-encoding:date:date:from:from:in-reply-to :message-id:mime-version:reply-to:sender:subject:subject:to:to; s=fm2; t=1653392287; x=1653478687; bh=0AfpU5e64CczV+prS7scln7O7 qtqHe9gIrrUxKD0fkg=; b=A4871YPVqRgyWgzMFfjtAafHsJDdA/UMxOCHDVerM FbP+L0NZ7skKhPQv8PKeqk+GDr1jZ0kjaTrtxFRNsjGB++IX37c3YsVVhBQCO+dJ NcUKDW2rX9pW3VH5+o/kBS0ZbCUBOi/x3sFf0TqzKiGIjGSG4p3GsHR5r+mSw6cV qgBQpU/afY0kiQWs0cRZvVAJjr9XwqTUf4OETK3eKMmL4qzf3e0RdyIk2N4PpazH FR5g3ZE62K0LazlsNAWpaFNjsBrygBGRKQNiu7FT+IU/ZDqVMWOXCfW8o0N4GoAj 9y3N/qINlYwsOSMTwQRVYGvv7/l0aJpHAqqDG1cmbuVog== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:date:date :feedback-id:feedback-id:from:from:in-reply-to:message-id :mime-version:reply-to:sender:subject:subject:to:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t= 1653392287; x=1653478687; bh=0AfpU5e64CczV+prS7scln7O7qtqHe9gIrr UxKD0fkg=; b=REB6IRT4pR6cEl32xr25gh2JHW3IZbGHy138F3APFTYeyo/zAYK qn7ohL8QtQQsP8KSkXXLxX3ekXLpj3VD5zjw4skDbkdnHJIioJ66sifRNgiIXC8U VcKxBBSetQ+xT2LfHvPm4vRWx3hpCyqAdR0l8jN0zppOqSoNMSFtpOHuAuJGaUso uxdI0sQwILfkdAr0Mq+GVoXlHuZ2FR5Pfg3Fb5KhhDYi2VGXLJtMmC2bGuj2JcU/ p51aLKs52Cw3/C/r3gJk529bbpYRzbD1H5bjiWI8SWaucLb9bNmEqw9qQa9qfkQt nEGVha3AEhcwdotYiEnLUMzEvAHxBGmzpXg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvfedrjeefgdegvdcutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkffoggfgsedtkeertdertd dtnecuhfhrohhmpedflfdrucffvghkkhgvrhdfuceojhguvghksehithgrnhhimhhulhdr lhhiqeenucggtffrrghtthgvrhhnpefhfedtieetieejieegveefveehudelveevteejke evieeukeejvefhhfetjeethfenucffohhmrghinhepnhgvohhnrdhssgenucevlhhushht vghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpehjuggvkhesihhtrghnih hmuhhlrdhlih X-ME-Proxy: Feedback-ID: i84994747:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA for ; Tue, 24 May 2022 07:38:06 -0400 (EDT) From: "J. Dekker" To: ffmpeg-devel@ffmpeg.org Date: Tue, 24 May 2022 13:38:03 +0200 Message-Id: <20220524113803.9642-1-jdek@itanimul.li> X-Mailer: git-send-email 2.32.0 (Apple Git-132) MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH] lavc/aarch64: add hevc horizontal qpel/uni/bi X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: 2SSGBQJS0dxB checkasm --benchmark on Ampere Altra (Neoverse N1): put_hevc_qpel_bi_h4_8_c: 173.7 put_hevc_qpel_bi_h4_8_neon: 77.0 put_hevc_qpel_bi_h6_8_c: 385.7 put_hevc_qpel_bi_h6_8_neon: 125.7 put_hevc_qpel_bi_h8_8_c: 680.7 put_hevc_qpel_bi_h8_8_neon: 137.5 put_hevc_qpel_bi_h12_8_c: 1480.0 put_hevc_qpel_bi_h12_8_neon: 438.5 put_hevc_qpel_bi_h16_8_c: 2663.2 put_hevc_qpel_bi_h16_8_neon: 561.5 put_hevc_qpel_bi_h24_8_c: 6039.0 put_hevc_qpel_bi_h24_8_neon: 1717.5 put_hevc_qpel_bi_h32_8_c: 11104.2 put_hevc_qpel_bi_h32_8_neon: 2222.0 put_hevc_qpel_bi_h48_8_c: 25175.2 put_hevc_qpel_bi_h48_8_neon: 4983.7 put_hevc_qpel_bi_h64_8_c: 42806.5 put_hevc_qpel_bi_h64_8_neon: 8848.5 put_hevc_qpel_h4_8_c: 149.7 put_hevc_qpel_h4_8_neon: 68.2 put_hevc_qpel_h6_8_c: 318.5 put_hevc_qpel_h6_8_neon: 105.2 put_hevc_qpel_h8_8_c: 577.0 put_hevc_qpel_h8_8_neon: 133.2 put_hevc_qpel_h12_8_c: 1276.0 put_hevc_qpel_h12_8_neon: 394.5 put_hevc_qpel_h16_8_c: 2278.2 put_hevc_qpel_h16_8_neon: 517.5 put_hevc_qpel_h24_8_c: 5081.7 put_hevc_qpel_h24_8_neon: 1546.5 put_hevc_qpel_h32_8_c: 9081.0 put_hevc_qpel_h32_8_neon: 2054.0 put_hevc_qpel_h48_8_c: 20280.7 put_hevc_qpel_h48_8_neon: 4615.5 put_hevc_qpel_h64_8_c: 36042.0 put_hevc_qpel_h64_8_neon: 8197.5 put_hevc_qpel_uni_h4_8_c: 165.5 put_hevc_qpel_uni_h4_8_neon: 73.5 put_hevc_qpel_uni_h6_8_c: 366.5 put_hevc_qpel_uni_h6_8_neon: 118.5 put_hevc_qpel_uni_h8_8_c: 661.7 put_hevc_qpel_uni_h8_8_neon: 138.2 put_hevc_qpel_uni_h12_8_c: 1440.5 put_hevc_qpel_uni_h12_8_neon: 399.5 put_hevc_qpel_uni_h16_8_c: 2489.0 put_hevc_qpel_uni_h16_8_neon: 532.2 put_hevc_qpel_uni_h24_8_c: 5896.5 put_hevc_qpel_uni_h24_8_neon: 1558.5 put_hevc_qpel_uni_h32_8_c: 10675.5 put_hevc_qpel_uni_h32_8_neon: 2092.2 put_hevc_qpel_uni_h48_8_c: 24103.0 put_hevc_qpel_uni_h48_8_neon: 4680.2 put_hevc_qpel_uni_h64_8_c: 42789.2 put_hevc_qpel_uni_h64_8_neon: 8330.0 Signed-off-by: J. Dekker --- libavcodec/aarch64/Makefile | 1 + libavcodec/aarch64/hevcdsp_init_aarch64.c | 43 +- libavcodec/aarch64/hevcdsp_qpel_neon.S | 520 ++++++++++++++++++++++ 3 files changed, 563 insertions(+), 1 deletion(-) create mode 100644 libavcodec/aarch64/hevcdsp_qpel_neon.S diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile index c8935f205e..2f95649c66 100644 --- a/libavcodec/aarch64/Makefile +++ b/libavcodec/aarch64/Makefile @@ -65,4 +65,5 @@ NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_16bpp_neon.o \ aarch64/vp9mc_neon.o NEON-OBJS-$(CONFIG_HEVC_DECODER) += aarch64/hevcdsp_idct_neon.o \ aarch64/hevcdsp_init_aarch64.o \ + aarch64/hevcdsp_qpel_neon.o \ aarch64/hevcdsp_sao_neon.o diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c index 1e40be740c..ca2cb7cf97 100644 --- a/libavcodec/aarch64/hevcdsp_init_aarch64.c +++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c @@ -58,7 +58,21 @@ void ff_hevc_sao_band_filter_8x8_8_neon(uint8_t *_dst, uint8_t *_src, int16_t *sao_offset_val, int sao_left_class, int width, int height); - +void ff_hevc_put_hevc_qpel_h4_8_neon(int16_t *dst, uint8_t *_src, ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width); +void ff_hevc_put_hevc_qpel_h6_8_neon(int16_t *dst, uint8_t *_src, ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width); +void ff_hevc_put_hevc_qpel_h8_8_neon(int16_t *dst, uint8_t *_src, ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width); +void ff_hevc_put_hevc_qpel_h12_8_neon(int16_t *dst, uint8_t *_src, ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width); +void ff_hevc_put_hevc_qpel_h16_8_neon(int16_t *dst, uint8_t *_src, ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width); +void ff_hevc_put_hevc_qpel_uni_h4_8_neon(uint8_t *_dst, ptrdiff_t _dststride, uint8_t *_src, ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width); +void ff_hevc_put_hevc_qpel_uni_h6_8_neon(uint8_t *_dst, ptrdiff_t _dststride, uint8_t *_src, ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width); +void ff_hevc_put_hevc_qpel_uni_h8_8_neon(uint8_t *_dst, ptrdiff_t _dststride, uint8_t *_src, ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width); +void ff_hevc_put_hevc_qpel_uni_h12_8_neon(uint8_t *_dst, ptrdiff_t _dststride, uint8_t *_src, ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width); +void ff_hevc_put_hevc_qpel_uni_h16_8_neon(uint8_t *_dst, ptrdiff_t _dststride, uint8_t *_src, ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my, int width); +void ff_hevc_put_hevc_qpel_bi_h4_8_neon(uint8_t *_dst, ptrdiff_t _dststride, uint8_t *_src, ptrdiff_t _srcstride, int16_t *src2, int height, intptr_t mx, intptr_t my, int width); +void ff_hevc_put_hevc_qpel_bi_h6_8_neon(uint8_t *_dst, ptrdiff_t _dststride, uint8_t *_src, ptrdiff_t _srcstride, int16_t *src2, int height, intptr_t mx, intptr_t my, int width); +void ff_hevc_put_hevc_qpel_bi_h8_8_neon(uint8_t *_dst, ptrdiff_t _dststride, uint8_t *_src, ptrdiff_t _srcstride, int16_t *src2, int height, intptr_t mx, intptr_t my, int width); +void ff_hevc_put_hevc_qpel_bi_h12_8_neon(uint8_t *_dst, ptrdiff_t _dststride, uint8_t *_src, ptrdiff_t _srcstride, int16_t *src2, int height, intptr_t mx, intptr_t my, int width); +void ff_hevc_put_hevc_qpel_bi_h16_8_neon(uint8_t *_dst, ptrdiff_t _dststride, uint8_t *_src, ptrdiff_t _srcstride, int16_t *src2, int height, intptr_t mx, intptr_t my, int width); av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth) { @@ -80,6 +94,33 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth) // for the current size, but if enabled for bigger sizes, the cases // of non-multiple of 8 seem to arise. // c->sao_band_filter[0] = ff_hevc_sao_band_filter_8x8_8_neon; + c->put_hevc_qpel[1][0][1] = ff_hevc_put_hevc_qpel_h4_8_neon; + c->put_hevc_qpel[2][0][1] = ff_hevc_put_hevc_qpel_h6_8_neon; + c->put_hevc_qpel[3][0][1] = ff_hevc_put_hevc_qpel_h8_8_neon; + c->put_hevc_qpel[4][0][1] = + c->put_hevc_qpel[6][0][1] = ff_hevc_put_hevc_qpel_h12_8_neon; + c->put_hevc_qpel[5][0][1] = + c->put_hevc_qpel[7][0][1] = + c->put_hevc_qpel[8][0][1] = + c->put_hevc_qpel[9][0][1] = ff_hevc_put_hevc_qpel_h16_8_neon; + c->put_hevc_qpel_uni[1][0][1] = ff_hevc_put_hevc_qpel_uni_h4_8_neon; + c->put_hevc_qpel_uni[2][0][1] = ff_hevc_put_hevc_qpel_uni_h6_8_neon; + c->put_hevc_qpel_uni[3][0][1] = ff_hevc_put_hevc_qpel_uni_h8_8_neon; + c->put_hevc_qpel_uni[4][0][1] = + c->put_hevc_qpel_uni[6][0][1] = ff_hevc_put_hevc_qpel_uni_h12_8_neon; + c->put_hevc_qpel_uni[5][0][1] = + c->put_hevc_qpel_uni[7][0][1] = + c->put_hevc_qpel_uni[8][0][1] = + c->put_hevc_qpel_uni[9][0][1] = ff_hevc_put_hevc_qpel_uni_h16_8_neon; + c->put_hevc_qpel_bi[1][0][1] = ff_hevc_put_hevc_qpel_bi_h4_8_neon; + c->put_hevc_qpel_bi[2][0][1] = ff_hevc_put_hevc_qpel_bi_h6_8_neon; + c->put_hevc_qpel_bi[3][0][1] = ff_hevc_put_hevc_qpel_bi_h8_8_neon; + c->put_hevc_qpel_bi[4][0][1] = + c->put_hevc_qpel_bi[6][0][1] = ff_hevc_put_hevc_qpel_bi_h12_8_neon; + c->put_hevc_qpel_bi[5][0][1] = + c->put_hevc_qpel_bi[7][0][1] = + c->put_hevc_qpel_bi[8][0][1] = + c->put_hevc_qpel_bi[9][0][1] = ff_hevc_put_hevc_qpel_bi_h16_8_neon; } if (bit_depth == 10) { c->add_residual[0] = ff_hevc_add_residual_4x4_10_neon; diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S new file mode 100644 index 0000000000..bbaa32a9d9 --- /dev/null +++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S @@ -0,0 +1,520 @@ +/* -*-arm64-*- + * vim: syntax=arm64asm + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/aarch64/asm.S" +#define MAX_PB_SIZE 64 + +const qpel_filters, align=4 + .byte 0, 0, 0, 0, 0, 0, 0, 0 + .byte -1, 4,-10, 58, 17, -5, 1, 0 + .byte -1, 4,-11, 40, 40,-11, 4, -1 + .byte 0, 1, -5, 17, 58,-10, 4, -1 +endconst + +.macro load_qpel_filter m + movrel x15, qpel_filters + add x15, x15, \m, lsl #3 + ld1 {v0.8b}, [x15] + sxtl v0.8h, v0.8b +.endm + +// void put_hevc_qpel_h(int16_t *dst, +// uint8_t *_src, ptrdiff_t _srcstride, +// int height, intptr_t mx, intptr_t my, int width) + +// void put_hevc_qpel_uni_h(uint8_t *_dst, ptrdiff_t _dststride, +// uint8_t *_src, ptrdiff_t _srcstride, +// int height, intptr_t mx, intptr_t my, int width) + +// void put_hevc_qpel_bi_h(uint8_t *_dst, ptrdiff_t _dststride, +// uint8_t *_src, ptrdiff_t _srcstride, +// int16_t *src2, int height, intptr_t mx, +// intptr_t my, int width) + +.macro put_hevc type +function ff_hevc_put_hevc_\type\()_h4_8_neon, export=1 +.ifc \type, qpel + load_qpel_filter x4 + lsl x10, x2, #1 // src stride * 2 + sub x13, x1, #3 // src1 = src - 3 + mov x15, #(MAX_PB_SIZE << 2) // dst stride + add x14, x13, x2 // src2 = src1 + src stride + add x17, x0, #(MAX_PB_SIZE << 1) // dst2 = dst1 + 64 * 2 +.else +.ifc \type, qpel_bi + load_qpel_filter x6 + mov x6, #(MAX_PB_SIZE << 2) // rsrc stride << 1 + add x7, x4, #(MAX_PB_SIZE << 1) // rsrc2 +.else + load_qpel_filter x5 +.endif + lsl x10, x3, #1 // src stride * 2 + sub x13, x2, #3 // src1 = src - 3 + lsl x15, x1, #1 // dst stride * 2 + add x14, x13, x3 // src2 = src1 + src stride + add x17, x0, x1 // dst2 = dst1 + dst stride +.endif +0: ld1 {v16.8b, v17.8b}, [x13], x10 + ld1 {v18.8b, v19.8b}, [x14], x10 +.ifc \type, qpel_bi + ld1 {v25.8h}, [x4], x6 + ld1 {v26.8h}, [x7], x6 +.endif + uxtl v16.8h, v16.8b + uxtl v17.8h, v17.8b + uxtl v18.8h, v18.8b + uxtl v19.8h, v19.8b + + mul v23.8h, v16.8h, v0.h[0] + mul v24.8h, v18.8h, v0.h[0] + +.irpc i, 1234567 + ext v20.16b, v16.16b, v17.16b, #(2*\i) + ext v21.16b, v18.16b, v19.16b, #(2*\i) + mla v23.8h, v20.8h, v0.h[\i] + mla v24.8h, v21.8h, v0.h[\i] +.endr + +.ifc \type, qpel + subs w3, w3, #2 + st1 {v23.4h}, [ x0], x15 + st1 {v24.4h}, [x17], x15 +.else +.ifc \type, qpel_bi + subs w5, w5, #2 + sqadd v23.8h, v23.8h, v25.8h + sqadd v24.8h, v24.8h, v26.8h + sqrshrun v23.8b, v23.8h, #7 + sqrshrun v24.8b, v24.8h, #7 +.else + subs w4, w4, #2 + sqrshrun v23.8b, v23.8h, #6 + sqrshrun v24.8b, v24.8h, #6 +.endif + st1 {v23.s}[0], [ x0], x15 + st1 {v24.s}[0], [x17], x15 +.endif + b.gt 0b // double line + ret +endfunc + +function ff_hevc_put_hevc_\type\()_h6_8_neon, export=1 +.ifc \type, qpel + load_qpel_filter x4 + lsl x10, x2, #1 // width * 2 + sub x13, x1, #3 // src1 = src - 3 + mov x15, #(MAX_PB_SIZE * 4 - 8) // dst stride + add x14, x13, x2 // src2 = src1 + src stride + add x17, x0, #(MAX_PB_SIZE << 1) // dst2 = dst1 + 64 * 2 +.else +.ifc \type, qpel_bi + load_qpel_filter x6 + mov x6, #(MAX_PB_SIZE << 2) // rsrc stride << 1 + add x7, x4, #(MAX_PB_SIZE << 1) // rsrc2 +.else + load_qpel_filter x5 +.endif + lsl x10, x3, #1 // src stride * 2 + sub x13, x2, #3 // src1 = src - 3 + lsl x15, x1, #1 // dst stride * 2 + subs x15, x15, #4 + add x14, x13, x3 // src2 = src1 + src stride + add x17, x0, x1 // dst2 = dst1 + dst stride +.endif +0: ld1 {v16.8b, v17.8b}, [x13], x10 + ld1 {v18.8b, v19.8b}, [x14], x10 +.ifc \type, qpel_bi + ld1 {v25.8h}, [x4], x6 + ld1 {v26.8h}, [x7], x6 +.endif + + uxtl v16.8h, v16.8b + uxtl v17.8h, v17.8b + uxtl v18.8h, v18.8b + uxtl v19.8h, v19.8b + + mul v23.8h, v16.8h, v0.h[0] + mul v24.8h, v18.8h, v0.h[0] + +.irpc i, 1234567 + ext v20.16b, v16.16b, v17.16b, #(2*\i) + ext v21.16b, v18.16b, v19.16b, #(2*\i) + mla v23.8h, v20.8h, v0.h[\i] + mla v24.8h, v21.8h, v0.h[\i] +.endr + +.ifc \type, qpel + subs w3, w3, #2 + st1 {v23.4h}, [ x0], #8 + st1 {v23.s}[2], [ x0], x15 + st1 {v24.4h}, [x17], #8 + st1 {v24.s}[2], [x17], x15 +.else +.ifc \type, qpel_bi + subs w5, w5, #2 + sqadd v23.8h, v23.8h, v25.8h + sqadd v24.8h, v24.8h, v26.8h + sqrshrun v23.8b, v23.8h, #7 + sqrshrun v24.8b, v24.8h, #7 +.else + subs w4, w4, #2 + sqrshrun v23.8b, v23.8h, #6 + sqrshrun v24.8b, v24.8h, #6 +.endif + st1 {v23.s}[0], [ x0], #4 + st1 {v23.h}[2], [ x0], x15 + st1 {v24.s}[0], [x17], #4 + st1 {v24.h}[2], [x17], x15 +.endif + b.gt 0b // double line + ret +endfunc + +function ff_hevc_put_hevc_\type\()_h8_8_neon, export=1 +.ifc \type, qpel + load_qpel_filter x4 + lsl x10, x2, #1 // width * 2 + sub x13, x1, #3 // src1 = src - 3 + mov x15, #(MAX_PB_SIZE << 2) // dst stride + add x14, x13, x2 // src2 = src1 + src stride + add x17, x0, #(MAX_PB_SIZE << 1) // dst2 = dst1 + 64 * 2 +.else +.ifc \type, qpel_bi + load_qpel_filter x6 + mov x6, #(MAX_PB_SIZE << 2) // rsrc stride << 1 + add x7, x4, #(MAX_PB_SIZE << 1) // rsrc2 +.else + load_qpel_filter x5 +.endif + lsl x10, x3, #1 // src stride * 2 + sub x13, x2, #3 // src1 = src - 3 + lsl x15, x1, #1 // dst stride * 2 + add x14, x13, x3 // src2 = src1 + src stride + add x17, x0, x1 // dst2 = dst1 + dst stride +.endif +0: ld1 {v16.8b, v17.8b}, [x13], x10 + ld1 {v18.8b, v19.8b}, [x14], x10 +.ifc \type, qpel_bi + ld1 {v25.8h}, [x4], x6 + ld1 {v26.8h}, [x7], x6 +.endif + + uxtl v16.8h, v16.8b + uxtl v17.8h, v17.8b + uxtl v18.8h, v18.8b + uxtl v19.8h, v19.8b + + mul v23.8h, v16.8h, v0.h[0] + mul v24.8h, v18.8h, v0.h[0] + +.irpc i, 1234567 + ext v20.16b, v16.16b, v17.16b, #(2*\i) + ext v21.16b, v18.16b, v19.16b, #(2*\i) + mla v23.8h, v20.8h, v0.h[\i] + mla v24.8h, v21.8h, v0.h[\i] +.endr + +.ifc \type, qpel + subs w3, w3, #2 + st1 {v23.8h}, [ x0], x15 + st1 {v24.8h}, [x17], x15 +.else +.ifc \type, qpel_bi + subs w5, w5, #2 + sqadd v23.8h, v23.8h, v25.8h + sqadd v24.8h, v24.8h, v26.8h + sqrshrun v23.8b, v23.8h, #7 + sqrshrun v24.8b, v24.8h, #7 +.else + subs w4, w4, #2 + sqrshrun v23.8b, v23.8h, #6 + sqrshrun v24.8b, v24.8h, #6 +.endif + st1 {v23.8b}, [ x0], x15 + st1 {v24.8b}, [x17], x15 +.endif + b.gt 0b // double line + ret +endfunc + +function ff_hevc_put_hevc_\type\()_h12_8_neon, export=1 +.ifc \type, qpel + load_qpel_filter x4 + // blocks + mov w8, #0xAAAB + movk w8, #0x2AAA, lsl #16 + smull x15, w8, w6 + asr x15, x15, #33 + sub w6, w15, w6, asr #31 + // fast divide by 12, thank gcc for this one... + + // src constants + lsl x10, x2, #1 // width * 2 + sub x1, x1, #3 // src = src - 3 + + // dst constants + mov x15, #(MAX_PB_SIZE * 4 - 16) // dst stride + + // loop + mov x8, xzr // hblock +0: mov w7, w3 + + // 12 * hblock + lsl x12, x8, #3 + add x12, x12, x8, lsl #2 + + add x13, x1, x12 // src1 = src0 + 12 * hblock + add x14, x13, x2 // src2 = src1 + src stride + + add x16, x0, x12, lsl #1 // dst1 = dst0 + 12 * hblock * 2 + add x17, x16, #(MAX_PB_SIZE << 1) // dst2 = dst1 + dst stride +.else + // blocks +.ifc \type, qpel_bi + ldrh w7, [sp] + load_qpel_filter x6 +.else + load_qpel_filter x5 +.endif + mov w9, #0xAAAB + movk w9, #0x2AAA, lsl #16 + smull x15, w9, w7 + asr x15, x15, #33 + sub w6, w15, w7, asr #31 + + // src constants + lsl x10, x3, #1 // src stride * 2 + sub x2, x2, #3 // src = src - 3 + + // dst constants + lsl x15, x1, #1 // dst stride * 2 +.ifc \type, qpel_bi + mov x9, #(MAX_PB_SIZE << 2) +.endif + sub x15, x15, #8 + // loop + mov x8, xzr // hblock +0: +.ifc \type, qpel_bi // height + mov w7, w5 +.else + mov w7, w4 +.endif + // 12 * hblock + lsl x12, x8, #3 + add x12, x12, x8, lsl #2 + + add x13, x2, x12 // src1 = src0 + 12 * hblock + add x14, x13, x3 // src2 = src1 + src stride + + add x16, x0, x12 // dst1 = dst0 + 12 * hblock + add x17, x16, x1 // dst2 = dst1 + dst stride +.ifc \type, qpel_bi + add x11, x4, x12, lsl #1 // rsrc1 = rsrc0 + 12 * hblock * 2 + add x12, x11, #(MAX_PB_SIZE << 1) // rsrc2 = rsrc1 + rsrc stride +.endif +.endif +1: ld1 {v16.8b-v18.8b}, [x13], x10 + ld1 {v19.8b-v21.8b}, [x14], x10 + + uxtl v16.8h, v16.8b + uxtl v17.8h, v17.8b + uxtl v18.8h, v18.8b + + uxtl v19.8h, v19.8b + uxtl v20.8h, v20.8b + uxtl v21.8h, v21.8b + + mul v26.8h, v16.8h, v0.h[0] + mul v27.8h, v17.8h, v0.h[0] + mul v28.8h, v19.8h, v0.h[0] + mul v29.8h, v20.8h, v0.h[0] + +.irpc i, 1234567 + ext v22.16b, v16.16b, v17.16b, #(2*\i) + ext v23.16b, v17.16b, v18.16b, #(2*\i) + + ext v24.16b, v19.16b, v20.16b, #(2*\i) + ext v25.16b, v20.16b, v21.16b, #(2*\i) + + mla v26.8h, v22.8h, v0.h[\i] + mla v27.8h, v23.8h, v0.h[\i] + + mla v28.8h, v24.8h, v0.h[\i] + mla v29.8h, v25.8h, v0.h[\i] +.endr + subs w7, w7, #2 +.ifc \type, qpel + st1 {v26.8h}, [x16], #16 + st1 {v27.4h}, [x16], x15 + st1 {v28.8h}, [x17], #16 + st1 {v29.4h}, [x17], x15 +.else +.ifc \type, qpel_bi + ld1 {v16.8h, v17.8h}, [x11], x9 + ld1 {v18.8h, v19.8h}, [x12], x9 + sqadd v26.8h, v26.8h, v16.8h + sqadd v27.8h, v27.8h, v17.8h + sqadd v28.8h, v28.8h, v18.8h + sqadd v29.8h, v29.8h, v19.8h + sqrshrun v26.8b, v26.8h, #7 + sqrshrun v27.8b, v27.8h, #7 + sqrshrun v28.8b, v28.8h, #7 + sqrshrun v29.8b, v29.8h, #7 +.else + sqrshrun v26.8b, v26.8h, #6 + sqrshrun v27.8b, v27.8h, #6 + sqrshrun v28.8b, v28.8h, #6 + sqrshrun v29.8b, v29.8h, #6 +.endif + st1 {v26.8b}, [x16], #8 + st1 {v27.s}[0], [x16], x15 + st1 {v28.8b}, [x17], #8 + st1 {v29.s}[0], [x17], x15 +.endif + b.gt 1b // double line + add x8, x8, #1 + cmp x8, x6 + b.lt 0b // line of blocks + ret +endfunc + +function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1 + mov x8, xzr // hblock +.ifc \type, qpel + load_qpel_filter x4 + // blocks + lsr w6, w6, #4 // horizontal block count + // src constants + lsl x10, x2, #1 // width * 2 + sub x1, x1, #3 // src = src - 3 + // dst constants + mov x15, #(MAX_PB_SIZE * 4 - 16) // dst stride + // loop +0: mov w7, w3 // reset height + + add x13, x1, x8, lsl #4 + add x14, x13, x2 // src2 = src1 + src stride + + add x16, x0, x8, lsl #5 // dst1 = dst0 + hblock * 16 * 2 + add x17, x16, #(MAX_PB_SIZE << 1) // dst2 = dst1 + 64 * 2 +.else +.ifc \type, qpel_bi + mov x9, #(MAX_PB_SIZE << 2) + ldrh w7, [sp] + load_qpel_filter x6 +.else + load_qpel_filter x5 +.endif + // blocks + lsr w6, w7, #4 // horizontal block count + // src constants + lsl x10, x3, #1 // src stride * 2 + sub x2, x2, #3 // src = src - 3 + // dst constants + lsl x15, x1, #1 // dst stride * 2 + sub x15, x15, #8 + // loop +0: +.ifc \type, qpel_bi // height + mov w7, w5 +.else + mov w7, w4 +.endif + + add x13, x2, x8, lsl #4 // src1 = src0 + hblock * 16 + add x14, x13, x3 // src2 = src1 + src stride + + add x16, x0, x8, lsl #4 // dst1 = dst0 + hblock * 16 + add x17, x16, x1 // dst2 = dst1 + dst stride +.ifc \type, qpel_bi + add x11, x4, x8, lsl #5 // rsrc1 = rsrc0 + 16 * hblock * 2 + add x12, x11, #(MAX_PB_SIZE << 1) // rsrc2 = rsrc1 + rsrc stride +.endif +.endif +1: ld1 {v16.8b-v18.8b}, [x13], x10 + ld1 {v19.8b-v21.8b}, [x14], x10 + + uxtl v16.8h, v16.8b + uxtl v17.8h, v17.8b + uxtl v18.8h, v18.8b + + uxtl v19.8h, v19.8b + uxtl v20.8h, v20.8b + uxtl v21.8h, v21.8b + + mul v26.8h, v16.8h, v0.h[0] + mul v27.8h, v17.8h, v0.h[0] + mul v28.8h, v19.8h, v0.h[0] + mul v29.8h, v20.8h, v0.h[0] + +.irpc i, 1234567 + ext v22.16b, v16.16b, v17.16b, #(2*\i) + ext v23.16b, v17.16b, v18.16b, #(2*\i) + + ext v24.16b, v19.16b, v20.16b, #(2*\i) + ext v25.16b, v20.16b, v21.16b, #(2*\i) + + mla v26.8h, v22.8h, v0.h[\i] + mla v27.8h, v23.8h, v0.h[\i] + + mla v28.8h, v24.8h, v0.h[\i] + mla v29.8h, v25.8h, v0.h[\i] +.endr + subs w7, w7, #2 +.ifc \type, qpel + st1 {v26.8h}, [x16], #16 + st1 {v27.8h}, [x16], x15 + st1 {v28.8h}, [x17], #16 + st1 {v29.8h}, [x17], x15 +.else +.ifc \type, qpel_bi + ld1 {v16.8h, v17.8h}, [x11], x9 + ld1 {v18.8h, v19.8h}, [x12], x9 + sqadd v26.8h, v26.8h, v16.8h + sqadd v27.8h, v27.8h, v17.8h + sqadd v28.8h, v28.8h, v18.8h + sqadd v29.8h, v29.8h, v19.8h + sqrshrun v26.8b, v26.8h, #7 + sqrshrun v27.8b, v27.8h, #7 + sqrshrun v28.8b, v28.8h, #7 + sqrshrun v29.8b, v29.8h, #7 +.else + sqrshrun v26.8b, v26.8h, #6 + sqrshrun v27.8b, v27.8h, #6 + sqrshrun v28.8b, v28.8h, #6 + sqrshrun v29.8b, v29.8h, #6 +.endif + st1 {v26.8b}, [x16], #8 + st1 {v27.8b}, [x16], x15 + st1 {v28.8b}, [x17], #8 + st1 {v29.8b}, [x17], x15 +.endif + b.gt 1b // double line + add x8, x8, #1 + cmp x8, x6 + b.lt 0b // horizontal tiling + ret +endfunc +.endm + +put_hevc qpel +put_hevc qpel_uni +put_hevc qpel_bi