From patchwork Thu Feb 3 13:51:51 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "J. Dekker" X-Patchwork-Id: 34093 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6602:2c4e:0:0:0:0 with SMTP id x14csp2499832iov; Thu, 3 Feb 2022 05:52:36 -0800 (PST) X-Google-Smtp-Source: ABdhPJx2qM8+Yw3mE8N0jEhu7Ldx42alfMGQqHmhWfZQP/qx/zBr1ZMUMKOAHsX7hIHBo653I+fM X-Received: by 2002:a17:907:9513:: with SMTP id ew19mr29150096ejc.321.1643896355869; Thu, 03 Feb 2022 05:52:35 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1643896355; cv=none; d=google.com; s=arc-20160816; b=Iu73MuFnTSs+fuKPL9yWyE5iup+7Ua8s480r8SeSBoPHvhrIsQXvNIQJNRQp1Phvyc t5GEcikcV7XbzdWvVtlhlsi+A5ijCyB+OV9mINHdBH7Uk6w8G39EwJBL5QNGLitknApp HF0L4GmqNjE/WxO2lqGT3/ZFA0ORJiru5oiYnbv/uJ7L6EsWg3p0jOo9DjGjkPNv3m2R NhnIUZ5JTAlT22KRk3tjZEZKUSxL/Wo9fcsQT93oVhBReOahDuAoOxJRWpHrPXbHLH8X niD55Ckt+egXWZr9t9HGt1YLT4RlN27RHnEoXPdYygxjnFho0q5qPUSLmhMkc2+spb/V ybag== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:reply-to:list-subscribe :list-help:list-post:list-archive:list-unsubscribe:list-id :precedence:subject:mime-version:references:in-reply-to:message-id :date:to:from:dkim-signature:dkim-signature:delivered-to; bh=VltmgGxb3MZdbS55w7rzc3Q6+VELEHHWmWG9jElAC6s=; b=dlurSd7sx7Y9u0YT7RRXpjZ0nTl5QRcZ9tr1wCwaBb/dExxuDD+QErHg5aNXvEJc5B 6vHJPCBbfnjadCtxIkYR3g6vCduWbrexZZzlBZyf3rW4zellmkUbiCHexYhbUCWXG+fA ew8SjDUUEJRqlhMnZ9gUdGZpmwCYdMdyRIYKQN+CqY7K0UjxOsWnX2SxgE+ZK5qf9NmV t8t2NnqJH/u83frgnyXxuUplLNVBIipRBTBZcq8f60GGSc1wxE/Gv5NgYpWEg+uxyIph fCcSmzuQepNwEOq66qDj9phcIDCYMMNM9XJ1rrPiMHQi4j3KGnlGKXXybuF3tW/TKEB5 7yAQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@itanimul.li header.s=fm1 header.b=tnRRVZ+G; dkim=neutral (body hash did not verify) header.i=@messagingengine.com header.s=fm2 header.b=W2MrBp+W; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id x13si12713238eds.134.2022.02.03.05.52.35; Thu, 03 Feb 2022 05:52:35 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@itanimul.li header.s=fm1 header.b=tnRRVZ+G; dkim=neutral (body hash did not verify) header.i=@messagingengine.com header.s=fm2 header.b=W2MrBp+W; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 2B28268B022; Thu, 3 Feb 2022 15:52:17 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from wout3-smtp.messagingengine.com (wout3-smtp.messagingengine.com [64.147.123.19]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id CDC2568AFF6 for ; Thu, 3 Feb 2022 15:52:09 +0200 (EET) Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailout.west.internal (Postfix) with ESMTP id 181A4320213A for ; Thu, 3 Feb 2022 08:52:08 -0500 (EST) Received: from mailfrontend1 ([10.202.2.162]) by compute4.internal (MEProxy); Thu, 03 Feb 2022 08:52:08 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=itanimul.li; h= cc:content-transfer-encoding:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to; s=fm1; bh=E2zJwK4tOKIgNw2Gc6tKTOqvXLXeof p3zimIuK417E0=; b=tnRRVZ+GxuOOCZEnhTQscayOb9tCib8/tYCnIJAD4J8SVX srN1A+A6qbp3puOm0t15aNsYS/N4Lqt2xbKR1Hj5vv34CQbPC9pU+0WDlvkjib4D /URHbW5K7vrxdTLU7MCWHG8/y04HKMOMH1g69s35q6Nfy22K58pyoxtqPtz1ofBU b8JFzSuQJgYjxDNxaVDwPxGFQ3b5Yuq3sK83T9lHjvGsa+VZjZwpZMjZ6EDeuYhR aFq95zIwQ56I71IP9S0qaeU5DThmzSTfU6Ro5kzXxdoW0l4mXe5fSdoNi5dG1nEf W3pmUOACgiXgYTADX6X6X5rr6wtdGwLuX+baOI9A== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:date:date :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:sender:subject:subject:to:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; bh=E2zJwK 4tOKIgNw2Gc6tKTOqvXLXeofp3zimIuK417E0=; b=W2MrBp+WzTwxqMChNG23RF WJ5ZQEGTeA1dUeq7bflzOJlQ8FQGl4zm4+28o24yX7kkVZ5L1zE4Jhji6seJPX0T ZC6CAB1kiv2NLQb/7KziIemSbiqVYPs/XFIni8hP3rMVBC9/6uGHFcqnxAvS/urJ i/W8IIGfUfm00LbweNyF697do/84NkPPyboi3XEyyLXZ+NuRTTeEQEsjyq9k0NlD W3Pz+KsvYsxf1M71KTTJAl54Ooz+cDKVmDo912Y+qrniE7nx7ZQgJbt0HHpMDBPF Gmlrk7JUQ4TW64SA7ymwbZJIF3NUr5A++QzSSnfmDFcLvVQt0b8GEv9C8ggnrdLA == X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvvddrgeejgdehhecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkffojghfggfgsedtkeertd ertddtnecuhfhrohhmpedflfdrucffvghkkhgvrhdfuceojhguvghksehithgrnhhimhhu lhdrlhhiqeenucggtffrrghtthgvrhhnpefhvdefjeffgefffeeifeevgfehueduleehhf ffvedttdfhheduiedtteefheeiteenucffohhmrghinhepnhgvohhnrdhssgenucevlhhu shhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpehjuggvkhesihhtrg hnihhmuhhlrdhlih X-ME-Proxy: Received: by mail.messagingengine.com (Postfix) with ESMTPA for ; Thu, 3 Feb 2022 08:52:06 -0500 (EST) From: "J. Dekker" To: ffmpeg-devel@ffmpeg.org Date: Thu, 3 Feb 2022 14:51:51 +0100 Message-Id: <20220203135151.90166-2-jdek@itanimul.li> X-Mailer: git-send-email 2.32.0 (Apple Git-132) In-Reply-To: <20220203135151.90166-1-jdek@itanimul.li> References: <9890ed6-9e61-7078-cfc0-b8c79885f2@martin.st> <20220203135151.90166-1-jdek@itanimul.li> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v2 2/2] lavc/aarch64: add hevc epel assembly X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: mogVxNdSwREF Thanks: Rafal Dabrowa --- libavcodec/aarch64/Makefile | 3 +- libavcodec/aarch64/hevcdsp_epel_neon.S | 2501 +++++++++++++++++++++ libavcodec/aarch64/hevcdsp_init_aarch64.c | 52 + 3 files changed, 2555 insertions(+), 1 deletion(-) create mode 100644 libavcodec/aarch64/hevcdsp_epel_neon.S diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile index 8592692479..ebedc03bfa 100644 --- a/libavcodec/aarch64/Makefile +++ b/libavcodec/aarch64/Makefile @@ -61,7 +61,8 @@ NEON-OBJS-$(CONFIG_VP9_DECODER) += aarch64/vp9itxfm_16bpp_neon.o \ aarch64/vp9lpf_neon.o \ aarch64/vp9mc_16bpp_neon.o \ aarch64/vp9mc_neon.o -NEON-OBJS-$(CONFIG_HEVC_DECODER) += aarch64/hevcdsp_idct_neon.o \ +NEON-OBJS-$(CONFIG_HEVC_DECODER) += aarch64/hevcdsp_epel_neon.o \ + aarch64/hevcdsp_idct_neon.o \ aarch64/hevcdsp_init_aarch64.o \ aarch64/hevcdsp_qpel_neon.o \ aarch64/hevcdsp_sao_neon.o diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S new file mode 100644 index 0000000000..bbf93c3d6a --- /dev/null +++ b/libavcodec/aarch64/hevcdsp_epel_neon.S @@ -0,0 +1,2501 @@ +/* -*-arm64-*- + * vim: syntax=arm64asm + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/aarch64/asm.S" +#define MAX_PB_SIZE 64 + +function ff_hevc_put_hevc_pel_pixels4_8_neon, export=1 + mov x7, #(MAX_PB_SIZE * 2) +1: ld1 {v0.s}[0], [x1], x2 + ushll v4.8h, v0.8b, #6 + subs w3, w3, #1 + st1 {v4.d}[0], [x0], x7 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_pixels6_8_neon, export=1 + mov x7, #(MAX_PB_SIZE * 2 - 8) +1: ld1 {v0.8b}, [x1], x2 + ushll v4.8h, v0.8b, #6 + st1 {v4.d}[0], [x0], #8 + subs w3, w3, #1 + st1 {v4.s}[2], [x0], x7 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_pixels8_8_neon, export=1 + mov x7, #(MAX_PB_SIZE * 2) +1: ld1 {v0.8b}, [x1], x2 + ushll v4.8h, v0.8b, #6 + subs w3, w3, #1 + st1 {v4.8h}, [x0], x7 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_pixels12_8_neon, export=1 + mov x7, #(MAX_PB_SIZE * 2 - 16) +1: ld1 {v0.8b, v1.8b}, [x1], x2 + ushll v4.8h, v0.8b, #6 + st1 {v4.8h}, [x0], #16 + ushll v5.8h, v1.8b, #6 + subs w3, w3, #1 + st1 {v5.d}[0], [x0], x7 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_pixels16_8_neon, export=1 + mov x7, #(MAX_PB_SIZE * 2) +1: ld1 {v0.8b, v1.8b}, [x1], x2 + ushll v4.8h, v0.8b, #6 + ushll v5.8h, v1.8b, #6 + subs w3, w3, #1 + st1 {v4.8h, v5.8h}, [x0], x7 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_pixels24_8_neon, export=1 + mov x7, #(MAX_PB_SIZE * 2) +1: ld1 {v0.8b-v2.8b}, [x1], x2 + ushll v4.8h, v0.8b, #6 + ushll v5.8h, v1.8b, #6 + ushll v6.8h, v2.8b, #6 + subs w3, w3, #1 + st1 {v4.8h-v6.8h}, [x0], x7 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_pixels32_8_neon, export=1 + mov x7, #(MAX_PB_SIZE * 2) +1: ld1 {v0.8b-v3.8b}, [x1], x2 + ushll v4.8h, v0.8b, #6 + ushll v5.8h, v1.8b, #6 + ushll v6.8h, v2.8b, #6 + ushll v7.8h, v3.8b, #6 + subs w3, w3, #1 + st1 {v4.8h-v7.8h}, [x0], x7 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_pixels48_8_neon, export=1 + mov x7, #(MAX_PB_SIZE) +1: ld1 {v0.16b-v2.16b}, [x1], x2 + ushll v4.8h, v0.8b, #6 + ushll2 v5.8h, v0.16b, #6 + ushll v6.8h, v1.8b, #6 + ushll2 v7.8h, v1.16b, #6 + st1 {v4.8h-v7.8h}, [x0], #64 + ushll v16.8h, v2.8b, #6 + ushll2 v17.8h, v2.16b, #6 + subs w3, w3, #1 + st1 {v16.8h-v17.8h}, [x0], x7 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_pixels64_8_neon, export=1 +1: ld1 {v0.16b-v3.16b}, [x1], x2 + ushll v4.8h, v0.8b, #6 + ushll2 v5.8h, v0.16b, #6 + ushll v6.8h, v1.8b, #6 + ushll2 v7.8h, v1.16b, #6 + st1 {v4.8h-v7.8h}, [x0], #(MAX_PB_SIZE) + ushll v16.8h, v2.8b, #6 + ushll2 v17.8h, v2.16b, #6 + ushll v18.8h, v3.8b, #6 + ushll2 v19.8h, v3.16b, #6 + subs w3, w3, #1 + st1 {v16.8h-v19.8h}, [x0], #(MAX_PB_SIZE) + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_bi_pixels4_8_neon, export=1 + mov x10, #(MAX_PB_SIZE * 2) +1: ld1 {v0.s}[0], [x2], x3 // src + ushll v16.8h, v0.8b, #6 + ld1 {v20.4h}, [x4], x10 // src2 + sqadd v16.8h, v16.8h, v20.8h + sqrshrun v0.8b, v16.8h, #7 + st1 {v0.s}[0], [x0], x1 + subs w5, w5, #1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_bi_pixels6_8_neon, export=1 + mov x10, #(MAX_PB_SIZE * 2 - 8) + sub x1, x1, #4 +1: ld1 {v0.8b}, [x2], x3 + ushll v16.8h, v0.8b, #6 + ld1 {v20.4h}, [x4], #8 + ld1 {v20.s}[2], [x4], x10 + sqadd v16.8h, v16.8h, v20.8h + sqrshrun v0.8b, v16.8h, #7 + st1 {v0.s}[0], [x0], #4 + st1 {v0.h}[2], [x0], x1 + subs w5, w5, #1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_bi_pixels8_8_neon, export=1 + mov x10, #(MAX_PB_SIZE * 2) +1: ld1 {v0.8b}, [x2], x3 // src + ushll v16.8h, v0.8b, #6 + ld1 {v20.8h}, [x4], x10 // src2 + sqadd v16.8h, v16.8h, v20.8h + sqrshrun v0.8b, v16.8h, #7 + subs w5, w5, #1 + st1 {v0.8b}, [x0], x1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_bi_pixels12_8_neon, export=1 + mov x10, #(MAX_PB_SIZE * 2 - 16) + sub x1, x1, #8 +1: ld1 {v0.16b}, [x2], x3 + ushll v16.8h, v0.8b, #6 + ushll2 v17.8h, v0.16b, #6 + ld1 {v20.8h}, [x4], #16 + ld1 {v21.4h}, [x4], x10 + sqadd v16.8h, v16.8h, v20.8h + sqadd v17.8h, v17.8h, v21.8h + sqrshrun v0.8b, v16.8h, #7 + sqrshrun2 v0.16b, v17.8h, #7 + st1 {v0.8b}, [x0], #8 + subs w5, w5, #1 + st1 {v0.s}[2], [x0], x1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_bi_pixels16_8_neon, export=1 + mov x10, #(MAX_PB_SIZE * 2) +1: ld1 {v0.16b}, [x2], x3 // src + ushll v16.8h, v0.8b, #6 + ushll2 v17.8h, v0.16b, #6 + ld1 {v20.8h, v21.8h}, [x4], x10 // src2 + sqadd v16.8h, v16.8h, v20.8h + sqadd v17.8h, v17.8h, v21.8h + sqrshrun v0.8b, v16.8h, #7 + sqrshrun2 v0.16b, v17.8h, #7 + subs w5, w5, #1 + st1 {v0.16b}, [x0], x1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_bi_pixels24_8_neon, export=1 + mov x10, #(MAX_PB_SIZE * 2) +1: ld1 {v0.8b-v2.8b}, [x2], x3 // src + ushll v16.8h, v0.8b, #6 + ushll v17.8h, v1.8b, #6 + ushll v18.8h, v2.8b, #6 + ld1 {v20.8h-v22.8h}, [x4], x10 // src2 + sqadd v16.8h, v16.8h, v20.8h + sqadd v17.8h, v17.8h, v21.8h + sqadd v18.8h, v18.8h, v22.8h + sqrshrun v0.8b, v16.8h, #7 + sqrshrun v1.8b, v17.8h, #7 + sqrshrun v2.8b, v18.8h, #7 + subs w5, w5, #1 + st1 {v0.8b-v2.8b}, [x0], x1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_bi_pixels32_8_neon, export=1 + mov x10, #(MAX_PB_SIZE * 2) +1: ld1 {v0.16b-v1.16b}, [x2], x3 // src + ushll v16.8h, v0.8b, #6 + ushll2 v17.8h, v0.16b, #6 + ushll v18.8h, v1.8b, #6 + ushll2 v19.8h, v1.16b, #6 + ld1 {v20.8h-v23.8h}, [x4], x10 // src2 + sqadd v16.8h, v16.8h, v20.8h + sqadd v17.8h, v17.8h, v21.8h + sqadd v18.8h, v18.8h, v22.8h + sqadd v19.8h, v19.8h, v23.8h + sqrshrun v0.8b, v16.8h, #7 + sqrshrun2 v0.16b, v17.8h, #7 + sqrshrun v1.8b, v18.8h, #7 + sqrshrun2 v1.16b, v19.8h, #7 + st1 {v0.16b-v1.16b}, [x0], x1 + subs w5, w5, #1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_bi_pixels48_8_neon, export=1 + mov x10, #(MAX_PB_SIZE) +1: ld1 {v0.16b-v2.16b}, [x2], x3 // src + ushll v16.8h, v0.8b, #6 + ushll2 v17.8h, v0.16b, #6 + ushll v18.8h, v1.8b, #6 + ushll2 v19.8h, v1.16b, #6 + ushll v20.8h, v2.8b, #6 + ushll2 v21.8h, v2.16b, #6 + ld1 {v24.8h-v27.8h}, [x4], #(MAX_PB_SIZE) // src2 + sqadd v16.8h, v16.8h, v24.8h + sqadd v17.8h, v17.8h, v25.8h + sqadd v18.8h, v18.8h, v26.8h + sqadd v19.8h, v19.8h, v27.8h + ld1 {v24.8h-v25.8h}, [x4], x10 + sqadd v20.8h, v20.8h, v24.8h + sqadd v21.8h, v21.8h, v25.8h + sqrshrun v0.8b, v16.8h, #7 + sqrshrun2 v0.16b, v17.8h, #7 + sqrshrun v1.8b, v18.8h, #7 + sqrshrun2 v1.16b, v19.8h, #7 + sqrshrun v2.8b, v20.8h, #7 + sqrshrun2 v2.16b, v21.8h, #7 + subs w5, w5, #1 + st1 {v0.16b-v2.16b}, [x0], x1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_bi_pixels64_8_neon, export=1 +1: ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x2], x3 // src + ushll v16.8h, v0.8b, #6 + ushll2 v17.8h, v0.16b, #6 + ushll v18.8h, v1.8b, #6 + ushll2 v19.8h, v1.16b, #6 + ushll v20.8h, v2.8b, #6 + ushll2 v21.8h, v2.16b, #6 + ushll v22.8h, v3.8b, #6 + ushll2 v23.8h, v3.16b, #6 + ld1 {v24.8h, v25.8h, v26.8h, v27.8h}, [x4], #(MAX_PB_SIZE) // src2 + sqadd v16.8h, v16.8h, v24.8h + sqadd v17.8h, v17.8h, v25.8h + sqadd v18.8h, v18.8h, v26.8h + sqadd v19.8h, v19.8h, v27.8h + ld1 {v24.8h, v25.8h, v26.8h, v27.8h}, [x4], #(MAX_PB_SIZE) + sqadd v20.8h, v20.8h, v24.8h + sqadd v21.8h, v21.8h, v25.8h + sqadd v22.8h, v22.8h, v26.8h + sqadd v23.8h, v23.8h, v27.8h + sqrshrun v0.8b, v16.8h, #7 + sqrshrun2 v0.16b, v17.8h, #7 + sqrshrun v1.8b, v18.8h, #7 + sqrshrun2 v1.16b, v19.8h, #7 + sqrshrun v2.8b, v20.8h, #7 + sqrshrun2 v2.16b, v21.8h, #7 + sqrshrun v3.8b, v22.8h, #7 + sqrshrun2 v3.16b, v23.8h, #7 + st1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x0], x1 + subs w5, w5, #1 + b.ne 1b + ret +endfunc + +.Lepel_filters: + .byte 0, 0, 0, 0 + .byte -2, 58, 10, -2 + .byte -4, 54, 16, -2 + .byte -6, 46, 28, -4 + .byte -4, 36, 36, -4 + .byte -4, 28, 46, -6 + .byte -2, 16, 54, -4 + .byte -2, 10, 58, -2 + +.macro load_epel_filterb freg, xreg + adr \xreg, .Lepel_filters + add \xreg, \xreg, \freg, lsl #2 + ld4r {v0.16b, v1.16b, v2.16b, v3.16b}, [\xreg] // filter + neg v0.16b, v0.16b + neg v3.16b, v3.16b +.endm + +.macro calc_epelb dst, src0, src1, src2, src3 + umlsl \dst\().8h, \src0\().8b, v0.8b + umlal \dst\().8h, \src1\().8b, v1.8b + umlal \dst\().8h, \src2\().8b, v2.8b + umlsl \dst\().8h, \src3\().8b, v3.8b +.endm + +.macro calc_epelb2 dst, src0, src1, src2, src3 + umlsl2 \dst\().8h, \src0\().16b, v0.16b + umlal2 \dst\().8h, \src1\().16b, v1.16b + umlal2 \dst\().8h, \src2\().16b, v2.16b + umlsl2 \dst\().8h, \src3\().16b, v3.16b +.endm + +.macro load_epel_filterh freg, xreg + adr \xreg, .Lepel_filters + add \xreg, \xreg, \freg, lsl #2 + ld1 {v0.8b}, [\xreg] + sxtl v0.8h, v0.8b +.endm + +.macro calc_epelh dst, src0, src1, src2, src3 + smull \dst\().4s, \src0\().4h, v0.h[0] + smlal \dst\().4s, \src1\().4h, v0.h[1] + smlal \dst\().4s, \src2\().4h, v0.h[2] + smlal \dst\().4s, \src3\().4h, v0.h[3] + sqshrn \dst\().4h, \dst\().4s, #6 +.endm + +.macro calc_epelh2 dst, tmp, src0, src1, src2, src3 + smull2 \tmp\().4s, \src0\().8h, v0.h[0] + smlal2 \tmp\().4s, \src1\().8h, v0.h[1] + smlal2 \tmp\().4s, \src2\().8h, v0.h[2] + smlal2 \tmp\().4s, \src3\().8h, v0.h[3] + sqshrn2 \dst\().8h, \tmp\().4s, #6 +.endm + +function ff_hevc_put_hevc_epel_h4_8_neon, export=1 + load_epel_filterb x4, x5 + sub x1, x1, #1 + mov x10, #(MAX_PB_SIZE * 2) +1: ld1 {v4.8b}, [x1], x2 + ushr v5.2d, v4.2d, #8 + ushr v6.2d, v5.2d, #8 + ushr v7.2d, v6.2d, #8 + movi v16.8h, #0 + calc_epelb v16, v4, v5, v6, v7 + st1 {v16.4h}, [x0], x10 + subs w3, w3, #1 // height + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_epel_h6_8_neon, export=1 + load_epel_filterb x4, x5 + sub x1, x1, #1 + sub x2, x2, #8 + mov x10, #(MAX_PB_SIZE * 2 - 8) +1: ld1 {v24.8b}, [x1], #8 + ushr v26.2d, v24.2d, #8 + ushr v27.2d, v26.2d, #8 + ushr v28.2d, v27.2d, #8 + movi v16.8h, #0 + ld1 {v28.b}[5], [x1], x2 + calc_epelb v16, v24, v26, v27, v28 + st1 {v16.4h}, [x0], #8 + st1 {v16.s}[2], [x0], x10 + subs w3, w3, #1 // height + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_epel_h8_8_neon, export=1 + load_epel_filterb x4, x5 + sub x1, x1, #1 + mov x10, #(MAX_PB_SIZE * 2) +1: ld2 {v24.8b, v25.8b}, [x1], x2 + ushr v26.2d, v24.2d, #8 + ushr v27.2d, v25.2d, #8 + ushr v28.2d, v26.2d, #8 + movi v16.8h, #0 + movi v17.8h, #0 + calc_epelb v16, v24, v25, v26, v27 + calc_epelb v17, v25, v26, v27, v28 + st2 {v16.4h, v17.4h}, [x0], x10 + subs w3, w3, #1 // height + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_epel_h12_8_neon, export=1 + load_epel_filterb x4, x5 + sub x1, x1, #1 + mov x10, #(MAX_PB_SIZE * 2 - 16) +1: ld2 {v24.8b, v25.8b}, [x1], x2 + ushr v26.2d, v24.2d, #8 + ushr v27.2d, v25.2d, #8 + ushr v28.2d, v26.2d, #8 + movi v16.8h, #0 + movi v17.8h, #0 + calc_epelb v16, v24, v25, v26, v27 + calc_epelb v17, v25, v26, v27, v28 + zip1 v18.8h, v16.8h, v17.8h + zip2 v19.8h, v16.8h, v17.8h + st1 {v18.8h}, [x0], #16 + st1 {v19.d}[0], [x0], x10 + subs w3, w3, #1 // height + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_epel_h16_8_neon, export=1 + load_epel_filterb x4, x5 + sub x1, x1, #1 + sub x2, x2, #16 + mov x10, #(MAX_PB_SIZE * 2) +1: ld2 {v24.8b, v25.8b}, [x1], #16 + ld1 {v20.s}[0], [x1], x2 + ushr v26.2d, v24.2d, #8 + ushr v27.2d, v25.2d, #8 + mov v26.b[7], v20.b[0] + mov v27.b[7], v20.b[1] + ushr v28.2d, v26.2d, #8 + mov v28.b[7], v20.b[2] + movi v16.8h, #0 + movi v17.8h, #0 + calc_epelb v16, v24, v25, v26, v27 + calc_epelb v17, v25, v26, v27, v28 + st2 {v16.8h, v17.8h}, [x0], x10 + subs w3, w3, #1 // height + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_epel_h24_8_neon, export=1 + load_epel_filterb x4, x5 + sub x1, x1, #1 + sub x2, x2, #24 + mov x10, #(MAX_PB_SIZE * 2) +1: ld3 {v24.8b, v25.8b, v26.8b}, [x1], #24 + ld1 {v20.s}[0], [x1], x2 + ushr v27.2d, v24.2d, #8 + ushr v28.2d, v25.2d, #8 + ushr v29.2d, v26.2d, #8 + mov v27.b[7], v20.b[0] + mov v28.b[7], v20.b[1] + mov v29.b[7], v20.b[2] + movi v16.8h, #0 + movi v17.8h, #0 + movi v18.8h, #0 + calc_epelb v16, v24, v25, v26, v27 + calc_epelb v17, v25, v26, v27, v28 + calc_epelb v18, v26, v27, v28, v29 + st3 {v16.8h, v17.8h, v18.8h}, [x0], x10 + subs w3, w3, #1 // height + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_epel_h32_8_neon, export=1 + load_epel_filterb x4, x5 + sub x1, x1, #1 + sub x2, x2, #32 + mov x10, #(MAX_PB_SIZE * 2) +1: ld4 {v24.8b, v25.8b, v26.8b, v27.8b}, [x1], #32 + ld1 {v20.s}[0], [x1], x2 + ushr v28.2d, v24.2d, #8 + ushr v29.2d, v25.2d, #8 + ushr v30.2d, v26.2d, #8 + ins v28.b[7], v20.b[0] + ins v29.b[7], v20.b[1] + ins v30.b[7], v20.b[2] + movi v16.8h, #0 + movi v17.8h, #0 + movi v18.8h, #0 + movi v19.8h, #0 + calc_epelb v16, v24, v25, v26, v27 + calc_epelb v17, v25, v26, v27, v28 + calc_epelb v18, v26, v27, v28, v29 + calc_epelb v19, v27, v28, v29, v30 + st4 {v16.8h, v17.8h, v18.8h, v19.8h}, [x0], x10 + subs w3, w3, #1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_epel_h48_8_neon, export=1 + load_epel_filterb x4, x5 + sub x1, x1, #1 + sub x2, x2, #48 + mov x5, #24 + mov x10, #(128 - 48) +1: ld3 {v26.16b, v27.16b, v28.16b}, [x1], x5 + ushr v29.2d, v26.2d, #8 + ushr v30.2d, v27.2d, #8 + ushr v31.2d, v28.2d, #8 + ld1 {v24.s}[0], [x1], x5 + ld1 {v25.s}[0], [x1], x2 + mov v29.b[7], v24.b[0] + mov v30.b[7], v24.b[1] + mov v31.b[7], v24.b[2] + mov v29.b[15], v25.b[0] + mov v30.b[15], v25.b[1] + mov v31.b[15], v25.b[2] + movi v16.8h, #0 + movi v17.8h, #0 + movi v18.8h, #0 + movi v20.8h, #0 + movi v21.8h, #0 + movi v22.8h, #0 + calc_epelb v16, v26, v27, v28, v29 + calc_epelb2 v20, v26, v27, v28, v29 + calc_epelb v17, v27, v28, v29, v30 + calc_epelb2 v21, v27, v28, v29, v30 + calc_epelb v18, v28, v29, v30, v31 + calc_epelb2 v22, v28, v29, v30, v31 + st3 {v16.8h, v17.8h, v18.8h}, [x0], #48 + st3 {v20.8h, v21.8h, v22.8h}, [x0], x10 + subs w3, w3, #1 // height + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_epel_h64_8_neon, export=1 + load_epel_filterb x4, x5 + sub x1, x1, #1 + sub x2, x2, #64 + mov x7, #32 +1: ld4 {v24.16b, v25.16b, v26.16b, v27.16b}, [x1], x7 + ushr v28.2d, v24.2d, #8 + ushr v29.2d, v25.2d, #8 + ushr v30.2d, v26.2d, #8 + ld1 {v4.s}[0], [x1], x7 + ld1 {v5.s}[0], [x1], x2 + ins v28.b[7], v4.b[0] + ins v28.b[15], v5.b[0] + ins v29.b[7], v4.b[1] + ins v29.b[15], v5.b[1] + ins v30.b[7], v4.b[2] + ins v30.b[15], v5.b[2] + movi v16.8h, #0 + movi v17.8h, #0 + movi v18.8h, #0 + movi v19.8h, #0 + movi v20.8h, #0 + movi v21.8h, #0 + movi v22.8h, #0 + movi v23.8h, #0 + calc_epelb v16, v24, v25, v26, v27 + calc_epelb2 v20, v24, v25, v26, v27 + calc_epelb v17, v25, v26, v27, v28 + calc_epelb2 v21, v25, v26, v27, v28 + calc_epelb v18, v26, v27, v28, v29 + calc_epelb2 v22, v26, v27, v28, v29 + calc_epelb v19, v27, v28, v29, v30 + calc_epelb2 v23, v27, v28, v29, v30 + st4 {v16.8h, v17.8h, v18.8h, v19.8h}, [x0], #64 + st4 {v20.8h, v21.8h, v22.8h, v23.8h}, [x0], #64 + subs w3, w3, #1 + b.ne 1b + ret +endfunc + +.macro calc_all4 + calc v16, v17, v18, v19 + b.eq 2f + calc v17, v18, v19, v16 + b.eq 2f + calc v18, v19, v16, v17 + b.eq 2f + calc v19, v16, v17, v18 + b.ne 1b +.endm + +.macro calc_all8 + calc v16, v17, v18, v19, v20, v21, v22, v23 + b.eq 2f + calc v18, v19, v20, v21, v22, v23, v16, v17 + b.eq 2f + calc v20, v21, v22, v23, v16, v17, v18, v19 + b.eq 2f + calc v22, v23, v16, v17, v18, v19, v20, v21 + b.ne 1b +.endm + +.macro calc_all12 + calc v16, v17, v18, v19, v20, v21, v22, v23, v24, v25, v26, v27 + b.eq 2f + calc v19, v20, v21, v22, v23, v24, v25, v26, v27, v16, v17, v18 + b.eq 2f + calc v22, v23, v24, v25, v26, v27, v16, v17, v18, v19, v20, v21 + b.eq 2f + calc v25, v26, v27, v16, v17, v18, v19, v20, v21, v22, v23, v24 + b.ne 1b +.endm + +.macro calc_all16 + calc v16, v17, v18, v19, v20, v21, v22, v23, v24, v25, v26, v27, v28, v29, v30, v31 + b.eq 2f + calc v20, v21, v22, v23, v24, v25, v26, v27, v28, v29, v30, v31, v16, v17, v18, v19 + b.eq 2f + calc v24, v25, v26, v27, v28, v29, v30, v31, v16, v17, v18, v19, v20, v21, v22, v23 + b.eq 2f + calc v28, v29, v30, v31, v16, v17, v18, v19, v20, v21, v22, v23, v24, v25, v26, v27 + b.ne 1b +.endm + +function ff_hevc_put_hevc_epel_v4_8_neon, export=1 + load_epel_filterb x5, x4 + sub x1, x1, x2 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.s}[0], [x1], x2 + ld1 {v17.s}[0], [x1], x2 + ld1 {v18.s}[0], [x1], x2 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().s}[0], [x1], x2 + movi v4.8h, #0 + calc_epelb v4, \src0, \src1, \src2, \src3 + subs w3, w3, #1 + st1 {v4.4h}, [x0], x10 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_v6_8_neon, export=1 + load_epel_filterb x5, x4 + sub x1, x1, x2 + mov x10, #(MAX_PB_SIZE * 2 - 8) + ld1 {v16.8b}, [x1], x2 + ld1 {v17.8b}, [x1], x2 + ld1 {v18.8b}, [x1], x2 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().8b}, [x1], x2 + movi v4.8h, #0 + calc_epelb v4, \src0, \src1, \src2, \src3 + st1 {v4.d}[0], [x0], #8 + subs w3, w3, #1 + st1 {v4.s}[2], [x0], x10 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_v8_8_neon, export=1 + load_epel_filterb x5, x4 + sub x1, x1, x2 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8b}, [x1], x2 + ld1 {v17.8b}, [x1], x2 + ld1 {v18.8b}, [x1], x2 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().8b}, [x1], x2 + movi v4.8h, #0 + calc_epelb v4, \src0, \src1, \src2, \src3 + subs w3, w3, #1 + st1 {v4.8h}, [x0], x10 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_v12_8_neon, export=1 + load_epel_filterb x5, x4 + sub x1, x1, x2 + mov x10, #(MAX_PB_SIZE * 2 - 16) + ld1 {v16.16b}, [x1], x2 + ld1 {v17.16b}, [x1], x2 + ld1 {v18.16b}, [x1], x2 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().16b}, [x1], x2 + movi v4.8h, #0 + movi v5.8h, #0 + calc_epelb v4, \src0, \src1, \src2, \src3 + calc_epelb2 v5, \src0, \src1, \src2, \src3 + st1 {v4.8h}, [x0], #16 + subs w3, w3, #1 + st1 {v5.d}[0], [x0], x10 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_v16_8_neon, export=1 + load_epel_filterb x5, x4 + sub x1, x1, x2 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.16b}, [x1], x2 + ld1 {v17.16b}, [x1], x2 + ld1 {v18.16b}, [x1], x2 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().16b}, [x1], x2 + movi v4.8h, #0 + movi v5.8h, #0 + calc_epelb v4, \src0, \src1, \src2, \src3 + calc_epelb2 v5, \src0, \src1, \src2, \src3 + subs w3, w3, #1 + st1 {v4.8h, v5.8h}, [x0], x10 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_v24_8_neon, export=1 + load_epel_filterb x5, x4 + sub x1, x1, x2 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8b, v17.8b, v18.8b}, [x1], x2 + ld1 {v19.8b, v20.8b, v21.8b}, [x1], x2 + ld1 {v22.8b, v23.8b, v24.8b}, [x1], x2 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11 + ld1 {\src9\().8b, \src10\().8b, \src11\().8b}, [x1], x2 + movi v4.8h, #0 + movi v5.8h, #0 + movi v6.8h, #0 + calc_epelb v4, \src0, \src3, \src6, \src9 + calc_epelb v5, \src1, \src4, \src7, \src10 + calc_epelb v6, \src2, \src5, \src8, \src11 + subs w3, w3, #1 + st1 {v4.8h-v6.8h}, [x0], x10 +.endm +1: calc_all12 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_v32_8_neon, export=1 + load_epel_filterb x5, x4 + sub x1, x1, x2 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.16b, v17.16b}, [x1], x2 + ld1 {v18.16b, v19.16b}, [x1], x2 + ld1 {v20.16b, v21.16b}, [x1], x2 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7 + ld1 {\src6\().16b, \src7\().16b}, [x1], x2 + movi v4.8h, #0 + movi v5.8h, #0 + movi v6.8h, #0 + movi v7.8h, #0 + calc_epelb v4, \src0, \src2, \src4, \src6 + calc_epelb2 v5, \src0, \src2, \src4, \src6 + calc_epelb v6, \src1, \src3, \src5, \src7 + calc_epelb2 v7, \src1, \src3, \src5, \src7 + subs w3, w3, #1 + st1 {v4.8h-v7.8h}, [x0], x10 +.endm +1: calc_all8 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_v48_8_neon, export=1 + load_epel_filterb x5, x4 + sub x1, x1, x2 + mov x10, #64 + ld1 {v16.16b, v17.16b, v18.16b}, [x1], x2 + ld1 {v19.16b, v20.16b, v21.16b}, [x1], x2 + ld1 {v22.16b, v23.16b, v24.16b}, [x1], x2 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11 + ld1 {\src9\().16b, \src10\().16b, \src11\().16b}, [x1], x2 + movi v4.8h, #0 + movi v5.8h, #0 + movi v6.8h, #0 + movi v7.8h, #0 + movi v28.8h, #0 + movi v29.8h, #0 + calc_epelb v4, \src0, \src3, \src6, \src9 + calc_epelb2 v5, \src0, \src3, \src6, \src9 + calc_epelb v6, \src1, \src4, \src7, \src10 + calc_epelb2 v7, \src1, \src4, \src7, \src10 + calc_epelb v28, \src2, \src5, \src8, \src11 + calc_epelb2 v29, \src2, \src5, \src8, \src11 + st1 { v4.8h-v7.8h}, [x0], #64 + subs w3, w3, #1 + st1 {v28.8h-v29.8h}, [x0], x10 +.endm +1: calc_all12 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_v64_8_neon, export=1 + load_epel_filterb x5, x4 + sub sp, sp, #32 + st1 {v8.8b-v11.8b}, [sp] + sub x1, x1, x2 + ld1 {v16.16b, v17.16b, v18.16b, v19.16b}, [x1], x2 + ld1 {v20.16b, v21.16b, v22.16b, v23.16b}, [x1], x2 + ld1 {v24.16b, v25.16b, v26.16b, v27.16b}, [x1], x2 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11, src12, src13, src14, src15 + ld1 {\src12\().16b-\src15\().16b}, [x1], x2 + movi v4.8h, #0 + movi v5.8h, #0 + movi v6.8h, #0 + movi v7.8h, #0 + movi v8.8h, #0 + movi v9.8h, #0 + movi v10.8h, #0 + movi v11.8h, #0 + calc_epelb v4, \src0, \src4, \src8, \src12 + calc_epelb2 v5, \src0, \src4, \src8, \src12 + calc_epelb v6, \src1, \src5, \src9, \src13 + calc_epelb2 v7, \src1, \src5, \src9, \src13 + calc_epelb v8, \src2, \src6, \src10, \src14 + calc_epelb2 v9, \src2, \src6, \src10, \src14 + calc_epelb v10, \src3, \src7, \src11, \src15 + calc_epelb2 v11, \src3, \src7, \src11, \src15 + st1 {v4.8h-v7.8h}, [x0], #64 + subs w3, w3, #1 + st1 {v8.8h-v11.8h}, [x0], #64 +.endm +1: calc_all16 +.purgem calc +2: ld1 {v8.8b-v11.8b}, [sp] + add sp, sp, #32 + ret +endfunc + +function ff_hevc_put_hevc_epel_hv4_8_neon, export=1 + add w10, w3, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x3, [sp, #-16]! + stp x5, x30, [sp, #-16]! + add x0, sp, #32 + sub x1, x1, x2 + add w3, w3, #3 + bl X(ff_hevc_put_hevc_epel_h4_8_neon) + ldp x5, x30, [sp], #16 + ldp x0, x3, [sp], #16 + load_epel_filterh x5, x4 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.4h}, [sp], x10 + ld1 {v17.4h}, [sp], x10 + ld1 {v18.4h}, [sp], x10 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().4h}, [sp], x10 + calc_epelh v4, \src0, \src1, \src2, \src3 + subs w3, w3, #1 + st1 {v4.4h}, [x0], x10 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_hv6_8_neon, export=1 + add w10, w3, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x3, [sp, #-16]! + stp x5, x30, [sp, #-16]! + add x0, sp, #32 + sub x1, x1, x2 + add w3, w3, #3 + bl X(ff_hevc_put_hevc_epel_h6_8_neon) + ldp x5, x30, [sp], #16 + ldp x0, x3, [sp], #16 + load_epel_filterh x5, x4 + mov x5, #120 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8h}, [sp], x10 + ld1 {v17.8h}, [sp], x10 + ld1 {v18.8h}, [sp], x10 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().8h}, [sp], x10 + calc_epelh v4, \src0, \src1, \src2, \src3 + calc_epelh2 v4, v5, \src0, \src1, \src2, \src3 + st1 {v4.d}[0], [x0], #8 + subs w3, w3, #1 + st1 {v4.s}[2], [x0], x5 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_hv8_8_neon, export=1 + add w10, w3, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x3, [sp, #-16]! + stp x5, x30, [sp, #-16]! + add x0, sp, #32 + sub x1, x1, x2 + add w3, w3, #3 + bl X(ff_hevc_put_hevc_epel_h8_8_neon) + ldp x5, x30, [sp], #16 + ldp x0, x3, [sp], #16 + load_epel_filterh x5, x4 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8h}, [sp], x10 + ld1 {v17.8h}, [sp], x10 + ld1 {v18.8h}, [sp], x10 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().8h}, [sp], x10 + calc_epelh v4, \src0, \src1, \src2, \src3 + calc_epelh2 v4, v5, \src0, \src1, \src2, \src3 + subs w3, w3, #1 + st1 {v4.8h}, [x0], x10 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_hv12_8_neon, export=1 + add w10, w3, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x3, [sp, #-16]! + stp x5, x30, [sp, #-16]! + add x0, sp, #32 + sub x1, x1, x2 + add w3, w3, #3 + bl X(ff_hevc_put_hevc_epel_h12_8_neon) + ldp x5, x30, [sp], #16 + ldp x0, x3, [sp], #16 + load_epel_filterh x5, x4 + mov x5, #112 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8h, v17.8h}, [sp], x10 + ld1 {v18.8h, v19.8h}, [sp], x10 + ld1 {v20.8h, v21.8h}, [sp], x10 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7 + ld1 {\src6\().8h, \src7\().8h}, [sp], x10 + calc_epelh v4, \src0, \src2, \src4, \src6 + calc_epelh2 v4, v5, \src0, \src2, \src4, \src6 + calc_epelh v5, \src1, \src3, \src5, \src7 + st1 {v4.8h}, [x0], #16 + subs w3, w3, #1 + st1 {v5.4h}, [x0], x5 +.endm +1: calc_all8 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_hv16_8_neon, export=1 + add w10, w3, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x3, [sp, #-16]! + stp x5, x30, [sp, #-16]! + add x0, sp, #32 + sub x1, x1, x2 + add w3, w3, #3 + bl X(ff_hevc_put_hevc_epel_h16_8_neon) + ldp x5, x30, [sp], #16 + ldp x0, x3, [sp], #16 + load_epel_filterh x5, x4 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8h, v17.8h}, [sp], x10 + ld1 {v18.8h, v19.8h}, [sp], x10 + ld1 {v20.8h, v21.8h}, [sp], x10 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7 + ld1 {\src6\().8h, \src7\().8h}, [sp], x10 + calc_epelh v4, \src0, \src2, \src4, \src6 + calc_epelh2 v4, v5, \src0, \src2, \src4, \src6 + calc_epelh v5, \src1, \src3, \src5, \src7 + calc_epelh2 v5, v6, \src1, \src3, \src5, \src7 + subs w3, w3, #1 + st1 {v4.8h, v5.8h}, [x0], x10 +.endm +1: calc_all8 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_hv24_8_neon, export=1 + add w10, w3, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x3, [sp, #-16]! + stp x5, x30, [sp, #-16]! + add x0, sp, #32 + sub x1, x1, x2 + add w3, w3, #3 + bl X(ff_hevc_put_hevc_epel_h24_8_neon) + ldp x5, x30, [sp], #16 + ldp x0, x3, [sp], #16 + load_epel_filterh x5, x4 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8h, v17.8h, v18.8h}, [sp], x10 + ld1 {v19.8h, v20.8h, v21.8h}, [sp], x10 + ld1 {v22.8h, v23.8h, v24.8h}, [sp], x10 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11 + ld1 {\src9\().8h-\src11\().8h}, [sp], x10 + calc_epelh v4, \src0, \src3, \src6, \src9 + calc_epelh2 v4, v5, \src0, \src3, \src6, \src9 + calc_epelh v5, \src1, \src4, \src7, \src10 + calc_epelh2 v5, v6, \src1, \src4, \src7, \src10 + calc_epelh v6, \src2, \src5, \src8, \src11 + calc_epelh2 v6, v7, \src2, \src5, \src8, \src11 + subs w3, w3, #1 + st1 {v4.8h-v6.8h}, [x0], x10 +.endm +1: calc_all12 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_hv32_8_neon, export=1 + stp xzr, x30, [sp, #-16]! + stp x0, x1, [sp, #-16]! + stp x2, x3, [sp, #-16]! + stp x4, x5, [sp, #-16]! + mov x6, #16 + bl X(ff_hevc_put_hevc_epel_hv16_8_neon) + ldp x4, x5, [sp], #16 + ldp x2, x3, [sp], #16 + ldp x0, x1, [sp], #16 + add x0, x0, #32 + add x1, x1, #16 + mov x6, #16 + bl X(ff_hevc_put_hevc_epel_hv16_8_neon) + ldp xzr, x30, [sp], #16 + ret +endfunc + +function ff_hevc_put_hevc_epel_hv48_8_neon, export=1 + stp xzr, x30, [sp, #-16]! + stp x0, x1, [sp, #-16]! + stp x2, x3, [sp, #-16]! + stp x4, x5, [sp, #-16]! + mov x6, #24 + bl X(ff_hevc_put_hevc_epel_hv24_8_neon) + ldp x4, x5, [sp], #16 + ldp x2, x3, [sp], #16 + ldp x0, x1, [sp], #16 + add x0, x0, #48 + add x1, x1, #24 + mov x6, #24 + bl X(ff_hevc_put_hevc_epel_hv24_8_neon) + ldp xzr, x30, [sp], #16 + ret +endfunc + +function ff_hevc_put_hevc_epel_hv64_8_neon, export=1 + stp xzr, x30, [sp, #-16]! + stp x0, x1, [sp, #-16]! + stp x2, x3, [sp, #-16]! + stp x4, x5, [sp, #-16]! + mov x6, #16 + bl X(ff_hevc_put_hevc_epel_hv16_8_neon) + ldp x4, x5, [sp] + ldp x2, x3, [sp, #16] + ldp x0, x1, [sp, #32] + add x0, x0, #32 + add x1, x1, #16 + mov x6, #16 + bl X(ff_hevc_put_hevc_epel_hv16_8_neon) + ldp x4, x5, [sp] + ldp x2, x3, [sp, #16] + ldp x0, x1, [sp, #32] + add x0, x0, #64 + add x1, x1, #32 + mov x6, #16 + bl X(ff_hevc_put_hevc_epel_hv16_8_neon) + ldp x4, x5, [sp], #16 + ldp x2, x3, [sp], #16 + ldp x0, x1, [sp], #16 + add x0, x0, #96 + add x1, x1, #48 + mov x6, #16 + bl X(ff_hevc_put_hevc_epel_hv16_8_neon) + ldp xzr, x30, [sp], #16 + ret +endfunc + +function ff_hevc_put_hevc_epel_uni_v4_8_neon, export=1 + load_epel_filterb x6, x5 + sxtw x3, w3 + sxtw x1, w1 + sub x2, x2, x3 + ld1 {v16.s}[0], [x2], x3 + ld1 {v17.s}[0], [x2], x3 + ld1 {v18.s}[0], [x2], x3 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().s}[0], [x2], x3 + movi v4.8h, #0 + calc_epelb v4, \src0, \src1, \src2, \src3 + sqrshrun v4.8b, v4.8h, #6 + subs w4, w4, #1 + st1 {v4.s}[0], [x0], x1 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_uni_v6_8_neon, export=1 + load_epel_filterb x6, x5 + sxtw x3, w3 + sxtw x1, w1 + sub x2, x2, x3 + sub x1, x1, #4 + ld1 {v16.8b}, [x2], x3 + ld1 {v17.8b}, [x2], x3 + ld1 {v18.8b}, [x2], x3 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().8b}, [x2], x3 + movi v4.8h, #0 + calc_epelb v4, \src0, \src1, \src2, \src3 + sqrshrun v4.8b, v4.8h, #6 + st1 {v4.s}[0], [x0], #4 + subs w4, w4, #1 + st1 {v4.h}[2], [x0], x1 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_uni_v8_8_neon, export=1 + load_epel_filterb x6, x5 + sxtw x3, w3 + sxtw x1, w1 + sub x2, x2, x3 + ld1 {v16.8b}, [x2], x3 + ld1 {v17.8b}, [x2], x3 + ld1 {v18.8b}, [x2], x3 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().8b}, [x2], x3 + movi v4.8h, #0 + calc_epelb v4, \src0, \src1, \src2, \src3 + sqrshrun v4.8b, v4.8h, #6 + subs w4, w4, #1 + st1 {v4.8b}, [x0], x1 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_uni_v12_8_neon, export=1 + load_epel_filterb x6, x5 + sxtw x3, w3 + sxtw x1, w1 + sub x2, x2, x3 + sub x1, x1, #8 + ld1 {v16.16b}, [x2], x3 + ld1 {v17.16b}, [x2], x3 + ld1 {v18.16b}, [x2], x3 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().16b}, [x2], x3 + movi v4.8h, #0 + movi v5.8h, #0 + calc_epelb v4, \src0, \src1, \src2, \src3 + calc_epelb2 v5, \src0, \src1, \src2, \src3 + sqrshrun v4.8b, v4.8h, #6 + sqrshrun2 v4.16b, v5.8h, #6 + subs w4, w4, #1 + st1 {v4.8b}, [x0], #8 + st1 {v4.s}[2], [x0], x1 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_uni_v16_8_neon, export=1 + load_epel_filterb x6, x5 + sxtw x3, w3 + sxtw x1, w1 + sub x2, x2, x3 + ld1 {v16.16b}, [x2], x3 + ld1 {v17.16b}, [x2], x3 + ld1 {v18.16b}, [x2], x3 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().16b}, [x2], x3 + movi v4.8h, #0 + movi v5.8h, #0 + calc_epelb v4, \src0, \src1, \src2, \src3 + calc_epelb2 v5, \src0, \src1, \src2, \src3 + sqrshrun v4.8b, v4.8h, #6 + sqrshrun2 v4.16b, v5.8h, #6 + subs w4, w4, #1 + st1 {v4.16b}, [x0], x1 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_uni_v24_8_neon, export=1 + load_epel_filterb x6, x5 + sxtw x3, w3 + sxtw x1, w1 + sub x2, x2, x3 + ld1 {v16.8b, v17.8b, v18.8b}, [x2], x3 + ld1 {v19.8b, v20.8b, v21.8b}, [x2], x3 + ld1 {v22.8b, v23.8b, v24.8b}, [x2], x3 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11 + ld1 {\src9\().8b, \src10\().8b, \src11\().8b}, [x2], x3 + movi v4.8h, #0 + movi v5.8h, #0 + movi v6.8h, #0 + calc_epelb v4, \src0, \src3, \src6, \src9 + calc_epelb v5, \src1, \src4, \src7, \src10 + calc_epelb v6, \src2, \src5, \src8, \src11 + sqrshrun v4.8b, v4.8h, #6 + sqrshrun v5.8b, v5.8h, #6 + sqrshrun v6.8b, v6.8h, #6 + subs w4, w4, #1 + st1 {v4.8b-v6.8b}, [x0], x1 +.endm +1: calc_all12 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_uni_v32_8_neon, export=1 + load_epel_filterb x6, x5 + sxtw x3, w3 + sxtw x1, w1 + sub x2, x2, x3 + ld1 {v16.16b, v17.16b}, [x2], x3 + ld1 {v18.16b, v19.16b}, [x2], x3 + ld1 {v20.16b, v21.16b}, [x2], x3 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7 + ld1 {\src6\().16b, \src7\().16b}, [x2], x3 + movi v4.8h, #0 + movi v5.8h, #0 + movi v6.8h, #0 + movi v7.8h, #0 + calc_epelb v4, \src0, \src2, \src4, \src6 + calc_epelb2 v5, \src0, \src2, \src4, \src6 + calc_epelb v6, \src1, \src3, \src5, \src7 + calc_epelb2 v7, \src1, \src3, \src5, \src7 + sqrshrun v4.8b, v4.8h, #6 + sqrshrun2 v4.16b, v5.8h, #6 + sqrshrun v5.8b, v6.8h, #6 + sqrshrun2 v5.16b, v7.8h, #6 + subs w4, w4, #1 + st1 {v4.16b, v5.16b}, [x0], x1 +.endm +1: calc_all8 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_uni_v48_8_neon, export=1 + load_epel_filterb x6, x5 + sxtw x3, w3 + sxtw x1, w1 + sub x2, x2, x3 + ld1 {v16.16b, v17.16b, v18.16b}, [x2], x3 + ld1 {v19.16b, v20.16b, v21.16b}, [x2], x3 + ld1 {v22.16b, v23.16b, v24.16b}, [x2], x3 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11 + ld1 {\src9\().16b, \src10\().16b, \src11\().16b}, [x2], x3 + movi v4.8h, #0 + movi v5.8h, #0 + movi v6.8h, #0 + movi v7.8h, #0 + movi v28.8h, #0 + movi v29.8h, #0 + calc_epelb v4, \src0, \src3, \src6, \src9 + calc_epelb2 v5, \src0, \src3, \src6, \src9 + calc_epelb v6, \src1, \src4, \src7, \src10 + calc_epelb2 v7, \src1, \src4, \src7, \src10 + calc_epelb v28, \src2, \src5, \src8, \src11 + calc_epelb2 v29, \src2, \src5, \src8, \src11 + sqrshrun v4.8b, v4.8h, #6 + sqrshrun2 v4.16b, v5.8h, #6 + sqrshrun v5.8b, v6.8h, #6 + sqrshrun2 v5.16b, v7.8h, #6 + sqrshrun v6.8b, v28.8h, #6 + sqrshrun2 v6.16b, v29.8h, #6 + subs w4, w4, #1 + st1 {v4.16b, v5.16b, v6.16b}, [x0], x1 +.endm +1: calc_all12 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_uni_v64_8_neon, export=1 + load_epel_filterb x6, x5 + sub sp, sp, #32 + sxtw x3, w3 + sxtw x1, w1 + st1 {v8.8b-v11.8b}, [sp] + sub x2, x2, x3 + ld1 {v16.16b, v17.16b, v18.16b, v19.16b}, [x2], x3 + ld1 {v20.16b, v21.16b, v22.16b, v23.16b}, [x2], x3 + ld1 {v24.16b, v25.16b, v26.16b, v27.16b}, [x2], x3 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11, src12, src13, src14, src15 + ld1 {\src12\().16b, \src13\().16b, \src14\().16b, \src15\().16b}, [x2], x3 + movi v4.8h, #0 + movi v5.8h, #0 + movi v6.8h, #0 + movi v7.8h, #0 + movi v8.8h, #0 + movi v9.8h, #0 + movi v10.8h, #0 + movi v11.8h, #0 + calc_epelb v10, \src3, \src7, \src11, \src15 + calc_epelb2 v11, \src3, \src7, \src11, \src15 + calc_epelb v4, \src0, \src4, \src8, \src12 + calc_epelb2 v5, \src0, \src4, \src8, \src12 + calc_epelb v6, \src1, \src5, \src9, \src13 + calc_epelb2 v7, \src1, \src5, \src9, \src13 + calc_epelb v8, \src2, \src6, \src10, \src14 + calc_epelb2 v9, \src2, \src6, \src10, \src14 + sqrshrun v4.8b, v4.8h, #6 + sqrshrun2 v4.16b, v5.8h, #6 + sqrshrun v5.8b, v6.8h, #6 + sqrshrun2 v5.16b, v7.8h, #6 + sqrshrun v6.8b, v8.8h, #6 + sqrshrun2 v6.16b, v9.8h, #6 + sqrshrun v7.8b, v10.8h, #6 + sqrshrun2 v7.16b, v11.8h, #6 + subs w4, w4, #1 + st1 {v4.16b, v5.16b, v6.16b, v7.16b}, [x0], x1 +.endm +1: calc_all16 +.purgem calc +2: ld1 {v8.8b-v11.8b}, [sp] + add sp, sp, #32 + ret +endfunc + +function ff_hevc_put_hevc_epel_uni_hv4_8_neon, export=1 + add w10, w4, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x1, [sp, #-16]! + stp x4, x6, [sp, #-16]! + stp xzr, x30, [sp, #-16]! + add x0, sp, #48 + sub x1, x2, x3 + mov x2, x3 + add w3, w4, #3 + mov x4, x5 + bl X(ff_hevc_put_hevc_epel_h4_8_neon) + ldp xzr, x30, [sp], #16 + ldp x4, x6, [sp], #16 + ldp x0, x1, [sp], #16 + load_epel_filterh x6, x5 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.4h}, [sp], x10 + ld1 {v17.4h}, [sp], x10 + ld1 {v18.4h}, [sp], x10 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().4h}, [sp], x10 + calc_epelh v4, \src0, \src1, \src2, \src3 + sqrshrun v4.8b, v4.8h, #6 + subs w4, w4, #1 + st1 {v4.s}[0], [x0], x1 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_uni_hv6_8_neon, export=1 + add w10, w4, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x1, [sp, #-16]! + stp x4, x6, [sp, #-16]! + stp xzr, x30, [sp, #-16]! + add x0, sp, #48 + sub x1, x2, x3 + mov x2, x3 + add w3, w4, #3 + mov x4, x5 + bl X(ff_hevc_put_hevc_epel_h6_8_neon) + ldp xzr, x30, [sp], #16 + ldp x4, x6, [sp], #16 + ldp x0, x1, [sp], #16 + load_epel_filterh x6, x5 + sub x1, x1, #4 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8h}, [sp], x10 + ld1 {v17.8h}, [sp], x10 + ld1 {v18.8h}, [sp], x10 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().8h}, [sp], x10 + calc_epelh v4, \src0, \src1, \src2, \src3 + calc_epelh2 v4, v5, \src0, \src1, \src2, \src3 + sqrshrun v4.8b, v4.8h, #6 + st1 {v4.s}[0], [x0], #4 + subs w4, w4, #1 + st1 {v4.h}[2], [x0], x1 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_uni_hv8_8_neon, export=1 + add w10, w4, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x1, [sp, #-16]! + stp x4, x6, [sp, #-16]! + stp xzr, x30, [sp, #-16]! + add x0, sp, #48 + sub x1, x2, x3 + mov x2, x3 + add w3, w4, #3 + mov x4, x5 + bl X(ff_hevc_put_hevc_epel_h8_8_neon) + ldp xzr, x30, [sp], #16 + ldp x4, x6, [sp], #16 + ldp x0, x1, [sp], #16 + load_epel_filterh x6, x5 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8h}, [sp], x10 + ld1 {v17.8h}, [sp], x10 + ld1 {v18.8h}, [sp], x10 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().8h}, [sp], x10 + calc_epelh v4, \src0, \src1, \src2, \src3 + calc_epelh2 v4, v5, \src0, \src1, \src2, \src3 + sqrshrun v4.8b, v4.8h, #6 + subs w4, w4, #1 + st1 {v4.8b}, [x0], x1 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_uni_hv12_8_neon, export=1 + add w10, w4, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x1, [sp, #-16]! + stp x4, x6, [sp, #-16]! + stp xzr, x30, [sp, #-16]! + add x0, sp, #48 + sub x1, x2, x3 + mov x2, x3 + add w3, w4, #3 + mov x4, x5 + bl X(ff_hevc_put_hevc_epel_h12_8_neon) + ldp xzr, x30, [sp], #16 + ldp x4, x6, [sp], #16 + ldp x0, x1, [sp], #16 + load_epel_filterh x6, x5 + sub x1, x1, #8 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8h, v17.8h}, [sp], x10 + ld1 {v18.8h, v19.8h}, [sp], x10 + ld1 {v20.8h, v21.8h}, [sp], x10 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7 + ld1 {\src6\().8h, \src7\().8h}, [sp], x10 + calc_epelh v4, \src0, \src2, \src4, \src6 + calc_epelh2 v4, v5, \src0, \src2, \src4, \src6 + calc_epelh v5, \src1, \src3, \src5, \src7 + sqrshrun v4.8b, v4.8h, #6 + sqrshrun2 v4.16b, v5.8h, #6 + st1 {v4.8b}, [x0], #8 + st1 {v4.s}[2], [x0], x1 + subs w4, w4, #1 +.endm +1: calc_all8 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_uni_hv16_8_neon, export=1 + add w10, w4, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x1, [sp, #-16]! + stp x4, x6, [sp, #-16]! + stp xzr, x30, [sp, #-16]! + add x0, sp, #48 + sub x1, x2, x3 + mov x2, x3 + add w3, w4, #3 + mov x4, x5 + bl X(ff_hevc_put_hevc_epel_h16_8_neon) + ldp xzr, x30, [sp], #16 + ldp x4, x6, [sp], #16 + ldp x0, x1, [sp], #16 + load_epel_filterh x6, x5 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8h, v17.8h}, [sp], x10 + ld1 {v18.8h, v19.8h}, [sp], x10 + ld1 {v20.8h, v21.8h}, [sp], x10 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7 + ld1 {\src6\().8h, \src7\().8h}, [sp], x10 + calc_epelh v4, \src0, \src2, \src4, \src6 + calc_epelh2 v4, v5, \src0, \src2, \src4, \src6 + calc_epelh v5, \src1, \src3, \src5, \src7 + calc_epelh2 v5, v6, \src1, \src3, \src5, \src7 + sqrshrun v4.8b, v4.8h, #6 + sqrshrun2 v4.16b, v5.8h, #6 + subs w4, w4, #1 + st1 {v4.16b}, [x0], x1 +.endm +1: calc_all8 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_uni_hv24_8_neon, export=1 + add w10, w4, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x1, [sp, #-16]! + stp x4, x6, [sp, #-16]! + stp xzr, x30, [sp, #-16]! + add x0, sp, #48 + sub x1, x2, x3 + mov x2, x3 + add w3, w4, #3 + mov x4, x5 + bl X(ff_hevc_put_hevc_epel_h24_8_neon) + ldp xzr, x30, [sp], #16 + ldp x4, x6, [sp], #16 + ldp x0, x1, [sp], #16 + load_epel_filterh x6, x5 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8h, v17.8h, v18.8h}, [sp], x10 + ld1 {v19.8h, v20.8h, v21.8h}, [sp], x10 + ld1 {v22.8h, v23.8h, v24.8h}, [sp], x10 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11 + ld1 {\src9\().8h, \src10\().8h, \src11\().8h}, [sp], x10 + calc_epelh v4, \src0, \src3, \src6, \src9 + calc_epelh2 v4, v5, \src0, \src3, \src6, \src9 + calc_epelh v5, \src1, \src4, \src7, \src10 + calc_epelh2 v5, v6, \src1, \src4, \src7, \src10 + calc_epelh v6, \src2, \src5, \src8, \src11 + calc_epelh2 v6, v7, \src2, \src5, \src8, \src11 + sqrshrun v4.8b, v4.8h, #6 + sqrshrun v5.8b, v5.8h, #6 + sqrshrun v6.8b, v6.8h, #6 + subs w4, w4, #1 + st1 {v4.8b, v5.8b, v6.8b}, [x0], x1 +.endm +1: calc_all12 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_uni_hv32_8_neon, export=1 + stp x0, x30, [sp, #-16]! + stp x1, x2, [sp, #-16]! + stp x3, x4, [sp, #-16]! + stp x5, x6, [sp, #-16]! + mov x7, #16 + bl X(ff_hevc_put_hevc_epel_uni_hv16_8_neon) + ldp x5, x6, [sp], #16 + ldp x3, x4, [sp], #16 + ldp x1, x2, [sp], #16 + ldr x0, [sp] + add x0, x0, #16 + add x2, x2, #16 + mov x7, #16 + bl X(ff_hevc_put_hevc_epel_uni_hv16_8_neon) + ldp xzr, x30, [sp], #16 + ret +endfunc + +function ff_hevc_put_hevc_epel_uni_hv48_8_neon, export=1 + stp x0, x30, [sp, #-16]! + stp x1, x2, [sp, #-16]! + stp x3, x4, [sp, #-16]! + stp x5, x6, [sp, #-16]! + mov x7, #24 + bl X(ff_hevc_put_hevc_epel_uni_hv24_8_neon) + ldp x5, x6, [sp], #16 + ldp x3, x4, [sp], #16 + ldp x1, x2, [sp], #16 + ldr x0, [sp] + add x0, x0, #24 + add x2, x2, #24 + mov x7, #24 + bl X(ff_hevc_put_hevc_epel_uni_hv24_8_neon) + ldp xzr, x30, [sp], #16 + ret +endfunc + +function ff_hevc_put_hevc_epel_uni_hv64_8_neon, export=1 + stp x0, x30, [sp, #-16]! + stp x1, x2, [sp, #-16]! + stp x3, x4, [sp, #-16]! + stp x5, x6, [sp, #-16]! + mov x7, #16 + bl X(ff_hevc_put_hevc_epel_uni_hv16_8_neon) + ldp x5, x6, [sp] + ldp x3, x4, [sp, #16] + ldp x1, x2, [sp, #32] + ldr x0, [sp, #48] + add x0, x0, #16 + add x2, x2, #16 + mov x7, #16 + bl X(ff_hevc_put_hevc_epel_uni_hv16_8_neon) + ldp x5, x6, [sp] + ldp x3, x4, [sp, #16] + ldp x1, x2, [sp, #32] + ldr x0, [sp, #48] + add x0, x0, #32 + add x2, x2, #32 + mov x7, #16 + bl X(ff_hevc_put_hevc_epel_uni_hv16_8_neon) + ldp x5, x6, [sp], #16 + ldp x3, x4, [sp], #16 + ldp x1, x2, [sp], #16 + ldr x0, [sp] + add x0, x0, #48 + add x2, x2, #48 + mov x7, #16 + bl X(ff_hevc_put_hevc_epel_uni_hv16_8_neon) + ldp xzr, x30, [sp], #16 + ret +endfunc + +function ff_hevc_put_hevc_epel_bi_h4_8_neon, export=1 + load_epel_filterb x6, x7 + sub x2, x2, #1 + mov x10, #(MAX_PB_SIZE * 2) +1: ld1 {v4.8b}, [x2], x3 + ushr v5.2d, v4.2d, #8 + ushr v6.2d, v5.2d, #8 + ushr v7.2d, v6.2d, #8 + movi v16.8h, #0 + calc_epelb v16, v4, v5, v6, v7 + ld1 {v20.4h}, [x4], x10 + sqadd v16.8h, v16.8h, v20.8h + sqrshrun v4.8b, v16.8h, #7 + st1 {v4.s}[0], [x0], x1 + subs w5, w5, #1 // height + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_epel_bi_h6_8_neon, export=1 + load_epel_filterb x6, x7 + sub w1, w1, #4 + sub x2, x2, #1 + sub x3, x3, #8 + mov x10, #(MAX_PB_SIZE * 2) +1: ld1 {v24.8b}, [x2], #8 + ushr v26.2d, v24.2d, #8 + ushr v27.2d, v26.2d, #8 + ushr v28.2d, v27.2d, #8 + movi v16.8h, #0 + ld1 {v28.b}[5], [x2], x3 + calc_epelb v16, v24, v26, v27, v28 + ld1 {v20.8h}, [x4], x10 + sqadd v16.8h, v16.8h, v20.8h + sqrshrun v16.8b, v16.8h, #7 + st1 {v16.s}[0], [x0], #4 + st1 {v16.h}[2], [x0], x1 + subs w5, w5, #1 // height + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_epel_bi_h8_8_neon, export=1 + load_epel_filterb x6, x7 + sub x2, x2, #1 + mov x10, #(MAX_PB_SIZE * 2) +1: ld2 {v24.8b, v25.8b}, [x2], x3 + ushr v26.2d, v24.2d, #8 + ushr v27.2d, v25.2d, #8 + ushr v28.2d, v26.2d, #8 + movi v16.8h, #0 + movi v17.8h, #0 + calc_epelb v16, v24, v25, v26, v27 + calc_epelb v17, v25, v26, v27, v28 + zip1 v16.8h, v16.8h, v17.8h + ld1 {v20.8h}, [x4], x10 + sqadd v16.8h, v16.8h, v20.8h + sqrshrun v16.8b, v16.8h, #7 + st1 {v16.8b}, [x0], x1 + subs w5, w5, #1 // height + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_epel_bi_h12_8_neon, export=1 + load_epel_filterb x6, x7 + sub x1, x1, #8 + sub x2, x2, #1 + mov x10, #(MAX_PB_SIZE * 2) +1: ld2 {v24.8b, v25.8b}, [x2], x3 + ushr v26.2d, v24.2d, #8 + ushr v27.2d, v25.2d, #8 + ushr v28.2d, v26.2d, #8 + movi v16.8h, #0 + movi v17.8h, #0 + calc_epelb v16, v24, v25, v26, v27 + calc_epelb v17, v25, v26, v27, v28 + zip1 v18.8h, v16.8h, v17.8h + zip2 v19.8h, v16.8h, v17.8h + ld1 {v20.8h, v21.8h}, [x4], x10 + sqadd v18.8h, v18.8h, v20.8h + sqadd v19.8h, v19.8h, v21.8h + sqrshrun v20.8b, v18.8h, #7 + sqrshrun v21.8b, v19.8h, #7 + st1 {v20.8b}, [x0], #8 + st1 {v21.s}[0], [x0], x1 + subs w5, w5, #1 // height + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_epel_bi_h16_8_neon, export=1 + load_epel_filterb x6, x7 + sub x2, x2, #1 + sub w3, w3, #16 + mov x10, #(MAX_PB_SIZE * 2) +1: ld2 {v24.8b, v25.8b}, [x2], #16 + ld1 {v20.s}[0], [x2], x3 + ushr v26.2d, v24.2d, #8 + ushr v27.2d, v25.2d, #8 + mov v26.b[7], v20.b[0] + mov v27.b[7], v20.b[1] + ushr v28.2d, v26.2d, #8 + mov v28.b[7], v20.b[2] + movi v16.8h, #0 + movi v17.8h, #0 + calc_epelb v16, v24, v25, v26, v27 + calc_epelb v17, v25, v26, v27, v28 + zip1 v18.8h, v16.8h, v17.8h + zip2 v19.8h, v16.8h, v17.8h + ld2 {v24.8h, v25.8h}, [x4], x10 + sqadd v16.8h, v16.8h, v24.8h + sqadd v17.8h, v17.8h, v25.8h + sqrshrun v4.8b, v16.8h, #7 + sqrshrun v5.8b, v17.8h, #7 + st2 {v4.8b, v5.8b}, [x0], x1 + subs w5, w5, #1 // height + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_epel_bi_h24_8_neon, export=1 + load_epel_filterb x6, x7 + sub x2, x2, #1 + sub w3, w3, #24 + mov x10, #(MAX_PB_SIZE * 2) +1: ld3 {v24.8b, v25.8b, v26.8b}, [x2], #24 + ld1 {v20.s}[0], [x2], x3 + ushr v27.2d, v24.2d, #8 + ushr v28.2d, v25.2d, #8 + ushr v29.2d, v26.2d, #8 + mov v27.b[7], v20.b[0] + mov v28.b[7], v20.b[1] + mov v29.b[7], v20.b[2] + movi v16.8h, #0 + movi v17.8h, #0 + movi v18.8h, #0 + calc_epelb v16, v24, v25, v26, v27 + calc_epelb v17, v25, v26, v27, v28 + calc_epelb v18, v26, v27, v28, v29 + ld3 {v24.8h, v25.8h, v26.8h}, [x4], x10 + sqadd v16.8h, v16.8h, v24.8h + sqadd v17.8h, v17.8h, v25.8h + sqadd v18.8h, v18.8h, v26.8h + sqrshrun v4.8b, v16.8h, #7 + sqrshrun v5.8b, v17.8h, #7 + sqrshrun v6.8b, v18.8h, #7 + st3 {v4.8b, v5.8b, v6.8b}, [x0], x1 + subs w5, w5, #1 // height + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_epel_bi_h32_8_neon, export=1 + load_epel_filterb x6, x7 + sub x2, x2, #1 + sub w3, w3, #32 + mov x10, #(MAX_PB_SIZE * 2) +1: ld4 {v24.8b, v25.8b, v26.8b, v27.8b}, [x2], #32 + ld1 {v20.s}[0], [x2], x3 + ushr v28.2d, v24.2d, #8 + ushr v29.2d, v25.2d, #8 + ushr v30.2d, v26.2d, #8 + ins v28.b[7], v20.b[0] + ins v29.b[7], v20.b[1] + ins v30.b[7], v20.b[2] + movi v16.8h, #0 + movi v17.8h, #0 + movi v18.8h, #0 + movi v19.8h, #0 + calc_epelb v16, v24, v25, v26, v27 + calc_epelb v17, v25, v26, v27, v28 + calc_epelb v18, v26, v27, v28, v29 + calc_epelb v19, v27, v28, v29, v30 + ld4 {v24.8h, v25.8h, v26.8h, v27.8h}, [x4], x10 + sqadd v16.8h, v16.8h, v24.8h + sqadd v17.8h, v17.8h, v25.8h + sqadd v18.8h, v18.8h, v26.8h + sqadd v19.8h, v19.8h, v27.8h + sqrshrun v4.8b, v16.8h, #7 + sqrshrun v5.8b, v17.8h, #7 + sqrshrun v6.8b, v18.8h, #7 + sqrshrun v7.8b, v19.8h, #7 + st4 {v4.8b, v5.8b, v6.8b, v7.8b}, [x0], x1 + subs w5, w5, #1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_epel_bi_h48_8_neon, export=1 + load_epel_filterb x6, x7 + sub x2, x2, #1 + sub w3, w3, #48 + mov x7, #24 + mov x10, #(128 - 48) +1: ld3 {v26.16b, v27.16b, v28.16b}, [x2], x7 + ushr v29.2d, v26.2d, #8 + ushr v30.2d, v27.2d, #8 + ushr v31.2d, v28.2d, #8 + ld1 {v24.s}[0], [x2], x7 + ld1 {v25.s}[0], [x2], x3 + mov v29.b[7], v24.b[0] + mov v30.b[7], v24.b[1] + mov v31.b[7], v24.b[2] + mov v29.b[15], v25.b[0] + mov v30.b[15], v25.b[1] + mov v31.b[15], v25.b[2] + movi v16.8h, #0 + movi v17.8h, #0 + movi v18.8h, #0 + movi v20.8h, #0 + movi v21.8h, #0 + movi v22.8h, #0 + calc_epelb v16, v26, v27, v28, v29 + calc_epelb2 v20, v26, v27, v28, v29 + calc_epelb v17, v27, v28, v29, v30 + calc_epelb2 v21, v27, v28, v29, v30 + calc_epelb v18, v28, v29, v30, v31 + calc_epelb2 v22, v28, v29, v30, v31 + ld3 {v24.8h, v25.8h, v26.8h}, [x4], #48 + sqadd v16.8h, v16.8h, v24.8h + sqadd v17.8h, v17.8h, v25.8h + sqadd v18.8h, v18.8h, v26.8h + ld3 {v27.8h, v28.8h, v29.8h}, [x4], x10 + sqadd v20.8h, v20.8h, v27.8h + sqadd v21.8h, v21.8h, v28.8h + sqadd v22.8h, v22.8h, v29.8h + sqrshrun v4.8b, v16.8h, #7 + sqrshrun v5.8b, v17.8h, #7 + sqrshrun v6.8b, v18.8h, #7 + sqrshrun2 v4.16b, v20.8h, #7 + sqrshrun2 v5.16b, v21.8h, #7 + sqrshrun2 v6.16b, v22.8h, #7 + st3 {v4.16b, v5.16b, v6.16b}, [x0], x1 + subs w5, w5, #1 // height + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_epel_bi_h64_8_neon, export=1 + load_epel_filterb x6, x7 + sub x2, x2, #1 + sub w3, w3, #64 + mov x7, #32 +1: ld4 {v24.16b, v25.16b, v26.16b, v27.16b}, [x2], x7 + ushr v28.2d, v24.2d, #8 + ushr v29.2d, v25.2d, #8 + ushr v30.2d, v26.2d, #8 + ld1 {v4.s}[0], [x2], x7 + ld1 {v5.s}[0], [x2], x3 + ins v28.b[7], v4.b[0] + ins v28.b[15], v5.b[0] + ins v29.b[7], v4.b[1] + ins v29.b[15], v5.b[1] + ins v30.b[7], v4.b[2] + ins v30.b[15], v5.b[2] + movi v16.8h, #0 + movi v17.8h, #0 + movi v18.8h, #0 + movi v19.8h, #0 + movi v20.8h, #0 + movi v21.8h, #0 + movi v22.8h, #0 + movi v23.8h, #0 + calc_epelb v16, v24, v25, v26, v27 + calc_epelb2 v20, v24, v25, v26, v27 + calc_epelb v17, v25, v26, v27, v28 + calc_epelb2 v21, v25, v26, v27, v28 + calc_epelb v18, v26, v27, v28, v29 + calc_epelb2 v22, v26, v27, v28, v29 + calc_epelb v19, v27, v28, v29, v30 + calc_epelb2 v23, v27, v28, v29, v30 + ld4 {v24.8h, v25.8h, v26.8h, v27.8h}, [x4], #64 + sqadd v16.8h, v16.8h, v24.8h + sqadd v17.8h, v17.8h, v25.8h + sqadd v18.8h, v18.8h, v26.8h + sqadd v19.8h, v19.8h, v27.8h + ld4 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 + sqadd v20.8h, v20.8h, v28.8h + sqadd v21.8h, v21.8h, v29.8h + sqadd v22.8h, v22.8h, v30.8h + sqadd v23.8h, v23.8h, v31.8h + sqrshrun v4.8b, v16.8h, #7 + sqrshrun v5.8b, v17.8h, #7 + sqrshrun v6.8b, v18.8h, #7 + sqrshrun v7.8b, v19.8h, #7 + sqrshrun2 v4.16b, v20.8h, #7 + sqrshrun2 v5.16b, v21.8h, #7 + sqrshrun2 v6.16b, v22.8h, #7 + sqrshrun2 v7.16b, v23.8h, #7 + st4 {v4.16b, v5.16b, v6.16b, v7.16b}, [x0], x1 + subs w5, w5, #1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_epel_bi_v4_8_neon, export=1 + load_epel_filterb x7, x6 + sub x2, x2, x3 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.s}[0], [x2], x3 + ld1 {v17.s}[0], [x2], x3 + ld1 {v18.s}[0], [x2], x3 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().s}[0], [x2], x3 + movi v4.8h, #0 + calc_epelb v4, \src0, \src1, \src2, \src3 + ld1 {v24.4h}, [x4], x10 + sqadd v4.8h, v4.8h, v24.8h + sqrshrun v4.8b, v4.8h, #7 + subs w5, w5, #1 + st1 {v4.s}[0], [x0], x1 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_bi_v6_8_neon, export=1 + load_epel_filterb x7, x6 + sub x2, x2, x3 + sub x1, x1, #4 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8b}, [x2], x3 + ld1 {v17.8b}, [x2], x3 + ld1 {v18.8b}, [x2], x3 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().8b}, [x2], x3 + movi v4.8h, #0 + calc_epelb v4, \src0, \src1, \src2, \src3 + ld1 {v24.8h}, [x4], x10 + sqadd v4.8h, v4.8h, v24.8h + sqrshrun v4.8b, v4.8h, #7 + st1 {v4.s}[0], [x0], #4 + subs w5, w5, #1 + st1 {v4.h}[2], [x0], x1 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_bi_v8_8_neon, export=1 + load_epel_filterb x7, x6 + sub x2, x2, x3 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8b}, [x2], x3 + ld1 {v17.8b}, [x2], x3 + ld1 {v18.8b}, [x2], x3 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().8b}, [x2], x3 + movi v4.8h, #0 + calc_epelb v4, \src0, \src1, \src2, \src3 + ld1 {v24.8h}, [x4], x10 + sqadd v4.8h, v4.8h, v24.8h + sqrshrun v4.8b, v4.8h, #7 + subs w5, w5, #1 + st1 {v4.8b}, [x0], x1 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_bi_v12_8_neon, export=1 + load_epel_filterb x7, x6 + sub x1, x1, #8 + sub x2, x2, x3 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.16b}, [x2], x3 + ld1 {v17.16b}, [x2], x3 + ld1 {v18.16b}, [x2], x3 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().16b}, [x2], x3 + movi v4.8h, #0 + movi v5.8h, #0 + calc_epelb v4, \src0, \src1, \src2, \src3 + calc_epelb2 v5, \src0, \src1, \src2, \src3 + ld1 {v24.8h, v25.8h}, [x4], x10 + sqadd v4.8h, v4.8h, v24.8h + sqadd v5.8h, v5.8h, v25.8h + sqrshrun v4.8b, v4.8h, #7 + sqrshrun2 v4.16b, v5.8h, #7 + st1 {v4.8b}, [x0], #8 + subs w5, w5, #1 + st1 {v4.s}[2], [x0], x1 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_bi_v16_8_neon, export=1 + load_epel_filterb x7, x6 + sub x2, x2, x3 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.16b}, [x2], x3 + ld1 {v17.16b}, [x2], x3 + ld1 {v18.16b}, [x2], x3 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().16b}, [x2], x3 + movi v4.8h, #0 + movi v5.8h, #0 + calc_epelb v4, \src0, \src1, \src2, \src3 + calc_epelb2 v5, \src0, \src1, \src2, \src3 + ld1 {v24.8h, v25.8h}, [x4], x10 + sqadd v4.8h, v4.8h, v24.8h + sqadd v5.8h, v5.8h, v25.8h + sqrshrun v4.8b, v4.8h, #7 + sqrshrun2 v4.16b, v5.8h, #7 + st1 {v4.16b}, [x0], x1 + subs w5, w5, #1 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_bi_v24_8_neon, export=1 + load_epel_filterb x7, x6 + sub x2, x2, x3 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8b, v17.8b, v18.8b}, [x2], x3 + ld1 {v19.8b, v20.8b, v21.8b}, [x2], x3 + ld1 {v22.8b, v23.8b, v24.8b}, [x2], x3 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11 + ld1 {\src9\().8b, \src10\().8b, \src11\().8b}, [x2], x3 + movi v4.8h, #0 + movi v5.8h, #0 + movi v6.8h, #0 + calc_epelb v4, \src0, \src3, \src6, \src9 + calc_epelb v5, \src1, \src4, \src7, \src10 + calc_epelb v6, \src2, \src5, \src8, \src11 + ld1 {v28.8h, v29.8h, v30.8h}, [x4], x10 + sqadd v4.8h, v4.8h, v28.8h + sqadd v5.8h, v5.8h, v29.8h + sqadd v6.8h, v6.8h, v30.8h + sqrshrun v4.8b, v4.8h, #7 + sqrshrun v5.8b, v5.8h, #7 + sqrshrun v6.8b, v6.8h, #7 + subs w5, w5, #1 + st1 {v4.8b, v5.8b, v6.8b}, [x0], x1 +.endm +1: calc_all12 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_bi_v32_8_neon, export=1 + load_epel_filterb x7, x6 + sub x2, x2, x3 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.16b, v17.16b}, [x2], x3 + ld1 {v18.16b, v19.16b}, [x2], x3 + ld1 {v20.16b, v21.16b}, [x2], x3 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7 + ld1 {\src6\().16b, \src7\().16b}, [x2], x3 + movi v4.8h, #0 + movi v5.8h, #0 + movi v6.8h, #0 + movi v7.8h, #0 + calc_epelb v4, \src0, \src2, \src4, \src6 + calc_epelb2 v5, \src0, \src2, \src4, \src6 + calc_epelb v6, \src1, \src3, \src5, \src7 + calc_epelb2 v7, \src1, \src3, \src5, \src7 + ld1 {v24.8h-v27.8h}, [x4], x10 + sqadd v4.8h, v4.8h, v24.8h + sqadd v5.8h, v5.8h, v25.8h + sqadd v6.8h, v6.8h, v26.8h + sqadd v7.8h, v7.8h, v27.8h + sqrshrun v4.8b, v4.8h, #7 + sqrshrun2 v4.16b, v5.8h, #7 + sqrshrun v5.8b, v6.8h, #7 + sqrshrun2 v5.16b, v7.8h, #7 + st1 {v4.16b, v5.16b}, [x0], x1 + subs w5, w5, #1 +.endm +1: calc_all8 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_bi_v48_8_neon, export=1 + stp x7, x30, [sp, #-16]! + stp x0, x1, [sp, #-16]! + stp x2, x3, [sp, #-16]! + stp x4, x5, [sp, #-16]! + bl X(ff_hevc_put_hevc_epel_bi_v24_8_neon) + ldp x4, x5, [sp], #16 + ldp x2, x3, [sp], #16 + ldp x0, x1, [sp], #16 + add x0, x0, #24 + add x2, x2, #24 + add x4, x4, #48 + ldr x7, [sp] + bl X(ff_hevc_put_hevc_epel_bi_v24_8_neon) + ldp xzr, x30, [sp], #16 + ret +endfunc + +function ff_hevc_put_hevc_epel_bi_v64_8_neon, export=1 + stp x7, x30, [sp, #-16]! + stp x0, x1, [sp, #-16]! + stp x2, x3, [sp, #-16]! + stp x4, x5, [sp, #-16]! + bl X(ff_hevc_put_hevc_epel_bi_v32_8_neon) + ldp x4, x5, [sp], #16 + ldp x2, x3, [sp], #16 + ldp x0, x1, [sp], #16 + add x0, x0, #32 + add x2, x2, #32 + add x4, x4, #64 + ldr x7, [sp] + bl X(ff_hevc_put_hevc_epel_bi_v32_8_neon) + ldp xzr, x30, [sp], #16 + ret +endfunc + +function ff_hevc_put_hevc_epel_bi_hv4_8_neon, export=1 + add w10, w5, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x1, [sp, #-16]! + stp x4, x5, [sp, #-16]! + stp x7, x30, [sp, #-16]! + add x0, sp, #48 + sub x1, x2, x3 + mov x2, x3 + add w3, w5, #3 + mov x4, x6 + mov x5, x7 + bl X(ff_hevc_put_hevc_epel_h4_8_neon) + ldp x7, x30, [sp], #16 + ldp x4, x5, [sp], #16 + ldp x0, x1, [sp], #16 + load_epel_filterh x7, x6 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.4h}, [sp], x10 + ld1 {v17.4h}, [sp], x10 + ld1 {v18.4h}, [sp], x10 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().4h}, [sp], x10 + calc_epelh v4, \src0, \src1, \src2, \src3 + ld1 {v6.4h}, [x4], x10 + sqadd v4.4h, v4.4h, v6.4h + sqrshrun v4.8b, v4.8h, #7 + subs w5, w5, #1 + st1 {v4.s}[0], [x0], x1 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_bi_hv6_8_neon, export=1 + add w10, w5, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x1, [sp, #-16]! + stp x4, x5, [sp, #-16]! + stp x7, x30, [sp, #-16]! + add x0, sp, #48 + sub x1, x2, x3 + mov x2, x3 + add w3, w5, #3 + mov x4, x6 + mov x5, x7 + bl X(ff_hevc_put_hevc_epel_h6_8_neon) + ldp x7, x30, [sp], #16 + ldp x4, x5, [sp], #16 + ldp x0, x1, [sp], #16 + load_epel_filterh x7, x6 + sub x1, x1, #4 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8h}, [sp], x10 + ld1 {v17.8h}, [sp], x10 + ld1 {v18.8h}, [sp], x10 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().8h}, [sp], x10 + calc_epelh v4, \src0, \src1, \src2, \src3 + calc_epelh2 v4, v5, \src0, \src1, \src2, \src3 + ld1 {v6.8h}, [x4], x10 + sqadd v4.8h, v4.8h, v6.8h + sqrshrun v4.8b, v4.8h, #7 + st1 {v4.s}[0], [x0], #4 + subs w5, w5, #1 + st1 {v4.h}[2], [x0], x1 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_bi_hv8_8_neon, export=1 + add w10, w5, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x1, [sp, #-16]! + stp x4, x5, [sp, #-16]! + stp x7, x30, [sp, #-16]! + add x0, sp, #48 + sub x1, x2, x3 + mov x2, x3 + add w3, w5, #3 + mov x4, x6 + mov x5, x7 + bl X(ff_hevc_put_hevc_epel_h8_8_neon) + ldp x7, x30, [sp], #16 + ldp x4, x5, [sp], #16 + ldp x0, x1, [sp], #16 + load_epel_filterh x7, x6 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8h}, [sp], x10 + ld1 {v17.8h}, [sp], x10 + ld1 {v18.8h}, [sp], x10 +.macro calc src0, src1, src2, src3 + ld1 {\src3\().8h}, [sp], x10 + calc_epelh v4, \src0, \src1, \src2, \src3 + calc_epelh2 v4, v5, \src0, \src1, \src2, \src3 + ld1 {v6.8h}, [x4], x10 + sqadd v4.8h, v4.8h, v6.8h + sqrshrun v4.8b, v4.8h, #7 + subs w5, w5, #1 + st1 {v4.8b}, [x0], x1 +.endm +1: calc_all4 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_bi_hv12_8_neon, export=1 + add w10, w5, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x1, [sp, #-16]! + stp x4, x5, [sp, #-16]! + stp x7, x30, [sp, #-16]! + add x0, sp, #48 + sub x1, x2, x3 + mov x2, x3 + add w3, w5, #3 + mov x4, x6 + mov x5, x7 + bl X(ff_hevc_put_hevc_epel_h12_8_neon) + ldp x7, x30, [sp], #16 + ldp x4, x5, [sp], #16 + ldp x0, x1, [sp], #16 + load_epel_filterh x7, x6 + sub x1, x1, #8 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8h, v17.8h}, [sp], x10 + ld1 {v18.8h, v19.8h}, [sp], x10 + ld1 {v20.8h, v21.8h}, [sp], x10 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7 + ld1 {\src6\().8h, \src7\().8h}, [sp], x10 + calc_epelh v4, \src0, \src2, \src4, \src6 + calc_epelh2 v4, v5, \src0, \src2, \src4, \src6 + calc_epelh v5, \src1, \src3, \src5, \src7 + ld1 {v6.8h, v7.8h}, [x4], x10 + sqadd v4.8h, v4.8h, v6.8h + sqadd v5.8h, v5.8h, v7.8h + sqrshrun v4.8b, v4.8h, #7 + sqrshrun2 v4.16b, v5.8h, #7 + st1 {v4.8b}, [x0], #8 + subs w5, w5, #1 + st1 {v4.s}[2], [x0], x1 +.endm +1: calc_all8 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_bi_hv16_8_neon, export=1 + add w10, w5, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x1, [sp, #-16]! + stp x4, x5, [sp, #-16]! + stp x7, x30, [sp, #-16]! + add x0, sp, #48 + sub x1, x2, x3 + mov x2, x3 + add w3, w5, #3 + mov x4, x6 + mov x5, x7 + bl X(ff_hevc_put_hevc_epel_h16_8_neon) + ldp x7, x30, [sp], #16 + ldp x4, x5, [sp], #16 + ldp x0, x1, [sp], #16 + load_epel_filterh x7, x6 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8h, v17.8h}, [sp], x10 + ld1 {v18.8h, v19.8h}, [sp], x10 + ld1 {v20.8h, v21.8h}, [sp], x10 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7 + ld1 {\src6\().8h, \src7\().8h}, [sp], x10 + calc_epelh v4, \src0, \src2, \src4, \src6 + calc_epelh2 v4, v5, \src0, \src2, \src4, \src6 + calc_epelh v5, \src1, \src3, \src5, \src7 + calc_epelh2 v5, v6, \src1, \src3, \src5, \src7 + ld1 {v6.8h, v7.8h}, [x4], x10 + sqadd v4.8h, v4.8h, v6.8h + sqadd v5.8h, v5.8h, v7.8h + sqrshrun v4.8b, v4.8h, #7 + sqrshrun2 v4.16b, v5.8h, #7 + st1 {v4.16b}, [x0], x1 + subs w5, w5, #1 +.endm +1: calc_all8 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_bi_hv24_8_neon, export=1 + add w10, w5, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x1, [sp, #-16]! + stp x4, x5, [sp, #-16]! + stp x7, x30, [sp, #-16]! + add x0, sp, #48 + sub x1, x2, x3 + mov x2, x3 + add w3, w5, #3 + mov x4, x6 + mov x5, x7 + bl X(ff_hevc_put_hevc_epel_h24_8_neon) + ldp x7, x30, [sp], #16 + ldp x4, x5, [sp], #16 + ldp x0, x1, [sp], #16 + load_epel_filterh x7, x6 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8h, v17.8h, v18.8h}, [sp], x10 + ld1 {v19.8h, v20.8h, v21.8h}, [sp], x10 + ld1 {v22.8h, v23.8h, v24.8h}, [sp], x10 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11 + ld1 {\src9\().8h, \src10\().8h, \src11\().8h}, [sp], x10 + calc_epelh v1, \src0, \src3, \src6, \src9 + calc_epelh2 v1, v2, \src0, \src3, \src6, \src9 + calc_epelh v2, \src1, \src4, \src7, \src10 + calc_epelh2 v2, v3, \src1, \src4, \src7, \src10 + calc_epelh v3, \src2, \src5, \src8, \src11 + calc_epelh2 v3, v4, \src2, \src5, \src8, \src11 + ld1 {v4.8h, v5.8h, v6.8h}, [x4], x10 + sqadd v1.8h, v1.8h, v4.8h + sqadd v2.8h, v2.8h, v5.8h + sqadd v3.8h, v3.8h, v6.8h + sqrshrun v1.8b, v1.8h, #7 + sqrshrun v2.8b, v2.8h, #7 + sqrshrun v3.8b, v3.8h, #7 + subs w5, w5, #1 + st1 {v1.8b, v2.8b, v3.8b}, [x0], x1 +.endm +1: calc_all12 +.purgem calc +2: ret +endfunc + +function ff_hevc_put_hevc_epel_bi_hv32_8_neon, export=1 + sub sp, sp, #16 + st1 {v8.16b}, [sp] + add w10, w5, #3 + lsl x10, x10, #7 + sub sp, sp, x10 // tmp_array + stp x0, x1, [sp, #-16]! + stp x4, x5, [sp, #-16]! + stp x7, x30, [sp, #-16]! + add x0, sp, #48 + sub x1, x2, x3 + mov x2, x3 + add w3, w5, #3 + mov x4, x6 + mov x5, x7 + bl X(ff_hevc_put_hevc_epel_h32_8_neon) + ldp x7, x30, [sp], #16 + ldp x4, x5, [sp], #16 + ldp x0, x1, [sp], #16 + load_epel_filterh x7, x6 + mov x10, #(MAX_PB_SIZE * 2) + ld1 {v16.8h, v17.8h, v18.8h, v19.8h}, [sp], x10 + ld1 {v20.8h, v21.8h, v22.8h, v23.8h}, [sp], x10 + ld1 {v24.8h, v25.8h, v26.8h, v27.8h}, [sp], x10 +.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11, src12, src13, src14, src15 + ld1 {\src12\().8h, \src13\().8h, \src14\().8h, \src15\().8h}, [sp], x10 + calc_epelh v1, \src0, \src4, \src8, \src12 + calc_epelh2 v1, v2, \src0, \src4, \src8, \src12 + calc_epelh v2, \src1, \src5, \src9, \src13 + calc_epelh2 v2, v3, \src1, \src5, \src9, \src13 + calc_epelh v3, \src2, \src6, \src10, \src14 + calc_epelh2 v3, v4, \src2, \src6, \src10, \src14 + calc_epelh v4, \src3, \src7, \src11, \src15 + calc_epelh2 v4, v5, \src3, \src7, \src11, \src15 + ld1 {v5.8h, v6.8h, v7.8h, v8.8h}, [x4], x10 + sqadd v1.8h, v1.8h, v5.8h + sqadd v2.8h, v2.8h, v6.8h + sqadd v3.8h, v3.8h, v7.8h + sqadd v4.8h, v4.8h, v8.8h + sqrshrun v1.8b, v1.8h, #7 + sqrshrun v2.8b, v2.8h, #7 + sqrshrun v3.8b, v3.8h, #7 + sqrshrun v4.8b, v4.8h, #7 + st1 {v1.8b, v2.8b, v3.8b, v4.8b}, [x0], x1 + subs w5, w5, #1 +.endm +1: calc_all16 +.purgem calc +2: ld1 {v8.16b}, [sp], #16 + ret +endfunc + +function ff_hevc_put_hevc_epel_bi_hv48_8_neon, export=1 + stp xzr, x30, [sp, #-16]! + stp x0, x1, [sp, #-16]! + stp x2, x3, [sp, #-16]! + stp x4, x5, [sp, #-16]! + stp x6, x7, [sp, #-16]! + bl X(ff_hevc_put_hevc_epel_bi_hv24_8_neon) + ldp x6, x7, [sp], #16 + ldp x4, x5, [sp], #16 + ldp x2, x3, [sp], #16 + ldp x0, x1, [sp], #16 + add x0, x0, #24 + add x2, x2, #24 + add x4, x4, #48 + bl X(ff_hevc_put_hevc_epel_bi_hv24_8_neon) + ldp xzr, x30, [sp], #16 + ret +endfunc + +function ff_hevc_put_hevc_epel_bi_hv64_8_neon, export=1 + stp xzr, x30, [sp, #-16]! + stp x0, x1, [sp, #-16]! + stp x2, x3, [sp, #-16]! + stp x4, x5, [sp, #-16]! + stp x6, x7, [sp, #-16]! + bl X(ff_hevc_put_hevc_epel_bi_hv32_8_neon) + ldp x6, x7, [sp], #16 + ldp x4, x5, [sp], #16 + ldp x2, x3, [sp], #16 + ldp x0, x1, [sp], #16 + add x0, x0, #32 + add x2, x2, #32 + add x4, x4, #64 + bl X(ff_hevc_put_hevc_epel_bi_hv32_8_neon) + ldp xzr, x30, [sp], #16 + ret +endfunc diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c index 3e5d85247e..a4c078683b 100644 --- a/libavcodec/aarch64/hevcdsp_init_aarch64.c +++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c @@ -69,6 +69,46 @@ void ff_hevc_sao_band_filter_8x8_8_neon(uint8_t *_dst, uint8_t *_src, void ff_hevc_put_hevc_##fn##48_8_neon args; \ void ff_hevc_put_hevc_##fn##64_8_neon args; \ +NEON8_FNPROTO(pel_pixels, (int16_t *dst, + uint8_t *src, ptrdiff_t srcstride, + int height, intptr_t mx, intptr_t my, int width)); + +NEON8_FNPROTO(pel_bi_pixels, (uint8_t *dst, ptrdiff_t dststride, + uint8_t *src, ptrdiff_t srcstride, int16_t *src2, + int height, intptr_t mx, intptr_t my, int width)); + +NEON8_FNPROTO(epel_h, (int16_t *dst, + uint8_t *src, ptrdiff_t srcstride, + int height, intptr_t mx, intptr_t my, int width)); + +NEON8_FNPROTO(epel_v, (int16_t *dst, + uint8_t *src, ptrdiff_t srcstride, + int height, intptr_t mx, intptr_t my, int width)); + +NEON8_FNPROTO(epel_hv, (int16_t *dst, + uint8_t *src, ptrdiff_t srcstride, + int height, intptr_t mx, intptr_t my, int width)); + +NEON8_FNPROTO(epel_uni_v, (uint8_t *dst, ptrdiff_t dststride, + uint8_t *src, ptrdiff_t srcstride, + int height, intptr_t mx, intptr_t my, int width)); + +NEON8_FNPROTO(epel_uni_hv, (uint8_t *dst, ptrdiff_t _dststride, + uint8_t *src, ptrdiff_t srcstride, + int height, intptr_t mx, intptr_t my, int width)); + +NEON8_FNPROTO(epel_bi_h, (uint8_t *dst, ptrdiff_t dststride, + uint8_t *src, ptrdiff_t srcstride, int16_t *src2, + int height, intptr_t mx, intptr_t my, int width)); + +NEON8_FNPROTO(epel_bi_v, (uint8_t *dst, ptrdiff_t dststride, + uint8_t *src, ptrdiff_t srcstride, int16_t *src2, + int height, intptr_t mx, intptr_t my, int width)); + +NEON8_FNPROTO(epel_bi_hv, (uint8_t *dst, ptrdiff_t dststride, + uint8_t *src, ptrdiff_t srcstride, int16_t *src2, + int height, intptr_t mx, intptr_t my, int width)); + NEON8_FNPROTO(qpel_h, (int16_t *dst, uint8_t *src, ptrdiff_t srcstride, int height, intptr_t mx, intptr_t my, int width)); @@ -137,12 +177,24 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth) // of non-multiple of 8 seem to arise. // c->sao_band_filter[0] = ff_hevc_sao_band_filter_8x8_8_neon; + NEON8_FNASSIGN(c->put_hevc_epel, 0, 0, pel_pixels); + NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h); + NEON8_FNASSIGN(c->put_hevc_epel, 1, 0, epel_v); + NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv); + NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 0, epel_uni_v); + NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 1, epel_uni_hv); + NEON8_FNASSIGN(c->put_hevc_epel_bi, 0, 0, pel_bi_pixels); + NEON8_FNASSIGN(c->put_hevc_epel_bi, 0, 1, epel_bi_h); + NEON8_FNASSIGN(c->put_hevc_epel_bi, 1, 0, epel_bi_v); + NEON8_FNASSIGN(c->put_hevc_epel_bi, 1, 1, epel_bi_hv); + NEON8_FNASSIGN(c->put_hevc_qpel, 0, 0, pel_pixels); NEON8_FNASSIGN(c->put_hevc_qpel, 0, 1, qpel_h); NEON8_FNASSIGN(c->put_hevc_qpel, 1, 0, qpel_v); NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv); NEON8_FNASSIGN(c->put_hevc_qpel_uni, 0, 1, qpel_uni_h); NEON8_FNASSIGN(c->put_hevc_qpel_uni, 1, 0, qpel_uni_v); NEON8_FNASSIGN(c->put_hevc_qpel_uni, 1, 1, qpel_uni_hv); + NEON8_FNASSIGN(c->put_hevc_qpel_bi, 0, 0, pel_bi_pixels); NEON8_FNASSIGN(c->put_hevc_qpel_bi, 0, 1, qpel_bi_h); NEON8_FNASSIGN(c->put_hevc_qpel_bi, 1, 0, qpel_bi_v); NEON8_FNASSIGN(c->put_hevc_qpel_bi, 1, 1, qpel_bi_hv);