From patchwork Thu Mar 31 17:23:49 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Avison X-Patchwork-Id: 35112 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:c05:b0:7a:e998:b410 with SMTP id bw5csp230347pzb; Thu, 31 Mar 2022 10:25:45 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw+aqyIRaq9OQvUSktjGnRLhJcXix8EopIAAg9S+3ln1774eSZo40+tqJ5s0H0uEO6825xT X-Received: by 2002:a17:906:68c2:b0:6b4:9f26:c099 with SMTP id y2-20020a17090668c200b006b49f26c099mr5953850ejr.41.1648747545236; Thu, 31 Mar 2022 10:25:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1648747545; cv=none; d=google.com; s=arc-20160816; b=rZkGpVlMKUaZqC4/xLQCzh4mWGZMm+NJDpwVdnqAQXvqrg889f7X6076JVYxbYafNw nSsBqxqM5UVdKHGW5twNLpfBilqHzghcrtLvjD+FFLL0rgR74PozTpCfOCCF1qrYb0Fo 9tU92lHPVdPQ/6DVn4emBydFYPwAKEKLu7T+niPz2pBYIZXUj36jLhZ8p8wHVVjEY7+S uu3EoS5ZSk+vQ02jJC5SyrcpzoapL2+MuJiBBoLoUgchUybcXU2Yxgjap4Fk06zhjcqt VbKXga5gVLn+qL1lpP2tuPuqr6f7pPbv1KM4CIO53tGxcX3UQPvNf5poPwlwpOgLAXzM +tlw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:delivered-to; bh=6lHoLfur9CL72SuQ41w7zzPQ7EXhI+bJqteivdNZ/ig=; b=Tw9UKKuBihSABeUiSm2AgYJ4xFgQJxf45XFERStc3hgVXN0kavUKwMPI3diQUiQ+m8 meIl/ESP8goSPsRBid2xuCvbTVubClDG85uh2lNfSDdFj7ewzAjA86Dv4/ApTzI7Nzpp haVHtdnlutuNmiVWR6qhPPtzKaQM/SuqjAbVzMdmZLUnqd126S1OneXVrJD9wwz3rmov 9X6cx9vgyKpopuUhGB3rN5EZ9hHqZegcwbHOuNLl8ECEv+CBt9kwPMKm2G8SqYT88Vxi vru83anK9xfZWMXrkjmPdmPOxWcUodEPVEy1MHpuEYWBzOBWo3QEH9yJhRgmr6+Y8tZz yqUg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id e18-20020a170906845200b006e02fed8798si77226ejy.637.2022.03.31.10.25.44; Thu, 31 Mar 2022 10:25:45 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 55C1668B2B5; Thu, 31 Mar 2022 20:24:24 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from outmail149112.authsmtp.co.uk (outmail149112.authsmtp.co.uk [62.13.149.112]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 689CA68B098 for ; Thu, 31 Mar 2022 20:24:20 +0300 (EEST) Received: from punt18.authsmtp.com (punt18.authsmtp.com [62.13.128.225]) by punt17.authsmtp.com. (8.15.2/8.15.2) with ESMTP id 22VHOJVq022024 for ; Thu, 31 Mar 2022 18:24:19 +0100 (BST) (envelope-from bavison@riscosopen.org) Received: from mail-c233.authsmtp.com (mail-c233.authsmtp.com [62.13.128.233]) by punt18.authsmtp.com. (8.15.2/8.15.2) with ESMTP id 22VHOJdS097127; Thu, 31 Mar 2022 18:24:19 +0100 (BST) (envelope-from bavison@riscosopen.org) Received: from rpi2021 (237.63.9.51.dyn.plus.net [51.9.63.237]) (authenticated bits=0) by mail.authsmtp.com (8.15.2/8.15.2) with ESMTPSA id 22VHOHER062592 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 31 Mar 2022 18:24:17 +0100 (BST) (envelope-from bavison@riscosopen.org) Received: by rpi2021 (sSMTP sendmail emulation); Thu, 31 Mar 2022 18:24:16 +0100 From: Ben Avison To: ffmpeg-devel@ffmpeg.org Date: Thu, 31 Mar 2022 18:23:49 +0100 Message-Id: <20220331172351.550818-9-bavison@riscosopen.org> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220331172351.550818-1-bavison@riscosopen.org> References: <20220331172351.550818-1-bavison@riscosopen.org> MIME-Version: 1.0 X-Server-Quench: 655bdf02-b117-11ec-a0f2-84349711df28 X-AuthReport-Spam: If SPAM / abuse - report it at: http://www.authsmtp.com/abuse X-AuthRoute: OCd1YggXA1ZfRRob ESQCJDVBUg4iPRpU DBlFKhFVNl8UURhQ KkJXbgASJgZFAnRQ QXkJW1ZWQFx5U2Fx YQtZIwBcfENQWQZ0 UktOXVBXFgB3AFID BHhmLWEHFQVAent5 ZwhjVnhSVAp8cEMv FhhUF3BUZGZndWEe BRNFJgMCch5CehxB Y1d+VSdbY21JDRoR IyQTdy5qdW0Pb30N d0kEM1kVTUsAWT8w TAwOBi19VWQfQm1t c1ksK0JUOms2FA04 NVwqWhoRPxNaAwtS V0pJCSpBb1cIXDZj FQpGXVV2 X-Authentic-SMTP: 61633632303230.1021:7600 X-AuthFastPath: 0 (Was 255) X-AuthVirus-Status: No virus detected - but ensure you scan with your own anti-virus system. Subject: [FFmpeg-devel] [PATCH v3 08/10] avcodec/idctdsp: Arm 64-bit NEON block add and clamp fast paths X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Ben Avison Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: J9+EvHtj2Cyy checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. idctdsp.add_pixels_clamped_c: 313.3 idctdsp.add_pixels_clamped_neon: 24.3 idctdsp.put_pixels_clamped_c: 220.3 idctdsp.put_pixels_clamped_neon: 15.5 idctdsp.put_signed_pixels_clamped_c: 210.5 idctdsp.put_signed_pixels_clamped_neon: 19.5 Signed-off-by: Ben Avison --- libavcodec/aarch64/Makefile | 3 +- libavcodec/aarch64/idctdsp_init_aarch64.c | 26 +++-- libavcodec/aarch64/idctdsp_neon.S | 130 ++++++++++++++++++++++ 3 files changed, 150 insertions(+), 9 deletions(-) create mode 100644 libavcodec/aarch64/idctdsp_neon.S diff --git a/libavcodec/aarch64/Makefile b/libavcodec/aarch64/Makefile index 5b25e4dfb9..c8935f205e 100644 --- a/libavcodec/aarch64/Makefile +++ b/libavcodec/aarch64/Makefile @@ -44,7 +44,8 @@ NEON-OBJS-$(CONFIG_H264PRED) += aarch64/h264pred_neon.o NEON-OBJS-$(CONFIG_H264QPEL) += aarch64/h264qpel_neon.o \ aarch64/hpeldsp_neon.o NEON-OBJS-$(CONFIG_HPELDSP) += aarch64/hpeldsp_neon.o -NEON-OBJS-$(CONFIG_IDCTDSP) += aarch64/simple_idct_neon.o +NEON-OBJS-$(CONFIG_IDCTDSP) += aarch64/idctdsp_neon.o \ + aarch64/simple_idct_neon.o NEON-OBJS-$(CONFIG_MDCT) += aarch64/mdct_neon.o NEON-OBJS-$(CONFIG_MPEGAUDIODSP) += aarch64/mpegaudiodsp_neon.o NEON-OBJS-$(CONFIG_PIXBLOCKDSP) += aarch64/pixblockdsp_neon.o diff --git a/libavcodec/aarch64/idctdsp_init_aarch64.c b/libavcodec/aarch64/idctdsp_init_aarch64.c index 742a3372e3..eec21aa5a2 100644 --- a/libavcodec/aarch64/idctdsp_init_aarch64.c +++ b/libavcodec/aarch64/idctdsp_init_aarch64.c @@ -27,19 +27,29 @@ #include "libavcodec/idctdsp.h" #include "idct.h" +void ff_put_pixels_clamped_neon(const int16_t *, uint8_t *, ptrdiff_t); +void ff_put_signed_pixels_clamped_neon(const int16_t *, uint8_t *, ptrdiff_t); +void ff_add_pixels_clamped_neon(const int16_t *, uint8_t *, ptrdiff_t); + av_cold void ff_idctdsp_init_aarch64(IDCTDSPContext *c, AVCodecContext *avctx, unsigned high_bit_depth) { int cpu_flags = av_get_cpu_flags(); - if (have_neon(cpu_flags) && !avctx->lowres && !high_bit_depth) { - if (avctx->idct_algo == FF_IDCT_AUTO || - avctx->idct_algo == FF_IDCT_SIMPLEAUTO || - avctx->idct_algo == FF_IDCT_SIMPLENEON) { - c->idct_put = ff_simple_idct_put_neon; - c->idct_add = ff_simple_idct_add_neon; - c->idct = ff_simple_idct_neon; - c->perm_type = FF_IDCT_PERM_PARTTRANS; + if (have_neon(cpu_flags)) { + if (!avctx->lowres && !high_bit_depth) { + if (avctx->idct_algo == FF_IDCT_AUTO || + avctx->idct_algo == FF_IDCT_SIMPLEAUTO || + avctx->idct_algo == FF_IDCT_SIMPLENEON) { + c->idct_put = ff_simple_idct_put_neon; + c->idct_add = ff_simple_idct_add_neon; + c->idct = ff_simple_idct_neon; + c->perm_type = FF_IDCT_PERM_PARTTRANS; + } } + + c->add_pixels_clamped = ff_add_pixels_clamped_neon; + c->put_pixels_clamped = ff_put_pixels_clamped_neon; + c->put_signed_pixels_clamped = ff_put_signed_pixels_clamped_neon; } } diff --git a/libavcodec/aarch64/idctdsp_neon.S b/libavcodec/aarch64/idctdsp_neon.S new file mode 100644 index 0000000000..7f47611206 --- /dev/null +++ b/libavcodec/aarch64/idctdsp_neon.S @@ -0,0 +1,130 @@ +/* + * IDCT AArch64 NEON optimisations + * + * Copyright (c) 2022 Ben Avison + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/aarch64/asm.S" + +// Clamp 16-bit signed block coefficients to unsigned 8-bit +// On entry: +// x0 -> array of 64x 16-bit coefficients +// x1 -> 8-bit results +// x2 = row stride for results, bytes +function ff_put_pixels_clamped_neon, export=1 + ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x0], #64 + ld1 {v4.16b, v5.16b, v6.16b, v7.16b}, [x0] + sqxtun v0.8b, v0.8h + sqxtun v1.8b, v1.8h + sqxtun v2.8b, v2.8h + sqxtun v3.8b, v3.8h + sqxtun v4.8b, v4.8h + st1 {v0.8b}, [x1], x2 + sqxtun v0.8b, v5.8h + st1 {v1.8b}, [x1], x2 + sqxtun v1.8b, v6.8h + st1 {v2.8b}, [x1], x2 + sqxtun v2.8b, v7.8h + st1 {v3.8b}, [x1], x2 + st1 {v4.8b}, [x1], x2 + st1 {v0.8b}, [x1], x2 + st1 {v1.8b}, [x1], x2 + st1 {v2.8b}, [x1] + ret +endfunc + +// Clamp 16-bit signed block coefficients to signed 8-bit (biased by 128) +// On entry: +// x0 -> array of 64x 16-bit coefficients +// x1 -> 8-bit results +// x2 = row stride for results, bytes +function ff_put_signed_pixels_clamped_neon, export=1 + ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x0], #64 + movi v4.8b, #128 + ld1 {v16.16b, v17.16b, v18.16b, v19.16b}, [x0] + sqxtn v0.8b, v0.8h + sqxtn v1.8b, v1.8h + sqxtn v2.8b, v2.8h + sqxtn v3.8b, v3.8h + sqxtn v5.8b, v16.8h + add v0.8b, v0.8b, v4.8b + sqxtn v6.8b, v17.8h + add v1.8b, v1.8b, v4.8b + sqxtn v7.8b, v18.8h + add v2.8b, v2.8b, v4.8b + sqxtn v16.8b, v19.8h + add v3.8b, v3.8b, v4.8b + st1 {v0.8b}, [x1], x2 + add v0.8b, v5.8b, v4.8b + st1 {v1.8b}, [x1], x2 + add v1.8b, v6.8b, v4.8b + st1 {v2.8b}, [x1], x2 + add v2.8b, v7.8b, v4.8b + st1 {v3.8b}, [x1], x2 + add v3.8b, v16.8b, v4.8b + st1 {v0.8b}, [x1], x2 + st1 {v1.8b}, [x1], x2 + st1 {v2.8b}, [x1], x2 + st1 {v3.8b}, [x1] + ret +endfunc + +// Add 16-bit signed block coefficients to unsigned 8-bit +// On entry: +// x0 -> array of 64x 16-bit coefficients +// x1 -> 8-bit input and results +// x2 = row stride for 8-bit input and results, bytes +function ff_add_pixels_clamped_neon, export=1 + ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x0], #64 + mov x3, x1 + ld1 {v4.8b}, [x1], x2 + ld1 {v5.8b}, [x1], x2 + ld1 {v6.8b}, [x1], x2 + ld1 {v7.8b}, [x1], x2 + ld1 {v16.16b, v17.16b, v18.16b, v19.16b}, [x0] + uaddw v0.8h, v0.8h, v4.8b + uaddw v1.8h, v1.8h, v5.8b + uaddw v2.8h, v2.8h, v6.8b + ld1 {v4.8b}, [x1], x2 + uaddw v3.8h, v3.8h, v7.8b + ld1 {v5.8b}, [x1], x2 + sqxtun v0.8b, v0.8h + ld1 {v6.8b}, [x1], x2 + sqxtun v1.8b, v1.8h + ld1 {v7.8b}, [x1] + sqxtun v2.8b, v2.8h + sqxtun v3.8b, v3.8h + uaddw v4.8h, v16.8h, v4.8b + st1 {v0.8b}, [x3], x2 + uaddw v0.8h, v17.8h, v5.8b + st1 {v1.8b}, [x3], x2 + uaddw v1.8h, v18.8h, v6.8b + st1 {v2.8b}, [x3], x2 + uaddw v2.8h, v19.8h, v7.8b + sqxtun v4.8b, v4.8h + sqxtun v0.8b, v0.8h + st1 {v3.8b}, [x3], x2 + sqxtun v1.8b, v1.8h + sqxtun v2.8b, v2.8h + st1 {v4.8b}, [x3], x2 + st1 {v0.8b}, [x3], x2 + st1 {v1.8b}, [x3], x2 + st1 {v2.8b}, [x3] + ret +endfunc