From patchwork Fri Sep 27 12:52:41 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ramiro Polla X-Patchwork-Id: 51888 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a59:d8ca:0:b0:48e:c0f8:d0de with SMTP id dy10csp807095vqb; Fri, 27 Sep 2024 21:51:12 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCVL+cbtR6RN/AvWR8T3UyjZP6Wc/1zeI8cS2wdDPbT1iov4j9JAJRJpUFIS8g52xDL7wz3yjPTvFvy8BVwH5GQE@gmail.com X-Google-Smtp-Source: AGHT+IFi7V29grKSKX8bEAFg5U2VnjZ2laJNhjogi8zejLH/lZDUulkTIPrx/2dgF2lyPIQg/Bcx X-Received: by 2002:a05:6512:33d3:b0:52e:999b:7c01 with SMTP id 2adb3069b0e04-5389fc76df2mr3502231e87.48.1727499072400; Fri, 27 Sep 2024 21:51:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1727499072; cv=none; d=google.com; s=arc-20240605; b=DXdIGNKFWwOlICmoYs2LKrr/k5Pt0Qyd+djvwJSRcflXwQMwq+J2wg+Z0+uBKop7wy iCV9EUM2GxX2neYK5WnqDS4EO8kv8k2o1txcEk3Gm1A2qGHMJRSSScS0RuaXMeEzIGM8 95MjedtfEn2C2xXB2vTQ5kl0zJ2UR+KBucaDX5EYMZbsN0Nb5sBnpovaH/Eds2gLyhcW UYvsbg4n2oGxF31YxZwD3KH/oGnyfOEH8axugPGDj2ZcJdccMSUcr3LkC2YQNdyFM+a4 SK11GonlmG1OPiZnQ+JZVkHY6u5AV1vYUlyJGmtuTSNJ4Bu2Z71hp2/mo7ipneeqaVMM Inpw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=sender:errors-to:content-transfer-encoding:reply-to:list-subscribe :list-help:list-post:list-archive:list-unsubscribe:list-id :precedence:subject:mime-version:references:in-reply-to:message-id :date:to:from:dkim-signature:delivered-to; bh=FAVWqWCXT0vSotfQ0X157GtvIrseu2YSVwRxtqAWCpI=; fh=YOA8vD9MJZuwZ71F/05pj6KdCjf6jQRmzLS+CATXUQk=; b=F7JffmAFWLcTJ5UqKDnuJohCxoEdey8JtPSbnQF6vtORCLVUWGn49et8SaNwFtcwNW 49GYIHHaTADjbnKqL4TzLHsNpJJr1+qFOI1H9rxoKidhYuEZ8eDzzjQ/dByHj0Hh+9OD j/o9Qt4apCacNyVKo0rDC5Lzfks4wCyysV1TrcvIOxgnZCsr25lJcnHW9qbNqPjjBmYp ehP0I2yPe0Ki5RPStr+SabENeDQxByRSnrVtw4NFSJtJoM7kUyEwOQR1OfHda0wuiOHP UdXMk8MCpgebw/fMQz+SA11yNFD2kCZKvNewC3hVzBqNpKITJXhJGnOn2fALMs6Uzhf5 zpMA==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.s=20230601 header.b=OnQrq9oY; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com; dara=fail header.i=@gmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id 2adb3069b0e04-5389fd52760si1163200e87.44.2024.09.27.21.51.12; Fri, 27 Sep 2024 21:51:12 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.s=20230601 header.b=OnQrq9oY; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com; dara=fail header.i=@gmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 07D1268DD9C; Fri, 27 Sep 2024 15:53:19 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-ej1-f54.google.com (mail-ej1-f54.google.com [209.85.218.54]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 8690768DD75 for ; Fri, 27 Sep 2024 15:53:15 +0300 (EEST) Received: by mail-ej1-f54.google.com with SMTP id a640c23a62f3a-a910860e4dcso314514466b.3 for ; Fri, 27 Sep 2024 05:53:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1727441594; x=1728046394; darn=ffmpeg.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:from:to:cc:subject:date:message-id :reply-to; bh=4EcWJVJ/jORKLHdVOdlBjTHzBlLTrhqnIgbhHUj7+EU=; b=OnQrq9oY5mflV5NIQNA1VLcv5SB048fHP78lTO7hgaRBVGC+VYJjtsvvkWb02N1H10 lc8J/IYEzDiuzthr79aO+T4UJDC1dM52hMSrg+QTo+u81rulNv2/dh/sEIepzRoHB5ni 2tsn6HGF/xISTZp6ovv/7DnvfNS8MWsXrOsgtBdrddWHEz9Ct4BtDx+asSy0afZID0+D QzIPD+GYYndW+rHNWFx6ZSJ1GdKiX3qLJFonam20q7jBTHn68hIuTDzgE+QS/oP6CvU5 7k8J04CBZwnuD8yrmAr6p0/pz4NazTeV/Zy1ruGi5dBEgkTwcpdVVoR3zZSDA7SwjDh1 Ce4A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727441594; x=1728046394; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4EcWJVJ/jORKLHdVOdlBjTHzBlLTrhqnIgbhHUj7+EU=; b=UXwWnbtvY7uSPifr+HdqjnmvAapLTxK6q6vdFVwAzYZd5/ar4/RjN51DUMBM34w76y fyd2NHSfd4dM97r0SGLtaU2l8BpLgBVGcgRhHfkNH5rEDIT7IfLN6BdI1c13zXAv+h8N 4NqWG62fL+cQkD15L1iiH6sC6LBkqmzI20TyacRY9TPweVhWqSQaSLlQTSBXvvmdVo8r uPHQl2kNETZLmqsiTGymJDqKuWM58bLOvhfV3TMshjjOkg+xUsp1MUHtaU3e+ggV2O0v 3qL/ANK9jbpAiaXLkC7NE6mkKNqrms4NpcLMFhMwg1KapZ/uoeqTwgNsSNbWadZxwghi xO8w== X-Gm-Message-State: AOJu0YwxnWZovmp94EFX/7r2VSOi9L4F6PyiQh7LQQWnluirFhbk1KvQ /bPEdwNOu4eFoYvFsk5gNti+t3qzGVjYC7FV536yZJ+ZxNmDLAcNWigzfZiS X-Received: by 2002:a17:906:f59b:b0:a83:94bd:d913 with SMTP id a640c23a62f3a-a93c48e80d3mr297870066b.10.1727441594206; Fri, 27 Sep 2024 05:53:14 -0700 (PDT) Received: from localhost.localdomain ([109.143.184.139]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a93c2777214sm130608566b.36.2024.09.27.05.53.12 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 27 Sep 2024 05:53:13 -0700 (PDT) From: Ramiro Polla To: ffmpeg-devel@ffmpeg.org Date: Fri, 27 Sep 2024 14:52:41 +0200 Message-Id: <20240927125241.15887-17-ramiro.polla@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20240927125241.15887-1-ramiro.polla@gmail.com> References: <20240927125241.15887-1-ramiro.polla@gmail.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v2 16/16] swscale/aarch64: add neon {lum, chr}ConvertRange16 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: 5/y+7J9yTkkf A55 A76 chrRangeFromJpeg16_1920_c: 28840.6 6323.5 chrRangeFromJpeg16_1920_neon: 8436.5 ( 3.42x) 3365.2 ( 1.88x) chrRangeToJpeg16_1920_c: 23075.1 9195.6 chrRangeToJpeg16_1920_neon: 9393.6 ( 2.46x) 4084.5 ( 2.25x) lumRangeFromJpeg16_1920_c: 15383.8 4436.8 lumRangeFromJpeg16_1920_neon: 4586.0 ( 3.35x) 1814.0 ( 2.45x) lumRangeToJpeg16_1920_c: 19225.5 6017.2 lumRangeToJpeg16_1920_neon: 5067.9 ( 3.79x) 2146.4 ( 2.80x) --- libswscale/aarch64/range_convert_neon.S | 98 +++++++++++++++++++++++-- libswscale/aarch64/swscale.c | 36 ++++++--- 2 files changed, 116 insertions(+), 18 deletions(-) diff --git a/libswscale/aarch64/range_convert_neon.S b/libswscale/aarch64/range_convert_neon.S index 1aadd8e04d..f1812301ed 100644 --- a/libswscale/aarch64/range_convert_neon.S +++ b/libswscale/aarch64/range_convert_neon.S @@ -20,12 +20,42 @@ #include "libavutil/aarch64/asm.S" -.macro lumConvertRange fromto -function ff_lumRange\fromto\()Jpeg_neon, export=1 +.macro lumConvertRange fromto, bit_depth +function ff_lumRange\fromto\()Jpeg\bit_depth\()_neon, export=1 // x0 int16_t *dst // w1 int width // w2 int coeff // x3 int64_t offset +.if \bit_depth == 16 +.ifc \fromto, To + movi v25.4s, #1 + movi v24.4s, #1<<3, lsl #16 + sub v24.4s, v24.4s, v25.4s +.endif + dup v25.4s, w2 + dup v26.2d, x3 +1: + ld1 {v0.4s, v1.4s}, [x0] + mov v16.16b, v26.16b + mov v17.16b, v26.16b + mov v18.16b, v26.16b + mov v19.16b, v26.16b + smlal v16.2d, v0.2s, v25.2s + smlal2 v17.2d, v0.4s, v25.4s + smlal v18.2d, v1.2s, v25.2s + smlal2 v19.2d, v1.4s, v25.4s + shrn v0.2s, v16.2d, 18 + shrn2 v0.4s, v17.2d, 18 + shrn v1.2s, v18.2d, 18 + shrn2 v1.4s, v19.2d, 18 + subs w1, w1, #8 +.ifc \fromto, To + smin v0.4s, v0.4s, v24.4s + smin v1.4s, v1.4s, v24.4s +.endif + st1 {v0.4s, v1.4s}, [x0], #32 + b.gt 1b +.else dup v25.4s, w2 dup v26.4s, w3 1: @@ -46,17 +76,64 @@ function ff_lumRange\fromto\()Jpeg_neon, export=1 subs w1, w1, #8 st1 {v0.8h}, [x0], #16 b.gt 1b +.endif ret endfunc .endm -.macro chrConvertRange fromto -function ff_chrRange\fromto\()Jpeg_neon, export=1 +.macro chrConvertRange fromto, bit_depth +function ff_chrRange\fromto\()Jpeg\bit_depth\()_neon, export=1 // x0 int16_t *dstU // x1 int16_t *dstV // w2 int width // w3 int coeff // x4 int64_t offset +.if \bit_depth == 16 +.ifc \fromto, To + movi v25.4s, #1 + movi v24.4s, #1<<3, lsl #16 + sub v24.4s, v24.4s, v25.4s +.endif + dup v25.4s, w3 + dup v26.2d, x4 +1: + ld1 {v0.4s, v1.4s}, [x0] + ld1 {v2.4s, v3.4s}, [x1] + mov v16.16b, v26.16b + mov v17.16b, v26.16b + mov v18.16b, v26.16b + mov v19.16b, v26.16b + mov v20.16b, v26.16b + mov v21.16b, v26.16b + mov v22.16b, v26.16b + mov v23.16b, v26.16b + smlal v16.2d, v0.2s, v25.2s + smlal2 v17.2d, v0.4s, v25.4s + smlal v18.2d, v1.2s, v25.2s + smlal2 v19.2d, v1.4s, v25.4s + smlal v20.2d, v2.2s, v25.2s + smlal2 v21.2d, v2.4s, v25.4s + smlal v22.2d, v3.2s, v25.2s + smlal2 v23.2d, v3.4s, v25.4s + shrn v0.2s, v16.2d, 18 + shrn2 v0.4s, v17.2d, 18 + shrn v1.2s, v18.2d, 18 + shrn2 v1.4s, v19.2d, 18 + shrn v2.2s, v20.2d, 18 + shrn2 v2.4s, v21.2d, 18 + shrn v3.2s, v22.2d, 18 + shrn2 v3.4s, v23.2d, 18 + subs w2, w2, #8 +.ifc \fromto, To + smin v0.4s, v0.4s, v24.4s + smin v1.4s, v1.4s, v24.4s + smin v2.4s, v2.4s, v24.4s + smin v3.4s, v3.4s, v24.4s +.endif + st1 {v0.4s, v1.4s}, [x0], #32 + st1 {v2.4s, v3.4s}, [x1], #32 + b.gt 1b +.else dup v25.4s, w3 dup v26.4s, w4 1: @@ -89,11 +166,16 @@ function ff_chrRange\fromto\()Jpeg_neon, export=1 st1 {v0.8h}, [x0], #16 st1 {v1.8h}, [x1], #16 b.gt 1b +.endif ret endfunc .endm -lumConvertRange To -chrConvertRange To -lumConvertRange From -chrConvertRange From +lumConvertRange To, 8 +lumConvertRange To, 16 +chrConvertRange To, 8 +chrConvertRange To, 16 +lumConvertRange From, 8 +lumConvertRange From, 16 +chrConvertRange From, 8 +chrConvertRange From, 16 diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c index 98f07ecfe5..55d8ffc281 100644 --- a/libswscale/aarch64/swscale.c +++ b/libswscale/aarch64/swscale.c @@ -218,14 +218,22 @@ NEON_INPUT(bgra32); NEON_INPUT(rgb24); NEON_INPUT(rgba32); -void ff_lumRangeFromJpeg_neon(int16_t *dst, int width, +void ff_lumRangeFromJpeg8_neon(int16_t *dst, int width, + int coeff, int64_t offset); +void ff_chrRangeFromJpeg8_neon(int16_t *dstU, int16_t *dstV, int width, + int coeff, int64_t offset); +void ff_lumRangeToJpeg8_neon(int16_t *dst, int width, + int coeff, int64_t offset); +void ff_chrRangeToJpeg8_neon(int16_t *dstU, int16_t *dstV, int width, + int coeff, int64_t offset); +void ff_lumRangeFromJpeg16_neon(int16_t *dst, int width, + int coeff, int64_t offset); +void ff_chrRangeFromJpeg16_neon(int16_t *dstU, int16_t *dstV, int width, + int coeff, int64_t offset); +void ff_lumRangeToJpeg16_neon(int16_t *dst, int width, int coeff, int64_t offset); -void ff_chrRangeFromJpeg_neon(int16_t *dstU, int16_t *dstV, int width, +void ff_chrRangeToJpeg16_neon(int16_t *dstU, int16_t *dstV, int width, int coeff, int64_t offset); -void ff_lumRangeToJpeg_neon(int16_t *dst, int width, - int coeff, int64_t offset); -void ff_chrRangeToJpeg_neon(int16_t *dstU, int16_t *dstV, int width, - int coeff, int64_t offset); av_cold void ff_sws_init_range_convert_aarch64(SwsContext *c) { @@ -234,11 +242,19 @@ av_cold void ff_sws_init_range_convert_aarch64(SwsContext *c) if (have_neon(cpu_flags)) { if (c->dstBpc <= 14) { if (c->srcRange) { - c->lumConvertRange = ff_lumRangeFromJpeg_neon; - c->chrConvertRange = ff_chrRangeFromJpeg_neon; + c->lumConvertRange = ff_lumRangeFromJpeg8_neon; + c->chrConvertRange = ff_chrRangeFromJpeg8_neon; } else { - c->lumConvertRange = ff_lumRangeToJpeg_neon; - c->chrConvertRange = ff_chrRangeToJpeg_neon; + c->lumConvertRange = ff_lumRangeToJpeg8_neon; + c->chrConvertRange = ff_chrRangeToJpeg8_neon; + } + } else { + if (c->srcRange) { + c->lumConvertRange = ff_lumRangeFromJpeg16_neon; + c->chrConvertRange = ff_chrRangeFromJpeg16_neon; + } else { + c->lumConvertRange = ff_lumRangeToJpeg16_neon; + c->chrConvertRange = ff_chrRangeToJpeg16_neon; } } }