From patchwork Mon Jun 24 11:36:59 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zhao Zhili X-Patchwork-Id: 50120 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a59:ae71:0:b0:482:c625:d099 with SMTP id w17csp1947598vqz; Mon, 24 Jun 2024 04:37:43 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCXNKMbMgccTt3PwqkuQ7qqmwB7mw5gTerc4/dnUBaD7k3R/JAuMoWoLP0DsxtJVvcYFsUBq3Ef42qPIth4AN5Q++J8rkls8IzPqDw== X-Google-Smtp-Source: AGHT+IG3lDavniAEGKbktSMGWNEQ7xUNMyDiC7BBdwHTrzrVhnBpknDE2/LXZWq1e5iF3aydhIM4 X-Received: by 2002:a2e:8193:0:b0:2ec:2038:925d with SMTP id 38308e7fff4ca-2ec5b2c4f38mr29435711fa.1.1719229062906; Mon, 24 Jun 2024 04:37:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1719229062; cv=none; d=google.com; s=arc-20160816; b=F8Tx7oSvvLyX+EgnpYGYZMlyxkp69Z/3na2aF0Mn0NB2OjO23wOOABd11NqF8WBzJe gTkxATXe6aN7oA5rz6k+skc1nl+X3wiFqwJMq3Yt1cAtZeCTScVqSa8pkmaKBMGnBsw7 uxjUbP5nFhkylQU6TQ/iS2jEhUSmLiXUWWsqdgC6N3pNVHV15ggbj4fGpfRq8Z/FjyuL ImcYlf1ickjZJMyLE2hVHlf91kU2/kw9VIHPx7wHJOT6+mfMvqtEBp41A1fcehpVk/M3 o21qADsziGaDBl6mLPrEDO8Tg26P4ytKZJLWnoA/hVNca6+OeCa09fGbGs24GjlR1JJd q/6g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:date:to:from:message-id :dkim-signature:delivered-to; bh=cu6MwSTzFUlSCOBjCyv1qEKly6/g+ptBMMT3ErPRcR0=; fh=HnHYuZ9XgUo86ZRXTLWWmQxhslYEI9B9taZ5X1DLFfc=; b=dhhlxGLy1X/RK9RoX5qZcfKE2Z++1lehFb7f48F3aDVn3+udZw85sAyPGBUfoI+mdy NT9VqUxfOxt9V8gqGEvD5gE3XHLJlOu4bfAhPvQoGfBGKqsroOOmCciDLjIyeWDo8DkI MlzLPRynCqCNY4cHVlbRE4Sr/RHkmQBPSL8fYWEzPmeEu8axXOUwNclNvW7J0cgWgqzg YeKqnNv6Px1IY5jifb/29XDqywCqfxaohDNOoYVErJty9xLoWDEFUy5Ms3DK31eSeQg3 ztZX3Vk/iIRPAsCRJVzyHqx9NGrr5A86OXjQwmpqSat9yfdM4wWlYBrRlfuuHFUqfO7S pQ4A==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=i6KcIOcb; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id a640c23a62f3a-a6fcf565207si377001266b.615.2024.06.24.04.37.42; Mon, 24 Jun 2024 04:37:42 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=i6KcIOcb; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 6761A68CE05; Mon, 24 Jun 2024 14:37:28 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from out203-205-251-73.mail.qq.com (out203-205-251-73.mail.qq.com [203.205.251.73]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id B0A0468D5D2 for ; Mon, 24 Jun 2024 14:37:17 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foxmail.com; s=s201512; t=1719229025; bh=8WYXPGxnKhEGtYyFmoqHbV07jZjAb6611Iimmve4s/s=; h=From:To:Cc:Subject:Date; b=i6KcIOcbD7b6/+0Ic8f+GgDLSoSjpsZTymhA0klbqeHDjCg37M8EDYTKHSVPzUTwH s5CmFsPhSjT9diNoIeE5wU8UnvcBktoo895spgVu1oPrYMc9DGIWbQdQO3UB9P6uqs 7a9/S3dWgMev4G4qeLzUWMxh77uxJ689K24zexAY= Received: from ZHILIZHAO-MB1.tencent.com ([119.147.10.191]) by newxmesmtplogicsvrszb9-1.qq.com (NewEsmtp) with SMTP id 9438A032; Mon, 24 Jun 2024 19:37:03 +0800 X-QQ-mid: xmsmtpt1719229023t3hfsz908 Message-ID: X-QQ-XMAILINFO: MR/iVh5QLeiefIVaRaUPvAoLt3cJ0QhLVuhwWdzEQll8Tc9jLzbEpUVekLu7qq 5338KzNqppNtjOKCOraDn8uf90KjwkLw9fl6aixzD//MP2QeFjeBjAnmjutpLlDKL9P1bCdc+mle pECGjzoDHpz8xOOCObz+KpllyvqRguNaHEPPNldQy229kmboVcc/rDyuAYzZbeu7Mq5T8TLPui/S 0YjR6u04tPdFHwJyFYn83NiYqDyxBEp0Xk+Sl9q0/FsazAjvlvKz+9wiaUartcyboSm25ulQ0g/D qTNDQGJ0IjZ34zFF4c2KV/H8zbvrVubK+ARxQV+FoPqHy2XXJ72V5X0BQy8ncJaNtV3h668WcwHZ ZD6THWcGTOeNLy8NY7q76yft/sGMns3Oidd3R4ZmFERryzXwlDz4i/weJ0mAQW1Yh/ShXERH6zqB jF/dlngvMDKMSMh4uR8ROD/9a+ycToL3b//eG9YrTiiw0u/AAjIfU4vh+Xue8GKSI4LmgJDZSm3Q ml2UoDtNIdxsxYd6EvQSZdu7+LkKfzs01uecGneTXyu1hn0dla5uW/orgygl6e/qEZspad+BD/KV eCWfjt52rtyeeD4kjzPad2QtAtLaF5aVmmksDkeSa0AQVlE8KosD4FQTgY3ZddmUoIFjQ5Knsgmz q6FE7l/o2JUmgQRVwbzxzBQ9CrAkFHoIBE9l/7po+G94z2qatd9tPZaylYkK2VGHeVe5Mcdhgvgu PPD+dUXiIC8hbt6g7AzWN8EESnzizASF+W9shAOahyWcHOY+kbh8SAppQUL1OLxbYwxnRZY4HVy9 yAhC7kF2NQnflSL3BgO7JNKOxSFjEsDkbfM1SINSk+4CGqN0ovNPpP+PQjzM7LrRVc3Sd6HNUYKT 9gCUVF2wTS3J93++PcUV6/i2eOkcpRDaQnrVx/72B7n3EN39Q6xKbKFhiWhRY3Y3AG3tCWzRm0M9 ++l4D4gHlitVoMC5RPvdwzd1JsX9TrY4zX5FNCAr+VQBej5QTMJPt2JUyzVt5cwOKnc5kOMd0= X-QQ-XMRINFO: MPJ6Tf5t3I/ycC2BItcBVIA= From: Zhao Zhili To: ffmpeg-devel@ffmpeg.org Date: Mon, 24 Jun 2024 19:36:59 +0800 X-OQ-MSGID: <20240624113701.94616-1-quinkblack@foxmail.com> X-Mailer: git-send-email 2.42.0 MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v3 1/3] swscale/aarch64: Add bgr24 to yuv X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Zhao Zhili Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: FKX9SglEqRWZ From: Zhao Zhili Test on Apple M1 with kperf : -O3 : -O3 -fno-vectorize bgr24_to_uv_8_c : 28.5 : 52.5 bgr24_to_uv_8_neon : 54.5 : 59.7 bgr24_to_uv_128_c : 294.0 : 830.7 bgr24_to_uv_128_neon : 99.7 : 112.0 bgr24_to_uv_1080_c : 965.0 : 6624.0 bgr24_to_uv_1080_neon : 751.5 : 754.7 bgr24_to_uv_1920_c : 1693.2 : 11554.5 bgr24_to_uv_1920_neon : 1292.5 : 1307.5 bgr24_to_uv_half_8_c : 54.2 : 37.0 bgr24_to_uv_half_8_neon : 27.2 : 22.5 bgr24_to_uv_half_128_c : 127.2 : 392.5 bgr24_to_uv_half_128_neon : 63.0 : 52.0 bgr24_to_uv_half_1080_c : 880.2 : 3329.0 bgr24_to_uv_half_1080_neon : 401.5 : 390.7 bgr24_to_uv_half_1920_c : 1585.7 : 6390.7 bgr24_to_uv_half_1920_neon : 694.7 : 698.7 bgr24_to_y_8_c : 21.7 : 22.5 bgr24_to_y_8_neon : 797.2 : 25.5 bgr24_to_y_128_c : 88.0 : 280.5 bgr24_to_y_128_neon : 63.7 : 55.0 bgr24_to_y_1080_c : 616.7 : 2208.7 bgr24_to_y_1080_neon : 900.0 : 452.0 bgr24_to_y_1920_c : 1093.2 : 3894.7 bgr24_to_y_1920_neon : 777.2 : 767.5 --- libswscale/aarch64/input.S | 71 ++++++++++++++++++++++++++---------- libswscale/aarch64/swscale.c | 32 +++++++++------- 2 files changed, 71 insertions(+), 32 deletions(-) diff --git a/libswscale/aarch64/input.S b/libswscale/aarch64/input.S index 33afa34111..2cfec4cb6a 100644 --- a/libswscale/aarch64/input.S +++ b/libswscale/aarch64/input.S @@ -20,7 +20,7 @@ #include "libavutil/aarch64/asm.S" -.macro rgb24_to_yuv_load_rgb, src +.macro rgb_to_yuv_load_rgb src ld3 { v16.16b, v17.16b, v18.16b }, [\src] uxtl v19.8h, v16.8b // v19: r uxtl v20.8h, v17.8b // v20: g @@ -30,7 +30,7 @@ uxtl2 v24.8h, v18.16b // v24: b .endm -.macro rgb24_to_yuv_product, r, g, b, dst1, dst2, dst, coef0, coef1, coef2, right_shift +.macro rgb_to_yuv_product r, g, b, dst1, dst2, dst, coef0, coef1, coef2, right_shift mov \dst1\().16b, v6.16b // dst1 = const_offset mov \dst2\().16b, v6.16b // dst2 = const_offset smlal \dst1\().4s, \coef0\().4h, \r\().4h // dst1 += rx * r @@ -43,12 +43,20 @@ sqshrn2 \dst\().8h, \dst2\().4s, \right_shift // dst_higher_half = dst2 >> right_shift .endm +function ff_bgr24ToY_neon, export=1 + cmp w4, #0 // check width > 0 + ldp w12, w11, [x5] // w12: ry, w11: gy + ldr w10, [x5, #8] // w10: by + b.gt 4f + ret +endfunc + function ff_rgb24ToY_neon, export=1 cmp w4, #0 // check width > 0 ldp w10, w11, [x5] // w10: ry, w11: gy ldr w12, [x5, #8] // w12: by b.le 3f - +4: mov w9, #256 // w9 = 1 << (RGB2YUV_SHIFT - 7) movk w9, #8, lsl #16 // w9 += 32 << (RGB2YUV_SHIFT - 1) dup v6.4s, w9 // w9: const_offset @@ -59,9 +67,9 @@ function ff_rgb24ToY_neon, export=1 dup v2.8h, w12 b.lt 2f 1: - rgb24_to_yuv_load_rgb x1 - rgb24_to_yuv_product v19, v20, v21, v25, v26, v16, v0, v1, v2, #9 - rgb24_to_yuv_product v22, v23, v24, v27, v28, v17, v0, v1, v2, #9 + rgb_to_yuv_load_rgb x1 + rgb_to_yuv_product v19, v20, v21, v25, v26, v16, v0, v1, v2, #9 + rgb_to_yuv_product v22, v23, v24, v27, v28, v17, v0, v1, v2, #9 sub w4, w4, #16 // width -= 16 add x1, x1, #48 // src += 48 cmp w4, #16 // width >= 16 ? @@ -85,10 +93,7 @@ function ff_rgb24ToY_neon, export=1 ret endfunc -.macro rgb24_load_uv_coeff half - ldp w10, w11, [x6, #12] // w10: ru, w11: gu - ldp w12, w13, [x6, #20] // w12: bu, w13: rv - ldp w14, w15, [x6, #28] // w14: gv, w15: bv +.macro rgb_set_uv_coeff half .if \half mov w9, #512 movk w9, #128, lsl #16 // w9: const_offset @@ -105,12 +110,26 @@ endfunc dup v6.4s, w9 .endm +function ff_bgr24ToUV_half_neon, export=1 + cmp w5, #0 // check width > 0 + b.le 3f + + ldp w12, w11, [x6, #12] + ldp w10, w15, [x6, #20] + ldp w14, w13, [x6, #28] + b 4f +endfunc + function ff_rgb24ToUV_half_neon, export=1 cmp w5, #0 // check width > 0 b.le 3f + ldp w10, w11, [x6, #12] // w10: ru, w11: gu + ldp w12, w13, [x6, #20] // w12: bu, w13: rv + ldp w14, w15, [x6, #28] // w14: gv, w15: bv +4: cmp w5, #8 - rgb24_load_uv_coeff half=1 + rgb_set_uv_coeff half=1 b.lt 2f 1: ld3 { v16.16b, v17.16b, v18.16b }, [x3] @@ -118,8 +137,8 @@ function ff_rgb24ToUV_half_neon, export=1 uaddlp v20.8h, v17.16b // v20: g uaddlp v21.8h, v18.16b // v21: b - rgb24_to_yuv_product v19, v20, v21, v22, v23, v16, v0, v1, v2, #10 - rgb24_to_yuv_product v19, v20, v21, v24, v25, v17, v3, v4, v5, #10 + rgb_to_yuv_product v19, v20, v21, v22, v23, v16, v0, v1, v2, #10 + rgb_to_yuv_product v19, v20, v21, v24, v25, v17, v3, v4, v5, #10 sub w5, w5, #8 // width -= 8 add x3, x3, #48 // src += 48 cmp w5, #8 // width >= 8 ? @@ -158,19 +177,33 @@ function ff_rgb24ToUV_half_neon, export=1 ret endfunc +function ff_bgr24ToUV_neon, export=1 + cmp w5, #0 // check width > 0 + b.le 3f + + ldp w12, w11, [x6, #12] + ldp w10, w15, [x6, #20] + ldp w14, w13, [x6, #28] + b 4f +endfunc + function ff_rgb24ToUV_neon, export=1 cmp w5, #0 // check width > 0 b.le 3f + ldp w10, w11, [x6, #12] // w10: ru, w11: gu + ldp w12, w13, [x6, #20] // w12: bu, w13: rv + ldp w14, w15, [x6, #28] // w14: gv, w15: bv +4: cmp w5, #16 - rgb24_load_uv_coeff half=0 + rgb_set_uv_coeff half=0 b.lt 2f 1: - rgb24_to_yuv_load_rgb x3 - rgb24_to_yuv_product v19, v20, v21, v25, v26, v16, v0, v1, v2, #9 - rgb24_to_yuv_product v22, v23, v24, v27, v28, v17, v0, v1, v2, #9 - rgb24_to_yuv_product v19, v20, v21, v25, v26, v18, v3, v4, v5, #9 - rgb24_to_yuv_product v22, v23, v24, v27, v28, v19, v3, v4, v5, #9 + rgb_to_yuv_load_rgb x3 + rgb_to_yuv_product v19, v20, v21, v25, v26, v16, v0, v1, v2, #9 + rgb_to_yuv_product v22, v23, v24, v27, v28, v17, v0, v1, v2, #9 + rgb_to_yuv_product v19, v20, v21, v25, v26, v18, v3, v4, v5, #9 + rgb_to_yuv_product v22, v23, v24, v27, v28, v19, v3, v4, v5, #9 sub w5, w5, #16 add x3, x3, #48 // src += 48 cmp w5, #16 diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c index e4ea3309ba..c6594944c3 100644 --- a/libswscale/aarch64/swscale.c +++ b/libswscale/aarch64/swscale.c @@ -201,19 +201,18 @@ void ff_yuv2plane1_8_neon( default: break; \ } -void ff_rgb24ToY_neon(uint8_t *_dst, const uint8_t *src, const uint8_t *unused1, - const uint8_t *unused2, int width, - uint32_t *rgb2yuv, void *opq); - -void ff_rgb24ToUV_neon(uint8_t *_dstU, uint8_t *_dstV, const uint8_t *unused0, - const uint8_t *src1, - const uint8_t *src2, int width, uint32_t *rgb2yuv, - void *opq); - -void ff_rgb24ToUV_half_neon(uint8_t *_dstU, uint8_t *_dstV, const uint8_t *unused0, - const uint8_t *src1, - const uint8_t *src2, int width, uint32_t *rgb2yuv, - void *opq); +#define NEON_INPUT(name) \ +void ff_##name##ToY_neon(uint8_t *dst, const uint8_t *src, const uint8_t *, \ + const uint8_t *, int w, uint32_t *coeffs, void *); \ +void ff_##name##ToUV_neon(uint8_t *, uint8_t *, const uint8_t *, \ + const uint8_t *, const uint8_t *, int w, \ + uint32_t *coeffs, void *); \ +void ff_##name##ToUV_half_neon(uint8_t *, uint8_t *, const uint8_t *, \ + const uint8_t *, const uint8_t *, int w, \ + uint32_t *coeffs, void *) + +NEON_INPUT(bgr24); +NEON_INPUT(rgb24); void ff_lumRangeFromJpeg_neon(int16_t *dst, int width); void ff_chrRangeFromJpeg_neon(int16_t *dstU, int16_t *dstV, int width); @@ -247,6 +246,13 @@ av_cold void ff_sws_init_swscale_aarch64(SwsContext *c) c->yuv2planeX = ff_yuv2planeX_8_neon; } switch (c->srcFormat) { + case AV_PIX_FMT_BGR24: + c->lumToYV12 = ff_bgr24ToY_neon; + if (c->chrSrcHSubSample) + c->chrToYV12 = ff_bgr24ToUV_half_neon; + else + c->chrToYV12 = ff_bgr24ToUV_neon; + break; case AV_PIX_FMT_RGB24: c->lumToYV12 = ff_rgb24ToY_neon; if (c->chrSrcHSubSample) From patchwork Mon Jun 24 11:37:00 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zhao Zhili X-Patchwork-Id: 50119 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a59:ae71:0:b0:482:c625:d099 with SMTP id w17csp1947517vqz; Mon, 24 Jun 2024 04:37:31 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCVIvEsTucgAxJ8OLRNhiJQMkJJaNfV5K9IxHeebEQgNNPxQmOWMNgDA13WtKb92Lvxvyvpn5CZS452KqxpI15RmZLCbdZBcFjoMJA== X-Google-Smtp-Source: AGHT+IFRIJaCytO7mI0dVAO0ImHwpJP/G4Z8L7y2nxAFgvDdwxSO42RFHK6aOMbvCALzyo9B4OdN X-Received: by 2002:a05:6512:108c:b0:52c:dd25:9ac6 with SMTP id 2adb3069b0e04-52ce1835607mr3446243e87.29.1719229050846; Mon, 24 Jun 2024 04:37:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1719229050; cv=none; d=google.com; s=arc-20160816; b=rTn+ABtP6KA2EuebYhW7BW6gRBzQME0ztR0uQcK1OTTkFEtfgGYpwaawVqebRyfmdy LxJJSoYCQn4qZZmv0eYL2fDFKlHhvlEtDYPtV4/77NI0s1c09rDcCz3LGqWSWQc7vJMu 6OQhFcnhGv2Osw9SR7YYBB54GsJ2Gk4VVbBma5eI7dCPGQOZI6nGNOMznrG7ubhWhrIM hZaBVGgUL0U6aVMtpvqroNniwjs+VvQa/HbCvhJ1K8Vg3ExPLBA5Z7bdgZE3HU5Slgv9 flCqC9sI2XhtFGGQx4GNQybuW7Cp7Ymvy7UxB1C+/wvNiT+FGuzjvNPGt8f0e441TBkX 4ODg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to:date :to:from:message-id:dkim-signature:delivered-to; bh=9vsEvH0KMg32O56ZbEBClOmolUCMLcPB6UnDB1LwJmA=; fh=HnHYuZ9XgUo86ZRXTLWWmQxhslYEI9B9taZ5X1DLFfc=; b=ppeFFPiuX5NGJHTkC5uEN/lYbnewkFJCbD0Jxqh4bzwcsw95vNGvImGzHqoAfbW5WO 0bsIeZCVGdFrLn4Wp3NQpoW7M/YLiIOCElrUVRFLu5zquQOtWy9vjLyB6hr2L5RAJ1BL kcw4vcYQUUMGn86Tn6bUhhbFMIgGWhVHtgif+xkKdQhI374V8OqC6u8e9bBuFMcmEdGG RIFsCupfV2KmtKrda+Dk8jfbZawcdOY34dzjCKaO3uGt2oARc7AfPmgWKUTEdi/LfECQ OsjhWppDY4uAHgKkhCks3F3bbf4cGtHawb+4vWQX6JO0FOU2RuRnAjVExFQ19ZB6HU8J BJ5A==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=VgwVI4WY; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id a640c23a62f3a-a6fcf5ade11si369002066b.1043.2024.06.24.04.37.30; Mon, 24 Jun 2024 04:37:30 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b=VgwVI4WY; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id C74F868D607; Mon, 24 Jun 2024 14:37:26 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from out203-205-221-202.mail.qq.com (out203-205-221-202.mail.qq.com [203.205.221.202]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 9963A68D50D for ; Mon, 24 Jun 2024 14:37:17 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foxmail.com; s=s201512; t=1719229027; bh=S8nGQheq7psVEp9YB8ZpHh8uqFKFBPcCCRtJlgfNMNU=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=VgwVI4WYfeamq0m/QP0TE//uNbxDNDvjUH26cwS1t+w6VLArOSVDyRZujXQa39VHm Zugy9E62UPXAdnfXeo5SdOqkV0JtA1Ju4G0090KZMEIo3lnDFc4mVzeae9Hr2MMsPg P1f3VEKVEbdfiVjSPnKxvQsHhjmy0QsgeFNVKS1M= Received: from ZHILIZHAO-MB1.tencent.com ([119.147.10.191]) by newxmesmtplogicsvrszb9-1.qq.com (NewEsmtp) with SMTP id 9438A032; Mon, 24 Jun 2024 19:37:03 +0800 X-QQ-mid: xmsmtpt1719229025ty0hpmuwm Message-ID: X-QQ-XMAILINFO: Nxcyh1H/ItmfVRSgTrz+9llAB6ZaNXOsQLRuTtYR0VozbxYEaRP9LV9ZZUfGhm ZnGgbKJZhk9EZXJivD0RBkohlbTVBWFjM+/+S10GGy2YKWBJG1GGB4eYpVbB17zrVI9zaqmm+PyI 6Cejh0E464N94rz96S+gd4Ua5gkjbNWmRZlNWYP/otG4+5dxE0lwzbIImjzfdplTttJ30PgivzpH LLaRdVW/JV8/cdO/cdY75918vpd3QHicLLMAsMN2haOynmMB+GFmpc/QB7afN0h3k6Ec7XB+ADrf e3OLfGVwmHejIFCycSIjmT9LHAbS3+gbEQwOL8hzCx5ZIXwX3KEaDRYzRuJ+8wywoJK/f6e0tBv+ aDSdsn8bA0aBqc8sefeOh4mQwOHDoGG0eWI3WX8yrcOl7T3LD0oyPHMLHTMKD8UHJxPAPAu/ssGd Wy3S425FTaYn8UUHe/M9+NKiGdjxrP+G5XjIv7u9ZMNiVw0vXJoO1SNseajV40vxpUzuxJid42gz pIf0fnZDHGqwEniNc9WjYDoARVWk8mks2ASBFS/+XTZLxA3cgDYgVEemUV7eDvsNE8q3W7k/8qg6 wF8hKlIyH6QVpzsyyAAoVDS9dzTFlrTA5mbHf014hJeOnrVA6p/hslDFWcRtDDJtaB5YxKMtJyMh PqY5Np9ydi10azC/ukIZ0TZtNgluEr9h2I936jgfgxCIHSxPB+0ufjqY2G69oQFrocJarVwH3Tko 4c4T1NwmjpiqnnjLg6G+Aut/J3SjMXokj097+yrezpa3WDjpUugSDkXUO3nfvgs9AqSWKeEo5HVm 0GMNKlBCR5CTG4CkqYRT1P7LJVXYrLuHIG9/KO1ENNGSAhnskoPatqDsZdJgyjXHweDTyFMzbazK f1HH0NlSZ5hjBtyNG9QZDSJ0F26/jnmisg54CjZVHs1zNSiLexAD37yhRrcGlG/XCW1kmz40/kv8 1CeJZkN3zUCbByn2JgA8dKSFtOKp+XeoX00b9/A5O6qTDPaDslTB4fwEV/4yxmJd8UMfgSUvsxeD Y2XeAhRQ== X-QQ-XMRINFO: Mp0Kj//9VHAxr69bL5MkOOs= From: Zhao Zhili To: ffmpeg-devel@ffmpeg.org Date: Mon, 24 Jun 2024 19:37:00 +0800 X-OQ-MSGID: <20240624113701.94616-2-quinkblack@foxmail.com> X-Mailer: git-send-email 2.42.0 In-Reply-To: <20240624113701.94616-1-quinkblack@foxmail.com> References: <20240624113701.94616-1-quinkblack@foxmail.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v3 2/3] swscale/aarch64: Add bgra/rgba to yuv X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Zhao Zhili Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: vULz8ExZ1aiL From: Zhao Zhili Test on Apple M1 with kperf : -O3 : -O3 -fno-vectorize bgra_to_uv_8_c : 13.4 : 27.5 bgra_to_uv_8_neon : 37.4 : 41.7 bgra_to_uv_128_c : 155.9 : 550.2 bgra_to_uv_128_neon : 91.7 : 92.7 bgra_to_uv_1080_c : 1173.2 : 4558.2 bgra_to_uv_1080_neon : 822.7 : 809.5 bgra_to_uv_1920_c : 2078.2 : 8115.2 bgra_to_uv_1920_neon : 1437.7 : 1438.7 bgra_to_uv_half_8_c : 17.9 : 14.2 bgra_to_uv_half_8_neon : 37.4 : 10.5 bgra_to_uv_half_128_c : 103.9 : 326.0 bgra_to_uv_half_128_neon : 73.9 : 68.7 bgra_to_uv_half_1080_c : 850.2 : 3732.0 bgra_to_uv_half_1080_neon : 484.2 : 490.0 bgra_to_uv_half_1920_c : 1479.2 : 4942.7 bgra_to_uv_half_1920_neon : 824.2 : 824.7 bgra_to_y_8_c : 8.2 : 29.5 bgra_to_y_8_neon : 18.2 : 32.7 bgra_to_y_128_c : 101.4 : 361.5 bgra_to_y_128_neon : 74.9 : 73.7 bgra_to_y_1080_c : 739.4 : 3018.0 bgra_to_y_1080_neon : 613.4 : 544.2 bgra_to_y_1920_c : 1298.7 : 5326.0 bgra_to_y_1920_neon : 918.7 : 934.2 --- libswscale/aarch64/input.S | 91 ++++++++++++++++++++++++++++++------ libswscale/aarch64/swscale.c | 16 +++++++ 2 files changed, 94 insertions(+), 13 deletions(-) diff --git a/libswscale/aarch64/input.S b/libswscale/aarch64/input.S index 2cfec4cb6a..6d2c6034bb 100644 --- a/libswscale/aarch64/input.S +++ b/libswscale/aarch64/input.S @@ -20,8 +20,12 @@ #include "libavutil/aarch64/asm.S" -.macro rgb_to_yuv_load_rgb src +.macro rgb_to_yuv_load_rgb src, element=3 + .if \element == 3 ld3 { v16.16b, v17.16b, v18.16b }, [\src] + .else + ld4 { v16.16b, v17.16b, v18.16b, v19.16b }, [\src] + .endif uxtl v19.8h, v16.8b // v19: r uxtl v20.8h, v17.8b // v20: g uxtl v21.8h, v18.8b // v21: b @@ -51,7 +55,8 @@ function ff_bgr24ToY_neon, export=1 ret endfunc -function ff_rgb24ToY_neon, export=1 +.macro rgbToY_neon fmt, element +function ff_\fmt\()ToY_neon, export=1 cmp w4, #0 // check width > 0 ldp w10, w11, [x5] // w10: ry, w11: gy ldr w12, [x5, #8] // w12: by @@ -67,11 +72,11 @@ function ff_rgb24ToY_neon, export=1 dup v2.8h, w12 b.lt 2f 1: - rgb_to_yuv_load_rgb x1 + rgb_to_yuv_load_rgb x1, \element rgb_to_yuv_product v19, v20, v21, v25, v26, v16, v0, v1, v2, #9 rgb_to_yuv_product v22, v23, v24, v27, v28, v17, v0, v1, v2, #9 sub w4, w4, #16 // width -= 16 - add x1, x1, #48 // src += 48 + add x1, x1, #(16*\element) cmp w4, #16 // width >= 16 ? stp q16, q17, [x0], #32 // store to dst b.ge 1b @@ -86,12 +91,25 @@ function ff_rgb24ToY_neon, export=1 smaddl x13, w15, w12, x13 // x13 += by * b asr w13, w13, #9 // x13 >>= 9 sub w4, w4, #1 // width-- - add x1, x1, #3 // src += 3 + add x1, x1, #\element strh w13, [x0], #2 // store to dst cbnz w4, 2b 3: ret endfunc +.endm + +rgbToY_neon fmt=rgb24, element=3 + +function ff_bgra32ToY_neon, export=1 + cmp w4, #0 // check width > 0 + ldp w12, w11, [x5] // w12: ry, w11: gy + ldr w10, [x5, #8] // w10: by + b.gt 4f + ret +endfunc + +rgbToY_neon fmt=rgba32, element=4 .macro rgb_set_uv_coeff half .if \half @@ -120,7 +138,8 @@ function ff_bgr24ToUV_half_neon, export=1 b 4f endfunc -function ff_rgb24ToUV_half_neon, export=1 +.macro rgbToUV_half_neon fmt, element +function ff_\fmt\()ToUV_half_neon, export=1 cmp w5, #0 // check width > 0 b.le 3f @@ -132,7 +151,11 @@ function ff_rgb24ToUV_half_neon, export=1 rgb_set_uv_coeff half=1 b.lt 2f 1: + .if \element == 3 ld3 { v16.16b, v17.16b, v18.16b }, [x3] + .else + ld4 { v16.16b, v17.16b, v18.16b, v19.16b }, [x3] + .endif uaddlp v19.8h, v16.16b // v19: r uaddlp v20.8h, v17.16b // v20: g uaddlp v21.8h, v18.16b // v21: b @@ -140,7 +163,7 @@ function ff_rgb24ToUV_half_neon, export=1 rgb_to_yuv_product v19, v20, v21, v22, v23, v16, v0, v1, v2, #10 rgb_to_yuv_product v19, v20, v21, v24, v25, v17, v3, v4, v5, #10 sub w5, w5, #8 // width -= 8 - add x3, x3, #48 // src += 48 + add x3, x3, #(16*\element) cmp w5, #8 // width >= 8 ? str q16, [x0], #16 // store dst_u str q17, [x1], #16 // store dst_v @@ -148,9 +171,10 @@ function ff_rgb24ToUV_half_neon, export=1 cbz w5, 3f 2: ldrb w2, [x3] // w2: r1 - ldrb w4, [x3, #3] // w4: r2 + ldrb w4, [x3, #\element] // w4: r2 add w2, w2, w4 // w2 = r1 + r2 + .if \element == 3 ldrb w4, [x3, #1] // w4: g1 ldrb w7, [x3, #4] // w7: g2 add w4, w4, w7 // w4 = g1 + g2 @@ -158,6 +182,15 @@ function ff_rgb24ToUV_half_neon, export=1 ldrb w7, [x3, #2] // w7: b1 ldrb w8, [x3, #5] // w8: b2 add w7, w7, w8 // w7 = b1 + b2 + .else + ldrb w4, [x3, #1] // w4: g1 + ldrb w7, [x3, #5] // w7: g2 + add w4, w4, w7 // w4 = g1 + g2 + + ldrb w7, [x3, #2] // w7: b1 + ldrb w8, [x3, #6] // w8: b2 + add w7, w7, w8 // w7 = b1 + b2 + .endif smaddl x8, w2, w10, x9 // dst_u = ru * r + const_offset smaddl x8, w4, w11, x8 // dst_u += gu * g @@ -170,12 +203,28 @@ function ff_rgb24ToUV_half_neon, export=1 smaddl x8, w7, w15, x8 // dst_v += bv * b asr x8, x8, #10 // dst_v >>= 10 sub w5, w5, #1 - add x3, x3, #6 // src += 6 + ldrb w4, [x3, #1] // w4: g1 + add x3, x3, #(2*\element) strh w8, [x1], #2 // store dst_v cbnz w5, 2b 3: ret endfunc +.endm + +rgbToUV_half_neon fmt=rgb24, element=3 + +function ff_bgra32ToUV_half_neon, export=1 + cmp w5, #0 // check width > 0 + b.le 3f + + ldp w12, w11, [x6, #12] + ldp w10, w15, [x6, #20] + ldp w14, w13, [x6, #28] + b 4f +endfunc + +rgbToUV_half_neon fmt=rgba32, element=4 function ff_bgr24ToUV_neon, export=1 cmp w5, #0 // check width > 0 @@ -187,7 +236,8 @@ function ff_bgr24ToUV_neon, export=1 b 4f endfunc -function ff_rgb24ToUV_neon, export=1 +.macro rgbToUV_neon fmt, element +function ff_\fmt\()ToUV_neon, export=1 cmp w5, #0 // check width > 0 b.le 3f @@ -199,13 +249,13 @@ function ff_rgb24ToUV_neon, export=1 rgb_set_uv_coeff half=0 b.lt 2f 1: - rgb_to_yuv_load_rgb x3 + rgb_to_yuv_load_rgb x3, \element rgb_to_yuv_product v19, v20, v21, v25, v26, v16, v0, v1, v2, #9 rgb_to_yuv_product v22, v23, v24, v27, v28, v17, v0, v1, v2, #9 rgb_to_yuv_product v19, v20, v21, v25, v26, v18, v3, v4, v5, #9 rgb_to_yuv_product v22, v23, v24, v27, v28, v19, v3, v4, v5, #9 sub w5, w5, #16 - add x3, x3, #48 // src += 48 + add x3, x3, #(16*\element) cmp w5, #16 stp q16, q17, [x0], #32 // store to dst_u stp q18, q19, [x1], #32 // store to dst_v @@ -227,9 +277,24 @@ function ff_rgb24ToUV_neon, export=1 smaddl x8, w4, w15, x8 // x8 += bv * b asr w8, w8, #9 // x8 >>= 9 sub w5, w5, #1 // width-- - add x3, x3, #3 // src += 3 + add x3, x3, #\element strh w8, [x1], #2 // store to dst_v cbnz w5, 2b 3: ret endfunc +.endm + +rgbToUV_neon fmt=rgb24, element=3 + +function ff_bgra32ToUV_neon, export=1 + cmp w5, #0 // check width > 0 + b.le 3f + + ldp w12, w11, [x6, #12] + ldp w10, w15, [x6, #20] + ldp w14, w13, [x6, #28] + b 4f +endfunc + +rgbToUV_neon fmt=rgba32, element=4 diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c index c6594944c3..92af662014 100644 --- a/libswscale/aarch64/swscale.c +++ b/libswscale/aarch64/swscale.c @@ -212,7 +212,9 @@ void ff_##name##ToUV_half_neon(uint8_t *, uint8_t *, const uint8_t *, \ uint32_t *coeffs, void *) NEON_INPUT(bgr24); +NEON_INPUT(bgra32); NEON_INPUT(rgb24); +NEON_INPUT(rgba32); void ff_lumRangeFromJpeg_neon(int16_t *dst, int width); void ff_chrRangeFromJpeg_neon(int16_t *dstU, int16_t *dstV, int width); @@ -253,6 +255,13 @@ av_cold void ff_sws_init_swscale_aarch64(SwsContext *c) else c->chrToYV12 = ff_bgr24ToUV_neon; break; + case AV_PIX_FMT_BGRA: + c->lumToYV12 = ff_bgra32ToY_neon; + if (c->chrSrcHSubSample) + c->chrToYV12 = ff_bgra32ToUV_half_neon; + else + c->chrToYV12 = ff_bgra32ToUV_neon; + break; case AV_PIX_FMT_RGB24: c->lumToYV12 = ff_rgb24ToY_neon; if (c->chrSrcHSubSample) @@ -260,6 +269,13 @@ av_cold void ff_sws_init_swscale_aarch64(SwsContext *c) else c->chrToYV12 = ff_rgb24ToUV_neon; break; + case AV_PIX_FMT_RGBA: + c->lumToYV12 = ff_rgba32ToY_neon; + if (c->chrSrcHSubSample) + c->chrToYV12 = ff_rgba32ToUV_half_neon; + else + c->chrToYV12 = ff_rgba32ToUV_neon; + break; default: break; } From patchwork Mon Jun 24 11:37:01 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zhao Zhili X-Patchwork-Id: 50121 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a59:ae71:0:b0:482:c625:d099 with SMTP id w17csp1947681vqz; Mon, 24 Jun 2024 04:37:53 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCVTc6aDVv31ATeVONQWYhAzvDjzw4W4AS+P5uz1UrGpfLR7OqqNbO7At5i1wO8sujVeKT929dPKLcH+q60TOT8MgXMBH7BXdejNWA== X-Google-Smtp-Source: AGHT+IHF+qId9CW7xBty2MadN/FQnWXBHrEDmFJEi4va0HaoWbDMJCh2V82ux5MCYxJbxCRuQlyY X-Received: by 2002:a05:6512:5c4:b0:52c:a8c4:4d99 with SMTP id 2adb3069b0e04-52ce064ba55mr3148501e87.68.1719229073506; Mon, 24 Jun 2024 04:37:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1719229073; cv=none; d=google.com; s=arc-20160816; b=PaY51/vhsSqx+GT2r5OH77+G9lewrsNccagL9WGEyUhctCnON4UwjCBLmddTjDj4n8 7YdvnsiQ6Zwn1Psyoloamv7EwnCanzEe5hl8gUWZb6vsbNAwTJlJvKov/uh52hxJuION 2zKb7frh3zxzj8KuBBNp4MRSkmgiurkHi/pGU8KPq2bn8tTH6LCeoLvkXFMwpZ2lwE3F VnLofCaYzQKccXEnL1h4B1tgI5Gz4Cu54ShYcOpXsRNXUeGTKN5QB5z563kgUBQKGi/5 Yqu56L8ve/JA/Lo5k2LTZfIEeKA+uVNMaK5I6/wINtN1K6lRNGhBoGqBMhJ1itzyBSO4 Uh7w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to:date :to:from:message-id:dkim-signature:delivered-to; bh=aGqzFdb4wFUBtTx9ynHdlm9+4MNXdwUR0EcJPMItDhQ=; fh=HnHYuZ9XgUo86ZRXTLWWmQxhslYEI9B9taZ5X1DLFfc=; b=DFztPJyy7dJRvG/SQZl5VMmujVc2fuAraMqi2+Hb4bpqMqKY8BcuiiJq1QVrvLhB7+ uqxUmj6eSNcmmwzy/XzYR9PUh9aHz8kzr/Gd3AnwYxBLgfW6jMGsLDu2tmaNt23hiql6 KYRqllOv39CdoBFOdsYRWPqrOI657PSaGJlebraAv2TqDu1S6OQTgwVq7vQYlhT9M922 ZsSUR2k46N1H4TD12NPYnxOj/JaYai3xWwX6CpjW2O7uDdybvh5TX3pnj1Dh889+o5xW dRLBhF8zjowHZW7nvgD4KJ/RbO1n3QePaAHc0wjT5dqifcZKukk/7xRg1WU16R9du70O D/3Q==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b="S/XTcPQp"; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id a640c23a62f3a-a7259d6f013si65645166b.142.2024.06.24.04.37.52; Mon, 24 Jun 2024 04:37:53 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@foxmail.com header.s=s201512 header.b="S/XTcPQp"; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=foxmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 4FAE768D62F; Mon, 24 Jun 2024 14:37:29 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from out203-205-221-236.mail.qq.com (out203-205-221-236.mail.qq.com [203.205.221.236]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id EA2FA68D5D2 for ; Mon, 24 Jun 2024 14:37:18 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foxmail.com; s=s201512; t=1719229029; bh=/iGVJt5I+DiDIVip4gouw18Bfj3JgI92LjXcvJTojTc=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=S/XTcPQp4WyCSZ51v7/2PTPfURu19MlynElfLQNF8ZTsg/wZM59Qiz3ixjJfhI645 hAeWTkm04qJr0FhVRUqMv0hppVlkN2AqcDr4sZfWw/m8pzQGhYvq2O5m4bnpoEq+aH z6aruciTFN7/IWzXjlQktblnacsbKSpt7s+xwj4k= Received: from ZHILIZHAO-MB1.tencent.com ([119.147.10.191]) by newxmesmtplogicsvrszb9-1.qq.com (NewEsmtp) with SMTP id 9438A032; Mon, 24 Jun 2024 19:37:03 +0800 X-QQ-mid: xmsmtpt1719229027tcnxwvj1a Message-ID: X-QQ-XMAILINFO: NvH2zBBgt3uTwffz0RAPyBC6By0CLoM6HdUanZVJYkL8YiShQ9i94HAzaZPpJB 2YerpdvIzqtj3Q0YGrcVz33JXPHuc1ODs+RRFWhjW4LOOKEhPmdUco0MBg/enIgW0IS+6j4MTS2O 0rxtVbl4TnC1Gwe/S0K/TtbxWs5bYSs5rxNiZ9Eu1ZFknePqmPBqt912Xs226vlddBr03YgP2voY nRU4MUmkYIjOIMf9B79bYm67DEBEfqSmivN2DGG2DfpWi+zndPdjIN+jenUJisVv7dbn2J7N7iLR a/XFDHn6E04115lk9gqc4ee79U4UjUdZCh6euBehVKhXhFl+6sovbocCMR5ZEBgemv8hjj96hw9z r63jYlK/319zvKY2rJhkeSgrDdWJX2eGhuIy9V/5xIdjIksickvdg8BzXCdLQuAY4MCfw1OpFV9X MW5neTdoiu4OG8o/VvQv85k81p4s4zQzbmmc8uJ4jrpUWQ9So6bGFiuFyI2nWEIfHzwKqS+NwEoN aU9aF5s6FVq3xwNfAMY+G4u8lFdKPanD63iqvDN1iq1K3AuDmHrABC74j1ZHvpFepnedWvwbM3FA nlUeCHKTtmTDxdZpYu2Fjq7aqRTXQPpA8mDJ8LEL4PBBwHBec2NE9tBfXsRrEFkpwS1nisQYRc7b x3mzdqQfzZK8fyyHrNrcQE8m5AlEdfhnUSonuUoXkUTD6/MA/2j9m/EDcMsXTrb1pY3g3GViYPGq XgChjR/N8b8U7FxH0sx+OPuuhstocyv1bkJT4MC2Lo87AVkcvslbO9+KZ58OKSEepamESPV/WWgs cFPT1nBjgdpzKimvg93hTEn2QPgA46vY8FVqqSpsC7r1mnL7nesmvy55BzG9F7W8Clq/bQjgbTkm Hc35xFhs16H3defTfkQTzm+oKN+71C+92CVp08U62SZ6+dWFkWY3uGcwAarzQ23qsbfuh+gIDxu1 Je92/X1PCW96mtAEgWPOpEEW/6yXfYCP+vndxCKZz/2kXVRgkrsNViZhpx30UIodkuN45gJGA= X-QQ-XMRINFO: OD9hHCdaPRBwq3WW+NvGbIU= From: Zhao Zhili To: ffmpeg-devel@ffmpeg.org Date: Mon, 24 Jun 2024 19:37:01 +0800 X-OQ-MSGID: <20240624113701.94616-3-quinkblack@foxmail.com> X-Mailer: git-send-email 2.42.0 In-Reply-To: <20240624113701.94616-1-quinkblack@foxmail.com> References: <20240624113701.94616-1-quinkblack@foxmail.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v3 3/3] swscale/aarch64: Add argb/abgr to yuv X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Zhao Zhili Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: xVxgl62+DN7L From: Zhao Zhili Test on Apple M1 with kperf: : -O3 : -O3 -fno-vectorize abgr_to_uv_8_c : 19.4 : 26.1 abgr_to_uv_8_neon : 29.9 : 51.1 abgr_to_uv_128_c : 146.4 : 558.9 abgr_to_uv_128_neon : 85.1 : 83.4 abgr_to_uv_1080_c : 1162.6 : 4786.4 abgr_to_uv_1080_neon : 819.6 : 826.6 abgr_to_uv_1920_c : 2063.6 : 8492.1 abgr_to_uv_1920_neon : 1435.1 : 1447.1 abgr_to_uv_half_8_c : 16.4 : 11.4 abgr_to_uv_half_8_neon : 35.6 : 20.4 abgr_to_uv_half_128_c : 108.6 : 359.4 abgr_to_uv_half_128_neon : 75.4 : 42.6 abgr_to_uv_half_1080_c : 883.4 : 2885.6 abgr_to_uv_half_1080_neon : 460.6 : 481.1 abgr_to_uv_half_1920_c : 1553.6 : 5106.9 abgr_to_uv_half_1920_neon : 817.6 : 820.4 abgr_to_y_8_c : 6.1 : 26.4 abgr_to_y_8_neon : 40.6 : 6.4 abgr_to_y_128_c : 99.9 : 390.1 abgr_to_y_128_neon : 67.4 : 55.9 abgr_to_y_1080_c : 735.9 : 3170.4 abgr_to_y_1080_neon : 534.6 : 536.6 abgr_to_y_1920_c : 1279.4 : 6016.4 abgr_to_y_1920_neon : 932.6 : 927.6 --- libswscale/aarch64/input.S | 114 ++++++++++++++++++++++++++++------- libswscale/aarch64/swscale.c | 17 ++++++ 2 files changed, 110 insertions(+), 21 deletions(-) diff --git a/libswscale/aarch64/input.S b/libswscale/aarch64/input.S index 6d2c6034bb..f4d587fed0 100644 --- a/libswscale/aarch64/input.S +++ b/libswscale/aarch64/input.S @@ -34,6 +34,16 @@ uxtl2 v24.8h, v18.16b // v24: b .endm +.macro argb_to_yuv_load_rgb src + ld4 { v16.16b, v17.16b, v18.16b, v19.16b }, [\src] + uxtl v21.8h, v19.8b // v21: b + uxtl2 v24.8h, v19.16b // v24: b + uxtl v19.8h, v17.8b // v19: r + uxtl v20.8h, v18.8b // v20: g + uxtl2 v22.8h, v17.16b // v22: r + uxtl2 v23.8h, v18.16b // v23: g +.endm + .macro rgb_to_yuv_product r, g, b, dst1, dst2, dst, coef0, coef1, coef2, right_shift mov \dst1\().16b, v6.16b // dst1 = const_offset mov \dst2\().16b, v6.16b // dst2 = const_offset @@ -55,7 +65,7 @@ function ff_bgr24ToY_neon, export=1 ret endfunc -.macro rgbToY_neon fmt, element +.macro rgbToY_neon fmt, element, alpha_first=0 function ff_\fmt\()ToY_neon, export=1 cmp w4, #0 // check width > 0 ldp w10, w11, [x5] // w10: ry, w11: gy @@ -72,7 +82,11 @@ function ff_\fmt\()ToY_neon, export=1 dup v2.8h, w12 b.lt 2f 1: + .if \alpha_first + argb_to_yuv_load_rgb x1 + .else rgb_to_yuv_load_rgb x1, \element + .endif rgb_to_yuv_product v19, v20, v21, v25, v26, v16, v0, v1, v2, #9 rgb_to_yuv_product v22, v23, v24, v27, v28, v17, v0, v1, v2, #9 sub w4, w4, #16 // width -= 16 @@ -82,9 +96,15 @@ function ff_\fmt\()ToY_neon, export=1 b.ge 1b cbz x4, 3f 2: + .if \alpha_first + ldrb w13, [x1, #1] // w13: r + ldrb w14, [x1, #2] // w14: g + ldrb w15, [x1, #3] // w15: b + .else ldrb w13, [x1] // w13: r ldrb w14, [x1, #1] // w14: g ldrb w15, [x1, #2] // w15: b + .endif smaddl x13, w13, w10, x9 // x13 = ry * r + const_offset smaddl x13, w14, w11, x13 // x13 += gy * g @@ -101,6 +121,16 @@ endfunc rgbToY_neon fmt=rgb24, element=3 +function ff_abgr32ToY_neon, export=1 + cmp w4, #0 // check width > 0 + ldp w12, w11, [x5] // w12: ry, w11: gy + ldr w10, [x5, #8] // w10: by + b.gt 4f + ret +endfunc + +rgbToY_neon fmt=argb32, element=4, alpha_first=1 + function ff_bgra32ToY_neon, export=1 cmp w4, #0 // check width > 0 ldp w12, w11, [x5] // w12: ry, w11: gy @@ -138,7 +168,21 @@ function ff_bgr24ToUV_half_neon, export=1 b 4f endfunc -.macro rgbToUV_half_neon fmt, element +.macro rgb_load_add_half off_r1, off_r2, off_g1, off_g2, off_b1, off_b2 + ldrb w2, [x3, #\off_r1] // w2: r1 + ldrb w4, [x3, #\off_r2] // w4: r2 + add w2, w2, w4 // w2 = r1 + r2 + + ldrb w4, [x3, #\off_g1] // w4: g1 + ldrb w7, [x3, #\off_g2] // w7: g2 + add w4, w4, w7 // w4 = g1 + g2 + + ldrb w7, [x3, #\off_b1] // w7: b1 + ldrb w8, [x3, #\off_b2] // w8: b2 + add w7, w7, w8 // w7 = b1 + b2 +.endm + +.macro rgbToUV_half_neon fmt, element, alpha_first=0 function ff_\fmt\()ToUV_half_neon, export=1 cmp w5, #0 // check width > 0 b.le 3f @@ -156,9 +200,15 @@ function ff_\fmt\()ToUV_half_neon, export=1 .else ld4 { v16.16b, v17.16b, v18.16b, v19.16b }, [x3] .endif + .if \alpha_first + uaddlp v21.8h, v19.16b + uaddlp v20.8h, v18.16b + uaddlp v19.8h, v17.16b + .else uaddlp v19.8h, v16.16b // v19: r uaddlp v20.8h, v17.16b // v20: g uaddlp v21.8h, v18.16b // v21: b + .endif rgb_to_yuv_product v19, v20, v21, v22, v23, v16, v0, v1, v2, #10 rgb_to_yuv_product v19, v20, v21, v24, v25, v17, v3, v4, v5, #10 @@ -170,27 +220,15 @@ function ff_\fmt\()ToUV_half_neon, export=1 b.ge 1b cbz w5, 3f 2: - ldrb w2, [x3] // w2: r1 - ldrb w4, [x3, #\element] // w4: r2 - add w2, w2, w4 // w2 = r1 + r2 - +.if \alpha_first + rgb_load_add_half 1, 5, 2, 6, 3, 7 +.else .if \element == 3 - ldrb w4, [x3, #1] // w4: g1 - ldrb w7, [x3, #4] // w7: g2 - add w4, w4, w7 // w4 = g1 + g2 - - ldrb w7, [x3, #2] // w7: b1 - ldrb w8, [x3, #5] // w8: b2 - add w7, w7, w8 // w7 = b1 + b2 + rgb_load_add_half 0, 3, 1, 4, 2, 5 .else - ldrb w4, [x3, #1] // w4: g1 - ldrb w7, [x3, #5] // w7: g2 - add w4, w4, w7 // w4 = g1 + g2 - - ldrb w7, [x3, #2] // w7: b1 - ldrb w8, [x3, #6] // w8: b2 - add w7, w7, w8 // w7 = b1 + b2 + rgb_load_add_half 0, 4, 1, 5, 2, 6 .endif +.endif smaddl x8, w2, w10, x9 // dst_u = ru * r + const_offset smaddl x8, w4, w11, x8 // dst_u += gu * g @@ -214,6 +252,18 @@ endfunc rgbToUV_half_neon fmt=rgb24, element=3 +function ff_abgr32ToUV_half_neon, export=1 + cmp w5, #0 // check width > 0 + b.le 3f + + ldp w12, w11, [x6, #12] + ldp w10, w15, [x6, #20] + ldp w14, w13, [x6, #28] + b 4f +endfunc + +rgbToUV_half_neon fmt=argb32, element=4, alpha_first=1 + function ff_bgra32ToUV_half_neon, export=1 cmp w5, #0 // check width > 0 b.le 3f @@ -236,7 +286,7 @@ function ff_bgr24ToUV_neon, export=1 b 4f endfunc -.macro rgbToUV_neon fmt, element +.macro rgbToUV_neon fmt, element, alpha_first=0 function ff_\fmt\()ToUV_neon, export=1 cmp w5, #0 // check width > 0 b.le 3f @@ -249,7 +299,11 @@ function ff_\fmt\()ToUV_neon, export=1 rgb_set_uv_coeff half=0 b.lt 2f 1: + .if \alpha_first + argb_to_yuv_load_rgb x3 + .else rgb_to_yuv_load_rgb x3, \element + .endif rgb_to_yuv_product v19, v20, v21, v25, v26, v16, v0, v1, v2, #9 rgb_to_yuv_product v22, v23, v24, v27, v28, v17, v0, v1, v2, #9 rgb_to_yuv_product v19, v20, v21, v25, v26, v18, v3, v4, v5, #9 @@ -262,9 +316,15 @@ function ff_\fmt\()ToUV_neon, export=1 b.ge 1b cbz w5, 3f 2: + .if \alpha_first + ldrb w16, [x3, #1] // w16: r + ldrb w17, [x3, #2] // w17: g + ldrb w4, [x3, #3] // w4: b + .else ldrb w16, [x3] // w16: r ldrb w17, [x3, #1] // w17: g ldrb w4, [x3, #2] // w4: b + .endif smaddl x8, w16, w10, x9 // x8 = ru * r + const_offset smaddl x8, w17, w11, x8 // x8 += gu * g @@ -287,6 +347,18 @@ endfunc rgbToUV_neon fmt=rgb24, element=3 +function ff_abgr32ToUV_neon, export=1 + cmp w5, #0 // check width > 0 + b.le 3f + + ldp w12, w11, [x6, #12] + ldp w10, w15, [x6, #20] + ldp w14, w13, [x6, #28] + b 4f +endfunc + +rgbToUV_neon fmt=argb32, element=4, alpha_first=1 + function ff_bgra32ToUV_neon, export=1 cmp w5, #0 // check width > 0 b.le 3f diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c index 92af662014..eb907284e7 100644 --- a/libswscale/aarch64/swscale.c +++ b/libswscale/aarch64/swscale.c @@ -211,6 +211,8 @@ void ff_##name##ToUV_half_neon(uint8_t *, uint8_t *, const uint8_t *, \ const uint8_t *, const uint8_t *, int w, \ uint32_t *coeffs, void *) +NEON_INPUT(abgr32); +NEON_INPUT(argb32); NEON_INPUT(bgr24); NEON_INPUT(bgra32); NEON_INPUT(rgb24); @@ -248,6 +250,21 @@ av_cold void ff_sws_init_swscale_aarch64(SwsContext *c) c->yuv2planeX = ff_yuv2planeX_8_neon; } switch (c->srcFormat) { + case AV_PIX_FMT_ABGR: + c->lumToYV12 = ff_abgr32ToY_neon; + if (c->chrSrcHSubSample) + c->chrToYV12 = ff_abgr32ToUV_half_neon; + else + c->chrToYV12 = ff_abgr32ToUV_neon; + break; + + case AV_PIX_FMT_ARGB: + c->lumToYV12 = ff_argb32ToY_neon; + if (c->chrSrcHSubSample) + c->chrToYV12 = ff_argb32ToUV_half_neon; + else + c->chrToYV12 = ff_argb32ToUV_neon; + break; case AV_PIX_FMT_BGR24: c->lumToYV12 = ff_bgr24ToY_neon; if (c->chrSrcHSubSample)