From patchwork Sat Nov 18 02:06:37 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Logan.Lyu" X-Patchwork-Id: 44702 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:6a89:b0:181:818d:5e7f with SMTP id bi9csp293110pzb; Fri, 17 Nov 2023 18:07:00 -0800 (PST) X-Google-Smtp-Source: AGHT+IHCsZ9lNgT4svlrtEXgXGyHBKBKG/2wgn+gGgX8wd52IYgkfjx00nDMH82FiJt03PjDxese X-Received: by 2002:aa7:d755:0:b0:547:da7:9c10 with SMTP id a21-20020aa7d755000000b005470da79c10mr569823eds.2.1700273219605; Fri, 17 Nov 2023 18:06:59 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700273219; cv=none; d=google.com; s=arc-20160816; b=sDMXyQIv7wmIPZ+HkiZ6TKtJy82ZvdaFxqzH5c1pkiQZ94EMhJNSMyrZ3ByepkgztV AJiTtxK85CQBhJCuVMBR7Gey0LuEgbm6SR03Ths3wdDZ2gaY1o0JQySx0u6dIQnvq8dS Ybep8xTe2/BI6jYai274J0SfABRpm6lSQnHNbJu1Yt2MRPl1WiA3D1ugWolFgjwTHrkA olXKU9nDn9WTXWKiBBXFnmCDBTNXu0fEasqE7Yl9gAc66hwQIT3VoMyFxkWlS0oSWwib 8r0CdEG/36wGykLOwewTbLQ7ayrahXrDzgdQMeinVL+PGx12d/0fmcju8mJZ3xoQOqsm j9jw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:cc:reply-to:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence:subject :organization:to:from:user-agent:mime-version:date:message-id :delivered-to; bh=iOjdIKyIojTt/X2Wa/1VwtQdSHHBCz9m1/toAWJ0Je8=; fh=4VBelKDE4DH3L7jF6H/1Jmu78FdN+YP76yfdJCQTJ30=; b=K0/94tAn6Z/Sr3hiPS6aOyHgU2XqosZ2Prtjo0ndwfXuNxDGF5RfGU/ykVt3Ayuoxf Ine0kcd8q/UhmXHpXtYedKfcFoS3o1rFC8X0pm69bcOtp/Z2p9bqJ5WDRUTBm+tGAUPz s21dB9DQPkq9QD9obhvEqwvj7a+7KRp1FhYC25mNQ6xDCZZxbbYJQjJvFPsVlcnsVLNs 3BZ8DGDZb/KZqQENoHDgpPW1V+cdH+KqljtzXalTssKeM4mjOQGw3LgccQ42NAxIMcSL zjPQUgDkVQja/XuYJl4n4qELcZJkg+bMSePV2KvkyuYCaVRb8D/+A5js6MYRDAzE5tbu WYtg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id y11-20020a50eb0b000000b0053073f22103si1790960edp.8.2023.11.17.18.06.58; Fri, 17 Nov 2023 18:06:59 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id BFA7568CD19; Sat, 18 Nov 2023 04:06:53 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from smtp-my3-01p11.yunyou.top (unknown [60.247.169.11]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 2294568C96C for ; Sat, 18 Nov 2023 04:06:44 +0200 (EET) Received: from [192.168.15.105] (unknown [122.233.146.151]) by smtp-my-01.yunyou.top (WestCloudMail) with ESMTPA id 2FCF1141F27; Sat, 18 Nov 2023 10:06:38 +0800 (CST) Message-ID: <01e3c77f-56a3-4191-9637-df9999df694c@myais.com.cn> Date: Sat, 18 Nov 2023 10:06:37 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird From: "Logan.Lyu" To: ffmpeg-devel@ffmpeg.org Organization: myais Subject: [FFmpeg-devel] [PATCH 1/6] lavc/aarch64: new optimization for 8-bit hevc_pel_bi_pixels X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: jdek@itanimul.li Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: rBp3NJ3hQMwP put_hevc_pel_bi_pixels4_8_c: 54.7 put_hevc_pel_bi_pixels4_8_neon: 43.0 put_hevc_pel_bi_pixels6_8_c: 94.7 put_hevc_pel_bi_pixels6_8_neon: 37.0 put_hevc_pel_bi_pixels8_8_c: 171.0 put_hevc_pel_bi_pixels8_8_neon: 24.0 put_hevc_pel_bi_pixels12_8_c: 354.0 put_hevc_pel_bi_pixels12_8_neon: 68.7 put_hevc_pel_bi_pixels16_8_c: 588.2 put_hevc_pel_bi_pixels16_8_neon: 77.5 put_hevc_pel_bi_pixels24_8_c: 1670.7 put_hevc_pel_bi_pixels24_8_neon: 173.0 put_hevc_pel_bi_pixels32_8_c: 2267.7 put_hevc_pel_bi_pixels32_8_neon: 281.2 put_hevc_pel_bi_pixels48_8_c: 5787.5 put_hevc_pel_bi_pixels48_8_neon: 673.5 put_hevc_pel_bi_pixels64_8_c: 9897.0 put_hevc_pel_bi_pixels64_8_neon: 1159.5 Co-Authored-By: J. Dekker Signed-off-by: Logan Lyu --- libavcodec/aarch64/hevcdsp_epel_neon.S | 179 ++++++++++++++++++++++ libavcodec/aarch64/hevcdsp_init_aarch64.c | 10 +- 2 files changed, 187 insertions(+), 2 deletions(-) NEON8_FNASSIGN(c->put_hevc_epel, 1, 0, epel_v,); NEON8_FNASSIGN(c->put_hevc_qpel, 0, 0, pel_pixels,); NEON8_FNASSIGN(c->put_hevc_qpel, 1, 0, qpel_v,); + NEON8_FNASSIGN(c->put_hevc_epel_bi, 0, 0, pel_bi_pixels,); + NEON8_FNASSIGN(c->put_hevc_qpel_bi, 0, 0, pel_bi_pixels,); NEON8_FNASSIGN(c->put_hevc_epel_uni, 0, 0, pel_uni_pixels,); NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 0, epel_uni_v,); NEON8_FNASSIGN(c->put_hevc_qpel_uni, 0, 0, pel_uni_pixels,); diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S index 708b903b00..74165273d7 100644 --- a/libavcodec/aarch64/hevcdsp_epel_neon.S +++ b/libavcodec/aarch64/hevcdsp_epel_neon.S @@ -244,6 +244,185 @@ function ff_hevc_put_hevc_pel_pixels64_8_neon, export=1 endfunc +function ff_hevc_put_hevc_pel_bi_pixels4_8_neon, export=1 + mov x10, #(MAX_PB_SIZE * 2) +1: ld1 {v0.s}[0], [x2], x3 // src + ushll v16.8h, v0.8b, #6 + ld1 {v20.4h}, [x4], x10 // src2 + sqadd v16.8h, v16.8h, v20.8h + sqrshrun v0.8b, v16.8h, #7 + st1 {v0.s}[0], [x0], x1 + subs w5, w5, #1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_bi_pixels6_8_neon, export=1 + mov x10, #(MAX_PB_SIZE * 2) + sub x1, x1, #4 +1: ld1 {v0.8b}, [x2], x3 + ushll v16.8h, v0.8b, #6 + ld1 {v20.8h}, [x4], x10 + sqadd v16.8h, v16.8h, v20.8h + sqrshrun v0.8b, v16.8h, #7 + st1 {v0.s}[0], [x0], #4 + st1 {v0.h}[2], [x0], x1 + subs w5, w5, #1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_bi_pixels8_8_neon, export=1 + mov x10, #(MAX_PB_SIZE * 2) +1: ld1 {v0.8b}, [x2], x3 // src + ushll v16.8h, v0.8b, #6 + ld1 {v20.8h}, [x4], x10 // src2 + sqadd v16.8h, v16.8h, v20.8h + sqrshrun v0.8b, v16.8h, #7 + subs w5, w5, #1 + st1 {v0.8b}, [x0], x1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_bi_pixels12_8_neon, export=1 + mov x10, #(MAX_PB_SIZE * 2) + sub x1, x1, #8 +1: ld1 {v0.16b}, [x2], x3 + ushll v16.8h, v0.8b, #6 + ushll2 v17.8h, v0.16b, #6 + ld1 {v20.8h, v21.8h}, [x4], x10 + sqadd v16.8h, v16.8h, v20.8h + sqadd v17.8h, v17.8h, v21.8h + sqrshrun v0.8b, v16.8h, #7 + sqrshrun2 v0.16b, v17.8h, #7 + st1 {v0.8b}, [x0], #8 + subs w5, w5, #1 + st1 {v0.s}[2], [x0], x1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_bi_pixels16_8_neon, export=1 + mov x10, #(MAX_PB_SIZE * 2) +1: ld1 {v0.16b}, [x2], x3 // src + ushll v16.8h, v0.8b, #6 + ushll2 v17.8h, v0.16b, #6 + ld1 {v20.8h, v21.8h}, [x4], x10 // src2 + sqadd v16.8h, v16.8h, v20.8h + sqadd v17.8h, v17.8h, v21.8h + sqrshrun v0.8b, v16.8h, #7 + sqrshrun2 v0.16b, v17.8h, #7 + subs w5, w5, #1 + st1 {v0.16b}, [x0], x1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_bi_pixels24_8_neon, export=1 + mov x10, #(MAX_PB_SIZE * 2) +1: ld1 {v0.8b-v2.8b}, [x2], x3 // src + ushll v16.8h, v0.8b, #6 + ushll v17.8h, v1.8b, #6 + ushll v18.8h, v2.8b, #6 + ld1 {v20.8h-v22.8h}, [x4], x10 // src2 + sqadd v16.8h, v16.8h, v20.8h + sqadd v17.8h, v17.8h, v21.8h + sqadd v18.8h, v18.8h, v22.8h + sqrshrun v0.8b, v16.8h, #7 + sqrshrun v1.8b, v17.8h, #7 + sqrshrun v2.8b, v18.8h, #7 + subs w5, w5, #1 + st1 {v0.8b-v2.8b}, [x0], x1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_bi_pixels32_8_neon, export=1 + mov x10, #(MAX_PB_SIZE * 2) +1: ld1 {v0.16b-v1.16b}, [x2], x3 // src + ushll v16.8h, v0.8b, #6 + ushll2 v17.8h, v0.16b, #6 + ushll v18.8h, v1.8b, #6 + ushll2 v19.8h, v1.16b, #6 + ld1 {v20.8h-v23.8h}, [x4], x10 // src2 + sqadd v16.8h, v16.8h, v20.8h + sqadd v17.8h, v17.8h, v21.8h + sqadd v18.8h, v18.8h, v22.8h + sqadd v19.8h, v19.8h, v23.8h + sqrshrun v0.8b, v16.8h, #7 + sqrshrun2 v0.16b, v17.8h, #7 + sqrshrun v1.8b, v18.8h, #7 + sqrshrun2 v1.16b, v19.8h, #7 + st1 {v0.16b-v1.16b}, [x0], x1 + subs w5, w5, #1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_bi_pixels48_8_neon, export=1 + mov x10, #(MAX_PB_SIZE) +1: ld1 {v0.16b-v2.16b}, [x2], x3 // src + ushll v16.8h, v0.8b, #6 + ushll2 v17.8h, v0.16b, #6 + ushll v18.8h, v1.8b, #6 + ushll2 v19.8h, v1.16b, #6 + ushll v20.8h, v2.8b, #6 + ushll2 v21.8h, v2.16b, #6 + ld1 {v24.8h-v27.8h}, [x4], #(MAX_PB_SIZE) // src2 + sqadd v16.8h, v16.8h, v24.8h + sqadd v17.8h, v17.8h, v25.8h + sqadd v18.8h, v18.8h, v26.8h + sqadd v19.8h, v19.8h, v27.8h + ld1 {v24.8h-v25.8h}, [x4], x10 + sqadd v20.8h, v20.8h, v24.8h + sqadd v21.8h, v21.8h, v25.8h + sqrshrun v0.8b, v16.8h, #7 + sqrshrun2 v0.16b, v17.8h, #7 + sqrshrun v1.8b, v18.8h, #7 + sqrshrun2 v1.16b, v19.8h, #7 + sqrshrun v2.8b, v20.8h, #7 + sqrshrun2 v2.16b, v21.8h, #7 + subs w5, w5, #1 + st1 {v0.16b-v2.16b}, [x0], x1 + b.ne 1b + ret +endfunc + +function ff_hevc_put_hevc_pel_bi_pixels64_8_neon, export=1 +1: ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x2], x3 // src + ushll v16.8h, v0.8b, #6 + ushll2 v17.8h, v0.16b, #6 + ushll v18.8h, v1.8b, #6 + ushll2 v19.8h, v1.16b, #6 + ushll v20.8h, v2.8b, #6 + ushll2 v21.8h, v2.16b, #6 + ushll v22.8h, v3.8b, #6 + ushll2 v23.8h, v3.16b, #6 + ld1 {v24.8h, v25.8h, v26.8h, v27.8h}, [x4], #(MAX_PB_SIZE) // src2 + sqadd v16.8h, v16.8h, v24.8h + sqadd v17.8h, v17.8h, v25.8h + sqadd v18.8h, v18.8h, v26.8h + sqadd v19.8h, v19.8h, v27.8h + ld1 {v24.8h, v25.8h, v26.8h, v27.8h}, [x4], #(MAX_PB_SIZE) + sqadd v20.8h, v20.8h, v24.8h + sqadd v21.8h, v21.8h, v25.8h + sqadd v22.8h, v22.8h, v26.8h + sqadd v23.8h, v23.8h, v27.8h + sqrshrun v0.8b, v16.8h, #7 + sqrshrun2 v0.16b, v17.8h, #7 + sqrshrun v1.8b, v18.8h, #7 + sqrshrun2 v1.16b, v19.8h, #7 + sqrshrun v2.8b, v20.8h, #7 + sqrshrun2 v2.16b, v21.8h, #7 + sqrshrun v3.8b, v22.8h, #7 + sqrshrun2 v3.16b, v23.8h, #7 + st1 {v0.16b, v1.16b, v2.16b, v3.16b}, [x0], x1 + subs w5, w5, #1 + b.ne 1b + ret +endfunc + function ff_hevc_put_hevc_epel_v4_8_neon, export=1 load_epel_filterb x5, x4 sub x1, x1, x2 diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c index c51488275c..cf171023e7 100644 --- a/libavcodec/aarch64/hevcdsp_init_aarch64.c +++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c @@ -156,8 +156,12 @@ NEON8_FNPROTO(pel_pixels, (int16_t *dst, const uint8_t *src, ptrdiff_t srcstride, int height, intptr_t mx, intptr_t my, int width),); -NEON8_FNPROTO(epel_v, (int16_t *dst, - const uint8_t *src, ptrdiff_t srcstride, +NEON8_FNPROTO(pel_bi_pixels, (uint8_t *dst, ptrdiff_t dststride, + const uint8_t *_src, ptrdiff_t _srcstride, const int16_t *src2, + int height, intptr_t mx, intptr_t my, int width),); + +NEON8_FNPROTO(epel_v, (uint8_t *dst, ptrdiff_t dststride, + const uint8_t *_src, ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t mx, intptr_t my, int width),); NEON8_FNPROTO(pel_uni_pixels, (uint8_t *_dst, ptrdiff_t _dststride, @@ -324,6 +328,8 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)