From patchwork Tue Sep 6 10:27:21 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hubert Mazur X-Patchwork-Id: 37699 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:139a:b0:8f:1db5:eae2 with SMTP id w26csp3164749pzh; Tue, 6 Sep 2022 03:28:46 -0700 (PDT) X-Google-Smtp-Source: AA6agR5lWNbpGIpNAgap3kWAu5J+1oeBg+9Y3N1N5VVDKYgi+S47rtiixijH+HRrc2ev2d4wexOy X-Received: by 2002:a05:6402:27cf:b0:448:706c:184b with SMTP id c15-20020a05640227cf00b00448706c184bmr33775655ede.1.1662460126552; Tue, 06 Sep 2022 03:28:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1662460126; cv=none; d=google.com; s=arc-20160816; b=VvbGN9mMfrPp51gKB426k4sjoYPWOpIQ5jJifu+kcIRw+XR6COdbyalzjTlI40DsU/ 3SHuItT9EChyKYE5vHg4DoKWP++cSU470uD19udkRE3J9pkhXHHX+hlPvo0MWXo/9rVp wZlcnbYhqziGCdiQWOWiSFyI6Ep/flc5vNne3SmWXmk6POtrRlsii4L1OZYiTgpKC4XC Kcg87S6nYuZahfe/h93EhhPYwayk3CvBuQoXCI76MrRSz/dXYv8d5EcFMw/87g5pED3d d45L5WeuVxHovvCBhmZUqNiXzS22ibPVNudCNBFJ7h9my/YLgsqwmUnWGgUOSYkrWJuC 0JEw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:dkim-signature:delivered-to; bh=G/FVdXQjvmwriM76oDTnlDw38u6nbJDLBrS9MhSaNCk=; b=HgpG0WuIBq47IkFsMTj97iekxzPoUNrikZxaz1WKm6QRisYbPGI+k/YU5WhQcAqS2F Pl5ivU9Xl1bIv6I+R1AJAljGDonptbALsguNdTxMXhaQKhBxa5jU0BzNOocQ1gKjYmn/ wlRozlJVW9nzRhcGhDuI9Sdwuzv95F2JCkizhSroIlb6i1sTqmEdG5VMUfiuLMDOdoac coxuFmpNk4K6TpDAX03HWhgTRMmui1GgD0HYxx1pVYihjT6jldqSSYO2iYBWtH2WVpU2 tm9geB6p356WbA9E1PErK80eeyiCjsw/tEayB0Ds6pBY3QueQjBJUuzNz0DUxpHOIRw9 4kKA== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=uCIvbY1b; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id nd9-20020a170907628900b0073dd6d901dbsi9636571ejc.72.2022.09.06.03.28.46; Tue, 06 Sep 2022 03:28:46 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=uCIvbY1b; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 7E6A468BB0A; Tue, 6 Sep 2022 13:28:11 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lf1-f48.google.com (mail-lf1-f48.google.com [209.85.167.48]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 2257968BAFE for ; Tue, 6 Sep 2022 13:28:05 +0300 (EEST) Received: by mail-lf1-f48.google.com with SMTP id bq23so16647567lfb.7 for ; Tue, 06 Sep 2022 03:28:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=semihalf.com; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=wVp27DJYF0ZbjgVtZGBCvCRFM5WDxIk7hzRKdWbyVzo=; b=uCIvbY1bBc/TgerxeY508pPrp1NE+1M4ciaKLbfwcoLCWmBZlNccNYJUSTUGIldXpk H2V9bOvPiKu+0ssSGxf6GEWUOEXm45hWIRXp7s/mUojGoTg5OBPPhQK4uzuWlgdnqrjW j1mqVQFkI4Yuwd6D1PiTCBDZcB/Z2ThNi2ByVOlCn6Kuo0zYEpBjCrmbDfl2zmuK3qRE /151wuSkoISP1iCuHC0xgK1fUXfGwGA5vsBb60whlO2vWJ2pfAakPr1yAP6KPmgRj7RC 1SoATsjtcFFPaDOD0soy9bPA3P0higHsUTrIAApBp4ZqMvx4INfKFtfA5bQp8+2QFA6T DQlQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=wVp27DJYF0ZbjgVtZGBCvCRFM5WDxIk7hzRKdWbyVzo=; b=4uz7kJPueNNpl8y3NjMDXbpagsFnh2Yd2a/t5csC6GqPx8QBqFH2qF4Ti2JF6AnDW0 mmV9/MAbghRr7RLxufShOogug01vEJVS19YfnT5yUuvco6RV+9S9UHWI8saizc5f8cNm hM3tB1/Yu3id9ehSZ2nvlJT9aV+KsCGTPC4T3Hcvjx/cIFqkygg/J/RM/Dt08+k/lovM svoTaqr5XjpaJJ03Wdv9DYwVL2csc+8ct+His3wAnioPRqlc6hY77i6k1AFK8AObWB2a kYJiSmA1Zqwmyp8+ojKgIeIHVTdL2oKP+4DM68XROsmrynQfd6Zw6sepGMnQonucwLLA 5jYQ== X-Gm-Message-State: ACgBeo3afaFWmcNCrPexSvWTzmlXc1QQK6YYWl9zx88gJWfIsBIMOyU5 pND3KpmZkcOGYCSQ3SkCEhNMfaRWV3FAGw== X-Received: by 2002:a05:6512:3e27:b0:491:62df:b343 with SMTP id i39-20020a0565123e2700b0049162dfb343mr17006446lfv.336.1662460084163; Tue, 06 Sep 2022 03:28:04 -0700 (PDT) Received: from hum-HP-ProBook-440-G7.wifi.semihalf.net ([83.142.187.84]) by smtp.gmail.com with ESMTPSA id q5-20020a2eb4a5000000b0025e59f125fbsm1822751ljm.53.2022.09.06.03.28.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 06 Sep 2022 03:28:03 -0700 (PDT) From: Hubert Mazur To: ffmpeg-devel@ffmpeg.org Date: Tue, 6 Sep 2022 12:27:21 +0200 Message-Id: <20220906102722.53266-5-hum@semihalf.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20220906102722.53266-1-hum@semihalf.com> References: <20220906102722.53266-1-hum@semihalf.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com, Hubert Mazur , martin@martin.st, mw@semihalf.com, spop@amazon.com Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: nlq4ka4IMxOm Provide optimized implementation for vsse_intra16 for arm64. Performance tests are shown below. - vsse_4_c: 153.7 - vsse_4_neon: 34.2 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur --- libavcodec/aarch64/me_cmp_init_aarch64.c | 3 ++ libavcodec/aarch64/me_cmp_neon.S | 63 ++++++++++++++++++++++++ 2 files changed, 66 insertions(+) diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c index af83f7ed1e..8c295d5457 100644 --- a/libavcodec/aarch64/me_cmp_init_aarch64.c +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c @@ -47,6 +47,8 @@ int vsad_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy, ptrdiff_t stride, int h) ; int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2, ptrdiff_t stride, int h); +int vsse_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy, + ptrdiff_t stride, int h); av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) { @@ -69,5 +71,6 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) c->vsad[4] = vsad_intra16_neon; c->vsse[0] = vsse16_neon; + c->vsse[4] = vsse_intra16_neon; } } diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S index ce198ea227..cf2b8da425 100644 --- a/libavcodec/aarch64/me_cmp_neon.S +++ b/libavcodec/aarch64/me_cmp_neon.S @@ -784,3 +784,66 @@ function vsad_intra16_neon, export=1 ret endfunc + +function vsse_intra16_neon, export=1 + // x0 unused + // x1 uint8_t *pix1 + // x2 uint8_t *dummy + // x3 ptrdiff_t stride + // w4 int h + + ld1 {v0.16b}, [x1], x3 + movi v16.4s, #0 + movi v17.4s, #0 + + sub w4, w4, #1 // we need to make h-1 iterations + cmp w4, #3 + b.lt 2f + +1: + // v = abs( pix1[0] - pix1[0 + stride] ) + // score = sum( v * v ) + ld1 {v1.16b}, [x1], x3 + ld1 {v2.16b}, [x1], x3 + uabd v30.16b, v0.16b, v1.16b + ld1 {v3.16b}, [x1], x3 + umull v29.8h, v30.8b, v30.8b + umull2 v28.8h, v30.16b, v30.16b + uabd v27.16b, v1.16b, v2.16b + uadalp v16.4s, v29.8h + umull v26.8h, v27.8b, v27.8b + umull2 v27.8h, v27.16b, v27.16b + uadalp v17.4s, v28.8h + uabd v25.16b, v2.16b, v3.16b + uadalp v16.4s, v26.8h + umull v24.8h, v25.8b, v25.8b + umull2 v25.8h, v25.16b, v25.16b + uadalp v17.4s, v27.8h + sub w4, w4, #3 + uadalp v16.4s, v24.8h + cmp w4, #3 + uadalp v17.4s, v25.8h + mov v0.16b, v3.16b + + b.ge 1b + cbz w4, 3f + +// iterate by one +2: + ld1 {v1.16b}, [x1], x3 + subs w4, w4, #1 + uabd v30.16b, v0.16b, v1.16b + mov v0.16b, v1.16b + umull v29.8h, v30.8b, v30.8b + umull2 v30.8h, v30.16b, v30.16b + uadalp v16.4s, v29.8h + uadalp v17.4s, v30.8h + cbnz w4, 2b + +3: + add v16.4s, v16.4s, v17.4S + uaddlv d17, v16.4s + fmov w0, s17 + + ret +endfunc