From patchwork Mon Sep 26 09:15:01 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Grzegorz Bernacki X-Patchwork-Id: 38326 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:3b1c:b0:96:9ee8:5cfd with SMTP id c28csp2122567pzh; Mon, 26 Sep 2022 02:15:59 -0700 (PDT) X-Google-Smtp-Source: AMsMyM45jqjS/3mb5y2TEcjmXwvuVH+xUTmfifdcJXn3b2uxpaCG+dKdKvSFvqp49IiYesWjbaar X-Received: by 2002:a05:6402:3591:b0:451:8397:3e9 with SMTP id y17-20020a056402359100b00451839703e9mr20981946edc.409.1664183759345; Mon, 26 Sep 2022 02:15:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1664183759; cv=none; d=google.com; s=arc-20160816; b=w5I6pZz4QsxzwJuyI+9xSlosdOe8uPdq27x1Zwhgcy4un1w3mznMhwp0tVkB/hi8xP Fjhi9/qyJPQSQEecgqbjIxF4dPsVBgFJPS9uN2Cr5LPO65KTzrA1RG3KvFlvHKm89p1X FeTfJHGVYoTw1O/T7H/J6NCK55QHX5qyi7bJDsA6PSdsfQwlUaZRvZJjA9fvdEi3Dh3k Ie/gIrw53XEb8oHKJpX8qGBdFf0pltbSwLteU5OojZx0fSo+/xTx8tbTG99F+DBMjpJk 297c/isqTk35APGU62/A/eX25p67zjt3/GfiyfnrqqYeRoe//WrqaxlZQ4WR2cuMZQ4A jREg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:message-id:date:to:from :dkim-signature:delivered-to; bh=wGQjNRHEamGOJi9D71FVgPOFjmgbsc/gm9Y5AROvgZs=; b=iXKw5iTsEUam0Ly7DDrLFQaoepAQfS8mkH0lBzepAkl0U7VuOerwaZ+MUO7dlfm0CG oFTF/2uEaPPHuIQczthPH7P2FWyMycLpOXIyFbX74xLov7Ev34XaKiA8QGaLkYUiuyv3 Sb/s+0mBS2bS20ynI18I7SXjPbBA9XimcKvnmxnzBeK3eoDZA/B2j+/2pTbdSWuRhvd/ DUuwVji5XLdVHG/sDd1ucV1b4UKm2U9XYRe+BP+5HEXVzZo5MKBSHw9uER2Wp9Liv8Ff nkbe3BN44QGebVCpbeLS2ZRGNycLGd5gEnOxqhV0Pqiz6BojGnGPQqu57yGW2txbX7Yv bnwg== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=qhiOYo59; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id i8-20020a1709061e4800b00730ed690a72si12682356ejj.630.2022.09.26.02.15.36; Mon, 26 Sep 2022 02:15:59 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=qhiOYo59; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 2FB3768B9CA; Mon, 26 Sep 2022 12:15:33 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lj1-f178.google.com (mail-lj1-f178.google.com [209.85.208.178]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 7A63E68B515 for ; Mon, 26 Sep 2022 12:15:26 +0300 (EEST) Received: by mail-lj1-f178.google.com with SMTP id j24so6693385lja.4 for ; Mon, 26 Sep 2022 02:15:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=semihalf.com; s=google; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date; bh=L6JYCTQ0p9AW2Ky22OM4I9RWxdkXGsF1BdBKWQ/3o+U=; b=qhiOYo59CZ4s9P5bXAis1bZnsyVj4fI+VXTkWCDw9visXx9ouDEuycez4vH7YYoZ9+ vavrSYv7wQtbVddJbet5xG1+hdKK2a9ZVnCK8eFebEaVojgLImYzOO3xAl+ywVDoZJej vbCeqd19Bk9EjXH/nrpXDKWP5X6/XQBREMMx6DqoECY8fI0Qj0FnH+fCy6SdxD7isQWZ DWbqG8hgcRGecPo8i5S9TlcKDhX41WwM25/MtZQP5qNhciigTzCvIkS9pdPKOKNBkjA2 SNyWqLO/j56O70zO7OZMGIFbjlCawhPa63PYknZegDQ2CLDdie2D8um3yir+ZvNPjowH /fNg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date; bh=L6JYCTQ0p9AW2Ky22OM4I9RWxdkXGsF1BdBKWQ/3o+U=; b=mbT5JoJ5MlXa10PAjDNCcVvQSQvn7tbpPRd6r/Aj7obp15lHcY5rNN4ZU9fDlmPRXU lXoKWpTQA7MbFZ5b4SCprT2Au/F08tm8remOKzwi2Qz4oSg1ZR3aP79rKow59loZ011D TF1gkvAV15A4ngf/Z4CmebrTq6aV82xYcy6gHGIvIaphtGD1Zw+thQ0IJQ/fbAKCbA7o 4f73Qhn+BeAGb4TkX3qe7yzaOqyiN1jevGXYi8jCoQWmxpAnZlcQDfOHKjQE1/+khse1 3WsP39apl6HjJ6MXwT6hGEM7ykx5J3ReRGGZxXN+P6+e3ZT7Y9MhA5Tq56akUeCS+5l6 mQSQ== X-Gm-Message-State: ACrzQf20olZewClQYy0WnVW/HPz9WvSpX/4HTuWynx4VQvxQF6bk6D+s bfP+jE4xvVdZI7/G5TT6woOOGEjsJLYQ4A== X-Received: by 2002:a2e:9018:0:b0:26b:defc:2a19 with SMTP id h24-20020a2e9018000000b0026bdefc2a19mr7180881ljg.470.1664183725321; Mon, 26 Sep 2022 02:15:25 -0700 (PDT) Received: from gilgamesh.lab.semihalf.net ([83.142.187.85]) by smtp.gmail.com with ESMTPSA id b14-20020a056512070e00b004949f7cbb6esm2472942lfs.79.2022.09.26.02.15.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 26 Sep 2022 02:15:24 -0700 (PDT) From: Grzegorz Bernacki To: ffmpeg-devel@ffmpeg.org Date: Mon, 26 Sep 2022 11:15:01 +0200 Message-Id: <20220926091504.3459110-1-gjb@semihalf.com> X-Mailer: git-send-email 2.29.0 MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 1/4] lavc/aarch64: Add neon implementation for pix_abs8 functions. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com, hum@semihalf.com, martin@martin.st, mw@semihalf.com, spop@amazon.com Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: LE5062L8YmEB Provide optimized implementation of pix_abs8 function for arm64. Performance comparison tests are shown below: pix_abs_1_1_c: 162.5 pix_abs_1_1_neon: 27.0 pix_abs_1_2_c: 174.0 pix_abs_1_2_neon: 23.5 pix_abs_1_3_c: 203.2 pix_abs_1_3_neon: 34.7 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Grzegorz Bernacki --- libavcodec/aarch64/me_cmp_init_aarch64.c | 9 ++ libavcodec/aarch64/me_cmp_neon.S | 193 +++++++++++++++++++++++ 2 files changed, 202 insertions(+) diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c index e143f0816e..3459403ee5 100644 --- a/libavcodec/aarch64/me_cmp_init_aarch64.c +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c @@ -59,6 +59,12 @@ int pix_median_abs16_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t ptrdiff_t stride, int h); int pix_median_abs8_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2, ptrdiff_t stride, int h); +int ff_pix_abs8_x2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2, + ptrdiff_t stride, int h); +int ff_pix_abs8_y2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2, + ptrdiff_t stride, int h); +int ff_pix_abs8_xy2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2, + ptrdiff_t stride, int h); av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) { @@ -70,6 +76,9 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) c->pix_abs[0][2] = ff_pix_abs16_y2_neon; c->pix_abs[0][3] = ff_pix_abs16_xy2_neon; c->pix_abs[1][0] = ff_pix_abs8_neon; + c->pix_abs[1][1] = ff_pix_abs8_x2_neon; + c->pix_abs[1][2] = ff_pix_abs8_y2_neon; + c->pix_abs[1][3] = ff_pix_abs8_xy2_neon; c->sad[0] = ff_pix_abs16_neon; c->sad[1] = ff_pix_abs8_neon; diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S index 11af4849f9..e03c0c26cd 100644 --- a/libavcodec/aarch64/me_cmp_neon.S +++ b/libavcodec/aarch64/me_cmp_neon.S @@ -119,6 +119,199 @@ function ff_pix_abs8_neon, export=1 ret endfunc +function ff_pix_abs8_x2_neon, export=1 + // x0 unused + // x1 uint8_t *pix1 + // x2 uint8_t *pix2 + // x3 ptrdiff_t stride + // w4 int h + + cmp w4, #4 + movi v26.8h, #0 + add x5, x2, #1 // pix2 + 1 + b.lt 2f + +// make 4 iterations at once +1: + ld1 {v1.8b}, [x2], x3 + ld1 {v2.8b}, [x5], x3 + ld1 {v0.8b}, [x1], x3 + ld1 {v4.8b}, [x2], x3 + urhadd v30.8b, v1.8b, v2.8b + ld1 {v5.8b}, [x5], x3 + uabal v26.8h, v0.8b, v30.8b + ld1 {v6.8b}, [x1], x3 + urhadd v29.8b, v4.8b, v5.8b + ld1 {v7.8b}, [x2], x3 + ld1 {v20.8b}, [x5], x3 + uabal v26.8h, v6.8b, v29.8b + ld1 {v21.8b}, [x1], x3 + urhadd v28.8b, v7.8b, v20.8b + ld1 {v22.8b}, [x2], x3 + ld1 {v23.8b}, [x5], x3 + uabal v26.8h, v21.8b, v28.8b + sub w4, w4, #4 + ld1 {v24.8b}, [x1], x3 + urhadd v27.8b, v22.8b, v23.8b + cmp w4, #4 + uabal v26.8h, v24.8b, v27.8b + + b.ge 1b + cbz w4, 3f + +// iterate by one +2: + ld1 {v1.8b}, [x2], x3 + ld1 {v2.8b}, [x5], x3 + ld1 {v0.8b}, [x1], x3 + urhadd v30.8b, v1.8b, v2.8b + subs w4, w4, #1 + uabal v26.8h, v0.8b, v30.8b + + b.ne 2b +3: + uaddlv s20, v26.8h + fmov w0, s20 + + ret + +endfunc + +function ff_pix_abs8_y2_neon, export=1 + // x0 unused + // x1 uint8_t *pix1 + // x2 uint8_t *pix2 + // x3 ptrdiff_t stride + // w4 int h + + cmp w4, #4 + movi v26.8h, #0 + ld1 {v1.8b}, [x2], x3 + b.lt 2f + +// make 4 iterations at once +1: + ld1 {v2.8b}, [x2], x3 + ld1 {v0.8b}, [x1], x3 + ld1 {v6.8b}, [x1], x3 + urhadd v30.8b, v1.8b, v2.8b + ld1 {v5.8b}, [x2], x3 + ld1 {v21.8b}, [x1], x3 + uabal v26.8h, v0.8b, v30.8b + urhadd v29.8b, v2.8b, v5.8b + ld1 {v20.8b}, [x2], x3 + ld1 {v24.8b}, [x1], x3 + uabal v26.8h, v6.8b, v29.8b + urhadd v28.8b, v5.8b, v20.8b + uabal v26.8h, v21.8b, v28.8b + ld1 {v23.8b}, [x2], x3 + mov v1.8b, v23.8b + sub w4, w4, #4 + urhadd v27.8b, v20.8b, v23.8b + cmp w4, #4 + uabal v26.8h, v24.8b, v27.8b + + b.ge 1b + cbz w4, 3f + +// iterate by one +2: + ld1 {v0.8b}, [x1], x3 + ld1 {v2.8b}, [x2], x3 + urhadd v30.8b, v1.8b, v2.8b + subs w4, w4, #1 + uabal v26.8h, v0.8b, v30.8b + mov v1.8b, v2.8b + + b.ne 2b +3: + uaddlv s20, v26.8h + fmov w0, s20 + + ret + +endfunc + +function ff_pix_abs8_xy2_neon, export=1 + // x0 unused + // x1 uint8_t *pix1 + // x2 uint8_t *pix2 + // x3 ptrdiff_t stride + // w4 int h + + movi v31.8h, #0 + add x0, x2, 1 // pix2 + 1 + + add x5, x2, x3 // pix2 + stride = pix3 + cmp w4, #4 + add x6, x5, 1 // pix3 + stride + 1 + + b.lt 2f + + ld1 {v0.8b}, [x2], x3 + ld1 {v1.8b}, [x0], x3 + uaddl v2.8h, v0.8b, v1.8b + +// make 4 iterations at once +1: + ld1 {v4.8b}, [x5], x3 + ld1 {v5.8b}, [x6], x3 + ld1 {v7.8b}, [x5], x3 + uaddl v0.8h, v4.8b, v5.8b + ld1 {v16.8b}, [x6], x3 + add v4.8h, v0.8h, v2.8h + ld1 {v5.8b}, [x1], x3 + rshrn v4.8b, v4.8h, #2 + uaddl v7.8h, v7.8b, v16.8b + uabal v31.8h, v5.8b, v4.8b + add v2.8h, v0.8h, v7.8h + ld1 {v17.8b}, [x1], x3 + rshrn v2.8b, v2.8h, #2 + ld1 {v20.8b}, [x5], x3 + uabal v31.8h, v17.8b, v2.8b + ld1 {v21.8b}, [x6], x3 + ld1 {v25.8b}, [x5], x3 + uaddl v20.8h, v20.8b, v21.8b + ld1 {v26.8b}, [x6], x3 + add v7.8h, v7.8h, v20.8h + uaddl v25.8h, v25.8b, v26.8b + rshrn v7.8b, v7.8h, #2 + ld1 {v22.8b}, [x1], x3 + mov v2.16b, v25.16b + uabal v31.8h, v22.8b, v7.8b + add v20.8h, v20.8h, v25.8h + ld1 {v27.8b}, [x1], x3 + sub w4, w4, #4 + rshrn v20.8b, v20.8h, #2 + cmp w4, #4 + uabal v31.8h, v27.8b, v20.8b + + b.ge 1b + + cbz w4, 3f + +// iterate by one +2: + ld1 {v0.8b}, [x5], x3 + ld1 {v1.8b}, [x6], x3 + ld1 {v4.8b}, [x1], x3 + uaddl v21.8h, v0.8b, v1.8b + subs w4, w4, #1 + add v3.8h, v2.8h, v21.8h + mov v2.16b, v21.16b + rshrn v3.8b, v3.8h, #2 + uabal v31.8h, v4.8b, v3.8b + b.ne 2b + +3: + uaddlv s18, v31.8h + fmov w0, s18 + + ret + +endfunc + + function ff_pix_abs16_xy2_neon, export=1 // x0 unused // x1 uint8_t *pix1