From patchwork Tue Sep 6 10:27:18 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Hubert Mazur X-Patchwork-Id: 37696 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:139a:b0:8f:1db5:eae2 with SMTP id w26csp3164552pzh; Tue, 6 Sep 2022 03:28:18 -0700 (PDT) X-Google-Smtp-Source: AA6agR4pn+dCYDEdxs9rOJ0ICOa0KgrG81d75EZPavHC9xEzdu6z67NmMc44C7vLBddFaf7yfAzC X-Received: by 2002:a17:906:db05:b0:741:5730:270e with SMTP id xj5-20020a170906db0500b007415730270emr30654024ejb.609.1662460098583; Tue, 06 Sep 2022 03:28:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1662460098; cv=none; d=google.com; s=arc-20160816; b=m+XLbKIzY05+FUzTrWEWraph6C9uOIHvG9X2Lt0QtUamu+/oesaWnlYnva9jEA4IS2 10Mm3ZJxWsdWdx1Bstj/rdXdSkRoRVQkC4nxa461NdscwXHhHpZBpH08OAnmL+7pz0FL hkPSvujObJURa2kvyYbY/Zo1zRQYN1IR0841cglxU4fQwpUE9jT5mZwLvytrkvpL7QYg ZdvDv0DP1X6fNHW+Uh68T4LNvWlyXYl8mJq1dnZmNyDfMki9C2XB9jRynpRq3aVKtLJq glHjMZ943uob9YQTeKZGjP6cQ+19syVS+9CtzCwGVsZhXoAqbmrTkABfTkGGJmu0qS7i HZTA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:dkim-signature:delivered-to; bh=SirTVLf61F2JsvVOE7MQCmzn7rjunENg2IHfN1rEK+s=; b=H8Yp3a/3yk/cfWiR0QzzJNVnHi5GxFqEVYg2+VwwwVIlQobXdmC3zZONxW5iWAybad sqo24TPNSDRIieZIctPOpwJSlY2wfHjGRl9oyT1v9WfIRf2/799QxV5MN+hCO1alV7FQ W0WiW7Dt1IgZul5/rWV5ld+itUO5ZfS08/KUeP/gpb7clhN6fbXHoqldSIqcdqyJRFuV 9G2Q5BitKKJMAJK06Hk5hWhEbhlQLZ6NkacDKTHF7lJMQs75tOAR4aM3riAC3KJq4u5H Nr8+LHXozIK9OfujswSoa7GVnjcdhPW7cPgtF167yW9g6PWwsHoAjj8MsC5FnXxnE5UG 7XKg== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=Xhaw1nXN; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id j9-20020a05640211c900b00447c699b7f2si9660708edw.409.2022.09.06.03.28.18; Tue, 06 Sep 2022 03:28:18 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=Xhaw1nXN; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 8427E68BAA3; Tue, 6 Sep 2022 13:28:06 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lj1-f175.google.com (mail-lj1-f175.google.com [209.85.208.175]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 73A4A68BAA3 for ; Tue, 6 Sep 2022 13:27:59 +0300 (EEST) Received: by mail-lj1-f175.google.com with SMTP id r27so11340861ljn.0 for ; Tue, 06 Sep 2022 03:27:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=semihalf.com; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=9LYjDPHNf8ap8sn7DTJxHXEUAe7YLQg2BkWJFlrSlUQ=; b=Xhaw1nXN4a/smN+QSpMJ5ufQ9zjQNaNLnPY+kgA7CHNGxninfDqy489l4OfyvhwFdl cSAzWyA1fi4kgsPY3cgP/e5CXdmjTr8UnpCT1eNqgsD7pfJpNW4AxKcnXVsnTYP/68eT uCLPLJVkIf5Hf4OhNGFCpC6nZomE2u3RqPAeSbNYs0UM2M4RtRg1XeZI3WVzVjzeAA4M ZgmjiGFQZnuCFYCK2ozd2++U8MtS+0yJGFHXO3BXQrytfF2lAvTy0uEm3cTtjQoZ61dj pHVzN2gSivF4JDEmwZPgLHwR4aDWqduL+9qygdhz5RUbO09wbycAUm11wHbrgFDiox2b q3/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=9LYjDPHNf8ap8sn7DTJxHXEUAe7YLQg2BkWJFlrSlUQ=; b=UI0Fi3G14oEfvArWTHRuuD7ruet1ByG3kZuw8pHPCQnArwjfSM0/2onqog2oFqiCFB MhhUvF3JN0A0yZqrBJzMgHSoSunkGCqd7UeAOMmReMslITb2ENnlGLA1lWMTPlv8Bxg2 63KtjgKbPHL8tIxNrQ4YISYL8KD9uVgWzXTtPRxoA6Ht/851E3Gi1PG2TNY6hDmL6s4f woK8TfXxHrbTtGXtnWVa+8tNY/RSQUsrQbnsaqve6KT8xylbFPGvP3pT6oRvs3X4QRje njEelhf+eqQz79TUncXJTNv+0mfOmIh3kqpzaeUeVI0dSdhgVtfM4zQ/cY6EDYMeRdKm c0Ug== X-Gm-Message-State: ACgBeo32DtIIXih0obIMmwccFeHPI+RvD/f2coLxg1T0/o1JDf5vGk/v e+zqmry1/aQvIcZi5G4QGi2daXu4aS0oCg== X-Received: by 2002:a05:651c:160a:b0:25a:62a4:9085 with SMTP id f10-20020a05651c160a00b0025a62a49085mr16662310ljq.214.1662460078420; Tue, 06 Sep 2022 03:27:58 -0700 (PDT) Received: from hum-HP-ProBook-440-G7.wifi.semihalf.net ([83.142.187.84]) by smtp.gmail.com with ESMTPSA id q5-20020a2eb4a5000000b0025e59f125fbsm1822751ljm.53.2022.09.06.03.27.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 06 Sep 2022 03:27:58 -0700 (PDT) From: Hubert Mazur To: ffmpeg-devel@ffmpeg.org Date: Tue, 6 Sep 2022 12:27:18 +0200 Message-Id: <20220906102722.53266-2-hum@semihalf.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20220906102722.53266-1-hum@semihalf.com> References: <20220906102722.53266-1-hum@semihalf.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for vsad16 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com, Hubert Mazur , martin@martin.st, mw@semihalf.com, spop@amazon.com Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: pRksy+B9tSpW Provide optimized implementation of vsad16 function for arm64. Performance comparison tests are shown below. - vsad_0_c: 285.0 - vsad_0_neon: 42.5 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Co-authored-by: Martin Storsjö Signed-off-by: Hubert Mazur --- libavcodec/aarch64/me_cmp_init_aarch64.c | 5 ++ libavcodec/aarch64/me_cmp_neon.S | 65 ++++++++++++++++++++++++ 2 files changed, 70 insertions(+) diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c index fb7c3f5059..ddc5d05611 100644 --- a/libavcodec/aarch64/me_cmp_init_aarch64.c +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c @@ -41,6 +41,9 @@ int sse8_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2, int sse4_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2, ptrdiff_t stride, int h); +int vsad16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2, + ptrdiff_t stride, int h); + av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) { int cpu_flags = av_get_cpu_flags(); @@ -57,5 +60,7 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) c->sse[0] = sse16_neon; c->sse[1] = sse8_neon; c->sse[2] = sse4_neon; + + c->vsad[0] = vsad16_neon; } } diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S index 4198985c6c..1d0b166d69 100644 --- a/libavcodec/aarch64/me_cmp_neon.S +++ b/libavcodec/aarch64/me_cmp_neon.S @@ -584,3 +584,68 @@ function sse4_neon, export=1 ret endfunc + +function vsad16_neon, export=1 + // x0 unused + // x1 uint8_t *pix1 + // x2 uint8_t *pix2 + // x3 ptrdiff_t stride + // w4 int h + + ld1 {v0.16b}, [x1], x3 // Load pix1[0], first iteration + ld1 {v1.16b}, [x2], x3 // Load pix2[0], first iteration + + sub w4, w4, #1 // we need to make h-1 iterations + movi v16.8h, #0 + + cmp w4, #3 // check if we can make 3 iterations at once + usubl v31.8h, v0.8b, v1.8b // Signed difference pix1[0] - pix2[0], first iteration + usubl2 v30.8h, v0.16b, v1.16b // Signed difference pix1[0] - pix2[0], first iteration + + b.lt 2f + +1: + // abs(pix1[0] - pix2[0] - pix1[0 + stride] + pix2[0 + stride]) + ld1 {v0.16b}, [x1], x3 // Load pix1[0 + stride], first iteration + ld1 {v1.16b}, [x2], x3 // Load pix2[0 + stride], first iteration + ld1 {v2.16b}, [x1], x3 // Load pix1[0 + stride], second iteration + ld1 {v3.16b}, [x2], x3 // Load pix2[0 + stride], second iteration + usubl v29.8h, v0.8b, v1.8b // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration + usubl2 v28.8h, v0.16b, v1.16b // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration + ld1 {v4.16b}, [x1], x3 // Load pix1[0 + stride], third iteration + ld1 {v5.16b}, [x2], x3 // Load pix2[0 + stride], third iteration + usubl v27.8h, v2.8b, v3.8b // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration + saba v16.8h, v31.8h, v29.8h // Signed absolute difference and accumulate the result. first iteration + usubl2 v26.8h, v2.16b, v3.16b // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration + saba v16.8h, v30.8h, v28.8h // Signed absolute difference and accumulate the result. first iteration + usubl v25.8h, v4.8b, v5.8b // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration + usubl2 v24.8h, v4.16b, v5.16b // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration + saba v16.8h, v29.8h, v27.8h // Signed absolute difference and accumulate the result. second iteration + mov v31.16b, v25.16b + saba v16.8h, v28.8h, v26.8h // Signed absolute difference and accumulate the result. second iteration + sub w4, w4, #3 // h -= 3 + mov v30.16b, v24.16b + saba v16.8h, v27.8h, v25.8h // Signed absolute difference and accumulate the result. third iteration + cmp w4, #3 + saba v16.8h, v26.8h, v24.8h // Signed absolute difference and accumulate the result. third iteration + + b.ge 1b + cbz w4, 3f +2: + ld1 {v0.16b}, [x1], x3 + ld1 {v1.16b}, [x2], x3 + subs w4, w4, #1 + usubl v29.8h, v0.8b, v1.8b + usubl2 v28.8h, v0.16b, v1.16b + saba v16.8h, v31.8h, v29.8h + mov v31.16b, v29.16b + saba v16.8h, v30.8h, v28.8h + mov v30.16b, v28.16b + + b.ne 2b +3: + uaddlv s17, v16.8h + fmov w0, s17 + + ret +endfunc From patchwork Tue Sep 6 10:27:19 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hubert Mazur X-Patchwork-Id: 37697 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:139a:b0:8f:1db5:eae2 with SMTP id w26csp3164609pzh; Tue, 6 Sep 2022 03:28:27 -0700 (PDT) X-Google-Smtp-Source: AA6agR6fBjBdZiAq47MrIKULWgTeo/H3H7/fWzjprcyjFXFA3tKPDR1kzIUotv47QTUB7qfbWPYu X-Received: by 2002:a17:907:7f21:b0:73d:6b7b:3e0 with SMTP id qf33-20020a1709077f2100b0073d6b7b03e0mr37632466ejc.680.1662460107209; Tue, 06 Sep 2022 03:28:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1662460107; cv=none; d=google.com; s=arc-20160816; b=fiLrqtwKjCRdWFcgtXqj23hr6hXuyRubS/7G7yihmRBtnHMjGnexX6CkaS/zDTqr6m 6oHNa5VZ+8zUP3zgYHUlri7GkgvHB+X/4sBq7F33D1Nlcvd3IYTE87wpYhS+1GxsrRx6 +6h98ZpJepF2lHH5iHNuvVeZqO5sDy+QWMOIjAsU6bGtPzulDHShRTLongcCbSt2Y6o/ yA7xC+PHjpAOA12bTncFHIOuvxB7aW4OTO5IhIA6mY3m1soar/np3TrqgtVKntlCnvlt 7LVGV5Y85U5DSyHgF3ONgA18ArHiVY+dSJL72n6T2ZNtT91jKbOTnTrTzH4deMEBh/iQ HNCA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:dkim-signature:delivered-to; bh=PB6QRnmLuS2I2+Xx1m9GYgIVbXmClb8X2J+j/Sj+ldM=; b=R4CNER+PZyTSCy5B5+1rOEcBLwXDsAaKXtdUda4j192uLNvnG0m1zM9KzS1F21m9ls n9X1hWMMBxBCtITJDaFqUM09BSmx0oNfmdWMPI1G2b6xGIkBOexyN6bJBwfUEBStYZtR GzCPVIRvV+ZNA3E9SxSbr+r0R//3wWMq+LaoQWNFFos7irWPpVHggfFmhEDFjJoPjTXH w6YNNaLfyQ5wLZkVSPQingSMT/HfPdTNvPzqHvsvc19nykCy29PMxiN8Es9nAOtGBiRF +0vQNitB5wy0TID+ENufjYjJ5F7+7FLWdN9C3HC6vhRlzi1qBpis2O6rk/t2fnAuaIRC WgDA== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=l7y2HbRS; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id sg19-20020a170907a41300b0073d693a64c3si7711224ejc.370.2022.09.06.03.28.26; Tue, 06 Sep 2022 03:28:27 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=l7y2HbRS; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 7E6B468BAFD; Tue, 6 Sep 2022 13:28:07 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lf1-f45.google.com (mail-lf1-f45.google.com [209.85.167.45]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 258E868BAC0 for ; Tue, 6 Sep 2022 13:28:01 +0300 (EEST) Received: by mail-lf1-f45.google.com with SMTP id u18so4346312lfo.8 for ; Tue, 06 Sep 2022 03:28:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=semihalf.com; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=zb6DFBASB0B/mvxrERJNyWxSs/4QHIcvLrpHolC81hw=; b=l7y2HbRSnk9f1xU1IlTK14NJqfs3rdcsF2a+Zl9mYpAA/7r5IUhn3AXppstu8ncKKt 6bg52IjS9EPKo9L29JoluFDa3FfzlBXipt3A/sVzM7sbPxa70haLAMBxeq7RWmQxkKlQ o0AntpX8EnkUXIkMW26xc35hvWeQ+80iPKrAU30MOMufRNtWHMfYbKFGRawoV7Ld3PrE NVqp8yy43cX3r26H/h73ls4NM5oiH3iFcgYKSyaVMBJwge6i2YetEYmXKAqLCMX2HXPF z3gnG13GsnETfIMhUqs/teLvUW6nnRoIKTEUFV7oh3m+2WM5XGWCnYsaDyppeCh/bV/H t6SQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=zb6DFBASB0B/mvxrERJNyWxSs/4QHIcvLrpHolC81hw=; b=qf0KvV5ozn4kzNAfn04DN3tK6Z2pH6SOQ8f+0b0d1RDEGD+UZWBTGWj9c0u6whhVkB xk3AdvSUgqvckqI6YHw6+ZQu6R3jiIkVFKBVmZKTGvnAoYjOwO1TJ9p31wi2RdBIXecV jUlWquGtvB21O05dsFS7M77pnDe1L2OU2WvxTo4nfkaCmbJotI4EZZE6h/Wl+5UxZCTi Xir1hmfg3GckV1cNLAbNeVyV4nkpCgOAgl81/xXczWsbwy+HC8RgKg7fsYTpohEPvIiX zLxRbgTOCa6/F1Rc5+wCwRKfxgGll30za3f/tDDfBIOCHxj7YbPWpZD1aeUbZq1dLB7i NuKw== X-Gm-Message-State: ACgBeo2Hw0vGOWOo2TOSZdqhQ+8e3ueOMDKfeWk61JQepX+uwp/VO3aM 7KL4lkxnOS+sIFNTV94f6DRKVDl9o066LQ== X-Received: by 2002:a05:6512:c11:b0:494:7108:3936 with SMTP id z17-20020a0565120c1100b0049471083936mr12010209lfu.150.1662460080203; Tue, 06 Sep 2022 03:28:00 -0700 (PDT) Received: from hum-HP-ProBook-440-G7.wifi.semihalf.net ([83.142.187.84]) by smtp.gmail.com with ESMTPSA id q5-20020a2eb4a5000000b0025e59f125fbsm1822751ljm.53.2022.09.06.03.27.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 06 Sep 2022 03:27:59 -0700 (PDT) From: Hubert Mazur To: ffmpeg-devel@ffmpeg.org Date: Tue, 6 Sep 2022 12:27:19 +0200 Message-Id: <20220906102722.53266-3-hum@semihalf.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20220906102722.53266-1-hum@semihalf.com> References: <20220906102722.53266-1-hum@semihalf.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation of vsse16 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com, Hubert Mazur , martin@martin.st, mw@semihalf.com, spop@amazon.com Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: xbp5b4+XqSTM Provide optimized implementation of vsse16 for arm64. Performance comparison tests are shown below. - vsse_0_c: 254.4 - vsse_0_neon: 64.7 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur --- libavcodec/aarch64/me_cmp_init_aarch64.c | 4 ++ libavcodec/aarch64/me_cmp_neon.S | 87 ++++++++++++++++++++++++ 2 files changed, 91 insertions(+) diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c index ddc5d05611..7b81e48d16 100644 --- a/libavcodec/aarch64/me_cmp_init_aarch64.c +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c @@ -43,6 +43,8 @@ int sse4_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2, int vsad16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2, ptrdiff_t stride, int h); +int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2, + ptrdiff_t stride, int h); av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) { @@ -62,5 +64,7 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) c->sse[2] = sse4_neon; c->vsad[0] = vsad16_neon; + + c->vsse[0] = vsse16_neon; } } diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S index 1d0b166d69..b3f376aa60 100644 --- a/libavcodec/aarch64/me_cmp_neon.S +++ b/libavcodec/aarch64/me_cmp_neon.S @@ -649,3 +649,90 @@ function vsad16_neon, export=1 ret endfunc + +function vsse16_neon, export=1 + // x0 unused + // x1 uint8_t *pix1 + // x2 uint8_t *pix2 + // x3 ptrdiff_t stride + // w4 int h + + ld1 {v0.16b}, [x1], x3 // Load pix1[0], first iteration + ld1 {v1.16b}, [x2], x3 // Load pix2[0], first iteration + + sub w4, w4, #1 // we need to make h-1 iterations + movi v16.4s, #0 + movi v17.4s, #0 + + cmp w4, #3 // check if we can make 3 iterations at once + usubl v31.8h, v0.8b, v1.8b // Signed difference of pix1[0] - pix2[0], first iteration + usubl2 v30.8h, v0.16b, v1.16b // Signed difference of pix1[0] - pix2[0], first iteration + b.le 2f + + +1: + // x = abs(pix1[0] - pix2[0] - pix1[0 + stride] + pix2[0 + stride]) + // res = (x) * (x) + ld1 {v0.16b}, [x1], x3 // Load pix1[0 + stride], first iteration + ld1 {v1.16b}, [x2], x3 // Load pix2[0 + stride], first iteration + ld1 {v2.16b}, [x1], x3 // Load pix1[0 + stride], second iteration + ld1 {v3.16b}, [x2], x3 // Load pix2[0 + stride], second iteration + usubl v29.8h, v0.8b, v1.8b + usubl2 v28.8h, v0.16b, v1.16b + ld1 {v4.16b}, [x1], x3 // Load pix1[0 + stride], third iteration + ld1 {v5.16b}, [x2], x3 // Load pix1[0 + stride], third iteration + sabd v31.8h, v31.8h, v29.8h + sabd v30.8h, v30.8h, v28.8h + usubl v27.8h, v2.8b, v3.8b + usubl2 v26.8h, v2.16b, v3.16b + usubl v25.8h, v4.8b, v5.8b + usubl2 v24.8h, v4.16b, v5.16b + sabd v29.8h, v29.8h, v27.8h + sabd v27.8h, v27.8h, v25.8h + umlal v16.4s, v31.4h, v31.4h + umlal2 v17.4s, v31.8h, v31.8h + sabd v28.8h, v28.8h, v26.8h + sabd v26.8h, v26.8h, v24.8h + umlal v16.4s, v30.4h, v30.4h + umlal2 v17.4s, v30.8h, v30.8h + mov v31.16b, v25.16b + umlal v16.4s, v29.4h, v29.4h + umlal2 v17.4s, v29.8h, v29.8h + mov v30.16b, v24.16b + umlal v16.4s, v28.4h, v28.4h + umlal2 v17.4s, v28.8h, v28.8h + sub w4, w4, #3 + umlal v16.4s, v27.4h, v27.4h + umlal2 v17.4s, v27.8h, v27.8h + cmp w4, #3 + umlal v16.4s, v26.4h, v26.4h + umlal2 v17.4s, v26.8h, v26.8h + + b.ge 1b + + cbz w4, 3f + +// iterate by once +2: + ld1 {v0.16b}, [x1], x3 + ld1 {v1.16b}, [x2], x3 + subs w4, w4, #1 + usubl v29.8h, v0.8b, v1.8b + usubl2 v28.8h, v0.16b, v1.16b + sabd v31.8h, v31.8h, v29.8h + sabd v30.8h, v30.8h, v28.8h + umlal v16.4s, v31.4h, v31.4h + umlal2 v17.4s, v31.8h, v31.8h + mov v31.16b, v29.16b + umlal v16.4s, v30.4h, v30.4h + umlal2 v17.4s, v30.8h, v30.8h + mov v30.16b, v28.16b + b.ne 2b + +3: + add v16.4s, v16.4s, v17.4s + uaddlv d17, v16.4s + fmov w0, s17 + + ret +endfunc From patchwork Tue Sep 6 10:27:20 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hubert Mazur X-Patchwork-Id: 37698 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:139a:b0:8f:1db5:eae2 with SMTP id w26csp3164680pzh; Tue, 6 Sep 2022 03:28:37 -0700 (PDT) X-Google-Smtp-Source: AA6agR5XJZ1xnNWZ16FyoXoOpU/XWJue5mE1ZEt8GpkMi5O96c9KXONptC6CPyZ4efcF9swsdkY+ X-Received: by 2002:aa7:dc10:0:b0:440:b446:c0cc with SMTP id b16-20020aa7dc10000000b00440b446c0ccmr47444086edu.34.1662460117043; Tue, 06 Sep 2022 03:28:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1662460117; cv=none; d=google.com; s=arc-20160816; b=G6l402UhCmcSmf+Ep6B/ey73i8xhaTPwiOeGolXJ0jdQZcWevxsN3/x3R+jChuMgWo +MxqBXRXcHRGR11EXf35+hQW0MA/UCiTIlT7hOlZPwAFK2XjaLpb3QNv/2vBuTsmPwCY w7dVyfrump9j08OQ3xlRU4YSwIrN/3jX8oyPyAa2NjGxUDp7BvJ0oPALiST29hNgSTuW KsJXPvrKxVOEJxtfnGiJqBQngJp76zDZHFhI4RErsqPPzWQiKRQ7q+difhWNK0WV85N2 EbdsT9+T7+Zh+SieeIKbwrZMaedTiOGaEqwKYzVkEQgEL/J4Y0l173bgaHMHrVXrPFWC KpLQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:dkim-signature:delivered-to; bh=Ai3l7Sx+wikaaMO21luaNb0s2UDvb0ztSew/eCqKvSs=; b=InMKUoqDyWPzU8zS2y6a3L6NRe1Cf1rvNh/h4B/7LyBu+/wxrCmqby5Au171983LOv DDzraFfbswY97JKbo4C/QnP7Pe87ksuB4Zz/34NAdgCVZ6dBdSSOv1NnZc1ZWwgFNWfj UCASgubCfecCmfRYcPtNP33x4IgwJaf2YAS29yAOwNn1bmHLe+zGnBc8bQv6rL/sCP5H GQC38RvgJgj6R99nIIJJEScIsq+PBsHE11NjtClptO9Y4cwZkiF1TOGbBlpJ9eDAqxgv TLio4WWg28E8fN62w5T6BJLfLwr8TjxTTDJXHrTVYZecN7vS0hGyHycIX7kb6yrjV6/X LZoQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=bpq1mT8I; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id z19-20020a05640235d300b004479f85b42esi10696060edc.509.2022.09.06.03.28.36; Tue, 06 Sep 2022 03:28:37 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=bpq1mT8I; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 6CD9B68BB04; Tue, 6 Sep 2022 13:28:09 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lj1-f180.google.com (mail-lj1-f180.google.com [209.85.208.180]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 8065668BAF6 for ; Tue, 6 Sep 2022 13:28:03 +0300 (EEST) Received: by mail-lj1-f180.google.com with SMTP id b19so11756456ljf.8 for ; Tue, 06 Sep 2022 03:28:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=semihalf.com; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=HP/PYdL5UfyvnhOPsjWZ29R9/VdYT2rC/LCTuCGZG8w=; b=bpq1mT8IPj8PToOr3ZfPf1rfEVH5UQgNmxSsKxDNHJm2exqxf6Eul6M7qF12LDpeRj DuzJG1WBzzeZrIDggrWFw0caL6XU9fGIX5b26vE9LWVuAE/XWkXuryBsPLGOM133Pzmz oYoAFvO0rjiHyAKyFWkmUr+b67l2TFrQZsRULNSC4PNqUIi0R2751sylzayt3sP/7vtS olPowts7/VX7BQhiu8o4p4H9wqSuWZm8H8TGPa9rVDFEMKHcT8/ENgXZF7WI+mdpTjEt B8qApW6K6HY2UB9zPHTuIYj/H80mwwYIMND7y6wVs1ErXcdZVbA8EVYTw8hQONm2a0zo CQaQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=HP/PYdL5UfyvnhOPsjWZ29R9/VdYT2rC/LCTuCGZG8w=; b=dd+75X6qCf6+nxN2YuJcA/CyZoSVohc8sTaQl2HsuHbbjARchzP9aWsIeVzxTSJfH1 8By6Tx16WRoy9QyjNbHT2ynOqUo2amAJmhfj/RdfUqLQLh1WALgNccnDp7BTrwMJo6Ta NefLz0akvt1YSaVQRdk/sl/Z1XQHDZ2d8m6ZDbGJh/Y9RQ5ojWveVaz9IOWt3fshCbR6 9C5rn4D87Jr3+/UTMqmpWL/ye1wpCf9forRrukGS5W0eLmJ/7pdN1A8UxfU7GWC1jcMC b3JXq2Emek6pN1ZNntBrYXHImXm4aIinWVMsjD3314dvjwJQJuTYMPuSkYMo4DRediEM 8Ufg== X-Gm-Message-State: ACgBeo2BJxGV3Wx7y/thGNNm03h0xA2Q8MqP00hQj1pxTzwJ15EtUfqQ AuBldw+6Djn5Bj6EO7s9ni+2i/b2IdKrSw== X-Received: by 2002:a2e:bc10:0:b0:267:b34a:52e6 with SMTP id b16-20020a2ebc10000000b00267b34a52e6mr10105440ljf.292.1662460082586; Tue, 06 Sep 2022 03:28:02 -0700 (PDT) Received: from hum-HP-ProBook-440-G7.wifi.semihalf.net ([83.142.187.84]) by smtp.gmail.com with ESMTPSA id q5-20020a2eb4a5000000b0025e59f125fbsm1822751ljm.53.2022.09.06.03.28.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 06 Sep 2022 03:28:02 -0700 (PDT) From: Hubert Mazur To: ffmpeg-devel@ffmpeg.org Date: Tue, 6 Sep 2022 12:27:20 +0200 Message-Id: <20220906102722.53266-4-hum@semihalf.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20220906102722.53266-1-hum@semihalf.com> References: <20220906102722.53266-1-hum@semihalf.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for vsad_intra16 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com, Hubert Mazur , martin@martin.st, mw@semihalf.com, spop@amazon.com Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: T3llUzOc8koA Provide optimized implementation for vsad_intra16 function for arm64. Performance comparison tests are shown below. - vsad_4_c: 177.2 - vsad_4_neon: 24.5 Benchmarks and tests are run with checkasm tool on AWS Gravtion 3. Signed-off-by: Hubert Mazur --- libavcodec/aarch64/me_cmp_init_aarch64.c | 3 ++ libavcodec/aarch64/me_cmp_neon.S | 48 ++++++++++++++++++++++++ 2 files changed, 51 insertions(+) diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c index 7b81e48d16..af83f7ed1e 100644 --- a/libavcodec/aarch64/me_cmp_init_aarch64.c +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c @@ -43,6 +43,8 @@ int sse4_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2, int vsad16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2, ptrdiff_t stride, int h); +int vsad_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy, + ptrdiff_t stride, int h) ; int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2, ptrdiff_t stride, int h); @@ -64,6 +66,7 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) c->sse[2] = sse4_neon; c->vsad[0] = vsad16_neon; + c->vsad[4] = vsad_intra16_neon; c->vsse[0] = vsse16_neon; } diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S index b3f376aa60..ce198ea227 100644 --- a/libavcodec/aarch64/me_cmp_neon.S +++ b/libavcodec/aarch64/me_cmp_neon.S @@ -736,3 +736,51 @@ function vsse16_neon, export=1 ret endfunc + +function vsad_intra16_neon, export=1 + // x0 unused + // x1 uint8_t *pix1 + // x2 uint8_t *dummy + // x3 ptrdiff_t stride + // w4 int h + + ld1 {v0.16b}, [x1], x3 + sub w4, w4, #1 // we need to make h-1 iterations + cmp w4, #3 + movi v16.8h, #0 + b.lt 2f + +// make 4 iterations at once +1: + // v = abs( pix1[0] - pix1[0 + stride] ) + // score = sum(v) + ld1 {v1.16b}, [x1], x3 + ld1 {v2.16b}, [x1], x3 + uabal v16.8h, v0.8b, v1.8b + ld1 {v3.16b}, [x1], x3 + uabal2 v16.8h, v0.16b, v1.16b + sub w4, w4, #3 + uabal v16.8h, v1.8b, v2.8b + cmp w4, #3 + uabal2 v16.8h, v1.16b, v2.16b + mov v0.16b, v3.16b + uabal v16.8h, v2.8b, v3.8b + uabal2 v16.8h, v2.16b, v3.16b + b.ge 1b + cbz w4, 3f + +// iterate by one +2: + ld1 {v1.16b}, [x1], x3 + subs w4, w4, #1 + uabal v16.8h, v0.8b, v1.8b + uabal2 v16.8h, v0.16b, v1.16b + mov v0.16b, v1.16b + cbnz w4, 2b + +3: + uaddlv s17, v16.8h + fmov w0, s17 + + ret +endfunc From patchwork Tue Sep 6 10:27:21 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hubert Mazur X-Patchwork-Id: 37699 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:139a:b0:8f:1db5:eae2 with SMTP id w26csp3164749pzh; Tue, 6 Sep 2022 03:28:46 -0700 (PDT) X-Google-Smtp-Source: AA6agR5lWNbpGIpNAgap3kWAu5J+1oeBg+9Y3N1N5VVDKYgi+S47rtiixijH+HRrc2ev2d4wexOy X-Received: by 2002:a05:6402:27cf:b0:448:706c:184b with SMTP id c15-20020a05640227cf00b00448706c184bmr33775655ede.1.1662460126552; Tue, 06 Sep 2022 03:28:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1662460126; cv=none; d=google.com; s=arc-20160816; b=VvbGN9mMfrPp51gKB426k4sjoYPWOpIQ5jJifu+kcIRw+XR6COdbyalzjTlI40DsU/ 3SHuItT9EChyKYE5vHg4DoKWP++cSU470uD19udkRE3J9pkhXHHX+hlPvo0MWXo/9rVp wZlcnbYhqziGCdiQWOWiSFyI6Ep/flc5vNne3SmWXmk6POtrRlsii4L1OZYiTgpKC4XC Kcg87S6nYuZahfe/h93EhhPYwayk3CvBuQoXCI76MrRSz/dXYv8d5EcFMw/87g5pED3d d45L5WeuVxHovvCBhmZUqNiXzS22ibPVNudCNBFJ7h9my/YLgsqwmUnWGgUOSYkrWJuC 0JEw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:dkim-signature:delivered-to; bh=G/FVdXQjvmwriM76oDTnlDw38u6nbJDLBrS9MhSaNCk=; b=HgpG0WuIBq47IkFsMTj97iekxzPoUNrikZxaz1WKm6QRisYbPGI+k/YU5WhQcAqS2F Pl5ivU9Xl1bIv6I+R1AJAljGDonptbALsguNdTxMXhaQKhBxa5jU0BzNOocQ1gKjYmn/ wlRozlJVW9nzRhcGhDuI9Sdwuzv95F2JCkizhSroIlb6i1sTqmEdG5VMUfiuLMDOdoac coxuFmpNk4K6TpDAX03HWhgTRMmui1GgD0HYxx1pVYihjT6jldqSSYO2iYBWtH2WVpU2 tm9geB6p356WbA9E1PErK80eeyiCjsw/tEayB0Ds6pBY3QueQjBJUuzNz0DUxpHOIRw9 4kKA== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=uCIvbY1b; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id nd9-20020a170907628900b0073dd6d901dbsi9636571ejc.72.2022.09.06.03.28.46; Tue, 06 Sep 2022 03:28:46 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=uCIvbY1b; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 7E6A468BB0A; Tue, 6 Sep 2022 13:28:11 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lf1-f48.google.com (mail-lf1-f48.google.com [209.85.167.48]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 2257968BAFE for ; Tue, 6 Sep 2022 13:28:05 +0300 (EEST) Received: by mail-lf1-f48.google.com with SMTP id bq23so16647567lfb.7 for ; Tue, 06 Sep 2022 03:28:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=semihalf.com; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=wVp27DJYF0ZbjgVtZGBCvCRFM5WDxIk7hzRKdWbyVzo=; b=uCIvbY1bBc/TgerxeY508pPrp1NE+1M4ciaKLbfwcoLCWmBZlNccNYJUSTUGIldXpk H2V9bOvPiKu+0ssSGxf6GEWUOEXm45hWIRXp7s/mUojGoTg5OBPPhQK4uzuWlgdnqrjW j1mqVQFkI4Yuwd6D1PiTCBDZcB/Z2ThNi2ByVOlCn6Kuo0zYEpBjCrmbDfl2zmuK3qRE /151wuSkoISP1iCuHC0xgK1fUXfGwGA5vsBb60whlO2vWJ2pfAakPr1yAP6KPmgRj7RC 1SoATsjtcFFPaDOD0soy9bPA3P0higHsUTrIAApBp4ZqMvx4INfKFtfA5bQp8+2QFA6T DQlQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=wVp27DJYF0ZbjgVtZGBCvCRFM5WDxIk7hzRKdWbyVzo=; b=4uz7kJPueNNpl8y3NjMDXbpagsFnh2Yd2a/t5csC6GqPx8QBqFH2qF4Ti2JF6AnDW0 mmV9/MAbghRr7RLxufShOogug01vEJVS19YfnT5yUuvco6RV+9S9UHWI8saizc5f8cNm hM3tB1/Yu3id9ehSZ2nvlJT9aV+KsCGTPC4T3Hcvjx/cIFqkygg/J/RM/Dt08+k/lovM svoTaqr5XjpaJJ03Wdv9DYwVL2csc+8ct+His3wAnioPRqlc6hY77i6k1AFK8AObWB2a kYJiSmA1Zqwmyp8+ojKgIeIHVTdL2oKP+4DM68XROsmrynQfd6Zw6sepGMnQonucwLLA 5jYQ== X-Gm-Message-State: ACgBeo3afaFWmcNCrPexSvWTzmlXc1QQK6YYWl9zx88gJWfIsBIMOyU5 pND3KpmZkcOGYCSQ3SkCEhNMfaRWV3FAGw== X-Received: by 2002:a05:6512:3e27:b0:491:62df:b343 with SMTP id i39-20020a0565123e2700b0049162dfb343mr17006446lfv.336.1662460084163; Tue, 06 Sep 2022 03:28:04 -0700 (PDT) Received: from hum-HP-ProBook-440-G7.wifi.semihalf.net ([83.142.187.84]) by smtp.gmail.com with ESMTPSA id q5-20020a2eb4a5000000b0025e59f125fbsm1822751ljm.53.2022.09.06.03.28.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 06 Sep 2022 03:28:03 -0700 (PDT) From: Hubert Mazur To: ffmpeg-devel@ffmpeg.org Date: Tue, 6 Sep 2022 12:27:21 +0200 Message-Id: <20220906102722.53266-5-hum@semihalf.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20220906102722.53266-1-hum@semihalf.com> References: <20220906102722.53266-1-hum@semihalf.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com, Hubert Mazur , martin@martin.st, mw@semihalf.com, spop@amazon.com Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: nlq4ka4IMxOm Provide optimized implementation for vsse_intra16 for arm64. Performance tests are shown below. - vsse_4_c: 153.7 - vsse_4_neon: 34.2 Benchmarks and tests are run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur --- libavcodec/aarch64/me_cmp_init_aarch64.c | 3 ++ libavcodec/aarch64/me_cmp_neon.S | 63 ++++++++++++++++++++++++ 2 files changed, 66 insertions(+) diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c index af83f7ed1e..8c295d5457 100644 --- a/libavcodec/aarch64/me_cmp_init_aarch64.c +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c @@ -47,6 +47,8 @@ int vsad_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy, ptrdiff_t stride, int h) ; int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2, ptrdiff_t stride, int h); +int vsse_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy, + ptrdiff_t stride, int h); av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) { @@ -69,5 +71,6 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) c->vsad[4] = vsad_intra16_neon; c->vsse[0] = vsse16_neon; + c->vsse[4] = vsse_intra16_neon; } } diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S index ce198ea227..cf2b8da425 100644 --- a/libavcodec/aarch64/me_cmp_neon.S +++ b/libavcodec/aarch64/me_cmp_neon.S @@ -784,3 +784,66 @@ function vsad_intra16_neon, export=1 ret endfunc + +function vsse_intra16_neon, export=1 + // x0 unused + // x1 uint8_t *pix1 + // x2 uint8_t *dummy + // x3 ptrdiff_t stride + // w4 int h + + ld1 {v0.16b}, [x1], x3 + movi v16.4s, #0 + movi v17.4s, #0 + + sub w4, w4, #1 // we need to make h-1 iterations + cmp w4, #3 + b.lt 2f + +1: + // v = abs( pix1[0] - pix1[0 + stride] ) + // score = sum( v * v ) + ld1 {v1.16b}, [x1], x3 + ld1 {v2.16b}, [x1], x3 + uabd v30.16b, v0.16b, v1.16b + ld1 {v3.16b}, [x1], x3 + umull v29.8h, v30.8b, v30.8b + umull2 v28.8h, v30.16b, v30.16b + uabd v27.16b, v1.16b, v2.16b + uadalp v16.4s, v29.8h + umull v26.8h, v27.8b, v27.8b + umull2 v27.8h, v27.16b, v27.16b + uadalp v17.4s, v28.8h + uabd v25.16b, v2.16b, v3.16b + uadalp v16.4s, v26.8h + umull v24.8h, v25.8b, v25.8b + umull2 v25.8h, v25.16b, v25.16b + uadalp v17.4s, v27.8h + sub w4, w4, #3 + uadalp v16.4s, v24.8h + cmp w4, #3 + uadalp v17.4s, v25.8h + mov v0.16b, v3.16b + + b.ge 1b + cbz w4, 3f + +// iterate by one +2: + ld1 {v1.16b}, [x1], x3 + subs w4, w4, #1 + uabd v30.16b, v0.16b, v1.16b + mov v0.16b, v1.16b + umull v29.8h, v30.8b, v30.8b + umull2 v30.8h, v30.16b, v30.16b + uadalp v16.4s, v29.8h + uadalp v17.4s, v30.8h + cbnz w4, 2b + +3: + add v16.4s, v16.4s, v17.4S + uaddlv d17, v16.4s + fmov w0, s17 + + ret +endfunc From patchwork Tue Sep 6 10:27:22 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hubert Mazur X-Patchwork-Id: 37700 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:139a:b0:8f:1db5:eae2 with SMTP id w26csp3164808pzh; Tue, 6 Sep 2022 03:28:55 -0700 (PDT) X-Google-Smtp-Source: AA6agR5AHir4r4h+Zw0Q/kQQ5dxUUB6c02LFySPnPHv/z/KXSvRQeuwK4cqV5Kh3cEVlYIJZPsxs X-Received: by 2002:a17:907:72d0:b0:734:b451:c8d9 with SMTP id du16-20020a17090772d000b00734b451c8d9mr38666794ejc.272.1662460135382; Tue, 06 Sep 2022 03:28:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1662460135; cv=none; d=google.com; s=arc-20160816; b=uiBq8sDy8UUFRxziqAzCPcPYYDNSfdxbiIMYFwZrtl3tq5btXRwcXWed+bUQ0MbBOO IiTXVw/6EQiN4Yr56zobx/RdvZggIhp+MuO2DxRzqeZM4PfSSVsQqWEhKoFJ1/dClqqK cctp0/mlWRZIUCsgBP5fNR2lIOaSTvyThYKgKB75Q1qwdvCfW4qa8uHSW0t8GqU5qaLi gdns7IuAFYp/m1U4JcIIY7o+BpQNZVR+5D/RUrJ+rujotKRlM5NMmOiXeLN+wRC791Bk cDmVzJzA1hN1oJh01yaLI3eDZj36UQbbUyLPVzvBEkj7JtcphftJLZrUff2yLO94QBF0 TXfg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:references:in-reply-to :message-id:date:to:from:dkim-signature:delivered-to; bh=35b5K+AkSGEB5eCnF6IpeeULx3c1QBi0oN6EZynuySM=; b=09Cny4SR0z6iLUvYQNrBEv1AGiSvUr7zipzCCcUBKkmNN+pFWoy70HfSPrqpw3EuTU DYxT9nbWdOeuUgkdPkyo+hCflUVsUcVZp27Z9z71LrC2lyr7LlWYuz/Qf/mOQefUuMko VIgUH7dEfjikkIK6zjuOibjp6k3Dz4n+cVffVpg2d8pFhV1P/7iUgxm5ptwoYXCU7tLx WDuUjL/5kb9U6j/2SSC5w9AvxM+1RB0AzxBTAljhzDlJtttwt30DrdrO+4B8gj+ls6ue BLtyUztGcbPGz2F43mmWtY6MYJJd31uhH/9IEwxE2wVOwLW6TaD2WUdnvdZJrurqc/vm bqMQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=N4gniW3n; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id y17-20020a170906471100b0073ddd228877si8377547ejq.156.2022.09.06.03.28.55; Tue, 06 Sep 2022 03:28:55 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@semihalf.com header.s=google header.b=N4gniW3n; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=semihalf.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 9AD4F68BB0F; Tue, 6 Sep 2022 13:28:13 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lf1-f42.google.com (mail-lf1-f42.google.com [209.85.167.42]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id D76A468BB0C for ; Tue, 6 Sep 2022 13:28:06 +0300 (EEST) Received: by mail-lf1-f42.google.com with SMTP id z6so16617467lfu.9 for ; Tue, 06 Sep 2022 03:28:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=semihalf.com; s=google; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=rUl5JfNcOwShI+Fu8hWPsymuo0HaPRTGsANyMm/6gJI=; b=N4gniW3nFa5VDyummgLnJuKR68t1fU8h8RffHsuXsbXjri7Yp5YgAb2N++bqJWQNJG fsa+WOeI7lxXvtXou+qbbW2RAXQn10Cz2s7MKDYOGLfpnbFOrdT6kRVUIlvqm6jLEc4D fHncWoBU8jcWO4jutX5AKaW4kxmBkNgkhIUnTvaw+gj/RTLjtQ1YdMGlLosLWs9yhbwF EtfumefS+UcmrXxVcgY4/BzBHfWzo/qHo8p5kFo0EKqRoR4DLRMpVAoBqEUaVO40EHPV kxWYNWBBfPqeu9i4jlYZ4wIM2gaiDpMYSP6e07xs3ahSFB4BUySKY923P23mky8v5SLq n70Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=rUl5JfNcOwShI+Fu8hWPsymuo0HaPRTGsANyMm/6gJI=; b=xhkeCA0IzARGMGnk98a4B/JWMqOPTWHZSgaWYQ7CSarRwUIGK/1FSb47zYEShOjT2M 8pyNeZ1S/KofwRJt3t4DJqzyQuBGC0Kufkfe6r1BoaSFU3FAugGCMML547WdwOeL70zK vYSAfnOfxmPlkr8NnuMw1hJMQfW1Hl6+IJMCoJXHUfXriZ4ZgqPZ8gWHBFWxEtI0nlkb n4KWJzeC2qVbEX10pXV2b7kO3kWCTod7ZFHm5EDegRaRhZNPeP8T7RFuj2QlTnmiCpfO Z0qjfCdTqqhYyX4lZrZvXotT24iqKl88gtno1VvvzvwY/WVAg7PQne3nCKLAQOPV1lOJ QW6Q== X-Gm-Message-State: ACgBeo1Pe0oiMEdZHKgH7T3Jm0lSEL8uuBSNs7tsc57TdaVPedKzrR+A zksbKGEGATd410X25wqCgjo4Ust+79IJnw== X-Received: by 2002:a19:6706:0:b0:494:b2d0:3b3e with SMTP id b6-20020a196706000000b00494b2d03b3emr4602118lfc.179.1662460085999; Tue, 06 Sep 2022 03:28:05 -0700 (PDT) Received: from hum-HP-ProBook-440-G7.wifi.semihalf.net ([83.142.187.84]) by smtp.gmail.com with ESMTPSA id q5-20020a2eb4a5000000b0025e59f125fbsm1822751ljm.53.2022.09.06.03.28.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 06 Sep 2022 03:28:05 -0700 (PDT) From: Hubert Mazur To: ffmpeg-devel@ffmpeg.org Date: Tue, 6 Sep 2022 12:27:22 +0200 Message-Id: <20220906102722.53266-6-hum@semihalf.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20220906102722.53266-1-hum@semihalf.com> References: <20220906102722.53266-1-hum@semihalf.com> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com, Hubert Mazur , martin@martin.st, mw@semihalf.com, spop@amazon.com Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: cy4c8wjrRiTf Add vectorized implementation of nsse16 function. Performance comparison tests are shown below. - nsse_0_c: 707.0 - nsse_0_neon: 120.0 Benchmarks and tests run with checkasm tool on AWS Graviton 3. Signed-off-by: Hubert Mazur --- libavcodec/aarch64/me_cmp_init_aarch64.c | 15 +++ libavcodec/aarch64/me_cmp_neon.S | 124 +++++++++++++++++++++++ 2 files changed, 139 insertions(+) diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c index 8c295d5457..ea7f295373 100644 --- a/libavcodec/aarch64/me_cmp_init_aarch64.c +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c @@ -49,6 +49,10 @@ int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2, ptrdiff_t stride, int h); int vsse_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy, ptrdiff_t stride, int h); +int nsse16_neon(int multiplier, const uint8_t *s, const uint8_t *s2, + ptrdiff_t stride, int h); +int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2, + ptrdiff_t stride, int h); av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) { @@ -72,5 +76,16 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) c->vsse[0] = vsse16_neon; c->vsse[4] = vsse_intra16_neon; + + c->nsse[0] = nsse16_neon_wrapper; } } + +int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2, + ptrdiff_t stride, int h) +{ + if (c) + return nsse16_neon(c->avctx->nsse_weight, s1, s2, stride, h); + else + return nsse16_neon(8, s1, s2, stride, h); +} diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S index cf2b8da425..bd21122a21 100644 --- a/libavcodec/aarch64/me_cmp_neon.S +++ b/libavcodec/aarch64/me_cmp_neon.S @@ -847,3 +847,127 @@ function vsse_intra16_neon, export=1 ret endfunc + +function nsse16_neon, export=1 + // x0 multiplier + // x1 uint8_t *pix1 + // x2 uint8_t *pix2 + // x3 ptrdiff_t stride + // w4 int h + + str x0, [sp, #-0x40]! + stp x1, x2, [sp, #0x10] + stp x3, x4, [sp, #0x20] + str x30, [sp, #0x30] + bl X(sse16_neon) + ldr x30, [sp, #0x30] + mov w9, w0 // here we store score1 + ldr x5, [sp] + ldp x1, x2, [sp, #0x10] + ldp x3, x4, [sp, #0x20] + add sp, sp, #0x40 + + movi v16.8h, #0 + movi v17.8h, #0 + movi v18.8h, #0 + movi v19.8h, #0 + + ld1 {v0.16b}, [x1], x3 + subs w4, w4, #1 // we need to make h-1 iterations + ext v1.16b, v0.16b, v0.16b, #1 // x1 + 1 + ld1 {v2.16b}, [x2], x3 + cmp w4, #2 + ext v3.16b, v2.16b, v2.16b, #1 // x2 + 1 + + b.lt 2f + +// make 2 iterations at once +1: + ld1 {v4.16b}, [x1], x3 + ld1 {v20.16b}, [x1], x3 + ext v5.16b, v4.16b, v4.16b, #1 // x1 + stride + 1 + ext v21.16b, v20.16b, v20.16b, #1 + ld1 {v6.16b}, [x2], x3 + ld1 {v22.16b}, [x2], x3 + ext v7.16b, v6.16b, v6.16b, #1 // x2 + stride + 1 + ext v23.16b, v22.16b, v22.16b, #1 + + usubl v31.8h, v0.8b, v4.8b + usubl2 v30.8h, v0.16b, v4.16b + usubl v29.8h, v1.8b, v5.8b + usubl2 v28.8h, v1.16b, v5.16b + saba v16.8h, v31.8h, v29.8h + saba v17.8h, v30.8h, v28.8h + usubl v27.8h, v2.8b, v6.8b + usubl2 v26.8h, v2.16b, v6.16b + usubl v25.8h, v3.8b, v7.8b + usubl2 v24.8h, v3.16b, v7.16b + saba v18.8h, v27.8h, v25.8h + saba v19.8h, v26.8h, v24.8h + + usubl v31.8h, v4.8b, v20.8b + usubl2 v30.8h, v4.16b, v20.16b + usubl v29.8h, v5.8b, v21.8b + usubl2 v28.8h, v5.16b, v21.16b + saba v16.8h, v31.8h, v29.8h + saba v17.8h, v30.8h, v28.8h + usubl v27.8h, v6.8b, v22.8b + usubl2 v26.8h, v6.16b, v22.16b + usubl v25.8h, v7.8b, v23.8b + usubl2 v24.8h, v7.16b, v23.16b + saba v18.8h, v27.8h, v25.8h + saba v19.8h, v26.8h, v24.8h + + mov v0.16b, v20.16b + mov v1.16b, v21.16b + mov v2.16b, v22.16b + mov v3.16b, v23.16b + + sub w4, w4, #2 + cmp w4, #2 + + b.ge 1b + cbz w4, 3f + +// iterate by one +2: + ld1 {v4.16b}, [x1], x3 + subs w4, w4, #1 + ext v5.16b, v4.16b, v4.16b, #1 // x1 + stride + 1 + ld1 {v6.16b}, [x2], x3 + usubl v31.8h, v0.8b, v4.8b + ext v7.16b, v6.16b, v6.16b, #1 // x2 + stride + 1 + + usubl2 v30.8h, v0.16b, v4.16b + usubl v29.8h, v1.8b, v5.8b + usubl2 v28.8h, v1.16b, v5.16b + saba v16.8h, v31.8h, v29.8h + saba v17.8h, v30.8h, v28.8h + usubl v27.8h, v2.8b, v6.8b + usubl2 v26.8h, v2.16b, v6.16b + usubl v25.8h, v3.8b, v7.8b + usubl2 v24.8h, v3.16b, v7.16b + saba v18.8h, v27.8h, v25.8h + saba v19.8h, v26.8h, v24.8h + + mov v0.16b, v4.16b + mov v1.16b, v5.16b + mov v2.16b, v6.16b + mov v3.16b, v7.16b + + cbnz w4, 2b + +3: + sqsub v16.8h, v16.8h, v18.8h + sqsub v17.8h, v17.8h, v19.8h + ins v17.h[7], wzr + sqadd v16.8h, v16.8h, v17.8h + saddlv s16, v16.8h + sqabs s16, s16 + fmov w0, s16 + + mul w0, w0, w5 + add w0, w0, w9 + + ret +endfunc