From patchwork Tue Oct 17 11:45:56 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>
X-Patchwork-Id: 44277
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a21:3e0b:b0:15d:8365:d4b8 with SMTP id
 bk11csp301646pzc;
        Tue, 17 Oct 2023 04:46:29 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IF54s67dady+wHdCbCvAQflp0FCL9duBu7beW+gg96wgljv9DgdbahPjDbIB+/OPLvwpyQZ
X-Received: by 2002:a17:907:320b:b0:9b2:a078:4461 with SMTP id
 xg11-20020a170907320b00b009b2a0784461mr1303085ejb.44.1697543188843;
        Tue, 17 Oct 2023 04:46:28 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1697543188; cv=none;
        d=google.com; s=arc-20160816;
        b=Mmi3Pr/R4+TFggpEHCFzNRk+220Lrw2U0RhG06CRZoxooUt1EzZhg8BkghkwKmgtFy
         AT+cfih9iSl5ni8dr7emt+VIsAGcxVfe65rnUQnnw2My304bhlI9vhkRWabgnORZ+7Ua
         XVXEhi1k0LUAMrey3gwXCqOD4UkkYfCgw6Kwv517Wbhr6KiSQkK67qNPUcoelsOTaHiG
         q0ZqXC+ejpCEFQuZi6+6gpXKt1eaBtnnRfKWMaSoZQ8AFDbReHYLTu+4/+A5E/ENWSOm
         RMeRiCOuyK7l3JjJOU6noTEIvWaJzpjj2BcmfOBRSAkDjybJFG59EfNW+Ei8wIZND6xQ
         Z4Lw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:message-id:date:to:from
         :dkim-signature:delivered-to;
        bh=IHqOAyDwG9oDxZDC4OY9RQ/3/JsHBUsYwBG4hga95V0=;
        fh=4VBelKDE4DH3L7jF6H/1Jmu78FdN+YP76yfdJCQTJ30=;
        b=VYJbmCnmMxMkN0GOEh77GMxKCtDQKBr1cKIQ/d9SOGh7sW4u2ocpmO2EQyvQcZhL/Y
         DhLQD4/D6P40UAljAco9zBwo6pzHdnUCYfesGr4NMTCIXwwt6gGam08MSmyvCSF8mp1k
         u4Y0ocfyu40nLpmCpXDEFJzK2FzYJtmFsxWets94ndAxPVKQRUzfUhOuCb2ULUaxu1Hh
         6usQT3PD468WUB47tWlsYnhVa0rQag6cZ0awTbqoeb7+biHXrbc2HyUsCEQ0HrnVoltB
         NeQQT+foe+fGk+01iLBxaakMEAIJJ92PY+VKf8sU6SEB+LKi6xgalmWF9z3IQZZvPSpG
         ZMKA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify)
 header.i=@martin-st.20230601.gappssmtp.com header.s=20230601
 header.b=XTHxHhU2;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 o23-20020a170906359700b009c3b54051c5si626026ejb.413.2023.10.17.04.46.27;
        Tue, 17 Oct 2023 04:46:28 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify)
 header.i=@martin-st.20230601.gappssmtp.com header.s=20230601
 header.b=XTHxHhU2;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 78E1268C99F;
	Tue, 17 Oct 2023 14:46:12 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mail-lf1-f46.google.com (mail-lf1-f46.google.com
 [209.85.167.46])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id C601768C7A6
 for <ffmpeg-devel@ffmpeg.org>; Tue, 17 Oct 2023 14:46:04 +0300 (EEST)
Received: by mail-lf1-f46.google.com with SMTP id
 2adb3069b0e04-507ac66a969so2656802e87.3
 for <ffmpeg-devel@ffmpeg.org>; Tue, 17 Oct 2023 04:46:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=martin-st.20230601.gappssmtp.com; s=20230601; t=1697543164; x=1698147964;
 darn=ffmpeg.org;
 h=content-transfer-encoding:mime-version:message-id:date:subject:cc
 :to:from:from:to:cc:subject:date:message-id:reply-to;
 bh=W81RrtnXKgskPcQ3keTYfb7XzKHSLO1R29MjcpVd4Js=;
 b=XTHxHhU2Fw7DNzYpNjrEFOGLf8OlC762vJzD+VnoloJ1WEQiWLQk5B+KEVy6SCq5I8
 yUWpwzWHu4HWzNxuluwKuHmOVjPMttZHHw3arzZ/Xh5M+yi42uxiPhZqekywFJVFNhju
 EhV9wmp2TNZuHwtQ/sbN2e+5Jir7s54w62ISwD6cdb+39Q/dxzQjDbrPZawSZ0h+C8y1
 L0KeiJB6Mo/erYw2Rxdvqf1S6L+FL4rDpO1nNsclvxgmBetFLv9OPs7Cff2VTZ+6rZf1
 wn0loZZ2CrPT7Df1LFRMXENPVH0Kbj/sAd6tyWc07kSEa33tvleGZsyexF416/SomqD+
 D4Rg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1697543164; x=1698147964;
 h=content-transfer-encoding:mime-version:message-id:date:subject:cc
 :to:from:x-gm-message-state:from:to:cc:subject:date:message-id
 :reply-to;
 bh=W81RrtnXKgskPcQ3keTYfb7XzKHSLO1R29MjcpVd4Js=;
 b=EqXtWWoXkngwl1IDrR604qR8ephML26jkCgV/efLlH9+ZPYIISABYp0fILq5ioglv9
 WJWRjjZt6u2nCES+1zVeOEtoqTljCj+f2o6F/jfcX8nPWZJQFPjh9ye5gaG0CQDYGvT+
 +hB3kgHti7T/DwilBYX3qFg3mmpizwPK9aUdIlgduLKquQpALVkCWHKXNn62aBYaCyo2
 EC+Pm47xCm69TsFw4TAw3Ul+/utOXQSUohfeXtfPgh6vuyh7H5taS89Kn9zIoIVuBzjg
 1o68/orqgXBIKdLqI8bynr8h1axBixkPLtUXT73ue88yGj4ypWJd1KyGhCOcod6ja/8B
 ET8g==
X-Gm-Message-State: AOJu0YwfOEbbVsbHxxkN4HlZ0LSaXQzBoFd236vocvYRZr1ErSGMAVBd
 lC9VhbCcEz4Boj5s6yc8SUtEmn9GfCnPcUCUylCPSQ==
X-Received: by 2002:a19:6710:0:b0:503:3281:2ffd with SMTP id
 b16-20020a196710000000b0050332812ffdmr1269278lfc.41.1697543163007;
 Tue, 17 Oct 2023 04:46:03 -0700 (PDT)
Received: from localhost.localdomain (dsl-tkubng21-58c01c-243.dhcp.inet.fi.
 [88.192.28.243]) by smtp.gmail.com with ESMTPSA id
 x25-20020a19f619000000b0050797a35f8csm244532lfe.162.2023.10.17.04.46.01
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 17 Oct 2023 04:46:02 -0700 (PDT)
From: =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>
To: ffmpeg-devel@ffmpeg.org
Date: Tue, 17 Oct 2023 14:45:56 +0300
Message-Id: <20231017114601.1374712-1-martin@martin.st>
X-Mailer: git-send-email 2.34.1
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 1/5] aarch64: Consistently use lowercase for
 vector element specifiers
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: jdek@itanimul.li
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: icPfv/4HYwML

---
 libavcodec/aarch64/aacpsdsp_neon.S    | 194 ++++-----
 libavcodec/aarch64/h264cmc_neon.S     | 406 +++++++++---------
 libavcodec/aarch64/h264dsp_neon.S     | 594 +++++++++++++-------------
 libavcodec/aarch64/h264idct_neon.S    | 390 ++++++++---------
 libavcodec/aarch64/h264qpel_neon.S    | 556 ++++++++++++------------
 libavcodec/aarch64/hpeldsp_neon.S     | 362 ++++++++--------
 libavcodec/aarch64/me_cmp_neon.S      |   2 +-
 libavcodec/aarch64/neon.S             | 246 +++++------
 libavcodec/aarch64/sbrdsp_neon.S      | 294 ++++++-------
 libavcodec/aarch64/simple_idct_neon.S | 386 ++++++++---------
 libavfilter/aarch64/vf_nlmeans_neon.S |  70 +--
 libavutil/aarch64/float_dsp_neon.S    | 200 ++++-----
 libswresample/aarch64/resample.S      |  64 +--
 libswscale/aarch64/hscale.S           | 432 +++++++++----------
 libswscale/aarch64/output.S           | 150 +++----
 libswscale/aarch64/yuv2rgb_neon.S     | 116 ++---
 16 files changed, 2231 insertions(+), 2231 deletions(-)

diff --git a/libavcodec/aarch64/aacpsdsp_neon.S b/libavcodec/aarch64/aacpsdsp_neon.S
index ff4e6e244a..686c62eb2e 100644
--- a/libavcodec/aarch64/aacpsdsp_neon.S
+++ b/libavcodec/aarch64/aacpsdsp_neon.S
@@ -19,82 +19,82 @@
 #include "libavutil/aarch64/asm.S"
 
 function ff_ps_add_squares_neon, export=1
-1:      ld1         {v0.4S,v1.4S}, [x1], #32
-        fmul        v0.4S, v0.4S, v0.4S
-        fmul        v1.4S, v1.4S, v1.4S
-        faddp       v2.4S, v0.4S, v1.4S
-        ld1         {v3.4S}, [x0]
-        fadd        v3.4S, v3.4S, v2.4S
-        st1         {v3.4S}, [x0], #16
+1:      ld1         {v0.4s,v1.4s}, [x1], #32
+        fmul        v0.4s, v0.4s, v0.4s
+        fmul        v1.4s, v1.4s, v1.4s
+        faddp       v2.4s, v0.4s, v1.4s
+        ld1         {v3.4s}, [x0]
+        fadd        v3.4s, v3.4s, v2.4s
+        st1         {v3.4s}, [x0], #16
         subs        w2, w2, #4
         b.gt        1b
         ret
 endfunc
 
 function ff_ps_mul_pair_single_neon, export=1
-1:      ld1         {v0.4S,v1.4S}, [x1], #32
-        ld1         {v2.4S},       [x2], #16
-        zip1        v3.4S, v2.4S, v2.4S
-        zip2        v4.4S, v2.4S, v2.4S
-        fmul        v0.4S, v0.4S, v3.4S
-        fmul        v1.4S, v1.4S, v4.4S
-        st1         {v0.4S,v1.4S}, [x0], #32
+1:      ld1         {v0.4s,v1.4s}, [x1], #32
+        ld1         {v2.4s},       [x2], #16
+        zip1        v3.4s, v2.4s, v2.4s
+        zip2        v4.4s, v2.4s, v2.4s
+        fmul        v0.4s, v0.4s, v3.4s
+        fmul        v1.4s, v1.4s, v4.4s
+        st1         {v0.4s,v1.4s}, [x0], #32
         subs        w3, w3, #4
         b.gt        1b
         ret
 endfunc
 
 function ff_ps_stereo_interpolate_neon, export=1
-        ld1         {v0.4S}, [x2]
-        ld1         {v1.4S}, [x3]
-        zip1        v4.4S, v0.4S, v0.4S
-        zip2        v5.4S, v0.4S, v0.4S
-        zip1        v6.4S, v1.4S, v1.4S
-        zip2        v7.4S, v1.4S, v1.4S
-1:      ld1         {v2.2S}, [x0]
-        ld1         {v3.2S}, [x1]
-        fadd        v4.4S, v4.4S, v6.4S
-        fadd        v5.4S, v5.4S, v7.4S
-        mov         v2.D[1], v2.D[0]
-        mov         v3.D[1], v3.D[0]
-        fmul        v2.4S, v2.4S, v4.4S
-        fmla        v2.4S, v3.4S, v5.4S
-        st1         {v2.D}[0], [x0], #8
-        st1         {v2.D}[1], [x1], #8
+        ld1         {v0.4s}, [x2]
+        ld1         {v1.4s}, [x3]
+        zip1        v4.4s, v0.4s, v0.4s
+        zip2        v5.4s, v0.4s, v0.4s
+        zip1        v6.4s, v1.4s, v1.4s
+        zip2        v7.4s, v1.4s, v1.4s
+1:      ld1         {v2.2s}, [x0]
+        ld1         {v3.2s}, [x1]
+        fadd        v4.4s, v4.4s, v6.4s
+        fadd        v5.4s, v5.4s, v7.4s
+        mov         v2.d[1], v2.d[0]
+        mov         v3.d[1], v3.d[0]
+        fmul        v2.4s, v2.4s, v4.4s
+        fmla        v2.4s, v3.4s, v5.4s
+        st1         {v2.d}[0], [x0], #8
+        st1         {v2.d}[1], [x1], #8
         subs        w4, w4, #1
         b.gt        1b
         ret
 endfunc
 
 function ff_ps_stereo_interpolate_ipdopd_neon, export=1
-        ld1         {v0.4S,v1.4S}, [x2]
-        ld1         {v6.4S,v7.4S}, [x3]
-        fneg        v2.4S, v1.4S
-        fneg        v3.4S, v7.4S
-        zip1        v16.4S, v0.4S, v0.4S
-        zip2        v17.4S, v0.4S, v0.4S
-        zip1        v18.4S, v2.4S, v1.4S
-        zip2        v19.4S, v2.4S, v1.4S
-        zip1        v20.4S, v6.4S, v6.4S
-        zip2        v21.4S, v6.4S, v6.4S
-        zip1        v22.4S, v3.4S, v7.4S
-        zip2        v23.4S, v3.4S, v7.4S
-1:      ld1         {v2.2S}, [x0]
-        ld1         {v3.2S}, [x1]
-        fadd        v16.4S, v16.4S, v20.4S
-        fadd        v17.4S, v17.4S, v21.4S
-        mov         v2.D[1], v2.D[0]
-        mov         v3.D[1], v3.D[0]
-        fmul        v4.4S, v2.4S, v16.4S
-        fmla        v4.4S, v3.4S, v17.4S
-        fadd        v18.4S, v18.4S, v22.4S
-        fadd        v19.4S, v19.4S, v23.4S
-        ext         v2.16B, v2.16B, v2.16B, #4
-        ext         v3.16B, v3.16B, v3.16B, #4
-        fmla        v4.4S, v2.4S, v18.4S
-        fmla        v4.4S, v3.4S, v19.4S
-        st1         {v4.D}[0], [x0], #8
-        st1         {v4.D}[1], [x1], #8
+        ld1         {v0.4s,v1.4s}, [x2]
+        ld1         {v6.4s,v7.4s}, [x3]
+        fneg        v2.4s, v1.4s
+        fneg        v3.4s, v7.4s
+        zip1        v16.4s, v0.4s, v0.4s
+        zip2        v17.4s, v0.4s, v0.4s
+        zip1        v18.4s, v2.4s, v1.4s
+        zip2        v19.4s, v2.4s, v1.4s
+        zip1        v20.4s, v6.4s, v6.4s
+        zip2        v21.4s, v6.4s, v6.4s
+        zip1        v22.4s, v3.4s, v7.4s
+        zip2        v23.4s, v3.4s, v7.4s
+1:      ld1         {v2.2s}, [x0]
+        ld1         {v3.2s}, [x1]
+        fadd        v16.4s, v16.4s, v20.4s
+        fadd        v17.4s, v17.4s, v21.4s
+        mov         v2.d[1], v2.d[0]
+        mov         v3.d[1], v3.d[0]
+        fmul        v4.4s, v2.4s, v16.4s
+        fmla        v4.4s, v3.4s, v17.4s
+        fadd        v18.4s, v18.4s, v22.4s
+        fadd        v19.4s, v19.4s, v23.4s
+        ext         v2.16b, v2.16b, v2.16b, #4
+        ext         v3.16b, v3.16b, v3.16b, #4
+        fmla        v4.4s, v2.4s, v18.4s
+        fmla        v4.4s, v3.4s, v19.4s
+        st1         {v4.d}[0], [x0], #8
+        st1         {v4.d}[1], [x1], #8
         subs        w4, w4, #1
         b.gt        1b
         ret
@@ -102,46 +102,46 @@ endfunc
 
 function ff_ps_hybrid_analysis_neon, export=1
         lsl         x3, x3, #3
-        ld2         {v0.4S,v1.4S}, [x1], #32
-        ld2         {v2.2S,v3.2S}, [x1], #16
-        ld1         {v24.2S},      [x1], #8
-        ld2         {v4.2S,v5.2S}, [x1], #16
-        ld2         {v6.4S,v7.4S}, [x1]
-        rev64       v6.4S, v6.4S
-        rev64       v7.4S, v7.4S
-        ext         v6.16B, v6.16B, v6.16B, #8
-        ext         v7.16B, v7.16B, v7.16B, #8
-        rev64       v4.2S, v4.2S
-        rev64       v5.2S, v5.2S
-        mov         v2.D[1], v3.D[0]
-        mov         v4.D[1], v5.D[0]
-        mov         v5.D[1], v2.D[0]
-        mov         v3.D[1], v4.D[0]
-        fadd        v16.4S, v0.4S, v6.4S
-        fadd        v17.4S, v1.4S, v7.4S
-        fsub        v18.4S, v1.4S, v7.4S
-        fsub        v19.4S, v0.4S, v6.4S
-        fadd        v22.4S, v2.4S, v4.4S
-        fsub        v23.4S, v5.4S, v3.4S
-        trn1        v20.2D, v22.2D, v23.2D      // {re4+re8, re5+re7, im8-im4, im7-im5}
-        trn2        v21.2D, v22.2D, v23.2D      // {im4+im8, im5+im7, re4-re8, re5-re7}
-1:      ld2         {v2.4S,v3.4S}, [x2], #32
-        ld2         {v4.2S,v5.2S}, [x2], #16
-        ld1         {v6.2S},       [x2], #8
+        ld2         {v0.4s,v1.4s}, [x1], #32
+        ld2         {v2.2s,v3.2s}, [x1], #16
+        ld1         {v24.2s},      [x1], #8
+        ld2         {v4.2s,v5.2s}, [x1], #16
+        ld2         {v6.4s,v7.4s}, [x1]
+        rev64       v6.4s, v6.4s
+        rev64       v7.4s, v7.4s
+        ext         v6.16b, v6.16b, v6.16b, #8
+        ext         v7.16b, v7.16b, v7.16b, #8
+        rev64       v4.2s, v4.2s
+        rev64       v5.2s, v5.2s
+        mov         v2.d[1], v3.d[0]
+        mov         v4.d[1], v5.d[0]
+        mov         v5.d[1], v2.d[0]
+        mov         v3.d[1], v4.d[0]
+        fadd        v16.4s, v0.4s, v6.4s
+        fadd        v17.4s, v1.4s, v7.4s
+        fsub        v18.4s, v1.4s, v7.4s
+        fsub        v19.4s, v0.4s, v6.4s
+        fadd        v22.4s, v2.4s, v4.4s
+        fsub        v23.4s, v5.4s, v3.4s
+        trn1        v20.2d, v22.2d, v23.2d      // {re4+re8, re5+re7, im8-im4, im7-im5}
+        trn2        v21.2d, v22.2d, v23.2d      // {im4+im8, im5+im7, re4-re8, re5-re7}
+1:      ld2         {v2.4s,v3.4s}, [x2], #32
+        ld2         {v4.2s,v5.2s}, [x2], #16
+        ld1         {v6.2s},       [x2], #8
         add         x2, x2, #8
-        mov         v4.D[1], v5.D[0]
-        mov         v6.S[1], v6.S[0]
-        fmul        v6.2S, v6.2S, v24.2S
-        fmul        v0.4S, v2.4S, v16.4S
-        fmul        v1.4S, v2.4S, v17.4S
-        fmls        v0.4S, v3.4S, v18.4S
-        fmla        v1.4S, v3.4S, v19.4S
-        fmla        v0.4S, v4.4S, v20.4S
-        fmla        v1.4S, v4.4S, v21.4S
-        faddp       v0.4S, v0.4S, v1.4S
-        faddp       v0.4S, v0.4S, v0.4S
-        fadd        v0.2S, v0.2S, v6.2S
-        st1         {v0.2S}, [x0], x3
+        mov         v4.d[1], v5.d[0]
+        mov         v6.s[1], v6.s[0]
+        fmul        v6.2s, v6.2s, v24.2s
+        fmul        v0.4s, v2.4s, v16.4s
+        fmul        v1.4s, v2.4s, v17.4s
+        fmls        v0.4s, v3.4s, v18.4s
+        fmla        v1.4s, v3.4s, v19.4s
+        fmla        v0.4s, v4.4s, v20.4s
+        fmla        v1.4s, v4.4s, v21.4s
+        faddp       v0.4s, v0.4s, v1.4s
+        faddp       v0.4s, v0.4s, v0.4s
+        fadd        v0.2s, v0.2s, v6.2s
+        st1         {v0.2s}, [x0], x3
         subs        w4, w4, #1
         b.gt        1b
         ret
diff --git a/libavcodec/aarch64/h264cmc_neon.S b/libavcodec/aarch64/h264cmc_neon.S
index 88ccd727d0..5b959b87d3 100644
--- a/libavcodec/aarch64/h264cmc_neon.S
+++ b/libavcodec/aarch64/h264cmc_neon.S
@@ -39,10 +39,10 @@ function ff_\type\()_\codec\()_chroma_mc8_neon, export=1
         lsl             w10, w10, #1
         add             w9,  w9,  w10
         add             x6,  x6,  w9, UXTW
-        ld1r            {v22.8H}, [x6]
+        ld1r            {v22.8h}, [x6]
   .endif
   .ifc \codec,vc1
-        movi            v22.8H,   #28
+        movi            v22.8h,   #28
   .endif
         mul             w7,  w4,  w5
         lsl             w14, w5,  #3
@@ -55,139 +55,139 @@ function ff_\type\()_\codec\()_chroma_mc8_neon, export=1
         add             w4,  w4,  #64
         b.eq            2f
 
-        dup             v0.8B,  w4
-        dup             v1.8B,  w12
-        ld1             {v4.8B, v5.8B}, [x1], x2
-        dup             v2.8B,  w6
-        dup             v3.8B,  w7
-        ext             v5.8B,  v4.8B,  v5.8B,  #1
-1:      ld1             {v6.8B, v7.8B}, [x1], x2
-        umull           v16.8H, v4.8B,  v0.8B
-        umlal           v16.8H, v5.8B,  v1.8B
-        ext             v7.8B,  v6.8B,  v7.8B,  #1
-        ld1             {v4.8B, v5.8B}, [x1], x2
-        umlal           v16.8H, v6.8B,  v2.8B
+        dup             v0.8b,  w4
+        dup             v1.8b,  w12
+        ld1             {v4.8b, v5.8b}, [x1], x2
+        dup             v2.8b,  w6
+        dup             v3.8b,  w7
+        ext             v5.8b,  v4.8b,  v5.8b,  #1
+1:      ld1             {v6.8b, v7.8b}, [x1], x2
+        umull           v16.8h, v4.8b,  v0.8b
+        umlal           v16.8h, v5.8b,  v1.8b
+        ext             v7.8b,  v6.8b,  v7.8b,  #1
+        ld1             {v4.8b, v5.8b}, [x1], x2
+        umlal           v16.8h, v6.8b,  v2.8b
         prfm            pldl1strm, [x1]
-        ext             v5.8B,  v4.8B,  v5.8B,  #1
-        umlal           v16.8H, v7.8B,  v3.8B
-        umull           v17.8H, v6.8B,  v0.8B
+        ext             v5.8b,  v4.8b,  v5.8b,  #1
+        umlal           v16.8h, v7.8b,  v3.8b
+        umull           v17.8h, v6.8b,  v0.8b
         subs            w3,  w3,  #2
-        umlal           v17.8H, v7.8B, v1.8B
-        umlal           v17.8H, v4.8B, v2.8B
-        umlal           v17.8H, v5.8B, v3.8B
+        umlal           v17.8h, v7.8b, v1.8b
+        umlal           v17.8h, v4.8b, v2.8b
+        umlal           v17.8h, v5.8b, v3.8b
         prfm            pldl1strm, [x1, x2]
   .ifc \codec,h264
-        rshrn           v16.8B, v16.8H, #6
-        rshrn           v17.8B, v17.8H, #6
+        rshrn           v16.8b, v16.8h, #6
+        rshrn           v17.8b, v17.8h, #6
   .else
-        add             v16.8H, v16.8H, v22.8H
-        add             v17.8H, v17.8H, v22.8H
-        shrn            v16.8B, v16.8H, #6
-        shrn            v17.8B, v17.8H, #6
+        add             v16.8h, v16.8h, v22.8h
+        add             v17.8h, v17.8h, v22.8h
+        shrn            v16.8b, v16.8h, #6
+        shrn            v17.8b, v17.8h, #6
   .endif
   .ifc \type,avg
-        ld1             {v20.8B}, [x8], x2
-        ld1             {v21.8B}, [x8], x2
-        urhadd          v16.8B, v16.8B, v20.8B
-        urhadd          v17.8B, v17.8B, v21.8B
+        ld1             {v20.8b}, [x8], x2
+        ld1             {v21.8b}, [x8], x2
+        urhadd          v16.8b, v16.8b, v20.8b
+        urhadd          v17.8b, v17.8b, v21.8b
   .endif
-        st1             {v16.8B}, [x0], x2
-        st1             {v17.8B}, [x0], x2
+        st1             {v16.8b}, [x0], x2
+        st1             {v17.8b}, [x0], x2
         b.gt            1b
         ret
 
 2:      adds            w12, w12, w6
-        dup             v0.8B, w4
+        dup             v0.8b, w4
         b.eq            5f
         tst             w6,  w6
-        dup             v1.8B, w12
+        dup             v1.8b, w12
         b.eq            4f
 
-        ld1             {v4.8B}, [x1], x2
-3:      ld1             {v6.8B}, [x1], x2
-        umull           v16.8H, v4.8B,  v0.8B
-        umlal           v16.8H, v6.8B,  v1.8B
-        ld1             {v4.8B}, [x1], x2
-        umull           v17.8H, v6.8B,  v0.8B
-        umlal           v17.8H, v4.8B,  v1.8B
+        ld1             {v4.8b}, [x1], x2
+3:      ld1             {v6.8b}, [x1], x2
+        umull           v16.8h, v4.8b,  v0.8b
+        umlal           v16.8h, v6.8b,  v1.8b
+        ld1             {v4.8b}, [x1], x2
+        umull           v17.8h, v6.8b,  v0.8b
+        umlal           v17.8h, v4.8b,  v1.8b
         prfm            pldl1strm, [x1]
   .ifc \codec,h264
-        rshrn           v16.8B, v16.8H, #6
-        rshrn           v17.8B, v17.8H, #6
+        rshrn           v16.8b, v16.8h, #6
+        rshrn           v17.8b, v17.8h, #6
   .else
-        add             v16.8H, v16.8H, v22.8H
-        add             v17.8H, v17.8H, v22.8H
-        shrn            v16.8B, v16.8H, #6
-        shrn            v17.8B, v17.8H, #6
+        add             v16.8h, v16.8h, v22.8h
+        add             v17.8h, v17.8h, v22.8h
+        shrn            v16.8b, v16.8h, #6
+        shrn            v17.8b, v17.8h, #6
   .endif
         prfm            pldl1strm, [x1, x2]
   .ifc \type,avg
-        ld1             {v20.8B}, [x8], x2
-        ld1             {v21.8B}, [x8], x2
-        urhadd          v16.8B, v16.8B, v20.8B
-        urhadd          v17.8B, v17.8B, v21.8B
+        ld1             {v20.8b}, [x8], x2
+        ld1             {v21.8b}, [x8], x2
+        urhadd          v16.8b, v16.8b, v20.8b
+        urhadd          v17.8b, v17.8b, v21.8b
   .endif
         subs            w3,  w3,  #2
-        st1             {v16.8B}, [x0], x2
-        st1             {v17.8B}, [x0], x2
+        st1             {v16.8b}, [x0], x2
+        st1             {v17.8b}, [x0], x2
         b.gt            3b
         ret
 
-4:      ld1             {v4.8B, v5.8B}, [x1], x2
-        ld1             {v6.8B, v7.8B}, [x1], x2
-        ext             v5.8B,  v4.8B,  v5.8B,  #1
-        ext             v7.8B,  v6.8B,  v7.8B,  #1
+4:      ld1             {v4.8b, v5.8b}, [x1], x2
+        ld1             {v6.8b, v7.8b}, [x1], x2
+        ext             v5.8b,  v4.8b,  v5.8b,  #1
+        ext             v7.8b,  v6.8b,  v7.8b,  #1
         prfm            pldl1strm, [x1]
         subs            w3,  w3,  #2
-        umull           v16.8H, v4.8B, v0.8B
-        umlal           v16.8H, v5.8B, v1.8B
-        umull           v17.8H, v6.8B, v0.8B
-        umlal           v17.8H, v7.8B, v1.8B
+        umull           v16.8h, v4.8b, v0.8b
+        umlal           v16.8h, v5.8b, v1.8b
+        umull           v17.8h, v6.8b, v0.8b
+        umlal           v17.8h, v7.8b, v1.8b
         prfm            pldl1strm, [x1, x2]
   .ifc \codec,h264
-        rshrn           v16.8B, v16.8H, #6
-        rshrn           v17.8B, v17.8H, #6
+        rshrn           v16.8b, v16.8h, #6
+        rshrn           v17.8b, v17.8h, #6
   .else
-        add             v16.8H, v16.8H, v22.8H
-        add             v17.8H, v17.8H, v22.8H
-        shrn            v16.8B, v16.8H, #6
-        shrn            v17.8B, v17.8H, #6
+        add             v16.8h, v16.8h, v22.8h
+        add             v17.8h, v17.8h, v22.8h
+        shrn            v16.8b, v16.8h, #6
+        shrn            v17.8b, v17.8h, #6
   .endif
   .ifc \type,avg
-        ld1             {v20.8B}, [x8], x2
-        ld1             {v21.8B}, [x8], x2
-        urhadd          v16.8B, v16.8B, v20.8B
-        urhadd          v17.8B, v17.8B, v21.8B
+        ld1             {v20.8b}, [x8], x2
+        ld1             {v21.8b}, [x8], x2
+        urhadd          v16.8b, v16.8b, v20.8b
+        urhadd          v17.8b, v17.8b, v21.8b
   .endif
-        st1             {v16.8B}, [x0], x2
-        st1             {v17.8B}, [x0], x2
+        st1             {v16.8b}, [x0], x2
+        st1             {v17.8b}, [x0], x2
         b.gt            4b
         ret
 
-5:      ld1             {v4.8B}, [x1], x2
-        ld1             {v5.8B}, [x1], x2
+5:      ld1             {v4.8b}, [x1], x2
+        ld1             {v5.8b}, [x1], x2
         prfm            pldl1strm, [x1]
         subs            w3,  w3,  #2
-        umull           v16.8H, v4.8B, v0.8B
-        umull           v17.8H, v5.8B, v0.8B
+        umull           v16.8h, v4.8b, v0.8b
+        umull           v17.8h, v5.8b, v0.8b
         prfm            pldl1strm, [x1, x2]
   .ifc \codec,h264
-        rshrn           v16.8B, v16.8H, #6
-        rshrn           v17.8B, v17.8H, #6
+        rshrn           v16.8b, v16.8h, #6
+        rshrn           v17.8b, v17.8h, #6
   .else
-        add             v16.8H, v16.8H, v22.8H
-        add             v17.8H, v17.8H, v22.8H
-        shrn            v16.8B, v16.8H, #6
-        shrn            v17.8B, v17.8H, #6
+        add             v16.8h, v16.8h, v22.8h
+        add             v17.8h, v17.8h, v22.8h
+        shrn            v16.8b, v16.8h, #6
+        shrn            v17.8b, v17.8h, #6
   .endif
   .ifc \type,avg
-        ld1             {v20.8B}, [x8], x2
-        ld1             {v21.8B}, [x8], x2
-        urhadd          v16.8B, v16.8B, v20.8B
-        urhadd          v17.8B, v17.8B, v21.8B
+        ld1             {v20.8b}, [x8], x2
+        ld1             {v21.8b}, [x8], x2
+        urhadd          v16.8b, v16.8b, v20.8b
+        urhadd          v17.8b, v17.8b, v21.8b
   .endif
-        st1             {v16.8B}, [x0], x2
-        st1             {v17.8B}, [x0], x2
+        st1             {v16.8b}, [x0], x2
+        st1             {v17.8b}, [x0], x2
         b.gt            5b
         ret
 endfunc
@@ -209,10 +209,10 @@ function ff_\type\()_\codec\()_chroma_mc4_neon, export=1
         lsl             w10, w10, #1
         add             w9,  w9,  w10
         add             x6,  x6,  w9, UXTW
-        ld1r            {v22.8H}, [x6]
+        ld1r            {v22.8h}, [x6]
   .endif
   .ifc \codec,vc1
-        movi            v22.8H,   #28
+        movi            v22.8h,   #28
   .endif
         mul             w7,  w4,  w5
         lsl             w14, w5,  #3
@@ -225,133 +225,133 @@ function ff_\type\()_\codec\()_chroma_mc4_neon, export=1
         add             w4,  w4,  #64
         b.eq            2f
 
-        dup             v24.8B,  w4
-        dup             v25.8B,  w12
-        ld1             {v4.8B}, [x1], x2
-        dup             v26.8B,  w6
-        dup             v27.8B,  w7
-        ext             v5.8B,  v4.8B,  v5.8B, #1
-        trn1            v0.2S,  v24.2S, v25.2S
-        trn1            v2.2S,  v26.2S, v27.2S
-        trn1            v4.2S,  v4.2S,  v5.2S
-1:      ld1             {v6.8B}, [x1], x2
-        ext             v7.8B,  v6.8B,  v7.8B, #1
-        trn1            v6.2S,  v6.2S,  v7.2S
-        umull           v18.8H, v4.8B,  v0.8B
-        umlal           v18.8H, v6.8B,  v2.8B
-        ld1             {v4.8B}, [x1], x2
-        ext             v5.8B,  v4.8B,  v5.8B, #1
-        trn1            v4.2S,  v4.2S,  v5.2S
+        dup             v24.8b,  w4
+        dup             v25.8b,  w12
+        ld1             {v4.8b}, [x1], x2
+        dup             v26.8b,  w6
+        dup             v27.8b,  w7
+        ext             v5.8b,  v4.8b,  v5.8b, #1
+        trn1            v0.2s,  v24.2s, v25.2s
+        trn1            v2.2s,  v26.2s, v27.2s
+        trn1            v4.2s,  v4.2s,  v5.2s
+1:      ld1             {v6.8b}, [x1], x2
+        ext             v7.8b,  v6.8b,  v7.8b, #1
+        trn1            v6.2s,  v6.2s,  v7.2s
+        umull           v18.8h, v4.8b,  v0.8b
+        umlal           v18.8h, v6.8b,  v2.8b
+        ld1             {v4.8b}, [x1], x2
+        ext             v5.8b,  v4.8b,  v5.8b, #1
+        trn1            v4.2s,  v4.2s,  v5.2s
         prfm            pldl1strm, [x1]
-        umull           v19.8H, v6.8B,  v0.8B
-        umlal           v19.8H, v4.8B,  v2.8B
-        trn1            v30.2D, v18.2D, v19.2D
-        trn2            v31.2D, v18.2D, v19.2D
-        add             v18.8H, v30.8H, v31.8H
+        umull           v19.8h, v6.8b,  v0.8b
+        umlal           v19.8h, v4.8b,  v2.8b
+        trn1            v30.2d, v18.2d, v19.2d
+        trn2            v31.2d, v18.2d, v19.2d
+        add             v18.8h, v30.8h, v31.8h
   .ifc \codec,h264
-        rshrn           v16.8B, v18.8H, #6
+        rshrn           v16.8b, v18.8h, #6
   .else
-        add             v18.8H, v18.8H, v22.8H
-        shrn            v16.8B, v18.8H, #6
+        add             v18.8h, v18.8h, v22.8h
+        shrn            v16.8b, v18.8h, #6
   .endif
         subs            w3,  w3,  #2
         prfm            pldl1strm, [x1, x2]
   .ifc \type,avg
-        ld1             {v20.S}[0], [x8], x2
-        ld1             {v20.S}[1], [x8], x2
-        urhadd          v16.8B, v16.8B, v20.8B
+        ld1             {v20.s}[0], [x8], x2
+        ld1             {v20.s}[1], [x8], x2
+        urhadd          v16.8b, v16.8b, v20.8b
   .endif
-        st1             {v16.S}[0], [x0], x2
-        st1             {v16.S}[1], [x0], x2
+        st1             {v16.s}[0], [x0], x2
+        st1             {v16.s}[1], [x0], x2
         b.gt            1b
         ret
 
 2:      adds            w12, w12, w6
-        dup             v30.8B, w4
+        dup             v30.8b, w4
         b.eq            5f
         tst             w6,  w6
-        dup             v31.8B, w12
-        trn1            v0.2S,  v30.2S, v31.2S
-        trn2            v1.2S,  v30.2S, v31.2S
+        dup             v31.8b, w12
+        trn1            v0.2s,  v30.2s, v31.2s
+        trn2            v1.2s,  v30.2s, v31.2s
         b.eq            4f
 
-        ext             v1.8B,  v0.8B,  v1.8B, #4
-        ld1             {v4.S}[0], [x1], x2
-3:      ld1             {v4.S}[1], [x1], x2
-        umull           v18.8H, v4.8B,  v0.8B
-        ld1             {v4.S}[0], [x1], x2
-        umull           v19.8H, v4.8B,  v1.8B
-        trn1            v30.2D, v18.2D, v19.2D
-        trn2            v31.2D, v18.2D, v19.2D
-        add             v18.8H, v30.8H, v31.8H
+        ext             v1.8b,  v0.8b,  v1.8b, #4
+        ld1             {v4.s}[0], [x1], x2
+3:      ld1             {v4.s}[1], [x1], x2
+        umull           v18.8h, v4.8b,  v0.8b
+        ld1             {v4.s}[0], [x1], x2
+        umull           v19.8h, v4.8b,  v1.8b
+        trn1            v30.2d, v18.2d, v19.2d
+        trn2            v31.2d, v18.2d, v19.2d
+        add             v18.8h, v30.8h, v31.8h
         prfm            pldl1strm, [x1]
   .ifc \codec,h264
-        rshrn           v16.8B, v18.8H, #6
+        rshrn           v16.8b, v18.8h, #6
   .else
-        add             v18.8H, v18.8H, v22.8H
-        shrn            v16.8B, v18.8H, #6
+        add             v18.8h, v18.8h, v22.8h
+        shrn            v16.8b, v18.8h, #6
   .endif
   .ifc \type,avg
-        ld1             {v20.S}[0], [x8], x2
-        ld1             {v20.S}[1], [x8], x2
-        urhadd          v16.8B, v16.8B, v20.8B
+        ld1             {v20.s}[0], [x8], x2
+        ld1             {v20.s}[1], [x8], x2
+        urhadd          v16.8b, v16.8b, v20.8b
   .endif
         subs            w3,  w3,  #2
         prfm            pldl1strm, [x1, x2]
-        st1             {v16.S}[0], [x0], x2
-        st1             {v16.S}[1], [x0], x2
+        st1             {v16.s}[0], [x0], x2
+        st1             {v16.s}[1], [x0], x2
         b.gt            3b
         ret
 
-4:      ld1             {v4.8B}, [x1], x2
-        ld1             {v6.8B}, [x1], x2
-        ext             v5.8B,  v4.8B,  v5.8B, #1
-        ext             v7.8B,  v6.8B,  v7.8B, #1
-        trn1            v4.2S,  v4.2S,  v5.2S
-        trn1            v6.2S,  v6.2S,  v7.2S
-        umull           v18.8H, v4.8B,  v0.8B
-        umull           v19.8H, v6.8B,  v0.8B
+4:      ld1             {v4.8b}, [x1], x2
+        ld1             {v6.8b}, [x1], x2
+        ext             v5.8b,  v4.8b,  v5.8b, #1
+        ext             v7.8b,  v6.8b,  v7.8b, #1
+        trn1            v4.2s,  v4.2s,  v5.2s
+        trn1            v6.2s,  v6.2s,  v7.2s
+        umull           v18.8h, v4.8b,  v0.8b
+        umull           v19.8h, v6.8b,  v0.8b
         subs            w3,  w3,  #2
-        trn1            v30.2D, v18.2D, v19.2D
-        trn2            v31.2D, v18.2D, v19.2D
-        add             v18.8H, v30.8H, v31.8H
+        trn1            v30.2d, v18.2d, v19.2d
+        trn2            v31.2d, v18.2d, v19.2d
+        add             v18.8h, v30.8h, v31.8h
         prfm            pldl1strm, [x1]
   .ifc \codec,h264
-        rshrn           v16.8B, v18.8H, #6
+        rshrn           v16.8b, v18.8h, #6
   .else
-        add             v18.8H, v18.8H, v22.8H
-        shrn            v16.8B, v18.8H, #6
+        add             v18.8h, v18.8h, v22.8h
+        shrn            v16.8b, v18.8h, #6
   .endif
   .ifc \type,avg
-        ld1             {v20.S}[0], [x8], x2
-        ld1             {v20.S}[1], [x8], x2
-        urhadd          v16.8B, v16.8B, v20.8B
+        ld1             {v20.s}[0], [x8], x2
+        ld1             {v20.s}[1], [x8], x2
+        urhadd          v16.8b, v16.8b, v20.8b
   .endif
         prfm            pldl1strm, [x1]
-        st1             {v16.S}[0], [x0], x2
-        st1             {v16.S}[1], [x0], x2
+        st1             {v16.s}[0], [x0], x2
+        st1             {v16.s}[1], [x0], x2
         b.gt            4b
         ret
 
-5:      ld1             {v4.S}[0], [x1], x2
-        ld1             {v4.S}[1], [x1], x2
-        umull           v18.8H, v4.8B,  v30.8B
+5:      ld1             {v4.s}[0], [x1], x2
+        ld1             {v4.s}[1], [x1], x2
+        umull           v18.8h, v4.8b,  v30.8b
         subs            w3,  w3,  #2
         prfm            pldl1strm, [x1]
   .ifc \codec,h264
-        rshrn           v16.8B, v18.8H, #6
+        rshrn           v16.8b, v18.8h, #6
   .else
-        add             v18.8H, v18.8H, v22.8H
-        shrn            v16.8B, v18.8H, #6
+        add             v18.8h, v18.8h, v22.8h
+        shrn            v16.8b, v18.8h, #6
   .endif
   .ifc \type,avg
-        ld1             {v20.S}[0], [x8], x2
-        ld1             {v20.S}[1], [x8], x2
-        urhadd          v16.8B, v16.8B, v20.8B
+        ld1             {v20.s}[0], [x8], x2
+        ld1             {v20.s}[1], [x8], x2
+        urhadd          v16.8b, v16.8b, v20.8b
   .endif
         prfm            pldl1strm, [x1]
-        st1             {v16.S}[0], [x0], x2
-        st1             {v16.S}[1], [x0], x2
+        st1             {v16.s}[0], [x0], x2
+        st1             {v16.s}[1], [x0], x2
         b.gt            5b
         ret
 endfunc
@@ -372,51 +372,51 @@ function ff_\type\()_h264_chroma_mc2_neon, export=1
         sub             w4,  w7,  w13
         sub             w4,  w4,  w14
         add             w4,  w4,  #64
-        dup             v0.8B,  w4
-        dup             v2.8B,  w12
-        dup             v1.8B,  w6
-        dup             v3.8B,  w7
-        trn1            v0.4H,  v0.4H,  v2.4H
-        trn1            v1.4H,  v1.4H,  v3.4H
+        dup             v0.8b,  w4
+        dup             v2.8b,  w12
+        dup             v1.8b,  w6
+        dup             v3.8b,  w7
+        trn1            v0.4h,  v0.4h,  v2.4h
+        trn1            v1.4h,  v1.4h,  v3.4h
 1:
-        ld1             {v4.S}[0],  [x1], x2
-        ld1             {v4.S}[1],  [x1], x2
-        rev64           v5.2S,  v4.2S
-        ld1             {v5.S}[1],  [x1]
-        ext             v6.8B,  v4.8B,  v5.8B,  #1
-        ext             v7.8B,  v5.8B,  v4.8B,  #1
-        trn1            v4.4H,  v4.4H,  v6.4H
-        trn1            v5.4H,  v5.4H,  v7.4H
-        umull           v16.8H, v4.8B,  v0.8B
-        umlal           v16.8H, v5.8B,  v1.8B
+        ld1             {v4.s}[0],  [x1], x2
+        ld1             {v4.s}[1],  [x1], x2
+        rev64           v5.2s,  v4.2s
+        ld1             {v5.s}[1],  [x1]
+        ext             v6.8b,  v4.8b,  v5.8b,  #1
+        ext             v7.8b,  v5.8b,  v4.8b,  #1
+        trn1            v4.4h,  v4.4h,  v6.4h
+        trn1            v5.4h,  v5.4h,  v7.4h
+        umull           v16.8h, v4.8b,  v0.8b
+        umlal           v16.8h, v5.8b,  v1.8b
   .ifc \type,avg
-        ld1             {v18.H}[0], [x0], x2
-        ld1             {v18.H}[2], [x0]
+        ld1             {v18.h}[0], [x0], x2
+        ld1             {v18.h}[2], [x0]
         sub             x0,  x0,  x2
   .endif
-        rev64           v17.4S, v16.4S
-        add             v16.8H, v16.8H, v17.8H
-        rshrn           v16.8B, v16.8H, #6
+        rev64           v17.4s, v16.4s
+        add             v16.8h, v16.8h, v17.8h
+        rshrn           v16.8b, v16.8h, #6
   .ifc \type,avg
-        urhadd          v16.8B, v16.8B, v18.8B
+        urhadd          v16.8b, v16.8b, v18.8b
   .endif
-        st1             {v16.H}[0], [x0], x2
-        st1             {v16.H}[2], [x0], x2
+        st1             {v16.h}[0], [x0], x2
+        st1             {v16.h}[2], [x0], x2
         subs            w3,  w3,  #2
         b.gt            1b
         ret
 
 2:
-        ld1             {v16.H}[0], [x1], x2
-        ld1             {v16.H}[1], [x1], x2
+        ld1             {v16.h}[0], [x1], x2
+        ld1             {v16.h}[1], [x1], x2
   .ifc \type,avg
-        ld1             {v18.H}[0], [x0], x2
-        ld1             {v18.H}[1], [x0]
+        ld1             {v18.h}[0], [x0], x2
+        ld1             {v18.h}[1], [x0]
         sub             x0,  x0,  x2
-        urhadd          v16.8B, v16.8B, v18.8B
+        urhadd          v16.8b, v16.8b, v18.8b
   .endif
-        st1             {v16.H}[0], [x0], x2
-        st1             {v16.H}[1], [x0], x2
+        st1             {v16.h}[0], [x0], x2
+        st1             {v16.h}[1], [x0], x2
         subs            w3,  w3,  #2
         b.gt            2b
         ret
diff --git a/libavcodec/aarch64/h264dsp_neon.S b/libavcodec/aarch64/h264dsp_neon.S
index ea221e6862..71c2ddfd0c 100644
--- a/libavcodec/aarch64/h264dsp_neon.S
+++ b/libavcodec/aarch64/h264dsp_neon.S
@@ -27,7 +27,7 @@
         cmp             w2,  #0
         ldr             w6,  [x4]
         ccmp            w3,  #0, #0, ne
-        mov             v24.S[0], w6
+        mov             v24.s[0], w6
         and             w8,  w6,  w6,  lsl #16
         b.eq            1f
         ands            w8,  w8,  w8,  lsl #8
@@ -38,95 +38,95 @@
 .endm
 
 .macro  h264_loop_filter_luma
-        dup             v22.16B, w2                     // alpha
-        uxtl            v24.8H,  v24.8B
-        uabd            v21.16B, v16.16B, v0.16B        // abs(p0 - q0)
-        uxtl            v24.4S,  v24.4H
-        uabd            v28.16B, v18.16B, v16.16B       // abs(p1 - p0)
-        sli             v24.8H,  v24.8H,  #8
-        uabd            v30.16B, v2.16B,  v0.16B        // abs(q1 - q0)
-        sli             v24.4S,  v24.4S,  #16
-        cmhi            v21.16B, v22.16B, v21.16B       // < alpha
-        dup             v22.16B, w3                     // beta
-        cmlt            v23.16B, v24.16B, #0
-        cmhi            v28.16B, v22.16B, v28.16B       // < beta
-        cmhi            v30.16B, v22.16B, v30.16B       // < beta
-        bic             v21.16B, v21.16B, v23.16B
-        uabd            v17.16B, v20.16B, v16.16B       // abs(p2 - p0)
-        and             v21.16B, v21.16B, v28.16B
-        uabd            v19.16B,  v4.16B,  v0.16B       // abs(q2 - q0)
-        and             v21.16B, v21.16B, v30.16B      // < beta
+        dup             v22.16b, w2                     // alpha
+        uxtl            v24.8h,  v24.8b
+        uabd            v21.16b, v16.16b, v0.16b        // abs(p0 - q0)
+        uxtl            v24.4s,  v24.4h
+        uabd            v28.16b, v18.16b, v16.16b       // abs(p1 - p0)
+        sli             v24.8h,  v24.8h,  #8
+        uabd            v30.16b, v2.16b,  v0.16b        // abs(q1 - q0)
+        sli             v24.4s,  v24.4s,  #16
+        cmhi            v21.16b, v22.16b, v21.16b       // < alpha
+        dup             v22.16b, w3                     // beta
+        cmlt            v23.16b, v24.16b, #0
+        cmhi            v28.16b, v22.16b, v28.16b       // < beta
+        cmhi            v30.16b, v22.16b, v30.16b       // < beta
+        bic             v21.16b, v21.16b, v23.16b
+        uabd            v17.16b, v20.16b, v16.16b       // abs(p2 - p0)
+        and             v21.16b, v21.16b, v28.16b
+        uabd            v19.16b,  v4.16b,  v0.16b       // abs(q2 - q0)
+        and             v21.16b, v21.16b, v30.16b      // < beta
         shrn            v30.8b,  v21.8h,  #4
         mov             x7, v30.d[0]
-        cmhi            v17.16B, v22.16B, v17.16B       // < beta
-        cmhi            v19.16B, v22.16B, v19.16B       // < beta
+        cmhi            v17.16b, v22.16b, v17.16b       // < beta
+        cmhi            v19.16b, v22.16b, v19.16b       // < beta
         cbz             x7,  9f
-        and             v17.16B, v17.16B, v21.16B
-        and             v19.16B, v19.16B, v21.16B
-        and             v24.16B, v24.16B, v21.16B
-        urhadd          v28.16B, v16.16B,  v0.16B
-        sub             v21.16B, v24.16B, v17.16B
-        uqadd           v23.16B, v18.16B, v24.16B
-        uhadd           v20.16B, v20.16B, v28.16B
-        sub             v21.16B, v21.16B, v19.16B
-        uhadd           v28.16B,  v4.16B, v28.16B
-        umin            v23.16B, v23.16B, v20.16B
-        uqsub           v22.16B, v18.16B, v24.16B
-        uqadd           v4.16B,   v2.16B, v24.16B
-        umax            v23.16B, v23.16B, v22.16B
-        uqsub           v22.16B,  v2.16B, v24.16B
-        umin            v28.16B,  v4.16B, v28.16B
-        uxtl            v4.8H,    v0.8B
-        umax            v28.16B, v28.16B, v22.16B
-        uxtl2           v20.8H,   v0.16B
-        usubw           v4.8H,    v4.8H,  v16.8B
-        usubw2          v20.8H,  v20.8H,  v16.16B
-        shl             v4.8H,    v4.8H,  #2
-        shl             v20.8H,  v20.8H,  #2
-        uaddw           v4.8H,    v4.8H,  v18.8B
-        uaddw2          v20.8H,  v20.8H,  v18.16B
-        usubw           v4.8H,    v4.8H,   v2.8B
-        usubw2          v20.8H,  v20.8H,   v2.16B
-        rshrn           v4.8B,    v4.8H,  #3
-        rshrn2          v4.16B,  v20.8H,  #3
-        bsl             v17.16B, v23.16B, v18.16B
-        bsl             v19.16B, v28.16B,  v2.16B
-        neg             v23.16B, v21.16B
-        uxtl            v28.8H,  v16.8B
-        smin            v4.16B,   v4.16B, v21.16B
-        uxtl2           v21.8H,  v16.16B
-        smax            v4.16B,   v4.16B, v23.16B
-        uxtl            v22.8H,   v0.8B
-        uxtl2           v24.8H,   v0.16B
-        saddw           v28.8H,  v28.8H,  v4.8B
-        saddw2          v21.8H,  v21.8H,  v4.16B
-        ssubw           v22.8H,  v22.8H,  v4.8B
-        ssubw2          v24.8H,  v24.8H,  v4.16B
-        sqxtun          v16.8B,  v28.8H
-        sqxtun2         v16.16B, v21.8H
-        sqxtun          v0.8B,   v22.8H
-        sqxtun2         v0.16B,  v24.8H
+        and             v17.16b, v17.16b, v21.16b
+        and             v19.16b, v19.16b, v21.16b
+        and             v24.16b, v24.16b, v21.16b
+        urhadd          v28.16b, v16.16b,  v0.16b
+        sub             v21.16b, v24.16b, v17.16b
+        uqadd           v23.16b, v18.16b, v24.16b
+        uhadd           v20.16b, v20.16b, v28.16b
+        sub             v21.16b, v21.16b, v19.16b
+        uhadd           v28.16b,  v4.16b, v28.16b
+        umin            v23.16b, v23.16b, v20.16b
+        uqsub           v22.16b, v18.16b, v24.16b
+        uqadd           v4.16b,   v2.16b, v24.16b
+        umax            v23.16b, v23.16b, v22.16b
+        uqsub           v22.16b,  v2.16b, v24.16b
+        umin            v28.16b,  v4.16b, v28.16b
+        uxtl            v4.8h,    v0.8b
+        umax            v28.16b, v28.16b, v22.16b
+        uxtl2           v20.8h,   v0.16b
+        usubw           v4.8h,    v4.8h,  v16.8b
+        usubw2          v20.8h,  v20.8h,  v16.16b
+        shl             v4.8h,    v4.8h,  #2
+        shl             v20.8h,  v20.8h,  #2
+        uaddw           v4.8h,    v4.8h,  v18.8b
+        uaddw2          v20.8h,  v20.8h,  v18.16b
+        usubw           v4.8h,    v4.8h,   v2.8b
+        usubw2          v20.8h,  v20.8h,   v2.16b
+        rshrn           v4.8b,    v4.8h,  #3
+        rshrn2          v4.16b,  v20.8h,  #3
+        bsl             v17.16b, v23.16b, v18.16b
+        bsl             v19.16b, v28.16b,  v2.16b
+        neg             v23.16b, v21.16b
+        uxtl            v28.8h,  v16.8b
+        smin            v4.16b,   v4.16b, v21.16b
+        uxtl2           v21.8h,  v16.16b
+        smax            v4.16b,   v4.16b, v23.16b
+        uxtl            v22.8h,   v0.8b
+        uxtl2           v24.8h,   v0.16b
+        saddw           v28.8h,  v28.8h,  v4.8b
+        saddw2          v21.8h,  v21.8h,  v4.16b
+        ssubw           v22.8h,  v22.8h,  v4.8b
+        ssubw2          v24.8h,  v24.8h,  v4.16b
+        sqxtun          v16.8b,  v28.8h
+        sqxtun2         v16.16b, v21.8h
+        sqxtun          v0.8b,   v22.8h
+        sqxtun2         v0.16b,  v24.8h
 .endm
 
 function ff_h264_v_loop_filter_luma_neon, export=1
         h264_loop_filter_start
 
-        ld1             {v0.16B},  [x0], x1
-        ld1             {v2.16B},  [x0], x1
-        ld1             {v4.16B},  [x0], x1
+        ld1             {v0.16b},  [x0], x1
+        ld1             {v2.16b},  [x0], x1
+        ld1             {v4.16b},  [x0], x1
         sub             x0,  x0,  x1, lsl #2
         sub             x0,  x0,  x1, lsl #1
-        ld1             {v20.16B},  [x0], x1
-        ld1             {v18.16B},  [x0], x1
-        ld1             {v16.16B},  [x0], x1
+        ld1             {v20.16b},  [x0], x1
+        ld1             {v18.16b},  [x0], x1
+        ld1             {v16.16b},  [x0], x1
 
         h264_loop_filter_luma
 
         sub             x0,  x0,  x1, lsl #1
-        st1             {v17.16B},  [x0], x1
-        st1             {v16.16B}, [x0], x1
-        st1             {v0.16B},  [x0], x1
-        st1             {v19.16B}, [x0]
+        st1             {v17.16b},  [x0], x1
+        st1             {v16.16b}, [x0], x1
+        st1             {v0.16b},  [x0], x1
+        st1             {v19.16b}, [x0]
 9:
         ret
 endfunc
@@ -135,22 +135,22 @@ function ff_h264_h_loop_filter_luma_neon, export=1
         h264_loop_filter_start
 
         sub             x0,  x0,  #4
-        ld1             {v6.8B},  [x0], x1
-        ld1             {v20.8B}, [x0], x1
-        ld1             {v18.8B}, [x0], x1
-        ld1             {v16.8B}, [x0], x1
-        ld1             {v0.8B},  [x0], x1
-        ld1             {v2.8B},  [x0], x1
-        ld1             {v4.8B},  [x0], x1
-        ld1             {v26.8B}, [x0], x1
-        ld1             {v6.D}[1],  [x0], x1
-        ld1             {v20.D}[1], [x0], x1
-        ld1             {v18.D}[1], [x0], x1
-        ld1             {v16.D}[1], [x0], x1
-        ld1             {v0.D}[1],  [x0], x1
-        ld1             {v2.D}[1],  [x0], x1
-        ld1             {v4.D}[1],  [x0], x1
-        ld1             {v26.D}[1], [x0], x1
+        ld1             {v6.8b},  [x0], x1
+        ld1             {v20.8b}, [x0], x1
+        ld1             {v18.8b}, [x0], x1
+        ld1             {v16.8b}, [x0], x1
+        ld1             {v0.8b},  [x0], x1
+        ld1             {v2.8b},  [x0], x1
+        ld1             {v4.8b},  [x0], x1
+        ld1             {v26.8b}, [x0], x1
+        ld1             {v6.d}[1],  [x0], x1
+        ld1             {v20.d}[1], [x0], x1
+        ld1             {v18.d}[1], [x0], x1
+        ld1             {v16.d}[1], [x0], x1
+        ld1             {v0.d}[1],  [x0], x1
+        ld1             {v2.d}[1],  [x0], x1
+        ld1             {v4.d}[1],  [x0], x1
+        ld1             {v26.d}[1], [x0], x1
 
         transpose_8x16B v6, v20, v18, v16, v0, v2, v4, v26, v21, v23
 
@@ -160,22 +160,22 @@ function ff_h264_h_loop_filter_luma_neon, export=1
 
         sub             x0,  x0,  x1, lsl #4
         add             x0,  x0,  #2
-        st1             {v17.S}[0],  [x0], x1
-        st1             {v16.S}[0], [x0], x1
-        st1             {v0.S}[0],  [x0], x1
-        st1             {v19.S}[0], [x0], x1
-        st1             {v17.S}[1],  [x0], x1
-        st1             {v16.S}[1], [x0], x1
-        st1             {v0.S}[1],  [x0], x1
-        st1             {v19.S}[1], [x0], x1
-        st1             {v17.S}[2],  [x0], x1
-        st1             {v16.S}[2], [x0], x1
-        st1             {v0.S}[2],  [x0], x1
-        st1             {v19.S}[2], [x0], x1
-        st1             {v17.S}[3],  [x0], x1
-        st1             {v16.S}[3], [x0], x1
-        st1             {v0.S}[3],  [x0], x1
-        st1             {v19.S}[3], [x0], x1
+        st1             {v17.s}[0],  [x0], x1
+        st1             {v16.s}[0], [x0], x1
+        st1             {v0.s}[0],  [x0], x1
+        st1             {v19.s}[0], [x0], x1
+        st1             {v17.s}[1],  [x0], x1
+        st1             {v16.s}[1], [x0], x1
+        st1             {v0.s}[1],  [x0], x1
+        st1             {v19.s}[1], [x0], x1
+        st1             {v17.s}[2],  [x0], x1
+        st1             {v16.s}[2], [x0], x1
+        st1             {v0.s}[2],  [x0], x1
+        st1             {v19.s}[2], [x0], x1
+        st1             {v17.s}[3],  [x0], x1
+        st1             {v16.s}[3], [x0], x1
+        st1             {v0.s}[3],  [x0], x1
+        st1             {v19.s}[3], [x0], x1
 9:
         ret
 endfunc
@@ -377,52 +377,52 @@ function ff_h264_h_loop_filter_luma_intra_neon, export=1
 endfunc
 
 .macro  h264_loop_filter_chroma
-        dup             v22.8B, w2              // alpha
-        dup             v23.8B, w3              // beta
-        uxtl            v24.8H, v24.8B
-        uabd            v26.8B, v16.8B, v0.8B   // abs(p0 - q0)
-        uabd            v28.8B, v18.8B, v16.8B  // abs(p1 - p0)
-        uabd            v30.8B, v2.8B,  v0.8B   // abs(q1 - q0)
-        cmhi            v26.8B, v22.8B, v26.8B  // < alpha
-        cmhi            v28.8B, v23.8B, v28.8B  // < beta
-        cmhi            v30.8B, v23.8B, v30.8B  // < beta
-        uxtl            v4.8H,  v0.8B
-        and             v26.8B, v26.8B, v28.8B
-        usubw           v4.8H,  v4.8H,  v16.8B
-        and             v26.8B, v26.8B, v30.8B
-        shl             v4.8H,  v4.8H,  #2
+        dup             v22.8b, w2              // alpha
+        dup             v23.8b, w3              // beta
+        uxtl            v24.8h, v24.8b
+        uabd            v26.8b, v16.8b, v0.8b   // abs(p0 - q0)
+        uabd            v28.8b, v18.8b, v16.8b  // abs(p1 - p0)
+        uabd            v30.8b, v2.8b,  v0.8b   // abs(q1 - q0)
+        cmhi            v26.8b, v22.8b, v26.8b  // < alpha
+        cmhi            v28.8b, v23.8b, v28.8b  // < beta
+        cmhi            v30.8b, v23.8b, v30.8b  // < beta
+        uxtl            v4.8h,  v0.8b
+        and             v26.8b, v26.8b, v28.8b
+        usubw           v4.8h,  v4.8h,  v16.8b
+        and             v26.8b, v26.8b, v30.8b
+        shl             v4.8h,  v4.8h,  #2
         mov             x8,  v26.d[0]
-        sli             v24.8H, v24.8H, #8
-        uaddw           v4.8H,  v4.8H,  v18.8B
+        sli             v24.8h, v24.8h, #8
+        uaddw           v4.8h,  v4.8h,  v18.8b
         cbz             x8,  9f
-        usubw           v4.8H,  v4.8H,  v2.8B
-        rshrn           v4.8B,  v4.8H,  #3
-        smin            v4.8B,  v4.8B,  v24.8B
-        neg             v25.8B, v24.8B
-        smax            v4.8B,  v4.8B,  v25.8B
-        uxtl            v22.8H, v0.8B
-        and             v4.8B,  v4.8B,  v26.8B
-        uxtl            v28.8H, v16.8B
-        saddw           v28.8H, v28.8H, v4.8B
-        ssubw           v22.8H, v22.8H, v4.8B
-        sqxtun          v16.8B, v28.8H
-        sqxtun          v0.8B,  v22.8H
+        usubw           v4.8h,  v4.8h,  v2.8b
+        rshrn           v4.8b,  v4.8h,  #3
+        smin            v4.8b,  v4.8b,  v24.8b
+        neg             v25.8b, v24.8b
+        smax            v4.8b,  v4.8b,  v25.8b
+        uxtl            v22.8h, v0.8b
+        and             v4.8b,  v4.8b,  v26.8b
+        uxtl            v28.8h, v16.8b
+        saddw           v28.8h, v28.8h, v4.8b
+        ssubw           v22.8h, v22.8h, v4.8b
+        sqxtun          v16.8b, v28.8h
+        sqxtun          v0.8b,  v22.8h
 .endm
 
 function ff_h264_v_loop_filter_chroma_neon, export=1
         h264_loop_filter_start
 
         sub             x0,  x0,  x1, lsl #1
-        ld1             {v18.8B}, [x0], x1
-        ld1             {v16.8B}, [x0], x1
-        ld1             {v0.8B},  [x0], x1
-        ld1             {v2.8B},  [x0]
+        ld1             {v18.8b}, [x0], x1
+        ld1             {v16.8b}, [x0], x1
+        ld1             {v0.8b},  [x0], x1
+        ld1             {v2.8b},  [x0]
 
         h264_loop_filter_chroma
 
         sub             x0,  x0,  x1, lsl #1
-        st1             {v16.8B}, [x0], x1
-        st1             {v0.8B},  [x0], x1
+        st1             {v16.8b}, [x0], x1
+        st1             {v0.8b},  [x0], x1
 9:
         ret
 endfunc
@@ -432,14 +432,14 @@ function ff_h264_h_loop_filter_chroma_neon, export=1
 
         sub             x0,  x0,  #2
 h_loop_filter_chroma420:
-        ld1             {v18.S}[0], [x0], x1
-        ld1             {v16.S}[0], [x0], x1
-        ld1             {v0.S}[0],  [x0], x1
-        ld1             {v2.S}[0],  [x0], x1
-        ld1             {v18.S}[1], [x0], x1
-        ld1             {v16.S}[1], [x0], x1
-        ld1             {v0.S}[1],  [x0], x1
-        ld1             {v2.S}[1],  [x0], x1
+        ld1             {v18.s}[0], [x0], x1
+        ld1             {v16.s}[0], [x0], x1
+        ld1             {v0.s}[0],  [x0], x1
+        ld1             {v2.s}[0],  [x0], x1
+        ld1             {v18.s}[1], [x0], x1
+        ld1             {v16.s}[1], [x0], x1
+        ld1             {v0.s}[1],  [x0], x1
+        ld1             {v2.s}[1],  [x0], x1
 
         transpose_4x8B  v18, v16, v0, v2, v28, v29, v30, v31
 
@@ -448,14 +448,14 @@ h_loop_filter_chroma420:
         transpose_4x8B  v18, v16, v0, v2, v28, v29, v30, v31
 
         sub             x0,  x0,  x1, lsl #3
-        st1             {v18.S}[0], [x0], x1
-        st1             {v16.S}[0], [x0], x1
-        st1             {v0.S}[0],  [x0], x1
-        st1             {v2.S}[0],  [x0], x1
-        st1             {v18.S}[1], [x0], x1
-        st1             {v16.S}[1], [x0], x1
-        st1             {v0.S}[1],  [x0], x1
-        st1             {v2.S}[1],  [x0], x1
+        st1             {v18.s}[0], [x0], x1
+        st1             {v16.s}[0], [x0], x1
+        st1             {v0.s}[0],  [x0], x1
+        st1             {v2.s}[0],  [x0], x1
+        st1             {v18.s}[1], [x0], x1
+        st1             {v16.s}[1], [x0], x1
+        st1             {v0.s}[1],  [x0], x1
+        st1             {v2.s}[1],  [x0], x1
 9:
         ret
 endfunc
@@ -584,102 +584,102 @@ function ff_h264_h_loop_filter_chroma422_intra_neon, export=1
 endfunc
 
 .macro  biweight_16     macs, macd
-        dup             v0.16B,  w5
-        dup             v1.16B,  w6
-        mov             v4.16B,  v16.16B
-        mov             v6.16B,  v16.16B
+        dup             v0.16b,  w5
+        dup             v1.16b,  w6
+        mov             v4.16b,  v16.16b
+        mov             v6.16b,  v16.16b
 1:      subs            w3,  w3,  #2
-        ld1             {v20.16B}, [x0], x2
-        \macd           v4.8H,   v0.8B,  v20.8B
+        ld1             {v20.16b}, [x0], x2
+        \macd           v4.8h,   v0.8b,  v20.8b
         \macd\()2       v6.8H,   v0.16B, v20.16B
-        ld1             {v22.16B}, [x1], x2
-        \macs           v4.8H,   v1.8B,  v22.8B
+        ld1             {v22.16b}, [x1], x2
+        \macs           v4.8h,   v1.8b,  v22.8b
         \macs\()2       v6.8H,   v1.16B, v22.16B
-        mov             v24.16B, v16.16B
-        ld1             {v28.16B}, [x0], x2
-        mov             v26.16B, v16.16B
-        \macd           v24.8H,  v0.8B,  v28.8B
+        mov             v24.16b, v16.16b
+        ld1             {v28.16b}, [x0], x2
+        mov             v26.16b, v16.16b
+        \macd           v24.8h,  v0.8b,  v28.8b
         \macd\()2       v26.8H,  v0.16B, v28.16B
-        ld1             {v30.16B}, [x1], x2
-        \macs           v24.8H,  v1.8B,  v30.8B
+        ld1             {v30.16b}, [x1], x2
+        \macs           v24.8h,  v1.8b,  v30.8b
         \macs\()2       v26.8H,  v1.16B, v30.16B
-        sshl            v4.8H,   v4.8H,  v18.8H
-        sshl            v6.8H,   v6.8H,  v18.8H
-        sqxtun          v4.8B,   v4.8H
-        sqxtun2         v4.16B,  v6.8H
-        sshl            v24.8H,  v24.8H, v18.8H
-        sshl            v26.8H,  v26.8H, v18.8H
-        sqxtun          v24.8B,  v24.8H
-        sqxtun2         v24.16B, v26.8H
-        mov             v6.16B,  v16.16B
-        st1             {v4.16B},  [x7], x2
-        mov             v4.16B,  v16.16B
-        st1             {v24.16B}, [x7], x2
+        sshl            v4.8h,   v4.8h,  v18.8h
+        sshl            v6.8h,   v6.8h,  v18.8h
+        sqxtun          v4.8b,   v4.8h
+        sqxtun2         v4.16b,  v6.8h
+        sshl            v24.8h,  v24.8h, v18.8h
+        sshl            v26.8h,  v26.8h, v18.8h
+        sqxtun          v24.8b,  v24.8h
+        sqxtun2         v24.16b, v26.8h
+        mov             v6.16b,  v16.16b
+        st1             {v4.16b},  [x7], x2
+        mov             v4.16b,  v16.16b
+        st1             {v24.16b}, [x7], x2
         b.ne            1b
         ret
 .endm
 
 .macro  biweight_8      macs, macd
-        dup             v0.8B,  w5
-        dup             v1.8B,  w6
-        mov             v2.16B,  v16.16B
-        mov             v20.16B, v16.16B
+        dup             v0.8b,  w5
+        dup             v1.8b,  w6
+        mov             v2.16b,  v16.16b
+        mov             v20.16b, v16.16b
 1:      subs            w3,  w3,  #2
-        ld1             {v4.8B}, [x0], x2
-        \macd           v2.8H,  v0.8B,  v4.8B
-        ld1             {v5.8B}, [x1], x2
-        \macs           v2.8H,  v1.8B,  v5.8B
-        ld1             {v6.8B}, [x0], x2
-        \macd           v20.8H, v0.8B,  v6.8B
-        ld1             {v7.8B}, [x1], x2
-        \macs           v20.8H, v1.8B,  v7.8B
-        sshl            v2.8H,  v2.8H,  v18.8H
-        sqxtun          v2.8B,  v2.8H
-        sshl            v20.8H, v20.8H, v18.8H
-        sqxtun          v4.8B,  v20.8H
-        mov             v20.16B, v16.16B
-        st1             {v2.8B}, [x7], x2
-        mov             v2.16B,  v16.16B
-        st1             {v4.8B}, [x7], x2
+        ld1             {v4.8b}, [x0], x2
+        \macd           v2.8h,  v0.8b,  v4.8b
+        ld1             {v5.8b}, [x1], x2
+        \macs           v2.8h,  v1.8b,  v5.8b
+        ld1             {v6.8b}, [x0], x2
+        \macd           v20.8h, v0.8b,  v6.8b
+        ld1             {v7.8b}, [x1], x2
+        \macs           v20.8h, v1.8b,  v7.8b
+        sshl            v2.8h,  v2.8h,  v18.8h
+        sqxtun          v2.8b,  v2.8h
+        sshl            v20.8h, v20.8h, v18.8h
+        sqxtun          v4.8b,  v20.8h
+        mov             v20.16b, v16.16b
+        st1             {v2.8b}, [x7], x2
+        mov             v2.16b,  v16.16b
+        st1             {v4.8b}, [x7], x2
         b.ne            1b
         ret
 .endm
 
 .macro  biweight_4      macs, macd
-        dup             v0.8B,  w5
-        dup             v1.8B,  w6
-        mov             v2.16B, v16.16B
-        mov             v20.16B,v16.16B
+        dup             v0.8b,  w5
+        dup             v1.8b,  w6
+        mov             v2.16b, v16.16b
+        mov             v20.16b,v16.16b
 1:      subs            w3,  w3,  #4
-        ld1             {v4.S}[0], [x0], x2
-        ld1             {v4.S}[1], [x0], x2
-        \macd           v2.8H,  v0.8B,  v4.8B
-        ld1             {v5.S}[0], [x1], x2
-        ld1             {v5.S}[1], [x1], x2
-        \macs           v2.8H,  v1.8B,  v5.8B
+        ld1             {v4.s}[0], [x0], x2
+        ld1             {v4.s}[1], [x0], x2
+        \macd           v2.8h,  v0.8b,  v4.8b
+        ld1             {v5.s}[0], [x1], x2
+        ld1             {v5.s}[1], [x1], x2
+        \macs           v2.8h,  v1.8b,  v5.8b
         b.lt            2f
-        ld1             {v6.S}[0], [x0], x2
-        ld1             {v6.S}[1], [x0], x2
-        \macd           v20.8H, v0.8B,  v6.8B
-        ld1             {v7.S}[0], [x1], x2
-        ld1             {v7.S}[1], [x1], x2
-        \macs           v20.8H, v1.8B,  v7.8B
-        sshl            v2.8H,  v2.8H,  v18.8H
-        sqxtun          v2.8B,  v2.8H
-        sshl            v20.8H, v20.8H, v18.8H
-        sqxtun          v4.8B,  v20.8H
-        mov             v20.16B, v16.16B
-        st1             {v2.S}[0], [x7], x2
-        st1             {v2.S}[1], [x7], x2
-        mov             v2.16B,  v16.16B
-        st1             {v4.S}[0], [x7], x2
-        st1             {v4.S}[1], [x7], x2
+        ld1             {v6.s}[0], [x0], x2
+        ld1             {v6.s}[1], [x0], x2
+        \macd           v20.8h, v0.8b,  v6.8b
+        ld1             {v7.s}[0], [x1], x2
+        ld1             {v7.s}[1], [x1], x2
+        \macs           v20.8h, v1.8b,  v7.8b
+        sshl            v2.8h,  v2.8h,  v18.8h
+        sqxtun          v2.8b,  v2.8h
+        sshl            v20.8h, v20.8h, v18.8h
+        sqxtun          v4.8b,  v20.8h
+        mov             v20.16b, v16.16b
+        st1             {v2.s}[0], [x7], x2
+        st1             {v2.s}[1], [x7], x2
+        mov             v2.16b,  v16.16b
+        st1             {v4.s}[0], [x7], x2
+        st1             {v4.s}[1], [x7], x2
         b.ne            1b
         ret
-2:      sshl            v2.8H,  v2.8H,  v18.8H
-        sqxtun          v2.8B,  v2.8H
-        st1             {v2.S}[0], [x7], x2
-        st1             {v2.S}[1], [x7], x2
+2:      sshl            v2.8h,  v2.8h,  v18.8h
+        sqxtun          v2.8b,  v2.8h
+        st1             {v2.s}[0], [x7], x2
+        st1             {v2.s}[1], [x7], x2
         ret
 .endm
 
@@ -689,10 +689,10 @@ function ff_biweight_h264_pixels_\w\()_neon, export=1
         add             w7,  w7,  #1
         eor             w8,  w8,  w6,  lsr #30
         orr             w7,  w7,  #1
-        dup             v18.8H,   w4
+        dup             v18.8h,   w4
         lsl             w7,  w7,  w4
-        not             v18.16B,  v18.16B
-        dup             v16.8H,   w7
+        not             v18.16b,  v18.16b
+        dup             v16.8h,   w7
         mov             x7,  x0
         cbz             w8,  10f
         subs            w8,  w8,  #1
@@ -716,78 +716,78 @@ endfunc
         biweight_func   4
 
 .macro  weight_16       add
-        dup             v0.16B,  w4
+        dup             v0.16b,  w4
 1:      subs            w2,  w2,  #2
-        ld1             {v20.16B}, [x0], x1
-        umull           v4.8H,   v0.8B,  v20.8B
-        umull2          v6.8H,   v0.16B, v20.16B
-        ld1             {v28.16B}, [x0], x1
-        umull           v24.8H,  v0.8B,  v28.8B
-        umull2          v26.8H,  v0.16B, v28.16B
-        \add            v4.8H,   v16.8H, v4.8H
-        srshl           v4.8H,   v4.8H,  v18.8H
-        \add            v6.8H,   v16.8H, v6.8H
-        srshl           v6.8H,   v6.8H,  v18.8H
-        sqxtun          v4.8B,   v4.8H
-        sqxtun2         v4.16B,  v6.8H
-        \add            v24.8H,  v16.8H, v24.8H
-        srshl           v24.8H,  v24.8H, v18.8H
-        \add            v26.8H,  v16.8H, v26.8H
-        srshl           v26.8H,  v26.8H, v18.8H
-        sqxtun          v24.8B,  v24.8H
-        sqxtun2         v24.16B, v26.8H
-        st1             {v4.16B},  [x5], x1
-        st1             {v24.16B}, [x5], x1
+        ld1             {v20.16b}, [x0], x1
+        umull           v4.8h,   v0.8b,  v20.8b
+        umull2          v6.8h,   v0.16b, v20.16b
+        ld1             {v28.16b}, [x0], x1
+        umull           v24.8h,  v0.8b,  v28.8b
+        umull2          v26.8h,  v0.16b, v28.16b
+        \add            v4.8h,   v16.8h, v4.8h
+        srshl           v4.8h,   v4.8h,  v18.8h
+        \add            v6.8h,   v16.8h, v6.8h
+        srshl           v6.8h,   v6.8h,  v18.8h
+        sqxtun          v4.8b,   v4.8h
+        sqxtun2         v4.16b,  v6.8h
+        \add            v24.8h,  v16.8h, v24.8h
+        srshl           v24.8h,  v24.8h, v18.8h
+        \add            v26.8h,  v16.8h, v26.8h
+        srshl           v26.8h,  v26.8h, v18.8h
+        sqxtun          v24.8b,  v24.8h
+        sqxtun2         v24.16b, v26.8h
+        st1             {v4.16b},  [x5], x1
+        st1             {v24.16b}, [x5], x1
         b.ne            1b
         ret
 .endm
 
 .macro  weight_8        add
-        dup             v0.8B,  w4
+        dup             v0.8b,  w4
 1:      subs            w2,  w2,  #2
-        ld1             {v4.8B}, [x0], x1
-        umull           v2.8H,  v0.8B,  v4.8B
-        ld1             {v6.8B}, [x0], x1
-        umull           v20.8H, v0.8B,  v6.8B
-        \add            v2.8H,  v16.8H,  v2.8H
-        srshl           v2.8H,  v2.8H,  v18.8H
-        sqxtun          v2.8B,  v2.8H
-        \add            v20.8H, v16.8H,  v20.8H
-        srshl           v20.8H, v20.8H, v18.8H
-        sqxtun          v4.8B,  v20.8H
-        st1             {v2.8B}, [x5], x1
-        st1             {v4.8B}, [x5], x1
+        ld1             {v4.8b}, [x0], x1
+        umull           v2.8h,  v0.8b,  v4.8b
+        ld1             {v6.8b}, [x0], x1
+        umull           v20.8h, v0.8b,  v6.8b
+        \add            v2.8h,  v16.8h,  v2.8h
+        srshl           v2.8h,  v2.8h,  v18.8h
+        sqxtun          v2.8b,  v2.8h
+        \add            v20.8h, v16.8h,  v20.8h
+        srshl           v20.8h, v20.8h, v18.8h
+        sqxtun          v4.8b,  v20.8h
+        st1             {v2.8b}, [x5], x1
+        st1             {v4.8b}, [x5], x1
         b.ne            1b
         ret
 .endm
 
 .macro  weight_4        add
-        dup             v0.8B,  w4
+        dup             v0.8b,  w4
 1:      subs            w2,  w2,  #4
-        ld1             {v4.S}[0], [x0], x1
-        ld1             {v4.S}[1], [x0], x1
-        umull           v2.8H,  v0.8B,  v4.8B
+        ld1             {v4.s}[0], [x0], x1
+        ld1             {v4.s}[1], [x0], x1
+        umull           v2.8h,  v0.8b,  v4.8b
         b.lt            2f
-        ld1             {v6.S}[0], [x0], x1
-        ld1             {v6.S}[1], [x0], x1
-        umull           v20.8H, v0.8B,  v6.8B
-        \add            v2.8H,  v16.8H,  v2.8H
-        srshl           v2.8H,  v2.8H,  v18.8H
-        sqxtun          v2.8B,  v2.8H
-        \add            v20.8H, v16.8H,  v20.8H
-        srshl           v20.8H, v20.8h, v18.8H
-        sqxtun          v4.8B,  v20.8H
-        st1             {v2.S}[0], [x5], x1
-        st1             {v2.S}[1], [x5], x1
-        st1             {v4.S}[0], [x5], x1
-        st1             {v4.S}[1], [x5], x1
+        ld1             {v6.s}[0], [x0], x1
+        ld1             {v6.s}[1], [x0], x1
+        umull           v20.8h, v0.8b,  v6.8b
+        \add            v2.8h,  v16.8h,  v2.8h
+        srshl           v2.8h,  v2.8h,  v18.8h
+        sqxtun          v2.8b,  v2.8h
+        \add            v20.8h, v16.8h,  v20.8h
+        srshl           v20.8h, v20.8h, v18.8h
+        sqxtun          v4.8b,  v20.8h
+        st1             {v2.s}[0], [x5], x1
+        st1             {v2.s}[1], [x5], x1
+        st1             {v4.s}[0], [x5], x1
+        st1             {v4.s}[1], [x5], x1
         b.ne            1b
         ret
-2:      \add            v2.8H,  v16.8H,  v2.8H
-        srshl           v2.8H,  v2.8H,  v18.8H
-        sqxtun          v2.8B,  v2.8H
-        st1             {v2.S}[0], [x5], x1
-        st1             {v2.S}[1], [x5], x1
+2:      \add            v2.8h,  v16.8h,  v2.8h
+        srshl           v2.8h,  v2.8h,  v18.8h
+        sqxtun          v2.8b,  v2.8h
+        st1             {v2.s}[0], [x5], x1
+        st1             {v2.s}[1], [x5], x1
         ret
 .endm
 
@@ -796,18 +796,18 @@ function ff_weight_h264_pixels_\w\()_neon, export=1
         cmp             w3,  #1
         mov             w6,  #1
         lsl             w5,  w5,  w3
-        dup             v16.8H,  w5
+        dup             v16.8h,  w5
         mov             x5,  x0
         b.le            20f
         sub             w6,  w6,  w3
-        dup             v18.8H,  w6
+        dup             v18.8h,  w6
         cmp             w4, #0
         b.lt            10f
         weight_\w       shadd
 10:     neg             w4,  w4
         weight_\w       shsub
 20:     neg             w6,  w3
-        dup             v18.8H,  w6
+        dup             v18.8h,  w6
         cmp             w4,  #0
         b.lt            10f
         weight_\w       add
@@ -825,7 +825,7 @@ endfunc
         ldr             w6,  [x4]
         ccmp            w3,  #0,  #0,  ne
         lsl             w2,  w2,  #2
-        mov             v24.S[0], w6
+        mov             v24.s[0], w6
         lsl             w3,  w3,  #2
         and             w8,  w6,  w6,  lsl #16
         b.eq            1f
diff --git a/libavcodec/aarch64/h264idct_neon.S b/libavcodec/aarch64/h264idct_neon.S
index 375da31d65..1bab2ca7c8 100644
--- a/libavcodec/aarch64/h264idct_neon.S
+++ b/libavcodec/aarch64/h264idct_neon.S
@@ -25,54 +25,54 @@
 function ff_h264_idct_add_neon, export=1
 .L_ff_h264_idct_add_neon:
         AARCH64_VALID_CALL_TARGET
-        ld1             {v0.4H, v1.4H, v2.4H, v3.4H},  [x1]
+        ld1             {v0.4h, v1.4h, v2.4h, v3.4h},  [x1]
         sxtw            x2,     w2
-        movi            v30.8H, #0
+        movi            v30.8h, #0
 
-        add             v4.4H,  v0.4H,  v2.4H
-        sshr            v16.4H, v1.4H,  #1
-        st1             {v30.8H},    [x1], #16
-        sshr            v17.4H, v3.4H,  #1
-        st1             {v30.8H},    [x1], #16
-        sub             v5.4H,  v0.4H,  v2.4H
-        sub             v6.4H,  v16.4H, v3.4H
-        add             v7.4H,  v1.4H,  v17.4H
-        add             v0.4H,  v4.4H,  v7.4H
-        add             v1.4H,  v5.4H,  v6.4H
-        sub             v2.4H,  v5.4H,  v6.4H
-        sub             v3.4H,  v4.4H,  v7.4H
+        add             v4.4h,  v0.4h,  v2.4h
+        sshr            v16.4h, v1.4h,  #1
+        st1             {v30.8h},    [x1], #16
+        sshr            v17.4h, v3.4h,  #1
+        st1             {v30.8h},    [x1], #16
+        sub             v5.4h,  v0.4h,  v2.4h
+        sub             v6.4h,  v16.4h, v3.4h
+        add             v7.4h,  v1.4h,  v17.4h
+        add             v0.4h,  v4.4h,  v7.4h
+        add             v1.4h,  v5.4h,  v6.4h
+        sub             v2.4h,  v5.4h,  v6.4h
+        sub             v3.4h,  v4.4h,  v7.4h
 
         transpose_4x4H  v0, v1, v2, v3, v4, v5, v6, v7
 
-        add             v4.4H,  v0.4H,  v2.4H
-        ld1             {v18.S}[0], [x0], x2
-        sshr            v16.4H,  v3.4H,  #1
-        sshr            v17.4H,  v1.4H,  #1
-        ld1             {v18.S}[1], [x0], x2
-        sub             v5.4H,  v0.4H,  v2.4H
-        ld1             {v19.S}[1], [x0], x2
-        add             v6.4H,  v16.4H, v1.4H
-        ins             v4.D[1],  v5.D[0]
-        sub             v7.4H,  v17.4H, v3.4H
-        ld1             {v19.S}[0], [x0], x2
-        ins             v6.D[1],  v7.D[0]
+        add             v4.4h,  v0.4h,  v2.4h
+        ld1             {v18.s}[0], [x0], x2
+        sshr            v16.4h,  v3.4h,  #1
+        sshr            v17.4h,  v1.4h,  #1
+        ld1             {v18.s}[1], [x0], x2
+        sub             v5.4h,  v0.4h,  v2.4h
+        ld1             {v19.s}[1], [x0], x2
+        add             v6.4h,  v16.4h, v1.4h
+        ins             v4.d[1],  v5.d[0]
+        sub             v7.4h,  v17.4h, v3.4h
+        ld1             {v19.s}[0], [x0], x2
+        ins             v6.d[1],  v7.d[0]
         sub             x0,  x0,  x2, lsl #2
-        add             v0.8H,  v4.8H,  v6.8H
-        sub             v1.8H,  v4.8H,  v6.8H
+        add             v0.8h,  v4.8h,  v6.8h
+        sub             v1.8h,  v4.8h,  v6.8h
 
-        srshr           v0.8H,  v0.8H,  #6
-        srshr           v1.8H,  v1.8H,  #6
+        srshr           v0.8h,  v0.8h,  #6
+        srshr           v1.8h,  v1.8h,  #6
 
-        uaddw           v0.8H,  v0.8H,  v18.8B
-        uaddw           v1.8H,  v1.8H,  v19.8B
+        uaddw           v0.8h,  v0.8h,  v18.8b
+        uaddw           v1.8h,  v1.8h,  v19.8b
 
-        sqxtun          v0.8B, v0.8H
-        sqxtun          v1.8B, v1.8H
+        sqxtun          v0.8b, v0.8h
+        sqxtun          v1.8b, v1.8h
 
-        st1             {v0.S}[0],  [x0], x2
-        st1             {v0.S}[1],  [x0], x2
-        st1             {v1.S}[1],  [x0], x2
-        st1             {v1.S}[0],  [x0], x2
+        st1             {v0.s}[0],  [x0], x2
+        st1             {v0.s}[1],  [x0], x2
+        st1             {v1.s}[1],  [x0], x2
+        st1             {v1.s}[0],  [x0], x2
 
         sub             x1,  x1,  #32
         ret
@@ -83,22 +83,22 @@ function ff_h264_idct_dc_add_neon, export=1
         AARCH64_VALID_CALL_TARGET
         sxtw            x2,  w2
         mov             w3,       #0
-        ld1r            {v2.8H},  [x1]
+        ld1r            {v2.8h},  [x1]
         strh            w3,       [x1]
-        srshr           v2.8H,  v2.8H,  #6
-        ld1             {v0.S}[0],  [x0], x2
-        ld1             {v0.S}[1],  [x0], x2
-        uaddw           v3.8H,  v2.8H,  v0.8B
-        ld1             {v1.S}[0],  [x0], x2
-        ld1             {v1.S}[1],  [x0], x2
-        uaddw           v4.8H,  v2.8H,  v1.8B
-        sqxtun          v0.8B,  v3.8H
-        sqxtun          v1.8B,  v4.8H
+        srshr           v2.8h,  v2.8h,  #6
+        ld1             {v0.s}[0],  [x0], x2
+        ld1             {v0.s}[1],  [x0], x2
+        uaddw           v3.8h,  v2.8h,  v0.8b
+        ld1             {v1.s}[0],  [x0], x2
+        ld1             {v1.s}[1],  [x0], x2
+        uaddw           v4.8h,  v2.8h,  v1.8b
+        sqxtun          v0.8b,  v3.8h
+        sqxtun          v1.8b,  v4.8h
         sub             x0,  x0,  x2, lsl #2
-        st1             {v0.S}[0],  [x0], x2
-        st1             {v0.S}[1],  [x0], x2
-        st1             {v1.S}[0],  [x0], x2
-        st1             {v1.S}[1],  [x0], x2
+        st1             {v0.s}[0],  [x0], x2
+        st1             {v0.s}[1],  [x0], x2
+        st1             {v1.s}[0],  [x0], x2
+        st1             {v1.s}[1],  [x0], x2
         ret
 endfunc
 
@@ -194,71 +194,71 @@ endfunc
   .if \pass == 0
         va      .req    v18
         vb      .req    v30
-        sshr            v18.8H, v26.8H, #1
-        add             v16.8H, v24.8H, v28.8H
-        ld1             {v30.8H, v31.8H}, [x1]
-        st1             {v19.8H}, [x1],  #16
-        st1             {v19.8H}, [x1],  #16
-        sub             v17.8H,  v24.8H, v28.8H
-        sshr            v19.8H,  v30.8H, #1
-        sub             v18.8H,  v18.8H,  v30.8H
-        add             v19.8H,  v19.8H,  v26.8H
+        sshr            v18.8h, v26.8h, #1
+        add             v16.8h, v24.8h, v28.8h
+        ld1             {v30.8h, v31.8h}, [x1]
+        st1             {v19.8h}, [x1],  #16
+        st1             {v19.8h}, [x1],  #16
+        sub             v17.8h,  v24.8h, v28.8h
+        sshr            v19.8h,  v30.8h, #1
+        sub             v18.8h,  v18.8h,  v30.8h
+        add             v19.8h,  v19.8h,  v26.8h
   .else
         va      .req    v30
         vb      .req    v18
-        sshr            v30.8H, v26.8H, #1
-        sshr            v19.8H, v18.8H, #1
-        add             v16.8H, v24.8H, v28.8H
-        sub             v17.8H, v24.8H, v28.8H
-        sub             v30.8H, v30.8H, v18.8H
-        add             v19.8H, v19.8H, v26.8H
+        sshr            v30.8h, v26.8h, #1
+        sshr            v19.8h, v18.8h, #1
+        add             v16.8h, v24.8h, v28.8h
+        sub             v17.8h, v24.8h, v28.8h
+        sub             v30.8h, v30.8h, v18.8h
+        add             v19.8h, v19.8h, v26.8h
   .endif
-        add             v26.8H, v17.8H, va.8H
-        sub             v28.8H, v17.8H, va.8H
-        add             v24.8H, v16.8H, v19.8H
-        sub             vb.8H,  v16.8H, v19.8H
-        sub             v16.8H, v29.8H, v27.8H
-        add             v17.8H, v31.8H, v25.8H
-        sub             va.8H,  v31.8H, v25.8H
-        add             v19.8H, v29.8H, v27.8H
-        sub             v16.8H, v16.8H, v31.8H
-        sub             v17.8H, v17.8H, v27.8H
-        add             va.8H,  va.8H,  v29.8H
-        add             v19.8H, v19.8H, v25.8H
-        sshr            v25.8H, v25.8H, #1
-        sshr            v27.8H, v27.8H, #1
-        sshr            v29.8H, v29.8H, #1
-        sshr            v31.8H, v31.8H, #1
-        sub             v16.8H, v16.8H, v31.8H
-        sub             v17.8H, v17.8H, v27.8H
-        add             va.8H,  va.8H,  v29.8H
-        add             v19.8H, v19.8H, v25.8H
-        sshr            v25.8H, v16.8H, #2
-        sshr            v27.8H, v17.8H, #2
-        sshr            v29.8H, va.8H,  #2
-        sshr            v31.8H, v19.8H, #2
-        sub             v19.8H, v19.8H, v25.8H
-        sub             va.8H,  v27.8H, va.8H
-        add             v17.8H, v17.8H, v29.8H
-        add             v16.8H, v16.8H, v31.8H
+        add             v26.8h, v17.8h, va.8h
+        sub             v28.8h, v17.8h, va.8h
+        add             v24.8h, v16.8h, v19.8h
+        sub             vb.8h,  v16.8h, v19.8h
+        sub             v16.8h, v29.8h, v27.8h
+        add             v17.8h, v31.8h, v25.8h
+        sub             va.8h,  v31.8h, v25.8h
+        add             v19.8h, v29.8h, v27.8h
+        sub             v16.8h, v16.8h, v31.8h
+        sub             v17.8h, v17.8h, v27.8h
+        add             va.8h,  va.8h,  v29.8h
+        add             v19.8h, v19.8h, v25.8h
+        sshr            v25.8h, v25.8h, #1
+        sshr            v27.8h, v27.8h, #1
+        sshr            v29.8h, v29.8h, #1
+        sshr            v31.8h, v31.8h, #1
+        sub             v16.8h, v16.8h, v31.8h
+        sub             v17.8h, v17.8h, v27.8h
+        add             va.8h,  va.8h,  v29.8h
+        add             v19.8h, v19.8h, v25.8h
+        sshr            v25.8h, v16.8h, #2
+        sshr            v27.8h, v17.8h, #2
+        sshr            v29.8h, va.8h,  #2
+        sshr            v31.8h, v19.8h, #2
+        sub             v19.8h, v19.8h, v25.8h
+        sub             va.8h,  v27.8h, va.8h
+        add             v17.8h, v17.8h, v29.8h
+        add             v16.8h, v16.8h, v31.8h
   .if \pass == 0
-        sub             v31.8H, v24.8H, v19.8H
-        add             v24.8H, v24.8H, v19.8H
-        add             v25.8H, v26.8H, v18.8H
-        sub             v18.8H, v26.8H, v18.8H
-        add             v26.8H, v28.8H, v17.8H
-        add             v27.8H, v30.8H, v16.8H
-        sub             v29.8H, v28.8H, v17.8H
-        sub             v28.8H, v30.8H, v16.8H
+        sub             v31.8h, v24.8h, v19.8h
+        add             v24.8h, v24.8h, v19.8h
+        add             v25.8h, v26.8h, v18.8h
+        sub             v18.8h, v26.8h, v18.8h
+        add             v26.8h, v28.8h, v17.8h
+        add             v27.8h, v30.8h, v16.8h
+        sub             v29.8h, v28.8h, v17.8h
+        sub             v28.8h, v30.8h, v16.8h
   .else
-        sub             v31.8H, v24.8H, v19.8H
-        add             v24.8H, v24.8H, v19.8H
-        add             v25.8H, v26.8H, v30.8H
-        sub             v30.8H, v26.8H, v30.8H
-        add             v26.8H, v28.8H, v17.8H
-        sub             v29.8H, v28.8H, v17.8H
-        add             v27.8H, v18.8H, v16.8H
-        sub             v28.8H, v18.8H, v16.8H
+        sub             v31.8h, v24.8h, v19.8h
+        add             v24.8h, v24.8h, v19.8h
+        add             v25.8h, v26.8h, v30.8h
+        sub             v30.8h, v26.8h, v30.8h
+        add             v26.8h, v28.8h, v17.8h
+        sub             v29.8h, v28.8h, v17.8h
+        add             v27.8h, v18.8h, v16.8h
+        sub             v28.8h, v18.8h, v16.8h
   .endif
         .unreq          va
         .unreq          vb
@@ -267,63 +267,63 @@ endfunc
 function ff_h264_idct8_add_neon, export=1
 .L_ff_h264_idct8_add_neon:
         AARCH64_VALID_CALL_TARGET
-        movi            v19.8H,   #0
+        movi            v19.8h,   #0
         sxtw            x2,       w2
-        ld1             {v24.8H, v25.8H}, [x1]
-        st1             {v19.8H},  [x1],   #16
-        st1             {v19.8H},  [x1],   #16
-        ld1             {v26.8H, v27.8H}, [x1]
-        st1             {v19.8H},  [x1],   #16
-        st1             {v19.8H},  [x1],   #16
-        ld1             {v28.8H, v29.8H}, [x1]
-        st1             {v19.8H},  [x1],   #16
-        st1             {v19.8H},  [x1],   #16
+        ld1             {v24.8h, v25.8h}, [x1]
+        st1             {v19.8h},  [x1],   #16
+        st1             {v19.8h},  [x1],   #16
+        ld1             {v26.8h, v27.8h}, [x1]
+        st1             {v19.8h},  [x1],   #16
+        st1             {v19.8h},  [x1],   #16
+        ld1             {v28.8h, v29.8h}, [x1]
+        st1             {v19.8h},  [x1],   #16
+        st1             {v19.8h},  [x1],   #16
 
         idct8x8_cols    0
         transpose_8x8H  v24, v25, v26, v27, v28, v29, v18, v31, v6, v7
         idct8x8_cols    1
 
         mov             x3,  x0
-        srshr           v24.8H, v24.8H, #6
-        ld1             {v0.8B},     [x0], x2
-        srshr           v25.8H, v25.8H, #6
-        ld1             {v1.8B},     [x0], x2
-        srshr           v26.8H, v26.8H, #6
-        ld1             {v2.8B},     [x0], x2
-        srshr           v27.8H, v27.8H, #6
-        ld1             {v3.8B},     [x0], x2
-        srshr           v28.8H, v28.8H, #6
-        ld1             {v4.8B},     [x0], x2
-        srshr           v29.8H, v29.8H, #6
-        ld1             {v5.8B},     [x0], x2
-        srshr           v30.8H, v30.8H, #6
-        ld1             {v6.8B},     [x0], x2
-        srshr           v31.8H, v31.8H, #6
-        ld1             {v7.8B},     [x0], x2
-        uaddw           v24.8H, v24.8H, v0.8B
-        uaddw           v25.8H, v25.8H, v1.8B
-        uaddw           v26.8H, v26.8H, v2.8B
-        sqxtun          v0.8B,  v24.8H
-        uaddw           v27.8H, v27.8H, v3.8B
-        sqxtun          v1.8B,  v25.8H
-        uaddw           v28.8H, v28.8H, v4.8B
-        sqxtun          v2.8B,  v26.8H
-        st1             {v0.8B},     [x3], x2
-        uaddw           v29.8H, v29.8H, v5.8B
-        sqxtun          v3.8B,  v27.8H
-        st1             {v1.8B},     [x3], x2
-        uaddw           v30.8H, v30.8H, v6.8B
-        sqxtun          v4.8B,  v28.8H
-        st1             {v2.8B},     [x3], x2
-        uaddw           v31.8H, v31.8H, v7.8B
-        sqxtun          v5.8B,  v29.8H
-        st1             {v3.8B},     [x3], x2
-        sqxtun          v6.8B,  v30.8H
-        sqxtun          v7.8B,  v31.8H
-        st1             {v4.8B},     [x3], x2
-        st1             {v5.8B},     [x3], x2
-        st1             {v6.8B},     [x3], x2
-        st1             {v7.8B},     [x3], x2
+        srshr           v24.8h, v24.8h, #6
+        ld1             {v0.8b},     [x0], x2
+        srshr           v25.8h, v25.8h, #6
+        ld1             {v1.8b},     [x0], x2
+        srshr           v26.8h, v26.8h, #6
+        ld1             {v2.8b},     [x0], x2
+        srshr           v27.8h, v27.8h, #6
+        ld1             {v3.8b},     [x0], x2
+        srshr           v28.8h, v28.8h, #6
+        ld1             {v4.8b},     [x0], x2
+        srshr           v29.8h, v29.8h, #6
+        ld1             {v5.8b},     [x0], x2
+        srshr           v30.8h, v30.8h, #6
+        ld1             {v6.8b},     [x0], x2
+        srshr           v31.8h, v31.8h, #6
+        ld1             {v7.8b},     [x0], x2
+        uaddw           v24.8h, v24.8h, v0.8b
+        uaddw           v25.8h, v25.8h, v1.8b
+        uaddw           v26.8h, v26.8h, v2.8b
+        sqxtun          v0.8b,  v24.8h
+        uaddw           v27.8h, v27.8h, v3.8b
+        sqxtun          v1.8b,  v25.8h
+        uaddw           v28.8h, v28.8h, v4.8b
+        sqxtun          v2.8b,  v26.8h
+        st1             {v0.8b},     [x3], x2
+        uaddw           v29.8h, v29.8h, v5.8b
+        sqxtun          v3.8b,  v27.8h
+        st1             {v1.8b},     [x3], x2
+        uaddw           v30.8h, v30.8h, v6.8b
+        sqxtun          v4.8b,  v28.8h
+        st1             {v2.8b},     [x3], x2
+        uaddw           v31.8h, v31.8h, v7.8b
+        sqxtun          v5.8b,  v29.8h
+        st1             {v3.8b},     [x3], x2
+        sqxtun          v6.8b,  v30.8h
+        sqxtun          v7.8b,  v31.8h
+        st1             {v4.8b},     [x3], x2
+        st1             {v5.8b},     [x3], x2
+        st1             {v6.8b},     [x3], x2
+        st1             {v7.8b},     [x3], x2
 
         sub             x1,  x1,  #128
         ret
@@ -334,42 +334,42 @@ function ff_h264_idct8_dc_add_neon, export=1
         AARCH64_VALID_CALL_TARGET
         mov             w3,       #0
         sxtw            x2,       w2
-        ld1r            {v31.8H}, [x1]
+        ld1r            {v31.8h}, [x1]
         strh            w3,       [x1]
-        ld1             {v0.8B},  [x0], x2
-        srshr           v31.8H, v31.8H, #6
-        ld1             {v1.8B},     [x0], x2
-        ld1             {v2.8B},     [x0], x2
-        uaddw           v24.8H, v31.8H, v0.8B
-        ld1             {v3.8B},     [x0], x2
-        uaddw           v25.8H, v31.8H, v1.8B
-        ld1             {v4.8B},     [x0], x2
-        uaddw           v26.8H, v31.8H, v2.8B
-        ld1             {v5.8B},     [x0], x2
-        uaddw           v27.8H, v31.8H, v3.8B
-        ld1             {v6.8B},     [x0], x2
-        uaddw           v28.8H, v31.8H, v4.8B
-        ld1             {v7.8B},     [x0], x2
-        uaddw           v29.8H, v31.8H, v5.8B
-        uaddw           v30.8H, v31.8H, v6.8B
-        uaddw           v31.8H, v31.8H, v7.8B
-        sqxtun          v0.8B,  v24.8H
-        sqxtun          v1.8B,  v25.8H
-        sqxtun          v2.8B,  v26.8H
-        sqxtun          v3.8B,  v27.8H
+        ld1             {v0.8b},  [x0], x2
+        srshr           v31.8h, v31.8h, #6
+        ld1             {v1.8b},     [x0], x2
+        ld1             {v2.8b},     [x0], x2
+        uaddw           v24.8h, v31.8h, v0.8b
+        ld1             {v3.8b},     [x0], x2
+        uaddw           v25.8h, v31.8h, v1.8b
+        ld1             {v4.8b},     [x0], x2
+        uaddw           v26.8h, v31.8h, v2.8b
+        ld1             {v5.8b},     [x0], x2
+        uaddw           v27.8h, v31.8h, v3.8b
+        ld1             {v6.8b},     [x0], x2
+        uaddw           v28.8h, v31.8h, v4.8b
+        ld1             {v7.8b},     [x0], x2
+        uaddw           v29.8h, v31.8h, v5.8b
+        uaddw           v30.8h, v31.8h, v6.8b
+        uaddw           v31.8h, v31.8h, v7.8b
+        sqxtun          v0.8b,  v24.8h
+        sqxtun          v1.8b,  v25.8h
+        sqxtun          v2.8b,  v26.8h
+        sqxtun          v3.8b,  v27.8h
         sub             x0,  x0,  x2, lsl #3
-        st1             {v0.8B},     [x0], x2
-        sqxtun          v4.8B,  v28.8H
-        st1             {v1.8B},     [x0], x2
-        sqxtun          v5.8B,  v29.8H
-        st1             {v2.8B},     [x0], x2
-        sqxtun          v6.8B,  v30.8H
-        st1             {v3.8B},     [x0], x2
-        sqxtun          v7.8B,  v31.8H
-        st1             {v4.8B},     [x0], x2
-        st1             {v5.8B},     [x0], x2
-        st1             {v6.8B},     [x0], x2
-        st1             {v7.8B},     [x0], x2
+        st1             {v0.8b},     [x0], x2
+        sqxtun          v4.8b,  v28.8h
+        st1             {v1.8b},     [x0], x2
+        sqxtun          v5.8b,  v29.8h
+        st1             {v2.8b},     [x0], x2
+        sqxtun          v6.8b,  v30.8h
+        st1             {v3.8b},     [x0], x2
+        sqxtun          v7.8b,  v31.8h
+        st1             {v4.8b},     [x0], x2
+        st1             {v5.8b},     [x0], x2
+        st1             {v6.8b},     [x0], x2
+        st1             {v7.8b},     [x0], x2
         ret
 endfunc
 
diff --git a/libavcodec/aarch64/h264qpel_neon.S b/libavcodec/aarch64/h264qpel_neon.S
index 451fd8af24..21906327cd 100644
--- a/libavcodec/aarch64/h264qpel_neon.S
+++ b/libavcodec/aarch64/h264qpel_neon.S
@@ -27,127 +27,127 @@
 .macro  lowpass_const   r
         movz            \r, #20, lsl #16
         movk            \r, #5
-        mov             v6.S[0], \r
+        mov             v6.s[0], \r
 .endm
 
 //trashes v0-v5
 .macro  lowpass_8       r0,  r1,  r2,  r3,  d0,  d1,  narrow=1
-        ext             v2.8B,      \r0\().8B, \r1\().8B, #2
-        ext             v3.8B,      \r0\().8B, \r1\().8B, #3
-        uaddl           v2.8H,      v2.8B,     v3.8B
-        ext             v4.8B,      \r0\().8B, \r1\().8B, #1
-        ext             v5.8B,      \r0\().8B, \r1\().8B, #4
-        uaddl           v4.8H,      v4.8B,     v5.8B
-        ext             v1.8B,      \r0\().8B, \r1\().8B, #5
-        uaddl           \d0\().8H,  \r0\().8B, v1.8B
-        ext             v0.8B,      \r2\().8B, \r3\().8B, #2
-        mla             \d0\().8H,  v2.8H,     v6.H[1]
-        ext             v1.8B,      \r2\().8B, \r3\().8B, #3
-        uaddl           v0.8H,      v0.8B,     v1.8B
-        ext             v1.8B,      \r2\().8B, \r3\().8B, #1
-        mls             \d0\().8H,  v4.8H,     v6.H[0]
-        ext             v3.8B,      \r2\().8B, \r3\().8B, #4
-        uaddl           v1.8H,      v1.8B,     v3.8B
-        ext             v2.8B,      \r2\().8B, \r3\().8B, #5
-        uaddl           \d1\().8H,  \r2\().8B, v2.8B
-        mla             \d1\().8H,  v0.8H,     v6.H[1]
-        mls             \d1\().8H,  v1.8H,     v6.H[0]
+        ext             v2.8b,      \r0\().8b, \r1\().8b, #2
+        ext             v3.8b,      \r0\().8b, \r1\().8b, #3
+        uaddl           v2.8h,      v2.8b,     v3.8b
+        ext             v4.8b,      \r0\().8b, \r1\().8b, #1
+        ext             v5.8b,      \r0\().8b, \r1\().8b, #4
+        uaddl           v4.8h,      v4.8b,     v5.8b
+        ext             v1.8b,      \r0\().8b, \r1\().8b, #5
+        uaddl           \d0\().8h,  \r0\().8b, v1.8b
+        ext             v0.8b,      \r2\().8b, \r3\().8b, #2
+        mla             \d0\().8h,  v2.8h,     v6.h[1]
+        ext             v1.8b,      \r2\().8b, \r3\().8b, #3
+        uaddl           v0.8h,      v0.8b,     v1.8b
+        ext             v1.8b,      \r2\().8b, \r3\().8b, #1
+        mls             \d0\().8h,  v4.8h,     v6.h[0]
+        ext             v3.8b,      \r2\().8b, \r3\().8b, #4
+        uaddl           v1.8h,      v1.8b,     v3.8b
+        ext             v2.8b,      \r2\().8b, \r3\().8b, #5
+        uaddl           \d1\().8h,  \r2\().8b, v2.8b
+        mla             \d1\().8h,  v0.8h,     v6.h[1]
+        mls             \d1\().8h,  v1.8h,     v6.h[0]
   .if \narrow
-        sqrshrun        \d0\().8B,  \d0\().8H, #5
-        sqrshrun        \d1\().8B,  \d1\().8H, #5
+        sqrshrun        \d0\().8b,  \d0\().8h, #5
+        sqrshrun        \d1\().8b,  \d1\().8h, #5
   .endif
 .endm
 
 //trashes v0-v4
 .macro  lowpass_8_v     r0,  r1,  r2,  r3,  r4,  r5,  r6,  d0,  d1,  narrow=1
-        uaddl           v2.8H,      \r2\().8B, \r3\().8B
-        uaddl           v0.8H,      \r3\().8B, \r4\().8B
-        uaddl           v4.8H,      \r1\().8B, \r4\().8B
-        uaddl           v1.8H,      \r2\().8B, \r5\().8B
-        uaddl           \d0\().8H,  \r0\().8B, \r5\().8B
-        uaddl           \d1\().8H,  \r1\().8B, \r6\().8B
-        mla             \d0\().8H,  v2.8H,     v6.H[1]
-        mls             \d0\().8H,  v4.8H,     v6.H[0]
-        mla             \d1\().8H,  v0.8H,     v6.H[1]
-        mls             \d1\().8H,  v1.8H,     v6.H[0]
+        uaddl           v2.8h,      \r2\().8b, \r3\().8b
+        uaddl           v0.8h,      \r3\().8b, \r4\().8b
+        uaddl           v4.8h,      \r1\().8b, \r4\().8b
+        uaddl           v1.8h,      \r2\().8b, \r5\().8b
+        uaddl           \d0\().8h,  \r0\().8b, \r5\().8b
+        uaddl           \d1\().8h,  \r1\().8b, \r6\().8b
+        mla             \d0\().8h,  v2.8h,     v6.h[1]
+        mls             \d0\().8h,  v4.8h,     v6.h[0]
+        mla             \d1\().8h,  v0.8h,     v6.h[1]
+        mls             \d1\().8h,  v1.8h,     v6.h[0]
   .if \narrow
-        sqrshrun        \d0\().8B,  \d0\().8H, #5
-        sqrshrun        \d1\().8B,  \d1\().8H, #5
+        sqrshrun        \d0\().8b,  \d0\().8h, #5
+        sqrshrun        \d1\().8b,  \d1\().8h, #5
   .endif
 .endm
 
 //trashes v0-v5, v7, v30-v31
 .macro  lowpass_8H      r0,  r1
-        ext             v0.16B,     \r0\().16B, \r0\().16B, #2
-        ext             v1.16B,     \r0\().16B, \r0\().16B, #3
-        uaddl           v0.8H,      v0.8B,      v1.8B
-        ext             v2.16B,     \r0\().16B, \r0\().16B, #1
-        ext             v3.16B,     \r0\().16B, \r0\().16B, #4
-        uaddl           v2.8H,      v2.8B,      v3.8B
-        ext             v30.16B,    \r0\().16B, \r0\().16B, #5
-        uaddl           \r0\().8H,  \r0\().8B,  v30.8B
-        ext             v4.16B,     \r1\().16B, \r1\().16B, #2
-        mla             \r0\().8H,  v0.8H,      v6.H[1]
-        ext             v5.16B,     \r1\().16B, \r1\().16B, #3
-        uaddl           v4.8H,      v4.8B,      v5.8B
-        ext             v7.16B,     \r1\().16B, \r1\().16B, #1
-        mls             \r0\().8H,  v2.8H,      v6.H[0]
-        ext             v0.16B,     \r1\().16B, \r1\().16B, #4
-        uaddl           v7.8H,      v7.8B,      v0.8B
-        ext             v31.16B,    \r1\().16B, \r1\().16B, #5
-        uaddl           \r1\().8H,  \r1\().8B,  v31.8B
-        mla             \r1\().8H,  v4.8H,      v6.H[1]
-        mls             \r1\().8H,  v7.8H,      v6.H[0]
+        ext             v0.16b,     \r0\().16b, \r0\().16b, #2
+        ext             v1.16b,     \r0\().16b, \r0\().16b, #3
+        uaddl           v0.8h,      v0.8b,      v1.8b
+        ext             v2.16b,     \r0\().16b, \r0\().16b, #1
+        ext             v3.16b,     \r0\().16b, \r0\().16b, #4
+        uaddl           v2.8h,      v2.8b,      v3.8b
+        ext             v30.16b,    \r0\().16b, \r0\().16b, #5
+        uaddl           \r0\().8h,  \r0\().8b,  v30.8b
+        ext             v4.16b,     \r1\().16b, \r1\().16b, #2
+        mla             \r0\().8h,  v0.8h,      v6.h[1]
+        ext             v5.16b,     \r1\().16b, \r1\().16b, #3
+        uaddl           v4.8h,      v4.8b,      v5.8b
+        ext             v7.16b,     \r1\().16b, \r1\().16b, #1
+        mls             \r0\().8h,  v2.8h,      v6.h[0]
+        ext             v0.16b,     \r1\().16b, \r1\().16b, #4
+        uaddl           v7.8h,      v7.8b,      v0.8b
+        ext             v31.16b,    \r1\().16b, \r1\().16b, #5
+        uaddl           \r1\().8h,  \r1\().8b,  v31.8b
+        mla             \r1\().8h,  v4.8h,      v6.h[1]
+        mls             \r1\().8h,  v7.8h,      v6.h[0]
 .endm
 
 // trashes v2-v5, v30
 .macro  lowpass_8_1     r0,  r1,  d0,  narrow=1
-        ext             v2.8B,     \r0\().8B, \r1\().8B, #2
-        ext             v3.8B,     \r0\().8B, \r1\().8B, #3
-        uaddl           v2.8H,     v2.8B,     v3.8B
-        ext             v4.8B,     \r0\().8B, \r1\().8B, #1
-        ext             v5.8B,     \r0\().8B, \r1\().8B, #4
-        uaddl           v4.8H,     v4.8B,     v5.8B
-        ext             v30.8B,    \r0\().8B, \r1\().8B, #5
-        uaddl           \d0\().8H, \r0\().8B, v30.8B
-        mla             \d0\().8H, v2.8H,     v6.H[1]
-        mls             \d0\().8H, v4.8H,     v6.H[0]
+        ext             v2.8b,     \r0\().8b, \r1\().8b, #2
+        ext             v3.8b,     \r0\().8b, \r1\().8b, #3
+        uaddl           v2.8h,     v2.8b,     v3.8b
+        ext             v4.8b,     \r0\().8b, \r1\().8b, #1
+        ext             v5.8b,     \r0\().8b, \r1\().8b, #4
+        uaddl           v4.8h,     v4.8b,     v5.8b
+        ext             v30.8b,    \r0\().8b, \r1\().8b, #5
+        uaddl           \d0\().8h, \r0\().8b, v30.8b
+        mla             \d0\().8h, v2.8h,     v6.h[1]
+        mls             \d0\().8h, v4.8h,     v6.h[0]
   .if \narrow
-        sqrshrun        \d0\().8B, \d0\().8H, #5
+        sqrshrun        \d0\().8b, \d0\().8h, #5
   .endif
 .endm
 
 // trashed v0-v7
 .macro  lowpass_8.16    r0,  r1,  r2,  r3,  r4,  r5
-        saddl           v5.4S,      \r2\().4H,  \r3\().4H
-        saddl2          v1.4S,      \r2\().8H,  \r3\().8H
-        saddl           v6.4S,      \r1\().4H,  \r4\().4H
-        saddl2          v2.4S,      \r1\().8H,  \r4\().8H
-        saddl           v0.4S,      \r0\().4H,  \r5\().4H
-        saddl2          v4.4S,      \r0\().8H,  \r5\().8H
-
-        shl             v3.4S,  v5.4S,  #4
-        shl             v5.4S,  v5.4S,  #2
-        shl             v7.4S,  v6.4S,  #2
-        add             v5.4S,  v5.4S,  v3.4S
-        add             v6.4S,  v6.4S,  v7.4S
-
-        shl             v3.4S,  v1.4S,  #4
-        shl             v1.4S,  v1.4S,  #2
-        shl             v7.4S,  v2.4S,  #2
-        add             v1.4S,  v1.4S,  v3.4S
-        add             v2.4S,  v2.4S,  v7.4S
-
-        add             v5.4S,  v5.4S,  v0.4S
-        sub             v5.4S,  v5.4S,  v6.4S
-
-        add             v1.4S,  v1.4S,  v4.4S
-        sub             v1.4S,  v1.4S,  v2.4S
-
-        rshrn           v5.4H,  v5.4S,  #10
-        rshrn2          v5.8H,  v1.4S,  #10
-
-        sqxtun          \r0\().8B,  v5.8H
+        saddl           v5.4s,      \r2\().4h,  \r3\().4h
+        saddl2          v1.4s,      \r2\().8h,  \r3\().8h
+        saddl           v6.4s,      \r1\().4h,  \r4\().4h
+        saddl2          v2.4s,      \r1\().8h,  \r4\().8h
+        saddl           v0.4s,      \r0\().4h,  \r5\().4h
+        saddl2          v4.4s,      \r0\().8h,  \r5\().8h
+
+        shl             v3.4s,  v5.4s,  #4
+        shl             v5.4s,  v5.4s,  #2
+        shl             v7.4s,  v6.4s,  #2
+        add             v5.4s,  v5.4s,  v3.4s
+        add             v6.4s,  v6.4s,  v7.4s
+
+        shl             v3.4s,  v1.4s,  #4
+        shl             v1.4s,  v1.4s,  #2
+        shl             v7.4s,  v2.4s,  #2
+        add             v1.4s,  v1.4s,  v3.4s
+        add             v2.4s,  v2.4s,  v7.4s
+
+        add             v5.4s,  v5.4s,  v0.4s
+        sub             v5.4s,  v5.4s,  v6.4s
+
+        add             v1.4s,  v1.4s,  v4.4s
+        sub             v1.4s,  v1.4s,  v2.4s
+
+        rshrn           v5.4h,  v5.4s,  #10
+        rshrn2          v5.8h,  v1.4s,  #10
+
+        sqxtun          \r0\().8b,  v5.8h
 .endm
 
 function put_h264_qpel16_h_lowpass_neon_packed
@@ -176,19 +176,19 @@ function \type\()_h264_qpel16_h_lowpass_neon
 endfunc
 
 function \type\()_h264_qpel8_h_lowpass_neon
-1:      ld1             {v28.8B, v29.8B}, [x1], x2
-        ld1             {v16.8B, v17.8B}, [x1], x2
+1:      ld1             {v28.8b, v29.8b}, [x1], x2
+        ld1             {v16.8b, v17.8b}, [x1], x2
         subs            x12, x12, #2
         lowpass_8       v28, v29, v16, v17, v28, v16
   .ifc \type,avg
-        ld1             {v2.8B},    [x0], x3
-        ld1             {v3.8B},    [x0]
-        urhadd          v28.8B, v28.8B,  v2.8B
-        urhadd          v16.8B, v16.8B, v3.8B
+        ld1             {v2.8b},    [x0], x3
+        ld1             {v3.8b},    [x0]
+        urhadd          v28.8b, v28.8b,  v2.8b
+        urhadd          v16.8b, v16.8b, v3.8b
         sub             x0,  x0,  x3
   .endif
-        st1             {v28.8B},    [x0], x3
-        st1             {v16.8B},    [x0], x3
+        st1             {v28.8b},    [x0], x3
+        st1             {v16.8b},    [x0], x3
         b.ne            1b
         ret
 endfunc
@@ -213,23 +213,23 @@ function \type\()_h264_qpel16_h_lowpass_l2_neon
 endfunc
 
 function \type\()_h264_qpel8_h_lowpass_l2_neon
-1:      ld1             {v26.8B, v27.8B}, [x1], x2
-        ld1             {v16.8B, v17.8B}, [x1], x2
-        ld1             {v28.8B},     [x3], x2
-        ld1             {v29.8B},     [x3], x2
+1:      ld1             {v26.8b, v27.8b}, [x1], x2
+        ld1             {v16.8b, v17.8b}, [x1], x2
+        ld1             {v28.8b},     [x3], x2
+        ld1             {v29.8b},     [x3], x2
         subs            x12, x12, #2
         lowpass_8       v26, v27, v16, v17, v26, v27
-        urhadd          v26.8B, v26.8B, v28.8B
-        urhadd          v27.8B, v27.8B, v29.8B
+        urhadd          v26.8b, v26.8b, v28.8b
+        urhadd          v27.8b, v27.8b, v29.8b
   .ifc \type,avg
-        ld1             {v2.8B},      [x0], x2
-        ld1             {v3.8B},      [x0]
-        urhadd          v26.8B, v26.8B, v2.8B
-        urhadd          v27.8B, v27.8B, v3.8B
+        ld1             {v2.8b},      [x0], x2
+        ld1             {v3.8b},      [x0]
+        urhadd          v26.8b, v26.8b, v2.8b
+        urhadd          v27.8b, v27.8b, v3.8b
         sub             x0,  x0,  x2
   .endif
-        st1             {v26.8B},     [x0], x2
-        st1             {v27.8B},     [x0], x2
+        st1             {v26.8b},     [x0], x2
+        st1             {v27.8b},     [x0], x2
         b.ne            1b
         ret
 endfunc
@@ -270,52 +270,52 @@ function \type\()_h264_qpel16_v_lowpass_neon
 endfunc
 
 function \type\()_h264_qpel8_v_lowpass_neon
-        ld1             {v16.8B}, [x1], x3
-        ld1             {v17.8B}, [x1], x3
-        ld1             {v18.8B}, [x1], x3
-        ld1             {v19.8B}, [x1], x3
-        ld1             {v20.8B}, [x1], x3
-        ld1             {v21.8B}, [x1], x3
-        ld1             {v22.8B}, [x1], x3
-        ld1             {v23.8B}, [x1], x3
-        ld1             {v24.8B}, [x1], x3
-        ld1             {v25.8B}, [x1], x3
-        ld1             {v26.8B}, [x1], x3
-        ld1             {v27.8B}, [x1], x3
-        ld1             {v28.8B}, [x1]
+        ld1             {v16.8b}, [x1], x3
+        ld1             {v17.8b}, [x1], x3
+        ld1             {v18.8b}, [x1], x3
+        ld1             {v19.8b}, [x1], x3
+        ld1             {v20.8b}, [x1], x3
+        ld1             {v21.8b}, [x1], x3
+        ld1             {v22.8b}, [x1], x3
+        ld1             {v23.8b}, [x1], x3
+        ld1             {v24.8b}, [x1], x3
+        ld1             {v25.8b}, [x1], x3
+        ld1             {v26.8b}, [x1], x3
+        ld1             {v27.8b}, [x1], x3
+        ld1             {v28.8b}, [x1]
 
         lowpass_8_v     v16, v17, v18, v19, v20, v21, v22, v16, v17
         lowpass_8_v     v18, v19, v20, v21, v22, v23, v24, v18, v19
         lowpass_8_v     v20, v21, v22, v23, v24, v25, v26, v20, v21
         lowpass_8_v     v22, v23, v24, v25, v26, v27, v28, v22, v23
   .ifc \type,avg
-        ld1             {v24.8B},  [x0], x2
-        ld1             {v25.8B}, [x0], x2
-        ld1             {v26.8B}, [x0], x2
-        urhadd          v16.8B, v16.8B, v24.8B
-        ld1             {v27.8B}, [x0], x2
-        urhadd          v17.8B, v17.8B, v25.8B
-        ld1             {v28.8B}, [x0], x2
-        urhadd          v18.8B, v18.8B, v26.8B
-        ld1             {v29.8B}, [x0], x2
-        urhadd          v19.8B, v19.8B, v27.8B
-        ld1             {v30.8B}, [x0], x2
-        urhadd          v20.8B, v20.8B, v28.8B
-        ld1             {v31.8B}, [x0], x2
-        urhadd          v21.8B, v21.8B, v29.8B
-        urhadd          v22.8B, v22.8B, v30.8B
-        urhadd          v23.8B, v23.8B, v31.8B
+        ld1             {v24.8b},  [x0], x2
+        ld1             {v25.8b}, [x0], x2
+        ld1             {v26.8b}, [x0], x2
+        urhadd          v16.8b, v16.8b, v24.8b
+        ld1             {v27.8b}, [x0], x2
+        urhadd          v17.8b, v17.8b, v25.8b
+        ld1             {v28.8b}, [x0], x2
+        urhadd          v18.8b, v18.8b, v26.8b
+        ld1             {v29.8b}, [x0], x2
+        urhadd          v19.8b, v19.8b, v27.8b
+        ld1             {v30.8b}, [x0], x2
+        urhadd          v20.8b, v20.8b, v28.8b
+        ld1             {v31.8b}, [x0], x2
+        urhadd          v21.8b, v21.8b, v29.8b
+        urhadd          v22.8b, v22.8b, v30.8b
+        urhadd          v23.8b, v23.8b, v31.8b
         sub             x0,  x0,  x2,  lsl #3
   .endif
 
-        st1             {v16.8B}, [x0], x2
-        st1             {v17.8B}, [x0], x2
-        st1             {v18.8B}, [x0], x2
-        st1             {v19.8B}, [x0], x2
-        st1             {v20.8B}, [x0], x2
-        st1             {v21.8B}, [x0], x2
-        st1             {v22.8B}, [x0], x2
-        st1             {v23.8B}, [x0], x2
+        st1             {v16.8b}, [x0], x2
+        st1             {v17.8b}, [x0], x2
+        st1             {v18.8b}, [x0], x2
+        st1             {v19.8b}, [x0], x2
+        st1             {v20.8b}, [x0], x2
+        st1             {v21.8b}, [x0], x2
+        st1             {v22.8b}, [x0], x2
+        st1             {v23.8b}, [x0], x2
 
         ret
 endfunc
@@ -343,70 +343,70 @@ function \type\()_h264_qpel16_v_lowpass_l2_neon
 endfunc
 
 function \type\()_h264_qpel8_v_lowpass_l2_neon
-        ld1             {v16.8B}, [x1], x3
-        ld1             {v17.8B}, [x1], x3
-        ld1             {v18.8B}, [x1], x3
-        ld1             {v19.8B}, [x1], x3
-        ld1             {v20.8B}, [x1], x3
-        ld1             {v21.8B}, [x1], x3
-        ld1             {v22.8B}, [x1], x3
-        ld1             {v23.8B}, [x1], x3
-        ld1             {v24.8B}, [x1], x3
-        ld1             {v25.8B}, [x1], x3
-        ld1             {v26.8B}, [x1], x3
-        ld1             {v27.8B}, [x1], x3
-        ld1             {v28.8B}, [x1]
+        ld1             {v16.8b}, [x1], x3
+        ld1             {v17.8b}, [x1], x3
+        ld1             {v18.8b}, [x1], x3
+        ld1             {v19.8b}, [x1], x3
+        ld1             {v20.8b}, [x1], x3
+        ld1             {v21.8b}, [x1], x3
+        ld1             {v22.8b}, [x1], x3
+        ld1             {v23.8b}, [x1], x3
+        ld1             {v24.8b}, [x1], x3
+        ld1             {v25.8b}, [x1], x3
+        ld1             {v26.8b}, [x1], x3
+        ld1             {v27.8b}, [x1], x3
+        ld1             {v28.8b}, [x1]
 
         lowpass_8_v     v16, v17, v18, v19, v20, v21, v22, v16, v17
         lowpass_8_v     v18, v19, v20, v21, v22, v23, v24, v18, v19
         lowpass_8_v     v20, v21, v22, v23, v24, v25, v26, v20, v21
         lowpass_8_v     v22, v23, v24, v25, v26, v27, v28, v22, v23
 
-        ld1             {v24.8B},  [x12], x2
-        ld1             {v25.8B},  [x12], x2
-        ld1             {v26.8B},  [x12], x2
-        ld1             {v27.8B},  [x12], x2
-        ld1             {v28.8B},  [x12], x2
-        urhadd          v16.8B, v24.8B, v16.8B
-        urhadd          v17.8B, v25.8B, v17.8B
-        ld1             {v29.8B},  [x12], x2
-        urhadd          v18.8B, v26.8B, v18.8B
-        urhadd          v19.8B, v27.8B, v19.8B
-        ld1             {v30.8B}, [x12], x2
-        urhadd          v20.8B, v28.8B, v20.8B
-        urhadd          v21.8B, v29.8B, v21.8B
-        ld1             {v31.8B}, [x12], x2
-        urhadd          v22.8B, v30.8B, v22.8B
-        urhadd          v23.8B, v31.8B, v23.8B
+        ld1             {v24.8b},  [x12], x2
+        ld1             {v25.8b},  [x12], x2
+        ld1             {v26.8b},  [x12], x2
+        ld1             {v27.8b},  [x12], x2
+        ld1             {v28.8b},  [x12], x2
+        urhadd          v16.8b, v24.8b, v16.8b
+        urhadd          v17.8b, v25.8b, v17.8b
+        ld1             {v29.8b},  [x12], x2
+        urhadd          v18.8b, v26.8b, v18.8b
+        urhadd          v19.8b, v27.8b, v19.8b
+        ld1             {v30.8b}, [x12], x2
+        urhadd          v20.8b, v28.8b, v20.8b
+        urhadd          v21.8b, v29.8b, v21.8b
+        ld1             {v31.8b}, [x12], x2
+        urhadd          v22.8b, v30.8b, v22.8b
+        urhadd          v23.8b, v31.8b, v23.8b
 
   .ifc \type,avg
-        ld1             {v24.8B}, [x0], x3
-        ld1             {v25.8B}, [x0], x3
-        ld1             {v26.8B}, [x0], x3
-        urhadd          v16.8B, v16.8B, v24.8B
-        ld1             {v27.8B}, [x0], x3
-        urhadd          v17.8B, v17.8B, v25.8B
-        ld1             {v28.8B}, [x0], x3
-        urhadd          v18.8B, v18.8B, v26.8B
-        ld1             {v29.8B}, [x0], x3
-        urhadd          v19.8B, v19.8B, v27.8B
-        ld1             {v30.8B}, [x0], x3
-        urhadd          v20.8B, v20.8B, v28.8B
-        ld1             {v31.8B}, [x0], x3
-        urhadd          v21.8B, v21.8B, v29.8B
-        urhadd          v22.8B, v22.8B, v30.8B
-        urhadd          v23.8B, v23.8B, v31.8B
+        ld1             {v24.8b}, [x0], x3
+        ld1             {v25.8b}, [x0], x3
+        ld1             {v26.8b}, [x0], x3
+        urhadd          v16.8b, v16.8b, v24.8b
+        ld1             {v27.8b}, [x0], x3
+        urhadd          v17.8b, v17.8b, v25.8b
+        ld1             {v28.8b}, [x0], x3
+        urhadd          v18.8b, v18.8b, v26.8b
+        ld1             {v29.8b}, [x0], x3
+        urhadd          v19.8b, v19.8b, v27.8b
+        ld1             {v30.8b}, [x0], x3
+        urhadd          v20.8b, v20.8b, v28.8b
+        ld1             {v31.8b}, [x0], x3
+        urhadd          v21.8b, v21.8b, v29.8b
+        urhadd          v22.8b, v22.8b, v30.8b
+        urhadd          v23.8b, v23.8b, v31.8b
         sub             x0,  x0,  x3,  lsl #3
   .endif
 
-        st1             {v16.8B}, [x0], x3
-        st1             {v17.8B}, [x0], x3
-        st1             {v18.8B}, [x0], x3
-        st1             {v19.8B}, [x0], x3
-        st1             {v20.8B}, [x0], x3
-        st1             {v21.8B}, [x0], x3
-        st1             {v22.8B}, [x0], x3
-        st1             {v23.8B}, [x0], x3
+        st1             {v16.8b}, [x0], x3
+        st1             {v17.8b}, [x0], x3
+        st1             {v18.8b}, [x0], x3
+        st1             {v19.8b}, [x0], x3
+        st1             {v20.8b}, [x0], x3
+        st1             {v21.8b}, [x0], x3
+        st1             {v22.8b}, [x0], x3
+        st1             {v23.8b}, [x0], x3
 
         ret
 endfunc
@@ -417,19 +417,19 @@ endfunc
 
 function put_h264_qpel8_hv_lowpass_neon_top
         lowpass_const   w12
-        ld1             {v16.8H}, [x1], x3
-        ld1             {v17.8H}, [x1], x3
-        ld1             {v18.8H}, [x1], x3
-        ld1             {v19.8H}, [x1], x3
-        ld1             {v20.8H}, [x1], x3
-        ld1             {v21.8H}, [x1], x3
-        ld1             {v22.8H}, [x1], x3
-        ld1             {v23.8H}, [x1], x3
-        ld1             {v24.8H}, [x1], x3
-        ld1             {v25.8H}, [x1], x3
-        ld1             {v26.8H}, [x1], x3
-        ld1             {v27.8H}, [x1], x3
-        ld1             {v28.8H}, [x1]
+        ld1             {v16.8h}, [x1], x3
+        ld1             {v17.8h}, [x1], x3
+        ld1             {v18.8h}, [x1], x3
+        ld1             {v19.8h}, [x1], x3
+        ld1             {v20.8h}, [x1], x3
+        ld1             {v21.8h}, [x1], x3
+        ld1             {v22.8h}, [x1], x3
+        ld1             {v23.8h}, [x1], x3
+        ld1             {v24.8h}, [x1], x3
+        ld1             {v25.8h}, [x1], x3
+        ld1             {v26.8h}, [x1], x3
+        ld1             {v27.8h}, [x1], x3
+        ld1             {v28.8h}, [x1]
         lowpass_8H      v16, v17
         lowpass_8H      v18, v19
         lowpass_8H      v20, v21
@@ -458,33 +458,33 @@ function \type\()_h264_qpel8_hv_lowpass_neon
         mov             x10, x30
         bl              put_h264_qpel8_hv_lowpass_neon_top
   .ifc \type,avg
-        ld1             {v0.8B},      [x0], x2
-        ld1             {v1.8B},      [x0], x2
-        ld1             {v2.8B},      [x0], x2
-        urhadd          v16.8B, v16.8B, v0.8B
-        ld1             {v3.8B},      [x0], x2
-        urhadd          v17.8B, v17.8B, v1.8B
-        ld1             {v4.8B},      [x0], x2
-        urhadd          v18.8B, v18.8B, v2.8B
-        ld1             {v5.8B},      [x0], x2
-        urhadd          v19.8B, v19.8B, v3.8B
-        ld1             {v6.8B},      [x0], x2
-        urhadd          v20.8B, v20.8B, v4.8B
-        ld1             {v7.8B},      [x0], x2
-        urhadd          v21.8B, v21.8B, v5.8B
-        urhadd          v22.8B, v22.8B, v6.8B
-        urhadd          v23.8B, v23.8B, v7.8B
+        ld1             {v0.8b},      [x0], x2
+        ld1             {v1.8b},      [x0], x2
+        ld1             {v2.8b},      [x0], x2
+        urhadd          v16.8b, v16.8b, v0.8b
+        ld1             {v3.8b},      [x0], x2
+        urhadd          v17.8b, v17.8b, v1.8b
+        ld1             {v4.8b},      [x0], x2
+        urhadd          v18.8b, v18.8b, v2.8b
+        ld1             {v5.8b},      [x0], x2
+        urhadd          v19.8b, v19.8b, v3.8b
+        ld1             {v6.8b},      [x0], x2
+        urhadd          v20.8b, v20.8b, v4.8b
+        ld1             {v7.8b},      [x0], x2
+        urhadd          v21.8b, v21.8b, v5.8b
+        urhadd          v22.8b, v22.8b, v6.8b
+        urhadd          v23.8b, v23.8b, v7.8b
         sub             x0,  x0,  x2,  lsl #3
   .endif
 
-        st1             {v16.8B},     [x0], x2
-        st1             {v17.8B},     [x0], x2
-        st1             {v18.8B},     [x0], x2
-        st1             {v19.8B},     [x0], x2
-        st1             {v20.8B},     [x0], x2
-        st1             {v21.8B},     [x0], x2
-        st1             {v22.8B},     [x0], x2
-        st1             {v23.8B},     [x0], x2
+        st1             {v16.8b},     [x0], x2
+        st1             {v17.8b},     [x0], x2
+        st1             {v18.8b},     [x0], x2
+        st1             {v19.8b},     [x0], x2
+        st1             {v20.8b},     [x0], x2
+        st1             {v21.8b},     [x0], x2
+        st1             {v22.8b},     [x0], x2
+        st1             {v23.8b},     [x0], x2
 
         ret             x10
 endfunc
@@ -498,45 +498,45 @@ function \type\()_h264_qpel8_hv_lowpass_l2_neon
         mov             x10, x30
         bl              put_h264_qpel8_hv_lowpass_neon_top
 
-        ld1             {v0.8B, v1.8B},  [x2], #16
-        ld1             {v2.8B, v3.8B},  [x2], #16
-        urhadd          v0.8B,  v0.8B,  v16.8B
-        urhadd          v1.8B,  v1.8B,  v17.8B
-        ld1             {v4.8B, v5.8B},  [x2], #16
-        urhadd          v2.8B,  v2.8B,  v18.8B
-        urhadd          v3.8B,  v3.8B,  v19.8B
-        ld1             {v6.8B, v7.8B},  [x2], #16
-        urhadd          v4.8B,  v4.8B,  v20.8B
-        urhadd          v5.8B,  v5.8B,  v21.8B
-        urhadd          v6.8B,  v6.8B,  v22.8B
-        urhadd          v7.8B,  v7.8B,  v23.8B
+        ld1             {v0.8b, v1.8b},  [x2], #16
+        ld1             {v2.8b, v3.8b},  [x2], #16
+        urhadd          v0.8b,  v0.8b,  v16.8b
+        urhadd          v1.8b,  v1.8b,  v17.8b
+        ld1             {v4.8b, v5.8b},  [x2], #16
+        urhadd          v2.8b,  v2.8b,  v18.8b
+        urhadd          v3.8b,  v3.8b,  v19.8b
+        ld1             {v6.8b, v7.8b},  [x2], #16
+        urhadd          v4.8b,  v4.8b,  v20.8b
+        urhadd          v5.8b,  v5.8b,  v21.8b
+        urhadd          v6.8b,  v6.8b,  v22.8b
+        urhadd          v7.8b,  v7.8b,  v23.8b
   .ifc \type,avg
-        ld1             {v16.8B},     [x0], x3
-        ld1             {v17.8B},     [x0], x3
-        ld1             {v18.8B},     [x0], x3
-        urhadd          v0.8B,  v0.8B,  v16.8B
-        ld1             {v19.8B},     [x0], x3
-        urhadd          v1.8B,  v1.8B,  v17.8B
-        ld1             {v20.8B},     [x0], x3
-        urhadd          v2.8B,  v2.8B,  v18.8B
-        ld1             {v21.8B},     [x0], x3
-        urhadd          v3.8B,  v3.8B,  v19.8B
-        ld1             {v22.8B},     [x0], x3
-        urhadd          v4.8B,  v4.8B,  v20.8B
-        ld1             {v23.8B},     [x0], x3
-        urhadd          v5.8B,  v5.8B,  v21.8B
-        urhadd          v6.8B,  v6.8B,  v22.8B
-        urhadd          v7.8B,  v7.8B,  v23.8B
+        ld1             {v16.8b},     [x0], x3
+        ld1             {v17.8b},     [x0], x3
+        ld1             {v18.8b},     [x0], x3
+        urhadd          v0.8b,  v0.8b,  v16.8b
+        ld1             {v19.8b},     [x0], x3
+        urhadd          v1.8b,  v1.8b,  v17.8b
+        ld1             {v20.8b},     [x0], x3
+        urhadd          v2.8b,  v2.8b,  v18.8b
+        ld1             {v21.8b},     [x0], x3
+        urhadd          v3.8b,  v3.8b,  v19.8b
+        ld1             {v22.8b},     [x0], x3
+        urhadd          v4.8b,  v4.8b,  v20.8b
+        ld1             {v23.8b},     [x0], x3
+        urhadd          v5.8b,  v5.8b,  v21.8b
+        urhadd          v6.8b,  v6.8b,  v22.8b
+        urhadd          v7.8b,  v7.8b,  v23.8b
         sub             x0,  x0,  x3,  lsl #3
   .endif
-        st1             {v0.8B},      [x0], x3
-        st1             {v1.8B},      [x0], x3
-        st1             {v2.8B},      [x0], x3
-        st1             {v3.8B},      [x0], x3
-        st1             {v4.8B},      [x0], x3
-        st1             {v5.8B},      [x0], x3
-        st1             {v6.8B},      [x0], x3
-        st1             {v7.8B},      [x0], x3
+        st1             {v0.8b},      [x0], x3
+        st1             {v1.8b},      [x0], x3
+        st1             {v2.8b},      [x0], x3
+        st1             {v3.8b},      [x0], x3
+        st1             {v4.8b},      [x0], x3
+        st1             {v5.8b},      [x0], x3
+        st1             {v6.8b},      [x0], x3
+        st1             {v7.8b},      [x0], x3
 
         ret             x10
 endfunc
diff --git a/libavcodec/aarch64/hpeldsp_neon.S b/libavcodec/aarch64/hpeldsp_neon.S
index a491c173bb..e7c1549c40 100644
--- a/libavcodec/aarch64/hpeldsp_neon.S
+++ b/libavcodec/aarch64/hpeldsp_neon.S
@@ -26,295 +26,295 @@
   .if \avg
         mov             x12, x0
   .endif
-1:      ld1             {v0.16B},  [x1], x2
-        ld1             {v1.16B},  [x1], x2
-        ld1             {v2.16B},  [x1], x2
-        ld1             {v3.16B},  [x1], x2
+1:      ld1             {v0.16b},  [x1], x2
+        ld1             {v1.16b},  [x1], x2
+        ld1             {v2.16b},  [x1], x2
+        ld1             {v3.16b},  [x1], x2
   .if \avg
-        ld1             {v4.16B},  [x12], x2
-        urhadd          v0.16B,  v0.16B,  v4.16B
-        ld1             {v5.16B},  [x12], x2
-        urhadd          v1.16B,  v1.16B,  v5.16B
-        ld1             {v6.16B},  [x12], x2
-        urhadd          v2.16B,  v2.16B,  v6.16B
-        ld1             {v7.16B},  [x12], x2
-        urhadd          v3.16B,  v3.16B,  v7.16B
+        ld1             {v4.16b},  [x12], x2
+        urhadd          v0.16b,  v0.16b,  v4.16b
+        ld1             {v5.16b},  [x12], x2
+        urhadd          v1.16b,  v1.16b,  v5.16b
+        ld1             {v6.16b},  [x12], x2
+        urhadd          v2.16b,  v2.16b,  v6.16b
+        ld1             {v7.16b},  [x12], x2
+        urhadd          v3.16b,  v3.16b,  v7.16b
   .endif
         subs            w3,  w3,  #4
-        st1             {v0.16B},  [x0], x2
-        st1             {v1.16B},  [x0], x2
-        st1             {v2.16B},  [x0], x2
-        st1             {v3.16B},  [x0], x2
+        st1             {v0.16b},  [x0], x2
+        st1             {v1.16b},  [x0], x2
+        st1             {v2.16b},  [x0], x2
+        st1             {v3.16b},  [x0], x2
         b.ne            1b
         ret
 .endm
 
 .macro  pixels16_x2     rnd=1, avg=0
-1:      ld1             {v0.16B, v1.16B}, [x1], x2
-        ld1             {v2.16B, v3.16B}, [x1], x2
+1:      ld1             {v0.16b, v1.16b}, [x1], x2
+        ld1             {v2.16b, v3.16b}, [x1], x2
         subs            w3,  w3,  #2
-        ext             v1.16B,  v0.16B,  v1.16B,  #1
-        avg             v0.16B,  v0.16B,  v1.16B
-        ext             v3.16B,  v2.16B,  v3.16B,  #1
-        avg             v2.16B,  v2.16B,  v3.16B
+        ext             v1.16b,  v0.16b,  v1.16b,  #1
+        avg             v0.16b,  v0.16b,  v1.16b
+        ext             v3.16b,  v2.16b,  v3.16b,  #1
+        avg             v2.16b,  v2.16b,  v3.16b
   .if \avg
-        ld1             {v1.16B}, [x0], x2
-        ld1             {v3.16B}, [x0]
-        urhadd          v0.16B,  v0.16B,  v1.16B
-        urhadd          v2.16B,  v2.16B,  v3.16B
+        ld1             {v1.16b}, [x0], x2
+        ld1             {v3.16b}, [x0]
+        urhadd          v0.16b,  v0.16b,  v1.16b
+        urhadd          v2.16b,  v2.16b,  v3.16b
         sub             x0,  x0,  x2
   .endif
-        st1             {v0.16B}, [x0], x2
-        st1             {v2.16B}, [x0], x2
+        st1             {v0.16b}, [x0], x2
+        st1             {v2.16b}, [x0], x2
         b.ne            1b
         ret
 .endm
 
 .macro  pixels16_y2     rnd=1, avg=0
         sub             w3,  w3,  #2
-        ld1             {v0.16B}, [x1], x2
-        ld1             {v1.16B}, [x1], x2
+        ld1             {v0.16b}, [x1], x2
+        ld1             {v1.16b}, [x1], x2
 1:      subs            w3,  w3,  #2
-        avg             v2.16B,  v0.16B,  v1.16B
-        ld1             {v0.16B}, [x1], x2
-        avg             v3.16B,  v0.16B,  v1.16B
-        ld1             {v1.16B}, [x1], x2
+        avg             v2.16b,  v0.16b,  v1.16b
+        ld1             {v0.16b}, [x1], x2
+        avg             v3.16b,  v0.16b,  v1.16b
+        ld1             {v1.16b}, [x1], x2
   .if \avg
-        ld1             {v4.16B}, [x0], x2
-        ld1             {v5.16B}, [x0]
-        urhadd          v2.16B,  v2.16B,  v4.16B
-        urhadd          v3.16B,  v3.16B,  v5.16B
+        ld1             {v4.16b}, [x0], x2
+        ld1             {v5.16b}, [x0]
+        urhadd          v2.16b,  v2.16b,  v4.16b
+        urhadd          v3.16b,  v3.16b,  v5.16b
         sub             x0,  x0,  x2
   .endif
-        st1             {v2.16B}, [x0], x2
-        st1             {v3.16B}, [x0], x2
+        st1             {v2.16b}, [x0], x2
+        st1             {v3.16b}, [x0], x2
         b.ne            1b
 
-        avg             v2.16B,  v0.16B,  v1.16B
-        ld1             {v0.16B}, [x1], x2
-        avg             v3.16B,  v0.16B,  v1.16B
+        avg             v2.16b,  v0.16b,  v1.16b
+        ld1             {v0.16b}, [x1], x2
+        avg             v3.16b,  v0.16b,  v1.16b
   .if \avg
-        ld1             {v4.16B}, [x0], x2
-        ld1             {v5.16B}, [x0]
-        urhadd          v2.16B,  v2.16B,  v4.16B
-        urhadd          v3.16B,  v3.16B,  v5.16B
+        ld1             {v4.16b}, [x0], x2
+        ld1             {v5.16b}, [x0]
+        urhadd          v2.16b,  v2.16b,  v4.16b
+        urhadd          v3.16b,  v3.16b,  v5.16b
         sub             x0,  x0,  x2
   .endif
-        st1             {v2.16B},     [x0], x2
-        st1             {v3.16B},     [x0], x2
+        st1             {v2.16b},     [x0], x2
+        st1             {v3.16b},     [x0], x2
 
         ret
 .endm
 
 .macro  pixels16_xy2    rnd=1, avg=0
         sub             w3,  w3,  #2
-        ld1             {v0.16B, v1.16B}, [x1], x2
-        ld1             {v4.16B, v5.16B}, [x1], x2
+        ld1             {v0.16b, v1.16b}, [x1], x2
+        ld1             {v4.16b, v5.16b}, [x1], x2
 NRND    movi            v26.8H, #1
-        ext             v1.16B,  v0.16B,  v1.16B,  #1
-        ext             v5.16B,  v4.16B,  v5.16B,  #1
-        uaddl           v16.8H,  v0.8B,   v1.8B
-        uaddl2          v20.8H,  v0.16B,  v1.16B
-        uaddl           v18.8H,  v4.8B,   v5.8B
-        uaddl2          v22.8H,  v4.16B,  v5.16B
+        ext             v1.16b,  v0.16b,  v1.16b,  #1
+        ext             v5.16b,  v4.16b,  v5.16b,  #1
+        uaddl           v16.8h,  v0.8b,   v1.8b
+        uaddl2          v20.8h,  v0.16b,  v1.16b
+        uaddl           v18.8h,  v4.8b,   v5.8b
+        uaddl2          v22.8h,  v4.16b,  v5.16b
 1:      subs            w3,  w3,  #2
-        ld1             {v0.16B, v1.16B}, [x1], x2
-        add             v24.8H,  v16.8H,  v18.8H
+        ld1             {v0.16b, v1.16b}, [x1], x2
+        add             v24.8h,  v16.8h,  v18.8h
 NRND    add             v24.8H,  v24.8H,  v26.8H
-        ext             v30.16B, v0.16B,  v1.16B,  #1
-        add             v1.8H,   v20.8H,  v22.8H
-        mshrn           v28.8B,  v24.8H,  #2
+        ext             v30.16b, v0.16b,  v1.16b,  #1
+        add             v1.8h,   v20.8h,  v22.8h
+        mshrn           v28.8b,  v24.8h,  #2
 NRND    add             v1.8H,   v1.8H,   v26.8H
-        mshrn2          v28.16B, v1.8H,   #2
+        mshrn2          v28.16b, v1.8h,   #2
   .if \avg
-        ld1             {v16.16B},        [x0]
-        urhadd          v28.16B, v28.16B, v16.16B
+        ld1             {v16.16b},        [x0]
+        urhadd          v28.16b, v28.16b, v16.16b
   .endif
-        uaddl           v16.8H,  v0.8B,   v30.8B
-        ld1             {v2.16B, v3.16B}, [x1], x2
-        uaddl2          v20.8H,  v0.16B,  v30.16B
-        st1             {v28.16B},        [x0], x2
-        add             v24.8H,  v16.8H,  v18.8H
+        uaddl           v16.8h,  v0.8b,   v30.8b
+        ld1             {v2.16b, v3.16b}, [x1], x2
+        uaddl2          v20.8h,  v0.16b,  v30.16b
+        st1             {v28.16b},        [x0], x2
+        add             v24.8h,  v16.8h,  v18.8h
 NRND    add             v24.8H,  v24.8H,  v26.8H
-        ext             v3.16B,  v2.16B,  v3.16B,  #1
-        add             v0.8H,   v20.8H,  v22.8H
-        mshrn           v30.8B,  v24.8H,  #2
+        ext             v3.16b,  v2.16b,  v3.16b,  #1
+        add             v0.8h,   v20.8h,  v22.8h
+        mshrn           v30.8b,  v24.8h,  #2
 NRND    add             v0.8H,   v0.8H,   v26.8H
-        mshrn2          v30.16B, v0.8H,   #2
+        mshrn2          v30.16b, v0.8h,   #2
   .if \avg
-        ld1             {v18.16B},        [x0]
-        urhadd          v30.16B, v30.16B, v18.16B
+        ld1             {v18.16b},        [x0]
+        urhadd          v30.16b, v30.16b, v18.16b
   .endif
-        uaddl           v18.8H,   v2.8B,  v3.8B
-        uaddl2          v22.8H,   v2.16B, v3.16B
-        st1             {v30.16B},        [x0], x2
+        uaddl           v18.8h,   v2.8b,  v3.8b
+        uaddl2          v22.8h,   v2.16b, v3.16b
+        st1             {v30.16b},        [x0], x2
         b.gt            1b
 
-        ld1             {v0.16B, v1.16B}, [x1], x2
-        add             v24.8H,  v16.8H,  v18.8H
+        ld1             {v0.16b, v1.16b}, [x1], x2
+        add             v24.8h,  v16.8h,  v18.8h
 NRND    add             v24.8H,  v24.8H,  v26.8H
-        ext             v30.16B, v0.16B,  v1.16B,  #1
-        add             v1.8H,   v20.8H,  v22.8H
-        mshrn           v28.8B,  v24.8H,  #2
+        ext             v30.16b, v0.16b,  v1.16b,  #1
+        add             v1.8h,   v20.8h,  v22.8h
+        mshrn           v28.8b,  v24.8h,  #2
 NRND    add             v1.8H,   v1.8H,   v26.8H
-        mshrn2          v28.16B, v1.8H,   #2
+        mshrn2          v28.16b, v1.8h,   #2
   .if \avg
-        ld1             {v16.16B},        [x0]
-        urhadd          v28.16B, v28.16B, v16.16B
+        ld1             {v16.16b},        [x0]
+        urhadd          v28.16b, v28.16b, v16.16b
   .endif
-        uaddl           v16.8H,  v0.8B,   v30.8B
-        uaddl2          v20.8H,  v0.16B,  v30.16B
-        st1             {v28.16B},        [x0], x2
-        add             v24.8H,  v16.8H,  v18.8H
+        uaddl           v16.8h,  v0.8b,   v30.8b
+        uaddl2          v20.8h,  v0.16b,  v30.16b
+        st1             {v28.16b},        [x0], x2
+        add             v24.8h,  v16.8h,  v18.8h
 NRND    add             v24.8H,  v24.8H,  v26.8H
-        add             v0.8H,   v20.8H,  v22.8H
-        mshrn           v30.8B,  v24.8H,  #2
+        add             v0.8h,   v20.8h,  v22.8h
+        mshrn           v30.8b,  v24.8h,  #2
 NRND    add             v0.8H,   v0.8H,   v26.8H
-        mshrn2          v30.16B, v0.8H,   #2
+        mshrn2          v30.16b, v0.8h,   #2
   .if \avg
-        ld1             {v18.16B},        [x0]
-        urhadd          v30.16B, v30.16B, v18.16B
+        ld1             {v18.16b},        [x0]
+        urhadd          v30.16b, v30.16b, v18.16b
   .endif
-        st1             {v30.16B},        [x0], x2
+        st1             {v30.16b},        [x0], x2
 
         ret
 .endm
 
 .macro  pixels8         rnd=1, avg=0
-1:      ld1             {v0.8B}, [x1], x2
-        ld1             {v1.8B}, [x1], x2
-        ld1             {v2.8B}, [x1], x2
-        ld1             {v3.8B}, [x1], x2
+1:      ld1             {v0.8b}, [x1], x2
+        ld1             {v1.8b}, [x1], x2
+        ld1             {v2.8b}, [x1], x2
+        ld1             {v3.8b}, [x1], x2
   .if \avg
-        ld1             {v4.8B}, [x0], x2
-        urhadd          v0.8B,  v0.8B,  v4.8B
-        ld1             {v5.8B}, [x0], x2
-        urhadd          v1.8B,  v1.8B,  v5.8B
-        ld1             {v6.8B}, [x0], x2
-        urhadd          v2.8B,  v2.8B,  v6.8B
-        ld1             {v7.8B}, [x0], x2
-        urhadd          v3.8B,  v3.8B,  v7.8B
+        ld1             {v4.8b}, [x0], x2
+        urhadd          v0.8b,  v0.8b,  v4.8b
+        ld1             {v5.8b}, [x0], x2
+        urhadd          v1.8b,  v1.8b,  v5.8b
+        ld1             {v6.8b}, [x0], x2
+        urhadd          v2.8b,  v2.8b,  v6.8b
+        ld1             {v7.8b}, [x0], x2
+        urhadd          v3.8b,  v3.8b,  v7.8b
         sub             x0,  x0,  x2,  lsl #2
   .endif
         subs            w3,  w3,  #4
-        st1             {v0.8B}, [x0], x2
-        st1             {v1.8B}, [x0], x2
-        st1             {v2.8B}, [x0], x2
-        st1             {v3.8B}, [x0], x2
+        st1             {v0.8b}, [x0], x2
+        st1             {v1.8b}, [x0], x2
+        st1             {v2.8b}, [x0], x2
+        st1             {v3.8b}, [x0], x2
         b.ne            1b
         ret
 .endm
 
 .macro  pixels8_x2      rnd=1, avg=0
-1:      ld1             {v0.8B, v1.8B}, [x1], x2
-        ext             v1.8B,  v0.8B,  v1.8B,  #1
-        ld1             {v2.8B, v3.8B}, [x1], x2
-        ext             v3.8B,  v2.8B,  v3.8B,  #1
+1:      ld1             {v0.8b, v1.8b}, [x1], x2
+        ext             v1.8b,  v0.8b,  v1.8b,  #1
+        ld1             {v2.8b, v3.8b}, [x1], x2
+        ext             v3.8b,  v2.8b,  v3.8b,  #1
         subs            w3,  w3,  #2
-        avg             v0.8B,   v0.8B,   v1.8B
-        avg             v2.8B,   v2.8B,   v3.8B
+        avg             v0.8b,   v0.8b,   v1.8b
+        avg             v2.8b,   v2.8b,   v3.8b
   .if \avg
-        ld1             {v4.8B},     [x0], x2
-        ld1             {v5.8B},     [x0]
-        urhadd          v0.8B,   v0.8B,   v4.8B
-        urhadd          v2.8B,   v2.8B,   v5.8B
+        ld1             {v4.8b},     [x0], x2
+        ld1             {v5.8b},     [x0]
+        urhadd          v0.8b,   v0.8b,   v4.8b
+        urhadd          v2.8b,   v2.8b,   v5.8b
         sub             x0,  x0,  x2
   .endif
-        st1             {v0.8B}, [x0], x2
-        st1             {v2.8B}, [x0], x2
+        st1             {v0.8b}, [x0], x2
+        st1             {v2.8b}, [x0], x2
         b.ne            1b
         ret
 .endm
 
 .macro  pixels8_y2      rnd=1, avg=0
         sub             w3,  w3,  #2
-        ld1             {v0.8B},  [x1], x2
-        ld1             {v1.8B},  [x1], x2
+        ld1             {v0.8b},  [x1], x2
+        ld1             {v1.8b},  [x1], x2
 1:      subs            w3,  w3,  #2
-        avg             v4.8B,  v0.8B,  v1.8B
-        ld1             {v0.8B},  [x1], x2
-        avg             v5.8B,  v0.8B,  v1.8B
-        ld1             {v1.8B},  [x1], x2
+        avg             v4.8b,  v0.8b,  v1.8b
+        ld1             {v0.8b},  [x1], x2
+        avg             v5.8b,  v0.8b,  v1.8b
+        ld1             {v1.8b},  [x1], x2
   .if \avg
-        ld1             {v2.8B},     [x0], x2
-        ld1             {v3.8B},     [x0]
-        urhadd          v4.8B,  v4.8B,  v2.8B
-        urhadd          v5.8B,  v5.8B,  v3.8B
+        ld1             {v2.8b},     [x0], x2
+        ld1             {v3.8b},     [x0]
+        urhadd          v4.8b,  v4.8b,  v2.8b
+        urhadd          v5.8b,  v5.8b,  v3.8b
         sub             x0,  x0,  x2
   .endif
-        st1             {v4.8B},     [x0], x2
-        st1             {v5.8B},     [x0], x2
+        st1             {v4.8b},     [x0], x2
+        st1             {v5.8b},     [x0], x2
         b.ne            1b
 
-        avg             v4.8B,  v0.8B,  v1.8B
-        ld1             {v0.8B},  [x1], x2
-        avg             v5.8B,  v0.8B,  v1.8B
+        avg             v4.8b,  v0.8b,  v1.8b
+        ld1             {v0.8b},  [x1], x2
+        avg             v5.8b,  v0.8b,  v1.8b
   .if \avg
-        ld1             {v2.8B},     [x0], x2
-        ld1             {v3.8B},     [x0]
-        urhadd          v4.8B,  v4.8B,  v2.8B
-        urhadd          v5.8B,  v5.8B,  v3.8B
+        ld1             {v2.8b},     [x0], x2
+        ld1             {v3.8b},     [x0]
+        urhadd          v4.8b,  v4.8b,  v2.8b
+        urhadd          v5.8b,  v5.8b,  v3.8b
         sub             x0,  x0,  x2
   .endif
-        st1             {v4.8B},     [x0], x2
-        st1             {v5.8B},     [x0], x2
+        st1             {v4.8b},     [x0], x2
+        st1             {v5.8b},     [x0], x2
 
         ret
 .endm
 
 .macro  pixels8_xy2     rnd=1, avg=0
         sub             w3,  w3,  #2
-        ld1             {v0.16B},     [x1], x2
-        ld1             {v1.16B},     [x1], x2
+        ld1             {v0.16b},     [x1], x2
+        ld1             {v1.16b},     [x1], x2
 NRND    movi            v19.8H, #1
-        ext             v4.16B,  v0.16B,  v4.16B,  #1
-        ext             v6.16B,  v1.16B,  v6.16B,  #1
-        uaddl           v16.8H,  v0.8B,  v4.8B
-        uaddl           v17.8H,  v1.8B,  v6.8B
+        ext             v4.16b,  v0.16b,  v4.16b,  #1
+        ext             v6.16b,  v1.16b,  v6.16b,  #1
+        uaddl           v16.8h,  v0.8b,  v4.8b
+        uaddl           v17.8h,  v1.8b,  v6.8b
 1:      subs            w3,  w3,  #2
-        ld1             {v0.16B},     [x1], x2
-        add             v18.8H, v16.8H,  v17.8H
-        ext             v4.16B,  v0.16B,  v4.16B,  #1
+        ld1             {v0.16b},     [x1], x2
+        add             v18.8h, v16.8h,  v17.8h
+        ext             v4.16b,  v0.16b,  v4.16b,  #1
 NRND    add             v18.8H, v18.8H, v19.8H
-        uaddl           v16.8H,  v0.8B,  v4.8B
-        mshrn           v5.8B,  v18.8H, #2
-        ld1             {v1.16B},     [x1], x2
-        add             v18.8H, v16.8H,  v17.8H
+        uaddl           v16.8h,  v0.8b,  v4.8b
+        mshrn           v5.8b,  v18.8h, #2
+        ld1             {v1.16b},     [x1], x2
+        add             v18.8h, v16.8h,  v17.8h
   .if \avg
-        ld1             {v7.8B},     [x0]
-        urhadd          v5.8B,  v5.8B,  v7.8B
+        ld1             {v7.8b},     [x0]
+        urhadd          v5.8b,  v5.8b,  v7.8b
   .endif
 NRND    add             v18.8H, v18.8H, v19.8H
-        st1             {v5.8B},     [x0], x2
-        mshrn           v7.8B,  v18.8H, #2
+        st1             {v5.8b},     [x0], x2
+        mshrn           v7.8b,  v18.8h, #2
   .if \avg
-        ld1             {v5.8B},     [x0]
-        urhadd          v7.8B,  v7.8B,  v5.8B
+        ld1             {v5.8b},     [x0]
+        urhadd          v7.8b,  v7.8b,  v5.8b
   .endif
-        ext             v6.16B,  v1.16B,  v6.16B,  #1
-        uaddl           v17.8H,  v1.8B,   v6.8B
-        st1             {v7.8B},     [x0], x2
+        ext             v6.16b,  v1.16b,  v6.16b,  #1
+        uaddl           v17.8h,  v1.8b,   v6.8b
+        st1             {v7.8b},     [x0], x2
         b.gt            1b
 
-        ld1             {v0.16B},     [x1], x2
-        add             v18.8H, v16.8H, v17.8H
-        ext             v4.16B, v0.16B, v4.16B,  #1
+        ld1             {v0.16b},     [x1], x2
+        add             v18.8h, v16.8h, v17.8h
+        ext             v4.16b, v0.16b, v4.16b,  #1
 NRND    add             v18.8H, v18.8H, v19.8H
-        uaddl           v16.8H,  v0.8B, v4.8B
-        mshrn           v5.8B,  v18.8H, #2
-        add             v18.8H, v16.8H, v17.8H
+        uaddl           v16.8h,  v0.8b, v4.8b
+        mshrn           v5.8b,  v18.8h, #2
+        add             v18.8h, v16.8h, v17.8h
   .if \avg
-        ld1             {v7.8B},     [x0]
-        urhadd          v5.8B,  v5.8B,  v7.8B
+        ld1             {v7.8b},     [x0]
+        urhadd          v5.8b,  v5.8b,  v7.8b
   .endif
 NRND    add             v18.8H, v18.8H, v19.8H
-        st1             {v5.8B},     [x0], x2
-        mshrn           v7.8B,  v18.8H, #2
+        st1             {v5.8b},     [x0], x2
+        mshrn           v7.8b,  v18.8h, #2
   .if \avg
-        ld1             {v5.8B},     [x0]
-        urhadd          v7.8B,  v7.8B,  v5.8B
+        ld1             {v5.8b},     [x0]
+        urhadd          v7.8b,  v7.8b,  v5.8b
   .endif
-        st1             {v7.8B},     [x0], x2
+        st1             {v7.8b},     [x0], x2
 
         ret
 .endm
diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
index cf86e5081d..7500c324bd 100644
--- a/libavcodec/aarch64/me_cmp_neon.S
+++ b/libavcodec/aarch64/me_cmp_neon.S
@@ -1099,7 +1099,7 @@ function vsse_intra16_neon, export=1
         cbnz            w4, 2b
 
 3:
-        add             v16.4s, v16.4s, v17.4S
+        add             v16.4s, v16.4s, v17.4s
         uaddlv          d17, v16.4s
         fmov            w0, s17
 
diff --git a/libavcodec/aarch64/neon.S b/libavcodec/aarch64/neon.S
index bc105e4861..f6fb13bea0 100644
--- a/libavcodec/aarch64/neon.S
+++ b/libavcodec/aarch64/neon.S
@@ -28,146 +28,146 @@
 .endm
 
 .macro  transpose_8x8B  r0, r1, r2, r3, r4, r5, r6, r7, r8, r9
-        trn1            \r8\().8B,  \r0\().8B,  \r1\().8B
-        trn2            \r9\().8B,  \r0\().8B,  \r1\().8B
-        trn1            \r1\().8B,  \r2\().8B,  \r3\().8B
-        trn2            \r3\().8B,  \r2\().8B,  \r3\().8B
-        trn1            \r0\().8B,  \r4\().8B,  \r5\().8B
-        trn2            \r5\().8B,  \r4\().8B,  \r5\().8B
-        trn1            \r2\().8B,  \r6\().8B,  \r7\().8B
-        trn2            \r7\().8B,  \r6\().8B,  \r7\().8B
-
-        trn1            \r4\().4H,  \r0\().4H,  \r2\().4H
-        trn2            \r2\().4H,  \r0\().4H,  \r2\().4H
-        trn1            \r6\().4H,  \r5\().4H,  \r7\().4H
-        trn2            \r7\().4H,  \r5\().4H,  \r7\().4H
-        trn1            \r5\().4H,  \r9\().4H,  \r3\().4H
-        trn2            \r9\().4H,  \r9\().4H,  \r3\().4H
-        trn1            \r3\().4H,  \r8\().4H,  \r1\().4H
-        trn2            \r8\().4H,  \r8\().4H,  \r1\().4H
-
-        trn1            \r0\().2S,  \r3\().2S,  \r4\().2S
-        trn2            \r4\().2S,  \r3\().2S,  \r4\().2S
-
-        trn1            \r1\().2S,  \r5\().2S,  \r6\().2S
-        trn2            \r5\().2S,  \r5\().2S,  \r6\().2S
-
-        trn2            \r6\().2S,  \r8\().2S,  \r2\().2S
-        trn1            \r2\().2S,  \r8\().2S,  \r2\().2S
-
-        trn1            \r3\().2S,  \r9\().2S,  \r7\().2S
-        trn2            \r7\().2S,  \r9\().2S,  \r7\().2S
+        trn1            \r8\().8b,  \r0\().8b,  \r1\().8b
+        trn2            \r9\().8b,  \r0\().8b,  \r1\().8b
+        trn1            \r1\().8b,  \r2\().8b,  \r3\().8b
+        trn2            \r3\().8b,  \r2\().8b,  \r3\().8b
+        trn1            \r0\().8b,  \r4\().8b,  \r5\().8b
+        trn2            \r5\().8b,  \r4\().8b,  \r5\().8b
+        trn1            \r2\().8b,  \r6\().8b,  \r7\().8b
+        trn2            \r7\().8b,  \r6\().8b,  \r7\().8b
+
+        trn1            \r4\().4h,  \r0\().4h,  \r2\().4h
+        trn2            \r2\().4h,  \r0\().4h,  \r2\().4h
+        trn1            \r6\().4h,  \r5\().4h,  \r7\().4h
+        trn2            \r7\().4h,  \r5\().4h,  \r7\().4h
+        trn1            \r5\().4h,  \r9\().4h,  \r3\().4h
+        trn2            \r9\().4h,  \r9\().4h,  \r3\().4h
+        trn1            \r3\().4h,  \r8\().4h,  \r1\().4h
+        trn2            \r8\().4h,  \r8\().4h,  \r1\().4h
+
+        trn1            \r0\().2s,  \r3\().2s,  \r4\().2s
+        trn2            \r4\().2s,  \r3\().2s,  \r4\().2s
+
+        trn1            \r1\().2s,  \r5\().2s,  \r6\().2s
+        trn2            \r5\().2s,  \r5\().2s,  \r6\().2s
+
+        trn2            \r6\().2s,  \r8\().2s,  \r2\().2s
+        trn1            \r2\().2s,  \r8\().2s,  \r2\().2s
+
+        trn1            \r3\().2s,  \r9\().2s,  \r7\().2s
+        trn2            \r7\().2s,  \r9\().2s,  \r7\().2s
 .endm
 
 .macro  transpose_8x16B r0, r1, r2, r3, r4, r5, r6, r7, t0, t1
-        trn1            \t0\().16B, \r0\().16B, \r1\().16B
-        trn2            \t1\().16B, \r0\().16B, \r1\().16B
-        trn1            \r1\().16B, \r2\().16B, \r3\().16B
-        trn2            \r3\().16B, \r2\().16B, \r3\().16B
-        trn1            \r0\().16B, \r4\().16B, \r5\().16B
-        trn2            \r5\().16B, \r4\().16B, \r5\().16B
-        trn1            \r2\().16B, \r6\().16B, \r7\().16B
-        trn2            \r7\().16B, \r6\().16B, \r7\().16B
-
-        trn1            \r4\().8H,  \r0\().8H,  \r2\().8H
-        trn2            \r2\().8H,  \r0\().8H,  \r2\().8H
-        trn1            \r6\().8H,  \r5\().8H,  \r7\().8H
-        trn2            \r7\().8H,  \r5\().8H,  \r7\().8H
-        trn1            \r5\().8H,  \t1\().8H,  \r3\().8H
-        trn2            \t1\().8H,  \t1\().8H,  \r3\().8H
-        trn1            \r3\().8H,  \t0\().8H,  \r1\().8H
-        trn2            \t0\().8H,  \t0\().8H,  \r1\().8H
-
-        trn1            \r0\().4S,  \r3\().4S,  \r4\().4S
-        trn2            \r4\().4S,  \r3\().4S,  \r4\().4S
-
-        trn1            \r1\().4S,  \r5\().4S,  \r6\().4S
-        trn2            \r5\().4S,  \r5\().4S,  \r6\().4S
-
-        trn2            \r6\().4S,  \t0\().4S,  \r2\().4S
-        trn1            \r2\().4S,  \t0\().4S,  \r2\().4S
-
-        trn1            \r3\().4S,  \t1\().4S,  \r7\().4S
-        trn2            \r7\().4S,  \t1\().4S,  \r7\().4S
+        trn1            \t0\().16b, \r0\().16b, \r1\().16b
+        trn2            \t1\().16b, \r0\().16b, \r1\().16b
+        trn1            \r1\().16b, \r2\().16b, \r3\().16b
+        trn2            \r3\().16b, \r2\().16b, \r3\().16b
+        trn1            \r0\().16b, \r4\().16b, \r5\().16b
+        trn2            \r5\().16b, \r4\().16b, \r5\().16b
+        trn1            \r2\().16b, \r6\().16b, \r7\().16b
+        trn2            \r7\().16b, \r6\().16b, \r7\().16b
+
+        trn1            \r4\().8h,  \r0\().8h,  \r2\().8h
+        trn2            \r2\().8h,  \r0\().8h,  \r2\().8h
+        trn1            \r6\().8h,  \r5\().8h,  \r7\().8h
+        trn2            \r7\().8h,  \r5\().8h,  \r7\().8h
+        trn1            \r5\().8h,  \t1\().8h,  \r3\().8h
+        trn2            \t1\().8h,  \t1\().8h,  \r3\().8h
+        trn1            \r3\().8h,  \t0\().8h,  \r1\().8h
+        trn2            \t0\().8h,  \t0\().8h,  \r1\().8h
+
+        trn1            \r0\().4s,  \r3\().4s,  \r4\().4s
+        trn2            \r4\().4s,  \r3\().4s,  \r4\().4s
+
+        trn1            \r1\().4s,  \r5\().4s,  \r6\().4s
+        trn2            \r5\().4s,  \r5\().4s,  \r6\().4s
+
+        trn2            \r6\().4s,  \t0\().4s,  \r2\().4s
+        trn1            \r2\().4s,  \t0\().4s,  \r2\().4s
+
+        trn1            \r3\().4s,  \t1\().4s,  \r7\().4s
+        trn2            \r7\().4s,  \t1\().4s,  \r7\().4s
 .endm
 
 .macro  transpose_4x16B r0, r1, r2, r3, t4, t5, t6, t7
-        trn1            \t4\().16B, \r0\().16B,  \r1\().16B
-        trn2            \t5\().16B, \r0\().16B,  \r1\().16B
-        trn1            \t6\().16B, \r2\().16B,  \r3\().16B
-        trn2            \t7\().16B, \r2\().16B,  \r3\().16B
-
-        trn1            \r0\().8H,  \t4\().8H,  \t6\().8H
-        trn2            \r2\().8H,  \t4\().8H,  \t6\().8H
-        trn1            \r1\().8H,  \t5\().8H,  \t7\().8H
-        trn2            \r3\().8H,  \t5\().8H,  \t7\().8H
+        trn1            \t4\().16b, \r0\().16b,  \r1\().16b
+        trn2            \t5\().16b, \r0\().16b,  \r1\().16b
+        trn1            \t6\().16b, \r2\().16b,  \r3\().16b
+        trn2            \t7\().16b, \r2\().16b,  \r3\().16b
+
+        trn1            \r0\().8h,  \t4\().8h,  \t6\().8h
+        trn2            \r2\().8h,  \t4\().8h,  \t6\().8h
+        trn1            \r1\().8h,  \t5\().8h,  \t7\().8h
+        trn2            \r3\().8h,  \t5\().8h,  \t7\().8h
 .endm
 
 .macro  transpose_4x8B  r0, r1, r2, r3, t4, t5, t6, t7
-        trn1            \t4\().8B,  \r0\().8B,  \r1\().8B
-        trn2            \t5\().8B,  \r0\().8B,  \r1\().8B
-        trn1            \t6\().8B,  \r2\().8B,  \r3\().8B
-        trn2            \t7\().8B,  \r2\().8B,  \r3\().8B
-
-        trn1            \r0\().4H,  \t4\().4H,  \t6\().4H
-        trn2            \r2\().4H,  \t4\().4H,  \t6\().4H
-        trn1            \r1\().4H,  \t5\().4H,  \t7\().4H
-        trn2            \r3\().4H,  \t5\().4H,  \t7\().4H
+        trn1            \t4\().8b,  \r0\().8b,  \r1\().8b
+        trn2            \t5\().8b,  \r0\().8b,  \r1\().8b
+        trn1            \t6\().8b,  \r2\().8b,  \r3\().8b
+        trn2            \t7\().8b,  \r2\().8b,  \r3\().8b
+
+        trn1            \r0\().4h,  \t4\().4h,  \t6\().4h
+        trn2            \r2\().4h,  \t4\().4h,  \t6\().4h
+        trn1            \r1\().4h,  \t5\().4h,  \t7\().4h
+        trn2            \r3\().4h,  \t5\().4h,  \t7\().4h
 .endm
 
 .macro  transpose_4x4H  r0, r1, r2, r3, r4, r5, r6, r7
-        trn1            \r4\().4H,  \r0\().4H,  \r1\().4H
-        trn2            \r5\().4H,  \r0\().4H,  \r1\().4H
-        trn1            \r6\().4H,  \r2\().4H,  \r3\().4H
-        trn2            \r7\().4H,  \r2\().4H,  \r3\().4H
-
-        trn1            \r0\().2S,  \r4\().2S,  \r6\().2S
-        trn2            \r2\().2S,  \r4\().2S,  \r6\().2S
-        trn1            \r1\().2S,  \r5\().2S,  \r7\().2S
-        trn2            \r3\().2S,  \r5\().2S,  \r7\().2S
+        trn1            \r4\().4h,  \r0\().4h,  \r1\().4h
+        trn2            \r5\().4h,  \r0\().4h,  \r1\().4h
+        trn1            \r6\().4h,  \r2\().4h,  \r3\().4h
+        trn2            \r7\().4h,  \r2\().4h,  \r3\().4h
+
+        trn1            \r0\().2s,  \r4\().2s,  \r6\().2s
+        trn2            \r2\().2s,  \r4\().2s,  \r6\().2s
+        trn1            \r1\().2s,  \r5\().2s,  \r7\().2s
+        trn2            \r3\().2s,  \r5\().2s,  \r7\().2s
 .endm
 
 .macro transpose_4x8H r0, r1, r2, r3, t4, t5, t6, t7
-        trn1            \t4\().8H,  \r0\().8H,  \r1\().8H
-        trn2            \t5\().8H,  \r0\().8H,  \r1\().8H
-        trn1            \t6\().8H,  \r2\().8H,  \r3\().8H
-        trn2            \t7\().8H,  \r2\().8H,  \r3\().8H
-
-        trn1            \r0\().4S,  \t4\().4S,  \t6\().4S
-        trn2            \r2\().4S,  \t4\().4S,  \t6\().4S
-        trn1            \r1\().4S,  \t5\().4S,  \t7\().4S
-        trn2            \r3\().4S,  \t5\().4S,  \t7\().4S
+        trn1            \t4\().8h,  \r0\().8h,  \r1\().8h
+        trn2            \t5\().8h,  \r0\().8h,  \r1\().8h
+        trn1            \t6\().8h,  \r2\().8h,  \r3\().8h
+        trn2            \t7\().8h,  \r2\().8h,  \r3\().8h
+
+        trn1            \r0\().4s,  \t4\().4s,  \t6\().4s
+        trn2            \r2\().4s,  \t4\().4s,  \t6\().4s
+        trn1            \r1\().4s,  \t5\().4s,  \t7\().4s
+        trn2            \r3\().4s,  \t5\().4s,  \t7\().4s
 .endm
 
 .macro  transpose_8x8H  r0, r1, r2, r3, r4, r5, r6, r7, r8, r9
-        trn1            \r8\().8H,  \r0\().8H,  \r1\().8H
-        trn2            \r9\().8H,  \r0\().8H,  \r1\().8H
-        trn1            \r1\().8H,  \r2\().8H,  \r3\().8H
-        trn2            \r3\().8H,  \r2\().8H,  \r3\().8H
-        trn1            \r0\().8H,  \r4\().8H,  \r5\().8H
-        trn2            \r5\().8H,  \r4\().8H,  \r5\().8H
-        trn1            \r2\().8H,  \r6\().8H,  \r7\().8H
-        trn2            \r7\().8H,  \r6\().8H,  \r7\().8H
-
-        trn1            \r4\().4S,  \r0\().4S,  \r2\().4S
-        trn2            \r2\().4S,  \r0\().4S,  \r2\().4S
-        trn1            \r6\().4S,  \r5\().4S,  \r7\().4S
-        trn2            \r7\().4S,  \r5\().4S,  \r7\().4S
-        trn1            \r5\().4S,  \r9\().4S,  \r3\().4S
-        trn2            \r9\().4S,  \r9\().4S,  \r3\().4S
-        trn1            \r3\().4S,  \r8\().4S,  \r1\().4S
-        trn2            \r8\().4S,  \r8\().4S,  \r1\().4S
-
-        trn1            \r0\().2D,  \r3\().2D,  \r4\().2D
-        trn2            \r4\().2D,  \r3\().2D,  \r4\().2D
-
-        trn1            \r1\().2D,  \r5\().2D,  \r6\().2D
-        trn2            \r5\().2D,  \r5\().2D,  \r6\().2D
-
-        trn2            \r6\().2D,  \r8\().2D,  \r2\().2D
-        trn1            \r2\().2D,  \r8\().2D,  \r2\().2D
-
-        trn1            \r3\().2D,  \r9\().2D,  \r7\().2D
-        trn2            \r7\().2D,  \r9\().2D,  \r7\().2D
+        trn1            \r8\().8h,  \r0\().8h,  \r1\().8h
+        trn2            \r9\().8h,  \r0\().8h,  \r1\().8h
+        trn1            \r1\().8h,  \r2\().8h,  \r3\().8h
+        trn2            \r3\().8h,  \r2\().8h,  \r3\().8h
+        trn1            \r0\().8h,  \r4\().8h,  \r5\().8h
+        trn2            \r5\().8h,  \r4\().8h,  \r5\().8h
+        trn1            \r2\().8h,  \r6\().8h,  \r7\().8h
+        trn2            \r7\().8h,  \r6\().8h,  \r7\().8h
+
+        trn1            \r4\().4s,  \r0\().4s,  \r2\().4s
+        trn2            \r2\().4s,  \r0\().4s,  \r2\().4s
+        trn1            \r6\().4s,  \r5\().4s,  \r7\().4s
+        trn2            \r7\().4s,  \r5\().4s,  \r7\().4s
+        trn1            \r5\().4s,  \r9\().4s,  \r3\().4s
+        trn2            \r9\().4s,  \r9\().4s,  \r3\().4s
+        trn1            \r3\().4s,  \r8\().4s,  \r1\().4s
+        trn2            \r8\().4s,  \r8\().4s,  \r1\().4s
+
+        trn1            \r0\().2d,  \r3\().2d,  \r4\().2d
+        trn2            \r4\().2d,  \r3\().2d,  \r4\().2d
+
+        trn1            \r1\().2d,  \r5\().2d,  \r6\().2d
+        trn2            \r5\().2d,  \r5\().2d,  \r6\().2d
+
+        trn2            \r6\().2d,  \r8\().2d,  \r2\().2d
+        trn1            \r2\().2d,  \r8\().2d,  \r2\().2d
+
+        trn1            \r3\().2d,  \r9\().2d,  \r7\().2d
+        trn2            \r7\().2d,  \r9\().2d,  \r7\().2d
 
 .endm
diff --git a/libavcodec/aarch64/sbrdsp_neon.S b/libavcodec/aarch64/sbrdsp_neon.S
index d23717e760..1fdde6ccb6 100644
--- a/libavcodec/aarch64/sbrdsp_neon.S
+++ b/libavcodec/aarch64/sbrdsp_neon.S
@@ -46,49 +46,49 @@ function ff_sbr_sum64x5_neon, export=1
         add             x3, x0, #192*4
         add             x4, x0, #256*4
         mov             x5, #64
-1:      ld1             {v0.4S}, [x0]
-        ld1             {v1.4S}, [x1], #16
-        fadd            v0.4S, v0.4S, v1.4S
-        ld1             {v2.4S}, [x2], #16
-        fadd            v0.4S, v0.4S, v2.4S
-        ld1             {v3.4S}, [x3], #16
-        fadd            v0.4S, v0.4S, v3.4S
-        ld1             {v4.4S}, [x4], #16
-        fadd            v0.4S, v0.4S, v4.4S
-        st1             {v0.4S}, [x0], #16
+1:      ld1             {v0.4s}, [x0]
+        ld1             {v1.4s}, [x1], #16
+        fadd            v0.4s, v0.4s, v1.4s
+        ld1             {v2.4s}, [x2], #16
+        fadd            v0.4s, v0.4s, v2.4s
+        ld1             {v3.4s}, [x3], #16
+        fadd            v0.4s, v0.4s, v3.4s
+        ld1             {v4.4s}, [x4], #16
+        fadd            v0.4s, v0.4s, v4.4s
+        st1             {v0.4s}, [x0], #16
         subs            x5, x5, #4
         b.gt            1b
         ret
 endfunc
 
 function ff_sbr_sum_square_neon, export=1
-        movi            v0.4S, #0
-1:      ld1             {v1.4S}, [x0], #16
-        fmla            v0.4S, v1.4S, v1.4S
+        movi            v0.4s, #0
+1:      ld1             {v1.4s}, [x0], #16
+        fmla            v0.4s, v1.4s, v1.4s
         subs            w1, w1, #2
         b.gt            1b
-        faddp           v0.4S, v0.4S, v0.4S
-        faddp           v0.4S, v0.4S, v0.4S
+        faddp           v0.4s, v0.4s, v0.4s
+        faddp           v0.4s, v0.4s, v0.4s
         ret
 endfunc
 
 function ff_sbr_neg_odd_64_neon, export=1
         mov             x1, x0
-        movi            v5.4S, #1<<7, lsl #24
-        ld2             {v0.4S, v1.4S}, [x0], #32
-        eor             v1.16B, v1.16B, v5.16B
-        ld2             {v2.4S, v3.4S}, [x0], #32
+        movi            v5.4s, #1<<7, lsl #24
+        ld2             {v0.4s, v1.4s}, [x0], #32
+        eor             v1.16b, v1.16b, v5.16b
+        ld2             {v2.4s, v3.4s}, [x0], #32
 .rept 3
-        st2             {v0.4S, v1.4S}, [x1], #32
-        eor             v3.16B, v3.16B, v5.16B
-        ld2             {v0.4S, v1.4S}, [x0], #32
-        st2             {v2.4S, v3.4S}, [x1], #32
-        eor             v1.16B, v1.16B, v5.16B
-        ld2             {v2.4S, v3.4S}, [x0], #32
+        st2             {v0.4s, v1.4s}, [x1], #32
+        eor             v3.16b, v3.16b, v5.16b
+        ld2             {v0.4s, v1.4s}, [x0], #32
+        st2             {v2.4s, v3.4s}, [x1], #32
+        eor             v1.16b, v1.16b, v5.16b
+        ld2             {v2.4s, v3.4s}, [x0], #32
 .endr
-        eor             v3.16B, v3.16B, v5.16B
-        st2             {v0.4S, v1.4S}, [x1], #32
-        st2             {v2.4S, v3.4S}, [x1], #32
+        eor             v3.16b, v3.16b, v5.16b
+        st2             {v0.4s, v1.4s}, [x1], #32
+        st2             {v2.4s, v3.4s}, [x1], #32
         ret
 endfunc
 
@@ -97,26 +97,26 @@ function ff_sbr_qmf_pre_shuffle_neon, export=1
         add             x2, x0, #64*4
         mov             x3, #-16
         mov             x4, #-4
-        movi            v6.4S, #1<<7, lsl #24
-        ld1             {v0.2S}, [x0], #8
-        st1             {v0.2S}, [x2], #8
+        movi            v6.4s, #1<<7, lsl #24
+        ld1             {v0.2s}, [x0], #8
+        st1             {v0.2s}, [x2], #8
 .rept 7
-        ld1             {v1.4S}, [x1], x3
-        ld1             {v2.4S}, [x0], #16
-        eor             v1.16B, v1.16B, v6.16B
-        rev64           v1.4S, v1.4S
-        ext             v1.16B, v1.16B, v1.16B, #8
-        st2             {v1.4S, v2.4S}, [x2], #32
+        ld1             {v1.4s}, [x1], x3
+        ld1             {v2.4s}, [x0], #16
+        eor             v1.16b, v1.16b, v6.16b
+        rev64           v1.4s, v1.4s
+        ext             v1.16b, v1.16b, v1.16b, #8
+        st2             {v1.4s, v2.4s}, [x2], #32
 .endr
         add             x1, x1, #8
-        ld1             {v1.2S}, [x1], x4
-        ld1             {v2.2S}, [x0], #8
-        ld1             {v1.S}[3], [x1]
-        ld1             {v2.S}[2], [x0]
-        eor             v1.16B, v1.16B, v6.16B
-        rev64           v1.4S, v1.4S
-        st2             {v1.2S, v2.2S}, [x2], #16
-        st2             {v1.S, v2.S}[2], [x2]
+        ld1             {v1.2s}, [x1], x4
+        ld1             {v2.2s}, [x0], #8
+        ld1             {v1.s}[3], [x1]
+        ld1             {v2.s}[2], [x0]
+        eor             v1.16b, v1.16b, v6.16b
+        rev64           v1.4s, v1.4s
+        st2             {v1.2s, v2.2s}, [x2], #16
+        st2             {v1.s, v2.s}[2], [x2]
         ret
 endfunc
 
@@ -124,13 +124,13 @@ function ff_sbr_qmf_post_shuffle_neon, export=1
         add             x2, x1, #60*4
         mov             x3, #-16
         mov             x4, #32
-        movi            v6.4S, #1<<7, lsl #24
-1:      ld1             {v0.4S}, [x2], x3
-        ld1             {v1.4S}, [x1], #16
-        eor             v0.16B, v0.16B, v6.16B
-        rev64           v0.4S, v0.4S
-        ext             v0.16B, v0.16B, v0.16B, #8
-        st2             {v0.4S, v1.4S}, [x0], #32
+        movi            v6.4s, #1<<7, lsl #24
+1:      ld1             {v0.4s}, [x2], x3
+        ld1             {v1.4s}, [x1], #16
+        eor             v0.16b, v0.16b, v6.16b
+        rev64           v0.4s, v0.4s
+        ext             v0.16b, v0.16b, v0.16b, #8
+        st2             {v0.4s, v1.4s}, [x0], #32
         subs            x4, x4, #4
         b.gt            1b
         ret
@@ -141,13 +141,13 @@ function ff_sbr_qmf_deint_neg_neon, export=1
         add             x2, x0, #60*4
         mov             x3, #-32
         mov             x4, #32
-        movi            v2.4S, #1<<7, lsl #24
-1:      ld2             {v0.4S, v1.4S}, [x1], x3
-        eor             v0.16B, v0.16B, v2.16B
-        rev64           v1.4S, v1.4S
-        ext             v1.16B, v1.16B, v1.16B, #8
-        st1             {v0.4S}, [x2]
-        st1             {v1.4S}, [x0], #16
+        movi            v2.4s, #1<<7, lsl #24
+1:      ld2             {v0.4s, v1.4s}, [x1], x3
+        eor             v0.16b, v0.16b, v2.16b
+        rev64           v1.4s, v1.4s
+        ext             v1.16b, v1.16b, v1.16b, #8
+        st1             {v0.4s}, [x2]
+        st1             {v1.4s}, [x0], #16
         sub             x2, x2, #16
         subs            x4, x4, #4
         b.gt            1b
@@ -159,16 +159,16 @@ function ff_sbr_qmf_deint_bfly_neon, export=1
         add             x3, x0, #124*4
         mov             x4, #64
         mov             x5, #-16
-1:      ld1             {v0.4S}, [x1], #16
-        ld1             {v1.4S}, [x2], x5
-        rev64           v2.4S, v0.4S
-        ext             v2.16B, v2.16B, v2.16B, #8
-        rev64           v3.4S, v1.4S
-        ext             v3.16B, v3.16B, v3.16B, #8
-        fadd            v1.4S, v1.4S, v2.4S
-        fsub            v0.4S, v0.4S, v3.4S
-        st1             {v0.4S}, [x0], #16
-        st1             {v1.4S}, [x3], x5
+1:      ld1             {v0.4s}, [x1], #16
+        ld1             {v1.4s}, [x2], x5
+        rev64           v2.4s, v0.4s
+        ext             v2.16b, v2.16b, v2.16b, #8
+        rev64           v3.4s, v1.4s
+        ext             v3.16b, v3.16b, v3.16b, #8
+        fadd            v1.4s, v1.4s, v2.4s
+        fsub            v0.4s, v0.4s, v3.4s
+        st1             {v0.4s}, [x0], #16
+        st1             {v1.4s}, [x3], x5
         subs            x4, x4, #4
         b.gt            1b
         ret
@@ -178,32 +178,32 @@ function ff_sbr_hf_gen_neon, export=1
         sxtw            x4, w4
         sxtw            x5, w5
         movrel          x6, factors
-        ld1             {v7.4S}, [x6]
-        dup             v1.4S, v0.S[0]
-        mov             v2.8B, v1.8B
-        mov             v2.S[2], v7.S[0]
-        mov             v2.S[3], v7.S[0]
-        fmul            v1.4S, v1.4S, v2.4S
-        ld1             {v0.D}[0], [x3]
-        ld1             {v0.D}[1], [x2]
-        fmul            v0.4S, v0.4S, v1.4S
-        fmul            v1.4S, v0.4S, v7.4S
-        rev64           v0.4S, v0.4S
+        ld1             {v7.4s}, [x6]
+        dup             v1.4s, v0.s[0]
+        mov             v2.8b, v1.8b
+        mov             v2.s[2], v7.s[0]
+        mov             v2.s[3], v7.s[0]
+        fmul            v1.4s, v1.4s, v2.4s
+        ld1             {v0.d}[0], [x3]
+        ld1             {v0.d}[1], [x2]
+        fmul            v0.4s, v0.4s, v1.4s
+        fmul            v1.4s, v0.4s, v7.4s
+        rev64           v0.4s, v0.4s
         sub             x7, x5, x4
         add             x0, x0, x4, lsl #3
         add             x1, x1, x4, lsl #3
         sub             x1, x1, #16
-1:      ld1             {v2.4S}, [x1], #16
-        ld1             {v3.2S}, [x1]
-        fmul            v4.4S, v2.4S, v1.4S
-        fmul            v5.4S, v2.4S, v0.4S
-        faddp           v4.4S, v4.4S, v4.4S
-        faddp           v5.4S, v5.4S, v5.4S
-        faddp           v4.4S, v4.4S, v4.4S
-        faddp           v5.4S, v5.4S, v5.4S
-        mov             v4.S[1], v5.S[0]
-        fadd            v4.2S, v4.2S, v3.2S
-        st1             {v4.2S}, [x0], #8
+1:      ld1             {v2.4s}, [x1], #16
+        ld1             {v3.2s}, [x1]
+        fmul            v4.4s, v2.4s, v1.4s
+        fmul            v5.4s, v2.4s, v0.4s
+        faddp           v4.4s, v4.4s, v4.4s
+        faddp           v5.4s, v5.4s, v5.4s
+        faddp           v4.4s, v4.4s, v4.4s
+        faddp           v5.4s, v5.4s, v5.4s
+        mov             v4.s[1], v5.s[0]
+        fadd            v4.2s, v4.2s, v3.2s
+        st1             {v4.2s}, [x0], #8
         sub             x1, x1, #8
         subs            x7, x7, #1
         b.gt            1b
@@ -215,10 +215,10 @@ function ff_sbr_hf_g_filt_neon, export=1
         sxtw            x4, w4
         mov             x5, #40*2*4
         add             x1, x1, x4, lsl #3
-1:      ld1             {v0.2S}, [x1], x5
-        ld1             {v1.S}[0], [x2], #4
-        fmul            v2.4S, v0.4S, v1.S[0]
-        st1             {v2.2S}, [x0], #8
+1:      ld1             {v0.2s}, [x1], x5
+        ld1             {v1.s}[0], [x2], #4
+        fmul            v2.4s, v0.4s, v1.s[0]
+        st1             {v2.2s}, [x0], #8
         subs            x3, x3, #1
         b.gt            1b
         ret
@@ -227,46 +227,46 @@ endfunc
 function ff_sbr_autocorrelate_neon, export=1
         mov             x2, #38
         movrel          x3, factors
-        ld1             {v0.4S}, [x3]
-        movi            v1.4S, #0
-        movi            v2.4S, #0
-        movi            v3.4S, #0
-        ld1             {v4.2S}, [x0], #8
-        ld1             {v5.2S}, [x0], #8
-        fmul            v16.2S, v4.2S, v4.2S
-        fmul            v17.2S, v5.2S, v4.S[0]
-        fmul            v18.2S, v5.2S, v4.S[1]
-1:      ld1             {v5.D}[1], [x0], #8
-        fmla            v1.2S, v4.2S, v4.2S
-        fmla            v2.4S, v5.4S, v4.S[0]
-        fmla            v3.4S, v5.4S, v4.S[1]
-        mov             v4.D[0], v5.D[0]
-        mov             v5.D[0], v5.D[1]
+        ld1             {v0.4s}, [x3]
+        movi            v1.4s, #0
+        movi            v2.4s, #0
+        movi            v3.4s, #0
+        ld1             {v4.2s}, [x0], #8
+        ld1             {v5.2s}, [x0], #8
+        fmul            v16.2s, v4.2s, v4.2s
+        fmul            v17.2s, v5.2s, v4.s[0]
+        fmul            v18.2s, v5.2s, v4.s[1]
+1:      ld1             {v5.d}[1], [x0], #8
+        fmla            v1.2s, v4.2s, v4.2s
+        fmla            v2.4s, v5.4s, v4.s[0]
+        fmla            v3.4s, v5.4s, v4.s[1]
+        mov             v4.d[0], v5.d[0]
+        mov             v5.d[0], v5.d[1]
         subs            x2, x2, #1
         b.gt            1b
-        fmul            v19.2S, v4.2S, v4.2S
-        fmul            v20.2S, v5.2S, v4.S[0]
-        fmul            v21.2S, v5.2S, v4.S[1]
-        fadd            v22.4S, v2.4S, v20.4S
-        fsub            v22.4S, v22.4S, v17.4S
-        fadd            v23.4S, v3.4S, v21.4S
-        fsub            v23.4S, v23.4S, v18.4S
-        rev64           v23.4S, v23.4S
-        fmul            v23.4S, v23.4S, v0.4S
-        fadd            v22.4S, v22.4S, v23.4S
-        st1             {v22.4S}, [x1], #16
-        fadd            v23.2S, v1.2S, v19.2S
-        fsub            v23.2S, v23.2S, v16.2S
-        faddp           v23.2S, v23.2S, v23.2S
-        st1             {v23.S}[0], [x1]
+        fmul            v19.2s, v4.2s, v4.2s
+        fmul            v20.2s, v5.2s, v4.s[0]
+        fmul            v21.2s, v5.2s, v4.s[1]
+        fadd            v22.4s, v2.4s, v20.4s
+        fsub            v22.4s, v22.4s, v17.4s
+        fadd            v23.4s, v3.4s, v21.4s
+        fsub            v23.4s, v23.4s, v18.4s
+        rev64           v23.4s, v23.4s
+        fmul            v23.4s, v23.4s, v0.4s
+        fadd            v22.4s, v22.4s, v23.4s
+        st1             {v22.4s}, [x1], #16
+        fadd            v23.2s, v1.2s, v19.2s
+        fsub            v23.2s, v23.2s, v16.2s
+        faddp           v23.2s, v23.2s, v23.2s
+        st1             {v23.s}[0], [x1]
         add             x1, x1, #8
-        rev64           v3.2S, v3.2S
-        fmul            v3.2S, v3.2S, v0.2S
-        fadd            v2.2S, v2.2S, v3.2S
-        st1             {v2.2S}, [x1]
+        rev64           v3.2s, v3.2s
+        fmul            v3.2s, v3.2s, v0.2s
+        fadd            v2.2s, v2.2s, v3.2s
+        st1             {v2.2s}, [x1]
         add             x1, x1, #16
-        faddp           v1.2S, v1.2S, v1.2S
-        st1             {v1.S}[0], [x1]
+        faddp           v1.2s, v1.2s, v1.2s
+        st1             {v1.s}[0], [x1]
         ret
 endfunc
 
@@ -278,25 +278,25 @@ endfunc
 1:      and             x3, x3, #0x1ff
         add             x8, x7, x3, lsl #3
         add             x3, x3, #2
-        ld1             {v2.4S}, [x0]
-        ld1             {v3.2S}, [x1], #8
-        ld1             {v4.2S}, [x2], #8
-        ld1             {v5.4S}, [x8]
-        mov             v6.16B, v2.16B
-        zip1            v3.4S, v3.4S, v3.4S
-        zip1            v4.4S, v4.4S, v4.4S
-        fmla            v6.4S, v1.4S, v3.4S
-        fmla            v2.4S, v5.4S, v4.4S
-        fcmeq           v7.4S, v3.4S, #0
-        bif             v2.16B, v6.16B, v7.16B
-        st1             {v2.4S}, [x0], #16
+        ld1             {v2.4s}, [x0]
+        ld1             {v3.2s}, [x1], #8
+        ld1             {v4.2s}, [x2], #8
+        ld1             {v5.4s}, [x8]
+        mov             v6.16b, v2.16b
+        zip1            v3.4s, v3.4s, v3.4s
+        zip1            v4.4s, v4.4s, v4.4s
+        fmla            v6.4s, v1.4s, v3.4s
+        fmla            v2.4s, v5.4s, v4.4s
+        fcmeq           v7.4s, v3.4s, #0
+        bif             v2.16b, v6.16b, v7.16b
+        st1             {v2.4s}, [x0], #16
         subs            x5, x5, #2
         b.gt            1b
 .endm
 
 function ff_sbr_hf_apply_noise_0_neon, export=1
         movrel          x9, phi_noise_0
-        ld1             {v1.4S}, [x9]
+        ld1             {v1.4s}, [x9]
         apply_noise_common
         ret
 endfunc
@@ -305,14 +305,14 @@ function ff_sbr_hf_apply_noise_1_neon, export=1
         movrel          x9, phi_noise_1
         and             x4, x4, #1
         add             x9, x9, x4, lsl #4
-        ld1             {v1.4S}, [x9]
+        ld1             {v1.4s}, [x9]
         apply_noise_common
         ret
 endfunc
 
 function ff_sbr_hf_apply_noise_2_neon, export=1
         movrel          x9, phi_noise_2
-        ld1             {v1.4S}, [x9]
+        ld1             {v1.4s}, [x9]
         apply_noise_common
         ret
 endfunc
@@ -321,7 +321,7 @@ function ff_sbr_hf_apply_noise_3_neon, export=1
         movrel          x9, phi_noise_3
         and             x4, x4, #1
         add             x9, x9, x4, lsl #4
-        ld1             {v1.4S}, [x9]
+        ld1             {v1.4s}, [x9]
         apply_noise_common
         ret
 endfunc
diff --git a/libavcodec/aarch64/simple_idct_neon.S b/libavcodec/aarch64/simple_idct_neon.S
index 210182ff21..a4438e9922 100644
--- a/libavcodec/aarch64/simple_idct_neon.S
+++ b/libavcodec/aarch64/simple_idct_neon.S
@@ -54,7 +54,7 @@ endconst
         prfm            pldl1keep, [\data]
         mov             x10, x30
         movrel          x3, idct_coeff_neon
-        ld1             {v0.2D}, [x3]
+        ld1             {v0.2d}, [x3]
 .endm
 
 .macro idct_end
@@ -74,146 +74,146 @@ endconst
 .endm
 
 .macro idct_col4_top y1, y2, y3, y4, i, l
-        smull\i         v7.4S,  \y3\l, z2
-        smull\i         v16.4S, \y3\l, z6
-        smull\i         v17.4S, \y2\l, z1
-        add             v19.4S, v23.4S, v7.4S
-        smull\i         v18.4S, \y2\l, z3
-        add             v20.4S, v23.4S, v16.4S
-        smull\i         v5.4S,  \y2\l, z5
-        sub             v21.4S, v23.4S, v16.4S
-        smull\i         v6.4S,  \y2\l, z7
-        sub             v22.4S, v23.4S, v7.4S
-
-        smlal\i         v17.4S, \y4\l, z3
-        smlsl\i         v18.4S, \y4\l, z7
-        smlsl\i         v5.4S,  \y4\l, z1
-        smlsl\i         v6.4S,  \y4\l, z5
+        smull\i         v7.4s,  \y3\l, z2
+        smull\i         v16.4s, \y3\l, z6
+        smull\i         v17.4s, \y2\l, z1
+        add             v19.4s, v23.4s, v7.4s
+        smull\i         v18.4s, \y2\l, z3
+        add             v20.4s, v23.4s, v16.4s
+        smull\i         v5.4s,  \y2\l, z5
+        sub             v21.4s, v23.4s, v16.4s
+        smull\i         v6.4s,  \y2\l, z7
+        sub             v22.4s, v23.4s, v7.4s
+
+        smlal\i         v17.4s, \y4\l, z3
+        smlsl\i         v18.4s, \y4\l, z7
+        smlsl\i         v5.4s,  \y4\l, z1
+        smlsl\i         v6.4s,  \y4\l, z5
 .endm
 
 .macro idct_row4_neon y1, y2, y3, y4, pass
-        ld1             {\y1\().2D,\y2\().2D}, [x2], #32
-        movi            v23.4S, #1<<2, lsl #8
-        orr             v5.16B, \y1\().16B, \y2\().16B
-        ld1             {\y3\().2D,\y4\().2D}, [x2], #32
-        orr             v6.16B, \y3\().16B, \y4\().16B
-        orr             v5.16B, v5.16B, v6.16B
-        mov             x3, v5.D[1]
-        smlal           v23.4S, \y1\().4H, z4
+        ld1             {\y1\().2d,\y2\().2d}, [x2], #32
+        movi            v23.4s, #1<<2, lsl #8
+        orr             v5.16b, \y1\().16b, \y2\().16b
+        ld1             {\y3\().2d,\y4\().2d}, [x2], #32
+        orr             v6.16b, \y3\().16b, \y4\().16b
+        orr             v5.16b, v5.16b, v6.16b
+        mov             x3, v5.d[1]
+        smlal           v23.4s, \y1\().4h, z4
 
-        idct_col4_top   \y1, \y2, \y3, \y4, 1, .4H
+        idct_col4_top   \y1, \y2, \y3, \y4, 1, .4h
 
         cmp             x3, #0
         b.eq            \pass\()f
 
-        smull2          v7.4S, \y1\().8H, z4
-        smlal2          v17.4S, \y2\().8H, z5
-        smlsl2          v18.4S, \y2\().8H, z1
-        smull2          v16.4S, \y3\().8H, z2
-        smlal2          v5.4S, \y2\().8H, z7
-        add             v19.4S, v19.4S, v7.4S
-        sub             v20.4S, v20.4S, v7.4S
-        sub             v21.4S, v21.4S, v7.4S
-        add             v22.4S, v22.4S, v7.4S
-        smlal2          v6.4S, \y2\().8H, z3
-        smull2          v7.4S, \y3\().8H, z6
-        smlal2          v17.4S, \y4\().8H, z7
-        smlsl2          v18.4S, \y4\().8H, z5
-        smlal2          v5.4S, \y4\().8H, z3
-        smlsl2          v6.4S, \y4\().8H, z1
-        add             v19.4S, v19.4S, v7.4S
-        sub             v20.4S, v20.4S, v16.4S
-        add             v21.4S, v21.4S, v16.4S
-        sub             v22.4S, v22.4S, v7.4S
+        smull2          v7.4s, \y1\().8h, z4
+        smlal2          v17.4s, \y2\().8h, z5
+        smlsl2          v18.4s, \y2\().8h, z1
+        smull2          v16.4s, \y3\().8h, z2
+        smlal2          v5.4s, \y2\().8h, z7
+        add             v19.4s, v19.4s, v7.4s
+        sub             v20.4s, v20.4s, v7.4s
+        sub             v21.4s, v21.4s, v7.4s
+        add             v22.4s, v22.4s, v7.4s
+        smlal2          v6.4s, \y2\().8h, z3
+        smull2          v7.4s, \y3\().8h, z6
+        smlal2          v17.4s, \y4\().8h, z7
+        smlsl2          v18.4s, \y4\().8h, z5
+        smlal2          v5.4s, \y4\().8h, z3
+        smlsl2          v6.4s, \y4\().8h, z1
+        add             v19.4s, v19.4s, v7.4s
+        sub             v20.4s, v20.4s, v16.4s
+        add             v21.4s, v21.4s, v16.4s
+        sub             v22.4s, v22.4s, v7.4s
 
 \pass:  add             \y3\().4S, v19.4S, v17.4S
-        add             \y4\().4S, v20.4S, v18.4S
-        shrn            \y1\().4H, \y3\().4S, #ROW_SHIFT
-        shrn            \y2\().4H, \y4\().4S, #ROW_SHIFT
-        add             v7.4S, v21.4S, v5.4S
-        add             v16.4S, v22.4S, v6.4S
-        shrn            \y3\().4H, v7.4S, #ROW_SHIFT
-        shrn            \y4\().4H, v16.4S, #ROW_SHIFT
-        sub             v22.4S, v22.4S, v6.4S
-        sub             v19.4S, v19.4S, v17.4S
-        sub             v21.4S, v21.4S, v5.4S
-        shrn2           \y1\().8H, v22.4S, #ROW_SHIFT
-        sub             v20.4S, v20.4S, v18.4S
-        shrn2           \y2\().8H, v21.4S, #ROW_SHIFT
-        shrn2           \y3\().8H, v20.4S, #ROW_SHIFT
-        shrn2           \y4\().8H, v19.4S, #ROW_SHIFT
-
-        trn1            v16.8H, \y1\().8H, \y2\().8H
-        trn2            v17.8H, \y1\().8H, \y2\().8H
-        trn1            v18.8H, \y3\().8H, \y4\().8H
-        trn2            v19.8H, \y3\().8H, \y4\().8H
-        trn1            \y1\().4S, v16.4S, v18.4S
-        trn1            \y2\().4S, v17.4S, v19.4S
-        trn2            \y3\().4S, v16.4S, v18.4S
-        trn2            \y4\().4S, v17.4S, v19.4S
+        add             \y4\().4s, v20.4s, v18.4s
+        shrn            \y1\().4h, \y3\().4s, #ROW_SHIFT
+        shrn            \y2\().4h, \y4\().4s, #ROW_SHIFT
+        add             v7.4s, v21.4s, v5.4s
+        add             v16.4s, v22.4s, v6.4s
+        shrn            \y3\().4h, v7.4s, #ROW_SHIFT
+        shrn            \y4\().4h, v16.4s, #ROW_SHIFT
+        sub             v22.4s, v22.4s, v6.4s
+        sub             v19.4s, v19.4s, v17.4s
+        sub             v21.4s, v21.4s, v5.4s
+        shrn2           \y1\().8h, v22.4s, #ROW_SHIFT
+        sub             v20.4s, v20.4s, v18.4s
+        shrn2           \y2\().8h, v21.4s, #ROW_SHIFT
+        shrn2           \y3\().8h, v20.4s, #ROW_SHIFT
+        shrn2           \y4\().8h, v19.4s, #ROW_SHIFT
+
+        trn1            v16.8h, \y1\().8h, \y2\().8h
+        trn2            v17.8h, \y1\().8h, \y2\().8h
+        trn1            v18.8h, \y3\().8h, \y4\().8h
+        trn2            v19.8h, \y3\().8h, \y4\().8h
+        trn1            \y1\().4s, v16.4s, v18.4s
+        trn1            \y2\().4s, v17.4s, v19.4s
+        trn2            \y3\().4s, v16.4s, v18.4s
+        trn2            \y4\().4s, v17.4s, v19.4s
 .endm
 
 .macro declare_idct_col4_neon i, l
 function idct_col4_neon\i
-        dup             v23.4H, z4c
+        dup             v23.4h, z4c
 .if \i == 1
-        add             v23.4H, v23.4H, v24.4H
+        add             v23.4h, v23.4h, v24.4h
 .else
-        mov             v5.D[0], v24.D[1]
-        add             v23.4H, v23.4H, v5.4H
+        mov             v5.d[0], v24.d[1]
+        add             v23.4h, v23.4h, v5.4h
 .endif
-        smull           v23.4S, v23.4H, z4
+        smull           v23.4s, v23.4h, z4
 
         idct_col4_top   v24, v25, v26, v27, \i, \l
 
-        mov             x4, v28.D[\i - 1]
-        mov             x5, v29.D[\i - 1]
+        mov             x4, v28.d[\i - 1]
+        mov             x5, v29.d[\i - 1]
         cmp             x4, #0
         b.eq            1f
 
-        smull\i         v7.4S,  v28\l,  z4
-        add             v19.4S, v19.4S, v7.4S
-        sub             v20.4S, v20.4S, v7.4S
-        sub             v21.4S, v21.4S, v7.4S
-        add             v22.4S, v22.4S, v7.4S
+        smull\i         v7.4s,  v28\l,  z4
+        add             v19.4s, v19.4s, v7.4s
+        sub             v20.4s, v20.4s, v7.4s
+        sub             v21.4s, v21.4s, v7.4s
+        add             v22.4s, v22.4s, v7.4s
 
-1:      mov             x4, v30.D[\i - 1]
+1:      mov             x4, v30.d[\i - 1]
         cmp             x5, #0
         b.eq            2f
 
-        smlal\i         v17.4S, v29\l, z5
-        smlsl\i         v18.4S, v29\l, z1
-        smlal\i         v5.4S,  v29\l, z7
-        smlal\i         v6.4S,  v29\l, z3
+        smlal\i         v17.4s, v29\l, z5
+        smlsl\i         v18.4s, v29\l, z1
+        smlal\i         v5.4s,  v29\l, z7
+        smlal\i         v6.4s,  v29\l, z3
 
-2:      mov             x5, v31.D[\i - 1]
+2:      mov             x5, v31.d[\i - 1]
         cmp             x4, #0
         b.eq            3f
 
-        smull\i         v7.4S,  v30\l, z6
-        smull\i         v16.4S, v30\l, z2
-        add             v19.4S, v19.4S, v7.4S
-        sub             v22.4S, v22.4S, v7.4S
-        sub             v20.4S, v20.4S, v16.4S
-        add             v21.4S, v21.4S, v16.4S
+        smull\i         v7.4s,  v30\l, z6
+        smull\i         v16.4s, v30\l, z2
+        add             v19.4s, v19.4s, v7.4s
+        sub             v22.4s, v22.4s, v7.4s
+        sub             v20.4s, v20.4s, v16.4s
+        add             v21.4s, v21.4s, v16.4s
 
 3:      cmp             x5, #0
         b.eq            4f
 
-        smlal\i         v17.4S, v31\l, z7
-        smlsl\i         v18.4S, v31\l, z5
-        smlal\i         v5.4S,  v31\l, z3
-        smlsl\i         v6.4S,  v31\l, z1
+        smlal\i         v17.4s, v31\l, z7
+        smlsl\i         v18.4s, v31\l, z5
+        smlal\i         v5.4s,  v31\l, z3
+        smlsl\i         v6.4s,  v31\l, z1
 
-4:      addhn           v7.4H, v19.4S, v17.4S
-        addhn2          v7.8H, v20.4S, v18.4S
-        subhn           v18.4H, v20.4S, v18.4S
-        subhn2          v18.8H, v19.4S, v17.4S
+4:      addhn           v7.4h, v19.4s, v17.4s
+        addhn2          v7.8h, v20.4s, v18.4s
+        subhn           v18.4h, v20.4s, v18.4s
+        subhn2          v18.8h, v19.4s, v17.4s
 
-        addhn           v16.4H, v21.4S, v5.4S
-        addhn2          v16.8H, v22.4S, v6.4S
-        subhn           v17.4H, v22.4S, v6.4S
-        subhn2          v17.8H, v21.4S, v5.4S
+        addhn           v16.4h, v21.4s, v5.4s
+        addhn2          v16.8h, v22.4s, v6.4s
+        subhn           v17.4h, v22.4s, v6.4s
+        subhn2          v17.8h, v21.4s, v5.4s
 
         ret
 endfunc
@@ -229,33 +229,33 @@ function ff_simple_idct_put_neon, export=1
         idct_row4_neon  v28, v29, v30, v31, 2
         bl              idct_col4_neon1
 
-        sqshrun         v1.8B,  v7.8H, #COL_SHIFT-16
-        sqshrun2        v1.16B, v16.8H, #COL_SHIFT-16
-        sqshrun         v3.8B,  v17.8H, #COL_SHIFT-16
-        sqshrun2        v3.16B, v18.8H, #COL_SHIFT-16
+        sqshrun         v1.8b,  v7.8h, #COL_SHIFT-16
+        sqshrun2        v1.16b, v16.8h, #COL_SHIFT-16
+        sqshrun         v3.8b,  v17.8h, #COL_SHIFT-16
+        sqshrun2        v3.16b, v18.8h, #COL_SHIFT-16
 
         bl              idct_col4_neon2
 
-        sqshrun         v2.8B,  v7.8H, #COL_SHIFT-16
-        sqshrun2        v2.16B, v16.8H, #COL_SHIFT-16
-        sqshrun         v4.8B,  v17.8H, #COL_SHIFT-16
-        sqshrun2        v4.16B, v18.8H, #COL_SHIFT-16
+        sqshrun         v2.8b,  v7.8h, #COL_SHIFT-16
+        sqshrun2        v2.16b, v16.8h, #COL_SHIFT-16
+        sqshrun         v4.8b,  v17.8h, #COL_SHIFT-16
+        sqshrun2        v4.16b, v18.8h, #COL_SHIFT-16
 
-        zip1            v16.4S, v1.4S, v2.4S
-        zip2            v17.4S, v1.4S, v2.4S
+        zip1            v16.4s, v1.4s, v2.4s
+        zip2            v17.4s, v1.4s, v2.4s
 
-        st1             {v16.D}[0], [x0], x1
-        st1             {v16.D}[1], [x0], x1
+        st1             {v16.d}[0], [x0], x1
+        st1             {v16.d}[1], [x0], x1
 
-        zip1            v18.4S, v3.4S, v4.4S
-        zip2            v19.4S, v3.4S, v4.4S
+        zip1            v18.4s, v3.4s, v4.4s
+        zip2            v19.4s, v3.4s, v4.4s
 
-        st1             {v17.D}[0], [x0], x1
-        st1             {v17.D}[1], [x0], x1
-        st1             {v18.D}[0], [x0], x1
-        st1             {v18.D}[1], [x0], x1
-        st1             {v19.D}[0], [x0], x1
-        st1             {v19.D}[1], [x0], x1
+        st1             {v17.d}[0], [x0], x1
+        st1             {v17.d}[1], [x0], x1
+        st1             {v18.d}[0], [x0], x1
+        st1             {v18.d}[1], [x0], x1
+        st1             {v19.d}[0], [x0], x1
+        st1             {v19.d}[1], [x0], x1
 
         idct_end
 endfunc
@@ -267,59 +267,59 @@ function ff_simple_idct_add_neon, export=1
         idct_row4_neon  v28, v29, v30, v31, 2
         bl              idct_col4_neon1
 
-        sshr            v1.8H, v7.8H, #COL_SHIFT-16
-        sshr            v2.8H, v16.8H, #COL_SHIFT-16
-        sshr            v3.8H, v17.8H, #COL_SHIFT-16
-        sshr            v4.8H, v18.8H, #COL_SHIFT-16
+        sshr            v1.8h, v7.8h, #COL_SHIFT-16
+        sshr            v2.8h, v16.8h, #COL_SHIFT-16
+        sshr            v3.8h, v17.8h, #COL_SHIFT-16
+        sshr            v4.8h, v18.8h, #COL_SHIFT-16
 
         bl              idct_col4_neon2
 
-        sshr            v7.8H, v7.8H, #COL_SHIFT-16
-        sshr            v16.8H, v16.8H, #COL_SHIFT-16
-        sshr            v17.8H, v17.8H, #COL_SHIFT-16
-        sshr            v18.8H, v18.8H, #COL_SHIFT-16
+        sshr            v7.8h, v7.8h, #COL_SHIFT-16
+        sshr            v16.8h, v16.8h, #COL_SHIFT-16
+        sshr            v17.8h, v17.8h, #COL_SHIFT-16
+        sshr            v18.8h, v18.8h, #COL_SHIFT-16
 
         mov             x9,  x0
-        ld1             {v19.D}[0], [x0], x1
-        zip1            v23.2D, v1.2D, v7.2D
-        zip2            v24.2D, v1.2D, v7.2D
-        ld1             {v19.D}[1], [x0], x1
-        zip1            v25.2D, v2.2D, v16.2D
-        zip2            v26.2D, v2.2D, v16.2D
-        ld1             {v20.D}[0], [x0], x1
-        zip1            v27.2D, v3.2D, v17.2D
-        zip2            v28.2D, v3.2D, v17.2D
-        ld1             {v20.D}[1], [x0], x1
-        zip1            v29.2D, v4.2D, v18.2D
-        zip2            v30.2D, v4.2D, v18.2D
-        ld1             {v21.D}[0], [x0], x1
-        uaddw           v23.8H, v23.8H, v19.8B
-        uaddw2          v24.8H, v24.8H, v19.16B
-        ld1             {v21.D}[1], [x0], x1
-        sqxtun          v23.8B, v23.8H
-        sqxtun2         v23.16B, v24.8H
-        ld1             {v22.D}[0], [x0], x1
-        uaddw           v24.8H, v25.8H, v20.8B
-        uaddw2          v25.8H, v26.8H, v20.16B
-        ld1             {v22.D}[1], [x0], x1
-        sqxtun          v24.8B, v24.8H
-        sqxtun2         v24.16B, v25.8H
-        st1             {v23.D}[0], [x9], x1
-        uaddw           v25.8H, v27.8H, v21.8B
-        uaddw2          v26.8H, v28.8H, v21.16B
-        st1             {v23.D}[1], [x9], x1
-        sqxtun          v25.8B, v25.8H
-        sqxtun2         v25.16B, v26.8H
-        st1             {v24.D}[0], [x9], x1
-        uaddw           v26.8H, v29.8H, v22.8B
-        uaddw2          v27.8H, v30.8H, v22.16B
-        st1             {v24.D}[1], [x9], x1
-        sqxtun          v26.8B, v26.8H
-        sqxtun2         v26.16B, v27.8H
-        st1             {v25.D}[0], [x9], x1
-        st1             {v25.D}[1], [x9], x1
-        st1             {v26.D}[0], [x9], x1
-        st1             {v26.D}[1], [x9], x1
+        ld1             {v19.d}[0], [x0], x1
+        zip1            v23.2d, v1.2d, v7.2d
+        zip2            v24.2d, v1.2d, v7.2d
+        ld1             {v19.d}[1], [x0], x1
+        zip1            v25.2d, v2.2d, v16.2d
+        zip2            v26.2d, v2.2d, v16.2d
+        ld1             {v20.d}[0], [x0], x1
+        zip1            v27.2d, v3.2d, v17.2d
+        zip2            v28.2d, v3.2d, v17.2d
+        ld1             {v20.d}[1], [x0], x1
+        zip1            v29.2d, v4.2d, v18.2d
+        zip2            v30.2d, v4.2d, v18.2d
+        ld1             {v21.d}[0], [x0], x1
+        uaddw           v23.8h, v23.8h, v19.8b
+        uaddw2          v24.8h, v24.8h, v19.16b
+        ld1             {v21.d}[1], [x0], x1
+        sqxtun          v23.8b, v23.8h
+        sqxtun2         v23.16b, v24.8h
+        ld1             {v22.d}[0], [x0], x1
+        uaddw           v24.8h, v25.8h, v20.8b
+        uaddw2          v25.8h, v26.8h, v20.16b
+        ld1             {v22.d}[1], [x0], x1
+        sqxtun          v24.8b, v24.8h
+        sqxtun2         v24.16b, v25.8h
+        st1             {v23.d}[0], [x9], x1
+        uaddw           v25.8h, v27.8h, v21.8b
+        uaddw2          v26.8h, v28.8h, v21.16b
+        st1             {v23.d}[1], [x9], x1
+        sqxtun          v25.8b, v25.8h
+        sqxtun2         v25.16b, v26.8h
+        st1             {v24.d}[0], [x9], x1
+        uaddw           v26.8h, v29.8h, v22.8b
+        uaddw2          v27.8h, v30.8h, v22.16b
+        st1             {v24.d}[1], [x9], x1
+        sqxtun          v26.8b, v26.8h
+        sqxtun2         v26.16b, v27.8h
+        st1             {v25.d}[0], [x9], x1
+        st1             {v25.d}[1], [x9], x1
+        st1             {v26.d}[0], [x9], x1
+        st1             {v26.d}[1], [x9], x1
 
         idct_end
 endfunc
@@ -333,30 +333,30 @@ function ff_simple_idct_neon, export=1
         sub             x2, x2, #128
         bl              idct_col4_neon1
 
-        sshr            v1.8H, v7.8H, #COL_SHIFT-16
-        sshr            v2.8H, v16.8H, #COL_SHIFT-16
-        sshr            v3.8H, v17.8H, #COL_SHIFT-16
-        sshr            v4.8H, v18.8H, #COL_SHIFT-16
+        sshr            v1.8h, v7.8h, #COL_SHIFT-16
+        sshr            v2.8h, v16.8h, #COL_SHIFT-16
+        sshr            v3.8h, v17.8h, #COL_SHIFT-16
+        sshr            v4.8h, v18.8h, #COL_SHIFT-16
 
         bl              idct_col4_neon2
 
-        sshr            v7.8H, v7.8H, #COL_SHIFT-16
-        sshr            v16.8H, v16.8H, #COL_SHIFT-16
-        sshr            v17.8H, v17.8H, #COL_SHIFT-16
-        sshr            v18.8H, v18.8H, #COL_SHIFT-16
-
-        zip1            v23.2D, v1.2D, v7.2D
-        zip2            v24.2D, v1.2D, v7.2D
-        st1             {v23.2D,v24.2D}, [x2], #32
-        zip1            v25.2D, v2.2D, v16.2D
-        zip2            v26.2D, v2.2D, v16.2D
-        st1             {v25.2D,v26.2D}, [x2], #32
-        zip1            v27.2D, v3.2D, v17.2D
-        zip2            v28.2D, v3.2D, v17.2D
-        st1             {v27.2D,v28.2D}, [x2], #32
-        zip1            v29.2D, v4.2D, v18.2D
-        zip2            v30.2D, v4.2D, v18.2D
-        st1             {v29.2D,v30.2D}, [x2], #32
+        sshr            v7.8h, v7.8h, #COL_SHIFT-16
+        sshr            v16.8h, v16.8h, #COL_SHIFT-16
+        sshr            v17.8h, v17.8h, #COL_SHIFT-16
+        sshr            v18.8h, v18.8h, #COL_SHIFT-16
+
+        zip1            v23.2d, v1.2d, v7.2d
+        zip2            v24.2d, v1.2d, v7.2d
+        st1             {v23.2d,v24.2d}, [x2], #32
+        zip1            v25.2d, v2.2d, v16.2d
+        zip2            v26.2d, v2.2d, v16.2d
+        st1             {v25.2d,v26.2d}, [x2], #32
+        zip1            v27.2d, v3.2d, v17.2d
+        zip2            v28.2d, v3.2d, v17.2d
+        st1             {v27.2d,v28.2d}, [x2], #32
+        zip1            v29.2d, v4.2d, v18.2d
+        zip2            v30.2d, v4.2d, v18.2d
+        st1             {v29.2d,v30.2d}, [x2], #32
 
         idct_end
 endfunc
diff --git a/libavfilter/aarch64/vf_nlmeans_neon.S b/libavfilter/aarch64/vf_nlmeans_neon.S
index e69b0dd923..26d6958b82 100644
--- a/libavfilter/aarch64/vf_nlmeans_neon.S
+++ b/libavfilter/aarch64/vf_nlmeans_neon.S
@@ -22,19 +22,19 @@
 
 // acc_sum_store(ABCD) = {X+A, X+A+B, X+A+B+C, X+A+B+C+D}
 .macro acc_sum_store x, xb
-        dup             v24.4S, v24.S[3]                                // ...X -> XXXX
-        ext             v25.16B, v26.16B, \xb, #12                      // ext(0000,ABCD,12)=0ABC
-        add             v24.4S, v24.4S, \x                              // XXXX+ABCD={X+A,X+B,X+C,X+D}
-        add             v24.4S, v24.4S, v25.4S                          // {X+A,X+B+A,X+C+B,X+D+C}       (+0ABC)
-        ext             v25.16B, v26.16B, v25.16B, #12                  // ext(0000,0ABC,12)=00AB
-        add             v24.4S, v24.4S, v25.4S                          // {X+A,X+B+A,X+C+B+A,X+D+C+B}   (+00AB)
-        ext             v25.16B, v26.16B, v25.16B, #12                  // ext(0000,00AB,12)=000A
-        add             v24.4S, v24.4S, v25.4S                          // {X+A,X+B+A,X+C+B+A,X+D+C+B+A} (+000A)
-        st1             {v24.4S}, [x0], #16                             // write 4x32-bit final values
+        dup             v24.4s, v24.s[3]                                // ...X -> XXXX
+        ext             v25.16b, v26.16b, \xb, #12                      // ext(0000,ABCD,12)=0ABC
+        add             v24.4s, v24.4s, \x                              // XXXX+ABCD={X+A,X+B,X+C,X+D}
+        add             v24.4s, v24.4s, v25.4s                          // {X+A,X+B+A,X+C+B,X+D+C}       (+0ABC)
+        ext             v25.16b, v26.16b, v25.16b, #12                  // ext(0000,0ABC,12)=00AB
+        add             v24.4s, v24.4s, v25.4s                          // {X+A,X+B+A,X+C+B+A,X+D+C+B}   (+00AB)
+        ext             v25.16b, v26.16b, v25.16b, #12                  // ext(0000,00AB,12)=000A
+        add             v24.4s, v24.4s, v25.4s                          // {X+A,X+B+A,X+C+B+A,X+D+C+B+A} (+000A)
+        st1             {v24.4s}, [x0], #16                             // write 4x32-bit final values
 .endm
 
 function ff_compute_safe_ssd_integral_image_neon, export=1
-        movi            v26.4S, #0                                      // used as zero for the "rotations" in acc_sum_store
+        movi            v26.4s, #0                                      // used as zero for the "rotations" in acc_sum_store
         sub             x3, x3, w6, UXTW                                // s1 padding (s1_linesize - w)
         sub             x5, x5, w6, UXTW                                // s2 padding (s2_linesize - w)
         sub             x9, x0, w1, UXTW #2                             // dst_top
@@ -43,31 +43,31 @@ function ff_compute_safe_ssd_integral_image_neon, export=1
 1:      mov             w10, w6                                         // width copy for each line
         sub             x0, x0, #16                                     // beginning of the dst line minus 4 sums
         sub             x8, x9, #4                                      // dst_top-1
-        ld1             {v24.4S}, [x0], #16                             // load ...X (contextual last sums)
-2:      ld1             {v0.16B}, [x2], #16                             // s1[x + 0..15]
-        ld1             {v1.16B}, [x4], #16                             // s2[x + 0..15]
-        ld1             {v16.4S,v17.4S}, [x8], #32                      // dst_top[x + 0..7 - 1]
-        usubl           v2.8H, v0.8B,  v1.8B                            // d[x + 0..7]  = s1[x + 0..7]  - s2[x + 0..7]
-        usubl2          v3.8H, v0.16B, v1.16B                           // d[x + 8..15] = s1[x + 8..15] - s2[x + 8..15]
-        ld1             {v18.4S,v19.4S}, [x8], #32                      // dst_top[x + 8..15 - 1]
-        smull           v4.4S, v2.4H, v2.4H                             // d[x + 0..3]^2
-        smull2          v5.4S, v2.8H, v2.8H                             // d[x + 4..7]^2
-        ld1             {v20.4S,v21.4S}, [x9], #32                      // dst_top[x + 0..7]
-        smull           v6.4S, v3.4H, v3.4H                             // d[x + 8..11]^2
-        smull2          v7.4S, v3.8H, v3.8H                             // d[x + 12..15]^2
-        ld1             {v22.4S,v23.4S}, [x9], #32                      // dst_top[x + 8..15]
-        sub             v0.4S, v20.4S, v16.4S                           // dst_top[x + 0..3] - dst_top[x + 0..3 - 1]
-        sub             v1.4S, v21.4S, v17.4S                           // dst_top[x + 4..7] - dst_top[x + 4..7 - 1]
-        add             v0.4S, v0.4S, v4.4S                             // + d[x + 0..3]^2
-        add             v1.4S, v1.4S, v5.4S                             // + d[x + 4..7]^2
-        sub             v2.4S, v22.4S, v18.4S                           // dst_top[x +  8..11] - dst_top[x +  8..11 - 1]
-        sub             v3.4S, v23.4S, v19.4S                           // dst_top[x + 12..15] - dst_top[x + 12..15 - 1]
-        add             v2.4S, v2.4S, v6.4S                             // + d[x +  8..11]^2
-        add             v3.4S, v3.4S, v7.4S                             // + d[x + 12..15]^2
-        acc_sum_store   v0.4S, v0.16B                                   // accumulate and store dst[ 0..3]
-        acc_sum_store   v1.4S, v1.16B                                   // accumulate and store dst[ 4..7]
-        acc_sum_store   v2.4S, v2.16B                                   // accumulate and store dst[ 8..11]
-        acc_sum_store   v3.4S, v3.16B                                   // accumulate and store dst[12..15]
+        ld1             {v24.4s}, [x0], #16                             // load ...X (contextual last sums)
+2:      ld1             {v0.16b}, [x2], #16                             // s1[x + 0..15]
+        ld1             {v1.16b}, [x4], #16                             // s2[x + 0..15]
+        ld1             {v16.4s,v17.4s}, [x8], #32                      // dst_top[x + 0..7 - 1]
+        usubl           v2.8h, v0.8b,  v1.8b                            // d[x + 0..7]  = s1[x + 0..7]  - s2[x + 0..7]
+        usubl2          v3.8h, v0.16b, v1.16b                           // d[x + 8..15] = s1[x + 8..15] - s2[x + 8..15]
+        ld1             {v18.4s,v19.4s}, [x8], #32                      // dst_top[x + 8..15 - 1]
+        smull           v4.4s, v2.4h, v2.4h                             // d[x + 0..3]^2
+        smull2          v5.4s, v2.8h, v2.8h                             // d[x + 4..7]^2
+        ld1             {v20.4s,v21.4s}, [x9], #32                      // dst_top[x + 0..7]
+        smull           v6.4s, v3.4h, v3.4h                             // d[x + 8..11]^2
+        smull2          v7.4s, v3.8h, v3.8h                             // d[x + 12..15]^2
+        ld1             {v22.4s,v23.4s}, [x9], #32                      // dst_top[x + 8..15]
+        sub             v0.4s, v20.4s, v16.4s                           // dst_top[x + 0..3] - dst_top[x + 0..3 - 1]
+        sub             v1.4s, v21.4s, v17.4s                           // dst_top[x + 4..7] - dst_top[x + 4..7 - 1]
+        add             v0.4s, v0.4s, v4.4s                             // + d[x + 0..3]^2
+        add             v1.4s, v1.4s, v5.4s                             // + d[x + 4..7]^2
+        sub             v2.4s, v22.4s, v18.4s                           // dst_top[x +  8..11] - dst_top[x +  8..11 - 1]
+        sub             v3.4s, v23.4s, v19.4s                           // dst_top[x + 12..15] - dst_top[x + 12..15 - 1]
+        add             v2.4s, v2.4s, v6.4s                             // + d[x +  8..11]^2
+        add             v3.4s, v3.4s, v7.4s                             // + d[x + 12..15]^2
+        acc_sum_store   v0.4s, v0.16b                                   // accumulate and store dst[ 0..3]
+        acc_sum_store   v1.4s, v1.16b                                   // accumulate and store dst[ 4..7]
+        acc_sum_store   v2.4s, v2.16b                                   // accumulate and store dst[ 8..11]
+        acc_sum_store   v3.4s, v3.16b                                   // accumulate and store dst[12..15]
         subs            w10, w10, #16                                   // width dec
         b.ne            2b                                              // loop til next line
         add             x2, x2, x3                                      // skip to next line (s1)
diff --git a/libavutil/aarch64/float_dsp_neon.S b/libavutil/aarch64/float_dsp_neon.S
index 02d790c0cc..35e2715b87 100644
--- a/libavutil/aarch64/float_dsp_neon.S
+++ b/libavutil/aarch64/float_dsp_neon.S
@@ -25,16 +25,16 @@
 
 function ff_vector_fmul_neon, export=1
 1:      subs            w3,  w3,  #16
-        ld1             {v0.4S, v1.4S}, [x1], #32
-        ld1             {v2.4S, v3.4S}, [x1], #32
-        ld1             {v4.4S, v5.4S}, [x2], #32
-        ld1             {v6.4S, v7.4S}, [x2], #32
-        fmul            v16.4S, v0.4S,  v4.4S
-        fmul            v17.4S, v1.4S,  v5.4S
-        fmul            v18.4S, v2.4S,  v6.4S
-        fmul            v19.4S, v3.4S,  v7.4S
-        st1             {v16.4S, v17.4S}, [x0], #32
-        st1             {v18.4S, v19.4S}, [x0], #32
+        ld1             {v0.4s, v1.4s}, [x1], #32
+        ld1             {v2.4s, v3.4s}, [x1], #32
+        ld1             {v4.4s, v5.4s}, [x2], #32
+        ld1             {v6.4s, v7.4s}, [x2], #32
+        fmul            v16.4s, v0.4s,  v4.4s
+        fmul            v17.4s, v1.4s,  v5.4s
+        fmul            v18.4s, v2.4s,  v6.4s
+        fmul            v19.4s, v3.4s,  v7.4s
+        st1             {v16.4s, v17.4s}, [x0], #32
+        st1             {v18.4s, v19.4s}, [x0], #32
         b.ne            1b
         ret
 endfunc
@@ -42,16 +42,16 @@ endfunc
 function ff_vector_fmac_scalar_neon, export=1
         mov             x3,  #-32
 1:      subs            w2,  w2,  #16
-        ld1             {v16.4S, v17.4S}, [x0], #32
-        ld1             {v18.4S, v19.4S}, [x0], x3
-        ld1             {v4.4S,  v5.4S},  [x1], #32
-        ld1             {v6.4S,  v7.4S},  [x1], #32
-        fmla            v16.4S, v4.4S,  v0.S[0]
-        fmla            v17.4S, v5.4S,  v0.S[0]
-        fmla            v18.4S, v6.4S,  v0.S[0]
-        fmla            v19.4S, v7.4S,  v0.S[0]
-        st1             {v16.4S, v17.4S}, [x0], #32
-        st1             {v18.4S, v19.4S}, [x0], #32
+        ld1             {v16.4s, v17.4s}, [x0], #32
+        ld1             {v18.4s, v19.4s}, [x0], x3
+        ld1             {v4.4s,  v5.4s},  [x1], #32
+        ld1             {v6.4s,  v7.4s},  [x1], #32
+        fmla            v16.4s, v4.4s,  v0.s[0]
+        fmla            v17.4s, v5.4s,  v0.s[0]
+        fmla            v18.4s, v6.4s,  v0.s[0]
+        fmla            v19.4s, v7.4s,  v0.s[0]
+        st1             {v16.4s, v17.4s}, [x0], #32
+        st1             {v18.4s, v19.4s}, [x0], #32
         b.ne            1b
         ret
 endfunc
@@ -59,43 +59,43 @@ endfunc
 function ff_vector_fmul_scalar_neon, export=1
         mov             w4,  #15
         bics            w3,  w2,  w4
-        dup             v16.4S, v0.S[0]
+        dup             v16.4s, v0.s[0]
         b.eq            3f
-        ld1             {v0.4S, v1.4S}, [x1], #32
+        ld1             {v0.4s, v1.4s}, [x1], #32
 1:      subs            w3,  w3,  #16
-        fmul            v0.4S,  v0.4S,  v16.4S
-        ld1             {v2.4S, v3.4S}, [x1], #32
-        fmul            v1.4S,  v1.4S,  v16.4S
-        fmul            v2.4S,  v2.4S,  v16.4S
-        st1             {v0.4S, v1.4S}, [x0], #32
-        fmul            v3.4S,  v3.4S,  v16.4S
+        fmul            v0.4s,  v0.4s,  v16.4s
+        ld1             {v2.4s, v3.4s}, [x1], #32
+        fmul            v1.4s,  v1.4s,  v16.4s
+        fmul            v2.4s,  v2.4s,  v16.4s
+        st1             {v0.4s, v1.4s}, [x0], #32
+        fmul            v3.4s,  v3.4s,  v16.4s
         b.eq            2f
-        ld1             {v0.4S, v1.4S}, [x1], #32
-        st1             {v2.4S, v3.4S}, [x0], #32
+        ld1             {v0.4s, v1.4s}, [x1], #32
+        st1             {v2.4s, v3.4s}, [x0], #32
         b               1b
 2:      ands            w2,  w2,  #15
-        st1             {v2.4S, v3.4S}, [x0], #32
+        st1             {v2.4s, v3.4s}, [x0], #32
         b.eq            4f
-3:      ld1             {v0.4S}, [x1], #16
-        fmul            v0.4S,  v0.4S,  v16.4S
-        st1             {v0.4S}, [x0], #16
+3:      ld1             {v0.4s}, [x1], #16
+        fmul            v0.4s,  v0.4s,  v16.4s
+        st1             {v0.4s}, [x0], #16
         subs            w2,  w2,  #4
         b.gt            3b
 4:      ret
 endfunc
 
 function ff_vector_dmul_scalar_neon, export=1
-        dup             v16.2D, v0.D[0]
-        ld1             {v0.2D, v1.2D}, [x1], #32
+        dup             v16.2d, v0.d[0]
+        ld1             {v0.2d, v1.2d}, [x1], #32
 1:      subs            w2,  w2,  #8
-        fmul            v0.2D,  v0.2D,  v16.2D
-        ld1             {v2.2D, v3.2D}, [x1], #32
-        fmul            v1.2D,  v1.2D,  v16.2D
-        fmul            v2.2D,  v2.2D,  v16.2D
-        st1             {v0.2D, v1.2D}, [x0], #32
-        fmul            v3.2D,  v3.2D,  v16.2D
-        ld1             {v0.2D, v1.2D}, [x1], #32
-        st1             {v2.2D, v3.2D}, [x0], #32
+        fmul            v0.2d,  v0.2d,  v16.2d
+        ld1             {v2.2d, v3.2d}, [x1], #32
+        fmul            v1.2d,  v1.2d,  v16.2d
+        fmul            v2.2d,  v2.2d,  v16.2d
+        st1             {v0.2d, v1.2d}, [x0], #32
+        fmul            v3.2d,  v3.2d,  v16.2d
+        ld1             {v0.2d, v1.2d}, [x1], #32
+        st1             {v2.2d, v3.2d}, [x0], #32
         b.gt            1b
         ret
 endfunc
@@ -108,49 +108,49 @@ function ff_vector_fmul_window_neon, export=1
         add             x6,  x3,  x5, lsl #3    // win  + 8 * (len - 2)
         add             x5,  x0,  x5, lsl #3    // dst  + 8 * (len - 2)
         mov             x7,  #-16
-        ld1             {v0.4S},  [x1], #16     // s0
-        ld1             {v2.4S},  [x3], #16     // wi
-        ld1             {v1.4S},  [x2], x7      // s1
-1:      ld1             {v3.4S},  [x6], x7      // wj
+        ld1             {v0.4s},  [x1], #16     // s0
+        ld1             {v2.4s},  [x3], #16     // wi
+        ld1             {v1.4s},  [x2], x7      // s1
+1:      ld1             {v3.4s},  [x6], x7      // wj
         subs            x4,  x4,  #4
-        fmul            v17.4S, v0.4S,  v2.4S   // s0 * wi
-        rev64           v4.4S,  v1.4S
-        rev64           v5.4S,  v3.4S
-        rev64           v17.4S, v17.4S
-        ext             v4.16B,  v4.16B,  v4.16B,  #8 // s1_r
-        ext             v5.16B,  v5.16B,  v5.16B,  #8 // wj_r
-        ext             v17.16B, v17.16B, v17.16B, #8 // (s0 * wi)_rev
-        fmul            v16.4S, v0.4S,  v5.4S  // s0 * wj_r
-        fmla            v17.4S, v1.4S,  v3.4S  // (s0 * wi)_rev + s1 * wj
+        fmul            v17.4s, v0.4s,  v2.4s   // s0 * wi
+        rev64           v4.4s,  v1.4s
+        rev64           v5.4s,  v3.4s
+        rev64           v17.4s, v17.4s
+        ext             v4.16b,  v4.16b,  v4.16b,  #8 // s1_r
+        ext             v5.16b,  v5.16b,  v5.16b,  #8 // wj_r
+        ext             v17.16b, v17.16b, v17.16b, #8 // (s0 * wi)_rev
+        fmul            v16.4s, v0.4s,  v5.4s  // s0 * wj_r
+        fmla            v17.4s, v1.4s,  v3.4s  // (s0 * wi)_rev + s1 * wj
         b.eq            2f
-        ld1             {v0.4S},  [x1], #16
-        fmls            v16.4S, v4.4S,  v2.4S  // s0 * wj_r - s1_r * wi
-        st1             {v17.4S}, [x5], x7
-        ld1             {v2.4S},  [x3], #16
-        ld1             {v1.4S},  [x2], x7
-        st1             {v16.4S}, [x0], #16
+        ld1             {v0.4s},  [x1], #16
+        fmls            v16.4s, v4.4s,  v2.4s  // s0 * wj_r - s1_r * wi
+        st1             {v17.4s}, [x5], x7
+        ld1             {v2.4s},  [x3], #16
+        ld1             {v1.4s},  [x2], x7
+        st1             {v16.4s}, [x0], #16
         b               1b
 2:
-        fmls            v16.4S, v4.4S,  v2.4S  // s0 * wj_r - s1_r * wi
-        st1             {v17.4S}, [x5], x7
-        st1             {v16.4S}, [x0], #16
+        fmls            v16.4s, v4.4s,  v2.4s  // s0 * wj_r - s1_r * wi
+        st1             {v17.4s}, [x5], x7
+        st1             {v16.4s}, [x0], #16
         ret
 endfunc
 
 function ff_vector_fmul_add_neon, export=1
-        ld1             {v0.4S, v1.4S},  [x1], #32
-        ld1             {v2.4S, v3.4S},  [x2], #32
-        ld1             {v4.4S, v5.4S},  [x3], #32
+        ld1             {v0.4s, v1.4s},  [x1], #32
+        ld1             {v2.4s, v3.4s},  [x2], #32
+        ld1             {v4.4s, v5.4s},  [x3], #32
 1:      subs            w4,  w4,  #8
-        fmla            v4.4S,  v0.4S,  v2.4S
-        fmla            v5.4S,  v1.4S,  v3.4S
+        fmla            v4.4s,  v0.4s,  v2.4s
+        fmla            v5.4s,  v1.4s,  v3.4s
         b.eq            2f
-        ld1             {v0.4S, v1.4S},  [x1], #32
-        ld1             {v2.4S, v3.4S},  [x2], #32
-        st1             {v4.4S, v5.4S},  [x0], #32
-        ld1             {v4.4S, v5.4S},  [x3], #32
+        ld1             {v0.4s, v1.4s},  [x1], #32
+        ld1             {v2.4s, v3.4s},  [x2], #32
+        st1             {v4.4s, v5.4s},  [x0], #32
+        ld1             {v4.4s, v5.4s},  [x3], #32
         b               1b
-2:      st1             {v4.4S, v5.4S},  [x0], #32
+2:      st1             {v4.4s, v5.4s},  [x0], #32
         ret
 endfunc
 
@@ -159,44 +159,44 @@ function ff_vector_fmul_reverse_neon, export=1
         add             x2,  x2,  x3,  lsl #2
         sub             x2,  x2,  #32
         mov             x4, #-32
-        ld1             {v2.4S, v3.4S},  [x2], x4
-        ld1             {v0.4S, v1.4S},  [x1], #32
+        ld1             {v2.4s, v3.4s},  [x2], x4
+        ld1             {v0.4s, v1.4s},  [x1], #32
 1:      subs            x3,  x3,  #8
-        rev64           v3.4S,  v3.4S
-        rev64           v2.4S,  v2.4S
-        ext             v3.16B, v3.16B, v3.16B,  #8
-        ext             v2.16B, v2.16B, v2.16B,  #8
-        fmul            v16.4S, v0.4S,  v3.4S
-        fmul            v17.4S, v1.4S,  v2.4S
+        rev64           v3.4s,  v3.4s
+        rev64           v2.4s,  v2.4s
+        ext             v3.16b, v3.16b, v3.16b,  #8
+        ext             v2.16b, v2.16b, v2.16b,  #8
+        fmul            v16.4s, v0.4s,  v3.4s
+        fmul            v17.4s, v1.4s,  v2.4s
         b.eq            2f
-        ld1             {v2.4S, v3.4S},  [x2], x4
-        ld1             {v0.4S, v1.4S},  [x1], #32
-        st1             {v16.4S, v17.4S},  [x0], #32
+        ld1             {v2.4s, v3.4s},  [x2], x4
+        ld1             {v0.4s, v1.4s},  [x1], #32
+        st1             {v16.4s, v17.4s},  [x0], #32
         b               1b
-2:      st1             {v16.4S, v17.4S},  [x0], #32
+2:      st1             {v16.4s, v17.4s},  [x0], #32
         ret
 endfunc
 
 function ff_butterflies_float_neon, export=1
-1:      ld1             {v0.4S}, [x0]
-        ld1             {v1.4S}, [x1]
+1:      ld1             {v0.4s}, [x0]
+        ld1             {v1.4s}, [x1]
         subs            w2,  w2,  #4
-        fsub            v2.4S,   v0.4S,  v1.4S
-        fadd            v3.4S,   v0.4S,  v1.4S
-        st1             {v2.4S}, [x1],   #16
-        st1             {v3.4S}, [x0],   #16
+        fsub            v2.4s,   v0.4s,  v1.4s
+        fadd            v3.4s,   v0.4s,  v1.4s
+        st1             {v2.4s}, [x1],   #16
+        st1             {v3.4s}, [x0],   #16
         b.gt            1b
         ret
 endfunc
 
 function ff_scalarproduct_float_neon, export=1
-        movi            v2.4S,  #0
-1:      ld1             {v0.4S}, [x0],   #16
-        ld1             {v1.4S}, [x1],   #16
+        movi            v2.4s,  #0
+1:      ld1             {v0.4s}, [x0],   #16
+        ld1             {v1.4s}, [x1],   #16
         subs            w2,      w2,     #4
-        fmla            v2.4S,   v0.4S,  v1.4S
+        fmla            v2.4s,   v0.4s,  v1.4s
         b.gt            1b
-        faddp           v0.4S,   v2.4S,  v2.4S
-        faddp           s0,      v0.2S
+        faddp           v0.4s,   v2.4s,  v2.4s
+        faddp           s0,      v0.2s
         ret
 endfunc
diff --git a/libswresample/aarch64/resample.S b/libswresample/aarch64/resample.S
index bbad619a81..114d1216fb 100644
--- a/libswresample/aarch64/resample.S
+++ b/libswresample/aarch64/resample.S
@@ -21,57 +21,57 @@
 #include "libavutil/aarch64/asm.S"
 
 function ff_resample_common_apply_filter_x4_float_neon, export=1
-    movi                v0.4S, #0                                      // accumulator
-1:  ld1                 {v1.4S}, [x1], #16                             // src[0..3]
-    ld1                 {v2.4S}, [x2], #16                             // filter[0..3]
-    fmla                v0.4S, v1.4S, v2.4S                            // accumulator += src[0..3] * filter[0..3]
+    movi                v0.4s, #0                                      // accumulator
+1:  ld1                 {v1.4s}, [x1], #16                             // src[0..3]
+    ld1                 {v2.4s}, [x2], #16                             // filter[0..3]
+    fmla                v0.4s, v1.4s, v2.4s                            // accumulator += src[0..3] * filter[0..3]
     subs                w3, w3, #4                                     // filter_length -= 4
     b.gt                1b                                             // loop until filter_length
-    faddp               v0.4S, v0.4S, v0.4S                            // pair adding of the 4x32-bit accumulated values
-    faddp               v0.4S, v0.4S, v0.4S                            // pair adding of the 4x32-bit accumulated values
-    st1                 {v0.S}[0], [x0], #4                            // write accumulator
+    faddp               v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
+    faddp               v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
+    st1                 {v0.s}[0], [x0], #4                            // write accumulator
     ret
 endfunc
 
 function ff_resample_common_apply_filter_x8_float_neon, export=1
-    movi                v0.4S, #0                                      // accumulator
-1:  ld1                 {v1.4S}, [x1], #16                             // src[0..3]
-    ld1                 {v2.4S}, [x2], #16                             // filter[0..3]
-    ld1                 {v3.4S}, [x1], #16                             // src[4..7]
-    ld1                 {v4.4S}, [x2], #16                             // filter[4..7]
-    fmla                v0.4S, v1.4S, v2.4S                            // accumulator += src[0..3] * filter[0..3]
-    fmla                v0.4S, v3.4S, v4.4S                            // accumulator += src[4..7] * filter[4..7]
+    movi                v0.4s, #0                                      // accumulator
+1:  ld1                 {v1.4s}, [x1], #16                             // src[0..3]
+    ld1                 {v2.4s}, [x2], #16                             // filter[0..3]
+    ld1                 {v3.4s}, [x1], #16                             // src[4..7]
+    ld1                 {v4.4s}, [x2], #16                             // filter[4..7]
+    fmla                v0.4s, v1.4s, v2.4s                            // accumulator += src[0..3] * filter[0..3]
+    fmla                v0.4s, v3.4s, v4.4s                            // accumulator += src[4..7] * filter[4..7]
     subs                w3, w3, #8                                     // filter_length -= 8
     b.gt                1b                                             // loop until filter_length
-    faddp               v0.4S, v0.4S, v0.4S                            // pair adding of the 4x32-bit accumulated values
-    faddp               v0.4S, v0.4S, v0.4S                            // pair adding of the 4x32-bit accumulated values
-    st1                 {v0.S}[0], [x0], #4                            // write accumulator
+    faddp               v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
+    faddp               v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
+    st1                 {v0.s}[0], [x0], #4                            // write accumulator
     ret
 endfunc
 
 function ff_resample_common_apply_filter_x4_s16_neon, export=1
-    movi                v0.4S, #0                                      // accumulator
-1:  ld1                 {v1.4H}, [x1], #8                              // src[0..3]
-    ld1                 {v2.4H}, [x2], #8                              // filter[0..3]
-    smlal               v0.4S, v1.4H, v2.4H                            // accumulator += src[0..3] * filter[0..3]
+    movi                v0.4s, #0                                      // accumulator
+1:  ld1                 {v1.4h}, [x1], #8                              // src[0..3]
+    ld1                 {v2.4h}, [x2], #8                              // filter[0..3]
+    smlal               v0.4s, v1.4h, v2.4h                            // accumulator += src[0..3] * filter[0..3]
     subs                w3, w3, #4                                     // filter_length -= 4
     b.gt                1b                                             // loop until filter_length
-    addp                v0.4S, v0.4S, v0.4S                            // pair adding of the 4x32-bit accumulated values
-    addp                v0.4S, v0.4S, v0.4S                            // pair adding of the 4x32-bit accumulated values
-    st1                 {v0.S}[0], [x0], #4                            // write accumulator
+    addp                v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
+    addp                v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
+    st1                 {v0.s}[0], [x0], #4                            // write accumulator
     ret
 endfunc
 
 function ff_resample_common_apply_filter_x8_s16_neon, export=1
-    movi                v0.4S, #0                                      // accumulator
-1:  ld1                 {v1.8H}, [x1], #16                             // src[0..7]
-    ld1                 {v2.8H}, [x2], #16                             // filter[0..7]
-    smlal               v0.4S, v1.4H, v2.4H                            // accumulator += src[0..3] * filter[0..3]
-    smlal2              v0.4S, v1.8H, v2.8H                            // accumulator += src[4..7] * filter[4..7]
+    movi                v0.4s, #0                                      // accumulator
+1:  ld1                 {v1.8h}, [x1], #16                             // src[0..7]
+    ld1                 {v2.8h}, [x2], #16                             // filter[0..7]
+    smlal               v0.4s, v1.4h, v2.4h                            // accumulator += src[0..3] * filter[0..3]
+    smlal2              v0.4s, v1.8h, v2.8h                            // accumulator += src[4..7] * filter[4..7]
     subs                w3, w3, #8                                     // filter_length -= 8
     b.gt                1b                                             // loop until filter_length
-    addp                v0.4S, v0.4S, v0.4S                            // pair adding of the 4x32-bit accumulated values
-    addp                v0.4S, v0.4S, v0.4S                            // pair adding of the 4x32-bit accumulated values
-    st1                 {v0.S}[0], [x0], #4                            // write accumulator
+    addp                v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
+    addp                v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
+    st1                 {v0.s}[0], [x0], #4                            // write accumulator
     ret
 endfunc
diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S
index 8d4dcb2541..f3c404eb5f 100644
--- a/libswscale/aarch64/hscale.S
+++ b/libswscale/aarch64/hscale.S
@@ -50,43 +50,43 @@ function ff_hscale8to15_X8_neon, export=1
         add                 x12, x16, x7                // filter1 = filter0 + filterSize*2
         add                 x13, x12, x7                // filter2 = filter1 + filterSize*2
         add                 x4, x13, x7                 // filter3 = filter2 + filterSize*2
-        movi                v0.2D, #0                   // val sum part 1 (for dst[0])
-        movi                v1.2D, #0                   // val sum part 2 (for dst[1])
-        movi                v2.2D, #0                   // val sum part 3 (for dst[2])
-        movi                v3.2D, #0                   // val sum part 4 (for dst[3])
+        movi                v0.2d, #0                   // val sum part 1 (for dst[0])
+        movi                v1.2d, #0                   // val sum part 2 (for dst[1])
+        movi                v2.2d, #0                   // val sum part 3 (for dst[2])
+        movi                v3.2d, #0                   // val sum part 4 (for dst[3])
         add                 x17, x3, w8, UXTW           // srcp + filterPos[0]
         add                 x8,  x3, w0, UXTW           // srcp + filterPos[1]
         add                 x0, x3, w11, UXTW           // srcp + filterPos[2]
         add                 x11, x3, w9, UXTW           // srcp + filterPos[3]
         mov                 w15, w6                     // filterSize counter
-2:      ld1                 {v4.8B}, [x17], #8          // srcp[filterPos[0] + {0..7}]
-        ld1                 {v5.8H}, [x16], #16         // load 8x16-bit filter values, part 1
-        ld1                 {v6.8B}, [x8], #8           // srcp[filterPos[1] + {0..7}]
-        ld1                 {v7.8H}, [x12], #16         // load 8x16-bit at filter+filterSize
-        uxtl                v4.8H, v4.8B                // unpack part 1 to 16-bit
-        smlal               v0.4S, v4.4H, v5.4H         // v0 accumulates srcp[filterPos[0] + {0..3}] * filter[{0..3}]
-        smlal2              v0.4S, v4.8H, v5.8H         // v0 accumulates srcp[filterPos[0] + {4..7}] * filter[{4..7}]
-        ld1                 {v16.8B}, [x0], #8          // srcp[filterPos[2] + {0..7}]
-        ld1                 {v17.8H}, [x13], #16        // load 8x16-bit at filter+2*filterSize
-        uxtl                v6.8H, v6.8B                // unpack part 2 to 16-bit
-        smlal               v1.4S, v6.4H, v7.4H         // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
-        uxtl                v16.8H, v16.8B              // unpack part 3 to 16-bit
-        smlal               v2.4S, v16.4H, v17.4H       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
-        smlal2              v2.4S, v16.8H, v17.8H       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
-        ld1                 {v18.8B}, [x11], #8         // srcp[filterPos[3] + {0..7}]
-        smlal2              v1.4S, v6.8H, v7.8H         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
-        ld1                 {v19.8H}, [x4], #16         // load 8x16-bit at filter+3*filterSize
+2:      ld1                 {v4.8b}, [x17], #8          // srcp[filterPos[0] + {0..7}]
+        ld1                 {v5.8h}, [x16], #16         // load 8x16-bit filter values, part 1
+        ld1                 {v6.8b}, [x8], #8           // srcp[filterPos[1] + {0..7}]
+        ld1                 {v7.8h}, [x12], #16         // load 8x16-bit at filter+filterSize
+        uxtl                v4.8h, v4.8b                // unpack part 1 to 16-bit
+        smlal               v0.4s, v4.4h, v5.4h         // v0 accumulates srcp[filterPos[0] + {0..3}] * filter[{0..3}]
+        smlal2              v0.4s, v4.8h, v5.8h         // v0 accumulates srcp[filterPos[0] + {4..7}] * filter[{4..7}]
+        ld1                 {v16.8b}, [x0], #8          // srcp[filterPos[2] + {0..7}]
+        ld1                 {v17.8h}, [x13], #16        // load 8x16-bit at filter+2*filterSize
+        uxtl                v6.8h, v6.8b                // unpack part 2 to 16-bit
+        smlal               v1.4s, v6.4h, v7.4h         // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+        uxtl                v16.8h, v16.8b              // unpack part 3 to 16-bit
+        smlal               v2.4s, v16.4h, v17.4h       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+        smlal2              v2.4s, v16.8h, v17.8h       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+        ld1                 {v18.8b}, [x11], #8         // srcp[filterPos[3] + {0..7}]
+        smlal2              v1.4s, v6.8h, v7.8h         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+        ld1                 {v19.8h}, [x4], #16         // load 8x16-bit at filter+3*filterSize
         subs                w15, w15, #8                // j -= 8: processed 8/filterSize
-        uxtl                v18.8H, v18.8B              // unpack part 4 to 16-bit
-        smlal               v3.4S, v18.4H, v19.4H       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
-        smlal2              v3.4S, v18.8H, v19.8H       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+        uxtl                v18.8h, v18.8b              // unpack part 4 to 16-bit
+        smlal               v3.4s, v18.4h, v19.4h       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+        smlal2              v3.4s, v18.8h, v19.8h       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
         b.gt                2b                          // inner loop if filterSize not consumed completely
-        addp                v0.4S, v0.4S, v1.4S         // part01 horizontal pair adding
-        addp                v2.4S, v2.4S, v3.4S         // part23 horizontal pair adding
-        addp                v0.4S, v0.4S, v2.4S         // part0123 horizontal pair adding
+        addp                v0.4s, v0.4s, v1.4s         // part01 horizontal pair adding
+        addp                v2.4s, v2.4s, v3.4s         // part23 horizontal pair adding
+        addp                v0.4s, v0.4s, v2.4s         // part0123 horizontal pair adding
         subs                w2, w2, #4                  // dstW -= 4
-        sqshrn              v0.4H, v0.4S, #7            // shift and clip the 2x16-bit final values
-        st1                 {v0.4H}, [x1], #8           // write to destination part0123
+        sqshrn              v0.4h, v0.4s, #7            // shift and clip the 2x16-bit final values
+        st1                 {v0.4h}, [x1], #8           // write to destination part0123
         b.gt                1b                          // loop until end of line
         ret
 endfunc
@@ -245,7 +245,7 @@ function ff_hscale8to15_4_neon, export=1
         stp                 w14, w15, [sp, #24]         // *scratch_mem = { src[filterPos[idx + 6]][0..3], src[filterPos[idx + 7]][0..3] }
 
 1:
-        ld4                 {v16.8B, v17.8B, v18.8B, v19.8B}, [sp] // transpose 8 bytes each from src into 4 registers
+        ld4                 {v16.8b, v17.8b, v18.8b, v19.8b}, [sp] // transpose 8 bytes each from src into 4 registers
 
         // load 8 values from filterPos to be used as offsets into src
         ldp                 w8, w9,  [x5]               // filterPos[idx + 0][0..3], [idx + 1][0..3], next iteration
@@ -253,74 +253,74 @@ function ff_hscale8to15_4_neon, export=1
         ldp                 w12, w13, [x5, #16]         // filterPos[idx + 4][0..3], [idx + 5][0..3], next iteration
         ldp                 w14, w15, [x5, #24]         // filterPos[idx + 6][0..3], [idx + 7][0..3], next iteration
 
-        movi                v0.2D, #0                   // Clear madd accumulator for idx 0..3
-        movi                v5.2D, #0                   // Clear madd accumulator for idx 4..7
+        movi                v0.2d, #0                   // Clear madd accumulator for idx 0..3
+        movi                v5.2d, #0                   // Clear madd accumulator for idx 4..7
 
-        ld4                 {v1.8H, v2.8H, v3.8H, v4.8H}, [x4], #64 // load filter idx + 0..7
+        ld4                 {v1.8h, v2.8h, v3.8h, v4.8h}, [x4], #64 // load filter idx + 0..7
 
         add                 x5, x5, #32                 // advance filterPos
 
         // interleaved SIMD and prefetching intended to keep ld/st and vector pipelines busy
-        uxtl                v16.8H, v16.8B              // unsigned extend long, covert src data to 16-bit
-        uxtl                v17.8H, v17.8B              // unsigned extend long, covert src data to 16-bit
+        uxtl                v16.8h, v16.8b              // unsigned extend long, covert src data to 16-bit
+        uxtl                v17.8h, v17.8b              // unsigned extend long, covert src data to 16-bit
         ldr                 w8, [x3, w8, UXTW]          // src[filterPos[idx + 0]], next iteration
         ldr                 w9, [x3, w9, UXTW]          // src[filterPos[idx + 1]], next iteration
-        uxtl                v18.8H, v18.8B              // unsigned extend long, covert src data to 16-bit
-        uxtl                v19.8H, v19.8B              // unsigned extend long, covert src data to 16-bit
+        uxtl                v18.8h, v18.8b              // unsigned extend long, covert src data to 16-bit
+        uxtl                v19.8h, v19.8b              // unsigned extend long, covert src data to 16-bit
         ldr                 w10, [x3, w10, UXTW]        // src[filterPos[idx + 2]], next iteration
         ldr                 w11, [x3, w11, UXTW]        // src[filterPos[idx + 3]], next iteration
 
-        smlal               v0.4S, v1.4H, v16.4H        // multiply accumulate inner loop j = 0, idx = 0..3
-        smlal               v0.4S, v2.4H, v17.4H        // multiply accumulate inner loop j = 1, idx = 0..3
+        smlal               v0.4s, v1.4h, v16.4h        // multiply accumulate inner loop j = 0, idx = 0..3
+        smlal               v0.4s, v2.4h, v17.4h        // multiply accumulate inner loop j = 1, idx = 0..3
         ldr                 w12, [x3, w12, UXTW]        // src[filterPos[idx + 4]], next iteration
         ldr                 w13, [x3, w13, UXTW]        // src[filterPos[idx + 5]], next iteration
-        smlal               v0.4S, v3.4H, v18.4H        // multiply accumulate inner loop j = 2, idx = 0..3
-        smlal               v0.4S, v4.4H, v19.4H        // multiply accumulate inner loop j = 3, idx = 0..3
+        smlal               v0.4s, v3.4h, v18.4h        // multiply accumulate inner loop j = 2, idx = 0..3
+        smlal               v0.4s, v4.4h, v19.4h        // multiply accumulate inner loop j = 3, idx = 0..3
         ldr                 w14, [x3, w14, UXTW]        // src[filterPos[idx + 6]], next iteration
         ldr                 w15, [x3, w15, UXTW]        // src[filterPos[idx + 7]], next iteration
 
-        smlal2              v5.4S, v1.8H, v16.8H        // multiply accumulate inner loop j = 0, idx = 4..7
-        smlal2              v5.4S, v2.8H, v17.8H        // multiply accumulate inner loop j = 1, idx = 4..7
+        smlal2              v5.4s, v1.8h, v16.8h        // multiply accumulate inner loop j = 0, idx = 4..7
+        smlal2              v5.4s, v2.8h, v17.8h        // multiply accumulate inner loop j = 1, idx = 4..7
         stp                 w8, w9, [sp]                // *scratch_mem = { src[filterPos[idx + 0]][0..3], src[filterPos[idx + 1]][0..3] }
         stp                 w10, w11, [sp, #8]          // *scratch_mem = { src[filterPos[idx + 2]][0..3], src[filterPos[idx + 3]][0..3] }
-        smlal2              v5.4S, v3.8H, v18.8H        // multiply accumulate inner loop j = 2, idx = 4..7
-        smlal2              v5.4S, v4.8H, v19.8H        // multiply accumulate inner loop j = 3, idx = 4..7
+        smlal2              v5.4s, v3.8h, v18.8h        // multiply accumulate inner loop j = 2, idx = 4..7
+        smlal2              v5.4s, v4.8h, v19.8h        // multiply accumulate inner loop j = 3, idx = 4..7
         stp                 w12, w13, [sp, #16]         // *scratch_mem = { src[filterPos[idx + 4]][0..3], src[filterPos[idx + 5]][0..3] }
         stp                 w14, w15, [sp, #24]         // *scratch_mem = { src[filterPos[idx + 6]][0..3], src[filterPos[idx + 7]][0..3] }
 
         sub                 w2, w2, #8                  // dstW -= 8
-        sqshrn              v0.4H, v0.4S, #7            // shift and clip the 2x16-bit final values
-        sqshrn              v1.4H, v5.4S, #7            // shift and clip the 2x16-bit final values
-        st1                 {v0.4H, v1.4H}, [x1], #16   // write to dst[idx + 0..7]
+        sqshrn              v0.4h, v0.4s, #7            // shift and clip the 2x16-bit final values
+        sqshrn              v1.4h, v5.4s, #7            // shift and clip the 2x16-bit final values
+        st1                 {v0.4h, v1.4h}, [x1], #16   // write to dst[idx + 0..7]
         cmp                 w2, #16                     // continue on main loop if there are at least 16 iterations left
         b.ge                1b
 
         // last full iteration
-        ld4                 {v16.8B, v17.8B, v18.8B, v19.8B}, [sp]
-        ld4                 {v1.8H, v2.8H, v3.8H, v4.8H}, [x4], #64 // load filter idx + 0..7
+        ld4                 {v16.8b, v17.8b, v18.8b, v19.8b}, [sp]
+        ld4                 {v1.8h, v2.8h, v3.8h, v4.8h}, [x4], #64 // load filter idx + 0..7
 
-        movi                v0.2D, #0                   // Clear madd accumulator for idx 0..3
-        movi                v5.2D, #0                   // Clear madd accumulator for idx 4..7
+        movi                v0.2d, #0                   // Clear madd accumulator for idx 0..3
+        movi                v5.2d, #0                   // Clear madd accumulator for idx 4..7
 
-        uxtl                v16.8H, v16.8B              // unsigned extend long, covert src data to 16-bit
-        uxtl                v17.8H, v17.8B              // unsigned extend long, covert src data to 16-bit
-        uxtl                v18.8H, v18.8B              // unsigned extend long, covert src data to 16-bit
-        uxtl                v19.8H, v19.8B              // unsigned extend long, covert src data to 16-bit
+        uxtl                v16.8h, v16.8b              // unsigned extend long, covert src data to 16-bit
+        uxtl                v17.8h, v17.8b              // unsigned extend long, covert src data to 16-bit
+        uxtl                v18.8h, v18.8b              // unsigned extend long, covert src data to 16-bit
+        uxtl                v19.8h, v19.8b              // unsigned extend long, covert src data to 16-bit
 
-        smlal               v0.4S, v1.4H, v16.4H        // multiply accumulate inner loop j = 0, idx = 0..3
-        smlal               v0.4S, v2.4H, v17.4H        // multiply accumulate inner loop j = 1, idx = 0..3
-        smlal               v0.4S, v3.4H, v18.4H        // multiply accumulate inner loop j = 2, idx = 0..3
-        smlal               v0.4S, v4.4H, v19.4H        // multiply accumulate inner loop j = 3, idx = 0..3
+        smlal               v0.4s, v1.4h, v16.4h        // multiply accumulate inner loop j = 0, idx = 0..3
+        smlal               v0.4s, v2.4h, v17.4h        // multiply accumulate inner loop j = 1, idx = 0..3
+        smlal               v0.4s, v3.4h, v18.4h        // multiply accumulate inner loop j = 2, idx = 0..3
+        smlal               v0.4s, v4.4h, v19.4h        // multiply accumulate inner loop j = 3, idx = 0..3
 
-        smlal2              v5.4S, v1.8H, v16.8H        // multiply accumulate inner loop j = 0, idx = 4..7
-        smlal2              v5.4S, v2.8H, v17.8H        // multiply accumulate inner loop j = 1, idx = 4..7
-        smlal2              v5.4S, v3.8H, v18.8H        // multiply accumulate inner loop j = 2, idx = 4..7
-        smlal2              v5.4S, v4.8H, v19.8H        // multiply accumulate inner loop j = 3, idx = 4..7
+        smlal2              v5.4s, v1.8h, v16.8h        // multiply accumulate inner loop j = 0, idx = 4..7
+        smlal2              v5.4s, v2.8h, v17.8h        // multiply accumulate inner loop j = 1, idx = 4..7
+        smlal2              v5.4s, v3.8h, v18.8h        // multiply accumulate inner loop j = 2, idx = 4..7
+        smlal2              v5.4s, v4.8h, v19.8h        // multiply accumulate inner loop j = 3, idx = 4..7
 
         subs                w2, w2, #8                  // dstW -= 8
-        sqshrn              v0.4H, v0.4S, #7            // shift and clip the 2x16-bit final values
-        sqshrn              v1.4H, v5.4S, #7            // shift and clip the 2x16-bit final values
-        st1                 {v0.4H, v1.4H}, [x1], #16   // write to dst[idx + 0..7]
+        sqshrn              v0.4h, v0.4s, #7            // shift and clip the 2x16-bit final values
+        sqshrn              v1.4h, v5.4s, #7            // shift and clip the 2x16-bit final values
+        st1                 {v0.4h, v1.4h}, [x1], #16   // write to dst[idx + 0..7]
 
         cbnz                w2, 2f                      // if >0 iterations remain, jump to the wrap up section
 
@@ -332,15 +332,15 @@ function ff_hscale8to15_4_neon, export=1
         // load src
         ldr                 w8, [x5], #4                // filterPos[i]
         add                 x9, x3, w8, UXTW            // calculate the address for src load
-        ld1                 {v5.S}[0], [x9]             // src[filterPos[i] + 0..3]
+        ld1                 {v5.s}[0], [x9]             // src[filterPos[i] + 0..3]
         // load filter
-        ld1                 {v6.4H}, [x4], #8           // filter[filterSize * i + 0..3]
+        ld1                 {v6.4h}, [x4], #8           // filter[filterSize * i + 0..3]
 
-        uxtl                v5.8H, v5.8B                // unsigned exten long, convert src data to 16-bit
-        smull               v0.4S, v5.4H, v6.4H         // 4 iterations of src[...] * filter[...]
-        addv                s0, v0.4S                   // add up products of src and filter values
+        uxtl                v5.8h, v5.8b                // unsigned exten long, convert src data to 16-bit
+        smull               v0.4s, v5.4h, v6.4h         // 4 iterations of src[...] * filter[...]
+        addv                s0, v0.4s                   // add up products of src and filter values
         sqshrn              h0, s0, #7                  // shift and clip the 2x16-bit final value
-        st1                 {v0.H}[0], [x1], #2         // dst[i] = ...
+        st1                 {v0.h}[0], [x1], #2         // dst[i] = ...
         sub                 w2, w2, #1                  // dstW--
         cbnz                w2, 2b
 
@@ -445,12 +445,12 @@ function ff_hscale8to19_4_neon, export=1
         smull               v5.4s, v0.4h, v28.4h
         smull2              v6.4s, v0.8h, v28.8h
         uxtl                v2.8h, v2.8b
-        smlal               v5.4s, v1.4h, v29.4H
-        smlal2              v6.4s, v1.8h, v29.8H
+        smlal               v5.4s, v1.4h, v29.4h
+        smlal2              v6.4s, v1.8h, v29.8h
         uxtl                v3.8h, v3.8b
-        smlal               v5.4s, v2.4h, v30.4H
-        smlal2              v6.4s, v2.8h, v30.8H
-        smlal               v5.4s, v3.4h, v31.4H
+        smlal               v5.4s, v2.4h, v30.4h
+        smlal2              v6.4s, v2.8h, v30.8h
+        smlal               v5.4s, v3.4h, v31.4h
         smlal2              v6.4s, v3.8h, v31.8h
 
         sshr                v5.4s, v5.4s, #3
@@ -472,8 +472,8 @@ function ff_hscale8to19_4_neon, export=1
         ld1                 {v0.s}[0], [x9] // load 4 * uint8_t* into one single
         ld1                 {v31.4h}, [x4], #8
         uxtl                v0.8h, v0.8b
-        smull               v5.4s, v0.4h, v31.4H
-        saddlv              d0, v5.4S
+        smull               v5.4s, v0.4h, v31.4h
+        saddlv              d0, v5.4s
         sqshrn              s0, d0, #3
         smin                v0.4s, v0.4s, v18.4s
         st1                 {v0.s}[0], [x1], #4
@@ -499,42 +499,42 @@ function ff_hscale8to19_X8_neon, export=1
         ldr                 w11, [x5], #4               // filterPos[idx + 2]
         add                 x4, x13, x7                 // filter3 = filter2 + filterSize*2
         ldr                 w9, [x5], #4                // filterPos[idx + 3]
-        movi                v0.2D, #0                   // val sum part 1 (for dst[0])
-        movi                v1.2D, #0                   // val sum part 2 (for dst[1])
-        movi                v2.2D, #0                   // val sum part 3 (for dst[2])
-        movi                v3.2D, #0                   // val sum part 4 (for dst[3])
+        movi                v0.2d, #0                   // val sum part 1 (for dst[0])
+        movi                v1.2d, #0                   // val sum part 2 (for dst[1])
+        movi                v2.2d, #0                   // val sum part 3 (for dst[2])
+        movi                v3.2d, #0                   // val sum part 4 (for dst[3])
         add                 x17, x3, w8, UXTW           // srcp + filterPos[0]
         add                 x8,  x3, w0, UXTW           // srcp + filterPos[1]
         add                 x0, x3, w11, UXTW           // srcp + filterPos[2]
         add                 x11, x3, w9, UXTW           // srcp + filterPos[3]
         mov                 w15, w6                     // filterSize counter
-2:      ld1                 {v4.8B}, [x17], #8          // srcp[filterPos[0] + {0..7}]
-        ld1                 {v5.8H}, [x16], #16         // load 8x16-bit filter values, part 1
-        uxtl                v4.8H, v4.8B                // unpack part 1 to 16-bit
-        smlal               v0.4S, v4.4H, v5.4H         // v0 accumulates srcp[filterPos[0] + {0..3}] * filter[{0..3}]
-        ld1                 {v6.8B}, [x8], #8           // srcp[filterPos[1] + {0..7}]
-        smlal2              v0.4S, v4.8H, v5.8H         // v0 accumulates srcp[filterPos[0] + {4..7}] * filter[{4..7}]
-        ld1                 {v7.8H}, [x12], #16         // load 8x16-bit at filter+filterSize
-        ld1                 {v16.8B}, [x0], #8          // srcp[filterPos[2] + {0..7}]
-        uxtl                v6.8H, v6.8B                // unpack part 2 to 16-bit
-        ld1                 {v17.8H}, [x13], #16        // load 8x16-bit at filter+2*filterSize
-        uxtl                v16.8H, v16.8B              // unpack part 3 to 16-bit
-        smlal               v1.4S, v6.4H, v7.4H         // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
-        ld1                 {v18.8B}, [x11], #8         // srcp[filterPos[3] + {0..7}]
-        smlal               v2.4S, v16.4H, v17.4H       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
-        ld1                 {v19.8H}, [x4], #16         // load 8x16-bit at filter+3*filterSize
-        smlal2              v2.4S, v16.8H, v17.8H       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
-        uxtl                v18.8H, v18.8B              // unpack part 4 to 16-bit
-        smlal2              v1.4S, v6.8H, v7.8H         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
-        smlal               v3.4S, v18.4H, v19.4H       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+2:      ld1                 {v4.8b}, [x17], #8          // srcp[filterPos[0] + {0..7}]
+        ld1                 {v5.8h}, [x16], #16         // load 8x16-bit filter values, part 1
+        uxtl                v4.8h, v4.8b                // unpack part 1 to 16-bit
+        smlal               v0.4s, v4.4h, v5.4h         // v0 accumulates srcp[filterPos[0] + {0..3}] * filter[{0..3}]
+        ld1                 {v6.8b}, [x8], #8           // srcp[filterPos[1] + {0..7}]
+        smlal2              v0.4s, v4.8h, v5.8h         // v0 accumulates srcp[filterPos[0] + {4..7}] * filter[{4..7}]
+        ld1                 {v7.8h}, [x12], #16         // load 8x16-bit at filter+filterSize
+        ld1                 {v16.8b}, [x0], #8          // srcp[filterPos[2] + {0..7}]
+        uxtl                v6.8h, v6.8b                // unpack part 2 to 16-bit
+        ld1                 {v17.8h}, [x13], #16        // load 8x16-bit at filter+2*filterSize
+        uxtl                v16.8h, v16.8b              // unpack part 3 to 16-bit
+        smlal               v1.4s, v6.4h, v7.4h         // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+        ld1                 {v18.8b}, [x11], #8         // srcp[filterPos[3] + {0..7}]
+        smlal               v2.4s, v16.4h, v17.4h       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+        ld1                 {v19.8h}, [x4], #16         // load 8x16-bit at filter+3*filterSize
+        smlal2              v2.4s, v16.8h, v17.8h       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+        uxtl                v18.8h, v18.8b              // unpack part 4 to 16-bit
+        smlal2              v1.4s, v6.8h, v7.8h         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+        smlal               v3.4s, v18.4h, v19.4h       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
         subs                w15, w15, #8                // j -= 8: processed 8/filterSize
-        smlal2              v3.4S, v18.8H, v19.8H       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+        smlal2              v3.4s, v18.8h, v19.8h       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
         b.gt                2b                          // inner loop if filterSize not consumed completely
-        addp                v0.4S, v0.4S, v1.4S         // part01 horizontal pair adding
-        addp                v2.4S, v2.4S, v3.4S         // part23 horizontal pair adding
-        addp                v0.4S, v0.4S, v2.4S         // part0123 horizontal pair adding
+        addp                v0.4s, v0.4s, v1.4s         // part01 horizontal pair adding
+        addp                v2.4s, v2.4s, v3.4s         // part23 horizontal pair adding
+        addp                v0.4s, v0.4s, v2.4s         // part0123 horizontal pair adding
         subs                w2, w2, #4                  // dstW -= 4
-        sshr                v0.4s, v0.4S, #3            // shift and clip the 2x16-bit final values
+        sshr                v0.4s, v0.4s, #3            // shift and clip the 2x16-bit final values
         smin                v0.4s, v0.4s, v20.4s
         st1                 {v0.4s}, [x1], #16           // write to destination part0123
         b.gt                1b                          // loop until end of line
@@ -588,16 +588,16 @@ function ff_hscale8to19_X4_neon, export=1
         smlal2              v16.4s, v4.8h, v31.8h       // multiplication of upper half for idx 0
         ldr                 d6, [x10], #8               // load src values for idx 2
         ldr                 q29, [x14, x16]             // load filter values for idx 2
-        smlal               v17.4s, v5.4h, v30.4H       // multiplication of lower half for idx 1
+        smlal               v17.4s, v5.4h, v30.4h       // multiplication of lower half for idx 1
         ldr                 d7, [x11], #8               // load src values for idx 3
-        smlal2              v17.4s, v5.8h, v30.8H       // multiplication of upper half for idx 1
-        uxtl                v6.8h, v6.8B                // extend tpye to matchi the filter's size
+        smlal2              v17.4s, v5.8h, v30.8h       // multiplication of upper half for idx 1
+        uxtl                v6.8h, v6.8b                // extend tpye to matchi the filter's size
         ldr                 q28, [x15, x16]             // load filter values for idx 3
         smlal               v18.4s, v6.4h, v29.4h       // multiplication of lower half for idx 2
-        uxtl                v7.8h, v7.8B
-        smlal2              v18.4s, v6.8h, v29.8H       // multiplication of upper half for idx 2
+        uxtl                v7.8h, v7.8b
+        smlal2              v18.4s, v6.8h, v29.8h       // multiplication of upper half for idx 2
         sub                 w0, w0, #8
-        smlal               v19.4s, v7.4h, v28.4H       // multiplication of lower half for idx 3
+        smlal               v19.4s, v7.4h, v28.4h       // multiplication of lower half for idx 3
         cmp                 w0, #8
         smlal2              v19.4s, v7.8h, v28.8h       // multiplication of upper half for idx 3
         add                 x16, x16, #16                // advance filter values indexing
@@ -618,11 +618,11 @@ function ff_hscale8to19_X4_neon, export=1
         uxtl                v5.8h, v5.8b                // extend type to match the filter' size
         ldr                 s6, [x10]                   // load src values for idx 2
         smlal               v17.4s, v5.4h, v30.4h
-        uxtl                v6.8h, v6.8B                // extend type to match the filter's size
+        uxtl                v6.8h, v6.8b                // extend type to match the filter's size
         ldr                 d29, [x14, x17]             // load filter values for idx 2
         ldr                 s7, [x11]                   // load src values for idx 3
         addp                v16.4s, v16.4s, v17.4s
-        uxtl                v7.8h, v7.8B
+        uxtl                v7.8h, v7.8b
         ldr                 d28, [x15, x17]             // load filter values for idx 3
         smlal               v18.4s, v6.4h, v29.4h
         smlal               v19.4s, v7.4h, v28.4h
@@ -700,31 +700,31 @@ function ff_hscale16to15_4_neon_asm, export=1
         // Extending to 32 bits is necessary, as unit16_t values can't
         // be represented as int16_t without type promotion.
         uxtl                v26.4s, v0.4h
-        sxtl                v27.4s, v28.4H
+        sxtl                v27.4s, v28.4h
         uxtl2               v0.4s, v0.8h
         mul                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v28.8H
+        sxtl2               v28.4s, v28.8h
         uxtl                v26.4s, v1.4h
         mul                 v6.4s, v0.4s, v28.4s
 
-        sxtl                v27.4s, v29.4H
+        sxtl                v27.4s, v29.4h
         uxtl2               v0.4s, v1.8h
         mla                 v5.4s, v27.4s, v26.4s
-        sxtl2               v28.4s, v29.8H
+        sxtl2               v28.4s, v29.8h
         uxtl                v26.4s, v2.4h
         mla                 v6.4s, v28.4s, v0.4s
 
-        sxtl                v27.4s, v30.4H
+        sxtl                v27.4s, v30.4h
         uxtl2               v0.4s, v2.8h
         mla                 v5.4s, v27.4s, v26.4s
-        sxtl2               v28.4s, v30.8H
+        sxtl2               v28.4s, v30.8h
         uxtl                v26.4s, v3.4h
         mla                 v6.4s, v28.4s, v0.4s
 
-        sxtl                v27.4s, v31.4H
+        sxtl                v27.4s, v31.4h
         uxtl2               v0.4s, v3.8h
         mla                 v5.4s, v27.4s, v26.4s
-        sxtl2               v28.4s, v31.8H
+        sxtl2               v28.4s, v31.8h
         sub                 w2, w2, #8
         mla                 v6.4s, v28.4s, v0.4s
 
@@ -775,31 +775,31 @@ function ff_hscale16to15_4_neon_asm, export=1
         ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64
 
         uxtl                v26.4s, v0.4h
-        sxtl                v27.4s, v28.4H
+        sxtl                v27.4s, v28.4h
         uxtl2               v0.4s, v0.8h
         mul                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v28.8H
+        sxtl2               v28.4s, v28.8h
         uxtl                v26.4s, v1.4h
         mul                 v6.4s, v0.4s, v28.4s
 
-        sxtl                v27.4s, v29.4H
+        sxtl                v27.4s, v29.4h
         uxtl2               v0.4s, v1.8h
         mla                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v29.8H
+        sxtl2               v28.4s, v29.8h
         uxtl                v26.4s, v2.4h
         mla                 v6.4s, v0.4s, v28.4s
 
-        sxtl                v27.4s, v30.4H
+        sxtl                v27.4s, v30.4h
         uxtl2               v0.4s, v2.8h
         mla                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v30.8H
+        sxtl2               v28.4s, v30.8h
         uxtl                v26.4s, v3.4h
         mla                 v6.4s, v0.4s, v28.4s
 
-        sxtl                v27.4s, v31.4H
+        sxtl                v27.4s, v31.4h
         uxtl2               v0.4s, v3.8h
         mla                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v31.8H
+        sxtl2               v28.4s, v31.8h
         subs                w2, w2, #8
         mla                 v6.4s, v0.4s, v28.4s
 
@@ -807,7 +807,7 @@ function ff_hscale16to15_4_neon_asm, export=1
         sshl                v6.4s, v6.4s, v17.4s
         smin                v5.4s, v5.4s, v18.4s
         smin                v6.4s, v6.4s, v18.4s
-        xtn                 v5.4h, v5.4S
+        xtn                 v5.4h, v5.4s
         xtn2                v5.8h, v6.4s
 
         st1                 {v5.8h}, [x1], #16
@@ -826,7 +826,7 @@ function ff_hscale16to15_4_neon_asm, export=1
         uxtl                v0.4s, v0.4h
         sxtl                v31.4s, v31.4h
         mul                 v5.4s, v0.4s, v31.4s
-        addv                s0, v5.4S
+        addv                s0, v5.4s
         sshl                v0.4s, v0.4s, v17.4s
         smin                v0.4s, v0.4s, v18.4s
         st1                 {v0.h}[0], [x1], #2
@@ -865,58 +865,58 @@ function ff_hscale16to15_X8_neon_asm, export=1
         add                 x12, x16, x7                // filter1 = filter0 + filterSize*2
         add                 x13, x12, x7                // filter2 = filter1 + filterSize*2
         add                 x4, x13, x7                 // filter3 = filter2 + filterSize*2
-        movi                v0.2D, #0                   // val sum part 1 (for dst[0])
-        movi                v1.2D, #0                   // val sum part 2 (for dst[1])
-        movi                v2.2D, #0                   // val sum part 3 (for dst[2])
-        movi                v3.2D, #0                   // val sum part 4 (for dst[3])
+        movi                v0.2d, #0                   // val sum part 1 (for dst[0])
+        movi                v1.2d, #0                   // val sum part 2 (for dst[1])
+        movi                v2.2d, #0                   // val sum part 3 (for dst[2])
+        movi                v3.2d, #0                   // val sum part 4 (for dst[3])
         add                 x17, x3, w8, UXTW           // srcp + filterPos[0]
         add                 x8,  x3, w10, UXTW          // srcp + filterPos[1]
         add                 x10, x3, w11, UXTW          // srcp + filterPos[2]
         add                 x11, x3, w9, UXTW           // srcp + filterPos[3]
         mov                 w15, w6                     // filterSize counter
-2:      ld1                 {v4.8H}, [x17], #16         // srcp[filterPos[0] + {0..7}]
-        ld1                 {v5.8H}, [x16], #16         // load 8x16-bit filter values, part 1
-        ld1                 {v6.8H}, [x8], #16          // srcp[filterPos[1] + {0..7}]
-        ld1                 {v7.8H}, [x12], #16         // load 8x16-bit at filter+filterSize
-        uxtl                v24.4s, v4.4H               // extend srcp lower half to 32 bits to preserve sign
-        sxtl                v25.4s, v5.4H               // extend filter lower half to 32 bits to match srcp size
+2:      ld1                 {v4.8h}, [x17], #16         // srcp[filterPos[0] + {0..7}]
+        ld1                 {v5.8h}, [x16], #16         // load 8x16-bit filter values, part 1
+        ld1                 {v6.8h}, [x8], #16          // srcp[filterPos[1] + {0..7}]
+        ld1                 {v7.8h}, [x12], #16         // load 8x16-bit at filter+filterSize
+        uxtl                v24.4s, v4.4h               // extend srcp lower half to 32 bits to preserve sign
+        sxtl                v25.4s, v5.4h               // extend filter lower half to 32 bits to match srcp size
         uxtl2               v4.4s, v4.8h                // extend srcp upper half to 32 bits
         mla                 v0.4s, v24.4s, v25.4s       // multiply accumulate lower half of v4 * v5
         sxtl2               v5.4s, v5.8h                // extend filter upper half to 32 bits
         uxtl                v26.4s, v6.4h               // extend srcp lower half to 32 bits
-        mla                 v0.4S, v4.4s, v5.4s         // multiply accumulate upper half of v4 * v5
-        sxtl                v27.4s, v7.4H               // exted filter lower half
-        uxtl2               v6.4s, v6.8H                // extend srcp upper half
+        mla                 v0.4s, v4.4s, v5.4s         // multiply accumulate upper half of v4 * v5
+        sxtl                v27.4s, v7.4h               // exted filter lower half
+        uxtl2               v6.4s, v6.8h                // extend srcp upper half
         sxtl2               v7.4s, v7.8h                // extend filter upper half
-        ld1                 {v16.8H}, [x10], #16        // srcp[filterPos[2] + {0..7}]
-        mla                 v1.4S, v26.4s, v27.4s       // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
-        ld1                 {v17.8H}, [x13], #16        // load 8x16-bit at filter+2*filterSize
-        uxtl                v22.4s, v16.4H              // extend srcp lower half
-        sxtl                v23.4s, v17.4H              // extend filter lower half
-        uxtl2               v16.4s, v16.8H              // extend srcp upper half
+        ld1                 {v16.8h}, [x10], #16        // srcp[filterPos[2] + {0..7}]
+        mla                 v1.4s, v26.4s, v27.4s       // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+        ld1                 {v17.8h}, [x13], #16        // load 8x16-bit at filter+2*filterSize
+        uxtl                v22.4s, v16.4h              // extend srcp lower half
+        sxtl                v23.4s, v17.4h              // extend filter lower half
+        uxtl2               v16.4s, v16.8h              // extend srcp upper half
         sxtl2               v17.4s, v17.8h              // extend filter upper half
-        mla                 v2.4S, v22.4s, v23.4s       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
-        mla                 v2.4S, v16.4s, v17.4s       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
-        ld1                 {v18.8H}, [x11], #16        // srcp[filterPos[3] + {0..7}]
-        mla                 v1.4S, v6.4s, v7.4s         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
-        ld1                 {v19.8H}, [x4], #16         // load 8x16-bit at filter+3*filterSize
+        mla                 v2.4s, v22.4s, v23.4s       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+        mla                 v2.4s, v16.4s, v17.4s       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+        ld1                 {v18.8h}, [x11], #16        // srcp[filterPos[3] + {0..7}]
+        mla                 v1.4s, v6.4s, v7.4s         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+        ld1                 {v19.8h}, [x4], #16         // load 8x16-bit at filter+3*filterSize
         subs                w15, w15, #8                // j -= 8: processed 8/filterSize
-        uxtl                v28.4s, v18.4H              // extend srcp lower half
-        sxtl                v29.4s, v19.4H              // extend filter lower half
-        uxtl2               v18.4s, v18.8H              // extend srcp upper half
+        uxtl                v28.4s, v18.4h              // extend srcp lower half
+        sxtl                v29.4s, v19.4h              // extend filter lower half
+        uxtl2               v18.4s, v18.8h              // extend srcp upper half
         sxtl2               v19.4s, v19.8h              // extend filter upper half
-        mla                 v3.4S, v28.4s, v29.4s       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
-        mla                 v3.4S, v18.4s, v19.4s       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+        mla                 v3.4s, v28.4s, v29.4s       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+        mla                 v3.4s, v18.4s, v19.4s       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
         b.gt                2b                          // inner loop if filterSize not consumed completely
-        addp                v0.4S, v0.4S, v1.4S         // part01 horizontal pair adding
-        addp                v2.4S, v2.4S, v3.4S         // part23 horizontal pair adding
-        addp                v0.4S, v0.4S, v2.4S         // part0123 horizontal pair adding
+        addp                v0.4s, v0.4s, v1.4s         // part01 horizontal pair adding
+        addp                v2.4s, v2.4s, v3.4s         // part23 horizontal pair adding
+        addp                v0.4s, v0.4s, v2.4s         // part0123 horizontal pair adding
         subs                w2, w2, #4                  // dstW -= 4
         sshl                v0.4s, v0.4s, v21.4s        // shift right (effectively rigth, as shift is negative); overflow expected
         smin                v0.4s, v0.4s, v20.4s        // apply min (do not use sqshl)
         xtn                 v0.4h, v0.4s                // narrow down to 16 bits
 
-        st1                 {v0.4H}, [x1], #8           // write to destination part0123
+        st1                 {v0.4h}, [x1], #8           // write to destination part0123
         b.gt                1b                          // loop until end of line
         ret
 endfunc
@@ -1108,31 +1108,31 @@ function ff_hscale16to19_4_neon_asm, export=1
         // Extending to 32 bits is necessary, as unit16_t values can't
         // be represented as int16_t without type promotion.
         uxtl                v26.4s, v0.4h
-        sxtl                v27.4s, v28.4H
+        sxtl                v27.4s, v28.4h
         uxtl2               v0.4s, v0.8h
         mul                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v28.8H
+        sxtl2               v28.4s, v28.8h
         uxtl                v26.4s, v1.4h
         mul                 v6.4s, v0.4s, v28.4s
 
-        sxtl                v27.4s, v29.4H
+        sxtl                v27.4s, v29.4h
         uxtl2               v0.4s, v1.8h
         mla                 v5.4s, v27.4s, v26.4s
-        sxtl2               v28.4s, v29.8H
+        sxtl2               v28.4s, v29.8h
         uxtl                v26.4s, v2.4h
         mla                 v6.4s, v28.4s, v0.4s
 
-        sxtl                v27.4s, v30.4H
+        sxtl                v27.4s, v30.4h
         uxtl2               v0.4s, v2.8h
         mla                 v5.4s, v27.4s, v26.4s
-        sxtl2               v28.4s, v30.8H
+        sxtl2               v28.4s, v30.8h
         uxtl                v26.4s, v3.4h
         mla                 v6.4s, v28.4s, v0.4s
 
-        sxtl                v27.4s, v31.4H
+        sxtl                v27.4s, v31.4h
         uxtl2               v0.4s, v3.8h
         mla                 v5.4s, v27.4s, v26.4s
-        sxtl2               v28.4s, v31.8H
+        sxtl2               v28.4s, v31.8h
         sub                 w2, w2, #8
         mla                 v6.4s, v28.4s, v0.4s
 
@@ -1181,31 +1181,31 @@ function ff_hscale16to19_4_neon_asm, export=1
         ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64
 
         uxtl                v26.4s, v0.4h
-        sxtl                v27.4s, v28.4H
+        sxtl                v27.4s, v28.4h
         uxtl2               v0.4s, v0.8h
         mul                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v28.8H
+        sxtl2               v28.4s, v28.8h
         uxtl                v26.4s, v1.4h
         mul                 v6.4s, v0.4s, v28.4s
 
-        sxtl                v27.4s, v29.4H
+        sxtl                v27.4s, v29.4h
         uxtl2               v0.4s, v1.8h
         mla                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v29.8H
+        sxtl2               v28.4s, v29.8h
         uxtl                v26.4s, v2.4h
         mla                 v6.4s, v0.4s, v28.4s
 
-        sxtl                v27.4s, v30.4H
+        sxtl                v27.4s, v30.4h
         uxtl2               v0.4s, v2.8h
         mla                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v30.8H
+        sxtl2               v28.4s, v30.8h
         uxtl                v26.4s, v3.4h
         mla                 v6.4s, v0.4s, v28.4s
 
-        sxtl                v27.4s, v31.4H
+        sxtl                v27.4s, v31.4h
         uxtl2               v0.4s, v3.8h
         mla                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v31.8H
+        sxtl2               v28.4s, v31.8h
         subs                w2, w2, #8
         mla                 v6.4s, v0.4s, v28.4s
 
@@ -1232,7 +1232,7 @@ function ff_hscale16to19_4_neon_asm, export=1
         sxtl                v31.4s, v31.4h
         subs                w2, w2, #1
         mul                 v5.4s, v0.4s, v31.4s
-        addv                s0, v5.4S
+        addv                s0, v5.4s
         sshl                v0.4s, v0.4s, v17.4s
         smin                v0.4s, v0.4s, v18.4s
         st1                 {v0.s}[0], [x1], #4
@@ -1270,52 +1270,52 @@ function ff_hscale16to19_X8_neon_asm, export=1
         add                 x13, x12, x7                // filter2 = filter1 + filterSize*2
         lsl                 w10, w10, #1
         add                 x4, x13, x7                 // filter3 = filter2 + filterSize*2
-        movi                v0.2D, #0                   // val sum part 1 (for dst[0])
-        movi                v1.2D, #0                   // val sum part 2 (for dst[1])
-        movi                v2.2D, #0                   // val sum part 3 (for dst[2])
-        movi                v3.2D, #0                   // val sum part 4 (for dst[3])
+        movi                v0.2d, #0                   // val sum part 1 (for dst[0])
+        movi                v1.2d, #0                   // val sum part 2 (for dst[1])
+        movi                v2.2d, #0                   // val sum part 3 (for dst[2])
+        movi                v3.2d, #0                   // val sum part 4 (for dst[3])
         add                 x17, x3, w8, UXTW           // srcp + filterPos[0]
         add                 x8,  x3, w10, UXTW          // srcp + filterPos[1]
         add                 x10, x3, w11, UXTW          // srcp + filterPos[2]
         add                 x11, x3, w9, UXTW           // srcp + filterPos[3]
         mov                 w15, w6                     // filterSize counter
-2:      ld1                 {v4.8H}, [x17], #16         // srcp[filterPos[0] + {0..7}]
-        ld1                 {v5.8H}, [x16], #16         // load 8x16-bit filter values, part 1
-        ld1                 {v6.8H}, [x8], #16          // srcp[filterPos[1] + {0..7}]
-        ld1                 {v7.8H}, [x12], #16         // load 8x16-bit at filter+filterSize
-        uxtl                v24.4s, v4.4H               // extend srcp lower half to 32 bits to preserve sign
-        sxtl                v25.4s, v5.4H               // extend filter lower half to 32 bits to match srcp size
+2:      ld1                 {v4.8h}, [x17], #16         // srcp[filterPos[0] + {0..7}]
+        ld1                 {v5.8h}, [x16], #16         // load 8x16-bit filter values, part 1
+        ld1                 {v6.8h}, [x8], #16          // srcp[filterPos[1] + {0..7}]
+        ld1                 {v7.8h}, [x12], #16         // load 8x16-bit at filter+filterSize
+        uxtl                v24.4s, v4.4h               // extend srcp lower half to 32 bits to preserve sign
+        sxtl                v25.4s, v5.4h               // extend filter lower half to 32 bits to match srcp size
         uxtl2               v4.4s, v4.8h                // extend srcp upper half to 32 bits
         mla                 v0.4s, v24.4s, v25.4s       // multiply accumulate lower half of v4 * v5
         sxtl2               v5.4s, v5.8h                // extend filter upper half to 32 bits
         uxtl                v26.4s, v6.4h               // extend srcp lower half to 32 bits
-        mla                 v0.4S, v4.4s, v5.4s         // multiply accumulate upper half of v4 * v5
-        sxtl                v27.4s, v7.4H               // exted filter lower half
-        uxtl2               v6.4s, v6.8H                // extend srcp upper half
+        mla                 v0.4s, v4.4s, v5.4s         // multiply accumulate upper half of v4 * v5
+        sxtl                v27.4s, v7.4h               // exted filter lower half
+        uxtl2               v6.4s, v6.8h                // extend srcp upper half
         sxtl2               v7.4s, v7.8h                // extend filter upper half
-        ld1                 {v16.8H}, [x10], #16        // srcp[filterPos[2] + {0..7}]
-        mla                 v1.4S, v26.4s, v27.4s       // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
-        ld1                 {v17.8H}, [x13], #16        // load 8x16-bit at filter+2*filterSize
-        uxtl                v22.4s, v16.4H              // extend srcp lower half
-        sxtl                v23.4s, v17.4H              // extend filter lower half
-        uxtl2               v16.4s, v16.8H              // extend srcp upper half
+        ld1                 {v16.8h}, [x10], #16        // srcp[filterPos[2] + {0..7}]
+        mla                 v1.4s, v26.4s, v27.4s       // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+        ld1                 {v17.8h}, [x13], #16        // load 8x16-bit at filter+2*filterSize
+        uxtl                v22.4s, v16.4h              // extend srcp lower half
+        sxtl                v23.4s, v17.4h              // extend filter lower half
+        uxtl2               v16.4s, v16.8h              // extend srcp upper half
         sxtl2               v17.4s, v17.8h              // extend filter upper half
-        mla                 v2.4S, v22.4s, v23.4s       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
-        mla                 v2.4S, v16.4s, v17.4s       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
-        ld1                 {v18.8H}, [x11], #16        // srcp[filterPos[3] + {0..7}]
-        mla                 v1.4S, v6.4s, v7.4s         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
-        ld1                 {v19.8H}, [x4], #16         // load 8x16-bit at filter+3*filterSize
+        mla                 v2.4s, v22.4s, v23.4s       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+        mla                 v2.4s, v16.4s, v17.4s       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+        ld1                 {v18.8h}, [x11], #16        // srcp[filterPos[3] + {0..7}]
+        mla                 v1.4s, v6.4s, v7.4s         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+        ld1                 {v19.8h}, [x4], #16         // load 8x16-bit at filter+3*filterSize
         subs                w15, w15, #8                // j -= 8: processed 8/filterSize
-        uxtl                v28.4s, v18.4H              // extend srcp lower half
-        sxtl                v29.4s, v19.4H              // extend filter lower half
-        uxtl2               v18.4s, v18.8H              // extend srcp upper half
+        uxtl                v28.4s, v18.4h              // extend srcp lower half
+        sxtl                v29.4s, v19.4h              // extend filter lower half
+        uxtl2               v18.4s, v18.8h              // extend srcp upper half
         sxtl2               v19.4s, v19.8h              // extend filter upper half
-        mla                 v3.4S, v28.4s, v29.4s       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
-        mla                 v3.4S, v18.4s, v19.4s       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+        mla                 v3.4s, v28.4s, v29.4s       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+        mla                 v3.4s, v18.4s, v19.4s       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
         b.gt                2b                          // inner loop if filterSize not consumed completely
-        addp                v0.4S, v0.4S, v1.4S         // part01 horizontal pair adding
-        addp                v2.4S, v2.4S, v3.4S         // part23 horizontal pair adding
-        addp                v0.4S, v0.4S, v2.4S         // part0123 horizontal pair adding
+        addp                v0.4s, v0.4s, v1.4s         // part01 horizontal pair adding
+        addp                v2.4s, v2.4s, v3.4s         // part23 horizontal pair adding
+        addp                v0.4s, v0.4s, v2.4s         // part0123 horizontal pair adding
         subs                w2, w2, #4                  // dstW -= 4
         sshl                v0.4s, v0.4s, v21.4s        // shift right (effectively rigth, as shift is negative); overflow expected
         smin                v0.4s, v0.4s, v20.4s        // apply min (do not use sqshl)
diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S
index b8a2818c9b..344d0659ea 100644
--- a/libswscale/aarch64/output.S
+++ b/libswscale/aarch64/output.S
@@ -29,13 +29,13 @@ function ff_yuv2planeX_8_neon, export=1
 // x5 - const uint8_t *dither,
 // w6 - int offset
 
-        ld1                 {v0.8B}, [x5]                   // load 8x8-bit dither
+        ld1                 {v0.8b}, [x5]                   // load 8x8-bit dither
         and                 w6, w6, #7
         cbz                 w6, 1f                          // check if offsetting present
-        ext                 v0.8B, v0.8B, v0.8B, #3         // honor offsetting which can be 0 or 3 only
-1:      uxtl                v0.8H, v0.8B                    // extend dither to 16-bit
-        ushll               v1.4S, v0.4H, #12               // extend dither to 32-bit with left shift by 12 (part 1)
-        ushll2              v2.4S, v0.8H, #12               // extend dither to 32-bit with left shift by 12 (part 2)
+        ext                 v0.8b, v0.8b, v0.8b, #3         // honor offsetting which can be 0 or 3 only
+1:      uxtl                v0.8h, v0.8b                    // extend dither to 16-bit
+        ushll               v1.4s, v0.4h, #12               // extend dither to 32-bit with left shift by 12 (part 1)
+        ushll2              v2.4s, v0.8h, #12               // extend dither to 32-bit with left shift by 12 (part 2)
         cmp                 w1, #8                          // if filterSize == 8, branch to specialized version
         b.eq                6f
         cmp                 w1, #4                          // if filterSize == 4, branch to specialized version
@@ -48,8 +48,8 @@ function ff_yuv2planeX_8_neon, export=1
         mov                 x7, #0                          // i = 0
         tbnz                w1, #0, 4f                      // if filterSize % 2 != 0 branch to specialized version
 // fs % 2 == 0
-2:      mov                 v3.16B, v1.16B                  // initialize accumulator part 1 with dithering value
-        mov                 v4.16B, v2.16B                  // initialize accumulator part 2 with dithering value
+2:      mov                 v3.16b, v1.16b                  // initialize accumulator part 1 with dithering value
+        mov                 v4.16b, v2.16b                  // initialize accumulator part 2 with dithering value
         mov                 w8, w1                          // tmpfilterSize = filterSize
         mov                 x9, x2                          // srcp    = src
         mov                 x10, x0                         // filterp = filter
@@ -57,12 +57,12 @@ function ff_yuv2planeX_8_neon, export=1
         ldr                 s7, [x10], #4                   // read 2x16-bit coeff X and Y at filter[j] and filter[j+1]
         add                 x11, x11, x7, lsl #1            // &src[j  ][i]
         add                 x12, x12, x7, lsl #1            // &src[j+1][i]
-        ld1                 {v5.8H}, [x11]                  // read 8x16-bit @ src[j  ][i + {0..7}]: A,B,C,D,E,F,G,H
-        ld1                 {v6.8H}, [x12]                  // read 8x16-bit @ src[j+1][i + {0..7}]: I,J,K,L,M,N,O,P
-        smlal               v3.4S, v5.4H, v7.H[0]           // val0 += {A,B,C,D} * X
-        smlal2              v4.4S, v5.8H, v7.H[0]           // val1 += {E,F,G,H} * X
-        smlal               v3.4S, v6.4H, v7.H[1]           // val0 += {I,J,K,L} * Y
-        smlal2              v4.4S, v6.8H, v7.H[1]           // val1 += {M,N,O,P} * Y
+        ld1                 {v5.8h}, [x11]                  // read 8x16-bit @ src[j  ][i + {0..7}]: A,B,C,D,E,F,G,H
+        ld1                 {v6.8h}, [x12]                  // read 8x16-bit @ src[j+1][i + {0..7}]: I,J,K,L,M,N,O,P
+        smlal               v3.4s, v5.4h, v7.h[0]           // val0 += {A,B,C,D} * X
+        smlal2              v4.4s, v5.8h, v7.h[0]           // val1 += {E,F,G,H} * X
+        smlal               v3.4s, v6.4h, v7.h[1]           // val0 += {I,J,K,L} * Y
+        smlal2              v4.4s, v6.8h, v7.h[1]           // val1 += {M,N,O,P} * Y
         subs                w8, w8, #2                      // tmpfilterSize -= 2
         b.gt                3b                              // loop until filterSize consumed
 
@@ -77,17 +77,17 @@ function ff_yuv2planeX_8_neon, export=1
 
 // If filter size is odd (most likely == 1), then use this section.
 // fs % 2 != 0
-4:      mov                 v3.16B, v1.16B                  // initialize accumulator part 1 with dithering value
-        mov                 v4.16B, v2.16B                  // initialize accumulator part 2 with dithering value
+4:      mov                 v3.16b, v1.16b                  // initialize accumulator part 1 with dithering value
+        mov                 v4.16b, v2.16b                  // initialize accumulator part 2 with dithering value
         mov                 w8, w1                          // tmpfilterSize = filterSize
         mov                 x9, x2                          // srcp    = src
         mov                 x10, x0                         // filterp = filter
 5:      ldr                 x11, [x9], #8                   // get 1 pointer: src[j]
         ldr                 h6, [x10], #2                   // read 1 16 bit coeff X at filter[j]
         add                 x11, x11, x7, lsl #1            // &src[j  ][i]
-        ld1                 {v5.8H}, [x11]                  // read 8x16-bit @ src[j  ][i + {0..7}]: A,B,C,D,E,F,G,H
-        smlal               v3.4S, v5.4H, v6.H[0]           // val0 += {A,B,C,D} * X
-        smlal2              v4.4S, v5.8H, v6.H[0]           // val1 += {E,F,G,H} * X
+        ld1                 {v5.8h}, [x11]                  // read 8x16-bit @ src[j  ][i + {0..7}]: A,B,C,D,E,F,G,H
+        smlal               v3.4s, v5.4h, v6.h[0]           // val0 += {A,B,C,D} * X
+        smlal2              v4.4s, v5.8h, v6.h[0]           // val1 += {E,F,G,H} * X
         subs                w8, w8, #1                      // tmpfilterSize -= 2
         b.gt                5b                              // loop until filterSize consumed
 
@@ -107,36 +107,36 @@ function ff_yuv2planeX_8_neon, export=1
         ldp                 x12, x13, [x2, #48]             // load 2 pointers: src[j+6] and src[j+7]
 
         // load 8x16-bit values for filter[j], where j=0..7
-        ld1                 {v6.8H}, [x0]
+        ld1                 {v6.8h}, [x0]
 7:
-        mov                 v3.16B, v1.16B                  // initialize accumulator part 1 with dithering value
-        mov                 v4.16B, v2.16B                  // initialize accumulator part 2 with dithering value
-
-        ld1                 {v24.8H}, [x5], #16             // load 8x16-bit values for src[j + 0][i + {0..7}]
-        ld1                 {v25.8H}, [x6], #16             // load 8x16-bit values for src[j + 1][i + {0..7}]
-        ld1                 {v26.8H}, [x7], #16             // load 8x16-bit values for src[j + 2][i + {0..7}]
-        ld1                 {v27.8H}, [x9], #16             // load 8x16-bit values for src[j + 3][i + {0..7}]
-        ld1                 {v28.8H}, [x10], #16            // load 8x16-bit values for src[j + 4][i + {0..7}]
-        ld1                 {v29.8H}, [x11], #16            // load 8x16-bit values for src[j + 5][i + {0..7}]
-        ld1                 {v30.8H}, [x12], #16            // load 8x16-bit values for src[j + 6][i + {0..7}]
-        ld1                 {v31.8H}, [x13], #16            // load 8x16-bit values for src[j + 7][i + {0..7}]
-
-        smlal               v3.4S, v24.4H, v6.H[0]          // val0 += src[0][i + {0..3}] * filter[0]
-        smlal2              v4.4S, v24.8H, v6.H[0]          // val1 += src[0][i + {4..7}] * filter[0]
-        smlal               v3.4S, v25.4H, v6.H[1]          // val0 += src[1][i + {0..3}] * filter[1]
-        smlal2              v4.4S, v25.8H, v6.H[1]          // val1 += src[1][i + {4..7}] * filter[1]
-        smlal               v3.4S, v26.4H, v6.H[2]          // val0 += src[2][i + {0..3}] * filter[2]
-        smlal2              v4.4S, v26.8H, v6.H[2]          // val1 += src[2][i + {4..7}] * filter[2]
-        smlal               v3.4S, v27.4H, v6.H[3]          // val0 += src[3][i + {0..3}] * filter[3]
-        smlal2              v4.4S, v27.8H, v6.H[3]          // val1 += src[3][i + {4..7}] * filter[3]
-        smlal               v3.4S, v28.4H, v6.H[4]          // val0 += src[4][i + {0..3}] * filter[4]
-        smlal2              v4.4S, v28.8H, v6.H[4]          // val1 += src[4][i + {4..7}] * filter[4]
-        smlal               v3.4S, v29.4H, v6.H[5]          // val0 += src[5][i + {0..3}] * filter[5]
-        smlal2              v4.4S, v29.8H, v6.H[5]          // val1 += src[5][i + {4..7}] * filter[5]
-        smlal               v3.4S, v30.4H, v6.H[6]          // val0 += src[6][i + {0..3}] * filter[6]
-        smlal2              v4.4S, v30.8H, v6.H[6]          // val1 += src[6][i + {4..7}] * filter[6]
-        smlal               v3.4S, v31.4H, v6.H[7]          // val0 += src[7][i + {0..3}] * filter[7]
-        smlal2              v4.4S, v31.8H, v6.H[7]          // val1 += src[7][i + {4..7}] * filter[7]
+        mov                 v3.16b, v1.16b                  // initialize accumulator part 1 with dithering value
+        mov                 v4.16b, v2.16b                  // initialize accumulator part 2 with dithering value
+
+        ld1                 {v24.8h}, [x5], #16             // load 8x16-bit values for src[j + 0][i + {0..7}]
+        ld1                 {v25.8h}, [x6], #16             // load 8x16-bit values for src[j + 1][i + {0..7}]
+        ld1                 {v26.8h}, [x7], #16             // load 8x16-bit values for src[j + 2][i + {0..7}]
+        ld1                 {v27.8h}, [x9], #16             // load 8x16-bit values for src[j + 3][i + {0..7}]
+        ld1                 {v28.8h}, [x10], #16            // load 8x16-bit values for src[j + 4][i + {0..7}]
+        ld1                 {v29.8h}, [x11], #16            // load 8x16-bit values for src[j + 5][i + {0..7}]
+        ld1                 {v30.8h}, [x12], #16            // load 8x16-bit values for src[j + 6][i + {0..7}]
+        ld1                 {v31.8h}, [x13], #16            // load 8x16-bit values for src[j + 7][i + {0..7}]
+
+        smlal               v3.4s, v24.4h, v6.h[0]          // val0 += src[0][i + {0..3}] * filter[0]
+        smlal2              v4.4s, v24.8h, v6.h[0]          // val1 += src[0][i + {4..7}] * filter[0]
+        smlal               v3.4s, v25.4h, v6.h[1]          // val0 += src[1][i + {0..3}] * filter[1]
+        smlal2              v4.4s, v25.8h, v6.h[1]          // val1 += src[1][i + {4..7}] * filter[1]
+        smlal               v3.4s, v26.4h, v6.h[2]          // val0 += src[2][i + {0..3}] * filter[2]
+        smlal2              v4.4s, v26.8h, v6.h[2]          // val1 += src[2][i + {4..7}] * filter[2]
+        smlal               v3.4s, v27.4h, v6.h[3]          // val0 += src[3][i + {0..3}] * filter[3]
+        smlal2              v4.4s, v27.8h, v6.h[3]          // val1 += src[3][i + {4..7}] * filter[3]
+        smlal               v3.4s, v28.4h, v6.h[4]          // val0 += src[4][i + {0..3}] * filter[4]
+        smlal2              v4.4s, v28.8h, v6.h[4]          // val1 += src[4][i + {4..7}] * filter[4]
+        smlal               v3.4s, v29.4h, v6.h[5]          // val0 += src[5][i + {0..3}] * filter[5]
+        smlal2              v4.4s, v29.8h, v6.h[5]          // val1 += src[5][i + {4..7}] * filter[5]
+        smlal               v3.4s, v30.4h, v6.h[6]          // val0 += src[6][i + {0..3}] * filter[6]
+        smlal2              v4.4s, v30.8h, v6.h[6]          // val1 += src[6][i + {4..7}] * filter[6]
+        smlal               v3.4s, v31.4h, v6.h[7]          // val0 += src[7][i + {0..3}] * filter[7]
+        smlal2              v4.4s, v31.8h, v6.h[7]          // val1 += src[7][i + {4..7}] * filter[7]
 
         sqshrun             v3.4h, v3.4s, #16               // clip16(val0>>16)
         sqshrun2            v3.8h, v4.4s, #16               // clip16(val1>>16)
@@ -151,24 +151,24 @@ function ff_yuv2planeX_8_neon, export=1
         ldp                 x7, x9, [x2, #16]               // load 2 pointers: src[j+2] and src[j+3]
 
         // load 4x16-bit values for filter[j], where j=0..3 and replicated across lanes
-        ld1                 {v6.4H}, [x0]
+        ld1                 {v6.4h}, [x0]
 9:
-        mov                 v3.16B, v1.16B                  // initialize accumulator part 1 with dithering value
-        mov                 v4.16B, v2.16B                  // initialize accumulator part 2 with dithering value
-
-        ld1                 {v24.8H}, [x5], #16             // load 8x16-bit values for src[j + 0][i + {0..7}]
-        ld1                 {v25.8H}, [x6], #16             // load 8x16-bit values for src[j + 1][i + {0..7}]
-        ld1                 {v26.8H}, [x7], #16             // load 8x16-bit values for src[j + 2][i + {0..7}]
-        ld1                 {v27.8H}, [x9], #16             // load 8x16-bit values for src[j + 3][i + {0..7}]
-
-        smlal               v3.4S, v24.4H, v6.H[0]          // val0 += src[0][i + {0..3}] * filter[0]
-        smlal2              v4.4S, v24.8H, v6.H[0]          // val1 += src[0][i + {4..7}] * filter[0]
-        smlal               v3.4S, v25.4H, v6.H[1]          // val0 += src[1][i + {0..3}] * filter[1]
-        smlal2              v4.4S, v25.8H, v6.H[1]          // val1 += src[1][i + {4..7}] * filter[1]
-        smlal               v3.4S, v26.4H, v6.H[2]          // val0 += src[2][i + {0..3}] * filter[2]
-        smlal2              v4.4S, v26.8H, v6.H[2]          // val1 += src[2][i + {4..7}] * filter[2]
-        smlal               v3.4S, v27.4H, v6.H[3]          // val0 += src[3][i + {0..3}] * filter[3]
-        smlal2              v4.4S, v27.8H, v6.H[3]          // val1 += src[3][i + {4..7}] * filter[3]
+        mov                 v3.16b, v1.16b                  // initialize accumulator part 1 with dithering value
+        mov                 v4.16b, v2.16b                  // initialize accumulator part 2 with dithering value
+
+        ld1                 {v24.8h}, [x5], #16             // load 8x16-bit values for src[j + 0][i + {0..7}]
+        ld1                 {v25.8h}, [x6], #16             // load 8x16-bit values for src[j + 1][i + {0..7}]
+        ld1                 {v26.8h}, [x7], #16             // load 8x16-bit values for src[j + 2][i + {0..7}]
+        ld1                 {v27.8h}, [x9], #16             // load 8x16-bit values for src[j + 3][i + {0..7}]
+
+        smlal               v3.4s, v24.4h, v6.h[0]          // val0 += src[0][i + {0..3}] * filter[0]
+        smlal2              v4.4s, v24.8h, v6.h[0]          // val1 += src[0][i + {4..7}] * filter[0]
+        smlal               v3.4s, v25.4h, v6.h[1]          // val0 += src[1][i + {0..3}] * filter[1]
+        smlal2              v4.4s, v25.8h, v6.h[1]          // val1 += src[1][i + {4..7}] * filter[1]
+        smlal               v3.4s, v26.4h, v6.h[2]          // val0 += src[2][i + {0..3}] * filter[2]
+        smlal2              v4.4s, v26.8h, v6.h[2]          // val1 += src[2][i + {4..7}] * filter[2]
+        smlal               v3.4s, v27.4h, v6.h[3]          // val0 += src[3][i + {0..3}] * filter[3]
+        smlal2              v4.4s, v27.8h, v6.h[3]          // val1 += src[3][i + {4..7}] * filter[3]
 
         sqshrun             v3.4h, v3.4s, #16               // clip16(val0>>16)
         sqshrun2            v3.8h, v4.4s, #16               // clip16(val1>>16)
@@ -184,16 +184,16 @@ function ff_yuv2planeX_8_neon, export=1
         // load 2x16-bit values for filter[j], where j=0..1 and replicated across lanes
         ldr                 s6, [x0]
 11:
-        mov                 v3.16B, v1.16B                  // initialize accumulator part 1 with dithering value
-        mov                 v4.16B, v2.16B                  // initialize accumulator part 2 with dithering value
+        mov                 v3.16b, v1.16b                  // initialize accumulator part 1 with dithering value
+        mov                 v4.16b, v2.16b                  // initialize accumulator part 2 with dithering value
 
-        ld1                 {v24.8H}, [x5], #16             // load 8x16-bit values for src[j + 0][i + {0..7}]
-        ld1                 {v25.8H}, [x6], #16             // load 8x16-bit values for src[j + 1][i + {0..7}]
+        ld1                 {v24.8h}, [x5], #16             // load 8x16-bit values for src[j + 0][i + {0..7}]
+        ld1                 {v25.8h}, [x6], #16             // load 8x16-bit values for src[j + 1][i + {0..7}]
 
-        smlal               v3.4S, v24.4H, v6.H[0]          // val0 += src[0][i + {0..3}] * filter[0]
-        smlal2              v4.4S, v24.8H, v6.H[0]          // val1 += src[0][i + {4..7}] * filter[0]
-        smlal               v3.4S, v25.4H, v6.H[1]          // val0 += src[1][i + {0..3}] * filter[1]
-        smlal2              v4.4S, v25.8H, v6.H[1]          // val1 += src[1][i + {4..7}] * filter[1]
+        smlal               v3.4s, v24.4h, v6.h[0]          // val0 += src[0][i + {0..3}] * filter[0]
+        smlal2              v4.4s, v24.8h, v6.h[0]          // val1 += src[0][i + {4..7}] * filter[0]
+        smlal               v3.4s, v25.4h, v6.h[1]          // val0 += src[1][i + {0..3}] * filter[1]
+        smlal2              v4.4s, v25.8h, v6.h[1]          // val1 += src[1][i + {4..7}] * filter[1]
 
         sqshrun             v3.4h, v3.4s, #16               // clip16(val0>>16)
         sqshrun2            v3.8h, v4.4s, #16               // clip16(val1>>16)
@@ -210,11 +210,11 @@ function ff_yuv2plane1_8_neon, export=1
 // w2 - int dstW,
 // x3 - const uint8_t *dither,
 // w4 - int offset
-        ld1                 {v0.8B}, [x3]                   // load 8x8-bit dither
+        ld1                 {v0.8b}, [x3]                   // load 8x8-bit dither
         and                 w4, w4, #7
         cbz                 w4, 1f                          // check if offsetting present
-        ext                 v0.8B, v0.8B, v0.8B, #3         // honor offsetting which can be 0 or 3 only
-1:      uxtl                v0.8H, v0.8B                    // extend dither to 32-bit
+        ext                 v0.8b, v0.8b, v0.8b, #3         // honor offsetting which can be 0 or 3 only
+1:      uxtl                v0.8h, v0.8b                    // extend dither to 32-bit
         uxtl                v1.4s, v0.4h
         uxtl2               v2.4s, v0.8h
 2:
diff --git a/libswscale/aarch64/yuv2rgb_neon.S b/libswscale/aarch64/yuv2rgb_neon.S
index f341268c5d..3fc91530b6 100644
--- a/libswscale/aarch64/yuv2rgb_neon.S
+++ b/libswscale/aarch64/yuv2rgb_neon.S
@@ -33,9 +33,9 @@
 .macro load_args_nv12
     ldr                 x8,  [sp]                                       // table
     load_yoff_ycoeff    8, 16                                           // y_offset, y_coeff
-    ld1                 {v1.1D}, [x8]
-    dup                 v0.8H, w10
-    dup                 v3.8H, w9
+    ld1                 {v1.1d}, [x8]
+    dup                 v0.8h, w10
+    dup                 v3.8h, w9
     sub                 w3, w3, w0, lsl #2                              // w3 = linesize  - width * 4 (padding)
     sub                 w5, w5, w0                                      // w5 = linesizeY - width     (paddingY)
     sub                 w7, w7, w0                                      // w7 = linesizeC - width     (paddingC)
@@ -51,9 +51,9 @@
     ldr                 w14, [sp, #8]                                   // linesizeV
     ldr                 x8,  [sp, #16]                                  // table
     load_yoff_ycoeff    24, 32                                          // y_offset, y_coeff
-    ld1                 {v1.1D}, [x8]
-    dup                 v0.8H, w10
-    dup                 v3.8H, w9
+    ld1                 {v1.1d}, [x8]
+    dup                 v0.8h, w10
+    dup                 v3.8h, w9
     sub                 w3, w3, w0, lsl #2                              // w3 = linesize  - width * 4 (padding)
     sub                 w5, w5, w0                                      // w5 = linesizeY - width     (paddingY)
     sub                 w7,  w7,  w0, lsr #1                            // w7  = linesizeU - width / 2 (paddingU)
@@ -67,9 +67,9 @@
     ldr                 w14, [sp, #8]                                   // linesizeV
     ldr                 x8,  [sp, #16]                                  // table
     load_yoff_ycoeff    24, 32                                          // y_offset, y_coeff
-    ld1                 {v1.1D}, [x8]
-    dup                 v0.8H, w10
-    dup                 v3.8H, w9
+    ld1                 {v1.1d}, [x8]
+    dup                 v0.8h, w10
+    dup                 v3.8h, w9
     sub                 w3, w3, w0, lsl #2                              // w3 = linesize  - width * 4 (padding)
     sub                 w5, w5, w0                                      // w5 = linesizeY - width     (paddingY)
     sub                 w7,  w7,  w0, lsr #1                            // w7  = linesizeU - width / 2 (paddingU)
@@ -77,22 +77,22 @@
 .endm
 
 .macro load_chroma_nv12
-    ld2                 {v16.8B, v17.8B}, [x6], #16
-    ushll               v18.8H, v16.8B, #3
-    ushll               v19.8H, v17.8B, #3
+    ld2                 {v16.8b, v17.8b}, [x6], #16
+    ushll               v18.8h, v16.8b, #3
+    ushll               v19.8h, v17.8b, #3
 .endm
 
 .macro load_chroma_nv21
-    ld2                 {v16.8B, v17.8B}, [x6], #16
-    ushll               v19.8H, v16.8B, #3
-    ushll               v18.8H, v17.8B, #3
+    ld2                 {v16.8b, v17.8b}, [x6], #16
+    ushll               v19.8h, v16.8b, #3
+    ushll               v18.8h, v17.8b, #3
 .endm
 
 .macro load_chroma_yuv420p
-    ld1                 {v16.8B}, [ x6], #8
-    ld1                 {v17.8B}, [x13], #8
-    ushll               v18.8H, v16.8B, #3
-    ushll               v19.8H, v17.8B, #3
+    ld1                 {v16.8b}, [ x6], #8
+    ld1                 {v17.8b}, [x13], #8
+    ushll               v18.8h, v16.8b, #3
+    ushll               v19.8h, v17.8b, #3
 .endm
 
 .macro load_chroma_yuv422p
@@ -123,18 +123,18 @@
 .endm
 
 .macro compute_rgba r1 g1 b1 a1 r2 g2 b2 a2
-    add                 v20.8H, v26.8H, v20.8H                          // Y1 + R1
-    add                 v21.8H, v27.8H, v21.8H                          // Y2 + R2
-    add                 v22.8H, v26.8H, v22.8H                          // Y1 + G1
-    add                 v23.8H, v27.8H, v23.8H                          // Y2 + G2
-    add                 v24.8H, v26.8H, v24.8H                          // Y1 + B1
-    add                 v25.8H, v27.8H, v25.8H                          // Y2 + B2
-    sqrshrun            \r1, v20.8H, #1                                 // clip_u8((Y1 + R1) >> 1)
-    sqrshrun            \r2, v21.8H, #1                                 // clip_u8((Y2 + R1) >> 1)
-    sqrshrun            \g1, v22.8H, #1                                 // clip_u8((Y1 + G1) >> 1)
-    sqrshrun            \g2, v23.8H, #1                                 // clip_u8((Y2 + G1) >> 1)
-    sqrshrun            \b1, v24.8H, #1                                 // clip_u8((Y1 + B1) >> 1)
-    sqrshrun            \b2, v25.8H, #1                                 // clip_u8((Y2 + B1) >> 1)
+    add                 v20.8h, v26.8h, v20.8h                          // Y1 + R1
+    add                 v21.8h, v27.8h, v21.8h                          // Y2 + R2
+    add                 v22.8h, v26.8h, v22.8h                          // Y1 + G1
+    add                 v23.8h, v27.8h, v23.8h                          // Y2 + G2
+    add                 v24.8h, v26.8h, v24.8h                          // Y1 + B1
+    add                 v25.8h, v27.8h, v25.8h                          // Y2 + B2
+    sqrshrun            \r1, v20.8h, #1                                 // clip_u8((Y1 + R1) >> 1)
+    sqrshrun            \r2, v21.8h, #1                                 // clip_u8((Y2 + R1) >> 1)
+    sqrshrun            \g1, v22.8h, #1                                 // clip_u8((Y1 + G1) >> 1)
+    sqrshrun            \g2, v23.8h, #1                                 // clip_u8((Y2 + G1) >> 1)
+    sqrshrun            \b1, v24.8h, #1                                 // clip_u8((Y1 + B1) >> 1)
+    sqrshrun            \b2, v25.8h, #1                                 // clip_u8((Y2 + B1) >> 1)
     movi                \a1, #255
     movi                \a2, #255
 .endm
@@ -146,47 +146,47 @@ function ff_\ifmt\()_to_\ofmt\()_neon, export=1
 1:
     mov                 w8, w0                                          // w8 = width
 2:
-    movi                v5.8H, #4, lsl #8                               // 128 * (1<<3)
+    movi                v5.8h, #4, lsl #8                               // 128 * (1<<3)
     load_chroma_\ifmt
-    sub                 v18.8H, v18.8H, v5.8H                           // U*(1<<3) - 128*(1<<3)
-    sub                 v19.8H, v19.8H, v5.8H                           // V*(1<<3) - 128*(1<<3)
-    sqdmulh             v20.8H, v19.8H, v1.H[0]                         // V * v2r            (R)
-    sqdmulh             v22.8H, v18.8H, v1.H[1]                         // U * u2g
-    sqdmulh             v19.8H, v19.8H, v1.H[2]                         //           V * v2g
-    add                 v22.8H, v22.8H, v19.8H                          // U * u2g + V * v2g  (G)
-    sqdmulh             v24.8H, v18.8H, v1.H[3]                         // U * u2b            (B)
-    zip2                v21.8H, v20.8H, v20.8H                          // R2
-    zip1                v20.8H, v20.8H, v20.8H                          // R1
-    zip2                v23.8H, v22.8H, v22.8H                          // G2
-    zip1                v22.8H, v22.8H, v22.8H                          // G1
-    zip2                v25.8H, v24.8H, v24.8H                          // B2
-    zip1                v24.8H, v24.8H, v24.8H                          // B1
-    ld1                 {v2.16B}, [x4], #16                             // load luma
-    ushll               v26.8H, v2.8B,  #3                              // Y1*(1<<3)
-    ushll2              v27.8H, v2.16B, #3                              // Y2*(1<<3)
-    sub                 v26.8H, v26.8H, v3.8H                           // Y1*(1<<3) - y_offset
-    sub                 v27.8H, v27.8H, v3.8H                           // Y2*(1<<3) - y_offset
-    sqdmulh             v26.8H, v26.8H, v0.8H                           // ((Y1*(1<<3) - y_offset) * y_coeff) >> 15
-    sqdmulh             v27.8H, v27.8H, v0.8H                           // ((Y2*(1<<3) - y_offset) * y_coeff) >> 15
+    sub                 v18.8h, v18.8h, v5.8h                           // U*(1<<3) - 128*(1<<3)
+    sub                 v19.8h, v19.8h, v5.8h                           // V*(1<<3) - 128*(1<<3)
+    sqdmulh             v20.8h, v19.8h, v1.h[0]                         // V * v2r            (R)
+    sqdmulh             v22.8h, v18.8h, v1.h[1]                         // U * u2g
+    sqdmulh             v19.8h, v19.8h, v1.h[2]                         //           V * v2g
+    add                 v22.8h, v22.8h, v19.8h                          // U * u2g + V * v2g  (G)
+    sqdmulh             v24.8h, v18.8h, v1.h[3]                         // U * u2b            (B)
+    zip2                v21.8h, v20.8h, v20.8h                          // R2
+    zip1                v20.8h, v20.8h, v20.8h                          // R1
+    zip2                v23.8h, v22.8h, v22.8h                          // G2
+    zip1                v22.8h, v22.8h, v22.8h                          // G1
+    zip2                v25.8h, v24.8h, v24.8h                          // B2
+    zip1                v24.8h, v24.8h, v24.8h                          // B1
+    ld1                 {v2.16b}, [x4], #16                             // load luma
+    ushll               v26.8h, v2.8b,  #3                              // Y1*(1<<3)
+    ushll2              v27.8h, v2.16b, #3                              // Y2*(1<<3)
+    sub                 v26.8h, v26.8h, v3.8h                           // Y1*(1<<3) - y_offset
+    sub                 v27.8h, v27.8h, v3.8h                           // Y2*(1<<3) - y_offset
+    sqdmulh             v26.8h, v26.8h, v0.8h                           // ((Y1*(1<<3) - y_offset) * y_coeff) >> 15
+    sqdmulh             v27.8h, v27.8h, v0.8h                           // ((Y2*(1<<3) - y_offset) * y_coeff) >> 15
 
 .ifc \ofmt,argb // 1 2 3 0
-    compute_rgba        v5.8B,v6.8B,v7.8B,v4.8B, v17.8B,v18.8B,v19.8B,v16.8B
+    compute_rgba        v5.8b,v6.8b,v7.8b,v4.8b, v17.8b,v18.8b,v19.8b,v16.8b
 .endif
 
 .ifc \ofmt,rgba // 0 1 2 3
-    compute_rgba        v4.8B,v5.8B,v6.8B,v7.8B, v16.8B,v17.8B,v18.8B,v19.8B
+    compute_rgba        v4.8b,v5.8b,v6.8b,v7.8b, v16.8b,v17.8b,v18.8b,v19.8b
 .endif
 
 .ifc \ofmt,abgr // 3 2 1 0
-    compute_rgba        v7.8B,v6.8B,v5.8B,v4.8B, v19.8B,v18.8B,v17.8B,v16.8B
+    compute_rgba        v7.8b,v6.8b,v5.8b,v4.8b, v19.8b,v18.8b,v17.8b,v16.8b
 .endif
 
 .ifc \ofmt,bgra // 2 1 0 3
-    compute_rgba        v6.8B,v5.8B,v4.8B,v7.8B, v18.8B,v17.8B,v16.8B,v19.8B
+    compute_rgba        v6.8b,v5.8b,v4.8b,v7.8b, v18.8b,v17.8b,v16.8b,v19.8b
 .endif
 
-    st4                 { v4.8B, v5.8B, v6.8B, v7.8B}, [x2], #32
-    st4                 {v16.8B,v17.8B,v18.8B,v19.8B}, [x2], #32
+    st4                 { v4.8b, v5.8b, v6.8b, v7.8b}, [x2], #32
+    st4                 {v16.8b,v17.8b,v18.8b,v19.8b}, [x2], #32
     subs                w8, w8, #16                                     // width -= 16
     b.gt                2b
     add                 x2, x2, w3, SXTW                                // dst  += padding

From patchwork Tue Oct 17 11:45:57 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>
X-Patchwork-Id: 44276
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a21:3e0b:b0:15d:8365:d4b8 with SMTP id
 bk11csp301523pzc;
        Tue, 17 Oct 2023 04:46:16 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IE/zgQPtH58nzZMJujmr964r2GVk7AvavWDw247sR/pUaLoBmGEcfeLkYRH6mDo7b/WyMui
X-Received: by 2002:a17:907:961f:b0:9b2:cf77:a105 with SMTP id
 gb31-20020a170907961f00b009b2cf77a105mr1644718ejc.15.1697543175940;
        Tue, 17 Oct 2023 04:46:15 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1697543175; cv=none;
        d=google.com; s=arc-20160816;
        b=zYv14Qoxu1AGJVTOU/RZ0eSgDVIe5ISEz76ARmo82J6gTdtbaSVd7I3R83yNEibrRt
         tYZJcZ4ZEKsGuN5OqQ6CDeVNF3CHFZCx4VjzaTwlbxVtQH3xQrUt2RqsWgcrGdfTJ8Ga
         1c0NpiPeH/IOJESmsfp0yemTQoWG0GE6XBOmJlKz5oFYyY4g4EMy1nIEoBbNjrFl1V6l
         WFaPJilOMSOW9cjITpoYdAPbGSEppuJmexPKmRdswOV31SzNYGqljAg8eHr/KKGnHn8S
         YjuYWiPageHooTdwZA2TegKmZajk8fpora1Ow9Z6mmzcFUJpH5Cxy+bEKfLKKTnmojNo
         jQXQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:references:in-reply-to
         :message-id:date:to:from:dkim-signature:delivered-to;
        bh=JMBvWnw8oCYFrWJ9mosa9tytEFAeh5oO6o59qNzCt6s=;
        fh=4VBelKDE4DH3L7jF6H/1Jmu78FdN+YP76yfdJCQTJ30=;
        b=kdLn3NKeGJJyhmhnw+JcBPeEqYjVTYOTOgFmFJGZ88E6VKHjdO8a/Kmedjo/2fw2Od
         FFT/8IMFFhUrzUYpfQy57slKsmMgqRuzo+MGUKFtchDtmPmracoyvCkCCbr2Qo3QNh1f
         frYWLmro7vH2XcDemiRiGpsnEzDDuwei3c3w+EHgTO03Fx151rzlem2BQbC/gOioUAWZ
         SH34OSfiYPOn49nm87Gjm0ySU+LTZQ1afPnQeb4wuHu1pEraoDZecrjsCgLsmSakJ5lj
         WN4A8LQjVY/mB/ACSWg7hIEG4YVQsYex2tBQqtb8Kdqnp/R24fMdHJ0k4bPVwctIjq/Q
         KNww==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify)
 header.i=@martin-st.20230601.gappssmtp.com header.s=20230601
 header.b=CmMdUOh6;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 i21-20020a1709064fd500b009a210885fb0si643630ejw.852.2023.10.17.04.46.15;
        Tue, 17 Oct 2023 04:46:15 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify)
 header.i=@martin-st.20230601.gappssmtp.com header.s=20230601
 header.b=CmMdUOh6;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 34FD968C9C1;
	Tue, 17 Oct 2023 14:46:11 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mail-lf1-f44.google.com (mail-lf1-f44.google.com
 [209.85.167.44])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 9A93868C7A6
 for <ffmpeg-devel@ffmpeg.org>; Tue, 17 Oct 2023 14:46:04 +0300 (EEST)
Received: by mail-lf1-f44.google.com with SMTP id
 2adb3069b0e04-507ad511315so2906190e87.0
 for <ffmpeg-devel@ffmpeg.org>; Tue, 17 Oct 2023 04:46:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=martin-st.20230601.gappssmtp.com; s=20230601; t=1697543164; x=1698147964;
 darn=ffmpeg.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=HUN3+Wv/h27vZCrGwcgiz6asVsIYSgXtmrNH1ppyIeM=;
 b=CmMdUOh6Q0FzS+7m88PjUF8HA91Rrpezh2ooGFBz9qX96VMw/l/7SLfpH1IbgjJxl6
 WU0L3JTC1HecJN76vZGATmtvXwLFonSYzp9Yqqbie08/Clf9PrHLneqk/7EQZ53X05+V
 Ad3jnKnhRBdaiSB2BfZUOHNsLoPm60BvX46vxPQJOqDMud5fg1CdqbwlZXqOmMeIYW48
 r1mmL3L1xRKnzExaHyJY/IO2YUWZBhk8i2hdibJ845LS4r9398wTFWs5v7WZvEoZORhF
 LpNl20LnKF51JXoHElaQCei8N0EvSd/ET9+JOh0aErgU2xqDaNEIsWvWZaG3sJpMvQ8Y
 dHfg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1697543164; x=1698147964;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=HUN3+Wv/h27vZCrGwcgiz6asVsIYSgXtmrNH1ppyIeM=;
 b=G8MOrbFMDIm0YPuYIHdpjNvteWBON5z9Tv44ws1PxO1qqwxFUY/oS6h2GMCAOTZHAc
 4V4gJi7tPEmdo8Y4rBQ66a1LCGhgMulfzQY9rpXuOX+dw3qSbN71o2anAVI8Y/xyX1gi
 v+T/3n7K81u261+zdPlZMx0xVvOBy/BNi9w63Y+QxQiUU2xrcnWjlQwVBYOFIPK4z0zF
 LSnAHU35EaaCBxq/62HAFirNm6AUUW8+r4p6eaO7+EBF8fSFolzbOYlBCOhZuOzgZRp8
 nECjcnQuY5nASM4TQJ2y8QrE68UYHqZ94a1B+zHy5kTADoE/Z7mQeJEucC3mNKRN6pNJ
 LrXA==
X-Gm-Message-State: AOJu0YxFSbPPZNlfxqR+JBCylxnKcToJudNdp2auh5y+GrMzlMrai3hU
 hponGdAEug1Ceh1Q05GxDOKWiH+PoqYbI7kSc4hTVw==
X-Received: by 2002:ac2:53b8:0:b0:507:9a08:4046 with SMTP id
 j24-20020ac253b8000000b005079a084046mr1604519lfh.55.1697543163580;
 Tue, 17 Oct 2023 04:46:03 -0700 (PDT)
Received: from localhost.localdomain (dsl-tkubng21-58c01c-243.dhcp.inet.fi.
 [88.192.28.243]) by smtp.gmail.com with ESMTPSA id
 x25-20020a19f619000000b0050797a35f8csm244532lfe.162.2023.10.17.04.46.03
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 17 Oct 2023 04:46:03 -0700 (PDT)
From: =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>
To: ffmpeg-devel@ffmpeg.org
Date: Tue, 17 Oct 2023 14:45:57 +0300
Message-Id: <20231017114601.1374712-2-martin@martin.st>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20231017114601.1374712-1-martin@martin.st>
References: <20231017114601.1374712-1-martin@martin.st>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 2/5] aarch64: Lowercase UXTW/SXTW and similar
 flags
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: jdek@itanimul.li
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: NZQcVt+9IGAx

---
 libavcodec/aarch64/h264cmc_neon.S     |   4 +-
 libavcodec/aarch64/h264idct_neon.S    |   2 +-
 libavfilter/aarch64/vf_bwdif_neon.S   |   6 +-
 libavfilter/aarch64/vf_nlmeans_neon.S |   8 +-
 libswscale/aarch64/hscale.S           | 176 +++++++++++++-------------
 libswscale/aarch64/yuv2rgb_neon.S     |  14 +-
 6 files changed, 105 insertions(+), 105 deletions(-)

diff --git a/libavcodec/aarch64/h264cmc_neon.S b/libavcodec/aarch64/h264cmc_neon.S
index 5b959b87d3..2ddd5c8a53 100644
--- a/libavcodec/aarch64/h264cmc_neon.S
+++ b/libavcodec/aarch64/h264cmc_neon.S
@@ -38,7 +38,7 @@ function ff_\type\()_\codec\()_chroma_mc8_neon, export=1
         lsl             w9,  w9,  #3
         lsl             w10, w10, #1
         add             w9,  w9,  w10
-        add             x6,  x6,  w9, UXTW
+        add             x6,  x6,  w9, uxtw
         ld1r            {v22.8h}, [x6]
   .endif
   .ifc \codec,vc1
@@ -208,7 +208,7 @@ function ff_\type\()_\codec\()_chroma_mc4_neon, export=1
         lsl             w9,  w9,  #3
         lsl             w10, w10, #1
         add             w9,  w9,  w10
-        add             x6,  x6,  w9, UXTW
+        add             x6,  x6,  w9, uxtw
         ld1r            {v22.8h}, [x6]
   .endif
   .ifc \codec,vc1
diff --git a/libavcodec/aarch64/h264idct_neon.S b/libavcodec/aarch64/h264idct_neon.S
index 1bab2ca7c8..3f7ff2c49e 100644
--- a/libavcodec/aarch64/h264idct_neon.S
+++ b/libavcodec/aarch64/h264idct_neon.S
@@ -385,7 +385,7 @@ function ff_h264_idct8_add4_neon, export=1
         movrel          x14, .L_ff_h264_idct8_add_neon
 1:      ldrb            w9,  [x7], #4
         ldrsw           x0,  [x5], #16
-        ldrb            w9,  [x4, w9, UXTW]
+        ldrb            w9,  [x4, w9, uxtw]
         subs            w9,  w9,  #1
         b.lt            2f
         ldrsh           w11,  [x1]
diff --git a/libavfilter/aarch64/vf_bwdif_neon.S b/libavfilter/aarch64/vf_bwdif_neon.S
index ae9aab20cd..bf268b12f8 100644
--- a/libavfilter/aarch64/vf_bwdif_neon.S
+++ b/libavfilter/aarch64/vf_bwdif_neon.S
@@ -186,13 +186,13 @@ function ff_bwdif_filter_line3_neon, export=1
         mov             w10, w6                         // w10 = loop count
         neg             w9,  w5                         // w9  = mref
         lsl             w8,  w9,  #1                    // w8 =  mref2
-        add             w7,  w9,  w9, LSL #1            // w7  = mref3
+        add             w7,  w9,  w9, lsl #1            // w7  = mref3
         lsl             w6,  w9,  #2                    // w6  = mref4
         mov             w11, w5                         // w11 = pref
         lsl             w12, w5,  #1                    // w12 = pref2
-        add             w13, w5,  w5, LSL #1            // w13 = pref3
+        add             w13, w5,  w5, lsl #1            // w13 = pref3
         lsl             w14, w5,  #2                    // w14 = pref4
-        add             w15, w5,  w5, LSL #2            // w15 = pref5
+        add             w15, w5,  w5, lsl #2            // w15 = pref5
         add             w16, w14, w12                   // w16 = pref6
 
         lsl             w5,  w1,  #1                    // w5 = d_stride * 2
diff --git a/libavfilter/aarch64/vf_nlmeans_neon.S b/libavfilter/aarch64/vf_nlmeans_neon.S
index 26d6958b82..a788cffd85 100644
--- a/libavfilter/aarch64/vf_nlmeans_neon.S
+++ b/libavfilter/aarch64/vf_nlmeans_neon.S
@@ -35,10 +35,10 @@
 
 function ff_compute_safe_ssd_integral_image_neon, export=1
         movi            v26.4s, #0                                      // used as zero for the "rotations" in acc_sum_store
-        sub             x3, x3, w6, UXTW                                // s1 padding (s1_linesize - w)
-        sub             x5, x5, w6, UXTW                                // s2 padding (s2_linesize - w)
-        sub             x9, x0, w1, UXTW #2                             // dst_top
-        sub             x1, x1, w6, UXTW                                // dst padding (dst_linesize_32 - w)
+        sub             x3, x3, w6, uxtw                                // s1 padding (s1_linesize - w)
+        sub             x5, x5, w6, uxtw                                // s2 padding (s2_linesize - w)
+        sub             x9, x0, w1, uxtw #2                             // dst_top
+        sub             x1, x1, w6, uxtw                                // dst padding (dst_linesize_32 - w)
         lsl             x1, x1, #2                                      // dst padding expressed in bytes
 1:      mov             w10, w6                                         // width copy for each line
         sub             x0, x0, #16                                     // beginning of the dst line minus 4 sums
diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S
index f3c404eb5f..3041d483fc 100644
--- a/libswscale/aarch64/hscale.S
+++ b/libswscale/aarch64/hscale.S
@@ -54,10 +54,10 @@ function ff_hscale8to15_X8_neon, export=1
         movi                v1.2d, #0                   // val sum part 2 (for dst[1])
         movi                v2.2d, #0                   // val sum part 3 (for dst[2])
         movi                v3.2d, #0                   // val sum part 4 (for dst[3])
-        add                 x17, x3, w8, UXTW           // srcp + filterPos[0]
-        add                 x8,  x3, w0, UXTW           // srcp + filterPos[1]
-        add                 x0, x3, w11, UXTW           // srcp + filterPos[2]
-        add                 x11, x3, w9, UXTW           // srcp + filterPos[3]
+        add                 x17, x3, w8, uxtw           // srcp + filterPos[0]
+        add                 x8,  x3, w0, uxtw           // srcp + filterPos[1]
+        add                 x0, x3, w11, uxtw           // srcp + filterPos[2]
+        add                 x11, x3, w9, uxtw           // srcp + filterPos[3]
         mov                 w15, w6                     // filterSize counter
 2:      ld1                 {v4.8b}, [x17], #8          // srcp[filterPos[0] + {0..7}]
         ld1                 {v5.8h}, [x16], #16         // load 8x16-bit filter values, part 1
@@ -231,14 +231,14 @@ function ff_hscale8to15_4_neon, export=1
         add                 x5, x5, #32                 // advance filterPos
 
         // gather random access data from src into contiguous memory
-        ldr                 w8, [x3, w8, UXTW]          // src[filterPos[idx + 0]][0..3]
-        ldr                 w9, [x3, w9, UXTW]          // src[filterPos[idx + 1]][0..3]
-        ldr                 w10, [x3, w10, UXTW]        // src[filterPos[idx + 2]][0..3]
-        ldr                 w11, [x3, w11, UXTW]        // src[filterPos[idx + 3]][0..3]
-        ldr                 w12, [x3, w12, UXTW]        // src[filterPos[idx + 4]][0..3]
-        ldr                 w13, [x3, w13, UXTW]        // src[filterPos[idx + 5]][0..3]
-        ldr                 w14, [x3, w14, UXTW]        // src[filterPos[idx + 6]][0..3]
-        ldr                 w15, [x3, w15, UXTW]        // src[filterPos[idx + 7]][0..3]
+        ldr                 w8, [x3, w8, uxtw]          // src[filterPos[idx + 0]][0..3]
+        ldr                 w9, [x3, w9, uxtw]          // src[filterPos[idx + 1]][0..3]
+        ldr                 w10, [x3, w10, uxtw]        // src[filterPos[idx + 2]][0..3]
+        ldr                 w11, [x3, w11, uxtw]        // src[filterPos[idx + 3]][0..3]
+        ldr                 w12, [x3, w12, uxtw]        // src[filterPos[idx + 4]][0..3]
+        ldr                 w13, [x3, w13, uxtw]        // src[filterPos[idx + 5]][0..3]
+        ldr                 w14, [x3, w14, uxtw]        // src[filterPos[idx + 6]][0..3]
+        ldr                 w15, [x3, w15, uxtw]        // src[filterPos[idx + 7]][0..3]
         stp                 w8, w9, [sp]                // *scratch_mem = { src[filterPos[idx + 0]][0..3], src[filterPos[idx + 1]][0..3] }
         stp                 w10, w11, [sp, #8]          // *scratch_mem = { src[filterPos[idx + 2]][0..3], src[filterPos[idx + 3]][0..3] }
         stp                 w12, w13, [sp, #16]         // *scratch_mem = { src[filterPos[idx + 4]][0..3], src[filterPos[idx + 5]][0..3] }
@@ -263,21 +263,21 @@ function ff_hscale8to15_4_neon, export=1
         // interleaved SIMD and prefetching intended to keep ld/st and vector pipelines busy
         uxtl                v16.8h, v16.8b              // unsigned extend long, covert src data to 16-bit
         uxtl                v17.8h, v17.8b              // unsigned extend long, covert src data to 16-bit
-        ldr                 w8, [x3, w8, UXTW]          // src[filterPos[idx + 0]], next iteration
-        ldr                 w9, [x3, w9, UXTW]          // src[filterPos[idx + 1]], next iteration
+        ldr                 w8, [x3, w8, uxtw]          // src[filterPos[idx + 0]], next iteration
+        ldr                 w9, [x3, w9, uxtw]          // src[filterPos[idx + 1]], next iteration
         uxtl                v18.8h, v18.8b              // unsigned extend long, covert src data to 16-bit
         uxtl                v19.8h, v19.8b              // unsigned extend long, covert src data to 16-bit
-        ldr                 w10, [x3, w10, UXTW]        // src[filterPos[idx + 2]], next iteration
-        ldr                 w11, [x3, w11, UXTW]        // src[filterPos[idx + 3]], next iteration
+        ldr                 w10, [x3, w10, uxtw]        // src[filterPos[idx + 2]], next iteration
+        ldr                 w11, [x3, w11, uxtw]        // src[filterPos[idx + 3]], next iteration
 
         smlal               v0.4s, v1.4h, v16.4h        // multiply accumulate inner loop j = 0, idx = 0..3
         smlal               v0.4s, v2.4h, v17.4h        // multiply accumulate inner loop j = 1, idx = 0..3
-        ldr                 w12, [x3, w12, UXTW]        // src[filterPos[idx + 4]], next iteration
-        ldr                 w13, [x3, w13, UXTW]        // src[filterPos[idx + 5]], next iteration
+        ldr                 w12, [x3, w12, uxtw]        // src[filterPos[idx + 4]], next iteration
+        ldr                 w13, [x3, w13, uxtw]        // src[filterPos[idx + 5]], next iteration
         smlal               v0.4s, v3.4h, v18.4h        // multiply accumulate inner loop j = 2, idx = 0..3
         smlal               v0.4s, v4.4h, v19.4h        // multiply accumulate inner loop j = 3, idx = 0..3
-        ldr                 w14, [x3, w14, UXTW]        // src[filterPos[idx + 6]], next iteration
-        ldr                 w15, [x3, w15, UXTW]        // src[filterPos[idx + 7]], next iteration
+        ldr                 w14, [x3, w14, uxtw]        // src[filterPos[idx + 6]], next iteration
+        ldr                 w15, [x3, w15, uxtw]        // src[filterPos[idx + 7]], next iteration
 
         smlal2              v5.4s, v1.8h, v16.8h        // multiply accumulate inner loop j = 0, idx = 4..7
         smlal2              v5.4s, v2.8h, v17.8h        // multiply accumulate inner loop j = 1, idx = 4..7
@@ -331,7 +331,7 @@ function ff_hscale8to15_4_neon, export=1
 2:
         // load src
         ldr                 w8, [x5], #4                // filterPos[i]
-        add                 x9, x3, w8, UXTW            // calculate the address for src load
+        add                 x9, x3, w8, uxtw            // calculate the address for src load
         ld1                 {v5.s}[0], [x9]             // src[filterPos[i] + 0..3]
         // load filter
         ld1                 {v6.4h}, [x4], #8           // filter[filterSize * i + 0..3]
@@ -372,14 +372,14 @@ function ff_hscale8to19_4_neon, export=1
         add                 x5, x5, #32
 
         // load data from
-        ldr                 w8, [x3, w8, UXTW]
-        ldr                 w9, [x3, w9, UXTW]
-        ldr                 w10, [x3, w10, UXTW]
-        ldr                 w11, [x3, w11, UXTW]
-        ldr                 w12, [x3, w12, UXTW]
-        ldr                 w13, [x3, w13, UXTW]
-        ldr                 w14, [x3, w14, UXTW]
-        ldr                 w15, [x3, w15, UXTW]
+        ldr                 w8, [x3, w8, uxtw]
+        ldr                 w9, [x3, w9, uxtw]
+        ldr                 w10, [x3, w10, uxtw]
+        ldr                 w11, [x3, w11, uxtw]
+        ldr                 w12, [x3, w12, uxtw]
+        ldr                 w13, [x3, w13, uxtw]
+        ldr                 w14, [x3, w14, uxtw]
+        ldr                 w15, [x3, w15, uxtw]
 
         sub                 sp, sp, #32
 
@@ -399,30 +399,30 @@ function ff_hscale8to19_4_neon, export=1
         ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
         add                 x5, x5, #32
         uxtl                v0.8h, v0.8b
-        ldr                 w8, [x3, w8, UXTW]
+        ldr                 w8, [x3, w8, uxtw]
         smull               v5.4s, v0.4h, v28.4h        // multiply first column of src
-        ldr                 w9, [x3, w9, UXTW]
+        ldr                 w9, [x3, w9, uxtw]
         smull2              v6.4s, v0.8h, v28.8h
         stp                 w8, w9, [sp]
 
         uxtl                v1.8h, v1.8b
-        ldr                 w10, [x3, w10, UXTW]
+        ldr                 w10, [x3, w10, uxtw]
         smlal               v5.4s, v1.4h, v29.4h        // multiply second column of src
-        ldr                 w11, [x3, w11, UXTW]
+        ldr                 w11, [x3, w11, uxtw]
         smlal2              v6.4s, v1.8h, v29.8h
         stp                 w10, w11, [sp, #8]
 
         uxtl                v2.8h, v2.8b
-        ldr                 w12, [x3, w12, UXTW]
+        ldr                 w12, [x3, w12, uxtw]
         smlal               v5.4s, v2.4h, v30.4h        // multiply third column of src
-        ldr                 w13, [x3, w13, UXTW]
+        ldr                 w13, [x3, w13, uxtw]
         smlal2              v6.4s, v2.8h, v30.8h
         stp                 w12, w13, [sp, #16]
 
         uxtl                v3.8h, v3.8b
-        ldr                 w14, [x3, w14, UXTW]
+        ldr                 w14, [x3, w14, uxtw]
         smlal               v5.4s, v3.4h, v31.4h        // multiply fourth column of src
-        ldr                 w15, [x3, w15, UXTW]
+        ldr                 w15, [x3, w15, uxtw]
         smlal2              v6.4s, v3.8h, v31.8h
         stp                 w14, w15, [sp, #24]
 
@@ -468,7 +468,7 @@ function ff_hscale8to19_4_neon, export=1
 
 2:
         ldr                 w8, [x5], #4 // load filterPos
-        add                 x9, x3, w8, UXTW // src + filterPos
+        add                 x9, x3, w8, uxtw // src + filterPos
         ld1                 {v0.s}[0], [x9] // load 4 * uint8_t* into one single
         ld1                 {v31.4h}, [x4], #8
         uxtl                v0.8h, v0.8b
@@ -503,10 +503,10 @@ function ff_hscale8to19_X8_neon, export=1
         movi                v1.2d, #0                   // val sum part 2 (for dst[1])
         movi                v2.2d, #0                   // val sum part 3 (for dst[2])
         movi                v3.2d, #0                   // val sum part 4 (for dst[3])
-        add                 x17, x3, w8, UXTW           // srcp + filterPos[0]
-        add                 x8,  x3, w0, UXTW           // srcp + filterPos[1]
-        add                 x0, x3, w11, UXTW           // srcp + filterPos[2]
-        add                 x11, x3, w9, UXTW           // srcp + filterPos[3]
+        add                 x17, x3, w8, uxtw           // srcp + filterPos[0]
+        add                 x8,  x3, w0, uxtw           // srcp + filterPos[1]
+        add                 x0, x3, w11, uxtw           // srcp + filterPos[2]
+        add                 x11, x3, w9, uxtw           // srcp + filterPos[3]
         mov                 w15, w6                     // filterSize counter
 2:      ld1                 {v4.8b}, [x17], #8          // srcp[filterPos[0] + {0..7}]
         ld1                 {v5.8h}, [x16], #16         // load 8x16-bit filter values, part 1
@@ -567,13 +567,13 @@ function ff_hscale8to19_X4_neon, export=1
 
         mov                 x12, x4                     // filter + 0
         add                 x13, x4, x7                 // filter + 1
-        add                 x8, x3, w8, UXTW            // srcp + filterPos 0
+        add                 x8, x3, w8, uxtw            // srcp + filterPos 0
         add                 x14, x13, x7                // filter + 2
-        add                 x9, x3, w9, UXTW            // srcp + filterPos 1
+        add                 x9, x3, w9, uxtw            // srcp + filterPos 1
         add                 x15, x14, x7                // filter + 3
-        add                 x10, x3, w10, UXTW          // srcp + filterPos 2
+        add                 x10, x3, w10, uxtw          // srcp + filterPos 2
         mov                 w0, w6                      // save the filterSize to temporary variable
-        add                 x11, x3, w11, UXTW          // srcp + filterPos 3
+        add                 x11, x3, w11, uxtw          // srcp + filterPos 3
         add                 x5, x5, #16                 // advance filter position
         mov                 x16, xzr                    // clear the register x16 used for offsetting the filter values
 
@@ -674,14 +674,14 @@ function ff_hscale16to15_4_neon_asm, export=1
         lsl                 x15, x15, #1
 
         // load src with given offset
-        ldr                 x8,  [x3, w8,  UXTW]
-        ldr                 x9,  [x3, w9,  UXTW]
-        ldr                 x10, [x3, w10, UXTW]
-        ldr                 x11, [x3, w11, UXTW]
-        ldr                 x12, [x3, w12, UXTW]
-        ldr                 x13, [x3, w13, UXTW]
-        ldr                 x14, [x3, w14, UXTW]
-        ldr                 x15, [x3, w15, UXTW]
+        ldr                 x8,  [x3, w8,  uxtw]
+        ldr                 x9,  [x3, w9,  uxtw]
+        ldr                 x10, [x3, w10, uxtw]
+        ldr                 x11, [x3, w11, uxtw]
+        ldr                 x12, [x3, w12, uxtw]
+        ldr                 x13, [x3, w13, uxtw]
+        ldr                 x14, [x3, w14, uxtw]
+        ldr                 x15, [x3, w15, uxtw]
 
         sub                 sp, sp, #64
         // push src on stack so it can be loaded into vectors later
@@ -754,14 +754,14 @@ function ff_hscale16to15_4_neon_asm, export=1
         lsl                 x14, x14, #1
         lsl                 x15, x15, #1
 
-        ldr                 x8,  [x3, w8,  UXTW]
-        ldr                 x9,  [x3, w9,  UXTW]
-        ldr                 x10, [x3, w10, UXTW]
-        ldr                 x11, [x3, w11, UXTW]
-        ldr                 x12, [x3, w12, UXTW]
-        ldr                 x13, [x3, w13, UXTW]
-        ldr                 x14, [x3, w14, UXTW]
-        ldr                 x15, [x3, w15, UXTW]
+        ldr                 x8,  [x3, w8,  uxtw]
+        ldr                 x9,  [x3, w9,  uxtw]
+        ldr                 x10, [x3, w10, uxtw]
+        ldr                 x11, [x3, w11, uxtw]
+        ldr                 x12, [x3, w12, uxtw]
+        ldr                 x13, [x3, w13, uxtw]
+        ldr                 x14, [x3, w14, uxtw]
+        ldr                 x15, [x3, w15, uxtw]
 
         stp                 x8, x9, [sp]
         stp                 x10, x11, [sp, #16]
@@ -819,7 +819,7 @@ function ff_hscale16to15_4_neon_asm, export=1
 2:
         ldr                 w8, [x5], #4                // load filterPos
         lsl                 w8, w8, #1
-        add                 x9, x3, w8, UXTW            // src + filterPos
+        add                 x9, x3, w8, uxtw            // src + filterPos
         ld1                 {v0.4h}, [x9]               // load 4 * uint16_t
         ld1                 {v31.4h}, [x4], #8
 
@@ -869,10 +869,10 @@ function ff_hscale16to15_X8_neon_asm, export=1
         movi                v1.2d, #0                   // val sum part 2 (for dst[1])
         movi                v2.2d, #0                   // val sum part 3 (for dst[2])
         movi                v3.2d, #0                   // val sum part 4 (for dst[3])
-        add                 x17, x3, w8, UXTW           // srcp + filterPos[0]
-        add                 x8,  x3, w10, UXTW          // srcp + filterPos[1]
-        add                 x10, x3, w11, UXTW          // srcp + filterPos[2]
-        add                 x11, x3, w9, UXTW           // srcp + filterPos[3]
+        add                 x17, x3, w8, uxtw           // srcp + filterPos[0]
+        add                 x8,  x3, w10, uxtw          // srcp + filterPos[1]
+        add                 x10, x3, w11, uxtw          // srcp + filterPos[2]
+        add                 x11, x3, w9, uxtw           // srcp + filterPos[3]
         mov                 w15, w6                     // filterSize counter
 2:      ld1                 {v4.8h}, [x17], #16         // srcp[filterPos[0] + {0..7}]
         ld1                 {v5.8h}, [x16], #16         // load 8x16-bit filter values, part 1
@@ -1082,14 +1082,14 @@ function ff_hscale16to19_4_neon_asm, export=1
         lsl                 x15, x15, #1
 
         // load src with given offset
-        ldr                 x8,  [x3, w8,  UXTW]
-        ldr                 x9,  [x3, w9,  UXTW]
-        ldr                 x10, [x3, w10, UXTW]
-        ldr                 x11, [x3, w11, UXTW]
-        ldr                 x12, [x3, w12, UXTW]
-        ldr                 x13, [x3, w13, UXTW]
-        ldr                 x14, [x3, w14, UXTW]
-        ldr                 x15, [x3, w15, UXTW]
+        ldr                 x8,  [x3, w8,  uxtw]
+        ldr                 x9,  [x3, w9,  uxtw]
+        ldr                 x10, [x3, w10, uxtw]
+        ldr                 x11, [x3, w11, uxtw]
+        ldr                 x12, [x3, w12, uxtw]
+        ldr                 x13, [x3, w13, uxtw]
+        ldr                 x14, [x3, w14, uxtw]
+        ldr                 x15, [x3, w15, uxtw]
 
         sub                 sp, sp, #64
         // push src on stack so it can be loaded into vectors later
@@ -1160,14 +1160,14 @@ function ff_hscale16to19_4_neon_asm, export=1
         lsl                 x14, x14, #1
         lsl                 x15, x15, #1
 
-        ldr                 x8,  [x3, w8,  UXTW]
-        ldr                 x9,  [x3, w9,  UXTW]
-        ldr                 x10, [x3, w10, UXTW]
-        ldr                 x11, [x3, w11, UXTW]
-        ldr                 x12, [x3, w12, UXTW]
-        ldr                 x13, [x3, w13, UXTW]
-        ldr                 x14, [x3, w14, UXTW]
-        ldr                 x15, [x3, w15, UXTW]
+        ldr                 x8,  [x3, w8,  uxtw]
+        ldr                 x9,  [x3, w9,  uxtw]
+        ldr                 x10, [x3, w10, uxtw]
+        ldr                 x11, [x3, w11, uxtw]
+        ldr                 x12, [x3, w12, uxtw]
+        ldr                 x13, [x3, w13, uxtw]
+        ldr                 x14, [x3, w14, uxtw]
+        ldr                 x15, [x3, w15, uxtw]
 
         stp                 x8, x9, [sp]
         stp                 x10, x11, [sp, #16]
@@ -1224,7 +1224,7 @@ function ff_hscale16to19_4_neon_asm, export=1
 2:
         ldr                 w8, [x5], #4                // load filterPos
         lsl                 w8, w8, #1
-        add                 x9, x3, w8, UXTW            // src + filterPos
+        add                 x9, x3, w8, uxtw            // src + filterPos
         ld1                 {v0.4h}, [x9]               // load 4 * uint16_t
         ld1                 {v31.4h}, [x4], #8
 
@@ -1274,10 +1274,10 @@ function ff_hscale16to19_X8_neon_asm, export=1
         movi                v1.2d, #0                   // val sum part 2 (for dst[1])
         movi                v2.2d, #0                   // val sum part 3 (for dst[2])
         movi                v3.2d, #0                   // val sum part 4 (for dst[3])
-        add                 x17, x3, w8, UXTW           // srcp + filterPos[0]
-        add                 x8,  x3, w10, UXTW          // srcp + filterPos[1]
-        add                 x10, x3, w11, UXTW          // srcp + filterPos[2]
-        add                 x11, x3, w9, UXTW           // srcp + filterPos[3]
+        add                 x17, x3, w8, uxtw           // srcp + filterPos[0]
+        add                 x8,  x3, w10, uxtw          // srcp + filterPos[1]
+        add                 x10, x3, w11, uxtw          // srcp + filterPos[2]
+        add                 x11, x3, w9, uxtw           // srcp + filterPos[3]
         mov                 w15, w6                     // filterSize counter
 2:      ld1                 {v4.8h}, [x17], #16         // srcp[filterPos[0] + {0..7}]
         ld1                 {v5.8h}, [x16], #16         // load 8x16-bit filter values, part 1
diff --git a/libswscale/aarch64/yuv2rgb_neon.S b/libswscale/aarch64/yuv2rgb_neon.S
index 3fc91530b6..379d75622e 100644
--- a/libswscale/aarch64/yuv2rgb_neon.S
+++ b/libswscale/aarch64/yuv2rgb_neon.S
@@ -102,7 +102,7 @@
 .macro increment_nv12
     ands                w15, w1, #1
     csel                w16, w7, w11, ne                                // incC = (h & 1) ? paddincC : -width
-    add                 x6,  x6, w16, SXTW                              // srcC += incC
+    add                 x6,  x6, w16, sxtw                              // srcC += incC
 .endm
 
 .macro increment_nv21
@@ -113,13 +113,13 @@
     ands                w15, w1, #1
     csel                w16,  w7, w11, ne                               // incU = (h & 1) ? paddincU : -width/2
     csel                w17, w14, w11, ne                               // incV = (h & 1) ? paddincV : -width/2
-    add                 x6,  x6,  w16, SXTW                             // srcU += incU
-    add                 x13, x13, w17, SXTW                             // srcV += incV
+    add                 x6,  x6,  w16, sxtw                             // srcU += incU
+    add                 x13, x13, w17, sxtw                             // srcV += incV
 .endm
 
 .macro increment_yuv422p
-    add                 x6,  x6,  w7, SXTW                              // srcU += incU
-    add                 x13, x13, w14, SXTW                             // srcV += incV
+    add                 x6,  x6,  w7, sxtw                              // srcU += incU
+    add                 x13, x13, w14, sxtw                             // srcV += incV
 .endm
 
 .macro compute_rgba r1 g1 b1 a1 r2 g2 b2 a2
@@ -189,8 +189,8 @@ function ff_\ifmt\()_to_\ofmt\()_neon, export=1
     st4                 {v16.8b,v17.8b,v18.8b,v19.8b}, [x2], #32
     subs                w8, w8, #16                                     // width -= 16
     b.gt                2b
-    add                 x2, x2, w3, SXTW                                // dst  += padding
-    add                 x4, x4, w5, SXTW                                // srcY += paddingY
+    add                 x2, x2, w3, sxtw                                // dst  += padding
+    add                 x4, x4, w5, sxtw                                // srcY += paddingY
     increment_\ifmt
     subs                w1, w1, #1                                      // height -= 1
     b.gt                1b

From patchwork Tue Oct 17 11:45:58 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>
X-Patchwork-Id: 44278
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a21:3e0b:b0:15d:8365:d4b8 with SMTP id
 bk11csp301763pzc;
        Tue, 17 Oct 2023 04:46:43 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IGouO7HyhKhU4LX/dXHBF2weTXozSqfOLbNHOZ5B7FlFgf70qpRN5KqQUqYwRXXcBvlmooq
X-Received: by 2002:ac2:4858:0:b0:503:2eaf:1659 with SMTP id
 24-20020ac24858000000b005032eaf1659mr1542791lfy.41.1697543203478;
        Tue, 17 Oct 2023 04:46:43 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1697543203; cv=none;
        d=google.com; s=arc-20160816;
        b=rusiLcae/YdrfO7aVW0/PtxxSxpy+pYrJB/h2Wvw/S5BUXtsplh98dA/uo+wfT+2Wy
         odnQ1BGq5ETTo2hsEMSsKQvzPByJjpOVOuTZ2ad9geul+gU4U8wK9akj5QGVyVMWaT8W
         v598cN63PUpuOYZ0UaZiAjCzgJxoANxP4YK67ldeoUcWqTCKbLtCrONmw7yxJOfh+GuM
         wjU9Z61m6teyL7Vbar9dSIELBqf5KsyBw1Gh0H4lXRosrzt1P/TZBgXKoEm9gkojYB8G
         D8vW+XmbxnxwgEmd03B70HEwLqP/0yoZrjQuzBkKYtjlS/OC42xpJB5tDzWcKAbLgn+A
         hczA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:references:in-reply-to
         :message-id:date:to:from:dkim-signature:delivered-to;
        bh=pxh0GtUP9OqB0NndwGKCjNEfX9ZKXxu8LjkIolSj4vY=;
        fh=4VBelKDE4DH3L7jF6H/1Jmu78FdN+YP76yfdJCQTJ30=;
        b=SUcot+VDulCfPUKOBedFuwEYWcF5kz9YeBaA1KQSWrk/ipYH5w6qzufIxbeSpxRyYy
         EU1zRvDeDAM0xnkHZPJGas8ueXHTdtX/bJfVHQY+UThc4sQWjlLZhy/kyFfECD6/wRtS
         CEESHntcUoZlE9hyEHlBTk3WJRaWDAMh0SudRuojp+Wcw3c6wAISPPYPEdCYdOE5iQHD
         8Zfw1YUw4iMPwjhxDzmSY28oLULsv0RzYi1aQ2YZCAS5bwdm8gcR/Yvo20SjLdGX8cOA
         DjncNgq2z8qXzAhF2gh0SR67P1TKnj7PL1BI5jkRcH3E2z+YWDbGBK6ciGUdjEnwegOp
         GCSg==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify)
 header.i=@martin-st.20230601.gappssmtp.com header.s=20230601
 header.b="HG/SYw3f";
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 y11-20020a056402440b00b0053dd8891307si660437eda.129.2023.10.17.04.46.40;
        Tue, 17 Oct 2023 04:46:43 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify)
 header.i=@martin-st.20230601.gappssmtp.com header.s=20230601
 header.b="HG/SYw3f";
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 9232068C9FF;
	Tue, 17 Oct 2023 14:46:13 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mail-lf1-f47.google.com (mail-lf1-f47.google.com
 [209.85.167.47])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 65B1A68C7A6
 for <ffmpeg-devel@ffmpeg.org>; Tue, 17 Oct 2023 14:46:05 +0300 (EEST)
Received: by mail-lf1-f47.google.com with SMTP id
 2adb3069b0e04-507a29c7eefso3849064e87.1
 for <ffmpeg-devel@ffmpeg.org>; Tue, 17 Oct 2023 04:46:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=martin-st.20230601.gappssmtp.com; s=20230601; t=1697543164; x=1698147964;
 darn=ffmpeg.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=kdvbHSesA2FMTScYpCBp2Vwss+eI6Tc765fw3iuWjjw=;
 b=HG/SYw3fgkJYrmhxdy2bmNt3c1LPaqv0EP16up81w4W9mCHHQddonifI30YdnSzcf7
 vv4QNJ6r9d5ZS31cdbGDlrgcTZ1qcrnQdkou82sdBLsPAjrZ1H6+zJW6MvTANVC6I/yI
 Dc4J33C7TsVnmh5GvfqEH6Z8kMb8C2YIOiMvAssqVSXwgRJG9Ld0NIzFJbzii03rEsaI
 qG2uPKPsNHkZWX8mBqf+I6pYO3p4EUPUJMLf6CFCoiUsOXb94+s6nz1MI1zs7wZGBLwy
 bfU3tLdXi8J9aeu4dVg0yREvJ6wxzbuGsP2r+Mc2AckuHrmKPB4HM6LupO3hmkM8bXpM
 1Mdw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1697543164; x=1698147964;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=kdvbHSesA2FMTScYpCBp2Vwss+eI6Tc765fw3iuWjjw=;
 b=b5G6SFCHY9TqT/LVkbVjf844IadOQQnOmoekpyUnATjt3jH/VbecijbT2LbLqvpRQ+
 wHQTlAPukgG7Uu0jibasxUeePm0CyvGl1uhUg2Ur1YNDMs3QV7+DiI1U08IgmWZcBDAH
 mR8XdHYQZ6RyNtGJbZi+TvOhM/4i3crpxOZL3+3jmeITEN6s8nCXGP4QkCQ3YPIYRxH4
 /aqztQI3sJyYOSqzDQrHb6Gubzr2loh233dyKvrVRbCR0L4xI2AXpmbqqITesQU+qMKa
 f8CS78oEeVQDbNa9Z61y5MwB9Yzlc4cYmi+kxbSeRO8IYyXWydrMEOGxRYYvcqGBOLLw
 CQ7w==
X-Gm-Message-State: AOJu0YyQNRsJsE+WC4N/yCYJEJ74tSzQEMj7hGcMJl/aIwLLp04g1nyv
 OgdNI5kYpshqc7vTMp+HhIEBe4yRkEmewJJZxnfHjA==
X-Received: by 2002:a05:6512:3b85:b0:507:a04c:76eb with SMTP id
 g5-20020a0565123b8500b00507a04c76ebmr1838658lfv.35.1697543164151;
 Tue, 17 Oct 2023 04:46:04 -0700 (PDT)
Received: from localhost.localdomain (dsl-tkubng21-58c01c-243.dhcp.inet.fi.
 [88.192.28.243]) by smtp.gmail.com with ESMTPSA id
 x25-20020a19f619000000b0050797a35f8csm244532lfe.162.2023.10.17.04.46.03
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 17 Oct 2023 04:46:03 -0700 (PDT)
From: =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>
To: ffmpeg-devel@ffmpeg.org
Date: Tue, 17 Oct 2023 14:45:58 +0300
Message-Id: <20231017114601.1374712-3-martin@martin.st>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20231017114601.1374712-1-martin@martin.st>
References: <20231017114601.1374712-1-martin@martin.st>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 3/5] aarch64: Make the indentation more
 consistent
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: jdek@itanimul.li
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: OzIHKGtVS3Al

Some functions have slightly different indentation styles; try
to match the surrounding code.

libavcodec/aarch64/vc1dsp_neon.S is skipped here, as it intentionally
uses a layered indentation style to visually show how different
unrolled/interleaved phases fit together.
---
 libavcodec/aarch64/h264dsp_neon.S      |   8 +-
 libavcodec/aarch64/h264qpel_neon.S     |  12 +-
 libavcodec/aarch64/hevcdsp_idct_neon.S | 256 ++++++++++----------
 libavcodec/aarch64/hevcdsp_qpel_neon.S |  24 +-
 libavcodec/aarch64/opusdsp_neon.S      |   8 +-
 libavcodec/aarch64/vp8dsp_neon.S       | 310 ++++++++++++-------------
 libavutil/aarch64/tx_float_neon.S      |  12 +-
 7 files changed, 315 insertions(+), 315 deletions(-)

diff --git a/libavcodec/aarch64/h264dsp_neon.S b/libavcodec/aarch64/h264dsp_neon.S
index 71c2ddfd0c..723b692019 100644
--- a/libavcodec/aarch64/h264dsp_neon.S
+++ b/libavcodec/aarch64/h264dsp_neon.S
@@ -526,7 +526,7 @@ function ff_h264_h_loop_filter_chroma_mbaff_intra_neon, export=1
         ld1             {v17.8b}, [x4], x1
         ld1             {v19.8b}, [x4], x1
 
-        transpose_4x8B v18, v16, v17, v19, v26, v27, v28, v29
+        transpose_4x8B  v18, v16, v17, v19, v26, v27, v28, v29
 
         h264_loop_filter_chroma_intra
 
@@ -554,7 +554,7 @@ h_loop_filter_chroma420_intra:
         ld1             {v17.s}[1], [x4], x1
         ld1             {v19.s}[1], [x4], x1
 
-        transpose_4x8B v18, v16, v17, v19, v26, v27, v28, v29
+        transpose_4x8B  v18, v16, v17, v19, v26, v27, v28, v29
 
         h264_loop_filter_chroma_intra
 
@@ -1017,7 +1017,7 @@ function ff_h264_h_loop_filter_chroma_mbaff_intra_neon_10, export=1
         ld1             {v16.8h}, [x4], x1
         ld1             {v19.8h}, [x9], x1
 
-        transpose_4x8H v18, v16, v17, v19, v26, v27, v28, v29
+        transpose_4x8H  v18, v16, v17, v19, v26, v27, v28, v29
 
         h264_loop_filter_chroma_intra_10
 
@@ -1045,7 +1045,7 @@ h_loop_filter_chroma420_intra_10:
         ld1             {v19.4h},   [x4], x1
         ld1             {v19.d}[1], [x9], x1
 
-        transpose_4x8H v18, v16, v17, v19, v26, v27, v28, v29
+        transpose_4x8H  v18, v16, v17, v19, v26, v27, v28, v29
 
         h264_loop_filter_chroma_intra_10
 
diff --git a/libavcodec/aarch64/h264qpel_neon.S b/libavcodec/aarch64/h264qpel_neon.S
index 21906327cd..f4475d96f9 100644
--- a/libavcodec/aarch64/h264qpel_neon.S
+++ b/libavcodec/aarch64/h264qpel_neon.S
@@ -580,8 +580,8 @@ function \type\()_h264_qpel16_hv_lowpass_l2_neon
 endfunc
 .endm
 
-        h264_qpel16_hv put
-        h264_qpel16_hv avg
+        h264_qpel16_hv  put
+        h264_qpel16_hv  avg
 
 .macro  h264_qpel8      type
 function ff_\type\()_h264_qpel8_mc10_neon, export=1
@@ -759,8 +759,8 @@ function ff_\type\()_h264_qpel8_mc33_neon, export=1
 endfunc
 .endm
 
-        h264_qpel8 put
-        h264_qpel8 avg
+        h264_qpel8      put
+        h264_qpel8      avg
 
 .macro  h264_qpel16     type
 function ff_\type\()_h264_qpel16_mc10_neon, export=1
@@ -931,5 +931,5 @@ function ff_\type\()_h264_qpel16_mc33_neon, export=1
 endfunc
 .endm
 
-        h264_qpel16 put
-        h264_qpel16 avg
+        h264_qpel16     put
+        h264_qpel16     avg
diff --git a/libavcodec/aarch64/hevcdsp_idct_neon.S b/libavcodec/aarch64/hevcdsp_idct_neon.S
index ba8a1ebaed..3cac6e6db9 100644
--- a/libavcodec/aarch64/hevcdsp_idct_neon.S
+++ b/libavcodec/aarch64/hevcdsp_idct_neon.S
@@ -239,23 +239,23 @@ function hevc_add_residual_32x32_16_neon, export=0
 endfunc
 
 .macro tr_4x4 in0, in1, in2, in3, out0, out1, out2, out3, shift
-         sshll          v20.4s, \in0, #6
-         sshll          v21.4s, \in0, #6
-         smull          v22.4s, \in1, v4.h[1]
-         smull          v23.4s, \in1, v4.h[3]
-         smlal          v20.4s, \in2, v4.h[0] //e0
-         smlsl          v21.4s, \in2, v4.h[0] //e1
-         smlal          v22.4s, \in3, v4.h[3] //o0
-         smlsl          v23.4s, \in3, v4.h[1] //o1
-
-         add            v24.4s, v20.4s, v22.4s
-         sub            v20.4s, v20.4s, v22.4s
-         add            v22.4s, v21.4s, v23.4s
-         sub            v21.4s, v21.4s, v23.4s
-         sqrshrn        \out0, v24.4s, #\shift
-         sqrshrn        \out3, v20.4s, #\shift
-         sqrshrn        \out1, v22.4s, #\shift
-         sqrshrn        \out2, v21.4s, #\shift
+        sshll           v20.4s, \in0, #6
+        sshll           v21.4s, \in0, #6
+        smull           v22.4s, \in1, v4.h[1]
+        smull           v23.4s, \in1, v4.h[3]
+        smlal           v20.4s, \in2, v4.h[0] //e0
+        smlsl           v21.4s, \in2, v4.h[0] //e1
+        smlal           v22.4s, \in3, v4.h[3] //o0
+        smlsl           v23.4s, \in3, v4.h[1] //o1
+
+        add             v24.4s, v20.4s, v22.4s
+        sub             v20.4s, v20.4s, v22.4s
+        add             v22.4s, v21.4s, v23.4s
+        sub             v21.4s, v21.4s, v23.4s
+        sqrshrn         \out0, v24.4s, #\shift
+        sqrshrn         \out3, v20.4s, #\shift
+        sqrshrn         \out1, v22.4s, #\shift
+        sqrshrn         \out2, v21.4s, #\shift
 .endm
 
 .macro idct_4x4 bitdepth
@@ -294,19 +294,19 @@ endfunc
 
 // uses and clobbers v28-v31 as temp registers
 .macro tr_4x4_8 in0, in1, in2, in3, out0, out1, out2, out3, p1, p2
-         sshll\p1       v28.4s, \in0, #6
-         mov            v29.16b, v28.16b
-         smull\p1       v30.4s, \in1, v0.h[1]
-         smull\p1       v31.4s, \in1, v0.h[3]
-         smlal\p2       v28.4s, \in2, v0.h[0] //e0
-         smlsl\p2       v29.4s, \in2, v0.h[0] //e1
-         smlal\p2       v30.4s, \in3, v0.h[3] //o0
-         smlsl\p2       v31.4s, \in3, v0.h[1] //o1
-
-         add            \out0, v28.4s, v30.4s
-         add            \out1, v29.4s, v31.4s
-         sub            \out2, v29.4s, v31.4s
-         sub            \out3, v28.4s, v30.4s
+        sshll\p1        v28.4s, \in0, #6
+        mov             v29.16b, v28.16b
+        smull\p1        v30.4s, \in1, v0.h[1]
+        smull\p1        v31.4s, \in1, v0.h[3]
+        smlal\p2        v28.4s, \in2, v0.h[0] //e0
+        smlsl\p2        v29.4s, \in2, v0.h[0] //e1
+        smlal\p2        v30.4s, \in3, v0.h[3] //o0
+        smlsl\p2        v31.4s, \in3, v0.h[1] //o1
+
+        add             \out0, v28.4s, v30.4s
+        add             \out1, v29.4s, v31.4s
+        sub             \out2, v29.4s, v31.4s
+        sub             \out3, v28.4s, v30.4s
 .endm
 
 .macro transpose_8x8 r0, r1, r2, r3, r4, r5, r6, r7
@@ -362,11 +362,11 @@ endfunc
 .macro idct_8x8 bitdepth
 function ff_hevc_idct_8x8_\bitdepth\()_neon, export=1
 //x0 - coeffs
-        mov              x1,  x0
+        mov             x1,  x0
         ld1             {v16.8h-v19.8h}, [x1], #64
         ld1             {v20.8h-v23.8h}, [x1]
 
-        movrel           x1, trans
+        movrel          x1, trans
         ld1             {v0.8h}, [x1]
 
         tr_8x4          7, v16,.4h, v17,.4h, v18,.4h, v19,.4h, v20,.4h, v21,.4h, v22,.4h, v23,.4h
@@ -379,7 +379,7 @@ function ff_hevc_idct_8x8_\bitdepth\()_neon, export=1
 
         transpose_8x8   v16, v17, v18, v19, v20, v21, v22, v23
 
-        mov              x1,  x0
+        mov             x1,  x0
         st1             {v16.8h-v19.8h}, [x1], #64
         st1             {v20.8h-v23.8h}, [x1]
 
@@ -388,8 +388,8 @@ endfunc
 .endm
 
 .macro butterfly e, o, tmp_p, tmp_m
-        add        \tmp_p, \e, \o
-        sub        \tmp_m, \e, \o
+        add             \tmp_p, \e, \o
+        sub             \tmp_m, \e, \o
 .endm
 
 .macro tr16_8x4 in0, in1, in2, in3, offset
@@ -418,7 +418,7 @@ endfunc
         butterfly       v25.4s, v29.4s, v17.4s, v22.4s
         butterfly       v26.4s, v30.4s, v18.4s, v21.4s
         butterfly       v27.4s, v31.4s, v19.4s, v20.4s
-        add              x4,  sp,  #\offset
+        add             x4,  sp,  #\offset
         st1             {v16.4s-v19.4s}, [x4], #64
         st1             {v20.4s-v23.4s}, [x4]
 .endm
@@ -435,14 +435,14 @@ endfunc
 .endm
 
 .macro add_member in, t0, t1, t2, t3, t4, t5, t6, t7, op0, op1, op2, op3, op4, op5, op6, op7, p
-        sum_sub v21.4s, \in, \t0, \op0, \p
-        sum_sub v22.4s, \in, \t1, \op1, \p
-        sum_sub v23.4s, \in, \t2, \op2, \p
-        sum_sub v24.4s, \in, \t3, \op3, \p
-        sum_sub v25.4s, \in, \t4, \op4, \p
-        sum_sub v26.4s, \in, \t5, \op5, \p
-        sum_sub v27.4s, \in, \t6, \op6, \p
-        sum_sub v28.4s, \in, \t7, \op7, \p
+        sum_sub         v21.4s, \in, \t0, \op0, \p
+        sum_sub         v22.4s, \in, \t1, \op1, \p
+        sum_sub         v23.4s, \in, \t2, \op2, \p
+        sum_sub         v24.4s, \in, \t3, \op3, \p
+        sum_sub         v25.4s, \in, \t4, \op4, \p
+        sum_sub         v26.4s, \in, \t5, \op5, \p
+        sum_sub         v27.4s, \in, \t6, \op6, \p
+        sum_sub         v28.4s, \in, \t7, \op7, \p
 .endm
 
 .macro butterfly16 in0, in1, in2, in3, in4, in5, in6, in7
@@ -528,20 +528,20 @@ endfunc
 
 .macro tr_16x4 name, shift, offset, step
 function func_tr_16x4_\name
-        mov              x1,  x5
-        add              x3,  x5, #(\step * 64)
-        mov              x2,  #(\step * 128)
+        mov             x1,  x5
+        add             x3,  x5, #(\step * 64)
+        mov             x2,  #(\step * 128)
         load16          v16.d, v17.d, v18.d, v19.d
-        movrel           x1,  trans
+        movrel          x1,  trans
         ld1             {v0.8h}, [x1]
 
         tr16_8x4        v16, v17, v18, v19, \offset
 
-        add              x1,  x5, #(\step * 32)
-        add              x3,  x5, #(\step * 3 *32)
-        mov              x2,  #(\step * 128)
+        add             x1,  x5, #(\step * 32)
+        add             x3,  x5, #(\step * 3 *32)
+        mov             x2,  #(\step * 128)
         load16          v20.d, v17.d, v18.d, v19.d
-        movrel           x1, trans, 16
+        movrel          x1, trans, 16
         ld1             {v1.8h}, [x1]
         smull           v21.4s, v20.4h, v1.h[0]
         smull           v22.4s, v20.4h, v1.h[1]
@@ -560,19 +560,19 @@ function func_tr_16x4_\name
         add_member      v19.4h, v1.h[6], v1.h[3], v1.h[0], v1.h[2], v1.h[5], v1.h[7], v1.h[4], v1.h[1], +, -, +, -, +, +, -, +
         add_member      v19.8h, v1.h[7], v1.h[6], v1.h[5], v1.h[4], v1.h[3], v1.h[2], v1.h[1], v1.h[0], +, -, +, -, +, -, +, -, 2
 
-        add              x4, sp, #\offset
+        add             x4, sp, #\offset
         ld1             {v16.4s-v19.4s}, [x4], #64
         butterfly16     v16.4s, v21.4s, v17.4s, v22.4s, v18.4s, v23.4s, v19.4s, v24.4s
     .if \shift > 0
         scale           v29, v30, v31, v24, v20.4s, v16.4s, v21.4s, v17.4s, v22.4s, v18.4s, v23.4s, v19.4s, \shift
         transpose16_4x4_2 v29, v30, v31, v24, v2, v3, v4, v5, v6, v7
-        mov              x1,  x6
-        add              x3,  x6, #(24 +3*32)
-        mov              x2, #32
-        mov              x4, #-32
+        mov             x1,  x6
+        add             x3,  x6, #(24 +3*32)
+        mov             x2, #32
+        mov             x4, #-32
         store16         v29.d, v30.d, v31.d, v24.d, x4
     .else
-       store_to_stack  \offset, (\offset + 240), v20.4s, v21.4s, v22.4s, v23.4s, v19.4s, v18.4s, v17.4s, v16.4s
+        store_to_stack  \offset, (\offset + 240), v20.4s, v21.4s, v22.4s, v23.4s, v19.4s, v18.4s, v17.4s, v16.4s
     .endif
 
         add             x4, sp, #(\offset + 64)
@@ -582,13 +582,13 @@ function func_tr_16x4_\name
         scale           v29, v30, v31, v20, v20.4s, v16.4s, v25.4s, v17.4s, v26.4s, v18.4s, v27.4s, v19.4s, \shift
         transpose16_4x4_2 v29, v30, v31, v20, v2, v3, v4, v5, v6, v7
 
-        add              x1,  x6, #8
-        add              x3,  x6, #(16 + 3 * 32)
-        mov              x2, #32
-        mov              x4, #-32
+        add             x1,  x6, #8
+        add             x3,  x6, #(16 + 3 * 32)
+        mov             x2, #32
+        mov             x4, #-32
         store16         v29.d, v30.d, v31.d, v20.d, x4
    .else
-       store_to_stack (\offset + 64), (\offset + 176), v20.4s, v25.4s, v26.4s, v27.4s, v19.4s, v18.4s, v17.4s, v16.4s
+        store_to_stack  (\offset + 64), (\offset + 176), v20.4s, v25.4s, v26.4s, v27.4s, v19.4s, v18.4s, v17.4s, v16.4s
    .endif
 
         ret
@@ -601,21 +601,21 @@ function ff_hevc_idct_16x16_\bitdepth\()_neon, export=1
         mov             x15, x30
 
         // allocate a temp buffer
-        sub              sp,  sp,  #640
+        sub             sp,  sp,  #640
 
 .irp i, 0, 1, 2, 3
-        add              x5,  x0, #(8 * \i)
-        add              x6,  sp, #(8 * \i * 16)
+        add             x5,  x0, #(8 * \i)
+        add             x6,  sp, #(8 * \i * 16)
         bl              func_tr_16x4_firstpass
 .endr
 
 .irp i, 0, 1, 2, 3
-        add              x5,  sp, #(8 * \i)
-        add              x6,  x0, #(8 * \i * 16)
+        add             x5,  sp, #(8 * \i)
+        add             x6,  x0, #(8 * \i * 16)
         bl              func_tr_16x4_secondpass_\bitdepth
 .endr
 
-        add              sp,  sp,  #640
+        add             sp,  sp,  #640
 
         ret             x15
 endfunc
@@ -644,10 +644,10 @@ endfunc
 .endm
 
 .macro add_member32 in, t0, t1, t2, t3, op0, op1, op2, op3, p
-        sum_sub v24.4s, \in, \t0, \op0, \p
-        sum_sub v25.4s, \in, \t1, \op1, \p
-        sum_sub v26.4s, \in, \t2, \op2, \p
-        sum_sub v27.4s, \in, \t3, \op3, \p
+        sum_sub         v24.4s, \in, \t0, \op0, \p
+        sum_sub         v25.4s, \in, \t1, \op1, \p
+        sum_sub         v26.4s, \in, \t2, \op2, \p
+        sum_sub         v27.4s, \in, \t3, \op3, \p
 .endm
 
 .macro butterfly32 in0, in1, in2, in3, out
@@ -841,85 +841,85 @@ idct_32x32 8
 idct_32x32 10
 
 .macro tr4_luma_shift r0, r1, r2, r3, shift
-        saddl       v0.4s, \r0, \r2         // c0 = src0 + src2
-        saddl       v1.4s, \r2, \r3         // c1 = src2 + src3
-        ssubl       v2.4s, \r0, \r3         // c2 = src0 - src3
-        smull       v3.4s, \r1, v21.4h      // c3 = 74 * src1
-
-        saddl       v7.4s, \r0, \r3         // src0 + src3
-        ssubw       v7.4s, v7.4s, \r2       // src0 - src2 + src3
-        mul         v7.4s, v7.4s, v18.4s    // dst2 = 74 * (src0 - src2 + src3)
-
-        mul         v5.4s, v0.4s, v19.4s    // 29 * c0
-        mul         v6.4s, v1.4s, v20.4s    // 55 * c1
-        add         v5.4s, v5.4s, v6.4s     // 29 * c0 + 55 * c1
-        add         v5.4s, v5.4s, v3.4s     // dst0 = 29 * c0 + 55 * c1 + c3
-
-        mul         v1.4s, v1.4s, v19.4s    // 29 * c1
-        mul         v6.4s, v2.4s, v20.4s    // 55 * c2
-        sub         v6.4s, v6.4s, v1.4s     // 55 * c2 - 29 * c1
-        add         v6.4s, v6.4s, v3.4s     // dst1 = 55 * c2 - 29 * c1 + c3
-
-        mul         v0.4s, v0.4s, v20.4s    // 55 * c0
-        mul         v2.4s, v2.4s, v19.4s    // 29 * c2
-        add         v0.4s, v0.4s, v2.4s     // 55 * c0 + 29 * c2
-        sub         v0.4s, v0.4s, v3.4s     // dst3 = 55 * c0 + 29 * c2 - c3
-
-        sqrshrn     \r0, v5.4s, \shift
-        sqrshrn     \r1, v6.4s, \shift
-        sqrshrn     \r2, v7.4s, \shift
-        sqrshrn     \r3, v0.4s, \shift
+        saddl           v0.4s, \r0, \r2         // c0 = src0 + src2
+        saddl           v1.4s, \r2, \r3         // c1 = src2 + src3
+        ssubl           v2.4s, \r0, \r3         // c2 = src0 - src3
+        smull           v3.4s, \r1, v21.4h      // c3 = 74 * src1
+
+        saddl           v7.4s, \r0, \r3         // src0 + src3
+        ssubw           v7.4s, v7.4s, \r2       // src0 - src2 + src3
+        mul             v7.4s, v7.4s, v18.4s    // dst2 = 74 * (src0 - src2 + src3)
+
+        mul             v5.4s, v0.4s, v19.4s    // 29 * c0
+        mul             v6.4s, v1.4s, v20.4s    // 55 * c1
+        add             v5.4s, v5.4s, v6.4s     // 29 * c0 + 55 * c1
+        add             v5.4s, v5.4s, v3.4s     // dst0 = 29 * c0 + 55 * c1 + c3
+
+        mul             v1.4s, v1.4s, v19.4s    // 29 * c1
+        mul             v6.4s, v2.4s, v20.4s    // 55 * c2
+        sub             v6.4s, v6.4s, v1.4s     // 55 * c2 - 29 * c1
+        add             v6.4s, v6.4s, v3.4s     // dst1 = 55 * c2 - 29 * c1 + c3
+
+        mul             v0.4s, v0.4s, v20.4s    // 55 * c0
+        mul             v2.4s, v2.4s, v19.4s    // 29 * c2
+        add             v0.4s, v0.4s, v2.4s     // 55 * c0 + 29 * c2
+        sub             v0.4s, v0.4s, v3.4s     // dst3 = 55 * c0 + 29 * c2 - c3
+
+        sqrshrn         \r0, v5.4s, \shift
+        sqrshrn         \r1, v6.4s, \shift
+        sqrshrn         \r2, v7.4s, \shift
+        sqrshrn         \r3, v0.4s, \shift
 .endm
 
 function ff_hevc_transform_luma_4x4_neon_8, export=1
-        ld1            {v28.4h-v31.4h}, [x0]
-        movi           v18.4s, #74
-        movi           v19.4s, #29
-        movi           v20.4s, #55
-        movi           v21.4h, #74
+        ld1             {v28.4h-v31.4h}, [x0]
+        movi            v18.4s, #74
+        movi            v19.4s, #29
+        movi            v20.4s, #55
+        movi            v21.4h, #74
 
-        tr4_luma_shift v28.4h, v29.4h, v30.4h, v31.4h, #7
-        transpose_4x4H v28, v29, v30, v31, v22, v23, v24, v25
+        tr4_luma_shift  v28.4h, v29.4h, v30.4h, v31.4h, #7
+        transpose_4x4H  v28, v29, v30, v31, v22, v23, v24, v25
 
-        tr4_luma_shift v28.4h, v29.4h, v30.4h, v31.4h, #12
-        transpose_4x4H v28, v29, v30, v31, v22, v23, v24, v25
+        tr4_luma_shift  v28.4h, v29.4h, v30.4h, v31.4h, #12
+        transpose_4x4H  v28, v29, v30, v31, v22, v23, v24, v25
 
-        st1            {v28.4h-v31.4h}, [x0]
+        st1             {v28.4h-v31.4h}, [x0]
         ret
 endfunc
 
 // void ff_hevc_idct_NxN_dc_DEPTH_neon(int16_t *coeffs)
 .macro idct_dc size, bitdepth
 function ff_hevc_idct_\size\()x\size\()_dc_\bitdepth\()_neon, export=1
-        ld1r         {v4.8h}, [x0]
-        srshr         v4.8h,  v4.8h,  #1
-        srshr         v0.8h,  v4.8h,  #(14 - \bitdepth)
-        srshr         v1.8h,  v4.8h,  #(14 - \bitdepth)
+        ld1r            {v4.8h}, [x0]
+        srshr           v4.8h,  v4.8h,  #1
+        srshr           v0.8h,  v4.8h,  #(14 - \bitdepth)
+        srshr           v1.8h,  v4.8h,  #(14 - \bitdepth)
 .if \size > 4
-        srshr         v2.8h,  v4.8h,  #(14 - \bitdepth)
-        srshr         v3.8h,  v4.8h,  #(14 - \bitdepth)
+        srshr           v2.8h,  v4.8h,  #(14 - \bitdepth)
+        srshr           v3.8h,  v4.8h,  #(14 - \bitdepth)
 .if \size > 16 /* dc 32x32 */
-        mov              x2,  #4
+        mov             x2,  #4
 1:
-        subs             x2,  x2, #1
+        subs            x2,  x2, #1
 .endif
         add             x12,  x0, #64
         mov             x13,  #128
 .if \size > 8 /* dc 16x16 */
-        st1            {v0.8h-v3.8h},  [x0], x13
-        st1            {v0.8h-v3.8h}, [x12], x13
-        st1            {v0.8h-v3.8h},  [x0], x13
-        st1            {v0.8h-v3.8h}, [x12], x13
-        st1            {v0.8h-v3.8h},  [x0], x13
-        st1            {v0.8h-v3.8h}, [x12], x13
+        st1             {v0.8h-v3.8h},  [x0], x13
+        st1             {v0.8h-v3.8h}, [x12], x13
+        st1             {v0.8h-v3.8h},  [x0], x13
+        st1             {v0.8h-v3.8h}, [x12], x13
+        st1             {v0.8h-v3.8h},  [x0], x13
+        st1             {v0.8h-v3.8h}, [x12], x13
 .endif /* dc 8x8 */
-        st1            {v0.8h-v3.8h},  [x0], x13
-        st1            {v0.8h-v3.8h}, [x12], x13
+        st1             {v0.8h-v3.8h},  [x0], x13
+        st1             {v0.8h-v3.8h}, [x12], x13
 .if \size > 16 /* dc 32x32 */
         bne             1b
 .endif
 .else /* dc 4x4 */
-        st1            {v0.8h-v1.8h},  [x0]
+        st1             {v0.8h-v1.8h},  [x0]
 .endif
         ret
 endfunc
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 1212eae63d..f3f24ab8b0 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -840,19 +840,19 @@ function ff_hevc_put_hevc_qpel_uni_v16_8_neon, export=1
 endfunc
 
 function ff_hevc_put_hevc_qpel_uni_v24_8_neon, export=1
-        b X(ff_hevc_put_hevc_qpel_uni_v12_8_neon)
+        b               X(ff_hevc_put_hevc_qpel_uni_v12_8_neon)
 endfunc
 
 function ff_hevc_put_hevc_qpel_uni_v32_8_neon, export=1
-        b X(ff_hevc_put_hevc_qpel_uni_v16_8_neon)
+        b               X(ff_hevc_put_hevc_qpel_uni_v16_8_neon)
 endfunc
 
 function ff_hevc_put_hevc_qpel_uni_v48_8_neon, export=1
-        b X(ff_hevc_put_hevc_qpel_uni_v16_8_neon)
+        b               X(ff_hevc_put_hevc_qpel_uni_v16_8_neon)
 endfunc
 
 function ff_hevc_put_hevc_qpel_uni_v64_8_neon, export=1
-        b X(ff_hevc_put_hevc_qpel_uni_v16_8_neon)
+        b               X(ff_hevc_put_hevc_qpel_uni_v16_8_neon)
 endfunc
 
 function ff_hevc_put_hevc_pel_uni_w_pixels4_8_neon, export=1
@@ -1560,21 +1560,21 @@ endfunc
 #if HAVE_I8MM
 
 .macro calc_all2
-        calc v30, v31, v16, v18, v20, v22, v24, v26, v28, v30, v17, v19, v21, v23, v25, v27, v29, v31
+        calc            v30, v31, v16, v18, v20, v22, v24, v26, v28, v30, v17, v19, v21, v23, v25, v27, v29, v31
         b.eq            2f
-        calc v16, v17, v18, v20, v22, v24, v26, v28, v30, v16, v19, v21, v23, v25, v27, v29, v31, v17
+        calc            v16, v17, v18, v20, v22, v24, v26, v28, v30, v16, v19, v21, v23, v25, v27, v29, v31, v17
         b.eq            2f
-        calc v18, v19, v20, v22, v24, v26, v28, v30, v16, v18, v21, v23, v25, v27, v29, v31, v17, v19
+        calc            v18, v19, v20, v22, v24, v26, v28, v30, v16, v18, v21, v23, v25, v27, v29, v31, v17, v19
         b.eq            2f
-        calc v20, v21, v22, v24, v26, v28, v30, v16, v18, v20, v23, v25, v27, v29, v31, v17, v19, v21
+        calc            v20, v21, v22, v24, v26, v28, v30, v16, v18, v20, v23, v25, v27, v29, v31, v17, v19, v21
         b.eq            2f
-        calc v22, v23, v24, v26, v28, v30, v16, v18, v20, v22, v25, v27, v29, v31, v17, v19, v21, v23
+        calc            v22, v23, v24, v26, v28, v30, v16, v18, v20, v22, v25, v27, v29, v31, v17, v19, v21, v23
         b.eq            2f
-        calc v24, v25, v26, v28, v30, v16, v18, v20, v22, v24, v27, v29, v31, v17, v19, v21, v23, v25
+        calc            v24, v25, v26, v28, v30, v16, v18, v20, v22, v24, v27, v29, v31, v17, v19, v21, v23, v25
         b.eq            2f
-        calc v26, v27, v28, v30, v16, v18, v20, v22, v24, v26, v29, v31, v17, v19, v21, v23, v25, v27
+        calc            v26, v27, v28, v30, v16, v18, v20, v22, v24, v26, v29, v31, v17, v19, v21, v23, v25, v27
         b.eq            2f
-        calc v28, v29, v30, v16, v18, v20, v22, v24, v26, v28, v31, v17, v19, v21, v23, v25, v27, v29
+        calc            v28, v29, v30, v16, v18, v20, v22, v24, v26, v28, v31, v17, v19, v21, v23, v25, v27, v29
         b.hi            1b
 .endm
 
diff --git a/libavcodec/aarch64/opusdsp_neon.S b/libavcodec/aarch64/opusdsp_neon.S
index 46c2be0874..1c88d7d123 100644
--- a/libavcodec/aarch64/opusdsp_neon.S
+++ b/libavcodec/aarch64/opusdsp_neon.S
@@ -34,13 +34,13 @@ endconst
 
 function ff_opus_deemphasis_neon, export=1
         movrel  x4, tab_st
-        ld1    {v4.4s}, [x4]
+        ld1     {v4.4s}, [x4]
         movrel  x4, tab_x0
-        ld1    {v5.4s}, [x4]
+        ld1     {v5.4s}, [x4]
         movrel  x4, tab_x1
-        ld1    {v6.4s}, [x4]
+        ld1     {v6.4s}, [x4]
         movrel  x4, tab_x2
-        ld1    {v7.4s}, [x4]
+        ld1     {v7.4s}, [x4]
 
         fmul v0.4s, v4.4s, v0.s[0]
 
diff --git a/libavcodec/aarch64/vp8dsp_neon.S b/libavcodec/aarch64/vp8dsp_neon.S
index 4bbf16d1a4..e385293ba7 100644
--- a/libavcodec/aarch64/vp8dsp_neon.S
+++ b/libavcodec/aarch64/vp8dsp_neon.S
@@ -330,32 +330,32 @@ endfunc
         //   v17: hev
 
         // convert to signed value:
-        eor            v3.16b, v3.16b, v21.16b           // PS0 = P0 ^ 0x80
-        eor            v4.16b, v4.16b, v21.16b           // QS0 = Q0 ^ 0x80
-
-        movi           v20.8h, #3
-        ssubl          v18.8h, v4.8b,  v3.8b             // QS0 - PS0
-        ssubl2         v19.8h, v4.16b, v3.16b            //   (widened to 16bit)
-        eor            v2.16b, v2.16b, v21.16b           // PS1 = P1 ^ 0x80
-        eor            v5.16b, v5.16b, v21.16b           // QS1 = Q1 ^ 0x80
-        mul            v18.8h, v18.8h, v20.8h            // w = 3 * (QS0 - PS0)
-        mul            v19.8h, v19.8h, v20.8h
-
-        sqsub          v20.16b, v2.16b, v5.16b           // clamp(PS1-QS1)
-        movi           v22.16b, #4
-        movi           v23.16b, #3
+        eor             v3.16b, v3.16b, v21.16b           // PS0 = P0 ^ 0x80
+        eor             v4.16b, v4.16b, v21.16b           // QS0 = Q0 ^ 0x80
+
+        movi            v20.8h, #3
+        ssubl           v18.8h, v4.8b,  v3.8b             // QS0 - PS0
+        ssubl2          v19.8h, v4.16b, v3.16b            //   (widened to 16bit)
+        eor             v2.16b, v2.16b, v21.16b           // PS1 = P1 ^ 0x80
+        eor             v5.16b, v5.16b, v21.16b           // QS1 = Q1 ^ 0x80
+        mul             v18.8h, v18.8h, v20.8h            // w = 3 * (QS0 - PS0)
+        mul             v19.8h, v19.8h, v20.8h
+
+        sqsub           v20.16b, v2.16b, v5.16b           // clamp(PS1-QS1)
+        movi            v22.16b, #4
+        movi            v23.16b, #3
     .if \inner
-        and            v20.16b, v20.16b, v17.16b         // if(hev) w += clamp(PS1-QS1)
+        and             v20.16b, v20.16b, v17.16b         // if(hev) w += clamp(PS1-QS1)
     .endif
-        saddw          v18.8h,  v18.8h, v20.8b           // w += clamp(PS1-QS1)
-        saddw2         v19.8h,  v19.8h, v20.16b
-        sqxtn          v18.8b,  v18.8h                   // narrow result back into v18
-        sqxtn2         v18.16b, v19.8h
+        saddw           v18.8h,  v18.8h, v20.8b           // w += clamp(PS1-QS1)
+        saddw2          v19.8h,  v19.8h, v20.16b
+        sqxtn           v18.8b,  v18.8h                   // narrow result back into v18
+        sqxtn2          v18.16b, v19.8h
     .if !\inner && !\simple
-        eor            v1.16b,  v1.16b,  v21.16b         // PS2 = P2 ^ 0x80
-        eor            v6.16b,  v6.16b,  v21.16b         // QS2 = Q2 ^ 0x80
+        eor             v1.16b,  v1.16b,  v21.16b         // PS2 = P2 ^ 0x80
+        eor             v6.16b,  v6.16b,  v21.16b         // QS2 = Q2 ^ 0x80
     .endif
-        and            v18.16b, v18.16b, v16.16b         // w &= normal_limit
+        and             v18.16b, v18.16b, v16.16b         // w &= normal_limit
 
         // registers used at this point..
         //   v0 -> P3  (don't corrupt)
@@ -375,44 +375,44 @@ endfunc
         //   P0 = s2u(PS0 + c2);
 
     .if \simple
-        sqadd          v19.16b, v18.16b, v22.16b           // c1 = clamp((w&hev)+4)
-        sqadd          v20.16b, v18.16b, v23.16b           // c2 = clamp((w&hev)+3)
-        sshr           v19.16b, v19.16b, #3                // c1 >>= 3
-        sshr           v20.16b, v20.16b, #3                // c2 >>= 3
-        sqsub          v4.16b,  v4.16b,  v19.16b           // QS0 = clamp(QS0-c1)
-        sqadd          v3.16b,  v3.16b,  v20.16b           // PS0 = clamp(PS0+c2)
-        eor            v4.16b,  v4.16b,  v21.16b           // Q0 = QS0 ^ 0x80
-        eor            v3.16b,  v3.16b,  v21.16b           // P0 = PS0 ^ 0x80
-        eor            v5.16b,  v5.16b,  v21.16b           // Q1 = QS1 ^ 0x80
-        eor            v2.16b,  v2.16b,  v21.16b           // P1 = PS1 ^ 0x80
+        sqadd           v19.16b, v18.16b, v22.16b           // c1 = clamp((w&hev)+4)
+        sqadd           v20.16b, v18.16b, v23.16b           // c2 = clamp((w&hev)+3)
+        sshr            v19.16b, v19.16b, #3                // c1 >>= 3
+        sshr            v20.16b, v20.16b, #3                // c2 >>= 3
+        sqsub           v4.16b,  v4.16b,  v19.16b           // QS0 = clamp(QS0-c1)
+        sqadd           v3.16b,  v3.16b,  v20.16b           // PS0 = clamp(PS0+c2)
+        eor             v4.16b,  v4.16b,  v21.16b           // Q0 = QS0 ^ 0x80
+        eor             v3.16b,  v3.16b,  v21.16b           // P0 = PS0 ^ 0x80
+        eor             v5.16b,  v5.16b,  v21.16b           // Q1 = QS1 ^ 0x80
+        eor             v2.16b,  v2.16b,  v21.16b           // P1 = PS1 ^ 0x80
     .elseif \inner
         // the !is4tap case of filter_common, only used for inner blocks
         //   c3 = ((c1&~hev) + 1) >> 1;
         //   Q1 = s2u(QS1 - c3);
         //   P1 = s2u(PS1 + c3);
-        sqadd          v19.16b, v18.16b, v22.16b           // c1 = clamp((w&hev)+4)
-        sqadd          v20.16b, v18.16b, v23.16b           // c2 = clamp((w&hev)+3)
-        sshr           v19.16b, v19.16b, #3                // c1 >>= 3
-        sshr           v20.16b, v20.16b, #3                // c2 >>= 3
-        sqsub          v4.16b,  v4.16b,  v19.16b           // QS0 = clamp(QS0-c1)
-        sqadd          v3.16b,  v3.16b,  v20.16b           // PS0 = clamp(PS0+c2)
-        bic            v19.16b, v19.16b, v17.16b           // c1 & ~hev
-        eor            v4.16b,  v4.16b,  v21.16b           // Q0 = QS0 ^ 0x80
-        srshr          v19.16b, v19.16b, #1                // c3 >>= 1
-        eor            v3.16b,  v3.16b,  v21.16b           // P0 = PS0 ^ 0x80
-        sqsub          v5.16b,  v5.16b,  v19.16b           // QS1 = clamp(QS1-c3)
-        sqadd          v2.16b,  v2.16b,  v19.16b           // PS1 = clamp(PS1+c3)
-        eor            v5.16b,  v5.16b,  v21.16b           // Q1 = QS1 ^ 0x80
-        eor            v2.16b,  v2.16b,  v21.16b           // P1 = PS1 ^ 0x80
+        sqadd           v19.16b, v18.16b, v22.16b           // c1 = clamp((w&hev)+4)
+        sqadd           v20.16b, v18.16b, v23.16b           // c2 = clamp((w&hev)+3)
+        sshr            v19.16b, v19.16b, #3                // c1 >>= 3
+        sshr            v20.16b, v20.16b, #3                // c2 >>= 3
+        sqsub           v4.16b,  v4.16b,  v19.16b           // QS0 = clamp(QS0-c1)
+        sqadd           v3.16b,  v3.16b,  v20.16b           // PS0 = clamp(PS0+c2)
+        bic             v19.16b, v19.16b, v17.16b           // c1 & ~hev
+        eor             v4.16b,  v4.16b,  v21.16b           // Q0 = QS0 ^ 0x80
+        srshr           v19.16b, v19.16b, #1                // c3 >>= 1
+        eor             v3.16b,  v3.16b,  v21.16b           // P0 = PS0 ^ 0x80
+        sqsub           v5.16b,  v5.16b,  v19.16b           // QS1 = clamp(QS1-c3)
+        sqadd           v2.16b,  v2.16b,  v19.16b           // PS1 = clamp(PS1+c3)
+        eor             v5.16b,  v5.16b,  v21.16b           // Q1 = QS1 ^ 0x80
+        eor             v2.16b,  v2.16b,  v21.16b           // P1 = PS1 ^ 0x80
     .else
-        and            v20.16b, v18.16b, v17.16b           // w & hev
-        sqadd          v19.16b, v20.16b, v22.16b           // c1 = clamp((w&hev)+4)
-        sqadd          v20.16b, v20.16b, v23.16b           // c2 = clamp((w&hev)+3)
-        sshr           v19.16b, v19.16b, #3                // c1 >>= 3
-        sshr           v20.16b, v20.16b, #3                // c2 >>= 3
-        bic            v18.16b, v18.16b, v17.16b           // w &= ~hev
-        sqsub          v4.16b,  v4.16b,  v19.16b           // QS0 = clamp(QS0-c1)
-        sqadd          v3.16b,  v3.16b,  v20.16b           // PS0 = clamp(PS0+c2)
+        and             v20.16b, v18.16b, v17.16b           // w & hev
+        sqadd           v19.16b, v20.16b, v22.16b           // c1 = clamp((w&hev)+4)
+        sqadd           v20.16b, v20.16b, v23.16b           // c2 = clamp((w&hev)+3)
+        sshr            v19.16b, v19.16b, #3                // c1 >>= 3
+        sshr            v20.16b, v20.16b, #3                // c2 >>= 3
+        bic             v18.16b, v18.16b, v17.16b           // w &= ~hev
+        sqsub           v4.16b,  v4.16b,  v19.16b           // QS0 = clamp(QS0-c1)
+        sqadd           v3.16b,  v3.16b,  v20.16b           // PS0 = clamp(PS0+c2)
 
         // filter_mbedge:
         //   a = clamp((27*w + 63) >> 7);
@@ -424,35 +424,35 @@ endfunc
         //   a = clamp((9*w + 63) >> 7);
         //   Q2 = s2u(QS2 - a);
         //   P2 = s2u(PS2 + a);
-        movi           v17.8h,  #63
-        sshll          v22.8h,  v18.8b, #3
-        sshll2         v23.8h,  v18.16b, #3
-        saddw          v22.8h,  v22.8h, v18.8b
-        saddw2         v23.8h,  v23.8h, v18.16b
-        add            v16.8h,  v17.8h, v22.8h
-        add            v17.8h,  v17.8h, v23.8h           //  9*w + 63
-        add            v19.8h,  v16.8h, v22.8h
-        add            v20.8h,  v17.8h, v23.8h           // 18*w + 63
-        add            v22.8h,  v19.8h, v22.8h
-        add            v23.8h,  v20.8h, v23.8h           // 27*w + 63
-        sqshrn         v16.8b,  v16.8h,  #7
-        sqshrn2        v16.16b, v17.8h, #7              // clamp(( 9*w + 63)>>7)
-        sqshrn         v19.8b,  v19.8h, #7
-        sqshrn2        v19.16b, v20.8h, #7              // clamp((18*w + 63)>>7)
-        sqshrn         v22.8b,  v22.8h, #7
-        sqshrn2        v22.16b, v23.8h, #7              // clamp((27*w + 63)>>7)
-        sqadd          v1.16b,  v1.16b,  v16.16b        // PS2 = clamp(PS2+a)
-        sqsub          v6.16b,  v6.16b,  v16.16b        // QS2 = clamp(QS2-a)
-        sqadd          v2.16b,  v2.16b,  v19.16b        // PS1 = clamp(PS1+a)
-        sqsub          v5.16b,  v5.16b,  v19.16b        // QS1 = clamp(QS1-a)
-        sqadd          v3.16b,  v3.16b,  v22.16b        // PS0 = clamp(PS0+a)
-        sqsub          v4.16b,  v4.16b,  v22.16b        // QS0 = clamp(QS0-a)
-        eor            v3.16b,  v3.16b,  v21.16b        // P0 = PS0 ^ 0x80
-        eor            v4.16b,  v4.16b,  v21.16b        // Q0 = QS0 ^ 0x80
-        eor            v2.16b,  v2.16b,  v21.16b        // P1 = PS1 ^ 0x80
-        eor            v5.16b,  v5.16b,  v21.16b        // Q1 = QS1 ^ 0x80
-        eor            v1.16b,  v1.16b,  v21.16b        // P2 = PS2 ^ 0x80
-        eor            v6.16b,  v6.16b,  v21.16b        // Q2 = QS2 ^ 0x80
+        movi            v17.8h,  #63
+        sshll           v22.8h,  v18.8b, #3
+        sshll2          v23.8h,  v18.16b, #3
+        saddw           v22.8h,  v22.8h, v18.8b
+        saddw2          v23.8h,  v23.8h, v18.16b
+        add             v16.8h,  v17.8h, v22.8h
+        add             v17.8h,  v17.8h, v23.8h           //  9*w + 63
+        add             v19.8h,  v16.8h, v22.8h
+        add             v20.8h,  v17.8h, v23.8h           // 18*w + 63
+        add             v22.8h,  v19.8h, v22.8h
+        add             v23.8h,  v20.8h, v23.8h           // 27*w + 63
+        sqshrn          v16.8b,  v16.8h,  #7
+        sqshrn2         v16.16b, v17.8h, #7              // clamp(( 9*w + 63)>>7)
+        sqshrn          v19.8b,  v19.8h, #7
+        sqshrn2         v19.16b, v20.8h, #7              // clamp((18*w + 63)>>7)
+        sqshrn          v22.8b,  v22.8h, #7
+        sqshrn2         v22.16b, v23.8h, #7              // clamp((27*w + 63)>>7)
+        sqadd           v1.16b,  v1.16b,  v16.16b        // PS2 = clamp(PS2+a)
+        sqsub           v6.16b,  v6.16b,  v16.16b        // QS2 = clamp(QS2-a)
+        sqadd           v2.16b,  v2.16b,  v19.16b        // PS1 = clamp(PS1+a)
+        sqsub           v5.16b,  v5.16b,  v19.16b        // QS1 = clamp(QS1-a)
+        sqadd           v3.16b,  v3.16b,  v22.16b        // PS0 = clamp(PS0+a)
+        sqsub           v4.16b,  v4.16b,  v22.16b        // QS0 = clamp(QS0-a)
+        eor             v3.16b,  v3.16b,  v21.16b        // P0 = PS0 ^ 0x80
+        eor             v4.16b,  v4.16b,  v21.16b        // Q0 = QS0 ^ 0x80
+        eor             v2.16b,  v2.16b,  v21.16b        // P1 = PS1 ^ 0x80
+        eor             v5.16b,  v5.16b,  v21.16b        // Q1 = QS1 ^ 0x80
+        eor             v1.16b,  v1.16b,  v21.16b        // P2 = PS2 ^ 0x80
+        eor             v6.16b,  v6.16b,  v21.16b        // Q2 = QS2 ^ 0x80
     .endif
 .endm
 
@@ -507,48 +507,48 @@ function ff_vp8_v_loop_filter8uv\name\()_neon, export=1
         sub             x0,  x0,  x2,  lsl #2
         sub             x1,  x1,  x2,  lsl #2
         // Load pixels:
-        ld1          {v0.d}[0],     [x0], x2  // P3
-        ld1          {v0.d}[1],     [x1], x2  // P3
-        ld1          {v1.d}[0],     [x0], x2  // P2
-        ld1          {v1.d}[1],     [x1], x2  // P2
-        ld1          {v2.d}[0],     [x0], x2  // P1
-        ld1          {v2.d}[1],     [x1], x2  // P1
-        ld1          {v3.d}[0],     [x0], x2  // P0
-        ld1          {v3.d}[1],     [x1], x2  // P0
-        ld1          {v4.d}[0],     [x0], x2  // Q0
-        ld1          {v4.d}[1],     [x1], x2  // Q0
-        ld1          {v5.d}[0],     [x0], x2  // Q1
-        ld1          {v5.d}[1],     [x1], x2  // Q1
-        ld1          {v6.d}[0],     [x0], x2  // Q2
-        ld1          {v6.d}[1],     [x1], x2  // Q2
-        ld1          {v7.d}[0],     [x0]      // Q3
-        ld1          {v7.d}[1],     [x1]      // Q3
-
-        dup          v22.16b, w3                 // flim_E
-        dup          v23.16b, w4                 // flim_I
+        ld1             {v0.d}[0],     [x0], x2  // P3
+        ld1             {v0.d}[1],     [x1], x2  // P3
+        ld1             {v1.d}[0],     [x0], x2  // P2
+        ld1             {v1.d}[1],     [x1], x2  // P2
+        ld1             {v2.d}[0],     [x0], x2  // P1
+        ld1             {v2.d}[1],     [x1], x2  // P1
+        ld1             {v3.d}[0],     [x0], x2  // P0
+        ld1             {v3.d}[1],     [x1], x2  // P0
+        ld1             {v4.d}[0],     [x0], x2  // Q0
+        ld1             {v4.d}[1],     [x1], x2  // Q0
+        ld1             {v5.d}[0],     [x0], x2  // Q1
+        ld1             {v5.d}[1],     [x1], x2  // Q1
+        ld1             {v6.d}[0],     [x0], x2  // Q2
+        ld1             {v6.d}[1],     [x1], x2  // Q2
+        ld1             {v7.d}[0],     [x0]      // Q3
+        ld1             {v7.d}[1],     [x1]      // Q3
+
+        dup             v22.16b, w3                 // flim_E
+        dup             v23.16b, w4                 // flim_I
 
         vp8_loop_filter inner=\inner, hev_thresh=w5
 
         // back up to P2:  u,v -= stride * 6
-        sub          x0,  x0,  x2,  lsl #2
-        sub          x1,  x1,  x2,  lsl #2
-        sub          x0,  x0,  x2,  lsl #1
-        sub          x1,  x1,  x2,  lsl #1
+        sub             x0,  x0,  x2,  lsl #2
+        sub             x1,  x1,  x2,  lsl #2
+        sub             x0,  x0,  x2,  lsl #1
+        sub             x1,  x1,  x2,  lsl #1
 
         // Store pixels:
 
-        st1          {v1.d}[0],     [x0], x2  // P2
-        st1          {v1.d}[1],     [x1], x2  // P2
-        st1          {v2.d}[0],     [x0], x2  // P1
-        st1          {v2.d}[1],     [x1], x2  // P1
-        st1          {v3.d}[0],     [x0], x2  // P0
-        st1          {v3.d}[1],     [x1], x2  // P0
-        st1          {v4.d}[0],     [x0], x2  // Q0
-        st1          {v4.d}[1],     [x1], x2  // Q0
-        st1          {v5.d}[0],     [x0], x2  // Q1
-        st1          {v5.d}[1],     [x1], x2  // Q1
-        st1          {v6.d}[0],     [x0]      // Q2
-        st1          {v6.d}[1],     [x1]      // Q2
+        st1             {v1.d}[0],     [x0], x2  // P2
+        st1             {v1.d}[1],     [x1], x2  // P2
+        st1             {v2.d}[0],     [x0], x2  // P1
+        st1             {v2.d}[1],     [x1], x2  // P1
+        st1             {v3.d}[0],     [x0], x2  // P0
+        st1             {v3.d}[1],     [x1], x2  // P0
+        st1             {v4.d}[0],     [x0], x2  // Q0
+        st1             {v4.d}[1],     [x1], x2  // Q0
+        st1             {v5.d}[0],     [x0], x2  // Q1
+        st1             {v5.d}[1],     [x1], x2  // Q1
+        st1             {v6.d}[0],     [x0]      // Q2
+        st1             {v6.d}[1],     [x1]      // Q2
 
         ret
 endfunc
@@ -579,7 +579,7 @@ function ff_vp8_h_loop_filter16\name\()_neon, export=1
         ld1             {v6.d}[1], [x0], x1
         ld1             {v7.d}[1], [x0], x1
 
-        transpose_8x16B   v0,  v1,  v2,  v3,  v4,  v5,  v6,  v7, v30, v31
+        transpose_8x16B v0,  v1,  v2,  v3,  v4,  v5,  v6,  v7, v30, v31
 
         dup             v22.16b, w2                 // flim_E
     .if !\simple
@@ -590,7 +590,7 @@ function ff_vp8_h_loop_filter16\name\()_neon, export=1
 
         sub             x0,  x0,  x1, lsl #4    // backup 16 rows
 
-        transpose_8x16B   v0,  v1,  v2,  v3,  v4,  v5,  v6,  v7, v30, v31
+        transpose_8x16B v0,  v1,  v2,  v3,  v4,  v5,  v6,  v7, v30, v31
 
         // Store pixels:
         st1             {v0.d}[0], [x0], x1
@@ -624,24 +624,24 @@ function ff_vp8_h_loop_filter8uv\name\()_neon, export=1
         sub             x1,  x1,  #4
 
         // Load pixels:
-        ld1          {v0.d}[0],     [x0], x2 // load u
-        ld1          {v0.d}[1],     [x1], x2 // load v
-        ld1          {v1.d}[0],     [x0], x2
-        ld1          {v1.d}[1],     [x1], x2
-        ld1          {v2.d}[0],     [x0], x2
-        ld1          {v2.d}[1],     [x1], x2
-        ld1          {v3.d}[0],     [x0], x2
-        ld1          {v3.d}[1],     [x1], x2
-        ld1          {v4.d}[0],     [x0], x2
-        ld1          {v4.d}[1],     [x1], x2
-        ld1          {v5.d}[0],     [x0], x2
-        ld1          {v5.d}[1],     [x1], x2
-        ld1          {v6.d}[0],     [x0], x2
-        ld1          {v6.d}[1],     [x1], x2
-        ld1          {v7.d}[0],     [x0], x2
-        ld1          {v7.d}[1],     [x1], x2
-
-        transpose_8x16B   v0,  v1,  v2,  v3,  v4,  v5,  v6,  v7, v30, v31
+        ld1             {v0.d}[0],     [x0], x2 // load u
+        ld1             {v0.d}[1],     [x1], x2 // load v
+        ld1             {v1.d}[0],     [x0], x2
+        ld1             {v1.d}[1],     [x1], x2
+        ld1             {v2.d}[0],     [x0], x2
+        ld1             {v2.d}[1],     [x1], x2
+        ld1             {v3.d}[0],     [x0], x2
+        ld1             {v3.d}[1],     [x1], x2
+        ld1             {v4.d}[0],     [x0], x2
+        ld1             {v4.d}[1],     [x1], x2
+        ld1             {v5.d}[0],     [x0], x2
+        ld1             {v5.d}[1],     [x1], x2
+        ld1             {v6.d}[0],     [x0], x2
+        ld1             {v6.d}[1],     [x1], x2
+        ld1             {v7.d}[0],     [x0], x2
+        ld1             {v7.d}[1],     [x1], x2
+
+        transpose_8x16B v0,  v1,  v2,  v3,  v4,  v5,  v6,  v7, v30, v31
 
         dup             v22.16b, w3                 // flim_E
         dup             v23.16b, w4                 // flim_I
@@ -651,25 +651,25 @@ function ff_vp8_h_loop_filter8uv\name\()_neon, export=1
         sub             x0,  x0,  x2, lsl #3    // backup u 8 rows
         sub             x1,  x1,  x2, lsl #3    // backup v 8 rows
 
-        transpose_8x16B   v0,  v1,  v2,  v3,  v4,  v5,  v6,  v7, v30, v31
+        transpose_8x16B v0,  v1,  v2,  v3,  v4,  v5,  v6,  v7, v30, v31
 
         // Store pixels:
-        st1          {v0.d}[0],     [x0], x2 // load u
-        st1          {v0.d}[1],     [x1], x2 // load v
-        st1          {v1.d}[0],     [x0], x2
-        st1          {v1.d}[1],     [x1], x2
-        st1          {v2.d}[0],     [x0], x2
-        st1          {v2.d}[1],     [x1], x2
-        st1          {v3.d}[0],     [x0], x2
-        st1          {v3.d}[1],     [x1], x2
-        st1          {v4.d}[0],     [x0], x2
-        st1          {v4.d}[1],     [x1], x2
-        st1          {v5.d}[0],     [x0], x2
-        st1          {v5.d}[1],     [x1], x2
-        st1          {v6.d}[0],     [x0], x2
-        st1          {v6.d}[1],     [x1], x2
-        st1          {v7.d}[0],     [x0]
-        st1          {v7.d}[1],     [x1]
+        st1             {v0.d}[0],     [x0], x2 // load u
+        st1             {v0.d}[1],     [x1], x2 // load v
+        st1             {v1.d}[0],     [x0], x2
+        st1             {v1.d}[1],     [x1], x2
+        st1             {v2.d}[0],     [x0], x2
+        st1             {v2.d}[1],     [x1], x2
+        st1             {v3.d}[0],     [x0], x2
+        st1             {v3.d}[1],     [x1], x2
+        st1             {v4.d}[0],     [x0], x2
+        st1             {v4.d}[1],     [x1], x2
+        st1             {v5.d}[0],     [x0], x2
+        st1             {v5.d}[1],     [x1], x2
+        st1             {v6.d}[0],     [x0], x2
+        st1             {v6.d}[1],     [x1], x2
+        st1             {v7.d}[0],     [x0]
+        st1             {v7.d}[1],     [x1]
 
         ret
 
diff --git a/libavutil/aarch64/tx_float_neon.S b/libavutil/aarch64/tx_float_neon.S
index e5531dcc7c..9916ad4142 100644
--- a/libavutil/aarch64/tx_float_neon.S
+++ b/libavutil/aarch64/tx_float_neon.S
@@ -729,9 +729,9 @@ FFT16_FN ns_float, 1
 .endm
 
 .macro SR_COMBINE_4 len, part, off
-        add              x10, x1, x21
-        add              x11, x1, x21, lsl #1
-        add              x12, x1, x22
+        add             x10, x1, x21
+        add             x11, x1, x21, lsl #1
+        add             x12, x1, x22
 
         ldp              q0,  q1, [x1,  #((0 + \part)*32 + \off)]
         ldp              q4,  q5, [x1,  #((2 + \part)*32 + \off)]
@@ -759,9 +759,9 @@ FFT16_FN ns_float, 1
 .endm
 
 .macro SR_COMBINE_FULL len, off=0
-        add              x10, x1, x21
-        add              x11, x1, x21, lsl #1
-        add              x12, x1, x22
+        add             x10, x1, x21
+        add             x11, x1, x21, lsl #1
+        add             x12, x1, x22
 
         SR_COMBINE_4    \len, 0, \off
         SR_COMBINE_4    \len, 1, \off

From patchwork Tue Oct 17 11:45:59 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>
X-Patchwork-Id: 44279
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a21:3e0b:b0:15d:8365:d4b8 with SMTP id
 bk11csp301822pzc;
        Tue, 17 Oct 2023 04:46:52 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IGg/sdQKHbEj9QCAsahwyeOHS5y2f//+h6lVexlf1yG0BqhsdW7RlNpUUi9eX8bJtPpvMoG
X-Received: by 2002:a17:907:980c:b0:9bf:5df1:38c9 with SMTP id
 ji12-20020a170907980c00b009bf5df138c9mr1486948ejc.9.1697543212237;
        Tue, 17 Oct 2023 04:46:52 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1697543212; cv=none;
        d=google.com; s=arc-20160816;
        b=PerzM0klwrKQMaZHSBg4m5Ue83OX3eiGz29XghSdoalwMLF3wpN9GEhMXGNSwW00s8
         B5lCBdhQ4/FDfi7FqEn2dPpJLrP+MDSxU+fzb6z2jwhEQ+tlelWROMIEV5x3wxOONnef
         amTvXJUKkjWtZFXBRnFcvC6RROdyFbBP039KE+US5Ce1Gl8aBBPv1Ol+bh3KKsEWofRU
         5V/1JDbeXIUugWMgdTCF8k5t59quqtR++fzdGDPICl/BO+5t6YwVpZBJLWLeWd1LgBJl
         fCuWtGB5VHXpzoeTrIbaDUzVXZgmtq6eYVcIPE87H1ncV3LpCUjpp54DQv7wCO1hkych
         hXiQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:references:in-reply-to
         :message-id:date:to:from:dkim-signature:delivered-to;
        bh=z61hFbAGAcOROFasrakAw5Mr2HIPDq96RONNBth47z0=;
        fh=4VBelKDE4DH3L7jF6H/1Jmu78FdN+YP76yfdJCQTJ30=;
        b=vuV6fPzpTngPD5H8qXM0jdexAXxTeXJWMOesTXFYFhGtjrNy/Z3qMz8EjGEv+r3OG4
         QUKR6ne8hPrVdDgO+nmXYIzdhJOMyHvKisYbNNLkpu6+win/vwgkOBqtnoKQQlYp2Z9b
         d0uEIStphXUBfQGQizESwDT+UOaqRKg3zuHRH5lnF4O/wBEAJg/il83XymMolwdqSOLO
         MOZCjKYuuFOxEvrlkgfTDtTOHJXEGa1FkYjnxuixoe3NFk/kQUNxqDarqGc3rHFVeQxJ
         z/O+4Soju5u0jRqpBazLXX80TPqSmI2q9wQ00aWVvSU338FXrv6tZvoC9R91hYhjO2nA
         lS6g==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify)
 header.i=@martin-st.20230601.gappssmtp.com header.s=20230601
 header.b="Ky7kphy/";
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 fx25-20020a170906b75900b009b2cda6d09asi608534ejb.805.2023.10.17.04.46.51;
        Tue, 17 Oct 2023 04:46:52 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify)
 header.i=@martin-st.20230601.gappssmtp.com header.s=20230601
 header.b="Ky7kphy/";
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id CC4D168C9F9;
	Tue, 17 Oct 2023 14:46:14 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mail-lj1-f170.google.com (mail-lj1-f170.google.com
 [209.85.208.170])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id BA82A68C93C
 for <ffmpeg-devel@ffmpeg.org>; Tue, 17 Oct 2023 14:46:05 +0300 (EEST)
Received: by mail-lj1-f170.google.com with SMTP id
 38308e7fff4ca-2c509d5ab43so55798681fa.0
 for <ffmpeg-devel@ffmpeg.org>; Tue, 17 Oct 2023 04:46:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=martin-st.20230601.gappssmtp.com; s=20230601; t=1697543165; x=1698147965;
 darn=ffmpeg.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=MEmfODeGLjWBad888FGyDKhcOip/MaWIDMIipQ5lKaU=;
 b=Ky7kphy/kaK+jLBn2VfeVy9VjG3nTY+tl2EXcKFm3+PgQDJqbkHD/XxDx+HqluMNuL
 RgoUGcgTmtpBjt19DeAlClUvdxFsidLTbWGsn0r0jQ753inwh6K7VhBZV4nvXXdxB+UL
 04sY7UpTu65pMnSrcd0GK93Q5+FkGTEN1Tou6FEOtfAkjH+8GxUNntKcylxpWhlVR5ok
 RV3eD291tZJXrKXpUY4gW1cyQY5kqrB14Au+GIhyF0G2LqJsYzTH/xQ9fzX2iaUPcT5W
 OJ9fZCz4FQNM7M/ngHkQdz03mibnQqtzl5qm1hGQ1Q4JoP7EmVs5NTzdROoDThk5NPys
 WRXw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1697543165; x=1698147965;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=MEmfODeGLjWBad888FGyDKhcOip/MaWIDMIipQ5lKaU=;
 b=saRZERxQluUU/Fnv9izimHPNbdjD6j0EI0kfHUbRX6hi0/C3unByh16ln6wECesy0c
 XpkEGLf78YasCcr6zIibbbCWT/N7jYlpc471JCFWNcTeBX0jCayZUMsrxlD146oTA+ch
 utJothcyN8nnln/kijpQGukC1hK4My7OOaMOySU0BRNwv5wwqqRfiUF6MrAnbbLusLOa
 3AtA+gkOFTDs9DCwaZZL3iOKjOtzSL7Sf+FLfrA6zIbFU3vSsAROWGAQ9NGClwTSguM/
 kW+BDEcQCXe9vr0h8+OYupAOoP4mfFAps0G9OgIcRizyXr20pA/Y0MoyD3Zv3dux/gnE
 HEBw==
X-Gm-Message-State: AOJu0YxQjPeeb2mBX2nCe/0LbQgvxvQ6feILD31OTYLBW6ovwUxnWsN1
 hSvY1NyKrL491jrWKR4bH1zSU1LkkujuQ3UL5yakaQ==
X-Received: by 2002:a05:6512:3144:b0:507:a66b:c9a1 with SMTP id
 s4-20020a056512314400b00507a66bc9a1mr1425565lfi.17.1697543164645;
 Tue, 17 Oct 2023 04:46:04 -0700 (PDT)
Received: from localhost.localdomain (dsl-tkubng21-58c01c-243.dhcp.inet.fi.
 [88.192.28.243]) by smtp.gmail.com with ESMTPSA id
 x25-20020a19f619000000b0050797a35f8csm244532lfe.162.2023.10.17.04.46.04
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 17 Oct 2023 04:46:04 -0700 (PDT)
From: =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>
To: ffmpeg-devel@ffmpeg.org
Date: Tue, 17 Oct 2023 14:45:59 +0300
Message-Id: <20231017114601.1374712-4-martin@martin.st>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20231017114601.1374712-1-martin@martin.st>
References: <20231017114601.1374712-1-martin@martin.st>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 4/5] aarch64: Manually tweak vertical
 alignment/indentation in tx_float_neon.S
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: jdek@itanimul.li
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: jQEzOzhk83dt

Favour left aligned columns over right aligned columns.

In principle either style should be ok, but some of the cases
easily lead to incorrect indentation in the surrounding code (see
a couple of cases fixed up in the preceding patch), and show up in
automatic indentation correction attempts.
---
 libavutil/aarch64/tx_float_neon.S | 120 +++++++++++++++---------------
 1 file changed, 60 insertions(+), 60 deletions(-)

diff --git a/libavutil/aarch64/tx_float_neon.S b/libavutil/aarch64/tx_float_neon.S
index 9916ad4142..30ffa2a1d4 100644
--- a/libavutil/aarch64/tx_float_neon.S
+++ b/libavutil/aarch64/tx_float_neon.S
@@ -733,12 +733,12 @@ FFT16_FN ns_float, 1
         add             x11, x1, x21, lsl #1
         add             x12, x1, x22
 
-        ldp              q0,  q1, [x1,  #((0 + \part)*32 + \off)]
-        ldp              q4,  q5, [x1,  #((2 + \part)*32 + \off)]
-        ldp              q2,  q3, [x10, #((0 + \part)*32 + \off)]
-        ldp              q6,  q7, [x10, #((2 + \part)*32 + \off)]
+        ldp             q0,  q1,  [x1,  #((0 + \part)*32 + \off)]
+        ldp             q4,  q5,  [x1,  #((2 + \part)*32 + \off)]
+        ldp             q2,  q3,  [x10, #((0 + \part)*32 + \off)]
+        ldp             q6,  q7,  [x10, #((2 + \part)*32 + \off)]
 
-        ldp              q8,  q9, [x11, #((0 + \part)*32 + \off)]
+        ldp             q8,  q9,  [x11, #((0 + \part)*32 + \off)]
         ldp             q10, q11, [x11, #((2 + \part)*32 + \off)]
         ldp             q12, q13, [x12, #((0 + \part)*32 + \off)]
         ldp             q14, q15, [x12, #((2 + \part)*32 + \off)]
@@ -747,12 +747,12 @@ FFT16_FN ns_float, 1
                         v8,  v9, v10, v11, v12, v13, v14, v15, \
                         x7,  x8, x9, 0
 
-        stp              q0,  q1, [x1,  #((0 + \part)*32 + \off)]
-        stp              q4,  q5, [x1,  #((2 + \part)*32 + \off)]
-        stp              q2,  q3, [x10, #((0 + \part)*32 + \off)]
-        stp              q6,  q7, [x10, #((2 + \part)*32 + \off)]
+        stp             q0,  q1,  [x1,  #((0 + \part)*32 + \off)]
+        stp             q4,  q5,  [x1,  #((2 + \part)*32 + \off)]
+        stp             q2,  q3,  [x10, #((0 + \part)*32 + \off)]
+        stp             q6,  q7,  [x10, #((2 + \part)*32 + \off)]
 
-        stp              q8,  q9, [x11, #((0 + \part)*32 + \off)]
+        stp             q8,  q9,  [x11, #((0 + \part)*32 + \off)]
         stp             q12, q13, [x11, #((2 + \part)*32 + \off)]
         stp             q10, q11, [x12, #((0 + \part)*32 + \off)]
         stp             q14, q15, [x12, #((2 + \part)*32 + \off)]
@@ -775,12 +775,12 @@ FFT16_FN ns_float, 1
         add             x12, x15, #((\part)*32 + \off)
         add             x13, x16, #((\part)*32 + \off)
 
-        ldp              q0,  q1, [x10]
-        ldp              q4,  q5, [x10, #(2*32)]
-        ldp              q2,  q3, [x11]
-        ldp              q6,  q7, [x11, #(2*32)]
+        ldp             q0,  q1,  [x10]
+        ldp             q4,  q5,  [x10, #(2*32)]
+        ldp             q2,  q3,  [x11]
+        ldp             q6,  q7,  [x11, #(2*32)]
 
-        ldp              q8,  q9, [x12]
+        ldp             q8,  q9,  [x12]
         ldp             q10, q11, [x12, #(2*32)]
         ldp             q12, q13, [x13]
         ldp             q14, q15, [x13, #(2*32)]
@@ -800,10 +800,10 @@ FFT16_FN ns_float, 1
         zip1            v22.2d, v3.2d, v7.2d
         zip2            v23.2d, v3.2d, v7.2d
 
-        ldp              q0,  q1, [x10, #(1*32)]
-        ldp              q4,  q5, [x10, #(3*32)]
-        ldp              q2,  q3, [x11, #(1*32)]
-        ldp              q6,  q7, [x11, #(3*32)]
+        ldp             q0,  q1,  [x10, #(1*32)]
+        ldp             q4,  q5,  [x10, #(3*32)]
+        ldp             q2,  q3,  [x11, #(1*32)]
+        ldp             q6,  q7,  [x11, #(3*32)]
 
         st1             { v16.4s, v17.4s, v18.4s, v19.4s }, [x10], #64
         st1             { v20.4s, v21.4s, v22.4s, v23.4s }, [x11], #64
@@ -817,7 +817,7 @@ FFT16_FN ns_float, 1
         zip1            v26.2d, v11.2d, v15.2d
         zip2            v27.2d, v11.2d, v15.2d
 
-        ldp              q8,  q9, [x12, #(1*32)]
+        ldp             q8,  q9,  [x12, #(1*32)]
         ldp             q10, q11, [x12, #(3*32)]
         ldp             q12, q13, [x13, #(1*32)]
         ldp             q14, q15, [x13, #(3*32)]
@@ -875,9 +875,9 @@ function ff_tx_fft32_\name\()_neon, export=1
         SETUP_SR_RECOMB 32, x7, x8, x9
 
         SETUP_LUT       \no_perm
-        LOAD_INPUT       0,  1,  2,  3, x2, \no_perm
-        LOAD_INPUT       4,  5,  6,  7, x2, \no_perm
-        LOAD_INPUT       8,  9, 10, 11, x2, \no_perm
+        LOAD_INPUT      0,  1,  2,  3,  x2, \no_perm
+        LOAD_INPUT      4,  5,  6,  7,  x2, \no_perm
+        LOAD_INPUT      8,  9,  10, 11, x2, \no_perm
         LOAD_INPUT      12, 13, 14, 15, x2, \no_perm
 
         FFT8_X2         v8, v9, v10, v11, v12, v13, v14, v15
@@ -982,37 +982,37 @@ function ff_tx_fft_sr_\name\()_neon, export=1
 32:
         SETUP_SR_RECOMB 32, x7, x8, x9
 
-        LOAD_INPUT       0,  1,  2,  3, x2, \no_perm
-        LOAD_INPUT       4,  6,  5,  7, x2, \no_perm, 1
-        LOAD_INPUT       8,  9, 10, 11, x2, \no_perm
+        LOAD_INPUT      0,  1,  2,  3,  x2, \no_perm
+        LOAD_INPUT      4,  6,  5,  7,  x2, \no_perm, 1
+        LOAD_INPUT      8,  9,  10, 11, x2, \no_perm
         LOAD_INPUT      12, 13, 14, 15, x2, \no_perm
 
         FFT8_X2         v8, v9, v10, v11, v12, v13, v14, v15
         FFT16           v0, v1, v2, v3, v4, v6, v5, v7
 
-        SR_COMBINE      v0,  v1,  v2,  v3,  v4,  v6,  v5,  v7, \
-                        v8,  v9, v10, v11, v12, v13, v14, v15, \
-                        x7, x8, x9, 0
+        SR_COMBINE      v0,  v1,  v2,  v3,  v4,  v6,  v5,  v7,  \
+                        v8,  v9,  v10, v11, v12, v13, v14, v15, \
+                        x7,  x8,  x9,  0
 
-        stp              q2,  q3, [x1, #32*1]
-        stp              q6,  q7, [x1, #32*3]
+        stp             q2,  q3,  [x1, #32*1]
+        stp             q6,  q7,  [x1, #32*3]
         stp             q10, q11, [x1, #32*5]
         stp             q14, q15, [x1, #32*7]
 
         cmp             w20, #32
         b.gt            64f
 
-        stp              q0,  q1, [x1, #32*0]
-        stp              q4,  q5, [x1, #32*2]
-        stp              q8,  q9, [x1, #32*4]
+        stp             q0,  q1,  [x1, #32*0]
+        stp             q4,  q5,  [x1, #32*2]
+        stp             q8,  q9,  [x1, #32*4]
         stp             q12, q13, [x1, #32*6]
 
         ret
 64:
         SETUP_SR_RECOMB 64, x7, x8, x9
 
-        LOAD_INPUT       2,  3, 10, 11, x2, \no_perm, 1
-        LOAD_INPUT       6,  14, 7, 15, x2, \no_perm, 1
+        LOAD_INPUT      2,  3, 10, 11, x2, \no_perm, 1
+        LOAD_INPUT      6,  14, 7, 15, x2, \no_perm, 1
 
         FFT16           v2, v3, v10, v11, v6, v14, v7, v15
 
@@ -1033,38 +1033,38 @@ function ff_tx_fft_sr_\name\()_neon, export=1
 
         // TODO: investigate doing the 2 combines like in deinterleave
         // TODO: experiment with spilling to gprs and converting to HALF or full
-        SR_COMBINE_LITE  v0,  v1,  v8,  v9, \
-                         v2,  v3, v16, v17, \
+        SR_COMBINE_LITE v0,  v1,  v8,  v9,  \
+                        v2,  v3,  v16, v17, \
                         v24, v25, v26, v27, \
                         v28, v29, v30, 0
 
-        stp              q0,  q1, [x1, #32* 0]
-        stp              q8,  q9, [x1, #32* 4]
-        stp              q2,  q3, [x1, #32* 8]
+        stp             q0,  q1,  [x1, #32* 0]
+        stp             q8,  q9,  [x1, #32* 4]
+        stp             q2,  q3,  [x1, #32* 8]
         stp             q16, q17, [x1, #32*12]
 
-        SR_COMBINE_HALF  v4,  v5, v12, v13, \
-                         v6,  v7, v20, v21, \
+        SR_COMBINE_HALF v4,  v5,  v12, v13, \
+                        v6,  v7,  v20, v21, \
                         v24, v25, v26, v27, \
                         v28, v29, v30, v0, v1, v8, 1
 
-        stp              q4, q20, [x1, #32* 2]
+        stp             q4,  q20, [x1, #32* 2]
         stp             q12, q21, [x1, #32* 6]
-        stp              q6,  q5, [x1, #32*10]
-        stp              q7, q13, [x1, #32*14]
+        stp             q6,  q5,  [x1, #32*10]
+        stp             q7,  q13, [x1, #32*14]
 
-        ldp              q2,  q3, [x1, #32*1]
-        ldp              q6,  q7, [x1, #32*3]
+        ldp             q2,  q3,  [x1, #32*1]
+        ldp             q6,  q7,  [x1, #32*3]
         ldp             q12, q13, [x1, #32*5]
         ldp             q16, q17, [x1, #32*7]
 
-        SR_COMBINE       v2,  v3, v12, v13,  v6, v16,  v7, v17, \
+        SR_COMBINE      v2,  v3,  v12, v13, v6,  v16, v7,  v17, \
                         v10, v11, v14, v15, v18, v19, v22, v23, \
-                        x7, x8, x9, 0, \
+                        x7,  x8,  x9,  0, \
                         v24, v25, v26, v27, v28, v29, v30, v8, v0, v1, v4, v5
 
-        stp              q2,  q3, [x1, #32* 1]
-        stp              q6,  q7, [x1, #32* 3]
+        stp             q2,  q3,  [x1, #32* 1]
+        stp             q6,  q7,  [x1, #32* 3]
         stp             q12, q13, [x1, #32* 5]
         stp             q16, q17, [x1, #32* 7]
 
@@ -1198,13 +1198,13 @@ SR_TRANSFORM_DEF        131072
         mov             x10, v23.d[0]
         mov             x11, v23.d[1]
 
-        SR_COMBINE_LITE  v0,  v1,  v8,  v9, \
-                         v2,  v3, v16, v17, \
+        SR_COMBINE_LITE v0,  v1,  v8,  v9,  \
+                        v2,  v3,  v16, v17, \
                         v24, v25, v26, v27, \
                         v28, v29, v30, 0
 
-        SR_COMBINE_HALF  v4,  v5, v12, v13, \
-                         v6,  v7, v20, v21, \
+        SR_COMBINE_HALF v4,  v5,  v12, v13, \
+                        v6,  v7,  v20, v21, \
                         v24, v25, v26, v27, \
                         v28, v29, v30, v23, v24, v26, 1
 
@@ -1236,7 +1236,7 @@ SR_TRANSFORM_DEF        131072
         zip2            v3.2d, v17.2d, v13.2d
 
         // stp is faster by a little on A53, but this is faster on M1s (theory)
-        ldp              q8,  q9, [x1, #32*1]
+        ldp             q8,  q9, [x1, #32*1]
         ldp             q12, q13, [x1, #32*5]
 
         st1             { v23.4s, v24.4s, v25.4s, v26.4s }, [x12], #64  // 32* 0...1
@@ -1247,12 +1247,12 @@ SR_TRANSFORM_DEF        131072
         mov             v23.d[0], x10
         mov             v23.d[1], x11
 
-        ldp              q6,  q7, [x1, #32*3]
+        ldp             q6,  q7, [x1, #32*3]
         ldp             q16, q17, [x1, #32*7]
 
-        SR_COMBINE       v8,  v9, v12, v13,  v6, v16,  v7, v17, \
+        SR_COMBINE      v8,  v9,  v12, v13, v6,  v16, v7,  v17, \
                         v10, v11, v14, v15, v18, v19, v22, v23, \
-                        x7, x8, x9, 0, \
+                        x7,  x8,  x9,  0, \
                         v24, v25, v26, v27, v28, v29, v30, v4, v0, v1, v5, v20
 
         zip1            v0.2d, v8.2d, v6.2d

From patchwork Tue Oct 17 11:46:00 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>
X-Patchwork-Id: 44280
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a21:3e0b:b0:15d:8365:d4b8 with SMTP id
 bk11csp301925pzc;
        Tue, 17 Oct 2023 04:47:04 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IGtPa8cKs9fGSvatYLI0UtJoab7/A/rAkiwGoq6IhjlS84qEsuAO189D5GJYaQDzbl3/vNV
X-Received: by 2002:a17:907:1b22:b0:9c4:bb5f:970f with SMTP id
 mp34-20020a1709071b2200b009c4bb5f970fmr1439485ejc.32.1697543223975;
        Tue, 17 Oct 2023 04:47:03 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1697543223; cv=none;
        d=google.com; s=arc-20160816;
        b=Q8lk4M1ceeDb8YxMQwGbYVB1NmWyZT03kNveKpvtBhIZmkTC+b6DQ97NrUPm7iUmdC
         CCiTR8VIqFXEdbbH9eb8jpIjZD5BZ8hjg14BLsUD/Hv1zU6MBv9NDTCJGsIvzVwt5Qea
         mZladmsOJRmBtpMyx9XF4/MMTdU3zY0CyRr+PeL3CPFYNvNU387QBUffCPiZ9guJepGy
         1m4hzgNJyExZvR614fAkTbWjteQhCIaCk/3askqwKcBn0oYCaPDie05jaQl0P8+lHk8H
         w08Yen8UdfpCA4jue+gN6kW9Qgl/wfn0V/uRttAMsfhblpjnDRaWMXo+hdURfsOUFYhP
         FppQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:references:in-reply-to
         :message-id:date:to:from:dkim-signature:delivered-to;
        bh=jJB3DnW+NLgGTa3DtFRxG3QsG7eQWv2dCLTRkt9S5JE=;
        fh=4VBelKDE4DH3L7jF6H/1Jmu78FdN+YP76yfdJCQTJ30=;
        b=JTPENDZN+Yr12NX6kLQJG4SxTP6g7/AV8s0iYnVIvXRreFxKLM8w7c88A6e9YSsgFm
         LPrgPorBg6+IVqyIHANky8oapRQsl7IhPCLebFxu1nMhOx4TiRV2FvvDFnCd/XB4bYSx
         OfxXth8lkD5yWvYZSRp6axWNQn3I+UtXvmwSBlTIEGUMeboR6ycrQwNcgjzGHJGiLwPk
         A8N5ub6t5WMArwd3nltaHh66EVT5uVBTczAQTwJmM82ZfpQFm6qcS3TRzGsC4wBngFHv
         DtxAeCLbLS84jNSf9Ic0fWnb2mLdtsZWRulpynKRzdACeU1eCCitfyhDG2C6CSN2Hbcz
         uFew==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify)
 header.i=@martin-st.20230601.gappssmtp.com header.s=20230601
 header.b="WHmicHM/";
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 26-20020a170906011a00b009930d9d6b4csi632678eje.888.2023.10.17.04.47.03;
        Tue, 17 Oct 2023 04:47:03 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify)
 header.i=@martin-st.20230601.gappssmtp.com header.s=20230601
 header.b="WHmicHM/";
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 0A24F68CA11;
	Tue, 17 Oct 2023 14:46:16 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mail-lf1-f46.google.com (mail-lf1-f46.google.com
 [209.85.167.46])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id B8A1068C959
 for <ffmpeg-devel@ffmpeg.org>; Tue, 17 Oct 2023 14:46:06 +0300 (EEST)
Received: by mail-lf1-f46.google.com with SMTP id
 2adb3069b0e04-507962561adso6261247e87.0
 for <ffmpeg-devel@ffmpeg.org>; Tue, 17 Oct 2023 04:46:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=martin-st.20230601.gappssmtp.com; s=20230601; t=1697543166; x=1698147966;
 darn=ffmpeg.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=U+HXfvv3n6H1cMSKuuzZ0+dT821gnJLW56ikI+UY6VU=;
 b=WHmicHM/ZC+WS+2JkOaT7EXQXthSQOmzSVHG1NdWL9zvLmZX57dtMYmU9GvhjZcMTN
 M43oTPDHUkF/SBQI6y3R2tGskjDThKCpS3lQz5KVqV4/HhZkYmXtREOmRJYsJfQmxE5a
 YKV/yWhDHVD2bwoprdRGcWDryY7i8MnfhY0ELgxetPQeDs67m5LokS9PqM1UxfpZOHjL
 G3/+f7lo9uAndSWJhtkIYR8OCSkB1Ca8qDXv8AjAyvWkmvEWeQWbbxopFn+LI4hMCj0j
 0rP0TaVTUDmIPDd4cU3lSkyVOsix3pdEl4aZxKMFx+vysXl0wbPlTMehl0SdbwxTylxL
 GbOw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1697543166; x=1698147966;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=U+HXfvv3n6H1cMSKuuzZ0+dT821gnJLW56ikI+UY6VU=;
 b=wbj5DK+HdtYQ2q8PdIBD+euMOl08RELY2vJXfJucs6uOMI3nR7iCAVowz4xVtevn8o
 twf1SBTco2owmvOmKURhPbm0IPua3ftlCSIWxU+qHKmZP6jKVH+ADtVmYZHyqjroNxwo
 rxeJqJLxRW11L6ZX8EIgM8ZOSb3XIs4LQ8BxySb60cgUTdysyaPdZMrMneRsloWS6hma
 4EECWQg5XGebyDWzSlc1rP8jXqLhzaj9vPj71FwfObQ1iU84gyiIrhag/GFZRQeCMP5m
 vT/QOSGnCt1Vw8BHG253e0KCP2xevLfEGiQ5qe4xN90YyymqxPZJHofnT+mS0844Gayt
 kbKQ==
X-Gm-Message-State: AOJu0YzemI08Qno8x1bjwJV0itsyzJtZFjvXWUy4oQ3Tn+lczB0Iak4z
 gSgVN/ADOM64t2uDpq59QhZkL642XA9HdaRHQm3poA==
X-Received: by 2002:a05:6512:555:b0:4fb:8bab:48b6 with SMTP id
 h21-20020a056512055500b004fb8bab48b6mr1479680lfl.52.1697543165455;
 Tue, 17 Oct 2023 04:46:05 -0700 (PDT)
Received: from localhost.localdomain (dsl-tkubng21-58c01c-243.dhcp.inet.fi.
 [88.192.28.243]) by smtp.gmail.com with ESMTPSA id
 x25-20020a19f619000000b0050797a35f8csm244532lfe.162.2023.10.17.04.46.04
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 17 Oct 2023 04:46:04 -0700 (PDT)
From: =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>
To: ffmpeg-devel@ffmpeg.org
Date: Tue, 17 Oct 2023 14:46:00 +0300
Message-Id: <20231017114601.1374712-5-martin@martin.st>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20231017114601.1374712-1-martin@martin.st>
References: <20231017114601.1374712-1-martin@martin.st>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 5/5] aarch64: Reindent all assembly to 8/24
 column indentation
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: jdek@itanimul.li
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: WpVYxHJIxmWx

---
 libavcodec/aarch64/aacpsdsp_neon.S |  218 +--
 libavcodec/aarch64/opusdsp_neon.S  |  102 +-
 libswresample/aarch64/resample.S   |   80 +-
 libswscale/aarch64/hscale.S        | 2250 ++++++++++++++--------------
 libswscale/aarch64/output.S        |  330 ++--
 libswscale/aarch64/yuv2rgb_neon.S  |  220 +--
 6 files changed, 1600 insertions(+), 1600 deletions(-)

diff --git a/libavcodec/aarch64/aacpsdsp_neon.S b/libavcodec/aarch64/aacpsdsp_neon.S
index 686c62eb2e..f8cb0b2959 100644
--- a/libavcodec/aarch64/aacpsdsp_neon.S
+++ b/libavcodec/aarch64/aacpsdsp_neon.S
@@ -19,130 +19,130 @@
 #include "libavutil/aarch64/asm.S"
 
 function ff_ps_add_squares_neon, export=1
-1:      ld1         {v0.4s,v1.4s}, [x1], #32
-        fmul        v0.4s, v0.4s, v0.4s
-        fmul        v1.4s, v1.4s, v1.4s
-        faddp       v2.4s, v0.4s, v1.4s
-        ld1         {v3.4s}, [x0]
-        fadd        v3.4s, v3.4s, v2.4s
-        st1         {v3.4s}, [x0], #16
-        subs        w2, w2, #4
-        b.gt        1b
+1:      ld1             {v0.4s,v1.4s}, [x1], #32
+        fmul            v0.4s, v0.4s, v0.4s
+        fmul            v1.4s, v1.4s, v1.4s
+        faddp           v2.4s, v0.4s, v1.4s
+        ld1             {v3.4s}, [x0]
+        fadd            v3.4s, v3.4s, v2.4s
+        st1             {v3.4s}, [x0], #16
+        subs            w2, w2, #4
+        b.gt            1b
         ret
 endfunc
 
 function ff_ps_mul_pair_single_neon, export=1
-1:      ld1         {v0.4s,v1.4s}, [x1], #32
-        ld1         {v2.4s},       [x2], #16
-        zip1        v3.4s, v2.4s, v2.4s
-        zip2        v4.4s, v2.4s, v2.4s
-        fmul        v0.4s, v0.4s, v3.4s
-        fmul        v1.4s, v1.4s, v4.4s
-        st1         {v0.4s,v1.4s}, [x0], #32
-        subs        w3, w3, #4
-        b.gt        1b
+1:      ld1             {v0.4s,v1.4s}, [x1], #32
+        ld1             {v2.4s},       [x2], #16
+        zip1            v3.4s, v2.4s, v2.4s
+        zip2            v4.4s, v2.4s, v2.4s
+        fmul            v0.4s, v0.4s, v3.4s
+        fmul            v1.4s, v1.4s, v4.4s
+        st1             {v0.4s,v1.4s}, [x0], #32
+        subs            w3, w3, #4
+        b.gt            1b
         ret
 endfunc
 
 function ff_ps_stereo_interpolate_neon, export=1
-        ld1         {v0.4s}, [x2]
-        ld1         {v1.4s}, [x3]
-        zip1        v4.4s, v0.4s, v0.4s
-        zip2        v5.4s, v0.4s, v0.4s
-        zip1        v6.4s, v1.4s, v1.4s
-        zip2        v7.4s, v1.4s, v1.4s
-1:      ld1         {v2.2s}, [x0]
-        ld1         {v3.2s}, [x1]
-        fadd        v4.4s, v4.4s, v6.4s
-        fadd        v5.4s, v5.4s, v7.4s
-        mov         v2.d[1], v2.d[0]
-        mov         v3.d[1], v3.d[0]
-        fmul        v2.4s, v2.4s, v4.4s
-        fmla        v2.4s, v3.4s, v5.4s
-        st1         {v2.d}[0], [x0], #8
-        st1         {v2.d}[1], [x1], #8
-        subs        w4, w4, #1
-        b.gt        1b
+        ld1             {v0.4s}, [x2]
+        ld1             {v1.4s}, [x3]
+        zip1            v4.4s, v0.4s, v0.4s
+        zip2            v5.4s, v0.4s, v0.4s
+        zip1            v6.4s, v1.4s, v1.4s
+        zip2            v7.4s, v1.4s, v1.4s
+1:      ld1             {v2.2s}, [x0]
+        ld1             {v3.2s}, [x1]
+        fadd            v4.4s, v4.4s, v6.4s
+        fadd            v5.4s, v5.4s, v7.4s
+        mov             v2.d[1], v2.d[0]
+        mov             v3.d[1], v3.d[0]
+        fmul            v2.4s, v2.4s, v4.4s
+        fmla            v2.4s, v3.4s, v5.4s
+        st1             {v2.d}[0], [x0], #8
+        st1             {v2.d}[1], [x1], #8
+        subs            w4, w4, #1
+        b.gt            1b
         ret
 endfunc
 
 function ff_ps_stereo_interpolate_ipdopd_neon, export=1
-        ld1         {v0.4s,v1.4s}, [x2]
-        ld1         {v6.4s,v7.4s}, [x3]
-        fneg        v2.4s, v1.4s
-        fneg        v3.4s, v7.4s
-        zip1        v16.4s, v0.4s, v0.4s
-        zip2        v17.4s, v0.4s, v0.4s
-        zip1        v18.4s, v2.4s, v1.4s
-        zip2        v19.4s, v2.4s, v1.4s
-        zip1        v20.4s, v6.4s, v6.4s
-        zip2        v21.4s, v6.4s, v6.4s
-        zip1        v22.4s, v3.4s, v7.4s
-        zip2        v23.4s, v3.4s, v7.4s
-1:      ld1         {v2.2s}, [x0]
-        ld1         {v3.2s}, [x1]
-        fadd        v16.4s, v16.4s, v20.4s
-        fadd        v17.4s, v17.4s, v21.4s
-        mov         v2.d[1], v2.d[0]
-        mov         v3.d[1], v3.d[0]
-        fmul        v4.4s, v2.4s, v16.4s
-        fmla        v4.4s, v3.4s, v17.4s
-        fadd        v18.4s, v18.4s, v22.4s
-        fadd        v19.4s, v19.4s, v23.4s
-        ext         v2.16b, v2.16b, v2.16b, #4
-        ext         v3.16b, v3.16b, v3.16b, #4
-        fmla        v4.4s, v2.4s, v18.4s
-        fmla        v4.4s, v3.4s, v19.4s
-        st1         {v4.d}[0], [x0], #8
-        st1         {v4.d}[1], [x1], #8
-        subs        w4, w4, #1
-        b.gt        1b
+        ld1             {v0.4s,v1.4s}, [x2]
+        ld1             {v6.4s,v7.4s}, [x3]
+        fneg            v2.4s, v1.4s
+        fneg            v3.4s, v7.4s
+        zip1            v16.4s, v0.4s, v0.4s
+        zip2            v17.4s, v0.4s, v0.4s
+        zip1            v18.4s, v2.4s, v1.4s
+        zip2            v19.4s, v2.4s, v1.4s
+        zip1            v20.4s, v6.4s, v6.4s
+        zip2            v21.4s, v6.4s, v6.4s
+        zip1            v22.4s, v3.4s, v7.4s
+        zip2            v23.4s, v3.4s, v7.4s
+1:      ld1             {v2.2s}, [x0]
+        ld1             {v3.2s}, [x1]
+        fadd            v16.4s, v16.4s, v20.4s
+        fadd            v17.4s, v17.4s, v21.4s
+        mov             v2.d[1], v2.d[0]
+        mov             v3.d[1], v3.d[0]
+        fmul            v4.4s, v2.4s, v16.4s
+        fmla            v4.4s, v3.4s, v17.4s
+        fadd            v18.4s, v18.4s, v22.4s
+        fadd            v19.4s, v19.4s, v23.4s
+        ext             v2.16b, v2.16b, v2.16b, #4
+        ext             v3.16b, v3.16b, v3.16b, #4
+        fmla            v4.4s, v2.4s, v18.4s
+        fmla            v4.4s, v3.4s, v19.4s
+        st1             {v4.d}[0], [x0], #8
+        st1             {v4.d}[1], [x1], #8
+        subs            w4, w4, #1
+        b.gt            1b
         ret
 endfunc
 
 function ff_ps_hybrid_analysis_neon, export=1
-        lsl         x3, x3, #3
-        ld2         {v0.4s,v1.4s}, [x1], #32
-        ld2         {v2.2s,v3.2s}, [x1], #16
-        ld1         {v24.2s},      [x1], #8
-        ld2         {v4.2s,v5.2s}, [x1], #16
-        ld2         {v6.4s,v7.4s}, [x1]
-        rev64       v6.4s, v6.4s
-        rev64       v7.4s, v7.4s
-        ext         v6.16b, v6.16b, v6.16b, #8
-        ext         v7.16b, v7.16b, v7.16b, #8
-        rev64       v4.2s, v4.2s
-        rev64       v5.2s, v5.2s
-        mov         v2.d[1], v3.d[0]
-        mov         v4.d[1], v5.d[0]
-        mov         v5.d[1], v2.d[0]
-        mov         v3.d[1], v4.d[0]
-        fadd        v16.4s, v0.4s, v6.4s
-        fadd        v17.4s, v1.4s, v7.4s
-        fsub        v18.4s, v1.4s, v7.4s
-        fsub        v19.4s, v0.4s, v6.4s
-        fadd        v22.4s, v2.4s, v4.4s
-        fsub        v23.4s, v5.4s, v3.4s
-        trn1        v20.2d, v22.2d, v23.2d      // {re4+re8, re5+re7, im8-im4, im7-im5}
-        trn2        v21.2d, v22.2d, v23.2d      // {im4+im8, im5+im7, re4-re8, re5-re7}
-1:      ld2         {v2.4s,v3.4s}, [x2], #32
-        ld2         {v4.2s,v5.2s}, [x2], #16
-        ld1         {v6.2s},       [x2], #8
-        add         x2, x2, #8
-        mov         v4.d[1], v5.d[0]
-        mov         v6.s[1], v6.s[0]
-        fmul        v6.2s, v6.2s, v24.2s
-        fmul        v0.4s, v2.4s, v16.4s
-        fmul        v1.4s, v2.4s, v17.4s
-        fmls        v0.4s, v3.4s, v18.4s
-        fmla        v1.4s, v3.4s, v19.4s
-        fmla        v0.4s, v4.4s, v20.4s
-        fmla        v1.4s, v4.4s, v21.4s
-        faddp       v0.4s, v0.4s, v1.4s
-        faddp       v0.4s, v0.4s, v0.4s
-        fadd        v0.2s, v0.2s, v6.2s
-        st1         {v0.2s}, [x0], x3
-        subs        w4, w4, #1
-        b.gt        1b
+        lsl             x3, x3, #3
+        ld2             {v0.4s,v1.4s}, [x1], #32
+        ld2             {v2.2s,v3.2s}, [x1], #16
+        ld1             {v24.2s},      [x1], #8
+        ld2             {v4.2s,v5.2s}, [x1], #16
+        ld2             {v6.4s,v7.4s}, [x1]
+        rev64           v6.4s, v6.4s
+        rev64           v7.4s, v7.4s
+        ext             v6.16b, v6.16b, v6.16b, #8
+        ext             v7.16b, v7.16b, v7.16b, #8
+        rev64           v4.2s, v4.2s
+        rev64           v5.2s, v5.2s
+        mov             v2.d[1], v3.d[0]
+        mov             v4.d[1], v5.d[0]
+        mov             v5.d[1], v2.d[0]
+        mov             v3.d[1], v4.d[0]
+        fadd            v16.4s, v0.4s, v6.4s
+        fadd            v17.4s, v1.4s, v7.4s
+        fsub            v18.4s, v1.4s, v7.4s
+        fsub            v19.4s, v0.4s, v6.4s
+        fadd            v22.4s, v2.4s, v4.4s
+        fsub            v23.4s, v5.4s, v3.4s
+        trn1            v20.2d, v22.2d, v23.2d      // {re4+re8, re5+re7, im8-im4, im7-im5}
+        trn2            v21.2d, v22.2d, v23.2d      // {im4+im8, im5+im7, re4-re8, re5-re7}
+1:      ld2             {v2.4s,v3.4s}, [x2], #32
+        ld2             {v4.2s,v5.2s}, [x2], #16
+        ld1             {v6.2s},       [x2], #8
+        add             x2, x2, #8
+        mov             v4.d[1], v5.d[0]
+        mov             v6.s[1], v6.s[0]
+        fmul            v6.2s, v6.2s, v24.2s
+        fmul            v0.4s, v2.4s, v16.4s
+        fmul            v1.4s, v2.4s, v17.4s
+        fmls            v0.4s, v3.4s, v18.4s
+        fmla            v1.4s, v3.4s, v19.4s
+        fmla            v0.4s, v4.4s, v20.4s
+        fmla            v1.4s, v4.4s, v21.4s
+        faddp           v0.4s, v0.4s, v1.4s
+        faddp           v0.4s, v0.4s, v0.4s
+        fadd            v0.2s, v0.2s, v6.2s
+        st1             {v0.2s}, [x0], x3
+        subs            w4, w4, #1
+        b.gt            1b
         ret
 endfunc
diff --git a/libavcodec/aarch64/opusdsp_neon.S b/libavcodec/aarch64/opusdsp_neon.S
index 1c88d7d123..e933151ab4 100644
--- a/libavcodec/aarch64/opusdsp_neon.S
+++ b/libavcodec/aarch64/opusdsp_neon.S
@@ -33,81 +33,81 @@ const tab_x2, align=4
 endconst
 
 function ff_opus_deemphasis_neon, export=1
-        movrel  x4, tab_st
-        ld1     {v4.4s}, [x4]
-        movrel  x4, tab_x0
-        ld1     {v5.4s}, [x4]
-        movrel  x4, tab_x1
-        ld1     {v6.4s}, [x4]
-        movrel  x4, tab_x2
-        ld1     {v7.4s}, [x4]
+        movrel          x4, tab_st
+        ld1             {v4.4s}, [x4]
+        movrel          x4, tab_x0
+        ld1             {v5.4s}, [x4]
+        movrel          x4, tab_x1
+        ld1             {v6.4s}, [x4]
+        movrel          x4, tab_x2
+        ld1             {v7.4s}, [x4]
 
-        fmul v0.4s, v4.4s, v0.s[0]
+        fmul            v0.4s, v4.4s, v0.s[0]
 
-1:      ld1  {v1.4s, v2.4s}, [x1], #32
+1:      ld1             {v1.4s, v2.4s}, [x1], #32
 
-        fmla v0.4s, v5.4s, v1.s[0]
-        fmul v3.4s, v7.4s, v2.s[2]
+        fmla            v0.4s, v5.4s, v1.s[0]
+        fmul            v3.4s, v7.4s, v2.s[2]
 
-        fmla v0.4s, v6.4s, v1.s[1]
-        fmla v3.4s, v6.4s, v2.s[1]
+        fmla            v0.4s, v6.4s, v1.s[1]
+        fmla            v3.4s, v6.4s, v2.s[1]
 
-        fmla v0.4s, v7.4s, v1.s[2]
-        fmla v3.4s, v5.4s, v2.s[0]
+        fmla            v0.4s, v7.4s, v1.s[2]
+        fmla            v3.4s, v5.4s, v2.s[0]
 
-        fadd v1.4s, v1.4s, v0.4s
-        fadd v2.4s, v2.4s, v3.4s
+        fadd            v1.4s, v1.4s, v0.4s
+        fadd            v2.4s, v2.4s, v3.4s
 
-        fmla v2.4s, v4.4s, v1.s[3]
+        fmla            v2.4s, v4.4s, v1.s[3]
 
-        st1  {v1.4s, v2.4s}, [x0], #32
-        fmul v0.4s, v4.4s, v2.s[3]
+        st1             {v1.4s, v2.4s}, [x0], #32
+        fmul            v0.4s, v4.4s, v2.s[3]
 
-        subs w2, w2, #8
-        b.gt 1b
+        subs            w2, w2, #8
+        b.gt            1b
 
-        mov s0, v2.s[3]
+        mov             s0, v2.s[3]
 
         ret
 endfunc
 
 function ff_opus_postfilter_neon, export=1
-        ld1 {v0.4s}, [x2]
-        dup v1.4s, v0.s[1]
-        dup v2.4s, v0.s[2]
-        dup v0.4s, v0.s[0]
+        ld1             {v0.4s}, [x2]
+        dup             v1.4s, v0.s[1]
+        dup             v2.4s, v0.s[2]
+        dup             v0.4s, v0.s[0]
 
-        add w1, w1, #2
-        sub x1, x0, x1, lsl #2
+        add             w1, w1, #2
+        sub             x1, x0, x1, lsl #2
 
-        ld1 {v3.4s}, [x1]
-        fmul v3.4s, v3.4s, v2.4s
+        ld1             {v3.4s}, [x1]
+        fmul            v3.4s, v3.4s, v2.4s
 
-1:      add x1, x1, #4
-        ld1 {v4.4s}, [x1]
-        add x1, x1, #4
-        ld1 {v5.4s}, [x1]
-        add x1, x1, #4
-        ld1 {v6.4s}, [x1]
-        add x1, x1, #4
-        ld1 {v7.4s}, [x1]
+1:      add             x1, x1, #4
+        ld1             {v4.4s}, [x1]
+        add             x1, x1, #4
+        ld1             {v5.4s}, [x1]
+        add             x1, x1, #4
+        ld1             {v6.4s}, [x1]
+        add             x1, x1, #4
+        ld1             {v7.4s}, [x1]
 
-        fmla v3.4s, v7.4s, v2.4s
-        fadd v6.4s, v6.4s, v4.4s
+        fmla            v3.4s, v7.4s, v2.4s
+        fadd            v6.4s, v6.4s, v4.4s
 
-        ld1 {v4.4s}, [x0]
-        fmla v4.4s, v5.4s, v0.4s
+        ld1             {v4.4s}, [x0]
+        fmla            v4.4s, v5.4s, v0.4s
 
-        fmul v6.4s, v6.4s, v1.4s
-        fadd v6.4s, v6.4s, v3.4s
+        fmul            v6.4s, v6.4s, v1.4s
+        fadd            v6.4s, v6.4s, v3.4s
 
-        fadd v4.4s, v4.4s, v6.4s
-        fmul v3.4s, v7.4s, v2.4s
+        fadd            v4.4s, v4.4s, v6.4s
+        fmul            v3.4s, v7.4s, v2.4s
 
-        st1  {v4.4s}, [x0], #16
+        st1             {v4.4s}, [x0], #16
 
-        subs w3, w3, #4
-        b.gt 1b
+        subs            w3, w3, #4
+        b.gt            1b
 
         ret
 endfunc
diff --git a/libswresample/aarch64/resample.S b/libswresample/aarch64/resample.S
index 114d1216fb..6d9eaaeb23 100644
--- a/libswresample/aarch64/resample.S
+++ b/libswresample/aarch64/resample.S
@@ -21,57 +21,57 @@
 #include "libavutil/aarch64/asm.S"
 
 function ff_resample_common_apply_filter_x4_float_neon, export=1
-    movi                v0.4s, #0                                      // accumulator
-1:  ld1                 {v1.4s}, [x1], #16                             // src[0..3]
-    ld1                 {v2.4s}, [x2], #16                             // filter[0..3]
-    fmla                v0.4s, v1.4s, v2.4s                            // accumulator += src[0..3] * filter[0..3]
-    subs                w3, w3, #4                                     // filter_length -= 4
-    b.gt                1b                                             // loop until filter_length
-    faddp               v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
-    faddp               v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
-    st1                 {v0.s}[0], [x0], #4                            // write accumulator
+        movi            v0.4s, #0                                      // accumulator
+1:      ld1             {v1.4s}, [x1], #16                             // src[0..3]
+        ld1             {v2.4s}, [x2], #16                             // filter[0..3]
+        fmla            v0.4s, v1.4s, v2.4s                            // accumulator += src[0..3] * filter[0..3]
+        subs            w3, w3, #4                                     // filter_length -= 4
+        b.gt            1b                                             // loop until filter_length
+        faddp           v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
+        faddp           v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
+        st1             {v0.s}[0], [x0], #4                            // write accumulator
     ret
 endfunc
 
 function ff_resample_common_apply_filter_x8_float_neon, export=1
-    movi                v0.4s, #0                                      // accumulator
-1:  ld1                 {v1.4s}, [x1], #16                             // src[0..3]
-    ld1                 {v2.4s}, [x2], #16                             // filter[0..3]
-    ld1                 {v3.4s}, [x1], #16                             // src[4..7]
-    ld1                 {v4.4s}, [x2], #16                             // filter[4..7]
-    fmla                v0.4s, v1.4s, v2.4s                            // accumulator += src[0..3] * filter[0..3]
-    fmla                v0.4s, v3.4s, v4.4s                            // accumulator += src[4..7] * filter[4..7]
-    subs                w3, w3, #8                                     // filter_length -= 8
-    b.gt                1b                                             // loop until filter_length
-    faddp               v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
-    faddp               v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
-    st1                 {v0.s}[0], [x0], #4                            // write accumulator
+        movi            v0.4s, #0                                      // accumulator
+1:      ld1             {v1.4s}, [x1], #16                             // src[0..3]
+        ld1             {v2.4s}, [x2], #16                             // filter[0..3]
+        ld1             {v3.4s}, [x1], #16                             // src[4..7]
+        ld1             {v4.4s}, [x2], #16                             // filter[4..7]
+        fmla            v0.4s, v1.4s, v2.4s                            // accumulator += src[0..3] * filter[0..3]
+        fmla            v0.4s, v3.4s, v4.4s                            // accumulator += src[4..7] * filter[4..7]
+        subs            w3, w3, #8                                     // filter_length -= 8
+        b.gt            1b                                             // loop until filter_length
+        faddp           v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
+        faddp           v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
+        st1             {v0.s}[0], [x0], #4                            // write accumulator
     ret
 endfunc
 
 function ff_resample_common_apply_filter_x4_s16_neon, export=1
-    movi                v0.4s, #0                                      // accumulator
-1:  ld1                 {v1.4h}, [x1], #8                              // src[0..3]
-    ld1                 {v2.4h}, [x2], #8                              // filter[0..3]
-    smlal               v0.4s, v1.4h, v2.4h                            // accumulator += src[0..3] * filter[0..3]
-    subs                w3, w3, #4                                     // filter_length -= 4
-    b.gt                1b                                             // loop until filter_length
-    addp                v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
-    addp                v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
-    st1                 {v0.s}[0], [x0], #4                            // write accumulator
+        movi            v0.4s, #0                                      // accumulator
+1:      ld1             {v1.4h}, [x1], #8                              // src[0..3]
+        ld1             {v2.4h}, [x2], #8                              // filter[0..3]
+        smlal           v0.4s, v1.4h, v2.4h                            // accumulator += src[0..3] * filter[0..3]
+        subs            w3, w3, #4                                     // filter_length -= 4
+        b.gt            1b                                             // loop until filter_length
+        addp            v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
+        addp            v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
+        st1             {v0.s}[0], [x0], #4                            // write accumulator
     ret
 endfunc
 
 function ff_resample_common_apply_filter_x8_s16_neon, export=1
-    movi                v0.4s, #0                                      // accumulator
-1:  ld1                 {v1.8h}, [x1], #16                             // src[0..7]
-    ld1                 {v2.8h}, [x2], #16                             // filter[0..7]
-    smlal               v0.4s, v1.4h, v2.4h                            // accumulator += src[0..3] * filter[0..3]
-    smlal2              v0.4s, v1.8h, v2.8h                            // accumulator += src[4..7] * filter[4..7]
-    subs                w3, w3, #8                                     // filter_length -= 8
-    b.gt                1b                                             // loop until filter_length
-    addp                v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
-    addp                v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
-    st1                 {v0.s}[0], [x0], #4                            // write accumulator
+        movi            v0.4s, #0                                      // accumulator
+1:      ld1             {v1.8h}, [x1], #16                             // src[0..7]
+        ld1             {v2.8h}, [x2], #16                             // filter[0..7]
+        smlal           v0.4s, v1.4h, v2.4h                            // accumulator += src[0..3] * filter[0..3]
+        smlal2          v0.4s, v1.8h, v2.8h                            // accumulator += src[4..7] * filter[4..7]
+        subs            w3, w3, #8                                     // filter_length -= 8
+        b.gt            1b                                             // loop until filter_length
+        addp            v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
+        addp            v0.4s, v0.4s, v0.4s                            // pair adding of the 4x32-bit accumulated values
+        st1             {v0.s}[0], [x0], #4                            // write accumulator
     ret
 endfunc
diff --git a/libswscale/aarch64/hscale.S b/libswscale/aarch64/hscale.S
index 3041d483fc..b49443c964 100644
--- a/libswscale/aarch64/hscale.S
+++ b/libswscale/aarch64/hscale.S
@@ -41,53 +41,53 @@
 ;----------------------------------------------------------------------------- */
 
 function ff_hscale8to15_X8_neon, export=1
-        sbfiz               x7, x6, #1, #32             // filterSize*2 (*2 because int16)
-1:      ldr                 w8, [x5], #4                // filterPos[idx]
-        ldr                 w0, [x5], #4                // filterPos[idx + 1]
-        ldr                 w11, [x5], #4               // filterPos[idx + 2]
-        ldr                 w9, [x5], #4                // filterPos[idx + 3]
-        mov                 x16, x4                     // filter0 = filter
-        add                 x12, x16, x7                // filter1 = filter0 + filterSize*2
-        add                 x13, x12, x7                // filter2 = filter1 + filterSize*2
-        add                 x4, x13, x7                 // filter3 = filter2 + filterSize*2
-        movi                v0.2d, #0                   // val sum part 1 (for dst[0])
-        movi                v1.2d, #0                   // val sum part 2 (for dst[1])
-        movi                v2.2d, #0                   // val sum part 3 (for dst[2])
-        movi                v3.2d, #0                   // val sum part 4 (for dst[3])
-        add                 x17, x3, w8, uxtw           // srcp + filterPos[0]
-        add                 x8,  x3, w0, uxtw           // srcp + filterPos[1]
-        add                 x0, x3, w11, uxtw           // srcp + filterPos[2]
-        add                 x11, x3, w9, uxtw           // srcp + filterPos[3]
-        mov                 w15, w6                     // filterSize counter
-2:      ld1                 {v4.8b}, [x17], #8          // srcp[filterPos[0] + {0..7}]
-        ld1                 {v5.8h}, [x16], #16         // load 8x16-bit filter values, part 1
-        ld1                 {v6.8b}, [x8], #8           // srcp[filterPos[1] + {0..7}]
-        ld1                 {v7.8h}, [x12], #16         // load 8x16-bit at filter+filterSize
-        uxtl                v4.8h, v4.8b                // unpack part 1 to 16-bit
-        smlal               v0.4s, v4.4h, v5.4h         // v0 accumulates srcp[filterPos[0] + {0..3}] * filter[{0..3}]
-        smlal2              v0.4s, v4.8h, v5.8h         // v0 accumulates srcp[filterPos[0] + {4..7}] * filter[{4..7}]
-        ld1                 {v16.8b}, [x0], #8          // srcp[filterPos[2] + {0..7}]
-        ld1                 {v17.8h}, [x13], #16        // load 8x16-bit at filter+2*filterSize
-        uxtl                v6.8h, v6.8b                // unpack part 2 to 16-bit
-        smlal               v1.4s, v6.4h, v7.4h         // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
-        uxtl                v16.8h, v16.8b              // unpack part 3 to 16-bit
-        smlal               v2.4s, v16.4h, v17.4h       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
-        smlal2              v2.4s, v16.8h, v17.8h       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
-        ld1                 {v18.8b}, [x11], #8         // srcp[filterPos[3] + {0..7}]
-        smlal2              v1.4s, v6.8h, v7.8h         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
-        ld1                 {v19.8h}, [x4], #16         // load 8x16-bit at filter+3*filterSize
-        subs                w15, w15, #8                // j -= 8: processed 8/filterSize
-        uxtl                v18.8h, v18.8b              // unpack part 4 to 16-bit
-        smlal               v3.4s, v18.4h, v19.4h       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
-        smlal2              v3.4s, v18.8h, v19.8h       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
-        b.gt                2b                          // inner loop if filterSize not consumed completely
-        addp                v0.4s, v0.4s, v1.4s         // part01 horizontal pair adding
-        addp                v2.4s, v2.4s, v3.4s         // part23 horizontal pair adding
-        addp                v0.4s, v0.4s, v2.4s         // part0123 horizontal pair adding
-        subs                w2, w2, #4                  // dstW -= 4
-        sqshrn              v0.4h, v0.4s, #7            // shift and clip the 2x16-bit final values
-        st1                 {v0.4h}, [x1], #8           // write to destination part0123
-        b.gt                1b                          // loop until end of line
+        sbfiz           x7, x6, #1, #32             // filterSize*2 (*2 because int16)
+1:      ldr             w8, [x5], #4                // filterPos[idx]
+        ldr             w0, [x5], #4                // filterPos[idx + 1]
+        ldr             w11, [x5], #4               // filterPos[idx + 2]
+        ldr             w9, [x5], #4                // filterPos[idx + 3]
+        mov             x16, x4                     // filter0 = filter
+        add             x12, x16, x7                // filter1 = filter0 + filterSize*2
+        add             x13, x12, x7                // filter2 = filter1 + filterSize*2
+        add             x4, x13, x7                 // filter3 = filter2 + filterSize*2
+        movi            v0.2d, #0                   // val sum part 1 (for dst[0])
+        movi            v1.2d, #0                   // val sum part 2 (for dst[1])
+        movi            v2.2d, #0                   // val sum part 3 (for dst[2])
+        movi            v3.2d, #0                   // val sum part 4 (for dst[3])
+        add             x17, x3, w8, uxtw           // srcp + filterPos[0]
+        add             x8,  x3, w0, uxtw           // srcp + filterPos[1]
+        add             x0, x3, w11, uxtw           // srcp + filterPos[2]
+        add             x11, x3, w9, uxtw           // srcp + filterPos[3]
+        mov             w15, w6                     // filterSize counter
+2:      ld1             {v4.8b}, [x17], #8          // srcp[filterPos[0] + {0..7}]
+        ld1             {v5.8h}, [x16], #16         // load 8x16-bit filter values, part 1
+        ld1             {v6.8b}, [x8], #8           // srcp[filterPos[1] + {0..7}]
+        ld1             {v7.8h}, [x12], #16         // load 8x16-bit at filter+filterSize
+        uxtl            v4.8h, v4.8b                // unpack part 1 to 16-bit
+        smlal           v0.4s, v4.4h, v5.4h         // v0 accumulates srcp[filterPos[0] + {0..3}] * filter[{0..3}]
+        smlal2          v0.4s, v4.8h, v5.8h         // v0 accumulates srcp[filterPos[0] + {4..7}] * filter[{4..7}]
+        ld1             {v16.8b}, [x0], #8          // srcp[filterPos[2] + {0..7}]
+        ld1             {v17.8h}, [x13], #16        // load 8x16-bit at filter+2*filterSize
+        uxtl            v6.8h, v6.8b                // unpack part 2 to 16-bit
+        smlal           v1.4s, v6.4h, v7.4h         // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+        uxtl            v16.8h, v16.8b              // unpack part 3 to 16-bit
+        smlal           v2.4s, v16.4h, v17.4h       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+        smlal2          v2.4s, v16.8h, v17.8h       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+        ld1             {v18.8b}, [x11], #8         // srcp[filterPos[3] + {0..7}]
+        smlal2          v1.4s, v6.8h, v7.8h         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+        ld1             {v19.8h}, [x4], #16         // load 8x16-bit at filter+3*filterSize
+        subs            w15, w15, #8                // j -= 8: processed 8/filterSize
+        uxtl            v18.8h, v18.8b              // unpack part 4 to 16-bit
+        smlal           v3.4s, v18.4h, v19.4h       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+        smlal2          v3.4s, v18.8h, v19.8h       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+        b.gt            2b                          // inner loop if filterSize not consumed completely
+        addp            v0.4s, v0.4s, v1.4s         // part01 horizontal pair adding
+        addp            v2.4s, v2.4s, v3.4s         // part23 horizontal pair adding
+        addp            v0.4s, v0.4s, v2.4s         // part0123 horizontal pair adding
+        subs            w2, w2, #4                  // dstW -= 4
+        sqshrn          v0.4h, v0.4s, #7            // shift and clip the 2x16-bit final values
+        st1             {v0.4h}, [x1], #8           // write to destination part0123
+        b.gt            1b                          // loop until end of line
         ret
 endfunc
 
@@ -103,98 +103,98 @@ function ff_hscale8to15_X4_neon, export=1
 // This function for filter sizes that are 4 mod 8. In other words, anything that's 0 mod 4 but not
 // 0 mod 8. It also assumes that dstW is 0 mod 4.
 
-        lsl                 w7, w6, #1                  // w7 = filterSize * 2
+        lsl             w7, w6, #1                  // w7 = filterSize * 2
 1:
-        ldp                 w8, w9,  [x5]               // filterPos[idx + 0], [idx + 1]
-        ldp                 w10, w11, [x5, #8]          // filterPos[idx + 2], [idx + 3]
+        ldp             w8, w9,  [x5]               // filterPos[idx + 0], [idx + 1]
+        ldp             w10, w11, [x5, #8]          // filterPos[idx + 2], [idx + 3]
 
-        movi                v16.2d, #0                  // initialize accumulator for idx + 0
-        movi                v17.2d, #0                  // initialize accumulator for idx + 1
-        movi                v18.2d, #0                  // initialize accumulator for idx + 2
-        movi                v19.2d, #0                  // initialize accumulator for idx + 3
+        movi            v16.2d, #0                  // initialize accumulator for idx + 0
+        movi            v17.2d, #0                  // initialize accumulator for idx + 1
+        movi            v18.2d, #0                  // initialize accumulator for idx + 2
+        movi            v19.2d, #0                  // initialize accumulator for idx + 3
 
-        mov                 x12, x4                     // filter pointer for idx + 0
-        add                 x13, x4, x7                 // filter pointer for idx + 1
-        add                 x8, x3, w8, uxtw            // srcp + filterPos[idx + 0]
-        add                 x9, x3, w9, uxtw            // srcp + filterPos[idx + 1]
+        mov             x12, x4                     // filter pointer for idx + 0
+        add             x13, x4, x7                 // filter pointer for idx + 1
+        add             x8, x3, w8, uxtw            // srcp + filterPos[idx + 0]
+        add             x9, x3, w9, uxtw            // srcp + filterPos[idx + 1]
 
-        add                 x14, x13, x7                // filter pointer for idx + 2
-        add                 x10, x3, w10, uxtw          // srcp + filterPos[idx + 2]
-        add                 x11, x3, w11, uxtw          // srcp + filterPos[idx + 3]
+        add             x14, x13, x7                // filter pointer for idx + 2
+        add             x10, x3, w10, uxtw          // srcp + filterPos[idx + 2]
+        add             x11, x3, w11, uxtw          // srcp + filterPos[idx + 3]
 
-        mov                 w0, w6                      // copy filterSize to a temp register, w0
-        add                 x5, x5, #16                 // advance the filterPos pointer
-        add                 x15, x14, x7                // filter pointer for idx + 3
-        mov                 x16, xzr                    // temp register for offsetting filter pointers
+        mov             w0, w6                      // copy filterSize to a temp register, w0
+        add             x5, x5, #16                 // advance the filterPos pointer
+        add             x15, x14, x7                // filter pointer for idx + 3
+        mov             x16, xzr                    // temp register for offsetting filter pointers
 
 2:
         // This section loops over 8-wide chunks of filter size
-        ldr                 d4, [x8], #8                // load 8 bytes from srcp for idx + 0
-        ldr                 q0, [x12, x16]              // load 8 values, 16 bytes from filter for idx + 0
+        ldr             d4, [x8], #8                // load 8 bytes from srcp for idx + 0
+        ldr             q0, [x12, x16]              // load 8 values, 16 bytes from filter for idx + 0
 
-        ldr                 d5, [x9], #8                // load 8 bytes from srcp for idx + 1
-        ldr                 q1, [x13, x16]              // load 8 values, 16 bytes from filter for idx + 1
+        ldr             d5, [x9], #8                // load 8 bytes from srcp for idx + 1
+        ldr             q1, [x13, x16]              // load 8 values, 16 bytes from filter for idx + 1
 
-        uxtl                v4.8h, v4.8b                // unsigned extend long for idx + 0
-        uxtl                v5.8h, v5.8b                // unsigned extend long for idx + 1
+        uxtl            v4.8h, v4.8b                // unsigned extend long for idx + 0
+        uxtl            v5.8h, v5.8b                // unsigned extend long for idx + 1
 
-        ldr                 d6, [x10], #8               // load 8 bytes from srcp for idx + 2
-        ldr                 q2, [x14, x16]              // load 8 values, 16 bytes from filter for idx + 2
+        ldr             d6, [x10], #8               // load 8 bytes from srcp for idx + 2
+        ldr             q2, [x14, x16]              // load 8 values, 16 bytes from filter for idx + 2
 
-        smlal               v16.4s, v0.4h, v4.4h        // val += src[srcPos + j + 0..3] * filter[fs * i + j + 0..3], idx + 0
-        smlal               v17.4s, v1.4h, v5.4h        // val += src[srcPos + j + 0..3] * filter[fs * i + j + 0..3], idx + 1
+        smlal           v16.4s, v0.4h, v4.4h        // val += src[srcPos + j + 0..3] * filter[fs * i + j + 0..3], idx + 0
+        smlal           v17.4s, v1.4h, v5.4h        // val += src[srcPos + j + 0..3] * filter[fs * i + j + 0..3], idx + 1
 
-        ldr                 d7, [x11], #8               // load 8 bytes from srcp for idx + 3
-        ldr                 q3, [x15, x16]              // load 8 values, 16 bytes from filter for idx + 3
+        ldr             d7, [x11], #8               // load 8 bytes from srcp for idx + 3
+        ldr             q3, [x15, x16]              // load 8 values, 16 bytes from filter for idx + 3
 
-        sub                 w0, w0, #8                  // decrement the remaining filterSize counter
-        smlal2              v16.4s, v0.8h, v4.8h        // val += src[srcPos + j + 4..7] * filter[fs * i + j + 4..7], idx + 0
-        smlal2              v17.4s, v1.8h, v5.8h        // val += src[srcPos + j + 4..7] * filter[fs * i + j + 4..7], idx + 1
-        uxtl                v6.8h, v6.8b                // unsigned extend long for idx + 2
-        uxtl                v7.8h, v7.8b                // unsigned extend long for idx + 3
-        smlal               v18.4s, v2.4h, v6.4h        // val += src[srcPos + j + 0..3] * filter[fs * i + j + 0..3], idx + 2
-        smlal               v19.4s, v3.4h, v7.4h        // val += src[srcPos + j + 0..3] * filter[fs * i + j + 0..3], idx + 3
+        sub             w0, w0, #8                  // decrement the remaining filterSize counter
+        smlal2          v16.4s, v0.8h, v4.8h        // val += src[srcPos + j + 4..7] * filter[fs * i + j + 4..7], idx + 0
+        smlal2          v17.4s, v1.8h, v5.8h        // val += src[srcPos + j + 4..7] * filter[fs * i + j + 4..7], idx + 1
+        uxtl            v6.8h, v6.8b                // unsigned extend long for idx + 2
+        uxtl            v7.8h, v7.8b                // unsigned extend long for idx + 3
+        smlal           v18.4s, v2.4h, v6.4h        // val += src[srcPos + j + 0..3] * filter[fs * i + j + 0..3], idx + 2
+        smlal           v19.4s, v3.4h, v7.4h        // val += src[srcPos + j + 0..3] * filter[fs * i + j + 0..3], idx + 3
 
-        cmp                 w0, #8                      // are there at least 8 more elements in filter to consume?
-        add                 x16, x16, #16               // advance the offsetting register for filter values
+        cmp             w0, #8                      // are there at least 8 more elements in filter to consume?
+        add             x16, x16, #16               // advance the offsetting register for filter values
 
-        smlal2              v18.4s, v2.8h, v6.8h        // val += src[srcPos + j + 4..7] * filter[fs * i + j + 4..7], idx + 2
-        smlal2              v19.4s, v3.8h, v7.8h        // val += src[srcPos + j + 4..7] * filter[fs * i + j + 4..7], idx + 3
+        smlal2          v18.4s, v2.8h, v6.8h        // val += src[srcPos + j + 4..7] * filter[fs * i + j + 4..7], idx + 2
+        smlal2          v19.4s, v3.8h, v7.8h        // val += src[srcPos + j + 4..7] * filter[fs * i + j + 4..7], idx + 3
 
-        b.ge                2b                          // branch back to inner loop
+        b.ge            2b                          // branch back to inner loop
 
         // complete the remaining 4 filter elements
-        sub                 x17, x7, #8                 // calculate the offset of the filter pointer for the remaining 4 elements
-
-        ldr                 s4, [x8]                    // load 4 bytes from srcp for idx + 0
-        ldr                 d0, [x12, x17]              // load 4 values, 8 bytes from filter for idx + 0
-        ldr                 s5, [x9]                    // load 4 bytes from srcp for idx + 1
-        ldr                 d1, [x13, x17]              // load 4 values, 8 bytes from filter for idx + 1
-
-        uxtl                v4.8h, v4.8b                // unsigned extend long for idx + 0
-        uxtl                v5.8h, v5.8b                // unsigned extend long for idx + 1
-
-        ldr                 s6, [x10]                   // load 4 bytes from srcp for idx + 2
-        ldr                 d2, [x14, x17]              // load 4 values, 8 bytes from filter for idx + 2
-        smlal               v16.4s, v0.4h, v4.4h        // val += src[srcPos + j + 0..3] * filter[fs * i + j + 0..3], idx + 0
-        smlal               v17.4s, v1.4h, v5.4h        // val += src[srcPos + j + 0..3] * filter[fs * i + j + 0..3], idx + 1
-        ldr                 s7, [x11]                   // load 4 bytes from srcp for idx + 3
-        ldr                 d3, [x15, x17]              // load 4 values, 8 bytes from filter for idx + 3
-
-        uxtl                v6.8h, v6.8b                // unsigned extend long for idx + 2
-        uxtl                v7.8h, v7.8b                // unsigned extend long for idx + 3
-        addp                v16.4s, v16.4s, v17.4s      // horizontal pair adding for idx 0,1
-        smlal               v18.4s, v2.4h, v6.4h        // val += src[srcPos + j + 0..3] * filter[fs * i + j + 0..3], idx + 2
-        smlal               v19.4s, v3.4h, v7.4h        // val += src[srcPos + j + 0..3] * filter[fs * i + j + 0..3], idx + 3
-
-        addp                v18.4s, v18.4s, v19.4s      // horizontal pair adding for idx 2,3
-        addp                v16.4s, v16.4s, v18.4s      // final horizontal pair adding producing one vector with results for idx = 0..3
-
-        subs                w2, w2, #4                  // dstW -= 4
-        sqshrn              v0.4h, v16.4s, #7           // shift and clip the 2x16-bit final values
-        st1                 {v0.4h}, [x1], #8           // write to destination idx 0..3
-        add                 x4, x4, x7, lsl #2          // filter += (filterSize*2) * 4
-        b.gt                1b                          // loop until end of line
+        sub             x17, x7, #8                 // calculate the offset of the filter pointer for the remaining 4 elements
+
+        ldr             s4, [x8]                    // load 4 bytes from srcp for idx + 0
+        ldr             d0, [x12, x17]              // load 4 values, 8 bytes from filter for idx + 0
+        ldr             s5, [x9]                    // load 4 bytes from srcp for idx + 1
+        ldr             d1, [x13, x17]              // load 4 values, 8 bytes from filter for idx + 1
+
+        uxtl            v4.8h, v4.8b                // unsigned extend long for idx + 0
+        uxtl            v5.8h, v5.8b                // unsigned extend long for idx + 1
+
+        ldr             s6, [x10]                   // load 4 bytes from srcp for idx + 2
+        ldr             d2, [x14, x17]              // load 4 values, 8 bytes from filter for idx + 2
+        smlal           v16.4s, v0.4h, v4.4h        // val += src[srcPos + j + 0..3] * filter[fs * i + j + 0..3], idx + 0
+        smlal           v17.4s, v1.4h, v5.4h        // val += src[srcPos + j + 0..3] * filter[fs * i + j + 0..3], idx + 1
+        ldr             s7, [x11]                   // load 4 bytes from srcp for idx + 3
+        ldr             d3, [x15, x17]              // load 4 values, 8 bytes from filter for idx + 3
+
+        uxtl            v6.8h, v6.8b                // unsigned extend long for idx + 2
+        uxtl            v7.8h, v7.8b                // unsigned extend long for idx + 3
+        addp            v16.4s, v16.4s, v17.4s      // horizontal pair adding for idx 0,1
+        smlal           v18.4s, v2.4h, v6.4h        // val += src[srcPos + j + 0..3] * filter[fs * i + j + 0..3], idx + 2
+        smlal           v19.4s, v3.4h, v7.4h        // val += src[srcPos + j + 0..3] * filter[fs * i + j + 0..3], idx + 3
+
+        addp            v18.4s, v18.4s, v19.4s      // horizontal pair adding for idx 2,3
+        addp            v16.4s, v16.4s, v18.4s      // final horizontal pair adding producing one vector with results for idx = 0..3
+
+        subs            w2, w2, #4                  // dstW -= 4
+        sqshrn          v0.4h, v16.4s, #7           // shift and clip the 2x16-bit final values
+        st1             {v0.4h}, [x1], #8           // write to destination idx 0..3
+        add             x4, x4, x7, lsl #2          // filter += (filterSize*2) * 4
+        b.gt            1b                          // loop until end of line
         ret
 endfunc
 
@@ -219,132 +219,132 @@ function ff_hscale8to15_4_neon, export=1
 //  3. Complete madd
 //  4. Complete remaining iterations when dstW % 8 != 0
 
-        sub                 sp, sp, #32                 // allocate 32 bytes on the stack
-        cmp                 w2, #16                     // if dstW <16, skip to the last block used for wrapping up
-        b.lt                2f
+        sub             sp, sp, #32                 // allocate 32 bytes on the stack
+        cmp             w2, #16                     // if dstW <16, skip to the last block used for wrapping up
+        b.lt            2f
 
         // load 8 values from filterPos to be used as offsets into src
-        ldp                 w8, w9,  [x5]               // filterPos[idx + 0], [idx + 1]
-        ldp                 w10, w11, [x5, #8]          // filterPos[idx + 2], [idx + 3]
-        ldp                 w12, w13, [x5, #16]         // filterPos[idx + 4], [idx + 5]
-        ldp                 w14, w15, [x5, #24]         // filterPos[idx + 6], [idx + 7]
-        add                 x5, x5, #32                 // advance filterPos
+        ldp             w8, w9,  [x5]               // filterPos[idx + 0], [idx + 1]
+        ldp             w10, w11, [x5, #8]          // filterPos[idx + 2], [idx + 3]
+        ldp             w12, w13, [x5, #16]         // filterPos[idx + 4], [idx + 5]
+        ldp             w14, w15, [x5, #24]         // filterPos[idx + 6], [idx + 7]
+        add             x5, x5, #32                 // advance filterPos
 
         // gather random access data from src into contiguous memory
-        ldr                 w8, [x3, w8, uxtw]          // src[filterPos[idx + 0]][0..3]
-        ldr                 w9, [x3, w9, uxtw]          // src[filterPos[idx + 1]][0..3]
-        ldr                 w10, [x3, w10, uxtw]        // src[filterPos[idx + 2]][0..3]
-        ldr                 w11, [x3, w11, uxtw]        // src[filterPos[idx + 3]][0..3]
-        ldr                 w12, [x3, w12, uxtw]        // src[filterPos[idx + 4]][0..3]
-        ldr                 w13, [x3, w13, uxtw]        // src[filterPos[idx + 5]][0..3]
-        ldr                 w14, [x3, w14, uxtw]        // src[filterPos[idx + 6]][0..3]
-        ldr                 w15, [x3, w15, uxtw]        // src[filterPos[idx + 7]][0..3]
-        stp                 w8, w9, [sp]                // *scratch_mem = { src[filterPos[idx + 0]][0..3], src[filterPos[idx + 1]][0..3] }
-        stp                 w10, w11, [sp, #8]          // *scratch_mem = { src[filterPos[idx + 2]][0..3], src[filterPos[idx + 3]][0..3] }
-        stp                 w12, w13, [sp, #16]         // *scratch_mem = { src[filterPos[idx + 4]][0..3], src[filterPos[idx + 5]][0..3] }
-        stp                 w14, w15, [sp, #24]         // *scratch_mem = { src[filterPos[idx + 6]][0..3], src[filterPos[idx + 7]][0..3] }
+        ldr             w8, [x3, w8, uxtw]          // src[filterPos[idx + 0]][0..3]
+        ldr             w9, [x3, w9, uxtw]          // src[filterPos[idx + 1]][0..3]
+        ldr             w10, [x3, w10, uxtw]        // src[filterPos[idx + 2]][0..3]
+        ldr             w11, [x3, w11, uxtw]        // src[filterPos[idx + 3]][0..3]
+        ldr             w12, [x3, w12, uxtw]        // src[filterPos[idx + 4]][0..3]
+        ldr             w13, [x3, w13, uxtw]        // src[filterPos[idx + 5]][0..3]
+        ldr             w14, [x3, w14, uxtw]        // src[filterPos[idx + 6]][0..3]
+        ldr             w15, [x3, w15, uxtw]        // src[filterPos[idx + 7]][0..3]
+        stp             w8, w9, [sp]                // *scratch_mem = { src[filterPos[idx + 0]][0..3], src[filterPos[idx + 1]][0..3] }
+        stp             w10, w11, [sp, #8]          // *scratch_mem = { src[filterPos[idx + 2]][0..3], src[filterPos[idx + 3]][0..3] }
+        stp             w12, w13, [sp, #16]         // *scratch_mem = { src[filterPos[idx + 4]][0..3], src[filterPos[idx + 5]][0..3] }
+        stp             w14, w15, [sp, #24]         // *scratch_mem = { src[filterPos[idx + 6]][0..3], src[filterPos[idx + 7]][0..3] }
 
 1:
-        ld4                 {v16.8b, v17.8b, v18.8b, v19.8b}, [sp] // transpose 8 bytes each from src into 4 registers
+        ld4             {v16.8b, v17.8b, v18.8b, v19.8b}, [sp] // transpose 8 bytes each from src into 4 registers
 
         // load 8 values from filterPos to be used as offsets into src
-        ldp                 w8, w9,  [x5]               // filterPos[idx + 0][0..3], [idx + 1][0..3], next iteration
-        ldp                 w10, w11, [x5, #8]          // filterPos[idx + 2][0..3], [idx + 3][0..3], next iteration
-        ldp                 w12, w13, [x5, #16]         // filterPos[idx + 4][0..3], [idx + 5][0..3], next iteration
-        ldp                 w14, w15, [x5, #24]         // filterPos[idx + 6][0..3], [idx + 7][0..3], next iteration
+        ldp             w8, w9,  [x5]               // filterPos[idx + 0][0..3], [idx + 1][0..3], next iteration
+        ldp             w10, w11, [x5, #8]          // filterPos[idx + 2][0..3], [idx + 3][0..3], next iteration
+        ldp             w12, w13, [x5, #16]         // filterPos[idx + 4][0..3], [idx + 5][0..3], next iteration
+        ldp             w14, w15, [x5, #24]         // filterPos[idx + 6][0..3], [idx + 7][0..3], next iteration
 
-        movi                v0.2d, #0                   // Clear madd accumulator for idx 0..3
-        movi                v5.2d, #0                   // Clear madd accumulator for idx 4..7
+        movi            v0.2d, #0                   // Clear madd accumulator for idx 0..3
+        movi            v5.2d, #0                   // Clear madd accumulator for idx 4..7
 
-        ld4                 {v1.8h, v2.8h, v3.8h, v4.8h}, [x4], #64 // load filter idx + 0..7
+        ld4             {v1.8h, v2.8h, v3.8h, v4.8h}, [x4], #64 // load filter idx + 0..7
 
-        add                 x5, x5, #32                 // advance filterPos
+        add             x5, x5, #32                 // advance filterPos
 
         // interleaved SIMD and prefetching intended to keep ld/st and vector pipelines busy
-        uxtl                v16.8h, v16.8b              // unsigned extend long, covert src data to 16-bit
-        uxtl                v17.8h, v17.8b              // unsigned extend long, covert src data to 16-bit
-        ldr                 w8, [x3, w8, uxtw]          // src[filterPos[idx + 0]], next iteration
-        ldr                 w9, [x3, w9, uxtw]          // src[filterPos[idx + 1]], next iteration
-        uxtl                v18.8h, v18.8b              // unsigned extend long, covert src data to 16-bit
-        uxtl                v19.8h, v19.8b              // unsigned extend long, covert src data to 16-bit
-        ldr                 w10, [x3, w10, uxtw]        // src[filterPos[idx + 2]], next iteration
-        ldr                 w11, [x3, w11, uxtw]        // src[filterPos[idx + 3]], next iteration
-
-        smlal               v0.4s, v1.4h, v16.4h        // multiply accumulate inner loop j = 0, idx = 0..3
-        smlal               v0.4s, v2.4h, v17.4h        // multiply accumulate inner loop j = 1, idx = 0..3
-        ldr                 w12, [x3, w12, uxtw]        // src[filterPos[idx + 4]], next iteration
-        ldr                 w13, [x3, w13, uxtw]        // src[filterPos[idx + 5]], next iteration
-        smlal               v0.4s, v3.4h, v18.4h        // multiply accumulate inner loop j = 2, idx = 0..3
-        smlal               v0.4s, v4.4h, v19.4h        // multiply accumulate inner loop j = 3, idx = 0..3
-        ldr                 w14, [x3, w14, uxtw]        // src[filterPos[idx + 6]], next iteration
-        ldr                 w15, [x3, w15, uxtw]        // src[filterPos[idx + 7]], next iteration
-
-        smlal2              v5.4s, v1.8h, v16.8h        // multiply accumulate inner loop j = 0, idx = 4..7
-        smlal2              v5.4s, v2.8h, v17.8h        // multiply accumulate inner loop j = 1, idx = 4..7
-        stp                 w8, w9, [sp]                // *scratch_mem = { src[filterPos[idx + 0]][0..3], src[filterPos[idx + 1]][0..3] }
-        stp                 w10, w11, [sp, #8]          // *scratch_mem = { src[filterPos[idx + 2]][0..3], src[filterPos[idx + 3]][0..3] }
-        smlal2              v5.4s, v3.8h, v18.8h        // multiply accumulate inner loop j = 2, idx = 4..7
-        smlal2              v5.4s, v4.8h, v19.8h        // multiply accumulate inner loop j = 3, idx = 4..7
-        stp                 w12, w13, [sp, #16]         // *scratch_mem = { src[filterPos[idx + 4]][0..3], src[filterPos[idx + 5]][0..3] }
-        stp                 w14, w15, [sp, #24]         // *scratch_mem = { src[filterPos[idx + 6]][0..3], src[filterPos[idx + 7]][0..3] }
-
-        sub                 w2, w2, #8                  // dstW -= 8
-        sqshrn              v0.4h, v0.4s, #7            // shift and clip the 2x16-bit final values
-        sqshrn              v1.4h, v5.4s, #7            // shift and clip the 2x16-bit final values
-        st1                 {v0.4h, v1.4h}, [x1], #16   // write to dst[idx + 0..7]
-        cmp                 w2, #16                     // continue on main loop if there are at least 16 iterations left
-        b.ge                1b
+        uxtl            v16.8h, v16.8b              // unsigned extend long, covert src data to 16-bit
+        uxtl            v17.8h, v17.8b              // unsigned extend long, covert src data to 16-bit
+        ldr             w8, [x3, w8, uxtw]          // src[filterPos[idx + 0]], next iteration
+        ldr             w9, [x3, w9, uxtw]          // src[filterPos[idx + 1]], next iteration
+        uxtl            v18.8h, v18.8b              // unsigned extend long, covert src data to 16-bit
+        uxtl            v19.8h, v19.8b              // unsigned extend long, covert src data to 16-bit
+        ldr             w10, [x3, w10, uxtw]        // src[filterPos[idx + 2]], next iteration
+        ldr             w11, [x3, w11, uxtw]        // src[filterPos[idx + 3]], next iteration
+
+        smlal           v0.4s, v1.4h, v16.4h        // multiply accumulate inner loop j = 0, idx = 0..3
+        smlal           v0.4s, v2.4h, v17.4h        // multiply accumulate inner loop j = 1, idx = 0..3
+        ldr             w12, [x3, w12, uxtw]        // src[filterPos[idx + 4]], next iteration
+        ldr             w13, [x3, w13, uxtw]        // src[filterPos[idx + 5]], next iteration
+        smlal           v0.4s, v3.4h, v18.4h        // multiply accumulate inner loop j = 2, idx = 0..3
+        smlal           v0.4s, v4.4h, v19.4h        // multiply accumulate inner loop j = 3, idx = 0..3
+        ldr             w14, [x3, w14, uxtw]        // src[filterPos[idx + 6]], next iteration
+        ldr             w15, [x3, w15, uxtw]        // src[filterPos[idx + 7]], next iteration
+
+        smlal2          v5.4s, v1.8h, v16.8h        // multiply accumulate inner loop j = 0, idx = 4..7
+        smlal2          v5.4s, v2.8h, v17.8h        // multiply accumulate inner loop j = 1, idx = 4..7
+        stp             w8, w9, [sp]                // *scratch_mem = { src[filterPos[idx + 0]][0..3], src[filterPos[idx + 1]][0..3] }
+        stp             w10, w11, [sp, #8]          // *scratch_mem = { src[filterPos[idx + 2]][0..3], src[filterPos[idx + 3]][0..3] }
+        smlal2          v5.4s, v3.8h, v18.8h        // multiply accumulate inner loop j = 2, idx = 4..7
+        smlal2          v5.4s, v4.8h, v19.8h        // multiply accumulate inner loop j = 3, idx = 4..7
+        stp             w12, w13, [sp, #16]         // *scratch_mem = { src[filterPos[idx + 4]][0..3], src[filterPos[idx + 5]][0..3] }
+        stp             w14, w15, [sp, #24]         // *scratch_mem = { src[filterPos[idx + 6]][0..3], src[filterPos[idx + 7]][0..3] }
+
+        sub             w2, w2, #8                  // dstW -= 8
+        sqshrn          v0.4h, v0.4s, #7            // shift and clip the 2x16-bit final values
+        sqshrn          v1.4h, v5.4s, #7            // shift and clip the 2x16-bit final values
+        st1             {v0.4h, v1.4h}, [x1], #16   // write to dst[idx + 0..7]
+        cmp             w2, #16                     // continue on main loop if there are at least 16 iterations left
+        b.ge            1b
 
         // last full iteration
-        ld4                 {v16.8b, v17.8b, v18.8b, v19.8b}, [sp]
-        ld4                 {v1.8h, v2.8h, v3.8h, v4.8h}, [x4], #64 // load filter idx + 0..7
+        ld4             {v16.8b, v17.8b, v18.8b, v19.8b}, [sp]
+        ld4             {v1.8h, v2.8h, v3.8h, v4.8h}, [x4], #64 // load filter idx + 0..7
 
-        movi                v0.2d, #0                   // Clear madd accumulator for idx 0..3
-        movi                v5.2d, #0                   // Clear madd accumulator for idx 4..7
+        movi            v0.2d, #0                   // Clear madd accumulator for idx 0..3
+        movi            v5.2d, #0                   // Clear madd accumulator for idx 4..7
 
-        uxtl                v16.8h, v16.8b              // unsigned extend long, covert src data to 16-bit
-        uxtl                v17.8h, v17.8b              // unsigned extend long, covert src data to 16-bit
-        uxtl                v18.8h, v18.8b              // unsigned extend long, covert src data to 16-bit
-        uxtl                v19.8h, v19.8b              // unsigned extend long, covert src data to 16-bit
+        uxtl            v16.8h, v16.8b              // unsigned extend long, covert src data to 16-bit
+        uxtl            v17.8h, v17.8b              // unsigned extend long, covert src data to 16-bit
+        uxtl            v18.8h, v18.8b              // unsigned extend long, covert src data to 16-bit
+        uxtl            v19.8h, v19.8b              // unsigned extend long, covert src data to 16-bit
 
-        smlal               v0.4s, v1.4h, v16.4h        // multiply accumulate inner loop j = 0, idx = 0..3
-        smlal               v0.4s, v2.4h, v17.4h        // multiply accumulate inner loop j = 1, idx = 0..3
-        smlal               v0.4s, v3.4h, v18.4h        // multiply accumulate inner loop j = 2, idx = 0..3
-        smlal               v0.4s, v4.4h, v19.4h        // multiply accumulate inner loop j = 3, idx = 0..3
+        smlal           v0.4s, v1.4h, v16.4h        // multiply accumulate inner loop j = 0, idx = 0..3
+        smlal           v0.4s, v2.4h, v17.4h        // multiply accumulate inner loop j = 1, idx = 0..3
+        smlal           v0.4s, v3.4h, v18.4h        // multiply accumulate inner loop j = 2, idx = 0..3
+        smlal           v0.4s, v4.4h, v19.4h        // multiply accumulate inner loop j = 3, idx = 0..3
 
-        smlal2              v5.4s, v1.8h, v16.8h        // multiply accumulate inner loop j = 0, idx = 4..7
-        smlal2              v5.4s, v2.8h, v17.8h        // multiply accumulate inner loop j = 1, idx = 4..7
-        smlal2              v5.4s, v3.8h, v18.8h        // multiply accumulate inner loop j = 2, idx = 4..7
-        smlal2              v5.4s, v4.8h, v19.8h        // multiply accumulate inner loop j = 3, idx = 4..7
+        smlal2          v5.4s, v1.8h, v16.8h        // multiply accumulate inner loop j = 0, idx = 4..7
+        smlal2          v5.4s, v2.8h, v17.8h        // multiply accumulate inner loop j = 1, idx = 4..7
+        smlal2          v5.4s, v3.8h, v18.8h        // multiply accumulate inner loop j = 2, idx = 4..7
+        smlal2          v5.4s, v4.8h, v19.8h        // multiply accumulate inner loop j = 3, idx = 4..7
 
-        subs                w2, w2, #8                  // dstW -= 8
-        sqshrn              v0.4h, v0.4s, #7            // shift and clip the 2x16-bit final values
-        sqshrn              v1.4h, v5.4s, #7            // shift and clip the 2x16-bit final values
-        st1                 {v0.4h, v1.4h}, [x1], #16   // write to dst[idx + 0..7]
+        subs            w2, w2, #8                  // dstW -= 8
+        sqshrn          v0.4h, v0.4s, #7            // shift and clip the 2x16-bit final values
+        sqshrn          v1.4h, v5.4s, #7            // shift and clip the 2x16-bit final values
+        st1             {v0.4h, v1.4h}, [x1], #16   // write to dst[idx + 0..7]
 
-        cbnz                w2, 2f                      // if >0 iterations remain, jump to the wrap up section
+        cbnz            w2, 2f                      // if >0 iterations remain, jump to the wrap up section
 
-        add                 sp, sp, #32                 // clean up stack
+        add             sp, sp, #32                 // clean up stack
         ret
 
         // finish up when dstW % 8 != 0 or dstW < 16
 2:
         // load src
-        ldr                 w8, [x5], #4                // filterPos[i]
-        add                 x9, x3, w8, uxtw            // calculate the address for src load
-        ld1                 {v5.s}[0], [x9]             // src[filterPos[i] + 0..3]
+        ldr             w8, [x5], #4                // filterPos[i]
+        add             x9, x3, w8, uxtw            // calculate the address for src load
+        ld1             {v5.s}[0], [x9]             // src[filterPos[i] + 0..3]
         // load filter
-        ld1                 {v6.4h}, [x4], #8           // filter[filterSize * i + 0..3]
+        ld1             {v6.4h}, [x4], #8           // filter[filterSize * i + 0..3]
 
-        uxtl                v5.8h, v5.8b                // unsigned exten long, convert src data to 16-bit
-        smull               v0.4s, v5.4h, v6.4h         // 4 iterations of src[...] * filter[...]
-        addv                s0, v0.4s                   // add up products of src and filter values
-        sqshrn              h0, s0, #7                  // shift and clip the 2x16-bit final value
-        st1                 {v0.h}[0], [x1], #2         // dst[i] = ...
-        sub                 w2, w2, #1                  // dstW--
-        cbnz                w2, 2b
+        uxtl            v5.8h, v5.8b                // unsigned exten long, convert src data to 16-bit
+        smull           v0.4s, v5.4h, v6.4h         // 4 iterations of src[...] * filter[...]
+        addv            s0, v0.4s                   // add up products of src and filter values
+        sqshrn          h0, s0, #7                  // shift and clip the 2x16-bit final value
+        st1             {v0.h}[0], [x1], #2         // dst[i] = ...
+        sub             w2, w2, #1                  // dstW--
+        cbnz            w2, 2b
 
-        add                 sp, sp, #32                 // clean up stack
+        add             sp, sp, #32                 // clean up stack
         ret
 endfunc
 
@@ -357,187 +357,187 @@ function ff_hscale8to19_4_neon, export=1
         // x5               const int32_t *filterPos
         // w6               int filterSize
 
-        movi                v18.4s, #1
-        movi                v17.4s, #1
-        shl                 v18.4s, v18.4s, #19
-        sub                 v18.4s, v18.4s, v17.4s      // max allowed value
+        movi            v18.4s, #1
+        movi            v17.4s, #1
+        shl             v18.4s, v18.4s, #19
+        sub             v18.4s, v18.4s, v17.4s      // max allowed value
 
-        cmp                 w2, #16
-        b.lt                2f // move to last block
+        cmp             w2, #16
+        b.lt            2f // move to last block
 
-        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
-        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
-        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
-        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
-        add                 x5, x5, #32
+        ldp             w8, w9, [x5]                // filterPos[0], filterPos[1]
+        ldp             w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
+        ldp             w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
+        ldp             w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
+        add             x5, x5, #32
 
         // load data from
-        ldr                 w8, [x3, w8, uxtw]
-        ldr                 w9, [x3, w9, uxtw]
-        ldr                 w10, [x3, w10, uxtw]
-        ldr                 w11, [x3, w11, uxtw]
-        ldr                 w12, [x3, w12, uxtw]
-        ldr                 w13, [x3, w13, uxtw]
-        ldr                 w14, [x3, w14, uxtw]
-        ldr                 w15, [x3, w15, uxtw]
-
-        sub                 sp, sp, #32
-
-        stp                 w8, w9, [sp]
-        stp                 w10, w11, [sp, #8]
-        stp                 w12, w13, [sp, #16]
-        stp                 w14, w15, [sp, #24]
+        ldr             w8, [x3, w8, uxtw]
+        ldr             w9, [x3, w9, uxtw]
+        ldr             w10, [x3, w10, uxtw]
+        ldr             w11, [x3, w11, uxtw]
+        ldr             w12, [x3, w12, uxtw]
+        ldr             w13, [x3, w13, uxtw]
+        ldr             w14, [x3, w14, uxtw]
+        ldr             w15, [x3, w15, uxtw]
+
+        sub             sp, sp, #32
+
+        stp             w8, w9, [sp]
+        stp             w10, w11, [sp, #8]
+        stp             w12, w13, [sp, #16]
+        stp             w14, w15, [sp, #24]
 
 1:
-        ld4                 {v0.8b, v1.8b, v2.8b, v3.8b}, [sp]
-        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
+        ld4             {v0.8b, v1.8b, v2.8b, v3.8b}, [sp]
+        ld4             {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
         // load filterPositions into registers for next iteration
 
-        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
-        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
-        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
-        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
-        add                 x5, x5, #32
-        uxtl                v0.8h, v0.8b
-        ldr                 w8, [x3, w8, uxtw]
-        smull               v5.4s, v0.4h, v28.4h        // multiply first column of src
-        ldr                 w9, [x3, w9, uxtw]
-        smull2              v6.4s, v0.8h, v28.8h
-        stp                 w8, w9, [sp]
-
-        uxtl                v1.8h, v1.8b
-        ldr                 w10, [x3, w10, uxtw]
-        smlal               v5.4s, v1.4h, v29.4h        // multiply second column of src
-        ldr                 w11, [x3, w11, uxtw]
-        smlal2              v6.4s, v1.8h, v29.8h
-        stp                 w10, w11, [sp, #8]
-
-        uxtl                v2.8h, v2.8b
-        ldr                 w12, [x3, w12, uxtw]
-        smlal               v5.4s, v2.4h, v30.4h        // multiply third column of src
-        ldr                 w13, [x3, w13, uxtw]
-        smlal2              v6.4s, v2.8h, v30.8h
-        stp                 w12, w13, [sp, #16]
-
-        uxtl                v3.8h, v3.8b
-        ldr                 w14, [x3, w14, uxtw]
-        smlal               v5.4s, v3.4h, v31.4h        // multiply fourth column of src
-        ldr                 w15, [x3, w15, uxtw]
-        smlal2              v6.4s, v3.8h, v31.8h
-        stp                 w14, w15, [sp, #24]
-
-        sub                 w2, w2, #8
-        sshr                v5.4s, v5.4s, #3
-        sshr                v6.4s, v6.4s, #3
-        smin                v5.4s, v5.4s, v18.4s
-        smin                v6.4s, v6.4s, v18.4s
-
-        st1                 {v5.4s, v6.4s}, [x1], #32
-        cmp                 w2, #16
-        b.ge                1b
+        ldp             w8, w9, [x5]                // filterPos[0], filterPos[1]
+        ldp             w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
+        ldp             w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
+        ldp             w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
+        add             x5, x5, #32
+        uxtl            v0.8h, v0.8b
+        ldr             w8, [x3, w8, uxtw]
+        smull           v5.4s, v0.4h, v28.4h        // multiply first column of src
+        ldr             w9, [x3, w9, uxtw]
+        smull2          v6.4s, v0.8h, v28.8h
+        stp             w8, w9, [sp]
+
+        uxtl            v1.8h, v1.8b
+        ldr             w10, [x3, w10, uxtw]
+        smlal           v5.4s, v1.4h, v29.4h        // multiply second column of src
+        ldr             w11, [x3, w11, uxtw]
+        smlal2          v6.4s, v1.8h, v29.8h
+        stp             w10, w11, [sp, #8]
+
+        uxtl            v2.8h, v2.8b
+        ldr             w12, [x3, w12, uxtw]
+        smlal           v5.4s, v2.4h, v30.4h        // multiply third column of src
+        ldr             w13, [x3, w13, uxtw]
+        smlal2          v6.4s, v2.8h, v30.8h
+        stp             w12, w13, [sp, #16]
+
+        uxtl            v3.8h, v3.8b
+        ldr             w14, [x3, w14, uxtw]
+        smlal           v5.4s, v3.4h, v31.4h        // multiply fourth column of src
+        ldr             w15, [x3, w15, uxtw]
+        smlal2          v6.4s, v3.8h, v31.8h
+        stp             w14, w15, [sp, #24]
+
+        sub             w2, w2, #8
+        sshr            v5.4s, v5.4s, #3
+        sshr            v6.4s, v6.4s, #3
+        smin            v5.4s, v5.4s, v18.4s
+        smin            v6.4s, v6.4s, v18.4s
+
+        st1             {v5.4s, v6.4s}, [x1], #32
+        cmp             w2, #16
+        b.ge            1b
 
         // here we make last iteration, without updating the registers
-        ld4                 {v0.8b, v1.8b, v2.8b, v3.8b}, [sp]
-        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
-
-        uxtl                v0.8h, v0.8b
-        uxtl                v1.8h, v1.8b
-        smull               v5.4s, v0.4h, v28.4h
-        smull2              v6.4s, v0.8h, v28.8h
-        uxtl                v2.8h, v2.8b
-        smlal               v5.4s, v1.4h, v29.4h
-        smlal2              v6.4s, v1.8h, v29.8h
-        uxtl                v3.8h, v3.8b
-        smlal               v5.4s, v2.4h, v30.4h
-        smlal2              v6.4s, v2.8h, v30.8h
-        smlal               v5.4s, v3.4h, v31.4h
-        smlal2              v6.4s, v3.8h, v31.8h
-
-        sshr                v5.4s, v5.4s, #3
-        sshr                v6.4s, v6.4s, #3
-
-        smin                v5.4s, v5.4s, v18.4s
-        smin                v6.4s, v6.4s, v18.4s
-
-        sub                 w2, w2, #8
-        st1                 {v5.4s, v6.4s}, [x1], #32
-        add                 sp, sp, #32 // restore stack
-        cbnz                w2, 2f
+        ld4             {v0.8b, v1.8b, v2.8b, v3.8b}, [sp]
+        ld4             {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
+
+        uxtl            v0.8h, v0.8b
+        uxtl            v1.8h, v1.8b
+        smull           v5.4s, v0.4h, v28.4h
+        smull2          v6.4s, v0.8h, v28.8h
+        uxtl            v2.8h, v2.8b
+        smlal           v5.4s, v1.4h, v29.4h
+        smlal2          v6.4s, v1.8h, v29.8h
+        uxtl            v3.8h, v3.8b
+        smlal           v5.4s, v2.4h, v30.4h
+        smlal2          v6.4s, v2.8h, v30.8h
+        smlal           v5.4s, v3.4h, v31.4h
+        smlal2          v6.4s, v3.8h, v31.8h
+
+        sshr            v5.4s, v5.4s, #3
+        sshr            v6.4s, v6.4s, #3
+
+        smin            v5.4s, v5.4s, v18.4s
+        smin            v6.4s, v6.4s, v18.4s
+
+        sub             w2, w2, #8
+        st1             {v5.4s, v6.4s}, [x1], #32
+        add             sp, sp, #32 // restore stack
+        cbnz            w2, 2f
 
         ret
 
 2:
-        ldr                 w8, [x5], #4 // load filterPos
-        add                 x9, x3, w8, uxtw // src + filterPos
-        ld1                 {v0.s}[0], [x9] // load 4 * uint8_t* into one single
-        ld1                 {v31.4h}, [x4], #8
-        uxtl                v0.8h, v0.8b
-        smull               v5.4s, v0.4h, v31.4h
-        saddlv              d0, v5.4s
-        sqshrn              s0, d0, #3
-        smin                v0.4s, v0.4s, v18.4s
-        st1                 {v0.s}[0], [x1], #4
-        sub                 w2, w2, #1
-        cbnz                w2, 2b // if iterations remain jump to beginning
+        ldr             w8, [x5], #4 // load filterPos
+        add             x9, x3, w8, uxtw // src + filterPos
+        ld1             {v0.s}[0], [x9] // load 4 * uint8_t* into one single
+        ld1             {v31.4h}, [x4], #8
+        uxtl            v0.8h, v0.8b
+        smull           v5.4s, v0.4h, v31.4h
+        saddlv          d0, v5.4s
+        sqshrn          s0, d0, #3
+        smin            v0.4s, v0.4s, v18.4s
+        st1             {v0.s}[0], [x1], #4
+        sub             w2, w2, #1
+        cbnz            w2, 2b // if iterations remain jump to beginning
 
         ret
 endfunc
 
 function ff_hscale8to19_X8_neon, export=1
-        movi                v20.4s, #1
-        movi                v17.4s, #1
-        shl                 v20.4s, v20.4s, #19
-        sub                 v20.4s, v20.4s, v17.4s
+        movi            v20.4s, #1
+        movi            v17.4s, #1
+        shl             v20.4s, v20.4s, #19
+        sub             v20.4s, v20.4s, v17.4s
 
-        sbfiz               x7, x6, #1, #32             // filterSize*2 (*2 because int16)
+        sbfiz           x7, x6, #1, #32             // filterSize*2 (*2 because int16)
 1:
-        mov                 x16, x4                     // filter0 = filter
-        ldr                 w8, [x5], #4                // filterPos[idx]
-        add                 x12, x16, x7                // filter1 = filter0 + filterSize*2
-        ldr                 w0, [x5], #4                // filterPos[idx + 1]
-        add                 x13, x12, x7                // filter2 = filter1 + filterSize*2
-        ldr                 w11, [x5], #4               // filterPos[idx + 2]
-        add                 x4, x13, x7                 // filter3 = filter2 + filterSize*2
-        ldr                 w9, [x5], #4                // filterPos[idx + 3]
-        movi                v0.2d, #0                   // val sum part 1 (for dst[0])
-        movi                v1.2d, #0                   // val sum part 2 (for dst[1])
-        movi                v2.2d, #0                   // val sum part 3 (for dst[2])
-        movi                v3.2d, #0                   // val sum part 4 (for dst[3])
-        add                 x17, x3, w8, uxtw           // srcp + filterPos[0]
-        add                 x8,  x3, w0, uxtw           // srcp + filterPos[1]
-        add                 x0, x3, w11, uxtw           // srcp + filterPos[2]
-        add                 x11, x3, w9, uxtw           // srcp + filterPos[3]
-        mov                 w15, w6                     // filterSize counter
-2:      ld1                 {v4.8b}, [x17], #8          // srcp[filterPos[0] + {0..7}]
-        ld1                 {v5.8h}, [x16], #16         // load 8x16-bit filter values, part 1
-        uxtl                v4.8h, v4.8b                // unpack part 1 to 16-bit
-        smlal               v0.4s, v4.4h, v5.4h         // v0 accumulates srcp[filterPos[0] + {0..3}] * filter[{0..3}]
-        ld1                 {v6.8b}, [x8], #8           // srcp[filterPos[1] + {0..7}]
-        smlal2              v0.4s, v4.8h, v5.8h         // v0 accumulates srcp[filterPos[0] + {4..7}] * filter[{4..7}]
-        ld1                 {v7.8h}, [x12], #16         // load 8x16-bit at filter+filterSize
-        ld1                 {v16.8b}, [x0], #8          // srcp[filterPos[2] + {0..7}]
-        uxtl                v6.8h, v6.8b                // unpack part 2 to 16-bit
-        ld1                 {v17.8h}, [x13], #16        // load 8x16-bit at filter+2*filterSize
-        uxtl                v16.8h, v16.8b              // unpack part 3 to 16-bit
-        smlal               v1.4s, v6.4h, v7.4h         // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
-        ld1                 {v18.8b}, [x11], #8         // srcp[filterPos[3] + {0..7}]
-        smlal               v2.4s, v16.4h, v17.4h       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
-        ld1                 {v19.8h}, [x4], #16         // load 8x16-bit at filter+3*filterSize
-        smlal2              v2.4s, v16.8h, v17.8h       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
-        uxtl                v18.8h, v18.8b              // unpack part 4 to 16-bit
-        smlal2              v1.4s, v6.8h, v7.8h         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
-        smlal               v3.4s, v18.4h, v19.4h       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
-        subs                w15, w15, #8                // j -= 8: processed 8/filterSize
-        smlal2              v3.4s, v18.8h, v19.8h       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
-        b.gt                2b                          // inner loop if filterSize not consumed completely
-        addp                v0.4s, v0.4s, v1.4s         // part01 horizontal pair adding
-        addp                v2.4s, v2.4s, v3.4s         // part23 horizontal pair adding
-        addp                v0.4s, v0.4s, v2.4s         // part0123 horizontal pair adding
-        subs                w2, w2, #4                  // dstW -= 4
-        sshr                v0.4s, v0.4s, #3            // shift and clip the 2x16-bit final values
-        smin                v0.4s, v0.4s, v20.4s
-        st1                 {v0.4s}, [x1], #16           // write to destination part0123
-        b.gt                1b                          // loop until end of line
+        mov             x16, x4                     // filter0 = filter
+        ldr             w8, [x5], #4                // filterPos[idx]
+        add             x12, x16, x7                // filter1 = filter0 + filterSize*2
+        ldr             w0, [x5], #4                // filterPos[idx + 1]
+        add             x13, x12, x7                // filter2 = filter1 + filterSize*2
+        ldr             w11, [x5], #4               // filterPos[idx + 2]
+        add             x4, x13, x7                 // filter3 = filter2 + filterSize*2
+        ldr             w9, [x5], #4                // filterPos[idx + 3]
+        movi            v0.2d, #0                   // val sum part 1 (for dst[0])
+        movi            v1.2d, #0                   // val sum part 2 (for dst[1])
+        movi            v2.2d, #0                   // val sum part 3 (for dst[2])
+        movi            v3.2d, #0                   // val sum part 4 (for dst[3])
+        add             x17, x3, w8, uxtw           // srcp + filterPos[0]
+        add             x8,  x3, w0, uxtw           // srcp + filterPos[1]
+        add             x0, x3, w11, uxtw           // srcp + filterPos[2]
+        add             x11, x3, w9, uxtw           // srcp + filterPos[3]
+        mov             w15, w6                     // filterSize counter
+2:      ld1             {v4.8b}, [x17], #8          // srcp[filterPos[0] + {0..7}]
+        ld1             {v5.8h}, [x16], #16         // load 8x16-bit filter values, part 1
+        uxtl            v4.8h, v4.8b                // unpack part 1 to 16-bit
+        smlal           v0.4s, v4.4h, v5.4h         // v0 accumulates srcp[filterPos[0] + {0..3}] * filter[{0..3}]
+        ld1             {v6.8b}, [x8], #8           // srcp[filterPos[1] + {0..7}]
+        smlal2          v0.4s, v4.8h, v5.8h         // v0 accumulates srcp[filterPos[0] + {4..7}] * filter[{4..7}]
+        ld1             {v7.8h}, [x12], #16         // load 8x16-bit at filter+filterSize
+        ld1             {v16.8b}, [x0], #8          // srcp[filterPos[2] + {0..7}]
+        uxtl            v6.8h, v6.8b                // unpack part 2 to 16-bit
+        ld1             {v17.8h}, [x13], #16        // load 8x16-bit at filter+2*filterSize
+        uxtl            v16.8h, v16.8b              // unpack part 3 to 16-bit
+        smlal           v1.4s, v6.4h, v7.4h         // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+        ld1             {v18.8b}, [x11], #8         // srcp[filterPos[3] + {0..7}]
+        smlal           v2.4s, v16.4h, v17.4h       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+        ld1             {v19.8h}, [x4], #16         // load 8x16-bit at filter+3*filterSize
+        smlal2          v2.4s, v16.8h, v17.8h       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+        uxtl            v18.8h, v18.8b              // unpack part 4 to 16-bit
+        smlal2          v1.4s, v6.8h, v7.8h         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+        smlal           v3.4s, v18.4h, v19.4h       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+        subs            w15, w15, #8                // j -= 8: processed 8/filterSize
+        smlal2          v3.4s, v18.8h, v19.8h       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+        b.gt            2b                          // inner loop if filterSize not consumed completely
+        addp            v0.4s, v0.4s, v1.4s         // part01 horizontal pair adding
+        addp            v2.4s, v2.4s, v3.4s         // part23 horizontal pair adding
+        addp            v0.4s, v0.4s, v2.4s         // part0123 horizontal pair adding
+        subs            w2, w2, #4                  // dstW -= 4
+        sshr            v0.4s, v0.4s, #3            // shift and clip the 2x16-bit final values
+        smin            v0.4s, v0.4s, v20.4s
+        st1             {v0.4s}, [x1], #16           // write to destination part0123
+        b.gt            1b                          // loop until end of line
         ret
 endfunc
 
@@ -550,91 +550,91 @@ function ff_hscale8to19_X4_neon, export=1
         // x5  const int32_t *filterPos
         // w6  int filterSize
 
-        movi                v20.4s, #1
-        movi                v17.4s, #1
-        shl                 v20.4s, v20.4s, #19
-        sub                 v20.4s, v20.4s, v17.4s
+        movi            v20.4s, #1
+        movi            v17.4s, #1
+        shl             v20.4s, v20.4s, #19
+        sub             v20.4s, v20.4s, v17.4s
 
-        lsl                 w7, w6, #1
+        lsl             w7, w6, #1
 1:
-        ldp                 w8, w9, [x5]
-        ldp                 w10, w11, [x5, #8]
-
-        movi                v16.2d, #0                  // initialize accumulator for idx + 0
-        movi                v17.2d, #0                  // initialize accumulator for idx + 1
-        movi                v18.2d, #0                  // initialize accumulator for idx + 2
-        movi                v19.2d, #0                  // initialize accumulator for idx + 3
-
-        mov                 x12, x4                     // filter + 0
-        add                 x13, x4, x7                 // filter + 1
-        add                 x8, x3, w8, uxtw            // srcp + filterPos 0
-        add                 x14, x13, x7                // filter + 2
-        add                 x9, x3, w9, uxtw            // srcp + filterPos 1
-        add                 x15, x14, x7                // filter + 3
-        add                 x10, x3, w10, uxtw          // srcp + filterPos 2
-        mov                 w0, w6                      // save the filterSize to temporary variable
-        add                 x11, x3, w11, uxtw          // srcp + filterPos 3
-        add                 x5, x5, #16                 // advance filter position
-        mov                 x16, xzr                    // clear the register x16 used for offsetting the filter values
+        ldp             w8, w9, [x5]
+        ldp             w10, w11, [x5, #8]
+
+        movi            v16.2d, #0                  // initialize accumulator for idx + 0
+        movi            v17.2d, #0                  // initialize accumulator for idx + 1
+        movi            v18.2d, #0                  // initialize accumulator for idx + 2
+        movi            v19.2d, #0                  // initialize accumulator for idx + 3
+
+        mov             x12, x4                     // filter + 0
+        add             x13, x4, x7                 // filter + 1
+        add             x8, x3, w8, uxtw            // srcp + filterPos 0
+        add             x14, x13, x7                // filter + 2
+        add             x9, x3, w9, uxtw            // srcp + filterPos 1
+        add             x15, x14, x7                // filter + 3
+        add             x10, x3, w10, uxtw          // srcp + filterPos 2
+        mov             w0, w6                      // save the filterSize to temporary variable
+        add             x11, x3, w11, uxtw          // srcp + filterPos 3
+        add             x5, x5, #16                 // advance filter position
+        mov             x16, xzr                    // clear the register x16 used for offsetting the filter values
 
 2:
-        ldr                 d4, [x8], #8                // load src values for idx 0
-        ldr                 q31, [x12, x16]             // load filter values for idx 0
-        uxtl                v4.8h, v4.8b                // extend type to match the filter' size
-        ldr                 d5, [x9], #8                // load src values for idx 1
-        smlal               v16.4s, v4.4h, v31.4h       // multiplication of lower half for idx 0
-        uxtl                v5.8h, v5.8b                // extend type to match the filter' size
-        ldr                 q30, [x13, x16]             // load filter values for idx 1
-        smlal2              v16.4s, v4.8h, v31.8h       // multiplication of upper half for idx 0
-        ldr                 d6, [x10], #8               // load src values for idx 2
-        ldr                 q29, [x14, x16]             // load filter values for idx 2
-        smlal               v17.4s, v5.4h, v30.4h       // multiplication of lower half for idx 1
-        ldr                 d7, [x11], #8               // load src values for idx 3
-        smlal2              v17.4s, v5.8h, v30.8h       // multiplication of upper half for idx 1
-        uxtl                v6.8h, v6.8b                // extend tpye to matchi the filter's size
-        ldr                 q28, [x15, x16]             // load filter values for idx 3
-        smlal               v18.4s, v6.4h, v29.4h       // multiplication of lower half for idx 2
-        uxtl                v7.8h, v7.8b
-        smlal2              v18.4s, v6.8h, v29.8h       // multiplication of upper half for idx 2
-        sub                 w0, w0, #8
-        smlal               v19.4s, v7.4h, v28.4h       // multiplication of lower half for idx 3
-        cmp                 w0, #8
-        smlal2              v19.4s, v7.8h, v28.8h       // multiplication of upper half for idx 3
-        add                 x16, x16, #16                // advance filter values indexing
-
-        b.ge                2b
+        ldr             d4, [x8], #8                // load src values for idx 0
+        ldr             q31, [x12, x16]             // load filter values for idx 0
+        uxtl            v4.8h, v4.8b                // extend type to match the filter' size
+        ldr             d5, [x9], #8                // load src values for idx 1
+        smlal           v16.4s, v4.4h, v31.4h       // multiplication of lower half for idx 0
+        uxtl            v5.8h, v5.8b                // extend type to match the filter' size
+        ldr             q30, [x13, x16]             // load filter values for idx 1
+        smlal2          v16.4s, v4.8h, v31.8h       // multiplication of upper half for idx 0
+        ldr             d6, [x10], #8               // load src values for idx 2
+        ldr             q29, [x14, x16]             // load filter values for idx 2
+        smlal           v17.4s, v5.4h, v30.4h       // multiplication of lower half for idx 1
+        ldr             d7, [x11], #8               // load src values for idx 3
+        smlal2          v17.4s, v5.8h, v30.8h       // multiplication of upper half for idx 1
+        uxtl            v6.8h, v6.8b                // extend tpye to matchi the filter's size
+        ldr             q28, [x15, x16]             // load filter values for idx 3
+        smlal           v18.4s, v6.4h, v29.4h       // multiplication of lower half for idx 2
+        uxtl            v7.8h, v7.8b
+        smlal2          v18.4s, v6.8h, v29.8h       // multiplication of upper half for idx 2
+        sub             w0, w0, #8
+        smlal           v19.4s, v7.4h, v28.4h       // multiplication of lower half for idx 3
+        cmp             w0, #8
+        smlal2          v19.4s, v7.8h, v28.8h       // multiplication of upper half for idx 3
+        add             x16, x16, #16                // advance filter values indexing
+
+        b.ge            2b
 
 
         // 4 iterations left
 
-        sub                 x17, x7, #8                 // step back to wrap up the filter pos for last 4 elements
-
-        ldr                 s4, [x8]                    // load src values for idx 0
-        ldr                 d31, [x12, x17]             // load filter values for idx 0
-        uxtl                v4.8h, v4.8b                // extend type to match the filter' size
-        ldr                 s5, [x9]                    // load src values for idx 1
-        smlal               v16.4s, v4.4h, v31.4h
-        ldr                 d30, [x13, x17]             // load filter values for idx 1
-        uxtl                v5.8h, v5.8b                // extend type to match the filter' size
-        ldr                 s6, [x10]                   // load src values for idx 2
-        smlal               v17.4s, v5.4h, v30.4h
-        uxtl                v6.8h, v6.8b                // extend type to match the filter's size
-        ldr                 d29, [x14, x17]             // load filter values for idx 2
-        ldr                 s7, [x11]                   // load src values for idx 3
-        addp                v16.4s, v16.4s, v17.4s
-        uxtl                v7.8h, v7.8b
-        ldr                 d28, [x15, x17]             // load filter values for idx 3
-        smlal               v18.4s, v6.4h, v29.4h
-        smlal               v19.4s, v7.4h, v28.4h
-        subs                w2, w2, #4
-        addp                v18.4s, v18.4s, v19.4s
-        addp                v16.4s, v16.4s, v18.4s
-        sshr                v16.4s, v16.4s, #3
-        smin                v16.4s, v16.4s, v20.4s
-
-        st1                 {v16.4s}, [x1], #16
-        add                 x4, x4, x7, lsl #2
-        b.gt                1b
+        sub             x17, x7, #8                 // step back to wrap up the filter pos for last 4 elements
+
+        ldr             s4, [x8]                    // load src values for idx 0
+        ldr             d31, [x12, x17]             // load filter values for idx 0
+        uxtl            v4.8h, v4.8b                // extend type to match the filter' size
+        ldr             s5, [x9]                    // load src values for idx 1
+        smlal           v16.4s, v4.4h, v31.4h
+        ldr             d30, [x13, x17]             // load filter values for idx 1
+        uxtl            v5.8h, v5.8b                // extend type to match the filter' size
+        ldr             s6, [x10]                   // load src values for idx 2
+        smlal           v17.4s, v5.4h, v30.4h
+        uxtl            v6.8h, v6.8b                // extend type to match the filter's size
+        ldr             d29, [x14, x17]             // load filter values for idx 2
+        ldr             s7, [x11]                   // load src values for idx 3
+        addp            v16.4s, v16.4s, v17.4s
+        uxtl            v7.8h, v7.8b
+        ldr             d28, [x15, x17]             // load filter values for idx 3
+        smlal           v18.4s, v6.4h, v29.4h
+        smlal           v19.4s, v7.4h, v28.4h
+        subs            w2, w2, #4
+        addp            v18.4s, v18.4s, v19.4s
+        addp            v16.4s, v16.4s, v18.4s
+        sshr            v16.4s, v16.4s, #3
+        smin            v16.4s, v16.4s, v20.4s
+
+        st1             {v16.4s}, [x1], #16
+        add             x4, x4, x7, lsl #2
+        b.gt            1b
         ret
 endfunc
 
@@ -647,191 +647,191 @@ function ff_hscale16to15_4_neon_asm, export=1
         // x5               const int32_t *filterPos
         // w6               int filterSize
 
-        movi                v18.4s, #1
-        movi                v17.4s, #1
-        shl                 v18.4s, v18.4s, #15
-        sub                 v18.4s, v18.4s, v17.4s      // max allowed value
-        dup                 v17.4s, w0                  // read shift
-        neg                 v17.4s, v17.4s              // negate it, so it can be used in sshl (effectively shift right)
+        movi            v18.4s, #1
+        movi            v17.4s, #1
+        shl             v18.4s, v18.4s, #15
+        sub             v18.4s, v18.4s, v17.4s      // max allowed value
+        dup             v17.4s, w0                  // read shift
+        neg             v17.4s, v17.4s              // negate it, so it can be used in sshl (effectively shift right)
 
-        cmp                 w2, #16
-        b.lt                2f // move to last block
+        cmp             w2, #16
+        b.lt            2f // move to last block
 
-        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
-        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
-        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
-        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
-        add                 x5, x5, #32
+        ldp             w8, w9, [x5]                // filterPos[0], filterPos[1]
+        ldp             w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
+        ldp             w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
+        ldp             w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
+        add             x5, x5, #32
 
         // shift all filterPos left by one, as uint16_t will be read
-        lsl                 x8, x8, #1
-        lsl                 x9, x9, #1
-        lsl                 x10, x10, #1
-        lsl                 x11, x11, #1
-        lsl                 x12, x12, #1
-        lsl                 x13, x13, #1
-        lsl                 x14, x14, #1
-        lsl                 x15, x15, #1
+        lsl             x8, x8, #1
+        lsl             x9, x9, #1
+        lsl             x10, x10, #1
+        lsl             x11, x11, #1
+        lsl             x12, x12, #1
+        lsl             x13, x13, #1
+        lsl             x14, x14, #1
+        lsl             x15, x15, #1
 
         // load src with given offset
-        ldr                 x8,  [x3, w8,  uxtw]
-        ldr                 x9,  [x3, w9,  uxtw]
-        ldr                 x10, [x3, w10, uxtw]
-        ldr                 x11, [x3, w11, uxtw]
-        ldr                 x12, [x3, w12, uxtw]
-        ldr                 x13, [x3, w13, uxtw]
-        ldr                 x14, [x3, w14, uxtw]
-        ldr                 x15, [x3, w15, uxtw]
-
-        sub                 sp, sp, #64
+        ldr             x8,  [x3, w8,  uxtw]
+        ldr             x9,  [x3, w9,  uxtw]
+        ldr             x10, [x3, w10, uxtw]
+        ldr             x11, [x3, w11, uxtw]
+        ldr             x12, [x3, w12, uxtw]
+        ldr             x13, [x3, w13, uxtw]
+        ldr             x14, [x3, w14, uxtw]
+        ldr             x15, [x3, w15, uxtw]
+
+        sub             sp, sp, #64
         // push src on stack so it can be loaded into vectors later
-        stp                 x8, x9, [sp]
-        stp                 x10, x11, [sp, #16]
-        stp                 x12, x13, [sp, #32]
-        stp                 x14, x15, [sp, #48]
+        stp             x8, x9, [sp]
+        stp             x10, x11, [sp, #16]
+        stp             x12, x13, [sp, #32]
+        stp             x14, x15, [sp, #48]
 
 1:
-        ld4                 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
-        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
+        ld4             {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
+        ld4             {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
 
         // Each of blocks does the following:
         // Extend src and filter to 32 bits with uxtl and sxtl
         // multiply or multiply and accumulate results
         // Extending to 32 bits is necessary, as unit16_t values can't
         // be represented as int16_t without type promotion.
-        uxtl                v26.4s, v0.4h
-        sxtl                v27.4s, v28.4h
-        uxtl2               v0.4s, v0.8h
-        mul                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v28.8h
-        uxtl                v26.4s, v1.4h
-        mul                 v6.4s, v0.4s, v28.4s
-
-        sxtl                v27.4s, v29.4h
-        uxtl2               v0.4s, v1.8h
-        mla                 v5.4s, v27.4s, v26.4s
-        sxtl2               v28.4s, v29.8h
-        uxtl                v26.4s, v2.4h
-        mla                 v6.4s, v28.4s, v0.4s
-
-        sxtl                v27.4s, v30.4h
-        uxtl2               v0.4s, v2.8h
-        mla                 v5.4s, v27.4s, v26.4s
-        sxtl2               v28.4s, v30.8h
-        uxtl                v26.4s, v3.4h
-        mla                 v6.4s, v28.4s, v0.4s
-
-        sxtl                v27.4s, v31.4h
-        uxtl2               v0.4s, v3.8h
-        mla                 v5.4s, v27.4s, v26.4s
-        sxtl2               v28.4s, v31.8h
-        sub                 w2, w2, #8
-        mla                 v6.4s, v28.4s, v0.4s
-
-        sshl                v5.4s, v5.4s, v17.4s
-        sshl                v6.4s, v6.4s, v17.4s
-        smin                v5.4s, v5.4s, v18.4s
-        smin                v6.4s, v6.4s, v18.4s
-        xtn                 v5.4h, v5.4s
-        xtn2                v5.8h, v6.4s
-
-        st1                 {v5.8h}, [x1], #16
-        cmp                 w2, #16
+        uxtl            v26.4s, v0.4h
+        sxtl            v27.4s, v28.4h
+        uxtl2           v0.4s, v0.8h
+        mul             v5.4s, v26.4s, v27.4s
+        sxtl2           v28.4s, v28.8h
+        uxtl            v26.4s, v1.4h
+        mul             v6.4s, v0.4s, v28.4s
+
+        sxtl            v27.4s, v29.4h
+        uxtl2           v0.4s, v1.8h
+        mla             v5.4s, v27.4s, v26.4s
+        sxtl2           v28.4s, v29.8h
+        uxtl            v26.4s, v2.4h
+        mla             v6.4s, v28.4s, v0.4s
+
+        sxtl            v27.4s, v30.4h
+        uxtl2           v0.4s, v2.8h
+        mla             v5.4s, v27.4s, v26.4s
+        sxtl2           v28.4s, v30.8h
+        uxtl            v26.4s, v3.4h
+        mla             v6.4s, v28.4s, v0.4s
+
+        sxtl            v27.4s, v31.4h
+        uxtl2           v0.4s, v3.8h
+        mla             v5.4s, v27.4s, v26.4s
+        sxtl2           v28.4s, v31.8h
+        sub             w2, w2, #8
+        mla             v6.4s, v28.4s, v0.4s
+
+        sshl            v5.4s, v5.4s, v17.4s
+        sshl            v6.4s, v6.4s, v17.4s
+        smin            v5.4s, v5.4s, v18.4s
+        smin            v6.4s, v6.4s, v18.4s
+        xtn             v5.4h, v5.4s
+        xtn2            v5.8h, v6.4s
+
+        st1             {v5.8h}, [x1], #16
+        cmp             w2, #16
 
         // load filterPositions into registers for next iteration
-        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
-        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
-        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
-        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
-        add                 x5, x5, #32
-
-        lsl                 x8, x8, #1
-        lsl                 x9, x9, #1
-        lsl                 x10, x10, #1
-        lsl                 x11, x11, #1
-        lsl                 x12, x12, #1
-        lsl                 x13, x13, #1
-        lsl                 x14, x14, #1
-        lsl                 x15, x15, #1
-
-        ldr                 x8,  [x3, w8,  uxtw]
-        ldr                 x9,  [x3, w9,  uxtw]
-        ldr                 x10, [x3, w10, uxtw]
-        ldr                 x11, [x3, w11, uxtw]
-        ldr                 x12, [x3, w12, uxtw]
-        ldr                 x13, [x3, w13, uxtw]
-        ldr                 x14, [x3, w14, uxtw]
-        ldr                 x15, [x3, w15, uxtw]
-
-        stp                 x8, x9, [sp]
-        stp                 x10, x11, [sp, #16]
-        stp                 x12, x13, [sp, #32]
-        stp                 x14, x15, [sp, #48]
-
-        b.ge                1b
+        ldp             w8, w9, [x5]                // filterPos[0], filterPos[1]
+        ldp             w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
+        ldp             w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
+        ldp             w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
+        add             x5, x5, #32
+
+        lsl             x8, x8, #1
+        lsl             x9, x9, #1
+        lsl             x10, x10, #1
+        lsl             x11, x11, #1
+        lsl             x12, x12, #1
+        lsl             x13, x13, #1
+        lsl             x14, x14, #1
+        lsl             x15, x15, #1
+
+        ldr             x8,  [x3, w8,  uxtw]
+        ldr             x9,  [x3, w9,  uxtw]
+        ldr             x10, [x3, w10, uxtw]
+        ldr             x11, [x3, w11, uxtw]
+        ldr             x12, [x3, w12, uxtw]
+        ldr             x13, [x3, w13, uxtw]
+        ldr             x14, [x3, w14, uxtw]
+        ldr             x15, [x3, w15, uxtw]
+
+        stp             x8, x9, [sp]
+        stp             x10, x11, [sp, #16]
+        stp             x12, x13, [sp, #32]
+        stp             x14, x15, [sp, #48]
+
+        b.ge            1b
 
         // here we make last iteration, without updating the registers
-        ld4                 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
-        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64
-
-        uxtl                v26.4s, v0.4h
-        sxtl                v27.4s, v28.4h
-        uxtl2               v0.4s, v0.8h
-        mul                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v28.8h
-        uxtl                v26.4s, v1.4h
-        mul                 v6.4s, v0.4s, v28.4s
-
-        sxtl                v27.4s, v29.4h
-        uxtl2               v0.4s, v1.8h
-        mla                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v29.8h
-        uxtl                v26.4s, v2.4h
-        mla                 v6.4s, v0.4s, v28.4s
-
-        sxtl                v27.4s, v30.4h
-        uxtl2               v0.4s, v2.8h
-        mla                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v30.8h
-        uxtl                v26.4s, v3.4h
-        mla                 v6.4s, v0.4s, v28.4s
-
-        sxtl                v27.4s, v31.4h
-        uxtl2               v0.4s, v3.8h
-        mla                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v31.8h
-        subs                w2, w2, #8
-        mla                 v6.4s, v0.4s, v28.4s
-
-        sshl                v5.4s, v5.4s, v17.4s
-        sshl                v6.4s, v6.4s, v17.4s
-        smin                v5.4s, v5.4s, v18.4s
-        smin                v6.4s, v6.4s, v18.4s
-        xtn                 v5.4h, v5.4s
-        xtn2                v5.8h, v6.4s
-
-        st1                 {v5.8h}, [x1], #16
-        add                 sp, sp, #64                 // restore stack
-        cbnz                w2, 2f
+        ld4             {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
+        ld4             {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64
+
+        uxtl            v26.4s, v0.4h
+        sxtl            v27.4s, v28.4h
+        uxtl2           v0.4s, v0.8h
+        mul             v5.4s, v26.4s, v27.4s
+        sxtl2           v28.4s, v28.8h
+        uxtl            v26.4s, v1.4h
+        mul             v6.4s, v0.4s, v28.4s
+
+        sxtl            v27.4s, v29.4h
+        uxtl2           v0.4s, v1.8h
+        mla             v5.4s, v26.4s, v27.4s
+        sxtl2           v28.4s, v29.8h
+        uxtl            v26.4s, v2.4h
+        mla             v6.4s, v0.4s, v28.4s
+
+        sxtl            v27.4s, v30.4h
+        uxtl2           v0.4s, v2.8h
+        mla             v5.4s, v26.4s, v27.4s
+        sxtl2           v28.4s, v30.8h
+        uxtl            v26.4s, v3.4h
+        mla             v6.4s, v0.4s, v28.4s
+
+        sxtl            v27.4s, v31.4h
+        uxtl2           v0.4s, v3.8h
+        mla             v5.4s, v26.4s, v27.4s
+        sxtl2           v28.4s, v31.8h
+        subs            w2, w2, #8
+        mla             v6.4s, v0.4s, v28.4s
+
+        sshl            v5.4s, v5.4s, v17.4s
+        sshl            v6.4s, v6.4s, v17.4s
+        smin            v5.4s, v5.4s, v18.4s
+        smin            v6.4s, v6.4s, v18.4s
+        xtn             v5.4h, v5.4s
+        xtn2            v5.8h, v6.4s
+
+        st1             {v5.8h}, [x1], #16
+        add             sp, sp, #64                 // restore stack
+        cbnz            w2, 2f
 
         ret
 
 2:
-        ldr                 w8, [x5], #4                // load filterPos
-        lsl                 w8, w8, #1
-        add                 x9, x3, w8, uxtw            // src + filterPos
-        ld1                 {v0.4h}, [x9]               // load 4 * uint16_t
-        ld1                 {v31.4h}, [x4], #8
-
-        uxtl                v0.4s, v0.4h
-        sxtl                v31.4s, v31.4h
-        mul                 v5.4s, v0.4s, v31.4s
-        addv                s0, v5.4s
-        sshl                v0.4s, v0.4s, v17.4s
-        smin                v0.4s, v0.4s, v18.4s
-        st1                 {v0.h}[0], [x1], #2
-        sub                 w2, w2, #1
-        cbnz                w2, 2b                      // if iterations remain jump to beginning
+        ldr             w8, [x5], #4                // load filterPos
+        lsl             w8, w8, #1
+        add             x9, x3, w8, uxtw            // src + filterPos
+        ld1             {v0.4h}, [x9]               // load 4 * uint16_t
+        ld1             {v31.4h}, [x4], #8
+
+        uxtl            v0.4s, v0.4h
+        sxtl            v31.4s, v31.4h
+        mul             v5.4s, v0.4s, v31.4s
+        addv            s0, v5.4s
+        sshl            v0.4s, v0.4s, v17.4s
+        smin            v0.4s, v0.4s, v18.4s
+        st1             {v0.h}[0], [x1], #2
+        sub             w2, w2, #1
+        cbnz            w2, 2b                      // if iterations remain jump to beginning
 
         ret
 endfunc
@@ -845,79 +845,79 @@ function ff_hscale16to15_X8_neon_asm, export=1
         // x5               const int32_t *filterPos
         // w6               int filterSize
 
-        movi                v20.4s, #1
-        movi                v21.4s, #1
-        shl                 v20.4s, v20.4s, #15
-        sub                 v20.4s, v20.4s, v21.4s
-        dup                 v21.4s, w0
-        neg                 v21.4s, v21.4s
-
-        sbfiz               x7, x6, #1, #32             // filterSize*2 (*2 because int16)
-1:      ldr                 w8, [x5], #4                // filterPos[idx]
-        lsl                 w8, w8, #1
-        ldr                 w10, [x5], #4               // filterPos[idx + 1]
-        lsl                 w10, w10, #1
-        ldr                 w11, [x5], #4               // filterPos[idx + 2]
-        lsl                 w11, w11, #1
-        ldr                 w9, [x5], #4                // filterPos[idx + 3]
-        lsl                 w9, w9, #1
-        mov                 x16, x4                     // filter0 = filter
-        add                 x12, x16, x7                // filter1 = filter0 + filterSize*2
-        add                 x13, x12, x7                // filter2 = filter1 + filterSize*2
-        add                 x4, x13, x7                 // filter3 = filter2 + filterSize*2
-        movi                v0.2d, #0                   // val sum part 1 (for dst[0])
-        movi                v1.2d, #0                   // val sum part 2 (for dst[1])
-        movi                v2.2d, #0                   // val sum part 3 (for dst[2])
-        movi                v3.2d, #0                   // val sum part 4 (for dst[3])
-        add                 x17, x3, w8, uxtw           // srcp + filterPos[0]
-        add                 x8,  x3, w10, uxtw          // srcp + filterPos[1]
-        add                 x10, x3, w11, uxtw          // srcp + filterPos[2]
-        add                 x11, x3, w9, uxtw           // srcp + filterPos[3]
-        mov                 w15, w6                     // filterSize counter
-2:      ld1                 {v4.8h}, [x17], #16         // srcp[filterPos[0] + {0..7}]
-        ld1                 {v5.8h}, [x16], #16         // load 8x16-bit filter values, part 1
-        ld1                 {v6.8h}, [x8], #16          // srcp[filterPos[1] + {0..7}]
-        ld1                 {v7.8h}, [x12], #16         // load 8x16-bit at filter+filterSize
-        uxtl                v24.4s, v4.4h               // extend srcp lower half to 32 bits to preserve sign
-        sxtl                v25.4s, v5.4h               // extend filter lower half to 32 bits to match srcp size
-        uxtl2               v4.4s, v4.8h                // extend srcp upper half to 32 bits
-        mla                 v0.4s, v24.4s, v25.4s       // multiply accumulate lower half of v4 * v5
-        sxtl2               v5.4s, v5.8h                // extend filter upper half to 32 bits
-        uxtl                v26.4s, v6.4h               // extend srcp lower half to 32 bits
-        mla                 v0.4s, v4.4s, v5.4s         // multiply accumulate upper half of v4 * v5
-        sxtl                v27.4s, v7.4h               // exted filter lower half
-        uxtl2               v6.4s, v6.8h                // extend srcp upper half
-        sxtl2               v7.4s, v7.8h                // extend filter upper half
-        ld1                 {v16.8h}, [x10], #16        // srcp[filterPos[2] + {0..7}]
-        mla                 v1.4s, v26.4s, v27.4s       // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
-        ld1                 {v17.8h}, [x13], #16        // load 8x16-bit at filter+2*filterSize
-        uxtl                v22.4s, v16.4h              // extend srcp lower half
-        sxtl                v23.4s, v17.4h              // extend filter lower half
-        uxtl2               v16.4s, v16.8h              // extend srcp upper half
-        sxtl2               v17.4s, v17.8h              // extend filter upper half
-        mla                 v2.4s, v22.4s, v23.4s       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
-        mla                 v2.4s, v16.4s, v17.4s       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
-        ld1                 {v18.8h}, [x11], #16        // srcp[filterPos[3] + {0..7}]
-        mla                 v1.4s, v6.4s, v7.4s         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
-        ld1                 {v19.8h}, [x4], #16         // load 8x16-bit at filter+3*filterSize
-        subs                w15, w15, #8                // j -= 8: processed 8/filterSize
-        uxtl                v28.4s, v18.4h              // extend srcp lower half
-        sxtl                v29.4s, v19.4h              // extend filter lower half
-        uxtl2               v18.4s, v18.8h              // extend srcp upper half
-        sxtl2               v19.4s, v19.8h              // extend filter upper half
-        mla                 v3.4s, v28.4s, v29.4s       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
-        mla                 v3.4s, v18.4s, v19.4s       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
-        b.gt                2b                          // inner loop if filterSize not consumed completely
-        addp                v0.4s, v0.4s, v1.4s         // part01 horizontal pair adding
-        addp                v2.4s, v2.4s, v3.4s         // part23 horizontal pair adding
-        addp                v0.4s, v0.4s, v2.4s         // part0123 horizontal pair adding
-        subs                w2, w2, #4                  // dstW -= 4
-        sshl                v0.4s, v0.4s, v21.4s        // shift right (effectively rigth, as shift is negative); overflow expected
-        smin                v0.4s, v0.4s, v20.4s        // apply min (do not use sqshl)
-        xtn                 v0.4h, v0.4s                // narrow down to 16 bits
-
-        st1                 {v0.4h}, [x1], #8           // write to destination part0123
-        b.gt                1b                          // loop until end of line
+        movi            v20.4s, #1
+        movi            v21.4s, #1
+        shl             v20.4s, v20.4s, #15
+        sub             v20.4s, v20.4s, v21.4s
+        dup             v21.4s, w0
+        neg             v21.4s, v21.4s
+
+        sbfiz           x7, x6, #1, #32             // filterSize*2 (*2 because int16)
+1:      ldr             w8, [x5], #4                // filterPos[idx]
+        lsl             w8, w8, #1
+        ldr             w10, [x5], #4               // filterPos[idx + 1]
+        lsl             w10, w10, #1
+        ldr             w11, [x5], #4               // filterPos[idx + 2]
+        lsl             w11, w11, #1
+        ldr             w9, [x5], #4                // filterPos[idx + 3]
+        lsl             w9, w9, #1
+        mov             x16, x4                     // filter0 = filter
+        add             x12, x16, x7                // filter1 = filter0 + filterSize*2
+        add             x13, x12, x7                // filter2 = filter1 + filterSize*2
+        add             x4, x13, x7                 // filter3 = filter2 + filterSize*2
+        movi            v0.2d, #0                   // val sum part 1 (for dst[0])
+        movi            v1.2d, #0                   // val sum part 2 (for dst[1])
+        movi            v2.2d, #0                   // val sum part 3 (for dst[2])
+        movi            v3.2d, #0                   // val sum part 4 (for dst[3])
+        add             x17, x3, w8, uxtw           // srcp + filterPos[0]
+        add             x8,  x3, w10, uxtw          // srcp + filterPos[1]
+        add             x10, x3, w11, uxtw          // srcp + filterPos[2]
+        add             x11, x3, w9, uxtw           // srcp + filterPos[3]
+        mov             w15, w6                     // filterSize counter
+2:      ld1             {v4.8h}, [x17], #16         // srcp[filterPos[0] + {0..7}]
+        ld1             {v5.8h}, [x16], #16         // load 8x16-bit filter values, part 1
+        ld1             {v6.8h}, [x8], #16          // srcp[filterPos[1] + {0..7}]
+        ld1             {v7.8h}, [x12], #16         // load 8x16-bit at filter+filterSize
+        uxtl            v24.4s, v4.4h               // extend srcp lower half to 32 bits to preserve sign
+        sxtl            v25.4s, v5.4h               // extend filter lower half to 32 bits to match srcp size
+        uxtl2           v4.4s, v4.8h                // extend srcp upper half to 32 bits
+        mla             v0.4s, v24.4s, v25.4s       // multiply accumulate lower half of v4 * v5
+        sxtl2           v5.4s, v5.8h                // extend filter upper half to 32 bits
+        uxtl            v26.4s, v6.4h               // extend srcp lower half to 32 bits
+        mla             v0.4s, v4.4s, v5.4s         // multiply accumulate upper half of v4 * v5
+        sxtl            v27.4s, v7.4h               // exted filter lower half
+        uxtl2           v6.4s, v6.8h                // extend srcp upper half
+        sxtl2           v7.4s, v7.8h                // extend filter upper half
+        ld1             {v16.8h}, [x10], #16        // srcp[filterPos[2] + {0..7}]
+        mla             v1.4s, v26.4s, v27.4s       // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+        ld1             {v17.8h}, [x13], #16        // load 8x16-bit at filter+2*filterSize
+        uxtl            v22.4s, v16.4h              // extend srcp lower half
+        sxtl            v23.4s, v17.4h              // extend filter lower half
+        uxtl2           v16.4s, v16.8h              // extend srcp upper half
+        sxtl2           v17.4s, v17.8h              // extend filter upper half
+        mla             v2.4s, v22.4s, v23.4s       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+        mla             v2.4s, v16.4s, v17.4s       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+        ld1             {v18.8h}, [x11], #16        // srcp[filterPos[3] + {0..7}]
+        mla             v1.4s, v6.4s, v7.4s         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+        ld1             {v19.8h}, [x4], #16         // load 8x16-bit at filter+3*filterSize
+        subs            w15, w15, #8                // j -= 8: processed 8/filterSize
+        uxtl            v28.4s, v18.4h              // extend srcp lower half
+        sxtl            v29.4s, v19.4h              // extend filter lower half
+        uxtl2           v18.4s, v18.8h              // extend srcp upper half
+        sxtl2           v19.4s, v19.8h              // extend filter upper half
+        mla             v3.4s, v28.4s, v29.4s       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+        mla             v3.4s, v18.4s, v19.4s       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+        b.gt            2b                          // inner loop if filterSize not consumed completely
+        addp            v0.4s, v0.4s, v1.4s         // part01 horizontal pair adding
+        addp            v2.4s, v2.4s, v3.4s         // part23 horizontal pair adding
+        addp            v0.4s, v0.4s, v2.4s         // part0123 horizontal pair adding
+        subs            w2, w2, #4                  // dstW -= 4
+        sshl            v0.4s, v0.4s, v21.4s        // shift right (effectively rigth, as shift is negative); overflow expected
+        smin            v0.4s, v0.4s, v20.4s        // apply min (do not use sqshl)
+        xtn             v0.4h, v0.4s                // narrow down to 16 bits
+
+        st1             {v0.4h}, [x1], #8           // write to destination part0123
+        b.gt            1b                          // loop until end of line
         ret
 endfunc
 
@@ -930,118 +930,118 @@ function ff_hscale16to15_X4_neon_asm, export=1
         // x5  const int32_t *filterPos
         // w6  int filterSize
 
-        stp                 d8, d9, [sp, #-0x20]!
-        stp                 d10, d11, [sp, #0x10]
+        stp             d8, d9, [sp, #-0x20]!
+        stp             d10, d11, [sp, #0x10]
 
-        movi                v18.4s, #1
-        movi                v17.4s, #1
-        shl                 v18.4s, v18.4s, #15
-        sub                 v21.4s, v18.4s, v17.4s      // max allowed value
-        dup                 v17.4s, w0                  // read shift
-        neg                 v20.4s, v17.4s              // negate it, so it can be used in sshl (effectively shift right)
+        movi            v18.4s, #1
+        movi            v17.4s, #1
+        shl             v18.4s, v18.4s, #15
+        sub             v21.4s, v18.4s, v17.4s      // max allowed value
+        dup             v17.4s, w0                  // read shift
+        neg             v20.4s, v17.4s              // negate it, so it can be used in sshl (effectively shift right)
 
-        lsl                 w7, w6, #1
+        lsl             w7, w6, #1
 1:
-        ldp                 w8, w9, [x5]
-        ldp                 w10, w11, [x5, #8]
-
-        movi                v16.2d, #0                  // initialize accumulator for idx + 0
-        movi                v17.2d, #0                  // initialize accumulator for idx + 1
-        movi                v18.2d, #0                  // initialize accumulator for idx + 2
-        movi                v19.2d, #0                  // initialize accumulator for idx + 3
-
-        mov                 x12, x4                     // filter + 0
-        add                 x13, x4, x7                 // filter + 1
-        add                 x8, x3, x8, lsl #1          // srcp + filterPos 0
-        add                 x14, x13, x7                // filter + 2
-        add                 x9, x3, x9, lsl #1          // srcp + filterPos 1
-        add                 x15, x14, x7                // filter + 3
-        add                 x10, x3, x10, lsl #1        // srcp + filterPos 2
-        mov                 w0, w6                      // save the filterSize to temporary variable
-        add                 x11, x3, x11, lsl #1        // srcp + filterPos 3
-        add                 x5, x5, #16                 // advance filter position
-        mov                 x16, xzr                    // clear the register x16 used for offsetting the filter values
+        ldp             w8, w9, [x5]
+        ldp             w10, w11, [x5, #8]
+
+        movi            v16.2d, #0                  // initialize accumulator for idx + 0
+        movi            v17.2d, #0                  // initialize accumulator for idx + 1
+        movi            v18.2d, #0                  // initialize accumulator for idx + 2
+        movi            v19.2d, #0                  // initialize accumulator for idx + 3
+
+        mov             x12, x4                     // filter + 0
+        add             x13, x4, x7                 // filter + 1
+        add             x8, x3, x8, lsl #1          // srcp + filterPos 0
+        add             x14, x13, x7                // filter + 2
+        add             x9, x3, x9, lsl #1          // srcp + filterPos 1
+        add             x15, x14, x7                // filter + 3
+        add             x10, x3, x10, lsl #1        // srcp + filterPos 2
+        mov             w0, w6                      // save the filterSize to temporary variable
+        add             x11, x3, x11, lsl #1        // srcp + filterPos 3
+        add             x5, x5, #16                 // advance filter position
+        mov             x16, xzr                    // clear the register x16 used for offsetting the filter values
 
 2:
-        ldr                 q4, [x8], #16               // load src values for idx 0
-        ldr                 q5, [x9], #16               // load src values for idx 1
-        uxtl                v26.4s, v4.4h
-        uxtl2               v4.4s, v4.8h
-        ldr                 q31, [x12, x16]             // load filter values for idx 0
-        ldr                 q6, [x10], #16              // load src values for idx 2
-        sxtl                v22.4s, v31.4h
-        sxtl2               v31.4s, v31.8h
-        mla                 v16.4s, v26.4s, v22.4s      // multiplication of lower half for idx 0
-        uxtl                v25.4s, v5.4h
-        uxtl2               v5.4s, v5.8h
-        ldr                 q30, [x13, x16]             // load filter values for idx 1
-        ldr                 q7, [x11], #16              // load src values for idx 3
-        mla                 v16.4s, v4.4s, v31.4s       // multiplication of upper half for idx 0
-        uxtl                v24.4s, v6.4h
-        sxtl                v8.4s, v30.4h
-        sxtl2               v30.4s, v30.8h
-        mla                 v17.4s, v25.4s, v8.4s       // multiplication of lower half for idx 1
-        ldr                 q29, [x14, x16]             // load filter values for idx 2
-        uxtl2               v6.4s, v6.8h
-        sxtl                v9.4s, v29.4h
-        sxtl2               v29.4s, v29.8h
-        mla                 v17.4s, v5.4s, v30.4s       // multiplication of upper half for idx 1
-        mla                 v18.4s, v24.4s, v9.4s       // multiplication of lower half for idx 2
-        ldr                 q28, [x15, x16]             // load filter values for idx 3
-        uxtl                v23.4s, v7.4h
-        sxtl                v10.4s, v28.4h
-        mla                 v18.4s, v6.4s, v29.4s       // multiplication of upper half for idx 2
-        uxtl2               v7.4s, v7.8h
-        sxtl2               v28.4s, v28.8h
-        mla                 v19.4s, v23.4s, v10.4s      // multiplication of lower half for idx 3
-        sub                 w0, w0, #8
-        cmp                 w0, #8
-        mla                 v19.4s, v7.4s, v28.4s       // multiplication of upper half for idx 3
-
-        add                 x16, x16, #16               // advance filter values indexing
-
-        b.ge                2b
+        ldr             q4, [x8], #16               // load src values for idx 0
+        ldr             q5, [x9], #16               // load src values for idx 1
+        uxtl            v26.4s, v4.4h
+        uxtl2           v4.4s, v4.8h
+        ldr             q31, [x12, x16]             // load filter values for idx 0
+        ldr             q6, [x10], #16              // load src values for idx 2
+        sxtl            v22.4s, v31.4h
+        sxtl2           v31.4s, v31.8h
+        mla             v16.4s, v26.4s, v22.4s      // multiplication of lower half for idx 0
+        uxtl            v25.4s, v5.4h
+        uxtl2           v5.4s, v5.8h
+        ldr             q30, [x13, x16]             // load filter values for idx 1
+        ldr             q7, [x11], #16              // load src values for idx 3
+        mla             v16.4s, v4.4s, v31.4s       // multiplication of upper half for idx 0
+        uxtl            v24.4s, v6.4h
+        sxtl            v8.4s, v30.4h
+        sxtl2           v30.4s, v30.8h
+        mla             v17.4s, v25.4s, v8.4s       // multiplication of lower half for idx 1
+        ldr             q29, [x14, x16]             // load filter values for idx 2
+        uxtl2           v6.4s, v6.8h
+        sxtl            v9.4s, v29.4h
+        sxtl2           v29.4s, v29.8h
+        mla             v17.4s, v5.4s, v30.4s       // multiplication of upper half for idx 1
+        mla             v18.4s, v24.4s, v9.4s       // multiplication of lower half for idx 2
+        ldr             q28, [x15, x16]             // load filter values for idx 3
+        uxtl            v23.4s, v7.4h
+        sxtl            v10.4s, v28.4h
+        mla             v18.4s, v6.4s, v29.4s       // multiplication of upper half for idx 2
+        uxtl2           v7.4s, v7.8h
+        sxtl2           v28.4s, v28.8h
+        mla             v19.4s, v23.4s, v10.4s      // multiplication of lower half for idx 3
+        sub             w0, w0, #8
+        cmp             w0, #8
+        mla             v19.4s, v7.4s, v28.4s       // multiplication of upper half for idx 3
+
+        add             x16, x16, #16               // advance filter values indexing
+
+        b.ge            2b
 
         // 4 iterations left
 
-        sub                 x17, x7, #8                 // step back to wrap up the filter pos for last 4 elements
-
-        ldr                 d4, [x8]                    // load src values for idx 0
-        ldr                 d31, [x12, x17]             // load filter values for idx 0
-        uxtl                v4.4s, v4.4h
-        sxtl                v31.4s, v31.4h
-        ldr                 d5, [x9]                    // load src values for idx 1
-        mla                 v16.4s, v4.4s, v31.4s       // multiplication of upper half for idx 0
-        ldr                 d30, [x13, x17]             // load filter values for idx 1
-        uxtl                v5.4s, v5.4h
-        sxtl                v30.4s, v30.4h
-        ldr                 d6, [x10]                   // load src values for idx 2
-        mla                 v17.4s, v5.4s, v30.4s       // multiplication of upper half for idx 1
-        ldr                 d29, [x14, x17]             // load filter values for idx 2
-        uxtl                v6.4s, v6.4h
-        sxtl                v29.4s, v29.4h
-        ldr                 d7, [x11]                   // load src values for idx 3
-        ldr                 d28, [x15, x17]             // load filter values for idx 3
-        mla                 v18.4s, v6.4s, v29.4s       // multiplication of upper half for idx 2
-        uxtl                v7.4s, v7.4h
-        sxtl                v28.4s, v28.4h
-        addp                v16.4s, v16.4s, v17.4s
-        mla                 v19.4s, v7.4s, v28.4s       // multiplication of upper half for idx 3
-        subs                w2, w2, #4
-        addp                v18.4s, v18.4s, v19.4s
-        addp                v16.4s, v16.4s, v18.4s
-        sshl                v16.4s, v16.4s, v20.4s
-        smin                v16.4s, v16.4s, v21.4s
-        xtn                 v16.4h, v16.4s
-
-        st1                 {v16.4h}, [x1], #8
-        add                 x4, x4, x7, lsl #2
-        b.gt                1b
-
-        ldp                 d8, d9, [sp]
-        ldp                 d10, d11, [sp, #0x10]
-
-        add                 sp, sp, #0x20
+        sub             x17, x7, #8                 // step back to wrap up the filter pos for last 4 elements
+
+        ldr             d4, [x8]                    // load src values for idx 0
+        ldr             d31, [x12, x17]             // load filter values for idx 0
+        uxtl            v4.4s, v4.4h
+        sxtl            v31.4s, v31.4h
+        ldr             d5, [x9]                    // load src values for idx 1
+        mla             v16.4s, v4.4s, v31.4s       // multiplication of upper half for idx 0
+        ldr             d30, [x13, x17]             // load filter values for idx 1
+        uxtl            v5.4s, v5.4h
+        sxtl            v30.4s, v30.4h
+        ldr             d6, [x10]                   // load src values for idx 2
+        mla             v17.4s, v5.4s, v30.4s       // multiplication of upper half for idx 1
+        ldr             d29, [x14, x17]             // load filter values for idx 2
+        uxtl            v6.4s, v6.4h
+        sxtl            v29.4s, v29.4h
+        ldr             d7, [x11]                   // load src values for idx 3
+        ldr             d28, [x15, x17]             // load filter values for idx 3
+        mla             v18.4s, v6.4s, v29.4s       // multiplication of upper half for idx 2
+        uxtl            v7.4s, v7.4h
+        sxtl            v28.4s, v28.4h
+        addp            v16.4s, v16.4s, v17.4s
+        mla             v19.4s, v7.4s, v28.4s       // multiplication of upper half for idx 3
+        subs            w2, w2, #4
+        addp            v18.4s, v18.4s, v19.4s
+        addp            v16.4s, v16.4s, v18.4s
+        sshl            v16.4s, v16.4s, v20.4s
+        smin            v16.4s, v16.4s, v21.4s
+        xtn             v16.4h, v16.4s
+
+        st1             {v16.4h}, [x1], #8
+        add             x4, x4, x7, lsl #2
+        b.gt            1b
+
+        ldp             d8, d9, [sp]
+        ldp             d10, d11, [sp, #0x10]
+
+        add             sp, sp, #0x20
 
         ret
 endfunc
@@ -1055,188 +1055,188 @@ function ff_hscale16to19_4_neon_asm, export=1
         // x5               const int32_t *filterPos
         // w6               int filterSize
 
-        movi                v18.4s, #1
-        movi                v17.4s, #1
-        shl                 v18.4s, v18.4s, #19
-        sub                 v18.4s, v18.4s, v17.4s      // max allowed value
-        dup                 v17.4s, w0                  // read shift
-        neg                 v17.4s, v17.4s              // negate it, so it can be used in sshl (effectively shift right)
+        movi            v18.4s, #1
+        movi            v17.4s, #1
+        shl             v18.4s, v18.4s, #19
+        sub             v18.4s, v18.4s, v17.4s      // max allowed value
+        dup             v17.4s, w0                  // read shift
+        neg             v17.4s, v17.4s              // negate it, so it can be used in sshl (effectively shift right)
 
-        cmp                 w2, #16
-        b.lt                2f // move to last block
+        cmp             w2, #16
+        b.lt            2f // move to last block
 
-        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
-        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
-        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
-        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
-        add                 x5, x5, #32
+        ldp             w8, w9, [x5]                // filterPos[0], filterPos[1]
+        ldp             w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
+        ldp             w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
+        ldp             w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
+        add             x5, x5, #32
 
         // shift all filterPos left by one, as uint16_t will be read
-        lsl                 x8, x8, #1
-        lsl                 x9, x9, #1
-        lsl                 x10, x10, #1
-        lsl                 x11, x11, #1
-        lsl                 x12, x12, #1
-        lsl                 x13, x13, #1
-        lsl                 x14, x14, #1
-        lsl                 x15, x15, #1
+        lsl             x8, x8, #1
+        lsl             x9, x9, #1
+        lsl             x10, x10, #1
+        lsl             x11, x11, #1
+        lsl             x12, x12, #1
+        lsl             x13, x13, #1
+        lsl             x14, x14, #1
+        lsl             x15, x15, #1
 
         // load src with given offset
-        ldr                 x8,  [x3, w8,  uxtw]
-        ldr                 x9,  [x3, w9,  uxtw]
-        ldr                 x10, [x3, w10, uxtw]
-        ldr                 x11, [x3, w11, uxtw]
-        ldr                 x12, [x3, w12, uxtw]
-        ldr                 x13, [x3, w13, uxtw]
-        ldr                 x14, [x3, w14, uxtw]
-        ldr                 x15, [x3, w15, uxtw]
-
-        sub                 sp, sp, #64
+        ldr             x8,  [x3, w8,  uxtw]
+        ldr             x9,  [x3, w9,  uxtw]
+        ldr             x10, [x3, w10, uxtw]
+        ldr             x11, [x3, w11, uxtw]
+        ldr             x12, [x3, w12, uxtw]
+        ldr             x13, [x3, w13, uxtw]
+        ldr             x14, [x3, w14, uxtw]
+        ldr             x15, [x3, w15, uxtw]
+
+        sub             sp, sp, #64
         // push src on stack so it can be loaded into vectors later
-        stp                 x8, x9, [sp]
-        stp                 x10, x11, [sp, #16]
-        stp                 x12, x13, [sp, #32]
-        stp                 x14, x15, [sp, #48]
+        stp             x8, x9, [sp]
+        stp             x10, x11, [sp, #16]
+        stp             x12, x13, [sp, #32]
+        stp             x14, x15, [sp, #48]
 
 1:
-        ld4                 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
-        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
+        ld4             {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
+        ld4             {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64 // filter[0..7]
 
         // Each of blocks does the following:
         // Extend src and filter to 32 bits with uxtl and sxtl
         // multiply or multiply and accumulate results
         // Extending to 32 bits is necessary, as unit16_t values can't
         // be represented as int16_t without type promotion.
-        uxtl                v26.4s, v0.4h
-        sxtl                v27.4s, v28.4h
-        uxtl2               v0.4s, v0.8h
-        mul                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v28.8h
-        uxtl                v26.4s, v1.4h
-        mul                 v6.4s, v0.4s, v28.4s
-
-        sxtl                v27.4s, v29.4h
-        uxtl2               v0.4s, v1.8h
-        mla                 v5.4s, v27.4s, v26.4s
-        sxtl2               v28.4s, v29.8h
-        uxtl                v26.4s, v2.4h
-        mla                 v6.4s, v28.4s, v0.4s
-
-        sxtl                v27.4s, v30.4h
-        uxtl2               v0.4s, v2.8h
-        mla                 v5.4s, v27.4s, v26.4s
-        sxtl2               v28.4s, v30.8h
-        uxtl                v26.4s, v3.4h
-        mla                 v6.4s, v28.4s, v0.4s
-
-        sxtl                v27.4s, v31.4h
-        uxtl2               v0.4s, v3.8h
-        mla                 v5.4s, v27.4s, v26.4s
-        sxtl2               v28.4s, v31.8h
-        sub                 w2, w2, #8
-        mla                 v6.4s, v28.4s, v0.4s
-
-        sshl                v5.4s, v5.4s, v17.4s
-        sshl                v6.4s, v6.4s, v17.4s
-        smin                v5.4s, v5.4s, v18.4s
-        smin                v6.4s, v6.4s, v18.4s
-
-        st1                 {v5.4s, v6.4s}, [x1], #32
-        cmp                 w2, #16
+        uxtl            v26.4s, v0.4h
+        sxtl            v27.4s, v28.4h
+        uxtl2           v0.4s, v0.8h
+        mul             v5.4s, v26.4s, v27.4s
+        sxtl2           v28.4s, v28.8h
+        uxtl            v26.4s, v1.4h
+        mul             v6.4s, v0.4s, v28.4s
+
+        sxtl            v27.4s, v29.4h
+        uxtl2           v0.4s, v1.8h
+        mla             v5.4s, v27.4s, v26.4s
+        sxtl2           v28.4s, v29.8h
+        uxtl            v26.4s, v2.4h
+        mla             v6.4s, v28.4s, v0.4s
+
+        sxtl            v27.4s, v30.4h
+        uxtl2           v0.4s, v2.8h
+        mla             v5.4s, v27.4s, v26.4s
+        sxtl2           v28.4s, v30.8h
+        uxtl            v26.4s, v3.4h
+        mla             v6.4s, v28.4s, v0.4s
+
+        sxtl            v27.4s, v31.4h
+        uxtl2           v0.4s, v3.8h
+        mla             v5.4s, v27.4s, v26.4s
+        sxtl2           v28.4s, v31.8h
+        sub             w2, w2, #8
+        mla             v6.4s, v28.4s, v0.4s
+
+        sshl            v5.4s, v5.4s, v17.4s
+        sshl            v6.4s, v6.4s, v17.4s
+        smin            v5.4s, v5.4s, v18.4s
+        smin            v6.4s, v6.4s, v18.4s
+
+        st1             {v5.4s, v6.4s}, [x1], #32
+        cmp             w2, #16
 
         // load filterPositions into registers for next iteration
-        ldp                 w8, w9, [x5]                // filterPos[0], filterPos[1]
-        ldp                 w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
-        ldp                 w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
-        ldp                 w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
-        add                 x5, x5, #32
-
-        lsl                 x8, x8, #1
-        lsl                 x9, x9, #1
-        lsl                 x10, x10, #1
-        lsl                 x11, x11, #1
-        lsl                 x12, x12, #1
-        lsl                 x13, x13, #1
-        lsl                 x14, x14, #1
-        lsl                 x15, x15, #1
-
-        ldr                 x8,  [x3, w8,  uxtw]
-        ldr                 x9,  [x3, w9,  uxtw]
-        ldr                 x10, [x3, w10, uxtw]
-        ldr                 x11, [x3, w11, uxtw]
-        ldr                 x12, [x3, w12, uxtw]
-        ldr                 x13, [x3, w13, uxtw]
-        ldr                 x14, [x3, w14, uxtw]
-        ldr                 x15, [x3, w15, uxtw]
-
-        stp                 x8, x9, [sp]
-        stp                 x10, x11, [sp, #16]
-        stp                 x12, x13, [sp, #32]
-        stp                 x14, x15, [sp, #48]
-
-        b.ge                1b
+        ldp             w8, w9, [x5]                // filterPos[0], filterPos[1]
+        ldp             w10, w11, [x5, #8]          // filterPos[2], filterPos[3]
+        ldp             w12, w13, [x5, #16]         // filterPos[4], filterPos[5]
+        ldp             w14, w15, [x5, #24]         // filterPos[6], filterPos[7]
+        add             x5, x5, #32
+
+        lsl             x8, x8, #1
+        lsl             x9, x9, #1
+        lsl             x10, x10, #1
+        lsl             x11, x11, #1
+        lsl             x12, x12, #1
+        lsl             x13, x13, #1
+        lsl             x14, x14, #1
+        lsl             x15, x15, #1
+
+        ldr             x8,  [x3, w8,  uxtw]
+        ldr             x9,  [x3, w9,  uxtw]
+        ldr             x10, [x3, w10, uxtw]
+        ldr             x11, [x3, w11, uxtw]
+        ldr             x12, [x3, w12, uxtw]
+        ldr             x13, [x3, w13, uxtw]
+        ldr             x14, [x3, w14, uxtw]
+        ldr             x15, [x3, w15, uxtw]
+
+        stp             x8, x9, [sp]
+        stp             x10, x11, [sp, #16]
+        stp             x12, x13, [sp, #32]
+        stp             x14, x15, [sp, #48]
+
+        b.ge            1b
 
         // here we make last iteration, without updating the registers
-        ld4                 {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
-        ld4                 {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64
-
-        uxtl                v26.4s, v0.4h
-        sxtl                v27.4s, v28.4h
-        uxtl2               v0.4s, v0.8h
-        mul                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v28.8h
-        uxtl                v26.4s, v1.4h
-        mul                 v6.4s, v0.4s, v28.4s
-
-        sxtl                v27.4s, v29.4h
-        uxtl2               v0.4s, v1.8h
-        mla                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v29.8h
-        uxtl                v26.4s, v2.4h
-        mla                 v6.4s, v0.4s, v28.4s
-
-        sxtl                v27.4s, v30.4h
-        uxtl2               v0.4s, v2.8h
-        mla                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v30.8h
-        uxtl                v26.4s, v3.4h
-        mla                 v6.4s, v0.4s, v28.4s
-
-        sxtl                v27.4s, v31.4h
-        uxtl2               v0.4s, v3.8h
-        mla                 v5.4s, v26.4s, v27.4s
-        sxtl2               v28.4s, v31.8h
-        subs                w2, w2, #8
-        mla                 v6.4s, v0.4s, v28.4s
-
-        sshl                v5.4s, v5.4s, v17.4s
-        sshl                v6.4s, v6.4s, v17.4s
-
-        smin                v5.4s, v5.4s, v18.4s
-        smin                v6.4s, v6.4s, v18.4s
-
-        st1                 {v5.4s, v6.4s}, [x1], #32
-        add                 sp, sp, #64                 // restore stack
-        cbnz                w2, 2f
+        ld4             {v0.8h, v1.8h, v2.8h, v3.8h}, [sp]
+        ld4             {v28.8h, v29.8h, v30.8h, v31.8h}, [x4], #64
+
+        uxtl            v26.4s, v0.4h
+        sxtl            v27.4s, v28.4h
+        uxtl2           v0.4s, v0.8h
+        mul             v5.4s, v26.4s, v27.4s
+        sxtl2           v28.4s, v28.8h
+        uxtl            v26.4s, v1.4h
+        mul             v6.4s, v0.4s, v28.4s
+
+        sxtl            v27.4s, v29.4h
+        uxtl2           v0.4s, v1.8h
+        mla             v5.4s, v26.4s, v27.4s
+        sxtl2           v28.4s, v29.8h
+        uxtl            v26.4s, v2.4h
+        mla             v6.4s, v0.4s, v28.4s
+
+        sxtl            v27.4s, v30.4h
+        uxtl2           v0.4s, v2.8h
+        mla             v5.4s, v26.4s, v27.4s
+        sxtl2           v28.4s, v30.8h
+        uxtl            v26.4s, v3.4h
+        mla             v6.4s, v0.4s, v28.4s
+
+        sxtl            v27.4s, v31.4h
+        uxtl2           v0.4s, v3.8h
+        mla             v5.4s, v26.4s, v27.4s
+        sxtl2           v28.4s, v31.8h
+        subs            w2, w2, #8
+        mla             v6.4s, v0.4s, v28.4s
+
+        sshl            v5.4s, v5.4s, v17.4s
+        sshl            v6.4s, v6.4s, v17.4s
+
+        smin            v5.4s, v5.4s, v18.4s
+        smin            v6.4s, v6.4s, v18.4s
+
+        st1             {v5.4s, v6.4s}, [x1], #32
+        add             sp, sp, #64                 // restore stack
+        cbnz            w2, 2f
 
         ret
 
 2:
-        ldr                 w8, [x5], #4                // load filterPos
-        lsl                 w8, w8, #1
-        add                 x9, x3, w8, uxtw            // src + filterPos
-        ld1                 {v0.4h}, [x9]               // load 4 * uint16_t
-        ld1                 {v31.4h}, [x4], #8
-
-        uxtl                v0.4s, v0.4h
-        sxtl                v31.4s, v31.4h
-        subs                w2, w2, #1
-        mul                 v5.4s, v0.4s, v31.4s
-        addv                s0, v5.4s
-        sshl                v0.4s, v0.4s, v17.4s
-        smin                v0.4s, v0.4s, v18.4s
-        st1                 {v0.s}[0], [x1], #4
-        cbnz                w2, 2b                      // if iterations remain jump to beginning
+        ldr             w8, [x5], #4                // load filterPos
+        lsl             w8, w8, #1
+        add             x9, x3, w8, uxtw            // src + filterPos
+        ld1             {v0.4h}, [x9]               // load 4 * uint16_t
+        ld1             {v31.4h}, [x4], #8
+
+        uxtl            v0.4s, v0.4h
+        sxtl            v31.4s, v31.4h
+        subs            w2, w2, #1
+        mul             v5.4s, v0.4s, v31.4s
+        addv            s0, v5.4s
+        sshl            v0.4s, v0.4s, v17.4s
+        smin            v0.4s, v0.4s, v18.4s
+        st1             {v0.s}[0], [x1], #4
+        cbnz            w2, 2b                      // if iterations remain jump to beginning
 
         ret
 endfunc
@@ -1250,77 +1250,77 @@ function ff_hscale16to19_X8_neon_asm, export=1
         // x5               const int32_t *filterPos
         // w6               int filterSize
 
-        movi                v20.4s, #1
-        movi                v21.4s, #1
-        shl                 v20.4s, v20.4s, #19
-        sub                 v20.4s, v20.4s, v21.4s
-        dup                 v21.4s, w0
-        neg                 v21.4s, v21.4s
-
-        sbfiz               x7, x6, #1, #32             // filterSize*2 (*2 because int16)
-1:      ldr                 w8, [x5], #4                // filterPos[idx]
-        ldr                 w10, [x5], #4               // filterPos[idx + 1]
-        lsl                 w8, w8, #1
-        ldr                 w11, [x5], #4               // filterPos[idx + 2]
-        ldr                 w9, [x5], #4                // filterPos[idx + 3]
-        mov                 x16, x4                     // filter0 = filter
-        lsl                 w11, w11, #1
-        add                 x12, x16, x7                // filter1 = filter0 + filterSize*2
-        lsl                 w9, w9, #1
-        add                 x13, x12, x7                // filter2 = filter1 + filterSize*2
-        lsl                 w10, w10, #1
-        add                 x4, x13, x7                 // filter3 = filter2 + filterSize*2
-        movi                v0.2d, #0                   // val sum part 1 (for dst[0])
-        movi                v1.2d, #0                   // val sum part 2 (for dst[1])
-        movi                v2.2d, #0                   // val sum part 3 (for dst[2])
-        movi                v3.2d, #0                   // val sum part 4 (for dst[3])
-        add                 x17, x3, w8, uxtw           // srcp + filterPos[0]
-        add                 x8,  x3, w10, uxtw          // srcp + filterPos[1]
-        add                 x10, x3, w11, uxtw          // srcp + filterPos[2]
-        add                 x11, x3, w9, uxtw           // srcp + filterPos[3]
-        mov                 w15, w6                     // filterSize counter
-2:      ld1                 {v4.8h}, [x17], #16         // srcp[filterPos[0] + {0..7}]
-        ld1                 {v5.8h}, [x16], #16         // load 8x16-bit filter values, part 1
-        ld1                 {v6.8h}, [x8], #16          // srcp[filterPos[1] + {0..7}]
-        ld1                 {v7.8h}, [x12], #16         // load 8x16-bit at filter+filterSize
-        uxtl                v24.4s, v4.4h               // extend srcp lower half to 32 bits to preserve sign
-        sxtl                v25.4s, v5.4h               // extend filter lower half to 32 bits to match srcp size
-        uxtl2               v4.4s, v4.8h                // extend srcp upper half to 32 bits
-        mla                 v0.4s, v24.4s, v25.4s       // multiply accumulate lower half of v4 * v5
-        sxtl2               v5.4s, v5.8h                // extend filter upper half to 32 bits
-        uxtl                v26.4s, v6.4h               // extend srcp lower half to 32 bits
-        mla                 v0.4s, v4.4s, v5.4s         // multiply accumulate upper half of v4 * v5
-        sxtl                v27.4s, v7.4h               // exted filter lower half
-        uxtl2               v6.4s, v6.8h                // extend srcp upper half
-        sxtl2               v7.4s, v7.8h                // extend filter upper half
-        ld1                 {v16.8h}, [x10], #16        // srcp[filterPos[2] + {0..7}]
-        mla                 v1.4s, v26.4s, v27.4s       // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
-        ld1                 {v17.8h}, [x13], #16        // load 8x16-bit at filter+2*filterSize
-        uxtl                v22.4s, v16.4h              // extend srcp lower half
-        sxtl                v23.4s, v17.4h              // extend filter lower half
-        uxtl2               v16.4s, v16.8h              // extend srcp upper half
-        sxtl2               v17.4s, v17.8h              // extend filter upper half
-        mla                 v2.4s, v22.4s, v23.4s       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
-        mla                 v2.4s, v16.4s, v17.4s       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
-        ld1                 {v18.8h}, [x11], #16        // srcp[filterPos[3] + {0..7}]
-        mla                 v1.4s, v6.4s, v7.4s         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
-        ld1                 {v19.8h}, [x4], #16         // load 8x16-bit at filter+3*filterSize
-        subs                w15, w15, #8                // j -= 8: processed 8/filterSize
-        uxtl                v28.4s, v18.4h              // extend srcp lower half
-        sxtl                v29.4s, v19.4h              // extend filter lower half
-        uxtl2               v18.4s, v18.8h              // extend srcp upper half
-        sxtl2               v19.4s, v19.8h              // extend filter upper half
-        mla                 v3.4s, v28.4s, v29.4s       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
-        mla                 v3.4s, v18.4s, v19.4s       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
-        b.gt                2b                          // inner loop if filterSize not consumed completely
-        addp                v0.4s, v0.4s, v1.4s         // part01 horizontal pair adding
-        addp                v2.4s, v2.4s, v3.4s         // part23 horizontal pair adding
-        addp                v0.4s, v0.4s, v2.4s         // part0123 horizontal pair adding
-        subs                w2, w2, #4                  // dstW -= 4
-        sshl                v0.4s, v0.4s, v21.4s        // shift right (effectively rigth, as shift is negative); overflow expected
-        smin                v0.4s, v0.4s, v20.4s        // apply min (do not use sqshl)
-        st1                 {v0.4s}, [x1], #16          // write to destination part0123
-        b.gt                1b                          // loop until end of line
+        movi            v20.4s, #1
+        movi            v21.4s, #1
+        shl             v20.4s, v20.4s, #19
+        sub             v20.4s, v20.4s, v21.4s
+        dup             v21.4s, w0
+        neg             v21.4s, v21.4s
+
+        sbfiz           x7, x6, #1, #32             // filterSize*2 (*2 because int16)
+1:      ldr             w8, [x5], #4                // filterPos[idx]
+        ldr             w10, [x5], #4               // filterPos[idx + 1]
+        lsl             w8, w8, #1
+        ldr             w11, [x5], #4               // filterPos[idx + 2]
+        ldr             w9, [x5], #4                // filterPos[idx + 3]
+        mov             x16, x4                     // filter0 = filter
+        lsl             w11, w11, #1
+        add             x12, x16, x7                // filter1 = filter0 + filterSize*2
+        lsl             w9, w9, #1
+        add             x13, x12, x7                // filter2 = filter1 + filterSize*2
+        lsl             w10, w10, #1
+        add             x4, x13, x7                 // filter3 = filter2 + filterSize*2
+        movi            v0.2d, #0                   // val sum part 1 (for dst[0])
+        movi            v1.2d, #0                   // val sum part 2 (for dst[1])
+        movi            v2.2d, #0                   // val sum part 3 (for dst[2])
+        movi            v3.2d, #0                   // val sum part 4 (for dst[3])
+        add             x17, x3, w8, uxtw           // srcp + filterPos[0]
+        add             x8,  x3, w10, uxtw          // srcp + filterPos[1]
+        add             x10, x3, w11, uxtw          // srcp + filterPos[2]
+        add             x11, x3, w9, uxtw           // srcp + filterPos[3]
+        mov             w15, w6                     // filterSize counter
+2:      ld1             {v4.8h}, [x17], #16         // srcp[filterPos[0] + {0..7}]
+        ld1             {v5.8h}, [x16], #16         // load 8x16-bit filter values, part 1
+        ld1             {v6.8h}, [x8], #16          // srcp[filterPos[1] + {0..7}]
+        ld1             {v7.8h}, [x12], #16         // load 8x16-bit at filter+filterSize
+        uxtl            v24.4s, v4.4h               // extend srcp lower half to 32 bits to preserve sign
+        sxtl            v25.4s, v5.4h               // extend filter lower half to 32 bits to match srcp size
+        uxtl2           v4.4s, v4.8h                // extend srcp upper half to 32 bits
+        mla             v0.4s, v24.4s, v25.4s       // multiply accumulate lower half of v4 * v5
+        sxtl2           v5.4s, v5.8h                // extend filter upper half to 32 bits
+        uxtl            v26.4s, v6.4h               // extend srcp lower half to 32 bits
+        mla             v0.4s, v4.4s, v5.4s         // multiply accumulate upper half of v4 * v5
+        sxtl            v27.4s, v7.4h               // exted filter lower half
+        uxtl2           v6.4s, v6.8h                // extend srcp upper half
+        sxtl2           v7.4s, v7.8h                // extend filter upper half
+        ld1             {v16.8h}, [x10], #16        // srcp[filterPos[2] + {0..7}]
+        mla             v1.4s, v26.4s, v27.4s       // v1 accumulates srcp[filterPos[1] + {0..3}] * filter[{0..3}]
+        ld1             {v17.8h}, [x13], #16        // load 8x16-bit at filter+2*filterSize
+        uxtl            v22.4s, v16.4h              // extend srcp lower half
+        sxtl            v23.4s, v17.4h              // extend filter lower half
+        uxtl2           v16.4s, v16.8h              // extend srcp upper half
+        sxtl2           v17.4s, v17.8h              // extend filter upper half
+        mla             v2.4s, v22.4s, v23.4s       // v2 accumulates srcp[filterPos[2] + {0..3}] * filter[{0..3}]
+        mla             v2.4s, v16.4s, v17.4s       // v2 accumulates srcp[filterPos[2] + {4..7}] * filter[{4..7}]
+        ld1             {v18.8h}, [x11], #16        // srcp[filterPos[3] + {0..7}]
+        mla             v1.4s, v6.4s, v7.4s         // v1 accumulates srcp[filterPos[1] + {4..7}] * filter[{4..7}]
+        ld1             {v19.8h}, [x4], #16         // load 8x16-bit at filter+3*filterSize
+        subs            w15, w15, #8                // j -= 8: processed 8/filterSize
+        uxtl            v28.4s, v18.4h              // extend srcp lower half
+        sxtl            v29.4s, v19.4h              // extend filter lower half
+        uxtl2           v18.4s, v18.8h              // extend srcp upper half
+        sxtl2           v19.4s, v19.8h              // extend filter upper half
+        mla             v3.4s, v28.4s, v29.4s       // v3 accumulates srcp[filterPos[3] + {0..3}] * filter[{0..3}]
+        mla             v3.4s, v18.4s, v19.4s       // v3 accumulates srcp[filterPos[3] + {4..7}] * filter[{4..7}]
+        b.gt            2b                          // inner loop if filterSize not consumed completely
+        addp            v0.4s, v0.4s, v1.4s         // part01 horizontal pair adding
+        addp            v2.4s, v2.4s, v3.4s         // part23 horizontal pair adding
+        addp            v0.4s, v0.4s, v2.4s         // part0123 horizontal pair adding
+        subs            w2, w2, #4                  // dstW -= 4
+        sshl            v0.4s, v0.4s, v21.4s        // shift right (effectively rigth, as shift is negative); overflow expected
+        smin            v0.4s, v0.4s, v20.4s        // apply min (do not use sqshl)
+        st1             {v0.4s}, [x1], #16          // write to destination part0123
+        b.gt            1b                          // loop until end of line
         ret
 endfunc
 
@@ -1333,117 +1333,117 @@ function ff_hscale16to19_X4_neon_asm, export=1
         // x5  const int32_t *filterPos
         // w6  int filterSize
 
-        stp                 d8, d9, [sp, #-0x20]!
-        stp                 d10, d11, [sp, #0x10]
+        stp             d8, d9, [sp, #-0x20]!
+        stp             d10, d11, [sp, #0x10]
 
-        movi                v18.4s, #1
-        movi                v17.4s, #1
-        shl                 v18.4s, v18.4s, #19
-        sub                 v21.4s, v18.4s, v17.4s      // max allowed value
-        dup                 v17.4s, w0                  // read shift
-        neg                 v20.4s, v17.4s              // negate it, so it can be used in sshl (effectively shift right)
+        movi            v18.4s, #1
+        movi            v17.4s, #1
+        shl             v18.4s, v18.4s, #19
+        sub             v21.4s, v18.4s, v17.4s      // max allowed value
+        dup             v17.4s, w0                  // read shift
+        neg             v20.4s, v17.4s              // negate it, so it can be used in sshl (effectively shift right)
 
-        lsl                 w7, w6, #1
+        lsl             w7, w6, #1
 1:
-        ldp                 w8, w9, [x5]
-        ldp                 w10, w11, [x5, #8]
-
-        movi                v16.2d, #0                  // initialize accumulator for idx + 0
-        movi                v17.2d, #0                  // initialize accumulator for idx + 1
-        movi                v18.2d, #0                  // initialize accumulator for idx + 2
-        movi                v19.2d, #0                  // initialize accumulator for idx + 3
-
-        mov                 x12, x4                     // filter + 0
-        add                 x13, x4, x7                 // filter + 1
-        add                 x8, x3, x8, lsl #1          // srcp + filterPos 0
-        add                 x14, x13, x7                // filter + 2
-        add                 x9, x3, x9, lsl #1          // srcp + filterPos 1
-        add                 x15, x14, x7                // filter + 3
-        add                 x10, x3, x10, lsl #1        // srcp + filterPos 2
-        mov                 w0, w6                      // save the filterSize to temporary variable
-        add                 x11, x3, x11, lsl #1        // srcp + filterPos 3
-        add                 x5, x5, #16                 // advance filter position
-        mov                 x16, xzr                    // clear the register x16 used for offsetting the filter values
+        ldp             w8, w9, [x5]
+        ldp             w10, w11, [x5, #8]
+
+        movi            v16.2d, #0                  // initialize accumulator for idx + 0
+        movi            v17.2d, #0                  // initialize accumulator for idx + 1
+        movi            v18.2d, #0                  // initialize accumulator for idx + 2
+        movi            v19.2d, #0                  // initialize accumulator for idx + 3
+
+        mov             x12, x4                     // filter + 0
+        add             x13, x4, x7                 // filter + 1
+        add             x8, x3, x8, lsl #1          // srcp + filterPos 0
+        add             x14, x13, x7                // filter + 2
+        add             x9, x3, x9, lsl #1          // srcp + filterPos 1
+        add             x15, x14, x7                // filter + 3
+        add             x10, x3, x10, lsl #1        // srcp + filterPos 2
+        mov             w0, w6                      // save the filterSize to temporary variable
+        add             x11, x3, x11, lsl #1        // srcp + filterPos 3
+        add             x5, x5, #16                 // advance filter position
+        mov             x16, xzr                    // clear the register x16 used for offsetting the filter values
 
 2:
-        ldr                 q4, [x8], #16               // load src values for idx 0
-        ldr                 q5, [x9], #16               // load src values for idx 1
-        uxtl                v26.4s, v4.4h
-        uxtl2               v4.4s, v4.8h
-        ldr                 q31, [x12, x16]             // load filter values for idx 0
-        ldr                 q6, [x10], #16              // load src values for idx 2
-        sxtl                v22.4s, v31.4h
-        sxtl2               v31.4s, v31.8h
-        mla                 v16.4s, v26.4s, v22.4s      // multiplication of lower half for idx 0
-        uxtl                v25.4s, v5.4h
-        uxtl2               v5.4s, v5.8h
-        ldr                 q30, [x13, x16]             // load filter values for idx 1
-        ldr                 q7, [x11], #16              // load src values for idx 3
-        mla                 v16.4s, v4.4s, v31.4s       // multiplication of upper half for idx 0
-        uxtl                v24.4s, v6.4h
-        sxtl                v8.4s, v30.4h
-        sxtl2               v30.4s, v30.8h
-        mla                 v17.4s, v25.4s, v8.4s       // multiplication of lower half for idx 1
-        ldr                 q29, [x14, x16]             // load filter values for idx 2
-        uxtl2               v6.4s, v6.8h
-        sxtl                v9.4s, v29.4h
-        sxtl2               v29.4s, v29.8h
-        mla                 v17.4s, v5.4s, v30.4s       // multiplication of upper half for idx 1
-        ldr                 q28, [x15, x16]             // load filter values for idx 3
-        mla                 v18.4s, v24.4s, v9.4s       // multiplication of lower half for idx 2
-        uxtl                v23.4s, v7.4h
-        sxtl                v10.4s, v28.4h
-        mla                 v18.4s, v6.4s, v29.4s       // multiplication of upper half for idx 2
-        uxtl2               v7.4s, v7.8h
-        sxtl2               v28.4s, v28.8h
-        mla                 v19.4s, v23.4s, v10.4s      // multiplication of lower half for idx 3
-        sub                 w0, w0, #8
-        cmp                 w0, #8
-        mla                 v19.4s, v7.4s, v28.4s       // multiplication of upper half for idx 3
-
-        add                 x16, x16, #16               // advance filter values indexing
-
-        b.ge                2b
+        ldr             q4, [x8], #16               // load src values for idx 0
+        ldr             q5, [x9], #16               // load src values for idx 1
+        uxtl            v26.4s, v4.4h
+        uxtl2           v4.4s, v4.8h
+        ldr             q31, [x12, x16]             // load filter values for idx 0
+        ldr             q6, [x10], #16              // load src values for idx 2
+        sxtl            v22.4s, v31.4h
+        sxtl2           v31.4s, v31.8h
+        mla             v16.4s, v26.4s, v22.4s      // multiplication of lower half for idx 0
+        uxtl            v25.4s, v5.4h
+        uxtl2           v5.4s, v5.8h
+        ldr             q30, [x13, x16]             // load filter values for idx 1
+        ldr             q7, [x11], #16              // load src values for idx 3
+        mla             v16.4s, v4.4s, v31.4s       // multiplication of upper half for idx 0
+        uxtl            v24.4s, v6.4h
+        sxtl            v8.4s, v30.4h
+        sxtl2           v30.4s, v30.8h
+        mla             v17.4s, v25.4s, v8.4s       // multiplication of lower half for idx 1
+        ldr             q29, [x14, x16]             // load filter values for idx 2
+        uxtl2           v6.4s, v6.8h
+        sxtl            v9.4s, v29.4h
+        sxtl2           v29.4s, v29.8h
+        mla             v17.4s, v5.4s, v30.4s       // multiplication of upper half for idx 1
+        ldr             q28, [x15, x16]             // load filter values for idx 3
+        mla             v18.4s, v24.4s, v9.4s       // multiplication of lower half for idx 2
+        uxtl            v23.4s, v7.4h
+        sxtl            v10.4s, v28.4h
+        mla             v18.4s, v6.4s, v29.4s       // multiplication of upper half for idx 2
+        uxtl2           v7.4s, v7.8h
+        sxtl2           v28.4s, v28.8h
+        mla             v19.4s, v23.4s, v10.4s      // multiplication of lower half for idx 3
+        sub             w0, w0, #8
+        cmp             w0, #8
+        mla             v19.4s, v7.4s, v28.4s       // multiplication of upper half for idx 3
+
+        add             x16, x16, #16               // advance filter values indexing
+
+        b.ge            2b
 
         // 4 iterations left
 
-        sub                 x17, x7, #8                 // step back to wrap up the filter pos for last 4 elements
-
-        ldr                 d4, [x8]                    // load src values for idx 0
-        ldr                 d31, [x12, x17]             // load filter values for idx 0
-        uxtl                v4.4s, v4.4h
-        sxtl                v31.4s, v31.4h
-        ldr                 d5, [x9]                    // load src values for idx 1
-        mla                 v16.4s, v4.4s, v31.4s       // multiplication of upper half for idx 0
-        ldr                 d30, [x13, x17]             // load filter values for idx 1
-        uxtl                v5.4s, v5.4h
-        sxtl                v30.4s, v30.4h
-        ldr                 d6, [x10]                   // load src values for idx 2
-        mla                 v17.4s, v5.4s, v30.4s       // multiplication of upper half for idx 1
-        ldr                 d29, [x14, x17]             // load filter values for idx 2
-        uxtl                v6.4s, v6.4h
-        sxtl                v29.4s, v29.4h
-        ldr                 d7, [x11]                   // load src values for idx 3
-        ldr                 d28, [x15, x17]             // load filter values for idx 3
-        mla                 v18.4s, v6.4s, v29.4s       // multiplication of upper half for idx 2
-        uxtl                v7.4s, v7.4h
-        sxtl                v28.4s, v28.4h
-        addp                v16.4s, v16.4s, v17.4s
-        mla                 v19.4s, v7.4s, v28.4s       // multiplication of upper half for idx 3
-        subs                w2, w2, #4
-        addp                v18.4s, v18.4s, v19.4s
-        addp                v16.4s, v16.4s, v18.4s
-        sshl                v16.4s, v16.4s, v20.4s
-        smin                v16.4s, v16.4s, v21.4s
-
-        st1                 {v16.4s}, [x1], #16
-        add                 x4, x4, x7, lsl #2
-        b.gt                1b
-
-        ldp                 d8, d9, [sp]
-        ldp                 d10, d11, [sp, #0x10]
-
-        add                 sp, sp, #0x20
+        sub             x17, x7, #8                 // step back to wrap up the filter pos for last 4 elements
+
+        ldr             d4, [x8]                    // load src values for idx 0
+        ldr             d31, [x12, x17]             // load filter values for idx 0
+        uxtl            v4.4s, v4.4h
+        sxtl            v31.4s, v31.4h
+        ldr             d5, [x9]                    // load src values for idx 1
+        mla             v16.4s, v4.4s, v31.4s       // multiplication of upper half for idx 0
+        ldr             d30, [x13, x17]             // load filter values for idx 1
+        uxtl            v5.4s, v5.4h
+        sxtl            v30.4s, v30.4h
+        ldr             d6, [x10]                   // load src values for idx 2
+        mla             v17.4s, v5.4s, v30.4s       // multiplication of upper half for idx 1
+        ldr             d29, [x14, x17]             // load filter values for idx 2
+        uxtl            v6.4s, v6.4h
+        sxtl            v29.4s, v29.4h
+        ldr             d7, [x11]                   // load src values for idx 3
+        ldr             d28, [x15, x17]             // load filter values for idx 3
+        mla             v18.4s, v6.4s, v29.4s       // multiplication of upper half for idx 2
+        uxtl            v7.4s, v7.4h
+        sxtl            v28.4s, v28.4h
+        addp            v16.4s, v16.4s, v17.4s
+        mla             v19.4s, v7.4s, v28.4s       // multiplication of upper half for idx 3
+        subs            w2, w2, #4
+        addp            v18.4s, v18.4s, v19.4s
+        addp            v16.4s, v16.4s, v18.4s
+        sshl            v16.4s, v16.4s, v20.4s
+        smin            v16.4s, v16.4s, v21.4s
+
+        st1             {v16.4s}, [x1], #16
+        add             x4, x4, x7, lsl #2
+        b.gt            1b
+
+        ldp             d8, d9, [sp]
+        ldp             d10, d11, [sp, #0x10]
+
+        add             sp, sp, #0x20
 
         ret
 endfunc
diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S
index 344d0659ea..934d62dfd0 100644
--- a/libswscale/aarch64/output.S
+++ b/libswscale/aarch64/output.S
@@ -29,178 +29,178 @@ function ff_yuv2planeX_8_neon, export=1
 // x5 - const uint8_t *dither,
 // w6 - int offset
 
-        ld1                 {v0.8b}, [x5]                   // load 8x8-bit dither
-        and                 w6, w6, #7
-        cbz                 w6, 1f                          // check if offsetting present
-        ext                 v0.8b, v0.8b, v0.8b, #3         // honor offsetting which can be 0 or 3 only
-1:      uxtl                v0.8h, v0.8b                    // extend dither to 16-bit
-        ushll               v1.4s, v0.4h, #12               // extend dither to 32-bit with left shift by 12 (part 1)
-        ushll2              v2.4s, v0.8h, #12               // extend dither to 32-bit with left shift by 12 (part 2)
-        cmp                 w1, #8                          // if filterSize == 8, branch to specialized version
-        b.eq                6f
-        cmp                 w1, #4                          // if filterSize == 4, branch to specialized version
-        b.eq                8f
-        cmp                 w1, #2                          // if filterSize == 2, branch to specialized version
-        b.eq                10f
+        ld1             {v0.8b}, [x5]                   // load 8x8-bit dither
+        and             w6, w6, #7
+        cbz             w6, 1f                          // check if offsetting present
+        ext             v0.8b, v0.8b, v0.8b, #3         // honor offsetting which can be 0 or 3 only
+1:      uxtl            v0.8h, v0.8b                    // extend dither to 16-bit
+        ushll           v1.4s, v0.4h, #12               // extend dither to 32-bit with left shift by 12 (part 1)
+        ushll2          v2.4s, v0.8h, #12               // extend dither to 32-bit with left shift by 12 (part 2)
+        cmp             w1, #8                          // if filterSize == 8, branch to specialized version
+        b.eq            6f
+        cmp             w1, #4                          // if filterSize == 4, branch to specialized version
+        b.eq            8f
+        cmp             w1, #2                          // if filterSize == 2, branch to specialized version
+        b.eq            10f
 
 // The filter size does not match of the of specialized implementations. It is either even or odd. If it is even
 // then use the first section below.
-        mov                 x7, #0                          // i = 0
-        tbnz                w1, #0, 4f                      // if filterSize % 2 != 0 branch to specialized version
+        mov             x7, #0                          // i = 0
+        tbnz            w1, #0, 4f                      // if filterSize % 2 != 0 branch to specialized version
 // fs % 2 == 0
-2:      mov                 v3.16b, v1.16b                  // initialize accumulator part 1 with dithering value
-        mov                 v4.16b, v2.16b                  // initialize accumulator part 2 with dithering value
-        mov                 w8, w1                          // tmpfilterSize = filterSize
-        mov                 x9, x2                          // srcp    = src
-        mov                 x10, x0                         // filterp = filter
-3:      ldp                 x11, x12, [x9], #16             // get 2 pointers: src[j] and src[j+1]
-        ldr                 s7, [x10], #4                   // read 2x16-bit coeff X and Y at filter[j] and filter[j+1]
-        add                 x11, x11, x7, lsl #1            // &src[j  ][i]
-        add                 x12, x12, x7, lsl #1            // &src[j+1][i]
-        ld1                 {v5.8h}, [x11]                  // read 8x16-bit @ src[j  ][i + {0..7}]: A,B,C,D,E,F,G,H
-        ld1                 {v6.8h}, [x12]                  // read 8x16-bit @ src[j+1][i + {0..7}]: I,J,K,L,M,N,O,P
-        smlal               v3.4s, v5.4h, v7.h[0]           // val0 += {A,B,C,D} * X
-        smlal2              v4.4s, v5.8h, v7.h[0]           // val1 += {E,F,G,H} * X
-        smlal               v3.4s, v6.4h, v7.h[1]           // val0 += {I,J,K,L} * Y
-        smlal2              v4.4s, v6.8h, v7.h[1]           // val1 += {M,N,O,P} * Y
-        subs                w8, w8, #2                      // tmpfilterSize -= 2
-        b.gt                3b                              // loop until filterSize consumed
-
-        sqshrun             v3.4h, v3.4s, #16               // clip16(val0>>16)
-        sqshrun2            v3.8h, v4.4s, #16               // clip16(val1>>16)
-        uqshrn              v3.8b, v3.8h, #3                // clip8(val>>19)
-        st1                 {v3.8b}, [x3], #8               // write to destination
-        subs                w4, w4, #8                      // dstW -= 8
-        add                 x7, x7, #8                      // i += 8
-        b.gt                2b                              // loop until width consumed
+2:      mov             v3.16b, v1.16b                  // initialize accumulator part 1 with dithering value
+        mov             v4.16b, v2.16b                  // initialize accumulator part 2 with dithering value
+        mov             w8, w1                          // tmpfilterSize = filterSize
+        mov             x9, x2                          // srcp    = src
+        mov             x10, x0                         // filterp = filter
+3:      ldp             x11, x12, [x9], #16             // get 2 pointers: src[j] and src[j+1]
+        ldr             s7, [x10], #4                   // read 2x16-bit coeff X and Y at filter[j] and filter[j+1]
+        add             x11, x11, x7, lsl #1            // &src[j  ][i]
+        add             x12, x12, x7, lsl #1            // &src[j+1][i]
+        ld1             {v5.8h}, [x11]                  // read 8x16-bit @ src[j  ][i + {0..7}]: A,B,C,D,E,F,G,H
+        ld1             {v6.8h}, [x12]                  // read 8x16-bit @ src[j+1][i + {0..7}]: I,J,K,L,M,N,O,P
+        smlal           v3.4s, v5.4h, v7.h[0]           // val0 += {A,B,C,D} * X
+        smlal2          v4.4s, v5.8h, v7.h[0]           // val1 += {E,F,G,H} * X
+        smlal           v3.4s, v6.4h, v7.h[1]           // val0 += {I,J,K,L} * Y
+        smlal2          v4.4s, v6.8h, v7.h[1]           // val1 += {M,N,O,P} * Y
+        subs            w8, w8, #2                      // tmpfilterSize -= 2
+        b.gt            3b                              // loop until filterSize consumed
+
+        sqshrun         v3.4h, v3.4s, #16               // clip16(val0>>16)
+        sqshrun2        v3.8h, v4.4s, #16               // clip16(val1>>16)
+        uqshrn          v3.8b, v3.8h, #3                // clip8(val>>19)
+        st1             {v3.8b}, [x3], #8               // write to destination
+        subs            w4, w4, #8                      // dstW -= 8
+        add             x7, x7, #8                      // i += 8
+        b.gt            2b                              // loop until width consumed
         ret
 
 // If filter size is odd (most likely == 1), then use this section.
 // fs % 2 != 0
-4:      mov                 v3.16b, v1.16b                  // initialize accumulator part 1 with dithering value
-        mov                 v4.16b, v2.16b                  // initialize accumulator part 2 with dithering value
-        mov                 w8, w1                          // tmpfilterSize = filterSize
-        mov                 x9, x2                          // srcp    = src
-        mov                 x10, x0                         // filterp = filter
-5:      ldr                 x11, [x9], #8                   // get 1 pointer: src[j]
-        ldr                 h6, [x10], #2                   // read 1 16 bit coeff X at filter[j]
-        add                 x11, x11, x7, lsl #1            // &src[j  ][i]
-        ld1                 {v5.8h}, [x11]                  // read 8x16-bit @ src[j  ][i + {0..7}]: A,B,C,D,E,F,G,H
-        smlal               v3.4s, v5.4h, v6.h[0]           // val0 += {A,B,C,D} * X
-        smlal2              v4.4s, v5.8h, v6.h[0]           // val1 += {E,F,G,H} * X
-        subs                w8, w8, #1                      // tmpfilterSize -= 2
-        b.gt                5b                              // loop until filterSize consumed
-
-        sqshrun             v3.4h, v3.4s, #16               // clip16(val0>>16)
-        sqshrun2            v3.8h, v4.4s, #16               // clip16(val1>>16)
-        uqshrn              v3.8b, v3.8h, #3                // clip8(val>>19)
-        st1                 {v3.8b}, [x3], #8               // write to destination
-        subs                w4, w4, #8                      // dstW -= 8
-        add                 x7, x7, #8                      // i += 8
-        b.gt                4b                              // loop until width consumed
+4:      mov             v3.16b, v1.16b                  // initialize accumulator part 1 with dithering value
+        mov             v4.16b, v2.16b                  // initialize accumulator part 2 with dithering value
+        mov             w8, w1                          // tmpfilterSize = filterSize
+        mov             x9, x2                          // srcp    = src
+        mov             x10, x0                         // filterp = filter
+5:      ldr             x11, [x9], #8                   // get 1 pointer: src[j]
+        ldr             h6, [x10], #2                   // read 1 16 bit coeff X at filter[j]
+        add             x11, x11, x7, lsl #1            // &src[j  ][i]
+        ld1             {v5.8h}, [x11]                  // read 8x16-bit @ src[j  ][i + {0..7}]: A,B,C,D,E,F,G,H
+        smlal           v3.4s, v5.4h, v6.h[0]           // val0 += {A,B,C,D} * X
+        smlal2          v4.4s, v5.8h, v6.h[0]           // val1 += {E,F,G,H} * X
+        subs            w8, w8, #1                      // tmpfilterSize -= 2
+        b.gt            5b                              // loop until filterSize consumed
+
+        sqshrun         v3.4h, v3.4s, #16               // clip16(val0>>16)
+        sqshrun2        v3.8h, v4.4s, #16               // clip16(val1>>16)
+        uqshrn          v3.8b, v3.8h, #3                // clip8(val>>19)
+        st1             {v3.8b}, [x3], #8               // write to destination
+        subs            w4, w4, #8                      // dstW -= 8
+        add             x7, x7, #8                      // i += 8
+        b.gt            4b                              // loop until width consumed
         ret
 
 6:      // fs=8
-        ldp                 x5, x6, [x2]                    // load 2 pointers: src[j  ] and src[j+1]
-        ldp                 x7, x9, [x2, #16]               // load 2 pointers: src[j+2] and src[j+3]
-        ldp                 x10, x11, [x2, #32]             // load 2 pointers: src[j+4] and src[j+5]
-        ldp                 x12, x13, [x2, #48]             // load 2 pointers: src[j+6] and src[j+7]
+        ldp             x5, x6, [x2]                    // load 2 pointers: src[j  ] and src[j+1]
+        ldp             x7, x9, [x2, #16]               // load 2 pointers: src[j+2] and src[j+3]
+        ldp             x10, x11, [x2, #32]             // load 2 pointers: src[j+4] and src[j+5]
+        ldp             x12, x13, [x2, #48]             // load 2 pointers: src[j+6] and src[j+7]
 
         // load 8x16-bit values for filter[j], where j=0..7
-        ld1                 {v6.8h}, [x0]
+        ld1             {v6.8h}, [x0]
 7:
-        mov                 v3.16b, v1.16b                  // initialize accumulator part 1 with dithering value
-        mov                 v4.16b, v2.16b                  // initialize accumulator part 2 with dithering value
-
-        ld1                 {v24.8h}, [x5], #16             // load 8x16-bit values for src[j + 0][i + {0..7}]
-        ld1                 {v25.8h}, [x6], #16             // load 8x16-bit values for src[j + 1][i + {0..7}]
-        ld1                 {v26.8h}, [x7], #16             // load 8x16-bit values for src[j + 2][i + {0..7}]
-        ld1                 {v27.8h}, [x9], #16             // load 8x16-bit values for src[j + 3][i + {0..7}]
-        ld1                 {v28.8h}, [x10], #16            // load 8x16-bit values for src[j + 4][i + {0..7}]
-        ld1                 {v29.8h}, [x11], #16            // load 8x16-bit values for src[j + 5][i + {0..7}]
-        ld1                 {v30.8h}, [x12], #16            // load 8x16-bit values for src[j + 6][i + {0..7}]
-        ld1                 {v31.8h}, [x13], #16            // load 8x16-bit values for src[j + 7][i + {0..7}]
-
-        smlal               v3.4s, v24.4h, v6.h[0]          // val0 += src[0][i + {0..3}] * filter[0]
-        smlal2              v4.4s, v24.8h, v6.h[0]          // val1 += src[0][i + {4..7}] * filter[0]
-        smlal               v3.4s, v25.4h, v6.h[1]          // val0 += src[1][i + {0..3}] * filter[1]
-        smlal2              v4.4s, v25.8h, v6.h[1]          // val1 += src[1][i + {4..7}] * filter[1]
-        smlal               v3.4s, v26.4h, v6.h[2]          // val0 += src[2][i + {0..3}] * filter[2]
-        smlal2              v4.4s, v26.8h, v6.h[2]          // val1 += src[2][i + {4..7}] * filter[2]
-        smlal               v3.4s, v27.4h, v6.h[3]          // val0 += src[3][i + {0..3}] * filter[3]
-        smlal2              v4.4s, v27.8h, v6.h[3]          // val1 += src[3][i + {4..7}] * filter[3]
-        smlal               v3.4s, v28.4h, v6.h[4]          // val0 += src[4][i + {0..3}] * filter[4]
-        smlal2              v4.4s, v28.8h, v6.h[4]          // val1 += src[4][i + {4..7}] * filter[4]
-        smlal               v3.4s, v29.4h, v6.h[5]          // val0 += src[5][i + {0..3}] * filter[5]
-        smlal2              v4.4s, v29.8h, v6.h[5]          // val1 += src[5][i + {4..7}] * filter[5]
-        smlal               v3.4s, v30.4h, v6.h[6]          // val0 += src[6][i + {0..3}] * filter[6]
-        smlal2              v4.4s, v30.8h, v6.h[6]          // val1 += src[6][i + {4..7}] * filter[6]
-        smlal               v3.4s, v31.4h, v6.h[7]          // val0 += src[7][i + {0..3}] * filter[7]
-        smlal2              v4.4s, v31.8h, v6.h[7]          // val1 += src[7][i + {4..7}] * filter[7]
-
-        sqshrun             v3.4h, v3.4s, #16               // clip16(val0>>16)
-        sqshrun2            v3.8h, v4.4s, #16               // clip16(val1>>16)
-        uqshrn              v3.8b, v3.8h, #3                // clip8(val>>19)
-        subs                w4, w4, #8                      // dstW -= 8
-        st1                 {v3.8b}, [x3], #8               // write to destination
-        b.gt                7b                              // loop until width consumed
+        mov             v3.16b, v1.16b                  // initialize accumulator part 1 with dithering value
+        mov             v4.16b, v2.16b                  // initialize accumulator part 2 with dithering value
+
+        ld1             {v24.8h}, [x5], #16             // load 8x16-bit values for src[j + 0][i + {0..7}]
+        ld1             {v25.8h}, [x6], #16             // load 8x16-bit values for src[j + 1][i + {0..7}]
+        ld1             {v26.8h}, [x7], #16             // load 8x16-bit values for src[j + 2][i + {0..7}]
+        ld1             {v27.8h}, [x9], #16             // load 8x16-bit values for src[j + 3][i + {0..7}]
+        ld1             {v28.8h}, [x10], #16            // load 8x16-bit values for src[j + 4][i + {0..7}]
+        ld1             {v29.8h}, [x11], #16            // load 8x16-bit values for src[j + 5][i + {0..7}]
+        ld1             {v30.8h}, [x12], #16            // load 8x16-bit values for src[j + 6][i + {0..7}]
+        ld1             {v31.8h}, [x13], #16            // load 8x16-bit values for src[j + 7][i + {0..7}]
+
+        smlal           v3.4s, v24.4h, v6.h[0]          // val0 += src[0][i + {0..3}] * filter[0]
+        smlal2          v4.4s, v24.8h, v6.h[0]          // val1 += src[0][i + {4..7}] * filter[0]
+        smlal           v3.4s, v25.4h, v6.h[1]          // val0 += src[1][i + {0..3}] * filter[1]
+        smlal2          v4.4s, v25.8h, v6.h[1]          // val1 += src[1][i + {4..7}] * filter[1]
+        smlal           v3.4s, v26.4h, v6.h[2]          // val0 += src[2][i + {0..3}] * filter[2]
+        smlal2          v4.4s, v26.8h, v6.h[2]          // val1 += src[2][i + {4..7}] * filter[2]
+        smlal           v3.4s, v27.4h, v6.h[3]          // val0 += src[3][i + {0..3}] * filter[3]
+        smlal2          v4.4s, v27.8h, v6.h[3]          // val1 += src[3][i + {4..7}] * filter[3]
+        smlal           v3.4s, v28.4h, v6.h[4]          // val0 += src[4][i + {0..3}] * filter[4]
+        smlal2          v4.4s, v28.8h, v6.h[4]          // val1 += src[4][i + {4..7}] * filter[4]
+        smlal           v3.4s, v29.4h, v6.h[5]          // val0 += src[5][i + {0..3}] * filter[5]
+        smlal2          v4.4s, v29.8h, v6.h[5]          // val1 += src[5][i + {4..7}] * filter[5]
+        smlal           v3.4s, v30.4h, v6.h[6]          // val0 += src[6][i + {0..3}] * filter[6]
+        smlal2          v4.4s, v30.8h, v6.h[6]          // val1 += src[6][i + {4..7}] * filter[6]
+        smlal           v3.4s, v31.4h, v6.h[7]          // val0 += src[7][i + {0..3}] * filter[7]
+        smlal2          v4.4s, v31.8h, v6.h[7]          // val1 += src[7][i + {4..7}] * filter[7]
+
+        sqshrun         v3.4h, v3.4s, #16               // clip16(val0>>16)
+        sqshrun2        v3.8h, v4.4s, #16               // clip16(val1>>16)
+        uqshrn          v3.8b, v3.8h, #3                // clip8(val>>19)
+        subs            w4, w4, #8                      // dstW -= 8
+        st1             {v3.8b}, [x3], #8               // write to destination
+        b.gt            7b                              // loop until width consumed
         ret
 
 8:      // fs=4
-        ldp                 x5, x6, [x2]                    // load 2 pointers: src[j  ] and src[j+1]
-        ldp                 x7, x9, [x2, #16]               // load 2 pointers: src[j+2] and src[j+3]
+        ldp             x5, x6, [x2]                    // load 2 pointers: src[j  ] and src[j+1]
+        ldp             x7, x9, [x2, #16]               // load 2 pointers: src[j+2] and src[j+3]
 
         // load 4x16-bit values for filter[j], where j=0..3 and replicated across lanes
-        ld1                 {v6.4h}, [x0]
+        ld1             {v6.4h}, [x0]
 9:
-        mov                 v3.16b, v1.16b                  // initialize accumulator part 1 with dithering value
-        mov                 v4.16b, v2.16b                  // initialize accumulator part 2 with dithering value
-
-        ld1                 {v24.8h}, [x5], #16             // load 8x16-bit values for src[j + 0][i + {0..7}]
-        ld1                 {v25.8h}, [x6], #16             // load 8x16-bit values for src[j + 1][i + {0..7}]
-        ld1                 {v26.8h}, [x7], #16             // load 8x16-bit values for src[j + 2][i + {0..7}]
-        ld1                 {v27.8h}, [x9], #16             // load 8x16-bit values for src[j + 3][i + {0..7}]
-
-        smlal               v3.4s, v24.4h, v6.h[0]          // val0 += src[0][i + {0..3}] * filter[0]
-        smlal2              v4.4s, v24.8h, v6.h[0]          // val1 += src[0][i + {4..7}] * filter[0]
-        smlal               v3.4s, v25.4h, v6.h[1]          // val0 += src[1][i + {0..3}] * filter[1]
-        smlal2              v4.4s, v25.8h, v6.h[1]          // val1 += src[1][i + {4..7}] * filter[1]
-        smlal               v3.4s, v26.4h, v6.h[2]          // val0 += src[2][i + {0..3}] * filter[2]
-        smlal2              v4.4s, v26.8h, v6.h[2]          // val1 += src[2][i + {4..7}] * filter[2]
-        smlal               v3.4s, v27.4h, v6.h[3]          // val0 += src[3][i + {0..3}] * filter[3]
-        smlal2              v4.4s, v27.8h, v6.h[3]          // val1 += src[3][i + {4..7}] * filter[3]
-
-        sqshrun             v3.4h, v3.4s, #16               // clip16(val0>>16)
-        sqshrun2            v3.8h, v4.4s, #16               // clip16(val1>>16)
-        uqshrn              v3.8b, v3.8h, #3                // clip8(val>>19)
-        st1                 {v3.8b}, [x3], #8               // write to destination
-        subs                w4, w4, #8                      // dstW -= 8
-        b.gt                9b                              // loop until width consumed
+        mov             v3.16b, v1.16b                  // initialize accumulator part 1 with dithering value
+        mov             v4.16b, v2.16b                  // initialize accumulator part 2 with dithering value
+
+        ld1             {v24.8h}, [x5], #16             // load 8x16-bit values for src[j + 0][i + {0..7}]
+        ld1             {v25.8h}, [x6], #16             // load 8x16-bit values for src[j + 1][i + {0..7}]
+        ld1             {v26.8h}, [x7], #16             // load 8x16-bit values for src[j + 2][i + {0..7}]
+        ld1             {v27.8h}, [x9], #16             // load 8x16-bit values for src[j + 3][i + {0..7}]
+
+        smlal           v3.4s, v24.4h, v6.h[0]          // val0 += src[0][i + {0..3}] * filter[0]
+        smlal2          v4.4s, v24.8h, v6.h[0]          // val1 += src[0][i + {4..7}] * filter[0]
+        smlal           v3.4s, v25.4h, v6.h[1]          // val0 += src[1][i + {0..3}] * filter[1]
+        smlal2          v4.4s, v25.8h, v6.h[1]          // val1 += src[1][i + {4..7}] * filter[1]
+        smlal           v3.4s, v26.4h, v6.h[2]          // val0 += src[2][i + {0..3}] * filter[2]
+        smlal2          v4.4s, v26.8h, v6.h[2]          // val1 += src[2][i + {4..7}] * filter[2]
+        smlal           v3.4s, v27.4h, v6.h[3]          // val0 += src[3][i + {0..3}] * filter[3]
+        smlal2          v4.4s, v27.8h, v6.h[3]          // val1 += src[3][i + {4..7}] * filter[3]
+
+        sqshrun         v3.4h, v3.4s, #16               // clip16(val0>>16)
+        sqshrun2        v3.8h, v4.4s, #16               // clip16(val1>>16)
+        uqshrn          v3.8b, v3.8h, #3                // clip8(val>>19)
+        st1             {v3.8b}, [x3], #8               // write to destination
+        subs            w4, w4, #8                      // dstW -= 8
+        b.gt            9b                              // loop until width consumed
         ret
 
 10:     // fs=2
-        ldp                 x5, x6, [x2]                    // load 2 pointers: src[j  ] and src[j+1]
+        ldp             x5, x6, [x2]                    // load 2 pointers: src[j  ] and src[j+1]
 
         // load 2x16-bit values for filter[j], where j=0..1 and replicated across lanes
-        ldr                 s6, [x0]
+        ldr             s6, [x0]
 11:
-        mov                 v3.16b, v1.16b                  // initialize accumulator part 1 with dithering value
-        mov                 v4.16b, v2.16b                  // initialize accumulator part 2 with dithering value
-
-        ld1                 {v24.8h}, [x5], #16             // load 8x16-bit values for src[j + 0][i + {0..7}]
-        ld1                 {v25.8h}, [x6], #16             // load 8x16-bit values for src[j + 1][i + {0..7}]
-
-        smlal               v3.4s, v24.4h, v6.h[0]          // val0 += src[0][i + {0..3}] * filter[0]
-        smlal2              v4.4s, v24.8h, v6.h[0]          // val1 += src[0][i + {4..7}] * filter[0]
-        smlal               v3.4s, v25.4h, v6.h[1]          // val0 += src[1][i + {0..3}] * filter[1]
-        smlal2              v4.4s, v25.8h, v6.h[1]          // val1 += src[1][i + {4..7}] * filter[1]
-
-        sqshrun             v3.4h, v3.4s, #16               // clip16(val0>>16)
-        sqshrun2            v3.8h, v4.4s, #16               // clip16(val1>>16)
-        uqshrn              v3.8b, v3.8h, #3                // clip8(val>>19)
-        st1                 {v3.8b}, [x3], #8               // write to destination
-        subs                w4, w4, #8                      // dstW -= 8
-        b.gt                11b                             // loop until width consumed
+        mov             v3.16b, v1.16b                  // initialize accumulator part 1 with dithering value
+        mov             v4.16b, v2.16b                  // initialize accumulator part 2 with dithering value
+
+        ld1             {v24.8h}, [x5], #16             // load 8x16-bit values for src[j + 0][i + {0..7}]
+        ld1             {v25.8h}, [x6], #16             // load 8x16-bit values for src[j + 1][i + {0..7}]
+
+        smlal           v3.4s, v24.4h, v6.h[0]          // val0 += src[0][i + {0..3}] * filter[0]
+        smlal2          v4.4s, v24.8h, v6.h[0]          // val1 += src[0][i + {4..7}] * filter[0]
+        smlal           v3.4s, v25.4h, v6.h[1]          // val0 += src[1][i + {0..3}] * filter[1]
+        smlal2          v4.4s, v25.8h, v6.h[1]          // val1 += src[1][i + {4..7}] * filter[1]
+
+        sqshrun         v3.4h, v3.4s, #16               // clip16(val0>>16)
+        sqshrun2        v3.8h, v4.4s, #16               // clip16(val1>>16)
+        uqshrn          v3.8b, v3.8h, #3                // clip8(val>>19)
+        st1             {v3.8b}, [x3], #8               // write to destination
+        subs            w4, w4, #8                      // dstW -= 8
+        b.gt            11b                             // loop until width consumed
         ret
 endfunc
 
@@ -210,25 +210,25 @@ function ff_yuv2plane1_8_neon, export=1
 // w2 - int dstW,
 // x3 - const uint8_t *dither,
 // w4 - int offset
-        ld1                 {v0.8b}, [x3]                   // load 8x8-bit dither
-        and                 w4, w4, #7
-        cbz                 w4, 1f                          // check if offsetting present
-        ext                 v0.8b, v0.8b, v0.8b, #3         // honor offsetting which can be 0 or 3 only
-1:      uxtl                v0.8h, v0.8b                    // extend dither to 32-bit
-        uxtl                v1.4s, v0.4h
-        uxtl2               v2.4s, v0.8h
+        ld1             {v0.8b}, [x3]                   // load 8x8-bit dither
+        and             w4, w4, #7
+        cbz             w4, 1f                          // check if offsetting present
+        ext             v0.8b, v0.8b, v0.8b, #3         // honor offsetting which can be 0 or 3 only
+1:      uxtl            v0.8h, v0.8b                    // extend dither to 32-bit
+        uxtl            v1.4s, v0.4h
+        uxtl2           v2.4s, v0.8h
 2:
-        ld1                 {v3.8h}, [x0], #16              // read 8x16-bit @ src[j  ][i + {0..7}]: A,B,C,D,E,F,G,H
-        sxtl                v4.4s, v3.4h
-        sxtl2               v5.4s, v3.8h
-        add                 v4.4s, v4.4s, v1.4s
-        add                 v5.4s, v5.4s, v2.4s
-        sqshrun             v4.4h, v4.4s, #6
-        sqshrun2            v4.8h, v5.4s, #6
-
-        uqshrn              v3.8b, v4.8h, #1                // clip8(val>>7)
-        subs                w2, w2, #8                      // dstW -= 8
-        st1                 {v3.8b}, [x1], #8               // write to destination
-        b.gt                2b                              // loop until width consumed
+        ld1             {v3.8h}, [x0], #16              // read 8x16-bit @ src[j  ][i + {0..7}]: A,B,C,D,E,F,G,H
+        sxtl            v4.4s, v3.4h
+        sxtl2           v5.4s, v3.8h
+        add             v4.4s, v4.4s, v1.4s
+        add             v5.4s, v5.4s, v2.4s
+        sqshrun         v4.4h, v4.4s, #6
+        sqshrun2        v4.8h, v5.4s, #6
+
+        uqshrn          v3.8b, v4.8h, #1                // clip8(val>>7)
+        subs            w2, w2, #8                      // dstW -= 8
+        st1             {v3.8b}, [x1], #8               // write to destination
+        b.gt            2b                              // loop until width consumed
         ret
 endfunc
diff --git a/libswscale/aarch64/yuv2rgb_neon.S b/libswscale/aarch64/yuv2rgb_neon.S
index 379d75622e..89d69e7f6c 100644
--- a/libswscale/aarch64/yuv2rgb_neon.S
+++ b/libswscale/aarch64/yuv2rgb_neon.S
@@ -23,23 +23,23 @@
 
 .macro load_yoff_ycoeff yoff ycoeff
 #if defined(__APPLE__)
-    ldp                 w9, w10, [sp, #\yoff]
+        ldp             w9, w10, [sp, #\yoff]
 #else
-    ldr                 w9,  [sp, #\yoff]
-    ldr                 w10, [sp, #\ycoeff]
+        ldr             w9,  [sp, #\yoff]
+        ldr             w10, [sp, #\ycoeff]
 #endif
 .endm
 
 .macro load_args_nv12
-    ldr                 x8,  [sp]                                       // table
-    load_yoff_ycoeff    8, 16                                           // y_offset, y_coeff
-    ld1                 {v1.1d}, [x8]
-    dup                 v0.8h, w10
-    dup                 v3.8h, w9
-    sub                 w3, w3, w0, lsl #2                              // w3 = linesize  - width * 4 (padding)
-    sub                 w5, w5, w0                                      // w5 = linesizeY - width     (paddingY)
-    sub                 w7, w7, w0                                      // w7 = linesizeC - width     (paddingC)
-    neg                 w11, w0
+        ldr             x8,  [sp]                                       // table
+        load_yoff_ycoeff 8, 16                                           // y_offset, y_coeff
+        ld1             {v1.1d}, [x8]
+        dup             v0.8h, w10
+        dup             v3.8h, w9
+        sub             w3, w3, w0, lsl #2                              // w3 = linesize  - width * 4 (padding)
+        sub             w5, w5, w0                                      // w5 = linesizeY - width     (paddingY)
+        sub             w7, w7, w0                                      // w7 = linesizeC - width     (paddingC)
+        neg             w11, w0
 .endm
 
 .macro load_args_nv21
@@ -47,52 +47,52 @@
 .endm
 
 .macro load_args_yuv420p
-    ldr                 x13, [sp]                                       // srcV
-    ldr                 w14, [sp, #8]                                   // linesizeV
-    ldr                 x8,  [sp, #16]                                  // table
-    load_yoff_ycoeff    24, 32                                          // y_offset, y_coeff
-    ld1                 {v1.1d}, [x8]
-    dup                 v0.8h, w10
-    dup                 v3.8h, w9
-    sub                 w3, w3, w0, lsl #2                              // w3 = linesize  - width * 4 (padding)
-    sub                 w5, w5, w0                                      // w5 = linesizeY - width     (paddingY)
-    sub                 w7,  w7,  w0, lsr #1                            // w7  = linesizeU - width / 2 (paddingU)
-    sub                 w14, w14, w0, lsr #1                            // w14 = linesizeV - width / 2 (paddingV)
-    lsr                 w11, w0, #1
-    neg                 w11, w11
+        ldr             x13, [sp]                                       // srcV
+        ldr             w14, [sp, #8]                                   // linesizeV
+        ldr             x8,  [sp, #16]                                  // table
+        load_yoff_ycoeff 24, 32                                          // y_offset, y_coeff
+        ld1             {v1.1d}, [x8]
+        dup             v0.8h, w10
+        dup             v3.8h, w9
+        sub             w3, w3, w0, lsl #2                              // w3 = linesize  - width * 4 (padding)
+        sub             w5, w5, w0                                      // w5 = linesizeY - width     (paddingY)
+        sub             w7,  w7,  w0, lsr #1                            // w7  = linesizeU - width / 2 (paddingU)
+        sub             w14, w14, w0, lsr #1                            // w14 = linesizeV - width / 2 (paddingV)
+        lsr             w11, w0, #1
+        neg             w11, w11
 .endm
 
 .macro load_args_yuv422p
-    ldr                 x13, [sp]                                       // srcV
-    ldr                 w14, [sp, #8]                                   // linesizeV
-    ldr                 x8,  [sp, #16]                                  // table
-    load_yoff_ycoeff    24, 32                                          // y_offset, y_coeff
-    ld1                 {v1.1d}, [x8]
-    dup                 v0.8h, w10
-    dup                 v3.8h, w9
-    sub                 w3, w3, w0, lsl #2                              // w3 = linesize  - width * 4 (padding)
-    sub                 w5, w5, w0                                      // w5 = linesizeY - width     (paddingY)
-    sub                 w7,  w7,  w0, lsr #1                            // w7  = linesizeU - width / 2 (paddingU)
-    sub                 w14, w14, w0, lsr #1                            // w14 = linesizeV - width / 2 (paddingV)
+        ldr             x13, [sp]                                       // srcV
+        ldr             w14, [sp, #8]                                   // linesizeV
+        ldr             x8,  [sp, #16]                                  // table
+        load_yoff_ycoeff 24, 32                                          // y_offset, y_coeff
+        ld1             {v1.1d}, [x8]
+        dup             v0.8h, w10
+        dup             v3.8h, w9
+        sub             w3, w3, w0, lsl #2                              // w3 = linesize  - width * 4 (padding)
+        sub             w5, w5, w0                                      // w5 = linesizeY - width     (paddingY)
+        sub             w7,  w7,  w0, lsr #1                            // w7  = linesizeU - width / 2 (paddingU)
+        sub             w14, w14, w0, lsr #1                            // w14 = linesizeV - width / 2 (paddingV)
 .endm
 
 .macro load_chroma_nv12
-    ld2                 {v16.8b, v17.8b}, [x6], #16
-    ushll               v18.8h, v16.8b, #3
-    ushll               v19.8h, v17.8b, #3
+        ld2             {v16.8b, v17.8b}, [x6], #16
+        ushll           v18.8h, v16.8b, #3
+        ushll           v19.8h, v17.8b, #3
 .endm
 
 .macro load_chroma_nv21
-    ld2                 {v16.8b, v17.8b}, [x6], #16
-    ushll               v19.8h, v16.8b, #3
-    ushll               v18.8h, v17.8b, #3
+        ld2             {v16.8b, v17.8b}, [x6], #16
+        ushll           v19.8h, v16.8b, #3
+        ushll           v18.8h, v17.8b, #3
 .endm
 
 .macro load_chroma_yuv420p
-    ld1                 {v16.8b}, [ x6], #8
-    ld1                 {v17.8b}, [x13], #8
-    ushll               v18.8h, v16.8b, #3
-    ushll               v19.8h, v17.8b, #3
+        ld1             {v16.8b}, [ x6], #8
+        ld1             {v17.8b}, [x13], #8
+        ushll           v18.8h, v16.8b, #3
+        ushll           v19.8h, v17.8b, #3
 .endm
 
 .macro load_chroma_yuv422p
@@ -100,9 +100,9 @@
 .endm
 
 .macro increment_nv12
-    ands                w15, w1, #1
-    csel                w16, w7, w11, ne                                // incC = (h & 1) ? paddincC : -width
-    add                 x6,  x6, w16, sxtw                              // srcC += incC
+        ands            w15, w1, #1
+        csel            w16, w7, w11, ne                                // incC = (h & 1) ? paddincC : -width
+        add             x6,  x6, w16, sxtw                              // srcC += incC
 .endm
 
 .macro increment_nv21
@@ -110,100 +110,100 @@
 .endm
 
 .macro increment_yuv420p
-    ands                w15, w1, #1
-    csel                w16,  w7, w11, ne                               // incU = (h & 1) ? paddincU : -width/2
-    csel                w17, w14, w11, ne                               // incV = (h & 1) ? paddincV : -width/2
-    add                 x6,  x6,  w16, sxtw                             // srcU += incU
-    add                 x13, x13, w17, sxtw                             // srcV += incV
+        ands            w15, w1, #1
+        csel            w16,  w7, w11, ne                               // incU = (h & 1) ? paddincU : -width/2
+        csel            w17, w14, w11, ne                               // incV = (h & 1) ? paddincV : -width/2
+        add             x6,  x6,  w16, sxtw                             // srcU += incU
+        add             x13, x13, w17, sxtw                             // srcV += incV
 .endm
 
 .macro increment_yuv422p
-    add                 x6,  x6,  w7, sxtw                              // srcU += incU
-    add                 x13, x13, w14, sxtw                             // srcV += incV
+        add             x6,  x6,  w7, sxtw                              // srcU += incU
+        add             x13, x13, w14, sxtw                             // srcV += incV
 .endm
 
 .macro compute_rgba r1 g1 b1 a1 r2 g2 b2 a2
-    add                 v20.8h, v26.8h, v20.8h                          // Y1 + R1
-    add                 v21.8h, v27.8h, v21.8h                          // Y2 + R2
-    add                 v22.8h, v26.8h, v22.8h                          // Y1 + G1
-    add                 v23.8h, v27.8h, v23.8h                          // Y2 + G2
-    add                 v24.8h, v26.8h, v24.8h                          // Y1 + B1
-    add                 v25.8h, v27.8h, v25.8h                          // Y2 + B2
-    sqrshrun            \r1, v20.8h, #1                                 // clip_u8((Y1 + R1) >> 1)
-    sqrshrun            \r2, v21.8h, #1                                 // clip_u8((Y2 + R1) >> 1)
-    sqrshrun            \g1, v22.8h, #1                                 // clip_u8((Y1 + G1) >> 1)
-    sqrshrun            \g2, v23.8h, #1                                 // clip_u8((Y2 + G1) >> 1)
-    sqrshrun            \b1, v24.8h, #1                                 // clip_u8((Y1 + B1) >> 1)
-    sqrshrun            \b2, v25.8h, #1                                 // clip_u8((Y2 + B1) >> 1)
-    movi                \a1, #255
-    movi                \a2, #255
+        add             v20.8h, v26.8h, v20.8h                          // Y1 + R1
+        add             v21.8h, v27.8h, v21.8h                          // Y2 + R2
+        add             v22.8h, v26.8h, v22.8h                          // Y1 + G1
+        add             v23.8h, v27.8h, v23.8h                          // Y2 + G2
+        add             v24.8h, v26.8h, v24.8h                          // Y1 + B1
+        add             v25.8h, v27.8h, v25.8h                          // Y2 + B2
+        sqrshrun        \r1, v20.8h, #1                                 // clip_u8((Y1 + R1) >> 1)
+        sqrshrun        \r2, v21.8h, #1                                 // clip_u8((Y2 + R1) >> 1)
+        sqrshrun        \g1, v22.8h, #1                                 // clip_u8((Y1 + G1) >> 1)
+        sqrshrun        \g2, v23.8h, #1                                 // clip_u8((Y2 + G1) >> 1)
+        sqrshrun        \b1, v24.8h, #1                                 // clip_u8((Y1 + B1) >> 1)
+        sqrshrun        \b2, v25.8h, #1                                 // clip_u8((Y2 + B1) >> 1)
+        movi            \a1, #255
+        movi            \a2, #255
 .endm
 
 .macro declare_func ifmt ofmt
 function ff_\ifmt\()_to_\ofmt\()_neon, export=1
     load_args_\ifmt
-    mov                 w9, w1
+        mov             w9, w1
 1:
-    mov                 w8, w0                                          // w8 = width
+        mov             w8, w0                                          // w8 = width
 2:
-    movi                v5.8h, #4, lsl #8                               // 128 * (1<<3)
+        movi            v5.8h, #4, lsl #8                               // 128 * (1<<3)
     load_chroma_\ifmt
-    sub                 v18.8h, v18.8h, v5.8h                           // U*(1<<3) - 128*(1<<3)
-    sub                 v19.8h, v19.8h, v5.8h                           // V*(1<<3) - 128*(1<<3)
-    sqdmulh             v20.8h, v19.8h, v1.h[0]                         // V * v2r            (R)
-    sqdmulh             v22.8h, v18.8h, v1.h[1]                         // U * u2g
-    sqdmulh             v19.8h, v19.8h, v1.h[2]                         //           V * v2g
-    add                 v22.8h, v22.8h, v19.8h                          // U * u2g + V * v2g  (G)
-    sqdmulh             v24.8h, v18.8h, v1.h[3]                         // U * u2b            (B)
-    zip2                v21.8h, v20.8h, v20.8h                          // R2
-    zip1                v20.8h, v20.8h, v20.8h                          // R1
-    zip2                v23.8h, v22.8h, v22.8h                          // G2
-    zip1                v22.8h, v22.8h, v22.8h                          // G1
-    zip2                v25.8h, v24.8h, v24.8h                          // B2
-    zip1                v24.8h, v24.8h, v24.8h                          // B1
-    ld1                 {v2.16b}, [x4], #16                             // load luma
-    ushll               v26.8h, v2.8b,  #3                              // Y1*(1<<3)
-    ushll2              v27.8h, v2.16b, #3                              // Y2*(1<<3)
-    sub                 v26.8h, v26.8h, v3.8h                           // Y1*(1<<3) - y_offset
-    sub                 v27.8h, v27.8h, v3.8h                           // Y2*(1<<3) - y_offset
-    sqdmulh             v26.8h, v26.8h, v0.8h                           // ((Y1*(1<<3) - y_offset) * y_coeff) >> 15
-    sqdmulh             v27.8h, v27.8h, v0.8h                           // ((Y2*(1<<3) - y_offset) * y_coeff) >> 15
+        sub             v18.8h, v18.8h, v5.8h                           // U*(1<<3) - 128*(1<<3)
+        sub             v19.8h, v19.8h, v5.8h                           // V*(1<<3) - 128*(1<<3)
+        sqdmulh         v20.8h, v19.8h, v1.h[0]                         // V * v2r            (R)
+        sqdmulh         v22.8h, v18.8h, v1.h[1]                         // U * u2g
+        sqdmulh         v19.8h, v19.8h, v1.h[2]                         //           V * v2g
+        add             v22.8h, v22.8h, v19.8h                          // U * u2g + V * v2g  (G)
+        sqdmulh         v24.8h, v18.8h, v1.h[3]                         // U * u2b            (B)
+        zip2            v21.8h, v20.8h, v20.8h                          // R2
+        zip1            v20.8h, v20.8h, v20.8h                          // R1
+        zip2            v23.8h, v22.8h, v22.8h                          // G2
+        zip1            v22.8h, v22.8h, v22.8h                          // G1
+        zip2            v25.8h, v24.8h, v24.8h                          // B2
+        zip1            v24.8h, v24.8h, v24.8h                          // B1
+        ld1             {v2.16b}, [x4], #16                             // load luma
+        ushll           v26.8h, v2.8b,  #3                              // Y1*(1<<3)
+        ushll2          v27.8h, v2.16b, #3                              // Y2*(1<<3)
+        sub             v26.8h, v26.8h, v3.8h                           // Y1*(1<<3) - y_offset
+        sub             v27.8h, v27.8h, v3.8h                           // Y2*(1<<3) - y_offset
+        sqdmulh         v26.8h, v26.8h, v0.8h                           // ((Y1*(1<<3) - y_offset) * y_coeff) >> 15
+        sqdmulh         v27.8h, v27.8h, v0.8h                           // ((Y2*(1<<3) - y_offset) * y_coeff) >> 15
 
 .ifc \ofmt,argb // 1 2 3 0
-    compute_rgba        v5.8b,v6.8b,v7.8b,v4.8b, v17.8b,v18.8b,v19.8b,v16.8b
+        compute_rgba    v5.8b,v6.8b,v7.8b,v4.8b, v17.8b,v18.8b,v19.8b,v16.8b
 .endif
 
 .ifc \ofmt,rgba // 0 1 2 3
-    compute_rgba        v4.8b,v5.8b,v6.8b,v7.8b, v16.8b,v17.8b,v18.8b,v19.8b
+        compute_rgba    v4.8b,v5.8b,v6.8b,v7.8b, v16.8b,v17.8b,v18.8b,v19.8b
 .endif
 
 .ifc \ofmt,abgr // 3 2 1 0
-    compute_rgba        v7.8b,v6.8b,v5.8b,v4.8b, v19.8b,v18.8b,v17.8b,v16.8b
+        compute_rgba    v7.8b,v6.8b,v5.8b,v4.8b, v19.8b,v18.8b,v17.8b,v16.8b
 .endif
 
 .ifc \ofmt,bgra // 2 1 0 3
-    compute_rgba        v6.8b,v5.8b,v4.8b,v7.8b, v18.8b,v17.8b,v16.8b,v19.8b
+        compute_rgba    v6.8b,v5.8b,v4.8b,v7.8b, v18.8b,v17.8b,v16.8b,v19.8b
 .endif
 
-    st4                 { v4.8b, v5.8b, v6.8b, v7.8b}, [x2], #32
-    st4                 {v16.8b,v17.8b,v18.8b,v19.8b}, [x2], #32
-    subs                w8, w8, #16                                     // width -= 16
-    b.gt                2b
-    add                 x2, x2, w3, sxtw                                // dst  += padding
-    add                 x4, x4, w5, sxtw                                // srcY += paddingY
+        st4             { v4.8b, v5.8b, v6.8b, v7.8b}, [x2], #32
+        st4             {v16.8b,v17.8b,v18.8b,v19.8b}, [x2], #32
+        subs            w8, w8, #16                                     // width -= 16
+        b.gt            2b
+        add             x2, x2, w3, sxtw                                // dst  += padding
+        add             x4, x4, w5, sxtw                                // srcY += paddingY
     increment_\ifmt
-    subs                w1, w1, #1                                      // height -= 1
-    b.gt                1b
-    mov                 w0, w9
+        subs            w1, w1, #1                                      // height -= 1
+        b.gt            1b
+        mov             w0, w9
     ret
 endfunc
 .endm
 
 .macro declare_rgb_funcs ifmt
-    declare_func \ifmt, argb
-    declare_func \ifmt, rgba
-    declare_func \ifmt, abgr
-    declare_func \ifmt, bgra
+        declare_func    \ifmt, argb
+        declare_func    \ifmt, rgba
+        declare_func    \ifmt, abgr
+        declare_func    \ifmt, bgra
 .endm
 
 declare_rgb_funcs nv12