From patchwork Mon Aug 16 10:29:18 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mikhail Nitenko X-Patchwork-Id: 29564 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6602:2a4a:0:0:0:0 with SMTP id k10csp1905710iov; Mon, 16 Aug 2021 03:29:41 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxpv+EL/Z2I0wBh8U0q0cbttH68ycwa0LYkW9sPDJOusAEIAXAB1wV3s3bKg9kgdR/tiPEB X-Received: by 2002:aa7:c306:: with SMTP id l6mr19280017edq.383.1629109781176; Mon, 16 Aug 2021 03:29:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1629109781; cv=none; d=google.com; s=arc-20160816; b=H1FYh+dL5prwZ4va3azIHXII6qPS0W6wU7EYLNU80qT4Tnb048+gnjLsYzVr2WkXhJ sUZXTQNxYGPNrBtEdYGMPgCpnIAy/d1TQMibNn4i1K5pievrIQdXLXgtZKyGfwIVAh+U XFUz6Q/qCBVsCOqaT4TCQbzZPtYB1CK/enZ3WLmlVI34+aGoJHK9vLX6v7IN2HVPBk4h puldU8mOaLptag8j8bO57G0QO5BS2DyZEMBCwIZ6k6KlKjbyU+ymJbEm91G1fEzWcVQD SA7a8bwwCqo9dhx3L48+vbYOY6mhCxwVAcRQU4GOtHj1syifGEvmLPh0cWvawRc8leJA DghQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:message-id:date:to:from :dkim-signature:delivered-to; bh=dTnJerOKJCxtKYOyRayZMIvEyQnM6GSbhwPzvPeR70w=; b=yEH+JQTJUeLg0XzJuov7BcRYlnBzdiu4TW1AwmC4tncJyeg1Nf4Nlp/fNMFvCUZeJF CFUTfT02/ppB6LeXCZrsBiZwxXpX8i3f7hI31fEdi9LuxJLOBBuQOilCT+t+rDJXgJWr 4Jok1E/qQ8UORJZ4c4ye2wvVKOwABm0c79M8lL7p6DbnetWlQKOW+qeCXMlIojzm89I5 DVOjhPQyoumfRYgy0lIF1k/K/bGjoqTK3MwGJN5EhWcM/L3QKE/1CbN0azdb4o44Y4EH g2W3nwBp1f0nWQxCGAw1levEFVL/n8GimhOVxMyHGjjJ4H59Dn6kzCS/+SOqdS8Gb4hE r0ig== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.s=20161025 header.b=OIcqJvUF; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id jz9si11294366ejb.406.2021.08.16.03.29.39; Mon, 16 Aug 2021 03:29:41 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.s=20161025 header.b=OIcqJvUF; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 51D7668A3A8; Mon, 16 Aug 2021 13:29:36 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lf1-f49.google.com (mail-lf1-f49.google.com [209.85.167.49]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 9A12768A1AB for ; Mon, 16 Aug 2021 13:29:30 +0300 (EEST) Received: by mail-lf1-f49.google.com with SMTP id d4so33321276lfk.9 for ; Mon, 16 Aug 2021 03:29:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=Erf8Af0eede7ZsjMaFPDXNxE9kdI19GNxw3VHk3LO24=; b=OIcqJvUFjZUSuQvF6n7v23loyVabq1V0+kpg5ErDkw7LT22IVm4gIqaf/TUBt4EDD4 KdJuvb+MkIKoaVMfuhV3w1BNYOMlBxuJ/C9YTvAVGfAnkm7onmhTAWMSCmxUVteJ11Jq 4i5FfKqblyZCRLmSATEPTvr6U6Z5AxLkVZWMsDEZyvjbYobCa1x9NVuMaSteqfFIhDjz z1NY33prmQbVm95J5Qy4+blALC3fFCLayacUbGqE8uF1p3f+6C24tSzxug/nYXmaCMUl MMoAjkwiVUSqGxchLlNIDqvK+uy6fTUsq7PewjgHHD7X74BzN6KKyJ2b5sYPMJNdBoTq 8RMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=Erf8Af0eede7ZsjMaFPDXNxE9kdI19GNxw3VHk3LO24=; b=ZOteRwZkvhJmq3utr2uNiQODVJs26DmoekKGPpc8E3EEk+ZCcS6qP2hIuWTC5jSJT/ YxMlkVdJ1eWbSPmD0gjqqF8tFbJ5YYGd73m11+L/4vqhSl+uAQap06pfsFSUPkpzD2BJ ezDWVgZMQJ1p6jf/mRi6DN5/BOCggvRXyCQydhtP6pfA2g8Mn1U9BeaSiUUVbRwFuLOV ANM0VoU8M8s1MdxYfNj8Mu0MKiwrUJShdoek/MfbKunVDxyp9MjvZGCcWPWtb6BbLsZS kxCtW2vxDKYq8KPoiOoT039vsmgA2K5xV3RHgIlyGgbljN3RUQ4NMDoD7rSE3PHDzlh1 8ZKA== X-Gm-Message-State: AOAM531bSMMoVsX75Nl0zZNpxKTaBQRyjW/h0NkCHMK4cguw9AU4IRaa fpXjRw9fyV+hlcLOTWtBnqvv8Sshf1HO0A== X-Received: by 2002:a05:6512:2602:: with SMTP id bt2mr10769296lfb.47.1629109769599; Mon, 16 Aug 2021 03:29:29 -0700 (PDT) Received: from localhost.localdomain ([109.195.102.12]) by smtp.gmail.com with ESMTPSA id y28sm910957lfl.35.2021.08.16.03.29.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 16 Aug 2021 03:29:29 -0700 (PDT) From: Mikhail Nitenko To: ffmpeg-devel@ffmpeg.org Date: Mon, 16 Aug 2021 15:29:18 +0500 Message-Id: <20210816102918.464463-1-mnitenko@gmail.com> X-Mailer: git-send-email 2.32.0 MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v2] lavc/aarch64: add pred functions for 10-bit X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Mikhail Nitenko Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: 7dn4v+Yi+BM0 Benchmarks: A53 A72 pred8x8_dc_10_c: 64.2 55.7 pred8x8_dc_10_neon: 61.7 53.7 pred8x8_dc_128_10_c: 26.0 20.7 pred8x8_dc_128_10_neon: 30.7 24.5 pred8x8_horizontal_10_c: 60.0 35.2 pred8x8_horizontal_10_neon: 38.0 33.0 pred8x8_left_dc_10_c: 42.5 35.5 pred8x8_left_dc_10_neon: 50.7 41.5 pred8x8_mad_cow_dc_0l0_10_c: 55.7 44.7 pred8x8_mad_cow_dc_0l0_10_neon: 47.5 37.2 pred8x8_mad_cow_dc_0lt_10_c: 89.2 75.5 pred8x8_mad_cow_dc_0lt_10_neon: 52.2 47.0 pred8x8_mad_cow_dc_l0t_10_c: 74.7 59.2 pred8x8_mad_cow_dc_l0t_10_neon: 50.5 44.7 pred8x8_mad_cow_dc_l00_10_c: 58.0 45.7 pred8x8_mad_cow_dc_l00_10_neon: 42.5 37.5 pred8x8_plane_10_c: 347.7 295.5 pred8x8_plane_10_neon: 136.2 108.0 pred8x8_top_dc_10_c: 44.5 38.5 pred8x8_top_dc_10_neon: 39.7 34.5 pred8x8_vertical_10_c: 27.5 21.7 pred8x8_vertical_10_neon: 21.0 22.2 pred16x16_plane_10_c: 1242.0 1075.7 pred16x16_plane_10_neon: 324.0 199.5 Signed-off-by: Mikhail Nitenko --- moved to 32-bit, however, in plane the 16bit are not enough, and it overflows, so when it overflows the code starts using 32bit wide sections libavcodec/aarch64/h264pred_init.c | 40 +++- libavcodec/aarch64/h264pred_neon.S | 302 ++++++++++++++++++++++++++++- 2 files changed, 335 insertions(+), 7 deletions(-) diff --git a/libavcodec/aarch64/h264pred_init.c b/libavcodec/aarch64/h264pred_init.c index 325a86bfcd..0ae8f70d23 100644 --- a/libavcodec/aarch64/h264pred_init.c +++ b/libavcodec/aarch64/h264pred_init.c @@ -45,10 +45,23 @@ void ff_pred8x8_0lt_dc_neon(uint8_t *src, ptrdiff_t stride); void ff_pred8x8_l00_dc_neon(uint8_t *src, ptrdiff_t stride); void ff_pred8x8_0l0_dc_neon(uint8_t *src, ptrdiff_t stride); -void ff_pred16x16_top_dc_neon_10(uint8_t *src, ptrdiff_t stride); -void ff_pred16x16_dc_neon_10(uint8_t *src, ptrdiff_t stride); -void ff_pred16x16_hor_neon_10(uint8_t *src, ptrdiff_t stride); void ff_pred16x16_vert_neon_10(uint8_t *src, ptrdiff_t stride); +void ff_pred16x16_hor_neon_10(uint8_t *src, ptrdiff_t stride); +void ff_pred16x16_plane_neon_10(uint8_t *src, ptrdiff_t stride); +void ff_pred16x16_dc_neon_10(uint8_t *src, ptrdiff_t stride); +void ff_pred16x16_top_dc_neon_10(uint8_t *src, ptrdiff_t stride); + +void ff_pred8x8_vert_neon_10(uint8_t *src, ptrdiff_t stride); +void ff_pred8x8_hor_neon_10(uint8_t *src, ptrdiff_t stride); +void ff_pred8x8_plane_neon_10(uint8_t *src, ptrdiff_t stride); +void ff_pred8x8_dc_neon_10(uint8_t *src, ptrdiff_t stride); +void ff_pred8x8_128_dc_neon_10(uint8_t *src, ptrdiff_t stride); +void ff_pred8x8_left_dc_neon_10(uint8_t *src, ptrdiff_t stride); +void ff_pred8x8_top_dc_neon_10(uint8_t *src, ptrdiff_t stride); +void ff_pred8x8_l0t_dc_neon_10(uint8_t *src, ptrdiff_t stride); +void ff_pred8x8_0lt_dc_neon_10(uint8_t *src, ptrdiff_t stride); +void ff_pred8x8_l00_dc_neon_10(uint8_t *src, ptrdiff_t stride); +void ff_pred8x8_0l0_dc_neon_10(uint8_t *src, ptrdiff_t stride); static av_cold void h264_pred_init_neon(H264PredContext *h, int codec_id, const int bit_depth, @@ -84,10 +97,31 @@ static av_cold void h264_pred_init_neon(H264PredContext *h, int codec_id, h->pred16x16[PLANE_PRED8x8 ] = ff_pred16x16_plane_neon; } if (bit_depth == 10) { + if (chroma_format_idc <= 1) { + h->pred8x8[VERT_PRED8x8 ] = ff_pred8x8_vert_neon_10; + h->pred8x8[HOR_PRED8x8 ] = ff_pred8x8_hor_neon_10; + if (codec_id != AV_CODEC_ID_VP7 && codec_id != AV_CODEC_ID_VP8) + h->pred8x8[PLANE_PRED8x8] = ff_pred8x8_plane_neon_10; + h->pred8x8[DC_128_PRED8x8 ] = ff_pred8x8_128_dc_neon_10; + if (codec_id != AV_CODEC_ID_RV40 && codec_id != AV_CODEC_ID_VP7 && + codec_id != AV_CODEC_ID_VP8) { + h->pred8x8[DC_PRED8x8 ] = ff_pred8x8_dc_neon_10; + h->pred8x8[LEFT_DC_PRED8x8] = ff_pred8x8_left_dc_neon_10; + h->pred8x8[TOP_DC_PRED8x8 ] = ff_pred8x8_top_dc_neon_10; + h->pred8x8[ALZHEIMER_DC_L0T_PRED8x8] = ff_pred8x8_l0t_dc_neon_10; + h->pred8x8[ALZHEIMER_DC_0LT_PRED8x8] = ff_pred8x8_0lt_dc_neon_10; + h->pred8x8[ALZHEIMER_DC_L00_PRED8x8] = ff_pred8x8_l00_dc_neon_10; + h->pred8x8[ALZHEIMER_DC_0L0_PRED8x8] = ff_pred8x8_0l0_dc_neon_10; + } + } + h->pred16x16[DC_PRED8x8 ] = ff_pred16x16_dc_neon_10; h->pred16x16[VERT_PRED8x8 ] = ff_pred16x16_vert_neon_10; h->pred16x16[HOR_PRED8x8 ] = ff_pred16x16_hor_neon_10; h->pred16x16[TOP_DC_PRED8x8 ] = ff_pred16x16_top_dc_neon_10; + if (codec_id != AV_CODEC_ID_SVQ3 && codec_id != AV_CODEC_ID_RV40 && + codec_id != AV_CODEC_ID_VP7 && codec_id != AV_CODEC_ID_VP8) + h->pred16x16[PLANE_PRED8x8 ] = ff_pred16x16_plane_neon_10; } } diff --git a/libavcodec/aarch64/h264pred_neon.S b/libavcodec/aarch64/h264pred_neon.S index e40bdc8d53..712741941f 100644 --- a/libavcodec/aarch64/h264pred_neon.S +++ b/libavcodec/aarch64/h264pred_neon.S @@ -361,15 +361,13 @@ function ff_pred8x8_0l0_dc_neon, export=1 endfunc .macro ldcol.16 rd, rs, rt, n=4, hi=0 -.if \n >= 4 || \hi == 0 +.if \n >= 4 && \hi == 0 ld1 {\rd\().h}[0], [\rs], \rt ld1 {\rd\().h}[1], [\rs], \rt -.endif -.if \n >= 4 || \hi == 1 ld1 {\rd\().h}[2], [\rs], \rt ld1 {\rd\().h}[3], [\rs], \rt .endif -.if \n == 8 +.if \n == 8 || \hi == 1 ld1 {\rd\().h}[4], [\rs], \rt ld1 {\rd\().h}[5], [\rs], \rt ld1 {\rd\().h}[6], [\rs], \rt @@ -467,3 +465,299 @@ function ff_pred16x16_vert_neon_10, export=1 b.ne 1b ret endfunc + +function ff_pred16x16_plane_neon_10, export=1 + sub x3, x0, x1 + movrel x4, p16weight + add x2, x3, #16 + sub x3, x3, #2 + ld1 {v0.8h}, [x3] + ld1 {v2.8h}, [x2], x1 + ldcol.16 v1, x3, x1, 8 + add x3, x3, x1 + ldcol.16 v3, x3, x1, 8 + + rev64 v16.8h, v0.8h + rev64 v17.8h, v1.8h + ext v0.16b, v16.16b, v16.16b, #8 + ext v1.16b, v17.16b, v17.16b, #8 + + add v7.8h, v2.8h, v3.8h + sub v2.8h, v2.8h, v0.8h + sub v3.8h, v3.8h, v1.8h + ld1 {v0.8h}, [x4] + mul v2.8h, v2.8h, v0.8h + mul v3.8h, v3.8h, v0.8h + addp v2.8h, v2.8h, v3.8h + addp v2.8h, v2.8h, v2.8h + addp v2.4h, v2.4h, v2.4h + sshll v3.4s, v2.4h, #2 + saddw v2.4s, v3.4s, v2.4h + rshrn v4.4h, v2.4s, #6 + trn2 v5.4h, v4.4h, v4.4h + add v2.4h, v4.4h, v5.4h + shl v3.4h, v2.4h, #3 + ext v7.16b, v7.16b, v7.16b, #14 + sub v3.4h, v3.4h, v2.4h // 7 * (b + c) + add v7.4h, v7.4h, v0.4h + shl v2.4h, v7.4h, #4 + ssubl v2.4s, v2.4h, v3.4h + shl v3.4h, v4.4h, #4 + ext v0.16b, v0.16b, v0.16b, #14 + ssubl v6.4s, v5.4h, v3.4h + + mov v0.h[0], wzr + mul v0.8h, v0.8h, v4.h[0] + dup v16.4s, v2.s[0] + dup v17.4s, v2.s[0] + dup v2.8h, v4.h[0] + dup v3.4s, v6.s[0] + shl v2.8h, v2.8h, #3 + saddw v16.4s, v16.4s, v0.4h + saddw2 v17.4s, v17.4s, v0.8h + saddw v3.4s, v3.4s, v2.4h + + mov w3, #16 + mvni v4.8h, #0xFC, lsl #8 // 1023 for clipping +1: + sqshrun v0.4h, v16.4s, #5 + sqshrun2 v0.8h, v17.4s, #5 + saddw v16.4s, v16.4s, v2.4h + saddw v17.4s, v17.4s, v2.4h + sqshrun v1.4h, v16.4s, #5 + sqshrun2 v1.8h, v17.4s, #5 + add v16.4s, v16.4s, v3.4s + add v17.4s, v17.4s, v3.4s + + subs w3, w3, #1 + + smin v0.8h, v0.8h, v4.8h + smin v1.8h, v1.8h, v4.8h + st1 {v0.8h, v1.8h}, [x0], x1 + b.ne 1b + ret +endfunc + +function ff_pred8x8_hor_neon_10, export=1 + sub x2, x0, #2 + mov w3, #8 + +1: ld1r {v0.8h}, [x2], x1 + subs w3, w3, #1 + st1 {v0.8h}, [x0], x1 + b.ne 1b + ret +endfunc + +function ff_pred8x8_vert_neon_10, export=1 + sub x2, x0, x1 + lsl x1, x1, #1 + + ld1 {v0.8h}, [x2], x1 + mov w3, #4 +1: subs w3, w3, #1 + st1 {v0.8h}, [x0], x1 + st1 {v0.8h}, [x2], x1 + b.ne 1b + ret +endfunc + +function ff_pred8x8_plane_neon_10, export=1 + sub x3, x0, x1 + movrel x4, p8weight + movrel x5, p16weight + add x2, x3, #8 + sub x3, x3, #2 + ld1 {v0.d}[0], [x3] + ld1 {v2.d}[0], [x2], x1 + ldcol.16 v0, x3, x1, hi=1 + add x3, x3, x1 + ldcol.16 v3, x3, x1, 4 + add v7.8h, v2.8h, v3.8h + rev64 v0.8h, v0.8h + trn1 v2.2d, v2.2d, v3.2d + sub v2.8h, v2.8h, v0.8h + ld1 {v6.8h}, [x4] + mul v2.8h, v2.8h, v6.8h + ld1 {v0.8h}, [x5] + saddlp v2.4s, v2.8h + addp v2.4s, v2.4s, v2.4s + shl v3.4s, v2.4s, #4 + add v2.4s, v3.4s, v2.4s + rshrn v5.4h, v2.4s, #5 + addp v2.4h, v5.4h, v5.4h + shl v3.4h, v2.4h, #1 + add v3.4h, v3.4h, v2.4h + rev64 v7.4h, v7.4h + add v7.4h, v7.4h, v0.4h + shl v2.4h, v7.4h, #4 + ssubl v2.4s, v2.4h, v3.4h + ext v0.16b, v0.16b, v0.16b, #14 + mov v0.h[0], wzr + mul v0.8h, v0.8h, v5.h[0] + dup v1.4s, v2.s[0] + dup v2.4s, v2.s[0] + dup v3.8h, v5.h[1] + saddw v1.4s, v1.4s, v0.4h + saddw2 v2.4s, v2.4s, v0.8h + mov w3, #8 + mvni v4.8h, #0xFC, lsl #8 // 1023 for clipping +1: + sqshrun v0.4h, v1.4s, #5 + sqshrun2 v0.8h, v2.4s, #5 + + subs w3, w3, #1 + + saddw v1.4s, v1.4s, v3.4h + saddw v2.4s, v2.4s, v3.4h + + smin v0.8h, v0.8h, v4.8h + st1 {v0.8h}, [x0], x1 + b.ne 1b + ret +endfunc + +function ff_pred8x8_128_dc_neon_10, export=1 + movi v0.8h, #2, lsl #8 // 512, 1 << (bit_depth - 1) + movi v1.8h, #2, lsl #8 + b .L_pred8x8_dc_10_end +endfunc + +function ff_pred8x8_top_dc_neon_10, export=1 + sub x2, x0, x1 + ld1 {v0.8h}, [x2] + + addp v0.8h, v0.8h, v0.8h + addp v0.4h, v0.4h, v0.4h + zip1 v0.4h, v0.4h, v0.4h + urshr v2.4h, v0.4h, #2 + zip1 v0.8h, v2.8h, v2.8h + zip1 v1.8h, v2.8h, v2.8h + b .L_pred8x8_dc_10_end +endfunc + +function ff_pred8x8_left_dc_neon_10, export=1 + sub x2, x0, #2 + ldcol.16 v0, x2, x1, 8 + + addp v0.8h, v0.8h, v0.8h + addp v0.4h, v0.4h, v0.4h + urshr v2.4h, v0.4h, #2 + dup v1.8h, v2.h[1] + dup v0.8h, v2.h[0] + b .L_pred8x8_dc_10_end +endfunc + +function ff_pred8x8_dc_neon_10, export=1 + sub x2, x0, x1 + sub x3, x0, #2 + + ld1 {v0.8h}, [x2] + ldcol.16 v1, x3, x1, 8 + + addp v0.8h, v0.8h, v0.8h + addp v1.8h, v1.8h, v1.8h + trn1 v2.2s, v0.2s, v1.2s + trn2 v3.2s, v0.2s, v1.2s + addp v4.4h, v2.4h, v3.4h + addp v5.4h, v4.4h, v4.4h + urshr v6.4h, v5.4h, #3 + urshr v7.4h, v4.4h, #2 + dup v0.8h, v6.h[0] + dup v2.8h, v7.h[2] + dup v1.8h, v7.h[3] + dup v3.8h, v6.h[1] + zip1 v0.2d, v0.2d, v2.2d + zip1 v1.2d, v1.2d, v3.2d +.L_pred8x8_dc_10_end: + mov w3, #4 + add x2, x0, x1, lsl #2 + +6: st1 {v0.8h}, [x0], x1 + subs w3, w3, #1 + st1 {v1.8h}, [x2], x1 + b.ne 6b + ret +endfunc + +function ff_pred8x8_l0t_dc_neon_10, export=1 + sub x2, x0, x1 + sub x3, x0, #2 + + ld1 {v0.8h}, [x2] + ldcol.16 v1, x3, x1, 4 + + addp v0.8h, v0.8h, v0.8h + addp v1.4h, v1.4h, v1.4h + addp v0.4h, v0.4h, v0.4h + addp v1.4h, v1.4h, v1.4h + add v1.4h, v1.4h, v0.4h + + urshr v2.4h, v0.4h, #2 + urshr v3.4h, v1.4h, #3 // the pred4x4 part + + dup v4.4h, v3.h[0] + dup v5.4h, v2.h[0] + dup v6.4h, v2.h[1] + + zip1 v0.2d, v4.2d, v6.2d + zip1 v1.2d, v5.2d, v6.2d + b .L_pred8x8_dc_10_end +endfunc + +function ff_pred8x8_l00_dc_neon_10, export=1 + sub x2, x0, #2 + + ldcol.16 v0, x2, x1, 4 + + addp v0.4h, v0.4h, v0.4h + addp v0.4h, v0.4h, v0.4h + urshr v0.4h, v0.4h, #2 + + movi v1.8h, #2, lsl #8 // 512 + dup v0.8h, v0.h[0] + b .L_pred8x8_dc_10_end +endfunc + +function ff_pred8x8_0lt_dc_neon_10, export=1 + add x3, x0, x1, lsl #2 + sub x2, x0, x1 + sub x3, x3, #2 + + ld1 {v0.8h}, [x2] + ldcol.16 v1, x3, x1, hi=1 + + addp v0.8h, v0.8h, v0.8h + addp v1.8h, v1.8h, v1.8h + addp v0.4h, v0.4h, v0.4h + addp v1.4h, v1.4h, v1.4h + zip1 v0.2s, v0.2s, v1.2s + add v1.4h, v0.4h, v1.4h + + urshr v2.4h, v0.4h, #2 + urshr v3.4h, v1.4h, #3 + + dup v4.4h, v2.h[0] + dup v5.4h, v2.h[3] + dup v6.4h, v2.h[1] + dup v7.4h, v3.h[1] + + zip1 v0.2d, v4.2d, v6.2d + zip1 v1.2d, v5.2d, v7.2d + b .L_pred8x8_dc_10_end +endfunc + +function ff_pred8x8_0l0_dc_neon_10, export=1 + add x2, x0, x1, lsl #2 + sub x2, x2, #2 + + ldcol.16 v1, x2, x1, 4 + + addp v2.8h, v1.8h, v1.8h + addp v2.4h, v2.4h, v2.4h + urshr v1.4h, v2.4h, #2 + + movi v0.8h, #2, lsl #8 // 512 + dup v1.8h, v1.h[0] + b .L_pred8x8_dc_10_end +endfunc