From patchwork Tue Jan 23 18:17:04 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Wu Jianhua <toqsxw@outlook.com>
X-Patchwork-Id: 45746
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:120f:b0:199:de12:6fa6 with SMTP id v15csp801372pzf;
        Tue, 23 Jan 2024 10:17:38 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IETXcmBqWZcKNvkCi6DamygYTJdpmNXb1KrTDvc0/sPVe46JZZA9AsHWM+6tzyYchw6fZXe
X-Received: by 2002:a17:907:3345:b0:a23:8918:2399 with SMTP id
 yr5-20020a170907334500b00a2389182399mr138947ejb.130.1706033857717;
        Tue, 23 Jan 2024 10:17:37 -0800 (PST)
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 bx12-20020a170906a1cc00b00a2d0d311720si10384752ejb.530.2024.01.23.10.17.37;
        Tue, 23 Jan 2024 10:17:37 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@outlook.com
 header.s=selector1 header.b=qMLsDZfG;
       arc=fail (body hash mismatch);
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=outlook.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 4822368D015;
	Tue, 23 Jan 2024 20:17:33 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from JPN01-OS0-obe.outbound.protection.outlook.com
 (mail-os0jpn01olkn2100.outbound.protection.outlook.com [40.92.98.100])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id A870168CE21
 for <ffmpeg-devel@ffmpeg.org>; Tue, 23 Jan 2024 20:17:25 +0200 (EET)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=kA1m+1qCC7so5xynqGstI6XLjLNxZeVD57X0uKPMCkl09Cp9KDi65nnIJcLL+hap8+nCnhsj5ynDHSnFbKm2V2RLvm+QYQ0ViEC+/s3fyDfwrvevNl78K308N3xHA+4fB4zbLG6dYrCa0+jQMUTNSr64f08W2pnyRcp26lpKC5DWdCRc85MS7cdfe7zTjylpmoXzhucCsW9yBIwKbja4MQxtpjg1M+p+xrWi+lE8ZKzpYAKyMXGugwXQRqQ1gqKkiGlwuyU1XRToPTQ6sHHrNIr9gm5ZIxP1EH0CQCa+DavGIjne8xWHpETT8cHbWe6FOm/BE6w7lHrx2oAyZQW2vA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=AXg4HwiVD4Pem0jfSAl0nUhuanvWhW6+PpDp9H1Qgnk=;
 b=eE8LHj6F7YoW/oVqNkFoqlphtTms1lMyGQ+Pth8wp7UFoEdmKeIfw0WAY4y/x39qMFr7g1pXx4f42WHC8olFLiBvm+3Q33IH7+lN4it8jyI6pvV0spcqsRQ/ao0Pn6xzrLGRDoDLayh/bLZwccv2uk92+4Afk1SofIQxzWVh0qv0YKBoaXPSK3n1ruQdfStPo98NvUvVFK3Cbur2h6CwgQ4tamE9CdHdpLxOB2Hn3EbygAbEwWPPo9MLxNBq+0066ucxnLh+IA8zSHbdHQif4ThaMeeJQrzQ3domFlybvN0iozQovRYFjzIW616+8MrqekNoFmVl2i7H4sWTD/P4wA==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none;
 dkim=none; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com;
 s=selector1;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=AXg4HwiVD4Pem0jfSAl0nUhuanvWhW6+PpDp9H1Qgnk=;
 b=qMLsDZfGN417CHBe6LxIWQOy1Rvx0cT1CQ7Fla01S0R3BTw0G0kEE3oKxkE3S+91JA6h+x1mYKXKPZrm5nbD321WJaEy//dgAxN76MO6/qZgEGyrtdsIdQN/pWDlM5L76r/zIk91oPMsVfc/WVLL+F1kgi7HcVwatYOq5O1qTDVZA+zRhZdNFIUv0169SgNLSn2kIXDJBDmQbn+sfqatTQR6qP241ieth+ARFw413DRjP+6OeNu3Ok0jpoOe3dcjwFVBhgF4NdRZQ+8pwDwMTVBOCtsJiXVzP8xtvmNGdkVkYA35pZfrE1Yi5cTF43/v9PCl3u1zhrIvJv+hMQ6tgQ==
Received: from TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:173::5)
 by TYWP286MB3822.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:447::6) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7202.36; Tue, 23 Jan
 2024 18:17:20 +0000
Received: from TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 ([fe80::2fb1:781b:3f26:a080]) by TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 ([fe80::2fb1:781b:3f26:a080%3]) with mapi id 15.20.7228.022; Tue, 23 Jan 2024
 18:17:20 +0000
From: toqsxw@outlook.com
To: ffmpeg-devel@ffmpeg.org
Date: Wed, 24 Jan 2024 02:17:04 +0800
Message-ID: 
 <TYWP286MB217273F359418CAFF7502105CA742@TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM>
X-Mailer: git-send-email 2.34.1
X-TMN: [ZcyxF+CQLNGfcfETawCe4vAZji7Sv1Nb]
X-ClientProxiedBy: SI2P153CA0030.APCP153.PROD.OUTLOOK.COM
 (2603:1096:4:190::15) To TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 (2603:1096:400:173::5)
X-Microsoft-Original-Message-ID: <20240123181711.402946-1-toqsxw@outlook.com>
MIME-Version: 1.0
X-MS-Exchange-MessageSentRepresentingType: 1
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: TYWP286MB2172:EE_|TYWP286MB3822:EE_
X-MS-Office365-Filtering-Correlation-Id: b18ec200-6a55-4df4-b33b-08dc1c3f89ed
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: 
 dvwGv0i8bwqaAqIHJ2Kn+NZtUxo0Cz7/1Kdo7o9PeQ9Ex/jOsgX2C5DxcVFjtl32+hGCL56VA38mTed8SyzMYKQ4qU2nR8HlRqeK5jidppiOvhl954nj+OGabkLa7v8xNbXwAQuPXwQw5bkjgTJ8AO+1p70Zj5CxcQl8edeV5GM8kTGvuAPMIEng/nknHWaY3Bzbs/ixhq37tU6AgqRlVu+aN51Ga3aEx9nqqxrQXx0DdKm9candrcAl3hdYYN7gxFdqx9ky6SeysISoymNfya4Z/5D0jLDfsX/TT0OALdBMTJUJ/3BffZ+zcsxoSwnbuarmR3wRB4kJK1un9e6mn/9H3GcsQizKyXdrLjcIIjelmLryBA+ab55GlQr3HZyGVBbCVbjCKY/uUM70YmHWgV+OeG374yUyv3VC+Aesq6ZCejXapIQDPopanWEGtBEjY2qZ4LFQD7qZthquLQFKfuU7lP61/9Cx4CagyTHBwMYYI2uzQ6kMZp4kSoBIUgaso2I32BWLCwkKDQKa+6jTMAJhf+uuhwgye/zr/1cDTfMG/y6zsm0ub5xOg/7/iS7DmAxMqsEXk2IeId4b228F2DFpRhLvCiiNgKs1TCm/ORpnupdWeb0HsWEmulUJOtOd
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: 
 SG2fGcNqhJhjllr8ZrT+zto4W9NTbub03GKONdt4fbIUKb+Op9qWoMT9VDo/QmODhajFPcGKDHbPpBjxxBtcyHbuSsAx5H+QgSt3efuum6vBZD8Yr0PHUJzz4iMlukn5OvNYzv5REmtgQ+0FRuCpljUH6ppiYHahrajKUaL5ABpu0aYQfy6zl+gTC9LkdjdeZCi2gVGvo48hqwAdMZ3Ixi1Wq2t30384xjuJMk9YRYWJ9Uz9FYYqzPisJ7kKxs7ubQJRVsmXNL9ijY5ckal3FUhwCpLUp35vyD6fWn1trmULpm4CT6uollNgyhYiKZx9swh1OhwryEKCU2ZtmcP/7NTi6Ht43tNDba3Uw8KMZeVnxhCPfq0AC9d1RZjSHAhn3EKzNG3PCNL+7MAoShnvp+4EkTEFnTUlppp3QtC3vCA3Jga2ZD/w9WaExIdbarVCHeYeK7fSNekKCblG2lWw4K5G5tuD8MacYlpEkaLrWUbyvsK68CWApWqLjlUqryXI1GGJZszf0xtAIGdG4L5Pkcn8zXYBWOTwkMo06qjnyyooAX5IxjgYF5RXJ24G364KwNSkXafUfhj6v+yKLsocPu3+bU6cHVDP5ZLiqO2CBEYwzGQE+LVXhR9/h9FiZp5FsITY3USzPT4IxWoOPWU/F1cbQ4oRme0/AcdFzOrkrseT4x5Mkvj2cv0vBIz97SmSJGYhO7nRXcWo8Mluo2LFQFWi0hK0XD0qRm1YJeyQWxvUKqsFCvHgJBU34F8qmNgEc8VPkv0Wf4kEG+YQdIGrOnzrV97A1BRfM25ZN0H487EjYzmXIvoeALuYYKBhLot62SXZO26UPsL2AgfH6SVmD85YUoSOtanKsKawP5RnJq5oBArnQpAMvT/ZndexIlMcAczc5DUzaM9zIE6ym1A6hOrJ2QMWFdh08MYviHiNpktkVEIox+MyQH1GRFOGZovQamHeryq6Jeep88X0kzk3FmivMFODj0bPKIl+Csvk7niL/PanSHSaZv4emV9WWz+0fwn+xLRjyF8RbjShzPpIhx8hXJf96+w40BWNrmWqMeVxt1rCfWHv6HYHHP8KOvs/G1NmAp1Qw1RBc/3dFJmv0hm0iGiXijriGsrtXU712j9oeCo6+2rW4v83vJnYozTI5M6W5AQswmJQj0sgZ2VO3U3u4j8yAdtZ2aXczzuXq5TzQ+mqpO1pAj4M1CarYGiVlzHIgFXsky6AC+VjZO1bPzZI5gLQknMPi2bGeKdbzZk=
X-OriginatorOrg: outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 b18ec200-6a55-4df4-b33b-08dc1c3f89ed
X-MS-Exchange-CrossTenant-AuthSource: TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 23 Jan 2024 18:17:20.4886 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 
 00000000-0000-0000-0000-000000000000
X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYWP286MB3822
Subject: [FFmpeg-devel] [PATCH v4 1/8] avcodec/vvc/vvc_inter_template: move
 put/put_luma/put_chroma template to h2656_inter_template.c
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Wu Jianhua <toqsxw@outlook.com>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: idsPDy6a7sAP

From: Wu Jianhua <toqsxw@outlook.com>

Signed-off-by: Wu Jianhua <toqsxw@outlook.com>
---
 libavcodec/h26x/h2656_inter_template.c | 577 +++++++++++++++++++++++++
 libavcodec/vvc/vvc_inter_template.c    | 559 +-----------------------
 2 files changed, 578 insertions(+), 558 deletions(-)
 create mode 100644 libavcodec/h26x/h2656_inter_template.c

diff --git a/libavcodec/h26x/h2656_inter_template.c b/libavcodec/h26x/h2656_inter_template.c
new file mode 100644
index 0000000000..864f6c7e7d
--- /dev/null
+++ b/libavcodec/h26x/h2656_inter_template.c
@@ -0,0 +1,577 @@
+/*
+ * inter prediction template for HEVC/VVC
+ *
+ * Copyright (C) 2022 Nuo Mi
+ * Copyright (C) 2024 Wu Jianhua
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#define CHROMA_EXTRA_BEFORE     1
+#define CHROMA_EXTRA            3
+#define LUMA_EXTRA_BEFORE       3
+#define LUMA_EXTRA              7
+
+static void FUNC(put_pixels)(int16_t *dst,
+    const uint8_t *_src, const ptrdiff_t _src_stride,
+    const int height, const int8_t *hf, const int8_t *vf, const int width)
+{
+    const pixel *src            = (const pixel *)_src;
+    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
+
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++)
+            dst[x] = src[x] << (14 - BIT_DEPTH);
+        src += src_stride;
+        dst += MAX_PB_SIZE;
+    }
+}
+
+static void FUNC(put_uni_pixels)(uint8_t *_dst, const ptrdiff_t _dst_stride,
+    const uint8_t *_src, const ptrdiff_t _src_stride, const int height,
+     const int8_t *hf, const int8_t *vf, const int width)
+{
+    const pixel *src            = (const pixel *)_src;
+    pixel *dst                  = (pixel *)_dst;
+    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
+    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
+
+    for (int y = 0; y < height; y++) {
+        memcpy(dst, src, width * sizeof(pixel));
+        src += src_stride;
+        dst += dst_stride;
+    }
+}
+
+static void FUNC(put_uni_w_pixels)(uint8_t *_dst, const ptrdiff_t _dst_stride,
+    const uint8_t *_src, const ptrdiff_t _src_stride, const int height,
+    const int denom, const int wx, const int _ox,  const int8_t *hf, const int8_t *vf,
+    const int width)
+{
+    const pixel *src            = (const pixel *)_src;
+    pixel *dst                  = (pixel *)_dst;
+    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
+    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
+    const int shift             = denom + 14 - BIT_DEPTH;
+#if BIT_DEPTH < 14
+    const int offset            = 1 << (shift - 1);
+#else
+    const int offset            = 0;
+#endif
+    const int ox                = _ox * (1 << (BIT_DEPTH - 8));
+
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++) {
+            const int v = (src[x] << (14 - BIT_DEPTH));
+            dst[x] = av_clip_pixel(((v * wx + offset) >> shift) + ox);
+        }
+        src += src_stride;
+        dst += dst_stride;
+    }
+}
+
+#define LUMA_FILTER(src, stride)                                               \
+    (filter[0] * src[x - 3 * stride] +                                         \
+     filter[1] * src[x - 2 * stride] +                                         \
+     filter[2] * src[x -     stride] +                                         \
+     filter[3] * src[x             ] +                                         \
+     filter[4] * src[x +     stride] +                                         \
+     filter[5] * src[x + 2 * stride] +                                         \
+     filter[6] * src[x + 3 * stride] +                                         \
+     filter[7] * src[x + 4 * stride])
+
+static void FUNC(put_luma_h)(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
+    const int height, const int8_t *hf, const int8_t *vf, const int width)
+{
+    const pixel *src           = (const pixel*)_src;
+    const ptrdiff_t src_stride = _src_stride / sizeof(pixel);
+    const int8_t *filter       = hf;
+
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++)
+            dst[x] = LUMA_FILTER(src, 1) >> (BIT_DEPTH - 8);
+        src += src_stride;
+        dst += MAX_PB_SIZE;
+    }
+}
+
+static void FUNC(put_luma_v)(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
+    const int height, const int8_t *hf, const int8_t *vf, const int width)
+{
+    const pixel *src           = (pixel*)_src;
+    const ptrdiff_t src_stride = _src_stride / sizeof(pixel);
+    const int8_t *filter       = vf;
+
+    for (int y = 0; y < height; y++)  {
+        for (int x = 0; x < width; x++)
+            dst[x] = LUMA_FILTER(src, src_stride) >> (BIT_DEPTH - 8);
+        src += src_stride;
+        dst += MAX_PB_SIZE;
+    }
+}
+
+static void FUNC(put_luma_hv)(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
+    const int height, const int8_t *hf, const int8_t *vf, const int width)
+{
+    int16_t tmp_array[(MAX_PB_SIZE + LUMA_EXTRA) * MAX_PB_SIZE];
+    int16_t *tmp                = tmp_array;
+    const pixel *src            = (const pixel*)_src;
+    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
+    const int8_t *filter        = hf;
+
+    src   -= LUMA_EXTRA_BEFORE * src_stride;
+    for (int y = 0; y < height + LUMA_EXTRA; y++) {
+        for (int x = 0; x < width; x++)
+            tmp[x] = LUMA_FILTER(src, 1) >> (BIT_DEPTH - 8);
+        src += src_stride;
+        tmp += MAX_PB_SIZE;
+    }
+
+    tmp    = tmp_array + LUMA_EXTRA_BEFORE * MAX_PB_SIZE;
+    filter = vf;
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++)
+            dst[x] = LUMA_FILTER(tmp, MAX_PB_SIZE) >> 6;
+        tmp += MAX_PB_SIZE;
+        dst += MAX_PB_SIZE;
+    }
+}
+
+static void FUNC(put_uni_luma_h)(uint8_t *_dst,  const ptrdiff_t _dst_stride,
+    const uint8_t *_src, const ptrdiff_t _src_stride,
+    const int height, const int8_t *hf, const int8_t *vf, const int width)
+{
+    const pixel *src           = (const pixel*)_src;
+    pixel *dst                 = (pixel *)_dst;
+    const ptrdiff_t src_stride = _src_stride / sizeof(pixel);
+    const ptrdiff_t dst_stride = _dst_stride / sizeof(pixel);
+    const int8_t *filter       = hf;
+    const int shift            = 14 - BIT_DEPTH;
+#if BIT_DEPTH < 14
+    const int offset           = 1 << (shift - 1);
+#else
+    const int offset           = 0;
+#endif
+
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++) {
+            const int val = LUMA_FILTER(src, 1) >> (BIT_DEPTH - 8);
+            dst[x]        = av_clip_pixel((val + offset) >> shift);
+        }
+        src   += src_stride;
+        dst   += dst_stride;
+    }
+}
+
+static void FUNC(put_uni_luma_v)(uint8_t *_dst,  const ptrdiff_t _dst_stride,
+    const uint8_t *_src, const ptrdiff_t _src_stride,
+    const int height, const int8_t *hf, const int8_t *vf, const int width)
+{
+
+    const pixel *src            = (const pixel*)_src;
+    pixel *dst                  = (pixel *)_dst;
+    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
+    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
+    const int8_t *filter        = vf;
+    const int shift             = 14 - BIT_DEPTH;
+#if BIT_DEPTH < 14
+    const int offset            = 1 << (shift - 1);
+#else
+    const int offset            = 0;
+#endif
+
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++) {
+            const int val = LUMA_FILTER(src, src_stride) >> (BIT_DEPTH - 8);
+            dst[x]        = av_clip_pixel((val + offset) >> shift);
+        }
+        src   += src_stride;
+        dst   += dst_stride;
+    }
+}
+
+static void FUNC(put_uni_luma_hv)(uint8_t *_dst, const ptrdiff_t _dst_stride,
+    const uint8_t *_src, const ptrdiff_t _src_stride,
+    const int height, const int8_t *hf, const int8_t *vf, const int width)
+{
+    int16_t tmp_array[(MAX_PB_SIZE + LUMA_EXTRA) * MAX_PB_SIZE];
+    int16_t *tmp                = tmp_array;
+    const pixel *src            = (const pixel*)_src;
+    pixel *dst                  = (pixel *)_dst;
+    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
+    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
+    const int8_t *filter        = hf;
+    const int shift             =  14 - BIT_DEPTH;
+#if BIT_DEPTH < 14
+    const int offset            = 1 << (shift - 1);
+#else
+    const int offset            = 0;
+#endif
+
+    src   -= LUMA_EXTRA_BEFORE * src_stride;
+    for (int y = 0; y < height + LUMA_EXTRA; y++) {
+        for (int x = 0; x < width; x++)
+            tmp[x] = LUMA_FILTER(src, 1) >> (BIT_DEPTH - 8);
+        src += src_stride;
+        tmp += MAX_PB_SIZE;
+    }
+
+    tmp    = tmp_array + LUMA_EXTRA_BEFORE * MAX_PB_SIZE;
+    filter = vf;
+
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++) {
+            const int val = LUMA_FILTER(tmp, MAX_PB_SIZE) >> 6;
+            dst[x]  = av_clip_pixel((val  + offset) >> shift);
+        }
+        tmp += MAX_PB_SIZE;
+        dst += dst_stride;
+    }
+
+}
+
+static void FUNC(put_uni_luma_w_h)(uint8_t *_dst,  const ptrdiff_t _dst_stride,
+    const uint8_t *_src, const ptrdiff_t _src_stride, int height,
+    const int denom, const int wx, const int _ox, const int8_t *hf, const int8_t *vf,
+    const int width)
+{
+    const pixel *src            = (const pixel*)_src;
+    pixel *dst                  = (pixel *)_dst;
+    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
+    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
+    const int8_t *filter        = hf;
+    const int ox                = _ox * (1 << (BIT_DEPTH - 8));
+    const int shift             = denom + 14 - BIT_DEPTH;
+#if BIT_DEPTH < 14
+    const int offset            = 1 << (shift - 1);
+#else
+    const int offset            = 0;
+#endif
+
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++)
+            dst[x] = av_clip_pixel((((LUMA_FILTER(src, 1) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox);
+        src += src_stride;
+        dst += dst_stride;
+    }
+}
+
+static void FUNC(put_uni_luma_w_v)(uint8_t *_dst,  const ptrdiff_t _dst_stride,
+    const uint8_t *_src, const ptrdiff_t _src_stride, const int height,
+    const int denom, const int wx, const int _ox, const int8_t *hf, const int8_t *vf,
+    const int width)
+{
+    const pixel *src            = (const pixel*)_src;
+    pixel *dst                  = (pixel *)_dst;
+    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
+    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
+    const int8_t *filter        = vf;
+    const int ox                = _ox * (1 << (BIT_DEPTH - 8));
+    const int shift             = denom + 14 - BIT_DEPTH;
+#if BIT_DEPTH < 14
+    const int offset            = 1 << (shift - 1);
+#else
+    const int offset            = 0;
+#endif
+
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++)
+            dst[x] = av_clip_pixel((((LUMA_FILTER(src, src_stride) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox);
+        src += src_stride;
+        dst += dst_stride;
+    }
+}
+
+static void FUNC(put_uni_luma_w_hv)(uint8_t *_dst,  const ptrdiff_t _dst_stride,
+    const uint8_t *_src, const ptrdiff_t _src_stride, const int height, const int denom,
+    const int wx, const int _ox, const int8_t *hf, const int8_t *vf, const int width)
+{
+    int16_t tmp_array[(MAX_PB_SIZE + LUMA_EXTRA) * MAX_PB_SIZE];
+    int16_t *tmp                = tmp_array;
+    const pixel *src            = (const pixel*)_src;
+    pixel *dst                  = (pixel *)_dst;
+    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
+    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
+    const int8_t *filter        = hf;
+    const int ox                = _ox * (1 << (BIT_DEPTH - 8));
+    const int shift             = denom + 14 - BIT_DEPTH;
+#if BIT_DEPTH < 14
+    const int offset            = 1 << (shift - 1);
+#else
+    const int offset            = 0;
+#endif
+
+    src   -= LUMA_EXTRA_BEFORE * src_stride;
+    for (int y = 0; y < height + LUMA_EXTRA; y++) {
+        for (int x = 0; x < width; x++)
+            tmp[x] = LUMA_FILTER(src, 1) >> (BIT_DEPTH - 8);
+        src += src_stride;
+        tmp += MAX_PB_SIZE;
+    }
+
+    tmp    = tmp_array + LUMA_EXTRA_BEFORE * MAX_PB_SIZE;
+    filter = vf;
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++)
+            dst[x] = av_clip_pixel((((LUMA_FILTER(tmp, MAX_PB_SIZE) >> 6) * wx + offset) >> shift) + ox);
+        tmp += MAX_PB_SIZE;
+        dst += dst_stride;
+    }
+}
+
+#define CHROMA_FILTER(src, stride)                                             \
+    (filter[0] * src[x - stride] +                                             \
+     filter[1] * src[x]          +                                             \
+     filter[2] * src[x + stride] +                                             \
+     filter[3] * src[x + 2 * stride])
+
+static void FUNC(put_chroma_h)(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
+    const int height, const int8_t *hf, const int8_t *vf, const int width)
+{
+    const pixel *src            = (const pixel *)_src;
+    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
+    const int8_t *filter        = hf;
+
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++)
+            dst[x] = CHROMA_FILTER(src, 1) >> (BIT_DEPTH - 8);
+        src += src_stride;
+        dst += MAX_PB_SIZE;
+    }
+}
+
+static void FUNC(put_chroma_v)(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
+    const int height, const int8_t *hf, const int8_t *vf, const int width)
+{
+    const pixel *src            = (const pixel *)_src;
+    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
+    const int8_t *filter        = vf;
+
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++)
+            dst[x] = CHROMA_FILTER(src, src_stride) >> (BIT_DEPTH - 8);
+        src += src_stride;
+        dst += MAX_PB_SIZE;
+    }
+}
+
+static void FUNC(put_chroma_hv)(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
+    const int height, const int8_t *hf, const int8_t *vf, const int width)
+{
+    int16_t tmp_array[(MAX_PB_SIZE + CHROMA_EXTRA) * MAX_PB_SIZE];
+    int16_t *tmp                = tmp_array;
+    const pixel *src            = (const pixel *)_src;
+    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
+    const int8_t *filter        = hf;
+
+    src -= CHROMA_EXTRA_BEFORE * src_stride;
+
+    for (int y = 0; y < height + CHROMA_EXTRA; y++) {
+        for (int x = 0; x < width; x++)
+            tmp[x] = CHROMA_FILTER(src, 1) >> (BIT_DEPTH - 8);
+        src += src_stride;
+        tmp += MAX_PB_SIZE;
+    }
+
+    tmp    = tmp_array + CHROMA_EXTRA_BEFORE * MAX_PB_SIZE;
+    filter = vf;
+
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++)
+            dst[x] = CHROMA_FILTER(tmp, MAX_PB_SIZE) >> 6;
+        tmp += MAX_PB_SIZE;
+        dst += MAX_PB_SIZE;
+    }
+}
+
+static void FUNC(put_uni_chroma_h)(uint8_t *_dst, const ptrdiff_t _dst_stride,
+    const uint8_t *_src, const ptrdiff_t _src_stride,
+    const int height, const int8_t *hf, const int8_t *vf, const int width)
+{
+    const pixel *src            = (const pixel *)_src;
+    pixel *dst                  = (pixel *)_dst;
+    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
+    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
+    const int8_t *filter        = hf;
+    const int shift             = 14 - BIT_DEPTH;
+#if BIT_DEPTH < 14
+    const int offset            = 1 << (shift - 1);
+#else
+    const int offset            = 0;
+#endif
+
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++)
+            dst[x] = av_clip_pixel(((CHROMA_FILTER(src, 1) >> (BIT_DEPTH - 8)) + offset) >> shift);
+        src += src_stride;
+        dst += dst_stride;
+    }
+}
+
+static void FUNC(put_uni_chroma_v)(uint8_t *_dst, const ptrdiff_t _dst_stride,
+    const uint8_t *_src, const ptrdiff_t _src_stride,
+    const int height, const int8_t *hf, const int8_t *vf, const int width)
+{
+    const pixel *src            = (const pixel *)_src;
+    pixel *dst                  = (pixel *)_dst;
+    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
+    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
+    const int8_t *filter        = vf;
+    const int shift             = 14 - BIT_DEPTH;
+#if BIT_DEPTH < 14
+    const int offset            = 1 << (shift - 1);
+#else
+    const int offset            = 0;
+#endif
+
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++)
+            dst[x] = av_clip_pixel(((CHROMA_FILTER(src, src_stride) >> (BIT_DEPTH - 8)) + offset) >> shift);
+        src += src_stride;
+        dst += dst_stride;
+    }
+}
+
+static void FUNC(put_uni_chroma_hv)(uint8_t *_dst, const ptrdiff_t _dst_stride,
+    const uint8_t *_src, const ptrdiff_t _src_stride,
+    const int height, const int8_t *hf, const int8_t *vf, const int width)
+{
+    int16_t tmp_array[(MAX_PB_SIZE + CHROMA_EXTRA) * MAX_PB_SIZE];
+    int16_t *tmp                = tmp_array;
+    const pixel *src            = (const pixel *)_src;
+    pixel *dst                  = (pixel *)_dst;
+    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
+    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
+    const int8_t *filter        = hf;
+    const int shift             = 14 - BIT_DEPTH;
+#if BIT_DEPTH < 14
+    const int offset            = 1 << (shift - 1);
+#else
+    const int offset            = 0;
+#endif
+
+    src -= CHROMA_EXTRA_BEFORE * src_stride;
+
+    for (int y = 0; y < height + CHROMA_EXTRA; y++) {
+        for (int x = 0; x < width; x++)
+            tmp[x] = CHROMA_FILTER(src, 1) >> (BIT_DEPTH - 8);
+        src += src_stride;
+        tmp += MAX_PB_SIZE;
+    }
+
+    tmp    = tmp_array + CHROMA_EXTRA_BEFORE * MAX_PB_SIZE;
+    filter = vf;
+
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++)
+            dst[x] = av_clip_pixel(((CHROMA_FILTER(tmp, MAX_PB_SIZE) >> 6) + offset) >> shift);
+        tmp += MAX_PB_SIZE;
+        dst += dst_stride;
+    }
+}
+
+static void FUNC(put_uni_chroma_w_h)(uint8_t *_dst, ptrdiff_t _dst_stride,
+    const uint8_t *_src, ptrdiff_t _src_stride, int height, int denom, int wx, int ox,
+    const int8_t *hf, const int8_t *vf, int width)
+{
+    const pixel *src            = (const pixel *)_src;
+    pixel *dst                  = (pixel *)_dst;
+    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
+    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
+    const int8_t *filter        = hf;
+    const int shift             = denom + 14 - BIT_DEPTH;
+#if BIT_DEPTH < 14
+    const int offset            = 1 << (shift - 1);
+#else
+    const int offset            = 0;
+#endif
+
+    ox     = ox * (1 << (BIT_DEPTH - 8));
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++) {
+            dst[x] = av_clip_pixel((((CHROMA_FILTER(src, 1) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox);
+        }
+        dst += dst_stride;
+        src += src_stride;
+    }
+}
+
+static void FUNC(put_uni_chroma_w_v)(uint8_t *_dst, const ptrdiff_t _dst_stride,
+    const uint8_t *_src, const ptrdiff_t _src_stride, const int height,
+    const int denom, const int wx, const int _ox, const int8_t *hf, const int8_t *vf,
+    const int width)
+{
+    const pixel *src            = (const pixel *)_src;
+    pixel *dst                  = (pixel *)_dst;
+    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
+    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
+    const int8_t *filter        = vf;
+    const int shift             = denom + 14 - BIT_DEPTH;
+    const int ox                = _ox * (1 << (BIT_DEPTH - 8));
+#if BIT_DEPTH < 14
+    int offset                  = 1 << (shift - 1);
+#else
+    int offset                  = 0;
+#endif
+
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++) {
+            dst[x] = av_clip_pixel((((CHROMA_FILTER(src, src_stride) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox);
+        }
+        dst += dst_stride;
+        src += src_stride;
+    }
+}
+
+static void FUNC(put_uni_chroma_w_hv)(uint8_t *_dst, ptrdiff_t _dst_stride,
+     const uint8_t *_src, ptrdiff_t _src_stride,  int height, int denom, int wx, int ox,
+     const int8_t *hf, const int8_t *vf, int width)
+{
+    int16_t tmp_array[(MAX_PB_SIZE + CHROMA_EXTRA) * MAX_PB_SIZE];
+    int16_t *tmp                = tmp_array;
+    const pixel *src            = (const pixel *)_src;
+    pixel *dst                  = (pixel *)_dst;
+    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
+    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
+    const int8_t *filter        = hf;
+    const int shift             = denom + 14 - BIT_DEPTH;
+#if BIT_DEPTH < 14
+    const int offset            = 1 << (shift - 1);
+#else
+    const int offset            = 0;
+#endif
+
+    src -= CHROMA_EXTRA_BEFORE * src_stride;
+
+    for (int y = 0; y < height + CHROMA_EXTRA; y++) {
+        for (int x = 0; x < width; x++)
+            tmp[x] = CHROMA_FILTER(src, 1) >> (BIT_DEPTH - 8);
+        src += src_stride;
+        tmp += MAX_PB_SIZE;
+    }
+
+    tmp    = tmp_array + CHROMA_EXTRA_BEFORE * MAX_PB_SIZE;
+    filter = vf;
+
+    ox     = ox * (1 << (BIT_DEPTH - 8));
+    for (int y = 0; y < height; y++) {
+        for (int x = 0; x < width; x++)
+            dst[x] = av_clip_pixel((((CHROMA_FILTER(tmp, MAX_PB_SIZE) >> 6) * wx + offset) >> shift) + ox);
+        tmp += MAX_PB_SIZE;
+        dst += dst_stride;
+    }
+}
diff --git a/libavcodec/vvc/vvc_inter_template.c b/libavcodec/vvc/vvc_inter_template.c
index 7160907778..e5cff079fb 100644
--- a/libavcodec/vvc/vvc_inter_template.c
+++ b/libavcodec/vvc/vvc_inter_template.c
@@ -20,564 +20,7 @@
  * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
  */
 
-////////////////////////////////////////////////////////////////////////////////
-//
-////////////////////////////////////////////////////////////////////////////////
-static void FUNC(put_pixels)(int16_t *dst,
-    const uint8_t *_src, const ptrdiff_t _src_stride,
-    const int height, const int8_t *hf, const int8_t *vf, const int width)
-{
-    const pixel *src            = (const pixel *)_src;
-    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
-
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++)
-            dst[x] = src[x] << (14 - BIT_DEPTH);
-        src += src_stride;
-        dst += MAX_PB_SIZE;
-    }
-}
-
-static void FUNC(put_uni_pixels)(uint8_t *_dst, const ptrdiff_t _dst_stride,
-    const uint8_t *_src, const ptrdiff_t _src_stride, const int height,
-     const int8_t *hf, const int8_t *vf, const int width)
-{
-    const pixel *src            = (const pixel *)_src;
-    pixel *dst                  = (pixel *)_dst;
-    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
-    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
-
-    for (int y = 0; y < height; y++) {
-        memcpy(dst, src, width * sizeof(pixel));
-        src += src_stride;
-        dst += dst_stride;
-    }
-}
-
-static void FUNC(put_uni_w_pixels)(uint8_t *_dst, const ptrdiff_t _dst_stride,
-    const uint8_t *_src, const ptrdiff_t _src_stride, const int height,
-    const int denom, const int wx, const int _ox,  const int8_t *hf, const int8_t *vf,
-    const int width)
-{
-    const pixel *src            = (const pixel *)_src;
-    pixel *dst                  = (pixel *)_dst;
-    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
-    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
-    const int shift             = denom + 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    const int offset            = 1 << (shift - 1);
-#else
-    const int offset            = 0;
-#endif
-    const int ox                = _ox * (1 << (BIT_DEPTH - 8));
-
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++) {
-            const int v = (src[x] << (14 - BIT_DEPTH));
-            dst[x] = av_clip_pixel(((v * wx + offset) >> shift) + ox);
-        }
-        src += src_stride;
-        dst += dst_stride;
-    }
-}
-
-////////////////////////////////////////////////////////////////////////////////
-//
-////////////////////////////////////////////////////////////////////////////////
-#define LUMA_FILTER(src, stride)                                               \
-    (filter[0] * src[x - 3 * stride] +                                         \
-     filter[1] * src[x - 2 * stride] +                                         \
-     filter[2] * src[x -     stride] +                                         \
-     filter[3] * src[x             ] +                                         \
-     filter[4] * src[x +     stride] +                                         \
-     filter[5] * src[x + 2 * stride] +                                         \
-     filter[6] * src[x + 3 * stride] +                                         \
-     filter[7] * src[x + 4 * stride])
-
-static void FUNC(put_luma_h)(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
-    const int height, const int8_t *hf, const int8_t *vf, const int width)
-{
-    const pixel *src           = (const pixel*)_src;
-    const ptrdiff_t src_stride = _src_stride / sizeof(pixel);
-    const int8_t *filter       = hf;
-
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++)
-            dst[x] = LUMA_FILTER(src, 1) >> (BIT_DEPTH - 8);
-        src += src_stride;
-        dst += MAX_PB_SIZE;
-    }
-}
-
-static void FUNC(put_luma_v)(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
-    const int height, const int8_t *hf, const int8_t *vf, const int width)
-{
-    const pixel *src           = (pixel*)_src;
-    const ptrdiff_t src_stride = _src_stride / sizeof(pixel);
-    const int8_t *filter       = vf;
-
-    for (int y = 0; y < height; y++)  {
-        for (int x = 0; x < width; x++)
-            dst[x] = LUMA_FILTER(src, src_stride) >> (BIT_DEPTH - 8);
-        src += src_stride;
-        dst += MAX_PB_SIZE;
-    }
-}
-
-static void FUNC(put_luma_hv)(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
-    const int height, const int8_t *hf, const int8_t *vf, const int width)
-{
-    int16_t tmp_array[(MAX_PB_SIZE + LUMA_EXTRA) * MAX_PB_SIZE];
-    int16_t *tmp                = tmp_array;
-    const pixel *src            = (const pixel*)_src;
-    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
-    const int8_t *filter        = hf;
-
-    src   -= LUMA_EXTRA_BEFORE * src_stride;
-    for (int y = 0; y < height + LUMA_EXTRA; y++) {
-        for (int x = 0; x < width; x++)
-            tmp[x] = LUMA_FILTER(src, 1) >> (BIT_DEPTH - 8);
-        src += src_stride;
-        tmp += MAX_PB_SIZE;
-    }
-
-    tmp    = tmp_array + LUMA_EXTRA_BEFORE * MAX_PB_SIZE;
-    filter = vf;
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++)
-            dst[x] = LUMA_FILTER(tmp, MAX_PB_SIZE) >> 6;
-        tmp += MAX_PB_SIZE;
-        dst += MAX_PB_SIZE;
-    }
-}
-
-static void FUNC(put_uni_luma_h)(uint8_t *_dst,  const ptrdiff_t _dst_stride,
-    const uint8_t *_src, const ptrdiff_t _src_stride,
-    const int height, const int8_t *hf, const int8_t *vf, const int width)
-{
-    const pixel *src           = (const pixel*)_src;
-    pixel *dst                 = (pixel *)_dst;
-    const ptrdiff_t src_stride = _src_stride / sizeof(pixel);
-    const ptrdiff_t dst_stride = _dst_stride / sizeof(pixel);
-    const int8_t *filter       = hf;
-    const int shift            = 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    const int offset           = 1 << (shift - 1);
-#else
-    const int offset           = 0;
-#endif
-
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++) {
-            const int val = LUMA_FILTER(src, 1) >> (BIT_DEPTH - 8);
-            dst[x]        = av_clip_pixel((val + offset) >> shift);
-        }
-        src   += src_stride;
-        dst   += dst_stride;
-    }
-}
-
-static void FUNC(put_uni_luma_v)(uint8_t *_dst,  const ptrdiff_t _dst_stride,
-    const uint8_t *_src, const ptrdiff_t _src_stride,
-    const int height, const int8_t *hf, const int8_t *vf, const int width)
-{
-
-    const pixel *src            = (const pixel*)_src;
-    pixel *dst                  = (pixel *)_dst;
-    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
-    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
-    const int8_t *filter        = vf;
-    const int shift             = 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    const int offset            = 1 << (shift - 1);
-#else
-    const int offset            = 0;
-#endif
-
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++) {
-            const int val = LUMA_FILTER(src, src_stride) >> (BIT_DEPTH - 8);
-            dst[x]        = av_clip_pixel((val + offset) >> shift);
-        }
-        src   += src_stride;
-        dst   += dst_stride;
-    }
-}
-
-static void FUNC(put_uni_luma_hv)(uint8_t *_dst, const ptrdiff_t _dst_stride,
-    const uint8_t *_src, const ptrdiff_t _src_stride,
-    const int height, const int8_t *hf, const int8_t *vf, const int width)
-{
-    int16_t tmp_array[(MAX_PB_SIZE + LUMA_EXTRA) * MAX_PB_SIZE];
-    int16_t *tmp                = tmp_array;
-    const pixel *src            = (const pixel*)_src;
-    pixel *dst                  = (pixel *)_dst;
-    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
-    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
-    const int8_t *filter        = hf;
-    const int shift             =  14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    const int offset            = 1 << (shift - 1);
-#else
-    const int offset            = 0;
-#endif
-
-    src   -= LUMA_EXTRA_BEFORE * src_stride;
-    for (int y = 0; y < height + LUMA_EXTRA; y++) {
-        for (int x = 0; x < width; x++)
-            tmp[x] = LUMA_FILTER(src, 1) >> (BIT_DEPTH - 8);
-        src += src_stride;
-        tmp += MAX_PB_SIZE;
-    }
-
-    tmp    = tmp_array + LUMA_EXTRA_BEFORE * MAX_PB_SIZE;
-    filter = vf;
-
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++) {
-            const int val = LUMA_FILTER(tmp, MAX_PB_SIZE) >> 6;
-            dst[x]  = av_clip_pixel((val  + offset) >> shift);
-        }
-        tmp += MAX_PB_SIZE;
-        dst += dst_stride;
-    }
-
-}
-
-static void FUNC(put_uni_luma_w_h)(uint8_t *_dst,  const ptrdiff_t _dst_stride,
-    const uint8_t *_src, const ptrdiff_t _src_stride, int height,
-    const int denom, const int wx, const int _ox, const int8_t *hf, const int8_t *vf,
-    const int width)
-{
-    const pixel *src            = (const pixel*)_src;
-    pixel *dst                  = (pixel *)_dst;
-    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
-    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
-    const int8_t *filter        = hf;
-    const int ox                = _ox * (1 << (BIT_DEPTH - 8));
-    const int shift             = denom + 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    const int offset            = 1 << (shift - 1);
-#else
-    const int offset            = 0;
-#endif
-
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++)
-            dst[x] = av_clip_pixel((((LUMA_FILTER(src, 1) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox);
-        src += src_stride;
-        dst += dst_stride;
-    }
-}
-
-static void FUNC(put_uni_luma_w_v)(uint8_t *_dst,  const ptrdiff_t _dst_stride,
-    const uint8_t *_src, const ptrdiff_t _src_stride, const int height,
-    const int denom, const int wx, const int _ox, const int8_t *hf, const int8_t *vf,
-    const int width)
-{
-    const pixel *src            = (const pixel*)_src;
-    pixel *dst                  = (pixel *)_dst;
-    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
-    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
-    const int8_t *filter        = vf;
-    const int ox                = _ox * (1 << (BIT_DEPTH - 8));
-    const int shift             = denom + 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    const int offset            = 1 << (shift - 1);
-#else
-    const int offset            = 0;
-#endif
-
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++)
-            dst[x] = av_clip_pixel((((LUMA_FILTER(src, src_stride) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox);
-        src += src_stride;
-        dst += dst_stride;
-    }
-}
-
-static void FUNC(put_uni_luma_w_hv)(uint8_t *_dst,  const ptrdiff_t _dst_stride,
-    const uint8_t *_src, const ptrdiff_t _src_stride, const int height, const int denom,
-    const int wx, const int _ox, const int8_t *hf, const int8_t *vf, const int width)
-{
-    int16_t tmp_array[(MAX_PB_SIZE + LUMA_EXTRA) * MAX_PB_SIZE];
-    int16_t *tmp                = tmp_array;
-    const pixel *src            = (const pixel*)_src;
-    pixel *dst                  = (pixel *)_dst;
-    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
-    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
-    const int8_t *filter        = hf;
-    const int ox                = _ox * (1 << (BIT_DEPTH - 8));
-    const int shift             = denom + 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    const int offset            = 1 << (shift - 1);
-#else
-    const int offset            = 0;
-#endif
-
-    src   -= LUMA_EXTRA_BEFORE * src_stride;
-    for (int y = 0; y < height + LUMA_EXTRA; y++) {
-        for (int x = 0; x < width; x++)
-            tmp[x] = LUMA_FILTER(src, 1) >> (BIT_DEPTH - 8);
-        src += src_stride;
-        tmp += MAX_PB_SIZE;
-    }
-
-    tmp    = tmp_array + LUMA_EXTRA_BEFORE * MAX_PB_SIZE;
-    filter = vf;
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++)
-            dst[x] = av_clip_pixel((((LUMA_FILTER(tmp, MAX_PB_SIZE) >> 6) * wx + offset) >> shift) + ox);
-        tmp += MAX_PB_SIZE;
-        dst += dst_stride;
-    }
-}
-
-////////////////////////////////////////////////////////////////////////////////
-//
-////////////////////////////////////////////////////////////////////////////////
-#define CHROMA_FILTER(src, stride)                                               \
-    (filter[0] * src[x - stride] +                                             \
-     filter[1] * src[x]          +                                             \
-     filter[2] * src[x + stride] +                                             \
-     filter[3] * src[x + 2 * stride])
-
-static void FUNC(put_chroma_h)(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
-    const int height, const int8_t *hf, const int8_t *vf, const int width)
-{
-    const pixel *src            = (const pixel *)_src;
-    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
-    const int8_t *filter        = hf;
-
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++)
-            dst[x] = CHROMA_FILTER(src, 1) >> (BIT_DEPTH - 8);
-        src += src_stride;
-        dst += MAX_PB_SIZE;
-    }
-}
-
-static void FUNC(put_chroma_v)(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
-    const int height, const int8_t *hf, const int8_t *vf, const int width)
-{
-    const pixel *src            = (const pixel *)_src;
-    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
-    const int8_t *filter        = vf;
-
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++)
-            dst[x] = CHROMA_FILTER(src, src_stride) >> (BIT_DEPTH - 8);
-        src += src_stride;
-        dst += MAX_PB_SIZE;
-    }
-}
-
-static void FUNC(put_chroma_hv)(int16_t *dst, const uint8_t *_src, const ptrdiff_t _src_stride,
-    const int height, const int8_t *hf, const int8_t *vf, const int width)
-{
-    int16_t tmp_array[(MAX_PB_SIZE + CHROMA_EXTRA) * MAX_PB_SIZE];
-    int16_t *tmp                = tmp_array;
-    const pixel *src            = (const pixel *)_src;
-    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
-    const int8_t *filter        = hf;
-
-    src -= CHROMA_EXTRA_BEFORE * src_stride;
-
-    for (int y = 0; y < height + CHROMA_EXTRA; y++) {
-        for (int x = 0; x < width; x++)
-            tmp[x] = CHROMA_FILTER(src, 1) >> (BIT_DEPTH - 8);
-        src += src_stride;
-        tmp += MAX_PB_SIZE;
-    }
-
-    tmp    = tmp_array + CHROMA_EXTRA_BEFORE * MAX_PB_SIZE;
-    filter = vf;
-
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++)
-            dst[x] = CHROMA_FILTER(tmp, MAX_PB_SIZE) >> 6;
-        tmp += MAX_PB_SIZE;
-        dst += MAX_PB_SIZE;
-    }
-}
-
-static void FUNC(put_uni_chroma_h)(uint8_t *_dst, const ptrdiff_t _dst_stride,
-    const uint8_t *_src, const ptrdiff_t _src_stride,
-    const int height, const int8_t *hf, const int8_t *vf, const int width)
-{
-    const pixel *src            = (const pixel *)_src;
-    pixel *dst                  = (pixel *)_dst;
-    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
-    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
-    const int8_t *filter        = hf;
-    const int shift             = 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    const int offset            = 1 << (shift - 1);
-#else
-    const int offset            = 0;
-#endif
-
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++)
-            dst[x] = av_clip_pixel(((CHROMA_FILTER(src, 1) >> (BIT_DEPTH - 8)) + offset) >> shift);
-        src += src_stride;
-        dst += dst_stride;
-    }
-}
-
-static void FUNC(put_uni_chroma_v)(uint8_t *_dst, const ptrdiff_t _dst_stride,
-    const uint8_t *_src, const ptrdiff_t _src_stride,
-    const int height, const int8_t *hf, const int8_t *vf, const int width)
-{
-    const pixel *src            = (const pixel *)_src;
-    pixel *dst                  = (pixel *)_dst;
-    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
-    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
-    const int8_t *filter        = vf;
-    const int shift             = 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    const int offset            = 1 << (shift - 1);
-#else
-    const int offset            = 0;
-#endif
-
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++)
-            dst[x] = av_clip_pixel(((CHROMA_FILTER(src, src_stride) >> (BIT_DEPTH - 8)) + offset) >> shift);
-        src += src_stride;
-        dst += dst_stride;
-    }
-}
-
-static void FUNC(put_uni_chroma_hv)(uint8_t *_dst, const ptrdiff_t _dst_stride,
-    const uint8_t *_src, const ptrdiff_t _src_stride,
-    const int height, const int8_t *hf, const int8_t *vf, const int width)
-{
-    int16_t tmp_array[(MAX_PB_SIZE + CHROMA_EXTRA) * MAX_PB_SIZE];
-    int16_t *tmp                = tmp_array;
-    const pixel *src            = (const pixel *)_src;
-    pixel *dst                  = (pixel *)_dst;
-    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
-    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
-    const int8_t *filter        = hf;
-    const int shift             = 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    const int offset            = 1 << (shift - 1);
-#else
-    const int offset            = 0;
-#endif
-
-    src -= CHROMA_EXTRA_BEFORE * src_stride;
-
-    for (int y = 0; y < height + CHROMA_EXTRA; y++) {
-        for (int x = 0; x < width; x++)
-            tmp[x] = CHROMA_FILTER(src, 1) >> (BIT_DEPTH - 8);
-        src += src_stride;
-        tmp += MAX_PB_SIZE;
-    }
-
-    tmp    = tmp_array + CHROMA_EXTRA_BEFORE * MAX_PB_SIZE;
-    filter = vf;
-
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++)
-            dst[x] = av_clip_pixel(((CHROMA_FILTER(tmp, MAX_PB_SIZE) >> 6) + offset) >> shift);
-        tmp += MAX_PB_SIZE;
-        dst += dst_stride;
-    }
-}
-
-static void FUNC(put_uni_chroma_w_h)(uint8_t *_dst, ptrdiff_t _dst_stride,
-    const uint8_t *_src, ptrdiff_t _src_stride, int height, int denom, int wx, int ox,
-    const int8_t *hf, const int8_t *vf, int width)
-{
-    const pixel *src            = (const pixel *)_src;
-    pixel *dst                  = (pixel *)_dst;
-    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
-    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
-    const int8_t *filter        = hf;
-    const int shift             = denom + 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    const int offset            = 1 << (shift - 1);
-#else
-    const int offset            = 0;
-#endif
-
-    ox     = ox * (1 << (BIT_DEPTH - 8));
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++) {
-            dst[x] = av_clip_pixel((((CHROMA_FILTER(src, 1) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox);
-        }
-        dst += dst_stride;
-        src += src_stride;
-    }
-}
-
-static void FUNC(put_uni_chroma_w_v)(uint8_t *_dst, const ptrdiff_t _dst_stride,
-    const uint8_t *_src, const ptrdiff_t _src_stride, const int height,
-    const int denom, const int wx, const int _ox, const int8_t *hf, const int8_t *vf,
-    const int width)
-{
-    const pixel *src            = (const pixel *)_src;
-    pixel *dst                  = (pixel *)_dst;
-    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
-    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
-    const int8_t *filter        = vf;
-    const int shift             = denom + 14 - BIT_DEPTH;
-    const int ox                = _ox * (1 << (BIT_DEPTH - 8));
-#if BIT_DEPTH < 14
-    int offset                  = 1 << (shift - 1);
-#else
-    int offset                  = 0;
-#endif
-
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++) {
-            dst[x] = av_clip_pixel((((CHROMA_FILTER(src, src_stride) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox);
-        }
-        dst += dst_stride;
-        src += src_stride;
-    }
-}
-
-static void FUNC(put_uni_chroma_w_hv)(uint8_t *_dst, ptrdiff_t _dst_stride,
-     const uint8_t *_src, ptrdiff_t _src_stride,  int height, int denom, int wx, int ox,
-     const int8_t *hf, const int8_t *vf, int width)
-{
-    int16_t tmp_array[(MAX_PB_SIZE + CHROMA_EXTRA) * MAX_PB_SIZE];
-    int16_t *tmp                = tmp_array;
-    const pixel *src            = (const pixel *)_src;
-    pixel *dst                  = (pixel *)_dst;
-    const ptrdiff_t src_stride  = _src_stride / sizeof(pixel);
-    const ptrdiff_t dst_stride  = _dst_stride / sizeof(pixel);
-    const int8_t *filter        = hf;
-    const int shift             = denom + 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    const int offset            = 1 << (shift - 1);
-#else
-    const int offset            = 0;
-#endif
-
-    src -= CHROMA_EXTRA_BEFORE * src_stride;
-
-    for (int y = 0; y < height + CHROMA_EXTRA; y++) {
-        for (int x = 0; x < width; x++)
-            tmp[x] = CHROMA_FILTER(src, 1) >> (BIT_DEPTH - 8);
-        src += src_stride;
-        tmp += MAX_PB_SIZE;
-    }
-
-    tmp    = tmp_array + CHROMA_EXTRA_BEFORE * MAX_PB_SIZE;
-    filter = vf;
-
-    ox     = ox * (1 << (BIT_DEPTH - 8));
-    for (int y = 0; y < height; y++) {
-        for (int x = 0; x < width; x++)
-            dst[x] = av_clip_pixel((((CHROMA_FILTER(tmp, MAX_PB_SIZE) >> 6) * wx + offset) >> shift) + ox);
-        tmp += MAX_PB_SIZE;
-        dst += dst_stride;
-    }
-}
+#include "libavcodec/h26x/h2656_inter_template.c"
 
 static void FUNC(avg)(uint8_t *_dst, const ptrdiff_t _dst_stride,
     const int16_t *src0, const int16_t *src1, const int width, const int height)

From patchwork Tue Jan 23 18:17:05 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Wu Jianhua <toqsxw@outlook.com>
X-Patchwork-Id: 45748
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:120f:b0:199:de12:6fa6 with SMTP id v15csp801553pzf;
        Tue, 23 Jan 2024 10:17:57 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IERt86nrl/kZSESnTBnSthbozPUff/QH02M4ooogEFQFa48lW4BLp/sp7HNvANfhx71YtcR
X-Received: by 2002:a17:906:3151:b0:a27:d3ee:2ef5 with SMTP id
 e17-20020a170906315100b00a27d3ee2ef5mr210301eje.24.1706033877506;
        Tue, 23 Jan 2024 10:17:57 -0800 (PST)
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 p10-20020a1709066a8a00b00a27c14be748si12131473ejr.920.2024.01.23.10.17.56;
        Tue, 23 Jan 2024 10:17:57 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@outlook.com
 header.s=selector1 header.b=K36hzMj6;
       arc=fail (body hash mismatch);
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=outlook.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 968DF68D0AE;
	Tue, 23 Jan 2024 20:17:40 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from JPN01-TYC-obe.outbound.protection.outlook.com
 (mail-tycjpn01olkn2040.outbound.protection.outlook.com [40.92.99.40])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 9093A68D093
 for <ffmpeg-devel@ffmpeg.org>; Tue, 23 Jan 2024 20:17:33 +0200 (EET)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=b/lg8Tk9terTvOjYEE+Exqt8M1sivAoVExFgyovOnRgph56mZ3Snzfn7OgHJx/XohA0SttVLl8EqzvEGUiZg1JqqXmuD5FZpN3Zm10uQJPkufQOHcqR1NKkPwVGxT4ajpAC6EiF4xipja01LLJTCxw8CK9SYzrvbHS7T6xtO/MDAgQHRYbLSMoJ9YgeUDra5ocGmEl1N1a2nmVi47chdO1oxZ8YFmnTKBz94KHr+tnXheSs1BdizF4uGK7P4bdd/awVXzyh2U54i+iGuSOB7kLL4DHCi/+mNFnCpoys0WzmfxrTiOzvSnokuI1ZhPzuO9oBLGoTju84ULapWEDLuZA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=o9p+JWkf2/gi0FVu/oZcjQ/tbdr260lXxz8jg8iDHco=;
 b=TZEgBLn00TsyncC2G4hGZ3EnmSVk85EPiiPLPUDtHHuwrZc1/OWl0k8hgrmImsM7+y1BvAM5qnEtALsaVE+S2nEm5DAQpwbMRuqiN+opKrrCvwYpOIlg0bbTXSuQFv7dyQ60ARz1qAC72ipgDynKTOo7w+I+8/v+pfIFhLJxv+iOWI/+JLR+3dMJp+9CzQ3F8SmuOmZ+PmASyvb+QPFtIO6y5G9FoRJZ66StNZH08Ku3UeomNoWqZFDdHYIM56juAgDxvt4CPQJ91m8eQ34VokQqK5rD6I77BG+7pQNueACOhsjCTxl2+g5BC3AHfijY66rDrYvX1eC5zq5xH42UTw==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none;
 dkim=none; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com;
 s=selector1;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=o9p+JWkf2/gi0FVu/oZcjQ/tbdr260lXxz8jg8iDHco=;
 b=K36hzMj6J4Vs3awUU7lIT0ucjDXmqg/4wwMR/eWjdulvJSPtimRVjP4mQuXvmoJ8QP0iBzEZdy8vjt28tgDT0Dqr5grufyGQaTojHFE6GsUELG4tUhtuYKtEn8ij8z0mV7LKSEiZRmY/7qw3q17D988h0kaJRicuyVF8qyLMWUYxrqJmmSR2p+2NyngRz6ktlJZQCIJRnPtuJ2AcW8x0raHX5VHSA/xLCs8rqO5Fa5PJX8iejSdAKNYiCKlBiCfe0s4JlgkHH6IuutWM6v5Plj3+DOevpiv1efAcX4JNSfLXFFaNSXv5so6uZYp8qWzOoK59sqt7Pc72cNutaHGTCg==
Received: from TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:173::5)
 by TYWP286MB3822.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:447::6) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7202.36; Tue, 23 Jan
 2024 18:17:21 +0000
Received: from TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 ([fe80::2fb1:781b:3f26:a080]) by TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 ([fe80::2fb1:781b:3f26:a080%3]) with mapi id 15.20.7228.022; Tue, 23 Jan 2024
 18:17:21 +0000
From: toqsxw@outlook.com
To: ffmpeg-devel@ffmpeg.org
Date: Wed, 24 Jan 2024 02:17:05 +0800
Message-ID: 
 <TYWP286MB2172EA0CB118E576F396695ACA742@TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240123181711.402946-1-toqsxw@outlook.com>
References: <20240123181711.402946-1-toqsxw@outlook.com>
X-TMN: [nuj+6Fd59a9nOCMng7Z0tuwpMNLLy9it]
X-ClientProxiedBy: SI2P153CA0030.APCP153.PROD.OUTLOOK.COM
 (2603:1096:4:190::15) To TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 (2603:1096:400:173::5)
X-Microsoft-Original-Message-ID: <20240123181711.402946-2-toqsxw@outlook.com>
MIME-Version: 1.0
X-MS-Exchange-MessageSentRepresentingType: 1
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: TYWP286MB2172:EE_|TYWP286MB3822:EE_
X-MS-Office365-Filtering-Correlation-Id: bb4e17b6-6e91-42b9-df8b-08dc1c3f8a80
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: 
 YkyFuol5OflvfYBvMfKhVt0EFo7eNUotXnaYZlshNRxBeZAKyAscWK1Fa9yw/4hR257ooOBiqknoL7Spn2OHOQXGPSbDUi53LEIORrKsrqsqC9awkRL34R4gPkTXQCWPJ7CrLEHEbYyjU2i9+XelbkEMts2G8T+DtmNxnmSMqCYUsBfWl+ELpc4H4w01mPX2TKRhxGGqCm8uhM5X8prOuWUvtCoYxB59KGI9ww1xXUa+uznCbTbB9MWQehkUm9y9rfdh5LJ04aSOxLenJuRDVTHqVquPNUN2G/dEZrKe3SsGHecsYvXI/S4oGL4WMfoOeM/uMCpeFSTEXY6m18hhzOHGuOc2YQNm0hqzMn2ynqGYboTD7LHluyUkfHQTqu79cAkKIzviSYAmZNtl56bUi/lJS9Efmy4VeOZO2+A4owf85E8uKaR6WEkINBhSx4SS0gh63TIhBbT1ZMmqKfUROoKDjYxhqCKvoSIzBQtY4ZqU1TLdhPg0QeU5dGNLtxt4oYZfkQQMvTeyU3tjWxT35W4yidPjNnBHH3pvYlV8bDR3R+uaG+tQ8VTFetOxH3vb+i/QjmygMyfXPW0uzZP9ZRXPEQaDFJ4pDYN5lkEr/WACWopYa7YjgufjzMVV3qJM
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: 
 gld1pnqJAHQXfUbxf/4+/2robXF2hmpGoEZOhUrmH89EePUwfStGtqtOq+8fBYJdZTdq6mK2QeEbj3DP5l79KSw3IMJwaNfe7Z1c2W3jaNUf0q6Eh3MvvQvhxL5bupIV361nk480pzbJM35oC7phzb32UglkPll1gQBTyUG09vj5akO1SYwf9onTRJKYcnjH5WdTT3yFuwSaAZZe014g9nIV2WFlHFZgrZ0bUP5njTHVu2y8GLqacDQUDU2Z1FiQkyhLyqiN1S2Y9IBbTNVFa5TneQ+QalnLSxxHZjfrRd9GO5wm176nULvGtYyo47keVnMcLsTg4iEfgHM7cTFRKuV55J6PeMH2Us+X2HJbMtQ75FCRxN0n4tVdq0FIVm7ws5hKgr/a4NDvfnMuHUnqf9UheUumj/YBz+fUZhj6HL67v0ACdDHHMAA37IjSMlc6TQLqs4CF9k/Ix92qvLhGEFIRcSdF8R0ASGdfZK84CRkFl1jI1u7/usS1Sb+hsbbdhElZ/NQd5gnGXaE5v5Ko6aoYtT1HzEdohkTB1ui2+3vV+IYLMR4z/7+yLbULHY83L7MpNQD9V6eWeVq+L8KIP2k80Ny7ezQ2hw0IKoSmRqIDlNtXG3qyS28k6eNDp9Qr6o4ZdME0gskKVkwEs1fqWsU+gwSJfVXGTrI6Hd+plstZkBP5cFQhWLiDFc84FsnxQ31QdMN5CsCnZM26VDa2aNadUPuSrxmlCmRcugxZWu8Tn0GBLuXUaAVnbV9UP+ZKbLJYtmZZ2aBod6Pi+5srYBXwO7mOuOBu9xifq5KYWixDF31IJM+7R77IunPs+KOvW42Fd+S9dkHV3rdHIchdxbgile/9C2XutiuCP3wX2PsnGSjvHEiZ2vCr6n9B2RzittjWlw7COjXzsbxhXUZH8ftbppW2G05OJ7dgElHQQ0EFyvVQMcEPLSE7FRAoUv6UoCeZmuUdkIaJL6x0hmwbiObwsms2RsRk8FPZkCxRVokyuWQeXW+d0aBx/mHI0F/VQdIGhF2LY8+U33NLSqbi9mtj3uil6oovpg30CpJ5sRfhOyx/F6yskFUlyUMjrKWH7ElaHzrOpkwkivzoHPhyYfxYucyBskML4hXCExaSj5wQxgBnjxhOcBCuJODJUiY6MzD+vD3RyjAt+KCoDTsVhW1C7cn6Gdk3ipW1TQU8k1jiVrqDn+pnn8D342OUu26DggnqQJbAwJewnVCRsCGrxNuBBGKig58YW6z/7txztDA=
X-OriginatorOrg: outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 bb4e17b6-6e91-42b9-df8b-08dc1c3f8a80
X-MS-Exchange-CrossTenant-AuthSource: TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 23 Jan 2024 18:17:21.3981 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 
 00000000-0000-0000-0000-000000000000
X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYWP286MB3822
Subject: [FFmpeg-devel] [PATCH v4 2/8] avcodec/hevcdsp_template: reuse
 put/put_luma/put_chroma from h2656_inter_template
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Wu Jianhua <toqsxw@outlook.com>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: PnfKZEjkH+LP

From: Wu Jianhua <toqsxw@outlook.com>

Signed-off-by: Wu Jianhua <toqsxw@outlook.com>
---
 libavcodec/hevcdsp_template.c | 594 +++-------------------------------
 1 file changed, 46 insertions(+), 548 deletions(-)

diff --git a/libavcodec/hevcdsp_template.c b/libavcodec/hevcdsp_template.c
index 0de14e9dcf..9b48bdf08e 100644
--- a/libavcodec/hevcdsp_template.c
+++ b/libavcodec/hevcdsp_template.c
@@ -26,6 +26,7 @@
 #include "bit_depth_template.c"
 #include "hevcdsp.h"
 #include "h26x/h2656_sao_template.c"
+#include "h26x/h2656_inter_template.c"
 
 static void FUNC(put_pcm)(uint8_t *_dst, ptrdiff_t stride, int width, int height,
                           GetBitContext *gb, int pcm_bit_depth)
@@ -299,37 +300,51 @@ IDCT_DC(32)
 ////////////////////////////////////////////////////////////////////////////////
 //
 ////////////////////////////////////////////////////////////////////////////////
-static void FUNC(put_hevc_pel_pixels)(int16_t *dst,
-                                      const uint8_t *_src, ptrdiff_t _srcstride,
-                                      int height, intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const pixel *src    = (const pixel *)_src;
-    ptrdiff_t srcstride = _srcstride / sizeof(pixel);
-
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++)
-            dst[x] = src[x] << (14 - BIT_DEPTH);
-        src += srcstride;
-        dst += MAX_PB_SIZE;
-    }
-}
-
-static void FUNC(put_hevc_pel_uni_pixels)(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, ptrdiff_t _srcstride,
-                                          int height, intptr_t mx, intptr_t my, int width)
-{
-    int y;
-    const pixel *src    = (const pixel *)_src;
-    ptrdiff_t srcstride = _srcstride / sizeof(pixel);
-    pixel *dst          = (pixel *)_dst;
-    ptrdiff_t dststride = _dststride / sizeof(pixel);
-
-    for (y = 0; y < height; y++) {
-        memcpy(dst, src, width * sizeof(pixel));
-        src += srcstride;
-        dst += dststride;
-    }
-}
+#define ff_hevc_pel_filters ff_hevc_qpel_filters
+#define DECL_HV_FILTER(f)                                  \
+    const uint8_t *hf = ff_hevc_ ## f ## _filters[mx - 1]; \
+    const uint8_t *vf = ff_hevc_ ## f ## _filters[my - 1];
+
+#define FW_PUT(p, f, t)                                                                                   \
+static void FUNC(put_hevc_## f)(int16_t *dst, const uint8_t *src, ptrdiff_t srcstride, int height,        \
+                                  intptr_t mx, intptr_t my, int width)                                    \
+{                                                                                                         \
+    DECL_HV_FILTER(p)                                                                                     \
+    FUNC(put_ ## t)(dst, src, srcstride, height, hf, vf, width);                                          \
+}
+
+#define FW_PUT_UNI(p, f, t)                                                                               \
+static void FUNC(put_hevc_ ## f)(uint8_t *dst, ptrdiff_t dststride, const uint8_t *src,                   \
+                                  ptrdiff_t srcstride, int height, intptr_t mx, intptr_t my, int width)   \
+{                                                                                                         \
+    DECL_HV_FILTER(p)                                                                                     \
+    FUNC(put_ ## t)(dst, dststride, src, srcstride, height, hf, vf, width);                           \
+}
+
+#define FW_PUT_UNI_W(p, f, t)                                                                             \
+static void FUNC(put_hevc_ ## f)(uint8_t *dst, ptrdiff_t dststride, const uint8_t *src,                   \
+                                  ptrdiff_t srcstride,int height, int denom, int wx, int ox,              \
+                                  intptr_t mx, intptr_t my, int width)                                    \
+{                                                                                                         \
+    DECL_HV_FILTER(p)                                                                                     \
+    FUNC(put_ ## t)(dst, dststride, src, srcstride, height, denom, wx, ox, hf, vf, width);            \
+}
+
+#define FW_PUT_FUNCS(f, t, dir)                                       \
+    FW_PUT(f, f ## _ ## dir, t ## _ ## dir)                     \
+    FW_PUT_UNI(f, f ## _uni_ ## dir, uni_ ## t ## _ ## dir)        \
+    FW_PUT_UNI_W(f, f ## _uni_w_ ## dir, uni_## t ## _w_ ## dir)
+
+FW_PUT(pel, pel_pixels, pixels)
+FW_PUT_UNI(pel, pel_uni_pixels, uni_pixels)
+FW_PUT_UNI_W(pel, pel_uni_w_pixels, uni_w_pixels)
+
+FW_PUT_FUNCS(qpel, luma,   h     )
+FW_PUT_FUNCS(qpel, luma,   v     )
+FW_PUT_FUNCS(qpel, luma,   hv    )
+FW_PUT_FUNCS(epel, chroma, h     )
+FW_PUT_FUNCS(epel, chroma, v     )
+FW_PUT_FUNCS(epel, chroma, hv    )
 
 static void FUNC(put_hevc_pel_bi_pixels)(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, ptrdiff_t _srcstride,
                                          const int16_t *src2,
@@ -357,30 +372,6 @@ static void FUNC(put_hevc_pel_bi_pixels)(uint8_t *_dst, ptrdiff_t _dststride, co
     }
 }
 
-static void FUNC(put_hevc_pel_uni_w_pixels)(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, ptrdiff_t _srcstride,
-                                            int height, int denom, int wx, int ox, intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const pixel *src    = (const pixel *)_src;
-    ptrdiff_t srcstride = _srcstride / sizeof(pixel);
-    pixel *dst          = (pixel *)_dst;
-    ptrdiff_t dststride = _dststride / sizeof(pixel);
-    int shift = denom + 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    int offset = 1 << (shift - 1);
-#else
-    int offset = 0;
-#endif
-
-    ox     = ox * (1 << (BIT_DEPTH - 8));
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++)
-            dst[x] = av_clip_pixel((((src[x] << (14 - BIT_DEPTH)) * wx + offset) >> shift) + ox);
-        src += srcstride;
-        dst += dststride;
-    }
-}
-
 static void FUNC(put_hevc_pel_bi_w_pixels)(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, ptrdiff_t _srcstride,
                                            const int16_t *src2,
                                            int height, int denom, int wx0, int wx1,
@@ -420,96 +411,6 @@ static void FUNC(put_hevc_pel_bi_w_pixels)(uint8_t *_dst, ptrdiff_t _dststride,
      filter[6] * src[x + 3 * stride] +                                         \
      filter[7] * src[x + 4 * stride])
 
-static void FUNC(put_hevc_qpel_h)(int16_t *dst,
-                                  const uint8_t *_src, ptrdiff_t _srcstride,
-                                  int height, intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const pixel  *src       = (const pixel*)_src;
-    ptrdiff_t     srcstride = _srcstride / sizeof(pixel);
-    const int8_t *filter    = ff_hevc_qpel_filters[mx - 1];
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++)
-            dst[x] = QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
-        src += srcstride;
-        dst += MAX_PB_SIZE;
-    }
-}
-
-static void FUNC(put_hevc_qpel_v)(int16_t *dst,
-                                  const uint8_t *_src, ptrdiff_t _srcstride,
-                                  int height, intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const pixel  *src       = (const pixel*)_src;
-    ptrdiff_t     srcstride = _srcstride / sizeof(pixel);
-    const int8_t *filter    = ff_hevc_qpel_filters[my - 1];
-    for (y = 0; y < height; y++)  {
-        for (x = 0; x < width; x++)
-            dst[x] = QPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8);
-        src += srcstride;
-        dst += MAX_PB_SIZE;
-    }
-}
-
-static void FUNC(put_hevc_qpel_hv)(int16_t *dst,
-                                   const uint8_t *_src,
-                                   ptrdiff_t _srcstride,
-                                   int height, intptr_t mx,
-                                   intptr_t my, int width)
-{
-    int x, y;
-    const int8_t *filter;
-    const pixel *src = (const pixel*)_src;
-    ptrdiff_t srcstride = _srcstride / sizeof(pixel);
-    int16_t tmp_array[(MAX_PB_SIZE + QPEL_EXTRA) * MAX_PB_SIZE];
-    int16_t *tmp = tmp_array;
-
-    src   -= QPEL_EXTRA_BEFORE * srcstride;
-    filter = ff_hevc_qpel_filters[mx - 1];
-    for (y = 0; y < height + QPEL_EXTRA; y++) {
-        for (x = 0; x < width; x++)
-            tmp[x] = QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
-        src += srcstride;
-        tmp += MAX_PB_SIZE;
-    }
-
-    tmp    = tmp_array + QPEL_EXTRA_BEFORE * MAX_PB_SIZE;
-    filter = ff_hevc_qpel_filters[my - 1];
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++)
-            dst[x] = QPEL_FILTER(tmp, MAX_PB_SIZE) >> 6;
-        tmp += MAX_PB_SIZE;
-        dst += MAX_PB_SIZE;
-    }
-}
-
-static void FUNC(put_hevc_qpel_uni_h)(uint8_t *_dst,  ptrdiff_t _dststride,
-                                      const uint8_t *_src, ptrdiff_t _srcstride,
-                                      int height, intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const pixel  *src       = (const pixel*)_src;
-    ptrdiff_t     srcstride = _srcstride / sizeof(pixel);
-    pixel *dst          = (pixel *)_dst;
-    ptrdiff_t dststride = _dststride / sizeof(pixel);
-    const int8_t *filter    = ff_hevc_qpel_filters[mx - 1];
-    int shift = 14 - BIT_DEPTH;
-
-#if BIT_DEPTH < 14
-    int offset = 1 << (shift - 1);
-#else
-    int offset = 0;
-#endif
-
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++)
-            dst[x] = av_clip_pixel(((QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) + offset) >> shift);
-        src += srcstride;
-        dst += dststride;
-    }
-}
-
 static void FUNC(put_hevc_qpel_bi_h)(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, ptrdiff_t _srcstride,
                                      const int16_t *src2,
                                      int height, intptr_t mx, intptr_t my, int width)
@@ -538,33 +439,6 @@ static void FUNC(put_hevc_qpel_bi_h)(uint8_t *_dst, ptrdiff_t _dststride, const
     }
 }
 
-static void FUNC(put_hevc_qpel_uni_v)(uint8_t *_dst,  ptrdiff_t _dststride,
-                                      const uint8_t *_src, ptrdiff_t _srcstride,
-                                     int height, intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const pixel  *src       = (const pixel*)_src;
-    ptrdiff_t     srcstride = _srcstride / sizeof(pixel);
-    pixel *dst          = (pixel *)_dst;
-    ptrdiff_t dststride = _dststride / sizeof(pixel);
-    const int8_t *filter    = ff_hevc_qpel_filters[my - 1];
-    int shift = 14 - BIT_DEPTH;
-
-#if BIT_DEPTH < 14
-    int offset = 1 << (shift - 1);
-#else
-    int offset = 0;
-#endif
-
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++)
-            dst[x] = av_clip_pixel(((QPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) + offset) >> shift);
-        src += srcstride;
-        dst += dststride;
-    }
-}
-
-
 static void FUNC(put_hevc_qpel_bi_v)(uint8_t *_dst, ptrdiff_t _dststride,
                                      const uint8_t *_src, ptrdiff_t _srcstride, const int16_t *src2,
                                      int height, intptr_t mx, intptr_t my, int width)
@@ -593,46 +467,6 @@ static void FUNC(put_hevc_qpel_bi_v)(uint8_t *_dst, ptrdiff_t _dststride,
     }
 }
 
-static void FUNC(put_hevc_qpel_uni_hv)(uint8_t *_dst,  ptrdiff_t _dststride,
-                                       const uint8_t *_src, ptrdiff_t _srcstride,
-                                       int height, intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const int8_t *filter;
-    const pixel *src = (const pixel*)_src;
-    ptrdiff_t srcstride = _srcstride / sizeof(pixel);
-    pixel *dst          = (pixel *)_dst;
-    ptrdiff_t dststride = _dststride / sizeof(pixel);
-    int16_t tmp_array[(MAX_PB_SIZE + QPEL_EXTRA) * MAX_PB_SIZE];
-    int16_t *tmp = tmp_array;
-    int shift =  14 - BIT_DEPTH;
-
-#if BIT_DEPTH < 14
-    int offset = 1 << (shift - 1);
-#else
-    int offset = 0;
-#endif
-
-    src   -= QPEL_EXTRA_BEFORE * srcstride;
-    filter = ff_hevc_qpel_filters[mx - 1];
-    for (y = 0; y < height + QPEL_EXTRA; y++) {
-        for (x = 0; x < width; x++)
-            tmp[x] = QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
-        src += srcstride;
-        tmp += MAX_PB_SIZE;
-    }
-
-    tmp    = tmp_array + QPEL_EXTRA_BEFORE * MAX_PB_SIZE;
-    filter = ff_hevc_qpel_filters[my - 1];
-
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++)
-            dst[x] = av_clip_pixel(((QPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) + offset) >> shift);
-        tmp += MAX_PB_SIZE;
-        dst += dststride;
-    }
-}
-
 static void FUNC(put_hevc_qpel_bi_hv)(uint8_t *_dst, ptrdiff_t _dststride,
                                       const uint8_t *_src, ptrdiff_t _srcstride, const int16_t *src2,
                                       int height, intptr_t mx, intptr_t my, int width)
@@ -673,33 +507,6 @@ static void FUNC(put_hevc_qpel_bi_hv)(uint8_t *_dst, ptrdiff_t _dststride,
     }
 }
 
-static void FUNC(put_hevc_qpel_uni_w_h)(uint8_t *_dst,  ptrdiff_t _dststride,
-                                        const uint8_t *_src, ptrdiff_t _srcstride,
-                                        int height, int denom, int wx, int ox,
-                                        intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const pixel  *src       = (const pixel*)_src;
-    ptrdiff_t     srcstride = _srcstride / sizeof(pixel);
-    pixel *dst          = (pixel *)_dst;
-    ptrdiff_t dststride = _dststride / sizeof(pixel);
-    const int8_t *filter    = ff_hevc_qpel_filters[mx - 1];
-    int shift = denom + 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    int offset = 1 << (shift - 1);
-#else
-    int offset = 0;
-#endif
-
-    ox = ox * (1 << (BIT_DEPTH - 8));
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++)
-            dst[x] = av_clip_pixel((((QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox);
-        src += srcstride;
-        dst += dststride;
-    }
-}
-
 static void FUNC(put_hevc_qpel_bi_w_h)(uint8_t *_dst, ptrdiff_t _dststride,
                                        const uint8_t *_src, ptrdiff_t _srcstride, const int16_t *src2,
                                        int height, int denom, int wx0, int wx1,
@@ -728,33 +535,6 @@ static void FUNC(put_hevc_qpel_bi_w_h)(uint8_t *_dst, ptrdiff_t _dststride,
     }
 }
 
-static void FUNC(put_hevc_qpel_uni_w_v)(uint8_t *_dst,  ptrdiff_t _dststride,
-                                        const uint8_t *_src, ptrdiff_t _srcstride,
-                                        int height, int denom, int wx, int ox,
-                                        intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const pixel  *src       = (const pixel*)_src;
-    ptrdiff_t     srcstride = _srcstride / sizeof(pixel);
-    pixel *dst          = (pixel *)_dst;
-    ptrdiff_t dststride = _dststride / sizeof(pixel);
-    const int8_t *filter    = ff_hevc_qpel_filters[my - 1];
-    int shift = denom + 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    int offset = 1 << (shift - 1);
-#else
-    int offset = 0;
-#endif
-
-    ox = ox * (1 << (BIT_DEPTH - 8));
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++)
-            dst[x] = av_clip_pixel((((QPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox);
-        src += srcstride;
-        dst += dststride;
-    }
-}
-
 static void FUNC(put_hevc_qpel_bi_w_v)(uint8_t *_dst, ptrdiff_t _dststride,
                                        const uint8_t *_src, ptrdiff_t _srcstride, const int16_t *src2,
                                        int height, int denom, int wx0, int wx1,
@@ -783,47 +563,6 @@ static void FUNC(put_hevc_qpel_bi_w_v)(uint8_t *_dst, ptrdiff_t _dststride,
     }
 }
 
-static void FUNC(put_hevc_qpel_uni_w_hv)(uint8_t *_dst,  ptrdiff_t _dststride,
-                                         const uint8_t *_src, ptrdiff_t _srcstride,
-                                         int height, int denom, int wx, int ox,
-                                         intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const int8_t *filter;
-    const pixel *src = (const pixel*)_src;
-    ptrdiff_t srcstride = _srcstride / sizeof(pixel);
-    pixel *dst          = (pixel *)_dst;
-    ptrdiff_t dststride = _dststride / sizeof(pixel);
-    int16_t tmp_array[(MAX_PB_SIZE + QPEL_EXTRA) * MAX_PB_SIZE];
-    int16_t *tmp = tmp_array;
-    int shift = denom + 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    int offset = 1 << (shift - 1);
-#else
-    int offset = 0;
-#endif
-
-    src   -= QPEL_EXTRA_BEFORE * srcstride;
-    filter = ff_hevc_qpel_filters[mx - 1];
-    for (y = 0; y < height + QPEL_EXTRA; y++) {
-        for (x = 0; x < width; x++)
-            tmp[x] = QPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
-        src += srcstride;
-        tmp += MAX_PB_SIZE;
-    }
-
-    tmp    = tmp_array + QPEL_EXTRA_BEFORE * MAX_PB_SIZE;
-    filter = ff_hevc_qpel_filters[my - 1];
-
-    ox = ox * (1 << (BIT_DEPTH - 8));
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++)
-            dst[x] = av_clip_pixel((((QPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) * wx + offset) >> shift) + ox);
-        tmp += MAX_PB_SIZE;
-        dst += dststride;
-    }
-}
-
 static void FUNC(put_hevc_qpel_bi_w_hv)(uint8_t *_dst, ptrdiff_t _dststride,
                                         const uint8_t *_src, ptrdiff_t _srcstride, const int16_t *src2,
                                         int height, int denom, int wx0, int wx1,
@@ -873,94 +612,6 @@ static void FUNC(put_hevc_qpel_bi_w_hv)(uint8_t *_dst, ptrdiff_t _dststride,
      filter[2] * src[x + stride] +                                             \
      filter[3] * src[x + 2 * stride])
 
-static void FUNC(put_hevc_epel_h)(int16_t *dst,
-                                  const uint8_t *_src, ptrdiff_t _srcstride,
-                                  int height, intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const pixel *src = (const pixel *)_src;
-    ptrdiff_t srcstride  = _srcstride / sizeof(pixel);
-    const int8_t *filter = ff_hevc_epel_filters[mx - 1];
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++)
-            dst[x] = EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
-        src += srcstride;
-        dst += MAX_PB_SIZE;
-    }
-}
-
-static void FUNC(put_hevc_epel_v)(int16_t *dst,
-                                  const uint8_t *_src, ptrdiff_t _srcstride,
-                                  int height, intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const pixel *src = (const pixel *)_src;
-    ptrdiff_t srcstride = _srcstride / sizeof(pixel);
-    const int8_t *filter = ff_hevc_epel_filters[my - 1];
-
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++)
-            dst[x] = EPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8);
-        src += srcstride;
-        dst += MAX_PB_SIZE;
-    }
-}
-
-static void FUNC(put_hevc_epel_hv)(int16_t *dst,
-                                   const uint8_t *_src, ptrdiff_t _srcstride,
-                                   int height, intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const pixel *src = (const pixel *)_src;
-    ptrdiff_t srcstride = _srcstride / sizeof(pixel);
-    const int8_t *filter = ff_hevc_epel_filters[mx - 1];
-    int16_t tmp_array[(MAX_PB_SIZE + EPEL_EXTRA) * MAX_PB_SIZE];
-    int16_t *tmp = tmp_array;
-
-    src -= EPEL_EXTRA_BEFORE * srcstride;
-
-    for (y = 0; y < height + EPEL_EXTRA; y++) {
-        for (x = 0; x < width; x++)
-            tmp[x] = EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
-        src += srcstride;
-        tmp += MAX_PB_SIZE;
-    }
-
-    tmp      = tmp_array + EPEL_EXTRA_BEFORE * MAX_PB_SIZE;
-    filter = ff_hevc_epel_filters[my - 1];
-
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++)
-            dst[x] = EPEL_FILTER(tmp, MAX_PB_SIZE) >> 6;
-        tmp += MAX_PB_SIZE;
-        dst += MAX_PB_SIZE;
-    }
-}
-
-static void FUNC(put_hevc_epel_uni_h)(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, ptrdiff_t _srcstride,
-                                      int height, intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const pixel *src = (const pixel *)_src;
-    ptrdiff_t srcstride  = _srcstride / sizeof(pixel);
-    pixel *dst          = (pixel *)_dst;
-    ptrdiff_t dststride = _dststride / sizeof(pixel);
-    const int8_t *filter = ff_hevc_epel_filters[mx - 1];
-    int shift = 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    int offset = 1 << (shift - 1);
-#else
-    int offset = 0;
-#endif
-
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++)
-            dst[x] = av_clip_pixel(((EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) + offset) >> shift);
-        src += srcstride;
-        dst += dststride;
-    }
-}
-
 static void FUNC(put_hevc_epel_bi_h)(uint8_t *_dst, ptrdiff_t _dststride,
                                      const uint8_t *_src, ptrdiff_t _srcstride, const int16_t *src2,
                                      int height, intptr_t mx, intptr_t my, int width)
@@ -988,30 +639,6 @@ static void FUNC(put_hevc_epel_bi_h)(uint8_t *_dst, ptrdiff_t _dststride,
     }
 }
 
-static void FUNC(put_hevc_epel_uni_v)(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, ptrdiff_t _srcstride,
-                                      int height, intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const pixel *src = (const pixel *)_src;
-    ptrdiff_t srcstride  = _srcstride / sizeof(pixel);
-    pixel *dst          = (pixel *)_dst;
-    ptrdiff_t dststride = _dststride / sizeof(pixel);
-    const int8_t *filter = ff_hevc_epel_filters[my - 1];
-    int shift = 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    int offset = 1 << (shift - 1);
-#else
-    int offset = 0;
-#endif
-
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++)
-            dst[x] = av_clip_pixel(((EPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) + offset) >> shift);
-        src += srcstride;
-        dst += dststride;
-    }
-}
-
 static void FUNC(put_hevc_epel_bi_v)(uint8_t *_dst, ptrdiff_t _dststride,
                                      const uint8_t *_src, ptrdiff_t _srcstride, const int16_t *src2,
                                      int height, intptr_t mx, intptr_t my, int width)
@@ -1038,44 +665,6 @@ static void FUNC(put_hevc_epel_bi_v)(uint8_t *_dst, ptrdiff_t _dststride,
     }
 }
 
-static void FUNC(put_hevc_epel_uni_hv)(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, ptrdiff_t _srcstride,
-                                       int height, intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const pixel *src = (const pixel *)_src;
-    ptrdiff_t srcstride = _srcstride / sizeof(pixel);
-    pixel *dst          = (pixel *)_dst;
-    ptrdiff_t dststride = _dststride / sizeof(pixel);
-    const int8_t *filter = ff_hevc_epel_filters[mx - 1];
-    int16_t tmp_array[(MAX_PB_SIZE + EPEL_EXTRA) * MAX_PB_SIZE];
-    int16_t *tmp = tmp_array;
-    int shift = 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    int offset = 1 << (shift - 1);
-#else
-    int offset = 0;
-#endif
-
-    src -= EPEL_EXTRA_BEFORE * srcstride;
-
-    for (y = 0; y < height + EPEL_EXTRA; y++) {
-        for (x = 0; x < width; x++)
-            tmp[x] = EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
-        src += srcstride;
-        tmp += MAX_PB_SIZE;
-    }
-
-    tmp      = tmp_array + EPEL_EXTRA_BEFORE * MAX_PB_SIZE;
-    filter = ff_hevc_epel_filters[my - 1];
-
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++)
-            dst[x] = av_clip_pixel(((EPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) + offset) >> shift);
-        tmp += MAX_PB_SIZE;
-        dst += dststride;
-    }
-}
-
 static void FUNC(put_hevc_epel_bi_hv)(uint8_t *_dst, ptrdiff_t _dststride,
                                       const uint8_t *_src, ptrdiff_t _srcstride, const int16_t *src2,
                                       int height, intptr_t mx, intptr_t my, int width)
@@ -1116,32 +705,6 @@ static void FUNC(put_hevc_epel_bi_hv)(uint8_t *_dst, ptrdiff_t _dststride,
     }
 }
 
-static void FUNC(put_hevc_epel_uni_w_h)(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, ptrdiff_t _srcstride,
-                                        int height, int denom, int wx, int ox, intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const pixel *src = (const pixel *)_src;
-    ptrdiff_t srcstride  = _srcstride / sizeof(pixel);
-    pixel *dst          = (pixel *)_dst;
-    ptrdiff_t dststride = _dststride / sizeof(pixel);
-    const int8_t *filter = ff_hevc_epel_filters[mx - 1];
-    int shift = denom + 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    int offset = 1 << (shift - 1);
-#else
-    int offset = 0;
-#endif
-
-    ox     = ox * (1 << (BIT_DEPTH - 8));
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++) {
-            dst[x] = av_clip_pixel((((EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox);
-        }
-        dst += dststride;
-        src += srcstride;
-    }
-}
-
 static void FUNC(put_hevc_epel_bi_w_h)(uint8_t *_dst, ptrdiff_t _dststride,
                                        const uint8_t *_src, ptrdiff_t _srcstride, const int16_t *src2,
                                        int height, int denom, int wx0, int wx1,
@@ -1168,32 +731,6 @@ static void FUNC(put_hevc_epel_bi_w_h)(uint8_t *_dst, ptrdiff_t _dststride,
     }
 }
 
-static void FUNC(put_hevc_epel_uni_w_v)(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, ptrdiff_t _srcstride,
-                                        int height, int denom, int wx, int ox, intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const pixel *src = (const pixel *)_src;
-    ptrdiff_t srcstride  = _srcstride / sizeof(pixel);
-    pixel *dst          = (pixel *)_dst;
-    ptrdiff_t dststride = _dststride / sizeof(pixel);
-    const int8_t *filter = ff_hevc_epel_filters[my - 1];
-    int shift = denom + 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    int offset = 1 << (shift - 1);
-#else
-    int offset = 0;
-#endif
-
-    ox     = ox * (1 << (BIT_DEPTH - 8));
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++) {
-            dst[x] = av_clip_pixel((((EPEL_FILTER(src, srcstride) >> (BIT_DEPTH - 8)) * wx + offset) >> shift) + ox);
-        }
-        dst += dststride;
-        src += srcstride;
-    }
-}
-
 static void FUNC(put_hevc_epel_bi_w_v)(uint8_t *_dst, ptrdiff_t _dststride,
                                        const uint8_t *_src, ptrdiff_t _srcstride, const int16_t *src2,
                                        int height, int denom, int wx0, int wx1,
@@ -1220,45 +757,6 @@ static void FUNC(put_hevc_epel_bi_w_v)(uint8_t *_dst, ptrdiff_t _dststride,
     }
 }
 
-static void FUNC(put_hevc_epel_uni_w_hv)(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, ptrdiff_t _srcstride,
-                                         int height, int denom, int wx, int ox, intptr_t mx, intptr_t my, int width)
-{
-    int x, y;
-    const pixel *src = (const pixel *)_src;
-    ptrdiff_t srcstride = _srcstride / sizeof(pixel);
-    pixel *dst          = (pixel *)_dst;
-    ptrdiff_t dststride = _dststride / sizeof(pixel);
-    const int8_t *filter = ff_hevc_epel_filters[mx - 1];
-    int16_t tmp_array[(MAX_PB_SIZE + EPEL_EXTRA) * MAX_PB_SIZE];
-    int16_t *tmp = tmp_array;
-    int shift = denom + 14 - BIT_DEPTH;
-#if BIT_DEPTH < 14
-    int offset = 1 << (shift - 1);
-#else
-    int offset = 0;
-#endif
-
-    src -= EPEL_EXTRA_BEFORE * srcstride;
-
-    for (y = 0; y < height + EPEL_EXTRA; y++) {
-        for (x = 0; x < width; x++)
-            tmp[x] = EPEL_FILTER(src, 1) >> (BIT_DEPTH - 8);
-        src += srcstride;
-        tmp += MAX_PB_SIZE;
-    }
-
-    tmp      = tmp_array + EPEL_EXTRA_BEFORE * MAX_PB_SIZE;
-    filter = ff_hevc_epel_filters[my - 1];
-
-    ox     = ox * (1 << (BIT_DEPTH - 8));
-    for (y = 0; y < height; y++) {
-        for (x = 0; x < width; x++)
-            dst[x] = av_clip_pixel((((EPEL_FILTER(tmp, MAX_PB_SIZE) >> 6) * wx + offset) >> shift) + ox);
-        tmp += MAX_PB_SIZE;
-        dst += dststride;
-    }
-}
-
 static void FUNC(put_hevc_epel_bi_w_hv)(uint8_t *_dst, ptrdiff_t _dststride,
                                         const uint8_t *_src, ptrdiff_t _srcstride, const int16_t *src2,
                                         int height, int denom, int wx0, int wx1,

From patchwork Tue Jan 23 18:17:06 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Wu Jianhua <toqsxw@outlook.com>
X-Patchwork-Id: 45749
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:120f:b0:199:de12:6fa6 with SMTP id v15csp801640pzf;
        Tue, 23 Jan 2024 10:18:07 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IFeNsxm+Cty9K1JjFFl8U/Jxy5l4+74/ZWasZvB1o+LUZUMgPVgYeuwdAV4JBqPeFPTYUKR
X-Received: by 2002:aa7:d1d8:0:b0:553:627f:4e48 with SMTP id
 g24-20020aa7d1d8000000b00553627f4e48mr1149034edp.55.1706033886922;
        Tue, 23 Jan 2024 10:18:06 -0800 (PST)
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 d14-20020a50cd4e000000b0055c72de0fddsi1212852edj.91.2024.01.23.10.18.06;
        Tue, 23 Jan 2024 10:18:06 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@outlook.com
 header.s=selector1 header.b=JNRU3Of1;
       arc=fail (body hash mismatch);
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=outlook.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 7707568D093;
	Tue, 23 Jan 2024 20:17:44 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from JPN01-OS0-obe.outbound.protection.outlook.com
 (mail-os0jpn01olkn2100.outbound.protection.outlook.com [40.92.98.100])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 449D568D08E
 for <ffmpeg-devel@ffmpeg.org>; Tue, 23 Jan 2024 20:17:37 +0200 (EET)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=iDEMYwcxG7tDldamlXa+A1zNs6K9LMRK6ltGN79qLjdTDloO08IwyxM2z9NbSh01ZYed+7L7Zvbbz4y9dK9CcXBcWD7a6l0JbYOIBdoKFzlaHCbx8hqUcAaVZdG0mDWkdpUDDIfXCbi2O5OZwQKYAk6t3p1wwT5OKuvtIPphOXmynGJNE6H/1nAcEKMW7syuYd0/1PAtvHcA2sMKEAfmbjhhpaHK/EmZA4HscX2wja4obXUFNpdmeqE5XmbgYi2Tn3ZiGFikSe9fPW4vR+PMbSVX7vZMtpdDvcxEzPRBIVacGCeF9Q8EJZe7Exd1G+2fb9lQlxRTPfAqJv8DCFhIag==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=8Qjvt+2+6x3/gkEweKR0e4pBQ77bIlzOJPJaAgmEyko=;
 b=f7HxLPHn1PldIpzxiWvoL5TlLASk76vZgaIGE8LskUMP3d17FUhCLs+Ldub8w+lmPqU3JjZIdJzxx1CxewM0dF33xpfL717GI00Cl3r+BnqQXGkVDaZaFeXaLP4TwmvY1rYsrZYm/c66dM9VAboJm/093iiPPw7A1FuYgqfZsocaOYaaUhBz8L9UYmHkiz0AT46nAOJ8vH+3/cZbd+gRlHhx/XwubQb9ORQkzPWjiigJXFW2IkPeUbDaDR/QD1Iho0E0z0JseUbYxeUmcEyxNpgu+xPAqvGx7WXNlUszovWF6qkoRtoW1ZhC7pimYHWjcJwRK8RrhBxD9dp2ZxLjpQ==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none;
 dkim=none; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com;
 s=selector1;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=8Qjvt+2+6x3/gkEweKR0e4pBQ77bIlzOJPJaAgmEyko=;
 b=JNRU3Of1z6XB7QhvRYpGqgQqM7YIf991nNevg3zP31s3irlzOKL/2iNgMru9gLREsI8W8SCbonp3Q85cBhpCY9mUDp+pexaLsDzRKUX0e9W45oH0t0ydwocqw/t50MjfCqQUA3sCadhlJSArkr0SeZ6KYPTv8WpNqcgXtWse0Q+NsNGbF3xE9MzwilSgDZn/+nhRwS0svWgPT8Dp6L7+SUzm9VIuNdQu5rs8kgV1asrMvmYvwKFDgP6b/zo8V5gozsxldkyVoFl+Gb6jywgSAamTjXrKjuz7n/mSaToB+aO1RlgQh/C6+EI3EeTFyu3sHywhsB1OZ1mgcn8OdLH5lQ==
Received: from TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:173::5)
 by TYWP286MB3822.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:447::6) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7202.36; Tue, 23 Jan
 2024 18:17:22 +0000
Received: from TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 ([fe80::2fb1:781b:3f26:a080]) by TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 ([fe80::2fb1:781b:3f26:a080%3]) with mapi id 15.20.7228.022; Tue, 23 Jan 2024
 18:17:22 +0000
From: toqsxw@outlook.com
To: ffmpeg-devel@ffmpeg.org
Date: Wed, 24 Jan 2024 02:17:06 +0800
Message-ID: 
 <TYWP286MB21720FABDB5F5A32F99CA59FCA742@TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240123181711.402946-1-toqsxw@outlook.com>
References: <20240123181711.402946-1-toqsxw@outlook.com>
X-TMN: [/55ZUZsfKUAGnDrdH9xatA2HKHmSkjgu]
X-ClientProxiedBy: SI2P153CA0030.APCP153.PROD.OUTLOOK.COM
 (2603:1096:4:190::15) To TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 (2603:1096:400:173::5)
X-Microsoft-Original-Message-ID: <20240123181711.402946-3-toqsxw@outlook.com>
MIME-Version: 1.0
X-MS-Exchange-MessageSentRepresentingType: 1
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: TYWP286MB2172:EE_|TYWP286MB3822:EE_
X-MS-Office365-Filtering-Correlation-Id: 23b1785a-420a-4e3c-f085-08dc1c3f8b0b
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: 
 hCuvWSuOalJyVEqVClXBbag7/RalIrM5zxsqHDTaeeK717fcbuNH9CCQ4V98rR2z+B1un+U92NgsEiGFDYCjJz1UOZw5cyUp4NprwqsQmCt0bSotU1HIeeNbgrh8BiriXnY0ngrONqqSzbbLzvj9wRuhgjR1eJ9z2AuIjLpM0lyPYnNXsLgnHrkEw9uj6f1Mu52MfZO5gEmc07xbnJIjK+++SnhiHjF5YK+oPnuwDFx+QQMFXsQqQXhpOScBuyZFL/TPMgsye/beX0QOFqBAFUBsRWT1YdmTjwY/BfWWM/x/LQ0RtJ91vKAb66EpnoO679KGt61Get9gK4zz9eAWNEg1qPrlbs2/7PQiQUmo2Hpvk+05qEILiCrDlDQohoZ9QWZPWIUmAh6Gj648htjIlb4JEiBUjBUCy8q41xMSjXFoqIv3XvE2SDlqtGxoVmIKp4NoMpi9EEnAhXOLbJ0xc6AkwuWizObyja5xkzbhi29GApdDS+Owle5odxvQPJxD0FR7wZ+R5oPUpYUNEVdSj5xAh/ptU6+WEYKUO2+CYGCZSWBlQU7VnYJjL2ONJ1yqmSHOBdWSQfExaCM9feaz0i4JHtb2BRNMyXTRE/MsHsd9dF6ZO70C0DaMeab/w9sH
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: 
 7QlzG65xC58A+dgSO25T1dKhO/xDAmdmtIiY6TL22JTNdTIQrQLwvzbfZtYvHdHQjDwVdWgnX0lyXQo8NrXAEVgHtBdjJb96fTvTsvy5IVs/ArXOFL5nEXbbj8ocvAQI+yl7cOfaysYriihfXrWbSaoQyK/Gwl8REHQSvfdnFWQ6Km6Fe/Zi03Mjl4nC6TKEEQX+puTXZkimNIOOeLJFpAG5BCHfxNLMmLLeQJhommi+N4VckBlaPZuBQfyBdlDZWMPMHo6KB6J/MC7g4zV3/XcxcGN4q6eNkDIuowonTCQzNFadLVoM+1TyWln8uyY2/bMoA7sQKCVeBsXqG/BfS0h1Fq37oKhS1bn082s9dQRZ3gqDMjfI6NBaEgioLjRI6jIVsC7zHtFQ72wAJt0vvemokjgoAoQkkynQ8Nay7jOhm2d6y0sBQZb+pBlFYpiRu6oqtmKettxZfvQyNfMKiB4pltvJLdyUDwctW3isgzH+bBydwQVE1bqBKsFpj7LVnHaL86dQxh39Z+MrDF9CkjdNKzdY1+CyPt4Ws/siNQBpW8LX3he4MpHodBIPUWyCTe9ZhygRpUQRGy3to1uVdTQGnfcYuDfyB7G8roHVuhhrc/6dKGzELCb5HuUyAJAt3akucfSEDzUSPdIxZ3rXM7ZfoMoHuRoD00fTm2pl07cj+m00ooDHiMXEIBvz6rQ/XAMcaZ02tybzCN93pg3Jj3F9krqGbS/o5HVfh+H9r1T9qYMS6NyNp7cRMsyQ7AM8qhf+0+NegI+RMHUN6kYOUW1CBVgMBPBedE0m+rkgD5RnxLcAmeUniZmFYUqNGg4lseifyBvcP2QcdMlDE9OvUWgMG0dTyzOu5wdMS1zxjEA9UFANP5vWLFrPzbnCKjs4BCsRjgqaabkVEx3d6tDkV4SQKEz2AEklJEUt1CfQLLkuAaPWoqUpvkfXZBc/BD0uzhj5B/lPjx0mPkua5cRuNQGIArv5SK1bOR6FGhb0VS3uuNY+NYZ+ZY6j+0jSSA7XzgOKlYAspRe8WobqYOWpVesPtALyF83QziSd+CkeMw67i1qQmiDAjJdjrYWcXZxUpNplx8mZ4qxeqdwno/R3cB7GbXP2bnnaExYyCeFVGe2QGnNZ702bDnJi4z/gwDIdYDh4/ILKC0XMxCIwIh+k6Ff5X1zmWvk93tJqwAyU1DS7obLCwiQB69RMgpYxbXfBK7aSeL0xg5/r8r5hXDaxY14oyfUX4zVS5EUfZ+mmdDk=
X-OriginatorOrg: outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 23b1785a-420a-4e3c-f085-08dc1c3f8b0b
X-MS-Exchange-CrossTenant-AuthSource: TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 23 Jan 2024 18:17:22.5862 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 
 00000000-0000-0000-0000-000000000000
X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYWP286MB3822
Subject: [FFmpeg-devel] [PATCH v4 3/8] avcodec/x86/hevc_mc: move put/put_uni
 to h26x/h2656_inter.asm
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Wu Jianhua <toqsxw@outlook.com>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: ZXs0UskObC1r

From: Wu Jianhua <toqsxw@outlook.com>

This enable that the asm optimization can be reused by VVC

Signed-off-by: Wu Jianhua <toqsxw@outlook.com>
---
 libavcodec/x86/Makefile             |    1 +
 libavcodec/x86/h26x/h2656_inter.asm | 1145 +++++++++++++++++++++++++++
 libavcodec/x86/h26x/h2656dsp.c      |   98 +++
 libavcodec/x86/h26x/h2656dsp.h      |  103 +++
 libavcodec/x86/hevc_mc.asm          |  462 +----------
 libavcodec/x86/hevcdsp_init.c       |  108 ++-
 6 files changed, 1471 insertions(+), 446 deletions(-)
 create mode 100644 libavcodec/x86/h26x/h2656_inter.asm
 create mode 100644 libavcodec/x86/h26x/h2656dsp.c
 create mode 100644 libavcodec/x86/h26x/h2656dsp.h

diff --git a/libavcodec/x86/Makefile b/libavcodec/x86/Makefile
index d5fb30645a..8098cd840c 100644
--- a/libavcodec/x86/Makefile
+++ b/libavcodec/x86/Makefile
@@ -167,6 +167,7 @@ X86ASM-OBJS-$(CONFIG_HEVC_DECODER)     += x86/hevc_add_res.o            \
                                           x86/hevc_deblock.o            \
                                           x86/hevc_idct.o               \
                                           x86/hevc_mc.o                 \
+                                          x86/h26x/h2656_inter.o        \
                                           x86/hevc_sao.o                \
                                           x86/hevc_sao_10bit.o
 X86ASM-OBJS-$(CONFIG_JPEG2000_DECODER) += x86/jpeg2000dsp.o
diff --git a/libavcodec/x86/h26x/h2656_inter.asm b/libavcodec/x86/h26x/h2656_inter.asm
new file mode 100644
index 0000000000..aa296d549c
--- /dev/null
+++ b/libavcodec/x86/h26x/h2656_inter.asm
@@ -0,0 +1,1145 @@
+; /*
+; * Provide SSE luma and chroma mc functions for HEVC/VVC decoding
+; * Copyright (c) 2013 Pierre-Edouard LEPERE
+; * Copyright (c) 2023-2024 Nuo Mi
+; * Copyright (c) 2023-2024 Wu Jianhua
+; *
+; * This file is part of FFmpeg.
+; *
+; * FFmpeg is free software; you can redistribute it and/or
+; * modify it under the terms of the GNU Lesser General Public
+; * License as published by the Free Software Foundation; either
+; * version 2.1 of the License, or (at your option) any later version.
+; *
+; * FFmpeg is distributed in the hope that it will be useful,
+; * but WITHOUT ANY WARRANTY; without even the implied warranty of
+; * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+; * Lesser General Public License for more details.
+; *
+; * You should have received a copy of the GNU Lesser General Public
+; * License along with FFmpeg; if not, write to the Free Software
+; * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+; */
+%include "libavutil/x86/x86util.asm"
+
+%define MAX_PB_SIZE 64
+
+SECTION_RODATA 32
+cextern pw_255
+cextern pw_512
+cextern pw_2048
+cextern pw_1023
+cextern pw_1024
+cextern pw_4096
+cextern pw_8192
+%define scale_8 pw_512
+%define scale_10 pw_2048
+%define scale_12 pw_8192
+%define max_pixels_8 pw_255
+%define max_pixels_10 pw_1023
+max_pixels_12:          times 16 dw ((1 << 12)-1)
+cextern pb_0
+
+SECTION .text
+%macro SIMPLE_LOAD 4    ;width, bitd, tab, r1
+%if %1 == 2 || (%2 == 8 && %1 <= 4)
+    movd              %4, [%3]                                               ; load data from source
+%elif %1 == 4 || (%2 == 8 && %1 <= 8)
+    movq              %4, [%3]                                               ; load data from source
+%elif notcpuflag(avx)
+    movu              %4, [%3]                                               ; load data from source
+%elif %1 <= 8 || (%2 == 8 && %1 <= 16)
+    movdqu           %4, [%3]
+%else
+    movu              %4, [%3]
+%endif
+%endmacro
+
+%macro VPBROADCASTW 2
+%if notcpuflag(avx2)
+    movd           %1, %2
+    pshuflw        %1, %1, 0
+    punpcklwd      %1, %1
+%else
+    vpbroadcastw   %1, %2
+%endif
+%endmacro
+
+%macro MC_4TAP_FILTER 4 ; bitdepth, filter, a, b,
+    VPBROADCASTW   %3, [%2q + 0 * 2]  ; coeff 0, 1
+    VPBROADCASTW   %4, [%2q + 1 * 2]  ; coeff 2, 3
+%if %1 != 8
+    pmovsxbw       %3, xmm%3
+    pmovsxbw       %4, xmm%4
+%endif
+%endmacro
+
+%macro MC_4TAP_HV_FILTER 1
+    VPBROADCASTW  m12, [vfq + 0 * 2]  ; vf 0, 1
+    VPBROADCASTW  m13, [vfq + 1 * 2]  ; vf 2, 3
+    VPBROADCASTW  m14, [hfq + 0 * 2]  ; hf 0, 1
+    VPBROADCASTW  m15, [hfq + 1 * 2]  ; hf 2, 3
+
+    pmovsxbw      m12, xm12
+    pmovsxbw      m13, xm13
+%if %1 != 8
+    pmovsxbw      m14, xm14
+    pmovsxbw      m15, xm15
+%endif
+    lea           r3srcq, [srcstrideq*3]
+%endmacro
+
+%macro MC_8TAP_SAVE_FILTER 5    ;offset, mm registers
+    mova [rsp + %1 + 0*mmsize], %2
+    mova [rsp + %1 + 1*mmsize], %3
+    mova [rsp + %1 + 2*mmsize], %4
+    mova [rsp + %1 + 3*mmsize], %5
+%endmacro
+
+%macro MC_8TAP_FILTER 2-3 ;bitdepth, filter, offset
+    VPBROADCASTW                      m12, [%2q + 0 * 2]  ; coeff 0, 1
+    VPBROADCASTW                      m13, [%2q + 1 * 2]  ; coeff 2, 3
+    VPBROADCASTW                      m14, [%2q + 2 * 2]  ; coeff 4, 5
+    VPBROADCASTW                      m15, [%2q + 3 * 2]  ; coeff 6, 7
+%if %0 == 3
+    MC_8TAP_SAVE_FILTER                %3, m12, m13, m14, m15
+%endif
+
+%if %1 != 8
+    pmovsxbw                          m12, xm12
+    pmovsxbw                          m13, xm13
+    pmovsxbw                          m14, xm14
+    pmovsxbw                          m15, xm15
+    %if %0 == 3
+    MC_8TAP_SAVE_FILTER     %3 + 4*mmsize, m12, m13, m14, m15
+    %endif
+%elif %0 == 3
+    pmovsxbw                          m8, xm12
+    pmovsxbw                          m9, xm13
+    pmovsxbw                         m10, xm14
+    pmovsxbw                         m11, xm15
+    MC_8TAP_SAVE_FILTER     %3 + 4*mmsize, m8, m9, m10, m11
+%endif
+
+%endmacro
+
+%macro MC_4TAP_LOAD 4
+%if (%1 == 8 && %4 <= 4)
+%define %%load movd
+%elif (%1 == 8 && %4 <= 8) || (%1 > 8 && %4 <= 4)
+%define %%load movq
+%else
+%define %%load movdqu
+%endif
+
+    %%load            m0, [%2q ]
+%ifnum %3
+    %%load            m1, [%2q+  %3]
+    %%load            m2, [%2q+2*%3]
+    %%load            m3, [%2q+3*%3]
+%else
+    %%load            m1, [%2q+  %3q]
+    %%load            m2, [%2q+2*%3q]
+    %%load            m3, [%2q+r3srcq]
+%endif
+%if %1 == 8
+%if %4 > 8
+    SBUTTERFLY        bw, 0, 1, 7
+    SBUTTERFLY        bw, 2, 3, 7
+%else
+    punpcklbw         m0, m1
+    punpcklbw         m2, m3
+%endif
+%else
+%if %4 > 4
+    SBUTTERFLY        wd, 0, 1, 7
+    SBUTTERFLY        wd, 2, 3, 7
+%else
+    punpcklwd         m0, m1
+    punpcklwd         m2, m3
+%endif
+%endif
+%endmacro
+
+%macro MC_8TAP_H_LOAD 4
+%assign %%stride (%1+7)/8
+%if %1 == 8
+%if %3 <= 4
+%define %%load movd
+%elif %3 == 8
+%define %%load movq
+%else
+%define %%load movu
+%endif
+%else
+%if %3 == 2
+%define %%load movd
+%elif %3 == 4
+%define %%load movq
+%else
+%define %%load movu
+%endif
+%endif
+    %%load            m0, [%2-3*%%stride]        ;load data from source
+    %%load            m1, [%2-2*%%stride]
+    %%load            m2, [%2-%%stride  ]
+    %%load            m3, [%2           ]
+    %%load            m4, [%2+%%stride  ]
+    %%load            m5, [%2+2*%%stride]
+    %%load            m6, [%2+3*%%stride]
+    %%load            m7, [%2+4*%%stride]
+
+%if %1 == 8
+%if %3 > 8
+    SBUTTERFLY        wd, 0, 1, %4
+    SBUTTERFLY        wd, 2, 3, %4
+    SBUTTERFLY        wd, 4, 5, %4
+    SBUTTERFLY        wd, 6, 7, %4
+%else
+    punpcklbw         m0, m1
+    punpcklbw         m2, m3
+    punpcklbw         m4, m5
+    punpcklbw         m6, m7
+%endif
+%else
+%if %3 > 4
+    SBUTTERFLY        dq, 0, 1, %4
+    SBUTTERFLY        dq, 2, 3, %4
+    SBUTTERFLY        dq, 4, 5, %4
+    SBUTTERFLY        dq, 6, 7, %4
+%else
+    punpcklwd         m0, m1
+    punpcklwd         m2, m3
+    punpcklwd         m4, m5
+    punpcklwd         m6, m7
+%endif
+%endif
+%endmacro
+
+%macro MC_8TAP_V_LOAD 5
+    lea              %5q, [%2]
+    sub              %5q, r3srcq
+    movu              m0, [%5q            ]      ;load x- 3*srcstride
+    movu              m1, [%5q+   %3q     ]      ;load x- 2*srcstride
+    movu              m2, [%5q+ 2*%3q     ]      ;load x-srcstride
+    movu              m3, [%2       ]      ;load x
+    movu              m4, [%2+   %3q]      ;load x+stride
+    movu              m5, [%2+ 2*%3q]      ;load x+2*stride
+    movu              m6, [%2+r3srcq]      ;load x+3*stride
+    movu              m7, [%2+ 4*%3q]      ;load x+4*stride
+%if %1 == 8
+%if %4 > 8
+    SBUTTERFLY        bw, 0, 1, 8
+    SBUTTERFLY        bw, 2, 3, 8
+    SBUTTERFLY        bw, 4, 5, 8
+    SBUTTERFLY        bw, 6, 7, 8
+%else
+    punpcklbw         m0, m1
+    punpcklbw         m2, m3
+    punpcklbw         m4, m5
+    punpcklbw         m6, m7
+%endif
+%else
+%if %4 > 4
+    SBUTTERFLY        wd, 0, 1, 8
+    SBUTTERFLY        wd, 2, 3, 8
+    SBUTTERFLY        wd, 4, 5, 8
+    SBUTTERFLY        wd, 6, 7, 8
+%else
+    punpcklwd         m0, m1
+    punpcklwd         m2, m3
+    punpcklwd         m4, m5
+    punpcklwd         m6, m7
+%endif
+%endif
+%endmacro
+
+%macro PEL_12STORE2 3
+    movd           [%1], %2
+%endmacro
+%macro PEL_12STORE4 3
+    movq           [%1], %2
+%endmacro
+%macro PEL_12STORE6 3
+    movq           [%1], %2
+    psrldq            %2, 8
+    movd         [%1+8], %2
+%endmacro
+%macro PEL_12STORE8 3
+    movdqu         [%1], %2
+%endmacro
+%macro PEL_12STORE12 3
+    PEL_12STORE8     %1, %2, %3
+    movq        [%1+16], %3
+%endmacro
+%macro PEL_12STORE16 3
+%if cpuflag(avx2)
+    movu            [%1], %2
+%else
+    PEL_12STORE8      %1, %2, %3
+    movdqu       [%1+16], %3
+%endif
+%endmacro
+
+%macro PEL_10STORE2 3
+    movd           [%1], %2
+%endmacro
+%macro PEL_10STORE4 3
+    movq           [%1], %2
+%endmacro
+%macro PEL_10STORE6 3
+    movq           [%1], %2
+    psrldq            %2, 8
+    movd         [%1+8], %2
+%endmacro
+%macro PEL_10STORE8 3
+    movdqu         [%1], %2
+%endmacro
+%macro PEL_10STORE12 3
+    PEL_10STORE8     %1, %2, %3
+    movq        [%1+16], %3
+%endmacro
+%macro PEL_10STORE16 3
+%if cpuflag(avx2)
+    movu            [%1], %2
+%else
+    PEL_10STORE8      %1, %2, %3
+    movdqu       [%1+16], %3
+%endif
+%endmacro
+%macro PEL_10STORE32 3
+    PEL_10STORE16     %1, %2, %3
+    movu         [%1+32], %3
+%endmacro
+
+%macro PEL_8STORE2 3
+    pextrw          [%1], %2, 0
+%endmacro
+%macro PEL_8STORE4 3
+    movd            [%1], %2
+%endmacro
+%macro PEL_8STORE6 3
+    movd            [%1], %2
+    pextrw        [%1+4], %2, 2
+%endmacro
+%macro PEL_8STORE8 3
+    movq           [%1], %2
+%endmacro
+%macro PEL_8STORE12 3
+    movq            [%1], %2
+    psrldq            %2, 8
+    movd          [%1+8], %2
+%endmacro
+%macro PEL_8STORE16 3
+%if cpuflag(avx2)
+    movdqu        [%1], %2
+%else
+    movu          [%1], %2
+%endif ; avx
+%endmacro
+%macro PEL_8STORE32 3
+    movu          [%1], %2
+%endmacro
+
+%macro LOOP_END 3
+    add              %1q, 2*MAX_PB_SIZE          ; dst += dststride
+    add              %2q, %3q                    ; src += srcstride
+    dec          heightd                         ; cmp height
+    jnz               .loop                      ; height loop
+%endmacro
+
+
+%macro MC_PIXEL_COMPUTE 2-3 ;width, bitdepth
+%if %2 == 8
+%if cpuflag(avx2) && %0 ==3
+%if %1 > 16
+    vextracti128 xm1, m0, 1
+    pmovzxbw      m1, xm1
+    psllw         m1, 14-%2
+%endif
+    pmovzxbw      m0, xm0
+%else ; not avx
+%if %1 > 8
+    punpckhbw     m1, m0, m2
+    psllw         m1, 14-%2
+%endif
+    punpcklbw     m0, m2
+%endif
+%endif ;avx
+    psllw         m0, 14-%2
+%endmacro
+
+%macro MC_4TAP_COMPUTE 4-8 ; bitdepth, width, filter1, filter2, HV/m0, m2, m1, m3
+%if %0 == 8
+%define %%reg0 %5
+%define %%reg2 %6
+%define %%reg1 %7
+%define %%reg3 %8
+%else
+%define %%reg0 m0
+%define %%reg2 m2
+%define %%reg1 m1
+%define %%reg3 m3
+%endif
+%if %1 == 8
+%if cpuflag(avx2) && (%0 == 5)
+%if %2 > 16
+    vperm2i128    m10, m0, m1, q0301
+%endif
+    vinserti128    m0, m0, xm1, 1
+    mova           m1, m10
+%if %2 > 16
+    vperm2i128    m10, m2, m3, q0301
+%endif
+    vinserti128    m2, m2, xm3, 1
+    mova           m3, m10
+%endif
+    pmaddubsw      %%reg0, %3   ;x1*c1+x2*c2
+    pmaddubsw      %%reg2, %4   ;x3*c3+x4*c4
+    paddw          %%reg0, %%reg2
+%if %2 > 8
+    pmaddubsw      %%reg1, %3
+    pmaddubsw      %%reg3, %4
+    paddw          %%reg1, %%reg3
+%endif
+%else
+    pmaddwd        %%reg0, %3
+    pmaddwd        %%reg2, %4
+    paddd          %%reg0, %%reg2
+%if %2 > 4
+    pmaddwd        %%reg1, %3
+    pmaddwd        %%reg3, %4
+    paddd          %%reg1, %%reg3
+%if %1 != 8
+    psrad          %%reg1, %1-8
+%endif
+%endif
+%if %1 != 8
+    psrad          %%reg0, %1-8
+%endif
+    packssdw       %%reg0, %%reg1
+%endif
+%endmacro
+
+%macro MC_8TAP_HV_COMPUTE 4     ; width, bitdepth, filter
+
+%if %2 == 8
+    pmaddubsw         m0, [%3q+0*mmsize]    ;x1*c1+x2*c2
+    pmaddubsw         m2, [%3q+1*mmsize]    ;x3*c3+x4*c4
+    pmaddubsw         m4, [%3q+2*mmsize]    ;x5*c5+x6*c6
+    pmaddubsw         m6, [%3q+3*mmsize]    ;x7*c7+x8*c8
+    paddw             m0, m2
+    paddw             m4, m6
+    paddw             m0, m4
+%else
+    pmaddwd           m0, [%3q+4*mmsize]
+    pmaddwd           m2, [%3q+5*mmsize]
+    pmaddwd           m4, [%3q+6*mmsize]
+    pmaddwd           m6, [%3q+7*mmsize]
+    paddd             m0, m2
+    paddd             m4, m6
+    paddd             m0, m4
+%if %2 != 8
+    psrad             m0, %2-8
+%endif
+%if %1 > 4
+    pmaddwd           m1, [%3q+4*mmsize]
+    pmaddwd           m3, [%3q+5*mmsize]
+    pmaddwd           m5, [%3q+6*mmsize]
+    pmaddwd           m7, [%3q+7*mmsize]
+    paddd             m1, m3
+    paddd             m5, m7
+    paddd             m1, m5
+%if %2 != 8
+    psrad             m1, %2-8
+%endif
+%endif
+    p%4               m0, m1
+%endif
+%endmacro
+
+
+%macro MC_8TAP_COMPUTE 2-3     ; width, bitdepth
+%if %2 == 8
+%if cpuflag(avx2) && (%0 == 3)
+
+    vperm2i128 m10, m0,  m1, q0301
+    vinserti128 m0, m0, xm1, 1
+    SWAP 1, 10
+
+    vperm2i128 m10, m2,  m3, q0301
+    vinserti128 m2, m2, xm3, 1
+    SWAP 3, 10
+
+
+    vperm2i128 m10, m4,  m5, q0301
+    vinserti128 m4, m4, xm5, 1
+    SWAP 5, 10
+
+    vperm2i128 m10, m6,  m7, q0301
+    vinserti128 m6, m6, xm7, 1
+    SWAP 7, 10
+%endif
+
+    pmaddubsw         m0, m12   ;x1*c1+x2*c2
+    pmaddubsw         m2, m13   ;x3*c3+x4*c4
+    pmaddubsw         m4, m14   ;x5*c5+x6*c6
+    pmaddubsw         m6, m15   ;x7*c7+x8*c8
+    paddw             m0, m2
+    paddw             m4, m6
+    paddw             m0, m4
+%if %1 > 8
+    pmaddubsw         m1, m12
+    pmaddubsw         m3, m13
+    pmaddubsw         m5, m14
+    pmaddubsw         m7, m15
+    paddw             m1, m3
+    paddw             m5, m7
+    paddw             m1, m5
+%endif
+%else
+    pmaddwd           m0, m12
+    pmaddwd           m2, m13
+    pmaddwd           m4, m14
+    pmaddwd           m6, m15
+    paddd             m0, m2
+    paddd             m4, m6
+    paddd             m0, m4
+%if %2 != 8
+    psrad             m0, %2-8
+%endif
+%if %1 > 4
+    pmaddwd           m1, m12
+    pmaddwd           m3, m13
+    pmaddwd           m5, m14
+    pmaddwd           m7, m15
+    paddd             m1, m3
+    paddd             m5, m7
+    paddd             m1, m5
+%if %2 != 8
+    psrad             m1, %2-8
+%endif
+%endif
+%endif
+%endmacro
+%macro UNI_COMPUTE 5
+    pmulhrsw          %3, %5
+%if %1 > 8 || (%2 > 8 && %1 > 4)
+    pmulhrsw          %4, %5
+%endif
+%if %2 == 8
+    packuswb          %3, %4
+%else
+    CLIPW             %3, [pb_0], [max_pixels_%2]
+%if (%1 > 8 && notcpuflag(avx)) || %1 > 16
+    CLIPW             %4, [pb_0], [max_pixels_%2]
+%endif
+%endif
+%endmacro
+
+
+; ******************************
+; void %1_put_pixels(int16_t *dst, const uint8_t *_src, ptrdiff_t srcstride,
+;                         int height, const int8_t *hf, const int8_t *vf, int width)
+; ******************************
+
+%macro PUT_PIXELS 3
+    MC_PIXELS       %1, %2, %3
+    MC_UNI_PIXELS   %1, %2, %3
+%endmacro
+
+%macro MC_PIXELS 3
+cglobal %1_put_pixels%2_%3, 4, 4, 3, dst, src, srcstride, height
+    pxor              m2, m2
+.loop:
+    SIMPLE_LOAD       %2, %3, srcq, m0
+    MC_PIXEL_COMPUTE  %2, %3, 1
+    PEL_10STORE%2     dstq, m0, m1
+    LOOP_END         dst, src, srcstride
+    RET
+%endmacro
+
+%macro MC_UNI_PIXELS 3
+cglobal %1_put_uni_pixels%2_%3, 5, 5, 2, dst, dststride, src, srcstride, height
+.loop:
+    SIMPLE_LOAD       %2, %3, srcq, m0
+    PEL_%3STORE%2   dstq, m0, m1
+    add             dstq, dststrideq             ; dst += dststride
+    add             srcq, srcstrideq             ; src += srcstride
+    dec          heightd                         ; cmp height
+    jnz               .loop                      ; height loop
+    RET
+%endmacro
+
+%macro PUT_4TAP 3
+%if cpuflag(avx2)
+%define XMM_REGS  11
+%else
+%define XMM_REGS  8
+%endif
+
+; ******************************
+; void %1_put_4tap_hX(int16_t *dst,
+;      const uint8_t *_src, ptrdiff_t _srcstride, int height, int8_t *hf, int8_t *vf, int width);
+; ******************************
+cglobal %1_put_4tap_h%2_%3, 5, 5, XMM_REGS, dst, src, srcstride, height, hf
+%assign %%stride ((%3 + 7)/8)
+    MC_4TAP_FILTER       %3, hf, m4, m5
+.loop:
+    MC_4TAP_LOAD         %3, srcq-%%stride, %%stride, %2
+    MC_4TAP_COMPUTE      %3, %2, m4, m5, 1
+    PEL_10STORE%2      dstq, m0, m1
+    LOOP_END            dst, src, srcstride
+    RET
+
+; ******************************
+; void %1_put_uni_4tap_hX(uint8_t *dst, ptrdiff_t dststride,
+;      const uint8_t *_src, ptrdiff_t _srcstride, int height, int8_t *hf, int8_t *vf, int width);
+; ******************************
+cglobal %1_put_uni_4tap_h%2_%3, 6, 7, XMM_REGS, dst, dststride, src, srcstride, height, hf
+%assign %%stride ((%3 + 7)/8)
+    movdqa            m6, [scale_%3]
+    MC_4TAP_FILTER    %3, hf, m4, m5
+.loop:
+    MC_4TAP_LOAD      %3, srcq-%%stride, %%stride, %2
+    MC_4TAP_COMPUTE   %3, %2, m4, m5
+    UNI_COMPUTE       %2, %3, m0, m1, m6
+    PEL_%3STORE%2   dstq, m0, m1
+    add             dstq, dststrideq             ; dst += dststride
+    add             srcq, srcstrideq             ; src += srcstride
+    dec          heightd                         ; cmp height
+    jnz               .loop                      ; height loop
+    RET
+
+; ******************************
+; void %1_put_4tap_v(int16_t *dst,
+;      const uint8_t *_src, ptrdiff_t _srcstride, int height, int8_t *hf, int8_t *vf, int width)
+; ******************************
+cglobal %1_put_4tap_v%2_%3, 6, 6, XMM_REGS, dst, src, srcstride, height, r3src, vf
+    sub             srcq, srcstrideq
+    MC_4TAP_FILTER    %3, vf, m4, m5
+    lea           r3srcq, [srcstrideq*3]
+.loop:
+    MC_4TAP_LOAD      %3, srcq, srcstride, %2
+    MC_4TAP_COMPUTE   %3, %2, m4, m5, 1
+    PEL_10STORE%2     dstq, m0, m1
+    LOOP_END          dst, src, srcstride
+    RET
+
+; ******************************
+; void %1_put_uni_4tap_vX(uint8_t *dst, ptrdiff_t dststride,
+;      const uint8_t *_src, ptrdiff_t _srcstride, int height, int8_t *hf, int8_t *vf, int width);
+; ******************************
+cglobal %1_put_uni_4tap_v%2_%3, 7, 7, XMM_REGS, dst, dststride, src, srcstride, height, r3src, vf
+    movdqa            m6, [scale_%3]
+    sub             srcq, srcstrideq
+    MC_4TAP_FILTER       %3, vf, m4, m5
+    lea           r3srcq, [srcstrideq*3]
+.loop:
+    MC_4TAP_LOAD      %3, srcq, srcstride, %2
+    MC_4TAP_COMPUTE   %3, %2, m4, m5
+    UNI_COMPUTE       %2, %3, m0, m1, m6
+    PEL_%3STORE%2   dstq, m0, m1
+    add             dstq, dststrideq             ; dst += dststride
+    add             srcq, srcstrideq             ; src += srcstride
+    dec          heightd                         ; cmp height
+    jnz               .loop                      ; height loop
+    RET
+%endmacro
+
+%macro PUT_4TAP_HV 3
+; ******************************
+; void put_4tap_hv(int16_t *dst,
+;      const uint8_t *_src, ptrdiff_t _srcstride, int height, int8_t *hf, int8_t *vf, int width)
+; ******************************
+cglobal %1_put_4tap_hv%2_%3, 6, 7, 16 , dst, src, srcstride, height, hf, vf, r3src
+%assign %%stride ((%3 + 7)/8)
+    sub                 srcq, srcstrideq
+    MC_4TAP_HV_FILTER    %3
+    MC_4TAP_LOAD         %3, srcq-%%stride, %%stride, %2
+    MC_4TAP_COMPUTE      %3, %2, m14, m15
+%if (%2 > 8 && (%3 == 8))
+    SWAP              m8, m1
+%endif
+    SWAP              m4, m0
+    add             srcq, srcstrideq
+    MC_4TAP_LOAD         %3, srcq-%%stride, %%stride, %2
+    MC_4TAP_COMPUTE      %3, %2, m14, m15
+%if (%2 > 8 && (%3 == 8))
+    SWAP              m9, m1
+%endif
+    SWAP              m5, m0
+    add             srcq, srcstrideq
+    MC_4TAP_LOAD         %3, srcq-%%stride, %%stride, %2
+    MC_4TAP_COMPUTE      %3, %2, m14, m15
+%if (%2 > 8 && (%3 == 8))
+    SWAP             m10, m1
+%endif
+    SWAP              m6, m0
+    add             srcq, srcstrideq
+.loop:
+    MC_4TAP_LOAD         %3, srcq-%%stride, %%stride, %2
+    MC_4TAP_COMPUTE      %3, %2, m14, m15
+%if (%2 > 8 && (%3 == 8))
+    SWAP             m11, m1
+%endif
+    SWAP              m7, m0
+    punpcklwd         m0, m4, m5
+    punpcklwd         m2, m6, m7
+%if %2 > 4
+    punpckhwd         m1, m4, m5
+    punpckhwd         m3, m6, m7
+%endif
+    MC_4TAP_COMPUTE      14, %2, m12, m13
+%if (%2 > 8 && (%3 == 8))
+    punpcklwd         m4, m8, m9
+    punpcklwd         m2, m10, m11
+    punpckhwd         m8, m8, m9
+    punpckhwd         m3, m10, m11
+    MC_4TAP_COMPUTE      14, %2, m12, m13, m4, m2, m8, m3
+%if cpuflag(avx2)
+    vinserti128       m2, m0, xm4, 1
+    vperm2i128        m3, m0, m4, q0301
+    PEL_10STORE%2     dstq, m2, m3
+%else
+    PEL_10STORE%2     dstq, m0, m4
+%endif
+%else
+    PEL_10STORE%2     dstq, m0, m1
+%endif
+    movdqa            m4, m5
+    movdqa            m5, m6
+    movdqa            m6, m7
+%if (%2 > 8 && (%3 == 8))
+    mova              m8, m9
+    mova              m9, m10
+    mova             m10, m11
+%endif
+    LOOP_END         dst, src, srcstride
+    RET
+
+cglobal %1_put_uni_4tap_hv%2_%3, 7, 8, 16 , dst, dststride, src, srcstride, height, hf, vf, r3src
+%assign %%stride ((%3 + 7)/8)
+    sub                srcq, srcstrideq
+    MC_4TAP_HV_FILTER    %3
+    MC_4TAP_LOAD         %3, srcq-%%stride, %%stride, %2
+    MC_4TAP_COMPUTE      %3, %2, m14, m15
+%if (%2 > 8 && (%3 == 8))
+    SWAP                 m8, m1
+%endif
+    SWAP                 m4, m0
+    add                srcq, srcstrideq
+    MC_4TAP_LOAD         %3, srcq-%%stride, %%stride, %2
+    MC_4TAP_COMPUTE      %3, %2, m14, m15
+%if (%2 > 8 && (%3 == 8))
+    SWAP                 m9, m1
+%endif
+    SWAP                 m5, m0
+    add                srcq, srcstrideq
+    MC_4TAP_LOAD         %3, srcq-%%stride, %%stride, %2
+    MC_4TAP_COMPUTE      %3, %2, m14, m15
+%if (%2 > 8 && (%3 == 8))
+    SWAP                m10, m1
+%endif
+    SWAP                 m6, m0
+    add                srcq, srcstrideq
+.loop:
+    MC_4TAP_LOAD         %3, srcq-%%stride, %%stride, %2
+    MC_4TAP_COMPUTE      %3, %2, m14, m15
+%if (%2 > 8 && (%3 == 8))
+    SWAP                m11, m1
+%endif
+    mova                 m7, m0
+    punpcklwd            m0, m4, m5
+    punpcklwd            m2, m6, m7
+%if %2 > 4
+    punpckhwd            m1, m4, m5
+    punpckhwd            m3, m6, m7
+%endif
+    MC_4TAP_COMPUTE      14, %2, m12, m13
+%if (%2 > 8 && (%3 == 8))
+    punpcklwd            m4, m8, m9
+    punpcklwd            m2, m10, m11
+    punpckhwd            m8, m8, m9
+    punpckhwd            m3, m10, m11
+    MC_4TAP_COMPUTE      14, %2, m12, m13, m4, m2, m8, m3
+    UNI_COMPUTE          %2, %3, m0, m4, [scale_%3]
+%else
+    UNI_COMPUTE          %2, %3, m0, m1, [scale_%3]
+%endif
+    PEL_%3STORE%2      dstq, m0, m1
+    mova                 m4, m5
+    mova                 m5, m6
+    mova                 m6, m7
+%if (%2 > 8 && (%3 == 8))
+    mova                 m8, m9
+    mova                 m9, m10
+    mova                m10, m11
+%endif
+    add                dstq, dststrideq             ; dst += dststride
+    add                srcq, srcstrideq             ; src += srcstride
+    dec             heightd                         ; cmp height
+    jnz               .loop                         ; height loop
+    RET
+%endmacro
+
+; ******************************
+; void put_8tap_hX_X_X(int16_t *dst, const uint8_t *_src, ptrdiff_t srcstride,
+;                       int height, const int8_t *hf, const int8_t *vf, int width)
+; ******************************
+
+%macro PUT_8TAP 3
+cglobal %1_put_8tap_h%2_%3, 5, 5, 16, dst, src, srcstride, height, hf
+    MC_8TAP_FILTER          %3, hf
+.loop:
+    MC_8TAP_H_LOAD          %3, srcq, %2, 10
+    MC_8TAP_COMPUTE         %2, %3, 1
+%if %3 > 8
+    packssdw                m0, m1
+%endif
+    PEL_10STORE%2         dstq, m0, m1
+    LOOP_END               dst, src, srcstride
+    RET
+
+; ******************************
+; void put_uni_8tap_hX_X_X(int16_t *dst, ptrdiff_t dststride, const uint8_t *_src, ptrdiff_t srcstride,
+;                       int height, const int8_t *hf, const int8_t *vf, int width)
+; ******************************
+cglobal %1_put_uni_8tap_h%2_%3, 6, 7, 16 , dst, dststride, src, srcstride, height, hf
+    mova                 m9, [scale_%3]
+    MC_8TAP_FILTER       %3, hf
+.loop:
+    MC_8TAP_H_LOAD       %3, srcq, %2, 10
+    MC_8TAP_COMPUTE      %2, %3
+%if %3 > 8
+    packssdw             m0, m1
+%endif
+    UNI_COMPUTE          %2, %3, m0, m1, m9
+    PEL_%3STORE%2      dstq, m0, m1
+    add                dstq, dststrideq             ; dst += dststride
+    add                srcq, srcstrideq             ; src += srcstride
+    dec             heightd                         ; cmp height
+    jnz               .loop                         ; height loop
+    RET
+
+
+; ******************************
+; void put_8tap_vX_X_X(int16_t *dst, const uint8_t *_src, ptrdiff_t srcstride,
+;                      int height, const int8_t *hf, const int8_t *vf, int width)
+; ******************************
+cglobal %1_put_8tap_v%2_%3, 6, 8, 16, dst, src, srcstride, height, r3src, vf
+    MC_8TAP_FILTER        %3, vf
+    lea               r3srcq, [srcstrideq*3]
+.loop:
+    MC_8TAP_V_LOAD        %3, srcq, srcstride, %2, r7
+    MC_8TAP_COMPUTE       %2, %3, 1
+%if %3 > 8
+    packssdw              m0, m1
+%endif
+    PEL_10STORE%2       dstq, m0, m1
+    LOOP_END             dst, src, srcstride
+    RET
+
+; ******************************
+; void put_uni_8tap_vX_X_X(int16_t *dst, ptrdiff_t dststride, const uint8_t *_src, ptrdiff_t srcstride,
+;                       int height, const int8_t *hf, const int8_t *vf, int width)
+; ******************************
+cglobal %1_put_uni_8tap_v%2_%3, 7, 9, 16, dst, dststride, src, srcstride, height, r3src, vf
+    MC_8TAP_FILTER    %3, vf
+    movdqa            m9, [scale_%3]
+    lea           r3srcq, [srcstrideq*3]
+.loop:
+    MC_8TAP_V_LOAD    %3, srcq, srcstride, %2, r8
+    MC_8TAP_COMPUTE   %2, %3
+%if %3 > 8
+    packssdw          m0, m1
+%endif
+    UNI_COMPUTE       %2, %3, m0, m1, m9
+    PEL_%3STORE%2   dstq, m0, m1
+    add             dstq, dststrideq             ; dst += dststride
+    add             srcq, srcstrideq             ; src += srcstride
+    dec          heightd                         ; cmp height
+    jnz               .loop                      ; height loop
+    RET
+
+%endmacro
+
+
+; ******************************
+; void put_8tap_hvX_X(int16_t *dst, const uint8_t *_src, ptrdiff_t srcstride,
+;                     int height, const int8_t *hf, const int8_t *vf, int width)
+; ******************************
+%macro PUT_8TAP_HV 3
+cglobal %1_put_8tap_hv%2_%3, 6, 7, 16, 0 - mmsize*16, dst, src, srcstride, height, hf, vf, r3src
+    MC_8TAP_FILTER           %3, hf, 0
+    lea                     hfq, [rsp]
+    MC_8TAP_FILTER           %3, vf, 8*mmsize
+    lea                     vfq, [rsp + 8*mmsize]
+
+    lea                  r3srcq, [srcstrideq*3]
+    sub                    srcq, r3srcq
+
+    MC_8TAP_H_LOAD           %3, srcq, %2, 15
+    MC_8TAP_HV_COMPUTE       %2, %3, hf, ackssdw
+    SWAP                     m8, m0
+    add                    srcq, srcstrideq
+    MC_8TAP_H_LOAD           %3, srcq, %2, 15
+    MC_8TAP_HV_COMPUTE       %2, %3, hf, ackssdw
+    SWAP                     m9, m0
+    add                    srcq, srcstrideq
+    MC_8TAP_H_LOAD           %3, srcq, %2, 15
+    MC_8TAP_HV_COMPUTE       %2, %3, hf, ackssdw
+    SWAP                    m10, m0
+    add                    srcq, srcstrideq
+    MC_8TAP_H_LOAD           %3, srcq, %2, 15
+    MC_8TAP_HV_COMPUTE       %2, %3, hf, ackssdw
+    SWAP                    m11, m0
+    add                    srcq, srcstrideq
+    MC_8TAP_H_LOAD           %3, srcq, %2, 15
+    MC_8TAP_HV_COMPUTE       %2, %3, hf, ackssdw
+    SWAP                    m12, m0
+    add                    srcq, srcstrideq
+    MC_8TAP_H_LOAD           %3, srcq, %2, 15
+    MC_8TAP_HV_COMPUTE       %2, %3, hf, ackssdw
+    SWAP                    m13, m0
+    add                    srcq, srcstrideq
+    MC_8TAP_H_LOAD           %3, srcq, %2, 15
+    MC_8TAP_HV_COMPUTE       %2, %3, hf, ackssdw
+    SWAP                    m14, m0
+    add                    srcq, srcstrideq
+.loop:
+    MC_8TAP_H_LOAD           %3, srcq, %2, 15
+    MC_8TAP_HV_COMPUTE       %2, %3, hf, ackssdw
+    SWAP                    m15, m0
+    punpcklwd                m0, m8, m9
+    punpcklwd                m2, m10, m11
+    punpcklwd                m4, m12, m13
+    punpcklwd                m6, m14, m15
+%if %2 > 4
+    punpckhwd                m1, m8, m9
+    punpckhwd                m3, m10, m11
+    punpckhwd                m5, m12, m13
+    punpckhwd                m7, m14, m15
+%endif
+%if %2 <= 4
+    movq                     m8, m9
+    movq                     m9, m10
+    movq                    m10, m11
+    movq                    m11, m12
+    movq                    m12, m13
+    movq                    m13, m14
+    movq                    m14, m15
+%else
+    movdqa                   m8, m9
+    movdqa                   m9, m10
+    movdqa                  m10, m11
+    movdqa                  m11, m12
+    movdqa                  m12, m13
+    movdqa                  m13, m14
+    movdqa                  m14, m15
+%endif
+    MC_8TAP_HV_COMPUTE       %2, 14, vf, ackssdw
+    PEL_10STORE%2          dstq, m0, m1
+
+    LOOP_END                dst, src, srcstride
+    RET
+
+
+cglobal %1_put_uni_8tap_hv%2_%3, 7, 9, 16, 0 - 16*mmsize, dst, dststride, src, srcstride, height, hf, vf, r3src
+    MC_8TAP_FILTER           %3, hf, 0
+    lea                     hfq, [rsp]
+    MC_8TAP_FILTER           %3, vf, 8*mmsize
+    lea                     vfq, [rsp + 8*mmsize]
+    lea           r3srcq, [srcstrideq*3]
+    sub             srcq, r3srcq
+    MC_8TAP_H_LOAD       %3, srcq, %2, 15
+    MC_8TAP_HV_COMPUTE   %2, %3, hf, ackssdw
+    SWAP              m8, m0
+    add             srcq, srcstrideq
+    MC_8TAP_H_LOAD       %3, srcq, %2, 15
+    MC_8TAP_HV_COMPUTE   %2, %3, hf, ackssdw
+    SWAP              m9, m0
+    add             srcq, srcstrideq
+    MC_8TAP_H_LOAD       %3, srcq, %2, 15
+    MC_8TAP_HV_COMPUTE   %2, %3, hf, ackssdw
+    SWAP             m10, m0
+    add             srcq, srcstrideq
+    MC_8TAP_H_LOAD       %3, srcq, %2, 15
+    MC_8TAP_HV_COMPUTE   %2, %3, hf, ackssdw
+    SWAP             m11, m0
+    add             srcq, srcstrideq
+    MC_8TAP_H_LOAD       %3, srcq, %2, 15
+    MC_8TAP_HV_COMPUTE   %2, %3, hf, ackssdw
+    SWAP             m12, m0
+    add             srcq, srcstrideq
+    MC_8TAP_H_LOAD       %3, srcq, %2, 15
+    MC_8TAP_HV_COMPUTE   %2, %3, hf, ackssdw
+    SWAP             m13, m0
+    add             srcq, srcstrideq
+    MC_8TAP_H_LOAD       %3, srcq, %2, 15
+    MC_8TAP_HV_COMPUTE   %2, %3, hf, ackssdw
+    SWAP             m14, m0
+    add             srcq, srcstrideq
+.loop:
+    MC_8TAP_H_LOAD       %3, srcq, %2, 15
+    MC_8TAP_HV_COMPUTE   %2, %3, hf, ackssdw
+    SWAP             m15, m0
+    punpcklwd         m0, m8, m9
+    punpcklwd         m2, m10, m11
+    punpcklwd         m4, m12, m13
+    punpcklwd         m6, m14, m15
+%if %2 > 4
+    punpckhwd         m1, m8, m9
+    punpckhwd         m3, m10, m11
+    punpckhwd         m5, m12, m13
+    punpckhwd         m7, m14, m15
+%endif
+    MC_8TAP_HV_COMPUTE   %2, 14, vf, ackusdw
+    UNI_COMPUTE       %2, %3, m0, m1, [scale_%3]
+    PEL_%3STORE%2   dstq, m0, m1
+
+%if %2 <= 4
+    movq              m8, m9
+    movq              m9, m10
+    movq             m10, m11
+    movq             m11, m12
+    movq             m12, m13
+    movq             m13, m14
+    movq             m14, m15
+%else
+    mova            m8, m9
+    mova            m9, m10
+    mova           m10, m11
+    mova           m11, m12
+    mova           m12, m13
+    mova           m13, m14
+    mova           m14, m15
+%endif
+    add             dstq, dststrideq             ; dst += dststride
+    add             srcq, srcstrideq             ; src += srcstride
+    dec          heightd                         ; cmp height
+    jnz               .loop                      ; height loop
+    RET
+
+%endmacro
+
+%macro H2656PUT_PIXELS 2
+    PUT_PIXELS h2656, %1, %2
+%endmacro
+
+%macro H2656PUT_4TAP 2
+    PUT_4TAP h2656, %1, %2
+%endmacro
+
+%macro H2656PUT_4TAP_HV 2
+    PUT_4TAP_HV h2656, %1, %2
+%endmacro
+
+%macro H2656PUT_8TAP 2
+    PUT_8TAP h2656, %1, %2
+%endmacro
+
+%macro H2656PUT_8TAP_HV 2
+    PUT_8TAP_HV h2656, %1, %2
+%endmacro
+
+%if ARCH_X86_64
+
+INIT_XMM sse4
+H2656PUT_PIXELS  2, 8
+H2656PUT_PIXELS  4, 8
+H2656PUT_PIXELS  6, 8
+H2656PUT_PIXELS  8, 8
+H2656PUT_PIXELS 12, 8
+H2656PUT_PIXELS 16, 8
+
+H2656PUT_PIXELS 2, 10
+H2656PUT_PIXELS 4, 10
+H2656PUT_PIXELS 6, 10
+H2656PUT_PIXELS 8, 10
+
+H2656PUT_PIXELS 2, 12
+H2656PUT_PIXELS 4, 12
+H2656PUT_PIXELS 6, 12
+H2656PUT_PIXELS 8, 12
+
+H2656PUT_4TAP 2,  8
+H2656PUT_4TAP 4,  8
+H2656PUT_4TAP 6,  8
+H2656PUT_4TAP 8,  8
+
+H2656PUT_4TAP 12,  8
+H2656PUT_4TAP 16, 8
+
+H2656PUT_4TAP 2, 10
+H2656PUT_4TAP 4, 10
+H2656PUT_4TAP 6, 10
+H2656PUT_4TAP 8, 10
+
+H2656PUT_4TAP 2, 12
+H2656PUT_4TAP 4, 12
+H2656PUT_4TAP 6, 12
+H2656PUT_4TAP 8, 12
+
+H2656PUT_4TAP_HV 2,  8
+H2656PUT_4TAP_HV 4,  8
+H2656PUT_4TAP_HV 6,  8
+H2656PUT_4TAP_HV 8,  8
+H2656PUT_4TAP_HV 16, 8
+
+H2656PUT_4TAP_HV 2, 10
+H2656PUT_4TAP_HV 4, 10
+H2656PUT_4TAP_HV 6, 10
+H2656PUT_4TAP_HV 8, 10
+
+H2656PUT_4TAP_HV 2, 12
+H2656PUT_4TAP_HV 4, 12
+H2656PUT_4TAP_HV 6, 12
+H2656PUT_4TAP_HV 8, 12
+
+H2656PUT_8TAP  4,  8
+H2656PUT_8TAP  8,  8
+H2656PUT_8TAP 12, 8
+H2656PUT_8TAP 16, 8
+
+H2656PUT_8TAP 4, 10
+H2656PUT_8TAP 8, 10
+
+H2656PUT_8TAP 4, 12
+H2656PUT_8TAP 8, 12
+
+H2656PUT_8TAP_HV 4, 8
+H2656PUT_8TAP_HV 8, 8
+
+H2656PUT_8TAP_HV 4, 10
+H2656PUT_8TAP_HV 8, 10
+
+H2656PUT_8TAP_HV 4, 12
+H2656PUT_8TAP_HV 8, 12
+
+%if HAVE_AVX2_EXTERNAL
+INIT_YMM avx2
+
+H2656PUT_PIXELS  32, 8
+H2656PUT_PIXELS  16, 10
+H2656PUT_PIXELS  16, 12
+
+H2656PUT_8TAP 32,  8
+H2656PUT_8TAP 16, 10
+H2656PUT_8TAP 16, 12
+
+H2656PUT_8TAP_HV 32, 8
+H2656PUT_8TAP_HV 16, 10
+H2656PUT_8TAP_HV 16, 12
+
+H2656PUT_4TAP 32,  8
+H2656PUT_4TAP 16, 10
+H2656PUT_4TAP 16, 12
+
+H2656PUT_4TAP_HV 32, 8
+H2656PUT_4TAP_HV 16, 10
+H2656PUT_4TAP_HV 16, 12
+
+%endif
+
+%endif
diff --git a/libavcodec/x86/h26x/h2656dsp.c b/libavcodec/x86/h26x/h2656dsp.c
new file mode 100644
index 0000000000..27769f9c55
--- /dev/null
+++ b/libavcodec/x86/h26x/h2656dsp.c
@@ -0,0 +1,98 @@
+/*
+ * DSP for HEVC/VVC
+ *
+ * Copyright (C) 2022-2024 Nuo Mi
+ * Copyright (c) 2023-2024 Wu Jianhua
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "h2656dsp.h"
+
+#define mc_rep_func(name, bitd, step, W, opt) \
+void ff_h2656_put_##name##W##_##bitd##_##opt(int16_t *_dst,                                                     \
+    const uint8_t *_src, ptrdiff_t _srcstride, int height, const int8_t *hf, const int8_t *vf, int width)       \
+{                                                                                                               \
+    int i;                                                                                                      \
+    int16_t *dst;                                                                                               \
+    for (i = 0; i < W; i += step) {                                                                             \
+        const uint8_t *src  = _src + (i * ((bitd + 7) / 8));                                                    \
+        dst = _dst + i;                                                                                         \
+        ff_h2656_put_##name##step##_##bitd##_##opt(dst, src, _srcstride, height, hf, vf, width);                \
+    }                                                                                                           \
+}
+
+#define mc_rep_uni_func(name, bitd, step, W, opt) \
+void ff_h2656_put_uni_##name##W##_##bitd##_##opt(uint8_t *_dst, ptrdiff_t dststride,                            \
+    const uint8_t *_src, ptrdiff_t _srcstride, int height, const int8_t *hf, const int8_t *vf, int width)       \
+{                                                                                                               \
+    int i;                                                                                                      \
+    uint8_t *dst;                                                                                               \
+    for (i = 0; i < W; i += step) {                                                                             \
+        const uint8_t *src = _src + (i * ((bitd + 7) / 8));                                                     \
+        dst = _dst + (i * ((bitd + 7) / 8));                                                                    \
+        ff_h2656_put_uni_##name##step##_##bitd##_##opt(dst, dststride, src, _srcstride,                         \
+                                                          height, hf, vf, width);                               \
+    }                                                                                                           \
+}
+
+#define mc_rep_funcs(name, bitd, step, W, opt)      \
+    mc_rep_func(name, bitd, step, W, opt)           \
+    mc_rep_uni_func(name, bitd, step, W, opt)
+
+#define MC_REP_FUNCS_SSE4(fname)                 \
+    mc_rep_funcs(fname,  8, 16,128, sse4)        \
+    mc_rep_funcs(fname,  8, 16, 64, sse4)        \
+    mc_rep_funcs(fname,  8, 16, 32, sse4)        \
+    mc_rep_funcs(fname, 10,  8,128, sse4)        \
+    mc_rep_funcs(fname, 10,  8, 64, sse4)        \
+    mc_rep_funcs(fname, 10,  8, 32, sse4)        \
+    mc_rep_funcs(fname, 10,  8, 16, sse4)        \
+    mc_rep_funcs(fname, 12,  8,128, sse4)        \
+    mc_rep_funcs(fname, 12,  8, 64, sse4)        \
+    mc_rep_funcs(fname, 12,  8, 32, sse4)        \
+    mc_rep_funcs(fname, 12,  8, 16, sse4)        \
+
+MC_REP_FUNCS_SSE4(pixels)
+MC_REP_FUNCS_SSE4(4tap_h)
+MC_REP_FUNCS_SSE4(4tap_v)
+MC_REP_FUNCS_SSE4(4tap_hv)
+MC_REP_FUNCS_SSE4(8tap_h)
+MC_REP_FUNCS_SSE4(8tap_v)
+MC_REP_FUNCS_SSE4(8tap_hv)
+mc_rep_funcs(8tap_hv, 8, 8, 16, sse4)
+
+#if HAVE_AVX2_EXTERNAL
+
+#define MC_REP_FUNCS_AVX2(fname)               \
+    mc_rep_funcs(fname, 8, 32, 64, avx2)       \
+    mc_rep_funcs(fname, 8, 32,128, avx2)       \
+    mc_rep_funcs(fname,10, 16, 32, avx2)       \
+    mc_rep_funcs(fname,10, 16, 64, avx2)       \
+    mc_rep_funcs(fname,10, 16,128, avx2)       \
+    mc_rep_funcs(fname,12, 16, 32, avx2)       \
+    mc_rep_funcs(fname,12, 16, 64, avx2)       \
+    mc_rep_funcs(fname,12, 16,128, avx2)       \
+
+MC_REP_FUNCS_AVX2(pixels)
+MC_REP_FUNCS_AVX2(8tap_h)
+MC_REP_FUNCS_AVX2(8tap_v)
+MC_REP_FUNCS_AVX2(8tap_hv)
+MC_REP_FUNCS_AVX2(4tap_h)
+MC_REP_FUNCS_AVX2(4tap_v)
+MC_REP_FUNCS_AVX2(4tap_hv)
+#endif
diff --git a/libavcodec/x86/h26x/h2656dsp.h b/libavcodec/x86/h26x/h2656dsp.h
new file mode 100644
index 0000000000..8a2ab13607
--- /dev/null
+++ b/libavcodec/x86/h26x/h2656dsp.h
@@ -0,0 +1,103 @@
+/*
+ * DSP for HEVC/VVC
+ *
+ * Copyright (C) 2022-2024 Nuo Mi
+ * Copyright (c) 2023-2024 Wu Jianhua
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef AVCODEC_X86_H26X_H2656DSP_H
+#define AVCODEC_X86_H26X_H2656DSP_H
+
+#include "config.h"
+#include "libavutil/x86/asm.h"
+#include "libavutil/x86/cpu.h"
+#include <stdlib.h>
+
+#define H2656_PEL_PROTOTYPE(name, D, opt) \
+void ff_h2656_put_ ## name ## _ ## D ## _##opt(int16_t *dst, const uint8_t *_src, ptrdiff_t _srcstride, int height, const int8_t *hf, const int8_t *vf, int width);                               \
+void ff_h2656_put_uni_ ## name ## _ ## D ## _##opt(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, ptrdiff_t _srcstride, int height, const int8_t *hf, const int8_t *vf, int width);    \
+
+#define H2656_MC_8TAP_PROTOTYPES(fname, bitd, opt)    \
+    H2656_PEL_PROTOTYPE(fname##4,  bitd, opt);        \
+    H2656_PEL_PROTOTYPE(fname##6,  bitd, opt);        \
+    H2656_PEL_PROTOTYPE(fname##8,  bitd, opt);        \
+    H2656_PEL_PROTOTYPE(fname##12,  bitd, opt);       \
+    H2656_PEL_PROTOTYPE(fname##16, bitd, opt);        \
+    H2656_PEL_PROTOTYPE(fname##32, bitd, opt);        \
+    H2656_PEL_PROTOTYPE(fname##64, bitd, opt);        \
+    H2656_PEL_PROTOTYPE(fname##128, bitd, opt)
+
+H2656_MC_8TAP_PROTOTYPES(pixels  ,  8, sse4);
+H2656_MC_8TAP_PROTOTYPES(pixels  , 10, sse4);
+H2656_MC_8TAP_PROTOTYPES(pixels  , 12, sse4);
+H2656_MC_8TAP_PROTOTYPES(8tap_h  ,  8, sse4);
+H2656_MC_8TAP_PROTOTYPES(8tap_h  , 10, sse4);
+H2656_MC_8TAP_PROTOTYPES(8tap_h  , 12, sse4);
+H2656_MC_8TAP_PROTOTYPES(8tap_v  ,  8, sse4);
+H2656_MC_8TAP_PROTOTYPES(8tap_v  , 10, sse4);
+H2656_MC_8TAP_PROTOTYPES(8tap_v  , 12, sse4);
+H2656_MC_8TAP_PROTOTYPES(8tap_hv ,  8, sse4);
+H2656_MC_8TAP_PROTOTYPES(8tap_hv , 10, sse4);
+H2656_MC_8TAP_PROTOTYPES(8tap_hv , 12, sse4);
+
+#define H2656_MC_4TAP_PROTOTYPES(fname, bitd, opt)    \
+    H2656_PEL_PROTOTYPE(fname##2,  bitd, opt);        \
+    H2656_PEL_PROTOTYPE(fname##4,  bitd, opt);        \
+    H2656_PEL_PROTOTYPE(fname##6,  bitd, opt);        \
+    H2656_PEL_PROTOTYPE(fname##8,  bitd, opt);        \
+    H2656_PEL_PROTOTYPE(fname##12, bitd, opt);        \
+    H2656_PEL_PROTOTYPE(fname##16, bitd, opt);        \
+    H2656_PEL_PROTOTYPE(fname##32, bitd, opt);        \
+    H2656_PEL_PROTOTYPE(fname##64, bitd, opt);        \
+    H2656_PEL_PROTOTYPE(fname##128, bitd, opt)
+
+#define H2656_MC_4TAP_PROTOTYPES_SSE4(bitd)           \
+    H2656_PEL_PROTOTYPE(pixels2, bitd, sse4);         \
+    H2656_MC_4TAP_PROTOTYPES(4tap_h, bitd, sse4);     \
+    H2656_MC_4TAP_PROTOTYPES(4tap_v, bitd, sse4);     \
+    H2656_MC_4TAP_PROTOTYPES(4tap_hv, bitd, sse4);    \
+
+H2656_MC_4TAP_PROTOTYPES_SSE4(8)
+H2656_MC_4TAP_PROTOTYPES_SSE4(10)
+H2656_MC_4TAP_PROTOTYPES_SSE4(12)
+
+#define H2656_MC_8TAP_PROTOTYPES_AVX2(fname)              \
+    H2656_PEL_PROTOTYPE(fname##32 , 8, avx2);             \
+    H2656_PEL_PROTOTYPE(fname##64 , 8, avx2);             \
+    H2656_PEL_PROTOTYPE(fname##128, 8, avx2);             \
+    H2656_PEL_PROTOTYPE(fname##16 ,10, avx2);             \
+    H2656_PEL_PROTOTYPE(fname##32 ,10, avx2);             \
+    H2656_PEL_PROTOTYPE(fname##64 ,10, avx2);             \
+    H2656_PEL_PROTOTYPE(fname##128,10, avx2);             \
+    H2656_PEL_PROTOTYPE(fname##16 ,12, avx2);             \
+    H2656_PEL_PROTOTYPE(fname##32 ,12, avx2);             \
+    H2656_PEL_PROTOTYPE(fname##64 ,12, avx2);             \
+    H2656_PEL_PROTOTYPE(fname##128,12, avx2)              \
+
+H2656_MC_8TAP_PROTOTYPES_AVX2(pixels);
+H2656_MC_8TAP_PROTOTYPES_AVX2(8tap_h);
+H2656_MC_8TAP_PROTOTYPES_AVX2(8tap_v);
+H2656_MC_8TAP_PROTOTYPES_AVX2(8tap_hv);
+H2656_PEL_PROTOTYPE(8tap_hv16, 8, avx2);
+
+H2656_MC_8TAP_PROTOTYPES_AVX2(4tap_h);
+H2656_MC_8TAP_PROTOTYPES_AVX2(4tap_v);
+H2656_MC_8TAP_PROTOTYPES_AVX2(4tap_hv);
+
+#endif
diff --git a/libavcodec/x86/hevc_mc.asm b/libavcodec/x86/hevc_mc.asm
index eb267453fe..5489701e44 100644
--- a/libavcodec/x86/hevc_mc.asm
+++ b/libavcodec/x86/hevc_mc.asm
@@ -715,35 +715,6 @@ SECTION .text
 ;                         int height, int mx, int my)
 ; ******************************
 
-%macro HEVC_PUT_HEVC_PEL_PIXELS 2
-HEVC_PEL_PIXELS     %1, %2
-HEVC_UNI_PEL_PIXELS %1, %2
-HEVC_BI_PEL_PIXELS  %1, %2
-%endmacro
-
-%macro HEVC_PEL_PIXELS 2
-cglobal hevc_put_hevc_pel_pixels%1_%2, 4, 4, 3, dst, src, srcstride,height
-    pxor               m2, m2
-.loop:
-    SIMPLE_LOAD       %1, %2, srcq, m0
-    MC_PIXEL_COMPUTE  %1, %2, 1
-    PEL_10STORE%1     dstq, m0, m1
-    LOOP_END         dst, src, srcstride
-    RET
- %endmacro
-
-%macro HEVC_UNI_PEL_PIXELS 2
-cglobal hevc_put_hevc_uni_pel_pixels%1_%2, 5, 5, 2, dst, dststride, src, srcstride,height
-.loop:
-    SIMPLE_LOAD       %1, %2, srcq, m0
-    PEL_%2STORE%1   dstq, m0, m1
-    add             dstq, dststrideq             ; dst += dststride
-    add             srcq, srcstrideq             ; src += srcstride
-    dec          heightd                         ; cmp height
-    jnz               .loop                      ; height loop
-    RET
-%endmacro
-
 %macro HEVC_BI_PEL_PIXELS 2
 cglobal hevc_put_hevc_bi_pel_pixels%1_%2, 6, 6, 6, dst, dststride, src, srcstride, src2, height
     pxor              m2, m2
@@ -777,32 +748,8 @@ cglobal hevc_put_hevc_bi_pel_pixels%1_%2, 6, 6, 6, dst, dststride, src, srcstrid
 %define XMM_REGS  8
 %endif
 
-cglobal hevc_put_hevc_epel_h%1_%2, 5, 6, XMM_REGS, dst, src, srcstride, height, mx, rfilter
-%assign %%stride ((%2 + 7)/8)
-    EPEL_FILTER       %2, mx, m4, m5, rfilter
-.loop:
-    EPEL_LOAD         %2, srcq-%%stride, %%stride, %1
-    EPEL_COMPUTE      %2, %1, m4, m5, 1
-    PEL_10STORE%1      dstq, m0, m1
-    LOOP_END         dst, src, srcstride
-    RET
-
-cglobal hevc_put_hevc_uni_epel_h%1_%2, 6, 7, XMM_REGS, dst, dststride, src, srcstride, height, mx, rfilter
-%assign %%stride ((%2 + 7)/8)
-    movdqa            m6, [pw_%2]
-    EPEL_FILTER       %2, mx, m4, m5, rfilter
-.loop:
-    EPEL_LOAD         %2, srcq-%%stride, %%stride, %1
-    EPEL_COMPUTE      %2, %1, m4, m5
-    UNI_COMPUTE       %1, %2, m0, m1, m6
-    PEL_%2STORE%1   dstq, m0, m1
-    add             dstq, dststrideq             ; dst += dststride
-    add             srcq, srcstrideq             ; src += srcstride
-    dec          heightd                         ; cmp height
-    jnz               .loop                      ; height loop
-    RET
-
 cglobal hevc_put_hevc_bi_epel_h%1_%2, 7, 8, XMM_REGS, dst, dststride, src, srcstride, src2, height, mx, rfilter
+%assign %%stride ((%2 + 7)/8)
     movdqa            m6, [pw_bi_%2]
     EPEL_FILTER       %2, mx, m4, m5, rfilter
 .loop:
@@ -824,36 +771,6 @@ cglobal hevc_put_hevc_bi_epel_h%1_%2, 7, 8, XMM_REGS, dst, dststride, src, srcst
 ;                      int height, int mx, int my, int width)
 ; ******************************
 
-cglobal hevc_put_hevc_epel_v%1_%2, 4, 6, XMM_REGS, dst, src, srcstride, height, r3src, my
-    movifnidn        myd, mym
-    sub             srcq, srcstrideq
-    EPEL_FILTER       %2, my, m4, m5, r3src
-    lea           r3srcq, [srcstrideq*3]
-.loop:
-    EPEL_LOAD         %2, srcq, srcstride, %1
-    EPEL_COMPUTE      %2, %1, m4, m5, 1
-    PEL_10STORE%1     dstq, m0, m1
-    LOOP_END          dst, src, srcstride
-    RET
-
-cglobal hevc_put_hevc_uni_epel_v%1_%2, 5, 7, XMM_REGS, dst, dststride, src, srcstride, height, r3src, my
-    movifnidn        myd, mym
-    movdqa            m6, [pw_%2]
-    sub             srcq, srcstrideq
-    EPEL_FILTER       %2, my, m4, m5, r3src
-    lea           r3srcq, [srcstrideq*3]
-.loop:
-    EPEL_LOAD         %2, srcq, srcstride, %1
-    EPEL_COMPUTE      %2, %1, m4, m5
-    UNI_COMPUTE       %1, %2, m0, m1, m6
-    PEL_%2STORE%1   dstq, m0, m1
-    add             dstq, dststrideq             ; dst += dststride
-    add             srcq, srcstrideq             ; src += srcstride
-    dec          heightd                         ; cmp height
-    jnz               .loop                      ; height loop
-    RET
-
-
 cglobal hevc_put_hevc_bi_epel_v%1_%2, 6, 8, XMM_REGS, dst, dststride, src, srcstride, src2, height, r3src, my
     movifnidn        myd, mym
     movdqa            m6, [pw_bi_%2]
@@ -882,135 +799,6 @@ cglobal hevc_put_hevc_bi_epel_v%1_%2, 6, 8, XMM_REGS, dst, dststride, src, srcst
 ; ******************************
 
 %macro HEVC_PUT_HEVC_EPEL_HV 2
-cglobal hevc_put_hevc_epel_hv%1_%2, 6, 7, 16 , dst, src, srcstride, height, mx, my, r3src
-%assign %%stride ((%2 + 7)/8)
-    sub             srcq, srcstrideq
-    EPEL_HV_FILTER    %2
-    EPEL_LOAD         %2, srcq-%%stride, %%stride, %1
-    EPEL_COMPUTE      %2, %1, m14, m15
-%if (%1 > 8 && (%2 == 8))
-    SWAP              m8, m1
-%endif
-    SWAP              m4, m0
-    add             srcq, srcstrideq
-    EPEL_LOAD         %2, srcq-%%stride, %%stride, %1
-    EPEL_COMPUTE      %2, %1, m14, m15
-%if (%1 > 8 && (%2 == 8))
-    SWAP              m9, m1
-%endif
-    SWAP              m5, m0
-    add             srcq, srcstrideq
-    EPEL_LOAD         %2, srcq-%%stride, %%stride, %1
-    EPEL_COMPUTE      %2, %1, m14, m15
-%if (%1 > 8 && (%2 == 8))
-    SWAP             m10, m1
-%endif
-    SWAP              m6, m0
-    add             srcq, srcstrideq
-.loop:
-    EPEL_LOAD         %2, srcq-%%stride, %%stride, %1
-    EPEL_COMPUTE      %2, %1, m14, m15
-%if (%1 > 8 && (%2 == 8))
-    SWAP             m11, m1
-%endif
-    SWAP              m7, m0
-    punpcklwd         m0, m4, m5
-    punpcklwd         m2, m6, m7
-%if %1 > 4
-    punpckhwd         m1, m4, m5
-    punpckhwd         m3, m6, m7
-%endif
-    EPEL_COMPUTE      14, %1, m12, m13
-%if (%1 > 8 && (%2 == 8))
-    punpcklwd         m4, m8, m9
-    punpcklwd         m2, m10, m11
-    punpckhwd         m8, m8, m9
-    punpckhwd         m3, m10, m11
-    EPEL_COMPUTE      14, %1, m12, m13, m4, m2, m8, m3
-%if cpuflag(avx2)
-    vinserti128       m2, m0, xm4, 1
-    vperm2i128        m3, m0, m4, q0301
-    PEL_10STORE%1     dstq, m2, m3
-%else
-    PEL_10STORE%1     dstq, m0, m4
-%endif
-%else
-    PEL_10STORE%1     dstq, m0, m1
-%endif
-    movdqa            m4, m5
-    movdqa            m5, m6
-    movdqa            m6, m7
-%if (%1 > 8 && (%2 == 8))
-    mova              m8, m9
-    mova              m9, m10
-    mova             m10, m11
-%endif
-    LOOP_END         dst, src, srcstride
-    RET
-
-cglobal hevc_put_hevc_uni_epel_hv%1_%2, 7, 8, 16 , dst, dststride, src, srcstride, height, mx, my, r3src
-%assign %%stride ((%2 + 7)/8)
-    sub             srcq, srcstrideq
-    EPEL_HV_FILTER    %2
-    EPEL_LOAD         %2, srcq-%%stride, %%stride, %1
-    EPEL_COMPUTE      %2, %1, m14, m15
-%if (%1 > 8 && (%2 == 8))
-    SWAP              m8, m1
-%endif
-    SWAP              m4, m0
-    add             srcq, srcstrideq
-    EPEL_LOAD         %2, srcq-%%stride, %%stride, %1
-    EPEL_COMPUTE      %2, %1, m14, m15
-%if (%1 > 8 && (%2 == 8))
-    SWAP              m9, m1
-%endif
-    SWAP              m5, m0
-    add             srcq, srcstrideq
-    EPEL_LOAD         %2, srcq-%%stride, %%stride, %1
-    EPEL_COMPUTE      %2, %1, m14, m15
-%if (%1 > 8 && (%2 == 8))
-    SWAP             m10, m1
-%endif
-    SWAP              m6, m0
-    add             srcq, srcstrideq
-.loop:
-    EPEL_LOAD         %2, srcq-%%stride, %%stride, %1
-    EPEL_COMPUTE      %2, %1, m14, m15
-%if (%1 > 8 && (%2 == 8))
-    SWAP             m11, m1
-%endif
-    mova              m7, m0
-    punpcklwd         m0, m4, m5
-    punpcklwd         m2, m6, m7
-%if %1 > 4
-    punpckhwd         m1, m4, m5
-    punpckhwd         m3, m6, m7
-%endif
-    EPEL_COMPUTE      14, %1, m12, m13
-%if (%1 > 8 && (%2 == 8))
-    punpcklwd         m4, m8, m9
-    punpcklwd         m2, m10, m11
-    punpckhwd         m8, m8, m9
-    punpckhwd         m3, m10, m11
-    EPEL_COMPUTE      14, %1, m12, m13, m4, m2, m8, m3
-    UNI_COMPUTE       %1, %2, m0, m4, [pw_%2]
-%else
-    UNI_COMPUTE       %1, %2, m0, m1, [pw_%2]
-%endif
-    PEL_%2STORE%1   dstq, m0, m1
-    mova              m4, m5
-    mova              m5, m6
-    mova              m6, m7
-%if (%1 > 8 && (%2 == 8))
-    mova              m8, m9
-    mova              m9, m10
-    mova             m10, m11
-%endif
-    add             dstq, dststrideq             ; dst += dststride
-    add             srcq, srcstrideq             ; src += srcstride
-    dec          heightd                         ; cmp height
-    jnz               .loop                      ; height loop
-    RET
 
 cglobal hevc_put_hevc_bi_epel_hv%1_%2, 8, 9, 16, dst, dststride, src, srcstride, src2, height, mx, my, r3src
 %assign %%stride ((%2 + 7)/8)
@@ -1093,34 +881,6 @@ cglobal hevc_put_hevc_bi_epel_hv%1_%2, 8, 9, 16, dst, dststride, src, srcstride,
 ; ******************************
 
 %macro HEVC_PUT_HEVC_QPEL 2
-cglobal hevc_put_hevc_qpel_h%1_%2, 5, 6, 16, dst, src, srcstride, height, mx, rfilter
-    QPEL_FILTER       %2, mx
-.loop:
-    QPEL_H_LOAD       %2, srcq, %1, 10
-    QPEL_COMPUTE      %1, %2, 1
-%if %2 > 8
-    packssdw          m0, m1
-%endif
-    PEL_10STORE%1     dstq, m0, m1
-    LOOP_END          dst, src, srcstride
-    RET
-
-cglobal hevc_put_hevc_uni_qpel_h%1_%2, 6, 7, 16 , dst, dststride, src, srcstride, height, mx, rfilter
-    mova              m9, [pw_%2]
-    QPEL_FILTER       %2, mx
-.loop:
-    QPEL_H_LOAD       %2, srcq, %1, 10
-    QPEL_COMPUTE      %1, %2
-%if %2 > 8
-    packssdw          m0, m1
-%endif
-    UNI_COMPUTE       %1, %2, m0, m1, m9
-    PEL_%2STORE%1   dstq, m0, m1
-    add             dstq, dststrideq             ; dst += dststride
-    add             srcq, srcstrideq             ; src += srcstride
-    dec          heightd                         ; cmp height
-    jnz               .loop                      ; height loop
-    RET
 
 cglobal hevc_put_hevc_bi_qpel_h%1_%2, 7, 8, 16 , dst, dststride, src, srcstride, src2, height, mx, rfilter
     movdqa            m9, [pw_bi_%2]
@@ -1148,38 +908,6 @@ cglobal hevc_put_hevc_bi_qpel_h%1_%2, 7, 8, 16 , dst, dststride, src, srcstride,
 ;                       int height, int mx, int my, int width)
 ; ******************************
 
-cglobal hevc_put_hevc_qpel_v%1_%2, 4, 8, 16, dst, src, srcstride, height, r3src, my, rfilter
-    movifnidn        myd, mym
-    lea           r3srcq, [srcstrideq*3]
-    QPEL_FILTER       %2, my
-.loop:
-    QPEL_V_LOAD       %2, srcq, srcstride, %1, r7
-    QPEL_COMPUTE      %1, %2, 1
-%if %2 > 8
-    packssdw          m0, m1
-%endif
-    PEL_10STORE%1     dstq, m0, m1
-    LOOP_END         dst, src, srcstride
-    RET
-
-cglobal hevc_put_hevc_uni_qpel_v%1_%2, 5, 9, 16, dst, dststride, src, srcstride, height, r3src, my, rfilter
-    movifnidn        myd, mym
-    movdqa            m9, [pw_%2]
-    lea           r3srcq, [srcstrideq*3]
-    QPEL_FILTER       %2, my
-.loop:
-    QPEL_V_LOAD       %2, srcq, srcstride, %1, r8
-    QPEL_COMPUTE      %1, %2
-%if %2 > 8
-    packssdw          m0, m1
-%endif
-    UNI_COMPUTE       %1, %2, m0, m1, m9
-    PEL_%2STORE%1   dstq, m0, m1
-    add             dstq, dststrideq             ; dst += dststride
-    add             srcq, srcstrideq             ; src += srcstride
-    dec          heightd                         ; cmp height
-    jnz               .loop                      ; height loop
-    RET
 
 cglobal hevc_put_hevc_bi_qpel_v%1_%2, 6, 10, 16, dst, dststride, src, srcstride, src2, height, r3src, my, rfilter
     movifnidn        myd, mym
@@ -1210,162 +938,6 @@ cglobal hevc_put_hevc_bi_qpel_v%1_%2, 6, 10, 16, dst, dststride, src, srcstride,
 ;                       int height, int mx, int my)
 ; ******************************
 %macro HEVC_PUT_HEVC_QPEL_HV 2
-cglobal hevc_put_hevc_qpel_hv%1_%2, 6, 8, 16, dst, src, srcstride, height, mx, my, r3src, rfilter
-%if cpuflag(avx2)
-%assign %%shift  4
-%else
-%assign %%shift  3
-%endif
-    sub              mxq, 1
-    sub              myq, 1
-    shl              mxq, %%shift                ; multiply by 32
-    shl              myq, %%shift                ; multiply by 32
-    lea           r3srcq, [srcstrideq*3]
-    sub             srcq, r3srcq
-    QPEL_H_LOAD       %2, srcq, %1, 15
-    QPEL_HV_COMPUTE   %1, %2, mx, ackssdw
-    SWAP              m8, m0
-    add             srcq, srcstrideq
-    QPEL_H_LOAD       %2, srcq, %1, 15
-    QPEL_HV_COMPUTE   %1, %2, mx, ackssdw
-    SWAP              m9, m0
-    add             srcq, srcstrideq
-    QPEL_H_LOAD       %2, srcq, %1, 15
-    QPEL_HV_COMPUTE   %1, %2, mx, ackssdw
-    SWAP             m10, m0
-    add             srcq, srcstrideq
-    QPEL_H_LOAD       %2, srcq, %1, 15
-    QPEL_HV_COMPUTE   %1, %2, mx, ackssdw
-    SWAP             m11, m0
-    add             srcq, srcstrideq
-    QPEL_H_LOAD       %2, srcq, %1, 15
-    QPEL_HV_COMPUTE   %1, %2, mx, ackssdw
-    SWAP             m12, m0
-    add             srcq, srcstrideq
-    QPEL_H_LOAD       %2, srcq, %1, 15
-    QPEL_HV_COMPUTE   %1, %2, mx, ackssdw
-    SWAP             m13, m0
-    add             srcq, srcstrideq
-    QPEL_H_LOAD       %2, srcq, %1, 15
-    QPEL_HV_COMPUTE   %1, %2, mx, ackssdw
-    SWAP             m14, m0
-    add             srcq, srcstrideq
-.loop:
-    QPEL_H_LOAD       %2, srcq, %1, 15
-    QPEL_HV_COMPUTE   %1, %2, mx, ackssdw
-    SWAP             m15, m0
-    punpcklwd         m0, m8, m9
-    punpcklwd         m2, m10, m11
-    punpcklwd         m4, m12, m13
-    punpcklwd         m6, m14, m15
-%if %1 > 4
-    punpckhwd         m1, m8, m9
-    punpckhwd         m3, m10, m11
-    punpckhwd         m5, m12, m13
-    punpckhwd         m7, m14, m15
-%endif
-    QPEL_HV_COMPUTE   %1, 14, my, ackssdw
-    PEL_10STORE%1     dstq, m0, m1
-%if %1 <= 4
-    movq              m8, m9
-    movq              m9, m10
-    movq             m10, m11
-    movq             m11, m12
-    movq             m12, m13
-    movq             m13, m14
-    movq             m14, m15
-%else
-    movdqa            m8, m9
-    movdqa            m9, m10
-    movdqa           m10, m11
-    movdqa           m11, m12
-    movdqa           m12, m13
-    movdqa           m13, m14
-    movdqa           m14, m15
-%endif
-    LOOP_END         dst, src, srcstride
-    RET
-
-cglobal hevc_put_hevc_uni_qpel_hv%1_%2, 7, 9, 16 , dst, dststride, src, srcstride, height, mx, my, r3src, rfilter
-%if cpuflag(avx2)
-%assign %%shift  4
-%else
-%assign %%shift  3
-%endif
-    sub              mxq, 1
-    sub              myq, 1
-    shl              mxq, %%shift                ; multiply by 32
-    shl              myq, %%shift                ; multiply by 32
-    lea           r3srcq, [srcstrideq*3]
-    sub             srcq, r3srcq
-    QPEL_H_LOAD       %2, srcq, %1, 15
-    QPEL_HV_COMPUTE   %1, %2, mx, ackssdw
-    SWAP              m8, m0
-    add             srcq, srcstrideq
-    QPEL_H_LOAD       %2, srcq, %1, 15
-    QPEL_HV_COMPUTE   %1, %2, mx, ackssdw
-    SWAP              m9, m0
-    add             srcq, srcstrideq
-    QPEL_H_LOAD       %2, srcq, %1, 15
-    QPEL_HV_COMPUTE   %1, %2, mx, ackssdw
-    SWAP             m10, m0
-    add             srcq, srcstrideq
-    QPEL_H_LOAD       %2, srcq, %1, 15
-    QPEL_HV_COMPUTE   %1, %2, mx, ackssdw
-    SWAP             m11, m0
-    add             srcq, srcstrideq
-    QPEL_H_LOAD       %2, srcq, %1, 15
-    QPEL_HV_COMPUTE   %1, %2, mx, ackssdw
-    SWAP             m12, m0
-    add             srcq, srcstrideq
-    QPEL_H_LOAD       %2, srcq, %1, 15
-    QPEL_HV_COMPUTE   %1, %2, mx, ackssdw
-    SWAP             m13, m0
-    add             srcq, srcstrideq
-    QPEL_H_LOAD       %2, srcq, %1, 15
-    QPEL_HV_COMPUTE   %1, %2, mx, ackssdw
-    SWAP             m14, m0
-    add             srcq, srcstrideq
-.loop:
-    QPEL_H_LOAD       %2, srcq, %1, 15
-    QPEL_HV_COMPUTE   %1, %2, mx, ackssdw
-    SWAP             m15, m0
-    punpcklwd         m0, m8, m9
-    punpcklwd         m2, m10, m11
-    punpcklwd         m4, m12, m13
-    punpcklwd         m6, m14, m15
-%if %1 > 4
-    punpckhwd         m1, m8, m9
-    punpckhwd         m3, m10, m11
-    punpckhwd         m5, m12, m13
-    punpckhwd         m7, m14, m15
-%endif
-    QPEL_HV_COMPUTE   %1, 14, my, ackusdw
-    UNI_COMPUTE       %1, %2, m0, m1, [pw_%2]
-    PEL_%2STORE%1   dstq, m0, m1
-
-%if %1 <= 4
-    movq              m8, m9
-    movq              m9, m10
-    movq             m10, m11
-    movq             m11, m12
-    movq             m12, m13
-    movq             m13, m14
-    movq             m14, m15
-%else
-    mova            m8, m9
-    mova            m9, m10
-    mova           m10, m11
-    mova           m11, m12
-    mova           m12, m13
-    mova           m13, m14
-    mova           m14, m15
-%endif
-    add             dstq, dststrideq             ; dst += dststride
-    add             srcq, srcstrideq             ; src += srcstride
-    dec          heightd                         ; cmp height
-    jnz               .loop                      ; height loop
-    RET
 
 cglobal hevc_put_hevc_bi_qpel_hv%1_%2, 8, 10, 16, dst, dststride, src, srcstride, src2, height, mx, my, r3src, rfilter
 %if cpuflag(avx2)
@@ -1613,22 +1185,22 @@ WEIGHTING_FUNCS 4, 12
 WEIGHTING_FUNCS 6, 12
 WEIGHTING_FUNCS 8, 12
 
-HEVC_PUT_HEVC_PEL_PIXELS  2, 8
-HEVC_PUT_HEVC_PEL_PIXELS  4, 8
-HEVC_PUT_HEVC_PEL_PIXELS  6, 8
-HEVC_PUT_HEVC_PEL_PIXELS  8, 8
-HEVC_PUT_HEVC_PEL_PIXELS 12, 8
-HEVC_PUT_HEVC_PEL_PIXELS 16, 8
+HEVC_BI_PEL_PIXELS  2, 8
+HEVC_BI_PEL_PIXELS  4, 8
+HEVC_BI_PEL_PIXELS  6, 8
+HEVC_BI_PEL_PIXELS  8, 8
+HEVC_BI_PEL_PIXELS 12, 8
+HEVC_BI_PEL_PIXELS 16, 8
 
-HEVC_PUT_HEVC_PEL_PIXELS 2, 10
-HEVC_PUT_HEVC_PEL_PIXELS 4, 10
-HEVC_PUT_HEVC_PEL_PIXELS 6, 10
-HEVC_PUT_HEVC_PEL_PIXELS 8, 10
+HEVC_BI_PEL_PIXELS 2, 10
+HEVC_BI_PEL_PIXELS 4, 10
+HEVC_BI_PEL_PIXELS 6, 10
+HEVC_BI_PEL_PIXELS 8, 10
 
-HEVC_PUT_HEVC_PEL_PIXELS 2, 12
-HEVC_PUT_HEVC_PEL_PIXELS 4, 12
-HEVC_PUT_HEVC_PEL_PIXELS 6, 12
-HEVC_PUT_HEVC_PEL_PIXELS 8, 12
+HEVC_BI_PEL_PIXELS 2, 12
+HEVC_BI_PEL_PIXELS 4, 12
+HEVC_BI_PEL_PIXELS 6, 12
+HEVC_BI_PEL_PIXELS 8, 12
 
 HEVC_PUT_HEVC_EPEL 2,  8
 HEVC_PUT_HEVC_EPEL 4,  8
@@ -1693,8 +1265,8 @@ HEVC_PUT_HEVC_QPEL_HV 8, 12
 %if HAVE_AVX2_EXTERNAL
 INIT_YMM avx2  ; adds ff_ and _avx2 to function name & enables 256b registers : m0 for 256b, xm0 for 128b. cpuflag(avx2) = 1 / notcpuflag(avx) = 0
 
-HEVC_PUT_HEVC_PEL_PIXELS 32, 8
-HEVC_PUT_HEVC_PEL_PIXELS 16, 10
+HEVC_BI_PEL_PIXELS 32, 8
+HEVC_BI_PEL_PIXELS 16, 10
 
 HEVC_PUT_HEVC_EPEL 32, 8
 HEVC_PUT_HEVC_EPEL 16, 10
diff --git a/libavcodec/x86/hevcdsp_init.c b/libavcodec/x86/hevcdsp_init.c
index 6f45e5e0db..5c19330e19 100644
--- a/libavcodec/x86/hevcdsp_init.c
+++ b/libavcodec/x86/hevcdsp_init.c
@@ -1,6 +1,7 @@
 /*
  * Copyright (c) 2013 Seppo Tomperi
- * Copyright (c) 2013 - 2014 Pierre-Edouard Lepere
+ * Copyright (c) 2013-2014 Pierre-Edouard Lepere
+ * Copyright (c) 2023-2024 Wu Jianhua
  *
  * This file is part of FFmpeg.
  *
@@ -27,6 +28,7 @@
 #include "libavutil/x86/cpu.h"
 #include "libavcodec/hevcdsp.h"
 #include "libavcodec/x86/hevcdsp.h"
+#include "libavcodec/x86/h26x/h2656dsp.h"
 
 #define LFC_FUNC(DIR, DEPTH, OPT) \
 void ff_hevc_ ## DIR ## _loop_filter_chroma_ ## DEPTH ## _ ## OPT(uint8_t *pix, ptrdiff_t stride, const int *tc, const uint8_t *no_p, const uint8_t *no_q);
@@ -83,6 +85,110 @@ void ff_hevc_idct_32x32_10_ ## opt(int16_t *coeffs, int col_limit);
 IDCT_FUNCS(sse2)
 IDCT_FUNCS(avx)
 
+
+#define ff_hevc_pel_filters ff_hevc_qpel_filters
+#define DECL_HV_FILTER(f)                                  \
+    const uint8_t *hf = ff_hevc_ ## f ## _filters[mx - 1]; \
+    const uint8_t *vf = ff_hevc_ ## f ## _filters[my - 1];
+
+#define FW_PUT(p, a, b, depth, opt) \
+void ff_hevc_put_hevc_ ## a ## _ ## depth ## _##opt(int16_t *dst, const uint8_t *src, ptrdiff_t srcstride,   \
+                                                    int height, intptr_t mx, intptr_t my,int width)          \
+{                                                                                                            \
+    DECL_HV_FILTER(p)                                                                                        \
+    ff_h2656_put_ ## b ## _ ## depth ## _##opt(dst, src, srcstride, height, hf, vf, width);                  \
+}
+
+#define FW_PUT_UNI(p, a, b, depth, opt) \
+void ff_hevc_put_hevc_uni_ ## a ## _ ## depth ## _##opt(uint8_t *dst, ptrdiff_t dststride,                   \
+                                                        const uint8_t *src, ptrdiff_t srcstride,             \
+                                                        int height, intptr_t mx, intptr_t my, int width)     \
+{                                                                                                            \
+    DECL_HV_FILTER(p)                                                                                        \
+    ff_h2656_put_uni_ ## b ## _ ## depth ## _##opt(dst, dststride, src, srcstride, height, hf, vf, width);   \
+}
+
+#if ARCH_X86_64 && HAVE_SSE4_EXTERNAL
+
+#define FW_PUT_FUNCS(p, a, b, depth, opt) \
+    FW_PUT(p, a, b, depth, opt) \
+    FW_PUT_UNI(p, a, b, depth, opt)
+
+#define FW_PEL(w, depth, opt) FW_PUT_FUNCS(pel, pel_pixels##w, pixels##w, depth, opt)
+
+#define FW_DIR(npel, n, w, depth, opt) \
+    FW_PUT_FUNCS(npel, npel ## _h##w,  n ## tap_h##w,  depth, opt) \
+    FW_PUT_FUNCS(npel, npel ## _v##w,  n ## tap_v##w,  depth, opt)
+
+#define FW_DIR_HV(npel, n, w, depth, opt) \
+    FW_PUT_FUNCS(npel, npel ## _hv##w,  n ## tap_hv##w,  depth, opt)
+
+FW_PEL(4,   8, sse4);
+FW_PEL(6,   8, sse4);
+FW_PEL(8,   8, sse4);
+FW_PEL(12,  8, sse4);
+FW_PEL(16,  8, sse4);
+FW_PEL(4,  10, sse4);
+FW_PEL(6,  10, sse4);
+FW_PEL(8,  10, sse4);
+FW_PEL(4,  12, sse4);
+FW_PEL(6,  12, sse4);
+FW_PEL(8,  12, sse4);
+
+#define FW_EPEL(w, depth, opt) FW_DIR(epel, 4, w, depth, opt)
+#define FW_EPEL_HV(w, depth, opt) FW_DIR_HV(epel, 4, w, depth, opt)
+#define FW_EPEL_FUNCS(w, depth, opt) \
+    FW_EPEL(w, depth, opt)           \
+    FW_EPEL_HV(w, depth, opt)
+
+FW_EPEL(12,  8, sse4);
+
+FW_EPEL_FUNCS(4,   8, sse4);
+FW_EPEL_FUNCS(6,   8, sse4);
+FW_EPEL_FUNCS(8,   8, sse4);
+FW_EPEL_FUNCS(16,  8, sse4);
+FW_EPEL_FUNCS(4,  10, sse4);
+FW_EPEL_FUNCS(6,  10, sse4);
+FW_EPEL_FUNCS(8,  10, sse4);
+FW_EPEL_FUNCS(4,  12, sse4);
+FW_EPEL_FUNCS(6,  12, sse4);
+FW_EPEL_FUNCS(8,  12, sse4);
+
+#define FW_QPEL(w, depth, opt) FW_DIR(qpel, 8, w, depth, opt)
+#define FW_QPEL_HV(w, depth, opt) FW_DIR_HV(qpel, 8, w, depth, opt)
+#define FW_QPEL_FUNCS(w, depth, opt) \
+    FW_QPEL(w, depth, opt)           \
+    FW_QPEL_HV(w, depth, opt)
+
+FW_QPEL(12, 8, sse4);
+FW_QPEL(16, 8, sse4);
+
+FW_QPEL_FUNCS(4,   8, sse4);
+FW_QPEL_FUNCS(8,   8, sse4);
+FW_QPEL_FUNCS(4,  10, sse4);
+FW_QPEL_FUNCS(8,  10, sse4);
+FW_QPEL_FUNCS(4,  12, sse4);
+FW_QPEL_FUNCS(8,  12, sse4);
+
+#ifdef HAVE_AVX2_EXTERNAL
+
+FW_PEL(32,  8, avx2);
+FW_PUT(pel, pel_pixels16, pixels16, 10, avx2);
+
+FW_EPEL(32,  8, avx2);
+FW_EPEL(16, 10, avx2);
+
+FW_EPEL_HV(32,  8, avx2);
+FW_EPEL_HV(16, 10, avx2);
+
+FW_QPEL(32,  8, avx2);
+FW_QPEL(16, 10, avx2);
+
+FW_QPEL_HV(16, 10, avx2);
+
+#endif
+#endif
+
 #define mc_rep_func(name, bitd, step, W, opt) \
 void ff_hevc_put_hevc_##name##W##_##bitd##_##opt(int16_t *_dst,                                                 \
                                                  const uint8_t *_src, ptrdiff_t _srcstride, int height,         \

From patchwork Tue Jan 23 18:17:07 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Wu Jianhua <toqsxw@outlook.com>
X-Patchwork-Id: 45747
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:120f:b0:199:de12:6fa6 with SMTP id v15csp801478pzf;
        Tue, 23 Jan 2024 10:17:48 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IGFMEFOp7QlbjlpmPEwG2KbCub4e0+8p5f4l93Xal45RgXxrlK4mTAMFtofMa4zSbVxkJtj
X-Received: by 2002:a05:6402:b57:b0:55a:a8ac:8cb9 with SMTP id
 bx23-20020a0564020b5700b0055aa8ac8cb9mr1916672edb.43.1706033868238;
        Tue, 23 Jan 2024 10:17:48 -0800 (PST)
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 bl21-20020a056402211500b0055c54108811si1969990edb.222.2024.01.23.10.17.47;
        Tue, 23 Jan 2024 10:17:48 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@outlook.com
 header.s=selector1 header.b=OOTZX7bA;
       arc=fail (body hash mismatch);
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=outlook.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 746D268CE21;
	Tue, 23 Jan 2024 20:17:38 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from JPN01-OS0-obe.outbound.protection.outlook.com
 (mail-os0jpn01olkn2100.outbound.protection.outlook.com [40.92.98.100])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id B526B68CE21
 for <ffmpeg-devel@ffmpeg.org>; Tue, 23 Jan 2024 20:17:31 +0200 (EET)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=fDKZJL8sly0Su+KmFZRuGUI24r0FF6Sb0CB/YaSB6A6SSzX3+drzCXo6JfgAE+DPeWiu5NYaY9levZg6koFTeKryWGdeCwzP5dJdAnxXy4j1I90JgZIiSogeuQzUOyXWUW8StxLkGr1kRnr4pG3Y/q7a4X54wqVHkrdQH0NOg842c38n4RhIwziTpoC+2HheaKtDoENXnqJCtl2+wp4gNZJoH0yNf3f0COpllJA6W1BvJExel/iSGkARkAYNVJ1TsPOpFK8L5vVU0Vq6PklhWE4J1QjyJDtGV6ggCJc81BaAVQa0NH/6ENpICGzcq7CUkVG4L3O1TYZ9LXpmvH7gWg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=3YWCPE3QJL1q8pUPyPgguKhxWpexD9mZ6X8ITP8r0ec=;
 b=Ic/oN+2/44DlXlwI23dEObQuQ4FOmGp1BPJjZb116w4eCMZ4FcZb8J0jmsPH6MQcbO7IGt/SS6vFk/0drXF1R8Pto1+/zBuHrjcyYCxnSQ4emBbaByS7gRp71bA0Ehx2o9Jqt58aQVQ80YP0ngtafdMKHeD3gQujahMDzK48CVCxUfk9FGWkdSrLSuwawhkCayBdnxZOVj9VB1sY81DV/Rkx3NR657tnPf+OPDmSfd6mTMUtxDqA7dQXlTrBb+O1yPrCDRnKDbM6kgNZmQU+oMxG2KPq4OMbZ4h4OC8ZRWpLGZbRmNB5Ys/xK3TLCSEZKP0X7+N6sDSuJzTcRt9xKQ==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none;
 dkim=none; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com;
 s=selector1;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=3YWCPE3QJL1q8pUPyPgguKhxWpexD9mZ6X8ITP8r0ec=;
 b=OOTZX7bAplukikL+si6ItEkzVu6c7HNVqpTlgZeM0y/6gsxQkfsJ1Gi5x2vPrg4bSircxqJnzK+2Z9eDjIm4+vNu5vlr3eDH3Su/dGAwRlzlGslkMDJOBGMn5fO2xdpLP8RC9gh1v8GxkmkgtnE1jld7TGMTyYPQqz7KMUPQ8a3GfxmXuKgXcvp3mYuqOZ80XnPhOV+dSg3xGOuszKOjowK3haFgsABzYQ7gThSAknvyJxOq5k9raVu9+Py2aQijGtq6opYJyd/FV1imMHMXCQAdrhA5pYrmNObFQreobA6Z4d9QO0J5vMWiDBmZ7/sKBHygDDy6IYZwtyJy9l7zjQ==
Received: from TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:173::5)
 by TYWP286MB3822.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:447::6) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7202.36; Tue, 23 Jan
 2024 18:17:23 +0000
Received: from TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 ([fe80::2fb1:781b:3f26:a080]) by TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 ([fe80::2fb1:781b:3f26:a080%3]) with mapi id 15.20.7228.022; Tue, 23 Jan 2024
 18:17:23 +0000
From: toqsxw@outlook.com
To: ffmpeg-devel@ffmpeg.org
Date: Wed, 24 Jan 2024 02:17:07 +0800
Message-ID: 
 <TYWP286MB2172B45C237A2232BD601FD7CA742@TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240123181711.402946-1-toqsxw@outlook.com>
References: <20240123181711.402946-1-toqsxw@outlook.com>
X-TMN: [IjskHIRF5+ikQ/aCTAiv3DuOanWvqgMu]
X-ClientProxiedBy: SI2P153CA0030.APCP153.PROD.OUTLOOK.COM
 (2603:1096:4:190::15) To TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 (2603:1096:400:173::5)
X-Microsoft-Original-Message-ID: <20240123181711.402946-4-toqsxw@outlook.com>
MIME-Version: 1.0
X-MS-Exchange-MessageSentRepresentingType: 1
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: TYWP286MB2172:EE_|TYWP286MB3822:EE_
X-MS-Office365-Filtering-Correlation-Id: dbeb1757-fd43-4b8b-b92a-08dc1c3f8bc1
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: 
 8+o8KZf2dhSs2xlfc/9ssYVEXevZI7LJx70JEo1poaDd/T5UEl5UYvGELMoLhknB+S5gilQO0K8XNGtL9GWrB9xCmBeppLj5homCw3V2qidjPJqS1tXT++eHKdEQdYi+e8mvkfad+LxJQG5vsICCstUBmcCOXjqVwKnMs+QCc35WFIcZjW/hmxJ4f/ZclE9OXuoRilx2xWlwWX7Ylo4wMh2s1lB7XFUrvLATQCQ0yslpcQosjdHUOXsgb0fgXfN9da1FHVUwIefhplZ2SwdJIxR4Yzis6VH5hOgzWF0ULl1Vjddl7/inpNNJV+RtkqVWSukmDxgAILzdFccMePIA4lCMfWHbfY8c8cQGJc7YsOtz9StcIg9mFKIMZ1dsmJfAs0D5wfOfMuGMV/Lq+JgTDJ7xTRITmiDogxD11X+rWBKvNESEvONuNKb6sOa9Z6u2ffh+9RBnamUfdkJasmFzH8uRnP6K0gyJgex5yMTQpb355//nYmetMPzARocYVCPH6ZHB0xoDbtu1hf16ZqeOyH8Dcx90OWP2xNPrRRtsgOPY/Al1SaaJev2S5A4Hsbw3YzO9/1/8xAbtyYPdy1p08SUpadDfttCKn0S8NKKwwNUCTFvvpeYZYE+MNSHoZELC
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: 
 jouApqNp8SUQN7aJE3tqM2XWqZ1yTCXqA42Wb0gJXgcgopo5vYBjlU3VpYG80ZicBBgOuhXs//Aews423qI9fgOsHXz5ogLplySIg4r/CwkHnHhc9+gSzcAwshxGCnbygiRZsdxeoTSwjy4bOj4iAek1h/KaAQK6G5Jdy/tvoLjR5vT660SvFJUVEtFG28o7NY9tvfXKhiMUow1zsxVWSyeUk+guuUIkNH+l9Bldo0QO4pgcTBomwyxBN3heSD0h1KPmtq2kwJ9GKBaLnWa2wklKhAiMrOBz0eSfZtIjcdwbJMXyQ9kEytNo58WYfCSa1f980ga+MuGuOU4Gzenp9B5VrcKHPkXA9Pf1HGutlxWKZQsHH5IEkQFrreZqfi+zxQEXm/kqDWjP4r/LmoRF542SpdNHX9LWDBJqzavQlxB3rAjT7Gvs5M0wkHNHoXhC0AdQoHmD1r8Icvay5Py0tnwqEXPbZfHh2VoE1BtSBElO5elv/z17iwTuu4S73WVCcaER62xNoxNa2fNLGjJMzDt+9uWQWglSGcpqRBrMLacSSV5BoUuS8BnBBZ5gq5e/0pqY6962yVEom0WXZ6Pg7DYM0xuuEu8VciiS1dtjfzZJwhq/o4ybYYf+QsoJ9LaVcYUV9HMEsUfFeVkat7b0MgyCxGVN0J5P4V3Vc/MShxOzxdI3FgSqOoEcESKcFfdAtaYNtgZj232/OJHeMNKKbSmUMo0+sh/+5uwu2y61uhks0TIZmCMJ0gUjFzDG9zoA9Yn6H32O200qXRzOplWnUMIDuhJ93WaLhwJFhGCZYxgOGAUuP0EAz/2wuR9tLlFCfQQQ0qtWd5itxn4UBg+7YNeYGTDfgokYZW7E1ebEHyeL8uCBnwv2jX7Vlr5NJQjO92KzGsDXcKm2HrGtuKmCpIMV7NDSJe4YiYV7ENwBl0mB+63jO6PJFmmgL3d6Jo5uWYn+swMtYaKarpfikM4DtNYD61vhFv5nCmxp8//iVyMiilKl5jb9kmLfy29nv+/u2x0U2q6gj9vHgY7EsvbUjuSZ8ulvAmwpC6f8NczVc/yyGClF99LjK+k049YIfS9C4kpVBbG6Ab5VUsBdLthubBk9BC512aPs72XWifPzasxYgFTumMCOzAgeJs0/0W5Ou2AoWfJ2N7jzjFgiNfFgwmW9cdIYAXzYnCFqFXn7qf9fHJFA9CQ/ANEyS5vfk9rLdZoRV1ppAHN8Lmycz1sKEHg53QEOmTZ+S4do2x13U6Y=
X-OriginatorOrg: outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 dbeb1757-fd43-4b8b-b92a-08dc1c3f8bc1
X-MS-Exchange-CrossTenant-AuthSource: TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 23 Jan 2024 18:17:23.4835 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 
 00000000-0000-0000-0000-000000000000
X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYWP286MB3822
Subject: [FFmpeg-devel] [PATCH v4 4/8] avcodec/x86/h26x/h2656_inter: add
 dststride to put
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Wu Jianhua <toqsxw@outlook.com>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: UjrGRK3C/PSz

From: Wu Jianhua <toqsxw@outlook.com>

Signed-off-by: Wu Jianhua <toqsxw@outlook.com>
---
 libavcodec/x86/h26x/h2656_inter.asm | 32 ++++++++++++++---------------
 libavcodec/x86/h26x/h2656dsp.c      |  4 ++--
 libavcodec/x86/h26x/h2656dsp.h      |  2 +-
 libavcodec/x86/hevcdsp_init.c       |  2 +-
 4 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/libavcodec/x86/h26x/h2656_inter.asm b/libavcodec/x86/h26x/h2656_inter.asm
index aa296d549c..cbba0c1ea5 100644
--- a/libavcodec/x86/h26x/h2656_inter.asm
+++ b/libavcodec/x86/h26x/h2656_inter.asm
@@ -22,8 +22,6 @@
 ; */
 %include "libavutil/x86/x86util.asm"
 
-%define MAX_PB_SIZE 64
-
 SECTION_RODATA 32
 cextern pw_255
 cextern pw_512
@@ -342,7 +340,7 @@ SECTION .text
 %endmacro
 
 %macro LOOP_END 3
-    add              %1q, 2*MAX_PB_SIZE          ; dst += dststride
+    add              %1q, dststrideq             ; dst += dststride
     add              %2q, %3q                    ; src += srcstride
     dec          heightd                         ; cmp height
     jnz               .loop                      ; height loop
@@ -539,7 +537,7 @@ SECTION .text
 
 
 ; ******************************
-; void %1_put_pixels(int16_t *dst, const uint8_t *_src, ptrdiff_t srcstride,
+; void %1_put_pixels(int16_t *dst, ptrdiff_t dststride, const uint8_t *_src, ptrdiff_t srcstride,
 ;                         int height, const int8_t *hf, const int8_t *vf, int width)
 ; ******************************
 
@@ -549,7 +547,7 @@ SECTION .text
 %endmacro
 
 %macro MC_PIXELS 3
-cglobal %1_put_pixels%2_%3, 4, 4, 3, dst, src, srcstride, height
+cglobal %1_put_pixels%2_%3, 5, 5, 3, dst, dststride, src, srcstride, height
     pxor              m2, m2
 .loop:
     SIMPLE_LOAD       %2, %3, srcq, m0
@@ -579,10 +577,10 @@ cglobal %1_put_uni_pixels%2_%3, 5, 5, 2, dst, dststride, src, srcstride, height
 %endif
 
 ; ******************************
-; void %1_put_4tap_hX(int16_t *dst,
+; void %1_put_4tap_hX(int16_t *dst, ptrdiff_t dststride,
 ;      const uint8_t *_src, ptrdiff_t _srcstride, int height, int8_t *hf, int8_t *vf, int width);
 ; ******************************
-cglobal %1_put_4tap_h%2_%3, 5, 5, XMM_REGS, dst, src, srcstride, height, hf
+cglobal %1_put_4tap_h%2_%3, 6, 6, XMM_REGS, dst, dststride, src, srcstride, height, hf
 %assign %%stride ((%3 + 7)/8)
     MC_4TAP_FILTER       %3, hf, m4, m5
 .loop:
@@ -612,10 +610,10 @@ cglobal %1_put_uni_4tap_h%2_%3, 6, 7, XMM_REGS, dst, dststride, src, srcstride,
     RET
 
 ; ******************************
-; void %1_put_4tap_v(int16_t *dst,
+; void %1_put_4tap_v(int16_t *dst, ptrdiff_t dststride,
 ;      const uint8_t *_src, ptrdiff_t _srcstride, int height, int8_t *hf, int8_t *vf, int width)
 ; ******************************
-cglobal %1_put_4tap_v%2_%3, 6, 6, XMM_REGS, dst, src, srcstride, height, r3src, vf
+cglobal %1_put_4tap_v%2_%3, 7, 7, XMM_REGS, dst, dststride, src, srcstride, height, r3src, vf
     sub             srcq, srcstrideq
     MC_4TAP_FILTER    %3, vf, m4, m5
     lea           r3srcq, [srcstrideq*3]
@@ -649,10 +647,10 @@ cglobal %1_put_uni_4tap_v%2_%3, 7, 7, XMM_REGS, dst, dststride, src, srcstride,
 
 %macro PUT_4TAP_HV 3
 ; ******************************
-; void put_4tap_hv(int16_t *dst,
+; void put_4tap_hv(int16_t *dst, ptrdiff_t dststride,
 ;      const uint8_t *_src, ptrdiff_t _srcstride, int height, int8_t *hf, int8_t *vf, int width)
 ; ******************************
-cglobal %1_put_4tap_hv%2_%3, 6, 7, 16 , dst, src, srcstride, height, hf, vf, r3src
+cglobal %1_put_4tap_hv%2_%3, 7, 8, 16 , dst, dststride, src, srcstride, height, hf, vf, r3src
 %assign %%stride ((%3 + 7)/8)
     sub                 srcq, srcstrideq
     MC_4TAP_HV_FILTER    %3
@@ -784,12 +782,12 @@ cglobal %1_put_uni_4tap_hv%2_%3, 7, 8, 16 , dst, dststride, src, srcstride, heig
 %endmacro
 
 ; ******************************
-; void put_8tap_hX_X_X(int16_t *dst, const uint8_t *_src, ptrdiff_t srcstride,
+; void put_8tap_hX_X_X(int16_t *dst, ptrdiff_t dststride, const uint8_t *_src, ptrdiff_t srcstride,
 ;                       int height, const int8_t *hf, const int8_t *vf, int width)
 ; ******************************
 
 %macro PUT_8TAP 3
-cglobal %1_put_8tap_h%2_%3, 5, 5, 16, dst, src, srcstride, height, hf
+cglobal %1_put_8tap_h%2_%3, 6, 6, 16, dst, dststride, src, srcstride, height, hf
     MC_8TAP_FILTER          %3, hf
 .loop:
     MC_8TAP_H_LOAD          %3, srcq, %2, 10
@@ -824,10 +822,10 @@ cglobal %1_put_uni_8tap_h%2_%3, 6, 7, 16 , dst, dststride, src, srcstride, heigh
 
 
 ; ******************************
-; void put_8tap_vX_X_X(int16_t *dst, const uint8_t *_src, ptrdiff_t srcstride,
+; void put_8tap_vX_X_X(int16_t *dst, ptrdiff_t dststride, const uint8_t *_src, ptrdiff_t srcstride,
 ;                      int height, const int8_t *hf, const int8_t *vf, int width)
 ; ******************************
-cglobal %1_put_8tap_v%2_%3, 6, 8, 16, dst, src, srcstride, height, r3src, vf
+cglobal %1_put_8tap_v%2_%3, 7, 8, 16, dst, dststride, src, srcstride, height, r3src, vf
     MC_8TAP_FILTER        %3, vf
     lea               r3srcq, [srcstrideq*3]
 .loop:
@@ -866,11 +864,11 @@ cglobal %1_put_uni_8tap_v%2_%3, 7, 9, 16, dst, dststride, src, srcstride, height
 
 
 ; ******************************
-; void put_8tap_hvX_X(int16_t *dst, const uint8_t *_src, ptrdiff_t srcstride,
+; void put_8tap_hvX_X(int16_t *dst, ptrdiff_t dststride, const uint8_t *_src, ptrdiff_t srcstride,
 ;                     int height, const int8_t *hf, const int8_t *vf, int width)
 ; ******************************
 %macro PUT_8TAP_HV 3
-cglobal %1_put_8tap_hv%2_%3, 6, 7, 16, 0 - mmsize*16, dst, src, srcstride, height, hf, vf, r3src
+cglobal %1_put_8tap_hv%2_%3, 7, 8, 16, 0 - mmsize*16, dst, dststride, src, srcstride, height, hf, vf, r3src
     MC_8TAP_FILTER           %3, hf, 0
     lea                     hfq, [rsp]
     MC_8TAP_FILTER           %3, vf, 8*mmsize
diff --git a/libavcodec/x86/h26x/h2656dsp.c b/libavcodec/x86/h26x/h2656dsp.c
index 27769f9c55..7ef1234936 100644
--- a/libavcodec/x86/h26x/h2656dsp.c
+++ b/libavcodec/x86/h26x/h2656dsp.c
@@ -24,7 +24,7 @@
 #include "h2656dsp.h"
 
 #define mc_rep_func(name, bitd, step, W, opt) \
-void ff_h2656_put_##name##W##_##bitd##_##opt(int16_t *_dst,                                                     \
+void ff_h2656_put_##name##W##_##bitd##_##opt(int16_t *_dst, ptrdiff_t dststride,                                \
     const uint8_t *_src, ptrdiff_t _srcstride, int height, const int8_t *hf, const int8_t *vf, int width)       \
 {                                                                                                               \
     int i;                                                                                                      \
@@ -32,7 +32,7 @@ void ff_h2656_put_##name##W##_##bitd##_##opt(int16_t *_dst,
     for (i = 0; i < W; i += step) {                                                                             \
         const uint8_t *src  = _src + (i * ((bitd + 7) / 8));                                                    \
         dst = _dst + i;                                                                                         \
-        ff_h2656_put_##name##step##_##bitd##_##opt(dst, src, _srcstride, height, hf, vf, width);                \
+        ff_h2656_put_##name##step##_##bitd##_##opt(dst, dststride, src, _srcstride, height, hf, vf, width);     \
     }                                                                                                           \
 }
 
diff --git a/libavcodec/x86/h26x/h2656dsp.h b/libavcodec/x86/h26x/h2656dsp.h
index 8a2ab13607..e31aae6b0d 100644
--- a/libavcodec/x86/h26x/h2656dsp.h
+++ b/libavcodec/x86/h26x/h2656dsp.h
@@ -30,7 +30,7 @@
 #include <stdlib.h>
 
 #define H2656_PEL_PROTOTYPE(name, D, opt) \
-void ff_h2656_put_ ## name ## _ ## D ## _##opt(int16_t *dst, const uint8_t *_src, ptrdiff_t _srcstride, int height, const int8_t *hf, const int8_t *vf, int width);                               \
+void ff_h2656_put_ ## name ## _ ## D ## _##opt(int16_t *dst, ptrdiff_t dststride, const uint8_t *_src, ptrdiff_t _srcstride, int height, const int8_t *hf, const int8_t *vf, int width);          \
 void ff_h2656_put_uni_ ## name ## _ ## D ## _##opt(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, ptrdiff_t _srcstride, int height, const int8_t *hf, const int8_t *vf, int width);    \
 
 #define H2656_MC_8TAP_PROTOTYPES(fname, bitd, opt)    \
diff --git a/libavcodec/x86/hevcdsp_init.c b/libavcodec/x86/hevcdsp_init.c
index 5c19330e19..e0dc82eef0 100644
--- a/libavcodec/x86/hevcdsp_init.c
+++ b/libavcodec/x86/hevcdsp_init.c
@@ -96,7 +96,7 @@ void ff_hevc_put_hevc_ ## a ## _ ## depth ## _##opt(int16_t *dst, const uint8_t
                                                     int height, intptr_t mx, intptr_t my,int width)          \
 {                                                                                                            \
     DECL_HV_FILTER(p)                                                                                        \
-    ff_h2656_put_ ## b ## _ ## depth ## _##opt(dst, src, srcstride, height, hf, vf, width);                  \
+    ff_h2656_put_ ## b ## _ ## depth ## _##opt(dst, 2 * MAX_PB_SIZE, src, srcstride, height, hf, vf, width); \
 }
 
 #define FW_PUT_UNI(p, a, b, depth, opt) \

From patchwork Tue Jan 23 18:17:08 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Wu Jianhua <toqsxw@outlook.com>
X-Patchwork-Id: 45750
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:120f:b0:199:de12:6fa6 with SMTP id v15csp801758pzf;
        Tue, 23 Jan 2024 10:18:17 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IHAtjTGgztwIQcFwTVwdkKIXrxz+HGhoO0XI6LsEydB1gVaO3hngwY6ONIPmCQXnyrXwDpa
X-Received: by 2002:a05:6402:40d6:b0:55c:826b:70c3 with SMTP id
 z22-20020a05640240d600b0055c826b70c3mr705938edb.39.1706033897537;
        Tue, 23 Jan 2024 10:18:17 -0800 (PST)
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 ee4-20020a056402290400b0055c47bd0dabsi2066938edb.146.2024.01.23.10.18.15;
        Tue, 23 Jan 2024 10:18:17 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@outlook.com
 header.s=selector1 header.b=s5DtMcfv;
       arc=fail (body hash mismatch);
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=outlook.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 28D2968D0C8;
	Tue, 23 Jan 2024 20:17:47 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from JPN01-TYC-obe.outbound.protection.outlook.com
 (mail-tycjpn01olkn2040.outbound.protection.outlook.com [40.92.99.40])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 5459E68D0CE
 for <ffmpeg-devel@ffmpeg.org>; Tue, 23 Jan 2024 20:17:39 +0200 (EET)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=Xgyho3IKLcHC1XPATtW+6T39l7r/rJ+vWOsZ4iPruxh8VC76LzvJZskLo3tOilE1UwDr9nlbnGj+WhjpZ8IlRw3GVvShjWXurb2YIDJm68f8jE/bWI/1fGXE706TZ5qw2fG1NlR2B46eiyYVW0zsFbuG0KVSrVeOSihqv2QUTtWmFjXgbrbsPY6Yy9LLZGbV2ICqd0FStX2UzfCf6lxoOMmCvFcSn5XGy592zByiY+IR2r64ZhUcB9GQDUekt8zDlNxS2I55KQy1C5agpjVUKqolLp4dF09PCU4JqXJ+glYiJrtVlzYs62dIo+yDlJnZ5+/1dOHe5AJxNaSgyvALaA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=KcIMgBFyu8ErRgoqm8Wlx2gTR2EPZpHErXlpYlPSIGE=;
 b=Hnp51NCKgmU091WxvrKOlNQW2CBEoPUXcxoNmHZZ8PLxHGyifnhCSExqDZjABldh0t7+5i4PyOXKc8ylq91D9WZUD6oxU+B2wclk4ihgcuRDreanxtYaz9/i91km66b9MT0VLsnD9/BJA/NmIhJ/UWgGjMT40/6uHgNQnc+OFAxlMSNX6gVu7aDgn4gnsaRujqPyTO8MtxODA2vct7D4Gx+7NhgBhFWCalMItiDlm+OCLjAauwuXT+Kp1yMh5ggLflq2z3RtvTY+nVOLisYTr+uCsVnLtboket8RerlNOZYXwLkqkw71Wfja/HmFS1883G7xB+CMT5V4vdKmgAbutw==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none;
 dkim=none; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com;
 s=selector1;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=KcIMgBFyu8ErRgoqm8Wlx2gTR2EPZpHErXlpYlPSIGE=;
 b=s5DtMcfvDF2upwO1s0819zEifZzpHC3WpXhZB9iygeFqVlQ8wwuF4zBsZytxrvohgjOIRKgcHQ0Om2Y5LQU+uKp7Ov6F4+s93HeipIoM5YZrc07Pl4/bVUeJuvAIi36N7Y5H8DxCsp1B+TkRT6f4bsHPRI6kfQkEOjrGc13nxPMMNDPp33uO8tH9SziJ5qnrq0iMfEbWK8BdM2UUyyhYmDAgjosvvlfn6X2TUD2dHrVXP2pvYdeza8LwV2NX4MiB/5v9Nm0uOLh64vAgfylWu9l0OtkKvJFRAxeHOdy3KYtX4FFZ6PNsvs9tXfMWxhMlt6ZRrFu0bdZguIxhVQCQbw==
Received: from TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:173::5)
 by TYWP286MB3822.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:447::6) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7202.36; Tue, 23 Jan
 2024 18:17:24 +0000
Received: from TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 ([fe80::2fb1:781b:3f26:a080]) by TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 ([fe80::2fb1:781b:3f26:a080%3]) with mapi id 15.20.7228.022; Tue, 23 Jan 2024
 18:17:24 +0000
From: toqsxw@outlook.com
To: ffmpeg-devel@ffmpeg.org
Date: Wed, 24 Jan 2024 02:17:08 +0800
Message-ID: 
 <TYWP286MB2172994D9ABA3A97CBF60D45CA742@TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240123181711.402946-1-toqsxw@outlook.com>
References: <20240123181711.402946-1-toqsxw@outlook.com>
X-TMN: [N23HyXE+nsi3/tzjeczROnvr9lrxJqhY]
X-ClientProxiedBy: SI2P153CA0030.APCP153.PROD.OUTLOOK.COM
 (2603:1096:4:190::15) To TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 (2603:1096:400:173::5)
X-Microsoft-Original-Message-ID: <20240123181711.402946-5-toqsxw@outlook.com>
MIME-Version: 1.0
X-MS-Exchange-MessageSentRepresentingType: 1
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: TYWP286MB2172:EE_|TYWP286MB3822:EE_
X-MS-Office365-Filtering-Correlation-Id: 6748e59b-974e-447b-1e8b-08dc1c3f8c4a
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: 
 WjPJXw8L/1Ul/cB8L73hml0ZUWB9nT2VfbCw/gzaEOBRTD28l/km/t6da0NVj0oPSTI140YJ7A9YXzEh8uMsnot5U5ehkXFY5kP/qx9m+mIE6SROhasObsydSVSmAOhAZtxKihQbgIHb+3/ReEovS1ci5KZ3AqohRHJB5hgABNZStsJXmCNetFhl0i2ihAIJE4c1/BqNIIQR9Td93DIEFlA28tRxWIqnKno/dzFZ7wb8uuJ3UcyQZ7AtdlQ2vcYXfnel0hucJOZwNfTTZ4sbqzk4v5RCBI2LQ8GK4DGP5fjBEC6/AX+9cy1tp1YoYyoFQMnQAxC5m/HUohpzvuGxnOfBtDLCdMRlFn0xATr592a+m/xDSnV7HVmircsmQO3uFRkoy9oUZp5HpSlSHtErjZZcdumLOhxiJzwYYtMi+ksIcNoiru8hkr3Bwa0GHXHjSppD1S7zxA+3B9CoxevDx0VVrMJGbhPs0Om7QAUbc9r1NDHp8MmCYr574UN8q/tUzubiZXVJ4+EIWywsAAyPTev8mhwmfqR+gZ+N7otYuCMatxLu0Rhfl3v18ZMSZQYIwEitapT8X4joudDJ0c0GP4eRTNqN15q2/V25J9AqTgIbgo7TFIjoFw5dHOx3kMIj
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: 
 E0KLO7oe0XI5uWgeRWjQAxfcHH7hsscKQ9AjDCPcNZvJXFHuJGpxvDoRZHaLrKTuFmKYnF41vdLbRzh1EY8rh0vyL1tkgArsbiBV4Ka1I5amVoCBvznO8CplKd2bA1ZNH9JXG9kxF3D5vkNPvlWx8ilrNu73GXVTvzSvwLUDaYLlWOvFKBPerJqfAxbqQ6dwPHAB6LSA56fcNUSY0XSj5Tg3oCDMwpMzfcu7G+XSwrGA50Jz3oB7wQNN9zWUAMCbR/4L7FteYfoUAbOPcI9zcJ9GMR/aj/XxlVN7BYO6iXl29+lJuCfE/Um1UbnO2XJk8HdnI2uXJVCTvmB3DxNskNztb9XO5wNFLeegneAgicrodn9OxN9+ZbsALii+xbnFD1/zhZtsQJerNsFDCej2TtUOsiEj/bMVuYg7RQBVY5f7g77e2vTmw64n6P9q56tg8MgOTj+JFkdHnwJeb+6QPp4IVVGSinUTR7UqG8atYXUS+Pa4u5UWiDsgtkH2y8OCiAUmr90q1BOn5VGppL6OqYkDtabP6IBD7LBv2JQDfurTVViNw+WDrLSHDR6zhhRLZ34ifvI4Zjgvr6Chmk1tG0ImfZk9hUKYQd/wqVHFxzna6KMo8JAYDsYVFdGHgDCIgb/FMGtbKaqFHeGK0DEsVLKuzU8Du2rGgAIpD1zne6SGVlSBrm0wRkLflFkvPQdZHoEGcPvWf4YmkcRJExWFhTkoQGj3CPMiAdSN/BPfsdSQea/ej4gpw+aqAUBXBp7Fby7g3iFVXhT7/9MYZJ0ztzt2Dm/bW4mRV3esQWoSlHA3z4E+88rxR61o9FIVeiqhYsemNSWUijXBP0nv0MNtLY09x2REO+pWfaKcXZCqKloZiYREK1VBsSpNMrF4k59YHzxVSQ9Dpz48bjdFzaV4g89oxxFOVB3Yes50Z7Gc+AWYchHHriO6RbsKfBAuHiLE8x6EiRbedEBKrhQWqJ9yVIX5llwisiSslGJ9m+8UXe1TZl+7SlDp5aQtZSwYreFSJXb3nnEaExcpxak0VV+5BdhhXj8ZwvTgntFHtyw8D6KRzB1LzGiBcxYpjyEtpTajpy09jhvMuA9zVaYDa+RgMfXL67ICBdsQhHY7r7CsxXmICBHyv9RKOmfYsN1AgSbJz7C9GGWWRVCE79opuag/94IWJpdrqsfkhUwpjG8DAQcyNxyNGUETGecjlnZ1rCYJL/pEW80AIjLgIGSOmMbT/0BAUoA/8vDSI7/ogi/+Rig=
X-OriginatorOrg: outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 6748e59b-974e-447b-1e8b-08dc1c3f8c4a
X-MS-Exchange-CrossTenant-AuthSource: TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 23 Jan 2024 18:17:24.3781 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 
 00000000-0000-0000-0000-000000000000
X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYWP286MB3822
Subject: [FFmpeg-devel] [PATCH v4 5/8] avcodec/vvcdec: reuse
 h26x/2656_inter.asm to enable x86 optimizations
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Wu Jianhua <toqsxw@outlook.com>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: qEZ45hij/nt7

From: Wu Jianhua <toqsxw@outlook.com>

Signed-off-by: Wu Jianhua <toqsxw@outlook.com>
---
 libavcodec/Makefile              |   1 +
 libavcodec/vvc/vvcdsp.c          |   4 +
 libavcodec/vvc/vvcdsp.h          |   2 +
 libavcodec/x86/vvc/Makefile      |   6 +
 libavcodec/x86/vvc/vvcdsp_init.c | 202 +++++++++++++++++++++++++++++++
 5 files changed, 215 insertions(+)
 create mode 100644 libavcodec/x86/vvc/Makefile
 create mode 100644 libavcodec/x86/vvc/vvcdsp_init.c

diff --git a/libavcodec/Makefile b/libavcodec/Makefile
index bb42095165..ce33631b60 100644
--- a/libavcodec/Makefile
+++ b/libavcodec/Makefile
@@ -65,6 +65,7 @@ OBJS = ac3_parser.o                                                     \
 
 # subsystems
 include $(SRC_PATH)/libavcodec/vvc/Makefile
+include $(SRC_PATH)/libavcodec/x86/vvc/Makefile
 OBJS-$(CONFIG_AANDCTTABLES)            += aandcttab.o
 OBJS-$(CONFIG_AC3DSP)                  += ac3dsp.o ac3.o ac3tab.o
 OBJS-$(CONFIG_ADTS_HEADER)             += adts_header.o mpeg4audio_sample_rates.o
diff --git a/libavcodec/vvc/vvcdsp.c b/libavcodec/vvc/vvcdsp.c
index c82ea7be30..c542be5258 100644
--- a/libavcodec/vvc/vvcdsp.c
+++ b/libavcodec/vvc/vvcdsp.c
@@ -138,4 +138,8 @@ void ff_vvc_dsp_init(VVCDSPContext *vvcdsp, int bit_depth)
         VVC_DSP(8);
         break;
     }
+
+#if ARCH_X86
+    ff_vvc_dsp_init_x86(vvcdsp, bit_depth);
+#endif
 }
diff --git a/libavcodec/vvc/vvcdsp.h b/libavcodec/vvc/vvcdsp.h
index b5a63c5833..6f59e73654 100644
--- a/libavcodec/vvc/vvcdsp.h
+++ b/libavcodec/vvc/vvcdsp.h
@@ -167,4 +167,6 @@ typedef struct VVCDSPContext {
 
 void ff_vvc_dsp_init(VVCDSPContext *hpc, int bit_depth);
 
+void ff_vvc_dsp_init_x86(VVCDSPContext *hpc, const int bit_depth);
+
 #endif /* AVCODEC_VVC_VVCDSP_H */
diff --git a/libavcodec/x86/vvc/Makefile b/libavcodec/x86/vvc/Makefile
new file mode 100644
index 0000000000..b4acc22501
--- /dev/null
+++ b/libavcodec/x86/vvc/Makefile
@@ -0,0 +1,6 @@
+clean::
+	$(RM) $(CLEANSUFFIXES:%=libavcodec/x86/vvc/%)
+
+OBJS-$(CONFIG_VVC_DECODER)             += x86/vvc/vvcdsp_init.o
+X86ASM-OBJS-$(CONFIG_VVC_DECODER)      += x86/h26x/h2656dsp.o               \
+										  x86/h26x/h2656_inter.o
diff --git a/libavcodec/x86/vvc/vvcdsp_init.c b/libavcodec/x86/vvc/vvcdsp_init.c
new file mode 100644
index 0000000000..c197cdb4cc
--- /dev/null
+++ b/libavcodec/x86/vvc/vvcdsp_init.c
@@ -0,0 +1,202 @@
+/*
+ * VVC DSP init for x86
+ *
+ * Copyright (C) 2022-2024 Nuo Mi
+ * Copyright (c) 2023-2024 Wu Jianhua
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "config.h"
+
+#include "libavutil/cpu.h"
+#include "libavutil/x86/asm.h"
+#include "libavutil/x86/cpu.h"
+#include "libavcodec/vvc/vvcdec.h"
+#include "libavcodec/vvc/vvc_ctu.h"
+#include "libavcodec/vvc/vvcdsp.h"
+#include "libavcodec/x86/h26x/h2656dsp.h"
+
+#define FW_PUT(name, depth, opt) \
+static void ff_vvc_put_ ## name ## _ ## depth ## _##opt(int16_t *dst, const uint8_t *src, ptrdiff_t srcstride, \
+                                                 int height, const int8_t *hf, const int8_t *vf, int width)    \
+{                                                                                                              \
+    ff_h2656_put_## name ## _ ## depth ## _##opt(dst, 2 * MAX_PB_SIZE, src, srcstride, height, hf, vf, width); \
+}
+
+#define FW_PUT_TAP(fname, bitd, opt ) \
+    FW_PUT(fname##4,   bitd, opt );   \
+    FW_PUT(fname##8,   bitd, opt );   \
+    FW_PUT(fname##16,  bitd, opt );   \
+    FW_PUT(fname##32,  bitd, opt );   \
+    FW_PUT(fname##64,  bitd, opt );   \
+    FW_PUT(fname##128, bitd, opt );   \
+
+#define FW_PUT_4TAP(fname, bitd, opt) \
+    FW_PUT(fname ## 2, bitd, opt)     \
+    FW_PUT_TAP(fname,  bitd, opt)
+
+#define FW_PUT_4TAP_SSE4(bitd)       \
+    FW_PUT_4TAP(pixels,  bitd, sse4) \
+    FW_PUT_4TAP(4tap_h,  bitd, sse4) \
+    FW_PUT_4TAP(4tap_v,  bitd, sse4) \
+    FW_PUT_4TAP(4tap_hv, bitd, sse4)
+
+#define FW_PUT_8TAP_SSE4(bitd)      \
+    FW_PUT_TAP(8tap_h,  bitd, sse4) \
+    FW_PUT_TAP(8tap_v,  bitd, sse4) \
+    FW_PUT_TAP(8tap_hv, bitd, sse4)
+
+#define FW_PUT_SSE4(bitd)  \
+    FW_PUT_4TAP_SSE4(bitd) \
+    FW_PUT_8TAP_SSE4(bitd)
+
+FW_PUT_SSE4( 8);
+FW_PUT_SSE4(10);
+FW_PUT_SSE4(12);
+
+#define FW_PUT_TAP_AVX2(n, bitd)        \
+    FW_PUT(n ## tap_h32,   bitd, avx2)  \
+    FW_PUT(n ## tap_h64,   bitd, avx2)  \
+    FW_PUT(n ## tap_h128,  bitd, avx2)  \
+    FW_PUT(n ## tap_v32,   bitd, avx2)  \
+    FW_PUT(n ## tap_v64,   bitd, avx2)  \
+    FW_PUT(n ## tap_v128,  bitd, avx2)
+
+#define FW_PUT_AVX2(bitd) \
+    FW_PUT(pixels32,  bitd, avx2) \
+    FW_PUT(pixels64,  bitd, avx2) \
+    FW_PUT(pixels128, bitd, avx2) \
+    FW_PUT_TAP_AVX2(4, bitd)      \
+    FW_PUT_TAP_AVX2(8, bitd)      \
+
+FW_PUT_AVX2( 8)
+FW_PUT_AVX2(10)
+FW_PUT_AVX2(12)
+
+#define FW_PUT_TAP_16BPC_AVX2(n, bitd) \
+    FW_PUT(n ## tap_h16,   bitd, avx2) \
+    FW_PUT(n ## tap_v16,   bitd, avx2) \
+    FW_PUT(n ## tap_hv16,  bitd, avx2) \
+    FW_PUT(n ## tap_hv32,  bitd, avx2) \
+    FW_PUT(n ## tap_hv64,  bitd, avx2) \
+    FW_PUT(n ## tap_hv128, bitd, avx2)
+
+#define FW_PUT_16BPC_AVX2(bitd)     \
+    FW_PUT(pixels16, bitd, avx2)    \
+    FW_PUT_TAP_16BPC_AVX2(4, bitd)  \
+    FW_PUT_TAP_16BPC_AVX2(8, bitd);
+
+FW_PUT_16BPC_AVX2(10);
+FW_PUT_16BPC_AVX2(12);
+
+#define PEL_LINK(dst, C, W, idx1, idx2, name, D, opt)                              \
+    dst[C][W][idx1][idx2] = ff_vvc_put_## name ## _ ## D ## _##opt;                \
+    dst ## _uni[C][W][idx1][idx2] = ff_h2656_put_uni_ ## name ## _ ## D ## _##opt; \
+
+#define MC_TAP_LINKS(pointer, C, my, mx, fname, bitd, opt )          \
+    PEL_LINK(pointer, C, 1, my , mx , fname##4 ,  bitd, opt );       \
+    PEL_LINK(pointer, C, 2, my , mx , fname##8 ,  bitd, opt );       \
+    PEL_LINK(pointer, C, 3, my , mx , fname##16,  bitd, opt );       \
+    PEL_LINK(pointer, C, 4, my , mx , fname##32,  bitd, opt );       \
+    PEL_LINK(pointer, C, 5, my , mx , fname##64,  bitd, opt );       \
+    PEL_LINK(pointer, C, 6, my , mx , fname##128, bitd, opt );
+
+#define MC_8TAP_LINKS(pointer, my, mx, fname, bitd, opt)             \
+    MC_TAP_LINKS(pointer, LUMA, my, mx, fname, bitd, opt)
+
+#define MC_8TAP_LINKS_SSE4(bd)                                       \
+    MC_8TAP_LINKS(c->inter.put, 0, 0, pixels, bd, sse4);             \
+    MC_8TAP_LINKS(c->inter.put, 0, 1, 8tap_h, bd, sse4);             \
+    MC_8TAP_LINKS(c->inter.put, 1, 0, 8tap_v, bd, sse4);             \
+    MC_8TAP_LINKS(c->inter.put, 1, 1, 8tap_hv, bd, sse4)
+
+#define MC_4TAP_LINKS(pointer, my, mx, fname, bitd, opt)             \
+    PEL_LINK(pointer, CHROMA, 0, my , mx , fname##2 ,  bitd, opt );  \
+    MC_TAP_LINKS(pointer, CHROMA, my, mx, fname, bitd, opt)          \
+
+#define MC_4TAP_LINKS_SSE4(bd)                                       \
+    MC_4TAP_LINKS(c->inter.put, 0, 0, pixels, bd, sse4);             \
+    MC_4TAP_LINKS(c->inter.put, 0, 1, 4tap_h, bd, sse4);             \
+    MC_4TAP_LINKS(c->inter.put, 1, 0, 4tap_v, bd, sse4);             \
+    MC_4TAP_LINKS(c->inter.put, 1, 1, 4tap_hv, bd, sse4)
+
+#define MC_LINK_SSE4(bd)                                             \
+    MC_4TAP_LINKS_SSE4(bd)                                           \
+    MC_8TAP_LINKS_SSE4(bd)
+
+#define MC_TAP_LINKS_AVX2(C,tap,bd) do {                             \
+        PEL_LINK(c->inter.put, C, 4, 0, 0, pixels32,      bd, avx2)  \
+        PEL_LINK(c->inter.put, C, 5, 0, 0, pixels64,      bd, avx2)  \
+        PEL_LINK(c->inter.put, C, 6, 0, 0, pixels128,     bd, avx2)  \
+        PEL_LINK(c->inter.put, C, 4, 0, 1, tap##tap_h32,  bd, avx2)  \
+        PEL_LINK(c->inter.put, C, 5, 0, 1, tap##tap_h64,  bd, avx2)  \
+        PEL_LINK(c->inter.put, C, 6, 0, 1, tap##tap_h128, bd, avx2)  \
+        PEL_LINK(c->inter.put, C, 4, 1, 0, tap##tap_v32,  bd, avx2)  \
+        PEL_LINK(c->inter.put, C, 5, 1, 0, tap##tap_v64,  bd, avx2)  \
+        PEL_LINK(c->inter.put, C, 6, 1, 0, tap##tap_v128, bd, avx2)  \
+    } while (0)
+
+#define MC_LINKS_AVX2(bd)                                            \
+    MC_TAP_LINKS_AVX2(LUMA,   8, bd);                                \
+    MC_TAP_LINKS_AVX2(CHROMA, 4, bd);
+
+#define MC_TAP_LINKS_16BPC_AVX2(C, tap, bd) do {                     \
+        PEL_LINK(c->inter.put, C, 3, 0, 0, pixels16, bd, avx2)       \
+        PEL_LINK(c->inter.put, C, 3, 0, 1, tap##tap_h16, bd, avx2)   \
+        PEL_LINK(c->inter.put, C, 3, 1, 0, tap##tap_v16, bd, avx2)   \
+        PEL_LINK(c->inter.put, C, 3, 1, 1, tap##tap_hv16, bd, avx2)  \
+        PEL_LINK(c->inter.put, C, 4, 1, 1, tap##tap_hv32, bd, avx2)  \
+        PEL_LINK(c->inter.put, C, 5, 1, 1, tap##tap_hv64, bd, avx2)  \
+        PEL_LINK(c->inter.put, C, 6, 1, 1, tap##tap_hv128, bd, avx2) \
+    } while (0)
+
+#define MC_LINKS_16BPC_AVX2(bd)                                      \
+    MC_TAP_LINKS_16BPC_AVX2(LUMA,   8, bd);                          \
+    MC_TAP_LINKS_16BPC_AVX2(CHROMA, 4, bd);
+
+void ff_vvc_dsp_init_x86(VVCDSPContext *const c, const int bd)
+{
+    const int cpu_flags = av_get_cpu_flags();
+
+    if (ARCH_X86_64) {
+        if (bd == 8) {
+            if (EXTERNAL_SSE4(cpu_flags)) {
+                MC_LINK_SSE4(8);
+            }
+            if (EXTERNAL_AVX2_FAST(cpu_flags)) {
+                MC_LINKS_AVX2(8);
+            }
+        } else if (bd == 10) {
+            if (EXTERNAL_SSE4(cpu_flags)) {
+                MC_LINK_SSE4(10);
+            }
+            if (EXTERNAL_AVX2_FAST(cpu_flags)) {
+                MC_LINKS_AVX2(10);
+                MC_LINKS_16BPC_AVX2(10);
+            }
+        } else if (bd == 12) {
+            if (EXTERNAL_SSE4(cpu_flags)) {
+                MC_LINK_SSE4(12);
+            }
+            if (EXTERNAL_AVX2_FAST(cpu_flags)) {
+                MC_LINKS_AVX2(12);
+                MC_LINKS_16BPC_AVX2(12);
+            }
+        }
+    }
+}

From patchwork Tue Jan 23 18:17:09 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Wu Jianhua <toqsxw@outlook.com>
X-Patchwork-Id: 45753
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:120f:b0:199:de12:6fa6 with SMTP id v15csp802032pzf;
        Tue, 23 Jan 2024 10:18:43 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IGX7GLYuZISKl8YBcrC3fvwiQz5adfYild5cnB3C27kNV+VBknnCu8hItpozSynliJNKLsG
X-Received: by 2002:a2e:97cb:0:b0:2cc:d490:28b1 with SMTP id
 m11-20020a2e97cb000000b002ccd49028b1mr110261ljj.58.1706033922753;
        Tue, 23 Jan 2024 10:18:42 -0800 (PST)
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 o12-20020aa7dd4c000000b005576bc33d82si12783530edw.517.2024.01.23.10.18.42;
        Tue, 23 Jan 2024 10:18:42 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@outlook.com
 header.s=selector1 header.b=cVqS1PTe;
       arc=fail (body hash mismatch);
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=outlook.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 9101B68D0F6;
	Tue, 23 Jan 2024 20:17:51 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from JPN01-OS0-obe.outbound.protection.outlook.com
 (mail-os0jpn01olkn2100.outbound.protection.outlook.com [40.92.98.100])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 103F768CEC8
 for <ffmpeg-devel@ffmpeg.org>; Tue, 23 Jan 2024 20:17:43 +0200 (EET)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=RHcgxXCIop+ZRp/poHNmslN4ioEYi3y62U6Ql2BPNNBj1eDYPni2jKBx6HD/PfS98zYQHM6n30RGXdSWIhFdLpQlv4KmoYKLljWHGApyLHGc89WDFt6457TAnv+IOZU0Zj9fpoJLibDKWoE29Np40xMxaC9cQGYMAEGpLEJAiOkQrc6M3ziexQ95cxrGfaWagC02Au683GST8lz0Tq8eT6b8BqVQfrZ7ydeknpOwoGU2zm7fx5td4e3yHCJaD8634BUd59KuPh8EBvjPX63xVIBe2hKgU5EXx0oTsXtsJM0LlB/uYOTf/TymY6h7621of1hRwRfEYtGA4Rl7oFIDeg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=zxEAedgkVuSP+mUoX7+Dpg6+I1RVSBRC5spffGTIJxU=;
 b=QVKRzTQzUFM/TB1G2q4FpM1kRhEJqM6qcEUVNbJdGSwbbUcCXDtkCPUPN3QDKpyFAs7id9T0r9e5r4GA198/T1yCkgJROaJTW8Jtfuc5U5AqUFSh2HAE+n0McBzigEwV2ozShnB16GgrRndXI3fnIMIJohWTUOkGFTnTatnOtf0W7dLwkl82jUkvMkVDSUzLrgvoRF0A/jugkeYKWtoi0Glb+glNias8ehTCm2bg9pMmx6Yas0RceJuvEYiIvfjRSeoI+mw3DKnPQn80bLviIQRhWH0b80jKxG6F3gpFdKldz9+MaLPI8bqJOv4/uXG0v9j3ZgycL0jc0DRwguSrNw==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none;
 dkim=none; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com;
 s=selector1;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=zxEAedgkVuSP+mUoX7+Dpg6+I1RVSBRC5spffGTIJxU=;
 b=cVqS1PTeje90cOIp6GEpSiok11Ybv9JUUx5mHrbiZnp/weuVI75oAwBV9TeVdDu/iVN52m9AanJrFCI0RbRo/gtdqNcT6wMFJSQWhan/D8qy1eIvjtBSz/bFFJ3xEy+66CBxX3Jo2x7OiVH8jfM6yVE9+UJxs+OOxp/QaIiQTonBbBO0y0/Npdjy+bYTiINgx4mfg65zuXro4SNwLgCvXmU7xar0i7xUs5UnoxrGyydGMg3c2Uq3syOunQ1D8z71Noq5tfIAuQLrrnJ/N7mU+izcP3Yaa86KKK8cvtsWeDSC06Za2uJvPk0Eup1M/Alzb7Q95CCfey2cSCJFwU0/NA==
Received: from TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:173::5)
 by TYWP286MB3822.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:447::6) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7202.36; Tue, 23 Jan
 2024 18:17:25 +0000
Received: from TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 ([fe80::2fb1:781b:3f26:a080]) by TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 ([fe80::2fb1:781b:3f26:a080%3]) with mapi id 15.20.7228.022; Tue, 23 Jan 2024
 18:17:25 +0000
From: toqsxw@outlook.com
To: ffmpeg-devel@ffmpeg.org
Date: Wed, 24 Jan 2024 02:17:09 +0800
Message-ID: 
 <TYWP286MB2172027F81FCA99EDB23BE87CA742@TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240123181711.402946-1-toqsxw@outlook.com>
References: <20240123181711.402946-1-toqsxw@outlook.com>
X-TMN: [FbKAYr4aHhF2uUZPsAmhbtSUFJJIL6Sy]
X-ClientProxiedBy: SI2P153CA0030.APCP153.PROD.OUTLOOK.COM
 (2603:1096:4:190::15) To TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 (2603:1096:400:173::5)
X-Microsoft-Original-Message-ID: <20240123181711.402946-6-toqsxw@outlook.com>
MIME-Version: 1.0
X-MS-Exchange-MessageSentRepresentingType: 1
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: TYWP286MB2172:EE_|TYWP286MB3822:EE_
X-MS-Office365-Filtering-Correlation-Id: 302a8e27-dcb7-4dce-8cce-08dc1c3f8cd2
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: 
 XR1HXeeUS27a7BSNUGg5ZR3hT0Qz0Lc+nN6wAIc96BsYqKgFNSaB0cRSM4CUHmZ+9RVK6zo3mkc/hyQ0Edz1SHYPdQr/0DeNQL1+/malUH72QSlTls2oSTuvyd9OPQAU/9DaBZi8veJjbBLopazQBUSJLZpDx8FA5LP9ZsgHURWiFHzs8S+iYtukPfIOtEjdQ4bnGTYo6MzXuLzW4k+qUuAQHiAeSw2B7+IUKv0Ls3FzJ3sECxKRY1DVR1yjdpmKWDTbPsPzpOpV8eeUMdovKWNRsE8crFl1OYQGRJpr/KRHIaplTXGyUeHC9ouy2szeYWxtbdnGodiPPDnOQzmprOyc3KzO7l9R3kh4llZKSmwln7vW7VMc9AfpjJd5GrxlP/AEczufWTXPx11g8XpYnVPSFO3LYJG1j9HfI8AKfOR67GPksJ/xP5UkUDgmHiG+bedPiywh0NS3qZaSzg4pSiGgqUQPu2lfh2Ka50BMdcnw2GTnnd1mWIrE0bbLqvoFRdV19XgYjMmPG8ZjUvoaOOZOAHfyv5qt2uvcJkYnl8XvO3fET99tNxokcnSZ3axpEzKV7YO8XR82kzYRWjFmLiIpyUVw4NkxpJx4vFKAxWU=
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: 
 VtfQiXzP2xtm/RSkaA6KhIgAY7GO8wHK0tedR/WjSLtiYKY6dta9bgd0GIddunqvNJPMdDMGA9nsGGALKZO0vgGHxvaYUIjOUWgJJgkE+V9/onTyL3BMxDCECxaLKnLgUFjMmUVt7KkXiKBsm1IjGADpZMsckY5YCWuJq8KxYnfPKA+dUD3TrafOAKCoEyEIUjAdnb82G8qMmvY9MK8K58JFndAbAvH0+cOBOwTFNMD42ENqceXJdL/3hkMyNkEfB4Sv3VKisPDtPTrSOKGtUDgHlSkrWA9XYhpHmFMhI5+UXhGYbdBp5/k6yyQ6eIqamp7rqrJZl8qv+EXlE4GjBzpvJXAWVk2TOQXs+g3/KZlqTG6Vc86XmeFHEWeQbwlV6NQWdChU6fkroiC4w0UpTUE9NeFyoFYZyd2B3Qy02WSXxx040ODMgHBQHAAAIXwGJ3jGPly+n0zmn4EttMfAXZbSUGsqLkCyEMlw+eP23YB2vInBlihQCDALZ++55+AvG3ek9gXw/vTU5OrMAJT3/bqkvosGlKOmPm573c7lfUrBGcQa6wOEukYr7yNge+cLuyimr1Ru9xwcw8dWjUsq5QvhChwPFTOOC1fXtgQKIltRW3kP8dvYJaicFlAYB1PQxjhS75FX9D1tejBZOWhOYpTmh6FXdHf0mIVqpa+XbdNtzynAzPwaX/HpcFArUjLsD8oF7/CR4oPwaH2QvL6PpMyxYtLWtSI/OqoQRvb4CTXYq1rKqetW7WjE1c3hSNKgzCffrBJecEcrT8M2jq+rhqESd1UotCEhx1TWzoXNR8kRa/HoislEHg3GblCYckXVzhJWtTQVqo/pvmOesbkVVwOXVIkAGRGI5E+JSbyqfE+1Q/Y5bhEx1K10bzrps3hwX3ae33DqqVzsXUVsjl++NhAyG2ftdq/j9CZw277gZqR6MY6xpTzs8xwbaxpBm50GRV03ryqxDCLQgJb1xWmrB009alcxTAICrIMLcTrkmM6dOXw136KH76G4YF3xpg2CM/uQPp/MPkcWColQKhx9fQ/5MpgFZPlHgzsr8ugwLpp9gwW1iPal/iK+YZPWJOt+khxxKRWEV5Dj9jZuKxSui+tN3y4s4P7BiHbt1ub9Mw1WGW66KP9AjckMr+q3mkpJU87jtIEnCJ2dOupIJXBIoLlQlPtpzh3KQiJsn0dYtOarIVgNPtQnZ0nlI1kjrLGDyK/ZU3VQFjPykKwiIM1X8ZfANnOVD/tLF8VETfWeKoE=
X-OriginatorOrg: outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 302a8e27-dcb7-4dce-8cce-08dc1c3f8cd2
X-MS-Exchange-CrossTenant-AuthSource: TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 23 Jan 2024 18:17:25.2394 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 
 00000000-0000-0000-0000-000000000000
X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYWP286MB3822
Subject: [FFmpeg-devel] [PATCH v4 6/8] tests/checkasm: add
 checkasm_check_vvc_mc
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Wu Jianhua <toqsxw@outlook.com>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: MHxpO4cFsEC9

From: Wu Jianhua <toqsxw@outlook.com>

Signed-off-by: Wu Jianhua <toqsxw@outlook.com>
---
 tests/checkasm/Makefile   |   1 +
 tests/checkasm/checkasm.c |   3 +
 tests/checkasm/checkasm.h |   1 +
 tests/checkasm/vvc_mc.c   | 270 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 275 insertions(+)
 create mode 100644 tests/checkasm/vvc_mc.c

diff --git a/tests/checkasm/Makefile b/tests/checkasm/Makefile
index 3b5b54352b..3562acb2b2 100644
--- a/tests/checkasm/Makefile
+++ b/tests/checkasm/Makefile
@@ -40,6 +40,7 @@ AVCODECOBJS-$(CONFIG_V210_DECODER)      += v210dec.o
 AVCODECOBJS-$(CONFIG_V210_ENCODER)      += v210enc.o
 AVCODECOBJS-$(CONFIG_VORBIS_DECODER)    += vorbisdsp.o
 AVCODECOBJS-$(CONFIG_VP9_DECODER)       += vp9dsp.o
+AVCODECOBJS-$(CONFIG_VVC_DECODER)       += vvc_mc.o
 
 CHECKASMOBJS-$(CONFIG_AVCODEC)          += $(AVCODECOBJS-yes)
 
diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index 87f24c77ca..36a97957e5 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -194,6 +194,9 @@ static const struct {
     #if CONFIG_VORBIS_DECODER
         { "vorbisdsp", checkasm_check_vorbisdsp },
     #endif
+    #if CONFIG_VVC_DECODER
+        { "vvc_mc", checkasm_check_vvc_mc },
+    #endif
 #endif
 #if CONFIG_AVFILTER
     #if CONFIG_AFIR_FILTER
diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
index 4db8c495ea..53cb3ccfbf 100644
--- a/tests/checkasm/checkasm.h
+++ b/tests/checkasm/checkasm.h
@@ -131,6 +131,7 @@ void checkasm_check_vp8dsp(void);
 void checkasm_check_vp9dsp(void);
 void checkasm_check_videodsp(void);
 void checkasm_check_vorbisdsp(void);
+void checkasm_check_vvc_mc(void);
 
 struct CheckasmPerf;
 
diff --git a/tests/checkasm/vvc_mc.c b/tests/checkasm/vvc_mc.c
new file mode 100644
index 0000000000..711280deec
--- /dev/null
+++ b/tests/checkasm/vvc_mc.c
@@ -0,0 +1,270 @@
+/*
+ * Copyright (c) 2023-2024 Nuo Mi
+ * Copyright (c) 2023-2024 Wu Jianhua
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with FFmpeg; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include <string.h>
+
+#include "checkasm.h"
+#include "libavcodec/avcodec.h"
+#include "libavcodec/vvc/vvc_ctu.h"
+#include "libavcodec/vvc/vvc_data.h"
+
+#include "libavutil/common.h"
+#include "libavutil/internal.h"
+#include "libavutil/internal.h"
+#include "libavutil/intreadwrite.h"
+#include "libavutil/mem_internal.h"
+
+static const uint32_t pixel_mask[] = { 0xffffffff, 0x03ff03ff, 0x0fff0fff, 0x3fff3fff, 0xffffffff };
+static const int sizes[] = { 2, 4, 8, 16, 32, 64, 128 };
+
+#define PIXEL_STRIDE (MAX_CTU_SIZE * 2)
+#define EXTRA_BEFORE 3
+#define EXTRA_AFTER  4
+#define SRC_EXTRA    (EXTRA_BEFORE + EXTRA_AFTER) * 2
+#define SRC_BUF_SIZE (PIXEL_STRIDE + SRC_EXTRA) * (PIXEL_STRIDE + SRC_EXTRA)
+#define DST_BUF_SIZE (MAX_CTU_SIZE * MAX_CTU_SIZE * 2)
+#define SRC_OFFSET   ((PIXEL_STRIDE + EXTRA_BEFORE * 2) * EXTRA_BEFORE)
+
+#define randomize_buffers(buf0, buf1, size, mask)           \
+    do {                                                    \
+        int k;                                              \
+        for (k = 0; k < size; k += 4) {                     \
+            uint32_t r = rnd() & mask;                      \
+            AV_WN32A(buf0 + k, r);                          \
+            AV_WN32A(buf1 + k, r);                          \
+        }                                                   \
+    } while (0)
+
+#define randomize_pixels(buf0, buf1, size)                  \
+    do {                                                    \
+        uint32_t mask = pixel_mask[(bit_depth - 8) >> 1];   \
+        randomize_buffers(buf0, buf1, size, mask);          \
+    } while (0)
+
+#define randomize_avg_src(buf0, buf1, size)                 \
+    do {                                                    \
+        uint32_t mask = 0x3fff3fff;                         \
+        randomize_buffers(buf0, buf1, size, mask);          \
+    } while (0)
+
+static void check_put_vvc_luma(void)
+{
+    LOCAL_ALIGNED_32(int16_t, dst0, [DST_BUF_SIZE / 2]);
+    LOCAL_ALIGNED_32(int16_t, dst1, [DST_BUF_SIZE / 2]);
+    LOCAL_ALIGNED_32(uint8_t, src0, [SRC_BUF_SIZE]);
+    LOCAL_ALIGNED_32(uint8_t, src1, [SRC_BUF_SIZE]);
+    VVCDSPContext c;
+
+    declare_func_emms(AV_CPU_FLAG_MMX | AV_CPU_FLAG_MMXEXT, void, int16_t *dst, const uint8_t *src, const ptrdiff_t src_stride,
+        const int height, const int8_t *hf, const int8_t *vf, const int width);
+
+    for (int bit_depth = 8; bit_depth <= 12; bit_depth += 2) {
+        randomize_pixels(src0, src1, SRC_BUF_SIZE);
+        ff_vvc_dsp_init(&c, bit_depth);
+        for (int i = 0; i < 2; i++) {
+            for (int j = 0; j < 2; j++) {
+                for (int h = 4; h <= MAX_CTU_SIZE; h *= 2) {
+                    for (int w = 4; w <= MAX_CTU_SIZE; w *= 2) {
+                        const int idx       = av_log2(w) - 1;
+                        const int mx        = rnd() % 16;
+                        const int my        = rnd() % 16;
+                        const int8_t *hf    = ff_vvc_inter_luma_filters[rnd() % 3][mx];
+                        const int8_t *vf    = ff_vvc_inter_luma_filters[rnd() % 3][my];
+                        const char *type;
+                        switch ((j << 1) | i) {
+                            case 0: type = "put_luma_pixels"; break; // 0 0
+                            case 1: type = "put_luma_h"; break; // 0 1
+                            case 2: type = "put_luma_v"; break; // 1 0
+                            case 3: type = "put_luma_hv"; break; // 1 1
+                        }
+                        if (check_func(c.inter.put[LUMA][idx][j][i], "%s_%d_%dx%d", type, bit_depth, w, h)) {
+                            memset(dst0, 0, DST_BUF_SIZE);
+                            memset(dst1, 0, DST_BUF_SIZE);
+                            call_ref(dst0, src0 + SRC_OFFSET, PIXEL_STRIDE, h, hf, vf, w);
+                            call_new(dst1, src1 + SRC_OFFSET, PIXEL_STRIDE, h, hf, vf, w);
+                            if (memcmp(dst0, dst1, DST_BUF_SIZE))
+                                fail();
+                            if (w == h)
+                                bench_new(dst1, src1 + SRC_OFFSET, PIXEL_STRIDE, h, hf, vf, w);
+                        }
+                    }
+                }
+            }
+        }
+    }
+    report("put_luma");
+}
+
+static void check_put_vvc_luma_uni(void)
+{
+    LOCAL_ALIGNED_32(uint8_t, dst0, [DST_BUF_SIZE]);
+    LOCAL_ALIGNED_32(uint8_t, dst1, [DST_BUF_SIZE]);
+    LOCAL_ALIGNED_32(uint8_t, src0, [SRC_BUF_SIZE]);
+    LOCAL_ALIGNED_32(uint8_t, src1, [SRC_BUF_SIZE]);
+
+    VVCDSPContext c;
+    declare_func_emms(AV_CPU_FLAG_MMX | AV_CPU_FLAG_MMXEXT, void, uint8_t *dst, ptrdiff_t dststride,
+        uint8_t *src, ptrdiff_t srcstride,  int height, const int8_t *hf, const int8_t *vf, int width);
+
+    for (int bit_depth = 8; bit_depth <= 12; bit_depth += 2) {
+        ff_vvc_dsp_init(&c, bit_depth);
+        randomize_pixels(src0, src1, SRC_BUF_SIZE);
+        for (int i = 0; i < 2; i++) {
+            for (int j = 0; j < 2; j++) {
+                for (int h = 4; h <= MAX_CTU_SIZE; h *= 2) {
+                    for (int w = 4; w <= MAX_CTU_SIZE; w *= 2) {
+                        const int idx       = av_log2(w) - 1;
+                        const int mx        = rnd() % VVC_INTER_LUMA_FACTS;
+                        const int my        = rnd() % VVC_INTER_LUMA_FACTS;
+                        const int8_t *hf    = ff_vvc_inter_luma_filters[rnd() % VVC_INTER_FILTER_TYPES][mx];
+                        const int8_t *vf    = ff_vvc_inter_luma_filters[rnd() % VVC_INTER_FILTER_TYPES][my];
+                        const char *type;
+
+                        switch ((j << 1) | i) {
+                            case 0: type = "put_uni_pixels"; break; // 0 0
+                            case 1: type = "put_uni_h"; break; // 0 1
+                            case 2: type = "put_uni_v"; break; // 1 0
+                            case 3: type = "put_uni_hv"; break; // 1 1
+                        }
+
+                        if (check_func(c.inter.put_uni[LUMA][idx][j][i], "%s_luma_%d_%dx%d", type, bit_depth, w, h)) {
+                            memset(dst0, 0, DST_BUF_SIZE);
+                            memset(dst1, 0, DST_BUF_SIZE);
+                            call_ref(dst0, PIXEL_STRIDE, src0 + SRC_OFFSET, PIXEL_STRIDE, h, hf, vf, w);
+                            call_new(dst1, PIXEL_STRIDE, src1 + SRC_OFFSET, PIXEL_STRIDE, h, hf, vf, w);
+                            if (memcmp(dst0, dst1, DST_BUF_SIZE))
+                                fail();
+                            if (w == h)
+                                bench_new(dst1, PIXEL_STRIDE, src1 + SRC_OFFSET, PIXEL_STRIDE, h, hf, vf, w);
+                        }
+                    }
+                }
+            }
+        }
+    }
+    report("put_uni_luma");
+}
+
+static void check_put_vvc_chroma(void)
+{
+    LOCAL_ALIGNED_32(int16_t, dst0, [DST_BUF_SIZE / 2]);
+    LOCAL_ALIGNED_32(int16_t, dst1, [DST_BUF_SIZE / 2]);
+    LOCAL_ALIGNED_32(uint8_t, src0, [SRC_BUF_SIZE]);
+    LOCAL_ALIGNED_32(uint8_t, src1, [SRC_BUF_SIZE]);
+    VVCDSPContext c;
+
+    declare_func_emms(AV_CPU_FLAG_MMX | AV_CPU_FLAG_MMXEXT, void, int16_t *dst, const uint8_t *src, const ptrdiff_t src_stride,
+        const int height, const int8_t *hf, const int8_t *vf, const int width);
+
+    for (int bit_depth = 8; bit_depth <= 12; bit_depth += 2) {
+        randomize_pixels(src0, src1, SRC_BUF_SIZE);
+        ff_vvc_dsp_init(&c, bit_depth);
+        for (int i = 0; i < 2; i++) {
+            for (int j = 0; j < 2; j++) {
+                for (int h = 2; h <= MAX_CTU_SIZE; h *= 2) {
+                    for (int w = 2; w <= MAX_CTU_SIZE; w *= 2) {
+                        const int idx       = av_log2(w) - 1;
+                        const int mx        = rnd() % VVC_INTER_CHROMA_FACTS;
+                        const int my        = rnd() % VVC_INTER_CHROMA_FACTS;
+                        const int8_t *hf    = ff_vvc_inter_chroma_filters[rnd() % VVC_INTER_FILTER_TYPES][mx];
+                        const int8_t *vf    = ff_vvc_inter_chroma_filters[rnd() % VVC_INTER_FILTER_TYPES][my];
+                        const char *type;
+                        switch ((j << 1) | i) {
+                            case 0: type = "put_chroma_pixels"; break; // 0 0
+                            case 1: type = "put_chroma_h"; break; // 0 1
+                            case 2: type = "put_chroma_v"; break; // 1 0
+                            case 3: type = "put_chroma_hv"; break; // 1 1
+                        }
+                        if (check_func(c.inter.put[CHROMA][idx][j][i], "%s_%d_%dx%d", type, bit_depth, w, h)) {
+                            memset(dst0, 0, DST_BUF_SIZE);
+                            memset(dst1, 0, DST_BUF_SIZE);
+                            call_ref(dst0, src0 + SRC_OFFSET, PIXEL_STRIDE, h, hf, vf, w);
+                            call_new(dst1, src1 + SRC_OFFSET, PIXEL_STRIDE, h, hf, vf, w);
+                            if (memcmp(dst0, dst1, DST_BUF_SIZE))
+                                fail();
+                            if (w == h)
+                                bench_new(dst1, src1 + SRC_OFFSET, PIXEL_STRIDE, h, hf, vf, w);
+                        }
+                    }
+                }
+            }
+        }
+    }
+    report("put_chroma");
+}
+
+static void check_put_vvc_chroma_uni(void)
+{
+    LOCAL_ALIGNED_32(uint8_t, dst0, [DST_BUF_SIZE]);
+    LOCAL_ALIGNED_32(uint8_t, dst1, [DST_BUF_SIZE]);
+    LOCAL_ALIGNED_32(uint8_t, src0, [SRC_BUF_SIZE]);
+    LOCAL_ALIGNED_32(uint8_t, src1, [SRC_BUF_SIZE]);
+
+    VVCDSPContext c;
+    declare_func_emms(AV_CPU_FLAG_MMX | AV_CPU_FLAG_MMXEXT, void, uint8_t *dst, ptrdiff_t dststride,
+        uint8_t *src, ptrdiff_t srcstride,  int height, const int8_t *hf, const int8_t *vf, int width);
+
+    for (int bit_depth = 8; bit_depth <= 12; bit_depth += 2) {
+        ff_vvc_dsp_init(&c, bit_depth);
+        randomize_pixels(src0, src1, SRC_BUF_SIZE);
+        for (int i = 0; i < 2; i++) {
+            for (int j = 0; j < 2; j++) {
+                for (int h = 4; h <= MAX_CTU_SIZE; h *= 2) {
+                    for (int w = 4; w <= MAX_CTU_SIZE; w *= 2) {
+                        const int idx       = av_log2(w) - 1;
+                        const int mx        = rnd() % VVC_INTER_CHROMA_FACTS;
+                        const int my        = rnd() % VVC_INTER_CHROMA_FACTS;
+                        const int8_t *hf    = ff_vvc_inter_chroma_filters[rnd() % VVC_INTER_FILTER_TYPES][mx];
+                        const int8_t *vf    = ff_vvc_inter_chroma_filters[rnd() % VVC_INTER_FILTER_TYPES][my];
+                        const char *type;
+
+                        switch ((j << 1) | i) {
+                            case 0: type = "put_uni_pixels"; break; // 0 0
+                            case 1: type = "put_uni_h"; break; // 0 1
+                            case 2: type = "put_uni_v"; break; // 1 0
+                            case 3: type = "put_uni_hv"; break; // 1 1
+                        }
+
+                        if (check_func(c.inter.put_uni[CHROMA][idx][j][i], "%s_chroma_%d_%dx%d", type, bit_depth, w, h)) {
+                            memset(dst0, 0, DST_BUF_SIZE);
+                            memset(dst1, 0, DST_BUF_SIZE);
+                            call_ref(dst0, PIXEL_STRIDE, src0 + SRC_OFFSET, PIXEL_STRIDE, h, hf, vf, w);
+                            call_new(dst1, PIXEL_STRIDE, src1 + SRC_OFFSET, PIXEL_STRIDE, h, hf, vf, w);
+                            if (memcmp(dst0, dst1, DST_BUF_SIZE))
+                                fail();
+                            if (w == h)
+                                bench_new(dst1, PIXEL_STRIDE, src1 + SRC_OFFSET, PIXEL_STRIDE, h, hf, vf, w);
+                        }
+                    }
+                }
+            }
+        }
+    }
+    report("put_uni_chroma");
+}
+
+void checkasm_check_vvc_mc(void)
+{
+    check_put_vvc_luma();
+    check_put_vvc_luma_uni();
+    check_put_vvc_chroma();
+    check_put_vvc_chroma_uni();
+}

From patchwork Tue Jan 23 18:17:10 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Wu Jianhua <toqsxw@outlook.com>
X-Patchwork-Id: 45751
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:120f:b0:199:de12:6fa6 with SMTP id v15csp801862pzf;
        Tue, 23 Jan 2024 10:18:25 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IHtWhI7+izGRRRBYvAIACx5oIwdukz6lEzCGXapplJBYtW7NMJlj8izB3I8qZbQnAUYUNp/
X-Received: by 2002:a17:906:d94:b0:a30:c104:3a9c with SMTP id
 m20-20020a1709060d9400b00a30c1043a9cmr254624eji.13.1706033905480;
        Tue, 23 Jan 2024 10:18:25 -0800 (PST)
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 xo14-20020a170907bb8e00b00a2a18252ae7si11704361ejc.1053.2024.01.23.10.18.24;
        Tue, 23 Jan 2024 10:18:25 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@outlook.com
 header.s=selector1 header.b=tWTx5sNQ;
       arc=fail (body hash mismatch);
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=outlook.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 3BCBF68D0EF;
	Tue, 23 Jan 2024 20:17:48 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from JPN01-TYC-obe.outbound.protection.outlook.com
 (mail-tycjpn01olkn2040.outbound.protection.outlook.com [40.92.99.40])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 2605568D0CE
 for <ffmpeg-devel@ffmpeg.org>; Tue, 23 Jan 2024 20:17:45 +0200 (EET)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=fpg4KPZUJn6dLD74UMXnV54nT7AkFFowdvxDVFGc2SrIE4ZX40zE+dvcAIGj/N0XTXu8LsrRRRZ8E52nhKMmIJPRAEw0lG6/KCH4KqQfseERO4K0Ob3oeC7kkvuAp1/bYW7ZVcTLlfwe40kzc9Q2g/lYOl4aKo9X9E8zWgcQb3/0ZnKRHPwqpZyQff0hdzL93+FIN78mrE8xUS4RxWngmZVQ+oQHnHqtTeQJlQgWjwzWqf36SG9DaDIWbgZxKL8mr7ZTmT62zFTuMjR/94tve6eClGpWSa+fieYlp/QFJZG3ppvpk8kCvA/grQwUdvoS4TFgB4IF21C03JrrtmX/wg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=yw/wMwjk62nhvPosdopl326zo/bb6Gwol9R53NZTh2U=;
 b=aWEKP9vufeEH2QEy0heTvyV0uFKbeg62a4FnuVzn27gkHjXCgiDZox95bh9FOVaQbtaJX3WZMLf2C3Q6yTmV45TTqFtX3aBqUWjRKfueCx5nr38V/LUpcolgWriB6f+DcbhG67aFX07n9g/1umgYEu/RJa7TpmCP7uO0QOgSYIt3DPoO0gN8zkyTNI0Hgm7EQmUX5l25CsH8ZvE9Q+r+rx+PF3ha6AxsYjGM0OHXWF3XJn2adtnhAeTbfl9XGpZ1AwEhHV0J6JxIYcqJr2p9F2V3KMmYX8vYjjk8fPnsCXRA+dcm8TnT61EKx61B/89Sbm16jyyv35XJkw5yrPGbEg==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none;
 dkim=none; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com;
 s=selector1;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=yw/wMwjk62nhvPosdopl326zo/bb6Gwol9R53NZTh2U=;
 b=tWTx5sNQUGT+i5Ay8nZC/9BsPTS5ZCJoHzLypVNfZx+Cl297IjuTY/T65SUNbiK+Xjkv/zZqsVpuiw6fAoKF/p85DxHRlG7lz4ja64gMJrozj0z1a7aWuRmyUuROPiOB7T2tHuD+EjT6Jnc//mDtszkTnlv7Jk9le19sWogpePT2AE4iGh8kM0vkEGem+eDcd2htc1AbuHwg/w/wRTJF5AwqZcOL1OHYhuQKI3Hj258yd+5fAWuIxSDef0qbhki0qKq69qVv4Fyp1Or8z7zIkMoogQcpGxcSRsKDVMlpkaoauSosrlY++CTwoiNAiHBVdFI20oSGKqVCOE/iRJLFyg==
Received: from TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:173::5)
 by TYWP286MB3822.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:447::6) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7202.36; Tue, 23 Jan
 2024 18:17:26 +0000
Received: from TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 ([fe80::2fb1:781b:3f26:a080]) by TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 ([fe80::2fb1:781b:3f26:a080%3]) with mapi id 15.20.7228.022; Tue, 23 Jan 2024
 18:17:26 +0000
From: toqsxw@outlook.com
To: ffmpeg-devel@ffmpeg.org
Date: Wed, 24 Jan 2024 02:17:10 +0800
Message-ID: 
 <TYWP286MB21729E3F0F6082B114E7CEF0CA742@TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240123181711.402946-1-toqsxw@outlook.com>
References: <20240123181711.402946-1-toqsxw@outlook.com>
X-TMN: [lMCtSFRJAFXJI1Ry8BNSlHuENeUFVw9d]
X-ClientProxiedBy: SI2P153CA0030.APCP153.PROD.OUTLOOK.COM
 (2603:1096:4:190::15) To TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 (2603:1096:400:173::5)
X-Microsoft-Original-Message-ID: <20240123181711.402946-7-toqsxw@outlook.com>
MIME-Version: 1.0
X-MS-Exchange-MessageSentRepresentingType: 1
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: TYWP286MB2172:EE_|TYWP286MB3822:EE_
X-MS-Office365-Filtering-Correlation-Id: 6220f569-bbfb-46de-da85-08dc1c3f8d58
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: 
 zDLLSVddekXyPTCkPT0LMuDrcTDku37ekJj76LvHqcGxTc1EuCMawXhM7gI8GLsxCuwRaPi/DTj4b7O/fXitt+TlvGjdZIv/sA0Dt8BPNCGnRHqQ4bHPPpkWM8v1dPy4l4X0nNG/ip+xQNvK4nxts3B5Eh/ffJUZnTJ5xZnk9qZhX7JX2/DgK4nkFhCylR7DLHCXoRxmoRkf7CXhaBvwMyQeaayho3YgXa8KCTZ4/rLbkt5tRv/YGa30Bds5MPtkj67BFYSuGcUPYKqzKpxsWB+ytTZN2sq2P6jffaSH4GmJFhYBq2n9L9ImpIh65zOdiV01vkZRnQ47aV08F7PGx8yTPmfOrNGjBjjMuVIF0OMmI9IxNpJ0pajBfNQZnOACtAhXi25NPqqRmTaMCW7BNVkv5fp1HziKwVl+YqPrLxU5sCJzbKBJ2aYaXFbluhNFgkDNDsYFyo8Iz67N+13YsN0E0qSQcz+/HiDPKPgnnI90iAxNK9QLeFik1IIBfqw0YDobaj34bGW2Gk1BM7x/kVaBuxMqZuEpWMtWRAm3krwAnV0ijXVqlJHfrb+qCtGFDvfvnnX+AuNUFluxKDSmin0NXA/Yn+S+u/fz7PhSbwXgXGmu3xlEG6so4r2meE9W
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?q?NuSbnJD0dlBAeryvoK5+v6g1DrRB?=
	=?utf-8?q?pETSasr4o5R0VHx9Ox/ZHisJq1BgW/Ui5oFeiiuMi6jlreA+WvTTWRm5NG8Hzmyzt?=
	=?utf-8?q?k7I2ksbU1fcjXafTZ/7L7lbbp94bHGZPQPsajhkfobxbZL+nUxzKoBmtYhUHf9J0j?=
	=?utf-8?q?itvgNYiM/HdEJv03eieocRm1J/1cQTFA90q+vPh1F4PiPYGVmvjjK7xjKdueyD1jE?=
	=?utf-8?q?i+DjIuapuC1SWbD4QQBx6Q13vpsrpDPtDt4LElVBV/wQPlt5n+wFh0aL86Sfc2sgp?=
	=?utf-8?q?UPCP4Vyh5j/Uit4KNLqQmw/J+ouhfvm/tfa3DeN071N1RMFPmxdGthhIPl2EWGHWk?=
	=?utf-8?q?MBSULx+zb/b9WGkb0tYpqxRRCxKXKgXTlIctZqkYnAypGqL1GkvQhcuRdIQV49xbG?=
	=?utf-8?q?lFjZV47XpdfDt/QHzlzuR/LpXy7xwLw/VgTTF+1dFUYuhREm0syDKJgLHSAoXML6v?=
	=?utf-8?q?rWLkZR+srZsiQa6WUVEd/b7DK4HRFO6omOF2fOi5YbpDNk8bvUnaCBHE41OMFxz8a?=
	=?utf-8?q?QhD8BgElk7hCpQ2MccKa/67QpBGsGrCmR+BV7UpmG+nnveHAtSZeF2G967UKzG9vL?=
	=?utf-8?q?5WofJSAyZuGHdnrCoBbJsj/URvdYGIMhyI3QCfTwYosSRm3fD5d0kBXDehTQ0bTQ+?=
	=?utf-8?q?GCI0zXuiO3asvDMcq6e6vGa9J101id4lEkYNUkyrILgIYrI3FNPPiXapr47pIctv5?=
	=?utf-8?q?jXfzuyGyFe9JrhLDwfHr4tXG8c1e24u9HCF6QjBId966XbgdX5kFYwtkop3BPS1Xs?=
	=?utf-8?q?3Z+gI3ZzRWrQD0qJH9vCHfWAknplFbDM7/9YjPUeKR/4n49LpetYcU9gzOffF20bk?=
	=?utf-8?q?s0gL5K1zQaPTcv/+5iR2DnOEBnTJK4hFTPWQP8ySu1mdjFXGXpVZl+mjQqUMmojOL?=
	=?utf-8?q?oH2s+Zq5Yp8vSn0vyYLyyzqMoifcBJO8Na1+S3JwnD6THZZVFZ8eYsY1RfGOuN+Qs?=
	=?utf-8?q?scw1o5CnsDllQ8r3PWH3/3SnRiCozY9LL+CPwdUBvyDXIxytLko0ZlsYwPceoi5/d?=
	=?utf-8?q?oM18shpindxoLjFyxd34FjDToMOWWijWMhSX0h8vAyPZztrB/hoVCkmBbo8fhTYkj?=
	=?utf-8?q?lhcUgRTFtkuOXUarCpmKfEaFF7rPFFULEn3CIkDMVi4YPamPAtzJa2DPfT3PPzum/?=
	=?utf-8?q?RjVB+A2KBBPMP+sN0fOBlV8w40uLj9HQk/33XqS1rQ4TIiDSrGFYL1Bcdqihg=3D?=
X-OriginatorOrg: outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 6220f569-bbfb-46de-da85-08dc1c3f8d58
X-MS-Exchange-CrossTenant-AuthSource: TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 23 Jan 2024 18:17:26.1266 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 
 00000000-0000-0000-0000-000000000000
X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYWP286MB3822
Subject: [FFmpeg-devel] [PATCH v4 7/8] avcodec/x86/vvc: add avg and avg_w
 AVX2 optimizations
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Wu Jianhua <toqsxw@outlook.com>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: zjMJ5PTVlVNm

From: Wu Jianhua <toqsxw@outlook.com>

The avg/avg_w is based on dav1d.
See https://code.videolan.org/videolan/dav1d/-/blob/master/src/x86/mc_avx2.asm

vvc_avg_8_2x2_c: 71.6
vvc_avg_8_2x2_avx2: 26.8
vvc_avg_8_2x4_c: 140.8
vvc_avg_8_2x4_avx2: 34.6
vvc_avg_8_2x8_c: 410.3
vvc_avg_8_2x8_avx2: 41.3
vvc_avg_8_2x16_c: 769.3
vvc_avg_8_2x16_avx2: 60.3
vvc_avg_8_2x32_c: 1669.6
vvc_avg_8_2x32_avx2: 105.1
vvc_avg_8_2x64_c: 1978.3
vvc_avg_8_2x64_avx2: 425.8
vvc_avg_8_2x128_c: 6536.8
vvc_avg_8_2x128_avx2: 1315.1
vvc_avg_8_4x2_c: 155.6
vvc_avg_8_4x2_avx2: 26.1
vvc_avg_8_4x4_c: 250.3
vvc_avg_8_4x4_avx2: 31.3
vvc_avg_8_4x8_c: 831.8
vvc_avg_8_4x8_avx2: 41.3
vvc_avg_8_4x16_c: 1461.1
vvc_avg_8_4x16_avx2: 57.1
vvc_avg_8_4x32_c: 2821.6
vvc_avg_8_4x32_avx2: 105.1
vvc_avg_8_4x64_c: 3615.8
vvc_avg_8_4x64_avx2: 412.6
vvc_avg_8_4x128_c: 11962.6
vvc_avg_8_4x128_avx2: 1274.3
vvc_avg_8_8x2_c: 215.8
vvc_avg_8_8x2_avx2: 29.1
vvc_avg_8_8x4_c: 430.6
vvc_avg_8_8x4_avx2: 37.6
vvc_avg_8_8x8_c: 1463.3
vvc_avg_8_8x8_avx2: 51.8
vvc_avg_8_8x16_c: 2630.1
vvc_avg_8_8x16_avx2: 97.6
vvc_avg_8_8x32_c: 5813.8
vvc_avg_8_8x32_avx2: 196.6
vvc_avg_8_8x64_c: 6687.3
vvc_avg_8_8x64_avx2: 487.8
vvc_avg_8_8x128_c: 13178.6
vvc_avg_8_8x128_avx2: 1290.6
vvc_avg_8_16x2_c: 443.8
vvc_avg_8_16x2_avx2: 28.3
vvc_avg_8_16x4_c: 1253.3
vvc_avg_8_16x4_avx2: 32.1
vvc_avg_8_16x8_c: 2236.3
vvc_avg_8_16x8_avx2: 44.3
vvc_avg_8_16x16_c: 5127.8
vvc_avg_8_16x16_avx2: 63.3
vvc_avg_8_16x32_c: 6573.3
vvc_avg_8_16x32_avx2: 223.6
vvc_avg_8_16x64_c: 30311.8
vvc_avg_8_16x64_avx2: 437.8
vvc_avg_8_16x128_c: 25693.3
vvc_avg_8_16x128_avx2: 1266.8
vvc_avg_8_32x2_c: 954.6
vvc_avg_8_32x2_avx2: 32.1
vvc_avg_8_32x4_c: 2359.6
vvc_avg_8_32x4_avx2: 39.6
vvc_avg_8_32x8_c: 5703.6
vvc_avg_8_32x8_avx2: 57.1
vvc_avg_8_32x16_c: 9967.6
vvc_avg_8_32x16_avx2: 107.1
vvc_avg_8_32x32_c: 21327.6
vvc_avg_8_32x32_avx2: 272.6
vvc_avg_8_32x64_c: 39240.8
vvc_avg_8_32x64_avx2: 529.6
vvc_avg_8_32x128_c: 52580.8
vvc_avg_8_32x128_avx2: 1338.8
vvc_avg_8_64x2_c: 1647.3
vvc_avg_8_64x2_avx2: 38.8
vvc_avg_8_64x4_c: 5130.1
vvc_avg_8_64x4_avx2: 58.8
vvc_avg_8_64x8_c: 6529.3
vvc_avg_8_64x8_avx2: 88.3
vvc_avg_8_64x16_c: 19913.6
vvc_avg_8_64x16_avx2: 162.3
vvc_avg_8_64x32_c: 39360.8
vvc_avg_8_64x32_avx2: 295.8
vvc_avg_8_64x64_c: 49658.3
vvc_avg_8_64x64_avx2: 784.1
vvc_avg_8_64x128_c: 108513.1
vvc_avg_8_64x128_avx2: 1977.1
vvc_avg_8_128x2_c: 3226.1
vvc_avg_8_128x2_avx2: 61.1
vvc_avg_8_128x4_c: 10280.3
vvc_avg_8_128x4_avx2: 94.6
vvc_avg_8_128x8_c: 18079.3
vvc_avg_8_128x8_avx2: 155.3
vvc_avg_8_128x16_c: 45121.8
vvc_avg_8_128x16_avx2: 285.3
vvc_avg_8_128x32_c: 48651.8
vvc_avg_8_128x32_avx2: 581.6
vvc_avg_8_128x64_c: 165078.6
vvc_avg_8_128x64_avx2: 1942.8
vvc_avg_8_128x128_c: 339103.1
vvc_avg_8_128x128_avx2: 4332.6
vvc_avg_10_2x2_c: 144.3
vvc_avg_10_2x2_avx2: 26.8
vvc_avg_10_2x4_c: 142.6
vvc_avg_10_2x4_avx2: 45.3
vvc_avg_10_2x8_c: 478.1
vvc_avg_10_2x8_avx2: 38.1
vvc_avg_10_2x16_c: 518.3
vvc_avg_10_2x16_avx2: 58.1
vvc_avg_10_2x32_c: 2059.8
vvc_avg_10_2x32_avx2: 93.1
vvc_avg_10_2x64_c: 2383.8
vvc_avg_10_2x64_avx2: 714.8
vvc_avg_10_2x128_c: 4498.3
vvc_avg_10_2x128_avx2: 1466.3
vvc_avg_10_4x2_c: 228.6
vvc_avg_10_4x2_avx2: 26.8
vvc_avg_10_4x4_c: 378.3
vvc_avg_10_4x4_avx2: 30.6
vvc_avg_10_4x8_c: 866.8
vvc_avg_10_4x8_avx2: 44.6
vvc_avg_10_4x16_c: 1018.1
vvc_avg_10_4x16_avx2: 58.1
vvc_avg_10_4x32_c: 3590.8
vvc_avg_10_4x32_avx2: 128.8
vvc_avg_10_4x64_c: 4200.8
vvc_avg_10_4x64_avx2: 663.6
vvc_avg_10_4x128_c: 8450.8
vvc_avg_10_4x128_avx2: 1531.8
vvc_avg_10_8x2_c: 369.3
vvc_avg_10_8x2_avx2: 28.3
vvc_avg_10_8x4_c: 513.8
vvc_avg_10_8x4_avx2: 32.1
vvc_avg_10_8x8_c: 1720.3
vvc_avg_10_8x8_avx2: 49.1
vvc_avg_10_8x16_c: 1894.8
vvc_avg_10_8x16_avx2: 71.6
vvc_avg_10_8x32_c: 3931.3
vvc_avg_10_8x32_avx2: 148.1
vvc_avg_10_8x64_c: 7964.3
vvc_avg_10_8x64_avx2: 613.1
vvc_avg_10_8x128_c: 15540.1
vvc_avg_10_8x128_avx2: 1585.1
vvc_avg_10_16x2_c: 877.3
vvc_avg_10_16x2_avx2: 27.6
vvc_avg_10_16x4_c: 955.8
vvc_avg_10_16x4_avx2: 29.8
vvc_avg_10_16x8_c: 3419.6
vvc_avg_10_16x8_avx2: 62.6
vvc_avg_10_16x16_c: 3826.8
vvc_avg_10_16x16_avx2: 54.3
vvc_avg_10_16x32_c: 7655.3
vvc_avg_10_16x32_avx2: 86.3
vvc_avg_10_16x64_c: 30011.1
vvc_avg_10_16x64_avx2: 692.6
vvc_avg_10_16x128_c: 47894.8
vvc_avg_10_16x128_avx2: 1580.3
vvc_avg_10_32x2_c: 944.3
vvc_avg_10_32x2_avx2: 29.8
vvc_avg_10_32x4_c: 2022.6
vvc_avg_10_32x4_avx2: 35.1
vvc_avg_10_32x8_c: 6148.8
vvc_avg_10_32x8_avx2: 51.3
vvc_avg_10_32x16_c: 12601.6
vvc_avg_10_32x16_avx2: 70.8
vvc_avg_10_32x32_c: 15958.6
vvc_avg_10_32x32_avx2: 124.3
vvc_avg_10_32x64_c: 31784.6
vvc_avg_10_32x64_avx2: 757.3
vvc_avg_10_32x128_c: 63892.8
vvc_avg_10_32x128_avx2: 1711.3
vvc_avg_10_64x2_c: 1890.8
vvc_avg_10_64x2_avx2: 34.3
vvc_avg_10_64x4_c: 6267.3
vvc_avg_10_64x4_avx2: 42.6
vvc_avg_10_64x8_c: 12778.1
vvc_avg_10_64x8_avx2: 67.8
vvc_avg_10_64x16_c: 22304.3
vvc_avg_10_64x16_avx2: 116.8
vvc_avg_10_64x32_c: 30777.1
vvc_avg_10_64x32_avx2: 201.1
vvc_avg_10_64x64_c: 60169.1
vvc_avg_10_64x64_avx2: 1454.3
vvc_avg_10_64x128_c: 124392.8
vvc_avg_10_64x128_avx2: 3648.6
vvc_avg_10_128x2_c: 3650.1
vvc_avg_10_128x2_avx2: 41.1
vvc_avg_10_128x4_c: 22887.8
vvc_avg_10_128x4_avx2: 64.1
vvc_avg_10_128x8_c: 14622.6
vvc_avg_10_128x8_avx2: 111.6
vvc_avg_10_128x16_c: 62207.6
vvc_avg_10_128x16_avx2: 186.3
vvc_avg_10_128x32_c: 59761.3
vvc_avg_10_128x32_avx2: 374.6
vvc_avg_10_128x64_c: 117504.3
vvc_avg_10_128x64_avx2: 2684.6
vvc_avg_10_128x128_c: 236767.6
vvc_avg_10_128x128_avx2: 15278.1
vvc_avg_12_2x2_c: 78.6
vvc_avg_12_2x2_avx2: 26.1
vvc_avg_12_2x4_c: 254.1
vvc_avg_12_2x4_avx2: 30.6
vvc_avg_12_2x8_c: 261.8
vvc_avg_12_2x8_avx2: 39.1
vvc_avg_12_2x16_c: 527.6
vvc_avg_12_2x16_avx2: 57.3
vvc_avg_12_2x32_c: 1089.1
vvc_avg_12_2x32_avx2: 93.8
vvc_avg_12_2x64_c: 2337.6
vvc_avg_12_2x64_avx2: 707.1
vvc_avg_12_2x128_c: 4582.1
vvc_avg_12_2x128_avx2: 1414.6
vvc_avg_12_4x2_c: 129.6
vvc_avg_12_4x2_avx2: 26.8
vvc_avg_12_4x4_c: 427.3
vvc_avg_12_4x4_avx2: 30.6
vvc_avg_12_4x8_c: 529.6
vvc_avg_12_4x8_avx2: 36.6
vvc_avg_12_4x16_c: 1022.1
vvc_avg_12_4x16_avx2: 57.3
vvc_avg_12_4x32_c: 1987.6
vvc_avg_12_4x32_avx2: 84.3
vvc_avg_12_4x64_c: 4147.6
vvc_avg_12_4x64_avx2: 706.3
vvc_avg_12_4x128_c: 8469.3
vvc_avg_12_4x128_avx2: 1448.3
vvc_avg_12_8x2_c: 253.6
vvc_avg_12_8x2_avx2: 27.6
vvc_avg_12_8x4_c: 836.3
vvc_avg_12_8x4_avx2: 32.1
vvc_avg_12_8x8_c: 1074.6
vvc_avg_12_8x8_avx2: 45.1
vvc_avg_12_8x16_c: 3616.8
vvc_avg_12_8x16_avx2: 71.6
vvc_avg_12_8x32_c: 3823.6
vvc_avg_12_8x32_avx2: 140.1
vvc_avg_12_8x64_c: 7764.8
vvc_avg_12_8x64_avx2: 656.1
vvc_avg_12_8x128_c: 15896.1
vvc_avg_12_8x128_avx2: 1232.8
vvc_avg_12_16x2_c: 462.1
vvc_avg_12_16x2_avx2: 26.8
vvc_avg_12_16x4_c: 1732.1
vvc_avg_12_16x4_avx2: 29.1
vvc_avg_12_16x8_c: 2097.6
vvc_avg_12_16x8_avx2: 62.6
vvc_avg_12_16x16_c: 6753.1
vvc_avg_12_16x16_avx2: 47.8
vvc_avg_12_16x32_c: 7373.1
vvc_avg_12_16x32_avx2: 80.8
vvc_avg_12_16x64_c: 15046.3
vvc_avg_12_16x64_avx2: 621.1
vvc_avg_12_16x128_c: 52574.6
vvc_avg_12_16x128_avx2: 1417.1
vvc_avg_12_32x2_c: 1712.1
vvc_avg_12_32x2_avx2: 29.8
vvc_avg_12_32x4_c: 2036.8
vvc_avg_12_32x4_avx2: 37.6
vvc_avg_12_32x8_c: 4017.6
vvc_avg_12_32x8_avx2: 44.1
vvc_avg_12_32x16_c: 8018.6
vvc_avg_12_32x16_avx2: 70.8
vvc_avg_12_32x32_c: 15637.6
vvc_avg_12_32x32_avx2: 124.3
vvc_avg_12_32x64_c: 31143.3
vvc_avg_12_32x64_avx2: 830.3
vvc_avg_12_32x128_c: 75706.8
vvc_avg_12_32x128_avx2: 1604.8
vvc_avg_12_64x2_c: 3230.3
vvc_avg_12_64x2_avx2: 33.6
vvc_avg_12_64x4_c: 4139.6
vvc_avg_12_64x4_avx2: 45.1
vvc_avg_12_64x8_c: 8201.6
vvc_avg_12_64x8_avx2: 67.1
vvc_avg_12_64x16_c: 25632.3
vvc_avg_12_64x16_avx2: 110.3
vvc_avg_12_64x32_c: 30744.3
vvc_avg_12_64x32_avx2: 200.3
vvc_avg_12_64x64_c: 105554.8
vvc_avg_12_64x64_avx2: 1325.6
vvc_avg_12_64x128_c: 235254.3
vvc_avg_12_64x128_avx2: 3132.6
vvc_avg_12_128x2_c: 6194.3
vvc_avg_12_128x2_avx2: 55.1
vvc_avg_12_128x4_c: 7583.8
vvc_avg_12_128x4_avx2: 79.3
vvc_avg_12_128x8_c: 14635.6
vvc_avg_12_128x8_avx2: 104.3
vvc_avg_12_128x16_c: 29270.8
vvc_avg_12_128x16_avx2: 194.3
vvc_avg_12_128x32_c: 60113.6
vvc_avg_12_128x32_avx2: 346.3
vvc_avg_12_128x64_c: 197030.3
vvc_avg_12_128x64_avx2: 2779.6
vvc_avg_12_128x128_c: 432809.6
vvc_avg_12_128x128_avx2: 5513.3
vvc_w_avg_8_2x2_c: 84.3
vvc_w_avg_8_2x2_avx2: 42.6
vvc_w_avg_8_2x4_c: 156.3
vvc_w_avg_8_2x4_avx2: 58.8
vvc_w_avg_8_2x8_c: 310.6
vvc_w_avg_8_2x8_avx2: 73.1
vvc_w_avg_8_2x16_c: 942.1
vvc_w_avg_8_2x16_avx2: 113.3
vvc_w_avg_8_2x32_c: 1098.8
vvc_w_avg_8_2x32_avx2: 202.6
vvc_w_avg_8_2x64_c: 2414.3
vvc_w_avg_8_2x64_avx2: 467.6
vvc_w_avg_8_2x128_c: 4763.8
vvc_w_avg_8_2x128_avx2: 1333.1
vvc_w_avg_8_4x2_c: 140.1
vvc_w_avg_8_4x2_avx2: 49.8
vvc_w_avg_8_4x4_c: 276.3
vvc_w_avg_8_4x4_avx2: 58.1
vvc_w_avg_8_4x8_c: 524.3
vvc_w_avg_8_4x8_avx2: 72.3
vvc_w_avg_8_4x16_c: 1108.1
vvc_w_avg_8_4x16_avx2: 111.8
vvc_w_avg_8_4x32_c: 2149.8
vvc_w_avg_8_4x32_avx2: 199.6
vvc_w_avg_8_4x64_c: 12288.1
vvc_w_avg_8_4x64_avx2: 509.3
vvc_w_avg_8_4x128_c: 8398.6
vvc_w_avg_8_4x128_avx2: 1319.6
vvc_w_avg_8_8x2_c: 271.1
vvc_w_avg_8_8x2_avx2: 44.1
vvc_w_avg_8_8x4_c: 503.3
vvc_w_avg_8_8x4_avx2: 61.8
vvc_w_avg_8_8x8_c: 1031.1
vvc_w_avg_8_8x8_avx2: 93.8
vvc_w_avg_8_8x16_c: 2009.8
vvc_w_avg_8_8x16_avx2: 163.1
vvc_w_avg_8_8x32_c: 4161.3
vvc_w_avg_8_8x32_avx2: 292.1
vvc_w_avg_8_8x64_c: 7940.6
vvc_w_avg_8_8x64_avx2: 592.1
vvc_w_avg_8_8x128_c: 16802.3
vvc_w_avg_8_8x128_avx2: 1287.6
vvc_w_avg_8_16x2_c: 762.6
vvc_w_avg_8_16x2_avx2: 53.6
vvc_w_avg_8_16x4_c: 1486.3
vvc_w_avg_8_16x4_avx2: 67.1
vvc_w_avg_8_16x8_c: 1907.8
vvc_w_avg_8_16x8_avx2: 96.8
vvc_w_avg_8_16x16_c: 3883.6
vvc_w_avg_8_16x16_avx2: 151.3
vvc_w_avg_8_16x32_c: 7974.8
vvc_w_avg_8_16x32_avx2: 285.8
vvc_w_avg_8_16x64_c: 25160.6
vvc_w_avg_8_16x64_avx2: 589.8
vvc_w_avg_8_16x128_c: 58328.1
vvc_w_avg_8_16x128_avx2: 1169.8
vvc_w_avg_8_32x2_c: 1009.1
vvc_w_avg_8_32x2_avx2: 65.6
vvc_w_avg_8_32x4_c: 2091.1
vvc_w_avg_8_32x4_avx2: 96.8
vvc_w_avg_8_32x8_c: 3997.8
vvc_w_avg_8_32x8_avx2: 156.3
vvc_w_avg_8_32x16_c: 8216.8
vvc_w_avg_8_32x16_avx2: 269.6
vvc_w_avg_8_32x32_c: 21746.1
vvc_w_avg_8_32x32_avx2: 635.3
vvc_w_avg_8_32x64_c: 31564.8
vvc_w_avg_8_32x64_avx2: 1010.6
vvc_w_avg_8_32x128_c: 114373.3
vvc_w_avg_8_32x128_avx2: 2013.6
vvc_w_avg_8_64x2_c: 2067.3
vvc_w_avg_8_64x2_avx2: 97.6
vvc_w_avg_8_64x4_c: 3901.1
vvc_w_avg_8_64x4_avx2: 154.8
vvc_w_avg_8_64x8_c: 7911.6
vvc_w_avg_8_64x8_avx2: 268.8
vvc_w_avg_8_64x16_c: 16508.8
vvc_w_avg_8_64x16_avx2: 501.8
vvc_w_avg_8_64x32_c: 38770.3
vvc_w_avg_8_64x32_avx2: 1287.6
vvc_w_avg_8_64x64_c: 110350.6
vvc_w_avg_8_64x64_avx2: 1890.8
vvc_w_avg_8_64x128_c: 141354.6
vvc_w_avg_8_64x128_avx2: 3839.6
vvc_w_avg_8_128x2_c: 7012.1
vvc_w_avg_8_128x2_avx2: 159.3
vvc_w_avg_8_128x4_c: 8146.8
vvc_w_avg_8_128x4_avx2: 272.6
vvc_w_avg_8_128x8_c: 24596.8
vvc_w_avg_8_128x8_avx2: 501.1
vvc_w_avg_8_128x16_c: 35918.1
vvc_w_avg_8_128x16_avx2: 948.8
vvc_w_avg_8_128x32_c: 68799.6
vvc_w_avg_8_128x32_avx2: 1963.1
vvc_w_avg_8_128x64_c: 133862.1
vvc_w_avg_8_128x64_avx2: 3833.6
vvc_w_avg_8_128x128_c: 348427.8
vvc_w_avg_8_128x128_avx2: 7682.8
vvc_w_avg_10_2x2_c: 118.6
vvc_w_avg_10_2x2_avx2: 73.1
vvc_w_avg_10_2x4_c: 189.1
vvc_w_avg_10_2x4_avx2: 89.3
vvc_w_avg_10_2x8_c: 382.8
vvc_w_avg_10_2x8_avx2: 179.8
vvc_w_avg_10_2x16_c: 658.3
vvc_w_avg_10_2x16_avx2: 185.1
vvc_w_avg_10_2x32_c: 1409.3
vvc_w_avg_10_2x32_avx2: 290.8
vvc_w_avg_10_2x64_c: 2906.8
vvc_w_avg_10_2x64_avx2: 793.1
vvc_w_avg_10_2x128_c: 6292.6
vvc_w_avg_10_2x128_avx2: 1696.8
vvc_w_avg_10_4x2_c: 178.8
vvc_w_avg_10_4x2_avx2: 80.1
vvc_w_avg_10_4x4_c: 581.6
vvc_w_avg_10_4x4_avx2: 97.6
vvc_w_avg_10_4x8_c: 693.3
vvc_w_avg_10_4x8_avx2: 128.1
vvc_w_avg_10_4x16_c: 1436.6
vvc_w_avg_10_4x16_avx2: 179.8
vvc_w_avg_10_4x32_c: 2409.1
vvc_w_avg_10_4x32_avx2: 292.3
vvc_w_avg_10_4x64_c: 4925.3
vvc_w_avg_10_4x64_avx2: 746.1
vvc_w_avg_10_4x128_c: 10664.6
vvc_w_avg_10_4x128_avx2: 1647.6
vvc_w_avg_10_8x2_c: 359.3
vvc_w_avg_10_8x2_avx2: 80.1
vvc_w_avg_10_8x4_c: 925.6
vvc_w_avg_10_8x4_avx2: 97.6
vvc_w_avg_10_8x8_c: 1360.6
vvc_w_avg_10_8x8_avx2: 121.8
vvc_w_avg_10_8x16_c: 3490.3
vvc_w_avg_10_8x16_avx2: 203.3
vvc_w_avg_10_8x32_c: 5266.1
vvc_w_avg_10_8x32_avx2: 325.8
vvc_w_avg_10_8x64_c: 11127.1
vvc_w_avg_10_8x64_avx2: 747.8
vvc_w_avg_10_8x128_c: 31058.3
vvc_w_avg_10_8x128_avx2: 1424.6
vvc_w_avg_10_16x2_c: 624.8
vvc_w_avg_10_16x2_avx2: 84.6
vvc_w_avg_10_16x4_c: 1389.6
vvc_w_avg_10_16x4_avx2: 109.1
vvc_w_avg_10_16x8_c: 2688.3
vvc_w_avg_10_16x8_avx2: 137.1
vvc_w_avg_10_16x16_c: 5387.1
vvc_w_avg_10_16x16_avx2: 224.6
vvc_w_avg_10_16x32_c: 10776.3
vvc_w_avg_10_16x32_avx2: 312.1
vvc_w_avg_10_16x64_c: 18069.1
vvc_w_avg_10_16x64_avx2: 858.6
vvc_w_avg_10_16x128_c: 43460.3
vvc_w_avg_10_16x128_avx2: 1411.6
vvc_w_avg_10_32x2_c: 1232.8
vvc_w_avg_10_32x2_avx2: 99.1
vvc_w_avg_10_32x4_c: 4017.6
vvc_w_avg_10_32x4_avx2: 134.1
vvc_w_avg_10_32x8_c: 9306.3
vvc_w_avg_10_32x8_avx2: 208.1
vvc_w_avg_10_32x16_c: 8424.6
vvc_w_avg_10_32x16_avx2: 349.3
vvc_w_avg_10_32x32_c: 20787.8
vvc_w_avg_10_32x32_avx2: 655.3
vvc_w_avg_10_32x64_c: 40972.1
vvc_w_avg_10_32x64_avx2: 904.8
vvc_w_avg_10_32x128_c: 85670.3
vvc_w_avg_10_32x128_avx2: 1751.6
vvc_w_avg_10_64x2_c: 2454.1
vvc_w_avg_10_64x2_avx2: 132.6
vvc_w_avg_10_64x4_c: 5012.6
vvc_w_avg_10_64x4_avx2: 215.6
vvc_w_avg_10_64x8_c: 10811.3
vvc_w_avg_10_64x8_avx2: 361.1
vvc_w_avg_10_64x16_c: 33349.1
vvc_w_avg_10_64x16_avx2: 904.1
vvc_w_avg_10_64x32_c: 41892.3
vvc_w_avg_10_64x32_avx2: 1220.6
vvc_w_avg_10_64x64_c: 66983.3
vvc_w_avg_10_64x64_avx2: 2622.1
vvc_w_avg_10_64x128_c: 246508.8
vvc_w_avg_10_64x128_avx2: 3316.8
vvc_w_avg_10_128x2_c: 7791.6
vvc_w_avg_10_128x2_avx2: 198.8
vvc_w_avg_10_128x4_c: 10534.3
vvc_w_avg_10_128x4_avx2: 337.3
vvc_w_avg_10_128x8_c: 21142.3
vvc_w_avg_10_128x8_avx2: 614.8
vvc_w_avg_10_128x16_c: 40968.6
vvc_w_avg_10_128x16_avx2: 1160.6
vvc_w_avg_10_128x32_c: 113043.3
vvc_w_avg_10_128x32_avx2: 1644.6
vvc_w_avg_10_128x64_c: 230658.3
vvc_w_avg_10_128x64_avx2: 5065.3
vvc_w_avg_10_128x128_c: 335236.3
vvc_w_avg_10_128x128_avx2: 6450.3
vvc_w_avg_12_2x2_c: 185.3
vvc_w_avg_12_2x2_avx2: 43.6
vvc_w_avg_12_2x4_c: 340.3
vvc_w_avg_12_2x4_avx2: 55.8
vvc_w_avg_12_2x8_c: 632.3
vvc_w_avg_12_2x8_avx2: 70.1
vvc_w_avg_12_2x16_c: 728.3
vvc_w_avg_12_2x16_avx2: 108.1
vvc_w_avg_12_2x32_c: 1392.6
vvc_w_avg_12_2x32_avx2: 176.8
vvc_w_avg_12_2x64_c: 2618.3
vvc_w_avg_12_2x64_avx2: 757.3
vvc_w_avg_12_2x128_c: 6408.8
vvc_w_avg_12_2x128_avx2: 1435.1
vvc_w_avg_12_4x2_c: 349.3
vvc_w_avg_12_4x2_avx2: 44.3
vvc_w_avg_12_4x4_c: 607.1
vvc_w_avg_12_4x4_avx2: 52.6
vvc_w_avg_12_4x8_c: 1134.8
vvc_w_avg_12_4x8_avx2: 70.1
vvc_w_avg_12_4x16_c: 1378.1
vvc_w_avg_12_4x16_avx2: 115.3
vvc_w_avg_12_4x32_c: 2599.3
vvc_w_avg_12_4x32_avx2: 174.3
vvc_w_avg_12_4x64_c: 4474.8
vvc_w_avg_12_4x64_avx2: 656.1
vvc_w_avg_12_4x128_c: 11319.6
vvc_w_avg_12_4x128_avx2: 1373.1
vvc_w_avg_12_8x2_c: 595.8
vvc_w_avg_12_8x2_avx2: 44.3
vvc_w_avg_12_8x4_c: 1164.3
vvc_w_avg_12_8x4_avx2: 56.6
vvc_w_avg_12_8x8_c: 2019.6
vvc_w_avg_12_8x8_avx2: 80.1
vvc_w_avg_12_8x16_c: 4071.6
vvc_w_avg_12_8x16_avx2: 139.3
vvc_w_avg_12_8x32_c: 4485.1
vvc_w_avg_12_8x32_avx2: 250.6
vvc_w_avg_12_8x64_c: 8404.8
vvc_w_avg_12_8x64_avx2: 735.8
vvc_w_avg_12_8x128_c: 35679.8
vvc_w_avg_12_8x128_avx2: 1252.6
vvc_w_avg_12_16x2_c: 1114.8
vvc_w_avg_12_16x2_avx2: 46.6
vvc_w_avg_12_16x4_c: 2240.1
vvc_w_avg_12_16x4_avx2: 62.6
vvc_w_avg_12_16x8_c: 13174.6
vvc_w_avg_12_16x8_avx2: 88.6
vvc_w_avg_12_16x16_c: 5334.6
vvc_w_avg_12_16x16_avx2: 144.3
vvc_w_avg_12_16x32_c: 8378.1
vvc_w_avg_12_16x32_avx2: 234.6
vvc_w_avg_12_16x64_c: 21300.8
vvc_w_avg_12_16x64_avx2: 761.8
vvc_w_avg_12_16x128_c: 32786.8
vvc_w_avg_12_16x128_avx2: 1432.8
vvc_w_avg_12_32x2_c: 2154.3
vvc_w_avg_12_32x2_avx2: 61.1
vvc_w_avg_12_32x4_c: 4299.8
vvc_w_avg_12_32x4_avx2: 83.1
vvc_w_avg_12_32x8_c: 7964.8
vvc_w_avg_12_32x8_avx2: 132.6
vvc_w_avg_12_32x16_c: 13321.6
vvc_w_avg_12_32x16_avx2: 234.6
vvc_w_avg_12_32x32_c: 21149.3
vvc_w_avg_12_32x32_avx2: 433.3
vvc_w_avg_12_32x64_c: 43666.6
vvc_w_avg_12_32x64_avx2: 876.6
vvc_w_avg_12_32x128_c: 83189.8
vvc_w_avg_12_32x128_avx2: 1756.6
vvc_w_avg_12_64x2_c: 3829.8
vvc_w_avg_12_64x2_avx2: 83.1
vvc_w_avg_12_64x4_c: 8588.1
vvc_w_avg_12_64x4_avx2: 127.1
vvc_w_avg_12_64x8_c: 17027.6
vvc_w_avg_12_64x8_avx2: 310.6
vvc_w_avg_12_64x16_c: 29797.8
vvc_w_avg_12_64x16_avx2: 415.6
vvc_w_avg_12_64x32_c: 43854.3
vvc_w_avg_12_64x32_avx2: 773.3
vvc_w_avg_12_64x64_c: 137767.3
vvc_w_avg_12_64x64_avx2: 1608.6
vvc_w_avg_12_64x128_c: 316428.3
vvc_w_avg_12_64x128_avx2: 3249.8
vvc_w_avg_12_128x2_c: 8824.6
vvc_w_avg_12_128x2_avx2: 130.3
vvc_w_avg_12_128x4_c: 17173.6
vvc_w_avg_12_128x4_avx2: 219.3
vvc_w_avg_12_128x8_c: 21997.8
vvc_w_avg_12_128x8_avx2: 397.3
vvc_w_avg_12_128x16_c: 43553.8
vvc_w_avg_12_128x16_avx2: 790.1
vvc_w_avg_12_128x32_c: 89792.1
vvc_w_avg_12_128x32_avx2: 1497.6
vvc_w_avg_12_128x64_c: 226573.3
vvc_w_avg_12_128x64_avx2: 3153.1
vvc_w_avg_12_128x128_c: 332090.1
vvc_w_avg_12_128x128_avx2: 6499.6

Signed-off-by: Wu Jianhua <toqsxw@outlook.com>
---
 libavcodec/x86/vvc/Makefile      |   3 +-
 libavcodec/x86/vvc/vvc_mc.asm    | 303 +++++++++++++++++++++++++++++++
 libavcodec/x86/vvc/vvcdsp_init.c |  52 ++++++
 3 files changed, 357 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/x86/vvc/vvc_mc.asm

diff --git a/libavcodec/x86/vvc/Makefile b/libavcodec/x86/vvc/Makefile
index b4acc22501..29765a6c48 100644
--- a/libavcodec/x86/vvc/Makefile
+++ b/libavcodec/x86/vvc/Makefile
@@ -2,5 +2,6 @@ clean::
 	$(RM) $(CLEANSUFFIXES:%=libavcodec/x86/vvc/%)
 
 OBJS-$(CONFIG_VVC_DECODER)             += x86/vvc/vvcdsp_init.o
-X86ASM-OBJS-$(CONFIG_VVC_DECODER)      += x86/h26x/h2656dsp.o               \
+X86ASM-OBJS-$(CONFIG_VVC_DECODER)      += x86/vvc/vvc_mc.o       \
+                                          x86/h26x/h2656dsp.o    \
 										  x86/h26x/h2656_inter.o
diff --git a/libavcodec/x86/vvc/vvc_mc.asm b/libavcodec/x86/vvc/vvc_mc.asm
new file mode 100644
index 0000000000..948883b61b
--- /dev/null
+++ b/libavcodec/x86/vvc/vvc_mc.asm
@@ -0,0 +1,303 @@
+; /*
+; * Provide SIMD MC functions for VVC decoding
+; *
+; * Copyright © 2021, VideoLAN and dav1d authors
+; * Copyright © 2021, Two Orioles, LLC
+; * All rights reserved.
+; *
+; * Copyright (c) 2023-2024 Nuo Mi
+; * Copyright (c) 2023-2024 Wu Jianhua
+; *
+; * This file is part of FFmpeg.
+; *
+; * FFmpeg is free software; you can redistribute it and/or
+; * modify it under the terms of the GNU Lesser General Public
+; * License as published by the Free Software Foundation; either
+; * version 2.1 of the License, or (at your option) any later version.
+; *
+; * FFmpeg is distributed in the hope that it will be useful,
+; * but WITHOUT ANY WARRANTY; without even the implied warranty of
+; * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+; * Lesser General Public License for more details.
+; *
+; * You should have received a copy of the GNU Lesser General Public
+; * License along with FFmpeg; if not, write to the Free Software
+; * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+; */
+
+%include "libavutil/x86/x86util.asm"
+
+%define MAX_PB_SIZE 128
+
+SECTION_RODATA 32
+
+pw_0    times 2 dw   0
+pw_1    times 2 dw   1
+pw_4    times 2 dw   4
+pw_12   times 2 dw  12
+pw_256  times 2 dw 256
+
+%macro AVG_JMP_TABLE 3-*
+    %xdefine %1_%2_%3_table (%%table - 2*%4)
+    %xdefine %%base %1_%2_%3_table
+    %xdefine %%prefix mangle(private_prefix %+ _vvc_%1_%2bpc_%3)
+    %%table:
+    %rep %0 - 3
+        dd %%prefix %+ .w%4 - %%base
+        %rotate 1
+    %endrep
+%endmacro
+
+%if ARCH_X86_64
+AVG_JMP_TABLE    avg,  8, avx2,                2, 4, 8, 16, 32, 64, 128
+AVG_JMP_TABLE    avg, 16, avx2,                2, 4, 8, 16, 32, 64, 128
+AVG_JMP_TABLE  w_avg,  8, avx2,                2, 4, 8, 16, 32, 64, 128
+AVG_JMP_TABLE  w_avg, 16, avx2,                2, 4, 8, 16, 32, 64, 128
+%endif
+
+SECTION .text
+
+%macro AVG_W16_FN 3 ; bpc, op, count
+    %assign %%i 0
+    %rep %3
+        %define off %%i
+        AVG_LOAD_W16        0, off
+        %2
+        AVG_SAVE_W16       %1, 0, off
+
+
+        AVG_LOAD_W16        1, off
+        %2
+        AVG_SAVE_W16       %1, 1, off
+
+        %assign %%i %%i+1
+    %endrep
+%endmacro
+
+%macro AVG_FN 2 ; bpc, op
+   jmp                  wq
+
+.w2:
+    movd                xm0, [src0q]
+    pinsrd              xm0, [src0q + AVG_SRC_STRIDE], 1
+    movd                xm1, [src1q]
+    pinsrd              xm1, [src1q + AVG_SRC_STRIDE], 1
+    %2
+    AVG_SAVE_W2          %1
+    AVG_LOOP_END        .w2
+
+.w4:
+    movq                xm0, [src0q]
+    pinsrq              xm0, [src0q + AVG_SRC_STRIDE], 1
+    movq                xm1, [src1q]
+    pinsrq              xm1, [src1q + AVG_SRC_STRIDE], 1
+    %2
+    AVG_SAVE_W4          %1
+
+    AVG_LOOP_END        .w4
+
+.w8:
+    vinserti128         m0, m0, [src0q], 0
+    vinserti128         m0, m0, [src0q + AVG_SRC_STRIDE], 1
+    vinserti128         m1, m1, [src1q], 0
+    vinserti128         m1, m1, [src1q + AVG_SRC_STRIDE], 1
+    %2
+    AVG_SAVE_W8         %1
+
+    AVG_LOOP_END       .w8
+
+.w16:
+    AVG_W16_FN          %1, %2, 1
+
+    AVG_LOOP_END       .w16
+
+.w32:
+    AVG_W16_FN          %1, %2, 2
+
+    AVG_LOOP_END       .w32
+
+.w64:
+    AVG_W16_FN          %1, %2, 4
+
+    AVG_LOOP_END       .w64
+
+.w128:
+    AVG_W16_FN          %1, %2, 8
+
+    AVG_LOOP_END       .w128
+
+.ret:
+    RET
+%endmacro
+
+%macro AVG   0
+    paddsw               m0, m1
+    pmulhrsw             m0, m2
+    CLIPW                m0, m3, m4
+%endmacro
+
+%macro W_AVG 0
+    punpckhwd            m5, m0, m1
+    pmaddwd              m5, m3
+    paddd                m5, m4
+    psrad                m5, xm2
+
+    punpcklwd            m0, m0, m1
+    pmaddwd              m0, m3
+    paddd                m0, m4
+    psrad                m0, xm2
+
+    packssdw             m0, m5
+    CLIPW                m0, m6, m7
+%endmacro
+
+%macro AVG_LOAD_W16 2  ; line, offset
+    movu               m0, [src0q + %1 * AVG_SRC_STRIDE + %2 * 32]
+    movu               m1, [src1q + %1 * AVG_SRC_STRIDE + %2 * 32]
+%endmacro
+
+%macro AVG_SAVE_W2 1 ;bpc
+    %if %1 == 16
+        pextrd           [dstq], xm0, 0
+        pextrd [dstq + strideq], xm0, 1
+    %else
+        packuswb           m0, m0
+        pextrw           [dstq], xm0, 0
+        pextrw [dstq + strideq], xm0, 1
+    %endif
+%endmacro
+
+%macro AVG_SAVE_W4 1 ;bpc
+    %if %1 == 16
+        pextrq           [dstq], xm0, 0
+        pextrq [dstq + strideq], xm0, 1
+    %else
+        packuswb           m0, m0
+        pextrd           [dstq], xm0, 0
+        pextrd [dstq + strideq], xm0, 1
+    %endif
+%endmacro
+
+%macro AVG_SAVE_W8 1 ;bpc
+    %if %1 == 16
+        vextracti128            [dstq], m0, 0
+        vextracti128  [dstq + strideq], m0, 1
+    %else
+        packuswb                    m0, m0
+        vpermq                      m0, m0, 1000b
+        pextrq                  [dstq], xm0, 0
+        pextrq        [dstq + strideq], xm0, 1
+    %endif
+%endmacro
+
+%macro AVG_SAVE_W16 3 ; bpc, line, offset
+    %if %1 == 16
+        movu               [dstq + %2 * strideq + %3 * 32], m0
+    %else
+        packuswb                                        m0, m0
+        vpermq                                          m0, m0, 1000b
+        vextracti128       [dstq + %2 * strideq + %3 * 16], m0, 0
+    %endif
+%endmacro
+
+%macro AVG_LOOP_END 1
+    sub                  hd, 2
+    je                 .ret
+
+    lea               src0q, [src0q + 2 * AVG_SRC_STRIDE]
+    lea               src1q, [src1q + 2 * AVG_SRC_STRIDE]
+    lea                dstq, [dstq + 2 * strideq]
+    jmp                  %1
+%endmacro
+
+%define AVG_SRC_STRIDE MAX_PB_SIZE*2
+
+;void ff_vvc_avg_%1bpc_avx2(uint8_t *dst, ptrdiff_t dst_stride,
+;   const int16_t *src0, const int16_t *src1, intptr_t width, intptr_t height, intptr_t pixel_max);
+%macro VVC_AVG_AVX2 1
+cglobal vvc_avg_%1bpc, 4, 7, 5, dst, stride, src0, src1, w, h, bd
+    movifnidn            hd, hm
+
+    pxor                 m3, m3             ; pixel min
+    vpbroadcastw         m4, bdm            ; pixel max
+
+    movifnidn           bdd, bdm
+    inc                 bdd
+    tzcnt               bdd, bdd            ; bit depth
+
+    sub                 bdd, 8
+    movd                xm0, bdd
+    vpbroadcastd         m1, [pw_4]
+    pminuw               m0, m1
+    vpbroadcastd         m2, [pw_256]
+    psllw                m2, xm0                ; shift
+
+    lea                  r6, [avg_%1 %+ SUFFIX %+ _table]
+    tzcnt                wd, wm
+    movsxd               wq, dword [r6+wq*4]
+    add                  wq, r6
+    AVG_FN               %1, AVG
+%endmacro
+
+;void ff_vvc_w_avg_%1bpc_avx(uint8_t *dst, ptrdiff_t dst_stride,
+;    const int16_t *src0, const int16_t *src1, intptr_t width, intptr_t height,
+;    intptr_t denom, intptr_t w0, intptr_t w1,  intptr_t o0, intptr_t o1, intptr_t pixel_max);
+%macro VVC_W_AVG_AVX2 1
+cglobal vvc_w_avg_%1bpc, 4, 7, 8, dst, stride, src0, src1, w, h, t0, t1
+
+    movifnidn            hd, hm
+
+    movifnidn           t0d, r8m                ; w1
+    shl                 t0d, 16
+    mov                 t0w, r7m                ; w0
+    movd                xm3, t0d
+    vpbroadcastd         m3, xm3                ; w0, w1
+
+    pxor                m6, m6                  ;pixel min
+    vpbroadcastw        m7, r11m                ;pixel max
+
+    mov                 t1q, rcx                ; save ecx
+    mov                 ecx, r11m
+    inc                 ecx                     ; bd
+    tzcnt               ecx, ecx
+    sub                 ecx, 8
+    mov                 t0d, r9m                ; o0
+    add                 t0d, r10m               ; o1
+    shl                 t0d, cl
+    inc                 t0d                     ;((o0 + o1) << (BIT_DEPTH - 8)) + 1
+
+    neg                 ecx
+    add                 ecx, 4                  ; bd - 12
+    cmovl               ecx, [pw_0]
+    add                 ecx, 3
+    add                 ecx, r6m
+    movd                xm2, ecx                ; shift
+
+    dec                ecx
+    shl                t0d, cl
+    movd               xm4, t0d
+    vpbroadcastd        m4, xm4                 ; offset
+    mov                rcx, t1q                 ; restore ecx
+
+    lea                 r6, [w_avg_%1 %+ SUFFIX %+ _table]
+    tzcnt               wd, wm
+    movsxd              wq, dword [r6+wq*4]
+    add                 wq, r6
+    AVG_FN              %1, W_AVG
+%endmacro
+
+%if ARCH_X86_64
+
+%if HAVE_AVX2_EXTERNAL
+INIT_YMM avx2
+
+VVC_AVG_AVX2 16
+
+VVC_AVG_AVX2 8
+
+VVC_W_AVG_AVX2 16
+
+VVC_W_AVG_AVX2 8
+%endif
+
+%endif
diff --git a/libavcodec/x86/vvc/vvcdsp_init.c b/libavcodec/x86/vvc/vvcdsp_init.c
index c197cdb4cc..909ef9f56b 100644
--- a/libavcodec/x86/vvc/vvcdsp_init.c
+++ b/libavcodec/x86/vvc/vvcdsp_init.c
@@ -169,6 +169,42 @@ FW_PUT_16BPC_AVX2(12);
     MC_TAP_LINKS_16BPC_AVX2(LUMA,   8, bd);                          \
     MC_TAP_LINKS_16BPC_AVX2(CHROMA, 4, bd);
 
+#define bf(fn, bd,  opt) fn##_##bd##_##opt
+#define BF(fn, bpc, opt) fn##_##bpc##bpc_##opt
+
+#define AVG_BPC_FUNC(bpc, opt)                                                                      \
+void BF(ff_vvc_avg, bpc, opt)(uint8_t *dst, ptrdiff_t dst_stride,                                   \
+    const int16_t *src0, const int16_t *src1, intptr_t width, intptr_t height, intptr_t pixel_max); \
+void BF(ff_vvc_w_avg, bpc, opt)(uint8_t *dst, ptrdiff_t dst_stride,                                 \
+    const int16_t *src0, const int16_t *src1, intptr_t width, intptr_t height,                      \
+    intptr_t denom, intptr_t w0, intptr_t w1,  intptr_t o0, intptr_t o1, intptr_t pixel_max);
+
+#define AVG_FUNCS(bpc, bd, opt)                                                                     \
+static void bf(avg, bd, opt)(uint8_t *dst, ptrdiff_t dst_stride,                                    \
+    const int16_t *src0, const int16_t *src1, int width, int height)                                \
+{                                                                                                   \
+    BF(ff_vvc_avg, bpc, opt)(dst, dst_stride, src0, src1, width, height, (1 << bd)  - 1);           \
+}                                                                                                   \
+static void bf(w_avg, bd, opt)(uint8_t *dst, ptrdiff_t dst_stride,                                  \
+    const int16_t *src0, const int16_t *src1, int width, int height,                                \
+    int denom, int w0, int w1, int o0, int o1)                                                      \
+{                                                                                                   \
+    BF(ff_vvc_w_avg, bpc, opt)(dst, dst_stride, src0, src1, width, height,                          \
+        denom, w0, w1, o0, o1, (1 << bd)  - 1);                                                     \
+}
+
+AVG_BPC_FUNC(8,   avx2)
+AVG_BPC_FUNC(16,  avx2)
+
+AVG_FUNCS(8,  8,  avx2)
+AVG_FUNCS(16, 10, avx2)
+AVG_FUNCS(16, 12, avx2)
+
+#define AVG_INIT(bd, opt) do {                                          \
+    c->inter.avg    = bf(avg, bd, opt);                                 \
+    c->inter.w_avg  = bf(w_avg, bd, opt);                               \
+} while (0)
+
 void ff_vvc_dsp_init_x86(VVCDSPContext *const c, const int bd)
 {
     const int cpu_flags = av_get_cpu_flags();
@@ -198,5 +234,21 @@ void ff_vvc_dsp_init_x86(VVCDSPContext *const c, const int bd)
                 MC_LINKS_16BPC_AVX2(12);
             }
         }
+
+        if (EXTERNAL_AVX2(cpu_flags)) {
+            switch (bd) {
+                case 8:
+                    AVG_INIT(8, avx2);
+                    break;
+                case 10:
+                    AVG_INIT(10, avx2);
+                    break;
+                case 12:
+                    AVG_INIT(12, avx2);
+                    break;
+                default:
+                    break;
+            }
+        }
     }
 }

From patchwork Tue Jan 23 18:17:11 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Wu Jianhua <toqsxw@outlook.com>
X-Patchwork-Id: 45752
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:120f:b0:199:de12:6fa6 with SMTP id v15csp801938pzf;
        Tue, 23 Jan 2024 10:18:34 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IHut7ShgDTU9vxgs6dyKhYV5fX73JlmKLOqq/kci5d1Ztcj69BxdBzEP4BV4ROz5OnrPOby
X-Received: by 2002:a17:906:fa0a:b0:a30:d9b0:ce6e with SMTP id
 lo10-20020a170906fa0a00b00a30d9b0ce6emr80663ejb.191.1706033914625;
        Tue, 23 Jan 2024 10:18:34 -0800 (PST)
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 j2-20020a170906254200b00a2b12e231b1si11575173ejb.334.2024.01.23.10.18.33;
        Tue, 23 Jan 2024 10:18:34 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@outlook.com
 header.s=selector1 header.b="Dj/sqZp8";
       arc=fail (body hash mismatch);
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=outlook.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 6978F68D106;
	Tue, 23 Jan 2024 20:17:49 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from JPN01-TYC-obe.outbound.protection.outlook.com
 (mail-tycjpn01olkn2040.outbound.protection.outlook.com [40.92.99.40])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id A6E9168D0CE
 for <ffmpeg-devel@ffmpeg.org>; Tue, 23 Jan 2024 20:17:45 +0200 (EET)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=Fl0Y8w9ix5Y5GWNYZbPVdZSx45H0MwQBg8PuaHd0h2kA1xLfwcr3c2qZH6bOaZ3zhTPWfl3CbT+1E49tkERilvx9oGVw8IDwFWwGcoIrr/Y9Qgmlod86TVmhsh8g4WK7QrmqXKfzXk9xC0NB75fOnSg/zCaNenxLamWUAbBpcHyBpR7Qb6GjIENHrTwCbYTJFnGS/Ml1/2w3ShtvJcq1OJb5UZDZK+7udOJKcoMpljKSjpauVtdRv4iBXc3JstoceVHN4gLwBSd3vzeEVagz8V3TNPRXxik4jlWdlnueG4+6HFznMGO+IwYxQ48Nyjhl3YCqQbf0MjdnOeLX+Mf+3A==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=cTQRBaCRGC5C6v+J0FtP0SeAXPgnVftQTwbk5pBby9M=;
 b=I1uco28y1DQbJN7P7xBD81GcnZ7SDsI3CrC82ok1JqMtExf/OSr5v5tTS1riZvWENS4GzR4HY3fOV/OA59qkckgXVr/dD4Z0d3mdKZgrKEIJhElJauEn1+g0rdpvsErKQoPYz32Fo1JKJ4Teua1QCXSIWSbloNlGg3CH73LNgWOt3NeJQ4maZirtvl6e3HlyuS4D5ueQHIf4HDqLLbOz5kMMzyys18sOdeO6dKHv24neY5ccGDQQGWZTafGzf/joZ6fgv0n7c41TKNH3tSf9h4ZvVvzWu3D9Rj1E4AIWFSWoNSJty/sz24qVzYyXtwVyGifvqg8mh7gLRrhKkZD1kg==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none;
 dkim=none; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com;
 s=selector1;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=cTQRBaCRGC5C6v+J0FtP0SeAXPgnVftQTwbk5pBby9M=;
 b=Dj/sqZp8MJxf2SGRP3zqdVAXiCQFUJC6v7isyhR06f54+eBZKq8XaxcGIVAuHa/q4cb2YoWKFLM2F4A7ZrI3zLVH9svkJ255iZb/gDiN2Kt5qCZgnK1/ubbhf+xEYg0G6iiUKvWfXgHSk0Y6jq5clzYJF1evsD4frb5XhKr5+frCsruy29KGH72bavzVH78Buf9GbMe27Q4Px+/E6ucR1giogcrukrZgdt+PbnlltA5nGgbTSjy+ImqD9kTuz4YjhI+0CcoQp2+IGSNLDaVRRcvY5UchnJO+JJWlqjYUYwM2c9zJ9YE3crp5Nu98AgGVD7U07NPwAVimbFc1LbT0eQ==
Received: from TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:173::5)
 by TYWP286MB3822.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:447::6) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7202.36; Tue, 23 Jan
 2024 18:17:27 +0000
Received: from TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 ([fe80::2fb1:781b:3f26:a080]) by TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 ([fe80::2fb1:781b:3f26:a080%3]) with mapi id 15.20.7228.022; Tue, 23 Jan 2024
 18:17:27 +0000
From: toqsxw@outlook.com
To: ffmpeg-devel@ffmpeg.org
Date: Wed, 24 Jan 2024 02:17:11 +0800
Message-ID: 
 <TYWP286MB2172EB4579A8E7D32CEE9535CA742@TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240123181711.402946-1-toqsxw@outlook.com>
References: <20240123181711.402946-1-toqsxw@outlook.com>
X-TMN: [Pi39DCj58ed6gbDRTEBVVeYjZ3iUvjYM]
X-ClientProxiedBy: SI2P153CA0030.APCP153.PROD.OUTLOOK.COM
 (2603:1096:4:190::15) To TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
 (2603:1096:400:173::5)
X-Microsoft-Original-Message-ID: <20240123181711.402946-8-toqsxw@outlook.com>
MIME-Version: 1.0
X-MS-Exchange-MessageSentRepresentingType: 1
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: TYWP286MB2172:EE_|TYWP286MB3822:EE_
X-MS-Office365-Filtering-Correlation-Id: 14b7e8f5-5910-47d3-f512-08dc1c3f8ddc
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: 
 ZnQgLcT1GiTopUFufYC7HrFf/AMN+6V9XGMZxDbjT6PDVeGoKI+oHQLyNnPJVU/doWRTWMIwXONnuQDvPOx1VDLC/d9RCUZh6mdJTxUFGLOF7ib0RhqjJOfqWiI0JmMrBPAEOVoideQPCjecf141ex2rNGvJfoPfbWXfe8+HmTDuxRW7mBcZ5eJYIjqkV+bi/21b34lbml4K/B/ZolliPy/HmoQ2HePceT1mpvuuqi5NpOTUlwx/EXkgT4kcoaetOyygO+5ZBlz0zznXi72lqZxs68T7HEM4NT2q5kKrMUziN84MMALZlYgm3Elp5/RXljHpANYtp59C/P1DNuBW3I2j/apxgDzSoEsYxZH5tyW/EnXkfIStC/RhCoI6NclgFL9NTr8PF5ezE2bd5o3+wpMGc0qtaDiP+ZCubwu38ovLgnkkyMBAxaCoWNriGZyq25Vg4q6DOhfFc2foTNkcKQCXvkMRVttsk0pVlpXPovtw1E4I9lANnDbcoG87z3gAAQehdx0gnil7CwkoZ8Y0bWY/XXst/YgoILPXkEfn+0/kXvoK9yviVGUSXurb01AMRzsynjGW5PKtahUxMDJxl/Tk1Nj6CUQ0UEw/Jsu3uCbJ4VZZLUatLgWiCyje+jYG
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: 
 eOGy+joVGnc2L27ec+I+duqfDAHR9XOfFbv5UJh0dgovmFBj+WicjQWF9HMUB7c2smttHhKGdtud94+SH/0KJYglerHSBMIHsJUCw7Trs4/YB2UhdT7MgsVssB1pkuV5+amCn4zv3pkfR3pTOkx02NJUyUI5hgAmD+UNSmXmX0Z0w+c72RBgjBS1ew+kV+wkZLkJQx6fgI04aUZdhfrNRcrI+q3D6HlqGU8bjc9bmCjOIOxZNkoVhmPfkGZ327FltCWPDFq9dCnxnhROIhyLKhELxLM8qEC4RETwx6rnkyPjA0dxwv0ssUtxSQEjgiJuwWKkSR+19c28XmexbsYP0kgkOny72VUqDoU8DDKipUzJeWYxXogNYRtgmERNDCjX8L9qzpgFVkevY/yWGxagteKJaqBHK2HACjJt/Ef+iBtiqM8EfPZCknmb+bnf1KonAR57QjhkIrky3jDlHyhx2+2uOtqNgFAiEdiXKuSZXX/anZZDKB7ejImjkWBBtuBpLAJpuTosgu6qmnFRba1Qt5/BFaWkUoNXFsQ8ICjN45OIkAEM3gsXo0aGWOhdxwnAmzBSTHasdTPEdJkALVY8ylzazzd/XljdzYGh0pz15qw46ItKY+HWoAhFpjRH7biIFvObdHHRRYL/NOveNKSQ/6kGgNbgVrC6YaLqipCfLC2hKtAEQ8OiJFUNzcrvJVgsETwdKnT1yDRdNpCzlSQKl/9xLqiBxIrxqiPIoesfb6lzQ76QKwrZKKCfoAnTudZ6d3pTeROMo7pHW5y/iVFsttZGFtWKj/cBEc99bwYA5LjyMaRcngNNCUEut3dqxlwpS7MGp6+bEBitSeGdJzaip707Z/mKl1kFs9MuDR1nad+HcBRXQMvONEZM0VrTYtqwz9QanmUhEDqYXZK94p0d1df11A5or0WyauAcxZtVCe1WiuiziWpWfroIUAJo4tfxr1CiPvGR2jZ1KLhYhbfe2VGdV5wVjH+wzr2U5bR/TV00b3Iw4XaZfm9rYqdmbLC6uSESjLlqpk3VTKF3niZKeJNawz1CJPtCcnR3WA+5RYKjkTRYd2skBXZl6hVoXBmwn+MGRprvm2okgtOhnkbeWFQjk3KshwDcpqtd/QhiltCHt1Lz49YxXqvkWbWODJYWdCQzIAve9SZuHRVSAFh5IaZCiqSBsuFBzD13Wj/izdU72dYKmJEVlwvqiQ5eIemot1lyiy29759xtr0UDf0x/08Vw2PLoN23/oEQVjFCde0=
X-OriginatorOrg: outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 14b7e8f5-5910-47d3-f512-08dc1c3f8ddc
X-MS-Exchange-CrossTenant-AuthSource: TYWP286MB2172.JPNP286.PROD.OUTLOOK.COM
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 23 Jan 2024 18:17:27.0245 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 
 00000000-0000-0000-0000-000000000000
X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYWP286MB3822
Subject: [FFmpeg-devel] [PATCH v4 8/8] tests/checkasm/vvc_mc: add check_avg
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Wu Jianhua <toqsxw@outlook.com>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: 3MzvF+UaXNiz

From: Wu Jianhua <toqsxw@outlook.com>

Signed-off-by: Wu Jianhua <toqsxw@outlook.com>
---
 tests/checkasm/vvc_mc.c | 64 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)

diff --git a/tests/checkasm/vvc_mc.c b/tests/checkasm/vvc_mc.c
index 711280deec..8adb00573f 100644
--- a/tests/checkasm/vvc_mc.c
+++ b/tests/checkasm/vvc_mc.c
@@ -35,6 +35,7 @@
 static const uint32_t pixel_mask[] = { 0xffffffff, 0x03ff03ff, 0x0fff0fff, 0x3fff3fff, 0xffffffff };
 static const int sizes[] = { 2, 4, 8, 16, 32, 64, 128 };
 
+#define SIZEOF_PIXEL ((bit_depth + 7) / 8)
 #define PIXEL_STRIDE (MAX_CTU_SIZE * 2)
 #define EXTRA_BEFORE 3
 #define EXTRA_AFTER  4
@@ -261,10 +262,73 @@ static void check_put_vvc_chroma_uni(void)
     report("put_uni_chroma");
 }
 
+#define AVG_SRC_BUF_SIZE (MAX_CTU_SIZE * MAX_CTU_SIZE)
+#define AVG_DST_BUF_SIZE (MAX_PB_SIZE * MAX_PB_SIZE * 2)
+
+static void check_avg(void)
+{
+    LOCAL_ALIGNED_32(int16_t, src00, [AVG_SRC_BUF_SIZE]);
+    LOCAL_ALIGNED_32(int16_t, src01, [AVG_SRC_BUF_SIZE]);
+    LOCAL_ALIGNED_32(int16_t, src10, [AVG_SRC_BUF_SIZE]);
+    LOCAL_ALIGNED_32(int16_t, src11, [AVG_SRC_BUF_SIZE]);
+    LOCAL_ALIGNED_32(uint8_t, dst0, [AVG_DST_BUF_SIZE]);
+    LOCAL_ALIGNED_32(uint8_t, dst1, [AVG_DST_BUF_SIZE]);
+    VVCDSPContext c;
+
+    for (int bit_depth = 8; bit_depth <= 12; bit_depth += 2) {
+        randomize_avg_src((uint8_t*)src00, (uint8_t*)src10, AVG_SRC_BUF_SIZE * sizeof(int16_t));
+        randomize_avg_src((uint8_t*)src01, (uint8_t*)src11, AVG_SRC_BUF_SIZE * sizeof(int16_t));
+        ff_vvc_dsp_init(&c, bit_depth);
+        for (int h = 2; h <= MAX_CTU_SIZE; h *= 2) {
+            for (int w = 2; w <= MAX_CTU_SIZE; w *= 2) {
+                {
+                   declare_func_emms(AV_CPU_FLAG_MMX | AV_CPU_FLAG_MMXEXT, void, uint8_t *dst, ptrdiff_t dst_stride,
+                        const int16_t *src0, const int16_t *src1, int width, int height);
+                    if (check_func(c.inter.avg, "avg_%d_%dx%d", bit_depth, w, h)) {
+                        memset(dst0, 0, AVG_DST_BUF_SIZE);
+                        memset(dst1, 0, AVG_DST_BUF_SIZE);
+                        call_ref(dst0, MAX_CTU_SIZE * SIZEOF_PIXEL, src00, src01, w, h);
+                        call_new(dst1, MAX_CTU_SIZE * SIZEOF_PIXEL, src10, src11, w, h);
+                        if (memcmp(dst0, dst1, DST_BUF_SIZE))
+                            fail();
+                        if (w == h)
+                            bench_new(dst0, MAX_CTU_SIZE * SIZEOF_PIXEL, src00, src01, w, h);
+                    }
+                }
+                {
+                    declare_func_emms(AV_CPU_FLAG_MMX | AV_CPU_FLAG_MMXEXT, void, uint8_t *dst, ptrdiff_t dst_stride,
+                        const int16_t *src0, const int16_t *src1, int width, int height,
+                        int denom, int w0, int w1, int o0, int o1);
+                    {
+                        const int denom = rnd() % 8;
+                        const int w0    = rnd() % 256 - 128;
+                        const int w1    = rnd() % 256 - 128;
+                        const int o0    = rnd() % 256 - 128;
+                        const int o1    = rnd() % 256 - 128;
+                        if (check_func(c.inter.w_avg, "w_avg_%d_%dx%d", bit_depth, w, h)) {
+                            memset(dst0, 0, AVG_DST_BUF_SIZE);
+                            memset(dst1, 0, AVG_DST_BUF_SIZE);
+
+                            call_ref(dst0, MAX_CTU_SIZE * SIZEOF_PIXEL, src00, src01, w, h, denom, w0, w1, o0, o1);
+                            call_new(dst1, MAX_CTU_SIZE * SIZEOF_PIXEL, src10, src11, w, h, denom, w0, w1, o0, o1);
+                            if (memcmp(dst0, dst1, DST_BUF_SIZE))
+                                fail();
+                            if (w == h)
+                                bench_new(dst0, MAX_CTU_SIZE * SIZEOF_PIXEL, src00, src01, w, h, denom, w0, w1, o0, o1);
+                        }
+                    }
+                }
+            }
+        }
+    }
+    report("avg");
+}
+
 void checkasm_check_vvc_mc(void)
 {
     check_put_vvc_luma();
     check_put_vvc_luma_uni();
     check_put_vvc_chroma();
     check_put_vvc_chroma_uni();
+    check_avg();
 }