From patchwork Mon Jan 22 17:46:23 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wu Jianhua X-Patchwork-Id: 45718 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:120f:b0:199:de12:6fa6 with SMTP id v15csp130273pzf; Mon, 22 Jan 2024 09:47:15 -0800 (PST) X-Google-Smtp-Source: AGHT+IF5nsgVMdsPi8eNE3Vn2354gnO5Iqmij2FlsI2rfKfCk4T9aBtVMc9+qEgik58SxXoRFSvq X-Received: by 2002:a2e:9b90:0:b0:2cc:3e21:23a with SMTP id z16-20020a2e9b90000000b002cc3e21023amr1911523lji.107.1705945635155; Mon, 22 Jan 2024 09:47:15 -0800 (PST) Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id ig7-20020a056402458700b00558548e5158si6572101edb.487.2024.01.22.09.47.10; Mon, 22 Jan 2024 09:47:15 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@outlook.com header.s=selector1 header.b=RODz4CY7; arc=fail (body hash mismatch); spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=outlook.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id B7F1868D0A7; Mon, 22 Jan 2024 19:47:01 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from JPN01-TYC-obe.outbound.protection.outlook.com (mail-tycjpn01olkn2066.outbound.protection.outlook.com [40.92.99.66]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E770B68D050 for ; Mon, 22 Jan 2024 19:46:54 +0200 (EET) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=CVCQikOsOub11Sf5TD3mQEtORxN2G2KPnUoNbnHgBvZXq8Q7iVov51bLO88UUz+9NELlh0kAQc65PrHKChoqLy7+lvspr67p6kPyePN6XtLLkP5DP8KEDSIFfgzEb9kR0x54lq6VXrl/Ez6TJ482PNT6/iEN+PBFOFf76ooxxP+D6CPVXV2B5GqC85kN30dRMKTz7BZgF65Ex4Sd3w/acd5MEt+H3u3g+c2f7tbbxSgTQoqmPCeX3Gvwyyek7O8uvZq7TISicrGkf4jagrXbgUPJJNg9LN45HufKlptsMGsdQDkr6yE1AKFSbuxx9R6Z58o/92h5DQvjWPEygcvBuw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=8Qjvt+2+6x3/gkEweKR0e4pBQ77bIlzOJPJaAgmEyko=; b=GgmqNniZzS39rOFAteYby7bH1gNwhXOkHoB3+U1RRJBIbq2czGt9d4VznW6xWRL3jUgWyBJ3IzPl2uTGzUNwZ36J1qCYsdpAkRkndCRUaUKywKNAW7RHQOfttXGwqiSPTP4muiVIHmEMojVA8eTWtJZ35B87b9nPe7sgYCY4Wsz2Yc3dT6+4v/g6EnV8PuL1BOVxSSL+bx2EQTdXYi2yTqmICCEuv+jV9VVlPaUrtnpZQQYQTLypzOuu8LUyLFqob5l+r2hEfOJ5S+V5yK+ZPMaJu41bBUk3w2WkRr98zVn2eScNigdMPWUFpgKu+frfOL7Fy3c7Ffx/lCMotQfs9A== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=8Qjvt+2+6x3/gkEweKR0e4pBQ77bIlzOJPJaAgmEyko=; b=RODz4CY7Pj0qHWJuD5l6B9L64+2oDrqJ8VPLI7LXyybKo6jwKhG9mJ7ewG69xDVlAv9+nCkF8Z0OncccJOGCzPycbmiYYGstFSH1ZIRuE8mdTHy6X1ussOgh3uBoi3aLtKJCuCIvPDuBAL2UCAoNMnfwyOaSprLlRdq5+jjXTguWdzpA+fb8vnBUN/kIAdc/WyhohSa003Cs4DbPH2IOt4unJtmkQ4UoJ/FOo0DyJdblMR1Dc0mlskvAuk0tZLEA84eQ9bYSOjWFFM+0KuKKyB9oOEygcY5Y7C3D8ru6S/2BqS/bukhP5E1lmP7TqLE/CfZDhS7CpgR2gNMqgmSALA== Received: from OSZP286MB2173.JPNP286.PROD.OUTLOOK.COM (2603:1096:604:186::5) by TYWP286MB1975.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:164::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7202.31; Mon, 22 Jan 2024 17:46:46 +0000 Received: from OSZP286MB2173.JPNP286.PROD.OUTLOOK.COM ([fe80::86fc:6255:f4c1:a076]) by OSZP286MB2173.JPNP286.PROD.OUTLOOK.COM ([fe80::86fc:6255:f4c1:a076%3]) with mapi id 15.20.7202.035; Mon, 22 Jan 2024 17:46:46 +0000 From: toqsxw@outlook.com To: ffmpeg-devel@ffmpeg.org Date: Tue, 23 Jan 2024 01:46:23 +0800 Message-ID: X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240122174628.1206503-1-toqsxw@outlook.com> References: <20240122174628.1206503-1-toqsxw@outlook.com> X-TMN: [ySSMe1TWbExzaKTMEe49dflqR8ZzllHt] X-ClientProxiedBy: KL1PR0401CA0027.apcprd04.prod.outlook.com (2603:1096:820:e::14) To OSZP286MB2173.JPNP286.PROD.OUTLOOK.COM (2603:1096:604:186::5) X-Microsoft-Original-Message-ID: <20240122174628.1206503-3-toqsxw@outlook.com> MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: OSZP286MB2173:EE_|TYWP286MB1975:EE_ X-MS-Office365-Filtering-Correlation-Id: ca3e7f0d-e8f5-4891-1969-08dc1b721a1b X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: MuMh2rC+rla6Ca+yTJtlp1iXt/YOJeFIAV3qELJ3unlrf5XV8uEMgT6BfXpRgOOkUrA/EQSA2uD+5N3iWDn0YYOMa92k7Lsz6gLJxnKk3TEN3ckwNy26ExKMVouxLQ6ST89DTlB0j+FKfbrppFcKGuu66KB1Px1HJqL1SZEw9bsSYYx7ujbj4VI0Uw92fvNTywiRSteXIa+RNcsAA1W52PmqDTPZ/PtXrVjD7EjDR84cKltmDwmjDaMh/mJVIWRo/wwn1IzxNxOEgJXU9CR3gnNewrUo3aw5D4FXXMmn3KDXiGLC+L0F+P0N7fAG/hjrUqapbvQUfA+Je08yg3DIXXs+kzcyZNpc2zx+em7INU9CpZ0pFk+QnZk8rMyL1YnJEUrReLbSOlh2GDixaMR0FaiRfOJtbZJ+SExME/4jlmAtayBTiSMZBoCTanS7drjepZe5hMuJiTjN5iQLemAkeNOxWb9E8xcYFvB4E6iNhTh9w40ArUKXQLd2rzgenRp3lk/iSXw8bmcg4U97ETLDwgT40j+jhhKXoc5+Xtoiq4OpM8zxv8pJS3KDgPJmDOJFFOKDTjr5HGEBH92hSrLgfT5iZgpnlbbfiVx3MEwXtXVQUxI5DW+/2VimIah7fXFT X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: NK//PEG/2Q+atPNjFOjen9GZaXafWEiNvABD9jz7TN783g5wqa5pKrBa9+aZfIcluewuJfnivA54UYQrlOGiP6xANFImUG3plyJD5qpD/zc+DpwH/XKBkSvvWrgRnXO8eA0hidBeXrU23V+CQILfo9i6/0Vg56cKjLWOeiXMEZ/U+D1zDeUumrx8rHK8jZVjzAJ/Cug0wGdhEM8ZKcC9uw9BJYKWebCnwCJm6fFa2XE8lUqooT6aXKBFqtfiy4TZ1OyBKaCCoH/8WSe/jZVumEkHj/eJ5RXHjtsb/pwqEmA0IpmfloZc07Bodt6skUG8UwuNVLrgqeIonjkswcRUBbmiDq6tGWFUF5sfbsdeU0CYJCLmrbxkHAlQZczazEM3cPZf1j9xOq6ZChONfv2xliU5DBVAh0snaJVCFzKaOi7iPk2aiMYmyrEle6/Ro8Lx+QcZqdN6KPKkbGWsFfAN15+1zVpJen03ttd7P6z4+6/3xYS0dc+rz/LEjgIQCbrtMfYmWaVFotwfu+fQpdA6OTQiTAD97kzqSWoFQwB+/fgC4YJpSo3LgDEe5mplm4zQ95DY+314jq4jNukinzGBgmdnBg+5WjvbSsrtTTiAi9BQ96AIX3LXpEcgJwXwaiT/v04NkC2YUhb3qvV3DuXvcueAgik9LxSlP027r8y2j9E0Xrnx3IpDjhDWFqPRm8cJRwHee8x/UqLb3DwIPRVOYmO+qcw8ueLNhkqfK9i0B1ILJm63YTcspklDbnXjxTZs5ueU1fBB9bTZyn9I3XGVZboATWz4fPCqassXTUNMlGoa7iH5UBoqteOoZJzQnokewWtcLsT/fWc3+DMgN7IG8b18Ah2nthUtDqnfE0r3tymgjqrmpGdmmf3v8uy2pLs9TCinmMViLuN+71OG7TNAOqFmxd7lTTXek/Z2EiDqWmLxg6W5czzgV0k0+1M08scCam4M2bOvQsYI5I0p37lZ0535R09ERhaLwtHxT2456STW5q26JFXr7U8jKlgoUdExo9kPv67nVMTFJ5h+T48/YBMuNgT37JAWggnWIaK70LYo4BY8XqZbYnA6PjGTd7ECSvLMGLsqDvbYZq1erhO8kEu9fuixujLoe3Cir2sr7h9O+1lfqlhrgbApUcoMGFRVQxZrzq36TD1PExJv54/QnOOm2vrB4PecZYT0el0iLCl2DRqZ9DF38dYP3s/qIyw2Yd+d9PUjizR9VieXVlG7Yiss+LfIGUpSP2EDE1NuwME= X-OriginatorOrg: outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: ca3e7f0d-e8f5-4891-1969-08dc1b721a1b X-MS-Exchange-CrossTenant-AuthSource: OSZP286MB2173.JPNP286.PROD.OUTLOOK.COM X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 22 Jan 2024 17:46:46.0592 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000 X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYWP286MB1975 Subject: [FFmpeg-devel] [PATCH v3 3/8] avcodec/x86/hevc_mc: move put/put_uni to h26x/h2656_inter.asm X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Wu Jianhua Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: wzIJTKzulUWy From: Wu Jianhua This enable that the asm optimization can be reused by VVC Signed-off-by: Wu Jianhua --- libavcodec/x86/Makefile | 1 + libavcodec/x86/h26x/h2656_inter.asm | 1145 +++++++++++++++++++++++++++ libavcodec/x86/h26x/h2656dsp.c | 98 +++ libavcodec/x86/h26x/h2656dsp.h | 103 +++ libavcodec/x86/hevc_mc.asm | 462 +---------- libavcodec/x86/hevcdsp_init.c | 108 ++- 6 files changed, 1471 insertions(+), 446 deletions(-) create mode 100644 libavcodec/x86/h26x/h2656_inter.asm create mode 100644 libavcodec/x86/h26x/h2656dsp.c create mode 100644 libavcodec/x86/h26x/h2656dsp.h diff --git a/libavcodec/x86/Makefile b/libavcodec/x86/Makefile index d5fb30645a..8098cd840c 100644 --- a/libavcodec/x86/Makefile +++ b/libavcodec/x86/Makefile @@ -167,6 +167,7 @@ X86ASM-OBJS-$(CONFIG_HEVC_DECODER) += x86/hevc_add_res.o \ x86/hevc_deblock.o \ x86/hevc_idct.o \ x86/hevc_mc.o \ + x86/h26x/h2656_inter.o \ x86/hevc_sao.o \ x86/hevc_sao_10bit.o X86ASM-OBJS-$(CONFIG_JPEG2000_DECODER) += x86/jpeg2000dsp.o diff --git a/libavcodec/x86/h26x/h2656_inter.asm b/libavcodec/x86/h26x/h2656_inter.asm new file mode 100644 index 0000000000..aa296d549c --- /dev/null +++ b/libavcodec/x86/h26x/h2656_inter.asm @@ -0,0 +1,1145 @@ +; /* +; * Provide SSE luma and chroma mc functions for HEVC/VVC decoding +; * Copyright (c) 2013 Pierre-Edouard LEPERE +; * Copyright (c) 2023-2024 Nuo Mi +; * Copyright (c) 2023-2024 Wu Jianhua +; * +; * This file is part of FFmpeg. +; * +; * FFmpeg is free software; you can redistribute it and/or +; * modify it under the terms of the GNU Lesser General Public +; * License as published by the Free Software Foundation; either +; * version 2.1 of the License, or (at your option) any later version. +; * +; * FFmpeg is distributed in the hope that it will be useful, +; * but WITHOUT ANY WARRANTY; without even the implied warranty of +; * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +; * Lesser General Public License for more details. +; * +; * You should have received a copy of the GNU Lesser General Public +; * License along with FFmpeg; if not, write to the Free Software +; * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA +; */ +%include "libavutil/x86/x86util.asm" + +%define MAX_PB_SIZE 64 + +SECTION_RODATA 32 +cextern pw_255 +cextern pw_512 +cextern pw_2048 +cextern pw_1023 +cextern pw_1024 +cextern pw_4096 +cextern pw_8192 +%define scale_8 pw_512 +%define scale_10 pw_2048 +%define scale_12 pw_8192 +%define max_pixels_8 pw_255 +%define max_pixels_10 pw_1023 +max_pixels_12: times 16 dw ((1 << 12)-1) +cextern pb_0 + +SECTION .text +%macro SIMPLE_LOAD 4 ;width, bitd, tab, r1 +%if %1 == 2 || (%2 == 8 && %1 <= 4) + movd %4, [%3] ; load data from source +%elif %1 == 4 || (%2 == 8 && %1 <= 8) + movq %4, [%3] ; load data from source +%elif notcpuflag(avx) + movu %4, [%3] ; load data from source +%elif %1 <= 8 || (%2 == 8 && %1 <= 16) + movdqu %4, [%3] +%else + movu %4, [%3] +%endif +%endmacro + +%macro VPBROADCASTW 2 +%if notcpuflag(avx2) + movd %1, %2 + pshuflw %1, %1, 0 + punpcklwd %1, %1 +%else + vpbroadcastw %1, %2 +%endif +%endmacro + +%macro MC_4TAP_FILTER 4 ; bitdepth, filter, a, b, + VPBROADCASTW %3, [%2q + 0 * 2] ; coeff 0, 1 + VPBROADCASTW %4, [%2q + 1 * 2] ; coeff 2, 3 +%if %1 != 8 + pmovsxbw %3, xmm%3 + pmovsxbw %4, xmm%4 +%endif +%endmacro + +%macro MC_4TAP_HV_FILTER 1 + VPBROADCASTW m12, [vfq + 0 * 2] ; vf 0, 1 + VPBROADCASTW m13, [vfq + 1 * 2] ; vf 2, 3 + VPBROADCASTW m14, [hfq + 0 * 2] ; hf 0, 1 + VPBROADCASTW m15, [hfq + 1 * 2] ; hf 2, 3 + + pmovsxbw m12, xm12 + pmovsxbw m13, xm13 +%if %1 != 8 + pmovsxbw m14, xm14 + pmovsxbw m15, xm15 +%endif + lea r3srcq, [srcstrideq*3] +%endmacro + +%macro MC_8TAP_SAVE_FILTER 5 ;offset, mm registers + mova [rsp + %1 + 0*mmsize], %2 + mova [rsp + %1 + 1*mmsize], %3 + mova [rsp + %1 + 2*mmsize], %4 + mova [rsp + %1 + 3*mmsize], %5 +%endmacro + +%macro MC_8TAP_FILTER 2-3 ;bitdepth, filter, offset + VPBROADCASTW m12, [%2q + 0 * 2] ; coeff 0, 1 + VPBROADCASTW m13, [%2q + 1 * 2] ; coeff 2, 3 + VPBROADCASTW m14, [%2q + 2 * 2] ; coeff 4, 5 + VPBROADCASTW m15, [%2q + 3 * 2] ; coeff 6, 7 +%if %0 == 3 + MC_8TAP_SAVE_FILTER %3, m12, m13, m14, m15 +%endif + +%if %1 != 8 + pmovsxbw m12, xm12 + pmovsxbw m13, xm13 + pmovsxbw m14, xm14 + pmovsxbw m15, xm15 + %if %0 == 3 + MC_8TAP_SAVE_FILTER %3 + 4*mmsize, m12, m13, m14, m15 + %endif +%elif %0 == 3 + pmovsxbw m8, xm12 + pmovsxbw m9, xm13 + pmovsxbw m10, xm14 + pmovsxbw m11, xm15 + MC_8TAP_SAVE_FILTER %3 + 4*mmsize, m8, m9, m10, m11 +%endif + +%endmacro + +%macro MC_4TAP_LOAD 4 +%if (%1 == 8 && %4 <= 4) +%define %%load movd +%elif (%1 == 8 && %4 <= 8) || (%1 > 8 && %4 <= 4) +%define %%load movq +%else +%define %%load movdqu +%endif + + %%load m0, [%2q ] +%ifnum %3 + %%load m1, [%2q+ %3] + %%load m2, [%2q+2*%3] + %%load m3, [%2q+3*%3] +%else + %%load m1, [%2q+ %3q] + %%load m2, [%2q+2*%3q] + %%load m3, [%2q+r3srcq] +%endif +%if %1 == 8 +%if %4 > 8 + SBUTTERFLY bw, 0, 1, 7 + SBUTTERFLY bw, 2, 3, 7 +%else + punpcklbw m0, m1 + punpcklbw m2, m3 +%endif +%else +%if %4 > 4 + SBUTTERFLY wd, 0, 1, 7 + SBUTTERFLY wd, 2, 3, 7 +%else + punpcklwd m0, m1 + punpcklwd m2, m3 +%endif +%endif +%endmacro + +%macro MC_8TAP_H_LOAD 4 +%assign %%stride (%1+7)/8 +%if %1 == 8 +%if %3 <= 4 +%define %%load movd +%elif %3 == 8 +%define %%load movq +%else +%define %%load movu +%endif +%else +%if %3 == 2 +%define %%load movd +%elif %3 == 4 +%define %%load movq +%else +%define %%load movu +%endif +%endif + %%load m0, [%2-3*%%stride] ;load data from source + %%load m1, [%2-2*%%stride] + %%load m2, [%2-%%stride ] + %%load m3, [%2 ] + %%load m4, [%2+%%stride ] + %%load m5, [%2+2*%%stride] + %%load m6, [%2+3*%%stride] + %%load m7, [%2+4*%%stride] + +%if %1 == 8 +%if %3 > 8 + SBUTTERFLY wd, 0, 1, %4 + SBUTTERFLY wd, 2, 3, %4 + SBUTTERFLY wd, 4, 5, %4 + SBUTTERFLY wd, 6, 7, %4 +%else + punpcklbw m0, m1 + punpcklbw m2, m3 + punpcklbw m4, m5 + punpcklbw m6, m7 +%endif +%else +%if %3 > 4 + SBUTTERFLY dq, 0, 1, %4 + SBUTTERFLY dq, 2, 3, %4 + SBUTTERFLY dq, 4, 5, %4 + SBUTTERFLY dq, 6, 7, %4 +%else + punpcklwd m0, m1 + punpcklwd m2, m3 + punpcklwd m4, m5 + punpcklwd m6, m7 +%endif +%endif +%endmacro + +%macro MC_8TAP_V_LOAD 5 + lea %5q, [%2] + sub %5q, r3srcq + movu m0, [%5q ] ;load x- 3*srcstride + movu m1, [%5q+ %3q ] ;load x- 2*srcstride + movu m2, [%5q+ 2*%3q ] ;load x-srcstride + movu m3, [%2 ] ;load x + movu m4, [%2+ %3q] ;load x+stride + movu m5, [%2+ 2*%3q] ;load x+2*stride + movu m6, [%2+r3srcq] ;load x+3*stride + movu m7, [%2+ 4*%3q] ;load x+4*stride +%if %1 == 8 +%if %4 > 8 + SBUTTERFLY bw, 0, 1, 8 + SBUTTERFLY bw, 2, 3, 8 + SBUTTERFLY bw, 4, 5, 8 + SBUTTERFLY bw, 6, 7, 8 +%else + punpcklbw m0, m1 + punpcklbw m2, m3 + punpcklbw m4, m5 + punpcklbw m6, m7 +%endif +%else +%if %4 > 4 + SBUTTERFLY wd, 0, 1, 8 + SBUTTERFLY wd, 2, 3, 8 + SBUTTERFLY wd, 4, 5, 8 + SBUTTERFLY wd, 6, 7, 8 +%else + punpcklwd m0, m1 + punpcklwd m2, m3 + punpcklwd m4, m5 + punpcklwd m6, m7 +%endif +%endif +%endmacro + +%macro PEL_12STORE2 3 + movd [%1], %2 +%endmacro +%macro PEL_12STORE4 3 + movq [%1], %2 +%endmacro +%macro PEL_12STORE6 3 + movq [%1], %2 + psrldq %2, 8 + movd [%1+8], %2 +%endmacro +%macro PEL_12STORE8 3 + movdqu [%1], %2 +%endmacro +%macro PEL_12STORE12 3 + PEL_12STORE8 %1, %2, %3 + movq [%1+16], %3 +%endmacro +%macro PEL_12STORE16 3 +%if cpuflag(avx2) + movu [%1], %2 +%else + PEL_12STORE8 %1, %2, %3 + movdqu [%1+16], %3 +%endif +%endmacro + +%macro PEL_10STORE2 3 + movd [%1], %2 +%endmacro +%macro PEL_10STORE4 3 + movq [%1], %2 +%endmacro +%macro PEL_10STORE6 3 + movq [%1], %2 + psrldq %2, 8 + movd [%1+8], %2 +%endmacro +%macro PEL_10STORE8 3 + movdqu [%1], %2 +%endmacro +%macro PEL_10STORE12 3 + PEL_10STORE8 %1, %2, %3 + movq [%1+16], %3 +%endmacro +%macro PEL_10STORE16 3 +%if cpuflag(avx2) + movu [%1], %2 +%else + PEL_10STORE8 %1, %2, %3 + movdqu [%1+16], %3 +%endif +%endmacro +%macro PEL_10STORE32 3 + PEL_10STORE16 %1, %2, %3 + movu [%1+32], %3 +%endmacro + +%macro PEL_8STORE2 3 + pextrw [%1], %2, 0 +%endmacro +%macro PEL_8STORE4 3 + movd [%1], %2 +%endmacro +%macro PEL_8STORE6 3 + movd [%1], %2 + pextrw [%1+4], %2, 2 +%endmacro +%macro PEL_8STORE8 3 + movq [%1], %2 +%endmacro +%macro PEL_8STORE12 3 + movq [%1], %2 + psrldq %2, 8 + movd [%1+8], %2 +%endmacro +%macro PEL_8STORE16 3 +%if cpuflag(avx2) + movdqu [%1], %2 +%else + movu [%1], %2 +%endif ; avx +%endmacro +%macro PEL_8STORE32 3 + movu [%1], %2 +%endmacro + +%macro LOOP_END 3 + add %1q, 2*MAX_PB_SIZE ; dst += dststride + add %2q, %3q ; src += srcstride + dec heightd ; cmp height + jnz .loop ; height loop +%endmacro + + +%macro MC_PIXEL_COMPUTE 2-3 ;width, bitdepth +%if %2 == 8 +%if cpuflag(avx2) && %0 ==3 +%if %1 > 16 + vextracti128 xm1, m0, 1 + pmovzxbw m1, xm1 + psllw m1, 14-%2 +%endif + pmovzxbw m0, xm0 +%else ; not avx +%if %1 > 8 + punpckhbw m1, m0, m2 + psllw m1, 14-%2 +%endif + punpcklbw m0, m2 +%endif +%endif ;avx + psllw m0, 14-%2 +%endmacro + +%macro MC_4TAP_COMPUTE 4-8 ; bitdepth, width, filter1, filter2, HV/m0, m2, m1, m3 +%if %0 == 8 +%define %%reg0 %5 +%define %%reg2 %6 +%define %%reg1 %7 +%define %%reg3 %8 +%else +%define %%reg0 m0 +%define %%reg2 m2 +%define %%reg1 m1 +%define %%reg3 m3 +%endif +%if %1 == 8 +%if cpuflag(avx2) && (%0 == 5) +%if %2 > 16 + vperm2i128 m10, m0, m1, q0301 +%endif + vinserti128 m0, m0, xm1, 1 + mova m1, m10 +%if %2 > 16 + vperm2i128 m10, m2, m3, q0301 +%endif + vinserti128 m2, m2, xm3, 1 + mova m3, m10 +%endif + pmaddubsw %%reg0, %3 ;x1*c1+x2*c2 + pmaddubsw %%reg2, %4 ;x3*c3+x4*c4 + paddw %%reg0, %%reg2 +%if %2 > 8 + pmaddubsw %%reg1, %3 + pmaddubsw %%reg3, %4 + paddw %%reg1, %%reg3 +%endif +%else + pmaddwd %%reg0, %3 + pmaddwd %%reg2, %4 + paddd %%reg0, %%reg2 +%if %2 > 4 + pmaddwd %%reg1, %3 + pmaddwd %%reg3, %4 + paddd %%reg1, %%reg3 +%if %1 != 8 + psrad %%reg1, %1-8 +%endif +%endif +%if %1 != 8 + psrad %%reg0, %1-8 +%endif + packssdw %%reg0, %%reg1 +%endif +%endmacro + +%macro MC_8TAP_HV_COMPUTE 4 ; width, bitdepth, filter + +%if %2 == 8 + pmaddubsw m0, [%3q+0*mmsize] ;x1*c1+x2*c2 + pmaddubsw m2, [%3q+1*mmsize] ;x3*c3+x4*c4 + pmaddubsw m4, [%3q+2*mmsize] ;x5*c5+x6*c6 + pmaddubsw m6, [%3q+3*mmsize] ;x7*c7+x8*c8 + paddw m0, m2 + paddw m4, m6 + paddw m0, m4 +%else + pmaddwd m0, [%3q+4*mmsize] + pmaddwd m2, [%3q+5*mmsize] + pmaddwd m4, [%3q+6*mmsize] + pmaddwd m6, [%3q+7*mmsize] + paddd m0, m2 + paddd m4, m6 + paddd m0, m4 +%if %2 != 8 + psrad m0, %2-8 +%endif +%if %1 > 4 + pmaddwd m1, [%3q+4*mmsize] + pmaddwd m3, [%3q+5*mmsize] + pmaddwd m5, [%3q+6*mmsize] + pmaddwd m7, [%3q+7*mmsize] + paddd m1, m3 + paddd m5, m7 + paddd m1, m5 +%if %2 != 8 + psrad m1, %2-8 +%endif +%endif + p%4 m0, m1 +%endif +%endmacro + + +%macro MC_8TAP_COMPUTE 2-3 ; width, bitdepth +%if %2 == 8 +%if cpuflag(avx2) && (%0 == 3) + + vperm2i128 m10, m0, m1, q0301 + vinserti128 m0, m0, xm1, 1 + SWAP 1, 10 + + vperm2i128 m10, m2, m3, q0301 + vinserti128 m2, m2, xm3, 1 + SWAP 3, 10 + + + vperm2i128 m10, m4, m5, q0301 + vinserti128 m4, m4, xm5, 1 + SWAP 5, 10 + + vperm2i128 m10, m6, m7, q0301 + vinserti128 m6, m6, xm7, 1 + SWAP 7, 10 +%endif + + pmaddubsw m0, m12 ;x1*c1+x2*c2 + pmaddubsw m2, m13 ;x3*c3+x4*c4 + pmaddubsw m4, m14 ;x5*c5+x6*c6 + pmaddubsw m6, m15 ;x7*c7+x8*c8 + paddw m0, m2 + paddw m4, m6 + paddw m0, m4 +%if %1 > 8 + pmaddubsw m1, m12 + pmaddubsw m3, m13 + pmaddubsw m5, m14 + pmaddubsw m7, m15 + paddw m1, m3 + paddw m5, m7 + paddw m1, m5 +%endif +%else + pmaddwd m0, m12 + pmaddwd m2, m13 + pmaddwd m4, m14 + pmaddwd m6, m15 + paddd m0, m2 + paddd m4, m6 + paddd m0, m4 +%if %2 != 8 + psrad m0, %2-8 +%endif +%if %1 > 4 + pmaddwd m1, m12 + pmaddwd m3, m13 + pmaddwd m5, m14 + pmaddwd m7, m15 + paddd m1, m3 + paddd m5, m7 + paddd m1, m5 +%if %2 != 8 + psrad m1, %2-8 +%endif +%endif +%endif +%endmacro +%macro UNI_COMPUTE 5 + pmulhrsw %3, %5 +%if %1 > 8 || (%2 > 8 && %1 > 4) + pmulhrsw %4, %5 +%endif +%if %2 == 8 + packuswb %3, %4 +%else + CLIPW %3, [pb_0], [max_pixels_%2] +%if (%1 > 8 && notcpuflag(avx)) || %1 > 16 + CLIPW %4, [pb_0], [max_pixels_%2] +%endif +%endif +%endmacro + + +; ****************************** +; void %1_put_pixels(int16_t *dst, const uint8_t *_src, ptrdiff_t srcstride, +; int height, const int8_t *hf, const int8_t *vf, int width) +; ****************************** + +%macro PUT_PIXELS 3 + MC_PIXELS %1, %2, %3 + MC_UNI_PIXELS %1, %2, %3 +%endmacro + +%macro MC_PIXELS 3 +cglobal %1_put_pixels%2_%3, 4, 4, 3, dst, src, srcstride, height + pxor m2, m2 +.loop: + SIMPLE_LOAD %2, %3, srcq, m0 + MC_PIXEL_COMPUTE %2, %3, 1 + PEL_10STORE%2 dstq, m0, m1 + LOOP_END dst, src, srcstride + RET +%endmacro + +%macro MC_UNI_PIXELS 3 +cglobal %1_put_uni_pixels%2_%3, 5, 5, 2, dst, dststride, src, srcstride, height +.loop: + SIMPLE_LOAD %2, %3, srcq, m0 + PEL_%3STORE%2 dstq, m0, m1 + add dstq, dststrideq ; dst += dststride + add srcq, srcstrideq ; src += srcstride + dec heightd ; cmp height + jnz .loop ; height loop + RET +%endmacro + +%macro PUT_4TAP 3 +%if cpuflag(avx2) +%define XMM_REGS 11 +%else +%define XMM_REGS 8 +%endif + +; ****************************** +; void %1_put_4tap_hX(int16_t *dst, +; const uint8_t *_src, ptrdiff_t _srcstride, int height, int8_t *hf, int8_t *vf, int width); +; ****************************** +cglobal %1_put_4tap_h%2_%3, 5, 5, XMM_REGS, dst, src, srcstride, height, hf +%assign %%stride ((%3 + 7)/8) + MC_4TAP_FILTER %3, hf, m4, m5 +.loop: + MC_4TAP_LOAD %3, srcq-%%stride, %%stride, %2 + MC_4TAP_COMPUTE %3, %2, m4, m5, 1 + PEL_10STORE%2 dstq, m0, m1 + LOOP_END dst, src, srcstride + RET + +; ****************************** +; void %1_put_uni_4tap_hX(uint8_t *dst, ptrdiff_t dststride, +; const uint8_t *_src, ptrdiff_t _srcstride, int height, int8_t *hf, int8_t *vf, int width); +; ****************************** +cglobal %1_put_uni_4tap_h%2_%3, 6, 7, XMM_REGS, dst, dststride, src, srcstride, height, hf +%assign %%stride ((%3 + 7)/8) + movdqa m6, [scale_%3] + MC_4TAP_FILTER %3, hf, m4, m5 +.loop: + MC_4TAP_LOAD %3, srcq-%%stride, %%stride, %2 + MC_4TAP_COMPUTE %3, %2, m4, m5 + UNI_COMPUTE %2, %3, m0, m1, m6 + PEL_%3STORE%2 dstq, m0, m1 + add dstq, dststrideq ; dst += dststride + add srcq, srcstrideq ; src += srcstride + dec heightd ; cmp height + jnz .loop ; height loop + RET + +; ****************************** +; void %1_put_4tap_v(int16_t *dst, +; const uint8_t *_src, ptrdiff_t _srcstride, int height, int8_t *hf, int8_t *vf, int width) +; ****************************** +cglobal %1_put_4tap_v%2_%3, 6, 6, XMM_REGS, dst, src, srcstride, height, r3src, vf + sub srcq, srcstrideq + MC_4TAP_FILTER %3, vf, m4, m5 + lea r3srcq, [srcstrideq*3] +.loop: + MC_4TAP_LOAD %3, srcq, srcstride, %2 + MC_4TAP_COMPUTE %3, %2, m4, m5, 1 + PEL_10STORE%2 dstq, m0, m1 + LOOP_END dst, src, srcstride + RET + +; ****************************** +; void %1_put_uni_4tap_vX(uint8_t *dst, ptrdiff_t dststride, +; const uint8_t *_src, ptrdiff_t _srcstride, int height, int8_t *hf, int8_t *vf, int width); +; ****************************** +cglobal %1_put_uni_4tap_v%2_%3, 7, 7, XMM_REGS, dst, dststride, src, srcstride, height, r3src, vf + movdqa m6, [scale_%3] + sub srcq, srcstrideq + MC_4TAP_FILTER %3, vf, m4, m5 + lea r3srcq, [srcstrideq*3] +.loop: + MC_4TAP_LOAD %3, srcq, srcstride, %2 + MC_4TAP_COMPUTE %3, %2, m4, m5 + UNI_COMPUTE %2, %3, m0, m1, m6 + PEL_%3STORE%2 dstq, m0, m1 + add dstq, dststrideq ; dst += dststride + add srcq, srcstrideq ; src += srcstride + dec heightd ; cmp height + jnz .loop ; height loop + RET +%endmacro + +%macro PUT_4TAP_HV 3 +; ****************************** +; void put_4tap_hv(int16_t *dst, +; const uint8_t *_src, ptrdiff_t _srcstride, int height, int8_t *hf, int8_t *vf, int width) +; ****************************** +cglobal %1_put_4tap_hv%2_%3, 6, 7, 16 , dst, src, srcstride, height, hf, vf, r3src +%assign %%stride ((%3 + 7)/8) + sub srcq, srcstrideq + MC_4TAP_HV_FILTER %3 + MC_4TAP_LOAD %3, srcq-%%stride, %%stride, %2 + MC_4TAP_COMPUTE %3, %2, m14, m15 +%if (%2 > 8 && (%3 == 8)) + SWAP m8, m1 +%endif + SWAP m4, m0 + add srcq, srcstrideq + MC_4TAP_LOAD %3, srcq-%%stride, %%stride, %2 + MC_4TAP_COMPUTE %3, %2, m14, m15 +%if (%2 > 8 && (%3 == 8)) + SWAP m9, m1 +%endif + SWAP m5, m0 + add srcq, srcstrideq + MC_4TAP_LOAD %3, srcq-%%stride, %%stride, %2 + MC_4TAP_COMPUTE %3, %2, m14, m15 +%if (%2 > 8 && (%3 == 8)) + SWAP m10, m1 +%endif + SWAP m6, m0 + add srcq, srcstrideq +.loop: + MC_4TAP_LOAD %3, srcq-%%stride, %%stride, %2 + MC_4TAP_COMPUTE %3, %2, m14, m15 +%if (%2 > 8 && (%3 == 8)) + SWAP m11, m1 +%endif + SWAP m7, m0 + punpcklwd m0, m4, m5 + punpcklwd m2, m6, m7 +%if %2 > 4 + punpckhwd m1, m4, m5 + punpckhwd m3, m6, m7 +%endif + MC_4TAP_COMPUTE 14, %2, m12, m13 +%if (%2 > 8 && (%3 == 8)) + punpcklwd m4, m8, m9 + punpcklwd m2, m10, m11 + punpckhwd m8, m8, m9 + punpckhwd m3, m10, m11 + MC_4TAP_COMPUTE 14, %2, m12, m13, m4, m2, m8, m3 +%if cpuflag(avx2) + vinserti128 m2, m0, xm4, 1 + vperm2i128 m3, m0, m4, q0301 + PEL_10STORE%2 dstq, m2, m3 +%else + PEL_10STORE%2 dstq, m0, m4 +%endif +%else + PEL_10STORE%2 dstq, m0, m1 +%endif + movdqa m4, m5 + movdqa m5, m6 + movdqa m6, m7 +%if (%2 > 8 && (%3 == 8)) + mova m8, m9 + mova m9, m10 + mova m10, m11 +%endif + LOOP_END dst, src, srcstride + RET + +cglobal %1_put_uni_4tap_hv%2_%3, 7, 8, 16 , dst, dststride, src, srcstride, height, hf, vf, r3src +%assign %%stride ((%3 + 7)/8) + sub srcq, srcstrideq + MC_4TAP_HV_FILTER %3 + MC_4TAP_LOAD %3, srcq-%%stride, %%stride, %2 + MC_4TAP_COMPUTE %3, %2, m14, m15 +%if (%2 > 8 && (%3 == 8)) + SWAP m8, m1 +%endif + SWAP m4, m0 + add srcq, srcstrideq + MC_4TAP_LOAD %3, srcq-%%stride, %%stride, %2 + MC_4TAP_COMPUTE %3, %2, m14, m15 +%if (%2 > 8 && (%3 == 8)) + SWAP m9, m1 +%endif + SWAP m5, m0 + add srcq, srcstrideq + MC_4TAP_LOAD %3, srcq-%%stride, %%stride, %2 + MC_4TAP_COMPUTE %3, %2, m14, m15 +%if (%2 > 8 && (%3 == 8)) + SWAP m10, m1 +%endif + SWAP m6, m0 + add srcq, srcstrideq +.loop: + MC_4TAP_LOAD %3, srcq-%%stride, %%stride, %2 + MC_4TAP_COMPUTE %3, %2, m14, m15 +%if (%2 > 8 && (%3 == 8)) + SWAP m11, m1 +%endif + mova m7, m0 + punpcklwd m0, m4, m5 + punpcklwd m2, m6, m7 +%if %2 > 4 + punpckhwd m1, m4, m5 + punpckhwd m3, m6, m7 +%endif + MC_4TAP_COMPUTE 14, %2, m12, m13 +%if (%2 > 8 && (%3 == 8)) + punpcklwd m4, m8, m9 + punpcklwd m2, m10, m11 + punpckhwd m8, m8, m9 + punpckhwd m3, m10, m11 + MC_4TAP_COMPUTE 14, %2, m12, m13, m4, m2, m8, m3 + UNI_COMPUTE %2, %3, m0, m4, [scale_%3] +%else + UNI_COMPUTE %2, %3, m0, m1, [scale_%3] +%endif + PEL_%3STORE%2 dstq, m0, m1 + mova m4, m5 + mova m5, m6 + mova m6, m7 +%if (%2 > 8 && (%3 == 8)) + mova m8, m9 + mova m9, m10 + mova m10, m11 +%endif + add dstq, dststrideq ; dst += dststride + add srcq, srcstrideq ; src += srcstride + dec heightd ; cmp height + jnz .loop ; height loop + RET +%endmacro + +; ****************************** +; void put_8tap_hX_X_X(int16_t *dst, const uint8_t *_src, ptrdiff_t srcstride, +; int height, const int8_t *hf, const int8_t *vf, int width) +; ****************************** + +%macro PUT_8TAP 3 +cglobal %1_put_8tap_h%2_%3, 5, 5, 16, dst, src, srcstride, height, hf + MC_8TAP_FILTER %3, hf +.loop: + MC_8TAP_H_LOAD %3, srcq, %2, 10 + MC_8TAP_COMPUTE %2, %3, 1 +%if %3 > 8 + packssdw m0, m1 +%endif + PEL_10STORE%2 dstq, m0, m1 + LOOP_END dst, src, srcstride + RET + +; ****************************** +; void put_uni_8tap_hX_X_X(int16_t *dst, ptrdiff_t dststride, const uint8_t *_src, ptrdiff_t srcstride, +; int height, const int8_t *hf, const int8_t *vf, int width) +; ****************************** +cglobal %1_put_uni_8tap_h%2_%3, 6, 7, 16 , dst, dststride, src, srcstride, height, hf + mova m9, [scale_%3] + MC_8TAP_FILTER %3, hf +.loop: + MC_8TAP_H_LOAD %3, srcq, %2, 10 + MC_8TAP_COMPUTE %2, %3 +%if %3 > 8 + packssdw m0, m1 +%endif + UNI_COMPUTE %2, %3, m0, m1, m9 + PEL_%3STORE%2 dstq, m0, m1 + add dstq, dststrideq ; dst += dststride + add srcq, srcstrideq ; src += srcstride + dec heightd ; cmp height + jnz .loop ; height loop + RET + + +; ****************************** +; void put_8tap_vX_X_X(int16_t *dst, const uint8_t *_src, ptrdiff_t srcstride, +; int height, const int8_t *hf, const int8_t *vf, int width) +; ****************************** +cglobal %1_put_8tap_v%2_%3, 6, 8, 16, dst, src, srcstride, height, r3src, vf + MC_8TAP_FILTER %3, vf + lea r3srcq, [srcstrideq*3] +.loop: + MC_8TAP_V_LOAD %3, srcq, srcstride, %2, r7 + MC_8TAP_COMPUTE %2, %3, 1 +%if %3 > 8 + packssdw m0, m1 +%endif + PEL_10STORE%2 dstq, m0, m1 + LOOP_END dst, src, srcstride + RET + +; ****************************** +; void put_uni_8tap_vX_X_X(int16_t *dst, ptrdiff_t dststride, const uint8_t *_src, ptrdiff_t srcstride, +; int height, const int8_t *hf, const int8_t *vf, int width) +; ****************************** +cglobal %1_put_uni_8tap_v%2_%3, 7, 9, 16, dst, dststride, src, srcstride, height, r3src, vf + MC_8TAP_FILTER %3, vf + movdqa m9, [scale_%3] + lea r3srcq, [srcstrideq*3] +.loop: + MC_8TAP_V_LOAD %3, srcq, srcstride, %2, r8 + MC_8TAP_COMPUTE %2, %3 +%if %3 > 8 + packssdw m0, m1 +%endif + UNI_COMPUTE %2, %3, m0, m1, m9 + PEL_%3STORE%2 dstq, m0, m1 + add dstq, dststrideq ; dst += dststride + add srcq, srcstrideq ; src += srcstride + dec heightd ; cmp height + jnz .loop ; height loop + RET + +%endmacro + + +; ****************************** +; void put_8tap_hvX_X(int16_t *dst, const uint8_t *_src, ptrdiff_t srcstride, +; int height, const int8_t *hf, const int8_t *vf, int width) +; ****************************** +%macro PUT_8TAP_HV 3 +cglobal %1_put_8tap_hv%2_%3, 6, 7, 16, 0 - mmsize*16, dst, src, srcstride, height, hf, vf, r3src + MC_8TAP_FILTER %3, hf, 0 + lea hfq, [rsp] + MC_8TAP_FILTER %3, vf, 8*mmsize + lea vfq, [rsp + 8*mmsize] + + lea r3srcq, [srcstrideq*3] + sub srcq, r3srcq + + MC_8TAP_H_LOAD %3, srcq, %2, 15 + MC_8TAP_HV_COMPUTE %2, %3, hf, ackssdw + SWAP m8, m0 + add srcq, srcstrideq + MC_8TAP_H_LOAD %3, srcq, %2, 15 + MC_8TAP_HV_COMPUTE %2, %3, hf, ackssdw + SWAP m9, m0 + add srcq, srcstrideq + MC_8TAP_H_LOAD %3, srcq, %2, 15 + MC_8TAP_HV_COMPUTE %2, %3, hf, ackssdw + SWAP m10, m0 + add srcq, srcstrideq + MC_8TAP_H_LOAD %3, srcq, %2, 15 + MC_8TAP_HV_COMPUTE %2, %3, hf, ackssdw + SWAP m11, m0 + add srcq, srcstrideq + MC_8TAP_H_LOAD %3, srcq, %2, 15 + MC_8TAP_HV_COMPUTE %2, %3, hf, ackssdw + SWAP m12, m0 + add srcq, srcstrideq + MC_8TAP_H_LOAD %3, srcq, %2, 15 + MC_8TAP_HV_COMPUTE %2, %3, hf, ackssdw + SWAP m13, m0 + add srcq, srcstrideq + MC_8TAP_H_LOAD %3, srcq, %2, 15 + MC_8TAP_HV_COMPUTE %2, %3, hf, ackssdw + SWAP m14, m0 + add srcq, srcstrideq +.loop: + MC_8TAP_H_LOAD %3, srcq, %2, 15 + MC_8TAP_HV_COMPUTE %2, %3, hf, ackssdw + SWAP m15, m0 + punpcklwd m0, m8, m9 + punpcklwd m2, m10, m11 + punpcklwd m4, m12, m13 + punpcklwd m6, m14, m15 +%if %2 > 4 + punpckhwd m1, m8, m9 + punpckhwd m3, m10, m11 + punpckhwd m5, m12, m13 + punpckhwd m7, m14, m15 +%endif +%if %2 <= 4 + movq m8, m9 + movq m9, m10 + movq m10, m11 + movq m11, m12 + movq m12, m13 + movq m13, m14 + movq m14, m15 +%else + movdqa m8, m9 + movdqa m9, m10 + movdqa m10, m11 + movdqa m11, m12 + movdqa m12, m13 + movdqa m13, m14 + movdqa m14, m15 +%endif + MC_8TAP_HV_COMPUTE %2, 14, vf, ackssdw + PEL_10STORE%2 dstq, m0, m1 + + LOOP_END dst, src, srcstride + RET + + +cglobal %1_put_uni_8tap_hv%2_%3, 7, 9, 16, 0 - 16*mmsize, dst, dststride, src, srcstride, height, hf, vf, r3src + MC_8TAP_FILTER %3, hf, 0 + lea hfq, [rsp] + MC_8TAP_FILTER %3, vf, 8*mmsize + lea vfq, [rsp + 8*mmsize] + lea r3srcq, [srcstrideq*3] + sub srcq, r3srcq + MC_8TAP_H_LOAD %3, srcq, %2, 15 + MC_8TAP_HV_COMPUTE %2, %3, hf, ackssdw + SWAP m8, m0 + add srcq, srcstrideq + MC_8TAP_H_LOAD %3, srcq, %2, 15 + MC_8TAP_HV_COMPUTE %2, %3, hf, ackssdw + SWAP m9, m0 + add srcq, srcstrideq + MC_8TAP_H_LOAD %3, srcq, %2, 15 + MC_8TAP_HV_COMPUTE %2, %3, hf, ackssdw + SWAP m10, m0 + add srcq, srcstrideq + MC_8TAP_H_LOAD %3, srcq, %2, 15 + MC_8TAP_HV_COMPUTE %2, %3, hf, ackssdw + SWAP m11, m0 + add srcq, srcstrideq + MC_8TAP_H_LOAD %3, srcq, %2, 15 + MC_8TAP_HV_COMPUTE %2, %3, hf, ackssdw + SWAP m12, m0 + add srcq, srcstrideq + MC_8TAP_H_LOAD %3, srcq, %2, 15 + MC_8TAP_HV_COMPUTE %2, %3, hf, ackssdw + SWAP m13, m0 + add srcq, srcstrideq + MC_8TAP_H_LOAD %3, srcq, %2, 15 + MC_8TAP_HV_COMPUTE %2, %3, hf, ackssdw + SWAP m14, m0 + add srcq, srcstrideq +.loop: + MC_8TAP_H_LOAD %3, srcq, %2, 15 + MC_8TAP_HV_COMPUTE %2, %3, hf, ackssdw + SWAP m15, m0 + punpcklwd m0, m8, m9 + punpcklwd m2, m10, m11 + punpcklwd m4, m12, m13 + punpcklwd m6, m14, m15 +%if %2 > 4 + punpckhwd m1, m8, m9 + punpckhwd m3, m10, m11 + punpckhwd m5, m12, m13 + punpckhwd m7, m14, m15 +%endif + MC_8TAP_HV_COMPUTE %2, 14, vf, ackusdw + UNI_COMPUTE %2, %3, m0, m1, [scale_%3] + PEL_%3STORE%2 dstq, m0, m1 + +%if %2 <= 4 + movq m8, m9 + movq m9, m10 + movq m10, m11 + movq m11, m12 + movq m12, m13 + movq m13, m14 + movq m14, m15 +%else + mova m8, m9 + mova m9, m10 + mova m10, m11 + mova m11, m12 + mova m12, m13 + mova m13, m14 + mova m14, m15 +%endif + add dstq, dststrideq ; dst += dststride + add srcq, srcstrideq ; src += srcstride + dec heightd ; cmp height + jnz .loop ; height loop + RET + +%endmacro + +%macro H2656PUT_PIXELS 2 + PUT_PIXELS h2656, %1, %2 +%endmacro + +%macro H2656PUT_4TAP 2 + PUT_4TAP h2656, %1, %2 +%endmacro + +%macro H2656PUT_4TAP_HV 2 + PUT_4TAP_HV h2656, %1, %2 +%endmacro + +%macro H2656PUT_8TAP 2 + PUT_8TAP h2656, %1, %2 +%endmacro + +%macro H2656PUT_8TAP_HV 2 + PUT_8TAP_HV h2656, %1, %2 +%endmacro + +%if ARCH_X86_64 + +INIT_XMM sse4 +H2656PUT_PIXELS 2, 8 +H2656PUT_PIXELS 4, 8 +H2656PUT_PIXELS 6, 8 +H2656PUT_PIXELS 8, 8 +H2656PUT_PIXELS 12, 8 +H2656PUT_PIXELS 16, 8 + +H2656PUT_PIXELS 2, 10 +H2656PUT_PIXELS 4, 10 +H2656PUT_PIXELS 6, 10 +H2656PUT_PIXELS 8, 10 + +H2656PUT_PIXELS 2, 12 +H2656PUT_PIXELS 4, 12 +H2656PUT_PIXELS 6, 12 +H2656PUT_PIXELS 8, 12 + +H2656PUT_4TAP 2, 8 +H2656PUT_4TAP 4, 8 +H2656PUT_4TAP 6, 8 +H2656PUT_4TAP 8, 8 + +H2656PUT_4TAP 12, 8 +H2656PUT_4TAP 16, 8 + +H2656PUT_4TAP 2, 10 +H2656PUT_4TAP 4, 10 +H2656PUT_4TAP 6, 10 +H2656PUT_4TAP 8, 10 + +H2656PUT_4TAP 2, 12 +H2656PUT_4TAP 4, 12 +H2656PUT_4TAP 6, 12 +H2656PUT_4TAP 8, 12 + +H2656PUT_4TAP_HV 2, 8 +H2656PUT_4TAP_HV 4, 8 +H2656PUT_4TAP_HV 6, 8 +H2656PUT_4TAP_HV 8, 8 +H2656PUT_4TAP_HV 16, 8 + +H2656PUT_4TAP_HV 2, 10 +H2656PUT_4TAP_HV 4, 10 +H2656PUT_4TAP_HV 6, 10 +H2656PUT_4TAP_HV 8, 10 + +H2656PUT_4TAP_HV 2, 12 +H2656PUT_4TAP_HV 4, 12 +H2656PUT_4TAP_HV 6, 12 +H2656PUT_4TAP_HV 8, 12 + +H2656PUT_8TAP 4, 8 +H2656PUT_8TAP 8, 8 +H2656PUT_8TAP 12, 8 +H2656PUT_8TAP 16, 8 + +H2656PUT_8TAP 4, 10 +H2656PUT_8TAP 8, 10 + +H2656PUT_8TAP 4, 12 +H2656PUT_8TAP 8, 12 + +H2656PUT_8TAP_HV 4, 8 +H2656PUT_8TAP_HV 8, 8 + +H2656PUT_8TAP_HV 4, 10 +H2656PUT_8TAP_HV 8, 10 + +H2656PUT_8TAP_HV 4, 12 +H2656PUT_8TAP_HV 8, 12 + +%if HAVE_AVX2_EXTERNAL +INIT_YMM avx2 + +H2656PUT_PIXELS 32, 8 +H2656PUT_PIXELS 16, 10 +H2656PUT_PIXELS 16, 12 + +H2656PUT_8TAP 32, 8 +H2656PUT_8TAP 16, 10 +H2656PUT_8TAP 16, 12 + +H2656PUT_8TAP_HV 32, 8 +H2656PUT_8TAP_HV 16, 10 +H2656PUT_8TAP_HV 16, 12 + +H2656PUT_4TAP 32, 8 +H2656PUT_4TAP 16, 10 +H2656PUT_4TAP 16, 12 + +H2656PUT_4TAP_HV 32, 8 +H2656PUT_4TAP_HV 16, 10 +H2656PUT_4TAP_HV 16, 12 + +%endif + +%endif diff --git a/libavcodec/x86/h26x/h2656dsp.c b/libavcodec/x86/h26x/h2656dsp.c new file mode 100644 index 0000000000..27769f9c55 --- /dev/null +++ b/libavcodec/x86/h26x/h2656dsp.c @@ -0,0 +1,98 @@ +/* + * DSP for HEVC/VVC + * + * Copyright (C) 2022-2024 Nuo Mi + * Copyright (c) 2023-2024 Wu Jianhua + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "h2656dsp.h" + +#define mc_rep_func(name, bitd, step, W, opt) \ +void ff_h2656_put_##name##W##_##bitd##_##opt(int16_t *_dst, \ + const uint8_t *_src, ptrdiff_t _srcstride, int height, const int8_t *hf, const int8_t *vf, int width) \ +{ \ + int i; \ + int16_t *dst; \ + for (i = 0; i < W; i += step) { \ + const uint8_t *src = _src + (i * ((bitd + 7) / 8)); \ + dst = _dst + i; \ + ff_h2656_put_##name##step##_##bitd##_##opt(dst, src, _srcstride, height, hf, vf, width); \ + } \ +} + +#define mc_rep_uni_func(name, bitd, step, W, opt) \ +void ff_h2656_put_uni_##name##W##_##bitd##_##opt(uint8_t *_dst, ptrdiff_t dststride, \ + const uint8_t *_src, ptrdiff_t _srcstride, int height, const int8_t *hf, const int8_t *vf, int width) \ +{ \ + int i; \ + uint8_t *dst; \ + for (i = 0; i < W; i += step) { \ + const uint8_t *src = _src + (i * ((bitd + 7) / 8)); \ + dst = _dst + (i * ((bitd + 7) / 8)); \ + ff_h2656_put_uni_##name##step##_##bitd##_##opt(dst, dststride, src, _srcstride, \ + height, hf, vf, width); \ + } \ +} + +#define mc_rep_funcs(name, bitd, step, W, opt) \ + mc_rep_func(name, bitd, step, W, opt) \ + mc_rep_uni_func(name, bitd, step, W, opt) + +#define MC_REP_FUNCS_SSE4(fname) \ + mc_rep_funcs(fname, 8, 16,128, sse4) \ + mc_rep_funcs(fname, 8, 16, 64, sse4) \ + mc_rep_funcs(fname, 8, 16, 32, sse4) \ + mc_rep_funcs(fname, 10, 8,128, sse4) \ + mc_rep_funcs(fname, 10, 8, 64, sse4) \ + mc_rep_funcs(fname, 10, 8, 32, sse4) \ + mc_rep_funcs(fname, 10, 8, 16, sse4) \ + mc_rep_funcs(fname, 12, 8,128, sse4) \ + mc_rep_funcs(fname, 12, 8, 64, sse4) \ + mc_rep_funcs(fname, 12, 8, 32, sse4) \ + mc_rep_funcs(fname, 12, 8, 16, sse4) \ + +MC_REP_FUNCS_SSE4(pixels) +MC_REP_FUNCS_SSE4(4tap_h) +MC_REP_FUNCS_SSE4(4tap_v) +MC_REP_FUNCS_SSE4(4tap_hv) +MC_REP_FUNCS_SSE4(8tap_h) +MC_REP_FUNCS_SSE4(8tap_v) +MC_REP_FUNCS_SSE4(8tap_hv) +mc_rep_funcs(8tap_hv, 8, 8, 16, sse4) + +#if HAVE_AVX2_EXTERNAL + +#define MC_REP_FUNCS_AVX2(fname) \ + mc_rep_funcs(fname, 8, 32, 64, avx2) \ + mc_rep_funcs(fname, 8, 32,128, avx2) \ + mc_rep_funcs(fname,10, 16, 32, avx2) \ + mc_rep_funcs(fname,10, 16, 64, avx2) \ + mc_rep_funcs(fname,10, 16,128, avx2) \ + mc_rep_funcs(fname,12, 16, 32, avx2) \ + mc_rep_funcs(fname,12, 16, 64, avx2) \ + mc_rep_funcs(fname,12, 16,128, avx2) \ + +MC_REP_FUNCS_AVX2(pixels) +MC_REP_FUNCS_AVX2(8tap_h) +MC_REP_FUNCS_AVX2(8tap_v) +MC_REP_FUNCS_AVX2(8tap_hv) +MC_REP_FUNCS_AVX2(4tap_h) +MC_REP_FUNCS_AVX2(4tap_v) +MC_REP_FUNCS_AVX2(4tap_hv) +#endif diff --git a/libavcodec/x86/h26x/h2656dsp.h b/libavcodec/x86/h26x/h2656dsp.h new file mode 100644 index 0000000000..8a2ab13607 --- /dev/null +++ b/libavcodec/x86/h26x/h2656dsp.h @@ -0,0 +1,103 @@ +/* + * DSP for HEVC/VVC + * + * Copyright (C) 2022-2024 Nuo Mi + * Copyright (c) 2023-2024 Wu Jianhua + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#ifndef AVCODEC_X86_H26X_H2656DSP_H +#define AVCODEC_X86_H26X_H2656DSP_H + +#include "config.h" +#include "libavutil/x86/asm.h" +#include "libavutil/x86/cpu.h" +#include + +#define H2656_PEL_PROTOTYPE(name, D, opt) \ +void ff_h2656_put_ ## name ## _ ## D ## _##opt(int16_t *dst, const uint8_t *_src, ptrdiff_t _srcstride, int height, const int8_t *hf, const int8_t *vf, int width); \ +void ff_h2656_put_uni_ ## name ## _ ## D ## _##opt(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src, ptrdiff_t _srcstride, int height, const int8_t *hf, const int8_t *vf, int width); \ + +#define H2656_MC_8TAP_PROTOTYPES(fname, bitd, opt) \ + H2656_PEL_PROTOTYPE(fname##4, bitd, opt); \ + H2656_PEL_PROTOTYPE(fname##6, bitd, opt); \ + H2656_PEL_PROTOTYPE(fname##8, bitd, opt); \ + H2656_PEL_PROTOTYPE(fname##12, bitd, opt); \ + H2656_PEL_PROTOTYPE(fname##16, bitd, opt); \ + H2656_PEL_PROTOTYPE(fname##32, bitd, opt); \ + H2656_PEL_PROTOTYPE(fname##64, bitd, opt); \ + H2656_PEL_PROTOTYPE(fname##128, bitd, opt) + +H2656_MC_8TAP_PROTOTYPES(pixels , 8, sse4); +H2656_MC_8TAP_PROTOTYPES(pixels , 10, sse4); +H2656_MC_8TAP_PROTOTYPES(pixels , 12, sse4); +H2656_MC_8TAP_PROTOTYPES(8tap_h , 8, sse4); +H2656_MC_8TAP_PROTOTYPES(8tap_h , 10, sse4); +H2656_MC_8TAP_PROTOTYPES(8tap_h , 12, sse4); +H2656_MC_8TAP_PROTOTYPES(8tap_v , 8, sse4); +H2656_MC_8TAP_PROTOTYPES(8tap_v , 10, sse4); +H2656_MC_8TAP_PROTOTYPES(8tap_v , 12, sse4); +H2656_MC_8TAP_PROTOTYPES(8tap_hv , 8, sse4); +H2656_MC_8TAP_PROTOTYPES(8tap_hv , 10, sse4); +H2656_MC_8TAP_PROTOTYPES(8tap_hv , 12, sse4); + +#define H2656_MC_4TAP_PROTOTYPES(fname, bitd, opt) \ + H2656_PEL_PROTOTYPE(fname##2, bitd, opt); \ + H2656_PEL_PROTOTYPE(fname##4, bitd, opt); \ + H2656_PEL_PROTOTYPE(fname##6, bitd, opt); \ + H2656_PEL_PROTOTYPE(fname##8, bitd, opt); \ + H2656_PEL_PROTOTYPE(fname##12, bitd, opt); \ + H2656_PEL_PROTOTYPE(fname##16, bitd, opt); \ + H2656_PEL_PROTOTYPE(fname##32, bitd, opt); \ + H2656_PEL_PROTOTYPE(fname##64, bitd, opt); \ + H2656_PEL_PROTOTYPE(fname##128, bitd, opt) + +#define H2656_MC_4TAP_PROTOTYPES_SSE4(bitd) \ + H2656_PEL_PROTOTYPE(pixels2, bitd, sse4); \ + H2656_MC_4TAP_PROTOTYPES(4tap_h, bitd, sse4); \ + H2656_MC_4TAP_PROTOTYPES(4tap_v, bitd, sse4); \ + H2656_MC_4TAP_PROTOTYPES(4tap_hv, bitd, sse4); \ + +H2656_MC_4TAP_PROTOTYPES_SSE4(8) +H2656_MC_4TAP_PROTOTYPES_SSE4(10) +H2656_MC_4TAP_PROTOTYPES_SSE4(12) + +#define H2656_MC_8TAP_PROTOTYPES_AVX2(fname) \ + H2656_PEL_PROTOTYPE(fname##32 , 8, avx2); \ + H2656_PEL_PROTOTYPE(fname##64 , 8, avx2); \ + H2656_PEL_PROTOTYPE(fname##128, 8, avx2); \ + H2656_PEL_PROTOTYPE(fname##16 ,10, avx2); \ + H2656_PEL_PROTOTYPE(fname##32 ,10, avx2); \ + H2656_PEL_PROTOTYPE(fname##64 ,10, avx2); \ + H2656_PEL_PROTOTYPE(fname##128,10, avx2); \ + H2656_PEL_PROTOTYPE(fname##16 ,12, avx2); \ + H2656_PEL_PROTOTYPE(fname##32 ,12, avx2); \ + H2656_PEL_PROTOTYPE(fname##64 ,12, avx2); \ + H2656_PEL_PROTOTYPE(fname##128,12, avx2) \ + +H2656_MC_8TAP_PROTOTYPES_AVX2(pixels); +H2656_MC_8TAP_PROTOTYPES_AVX2(8tap_h); +H2656_MC_8TAP_PROTOTYPES_AVX2(8tap_v); +H2656_MC_8TAP_PROTOTYPES_AVX2(8tap_hv); +H2656_PEL_PROTOTYPE(8tap_hv16, 8, avx2); + +H2656_MC_8TAP_PROTOTYPES_AVX2(4tap_h); +H2656_MC_8TAP_PROTOTYPES_AVX2(4tap_v); +H2656_MC_8TAP_PROTOTYPES_AVX2(4tap_hv); + +#endif diff --git a/libavcodec/x86/hevc_mc.asm b/libavcodec/x86/hevc_mc.asm index eb267453fe..5489701e44 100644 --- a/libavcodec/x86/hevc_mc.asm +++ b/libavcodec/x86/hevc_mc.asm @@ -715,35 +715,6 @@ SECTION .text ; int height, int mx, int my) ; ****************************** -%macro HEVC_PUT_HEVC_PEL_PIXELS 2 -HEVC_PEL_PIXELS %1, %2 -HEVC_UNI_PEL_PIXELS %1, %2 -HEVC_BI_PEL_PIXELS %1, %2 -%endmacro - -%macro HEVC_PEL_PIXELS 2 -cglobal hevc_put_hevc_pel_pixels%1_%2, 4, 4, 3, dst, src, srcstride,height - pxor m2, m2 -.loop: - SIMPLE_LOAD %1, %2, srcq, m0 - MC_PIXEL_COMPUTE %1, %2, 1 - PEL_10STORE%1 dstq, m0, m1 - LOOP_END dst, src, srcstride - RET - %endmacro - -%macro HEVC_UNI_PEL_PIXELS 2 -cglobal hevc_put_hevc_uni_pel_pixels%1_%2, 5, 5, 2, dst, dststride, src, srcstride,height -.loop: - SIMPLE_LOAD %1, %2, srcq, m0 - PEL_%2STORE%1 dstq, m0, m1 - add dstq, dststrideq ; dst += dststride - add srcq, srcstrideq ; src += srcstride - dec heightd ; cmp height - jnz .loop ; height loop - RET -%endmacro - %macro HEVC_BI_PEL_PIXELS 2 cglobal hevc_put_hevc_bi_pel_pixels%1_%2, 6, 6, 6, dst, dststride, src, srcstride, src2, height pxor m2, m2 @@ -777,32 +748,8 @@ cglobal hevc_put_hevc_bi_pel_pixels%1_%2, 6, 6, 6, dst, dststride, src, srcstrid %define XMM_REGS 8 %endif -cglobal hevc_put_hevc_epel_h%1_%2, 5, 6, XMM_REGS, dst, src, srcstride, height, mx, rfilter -%assign %%stride ((%2 + 7)/8) - EPEL_FILTER %2, mx, m4, m5, rfilter -.loop: - EPEL_LOAD %2, srcq-%%stride, %%stride, %1 - EPEL_COMPUTE %2, %1, m4, m5, 1 - PEL_10STORE%1 dstq, m0, m1 - LOOP_END dst, src, srcstride - RET - -cglobal hevc_put_hevc_uni_epel_h%1_%2, 6, 7, XMM_REGS, dst, dststride, src, srcstride, height, mx, rfilter -%assign %%stride ((%2 + 7)/8) - movdqa m6, [pw_%2] - EPEL_FILTER %2, mx, m4, m5, rfilter -.loop: - EPEL_LOAD %2, srcq-%%stride, %%stride, %1 - EPEL_COMPUTE %2, %1, m4, m5 - UNI_COMPUTE %1, %2, m0, m1, m6 - PEL_%2STORE%1 dstq, m0, m1 - add dstq, dststrideq ; dst += dststride - add srcq, srcstrideq ; src += srcstride - dec heightd ; cmp height - jnz .loop ; height loop - RET - cglobal hevc_put_hevc_bi_epel_h%1_%2, 7, 8, XMM_REGS, dst, dststride, src, srcstride, src2, height, mx, rfilter +%assign %%stride ((%2 + 7)/8) movdqa m6, [pw_bi_%2] EPEL_FILTER %2, mx, m4, m5, rfilter .loop: @@ -824,36 +771,6 @@ cglobal hevc_put_hevc_bi_epel_h%1_%2, 7, 8, XMM_REGS, dst, dststride, src, srcst ; int height, int mx, int my, int width) ; ****************************** -cglobal hevc_put_hevc_epel_v%1_%2, 4, 6, XMM_REGS, dst, src, srcstride, height, r3src, my - movifnidn myd, mym - sub srcq, srcstrideq - EPEL_FILTER %2, my, m4, m5, r3src - lea r3srcq, [srcstrideq*3] -.loop: - EPEL_LOAD %2, srcq, srcstride, %1 - EPEL_COMPUTE %2, %1, m4, m5, 1 - PEL_10STORE%1 dstq, m0, m1 - LOOP_END dst, src, srcstride - RET - -cglobal hevc_put_hevc_uni_epel_v%1_%2, 5, 7, XMM_REGS, dst, dststride, src, srcstride, height, r3src, my - movifnidn myd, mym - movdqa m6, [pw_%2] - sub srcq, srcstrideq - EPEL_FILTER %2, my, m4, m5, r3src - lea r3srcq, [srcstrideq*3] -.loop: - EPEL_LOAD %2, srcq, srcstride, %1 - EPEL_COMPUTE %2, %1, m4, m5 - UNI_COMPUTE %1, %2, m0, m1, m6 - PEL_%2STORE%1 dstq, m0, m1 - add dstq, dststrideq ; dst += dststride - add srcq, srcstrideq ; src += srcstride - dec heightd ; cmp height - jnz .loop ; height loop - RET - - cglobal hevc_put_hevc_bi_epel_v%1_%2, 6, 8, XMM_REGS, dst, dststride, src, srcstride, src2, height, r3src, my movifnidn myd, mym movdqa m6, [pw_bi_%2] @@ -882,135 +799,6 @@ cglobal hevc_put_hevc_bi_epel_v%1_%2, 6, 8, XMM_REGS, dst, dststride, src, srcst ; ****************************** %macro HEVC_PUT_HEVC_EPEL_HV 2 -cglobal hevc_put_hevc_epel_hv%1_%2, 6, 7, 16 , dst, src, srcstride, height, mx, my, r3src -%assign %%stride ((%2 + 7)/8) - sub srcq, srcstrideq - EPEL_HV_FILTER %2 - EPEL_LOAD %2, srcq-%%stride, %%stride, %1 - EPEL_COMPUTE %2, %1, m14, m15 -%if (%1 > 8 && (%2 == 8)) - SWAP m8, m1 -%endif - SWAP m4, m0 - add srcq, srcstrideq - EPEL_LOAD %2, srcq-%%stride, %%stride, %1 - EPEL_COMPUTE %2, %1, m14, m15 -%if (%1 > 8 && (%2 == 8)) - SWAP m9, m1 -%endif - SWAP m5, m0 - add srcq, srcstrideq - EPEL_LOAD %2, srcq-%%stride, %%stride, %1 - EPEL_COMPUTE %2, %1, m14, m15 -%if (%1 > 8 && (%2 == 8)) - SWAP m10, m1 -%endif - SWAP m6, m0 - add srcq, srcstrideq -.loop: - EPEL_LOAD %2, srcq-%%stride, %%stride, %1 - EPEL_COMPUTE %2, %1, m14, m15 -%if (%1 > 8 && (%2 == 8)) - SWAP m11, m1 -%endif - SWAP m7, m0 - punpcklwd m0, m4, m5 - punpcklwd m2, m6, m7 -%if %1 > 4 - punpckhwd m1, m4, m5 - punpckhwd m3, m6, m7 -%endif - EPEL_COMPUTE 14, %1, m12, m13 -%if (%1 > 8 && (%2 == 8)) - punpcklwd m4, m8, m9 - punpcklwd m2, m10, m11 - punpckhwd m8, m8, m9 - punpckhwd m3, m10, m11 - EPEL_COMPUTE 14, %1, m12, m13, m4, m2, m8, m3 -%if cpuflag(avx2) - vinserti128 m2, m0, xm4, 1 - vperm2i128 m3, m0, m4, q0301 - PEL_10STORE%1 dstq, m2, m3 -%else - PEL_10STORE%1 dstq, m0, m4 -%endif -%else - PEL_10STORE%1 dstq, m0, m1 -%endif - movdqa m4, m5 - movdqa m5, m6 - movdqa m6, m7 -%if (%1 > 8 && (%2 == 8)) - mova m8, m9 - mova m9, m10 - mova m10, m11 -%endif - LOOP_END dst, src, srcstride - RET - -cglobal hevc_put_hevc_uni_epel_hv%1_%2, 7, 8, 16 , dst, dststride, src, srcstride, height, mx, my, r3src -%assign %%stride ((%2 + 7)/8) - sub srcq, srcstrideq - EPEL_HV_FILTER %2 - EPEL_LOAD %2, srcq-%%stride, %%stride, %1 - EPEL_COMPUTE %2, %1, m14, m15 -%if (%1 > 8 && (%2 == 8)) - SWAP m8, m1 -%endif - SWAP m4, m0 - add srcq, srcstrideq - EPEL_LOAD %2, srcq-%%stride, %%stride, %1 - EPEL_COMPUTE %2, %1, m14, m15 -%if (%1 > 8 && (%2 == 8)) - SWAP m9, m1 -%endif - SWAP m5, m0 - add srcq, srcstrideq - EPEL_LOAD %2, srcq-%%stride, %%stride, %1 - EPEL_COMPUTE %2, %1, m14, m15 -%if (%1 > 8 && (%2 == 8)) - SWAP m10, m1 -%endif - SWAP m6, m0 - add srcq, srcstrideq -.loop: - EPEL_LOAD %2, srcq-%%stride, %%stride, %1 - EPEL_COMPUTE %2, %1, m14, m15 -%if (%1 > 8 && (%2 == 8)) - SWAP m11, m1 -%endif - mova m7, m0 - punpcklwd m0, m4, m5 - punpcklwd m2, m6, m7 -%if %1 > 4 - punpckhwd m1, m4, m5 - punpckhwd m3, m6, m7 -%endif - EPEL_COMPUTE 14, %1, m12, m13 -%if (%1 > 8 && (%2 == 8)) - punpcklwd m4, m8, m9 - punpcklwd m2, m10, m11 - punpckhwd m8, m8, m9 - punpckhwd m3, m10, m11 - EPEL_COMPUTE 14, %1, m12, m13, m4, m2, m8, m3 - UNI_COMPUTE %1, %2, m0, m4, [pw_%2] -%else - UNI_COMPUTE %1, %2, m0, m1, [pw_%2] -%endif - PEL_%2STORE%1 dstq, m0, m1 - mova m4, m5 - mova m5, m6 - mova m6, m7 -%if (%1 > 8 && (%2 == 8)) - mova m8, m9 - mova m9, m10 - mova m10, m11 -%endif - add dstq, dststrideq ; dst += dststride - add srcq, srcstrideq ; src += srcstride - dec heightd ; cmp height - jnz .loop ; height loop - RET cglobal hevc_put_hevc_bi_epel_hv%1_%2, 8, 9, 16, dst, dststride, src, srcstride, src2, height, mx, my, r3src %assign %%stride ((%2 + 7)/8) @@ -1093,34 +881,6 @@ cglobal hevc_put_hevc_bi_epel_hv%1_%2, 8, 9, 16, dst, dststride, src, srcstride, ; ****************************** %macro HEVC_PUT_HEVC_QPEL 2 -cglobal hevc_put_hevc_qpel_h%1_%2, 5, 6, 16, dst, src, srcstride, height, mx, rfilter - QPEL_FILTER %2, mx -.loop: - QPEL_H_LOAD %2, srcq, %1, 10 - QPEL_COMPUTE %1, %2, 1 -%if %2 > 8 - packssdw m0, m1 -%endif - PEL_10STORE%1 dstq, m0, m1 - LOOP_END dst, src, srcstride - RET - -cglobal hevc_put_hevc_uni_qpel_h%1_%2, 6, 7, 16 , dst, dststride, src, srcstride, height, mx, rfilter - mova m9, [pw_%2] - QPEL_FILTER %2, mx -.loop: - QPEL_H_LOAD %2, srcq, %1, 10 - QPEL_COMPUTE %1, %2 -%if %2 > 8 - packssdw m0, m1 -%endif - UNI_COMPUTE %1, %2, m0, m1, m9 - PEL_%2STORE%1 dstq, m0, m1 - add dstq, dststrideq ; dst += dststride - add srcq, srcstrideq ; src += srcstride - dec heightd ; cmp height - jnz .loop ; height loop - RET cglobal hevc_put_hevc_bi_qpel_h%1_%2, 7, 8, 16 , dst, dststride, src, srcstride, src2, height, mx, rfilter movdqa m9, [pw_bi_%2] @@ -1148,38 +908,6 @@ cglobal hevc_put_hevc_bi_qpel_h%1_%2, 7, 8, 16 , dst, dststride, src, srcstride, ; int height, int mx, int my, int width) ; ****************************** -cglobal hevc_put_hevc_qpel_v%1_%2, 4, 8, 16, dst, src, srcstride, height, r3src, my, rfilter - movifnidn myd, mym - lea r3srcq, [srcstrideq*3] - QPEL_FILTER %2, my -.loop: - QPEL_V_LOAD %2, srcq, srcstride, %1, r7 - QPEL_COMPUTE %1, %2, 1 -%if %2 > 8 - packssdw m0, m1 -%endif - PEL_10STORE%1 dstq, m0, m1 - LOOP_END dst, src, srcstride - RET - -cglobal hevc_put_hevc_uni_qpel_v%1_%2, 5, 9, 16, dst, dststride, src, srcstride, height, r3src, my, rfilter - movifnidn myd, mym - movdqa m9, [pw_%2] - lea r3srcq, [srcstrideq*3] - QPEL_FILTER %2, my -.loop: - QPEL_V_LOAD %2, srcq, srcstride, %1, r8 - QPEL_COMPUTE %1, %2 -%if %2 > 8 - packssdw m0, m1 -%endif - UNI_COMPUTE %1, %2, m0, m1, m9 - PEL_%2STORE%1 dstq, m0, m1 - add dstq, dststrideq ; dst += dststride - add srcq, srcstrideq ; src += srcstride - dec heightd ; cmp height - jnz .loop ; height loop - RET cglobal hevc_put_hevc_bi_qpel_v%1_%2, 6, 10, 16, dst, dststride, src, srcstride, src2, height, r3src, my, rfilter movifnidn myd, mym @@ -1210,162 +938,6 @@ cglobal hevc_put_hevc_bi_qpel_v%1_%2, 6, 10, 16, dst, dststride, src, srcstride, ; int height, int mx, int my) ; ****************************** %macro HEVC_PUT_HEVC_QPEL_HV 2 -cglobal hevc_put_hevc_qpel_hv%1_%2, 6, 8, 16, dst, src, srcstride, height, mx, my, r3src, rfilter -%if cpuflag(avx2) -%assign %%shift 4 -%else -%assign %%shift 3 -%endif - sub mxq, 1 - sub myq, 1 - shl mxq, %%shift ; multiply by 32 - shl myq, %%shift ; multiply by 32 - lea r3srcq, [srcstrideq*3] - sub srcq, r3srcq - QPEL_H_LOAD %2, srcq, %1, 15 - QPEL_HV_COMPUTE %1, %2, mx, ackssdw - SWAP m8, m0 - add srcq, srcstrideq - QPEL_H_LOAD %2, srcq, %1, 15 - QPEL_HV_COMPUTE %1, %2, mx, ackssdw - SWAP m9, m0 - add srcq, srcstrideq - QPEL_H_LOAD %2, srcq, %1, 15 - QPEL_HV_COMPUTE %1, %2, mx, ackssdw - SWAP m10, m0 - add srcq, srcstrideq - QPEL_H_LOAD %2, srcq, %1, 15 - QPEL_HV_COMPUTE %1, %2, mx, ackssdw - SWAP m11, m0 - add srcq, srcstrideq - QPEL_H_LOAD %2, srcq, %1, 15 - QPEL_HV_COMPUTE %1, %2, mx, ackssdw - SWAP m12, m0 - add srcq, srcstrideq - QPEL_H_LOAD %2, srcq, %1, 15 - QPEL_HV_COMPUTE %1, %2, mx, ackssdw - SWAP m13, m0 - add srcq, srcstrideq - QPEL_H_LOAD %2, srcq, %1, 15 - QPEL_HV_COMPUTE %1, %2, mx, ackssdw - SWAP m14, m0 - add srcq, srcstrideq -.loop: - QPEL_H_LOAD %2, srcq, %1, 15 - QPEL_HV_COMPUTE %1, %2, mx, ackssdw - SWAP m15, m0 - punpcklwd m0, m8, m9 - punpcklwd m2, m10, m11 - punpcklwd m4, m12, m13 - punpcklwd m6, m14, m15 -%if %1 > 4 - punpckhwd m1, m8, m9 - punpckhwd m3, m10, m11 - punpckhwd m5, m12, m13 - punpckhwd m7, m14, m15 -%endif - QPEL_HV_COMPUTE %1, 14, my, ackssdw - PEL_10STORE%1 dstq, m0, m1 -%if %1 <= 4 - movq m8, m9 - movq m9, m10 - movq m10, m11 - movq m11, m12 - movq m12, m13 - movq m13, m14 - movq m14, m15 -%else - movdqa m8, m9 - movdqa m9, m10 - movdqa m10, m11 - movdqa m11, m12 - movdqa m12, m13 - movdqa m13, m14 - movdqa m14, m15 -%endif - LOOP_END dst, src, srcstride - RET - -cglobal hevc_put_hevc_uni_qpel_hv%1_%2, 7, 9, 16 , dst, dststride, src, srcstride, height, mx, my, r3src, rfilter -%if cpuflag(avx2) -%assign %%shift 4 -%else -%assign %%shift 3 -%endif - sub mxq, 1 - sub myq, 1 - shl mxq, %%shift ; multiply by 32 - shl myq, %%shift ; multiply by 32 - lea r3srcq, [srcstrideq*3] - sub srcq, r3srcq - QPEL_H_LOAD %2, srcq, %1, 15 - QPEL_HV_COMPUTE %1, %2, mx, ackssdw - SWAP m8, m0 - add srcq, srcstrideq - QPEL_H_LOAD %2, srcq, %1, 15 - QPEL_HV_COMPUTE %1, %2, mx, ackssdw - SWAP m9, m0 - add srcq, srcstrideq - QPEL_H_LOAD %2, srcq, %1, 15 - QPEL_HV_COMPUTE %1, %2, mx, ackssdw - SWAP m10, m0 - add srcq, srcstrideq - QPEL_H_LOAD %2, srcq, %1, 15 - QPEL_HV_COMPUTE %1, %2, mx, ackssdw - SWAP m11, m0 - add srcq, srcstrideq - QPEL_H_LOAD %2, srcq, %1, 15 - QPEL_HV_COMPUTE %1, %2, mx, ackssdw - SWAP m12, m0 - add srcq, srcstrideq - QPEL_H_LOAD %2, srcq, %1, 15 - QPEL_HV_COMPUTE %1, %2, mx, ackssdw - SWAP m13, m0 - add srcq, srcstrideq - QPEL_H_LOAD %2, srcq, %1, 15 - QPEL_HV_COMPUTE %1, %2, mx, ackssdw - SWAP m14, m0 - add srcq, srcstrideq -.loop: - QPEL_H_LOAD %2, srcq, %1, 15 - QPEL_HV_COMPUTE %1, %2, mx, ackssdw - SWAP m15, m0 - punpcklwd m0, m8, m9 - punpcklwd m2, m10, m11 - punpcklwd m4, m12, m13 - punpcklwd m6, m14, m15 -%if %1 > 4 - punpckhwd m1, m8, m9 - punpckhwd m3, m10, m11 - punpckhwd m5, m12, m13 - punpckhwd m7, m14, m15 -%endif - QPEL_HV_COMPUTE %1, 14, my, ackusdw - UNI_COMPUTE %1, %2, m0, m1, [pw_%2] - PEL_%2STORE%1 dstq, m0, m1 - -%if %1 <= 4 - movq m8, m9 - movq m9, m10 - movq m10, m11 - movq m11, m12 - movq m12, m13 - movq m13, m14 - movq m14, m15 -%else - mova m8, m9 - mova m9, m10 - mova m10, m11 - mova m11, m12 - mova m12, m13 - mova m13, m14 - mova m14, m15 -%endif - add dstq, dststrideq ; dst += dststride - add srcq, srcstrideq ; src += srcstride - dec heightd ; cmp height - jnz .loop ; height loop - RET cglobal hevc_put_hevc_bi_qpel_hv%1_%2, 8, 10, 16, dst, dststride, src, srcstride, src2, height, mx, my, r3src, rfilter %if cpuflag(avx2) @@ -1613,22 +1185,22 @@ WEIGHTING_FUNCS 4, 12 WEIGHTING_FUNCS 6, 12 WEIGHTING_FUNCS 8, 12 -HEVC_PUT_HEVC_PEL_PIXELS 2, 8 -HEVC_PUT_HEVC_PEL_PIXELS 4, 8 -HEVC_PUT_HEVC_PEL_PIXELS 6, 8 -HEVC_PUT_HEVC_PEL_PIXELS 8, 8 -HEVC_PUT_HEVC_PEL_PIXELS 12, 8 -HEVC_PUT_HEVC_PEL_PIXELS 16, 8 +HEVC_BI_PEL_PIXELS 2, 8 +HEVC_BI_PEL_PIXELS 4, 8 +HEVC_BI_PEL_PIXELS 6, 8 +HEVC_BI_PEL_PIXELS 8, 8 +HEVC_BI_PEL_PIXELS 12, 8 +HEVC_BI_PEL_PIXELS 16, 8 -HEVC_PUT_HEVC_PEL_PIXELS 2, 10 -HEVC_PUT_HEVC_PEL_PIXELS 4, 10 -HEVC_PUT_HEVC_PEL_PIXELS 6, 10 -HEVC_PUT_HEVC_PEL_PIXELS 8, 10 +HEVC_BI_PEL_PIXELS 2, 10 +HEVC_BI_PEL_PIXELS 4, 10 +HEVC_BI_PEL_PIXELS 6, 10 +HEVC_BI_PEL_PIXELS 8, 10 -HEVC_PUT_HEVC_PEL_PIXELS 2, 12 -HEVC_PUT_HEVC_PEL_PIXELS 4, 12 -HEVC_PUT_HEVC_PEL_PIXELS 6, 12 -HEVC_PUT_HEVC_PEL_PIXELS 8, 12 +HEVC_BI_PEL_PIXELS 2, 12 +HEVC_BI_PEL_PIXELS 4, 12 +HEVC_BI_PEL_PIXELS 6, 12 +HEVC_BI_PEL_PIXELS 8, 12 HEVC_PUT_HEVC_EPEL 2, 8 HEVC_PUT_HEVC_EPEL 4, 8 @@ -1693,8 +1265,8 @@ HEVC_PUT_HEVC_QPEL_HV 8, 12 %if HAVE_AVX2_EXTERNAL INIT_YMM avx2 ; adds ff_ and _avx2 to function name & enables 256b registers : m0 for 256b, xm0 for 128b. cpuflag(avx2) = 1 / notcpuflag(avx) = 0 -HEVC_PUT_HEVC_PEL_PIXELS 32, 8 -HEVC_PUT_HEVC_PEL_PIXELS 16, 10 +HEVC_BI_PEL_PIXELS 32, 8 +HEVC_BI_PEL_PIXELS 16, 10 HEVC_PUT_HEVC_EPEL 32, 8 HEVC_PUT_HEVC_EPEL 16, 10 diff --git a/libavcodec/x86/hevcdsp_init.c b/libavcodec/x86/hevcdsp_init.c index 6f45e5e0db..5c19330e19 100644 --- a/libavcodec/x86/hevcdsp_init.c +++ b/libavcodec/x86/hevcdsp_init.c @@ -1,6 +1,7 @@ /* * Copyright (c) 2013 Seppo Tomperi - * Copyright (c) 2013 - 2014 Pierre-Edouard Lepere + * Copyright (c) 2013-2014 Pierre-Edouard Lepere + * Copyright (c) 2023-2024 Wu Jianhua * * This file is part of FFmpeg. * @@ -27,6 +28,7 @@ #include "libavutil/x86/cpu.h" #include "libavcodec/hevcdsp.h" #include "libavcodec/x86/hevcdsp.h" +#include "libavcodec/x86/h26x/h2656dsp.h" #define LFC_FUNC(DIR, DEPTH, OPT) \ void ff_hevc_ ## DIR ## _loop_filter_chroma_ ## DEPTH ## _ ## OPT(uint8_t *pix, ptrdiff_t stride, const int *tc, const uint8_t *no_p, const uint8_t *no_q); @@ -83,6 +85,110 @@ void ff_hevc_idct_32x32_10_ ## opt(int16_t *coeffs, int col_limit); IDCT_FUNCS(sse2) IDCT_FUNCS(avx) + +#define ff_hevc_pel_filters ff_hevc_qpel_filters +#define DECL_HV_FILTER(f) \ + const uint8_t *hf = ff_hevc_ ## f ## _filters[mx - 1]; \ + const uint8_t *vf = ff_hevc_ ## f ## _filters[my - 1]; + +#define FW_PUT(p, a, b, depth, opt) \ +void ff_hevc_put_hevc_ ## a ## _ ## depth ## _##opt(int16_t *dst, const uint8_t *src, ptrdiff_t srcstride, \ + int height, intptr_t mx, intptr_t my,int width) \ +{ \ + DECL_HV_FILTER(p) \ + ff_h2656_put_ ## b ## _ ## depth ## _##opt(dst, src, srcstride, height, hf, vf, width); \ +} + +#define FW_PUT_UNI(p, a, b, depth, opt) \ +void ff_hevc_put_hevc_uni_ ## a ## _ ## depth ## _##opt(uint8_t *dst, ptrdiff_t dststride, \ + const uint8_t *src, ptrdiff_t srcstride, \ + int height, intptr_t mx, intptr_t my, int width) \ +{ \ + DECL_HV_FILTER(p) \ + ff_h2656_put_uni_ ## b ## _ ## depth ## _##opt(dst, dststride, src, srcstride, height, hf, vf, width); \ +} + +#if ARCH_X86_64 && HAVE_SSE4_EXTERNAL + +#define FW_PUT_FUNCS(p, a, b, depth, opt) \ + FW_PUT(p, a, b, depth, opt) \ + FW_PUT_UNI(p, a, b, depth, opt) + +#define FW_PEL(w, depth, opt) FW_PUT_FUNCS(pel, pel_pixels##w, pixels##w, depth, opt) + +#define FW_DIR(npel, n, w, depth, opt) \ + FW_PUT_FUNCS(npel, npel ## _h##w, n ## tap_h##w, depth, opt) \ + FW_PUT_FUNCS(npel, npel ## _v##w, n ## tap_v##w, depth, opt) + +#define FW_DIR_HV(npel, n, w, depth, opt) \ + FW_PUT_FUNCS(npel, npel ## _hv##w, n ## tap_hv##w, depth, opt) + +FW_PEL(4, 8, sse4); +FW_PEL(6, 8, sse4); +FW_PEL(8, 8, sse4); +FW_PEL(12, 8, sse4); +FW_PEL(16, 8, sse4); +FW_PEL(4, 10, sse4); +FW_PEL(6, 10, sse4); +FW_PEL(8, 10, sse4); +FW_PEL(4, 12, sse4); +FW_PEL(6, 12, sse4); +FW_PEL(8, 12, sse4); + +#define FW_EPEL(w, depth, opt) FW_DIR(epel, 4, w, depth, opt) +#define FW_EPEL_HV(w, depth, opt) FW_DIR_HV(epel, 4, w, depth, opt) +#define FW_EPEL_FUNCS(w, depth, opt) \ + FW_EPEL(w, depth, opt) \ + FW_EPEL_HV(w, depth, opt) + +FW_EPEL(12, 8, sse4); + +FW_EPEL_FUNCS(4, 8, sse4); +FW_EPEL_FUNCS(6, 8, sse4); +FW_EPEL_FUNCS(8, 8, sse4); +FW_EPEL_FUNCS(16, 8, sse4); +FW_EPEL_FUNCS(4, 10, sse4); +FW_EPEL_FUNCS(6, 10, sse4); +FW_EPEL_FUNCS(8, 10, sse4); +FW_EPEL_FUNCS(4, 12, sse4); +FW_EPEL_FUNCS(6, 12, sse4); +FW_EPEL_FUNCS(8, 12, sse4); + +#define FW_QPEL(w, depth, opt) FW_DIR(qpel, 8, w, depth, opt) +#define FW_QPEL_HV(w, depth, opt) FW_DIR_HV(qpel, 8, w, depth, opt) +#define FW_QPEL_FUNCS(w, depth, opt) \ + FW_QPEL(w, depth, opt) \ + FW_QPEL_HV(w, depth, opt) + +FW_QPEL(12, 8, sse4); +FW_QPEL(16, 8, sse4); + +FW_QPEL_FUNCS(4, 8, sse4); +FW_QPEL_FUNCS(8, 8, sse4); +FW_QPEL_FUNCS(4, 10, sse4); +FW_QPEL_FUNCS(8, 10, sse4); +FW_QPEL_FUNCS(4, 12, sse4); +FW_QPEL_FUNCS(8, 12, sse4); + +#ifdef HAVE_AVX2_EXTERNAL + +FW_PEL(32, 8, avx2); +FW_PUT(pel, pel_pixels16, pixels16, 10, avx2); + +FW_EPEL(32, 8, avx2); +FW_EPEL(16, 10, avx2); + +FW_EPEL_HV(32, 8, avx2); +FW_EPEL_HV(16, 10, avx2); + +FW_QPEL(32, 8, avx2); +FW_QPEL(16, 10, avx2); + +FW_QPEL_HV(16, 10, avx2); + +#endif +#endif + #define mc_rep_func(name, bitd, step, W, opt) \ void ff_hevc_put_hevc_##name##W##_##bitd##_##opt(int16_t *_dst, \ const uint8_t *_src, ptrdiff_t _srcstride, int height, \