From patchwork Thu Jul 25 13:35:45 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nuo Mi X-Patchwork-Id: 50727 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a59:cc0a:0:b0:482:c625:d099 with SMTP id h10csp565422vqv; Thu, 25 Jul 2024 07:01:55 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCUwMyZnQU7kul6MpPC3gCBdw3jt2ytl6B3wUFc/VE1XNF/5jL6CxaK/aHmHtoz4rFQnlhm7XxXj1GOBHa6HsZ1ffm790tmc5/DH4Q== X-Google-Smtp-Source: AGHT+IHZy4pqmztNnSH8gWM40P7olaZQ1ZjabojYKRAJxow5jlBjhPB25eS6YvSqzPHSmiIhox3u X-Received: by 2002:a05:6512:3d0b:b0:52c:f55d:44a3 with SMTP id 2adb3069b0e04-52fd602c405mr1919838e87.19.1721916113629; Thu, 25 Jul 2024 07:01:53 -0700 (PDT) Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id 2adb3069b0e04-52fd5bee795si480537e87.258.2024.07.25.07.01.52; Thu, 25 Jul 2024 07:01:53 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@outlook.com header.s=selector1 header.b=VkR9ahr6; arc=fail (body hash mismatch); spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 79B5268D69B; Thu, 25 Jul 2024 16:52:03 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from APC01-SG2-obe.outbound.protection.outlook.com (mail-sgaapc01olkn2012.outbound.protection.outlook.com [40.92.53.12]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id AC42568D63B for ; Thu, 25 Jul 2024 16:51:55 +0300 (EEST) ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=nGRbo/KdrbVEg7wbD8jiq3vTN4C0n8LLaizMMl044/iKGrmTQujG0iMDxlUgAoeoiAwiD9cNRRdREpFDgh00HRhonttowfaroHMDoLCir4XQcjChXF8bQcaiO2wppRyBlsVt5Loo28xCKXv21Zn1vxZOlm28ZLENaV5CcKGEMjr+Nl5Bn7o+Y4ZUUx9/NQ73hjJ3Xf0UI6RCf1z2BKHLoJ3yJnSeo6yJh/cBckaXkE+ds4G7CoPiPKqVEIKug6JGyYHJeekGghZ5/bpshsDgA49fdvcY9T0fkXCZkcVz6MH3MxFozTzzJH22cxIg5kSSdLmkR8YBt5OpgzBcivs+eg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=H+4lbPgOwlm7oxy3WbXu36DsfbfTM/wZu4fWhIY9TIA=; b=DHTihmI8/lWP1VH2L/ZrILwUOGoEskdZXxrYaD3Ful6sU4+7KO1fYRq2jir0biWGydyPSlIG1hg5hWPTeMHoVkliQZpvSw3ORbZzLPzt7QRa48GQlQuvqO9VDIs6iowG/iZSZDqyvXQcoMv3voQjwRPf33FhLi6MYpFHukaHPu1EPLdZk7VHA8rIh8e5x7ogCzbzpnTONJt06BulfVV3iGfd2+HbeHibTwf+ylF0heeYHWkxyWn7eTPFVTSNfMbfDNixnFBIrndc7VcQFcLIW2Xve/VrBxplEpoO7sWrNoDib5+tEBm096lsJDcIRNsNF9TUtUS+W7zkXhw+8zJ7Dw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=H+4lbPgOwlm7oxy3WbXu36DsfbfTM/wZu4fWhIY9TIA=; b=VkR9ahr6GTQXNMM/QpdbxlxtYWT/P+Llkfglzafc1rUS8D8i/6TkLpKsHdAkoM97Mgpbp6yu3bjYzy/S5U6HJYRNH/dUrZ8Z2tsN+SdK3ByV0v9MgiSQYOvnrAZJabkwmJEm0S4gLGTRwOp3k40LhY6apMonCGHdvZUxkpdAPAkauajlkcg87MOm9RdKl2qP3hiz0SqwvkWNK53xFOSzvBECMp1XqdcrGASbi4Hq9SdfUdTfZv3/7nPVwSSLPQ1hWvjqTTOdGAAKx2PvjpjSUSbKNUIvwuvInwgqi9XX3ffKWn8jihAbsMRgxON4BVcvQXkUoLdatMFamB6M3Ujrog== Received: from TYSPR06MB6433.apcprd06.prod.outlook.com (2603:1096:400:47a::6) by TYZPR06MB5027.apcprd06.prod.outlook.com (2603:1096:400:1c9::9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7784.20; Thu, 25 Jul 2024 13:36:07 +0000 Received: from TYSPR06MB6433.apcprd06.prod.outlook.com ([fe80::81f7:9125:583a:1cca]) by TYSPR06MB6433.apcprd06.prod.outlook.com ([fe80::81f7:9125:583a:1cca%3]) with mapi id 15.20.7784.020; Thu, 25 Jul 2024 13:36:07 +0000 From: Nuo Mi To: ffmpeg-devel@ffmpeg.org Date: Thu, 25 Jul 2024 21:35:45 +0800 Message-ID: X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240725133546.19125-1-nuomi2021@gmail.com> References: <20240725133546.19125-1-nuomi2021@gmail.com> X-TMN: [cZf8uw+asMQhvfFoZJN+erFvDHcK454x] X-ClientProxiedBy: TYCP286CA0224.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:3c5::8) To TYSPR06MB6433.apcprd06.prod.outlook.com (2603:1096:400:47a::6) X-Microsoft-Original-Message-ID: <20240725133546.19125-2-nuomi2021@gmail.com> MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 2 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: TYSPR06MB6433:EE_|TYZPR06MB5027:EE_ X-MS-Office365-Filtering-Correlation-Id: 610cf185-91b2-4073-e835-08dcacaebcc3 X-Microsoft-Antispam: BCL:0; ARA:14566002|5072599006|19110799003|8060799006|461199028|440099028|3412199025|1710799026; X-Microsoft-Antispam-Message-Info: Q8juV/NZwR8TVQw8rI6T3/62esM6GZm0RKmvMh8OIF9dbk2nRmwDs8kyzQ2ZJ5y9C4cWODyGQTYdZhw+w/Ap9UX8fl/xO5aJuEiTZHp0i87eUVKGkFsDsD/39GLr2Ld9kK/B+I05J8jbOfoBPR0gXuVU2enAOekbUZCMS5gOOtCP7AguOSOXcKu0ku0cukX9PUsysI+kWW/C57IL3PYE6SDxv/m2QpM8mZIqXSp74n+zaQ+jI/TIev9CA6dxDZBA/NgWfLqjiy5+c+eP2EyQ8qtvEuwAO9z4Vvd0Rm3CXQlf3Gf98fK5o1DowEVBuhuj+T2WheNfP0oAggQ8MNZj1PaWVsqKBRdKHrArwp7klGcZaMZZkMxASFR837oRaHwgurx+aLp/yAuK0DIJAQ87oy/aiO5iIGwWJwwRezNIKqUcEeIXpOkFDYyjrAwrq4K60wUvGDP5G/lx41l8hfngVvTt70mRb21eGp7tFw0/aYy7Hp/1WcGxuLMa9Kps6o2Uxb/7KZnKZYfbO1NZnJKMUh4t9l2/s284SWZSkwyVzPh6b7vu0wrAEOPnffLvfWBw93bjJFwFg2IbEi2N6k+bXx6DmAr/4aMSGMkw7tpIA9OrreBXA9PkP+ebQTJG63n2v+Oe91q1dvnBu1vIy3BJipGK8Cpz5Y+is1RgfPYA3AK0uJ1TWBftOwulSUfb8l0LTk8YSahnF+JsCZKp+Nkohg== X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: 6Yqlf2RJqv3lmaD7RkxSKjXeGIspgJZYxscljKaflurbCv17YvAhVEFcED8vmArTDdUjXA9mgRrTl8sp1flWF5th/tE99bNpoRBetzFZyvq1tVfY71A5kU+Ll6GgecKeMNHtGEFibWVCPrTxevgrWqSNoIe5wRV8OnI+d/DTmXlA7GkEEEKuogmtcOPZ522ozgnDMvrs1m9IbtcaboSMS4arVXEn7gKaEYrHtIjnwyVz9GPoLTfY+OjMLauTGLwJMPDADn+WRZhZaUS88+DoE4ll6z2BnMI/LtOHXRW644hWQk0r8vCTRNGtuZJbDZST80GArvleo2qBv8LhFwvupaalbaQFkgZpkFr8p0hnOigciS5Hp/my6wW2bCRszRGAF3SIgZNcPXveRHnyeSKtKn9+z2xXBqLkwdNt2sMGkdbodXW064/VktQ/GxSaAgUMBu45S4okPBExZnmmjEKG1eMDG1rKo6dD2+SqrtCNvsrUewcReNv+ZnU5B5JB+LHT6fHljX4IrHxG6rv6Mr3imSkwgN0oilW14nKcwjPASAUAsEmxj1tVUJ6FeIlX+R1qFObtcx2aKNOyWvQ12aFrMvsT15n6j/toWR08hMLgobk/AdipgPxqZtxhIj5zWcZjVeywrqoZCkNx8hlOBaWi4nMSim086SLKKpkmkEqubKm7GKarvM7KyOcErer5/GJjRBe61KCXYO+n5SrryMLej2bwwXGhfF3RDwMfRyJ+G9wH8pLAiRloupe5+CgqAvIlKTo2E7C//y7Fj0488Oaa1miZ9aJw/45EYjEzWIHxEblsOwzJkCFl1vxb/Qxx0NEHdLq/QZb4ex4ZEKXM2Cli87IIPGQSGEsnJkKjxfPZvxmHtpwZ8A2vLRZoboDjBKuBjZ9fde8N+trBlG+BPV+X/lZSXgqVBGNcyXFJTAq0/9o8IH+pehBNkmj9Eg55hAA1Q0+8Q40MP0UfdioEa+rBKvWhk0B97y9lBIeNkggk282+ZhBO5ew8MHvRwMJhVDv1j5P0oAXQmZECteyLoNSfbKEASrQKovmWeO1WuukMi52dqB5whee2ZCfpf5QRr+hrmBaa29ONLGdRNwUej9GxzQuOLq2GlbKoRc+MP6vk5SQtWLvHxvuYN82CykimvMGGakZKKmB1ebXmlBrvlYmpfS9ueIb7z463auFhx1Tq4UFIxLzhtKIh787Sbss5F3cI6NC+QwiCeR1pLnZIuQfImJNQujy3pw40pqi3JouxIOXQBKP8XkqeSlLp+Tr8ldABrTBuhdnLxKyyjNU/jIrLCQ== X-OriginatorOrg: outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 610cf185-91b2-4073-e835-08dcacaebcc3 X-MS-Exchange-CrossTenant-AuthSource: TYSPR06MB6433.apcprd06.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 25 Jul 2024 13:36:07.6896 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000 X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYZPR06MB5027 Subject: [FFmpeg-devel] [PATCH 2/3] x86/vvcdec: add dmvr avx2 code X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Nuo Mi Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: gvMlO8/xYFDw Decoder-Side Motion Vector Refinement is about 4~8% CPU usage for some clips here is the test result for one time clips | before| after | delta ------------------------------------------|-------|-------|------ RitualDance_1920x1080_60_10_420_37_RA.266 | 338.7 | 354.3 |4.61% NovosobornayaSquare_1920x1080.bin | 320.3 | 329.3 |2.81% Tango2_3840x2160_60_10_420_27_LD.266 | 83.3 | 83.7 |0.48% RitualDance_1920x1080_60_10_420_32_LD.266 | 320.7 | 327.3 |2.06% Chimera_8bit_1080P_1000_frames.vvc | 360.7 | 381.0 |5.63% BQTerrace_1920x1080_60_10_420_22_RA.vvc | 161.7 | 163.0 |0.80% --- libavcodec/x86/vvc/Makefile | 1 + libavcodec/x86/vvc/vvc_dmvr.asm | 373 +++++++++++++++++++++++++++++++ libavcodec/x86/vvc/vvcdsp_init.c | 25 +++ 3 files changed, 399 insertions(+) create mode 100644 libavcodec/x86/vvc/vvc_dmvr.asm diff --git a/libavcodec/x86/vvc/Makefile b/libavcodec/x86/vvc/Makefile index 832d802daf..04f16bc10c 100644 --- a/libavcodec/x86/vvc/Makefile +++ b/libavcodec/x86/vvc/Makefile @@ -4,6 +4,7 @@ clean:: OBJS-$(CONFIG_VVC_DECODER) += x86/vvc/vvcdsp_init.o \ x86/h26x/h2656dsp.o X86ASM-OBJS-$(CONFIG_VVC_DECODER) += x86/vvc/vvc_alf.o \ + x86/vvc/vvc_dmvr.o \ x86/vvc/vvc_mc.o \ x86/vvc/vvc_sad.o \ x86/h26x/h2656_inter.o diff --git a/libavcodec/x86/vvc/vvc_dmvr.asm b/libavcodec/x86/vvc/vvc_dmvr.asm new file mode 100644 index 0000000000..4c971f970b --- /dev/null +++ b/libavcodec/x86/vvc/vvc_dmvr.asm @@ -0,0 +1,373 @@ +; /* +; * Provide AVX2 luma dmvr functions for VVC decoding +; * Copyright (c) 2024 Nuo Mi +; * +; * This file is part of FFmpeg. +; * +; * FFmpeg is free software; you can redistribute it and/or +; * modify it under the terms of the GNU Lesser General Public +; * License as published by the Free Software Foundation; either +; * version 2.1 of the License, or (at your option) any later version. +; * +; * FFmpeg is distributed in the hope that it will be useful, +; * but WITHOUT ANY WARRANTY; without even the implied warranty of +; * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +; * Lesser General Public License for more details. +; * +; * You should have received a copy of the GNU Lesser General Public +; * License along with FFmpeg; if not, write to the Free Software +; * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA +; */ +%include "libavutil/x86/x86util.asm" + +%define MAX_PB_SIZE 128 + +SECTION_RODATA 32 + +shift_12 times 2 dw 1 << (15 - (12 - 10)) +shift3_8 times 2 dw 1 << (15 - (8 - 6)) +shift3_10 times 2 dw 1 << (15 - (10 - 6)) +shift3_12 times 2 dw 1 << (15 - (12 - 6)) +pw_16 times 2 dw 16 + +%if ARCH_X86_64 + +%if HAVE_AVX2_EXTERNAL + +SECTION .text + +%define pstride (bd / 10 + 1) + +; LOAD(dst, src) +%macro LOAD_W16 2 +%if bd == 8 + pmovzxbw %1, %2 +%else + movu %1, %2 +%endif +%endmacro + +%macro SHIFT_W16 2 +%if bd == 8 + psllw %1, (10 - bd) +%elif bd == 10 + ; nothing +%else + pmulhrsw %1, %2 +%endif +%endmacro + +%macro SAVE_W16 2 + movu %1, %2 +%endmacro + +; NEXT_4_LINES(is_h) +%macro NEXT_4_LINES 1 + lea dstq, [dstq + dsq*4] + lea srcq, [srcq + ssq*4] +%if %1 + lea src1q, [srcq + pstride] +%endif +%endmacro + + +; DMVR_4xW16(dst, dst_stride, dst_stride3, src, src_stride, src_stride3) +%macro DMVR_4xW16 6 + LOAD_W16 m0, [%4] + LOAD_W16 m1, [%4 + %5] + LOAD_W16 m2, [%4 + 2 * %5] + LOAD_W16 m3, [%4 + %6] + + SHIFT_W16 m0, m4 + SHIFT_W16 m1, m4 + SHIFT_W16 m2, m4 + SHIFT_W16 m3, m4 + + SAVE_W16 [%1] , m0 + SAVE_W16 [%1 + %2] , m1 + SAVE_W16 [%1 + 2 * %2], m2 + SAVE_W16 [%1 + %3] , m3 +%endmacro + +; buf += -stride * h + off +; OFFSET_TO_W4(buf, stride, off) +%macro OFFSET_TO_W4 3 + mov id, hd + imul iq, %2 + sub %1, iq + lea %1, [%1 + %3] +%endmacro + +%macro OFFSET_TO_W4 0 + OFFSET_TO_W4 srcq, ssq, 16 * (bd / 10 + 1) + OFFSET_TO_W4 dstq, dsq, 16 * 2 +%endmacro + +; void ff_vvc_dmvr_%1_avx2(int16_t *dst, const uint8_t *src, ptrdiff_t src_stride, +; int height, intptr_t mx, intptr_t my, int width); +%macro DMVR_AVX2 1 +cglobal vvc_dmvr_%1, 4, 9, 5, dst, src, ss, h, ds, ds3, w, ss3, i +%define bd %1 + + LOAD_STRIDES + +%if %1 > 10 + vpbroadcastd m4, [shift_%1] +%endif + + mov wd, wm + mov id, hd +.w16: + sub id, 4 + jl .w16_end + DMVR_4xW16 dstq, dsq, ds3q, srcq, ssq, ss3q + NEXT_4_LINES 0 + jmp .w16 +.w16_end: + + sub wd, 16 + jl .w4_end + + OFFSET_TO_W4 +.w4: + sub hd, 4 + jl .w4_end + DMVR_4xW16 dstq, dsq, ds3q, srcq, ssq, ss3q + NEXT_4_LINES 0 + jmp .w4 +.w4_end: + + RET +%endmacro + +; LOAD_COEFFS(coeffs0, coeffs1, src) +%macro LOAD_COEFFS 3 + movd xm%2, %3 + vpbroadcastw m%2, xm%2 + vpbroadcastd m%1, [pw_16] + psubw m%1, m%2 +%endmacro + +; LOAD_SHIFT(shift, src) +%macro LOAD_SHIFT 2 + vpbroadcastd %1, [%2] +%if bd == 12 + psllw %1, 1 ; avoid signed mul for pmulhrsw +%endif +%endmacro + +; LOAD_STRIDES(shift, src) +%macro LOAD_STRIDES 0 + mov dsq, MAX_PB_SIZE * 2 + lea ss3q, [ssq*3] + lea ds3q, [dsq*3] +%endmacro + +; BILINEAR(dst/src0, src1, coeff0, coeff1, round, tmp) +%macro BILINEAR 6 + pmullw %1, %3 + pmullw %6, %2, %4 + paddw %1, %6 +%if bd == 12 + psrlw %1, 1 ; avoid signed mul for pmulhrsw +%endif + pmulhrsw %1, %5 +%endmacro + +; DMVR_H_1xW16(dst, src0, src1, offset, tmp) +%macro DMVR_H_1xW16 5 + LOAD_W16 %1, [%2 + %4] + LOAD_W16 %5, [%3 + %4] + BILINEAR %1, %5, m10, m11, m12, %5 +%endmacro + +; DMVR_H_4xW16(dst, dst_stride, dst_stride3, src, src_stride, src_stride3, src1) +%macro DMVR_H_4xW16 7 + DMVR_H_1xW16 m0, %4, %7, 0, m4 + DMVR_H_1xW16 m1, %4, %7, %5, m5 + DMVR_H_1xW16 m2, %4, %7, 2 * %5, m6 + DMVR_H_1xW16 m3, %4, %7, %6, m7 + + SAVE_W16 [%1] , m0 + SAVE_W16 [%1 + %2] , m1 + SAVE_W16 [%1 + 2 * %2], m2 + SAVE_W16 [%1 + %3] , m3 +%endmacro + +; void ff_vvc_dmvr_h_%1_avx2(int16_t *dst, const uint8_t *src, ptrdiff_t src_stride, +; int height, intptr_t mx, intptr_t my, int width); +%macro DMVR_H_AVX2 1 +cglobal vvc_dmvr_h_%1, 4, 10, 13, dst, src, ss, h, ds, ds3, w, ss3, src1, i +%define bd %1 + + LOAD_COEFFS 10, 11, dsm + LOAD_SHIFT m12, shift3_%1 + + LOAD_STRIDES + lea src1q, [srcq + pstride] + + mov wd, wm + mov id, hd +.w16: + sub id, 4 + jl .w16_end + DMVR_H_4xW16 dstq, dsq, ds3q, srcq, ssq, ss3q, src1q + NEXT_4_LINES 1 + jmp .w16 +.w16_end: + + sub wd, 16 + jl .w4_end + + OFFSET_TO_W4 + lea src1q, [srcq + pstride] +.w4: + sub hd, 4 + jl .w4_end + DMVR_H_4xW16 dstq, dsq, ds3q, srcq, ssq, ss3q, src1q + NEXT_4_LINES 1 + jmp .w4 +.w4_end: + + RET +%endmacro + +; DMVR_V_4xW16(dst, dst_stride, dst_stride3, src, src_stride, src_stride3) +%macro DMVR_V_4xW16 6 + LOAD_W16 m1, [%4 + %5] + LOAD_W16 m2, [%4 + 2 * %5] + LOAD_W16 m3, [%4 + %6] + LOAD_W16 m4, [%4 + 4 * %5] + + BILINEAR m0, m1, m8, m9, m10, m11 + BILINEAR m1, m2, m8, m9, m10, m12 + BILINEAR m2, m3, m8, m9, m10, m13 + BILINEAR m3, m4, m8, m9, m10, m14 + + SAVE_W16 [%1] , m0 + SAVE_W16 [%1 + %2] , m1 + SAVE_W16 [%1 + 2 * %2], m2 + SAVE_W16 [%1 + %3] , m3 + + ; why can't we use SWAP m0, m4 here? + movaps m0, m4 +%endmacro + +; void ff_vvc_dmvr_v_%1_avx2(int16_t *dst, const uint8_t *src, ptrdiff_t src_stride, +; int height, intptr_t mx, intptr_t my, int width); +%macro DMVR_V_AVX2 1 +cglobal vvc_dmvr_v_%1, 4, 9, 15, dst, src, ss, h, ds, ds3, w, ss3, i +%define bd %1 + + LOAD_COEFFS 8, 9, ds3m + LOAD_SHIFT m10, shift3_%1 + + LOAD_STRIDES + + mov wd, wm + mov id, hd + LOAD_W16 m0, [srcq] +.w16: + sub id, 4 + jl .w16_end + DMVR_V_4xW16 dstq, dsq, ds3q, srcq, ssq, ss3q + NEXT_4_LINES 0 + jmp .w16 +.w16_end: + + sub wd, 16 + jl .w4_end + + OFFSET_TO_W4 + LOAD_W16 m0, [srcq] +.w4: + sub hd, 4 + jl .w4_end + DMVR_V_4xW16 dstq, dsq, ds3q, srcq, ssq, ss3q + NEXT_4_LINES 0 + jmp .w4 +.w4_end: + + RET +%endmacro + +; DMVR_HV_4xW16(dst, dst_stride, dst_stride3, src, src_stride, src_stride3, src1) +%macro DMVR_HV_4xW16 7 + DMVR_H_1xW16 m1, %4, %7, %5, m6 + DMVR_H_1xW16 m2, %4, %7, 2 * %5, m7 + DMVR_H_1xW16 m3, %4, %7, %6, m8 + DMVR_H_1xW16 m4, %4, %7, 4 * %5, m9 + + BILINEAR m0, m1, m13, m14, m15, m6 + BILINEAR m1, m2, m13, m14, m15, m7 + BILINEAR m2, m3, m13, m14, m15, m8 + BILINEAR m3, m4, m13, m14, m15, m9 + + SAVE_W16 [%1] , m0 + SAVE_W16 [%1 + %2] , m1 + SAVE_W16 [%1 + 2 * %2], m2 + SAVE_W16 [%1 + %3] , m3 + + ; why can't we use SWAP m0, m4 here? + movaps m0, m4 +%endmacro + +; void ff_vvc_dmvr_hv_%1_avx2(int16_t *dst, const uint8_t *src, ptrdiff_t src_stride, +; int height, intptr_t mx, intptr_t my, int width); +%macro DMVR_HV_AVX2 1 +cglobal vvc_dmvr_hv_%1, 7, 10, 16, dst, src, ss, h, ds, ds3, w, ss3, src1, i +%define bd %1 + + LOAD_COEFFS 10, 11, dsm + LOAD_SHIFT m12, shift3_%1 + + LOAD_COEFFS 13, 14, ds3m + LOAD_SHIFT m15, shift3_10 + + LOAD_STRIDES + lea src1q, [srcq + pstride] + + mov id, hd + DMVR_H_1xW16 m0, srcq, src1q, 0, m5 +.w16: + sub id, 4 + jl .w16_end + DMVR_HV_4xW16 dstq, dsq, ds3q, srcq, ssq, ss3q, src1q + NEXT_4_LINES 1 + jmp .w16 +.w16_end: + + sub wd, 16 + jl .w4_end + + OFFSET_TO_W4 + lea src1q, [srcq + pstride] + + DMVR_H_1xW16 m0, srcq, src1q, 0, m5 +.w4: + sub hd, 4 + jl .w4_end + DMVR_HV_4xW16 dstq, dsq, ds3q, srcq, ssq, ss3q, src1q + NEXT_4_LINES 1 + jmp .w4 +.w4_end: + + RET +%endmacro + +%macro VVC_DMVR_AVX2 1 + DMVR_AVX2 %1 + DMVR_H_AVX2 %1 + DMVR_V_AVX2 %1 + DMVR_HV_AVX2 %1 +%endmacro + +INIT_YMM avx2 + +VVC_DMVR_AVX2 8 +VVC_DMVR_AVX2 10 +VVC_DMVR_AVX2 12 + +%endif ; HAVE_AVX2_EXTERNAL + +%endif ; ARCH_X86_64 diff --git a/libavcodec/x86/vvc/vvcdsp_init.c b/libavcodec/x86/vvc/vvcdsp_init.c index 4b4a2aa937..d5b4f4f8a5 100644 --- a/libavcodec/x86/vvc/vvcdsp_init.c +++ b/libavcodec/x86/vvc/vvcdsp_init.c @@ -87,6 +87,21 @@ AVG_PROTOTYPES( 8, avx2) AVG_PROTOTYPES(10, avx2) AVG_PROTOTYPES(12, avx2) + +#define DMVR_PROTOTYPES(bd, opt) \ +void ff_vvc_dmvr_##bd##_##opt(int16_t *dst, const uint8_t *src, ptrdiff_t src_stride, \ + int height, intptr_t mx, intptr_t my, int width); \ +void ff_vvc_dmvr_h_##bd##_##opt(int16_t *dst, const uint8_t *src, ptrdiff_t src_stride, \ + int height, intptr_t mx, intptr_t my, int width); \ +void ff_vvc_dmvr_v_##bd##_##opt(int16_t *dst, const uint8_t *src, ptrdiff_t src_stride, \ + int height, intptr_t mx, intptr_t my, int width); \ +void ff_vvc_dmvr_hv_##bd##_##opt(int16_t *dst, const uint8_t *src, ptrdiff_t src_stride, \ + int height, intptr_t mx, intptr_t my, int width); \ + +DMVR_PROTOTYPES( 8, avx2) +DMVR_PROTOTYPES(10, avx2) +DMVR_PROTOTYPES(12, avx2) + #define ALF_BPC_PROTOTYPES(bpc, opt) \ void BF(ff_vvc_alf_filter_luma, bpc, opt)(uint8_t *dst, ptrdiff_t dst_stride, \ const uint8_t *src, ptrdiff_t src_stride, ptrdiff_t width, ptrdiff_t height, \ @@ -306,6 +321,13 @@ ALF_FUNCS(16, 12, avx2) c->inter.w_avg = bf(ff_vvc_w_avg, bd, opt); \ } while (0) +#define DMVR_INIT(bd) do { \ + c->inter.dmvr[0][0] = ff_vvc_dmvr_##bd##_avx2; \ + c->inter.dmvr[0][1] = ff_vvc_dmvr_h_##bd##_avx2; \ + c->inter.dmvr[1][0] = ff_vvc_dmvr_v_##bd##_avx2; \ + c->inter.dmvr[1][1] = ff_vvc_dmvr_hv_##bd##_avx2; \ +} while (0) + #define ALF_INIT(bd) do { \ c->alf.filter[LUMA] = ff_vvc_alf_filter_luma_##bd##_avx2; \ c->alf.filter[CHROMA] = ff_vvc_alf_filter_chroma_##bd##_avx2; \ @@ -330,6 +352,7 @@ void ff_vvc_dsp_init_x86(VVCDSPContext *const c, const int bd) ALF_INIT(8); AVG_INIT(8, avx2); MC_LINKS_AVX2(8); + DMVR_INIT(8); SAD_INIT(); } break; @@ -342,6 +365,7 @@ void ff_vvc_dsp_init_x86(VVCDSPContext *const c, const int bd) AVG_INIT(10, avx2); MC_LINKS_AVX2(10); MC_LINKS_16BPC_AVX2(10); + DMVR_INIT(10); SAD_INIT(); } break; @@ -354,6 +378,7 @@ void ff_vvc_dsp_init_x86(VVCDSPContext *const c, const int bd) AVG_INIT(12, avx2); MC_LINKS_AVX2(12); MC_LINKS_16BPC_AVX2(12); + DMVR_INIT(12); SAD_INIT(); } break;