From patchwork Sat Oct 7 15:03:16 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lynne X-Patchwork-Id: 44197 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:4e24:b0:15d:8365:d4b8 with SMTP id gk36csp472961pzb; Sat, 7 Oct 2023 08:03:34 -0700 (PDT) X-Google-Smtp-Source: AGHT+IG3S9dLQAHGXZ6n20MCXICU45EZqLcDRqL4skTgylKTLBthIZ+QhC0tDLyVh88jkzq4WMxt X-Received: by 2002:a17:907:d02:b0:9ae:5a56:be32 with SMTP id gn2-20020a1709070d0200b009ae5a56be32mr6715410ejc.38.1696691013854; Sat, 07 Oct 2023 08:03:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696691013; cv=none; d=google.com; s=arc-20160816; b=EMRl9178PtrBahbXAvu9+o+NOnc1q9rNKWAOFOSUXr8Ak+vYuQ2BEyjjWX5GTZjw1g bn093JoP4dgdQcV6Z7P3SCWdzjOZcfiNGrmyDo0C+X2+zdhkM6VIXGqm5FhB2uQbNcGe QigPhmg5GATTequWKVxExbu5BUgqwM1Tz0CHnABQTlVvKem+Z9XeJQKfFFcU6ooXcL/a mjAASG6mOeynqhfgwIigMHvMYmTvQw/tDX5g5cDIt7z4iYVI39PBF5/P/pvERvG66JQt hdXLKB6+0vvYCTpK0tCuk6JRASyBP6GkL0AQXGQzreViBSoYz/3TarmpBM3jw2QMqp3U ebrA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:reply-to:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence:subject :mime-version:message-id:to:from:date:dkim-signature:delivered-to; bh=538w5jhx5zBODKzuH7O5aB3OqNBtS+5wM5QZZDxwxls=; fh=Q46kXK7oI5D1Jhi90JBr53c7NIaTxGaU4KPeRZyM/hI=; b=jx1BpyuhKEkVNkAY0GYNwhqnGc1ej/ZLailO5foDHZN0G9M01Uwiwr7nvtAMZjOO5i +uOluBEcjHrtgDWrmd7yh99mVSo16qwjDO8wrwkPPrRtmHR8TsUVsszA19iYjf6nOhXB 29y32018ME6sHPy9y76EKvSqFvAWXuKUubvXRflcSacPyGtoGTB8CoMqBKc9HxIPMKTi RB/v4v11+PoSsTf675oLX8z1kWptSbds9qmlY344uyL4QwGkhiQaXchkdc1ccvDBPjU6 oeft88l/loKbJS/76eUm9L3L+AaBNeDMhMg+X+emRrXVTM7vDjavG0zhtyovIPZ2DEFd /jeA== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1 header.b=fmt3t+QS; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id i3-20020a1709061cc300b0099b7f483374si2566147ejh.183.2023.10.07.08.03.28; Sat, 07 Oct 2023 08:03:33 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1 header.b=fmt3t+QS; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 6A1E268CA3B; Sat, 7 Oct 2023 18:03:24 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 0CE4768CA09 for ; Sat, 7 Oct 2023 18:03:17 +0300 (EEST) Received: from tutadb.w10.tutanota.de (unknown [192.168.1.10]) by w4.tutanota.de (Postfix) with ESMTP id 0F2D41060160 for ; Sat, 7 Oct 2023 15:03:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1696690996; s=s1; d=lynne.ee; h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:Sender; bh=QYr7E2o/ix1wmjNERDwLnQ0NtXTy6WWP5fci3EjEm0U=; b=fmt3t+QSOEVfEtCsfAjnxX/U0pV1REa+hCh2nkuqzPi6LURue5IZKywIeKPZUdIp Oa8algh2LC3O93q6spPDRHcqwCG7ZTbOPZ0Kt1cHO9yMM5uptOR0wBztHz2dMspo39W GKILv5kHDyAkIbiM3AB6uSseYO7diQ6wzSBxqxxfLXDJOuW5yuBQIuEFKijzA6ER8Co ePclsF85ksQmGUMkc/O3CE8vY6rehUsgQlc4S/zM8JoTrmeI1ph+F+7OCmllD5qSEqw OqaHCmouq1Hb0QdJ3tqqR75dg3wB9NXFvgIUDGP9bJU3ygoAreWmubzAYZ+ZfIqHkyB yvFNaq78LQ== Date: Sat, 7 Oct 2023 17:03:16 +0200 (CEST) From: Lynne To: Ffmpeg Devel Message-ID: MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 1/3] nlmeans_vulkan: fix width/height for chroma plane weights calculation X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: MWGdkSMBTdyH Patch attached. From 927c74d7851aafc589760a3882bef7f72b19db1c Mon Sep 17 00:00:00 2001 From: Lynne Date: Sat, 16 Sep 2023 00:42:53 +0200 Subject: [PATCH 1/3] nlmeans_vulkan: fix width/height for chroma plane weights calculation --- libavfilter/vf_nlmeans_vulkan.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/libavfilter/vf_nlmeans_vulkan.c b/libavfilter/vf_nlmeans_vulkan.c index 99f4f867e7..5b623eb7a6 100644 --- a/libavfilter/vf_nlmeans_vulkan.c +++ b/libavfilter/vf_nlmeans_vulkan.c @@ -100,7 +100,7 @@ static void insert_horizontal_pass(FFVkSPIRVShader *shd, int nb_rows, int first, gl_SemanticsMakeAvailable | gl_SemanticsMakeVisible); ); } - GLSLC(1, for (y = 0; y < height[0]; y++) { ); + GLSLF(1, for (y = 0; y < height[%i]; y++) { ,plane); GLSLC(2, offset = uint64_t(int_stride)*y*T_ALIGN; ); GLSLC(2, dst = DataBuffer(uint64_t(integral_data) + offset); ); GLSLC(0, ); @@ -127,7 +127,7 @@ static void insert_vertical_pass(FFVkSPIRVShader *shd, int nb_rows, int first, i gl_SemanticsMakeAvailable | gl_SemanticsMakeVisible); ); } - GLSLC(1, for (x = 0; x < width[0]; x++) { ); + GLSLF(1, for (x = 0; x < width[%i]; x++) { ,plane); GLSLC(2, dst = DataBuffer(uint64_t(integral_data) + x*T_ALIGN); ); for (int r = 0; r < nb_rows; r++) { @@ -156,13 +156,13 @@ static void insert_weights_pass(FFVkSPIRVShader *shd, int nb_rows, int vert, gl_SemanticsMakeVisible); ); GLSLC(1, barrier(); ); if (!vert) { - GLSLC(1, for (y = 0; y < height[0]; y++) { ); + GLSLF(1, for (y = 0; y < height[%i]; y++) { ,plane); GLSLF(2, if (gl_GlobalInvocationID.x*%i >= width[%i]) ,nb_rows, plane); GLSLC(3, break; ); GLSLF(2, for (r = 0; r < %i; r++) { ,nb_rows); GLSLF(3, x = int(gl_GlobalInvocationID.x) * %i + r; ,nb_rows); } else { - GLSLC(1, for (x = 0; x < width[0]; x++) { ); + GLSLF(1, for (x = 0; x < width[%i]; x++) { ,plane); GLSLF(2, if (gl_GlobalInvocationID.x*%i >= height[%i]) ,nb_rows, plane); GLSLC(3, break; ); GLSLF(2, for (r = 0; r < %i; r++) { ,nb_rows); -- 2.42.0 From patchwork Sat Oct 7 15:04:04 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lynne X-Patchwork-Id: 44198 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:4e24:b0:15d:8365:d4b8 with SMTP id gk36csp473452pzb; Sat, 7 Oct 2023 08:04:16 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHi7XIjcPcG346nzhs/5iNWv167Ry0VFz1n9FrQWKOyykN9ir/PM5xrSfp0r88NPyOoxQwG X-Received: by 2002:a17:906:24d:b0:9a5:ca42:f3a9 with SMTP id 13-20020a170906024d00b009a5ca42f3a9mr10954896ejl.2.1696691055704; Sat, 07 Oct 2023 08:04:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696691055; cv=none; d=google.com; s=arc-20160816; b=EvXrfXxr9BGYJZJu4YB/avs5DR0EJM0nNqSs07JnInKAq8hyGQw3SwVu3tDFiFQePR mSDIV2sPM8i39WsaGnotlilENQ/hb4S2OGvSisa96n0hjztNsBrnRa7OmDDlHyY0cHrK g0Tm+ZLuLWCQI/3lE4Cocz7xNztVvXzRbX/66Brnz1DVxKvzBAH/rbGxkInMPbYuvIV5 6ozyv2UbJt72Mf2/y+W8oSj9hQxxfXtcu66aeA1l3+F5lxwE1E4EpEr5Em7RNz9CGlmx yCnDMB6vag04iHm0MC6XzXu5KR3pm0Uy9YT3q25J+QVbg0j6rKaQZX82lC2XhoCzfrPM QBtg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:reply-to:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence:subject :mime-version:references:in-reply-to:message-id:to:from:date :dkim-signature:delivered-to; bh=t07VvuzpXwHU5uBrnwSgmBx7GhBUQMHcn4N+lKIvIIM=; fh=e5zN9xSzcxLA6bGo3lF+CqTbY/oLwzApV03EO/RBfgQ=; b=Y5QIJ2N7mSVzqKc2EgNVM9m+DhtQVDIA0wlTJA9fQ28MHelZkkxkJEmqojMCUKETqy 65ZFQvWFDZq+ri5DsbBstPbVlAwfn0YgtB56mVjEM2fziDfVgKf80BoGzzxEcUQaioD4 xGi4MWl3lsjG/TOxv4djH19lIXXvT5FRsvppHPu2uqB9miKXwDjZTsw1BclFdDKs2CC+ fDQoQyyUoIm5MrizGVsYQ3SGlJmjC9xSUZ89ht/w3QNz7mPSxjalTkBbdeG4YoHYzpX/ zoZGef64LXzYv0OFH81bwLZ3i875FvMAZ8ctvSbEcgxRKkeVfR4DQxPGSxaZufYI3Eij BZHg== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1 header.b=1JDRNp7H; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id rs3-20020a170907036300b009b2b30dcf6asi2701410ejb.559.2023.10.07.08.04.14; Sat, 07 Oct 2023 08:04:15 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1 header.b=1JDRNp7H; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 7E05C68CA67; Sat, 7 Oct 2023 18:04:12 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 5CCB968CA16 for ; Sat, 7 Oct 2023 18:04:05 +0300 (EEST) Received: from tutadb.w10.tutanota.de (unknown [192.168.1.10]) by w4.tutanota.de (Postfix) with ESMTP id 030001060165 for ; Sat, 7 Oct 2023 15:04:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1696691044; s=s1; d=lynne.ee; h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:References:Sender; bh=bJwYmJ+/va+H3xMsAYvCX+BJa3oY2FvZWZlcSWl2AMI=; b=1JDRNp7HZXgxO0wIOIRgoMPC2WxDGdQgMKwi0sG4djDCTsvaSAJNPF5DLaEvgL3Q JnRaPe/AOhcmVBkFV73TMlyteYxG2loDLmZqaEyucnOox9N4TZJhFZwU1oSBJ+w/lhR 52nusRqsnIbVYCxDvGyHTtN0VNw1v8MFyONKK4IaAYEpxWeIxJs93SA8EAuCPVfoqHL auk+915PvSUEtX72NJ1AiKnMYBoE7vCaH/4Woeo8BkQooWHBkjNrTMgaVBuTasBs9t6 kHKvyjk2LH7lHmyhcUA+urPONu8Ny5rO69dt1pVdRKmU4kkySyJn6LyooPz9o7MhQ0N JJFYCW7DgQ== Date: Sat, 7 Oct 2023 17:04:04 +0200 (CEST) From: Lynne To: FFmpeg development discussions and patches Message-ID: In-Reply-To: References: MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 2/3] nlmeans_vulkan: reduce dispatches by parallelizing the planes X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: hsM7lcKllETp Patch attached. From 863a9977192abf00d131d7b0ed88569210fe0d0c Mon Sep 17 00:00:00 2001 From: Lynne Date: Sat, 16 Sep 2023 01:04:18 +0200 Subject: [PATCH 2/3] nlmeans_vulkan: reduce dispatches by parallelizing the planes --- libavfilter/vf_nlmeans_vulkan.c | 33 +++++++++++++++++---------------- 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/libavfilter/vf_nlmeans_vulkan.c b/libavfilter/vf_nlmeans_vulkan.c index 5b623eb7a6..9741dd67ac 100644 --- a/libavfilter/vf_nlmeans_vulkan.c +++ b/libavfilter/vf_nlmeans_vulkan.c @@ -538,28 +538,29 @@ static av_cold int init_denoise_pipeline(FFVulkanContext *vkctx, FFVkExecPool *e GLSLC(0, { ); GLSLC(1, ivec2 size; ); GLSLC(1, const ivec2 pos = ivec2(gl_GlobalInvocationID.xy); ); + GLSLC(1, const uint plane = uint(gl_WorkGroupID.z); ); GLSLC(0, ); GLSLC(1, float w_sum; ); GLSLC(1, float sum; ); GLSLC(1, vec4 src; ); GLSLC(1, vec4 r; ); GLSLC(0, ); - - for (int i = 0; i < planes; i++) { - GLSLF(1, src = texture(input_img[%i], pos); ,i); - for (int c = 0; c < desc->nb_components; c++) { - if (desc->comp[c].plane == i) { - int off = desc->comp[c].offset / (FFALIGN(desc->comp[c].depth, 8)/8); - GLSLF(1, w_sum = weights_%i[pos.y*ws_stride[%i] + pos.x]; ,c, c); - GLSLF(1, sum = sums_%i[pos.y*ws_stride[%i] + pos.x]; ,c, c); - GLSLF(1, r[%i] = (sum + src[%i]*255) / (1.0 + w_sum) / 255; ,off, off); - GLSLC(0, ); - } - } - GLSLF(1, imageStore(output_img[%i], pos, r); ,i); - GLSLC(0, ); + GLSLC(1, size = imageSize(output_img[plane]); ); + GLSLC(1, if (!IS_WITHIN(pos, size)) ); + GLSLC(2, return; ); + GLSLC(0, ); + GLSLC(1, src = texture(input_img[plane], pos); ); + GLSLC(0, ); + for (int c = 0; c < desc->nb_components; c++) { + int off = desc->comp[c].offset / (FFALIGN(desc->comp[c].depth, 8)/8); + GLSLF(1, if (plane == %i) { ,desc->comp[c].plane); + GLSLF(2, w_sum = weights_%i[pos.y*ws_stride[%i] + pos.x]; ,c, c); + GLSLF(2, sum = sums_%i[pos.y*ws_stride[%i] + pos.x]; ,c, c); + GLSLF(2, r[%i] = (sum + src[%i]*255) / (1.0 + w_sum) / 255; ,off, off); + GLSLC(1, } ); + GLSLC(0, ); } - + GLSLC(1, imageStore(output_img[plane], pos, r); ); GLSLC(0, } ); RET(spv->compile_shader(spv, vkctx, shd, &spv_data, &spv_len, "main", &spv_opaque)); @@ -716,7 +717,7 @@ static int denoise_pass(NLMeansVulkanContext *s, FFVkExecContext *exec, vk->CmdDispatch(exec->buf, FFALIGN(vkctx->output_width, s->pl_denoise.wg_size[0])/s->pl_denoise.wg_size[0], FFALIGN(vkctx->output_height, s->pl_denoise.wg_size[1])/s->pl_denoise.wg_size[1], - 1); + av_pix_fmt_count_planes(s->vkctx.output_format)); return 0; } -- 2.42.0 From patchwork Sat Oct 7 15:07:52 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lynne X-Patchwork-Id: 44199 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:4e24:b0:15d:8365:d4b8 with SMTP id gk36csp475975pzb; Sat, 7 Oct 2023 08:08:03 -0700 (PDT) X-Google-Smtp-Source: AGHT+IF+1NmPlur8gtNjI0AlnEXeFgDhKFUKP53NIbcGWA7K7rSmeQXXm3Z/LZMiH8bpjDD5MK/x X-Received: by 2002:a17:906:7382:b0:9ae:284:c93d with SMTP id f2-20020a170906738200b009ae0284c93dmr8219388ejl.5.1696691283284; Sat, 07 Oct 2023 08:08:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696691283; cv=none; d=google.com; s=arc-20160816; b=UVjvbrKgbjB25A37eTEG9Kl9HkVxq539KD8WYamg4etJCClLdHC7sXkdpYEJP91iUc 6DFdH42+nYHTRMGr6oZULCsTVUl6BB9Z9tQJTuJA5FLlCqbkqpLHa3LgXMtM1u7Bm1bM JP0iUjUH/PExtAEhFK6oBgJ9P+JvEga7711tI97YH9ZNDU1R1q/HYu5cuWsb5PFO4MrB DPRG2GNupc7VBccGeiwP6p7Sl8WXm9vNG9tf5ido0TYOyTg9cLmKBP7wEviM4xaKy/LQ wYaUVRYBLryJdVhJ/H1iNatPyWLJuzNLR7Uw0zg/iYPJjFQt9s65OS2CxiZ9eLQqS9lS anBw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:reply-to:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence:subject :mime-version:references:in-reply-to:message-id:to:from:date :dkim-signature:delivered-to; bh=2qE5LZDZKE3y43QfIe7dGMjqE21apdMGY6MfC4rkUP8=; fh=e5zN9xSzcxLA6bGo3lF+CqTbY/oLwzApV03EO/RBfgQ=; b=paDP6C6tyZsF7ONslqYX1X/Y6Ju1jZqPdS/0n8zcWeLm6YFQic4cvM/0oBhT3f9gYc UOJJUyIDUBXvNTRuirVPSJ1mrXJHO3NP+5p/71Www+gtfzTala8ZfTLlBgdscvEDBXB5 9C2qChnr904tj4uLEJxSTPnpss70Sldn4Fh9IVWx5sbk2fJMKmlnYsTAuKw8tydKqj+C 4fkRzjg1oFccHAaHq5NGPWxrkNbWnvAJUL+M3AceyUCQzUl8SI6NNY7UEpY0k4Kn8ci6 EBjssicfjXZeSgG1esCtF1fQsgEm9HdUzSIlGvnoLkLqkIylDRqsopGkPoUjZfrS6lki qMpQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1 header.b=AC5oaxvQ; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id y25-20020a170906471900b009a633e2fae9si2634907ejq.127.2023.10.07.08.08.02; Sat, 07 Oct 2023 08:08:03 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1 header.b=AC5oaxvQ; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id C0BEF68CA6B; Sat, 7 Oct 2023 18:07:59 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id BE5E168C95A for ; Sat, 7 Oct 2023 18:07:52 +0300 (EEST) Received: from tutadb.w10.tutanota.de (unknown [192.168.1.10]) by w4.tutanota.de (Postfix) with ESMTP id 6CC3E1060152 for ; Sat, 7 Oct 2023 15:07:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1696691272; s=s1; d=lynne.ee; h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:References:Sender; bh=Ix9QaWZpskQBntgBSfss6e4tvGHeUKclvET0ySGoTd8=; b=AC5oaxvQF/xrF4Pv7bOn3nOROfMvM2Qrn0cfEVH+1/UhkGMSz89rzW66JhThaqSP hGlSwRuGggrh3U5p16xk7kdBqBTxWyXJ/mYpe98DOncs980LhBi+ddFGktcIP+84j7J SdHVlDlpxHy+eomRmq78luTNM+GHnurrPZ5DdQAj7O3Er2KtF0JwRO1NEK8caIFTKeu 5KyDJ+6FAceI+OrFugBacdPAO0/VSSqaXUedNqlnxKc3e/UQtkXVL/8ZFMB3UmzB7oh xUs1TiUe7e9PHU3W6IFFsfI26HxDdy8gudmWbe9oKsJPQwcA9sb9rOD0ecJsyKr0p29 ZmTzjSHh0A== Date: Sat, 7 Oct 2023 17:07:52 +0200 (CEST) From: Lynne To: FFmpeg development discussions and patches Message-ID: In-Reply-To: References: MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 3/3] nlmeans_vulkan: parallelize workgroup invocations X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: qKfwxIWz0w7f Removes the clever subgroup parallel prefix computation, and instead just computes the prefix inline. Cuts down the number of dispatches by a huge amount. Provides a ~12x speedup (2.5fps to 30fps on a 7900XTX, 2.1fps to 24fps on an Ada). Patch attached. From a51dd2ace418974c7e8b24a3762bd7495d3b3b10 Mon Sep 17 00:00:00 2001 From: Lynne Date: Fri, 15 Sep 2023 21:55:59 +0200 Subject: [PATCH 3/3] nlmeans_vulkan: parallelize workgroup invocations --- libavfilter/Makefile | 3 +- libavfilter/vf_nlmeans_vulkan.c | 333 ++++++++++++++--------------- libavfilter/vulkan/prefix_sum.comp | 151 ------------- 3 files changed, 167 insertions(+), 320 deletions(-) delete mode 100644 libavfilter/vulkan/prefix_sum.comp diff --git a/libavfilter/Makefile b/libavfilter/Makefile index 9a100cd665..603b532ad0 100644 --- a/libavfilter/Makefile +++ b/libavfilter/Makefile @@ -395,8 +395,7 @@ OBJS-$(CONFIG_MULTIPLY_FILTER) += vf_multiply.o OBJS-$(CONFIG_NEGATE_FILTER) += vf_negate.o OBJS-$(CONFIG_NLMEANS_FILTER) += vf_nlmeans.o OBJS-$(CONFIG_NLMEANS_OPENCL_FILTER) += vf_nlmeans_opencl.o opencl.o opencl/nlmeans.o -OBJS-$(CONFIG_NLMEANS_VULKAN_FILTER) += vf_nlmeans_vulkan.o vulkan.o vulkan_filter.o \ - vulkan/prefix_sum.o +OBJS-$(CONFIG_NLMEANS_VULKAN_FILTER) += vf_nlmeans_vulkan.o vulkan.o vulkan_filter.o OBJS-$(CONFIG_NNEDI_FILTER) += vf_nnedi.o OBJS-$(CONFIG_NOFORMAT_FILTER) += vf_format.o OBJS-$(CONFIG_NOISE_FILTER) += vf_noise.o diff --git a/libavfilter/vf_nlmeans_vulkan.c b/libavfilter/vf_nlmeans_vulkan.c index 9741dd67ac..6046ff598c 100644 --- a/libavfilter/vf_nlmeans_vulkan.c +++ b/libavfilter/vf_nlmeans_vulkan.c @@ -38,9 +38,10 @@ typedef struct NLMeansVulkanContext { VkSampler sampler; AVBufferPool *integral_buf_pool; - AVBufferPool *state_buf_pool; AVBufferPool *ws_buf_pool; + FFVkBuffer xyoffsets_buf; + int pl_weights_rows; FFVulkanPipeline pl_weights; FFVkSPIRVShader shd_weights; @@ -66,107 +67,97 @@ typedef struct NLMeansVulkanContext { extern const char *ff_source_prefix_sum_comp; -static void insert_first(FFVkSPIRVShader *shd, int r, int horiz, int plane, int comp) +static void insert_first(FFVkSPIRVShader *shd, int r, const char *off, int horiz, int plane, int comp) { - GLSLF(2, s1 = texture(input_img[%i], ivec2(x + %i, y + %i))[%i]; - ,plane, horiz ? r : 0, !horiz ? r : 0, comp); - - if (TYPE_ELEMS == 4) { - GLSLF(2, s2[0] = texture(input_img[%i], ivec2(x + %i + xoffs[0], y + %i + yoffs[0]))[%i]; - ,plane, horiz ? r : 0, !horiz ? r : 0, comp); - GLSLF(2, s2[1] = texture(input_img[%i], ivec2(x + %i + xoffs[1], y + %i + yoffs[1]))[%i]; - ,plane, horiz ? r : 0, !horiz ? r : 0, comp); - GLSLF(2, s2[2] = texture(input_img[%i], ivec2(x + %i + xoffs[2], y + %i + yoffs[2]))[%i]; - ,plane, horiz ? r : 0, !horiz ? r : 0, comp); - GLSLF(2, s2[3] = texture(input_img[%i], ivec2(x + %i + xoffs[3], y + %i + yoffs[3]))[%i]; - ,plane, horiz ? r : 0, !horiz ? r : 0, comp); - } else { - for (int i = 0; i < 16; i++) { - GLSLF(2, s2[%i][%i] = texture(input_img[%i], ivec2(x + %i + xoffs[%i], y + %i + yoffs[%i]))[%i]; - ,i / 4, i % 4, plane, horiz ? r : 0, i, !horiz ? r : 0, i, comp); - } - } - - GLSLC(2, s2 = (s1 - s2) * (s1 - s2); ); + GLSLF(4, s1 = texture(input_img[%i], ivec2(x + %i + %s, y + %i + %s))[%i]; + ,plane, horiz ? r : 0, horiz ? off : "0", !horiz ? r : 0, !horiz ? off : "0", comp); + + GLSLF(4, s2[0] = texture(input_img[%i], ivec2(x + %i + %s + xoffs[0], y + %i + %s + yoffs[0]))[%i]; + ,plane, horiz ? r : 0, horiz ? off : "0", !horiz ? r : 0, !horiz ? off : "0", comp); + GLSLF(4, s2[1] = texture(input_img[%i], ivec2(x + %i + %s + xoffs[1], y + %i + %s + yoffs[1]))[%i]; + ,plane, horiz ? r : 0, horiz ? off : "0", !horiz ? r : 0, !horiz ? off : "0", comp); + GLSLF(4, s2[2] = texture(input_img[%i], ivec2(x + %i + %s + xoffs[2], y + %i + %s + yoffs[2]))[%i]; + ,plane, horiz ? r : 0, horiz ? off : "0", !horiz ? r : 0, !horiz ? off : "0", comp); + GLSLF(4, s2[3] = texture(input_img[%i], ivec2(x + %i + %s + xoffs[3], y + %i + %s + yoffs[3]))[%i]; + ,plane, horiz ? r : 0, horiz ? off : "0", !horiz ? r : 0, !horiz ? off : "0", comp); + + GLSLC(4, s2 = (s1 - s2) * (s1 - s2); ); } static void insert_horizontal_pass(FFVkSPIRVShader *shd, int nb_rows, int first, int plane, int comp) { - GLSLF(1, x = int(gl_GlobalInvocationID.x) * %i; ,nb_rows); - if (!first) { - GLSLC(1, controlBarrier(gl_ScopeWorkgroup, gl_ScopeWorkgroup, - gl_StorageSemanticsBuffer, - gl_SemanticsAcquireRelease | - gl_SemanticsMakeAvailable | - gl_SemanticsMakeVisible); ); - } - GLSLF(1, for (y = 0; y < height[%i]; y++) { ,plane); - GLSLC(2, offset = uint64_t(int_stride)*y*T_ALIGN; ); - GLSLC(2, dst = DataBuffer(uint64_t(integral_data) + offset); ); - GLSLC(0, ); - if (first) { - for (int r = 0; r < nb_rows; r++) { - insert_first(shd, r, 1, plane, comp); - GLSLF(2, dst.v[x + %i] = s2; ,r); - GLSLC(0, ); - } - } - GLSLC(2, barrier(); ); - GLSLC(2, prefix_sum(dst, 1, dst, 1); ); - GLSLC(1, } ); - GLSLC(0, ); + GLSLF(1, y = int(gl_GlobalInvocationID.x) * %i; ,nb_rows); + if (!first) + GLSLC(1, barrier(); ); + GLSLC(0, ); + GLSLF(1, if (y < height[%i]) { ,plane); + GLSLC(2, #pragma unroll(1) ); + GLSLF(2, for (r = 0; r < %i; r++) { ,nb_rows); + GLSLC(3, prefix_sum = DTYPE(0); ); + GLSLC(3, offset = uint64_t(int_stride)*(y + r)*T_ALIGN; ); + GLSLC(3, dst = DataBuffer(uint64_t(integral_data) + offset); ); + GLSLC(0, ); + GLSLF(3, for (x = 0; x < width[%i]; x++) { ,plane); + if (first) + insert_first(shd, 0, "r", 0, plane, comp); + else + GLSLC(4, s2 = dst.v[x]; ); + GLSLC(4, dst.v[x] = s2 + prefix_sum; ); + GLSLC(4, prefix_sum += s2; ); + GLSLC(3, } ); + GLSLC(2, } ); + GLSLC(1, } ); + GLSLC(0, ); } static void insert_vertical_pass(FFVkSPIRVShader *shd, int nb_rows, int first, int plane, int comp) { - GLSLF(1, y = int(gl_GlobalInvocationID.x) * %i; ,nb_rows); - if (!first) { - GLSLC(1, controlBarrier(gl_ScopeWorkgroup, gl_ScopeWorkgroup, - gl_StorageSemanticsBuffer, - gl_SemanticsAcquireRelease | - gl_SemanticsMakeAvailable | - gl_SemanticsMakeVisible); ); - } - GLSLF(1, for (x = 0; x < width[%i]; x++) { ,plane); - GLSLC(2, dst = DataBuffer(uint64_t(integral_data) + x*T_ALIGN); ); - - for (int r = 0; r < nb_rows; r++) { - if (first) { - insert_first(shd, r, 0, plane, comp); - GLSLF(2, integral_data.v[(y + %i)*int_stride + x] = s2; ,r); - GLSLC(0, ); - } - } - - GLSLC(2, barrier(); ); - GLSLC(2, prefix_sum(dst, int_stride, dst, int_stride); ); - GLSLC(1, } ); - GLSLC(0, ); + GLSLF(1, x = int(gl_GlobalInvocationID.x) * %i; ,nb_rows); + GLSLC(1, #pragma unroll(1) ); + GLSLF(1, for (r = 0; r < %i; r++) ,nb_rows); + GLSLC(2, psum[r] = DTYPE(0); ); + GLSLC(0, ); + if (!first) + GLSLC(1, barrier(); ); + GLSLC(0, ); + GLSLF(1, if (x < width[%i]) { ,plane); + GLSLF(2, for (y = 0; y < height[%i]; y++) { ,plane); + GLSLC(3, offset = uint64_t(int_stride)*y*T_ALIGN; ); + GLSLC(3, dst = DataBuffer(uint64_t(integral_data) + offset); ); + GLSLC(0, ); + GLSLC(3, #pragma unroll(1) ); + GLSLF(3, for (r = 0; r < %i; r++) { ,nb_rows); + if (first) + insert_first(shd, 0, "r", 1, plane, comp); + else + GLSLC(4, s2 = dst.v[x + r]; ); + GLSLC(4, dst.v[x + r] = s2 + psum[r]; ); + GLSLC(4, psum[r] += s2; ); + GLSLC(3, } ); + GLSLC(2, } ); + GLSLC(1, } ); + GLSLC(0, ); } static void insert_weights_pass(FFVkSPIRVShader *shd, int nb_rows, int vert, int t, int dst_comp, int plane, int comp) { - GLSLF(1, p = patch_size[%i]; ,dst_comp); + GLSLF(1, p = patch_size[%i]; ,dst_comp); GLSLC(0, ); - GLSLC(1, controlBarrier(gl_ScopeWorkgroup, gl_ScopeWorkgroup, - gl_StorageSemanticsBuffer, - gl_SemanticsAcquireRelease | - gl_SemanticsMakeAvailable | - gl_SemanticsMakeVisible); ); GLSLC(1, barrier(); ); + GLSLC(0, ); if (!vert) { GLSLF(1, for (y = 0; y < height[%i]; y++) { ,plane); GLSLF(2, if (gl_GlobalInvocationID.x*%i >= width[%i]) ,nb_rows, plane); GLSLC(3, break; ); - GLSLF(2, for (r = 0; r < %i; r++) { ,nb_rows); - GLSLF(3, x = int(gl_GlobalInvocationID.x) * %i + r; ,nb_rows); + GLSLF(2, for (r = 0; r < %i; r++) { ,nb_rows); + GLSLF(3, x = int(gl_GlobalInvocationID.x) * %i + r; ,nb_rows); } else { GLSLF(1, for (x = 0; x < width[%i]; x++) { ,plane); GLSLF(2, if (gl_GlobalInvocationID.x*%i >= height[%i]) ,nb_rows, plane); GLSLC(3, break; ); - GLSLF(2, for (r = 0; r < %i; r++) { ,nb_rows); - GLSLF(3, y = int(gl_GlobalInvocationID.x) * %i + r; ,nb_rows); + GLSLF(2, for (r = 0; r < %i; r++) { ,nb_rows); + GLSLF(3, y = int(gl_GlobalInvocationID.x) * %i + r; ,nb_rows); } GLSLC(0, ); GLSLC(3, a = DTYPE(0); ); @@ -223,16 +214,15 @@ static void insert_weights_pass(FFVkSPIRVShader *shd, int nb_rows, int vert, } typedef struct HorizontalPushData { - VkDeviceAddress integral_data; - VkDeviceAddress state_data; - int32_t xoffs[TYPE_ELEMS]; - int32_t yoffs[TYPE_ELEMS]; uint32_t width[4]; uint32_t height[4]; uint32_t ws_stride[4]; int32_t patch_size[4]; float strength[4]; + VkDeviceAddress integral_base; + uint32_t integral_size; uint32_t int_stride; + uint32_t xyoffs_start; } HorizontalPushData; static av_cold int init_weights_pipeline(FFVulkanContext *vkctx, FFVkExecPool *exec, @@ -249,26 +239,18 @@ static av_cold int init_weights_pipeline(FFVulkanContext *vkctx, FFVkExecPool *e FFVulkanDescriptorSetBinding *desc_set; int max_dim = FFMAX(width, height); uint32_t max_wg = vkctx->props.properties.limits.maxComputeWorkGroupSize[0]; - int max_shm = vkctx->props.properties.limits.maxComputeSharedMemorySize; int wg_size, wg_rows; /* Round the max workgroup size to the previous power of two */ - max_wg = 1 << (31 - ff_clz(max_wg)); wg_size = max_wg; wg_rows = 1; if (max_wg > max_dim) { - wg_size = max_wg / (max_wg / max_dim); + wg_size = max_dim; } else if (max_wg < max_dim) { - /* First, make it fit */ + /* Make it fit */ while (wg_size*wg_rows < max_dim) wg_rows++; - - /* Second, make sure there's enough shared memory */ - while ((wg_size * TYPE_SIZE + TYPE_SIZE + 2*4) > max_shm) { - wg_size >>= 1; - wg_rows++; - } } RET(ff_vk_shader_init(pl, shd, "nlmeans_weights", VK_SHADER_STAGE_COMPUTE_BIT, 0)); @@ -278,33 +260,24 @@ static av_cold int init_weights_pipeline(FFVulkanContext *vkctx, FFVkExecPool *e if (t > 1) GLSLC(0, #extension GL_EXT_shader_atomic_float : require ); GLSLC(0, #extension GL_ARB_gpu_shader_int64 : require ); - GLSLC(0, #pragma use_vulkan_memory_model ); - GLSLC(0, #extension GL_KHR_memory_scope_semantics : enable ); GLSLC(0, ); - GLSLF(0, #define N_ROWS %i ,*nb_rows); - GLSLC(0, #define WG_SIZE (gl_WorkGroupSize.x) ); - GLSLF(0, #define LG_WG_SIZE %i ,ff_log2(shd->local_size[0])); - GLSLC(0, #define PARTITION_SIZE (N_ROWS*WG_SIZE) ); - GLSLF(0, #define DTYPE %s ,TYPE_NAME); - GLSLF(0, #define T_ALIGN %i ,TYPE_SIZE); + GLSLF(0, #define DTYPE %s ,TYPE_NAME); + GLSLF(0, #define T_ALIGN %i ,TYPE_SIZE); GLSLC(0, ); - GLSLC(0, layout(buffer_reference, buffer_reference_align = T_ALIGN) coherent buffer DataBuffer { ); + GLSLC(0, layout(buffer_reference, buffer_reference_align = T_ALIGN) buffer DataBuffer { ); GLSLC(1, DTYPE v[]; ); GLSLC(0, }; ); GLSLC(0, ); - GLSLC(0, layout(buffer_reference) buffer StateData; ); - GLSLC(0, ); GLSLC(0, layout(push_constant, std430) uniform pushConstants { ); - GLSLC(1, coherent DataBuffer integral_data; ); - GLSLC(1, StateData state; ); - GLSLF(1, uint xoffs[%i]; ,TYPE_ELEMS); - GLSLF(1, uint yoffs[%i]; ,TYPE_ELEMS); GLSLC(1, uvec4 width; ); GLSLC(1, uvec4 height; ); GLSLC(1, uvec4 ws_stride; ); GLSLC(1, ivec4 patch_size; ); GLSLC(1, vec4 strength; ); + GLSLC(1, DataBuffer integral_base; ); + GLSLC(1, uint integral_size; ); GLSLC(1, uint int_stride; ); + GLSLC(1, uint xyoffs_start; ); GLSLC(0, }; ); GLSLC(0, ); @@ -370,7 +343,17 @@ static av_cold int init_weights_pipeline(FFVulkanContext *vkctx, FFVkExecPool *e }; RET(ff_vk_pipeline_descriptor_set_add(vkctx, pl, shd, desc_set, 1 + 2*desc->nb_components, 0, 0)); - GLSLD( ff_source_prefix_sum_comp ); + desc_set = (FFVulkanDescriptorSetBinding []) { + { + .name = "xyoffsets_buffer", + .type = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER, + .mem_quali = "readonly", + .stages = VK_SHADER_STAGE_COMPUTE_BIT, + .buf_content = "int xyoffsets[];", + }, + }; + RET(ff_vk_pipeline_descriptor_set_add(vkctx, pl, shd, desc_set, 1, 1, 0)); + GLSLC(0, ); GLSLC(0, void main() ); GLSLC(0, { ); @@ -378,11 +361,24 @@ static av_cold int init_weights_pipeline(FFVulkanContext *vkctx, FFVkExecPool *e GLSLC(1, DataBuffer dst; ); GLSLC(1, float s1; ); GLSLC(1, DTYPE s2; ); + GLSLC(1, DTYPE prefix_sum; ); + GLSLF(1, DTYPE psum[%i]; ,*nb_rows); GLSLC(1, int r; ); GLSLC(1, int x; ); GLSLC(1, int y; ); GLSLC(1, int p; ); GLSLC(0, ); + GLSLC(1, DataBuffer integral_data; ); + GLSLF(1, int xoffs[%i]; ,TYPE_ELEMS); + GLSLF(1, int yoffs[%i]; ,TYPE_ELEMS); + GLSLC(0, ); + GLSLC(1, int invoc_idx = int(gl_WorkGroupID.z); ); + GLSLC(1, integral_data = DataBuffer(uint64_t(integral_base) + invoc_idx*integral_size); ); + for (int i = 0; i < TYPE_ELEMS*2; i += 2) { + GLSLF(1, xoffs[%i] = xyoffsets[xyoffs_start + 2*%i*invoc_idx + %i + 0]; ,i/2,TYPE_ELEMS,i); + GLSLF(1, yoffs[%i] = xyoffsets[xyoffs_start + 2*%i*invoc_idx + %i + 1]; ,i/2,TYPE_ELEMS,i); + } + GLSLC(0, ); GLSLC(1, DTYPE a; ); GLSLC(1, DTYPE b; ); GLSLC(1, DTYPE c; ); @@ -405,7 +401,7 @@ static av_cold int init_weights_pipeline(FFVulkanContext *vkctx, FFVkExecPool *e for (int i = 0; i < desc->nb_components; i++) { int off = desc->comp[i].offset / (FFALIGN(desc->comp[i].depth, 8)/8); - if (width > height) { + if (width >= height) { insert_horizontal_pass(shd, *nb_rows, 1, desc->comp[i].plane, off); insert_vertical_pass(shd, *nb_rows, 0, desc->comp[i].plane, off); insert_weights_pass(shd, *nb_rows, 0, t, i, desc->comp[i].plane, off); @@ -584,6 +580,7 @@ static av_cold int init_filter(AVFilterContext *ctx) FFVulkanContext *vkctx = &s->vkctx; const int planes = av_pix_fmt_count_planes(s->vkctx.output_format); FFVkSPIRVCompiler *spv; + int *offsets_buf; const AVPixFmtDescriptor *desc; desc = av_pix_fmt_desc_get(vkctx->output_format); @@ -634,6 +631,20 @@ static av_cold int init_filter(AVFilterContext *ctx) } } + RET(ff_vk_create_buf(&s->vkctx, &s->xyoffsets_buf, 2*s->nb_offsets*sizeof(int32_t), NULL, NULL, + VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT | + VK_BUFFER_USAGE_STORAGE_BUFFER_BIT, + VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | + VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT)); + RET(ff_vk_map_buffer(&s->vkctx, &s->xyoffsets_buf, (uint8_t **)&offsets_buf, 0)); + + for (int i = 0; i < 2*s->nb_offsets; i += 2) { + offsets_buf[i + 0] = s->xoffsets[i >> 1]; + offsets_buf[i + 1] = s->yoffsets[i >> 1]; + } + + RET(ff_vk_unmap_buffer(&s->vkctx, &s->xyoffsets_buf, 1)); + s->opts.t = FFMIN(s->opts.t, (FFALIGN(s->nb_offsets, TYPE_ELEMS) / TYPE_ELEMS)); if (!vkctx->atomic_float_feats.shaderBufferFloat32AtomicAdd) { av_log(ctx, AV_LOG_WARNING, "Device doesn't support atomic float adds, " @@ -641,11 +652,6 @@ static av_cold int init_filter(AVFilterContext *ctx) s->opts.t = 1; } - if (!vkctx->feats_12.vulkanMemoryModel) { - av_log(ctx, AV_LOG_ERROR, "Device doesn't support the Vulkan memory model!"); - return AVERROR(EINVAL);; - } - spv = ff_vk_spirv_init(); if (!spv) { av_log(ctx, AV_LOG_ERROR, "Unable to initialize SPIR-V compiler!\n"); @@ -663,6 +669,10 @@ static av_cold int init_filter(AVFilterContext *ctx) RET(init_denoise_pipeline(vkctx, &s->e, &s->pl_denoise, &s->shd_denoise, s->sampler, spv, desc, planes)); + RET(ff_vk_set_descriptor_buffer(&s->vkctx, &s->pl_weights, NULL, 1, 0, 0, + s->xyoffsets_buf.address, s->xyoffsets_buf.size, + VK_FORMAT_UNDEFINED)); + av_log(ctx, AV_LOG_VERBOSE, "Filter initialized, %i x/y offsets, %i dispatches, %i parallel\n", s->nb_offsets, (FFALIGN(s->nb_offsets, TYPE_ELEMS) / TYPE_ELEMS) + 1, s->opts.t); @@ -736,18 +746,16 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in) int plane_widths[4]; int plane_heights[4]; + int offsets_dispatched = 0; + /* Integral */ - AVBufferRef *state_buf; - FFVkBuffer *state_vk; - AVBufferRef *integral_buf; + AVBufferRef *integral_buf = NULL; FFVkBuffer *integral_vk; uint32_t int_stride; size_t int_size; - size_t state_size; - int t_offset = 0; /* Weights/sums */ - AVBufferRef *ws_buf; + AVBufferRef *ws_buf = NULL; FFVkBuffer *ws_vk; VkDeviceAddress weights_addr[4]; VkDeviceAddress sums_addr[4]; @@ -773,7 +781,6 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in) /* Integral image */ int_stride = s->pl_weights.wg_size[0]*s->pl_weights_rows; int_size = int_stride * int_stride * TYPE_SIZE; - state_size = int_stride * 3 *TYPE_SIZE; /* Plane dimensions */ for (int i = 0; i < desc->nb_components; i++) { @@ -798,16 +805,6 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in) return err; integral_vk = (FFVkBuffer *)integral_buf->data; - err = ff_vk_get_pooled_buffer(&s->vkctx, &s->state_buf_pool, &state_buf, - VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | - VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT, - NULL, - s->opts.t * state_size, - VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT); - if (err < 0) - return err; - state_vk = (FFVkBuffer *)state_buf->data; - err = ff_vk_get_pooled_buffer(&s->vkctx, &s->ws_buf_pool, &ws_buf, VK_BUFFER_USAGE_STORAGE_BUFFER_BIT | VK_BUFFER_USAGE_TRANSFER_DST_BIT | @@ -844,9 +841,12 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in) RET(ff_vk_exec_add_dep_frame(vkctx, exec, out, VK_PIPELINE_STAGE_2_ALL_COMMANDS_BIT, VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT)); + RET(ff_vk_exec_add_dep_buf(vkctx, exec, &integral_buf, 1, 0)); - RET(ff_vk_exec_add_dep_buf(vkctx, exec, &state_buf, 1, 0)); + integral_buf = NULL; + RET(ff_vk_exec_add_dep_buf(vkctx, exec, &ws_buf, 1, 0)); + ws_buf = NULL; /* Input frame prep */ RET(ff_vk_create_imageviews(vkctx, exec, in_views, in)); @@ -869,6 +869,7 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in) VK_IMAGE_LAYOUT_GENERAL, VK_QUEUE_FAMILY_IGNORED); + nb_buf_bar = 0; buf_bar[nb_buf_bar++] = (VkBufferMemoryBarrier2) { .sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER_2, .srcStageMask = ws_vk->stage, @@ -881,6 +882,19 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in) .size = ws_vk->size, .offset = 0, }; + buf_bar[nb_buf_bar++] = (VkBufferMemoryBarrier2) { + .sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER_2, + .srcStageMask = integral_vk->stage, + .dstStageMask = VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT, + .srcAccessMask = integral_vk->access, + .dstAccessMask = VK_ACCESS_2_SHADER_STORAGE_READ_BIT | + VK_ACCESS_2_SHADER_STORAGE_WRITE_BIT, + .srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED, + .dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED, + .buffer = integral_vk->buf, + .size = integral_vk->size, + .offset = 0, + }; vk->CmdPipelineBarrier2(exec->buf, &(VkDependencyInfo) { .sType = VK_STRUCTURE_TYPE_DEPENDENCY_INFO, @@ -891,10 +905,13 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in) }); ws_vk->stage = buf_bar[0].dstStageMask; ws_vk->access = buf_bar[0].dstAccessMask; + integral_vk->stage = buf_bar[1].dstStageMask; + integral_vk->access = buf_bar[1].dstAccessMask; - /* Weights/sums buffer zeroing */ + /* Buffer zeroing */ vk->CmdFillBuffer(exec->buf, ws_vk->buf, 0, ws_vk->size, 0x0); + nb_buf_bar = 0; buf_bar[nb_buf_bar++] = (VkBufferMemoryBarrier2) { .sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER_2, .srcStageMask = ws_vk->stage, @@ -948,29 +965,22 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in) /* Weights pipeline */ ff_vk_exec_bind_pipeline(vkctx, exec, &s->pl_weights); - for (int i = 0; i < s->nb_offsets; i += TYPE_ELEMS) { - int *xoffs = s->xoffsets + i; - int *yoffs = s->yoffsets + i; + do { + int wg_invoc; HorizontalPushData pd = { - integral_vk->address + t_offset*int_size, - state_vk->address + t_offset*state_size, - { 0 }, - { 0 }, { plane_widths[0], plane_widths[1], plane_widths[2], plane_widths[3] }, { plane_heights[0], plane_heights[1], plane_heights[2], plane_heights[3] }, { ws_stride[0], ws_stride[1], ws_stride[2], ws_stride[3] }, { s->patch[0], s->patch[1], s->patch[2], s->patch[3] }, { s->strength[0], s->strength[1], s->strength[2], s->strength[2], }, + integral_vk->address, + int_size, int_stride, + offsets_dispatched * 2, }; - memcpy(pd.xoffs, xoffs, sizeof(pd.xoffs)); - memcpy(pd.yoffs, yoffs, sizeof(pd.yoffs)); - - /* Put a barrier once we run out of parallelism buffers */ - if (!t_offset) { + if (offsets_dispatched) { nb_buf_bar = 0; - /* Buffer prep/sync */ buf_bar[nb_buf_bar++] = (VkBufferMemoryBarrier2) { .sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER_2, .srcStageMask = integral_vk->stage, @@ -984,39 +994,27 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in) .size = integral_vk->size, .offset = 0, }; - buf_bar[nb_buf_bar++] = (VkBufferMemoryBarrier2) { - .sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER_2, - .srcStageMask = state_vk->stage, - .dstStageMask = VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT, - .srcAccessMask = state_vk->access, - .dstAccessMask = VK_ACCESS_2_SHADER_STORAGE_READ_BIT | - VK_ACCESS_2_SHADER_STORAGE_WRITE_BIT, - .srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED, - .dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED, - .buffer = state_vk->buf, - .size = state_vk->size, - .offset = 0, - }; vk->CmdPipelineBarrier2(exec->buf, &(VkDependencyInfo) { .sType = VK_STRUCTURE_TYPE_DEPENDENCY_INFO, .pBufferMemoryBarriers = buf_bar, .bufferMemoryBarrierCount = nb_buf_bar, }); - integral_vk->stage = buf_bar[0].dstStageMask; - integral_vk->access = buf_bar[0].dstAccessMask; - state_vk->stage = buf_bar[1].dstStageMask; - state_vk->access = buf_bar[1].dstAccessMask; + integral_vk->stage = buf_bar[1].dstStageMask; + integral_vk->access = buf_bar[1].dstAccessMask; } - t_offset = (t_offset + 1) % s->opts.t; /* Push data */ ff_vk_update_push_exec(vkctx, exec, &s->pl_weights, VK_SHADER_STAGE_COMPUTE_BIT, 0, sizeof(pd), &pd); + wg_invoc = FFMIN((s->nb_offsets - offsets_dispatched)/TYPE_ELEMS, s->opts.t); + /* End of horizontal pass */ - vk->CmdDispatch(exec->buf, 1, 1, 1); - } + vk->CmdDispatch(exec->buf, 1, 1, wg_invoc); + + offsets_dispatched += wg_invoc * TYPE_ELEMS; + } while (offsets_dispatched < s->nb_offsets); RET(denoise_pass(s, exec, ws_vk, ws_stride)); @@ -1033,6 +1031,8 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in) return ff_filter_frame(outlink, out); fail: + av_buffer_unref(&integral_buf); + av_buffer_unref(&ws_buf); av_frame_free(&in); av_frame_free(&out); return err; @@ -1051,7 +1051,6 @@ static void nlmeans_vulkan_uninit(AVFilterContext *avctx) ff_vk_shader_free(vkctx, &s->shd_denoise); av_buffer_pool_uninit(&s->integral_buf_pool); - av_buffer_pool_uninit(&s->state_buf_pool); av_buffer_pool_uninit(&s->ws_buf_pool); if (s->sampler) diff --git a/libavfilter/vulkan/prefix_sum.comp b/libavfilter/vulkan/prefix_sum.comp deleted file mode 100644 index 9147cd82fb..0000000000 --- a/libavfilter/vulkan/prefix_sum.comp +++ /dev/null @@ -1,151 +0,0 @@ -#extension GL_EXT_buffer_reference : require -#extension GL_EXT_buffer_reference2 : require - -#define ACQUIRE gl_StorageSemanticsBuffer, gl_SemanticsAcquire -#define RELEASE gl_StorageSemanticsBuffer, gl_SemanticsRelease - -// These correspond to X, A, P respectively in the prefix sum paper. -#define FLAG_NOT_READY 0u -#define FLAG_AGGREGATE_READY 1u -#define FLAG_PREFIX_READY 2u - -layout(buffer_reference, buffer_reference_align = T_ALIGN) nonprivate buffer StateData { - DTYPE aggregate; - DTYPE prefix; - uint flag; -}; - -shared DTYPE sh_scratch[WG_SIZE]; -shared DTYPE sh_prefix; -shared uint sh_part_ix; -shared uint sh_flag; - -void prefix_sum(DataBuffer dst, uint dst_stride, DataBuffer src, uint src_stride) -{ - DTYPE local[N_ROWS]; - // Determine partition to process by atomic counter (described in Section 4.4 of prefix sum paper). - if (gl_GlobalInvocationID.x == 0) - sh_part_ix = gl_WorkGroupID.x; -// sh_part_ix = atomicAdd(part_counter, 1); - - barrier(); - uint part_ix = sh_part_ix; - - uint ix = part_ix * PARTITION_SIZE + gl_LocalInvocationID.x * N_ROWS; - - // TODO: gate buffer read? (evaluate whether shader check or CPU-side padding is better) - local[0] = src.v[ix*src_stride]; - for (uint i = 1; i < N_ROWS; i++) - local[i] = local[i - 1] + src.v[(ix + i)*src_stride]; - - DTYPE agg = local[N_ROWS - 1]; - sh_scratch[gl_LocalInvocationID.x] = agg; - for (uint i = 0; i < LG_WG_SIZE; i++) { - barrier(); - if (gl_LocalInvocationID.x >= (1u << i)) - agg += sh_scratch[gl_LocalInvocationID.x - (1u << i)]; - barrier(); - - sh_scratch[gl_LocalInvocationID.x] = agg; - } - - // Publish aggregate for this partition - if (gl_LocalInvocationID.x == WG_SIZE - 1) { - state[part_ix].aggregate = agg; - if (part_ix == 0) - state[0].prefix = agg; - } - - // Write flag with release semantics - if (gl_LocalInvocationID.x == WG_SIZE - 1) { - uint flag = part_ix == 0 ? FLAG_PREFIX_READY : FLAG_AGGREGATE_READY; - atomicStore(state[part_ix].flag, flag, gl_ScopeDevice, RELEASE); - } - - DTYPE exclusive = DTYPE(0); - if (part_ix != 0) { - // step 4 of paper: decoupled lookback - uint look_back_ix = part_ix - 1; - - DTYPE their_agg; - uint their_ix = 0; - while (true) { - // Read flag with acquire semantics. - if (gl_LocalInvocationID.x == WG_SIZE - 1) - sh_flag = atomicLoad(state[look_back_ix].flag, gl_ScopeDevice, ACQUIRE); - - // The flag load is done only in the last thread. However, because the - // translation of memoryBarrierBuffer to Metal requires uniform control - // flow, we broadcast it to all threads. - barrier(); - - uint flag = sh_flag; - barrier(); - - if (flag == FLAG_PREFIX_READY) { - if (gl_LocalInvocationID.x == WG_SIZE - 1) { - DTYPE their_prefix = state[look_back_ix].prefix; - exclusive = their_prefix + exclusive; - } - break; - } else if (flag == FLAG_AGGREGATE_READY) { - if (gl_LocalInvocationID.x == WG_SIZE - 1) { - their_agg = state[look_back_ix].aggregate; - exclusive = their_agg + exclusive; - } - look_back_ix--; - their_ix = 0; - continue; - } // else spins - - if (gl_LocalInvocationID.x == WG_SIZE - 1) { - // Unfortunately there's no guarantee of forward progress of other - // workgroups, so compute a bit of the aggregate before trying again. - // In the worst case, spinning stops when the aggregate is complete. - DTYPE m = src.v[(look_back_ix * PARTITION_SIZE + their_ix)*src_stride]; - if (their_ix == 0) - their_agg = m; - else - their_agg += m; - - their_ix++; - if (their_ix == PARTITION_SIZE) { - exclusive = their_agg + exclusive; - if (look_back_ix == 0) { - sh_flag = FLAG_PREFIX_READY; - } else { - look_back_ix--; - their_ix = 0; - } - } - } - barrier(); - flag = sh_flag; - barrier(); - if (flag == FLAG_PREFIX_READY) - break; - } - - // step 5 of paper: compute inclusive prefix - if (gl_LocalInvocationID.x == WG_SIZE - 1) { - DTYPE inclusive_prefix = exclusive + agg; - sh_prefix = exclusive; - state[part_ix].prefix = inclusive_prefix; - } - - if (gl_LocalInvocationID.x == WG_SIZE - 1) - atomicStore(state[part_ix].flag, FLAG_PREFIX_READY, gl_ScopeDevice, RELEASE); - } - - barrier(); - if (part_ix != 0) - exclusive = sh_prefix; - - DTYPE row = exclusive; - if (gl_LocalInvocationID.x > 0) - row += sh_scratch[gl_LocalInvocationID.x - 1]; - - // note - may overwrite - for (uint i = 0; i < N_ROWS; i++) - dst.v[(ix + i)*dst_stride] = row + local[i]; -} -- 2.42.0