From patchwork Sat Oct  7 15:03:16 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lynne <dev@lynne.ee>
X-Patchwork-Id: 44197
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:4e24:b0:15d:8365:d4b8 with SMTP id
 gk36csp472961pzb;
        Sat, 7 Oct 2023 08:03:34 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IG3S9dLQAHGXZ6n20MCXICU45EZqLcDRqL4skTgylKTLBthIZ+QhC0tDLyVh88jkzq4WMxt
X-Received: by 2002:a17:907:d02:b0:9ae:5a56:be32 with SMTP id
 gn2-20020a1709070d0200b009ae5a56be32mr6715410ejc.38.1696691013854;
        Sat, 07 Oct 2023 08:03:33 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1696691013; cv=none;
        d=google.com; s=arc-20160816;
        b=EMRl9178PtrBahbXAvu9+o+NOnc1q9rNKWAOFOSUXr8Ak+vYuQ2BEyjjWX5GTZjw1g
         bn093JoP4dgdQcV6Z7P3SCWdzjOZcfiNGrmyDo0C+X2+zdhkM6VIXGqm5FhB2uQbNcGe
         QigPhmg5GATTequWKVxExbu5BUgqwM1Tz0CHnABQTlVvKem+Z9XeJQKfFFcU6ooXcL/a
         mjAASG6mOeynqhfgwIigMHvMYmTvQw/tDX5g5cDIt7z4iYVI39PBF5/P/pvERvG66JQt
         hdXLKB6+0vvYCTpK0tCuk6JRASyBP6GkL0AQXGQzreViBSoYz/3TarmpBM3jw2QMqp3U
         ebrA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence:subject
         :mime-version:message-id:to:from:date:dkim-signature:delivered-to;
        bh=538w5jhx5zBODKzuH7O5aB3OqNBtS+5wM5QZZDxwxls=;
        fh=Q46kXK7oI5D1Jhi90JBr53c7NIaTxGaU4KPeRZyM/hI=;
        b=jx1BpyuhKEkVNkAY0GYNwhqnGc1ej/ZLailO5foDHZN0G9M01Uwiwr7nvtAMZjOO5i
         +uOluBEcjHrtgDWrmd7yh99mVSo16qwjDO8wrwkPPrRtmHR8TsUVsszA19iYjf6nOhXB
         29y32018ME6sHPy9y76EKvSqFvAWXuKUubvXRflcSacPyGtoGTB8CoMqBKc9HxIPMKTi
         RB/v4v11+PoSsTf675oLX8z1kWptSbds9qmlY344uyL4QwGkhiQaXchkdc1ccvDBPjU6
         oeft88l/loKbJS/76eUm9L3L+AaBNeDMhMg+X+emRrXVTM7vDjavG0zhtyovIPZ2DEFd
         /jeA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=fmt3t+QS;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 i3-20020a1709061cc300b0099b7f483374si2566147ejh.183.2023.10.07.08.03.28;
        Sat, 07 Oct 2023 08:03:33 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=fmt3t+QS;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 6A1E268CA3B;
	Sat,  7 Oct 2023 18:03:24 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 0CE4768CA09
 for <ffmpeg-devel@ffmpeg.org>; Sat,  7 Oct 2023 18:03:17 +0300 (EEST)
Received: from tutadb.w10.tutanota.de (unknown [192.168.1.10])
 by w4.tutanota.de (Postfix) with ESMTP id 0F2D41060160
 for <ffmpeg-devel@ffmpeg.org>; Sat,  7 Oct 2023 15:03:17 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1696690996;
 s=s1; d=lynne.ee;
 h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:Sender;
 bh=QYr7E2o/ix1wmjNERDwLnQ0NtXTy6WWP5fci3EjEm0U=;
 b=fmt3t+QSOEVfEtCsfAjnxX/U0pV1REa+hCh2nkuqzPi6LURue5IZKywIeKPZUdIp
 Oa8algh2LC3O93q6spPDRHcqwCG7ZTbOPZ0Kt1cHO9yMM5uptOR0wBztHz2dMspo39W
 GKILv5kHDyAkIbiM3AB6uSseYO7diQ6wzSBxqxxfLXDJOuW5yuBQIuEFKijzA6ER8Co
 ePclsF85ksQmGUMkc/O3CE8vY6rehUsgQlc4S/zM8JoTrmeI1ph+F+7OCmllD5qSEqw
 OqaHCmouq1Hb0QdJ3tqqR75dg3wB9NXFvgIUDGP9bJU3ygoAreWmubzAYZ+ZfIqHkyB
 yvFNaq78LQ==
Date: Sat, 7 Oct 2023 17:03:16 +0200 (CEST)
From: Lynne <dev@lynne.ee>
To: Ffmpeg Devel <ffmpeg-devel@ffmpeg.org>
Message-ID: <Ng9eVFH--3-9@lynne.ee>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 1/3] nlmeans_vulkan: fix width/height for
 chroma plane weights calculation
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: MWGdkSMBTdyH

Patch attached.

From 927c74d7851aafc589760a3882bef7f72b19db1c Mon Sep 17 00:00:00 2001
From: Lynne <dev@lynne.ee>
Date: Sat, 16 Sep 2023 00:42:53 +0200
Subject: [PATCH 1/3] nlmeans_vulkan: fix width/height for chroma plane weights
 calculation

---
 libavfilter/vf_nlmeans_vulkan.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/libavfilter/vf_nlmeans_vulkan.c b/libavfilter/vf_nlmeans_vulkan.c
index 99f4f867e7..5b623eb7a6 100644
--- a/libavfilter/vf_nlmeans_vulkan.c
+++ b/libavfilter/vf_nlmeans_vulkan.c
@@ -100,7 +100,7 @@ static void insert_horizontal_pass(FFVkSPIRVShader *shd, int nb_rows, int first,
                                 gl_SemanticsMakeAvailable |
                                 gl_SemanticsMakeVisible);                     );
     }
-    GLSLC(1, for (y = 0; y < height[0]; y++) {                                );
+    GLSLF(1, for (y = 0; y < height[%i]; y++) {                               ,plane);
     GLSLC(2,     offset = uint64_t(int_stride)*y*T_ALIGN;                     );
     GLSLC(2,     dst = DataBuffer(uint64_t(integral_data) + offset);          );
     GLSLC(0,                                                                  );
@@ -127,7 +127,7 @@ static void insert_vertical_pass(FFVkSPIRVShader *shd, int nb_rows, int first, i
                                 gl_SemanticsMakeAvailable |
                                 gl_SemanticsMakeVisible);                     );
     }
-    GLSLC(1, for (x = 0; x < width[0]; x++) {                                 );
+    GLSLF(1, for (x = 0; x < width[%i]; x++) {                                ,plane);
     GLSLC(2,     dst = DataBuffer(uint64_t(integral_data) + x*T_ALIGN);       );
 
     for (int r = 0; r < nb_rows; r++) {
@@ -156,13 +156,13 @@ static void insert_weights_pass(FFVkSPIRVShader *shd, int nb_rows, int vert,
                             gl_SemanticsMakeVisible);                         );
     GLSLC(1, barrier();                                                       );
     if (!vert) {
-        GLSLC(1, for (y = 0; y < height[0]; y++) {                            );
+        GLSLF(1, for (y = 0; y < height[%i]; y++) {                           ,plane);
         GLSLF(2,     if (gl_GlobalInvocationID.x*%i >= width[%i])             ,nb_rows, plane);
         GLSLC(3,         break;                                               );
         GLSLF(2,     for (r = 0; r < %i; r++) {                       ,nb_rows);
         GLSLF(3,         x = int(gl_GlobalInvocationID.x) * %i + r;   ,nb_rows);
     } else {
-        GLSLC(1, for (x = 0; x < width[0]; x++) {                             );
+        GLSLF(1, for (x = 0; x < width[%i]; x++) {                            ,plane);
         GLSLF(2,     if (gl_GlobalInvocationID.x*%i >= height[%i])            ,nb_rows, plane);
         GLSLC(3,         break;                                               );
         GLSLF(2,     for (r = 0; r < %i; r++) {                       ,nb_rows);
-- 
2.42.0


From patchwork Sat Oct  7 15:04:04 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lynne <dev@lynne.ee>
X-Patchwork-Id: 44198
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:4e24:b0:15d:8365:d4b8 with SMTP id
 gk36csp473452pzb;
        Sat, 7 Oct 2023 08:04:16 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IHi7XIjcPcG346nzhs/5iNWv167Ry0VFz1n9FrQWKOyykN9ir/PM5xrSfp0r88NPyOoxQwG
X-Received: by 2002:a17:906:24d:b0:9a5:ca42:f3a9 with SMTP id
 13-20020a170906024d00b009a5ca42f3a9mr10954896ejl.2.1696691055704;
        Sat, 07 Oct 2023 08:04:15 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1696691055; cv=none;
        d=google.com; s=arc-20160816;
        b=EvXrfXxr9BGYJZJu4YB/avs5DR0EJM0nNqSs07JnInKAq8hyGQw3SwVu3tDFiFQePR
         mSDIV2sPM8i39WsaGnotlilENQ/hb4S2OGvSisa96n0hjztNsBrnRa7OmDDlHyY0cHrK
         g0Tm+ZLuLWCQI/3lE4Cocz7xNztVvXzRbX/66Brnz1DVxKvzBAH/rbGxkInMPbYuvIV5
         6ozyv2UbJt72Mf2/y+W8oSj9hQxxfXtcu66aeA1l3+F5lxwE1E4EpEr5Em7RNz9CGlmx
         yCnDMB6vag04iHm0MC6XzXu5KR3pm0Uy9YT3q25J+QVbg0j6rKaQZX82lC2XhoCzfrPM
         QBtg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence:subject
         :mime-version:references:in-reply-to:message-id:to:from:date
         :dkim-signature:delivered-to;
        bh=t07VvuzpXwHU5uBrnwSgmBx7GhBUQMHcn4N+lKIvIIM=;
        fh=e5zN9xSzcxLA6bGo3lF+CqTbY/oLwzApV03EO/RBfgQ=;
        b=Y5QIJ2N7mSVzqKc2EgNVM9m+DhtQVDIA0wlTJA9fQ28MHelZkkxkJEmqojMCUKETqy
         65ZFQvWFDZq+ri5DsbBstPbVlAwfn0YgtB56mVjEM2fziDfVgKf80BoGzzxEcUQaioD4
         xGi4MWl3lsjG/TOxv4djH19lIXXvT5FRsvppHPu2uqB9miKXwDjZTsw1BclFdDKs2CC+
         fDQoQyyUoIm5MrizGVsYQ3SGlJmjC9xSUZ89ht/w3QNz7mPSxjalTkBbdeG4YoHYzpX/
         zoZGef64LXzYv0OFH81bwLZ3i875FvMAZ8ctvSbEcgxRKkeVfR4DQxPGSxaZufYI3Eij
         BZHg==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=1JDRNp7H;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 rs3-20020a170907036300b009b2b30dcf6asi2701410ejb.559.2023.10.07.08.04.14;
        Sat, 07 Oct 2023 08:04:15 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=1JDRNp7H;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 7E05C68CA67;
	Sat,  7 Oct 2023 18:04:12 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 5CCB968CA16
 for <ffmpeg-devel@ffmpeg.org>; Sat,  7 Oct 2023 18:04:05 +0300 (EEST)
Received: from tutadb.w10.tutanota.de (unknown [192.168.1.10])
 by w4.tutanota.de (Postfix) with ESMTP id 030001060165
 for <ffmpeg-devel@ffmpeg.org>; Sat,  7 Oct 2023 15:04:05 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1696691044;
 s=s1; d=lynne.ee;
 h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:References:Sender;
 bh=bJwYmJ+/va+H3xMsAYvCX+BJa3oY2FvZWZlcSWl2AMI=;
 b=1JDRNp7HZXgxO0wIOIRgoMPC2WxDGdQgMKwi0sG4djDCTsvaSAJNPF5DLaEvgL3Q
 JnRaPe/AOhcmVBkFV73TMlyteYxG2loDLmZqaEyucnOox9N4TZJhFZwU1oSBJ+w/lhR
 52nusRqsnIbVYCxDvGyHTtN0VNw1v8MFyONKK4IaAYEpxWeIxJs93SA8EAuCPVfoqHL
 auk+915PvSUEtX72NJ1AiKnMYBoE7vCaH/4Woeo8BkQooWHBkjNrTMgaVBuTasBs9t6
 kHKvyjk2LH7lHmyhcUA+urPONu8Ny5rO69dt1pVdRKmU4kkySyJn6LyooPz9o7MhQ0N
 JJFYCW7DgQ==
Date: Sat, 7 Oct 2023 17:04:04 +0200 (CEST)
From: Lynne <dev@lynne.ee>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Message-ID: <Ng9eg6t--7-9@lynne.ee>
In-Reply-To: <Ng9eVFH--3-9@lynne.ee-Ng9eZVQ----9>
References: <Ng9eVFH--3-9@lynne.ee> <Ng9eVFH--3-9@lynne.ee-Ng9eZVQ----9>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 2/3] nlmeans_vulkan: reduce dispatches by
 parallelizing the planes
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: hsM7lcKllETp

Patch attached.

From 863a9977192abf00d131d7b0ed88569210fe0d0c Mon Sep 17 00:00:00 2001
From: Lynne <dev@lynne.ee>
Date: Sat, 16 Sep 2023 01:04:18 +0200
Subject: [PATCH 2/3] nlmeans_vulkan: reduce dispatches by parallelizing the
 planes

---
 libavfilter/vf_nlmeans_vulkan.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/libavfilter/vf_nlmeans_vulkan.c b/libavfilter/vf_nlmeans_vulkan.c
index 5b623eb7a6..9741dd67ac 100644
--- a/libavfilter/vf_nlmeans_vulkan.c
+++ b/libavfilter/vf_nlmeans_vulkan.c
@@ -538,28 +538,29 @@ static av_cold int init_denoise_pipeline(FFVulkanContext *vkctx, FFVkExecPool *e
     GLSLC(0, {                                                                );
     GLSLC(1,     ivec2 size;                                                  );
     GLSLC(1,     const ivec2 pos = ivec2(gl_GlobalInvocationID.xy);           );
+    GLSLC(1,     const uint plane = uint(gl_WorkGroupID.z);                   );
     GLSLC(0,                                                                  );
     GLSLC(1,     float w_sum;                                                 );
     GLSLC(1,     float sum;                                                   );
     GLSLC(1,     vec4 src;                                                    );
     GLSLC(1,     vec4 r;                                                      );
     GLSLC(0,                                                                  );
-
-    for (int i = 0; i < planes; i++) {
-        GLSLF(1, src = texture(input_img[%i], pos);                         ,i);
-        for (int c = 0; c < desc->nb_components; c++) {
-            if (desc->comp[c].plane == i) {
-                int off = desc->comp[c].offset / (FFALIGN(desc->comp[c].depth, 8)/8);
-                GLSLF(1, w_sum = weights_%i[pos.y*ws_stride[%i] + pos.x];               ,c, c);
-                GLSLF(1, sum = sums_%i[pos.y*ws_stride[%i] + pos.x];                    ,c, c);
-                GLSLF(1, r[%i] = (sum + src[%i]*255) / (1.0 + w_sum) / 255;         ,off, off);
-                GLSLC(0,                                                                     );
-            }
-        }
-        GLSLF(1, imageStore(output_img[%i], pos, r);                        ,i);
-        GLSLC(0,                                                              );
+    GLSLC(1,     size = imageSize(output_img[plane]);                         );
+    GLSLC(1,     if (!IS_WITHIN(pos, size))                                   );
+    GLSLC(2,         return;                                                  );
+    GLSLC(0,                                                                  );
+    GLSLC(1,     src = texture(input_img[plane], pos);                        );
+    GLSLC(0,                                                                  );
+    for (int c = 0; c < desc->nb_components; c++) {
+        int off = desc->comp[c].offset / (FFALIGN(desc->comp[c].depth, 8)/8);
+        GLSLF(1, if (plane == %i) {                                              ,desc->comp[c].plane);
+        GLSLF(2,     w_sum = weights_%i[pos.y*ws_stride[%i] + pos.x];                           ,c, c);
+        GLSLF(2,     sum = sums_%i[pos.y*ws_stride[%i] + pos.x];                                ,c, c);
+        GLSLF(2,     r[%i] = (sum + src[%i]*255) / (1.0 + w_sum) / 255;                     ,off, off);
+        GLSLC(1, }                                                                                   );
+        GLSLC(0,                                                                                     );
     }
-
+    GLSLC(1, imageStore(output_img[plane], pos, r);                           );
     GLSLC(0, }                                                                );
 
     RET(spv->compile_shader(spv, vkctx, shd, &spv_data, &spv_len, "main", &spv_opaque));
@@ -716,7 +717,7 @@ static int denoise_pass(NLMeansVulkanContext *s, FFVkExecContext *exec,
     vk->CmdDispatch(exec->buf,
                     FFALIGN(vkctx->output_width,  s->pl_denoise.wg_size[0])/s->pl_denoise.wg_size[0],
                     FFALIGN(vkctx->output_height, s->pl_denoise.wg_size[1])/s->pl_denoise.wg_size[1],
-                    1);
+                    av_pix_fmt_count_planes(s->vkctx.output_format));
 
     return 0;
 }
-- 
2.42.0


From patchwork Sat Oct  7 15:07:52 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lynne <dev@lynne.ee>
X-Patchwork-Id: 44199
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:4e24:b0:15d:8365:d4b8 with SMTP id
 gk36csp475975pzb;
        Sat, 7 Oct 2023 08:08:03 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IF+1NmPlur8gtNjI0AlnEXeFgDhKFUKP53NIbcGWA7K7rSmeQXXm3Z/LZMiH8bpjDD5MK/x
X-Received: by 2002:a17:906:7382:b0:9ae:284:c93d with SMTP id
 f2-20020a170906738200b009ae0284c93dmr8219388ejl.5.1696691283284;
        Sat, 07 Oct 2023 08:08:03 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1696691283; cv=none;
        d=google.com; s=arc-20160816;
        b=UVjvbrKgbjB25A37eTEG9Kl9HkVxq539KD8WYamg4etJCClLdHC7sXkdpYEJP91iUc
         6DFdH42+nYHTRMGr6oZULCsTVUl6BB9Z9tQJTuJA5FLlCqbkqpLHa3LgXMtM1u7Bm1bM
         JP0iUjUH/PExtAEhFK6oBgJ9P+JvEga7711tI97YH9ZNDU1R1q/HYu5cuWsb5PFO4MrB
         DPRG2GNupc7VBccGeiwP6p7Sl8WXm9vNG9tf5ido0TYOyTg9cLmKBP7wEviM4xaKy/LQ
         wYaUVRYBLryJdVhJ/H1iNatPyWLJuzNLR7Uw0zg/iYPJjFQt9s65OS2CxiZ9eLQqS9lS
         anBw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:reply-to:list-subscribe:list-help:list-post
         :list-archive:list-unsubscribe:list-id:precedence:subject
         :mime-version:references:in-reply-to:message-id:to:from:date
         :dkim-signature:delivered-to;
        bh=2qE5LZDZKE3y43QfIe7dGMjqE21apdMGY6MfC4rkUP8=;
        fh=e5zN9xSzcxLA6bGo3lF+CqTbY/oLwzApV03EO/RBfgQ=;
        b=paDP6C6tyZsF7ONslqYX1X/Y6Ju1jZqPdS/0n8zcWeLm6YFQic4cvM/0oBhT3f9gYc
         UOJJUyIDUBXvNTRuirVPSJ1mrXJHO3NP+5p/71Www+gtfzTala8ZfTLlBgdscvEDBXB5
         9C2qChnr904tj4uLEJxSTPnpss70Sldn4Fh9IVWx5sbk2fJMKmlnYsTAuKw8tydKqj+C
         4fkRzjg1oFccHAaHq5NGPWxrkNbWnvAJUL+M3AceyUCQzUl8SI6NNY7UEpY0k4Kn8ci6
         EBjssicfjXZeSgG1esCtF1fQsgEm9HdUzSIlGvnoLkLqkIylDRqsopGkPoUjZfrS6lki
         qMpQ==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=AC5oaxvQ;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 y25-20020a170906471900b009a633e2fae9si2634907ejq.127.2023.10.07.08.08.02;
        Sat, 07 Oct 2023 08:08:03 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify) header.i=@lynne.ee header.s=s1
 header.b=AC5oaxvQ;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=lynne.ee
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id C0BEF68CA6B;
	Sat,  7 Oct 2023 18:07:59 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from w4.tutanota.de (w4.tutanota.de [81.3.6.165])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id BE5E168C95A
 for <ffmpeg-devel@ffmpeg.org>; Sat,  7 Oct 2023 18:07:52 +0300 (EEST)
Received: from tutadb.w10.tutanota.de (unknown [192.168.1.10])
 by w4.tutanota.de (Postfix) with ESMTP id 6CC3E1060152
 for <ffmpeg-devel@ffmpeg.org>; Sat,  7 Oct 2023 15:07:52 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1696691272;
 s=s1; d=lynne.ee;
 h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Date:Date:In-Reply-To:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:References:Sender;
 bh=Ix9QaWZpskQBntgBSfss6e4tvGHeUKclvET0ySGoTd8=;
 b=AC5oaxvQF/xrF4Pv7bOn3nOROfMvM2Qrn0cfEVH+1/UhkGMSz89rzW66JhThaqSP
 hGlSwRuGggrh3U5p16xk7kdBqBTxWyXJ/mYpe98DOncs980LhBi+ddFGktcIP+84j7J
 SdHVlDlpxHy+eomRmq78luTNM+GHnurrPZ5DdQAj7O3Er2KtF0JwRO1NEK8caIFTKeu
 5KyDJ+6FAceI+OrFugBacdPAO0/VSSqaXUedNqlnxKc3e/UQtkXVL/8ZFMB3UmzB7oh
 xUs1TiUe7e9PHU3W6IFFsfI26HxDdy8gudmWbe9oKsJPQwcA9sb9rOD0ecJsyKr0p29
 ZmTzjSHh0A==
Date: Sat, 7 Oct 2023 17:07:52 +0200 (CEST)
From: Lynne <dev@lynne.ee>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Message-ID: <Ng9fYdg--3-9@lynne.ee>
In-Reply-To: <Ng9eVFH--3-9@lynne.ee-Ng9eZVQ----9>
References: <Ng9eVFH--3-9@lynne.ee> <Ng9eVFH--3-9@lynne.ee-Ng9eZVQ----9>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH 3/3] nlmeans_vulkan: parallelize workgroup
 invocations
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: qKfwxIWz0w7f

Removes the clever subgroup parallel prefix computation,
and instead just computes the prefix inline.
Cuts down the number of dispatches by a huge amount.

Provides a ~12x speedup (2.5fps to 30fps on a 7900XTX,
2.1fps to 24fps on an Ada).

Patch attached.

From a51dd2ace418974c7e8b24a3762bd7495d3b3b10 Mon Sep 17 00:00:00 2001
From: Lynne <dev@lynne.ee>
Date: Fri, 15 Sep 2023 21:55:59 +0200
Subject: [PATCH 3/3] nlmeans_vulkan: parallelize workgroup invocations

---
 libavfilter/Makefile               |   3 +-
 libavfilter/vf_nlmeans_vulkan.c    | 333 ++++++++++++++---------------
 libavfilter/vulkan/prefix_sum.comp | 151 -------------
 3 files changed, 167 insertions(+), 320 deletions(-)
 delete mode 100644 libavfilter/vulkan/prefix_sum.comp

diff --git a/libavfilter/Makefile b/libavfilter/Makefile
index 9a100cd665..603b532ad0 100644
--- a/libavfilter/Makefile
+++ b/libavfilter/Makefile
@@ -395,8 +395,7 @@ OBJS-$(CONFIG_MULTIPLY_FILTER)               += vf_multiply.o
 OBJS-$(CONFIG_NEGATE_FILTER)                 += vf_negate.o
 OBJS-$(CONFIG_NLMEANS_FILTER)                += vf_nlmeans.o
 OBJS-$(CONFIG_NLMEANS_OPENCL_FILTER)         += vf_nlmeans_opencl.o opencl.o opencl/nlmeans.o
-OBJS-$(CONFIG_NLMEANS_VULKAN_FILTER)         += vf_nlmeans_vulkan.o vulkan.o vulkan_filter.o \
-                                                vulkan/prefix_sum.o
+OBJS-$(CONFIG_NLMEANS_VULKAN_FILTER)         += vf_nlmeans_vulkan.o vulkan.o vulkan_filter.o
 OBJS-$(CONFIG_NNEDI_FILTER)                  += vf_nnedi.o
 OBJS-$(CONFIG_NOFORMAT_FILTER)               += vf_format.o
 OBJS-$(CONFIG_NOISE_FILTER)                  += vf_noise.o
diff --git a/libavfilter/vf_nlmeans_vulkan.c b/libavfilter/vf_nlmeans_vulkan.c
index 9741dd67ac..6046ff598c 100644
--- a/libavfilter/vf_nlmeans_vulkan.c
+++ b/libavfilter/vf_nlmeans_vulkan.c
@@ -38,9 +38,10 @@ typedef struct NLMeansVulkanContext {
     VkSampler sampler;
 
     AVBufferPool *integral_buf_pool;
-    AVBufferPool *state_buf_pool;
     AVBufferPool *ws_buf_pool;
 
+    FFVkBuffer xyoffsets_buf;
+
     int pl_weights_rows;
     FFVulkanPipeline pl_weights;
     FFVkSPIRVShader shd_weights;
@@ -66,107 +67,97 @@ typedef struct NLMeansVulkanContext {
 
 extern const char *ff_source_prefix_sum_comp;
 
-static void insert_first(FFVkSPIRVShader *shd, int r, int horiz, int plane, int comp)
+static void insert_first(FFVkSPIRVShader *shd, int r, const char *off, int horiz, int plane, int comp)
 {
-    GLSLF(2,     s1    = texture(input_img[%i], ivec2(x + %i, y + %i))[%i];
-          ,plane, horiz ? r : 0, !horiz ? r : 0, comp);
-
-    if (TYPE_ELEMS == 4) {
-        GLSLF(2, s2[0] = texture(input_img[%i], ivec2(x + %i + xoffs[0], y + %i + yoffs[0]))[%i];
-              ,plane, horiz ? r : 0, !horiz ? r : 0, comp);
-        GLSLF(2, s2[1] = texture(input_img[%i], ivec2(x + %i + xoffs[1], y + %i + yoffs[1]))[%i];
-              ,plane, horiz ? r : 0, !horiz ? r : 0, comp);
-        GLSLF(2, s2[2] = texture(input_img[%i], ivec2(x + %i + xoffs[2], y + %i + yoffs[2]))[%i];
-              ,plane, horiz ? r : 0, !horiz ? r : 0, comp);
-        GLSLF(2, s2[3] = texture(input_img[%i], ivec2(x + %i + xoffs[3], y + %i + yoffs[3]))[%i];
-              ,plane, horiz ? r : 0, !horiz ? r : 0, comp);
-    } else {
-        for (int i = 0; i < 16; i++) {
-            GLSLF(2, s2[%i][%i] = texture(input_img[%i], ivec2(x + %i + xoffs[%i], y + %i + yoffs[%i]))[%i];
-                  ,i / 4, i % 4, plane, horiz ? r : 0, i, !horiz ? r : 0, i, comp);
-        }
-    }
-
-    GLSLC(2, s2 = (s1 - s2) * (s1 - s2);                                       );
+    GLSLF(4, s1    = texture(input_img[%i], ivec2(x + %i + %s, y + %i + %s))[%i];
+          ,plane, horiz ? r : 0, horiz ? off : "0", !horiz ? r : 0, !horiz ? off : "0", comp);
+
+    GLSLF(4, s2[0] = texture(input_img[%i], ivec2(x + %i + %s + xoffs[0], y + %i + %s + yoffs[0]))[%i];
+          ,plane, horiz ? r : 0, horiz ? off : "0", !horiz ? r : 0, !horiz ? off : "0", comp);
+    GLSLF(4, s2[1] = texture(input_img[%i], ivec2(x + %i + %s + xoffs[1], y + %i + %s + yoffs[1]))[%i];
+          ,plane, horiz ? r : 0, horiz ? off : "0", !horiz ? r : 0, !horiz ? off : "0", comp);
+    GLSLF(4, s2[2] = texture(input_img[%i], ivec2(x + %i + %s + xoffs[2], y + %i + %s + yoffs[2]))[%i];
+          ,plane, horiz ? r : 0, horiz ? off : "0", !horiz ? r : 0, !horiz ? off : "0", comp);
+    GLSLF(4, s2[3] = texture(input_img[%i], ivec2(x + %i + %s + xoffs[3], y + %i + %s + yoffs[3]))[%i];
+          ,plane, horiz ? r : 0, horiz ? off : "0", !horiz ? r : 0, !horiz ? off : "0", comp);
+
+    GLSLC(4, s2 = (s1 - s2) * (s1 - s2);                                                    );
 }
 
 static void insert_horizontal_pass(FFVkSPIRVShader *shd, int nb_rows, int first, int plane, int comp)
 {
-    GLSLF(1, x = int(gl_GlobalInvocationID.x) * %i;                   ,nb_rows);
-    if (!first) {
-        GLSLC(1, controlBarrier(gl_ScopeWorkgroup, gl_ScopeWorkgroup,
-                                gl_StorageSemanticsBuffer,
-                                gl_SemanticsAcquireRelease |
-                                gl_SemanticsMakeAvailable |
-                                gl_SemanticsMakeVisible);                     );
-    }
-    GLSLF(1, for (y = 0; y < height[%i]; y++) {                               ,plane);
-    GLSLC(2,     offset = uint64_t(int_stride)*y*T_ALIGN;                     );
-    GLSLC(2,     dst = DataBuffer(uint64_t(integral_data) + offset);          );
-    GLSLC(0,                                                                  );
-    if (first) {
-        for (int r = 0; r < nb_rows; r++) {
-            insert_first(shd, r, 1, plane, comp);
-            GLSLF(2, dst.v[x + %i] = s2;                                    ,r);
-            GLSLC(0,                                                          );
-        }
-    }
-    GLSLC(2,     barrier();                                                   );
-    GLSLC(2,     prefix_sum(dst, 1, dst, 1);                                  );
-    GLSLC(1, }                                                                );
-    GLSLC(0,                                                                  );
+    GLSLF(1, y = int(gl_GlobalInvocationID.x) * %i;                               ,nb_rows);
+    if (!first)
+        GLSLC(1, barrier();                                                       );
+    GLSLC(0,                                                                      );
+    GLSLF(1, if (y < height[%i]) {                                                ,plane);
+    GLSLC(2,     #pragma unroll(1)                                                );
+    GLSLF(2,     for (r = 0; r < %i; r++) {                                       ,nb_rows);
+    GLSLC(3,         prefix_sum = DTYPE(0);                                       );
+    GLSLC(3,         offset = uint64_t(int_stride)*(y + r)*T_ALIGN;               );
+    GLSLC(3,         dst = DataBuffer(uint64_t(integral_data) + offset);          );
+    GLSLC(0,                                                                      );
+    GLSLF(3,         for (x = 0; x < width[%i]; x++) {                            ,plane);
+    if (first)
+        insert_first(shd, 0, "r", 0, plane, comp);
+    else
+        GLSLC(4,         s2 = dst.v[x];                                           );
+    GLSLC(4,             dst.v[x] = s2 + prefix_sum;                              );
+    GLSLC(4,             prefix_sum += s2;                                        );
+    GLSLC(3,         }                                                            );
+    GLSLC(2,     }                                                                );
+    GLSLC(1, }                                                                    );
+    GLSLC(0,                                                                      );
 }
 
 static void insert_vertical_pass(FFVkSPIRVShader *shd, int nb_rows, int first, int plane, int comp)
 {
-    GLSLF(1, y = int(gl_GlobalInvocationID.x) * %i;                   ,nb_rows);
-    if (!first) {
-        GLSLC(1, controlBarrier(gl_ScopeWorkgroup, gl_ScopeWorkgroup,
-                                gl_StorageSemanticsBuffer,
-                                gl_SemanticsAcquireRelease |
-                                gl_SemanticsMakeAvailable |
-                                gl_SemanticsMakeVisible);                     );
-    }
-    GLSLF(1, for (x = 0; x < width[%i]; x++) {                                ,plane);
-    GLSLC(2,     dst = DataBuffer(uint64_t(integral_data) + x*T_ALIGN);       );
-
-    for (int r = 0; r < nb_rows; r++) {
-        if (first) {
-            insert_first(shd, r, 0, plane, comp);
-            GLSLF(2, integral_data.v[(y + %i)*int_stride + x] = s2;         ,r);
-            GLSLC(0,                                                          );
-        }
-    }
-
-    GLSLC(2,     barrier();                                                   );
-    GLSLC(2,     prefix_sum(dst, int_stride, dst, int_stride);                );
-    GLSLC(1, }                                                                );
-    GLSLC(0,                                                                  );
+    GLSLF(1, x = int(gl_GlobalInvocationID.x) * %i;                               ,nb_rows);
+    GLSLC(1, #pragma unroll(1)                                                    );
+    GLSLF(1, for (r = 0; r < %i; r++)                                             ,nb_rows);
+    GLSLC(2,     psum[r] = DTYPE(0);                                              );
+    GLSLC(0,                                                                      );
+    if (!first)
+        GLSLC(1, barrier();                                                       );
+    GLSLC(0,                                                                      );
+    GLSLF(1, if (x < width[%i]) {                                                 ,plane);
+    GLSLF(2,     for (y = 0; y < height[%i]; y++) {                               ,plane);
+    GLSLC(3,         offset = uint64_t(int_stride)*y*T_ALIGN;                     );
+    GLSLC(3,         dst = DataBuffer(uint64_t(integral_data) + offset);          );
+    GLSLC(0,                                                                      );
+    GLSLC(3,         #pragma unroll(1)                                            );
+    GLSLF(3,         for (r = 0; r < %i; r++) {                                   ,nb_rows);
+    if (first)
+        insert_first(shd, 0, "r", 1, plane, comp);
+    else
+        GLSLC(4,         s2 = dst.v[x + r];                                       );
+    GLSLC(4,             dst.v[x + r] = s2 + psum[r];                             );
+    GLSLC(4,             psum[r] += s2;                                           );
+    GLSLC(3,         }                                                            );
+    GLSLC(2,     }                                                                );
+    GLSLC(1, }                                                                    );
+    GLSLC(0,                                                                      );
 }
 
 static void insert_weights_pass(FFVkSPIRVShader *shd, int nb_rows, int vert,
                                 int t, int dst_comp, int plane, int comp)
 {
-    GLSLF(1, p = patch_size[%i];                                     ,dst_comp);
+    GLSLF(1, p = patch_size[%i];                                              ,dst_comp);
     GLSLC(0,                                                                  );
-    GLSLC(1, controlBarrier(gl_ScopeWorkgroup, gl_ScopeWorkgroup,
-                            gl_StorageSemanticsBuffer,
-                            gl_SemanticsAcquireRelease |
-                            gl_SemanticsMakeAvailable |
-                            gl_SemanticsMakeVisible);                         );
     GLSLC(1, barrier();                                                       );
+    GLSLC(0,                                                                  );
     if (!vert) {
         GLSLF(1, for (y = 0; y < height[%i]; y++) {                           ,plane);
         GLSLF(2,     if (gl_GlobalInvocationID.x*%i >= width[%i])             ,nb_rows, plane);
         GLSLC(3,         break;                                               );
-        GLSLF(2,     for (r = 0; r < %i; r++) {                       ,nb_rows);
-        GLSLF(3,         x = int(gl_GlobalInvocationID.x) * %i + r;   ,nb_rows);
+        GLSLF(2,     for (r = 0; r < %i; r++) {                               ,nb_rows);
+        GLSLF(3,         x = int(gl_GlobalInvocationID.x) * %i + r;           ,nb_rows);
     } else {
         GLSLF(1, for (x = 0; x < width[%i]; x++) {                            ,plane);
         GLSLF(2,     if (gl_GlobalInvocationID.x*%i >= height[%i])            ,nb_rows, plane);
         GLSLC(3,         break;                                               );
-        GLSLF(2,     for (r = 0; r < %i; r++) {                       ,nb_rows);
-        GLSLF(3,         y = int(gl_GlobalInvocationID.x) * %i + r;   ,nb_rows);
+        GLSLF(2,     for (r = 0; r < %i; r++) {                               ,nb_rows);
+        GLSLF(3,         y = int(gl_GlobalInvocationID.x) * %i + r;           ,nb_rows);
     }
     GLSLC(0,                                                                  );
     GLSLC(3,         a = DTYPE(0);                                            );
@@ -223,16 +214,15 @@ static void insert_weights_pass(FFVkSPIRVShader *shd, int nb_rows, int vert,
 }
 
 typedef struct HorizontalPushData {
-    VkDeviceAddress integral_data;
-    VkDeviceAddress state_data;
-    int32_t  xoffs[TYPE_ELEMS];
-    int32_t  yoffs[TYPE_ELEMS];
     uint32_t width[4];
     uint32_t height[4];
     uint32_t ws_stride[4];
     int32_t  patch_size[4];
     float    strength[4];
+    VkDeviceAddress integral_base;
+    uint32_t integral_size;
     uint32_t int_stride;
+    uint32_t xyoffs_start;
 } HorizontalPushData;
 
 static av_cold int init_weights_pipeline(FFVulkanContext *vkctx, FFVkExecPool *exec,
@@ -249,26 +239,18 @@ static av_cold int init_weights_pipeline(FFVulkanContext *vkctx, FFVkExecPool *e
     FFVulkanDescriptorSetBinding *desc_set;
     int max_dim = FFMAX(width, height);
     uint32_t max_wg = vkctx->props.properties.limits.maxComputeWorkGroupSize[0];
-    int max_shm = vkctx->props.properties.limits.maxComputeSharedMemorySize;
     int wg_size, wg_rows;
 
     /* Round the max workgroup size to the previous power of two */
-    max_wg = 1 << (31 - ff_clz(max_wg));
     wg_size = max_wg;
     wg_rows = 1;
 
     if (max_wg > max_dim) {
-        wg_size = max_wg / (max_wg / max_dim);
+        wg_size = max_dim;
     } else if (max_wg < max_dim) {
-        /* First, make it fit */
+        /* Make it fit */
         while (wg_size*wg_rows < max_dim)
             wg_rows++;
-
-        /* Second, make sure there's enough shared memory */
-        while ((wg_size * TYPE_SIZE + TYPE_SIZE + 2*4) > max_shm) {
-            wg_size >>= 1;
-            wg_rows++;
-        }
     }
 
     RET(ff_vk_shader_init(pl, shd, "nlmeans_weights", VK_SHADER_STAGE_COMPUTE_BIT, 0));
@@ -278,33 +260,24 @@ static av_cold int init_weights_pipeline(FFVulkanContext *vkctx, FFVkExecPool *e
     if (t > 1)
         GLSLC(0, #extension GL_EXT_shader_atomic_float : require              );
     GLSLC(0, #extension GL_ARB_gpu_shader_int64 : require                     );
-    GLSLC(0, #pragma use_vulkan_memory_model                                  );
-    GLSLC(0, #extension GL_KHR_memory_scope_semantics : enable                );
     GLSLC(0,                                                                  );
-    GLSLF(0, #define N_ROWS %i                                       ,*nb_rows);
-    GLSLC(0, #define WG_SIZE (gl_WorkGroupSize.x)                             );
-    GLSLF(0, #define LG_WG_SIZE %i                ,ff_log2(shd->local_size[0]));
-    GLSLC(0, #define PARTITION_SIZE (N_ROWS*WG_SIZE)                          );
-    GLSLF(0, #define DTYPE %s                                       ,TYPE_NAME);
-    GLSLF(0, #define T_ALIGN %i                                     ,TYPE_SIZE);
+    GLSLF(0, #define DTYPE %s                                                 ,TYPE_NAME);
+    GLSLF(0, #define T_ALIGN %i                                               ,TYPE_SIZE);
     GLSLC(0,                                                                  );
-    GLSLC(0, layout(buffer_reference, buffer_reference_align = T_ALIGN) coherent buffer DataBuffer {  );
+    GLSLC(0, layout(buffer_reference, buffer_reference_align = T_ALIGN) buffer DataBuffer {  );
     GLSLC(1,     DTYPE v[];                                                   );
     GLSLC(0, };                                                               );
     GLSLC(0,                                                                  );
-    GLSLC(0, layout(buffer_reference) buffer StateData;                       );
-    GLSLC(0,                                                                  );
     GLSLC(0, layout(push_constant, std430) uniform pushConstants {            );
-    GLSLC(1,     coherent DataBuffer integral_data;                           );
-    GLSLC(1,     StateData  state;                                            );
-    GLSLF(1,     uint xoffs[%i];                                   ,TYPE_ELEMS);
-    GLSLF(1,     uint yoffs[%i];                                   ,TYPE_ELEMS);
     GLSLC(1,     uvec4 width;                                                 );
     GLSLC(1,     uvec4 height;                                                );
     GLSLC(1,     uvec4 ws_stride;                                             );
     GLSLC(1,     ivec4 patch_size;                                            );
     GLSLC(1,     vec4 strength;                                               );
+    GLSLC(1,     DataBuffer integral_base;                                    );
+    GLSLC(1,     uint integral_size;                                          );
     GLSLC(1,     uint int_stride;                                             );
+    GLSLC(1,     uint xyoffs_start;                                           );
     GLSLC(0, };                                                               );
     GLSLC(0,                                                                  );
 
@@ -370,7 +343,17 @@ static av_cold int init_weights_pipeline(FFVulkanContext *vkctx, FFVkExecPool *e
     };
     RET(ff_vk_pipeline_descriptor_set_add(vkctx, pl, shd, desc_set, 1 + 2*desc->nb_components, 0, 0));
 
-    GLSLD(   ff_source_prefix_sum_comp                                        );
+    desc_set = (FFVulkanDescriptorSetBinding []) {
+        {
+            .name        = "xyoffsets_buffer",
+            .type        = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER,
+            .mem_quali   = "readonly",
+            .stages      = VK_SHADER_STAGE_COMPUTE_BIT,
+            .buf_content = "int xyoffsets[];",
+        },
+    };
+    RET(ff_vk_pipeline_descriptor_set_add(vkctx, pl, shd, desc_set, 1, 1, 0));
+
     GLSLC(0,                                                                  );
     GLSLC(0, void main()                                                      );
     GLSLC(0, {                                                                );
@@ -378,11 +361,24 @@ static av_cold int init_weights_pipeline(FFVulkanContext *vkctx, FFVkExecPool *e
     GLSLC(1,     DataBuffer dst;                                              );
     GLSLC(1,     float s1;                                                    );
     GLSLC(1,     DTYPE s2;                                                    );
+    GLSLC(1,     DTYPE prefix_sum;                                            );
+    GLSLF(1,     DTYPE psum[%i];                                              ,*nb_rows);
     GLSLC(1,     int r;                                                       );
     GLSLC(1,     int x;                                                       );
     GLSLC(1,     int y;                                                       );
     GLSLC(1,     int p;                                                       );
     GLSLC(0,                                                                  );
+    GLSLC(1,     DataBuffer integral_data;                                    );
+    GLSLF(1,     int xoffs[%i];                                               ,TYPE_ELEMS);
+    GLSLF(1,     int yoffs[%i];                                               ,TYPE_ELEMS);
+    GLSLC(0,                                                                  );
+    GLSLC(1,     int invoc_idx = int(gl_WorkGroupID.z);                       );
+    GLSLC(1,     integral_data = DataBuffer(uint64_t(integral_base) + invoc_idx*integral_size);        );
+    for (int i = 0; i < TYPE_ELEMS*2; i += 2) {
+        GLSLF(1, xoffs[%i] = xyoffsets[xyoffs_start + 2*%i*invoc_idx + %i + 0];       ,i/2,TYPE_ELEMS,i);
+        GLSLF(1, yoffs[%i] = xyoffsets[xyoffs_start + 2*%i*invoc_idx + %i + 1];       ,i/2,TYPE_ELEMS,i);
+    }
+    GLSLC(0,                                                                  );
     GLSLC(1,     DTYPE a;                                                     );
     GLSLC(1,     DTYPE b;                                                     );
     GLSLC(1,     DTYPE c;                                                     );
@@ -405,7 +401,7 @@ static av_cold int init_weights_pipeline(FFVulkanContext *vkctx, FFVkExecPool *e
 
     for (int i = 0; i < desc->nb_components; i++) {
         int off = desc->comp[i].offset / (FFALIGN(desc->comp[i].depth, 8)/8);
-        if (width > height) {
+        if (width >= height) {
             insert_horizontal_pass(shd, *nb_rows, 1, desc->comp[i].plane, off);
             insert_vertical_pass(shd, *nb_rows, 0, desc->comp[i].plane, off);
             insert_weights_pass(shd, *nb_rows, 0, t, i, desc->comp[i].plane, off);
@@ -584,6 +580,7 @@ static av_cold int init_filter(AVFilterContext *ctx)
     FFVulkanContext *vkctx = &s->vkctx;
     const int planes = av_pix_fmt_count_planes(s->vkctx.output_format);
     FFVkSPIRVCompiler *spv;
+    int *offsets_buf;
 
     const AVPixFmtDescriptor *desc;
     desc = av_pix_fmt_desc_get(vkctx->output_format);
@@ -634,6 +631,20 @@ static av_cold int init_filter(AVFilterContext *ctx)
         }
     }
 
+    RET(ff_vk_create_buf(&s->vkctx, &s->xyoffsets_buf, 2*s->nb_offsets*sizeof(int32_t), NULL, NULL,
+                         VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT |
+                         VK_BUFFER_USAGE_STORAGE_BUFFER_BIT,
+                         VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT |
+                         VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT));
+    RET(ff_vk_map_buffer(&s->vkctx, &s->xyoffsets_buf, (uint8_t **)&offsets_buf, 0));
+
+    for (int i = 0; i < 2*s->nb_offsets; i += 2) {
+        offsets_buf[i + 0] = s->xoffsets[i >> 1];
+        offsets_buf[i + 1] = s->yoffsets[i >> 1];
+    }
+
+    RET(ff_vk_unmap_buffer(&s->vkctx, &s->xyoffsets_buf, 1));
+
     s->opts.t = FFMIN(s->opts.t, (FFALIGN(s->nb_offsets, TYPE_ELEMS) / TYPE_ELEMS));
     if (!vkctx->atomic_float_feats.shaderBufferFloat32AtomicAdd) {
         av_log(ctx, AV_LOG_WARNING, "Device doesn't support atomic float adds, "
@@ -641,11 +652,6 @@ static av_cold int init_filter(AVFilterContext *ctx)
         s->opts.t = 1;
     }
 
-    if (!vkctx->feats_12.vulkanMemoryModel) {
-        av_log(ctx, AV_LOG_ERROR, "Device doesn't support the Vulkan memory model!");
-        return AVERROR(EINVAL);;
-    }
-
     spv = ff_vk_spirv_init();
     if (!spv) {
         av_log(ctx, AV_LOG_ERROR, "Unable to initialize SPIR-V compiler!\n");
@@ -663,6 +669,10 @@ static av_cold int init_filter(AVFilterContext *ctx)
     RET(init_denoise_pipeline(vkctx, &s->e, &s->pl_denoise, &s->shd_denoise, s->sampler,
                               spv, desc, planes));
 
+    RET(ff_vk_set_descriptor_buffer(&s->vkctx, &s->pl_weights, NULL, 1, 0, 0,
+                                    s->xyoffsets_buf.address, s->xyoffsets_buf.size,
+                                    VK_FORMAT_UNDEFINED));
+
     av_log(ctx, AV_LOG_VERBOSE, "Filter initialized, %i x/y offsets, %i dispatches, %i parallel\n",
            s->nb_offsets, (FFALIGN(s->nb_offsets, TYPE_ELEMS) / TYPE_ELEMS) + 1, s->opts.t);
 
@@ -736,18 +746,16 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in)
     int plane_widths[4];
     int plane_heights[4];
 
+    int offsets_dispatched = 0;
+
     /* Integral */
-    AVBufferRef *state_buf;
-    FFVkBuffer *state_vk;
-    AVBufferRef *integral_buf;
+    AVBufferRef *integral_buf = NULL;
     FFVkBuffer *integral_vk;
     uint32_t int_stride;
     size_t int_size;
-    size_t state_size;
-    int t_offset = 0;
 
     /* Weights/sums */
-    AVBufferRef *ws_buf;
+    AVBufferRef *ws_buf = NULL;
     FFVkBuffer *ws_vk;
     VkDeviceAddress weights_addr[4];
     VkDeviceAddress sums_addr[4];
@@ -773,7 +781,6 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in)
     /* Integral image */
     int_stride = s->pl_weights.wg_size[0]*s->pl_weights_rows;
     int_size = int_stride * int_stride * TYPE_SIZE;
-    state_size = int_stride * 3 *TYPE_SIZE;
 
     /* Plane dimensions */
     for (int i = 0; i < desc->nb_components; i++) {
@@ -798,16 +805,6 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in)
         return err;
     integral_vk = (FFVkBuffer *)integral_buf->data;
 
-    err = ff_vk_get_pooled_buffer(&s->vkctx, &s->state_buf_pool, &state_buf,
-                                  VK_BUFFER_USAGE_STORAGE_BUFFER_BIT |
-                                  VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT,
-                                  NULL,
-                                  s->opts.t * state_size,
-                                  VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT);
-    if (err < 0)
-        return err;
-    state_vk = (FFVkBuffer *)state_buf->data;
-
     err = ff_vk_get_pooled_buffer(&s->vkctx, &s->ws_buf_pool, &ws_buf,
                                   VK_BUFFER_USAGE_STORAGE_BUFFER_BIT |
                                   VK_BUFFER_USAGE_TRANSFER_DST_BIT |
@@ -844,9 +841,12 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in)
     RET(ff_vk_exec_add_dep_frame(vkctx, exec, out,
                                  VK_PIPELINE_STAGE_2_ALL_COMMANDS_BIT,
                                  VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT));
+
     RET(ff_vk_exec_add_dep_buf(vkctx, exec, &integral_buf, 1, 0));
-    RET(ff_vk_exec_add_dep_buf(vkctx, exec, &state_buf,    1, 0));
+    integral_buf = NULL;
+
     RET(ff_vk_exec_add_dep_buf(vkctx, exec, &ws_buf,       1, 0));
+    ws_buf = NULL;
 
     /* Input frame prep */
     RET(ff_vk_create_imageviews(vkctx, exec, in_views, in));
@@ -869,6 +869,7 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in)
                         VK_IMAGE_LAYOUT_GENERAL,
                         VK_QUEUE_FAMILY_IGNORED);
 
+    nb_buf_bar = 0;
     buf_bar[nb_buf_bar++] = (VkBufferMemoryBarrier2) {
         .sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER_2,
         .srcStageMask = ws_vk->stage,
@@ -881,6 +882,19 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in)
         .size = ws_vk->size,
         .offset = 0,
     };
+    buf_bar[nb_buf_bar++] = (VkBufferMemoryBarrier2) {
+        .sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER_2,
+        .srcStageMask = integral_vk->stage,
+        .dstStageMask = VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT,
+        .srcAccessMask = integral_vk->access,
+        .dstAccessMask = VK_ACCESS_2_SHADER_STORAGE_READ_BIT |
+                         VK_ACCESS_2_SHADER_STORAGE_WRITE_BIT,
+        .srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED,
+        .dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED,
+        .buffer = integral_vk->buf,
+        .size = integral_vk->size,
+        .offset = 0,
+    };
 
     vk->CmdPipelineBarrier2(exec->buf, &(VkDependencyInfo) {
             .sType = VK_STRUCTURE_TYPE_DEPENDENCY_INFO,
@@ -891,10 +905,13 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in)
         });
     ws_vk->stage = buf_bar[0].dstStageMask;
     ws_vk->access = buf_bar[0].dstAccessMask;
+    integral_vk->stage = buf_bar[1].dstStageMask;
+    integral_vk->access = buf_bar[1].dstAccessMask;
 
-    /* Weights/sums buffer zeroing */
+    /* Buffer zeroing */
     vk->CmdFillBuffer(exec->buf, ws_vk->buf, 0, ws_vk->size, 0x0);
 
+    nb_buf_bar = 0;
     buf_bar[nb_buf_bar++] = (VkBufferMemoryBarrier2) {
         .sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER_2,
         .srcStageMask = ws_vk->stage,
@@ -948,29 +965,22 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in)
     /* Weights pipeline */
     ff_vk_exec_bind_pipeline(vkctx, exec, &s->pl_weights);
 
-    for (int i = 0; i < s->nb_offsets; i += TYPE_ELEMS) {
-        int *xoffs = s->xoffsets + i;
-        int *yoffs = s->yoffsets + i;
+    do {
+        int wg_invoc;
         HorizontalPushData pd = {
-            integral_vk->address + t_offset*int_size,
-            state_vk->address + t_offset*state_size,
-            { 0 },
-            { 0 },
             { plane_widths[0], plane_widths[1], plane_widths[2], plane_widths[3] },
             { plane_heights[0], plane_heights[1], plane_heights[2], plane_heights[3] },
             { ws_stride[0], ws_stride[1], ws_stride[2], ws_stride[3] },
             { s->patch[0], s->patch[1], s->patch[2], s->patch[3] },
             { s->strength[0], s->strength[1], s->strength[2], s->strength[2], },
+            integral_vk->address,
+            int_size,
             int_stride,
+            offsets_dispatched * 2,
         };
 
-        memcpy(pd.xoffs, xoffs, sizeof(pd.xoffs));
-        memcpy(pd.yoffs, yoffs, sizeof(pd.yoffs));
-
-        /* Put a barrier once we run out of parallelism buffers */
-        if (!t_offset) {
+        if (offsets_dispatched) {
             nb_buf_bar = 0;
-            /* Buffer prep/sync */
             buf_bar[nb_buf_bar++] = (VkBufferMemoryBarrier2) {
                 .sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER_2,
                 .srcStageMask = integral_vk->stage,
@@ -984,39 +994,27 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in)
                 .size = integral_vk->size,
                 .offset = 0,
             };
-            buf_bar[nb_buf_bar++] = (VkBufferMemoryBarrier2) {
-                .sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER_2,
-                .srcStageMask = state_vk->stage,
-                .dstStageMask = VK_PIPELINE_STAGE_2_COMPUTE_SHADER_BIT,
-                .srcAccessMask = state_vk->access,
-                .dstAccessMask = VK_ACCESS_2_SHADER_STORAGE_READ_BIT |
-                                 VK_ACCESS_2_SHADER_STORAGE_WRITE_BIT,
-                .srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED,
-                .dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED,
-                .buffer = state_vk->buf,
-                .size = state_vk->size,
-                .offset = 0,
-            };
 
             vk->CmdPipelineBarrier2(exec->buf, &(VkDependencyInfo) {
                     .sType = VK_STRUCTURE_TYPE_DEPENDENCY_INFO,
                     .pBufferMemoryBarriers = buf_bar,
                     .bufferMemoryBarrierCount = nb_buf_bar,
                 });
-            integral_vk->stage = buf_bar[0].dstStageMask;
-            integral_vk->access = buf_bar[0].dstAccessMask;
-            state_vk->stage = buf_bar[1].dstStageMask;
-            state_vk->access = buf_bar[1].dstAccessMask;
+            integral_vk->stage = buf_bar[1].dstStageMask;
+            integral_vk->access = buf_bar[1].dstAccessMask;
         }
-        t_offset = (t_offset + 1) % s->opts.t;
 
         /* Push data */
         ff_vk_update_push_exec(vkctx, exec, &s->pl_weights, VK_SHADER_STAGE_COMPUTE_BIT,
                                0, sizeof(pd), &pd);
 
+        wg_invoc = FFMIN((s->nb_offsets - offsets_dispatched)/TYPE_ELEMS, s->opts.t);
+
         /* End of horizontal pass */
-        vk->CmdDispatch(exec->buf, 1, 1, 1);
-    }
+        vk->CmdDispatch(exec->buf, 1, 1, wg_invoc);
+
+        offsets_dispatched += wg_invoc * TYPE_ELEMS;
+    } while (offsets_dispatched < s->nb_offsets);
 
     RET(denoise_pass(s, exec, ws_vk, ws_stride));
 
@@ -1033,6 +1031,8 @@ static int nlmeans_vulkan_filter_frame(AVFilterLink *link, AVFrame *in)
     return ff_filter_frame(outlink, out);
 
 fail:
+    av_buffer_unref(&integral_buf);
+    av_buffer_unref(&ws_buf);
     av_frame_free(&in);
     av_frame_free(&out);
     return err;
@@ -1051,7 +1051,6 @@ static void nlmeans_vulkan_uninit(AVFilterContext *avctx)
     ff_vk_shader_free(vkctx, &s->shd_denoise);
 
     av_buffer_pool_uninit(&s->integral_buf_pool);
-    av_buffer_pool_uninit(&s->state_buf_pool);
     av_buffer_pool_uninit(&s->ws_buf_pool);
 
     if (s->sampler)
diff --git a/libavfilter/vulkan/prefix_sum.comp b/libavfilter/vulkan/prefix_sum.comp
deleted file mode 100644
index 9147cd82fb..0000000000
--- a/libavfilter/vulkan/prefix_sum.comp
+++ /dev/null
@@ -1,151 +0,0 @@
-#extension GL_EXT_buffer_reference : require
-#extension GL_EXT_buffer_reference2 : require
-
-#define ACQUIRE gl_StorageSemanticsBuffer, gl_SemanticsAcquire
-#define RELEASE gl_StorageSemanticsBuffer, gl_SemanticsRelease
-
-// These correspond to X, A, P respectively in the prefix sum paper.
-#define FLAG_NOT_READY       0u
-#define FLAG_AGGREGATE_READY 1u
-#define FLAG_PREFIX_READY    2u
-
-layout(buffer_reference, buffer_reference_align = T_ALIGN) nonprivate buffer StateData {
-    DTYPE aggregate;
-    DTYPE prefix;
-    uint flag;
-};
-
-shared DTYPE sh_scratch[WG_SIZE];
-shared DTYPE sh_prefix;
-shared uint  sh_part_ix;
-shared uint  sh_flag;
-
-void prefix_sum(DataBuffer dst, uint dst_stride, DataBuffer src, uint src_stride)
-{
-    DTYPE local[N_ROWS];
-    // Determine partition to process by atomic counter (described in Section 4.4 of prefix sum paper).
-    if (gl_GlobalInvocationID.x == 0)
-          sh_part_ix = gl_WorkGroupID.x;
-//        sh_part_ix = atomicAdd(part_counter, 1);
-
-    barrier();
-    uint part_ix = sh_part_ix;
-
-    uint ix = part_ix * PARTITION_SIZE + gl_LocalInvocationID.x * N_ROWS;
-
-    // TODO: gate buffer read? (evaluate whether shader check or CPU-side padding is better)
-    local[0] = src.v[ix*src_stride];
-    for (uint i = 1; i < N_ROWS; i++)
-        local[i] = local[i - 1] + src.v[(ix + i)*src_stride];
-
-    DTYPE agg = local[N_ROWS - 1];
-    sh_scratch[gl_LocalInvocationID.x] = agg;
-    for (uint i = 0; i < LG_WG_SIZE; i++) {
-        barrier();
-        if (gl_LocalInvocationID.x >= (1u << i))
-            agg += sh_scratch[gl_LocalInvocationID.x - (1u << i)];
-        barrier();
-
-        sh_scratch[gl_LocalInvocationID.x] = agg;
-    }
-
-    // Publish aggregate for this partition
-    if (gl_LocalInvocationID.x == WG_SIZE - 1) {
-        state[part_ix].aggregate = agg;
-        if (part_ix == 0)
-            state[0].prefix = agg;
-    }
-
-    // Write flag with release semantics
-    if (gl_LocalInvocationID.x == WG_SIZE - 1) {
-        uint flag = part_ix == 0 ? FLAG_PREFIX_READY : FLAG_AGGREGATE_READY;
-        atomicStore(state[part_ix].flag, flag, gl_ScopeDevice, RELEASE);
-    }
-
-    DTYPE exclusive = DTYPE(0);
-    if (part_ix != 0) {
-        // step 4 of paper: decoupled lookback
-        uint look_back_ix = part_ix - 1;
-
-        DTYPE their_agg;
-        uint their_ix = 0;
-        while (true) {
-            // Read flag with acquire semantics.
-            if (gl_LocalInvocationID.x == WG_SIZE - 1)
-                sh_flag = atomicLoad(state[look_back_ix].flag, gl_ScopeDevice, ACQUIRE);
-
-            // The flag load is done only in the last thread. However, because the
-            // translation of memoryBarrierBuffer to Metal requires uniform control
-            // flow, we broadcast it to all threads.
-            barrier();
-
-            uint flag = sh_flag;
-            barrier();
-
-            if (flag == FLAG_PREFIX_READY) {
-                if (gl_LocalInvocationID.x == WG_SIZE - 1) {
-                    DTYPE their_prefix = state[look_back_ix].prefix;
-                    exclusive = their_prefix + exclusive;
-                }
-                break;
-            } else if (flag == FLAG_AGGREGATE_READY) {
-                if (gl_LocalInvocationID.x == WG_SIZE - 1) {
-                    their_agg = state[look_back_ix].aggregate;
-                    exclusive = their_agg + exclusive;
-                }
-                look_back_ix--;
-                their_ix = 0;
-                continue;
-            } // else spins
-
-            if (gl_LocalInvocationID.x == WG_SIZE - 1) {
-                // Unfortunately there's no guarantee of forward progress of other
-                // workgroups, so compute a bit of the aggregate before trying again.
-                // In the worst case, spinning stops when the aggregate is complete.
-                DTYPE m = src.v[(look_back_ix * PARTITION_SIZE + their_ix)*src_stride];
-                if (their_ix == 0)
-                    their_agg = m;
-                else
-                    their_agg += m;
-
-                their_ix++;
-                if (their_ix == PARTITION_SIZE) {
-                    exclusive = their_agg + exclusive;
-                    if (look_back_ix == 0) {
-                        sh_flag = FLAG_PREFIX_READY;
-                    } else {
-                        look_back_ix--;
-                        their_ix = 0;
-                    }
-                }
-            }
-            barrier();
-            flag = sh_flag;
-            barrier();
-            if (flag == FLAG_PREFIX_READY)
-                break;
-        }
-
-        // step 5 of paper: compute inclusive prefix
-        if (gl_LocalInvocationID.x == WG_SIZE - 1) {
-            DTYPE inclusive_prefix = exclusive + agg;
-            sh_prefix = exclusive;
-            state[part_ix].prefix = inclusive_prefix;
-        }
-
-        if (gl_LocalInvocationID.x == WG_SIZE - 1)
-            atomicStore(state[part_ix].flag, FLAG_PREFIX_READY, gl_ScopeDevice, RELEASE);
-    }
-
-    barrier();
-    if (part_ix != 0)
-        exclusive = sh_prefix;
-
-    DTYPE row = exclusive;
-    if (gl_LocalInvocationID.x > 0)
-        row += sh_scratch[gl_LocalInvocationID.x - 1];
-
-    // note - may overwrite
-    for (uint i = 0; i < N_ROWS; i++)
-        dst.v[(ix + i)*dst_stride] = row + local[i];
-}
-- 
2.42.0