From patchwork Sun Mar 21 23:22:24 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lucas Clemente Vella <lvella@gmail.com>
X-Patchwork-Id: 26526
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
X-Original-To: patchwork@ffaux-bg.ffmpeg.org
Delivered-To: patchwork@ffaux-bg.ffmpeg.org
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by ffaux.localdomain (Postfix) with ESMTP id D2F7544B91E
	for <patchwork@ffaux-bg.ffmpeg.org>; Mon, 22 Mar 2021 01:53:51 +0200 (EET)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A7F6068A9E9;
	Mon, 22 Mar 2021 01:53:51 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mail-wr1-f52.google.com (mail-wr1-f52.google.com
 [209.85.221.52])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id AD58768A571
 for <ffmpeg-devel@ffmpeg.org>; Mon, 22 Mar 2021 01:53:45 +0200 (EET)
Received: by mail-wr1-f52.google.com with SMTP id j7so14860823wrd.1
 for <ffmpeg-devel@ffmpeg.org>; Sun, 21 Mar 2021 16:53:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=from:to:cc:subject:date:message-id:mime-version
 :content-transfer-encoding;
 bh=ltHyS4aCKRi48zLYGDSry8NZUftpbTkjKGs9IPNBxRE=;
 b=AyDvHRFVMVgLMZU4kP0ljhoVZZBgKndaHC8Wa6OfR2jwEBlSwukyvntxWpU5UDewTY
 FPU5kTwxDrLl7wVpge5W4hjJyDWMny9oYgxMoD/Cfh7wqI/6jBwilD6cvEp8kPPDIsxZ
 7Xt2aMI/CJNMRrlj0PPFFBvinkzMea9e9Avg2Pc00frzJN5QWeda9kokkm0IICs7Y24C
 qgZClBAwSUmnfxgXgeT8OEx0pRM2/xe47qSBW95o5XMN0mboXh0wAL4wy0LBgZlmirDo
 ko3/6RpwAiXxQzdHb/7+7RPlwa4qe/vDEnTHQoacHpqkAJTKbX6rCf/2cWnv6au1zIEH
 ZzSA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version
 :content-transfer-encoding;
 bh=ltHyS4aCKRi48zLYGDSry8NZUftpbTkjKGs9IPNBxRE=;
 b=eujum7Frb5hMIXlv0MfdpUy3hdW3MelvwGpfWrt+g5EFfJrtLJx7GQKIEce12mUFz9
 ZGI8AW3iQ5sUdOiXMZDMay2Y97Cg62JczXmJnkrBUmHzqJRdq80JubgYsGcQkFmKDETB
 iasRTQ+k/OospJstzkNbNbbs7hy54v++GYSmEJLGoyGmdBwg+oExbPPbD6tD2N6Ivw6b
 nsUmoankzQinh9NFOTE2e4n0vmVP9S46IM8yFLmhfKAWCKkWY/PB5iNWqq3dTTHjnd8b
 5ZZCBwPtA7PSYkmZEcDzXffpDC1LbrtSgZO5zB8slkP8ZTlxEiuwuIHPsIvZd2Wd8eGv
 05/A==
X-Gm-Message-State: AOAM533SlBmPlRzuHu6kVzSXHX+wNQ7WNqXbf1421CEytPzkUhabBVa+
 9IgWTOFvTr64MFdj6MD6Fw+S4DNpJYA=
X-Google-Smtp-Source: 
 ABdhPJwlRmh+Dlcjryy3btMBRxtIhEE788QGBcPJcN4J25rSRKmFWVSPBEqod2hY4gbuQHaW8rtzNA==
X-Received: by 2002:adf:9261:: with SMTP id 88mr14840593wrj.270.1616368999889;
 Sun, 21 Mar 2021 16:23:19 -0700 (PDT)
Received: from localhost.localdomain
 ([2001:8a0:6393:9a00:9b0f:6669:c578:bc5e])
 by smtp.gmail.com with ESMTPSA id b17sm17352326wrt.17.2021.03.21.16.23.19
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Sun, 21 Mar 2021 16:23:19 -0700 (PDT)
From: Lucas Clemente Vella <lvella@gmail.com>
To: ffmpeg-devel@ffmpeg.org
Date: Sun, 21 Mar 2021 23:22:24 +0000
Message-Id: <20210321232223.210494-1-lvella@gmail.com>
X-Mailer: git-send-email 2.27.0
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH] avfilter/vf_nlmeans_opencl: making filter
	independent of bit depth
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Lucas Clemente Vella <lvella@gmail.com>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

This filter originally quantized OpenCL float images fetchs in 256 levels,
and computed the integral image of squared differences in 32 bit integers.

This had two consequences:
1) it could overflow if the image resolution was big enough (I got overflows
in a 4K video);
2) it dropped precision from bit depths higher than 8 bits.

Now the integral image is computed with float values in range [0, 1], instead
of integers in range [0, 255] (then squared), so there is no longer the risk
of overflow.

And even with the accumulated floating point error over the integral image,
the resulting difference between this float implementation and an experimental
uint64 implementation with 65535 quantization levels is less than 0.08%
on the worst difference (per component), and less than 0.002% on average. For
reference, the smallest variation possible on a 10-bit quantization is 0.098%
of the total intensity. This was tested on a 4K frame from an 10-bit source.

Signed-off-by: Lucas Clemente Vella <lvella@gmail.com>
---
 libavfilter/opencl/nlmeans.cl   | 31 ++++++++++++++----------------
 libavfilter/vf_nlmeans_opencl.c | 34 +++++----------------------------
 2 files changed, 19 insertions(+), 46 deletions(-)

diff --git a/libavfilter/opencl/nlmeans.cl b/libavfilter/opencl/nlmeans.cl
index 72bd681fd6..6d78a41e46 100644
--- a/libavfilter/opencl/nlmeans.cl
+++ b/libavfilter/opencl/nlmeans.cl
@@ -20,7 +20,7 @@ const sampler_t sampler = (CLK_NORMALIZED_COORDS_FALSE |
                            CLK_ADDRESS_CLAMP_TO_EDGE   |
                            CLK_FILTER_NEAREST);
 
-kernel void horiz_sum(__global uint4 *integral_img,
+kernel void horiz_sum(__global float4 *integral_img,
                       __read_only image2d_t src,
                       int width,
                       int height,
@@ -31,7 +31,7 @@ kernel void horiz_sum(__global uint4 *integral_img,
     int y = get_global_id(0);
     int work_size = get_global_size(0);
 
-    uint4 sum = (uint4)(0);
+    float4 sum = 0.0;
     float4 s2;
     for (int i = 0; i < width; i++) {
         float s1 = read_imagef(src, sampler, (int2)(i, y)).x;
@@ -39,28 +39,25 @@ kernel void horiz_sum(__global uint4 *integral_img,
         s2.y = read_imagef(src, sampler, (int2)(i + dx.y, y + dy.y)).x;
         s2.z = read_imagef(src, sampler, (int2)(i + dx.z, y + dy.z)).x;
         s2.w = read_imagef(src, sampler, (int2)(i + dx.w, y + dy.w)).x;
-        sum += convert_uint4((s1 - s2) * (s1 - s2) * 255 * 255);
+        sum += (s1 - s2) * (s1 - s2);
         integral_img[y * width + i] = sum;
     }
 }
 
-kernel void vert_sum(__global uint4 *integral_img,
-                     __global int *overflow,
+kernel void vert_sum(__global float4 *integral_img,
                      int width,
                      int height)
 {
     int x = get_global_id(0);
-    uint4 sum = 0;
+    float4 sum = 0;
     for (int i = 0; i < height; i++) {
-        if (any((uint4)UINT_MAX - integral_img[i * width + x] < sum))
-            atomic_inc(overflow);
         integral_img[i * width + x] += sum;
         sum = integral_img[i * width + x];
     }
 }
 
 kernel void weight_accum(global float *sum, global float *weight,
-                         global uint4 *integral_img, __read_only image2d_t src,
+                         global float4 *integral_img, __read_only image2d_t src,
                          int width, int height, int p, float h,
                          int4 dx, int4 dy)
 {
@@ -75,16 +72,16 @@ kernel void weight_accum(global float *sum, global float *weight,
     int y = get_global_id(1);
     int4 xoff = x + dx;
     int4 yoff = y + dy;
-    uint4 a = 0, b = 0, c = 0, d = 0;
-    uint4 src_pix = 0;
+    float4 a = 0, b = 0, c = 0, d = 0;
+    float4 src_pix = 0;
 
     // out-of-bounding-box?
     int oobb = (x - p) < 0 || (y - p) < 0 || (y + p) >= height || (x + p) >= width;
 
-    src_pix.x = (int)(255 * read_imagef(src, sampler, (int2)(xoff.x, yoff.x)).x);
-    src_pix.y = (int)(255 * read_imagef(src, sampler, (int2)(xoff.y, yoff.y)).x);
-    src_pix.z = (int)(255 * read_imagef(src, sampler, (int2)(xoff.z, yoff.z)).x);
-    src_pix.w = (int)(255 * read_imagef(src, sampler, (int2)(xoff.w, yoff.w)).x);
+    src_pix.x = read_imagef(src, sampler, (int2)(xoff.x, yoff.x)).x;
+    src_pix.y = read_imagef(src, sampler, (int2)(xoff.y, yoff.y)).x;
+    src_pix.z = read_imagef(src, sampler, (int2)(xoff.z, yoff.z)).x;
+    src_pix.w = read_imagef(src, sampler, (int2)(xoff.w, yoff.w)).x;
     if (!oobb) {
         a = integral_img[(y - p) * width + x - p];
         b = integral_img[(y + p) * width + x - p];
@@ -93,7 +90,7 @@ kernel void weight_accum(global float *sum, global float *weight,
     }
 
     float4 patch_diff = convert_float4(d + a - c - b);
-    float4 w = native_exp(-patch_diff / (h * h));
+    float4 w = native_exp(-patch_diff * (float4)(255.0f * 255.0f) / (h * h));
     float w_sum = w.x + w.y + w.z + w.w;
     weight[y * width + x] += w_sum;
     sum[y * width + x] += dot(w, convert_float4(src_pix));
@@ -109,7 +106,7 @@ kernel void average(__write_only image2d_t dst,
     float w = weight[y * dim.x + x];
     float s = sum[y * dim.x + x];
     float src_pix = read_imagef(src, sampler, (int2)(x, y)).x;
-    float r = (s + src_pix * 255) / (1.0f + w) / 255.0f;
+    float r = (s + src_pix) / (1.0f + w);
     if (x < dim.x && y < dim.y)
         write_imagef(dst, (int2)(x, y), (float4)(r, 0.0f, 0.0f, 1.0f));
 }
diff --git a/libavfilter/vf_nlmeans_opencl.c b/libavfilter/vf_nlmeans_opencl.c
index e57b5e0873..0b69f3b6c4 100644
--- a/libavfilter/vf_nlmeans_opencl.c
+++ b/libavfilter/vf_nlmeans_opencl.c
@@ -30,11 +30,9 @@
 #include "opencl_source.h"
 #include "video.h"
 
-// TODO:
-//      the integral image may overflow 32bit, consider using 64bit
-
 static const enum AVPixelFormat supported_formats[] = {
     AV_PIX_FMT_YUV420P,
+    AV_PIX_FMT_YUV420P16LE,
     AV_PIX_FMT_YUV444P,
     AV_PIX_FMT_GBRP,
 };
@@ -59,7 +57,6 @@ typedef struct NLMeansOpenCLContext {
     cl_mem                integral_img;
     cl_mem                weight;
     cl_mem                sum;
-    cl_mem                overflow; // overflow in integral image?
     double                sigma;
     float                 h;
     int                   chroma_w;
@@ -129,7 +126,7 @@ static int nlmeans_opencl_init(AVFilterContext *avctx, int width, int height)
                      "average kernel %d.\n", cle);
 
     ctx->integral_img = clCreateBuffer(ctx->ocf.hwctx->context, 0,
-                                       4 * width * height * sizeof(cl_int),
+                                       4 * width * height * sizeof(cl_float),
                                        NULL, &cle);
     CL_FAIL_ON_ERROR(AVERROR(EIO), "Failed to create "
                      "integral image %d.\n", cle);
@@ -144,11 +141,6 @@ static int nlmeans_opencl_init(AVFilterContext *avctx, int width, int height)
     CL_FAIL_ON_ERROR(AVERROR(EIO), "Failed to create "
                      "sum buffer %d.\n", cle);
 
-    ctx->overflow = clCreateBuffer(ctx->ocf.hwctx->context, 0,
-                                   sizeof(cl_int), NULL, &cle);
-    CL_FAIL_ON_ERROR(AVERROR(EIO), "Failed to create "
-                     "overflow buffer %d.\n", cle);
-
     ctx->initialised = 1;
     return 0;
 
@@ -161,7 +153,6 @@ fail:
     CL_RELEASE_MEMORY(ctx->integral_img);
     CL_RELEASE_MEMORY(ctx->weight);
     CL_RELEASE_MEMORY(ctx->sum);
-    CL_RELEASE_MEMORY(ctx->overflow);
 
     CL_RELEASE_QUEUE(ctx->command_queue);
     return err;
@@ -239,9 +230,8 @@ static int nlmeans_plane(AVFilterContext *avctx, cl_mem dst, cl_mem src,
         // vertical pass
         // integral(x, y) = sum(integral(x, v)) for v in [0, y]
         CL_SET_KERNEL_ARG(ctx->vert_kernel, 0, cl_mem, &ctx->integral_img);
-        CL_SET_KERNEL_ARG(ctx->vert_kernel, 1, cl_mem, &ctx->overflow);
-        CL_SET_KERNEL_ARG(ctx->vert_kernel, 2, cl_int, &width);
-        CL_SET_KERNEL_ARG(ctx->vert_kernel, 3, cl_int, &height);
+        CL_SET_KERNEL_ARG(ctx->vert_kernel, 1, cl_int, &width);
+        CL_SET_KERNEL_ARG(ctx->vert_kernel, 2, cl_int, &height);
         cle = clEnqueueNDRangeKernel(ctx->command_queue, ctx->vert_kernel,
                                      1, NULL, worksize2, NULL, 0, NULL, NULL);
         CL_FAIL_ON_ERROR(AVERROR(EIO), "Failed to enqueue vert_kernel: %d.\n",
@@ -293,8 +283,7 @@ static int nlmeans_opencl_filter_frame(AVFilterLink *inlink, AVFrame *input)
     const AVPixFmtDescriptor *desc;
     enum AVPixelFormat in_format;
     cl_mem src, dst;
-    const cl_int zero = 0;
-    int w, h, err, cle, overflow, p, patch, research;
+    int w, h, err, cle, p, patch, research;
 
     av_log(ctx, AV_LOG_DEBUG, "Filter input: %s, %ux%u (%"PRId64").\n",
            av_get_pix_fmt_name(input->format),
@@ -331,11 +320,6 @@ static int nlmeans_opencl_filter_frame(AVFilterLink *inlink, AVFrame *input)
             goto fail;
     }
 
-    cle = clEnqueueWriteBuffer(ctx->command_queue, ctx->overflow, CL_FALSE,
-                               0, sizeof(cl_int), &zero, 0, NULL, NULL);
-    CL_FAIL_ON_ERROR(AVERROR(EIO), "Failed to initialize overflow"
-                     "detection buffer %d.\n", cle);
-
     for (p = 0; p < FF_ARRAY_ELEMS(output->data); p++) {
         src = (cl_mem) input->data[p];
         dst = (cl_mem) output->data[p];
@@ -351,17 +335,10 @@ static int nlmeans_opencl_filter_frame(AVFilterLink *inlink, AVFrame *input)
         if (err < 0)
             goto fail;
     }
-    // overflow occurred?
-    cle = clEnqueueReadBuffer(ctx->command_queue, ctx->overflow, CL_FALSE,
-                              0, sizeof(cl_int), &overflow, 0, NULL, NULL);
-    CL_FAIL_ON_ERROR(AVERROR(EIO), "Failed to read overflow: %d.\n", cle);
 
     cle = clFinish(ctx->command_queue);
     CL_FAIL_ON_ERROR(AVERROR(EIO), "Failed to finish kernel: %d.\n", cle);
 
-    if (overflow > 0)
-        av_log(avctx, AV_LOG_ERROR, "integral image overflow %d\n", overflow);
-
     av_frame_free(&input);
 
     av_log(ctx, AV_LOG_DEBUG, "Filter output: %s, %ux%u (%"PRId64").\n",
@@ -390,7 +367,6 @@ static av_cold void nlmeans_opencl_uninit(AVFilterContext *avctx)
     CL_RELEASE_MEMORY(ctx->integral_img);
     CL_RELEASE_MEMORY(ctx->weight);
     CL_RELEASE_MEMORY(ctx->sum);
-    CL_RELEASE_MEMORY(ctx->overflow);
 
     CL_RELEASE_QUEUE(ctx->command_queue);