From patchwork Sun Mar 21 23:22:24 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lucas Clemente Vella X-Patchwork-Id: 26526 Return-Path: X-Original-To: patchwork@ffaux-bg.ffmpeg.org Delivered-To: patchwork@ffaux-bg.ffmpeg.org Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by ffaux.localdomain (Postfix) with ESMTP id D2F7544B91E for ; Mon, 22 Mar 2021 01:53:51 +0200 (EET) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A7F6068A9E9; Mon, 22 Mar 2021 01:53:51 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-wr1-f52.google.com (mail-wr1-f52.google.com [209.85.221.52]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id AD58768A571 for ; Mon, 22 Mar 2021 01:53:45 +0200 (EET) Received: by mail-wr1-f52.google.com with SMTP id j7so14860823wrd.1 for ; Sun, 21 Mar 2021 16:53:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=ltHyS4aCKRi48zLYGDSry8NZUftpbTkjKGs9IPNBxRE=; b=AyDvHRFVMVgLMZU4kP0ljhoVZZBgKndaHC8Wa6OfR2jwEBlSwukyvntxWpU5UDewTY FPU5kTwxDrLl7wVpge5W4hjJyDWMny9oYgxMoD/Cfh7wqI/6jBwilD6cvEp8kPPDIsxZ 7Xt2aMI/CJNMRrlj0PPFFBvinkzMea9e9Avg2Pc00frzJN5QWeda9kokkm0IICs7Y24C qgZClBAwSUmnfxgXgeT8OEx0pRM2/xe47qSBW95o5XMN0mboXh0wAL4wy0LBgZlmirDo ko3/6RpwAiXxQzdHb/7+7RPlwa4qe/vDEnTHQoacHpqkAJTKbX6rCf/2cWnv6au1zIEH ZzSA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=ltHyS4aCKRi48zLYGDSry8NZUftpbTkjKGs9IPNBxRE=; b=eujum7Frb5hMIXlv0MfdpUy3hdW3MelvwGpfWrt+g5EFfJrtLJx7GQKIEce12mUFz9 ZGI8AW3iQ5sUdOiXMZDMay2Y97Cg62JczXmJnkrBUmHzqJRdq80JubgYsGcQkFmKDETB iasRTQ+k/OospJstzkNbNbbs7hy54v++GYSmEJLGoyGmdBwg+oExbPPbD6tD2N6Ivw6b nsUmoankzQinh9NFOTE2e4n0vmVP9S46IM8yFLmhfKAWCKkWY/PB5iNWqq3dTTHjnd8b 5ZZCBwPtA7PSYkmZEcDzXffpDC1LbrtSgZO5zB8slkP8ZTlxEiuwuIHPsIvZd2Wd8eGv 05/A== X-Gm-Message-State: AOAM533SlBmPlRzuHu6kVzSXHX+wNQ7WNqXbf1421CEytPzkUhabBVa+ 9IgWTOFvTr64MFdj6MD6Fw+S4DNpJYA= X-Google-Smtp-Source: ABdhPJwlRmh+Dlcjryy3btMBRxtIhEE788QGBcPJcN4J25rSRKmFWVSPBEqod2hY4gbuQHaW8rtzNA== X-Received: by 2002:adf:9261:: with SMTP id 88mr14840593wrj.270.1616368999889; Sun, 21 Mar 2021 16:23:19 -0700 (PDT) Received: from localhost.localdomain ([2001:8a0:6393:9a00:9b0f:6669:c578:bc5e]) by smtp.gmail.com with ESMTPSA id b17sm17352326wrt.17.2021.03.21.16.23.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 21 Mar 2021 16:23:19 -0700 (PDT) From: Lucas Clemente Vella To: ffmpeg-devel@ffmpeg.org Date: Sun, 21 Mar 2021 23:22:24 +0000 Message-Id: <20210321232223.210494-1-lvella@gmail.com> X-Mailer: git-send-email 2.27.0 MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH] avfilter/vf_nlmeans_opencl: making filter independent of bit depth X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Lucas Clemente Vella Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" This filter originally quantized OpenCL float images fetchs in 256 levels, and computed the integral image of squared differences in 32 bit integers. This had two consequences: 1) it could overflow if the image resolution was big enough (I got overflows in a 4K video); 2) it dropped precision from bit depths higher than 8 bits. Now the integral image is computed with float values in range [0, 1], instead of integers in range [0, 255] (then squared), so there is no longer the risk of overflow. And even with the accumulated floating point error over the integral image, the resulting difference between this float implementation and an experimental uint64 implementation with 65535 quantization levels is less than 0.08% on the worst difference (per component), and less than 0.002% on average. For reference, the smallest variation possible on a 10-bit quantization is 0.098% of the total intensity. This was tested on a 4K frame from an 10-bit source. Signed-off-by: Lucas Clemente Vella --- libavfilter/opencl/nlmeans.cl | 31 ++++++++++++++---------------- libavfilter/vf_nlmeans_opencl.c | 34 +++++---------------------------- 2 files changed, 19 insertions(+), 46 deletions(-) diff --git a/libavfilter/opencl/nlmeans.cl b/libavfilter/opencl/nlmeans.cl index 72bd681fd6..6d78a41e46 100644 --- a/libavfilter/opencl/nlmeans.cl +++ b/libavfilter/opencl/nlmeans.cl @@ -20,7 +20,7 @@ const sampler_t sampler = (CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST); -kernel void horiz_sum(__global uint4 *integral_img, +kernel void horiz_sum(__global float4 *integral_img, __read_only image2d_t src, int width, int height, @@ -31,7 +31,7 @@ kernel void horiz_sum(__global uint4 *integral_img, int y = get_global_id(0); int work_size = get_global_size(0); - uint4 sum = (uint4)(0); + float4 sum = 0.0; float4 s2; for (int i = 0; i < width; i++) { float s1 = read_imagef(src, sampler, (int2)(i, y)).x; @@ -39,28 +39,25 @@ kernel void horiz_sum(__global uint4 *integral_img, s2.y = read_imagef(src, sampler, (int2)(i + dx.y, y + dy.y)).x; s2.z = read_imagef(src, sampler, (int2)(i + dx.z, y + dy.z)).x; s2.w = read_imagef(src, sampler, (int2)(i + dx.w, y + dy.w)).x; - sum += convert_uint4((s1 - s2) * (s1 - s2) * 255 * 255); + sum += (s1 - s2) * (s1 - s2); integral_img[y * width + i] = sum; } } -kernel void vert_sum(__global uint4 *integral_img, - __global int *overflow, +kernel void vert_sum(__global float4 *integral_img, int width, int height) { int x = get_global_id(0); - uint4 sum = 0; + float4 sum = 0; for (int i = 0; i < height; i++) { - if (any((uint4)UINT_MAX - integral_img[i * width + x] < sum)) - atomic_inc(overflow); integral_img[i * width + x] += sum; sum = integral_img[i * width + x]; } } kernel void weight_accum(global float *sum, global float *weight, - global uint4 *integral_img, __read_only image2d_t src, + global float4 *integral_img, __read_only image2d_t src, int width, int height, int p, float h, int4 dx, int4 dy) { @@ -75,16 +72,16 @@ kernel void weight_accum(global float *sum, global float *weight, int y = get_global_id(1); int4 xoff = x + dx; int4 yoff = y + dy; - uint4 a = 0, b = 0, c = 0, d = 0; - uint4 src_pix = 0; + float4 a = 0, b = 0, c = 0, d = 0; + float4 src_pix = 0; // out-of-bounding-box? int oobb = (x - p) < 0 || (y - p) < 0 || (y + p) >= height || (x + p) >= width; - src_pix.x = (int)(255 * read_imagef(src, sampler, (int2)(xoff.x, yoff.x)).x); - src_pix.y = (int)(255 * read_imagef(src, sampler, (int2)(xoff.y, yoff.y)).x); - src_pix.z = (int)(255 * read_imagef(src, sampler, (int2)(xoff.z, yoff.z)).x); - src_pix.w = (int)(255 * read_imagef(src, sampler, (int2)(xoff.w, yoff.w)).x); + src_pix.x = read_imagef(src, sampler, (int2)(xoff.x, yoff.x)).x; + src_pix.y = read_imagef(src, sampler, (int2)(xoff.y, yoff.y)).x; + src_pix.z = read_imagef(src, sampler, (int2)(xoff.z, yoff.z)).x; + src_pix.w = read_imagef(src, sampler, (int2)(xoff.w, yoff.w)).x; if (!oobb) { a = integral_img[(y - p) * width + x - p]; b = integral_img[(y + p) * width + x - p]; @@ -93,7 +90,7 @@ kernel void weight_accum(global float *sum, global float *weight, } float4 patch_diff = convert_float4(d + a - c - b); - float4 w = native_exp(-patch_diff / (h * h)); + float4 w = native_exp(-patch_diff * (float4)(255.0f * 255.0f) / (h * h)); float w_sum = w.x + w.y + w.z + w.w; weight[y * width + x] += w_sum; sum[y * width + x] += dot(w, convert_float4(src_pix)); @@ -109,7 +106,7 @@ kernel void average(__write_only image2d_t dst, float w = weight[y * dim.x + x]; float s = sum[y * dim.x + x]; float src_pix = read_imagef(src, sampler, (int2)(x, y)).x; - float r = (s + src_pix * 255) / (1.0f + w) / 255.0f; + float r = (s + src_pix) / (1.0f + w); if (x < dim.x && y < dim.y) write_imagef(dst, (int2)(x, y), (float4)(r, 0.0f, 0.0f, 1.0f)); } diff --git a/libavfilter/vf_nlmeans_opencl.c b/libavfilter/vf_nlmeans_opencl.c index e57b5e0873..0b69f3b6c4 100644 --- a/libavfilter/vf_nlmeans_opencl.c +++ b/libavfilter/vf_nlmeans_opencl.c @@ -30,11 +30,9 @@ #include "opencl_source.h" #include "video.h" -// TODO: -// the integral image may overflow 32bit, consider using 64bit - static const enum AVPixelFormat supported_formats[] = { AV_PIX_FMT_YUV420P, + AV_PIX_FMT_YUV420P16LE, AV_PIX_FMT_YUV444P, AV_PIX_FMT_GBRP, }; @@ -59,7 +57,6 @@ typedef struct NLMeansOpenCLContext { cl_mem integral_img; cl_mem weight; cl_mem sum; - cl_mem overflow; // overflow in integral image? double sigma; float h; int chroma_w; @@ -129,7 +126,7 @@ static int nlmeans_opencl_init(AVFilterContext *avctx, int width, int height) "average kernel %d.\n", cle); ctx->integral_img = clCreateBuffer(ctx->ocf.hwctx->context, 0, - 4 * width * height * sizeof(cl_int), + 4 * width * height * sizeof(cl_float), NULL, &cle); CL_FAIL_ON_ERROR(AVERROR(EIO), "Failed to create " "integral image %d.\n", cle); @@ -144,11 +141,6 @@ static int nlmeans_opencl_init(AVFilterContext *avctx, int width, int height) CL_FAIL_ON_ERROR(AVERROR(EIO), "Failed to create " "sum buffer %d.\n", cle); - ctx->overflow = clCreateBuffer(ctx->ocf.hwctx->context, 0, - sizeof(cl_int), NULL, &cle); - CL_FAIL_ON_ERROR(AVERROR(EIO), "Failed to create " - "overflow buffer %d.\n", cle); - ctx->initialised = 1; return 0; @@ -161,7 +153,6 @@ fail: CL_RELEASE_MEMORY(ctx->integral_img); CL_RELEASE_MEMORY(ctx->weight); CL_RELEASE_MEMORY(ctx->sum); - CL_RELEASE_MEMORY(ctx->overflow); CL_RELEASE_QUEUE(ctx->command_queue); return err; @@ -239,9 +230,8 @@ static int nlmeans_plane(AVFilterContext *avctx, cl_mem dst, cl_mem src, // vertical pass // integral(x, y) = sum(integral(x, v)) for v in [0, y] CL_SET_KERNEL_ARG(ctx->vert_kernel, 0, cl_mem, &ctx->integral_img); - CL_SET_KERNEL_ARG(ctx->vert_kernel, 1, cl_mem, &ctx->overflow); - CL_SET_KERNEL_ARG(ctx->vert_kernel, 2, cl_int, &width); - CL_SET_KERNEL_ARG(ctx->vert_kernel, 3, cl_int, &height); + CL_SET_KERNEL_ARG(ctx->vert_kernel, 1, cl_int, &width); + CL_SET_KERNEL_ARG(ctx->vert_kernel, 2, cl_int, &height); cle = clEnqueueNDRangeKernel(ctx->command_queue, ctx->vert_kernel, 1, NULL, worksize2, NULL, 0, NULL, NULL); CL_FAIL_ON_ERROR(AVERROR(EIO), "Failed to enqueue vert_kernel: %d.\n", @@ -293,8 +283,7 @@ static int nlmeans_opencl_filter_frame(AVFilterLink *inlink, AVFrame *input) const AVPixFmtDescriptor *desc; enum AVPixelFormat in_format; cl_mem src, dst; - const cl_int zero = 0; - int w, h, err, cle, overflow, p, patch, research; + int w, h, err, cle, p, patch, research; av_log(ctx, AV_LOG_DEBUG, "Filter input: %s, %ux%u (%"PRId64").\n", av_get_pix_fmt_name(input->format), @@ -331,11 +320,6 @@ static int nlmeans_opencl_filter_frame(AVFilterLink *inlink, AVFrame *input) goto fail; } - cle = clEnqueueWriteBuffer(ctx->command_queue, ctx->overflow, CL_FALSE, - 0, sizeof(cl_int), &zero, 0, NULL, NULL); - CL_FAIL_ON_ERROR(AVERROR(EIO), "Failed to initialize overflow" - "detection buffer %d.\n", cle); - for (p = 0; p < FF_ARRAY_ELEMS(output->data); p++) { src = (cl_mem) input->data[p]; dst = (cl_mem) output->data[p]; @@ -351,17 +335,10 @@ static int nlmeans_opencl_filter_frame(AVFilterLink *inlink, AVFrame *input) if (err < 0) goto fail; } - // overflow occurred? - cle = clEnqueueReadBuffer(ctx->command_queue, ctx->overflow, CL_FALSE, - 0, sizeof(cl_int), &overflow, 0, NULL, NULL); - CL_FAIL_ON_ERROR(AVERROR(EIO), "Failed to read overflow: %d.\n", cle); cle = clFinish(ctx->command_queue); CL_FAIL_ON_ERROR(AVERROR(EIO), "Failed to finish kernel: %d.\n", cle); - if (overflow > 0) - av_log(avctx, AV_LOG_ERROR, "integral image overflow %d\n", overflow); - av_frame_free(&input); av_log(ctx, AV_LOG_DEBUG, "Filter output: %s, %ux%u (%"PRId64").\n", @@ -390,7 +367,6 @@ static av_cold void nlmeans_opencl_uninit(AVFilterContext *avctx) CL_RELEASE_MEMORY(ctx->integral_img); CL_RELEASE_MEMORY(ctx->weight); CL_RELEASE_MEMORY(ctx->sum); - CL_RELEASE_MEMORY(ctx->overflow); CL_RELEASE_QUEUE(ctx->command_queue);