From patchwork Fri Oct  9 16:32:43 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Philip Langdale <philipl@overt.org>
X-Patchwork-Id: 22798
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
X-Original-To: patchwork@ffaux-bg.ffmpeg.org
Delivered-To: patchwork@ffaux-bg.ffmpeg.org
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by ffaux.localdomain (Postfix) with ESMTP id 4C5C3449A63
	for <patchwork@ffaux-bg.ffmpeg.org>; Fri,  9 Oct 2020 19:33:01 +0300 (EEST)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 1FDBB68B9CD;
	Fri,  9 Oct 2020 19:33:01 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mail.overt.org (mail.overt.org [157.230.92.47])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 4BD7368B9B5
 for <ffmpeg-devel@ffmpeg.org>; Fri,  9 Oct 2020 19:32:55 +0300 (EEST)
Received: from authenticated-user (mail.overt.org [157.230.92.47])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits))
 (No client certificate requested)
 by mail.overt.org (Postfix) with ESMTPSA id 81B8F3F1DD;
 Fri,  9 Oct 2020 11:32:53 -0500 (CDT)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=overt.org; s=mail;
 t=1602261173; bh=cNfZwmi5yrrJjr3OSvfW8uTA/4VDJs4Cyl6Sw8eyA64=;
 h=From:To:Cc:Subject:Date:From;
 b=AHvYIYQPMKnSXVEhUI1hMBlud9/izX8eJpjpBi5K9+4cFvFiJEfoBulm49ak+ocY8
 jy5s0y3YAlSOFs0B5MxUhwZZncBtp+TVPal5EPvzAE2CswLkbthgaD4FDi+jDu8Doo
 6l/1dP6O/m9kvyD5IcnVWxKT/xnPgZlnHtO3nFQpfvwNxl1ZmmJ67fyt9uBQ5LK88Y
 Xxw1OFePe66UCJNd8QsCyRfhE9HNQbXtJJwWQHL4FAU5pZ8Wsm5e7L4WmZxrKMgyW1
 ivnDK8vTe9Udnren6h/53TbTF3XVD04QIDjv7uf0057fAE7PMEdB5XWKQ1A6S1aSfy
 fP5oAg4gajrMg==
From: Philip Langdale <philipl@overt.org>
To: ffmpeg-devel@ffmpeg.org
Date: Fri,  9 Oct 2020 09:32:43 -0700
Message-Id: <20201009163243.81167-1-philipl@overt.org>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH] avfilter/vf_bwdif_cuda: CUDA implementation
	of bwdif
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Philip Langdale <philipl@overt.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

I've been sitting on this for a couple of years now, and I figured I
should just send it out. This is what I believe is a conceptually
correct port of bwdif to cuda (modulo edge handling which is not done
in the same way because the conditional checks for edges are expensive
in cuda, but that's the same as for yadif_cuda).

However, I see glitches in some samples where black or white pixels
appear in white or black areas respectively. This seems like some
sort of under/overflow. I've tried to use the largest cuda types
everywhere, and that did appear to improve things but didn't make
it go away. This is what led to me never sending this diff over the
years, but maybe someone else has insights about this.
---
 configure                    |   2 +
 libavfilter/Makefile         |   2 +
 libavfilter/allfilters.c     |   1 +
 libavfilter/vf_bwdif_cuda.c  | 394 +++++++++++++++++++++++++++++++++++
 libavfilter/vf_bwdif_cuda.cu | 290 ++++++++++++++++++++++++++
 5 files changed, 689 insertions(+)
 create mode 100644 libavfilter/vf_bwdif_cuda.c
 create mode 100644 libavfilter/vf_bwdif_cuda.cu

diff --git a/configure b/configure
index 75f0a0fcaa..4e7a97b17e 100755
--- a/configure
+++ b/configure
@@ -3511,6 +3511,8 @@ bm3d_filter_select="dct"
 boxblur_filter_deps="gpl"
 boxblur_opencl_filter_deps="opencl gpl"
 bs2b_filter_deps="libbs2b"
+bwdif_cuda_filter_deps="ffnvcodec"
+bwdif_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
 chromaber_vulkan_filter_deps="vulkan libglslang"
 colorkey_opencl_filter_deps="opencl"
 colormatrix_filter_deps="gpl"
diff --git a/libavfilter/Makefile b/libavfilter/Makefile
index e6d3c283da..db99238fce 100644
--- a/libavfilter/Makefile
+++ b/libavfilter/Makefile
@@ -178,6 +178,8 @@ OBJS-$(CONFIG_BOXBLUR_FILTER)                += vf_boxblur.o boxblur.o
 OBJS-$(CONFIG_BOXBLUR_OPENCL_FILTER)         += vf_avgblur_opencl.o opencl.o \
                                                 opencl/avgblur.o boxblur.o
 OBJS-$(CONFIG_BWDIF_FILTER)                  += vf_bwdif.o yadif_common.o
+OBJS-$(CONFIG_BWDIF_CUDA_FILTER)             += vf_bwdif_cuda.o vf_bwdif_cuda.ptx.o \
+                                                yadif_common.o
 OBJS-$(CONFIG_CAS_FILTER)                    += vf_cas.o
 OBJS-$(CONFIG_CHROMABER_VULKAN_FILTER)       += vf_chromaber_vulkan.o vulkan.o
 OBJS-$(CONFIG_CHROMAHOLD_FILTER)             += vf_chromakey.o
diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
index fa91e608e4..2da43166a5 100644
--- a/libavfilter/allfilters.c
+++ b/libavfilter/allfilters.c
@@ -169,6 +169,7 @@ extern AVFilter ff_vf_bm3d;
 extern AVFilter ff_vf_boxblur;
 extern AVFilter ff_vf_boxblur_opencl;
 extern AVFilter ff_vf_bwdif;
+extern AVFilter ff_vf_bwdif_cuda;
 extern AVFilter ff_vf_cas;
 extern AVFilter ff_vf_chromahold;
 extern AVFilter ff_vf_chromakey;
diff --git a/libavfilter/vf_bwdif_cuda.c b/libavfilter/vf_bwdif_cuda.c
new file mode 100644
index 0000000000..7651a869d5
--- /dev/null
+++ b/libavfilter/vf_bwdif_cuda.c
@@ -0,0 +1,394 @@
+/*
+ * Copyright (C) 2018 Philip Langdale <philipl@overt.org>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+#include "libavutil/hwcontext_cuda_internal.h"
+#include "libavutil/cuda_check.h"
+#include "internal.h"
+#include "yadif.h"
+
+extern char vf_bwdif_cuda_ptx[];
+
+typedef struct DeintCUDAContext {
+    YADIFContext yadif;
+
+    AVCUDADeviceContext *hwctx;
+    AVBufferRef         *device_ref;
+    AVBufferRef         *input_frames_ref;
+    AVHWFramesContext   *input_frames;
+
+    CUcontext   cu_ctx;
+    CUstream    stream;
+    CUmodule    cu_module;
+    CUfunction  cu_func_uchar;
+    CUfunction  cu_func_uchar2;
+    CUfunction  cu_func_ushort;
+    CUfunction  cu_func_ushort2;
+} DeintCUDAContext;
+
+#define DIV_UP(a, b) ( ((a) + (b) - 1) / (b) )
+#define ALIGN_UP(a, b) (((a) + (b) - 1) & ~((b) - 1))
+#define BLOCKX 32
+#define BLOCKY 16
+
+#define CHECK_CU(x) FF_CUDA_CHECK_DL(ctx, s->hwctx->internal->cuda_dl, x)
+
+static CUresult call_kernel(AVFilterContext *ctx, CUfunction func,
+                            CUdeviceptr prev, CUdeviceptr cur, CUdeviceptr next,
+                            CUarray_format format, int channels,
+                            int src_width,  // Width is pixels per channel
+                            int src_height, // Height is pixels per channel
+                            int src_pitch,  // Pitch is bytes
+                            CUdeviceptr dst,
+                            int dst_width,  // Width is pixels per channel
+                            int dst_height, // Height is pixels per channel
+                            int dst_pitch,  // Pitch is pixels per channel
+                            int parity, int tff, int clip_max)
+{
+    DeintCUDAContext *s = ctx->priv;
+    CudaFunctions *cu = s->hwctx->internal->cuda_dl;
+    CUtexObject tex_prev = 0, tex_cur = 0, tex_next = 0;
+    int ret;
+    int skip_spatial_check = s->yadif.mode&2;
+
+    void *args[] = { &dst, &tex_prev, &tex_cur, &tex_next,
+                     &dst_width, &dst_height, &dst_pitch,
+                     &src_width, &src_height, &parity, &tff,
+                     &skip_spatial_check, &clip_max };
+
+    CUDA_TEXTURE_DESC tex_desc = {
+        .filterMode = CU_TR_FILTER_MODE_POINT,
+        .flags = CU_TRSF_READ_AS_INTEGER,
+    };
+
+    CUDA_RESOURCE_DESC res_desc = {
+        .resType = CU_RESOURCE_TYPE_PITCH2D,
+        .res.pitch2D.format = format,
+        .res.pitch2D.numChannels = channels,
+        .res.pitch2D.width = src_width,
+        .res.pitch2D.height = src_height,
+        .res.pitch2D.pitchInBytes = src_pitch,
+    };
+
+    res_desc.res.pitch2D.devPtr = (CUdeviceptr)prev;
+    ret = CHECK_CU(cu->cuTexObjectCreate(&tex_prev, &res_desc, &tex_desc, NULL));
+    if (ret < 0)
+        goto exit;
+
+    res_desc.res.pitch2D.devPtr = (CUdeviceptr)cur;
+    ret = CHECK_CU(cu->cuTexObjectCreate(&tex_cur, &res_desc, &tex_desc, NULL));
+    if (ret < 0)
+        goto exit;
+
+    res_desc.res.pitch2D.devPtr = (CUdeviceptr)next;
+    ret = CHECK_CU(cu->cuTexObjectCreate(&tex_next, &res_desc, &tex_desc, NULL));
+    if (ret < 0)
+        goto exit;
+
+    ret = CHECK_CU(cu->cuLaunchKernel(func,
+                                      DIV_UP(dst_width, BLOCKX), DIV_UP(dst_height, BLOCKY), 1,
+                                      BLOCKX, BLOCKY, 1,
+                                      0, s->stream, args, NULL));
+
+exit:
+    if (tex_prev)
+        CHECK_CU(cu->cuTexObjectDestroy(tex_prev));
+    if (tex_cur)
+        CHECK_CU(cu->cuTexObjectDestroy(tex_cur));
+    if (tex_next)
+        CHECK_CU(cu->cuTexObjectDestroy(tex_next));
+
+    return ret;
+}
+
+static void filter(AVFilterContext *ctx, AVFrame *dst,
+                   int parity, int tff)
+{
+    DeintCUDAContext *s = ctx->priv;
+    YADIFContext *y = &s->yadif;
+    CudaFunctions *cu = s->hwctx->internal->cuda_dl;
+    CUcontext dummy;
+    int i, ret;
+
+    ret = CHECK_CU(cu->cuCtxPushCurrent(s->cu_ctx));
+    if (ret < 0)
+        return;
+
+    for (i = 0; i < y->csp->nb_components; i++) {
+        CUfunction func;
+        CUarray_format format;
+        int pixel_size, channels;
+        const AVComponentDescriptor *comp = &y->csp->comp[i];
+
+        if (comp->plane < i) {
+            // We process planes as a whole, so don't reprocess
+            // them for additional components
+            continue;
+        }
+
+        pixel_size = (comp->depth + comp->shift) / 8;
+        channels = comp->step / pixel_size;
+        if (pixel_size > 2 || channels > 2) {
+            av_log(ctx, AV_LOG_ERROR, "Unsupported pixel format: %s\n", y->csp->name);
+            goto exit;
+        }
+        switch (pixel_size) {
+        case 1:
+            func = channels == 1 ? s->cu_func_uchar : s->cu_func_uchar2;
+            format = CU_AD_FORMAT_UNSIGNED_INT8;
+            break;
+        case 2:
+            func = channels == 1 ? s->cu_func_ushort : s->cu_func_ushort2;
+            format = CU_AD_FORMAT_UNSIGNED_INT16;
+            break;
+        default:
+            av_log(ctx, AV_LOG_ERROR, "Unsupported pixel format: %s\n", y->csp->name);
+            goto exit;
+        }
+
+        int clip_max = (1 << (y->csp->comp[i].depth)) - 1;
+
+        av_log(ctx, AV_LOG_TRACE,
+               "Deinterlacing plane %d: pixel_size: %d channels: %d\n",
+               comp->plane, pixel_size, channels);
+        call_kernel(ctx, func,
+                    (CUdeviceptr)y->prev->data[i],
+                    (CUdeviceptr)y->cur->data[i],
+                    (CUdeviceptr)y->next->data[i],
+                    format, channels,
+                    AV_CEIL_RSHIFT(y->cur->width, i ? y->csp->log2_chroma_w : 0),
+                    AV_CEIL_RSHIFT(y->cur->height, i ? y->csp->log2_chroma_h : 0),
+                    y->cur->linesize[i],
+                    (CUdeviceptr)dst->data[i],
+                    AV_CEIL_RSHIFT(dst->width, i ? y->csp->log2_chroma_w : 0),
+                    AV_CEIL_RSHIFT(dst->height, i ? y->csp->log2_chroma_h : 0),
+                    dst->linesize[i] / comp->step,
+                    parity, tff, clip_max);
+    }
+
+    if (y->current_field == YADIF_FIELD_END) {
+        y->current_field = YADIF_FIELD_NORMAL;
+    }
+
+exit:
+    CHECK_CU(cu->cuCtxPopCurrent(&dummy));
+    return;
+}
+
+static av_cold void deint_cuda_uninit(AVFilterContext *ctx)
+{
+    CUcontext dummy;
+    DeintCUDAContext *s = ctx->priv;
+    YADIFContext *y = &s->yadif;
+
+    if (s->hwctx && s->cu_module) {
+        CudaFunctions *cu = s->hwctx->internal->cuda_dl;
+        CHECK_CU(cu->cuCtxPushCurrent(s->cu_ctx));
+        CHECK_CU(cu->cuModuleUnload(s->cu_module));
+        CHECK_CU(cu->cuCtxPopCurrent(&dummy));
+    }
+
+    av_frame_free(&y->prev);
+    av_frame_free(&y->cur);
+    av_frame_free(&y->next);
+
+    av_buffer_unref(&s->device_ref);
+    s->hwctx = NULL;
+    av_buffer_unref(&s->input_frames_ref);
+    s->input_frames = NULL;
+}
+
+static int deint_cuda_query_formats(AVFilterContext *ctx)
+{
+    enum AVPixelFormat pix_fmts[] = {
+        AV_PIX_FMT_CUDA, AV_PIX_FMT_NONE,
+    };
+    int ret;
+
+    if ((ret = ff_formats_ref(ff_make_format_list(pix_fmts),
+                              &ctx->inputs[0]->outcfg.formats)) < 0)
+        return ret;
+    if ((ret = ff_formats_ref(ff_make_format_list(pix_fmts),
+                              &ctx->outputs[0]->incfg.formats)) < 0)
+        return ret;
+
+    return 0;
+}
+
+static int config_input(AVFilterLink *inlink)
+{
+    AVFilterContext *ctx = inlink->dst;
+    DeintCUDAContext *s  = ctx->priv;
+
+    if (!inlink->hw_frames_ctx) {
+        av_log(ctx, AV_LOG_ERROR, "A hardware frames reference is "
+               "required to associate the processing device.\n");
+        return AVERROR(EINVAL);
+    }
+
+    s->input_frames_ref = av_buffer_ref(inlink->hw_frames_ctx);
+    if (!s->input_frames_ref) {
+        av_log(ctx, AV_LOG_ERROR, "A input frames reference create "
+               "failed.\n");
+        return AVERROR(ENOMEM);
+    }
+    s->input_frames = (AVHWFramesContext*)s->input_frames_ref->data;
+
+    return 0;
+}
+
+static int config_output(AVFilterLink *link)
+{
+    AVHWFramesContext *output_frames;
+    AVFilterContext *ctx = link->src;
+    DeintCUDAContext *s = ctx->priv;
+    YADIFContext *y = &s->yadif;
+    CudaFunctions *cu;
+    int ret = 0;
+    CUcontext dummy;
+
+    av_assert0(s->input_frames);
+    s->device_ref = av_buffer_ref(s->input_frames->device_ref);
+    if (!s->device_ref) {
+        av_log(ctx, AV_LOG_ERROR, "A device reference create "
+               "failed.\n");
+        return AVERROR(ENOMEM);
+    }
+    s->hwctx = ((AVHWDeviceContext*)s->device_ref->data)->hwctx;
+    s->cu_ctx = s->hwctx->cuda_ctx;
+    s->stream = s->hwctx->stream;
+    cu = s->hwctx->internal->cuda_dl;
+
+    link->hw_frames_ctx = av_hwframe_ctx_alloc(s->device_ref);
+    if (!link->hw_frames_ctx) {
+        av_log(ctx, AV_LOG_ERROR, "Failed to create HW frame context "
+               "for output.\n");
+        ret = AVERROR(ENOMEM);
+        goto exit;
+    }
+
+    output_frames = (AVHWFramesContext*)link->hw_frames_ctx->data;
+
+    output_frames->format    = AV_PIX_FMT_CUDA;
+    output_frames->sw_format = s->input_frames->sw_format;
+    output_frames->width     = ctx->inputs[0]->w;
+    output_frames->height    = ctx->inputs[0]->h;
+
+    output_frames->initial_pool_size = 4;
+
+    ret = ff_filter_init_hw_frames(ctx, link, 10);
+    if (ret < 0)
+        goto exit;
+
+    ret = av_hwframe_ctx_init(link->hw_frames_ctx);
+    if (ret < 0) {
+        av_log(ctx, AV_LOG_ERROR, "Failed to initialise CUDA frame "
+               "context for output: %d\n", ret);
+        goto exit;
+    }
+
+    link->time_base.num = ctx->inputs[0]->time_base.num;
+    link->time_base.den = ctx->inputs[0]->time_base.den * 2;
+    link->w             = ctx->inputs[0]->w;
+    link->h             = ctx->inputs[0]->h;
+
+    if(y->mode & 1)
+        link->frame_rate = av_mul_q(ctx->inputs[0]->frame_rate,
+                                    (AVRational){2, 1});
+
+    if (link->w < 3 || link->h < 3) {
+        av_log(ctx, AV_LOG_ERROR, "Video of less than 3 columns or lines is not supported\n");
+        ret = AVERROR(EINVAL);
+        goto exit;
+    }
+
+    y->csp = av_pix_fmt_desc_get(output_frames->sw_format);
+    y->filter = filter;
+
+    ret = CHECK_CU(cu->cuCtxPushCurrent(s->cu_ctx));
+    if (ret < 0)
+        goto exit;
+
+    ret = CHECK_CU(cu->cuModuleLoadData(&s->cu_module, vf_bwdif_cuda_ptx));
+    if (ret < 0)
+        goto exit;
+
+    ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_uchar, s->cu_module, "bwdif_uchar"));
+    if (ret < 0)
+        goto exit;
+
+    ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_uchar2, s->cu_module, "bwdif_uchar2"));
+    if (ret < 0)
+        goto exit;
+
+    ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_ushort, s->cu_module, "bwdif_ushort"));
+    if (ret < 0)
+        goto exit;
+
+    ret = CHECK_CU(cu->cuModuleGetFunction(&s->cu_func_ushort2, s->cu_module, "bwdif_ushort2"));
+    if (ret < 0)
+        goto exit;
+
+exit:
+    CHECK_CU(cu->cuCtxPopCurrent(&dummy));
+
+    return ret;
+}
+
+static const AVClass bwdif_cuda_class = {
+    .class_name = "bwdif_cuda",
+    .item_name  = av_default_item_name,
+    .option     = ff_yadif_options,
+    .version    = LIBAVUTIL_VERSION_INT,
+    .category   = AV_CLASS_CATEGORY_FILTER,
+};
+
+static const AVFilterPad deint_cuda_inputs[] = {
+    {
+        .name          = "default",
+        .type          = AVMEDIA_TYPE_VIDEO,
+        .filter_frame  = ff_yadif_filter_frame,
+        .config_props  = config_input,
+    },
+    { NULL }
+};
+
+static const AVFilterPad deint_cuda_outputs[] = {
+    {
+        .name          = "default",
+        .type          = AVMEDIA_TYPE_VIDEO,
+        .request_frame = ff_yadif_request_frame,
+        .config_props  = config_output,
+    },
+    { NULL }
+};
+
+AVFilter ff_vf_bwdif_cuda = {
+    .name           = "bwdif_cuda",
+    .description    = NULL_IF_CONFIG_SMALL("Deinterlace CUDA frames"),
+    .priv_size      = sizeof(DeintCUDAContext),
+    .priv_class     = &bwdif_cuda_class,
+    .uninit         = deint_cuda_uninit,
+    .query_formats  = deint_cuda_query_formats,
+    .inputs         = deint_cuda_inputs,
+    .outputs        = deint_cuda_outputs,
+    .flags          = AVFILTER_FLAG_SUPPORT_TIMELINE_INTERNAL,
+    .flags_internal = FF_FILTER_FLAG_HWFRAME_AWARE,
+};
diff --git a/libavfilter/vf_bwdif_cuda.cu b/libavfilter/vf_bwdif_cuda.cu
new file mode 100644
index 0000000000..f748c630c9
--- /dev/null
+++ b/libavfilter/vf_bwdif_cuda.cu
@@ -0,0 +1,290 @@
+/*
+ * Copyright (C) 2018 Philip Langdale <philipl@overt.org>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+__device__ static const int coef_lf[2] = { 4309, 213 };
+__device__ static const int coef_hf[3] = { 5570, 3801, 1016 };
+__device__ static const int coef_sp[2] = { 5077, 981 };
+
+template<typename T>
+__inline__ __device__ T max3(T a, T b, T c)
+{
+    T x = max(a, b);
+    return max(x, c);
+}
+
+template<typename T>
+__inline__ __device__ T min3(T a, T b, T c)
+{
+    T x = min(a, b);
+    return min(x, c);
+}
+
+template<typename T>
+__inline__ __device__ T clip(T a, T min, T max)
+{
+    if (a < min) {
+        return min;
+    } else if (a > max) {
+        return max;
+    } else {
+        return a;
+    }
+}
+
+template<typename T>
+__inline__ __device__ T filter(T A, T B, T C, T D,
+                               T a, T b, T c, T d, T e, T f, T g,
+                               T h, T i, T j, T k, T l, T m, T n,
+                               int clip_max)
+{
+    T final;
+
+    int fc = C;
+    int fd = (c + l) >> 1;
+    int fe = B;
+
+    int temporal_diff0 = abs(c - l);
+    int temporal_diff1 = (abs(g - fc) + abs(f - fe)) >> 1;
+    int temporal_diff2 = (abs(i - fc) + abs(h - fe)) >> 1;
+    int diff = max3(temporal_diff0 >> 1, temporal_diff1, temporal_diff2);
+
+    if (!diff) {
+        final = fd;
+    } else {
+        int fb = ((d + m) >> 1) - fc;
+        int ff = ((c + l) >> 1) - fe;
+        int dc = fd - fc;
+        int de = fd - fe;
+        int mmax = max3(de, dc, min(fb, ff));
+        int mmin = min3(de, dc, max(fb, ff));
+        diff = max3(diff, mmin, -mmax);
+
+        int interpol;
+        if (abs(fc - fe) > temporal_diff0) {
+            interpol = (((coef_hf[0] * (c + l)
+                - coef_hf[1] * (d + m + b + k)
+                + coef_hf[2] * (e + n + a + j)) >> 2)
+                + coef_lf[0] * (C + B) - coef_lf[1] * (D + A)) >> 13;
+        } else {
+            interpol = (coef_sp[0] * (C + B) - coef_sp[1] * (D + A)) >> 13;
+        }
+        if (interpol > fd + diff) {
+            interpol = fd + diff;
+        } else if (interpol < fd - diff) {
+            interpol = fd - diff;
+        }
+        final = clip(interpol, 0, clip_max);
+    }
+
+    return final;
+}
+
+template<typename T>
+__inline__ __device__ void bwdif_single(T *dst,
+                                        cudaTextureObject_t prev,
+                                        cudaTextureObject_t cur,
+                                        cudaTextureObject_t next,
+                                        int dst_width, int dst_height, int dst_pitch,
+                                        int src_width, int src_height,
+                                        int parity, int tff, bool skip_spatial_check,
+                                        int clip_max)
+{
+    // Identify location
+    int xo = blockIdx.x * blockDim.x + threadIdx.x;
+    int yo = blockIdx.y * blockDim.y + threadIdx.y;
+
+    if (xo >= dst_width || yo >= dst_height) {
+        return;
+    }
+
+    // Don't modify the primary field
+    if (yo % 2 == parity) {
+      dst[yo*dst_pitch+xo] = tex2D<T>(cur, xo, yo);
+      return;
+    }
+
+    T A = tex2D<T>(cur, xo, yo + 3);
+    T B = tex2D<T>(cur, xo, yo + 1);
+    T C = tex2D<T>(cur, xo, yo - 1);
+    T D = tex2D<T>(cur, xo, yo - 3);
+
+    // Calculate temporal prediction
+    int is_second_field = !(parity ^ tff);
+
+    cudaTextureObject_t prev2 = prev;
+    cudaTextureObject_t prev1 = is_second_field ? cur : prev;
+    cudaTextureObject_t next1 = is_second_field ? next : cur;
+    cudaTextureObject_t next2 = next;
+
+    T a = tex2D<T>(prev2, xo,  yo + 4);
+    T b = tex2D<T>(prev2, xo,  yo + 2);
+    T c = tex2D<T>(prev2, xo,  yo + 0);
+    T d = tex2D<T>(prev2, xo,  yo - 2);
+    T e = tex2D<T>(prev2, xo,  yo - 4);
+    T f = tex2D<T>(prev1, xo,  yo + 1);
+    T g = tex2D<T>(prev1, xo,  yo - 1);
+    T h = tex2D<T>(next1, xo,  yo + 1);
+    T i = tex2D<T>(next1, xo,  yo - 1);
+    T j = tex2D<T>(next2, xo,  yo + 4);
+    T k = tex2D<T>(next2, xo,  yo + 2);
+    T l = tex2D<T>(next2, xo,  yo + 0);
+    T m = tex2D<T>(next2, xo,  yo - 2);
+    T n = tex2D<T>(next2, xo,  yo - 4);
+
+    dst[yo*dst_pitch+xo] = filter(A, B, C, D,
+                                  a, b, c, d, e, f, g,
+                                  h, i, j, k, l, m, n,
+                                  clip_max);
+}
+
+template <typename T>
+__inline__ __device__ void bwdif_double(T *dst,
+                                        cudaTextureObject_t prev,
+                                        cudaTextureObject_t cur,
+                                        cudaTextureObject_t next,
+                                        int dst_width, int dst_height, int dst_pitch,
+                                        int src_width, int src_height,
+                                        int parity, int tff, bool skip_spatial_check,
+                                        int clip_max)
+{
+    int xo = blockIdx.x * blockDim.x + threadIdx.x;
+    int yo = blockIdx.y * blockDim.y + threadIdx.y;
+
+    if (xo >= dst_width || yo >= dst_height) {
+        return;
+    }
+
+    if (yo % 2 == parity) {
+      // Don't modify the primary field
+      dst[yo*dst_pitch+xo] = tex2D<T>(cur, xo, yo);
+      return;
+    }
+
+    T A = tex2D<T>(cur, xo, yo + 3);
+    T B = tex2D<T>(cur, xo, yo + 1);
+    T C = tex2D<T>(cur, xo, yo - 1);
+    T D = tex2D<T>(cur, xo, yo - 3);
+
+    // Calculate temporal prediction
+    int is_second_field = !(parity ^ tff);
+
+    cudaTextureObject_t prev2 = prev;
+    cudaTextureObject_t prev1 = is_second_field ? cur : prev;
+    cudaTextureObject_t next1 = is_second_field ? next : cur;
+    cudaTextureObject_t next2 = next;
+
+    T a = tex2D<T>(prev2, xo,  yo + 4);
+    T b = tex2D<T>(prev2, xo,  yo + 2);
+    T c = tex2D<T>(prev2, xo,  yo + 0);
+    T d = tex2D<T>(prev2, xo,  yo - 2);
+    T e = tex2D<T>(prev2, xo,  yo - 4);
+    T f = tex2D<T>(prev1, xo,  yo + 1);
+    T g = tex2D<T>(prev1, xo,  yo - 1);
+    T h = tex2D<T>(next1, xo,  yo + 1);
+    T i = tex2D<T>(next1, xo,  yo - 1);
+    T j = tex2D<T>(next2, xo,  yo + 4);
+    T k = tex2D<T>(next2, xo,  yo + 2);
+    T l = tex2D<T>(next2, xo,  yo + 0);
+    T m = tex2D<T>(next2, xo,  yo - 2);
+    T n = tex2D<T>(next2, xo,  yo - 4);
+
+    T final;
+    final.x = filter(A.x, B.x, C.x, D.x,
+                     a.x, b.x, c.x, d.x, e.x, f.x, g.x,
+                     h.x, i.x, j.x, k.x, l.x, m.x, n.x,
+                     clip_max);
+    final.y = filter(A.y, B.y, C.y, D.y,
+                     a.y, b.y, c.y, d.y, e.y, f.y, g.y,
+                     h.y, i.y, j.y, k.y, l.y, m.y, n.y,
+                     clip_max);
+
+
+
+
+    dst[yo*dst_pitch+xo] = final;
+}
+
+extern "C" {
+
+__global__ void bwdif_uchar(unsigned char *dst,
+                            cudaTextureObject_t prev,
+                            cudaTextureObject_t cur,
+                            cudaTextureObject_t next,
+                            int dst_width, int dst_height, int dst_pitch,
+                            int src_width, int src_height,
+                            int parity, int tff, bool skip_spatial_check,
+                            int clip_max)
+{
+    bwdif_single(dst, prev, cur, next,
+                 dst_width, dst_height, dst_pitch,
+                 src_width, src_height,
+                 parity, tff, skip_spatial_check,
+                 clip_max);
+}
+
+__global__ void bwdif_ushort(unsigned short *dst,
+                            cudaTextureObject_t prev,
+                            cudaTextureObject_t cur,
+                            cudaTextureObject_t next,
+                            int dst_width, int dst_height, int dst_pitch,
+                            int src_width, int src_height,
+                            int parity, int tff, bool skip_spatial_check,
+                            int clip_max)
+{
+    bwdif_single(dst, prev, cur, next,
+                 dst_width, dst_height, dst_pitch,
+                 src_width, src_height,
+                 parity, tff, skip_spatial_check,
+                 clip_max);
+}
+
+__global__ void bwdif_uchar2(uchar2 *dst,
+                            cudaTextureObject_t prev,
+                            cudaTextureObject_t cur,
+                            cudaTextureObject_t next,
+                            int dst_width, int dst_height, int dst_pitch,
+                            int src_width, int src_height,
+                            int parity, int tff, bool skip_spatial_check,
+                            int clip_max)
+{
+    bwdif_double(dst, prev, cur, next,
+                 dst_width, dst_height, dst_pitch,
+                 src_width, src_height,
+                 parity, tff, skip_spatial_check,
+                 clip_max);
+}
+
+__global__ void bwdif_ushort2(ushort2 *dst,
+                            cudaTextureObject_t prev,
+                            cudaTextureObject_t cur,
+                            cudaTextureObject_t next,
+                            int dst_width, int dst_height, int dst_pitch,
+                            int src_width, int src_height,
+                            int parity, int tff, bool skip_spatial_check,
+                            int clip_max)
+{
+    bwdif_double(dst, prev, cur, next,
+                 dst_width, dst_height, dst_pitch,
+                 src_width, src_height,
+                 parity, tff, skip_spatial_check,
+                 clip_max);
+}
+
+} /* extern "C" */