diff mbox series

[FFmpeg-devel,v2,2/2] avfilter: add vf_overlay_cuda

Message ID 030ea47b8cf127b4ce05781aa007a75609fcd194.1584536595.git.yyyaroslav@gmail.com
State Accepted
Headers show
Series Overlay Cuda Filter
Related show

Checks

Context Check Description
andriy/ffmpeg-patchwork pending
andriy/ffmpeg-patchwork success Applied patch
andriy/ffmpeg-patchwork success Configure finished
andriy/ffmpeg-patchwork success Make finished
andriy/ffmpeg-patchwork success Make fate finished

Commit Message

Yaroslav Pogrebnyak March 19, 2020, 4:02 a.m. UTC
Signed-off-by: Yaroslav Pogrebnyak <yyyaroslav@gmail.com>
---
Changes in v2:
- Fixed switch() indentation style

 configure                      |   2 +
 libavfilter/Makefile           |   1 +
 libavfilter/allfilters.c       |   1 +
 libavfilter/vf_overlay_cuda.c  | 446 +++++++++++++++++++++++++++++++++
 libavfilter/vf_overlay_cuda.cu |  54 ++++
 5 files changed, 504 insertions(+)
 create mode 100644 libavfilter/vf_overlay_cuda.c
 create mode 100644 libavfilter/vf_overlay_cuda.cu

Comments

Timo Rothenpieler March 19, 2020, 1:15 p.m. UTC | #1
I'm currently trying to get this to work with nvdec, but seemingly can't:

./ffmpeg_g.exe -v verbose
-hwaccel_output_format cuda -hwaccel cuda -i test_h264.mp4 
-hwaccel_output_format cuda -hwaccel cuda -i test2_h264.mp4
-filter_complex
"[0:v]scale_cuda=640:-2[p],[1:v][p]overlay_cuda=x=100:y=100:shortest=true"
-an -c:v h264_nvenc -y out.mp4

It works with legacy h264_cuvid, but definitely also needs to work with 
the proper nvdec hwaccel.

I'm currently investigating as to why, but the error it produces is very 
hard to track down:

Error while filtering 2: Invalid argument
Failed to inject frame into filter network: Invalid argument
Error while processing the decoded data for stream #1:0
Yaroslav Pogrebnyak March 19, 2020, 1:35 p.m. UTC | #2
Oh, I didn't noticed that h264_cuvid is legacy.

It seems the problem in this line:

ret = av_frame_make_writable(input_main);

If removed, it starts to work with -hwaccel cuda.

I'll take a closed look why and what happens but any advice would be 
helpful. Thanks!


On 19.03.20 21:15, Timo Rothenpieler wrote:
> I'm currently trying to get this to work with nvdec, but seemingly can't:
>
> ./ffmpeg_g.exe -v verbose
> -hwaccel_output_format cuda -hwaccel cuda -i test_h264.mp4 
> -hwaccel_output_format cuda -hwaccel cuda -i test2_h264.mp4
> -filter_complex
> "[0:v]scale_cuda=640:-2[p],[1:v][p]overlay_cuda=x=100:y=100:shortest=true" 
>
> -an -c:v h264_nvenc -y out.mp4
>
> It works with legacy h264_cuvid, but definitely also needs to work 
> with the proper nvdec hwaccel.
>
> I'm currently investigating as to why, but the error it produces is 
> very hard to track down:
>
> Error while filtering 2: Invalid argument
> Failed to inject frame into filter network: Invalid argument
> Error while processing the decoded data for stream #1:0
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
Timo Rothenpieler March 19, 2020, 1:40 p.m. UTC | #3
On 19.03.2020 14:35, Yaroslav Pogrebnyak wrote:
> Oh, I didn't noticed that h264_cuvid is legacy.
> 
> It seems the problem in this line:
> 
> ret = av_frame_make_writable(input_main);
> 
> If removed, it starts to work with -hwaccel cuda.
> 
> I'll take a closed look why and what happens but any advice would be 
> helpful. Thanks!

For what I'm aware, make_writable does not work on hardware frames.
And the nvdec hwaccel returns frames that are mapped device memory, and 
thus hard read-only.

You will need to manually allocate output frames from the hw_frames_ctx.
Yaroslav Pogrebnyak March 19, 2020, 2:11 p.m. UTC | #4
On 19.03.20 21:40, Timo Rothenpieler wrote:

> For what I'm aware, make_writable does not work on hardware frames.
> And the nvdec hwaccel returns frames that are mapped device memory, 
> and thus hard read-only.
>
> You will need to manually allocate output frames from the hw_frames_ctx.

Yes I see. So it seems we can safely remove this call.

Also, I was thinking that output frame allocation is not needed because 
we can safely operate on input frame in-place saving extra memory 
allocation and copy. It seems works well. Is it ok, or should we always 
allocate output frame?

If removing call to av_frame_make_writable would be enough, I could send 
updated patch then.

P.S. Also it just strange why it worked well with h264_cuvid.
Timo Rothenpieler March 19, 2020, 2:41 p.m. UTC | #5
On 19.03.2020 15:11, Yaroslav Pogrebnyak wrote:
> On 19.03.20 21:40, Timo Rothenpieler wrote:
> 
>> For what I'm aware, make_writable does not work on hardware frames.
>> And the nvdec hwaccel returns frames that are mapped device memory, 
>> and thus hard read-only.
>>
>> You will need to manually allocate output frames from the hw_frames_ctx.
> 
> Yes I see. So it seems we can safely remove this call.
> 
> Also, I was thinking that output frame allocation is not needed because 
> we can safely operate on input frame in-place saving extra memory 
> allocation and copy. It seems works well. Is it ok, or should we always 
> allocate output frame?
> 
> If removing call to av_frame_make_writable would be enough, I could send 
> updated patch then.
> 
> P.S. Also it just strange why it worked well with h264_cuvid.

h264_cuvid copies frames back to normal VRAM, and does not pass around 
mapped nvdec surfaces, like nvdec does.
Writing around in these is documented as disallowed.

You can call av_frame_is_writable() on the frame. If it returns true, 
it's safe to write into it. If it returns false, you have to allocate a 
new output frame.
Yaroslav Pogrebnyak March 19, 2020, 2:59 p.m. UTC | #6
On 19.03.20 22:41, Timo Rothenpieler wrote:

> h264_cuvid copies frames back to normal VRAM, and does not pass around 
> mapped nvdec surfaces, like nvdec does.
> Writing around in these is documented as disallowed.
>
> You can call av_frame_is_writable() on the frame. If it returns true, 
> it's safe to write into it. If it returns false, you have to allocate 
> a new output frame.

Got it, thanks! I'll re-do it and submit updated patch soon.
Timo Rothenpieler March 19, 2020, 4:47 p.m. UTC | #7
On 19.03.2020 15:59, Yaroslav Pogrebnyak wrote:
> 
> Got it, thanks! I'll re-do it and submit updated patch soon.
> 

I'm looking into adding hardware-frame support to make_writable, so 
modifications might not be needed.
Yaroslav Pogrebnyak March 20, 2020, 11:30 a.m. UTC | #8
On 20.03.20 00:47, Timo Rothenpieler wrote:

> I'm looking into adding hardware-frame support to make_writable, so 
> modifications might not be needed.

Yep it seems to be more consistent if av_frame_make_writable could 
support hardware frames.

Please let me know if you are going to do it, or if I need to send 
modified patch. Thanks!
Timo Rothenpieler March 20, 2020, 11:40 a.m. UTC | #9
On 20.03.2020 12:30, Yaroslav Pogrebnyak wrote:
> On 20.03.20 00:47, Timo Rothenpieler wrote:
> 
>> I'm looking into adding hardware-frame support to make_writable, so 
>> modifications might not be needed.
> 
> Yep it seems to be more consistent if av_frame_make_writable could 
> support hardware frames.
> 
> Please let me know if you are going to do it, or if I need to send 
> modified patch. Thanks!
> 

It is not as simple as I anticipated to do it in a generic way.
The main issue at hand is how nvdec returns hardware frames, which I 
need to fix first to get rid of a lot of hackery that stems from it.

But I do intend to go along with it.
Timo Rothenpieler March 28, 2020, 5:43 p.m. UTC | #10
applied
diff mbox series

Patch

diff --git a/configure b/configure
index 18f2841765..b08dc7bd62 100755
--- a/configure
+++ b/configure
@@ -3026,6 +3026,8 @@  scale_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
 thumbnail_cuda_filter_deps="ffnvcodec"
 thumbnail_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
 transpose_npp_filter_deps="ffnvcodec libnpp"
+overlay_cuda_filter_deps="ffnvcodec"
+overlay_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
 
 amf_deps_any="libdl LoadLibrary"
 nvenc_deps="ffnvcodec"
diff --git a/libavfilter/Makefile b/libavfilter/Makefile
index 750412da6b..1ecaeae372 100644
--- a/libavfilter/Makefile
+++ b/libavfilter/Makefile
@@ -328,6 +328,7 @@  OBJS-$(CONFIG_OVERLAY_OPENCL_FILTER)         += vf_overlay_opencl.o opencl.o \
                                                 opencl/overlay.o framesync.o
 OBJS-$(CONFIG_OVERLAY_QSV_FILTER)            += vf_overlay_qsv.o framesync.o
 OBJS-$(CONFIG_OVERLAY_VULKAN_FILTER)         += vf_overlay_vulkan.o vulkan.o
+OBJS-$(CONFIG_OVERLAY_CUDA_FILTER)           += vf_overlay_cuda.o framesync.o vf_overlay_cuda.ptx.o
 OBJS-$(CONFIG_OWDENOISE_FILTER)              += vf_owdenoise.o
 OBJS-$(CONFIG_PAD_FILTER)                    += vf_pad.o
 OBJS-$(CONFIG_PAD_OPENCL_FILTER)             += vf_pad_opencl.o opencl.o opencl/pad.o
diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
index 501e5d041b..fb32bef788 100644
--- a/libavfilter/allfilters.c
+++ b/libavfilter/allfilters.c
@@ -312,6 +312,7 @@  extern AVFilter ff_vf_overlay;
 extern AVFilter ff_vf_overlay_opencl;
 extern AVFilter ff_vf_overlay_qsv;
 extern AVFilter ff_vf_overlay_vulkan;
+extern AVFilter ff_vf_overlay_cuda;
 extern AVFilter ff_vf_owdenoise;
 extern AVFilter ff_vf_pad;
 extern AVFilter ff_vf_pad_opencl;
diff --git a/libavfilter/vf_overlay_cuda.c b/libavfilter/vf_overlay_cuda.c
new file mode 100644
index 0000000000..63cb425b2d
--- /dev/null
+++ b/libavfilter/vf_overlay_cuda.c
@@ -0,0 +1,446 @@ 
+/*
+ * Copyright (c) 2020 Yaroslav Pogrebnyak <yyyaroslav@gmail.com>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/**
+ * @file
+ * Overlay one video on top of another using cuda hardware acceleration
+ */
+
+#include "libavutil/log.h"
+#include "libavutil/mem.h"
+#include "libavutil/opt.h"
+#include "libavutil/pixdesc.h"
+#include "libavutil/hwcontext.h"
+#include "libavutil/hwcontext_cuda_internal.h"
+#include "libavutil/cuda_check.h"
+
+#include "avfilter.h"
+#include "framesync.h"
+#include "internal.h"
+
+#define CHECK_CU(x) FF_CUDA_CHECK_DL(ctx, ctx->hwctx->internal->cuda_dl, x)
+#define DIV_UP(a, b) ( ((a) + (b) - 1) / (b) )
+
+#define BLOCK_X 32
+#define BLOCK_Y 16
+
+static const enum AVPixelFormat supported_main_formats[] = {
+    AV_PIX_FMT_NV12,
+    AV_PIX_FMT_YUV420P,
+    AV_PIX_FMT_NONE,
+};
+
+static const enum AVPixelFormat supported_overlay_formats[] = {
+    AV_PIX_FMT_NV12,
+    AV_PIX_FMT_YUV420P,
+    AV_PIX_FMT_YUVA420P,
+    AV_PIX_FMT_NONE,
+};
+
+/**
+ * OverlayCUDAContext
+ */
+typedef struct OverlayCUDAContext {
+    const AVClass      *class;
+
+    enum AVPixelFormat in_format_overlay;
+    enum AVPixelFormat in_format_main;
+
+    AVBufferRef *device_ref;
+    AVCUDADeviceContext *hwctx;
+
+    CUcontext cu_ctx;
+    CUmodule cu_module;
+    CUfunction cu_func;
+    CUstream cu_stream;
+
+    FFFrameSync fs;
+
+    int x_position;
+    int y_position;
+
+} OverlayCUDAContext;
+
+/**
+ * Helper to find out if provided format is supported by filter
+ */
+static int format_is_supported(const enum AVPixelFormat formats[], enum AVPixelFormat fmt)
+{
+    for (int i = 0; formats[i] != AV_PIX_FMT_NONE; i++)
+        if (formats[i] == fmt)
+            return 1;
+    return 0;
+}
+
+/**
+ * Helper checks if we can process main and overlay pixel formats
+ */
+static int formats_match(const enum AVPixelFormat format_main, const enum AVPixelFormat format_overlay) {
+    switch(format_main) {
+    case AV_PIX_FMT_NV12:
+        return format_overlay == AV_PIX_FMT_NV12;
+    case AV_PIX_FMT_YUV420P:
+        return format_overlay == AV_PIX_FMT_YUV420P ||
+               format_overlay == AV_PIX_FMT_YUVA420P;
+    default:
+        return 0;
+    }
+}
+
+/**
+ * Call overlay kernell for a plane
+ */
+static int overlay_cuda_call_kernel(
+    OverlayCUDAContext *ctx,
+    int x_position, int y_position,
+    uint8_t* main_data, int main_linesize,
+    int main_width, int main_height,
+    uint8_t* overlay_data, int overlay_linesize,
+    int overlay_width, int overlay_height,
+    uint8_t* alpha_data, int alpha_linesize,
+    int alpha_adj_x, int alpha_adj_y) {
+
+    CudaFunctions *cu = ctx->hwctx->internal->cuda_dl;
+
+    void* kernel_args[] = {
+        &x_position, &y_position,
+        &main_data, &main_linesize,
+        &overlay_data, &overlay_linesize,
+        &overlay_width, &overlay_height,
+        &alpha_data, &alpha_linesize,
+        &alpha_adj_x, &alpha_adj_y,
+    };
+
+    return CHECK_CU(cu->cuLaunchKernel(
+        ctx->cu_func,
+        DIV_UP(main_width, BLOCK_X), DIV_UP(main_height, BLOCK_Y), 1,
+        BLOCK_X, BLOCK_Y, 1,
+        0, ctx->cu_stream, kernel_args, NULL));
+}
+
+/**
+ * Perform blend overlay picture over main picture
+ */
+static int overlay_cuda_blend(FFFrameSync *fs)
+{
+    int ret;
+
+    AVFilterContext *avctx = fs->parent;
+    OverlayCUDAContext *ctx = avctx->priv;
+    AVFilterLink *outlink = avctx->outputs[0];
+
+    CudaFunctions *cu = ctx->hwctx->internal->cuda_dl;
+    CUcontext dummy, cuda_ctx = ctx->hwctx->cuda_ctx;
+
+    AVFrame *input_main, *input_overlay, *out;
+
+    ctx->cu_ctx = cuda_ctx;
+
+    // read main and overlay frames from inputs
+
+    ret = ff_framesync_get_frame(fs, 0, &input_main, 0);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = ff_framesync_get_frame(fs, 1, &input_overlay, 0);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (!input_main || !input_overlay) {
+        return AVERROR_BUG;
+    }
+
+    ret = av_frame_make_writable(input_main);
+    if (ret < 0) {
+        return ret;
+    }
+
+    // push cuda context
+
+    ret = CHECK_CU(cu->cuCtxPushCurrent(cuda_ctx));
+    if (ret < 0) {
+        return ret;
+    }
+
+    // overlay first plane
+
+    overlay_cuda_call_kernel(ctx,
+        ctx->x_position, ctx->y_position,
+        input_main->data[0], input_main->linesize[0],
+        input_main->width, input_main->height,
+        input_overlay->data[0], input_overlay->linesize[0],
+        input_overlay->width, input_overlay->height,
+        input_overlay->data[3], input_overlay->linesize[3], 1, 1);
+
+    // overlay rest planes depending on pixel format
+
+    switch(ctx->in_format_overlay) {
+    case AV_PIX_FMT_NV12:
+        overlay_cuda_call_kernel(ctx,
+            ctx->x_position, ctx->y_position / 2,
+            input_main->data[1], input_main->linesize[1],
+            input_main->width, input_main->height / 2,
+            input_overlay->data[1], input_overlay->linesize[1],
+            input_overlay->width, input_overlay->height / 2,
+            0, 0, 0, 0);
+        break;
+    case AV_PIX_FMT_YUV420P:
+    case AV_PIX_FMT_YUVA420P:
+        overlay_cuda_call_kernel(ctx,
+            ctx->x_position / 2 , ctx->y_position / 2,
+            input_main->data[1], input_main->linesize[1],
+            input_main->width / 2, input_main->height / 2,
+            input_overlay->data[1], input_overlay->linesize[1],
+            input_overlay->width / 2, input_overlay->height / 2,
+            input_overlay->data[3], input_overlay->linesize[3], 2, 2);
+
+        overlay_cuda_call_kernel(ctx,
+            ctx->x_position / 2 , ctx->y_position / 2,
+            input_main->data[2], input_main->linesize[2],
+            input_main->width / 2, input_main->height / 2,
+            input_overlay->data[2], input_overlay->linesize[2],
+            input_overlay->width / 2, input_overlay->height / 2,
+            input_overlay->data[3], input_overlay->linesize[3], 2, 2);
+        break;
+    default:
+        av_log(ctx, AV_LOG_ERROR, "Passed unsupported overlay pixel format\n");
+        return AVERROR_BUG;
+    }
+
+    CHECK_CU(cu->cuCtxPopCurrent(&dummy));
+
+    out = av_frame_alloc();
+    av_frame_ref(out, input_main);
+    av_frame_copy_props(out, input_main);
+
+    return ff_filter_frame(outlink, out);
+}
+
+/**
+ * Initialize overlay_cuda
+ */
+static av_cold int overlay_cuda_init(AVFilterContext *avctx)
+{
+    OverlayCUDAContext* ctx = avctx->priv;
+    ctx->fs.on_event = &overlay_cuda_blend;
+
+    return 0;
+}
+
+/**
+ * Uninitialize overlay_cuda
+ */
+static av_cold void overlay_cuda_uninit(AVFilterContext *avctx)
+{
+    OverlayCUDAContext* ctx = avctx->priv;
+
+    ff_framesync_uninit(&ctx->fs);
+
+    if (ctx->hwctx && ctx->cu_module) {
+        CUcontext dummy;
+        CudaFunctions *cu = ctx->hwctx->internal->cuda_dl;
+        CHECK_CU(cu->cuCtxPushCurrent(ctx->cu_ctx));
+        CHECK_CU(cu->cuModuleUnload(ctx->cu_module));
+        CHECK_CU(cu->cuCtxPopCurrent(&dummy));
+    }
+}
+
+/**
+ * Activate overlay_cuda
+ */
+static int overlay_cuda_activate(AVFilterContext *avctx)
+{
+    OverlayCUDAContext *ctx = avctx->priv;
+
+    return ff_framesync_activate(&ctx->fs);
+}
+
+/**
+ * Query formats
+ */
+static int overlay_cuda_query_formats(AVFilterContext *avctx)
+{
+    static const enum AVPixelFormat pixel_formats[] = {
+        AV_PIX_FMT_CUDA, AV_PIX_FMT_NONE,
+    };
+
+    AVFilterFormats *pix_fmts = ff_make_format_list(pixel_formats);
+
+    return ff_set_common_formats(avctx, pix_fmts);
+}
+
+/**
+ * Configure output
+ */
+static int overlay_cuda_config_output(AVFilterLink *outlink)
+{
+
+    extern char vf_overlay_cuda_ptx[];
+
+    int err;
+    AVFilterContext* avctx = outlink->src;
+    OverlayCUDAContext* ctx = avctx->priv;
+
+    AVFilterLink *inlink = avctx->inputs[0];
+    AVHWFramesContext  *frames_ctx = (AVHWFramesContext*)inlink->hw_frames_ctx->data;
+
+    AVFilterLink *inlink_overlay = avctx->inputs[1];
+    AVHWFramesContext  *frames_ctx_overlay = (AVHWFramesContext*)inlink_overlay->hw_frames_ctx->data;
+
+    CUcontext dummy, cuda_ctx;
+    CudaFunctions *cu;
+
+    // check main input formats
+
+    if (!frames_ctx) {
+        av_log(ctx, AV_LOG_ERROR, "No hw context provided on main input\n");
+        return AVERROR(EINVAL);
+    }
+
+    ctx->in_format_main = frames_ctx->sw_format;
+    if (!format_is_supported(supported_main_formats, ctx->in_format_main)) {
+        av_log(ctx, AV_LOG_ERROR, "Unsupported main input format: %s\n",
+               av_get_pix_fmt_name(ctx->in_format_main));
+        return AVERROR(ENOSYS);
+    }
+
+    // check overlay input formats
+
+    if (!frames_ctx_overlay) {
+        av_log(ctx, AV_LOG_ERROR, "No hw context provided on overlay input\n");
+        return AVERROR(EINVAL);
+    }
+
+    ctx->in_format_overlay = frames_ctx_overlay->sw_format;
+    if (!format_is_supported(supported_overlay_formats, ctx->in_format_overlay)) {
+        av_log(ctx, AV_LOG_ERROR, "Unsupported overlay input format: %s\n",
+            av_get_pix_fmt_name(ctx->in_format_overlay));
+        return AVERROR(ENOSYS);
+    }
+
+    // check we can overlay pictures with those pixel formats
+
+    if (!formats_match(ctx->in_format_main, ctx->in_format_overlay)) {
+        av_log(ctx, AV_LOG_ERROR, "Can't overlay %s on %s \n",
+            av_get_pix_fmt_name(ctx->in_format_overlay), av_get_pix_fmt_name(ctx->in_format_main));
+        return AVERROR(EINVAL);
+    }
+
+    // initialize
+
+    ctx->hwctx = frames_ctx->device_ctx->hwctx;
+    cuda_ctx = ctx->hwctx->cuda_ctx;
+    ctx->fs.time_base = inlink->time_base;
+
+    ctx->cu_stream = ctx->hwctx->stream;
+    ctx->device_ref = ((AVHWFramesContext*)inlink->hw_frames_ctx->data)->device_ref;
+
+    outlink->hw_frames_ctx = av_buffer_ref(inlink->hw_frames_ctx);
+
+    // load functions
+
+    cu = ctx->hwctx->internal->cuda_dl;
+
+    err = CHECK_CU(cu->cuCtxPushCurrent(cuda_ctx));
+    if (err < 0) {
+        return err;
+    }
+
+    err = CHECK_CU(cu-> cuModuleLoadData(&ctx->cu_module, vf_overlay_cuda_ptx));
+    if (err < 0) {
+        return err;
+    }
+
+    err = CHECK_CU(cu->cuModuleGetFunction(&ctx->cu_func, ctx->cu_module, "Overlay_Cuda"));
+    if (err < 0) {
+        return err;
+    }
+
+    CHECK_CU(cu->cuCtxPopCurrent(&dummy));
+
+    // init dual input
+
+    err = ff_framesync_init_dualinput(&ctx->fs, avctx);
+    if (err < 0) {
+        return err;
+    }
+
+    return ff_framesync_configure(&ctx->fs);
+}
+
+
+#define OFFSET(x) offsetof(OverlayCUDAContext, x)
+#define FLAGS (AV_OPT_FLAG_FILTERING_PARAM | AV_OPT_FLAG_VIDEO_PARAM)
+
+static const AVOption overlay_cuda_options[] = {
+    { "x", "Overlay x position",
+      OFFSET(x_position), AV_OPT_TYPE_INT, { .i64 = 0 }, INT_MIN, INT_MAX, .flags = FLAGS },
+    { "y", "Overlay y position",
+      OFFSET(y_position), AV_OPT_TYPE_INT, { .i64 = 0 }, INT_MIN, INT_MAX, .flags = FLAGS },
+    { "eof_action", "Action to take when encountering EOF from secondary input ",
+        OFFSET(fs.opt_eof_action), AV_OPT_TYPE_INT, { .i64 = EOF_ACTION_REPEAT },
+        EOF_ACTION_REPEAT, EOF_ACTION_PASS, .flags = FLAGS, "eof_action" },
+        { "repeat", "Repeat the previous frame.",   0, AV_OPT_TYPE_CONST, { .i64 = EOF_ACTION_REPEAT }, .flags = FLAGS, "eof_action" },
+        { "endall", "End both streams.",            0, AV_OPT_TYPE_CONST, { .i64 = EOF_ACTION_ENDALL }, .flags = FLAGS, "eof_action" },
+        { "pass",   "Pass through the main input.", 0, AV_OPT_TYPE_CONST, { .i64 = EOF_ACTION_PASS },   .flags = FLAGS, "eof_action" },
+    { "shortest", "force termination when the shortest input terminates", OFFSET(fs.opt_shortest), AV_OPT_TYPE_BOOL, { .i64 = 0 }, 0, 1, FLAGS },
+    { "repeatlast", "repeat overlay of the last overlay frame", OFFSET(fs.opt_repeatlast), AV_OPT_TYPE_BOOL, {.i64=1}, 0, 1, FLAGS },
+    { NULL },
+};
+
+FRAMESYNC_DEFINE_CLASS(overlay_cuda, OverlayCUDAContext, fs);
+
+static const AVFilterPad overlay_cuda_inputs[] = {
+    {
+        .name         = "main",
+        .type         = AVMEDIA_TYPE_VIDEO,
+    },
+    {
+        .name         = "overlay",
+        .type         = AVMEDIA_TYPE_VIDEO,
+    },
+    { NULL }
+};
+
+static const AVFilterPad overlay_cuda_outputs[] = {
+    {
+        .name          = "default",
+        .type          = AVMEDIA_TYPE_VIDEO,
+        .config_props  = &overlay_cuda_config_output,
+    },
+    { NULL }
+};
+
+AVFilter ff_vf_overlay_cuda = {
+    .name            = "overlay_cuda",
+    .description     = NULL_IF_CONFIG_SMALL("Overlay one video on top of another using CUDA"),
+    .priv_size       = sizeof(OverlayCUDAContext),
+    .priv_class      = &overlay_cuda_class,
+    .init            = &overlay_cuda_init,
+    .uninit          = &overlay_cuda_uninit,
+    .activate        = &overlay_cuda_activate,
+    .query_formats   = &overlay_cuda_query_formats,
+    .inputs          = overlay_cuda_inputs,
+    .outputs         = overlay_cuda_outputs,
+    .preinit         = overlay_cuda_framesync_preinit,
+    .flags_internal  = FF_FILTER_FLAG_HWFRAME_AWARE,
+};
diff --git a/libavfilter/vf_overlay_cuda.cu b/libavfilter/vf_overlay_cuda.cu
new file mode 100644
index 0000000000..43ec36c2ed
--- /dev/null
+++ b/libavfilter/vf_overlay_cuda.cu
@@ -0,0 +1,54 @@ 
+/*
+ * Copyright (c) 2020 Yaroslav Pogrebnyak <yyyaroslav@gmail.com>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+extern "C" {
+
+__global__ void Overlay_Cuda(
+    int x_position, int y_position,
+    unsigned char* main, int main_linesize,
+    unsigned char* overlay, int overlay_linesize,
+    int overlay_w, int overlay_h,
+    unsigned char* overlay_alpha, int alpha_linesize,
+    int alpha_adj_x, int alpha_adj_y)
+{
+    int x = blockIdx.x * blockDim.x + threadIdx.x;
+    int y = blockIdx.y * blockDim.y + threadIdx.y;
+
+    if (x >= overlay_w + x_position ||
+        y >= overlay_h + y_position ||
+        x < x_position ||
+        y < y_position ) {
+
+        return;
+    }
+
+    int overlay_x = x - x_position;
+    int overlay_y = y - y_position;
+
+    float alpha = 1.0;
+    if (alpha_linesize) {
+        alpha = overlay_alpha[alpha_adj_x * overlay_x  + alpha_adj_y * overlay_y * alpha_linesize] / 255.0f;
+    }
+
+    main[x + y*main_linesize] = alpha * overlay[overlay_x + overlay_y * overlay_linesize] + (1.0f - alpha) * main[x + y*main_linesize];
+}
+
+}
+