diff mbox series

[FFmpeg-devel] avfilter: add vf_overlay_cuda

Message ID 20200318071955.2329-1-yyyaroslav@gmail.com
State Accepted
Headers show
Series [FFmpeg-devel] avfilter: add vf_overlay_cuda
Related show

Checks

Context Check Description
andriy/ffmpeg-patchwork pending
andriy/ffmpeg-patchwork success Applied patch
andriy/ffmpeg-patchwork success Configure finished
andriy/ffmpeg-patchwork success Make finished
andriy/ffmpeg-patchwork success Make fate finished

Commit Message

Yaroslav Pogrebnyak March 18, 2020, 7:19 a.m. UTC
This patch adds 'vf_overlay_cuda' filter. 
It draws one picture on top of another on CUDA GPU. 
For the end-user, it's similar to 'vf_overlay_opencl' and other overlay filters. 

This filter would be especially useful for building video processing pipelines that execute fully on the CUDA GPU. For example, the following pipeline would be possible: decode -> scale -> overlay -> encode, without copying frames between CPU and GPU in between.

Technical details.

Supported sw input formats are NV12 and YUV420P for main input, and NV12, YUV420P and YUVA420P for overlay input. 
Main and overlay sw formats should match (i.e, overlaying YUVA420P on NV12 is not implemented). 
All pixel format conversions are needed to be done with 'format' or 'scale_npp' filters before 'overlay_cuda'.

It was needed to slightly modify 'hwcontext_cuda.c' to allow overlays with alpha channel:
 - Allow AV_PIX_FMT_YUVA420P to enable hwuploading frames with alpha channel to GPU.
 - Do not shift Height of 4rd plane (alpha) when uploading to GPU.

Examples.

- Overlay picture on top of video (main: YUVJ420P->NV12, overlay: NV12)
$ ffmpeg -y -init_hw_device cuda=cuda -filter_hw_device cuda -hwaccel cuvid \
  -c:v h264_cuvid -i main.mp4 \
  -i ~/overlay.jpg \
  -filter_complex "[1:v]format=nv12, hwupload[overlay], [0:v][overlay]overlay_cuda=x=0:y=0:shortest=false" \
  -an -c:v h264_nvenc -b:v 5M output.mp4

- Overlay one video on top of another (main: NV12, overlay: NV12)
$ ffmpeg -y \
  -hwaccel cuvid -c:v h264_cuvid -i main.mp4 \
  -hwaccel cuvid -c:v h264_cuvid -i overlay.mp4 \
  -filter_complex "[1:v]scale_npp=512:-1[o], [v:0][o]overlay_cuda=x=100:y=100:shortest=true" \
  -an -c:v h264_nvenc -b:v 5M output.mp4

- Overlay picture with alpha channel on top of video (main: NV12->YUV420P, overlay: RGBA->YUVA420P)
$ ffmpeg -y \
  -init_hw_device cuda=cuda -filter_hw_device cuda -hwaccel cuvid \
  -c:v h264_cuvid -i ~/main.mp4 \
  -i ~/overlay.png \
  -filter_complex "[1:v]format=yuva420p, hwupload[o], [v:0]scale_npp=format=yuv420p[m], [m][o]overlay_cuda=x=0:y=0:shortest=false" \
  -an -c:v h264_nvenc -b:v 5M output.mp4

Patch attached.

P.S. This is my first patch, I would be grateful for any feedback to know if I'm doing things correctly or not.
Thanks!


Signed-off-by: Yaroslav Pogrebnyak <yyyaroslav@gmail.com>
---
 configure                      |   2 +
 libavfilter/Makefile           |   1 +
 libavfilter/allfilters.c       |   1 +
 libavfilter/vf_overlay_cuda.c  | 451 +++++++++++++++++++++++++++++++++
 libavfilter/vf_overlay_cuda.cu |  54 ++++
 libavutil/hwcontext_cuda.c     |   3 +-
 6 files changed, 511 insertions(+), 1 deletion(-)
 create mode 100644 libavfilter/vf_overlay_cuda.c
 create mode 100644 libavfilter/vf_overlay_cuda.cu

Comments

Carl Eugen Hoyos March 18, 2020, 9:18 a.m. UTC | #1
Am Mi., 18. März 2020 um 08:29 Uhr schrieb Yaroslav Pogrebnyak
<yyyaroslav@gmail.com>:

> P.S. This is my first patch, I would be grateful for any feedback to know
> if I'm doing things correctly or not.

We use a different indentation for switch() statements.

I also wanted to remind you to keep the options in-sync with the overlay
filter but from a quick look you have already done so;-)

Carl Eugen
Timo Rothenpieler March 18, 2020, 11:15 a.m. UTC | #2
Please split the changes into two distinct patches.
One for the hwcontex changes, one for the new filter.
Alex April 1, 2020, 1:43 p.m. UTC | #3
Hi!Is it working? I try everything but constantly get error from overlay_cuda:


ffmpeg -y -init_hw_device cuda=cuda -filter_hw_device cuda -hwaccel cuvid -c:v h264_cuvid -resize 1920x1080 -i 720p.mp4 -i watermark.png -filter_complex "[1:v]format=nv12,hwupload[img];[0:v][img]overlay_cuda=x=50:y=800[out]" -map [out] -c:v h264_nvenc -b:v 6M -an -preset fast  -y out_nvenc_overlay.mp4
...
ffmpeg version git-2020-04-01-afa5e38
...
[h264_cuvid @ 000001dd1b356d00] CUVID capabilities for h264_cuvid:
[h264_cuvid @ 000001dd1b356d00] 8 bit: supported: 1, min_width: 48, max_width: 4096, min_height: 16, max_height: 4096
[h264_cuvid @ 000001dd1b356d00] 10 bit: supported: 0, min_width: 0, max_width: 0, min_height: 0, max_height: 0
[h264_cuvid @ 000001dd1b356d00] 12 bit: supported: 0, min_width: 0, max_width: 0, min_height: 0, max_height: 0
Stream mapping:
  Stream #0:0 (h264_cuvid) -> overlay_cuda:main
  Stream #1:0 (png) -> format
  overlay_cuda -> Stream #0:0 (h264_nvenc)
Press [q] to stop, [?] for help
[h264_cuvid @ 000001dd1b356d00] Formats: Original: cuda | HW: cuda | SW: nv12
[graph 0 input from stream 1:0 @ 000001dd2e84a100] w:1894 h:302 pixfmt:rgba tb:1/25 fr:25/1 sar:11811/11811
[graph 0 input from stream 0:0 @ 000001dd2e84ae00] w:1920 h:1080 pixfmt:cuda tb:1/24000 fr:24000/1001 sar:1/1
[auto_scaler_0 @ 000001dd2ebf4cc0] w:iw h:ih flags:'bilinear' interl:0
[Parsed_format_0 @ 000001dd2e849780] auto-inserting filter 'auto_scaler_0' between the filter 'graph 0 input from stream 1:0' and the filter 'Parsed_format_0'
[auto_scaler_0 @ 000001dd2ebf4cc0] w:1894 h:302 fmt:rgba sar:11811/11811 -> w:1894 h:302 fmt:nv12 sar:1/1 flags:0x2
[overlay_cuda @ 000001dd2ebc87c0] cu->cuModuleLoadData(&ctx->cu_module, vf_overlay_cuda_ptx) failed -> CUDA_ERROR_INVALID_IMAGE: device kernel image is invalid
[Parsed_overlay_cuda_2 @ 000001dd2e84b6c0] Failed to configure output pad on Parsed_overlay_cuda_2
Error reinitializing filters!
Failed to inject frame into filter network: Generic error in an external library
Error while processing the decoded data for stream #0:0
...



--- Original message ---
From: "Yaroslav Pogrebnyak" <yyyaroslav@gmail.com>
Date: 18 March 2020, 09:29:15

This patch adds 'vf_overlay_cuda' filter. 
It draws one picture on top of another on CUDA GPU. 
For the end-user, it's similar to 'vf_overlay_opencl' and other overlay filters. 

This filter would be especially useful for building video processing pipelines that execute fully on the CUDA GPU. For example, the following pipeline would be possible: decode -> scale -> overlay -> encode, without copying frames between CPU and GPU in between.

Technical details.

Supported sw input formats are NV12 and YUV420P for main input, and NV12, YUV420P and YUVA420P for overlay input. 
Main and overlay sw formats should match (i.e, overlaying YUVA420P on NV12 is not implemented). 
All pixel format conversions are needed to be done with 'format' or 'scale_npp' filters before 'overlay_cuda'.

It was needed to slightly modify 'hwcontext_cuda.c' to allow overlays with alpha channel:
 - Allow AV_PIX_FMT_YUVA420P to enable hwuploading frames with alpha channel to GPU.
 - Do not shift Height of 4rd plane (alpha) when uploading to GPU.

Examples.

- Overlay picture on top of video (main: YUVJ420P->NV12, overlay: NV12)
$ ffmpeg -y -init_hw_device cuda=cuda -filter_hw_device cuda -hwaccel cuvid \
  -c:v h264_cuvid -i main.mp4 \
  -i ~/overlay.jpg \
  -filter_complex "[1:v]format=nv12, hwupload[overlay], [0:v][overlay]overlay_cuda=x=0:y=0:shortest=false" \
  -an -c:v h264_nvenc -b:v 5M output.mp4

- Overlay one video on top of another (main: NV12, overlay: NV12)
$ ffmpeg -y \
  -hwaccel cuvid -c:v h264_cuvid -i main.mp4 \
  -hwaccel cuvid -c:v h264_cuvid -i overlay.mp4 \
  -filter_complex "[1:v]scale_npp=512:-1[o], [v:0][o]overlay_cuda=x=100:y=100:shortest=true" \
  -an -c:v h264_nvenc -b:v 5M output.mp4

- Overlay picture with alpha channel on top of video (main: NV12->YUV420P, overlay: RGBA->YUVA420P)
$ ffmpeg -y \
  -init_hw_device cuda=cuda -filter_hw_device cuda -hwaccel cuvid \
  -c:v h264_cuvid -i ~/main.mp4 \
  -i ~/overlay.png \
  -filter_complex "[1:v]format=yuva420p, hwupload[o], [v:0]scale_npp=format=yuv420p[m], [m][o]overlay_cuda=x=0:y=0:shortest=false" \
  -an -c:v h264_nvenc -b:v 5M output.mp4

Patch attached.

P.S. This is my first patch, I would be grateful for any feedback to know if I'm doing things correctly or not.
Thanks!


Signed-off-by: Yaroslav Pogrebnyak <yyyaroslav@gmail.com>
---
 configure                      |   2 +
 libavfilter/Makefile           |   1 +
 libavfilter/allfilters.c       |   1 +
 libavfilter/vf_overlay_cuda.c  | 451 +++++++++++++++++++++++++++++++++
 libavfilter/vf_overlay_cuda.cu |  54 ++++
 libavutil/hwcontext_cuda.c     |   3 +-
 6 files changed, 511 insertions(+), 1 deletion(-)
 create mode 100644 libavfilter/vf_overlay_cuda.c
 create mode 100644 libavfilter/vf_overlay_cuda.cu
Dennis Mungai April 1, 2020, 1:50 p.m. UTC | #4
On Wed, 1 Apr 2020 at 16:43, Alex <3.14pi@ukr.net> wrote:

> Hi!Is it working? I try everything but constantly get error from
> overlay_cuda:
>
>
> ffmpeg -y -init_hw_device cuda=cuda -filter_hw_device cuda -hwaccel cuvid
> -c:v h264_cuvid -resize 1920x1080 -i 720p.mp4 -i watermark.png
> -filter_complex
> "[1:v]format=nv12,hwupload[img];[0:v][img]overlay_cuda=x=50:y=800[out]"
> -map [out] -c:v h264_nvenc -b:v 6M -an -preset fast  -y
> out_nvenc_overlay.mp4
> ...
> ffmpeg version git-2020-04-01-afa5e38
> ...
> [h264_cuvid @ 000001dd1b356d00] CUVID capabilities for h264_cuvid:
> [h264_cuvid @ 000001dd1b356d00] 8 bit: supported: 1, min_width: 48,
> max_width: 4096, min_height: 16, max_height: 4096
> [h264_cuvid @ 000001dd1b356d00] 10 bit: supported: 0, min_width: 0,
> max_width: 0, min_height: 0, max_height: 0
> [h264_cuvid @ 000001dd1b356d00] 12 bit: supported: 0, min_width: 0,
> max_width: 0, min_height: 0, max_height: 0
> Stream mapping:
>   Stream #0:0 (h264_cuvid) -> overlay_cuda:main
>   Stream #1:0 (png) -> format
>   overlay_cuda -> Stream #0:0 (h264_nvenc)
> Press [q] to stop, [?] for help
> [h264_cuvid @ 000001dd1b356d00] Formats: Original: cuda | HW: cuda | SW:
> nv12
> [graph 0 input from stream 1:0 @ 000001dd2e84a100] w:1894 h:302
> pixfmt:rgba tb:1/25 fr:25/1 sar:11811/11811
> [graph 0 input from stream 0:0 @ 000001dd2e84ae00] w:1920 h:1080
> pixfmt:cuda tb:1/24000 fr:24000/1001 sar:1/1
> [auto_scaler_0 @ 000001dd2ebf4cc0] w:iw h:ih flags:'bilinear' interl:0
> [Parsed_format_0 @ 000001dd2e849780] auto-inserting filter 'auto_scaler_0'
> between the filter 'graph 0 input from stream 1:0' and the filter
> 'Parsed_format_0'
> [auto_scaler_0 @ 000001dd2ebf4cc0] w:1894 h:302 fmt:rgba sar:11811/11811
> -> w:1894 h:302 fmt:nv12 sar:1/1 flags:0x2
> [overlay_cuda @ 000001dd2ebc87c0] cu->cuModuleLoadData(&ctx->cu_module,
> vf_overlay_cuda_ptx) failed -> CUDA_ERROR_INVALID_IMAGE: device kernel
> image is invalid
> [Parsed_overlay_cuda_2 @ 000001dd2e84b6c0] Failed to configure output pad
> on Parsed_overlay_cuda_2
> Error reinitializing filters!
> Failed to inject frame into filter network: Generic error in an external
> library
> Error while processing the decoded data for stream #0:0
> ...
>
>
>
> --- Original message ---
> From: "Yaroslav Pogrebnyak" <yyyaroslav@gmail.com>
> Date: 18 March 2020, 09:29:15
>
> This patch adds 'vf_overlay_cuda' filter.
> It draws one picture on top of another on CUDA GPU.
> For the end-user, it's similar to 'vf_overlay_opencl' and other overlay
> filters.
>
> This filter would be especially useful for building video processing
> pipelines that execute fully on the CUDA GPU. For example, the following
> pipeline would be possible: decode -> scale -> overlay -> encode, without
> copying frames between CPU and GPU in between.
>
> Technical details.
>
> Supported sw input formats are NV12 and YUV420P for main input, and NV12,
> YUV420P and YUVA420P for overlay input.
> Main and overlay sw formats should match (i.e, overlaying YUVA420P on NV12
> is not implemented).
> All pixel format conversions are needed to be done with 'format' or
> 'scale_npp' filters before 'overlay_cuda'.
>
> It was needed to slightly modify 'hwcontext_cuda.c' to allow overlays with
> alpha channel:
>  - Allow AV_PIX_FMT_YUVA420P to enable hwuploading frames with alpha
> channel to GPU.
>  - Do not shift Height of 4rd plane (alpha) when uploading to GPU.
>
> Examples.
>
> - Overlay picture on top of video (main: YUVJ420P->NV12, overlay: NV12)
> $ ffmpeg -y -init_hw_device cuda=cuda -filter_hw_device cuda -hwaccel
> cuvid \
>   -c:v h264_cuvid -i main.mp4 \
>   -i ~/overlay.jpg \
>   -filter_complex "[1:v]format=nv12, hwupload[overlay],
> [0:v][overlay]overlay_cuda=x=0:y=0:shortest=false" \
>   -an -c:v h264_nvenc -b:v 5M output.mp4
>
> - Overlay one video on top of another (main: NV12, overlay: NV12)
> $ ffmpeg -y \
>   -hwaccel cuvid -c:v h264_cuvid -i main.mp4 \
>   -hwaccel cuvid -c:v h264_cuvid -i overlay.mp4 \
>   -filter_complex "[1:v]scale_npp=512:-1[o],
> [v:0][o]overlay_cuda=x=100:y=100:shortest=true" \
>   -an -c:v h264_nvenc -b:v 5M output.mp4
>
> - Overlay picture with alpha channel on top of video (main: NV12->YUV420P,
> overlay: RGBA->YUVA420P)
> $ ffmpeg -y \
>   -init_hw_device cuda=cuda -filter_hw_device cuda -hwaccel cuvid \
>   -c:v h264_cuvid -i ~/main.mp4 \
>   -i ~/overlay.png \
>   -filter_complex "[1:v]format=yuva420p, hwupload[o],
> [v:0]scale_npp=format=yuv420p[m],
> [m][o]overlay_cuda=x=0:y=0:shortest=false" \
>   -an -c:v h264_nvenc -b:v 5M output.mp4
>
> Patch attached.
>
> P.S. This is my first patch, I would be grateful for any feedback to know
> if I'm doing things correctly or not.
> Thanks!
>
>
> Signed-off-by: Yaroslav Pogrebnyak <yyyaroslav@gmail.com>
> ---
>  configure                      |   2 +
>  libavfilter/Makefile           |   1 +
>  libavfilter/allfilters.c       |   1 +
>  libavfilter/vf_overlay_cuda.c  | 451 +++++++++++++++++++++++++++++++++
>  libavfilter/vf_overlay_cuda.cu |  54 ++++
>  libavutil/hwcontext_cuda.c     |   3 +-
>  6 files changed, 511 insertions(+), 1 deletion(-)
>  create mode 100644 libavfilter/vf_overlay_cuda.c
>  create mode 100644 libavfilter/vf_overlay_cuda.cu
>
>
>
>

How does the NVDEC path work out?

Try this:

ffmpeg -y -init_hw_device cuda=cuda -filter_hw_device cuda -hwaccel cuda
--hwaccel_output_format cuda -i 720p.mp4 -i watermark.png -filter_complex
"[1:v]format=nv12,hwupload[img];[0:v][img]overlay_cuda=x=50:y=800[out]"
-map [out] -c:v h264_nvenc -b:v 6M -an -preset fast  -y
out_nvenc_overlay.mp4
Timo Rothenpieler April 1, 2020, 1:51 p.m. UTC | #5
On 01.04.2020 15:43, Alex wrote:
> Hi!Is it working? I try everything but constantly get error from overlay_cuda:
> 
> 
> ffmpeg -y -init_hw_device cuda=cuda -filter_hw_device cuda -hwaccel cuvid -c:v h264_cuvid -resize 1920x1080 -i 720p.mp4 -i watermark.png -filter_complex "[1:v]format=nv12,hwupload[img];[0:v][img]overlay_cuda=x=50:y=800[out]" -map [out] -c:v h264_nvenc -b:v 6M -an -preset fast  -y out_nvenc_overlay.mp4
> ...
> ffmpeg version git-2020-04-01-afa5e38
> ...
> [h264_cuvid @ 000001dd1b356d00] CUVID capabilities for h264_cuvid:
> [h264_cuvid @ 000001dd1b356d00] 8 bit: supported: 1, min_width: 48, max_width: 4096, min_height: 16, max_height: 4096
> [h264_cuvid @ 000001dd1b356d00] 10 bit: supported: 0, min_width: 0, max_width: 0, min_height: 0, max_height: 0
> [h264_cuvid @ 000001dd1b356d00] 12 bit: supported: 0, min_width: 0, max_width: 0, min_height: 0, max_height: 0
> Stream mapping:
>    Stream #0:0 (h264_cuvid) -> overlay_cuda:main
>    Stream #1:0 (png) -> format
>    overlay_cuda -> Stream #0:0 (h264_nvenc)
> Press [q] to stop, [?] for help
> [h264_cuvid @ 000001dd1b356d00] Formats: Original: cuda | HW: cuda | SW: nv12
> [graph 0 input from stream 1:0 @ 000001dd2e84a100] w:1894 h:302 pixfmt:rgba tb:1/25 fr:25/1 sar:11811/11811
> [graph 0 input from stream 0:0 @ 000001dd2e84ae00] w:1920 h:1080 pixfmt:cuda tb:1/24000 fr:24000/1001 sar:1/1
> [auto_scaler_0 @ 000001dd2ebf4cc0] w:iw h:ih flags:'bilinear' interl:0
> [Parsed_format_0 @ 000001dd2e849780] auto-inserting filter 'auto_scaler_0' between the filter 'graph 0 input from stream 1:0' and the filter 'Parsed_format_0'
> [auto_scaler_0 @ 000001dd2ebf4cc0] w:1894 h:302 fmt:rgba sar:11811/11811 -> w:1894 h:302 fmt:nv12 sar:1/1 flags:0x2
> [overlay_cuda @ 000001dd2ebc87c0] cu->cuModuleLoadData(&ctx->cu_module, vf_overlay_cuda_ptx) failed -> CUDA_ERROR_INVALID_IMAGE: device kernel image is invalid

Looks like your GPU does not support some of the required features of 
this filters CUDA Kernel. Which seems odd, given it's general simplicity.
What GPU is that?
Timo Rothenpieler April 1, 2020, 1:52 p.m. UTC | #6
> How does the NVDEC path work out?

It fails to even initialize the filter on his hardware, no path will 
work any differently.
Alex April 1, 2020, 2 p.m. UTC | #7
Hi!My GPU is GTX 1080Ti.
Trying Your command but same error result.
I tested on windows build downloaded from https://ffmpeg.zeranoe.com/builds/

Stream mapping:
  Stream #0:0 (h264) -> overlay_cuda:main
  Stream #1:0 (png) -> format
  overlay_cuda -> Stream #0:0 (h264_nvenc)
Press [q] to stop, [?] for help
[h264 @ 00000231eee7ce40] NVDEC capabilities:
[h264 @ 00000231eee7ce40] format supported: yes, max_mb_count: 65536
[h264 @ 00000231eee7ce40] min_width: 48, max_width: 4096
[h264 @ 00000231eee7ce40] min_height: 16, max_height: 4096
[h264 @ 00000231eee7ce40] Reinit context to 1280x720, pix_fmt: cuda
[graph 0 input from stream 1:0 @ 0000023182422180] w:1894 h:302 pixfmt:rgba tb:1/25 fr:25/1 sar:11811/11811
[graph 0 input from stream 0:0 @ 000002318bbe1540] w:1280 h:720 pixfmt:cuda tb:1/24000 fr:24000/1001 sar:1/1
[auto_scaler_0 @ 000002318bbe55c0] w:iw h:ih flags:'bilinear' interl:0
[Parsed_format_0 @ 00000231825e4bc0] auto-inserting filter 'auto_scaler_0' between the filter 'graph 0 input from stream 1:0' and the filter 'Parsed_format_0'
[auto_scaler_0 @ 000002318bbe55c0] w:1894 h:302 fmt:rgba sar:11811/11811 -> w:1894 h:302 fmt:nv12 sar:1/1 flags:0x2
[overlay_cuda @ 0000023182798140] cu->cuModuleLoadData(&ctx->cu_module, vf_overlay_cuda_ptx) failed -> CUDA_ERROR_INVALID_IMAGE: device kernel image is invalid
[Parsed_overlay_cuda_2 @ 0000023182431d40] Failed to configure output pad on Parsed_overlay_cuda_2
Error reinitializing filters!
Failed to inject frame into filter network: Generic error in an external library
Error while processing the decoded data for stream #0:0
[AVIOContext @ 0000023182437840] Statistics: 0 seeks, 0 writeouts
[AVIOContext @ 00000231eee87b80] Statistics: 409657 bytes read, 2 seeks
[AVIOContext @ 000002318248e700] Statistics: 67602 bytes read, 0 seeks
Conversion failed!


--- Original message ---
From: "Dennis Mungai" <dmngaie@gmail.com>
Date: 1 April 2020, 16:51:16

On Wed, 1 Apr 2020 at 16:43, Alex <3.14pi@ukr.net> wrote:

> Hi!Is it working? I try everything but constantly get error from
> overlay_cuda:
>
>
> ffmpeg -y -init_hw_device cuda=cuda -filter_hw_device cuda -hwaccel cuvid
> -c:v h264_cuvid -resize 1920x1080 -i 720p.mp4 -i watermark.png
> -filter_complex
> "[1:v]format=nv12,hwupload[img];[0:v][img]overlay_cuda=x=50:y=800[out]"
> -map [out] -c:v h264_nvenc -b:v 6M -an -preset fast  -y
> out_nvenc_overlay.mp4
> ...
> ffmpeg version git-2020-04-01-afa5e38
> ...
> [h264_cuvid @ 000001dd1b356d00] CUVID capabilities for h264_cuvid:
> [h264_cuvid @ 000001dd1b356d00] 8 bit: supported: 1, min_width: 48,
> max_width: 4096, min_height: 16, max_height: 4096
> [h264_cuvid @ 000001dd1b356d00] 10 bit: supported: 0, min_width: 0,
> max_width: 0, min_height: 0, max_height: 0
> [h264_cuvid @ 000001dd1b356d00] 12 bit: supported: 0, min_width: 0,
> max_width: 0, min_height: 0, max_height: 0
> Stream mapping:
>   Stream #0:0 (h264_cuvid) -> overlay_cuda:main
>   Stream #1:0 (png) -> format
>   overlay_cuda -> Stream #0:0 (h264_nvenc)
> Press [q] to stop, [?] for help
> [h264_cuvid @ 000001dd1b356d00] Formats: Original: cuda | HW: cuda | SW:
> nv12
> [graph 0 input from stream 1:0 @ 000001dd2e84a100] w:1894 h:302
> pixfmt:rgba tb:1/25 fr:25/1 sar:11811/11811
> [graph 0 input from stream 0:0 @ 000001dd2e84ae00] w:1920 h:1080
> pixfmt:cuda tb:1/24000 fr:24000/1001 sar:1/1
> [auto_scaler_0 @ 000001dd2ebf4cc0] w:iw h:ih flags:'bilinear' interl:0
> [Parsed_format_0 @ 000001dd2e849780] auto-inserting filter 'auto_scaler_0'
> between the filter 'graph 0 input from stream 1:0' and the filter
> 'Parsed_format_0'
> [auto_scaler_0 @ 000001dd2ebf4cc0] w:1894 h:302 fmt:rgba sar:11811/11811
> -> w:1894 h:302 fmt:nv12 sar:1/1 flags:0x2
> [overlay_cuda @ 000001dd2ebc87c0] cu->cuModuleLoadData(&ctx->cu_module,
> vf_overlay_cuda_ptx) failed -> CUDA_ERROR_INVALID_IMAGE: device kernel
> image is invalid
> [Parsed_overlay_cuda_2 @ 000001dd2e84b6c0] Failed to configure output pad
> on Parsed_overlay_cuda_2
> Error reinitializing filters!
> Failed to inject frame into filter network: Generic error in an external
> library
> Error while processing the decoded data for stream #0:0
> ...
>
>
>
> --- Original message ---
> From: "Yaroslav Pogrebnyak" <yyyaroslav@gmail.com>
> Date: 18 March 2020, 09:29:15
>
> This patch adds 'vf_overlay_cuda' filter.
> It draws one picture on top of another on CUDA GPU.
> For the end-user, it's similar to 'vf_overlay_opencl' and other overlay
> filters.
>
> This filter would be especially useful for building video processing
> pipelines that execute fully on the CUDA GPU. For example, the following
> pipeline would be possible: decode -> scale -> overlay -> encode, without
> copying frames between CPU and GPU in between.
>
> Technical details.
>
> Supported sw input formats are NV12 and YUV420P for main input, and NV12,
> YUV420P and YUVA420P for overlay input.
> Main and overlay sw formats should match (i.e, overlaying YUVA420P on NV12
> is not implemented).
> All pixel format conversions are needed to be done with 'format' or
> 'scale_npp' filters before 'overlay_cuda'.
>
> It was needed to slightly modify 'hwcontext_cuda.c' to allow overlays with
> alpha channel:
>  - Allow AV_PIX_FMT_YUVA420P to enable hwuploading frames with alpha
> channel to GPU.
>  - Do not shift Height of 4rd plane (alpha) when uploading to GPU.
>
> Examples.
>
> - Overlay picture on top of video (main: YUVJ420P->NV12, overlay: NV12)
> $ ffmpeg -y -init_hw_device cuda=cuda -filter_hw_device cuda -hwaccel
> cuvid \
>   -c:v h264_cuvid -i main.mp4 \
>   -i ~/overlay.jpg \
>   -filter_complex "[1:v]format=nv12, hwupload[overlay],
> [0:v][overlay]overlay_cuda=x=0:y=0:shortest=false" \
>   -an -c:v h264_nvenc -b:v 5M output.mp4
>
> - Overlay one video on top of another (main: NV12, overlay: NV12)
> $ ffmpeg -y \
>   -hwaccel cuvid -c:v h264_cuvid -i main.mp4 \
>   -hwaccel cuvid -c:v h264_cuvid -i overlay.mp4 \
>   -filter_complex "[1:v]scale_npp=512:-1[o],
> [v:0][o]overlay_cuda=x=100:y=100:shortest=true" \
>   -an -c:v h264_nvenc -b:v 5M output.mp4
>
> - Overlay picture with alpha channel on top of video (main: NV12->YUV420P,
> overlay: RGBA->YUVA420P)
> $ ffmpeg -y \
>   -init_hw_device cuda=cuda -filter_hw_device cuda -hwaccel cuvid \
>   -c:v h264_cuvid -i ~/main.mp4 \
>   -i ~/overlay.png \
>   -filter_complex "[1:v]format=yuva420p, hwupload[o],
> [v:0]scale_npp=format=yuv420p[m],
> [m][o]overlay_cuda=x=0:y=0:shortest=false" \
>   -an -c:v h264_nvenc -b:v 5M output.mp4
>
> Patch attached.
>
> P.S. This is my first patch, I would be grateful for any feedback to know
> if I'm doing things correctly or not.
> Thanks!
>
>
> Signed-off-by: Yaroslav Pogrebnyak <yyyaroslav@gmail.com>
> ---
>  configure                      |   2 +
>  libavfilter/Makefile           |   1 +
>  libavfilter/allfilters.c       |   1 +
>  libavfilter/vf_overlay_cuda.c  | 451 +++++++++++++++++++++++++++++++++
>  libavfilter/vf_overlay_cuda.cu |  54 ++++
>  libavutil/hwcontext_cuda.c     |   3 +-
>  6 files changed, 511 insertions(+), 1 deletion(-)
>  create mode 100644 libavfilter/vf_overlay_cuda.c
>  create mode 100644 libavfilter/vf_overlay_cuda.cu
>
>
>
>

How does the NVDEC path work out?

Try this:

ffmpeg -y -init_hw_device cuda=cuda -filter_hw_device cuda -hwaccel cuda
--hwaccel_output_format cuda -i 720p.mp4 -i watermark.png -filter_complex
"[1:v]format=nv12,hwupload[img];[0:v][img]overlay_cuda=x=50:y=800[out]"
-map [out] -c:v h264_nvenc -b:v 6M -an -preset fast  -y
out_nvenc_overlay.mp4
Alex April 1, 2020, 2:01 p.m. UTC | #8
My GPU is GTX 1080Ti.

--- Original message ---
From: "Timo Rothenpieler" <timo@rothenpieler.org>
Date: 1 April 2020, 16:51:27

On 01.04.2020 15:43, Alex wrote:
> Hi!Is it working? I try everything but constantly get error from overlay_cuda:
> 
> 
> ffmpeg -y -init_hw_device cuda=cuda -filter_hw_device cuda -hwaccel cuvid -c:v h264_cuvid -resize 1920x1080 -i 720p.mp4 -i watermark.png -filter_complex "[1:v]format=nv12,hwupload[img];[0:v][img]overlay_cuda=x=50:y=800[out]" -map [out] -c:v h264_nvenc -b:v 6M -an -preset fast  -y out_nvenc_overlay.mp4
> ...
> ffmpeg version git-2020-04-01-afa5e38
> ...
> [h264_cuvid @ 000001dd1b356d00] CUVID capabilities for h264_cuvid:
> [h264_cuvid @ 000001dd1b356d00] 8 bit: supported: 1, min_width: 48, max_width: 4096, min_height: 16, max_height: 4096
> [h264_cuvid @ 000001dd1b356d00] 10 bit: supported: 0, min_width: 0, max_width: 0, min_height: 0, max_height: 0
> [h264_cuvid @ 000001dd1b356d00] 12 bit: supported: 0, min_width: 0, max_width: 0, min_height: 0, max_height: 0
> Stream mapping:
>    Stream #0:0 (h264_cuvid) -> overlay_cuda:main
>    Stream #1:0 (png) -> format
>    overlay_cuda -> Stream #0:0 (h264_nvenc)
> Press [q] to stop, [?] for help
> [h264_cuvid @ 000001dd1b356d00] Formats: Original: cuda | HW: cuda | SW: nv12
> [graph 0 input from stream 1:0 @ 000001dd2e84a100] w:1894 h:302 pixfmt:rgba tb:1/25 fr:25/1 sar:11811/11811
> [graph 0 input from stream 0:0 @ 000001dd2e84ae00] w:1920 h:1080 pixfmt:cuda tb:1/24000 fr:24000/1001 sar:1/1
> [auto_scaler_0 @ 000001dd2ebf4cc0] w:iw h:ih flags:'bilinear' interl:0
> [Parsed_format_0 @ 000001dd2e849780] auto-inserting filter 'auto_scaler_0' between the filter 'graph 0 input from stream 1:0' and the filter 'Parsed_format_0'
> [auto_scaler_0 @ 000001dd2ebf4cc0] w:1894 h:302 fmt:rgba sar:11811/11811 -> w:1894 h:302 fmt:nv12 sar:1/1 flags:0x2
> [overlay_cuda @ 000001dd2ebc87c0] cu->cuModuleLoadData(&ctx->cu_module, vf_overlay_cuda_ptx) failed -> CUDA_ERROR_INVALID_IMAGE: device kernel image is invalid

Looks like your GPU does not support some of the required features of 
this filters CUDA Kernel. Which seems odd, given it's general simplicity.
What GPU is that?
Alex April 1, 2020, 2:14 p.m. UTC | #9
Driver 445.75 (Win 10)

--- Original message ---
From: "Alex" <3.14pi@ukr.net>
Date: 1 April 2020, 17:01:28

My GPU is GTX 1080Ti.

--- Original message ---
From: "Timo Rothenpieler" <timo@rothenpieler.org>
Date: 1 April 2020, 16:51:27

On 01.04.2020 15:43, Alex wrote:
> Hi!Is it working? I try everything but constantly get error from overlay_cuda:
> 
> 
> ffmpeg -y -init_hw_device cuda=cuda -filter_hw_device cuda -hwaccel cuvid -c:v h264_cuvid -resize 1920x1080 -i 720p.mp4 -i watermark.png -filter_complex "[1:v]format=nv12,hwupload[img];[0:v][img]overlay_cuda=x=50:y=800[out]" -map [out] -c:v h264_nvenc -b:v 6M -an -preset fast  -y out_nvenc_overlay.mp4
> ...
> ffmpeg version git-2020-04-01-afa5e38
> ...
> [h264_cuvid @ 000001dd1b356d00] CUVID capabilities for h264_cuvid:
> [h264_cuvid @ 000001dd1b356d00] 8 bit: supported: 1, min_width: 48, max_width: 4096, min_height: 16, max_height: 4096
> [h264_cuvid @ 000001dd1b356d00] 10 bit: supported: 0, min_width: 0, max_width: 0, min_height: 0, max_height: 0
> [h264_cuvid @ 000001dd1b356d00] 12 bit: supported: 0, min_width: 0, max_width: 0, min_height: 0, max_height: 0
> Stream mapping:
>    Stream #0:0 (h264_cuvid) -> overlay_cuda:main
>    Stream #1:0 (png) -> format
>    overlay_cuda -> Stream #0:0 (h264_nvenc)
> Press [q] to stop, [?] for help
> [h264_cuvid @ 000001dd1b356d00] Formats: Original: cuda | HW: cuda | SW: nv12
> [graph 0 input from stream 1:0 @ 000001dd2e84a100] w:1894 h:302 pixfmt:rgba tb:1/25 fr:25/1 sar:11811/11811
> [graph 0 input from stream 0:0 @ 000001dd2e84ae00] w:1920 h:1080 pixfmt:cuda tb:1/24000 fr:24000/1001 sar:1/1
> [auto_scaler_0 @ 000001dd2ebf4cc0] w:iw h:ih flags:'bilinear' interl:0
> [Parsed_format_0 @ 000001dd2e849780] auto-inserting filter 'auto_scaler_0' between the filter 'graph 0 input from stream 1:0' and the filter 'Parsed_format_0'
> [auto_scaler_0 @ 000001dd2ebf4cc0] w:1894 h:302 fmt:rgba sar:11811/11811 -> w:1894 h:302 fmt:nv12 sar:1/1 flags:0x2
> [overlay_cuda @ 000001dd2ebc87c0] cu->cuModuleLoadData(&ctx->cu_module, vf_overlay_cuda_ptx) failed -> CUDA_ERROR_INVALID_IMAGE: device kernel image is invalid

Looks like your GPU does not support some of the required features of 
this filters CUDA Kernel. Which seems odd, given it's general simplicity.
What GPU is that?
Timo Rothenpieler April 1, 2020, 3:03 p.m. UTC | #10
On 01.04.2020 16:14, Alex wrote:
> Driver 445.75 (Win 10)

I just tested a similar commandline on the same driver:

./ffmpeg_g.exe -v verbose -init_hw_device cuda=cuda -filter_hw_device 
cuda -hwaccel_output_format cuda -hwaccel cuda -i recode.mkv -i test.png 
-filter_complex 
"[1:v]format=nv12,hwupload[p],[0:v][p]overlay_cuda=x=100:y=100" -an -c:v 
h264_nvenc -y out.mp4

And that works perfectly fine on a RTX2070S. Make sure your driver 
installation is not somehow damaged.
Alex April 1, 2020, 3:12 p.m. UTC | #11
Driver is ok (I just updated it and same result), and nvenc working without  without cuda filters. But with overlay_cuda or scale_cuda ffmpeg is failing. Then I try to install ffmpeg on linux machine and looks like all working ok, but not on windows for me.

--- Original message ---
From: "Timo Rothenpieler" <timo@rothenpieler.org>
Date: 1 April 2020, 18:04:05

On 01.04.2020 16:14, Alex wrote:
> Driver 445.75 (Win 10)

I just tested a similar commandline on the same driver:

./ffmpeg_g.exe -v verbose -init_hw_device cuda=cuda -filter_hw_device 
cuda -hwaccel_output_format cuda -hwaccel cuda -i recode.mkv -i test.png 
-filter_complex 
"[1:v]format=nv12,hwupload[p],[0:v][p]overlay_cuda=x=100:y=100" -an -c:v 
h264_nvenc -y out.mp4

And that works perfectly fine on a RTX2070S. Make sure your driver 
installation is not somehow damaged.
Alex April 1, 2020, 3:42 p.m. UTC | #12
But on my linux machine is rtx2070 card, not 1080ti. Do You testing filter on 10xx gpu card?

1 April 2020, 18:12:22, by "Alex" < 3.14pi@ukr.net >:

Driver is ok (I just updated it and same result), and nvenc working without  without cuda filters. But with overlay_cuda or scale_cuda ffmpeg is failing. Then I try to install ffmpeg on linux machine and looks like all working ok, but not on windows for me.

--- Original message ---
From: "Timo Rothenpieler" <timo@rothenpieler.org>
Date: 1 April 2020, 18:04:05

On 01.04.2020 16:14, Alex wrote:
> Driver 445.75 (Win 10)

I just tested a similar commandline on the same driver:

./ffmpeg_g.exe -v verbose -init_hw_device cuda=cuda -filter_hw_device 
cuda -hwaccel_output_format cuda -hwaccel cuda -i recode.mkv -i test.png 
-filter_complex 
"[1:v]format=nv12,hwupload[p],[0:v][p]overlay_cuda=x=100:y=100" -an -c:v 
h264_nvenc -y out.mp4

And that works perfectly fine on a RTX2070S. Make sure your driver 
installation is not somehow damaged.
Timo Rothenpieler April 1, 2020, 3:45 p.m. UTC | #13
On 01.04.2020 17:12, Alex wrote:
> Driver is ok (I just updated it and same result), and nvenc working without  without cuda filters. But with overlay_cuda or scale_cuda ffmpeg is failing. Then I try to install ffmpeg on linux machine and looks like all working ok, but not on windows for me.

Do not top-post on this list.

I just tested this, and with Zeranoes Windows-Builds, I am getting the 
same kind of CUDA-Kernel-Errors.
With my own build, everything works fine.

So you will have to complain to Zeranoe or whoever made your binaries.
diff mbox series

Patch

diff --git a/configure b/configure
index 18f2841765..b08dc7bd62 100755
--- a/configure
+++ b/configure
@@ -3026,6 +3026,8 @@  scale_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
 thumbnail_cuda_filter_deps="ffnvcodec"
 thumbnail_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
 transpose_npp_filter_deps="ffnvcodec libnpp"
+overlay_cuda_filter_deps="ffnvcodec"
+overlay_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
 
 amf_deps_any="libdl LoadLibrary"
 nvenc_deps="ffnvcodec"
diff --git a/libavfilter/Makefile b/libavfilter/Makefile
index 750412da6b..1ecaeae372 100644
--- a/libavfilter/Makefile
+++ b/libavfilter/Makefile
@@ -328,6 +328,7 @@  OBJS-$(CONFIG_OVERLAY_OPENCL_FILTER)         += vf_overlay_opencl.o opencl.o \
                                                 opencl/overlay.o framesync.o
 OBJS-$(CONFIG_OVERLAY_QSV_FILTER)            += vf_overlay_qsv.o framesync.o
 OBJS-$(CONFIG_OVERLAY_VULKAN_FILTER)         += vf_overlay_vulkan.o vulkan.o
+OBJS-$(CONFIG_OVERLAY_CUDA_FILTER)           += vf_overlay_cuda.o framesync.o vf_overlay_cuda.ptx.o
 OBJS-$(CONFIG_OWDENOISE_FILTER)              += vf_owdenoise.o
 OBJS-$(CONFIG_PAD_FILTER)                    += vf_pad.o
 OBJS-$(CONFIG_PAD_OPENCL_FILTER)             += vf_pad_opencl.o opencl.o opencl/pad.o
diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
index 501e5d041b..fb32bef788 100644
--- a/libavfilter/allfilters.c
+++ b/libavfilter/allfilters.c
@@ -312,6 +312,7 @@  extern AVFilter ff_vf_overlay;
 extern AVFilter ff_vf_overlay_opencl;
 extern AVFilter ff_vf_overlay_qsv;
 extern AVFilter ff_vf_overlay_vulkan;
+extern AVFilter ff_vf_overlay_cuda;
 extern AVFilter ff_vf_owdenoise;
 extern AVFilter ff_vf_pad;
 extern AVFilter ff_vf_pad_opencl;
diff --git a/libavfilter/vf_overlay_cuda.c b/libavfilter/vf_overlay_cuda.c
new file mode 100644
index 0000000000..2fa4ea4443
--- /dev/null
+++ b/libavfilter/vf_overlay_cuda.c
@@ -0,0 +1,451 @@ 
+/*
+ * Copyright (c) 2020 Yaroslav Pogrebnyak <yyyaroslav@gmail.com>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/**
+ * @file
+ * Overlay one video on top of another using cuda hardware acceleration
+ */
+
+#include "libavutil/log.h"
+#include "libavutil/mem.h"
+#include "libavutil/opt.h"
+#include "libavutil/pixdesc.h"
+#include "libavutil/hwcontext.h"
+#include "libavutil/hwcontext_cuda_internal.h"
+#include "libavutil/cuda_check.h"
+
+#include "avfilter.h"
+#include "framesync.h"
+#include "internal.h"
+
+#define CHECK_CU(x) FF_CUDA_CHECK_DL(ctx, ctx->hwctx->internal->cuda_dl, x)
+#define DIV_UP(a, b) ( ((a) + (b) - 1) / (b) )
+
+#define BLOCK_X 32
+#define BLOCK_Y 16
+
+static const enum AVPixelFormat supported_main_formats[] = {
+    AV_PIX_FMT_NV12,
+    AV_PIX_FMT_YUV420P,
+    AV_PIX_FMT_NONE,
+};
+
+static const enum AVPixelFormat supported_overlay_formats[] = {
+    AV_PIX_FMT_NV12,
+    AV_PIX_FMT_YUV420P,
+    AV_PIX_FMT_YUVA420P,
+    AV_PIX_FMT_NONE,
+};
+
+/**
+ * OverlayCUDAContext
+ */
+typedef struct OverlayCUDAContext {
+    const AVClass      *class;
+
+    enum AVPixelFormat in_format_overlay;
+    enum AVPixelFormat in_format_main;
+
+    AVBufferRef *device_ref;
+    AVCUDADeviceContext *hwctx;
+
+    CUcontext cu_ctx;
+    CUmodule cu_module;
+    CUfunction cu_func;
+    CUstream cu_stream;
+
+    FFFrameSync fs;
+
+    int x_position;
+    int y_position;
+
+} OverlayCUDAContext;
+
+/**
+ * Helper to find out if provided format is supported by filter
+ */
+static int format_is_supported(const enum AVPixelFormat formats[], enum AVPixelFormat fmt)
+{
+    for (int i = 0; formats[i] != AV_PIX_FMT_NONE; i++)
+        if (formats[i] == fmt)
+            return 1;
+    return 0;
+}
+
+/**
+ * Helper checks if we can process main and overlay pixel formats
+ */
+static int formats_match(const enum AVPixelFormat format_main, const enum AVPixelFormat format_overlay) {
+    switch(format_main) {
+        case AV_PIX_FMT_NV12:
+            return format_overlay == AV_PIX_FMT_NV12;
+        case AV_PIX_FMT_YUV420P:
+            return format_overlay == AV_PIX_FMT_YUV420P ||
+                   format_overlay == AV_PIX_FMT_YUVA420P;
+        default:
+            return 0;
+    }
+}
+
+/**
+ * Call overlay kernell for a plane
+ */
+static int overlay_cuda_call_kernel(
+    OverlayCUDAContext *ctx,
+    int x_position, int y_position,
+    uint8_t* main_data, int main_linesize,
+    int main_width, int main_height,
+    uint8_t* overlay_data, int overlay_linesize,
+    int overlay_width, int overlay_height,
+    uint8_t* alpha_data, int alpha_linesize,
+    int alpha_adj_x, int alpha_adj_y) {
+
+    CudaFunctions *cu = ctx->hwctx->internal->cuda_dl;
+
+    void* kernel_args[] = {
+        &x_position, &y_position,
+        &main_data, &main_linesize,
+        &overlay_data, &overlay_linesize,
+        &overlay_width, &overlay_height,
+        &alpha_data, &alpha_linesize,
+        &alpha_adj_x, &alpha_adj_y,
+    };
+
+    return CHECK_CU(cu->cuLaunchKernel(
+        ctx->cu_func,
+        DIV_UP(main_width, BLOCK_X), DIV_UP(main_height, BLOCK_Y), 1,
+        BLOCK_X, BLOCK_Y, 1,
+        0, ctx->cu_stream, kernel_args, NULL));
+}
+
+/**
+ * Perform blend overlay picture over main picture
+ */
+static int overlay_cuda_blend(FFFrameSync *fs)
+{
+    int ret;
+
+    AVFilterContext *avctx = fs->parent;
+    OverlayCUDAContext *ctx = avctx->priv;
+    AVFilterLink *outlink = avctx->outputs[0];
+
+    CudaFunctions *cu = ctx->hwctx->internal->cuda_dl;
+    CUcontext dummy, cuda_ctx = ctx->hwctx->cuda_ctx;
+
+    AVFrame *input_main, *input_overlay, *out;
+
+    ctx->cu_ctx = cuda_ctx;
+
+    // read main and overlay frames from inputs
+
+    ret = ff_framesync_get_frame(fs, 0, &input_main, 0);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = ff_framesync_get_frame(fs, 1, &input_overlay, 0);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (!input_main || !input_overlay) {
+        return AVERROR_BUG;
+    }
+
+    ret = av_frame_make_writable(input_main);
+    if (ret < 0) {
+        return ret;
+    }
+
+    // push cuda context
+
+    ret = CHECK_CU(cu->cuCtxPushCurrent(cuda_ctx));
+    if (ret < 0) {
+        return ret;
+    }
+
+    // overlay first plane
+
+    overlay_cuda_call_kernel(ctx,
+        ctx->x_position, ctx->y_position,
+        input_main->data[0], input_main->linesize[0],
+        input_main->width, input_main->height,
+        input_overlay->data[0], input_overlay->linesize[0],
+        input_overlay->width, input_overlay->height,
+        input_overlay->data[3], input_overlay->linesize[3], 1, 1);
+
+    // overlay rest planes depending on pixel format
+
+    switch(ctx->in_format_overlay) {
+
+    case AV_PIX_FMT_NV12:
+        overlay_cuda_call_kernel(ctx,
+            ctx->x_position, ctx->y_position / 2,
+            input_main->data[1], input_main->linesize[1],
+            input_main->width, input_main->height / 2,
+            input_overlay->data[1], input_overlay->linesize[1],
+            input_overlay->width, input_overlay->height / 2,
+            0, 0, 0, 0);
+
+        break;
+
+    case AV_PIX_FMT_YUV420P:
+    case AV_PIX_FMT_YUVA420P:
+        overlay_cuda_call_kernel(ctx,
+            ctx->x_position / 2 , ctx->y_position / 2,
+            input_main->data[1], input_main->linesize[1],
+            input_main->width / 2, input_main->height / 2,
+            input_overlay->data[1], input_overlay->linesize[1],
+            input_overlay->width / 2, input_overlay->height / 2,
+            input_overlay->data[3], input_overlay->linesize[3], 2, 2);
+
+        overlay_cuda_call_kernel(ctx,
+            ctx->x_position / 2 , ctx->y_position / 2,
+            input_main->data[2], input_main->linesize[2],
+            input_main->width / 2, input_main->height / 2,
+            input_overlay->data[2], input_overlay->linesize[2],
+            input_overlay->width / 2, input_overlay->height / 2,
+            input_overlay->data[3], input_overlay->linesize[3], 2, 2);
+
+        break;
+
+    default:
+        av_log(ctx, AV_LOG_ERROR, "Passed unsupported overlay pixel format\n");
+        return AVERROR_BUG;
+    }
+
+    CHECK_CU(cu->cuCtxPopCurrent(&dummy));
+
+    out = av_frame_alloc();
+    av_frame_ref(out, input_main);
+    av_frame_copy_props(out, input_main);
+
+    return ff_filter_frame(outlink, out);
+}
+
+/**
+ * Initialize overlay_cuda
+ */
+static av_cold int overlay_cuda_init(AVFilterContext *avctx)
+{
+    OverlayCUDAContext* ctx = avctx->priv;
+    ctx->fs.on_event = &overlay_cuda_blend;
+
+    return 0;
+}
+
+/**
+ * Uninitialize overlay_cuda
+ */
+static av_cold void overlay_cuda_uninit(AVFilterContext *avctx)
+{
+    OverlayCUDAContext* ctx = avctx->priv;
+
+    ff_framesync_uninit(&ctx->fs);
+
+    if (ctx->hwctx && ctx->cu_module) {
+        CUcontext dummy;
+        CudaFunctions *cu = ctx->hwctx->internal->cuda_dl;
+        CHECK_CU(cu->cuCtxPushCurrent(ctx->cu_ctx));
+        CHECK_CU(cu->cuModuleUnload(ctx->cu_module));
+        CHECK_CU(cu->cuCtxPopCurrent(&dummy));
+    }
+}
+
+/**
+ * Activate overlay_cuda
+ */
+static int overlay_cuda_activate(AVFilterContext *avctx)
+{
+    OverlayCUDAContext *ctx = avctx->priv;
+
+    return ff_framesync_activate(&ctx->fs);
+}
+
+/**
+ * Query formats
+ */
+static int overlay_cuda_query_formats(AVFilterContext *avctx)
+{
+    static const enum AVPixelFormat pixel_formats[] = {
+        AV_PIX_FMT_CUDA, AV_PIX_FMT_NONE,
+    };
+
+    AVFilterFormats *pix_fmts = ff_make_format_list(pixel_formats);
+
+    return ff_set_common_formats(avctx, pix_fmts);
+}
+
+/**
+ * Configure output
+ */
+static int overlay_cuda_config_output(AVFilterLink *outlink)
+{
+
+    extern char vf_overlay_cuda_ptx[];
+
+    int err;
+    AVFilterContext* avctx = outlink->src;
+    OverlayCUDAContext* ctx = avctx->priv;
+
+    AVFilterLink *inlink = avctx->inputs[0];
+    AVHWFramesContext  *frames_ctx = (AVHWFramesContext*)inlink->hw_frames_ctx->data;
+
+    AVFilterLink *inlink_overlay = avctx->inputs[1];
+    AVHWFramesContext  *frames_ctx_overlay = (AVHWFramesContext*)inlink_overlay->hw_frames_ctx->data;
+
+    CUcontext dummy, cuda_ctx;
+    CudaFunctions *cu;
+
+    // check main input formats
+
+    if (!frames_ctx) {
+        av_log(ctx, AV_LOG_ERROR, "No hw context provided on main input\n");
+        return AVERROR(EINVAL);
+    }
+
+    ctx->in_format_main = frames_ctx->sw_format;
+    if (!format_is_supported(supported_main_formats, ctx->in_format_main)) {
+        av_log(ctx, AV_LOG_ERROR, "Unsupported main input format: %s\n",
+               av_get_pix_fmt_name(ctx->in_format_main));
+        return AVERROR(ENOSYS);
+    }
+
+    // check overlay input formats
+
+    if (!frames_ctx_overlay) {
+        av_log(ctx, AV_LOG_ERROR, "No hw context provided on overlay input\n");
+        return AVERROR(EINVAL);
+    }
+
+    ctx->in_format_overlay = frames_ctx_overlay->sw_format;
+    if (!format_is_supported(supported_overlay_formats, ctx->in_format_overlay)) {
+        av_log(ctx, AV_LOG_ERROR, "Unsupported overlay input format: %s\n",
+            av_get_pix_fmt_name(ctx->in_format_overlay));
+        return AVERROR(ENOSYS);
+    }
+
+    // check we can overlay pictures with those pixel formats
+
+    if (!formats_match(ctx->in_format_main, ctx->in_format_overlay)) {
+        av_log(ctx, AV_LOG_ERROR, "Can't overlay %s on %s \n",
+            av_get_pix_fmt_name(ctx->in_format_overlay), av_get_pix_fmt_name(ctx->in_format_main));
+        return AVERROR(EINVAL);
+    }
+
+    // initialize
+
+    ctx->hwctx = frames_ctx->device_ctx->hwctx;
+    cuda_ctx = ctx->hwctx->cuda_ctx;
+    ctx->fs.time_base = inlink->time_base;
+
+    ctx->cu_stream = ctx->hwctx->stream;
+    ctx->device_ref = ((AVHWFramesContext*)inlink->hw_frames_ctx->data)->device_ref;
+
+    outlink->hw_frames_ctx = av_buffer_ref(inlink->hw_frames_ctx);
+
+    // load functions
+
+    cu = ctx->hwctx->internal->cuda_dl;
+
+    err = CHECK_CU(cu->cuCtxPushCurrent(cuda_ctx));
+    if (err < 0) {
+        return err;
+    }
+
+    err = CHECK_CU(cu-> cuModuleLoadData(&ctx->cu_module, vf_overlay_cuda_ptx));
+    if (err < 0) {
+        return err;
+    }
+
+    err = CHECK_CU(cu->cuModuleGetFunction(&ctx->cu_func, ctx->cu_module, "Overlay_Cuda"));
+    if (err < 0) {
+        return err;
+    }
+
+    CHECK_CU(cu->cuCtxPopCurrent(&dummy));
+
+    // init dual input
+
+    err = ff_framesync_init_dualinput(&ctx->fs, avctx);
+    if (err < 0) {
+        return err;
+    }
+
+    return ff_framesync_configure(&ctx->fs);
+}
+
+
+#define OFFSET(x) offsetof(OverlayCUDAContext, x)
+#define FLAGS (AV_OPT_FLAG_FILTERING_PARAM | AV_OPT_FLAG_VIDEO_PARAM)
+
+static const AVOption overlay_cuda_options[] = {
+    { "x", "Overlay x position",
+      OFFSET(x_position), AV_OPT_TYPE_INT, { .i64 = 0 }, INT_MIN, INT_MAX, .flags = FLAGS },
+    { "y", "Overlay y position",
+      OFFSET(y_position), AV_OPT_TYPE_INT, { .i64 = 0 }, INT_MIN, INT_MAX, .flags = FLAGS },
+    { "eof_action", "Action to take when encountering EOF from secondary input ",
+        OFFSET(fs.opt_eof_action), AV_OPT_TYPE_INT, { .i64 = EOF_ACTION_REPEAT },
+        EOF_ACTION_REPEAT, EOF_ACTION_PASS, .flags = FLAGS, "eof_action" },
+        { "repeat", "Repeat the previous frame.",   0, AV_OPT_TYPE_CONST, { .i64 = EOF_ACTION_REPEAT }, .flags = FLAGS, "eof_action" },
+        { "endall", "End both streams.",            0, AV_OPT_TYPE_CONST, { .i64 = EOF_ACTION_ENDALL }, .flags = FLAGS, "eof_action" },
+        { "pass",   "Pass through the main input.", 0, AV_OPT_TYPE_CONST, { .i64 = EOF_ACTION_PASS },   .flags = FLAGS, "eof_action" },
+    { "shortest", "force termination when the shortest input terminates", OFFSET(fs.opt_shortest), AV_OPT_TYPE_BOOL, { .i64 = 0 }, 0, 1, FLAGS },
+    { "repeatlast", "repeat overlay of the last overlay frame", OFFSET(fs.opt_repeatlast), AV_OPT_TYPE_BOOL, {.i64=1}, 0, 1, FLAGS },
+    { NULL },
+};
+
+FRAMESYNC_DEFINE_CLASS(overlay_cuda, OverlayCUDAContext, fs);
+
+static const AVFilterPad overlay_cuda_inputs[] = {
+    {
+        .name         = "main",
+        .type         = AVMEDIA_TYPE_VIDEO,
+    },
+    {
+        .name         = "overlay",
+        .type         = AVMEDIA_TYPE_VIDEO,
+    },
+    { NULL }
+};
+
+static const AVFilterPad overlay_cuda_outputs[] = {
+    {
+        .name          = "default",
+        .type          = AVMEDIA_TYPE_VIDEO,
+        .config_props  = &overlay_cuda_config_output,
+    },
+    { NULL }
+};
+
+AVFilter ff_vf_overlay_cuda = {
+    .name            = "overlay_cuda",
+    .description     = NULL_IF_CONFIG_SMALL("Overlay one video on top of another using CUDA"),
+    .priv_size       = sizeof(OverlayCUDAContext),
+    .priv_class      = &overlay_cuda_class,
+    .init            = &overlay_cuda_init,
+    .uninit          = &overlay_cuda_uninit,
+    .activate        = &overlay_cuda_activate,
+    .query_formats   = &overlay_cuda_query_formats,
+    .inputs          = overlay_cuda_inputs,
+    .outputs         = overlay_cuda_outputs,
+    .preinit         = overlay_cuda_framesync_preinit,
+    .flags_internal  = FF_FILTER_FLAG_HWFRAME_AWARE,
+};
diff --git a/libavfilter/vf_overlay_cuda.cu b/libavfilter/vf_overlay_cuda.cu
new file mode 100644
index 0000000000..43ec36c2ed
--- /dev/null
+++ b/libavfilter/vf_overlay_cuda.cu
@@ -0,0 +1,54 @@ 
+/*
+ * Copyright (c) 2020 Yaroslav Pogrebnyak <yyyaroslav@gmail.com>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+extern "C" {
+
+__global__ void Overlay_Cuda(
+    int x_position, int y_position,
+    unsigned char* main, int main_linesize,
+    unsigned char* overlay, int overlay_linesize,
+    int overlay_w, int overlay_h,
+    unsigned char* overlay_alpha, int alpha_linesize,
+    int alpha_adj_x, int alpha_adj_y)
+{
+    int x = blockIdx.x * blockDim.x + threadIdx.x;
+    int y = blockIdx.y * blockDim.y + threadIdx.y;
+
+    if (x >= overlay_w + x_position ||
+        y >= overlay_h + y_position ||
+        x < x_position ||
+        y < y_position ) {
+
+        return;
+    }
+
+    int overlay_x = x - x_position;
+    int overlay_y = y - y_position;
+
+    float alpha = 1.0;
+    if (alpha_linesize) {
+        alpha = overlay_alpha[alpha_adj_x * overlay_x  + alpha_adj_y * overlay_y * alpha_linesize] / 255.0f;
+    }
+
+    main[x + y*main_linesize] = alpha * overlay[overlay_x + overlay_y * overlay_linesize] + (1.0f - alpha) * main[x + y*main_linesize];
+}
+
+}
+
diff --git a/libavutil/hwcontext_cuda.c b/libavutil/hwcontext_cuda.c
index a87c280cf7..3c4e36dde7 100644
--- a/libavutil/hwcontext_cuda.c
+++ b/libavutil/hwcontext_cuda.c
@@ -39,6 +39,7 @@  typedef struct CUDAFramesContext {
 static const enum AVPixelFormat supported_formats[] = {
     AV_PIX_FMT_NV12,
     AV_PIX_FMT_YUV420P,
+    AV_PIX_FMT_YUVA420P,
     AV_PIX_FMT_YUV444P,
     AV_PIX_FMT_P010,
     AV_PIX_FMT_P016,
@@ -274,7 +275,7 @@  static int cuda_transfer_data_to(AVHWFramesContext *ctx, AVFrame *dst,
             .srcPitch      = src->linesize[i],
             .dstPitch      = dst->linesize[i],
             .WidthInBytes  = FFMIN(src->linesize[i], dst->linesize[i]),
-            .Height        = src->height >> (i ? priv->shift_height : 0),
+            .Height        = src->height >> ( (i == 0 || i == 3) ? 0 : priv->shift_height),
         };
 
         ret = CHECK_CU(cu->cuMemcpy2DAsync(&cpy, hwctx->stream));