diff mbox series

[FFmpeg-devel] libavcodec/nvenc.c: copy incoming hwaccel frames instead of ref count increase

Message ID BN9PR12MB52745632705BBB311428177ED2A89@BN9PR12MB5274.namprd12.prod.outlook.com
State New
Headers show
Series [FFmpeg-devel] libavcodec/nvenc.c: copy incoming hwaccel frames instead of ref count increase | expand

Checks

Context Check Description
andriy/make_x86 success Make finished
andriy/make_fate_x86 success Make fate finished
andriy/make_ppc success Make finished
andriy/make_fate_ppc success Make fate finished

Commit Message

Roman Arzumanyan Sept. 28, 2021, 4:22 p.m. UTC
Hello,

This patch makes nvenc copy incoming hwaccel frames instead of ref count increase.
It fixes the bug which may happen when on-GPU transcoding is done and encoder is set to use B frames.

How to reproduce the bug:
./ffmpeg \
  -hwaccel cuda -hwaccel_output_format cuda \
  -i input.mkv \
  -c:v h264_nvenc -preset p4 -tune hq -bf 3 \
  -y output.mkv

Expected output:
[h264 @ 0x55b14da4b4c0] No decoder surfaces left
[h264 @ 0x55b14da682c0] No decoder surfaces left
[h264 @ 0x55b14da850c0] No decoder surfaces left
[h264 @ 0x55b14daa1ec0] No decoder surfaces left
Error while decoding stream #0:0: Invalid data found when processing input
[h264 @ 0x55b14da2e6c0] No decoder surfaces left
Error while decoding stream #0:0: Invalid data found when processing input
    Last message repeated 1 times


Although fix adds extra CUDA DtoD memcopy, our internal testing results didn't show any noticeable difference in transcoding performance.
Subject: [PATCH] nvenc: copy incoming HW frame instead of referencing

---
 libavcodec/nvenc.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

Comments

Timo Rothenpieler Sept. 28, 2021, 5:58 p.m. UTC | #1
On 28.09.2021 18:22, Roman Arzumanyan wrote:
> Hello,
> 
> This patch makes nvenc copy incoming hwaccel frames instead of ref count increase.
> It fixes the bug which may happen when on-GPU transcoding is done and encoder is set to use B frames.
> 
> How to reproduce the bug:
> ./ffmpeg \
>    -hwaccel cuda -hwaccel_output_format cuda \
>    -i input.mkv \
>    -c:v h264_nvenc -preset p4 -tune hq -bf 3 \
>    -y output.mkv
> 
> Expected output:
> [h264 @ 0x55b14da4b4c0] No decoder surfaces left
> [h264 @ 0x55b14da682c0] No decoder surfaces left
> [h264 @ 0x55b14da850c0] No decoder surfaces left
> [h264 @ 0x55b14daa1ec0] No decoder surfaces left
> Error while decoding stream #0:0: Invalid data found when processing input
> [h264 @ 0x55b14da2e6c0] No decoder surfaces left
> Error while decoding stream #0:0: Invalid data found when processing input
>      Last message repeated 1 times
> 
> 
> Although fix adds extra CUDA DtoD memcopy, our internal testing results didn't show any noticeable difference in transcoding performance.
> 

Hmm, so far my approach to deal with this was to inject a 
scale_cuda=passthrough=0 into the filter chain, which pretty much does 
exactly this, but only controllable by the user.

But I do agree that this is a bit of a clutch and not all that user 
friendly.

My main concern with this approach is that it will inevitably increase 
VRAM usage, depending on bframe count and resolution even quite 
significantly.
And it's surprisingly common that users show up that are highly pressed 
for memory. When bframes were switched on by default, several people 
showed up who where suddenly running out of VRAM.

I do like this approach though, since it will for the average user make 
using a full hw chain a lot less bothersome.

So what I'd propose is:

- Add an option to retain the old behaviour of just holding a reference 
to the input frame no matter what.
- Instead of explicitly copying the frame like you do right now, call 
av_frame_make_writable() on the frame, right after where you right now 
are replacing av_frame_ref with av_hwframe_transfer_data.
That is for one very easy to disable conditionally, and does not require 
you to guard all the unref calls.
Plus, it will only actually copy the frame if needed (i.e. it won't do 
anything if it comes out of a filterchain and has nothing else holding a 
ref)


Timo
Timo Rothenpieler Sept. 28, 2021, 6:28 p.m. UTC | #2
On 28.09.2021 19:58, Timo Rothenpieler wrote:
> On 28.09.2021 18:22, Roman Arzumanyan wrote:
>> Hello,
>>
>> This patch makes nvenc copy incoming hwaccel frames instead of ref 
>> count increase.
>> It fixes the bug which may happen when on-GPU transcoding is done and 
>> encoder is set to use B frames.
>>
>> How to reproduce the bug:
>> ./ffmpeg \
>>    -hwaccel cuda -hwaccel_output_format cuda \
>>    -i input.mkv \
>>    -c:v h264_nvenc -preset p4 -tune hq -bf 3 \
>>    -y output.mkv
>>
>> Expected output:
>> [h264 @ 0x55b14da4b4c0] No decoder surfaces left
>> [h264 @ 0x55b14da682c0] No decoder surfaces left
>> [h264 @ 0x55b14da850c0] No decoder surfaces left
>> [h264 @ 0x55b14daa1ec0] No decoder surfaces left
>> Error while decoding stream #0:0: Invalid data found when processing 
>> input
>> [h264 @ 0x55b14da2e6c0] No decoder surfaces left
>> Error while decoding stream #0:0: Invalid data found when processing 
>> input
>>      Last message repeated 1 times
>>
>>
>> Although fix adds extra CUDA DtoD memcopy, our internal testing 
>> results didn't show any noticeable difference in transcoding performance.
>>

Something else I just realized:

The nvenc_frame->in_ref you are copying to is never actually used anywhere.
It is purely held as a reference, so the frame is not freed before nvenc 
is done with it.

So effectively what your patch is doing right now is making a copy of 
the frame, but then never do anything with that copy.
nvEncRegisterResource() is still called on the original frame, and that 
registered resource is then used for nvenc.

I'm surprised that actually works? Is that okay to do, and then let 
nvdec continue on?
Roman Arzumanyan Sept. 28, 2021, 6:54 p.m. UTC | #3
This is a good point, let me check and come back later.
Timo Rothenpieler Sept. 28, 2021, 7:11 p.m. UTC | #4
On 28.09.2021 19:58, Timo Rothenpieler wrote:
> Hmm, so far my approach to deal with this was to inject a 
> scale_cuda=passthrough=0 into the filter chain, which pretty much does 
> exactly this, but only controllable by the user.
> 
> But I do agree that this is a bit of a clutch and not all that user 
> friendly.
> 
> My main concern with this approach is that it will inevitably increase 
> VRAM usage, depending on bframe count and resolution even quite 
> significantly.
> And it's surprisingly common that users show up that are highly pressed 
> for memory. When bframes were switched on by default, several people 
> showed up who where suddenly running out of VRAM.
> 
> I do like this approach though, since it will for the average user make 
> using a full hw chain a lot less bothersome.
> 
> So what I'd propose is:
> 
> - Add an option to retain the old behaviour of just holding a reference 
> to the input frame no matter what.
> - Instead of explicitly copying the frame like you do right now, call 
> av_frame_make_writable() on the frame, right after where you right now 
> are replacing av_frame_ref with av_hwframe_transfer_data.
> That is for one very easy to disable conditionally, and does not require 
> you to guard all the unref calls.
> Plus, it will only actually copy the frame if needed (i.e. it won't do 
> anything if it comes out of a filterchain and has nothing else holding a 
> ref)
> 
> 
> Timo

See attached patch for that approach.
I just encoded a 5 minute sample using it, and I do see a marginal 
decrease in performance (it drops by literally x0.01 speed, so pretty 
much within margin of error, but it did show that consistently) and 
increase in VRAM usage as expected.

However, given that your patch does seem to work just fine, somehow, it 
would be interesting to know if re-using a frame/CUDA buffer after 
registering it with nvenc is safe?
Given that the logic right now never unregisters buffers unless it runs 
out of free slots, it would seem weird to me if that was the case. What 
if a buffer actually does get re-used, as is common with 
non-nvdec-frames allocated from a hwframes ctx?
From e8e099dac72b1962aba7c8a4ba87f4035eb0a29d Mon Sep 17 00:00:00 2001
From: Timo Rothenpieler <timo@rothenpieler.org>
Date: Tue, 28 Sep 2021 20:45:26 +0200
Subject: [PATCH] avcodec/nvenc: copy incoming hwframes when applicable

---
 libavcodec/nvenc.c      | 21 +++++++++++++++------
 libavcodec/nvenc.h      |  1 +
 libavcodec/nvenc_h264.c |  2 ++
 libavcodec/nvenc_hevc.c |  2 ++
 4 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/libavcodec/nvenc.c b/libavcodec/nvenc.c
index edc46ed33a..e3cd67fbc1 100644
--- a/libavcodec/nvenc.c
+++ b/libavcodec/nvenc.c
@@ -1910,20 +1910,29 @@ static int nvenc_upload_frame(AVCodecContext *avctx, const AVFrame *frame,
     NvencDynLoadFunctions *dl_fn = &ctx->nvenc_dload_funcs;
     NV_ENCODE_API_FUNCTION_LIST *p_nvenc = &dl_fn->nvenc_funcs;
 
-    int res;
+    int res, reg_idx;
     NVENCSTATUS nv_status;
 
     if (avctx->pix_fmt == AV_PIX_FMT_CUDA || avctx->pix_fmt == AV_PIX_FMT_D3D11) {
-        int reg_idx = nvenc_register_frame(avctx, frame);
+        res = av_frame_ref(nvenc_frame->in_ref, frame);
+        if (res < 0)
+            return res;
+
+        if (avctx->pix_fmt == AV_PIX_FMT_CUDA && !ctx->lowmem) {
+            res = av_frame_make_writable(nvenc_frame->in_ref);
+            if (res < 0) {
+                av_frame_unref(nvenc_frame->in_ref);
+                return res;
+            }
+        }
+
+        reg_idx = nvenc_register_frame(avctx, nvenc_frame->in_ref);
         if (reg_idx < 0) {
             av_log(avctx, AV_LOG_ERROR, "Could not register an input HW frame\n");
+            av_frame_unref(nvenc_frame->in_ref);
             return reg_idx;
         }
 
-        res = av_frame_ref(nvenc_frame->in_ref, frame);
-        if (res < 0)
-            return res;
-
         if (!ctx->registered_frames[reg_idx].mapped) {
             ctx->registered_frames[reg_idx].in_map.version = NV_ENC_MAP_INPUT_RESOURCE_VER;
             ctx->registered_frames[reg_idx].in_map.registeredResource = ctx->registered_frames[reg_idx].regptr;
diff --git a/libavcodec/nvenc.h b/libavcodec/nvenc.h
index 08531e1be3..927f7aa8f3 100644
--- a/libavcodec/nvenc.h
+++ b/libavcodec/nvenc.h
@@ -235,6 +235,7 @@ typedef struct NvencContext
     int intra_refresh;
     int single_slice_intra_refresh;
     int constrained_encoding;
+    int lowmem;
 } NvencContext;
 
 int ff_nvenc_encode_init(AVCodecContext *avctx);
diff --git a/libavcodec/nvenc_h264.c b/libavcodec/nvenc_h264.c
index 7d78aa0d87..c96a26902a 100644
--- a/libavcodec/nvenc_h264.c
+++ b/libavcodec/nvenc_h264.c
@@ -194,6 +194,8 @@ static const AVOption options[] = {
                                                             OFFSET(single_slice_intra_refresh), AV_OPT_TYPE_BOOL, { .i64 = 0 }, 0, 1, VE },
     { "constrained-encoding", "Enable constrainedFrame encoding where each slice in the constrained picture is independent of other slices",
                                                             OFFSET(constrained_encoding), AV_OPT_TYPE_BOOL,  { .i64 = 0 }, 0, 1, VE },
+    { "lowmem", "Avoids DtoD copies of frames, reducing memory footprint and potentially increasing performance at the cost of stability",
+                                                            OFFSET(lowmem), AV_OPT_TYPE_BOOL, { .i64 = 0 }, 0, 1, VE },
     { NULL }
 };
 
diff --git a/libavcodec/nvenc_hevc.c b/libavcodec/nvenc_hevc.c
index a13ac5a395..f46c4862ce 100644
--- a/libavcodec/nvenc_hevc.c
+++ b/libavcodec/nvenc_hevc.c
@@ -175,6 +175,8 @@ static const AVOption options[] = {
                                                             OFFSET(single_slice_intra_refresh), AV_OPT_TYPE_BOOL, { .i64 = 0 }, 0, 1, VE },
     { "constrained-encoding", "Enable constrainedFrame encoding where each slice in the constrained picture is independent of other slices",
                                                             OFFSET(constrained_encoding), AV_OPT_TYPE_BOOL, { .i64 = 0 }, 0, 1, VE },
+    { "lowmem", "Avoids DtoD copies of frames, reducing memory footprint and potentially increasing performance at the cost of stability",
+                                                            OFFSET(lowmem), AV_OPT_TYPE_BOOL, { .i64 = 0 }, 0, 1, VE },
     { NULL }
 };
diff mbox series

Patch

diff --git a/libavcodec/nvenc.c b/libavcodec/nvenc.c
index 815b9429b3..9b4c968179 100644
--- a/libavcodec/nvenc.c
+++ b/libavcodec/nvenc.c
@@ -1849,7 +1849,7 @@  static int nvenc_upload_frame(AVCodecContext *avctx, const AVFrame *frame,
             return reg_idx;
         }
 
-        res = av_frame_ref(nvenc_frame->in_ref, frame);
+        res = av_hwframe_transfer_data(nvenc_frame->in_ref, frame, 0);
         if (res < 0)
             return res;
 
@@ -1858,7 +1858,6 @@  static int nvenc_upload_frame(AVCodecContext *avctx, const AVFrame *frame,
             ctx->registered_frames[reg_idx].in_map.registeredResource = ctx->registered_frames[reg_idx].regptr;
             nv_status = p_nvenc->nvEncMapInputResource(ctx->nvencoder, &ctx->registered_frames[reg_idx].in_map);
             if (nv_status != NV_ENC_SUCCESS) {
-                av_frame_unref(nvenc_frame->in_ref);
                 return nvenc_print_error(avctx, nv_status, "Error mapping an input resource");
             }
         }
@@ -2029,8 +2028,6 @@  static int process_output_surface(AVCodecContext *avctx, AVPacket *pkt, NvencSur
             goto error;
         }
 
-        av_frame_unref(tmpoutsurf->in_ref);
-
         tmpoutsurf->input_surface = NULL;
     }