diff mbox series

[FFmpeg-devel,3/4] avutil/cuda_check: propagate AVERROR_UNRECOVERABLE when needed

Message ID 20221122130732.403-3-jamrial@gmail.com
State New
Headers show
Series [FFmpeg-devel,1/4] avutil/error: add a new error to signal processing got into an unrecoverable state | expand

Checks

Context Check Description
andriy/make_x86 success Make finished
andriy/make_fate_x86 success Make fate finished

Commit Message

James Almer Nov. 22, 2022, 1:07 p.m. UTC
Based on a patch by Soft Works.

Signed-off-by: James Almer <jamrial@gmail.com>
---
 libavutil/cuda_check.h | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Timo Rothenpieler Nov. 22, 2022, 1:21 p.m. UTC | #1
On 22/11/2022 14:07, James Almer wrote:
> Based on a patch by Soft Works.
> 
> Signed-off-by: James Almer <jamrial@gmail.com>
> ---
>   libavutil/cuda_check.h | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/libavutil/cuda_check.h b/libavutil/cuda_check.h
> index f5a9234eaf..33aaf9c098 100644
> --- a/libavutil/cuda_check.h
> +++ b/libavutil/cuda_check.h
> @@ -49,6 +49,10 @@ static inline int ff_cuda_check(void *avctx,
>           av_log(avctx, AV_LOG_ERROR, " -> %s: %s", err_name, err_string);
>       av_log(avctx, AV_LOG_ERROR, "\n");
>   
> +    // Not recoverable
> +    if (err == CUDA_ERROR_UNKNOWN)
> +        return AVERROR_UNRECOVERABLE;

Why does specifically CUDA_ERROR_UNKNOWN get mapped to unrecoverable?

>       return AVERROR_EXTERNAL;
>   }
>
James Almer Nov. 22, 2022, 1:31 p.m. UTC | #2
On 11/22/2022 10:21 AM, Timo Rothenpieler wrote:
> On 22/11/2022 14:07, James Almer wrote:
>> Based on a patch by Soft Works.
>>
>> Signed-off-by: James Almer <jamrial@gmail.com>
>> ---
>>   libavutil/cuda_check.h | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/libavutil/cuda_check.h b/libavutil/cuda_check.h
>> index f5a9234eaf..33aaf9c098 100644
>> --- a/libavutil/cuda_check.h
>> +++ b/libavutil/cuda_check.h
>> @@ -49,6 +49,10 @@ static inline int ff_cuda_check(void *avctx,
>>           av_log(avctx, AV_LOG_ERROR, " -> %s: %s", err_name, 
>> err_string);
>>       av_log(avctx, AV_LOG_ERROR, "\n");
>> +    // Not recoverable
>> +    if (err == CUDA_ERROR_UNKNOWN)
>> +        return AVERROR_UNRECOVERABLE;
> 
> Why does specifically CUDA_ERROR_UNKNOWN get mapped to unrecoverable?

It's the code that Soft Works found out was returned repeatedly no 
matter how many packets you fed to the encoder, which meant it was stuck 
in an unrecoverable state. See 
http://ffmpeg.org/pipermail/ffmpeg-devel/2021-October/287153.html

If you know of cases where this error could be returned in valid 
recoverable scenarios that are not already handled in some other way, 
what do you suggest could be done?

> 
>>       return AVERROR_EXTERNAL;
>>   }
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
Soft Works Nov. 22, 2022, 2:33 p.m. UTC | #3
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> James Almer
> Sent: Tuesday, November 22, 2022 2:31 PM
> To: ffmpeg-devel@ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH 3/4] avutil/cuda_check: propagate
> AVERROR_UNRECOVERABLE when needed
> 
> On 11/22/2022 10:21 AM, Timo Rothenpieler wrote:
> > On 22/11/2022 14:07, James Almer wrote:
> >> Based on a patch by Soft Works.
> >>
> >> Signed-off-by: James Almer <jamrial@gmail.com>
> >> ---
> >>   libavutil/cuda_check.h | 4 ++++
> >>   1 file changed, 4 insertions(+)
> >>
> >> diff --git a/libavutil/cuda_check.h b/libavutil/cuda_check.h
> >> index f5a9234eaf..33aaf9c098 100644
> >> --- a/libavutil/cuda_check.h
> >> +++ b/libavutil/cuda_check.h
> >> @@ -49,6 +49,10 @@ static inline int ff_cuda_check(void *avctx,
> >>           av_log(avctx, AV_LOG_ERROR, " -> %s: %s", err_name,
> >> err_string);
> >>       av_log(avctx, AV_LOG_ERROR, "\n");
> >> +    // Not recoverable
> >> +    if (err == CUDA_ERROR_UNKNOWN)
> >> +        return AVERROR_UNRECOVERABLE;
> >
> > Why does specifically CUDA_ERROR_UNKNOWN get mapped to
> unrecoverable?
> 
> It's the code that Soft Works found out was returned repeatedly no
> matter how many packets you fed to the encoder, which meant it was
> stuck
> in an unrecoverable state. See
> http://ffmpeg.org/pipermail/ffmpeg-devel/2021-October/287153.html
> 
> If you know of cases where this error could be returned in valid
> recoverable scenarios that are not already handled in some other way,
> what do you suggest could be done?

Thanks James, for picking this up!

All I can say is that my original patch is deployed to a quite a
number of systems and there hasn't been any case where this 
would have had an adverse effect.

I hadn't reported this to Nvidia because a solution was needed
and it was an erroneous file, so the best they could 
have probably done was to return a different error code ;-)

softworkz
James Almer Nov. 22, 2022, 2:41 p.m. UTC | #4
On 11/22/2022 11:33 AM, Soft Works wrote:
> 
> 
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
>> James Almer
>> Sent: Tuesday, November 22, 2022 2:31 PM
>> To: ffmpeg-devel@ffmpeg.org
>> Subject: Re: [FFmpeg-devel] [PATCH 3/4] avutil/cuda_check: propagate
>> AVERROR_UNRECOVERABLE when needed
>>
>> On 11/22/2022 10:21 AM, Timo Rothenpieler wrote:
>>> On 22/11/2022 14:07, James Almer wrote:
>>>> Based on a patch by Soft Works.
>>>>
>>>> Signed-off-by: James Almer <jamrial@gmail.com>
>>>> ---
>>>>    libavutil/cuda_check.h | 4 ++++
>>>>    1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/libavutil/cuda_check.h b/libavutil/cuda_check.h
>>>> index f5a9234eaf..33aaf9c098 100644
>>>> --- a/libavutil/cuda_check.h
>>>> +++ b/libavutil/cuda_check.h
>>>> @@ -49,6 +49,10 @@ static inline int ff_cuda_check(void *avctx,
>>>>            av_log(avctx, AV_LOG_ERROR, " -> %s: %s", err_name,
>>>> err_string);
>>>>        av_log(avctx, AV_LOG_ERROR, "\n");
>>>> +    // Not recoverable
>>>> +    if (err == CUDA_ERROR_UNKNOWN)
>>>> +        return AVERROR_UNRECOVERABLE;
>>>
>>> Why does specifically CUDA_ERROR_UNKNOWN get mapped to
>> unrecoverable?
>>
>> It's the code that Soft Works found out was returned repeatedly no
>> matter how many packets you fed to the encoder, which meant it was
>> stuck
>> in an unrecoverable state. See
>> http://ffmpeg.org/pipermail/ffmpeg-devel/2021-October/287153.html
>>
>> If you know of cases where this error could be returned in valid
>> recoverable scenarios that are not already handled in some other way,
>> what do you suggest could be done?
> 
> Thanks James, for picking this up!
> 
> All I can say is that my original patch is deployed to a quite a
> number of systems and there hasn't been any case where this
> would have had an adverse effect.
> 
> I hadn't reported this to Nvidia because a solution was needed
> and it was an erroneous file, so the best they could
> have probably done was to return a different error code ;-)
> 
> softworkz

Can you be more specific about what kind of erroneous file it was? Are 
we talking about a completely broken stream where no packet was valid 
and even the software decoder would fail, or something that had one 
invalid packet that somehow chocked the nvdec to the point not even an 
IDR picture triggering a refresh would fix it?

If this is the former, then what you encountered was not the decoder 
entering an unrecoverable state, but just properly rejecting bad input, 
and then this patch would probably not be correct.
Andreas Rheinhardt Nov. 22, 2022, 2:48 p.m. UTC | #5
Soft Works:
> 
> 
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
>> James Almer
>> Sent: Tuesday, November 22, 2022 2:31 PM
>> To: ffmpeg-devel@ffmpeg.org
>> Subject: Re: [FFmpeg-devel] [PATCH 3/4] avutil/cuda_check: propagate
>> AVERROR_UNRECOVERABLE when needed
>>
>> On 11/22/2022 10:21 AM, Timo Rothenpieler wrote:
>>> On 22/11/2022 14:07, James Almer wrote:
>>>> Based on a patch by Soft Works.
>>>>
>>>> Signed-off-by: James Almer <jamrial@gmail.com>
>>>> ---
>>>>   libavutil/cuda_check.h | 4 ++++
>>>>   1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/libavutil/cuda_check.h b/libavutil/cuda_check.h
>>>> index f5a9234eaf..33aaf9c098 100644
>>>> --- a/libavutil/cuda_check.h
>>>> +++ b/libavutil/cuda_check.h
>>>> @@ -49,6 +49,10 @@ static inline int ff_cuda_check(void *avctx,
>>>>           av_log(avctx, AV_LOG_ERROR, " -> %s: %s", err_name,
>>>> err_string);
>>>>       av_log(avctx, AV_LOG_ERROR, "\n");
>>>> +    // Not recoverable
>>>> +    if (err == CUDA_ERROR_UNKNOWN)
>>>> +        return AVERROR_UNRECOVERABLE;
>>>
>>> Why does specifically CUDA_ERROR_UNKNOWN get mapped to
>> unrecoverable?
>>
>> It's the code that Soft Works found out was returned repeatedly no
>> matter how many packets you fed to the encoder, which meant it was
>> stuck
>> in an unrecoverable state. See
>> http://ffmpeg.org/pipermail/ffmpeg-devel/2021-October/287153.html
>>
>> If you know of cases where this error could be returned in valid
>> recoverable scenarios that are not already handled in some other way,
>> what do you suggest could be done?
> 
> Thanks James, for picking this up!
> 
> All I can say is that my original patch is deployed to a quite a
> number of systems and there hasn't been any case where this 
> would have had an adverse effect.
> 
> I hadn't reported this to Nvidia because a solution was needed
> and it was an erroneous file, so the best they could 
> have probably done was to return a different error code ;-)
> 

If this was an erroneous file, then wouldn't it be possible for future
packets to be non-erroneous and therefore decodable?
(An alternative fix for this would be for fftools to calculate how many
errors there were in a row and stop trying to decode a stream that
returns too many errors based upon some option (one could also use the
ratio of successful to non-successful decode calls or something like that).)

- Andreas
Soft Works Nov. 22, 2022, 2:55 p.m. UTC | #6
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> James Almer
> Sent: Tuesday, November 22, 2022 3:41 PM
> To: ffmpeg-devel@ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH 3/4] avutil/cuda_check: propagate
> AVERROR_UNRECOVERABLE when needed
> 
> On 11/22/2022 11:33 AM, Soft Works wrote:
> >
> >
> >> -----Original Message-----
> >> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> >> James Almer
> >> Sent: Tuesday, November 22, 2022 2:31 PM
> >> To: ffmpeg-devel@ffmpeg.org
> >> Subject: Re: [FFmpeg-devel] [PATCH 3/4] avutil/cuda_check:
> propagate
> >> AVERROR_UNRECOVERABLE when needed
> >>
> >> On 11/22/2022 10:21 AM, Timo Rothenpieler wrote:
> >>> On 22/11/2022 14:07, James Almer wrote:
> >>>> Based on a patch by Soft Works.
> >>>>
> >>>> Signed-off-by: James Almer <jamrial@gmail.com>
> >>>> ---
> >>>>    libavutil/cuda_check.h | 4 ++++
> >>>>    1 file changed, 4 insertions(+)
> >>>>
> >>>> diff --git a/libavutil/cuda_check.h b/libavutil/cuda_check.h
> >>>> index f5a9234eaf..33aaf9c098 100644
> >>>> --- a/libavutil/cuda_check.h
> >>>> +++ b/libavutil/cuda_check.h
> >>>> @@ -49,6 +49,10 @@ static inline int ff_cuda_check(void *avctx,
> >>>>            av_log(avctx, AV_LOG_ERROR, " -> %s: %s", err_name,
> >>>> err_string);
> >>>>        av_log(avctx, AV_LOG_ERROR, "\n");
> >>>> +    // Not recoverable
> >>>> +    if (err == CUDA_ERROR_UNKNOWN)
> >>>> +        return AVERROR_UNRECOVERABLE;
> >>>
> >>> Why does specifically CUDA_ERROR_UNKNOWN get mapped to
> >> unrecoverable?
> >>
> >> It's the code that Soft Works found out was returned repeatedly no
> >> matter how many packets you fed to the encoder, which meant it was
> >> stuck
> >> in an unrecoverable state. See
> >> http://ffmpeg.org/pipermail/ffmpeg-devel/2021-October/287153.html
> >>
> >> If you know of cases where this error could be returned in valid
> >> recoverable scenarios that are not already handled in some other
> way,
> >> what do you suggest could be done?
> >
> > Thanks James, for picking this up!
> >
> > All I can say is that my original patch is deployed to a quite a
> > number of systems and there hasn't been any case where this
> > would have had an adverse effect.
> >
> > I hadn't reported this to Nvidia because a solution was needed
> > and it was an erroneous file, so the best they could
> > have probably done was to return a different error code ;-)
> >
> > softworkz
> 
> Can you be more specific about what kind of erroneous file it was?

It's been a while, so I would need to look up what it was.

> Are
> we talking about a completely broken stream where no packet was valid
> and even the software decoder would fail, or something that had one
> invalid packet that somehow chocked the nvdec to the point not even
> an
> IDR picture triggering a refresh would fix it?
> 
> If this is the former, then what you encountered was not the decoder
> entering an unrecoverable state, but just properly rejecting bad
> input,
> and then this patch would probably not be correct.

What I remember is that other decoders could read over it and
recover.

But not the Nvidia decoder - no matter what it was being fed.
It was a point of no return. It kept getting fed packet 
after packet but it never recovered, and by "never" I mean
even as long as that the ffmpeg grew to 50 GB.

softworkz
Soft Works Nov. 22, 2022, 2:59 p.m. UTC | #7
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Andreas Rheinhardt
> Sent: Tuesday, November 22, 2022 3:48 PM
> To: ffmpeg-devel@ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH 3/4] avutil/cuda_check: propagate
> AVERROR_UNRECOVERABLE when needed
> 
> Soft Works:
> >
> >
> >> -----Original Message-----
> >> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> >> James Almer
> >> Sent: Tuesday, November 22, 2022 2:31 PM
> >> To: ffmpeg-devel@ffmpeg.org
> >> Subject: Re: [FFmpeg-devel] [PATCH 3/4] avutil/cuda_check:
> propagate
> >> AVERROR_UNRECOVERABLE when needed
> >>
> >> On 11/22/2022 10:21 AM, Timo Rothenpieler wrote:
> >>> On 22/11/2022 14:07, James Almer wrote:
> >>>> Based on a patch by Soft Works.
> >>>>
> >>>> Signed-off-by: James Almer <jamrial@gmail.com>
> >>>> ---
> >>>>   libavutil/cuda_check.h | 4 ++++
> >>>>   1 file changed, 4 insertions(+)
> >>>>
> >>>> diff --git a/libavutil/cuda_check.h b/libavutil/cuda_check.h
> >>>> index f5a9234eaf..33aaf9c098 100644
> >>>> --- a/libavutil/cuda_check.h
> >>>> +++ b/libavutil/cuda_check.h
> >>>> @@ -49,6 +49,10 @@ static inline int ff_cuda_check(void *avctx,
> >>>>           av_log(avctx, AV_LOG_ERROR, " -> %s: %s", err_name,
> >>>> err_string);
> >>>>       av_log(avctx, AV_LOG_ERROR, "\n");
> >>>> +    // Not recoverable
> >>>> +    if (err == CUDA_ERROR_UNKNOWN)
> >>>> +        return AVERROR_UNRECOVERABLE;
> >>>
> >>> Why does specifically CUDA_ERROR_UNKNOWN get mapped to
> >> unrecoverable?
> >>
> >> It's the code that Soft Works found out was returned repeatedly no
> >> matter how many packets you fed to the encoder, which meant it was
> >> stuck
> >> in an unrecoverable state. See
> >> http://ffmpeg.org/pipermail/ffmpeg-devel/2021-October/287153.html
> >>
> >> If you know of cases where this error could be returned in valid
> >> recoverable scenarios that are not already handled in some other
> way,
> >> what do you suggest could be done?
> >
> > Thanks James, for picking this up!
> >
> > All I can say is that my original patch is deployed to a quite a
> > number of systems and there hasn't been any case where this
> > would have had an adverse effect.
> >
> > I hadn't reported this to Nvidia because a solution was needed
> > and it was an erroneous file, so the best they could
> > have probably done was to return a different error code ;-)
> >
> 
> If this was an erroneous file, then wouldn't it be possible for
> future
> packets to be non-erroneous and therefore decodable?
> (An alternative fix for this would be for fftools to calculate how
> many
> errors there were in a row and stop trying to decode a stream that
> returns too many errors based upon some option (one could also use
> the
> ratio of successful to non-successful decode calls or something like
> that).)

That error isn't a normal decode error. Those exist as well and it 
can recover from these.
But from all experiments I did, my conclusion was that this 
is an error of final death. AFAIR.

Thanks,
softworkz
Timo Rothenpieler Nov. 22, 2022, 3:15 p.m. UTC | #8
On 22/11/2022 14:31, James Almer wrote:
> On 11/22/2022 10:21 AM, Timo Rothenpieler wrote:
>> On 22/11/2022 14:07, James Almer wrote:
>>> Based on a patch by Soft Works.
>>>
>>> Signed-off-by: James Almer <jamrial@gmail.com>
>>> ---
>>>   libavutil/cuda_check.h | 4 ++++
>>>   1 file changed, 4 insertions(+)
>>>
>>> diff --git a/libavutil/cuda_check.h b/libavutil/cuda_check.h
>>> index f5a9234eaf..33aaf9c098 100644
>>> --- a/libavutil/cuda_check.h
>>> +++ b/libavutil/cuda_check.h
>>> @@ -49,6 +49,10 @@ static inline int ff_cuda_check(void *avctx,
>>>           av_log(avctx, AV_LOG_ERROR, " -> %s: %s", err_name, 
>>> err_string);
>>>       av_log(avctx, AV_LOG_ERROR, "\n");
>>> +    // Not recoverable
>>> +    if (err == CUDA_ERROR_UNKNOWN)
>>> +        return AVERROR_UNRECOVERABLE;
>>
>> Why does specifically CUDA_ERROR_UNKNOWN get mapped to unrecoverable?
> 
> It's the code that Soft Works found out was returned repeatedly no 
> matter how many packets you fed to the encoder, which meant it was stuck 
> in an unrecoverable state. See 
> http://ffmpeg.org/pipermail/ffmpeg-devel/2021-October/287153.html
> 
> If you know of cases where this error could be returned in valid 
> recoverable scenarios that are not already handled in some other way, 
> what do you suggest could be done?

It's more that that very much depends on the context the function is 
used in.
In the context of an active de/encoding loop, every non-success return 
would be unrecoverable. Or even during init.

While in other places a failure just breaks a single frame and if it 
somehow recovery, it can continue on. Though that's usually the exception.

In any case, this function does not seem the right place to map an 
arbitrary return code to such a result. Keep in mind every single 
filter/de/encoder uses this function to check CUDA call results.
Soft Works Nov. 22, 2022, 4:02 p.m. UTC | #9
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Timo Rothenpieler
> Sent: Tuesday, November 22, 2022 4:16 PM
> To: ffmpeg-devel@ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH 3/4] avutil/cuda_check: propagate
> AVERROR_UNRECOVERABLE when needed
> 
> 
> 
> On 22/11/2022 14:31, James Almer wrote:
> > On 11/22/2022 10:21 AM, Timo Rothenpieler wrote:
> >> On 22/11/2022 14:07, James Almer wrote:
> >>> Based on a patch by Soft Works.
> >>>
> >>> Signed-off-by: James Almer <jamrial@gmail.com>
> >>> ---
> >>>   libavutil/cuda_check.h | 4 ++++
> >>>   1 file changed, 4 insertions(+)
> >>>
> >>> diff --git a/libavutil/cuda_check.h b/libavutil/cuda_check.h
> >>> index f5a9234eaf..33aaf9c098 100644
> >>> --- a/libavutil/cuda_check.h
> >>> +++ b/libavutil/cuda_check.h
> >>> @@ -49,6 +49,10 @@ static inline int ff_cuda_check(void *avctx,
> >>>           av_log(avctx, AV_LOG_ERROR, " -> %s: %s", err_name,
> >>> err_string);
> >>>       av_log(avctx, AV_LOG_ERROR, "\n");
> >>> +    // Not recoverable
> >>> +    if (err == CUDA_ERROR_UNKNOWN)
> >>> +        return AVERROR_UNRECOVERABLE;
> >>
> >> Why does specifically CUDA_ERROR_UNKNOWN get mapped to
> unrecoverable?
> >
> > It's the code that Soft Works found out was returned repeatedly no
> > matter how many packets you fed to the encoder, which meant it was
> stuck
> > in an unrecoverable state. See
> > http://ffmpeg.org/pipermail/ffmpeg-devel/2021-October/287153.html
> >
> > If you know of cases where this error could be returned in valid
> > recoverable scenarios that are not already handled in some other
> way,
> > what do you suggest could be done?
> 
> It's more that that very much depends on the context the function is
> used in.
> In the context of an active de/encoding loop, every non-success
> return
> would be unrecoverable. Or even during init.
> 
> While in other places a failure just breaks a single frame and if it
> somehow recovery, it can continue on. Though that's usually the
> exception.

Can you give an example for this in ffmpeg where CUDA returns an
error and ffmpeg continues?

At the time when I had submitted the patch, I had replied this to James
after he had argued that the behavior would be correct as is and
a user who would want ffmpeg to exit in such case, must specify 
the -xerror parameter.


softworkz:
> When there's an error in a filter => ffmpeg exits
> When there's an error in an encoder => ffmpeg exits
> When there's an error initializing a decoder or an encoder
> => ffmpeg exits
>
> So why shouldn't ffmpeg exit when there's an unrecoverable error 
> in a decoder?


AFAIR, in case of filtering and encoding, any "normal" CUDA 
error is already sufficient to cause ffmpeg to exit.


> In any case, this function does not seem the right place to map an
> arbitrary return code to such a result. 

That depends on whether you consider CUDA_ERROR_UNKNOWN as 
generally unrecoverable. When you do - like I did - then
it's the right place IMO.

If you doubt it, then yes - it wouldn't be the right place.

Best regards,
softworkz
Soft Works Nov. 22, 2022, 6:08 p.m. UTC | #10
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> James Almer
> Sent: Tuesday, November 22, 2022 3:41 PM
> To: ffmpeg-devel@ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH 3/4] avutil/cuda_check: propagate
> AVERROR_UNRECOVERABLE when needed
> 
> On 11/22/2022 11:33 AM, Soft Works wrote:
> >
> >
> >> -----Original Message-----
> >> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> >> James Almer
> >> Sent: Tuesday, November 22, 2022 2:31 PM
> >> To: ffmpeg-devel@ffmpeg.org
> >> Subject: Re: [FFmpeg-devel] [PATCH 3/4] avutil/cuda_check:
> propagate
> >> AVERROR_UNRECOVERABLE when needed
> >>
> >> On 11/22/2022 10:21 AM, Timo Rothenpieler wrote:
> >>> On 22/11/2022 14:07, James Almer wrote:
> >>>> Based on a patch by Soft Works.
> >>>>
> >>>> Signed-off-by: James Almer <jamrial@gmail.com>
> >>>> ---
> >>>>    libavutil/cuda_check.h | 4 ++++
> >>>>    1 file changed, 4 insertions(+)
> >>>>
> >>>> diff --git a/libavutil/cuda_check.h b/libavutil/cuda_check.h
> >>>> index f5a9234eaf..33aaf9c098 100644
> >>>> --- a/libavutil/cuda_check.h
> >>>> +++ b/libavutil/cuda_check.h
> >>>> @@ -49,6 +49,10 @@ static inline int ff_cuda_check(void *avctx,
> >>>>            av_log(avctx, AV_LOG_ERROR, " -> %s: %s", err_name,
> >>>> err_string);
> >>>>        av_log(avctx, AV_LOG_ERROR, "\n");
> >>>> +    // Not recoverable
> >>>> +    if (err == CUDA_ERROR_UNKNOWN)
> >>>> +        return AVERROR_UNRECOVERABLE;
> >>>
> >>> Why does specifically CUDA_ERROR_UNKNOWN get mapped to
> >> unrecoverable?
> >>
> >> It's the code that Soft Works found out was returned repeatedly no
> >> matter how many packets you fed to the encoder, which meant it was
> >> stuck
> >> in an unrecoverable state. See
> >> http://ffmpeg.org/pipermail/ffmpeg-devel/2021-October/287153.html
> >>
> >> If you know of cases where this error could be returned in valid
> >> recoverable scenarios that are not already handled in some other
> way,
> >> what do you suggest could be done?
> >
> > Thanks James, for picking this up!
> >
> > All I can say is that my original patch is deployed to a quite a
> > number of systems and there hasn't been any case where this
> > would have had an adverse effect.
> >
> > I hadn't reported this to Nvidia because a solution was needed
> > and it was an erroneous file, so the best they could
> > have probably done was to return a different error code ;-)
> >
> > softworkz
> 
> Can you be more specific about what kind of erroneous file it was?
> Are
> we talking about a completely broken stream where no packet was valid
> and even the software decoder would fail, or something that had one
> invalid packet that somehow chocked the nvdec...

I was able to find the conversations where this had been reported.
There were three cases, two were investigated, both of which quite 
similar.

The first case was about an mpegts "recording" from some online
stream where the "recorder" was simply reconnecting on connection
failure and then continued writing to the same mpegts file.
It seems the server had disconnected after 30 min and the
streams have changed from then on:

11:35:35.096 frame=107726 fps=371 q=29.0 size=  588032kB time=00:29:57.59 bitrate=2682.9kbits/s throttle=off speed=6.18x    
11:35:35.596 frame=107907 fps=371 q=28.0 size=  589312kB time=00:30:00.62 bitrate=2684.2kbits/s throttle=off speed=6.18x    
11:35:35.995 [mpeg2_cuvid @ 0x699a40] AVHWFramesContext is already initialized with incompatible parameters
11:35:35.995 [mpeg2_cuvid @ 0x699a40] ctx->cvdl->cuvidParseVideoData(ctx->cuparser, &cupkt) failed -> CUDA_ERROR_UNKNOWN: unknown error
11:35:35.995 Error while decoding stream #0:0: Generic error in an external library
11:35:35.998 [mpeg2_cuvid @ 0x699a40] ctx->cvdl->cuvidParseVideoData(ctx->cuparser, &cupkt) failed -> CUDA_ERROR_UNKNOWN: unknown error
11:35:35.998 Error while decoding stream #0:0: Generic error in an external library
11:35:36.003 [mpeg2_cuvid @ 0x699a40] ctx->cvdl->cuvidParseVideoData(ctx->cuparser, &cupkt) failed -> CUDA_ERROR_UNKNOWN: unknown error

We can't know what "incompatible parameters" actually means. It could
be the frame size, but it could also be a different codec (like H264
instead of MPEG2) or both, or interlaced/non-interlaced.


The other case was similar. The user had eventually admitted:

"I used ffmpeg and a bash script to concat the 3x videos into a single episode"

and that the codecs might have been different. Here it fails right from 
the start as the "-ss 00:07:57.000" is probably jumping right into the second
segment which differs from the probe results.
(total length 22min)

I remember now that I had constructed test files like this, but with 
much shorter "bad parts". The ffmpeg parser could read over it (at least
somewhat and eventually recover, while the cuvid parser never came back.
But that was just to find out whether the cuvid error state is terminal
or not. The ability to recover doesn’t help when a stream change is 
permanent (= not an erroneous incident for a few seconds).


As such, the requirement was simply: when that happens, ffmpeg should exit.
(instead of feeding the cuvid zombie to infinity)

Best regards,
softworkz
Timo Rothenpieler Nov. 22, 2022, 10:59 p.m. UTC | #11
On 22.11.2022 17:02, Soft Works wrote:
> 
> 
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
>> Timo Rothenpieler
>> Sent: Tuesday, November 22, 2022 4:16 PM
>> To: ffmpeg-devel@ffmpeg.org
>> Subject: Re: [FFmpeg-devel] [PATCH 3/4] avutil/cuda_check: propagate
>> AVERROR_UNRECOVERABLE when needed
>>
>>
>>
>> On 22/11/2022 14:31, James Almer wrote:
>>> On 11/22/2022 10:21 AM, Timo Rothenpieler wrote:
>>>> On 22/11/2022 14:07, James Almer wrote:
>>>>> Based on a patch by Soft Works.
>>>>>
>>>>> Signed-off-by: James Almer <jamrial@gmail.com>
>>>>> ---
>>>>>    libavutil/cuda_check.h | 4 ++++
>>>>>    1 file changed, 4 insertions(+)
>>>>>
>>>>> diff --git a/libavutil/cuda_check.h b/libavutil/cuda_check.h
>>>>> index f5a9234eaf..33aaf9c098 100644
>>>>> --- a/libavutil/cuda_check.h
>>>>> +++ b/libavutil/cuda_check.h
>>>>> @@ -49,6 +49,10 @@ static inline int ff_cuda_check(void *avctx,
>>>>>            av_log(avctx, AV_LOG_ERROR, " -> %s: %s", err_name,
>>>>> err_string);
>>>>>        av_log(avctx, AV_LOG_ERROR, "\n");
>>>>> +    // Not recoverable
>>>>> +    if (err == CUDA_ERROR_UNKNOWN)
>>>>> +        return AVERROR_UNRECOVERABLE;
>>>>
>>>> Why does specifically CUDA_ERROR_UNKNOWN get mapped to
>> unrecoverable?
>>>
>>> It's the code that Soft Works found out was returned repeatedly no
>>> matter how many packets you fed to the encoder, which meant it was
>> stuck
>>> in an unrecoverable state. See
>>> http://ffmpeg.org/pipermail/ffmpeg-devel/2021-October/287153.html
>>>
>>> If you know of cases where this error could be returned in valid
>>> recoverable scenarios that are not already handled in some other
>> way,
>>> what do you suggest could be done?
>>
>> It's more that that very much depends on the context the function is
>> used in.
>> In the context of an active de/encoding loop, every non-success
>> return
>> would be unrecoverable. Or even during init.
>>
>> While in other places a failure just breaks a single frame and if it
>> somehow recovery, it can continue on. Though that's usually the
>> exception.
> 
> Can you give an example for this in ffmpeg where CUDA returns an
> error and ffmpeg continues?

Almost all the filters will continue when one filtering step fails.

> At the time when I had submitted the patch, I had replied this to James
> after he had argued that the behavior would be correct as is and
> a user who would want ffmpeg to exit in such case, must specify
> the -xerror parameter.
> 
> 
> softworkz:
>> When there's an error in a filter => ffmpeg exits
>> When there's an error in an encoder => ffmpeg exits
>> When there's an error initializing a decoder or an encoder
>> => ffmpeg exits
>>
>> So why shouldn't ffmpeg exit when there's an unrecoverable error
>> in a decoder?
> 
> 
> AFAIR, in case of filtering and encoding, any "normal" CUDA
> error is already sufficient to cause ffmpeg to exit.
> 
> 
>> In any case, this function does not seem the right place to map an
>> arbitrary return code to such a result.
> 
> That depends on whether you consider CUDA_ERROR_UNKNOWN as
> generally unrecoverable. When you do - like I did - then
> it's the right place IMO.
> 
> If you doubt it, then yes - it wouldn't be the right place.

Why is the function not unconditionally returning an unrecoverable error 
then?
Like, 90% of current error returns, independent of CUDA, in 
filters/codecs can be considered UNRECOVERABLE then. Which does make 
sense that you want them to about.
But you in turn give up the informational character of error codes.

Makes me think this should be some kind of additional flag instead.
Soft Works Nov. 22, 2022, 11:49 p.m. UTC | #12
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Timo Rothenpieler
> Sent: Tuesday, November 22, 2022 11:59 PM
> To: ffmpeg-devel@ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH 3/4] avutil/cuda_check: propagate
> AVERROR_UNRECOVERABLE when needed
> 
> On 22.11.2022 17:02, Soft Works wrote:
> >
> >
> >> -----Original Message-----
> >> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> >> Timo Rothenpieler
> >> Sent: Tuesday, November 22, 2022 4:16 PM
> >> To: ffmpeg-devel@ffmpeg.org
> >> Subject: Re: [FFmpeg-devel] [PATCH 3/4] avutil/cuda_check:
> propagate
> >> AVERROR_UNRECOVERABLE when needed
> >>
> >>
> >>
> >> On 22/11/2022 14:31, James Almer wrote:
> >>> On 11/22/2022 10:21 AM, Timo Rothenpieler wrote:
> >>>> On 22/11/2022 14:07, James Almer wrote:
> >>>>> Based on a patch by Soft Works.
> >>>>>
> >>>>> Signed-off-by: James Almer <jamrial@gmail.com>
> >>>>> ---
> >>>>>    libavutil/cuda_check.h | 4 ++++
> >>>>>    1 file changed, 4 insertions(+)
> >>>>>
> >>>>> diff --git a/libavutil/cuda_check.h b/libavutil/cuda_check.h
> >>>>> index f5a9234eaf..33aaf9c098 100644
> >>>>> --- a/libavutil/cuda_check.h
> >>>>> +++ b/libavutil/cuda_check.h
> >>>>> @@ -49,6 +49,10 @@ static inline int ff_cuda_check(void *avctx,
> >>>>>            av_log(avctx, AV_LOG_ERROR, " -> %s: %s", err_name,
> >>>>> err_string);
> >>>>>        av_log(avctx, AV_LOG_ERROR, "\n");
> >>>>> +    // Not recoverable
> >>>>> +    if (err == CUDA_ERROR_UNKNOWN)
> >>>>> +        return AVERROR_UNRECOVERABLE;
> >>>>
> >>>> Why does specifically CUDA_ERROR_UNKNOWN get mapped to
> >> unrecoverable?
> >>>
> >>> It's the code that Soft Works found out was returned repeatedly
> no
> >>> matter how many packets you fed to the encoder, which meant it
> was
> >> stuck
> >>> in an unrecoverable state. See
> >>> http://ffmpeg.org/pipermail/ffmpeg-devel/2021-October/287153.html
> >>>
> >>> If you know of cases where this error could be returned in valid
> >>> recoverable scenarios that are not already handled in some other
> >> way,
> >>> what do you suggest could be done?
> >>
> >> It's more that that very much depends on the context the function
> is
> >> used in.
> >> In the context of an active de/encoding loop, every non-success
> >> return
> >> would be unrecoverable. Or even during init.
> >>
> >> While in other places a failure just breaks a single frame and if
> it
> >> somehow recovery, it can continue on. Though that's usually the
> >> exception.
> >
> > Can you give an example for this in ffmpeg where CUDA returns an
> > error and ffmpeg continues?
> 
> Almost all the filters will continue when one filtering step fails.

When looking at vf_scale_cuda, all calls are checked and return
AVERROR_EXTERNAL back in case of error. Where is the "one error
is forgiven" handling?


> > At the time when I had submitted the patch, I had replied this to
> James
> > after he had argued that the behavior would be correct as is and
> > a user who would want ffmpeg to exit in such case, must specify
> > the -xerror parameter.
> >
> >
> > softworkz:
> >> When there's an error in a filter => ffmpeg exits
> >> When there's an error in an encoder => ffmpeg exits
> >> When there's an error initializing a decoder or an encoder
> >> => ffmpeg exits
> >>
> >> So why shouldn't ffmpeg exit when there's an unrecoverable error
> >> in a decoder?
> >
> >
> > AFAIR, in case of filtering and encoding, any "normal" CUDA
> > error is already sufficient to cause ffmpeg to exit.
> >
> >
> >> In any case, this function does not seem the right place to map an
> >> arbitrary return code to such a result.
> >
> > That depends on whether you consider CUDA_ERROR_UNKNOWN as
> > generally unrecoverable. When you do - like I did - then
> > it's the right place IMO.
> >
> > If you doubt it, then yes - it wouldn't be the right place.
> 
> Why is the function not unconditionally returning an unrecoverable
> error
> then?
> Like, 90% of current error returns, independent of CUDA, in
> filters/codecs can be considered UNRECOVERABLE then. 

Why "then"? I'm only considering CUDA_ERROR_UNKNOWN as unrecoverable
I have no knowledge about the other errors. That's why I 
would make the change only affect CUDA_ERROR_UNKNOWN.


> Which does make
> sense that you want them to about.
> But you in turn give up the informational character of error codes.

That's why I wouldn't to that.

> Makes me think this should be some kind of additional flag instead.

And update all the hundreds of usages?

I have no bias in this regard (I can easily say that because James
has submitted :-)

Best wishes,
softworkz
diff mbox series

Patch

diff --git a/libavutil/cuda_check.h b/libavutil/cuda_check.h
index f5a9234eaf..33aaf9c098 100644
--- a/libavutil/cuda_check.h
+++ b/libavutil/cuda_check.h
@@ -49,6 +49,10 @@  static inline int ff_cuda_check(void *avctx,
         av_log(avctx, AV_LOG_ERROR, " -> %s: %s", err_name, err_string);
     av_log(avctx, AV_LOG_ERROR, "\n");
 
+    // Not recoverable
+    if (err == CUDA_ERROR_UNKNOWN)
+        return AVERROR_UNRECOVERABLE;
+
     return AVERROR_EXTERNAL;
 }