diff mbox series

[FFmpeg-devel] pthread_frame: attempt to get frame to reduce latency

Message ID 20200310093640.79848-1-jianhui.j.dai@intel.com
State New
Headers show
Series [FFmpeg-devel] pthread_frame: attempt to get frame to reduce latency | expand

Checks

Context Check Description
andriy/ffmpeg-patchwork success Make fate finished

Commit Message

Dai, Jianhui J March 10, 2020, 9:36 a.m. UTC
Avoid constant N frames latency in video streaming.

Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com>
---
 libavcodec/pthread_frame.c | 17 ++---------------
 1 file changed, 2 insertions(+), 15 deletions(-)

Comments

Carl Eugen Hoyos March 10, 2020, 10:18 a.m. UTC | #1
Am Di., 10. März 2020 um 10:37 Uhr schrieb Jianhui Dai
<jianhui.j.dai@intel.com>:
>
> Avoid constant N frames latency in video streaming.

Please add some numbers to the commit message,
if possible without using hardware acceleration.

Carl Eugen
Hendrik Leppkes March 10, 2020, 11:33 a.m. UTC | #2
On Tue, Mar 10, 2020 at 10:37 AM Jianhui Dai <jianhui.j.dai@intel.com> wrote:
>
> Avoid constant N frames latency in video streaming.
>

Personally, I'm off the opinion that a predictable constant delay in
this case is better then a variable ever-changing delay.

- Hendrik
Kieran Kunhya March 10, 2020, 12:41 p.m. UTC | #3
On Tue, 10 Mar 2020 at 12:07, Hendrik Leppkes <h.leppkes@gmail.com> wrote:

> On Tue, Mar 10, 2020 at 10:37 AM Jianhui Dai <jianhui.j.dai@intel.com>
> wrote:
> >
> > Avoid constant N frames latency in video streaming.
> >
>
> Personally, I'm off the opinion that a predictable constant delay in
> this case is better then a variable ever-changing delay.


I agree. Furthermore I do not understand how this patch is allowed.

Kieran
Fu, Linjie March 10, 2020, 1:30 p.m. UTC | #4
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Hendrik Leppkes
> Sent: Tuesday, March 10, 2020 19:34
> To: FFmpeg development discussions and patches <ffmpeg-
> devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH] pthread_frame: attempt to get frame
> to reduce latency
> 
> On Tue, Mar 10, 2020 at 10:37 AM Jianhui Dai <jianhui.j.dai@intel.com> wrote:
> >
> > Avoid constant N frames latency in video streaming.
> >
> 
> Personally, I'm off the opinion that a predictable constant delay in
> this case is better then a variable ever-changing delay.
> 
Would it be better to make it optional for user instead of totally constant or
variable, and won't break the default behavior.

(like "-flags low_delay" which add AV_CODEC_FLAG_LOW_DELAY for user context,
and disables frame_threading_supported )

In this case, maybe a new flag (not sure whether identical flag has existed) which allows
frame threading but with no constant N frame cached.

Thanks,

- Linjie
Derek Buitenhuis March 10, 2020, 2:37 p.m. UTC | #5
On 10/03/2020 11:33, Hendrik Leppkes wrote:
> Personally, I'm off the opinion that a predictable constant delay in
> this case is better then a variable ever-changing delay.

I second this. A lot of software an usecases benefits (or requires) this.

Not to mention non-determinism...

- Derek
Derek Buitenhuis March 10, 2020, 2:40 p.m. UTC | #6
On 10/03/2020 09:36, Jianhui Dai wrote:
> Avoid constant N frames latency in video streaming.
> 
> Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com>
> ---
>  libavcodec/pthread_frame.c | 17 ++---------------
>  1 file changed, 2 insertions(+), 15 deletions(-)

In addition to the points already raised (and which I agree with) about
calculable N frames delay vs unknowable delay (which affects frame accurate
seeking), I have one other point/question:

Does this not make the delay non-deterministic depending on which threads finish first?

- Derek
Dai, Jianhui J March 11, 2020, 2:29 a.m. UTC | #7
Thanks, I will update the commit message.

I test it with FFmpeg native sw H264 decoder.
In previous FF_THREAD_FRAME,  the latency is constant as N ( = thread_count - 1) frames.
It won't sync thread state until no idle threads available, therefore N frames are cached internal, even some frames are ready for output.
E.g. in RTSP 30fps streaming playback, 16 frame threads (default), the internal frame caching contributes 495ms(33ms x 15frame) latency.
And the latency is the same, regardless the video resolution is 640x480 or 4320x2160.

In this patch, I attempt to check thread state and try best output frame to reduce the constant frame caching latency.

Thanks,
Jianhui Dai

-----Original Message-----
From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Carl Eugen Hoyos
Sent: Tuesday, March 10, 2020 6:19 PM
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: Re: [FFmpeg-devel] [PATCH] pthread_frame: attempt to get frame to reduce latency

Am Di., 10. März 2020 um 10:37 Uhr schrieb Jianhui Dai
<jianhui.j.dai@intel.com>:
>
> Avoid constant N frames latency in video streaming.

Please add some numbers to the commit message, if possible without using hardware acceleration.

Carl Eugen
Kieran Kunhya March 11, 2020, 2:40 a.m. UTC | #8
On Wed, 11 Mar 2020 at 02:29, Dai, Jianhui J <jianhui.j.dai@intel.com>
wrote:

> Thanks, I will update the commit message.
>
> I test it with FFmpeg native sw H264 decoder.
> In previous FF_THREAD_FRAME,  the latency is constant as N ( =
> thread_count - 1) frames.
> It won't sync thread state until no idle threads available, therefore N
> frames are cached internal, even some frames are ready for output.
> E.g. in RTSP 30fps streaming playback, 16 frame threads (default), the
> internal frame caching contributes 495ms(33ms x 15frame) latency.
> And the latency is the same, regardless the video resolution is 640x480 or
> 4320x2160.
>
> In this patch, I attempt to check thread state and try best output frame
> to reduce the constant frame caching latency.
>
> Thanks,
> Jianhui Dai
>

Unless I misunderstand, surely this outputs frames nondeterministically?
i.e you don't know that the thread which has finished is the next one that
is due

Not to mention the variable latency problems.

Kieran
Dai, Jianhui J March 11, 2020, 2:59 a.m. UTC | #9
It does not break FFmpeg output frames management logic (e.g. h264_select_output_frame), just enter that phase earlier.

Perhaps my understanding is not correct. 
In my opinions, render should be the one considering variable latency issue, instead of decoder.
The decoder should try best generating frame ASAP (MUST follow standard).

And some FFmpeg user will call libavcodec FF_THREAD_FRAME directly and handle the variable latency issue in their own render code.

Thanks,
Jianhui Dai

-----Original Message-----
From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Kieran Kunhya
Sent: Wednesday, March 11, 2020 10:41 AM
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: Re: [FFmpeg-devel] [PATCH] pthread_frame: attempt to get frame to reduce latency

On Wed, 11 Mar 2020 at 02:29, Dai, Jianhui J <jianhui.j.dai@intel.com>
wrote:

> Thanks, I will update the commit message.
>
> I test it with FFmpeg native sw H264 decoder.
> In previous FF_THREAD_FRAME,  the latency is constant as N ( = 
> thread_count - 1) frames.
> It won't sync thread state until no idle threads available, therefore 
> N frames are cached internal, even some frames are ready for output.
> E.g. in RTSP 30fps streaming playback, 16 frame threads (default), the 
> internal frame caching contributes 495ms(33ms x 15frame) latency.
> And the latency is the same, regardless the video resolution is 
> 640x480 or 4320x2160.
>
> In this patch, I attempt to check thread state and try best output 
> frame to reduce the constant frame caching latency.
>
> Thanks,
> Jianhui Dai
>

Unless I misunderstand, surely this outputs frames nondeterministically?
i.e you don't know that the thread which has finished is the next one that is due

Not to mention the variable latency problems.

Kieran
Dai, Jianhui J March 11, 2020, 3:28 a.m. UTC | #10
As reply in another thread, the sequence of output frames still follows standard, like display order POC in H264.
Big enough frame cache can guarantee deterministic delay in some extent, but not always (decoding time > caching time).

E.g. in FF_THREAD_FRAME 4320x2160 30fps video streaming, 4 threads, the frame caching is 99ms (33ms x 3frames)
If the  cpu-decoding-execution-time is 80ms ~ 120ms (dependent on video frame content).
The const frame latency cannot be met.

Thanks,
Jianhui Dai 

-----Original Message-----
From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Derek Buitenhuis
Sent: Tuesday, March 10, 2020 10:40 PM
To: ffmpeg-devel@ffmpeg.org
Subject: Re: [FFmpeg-devel] [PATCH] pthread_frame: attempt to get frame to reduce latency

On 10/03/2020 09:36, Jianhui Dai wrote:
> Avoid constant N frames latency in video streaming.
> 
> Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com>
> ---
>  libavcodec/pthread_frame.c | 17 ++---------------
>  1 file changed, 2 insertions(+), 15 deletions(-)

In addition to the points already raised (and which I agree with) about calculable N frames delay vs unknowable delay (which affects frame accurate seeking), I have one other point/question:

Does this not make the delay non-deterministic depending on which threads finish first?

- Derek
Kieran Kunhya March 11, 2020, 3:41 a.m. UTC | #11
On Wed, 11 Mar 2020 at 02:59, Dai, Jianhui J <jianhui.j.dai@intel.com>
wrote:

> It does not break FFmpeg output frames management logic (e.g.
> h264_select_output_frame), just enter that phase earlier.
>
> Perhaps my understanding is not correct.
> In my opinions, render should be the one considering variable latency
> issue, instead of decoder.
>

How does the renderer know what the maximum latency of the upstream
pipeline is?

Kieran
Dai, Jianhui J March 11, 2020, 5:29 a.m. UTC | #12
In offline transcoding mode, max latency is controllable.
But in video streaming usages, we can not expect to know the max latency in advance, even decoder cannot know.
Like RTSP/RTMP/DASH, the latency is variable according to network bandwidth.
Let video sink or renderer in the of pipeline to control the frame caching size is more flexible.

Thanks,
Jianhui Dai

-----Original Message-----
From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Kieran Kunhya
Sent: Wednesday, March 11, 2020 11:41 AM
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: Re: [FFmpeg-devel] [PATCH] pthread_frame: attempt to get frame to reduce latency

On Wed, 11 Mar 2020 at 02:59, Dai, Jianhui J <jianhui.j.dai@intel.com>
wrote:

> It does not break FFmpeg output frames management logic (e.g.
> h264_select_output_frame), just enter that phase earlier.
>
> Perhaps my understanding is not correct.
> In my opinions, render should be the one considering variable latency 
> issue, instead of decoder.
>

How does the renderer know what the maximum latency of the upstream pipeline is?

Kieran
Michael Niedermayer March 11, 2020, 9:55 a.m. UTC | #13
On Tue, Mar 10, 2020 at 05:36:40PM +0800, Jianhui Dai wrote:
> Avoid constant N frames latency in video streaming.
> 
> Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com>
> ---
>  libavcodec/pthread_frame.c | 17 ++---------------
>  1 file changed, 2 insertions(+), 15 deletions(-)

This patch causes segfaults

./ffmpeg_g -i tickets/466/Brazil.ts -t 3 random.avi

Thread 1 "ffmpeg_g" received signal SIGSEGV, Segmentation fault.
0x0000555555e64a5d in submit_packet (p=0x5555578c5e80, user_avctx=0x55555787fac0, avpkt=0x555557845340) at libavcodec/pthread_frame.c:380
380	    PerThreadContext *prev_thread = fctx->prev_thread;
(gdb) bt
#0  0x0000555555e64a5d in submit_packet (p=0x5555578c5e80, user_avctx=0x55555787fac0, avpkt=0x555557845340) at libavcodec/pthread_frame.c:380
#1  0x0000555555e64fea in ff_thread_decode_frame (avctx=0x55555787fac0, picture=0x555557845080, got_picture_ptr=0x7fffffffd344, avpkt=0x555557845340) at libavcodec/pthread_frame.c:486
#2  0x0000555555bc5431 in decode_simple_internal (avctx=0x55555787fac0, frame=0x555557845080) at libavcodec/decode.c:430
#3  0x0000555555bc6050 in decode_simple_receive_frame (avctx=0x55555787fac0, frame=0x555557845080) at libavcodec/decode.c:628
#4  0x0000555555bc6120 in decode_receive_frame_internal (avctx=0x55555787fac0, frame=0x555557845080) at libavcodec/decode.c:646
#5  0x0000555555bc6380 in avcodec_send_packet (avctx=0x55555787fac0, avpkt=0x7fffffffd530) at libavcodec/decode.c:704
#6  0x0000555555694420 in decode (avctx=0x55555787fac0, frame=0x555557906280, got_frame=0x7fffffffd67c, pkt=0x7fffffffd530) at fftools/ffmpeg.c:2232
#7  0x0000555555694cd0 in decode_video (ist=0x5555578f6900, pkt=0x7fffffffd6c0, got_output=0x7fffffffd67c, duration_pts=0x7fffffffd6b8, eof=0, decode_failed=0x7fffffffd680) at fftools/ffmpeg.c:2374
#8  0x0000555555695dbd in process_input_packet (ist=0x5555578f6900, pkt=0x7fffffffd900, no_eof=0) at fftools/ffmpeg.c:2615
#9  0x000055555569d74a in process_input (file_index=0) at fftools/ffmpeg.c:4509
#10 0x000055555569dd0b in transcode_step () at fftools/ffmpeg.c:4629
#11 0x000055555569de88 in transcode () at fftools/ffmpeg.c:4683
#12 0x000055555569e7ae in main (argc=6, argv=0x7fffffffe268) at fftools/ffmpeg.c:4885

https://samples.ffmpeg.org/ffmpeg-bugs/trac/ticket466/

[...]
Derek Buitenhuis March 11, 2020, 2:28 p.m. UTC | #14
On 11/03/2020 03:28, Dai, Jianhui J wrote:
> As reply in another thread, the sequence of output frames still follows standard, like display order POC in H264.
> Big enough frame cache can guarantee deterministic delay in some extent, but not always (decoding time > caching time).

That was my point. As an API user, I *do not want* the headache of a non-deterministic amount of delay,
both from a debugging standpoint, and a usability standpoint. Purposely adding non-determinism in this
way gets a NAK from me, for whatever my opinion is worth on this list...

(Aside: Having a calculable and deterministic is important for frame-accurate seeking (using
e.g. a packet-to-frame map), since you *cannot* just 'decode until $timestamp', since monontonic timestamps
aren;t a guaranatee in reality (e.g. MPEG_-TS)).

> E.g. in FF_THREAD_FRAME 4320x2160 30fps video streaming, 4 threads, the frame caching is 99ms (33ms x 3frames)
> If the  cpu-decoding-execution-time is 80ms ~ 120ms (dependent on video frame content).

Also aside: It is not useful to measure frame delay in time. It's also not, IMO, maningful to use
in-flight (wallclock) time.

> The const frame latency cannot be met.

[...]

- Derek
Derek Buitenhuis March 11, 2020, 2:32 p.m. UTC | #15
On 11/03/2020 05:29, Dai, Jianhui J wrote:
> Like RTSP/RTMP/DASH, the latency is variable according to network bandwidth.

He is not talking about wallclock or een latency in the time domain.

Latency as in 'number of packets passed in before a frame is output'. Which this
patch makes both non-deterministic, and non-calculable/non-knowable.

At no point does a network or even the time domain factor in, in this definition.

- Derek
Devin Heitmueller March 11, 2020, 2:53 p.m. UTC | #16
On Wed, Mar 11, 2020 at 10:28 AM Derek Buitenhuis
<derek.buitenhuis@gmail.com> wrote:
> > E.g. in FF_THREAD_FRAME 4320x2160 30fps video streaming, 4 threads, the frame caching is 99ms (33ms x 3frames)
> > If the  cpu-decoding-execution-time is 80ms ~ 120ms (dependent on video frame content).
>
> Also aside: It is not useful to measure frame delay in time. It's also not, IMO, maningful to use
> in-flight (wallclock) time.

Regardless of the actual proposed patch, I think the author's use of
wallclock time to describe the problem is very reasonable.  I do a
large amount of work where I'm measuring "glass-to-glass" latency,
where I am interested in the total pipeline (encode/network/decode),
and I definitely went through the experience of trying to figure out
why ffmpeg was introducing nearly 500ms of extra latency during
decoding.  It turned out that it was because my particular platform
had 8-cores and thus 16 decoding threads and thus 15x33ms of delay,
regardless of the stream complexity.

You may not personally care about latency, but there are lots of
people operating in a world where actual latency matters (e.g. news
interviews) and they want to be able to use ffmpeg for decoding in
such environments.  The "problem" is not how many threads are used.
The problem is "there's way too much latency" and proposed solutions
include reducing the thread count or changing the heuristic for
dequeuing frames from the thread pool.

Devin
Kieran Kunhya March 11, 2020, 3:02 p.m. UTC | #17
>
> Regardless of the actual proposed patch, I think the author's use of
> wallclock time to describe the problem is very reasonable.  I do a
> large amount of work where I'm measuring "glass-to-glass" latency,
> where I am interested in the total pipeline (encode/network/decode),
> and I definitely went through the experience of trying to figure out
> why ffmpeg was introducing nearly 500ms of extra latency during
> decoding.  It turned out that it was because my particular platform
> had 8-cores and thus 16 decoding threads and thus 15x33ms of delay,
> regardless of the stream complexity.
>

You shouldn't be using frame threads for this kind of application.

Kieran
Derek Buitenhuis March 11, 2020, 3:14 p.m. UTC | #18
On 11/03/2020 14:53, Devin Heitmueller wrote:
> Regardless of the actual proposed patch, I think the author's use of
> wallclock time to describe the problem is very reasonable.  I do a
> large amount of work where I'm measuring "glass-to-glass" latency,
> where I am interested in the total pipeline (encode/network/decode),
> and I definitely went through the experience of trying to figure out
> why ffmpeg was introducing nearly 500ms of extra latency during
> decoding.  It turned out that it was because my particular platform
> had 8-cores and thus 16 decoding threads and thus 15x33ms of delay,
> regardless of the stream complexity.

Heavy disagree. The measurement is *specifically* referring to an API call
here, and it *specifically* effects the delay, in frames. The email in question
is conflating timestamps (33ms) per frame with wallclock time later on. It is
not a meaningful comparison to make. Only pain lies down the path of mixing
timestamps and wallclock like that.

Glass-to-glass measurement is important too, but don't conflate the two.

For what it's worth, I pick deterministic delay (in frames! packets-in-to-frames-out)
over the possibility of less delay but non-deterministic every day of the week.
For my own sanity. *Certainly* not as the default and only mode of operation.

> You may not personally care about latency, but there are lots of
> people operating in a world where actual latency matters (e.g. news
> interviews) and they want to be able to use ffmpeg for decoding in
> such environments.  The "problem" is not how many threads are used.
> The problem is "there's way too much latency" and proposed solutions
> include reducing the thread count or changing the heuristic for
> dequeuing frames from the thread pool.

You are again conflating wallclock latency with frame delay latency. Having
a deterministic amount of decoding delay (in frames, packets-in-to-frames-out)
will certainly cause soem wallclock latency, but it *IS NOT* (N x 33ms) (or whatever
timestamp your framerate causes to be the case). It is confusing and not useful to
mix the two concepts like this.

And again, I will choose a deterministic API every day of the week. In this case,
by deterministic I mean 'the exact same code and file/stream will take the same
number of packets in before outputting a frame', not 'it takes the same amount
of time'.

I understand It's important for some usecases to have the fastest glass-to-glass
time, but it's not worth destroying the API with non-determinism to accomplish that.
At the *very least* it should be behind a flag or option so people can choose to
shoot their debugger in the foot by choice.

But mostly I want people to stop conflating timestamps and wallclock, as well as
decoding delay (in frames/packets) vs wallclock-time-toprocess. They're both important
but nobody wins by mixing the two up like this.

- Derek
Martin Storsjö March 11, 2020, 8:42 p.m. UTC | #19
On Wed, 11 Mar 2020, Derek Buitenhuis wrote:

> On 11/03/2020 14:53, Devin Heitmueller wrote:
>> Regardless of the actual proposed patch, I think the author's use of
>> wallclock time to describe the problem is very reasonable.  I do a
>> large amount of work where I'm measuring "glass-to-glass" latency,
>> where I am interested in the total pipeline (encode/network/decode),
>> and I definitely went through the experience of trying to figure out
>> why ffmpeg was introducing nearly 500ms of extra latency during
>> decoding.  It turned out that it was because my particular platform
>> had 8-cores and thus 16 decoding threads and thus 15x33ms of delay,
>> regardless of the stream complexity.
>
> Heavy disagree. The measurement is *specifically* referring to an API call
> here, and it *specifically* effects the delay, in frames. The email in question
> is conflating timestamps (33ms) per frame with wallclock time later on. It is
> not a meaningful comparison to make. Only pain lies down the path of mixing
> timestamps and wallclock like that.
>
> Glass-to-glass measurement is important too, but don't conflate the two.
>
> For what it's worth, I pick deterministic delay (in frames! packets-in-to-frames-out)
> over the possibility of less delay but non-deterministic every day of the week.
> For my own sanity. *Certainly* not as the default and only mode of operation.

FWIW, while I agree it shouldn't be the default, I have occasionally 
considered the need for this particular feature.

Consider a live stream with a very variable framerate, essentially varying 
in the full range between 0 and 60 fps. To cope with decoding the high end 
of the framerate range, one needs to have frame threading enabled - maybe 
not with something crazy like 16 threads, but say at least 5 or so.

Then you need to feed 5 packets into the decoder before you get the first 
frame output (for a stream without any reordering).

Now if packets are received at 60 fps, you get one new packet to feed the 
decoder per 16 ms, and you get the first frame to output 83 ms later, 
assuming that the decoding of that individual frame on that thread took 
less than 83 ms.

However, if the rate of input packets drops to e.g. 1 packet per second, 
it will take 5 seconds before I have 5 packets to feed to the decoder, 
before I have the first frame output, even though it actually was finished 
decoding in say less than 100 ms after the first input packet was given to 
the decoder.

So in such a setup, being able to fetch output frames from the decoder 
sooner would be very useful - giving a closer to fixed decoding time in 
wallclock time, regardless of the packet input rate.

// Martin
Derek Buitenhuis March 11, 2020, 10:43 p.m. UTC | #20
On 11/03/2020 20:42, Martin Storsjö wrote:
> FWIW, while I agree it shouldn't be the default, I have occasionally 
> considered the need for this particular feature.

Arguably slice threading should be used, but that assumes you have sane input,
which is obviously not always the case for some.

> Consider a live stream with a very variable framerate, essentially varying 
> in the full range between 0 and 60 fps. To cope with decoding the high end 
> of the framerate range, one needs to have frame threading enabled - maybe 
> not with something crazy like 16 threads, but say at least 5 or so.
> 
> Then you need to feed 5 packets into the decoder before you get the first 
> frame output (for a stream without any reordering).

That last bit is key there, but yes.

> 
> Now if packets are received at 60 fps, you get one new packet to feed the 
> decoder per 16 ms, and you get the first frame to output 83 ms later, 
> assuming that the decoding of that individual frame on that thread took 
> less than 83 ms
(I'm assuming network, etc. has been left out for example's sake. :))

> However, if the rate of input packets drops to e.g. 1 packet per second, 
> it will take 5 seconds before I have 5 packets to feed to the decoder, 
> before I have the first frame output, even though it actually was finished 
> decoding in say less than 100 ms after the first input packet was given to 
> the decoder.
> 
> So in such a setup, being able to fetch output frames from the decoder 
> sooner would be very useful - giving a closer to fixed decoding time in 
> wallclock time, regardless of the packet input rate.

Not sure I would refer to it as closer to fixed, but the use case is certainly
valid - I never claimed otherwise.

If it is added, it needs be behind a flag/option with big bold letters saying
the risks, and off by default. And that segfault Michael saw investigated.

Thanks for the clear response that doesn't conflate the two.

- Derek
Dai, Jianhui J March 12, 2020, 2 a.m. UTC | #21
Thanks for clarify the packets-to-frames delay. I will respect that decoder design principle.

Add a few backgrounds for this issue.
I work on 4K@30fps streaming playback on 16-cores host.
Initially I use FF_THREAD_SLICE, but "avcodec_send_packet" takes ~100ms, only 10fps playback (I want 30fps).
It is not good for UHD video decoding, and not fully utilizing modern multiple CPU core.

Then I switch to FF_THREAD_FRAME  (slices parallel disabled), the fps is able to achieve 30fps, but latency also increasing to ~500ms.

Ideally, I wish to have FF_THREAD_FRAME+ FF_THREAD_SLICE simultaneously:) 

Thanks,
Jianhui Dai

-----Original Message-----
From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Derek Buitenhuis
Sent: Thursday, March 12, 2020 6:44 AM
To: ffmpeg-devel@ffmpeg.org
Subject: Re: [FFmpeg-devel] [PATCH] pthread_frame: attempt to get frame to reduce latency

On 11/03/2020 20:42, Martin Storsjö wrote:
> FWIW, while I agree it shouldn't be the default, I have occasionally 
> considered the need for this particular feature.

Arguably slice threading should be used, but that assumes you have sane input, which is obviously not always the case for some.

> Consider a live stream with a very variable framerate, essentially 
> varying in the full range between 0 and 60 fps. To cope with decoding 
> the high end of the framerate range, one needs to have frame threading 
> enabled - maybe not with something crazy like 16 threads, but say at least 5 or so.
> 
> Then you need to feed 5 packets into the decoder before you get the 
> first frame output (for a stream without any reordering).

That last bit is key there, but yes.

> 
> Now if packets are received at 60 fps, you get one new packet to feed 
> the decoder per 16 ms, and you get the first frame to output 83 ms 
> later, assuming that the decoding of that individual frame on that 
> thread took less than 83 ms
(I'm assuming network, etc. has been left out for example's sake. :))

> However, if the rate of input packets drops to e.g. 1 packet per 
> second, it will take 5 seconds before I have 5 packets to feed to the 
> decoder, before I have the first frame output, even though it actually 
> was finished decoding in say less than 100 ms after the first input 
> packet was given to the decoder.
> 
> So in such a setup, being able to fetch output frames from the decoder 
> sooner would be very useful - giving a closer to fixed decoding time 
> in wallclock time, regardless of the packet input rate.

Not sure I would refer to it as closer to fixed, but the use case is certainly valid - I never claimed otherwise.

If it is added, it needs be behind a flag/option with big bold letters saying the risks, and off by default. And that segfault Michael saw investigated.

Thanks for the clear response that doesn't conflate the two.

- Derek
Dai, Jianhui J March 12, 2020, 10:05 a.m. UTC | #22
Apologize for segfaults.
I do pass FATE and some basic tests myself before submit for review.
Fixed locally and wait for other review comments.

Thanks,
Jianhui Dai

-----Original Message-----
From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Michael Niedermayer
Sent: Wednesday, March 11, 2020 5:56 PM
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: Re: [FFmpeg-devel] [PATCH] pthread_frame: attempt to get frame to reduce latency

On Tue, Mar 10, 2020 at 05:36:40PM +0800, Jianhui Dai wrote:
> Avoid constant N frames latency in video streaming.
> 
> Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com>
> ---
>  libavcodec/pthread_frame.c | 17 ++---------------
>  1 file changed, 2 insertions(+), 15 deletions(-)

This patch causes segfaults

./ffmpeg_g -i tickets/466/Brazil.ts -t 3 random.avi

Thread 1 "ffmpeg_g" received signal SIGSEGV, Segmentation fault.
0x0000555555e64a5d in submit_packet (p=0x5555578c5e80, user_avctx=0x55555787fac0, avpkt=0x555557845340) at libavcodec/pthread_frame.c:380
380	    PerThreadContext *prev_thread = fctx->prev_thread;
(gdb) bt
#0  0x0000555555e64a5d in submit_packet (p=0x5555578c5e80, user_avctx=0x55555787fac0, avpkt=0x555557845340) at libavcodec/pthread_frame.c:380
#1  0x0000555555e64fea in ff_thread_decode_frame (avctx=0x55555787fac0, picture=0x555557845080, got_picture_ptr=0x7fffffffd344, avpkt=0x555557845340) at libavcodec/pthread_frame.c:486
#2  0x0000555555bc5431 in decode_simple_internal (avctx=0x55555787fac0, frame=0x555557845080) at libavcodec/decode.c:430
#3  0x0000555555bc6050 in decode_simple_receive_frame (avctx=0x55555787fac0, frame=0x555557845080) at libavcodec/decode.c:628
#4  0x0000555555bc6120 in decode_receive_frame_internal (avctx=0x55555787fac0, frame=0x555557845080) at libavcodec/decode.c:646
#5  0x0000555555bc6380 in avcodec_send_packet (avctx=0x55555787fac0, avpkt=0x7fffffffd530) at libavcodec/decode.c:704
#6  0x0000555555694420 in decode (avctx=0x55555787fac0, frame=0x555557906280, got_frame=0x7fffffffd67c, pkt=0x7fffffffd530) at fftools/ffmpeg.c:2232
#7  0x0000555555694cd0 in decode_video (ist=0x5555578f6900, pkt=0x7fffffffd6c0, got_output=0x7fffffffd67c, duration_pts=0x7fffffffd6b8, eof=0, decode_failed=0x7fffffffd680) at fftools/ffmpeg.c:2374
#8  0x0000555555695dbd in process_input_packet (ist=0x5555578f6900, pkt=0x7fffffffd900, no_eof=0) at fftools/ffmpeg.c:2615
#9  0x000055555569d74a in process_input (file_index=0) at fftools/ffmpeg.c:4509
#10 0x000055555569dd0b in transcode_step () at fftools/ffmpeg.c:4629
#11 0x000055555569de88 in transcode () at fftools/ffmpeg.c:4683
#12 0x000055555569e7ae in main (argc=6, argv=0x7fffffffe268) at fftools/ffmpeg.c:4885

https://samples.ffmpeg.org/ffmpeg-bugs/trac/ticket466/

[...]
James Almer March 12, 2020, 12:18 p.m. UTC | #23
On 3/12/2020 7:05 AM, Dai, Jianhui J wrote:
> Apologize for segfaults.
> I do pass FATE and some basic tests myself before submit for review.

It probably doesn't need to be mentioned, but since this affects frame
threading, make sure to run "make fate THREADS=N THREAD_TYPE=frame" or
similar.

> Fixed locally and wait for other review comments.
> 
> Thanks,
> Jianhui Dai
> 
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Michael Niedermayer
> Sent: Wednesday, March 11, 2020 5:56 PM
> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH] pthread_frame: attempt to get frame to reduce latency
> 
> On Tue, Mar 10, 2020 at 05:36:40PM +0800, Jianhui Dai wrote:
>> Avoid constant N frames latency in video streaming.
>>
>> Signed-off-by: Jianhui Dai <jianhui.j.dai@intel.com>
>> ---
>>  libavcodec/pthread_frame.c | 17 ++---------------
>>  1 file changed, 2 insertions(+), 15 deletions(-)
> 
> This patch causes segfaults
> 
> ./ffmpeg_g -i tickets/466/Brazil.ts -t 3 random.avi
> 
> Thread 1 "ffmpeg_g" received signal SIGSEGV, Segmentation fault.
> 0x0000555555e64a5d in submit_packet (p=0x5555578c5e80, user_avctx=0x55555787fac0, avpkt=0x555557845340) at libavcodec/pthread_frame.c:380
> 380	    PerThreadContext *prev_thread = fctx->prev_thread;
> (gdb) bt
> #0  0x0000555555e64a5d in submit_packet (p=0x5555578c5e80, user_avctx=0x55555787fac0, avpkt=0x555557845340) at libavcodec/pthread_frame.c:380
> #1  0x0000555555e64fea in ff_thread_decode_frame (avctx=0x55555787fac0, picture=0x555557845080, got_picture_ptr=0x7fffffffd344, avpkt=0x555557845340) at libavcodec/pthread_frame.c:486
> #2  0x0000555555bc5431 in decode_simple_internal (avctx=0x55555787fac0, frame=0x555557845080) at libavcodec/decode.c:430
> #3  0x0000555555bc6050 in decode_simple_receive_frame (avctx=0x55555787fac0, frame=0x555557845080) at libavcodec/decode.c:628
> #4  0x0000555555bc6120 in decode_receive_frame_internal (avctx=0x55555787fac0, frame=0x555557845080) at libavcodec/decode.c:646
> #5  0x0000555555bc6380 in avcodec_send_packet (avctx=0x55555787fac0, avpkt=0x7fffffffd530) at libavcodec/decode.c:704
> #6  0x0000555555694420 in decode (avctx=0x55555787fac0, frame=0x555557906280, got_frame=0x7fffffffd67c, pkt=0x7fffffffd530) at fftools/ffmpeg.c:2232
> #7  0x0000555555694cd0 in decode_video (ist=0x5555578f6900, pkt=0x7fffffffd6c0, got_output=0x7fffffffd67c, duration_pts=0x7fffffffd6b8, eof=0, decode_failed=0x7fffffffd680) at fftools/ffmpeg.c:2374
> #8  0x0000555555695dbd in process_input_packet (ist=0x5555578f6900, pkt=0x7fffffffd900, no_eof=0) at fftools/ffmpeg.c:2615
> #9  0x000055555569d74a in process_input (file_index=0) at fftools/ffmpeg.c:4509
> #10 0x000055555569dd0b in transcode_step () at fftools/ffmpeg.c:4629
> #11 0x000055555569de88 in transcode () at fftools/ffmpeg.c:4683
> #12 0x000055555569e7ae in main (argc=6, argv=0x7fffffffe268) at fftools/ffmpeg.c:4885
> 
> https://samples.ffmpeg.org/ffmpeg-bugs/trac/ticket466/
> 
> [...]
>
diff mbox series

Patch

diff --git a/libavcodec/pthread_frame.c b/libavcodec/pthread_frame.c
index b5bd494474..aeb42800ef 100644
--- a/libavcodec/pthread_frame.c
+++ b/libavcodec/pthread_frame.c
@@ -130,11 +130,6 @@  typedef struct FrameThreadContext {
 
     int next_decoding;             ///< The next context to submit a packet to.
     int next_finished;             ///< The next context to return output from.
-
-    int delaying;                  /**<
-                                    * Set for the first N packets, where N is the number of threads.
-                                    * While it is set, ff_thread_en/decode_frame won't return any results.
-                                    */
 } FrameThreadContext;
 
 #define THREAD_SAFE_CALLBACKS(avctx) \
@@ -492,14 +487,8 @@  int ff_thread_decode_frame(AVCodecContext *avctx,
     if (err)
         goto finish;
 
-    /*
-     * If we're still receiving the initial packets, don't return a frame.
-     */
-
-    if (fctx->next_decoding > (avctx->thread_count-1-(avctx->codec_id == AV_CODEC_ID_FFV1)))
-        fctx->delaying = 0;
-
-    if (fctx->delaying) {
+    if (((fctx->next_decoding + 1) % avctx->thread_count) != finished &&
+        atomic_load(&fctx->threads[finished].state) != STATE_INPUT_READY) {
         *got_picture_ptr=0;
         if (avpkt->size) {
             err = avpkt->size;
@@ -763,7 +752,6 @@  int ff_frame_thread_init(AVCodecContext *avctx)
     pthread_cond_init(&fctx->async_cond, NULL);
 
     fctx->async_lock = 1;
-    fctx->delaying = 1;
 
     for (i = 0; i < thread_count; i++) {
         AVCodecContext *copy = av_malloc(sizeof(AVCodecContext));
@@ -854,7 +842,6 @@  void ff_thread_flush(AVCodecContext *avctx)
     }
 
     fctx->next_decoding = fctx->next_finished = 0;
-    fctx->delaying = 1;
     fctx->prev_thread = NULL;
     for (i = 0; i < avctx->thread_count; i++) {
         PerThreadContext *p = &fctx->threads[i];