diff mbox series

[FFmpeg-devel] avcodec/v4l2_m2m_dec: dequeue frame if input isn't ready

Message ID 20211214021215.456493-1-aicommander@gmail.com
State Accepted
Commit 30322ebe3c55d0fb18bea4ae04d0fcaf1f97d27f
Headers show
Series [FFmpeg-devel] avcodec/v4l2_m2m_dec: dequeue frame if input isn't ready | expand

Checks

Context Check Description
andriy/make_x86 success Make finished
andriy/make_fate_x86 success Make fate finished
andriy/make_ppc success Make finished
andriy/make_fate_ppc success Make fate finished

Commit Message

Cameron Gutman Dec. 14, 2021, 2:12 a.m. UTC
The V4L2M2M API operates asynchronously, so multiple packets can
be enqueued before getting a batch of frames back. Since it was
only possible to receive a frame by submitting another packet,
there wasn't a way to drain those excess output frames from when
avcodec_receive_frame() returned AVERROR(EAGAIN).

In my testing, this change reduced decode latency of a real-time
60 FPS H.264 stream by approximately 10x (200ms -> 20ms) on a
Raspberry Pi 4.

Signed-off-by: Cameron Gutman <aicommander@gmail.com>
---
 libavcodec/v4l2_m2m_dec.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Andriy Gelman Dec. 25, 2021, 7:28 a.m. UTC | #1
On Tue, 14. Dec 02:12, Cameron Gutman wrote:
> The V4L2M2M API operates asynchronously, so multiple packets can
> be enqueued before getting a batch of frames back. Since it was
> only possible to receive a frame by submitting another packet,
> there wasn't a way to drain those excess output frames from when
> avcodec_receive_frame() returned AVERROR(EAGAIN).
> 
> In my testing, this change reduced decode latency of a real-time
> 60 FPS H.264 stream by approximately 10x (200ms -> 20ms) on a
> Raspberry Pi 4.
> 
> Signed-off-by: Cameron Gutman <aicommander@gmail.com>
> ---
>  libavcodec/v4l2_m2m_dec.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/libavcodec/v4l2_m2m_dec.c b/libavcodec/v4l2_m2m_dec.c
> index 224eb3d5e7..b0c3d30ac8 100644
> --- a/libavcodec/v4l2_m2m_dec.c
> +++ b/libavcodec/v4l2_m2m_dec.c
> @@ -142,8 +142,12 @@ static int v4l2_receive_frame(AVCodecContext *avctx, AVFrame *frame)
>  
>      if (!s->buf_pkt.size) {
>          ret = ff_decode_get_packet(avctx, &s->buf_pkt);
> -        if (ret < 0 && ret != AVERROR_EOF)
> -            return ret;
> +        if (ret < 0) {
> +            if (ret == AVERROR(EAGAIN))
> +                return ff_v4l2_context_dequeue_frame(capture, frame, 0);
> +            else if (ret != AVERROR_EOF)
> +                return ret;
> +        }
>      }

The change lgtm. I tested on Rpi4 and odroid xu4.
Will apply this on Sunday if no one else comments.

Thanks,
Andriy Gelman Dec. 27, 2021, 6:17 a.m. UTC | #2
On Tue, 14. Dec 02:12, Cameron Gutman wrote:
> The V4L2M2M API operates asynchronously, so multiple packets can
> be enqueued before getting a batch of frames back. Since it was
> only possible to receive a frame by submitting another packet,
> there wasn't a way to drain those excess output frames from when
> avcodec_receive_frame() returned AVERROR(EAGAIN).
> 

> In my testing, this change reduced decode latency of a real-time
> 60 FPS H.264 stream by approximately 10x (200ms -> 20ms) on a
> Raspberry Pi 4.

I was doing some more tests today, but didn't have any luck dequeuing a frame
if ff_decode_get_packet() returned EAGAIN.
Could you share the dataset?

Thanks,
Cameron Gutman Dec. 27, 2021, 9:18 a.m. UTC | #3
On 12/27/21 00:17, Andriy Gelman wrote:
> On Tue, 14. Dec 02:12, Cameron Gutman wrote:
>> The V4L2M2M API operates asynchronously, so multiple packets can
>> be enqueued before getting a batch of frames back. Since it was
>> only possible to receive a frame by submitting another packet,
>> there wasn't a way to drain those excess output frames from when
>> avcodec_receive_frame() returned AVERROR(EAGAIN).
>>
> 
>> In my testing, this change reduced decode latency of a real-time
>> 60 FPS H.264 stream by approximately 10x (200ms -> 20ms) on a
>> Raspberry Pi 4.
> 
> I was doing some more tests today, but didn't have any luck dequeuing a frame
> if ff_decode_get_packet() returned EAGAIN.

Hmm, maybe there is something different about your test harness?
I'm receiving 720p 60 FPS H.264 ES in real-time from the network.

For each H.264 encoded frame I receive off the network, my basic
approach is like this (simplified for brevity):

avcodec_send_packet(&pkt);
do {
    frame = av_frame_alloc();
    if ((err = avcodec_receive_frame(frame)) == 0) {
        render_frame(frame);
    }
} while (err == 0);

I'll usually get EAGAIN immediately for the first few frames I submit
(so no output frame yet), but then I'll get a batch of output frames
back after the first completed decode. That drains the excess latency
from the pipeline to avoid always being behind.

For cases where we want to prioritize latency over throughput, I've
had success with this approach too:

avcodec_send_packet(&pkt);
while (avcodec_receive_frame(frame) == AVERROR(EAGAIN)) {
    msleep(1);
}
render_frame(frame);

In this case, we can retry avcodec_receive_frame() until we get the
frame back that we just submitted for decoding.

The patch here enables both of these use-cases by allowing V4L2M2M
to retry getting a decoded frame without new input data. Both of
these also work with MMAL after the recent decoupled dataflow patch.

> Could you share the dataset?
> 

It is 720p60.h264 from here:
https://onedrive.live.com/?authkey=%21ALoKfcPfFeKyhzs&id=C15BF9770619F56%21165617&cid=0C15BF9770619F56

> Thanks,
> 

Thank you
Andriy Gelman Dec. 28, 2021, 11:41 p.m. UTC | #4
On Mon, 27. Dec 03:18, Cameron Gutman wrote:
> On 12/27/21 00:17, Andriy Gelman wrote:
> > On Tue, 14. Dec 02:12, Cameron Gutman wrote:
> >> The V4L2M2M API operates asynchronously, so multiple packets can
> >> be enqueued before getting a batch of frames back. Since it was
> >> only possible to receive a frame by submitting another packet,
> >> there wasn't a way to drain those excess output frames from when
> >> avcodec_receive_frame() returned AVERROR(EAGAIN).
> >>
> > 
> >> In my testing, this change reduced decode latency of a real-time
> >> 60 FPS H.264 stream by approximately 10x (200ms -> 20ms) on a
> >> Raspberry Pi 4.
> > 
> > I was doing some more tests today, but didn't have any luck dequeuing a frame
> > if ff_decode_get_packet() returned EAGAIN.
> 
> Hmm, maybe there is something different about your test harness?
> I'm receiving 720p 60 FPS H.264 ES in real-time from the network.

I used ffmpeg, i.e.
./ffmpeg -codec:v h264_v4l2m2m -i 720p60.h264 -f null -

Anyway, I applied your patch but just removed the comment on latency
because I couldn't reproduce it.

> 
> For each H.264 encoded frame I receive off the network, my basic
> approach is like this (simplified for brevity):
> 
> avcodec_send_packet(&pkt);
> do {
>     frame = av_frame_alloc();
>     if ((err = avcodec_receive_frame(frame)) == 0) {
>         render_frame(frame);
>     }
> } while (err == 0);
> 
> I'll usually get EAGAIN immediately for the first few frames I submit
> (so no output frame yet), but then I'll get a batch of output frames
> back after the first completed decode. That drains the excess latency
> from the pipeline to avoid always being behind.
> 
> For cases where we want to prioritize latency over throughput, I've
> had success with this approach too:
> 
> avcodec_send_packet(&pkt);
> while (avcodec_receive_frame(frame) == AVERROR(EAGAIN)) {
>     msleep(1);
> }
> render_frame(frame);
> 
> In this case, we can retry avcodec_receive_frame() until we get the
> frame back that we just submitted for decoding.

> 
> The patch here enables both of these use-cases by allowing V4L2M2M
> to retry getting a decoded frame without new input data. Both of
> these also work with MMAL after the recent decoupled dataflow patch.
> 
> > Could you share the dataset?
> > 
> 

> It is 720p60.h264 from here:
> https://onedrive.live.com/?authkey=%21ALoKfcPfFeKyhzs&id=C15BF9770619F56%21165617&cid=0C15BF9770619F56

I could not decode this dataset on rpi4 (before or after the patch). Tried to upgrade kernel
to 5.16, but still same problem. Odroid xu4 seems to work fine.

I found a repo in github https://github.com/jc-kynesim/rpi-ffmpeg (I guess Rpi4
team), where the decoding is working. Doing a bisect points to this commit that fixes:
https://github.com/jc-kynesim/rpi-ffmpeg/commit/3e60c08552110d9da1b3b5d3637c6bd8b53eec69

There are lots of changes in the commit, so I'm not sure what exactly fixed the
problem or if it was by chance. It would be good to figure out.
Andriy Gelman Jan. 2, 2022, 4:56 p.m. UTC | #5
On Tue, 28. Dec 18:41, Andriy Gelman wrote:
> On Mon, 27. Dec 03:18, Cameron Gutman wrote:
> > On 12/27/21 00:17, Andriy Gelman wrote:
> > > On Tue, 14. Dec 02:12, Cameron Gutman wrote:
> > >> The V4L2M2M API operates asynchronously, so multiple packets can
> > >> be enqueued before getting a batch of frames back. Since it was
> > >> only possible to receive a frame by submitting another packet,
> > >> there wasn't a way to drain those excess output frames from when
> > >> avcodec_receive_frame() returned AVERROR(EAGAIN).
> > >>
> > > 
> > >> In my testing, this change reduced decode latency of a real-time
> > >> 60 FPS H.264 stream by approximately 10x (200ms -> 20ms) on a
> > >> Raspberry Pi 4.
> > > 
> > > I was doing some more tests today, but didn't have any luck dequeuing a frame
> > > if ff_decode_get_packet() returned EAGAIN.
> > 
> > Hmm, maybe there is something different about your test harness?
> > I'm receiving 720p 60 FPS H.264 ES in real-time from the network.
> 
> I used ffmpeg, i.e.
> ./ffmpeg -codec:v h264_v4l2m2m -i 720p60.h264 -f null -
> 
> Anyway, I applied your patch but just removed the comment on latency
> because I couldn't reproduce it.
> 
> > 
> > For each H.264 encoded frame I receive off the network, my basic
> > approach is like this (simplified for brevity):
> > 
> > avcodec_send_packet(&pkt);
> > do {
> >     frame = av_frame_alloc();
> >     if ((err = avcodec_receive_frame(frame)) == 0) {
> >         render_frame(frame);
> >     }
> > } while (err == 0);
> > 
> > I'll usually get EAGAIN immediately for the first few frames I submit
> > (so no output frame yet), but then I'll get a batch of output frames
> > back after the first completed decode. That drains the excess latency
> > from the pipeline to avoid always being behind.
> > 
> > For cases where we want to prioritize latency over throughput, I've
> > had success with this approach too:
> > 
> > avcodec_send_packet(&pkt);
> > while (avcodec_receive_frame(frame) == AVERROR(EAGAIN)) {
> >     msleep(1);
> > }
> > render_frame(frame);
> > 
> > In this case, we can retry avcodec_receive_frame() until we get the
> > frame back that we just submitted for decoding.
> 
> > 
> > The patch here enables both of these use-cases by allowing V4L2M2M
> > to retry getting a decoded frame without new input data. Both of
> > these also work with MMAL after the recent decoupled dataflow patch.
> > 
> > > Could you share the dataset?
> > > 
> > 
> 
> > It is 720p60.h264 from here:
> > https://onedrive.live.com/?authkey=%21ALoKfcPfFeKyhzs&id=C15BF9770619F56%21165617&cid=0C15BF9770619F56

> 
> I could not decode this dataset on rpi4 (before or after the patch). Tried to upgrade kernel
> to 5.16, but still same problem. Odroid xu4 seems to work fine.

This patch fixes decoding on the RPi4 for me:
https://patchwork.ffmpeg.org/project/ffmpeg/patch/20210819085533.1174-2-ming.qian@nxp.com/

It would be nice to get the fix into the upcoming release.
diff mbox series

Patch

diff --git a/libavcodec/v4l2_m2m_dec.c b/libavcodec/v4l2_m2m_dec.c
index 224eb3d5e7..b0c3d30ac8 100644
--- a/libavcodec/v4l2_m2m_dec.c
+++ b/libavcodec/v4l2_m2m_dec.c
@@ -142,8 +142,12 @@  static int v4l2_receive_frame(AVCodecContext *avctx, AVFrame *frame)
 
     if (!s->buf_pkt.size) {
         ret = ff_decode_get_packet(avctx, &s->buf_pkt);
-        if (ret < 0 && ret != AVERROR_EOF)
-            return ret;
+        if (ret < 0) {
+            if (ret == AVERROR(EAGAIN))
+                return ff_v4l2_context_dequeue_frame(capture, frame, 0);
+            else if (ret != AVERROR_EOF)
+                return ret;
+        }
     }
 
     if (s->draining)