diff mbox series

[FFmpeg-devel] amfenc: Use a blocking call instead of sleeping and polling

Message ID d788c258-f65b-4f87-8ae4-e9a753af6416@jkqxz.net
State New
Headers show
Series [FFmpeg-devel] amfenc: Use a blocking call instead of sleeping and polling | expand

Checks

Context Check Description
andriy/configure_x86 warning Failed to apply patch
yinshiyou/configure_loongarch64 warning Failed to apply patch

Commit Message

Mark Thompson Oct. 18, 2023, 8:36 p.m. UTC
---
On 17/10/2023 18:11, Evgeny Pavlov wrote:
> The reason for using av_usleep() here is that AMF API doesn’t provide an
> API for explicit wait. There are two modes to get output from encoder:
> 
> 1. Polling with some sleep to avoid CPU thrashing – currently used in FFmpeg
> 
> 2. Set timeout parameter on AMF encoder and QueryOutput call will block
> till output is available or the timeout happens.
> 
> #2 is the preferable way but it is designed more to be used with a separate
> polling thread. With a single-thread approach in FFmpeg, the use of timeout
> can block input submission making things slower.  This is even more
> pronounced when B-frames are enabled and several inputs are needed to produce
> the first output.

This approach seems like it should work here?  Run non-blocking until the queue is full, then switch to blocking when you need to wait for some output.

I tried the patch enclosing (H.264 only, different proprties needed for other codecs), but it doesn't seem to work - the test assert always hits immediately and timing shows that QueryOutput didn't block even though the timeout should be set?  I'm probably doing something incorrect, maybe you would know how to fix it.

> The condition of this sleep is in special events (primarily when amf input
> queue is full), not the core loop part. During the experiments the cpu
> increasing is about 2-4% or so, not a burst.

What cases are you experimenting with?

The most problematic case I can think of is multiple encodes running simultaneously sharing the same instance so that each one has to wait for others to complete and therefore all queues fill up.

The busy wait will end up being the only place where it can block (since everything else runs asynchronously), so you will peg one CPU at close to 100% per encode running.

Thanks,

- Mark

  libavcodec/amfenc.c | 22 +++++++++++++++++++---
  libavcodec/amfenc.h |  1 +
  2 files changed, 20 insertions(+), 3 deletions(-)

Comments

Evgeny Pavlov Oct. 19, 2023, 4:13 p.m. UTC | #1
On Wed, Oct 18, 2023 at 10:36 PM Mark Thompson <sw@jkqxz.net> wrote:

> ---
> On 17/10/2023 18:11, Evgeny Pavlov wrote:
> > The reason for using av_usleep() here is that AMF API doesn’t provide an
> > API for explicit wait. There are two modes to get output from encoder:
> >
> > 1. Polling with some sleep to avoid CPU thrashing – currently used in
> FFmpeg
> >
> > 2. Set timeout parameter on AMF encoder and QueryOutput call will block
> > till output is available or the timeout happens.
> >
> > #2 is the preferable way but it is designed more to be used with a
> separate
> > polling thread. With a single-thread approach in FFmpeg, the use of
> timeout
> > can block input submission making things slower.  This is even more
> > pronounced when B-frames are enabled and several inputs are needed to
> produce
> > the first output.
>
> This approach seems like it should work here?  Run non-blocking until the
> queue is full, then switch to blocking when you need to wait for some
> output.
>
> I tried the patch enclosing (H.264 only, different proprties needed for
> other codecs), but it doesn't seem to work - the test assert always hits
> immediately and timing shows that QueryOutput didn't block even though the
> timeout should be set?  I'm probably doing something incorrect, maybe you
> would know how to fix it.
>
> > The condition of this sleep is in special events (primarily when amf
> input
> > queue is full), not the core loop part. During the experiments the cpu
> > increasing is about 2-4% or so, not a burst.
>
> What cases are you experimenting with?
>
> The most problematic case I can think of is multiple encodes running
> simultaneously sharing the same instance so that each one has to wait for
> others to complete and therefore all queues fill up.
>
> The busy wait will end up being the only place where it can block (since
> everything else runs asynchronously), so you will peg one CPU at close to
> 100% per encode running.
>
> Thanks,
>
> - Mark
>
>   libavcodec/amfenc.c | 22 +++++++++++++++++++---
>   libavcodec/amfenc.h |  1 +
>   2 files changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/libavcodec/amfenc.c b/libavcodec/amfenc.c
> index 061859f85c..db7ddbb083 100644
> --- a/libavcodec/amfenc.c
> +++ b/libavcodec/amfenc.c
> @@ -713,13 +713,22 @@ int ff_amf_receive_packet(AVCodecContext *avctx,
> AVPacket *avpkt)
>           }
>       }
>
> -
> +    block_and_wait = 0;
>       do {
> -        block_and_wait = 0;
>           // poll data
>           if (!avpkt->data && !avpkt->buf) {
> +            int64_t timeout = block_and_wait ? 100 : 0;
> +            if (timeout != ctx->output_query_timeout) {
> +                av_log(avctx, AV_LOG_INFO, "Set output query timeout to
> %"PRId64"\n", timeout);
> +                AMF_ASSIGN_PROPERTY_INT64(res, ctx->encoder,
> AMF_VIDEO_ENCODER_QUERY_TIMEOUT, timeout);
> +                AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_UNKNOWN,
> "Failed to set output query timeout\n");
> +                ctx->output_query_timeout = timeout;
> +            }
> +
>               res_query = ctx->encoder->pVtbl->QueryOutput(ctx->encoder,
> &data);
>               if (data) {
> +                av_log(avctx, AV_LOG_INFO, "QueryOutput returned with
> data\n");
> +
>                   // copy data to packet
>                   AMFBuffer *buffer;
>                   AMFGuid guid = IID_AMFBuffer();
> @@ -740,7 +749,13 @@ int ff_amf_receive_packet(AVCodecContext *avctx,
> AVPacket *avpkt)
>                   data->pVtbl->Release(data);
>
>                   AMF_RETURN_IF_FALSE(ctx, ret >= 0, ret,
> "amf_copy_buffer() failed with error %d\n", ret);
> +            } else {
> +                av_log(avctx, AV_LOG_INFO, "QueryOutput returned with
> nothing (%d)\n", res_query);
> +                // For testing, shouldn't hit this unless machine is
> otherwise very loaded.
> +                av_assert0(!block_and_wait);
>               }
> +
> +            block_and_wait = 0;
>           }
>           res_resubmit = AMF_OK;
>           if (ctx->delayed_surface != NULL) { // try to resubmit frame
> @@ -769,8 +784,9 @@ int ff_amf_receive_packet(AVCodecContext *avctx,
> AVPacket *avpkt)
>
>           if (query_output_data_flag == 0) {
>               if (res_resubmit == AMF_INPUT_FULL || ctx->delayed_drain ||
> (ctx->eof && res_query != AMF_EOF) || (ctx->hwsurfaces_in_queue >=
> ctx->hwsurfaces_in_queue_max)) {
> +                av_log(avctx, AV_LOG_INFO, "Need to wait for output\n");
>                   block_and_wait = 1;
> -                av_usleep(1000);
> +                //av_usleep(1000);
>               }
>           }
>       } while (block_and_wait);
> diff --git a/libavcodec/amfenc.h b/libavcodec/amfenc.h
> index 2dbd378ef8..64c77115b6 100644
> --- a/libavcodec/amfenc.h
> +++ b/libavcodec/amfenc.h
> @@ -72,6 +72,7 @@ typedef struct AmfContext {
>       int                 delayed_drain;
>       AMFSurface         *delayed_surface;
>       AVFrame            *delayed_frame;
> +    int64_t             output_query_timeout;
>
>       // shift dts back by max_b_frames in timing
>       AVFifo             *timestamp_list;
> --
> 2.39.2
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>

Dynamic switching between non-blocking & blocking approaches isn’t
supported in AMF at this time.

We might request to implement this feature for AMF team, but it might took
some time to implement this.

I would suggest using av_usleep(500) until this feature is implemented.

> What cases are you experimenting with?

This issue is very easy to reproduce when:

1) low resolution transcoding

2) hardware accelerated decoding

The command line sample:  ffmpeg -hwaccel d3d11va -hwaccel_output_format
d3d11 -i input_480x360_h264.mp4 -c:v hevc_amf  output_480x360_hevc.mp4
Mark Thompson Oct. 22, 2023, 2:30 p.m. UTC | #2
On 19/10/2023 17:13, Evgeny Pavlov wrote:
> On Wed, Oct 18, 2023 at 10:36 PM Mark Thompson <sw@jkqxz.net> wrote:
> 
>> ---
>> On 17/10/2023 18:11, Evgeny Pavlov wrote:
>>> The reason for using av_usleep() here is that AMF API doesn’t provide an
>>> API for explicit wait. There are two modes to get output from encoder:
>>>
>>> 1. Polling with some sleep to avoid CPU thrashing – currently used in
>> FFmpeg
>>>
>>> 2. Set timeout parameter on AMF encoder and QueryOutput call will block
>>> till output is available or the timeout happens.
>>>
>>> #2 is the preferable way but it is designed more to be used with a
>> separate
>>> polling thread. With a single-thread approach in FFmpeg, the use of
>> timeout
>>> can block input submission making things slower.  This is even more
>>> pronounced when B-frames are enabled and several inputs are needed to
>> produce
>>> the first output.
>>
>> This approach seems like it should work here?  Run non-blocking until the
>> queue is full, then switch to blocking when you need to wait for some
>> output.
>>
>> I tried the patch enclosing (H.264 only, different proprties needed for
>> other codecs), but it doesn't seem to work - the test assert always hits
>> immediately and timing shows that QueryOutput didn't block even though the
>> timeout should be set?  I'm probably doing something incorrect, maybe you
>> would know how to fix it.
>>
>>> The condition of this sleep is in special events (primarily when amf
>> input
>>> queue is full), not the core loop part. During the experiments the cpu
>>> increasing is about 2-4% or so, not a burst.
>>
>> What cases are you experimenting with?
>>
>> The most problematic case I can think of is multiple encodes running
>> simultaneously sharing the same instance so that each one has to wait for
>> others to complete and therefore all queues fill up.
>>
>> The busy wait will end up being the only place where it can block (since
>> everything else runs asynchronously), so you will peg one CPU at close to
>> 100% per encode running.
>>
>> Thanks,
>>
>> - Mark
>>
>>    libavcodec/amfenc.c | 22 +++++++++++++++++++---
>>    libavcodec/amfenc.h |  1 +
>>    2 files changed, 20 insertions(+), 3 deletions(-)
>>
>> ...
> 
> Dynamic switching between non-blocking & blocking approaches isn’t
> supported in AMF at this time.
> 
> We might request to implement this feature for AMF team, but it might took
> some time to implement this.

That is unfortunate, but it sounds like something like this is required.

> I would suggest using av_usleep(500) until this feature is implemented.
> 
>> What cases are you experimenting with?
> 
> This issue is very easy to reproduce when:
> 
> 1) low resolution transcoding
> 
> 2) hardware accelerated decoding
> 
> The command line sample:  ffmpeg -hwaccel d3d11va -hwaccel_output_format
> d3d11 -i input_480x360_h264.mp4 -c:v hevc_amf  output_480x360_hevc.mp4

To clarify, I meant: what cases are you experimenting with to verify that this doesn't cause problems elsewhere?

I agree (and can reproduce) that the specific case with one low-resolution stream slightly improves throughput at the cost of increased CPU use.

 >> The most problematic case I can think of is multiple encodes running
 >> simultaneously sharing the same instance so that each one has to wait for
 >> others to complete and therefore all queues fill up.
 >>
 >> The busy wait will end up being the only place where it can block (since
 >> everything else runs asynchronously), so you will peg one CPU at close to
 >> 100% per encode running.

I tried this case with two 4K streams and indeed it is a huge regression.  CPU use goes from 1-2% of one core for both streams to spinning on two cores, around a 100x increase.

Total throughput also decreased by about 10% in my testing, though since I'm running on a low-power device that might be an artefact of the CPU spinning wasting so much power that other clocks are reduced.

(My test was two instances of

$ ./ffmpeg_g.exe -extra_hw_frames 100 -hwaccel d3d11va -hwaccel_output_format d3d11 -i input-4k.mp4 -an -vf loop=loop=20:size=100:start=0 -c:v h264_amf -f null -

running simulataneously, looking at the steady state in the loop after the first hundred frames with the decoder are complete.)

Please consider this patch rejected in its current form.  IMO this is a hole in the AMF API and it needs to be improved to be able to wait for operations to complete rather than polling in the user code.

Thanks,

- Mark
diff mbox series

Patch

diff --git a/libavcodec/amfenc.c b/libavcodec/amfenc.c
index 061859f85c..db7ddbb083 100644
--- a/libavcodec/amfenc.c
+++ b/libavcodec/amfenc.c
@@ -713,13 +713,22 @@  int ff_amf_receive_packet(AVCodecContext *avctx, AVPacket *avpkt)
          }
      }

-
+    block_and_wait = 0;
      do {
-        block_and_wait = 0;
          // poll data
          if (!avpkt->data && !avpkt->buf) {
+            int64_t timeout = block_and_wait ? 100 : 0;
+            if (timeout != ctx->output_query_timeout) {
+                av_log(avctx, AV_LOG_INFO, "Set output query timeout to %"PRId64"\n", timeout);
+                AMF_ASSIGN_PROPERTY_INT64(res, ctx->encoder, AMF_VIDEO_ENCODER_QUERY_TIMEOUT, timeout);
+                AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_UNKNOWN, "Failed to set output query timeout\n");
+                ctx->output_query_timeout = timeout;
+            }
+
              res_query = ctx->encoder->pVtbl->QueryOutput(ctx->encoder, &data);
              if (data) {
+                av_log(avctx, AV_LOG_INFO, "QueryOutput returned with data\n");
+
                  // copy data to packet
                  AMFBuffer *buffer;
                  AMFGuid guid = IID_AMFBuffer();
@@ -740,7 +749,13 @@  int ff_amf_receive_packet(AVCodecContext *avctx, AVPacket *avpkt)
                  data->pVtbl->Release(data);

                  AMF_RETURN_IF_FALSE(ctx, ret >= 0, ret, "amf_copy_buffer() failed with error %d\n", ret);
+            } else {
+                av_log(avctx, AV_LOG_INFO, "QueryOutput returned with nothing (%d)\n", res_query);
+                // For testing, shouldn't hit this unless machine is otherwise very loaded.
+                av_assert0(!block_and_wait);
              }
+
+            block_and_wait = 0;
          }
          res_resubmit = AMF_OK;
          if (ctx->delayed_surface != NULL) { // try to resubmit frame
@@ -769,8 +784,9 @@  int ff_amf_receive_packet(AVCodecContext *avctx, AVPacket *avpkt)

          if (query_output_data_flag == 0) {
              if (res_resubmit == AMF_INPUT_FULL || ctx->delayed_drain || (ctx->eof && res_query != AMF_EOF) || (ctx->hwsurfaces_in_queue >= ctx->hwsurfaces_in_queue_max)) {
+                av_log(avctx, AV_LOG_INFO, "Need to wait for output\n");
                  block_and_wait = 1;
-                av_usleep(1000);
+                //av_usleep(1000);
              }
          }
      } while (block_and_wait);
diff --git a/libavcodec/amfenc.h b/libavcodec/amfenc.h
index 2dbd378ef8..64c77115b6 100644
--- a/libavcodec/amfenc.h
+++ b/libavcodec/amfenc.h
@@ -72,6 +72,7 @@  typedef struct AmfContext {
      int                 delayed_drain;
      AMFSurface         *delayed_surface;
      AVFrame            *delayed_frame;
+    int64_t             output_query_timeout;

      // shift dts back by max_b_frames in timing
      AVFifo             *timestamp_list;