diff mbox series

[FFmpeg-devel,v2] avcodec/amfenc: increase precision of Sleep() on Windows

Message ID 20240819142355.1967-1-Primeadvice@gmail.com
State New
Headers show
Series [FFmpeg-devel,v2] avcodec/amfenc: increase precision of Sleep() on Windows | expand

Checks

Context Check Description
yinshiyou/make_loongarch64 success Make finished
yinshiyou/make_fate_loongarch64 success Make fate finished
andriy/make_x86 success Make finished
andriy/make_fate_x86 success Make fate finished

Commit Message

Araz Aug. 19, 2024, 2:23 p.m. UTC
From: Evgeny Pavlov <lucenticus@gmail.com>

This commit increase precision of Sleep() function on Windows.
This fix reduces the sleep time on Windows to improve AMF encoding
performance on low resolution input videos.

Fix for issue #10622

We evaluated CreateWaitableTimerExW with
CREATE_WAITABLE_TIMER_HIGH_RESOLUTION flag. In fact, this function has
the same precision level as the Sleep() function.

Usually changing the time resolution will only affect the current
process and will not impact other processes, thus it will not cause a
global effect on the current system. Here is an info from
documentation on timeBeginPeriod
https://learn.microsoft.com/en-us/windows/win32/api/timeapi/nf-timeapi-timebeginperiod

"Prior to Windows 10, version 2004, this function affects a global
Windows setting. For all processes Windows uses the lowest value (that
is, highest resolution) requested by any process. Starting with
Windows 10, version 2004, this function no longer affects global timer
resolution. For processes which call this function, Windows uses the
lowest value (that is, highest resolution) requested by any process.
For processes which have not called this function, Windows does not
guarantee a higher resolution than the default system resolution."

We provide the following measurement to show performance improvements
with this patch.

1. Performance tests show that this high precision sleep will improve
performance, especially for low resolution sequences, it can get about
20% improvement.

Frames Per Second (FPS) being encoded by the hardware encoder (Navi 31
RX7900XT ):

Source Type: H.264 ,  Output Type: H.264
(Sorry for bad formatting)
No. |   Sequence Resolution | No. of Frames|    FPS Before patch    |
FPS after patch   | Difference    | Improvement %
----|-----------------------|--------------|------------------------|-------------------|---------------|----------
1   |   480x360             | 8290         |        2030            |
     2365        | 335           | 16.5%
2   |   720x576             | 8290         |        1440            |
     1790        | 350           | 24.3%
3 |     1280x720            | 8290         |        1120            |
     1190        | 70            | 6.3%
4   |   1920x1080           | 8290         |        692             |
     714         | 22            | 3.2%
5   |   3840x2160           | 8290         |        200             |
     200         | 0             | 0.0%

The sample ffmpeg command line:
$ ffmpeg.exe -y -hwaccel d3d11va -hwaccel_output_format d3d11 -i
input.mp4 -c:v h264_amf out.mp4
where input.mp4 should be changed to corresponding resolution input
H.264 format bitstream.

2. The power tests show an increase in power is within limit scope.

The purpose of the power test is to examine the increase in CPU power
consumption due to the improvement in CPU time resolution after using
this patch. We were testing a product from AMD called Phoenix, which
we refer to as an APU. It combines a general-purpose AMD CPU and a 3D
integrated graphics processing unit (IGPU) on a single die. Only the
APU has a DAP connector to the board's power rails.

We got the power test data shown below:

|                        | 480x360   |  720x576   | 1280x720 |
1920x1080 | 3840x2160 | average
|------------------------|-----------|------------|----------|-----------|-----------|--------
|CPU  power change       |  1.93%    |  2.43%     | -1.69%   | 3.49%
  | 2.92%     | 1.82%
|APU power total change  |  0.86%    |  1.34%     | -0.62%   | 1.54%
  | -0.58%    | 0.51

When using a high precision clock by applying the patch, the average
power consumption for CPU increases 1.82%, and the APU total increases
0.51%. We can see the power increase in power not very significant.

Signed-off-by: Evgeny Pavlov <lucenticus@gmail.com>
---
 libavcodec/amfenc.c | 31 +++++++++++++++++++++++++++++++
 libavcodec/amfenc.h |  3 +++
 2 files changed, 34 insertions(+)

Comments

Marvin Scholz Aug. 19, 2024, 2:43 p.m. UTC | #1
On 19 Aug 2024, at 16:23, Araz Iusubov wrote:

> From: Evgeny Pavlov <lucenticus@gmail.com>
>
> This commit increase precision of Sleep() function on Windows.
> This fix reduces the sleep time on Windows to improve AMF encoding
> performance on low resolution input videos.
>
> Fix for issue #10622
>
> We evaluated CreateWaitableTimerExW with
> CREATE_WAITABLE_TIMER_HIGH_RESOLUTION flag. In fact, this function has
> the same precision level as the Sleep() function.
>
> Usually changing the time resolution will only affect the current
> process and will not impact other processes, thus it will not cause a
> global effect on the current system. Here is an info from
> documentation on timeBeginPeriod
> https://learn.microsoft.com/en-us/windows/win32/api/timeapi/nf-timeapi-timebeginperiod
>
> "Prior to Windows 10, version 2004, this function affects a global
> Windows setting. For all processes Windows uses the lowest value (that
> is, highest resolution) requested by any process. Starting with
> Windows 10, version 2004, this function no longer affects global timer
> resolution. For processes which call this function, Windows uses the
> lowest value (that is, highest resolution) requested by any process.
> For processes which have not called this function, Windows does not
> guarantee a higher resolution than the default system resolution."
>
> We provide the following measurement to show performance improvements
> with this patch.
>
> 1. Performance tests show that this high precision sleep will improve
> performance, especially for low resolution sequences, it can get about
> 20% improvement.
>
> Frames Per Second (FPS) being encoded by the hardware encoder (Navi 31
> RX7900XT ):
>
> Source Type: H.264 ,  Output Type: H.264
> (Sorry for bad formatting)
> No. |   Sequence Resolution | No. of Frames|    FPS Before patch    |
> FPS after patch   | Difference    | Improvement %
> ----|-----------------------|--------------|------------------------|-------------------|---------------|----------
> 1   |   480x360             | 8290         |        2030            |
>      2365        | 335           | 16.5%
> 2   |   720x576             | 8290         |        1440            |
>      1790        | 350           | 24.3%
> 3 |     1280x720            | 8290         |        1120            |
>      1190        | 70            | 6.3%
> 4   |   1920x1080           | 8290         |        692             |
>      714         | 22            | 3.2%
> 5   |   3840x2160           | 8290         |        200             |
>      200         | 0             | 0.0%
>

The patch itself isn’t really my area of expertise as it’s very Windows
specific APIs, but just a nit about the commit message:

These tables are unreadable like this, but you can easily trim it down to
fit the width limit:

| Resolution | # Frames | FPS before | FPS after | Diff |   %   |
|------------|----------|------------|-----------|------|-------|
| 480x360    | 8290     | 2030       | 2365      | 335  | 16.5% |

> The sample ffmpeg command line:
> $ ffmpeg.exe -y -hwaccel d3d11va -hwaccel_output_format d3d11 -i
> input.mp4 -c:v h264_amf out.mp4
> where input.mp4 should be changed to corresponding resolution input
> H.264 format bitstream.
>
> 2. The power tests show an increase in power is within limit scope.
>
> The purpose of the power test is to examine the increase in CPU power
> consumption due to the improvement in CPU time resolution after using
> this patch. We were testing a product from AMD called Phoenix, which
> we refer to as an APU. It combines a general-purpose AMD CPU and a 3D
> integrated graphics processing unit (IGPU) on a single die. Only the
> APU has a DAP connector to the board's power rails.
>
> We got the power test data shown below:
>
> |                        | 480x360   |  720x576   | 1280x720 |
> 1920x1080 | 3840x2160 | average
> |------------------------|-----------|------------|----------|-----------|-----------|--------
> |CPU  power change       |  1.93%    |  2.43%     | -1.69%   | 3.49%
>   | 2.92%     | 1.82%
> |APU power total change  |  0.86%    |  1.34%     | -0.62%   | 1.54%
>   | -0.58%    | 0.51
>
> When using a high precision clock by applying the patch, the average
> power consumption for CPU increases 1.82%, and the APU total increases
> 0.51%. We can see the power increase in power not very significant.
>
> Signed-off-by: Evgeny Pavlov <lucenticus@gmail.com>
> ---
>  libavcodec/amfenc.c | 31 +++++++++++++++++++++++++++++++
>  libavcodec/amfenc.h |  3 +++
>  2 files changed, 34 insertions(+)
>
> diff --git a/libavcodec/amfenc.c b/libavcodec/amfenc.c
> index 061859f85c..55e24856e8 100644
> --- a/libavcodec/amfenc.c
> +++ b/libavcodec/amfenc.c
> @@ -42,7 +42,12 @@
>  #endif
>
>  #ifdef _WIN32
> +#include <timeapi.h>
>  #include "compat/w32dlfcn.h"
> +
> +typedef MMRESULT (*timeapi_fun)(UINT uPeriod);
> +#define WINMM_DLL "winmm.dll"
> +
>  #else
>  #include <dlfcn.h>
>  #endif
> @@ -113,6 +118,9 @@ static int amf_load_library(AVCodecContext *avctx)
>      AMFInit_Fn         init_fun;
>      AMFQueryVersion_Fn version_fun;
>      AMF_RESULT         res;
> +#ifdef _WIN32
> +    timeapi_fun time_begin_fun;
> +#endif
>
>      ctx->delayed_frame = av_frame_alloc();
>      if (!ctx->delayed_frame) {
> @@ -145,6 +153,16 @@ static int amf_load_library(AVCodecContext *avctx)
>      AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_UNKNOWN, "GetTrace() failed with error %d\n", res);
>      res = ctx->factory->pVtbl->GetDebug(ctx->factory, &ctx->debug);
>      AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_UNKNOWN, "GetDebug() failed with error %d\n", res);
> +
> +#ifdef _WIN32
> +    // Increase precision of Sleep() function on Windows platform
> +    ctx->winmm_lib = dlopen(WINMM_DLL, RTLD_NOW | RTLD_LOCAL);
> +    AMF_RETURN_IF_FALSE(ctx, ctx->winmm_lib != NULL, 0, "DLL %s failed to open\n", WINMM_DLL);
> +    time_begin_fun = (timeapi_fun)dlsym(ctx->winmm_lib, "timeBeginPeriod");
> +    AMF_RETURN_IF_FALSE(ctx, time_begin_fun != NULL, 0, "DLL %s failed to find function %s\n", WINMM_DLL, "timeBeginPeriod");
> +    time_begin_fun(1);
> +#endif //_WIN32
> +
>      return 0;
>  }
>
> @@ -375,6 +393,9 @@ static int amf_init_encoder(AVCodecContext *avctx)
>  int av_cold ff_amf_encode_close(AVCodecContext *avctx)
>  {
>      AmfContext *ctx = avctx->priv_data;
> +#ifdef _WIN32
> +    timeapi_fun time_end_fun;
> +#endif //_WIN32
>
>      if (ctx->delayed_surface) {
>          ctx->delayed_surface->pVtbl->Release(ctx->delayed_surface);
> @@ -410,6 +431,16 @@ int av_cold ff_amf_encode_close(AVCodecContext *avctx)
>      av_frame_free(&ctx->delayed_frame);
>      av_fifo_freep2(&ctx->timestamp_list);
>
> +#ifdef _WIN32
> +    if (ctx->winmm_lib) {
> +        time_end_fun = (timeapi_fun)dlsym(ctx->winmm_lib, "timeEndPeriod");
> +        AMF_RETURN_IF_FALSE(ctx, time_end_fun != NULL, 0, "DLL %s failed to find function %s\n", WINMM_DLL, "timeEndPeriod");
> +        time_end_fun(1);
> +        dlclose(ctx->winmm_lib);
> +        ctx->winmm_lib = NULL;
> +    }
> +#endif //_WIN32
> +
>      return 0;
>  }
>
> diff --git a/libavcodec/amfenc.h b/libavcodec/amfenc.h
> index 2dbd378ef8..35bcf1dfe3 100644
> --- a/libavcodec/amfenc.h
> +++ b/libavcodec/amfenc.h
> @@ -50,6 +50,9 @@ typedef struct AmfContext {
>      AVClass            *avclass;
>      // access to AMF runtime
>      amf_handle          library; ///< handle to DLL library
> +#ifdef _WIN32
> +    amf_handle          winmm_lib; ///< handle to winmm DLL library
> +#endif //_WIN32
>      AMFFactory         *factory; ///< pointer to AMF factory
>      AMFDebug           *debug;   ///< pointer to AMF debug interface
>      AMFTrace           *trace;   ///< pointer to AMF trace interface
> -- 
> 2.45.2.windows.1
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
Timo Rothenpieler Aug. 19, 2024, 6:31 p.m. UTC | #2
On 19.08.2024 16:23, Araz Iusubov wrote:
> From: Evgeny Pavlov <lucenticus@gmail.com>
> 
> This commit increase precision of Sleep() function on Windows.
> This fix reduces the sleep time on Windows to improve AMF encoding
> performance on low resolution input videos.
> 
> Fix for issue #10622
> 
> We evaluated CreateWaitableTimerExW with
> CREATE_WAITABLE_TIMER_HIGH_RESOLUTION flag. In fact, this function has
> the same precision level as the Sleep() function.
> 
> Usually changing the time resolution will only affect the current
> process and will not impact other processes, thus it will not cause a
> global effect on the current system. Here is an info from
> documentation on timeBeginPeriod
> https://learn.microsoft.com/en-us/windows/win32/api/timeapi/nf-timeapi-timebeginperiod
> 
> "Prior to Windows 10, version 2004, this function affects a global
> Windows setting. For all processes Windows uses the lowest value (that
> is, highest resolution) requested by any process. Starting with
> Windows 10, version 2004, this function no longer affects global timer
> resolution. For processes which call this function, Windows uses the
> lowest value (that is, highest resolution) requested by any process.
> For processes which have not called this function, Windows does not
> guarantee a higher resolution than the default system resolution."

Even modifying the current process is not acceptable, since lavc is 
frequently embedded in other applications, which might not expect 
this/be aversely impacted by it.

Why does the AMF wrapper need to sleep at all?
Cameron Gutman Aug. 21, 2024, 7:01 a.m. UTC | #3
On Mon, Aug 19, 2024 at 9:31 AM Araz Iusubov <primeadvice@gmail.com> wrote:
>
> From: Evgeny Pavlov <lucenticus@gmail.com>
>
> This commit increase precision of Sleep() function on Windows.
> This fix reduces the sleep time on Windows to improve AMF encoding
> performance on low resolution input videos.
>
> Fix for issue #10622
>
> We evaluated CreateWaitableTimerExW with
> CREATE_WAITABLE_TIMER_HIGH_RESOLUTION flag. In fact, this function has
> the same precision level as the Sleep() function.
>

The proper way to get a wait that completes under the tick period
on Windows is to use a waitable object, such as an event. Polling
for AMF output like we do now wastes power and increases latency.

I see that AMF has a QUERY_TIMEOUT option available to determine
how long QueryOutput() blocks waiting for output to be available.
Does this option do a something sane internally like waiting on
an event or is just doing the same polling that we already do?

> Usually changing the time resolution will only affect the current
> process and will not impact other processes, thus it will not cause a
> global effect on the current system. Here is an info from
> documentation on timeBeginPeriod
> https://learn.microsoft.com/en-us/windows/win32/api/timeapi/nf-timeapi-timebeginperiod
>
> "Prior to Windows 10, version 2004, this function affects a global
> Windows setting. For all processes Windows uses the lowest value (that
> is, highest resolution) requested by any process. Starting with
> Windows 10, version 2004, this function no longer affects global timer
> resolution. For processes which call this function, Windows uses the
> lowest value (that is, highest resolution) requested by any process.
> For processes which have not called this function, Windows does not
> guarantee a higher resolution than the default system resolution."
>
> We provide the following measurement to show performance improvements
> with this patch.
>
> 1. Performance tests show that this high precision sleep will improve
> performance, especially for low resolution sequences, it can get about
> 20% improvement.
>
> Frames Per Second (FPS) being encoded by the hardware encoder (Navi 31
> RX7900XT ):
>
> Source Type: H.264 ,  Output Type: H.264
> (Sorry for bad formatting)
> No. |   Sequence Resolution | No. of Frames|    FPS Before patch    |
> FPS after patch   | Difference    | Improvement %
> ----|-----------------------|--------------|------------------------|-------------------|---------------|----------
> 1   |   480x360             | 8290         |        2030            |
>      2365        | 335           | 16.5%
> 2   |   720x576             | 8290         |        1440            |
>      1790        | 350           | 24.3%
> 3 |     1280x720            | 8290         |        1120            |
>      1190        | 70            | 6.3%
> 4   |   1920x1080           | 8290         |        692             |
>      714         | 22            | 3.2%
> 5   |   3840x2160           | 8290         |        200             |
>      200         | 0             | 0.0%
>
> The sample ffmpeg command line:
> $ ffmpeg.exe -y -hwaccel d3d11va -hwaccel_output_format d3d11 -i
> input.mp4 -c:v h264_amf out.mp4
> where input.mp4 should be changed to corresponding resolution input
> H.264 format bitstream.
>
> 2. The power tests show an increase in power is within limit scope.
>
> The purpose of the power test is to examine the increase in CPU power
> consumption due to the improvement in CPU time resolution after using
> this patch. We were testing a product from AMD called Phoenix, which
> we refer to as an APU. It combines a general-purpose AMD CPU and a 3D
> integrated graphics processing unit (IGPU) on a single die. Only the
> APU has a DAP connector to the board's power rails.
>
> We got the power test data shown below:
>
> |                        | 480x360   |  720x576   | 1280x720 |
> 1920x1080 | 3840x2160 | average
> |------------------------|-----------|------------|----------|-----------|-----------|--------
> |CPU  power change       |  1.93%    |  2.43%     | -1.69%   | 3.49%
>   | 2.92%     | 1.82%
> |APU power total change  |  0.86%    |  1.34%     | -0.62%   | 1.54%
>   | -0.58%    | 0.51
>
> When using a high precision clock by applying the patch, the average
> power consumption for CPU increases 1.82%, and the APU total increases
> 0.51%. We can see the power increase in power not very significant.
>
> Signed-off-by: Evgeny Pavlov <lucenticus@gmail.com>
> ---
>  libavcodec/amfenc.c | 31 +++++++++++++++++++++++++++++++
>  libavcodec/amfenc.h |  3 +++
>  2 files changed, 34 insertions(+)
>
> diff --git a/libavcodec/amfenc.c b/libavcodec/amfenc.c
> index 061859f85c..55e24856e8 100644
> --- a/libavcodec/amfenc.c
> +++ b/libavcodec/amfenc.c
> @@ -42,7 +42,12 @@
>  #endif
>
>  #ifdef _WIN32
> +#include <timeapi.h>
>  #include "compat/w32dlfcn.h"
> +
> +typedef MMRESULT (*timeapi_fun)(UINT uPeriod);
> +#define WINMM_DLL "winmm.dll"
> +
>  #else
>  #include <dlfcn.h>
>  #endif
> @@ -113,6 +118,9 @@ static int amf_load_library(AVCodecContext *avctx)
>      AMFInit_Fn         init_fun;
>      AMFQueryVersion_Fn version_fun;
>      AMF_RESULT         res;
> +#ifdef _WIN32
> +    timeapi_fun time_begin_fun;
> +#endif
>
>      ctx->delayed_frame = av_frame_alloc();
>      if (!ctx->delayed_frame) {
> @@ -145,6 +153,16 @@ static int amf_load_library(AVCodecContext *avctx)
>      AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_UNKNOWN, "GetTrace() failed with error %d\n", res);
>      res = ctx->factory->pVtbl->GetDebug(ctx->factory, &ctx->debug);
>      AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_UNKNOWN, "GetDebug() failed with error %d\n", res);
> +
> +#ifdef _WIN32
> +    // Increase precision of Sleep() function on Windows platform
> +    ctx->winmm_lib = dlopen(WINMM_DLL, RTLD_NOW | RTLD_LOCAL);
> +    AMF_RETURN_IF_FALSE(ctx, ctx->winmm_lib != NULL, 0, "DLL %s failed to open\n", WINMM_DLL);
> +    time_begin_fun = (timeapi_fun)dlsym(ctx->winmm_lib, "timeBeginPeriod");
> +    AMF_RETURN_IF_FALSE(ctx, time_begin_fun != NULL, 0, "DLL %s failed to find function %s\n", WINMM_DLL, "timeBeginPeriod");
> +    time_begin_fun(1);
> +#endif //_WIN32
> +
>      return 0;
>  }
>
> @@ -375,6 +393,9 @@ static int amf_init_encoder(AVCodecContext *avctx)
>  int av_cold ff_amf_encode_close(AVCodecContext *avctx)
>  {
>      AmfContext *ctx = avctx->priv_data;
> +#ifdef _WIN32
> +    timeapi_fun time_end_fun;
> +#endif //_WIN32
>
>      if (ctx->delayed_surface) {
>          ctx->delayed_surface->pVtbl->Release(ctx->delayed_surface);
> @@ -410,6 +431,16 @@ int av_cold ff_amf_encode_close(AVCodecContext *avctx)
>      av_frame_free(&ctx->delayed_frame);
>      av_fifo_freep2(&ctx->timestamp_list);
>
> +#ifdef _WIN32
> +    if (ctx->winmm_lib) {
> +        time_end_fun = (timeapi_fun)dlsym(ctx->winmm_lib, "timeEndPeriod");
> +        AMF_RETURN_IF_FALSE(ctx, time_end_fun != NULL, 0, "DLL %s failed to find function %s\n", WINMM_DLL, "timeEndPeriod");
> +        time_end_fun(1);
> +        dlclose(ctx->winmm_lib);
> +        ctx->winmm_lib = NULL;
> +    }
> +#endif //_WIN32
> +
>      return 0;
>  }
>
> diff --git a/libavcodec/amfenc.h b/libavcodec/amfenc.h
> index 2dbd378ef8..35bcf1dfe3 100644
> --- a/libavcodec/amfenc.h
> +++ b/libavcodec/amfenc.h
> @@ -50,6 +50,9 @@ typedef struct AmfContext {
>      AVClass            *avclass;
>      // access to AMF runtime
>      amf_handle          library; ///< handle to DLL library
> +#ifdef _WIN32
> +    amf_handle          winmm_lib; ///< handle to winmm DLL library
> +#endif //_WIN32
>      AMFFactory         *factory; ///< pointer to AMF factory
>      AMFDebug           *debug;   ///< pointer to AMF debug interface
>      AMFTrace           *trace;   ///< pointer to AMF trace interface
> --
> 2.45.2.windows.1
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
Kieran Kunhya Aug. 21, 2024, 5:02 p.m. UTC | #4
On Mon, Aug 19, 2024 at 7:31 PM Timo Rothenpieler <timo@rothenpieler.org> wrote:
>
> On 19.08.2024 16:23, Araz Iusubov wrote:
> > From: Evgeny Pavlov <lucenticus@gmail.com>
> >
> > This commit increase precision of Sleep() function on Windows.
> > This fix reduces the sleep time on Windows to improve AMF encoding
> > performance on low resolution input videos.
> >
> > Fix for issue #10622
> >
> > We evaluated CreateWaitableTimerExW with
> > CREATE_WAITABLE_TIMER_HIGH_RESOLUTION flag. In fact, this function has
> > the same precision level as the Sleep() function.
> >
> > Usually changing the time resolution will only affect the current
> > process and will not impact other processes, thus it will not cause a
> > global effect on the current system. Here is an info from
> > documentation on timeBeginPeriod
> > https://learn.microsoft.com/en-us/windows/win32/api/timeapi/nf-timeapi-timebeginperiod
> >
> > "Prior to Windows 10, version 2004, this function affects a global
> > Windows setting. For all processes Windows uses the lowest value (that
> > is, highest resolution) requested by any process. Starting with
> > Windows 10, version 2004, this function no longer affects global timer
> > resolution. For processes which call this function, Windows uses the
> > lowest value (that is, highest resolution) requested by any process.
> > For processes which have not called this function, Windows does not
> > guarantee a higher resolution than the default system resolution."
>
> Even modifying the current process is not acceptable, since lavc is
> frequently embedded in other applications, which might not expect
> this/be aversely impacted by it.
>
> Why does the AMF wrapper need to sleep at all?

Agreed, we can't have things like this in FFmpeg. This might be
acceptable in a binary blob but not in an open source project.

Kieran
Cameron Gutman Sept. 2, 2024, 10:33 p.m. UTC | #5
On Wed, Aug 21, 2024 at 12:02 PM Kieran Kunhya via ffmpeg-devel
<ffmpeg-devel@ffmpeg.org> wrote:
>
> On Mon, Aug 19, 2024 at 7:31 PM Timo Rothenpieler <timo@rothenpieler.org> wrote:
> >
> > On 19.08.2024 16:23, Araz Iusubov wrote:
> > > From: Evgeny Pavlov <lucenticus@gmail.com>
> > >
> > > This commit increase precision of Sleep() function on Windows.
> > > This fix reduces the sleep time on Windows to improve AMF encoding
> > > performance on low resolution input videos.
> > >
> > > Fix for issue #10622
> > >
> > > We evaluated CreateWaitableTimerExW with
> > > CREATE_WAITABLE_TIMER_HIGH_RESOLUTION flag. In fact, this function has
> > > the same precision level as the Sleep() function.
> > >
> > > Usually changing the time resolution will only affect the current
> > > process and will not impact other processes, thus it will not cause a
> > > global effect on the current system. Here is an info from
> > > documentation on timeBeginPeriod
> > > https://learn.microsoft.com/en-us/windows/win32/api/timeapi/nf-timeapi-timebeginperiod
> > >
> > > "Prior to Windows 10, version 2004, this function affects a global
> > > Windows setting. For all processes Windows uses the lowest value (that
> > > is, highest resolution) requested by any process. Starting with
> > > Windows 10, version 2004, this function no longer affects global timer
> > > resolution. For processes which call this function, Windows uses the
> > > lowest value (that is, highest resolution) requested by any process.
> > > For processes which have not called this function, Windows does not
> > > guarantee a higher resolution than the default system resolution."
> >
> > Even modifying the current process is not acceptable, since lavc is
> > frequently embedded in other applications, which might not expect
> > this/be aversely impacted by it.
> >
> > Why does the AMF wrapper need to sleep at all?
>
> Agreed, we can't have things like this in FFmpeg. This might be
> acceptable in a binary blob but not in an open source project.

I took a look at this using AMD's AMF EncoderLatency sample and found that
setting the QUERY_TIMEOUT option to 50 ms (as is default for the new AMF
HQ and HQLL usage values) results in latency that is better than the
current AMF code in FFmpeg *and* this patch without having to touch
the process's timer precision.

Here are the results from QUERY_TIMEOUT=0, amf_sleep(1), 1ms timer period:
Encoder: AMFVideoEncoderVCE_AVC
Total  : Frames = 500 Duration = 1157.16ms FPS = 432.09frames
Latency: First,Min,Max = 7.12ms, 1.53ms, 3.73ms
Latency: Average = 1.99ms

and the results from QUERY_TIMEOUT=50, default timer period:
Encoder: AMFVideoEncoderVCE_AVC
Total  : Frames = 500 Duration = 933.03ms FPS = 535.89frames
Latency: First,Min,Max = 5.80ms, 1.49ms, 2.50ms
Latency: Average = 1.58ms

This seems to clearly demonstrate that QUERY_TIMEOUT is a better approach
than adjusting timer resolution. It avoids process-wide effects *and*
it even performs better on top of that.


Cameron
Araz Sept. 9, 2024, 4:05 p.m. UTC | #6
>
> > I took a look at this using AMD's AMF EncoderLatency sample and found
> that
> > setting the QUERY_TIMEOUT option to 50 ms (as is default for the new AMF
> > HQ and HQLL usage values) results in latency that is better than the
> > current AMF code in FFmpeg *and* this patch without having to touch
> > the process's timer precision.
> >
> > Here are the results from QUERY_TIMEOUT=0, amf_sleep(1), 1ms timer
> period:
> > Encoder: AMFVideoEncoderVCE_AVC
> > Total  : Frames = 500 Duration = 1157.16ms FPS = 432.09frames
> > Latency: First,Min,Max = 7.12ms, 1.53ms, 3.73ms
> > Latency: Average = 1.99ms
> >
> > and the results from QUERY_TIMEOUT=50, default timer period:
> > Encoder: AMFVideoEncoderVCE_AVC
> > Total  : Frames = 500 Duration = 933.03ms FPS = 535.89frames
> > Latency: First,Min,Max = 5.80ms, 1.49ms, 2.50ms
> > Latency: Average = 1.58ms
> >
> > This seems to clearly demonstrate that QUERY_TIMEOUT is a better approach
> > than adjusting timer resolution. It avoids process-wide effects *and*
> > it even performs better on top of that.
> >
> >
> > Cameron
>

Thanks everyone and Cameron,
TIMEOUT might be a possible solution.
Need some time for evaluating the performance impact of using it.
The problem is that FFmpeg calls SubmitInput and QueryOutput
from the same thread and if B-frames and/or look-ahead enabled, initially
not enough input submitted and QueryOutput will wait till timeout value.
Cameron Gutman Sept. 9, 2024, 11:48 p.m. UTC | #7
On Mon, Sep 9, 2024 at 11:05 AM Araz <primeadvice@gmail.com> wrote:
>>
>> > I took a look at this using AMD's AMF EncoderLatency sample and found that
>> > setting the QUERY_TIMEOUT option to 50 ms (as is default for the new AMF
>> > HQ and HQLL usage values) results in latency that is better than the
>> > current AMF code in FFmpeg *and* this patch without having to touch
>> > the process's timer precision.
>> >
>> > Here are the results from QUERY_TIMEOUT=0, amf_sleep(1), 1ms timer period:
>> > Encoder: AMFVideoEncoderVCE_AVC
>> > Total  : Frames = 500 Duration = 1157.16ms FPS = 432.09frames
>> > Latency: First,Min,Max = 7.12ms, 1.53ms, 3.73ms
>> > Latency: Average = 1.99ms
>> >
>> > and the results from QUERY_TIMEOUT=50, default timer period:
>> > Encoder: AMFVideoEncoderVCE_AVC
>> > Total  : Frames = 500 Duration = 933.03ms FPS = 535.89frames
>> > Latency: First,Min,Max = 5.80ms, 1.49ms, 2.50ms
>> > Latency: Average = 1.58ms
>> >
>> > This seems to clearly demonstrate that QUERY_TIMEOUT is a better approach
>> > than adjusting timer resolution. It avoids process-wide effects *and*
>> > it even performs better on top of that.
>> >
>> >
>> > Cameron
>
>
> Thanks everyone and Cameron,
> TIMEOUT might be a possible solution.
> Need some time for evaluating the performance impact of using it.
> The problem is that FFmpeg calls SubmitInput and QueryOutput
> from the same thread and if B-frames and/or look-ahead enabled, initially
> not enough input submitted and QueryOutput will wait till timeout value.

Ah I see, I guess we can just replace the fixed 1 ms sleep with a fixed
1 ms QUERY_TIMEOUT. We won't get the full power gains of waking exactly
once per frame when encoding is complete, but it should solve the issue
of having to wait a full 15.6ms to realize output data is available.

How about something like this?

diff --git a/libavcodec/amfenc.c b/libavcodec/amfenc.c
index 41eaef9758..555cdcde85 100644
--- a/libavcodec/amfenc.c
+++ b/libavcodec/amfenc.c
@@ -861,7 +861,12 @@ int ff_amf_receive_packet(AVCodecContext *avctx,
AVPacket *avpkt)
         if (query_output_data_flag == 0) {
             if (res_resubmit == AMF_INPUT_FULL || ctx->delayed_drain
|| (ctx->eof && res_query != AMF_EOF) || (ctx->hwsurfaces_in_queue >=
ctx->hwsurfaces_in_queue_max)) {
                 block_and_wait = 1;
-                av_usleep(1000);
+
+                // Only sleep if the driver doesn't support waiting
in QueryOutput()
+                // or if we already have output data so we will skip
calling it.
+                if (!ctx->query_timeout_supported || avpkt->data ||
avpkt->buf) {
+                    av_usleep(1000);
+                }
             }
         }
     } while (block_and_wait);
diff --git a/libavcodec/amfenc.h b/libavcodec/amfenc.h
index d985d01bb1..a634c133c6 100644
--- a/libavcodec/amfenc.h
+++ b/libavcodec/amfenc.h
@@ -68,6 +68,7 @@ typedef struct AmfContext {

     int                 hwsurfaces_in_queue;
     int                 hwsurfaces_in_queue_max;
+    int                 query_timeout_supported;

     // helpers to handle async calls
     int                 delayed_drain;
diff --git a/libavcodec/amfenc_av1.c b/libavcodec/amfenc_av1.c
index 2a7a782063..f3058c5b96 100644
--- a/libavcodec/amfenc_av1.c
+++ b/libavcodec/amfenc_av1.c
@@ -467,6 +467,11 @@ FF_ENABLE_DEPRECATION_WARNINGS
         }
     }

+    // Wait inside QueryOutput() if supported by the driver
+    AMF_ASSIGN_PROPERTY_INT64(res, ctx->encoder,
AMF_VIDEO_ENCODER_AV1_QUERY_TIMEOUT, 1);
+    res = ctx->encoder->pVtbl->GetProperty(ctx->encoder,
AMF_VIDEO_ENCODER_AV1_QUERY_TIMEOUT, &var);
+    ctx->query_timeout_supported = res == AMF_OK && var.int64Value;
+
     // init encoder
     res = ctx->encoder->pVtbl->Init(ctx->encoder, ctx->format,
avctx->width, avctx->height);
     AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_BUG,
"encoder->Init() failed with error %d\n", res);
diff --git a/libavcodec/amfenc_h264.c b/libavcodec/amfenc_h264.c
index 8edd39c633..6d7c5808ee 100644
--- a/libavcodec/amfenc_h264.c
+++ b/libavcodec/amfenc_h264.c
@@ -482,6 +482,11 @@ FF_ENABLE_DEPRECATION_WARNINGS
         AMF_ASSIGN_PROPERTY_INT64(res, ctx->encoder,
AMF_VIDEO_ENCODER_REF_B_PIC_DELTA_QP, ctx->ref_b_frame_delta_qp);
     }

+    // Wait inside QueryOutput() if supported by the driver
+    AMF_ASSIGN_PROPERTY_INT64(res, ctx->encoder,
AMF_VIDEO_ENCODER_QUERY_TIMEOUT, 1);
+    res = ctx->encoder->pVtbl->GetProperty(ctx->encoder,
AMF_VIDEO_ENCODER_QUERY_TIMEOUT, &var);
+    ctx->query_timeout_supported = res == AMF_OK && var.int64Value;
+
     // Initialize Encoder
     res = ctx->encoder->pVtbl->Init(ctx->encoder, ctx->format,
avctx->width, avctx->height);
     AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_BUG,
"encoder->Init() failed with error %d\n", res);
diff --git a/libavcodec/amfenc_hevc.c b/libavcodec/amfenc_hevc.c
index 4898824f3a..4457990247 100644
--- a/libavcodec/amfenc_hevc.c
+++ b/libavcodec/amfenc_hevc.c
@@ -429,6 +429,11 @@ FF_ENABLE_DEPRECATION_WARNINGS
         }
     }

+    // Wait inside QueryOutput() if supported by the driver
+    AMF_ASSIGN_PROPERTY_INT64(res, ctx->encoder,
AMF_VIDEO_ENCODER_HEVC_QUERY_TIMEOUT, 1);
+    res = ctx->encoder->pVtbl->GetProperty(ctx->encoder,
AMF_VIDEO_ENCODER_HEVC_QUERY_TIMEOUT, &var);
+    ctx->query_timeout_supported = res == AMF_OK && var.int64Value;
+
     // init encoder
     res = ctx->encoder->pVtbl->Init(ctx->encoder, ctx->format,
avctx->width, avctx->height);
     AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_BUG,
"encoder->Init() failed with error %d\n", res);
Cameron Gutman Sept. 9, 2024, 11:51 p.m. UTC | #8
On Mon, Sep 9, 2024 at 6:48 PM Cameron Gutman <aicommander@gmail.com> wrote:
>
> On Mon, Sep 9, 2024 at 11:05 AM Araz <primeadvice@gmail.com> wrote:
> >>
> >> > I took a look at this using AMD's AMF EncoderLatency sample and found that
> >> > setting the QUERY_TIMEOUT option to 50 ms (as is default for the new AMF
> >> > HQ and HQLL usage values) results in latency that is better than the
> >> > current AMF code in FFmpeg *and* this patch without having to touch
> >> > the process's timer precision.
> >> >
> >> > Here are the results from QUERY_TIMEOUT=0, amf_sleep(1), 1ms timer period:
> >> > Encoder: AMFVideoEncoderVCE_AVC
> >> > Total  : Frames = 500 Duration = 1157.16ms FPS = 432.09frames
> >> > Latency: First,Min,Max = 7.12ms, 1.53ms, 3.73ms
> >> > Latency: Average = 1.99ms
> >> >
> >> > and the results from QUERY_TIMEOUT=50, default timer period:
> >> > Encoder: AMFVideoEncoderVCE_AVC
> >> > Total  : Frames = 500 Duration = 933.03ms FPS = 535.89frames
> >> > Latency: First,Min,Max = 5.80ms, 1.49ms, 2.50ms
> >> > Latency: Average = 1.58ms
> >> >
> >> > This seems to clearly demonstrate that QUERY_TIMEOUT is a better approach
> >> > than adjusting timer resolution. It avoids process-wide effects *and*
> >> > it even performs better on top of that.
> >> >
> >> >
> >> > Cameron
> >
> >
> > Thanks everyone and Cameron,
> > TIMEOUT might be a possible solution.
> > Need some time for evaluating the performance impact of using it.
> > The problem is that FFmpeg calls SubmitInput and QueryOutput
> > from the same thread and if B-frames and/or look-ahead enabled, initially
> > not enough input submitted and QueryOutput will wait till timeout value.
>
> Ah I see, I guess we can just replace the fixed 1 ms sleep with a fixed
> 1 ms QUERY_TIMEOUT. We won't get the full power gains of waking exactly
> once per frame when encoding is complete, but it should solve the issue
> of having to wait a full 15.6ms to realize output data is available.
>
> How about something like this?
>

Ugh, email client mangled it. Let's try that again.
diff mbox series

Patch

diff --git a/libavcodec/amfenc.c b/libavcodec/amfenc.c
index 061859f85c..55e24856e8 100644
--- a/libavcodec/amfenc.c
+++ b/libavcodec/amfenc.c
@@ -42,7 +42,12 @@ 
 #endif
 
 #ifdef _WIN32
+#include <timeapi.h>
 #include "compat/w32dlfcn.h"
+
+typedef MMRESULT (*timeapi_fun)(UINT uPeriod);
+#define WINMM_DLL "winmm.dll"
+
 #else
 #include <dlfcn.h>
 #endif
@@ -113,6 +118,9 @@  static int amf_load_library(AVCodecContext *avctx)
     AMFInit_Fn         init_fun;
     AMFQueryVersion_Fn version_fun;
     AMF_RESULT         res;
+#ifdef _WIN32
+    timeapi_fun time_begin_fun;
+#endif
 
     ctx->delayed_frame = av_frame_alloc();
     if (!ctx->delayed_frame) {
@@ -145,6 +153,16 @@  static int amf_load_library(AVCodecContext *avctx)
     AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_UNKNOWN, "GetTrace() failed with error %d\n", res);
     res = ctx->factory->pVtbl->GetDebug(ctx->factory, &ctx->debug);
     AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_UNKNOWN, "GetDebug() failed with error %d\n", res);
+
+#ifdef _WIN32
+    // Increase precision of Sleep() function on Windows platform
+    ctx->winmm_lib = dlopen(WINMM_DLL, RTLD_NOW | RTLD_LOCAL);
+    AMF_RETURN_IF_FALSE(ctx, ctx->winmm_lib != NULL, 0, "DLL %s failed to open\n", WINMM_DLL);
+    time_begin_fun = (timeapi_fun)dlsym(ctx->winmm_lib, "timeBeginPeriod");
+    AMF_RETURN_IF_FALSE(ctx, time_begin_fun != NULL, 0, "DLL %s failed to find function %s\n", WINMM_DLL, "timeBeginPeriod");
+    time_begin_fun(1);
+#endif //_WIN32
+
     return 0;
 }
 
@@ -375,6 +393,9 @@  static int amf_init_encoder(AVCodecContext *avctx)
 int av_cold ff_amf_encode_close(AVCodecContext *avctx)
 {
     AmfContext *ctx = avctx->priv_data;
+#ifdef _WIN32
+    timeapi_fun time_end_fun;
+#endif //_WIN32
 
     if (ctx->delayed_surface) {
         ctx->delayed_surface->pVtbl->Release(ctx->delayed_surface);
@@ -410,6 +431,16 @@  int av_cold ff_amf_encode_close(AVCodecContext *avctx)
     av_frame_free(&ctx->delayed_frame);
     av_fifo_freep2(&ctx->timestamp_list);
 
+#ifdef _WIN32
+    if (ctx->winmm_lib) {
+        time_end_fun = (timeapi_fun)dlsym(ctx->winmm_lib, "timeEndPeriod");
+        AMF_RETURN_IF_FALSE(ctx, time_end_fun != NULL, 0, "DLL %s failed to find function %s\n", WINMM_DLL, "timeEndPeriod");
+        time_end_fun(1);
+        dlclose(ctx->winmm_lib);
+        ctx->winmm_lib = NULL;
+    }
+#endif //_WIN32
+
     return 0;
 }
 
diff --git a/libavcodec/amfenc.h b/libavcodec/amfenc.h
index 2dbd378ef8..35bcf1dfe3 100644
--- a/libavcodec/amfenc.h
+++ b/libavcodec/amfenc.h
@@ -50,6 +50,9 @@  typedef struct AmfContext {
     AVClass            *avclass;
     // access to AMF runtime
     amf_handle          library; ///< handle to DLL library
+#ifdef _WIN32
+    amf_handle          winmm_lib; ///< handle to winmm DLL library
+#endif //_WIN32
     AMFFactory         *factory; ///< pointer to AMF factory
     AMFDebug           *debug;   ///< pointer to AMF debug interface
     AMFTrace           *trace;   ///< pointer to AMF trace interface