Message ID | 20240819142355.1967-1-Primeadvice@gmail.com |
---|---|
State | New |
Headers | show |
Series | [FFmpeg-devel,v2] avcodec/amfenc: increase precision of Sleep() on Windows | expand |
Context | Check | Description |
---|---|---|
yinshiyou/make_loongarch64 | success | Make finished |
yinshiyou/make_fate_loongarch64 | success | Make fate finished |
andriy/make_x86 | success | Make finished |
andriy/make_fate_x86 | success | Make fate finished |
On 19 Aug 2024, at 16:23, Araz Iusubov wrote: > From: Evgeny Pavlov <lucenticus@gmail.com> > > This commit increase precision of Sleep() function on Windows. > This fix reduces the sleep time on Windows to improve AMF encoding > performance on low resolution input videos. > > Fix for issue #10622 > > We evaluated CreateWaitableTimerExW with > CREATE_WAITABLE_TIMER_HIGH_RESOLUTION flag. In fact, this function has > the same precision level as the Sleep() function. > > Usually changing the time resolution will only affect the current > process and will not impact other processes, thus it will not cause a > global effect on the current system. Here is an info from > documentation on timeBeginPeriod > https://learn.microsoft.com/en-us/windows/win32/api/timeapi/nf-timeapi-timebeginperiod > > "Prior to Windows 10, version 2004, this function affects a global > Windows setting. For all processes Windows uses the lowest value (that > is, highest resolution) requested by any process. Starting with > Windows 10, version 2004, this function no longer affects global timer > resolution. For processes which call this function, Windows uses the > lowest value (that is, highest resolution) requested by any process. > For processes which have not called this function, Windows does not > guarantee a higher resolution than the default system resolution." > > We provide the following measurement to show performance improvements > with this patch. > > 1. Performance tests show that this high precision sleep will improve > performance, especially for low resolution sequences, it can get about > 20% improvement. > > Frames Per Second (FPS) being encoded by the hardware encoder (Navi 31 > RX7900XT ): > > Source Type: H.264 , Output Type: H.264 > (Sorry for bad formatting) > No. | Sequence Resolution | No. of Frames| FPS Before patch | > FPS after patch | Difference | Improvement % > ----|-----------------------|--------------|------------------------|-------------------|---------------|---------- > 1 | 480x360 | 8290 | 2030 | > 2365 | 335 | 16.5% > 2 | 720x576 | 8290 | 1440 | > 1790 | 350 | 24.3% > 3 | 1280x720 | 8290 | 1120 | > 1190 | 70 | 6.3% > 4 | 1920x1080 | 8290 | 692 | > 714 | 22 | 3.2% > 5 | 3840x2160 | 8290 | 200 | > 200 | 0 | 0.0% > The patch itself isn’t really my area of expertise as it’s very Windows specific APIs, but just a nit about the commit message: These tables are unreadable like this, but you can easily trim it down to fit the width limit: | Resolution | # Frames | FPS before | FPS after | Diff | % | |------------|----------|------------|-----------|------|-------| | 480x360 | 8290 | 2030 | 2365 | 335 | 16.5% | > The sample ffmpeg command line: > $ ffmpeg.exe -y -hwaccel d3d11va -hwaccel_output_format d3d11 -i > input.mp4 -c:v h264_amf out.mp4 > where input.mp4 should be changed to corresponding resolution input > H.264 format bitstream. > > 2. The power tests show an increase in power is within limit scope. > > The purpose of the power test is to examine the increase in CPU power > consumption due to the improvement in CPU time resolution after using > this patch. We were testing a product from AMD called Phoenix, which > we refer to as an APU. It combines a general-purpose AMD CPU and a 3D > integrated graphics processing unit (IGPU) on a single die. Only the > APU has a DAP connector to the board's power rails. > > We got the power test data shown below: > > | | 480x360 | 720x576 | 1280x720 | > 1920x1080 | 3840x2160 | average > |------------------------|-----------|------------|----------|-----------|-----------|-------- > |CPU power change | 1.93% | 2.43% | -1.69% | 3.49% > | 2.92% | 1.82% > |APU power total change | 0.86% | 1.34% | -0.62% | 1.54% > | -0.58% | 0.51 > > When using a high precision clock by applying the patch, the average > power consumption for CPU increases 1.82%, and the APU total increases > 0.51%. We can see the power increase in power not very significant. > > Signed-off-by: Evgeny Pavlov <lucenticus@gmail.com> > --- > libavcodec/amfenc.c | 31 +++++++++++++++++++++++++++++++ > libavcodec/amfenc.h | 3 +++ > 2 files changed, 34 insertions(+) > > diff --git a/libavcodec/amfenc.c b/libavcodec/amfenc.c > index 061859f85c..55e24856e8 100644 > --- a/libavcodec/amfenc.c > +++ b/libavcodec/amfenc.c > @@ -42,7 +42,12 @@ > #endif > > #ifdef _WIN32 > +#include <timeapi.h> > #include "compat/w32dlfcn.h" > + > +typedef MMRESULT (*timeapi_fun)(UINT uPeriod); > +#define WINMM_DLL "winmm.dll" > + > #else > #include <dlfcn.h> > #endif > @@ -113,6 +118,9 @@ static int amf_load_library(AVCodecContext *avctx) > AMFInit_Fn init_fun; > AMFQueryVersion_Fn version_fun; > AMF_RESULT res; > +#ifdef _WIN32 > + timeapi_fun time_begin_fun; > +#endif > > ctx->delayed_frame = av_frame_alloc(); > if (!ctx->delayed_frame) { > @@ -145,6 +153,16 @@ static int amf_load_library(AVCodecContext *avctx) > AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_UNKNOWN, "GetTrace() failed with error %d\n", res); > res = ctx->factory->pVtbl->GetDebug(ctx->factory, &ctx->debug); > AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_UNKNOWN, "GetDebug() failed with error %d\n", res); > + > +#ifdef _WIN32 > + // Increase precision of Sleep() function on Windows platform > + ctx->winmm_lib = dlopen(WINMM_DLL, RTLD_NOW | RTLD_LOCAL); > + AMF_RETURN_IF_FALSE(ctx, ctx->winmm_lib != NULL, 0, "DLL %s failed to open\n", WINMM_DLL); > + time_begin_fun = (timeapi_fun)dlsym(ctx->winmm_lib, "timeBeginPeriod"); > + AMF_RETURN_IF_FALSE(ctx, time_begin_fun != NULL, 0, "DLL %s failed to find function %s\n", WINMM_DLL, "timeBeginPeriod"); > + time_begin_fun(1); > +#endif //_WIN32 > + > return 0; > } > > @@ -375,6 +393,9 @@ static int amf_init_encoder(AVCodecContext *avctx) > int av_cold ff_amf_encode_close(AVCodecContext *avctx) > { > AmfContext *ctx = avctx->priv_data; > +#ifdef _WIN32 > + timeapi_fun time_end_fun; > +#endif //_WIN32 > > if (ctx->delayed_surface) { > ctx->delayed_surface->pVtbl->Release(ctx->delayed_surface); > @@ -410,6 +431,16 @@ int av_cold ff_amf_encode_close(AVCodecContext *avctx) > av_frame_free(&ctx->delayed_frame); > av_fifo_freep2(&ctx->timestamp_list); > > +#ifdef _WIN32 > + if (ctx->winmm_lib) { > + time_end_fun = (timeapi_fun)dlsym(ctx->winmm_lib, "timeEndPeriod"); > + AMF_RETURN_IF_FALSE(ctx, time_end_fun != NULL, 0, "DLL %s failed to find function %s\n", WINMM_DLL, "timeEndPeriod"); > + time_end_fun(1); > + dlclose(ctx->winmm_lib); > + ctx->winmm_lib = NULL; > + } > +#endif //_WIN32 > + > return 0; > } > > diff --git a/libavcodec/amfenc.h b/libavcodec/amfenc.h > index 2dbd378ef8..35bcf1dfe3 100644 > --- a/libavcodec/amfenc.h > +++ b/libavcodec/amfenc.h > @@ -50,6 +50,9 @@ typedef struct AmfContext { > AVClass *avclass; > // access to AMF runtime > amf_handle library; ///< handle to DLL library > +#ifdef _WIN32 > + amf_handle winmm_lib; ///< handle to winmm DLL library > +#endif //_WIN32 > AMFFactory *factory; ///< pointer to AMF factory > AMFDebug *debug; ///< pointer to AMF debug interface > AMFTrace *trace; ///< pointer to AMF trace interface > -- > 2.45.2.windows.1 > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
On 19.08.2024 16:23, Araz Iusubov wrote: > From: Evgeny Pavlov <lucenticus@gmail.com> > > This commit increase precision of Sleep() function on Windows. > This fix reduces the sleep time on Windows to improve AMF encoding > performance on low resolution input videos. > > Fix for issue #10622 > > We evaluated CreateWaitableTimerExW with > CREATE_WAITABLE_TIMER_HIGH_RESOLUTION flag. In fact, this function has > the same precision level as the Sleep() function. > > Usually changing the time resolution will only affect the current > process and will not impact other processes, thus it will not cause a > global effect on the current system. Here is an info from > documentation on timeBeginPeriod > https://learn.microsoft.com/en-us/windows/win32/api/timeapi/nf-timeapi-timebeginperiod > > "Prior to Windows 10, version 2004, this function affects a global > Windows setting. For all processes Windows uses the lowest value (that > is, highest resolution) requested by any process. Starting with > Windows 10, version 2004, this function no longer affects global timer > resolution. For processes which call this function, Windows uses the > lowest value (that is, highest resolution) requested by any process. > For processes which have not called this function, Windows does not > guarantee a higher resolution than the default system resolution." Even modifying the current process is not acceptable, since lavc is frequently embedded in other applications, which might not expect this/be aversely impacted by it. Why does the AMF wrapper need to sleep at all?
On Mon, Aug 19, 2024 at 9:31 AM Araz Iusubov <primeadvice@gmail.com> wrote: > > From: Evgeny Pavlov <lucenticus@gmail.com> > > This commit increase precision of Sleep() function on Windows. > This fix reduces the sleep time on Windows to improve AMF encoding > performance on low resolution input videos. > > Fix for issue #10622 > > We evaluated CreateWaitableTimerExW with > CREATE_WAITABLE_TIMER_HIGH_RESOLUTION flag. In fact, this function has > the same precision level as the Sleep() function. > The proper way to get a wait that completes under the tick period on Windows is to use a waitable object, such as an event. Polling for AMF output like we do now wastes power and increases latency. I see that AMF has a QUERY_TIMEOUT option available to determine how long QueryOutput() blocks waiting for output to be available. Does this option do a something sane internally like waiting on an event or is just doing the same polling that we already do? > Usually changing the time resolution will only affect the current > process and will not impact other processes, thus it will not cause a > global effect on the current system. Here is an info from > documentation on timeBeginPeriod > https://learn.microsoft.com/en-us/windows/win32/api/timeapi/nf-timeapi-timebeginperiod > > "Prior to Windows 10, version 2004, this function affects a global > Windows setting. For all processes Windows uses the lowest value (that > is, highest resolution) requested by any process. Starting with > Windows 10, version 2004, this function no longer affects global timer > resolution. For processes which call this function, Windows uses the > lowest value (that is, highest resolution) requested by any process. > For processes which have not called this function, Windows does not > guarantee a higher resolution than the default system resolution." > > We provide the following measurement to show performance improvements > with this patch. > > 1. Performance tests show that this high precision sleep will improve > performance, especially for low resolution sequences, it can get about > 20% improvement. > > Frames Per Second (FPS) being encoded by the hardware encoder (Navi 31 > RX7900XT ): > > Source Type: H.264 , Output Type: H.264 > (Sorry for bad formatting) > No. | Sequence Resolution | No. of Frames| FPS Before patch | > FPS after patch | Difference | Improvement % > ----|-----------------------|--------------|------------------------|-------------------|---------------|---------- > 1 | 480x360 | 8290 | 2030 | > 2365 | 335 | 16.5% > 2 | 720x576 | 8290 | 1440 | > 1790 | 350 | 24.3% > 3 | 1280x720 | 8290 | 1120 | > 1190 | 70 | 6.3% > 4 | 1920x1080 | 8290 | 692 | > 714 | 22 | 3.2% > 5 | 3840x2160 | 8290 | 200 | > 200 | 0 | 0.0% > > The sample ffmpeg command line: > $ ffmpeg.exe -y -hwaccel d3d11va -hwaccel_output_format d3d11 -i > input.mp4 -c:v h264_amf out.mp4 > where input.mp4 should be changed to corresponding resolution input > H.264 format bitstream. > > 2. The power tests show an increase in power is within limit scope. > > The purpose of the power test is to examine the increase in CPU power > consumption due to the improvement in CPU time resolution after using > this patch. We were testing a product from AMD called Phoenix, which > we refer to as an APU. It combines a general-purpose AMD CPU and a 3D > integrated graphics processing unit (IGPU) on a single die. Only the > APU has a DAP connector to the board's power rails. > > We got the power test data shown below: > > | | 480x360 | 720x576 | 1280x720 | > 1920x1080 | 3840x2160 | average > |------------------------|-----------|------------|----------|-----------|-----------|-------- > |CPU power change | 1.93% | 2.43% | -1.69% | 3.49% > | 2.92% | 1.82% > |APU power total change | 0.86% | 1.34% | -0.62% | 1.54% > | -0.58% | 0.51 > > When using a high precision clock by applying the patch, the average > power consumption for CPU increases 1.82%, and the APU total increases > 0.51%. We can see the power increase in power not very significant. > > Signed-off-by: Evgeny Pavlov <lucenticus@gmail.com> > --- > libavcodec/amfenc.c | 31 +++++++++++++++++++++++++++++++ > libavcodec/amfenc.h | 3 +++ > 2 files changed, 34 insertions(+) > > diff --git a/libavcodec/amfenc.c b/libavcodec/amfenc.c > index 061859f85c..55e24856e8 100644 > --- a/libavcodec/amfenc.c > +++ b/libavcodec/amfenc.c > @@ -42,7 +42,12 @@ > #endif > > #ifdef _WIN32 > +#include <timeapi.h> > #include "compat/w32dlfcn.h" > + > +typedef MMRESULT (*timeapi_fun)(UINT uPeriod); > +#define WINMM_DLL "winmm.dll" > + > #else > #include <dlfcn.h> > #endif > @@ -113,6 +118,9 @@ static int amf_load_library(AVCodecContext *avctx) > AMFInit_Fn init_fun; > AMFQueryVersion_Fn version_fun; > AMF_RESULT res; > +#ifdef _WIN32 > + timeapi_fun time_begin_fun; > +#endif > > ctx->delayed_frame = av_frame_alloc(); > if (!ctx->delayed_frame) { > @@ -145,6 +153,16 @@ static int amf_load_library(AVCodecContext *avctx) > AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_UNKNOWN, "GetTrace() failed with error %d\n", res); > res = ctx->factory->pVtbl->GetDebug(ctx->factory, &ctx->debug); > AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_UNKNOWN, "GetDebug() failed with error %d\n", res); > + > +#ifdef _WIN32 > + // Increase precision of Sleep() function on Windows platform > + ctx->winmm_lib = dlopen(WINMM_DLL, RTLD_NOW | RTLD_LOCAL); > + AMF_RETURN_IF_FALSE(ctx, ctx->winmm_lib != NULL, 0, "DLL %s failed to open\n", WINMM_DLL); > + time_begin_fun = (timeapi_fun)dlsym(ctx->winmm_lib, "timeBeginPeriod"); > + AMF_RETURN_IF_FALSE(ctx, time_begin_fun != NULL, 0, "DLL %s failed to find function %s\n", WINMM_DLL, "timeBeginPeriod"); > + time_begin_fun(1); > +#endif //_WIN32 > + > return 0; > } > > @@ -375,6 +393,9 @@ static int amf_init_encoder(AVCodecContext *avctx) > int av_cold ff_amf_encode_close(AVCodecContext *avctx) > { > AmfContext *ctx = avctx->priv_data; > +#ifdef _WIN32 > + timeapi_fun time_end_fun; > +#endif //_WIN32 > > if (ctx->delayed_surface) { > ctx->delayed_surface->pVtbl->Release(ctx->delayed_surface); > @@ -410,6 +431,16 @@ int av_cold ff_amf_encode_close(AVCodecContext *avctx) > av_frame_free(&ctx->delayed_frame); > av_fifo_freep2(&ctx->timestamp_list); > > +#ifdef _WIN32 > + if (ctx->winmm_lib) { > + time_end_fun = (timeapi_fun)dlsym(ctx->winmm_lib, "timeEndPeriod"); > + AMF_RETURN_IF_FALSE(ctx, time_end_fun != NULL, 0, "DLL %s failed to find function %s\n", WINMM_DLL, "timeEndPeriod"); > + time_end_fun(1); > + dlclose(ctx->winmm_lib); > + ctx->winmm_lib = NULL; > + } > +#endif //_WIN32 > + > return 0; > } > > diff --git a/libavcodec/amfenc.h b/libavcodec/amfenc.h > index 2dbd378ef8..35bcf1dfe3 100644 > --- a/libavcodec/amfenc.h > +++ b/libavcodec/amfenc.h > @@ -50,6 +50,9 @@ typedef struct AmfContext { > AVClass *avclass; > // access to AMF runtime > amf_handle library; ///< handle to DLL library > +#ifdef _WIN32 > + amf_handle winmm_lib; ///< handle to winmm DLL library > +#endif //_WIN32 > AMFFactory *factory; ///< pointer to AMF factory > AMFDebug *debug; ///< pointer to AMF debug interface > AMFTrace *trace; ///< pointer to AMF trace interface > -- > 2.45.2.windows.1 > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
On Mon, Aug 19, 2024 at 7:31 PM Timo Rothenpieler <timo@rothenpieler.org> wrote: > > On 19.08.2024 16:23, Araz Iusubov wrote: > > From: Evgeny Pavlov <lucenticus@gmail.com> > > > > This commit increase precision of Sleep() function on Windows. > > This fix reduces the sleep time on Windows to improve AMF encoding > > performance on low resolution input videos. > > > > Fix for issue #10622 > > > > We evaluated CreateWaitableTimerExW with > > CREATE_WAITABLE_TIMER_HIGH_RESOLUTION flag. In fact, this function has > > the same precision level as the Sleep() function. > > > > Usually changing the time resolution will only affect the current > > process and will not impact other processes, thus it will not cause a > > global effect on the current system. Here is an info from > > documentation on timeBeginPeriod > > https://learn.microsoft.com/en-us/windows/win32/api/timeapi/nf-timeapi-timebeginperiod > > > > "Prior to Windows 10, version 2004, this function affects a global > > Windows setting. For all processes Windows uses the lowest value (that > > is, highest resolution) requested by any process. Starting with > > Windows 10, version 2004, this function no longer affects global timer > > resolution. For processes which call this function, Windows uses the > > lowest value (that is, highest resolution) requested by any process. > > For processes which have not called this function, Windows does not > > guarantee a higher resolution than the default system resolution." > > Even modifying the current process is not acceptable, since lavc is > frequently embedded in other applications, which might not expect > this/be aversely impacted by it. > > Why does the AMF wrapper need to sleep at all? Agreed, we can't have things like this in FFmpeg. This might be acceptable in a binary blob but not in an open source project. Kieran
On Wed, Aug 21, 2024 at 12:02 PM Kieran Kunhya via ffmpeg-devel <ffmpeg-devel@ffmpeg.org> wrote: > > On Mon, Aug 19, 2024 at 7:31 PM Timo Rothenpieler <timo@rothenpieler.org> wrote: > > > > On 19.08.2024 16:23, Araz Iusubov wrote: > > > From: Evgeny Pavlov <lucenticus@gmail.com> > > > > > > This commit increase precision of Sleep() function on Windows. > > > This fix reduces the sleep time on Windows to improve AMF encoding > > > performance on low resolution input videos. > > > > > > Fix for issue #10622 > > > > > > We evaluated CreateWaitableTimerExW with > > > CREATE_WAITABLE_TIMER_HIGH_RESOLUTION flag. In fact, this function has > > > the same precision level as the Sleep() function. > > > > > > Usually changing the time resolution will only affect the current > > > process and will not impact other processes, thus it will not cause a > > > global effect on the current system. Here is an info from > > > documentation on timeBeginPeriod > > > https://learn.microsoft.com/en-us/windows/win32/api/timeapi/nf-timeapi-timebeginperiod > > > > > > "Prior to Windows 10, version 2004, this function affects a global > > > Windows setting. For all processes Windows uses the lowest value (that > > > is, highest resolution) requested by any process. Starting with > > > Windows 10, version 2004, this function no longer affects global timer > > > resolution. For processes which call this function, Windows uses the > > > lowest value (that is, highest resolution) requested by any process. > > > For processes which have not called this function, Windows does not > > > guarantee a higher resolution than the default system resolution." > > > > Even modifying the current process is not acceptable, since lavc is > > frequently embedded in other applications, which might not expect > > this/be aversely impacted by it. > > > > Why does the AMF wrapper need to sleep at all? > > Agreed, we can't have things like this in FFmpeg. This might be > acceptable in a binary blob but not in an open source project. I took a look at this using AMD's AMF EncoderLatency sample and found that setting the QUERY_TIMEOUT option to 50 ms (as is default for the new AMF HQ and HQLL usage values) results in latency that is better than the current AMF code in FFmpeg *and* this patch without having to touch the process's timer precision. Here are the results from QUERY_TIMEOUT=0, amf_sleep(1), 1ms timer period: Encoder: AMFVideoEncoderVCE_AVC Total : Frames = 500 Duration = 1157.16ms FPS = 432.09frames Latency: First,Min,Max = 7.12ms, 1.53ms, 3.73ms Latency: Average = 1.99ms and the results from QUERY_TIMEOUT=50, default timer period: Encoder: AMFVideoEncoderVCE_AVC Total : Frames = 500 Duration = 933.03ms FPS = 535.89frames Latency: First,Min,Max = 5.80ms, 1.49ms, 2.50ms Latency: Average = 1.58ms This seems to clearly demonstrate that QUERY_TIMEOUT is a better approach than adjusting timer resolution. It avoids process-wide effects *and* it even performs better on top of that. Cameron
> > > I took a look at this using AMD's AMF EncoderLatency sample and found > that > > setting the QUERY_TIMEOUT option to 50 ms (as is default for the new AMF > > HQ and HQLL usage values) results in latency that is better than the > > current AMF code in FFmpeg *and* this patch without having to touch > > the process's timer precision. > > > > Here are the results from QUERY_TIMEOUT=0, amf_sleep(1), 1ms timer > period: > > Encoder: AMFVideoEncoderVCE_AVC > > Total : Frames = 500 Duration = 1157.16ms FPS = 432.09frames > > Latency: First,Min,Max = 7.12ms, 1.53ms, 3.73ms > > Latency: Average = 1.99ms > > > > and the results from QUERY_TIMEOUT=50, default timer period: > > Encoder: AMFVideoEncoderVCE_AVC > > Total : Frames = 500 Duration = 933.03ms FPS = 535.89frames > > Latency: First,Min,Max = 5.80ms, 1.49ms, 2.50ms > > Latency: Average = 1.58ms > > > > This seems to clearly demonstrate that QUERY_TIMEOUT is a better approach > > than adjusting timer resolution. It avoids process-wide effects *and* > > it even performs better on top of that. > > > > > > Cameron > Thanks everyone and Cameron, TIMEOUT might be a possible solution. Need some time for evaluating the performance impact of using it. The problem is that FFmpeg calls SubmitInput and QueryOutput from the same thread and if B-frames and/or look-ahead enabled, initially not enough input submitted and QueryOutput will wait till timeout value.
On Mon, Sep 9, 2024 at 11:05 AM Araz <primeadvice@gmail.com> wrote: >> >> > I took a look at this using AMD's AMF EncoderLatency sample and found that >> > setting the QUERY_TIMEOUT option to 50 ms (as is default for the new AMF >> > HQ and HQLL usage values) results in latency that is better than the >> > current AMF code in FFmpeg *and* this patch without having to touch >> > the process's timer precision. >> > >> > Here are the results from QUERY_TIMEOUT=0, amf_sleep(1), 1ms timer period: >> > Encoder: AMFVideoEncoderVCE_AVC >> > Total : Frames = 500 Duration = 1157.16ms FPS = 432.09frames >> > Latency: First,Min,Max = 7.12ms, 1.53ms, 3.73ms >> > Latency: Average = 1.99ms >> > >> > and the results from QUERY_TIMEOUT=50, default timer period: >> > Encoder: AMFVideoEncoderVCE_AVC >> > Total : Frames = 500 Duration = 933.03ms FPS = 535.89frames >> > Latency: First,Min,Max = 5.80ms, 1.49ms, 2.50ms >> > Latency: Average = 1.58ms >> > >> > This seems to clearly demonstrate that QUERY_TIMEOUT is a better approach >> > than adjusting timer resolution. It avoids process-wide effects *and* >> > it even performs better on top of that. >> > >> > >> > Cameron > > > Thanks everyone and Cameron, > TIMEOUT might be a possible solution. > Need some time for evaluating the performance impact of using it. > The problem is that FFmpeg calls SubmitInput and QueryOutput > from the same thread and if B-frames and/or look-ahead enabled, initially > not enough input submitted and QueryOutput will wait till timeout value. Ah I see, I guess we can just replace the fixed 1 ms sleep with a fixed 1 ms QUERY_TIMEOUT. We won't get the full power gains of waking exactly once per frame when encoding is complete, but it should solve the issue of having to wait a full 15.6ms to realize output data is available. How about something like this? diff --git a/libavcodec/amfenc.c b/libavcodec/amfenc.c index 41eaef9758..555cdcde85 100644 --- a/libavcodec/amfenc.c +++ b/libavcodec/amfenc.c @@ -861,7 +861,12 @@ int ff_amf_receive_packet(AVCodecContext *avctx, AVPacket *avpkt) if (query_output_data_flag == 0) { if (res_resubmit == AMF_INPUT_FULL || ctx->delayed_drain || (ctx->eof && res_query != AMF_EOF) || (ctx->hwsurfaces_in_queue >= ctx->hwsurfaces_in_queue_max)) { block_and_wait = 1; - av_usleep(1000); + + // Only sleep if the driver doesn't support waiting in QueryOutput() + // or if we already have output data so we will skip calling it. + if (!ctx->query_timeout_supported || avpkt->data || avpkt->buf) { + av_usleep(1000); + } } } } while (block_and_wait); diff --git a/libavcodec/amfenc.h b/libavcodec/amfenc.h index d985d01bb1..a634c133c6 100644 --- a/libavcodec/amfenc.h +++ b/libavcodec/amfenc.h @@ -68,6 +68,7 @@ typedef struct AmfContext { int hwsurfaces_in_queue; int hwsurfaces_in_queue_max; + int query_timeout_supported; // helpers to handle async calls int delayed_drain; diff --git a/libavcodec/amfenc_av1.c b/libavcodec/amfenc_av1.c index 2a7a782063..f3058c5b96 100644 --- a/libavcodec/amfenc_av1.c +++ b/libavcodec/amfenc_av1.c @@ -467,6 +467,11 @@ FF_ENABLE_DEPRECATION_WARNINGS } } + // Wait inside QueryOutput() if supported by the driver + AMF_ASSIGN_PROPERTY_INT64(res, ctx->encoder, AMF_VIDEO_ENCODER_AV1_QUERY_TIMEOUT, 1); + res = ctx->encoder->pVtbl->GetProperty(ctx->encoder, AMF_VIDEO_ENCODER_AV1_QUERY_TIMEOUT, &var); + ctx->query_timeout_supported = res == AMF_OK && var.int64Value; + // init encoder res = ctx->encoder->pVtbl->Init(ctx->encoder, ctx->format, avctx->width, avctx->height); AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_BUG, "encoder->Init() failed with error %d\n", res); diff --git a/libavcodec/amfenc_h264.c b/libavcodec/amfenc_h264.c index 8edd39c633..6d7c5808ee 100644 --- a/libavcodec/amfenc_h264.c +++ b/libavcodec/amfenc_h264.c @@ -482,6 +482,11 @@ FF_ENABLE_DEPRECATION_WARNINGS AMF_ASSIGN_PROPERTY_INT64(res, ctx->encoder, AMF_VIDEO_ENCODER_REF_B_PIC_DELTA_QP, ctx->ref_b_frame_delta_qp); } + // Wait inside QueryOutput() if supported by the driver + AMF_ASSIGN_PROPERTY_INT64(res, ctx->encoder, AMF_VIDEO_ENCODER_QUERY_TIMEOUT, 1); + res = ctx->encoder->pVtbl->GetProperty(ctx->encoder, AMF_VIDEO_ENCODER_QUERY_TIMEOUT, &var); + ctx->query_timeout_supported = res == AMF_OK && var.int64Value; + // Initialize Encoder res = ctx->encoder->pVtbl->Init(ctx->encoder, ctx->format, avctx->width, avctx->height); AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_BUG, "encoder->Init() failed with error %d\n", res); diff --git a/libavcodec/amfenc_hevc.c b/libavcodec/amfenc_hevc.c index 4898824f3a..4457990247 100644 --- a/libavcodec/amfenc_hevc.c +++ b/libavcodec/amfenc_hevc.c @@ -429,6 +429,11 @@ FF_ENABLE_DEPRECATION_WARNINGS } } + // Wait inside QueryOutput() if supported by the driver + AMF_ASSIGN_PROPERTY_INT64(res, ctx->encoder, AMF_VIDEO_ENCODER_HEVC_QUERY_TIMEOUT, 1); + res = ctx->encoder->pVtbl->GetProperty(ctx->encoder, AMF_VIDEO_ENCODER_HEVC_QUERY_TIMEOUT, &var); + ctx->query_timeout_supported = res == AMF_OK && var.int64Value; + // init encoder res = ctx->encoder->pVtbl->Init(ctx->encoder, ctx->format, avctx->width, avctx->height); AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_BUG, "encoder->Init() failed with error %d\n", res);
On Mon, Sep 9, 2024 at 6:48 PM Cameron Gutman <aicommander@gmail.com> wrote: > > On Mon, Sep 9, 2024 at 11:05 AM Araz <primeadvice@gmail.com> wrote: > >> > >> > I took a look at this using AMD's AMF EncoderLatency sample and found that > >> > setting the QUERY_TIMEOUT option to 50 ms (as is default for the new AMF > >> > HQ and HQLL usage values) results in latency that is better than the > >> > current AMF code in FFmpeg *and* this patch without having to touch > >> > the process's timer precision. > >> > > >> > Here are the results from QUERY_TIMEOUT=0, amf_sleep(1), 1ms timer period: > >> > Encoder: AMFVideoEncoderVCE_AVC > >> > Total : Frames = 500 Duration = 1157.16ms FPS = 432.09frames > >> > Latency: First,Min,Max = 7.12ms, 1.53ms, 3.73ms > >> > Latency: Average = 1.99ms > >> > > >> > and the results from QUERY_TIMEOUT=50, default timer period: > >> > Encoder: AMFVideoEncoderVCE_AVC > >> > Total : Frames = 500 Duration = 933.03ms FPS = 535.89frames > >> > Latency: First,Min,Max = 5.80ms, 1.49ms, 2.50ms > >> > Latency: Average = 1.58ms > >> > > >> > This seems to clearly demonstrate that QUERY_TIMEOUT is a better approach > >> > than adjusting timer resolution. It avoids process-wide effects *and* > >> > it even performs better on top of that. > >> > > >> > > >> > Cameron > > > > > > Thanks everyone and Cameron, > > TIMEOUT might be a possible solution. > > Need some time for evaluating the performance impact of using it. > > The problem is that FFmpeg calls SubmitInput and QueryOutput > > from the same thread and if B-frames and/or look-ahead enabled, initially > > not enough input submitted and QueryOutput will wait till timeout value. > > Ah I see, I guess we can just replace the fixed 1 ms sleep with a fixed > 1 ms QUERY_TIMEOUT. We won't get the full power gains of waking exactly > once per frame when encoding is complete, but it should solve the issue > of having to wait a full 15.6ms to realize output data is available. > > How about something like this? > Ugh, email client mangled it. Let's try that again.
diff --git a/libavcodec/amfenc.c b/libavcodec/amfenc.c index 061859f85c..55e24856e8 100644 --- a/libavcodec/amfenc.c +++ b/libavcodec/amfenc.c @@ -42,7 +42,12 @@ #endif #ifdef _WIN32 +#include <timeapi.h> #include "compat/w32dlfcn.h" + +typedef MMRESULT (*timeapi_fun)(UINT uPeriod); +#define WINMM_DLL "winmm.dll" + #else #include <dlfcn.h> #endif @@ -113,6 +118,9 @@ static int amf_load_library(AVCodecContext *avctx) AMFInit_Fn init_fun; AMFQueryVersion_Fn version_fun; AMF_RESULT res; +#ifdef _WIN32 + timeapi_fun time_begin_fun; +#endif ctx->delayed_frame = av_frame_alloc(); if (!ctx->delayed_frame) { @@ -145,6 +153,16 @@ static int amf_load_library(AVCodecContext *avctx) AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_UNKNOWN, "GetTrace() failed with error %d\n", res); res = ctx->factory->pVtbl->GetDebug(ctx->factory, &ctx->debug); AMF_RETURN_IF_FALSE(ctx, res == AMF_OK, AVERROR_UNKNOWN, "GetDebug() failed with error %d\n", res); + +#ifdef _WIN32 + // Increase precision of Sleep() function on Windows platform + ctx->winmm_lib = dlopen(WINMM_DLL, RTLD_NOW | RTLD_LOCAL); + AMF_RETURN_IF_FALSE(ctx, ctx->winmm_lib != NULL, 0, "DLL %s failed to open\n", WINMM_DLL); + time_begin_fun = (timeapi_fun)dlsym(ctx->winmm_lib, "timeBeginPeriod"); + AMF_RETURN_IF_FALSE(ctx, time_begin_fun != NULL, 0, "DLL %s failed to find function %s\n", WINMM_DLL, "timeBeginPeriod"); + time_begin_fun(1); +#endif //_WIN32 + return 0; } @@ -375,6 +393,9 @@ static int amf_init_encoder(AVCodecContext *avctx) int av_cold ff_amf_encode_close(AVCodecContext *avctx) { AmfContext *ctx = avctx->priv_data; +#ifdef _WIN32 + timeapi_fun time_end_fun; +#endif //_WIN32 if (ctx->delayed_surface) { ctx->delayed_surface->pVtbl->Release(ctx->delayed_surface); @@ -410,6 +431,16 @@ int av_cold ff_amf_encode_close(AVCodecContext *avctx) av_frame_free(&ctx->delayed_frame); av_fifo_freep2(&ctx->timestamp_list); +#ifdef _WIN32 + if (ctx->winmm_lib) { + time_end_fun = (timeapi_fun)dlsym(ctx->winmm_lib, "timeEndPeriod"); + AMF_RETURN_IF_FALSE(ctx, time_end_fun != NULL, 0, "DLL %s failed to find function %s\n", WINMM_DLL, "timeEndPeriod"); + time_end_fun(1); + dlclose(ctx->winmm_lib); + ctx->winmm_lib = NULL; + } +#endif //_WIN32 + return 0; } diff --git a/libavcodec/amfenc.h b/libavcodec/amfenc.h index 2dbd378ef8..35bcf1dfe3 100644 --- a/libavcodec/amfenc.h +++ b/libavcodec/amfenc.h @@ -50,6 +50,9 @@ typedef struct AmfContext { AVClass *avclass; // access to AMF runtime amf_handle library; ///< handle to DLL library +#ifdef _WIN32 + amf_handle winmm_lib; ///< handle to winmm DLL library +#endif //_WIN32 AMFFactory *factory; ///< pointer to AMF factory AMFDebug *debug; ///< pointer to AMF debug interface AMFTrace *trace; ///< pointer to AMF trace interface