diff mbox series

[FFmpeg-devel,v3,1/2] dxva: wait until D3D11 buffer copies are done before submitting them

Message ID 20200805100712.16834-1-robux4@ycbcr.xyz
State New
Headers show
Series [FFmpeg-devel,v3,1/2] dxva: wait until D3D11 buffer copies are done before submitting them
Related show

Checks

Context Check Description
andriy/default pending
andriy/make success Make finished
andriy/make_fate success Make fate finished

Commit Message

Steve Lhomme Aug. 5, 2020, 10:07 a.m. UTC
When used aggressively, calling SubmitDecoderBuffers() just after
ReleaseDecoderBuffer() may have the buffers not used properly and creates
decoding artifacts.
It's likely due to the time to copy the submitted buffer in CPU mapped memory
to GPU memory. SubmitDecoderBuffers() doesn't appear to wait for the state
of the buffer submitted to become "ready".

For now it's not supported in the legacy API using AVD3D11VAContext, we need to
add a ID3D11DeviceContext in there as it cannot be derived from the other
interfaces we provide (ID3D11VideoContext is not a kind of ID3D11DeviceContext).
---
 libavcodec/dxva2.c          | 33 +++++++++++++++++++++++++++++++++
 libavcodec/dxva2_internal.h |  2 ++
 2 files changed, 35 insertions(+)

Comments

Steve Lhomme Aug. 7, 2020, 1:05 p.m. UTC | #1
I experimented a bit more with this. Here are the 3 scenarii in other of 
least frame late:

- GetData waiting for 1/2s and releasing the lock
- No use of GetData (current code)
- GetData waiting for 1/2s and keeping the lock

The last option has horrible perfomance issues and should not be used.

The first option gives about 50% less late frames compared to the 
current code. *But* it requires to unlock the Video Context. There are 2 
problems with this:

- the same ID3D11Asynchronous is used to wait on multiple concurrent 
thread. This can confuse D3D11 which emits a warning in the logs.
- another thread might Get/Release some buffers and submit them before 
this thread is finished processing. That can result in distortions, for 
example if the second thread/frame depends on the first thread/frame 
which is not submitted yet.

The former issue can be solved by using a ID3D11Asynchronous per thread. 
That requires some TLS storage which FFmpeg doesn't seem to support yet. 
With this I get virtually no frame late.

The latter issue only occur if the wait is too long. For example waiting 
by increments of 10ms is too long in my test. Using increments of 1ms or 
2ms works fine in the most stressing sample I have (Sony Camping HDR 
HEVC high bitrate). But this seems hackish. There's still potentially a 
quick frame (alt frame in VPx/AV1 for example) that might get through to 
the decoder too early. (I suppose that's the source of the distortions I 
see)

It's also possible to change the order of the buffer sending, by 
starting with the bigger one (D3D11_VIDEO_DECODER_BUFFER_BITSTREAM). But 
it seems to have little influence, regardless if we wait for buffer 
submission or not.

The results are consistent between integrated GPU and dedicated GPU.

On 2020-08-05 12:07, Steve Lhomme wrote:
> When used aggressively, calling SubmitDecoderBuffers() just after
> ReleaseDecoderBuffer() may have the buffers not used properly and creates
> decoding artifacts.
> It's likely due to the time to copy the submitted buffer in CPU mapped memory
> to GPU memory. SubmitDecoderBuffers() doesn't appear to wait for the state
> of the buffer submitted to become "ready".
> 
> For now it's not supported in the legacy API using AVD3D11VAContext, we need to
> add a ID3D11DeviceContext in there as it cannot be derived from the other
> interfaces we provide (ID3D11VideoContext is not a kind of ID3D11DeviceContext).
> ---
>   libavcodec/dxva2.c          | 33 +++++++++++++++++++++++++++++++++
>   libavcodec/dxva2_internal.h |  2 ++
>   2 files changed, 35 insertions(+)
> 
> diff --git a/libavcodec/dxva2.c b/libavcodec/dxva2.c
> index 32416112bf..1a0e5b69b2 100644
> --- a/libavcodec/dxva2.c
> +++ b/libavcodec/dxva2.c
> @@ -692,6 +692,12 @@ int ff_dxva2_decode_init(AVCodecContext *avctx)
>           d3d11_ctx->surface       = sctx->d3d11_views;
>           d3d11_ctx->workaround    = sctx->workaround;
>           d3d11_ctx->context_mutex = INVALID_HANDLE_VALUE;
> +
> +        D3D11_QUERY_DESC query = { 0 };
> +        query.Query = D3D11_QUERY_EVENT;
> +        if (FAILED(ID3D11Device_CreateQuery(device_hwctx->device, &query,
> +                                            (ID3D11Query**)&sctx->wait_copies)))
> +            sctx->wait_copies = NULL;
>       }
>   #endif
>   
> @@ -729,6 +735,8 @@ int ff_dxva2_decode_uninit(AVCodecContext *avctx)
>       av_buffer_unref(&sctx->decoder_ref);
>   
>   #if CONFIG_D3D11VA
> +    if (sctx->wait_copies)
> +        ID3D11Asynchronous_Release(sctx->wait_copies);
>       for (i = 0; i < sctx->nb_d3d11_views; i++) {
>           if (sctx->d3d11_views[i])
>               ID3D11VideoDecoderOutputView_Release(sctx->d3d11_views[i]);
> @@ -932,6 +940,12 @@ int ff_dxva2_common_end_frame(AVCodecContext *avctx, AVFrame *frame,
>   
>   #if CONFIG_D3D11VA
>       if (ff_dxva2_is_d3d11(avctx)) {
> +        if (sctx->wait_copies) {
> +            AVHWFramesContext *frames_ctx = (AVHWFramesContext*)avctx->hw_frames_ctx->data;
> +            AVD3D11VADeviceContext *device_hwctx = frames_ctx->device_ctx->hwctx;
> +            ID3D11DeviceContext_Begin(device_hwctx->device_context, sctx->wait_copies);
> +        }
> +
>           buffer = &buffer11[buffer_count];
>           type = D3D11_VIDEO_DECODER_BUFFER_PICTURE_PARAMETERS;
>       }
> @@ -1005,9 +1019,28 @@ int ff_dxva2_common_end_frame(AVCodecContext *avctx, AVFrame *frame,
>   
>   #if CONFIG_D3D11VA
>       if (ff_dxva2_is_d3d11(avctx))
> +    {
> +        int maxWait = 10;
> +        /* wait until all the buffer release is done copying data to the GPU
> +         * before doing the submit command */
> +        if (sctx->wait_copies) {
> +            AVHWFramesContext *frames_ctx = (AVHWFramesContext*)avctx->hw_frames_ctx->data;
> +            AVD3D11VADeviceContext *device_hwctx = frames_ctx->device_ctx->hwctx;
> +            ID3D11DeviceContext_End(device_hwctx->device_context, sctx->wait_copies);
> +
> +            while (maxWait-- && S_FALSE ==
> +                   ID3D11DeviceContext_GetData(device_hwctx->device_context,
> +                                               sctx->wait_copies, NULL, 0, 0)) {
> +                ff_dxva2_unlock(avctx);
> +                SleepEx(2, TRUE);
> +                ff_dxva2_lock(avctx);
> +            }
> +        }
> +
>           hr = ID3D11VideoContext_SubmitDecoderBuffers(D3D11VA_CONTEXT(ctx)->video_context,
>                                                        D3D11VA_CONTEXT(ctx)->decoder,
>                                                        buffer_count, buffer11);
> +    }
>   #endif
>   #if CONFIG_DXVA2
>       if (avctx->pix_fmt == AV_PIX_FMT_DXVA2_VLD) {
> diff --git a/libavcodec/dxva2_internal.h b/libavcodec/dxva2_internal.h
> index b822af59cd..c44e8e09b0 100644
> --- a/libavcodec/dxva2_internal.h
> +++ b/libavcodec/dxva2_internal.h
> @@ -81,6 +81,8 @@ typedef struct FFDXVASharedContext {
>       ID3D11VideoDecoderOutputView  **d3d11_views;
>       int                          nb_d3d11_views;
>       ID3D11Texture2D                *d3d11_texture;
> +
> +    ID3D11Asynchronous             *wait_copies;
>   #endif
>   
>   #if CONFIG_DXVA2
> -- 
> 2.26.2
> 
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>
Soft Works Aug. 7, 2020, 9:59 p.m. UTC | #2
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Steve Lhomme
> Sent: Friday, August 7, 2020 3:05 PM
> To: ffmpeg-devel@ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
> copies are done before submitting them
> 
> I experimented a bit more with this. Here are the 3 scenarii in other of least
> frame late:
> 
> - GetData waiting for 1/2s and releasing the lock
> - No use of GetData (current code)
> - GetData waiting for 1/2s and keeping the lock
> 
> The last option has horrible perfomance issues and should not be used.
> 
> The first option gives about 50% less late frames compared to the current
> code. *But* it requires to unlock the Video Context. There are 2 problems
> with this:
> 
> - the same ID3D11Asynchronous is used to wait on multiple concurrent
> thread. This can confuse D3D11 which emits a warning in the logs.
> - another thread might Get/Release some buffers and submit them before
> this thread is finished processing. That can result in distortions, for example if
> the second thread/frame depends on the first thread/frame which is not
> submitted yet.
> 
> The former issue can be solved by using a ID3D11Asynchronous per thread.
> That requires some TLS storage which FFmpeg doesn't seem to support yet.
> With this I get virtually no frame late.
> 
> The latter issue only occur if the wait is too long. For example waiting by
> increments of 10ms is too long in my test. Using increments of 1ms or 2ms
> works fine in the most stressing sample I have (Sony Camping HDR HEVC high
> bitrate). But this seems hackish. There's still potentially a quick frame (alt
> frame in VPx/AV1 for example) that might get through to the decoder too
> early. (I suppose that's the source of the distortions I
> see)
> 
> It's also possible to change the order of the buffer sending, by starting with
> the bigger one (D3D11_VIDEO_DECODER_BUFFER_BITSTREAM). But it seems
> to have little influence, regardless if we wait for buffer submission or not.
> 
> The results are consistent between integrated GPU and dedicated GPU.

Hi Steven,

A while ago I had extended D3D11VA implementation to support single 
(non-array textures) for interoperability with Intel QSV+DX11.

I noticed a few bottlenecks making D3D11VA significantly slower than DXVA2.

The solution was to use ID3D10Multithread_SetMultithreadProtected and
remove all the locks which are currently applied.

Hence, I don't think that your patch is the best possible way .

Regards,
softworkz
Soft Works Aug. 7, 2020, 10:05 p.m. UTC | #3
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Soft Works
> Sent: Friday, August 7, 2020 11:59 PM
> To: FFmpeg development discussions and patches <ffmpeg-
> devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
> copies are done before submitting them
> 
> > -----Original Message-----
> > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> > Steve Lhomme
> > Sent: Friday, August 7, 2020 3:05 PM
> > To: ffmpeg-devel@ffmpeg.org
> > Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11
> > buffer copies are done before submitting them
> >
> > I experimented a bit more with this. Here are the 3 scenarii in other
> > of least frame late:
> >
> > - GetData waiting for 1/2s and releasing the lock
> > - No use of GetData (current code)
> > - GetData waiting for 1/2s and keeping the lock
> >
> > The last option has horrible perfomance issues and should not be used.
> >
> > The first option gives about 50% less late frames compared to the
> > current code. *But* it requires to unlock the Video Context. There are
> > 2 problems with this:
> >
> > - the same ID3D11Asynchronous is used to wait on multiple concurrent
> > thread. This can confuse D3D11 which emits a warning in the logs.
> > - another thread might Get/Release some buffers and submit them before
> > this thread is finished processing. That can result in distortions,
> > for example if the second thread/frame depends on the first
> > thread/frame which is not submitted yet.
> >
> > The former issue can be solved by using a ID3D11Asynchronous per thread.
> > That requires some TLS storage which FFmpeg doesn't seem to support
> yet.
> > With this I get virtually no frame late.
> >
> > The latter issue only occur if the wait is too long. For example
> > waiting by increments of 10ms is too long in my test. Using increments
> > of 1ms or 2ms works fine in the most stressing sample I have (Sony
> > Camping HDR HEVC high bitrate). But this seems hackish. There's still
> > potentially a quick frame (alt frame in VPx/AV1 for example) that
> > might get through to the decoder too early. (I suppose that's the
> > source of the distortions I
> > see)
> >
> > It's also possible to change the order of the buffer sending, by
> > starting with the bigger one
> (D3D11_VIDEO_DECODER_BUFFER_BITSTREAM).
> > But it seems to have little influence, regardless if we wait for buffer
> submission or not.
> >
> > The results are consistent between integrated GPU and dedicated GPU.
> 
> Hi Steven,
> 
> A while ago I had extended D3D11VA implementation to support single (non-
> array textures) for interoperability with Intel QSV+DX11.
> 
> I noticed a few bottlenecks making D3D11VA significantly slower than DXVA2.
> 
> The solution was to use ID3D10Multithread_SetMultithreadProtected and
> remove all the locks which are currently applied.
> 
> Hence, I don't think that your patch is the best possible way .
> 
> Regards,
> softworkz

I almost forgot that I had published that change already: https://github.com/softworkz/ffmpeg_dx11/commit/c09cc37ce7f513717493e060df740aa0e7374257
Steve Lhomme Aug. 8, 2020, 5:09 a.m. UTC | #4
On 2020-08-07 23:59, Soft Works wrote:
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
>> Steve Lhomme
>> Sent: Friday, August 7, 2020 3:05 PM
>> To: ffmpeg-devel@ffmpeg.org
>> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
>> copies are done before submitting them
>>
>> I experimented a bit more with this. Here are the 3 scenarii in other of least
>> frame late:
>>
>> - GetData waiting for 1/2s and releasing the lock
>> - No use of GetData (current code)
>> - GetData waiting for 1/2s and keeping the lock
>>
>> The last option has horrible perfomance issues and should not be used.
>>
>> The first option gives about 50% less late frames compared to the current
>> code. *But* it requires to unlock the Video Context. There are 2 problems
>> with this:
>>
>> - the same ID3D11Asynchronous is used to wait on multiple concurrent
>> thread. This can confuse D3D11 which emits a warning in the logs.
>> - another thread might Get/Release some buffers and submit them before
>> this thread is finished processing. That can result in distortions, for example if
>> the second thread/frame depends on the first thread/frame which is not
>> submitted yet.
>>
>> The former issue can be solved by using a ID3D11Asynchronous per thread.
>> That requires some TLS storage which FFmpeg doesn't seem to support yet.
>> With this I get virtually no frame late.
>>
>> The latter issue only occur if the wait is too long. For example waiting by
>> increments of 10ms is too long in my test. Using increments of 1ms or 2ms
>> works fine in the most stressing sample I have (Sony Camping HDR HEVC high
>> bitrate). But this seems hackish. There's still potentially a quick frame (alt
>> frame in VPx/AV1 for example) that might get through to the decoder too
>> early. (I suppose that's the source of the distortions I
>> see)
>>
>> It's also possible to change the order of the buffer sending, by starting with
>> the bigger one (D3D11_VIDEO_DECODER_BUFFER_BITSTREAM). But it seems
>> to have little influence, regardless if we wait for buffer submission or not.
>>
>> The results are consistent between integrated GPU and dedicated GPU.
> 
> Hi Steven,

Hi,

> A while ago I had extended D3D11VA implementation to support single
> (non-array textures) for interoperability with Intel QSV+DX11.

Looking at your code, it seems you are copying from an array texture to 
a single slice texture to achieve this. With double the amount of RAM. 
It may be a design issue with the new D3D11 API, which forces you to do 
that, but I'm not using that API. I'm using the old API.

In my case I directly render the texture slices coming out of the 
decoder with no copying (and no extra memory allocation). It is 
happening in a different thread than the decoder thread(s).

Also in VLC we also support direct D3D11 to QSV encoding. It does 
require a copy to "shadow" textures to feed QSV. I never managed to make 
it work without a copy.

> I noticed a few bottlenecks making D3D11VA significantly slower than DXVA2.
> 
> The solution was to use ID3D10Multithread_SetMultithreadProtected and
> remove all the locks which are currently applied.

I am also using that.

> Hence, I don't think that your patch is the best possible way .

Removing locks and saying "it works for me" is neither a correct 
solution. At the very least the locks are needed inside libavcodec to 
avoid setting DXVA buffers concurrently from different threads. It will 
most likely result in very bad distortions if not crashes. Maybe you're 
only using 1 decoding thread with DXVA (which a lot of people do) so you 
don't have this issue, but this is not my case.

Also ID3D10Multithread::SetMultithreadProtected means that the resources 
can be accessed from multiple threads. It doesn't mean that calls to 
ID3D11DeviceContext are safe from multithreading. And my experience 
shows that it is not. In fact if you have the Windows SDK installed and 
you have concurrent accesses, you'll get a big warning in your debug 
logs that you are doing something fishy. On WindowsPhone it would even 
crash. This is how I ended up adding the mutex to the old API 
(e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86).

The documentation for ID3D11DeviceContext is very clear about that [1]:
"Because each ID3D11DeviceContext is single threaded, only one thread 
can call a ID3D11DeviceContext at a time. If multiple threads must 
access a single ID3D11DeviceContext, they must use some synchronization 
mechanism, such as critical sections, to synchronize access to that 
ID3D11DeviceContext."

The DXVA documentation is a lot less clearer on the subject. But given 
the ID3D11VideoContext derives from a ID3D11DeviceContext (but is not a 
ID3D11DeviceContext) it's seem correct to assume it has the same 
restrictions.

[1] 
https://docs.microsoft.com/en-us/windows/win32/direct3d11/overviews-direct3d-11-render-multi-thread-intro
Soft Works Aug. 8, 2020, 6:24 a.m. UTC | #5
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Steve Lhomme
> Sent: Saturday, August 8, 2020 7:10 AM
> To: ffmpeg-devel@ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
> copies are done before submitting them

[...]

> >
> > Hi Steven,
> 
> Hi,
> 
> > A while ago I had extended D3D11VA implementation to support single
> > (non-array textures) for interoperability with Intel QSV+DX11.
> 
> Looking at your code, it seems you are copying from an array texture to a
> single slice texture to achieve this. With double the amount of RAM.
> It may be a design issue with the new D3D11 API, which forces you to do

With D3D11, it's mandatory to use a staging texture, which is not only done
in my code but also in the original implementation (hwcontext_d3d11va.c)
https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/hwcontext_d3d11va.c

> that, but I'm not using that API. I'm using the old API.

I'm not sure whether I understand what you mean by this. To my knowledge
there are two DX hw context implementations in ffmpeg:

- DXVA2
- D3D11VA

I'm not aware of a variant like "D3D11 with old API". Could you please elaborate?

> 
> > Hence, I don't think that your patch is the best possible way .
> 
> Removing locks and saying "it works for me" is neither a correct solution. 

How did you come to the conclusion that I might be working like this?

> the very least the locks are needed inside libavcodec to avoid setting DXVA
> buffers concurrently from different threads. It will most likely result in very
> bad distortions if not crashes. Maybe you're only using 1 decoding thread
> with DXVA (which a lot of people do) so you don't have this issue, but this is
> not my case.

I see no point in employing multiple threads for hw accelerated decoding.
To be honest I never looked into or tried whether ffmpeg even supports 
multiple threads with dva2 or d3d11va hw acceleration.

> Also ID3D10Multithread::SetMultithreadProtected means that the resources
> can be accessed from multiple threads. It doesn't mean that calls to
> ID3D11DeviceContext are safe from multithreading. And my experience
> shows that it is not. In fact if you have the Windows SDK installed and you
> have concurrent accesses, you'll get a big warning in your debug logs that you
> are doing something fishy. On WindowsPhone it would even crash. This is
> how I ended up adding the mutex to the old API
> (e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86).
> 
> The documentation for ID3D11DeviceContext is very clear about that [1]:
> "Because each ID3D11DeviceContext is single threaded, only one thread can
> call a ID3D11DeviceContext at a time. If multiple threads must access a single
> ID3D11DeviceContext, they must use some synchronization mechanism,
> such as critical sections, to synchronize access to that ID3D11DeviceContext."

Yes, but this doesn't apply to accessing staging textures IIRC.

In fact, I had researched this in-depth, but I can't tell much more without
looking into it again.

The patch I referenced is working in production on thousands of installations 
and tested with many different hardware and driver versions from Nvidia, 
Intel and AMD.

Kind regards,
softworkz
Steve Lhomme Aug. 10, 2020, 6:18 a.m. UTC | #6
On 2020-08-08 8:24, Soft Works wrote:
> 
> 
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
>> Steve Lhomme
>> Sent: Saturday, August 8, 2020 7:10 AM
>> To: ffmpeg-devel@ffmpeg.org
>> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
>> copies are done before submitting them
> 
> [...]
> 
>>>
>>> Hi Steven,
>>
>> Hi,
>>
>>> A while ago I had extended D3D11VA implementation to support single
>>> (non-array textures) for interoperability with Intel QSV+DX11.
>>
>> Looking at your code, it seems you are copying from an array texture to a
>> single slice texture to achieve this. With double the amount of RAM.
>> It may be a design issue with the new D3D11 API, which forces you to do
> 
> With D3D11, it's mandatory to use a staging texture, which is not only done
> in my code but also in the original implementation (hwcontext_d3d11va.c)
> https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/hwcontext_d3d11va.c
> 
>> that, but I'm not using that API. I'm using the old API.
> 
> I'm not sure whether I understand what you mean by this. To my knowledge
> there are two DX hw context implementations in ffmpeg:
> 
> - DXVA2
> - D3D11VA
> 
> I'm not aware of a variant like "D3D11 with old API". Could you please elaborate?

There is AV_PIX_FMT_D3D11VA_VLD (old) and AV_PIX_FMT_D3D11 (new).

>>
>>> Hence, I don't think that your patch is the best possible way .
>>
>> Removing locks and saying "it works for me" is neither a correct solution.
> 
> How did you come to the conclusion that I might be working like this?

The commented out "hwctx->lock" lines in your code.

>> the very least the locks are needed inside libavcodec to avoid setting DXVA
>> buffers concurrently from different threads. It will most likely result in very
>> bad distortions if not crashes. Maybe you're only using 1 decoding thread
>> with DXVA (which a lot of people do) so you don't have this issue, but this is
>> not my case.
> 
> I see no point in employing multiple threads for hw accelerated decoding.
> To be honest I never looked into or tried whether ffmpeg even supports
> multiple threads with dva2 or d3d11va hw acceleration.

Maybe you're in an ideal situation where all the files you play through 
libavcodec are hardware accelerated (so also with matching hardware). In 
this case you don't need to care about the case where it will fallback 
to software decoding. Using a single thread in that case would have 
terrible performance.

Even then, there's still a chance using multiple threads might improve 
performance. All the code that is run to prepare the buffers that are 
fed into the hardware decoder can be run in parallel for multiple 
frames. If you have an insanely fast hardware decoder that would be the 
bottleneck. In a transcoding scenario that could have an impact.

>> Also ID3D10Multithread::SetMultithreadProtected means that the resources
>> can be accessed from multiple threads. It doesn't mean that calls to
>> ID3D11DeviceContext are safe from multithreading. And my experience
>> shows that it is not. In fact if you have the Windows SDK installed and you
>> have concurrent accesses, you'll get a big warning in your debug logs that you
>> are doing something fishy. On WindowsPhone it would even crash. This is
>> how I ended up adding the mutex to the old API
>> (e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86).
>>
>> The documentation for ID3D11DeviceContext is very clear about that [1]:
>> "Because each ID3D11DeviceContext is single threaded, only one thread can
>> call a ID3D11DeviceContext at a time. If multiple threads must access a single
>> ID3D11DeviceContext, they must use some synchronization mechanism,
>> such as critical sections, to synchronize access to that ID3D11DeviceContext."
> 
> Yes, but this doesn't apply to accessing staging textures IIRC.

It does. To copy to a staging texture you need to use 
ID3D11DeviceContext::CopySubresourceRegion().

You probably don't have any synchronization issues in your pipeline 
because it seems you copy from GPU to CPU. In that case it forces the 
ID3D11DeviceContext::GetData() internally to make sure all the commands 
to produce your source texture on that video context are finished 
processing. You may not see it, but there's a wait happening there. In 
my case there's nothing happening between the decoder and the rendering 
of the texture.

> In fact, I had researched this in-depth, but I can't tell much more without
> looking into it again.
> 
> The patch I referenced is working in production on thousands of installations
> and tested with many different hardware and driver versions from Nvidia,
> Intel and AMD.

And I added the lock before, as the specs say, it's necessary. That 
solved some issues on the hundred of millions of VLC running on Windows 
on all the hardware you can think of.

Decoding 8K 60 fps HEVC was also a good stress test of the code. The 
ID3D11DeviceContext::GetData() in the rendering side ensured that we had 
the frames displayed in the right time and not whenever the pipeline is 
done processing the device context commands.

Now I realize the same thing should be done on the decoder side. With 
the improvements given earlier in this thread.
Soft Works Aug. 10, 2020, 10:04 a.m. UTC | #7
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Steve Lhomme
> Sent: Monday, August 10, 2020 8:19 AM
> To: ffmpeg-devel@ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
> copies are done before submitting them
> 
> On 2020-08-08 8:24, Soft Works wrote:
> >
> >
> >> -----Original Message-----
> >> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> >> Steve Lhomme
> >> Sent: Saturday, August 8, 2020 7:10 AM
> >> To: ffmpeg-devel@ffmpeg.org
> >> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11
> >> buffer copies are done before submitting them
> >
[...]


> >> the very least the locks are needed inside libavcodec to avoid
> >> setting DXVA buffers concurrently from different threads. It will
> >> most likely result in very bad distortions if not crashes. Maybe
> >> you're only using 1 decoding thread with DXVA (which a lot of people
> >> do) so you don't have this issue, but this is not my case.
> >
> > I see no point in employing multiple threads for hw accelerated decoding.
> > To be honest I never looked into or tried whether ffmpeg even supports
> > multiple threads with dva2 or d3d11va hw acceleration.
> 
> Maybe you're in an ideal situation where all the files you play through
> libavcodec are hardware accelerated (so also with matching hardware). In
> this case you don't need to care about the case where it will fallback to
> software decoding. Using a single thread in that case would have terrible
> performance.

I think we need to clarify the use cases we're talking about.

There is no "my case". All I'm talking about is using D3D11VA hardware 
acceleration using the ffmpeg.exe CLI.

You seem to have a rather special case where you are using parts of
the ffmpeg DXVA/D3D11VA code from another application (VLC?)

Did I understand that correctly?

> >> The documentation for ID3D11DeviceContext is very clear about that [1]:
> >> "Because each ID3D11DeviceContext is single threaded, only one thread
> >> can call a ID3D11DeviceContext at a time. If multiple threads must
> >> access a single ID3D11DeviceContext, they must use some
> >> synchronization mechanism, such as critical sections, to synchronize
> access to that ID3D11DeviceContext."
> >
> > Yes, but this doesn't apply to accessing staging textures IIRC.
> 
> It does. To copy to a staging texture you need to use
> ID3D11DeviceContext::CopySubresourceRegion().

Correct. And with DX11 and using SetMultithreadProtected it is legal
to call this from multiple threads without synchronization.

> You probably don't have any synchronization issues in your pipeline because
> it seems you copy from GPU to CPU. In that case it forces the
> ID3D11DeviceContext::GetData() internally to make sure all the commands
> to produce your source texture on that video context are finished
> processing. You may not see it, but there's a wait happening there. 

I've looked back into my work history and gladly most memory 
came back.

Yes, it's correct, there's a "wait happening". From your wording I 
would assume that you've already realized that I was right in stating
that there's no need for an external locking:

- Not for uploading 
- Not for downloading (at least not for the regular ffmpeg use case)

There is still some locking applied: Internally inside the DX11 runtime
(because we are using SetMultithreadProtected). And there's also
the "wait happening".

Let's go through an example: Downloading of a texture

1. Context_CopySubresourceRegion: Copy GPU texture to staging texture

CopySubresourceRegion is asynchronous anyway. It just puts the copy 
request into the DX11 processing queue. Using SetMultithreadProtected
avoids any race conditions, but this call always returns immediately.

2. Context_Map: Make the staging texture accessible for the CPU

When called without MapFlags, this call blocks until the texture is
mapped (and we can be sure that CopySubresourceRegion is executed
by then).

=> This is the 'wait' you've been talking about

3. av_image_copy : Copy the image from the staging texture 

Takes its time for copying obviously.

4. Context_Unmap: Release the texture mapping

Returns immediately

-----------------

We've seen that there is no locking required with regards to DX11,
but there's still one thing left: The staging texture. To resolve this
I'm using multiple staging textures (it's system memory, not GPU
memory).

When we look at the sequence 1 - 2 - 3 - 4, it's obvious that
It can run much faster when just the individual steps are synchronized
(by DX11) as when we would put one big lock around 1234 from
our side.

I've been struggling a long time with this, because DXVA2 was 
often much faster than D3D11VA and this kind of parallelism
was finally the way to get it working equally fast.

-----------------

It is not really obvious from the documentation that it is legal
to use CopySubresourceRegion, Map and UnMap in (pseudo-)
parallel even on multiple indexes of the same ArrayTexture.
IIRC I got one hint at this from an internal (yet public) Nvidia
presentation about DX11 and another one from the source code
of a game engine, but I haven't saved those links.

-----------------

@Steven - I don't know anything about your specific way of
using the ffmpeg code.  Perhaps, the above information is
useful for you in some way, but maybe those locks are unavoidable
in your case.

My only concern is that your changes do not slow down normal
ffmpeg operation - like the locks you had added earlier.
Maybe those could be put into some condition?

Kind regards,
softworkz
Steve Lhomme Aug. 10, 2020, 12:15 p.m. UTC | #8
On 2020-08-10 12:04, Soft Works wrote:
>>>> the very least the locks are needed inside libavcodec to avoid
>>>> setting DXVA buffers concurrently from different threads. It will
>>>> most likely result in very bad distortions if not crashes. Maybe
>>>> you're only using 1 decoding thread with DXVA (which a lot of people
>>>> do) so you don't have this issue, but this is not my case.
>>>
>>> I see no point in employing multiple threads for hw accelerated decoding.
>>> To be honest I never looked into or tried whether ffmpeg even supports
>>> multiple threads with dva2 or d3d11va hw acceleration.
>>
>> Maybe you're in an ideal situation where all the files you play through
>> libavcodec are hardware accelerated (so also with matching hardware). In
>> this case you don't need to care about the case where it will fallback to
>> software decoding. Using a single thread in that case would have terrible
>> performance.
> 
> I think we need to clarify the use cases we're talking about.
> 
> There is no "my case". All I'm talking about is using D3D11VA hardware
> acceleration using the ffmpeg.exe CLI.
> 
> You seem to have a rather special case where you are using parts of
> the ffmpeg DXVA/D3D11VA code from another application (VLC?)

I don't think there is anything special about using libavcodec in 
another application. That's where the code we're discussing is, not the 
ffmpeg CLI or ffplay. The API has to be designed to work in all host 
apps, not just these simpler use cases.

> Did I understand that correctly?
> 
>>>> The documentation for ID3D11DeviceContext is very clear about that [2]:
>>>> "Because each ID3D11DeviceContext is single threaded, only one thread
>>>> can call a ID3D11DeviceContext at a time. If multiple threads must
>>>> access a single ID3D11DeviceContext, they must use some
>>>> synchronization mechanism, such as critical sections, to synchronize
>> access to that ID3D11DeviceContext."
>>>
>>> Yes, but this doesn't apply to accessing staging textures IIRC.
>>
>> It does. To copy to a staging texture you need to use
>> ID3D11DeviceContext::CopySubresourceRegion().
> 
> Correct. And with DX11 and using SetMultithreadProtected it is legal
> to call this from multiple threads without synchronization.

No. I already explained it and pointed to the Microsoft documentation 
[1]. SetMultithreadProtected relates to ID3D11Device. 
ID3D11DeviceContext needs to be managed as non-thread safe resource.
If you want, you can even create one ID3D11DeviceContext per thread [2]. 
I'd be curious to see the effect on multithreaded decoding.

Also it seems SetMultithreadProtected() is not even needed by default. 
It enables the "thread-safe layer" [3]. But in d3d11 that's the default 
behavior. See [4] "Use this flag if your application will only call 
methods of Direct3D 11 interfaces from a single thread. By default, the 
ID3D11Device object is thread-safe."
SetMultithreadProtected() only made sense for D3D10:
"Direct3D 11 has been designed from the ground up to support 
multithreading. Direct3D 10 implements limited support for 
multithreading using the thread-safe layer."

>> You probably don't have any synchronization issues in your pipeline because
>> it seems you copy from GPU to CPU. In that case it forces the
>> ID3D11DeviceContext::GetData() internally to make sure all the commands
>> to produce your source texture on that video context are finished
>> processing. You may not see it, but there's a wait happening there.
> 
> I've looked back into my work history and gladly most memory
> came back.
> 
> Yes, it's correct, there's a "wait happening". From your wording I
> would assume that you've already realized that I was right in stating
> that there's no need for an external locking:
> 
> - Not for uploading
> - Not for downloading (at least not for the regular ffmpeg use case)
> 
> There is still some locking applied: Internally inside the DX11 runtime
> (because we are using SetMultithreadProtected). And there's also
> the "wait happening".

As the doc says, you have to use some synchronization. It may work in 
your case (FFmpeg CLI I suppose). As you mentioned you only use one 
thread. There's less chance that it can fail. But copying memory to/from 
CPU/GPU is probably the slowest part of the whole decoding (hence we 
don't do any in VLC in normal playback). So if you have one decoding 
thread doing that copy and another thread reading on the same 
ID3D11DeviceContext you're likely going to race-condition issues. I 
don't know what FFmpeg CLI does, so I can't tell.

> Let's go through an example: Downloading of a texture
> 
> 1. Context_CopySubresourceRegion: Copy GPU texture to staging texture
> 
> CopySubresourceRegion is asynchronous anyway. It just puts the copy
> request into the DX11 processing queue. Using SetMultithreadProtected
> avoids any race conditions, but this call always returns immediately.
> 
> 2. Context_Map: Make the staging texture accessible for the CPU
> 
> When called without MapFlags, this call blocks until the texture is
> mapped (and we can be sure that CopySubresourceRegion is executed
> by then).
> 
> => This is the 'wait' you've been talking about

Yes.

> 3. av_image_copy : Copy the image from the staging texture
> 
> Takes its time for copying obviously.
> 
> 4. Context_Unmap: Release the texture mapping
> 
> Returns immediately
> 
> -----------------
> 
> We've seen that there is no locking required with regards to DX11,
> but there's still one thing left: The staging texture. To resolve this
> I'm using multiple staging textures (it's system memory, not GPU
> memory).
> 
> When we look at the sequence 1 - 2 - 3 - 4, it's obvious that
> It can run much faster when just the individual steps are synchronized
> (by DX11) as when we would put one big lock around 1234 from
> our side.
> 
> I've been struggling a long time with this, because DXVA2 was
> often much faster than D3D11VA and this kind of parallelism
> was finally the way to get it working equally fast.

That's one very particular case where you do a copy to CPU. There is 
some synchronization happening the memory mapping. But that only covers 
a small part of the possibilities of D3D11VA in libavcodec. And that's 
certainly not what I use.
You can't deduce from that usage that synchronization (access to 
ID3D11DeviceContext) is not needed. In fact the Microsoft documentation 
and my experience show the exact opposite.

>
> -----------------
> 
> It is not really obvious from the documentation that it is legal
> to use CopySubresourceRegion, Map and UnMap in (pseudo-)
> parallel even on multiple indexes of the same ArrayTexture.
> IIRC I got one hint at this from an internal (yet public) Nvidia
> presentation about DX11 and another one from the source code
> of a game engine, but I haven't saved those links.
> 
> -----------------
> 
> @Steven

My name is Steve.

> I don't know anything about your specific way of
> using the ffmpeg code.  Perhaps, the above information is
> useful for you in some way, but maybe those locks are unavoidable
> in your case.
> 
> My only concern is that your changes do not slow down normal
> ffmpeg operation - like the locks you had added earlier.
> Maybe those could be put into some condition?

My change in e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86 is optional. So 
much that it even requires to use the creator helper function to make 
sure the mutex is properly initialized to an empty value and retain 
backward compatibility. If you don't want to use many threads you can 
safely ignore this field.

That being said, that's for the old API. I suppose the one you're 
talking about is the new API for which I have done nothing. If the mutex 
is always set, that's not my fault.

If you want the lock to have no effect, you can set the lock/unlock 
callbacks of AVD3D11VADeviceContext to functions that do nothing. If you 
don't set them, the documentation says:
"If unset on init, the hwcontext implementation will set them to use an 
internal mutex."

It's certainly better than commenting out a whole bunch of code [5].

> Kind regards,
> softworkz

[1] 
https://docs.microsoft.com/en-us/windows/win32/direct3d11/overviews-direct3d-11-render-multi-thread-intro
[2] 
https://docs.microsoft.com/en-us/windows/win32/direct3d11/overviews-direct3d-11-render-multi-thread-render
[3] 
https://docs.microsoft.com/en-us/windows/win32/api/d3d10/nn-d3d10-id3d10multithread
[4] 
https://docs.microsoft.com/en-us/windows/win32/api/d3d11/ne-d3d11-d3d11_create_device_flag
[5] 
https://github.com/softworkz/ffmpeg_dx11/commit/c09cc37ce7f513717493e060df740aa0e7374257
Soft Works Aug. 11, 2020, 1:52 a.m. UTC | #9
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Steve Lhomme
> Sent: Monday, August 10, 2020 2:15 PM
> To: ffmpeg-devel@ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
> copies are done before submitting them
> 

> > @Steven
> 
> My name is Steve.

I apologize for that.

> My change in e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86 is optional. So
> much that it even requires to use the creator helper function to make sure
> the mutex is properly initialized to an empty value and retain backward
> compatibility. If you don't want to use many threads you can safely ignore
> this field.

I thought you had reviewed my code, because you said that it was you 
who had added the locks that I have removed. I just took a look at your commit 
e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86 and I can say that those
are not the locks that I have removed. 
So you are actually the wrong address for the lock removal, you've just 
thrown yourself in the line ;-)

But the DX11 subject still needs some clarification...

> >>>> The documentation for ID3D11DeviceContext is very clear about that
> [2]:
> >>>> "Because each ID3D11DeviceContext is single threaded, only one
> >>>> thread can call a ID3D11DeviceContext at a time. If multiple
> >>>> threads must access a single ID3D11DeviceContext, they must use
> >>>> some synchronization mechanism, such as critical sections, to
> >>>> synchronize
> >> access to that ID3D11DeviceContext."
> >>>
> >>> Yes, but this doesn't apply to accessing staging textures IIRC.
> >>
> >> It does. To copy to a staging texture you need to use
> >> ID3D11DeviceContext::CopySubresourceRegion().
> >
> > Correct. And with DX11 and using SetMultithreadProtected it is legal
> > to call this from multiple threads without synchronization.
> 
> No. I already explained it and pointed to the Microsoft documentation [1].
> SetMultithreadProtected relates to ID3D11Device.

No, that's exactly wrong. You need to re-read. ID3D11Device is thread-safe
by default. You can read that everywhere throughout the documentation.

SetMultithreadProtected initially didn't even have a D3D11 interface.
The documentation also didn't tell anything about it. There was only
the DX10 documentation for this.

But SetMultithreadProtected in fact has the effect of protecting 
the methods from ID3D11DeviceContext (not device).
Like I explained several times already.


Luckily I've just seen that there has been a change meanwhile.
The undocumented behavior has finally been made public by
adding a DX11 version of SetMultithreadProtected
(which does the same as what happens when you use the DX10
version with DX11)

https://docs.microsoft.com/en-us/windows/win32/api/d3d11_4/nn-d3d11_4-id3d11multithread

Now: Read the docs!

PS: I need to say that your way of discrediting my statements
was not very respectful.

Kind regards,
softworkz
Steve Lhomme Aug. 11, 2020, 4:35 a.m. UTC | #10
On 2020-08-11 3:52, Soft Works wrote:
> 
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
>> Steve Lhomme
>> Sent: Monday, August 10, 2020 2:15 PM
>> To: ffmpeg-devel@ffmpeg.org
>> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
>> copies are done before submitting them
>>
> 
>>> @Steven
>>
>> My name is Steve.
> 
> I apologize for that.
> 
>> My change in e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86 is optional. So
>> much that it even requires to use the creator helper function to make sure
>> the mutex is properly initialized to an empty value and retain backward
>> compatibility. If you don't want to use many threads you can safely ignore
>> this field.
> 
> I thought you had reviewed my code, because you said that it was you
> who had added the locks that I have removed. I just took a look at your commit

I said "This is how I ended up adding the mutex to the old API 
(e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86)."
I never claimed I added the locks.

> e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86 and I can say that those
> are not the locks that I have removed.
> So you are actually the wrong address for the lock removal, you've just
> thrown yourself in the line ;-)

No. You misread what I said. I was explaining that my experience on 
multiple devices lead me to add this mutex otherwise it would not work 
properly or even crash.

> But the DX11 subject still needs some clarification...
> 
>>>>>> The documentation for ID3D11DeviceContext is very clear about that
>> [2]:
>>>>>> "Because each ID3D11DeviceContext is single threaded, only one
>>>>>> thread can call a ID3D11DeviceContext at a time. If multiple
>>>>>> threads must access a single ID3D11DeviceContext, they must use
>>>>>> some synchronization mechanism, such as critical sections, to
>>>>>> synchronize
>>>> access to that ID3D11DeviceContext."
>>>>>
>>>>> Yes, but this doesn't apply to accessing staging textures IIRC.
>>>>
>>>> It does. To copy to a staging texture you need to use
>>>> ID3D11DeviceContext::CopySubresourceRegion().
>>>
>>> Correct. And with DX11 and using SetMultithreadProtected it is legal
>>> to call this from multiple threads without synchronization.
>>
>> No. I already explained it and pointed to the Microsoft documentation [1].
>> SetMultithreadProtected relates to ID3D11Device.
> 
> No, that's exactly wrong. You need to re-read. ID3D11Device is thread-safe
> by default. You can read that everywhere throughout the documentation.

I read the documentation. That's not how I understood it. ID3D11Device 
is safe "by default" meaning it can be turned off. I assumed 
SetMultithreadProtected() is the way to do that. It seems 
D3D11_CREATE_DEVICE_SINGLETHREADED is the way to make it not thread safe.

> SetMultithreadProtected initially didn't even have a D3D11 interface.
> The documentation also didn't tell anything about it. There was only
> the DX10 documentation for this.
> 
> But SetMultithreadProtected in fact has the effect of protecting
> the methods from ID3D11DeviceContext (not device).
> Like I explained several times already.
> 
> 
> Luckily I've just seen that there has been a change meanwhile.
> The undocumented behavior has finally been made public by
> adding a DX11 version of SetMultithreadProtected
> (which does the same as what happens when you use the DX10
> version with DX11)
> 
> https://docs.microsoft.com/en-us/windows/win32/api/d3d11_4/nn-d3d11_4-id3d11multithread

OK, this is very interesting. It certainly clarifies that it's 
ID3D11DeviceContext related. The documentation is still sketchy though. 
For example ::Enter() [1] says that it enters a device critical section, 
not a device context critical section. So if you have multiple device 
context everything is blocked. This kinda defeats the use of multiple 
threads.

Also this interface is not available until Windows 10 Creator Update. So 
for example it cannot be used on Windows 7. Maybe the D3D10 version has 
the same effect but there's no guarantee. In fact it wasn't sufficient 
on mobile devices.

I'll try without the mutex (old API) on Windows 7 and see the effect.

Anyway this thread/patch is about waiting for data to be fully copied 
between the CPU and the GPU before submitting the decoding command. Not 
about thread/command synchronization.

> Now: Read the docs!
> 
> PS: I need to say that your way of discrediting my statements
> was not very respectful.

Likewise.

> Kind regards,
> softworkz
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
> 

[1] 
https://docs.microsoft.com/en-us/windows/win32/api/d3d11_4/nf-d3d11_4-id3d11multithread-enter
Soft Works Aug. 11, 2020, 6:58 a.m. UTC | #11
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Steve Lhomme
> Sent: Tuesday, August 11, 2020 6:35 AM
> To: ffmpeg-devel@ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
> copies are done before submitting them

[...]

> > But SetMultithreadProtected in fact has the effect of protecting the
> > methods from ID3D11DeviceContext (not device).
> > Like I explained several times already.
> >
> >
> > Luckily I've just seen that there has been a change meanwhile.
> > The undocumented behavior has finally been made public by adding a
> > DX11 version of SetMultithreadProtected (which does the same as what
> > happens when you use the DX10 version with DX11)
> >
> > https://docs.microsoft.com/en-us/windows/win32/api/d3d11_4/nn-
> d3d11_4-
> > id3d11multithread
> 
> OK, this is very interesting. It certainly clarifies that it's ID3D11DeviceContext
> related. The documentation is still sketchy though.
> For example ::Enter() [1] says that it enters a device critical section, not a
> device context critical section. So if you have multiple device context
> everything is blocked. This kinda defeats the use of multiple threads.
> 
> Also this interface is not available until Windows 10 Creator Update. So for
> example it cannot be used on Windows 7. Maybe the D3D10 version has the
> same effect but there's no guarantee.

Still playing "the doubtful" is not a sincere reaction after having been
proven wrong.

> I'll try without the mutex (old API) on Windows 7 and see the effect.

My patch requires feature level 11_1:

https://github.com/softworkz/ffmpeg_dx11/commit/c09cc37ce7f513717493e060df740aa0e7374257#diff-08dfe4cade908944fd2815bf36309d7bR591-R592

Windows 7 does not have DX11.1
(there's a 11.1 platform update for Win 7 but it doesn't provide full 11.1)
For Win7 and below we're using D3D9 with DXVA2 only.

Kind regards,
softworkz
Steve Lhomme Aug. 11, 2020, 8:05 a.m. UTC | #12
On 2020-08-11 8:58, Soft Works wrote:
> 
> 
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
>> Steve Lhomme
>> Sent: Tuesday, August 11, 2020 6:35 AM
>> To: ffmpeg-devel@ffmpeg.org
>> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
>> copies are done before submitting them
> 
> [...]
> 
>>> But SetMultithreadProtected in fact has the effect of protecting the
>>> methods from ID3D11DeviceContext (not device).
>>> Like I explained several times already.
>>>
>>>
>>> Luckily I've just seen that there has been a change meanwhile.
>>> The undocumented behavior has finally been made public by adding a
>>> DX11 version of SetMultithreadProtected (which does the same as what
>>> happens when you use the DX10 version with DX11)
>>>
>>> https://docs.microsoft.com/en-us/windows/win32/api/d3d11_4/nn-
>> d3d11_4-
>>> id3d11multithread
>>
>> OK, this is very interesting. It certainly clarifies that it's ID3D11DeviceContext
>> related. The documentation is still sketchy though.
>> For example ::Enter() [1] says that it enters a device critical section, not a
>> device context critical section. So if you have multiple device context
>> everything is blocked. This kinda defeats the use of multiple threads.
>>
>> Also this interface is not available until Windows 10 Creator Update. So for
>> example it cannot be used on Windows 7. Maybe the D3D10 version has the
>> same effect but there's no guarantee.
> 
> Still playing "the doubtful" is not a sincere reaction after having been
> proven wrong.

Sorry if you seem to know all the answers already, but I don't and so I 
have to investigate.

>> I'll try without the mutex (old API) on Windows 7 and see the effect.

I did some more test on Win10 before trying Win7. And removing the mutex 
is not an option. There's a lot of bogus frames. It makes sense, as the 
API is not meant to be thread-safe.

The SubmitDecoderBuffers() has no way of knowing exactly which buffers 
you are providing. Each buffer is tied to an index of the slice so it 
could be called without a lock, as long as 2 threads don't try to use 
the same slice at the same time. But when you submit the buffers you 
only give the type. I suppose D3D11 takes the last buffer of each type 
that was released (which is pretty much an unmap, so should have similar 
internal synchronization). But if 2 threads release a buffer of the same 
type and submit it right after, D3D11 has no way to know which to use.

Here's an example:
- thread A releases a D3D11_VIDEO_DECODER_BUFFER_BITSTREAM for slice 0
- thread B releases a D3D11_VIDEO_DECODER_BUFFER_BITSTREAM for slice 1
- thread A submit a buffer of type D3D11_VIDEO_DECODER_BUFFER_BITSTREAM
- thread B submit a buffer of type D3D11_VIDEO_DECODER_BUFFER_BITSTREAM

D3D11 doesn't seem to keep track of which thread submitted what (and in 
single thread you could also release 2 buffer of the same type and 
submit them, I'm not sure what would happen). And so the Thread A submit 
is likely using the buffer from Thread B. I'm not even sure what Thread 
B would be using.

The mutex/lock groups the buffer get/release and submit together, so 
that no other thread can do the same thing at the same time. D3D11 
cannot do it by itself, even with multithread turned on. (D3D12 doesn't 
have this problem as you submit the actual buffer, not just the type [1][2])

None of that happens if you use a single thread for decoding.

> My patch requires feature level 11_1:
 >
> https://github.com/softworkz/ffmpeg_dx11/commit/c09cc37ce7f513717493e060df740aa0e7374257#diff-08dfe4cade908944fd2815bf36309d7bR591-R592
> 
> Windows 7 does not have DX11.1
> (there's a 11.1 platform update for Win 7 but it doesn't provide full 11.1)
> For Win7 and below we're using D3D9 with DXVA2 only.
> 
> Kind regards,
> softworkz
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

[1] 
https://docs.microsoft.com/en-us/windows/win32/api/d3d12video/nf-d3d12video-id3d12videodecodecommandlist-decodeframe
[2] 
https://docs.microsoft.com/en-us/windows/win32/api/d3d12video/ns-d3d12video-d3d12_video_decode_frame_argument
Soft Works Aug. 11, 2020, 10:25 a.m. UTC | #13
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Steve Lhomme
> Sent: Tuesday, August 11, 2020 10:06 AM
> To: ffmpeg-devel@ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
> copies are done before submitting them
> 
> On 2020-08-11 8:58, Soft Works wrote:
> >
> >
> >> -----Original Message-----
> >> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> >> Steve Lhomme
> >> Sent: Tuesday, August 11, 2020 6:35 AM
> >> To: ffmpeg-devel@ffmpeg.org
> >> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11
> >> buffer copies are done before submitting them
> >
> > [...]
> >
> >>> But SetMultithreadProtected in fact has the effect of protecting the
> >>> methods from ID3D11DeviceContext (not device).
> >>> Like I explained several times already.
> >>>
> >>>
> >>> Luckily I've just seen that there has been a change meanwhile.
> >>> The undocumented behavior has finally been made public by adding a
> >>> DX11 version of SetMultithreadProtected (which does the same as what
> >>> happens when you use the DX10 version with DX11)
> >>>
> >>> https://docs.microsoft.com/en-us/windows/win32/api/d3d11_4/nn-
> >> d3d11_4-
> >>> id3d11multithread
> >>
> >> OK, this is very interesting. It certainly clarifies that it's
> >> ID3D11DeviceContext related. The documentation is still sketchy though.
> >> For example ::Enter() [1] says that it enters a device critical
> >> section, not a device context critical section. So if you have
> >> multiple device context everything is blocked. This kinda defeats the use
> of multiple threads.
> >>
> >> Also this interface is not available until Windows 10 Creator Update.
> >> So for example it cannot be used on Windows 7. Maybe the D3D10
> >> version has the same effect but there's no guarantee.
> >
> > Still playing "the doubtful" is not a sincere reaction after having
> > been proven wrong.
> 
> Sorry if you seem to know all the answers already, but I don't and so I have to
> investigate.

Last year, I had literally worked this down to death. I followed every slightest
hint from countless searches, read through hundreds of discussions, driven
because I was unwilling to believe that up-/downloading of video textures with
D3D11 can't be done equally fast as with D3D9. 
(the big picture was the implementation of D3D11 support for QuickSync where
the slowdown played a much bigger role than with D3D11VA decoders only).
Eventually I landed at some internal Nvidia presentation, some talks with MS
guys and some source code discussion deep inside a 3D game engine (not a 
no-name). It really bugs me that I didn't properly note the references, but
from somewhere in between I was able to gather solid evidence about what 
is legal to do and what Is not. Based on that, followed several iterations to
find the optimal way for doing the texture transfer. As I had implemented
D3D11 support for QuickSync, this got pretty complicated because with
a full transcoding pipeline, all parts (decoder, encoder and filters) can (and
usually will) request textures. Only the latest Intel Drivers can work with
array textures everywhere (e.g. VPP), so I also needed to add support for
non-array texture allocation. The patch you've seen is the result of weeks
of intensive work (a small but crucial part of it) - even when it may not 
look like that.


> Sorry if you seem to know all the answers already

Obviously, I don't know all the answers, but all the answers I have given
were correct. And when I didn't have an answer I always respectfully 
said that your situation might be different.
And I didn't reply by implying that you would have done your work
by trial-and-error or most likely invalid assumptions or deductions.


I still don't know how you are actually operating this and thus I also cannot
tell what might or might not work in your case.
All I can tell is that the procedure that I have described (1-2-3-4) can 
work rock-solid for multi-threaded DX11 texture transfer when it's done in 
the same way as I've shown.
And believe it or not - I would still be happy when it would be  
of any use for you...

Kind regards,
softworkz
Steve Lhomme Aug. 11, 2020, 10:43 a.m. UTC | #14
>> Sorry if you seem to know all the answers already, but I don't and so I have to
>> investigate.
> 
> Last year, I had literally worked this down to death. I followed every slightest
> hint from countless searches, read through hundreds of discussions, driven
> because I was unwilling to believe that up-/downloading of video textures with
> D3D11 can't be done equally fast as with D3D9.
> (the big picture was the implementation of D3D11 support for QuickSync where
> the slowdown played a much bigger role than with D3D11VA decoders only).
> Eventually I landed at some internal Nvidia presentation, some talks with MS
> guys and some source code discussion deep inside a 3D game engine (not a
> no-name). It really bugs me that I didn't properly note the references, but
> from somewhere in between I was able to gather solid evidence about what
> is legal to do and what Is not. Based on that, followed several iterations to
> find the optimal way for doing the texture transfer. As I had implemented
> D3D11 support for QuickSync, this got pretty complicated because with
> a full transcoding pipeline, all parts (decoder, encoder and filters) can (and
> usually will) request textures. Only the latest Intel Drivers can work with
> array textures everywhere (e.g. VPP), so I also needed to add support for
> non-array texture allocation. The patch you've seen is the result of weeks
> of intensive work (a small but crucial part of it) - even when it may not
> look like that.
> 
> 
>> Sorry if you seem to know all the answers already
> 
> Obviously, I don't know all the answers, but all the answers I have given
> were correct. And when I didn't have an answer I always respectfully
> said that your situation might be different.
> And I didn't reply by implying that you would have done your work
> by trial-and-error or most likely invalid assumptions or deductions.
> 
> 
> I still don't know how you are actually operating this and thus I also cannot
> tell what might or might not work in your case.
> All I can tell is that the procedure that I have described (1-2-3-4) can
> work rock-solid for multi-threaded DX11 texture transfer when it's done in
> the same way as I've shown.
> And believe it or not - I would still be happy when it would be
> of any use for you...

Even though the discussion is heated (fitting with the weather here) I 
don't mind. I learned some stuff and it pushed me to dig deeper. I can't 
just accept your word for it. I need something solid if I'm going to 
remove a lock that helped me so far.

So I'm currently tooling VLC to be able to bring the decoder to its 
knees and find out what it can and cannot do safely. So far I can still 
see decoding artifacts when I don't a lock, which would mean I still 
need the mutex, for the reasons given in the previous mail.
Soft Works Aug. 11, 2020, 11:08 a.m. UTC | #15
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Steve Lhomme
> Sent: Tuesday, August 11, 2020 12:44 PM
> To: ffmpeg-devel@ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
> copies are done before submitting them


> Even though the discussion is heated (fitting with the weather here) I don't
> mind. I learned some stuff and it pushed me to dig deeper. I can't just accept
> your word for it. I need something solid if I'm going to remove a lock that
> helped me so far.

You're not supposed to accept my word. I wouldn’t do either. You could have
just valued it as a possibility..that's all.

> So I'm currently tooling VLC to be able to bring the decoder to its knees and
> find out what it can and cannot do safely. So far I can still see decoding
> artifacts when I don't a lock, which would mean I still need the mutex, for the
> reasons given in the previous mail.

Are you actually using D3D11 staging textures? I'm still not clear about 
that "old-api" way.

I'm curious: what's your motivation for using D3D11 over D3D9?

I mean, we are always transcoding without rendering a video to any surface.
Our reasons to use D3D11 are these two:

1. Works headless (without connected display)
2. Works without user session (like inside a Windows service)

What's yours?

When implementing D3D11 for QuickSync I managed to get very close to
DXVA2 performance eventually, but I hardly ever have been able to exceed it..

PS: Yes, it's damn hot here as well! :-)

Kind regards,
softworkz
Steve Lhomme Aug. 11, 2020, 11:28 a.m. UTC | #16
On 2020-08-11 13:08, Soft Works wrote:
> 
> 
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
>> Steve Lhomme
>> Sent: Tuesday, August 11, 2020 12:44 PM
>> To: ffmpeg-devel@ffmpeg.org
>> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
>> copies are done before submitting them
> 
> 
>> Even though the discussion is heated (fitting with the weather here) I don't
>> mind. I learned some stuff and it pushed me to dig deeper. I can't just accept
>> your word for it. I need something solid if I'm going to remove a lock that
>> helped me so far.
> 
> You're not supposed to accept my word. I wouldn’t do either. You could have
> just valued it as a possibility..that's all.
> 
>> So I'm currently tooling VLC to be able to bring the decoder to its knees and
>> find out what it can and cannot do safely. So far I can still see decoding
>> artifacts when I don't a lock, which would mean I still need the mutex, for the
>> reasons given in the previous mail.
> 
> Are you actually using D3D11 staging textures? I'm still not clear about
> that "old-api" way.

No. I render the texture slices on the screen straight from the decoder. 
In a separate thread from the libavcodec thread(s).

> I'm curious: what's your motivation for using D3D11 over D3D9?

At first for UWP apps (no other way except for D3D12). But now also 
because it supports HDR.

> I mean, we are always transcoding without rendering a video to any surface.
> Our reasons to use D3D11 are these two:
> 
> 1. Works headless (without connected display)
> 2. Works without user session (like inside a Windows service)
> 
> What's yours?
> 
> When implementing D3D11 for QuickSync I managed to get very close to
> DXVA2 performance eventually, but I hardly ever have been able to exceed it..

Same here, especially because I never managed to make it work without 
copying first to another texture. But I might give it another try in 
light of all this.

> PS: Yes, it's damn hot here as well! :-)
> 
> Kind regards,
> softworkz
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>
Steve Lhomme Aug. 12, 2020, 12:04 p.m. UTC | #17
On 2020-08-11 12:43, Steve Lhomme wrote:
>>> Sorry if you seem to know all the answers already, but I don't and so 
>>> I have to
>>> investigate.
>>
>> Last year, I had literally worked this down to death. I followed every 
>> slightest
>> hint from countless searches, read through hundreds of discussions, 
>> driven
>> because I was unwilling to believe that up-/downloading of video 
>> textures with
>> D3D11 can't be done equally fast as with D3D9.
>> (the big picture was the implementation of D3D11 support for QuickSync 
>> where
>> the slowdown played a much bigger role than with D3D11VA decoders only).
>> Eventually I landed at some internal Nvidia presentation, some talks 
>> with MS
>> guys and some source code discussion deep inside a 3D game engine (not a
>> no-name). It really bugs me that I didn't properly note the 
>> references, but
>> from somewhere in between I was able to gather solid evidence about what
>> is legal to do and what Is not. Based on that, followed several 
>> iterations to
>> find the optimal way for doing the texture transfer. As I had implemented
>> D3D11 support for QuickSync, this got pretty complicated because with
>> a full transcoding pipeline, all parts (decoder, encoder and filters) 
>> can (and
>> usually will) request textures. Only the latest Intel Drivers can work 
>> with
>> array textures everywhere (e.g. VPP), so I also needed to add support for
>> non-array texture allocation. The patch you've seen is the result of 
>> weeks
>> of intensive work (a small but crucial part of it) - even when it may not
>> look like that.
>>
>>
>>> Sorry if you seem to know all the answers already
>>
>> Obviously, I don't know all the answers, but all the answers I have given
>> were correct. And when I didn't have an answer I always respectfully
>> said that your situation might be different.
>> And I didn't reply by implying that you would have done your work
>> by trial-and-error or most likely invalid assumptions or deductions.
>>
>>
>> I still don't know how you are actually operating this and thus I also 
>> cannot
>> tell what might or might not work in your case.
>> All I can tell is that the procedure that I have described (1-2-3-4) can
>> work rock-solid for multi-threaded DX11 texture transfer when it's 
>> done in
>> the same way as I've shown.
>> And believe it or not - I would still be happy when it would be
>> of any use for you...
> 
> Even though the discussion is heated (fitting with the weather here) I 
> don't mind. I learned some stuff and it pushed me to dig deeper. I can't 
> just accept your word for it. I need something solid if I'm going to 
> remove a lock that helped me so far.
> 
> So I'm currently tooling VLC to be able to bring the decoder to its 
> knees and find out what it can and cannot do safely. So far I can still 
> see decoding artifacts when I don't a lock, which would mean I still 
> need the mutex, for the reasons given in the previous mail.

A follow-up on this. Using ID3D10Multithread seems to be enough to have 
mostly thread safe ID3D11Device/ID3D11DeviceContext/etc. Even the 
decoding with its odd API seem to know what to do when submitted 
different buffers.

I did not manage to saturate the GPU but I much bigger decoding 
speed/throughput to validate the errors I got before. Many of them were 
due to VLC dropping data because of odd timing.

Now I still have some threading issues. For example for deinterlacing we 
create a ID3D11VideoProcessor to handle the deinterlacing. And we create 
it after the decoding started (as the deinterlacing can be 
enabled/disabled dynamically). Without the mutex in the decoder it 
crashes on ID3D11VideoDevice::CreateVideoProcessor() and 
ID3D11VideoContext::SubmitDecoderBuffers() as they are being called 
simultaneously. If I add the mutex between the decoder and just this 
filter (not the rendering side) it works fine.

So I guess I'm stuck with the mutex for the time being.

Here is the stack trace on an Intel 630 GPU:

igd11dxva64.dll!00007ffc384a8d24() (Unknown Source:0)
igd11dxva64.dll!00007ffc38452030() (Unknown Source:0)
igd11dxva64.dll!00007ffc3845a081() (Unknown Source:0)
igd11dxva64.dll!00007ffc38465a27() (Unknown Source:0)
igd11dxva64.dll!00007ffc386067d2() (Unknown Source:0)
igd11dxva64.dll!00007ffc3883c9f3() (Unknown Source:0)
igd11dxva64.dll!00007ffc3867145a() (Unknown Source:0)
igd11dxva64.dll!00007ffc3866ea23() (Unknown Source:0)
igd11dxva64.dll!00007ffc3881b4ac() (Unknown Source:0)
igd11dxva64.dll!00007ffc384f7bdc() (Unknown Source:0)
igd11dxva64.dll!00007ffc384fa2a5() (Unknown Source:0)
igd11dxva64.dll!00007ffc3847a334() (Unknown Source:0)
d3d11.dll!00007ffcabc33e8d() (Unknown Source:0)
d3d11.dll!00007ffcabc3389d() (Unknown Source:0)
d3d11_3SDKLayers.dll!00007ffc3184fa6b() (Unknown Source:0)
   calling ID3D11VideoContext::SubmitDecoderBuffers()
libavcodec_plugin.dll!ff_dxva2_common_end_frame(AVCodecContext * avctx, 
AVFrame * frame, const void * pp, unsigned int pp_size, const void * qm, 
unsigned int qm_size, int(*)(AVCodecContext *, void *, void *) 
commit_bs_si) Line 1085 
(c:\Users\robux\Documents\Programs\Videolabs\build\win64\contrib\contrib-win64\ffmpeg\libavcodec\dxva2.c:1085)
libavcodec_plugin.dll!dxva2_h264_end_frame(AVCodecContext * avctx) Line 
507 
(c:\Users\robux\Documents\Programs\Videolabs\build\win64\contrib\contrib-win64\ffmpeg\libavcodec\dxva2_h264.c:507)
libavcodec_plugin.dll!ff_h264_field_end(H264Context * h, 
H264SliceContext * sl, int in_setup) Line 171 
(c:\Users\robux\Documents\Programs\Videolabs\build\win64\contrib\contrib-win64\ffmpeg\libavcodec\h264_picture.c:171)
libavcodec_plugin.dll!h264_decode_frame(AVCodecContext * avctx, void * 
data, int * got_frame, AVPacket * avpkt) Line 1015 
(c:\Users\robux\Documents\Programs\Videolabs\build\win64\contrib\contrib-win64\ffmpeg\libavcodec\h264dec.c:1015)
libavcodec_plugin.dll!decode_simple_internal(AVCodecContext * avctx, 
AVFrame * frame) Line 432 
(c:\Users\robux\Documents\Programs\Videolabs\build\win64\contrib\contrib-win64\ffmpeg\libavcodec\decode.c:432)

win32u.dll!00007ffcb0054784() (Unknown Source:0)
gdi32.dll!00007ffcb1e03860() (Unknown Source:0)
d3d11.dll!00007ffcabc756ee() (Unknown Source:0)
d3d11.dll!00007ffcabc5c811() (Unknown Source:0)
igd11dxva64.dll!00007ffc385c5043() (Unknown Source:0)
igd11dxva64.dll!00007ffc384abaa5() (Unknown Source:0)
igd11dxva64.dll!00007ffc384ab7ab() (Unknown Source:0)
igd11dxva64.dll!00007ffc38453b27() (Unknown Source:0)
igd11dxva64.dll!00007ffc384611e6() (Unknown Source:0)
igd11dxva64.dll!00007ffc385cca30() (Unknown Source:0)
igd11dxva64.dll!00007ffc384bb303() (Unknown Source:0)
igd11dxva64.dll!00007ffc3847ccff() (Unknown Source:0)
d3d11.dll!00007ffcabc3e661() (Unknown Source:0)
d3d11.dll!00007ffcabc3d39f() (Unknown Source:0)
d3d11.dll!00007ffcabc3d0cd() (Unknown Source:0)
d3d11.dll!00007ffcabc68a46() (Unknown Source:0)
d3d11.dll!00007ffcabc5955d() (Unknown Source:0)
d3d11_3SDKLayers.dll!00007ffc318a263c() (Unknown Source:0)
d3d11_3SDKLayers.dll!00007ffc3189479a() (Unknown Source:0)
d3d11_3SDKLayers.dll!00007ffc3184e749() (Unknown Source:0)
d3d11.dll!00007ffcabc59d0c() (Unknown Source:0)
d3d11.dll!00007ffcabc3c606() (Unknown Source:0)
d3d11_3SDKLayers.dll!00007ffc3187dd0e() (Unknown Source:0)
   calling ID3D11VideoDevice::CreateVideoProcessor()
libdirect3d11_filters_plugin.dll!D3D11OpenDeinterlace(vlc_object_t * 
obj) Line 297 
(c:\Users\robux\Documents\Programs\Videolabs\vlc\modules\hw\d3d11\d3d11_deinterlace.c:297)
libvlccore.dll!generic_start(void * func, bool forced, char * ap) Line 
294 
(c:\Users\robux\Documents\Programs\Videolabs\vlc\src\modules\modules.c:294)
libvlccore.dll!module_load(vlc_logger * log, module_t * m, int(*)(void 
*, bool, char *) init, bool forced, char * args) Line 212 
(c:\Users\robux\Documents\Programs\Videolabs\vlc\src\modules\modules.c:212)
libvlccore.dll!vlc_module_load(vlc_logger * log, const char * 
capability, const char * name, bool strict, int(*)(void *, bool, char *) 
probe, ...) Line 265 
(c:\Users\robux\Documents\Programs\Videolabs\vlc\src\modules\modules.c:265)
libvlccore.dll!module_need(vlc_object_t * obj, const char * cap, const 
char * name, bool strict) Line 305 
(c:\Users\robux\Documents\Programs\Videolabs\vlc\src\modules\modules.c:305)
libvlccore.dll!filter_chain_AppendInner(filter_chain_t * chain, const 
char * name, const char * capability, config_chain_t * cfg, const 
es_format_t * fmt_out) Line 254 
(c:\Users\robux\Documents\Programs\Videolabs\vlc\src\misc\filter_chain.c:254)
libvlccore.dll!filter_chain_AppendFilter(filter_chain_t * chain, const 
char * name, config_chain_t * cfg, const es_format_t * fmt_out) Line 299 
(c:\Users\robux\Documents\Programs\Videolabs\vlc\src\misc\filter_chain.c:299)
libvlccore.dll!ThreadChangeFilters(vout_thread_sys_t * vout, const char 
* filters, const bool * new_deinterlace, bool is_locked) Line 992 
(c:\Users\robux\Documents\Programs\Videolabs\vlc\src\video_output\video_output.c:992)
libvlccore.dll!Thread(void * object) Line 1891 
(c:\Users\robux\Documents\Programs\Videolabs\vlc\src\video_output\video_output.c:1891)
libvlccore.dll!vlc_entry(void * p) Line 360 
(c:\Users\robux\Documents\Programs\Videolabs\vlc\src\win32\thread.c:360)
msvcrt.dll!00007ffcb139af5a() (Unknown Source:0)
msvcrt.dll!00007ffcb139b02c() (Unknown Source:0)
kernel32.dll!00007ffcb21d6fd4() (Unknown Source:0)
ntdll.dll!00007ffcb23bcec1() (Unknown Source:0)
Soft Works Aug. 12, 2020, 11:01 p.m. UTC | #18
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Steve Lhomme
> Sent: Wednesday, August 12, 2020 2:05 PM
> To: ffmpeg-devel@ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
> copies are done before submitting them
> 
> On 2020-08-11 12:43, Steve Lhomme wrote:
> >>> Sorry if you seem to know all the answers already, but I don't and
> >>> so I have to investigate.
> >>
> >> Last year, I had literally worked this down to death. I followed
> >> every slightest hint from countless searches, read through hundreds
> >> of discussions, driven because I was unwilling to believe that
> >> up-/downloading of video textures with
> >> D3D11 can't be done equally fast as with D3D9.
> >> (the big picture was the implementation of D3D11 support for
> >> QuickSync where the slowdown played a much bigger role than with
> >> D3D11VA decoders only).
> >> Eventually I landed at some internal Nvidia presentation, some talks
> >> with MS guys and some source code discussion deep inside a 3D game
> >> engine (not a no-name). It really bugs me that I didn't properly note
> >> the references, but from somewhere in between I was able to gather
> >> solid evidence about what is legal to do and what Is not. Based on
> >> that, followed several iterations to find the optimal way for doing
> >> the texture transfer. As I had implemented
> >> D3D11 support for QuickSync, this got pretty complicated because with
> >> a full transcoding pipeline, all parts (decoder, encoder and filters)
> >> can (and usually will) request textures. Only the latest Intel
> >> Drivers can work with array textures everywhere (e.g. VPP), so I also
> >> needed to add support for non-array texture allocation. The patch
> >> you've seen is the result of weeks of intensive work (a small but
> >> crucial part of it) - even when it may not look like that.
> >>
> >>
> >>> Sorry if you seem to know all the answers already
> >>
> >> Obviously, I don't know all the answers, but all the answers I have
> >> given were correct. And when I didn't have an answer I always
> >> respectfully said that your situation might be different.
> >> And I didn't reply by implying that you would have done your work by
> >> trial-and-error or most likely invalid assumptions or deductions.
> >>
> >>
> >> I still don't know how you are actually operating this and thus I
> >> also cannot tell what might or might not work in your case.
> >> All I can tell is that the procedure that I have described (1-2-3-4)
> >> can work rock-solid for multi-threaded DX11 texture transfer when
> >> it's done in the same way as I've shown.
> >> And believe it or not - I would still be happy when it would be of
> >> any use for you...
> >
> > Even though the discussion is heated (fitting with the weather here) I
> > don't mind. I learned some stuff and it pushed me to dig deeper. I
> > can't just accept your word for it. I need something solid if I'm
> > going to remove a lock that helped me so far.
> >
> > So I'm currently tooling VLC to be able to bring the decoder to its
> > knees and find out what it can and cannot do safely. So far I can
> > still see decoding artifacts when I don't a lock, which would mean I
> > still need the mutex, for the reasons given in the previous mail.
> 
> A follow-up on this. Using ID3D10Multithread seems to be enough to have
> mostly thread safe ID3D11Device/ID3D11DeviceContext/etc. Even the
> decoding with its odd API seem to know what to do when submitted
> different buffers.
> 
> I did not manage to saturate the GPU but I much bigger decoding
> speed/throughput to validate the errors I got before. Many of them were
> due to VLC dropping data because of odd timing.
> 
> Now I still have some threading issues. For example for deinterlacing we
> create a ID3D11VideoProcessor to handle the deinterlacing. And we create it
> after the decoding started (as the deinterlacing can be enabled/disabled
> dynamically). Without the mutex in the decoder it crashes on
> ID3D11VideoDevice::CreateVideoProcessor() and
> ID3D11VideoContext::SubmitDecoderBuffers() as they are being called
> simultaneously. If I add the mutex between the decoder and just this filter
> (not the rendering side) it works fine.
> 
> So I guess I'm stuck with the mutex for the time being.

At an earlier stage I had considered the idea of adding those video
processors as ffmpeg hardware filters, but due to the vast amount of
different use cases, platforms and hw accelerations we support, 
I had made the decision that we do all filtering either by CPU or in the
hw context of the en-coder, but never in the hw context of the de-coder,
so I don't have any experience with DX11 video processors.

Maybe a too obvious idea: How about activating the mutex use only for 
a short time during the process of adding the video processor?

----

Regarding the subject in general, I'm now able to fill some additional blanks
(I'm doing too many different things, and my RAM is limited ;-)
Specifically, the question "Why isn't this information more clearly documented?"

Short answer: Because "our" use case is too unusual.

While I had been searching for new or old evidence for my statement
I had found a book about DX11 talking about the use 
ID3D10_SetMultiThreadedProtected, but it said, something like that
this will work but it will lock only a small part and that would not be 
sufficient because it does not guarantee ordered execution and 
therefore a D3D11 application will always need to do its own locking 
at the application side.

Obviously, that wouldn't have been a good reference to use in the 
discussion :-)

Here comes the twist: All those books and documentation are about
writing 3D applications. And 3D applications are very different from
"our" kind of applications: Those applications may execute dozens
or hundreds of commands between two frames. And many of those
commands are sequences that need to be executed in order.

Any that explains why:
- The MSDN docs suggest application-side locking
- The SetMultiThreadedProtected method initially didn't have a DX11 version
- The later added ID3D11Threading interface didn't only have 
  the SetMultiThreadedProtected method, but also the Begin.. and 
  End.. methods
- The DX11 book said that SetMultiThreadedProtected would not be
  sufficient
- The documentation of SetMultiThreadedProtected in the new 
  Interface is stating that it would create possibly unwanted overhead

In our case, we're doing nothing more than a one or two calls
per frame (instead of hundreds). That's why we don't need to care 
about any of the bullet points above.

Unfortunately, nobody cared about us, when writing the docs.

Kind regards,
softworkz
Steve Lhomme Aug. 14, 2020, 7:12 a.m. UTC | #19
On 2020-08-13 1:01, Soft Works wrote:
> 
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
>> Steve Lhomme
>> Sent: Wednesday, August 12, 2020 2:05 PM
>> To: ffmpeg-devel@ffmpeg.org
>> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer
>> copies are done before submitting them
>>
>> On 2020-08-11 12:43, Steve Lhomme wrote:
>>>>> Sorry if you seem to know all the answers already, but I don't and
>>>>> so I have to investigate.
>>>>
>>>> Last year, I had literally worked this down to death. I followed
>>>> every slightest hint from countless searches, read through hundreds
>>>> of discussions, driven because I was unwilling to believe that
>>>> up-/downloading of video textures with
>>>> D3D11 can't be done equally fast as with D3D9.
>>>> (the big picture was the implementation of D3D11 support for
>>>> QuickSync where the slowdown played a much bigger role than with
>>>> D3D11VA decoders only).
>>>> Eventually I landed at some internal Nvidia presentation, some talks
>>>> with MS guys and some source code discussion deep inside a 3D game
>>>> engine (not a no-name). It really bugs me that I didn't properly note
>>>> the references, but from somewhere in between I was able to gather
>>>> solid evidence about what is legal to do and what Is not. Based on
>>>> that, followed several iterations to find the optimal way for doing
>>>> the texture transfer. As I had implemented
>>>> D3D11 support for QuickSync, this got pretty complicated because with
>>>> a full transcoding pipeline, all parts (decoder, encoder and filters)
>>>> can (and usually will) request textures. Only the latest Intel
>>>> Drivers can work with array textures everywhere (e.g. VPP), so I also
>>>> needed to add support for non-array texture allocation. The patch
>>>> you've seen is the result of weeks of intensive work (a small but
>>>> crucial part of it) - even when it may not look like that.
>>>>
>>>>
>>>>> Sorry if you seem to know all the answers already
>>>>
>>>> Obviously, I don't know all the answers, but all the answers I have
>>>> given were correct. And when I didn't have an answer I always
>>>> respectfully said that your situation might be different.
>>>> And I didn't reply by implying that you would have done your work by
>>>> trial-and-error or most likely invalid assumptions or deductions.
>>>>
>>>>
>>>> I still don't know how you are actually operating this and thus I
>>>> also cannot tell what might or might not work in your case.
>>>> All I can tell is that the procedure that I have described (1-2-3-4)
>>>> can work rock-solid for multi-threaded DX11 texture transfer when
>>>> it's done in the same way as I've shown.
>>>> And believe it or not - I would still be happy when it would be of
>>>> any use for you...
>>>
>>> Even though the discussion is heated (fitting with the weather here) I
>>> don't mind. I learned some stuff and it pushed me to dig deeper. I
>>> can't just accept your word for it. I need something solid if I'm
>>> going to remove a lock that helped me so far.
>>>
>>> So I'm currently tooling VLC to be able to bring the decoder to its
>>> knees and find out what it can and cannot do safely. So far I can
>>> still see decoding artifacts when I don't a lock, which would mean I
>>> still need the mutex, for the reasons given in the previous mail.
>>
>> A follow-up on this. Using ID3D10Multithread seems to be enough to have
>> mostly thread safe ID3D11Device/ID3D11DeviceContext/etc. Even the
>> decoding with its odd API seem to know what to do when submitted
>> different buffers.
>>
>> I did not manage to saturate the GPU but I much bigger decoding
>> speed/throughput to validate the errors I got before. Many of them were
>> due to VLC dropping data because of odd timing.
>>
>> Now I still have some threading issues. For example for deinterlacing we
>> create a ID3D11VideoProcessor to handle the deinterlacing. And we create it
>> after the decoding started (as the deinterlacing can be enabled/disabled
>> dynamically). Without the mutex in the decoder it crashes on
>> ID3D11VideoDevice::CreateVideoProcessor() and
>> ID3D11VideoContext::SubmitDecoderBuffers() as they are being called
>> simultaneously. If I add the mutex between the decoder and just this filter
>> (not the rendering side) it works fine.
>>
>> So I guess I'm stuck with the mutex for the time being.
> 
> At an earlier stage I had considered the idea of adding those video
> processors as ffmpeg hardware filters, but due to the vast amount of
> different use cases, platforms and hw accelerations we support,
> I had made the decision that we do all filtering either by CPU or in the
> hw context of the en-coder, but never in the hw context of the de-coder,
> so I don't have any experience with DX11 video processors.
> 
> Maybe a too obvious idea: How about activating the mutex use only for
> a short time during the process of adding the video processor?

This doesn't seem feasable, even with a callback system. You don't know 
when it's safe to enable/disable it.

By the way the origin of the mutex was on Windows Phones. It's probably 
related to the fact that some phones only decode to 
DXGI_FORMAT_I420_OPAQUE which cannot be used for rendering. The only way 
to use the decoded surface is to convert it (to NV12) via a 
VideoProcessor. So in this case it was always used, even for basic decoding.
diff mbox series

Patch

diff --git a/libavcodec/dxva2.c b/libavcodec/dxva2.c
index 32416112bf..1a0e5b69b2 100644
--- a/libavcodec/dxva2.c
+++ b/libavcodec/dxva2.c
@@ -692,6 +692,12 @@  int ff_dxva2_decode_init(AVCodecContext *avctx)
         d3d11_ctx->surface       = sctx->d3d11_views;
         d3d11_ctx->workaround    = sctx->workaround;
         d3d11_ctx->context_mutex = INVALID_HANDLE_VALUE;
+
+        D3D11_QUERY_DESC query = { 0 };
+        query.Query = D3D11_QUERY_EVENT;
+        if (FAILED(ID3D11Device_CreateQuery(device_hwctx->device, &query,
+                                            (ID3D11Query**)&sctx->wait_copies)))
+            sctx->wait_copies = NULL;
     }
 #endif
 
@@ -729,6 +735,8 @@  int ff_dxva2_decode_uninit(AVCodecContext *avctx)
     av_buffer_unref(&sctx->decoder_ref);
 
 #if CONFIG_D3D11VA
+    if (sctx->wait_copies)
+        ID3D11Asynchronous_Release(sctx->wait_copies);
     for (i = 0; i < sctx->nb_d3d11_views; i++) {
         if (sctx->d3d11_views[i])
             ID3D11VideoDecoderOutputView_Release(sctx->d3d11_views[i]);
@@ -932,6 +940,12 @@  int ff_dxva2_common_end_frame(AVCodecContext *avctx, AVFrame *frame,
 
 #if CONFIG_D3D11VA
     if (ff_dxva2_is_d3d11(avctx)) {
+        if (sctx->wait_copies) {
+            AVHWFramesContext *frames_ctx = (AVHWFramesContext*)avctx->hw_frames_ctx->data;
+            AVD3D11VADeviceContext *device_hwctx = frames_ctx->device_ctx->hwctx;
+            ID3D11DeviceContext_Begin(device_hwctx->device_context, sctx->wait_copies);
+        }
+
         buffer = &buffer11[buffer_count];
         type = D3D11_VIDEO_DECODER_BUFFER_PICTURE_PARAMETERS;
     }
@@ -1005,9 +1019,28 @@  int ff_dxva2_common_end_frame(AVCodecContext *avctx, AVFrame *frame,
 
 #if CONFIG_D3D11VA
     if (ff_dxva2_is_d3d11(avctx))
+    {
+        int maxWait = 10;
+        /* wait until all the buffer release is done copying data to the GPU
+         * before doing the submit command */
+        if (sctx->wait_copies) {
+            AVHWFramesContext *frames_ctx = (AVHWFramesContext*)avctx->hw_frames_ctx->data;
+            AVD3D11VADeviceContext *device_hwctx = frames_ctx->device_ctx->hwctx;
+            ID3D11DeviceContext_End(device_hwctx->device_context, sctx->wait_copies);
+
+            while (maxWait-- && S_FALSE ==
+                   ID3D11DeviceContext_GetData(device_hwctx->device_context,
+                                               sctx->wait_copies, NULL, 0, 0)) {
+                ff_dxva2_unlock(avctx);
+                SleepEx(2, TRUE);
+                ff_dxva2_lock(avctx);
+            }
+        }
+
         hr = ID3D11VideoContext_SubmitDecoderBuffers(D3D11VA_CONTEXT(ctx)->video_context,
                                                      D3D11VA_CONTEXT(ctx)->decoder,
                                                      buffer_count, buffer11);
+    }
 #endif
 #if CONFIG_DXVA2
     if (avctx->pix_fmt == AV_PIX_FMT_DXVA2_VLD) {
diff --git a/libavcodec/dxva2_internal.h b/libavcodec/dxva2_internal.h
index b822af59cd..c44e8e09b0 100644
--- a/libavcodec/dxva2_internal.h
+++ b/libavcodec/dxva2_internal.h
@@ -81,6 +81,8 @@  typedef struct FFDXVASharedContext {
     ID3D11VideoDecoderOutputView  **d3d11_views;
     int                          nb_d3d11_views;
     ID3D11Texture2D                *d3d11_texture;
+
+    ID3D11Asynchronous             *wait_copies;
 #endif
 
 #if CONFIG_DXVA2