Message ID | 20200805100712.16834-1-robux4@ycbcr.xyz |
---|---|
State | New |
Headers | show |
Series | [FFmpeg-devel,v3,1/2] dxva: wait until D3D11 buffer copies are done before submitting them | expand |
Context | Check | Description |
---|---|---|
andriy/default | pending | |
andriy/make | success | Make finished |
andriy/make_fate | success | Make fate finished |
I experimented a bit more with this. Here are the 3 scenarii in other of least frame late: - GetData waiting for 1/2s and releasing the lock - No use of GetData (current code) - GetData waiting for 1/2s and keeping the lock The last option has horrible perfomance issues and should not be used. The first option gives about 50% less late frames compared to the current code. *But* it requires to unlock the Video Context. There are 2 problems with this: - the same ID3D11Asynchronous is used to wait on multiple concurrent thread. This can confuse D3D11 which emits a warning in the logs. - another thread might Get/Release some buffers and submit them before this thread is finished processing. That can result in distortions, for example if the second thread/frame depends on the first thread/frame which is not submitted yet. The former issue can be solved by using a ID3D11Asynchronous per thread. That requires some TLS storage which FFmpeg doesn't seem to support yet. With this I get virtually no frame late. The latter issue only occur if the wait is too long. For example waiting by increments of 10ms is too long in my test. Using increments of 1ms or 2ms works fine in the most stressing sample I have (Sony Camping HDR HEVC high bitrate). But this seems hackish. There's still potentially a quick frame (alt frame in VPx/AV1 for example) that might get through to the decoder too early. (I suppose that's the source of the distortions I see) It's also possible to change the order of the buffer sending, by starting with the bigger one (D3D11_VIDEO_DECODER_BUFFER_BITSTREAM). But it seems to have little influence, regardless if we wait for buffer submission or not. The results are consistent between integrated GPU and dedicated GPU. On 2020-08-05 12:07, Steve Lhomme wrote: > When used aggressively, calling SubmitDecoderBuffers() just after > ReleaseDecoderBuffer() may have the buffers not used properly and creates > decoding artifacts. > It's likely due to the time to copy the submitted buffer in CPU mapped memory > to GPU memory. SubmitDecoderBuffers() doesn't appear to wait for the state > of the buffer submitted to become "ready". > > For now it's not supported in the legacy API using AVD3D11VAContext, we need to > add a ID3D11DeviceContext in there as it cannot be derived from the other > interfaces we provide (ID3D11VideoContext is not a kind of ID3D11DeviceContext). > --- > libavcodec/dxva2.c | 33 +++++++++++++++++++++++++++++++++ > libavcodec/dxva2_internal.h | 2 ++ > 2 files changed, 35 insertions(+) > > diff --git a/libavcodec/dxva2.c b/libavcodec/dxva2.c > index 32416112bf..1a0e5b69b2 100644 > --- a/libavcodec/dxva2.c > +++ b/libavcodec/dxva2.c > @@ -692,6 +692,12 @@ int ff_dxva2_decode_init(AVCodecContext *avctx) > d3d11_ctx->surface = sctx->d3d11_views; > d3d11_ctx->workaround = sctx->workaround; > d3d11_ctx->context_mutex = INVALID_HANDLE_VALUE; > + > + D3D11_QUERY_DESC query = { 0 }; > + query.Query = D3D11_QUERY_EVENT; > + if (FAILED(ID3D11Device_CreateQuery(device_hwctx->device, &query, > + (ID3D11Query**)&sctx->wait_copies))) > + sctx->wait_copies = NULL; > } > #endif > > @@ -729,6 +735,8 @@ int ff_dxva2_decode_uninit(AVCodecContext *avctx) > av_buffer_unref(&sctx->decoder_ref); > > #if CONFIG_D3D11VA > + if (sctx->wait_copies) > + ID3D11Asynchronous_Release(sctx->wait_copies); > for (i = 0; i < sctx->nb_d3d11_views; i++) { > if (sctx->d3d11_views[i]) > ID3D11VideoDecoderOutputView_Release(sctx->d3d11_views[i]); > @@ -932,6 +940,12 @@ int ff_dxva2_common_end_frame(AVCodecContext *avctx, AVFrame *frame, > > #if CONFIG_D3D11VA > if (ff_dxva2_is_d3d11(avctx)) { > + if (sctx->wait_copies) { > + AVHWFramesContext *frames_ctx = (AVHWFramesContext*)avctx->hw_frames_ctx->data; > + AVD3D11VADeviceContext *device_hwctx = frames_ctx->device_ctx->hwctx; > + ID3D11DeviceContext_Begin(device_hwctx->device_context, sctx->wait_copies); > + } > + > buffer = &buffer11[buffer_count]; > type = D3D11_VIDEO_DECODER_BUFFER_PICTURE_PARAMETERS; > } > @@ -1005,9 +1019,28 @@ int ff_dxva2_common_end_frame(AVCodecContext *avctx, AVFrame *frame, > > #if CONFIG_D3D11VA > if (ff_dxva2_is_d3d11(avctx)) > + { > + int maxWait = 10; > + /* wait until all the buffer release is done copying data to the GPU > + * before doing the submit command */ > + if (sctx->wait_copies) { > + AVHWFramesContext *frames_ctx = (AVHWFramesContext*)avctx->hw_frames_ctx->data; > + AVD3D11VADeviceContext *device_hwctx = frames_ctx->device_ctx->hwctx; > + ID3D11DeviceContext_End(device_hwctx->device_context, sctx->wait_copies); > + > + while (maxWait-- && S_FALSE == > + ID3D11DeviceContext_GetData(device_hwctx->device_context, > + sctx->wait_copies, NULL, 0, 0)) { > + ff_dxva2_unlock(avctx); > + SleepEx(2, TRUE); > + ff_dxva2_lock(avctx); > + } > + } > + > hr = ID3D11VideoContext_SubmitDecoderBuffers(D3D11VA_CONTEXT(ctx)->video_context, > D3D11VA_CONTEXT(ctx)->decoder, > buffer_count, buffer11); > + } > #endif > #if CONFIG_DXVA2 > if (avctx->pix_fmt == AV_PIX_FMT_DXVA2_VLD) { > diff --git a/libavcodec/dxva2_internal.h b/libavcodec/dxva2_internal.h > index b822af59cd..c44e8e09b0 100644 > --- a/libavcodec/dxva2_internal.h > +++ b/libavcodec/dxva2_internal.h > @@ -81,6 +81,8 @@ typedef struct FFDXVASharedContext { > ID3D11VideoDecoderOutputView **d3d11_views; > int nb_d3d11_views; > ID3D11Texture2D *d3d11_texture; > + > + ID3D11Asynchronous *wait_copies; > #endif > > #if CONFIG_DXVA2 > -- > 2.26.2 > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". >
> -----Original Message----- > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of > Steve Lhomme > Sent: Friday, August 7, 2020 3:05 PM > To: ffmpeg-devel@ffmpeg.org > Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer > copies are done before submitting them > > I experimented a bit more with this. Here are the 3 scenarii in other of least > frame late: > > - GetData waiting for 1/2s and releasing the lock > - No use of GetData (current code) > - GetData waiting for 1/2s and keeping the lock > > The last option has horrible perfomance issues and should not be used. > > The first option gives about 50% less late frames compared to the current > code. *But* it requires to unlock the Video Context. There are 2 problems > with this: > > - the same ID3D11Asynchronous is used to wait on multiple concurrent > thread. This can confuse D3D11 which emits a warning in the logs. > - another thread might Get/Release some buffers and submit them before > this thread is finished processing. That can result in distortions, for example if > the second thread/frame depends on the first thread/frame which is not > submitted yet. > > The former issue can be solved by using a ID3D11Asynchronous per thread. > That requires some TLS storage which FFmpeg doesn't seem to support yet. > With this I get virtually no frame late. > > The latter issue only occur if the wait is too long. For example waiting by > increments of 10ms is too long in my test. Using increments of 1ms or 2ms > works fine in the most stressing sample I have (Sony Camping HDR HEVC high > bitrate). But this seems hackish. There's still potentially a quick frame (alt > frame in VPx/AV1 for example) that might get through to the decoder too > early. (I suppose that's the source of the distortions I > see) > > It's also possible to change the order of the buffer sending, by starting with > the bigger one (D3D11_VIDEO_DECODER_BUFFER_BITSTREAM). But it seems > to have little influence, regardless if we wait for buffer submission or not. > > The results are consistent between integrated GPU and dedicated GPU. Hi Steven, A while ago I had extended D3D11VA implementation to support single (non-array textures) for interoperability with Intel QSV+DX11. I noticed a few bottlenecks making D3D11VA significantly slower than DXVA2. The solution was to use ID3D10Multithread_SetMultithreadProtected and remove all the locks which are currently applied. Hence, I don't think that your patch is the best possible way . Regards, softworkz
> -----Original Message----- > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of > Soft Works > Sent: Friday, August 7, 2020 11:59 PM > To: FFmpeg development discussions and patches <ffmpeg- > devel@ffmpeg.org> > Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer > copies are done before submitting them > > > -----Original Message----- > > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of > > Steve Lhomme > > Sent: Friday, August 7, 2020 3:05 PM > > To: ffmpeg-devel@ffmpeg.org > > Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 > > buffer copies are done before submitting them > > > > I experimented a bit more with this. Here are the 3 scenarii in other > > of least frame late: > > > > - GetData waiting for 1/2s and releasing the lock > > - No use of GetData (current code) > > - GetData waiting for 1/2s and keeping the lock > > > > The last option has horrible perfomance issues and should not be used. > > > > The first option gives about 50% less late frames compared to the > > current code. *But* it requires to unlock the Video Context. There are > > 2 problems with this: > > > > - the same ID3D11Asynchronous is used to wait on multiple concurrent > > thread. This can confuse D3D11 which emits a warning in the logs. > > - another thread might Get/Release some buffers and submit them before > > this thread is finished processing. That can result in distortions, > > for example if the second thread/frame depends on the first > > thread/frame which is not submitted yet. > > > > The former issue can be solved by using a ID3D11Asynchronous per thread. > > That requires some TLS storage which FFmpeg doesn't seem to support > yet. > > With this I get virtually no frame late. > > > > The latter issue only occur if the wait is too long. For example > > waiting by increments of 10ms is too long in my test. Using increments > > of 1ms or 2ms works fine in the most stressing sample I have (Sony > > Camping HDR HEVC high bitrate). But this seems hackish. There's still > > potentially a quick frame (alt frame in VPx/AV1 for example) that > > might get through to the decoder too early. (I suppose that's the > > source of the distortions I > > see) > > > > It's also possible to change the order of the buffer sending, by > > starting with the bigger one > (D3D11_VIDEO_DECODER_BUFFER_BITSTREAM). > > But it seems to have little influence, regardless if we wait for buffer > submission or not. > > > > The results are consistent between integrated GPU and dedicated GPU. > > Hi Steven, > > A while ago I had extended D3D11VA implementation to support single (non- > array textures) for interoperability with Intel QSV+DX11. > > I noticed a few bottlenecks making D3D11VA significantly slower than DXVA2. > > The solution was to use ID3D10Multithread_SetMultithreadProtected and > remove all the locks which are currently applied. > > Hence, I don't think that your patch is the best possible way . > > Regards, > softworkz I almost forgot that I had published that change already: https://github.com/softworkz/ffmpeg_dx11/commit/c09cc37ce7f513717493e060df740aa0e7374257
On 2020-08-07 23:59, Soft Works wrote: >> -----Original Message----- >> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of >> Steve Lhomme >> Sent: Friday, August 7, 2020 3:05 PM >> To: ffmpeg-devel@ffmpeg.org >> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer >> copies are done before submitting them >> >> I experimented a bit more with this. Here are the 3 scenarii in other of least >> frame late: >> >> - GetData waiting for 1/2s and releasing the lock >> - No use of GetData (current code) >> - GetData waiting for 1/2s and keeping the lock >> >> The last option has horrible perfomance issues and should not be used. >> >> The first option gives about 50% less late frames compared to the current >> code. *But* it requires to unlock the Video Context. There are 2 problems >> with this: >> >> - the same ID3D11Asynchronous is used to wait on multiple concurrent >> thread. This can confuse D3D11 which emits a warning in the logs. >> - another thread might Get/Release some buffers and submit them before >> this thread is finished processing. That can result in distortions, for example if >> the second thread/frame depends on the first thread/frame which is not >> submitted yet. >> >> The former issue can be solved by using a ID3D11Asynchronous per thread. >> That requires some TLS storage which FFmpeg doesn't seem to support yet. >> With this I get virtually no frame late. >> >> The latter issue only occur if the wait is too long. For example waiting by >> increments of 10ms is too long in my test. Using increments of 1ms or 2ms >> works fine in the most stressing sample I have (Sony Camping HDR HEVC high >> bitrate). But this seems hackish. There's still potentially a quick frame (alt >> frame in VPx/AV1 for example) that might get through to the decoder too >> early. (I suppose that's the source of the distortions I >> see) >> >> It's also possible to change the order of the buffer sending, by starting with >> the bigger one (D3D11_VIDEO_DECODER_BUFFER_BITSTREAM). But it seems >> to have little influence, regardless if we wait for buffer submission or not. >> >> The results are consistent between integrated GPU and dedicated GPU. > > Hi Steven, Hi, > A while ago I had extended D3D11VA implementation to support single > (non-array textures) for interoperability with Intel QSV+DX11. Looking at your code, it seems you are copying from an array texture to a single slice texture to achieve this. With double the amount of RAM. It may be a design issue with the new D3D11 API, which forces you to do that, but I'm not using that API. I'm using the old API. In my case I directly render the texture slices coming out of the decoder with no copying (and no extra memory allocation). It is happening in a different thread than the decoder thread(s). Also in VLC we also support direct D3D11 to QSV encoding. It does require a copy to "shadow" textures to feed QSV. I never managed to make it work without a copy. > I noticed a few bottlenecks making D3D11VA significantly slower than DXVA2. > > The solution was to use ID3D10Multithread_SetMultithreadProtected and > remove all the locks which are currently applied. I am also using that. > Hence, I don't think that your patch is the best possible way . Removing locks and saying "it works for me" is neither a correct solution. At the very least the locks are needed inside libavcodec to avoid setting DXVA buffers concurrently from different threads. It will most likely result in very bad distortions if not crashes. Maybe you're only using 1 decoding thread with DXVA (which a lot of people do) so you don't have this issue, but this is not my case. Also ID3D10Multithread::SetMultithreadProtected means that the resources can be accessed from multiple threads. It doesn't mean that calls to ID3D11DeviceContext are safe from multithreading. And my experience shows that it is not. In fact if you have the Windows SDK installed and you have concurrent accesses, you'll get a big warning in your debug logs that you are doing something fishy. On WindowsPhone it would even crash. This is how I ended up adding the mutex to the old API (e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86). The documentation for ID3D11DeviceContext is very clear about that [1]: "Because each ID3D11DeviceContext is single threaded, only one thread can call a ID3D11DeviceContext at a time. If multiple threads must access a single ID3D11DeviceContext, they must use some synchronization mechanism, such as critical sections, to synchronize access to that ID3D11DeviceContext." The DXVA documentation is a lot less clearer on the subject. But given the ID3D11VideoContext derives from a ID3D11DeviceContext (but is not a ID3D11DeviceContext) it's seem correct to assume it has the same restrictions. [1] https://docs.microsoft.com/en-us/windows/win32/direct3d11/overviews-direct3d-11-render-multi-thread-intro
> -----Original Message----- > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of > Steve Lhomme > Sent: Saturday, August 8, 2020 7:10 AM > To: ffmpeg-devel@ffmpeg.org > Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer > copies are done before submitting them [...] > > > > Hi Steven, > > Hi, > > > A while ago I had extended D3D11VA implementation to support single > > (non-array textures) for interoperability with Intel QSV+DX11. > > Looking at your code, it seems you are copying from an array texture to a > single slice texture to achieve this. With double the amount of RAM. > It may be a design issue with the new D3D11 API, which forces you to do With D3D11, it's mandatory to use a staging texture, which is not only done in my code but also in the original implementation (hwcontext_d3d11va.c) https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/hwcontext_d3d11va.c > that, but I'm not using that API. I'm using the old API. I'm not sure whether I understand what you mean by this. To my knowledge there are two DX hw context implementations in ffmpeg: - DXVA2 - D3D11VA I'm not aware of a variant like "D3D11 with old API". Could you please elaborate? > > > Hence, I don't think that your patch is the best possible way . > > Removing locks and saying "it works for me" is neither a correct solution. How did you come to the conclusion that I might be working like this? > the very least the locks are needed inside libavcodec to avoid setting DXVA > buffers concurrently from different threads. It will most likely result in very > bad distortions if not crashes. Maybe you're only using 1 decoding thread > with DXVA (which a lot of people do) so you don't have this issue, but this is > not my case. I see no point in employing multiple threads for hw accelerated decoding. To be honest I never looked into or tried whether ffmpeg even supports multiple threads with dva2 or d3d11va hw acceleration. > Also ID3D10Multithread::SetMultithreadProtected means that the resources > can be accessed from multiple threads. It doesn't mean that calls to > ID3D11DeviceContext are safe from multithreading. And my experience > shows that it is not. In fact if you have the Windows SDK installed and you > have concurrent accesses, you'll get a big warning in your debug logs that you > are doing something fishy. On WindowsPhone it would even crash. This is > how I ended up adding the mutex to the old API > (e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86). > > The documentation for ID3D11DeviceContext is very clear about that [1]: > "Because each ID3D11DeviceContext is single threaded, only one thread can > call a ID3D11DeviceContext at a time. If multiple threads must access a single > ID3D11DeviceContext, they must use some synchronization mechanism, > such as critical sections, to synchronize access to that ID3D11DeviceContext." Yes, but this doesn't apply to accessing staging textures IIRC. In fact, I had researched this in-depth, but I can't tell much more without looking into it again. The patch I referenced is working in production on thousands of installations and tested with many different hardware and driver versions from Nvidia, Intel and AMD. Kind regards, softworkz
On 2020-08-08 8:24, Soft Works wrote: > > >> -----Original Message----- >> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of >> Steve Lhomme >> Sent: Saturday, August 8, 2020 7:10 AM >> To: ffmpeg-devel@ffmpeg.org >> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer >> copies are done before submitting them > > [...] > >>> >>> Hi Steven, >> >> Hi, >> >>> A while ago I had extended D3D11VA implementation to support single >>> (non-array textures) for interoperability with Intel QSV+DX11. >> >> Looking at your code, it seems you are copying from an array texture to a >> single slice texture to achieve this. With double the amount of RAM. >> It may be a design issue with the new D3D11 API, which forces you to do > > With D3D11, it's mandatory to use a staging texture, which is not only done > in my code but also in the original implementation (hwcontext_d3d11va.c) > https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/hwcontext_d3d11va.c > >> that, but I'm not using that API. I'm using the old API. > > I'm not sure whether I understand what you mean by this. To my knowledge > there are two DX hw context implementations in ffmpeg: > > - DXVA2 > - D3D11VA > > I'm not aware of a variant like "D3D11 with old API". Could you please elaborate? There is AV_PIX_FMT_D3D11VA_VLD (old) and AV_PIX_FMT_D3D11 (new). >> >>> Hence, I don't think that your patch is the best possible way . >> >> Removing locks and saying "it works for me" is neither a correct solution. > > How did you come to the conclusion that I might be working like this? The commented out "hwctx->lock" lines in your code. >> the very least the locks are needed inside libavcodec to avoid setting DXVA >> buffers concurrently from different threads. It will most likely result in very >> bad distortions if not crashes. Maybe you're only using 1 decoding thread >> with DXVA (which a lot of people do) so you don't have this issue, but this is >> not my case. > > I see no point in employing multiple threads for hw accelerated decoding. > To be honest I never looked into or tried whether ffmpeg even supports > multiple threads with dva2 or d3d11va hw acceleration. Maybe you're in an ideal situation where all the files you play through libavcodec are hardware accelerated (so also with matching hardware). In this case you don't need to care about the case where it will fallback to software decoding. Using a single thread in that case would have terrible performance. Even then, there's still a chance using multiple threads might improve performance. All the code that is run to prepare the buffers that are fed into the hardware decoder can be run in parallel for multiple frames. If you have an insanely fast hardware decoder that would be the bottleneck. In a transcoding scenario that could have an impact. >> Also ID3D10Multithread::SetMultithreadProtected means that the resources >> can be accessed from multiple threads. It doesn't mean that calls to >> ID3D11DeviceContext are safe from multithreading. And my experience >> shows that it is not. In fact if you have the Windows SDK installed and you >> have concurrent accesses, you'll get a big warning in your debug logs that you >> are doing something fishy. On WindowsPhone it would even crash. This is >> how I ended up adding the mutex to the old API >> (e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86). >> >> The documentation for ID3D11DeviceContext is very clear about that [1]: >> "Because each ID3D11DeviceContext is single threaded, only one thread can >> call a ID3D11DeviceContext at a time. If multiple threads must access a single >> ID3D11DeviceContext, they must use some synchronization mechanism, >> such as critical sections, to synchronize access to that ID3D11DeviceContext." > > Yes, but this doesn't apply to accessing staging textures IIRC. It does. To copy to a staging texture you need to use ID3D11DeviceContext::CopySubresourceRegion(). You probably don't have any synchronization issues in your pipeline because it seems you copy from GPU to CPU. In that case it forces the ID3D11DeviceContext::GetData() internally to make sure all the commands to produce your source texture on that video context are finished processing. You may not see it, but there's a wait happening there. In my case there's nothing happening between the decoder and the rendering of the texture. > In fact, I had researched this in-depth, but I can't tell much more without > looking into it again. > > The patch I referenced is working in production on thousands of installations > and tested with many different hardware and driver versions from Nvidia, > Intel and AMD. And I added the lock before, as the specs say, it's necessary. That solved some issues on the hundred of millions of VLC running on Windows on all the hardware you can think of. Decoding 8K 60 fps HEVC was also a good stress test of the code. The ID3D11DeviceContext::GetData() in the rendering side ensured that we had the frames displayed in the right time and not whenever the pipeline is done processing the device context commands. Now I realize the same thing should be done on the decoder side. With the improvements given earlier in this thread.
> -----Original Message----- > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of > Steve Lhomme > Sent: Monday, August 10, 2020 8:19 AM > To: ffmpeg-devel@ffmpeg.org > Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer > copies are done before submitting them > > On 2020-08-08 8:24, Soft Works wrote: > > > > > >> -----Original Message----- > >> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of > >> Steve Lhomme > >> Sent: Saturday, August 8, 2020 7:10 AM > >> To: ffmpeg-devel@ffmpeg.org > >> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 > >> buffer copies are done before submitting them > > [...] > >> the very least the locks are needed inside libavcodec to avoid > >> setting DXVA buffers concurrently from different threads. It will > >> most likely result in very bad distortions if not crashes. Maybe > >> you're only using 1 decoding thread with DXVA (which a lot of people > >> do) so you don't have this issue, but this is not my case. > > > > I see no point in employing multiple threads for hw accelerated decoding. > > To be honest I never looked into or tried whether ffmpeg even supports > > multiple threads with dva2 or d3d11va hw acceleration. > > Maybe you're in an ideal situation where all the files you play through > libavcodec are hardware accelerated (so also with matching hardware). In > this case you don't need to care about the case where it will fallback to > software decoding. Using a single thread in that case would have terrible > performance. I think we need to clarify the use cases we're talking about. There is no "my case". All I'm talking about is using D3D11VA hardware acceleration using the ffmpeg.exe CLI. You seem to have a rather special case where you are using parts of the ffmpeg DXVA/D3D11VA code from another application (VLC?) Did I understand that correctly? > >> The documentation for ID3D11DeviceContext is very clear about that [1]: > >> "Because each ID3D11DeviceContext is single threaded, only one thread > >> can call a ID3D11DeviceContext at a time. If multiple threads must > >> access a single ID3D11DeviceContext, they must use some > >> synchronization mechanism, such as critical sections, to synchronize > access to that ID3D11DeviceContext." > > > > Yes, but this doesn't apply to accessing staging textures IIRC. > > It does. To copy to a staging texture you need to use > ID3D11DeviceContext::CopySubresourceRegion(). Correct. And with DX11 and using SetMultithreadProtected it is legal to call this from multiple threads without synchronization. > You probably don't have any synchronization issues in your pipeline because > it seems you copy from GPU to CPU. In that case it forces the > ID3D11DeviceContext::GetData() internally to make sure all the commands > to produce your source texture on that video context are finished > processing. You may not see it, but there's a wait happening there. I've looked back into my work history and gladly most memory came back. Yes, it's correct, there's a "wait happening". From your wording I would assume that you've already realized that I was right in stating that there's no need for an external locking: - Not for uploading - Not for downloading (at least not for the regular ffmpeg use case) There is still some locking applied: Internally inside the DX11 runtime (because we are using SetMultithreadProtected). And there's also the "wait happening". Let's go through an example: Downloading of a texture 1. Context_CopySubresourceRegion: Copy GPU texture to staging texture CopySubresourceRegion is asynchronous anyway. It just puts the copy request into the DX11 processing queue. Using SetMultithreadProtected avoids any race conditions, but this call always returns immediately. 2. Context_Map: Make the staging texture accessible for the CPU When called without MapFlags, this call blocks until the texture is mapped (and we can be sure that CopySubresourceRegion is executed by then). => This is the 'wait' you've been talking about 3. av_image_copy : Copy the image from the staging texture Takes its time for copying obviously. 4. Context_Unmap: Release the texture mapping Returns immediately ----------------- We've seen that there is no locking required with regards to DX11, but there's still one thing left: The staging texture. To resolve this I'm using multiple staging textures (it's system memory, not GPU memory). When we look at the sequence 1 - 2 - 3 - 4, it's obvious that It can run much faster when just the individual steps are synchronized (by DX11) as when we would put one big lock around 1234 from our side. I've been struggling a long time with this, because DXVA2 was often much faster than D3D11VA and this kind of parallelism was finally the way to get it working equally fast. ----------------- It is not really obvious from the documentation that it is legal to use CopySubresourceRegion, Map and UnMap in (pseudo-) parallel even on multiple indexes of the same ArrayTexture. IIRC I got one hint at this from an internal (yet public) Nvidia presentation about DX11 and another one from the source code of a game engine, but I haven't saved those links. ----------------- @Steven - I don't know anything about your specific way of using the ffmpeg code. Perhaps, the above information is useful for you in some way, but maybe those locks are unavoidable in your case. My only concern is that your changes do not slow down normal ffmpeg operation - like the locks you had added earlier. Maybe those could be put into some condition? Kind regards, softworkz
On 2020-08-10 12:04, Soft Works wrote: >>>> the very least the locks are needed inside libavcodec to avoid >>>> setting DXVA buffers concurrently from different threads. It will >>>> most likely result in very bad distortions if not crashes. Maybe >>>> you're only using 1 decoding thread with DXVA (which a lot of people >>>> do) so you don't have this issue, but this is not my case. >>> >>> I see no point in employing multiple threads for hw accelerated decoding. >>> To be honest I never looked into or tried whether ffmpeg even supports >>> multiple threads with dva2 or d3d11va hw acceleration. >> >> Maybe you're in an ideal situation where all the files you play through >> libavcodec are hardware accelerated (so also with matching hardware). In >> this case you don't need to care about the case where it will fallback to >> software decoding. Using a single thread in that case would have terrible >> performance. > > I think we need to clarify the use cases we're talking about. > > There is no "my case". All I'm talking about is using D3D11VA hardware > acceleration using the ffmpeg.exe CLI. > > You seem to have a rather special case where you are using parts of > the ffmpeg DXVA/D3D11VA code from another application (VLC?) I don't think there is anything special about using libavcodec in another application. That's where the code we're discussing is, not the ffmpeg CLI or ffplay. The API has to be designed to work in all host apps, not just these simpler use cases. > Did I understand that correctly? > >>>> The documentation for ID3D11DeviceContext is very clear about that [2]: >>>> "Because each ID3D11DeviceContext is single threaded, only one thread >>>> can call a ID3D11DeviceContext at a time. If multiple threads must >>>> access a single ID3D11DeviceContext, they must use some >>>> synchronization mechanism, such as critical sections, to synchronize >> access to that ID3D11DeviceContext." >>> >>> Yes, but this doesn't apply to accessing staging textures IIRC. >> >> It does. To copy to a staging texture you need to use >> ID3D11DeviceContext::CopySubresourceRegion(). > > Correct. And with DX11 and using SetMultithreadProtected it is legal > to call this from multiple threads without synchronization. No. I already explained it and pointed to the Microsoft documentation [1]. SetMultithreadProtected relates to ID3D11Device. ID3D11DeviceContext needs to be managed as non-thread safe resource. If you want, you can even create one ID3D11DeviceContext per thread [2]. I'd be curious to see the effect on multithreaded decoding. Also it seems SetMultithreadProtected() is not even needed by default. It enables the "thread-safe layer" [3]. But in d3d11 that's the default behavior. See [4] "Use this flag if your application will only call methods of Direct3D 11 interfaces from a single thread. By default, the ID3D11Device object is thread-safe." SetMultithreadProtected() only made sense for D3D10: "Direct3D 11 has been designed from the ground up to support multithreading. Direct3D 10 implements limited support for multithreading using the thread-safe layer." >> You probably don't have any synchronization issues in your pipeline because >> it seems you copy from GPU to CPU. In that case it forces the >> ID3D11DeviceContext::GetData() internally to make sure all the commands >> to produce your source texture on that video context are finished >> processing. You may not see it, but there's a wait happening there. > > I've looked back into my work history and gladly most memory > came back. > > Yes, it's correct, there's a "wait happening". From your wording I > would assume that you've already realized that I was right in stating > that there's no need for an external locking: > > - Not for uploading > - Not for downloading (at least not for the regular ffmpeg use case) > > There is still some locking applied: Internally inside the DX11 runtime > (because we are using SetMultithreadProtected). And there's also > the "wait happening". As the doc says, you have to use some synchronization. It may work in your case (FFmpeg CLI I suppose). As you mentioned you only use one thread. There's less chance that it can fail. But copying memory to/from CPU/GPU is probably the slowest part of the whole decoding (hence we don't do any in VLC in normal playback). So if you have one decoding thread doing that copy and another thread reading on the same ID3D11DeviceContext you're likely going to race-condition issues. I don't know what FFmpeg CLI does, so I can't tell. > Let's go through an example: Downloading of a texture > > 1. Context_CopySubresourceRegion: Copy GPU texture to staging texture > > CopySubresourceRegion is asynchronous anyway. It just puts the copy > request into the DX11 processing queue. Using SetMultithreadProtected > avoids any race conditions, but this call always returns immediately. > > 2. Context_Map: Make the staging texture accessible for the CPU > > When called without MapFlags, this call blocks until the texture is > mapped (and we can be sure that CopySubresourceRegion is executed > by then). > > => This is the 'wait' you've been talking about Yes. > 3. av_image_copy : Copy the image from the staging texture > > Takes its time for copying obviously. > > 4. Context_Unmap: Release the texture mapping > > Returns immediately > > ----------------- > > We've seen that there is no locking required with regards to DX11, > but there's still one thing left: The staging texture. To resolve this > I'm using multiple staging textures (it's system memory, not GPU > memory). > > When we look at the sequence 1 - 2 - 3 - 4, it's obvious that > It can run much faster when just the individual steps are synchronized > (by DX11) as when we would put one big lock around 1234 from > our side. > > I've been struggling a long time with this, because DXVA2 was > often much faster than D3D11VA and this kind of parallelism > was finally the way to get it working equally fast. That's one very particular case where you do a copy to CPU. There is some synchronization happening the memory mapping. But that only covers a small part of the possibilities of D3D11VA in libavcodec. And that's certainly not what I use. You can't deduce from that usage that synchronization (access to ID3D11DeviceContext) is not needed. In fact the Microsoft documentation and my experience show the exact opposite. > > ----------------- > > It is not really obvious from the documentation that it is legal > to use CopySubresourceRegion, Map and UnMap in (pseudo-) > parallel even on multiple indexes of the same ArrayTexture. > IIRC I got one hint at this from an internal (yet public) Nvidia > presentation about DX11 and another one from the source code > of a game engine, but I haven't saved those links. > > ----------------- > > @Steven My name is Steve. > I don't know anything about your specific way of > using the ffmpeg code. Perhaps, the above information is > useful for you in some way, but maybe those locks are unavoidable > in your case. > > My only concern is that your changes do not slow down normal > ffmpeg operation - like the locks you had added earlier. > Maybe those could be put into some condition? My change in e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86 is optional. So much that it even requires to use the creator helper function to make sure the mutex is properly initialized to an empty value and retain backward compatibility. If you don't want to use many threads you can safely ignore this field. That being said, that's for the old API. I suppose the one you're talking about is the new API for which I have done nothing. If the mutex is always set, that's not my fault. If you want the lock to have no effect, you can set the lock/unlock callbacks of AVD3D11VADeviceContext to functions that do nothing. If you don't set them, the documentation says: "If unset on init, the hwcontext implementation will set them to use an internal mutex." It's certainly better than commenting out a whole bunch of code [5]. > Kind regards, > softworkz [1] https://docs.microsoft.com/en-us/windows/win32/direct3d11/overviews-direct3d-11-render-multi-thread-intro [2] https://docs.microsoft.com/en-us/windows/win32/direct3d11/overviews-direct3d-11-render-multi-thread-render [3] https://docs.microsoft.com/en-us/windows/win32/api/d3d10/nn-d3d10-id3d10multithread [4] https://docs.microsoft.com/en-us/windows/win32/api/d3d11/ne-d3d11-d3d11_create_device_flag [5] https://github.com/softworkz/ffmpeg_dx11/commit/c09cc37ce7f513717493e060df740aa0e7374257
> -----Original Message----- > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of > Steve Lhomme > Sent: Monday, August 10, 2020 2:15 PM > To: ffmpeg-devel@ffmpeg.org > Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer > copies are done before submitting them > > > @Steven > > My name is Steve. I apologize for that. > My change in e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86 is optional. So > much that it even requires to use the creator helper function to make sure > the mutex is properly initialized to an empty value and retain backward > compatibility. If you don't want to use many threads you can safely ignore > this field. I thought you had reviewed my code, because you said that it was you who had added the locks that I have removed. I just took a look at your commit e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86 and I can say that those are not the locks that I have removed. So you are actually the wrong address for the lock removal, you've just thrown yourself in the line ;-) But the DX11 subject still needs some clarification... > >>>> The documentation for ID3D11DeviceContext is very clear about that > [2]: > >>>> "Because each ID3D11DeviceContext is single threaded, only one > >>>> thread can call a ID3D11DeviceContext at a time. If multiple > >>>> threads must access a single ID3D11DeviceContext, they must use > >>>> some synchronization mechanism, such as critical sections, to > >>>> synchronize > >> access to that ID3D11DeviceContext." > >>> > >>> Yes, but this doesn't apply to accessing staging textures IIRC. > >> > >> It does. To copy to a staging texture you need to use > >> ID3D11DeviceContext::CopySubresourceRegion(). > > > > Correct. And with DX11 and using SetMultithreadProtected it is legal > > to call this from multiple threads without synchronization. > > No. I already explained it and pointed to the Microsoft documentation [1]. > SetMultithreadProtected relates to ID3D11Device. No, that's exactly wrong. You need to re-read. ID3D11Device is thread-safe by default. You can read that everywhere throughout the documentation. SetMultithreadProtected initially didn't even have a D3D11 interface. The documentation also didn't tell anything about it. There was only the DX10 documentation for this. But SetMultithreadProtected in fact has the effect of protecting the methods from ID3D11DeviceContext (not device). Like I explained several times already. Luckily I've just seen that there has been a change meanwhile. The undocumented behavior has finally been made public by adding a DX11 version of SetMultithreadProtected (which does the same as what happens when you use the DX10 version with DX11) https://docs.microsoft.com/en-us/windows/win32/api/d3d11_4/nn-d3d11_4-id3d11multithread Now: Read the docs! PS: I need to say that your way of discrediting my statements was not very respectful. Kind regards, softworkz
On 2020-08-11 3:52, Soft Works wrote: > >> -----Original Message----- >> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of >> Steve Lhomme >> Sent: Monday, August 10, 2020 2:15 PM >> To: ffmpeg-devel@ffmpeg.org >> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer >> copies are done before submitting them >> > >>> @Steven >> >> My name is Steve. > > I apologize for that. > >> My change in e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86 is optional. So >> much that it even requires to use the creator helper function to make sure >> the mutex is properly initialized to an empty value and retain backward >> compatibility. If you don't want to use many threads you can safely ignore >> this field. > > I thought you had reviewed my code, because you said that it was you > who had added the locks that I have removed. I just took a look at your commit I said "This is how I ended up adding the mutex to the old API (e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86)." I never claimed I added the locks. > e3d4784eb31b3ea4a97f2d4c698a75fab9bf3d86 and I can say that those > are not the locks that I have removed. > So you are actually the wrong address for the lock removal, you've just > thrown yourself in the line ;-) No. You misread what I said. I was explaining that my experience on multiple devices lead me to add this mutex otherwise it would not work properly or even crash. > But the DX11 subject still needs some clarification... > >>>>>> The documentation for ID3D11DeviceContext is very clear about that >> [2]: >>>>>> "Because each ID3D11DeviceContext is single threaded, only one >>>>>> thread can call a ID3D11DeviceContext at a time. If multiple >>>>>> threads must access a single ID3D11DeviceContext, they must use >>>>>> some synchronization mechanism, such as critical sections, to >>>>>> synchronize >>>> access to that ID3D11DeviceContext." >>>>> >>>>> Yes, but this doesn't apply to accessing staging textures IIRC. >>>> >>>> It does. To copy to a staging texture you need to use >>>> ID3D11DeviceContext::CopySubresourceRegion(). >>> >>> Correct. And with DX11 and using SetMultithreadProtected it is legal >>> to call this from multiple threads without synchronization. >> >> No. I already explained it and pointed to the Microsoft documentation [1]. >> SetMultithreadProtected relates to ID3D11Device. > > No, that's exactly wrong. You need to re-read. ID3D11Device is thread-safe > by default. You can read that everywhere throughout the documentation. I read the documentation. That's not how I understood it. ID3D11Device is safe "by default" meaning it can be turned off. I assumed SetMultithreadProtected() is the way to do that. It seems D3D11_CREATE_DEVICE_SINGLETHREADED is the way to make it not thread safe. > SetMultithreadProtected initially didn't even have a D3D11 interface. > The documentation also didn't tell anything about it. There was only > the DX10 documentation for this. > > But SetMultithreadProtected in fact has the effect of protecting > the methods from ID3D11DeviceContext (not device). > Like I explained several times already. > > > Luckily I've just seen that there has been a change meanwhile. > The undocumented behavior has finally been made public by > adding a DX11 version of SetMultithreadProtected > (which does the same as what happens when you use the DX10 > version with DX11) > > https://docs.microsoft.com/en-us/windows/win32/api/d3d11_4/nn-d3d11_4-id3d11multithread OK, this is very interesting. It certainly clarifies that it's ID3D11DeviceContext related. The documentation is still sketchy though. For example ::Enter() [1] says that it enters a device critical section, not a device context critical section. So if you have multiple device context everything is blocked. This kinda defeats the use of multiple threads. Also this interface is not available until Windows 10 Creator Update. So for example it cannot be used on Windows 7. Maybe the D3D10 version has the same effect but there's no guarantee. In fact it wasn't sufficient on mobile devices. I'll try without the mutex (old API) on Windows 7 and see the effect. Anyway this thread/patch is about waiting for data to be fully copied between the CPU and the GPU before submitting the decoding command. Not about thread/command synchronization. > Now: Read the docs! > > PS: I need to say that your way of discrediting my statements > was not very respectful. Likewise. > Kind regards, > softworkz > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". > [1] https://docs.microsoft.com/en-us/windows/win32/api/d3d11_4/nf-d3d11_4-id3d11multithread-enter
> -----Original Message----- > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of > Steve Lhomme > Sent: Tuesday, August 11, 2020 6:35 AM > To: ffmpeg-devel@ffmpeg.org > Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer > copies are done before submitting them [...] > > But SetMultithreadProtected in fact has the effect of protecting the > > methods from ID3D11DeviceContext (not device). > > Like I explained several times already. > > > > > > Luckily I've just seen that there has been a change meanwhile. > > The undocumented behavior has finally been made public by adding a > > DX11 version of SetMultithreadProtected (which does the same as what > > happens when you use the DX10 version with DX11) > > > > https://docs.microsoft.com/en-us/windows/win32/api/d3d11_4/nn- > d3d11_4- > > id3d11multithread > > OK, this is very interesting. It certainly clarifies that it's ID3D11DeviceContext > related. The documentation is still sketchy though. > For example ::Enter() [1] says that it enters a device critical section, not a > device context critical section. So if you have multiple device context > everything is blocked. This kinda defeats the use of multiple threads. > > Also this interface is not available until Windows 10 Creator Update. So for > example it cannot be used on Windows 7. Maybe the D3D10 version has the > same effect but there's no guarantee. Still playing "the doubtful" is not a sincere reaction after having been proven wrong. > I'll try without the mutex (old API) on Windows 7 and see the effect. My patch requires feature level 11_1: https://github.com/softworkz/ffmpeg_dx11/commit/c09cc37ce7f513717493e060df740aa0e7374257#diff-08dfe4cade908944fd2815bf36309d7bR591-R592 Windows 7 does not have DX11.1 (there's a 11.1 platform update for Win 7 but it doesn't provide full 11.1) For Win7 and below we're using D3D9 with DXVA2 only. Kind regards, softworkz
On 2020-08-11 8:58, Soft Works wrote: > > >> -----Original Message----- >> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of >> Steve Lhomme >> Sent: Tuesday, August 11, 2020 6:35 AM >> To: ffmpeg-devel@ffmpeg.org >> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer >> copies are done before submitting them > > [...] > >>> But SetMultithreadProtected in fact has the effect of protecting the >>> methods from ID3D11DeviceContext (not device). >>> Like I explained several times already. >>> >>> >>> Luckily I've just seen that there has been a change meanwhile. >>> The undocumented behavior has finally been made public by adding a >>> DX11 version of SetMultithreadProtected (which does the same as what >>> happens when you use the DX10 version with DX11) >>> >>> https://docs.microsoft.com/en-us/windows/win32/api/d3d11_4/nn- >> d3d11_4- >>> id3d11multithread >> >> OK, this is very interesting. It certainly clarifies that it's ID3D11DeviceContext >> related. The documentation is still sketchy though. >> For example ::Enter() [1] says that it enters a device critical section, not a >> device context critical section. So if you have multiple device context >> everything is blocked. This kinda defeats the use of multiple threads. >> >> Also this interface is not available until Windows 10 Creator Update. So for >> example it cannot be used on Windows 7. Maybe the D3D10 version has the >> same effect but there's no guarantee. > > Still playing "the doubtful" is not a sincere reaction after having been > proven wrong. Sorry if you seem to know all the answers already, but I don't and so I have to investigate. >> I'll try without the mutex (old API) on Windows 7 and see the effect. I did some more test on Win10 before trying Win7. And removing the mutex is not an option. There's a lot of bogus frames. It makes sense, as the API is not meant to be thread-safe. The SubmitDecoderBuffers() has no way of knowing exactly which buffers you are providing. Each buffer is tied to an index of the slice so it could be called without a lock, as long as 2 threads don't try to use the same slice at the same time. But when you submit the buffers you only give the type. I suppose D3D11 takes the last buffer of each type that was released (which is pretty much an unmap, so should have similar internal synchronization). But if 2 threads release a buffer of the same type and submit it right after, D3D11 has no way to know which to use. Here's an example: - thread A releases a D3D11_VIDEO_DECODER_BUFFER_BITSTREAM for slice 0 - thread B releases a D3D11_VIDEO_DECODER_BUFFER_BITSTREAM for slice 1 - thread A submit a buffer of type D3D11_VIDEO_DECODER_BUFFER_BITSTREAM - thread B submit a buffer of type D3D11_VIDEO_DECODER_BUFFER_BITSTREAM D3D11 doesn't seem to keep track of which thread submitted what (and in single thread you could also release 2 buffer of the same type and submit them, I'm not sure what would happen). And so the Thread A submit is likely using the buffer from Thread B. I'm not even sure what Thread B would be using. The mutex/lock groups the buffer get/release and submit together, so that no other thread can do the same thing at the same time. D3D11 cannot do it by itself, even with multithread turned on. (D3D12 doesn't have this problem as you submit the actual buffer, not just the type [1][2]) None of that happens if you use a single thread for decoding. > My patch requires feature level 11_1: > > https://github.com/softworkz/ffmpeg_dx11/commit/c09cc37ce7f513717493e060df740aa0e7374257#diff-08dfe4cade908944fd2815bf36309d7bR591-R592 > > Windows 7 does not have DX11.1 > (there's a 11.1 platform update for Win 7 but it doesn't provide full 11.1) > For Win7 and below we're using D3D9 with DXVA2 only. > > Kind regards, > softworkz > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel [1] https://docs.microsoft.com/en-us/windows/win32/api/d3d12video/nf-d3d12video-id3d12videodecodecommandlist-decodeframe [2] https://docs.microsoft.com/en-us/windows/win32/api/d3d12video/ns-d3d12video-d3d12_video_decode_frame_argument
> -----Original Message----- > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of > Steve Lhomme > Sent: Tuesday, August 11, 2020 10:06 AM > To: ffmpeg-devel@ffmpeg.org > Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer > copies are done before submitting them > > On 2020-08-11 8:58, Soft Works wrote: > > > > > >> -----Original Message----- > >> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of > >> Steve Lhomme > >> Sent: Tuesday, August 11, 2020 6:35 AM > >> To: ffmpeg-devel@ffmpeg.org > >> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 > >> buffer copies are done before submitting them > > > > [...] > > > >>> But SetMultithreadProtected in fact has the effect of protecting the > >>> methods from ID3D11DeviceContext (not device). > >>> Like I explained several times already. > >>> > >>> > >>> Luckily I've just seen that there has been a change meanwhile. > >>> The undocumented behavior has finally been made public by adding a > >>> DX11 version of SetMultithreadProtected (which does the same as what > >>> happens when you use the DX10 version with DX11) > >>> > >>> https://docs.microsoft.com/en-us/windows/win32/api/d3d11_4/nn- > >> d3d11_4- > >>> id3d11multithread > >> > >> OK, this is very interesting. It certainly clarifies that it's > >> ID3D11DeviceContext related. The documentation is still sketchy though. > >> For example ::Enter() [1] says that it enters a device critical > >> section, not a device context critical section. So if you have > >> multiple device context everything is blocked. This kinda defeats the use > of multiple threads. > >> > >> Also this interface is not available until Windows 10 Creator Update. > >> So for example it cannot be used on Windows 7. Maybe the D3D10 > >> version has the same effect but there's no guarantee. > > > > Still playing "the doubtful" is not a sincere reaction after having > > been proven wrong. > > Sorry if you seem to know all the answers already, but I don't and so I have to > investigate. Last year, I had literally worked this down to death. I followed every slightest hint from countless searches, read through hundreds of discussions, driven because I was unwilling to believe that up-/downloading of video textures with D3D11 can't be done equally fast as with D3D9. (the big picture was the implementation of D3D11 support for QuickSync where the slowdown played a much bigger role than with D3D11VA decoders only). Eventually I landed at some internal Nvidia presentation, some talks with MS guys and some source code discussion deep inside a 3D game engine (not a no-name). It really bugs me that I didn't properly note the references, but from somewhere in between I was able to gather solid evidence about what is legal to do and what Is not. Based on that, followed several iterations to find the optimal way for doing the texture transfer. As I had implemented D3D11 support for QuickSync, this got pretty complicated because with a full transcoding pipeline, all parts (decoder, encoder and filters) can (and usually will) request textures. Only the latest Intel Drivers can work with array textures everywhere (e.g. VPP), so I also needed to add support for non-array texture allocation. The patch you've seen is the result of weeks of intensive work (a small but crucial part of it) - even when it may not look like that. > Sorry if you seem to know all the answers already Obviously, I don't know all the answers, but all the answers I have given were correct. And when I didn't have an answer I always respectfully said that your situation might be different. And I didn't reply by implying that you would have done your work by trial-and-error or most likely invalid assumptions or deductions. I still don't know how you are actually operating this and thus I also cannot tell what might or might not work in your case. All I can tell is that the procedure that I have described (1-2-3-4) can work rock-solid for multi-threaded DX11 texture transfer when it's done in the same way as I've shown. And believe it or not - I would still be happy when it would be of any use for you... Kind regards, softworkz
>> Sorry if you seem to know all the answers already, but I don't and so I have to >> investigate. > > Last year, I had literally worked this down to death. I followed every slightest > hint from countless searches, read through hundreds of discussions, driven > because I was unwilling to believe that up-/downloading of video textures with > D3D11 can't be done equally fast as with D3D9. > (the big picture was the implementation of D3D11 support for QuickSync where > the slowdown played a much bigger role than with D3D11VA decoders only). > Eventually I landed at some internal Nvidia presentation, some talks with MS > guys and some source code discussion deep inside a 3D game engine (not a > no-name). It really bugs me that I didn't properly note the references, but > from somewhere in between I was able to gather solid evidence about what > is legal to do and what Is not. Based on that, followed several iterations to > find the optimal way for doing the texture transfer. As I had implemented > D3D11 support for QuickSync, this got pretty complicated because with > a full transcoding pipeline, all parts (decoder, encoder and filters) can (and > usually will) request textures. Only the latest Intel Drivers can work with > array textures everywhere (e.g. VPP), so I also needed to add support for > non-array texture allocation. The patch you've seen is the result of weeks > of intensive work (a small but crucial part of it) - even when it may not > look like that. > > >> Sorry if you seem to know all the answers already > > Obviously, I don't know all the answers, but all the answers I have given > were correct. And when I didn't have an answer I always respectfully > said that your situation might be different. > And I didn't reply by implying that you would have done your work > by trial-and-error or most likely invalid assumptions or deductions. > > > I still don't know how you are actually operating this and thus I also cannot > tell what might or might not work in your case. > All I can tell is that the procedure that I have described (1-2-3-4) can > work rock-solid for multi-threaded DX11 texture transfer when it's done in > the same way as I've shown. > And believe it or not - I would still be happy when it would be > of any use for you... Even though the discussion is heated (fitting with the weather here) I don't mind. I learned some stuff and it pushed me to dig deeper. I can't just accept your word for it. I need something solid if I'm going to remove a lock that helped me so far. So I'm currently tooling VLC to be able to bring the decoder to its knees and find out what it can and cannot do safely. So far I can still see decoding artifacts when I don't a lock, which would mean I still need the mutex, for the reasons given in the previous mail.
> -----Original Message----- > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of > Steve Lhomme > Sent: Tuesday, August 11, 2020 12:44 PM > To: ffmpeg-devel@ffmpeg.org > Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer > copies are done before submitting them > Even though the discussion is heated (fitting with the weather here) I don't > mind. I learned some stuff and it pushed me to dig deeper. I can't just accept > your word for it. I need something solid if I'm going to remove a lock that > helped me so far. You're not supposed to accept my word. I wouldn’t do either. You could have just valued it as a possibility..that's all. > So I'm currently tooling VLC to be able to bring the decoder to its knees and > find out what it can and cannot do safely. So far I can still see decoding > artifacts when I don't a lock, which would mean I still need the mutex, for the > reasons given in the previous mail. Are you actually using D3D11 staging textures? I'm still not clear about that "old-api" way. I'm curious: what's your motivation for using D3D11 over D3D9? I mean, we are always transcoding without rendering a video to any surface. Our reasons to use D3D11 are these two: 1. Works headless (without connected display) 2. Works without user session (like inside a Windows service) What's yours? When implementing D3D11 for QuickSync I managed to get very close to DXVA2 performance eventually, but I hardly ever have been able to exceed it.. PS: Yes, it's damn hot here as well! :-) Kind regards, softworkz
On 2020-08-11 13:08, Soft Works wrote: > > >> -----Original Message----- >> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of >> Steve Lhomme >> Sent: Tuesday, August 11, 2020 12:44 PM >> To: ffmpeg-devel@ffmpeg.org >> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer >> copies are done before submitting them > > >> Even though the discussion is heated (fitting with the weather here) I don't >> mind. I learned some stuff and it pushed me to dig deeper. I can't just accept >> your word for it. I need something solid if I'm going to remove a lock that >> helped me so far. > > You're not supposed to accept my word. I wouldn’t do either. You could have > just valued it as a possibility..that's all. > >> So I'm currently tooling VLC to be able to bring the decoder to its knees and >> find out what it can and cannot do safely. So far I can still see decoding >> artifacts when I don't a lock, which would mean I still need the mutex, for the >> reasons given in the previous mail. > > Are you actually using D3D11 staging textures? I'm still not clear about > that "old-api" way. No. I render the texture slices on the screen straight from the decoder. In a separate thread from the libavcodec thread(s). > I'm curious: what's your motivation for using D3D11 over D3D9? At first for UWP apps (no other way except for D3D12). But now also because it supports HDR. > I mean, we are always transcoding without rendering a video to any surface. > Our reasons to use D3D11 are these two: > > 1. Works headless (without connected display) > 2. Works without user session (like inside a Windows service) > > What's yours? > > When implementing D3D11 for QuickSync I managed to get very close to > DXVA2 performance eventually, but I hardly ever have been able to exceed it.. Same here, especially because I never managed to make it work without copying first to another texture. But I might give it another try in light of all this. > PS: Yes, it's damn hot here as well! :-) > > Kind regards, > softworkz > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". >
On 2020-08-11 12:43, Steve Lhomme wrote: >>> Sorry if you seem to know all the answers already, but I don't and so >>> I have to >>> investigate. >> >> Last year, I had literally worked this down to death. I followed every >> slightest >> hint from countless searches, read through hundreds of discussions, >> driven >> because I was unwilling to believe that up-/downloading of video >> textures with >> D3D11 can't be done equally fast as with D3D9. >> (the big picture was the implementation of D3D11 support for QuickSync >> where >> the slowdown played a much bigger role than with D3D11VA decoders only). >> Eventually I landed at some internal Nvidia presentation, some talks >> with MS >> guys and some source code discussion deep inside a 3D game engine (not a >> no-name). It really bugs me that I didn't properly note the >> references, but >> from somewhere in between I was able to gather solid evidence about what >> is legal to do and what Is not. Based on that, followed several >> iterations to >> find the optimal way for doing the texture transfer. As I had implemented >> D3D11 support for QuickSync, this got pretty complicated because with >> a full transcoding pipeline, all parts (decoder, encoder and filters) >> can (and >> usually will) request textures. Only the latest Intel Drivers can work >> with >> array textures everywhere (e.g. VPP), so I also needed to add support for >> non-array texture allocation. The patch you've seen is the result of >> weeks >> of intensive work (a small but crucial part of it) - even when it may not >> look like that. >> >> >>> Sorry if you seem to know all the answers already >> >> Obviously, I don't know all the answers, but all the answers I have given >> were correct. And when I didn't have an answer I always respectfully >> said that your situation might be different. >> And I didn't reply by implying that you would have done your work >> by trial-and-error or most likely invalid assumptions or deductions. >> >> >> I still don't know how you are actually operating this and thus I also >> cannot >> tell what might or might not work in your case. >> All I can tell is that the procedure that I have described (1-2-3-4) can >> work rock-solid for multi-threaded DX11 texture transfer when it's >> done in >> the same way as I've shown. >> And believe it or not - I would still be happy when it would be >> of any use for you... > > Even though the discussion is heated (fitting with the weather here) I > don't mind. I learned some stuff and it pushed me to dig deeper. I can't > just accept your word for it. I need something solid if I'm going to > remove a lock that helped me so far. > > So I'm currently tooling VLC to be able to bring the decoder to its > knees and find out what it can and cannot do safely. So far I can still > see decoding artifacts when I don't a lock, which would mean I still > need the mutex, for the reasons given in the previous mail. A follow-up on this. Using ID3D10Multithread seems to be enough to have mostly thread safe ID3D11Device/ID3D11DeviceContext/etc. Even the decoding with its odd API seem to know what to do when submitted different buffers. I did not manage to saturate the GPU but I much bigger decoding speed/throughput to validate the errors I got before. Many of them were due to VLC dropping data because of odd timing. Now I still have some threading issues. For example for deinterlacing we create a ID3D11VideoProcessor to handle the deinterlacing. And we create it after the decoding started (as the deinterlacing can be enabled/disabled dynamically). Without the mutex in the decoder it crashes on ID3D11VideoDevice::CreateVideoProcessor() and ID3D11VideoContext::SubmitDecoderBuffers() as they are being called simultaneously. If I add the mutex between the decoder and just this filter (not the rendering side) it works fine. So I guess I'm stuck with the mutex for the time being. Here is the stack trace on an Intel 630 GPU: igd11dxva64.dll!00007ffc384a8d24() (Unknown Source:0) igd11dxva64.dll!00007ffc38452030() (Unknown Source:0) igd11dxva64.dll!00007ffc3845a081() (Unknown Source:0) igd11dxva64.dll!00007ffc38465a27() (Unknown Source:0) igd11dxva64.dll!00007ffc386067d2() (Unknown Source:0) igd11dxva64.dll!00007ffc3883c9f3() (Unknown Source:0) igd11dxva64.dll!00007ffc3867145a() (Unknown Source:0) igd11dxva64.dll!00007ffc3866ea23() (Unknown Source:0) igd11dxva64.dll!00007ffc3881b4ac() (Unknown Source:0) igd11dxva64.dll!00007ffc384f7bdc() (Unknown Source:0) igd11dxva64.dll!00007ffc384fa2a5() (Unknown Source:0) igd11dxva64.dll!00007ffc3847a334() (Unknown Source:0) d3d11.dll!00007ffcabc33e8d() (Unknown Source:0) d3d11.dll!00007ffcabc3389d() (Unknown Source:0) d3d11_3SDKLayers.dll!00007ffc3184fa6b() (Unknown Source:0) calling ID3D11VideoContext::SubmitDecoderBuffers() libavcodec_plugin.dll!ff_dxva2_common_end_frame(AVCodecContext * avctx, AVFrame * frame, const void * pp, unsigned int pp_size, const void * qm, unsigned int qm_size, int(*)(AVCodecContext *, void *, void *) commit_bs_si) Line 1085 (c:\Users\robux\Documents\Programs\Videolabs\build\win64\contrib\contrib-win64\ffmpeg\libavcodec\dxva2.c:1085) libavcodec_plugin.dll!dxva2_h264_end_frame(AVCodecContext * avctx) Line 507 (c:\Users\robux\Documents\Programs\Videolabs\build\win64\contrib\contrib-win64\ffmpeg\libavcodec\dxva2_h264.c:507) libavcodec_plugin.dll!ff_h264_field_end(H264Context * h, H264SliceContext * sl, int in_setup) Line 171 (c:\Users\robux\Documents\Programs\Videolabs\build\win64\contrib\contrib-win64\ffmpeg\libavcodec\h264_picture.c:171) libavcodec_plugin.dll!h264_decode_frame(AVCodecContext * avctx, void * data, int * got_frame, AVPacket * avpkt) Line 1015 (c:\Users\robux\Documents\Programs\Videolabs\build\win64\contrib\contrib-win64\ffmpeg\libavcodec\h264dec.c:1015) libavcodec_plugin.dll!decode_simple_internal(AVCodecContext * avctx, AVFrame * frame) Line 432 (c:\Users\robux\Documents\Programs\Videolabs\build\win64\contrib\contrib-win64\ffmpeg\libavcodec\decode.c:432) win32u.dll!00007ffcb0054784() (Unknown Source:0) gdi32.dll!00007ffcb1e03860() (Unknown Source:0) d3d11.dll!00007ffcabc756ee() (Unknown Source:0) d3d11.dll!00007ffcabc5c811() (Unknown Source:0) igd11dxva64.dll!00007ffc385c5043() (Unknown Source:0) igd11dxva64.dll!00007ffc384abaa5() (Unknown Source:0) igd11dxva64.dll!00007ffc384ab7ab() (Unknown Source:0) igd11dxva64.dll!00007ffc38453b27() (Unknown Source:0) igd11dxva64.dll!00007ffc384611e6() (Unknown Source:0) igd11dxva64.dll!00007ffc385cca30() (Unknown Source:0) igd11dxva64.dll!00007ffc384bb303() (Unknown Source:0) igd11dxva64.dll!00007ffc3847ccff() (Unknown Source:0) d3d11.dll!00007ffcabc3e661() (Unknown Source:0) d3d11.dll!00007ffcabc3d39f() (Unknown Source:0) d3d11.dll!00007ffcabc3d0cd() (Unknown Source:0) d3d11.dll!00007ffcabc68a46() (Unknown Source:0) d3d11.dll!00007ffcabc5955d() (Unknown Source:0) d3d11_3SDKLayers.dll!00007ffc318a263c() (Unknown Source:0) d3d11_3SDKLayers.dll!00007ffc3189479a() (Unknown Source:0) d3d11_3SDKLayers.dll!00007ffc3184e749() (Unknown Source:0) d3d11.dll!00007ffcabc59d0c() (Unknown Source:0) d3d11.dll!00007ffcabc3c606() (Unknown Source:0) d3d11_3SDKLayers.dll!00007ffc3187dd0e() (Unknown Source:0) calling ID3D11VideoDevice::CreateVideoProcessor() libdirect3d11_filters_plugin.dll!D3D11OpenDeinterlace(vlc_object_t * obj) Line 297 (c:\Users\robux\Documents\Programs\Videolabs\vlc\modules\hw\d3d11\d3d11_deinterlace.c:297) libvlccore.dll!generic_start(void * func, bool forced, char * ap) Line 294 (c:\Users\robux\Documents\Programs\Videolabs\vlc\src\modules\modules.c:294) libvlccore.dll!module_load(vlc_logger * log, module_t * m, int(*)(void *, bool, char *) init, bool forced, char * args) Line 212 (c:\Users\robux\Documents\Programs\Videolabs\vlc\src\modules\modules.c:212) libvlccore.dll!vlc_module_load(vlc_logger * log, const char * capability, const char * name, bool strict, int(*)(void *, bool, char *) probe, ...) Line 265 (c:\Users\robux\Documents\Programs\Videolabs\vlc\src\modules\modules.c:265) libvlccore.dll!module_need(vlc_object_t * obj, const char * cap, const char * name, bool strict) Line 305 (c:\Users\robux\Documents\Programs\Videolabs\vlc\src\modules\modules.c:305) libvlccore.dll!filter_chain_AppendInner(filter_chain_t * chain, const char * name, const char * capability, config_chain_t * cfg, const es_format_t * fmt_out) Line 254 (c:\Users\robux\Documents\Programs\Videolabs\vlc\src\misc\filter_chain.c:254) libvlccore.dll!filter_chain_AppendFilter(filter_chain_t * chain, const char * name, config_chain_t * cfg, const es_format_t * fmt_out) Line 299 (c:\Users\robux\Documents\Programs\Videolabs\vlc\src\misc\filter_chain.c:299) libvlccore.dll!ThreadChangeFilters(vout_thread_sys_t * vout, const char * filters, const bool * new_deinterlace, bool is_locked) Line 992 (c:\Users\robux\Documents\Programs\Videolabs\vlc\src\video_output\video_output.c:992) libvlccore.dll!Thread(void * object) Line 1891 (c:\Users\robux\Documents\Programs\Videolabs\vlc\src\video_output\video_output.c:1891) libvlccore.dll!vlc_entry(void * p) Line 360 (c:\Users\robux\Documents\Programs\Videolabs\vlc\src\win32\thread.c:360) msvcrt.dll!00007ffcb139af5a() (Unknown Source:0) msvcrt.dll!00007ffcb139b02c() (Unknown Source:0) kernel32.dll!00007ffcb21d6fd4() (Unknown Source:0) ntdll.dll!00007ffcb23bcec1() (Unknown Source:0)
> -----Original Message----- > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of > Steve Lhomme > Sent: Wednesday, August 12, 2020 2:05 PM > To: ffmpeg-devel@ffmpeg.org > Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer > copies are done before submitting them > > On 2020-08-11 12:43, Steve Lhomme wrote: > >>> Sorry if you seem to know all the answers already, but I don't and > >>> so I have to investigate. > >> > >> Last year, I had literally worked this down to death. I followed > >> every slightest hint from countless searches, read through hundreds > >> of discussions, driven because I was unwilling to believe that > >> up-/downloading of video textures with > >> D3D11 can't be done equally fast as with D3D9. > >> (the big picture was the implementation of D3D11 support for > >> QuickSync where the slowdown played a much bigger role than with > >> D3D11VA decoders only). > >> Eventually I landed at some internal Nvidia presentation, some talks > >> with MS guys and some source code discussion deep inside a 3D game > >> engine (not a no-name). It really bugs me that I didn't properly note > >> the references, but from somewhere in between I was able to gather > >> solid evidence about what is legal to do and what Is not. Based on > >> that, followed several iterations to find the optimal way for doing > >> the texture transfer. As I had implemented > >> D3D11 support for QuickSync, this got pretty complicated because with > >> a full transcoding pipeline, all parts (decoder, encoder and filters) > >> can (and usually will) request textures. Only the latest Intel > >> Drivers can work with array textures everywhere (e.g. VPP), so I also > >> needed to add support for non-array texture allocation. The patch > >> you've seen is the result of weeks of intensive work (a small but > >> crucial part of it) - even when it may not look like that. > >> > >> > >>> Sorry if you seem to know all the answers already > >> > >> Obviously, I don't know all the answers, but all the answers I have > >> given were correct. And when I didn't have an answer I always > >> respectfully said that your situation might be different. > >> And I didn't reply by implying that you would have done your work by > >> trial-and-error or most likely invalid assumptions or deductions. > >> > >> > >> I still don't know how you are actually operating this and thus I > >> also cannot tell what might or might not work in your case. > >> All I can tell is that the procedure that I have described (1-2-3-4) > >> can work rock-solid for multi-threaded DX11 texture transfer when > >> it's done in the same way as I've shown. > >> And believe it or not - I would still be happy when it would be of > >> any use for you... > > > > Even though the discussion is heated (fitting with the weather here) I > > don't mind. I learned some stuff and it pushed me to dig deeper. I > > can't just accept your word for it. I need something solid if I'm > > going to remove a lock that helped me so far. > > > > So I'm currently tooling VLC to be able to bring the decoder to its > > knees and find out what it can and cannot do safely. So far I can > > still see decoding artifacts when I don't a lock, which would mean I > > still need the mutex, for the reasons given in the previous mail. > > A follow-up on this. Using ID3D10Multithread seems to be enough to have > mostly thread safe ID3D11Device/ID3D11DeviceContext/etc. Even the > decoding with its odd API seem to know what to do when submitted > different buffers. > > I did not manage to saturate the GPU but I much bigger decoding > speed/throughput to validate the errors I got before. Many of them were > due to VLC dropping data because of odd timing. > > Now I still have some threading issues. For example for deinterlacing we > create a ID3D11VideoProcessor to handle the deinterlacing. And we create it > after the decoding started (as the deinterlacing can be enabled/disabled > dynamically). Without the mutex in the decoder it crashes on > ID3D11VideoDevice::CreateVideoProcessor() and > ID3D11VideoContext::SubmitDecoderBuffers() as they are being called > simultaneously. If I add the mutex between the decoder and just this filter > (not the rendering side) it works fine. > > So I guess I'm stuck with the mutex for the time being. At an earlier stage I had considered the idea of adding those video processors as ffmpeg hardware filters, but due to the vast amount of different use cases, platforms and hw accelerations we support, I had made the decision that we do all filtering either by CPU or in the hw context of the en-coder, but never in the hw context of the de-coder, so I don't have any experience with DX11 video processors. Maybe a too obvious idea: How about activating the mutex use only for a short time during the process of adding the video processor? ---- Regarding the subject in general, I'm now able to fill some additional blanks (I'm doing too many different things, and my RAM is limited ;-) Specifically, the question "Why isn't this information more clearly documented?" Short answer: Because "our" use case is too unusual. While I had been searching for new or old evidence for my statement I had found a book about DX11 talking about the use ID3D10_SetMultiThreadedProtected, but it said, something like that this will work but it will lock only a small part and that would not be sufficient because it does not guarantee ordered execution and therefore a D3D11 application will always need to do its own locking at the application side. Obviously, that wouldn't have been a good reference to use in the discussion :-) Here comes the twist: All those books and documentation are about writing 3D applications. And 3D applications are very different from "our" kind of applications: Those applications may execute dozens or hundreds of commands between two frames. And many of those commands are sequences that need to be executed in order. Any that explains why: - The MSDN docs suggest application-side locking - The SetMultiThreadedProtected method initially didn't have a DX11 version - The later added ID3D11Threading interface didn't only have the SetMultiThreadedProtected method, but also the Begin.. and End.. methods - The DX11 book said that SetMultiThreadedProtected would not be sufficient - The documentation of SetMultiThreadedProtected in the new Interface is stating that it would create possibly unwanted overhead In our case, we're doing nothing more than a one or two calls per frame (instead of hundreds). That's why we don't need to care about any of the bullet points above. Unfortunately, nobody cared about us, when writing the docs. Kind regards, softworkz
On 2020-08-13 1:01, Soft Works wrote: > >> -----Original Message----- >> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of >> Steve Lhomme >> Sent: Wednesday, August 12, 2020 2:05 PM >> To: ffmpeg-devel@ffmpeg.org >> Subject: Re: [FFmpeg-devel] [PATCH v3 1/2] dxva: wait until D3D11 buffer >> copies are done before submitting them >> >> On 2020-08-11 12:43, Steve Lhomme wrote: >>>>> Sorry if you seem to know all the answers already, but I don't and >>>>> so I have to investigate. >>>> >>>> Last year, I had literally worked this down to death. I followed >>>> every slightest hint from countless searches, read through hundreds >>>> of discussions, driven because I was unwilling to believe that >>>> up-/downloading of video textures with >>>> D3D11 can't be done equally fast as with D3D9. >>>> (the big picture was the implementation of D3D11 support for >>>> QuickSync where the slowdown played a much bigger role than with >>>> D3D11VA decoders only). >>>> Eventually I landed at some internal Nvidia presentation, some talks >>>> with MS guys and some source code discussion deep inside a 3D game >>>> engine (not a no-name). It really bugs me that I didn't properly note >>>> the references, but from somewhere in between I was able to gather >>>> solid evidence about what is legal to do and what Is not. Based on >>>> that, followed several iterations to find the optimal way for doing >>>> the texture transfer. As I had implemented >>>> D3D11 support for QuickSync, this got pretty complicated because with >>>> a full transcoding pipeline, all parts (decoder, encoder and filters) >>>> can (and usually will) request textures. Only the latest Intel >>>> Drivers can work with array textures everywhere (e.g. VPP), so I also >>>> needed to add support for non-array texture allocation. The patch >>>> you've seen is the result of weeks of intensive work (a small but >>>> crucial part of it) - even when it may not look like that. >>>> >>>> >>>>> Sorry if you seem to know all the answers already >>>> >>>> Obviously, I don't know all the answers, but all the answers I have >>>> given were correct. And when I didn't have an answer I always >>>> respectfully said that your situation might be different. >>>> And I didn't reply by implying that you would have done your work by >>>> trial-and-error or most likely invalid assumptions or deductions. >>>> >>>> >>>> I still don't know how you are actually operating this and thus I >>>> also cannot tell what might or might not work in your case. >>>> All I can tell is that the procedure that I have described (1-2-3-4) >>>> can work rock-solid for multi-threaded DX11 texture transfer when >>>> it's done in the same way as I've shown. >>>> And believe it or not - I would still be happy when it would be of >>>> any use for you... >>> >>> Even though the discussion is heated (fitting with the weather here) I >>> don't mind. I learned some stuff and it pushed me to dig deeper. I >>> can't just accept your word for it. I need something solid if I'm >>> going to remove a lock that helped me so far. >>> >>> So I'm currently tooling VLC to be able to bring the decoder to its >>> knees and find out what it can and cannot do safely. So far I can >>> still see decoding artifacts when I don't a lock, which would mean I >>> still need the mutex, for the reasons given in the previous mail. >> >> A follow-up on this. Using ID3D10Multithread seems to be enough to have >> mostly thread safe ID3D11Device/ID3D11DeviceContext/etc. Even the >> decoding with its odd API seem to know what to do when submitted >> different buffers. >> >> I did not manage to saturate the GPU but I much bigger decoding >> speed/throughput to validate the errors I got before. Many of them were >> due to VLC dropping data because of odd timing. >> >> Now I still have some threading issues. For example for deinterlacing we >> create a ID3D11VideoProcessor to handle the deinterlacing. And we create it >> after the decoding started (as the deinterlacing can be enabled/disabled >> dynamically). Without the mutex in the decoder it crashes on >> ID3D11VideoDevice::CreateVideoProcessor() and >> ID3D11VideoContext::SubmitDecoderBuffers() as they are being called >> simultaneously. If I add the mutex between the decoder and just this filter >> (not the rendering side) it works fine. >> >> So I guess I'm stuck with the mutex for the time being. > > At an earlier stage I had considered the idea of adding those video > processors as ffmpeg hardware filters, but due to the vast amount of > different use cases, platforms and hw accelerations we support, > I had made the decision that we do all filtering either by CPU or in the > hw context of the en-coder, but never in the hw context of the de-coder, > so I don't have any experience with DX11 video processors. > > Maybe a too obvious idea: How about activating the mutex use only for > a short time during the process of adding the video processor? This doesn't seem feasable, even with a callback system. You don't know when it's safe to enable/disable it. By the way the origin of the mutex was on Windows Phones. It's probably related to the fact that some phones only decode to DXGI_FORMAT_I420_OPAQUE which cannot be used for rendering. The only way to use the decoded surface is to convert it (to NV12) via a VideoProcessor. So in this case it was always used, even for basic decoding.
diff --git a/libavcodec/dxva2.c b/libavcodec/dxva2.c index 32416112bf..1a0e5b69b2 100644 --- a/libavcodec/dxva2.c +++ b/libavcodec/dxva2.c @@ -692,6 +692,12 @@ int ff_dxva2_decode_init(AVCodecContext *avctx) d3d11_ctx->surface = sctx->d3d11_views; d3d11_ctx->workaround = sctx->workaround; d3d11_ctx->context_mutex = INVALID_HANDLE_VALUE; + + D3D11_QUERY_DESC query = { 0 }; + query.Query = D3D11_QUERY_EVENT; + if (FAILED(ID3D11Device_CreateQuery(device_hwctx->device, &query, + (ID3D11Query**)&sctx->wait_copies))) + sctx->wait_copies = NULL; } #endif @@ -729,6 +735,8 @@ int ff_dxva2_decode_uninit(AVCodecContext *avctx) av_buffer_unref(&sctx->decoder_ref); #if CONFIG_D3D11VA + if (sctx->wait_copies) + ID3D11Asynchronous_Release(sctx->wait_copies); for (i = 0; i < sctx->nb_d3d11_views; i++) { if (sctx->d3d11_views[i]) ID3D11VideoDecoderOutputView_Release(sctx->d3d11_views[i]); @@ -932,6 +940,12 @@ int ff_dxva2_common_end_frame(AVCodecContext *avctx, AVFrame *frame, #if CONFIG_D3D11VA if (ff_dxva2_is_d3d11(avctx)) { + if (sctx->wait_copies) { + AVHWFramesContext *frames_ctx = (AVHWFramesContext*)avctx->hw_frames_ctx->data; + AVD3D11VADeviceContext *device_hwctx = frames_ctx->device_ctx->hwctx; + ID3D11DeviceContext_Begin(device_hwctx->device_context, sctx->wait_copies); + } + buffer = &buffer11[buffer_count]; type = D3D11_VIDEO_DECODER_BUFFER_PICTURE_PARAMETERS; } @@ -1005,9 +1019,28 @@ int ff_dxva2_common_end_frame(AVCodecContext *avctx, AVFrame *frame, #if CONFIG_D3D11VA if (ff_dxva2_is_d3d11(avctx)) + { + int maxWait = 10; + /* wait until all the buffer release is done copying data to the GPU + * before doing the submit command */ + if (sctx->wait_copies) { + AVHWFramesContext *frames_ctx = (AVHWFramesContext*)avctx->hw_frames_ctx->data; + AVD3D11VADeviceContext *device_hwctx = frames_ctx->device_ctx->hwctx; + ID3D11DeviceContext_End(device_hwctx->device_context, sctx->wait_copies); + + while (maxWait-- && S_FALSE == + ID3D11DeviceContext_GetData(device_hwctx->device_context, + sctx->wait_copies, NULL, 0, 0)) { + ff_dxva2_unlock(avctx); + SleepEx(2, TRUE); + ff_dxva2_lock(avctx); + } + } + hr = ID3D11VideoContext_SubmitDecoderBuffers(D3D11VA_CONTEXT(ctx)->video_context, D3D11VA_CONTEXT(ctx)->decoder, buffer_count, buffer11); + } #endif #if CONFIG_DXVA2 if (avctx->pix_fmt == AV_PIX_FMT_DXVA2_VLD) { diff --git a/libavcodec/dxva2_internal.h b/libavcodec/dxva2_internal.h index b822af59cd..c44e8e09b0 100644 --- a/libavcodec/dxva2_internal.h +++ b/libavcodec/dxva2_internal.h @@ -81,6 +81,8 @@ typedef struct FFDXVASharedContext { ID3D11VideoDecoderOutputView **d3d11_views; int nb_d3d11_views; ID3D11Texture2D *d3d11_texture; + + ID3D11Asynchronous *wait_copies; #endif #if CONFIG_DXVA2