[FFmpeg-devel] frame_thread_encoder: make task indexing deterministic.

Submitted by Ronald S. Bultje on March 31, 2017, 2:20 p.m.

Details

Message ID 1490970043-57594-1-git-send-email-rsbultje@gmail.com
State New
Headers show

Commit Message

Ronald S. Bultje March 31, 2017, 2:20 p.m.
Fixes tsan warnings in fate-utvideoenc. Example warning from
utvideoenc_rgb_left:

WARNING: ThreadSanitizer: data race (pid=19929)
  Read of size 8 at 0x7d840001cf18 by main thread:
    #0 ff_thread_video_encode_frame src/libavcodec/frame_thread_encoder.c:276 (ffmpeg+0x00000136a39d)
    #1 avcodec_encode_video2 src/libavcodec/utils.c:1986 (ffmpeg+0x000000f34a18)
[..]
  Previous write of size 8 at 0x7d840001cf18 by thread T14 (mutexes: write M1348):
    #0 worker src/libavcodec/frame_thread_encoder.c:100 (ffmpeg+0x000001369964)

We prevent that by using the same mechanism as frame-mt decoding, and
assuming that we're encoding N packets in parallel (where N=n_threads),
and thus the first N calls to encode_video() won't produce a packet.
---
 libavcodec/frame_thread_encoder.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

Comments

Michael Niedermayer April 1, 2017, 10:15 a.m.
On Fri, Mar 31, 2017 at 10:20:43AM -0400, Ronald S. Bultje wrote:
> Fixes tsan warnings in fate-utvideoenc. Example warning from
> utvideoenc_rgb_left:
> 
> WARNING: ThreadSanitizer: data race (pid=19929)
>   Read of size 8 at 0x7d840001cf18 by main thread:
>     #0 ff_thread_video_encode_frame src/libavcodec/frame_thread_encoder.c:276 (ffmpeg+0x00000136a39d)
>     #1 avcodec_encode_video2 src/libavcodec/utils.c:1986 (ffmpeg+0x000000f34a18)
> [..]
>   Previous write of size 8 at 0x7d840001cf18 by thread T14 (mutexes: write M1348):
>     #0 worker src/libavcodec/frame_thread_encoder.c:100 (ffmpeg+0x000001369964)
> 

> We prevent that by using the same mechanism as frame-mt decoding, and
> assuming that we're encoding N packets in parallel (where N=n_threads),
> and thus the first N calls to encode_video() won't produce a packet.

The whole idea behind frame_thread_encoder was that the number of
threads and the delay could be changed at runtime
ive rechecked the original commit and it said
"Compared to the decoder side, this code is able to change both the
    delay and the number of threads seamlessly during encoding. Also
    any idle thread can pick up tasks, the strict round robin in order
    limit is gone too."

Making it like frame-mt decoding doesnt sound good to me.

Though the frame_thread_encoder was/is limited by the encoding API

about the patch itself, the changes done to indexing seem not to
change anything, it makes it possible for the variables to overflow
though.

IIUC the only change your patch does is to remove the outdata check
from the quoted warning

its a while ago that i worked n this code but isnt this just "missing"
a finished_task_mutex lock over the access ?

[...]
Ronald S. Bultje April 1, 2017, 12:50 p.m.
Hi,

On Sat, Apr 1, 2017 at 6:15 AM, Michael Niedermayer <michael@niedermayer.cc>
wrote:
[.]
>
> about the patch itself, the changes done to indexing seem not to
> change anything, it makes it possible for the variables to overflow
> though.
>

That's intentional, they're both unsigned and the one is guaranteed to be
larger than the other so the difference after subtract would still be
guaranteed to be <= n_threads. So although the indices overflow, the
subtraction will always be correct.


> IIUC the only change your patch does is to remove the outdata check
> from the quoted warning
>

Yes, because if you change the delay to be fixed (and equal to n_threads),
you no longer need the check. If the difference between indices is smaller
than n_threads, you don't return a packet. If it's equal, you do return a
packet.

its a while ago that i worked n this code but isnt this just "missing"
> a finished_task_mutex lock over the access ?


I don't know, it depends on what you're trying to accomplish. If you want a
fixed-frame delay, you don't need the access and so no mutex is needed. I
haven't thought much about what you need to get a variable-frame delay. To
me, a fixed-frame delay for intra-only codecs (which is what this API
currently accomplishes in practice) is exactly what you would expect,
variable-frame delay implies you're not actually using all threads. But
maybe I misunderstand.

Ronald
Michael Niedermayer April 1, 2017, 9:44 p.m.
On Sat, Apr 01, 2017 at 08:50:37AM -0400, Ronald S. Bultje wrote:
> Hi,
> 
> On Sat, Apr 1, 2017 at 6:15 AM, Michael Niedermayer <michael@niedermayer.cc>
> wrote:
> [.]
> >
> > about the patch itself, the changes done to indexing seem not to
> > change anything, it makes it possible for the variables to overflow
> > though.
> >
> 
> That's intentional, they're both unsigned and the one is guaranteed to be
> larger than the other so the difference after subtract would still be
> guaranteed to be <= n_threads. So although the indices overflow, the
> subtraction will always be correct.
> 
> 
> > IIUC the only change your patch does is to remove the outdata check
> > from the quoted warning
> >
> 
> Yes, because if you change the delay to be fixed (and equal to n_threads),
> you no longer need the check. If the difference between indices is smaller
> than n_threads, you don't return a packet. If it's equal, you do return a
> packet.
> 

> its a while ago that i worked n this code but isnt this just "missing"
> > a finished_task_mutex lock over the access ?
> 
> 
> I don't know, it depends on what you're trying to accomplish. If you want a
> fixed-frame delay, you don't need the access and so no mutex is needed. I
> haven't thought much about what you need to get a variable-frame delay. To
> me, a fixed-frame delay for intra-only codecs (which is what this API
> currently accomplishes in practice) is exactly what you would expect,
> variable-frame delay implies you're not actually using all threads. But
> maybe I misunderstand.

Lower delay is better, so is using fewer threads if theres no need for
more
it means less delay for any form of real time commuication
less memory used, less L2 cache used and consequently better use of
the CPU.
If decoding with 1 thread is fast enough for the rate at which packets
become available using 12 threads will result in more cpu cycles per
frame, more memory used, more watts, a higher energy bill, ...


[...]
Ronald S. Bultje April 1, 2017, 10:34 p.m.
Hi,

On Sat, Apr 1, 2017 at 5:44 PM, Michael Niedermayer <michael@niedermayer.cc>
wrote:

> On Sat, Apr 01, 2017 at 08:50:37AM -0400, Ronald S. Bultje wrote:
> > Hi,
> >
> > On Sat, Apr 1, 2017 at 6:15 AM, Michael Niedermayer
> <michael@niedermayer.cc>
> > wrote:
> > [.]
> > >
> > > about the patch itself, the changes done to indexing seem not to
> > > change anything, it makes it possible for the variables to overflow
> > > though.
> > >
> >
> > That's intentional, they're both unsigned and the one is guaranteed to be
> > larger than the other so the difference after subtract would still be
> > guaranteed to be <= n_threads. So although the indices overflow, the
> > subtraction will always be correct.
> >
> >
> > > IIUC the only change your patch does is to remove the outdata check
> > > from the quoted warning
> > >
> >
> > Yes, because if you change the delay to be fixed (and equal to
> n_threads),
> > you no longer need the check. If the difference between indices is
> smaller
> > than n_threads, you don't return a packet. If it's equal, you do return a
> > packet.
> >
>
> > its a while ago that i worked n this code but isnt this just "missing"
> > > a finished_task_mutex lock over the access ?
> >
> >
> > I don't know, it depends on what you're trying to accomplish. If you
> want a
> > fixed-frame delay, you don't need the access and so no mutex is needed. I
> > haven't thought much about what you need to get a variable-frame delay.
> To
> > me, a fixed-frame delay for intra-only codecs (which is what this API
> > currently accomplishes in practice) is exactly what you would expect,
> > variable-frame delay implies you're not actually using all threads. But
> > maybe I misunderstand.
>
> Lower delay is better, so is using fewer threads if theres no need for
> more
> it means less delay for any form of real time commuication
> less memory used, less L2 cache used and consequently better use of
> the CPU.
> If decoding with 1 thread is fast enough for the rate at which packets
> become available using 12 threads will result in more cpu cycles per
> frame, more memory used, more watts, a higher energy bill, ...


We already do automatic thread detection, and it's a user-settable option
also.

Can we please review the patch?

Ronald
Michael Niedermayer April 2, 2017, 12:19 a.m.
On Sat, Apr 01, 2017 at 06:34:50PM -0400, Ronald S. Bultje wrote:
> Hi,
> 
> On Sat, Apr 1, 2017 at 5:44 PM, Michael Niedermayer <michael@niedermayer.cc>
> wrote:
> 
> > On Sat, Apr 01, 2017 at 08:50:37AM -0400, Ronald S. Bultje wrote:
> > > Hi,
> > >
> > > On Sat, Apr 1, 2017 at 6:15 AM, Michael Niedermayer
> > <michael@niedermayer.cc>
> > > wrote:
> > > [.]
> > > >
> > > > about the patch itself, the changes done to indexing seem not to
> > > > change anything, it makes it possible for the variables to overflow
> > > > though.
> > > >
> > >
> > > That's intentional, they're both unsigned and the one is guaranteed to be
> > > larger than the other so the difference after subtract would still be
> > > guaranteed to be <= n_threads. So although the indices overflow, the
> > > subtraction will always be correct.
> > >
> > >
> > > > IIUC the only change your patch does is to remove the outdata check
> > > > from the quoted warning
> > > >
> > >
> > > Yes, because if you change the delay to be fixed (and equal to
> > n_threads),
> > > you no longer need the check. If the difference between indices is
> > smaller
> > > than n_threads, you don't return a packet. If it's equal, you do return a
> > > packet.
> > >
> >
> > > its a while ago that i worked n this code but isnt this just "missing"
> > > > a finished_task_mutex lock over the access ?
> > >
> > >
> > > I don't know, it depends on what you're trying to accomplish. If you
> > want a
> > > fixed-frame delay, you don't need the access and so no mutex is needed. I
> > > haven't thought much about what you need to get a variable-frame delay.
> > To
> > > me, a fixed-frame delay for intra-only codecs (which is what this API
> > > currently accomplishes in practice) is exactly what you would expect,
> > > variable-frame delay implies you're not actually using all threads. But
> > > maybe I misunderstand.
> >
> > Lower delay is better, so is using fewer threads if theres no need for
> > more
> > it means less delay for any form of real time commuication
> > less memory used, less L2 cache used and consequently better use of
> > the CPU.
> > If decoding with 1 thread is fast enough for the rate at which packets
> > become available using 12 threads will result in more cpu cycles per
> > frame, more memory used, more watts, a higher energy bill, ...
> 
> 

> We already do automatic thread detection, and it's a user-settable option
> also.

automatic thread detection detects the maximum that might be usefull
on the hardware, not the optimal

Allowing the user to set the number of threads means the user can
set the number but the user does not know the best number for each
file, the number also changes at runtime.
But even if it didnt change and the user knew the best it would be
very inconvenient for the user to set it by hand
Its better if it just uses the optimal automatically


> 
> Can we please review the patch?

This comment feels offensive
I reviewed your patch, or at least tried to.

"... the changes done to indexing seem not to change anything, ..."
If the changes are not needed, please remove them from the patch or
if they are neeed please explain why

having an index count to UINT_MAX is confusing, i prefer it to stay
within 0..BUFFER_SIZE.


'...isnt this just "missing" a finished_task_mutex lock over the access ?'

If it just is missing a lock, adding this lock is the fix i would
prefer.
Of course you can as well use an atomic access or somethig else that
keeps tsan from complaining.

Supporting changing delay and number of threads at runtime was a
big factor in the design of the frame thread encoder. Its something
i cared about, so i am not to positive towards patches removing that.

[...]
Ronald S. Bultje April 2, 2017, 11:30 a.m.
Hi,

On Sat, Apr 1, 2017 at 8:19 PM, Michael Niedermayer <michael@niedermayer.cc>
wrote:

> Supporting changing delay and number of threads at runtime was a
> big factor in the design of the frame thread encoder. Its something
> i cared about, so i am not to positive towards patches removing that.


It introduced a bug. I fixed it. Please accept the fix or fix it yourself
differently. Don't advocate keeping a bug in our tree.

Ronald
Michael Niedermayer April 3, 2017, 9:13 p.m.
On Sun, Apr 02, 2017 at 07:30:00AM -0400, Ronald S. Bultje wrote:
> Hi,
> 
> On Sat, Apr 1, 2017 at 8:19 PM, Michael Niedermayer <michael@niedermayer.cc>
> wrote:
> 
> > Supporting changing delay and number of threads at runtime was a
> > big factor in the design of the frame thread encoder. Its something
> > i cared about, so i am not to positive towards patches removing that.
> 
> 
> It introduced a bug. I fixed it. Please accept the fix or fix it yourself
> differently. Don't advocate keeping a bug in our tree.

I would but iam not able to reproduce this warning with
gcc-6 (Ubuntu 6.2.0-3ubuntu11~12.04) 6.2.0 20160901

please try to move the finished_task_mutex lock up so it covers
the part of the check that needs it

[...]

Patch hide | download patch | download mbox

diff --git a/libavcodec/frame_thread_encoder.c b/libavcodec/frame_thread_encoder.c
index 27ae356..8db9290 100644
--- a/libavcodec/frame_thread_encoder.c
+++ b/libavcodec/frame_thread_encoder.c
@@ -251,6 +251,7 @@  int ff_thread_video_encode_frame(AVCodecContext *avctx, AVPacket *pkt, const AVF
     ThreadContext *c = avctx->internal->frame_thread_encoder;
     Task task;
     int ret;
+    unsigned idx;
 
     av_assert1(!*got_packet_ptr);
 
@@ -264,32 +265,33 @@  int ff_thread_video_encode_frame(AVCodecContext *avctx, AVPacket *pkt, const AVF
             return ret;
         }
 
-        task.index = c->task_index;
+        task.index = c->task_index % BUFFER_SIZE;
         task.indata = (void*)new;
         pthread_mutex_lock(&c->task_fifo_mutex);
         av_fifo_generic_write(c->task_fifo, &task, sizeof(task), NULL);
         pthread_cond_signal(&c->task_fifo_cond);
         pthread_mutex_unlock(&c->task_fifo_mutex);
 
-        c->task_index = (c->task_index+1) % BUFFER_SIZE;
+        c->task_index++;
 
-        if(!c->finished_tasks[c->finished_task_index].outdata && (c->task_index - c->finished_task_index) % BUFFER_SIZE <= avctx->thread_count)
+        if (c->task_index - c->finished_task_index < avctx->thread_count)
             return 0;
     }
 
     if(c->task_index == c->finished_task_index)
         return 0;
 
+    idx = c->finished_task_index % BUFFER_SIZE;
     pthread_mutex_lock(&c->finished_task_mutex);
-    while (!c->finished_tasks[c->finished_task_index].outdata) {
+    while (!c->finished_tasks[idx].outdata) {
         pthread_cond_wait(&c->finished_task_cond, &c->finished_task_mutex);
     }
-    task = c->finished_tasks[c->finished_task_index];
+    task = c->finished_tasks[idx];
     *pkt = *(AVPacket*)(task.outdata);
     if(pkt->data)
         *got_packet_ptr = 1;
-    av_freep(&c->finished_tasks[c->finished_task_index].outdata);
-    c->finished_task_index = (c->finished_task_index+1) % BUFFER_SIZE;
+    av_freep(&c->finished_tasks[idx].outdata);
+    c->finished_task_index++;
     pthread_mutex_unlock(&c->finished_task_mutex);
 
     return task.return_code;