diff mbox

[FFmpeg-devel,v5] Improved the performance of 1 decode + N filter graphs and adaptive bitrate.

Message ID 1550267663-8791-1-git-send-email-shaofei.wang@intel.com
State Superseded
Headers show

Commit Message

Shaofei Wang Feb. 15, 2019, 9:54 p.m. UTC
It enabled multiple filter graph concurrency, which bring above about
4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration

Below are some test cases and comparison as reference.
(Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
(Software: Intel iHD driver - 16.9.00100, CentOS 7)

For 1:N transcode by GPU acceleration with vaapi:
./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \
    -hwaccel_output_format vaapi \
    -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
    -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \
    -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null

    test results:
                2 encoders 5 encoders 10 encoders
    Improved       6.1%    6.9%       5.5%

For 1:N transcode by GPU acceleration with QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
    -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
    -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \
    -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null

    test results:
                2 encoders  5 encoders 10 encoders
    Improved       6%       4%         15%

For Intel GPU acceleration case, 1 decode to N scaling, by QSV:
./ffmpeg -hwaccel qsv -c:v h264_qsv \
    -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
    -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null \
    -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null

    test results:
                2 scale  5 scale   10 scale
    Improved       12%     21%        21%

For CPU only 1 decode to N scaling:
./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
    -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \
    -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null

    test results:
                2 scale  5 scale   10 scale
    Improved       25%    107%       148%

Signed-off-by: Wang, Shaofei <shaofei.wang@intel.com>
Reviewed-by: Zhao, Jun <jun.zhao@intel.com>
---
 fftools/ffmpeg.c        | 121 ++++++++++++++++++++++++++++++++++++++++++++----
 fftools/ffmpeg.h        |  14 ++++++
 fftools/ffmpeg_filter.c |   1 +
 3 files changed, 128 insertions(+), 8 deletions(-)

Comments

Michael Niedermayer Feb. 15, 2019, 9:21 p.m. UTC | #1
On Fri, Feb 15, 2019 at 04:54:23PM -0500, Shaofei Wang wrote:
> It enabled multiple filter graph concurrency, which bring above about
> 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration
> 
> Below are some test cases and comparison as reference.
> (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
> (Software: Intel iHD driver - 16.9.00100, CentOS 7)

breaks fate

make -j12 fate-filter-overlay-dvdsub-2397 V=2


frame=  208 fps=0.0 q=-0.0 Lsize=      48kB time=00:00:08.04 bitrate=  49.0kbits/s speed=10.6x    
video:105300kB audio:1254kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
pthread_join failed with error: No such process
Aborted (core dumped)
make: *** [fate-filter-overlay-dvdsub-2397] Error 134

[...]
Mark Thompson Feb. 16, 2019, 12:12 p.m. UTC | #2
On 15/02/2019 21:54, Shaofei Wang wrote:
> It enabled multiple filter graph concurrency, which bring above about
> 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration
> 
> Below are some test cases and comparison as reference.
> (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
> (Software: Intel iHD driver - 16.9.00100, CentOS 7)
> 
> For 1:N transcode by GPU acceleration with vaapi:
> ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \
>     -hwaccel_output_format vaapi \
>     -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
>     -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \
>     -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null
> 
>     test results:
>                 2 encoders 5 encoders 10 encoders
>     Improved       6.1%    6.9%       5.5%
> 
> For 1:N transcode by GPU acceleration with QSV:
> ./ffmpeg -hwaccel qsv -c:v h264_qsv \
>     -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
>     -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null \
>     -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null /dev/null
> 
>     test results:
>                 2 encoders  5 encoders 10 encoders
>     Improved       6%       4%         15%
> 
> For Intel GPU acceleration case, 1 decode to N scaling, by QSV:
> ./ffmpeg -hwaccel qsv -c:v h264_qsv \
>     -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
>     -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null \
>     -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f null /dev/null
> 
>     test results:
>                 2 scale  5 scale   10 scale
>     Improved       12%     21%        21%
> 
> For CPU only 1 decode to N scaling:
> ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
>     -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \
>     -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null
> 
>     test results:
>                 2 scale  5 scale   10 scale
>     Improved       25%    107%       148%
> 
> Signed-off-by: Wang, Shaofei <shaofei.wang@intel.com>
> Reviewed-by: Zhao, Jun <jun.zhao@intel.com>
> ---
>  fftools/ffmpeg.c        | 121 ++++++++++++++++++++++++++++++++++++++++++++----
>  fftools/ffmpeg.h        |  14 ++++++
>  fftools/ffmpeg_filter.c |   1 +
>  3 files changed, 128 insertions(+), 8 deletions(-)

On a bit more review, I don't think this patch works at all.

The existing code is all written to be run serially.  This simplistic approach to parallelising it falls down because many of those functions use variables written in what were previously other functions called at different times but have now become other threads, introducing undefined behaviour due to data races.

To consider a single example (not the only one), the function check_init_output_file() does not work at all after this change.  The test for OutputStream initialisation (so that you run exactly once after all of the output streams are ready) races with other threads setting those variables.  Since that's undefined behaviour you may get lucky sometimes and have the output file initialisation run exactly once, but in general it will fail in unknown ways.

If you want to resubmit this patch, you will need to refactor a lot of the other code in ffmpeg.c to rule out these undefined cases.

- Mark
Shaofei Wang Feb. 18, 2019, 3:22 a.m. UTC | #3
Thanks. Seems I need to cover external samples either

> -----Original Message-----
> From: ffmpeg-devel [mailto:ffmpeg-devel-bounces@ffmpeg.org] On Behalf Of
> Michael Niedermayer
> Sent: Saturday, February 16, 2019 5:22 AM
> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1
> decode + N filter graphs and adaptive bitrate.
> 
> On Fri, Feb 15, 2019 at 04:54:23PM -0500, Shaofei Wang wrote:
> > It enabled multiple filter graph concurrency, which bring above about
> > 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration
> >
> > Below are some test cases and comparison as reference.
> > (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
> > (Software: Intel iHD driver - 16.9.00100, CentOS 7)
> 
> breaks fate
> 
> make -j12 fate-filter-overlay-dvdsub-2397 V=2
> 
> 
> frame=  208 fps=0.0 q=-0.0 Lsize=      48kB time=00:00:08.04 bitrate=
> 49.0kbits/s speed=10.6x
> video:105300kB audio:1254kB subtitle:0kB other streams:0kB global
> headers:0kB muxing overhead: unknown pthread_join failed with error: No
> such process Aborted (core dumped)
> make: *** [fate-filter-overlay-dvdsub-2397] Error 134
> 
> [...]
> --
> Michael     GnuPG fingerprint:
> 9FF2128B147EF6730BADF133611EC787040B0FAB
> 
> Asymptotically faster algorithms should always be preferred if you have
> asymptotical amounts of data
Shaofei Wang Feb. 20, 2019, 10:17 a.m. UTC | #4
> -----Original Message-----

> From: ffmpeg-devel [mailto:ffmpeg-devel-bounces@ffmpeg.org] On Behalf Of

> Mark Thompson

> Sent: Saturday, February 16, 2019 8:12 PM

> To: ffmpeg-devel@ffmpeg.org

> Subject: Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1

> decode + N filter graphs and adaptive bitrate.

> On 15/02/2019 21:54, Shaofei Wang wrote:

> > It enabled multiple filter graph concurrency, which bring above about

> > 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration

> >

> > Below are some test cases and comparison as reference.

> > (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)

> > (Software: Intel iHD driver - 16.9.00100, CentOS 7)

> >

> > For 1:N transcode by GPU acceleration with vaapi:

> > ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \

> >     -hwaccel_output_format vaapi \

> >     -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \

> >     -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \

> >     -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null

> >

> >     test results:

> >                 2 encoders 5 encoders 10 encoders

> >     Improved       6.1%    6.9%       5.5%

> >

> > For 1:N transcode by GPU acceleration with QSV:

> > ./ffmpeg -hwaccel qsv -c:v h264_qsv \

> >     -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \

> >     -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null

> \

> >     -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null

> > /dev/null

> >

> >     test results:

> >                 2 encoders  5 encoders 10 encoders

> >     Improved       6%       4%         15%

> >

> > For Intel GPU acceleration case, 1 decode to N scaling, by QSV:

> > ./ffmpeg -hwaccel qsv -c:v h264_qsv \

> >     -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \

> >     -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f

> null /dev/null \

> >     -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f

> > null /dev/null

> >

> >     test results:

> >                 2 scale  5 scale   10 scale

> >     Improved       12%     21%        21%

> >

> > For CPU only 1 decode to N scaling:

> > ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \

> >     -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \

> >     -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null

> >

> >     test results:

> >                 2 scale  5 scale   10 scale

> >     Improved       25%    107%       148%

> >

> > Signed-off-by: Wang, Shaofei <shaofei.wang@intel.com>

> > Reviewed-by: Zhao, Jun <jun.zhao@intel.com>

> > ---

> >  fftools/ffmpeg.c        | 121

> ++++++++++++++++++++++++++++++++++++++++++++----

> >  fftools/ffmpeg.h        |  14 ++++++

> >  fftools/ffmpeg_filter.c |   1 +

> >  3 files changed, 128 insertions(+), 8 deletions(-)

> 

> On a bit more review, I don't think this patch works at all.

> 

It has been tested and verified by a lot of cases. More fate cases need to be covered now.

> The existing code is all written to be run serially.  This simplistic approach to

> parallelising it falls down because many of those functions use variables

> written in what were previously other functions called at different times but

> have now become other threads, introducing undefined behaviour due to

> data races.

>

Actually, this is not a patch to parallel every thing in the ffmpeg. It just thread the input filter
of the filter graph(tend for simple filter graph), which is a simple way to improve N filter graph
performance and also without introduce huge modification. So that there is still a lot of serial function call, differences
are that each filter graph need to init its output stream instead of init all together and each
filter graph will reap filters for its filter chain.

> To consider a single example (not the only one), the function

> check_init_output_file() does not work at all after this change.  The test for

> OutputStream initialisation (so that you run exactly once after all of the

> output streams are ready) races with other threads setting those variables.

> Since that's undefined behaviour you may get lucky sometimes and have the

> output file initialisation run exactly once, but in general it will fail in unknown

> ways.

> 


The check_init_output_file() should be responsible for the output file related with
specified output stream which managed by each thread chain, that means even it
called by different thread, the data setting are different. Let me double check.

> If you want to resubmit this patch, you will need to refactor a lot of the other

> code in ffmpeg.c to rule out these undefined cases.

> 

OK. This patch would only effect on SIMPLE filter graph.

> - Mark

> _______________________________________________

> ffmpeg-devel mailing list

> ffmpeg-devel@ffmpeg.org

> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Mark Thompson Feb. 20, 2019, 9:43 p.m. UTC | #5
On 20/02/2019 10:17, Wang, Shaofei wrote:
>> -----Original Message-----
>> From: ffmpeg-devel [mailto:ffmpeg-devel-bounces@ffmpeg.org] On Behalf Of
>> Mark Thompson
>> Sent: Saturday, February 16, 2019 8:12 PM
>> To: ffmpeg-devel@ffmpeg.org
>> Subject: Re: [FFmpeg-devel] [PATCH v5] Improved the performance of 1
>> decode + N filter graphs and adaptive bitrate.
>> On 15/02/2019 21:54, Shaofei Wang wrote:
>>> It enabled multiple filter graph concurrency, which bring above about
>>> 4%~20% improvement in some 1:N scenarios by CPU or GPU acceleration
>>>
>>> Below are some test cases and comparison as reference.
>>> (Hardware platform: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz)
>>> (Software: Intel iHD driver - 16.9.00100, CentOS 7)
>>>
>>> For 1:N transcode by GPU acceleration with vaapi:
>>> ./ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel vaapi \
>>>     -hwaccel_output_format vaapi \
>>>     -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
>>>     -vf "scale_vaapi=1280:720" -c:v h264_vaapi -f null /dev/null \
>>>     -vf "scale_vaapi=720:480" -c:v h264_vaapi -f null /dev/null
>>>
>>>     test results:
>>>                 2 encoders 5 encoders 10 encoders
>>>     Improved       6.1%    6.9%       5.5%
>>>
>>> For 1:N transcode by GPU acceleration with QSV:
>>> ./ffmpeg -hwaccel qsv -c:v h264_qsv \
>>>     -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
>>>     -vf "scale_qsv=1280:720:format=nv12" -c:v h264_qsv -f null /dev/null
>> \
>>>     -vf "scale_qsv=720:480:format=nv12" -c:v h264_qsv -f null
>>> /dev/null
>>>
>>>     test results:
>>>                 2 encoders  5 encoders 10 encoders
>>>     Improved       6%       4%         15%
>>>
>>> For Intel GPU acceleration case, 1 decode to N scaling, by QSV:
>>> ./ffmpeg -hwaccel qsv -c:v h264_qsv \
>>>     -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
>>>     -vf "scale_qsv=1280:720:format=nv12,hwdownload" -pix_fmt nv12 -f
>> null /dev/null \
>>>     -vf "scale_qsv=720:480:format=nv12,hwdownload" -pix_fmt nv12 -f
>>> null /dev/null
>>>
>>>     test results:
>>>                 2 scale  5 scale   10 scale
>>>     Improved       12%     21%        21%
>>>
>>> For CPU only 1 decode to N scaling:
>>> ./ffmpeg -i ~/Videos/1920x1080p_30.00_x264_qp28.h264 \
>>>     -vf "scale=1280:720" -pix_fmt nv12 -f null /dev/null \
>>>     -vf "scale=720:480" -pix_fmt nv12 -f null /dev/null
>>>
>>>     test results:
>>>                 2 scale  5 scale   10 scale
>>>     Improved       25%    107%       148%
>>>
>>> Signed-off-by: Wang, Shaofei <shaofei.wang@intel.com>
>>> Reviewed-by: Zhao, Jun <jun.zhao@intel.com>
>>> ---
>>>  fftools/ffmpeg.c        | 121
>> ++++++++++++++++++++++++++++++++++++++++++++----
>>>  fftools/ffmpeg.h        |  14 ++++++
>>>  fftools/ffmpeg_filter.c |   1 +
>>>  3 files changed, 128 insertions(+), 8 deletions(-)
>>
>> On a bit more review, I don't think this patch works at all.
>>
> It has been tested and verified by a lot of cases. More fate cases need to be covered now.
> 
>> The existing code is all written to be run serially.  This simplistic approach to
>> parallelising it falls down because many of those functions use variables
>> written in what were previously other functions called at different times but
>> have now become other threads, introducing undefined behaviour due to
>> data races.
>>
> Actually, this is not a patch to parallel every thing in the ffmpeg. It just thread the input filter
> of the filter graph(tend for simple filter graph), which is a simple way to improve N filter graph
> performance and also without introduce huge modification. So that there is still a lot of serial function call, differences
> are that each filter graph need to init its output stream instead of init all together and each
> filter graph will reap filters for its filter chain.

Indeed the existing encapsulation tries to keep things mostly separate, but in various places it accesses shared state which works fine in the serial case but fails when those parts are run in parallel.

Data races are undefined behaviour in C; introducing them is not acceptable.

>> To consider a single example (not the only one), the function
>> check_init_output_file() does not work at all after this change.  The test for
>> OutputStream initialisation (so that you run exactly once after all of the
>> output streams are ready) races with other threads setting those variables.
>> Since that's undefined behaviour you may get lucky sometimes and have the
>> output file initialisation run exactly once, but in general it will fail in unknown
>> ways.
>>
> 
> The check_init_output_file() should be responsible for the output file related with
> specified output stream which managed by each thread chain, that means even it
> called by different thread, the data setting are different. Let me double check.

Each output file can contain multiple streams - try a transcode with both audio and video filters.

(Incidentally, the patch as-is also crashes in this case if the transcode completes without any data from one of the streams.)

- Mark
diff mbox

Patch

diff --git a/fftools/ffmpeg.c b/fftools/ffmpeg.c
index 544f1a1..676c783 100644
--- a/fftools/ffmpeg.c
+++ b/fftools/ffmpeg.c
@@ -509,6 +509,15 @@  static void ffmpeg_cleanup(int ret)
                 }
                 av_fifo_freep(&fg->inputs[j]->ist->sub2video.sub_queue);
             }
+#if HAVE_THREADS
+            fg->inputs[j]->waited_frm = NULL;
+            av_frame_free(&fg->inputs[j]->input_frm);
+            pthread_mutex_lock(&fg->inputs[j]->process_mutex);
+            fg->inputs[j]->t_end = 1;
+            pthread_cond_signal(&fg->inputs[j]->process_cond);
+            pthread_mutex_unlock(&fg->inputs[j]->process_mutex);
+            pthread_join(fg->inputs[j]->abr_thread, NULL);
+#endif
             av_buffer_unref(&fg->inputs[j]->hw_frames_ctx);
             av_freep(&fg->inputs[j]->name);
             av_freep(&fg->inputs[j]);
@@ -1419,12 +1428,13 @@  static void finish_output_stream(OutputStream *ost)
  *
  * @return  0 for success, <0 for severe errors
  */
-static int reap_filters(int flush)
+static int reap_filters(int flush, InputFilter * ifilter)
 {
     AVFrame *filtered_frame = NULL;
     int i;
 
-    /* Reap all buffers present in the buffer sinks */
+    /* Reap all buffers present in the buffer sinks or just reap specified
+     * buffer which related with the filter graph who got ifilter as input*/
     for (i = 0; i < nb_output_streams; i++) {
         OutputStream *ost = output_streams[i];
         OutputFile    *of = output_files[ost->file_index];
@@ -1436,6 +1446,11 @@  static int reap_filters(int flush)
             continue;
         filter = ost->filter->filter;
 
+        if (ifilter) {
+            if (ifilter != output_streams[i]->filter->graph->inputs[0])
+                continue;
+        }
+
         if (!ost->initialized) {
             char error[1024] = "";
             ret = init_output_stream(ost, error, sizeof(error));
@@ -2179,7 +2194,8 @@  static int ifilter_send_frame(InputFilter *ifilter, AVFrame *frame)
             }
         }
 
-        ret = reap_filters(1);
+        ret = HAVE_THREADS ? reap_filters(1, ifilter) : reap_filters(1, NULL);
+
         if (ret < 0 && ret != AVERROR_EOF) {
             av_log(NULL, AV_LOG_ERROR, "Error while filtering: %s\n", av_err2str(ret));
             return ret;
@@ -2252,12 +2268,100 @@  static int decode(AVCodecContext *avctx, AVFrame *frame, int *got_frame, AVPacke
     return 0;
 }
 
+#if HAVE_THREADS
+static void *filter_pipeline(void *arg)
+{
+    InputFilter *fl = arg;
+    AVFrame *frm;
+    int ret;
+    while(1) {
+        pthread_mutex_lock(&fl->process_mutex);
+        while (fl->waited_frm == NULL && !fl->t_end)
+            pthread_cond_wait(&fl->process_cond, &fl->process_mutex);
+        pthread_mutex_unlock(&fl->process_mutex);
+
+        if (fl->t_end) break;
+
+        frm = fl->waited_frm;
+        ret = ifilter_send_frame(fl, frm);
+        if (ret == AVERROR_EOF)
+            ret = 0;
+        else if (ret < 0) {
+            av_log(NULL, AV_LOG_ERROR,
+                   "Failed to inject frame into filter network: %s\n", av_err2str(ret));
+        } else {
+            ret = reap_filters(0, fl);
+        }
+        fl->t_error = ret;
+
+        fl->waited_frm = NULL;
+        pthread_mutex_lock(&fl->finish_mutex);
+        pthread_cond_signal(&fl->finish_cond);
+        pthread_mutex_unlock(&fl->finish_mutex);
+    }
+    fl->waited_frm = NULL;
+    pthread_mutex_lock(&fl->finish_mutex);
+    pthread_cond_signal(&fl->finish_cond);
+    pthread_mutex_unlock(&fl->finish_mutex);
+    return fl;
+}
+#endif
+
 static int send_frame_to_filters(InputStream *ist, AVFrame *decoded_frame)
 {
-    int i, ret;
+    int i, ret = 0;
     AVFrame *f;
 
     av_assert1(ist->nb_filters > 0); /* ensure ret is initialized */
+#if HAVE_THREADS
+    for (i = 0; i < ist->nb_filters; i++) {
+        if (!ist->filters[i]->abr_thread_created) {
+            pthread_mutex_init(&ist->filters[i]->process_mutex, NULL);
+            pthread_mutex_init(&ist->filters[i]->finish_mutex, NULL);
+            pthread_cond_init(&ist->filters[i]->process_cond, NULL);
+            pthread_cond_init(&ist->filters[i]->finish_cond, NULL);
+            if ((ret = pthread_create(&ist->filters[i]->abr_thread, NULL, filter_pipeline,
+                            ist->filters[i]))) {
+                av_log(NULL, AV_LOG_ERROR,
+                        "abr pipeline pthread_create failed.\n");
+                return -1;
+            }
+            ist->filters[i]->input_frm = av_frame_alloc();
+            if (!ist->filters[i]->input_frm)
+                return AVERROR(ENOMEM);
+            ist->filters[i]->t_end = 0;
+            ist->filters[i]->t_error = 0;
+            ist->filters[i]->abr_thread_created = 1;
+        }
+
+        if (i < ist->nb_filters - 1) {
+            f = ist->filters[i]->input_frm;
+            ret = av_frame_ref(f, decoded_frame);
+            if (ret < 0)
+                return ret;
+        } else
+            f = decoded_frame;
+
+        pthread_mutex_lock(&ist->filters[i]->process_mutex);
+        ist->filters[i]->waited_frm = f;
+        pthread_cond_signal(&ist->filters[i]->process_cond);
+        pthread_mutex_unlock(&ist->filters[i]->process_mutex);
+    }
+
+    for (i = 0; i < ist->nb_filters; i++) {
+        pthread_mutex_lock(&ist->filters[i]->finish_mutex);
+        while(ist->filters[i]->waited_frm != NULL)
+            pthread_cond_wait(&ist->filters[i]->finish_cond,
+                    &ist->filters[i]->finish_mutex);
+        pthread_mutex_unlock(&ist->filters[i]->finish_mutex);
+    }
+    for (i = 0; i < ist->nb_filters; i++) {
+        if (ist->filters[i]->t_error < 0) {
+            ret = ist->filters[i]->t_error;
+            break;
+        }
+    }
+#else
     for (i = 0; i < ist->nb_filters; i++) {
         if (i < ist->nb_filters - 1) {
             f = ist->filter_frame;
@@ -2275,6 +2379,8 @@  static int send_frame_to_filters(InputStream *ist, AVFrame *decoded_frame)
             break;
         }
     }
+#endif
+
     return ret;
 }
 
@@ -2334,7 +2440,6 @@  static int decode_audio(InputStream *ist, AVPacket *pkt, int *got_output,
                                               (AVRational){1, avctx->sample_rate});
     ist->nb_samples = decoded_frame->nb_samples;
     err = send_frame_to_filters(ist, decoded_frame);
-
     av_frame_unref(ist->filter_frame);
     av_frame_unref(decoded_frame);
     return err < 0 ? err : ret;
@@ -4537,10 +4642,10 @@  static int transcode_from_filter(FilterGraph *graph, InputStream **best_ist)
     *best_ist = NULL;
     ret = avfilter_graph_request_oldest(graph->graph);
     if (ret >= 0)
-        return reap_filters(0);
+        return reap_filters(0, NULL);
 
     if (ret == AVERROR_EOF) {
-        ret = reap_filters(1);
+        ret = reap_filters(1, NULL);
         for (i = 0; i < graph->nb_outputs; i++)
             close_output_stream(graph->outputs[i]->ost);
         return ret;
@@ -4642,7 +4747,7 @@  static int transcode_step(void)
     if (ret < 0)
         return ret == AVERROR_EOF ? 0 : ret;
 
-    return reap_filters(0);
+    return HAVE_THREADS ? ret : reap_filters(0, NULL);
 }
 
 /*
diff --git a/fftools/ffmpeg.h b/fftools/ffmpeg.h
index eb1eaf6..43a11d4 100644
--- a/fftools/ffmpeg.h
+++ b/fftools/ffmpeg.h
@@ -253,6 +253,20 @@  typedef struct InputFilter {
 
     AVBufferRef *hw_frames_ctx;
 
+    // for abr pipeline
+    int abr_thread_created;
+#if HAVE_THREADS
+    AVFrame *waited_frm;
+    AVFrame *input_frm;
+    pthread_t abr_thread;
+    pthread_cond_t process_cond;
+    pthread_cond_t finish_cond;
+    pthread_mutex_t process_mutex;
+    pthread_mutex_t finish_mutex;
+    int t_end;
+    int t_error;
+#endif
+
     int eof;
 } InputFilter;
 
diff --git a/fftools/ffmpeg_filter.c b/fftools/ffmpeg_filter.c
index 6518d50..da80803 100644
--- a/fftools/ffmpeg_filter.c
+++ b/fftools/ffmpeg_filter.c
@@ -328,6 +328,7 @@  static void init_input_filter(FilterGraph *fg, AVFilterInOut *in)
 
     GROW_ARRAY(ist->filters, ist->nb_filters);
     ist->filters[ist->nb_filters - 1] = fg->inputs[fg->nb_inputs - 1];
+    ist->filters[ist->nb_filters - 1]->abr_thread_created = 0;
 }
 
 int init_complex_filtergraph(FilterGraph *fg)