diff mbox series

[FFmpeg-devel,1/3,GSoC] Add mutithread function for dnn_backend_native_layer_conv2d.c

Message ID 20200831170341.879003-1-xujunzz@sjtu.edu.cn
State New
Headers show
Series [FFmpeg-devel,1/3,GSoC] Add mutithread function for dnn_backend_native_layer_conv2d.c
Related show

Checks

Context Check Description
andriy/default pending
andriy/make success Make finished
andriy/make_fate success Make fate finished

Commit Message

Xu Jun Aug. 31, 2020, 5:03 p.m. UTC
From: Xu Jun <xujunzz@sjtu.edu.cn>

Use pthread to multithread dnn_execute_layer_conv2d.
Can be tested with command "./ffmpeg_g -i input.png -vf \
format=yuvj420p,dnn_processing=dnn_backend=native:model= \
espcn.model:input=x:output=y -y sr_native.jpg -benchmark"

before patch: utime=11.238s stime=0.005s rtime=11.248s
after patch:  utime=20.817s stime=0.047s rtime=1.051s

Signed-off-by: Xu Jun <xujunzz@sjtu.edu.cn>
---
 .../dnn/dnn_backend_native_layer_conv2d.c     | 95 ++++++++++++++++---
 1 file changed, 84 insertions(+), 11 deletions(-)

Comments

Mark Thompson Aug. 31, 2020, 8:41 p.m. UTC | #1
On 31/08/2020 18:03, xujunzz@sjtu.edu.cn wrote:
> From: Xu Jun <xujunzz@sjtu.edu.cn>
> 
> Use pthread to multithread dnn_execute_layer_conv2d.
> Can be tested with command "./ffmpeg_g -i input.png -vf \
> format=yuvj420p,dnn_processing=dnn_backend=native:model= \
> espcn.model:input=x:output=y -y sr_native.jpg -benchmark"
> 
> before patch: utime=11.238s stime=0.005s rtime=11.248s
> after patch:  utime=20.817s stime=0.047s rtime=1.051s

Can you explain why it uses almost twice as much total CPU time after the patch?  That seems rather more than can be explained away as scheduling overhead.

If it's actually doing significantly more then maybe you want to document somewhere that enabling threading will improve latency at the cost of throughput.

- Mark
Xu Jun Sept. 1, 2020, 2:35 p.m. UTC | #2
Hi, Mark

----- Original Message -----
> From: "Mark Thompson" <sw@jkqxz.net>
> To: "FFmpeg development discussions and patches" <ffmpeg-devel@ffmpeg.org>
> Sent: Tuesday, September 1, 2020 4:41:06 AM
> Subject: Re: [FFmpeg-devel] [PATCH 1/3][GSoC] Add mutithread function for dnn_backend_native_layer_conv2d.c

> On 31/08/2020 18:03, xujunzz@sjtu.edu.cn wrote:
>> From: Xu Jun <xujunzz@sjtu.edu.cn>
>> 
>> Use pthread to multithread dnn_execute_layer_conv2d.
>> Can be tested with command "./ffmpeg_g -i input.png -vf \
>> format=yuvj420p,dnn_processing=dnn_backend=native:model= \
>> espcn.model:input=x:output=y -y sr_native.jpg -benchmark"
>> 
>> before patch: utime=11.238s stime=0.005s rtime=11.248s
>> after patch:  utime=20.817s stime=0.047s rtime=1.051s
> 
> Can you explain why it uses almost twice as much total CPU time after the patch?
> That seems rather more than can be explained away as scheduling overhead.
> 
> If it's actually doing significantly more then maybe you want to document
> somewhere that enabling threading will improve latency at the cost of
> throughput.

I have done some test and find that utime is strongly correlated with CPU HyperThreading technology.

When I turn off my CPU HyperThreading technology using command "echo off > /sys/devices/system/cpu/smt/control" in root user, the utime gets stable whatever the number of threads I have created, and is same to that before patch.

When CPU HyperThreading technology is on, once the number of threads I create gets close to physical cores' number my cpu has, or even bigger, the utime will get bigger simultaneously. When I use as many threads as the logical cores' number of my cpu, the utime will be twice of that before patch.

Therefore, I think HyperThreading technology make the logical cores twice the physical cores while the counting power is not twiced. And for ffmpeg utime, it sums all logical cores' runtime. So it seems to be twice of that before patch.

In the next version, I will open an API for user to choose how many threads to use in native backend. And I'm going to set the default threads number to physical cores' number - 1 in order to get better performance while not increasing utime much on the plantforms which support HyperThreading.

As for the rtime, setting threads' number to logical cores - 1 will get about 20%-30% performance improvement over setting threads' number to physical cores - 1 in my test.

- Xu Jun

> 
> - Mark
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
Paul B Mahol Sept. 1, 2020, 2:46 p.m. UTC | #3
On 9/1/20, Xu Jun <xujunzz@sjtu.edu.cn> wrote:
> Hi, Mark
>
> ----- Original Message -----
>> From: "Mark Thompson" <sw@jkqxz.net>
>> To: "FFmpeg development discussions and patches" <ffmpeg-devel@ffmpeg.org>
>> Sent: Tuesday, September 1, 2020 4:41:06 AM
>> Subject: Re: [FFmpeg-devel] [PATCH 1/3][GSoC] Add mutithread function for
>> dnn_backend_native_layer_conv2d.c
>
>> On 31/08/2020 18:03, xujunzz@sjtu.edu.cn wrote:
>>> From: Xu Jun <xujunzz@sjtu.edu.cn>
>>>
>>> Use pthread to multithread dnn_execute_layer_conv2d.
>>> Can be tested with command "./ffmpeg_g -i input.png -vf \
>>> format=yuvj420p,dnn_processing=dnn_backend=native:model= \
>>> espcn.model:input=x:output=y -y sr_native.jpg -benchmark"
>>>
>>> before patch: utime=11.238s stime=0.005s rtime=11.248s
>>> after patch:  utime=20.817s stime=0.047s rtime=1.051s
>>
>> Can you explain why it uses almost twice as much total CPU time after the
>> patch?
>> That seems rather more than can be explained away as scheduling overhead.
>>
>> If it's actually doing significantly more then maybe you want to document
>> somewhere that enabling threading will improve latency at the cost of
>> throughput.
>
> I have done some test and find that utime is strongly correlated with CPU
> HyperThreading technology.
>
> When I turn off my CPU HyperThreading technology using command "echo off >
> /sys/devices/system/cpu/smt/control" in root user, the utime gets stable
> whatever the number of threads I have created, and is same to that before
> patch.
>
> When CPU HyperThreading technology is on, once the number of threads I
> create gets close to physical cores' number my cpu has, or even bigger, the
> utime will get bigger simultaneously. When I use as many threads as the
> logical cores' number of my cpu, the utime will be twice of that before
> patch.
>
> Therefore, I think HyperThreading technology make the logical cores twice
> the physical cores while the counting power is not twiced. And for ffmpeg
> utime, it sums all logical cores' runtime. So it seems to be twice of that
> before patch.
>
> In the next version, I will open an API for user to choose how many threads
> to use in native backend. And I'm going to set the default threads number to
> physical cores' number - 1 in order to get better performance while not
> increasing utime much on the plantforms which support HyperThreading.

-threads option is already available for filters that use slice threads.

Make sure that your threads do not share same memory for reading/writting.

>
> As for the rtime, setting threads' number to logical cores - 1 will get
> about 20%-30% performance improvement over setting threads' number to
> physical cores - 1 in my test.
>
> - Xu Jun
>
>>
>> - Mark
>> _______________________________________________
>> ffmpeg-devel mailing list
>> ffmpeg-devel@ffmpeg.org
>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>
>> To unsubscribe, visit link above, or email
>> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
Xu Jun Sept. 2, 2020, 1:55 p.m. UTC | #4
Hi, Paul

----- Original Message -----
> From: "Paul B Mahol" <onemda@gmail.com>
> To: "FFmpeg development discussions and patches" <ffmpeg-devel@ffmpeg.org>
> Sent: Tuesday, September 1, 2020 10:46:54 PM
> Subject: Re: [FFmpeg-devel] [PATCH 1/3][GSoC] Add mutithread function for	dnn_backend_native_layer_conv2d.c

> On 9/1/20, Xu Jun <xujunzz@sjtu.edu.cn> wrote:
>> Hi, Mark
>>
>> ----- Original Message -----
>>> From: "Mark Thompson" <sw@jkqxz.net>
>>> To: "FFmpeg development discussions and patches" <ffmpeg-devel@ffmpeg.org>
>>> Sent: Tuesday, September 1, 2020 4:41:06 AM
>>> Subject: Re: [FFmpeg-devel] [PATCH 1/3][GSoC] Add mutithread function for
>>> dnn_backend_native_layer_conv2d.c
>>
>>> On 31/08/2020 18:03, xujunzz@sjtu.edu.cn wrote:
>>>> From: Xu Jun <xujunzz@sjtu.edu.cn>
>>>>
>>>> Use pthread to multithread dnn_execute_layer_conv2d.
>>>> Can be tested with command "./ffmpeg_g -i input.png -vf \
>>>> format=yuvj420p,dnn_processing=dnn_backend=native:model= \
>>>> espcn.model:input=x:output=y -y sr_native.jpg -benchmark"
>>>>
>>>> before patch: utime=11.238s stime=0.005s rtime=11.248s
>>>> after patch:  utime=20.817s stime=0.047s rtime=1.051s
>>>
>>> Can you explain why it uses almost twice as much total CPU time after the
>>> patch?
>>> That seems rather more than can be explained away as scheduling overhead.
>>>
>>> If it's actually doing significantly more then maybe you want to document
>>> somewhere that enabling threading will improve latency at the cost of
>>> throughput.
>>
>> I have done some test and find that utime is strongly correlated with CPU
>> HyperThreading technology.
>>
>> When I turn off my CPU HyperThreading technology using command "echo off >
>> /sys/devices/system/cpu/smt/control" in root user, the utime gets stable
>> whatever the number of threads I have created, and is same to that before
>> patch.
>>
>> When CPU HyperThreading technology is on, once the number of threads I
>> create gets close to physical cores' number my cpu has, or even bigger, the
>> utime will get bigger simultaneously. When I use as many threads as the
>> logical cores' number of my cpu, the utime will be twice of that before
>> patch.
>>
>> Therefore, I think HyperThreading technology make the logical cores twice
>> the physical cores while the counting power is not twiced. And for ffmpeg
>> utime, it sums all logical cores' runtime. So it seems to be twice of that
>> before patch.
>>
>> In the next version, I will open an API for user to choose how many threads
>> to use in native backend. And I'm going to set the default threads number to
>> physical cores' number - 1 in order to get better performance while not
>> increasing utime much on the plantforms which support HyperThreading.
> 
> -threads option is already available for filters that use slice threads.

Actually the native backend of DNN module in libavfilter does not support slice thread. So it does not use -threads option. The thread function I added just have effect in the conv2d layer of DNN's native backend, and is not the same with slice thread in filter level.

> 
> Make sure that your threads do not share same memory for reading/writting.

I will carefully think about the timing in program running and operate adequate test for that to avoid such bugs.

- Xu Jun

> 
>>
>> As for the rtime, setting threads' number to logical cores - 1 will get
>> about 20%-30% performance improvement over setting threads' number to
>> physical cores - 1 in my test.
>>
>> - Xu Jun
>>
>>>
>>> - Mark
>>> _______________________________________________
>>> ffmpeg-devel mailing list
>>> ffmpeg-devel@ffmpeg.org
>>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>>
>>> To unsubscribe, visit link above, or email
>>> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>>
>> _______________________________________________
>> ffmpeg-devel mailing list
>> ffmpeg-devel@ffmpeg.org
>> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>>
>> To unsubscribe, visit link above, or email
>> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
diff mbox series

Patch

diff --git a/libavfilter/dnn/dnn_backend_native_layer_conv2d.c b/libavfilter/dnn/dnn_backend_native_layer_conv2d.c
index d079795bf8..570b974052 100644
--- a/libavfilter/dnn/dnn_backend_native_layer_conv2d.c
+++ b/libavfilter/dnn/dnn_backend_native_layer_conv2d.c
@@ -19,10 +19,23 @@ 
  */
 
 #include "libavutil/avassert.h"
+#include "libavutil/thread.h"
+#include "libavutil/cpu.h"
 #include "dnn_backend_native_layer_conv2d.h"
 
 #define CLAMP_TO_EDGE(x, w) ((x) < 0 ? 0 : ((x) >= (w) ? (w - 1) : (x)))
 
+//struct to pass parameters
+typedef struct thread_data{
+    DnnOperand *operands;
+    const int32_t *input_operand_indexes;
+    int32_t output_operand_index;
+    const void *parameters;
+    NativeContext *ctx;
+    int32_t thread_num;
+    int32_t thread_index;
+} thread_data;
+
 int dnn_load_layer_conv2d(Layer *layer, AVIOContext *model_file_context, int file_size, int operands_num)
 {
     ConvolutionalParams *conv_params;
@@ -88,17 +101,27 @@  int dnn_load_layer_conv2d(Layer *layer, AVIOContext *model_file_context, int fil
     return dnn_size;
 }
 
-int dnn_execute_layer_conv2d(DnnOperand *operands, const int32_t *input_operand_indexes,
-                             int32_t output_operand_index, const void *parameters, NativeContext *ctx)
+static void * dnn_execute_layer_conv2d_thread(void *threadarg)
 {
+    static pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;
+    //use mutexe to protect thread_index
+
+    //pass parameters
+    struct thread_data *thread_data = (struct thread_data *)threadarg;
+    DnnOperand *operands = thread_data->operands;
+
+    int thread_stride;
+    int thread_start;
+    int thread_end;
+
     float *output;
-    int32_t input_operand_index = input_operand_indexes[0];
+    int32_t input_operand_index = thread_data->input_operand_indexes[0];
     int number = operands[input_operand_index].dims[0];
     int height = operands[input_operand_index].dims[1];
     int width = operands[input_operand_index].dims[2];
     int channel = operands[input_operand_index].dims[3];
     const float *input = operands[input_operand_index].data;
-    const ConvolutionalParams *conv_params = (const ConvolutionalParams *)parameters;
+    const ConvolutionalParams *conv_params = (const ConvolutionalParams *)(thread_data->parameters);
 
     int radius = conv_params->kernel_size >> 1;
     int src_linesize = width * conv_params->input_num;
@@ -106,7 +129,7 @@  int dnn_execute_layer_conv2d(DnnOperand *operands, const int32_t *input_operand_
     int filter_size = conv_params->kernel_size * filter_linesize;
     int pad_size = (conv_params->padding_method == VALID) ? (conv_params->kernel_size - 1) / 2 * conv_params->dilation : 0;
 
-    DnnOperand *output_operand = &operands[output_operand_index];
+    DnnOperand *output_operand = &operands[thread_data->output_operand_index];
     output_operand->dims[0] = number;
     output_operand->dims[1] = height - pad_size * 2;
     output_operand->dims[2] = width - pad_size * 2;
@@ -114,19 +137,30 @@  int dnn_execute_layer_conv2d(DnnOperand *operands, const int32_t *input_operand_
     output_operand->data_type = operands[input_operand_index].data_type;
     output_operand->length = calculate_operand_data_length(output_operand);
     if (output_operand->length <= 0) {
-        av_log(ctx, AV_LOG_ERROR, "The output data length overflow\n");
-        return DNN_ERROR;
+        av_log(thread_data->ctx, AV_LOG_ERROR, "The output data length overflow\n");
+        return (void *)DNN_ERROR;
     }
     output_operand->data = av_realloc(output_operand->data, output_operand->length);
     if (!output_operand->data) {
-        av_log(ctx, AV_LOG_ERROR, "Failed to reallocate memory for output\n");
-        return DNN_ERROR;
+        av_log(thread_data->ctx, AV_LOG_ERROR, "Failed to reallocate memory for output\n");
+        return (void *)DNN_ERROR;
     }
+
+    //calculate area for this thread
+    thread_stride = (height - pad_size * 2) / thread_data->thread_num;
+    pthread_mutex_lock(&mtx);
+    thread_start = thread_stride * thread_data->thread_index + pad_size;
+    thread_end = (thread_data->thread_index == thread_data->thread_num - 1) ? (height - pad_size) : (thread_start + thread_stride);
+    thread_data->thread_index += 1;
+    pthread_mutex_unlock(&mtx);
+
     output = output_operand->data;
+    //calculate output start pos for this thread
+    output += (conv_params->output_num) * (width - 2 * pad_size) * (thread_start - pad_size);
 
     av_assert0(channel == conv_params->input_num);
 
-    for (int y = pad_size; y < height - pad_size; ++y) {
+    for (int y = thread_start; y < thread_end; ++y) {
         for (int x = pad_size; x < width - pad_size; ++x) {
             for (int n_filter = 0; n_filter < conv_params->output_num; ++n_filter) {
                 if (conv_params->has_bias)
@@ -174,5 +208,44 @@  int dnn_execute_layer_conv2d(DnnOperand *operands, const int32_t *input_operand_
             output += conv_params->output_num;
         }
     }
-    return 0;
+    return (void *)0;
+}
+
+
+int dnn_execute_layer_conv2d(DnnOperand *operands, const int32_t *input_operand_indexes,
+                             int32_t output_operand_index, const void *parameters, NativeContext *ctx)
+{
+    //get cpu available cores, -1 for higher efficiency
+    const int thread_num = av_cpu_count() - 1;
+    pthread_t *thread_id = av_malloc(thread_num * sizeof(pthread_t));
+    void *res;
+    int error_flag = 0;
+
+    //struct used to pass parameters
+    struct thread_data *thread_data;
+    thread_data = av_malloc(sizeof(*thread_data));
+    thread_data->operands = operands;
+    thread_data->input_operand_indexes = input_operand_indexes;
+    thread_data->output_operand_index = output_operand_index;
+    thread_data->parameters = parameters;
+    thread_data->ctx = ctx;
+    thread_data->thread_num = thread_num;
+    thread_data->thread_index = 0;
+
+    //create threads
+    for (int i = 0; i < thread_num; i++){
+        pthread_create(&thread_id[i], NULL, dnn_execute_layer_conv2d_thread, (void *)thread_data);
+    }
+
+    //join threads, res gets function return
+    for (int i = 0; i < thread_num; i++){
+        pthread_join(thread_id[i], &res);
+        if ((int)res != 0)
+            error_flag = (int)res;
+    }
+
+    //release memory
+    av_free(thread_id);
+    av_free(thread_data);
+    return error_flag;
 }