[FFmpeg-devel] avfilter/vf_convolution: add 16-column operation for filter_column() to prepare for x86 SIMD.

Submitted by xujunzz@sjtu.edu.cn on Nov. 27, 2019, 2:55 p.m.

Details

Message ID 20191127145546.6873-1-xujunzz@sjtu.edu.cn
State New
Headers show

Commit Message

xujunzz@sjtu.edu.cn Nov. 27, 2019, 2:55 p.m.
From: Xu Jun <xujunzz@sjtu.edu.cn>

In order to add x86 SIMD for filter_column(), I write a C function which processes 16 columns at a time.

Signed-off-by: Xu Jun <xujunzz@sjtu.edu.cn>
---
 libavfilter/vf_convolution.c          | 56 +++++++++++++++++++++++++++
 libavfilter/x86/vf_convolution_init.c | 23 +++++++++++
 2 files changed, 79 insertions(+)

Comments

Carl Eugen Hoyos Nov. 27, 2019, 4:19 p.m.
Am Mi., 27. Nov. 2019 um 15:56 Uhr schrieb <xujunzz@sjtu.edu.cn>:

> From: Xu Jun <xujunzz@sjtu.edu.cn>
>
> In order to add x86 SIMD for filter_column(), I write a C function which processes 16 columns at a time.

How does this perform compared to the existing C code?

Carl Eugen
xujunzz@sjtu.edu.cn Dec. 2, 2019, 2:42 a.m.
I'm sorry not to reply in time.

The performance of this C code is about 10% better than the existing C code.

It will have a bigger improvement after X86 SIMD optimizations.

Xu Jun

----- 原始邮件 -----
发件人: "Carl Eugen Hoyos" <ceffmpeg@gmail.com>
收件人: "FFmpeg development discussions and patches" <ffmpeg-devel@ffmpeg.org>
发送时间: 星期四, 2019年 11 月 28日 上午 12:19:44
主题: Re: [FFmpeg-devel] [PATCH] avfilter/vf_convolution: add 16-column operation for filter_column() to prepare for x86 SIMD.

Am Mi., 27. Nov. 2019 um 15:56 Uhr schrieb <xujunzz@sjtu.edu.cn>:

> From: Xu Jun <xujunzz@sjtu.edu.cn>
>
> In order to add x86 SIMD for filter_column(), I write a C function which processes 16 columns at a time.

How does this perform compared to the existing C code?

Carl Eugen
Steven Liu Dec. 2, 2019, 2:44 a.m.
> 在 2019年12月2日,10:42,徐鋆 <xujunzz@sjtu.edu.cn> 写道:
> 
> I'm sorry not to reply in time.
> 
> The performance of this C code is about 10% better than the existing C code.
> 
> It will have a bigger improvement after X86 SIMD optimizations.

1. How to test?
1. 怎么测试的?
1. どうやってテストしたの?

2. Don’t TOP-Posting: https://en.wikipedia.org/wiki/Top-posting
2. 回邮件要在你回的那一条的下面回复,别再最上面回复,人家看不懂你是针对的哪一条
2. 返信メールは、あなたが返信した項目の下にある。一番上に返信しないと、あなたが何を狙っているのか分からない

> 
> Xu Jun
> 
> ----- 原始邮件 -----
> 发件人: "Carl Eugen Hoyos" <ceffmpeg@gmail.com>
> 收件人: "FFmpeg development discussions and patches" <ffmpeg-devel@ffmpeg.org>
> 发送时间: 星期四, 2019年 11 月 28日 上午 12:19:44
> 主题: Re: [FFmpeg-devel] [PATCH] avfilter/vf_convolution: add 16-column operation for filter_column() to prepare for x86 SIMD.
> 
> Am Mi., 27. Nov. 2019 um 15:56 Uhr schrieb <xujunzz@sjtu.edu.cn>:
> 
>> From: Xu Jun <xujunzz@sjtu.edu.cn>
>> 
>> In order to add x86 SIMD for filter_column(), I write a C function which processes 16 columns at a time.
> 
> How does this perform compared to the existing C code?
> 
> Carl Eugen
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
> -- 
> 敬颂钧安, 
> 徐鋆 
> 电子信息与电气工程学院 
> 上海交通大学 
> 邮箱:xujunzz@sjtu.edu.cn 
> 地址:上海市闵行区东川路800号 
> 
> Yours sincerely, 
> Xylem(Jun Xu) 
> School of Electronic, Information and Electrical Engineering 
> Shanghai Jiao Tong University 
> Email: xujunzz@sjtu.edu.cn 
> No. 800, Dongchuan Road, Minhang District, Shanghai 200240, China 
> 
> 宜しくお愿いたします 
> 徐鋆 
> 電子情報と電気工程学院 
> 上海交通大学 
> メールアドレス :xujunzz@sjtu.edu.cn 
> 住所:上海市閔行区ドンチュワンルー800号
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

Thanks
Steven
xujunzz@sjtu.edu.cn Dec. 2, 2019, 3:16 a.m.
Hi, Steven

----- 原始邮件 -----
发件人: "Steven Liu" <lq@chinaffmpeg.org>
收件人: "FFmpeg development discussions and patches" <ffmpeg-devel@ffmpeg.org>
抄送: "Steven Liu" <lq@chinaffmpeg.org>
发送时间: 星期一, 2019年 12 月 02日 上午 10:44:48
主题: Re: [FFmpeg-devel] [PATCH] avfilter/vf_convolution: add 16-column operation for filter_column() to prepare for x86 SIMD.

> 在 2019年12月2日,10:42,徐鋆 <xujunzz@sjtu.edu.cn> 写道:
> 
> I'm sorry not to reply in time.
> 
> The performance of this C code is about 10% better than the existing C code.
> 
> It will have a bigger improvement after X86 SIMD optimizations.

1. How to test?
1. 怎么测试的?
1. どうやってテストしたの?

I tested using this command:

./ffmpeg_g -s 1280*720 -pix_fmt yuv420p -i test.yuv -vf convolution="1 2 3 4 5 6 7 8 9:1 2 3 4 5 6 7 8 9:1 2 3 4 5 6 7 8 9:1 2 3 4 5 6 7 8 9:1/45:1/45:1/45:1/45:1:2:3:4:column:column:column:column" -an -vframes 2000 -f null /dev/null 

The FPS increases from 329 to 365 on my local machine.

2. Don’t TOP-Posting: https://en.wikipedia.org/wiki/Top-posting
2. 回邮件要在你回的那一条的下面回复,别再最上面回复,人家看不懂你是针对的哪一条
2. 返信メールは、あなたが返信した項目の下にある。一番上に返信しないと、あなたが何を狙っているのか分からない

Thank you for reminding me. I'm new here. Forgive me for not knowing the rules:)

> 
> Xu Jun
> 
> ----- 原始邮件 -----
> 发件人: "Carl Eugen Hoyos" <ceffmpeg@gmail.com>
> 收件人: "FFmpeg development discussions and patches" <ffmpeg-devel@ffmpeg.org>
> 发送时间: 星期四, 2019年 11 月 28日 上午 12:19:44
> 主题: Re: [FFmpeg-devel] [PATCH] avfilter/vf_convolution: add 16-column operation for filter_column() to prepare for x86 SIMD.
> 
> Am Mi., 27. Nov. 2019 um 15:56 Uhr schrieb <xujunzz@sjtu.edu.cn>:
> 
>> From: Xu Jun <xujunzz@sjtu.edu.cn>
>> 
>> In order to add x86 SIMD for filter_column(), I write a C function which processes 16 columns at a time.
> 
> How does this perform compared to the existing C code?
> 
> Carl Eugen
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
> -- 
> 敬颂钧安, 
> 徐鋆 
> 电子信息与电气工程学院 
> 上海交通大学 
> 邮箱:xujunzz@sjtu.edu.cn 
> 地址:上海市闵行区东川路800号 
> 
> Yours sincerely, 
> Xylem(Jun Xu) 
> School of Electronic, Information and Electrical Engineering 
> Shanghai Jiao Tong University 
> Email: xujunzz@sjtu.edu.cn 
> No. 800, Dongchuan Road, Minhang District, Shanghai 200240, China 
> 
> 宜しくお愿いたします 
> 徐鋆 
> 電子情報と電気工程学院 
> 上海交通大学 
> メールアドレス :xujunzz@sjtu.edu.cn 
> 住所:上海市閔行区ドンチュワンルー800号
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

Thanks
Steven
Ruiling Song Dec. 2, 2019, 6:38 a.m.
> -----Original Message-----

> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of

> xujunzz@sjtu.edu.cn

> Sent: Wednesday, November 27, 2019 10:56 PM

> To: ffmpeg-devel@ffmpeg.org

> Cc: xujunzz@sjtu.edu.cn

> Subject: [FFmpeg-devel] [PATCH] avfilter/vf_convolution: add 16-column

> operation for filter_column() to prepare for x86 SIMD.

> 

> From: Xu Jun <xujunzz@sjtu.edu.cn>

> 

> In order to add x86 SIMD for filter_column(), I write a C function which

> processes 16 columns at a time.

> 

> Signed-off-by: Xu Jun <xujunzz@sjtu.edu.cn>

> ---

>  libavfilter/vf_convolution.c          | 56 +++++++++++++++++++++++++++

>  libavfilter/x86/vf_convolution_init.c | 23 +++++++++++

>  2 files changed, 79 insertions(+)

> 

> diff --git a/libavfilter/vf_convolution.c b/libavfilter/vf_convolution.c

> index d022f1a04a..5291415d48 100644

> --- a/libavfilter/vf_convolution.c

> +++ b/libavfilter/vf_convolution.c

> @@ -520,6 +520,61 @@ static int filter_slice(AVFilterContext *ctx, void *arg,

> int jobnr, int nb_jobs)

>              continue;

>          }

> 

> +        if (mode == MATRIX_COLUMN && s->filter[plane] != filter_column){

> +            for (y = slice_start; y < slice_end - 16; y+=16) {

Please take care of the coding style there should be white-space between variables and operators.
And also I think this piece of change make it harder to maintain, let's try to avoid code duplicate as much as we can.
> +                const int xoff = (y - slice_start) * bpc;

> +                const int yoff = radius * stride;

> +                for (x = 0; x < radius; x++) {

> +                    const int xoff = (y - slice_start) * bpc;

> +                    const int yoff = x * stride;

> +

> +                    s->setup[plane](radius, c, src, stride, x, width, y, height, bpc);

> +                    s->filter[plane](dst + yoff + xoff, 1, rdiv,

> +                                    bias, matrix, c, 16, radius,

> +                                    dstride, stride);

> +                }

> +                s->setup[plane](radius, c, src, stride, radius, width, y, height, bpc);

> +                s->filter[plane](dst + yoff + xoff, sizew - 2 * radius,

> +                                rdiv, bias, matrix, c, 16, radius,

> +                                dstride, stride);

> +                for (x = sizew - radius; x < sizew; x++) {

> +                    const int xoff = (y - slice_start) * bpc;

> +                    const int yoff = x * stride;

> +

> +                    s->setup[plane](radius, c, src, stride, x, width, y, height, bpc);

> +                    s->filter[plane](dst + yoff + xoff, 1, rdiv,

> +                                    bias, matrix, c, 16, radius,

> +                                    dstride, stride);

> +                }

> +            }

> +            if (y < slice_end){

> +                const int xoff = (y - slice_start) * bpc;

> +                const int yoff = radius * stride;

> +                for (x = 0; x < radius; x++) {

> +                    const int xoff = (y - slice_start) * bpc;

> +                    const int yoff = x * stride;

> +

> +                    s->setup[plane](radius, c, src, stride, x, width, y, height, bpc);

> +                    s->filter[plane](dst + yoff + xoff, 1, rdiv,

> +                                    bias, matrix, c, slice_end - y, radius,

> +                                    dstride, stride);

> +                }

> +                s->setup[plane](radius, c, src, stride, radius, width, y, height, bpc);

> +                s->filter[plane](dst + yoff + xoff, sizew - 2 * radius,

> +                                rdiv, bias, matrix, c, slice_end - y, radius,

> +                                dstride, stride);

> +                for (x = sizew - radius; x < sizew; x++) {

> +                    const int xoff = (y - slice_start) * bpc;

> +                    const int yoff = x * stride;

> +

> +                    s->setup[plane](radius, c, src, stride, x, width, y, height, bpc);

> +                    s->filter[plane](dst + yoff + xoff, 1, rdiv,

> +                                    bias, matrix, c, slice_end - y, radius,

> +                                    dstride, stride);

> +                }

> +            }

> +        }

> +        else {

>          for (y = slice_start; y < slice_end; y++) {

>              const int xoff = mode == MATRIX_COLUMN ? (y - slice_start) * bpc :

> radius * bpc;

>              const int yoff = mode == MATRIX_COLUMN ? radius * stride : 0;

> @@ -550,6 +605,7 @@ static int filter_slice(AVFilterContext *ctx, void *arg,

> int jobnr, int nb_jobs)

>                  dst += dstride;

>          }

>      }

> +    }

> 

>      return 0;

>  }

> diff --git a/libavfilter/x86/vf_convolution_init.c

> b/libavfilter/x86/vf_convolution_init.c

> index d1e8c90ceb..6b1c2f0e9f 100644

> --- a/libavfilter/x86/vf_convolution_init.c

> +++ b/libavfilter/x86/vf_convolution_init.c

> @@ -34,6 +34,27 @@ void ff_filter_row_sse4(uint8_t *dst, int width,

>                          const uint8_t *c[], int peak, int radius,

>                          int dstride, int stride);

> 

This C code should not be in the x86-specific file.

Ruiling
> +static void filter_column16(uint8_t *dst, int height,

> +                          float rdiv, float bias, const int *const matrix,

> +                          const uint8_t *c[], int length, int radius,

> +                          int dstride, int stride)

> +{

> +    int y, off16;

> +

> +    for (y = 0; y < height; y++) {

> +        for (off16 = 0; off16 < length; off16++){

> +            int i, sum = 0;

> +

> +            for (i = 0; i < 2 * radius + 1; i++)

> +                sum += c[i][0 + y * stride + off16] * matrix[i];

> +

> +            sum = (int)(sum * rdiv + bias + 0.5f);

> +            dst[off16] = av_clip_uint8(sum);

> +        }

> +        dst += dstride;

> +    }

> +

> +}

> 

>  av_cold void ff_convolution_init_x86(ConvolutionContext *s)

>  {

> @@ -51,6 +72,8 @@ av_cold void

> ff_convolution_init_x86(ConvolutionContext *s)

>                  if (EXTERNAL_SSE4(cpu_flags))

>                      s->filter[i] = ff_filter_row_sse4;

>          }

> +        if (s->mode[i] == MATRIX_COLUMN)

> +            s->filter[i] = filter_column16;

>      }

>  #endif

>  }

> --

> 2.17.1

> 

> _______________________________________________

> ffmpeg-devel mailing list

> ffmpeg-devel@ffmpeg.org

> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

> 

> To unsubscribe, visit link above, or email

> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
Carl Eugen Hoyos Dec. 2, 2019, 9:19 a.m.
Am Mo., 2. Dez. 2019 um 03:42 Uhr schrieb 徐鋆 <xujunzz@sjtu.edu.cn>:

> I'm sorry not to reply in time.

Definitely in time!

> The performance of this C code is about 10% better than the existing C code.

Please add this to the commit message.

Carl Eugen

Patch hide | download patch | download mbox

diff --git a/libavfilter/vf_convolution.c b/libavfilter/vf_convolution.c
index d022f1a04a..5291415d48 100644
--- a/libavfilter/vf_convolution.c
+++ b/libavfilter/vf_convolution.c
@@ -520,6 +520,61 @@  static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs)
             continue;
         }
 
+        if (mode == MATRIX_COLUMN && s->filter[plane] != filter_column){
+            for (y = slice_start; y < slice_end - 16; y+=16) {
+                const int xoff = (y - slice_start) * bpc;
+                const int yoff = radius * stride;
+                for (x = 0; x < radius; x++) {
+                    const int xoff = (y - slice_start) * bpc;
+                    const int yoff = x * stride;
+
+                    s->setup[plane](radius, c, src, stride, x, width, y, height, bpc);
+                    s->filter[plane](dst + yoff + xoff, 1, rdiv,
+                                    bias, matrix, c, 16, radius,
+                                    dstride, stride);
+                }
+                s->setup[plane](radius, c, src, stride, radius, width, y, height, bpc);
+                s->filter[plane](dst + yoff + xoff, sizew - 2 * radius,
+                                rdiv, bias, matrix, c, 16, radius,
+                                dstride, stride);
+                for (x = sizew - radius; x < sizew; x++) {
+                    const int xoff = (y - slice_start) * bpc;
+                    const int yoff = x * stride;
+
+                    s->setup[plane](radius, c, src, stride, x, width, y, height, bpc);
+                    s->filter[plane](dst + yoff + xoff, 1, rdiv,
+                                    bias, matrix, c, 16, radius,
+                                    dstride, stride);
+                }
+            }
+            if (y < slice_end){
+                const int xoff = (y - slice_start) * bpc;
+                const int yoff = radius * stride;
+                for (x = 0; x < radius; x++) {
+                    const int xoff = (y - slice_start) * bpc;
+                    const int yoff = x * stride;
+
+                    s->setup[plane](radius, c, src, stride, x, width, y, height, bpc);
+                    s->filter[plane](dst + yoff + xoff, 1, rdiv,
+                                    bias, matrix, c, slice_end - y, radius,
+                                    dstride, stride);
+                }
+                s->setup[plane](radius, c, src, stride, radius, width, y, height, bpc);
+                s->filter[plane](dst + yoff + xoff, sizew - 2 * radius,
+                                rdiv, bias, matrix, c, slice_end - y, radius,
+                                dstride, stride);
+                for (x = sizew - radius; x < sizew; x++) {
+                    const int xoff = (y - slice_start) * bpc;
+                    const int yoff = x * stride;
+
+                    s->setup[plane](radius, c, src, stride, x, width, y, height, bpc);
+                    s->filter[plane](dst + yoff + xoff, 1, rdiv,
+                                    bias, matrix, c, slice_end - y, radius,
+                                    dstride, stride);
+                }
+            }
+        }
+        else {
         for (y = slice_start; y < slice_end; y++) {
             const int xoff = mode == MATRIX_COLUMN ? (y - slice_start) * bpc : radius * bpc;
             const int yoff = mode == MATRIX_COLUMN ? radius * stride : 0;
@@ -550,6 +605,7 @@  static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs)
                 dst += dstride;
         }
     }
+    }
 
     return 0;
 }
diff --git a/libavfilter/x86/vf_convolution_init.c b/libavfilter/x86/vf_convolution_init.c
index d1e8c90ceb..6b1c2f0e9f 100644
--- a/libavfilter/x86/vf_convolution_init.c
+++ b/libavfilter/x86/vf_convolution_init.c
@@ -34,6 +34,27 @@  void ff_filter_row_sse4(uint8_t *dst, int width,
                         const uint8_t *c[], int peak, int radius,
                         int dstride, int stride);
 
+static void filter_column16(uint8_t *dst, int height,
+                          float rdiv, float bias, const int *const matrix,
+                          const uint8_t *c[], int length, int radius,
+                          int dstride, int stride)
+{
+    int y, off16;
+
+    for (y = 0; y < height; y++) {
+        for (off16 = 0; off16 < length; off16++){
+            int i, sum = 0;
+
+            for (i = 0; i < 2 * radius + 1; i++)
+                sum += c[i][0 + y * stride + off16] * matrix[i];
+
+            sum = (int)(sum * rdiv + bias + 0.5f);
+            dst[off16] = av_clip_uint8(sum);
+        }
+        dst += dstride;
+    }
+
+}
 
 av_cold void ff_convolution_init_x86(ConvolutionContext *s)
 {
@@ -51,6 +72,8 @@  av_cold void ff_convolution_init_x86(ConvolutionContext *s)
                 if (EXTERNAL_SSE4(cpu_flags))
                     s->filter[i] = ff_filter_row_sse4;
         }
+        if (s->mode[i] == MATRIX_COLUMN)
+            s->filter[i] = filter_column16;
     }
 #endif
 }