From patchwork Mon Dec  2 07:23:46 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: chen <chenm003@163.com>
X-Patchwork-Id: 16522
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
X-Original-To: patchwork@ffaux-bg.ffmpeg.org
Delivered-To: patchwork@ffaux-bg.ffmpeg.org
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by ffaux.localdomain (Postfix) with ESMTP id 6A9A744AC92
	for <patchwork@ffaux-bg.ffmpeg.org>;
	Mon,  2 Dec 2019 09:24:36 +0200 (EET)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 480FE68AF0B;
	Mon,  2 Dec 2019 09:24:36 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from m13-60.163.com (m13-60.163.com [220.181.13.60])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id BD55368AA15
	for <ffmpeg-devel@ffmpeg.org>; Mon,  2 Dec 2019 09:24:28 +0200 (EET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=163.com;
	s=s110527; h=Date:From:Subject:MIME-Version:Message-ID; bh=9VLq1
	0v34QevSno8eNbl9mpRHk606tJv+UTHlBZbWXE=; b=lc8vMp2N6NWx107EYbUvi
	gcoIhe+dtNzj517zb8RLKgLzkMxqDsspUmKOrzR4IQ84lEapy98YrLgmWhOKY/IQ
	Y/UxnM6sbxiQEN55fq9fzOgrVnf/oQmw7ENYSr+UgBO6JmDHsoQ9DJHULzq9BZbl
	YwLZxjfoT2w5kTfb+A97tg=
Received: from chenm003$163.com ( [103.107.216.230] ) by
	ajax-webmail-wmsvr60 (Coremail) ; Mon, 2 Dec 2019 15:23:46 +0800 (CST)
X-Originating-IP: [103.107.216.230]
Date: Mon, 2 Dec 2019 15:23:46 +0800 (CST)
From: chen  <chenm003@163.com>
To: "FFmpeg development discussions and patches" <ffmpeg-devel@ffmpeg.org>
X-Priority: 3
X-Mailer: Coremail Webmail Server Version XT5.0.10 build 20190724(ac680a23)
	Copyright (c) 2002-2019 www.mailtech.cn 163com
In-Reply-To: 
 <148B1B7A67D1C24B9EF0BE42EA497706851F5CAC@SHSMSX103.ccr.corp.intel.com>
References: <20191127145546.6873-1-xujunzz@sjtu.edu.cn>
	<148B1B7A67D1C24B9EF0BE42EA497706851F5CAC@SHSMSX103.ccr.corp.intel.com>
MIME-Version: 1.0
Message-ID: <680b05e5.8a2c.16ec57e6a3b.Coremail.chenm003@163.com>
X-Coremail-Locale: zh_CN
X-CM-TRANSID: PMGowAAHPokCvORdqmUyAQ--.57487W
X-CM-SenderInfo: xfkh0zqqqtqiywtou0bp/1tbiHQJ-nFSIjPLjGgAAsn
X-Coremail-Antispam: 1U5529EdanIXcx71UUUUU7vcSsGvfC2KfnxnUU==
X-Content-Filtered-By: Mailman/MimeDel 2.1.20
Subject: Re: [FFmpeg-devel] [PATCH] avfilter/vf_convolution: add
	16-column	operation for filter_column() to prepare for x86 SIMD.
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches
	<ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

I have a little suggest on filter_column16(..) [the function]


Firstly, the function is confused with filter16_column(..)


Secondly, the function's algoritym based on row direction, it means reduced address calculate operators and less cache performance, cost of them may more than calculate cost.


For more clear, I give my toy in here, I verify my patch with cmdline in below


 ./ffmpeg -s 1280*720 -pix_fmt yuv420p -i ~/git/sister_720x1280.yuv -vf convolution="1 2 3 4 5 6 7 8 9:1 2 3 4 5 6 7 8 9:1 2 3 4 5 6 7 8 9:1 2 3 4 5 6 7 8 9:1/45:1/45:1/45:1/45:1:2:3:4:column:column:column:column" -an -vframes 2000 -benchmark -f null /dev/null


The result:
Origin version:   utime=7.359s stime=0.138s rtime=1.664s
Song version:    utime=5.320s stime=0.133s rtime=1.250s
My version:       utime=2.930s stime=0.122s rtime=0.794s


ps: since the function processing up to 16-pixels each time, if we split path into 16 and non-16, I have got 1.934s in here.


My patch based on today head, I have also corrected Song's merge conflict.


************ Patch Start ********************
************ End ********************


At 2019-12-02 14:38:04, "Song, Ruiling" <ruiling.song@intel.com> wrote:
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
>> xujunzz@sjtu.edu.cn
>> Sent: Wednesday, November 27, 2019 10:56 PM
>> To: ffmpeg-devel@ffmpeg.org
>> Cc: xujunzz@sjtu.edu.cn
>> Subject: [FFmpeg-devel] [PATCH] avfilter/vf_convolution: add 16-column
>> operation for filter_column() to prepare for x86 SIMD.
>> 
>> From: Xu Jun <xujunzz@sjtu.edu.cn>
>> 
>> In order to add x86 SIMD for filter_column(), I write a C function which
>> processes 16 columns at a time.
>> 
>> Signed-off-by: Xu Jun <xujunzz@sjtu.edu.cn>
>> ---
>>  libavfilter/vf_convolution.c          | 56 +++++++++++++++++++++++++++
>>  libavfilter/x86/vf_convolution_init.c | 23 +++++++++++
>>  2 files changed, 79 insertions(+)
>> 
>> diff --git a/libavfilter/vf_convolution.c b/libavfilter/vf_convolution.c
>> index d022f1a04a..5291415d48 100644
>> --- a/libavfilter/vf_convolution.c
>> +++ b/libavfilter/vf_convolution.c
>> @@ -520,6 +520,61 @@ static int filter_slice(AVFilterContext *ctx, void *arg,
>> int jobnr, int nb_jobs)
>>              continue;
>>          }
>> 
>> +        if (mode == MATRIX_COLUMN && s->filter[plane] != filter_column){
>> +            for (y = slice_start; y < slice_end - 16; y+=16) {
>Please take care of the coding style there should be white-space between variables and operators.
>And also I think this piece of change make it harder to maintain, let's try to avoid code duplicate as much as we can.
>> +                const int xoff = (y - slice_start) * bpc;
>> +                const int yoff = radius * stride;
>> +                for (x = 0; x < radius; x++) {
>> +                    const int xoff = (y - slice_start) * bpc;
>> +                    const int yoff = x * stride;
>> +
>> +                    s->setup[plane](radius, c, src, stride, x, width, y, height, bpc);
>> +                    s->filter[plane](dst + yoff + xoff, 1, rdiv,
>> +                                    bias, matrix, c, 16, radius,
>> +                                    dstride, stride);
>> +                }
>> +                s->setup[plane](radius, c, src, stride, radius, width, y, height, bpc);
>> +                s->filter[plane](dst + yoff + xoff, sizew - 2 * radius,
>> +                                rdiv, bias, matrix, c, 16, radius,
>> +                                dstride, stride);
>> +                for (x = sizew - radius; x < sizew; x++) {
>> +                    const int xoff = (y - slice_start) * bpc;
>> +                    const int yoff = x * stride;
>> +
>> +                    s->setup[plane](radius, c, src, stride, x, width, y, height, bpc);
>> +                    s->filter[plane](dst + yoff + xoff, 1, rdiv,
>> +                                    bias, matrix, c, 16, radius,
>> +                                    dstride, stride);
>> +                }
>> +            }
>> +            if (y < slice_end){
>> +                const int xoff = (y - slice_start) * bpc;
>> +                const int yoff = radius * stride;
>> +                for (x = 0; x < radius; x++) {
>> +                    const int xoff = (y - slice_start) * bpc;
>> +                    const int yoff = x * stride;
>> +
>> +                    s->setup[plane](radius, c, src, stride, x, width, y, height, bpc);
>> +                    s->filter[plane](dst + yoff + xoff, 1, rdiv,
>> +                                    bias, matrix, c, slice_end - y, radius,
>> +                                    dstride, stride);
>> +                }
>> +                s->setup[plane](radius, c, src, stride, radius, width, y, height, bpc);
>> +                s->filter[plane](dst + yoff + xoff, sizew - 2 * radius,
>> +                                rdiv, bias, matrix, c, slice_end - y, radius,
>> +                                dstride, stride);
>> +                for (x = sizew - radius; x < sizew; x++) {
>> +                    const int xoff = (y - slice_start) * bpc;
>> +                    const int yoff = x * stride;
>> +
>> +                    s->setup[plane](radius, c, src, stride, x, width, y, height, bpc);
>> +                    s->filter[plane](dst + yoff + xoff, 1, rdiv,
>> +                                    bias, matrix, c, slice_end - y, radius,
>> +                                    dstride, stride);
>> +                }
>> +            }
>> +        }
>> +        else {
>>          for (y = slice_start; y < slice_end; y++) {
>>              const int xoff = mode == MATRIX_COLUMN ? (y - slice_start) * bpc :
>> radius * bpc;
>>              const int yoff = mode == MATRIX_COLUMN ? radius * stride : 0;
>> @@ -550,6 +605,7 @@ static int filter_slice(AVFilterContext *ctx, void *arg,
>> int jobnr, int nb_jobs)
>>                  dst += dstride;
>>          }
>>      }
>> +    }
>> 
>>      return 0;
>>  }
>> diff --git a/libavfilter/x86/vf_convolution_init.c
>> b/libavfilter/x86/vf_convolution_init.c
>> index d1e8c90ceb..6b1c2f0e9f 100644
>> --- a/libavfilter/x86/vf_convolution_init.c
>> +++ b/libavfilter/x86/vf_convolution_init.c
>> @@ -34,6 +34,27 @@ void ff_filter_row_sse4(uint8_t *dst, int width,
>>                          const uint8_t *c[], int peak, int radius,
>>                          int dstride, int stride);
>> 
>This C code should not be in the x86-specific file.
>
>Ruiling
>> +static void filter_column16(uint8_t *dst, int height,
>> +                          float rdiv, float bias, const int *const matrix,
>> +                          const uint8_t *c[], int length, int radius,
>> +                          int dstride, int stride)
>> +{
>> +    int y, off16;
>> +
>> +    for (y = 0; y < height; y++) {
>> +        for (off16 = 0; off16 < length; off16++){
>> +            int i, sum = 0;
>> +
>> +            for (i = 0; i < 2 * radius + 1; i++)
>> +                sum += c[i][0 + y * stride + off16] * matrix[i];
>> +
>> +            sum = (int)(sum * rdiv + bias + 0.5f);
>> +            dst[off16] = av_clip_uint8(sum);
>> +        }
>> +        dst += dstride;
>> +    }
>> +
>> +}
>> 
>>  av_cold void ff_convolution_init_x86(ConvolutionContext *s)
>>  {
>> @@ -51,6 +72,8 @@ av_cold void
>> ff_convolution_init_x86(ConvolutionContext *s)
>>                  if (EXTERNAL_SSE4(cpu_flags))
>>                      s->filter[i] = ff_filter_row_sse4;
>>          }
>> +        if (s->mode[i] == MATRIX_COLUMN)
>> +            s->filter[i] = filter_column16;
>>      }
>>  #endif
>>  }
>> --
>> 2.17.1
>>

diff --git a/libavfilter/vf_convolution.c b/libavfilter/vf_convolution.c
index 5909fea..708732a 100644
--- a/libavfilter/vf_convolution.c
+++ b/libavfilter/vf_convolution.c
@@ -521,6 +521,61 @@ static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs)
             continue;
         }
 
+        if (mode == MATRIX_COLUMN && s->filter[plane] != filter_column){
+            for (y = slice_start; y < slice_end - 16; y+=16) {
+                const int xoff = (y - slice_start) * bpc;
+                const int yoff = radius * stride;
+                for (x = 0; x < radius; x++) {
+                    const int xoff = (y - slice_start) * bpc;
+                    const int yoff = x * stride;
+
+                    s->setup[plane](radius, c, src, stride, x, width, y, height, bpc);
+                    s->filter[plane](dst + yoff + xoff, 1, rdiv,
+                                    bias, matrix, c, 16, radius,
+                                    dstride, stride);
+                }
+                s->setup[plane](radius, c, src, stride, radius, width, y, height, bpc);
+                s->filter[plane](dst + yoff + xoff, sizew - 2 * radius,
+                                rdiv, bias, matrix, c, 16, radius,
+                                dstride, stride);
+                for (x = sizew - radius; x < sizew; x++) {
+                    const int xoff = (y - slice_start) * bpc;
+                    const int yoff = x * stride;
+
+                    s->setup[plane](radius, c, src, stride, x, width, y, height, bpc);
+                    s->filter[plane](dst + yoff + xoff, 1, rdiv,
+                                    bias, matrix, c, 16, radius,
+                                    dstride, stride);
+                }
+            }
+            if (y < slice_end){
+                const int xoff = (y - slice_start) * bpc;
+                const int yoff = radius * stride;
+                for (x = 0; x < radius; x++) {
+                    const int xoff = (y - slice_start) * bpc;
+                    const int yoff = x * stride;
+
+                    s->setup[plane](radius, c, src, stride, x, width, y, height, bpc);
+                    s->filter[plane](dst + yoff + xoff, 1, rdiv,
+                                    bias, matrix, c, slice_end - y, radius,
+                                    dstride, stride);
+                }
+                s->setup[plane](radius, c, src, stride, radius, width, y, height, bpc);
+                s->filter[plane](dst + yoff + xoff, sizew - 2 * radius,
+                                rdiv, bias, matrix, c, slice_end - y, radius,
+                                dstride, stride);
+                for (x = sizew - radius; x < sizew; x++) {
+                    const int xoff = (y - slice_start) * bpc;
+                    const int yoff = x * stride;
+
+                    s->setup[plane](radius, c, src, stride, x, width, y, height, bpc);
+                    s->filter[plane](dst + yoff + xoff, 1, rdiv,
+                                    bias, matrix, c, slice_end - y, radius,
+                                    dstride, stride);
+                }
+            }
+        }
+        else {
         for (y = slice_start; y < slice_end; y++) {
             const int xoff = mode == MATRIX_COLUMN ? (y - slice_start) * bpc : radius * bpc;
             const int yoff = mode == MATRIX_COLUMN ? radius * stride : 0;
@@ -551,6 +606,7 @@ static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs)
                 dst += dstride;
         }
     }
+    }
 
     return 0;
 }
diff --git a/libavfilter/x86/vf_convolution_init.c b/libavfilter/x86/vf_convolution_init.c
index 5143240..fcc9ae8 100644
--- a/libavfilter/x86/vf_convolution_init.c
+++ b/libavfilter/x86/vf_convolution_init.c
@@ -29,6 +29,56 @@ void ff_filter_3x3_sse4(uint8_t *dst, int width,
                         const uint8_t *c[], int peak, int radius,
                         int dstride, int stride);
 
+static void filter_column16(uint8_t *dst, int height,
+                          float rdiv, float bias, const int *const matrix,
+                          const uint8_t *c[], int length, int radius,
+                          int dstride, int stride)
+{
+    int y, off16;
+
+#if 1
+    #define __assume(cond) do { if (!(cond)) __builtin_unreachable(); } while (0)
+    assert(length <= 16);
+    __assume(length <= 16);
+    // NOTE: alignment to 64-bytes, so 16 of int can be fill into full of a cache line
+    int __attribute__ ((aligned(64))) sum[16];
+    for (y = 0; y < height; y++) {
+        int i;
+        memset(sum, 0, sizeof(sum));
+
+        for (i = 0; i < 2 * radius + 1; i++) {
+            for (off16 = 0; off16 < length; off16++){
+                sum[off16] += c[i][0 + y * stride + off16] * matrix[i];
+            }
+        }
+
+        for (off16 = 0; off16 < length; off16++){
+            sum[off16] = (int)(sum[off16] * rdiv + bias + 0.5f);
+            dst[off16] = av_clip_uint8(sum[off16]);
+        }
+        dst += dstride;
+
+    }
+    #undef __assume
+
+#else
+
+    assert(length <= 16);
+    for (y = 0; y < height; y++) {
+        for (off16 = 0; off16 < length; off16++){
+            int i, sum = 0;
+
+            for (i = 0; i < 2 * radius + 1; i++)
+                sum += c[i][0 + y * stride + off16] * matrix[i];
+
+            sum = (int)(sum * rdiv + bias + 0.5f);
+            dst[off16] = av_clip_uint8(sum);
+        }
+        dst += dstride;
+    }
+#endif
+}
+
 av_cold void ff_convolution_init_x86(ConvolutionContext *s)
 {
 #if ARCH_X86_64
@@ -41,6 +91,8 @@ av_cold void ff_convolution_init_x86(ConvolutionContext *s)
                     s->filter[i] = ff_filter_3x3_sse4;
             }
         }
+        if (s->mode[i] == MATRIX_COLUMN)
+            s->filter[i] = filter_column16;
     }
 #endif
 }