From patchwork Wed Aug  4 02:06:13 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Wu Jianhua <jianhua.wu@intel.com>
X-Patchwork-Id: 29227
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a6b:6c0f:0:0:0:0:0 with SMTP id a15csp2976037ioh;
        Tue, 3 Aug 2021 19:06:47 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJyR6i3BP2oacErcrCYUfuV9jbXj7B6w8iOQvQSij6dNl4DQ1wYA6FoPlw4lsKFYtrqD8snO
X-Received: by 2002:a17:906:3b97:: with SMTP id
 u23mr23759716ejf.437.1628042807186;
        Tue, 03 Aug 2021 19:06:47 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1628042807; cv=none;
        d=google.com; s=arc-20160816;
        b=OR3Ri9kc6GDk1zxe6bZY2xClx48GLPtsigFH5j+O492FMIEYMlGLJjwYN/+QMp2zSZ
         12RxcDzQKoSHcDMcAzj3M6kf60mqJBz9a+miFxlUW1x5BkDCQz94ocKaTWk6FZ+AyBi+
         a62rRgC/74YeNj4jckvo73BTJCY3oCwoT5SP/8L9KQjGcf+CpqW165id6xgwBAcNsRnO
         E9hZOlYcYzsHvpP8kS+bNrLu5Z5ATZZyJUafnU4jkIFkK6DnzZLs352cFgQExsjD42ks
         Xg17KKANZIRlNpU0S72E/lf2ai+U1/KlX98j/cvJmwkx2mEfac1+dx7/rB/iB5gc9aVK
         7kcQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:mime-version:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:references:in-reply-to:message-id:date
         :to:from:delivered-to;
        bh=kpp4Z7SPPx5I7QPSLyEubYj8vXn2xYCBTyTWpIczwtY=;
        b=ZEEEhGP5WwA6Y80yXLpZLUXVBEFeLZN51s/tgwmOA9s8V2qHPSCZd7h+UjNErmzTgj
         3azzq1fy6UwdKiJD9vNsDvWciCX3JLLBSyNB6ItGyoxE3VRekzg6c1UVp7+CMmp/AxuV
         CaPJv8QHzZBLfhFX2drQYzLi4j+dwhDJhVYy8KXf+HGQGm0LFmoI1ZWCcT/mfTFBJMBO
         W56cersbX+UNXWmlL7m+//S1ty9DYhUKjmDKqOsV+hpyasF7z+n0jHK5WhUIPThE3e3O
         6zWj7SF7walaPbr8OAqFdqTHHkKvh3dE3UlBTbFHPh9LsjKHWTVLA5mmtKFb2jagzFCt
         Fbzg==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id f17si662782edw.586.2021.08.03.19.06.46;
        Tue, 03 Aug 2021 19:06:47 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
       dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id CB4B26898FB;
	Wed,  4 Aug 2021 05:06:33 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mga11.intel.com (mga11.intel.com [192.55.52.93])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E4E0D6880FA
 for <ffmpeg-devel@ffmpeg.org>; Wed,  4 Aug 2021 05:06:26 +0300 (EEST)
X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="210717019"
X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="210717019"
Received: from fmsmga006.fm.intel.com ([10.253.24.20])
 by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 03 Aug 2021 19:06:24 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="667632136"
Received: from skl-e5.sh.intel.com ([10.239.43.106])
 by fmsmga006.fm.intel.com with ESMTP; 03 Aug 2021 19:06:23 -0700
From: Wu Jianhua <jianhua.wu@intel.com>
To: ffmpeg-devel@ffmpeg.org
Date: Wed,  4 Aug 2021 10:06:13 +0800
Message-Id: <20210804020616.82866-2-jianhua.wu@intel.com>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20210804020616.82866-1-jianhua.wu@intel.com>
References: <20210804020616.82866-1-jianhua.wu@intel.com>
Subject: [FFmpeg-devel] [PATCH v2 2/5] libavfilter/x86/vf_gblur: add
 ff_verti_slice_avx2/512()
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Wu Jianhua <jianhua.wu@intel.com>
MIME-Version: 1.0
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: lvQBme5nxD/d

The new vertical slice with AVX2/512 acceleration can significantly
improve the performance of Gaussian Filter 2D.

Performance data:
ff_verti_slice_c: 32.57
ff_verti_slice_avx2: 476.19
ff_verti_slice_avx512: 833.33

Co-authored-by: Cheng Yanfei <yanfei.cheng@intel.com>
Co-authored-by: Jin Jun <jun.i.jin@intel.com>
Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
---
 libavfilter/gblur.h             |   2 +
 libavfilter/vf_gblur.c          |  24 ++--
 libavfilter/x86/vf_gblur.asm    | 189 ++++++++++++++++++++++++++++++++
 libavfilter/x86/vf_gblur_init.c |   7 ++
 4 files changed, 214 insertions(+), 8 deletions(-)

diff --git a/libavfilter/gblur.h b/libavfilter/gblur.h
index dce50671f6..367575a6db 100644
--- a/libavfilter/gblur.h
+++ b/libavfilter/gblur.h
@@ -50,6 +50,8 @@ typedef struct GBlurContext {
     float nuV;
     int nb_planes;
     void (*horiz_slice)(float *buffer, int width, int height, int steps, float nu, float bscale);
+    void (*verti_slice)(float *buffer, int width, int height, int slice_start, int slice_end, int steps,
+                            float nu, float bscale);
     void (*postscale_slice)(float *buffer, int length, float postscale, float min, float max);
 } GBlurContext;
 
diff --git a/libavfilter/vf_gblur.c b/libavfilter/vf_gblur.c
index 3f61275658..de7ed82d49 100644
--- a/libavfilter/vf_gblur.c
+++ b/libavfilter/vf_gblur.c
@@ -138,6 +138,19 @@ static void do_vertical_columns(float *buffer, int width, int height,
     }
 }
 
+static void verti_slice_c(float *buffer, int width, int height,
+                          int slice_start, int slice_end, int steps,
+                          float nu, float boundaryscale)
+{
+    int aligned_end = slice_start + (((slice_end - slice_start) >> 3) << 3);
+    /* Filter vertically along columns (process 8 columns in each step) */
+    do_vertical_columns(buffer, width, height, slice_start, aligned_end,
+                        steps, nu, boundaryscale, 8);
+    /* Filter un-aligned columns one by one */
+    do_vertical_columns(buffer, width, height, aligned_end, slice_end,
+                        steps, nu, boundaryscale, 1);
+}
+
 static int filter_vertically(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs)
 {
     GBlurContext *s = ctx->priv;
@@ -150,16 +163,10 @@ static int filter_vertically(AVFilterContext *ctx, void *arg, int jobnr, int nb_
     const int steps = s->steps;
     const float nu = s->nuV;
     float *buffer = s->buffer;
-    int aligned_end;
 
-    aligned_end = slice_start + (((slice_end - slice_start) >> 3) << 3);
-    /* Filter vertically along columns (process 8 columns in each step) */
-    do_vertical_columns(buffer, width, height, slice_start, aligned_end,
-                        steps, nu, boundaryscale, 8);
+    s->verti_slice(buffer, width, height, slice_start, slice_end,
+                   steps, nu, boundaryscale);
 
-    /* Filter un-aligned columns one by one */
-    do_vertical_columns(buffer, width, height, aligned_end, slice_end,
-                        steps, nu, boundaryscale, 1);
     return 0;
 }
 
@@ -233,6 +240,7 @@ static int query_formats(AVFilterContext *ctx)
 void ff_gblur_init(GBlurContext *s)
 {
     s->horiz_slice = horiz_slice_c;
+    s->verti_slice = verti_slice_c;
     s->postscale_slice = postscale_c;
     if (ARCH_X86)
         ff_gblur_init_x86(s);
diff --git a/libavfilter/x86/vf_gblur.asm b/libavfilter/x86/vf_gblur.asm
index 276fe347f5..ac4debba74 100644
--- a/libavfilter/x86/vf_gblur.asm
+++ b/libavfilter/x86/vf_gblur.asm
@@ -22,6 +22,43 @@
 
 SECTION .text
 
+%xdefine AVX2_MMSIZE   32
+%xdefine AVX512_MMSIZE 64
+
+%macro MOVSXDIFNIDN 1-*
+    %rep %0
+        movsxdifnidn %1q, %1d
+        %rotate 1
+    %endrep
+%endmacro
+
+%macro PUSH_MASK 5
+%if mmsize == AVX2_MMSIZE
+    %assign %%n mmsize/4
+    %assign %%i 0
+    %rep %%n
+        mov %4, %3
+        and %4, 1
+        neg %4
+        mov dword [%5 + %%i*4], %4
+        sar %3, 1
+        %assign %%i %%i+1
+    %endrep
+    movu %1, [%5]
+%else
+    kmovd %2, %3
+%endif
+%endmacro
+
+%macro VMASKMOVPS 4
+%if mmsize == AVX2_MMSIZE
+    vpmaskmovd %1, %3, %2
+%else
+    kmovw k7, %4
+    vmovups %1{k7}, %2
+%endif
+%endmacro
+
 ; void ff_horiz_slice_sse4(float *ptr, int width, int height, int steps,
 ;                          float nu, float bscale)
 
@@ -232,3 +269,155 @@ POSTSCALE_SLICE
 INIT_ZMM avx512
 POSTSCALE_SLICE
 %endif
+
+
+;*******************************************************************************
+; void ff_verti_slice(float *buffer, int width, int height, int column_begin,
+;                     int column_end, int steps, float nu, float bscale);
+;*******************************************************************************
+%macro VERTI_SLICE 0
+%if UNIX64
+cglobal verti_slice, 6, 12, 9, 0-mmsize*2, buffer, width, height, cbegin, cend, \
+                                         steps, x, y, cwidth, step, ptr, stride
+%else
+cglobal verti_slice, 6, 12, 9, 0-mmsize*2, buffer, width, height, cbegin, cend, \
+                                         steps, nu, bscale, x, y, cwidth, step, \
+                                         ptr, stride
+%endif
+%assign cols mmsize/4
+%if WIN64
+    VBROADCASTSS m0, num
+    VBROADCASTSS m1, bscalem
+    DEFINE_ARGS buffer, width, height, cbegin, cend, \
+                steps, x, y, cwidth, step, ptr, stride
+    MOVSXDIFNIDN width, height, cbegin, cend, steps
+%else
+    VBROADCASTSS m0, xmm0 ; nu
+    VBROADCASTSS m1, xmm1 ; bscale
+%endif
+    mov cwidthq, cendq
+    sub cwidthq, cbeginq
+    lea strideq, [widthq * 4]
+
+    xor xq, xq ; x = 0
+    cmp cwidthq, cols
+    jl .x_scalar
+    cmp cwidthq, 0x0
+    je .end_scalar
+
+    sub cwidthq, cols
+.loop_x:
+    xor stepq, stepq
+    .loop_step:
+        ; ptr = buffer + x + column_begin;
+        lea ptrq, [xq + cbeginq]
+        lea ptrq, [bufferq + ptrq*4]
+
+        ;  ptr[15:0] *= bcale;
+        movu m2, [ptrq]
+        mulps m2, m1
+        movu [ptrq], m2
+
+        ; Filter downwards
+        mov yq, 1
+        .loop_y_down:
+            add ptrq, strideq ; ptrq += width
+            movu m3, [ptrq]
+            FMULADD_PS m2, m2, m0, m3, m2
+            movu [ptrq], m2
+
+            inc yq
+            cmp yq, heightq
+            jl .loop_y_down
+
+        mulps m2, m1
+        movu [ptrq], m2
+
+        ; Filter upwards
+        dec yq
+        .loop_y_up:
+            sub ptrq, strideq
+            movu m3, [ptrq]
+            FMULADD_PS m2, m2, m0, m3, m2
+            movu [ptrq], m2
+
+            dec yq
+            cmp yq, 0
+            jg .loop_y_up
+
+        inc stepq
+        cmp stepq, stepsq
+        jl .loop_step
+
+    add xq, cols
+    cmp xq, cwidthq
+    jle .loop_x
+
+    add cwidthq, cols
+    cmp xq, cwidthq
+    jge .end_scalar
+
+.x_scalar:
+    xor stepq, stepq
+    mov qword [rsp + 0x10], xq
+    sub cwidthq, xq
+    mov xq, 1
+    shlx cwidthq, xq, cwidthq
+    sub cwidthq, 1
+    PUSH_MASK m4, k1, cwidthd, xd, rsp + 0x20
+    mov xq, qword [rsp + 0x10]
+
+    .loop_step_scalar:
+        lea ptrq, [xq + cbeginq]
+        lea ptrq, [bufferq + ptrq*4]
+
+        VMASKMOVPS m2, [ptrq], m4, k1
+        mulps m2, m1
+        VMASKMOVPS [ptrq], m2, m4, k1
+
+        ; Filter downwards
+        mov yq, 1
+        .x_scalar_loop_y_down:
+            add ptrq, strideq
+            VMASKMOVPS m3, [ptrq], m4, k1
+            FMULADD_PS m2, m2, m0, m3, m2
+            VMASKMOVPS [ptrq], m2, m4, k1
+
+            inc yq
+            cmp yq, heightq
+            jl .x_scalar_loop_y_down
+
+        mulps m2, m1
+        VMASKMOVPS [ptrq], m2, m4, k1
+
+        ; Filter upwards
+        dec yq
+        .x_scalar_loop_y_up:
+            sub ptrq, strideq
+            VMASKMOVPS m3, [ptrq], m4, k1
+            FMULADD_PS m2, m2, m0, m3, m2
+            VMASKMOVPS [ptrq], m2, m4, k1
+
+            dec yq
+            cmp yq, 0
+            jg .x_scalar_loop_y_up
+
+        inc stepq
+        cmp stepq, stepsq
+        jl .loop_step_scalar
+
+.end_scalar:
+    RET
+%endmacro
+
+%if ARCH_X86_64
+%if HAVE_AVX2_EXTERNAL
+INIT_YMM avx2
+VERTI_SLICE
+%endif
+
+%if HAVE_AVX512_EXTERNAL
+INIT_ZMM avx512
+VERTI_SLICE
+%endif
+%endif
diff --git a/libavfilter/x86/vf_gblur_init.c b/libavfilter/x86/vf_gblur_init.c
index 34aba4ca6e..3e173410c2 100644
--- a/libavfilter/x86/vf_gblur_init.c
+++ b/libavfilter/x86/vf_gblur_init.c
@@ -31,6 +31,11 @@ void ff_postscale_slice_sse(float *ptr, int length, float postscale, float min,
 void ff_postscale_slice_avx2(float *ptr, int length, float postscale, float min, float max);
 void ff_postscale_slice_avx512(float *ptr, int length, float postscale, float min, float max);
 
+void ff_verti_slice_avx2(float *buffer, int width, int height, int column_begin, int column_end,
+                        int steps, float nu, float bscale);
+void ff_verti_slice_avx512(float *buffer, int width, int height, int column_begin, int column_end,
+                        int steps, float nu, float bscale);
+
 av_cold void ff_gblur_init_x86(GBlurContext *s)
 {
     int cpu_flags = av_get_cpu_flags();
@@ -47,9 +52,11 @@ av_cold void ff_gblur_init_x86(GBlurContext *s)
     }
     if (EXTERNAL_AVX2(cpu_flags)) {
         s->horiz_slice = ff_horiz_slice_avx2;
+        s->verti_slice = ff_verti_slice_avx2;
     }
     if (EXTERNAL_AVX512(cpu_flags)) {
         s->postscale_slice = ff_postscale_slice_avx512;
+        s->verti_slice = ff_verti_slice_avx512;
     }
 #endif
 }