From patchwork Wed Aug 4 02:06:13 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wu Jianhua X-Patchwork-Id: 29227 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a6b:6c0f:0:0:0:0:0 with SMTP id a15csp2976037ioh; Tue, 3 Aug 2021 19:06:47 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyR6i3BP2oacErcrCYUfuV9jbXj7B6w8iOQvQSij6dNl4DQ1wYA6FoPlw4lsKFYtrqD8snO X-Received: by 2002:a17:906:3b97:: with SMTP id u23mr23759716ejf.437.1628042807186; Tue, 03 Aug 2021 19:06:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1628042807; cv=none; d=google.com; s=arc-20160816; b=OR3Ri9kc6GDk1zxe6bZY2xClx48GLPtsigFH5j+O492FMIEYMlGLJjwYN/+QMp2zSZ 12RxcDzQKoSHcDMcAzj3M6kf60mqJBz9a+miFxlUW1x5BkDCQz94ocKaTWk6FZ+AyBi+ a62rRgC/74YeNj4jckvo73BTJCY3oCwoT5SP/8L9KQjGcf+CpqW165id6xgwBAcNsRnO E9hZOlYcYzsHvpP8kS+bNrLu5Z5ATZZyJUafnU4jkIFkK6DnzZLs352cFgQExsjD42ks Xg17KKANZIRlNpU0S72E/lf2ai+U1/KlX98j/cvJmwkx2mEfac1+dx7/rB/iB5gc9aVK 7kcQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:mime-version:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:references:in-reply-to:message-id:date :to:from:delivered-to; bh=kpp4Z7SPPx5I7QPSLyEubYj8vXn2xYCBTyTWpIczwtY=; b=ZEEEhGP5WwA6Y80yXLpZLUXVBEFeLZN51s/tgwmOA9s8V2qHPSCZd7h+UjNErmzTgj 3azzq1fy6UwdKiJD9vNsDvWciCX3JLLBSyNB6ItGyoxE3VRekzg6c1UVp7+CMmp/AxuV CaPJv8QHzZBLfhFX2drQYzLi4j+dwhDJhVYy8KXf+HGQGm0LFmoI1ZWCcT/mfTFBJMBO W56cersbX+UNXWmlL7m+//S1ty9DYhUKjmDKqOsV+hpyasF7z+n0jHK5WhUIPThE3e3O 6zWj7SF7walaPbr8OAqFdqTHHkKvh3dE3UlBTbFHPh9LsjKHWTVLA5mmtKFb2jagzFCt Fbzg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id f17si662782edw.586.2021.08.03.19.06.46; Tue, 03 Aug 2021 19:06:47 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id CB4B26898FB; Wed, 4 Aug 2021 05:06:33 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E4E0D6880FA for ; Wed, 4 Aug 2021 05:06:26 +0300 (EEST) X-IronPort-AV: E=McAfee;i="6200,9189,10065"; a="210717019" X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="210717019" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 19:06:24 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.84,293,1620716400"; d="scan'208";a="667632136" Received: from skl-e5.sh.intel.com ([10.239.43.106]) by fmsmga006.fm.intel.com with ESMTP; 03 Aug 2021 19:06:23 -0700 From: Wu Jianhua To: ffmpeg-devel@ffmpeg.org Date: Wed, 4 Aug 2021 10:06:13 +0800 Message-Id: <20210804020616.82866-2-jianhua.wu@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20210804020616.82866-1-jianhua.wu@intel.com> References: <20210804020616.82866-1-jianhua.wu@intel.com> Subject: [FFmpeg-devel] [PATCH v2 2/5] libavfilter/x86/vf_gblur: add ff_verti_slice_avx2/512() X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Wu Jianhua MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: lvQBme5nxD/d The new vertical slice with AVX2/512 acceleration can significantly improve the performance of Gaussian Filter 2D. Performance data: ff_verti_slice_c: 32.57 ff_verti_slice_avx2: 476.19 ff_verti_slice_avx512: 833.33 Co-authored-by: Cheng Yanfei Co-authored-by: Jin Jun Signed-off-by: Wu Jianhua --- libavfilter/gblur.h | 2 + libavfilter/vf_gblur.c | 24 ++-- libavfilter/x86/vf_gblur.asm | 189 ++++++++++++++++++++++++++++++++ libavfilter/x86/vf_gblur_init.c | 7 ++ 4 files changed, 214 insertions(+), 8 deletions(-) diff --git a/libavfilter/gblur.h b/libavfilter/gblur.h index dce50671f6..367575a6db 100644 --- a/libavfilter/gblur.h +++ b/libavfilter/gblur.h @@ -50,6 +50,8 @@ typedef struct GBlurContext { float nuV; int nb_planes; void (*horiz_slice)(float *buffer, int width, int height, int steps, float nu, float bscale); + void (*verti_slice)(float *buffer, int width, int height, int slice_start, int slice_end, int steps, + float nu, float bscale); void (*postscale_slice)(float *buffer, int length, float postscale, float min, float max); } GBlurContext; diff --git a/libavfilter/vf_gblur.c b/libavfilter/vf_gblur.c index 3f61275658..de7ed82d49 100644 --- a/libavfilter/vf_gblur.c +++ b/libavfilter/vf_gblur.c @@ -138,6 +138,19 @@ static void do_vertical_columns(float *buffer, int width, int height, } } +static void verti_slice_c(float *buffer, int width, int height, + int slice_start, int slice_end, int steps, + float nu, float boundaryscale) +{ + int aligned_end = slice_start + (((slice_end - slice_start) >> 3) << 3); + /* Filter vertically along columns (process 8 columns in each step) */ + do_vertical_columns(buffer, width, height, slice_start, aligned_end, + steps, nu, boundaryscale, 8); + /* Filter un-aligned columns one by one */ + do_vertical_columns(buffer, width, height, aligned_end, slice_end, + steps, nu, boundaryscale, 1); +} + static int filter_vertically(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs) { GBlurContext *s = ctx->priv; @@ -150,16 +163,10 @@ static int filter_vertically(AVFilterContext *ctx, void *arg, int jobnr, int nb_ const int steps = s->steps; const float nu = s->nuV; float *buffer = s->buffer; - int aligned_end; - aligned_end = slice_start + (((slice_end - slice_start) >> 3) << 3); - /* Filter vertically along columns (process 8 columns in each step) */ - do_vertical_columns(buffer, width, height, slice_start, aligned_end, - steps, nu, boundaryscale, 8); + s->verti_slice(buffer, width, height, slice_start, slice_end, + steps, nu, boundaryscale); - /* Filter un-aligned columns one by one */ - do_vertical_columns(buffer, width, height, aligned_end, slice_end, - steps, nu, boundaryscale, 1); return 0; } @@ -233,6 +240,7 @@ static int query_formats(AVFilterContext *ctx) void ff_gblur_init(GBlurContext *s) { s->horiz_slice = horiz_slice_c; + s->verti_slice = verti_slice_c; s->postscale_slice = postscale_c; if (ARCH_X86) ff_gblur_init_x86(s); diff --git a/libavfilter/x86/vf_gblur.asm b/libavfilter/x86/vf_gblur.asm index 276fe347f5..ac4debba74 100644 --- a/libavfilter/x86/vf_gblur.asm +++ b/libavfilter/x86/vf_gblur.asm @@ -22,6 +22,43 @@ SECTION .text +%xdefine AVX2_MMSIZE 32 +%xdefine AVX512_MMSIZE 64 + +%macro MOVSXDIFNIDN 1-* + %rep %0 + movsxdifnidn %1q, %1d + %rotate 1 + %endrep +%endmacro + +%macro PUSH_MASK 5 +%if mmsize == AVX2_MMSIZE + %assign %%n mmsize/4 + %assign %%i 0 + %rep %%n + mov %4, %3 + and %4, 1 + neg %4 + mov dword [%5 + %%i*4], %4 + sar %3, 1 + %assign %%i %%i+1 + %endrep + movu %1, [%5] +%else + kmovd %2, %3 +%endif +%endmacro + +%macro VMASKMOVPS 4 +%if mmsize == AVX2_MMSIZE + vpmaskmovd %1, %3, %2 +%else + kmovw k7, %4 + vmovups %1{k7}, %2 +%endif +%endmacro + ; void ff_horiz_slice_sse4(float *ptr, int width, int height, int steps, ; float nu, float bscale) @@ -232,3 +269,155 @@ POSTSCALE_SLICE INIT_ZMM avx512 POSTSCALE_SLICE %endif + + +;******************************************************************************* +; void ff_verti_slice(float *buffer, int width, int height, int column_begin, +; int column_end, int steps, float nu, float bscale); +;******************************************************************************* +%macro VERTI_SLICE 0 +%if UNIX64 +cglobal verti_slice, 6, 12, 9, 0-mmsize*2, buffer, width, height, cbegin, cend, \ + steps, x, y, cwidth, step, ptr, stride +%else +cglobal verti_slice, 6, 12, 9, 0-mmsize*2, buffer, width, height, cbegin, cend, \ + steps, nu, bscale, x, y, cwidth, step, \ + ptr, stride +%endif +%assign cols mmsize/4 +%if WIN64 + VBROADCASTSS m0, num + VBROADCASTSS m1, bscalem + DEFINE_ARGS buffer, width, height, cbegin, cend, \ + steps, x, y, cwidth, step, ptr, stride + MOVSXDIFNIDN width, height, cbegin, cend, steps +%else + VBROADCASTSS m0, xmm0 ; nu + VBROADCASTSS m1, xmm1 ; bscale +%endif + mov cwidthq, cendq + sub cwidthq, cbeginq + lea strideq, [widthq * 4] + + xor xq, xq ; x = 0 + cmp cwidthq, cols + jl .x_scalar + cmp cwidthq, 0x0 + je .end_scalar + + sub cwidthq, cols +.loop_x: + xor stepq, stepq + .loop_step: + ; ptr = buffer + x + column_begin; + lea ptrq, [xq + cbeginq] + lea ptrq, [bufferq + ptrq*4] + + ; ptr[15:0] *= bcale; + movu m2, [ptrq] + mulps m2, m1 + movu [ptrq], m2 + + ; Filter downwards + mov yq, 1 + .loop_y_down: + add ptrq, strideq ; ptrq += width + movu m3, [ptrq] + FMULADD_PS m2, m2, m0, m3, m2 + movu [ptrq], m2 + + inc yq + cmp yq, heightq + jl .loop_y_down + + mulps m2, m1 + movu [ptrq], m2 + + ; Filter upwards + dec yq + .loop_y_up: + sub ptrq, strideq + movu m3, [ptrq] + FMULADD_PS m2, m2, m0, m3, m2 + movu [ptrq], m2 + + dec yq + cmp yq, 0 + jg .loop_y_up + + inc stepq + cmp stepq, stepsq + jl .loop_step + + add xq, cols + cmp xq, cwidthq + jle .loop_x + + add cwidthq, cols + cmp xq, cwidthq + jge .end_scalar + +.x_scalar: + xor stepq, stepq + mov qword [rsp + 0x10], xq + sub cwidthq, xq + mov xq, 1 + shlx cwidthq, xq, cwidthq + sub cwidthq, 1 + PUSH_MASK m4, k1, cwidthd, xd, rsp + 0x20 + mov xq, qword [rsp + 0x10] + + .loop_step_scalar: + lea ptrq, [xq + cbeginq] + lea ptrq, [bufferq + ptrq*4] + + VMASKMOVPS m2, [ptrq], m4, k1 + mulps m2, m1 + VMASKMOVPS [ptrq], m2, m4, k1 + + ; Filter downwards + mov yq, 1 + .x_scalar_loop_y_down: + add ptrq, strideq + VMASKMOVPS m3, [ptrq], m4, k1 + FMULADD_PS m2, m2, m0, m3, m2 + VMASKMOVPS [ptrq], m2, m4, k1 + + inc yq + cmp yq, heightq + jl .x_scalar_loop_y_down + + mulps m2, m1 + VMASKMOVPS [ptrq], m2, m4, k1 + + ; Filter upwards + dec yq + .x_scalar_loop_y_up: + sub ptrq, strideq + VMASKMOVPS m3, [ptrq], m4, k1 + FMULADD_PS m2, m2, m0, m3, m2 + VMASKMOVPS [ptrq], m2, m4, k1 + + dec yq + cmp yq, 0 + jg .x_scalar_loop_y_up + + inc stepq + cmp stepq, stepsq + jl .loop_step_scalar + +.end_scalar: + RET +%endmacro + +%if ARCH_X86_64 +%if HAVE_AVX2_EXTERNAL +INIT_YMM avx2 +VERTI_SLICE +%endif + +%if HAVE_AVX512_EXTERNAL +INIT_ZMM avx512 +VERTI_SLICE +%endif +%endif diff --git a/libavfilter/x86/vf_gblur_init.c b/libavfilter/x86/vf_gblur_init.c index 34aba4ca6e..3e173410c2 100644 --- a/libavfilter/x86/vf_gblur_init.c +++ b/libavfilter/x86/vf_gblur_init.c @@ -31,6 +31,11 @@ void ff_postscale_slice_sse(float *ptr, int length, float postscale, float min, void ff_postscale_slice_avx2(float *ptr, int length, float postscale, float min, float max); void ff_postscale_slice_avx512(float *ptr, int length, float postscale, float min, float max); +void ff_verti_slice_avx2(float *buffer, int width, int height, int column_begin, int column_end, + int steps, float nu, float bscale); +void ff_verti_slice_avx512(float *buffer, int width, int height, int column_begin, int column_end, + int steps, float nu, float bscale); + av_cold void ff_gblur_init_x86(GBlurContext *s) { int cpu_flags = av_get_cpu_flags(); @@ -47,9 +52,11 @@ av_cold void ff_gblur_init_x86(GBlurContext *s) } if (EXTERNAL_AVX2(cpu_flags)) { s->horiz_slice = ff_horiz_slice_avx2; + s->verti_slice = ff_verti_slice_avx2; } if (EXTERNAL_AVX512(cpu_flags)) { s->postscale_slice = ff_postscale_slice_avx512; + s->verti_slice = ff_verti_slice_avx512; } #endif }