From patchwork Mon Aug 2 05:34:36 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wu Jianhua X-Patchwork-Id: 29179 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a6b:6c0f:0:0:0:0:0 with SMTP id a15csp1282754ioh; Sun, 1 Aug 2021 22:35:41 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyfV7ghTgEik2twLh7nm07XA5toksDBO9dJwiGLwVTSUxMkJ1VcVbcQ1yr9Vrq3oB8Hs6pw X-Received: by 2002:a17:906:3b97:: with SMTP id u23mr14018309ejf.437.1627882541118; Sun, 01 Aug 2021 22:35:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1627882541; cv=none; d=google.com; s=arc-20160816; b=e9sTBHI660OVjsnf/d45pJfM7ONyt3M/MneWcCOhu3f1lqLL4EdoUBe6KdKulXimyT ENAYyT2byS6K4Ga6ThxF8GVNgatKuZB61oWeRkW7dAElI33yUdDu0o5kp3Eb3s5nERg4 g9OzwGFrNQ/NEAcwKGyqwe+JQ6JFfyqBh0rkzxazInnDbhe7VD8mgg+ghJbUP7ODbrFU 1jBNCu5/ic4xur8CYRGa4n5tcKApwr5XbIVQDnEgyCuTSqv7kCcw194CJK2YPrvgqQ5O slBoNmkc/xVmv3T/Ef984OKoe783llX3k9rz78XP6qWuLFhA5keUp+h1h0KCPTmhRedE Ga8A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:mime-version:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:references:in-reply-to:message-id:date :to:from:delivered-to; bh=9evhqIY03v5Io54szKy1y1BvuVwaxdQf1e/tT4cEMF4=; b=Rr7nJou+kw9xjcJ+LhVucJLsZOx9xxR6T53BvBm1UTXV7Mu5lOPXIKrrEtM9TQWYOt 5t5InUaVbDT+f+qUCWg7b+7fEKfXIZ/4xW/YMwQlOWzIEmtlQCxuileXhwfzj4ZRniXn rMkVc7A0xLDwemsnkNf0EcA9vcY361QvrEL53RDuhSwntPnKbYs2wUW6qRctNylcuQhl 8vBcCq86vAp6k1I9Wnt9kr/u2E7OFgAq7xZR0u80h5LIg2S99Z8nX9mKzLs2SzZ5FEoP yhbdp/SpcFHW1Gr8K/NRQ+kQh8skWZHfAzZ0dpsx2LKtBo6//fDPtRyKdS3NImg4dJim FUJg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id hs32si8843959ejc.619.2021.08.01.22.35.40; Sun, 01 Aug 2021 22:35:41 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 7528968A6B6; Mon, 2 Aug 2021 08:35:27 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id C572A689CB4 for ; Mon, 2 Aug 2021 08:35:18 +0300 (EEST) X-IronPort-AV: E=McAfee;i="6200,9189,10063"; a="213420920" X-IronPort-AV: E=Sophos;i="5.84,288,1620716400"; d="scan'208";a="213420920" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Aug 2021 22:35:15 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.84,287,1620716400"; d="scan'208";a="457815395" Received: from skl-e5.sh.intel.com ([10.239.43.106]) by orsmga007.jf.intel.com with ESMTP; 01 Aug 2021 22:35:14 -0700 From: Wu Jianhua To: ffmpeg-devel@ffmpeg.org Date: Mon, 2 Aug 2021 13:34:36 +0800 Message-Id: <20210802053439.42828-2-jianhua.wu@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20210802053439.42828-1-jianhua.wu@intel.com> References: <20210802053439.42828-1-jianhua.wu@intel.com> Subject: [FFmpeg-devel] [PATCH 2/5] libavfilter/x86/vf_gblur: add ff_verti_slice_avx2/512() X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Wu Jianhua , yanfei.cheng@intel.com MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: zuW/kWHGMWJJ The new vertical slice with AVX2/512 acceleration can significantly improve the performance of Gaussian Filter 2D. Performance data (fps): ff_verti_slice_c: 32.57 ff_verti_slice_avx2: 476.19 ff_verti_slice_avx512: 833.33 Co-authored-by: Cheng Yanfei Co-authored-by: Jin Jun --- libavfilter/gblur.h | 2 + libavfilter/vf_gblur.c | 24 ++-- libavfilter/x86/vf_gblur.asm | 187 ++++++++++++++++++++++++++++++++ libavfilter/x86/vf_gblur_init.c | 7 ++ 4 files changed, 212 insertions(+), 8 deletions(-) diff --git a/libavfilter/gblur.h b/libavfilter/gblur.h index dce50671f6..367575a6db 100644 --- a/libavfilter/gblur.h +++ b/libavfilter/gblur.h @@ -50,6 +50,8 @@ typedef struct GBlurContext { float nuV; int nb_planes; void (*horiz_slice)(float *buffer, int width, int height, int steps, float nu, float bscale); + void (*verti_slice)(float *buffer, int width, int height, int slice_start, int slice_end, int steps, + float nu, float bscale); void (*postscale_slice)(float *buffer, int length, float postscale, float min, float max); } GBlurContext; diff --git a/libavfilter/vf_gblur.c b/libavfilter/vf_gblur.c index 3f61275658..de7ed82d49 100644 --- a/libavfilter/vf_gblur.c +++ b/libavfilter/vf_gblur.c @@ -138,6 +138,19 @@ static void do_vertical_columns(float *buffer, int width, int height, } } +static void verti_slice_c(float *buffer, int width, int height, + int slice_start, int slice_end, int steps, + float nu, float boundaryscale) +{ + int aligned_end = slice_start + (((slice_end - slice_start) >> 3) << 3); + /* Filter vertically along columns (process 8 columns in each step) */ + do_vertical_columns(buffer, width, height, slice_start, aligned_end, + steps, nu, boundaryscale, 8); + /* Filter un-aligned columns one by one */ + do_vertical_columns(buffer, width, height, aligned_end, slice_end, + steps, nu, boundaryscale, 1); +} + static int filter_vertically(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs) { GBlurContext *s = ctx->priv; @@ -150,16 +163,10 @@ static int filter_vertically(AVFilterContext *ctx, void *arg, int jobnr, int nb_ const int steps = s->steps; const float nu = s->nuV; float *buffer = s->buffer; - int aligned_end; - aligned_end = slice_start + (((slice_end - slice_start) >> 3) << 3); - /* Filter vertically along columns (process 8 columns in each step) */ - do_vertical_columns(buffer, width, height, slice_start, aligned_end, - steps, nu, boundaryscale, 8); + s->verti_slice(buffer, width, height, slice_start, slice_end, + steps, nu, boundaryscale); - /* Filter un-aligned columns one by one */ - do_vertical_columns(buffer, width, height, aligned_end, slice_end, - steps, nu, boundaryscale, 1); return 0; } @@ -233,6 +240,7 @@ static int query_formats(AVFilterContext *ctx) void ff_gblur_init(GBlurContext *s) { s->horiz_slice = horiz_slice_c; + s->verti_slice = verti_slice_c; s->postscale_slice = postscale_c; if (ARCH_X86) ff_gblur_init_x86(s); diff --git a/libavfilter/x86/vf_gblur.asm b/libavfilter/x86/vf_gblur.asm index 276fe347f5..74174fdc43 100644 --- a/libavfilter/x86/vf_gblur.asm +++ b/libavfilter/x86/vf_gblur.asm @@ -22,6 +22,43 @@ SECTION .text +%xdefine AVX2_MMSIZE 32 +%xdefine AVX512_MMSIZE 64 + +%macro MOVSXDIFNIDN 1-* + %rep %0 + movsxdifnidn %1q, %1d + %rotate 1 + %endrep +%endmacro + +%macro PUSH_MASK 5 +%if mmsize == AVX2_MMSIZE + %assign %%n mmsize/4 + %assign %%i 0 + %rep %%n + mov %4, %3 + and %4, 1 + neg %4 + mov dword [%5 + %%i*4], %4 + sar %3, 1 + %assign %%i %%i+1 + %endrep + movu %1, [%5] +%else + kmovd %2, %3 +%endif +%endmacro + +%macro VMASKMOVPS 4 +%if mmsize == AVX2_MMSIZE + vpmaskmovd %1, %3, %2 +%else + kmovw k7, %4 + vmovups %1{k7}, %2 +%endif +%endmacro + ; void ff_horiz_slice_sse4(float *ptr, int width, int height, int steps, ; float nu, float bscale) @@ -232,3 +269,153 @@ POSTSCALE_SLICE INIT_ZMM avx512 POSTSCALE_SLICE %endif + + +;******************************************************************************* +; void ff_verti_slice(float *buffer, int width, int height, int column_begin, +; int column_end, int steps, float nu, float bscale); +;******************************************************************************* +%macro VERTI_SLICE 0 +%if UNIX64 +cglobal verti_slice, 6, 12, 9, 0-mmsize*2, buffer, width, height, cbegin, cend, \ + steps, x, y, cwidth, step, ptr, stride +%else +cglobal verti_slice, 6, 12, 9, 0-mmsize*2, buffer, width, height, cbegin, cend, \ + steps, nu, bscale, x, y, cwidth, step, \ + ptr, stride +%endif +%assign cols mmsize/4 +%if WIN64 + VBROADCASTSS m0, num + VBROADCASTSS m1, bscalem + DEFINE_ARGS buffer, width, height, cbegin, cend, \ + steps, x, y, cwidth, step, ptr, stride + MOVSXDIFNIDN width, height, cbegin, cend, steps +%else + VBROADCASTSS m0, xmm0 ; nu + VBROADCASTSS m1, xmm1 ; bscale +%endif + mov cwidthq, cendq + sub cwidthq, cbeginq + lea strideq, [widthq * 4] + + xor xq, xq ; x = 0 + cmp cwidthq, cols + jl .x_scalar + cmp cwidthq, 0x0 + je .end_scalar + + sub cwidthq, cols +.loop_x: + xor stepq, stepq + .loop_step: + ; ptr = buffer + x + column_begin; + lea ptrq, [xq + cbeginq] + lea ptrq, [bufferq + ptrq*4] + + ; ptr[15:0] *= bcale; + movu m2, [ptrq] + mulps m2, m1 + movu [ptrq], m2 + + ; Filter downwards + mov yq, 1 + .loop_y_down: + add ptrq, strideq ; ptrq += width + movu m3, [ptrq] + FMULADD_PS m2, m2, m0, m3, m2 + movu [ptrq], m2 + + inc yq + cmp yq, heightq + jl .loop_y_down + + mulps m2, m1 + movu [ptrq], m2 + + ; Filter upwards + dec yq + .loop_y_up: + sub ptrq, strideq + movu m3, [ptrq] + FMULADD_PS m2, m2, m0, m3, m2 + movu [ptrq], m2 + + dec yq + cmp yq, 0 + jg .loop_y_up + + inc stepq + cmp stepq, stepsq + jl .loop_step + + add xq, cols + cmp xq, cwidthq + jle .loop_x + + add cwidthq, cols + cmp xq, cwidthq + jge .end_scalar + +.x_scalar: + xor stepq, stepq + mov qword [rsp + 0x10], xq + sub cwidthq, xq + mov xq, 1 + shlx cwidthq, xq, cwidthq + sub cwidthq, 1 + PUSH_MASK m4, k1, cwidthd, xd, rsp + 0x20 + mov xq, qword [rsp + 0x10] + + .loop_step_scalar: + lea ptrq, [xq + cbeginq] + lea ptrq, [bufferq + ptrq*4] + + VMASKMOVPS m2, [ptrq], m4, k1 + mulps m2, m1 + VMASKMOVPS [ptrq], m2, m4, k1 + + ; Filter downwards + mov yq, 1 + .x_scalar_loop_y_down: + add ptrq, strideq + VMASKMOVPS m3, [ptrq], m4, k1 + FMULADD_PS m2, m2, m0, m3, m2 + VMASKMOVPS [ptrq], m2, m4, k1 + + inc yq + cmp yq, heightq + jl .x_scalar_loop_y_down + + mulps m2, m1 + VMASKMOVPS [ptrq], m2, m4, k1 + + ; Filter upwards + dec yq + .x_scalar_loop_y_up: + sub ptrq, strideq + VMASKMOVPS m3, [ptrq], m4, k1 + FMULADD_PS m2, m2, m0, m3, m2 + VMASKMOVPS [ptrq], m2, m4, k1 + + dec yq + cmp yq, 0 + jg .x_scalar_loop_y_up + + inc stepq + cmp stepq, stepsq + jl .loop_step_scalar + +.end_scalar: + RET +%endmacro + +%if HAVE_AVX2_EXTERNAL +INIT_YMM avx2 +VERTI_SLICE +%endif + +%if HAVE_AVX512_EXTERNAL +INIT_ZMM avx512 +VERTI_SLICE +%endif diff --git a/libavfilter/x86/vf_gblur_init.c b/libavfilter/x86/vf_gblur_init.c index 34aba4ca6e..3e173410c2 100644 --- a/libavfilter/x86/vf_gblur_init.c +++ b/libavfilter/x86/vf_gblur_init.c @@ -31,6 +31,11 @@ void ff_postscale_slice_sse(float *ptr, int length, float postscale, float min, void ff_postscale_slice_avx2(float *ptr, int length, float postscale, float min, float max); void ff_postscale_slice_avx512(float *ptr, int length, float postscale, float min, float max); +void ff_verti_slice_avx2(float *buffer, int width, int height, int column_begin, int column_end, + int steps, float nu, float bscale); +void ff_verti_slice_avx512(float *buffer, int width, int height, int column_begin, int column_end, + int steps, float nu, float bscale); + av_cold void ff_gblur_init_x86(GBlurContext *s) { int cpu_flags = av_get_cpu_flags(); @@ -47,9 +52,11 @@ av_cold void ff_gblur_init_x86(GBlurContext *s) } if (EXTERNAL_AVX2(cpu_flags)) { s->horiz_slice = ff_horiz_slice_avx2; + s->verti_slice = ff_verti_slice_avx2; } if (EXTERNAL_AVX512(cpu_flags)) { s->postscale_slice = ff_postscale_slice_avx512; + s->verti_slice = ff_verti_slice_avx512; } #endif }