From patchwork Fri Aug 27 04:51:41 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wu Jianhua X-Patchwork-Id: 29800 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6602:2a4a:0:0:0:0 with SMTP id k10csp1118910iov; Thu, 26 Aug 2021 21:52:03 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyBm0tb1wIVIoXY8+HDpEwSQzN60VI1w/shZqcZ27qbhsXTj2ubvjmqnqmQ5PNV6Q9ttlve X-Received: by 2002:a05:6402:1642:: with SMTP id s2mr7909738edx.135.1630039923630; Thu, 26 Aug 2021 21:52:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1630039923; cv=none; d=google.com; s=arc-20160816; b=h1C+uxGlmuANiRkCmPXNbAGRVx2i6uJPHrp8wZz2TxHFF5Q8prBpgP71xcCsax3P8s p5ZcTHO43hdUQ5lpHlgYd1Z3DFrXrN/uN+yzb2pW8n+ii5zBtEGG3i5xwChq7JW7uAN1 zhhFJnDSUqLw7uW+KkjPHkUyC8/FkoLd26crEAHDDOv5j+Zfy/O22gADND0wyS6WEFv7 96fA+zVdco5f5hlBxODCdV2aaUO541jczDWxjzE00/+wxpPWM3Qm4eE20HyzdrejC5Xs NJ3n/rwIDVXxYSF1GfVzmYCCgzaX4Htbjehob07486zLdA3WckWnMd+JBLxvLbb5ZG93 MB4w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:mime-version:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:message-id:date:to:from:delivered-to; bh=LqF0uR0pgBMMT5b+q87BnShpmsBmLfJXnW81avPAs+w=; b=g0O1Y/MegP7WumQ3fS7Zhdeb5UyXULiWBIiMVLgG5ULHhjfwgBkgMEP6DEaEykHpTx mHo9A3ogpv9UuLbKWCaXhNhV2uF1jFNu0j/bJle7LIQ8FnwsElTAiT8S1BjtVuBbVyvT BCrdWlcuKdH1iebTiTCXQNjYbijnZtUXvgd5cdVrdYw4N69Km0yINWHmzTtSIoKAM6LA l0PbFCIeaoViL+K8FNtvVOz3iPeFTKUXrGYlVaFdq2K34Rf4rfq/JDvqui5uJdPbnmYZ cr1vFHwy6VnuMevd8o2iuL2BA7H3pHsrTik/4pCWD2Yc6jnOGZ5mGE2c1P5rTXLycNKP zHiQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id m20si5187889edq.553.2021.08.26.21.52.02; Thu, 26 Aug 2021 21:52:03 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 76CB268A2B9; Fri, 27 Aug 2021 07:51:59 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 7BAF16881A9 for ; Fri, 27 Aug 2021 07:51:52 +0300 (EEST) X-IronPort-AV: E=McAfee;i="6200,9189,10088"; a="281613545" X-IronPort-AV: E=Sophos;i="5.84,355,1620716400"; d="scan'208";a="281613545" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Aug 2021 21:51:50 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.84,355,1620716400"; d="scan'208";a="495523685" Received: from otc-skl-e5-server.sh.intel.com ([10.239.43.106]) by fmsmga008.fm.intel.com with ESMTP; 26 Aug 2021 21:51:48 -0700 From: Wu Jianhua To: ffmpeg-devel@ffmpeg.org Date: Fri, 27 Aug 2021 12:51:41 +0800 Message-Id: <20210827045144.73794-1-jianhua.wu@intel.com> X-Mailer: git-send-email 2.17.1 Subject: [FFmpeg-devel] [PATCH 1/4] libavfilter/x86/vf_hflip: add ff_flip_byte/short_avx512() X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Wu Jianhua MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: tH5MEboBDCR6 Performance(Less is better): 8bit: ff_hflip_byte_ssse3 0.61 ff_hflip_byte_avx2 0.37 ff_hflip_byte_avx512 0.19 16bit: ff_hflip_short_ssse3 1.27 ff_hflip_short_avx2 0.76 ff_hflip_short_avx512 0.40 Signed-off-by: Wu Jianhua --- libavfilter/x86/vf_hflip.asm | 23 ++++++++++++++++++----- libavfilter/x86/vf_hflip_init.c | 8 ++++++++ 2 files changed, 26 insertions(+), 5 deletions(-) diff --git a/libavfilter/x86/vf_hflip.asm b/libavfilter/x86/vf_hflip.asm index 285618954f..c2237217f7 100644 --- a/libavfilter/x86/vf_hflip.asm +++ b/libavfilter/x86/vf_hflip.asm @@ -26,12 +26,16 @@ SECTION_RODATA pb_flip_byte: db 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 pb_flip_short: db 14,15,12,13,10,11,8,9,6,7,4,5,2,3,0,1 +pd_flip_indicies: dd 12,13,14,15,8,9,10,11,4,5,6,7,0,1,2,3 SECTION .text ;%1 byte or short, %2 b or w, %3 size in byte (1 for byte, 2 for short) %macro HFLIP 3 cglobal hflip_%1, 3, 5, 3, src, dst, w, r, x +%if mmsize == 64 + movu m3, [pd_flip_indicies] +%endif VBROADCASTI128 m0, [pb_flip_%1] xor xq, xq %if %3 == 1 @@ -47,12 +51,15 @@ cglobal hflip_%1, 3, 5, 3, src, dst, w, r, x .loop0: neg xq -%if mmsize == 32 - vpermq m1, [srcq + xq - mmsize + %3], 0x4e; flip each lane at load - vpermq m2, [srcq + xq - 2 * mmsize + %3], 0x4e; flip each lane at load +%if mmsize == 64 + vpermd m1, m3, [srcq + xq - mmsize + %3] + vpermd m2, m3, [srcq + xq - 2 * mmsize + %3] +%elif mmsize == 32 + vpermq m1, [srcq + xq - mmsize + %3], 0x4e; flip each lane at load + vpermq m2, [srcq + xq - 2 * mmsize + %3], 0x4e; flip each lane at load %else - movu m1, [srcq + xq - mmsize + %3] - movu m2, [srcq + xq - 2 * mmsize + %3] + movu m1, [srcq + xq - mmsize + %3] + movu m2, [srcq + xq - 2 * mmsize + %3] %endif pshufb m1, m0 pshufb m2, m0 @@ -88,3 +95,9 @@ INIT_YMM avx2 HFLIP byte, b, 1 HFLIP short, w, 2 %endif + +%if HAVE_AVX512_EXTERNAL +INIT_ZMM avx512 +HFLIP byte, b, 1 +HFLIP short, w, 2 +%endif diff --git a/libavfilter/x86/vf_hflip_init.c b/libavfilter/x86/vf_hflip_init.c index 0ac399b0d4..25fc40f7b0 100644 --- a/libavfilter/x86/vf_hflip_init.c +++ b/libavfilter/x86/vf_hflip_init.c @@ -25,8 +25,10 @@ void ff_hflip_byte_ssse3(const uint8_t *src, uint8_t *dst, int w); void ff_hflip_byte_avx2(const uint8_t *src, uint8_t *dst, int w); +void ff_hflip_byte_avx512(const uint8_t *src, uint8_t *dst, int w); void ff_hflip_short_ssse3(const uint8_t *src, uint8_t *dst, int w); void ff_hflip_short_avx2(const uint8_t *src, uint8_t *dst, int w); +void ff_hflip_short_avx512(const uint8_t *src, uint8_t *dst, int w); av_cold void ff_hflip_init_x86(FlipContext *s, int step[4], int nb_planes) { @@ -41,6 +43,9 @@ av_cold void ff_hflip_init_x86(FlipContext *s, int step[4], int nb_planes) if (EXTERNAL_AVX2_FAST(cpu_flags)) { s->flip_line[i] = ff_hflip_byte_avx2; } + if (EXTERNAL_AVX512(cpu_flags)) { + s->flip_line[i] = ff_hflip_byte_avx512; + } } else if (step[i] == 2) { if (EXTERNAL_SSSE3(cpu_flags)) { s->flip_line[i] = ff_hflip_short_ssse3; @@ -48,6 +53,9 @@ av_cold void ff_hflip_init_x86(FlipContext *s, int step[4], int nb_planes) if (EXTERNAL_AVX2_FAST(cpu_flags)) { s->flip_line[i] = ff_hflip_short_avx2; } + if (EXTERNAL_AVX512(cpu_flags)) { + s->flip_line[i] = ff_hflip_short_avx512; + } } } }