From patchwork Fri Nov 25 15:17:18 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: James Darnley X-Patchwork-Id: 39443 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a21:999a:b0:a4:2148:650a with SMTP id ve26csp5459282pzb; Fri, 25 Nov 2022 07:20:00 -0800 (PST) X-Google-Smtp-Source: AA0mqf6ElzYhRDyh9drjWtApIJ12ZifqdJ7mlpypKHzp9eXkT8JaIPdZHlPiSJFxVdfygzKxQvUf X-Received: by 2002:a05:6402:2948:b0:463:bc31:2604 with SMTP id ed8-20020a056402294800b00463bc312604mr34882643edb.32.1669389597417; Fri, 25 Nov 2022 07:19:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669389597; cv=none; d=google.com; s=arc-20160816; b=0I8OYGGG4bEFOGmby93WX6Z25ANOHgDZd6/SAv5xJeiALVGPdO5zjTDJ1r6es+H2uw U2yAi5GDrCvkaHMc78K/fr+qwp2DrymduWYcAMN0lp+rY0AdAwy9CkXX0kADHa70taNp RH+U8r6Gnp5ai+BKc8yNUqops8A+7v5S39GjE4BlnsYKnJnayD/jtRz0lIkY/267y16m 68pHSOGPC0IeZSVmV/8woZzxjRaWAkNWzHJEpyXTpUUVix0GE+8WaK1w1ug9ob6K/tz0 HavqS8O2Ke8oSo+DLJAbrivD+SD1CUsfP807MPDE9nx5sSlGmFsuxwA14yle5zNMucM6 YfAw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:reply-to:list-subscribe :list-help:list-post:list-archive:list-unsubscribe:list-id :precedence:subject:mime-version:references:in-reply-to:message-id :date:to:from:dkim-signature:delivered-to; bh=90YaFILkdFSqvxwLalDDKsA8r2Z97fVqoZSM5mOGnPU=; b=jfj7riL+7q56VFUMr6/kO71W/hAV7dLLh+z3xeGLEK6OIQi9oHDTCIfXmTSNMDEilJ TJY5pqq5stX3KiqvNiq/zQHCoqI8vb4unKi2t1P9bRQ5VekF+Rihlo+F2cvEJiXkDdfg KVle5z3PVyAUH38r0sPTMnnmZZevQe47L4X7ePGooZ97nZn7rk6AM2fItZZSkb51Zsdz Va1S5AcXM8vPeP4HbTIfFVi6/ga34fFB6MqGwlvsSdPh6rp1789twIkr1a3I/tmC5hKM 6d6QnPAjrhlxIR9dtBOcuZ0dB3PU8gRIHx4LPJ03Hqo7YRtHaW5iT3icETPwURIFrsdX D4Lg== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@obe-tv.20210112.gappssmtp.com header.s=20210112 header.b=w45y8uEM; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id b5-20020a509f05000000b00461e63fe88fsi3595431edf.596.2022.11.25.07.19.56; Fri, 25 Nov 2022 07:19:57 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@obe-tv.20210112.gappssmtp.com header.s=20210112 header.b=w45y8uEM; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 8954E68BC6A; Fri, 25 Nov 2022 17:19:37 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-wr1-f48.google.com (mail-wr1-f48.google.com [209.85.221.48]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 46C1D680060 for ; Fri, 25 Nov 2022 17:19:30 +0200 (EET) Received: by mail-wr1-f48.google.com with SMTP id z4so7237657wrr.3 for ; Fri, 25 Nov 2022 07:19:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=obe-tv.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:from:to:cc:subject:date:message-id :reply-to; bh=1Q1aKJj6tWPWN3vpO20Dz+13xEcPR9GAyXDYlw8pUQE=; b=w45y8uEMyhl6HADsUVoI1pHCwEnqQYXcDsQ3lWqM2kUGqRDppFyiDY0f9Vw2664c9h +oIc7DTh4wsGUWb9xKLDVZ7RSI36Wn8aY0q72q2oIRAVIytzkRcK+tJFXW2zssJAO/KZ fyVlozXPp7ajnyYN/ks1x84LCqRU1LRB5PsacZiCmPA4qRUczRWsYIshcZypUxPjw/R+ v4g74n1thQo/LMRCJl1/5kTW8yDbPSO0Cym4rI7m0loTdS7FEoJdxOwbU7fh7BCh60ap vcmYzLA11eWV6Jv+Rb10zKF7vJSXgCWMWfmiO7errwyqeegwetXnavpS/VizF21ThVaF 4BDQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=1Q1aKJj6tWPWN3vpO20Dz+13xEcPR9GAyXDYlw8pUQE=; b=Hv5wQoKhxMC8LmDyuCiq4u7xKhHa6VRlIAJdMNbV3FPFMCh4EnnQTFR1gxjobo13yz mM4ddruRYzu3odRJoXpEylnHUKp3qNNf7IwdnmeFNQdftRwjfESNy8f+hTodS1U1xvZG s/MPkV3+Vag7E4JW/MPtO3DPLRUS1ysLvezODXtPm0vCAIG2kAaLJxy7BRLIoBZgMkA4 cOA4waLkrRtRrfHtwa+ave99mh2yqXOEDpQOSiNkxg3h7YiLjPgPPYEMzh801JEmENe+ LUzgvuEe4EGU+COlzj/RkOxJ9dpAFWXp/3pjw5xQE4xC90W6UN6vlSPDMGZdcbTjckv1 gFJA== X-Gm-Message-State: ANoB5pmfqCg/dn7EcS7EXqnU7v03cAnV117Gg5lidY6A8xFkLgO5ibXM aJ8BnkHyb5dYYIRa8DTA3WgHuX1mSrTHlQ== X-Received: by 2002:adf:decc:0:b0:241:dd7b:f7d1 with SMTP id i12-20020adfdecc000000b00241dd7bf7d1mr13691730wrn.400.1669389569720; Fri, 25 Nov 2022 07:19:29 -0800 (PST) Received: from Dana.systemlords.lan (d51A44418.access.telenet.be. [81.164.68.24]) by smtp.gmail.com with ESMTPSA id z4-20020a05600c0a0400b003c70191f267sm10794254wmp.39.2022.11.25.07.19.28 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Nov 2022 07:19:29 -0800 (PST) From: James Darnley To: ffmpeg-devel@ffmpeg.org Date: Fri, 25 Nov 2022 16:17:18 +0100 Message-Id: <20221125151720.1655051-3-jdarnley@obe.tv> X-Mailer: git-send-email 2.38.0 In-Reply-To: <20221121124408.1577897-1-jdarnley@obe.tv> References: <20221121124408.1577897-1-jdarnley@obe.tv> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v2 3/5] avcodec/v210enc: add new 10-bit function for avx512 avx512icl X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: QnOTCdVfoBj5 avx512 on Skylake-X (Xeon D-2123IT): 1.19x faster (970±91.2 vs. 817±104.4 decicycles) compared with avx2 avx512icl on Ice Lake (Xeon Silver 4316): 2.52x faster (1350±5.3 vs. 535±9.5 decicycles) compared with avx2 --- libavcodec/x86/v210enc.asm | 99 +++++++++++++++++++++++++++++++++++ libavcodec/x86/v210enc_init.c | 12 +++++ 2 files changed, 111 insertions(+) diff --git a/libavcodec/x86/v210enc.asm b/libavcodec/x86/v210enc.asm index c2ad3d72c0..552164a8be 100644 --- a/libavcodec/x86/v210enc.asm +++ b/libavcodec/x86/v210enc.asm @@ -56,6 +56,36 @@ v210enc_8_permd: dd 0,1,4,5, 1,2,5,6 v210enc_8_mult: db 4, 0, 64, 0 v210enc_8_mask: dd 255<<12 +icl_perm_y: ; vpermb does not set bytes to zero when the high bit is set unlike pshufb +%assign i 0 +%rep 8 + db -1,i+0,i+1,-1 , i+2,i+3,i+4,i+5 + %assign i i+6 +%endrep + +icl_perm_uv: ; vpermb does not set bytes to zero when the high bit is set unlike pshufb +%assign i 0 +%rep 4 + db i+0,i+1,i+32,i+33 , -1,i+2,i+3,-1 , i+34,i+35,i+4,i+5 , -1,i+36,i+37,-1 + %assign i i+6 +%endrep + +icl_perm_y_kmask: times 8 db 0b1111_0110 +icl_perm_uv_kmask: times 8 db 0b0110_1111 + +icl_shift_y: times 10 dw 2,0,4 + times 4 db 0 ; padding to 64 bytes +icl_shift_uv: times 5 dw 0,2,4 + times 2 db 0 ; padding to 32 bytes + times 5 dw 4,0,2 + times 2 db 0 ; padding to 32 bytes + +v210enc_10_permd_y: dd 0,1,2,-1 , 3,4,5,-1 +v210enc_10_shufb_y: db -1,0,1,-1 , 2,3,4,5 , -1,6,7,-1 , 8,9,10,11 +v210enc_10_permd_uv: dd 0,1,4,5 , 1,2,5,6 +v210enc_10_shufb_uv: db 0,1, 8, 9 , -1,2,3,-1 , 10,11,4,5 , -1,12,13,-1 + db 2,3,10,11 , -1,4,5,-1 , 12,13,6,7 , -1,14,15,-1 + SECTION .text %macro v210_planar_pack_10 0 @@ -113,6 +143,75 @@ INIT_YMM avx2 v210_planar_pack_10 %endif +%macro v210_planar_pack_10_new 0 + +cglobal v210_planar_pack_10, 5, 5, 8+2*notcpuflag(avx512icl), y, u, v, dst, width + lea yq, [yq+2*widthq] + add uq, widthq + add vq, widthq + neg widthq + + %if cpuflag(avx512icl) + movu m6, [icl_perm_y] + movu m7, [icl_perm_uv] + kmovq k1, [icl_perm_y_kmask] + kmovq k2, [icl_perm_uv_kmask] + %else + movu m6, [v210enc_10_permd_y] + VBROADCASTI128 m7, [v210enc_10_shufb_y] + movu m8, [v210enc_10_permd_uv] + movu m9, [v210enc_10_shufb_uv] + %endif + movu m2, [icl_shift_y] + movu m3, [icl_shift_uv] + VBROADCASTI128 m4, [v210_enc_min_10] ; only ymm sized + VBROADCASTI128 m5, [v210_enc_max_10] ; only ymm sized + + .loop: + movu m0, [yq + widthq*2] + %if cpuflag(avx512icl) + movu ym1, [uq + widthq*1] + vinserti32x8 zm1, [vq + widthq*1], 1 + %else + movu xm1, [uq + widthq*1] + vinserti128 ym1, [vq + widthq*1], 1 + %endif + CLIPW m0, m4, m5 + CLIPW m1, m4, m5 + + vpsllvw m0, m2 + vpsllvw m1, m3 + %if cpuflag(avx512icl) + vpermb m0{k1}{z}, m6, m0 ; make space for uv where the k-mask sets to zero + vpermb m1{k2}{z}, m7, m1 ; interleave uv and make space for y where the k-mask sets to zero + %else + vpermd m0, m6, m0 + pshufb m0, m7 + vpermd m1, m8, m1 + pshufb m1, m9 + %endif + por m0, m1 + + movu [dstq], m0 + add dstq, mmsize + add widthq, (mmsize*3)/8 + jl .loop +RET + +%endmacro + +%if ARCH_X86_64 +%if HAVE_AVX512_EXTERNAL +INIT_YMM avx512 +v210_planar_pack_10_new +%endif +%endif + +%if HAVE_AVX512ICL_EXTERNAL +INIT_ZMM avx512icl +v210_planar_pack_10_new +%endif + %macro v210_planar_pack_8 0 ; v210_planar_pack_8(const uint8_t *y, const uint8_t *u, const uint8_t *v, uint8_t *dst, ptrdiff_t width) diff --git a/libavcodec/x86/v210enc_init.c b/libavcodec/x86/v210enc_init.c index 6e9f8c6e61..44f22ca7fe 100644 --- a/libavcodec/x86/v210enc_init.c +++ b/libavcodec/x86/v210enc_init.c @@ -37,6 +37,12 @@ void ff_v210_planar_pack_10_ssse3(const uint16_t *y, const uint16_t *u, void ff_v210_planar_pack_10_avx2(const uint16_t *y, const uint16_t *u, const uint16_t *v, uint8_t *dst, ptrdiff_t width); +void ff_v210_planar_pack_10_avx512(const uint16_t *y, const uint16_t *u, + const uint16_t *v, uint8_t *dst, + ptrdiff_t width); +void ff_v210_planar_pack_10_avx512icl(const uint16_t *y, const uint16_t *u, + const uint16_t *v, uint8_t *dst, + ptrdiff_t width); av_cold void ff_v210enc_init_x86(V210EncContext *s) { @@ -60,10 +66,16 @@ av_cold void ff_v210enc_init_x86(V210EncContext *s) if (EXTERNAL_AVX512(cpu_flags)) { s->sample_factor_8 = 2; s->pack_line_8 = ff_v210_planar_pack_8_avx512; +#if ARCH_X86_64 + s->sample_factor_10 = 2; + s->pack_line_10 = ff_v210_planar_pack_10_avx512; +#endif } if (EXTERNAL_AVX512ICL(cpu_flags)) { s->sample_factor_8 = 4; s->pack_line_8 = ff_v210_planar_pack_8_avx512icl; + s->sample_factor_10 = 4; + s->pack_line_10 = ff_v210_planar_pack_10_avx512icl; } }