From patchwork Fri Oct 28 18:55:08 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: James Darnley X-Patchwork-Id: 39041 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:85a8:b0:a2:d5a7:ad9d with SMTP id s40csp965759pzd; Fri, 28 Oct 2022 11:57:25 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4MCn8kiLpRJYz8iUeNX1mFdnWC/a7kj5b4Abp4RZL2zvMQP7vXzKIOCp4SmbNL2xvqjzet X-Received: by 2002:a17:906:4fcb:b0:791:9a26:376f with SMTP id i11-20020a1709064fcb00b007919a26376fmr682133ejw.431.1666983445217; Fri, 28 Oct 2022 11:57:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666983445; cv=none; d=google.com; s=arc-20160816; b=UK489OxYJMGDAHkYrmG73MqmwriPIiBi9i9UHoBm/Pox83jRpr2IDh2dF/EnWAoJxM mOe3QMwDJVgzevhaYzStwfkq4N7Cge1ALSWAovJ1E397wDyLdntxvbrJ2i7g+xZLP5SV Tx8BJLr5aIPSHqQrWFxN5Icf5QPixLu0lWA2LlGoSy4oiK/5gxTk1xIXpfA/QuyUwdlJ C9pcJfSNpe44As1B40Rsgf6et45QVR8buMN+8xnquzJLR5lo/67dA7qmCgVn4DzaQwfK vQ9ZVrxETlFQ7G2vs2PVkurrZaDg0EVIPfGLrrICtRBpsxR4HE8kz4vHHBOYqvG1L7+A dmEA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:reply-to:list-subscribe :list-help:list-post:list-archive:list-unsubscribe:list-id :precedence:subject:mime-version:message-id:date:to:from :dkim-signature:delivered-to; bh=zAvQmlaZjXUckKAtkc4/8M1qdCqhiT/8XtpdiZvUkdQ=; b=QB7f0MX27AE7m80BEW2pqV9QdzLbuIBna2UAKlvvunb1vyTIJD0A4c17JlXkThsfbM w7o9+uqR1Gu7L9JjkTMb4s6nFQgyw85pG/2Y/fLqE5uRxO4xwS80Tm2odBvHCP5qBqfP 4ZDsxM+lL1qG7CxUU5SQF2mwz5hsSGe/4ZRsdqpaMzHSAC0mIaFRS3lTpy8cBEqMJ90l qv2wXJym0k7GQnK4fUpPf6csZMVFdRNynwktG6cbX4naO0xDW1kpNwrqCqnO99zcHvlZ 6MAsr/yEe6ECJ+jKQLHbjbFVw7GyixmVYJyAtr7RplyxtkMnuh1fcaqgkKxsBNQeoTaj 58+g== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@obe-tv.20210112.gappssmtp.com header.s=20210112 header.b=SteIKVXu; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id r19-20020a50d693000000b004627b98ab6fsi4714942edi.69.2022.10.28.11.57.24; Fri, 28 Oct 2022 11:57:25 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@obe-tv.20210112.gappssmtp.com header.s=20210112 header.b=SteIKVXu; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id B713768BCD0; Fri, 28 Oct 2022 21:57:21 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lj1-f173.google.com (mail-lj1-f173.google.com [209.85.208.173]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 1C00F68BA5B for ; Fri, 28 Oct 2022 21:57:15 +0300 (EEST) Received: by mail-lj1-f173.google.com with SMTP id x21so8096017ljg.10 for ; Fri, 28 Oct 2022 11:57:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=obe-tv.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:from:to:cc:subject:date:message-id:reply-to; bh=9zGIE9AfYMv5m7kgdpNxzjhbUotojgrhoACUblaB6SI=; b=SteIKVXu1z+xNjC7c17gWVQH7tbKw7uy+4Lmvy0e9J2t4GFV80BhINNpBo3wLjo4QV PTK6EGmsm8MHNn4b6jcky80zfuLH0GQ8jvtR0xi1xRRgd0RV50SW0V+0W8xtjipvK1lW 3wr48g3uVxARRVQeOmZMRqD4qHgT7YooTBgY22rlRRiNSOo3T7rl08MnfZqUC78ki4ae QdiaybpkjFLoVvoU25bj9I1BCGik6gw/tFkkQpbBUWHTZnbkExF4arQm5EGwnVXWMFeE OLtQlvkgSwmsH8jCqA9qBOmeh3sEtPXkNKcjKAam22M9Ke1hbokz8Q7AFG485BeFlCGZ rMVQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=9zGIE9AfYMv5m7kgdpNxzjhbUotojgrhoACUblaB6SI=; b=udvGAZ2Onm+INcIw9EW+faeAPPDx4oLYfdvxryI27zxHDX0dvOhq2YxLllOG7iTNPi IfTfCqBQ7X2/G5i5BXOO0LzkoQqgTu5y25d5+P7PoEmyNqCwFYQhDYWzVdd1MzXuc0T/ agK55q6kIhAp0rIiWX2mXA2WZ+/utbJNwMygAdjLKQK+AWH8EtnzcDyDgYZIoPEdqlCN FnkI6uZ+ApWe99+cxDNDVaR40eVz0PNP1GkfTkcGnTsa0RDR77va1wFCU/ulal8LuXLr +yynONRRwdwvppmbFm5D03ykc8tdYX3CE9VH4gziyQDI154x3E6syZ/9JRElFPY29Lda 2U4A== X-Gm-Message-State: ACrzQf1iglzbxXQP1Lf4cR7FaZoimtnTZjWCbTN1x/jnWOnTd1OFw1We UtAAMGI5Ba4abA+t/Ds1ytViM7jHiIsrtQ== X-Received: by 2002:a2e:3011:0:b0:277:3a99:7a49 with SMTP id w17-20020a2e3011000000b002773a997a49mr356560ljw.217.1666983434223; Fri, 28 Oct 2022 11:57:14 -0700 (PDT) Received: from Dana.systemlords.lan (d51A44418.access.telenet.be. [81.164.68.24]) by smtp.gmail.com with ESMTPSA id g5-20020a056512118500b004a2386b8cf9sm668963lfr.206.2022.10.28.11.57.13 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 28 Oct 2022 11:57:13 -0700 (PDT) From: James Darnley To: ffmpeg-devel@ffmpeg.org Date: Fri, 28 Oct 2022 20:55:08 +0200 Message-Id: <20221028185508.625513-1-jdarnley@obe.tv> X-Mailer: git-send-email 2.38.0 MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH] avcodec/v210enc: add new function for avx2 avx512 avx512icl X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: 4aB26Zq2nNX6 Negligible speed difference for avx2 on Zen 2 (Ryzen 5700X) and Broadwell (Xeon E5-2620 v4): 1690±4.3 decicycles vs. 1693±78.4 1439±31.1 decicycles vs 1429±16.7 Moderate speedup with avx512 on Skylake-X (Xeon D-2123IT): 1.22x faster (793±0.8 vs. 649±5.5 decicycles) compared with avx2 Better speedup with avx512icl on Ice Lake (Xeon Silver 4316): 1.77x faster (784±1.8 vs. 442±11.6 decicycles) compared with avx2 Co-authors: Henrik Gramner Kieran Kunhya --- libavcodec/x86/v210enc.asm | 80 ++++++++++++++++++++++++++++++++++- libavcodec/x86/v210enc_init.c | 14 ++++++ 2 files changed, 92 insertions(+), 2 deletions(-) diff --git a/libavcodec/x86/v210enc.asm b/libavcodec/x86/v210enc.asm index 965f2bea3c..afac238ede 100644 --- a/libavcodec/x86/v210enc.asm +++ b/libavcodec/x86/v210enc.asm @@ -21,7 +21,7 @@ %include "libavutil/x86/x86util.asm" -SECTION_RODATA 32 +SECTION_RODATA 64 cextern pw_4 %define v210_enc_min_10 pw_4 @@ -46,6 +46,16 @@ v210_enc_chroma_shuf2_8: times 2 db 3,-1,4,-1,5,-1,7,-1,11,-1,12,-1,13,-1,15,-1 v210_enc_chroma_mult_8: times 2 dw 4,16,64,0,64,4,16,0 +v210enc_8_permb: db 32, 0,48,-1 , 1,33, 2,-1 , 49, 3,34,-1 , 4,50, 5,-1 + db 35, 6,51,-1 , 7,36, 8,-1 , 52, 9,37,-1 , 10,53,11,-1 + db 38,12,54,-1 , 13,39,14,-1 , 55,15,40,-1 , 16,56,17,-1 + db 41,18,57,-1 , 19,42,20,-1 , 58,21,43,-1 , 22,59,23,-1 +v210enc_8_shufb: db 0, 8, 1,-1 , 9, 2,10,-1 , 3,11, 4,-1 , 12, 5,13,-1 + db 2,10, 3,-1 , 11, 4,12,-1 , 5,13, 6,-1 , 14, 7,15,-1 +v210enc_8_permd: dd 0,1,4,5, 1,2,5,6 +v210enc_8_mult: db 4, 0, 64, 0 +v210enc_8_mask: dd 255<<12 + SECTION .text %macro v210_planar_pack_10 0 @@ -178,7 +188,73 @@ INIT_XMM avx v210_planar_pack_8 %endif +%macro v210_planar_pack_8_new 0 + +cglobal v210_planar_pack_8, 5, 5, 7+notcpuflag(avx512icl), y, u, v, dst, width + add yq, widthq + shr widthq, 1 + add uq, widthq + add vq, widthq + neg widthq + + %if cpuflag(avx512icl) + mova m2, [v210enc_8_permb] + %else + mova m2, [v210enc_8_permd] + %endif + vpbroadcastd m3, [v210enc_8_mult] + VBROADCASTI128 m4, [v210_enc_min_8] ; only ymm sized + VBROADCASTI128 m5, [v210_enc_max_8] ; only ymm sized + vpbroadcastd m6, [v210enc_8_mask] + %if notcpuflag(avx512icl) + movu m7, [v210enc_8_shufb] + %endif + + .loop: + %if cpuflag(avx512icl) + movu ym1, [yq + 2*widthq] + vinserti32x4 m1, [uq + 1*widthq], 2 + vinserti32x4 m1, [vq + 1*widthq], 3 + vpermb m1, m2, m1 ; uyv0 yuy0 vyu0 yvy0 + %else + movq xm0, [uq + 1*widthq] ; uuuu uuxx + movq xm1, [vq + 1*widthq] ; vvvv vvxx + punpcklbw xm1, xm0, xm1 ; uvuv uvuv uvuv xxxx + vinserti128 m1, m1, [yq + 2*widthq], 1 ; uvuv uvuv uvuv xxxx yyyy yyyy yyyy xxxx + vpermd m1, m2, m1 ; uvuv uvxx yyyy yyxx xxuv uvuv xxyy yyyy + pshufb m1, m7 ; uyv0 yuy0 vyu0 yvy0 + %endif + CLIPUB m1, m4, m5 + + pmaddubsw m0, m1, m3 + pslld m1, 4 + %if cpuflag(avx512) + vpternlogd m0, m1, m6, 0xd8 ; C?B:A + %else + pand m1, m6, m1 + pandn m0, m6, m0 + por m0, m0, m1 + %endif + + movu [dstq], m0 + add dstq, mmsize + add widthq, (mmsize*3)/16 + jl .loop +RET + +%endmacro + %if HAVE_AVX2_EXTERNAL INIT_YMM avx2 -v210_planar_pack_8 +v210_planar_pack_8_new +%endif + +%if HAVE_AVX512_EXTERNAL +INIT_YMM avx512 +v210_planar_pack_8_new +%endif + +%if HAVE_AVX512ICL_EXTERNAL +INIT_ZMM avx512icl +v210_planar_pack_8_new %endif diff --git a/libavcodec/x86/v210enc_init.c b/libavcodec/x86/v210enc_init.c index 13a351dd1d..6e9f8c6e61 100644 --- a/libavcodec/x86/v210enc_init.c +++ b/libavcodec/x86/v210enc_init.c @@ -27,6 +27,10 @@ void ff_v210_planar_pack_8_avx(const uint8_t *y, const uint8_t *u, const uint8_t *v, uint8_t *dst, ptrdiff_t width); void ff_v210_planar_pack_8_avx2(const uint8_t *y, const uint8_t *u, const uint8_t *v, uint8_t *dst, ptrdiff_t width); +void ff_v210_planar_pack_8_avx512(const uint8_t *y, const uint8_t *u, + const uint8_t *v, uint8_t *dst, ptrdiff_t width); +void ff_v210_planar_pack_8_avx512icl(const uint8_t *y, const uint8_t *u, + const uint8_t *v, uint8_t *dst, ptrdiff_t width); void ff_v210_planar_pack_10_ssse3(const uint16_t *y, const uint16_t *u, const uint16_t *v, uint8_t *dst, ptrdiff_t width); @@ -52,4 +56,14 @@ av_cold void ff_v210enc_init_x86(V210EncContext *s) s->sample_factor_10 = 2; s->pack_line_10 = ff_v210_planar_pack_10_avx2; } + + if (EXTERNAL_AVX512(cpu_flags)) { + s->sample_factor_8 = 2; + s->pack_line_8 = ff_v210_planar_pack_8_avx512; + } + + if (EXTERNAL_AVX512ICL(cpu_flags)) { + s->sample_factor_8 = 4; + s->pack_line_8 = ff_v210_planar_pack_8_avx512icl; + } }