From patchwork Wed Apr 10 12:47:49 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: James Darnley X-Patchwork-Id: 12690 Return-Path: X-Original-To: patchwork@ffaux-bg.ffmpeg.org Delivered-To: patchwork@ffaux-bg.ffmpeg.org Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by ffaux.localdomain (Postfix) with ESMTP id 26F68448F55 for ; Wed, 10 Apr 2019 15:48:18 +0300 (EEST) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 0B05A68AF1C; Wed, 10 Apr 2019 15:48:18 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-wr1-f45.google.com (mail-wr1-f45.google.com [209.85.221.45]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 8E99668ADBD for ; Wed, 10 Apr 2019 15:48:10 +0300 (EEST) Received: by mail-wr1-f45.google.com with SMTP id w10so2801616wrm.4 for ; Wed, 10 Apr 2019 05:48:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=obe-tv.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=UB/Gi8BOSUI1Mh20MzCJ0+ugud2aMJLo472vlqS8xT0=; b=IrFy28EqxyWZ5WvWPoic96XQEl457iora3ZGJ3a9I5z6fPkpzVaB5C88jyVIzCybBu rwecBFq4w95DMringrt4OjgLnAcbZyB1Oun7jW/x8pG/72F6zn8H4EjqNS8mlZCzdXT1 14Bjhjf0jc+QlEUbYfX4ixHUXwRZ4kDPse4nYBF0QVTwS7N1kHBhU2xFx/cRXggR2eqI hXByfnGJ82zdaUipl9XRueuCy8suTxc0oBBWau+ZzIkSyvxvU0raEg4LhbbjwggQwXWp TmBLmmcsokVgmj/pHuAbGnhmD6IkubTdJMQgFDZBN0D9TrOmzs0leI6ym2U2yVq6swx0 6fjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=UB/Gi8BOSUI1Mh20MzCJ0+ugud2aMJLo472vlqS8xT0=; b=UnJ7heNJKmAJgDfU+UJzyf6EMZgV1cKXqRVGMsfsnxtGeP4cYxhgfnDgfIMPw6smL2 ofx64FTUCOg8bRdLl/Mhrqr6qi9Lp1G9ngArLwHrbh3WRQDr1ak75xop7AkV6H4hYY9e B3I8TNGQ5X215M+k/q3N0ZxgBMOMrjA/WnoO2CCshSxXZUUjX/d1Gws967KRJ+9YcmuQ DUbpGR0aMRByxSTv2y9cx/Jjde+I2QzltuAEkcp5y6+BfjnBfxtuzeeHqF6h8HhWytCT IFEEpA3/u5ILomg9QXx+pUlmTmXWCQkBTA7rUIJS/K5mjNKh+ogchg+WqU3xAZZ6jEy8 Kt8w== X-Gm-Message-State: APjAAAWjPYmUW2+hto1kj8S9wZBpXiy+17VarqjBzhBlXm/cciJ+QQSM +f1C3cux4dWSG+SkWjMjw1HeRZCyfEA= X-Google-Smtp-Source: APXvYqwFNMDZNgazXvtoQORw+7EfpWHssfy6rsDovh0qDSlhIUHprKLvFK7YZb+s/RIOsJKkAlElFQ== X-Received: by 2002:adf:e40b:: with SMTP id g11mr25981291wrm.207.1554900489792; Wed, 10 Apr 2019 05:48:09 -0700 (PDT) Received: from Highwind.systemlords.lan (ptr-7sz70r1cslylgf1u1t2.18120a2.ip6.access.telenet.be. [2a02:1811:41e:dc00:6dc8:1f8e:32c0:e1a6]) by smtp.gmail.com with ESMTPSA id 61sm119323070wre.50.2019.04.10.05.48.09 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Apr 2019 05:48:09 -0700 (PDT) From: James Darnley To: ffmpeg-devel@ffmpeg.org Date: Wed, 10 Apr 2019 14:47:49 +0200 Message-Id: <20190410124749.9362-4-jdarnley@obe.tv> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190410124749.9362-1-jdarnley@obe.tv> References: <20190410124749.9362-1-jdarnley@obe.tv> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH 3/3] libavcodec Adding ff_v210_planar_unpack AVX2 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Michael Stoner Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" From: Michael Stoner Replaced VSHUFPS with VPBLENDD to relieve port 5 bottleneck AVX2 is 1.4x faster than AVX --- Mike, is this still the patch you want applied. I had to make a small amendment to it because you had some tabs as indentation. libavcodec/v210dec.c | 10 +++++- libavcodec/x86/v210-init.c | 8 +++++ libavcodec/x86/v210.asm | 72 +++++++++++++++++++++++++++++--------- 3 files changed, 73 insertions(+), 17 deletions(-) diff --git a/libavcodec/v210dec.c b/libavcodec/v210dec.c index fd8a6b0d78..bc1e1d34ff 100644 --- a/libavcodec/v210dec.c +++ b/libavcodec/v210dec.c @@ -123,7 +123,7 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame, const uint32_t *src = (const uint32_t*)psrc; uint32_t val; - w = (avctx->width / 6) * 6; + w = (avctx->width / 12) * 12; s->unpack_frame(src, y, u, v, w); y += w; @@ -131,6 +131,14 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame, v += w >> 1; src += (w << 1) / 3; + if (w < avctx->width - 5) { + READ_PIXELS(u, y, v); + READ_PIXELS(y, u, y); + READ_PIXELS(v, y, u); + READ_PIXELS(y, v, y); + w += 6; + } + if (w < avctx->width - 1) { READ_PIXELS(u, y, v); diff --git a/libavcodec/x86/v210-init.c b/libavcodec/x86/v210-init.c index d64dbca1a8..cb9a6cbd6a 100644 --- a/libavcodec/x86/v210-init.c +++ b/libavcodec/x86/v210-init.c @@ -21,9 +21,11 @@ extern void ff_v210_planar_unpack_unaligned_ssse3(const uint32_t *src, uint16_t *y, uint16_t *u, uint16_t *v, int width); extern void ff_v210_planar_unpack_unaligned_avx(const uint32_t *src, uint16_t *y, uint16_t *u, uint16_t *v, int width); +extern void ff_v210_planar_unpack_unaligned_avx2(const uint32_t *src, uint16_t *y, uint16_t *u, uint16_t *v, int width); extern void ff_v210_planar_unpack_aligned_ssse3(const uint32_t *src, uint16_t *y, uint16_t *u, uint16_t *v, int width); extern void ff_v210_planar_unpack_aligned_avx(const uint32_t *src, uint16_t *y, uint16_t *u, uint16_t *v, int width); +extern void ff_v210_planar_unpack_aligned_avx2(const uint32_t *src, uint16_t *y, uint16_t *u, uint16_t *v, int width); av_cold void ff_v210_x86_init(V210DecContext *s) { @@ -36,6 +38,9 @@ av_cold void ff_v210_x86_init(V210DecContext *s) if (HAVE_AVX_EXTERNAL && cpu_flags & AV_CPU_FLAG_AVX) s->unpack_frame = ff_v210_planar_unpack_aligned_avx; + + if (HAVE_AVX2_EXTERNAL && cpu_flags & AV_CPU_FLAG_AVX2) + s->unpack_frame = ff_v210_planar_unpack_aligned_avx2; } else { if (cpu_flags & AV_CPU_FLAG_SSSE3) @@ -43,6 +48,9 @@ av_cold void ff_v210_x86_init(V210DecContext *s) if (HAVE_AVX_EXTERNAL && cpu_flags & AV_CPU_FLAG_AVX) s->unpack_frame = ff_v210_planar_unpack_unaligned_avx; + + if (HAVE_AVX2_EXTERNAL && cpu_flags & AV_CPU_FLAG_AVX2) + s->unpack_frame = ff_v210_planar_unpack_unaligned_avx2; } #endif } diff --git a/libavcodec/x86/v210.asm b/libavcodec/x86/v210.asm index c24c765e5b..706712313d 100644 --- a/libavcodec/x86/v210.asm +++ b/libavcodec/x86/v210.asm @@ -22,9 +22,14 @@ %include "libavutil/x86/x86util.asm" -SECTION_RODATA +SECTION_RODATA 32 + +; for AVX2 version only +v210_luma_permute: dd 0,1,2,4,5,6,7,7 ; 32-byte alignment required +v210_chroma_shuf2: db 0,1,2,3,4,5,8,9,10,11,12,13,-1,-1,-1,-1 +v210_luma_shuf_avx2: db 0,1,4,5,6,7,8,9,12,13,14,15,-1,-1,-1,-1 +v210_chroma_shuf_avx2: db 0,1,4,5,10,11,-1,-1,2,3,8,9,12,13,-1,-1 -v210_mask: times 4 dd 0x3ff v210_mult: dw 64,4,64,4,64,4,64,4 v210_luma_shuf: db 8,9,0,1,2,3,12,13,4,5,6,7,-1,-1,-1,-1 v210_chroma_shuf: db 0,1,8,9,6,7,-1,-1,2,3,4,5,12,13,-1,-1 @@ -34,40 +39,65 @@ SECTION .text %macro v210_planar_unpack 1 ; v210_planar_unpack(const uint32_t *src, uint16_t *y, uint16_t *u, uint16_t *v, int width) -cglobal v210_planar_unpack_%1, 5, 5, 7 +cglobal v210_planar_unpack_%1, 5, 5, 8 movsxdifnidn r4, r4d lea r1, [r1+2*r4] add r2, r4 add r3, r4 neg r4 - mova m3, [v210_mult] - mova m4, [v210_mask] - mova m5, [v210_luma_shuf] - mova m6, [v210_chroma_shuf] + VBROADCASTI128 m3, [v210_mult] + VBROADCASTI128 m5, [v210_chroma_shuf] + +%if cpuflag(avx2) + VBROADCASTI128 m4, [v210_luma_shuf_avx2] + VBROADCASTI128 m5, [v210_chroma_shuf_avx2] + mova m6, [v210_luma_permute] + VBROADCASTI128 m7, [v210_chroma_shuf2] +%else + VBROADCASTI128 m4, [v210_luma_shuf] + VBROADCASTI128 m5, [v210_chroma_shuf] +%endif + .loop: %ifidn %1, unaligned - movu m0, [r0] + movu m0, [r0] ; yB v5 yA u5 y9 v4 y8 u4 y7 v3 y6 u3 y5 v2 y4 u2 y3 v1 y2 u1 y1 v0 y0 u0 %else mova m0, [r0] %endif pmullw m1, m0, m3 - psrld m0, 10 - psrlw m1, 6 ; u0 v0 y1 y2 v1 u2 y4 y5 - pand m0, m4 ; y0 __ u1 __ y3 __ v2 __ + pslld m0, 12 + psrlw m1, 6 ; yB yA u5 v4 y8 y7 v3 u3 y5 y4 u2 v1 y2 y1 v0 u0 + psrld m0, 22 ; 00 v5 00 y9 00 u4 00 y6 00 v2 00 y3 00 u1 00 y0 + +%if cpuflag(avx2) + vpblendd m2, m1, m0, 0x55 ; yB yA 00 y9 y8 y7 00 y6 y5 y4 00 y3 y2 y1 00 y0 + pshufb m2, m4 ; 00 00 yB yA y9 y8 y7 y6 00 00 y5 y4 y3 y2 y1 y0 + vpermd m2, m6, m2 ; 00 00 00 00 yB yA y9 y8 y7 y6 y5 y4 y3 y2 y1 y0 + movu [r1+2*r4], m2 - shufps m2, m1, m0, 0x8d ; y1 y2 y4 y5 y0 __ y3 __ - pshufb m2, m5 ; y0 y1 y2 y3 y4 y5 __ __ + vpblendd m1, m0, 0xaa ; 00 v5 u5 v4 00 u4 v3 u3 00 v2 u2 v1 00 u1 v0 u0 + pshufb m1, m5 ; 00 v5 v4 v3 00 u5 u4 u3 00 v2 v1 v0 00 u2 u1 u0 + vpermq m1, m1, 0xd8 ; 00 v5 v4 v3 00 v2 v1 v0 00 u5 u4 u3 00 u2 u1 u0 + pshufb m1, m7 ; 00 00 v5 v4 v3 v2 v1 v0 00 00 u5 u4 u3 u2 u1 u0 + + movu [r2+r4], xm1 + vextracti128 [r3+r4], m1, 1 +%else + shufps m2, m1, m0, 0x8d ; 00 y9 00 y6 yB yA y8 y7 00 y3 00 y0 y5 y4 y2 y1 + pshufb m2, m4 ; 00 00 yB yA y9 y8 y7 y6 00 00 y5 y4 y3 y2 y1 y0 movu [r1+2*r4], m2 - shufps m1, m0, 0xd8 ; u0 v0 v1 u2 u1 __ v2 __ - pshufb m1, m6 ; u0 u1 u2 __ v0 v1 v2 __ + shufps m1, m0, 0xd8 ; 00 v5 00 u4 u5 v4 v3 u3 00 v2 00 u1 u2 v1 v0 u0 + pshufb m1, m5 ; 00 v5 v4 v3 00 u5 u4 u3 00 v2 v1 v0 00 u2 u1 u0 + movq [r2+r4], m1 movhps [r3+r4], m1 +%endif add r0, mmsize - add r4, 6 + add r4, (mmsize*3)/8 jl .loop REP_RET @@ -81,6 +111,11 @@ INIT_XMM avx v210_planar_unpack unaligned %endif +%if HAVE_AVX2_EXTERNAL +INIT_YMM avx2 +v210_planar_unpack unaligned +%endif + INIT_XMM ssse3 v210_planar_unpack aligned @@ -88,3 +123,8 @@ v210_planar_unpack aligned INIT_XMM avx v210_planar_unpack aligned %endif + +%if HAVE_AVX2_EXTERNAL +INIT_YMM avx2 +v210_planar_unpack aligned +%endif