From patchwork Thu Mar 7 07:02:52 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Michael Stoner X-Patchwork-Id: 12234 Return-Path: X-Original-To: patchwork@ffaux-bg.ffmpeg.org Delivered-To: patchwork@ffaux-bg.ffmpeg.org Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by ffaux.localdomain (Postfix) with ESMTP id 4A1D6448CFB for ; Thu, 7 Mar 2019 09:03:25 +0200 (EET) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 2D64868A2B8; Thu, 7 Mar 2019 09:03:25 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from sonic306-1.consmr.mail.bf2.yahoo.com (sonic306-1.consmr.mail.bf2.yahoo.com [74.6.132.40]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 7B679689E30 for ; Thu, 7 Mar 2019 09:03:18 +0200 (EET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1551942196; bh=oOSf1RCSK2Eu/qKRLFz3iPZ7Z2qA8EmabABSh8ZPaU4=; h=From:To:Cc:Subject:Date:From:Subject; b=r6l1bH+cB+8TOmvxkolnFS/ubc0yss9RLh10uNqFtc2LeE8V20a5qwtfwWyr6e8jcFcNud07yozKPjyLzy0qw5AjIJGt3dbm53jqnbA9yqw4ltshotljPpdleoh5wWl5QYS9ucQZF8yAxzPl6mLyW/UIsOlGms/QHuBiifulhXpxBbSN9+PppbCwriv4rPuugqFZXlBLh4UqFAZ0FT6by7oxJ3bt1zWsWJv0RjSq4oLpW6DfHkGalWbyOBrstCTbReM52Sy5aXEUrkU8b7ZmwQ6y2ECPItK91jxIBid3PpNXPzNwdS42d15UZ3eCttLDSuajjdttaRnwXpV8oUoIpg== X-YMail-OSG: oNuauA0VM1lqbDoyNVmNEoq6D81UaDjD9kumd_fMlXLZd_JwRQwetx2yTf6ap8A 4RR40RhTk3hoHOIaoCR4.6xLQJ0BYvGvrcEdnn2MdTYaAjW9TnW3Ktb2MfScaKOwpumEPQ.f16hf 6aJ8Ij7uindaFu.iIo9VexbNXKFOYPm7wEEVUE0Uykwrjn45WBEOHFzhdm2H1_tfeBNOYvn47DNB PLEKD9L.jeqHYogEpDBxl9pHQIudg3aFuvFL2co0qywR83g8rq.EtqTzglsOyNRyYVwRMHPjMQx4 1UWJ2E9UmNjpIudF8gEcEc3uvWtab.zkD_WzGVqqZVKmBaEw6nc1fhbjtt9JDDGLVTEZjcDSciE2 nRhLDKVpLFxBC9vrB9VOawj238JL4ryVLrPjmRMUjFKJmXndlK4xtJ78YNM0_EvK_IPl.a7I9tt0 3CU.ZmdMNn7_a9OvcQZy2xCz_T5cXCTQVjbc3Cudo5FTzAhqCGSB1EcPvzE95Gid8TjzTkjgqK7v YVNxr7NkrGM5Br_Y5KaWvihKCYCdjIQNISRSYcVH.bcLo.HXQxGo9_oaLH7tAGvLaV4whBqEH8FX v52raLLstRcE1Hna0ZnjF5XQMadc4MO5zF8oraQGW0q_wpKG6j323mq405Lu7goAkzW9PR9OlPXE nt41m9WGBwBzFLZa01EYHHDKvHse.VVAttEzLsYUg3gaIflPvWAAJnGu8H6AgyWxTeR6gmw8OoVe BvT4aEWxVaGJYr3sl1ImoHk_iDPH_6HqMDL88oraifAEuui660.B3mVzsybB2mWjdM6rkr3VfCmn SbSRFEPA7_5LQzb3knwxVLXpiQVsi896d9t_FsvAW8QVOe4KYuqZAa0aZx3LgzSplii1BdnodGBR F.H3boIMB4zdFkh4tMCSRjR8joJ_qNXrAK3XAkeBvQ59sgG4rcx3VAMF9XJq0FzrI6w1L0cmDJi3 P0c0MHDxIUeb3dxIGS7RewIgWYrNd6KrUh6G6WwyWdSoVvLMILgUIj2pxYddC42NdfnbunpC_V6f SycpRznDSouLHVLMI3Q72 Received: from sonic.gate.mail.ne1.yahoo.com by sonic306.consmr.mail.bf2.yahoo.com with HTTP; Thu, 7 Mar 2019 07:03:16 +0000 Received: from c-73-41-202-110.hsd1.ca.comcast.net (EHLO localhost.localdomain) ([73.41.202.110]) by smtp405.mail.bf1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA ID fc948c50c1a0babd7a40bed50a305023; Thu, 07 Mar 2019 07:03:14 +0000 (UTC) From: Michael Stoner To: ffmpeg-devel@ffmpeg.org Date: Wed, 6 Mar 2019 23:02:52 -0800 Message-Id: <20190307070252.4264-1-mdstoner23@yahoo.com> X-Mailer: git-send-email 2.20.1.windows.1 MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH] Revised ff_v210_planar_unpack AVX2 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Michael Stoner Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" --- libavcodec/v210dec.c | 10 +++++- libavcodec/x86/v210-init.c | 8 +++++ libavcodec/x86/v210.asm | 63 ++++++++++++++++++++++++++++---------- 3 files changed, 64 insertions(+), 17 deletions(-) diff --git a/libavcodec/v210dec.c b/libavcodec/v210dec.c index ddc5dbe8be..26954c0df3 100644 --- a/libavcodec/v210dec.c +++ b/libavcodec/v210dec.c @@ -119,7 +119,7 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame, const uint32_t *src = (const uint32_t*)psrc; uint32_t val; - w = (avctx->width / 6) * 6; + w = (avctx->width / 12) * 12; s->unpack_frame(src, y, u, v, w); y += w; @@ -127,6 +127,14 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame, v += w >> 1; src += (w << 1) / 3; + if (w < avctx->width - 5) { + READ_PIXELS(u, y, v); + READ_PIXELS(y, u, y); + READ_PIXELS(v, y, u); + READ_PIXELS(y, v, y); + w += 6; + } + if (w < avctx->width - 1) { READ_PIXELS(u, y, v); diff --git a/libavcodec/x86/v210-init.c b/libavcodec/x86/v210-init.c index d64dbca1a8..cb9a6cbd6a 100644 --- a/libavcodec/x86/v210-init.c +++ b/libavcodec/x86/v210-init.c @@ -21,9 +21,11 @@ extern void ff_v210_planar_unpack_unaligned_ssse3(const uint32_t *src, uint16_t *y, uint16_t *u, uint16_t *v, int width); extern void ff_v210_planar_unpack_unaligned_avx(const uint32_t *src, uint16_t *y, uint16_t *u, uint16_t *v, int width); +extern void ff_v210_planar_unpack_unaligned_avx2(const uint32_t *src, uint16_t *y, uint16_t *u, uint16_t *v, int width); extern void ff_v210_planar_unpack_aligned_ssse3(const uint32_t *src, uint16_t *y, uint16_t *u, uint16_t *v, int width); extern void ff_v210_planar_unpack_aligned_avx(const uint32_t *src, uint16_t *y, uint16_t *u, uint16_t *v, int width); +extern void ff_v210_planar_unpack_aligned_avx2(const uint32_t *src, uint16_t *y, uint16_t *u, uint16_t *v, int width); av_cold void ff_v210_x86_init(V210DecContext *s) { @@ -36,6 +38,9 @@ av_cold void ff_v210_x86_init(V210DecContext *s) if (HAVE_AVX_EXTERNAL && cpu_flags & AV_CPU_FLAG_AVX) s->unpack_frame = ff_v210_planar_unpack_aligned_avx; + + if (HAVE_AVX2_EXTERNAL && cpu_flags & AV_CPU_FLAG_AVX2) + s->unpack_frame = ff_v210_planar_unpack_aligned_avx2; } else { if (cpu_flags & AV_CPU_FLAG_SSSE3) @@ -43,6 +48,9 @@ av_cold void ff_v210_x86_init(V210DecContext *s) if (HAVE_AVX_EXTERNAL && cpu_flags & AV_CPU_FLAG_AVX) s->unpack_frame = ff_v210_planar_unpack_unaligned_avx; + + if (HAVE_AVX2_EXTERNAL && cpu_flags & AV_CPU_FLAG_AVX2) + s->unpack_frame = ff_v210_planar_unpack_unaligned_avx2; } #endif } diff --git a/libavcodec/x86/v210.asm b/libavcodec/x86/v210.asm index c24c765e5b..064185354f 100644 --- a/libavcodec/x86/v210.asm +++ b/libavcodec/x86/v210.asm @@ -22,9 +22,12 @@ %include "libavutil/x86/x86util.asm" -SECTION_RODATA +SECTION_RODATA 32 + +; for AVX2 version only +v210_luma_permute: dd 0,1,2,4,5,6,7,7 ; 32-byte alignment required +v210_chroma_shuf2: db 0,1,2,3,4,5,8,9,10,11,12,13,-1,-1,-1,-1 -v210_mask: times 4 dd 0x3ff v210_mult: dw 64,4,64,4,64,4,64,4 v210_luma_shuf: db 8,9,0,1,2,3,12,13,4,5,6,7,-1,-1,-1,-1 v210_chroma_shuf: db 0,1,8,9,6,7,-1,-1,2,3,4,5,12,13,-1,-1 @@ -34,40 +37,58 @@ SECTION .text %macro v210_planar_unpack 1 ; v210_planar_unpack(const uint32_t *src, uint16_t *y, uint16_t *u, uint16_t *v, int width) -cglobal v210_planar_unpack_%1, 5, 5, 7 +cglobal v210_planar_unpack_%1, 5, 5, 8 movsxdifnidn r4, r4d lea r1, [r1+2*r4] add r2, r4 add r3, r4 neg r4 - mova m3, [v210_mult] - mova m4, [v210_mask] - mova m5, [v210_luma_shuf] - mova m6, [v210_chroma_shuf] + VBROADCASTI128 m3, [v210_mult] + VBROADCASTI128 m4, [v210_luma_shuf] + VBROADCASTI128 m5, [v210_chroma_shuf] + +%if cpuflag(avx2) + mova m6, [v210_luma_permute] + VBROADCASTI128 m7, [v210_chroma_shuf2] +%endif + .loop: %ifidn %1, unaligned - movu m0, [r0] + movu m0, [r0] ; yB v5 yA u5 y9 v4 y8 u4 y7 v3 y6 u3 y5 v2 y4 u2 y3 v1 y2 u1 y1 v0 y0 u0 %else mova m0, [r0] %endif pmullw m1, m0, m3 - psrld m0, 10 - psrlw m1, 6 ; u0 v0 y1 y2 v1 u2 y4 y5 - pand m0, m4 ; y0 __ u1 __ y3 __ v2 __ + pslld m0, 12 + psrlw m1, 6 ; yB yA u5 v4 y8 y7 v3 u3 y5 y4 u2 v1 y2 y1 v0 u0 + psrld m0, 22 ; 00 v5 00 y9 00 u4 00 y6 00 v2 00 y3 00 u1 00 y0 + + shufps m2, m1, m0, 0x8d ; 00 y9 00 y6 yB yA y8 y7 00 y3 00 y0 y5 y4 y2 y1 + pshufb m2, m4 ; 00 00 yB yA y9 y8 y7 y6 00 00 y5 y4 y3 y2 y1 y0 + +%if cpuflag(avx2) + vpermd m2, m6, m2 ; 00 00 00 00 yB yA y9 y8 y7 y6 y5 y4 y3 y2 y1 y0 +%endif - shufps m2, m1, m0, 0x8d ; y1 y2 y4 y5 y0 __ y3 __ - pshufb m2, m5 ; y0 y1 y2 y3 y4 y5 __ __ movu [r1+2*r4], m2 - shufps m1, m0, 0xd8 ; u0 v0 v1 u2 u1 __ v2 __ - pshufb m1, m6 ; u0 u1 u2 __ v0 v1 v2 __ + shufps m1, m0, 0xd8 ; 00 v5 00 u4 u5 v4 v3 u3 00 v2 00 u1 u2 v1 v0 u0 + pshufb m1, m5 ; 00 v5 v4 v3 00 u5 u4 u3 00 v2 v1 v0 00 u2 u1 u0 + +%if cpuflag(avx2) + vpermq m1, m1, 0xd8 ; 00 v5 v4 v3 00 v2 v1 v0 00 u5 u4 u3 00 u2 u1 u0 + pshufb m1, m7 ; 00 00 v5 v4 v3 v2 v1 v0 00 00 u5 u4 u3 u2 u1 u0 + movu [r2+r4], xm1 + vextracti128 [r3+r4], m1, 1 +%else movq [r2+r4], m1 movhps [r3+r4], m1 +%endif add r0, mmsize - add r4, 6 + add r4, (mmsize*3)/8 jl .loop REP_RET @@ -81,6 +102,11 @@ INIT_XMM avx v210_planar_unpack unaligned %endif +%if HAVE_AVX2_EXTERNAL +INIT_YMM avx2 +v210_planar_unpack unaligned +%endif + INIT_XMM ssse3 v210_planar_unpack aligned @@ -88,3 +114,8 @@ v210_planar_unpack aligned INIT_XMM avx v210_planar_unpack aligned %endif + +%if HAVE_AVX2_EXTERNAL +INIT_YMM avx2 +v210_planar_unpack aligned +%endif