From patchwork Sun Jun 25 14:42:08 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ilia X-Patchwork-Id: 4109 Delivered-To: ffmpegpatchwork@gmail.com Received: by 10.103.1.76 with SMTP id 73csp521628vsb; Sun, 25 Jun 2017 07:49:07 -0700 (PDT) X-Received: by 10.223.133.67 with SMTP id 61mr12220221wrh.30.1498402147744; Sun, 25 Jun 2017 07:49:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1498402147; cv=none; d=google.com; s=arc-20160816; b=G3F6ClY7IJuGNqBD0zb744Z84hzjlXin2E+kP4l23D4KZor1jWV0B+whe9ncocEVJT RqGDNhSQ4FvMROxBriDe5UE0RfhXDZ7yRH85OvR9edLx1TPh98hE0TF5VPuDOf37VeXJ LSO3NAccFyD5hv4NIQdMCF8cCPYA/nJzo8bAFZhp/wuBOK1QFCx+52h3jadwqhLh/Y4G LzrRkG1oMl5xEU2p6Iwuw8Yn6VSpNCUKXa8Poo7o/Nkj7B6dh1vEX1W/WLzG09+KPXPE z3lug2RjE5R7P5MJhKQ8ZN1zMwpfPGwuok212vUHoBROVpre3G6abnACMLZ5wmGdbTe7 mYDw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:mime-version:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:references:in-reply-to:message-id:date :to:from:dkim-signature:delivered-to:arc-authentication-results; bh=IPpvsP8vNFJfU0Y+5IWLvHXH3rs9URa96Wiku1EYLuo=; b=UYX5eQ7wkzsSyXmgo6wrPg/z1rPCIyMVh9s0+OIjQ9imzuHD1mmzC4nSs3bGEruGYP +C8LaMtQFbgtT2YExDzeacPxzG6h4jJ2zl2roH9WMWOk2q4lEUIp4voLoM+BLF6rRAw1 j4zUqisW5KMrKH+PJ2LB7SBsmKdianj/HNy2i7ozir/v7t5HUBTaP77gFYrJ4QbDLwlf oKof2kYXliGwi9kMww+ydfu+pog3kMVaRvRF+1njurV+bnclsNneqZZsPSZh9TK/zQZ3 Jng5BZMLNNgnRNFxF2rYhqTV9vn9wEBx2jrYvzmbpeI2HDT2yCVhQ+eahcy9SF3oDTqs bSIg== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.b=JqEteLXj; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=gmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id n71si9707630wrb.360.2017.06.25.07.49.06; Sun, 25 Jun 2017 07:49:07 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.b=JqEteLXj; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=gmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id AD4F4689FF7; Sun, 25 Jun 2017 17:49:02 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lf0-f65.google.com (mail-lf0-f65.google.com [209.85.215.65]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 58B1F689D66 for ; Sun, 25 Jun 2017 17:48:56 +0300 (EEST) Received: by mail-lf0-f65.google.com with SMTP id n136so10783993lfn.2 for ; Sun, 25 Jun 2017 07:48:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=HKbmyQM5LxEl1xZUuDv61MycZlpwfXv4JBcQFpNandg=; b=JqEteLXj9+TEgrq6tmD5lLKwvqZv5SwIfBdZSnpEcTl3LdJ6pj6KRwqcQU4GNOAUmI qn2xKA4fWYS//NzyBLloyur+HDEPB9DfFLe1zVQNZUPciUEVrz9Dx+snmbzKBZ1oqDhg 6kj5j7wJAldeZCwndWZDA72kypNrLzgXDgEIDd9pjIA2DikfmeRHTBjrBVu3GzbuY21X a4xzX0Y1ZPmHXDj1WTQOirenpJ04wXQM9wnLruxaHjJIC2T5up6ZXIKrGyP9S88U5ROR RIFWoUvfM5bZ6U3UWpx1Yp5yOA3/+gZ05MOiAMz9I+X0HITCf6TR2/ATQoET0/Ro7KWb I9fQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=HKbmyQM5LxEl1xZUuDv61MycZlpwfXv4JBcQFpNandg=; b=bEPRECF+U6dq8+dA3CvO+OlBp7NHv4pufLaaLOfw20XawGp7twYgASbkAbPGkatyr8 r0ZUd4ynzD0p3xL/+/Vc+44c2Wnl4QD9BTUANLYuqJIo38iXZpAW76uO9SX3GLKdIZ46 T0hC7krfLZzzkYXEm0Wed/C+p68S01tfCEu97z4L6pcKY4I/ckAAFaWpm9FqQthkF880 jrQ3Uhb1OGJiRT4OLdCwexAoRZe5m4Yz4oakSH9gUaLuwRu9jPnuqqJ6nptzoQ6QSy42 D8BjW1r57zO+DPQcZJnsM98QXr5gODr6a001VQRyVLIWyJiE4WEzck3uzmLLzLg2HGXk k9Kw== X-Gm-Message-State: AKS2vOxwFZNs/qvmoIdGIl8ZjIjRO0aTbfrLLEKKSV2sx6fqCERMd8o8 S043qfviOSSgrVic X-Received: by 10.46.77.19 with SMTP id a19mr248624ljb.74.1498401752619; Sun, 25 Jun 2017 07:42:32 -0700 (PDT) Received: from localhost.localdomain ([95.191.197.168]) by smtp.gmail.com with ESMTPSA id h98sm2436698ljh.57.2017.06.25.07.42.31 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 25 Jun 2017 07:42:31 -0700 (PDT) From: Ilia Valiakhmetov To: ffmpeg-devel@ffmpeg.org Date: Sun, 25 Jun 2017 21:42:08 +0700 Message-Id: <20170625144208.4428-1-zakne0ne@gmail.com> X-Mailer: git-send-email 2.8.3 In-Reply-To: References: Subject: [FFmpeg-devel] [PATCH] avcodec/vp9: add 64-bit ipred_dr_32x32_16 avx2 implementation X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Ilia Valiakhmetov MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" vp9_diag_downright_32x32_12bpp_c: 429.7 vp9_diag_downright_32x32_12bpp_sse2: 158.9 vp9_diag_downright_32x32_12bpp_ssse3: 144.6 vp9_diag_downright_32x32_12bpp_avx: 141.0 vp9_diag_downright_32x32_12bpp_avx2: 73.8 Almost 50% faster than avx implementation --- libavcodec/x86/vp9dsp_init_16bpp.c | 6 +- libavcodec/x86/vp9intrapred_16bpp.asm | 103 +++++++++++++++++++++++++++++++++- 2 files changed, 106 insertions(+), 3 deletions(-) diff --git a/libavcodec/x86/vp9dsp_init_16bpp.c b/libavcodec/x86/vp9dsp_init_16bpp.c index 8d1aa13..54216f0 100644 --- a/libavcodec/x86/vp9dsp_init_16bpp.c +++ b/libavcodec/x86/vp9dsp_init_16bpp.c @@ -52,8 +52,9 @@ decl_ipred_fns(dc, 16, mmxext, sse2); decl_ipred_fns(dc_top, 16, mmxext, sse2); decl_ipred_fns(dc_left, 16, mmxext, sse2); decl_ipred_fn(dl, 16, 16, avx2); -decl_ipred_fn(dr, 16, 16, avx2); decl_ipred_fn(dl, 32, 16, avx2); +decl_ipred_fn(dr, 16, 16, avx2); +decl_ipred_fn(dr, 32, 16, avx2); #define decl_ipred_dir_funcs(type) \ decl_ipred_fns(type, 16, sse2, sse2); \ @@ -137,8 +138,9 @@ av_cold void ff_vp9dsp_init_16bpp_x86(VP9DSPContext *dsp) init_fpel_func(1, 1, 64, avg, _16, avx2); init_fpel_func(0, 1, 128, avg, _16, avx2); init_ipred_func(dl, DIAG_DOWN_LEFT, 16, 16, avx2); - init_ipred_func(dr, DIAG_DOWN_RIGHT, 16, 16, avx2); init_ipred_func(dl, DIAG_DOWN_LEFT, 32, 16, avx2); + init_ipred_func(dr, DIAG_DOWN_RIGHT, 16, 16, avx2); + init_ipred_func(dr, DIAG_DOWN_RIGHT, 32, 16, avx2); } #endif /* HAVE_YASM */ diff --git a/libavcodec/x86/vp9intrapred_16bpp.asm b/libavcodec/x86/vp9intrapred_16bpp.asm index 6d4400b..32b6982 100644 --- a/libavcodec/x86/vp9intrapred_16bpp.asm +++ b/libavcodec/x86/vp9intrapred_16bpp.asm @@ -1221,8 +1221,109 @@ cglobal vp9_ipred_dr_16x16_16, 4, 5, 6, dst, stride, l, a mova [dstq+strideq*0], m4 ; 0 mova [dst3q+strideq*4], m5 ; 7 RET -%endif +%if ARCH_X86_64 +cglobal vp9_ipred_dr_32x32_16, 4, 7, 10, dst, stride, l, a + mova m0, [lq+mmsize*0+0] ; l[0-15] + mova m1, [lq+mmsize*1+0] ; l[16-31] + movu m2, [aq+mmsize*0-2] ; *abcdefghijklmno + mova m3, [aq+mmsize*0+0] ; abcdefghijklmnop + mova m4, [aq+mmsize*1+0] ; qrstuvwxyz012345 + vperm2i128 m5, m0, m1, q0201 ; lmnopqrstuvwxyz0 + vpalignr m6, m5, m0, 2 ; mnopqrstuvwxyz01 + vpalignr m7, m5, m0, 4 ; nopqrstuvwxyz012 + LOWPASS 0, 6, 7 ; L[0-15] + vperm2i128 m7, m1, m2, q0201 ; stuvwxyz*abcdefg + vpalignr m5, m7, m1, 2 ; lmnopqrstuvwxyz* + vpalignr m6, m7, m1, 4 ; mnopqrstuvwxyz*a + LOWPASS 1, 5, 6 ; L[16-31]# + vperm2i128 m5, m3, m4, q0201 ; ijklmnopqrstuvwx + vpalignr m6, m5, m3, 2 ; bcdefghijklmnopq + LOWPASS 2, 3, 6 ; A[0-15] + movu m3, [aq+mmsize*1-2] ; pqrstuvwxyz01234 + vperm2i128 m6, m4, m4, q2001 ; yz012345........ + vpalignr m7, m6, m4, 2 ; rstuvwxyz012345. + LOWPASS 3, 4, 7 ; A[16-31]. + vperm2i128 m4, m1, m2, q0201 ; TUVWXYZ#ABCDEFGH + vperm2i128 m5, m0, m1, q0201 ; L[7-15]L[16-23] + vperm2i128 m8, m2, m3, q0201 ; IJKLMNOPQRSTUVWX + DEFINE_ARGS dst8, stride, stride3, stride7, stride5, dst24, cnt + lea stride3q, [strideq*3] + lea stride5q, [stride3q+strideq*2] + lea stride7q, [strideq*4+stride3q] + lea dst24q, [dst8q+stride3q*8] + lea dst8q, [dst8q+strideq*8] + mov cntd, 2 + +.loop: + mova [dst24q+stride7q+0 ], m0 ; 31 23 15 7 + mova [dst24q+stride7q+32], m1 + mova [dst8q+stride7q+0], m1 + mova [dst8q+stride7q+32], m2 + vpalignr m6, m4, m1, 2 + vpalignr m7, m5, m0, 2 + vpalignr m9, m8, m2, 2 + mova [dst24q+stride3q*2+0], m7 ; 30 22 14 6 + mova [dst24q+stride3q*2+32], m6 + mova [dst8q+stride3q*2+0], m6 + mova [dst8q+stride3q*2+32], m9 + vpalignr m6, m4, m1, 4 + vpalignr m7, m5, m0, 4 + vpalignr m9, m8, m2, 4 + mova [dst24q+stride5q+0], m7 ; 29 21 13 5 + mova [dst24q+stride5q+32], m6 + mova [dst8q+stride5q+0], m6 + mova [dst8q+stride5q+32], m9 + vpalignr m6, m4, m1, 6 + vpalignr m7, m5, m0, 6 + vpalignr m9, m8, m2, 6 + mova [dst24q+strideq*4+0 ], m7 ; 28 20 12 4 + mova [dst24q+strideq*4+32], m6 + mova [dst8q+strideq*4+0], m6 + mova [dst8q+strideq*4+32], m9 + vpalignr m6, m4, m1, 8 + vpalignr m7, m5, m0, 8 + vpalignr m9, m8, m2, 8 + mova [dst24q+stride3q+0 ], m7 ; 27 19 11 3 + mova [dst24q+stride3q+32], m6 + mova [dst8q+stride3q+0], m6 + mova [dst8q+stride3q+32], m9 + vpalignr m6, m4, m1, 10 + vpalignr m7, m5, m0, 10 + vpalignr m9, m8, m2, 10 + mova [dst24q+strideq*2+0 ], m7 ; 26 18 10 2 + mova [dst24q+strideq*2+32], m6 + mova [dst8q+strideq*2+0], m6 + mova [dst8q+strideq*2+32], m9 + vpalignr m6, m4, m1, 12 + vpalignr m7, m5, m0, 12 + vpalignr m9, m8, m2, 12 + mova [dst24q+strideq+0 ], m7 ; 25 17 9 1 + mova [dst24q+strideq+32], m6 + mova [dst8q+strideq+0], m6 + mova [dst8q+strideq+32], m9 + vpalignr m6, m4, m1, 14 + vpalignr m7, m5, m0, 14 + vpalignr m9, m8, m2, 14 + mova [dst24q+strideq*0+0 ], m7 ; 24 16 8 0 + mova [dst24q+strideq*0+32], m6 + mova [dst8q+strideq*0+0], m6 + mova [dst8q+strideq*0+32], m9 + mova m0, m5 + mova m5, m1 + mova m1, m4 + mova m4, m2 + mova m2, m8 + mova m8, m3 + sub dst24q, stride7q + sub dst24q, strideq + sub dst8q, stride7q + sub dst8q, strideq + dec cntd + jg .loop + RET +%endif +%endif %macro VL_FUNCS 1 ; stack_mem_for_32x32_32bit_function cglobal vp9_ipred_vl_4x4_16, 2, 4, 3, dst, stride, l, a