From patchwork Mon Jul 3 11:12:50 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ilia X-Patchwork-Id: 4197 Delivered-To: ffmpegpatchwork@gmail.com Received: by 10.103.1.76 with SMTP id 73csp9264569vsb; Mon, 3 Jul 2017 04:13:16 -0700 (PDT) X-Received: by 10.28.10.194 with SMTP id 185mr24879897wmk.119.1499080396029; Mon, 03 Jul 2017 04:13:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1499080395; cv=none; d=google.com; s=arc-20160816; b=v+/dOum56G2RKkNhrsoVCfTdzl2cLZY74Zb4VUJRh6KBFtEnTYJvEzTRfhgXaC7hum QA2ZtHdn64DxF5tbYcfatUPfZAWpYB53kp9Wh987baiOXLUxFHhU7xfD2cn36rdh/6eu r/2IWQccbrdSwIENhSYcHuEUnOvIafmdZ66dUysKU99nQvD8pLtyzR5n4O8F3XtEAs+W +OqEs4xf3WdvPeO/zg/hJ5d7G9AZM3Ewn3lXd6txo5gz7g6lwkprsTweqaGfMJf/ebWi Ikx2IbykDk/7Nnl2DIt4NMXv91b8501TPGEpYlSojicIZNuaUv3rgzjBaa/n9wbNAeea lvSQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:mime-version:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:references:in-reply-to:message-id:date :to:from:dkim-signature:delivered-to:arc-authentication-results; bh=4MVHmzGzZUBz4kSyJDAQBEDjhWNGglm4AnwjZINZELk=; b=y3EKWg9kU9ERk0MM6J7wWpxrycxusMnDPaQlLKHaDaUVNrLhES3YAGIhXrn42Ckpq5 t1J+oFSN/1IX7wSXrkNTVg1AqOIVwKGtYW0JrT/g55wOdI2N88LdMR9fYq+7ZstOW4fR kdpylxElXE5qMJ87UPQzIM00WiFVjOjANKfjrEMCGxIFAa8YYiW1jt0l7uY6PVtbL+e0 ldJi6Wxqbzfzcq4zMXmEytYN+FBBcMMHXPJu15RiETJtwWLqzwzn9U0fi96TGaWgqm4r Oh62PAAZ89K3i4mc1BIjxP/VeIykGfEcN0VA+sUrHQJmrnVQV/rd5Hfqo1Su5BiYzKQX rkrA== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.b=rLb3NtwX; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=gmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id 93si11232940wro.213.2017.07.03.04.13.15; Mon, 03 Jul 2017 04:13:15 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.b=rLb3NtwX; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=gmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 75F596882A2; Mon, 3 Jul 2017 14:13:11 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lf0-f68.google.com (mail-lf0-f68.google.com [209.85.215.68]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id CA7B3680668 for ; Mon, 3 Jul 2017 14:13:05 +0300 (EEST) Received: by mail-lf0-f68.google.com with SMTP id z78so2578880lff.2 for ; Mon, 03 Jul 2017 04:13:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=4ZFNMoKDdjrRzhEd/F9l9swSYyNSq5/10Tma8ho5rxc=; b=rLb3NtwXc0uA7rYR1PgEFrm+3iIQZDlGaJgNimbzk6E4oxI4epfVqtWWiOyRrMptNb QwKGqD9PaXv1HpZvJilAuk+3ry5CHJ/titiRXi63wgZDCrdlAOPOu5c5Wj9O2FxIKe7e TmRSFnmIW+IVBZXFqvyAwsSxeobs+gMidFpqsZ2cp+q9a/8YiDXVguNV7aaZQPs35qiJ 3K24NCcUQFz4t+0UrXsAZJEIxFCe5fxO8T9jRjhXhxC4lZOUtPf0vpeN1RCFIc1DFtwA QnafMwgMX9cCQNoH3OZxzFc1upJmb5SUcRmvTlvRRmlALV9CpnMQXrmm10qgT7JsYmi4 14PQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=4ZFNMoKDdjrRzhEd/F9l9swSYyNSq5/10Tma8ho5rxc=; b=KyZB1eGtjOVmamsL3mt7+WyC468PDaJQKQQyyDoT/3tKwgrecB4PcFkEdPhylgY6IC I+kQ5bDDNQMg4h3p+RhJlwiP7jbPoRihx59twHGn7yg216UsH1IPucIUq54xEaY/xTio LVQide/lWSIsE/heMIJDlGZAERSD1lFl37gKE66AtSOQW1EQI1Sln+OOUvl6T+jSfyKj d4kfwRCbL3tvRGEuxGkpj/u5CVGWmdIRnU+lKbnJ8sugD/ExEMEa+zgezWHChbzbzaHD gk92O6aH7Z1bCpr7SV1A0fqlEryKvptMwG9mwah7ZnRGmc2WjP0sF2yx22lm0AUYvpuj 7blw== X-Gm-Message-State: AKS2vOw2BpVSvQQ3/T+ipVriRbhZ+ED5jTAzRf6FD1p5iEJGQais9ymr 0jKFksuwJuVFtIc4 X-Received: by 10.46.80.29 with SMTP id e29mr9698285ljb.36.1499080386418; Mon, 03 Jul 2017 04:13:06 -0700 (PDT) Received: from localhost.localdomain ([95.191.209.241]) by smtp.gmail.com with ESMTPSA id y10sm3227039lja.24.2017.07.03.04.13.05 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 03 Jul 2017 04:13:05 -0700 (PDT) From: Ilia Valiakhmetov To: ffmpeg-devel@ffmpeg.org Date: Mon, 3 Jul 2017 18:12:50 +0700 Message-Id: <20170703111250.5840-1-zakne0ne@gmail.com> X-Mailer: git-send-email 2.8.3 In-Reply-To: References: Subject: [FFmpeg-devel] [PATCH] avcodec/vp9: AVX2 ipred_dl_32x32 improvement X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Ilia Valiakhmetov MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Use symmetry properties of the ipred_dl function for better performance. vp9_diag_downleft_32x32_12bpp_c: 1534.2 vp9_diag_downleft_32x32_12bpp_sse2: 145.9 vp9_diag_downleft_32x32_12bpp_ssse3: 140.0 vp9_diag_downleft_32x32_12bpp_avx: 134.8 vp9_diag_downleft_32x32_12bpp_avx2: 78.9 ~40% faster than avx Signed-off-by: Ilia Valiakhmetov --- libavcodec/x86/vp9intrapred_16bpp.asm | 47 ++++++++++++++++++++++++----------- 1 file changed, 33 insertions(+), 14 deletions(-) diff --git a/libavcodec/x86/vp9intrapred_16bpp.asm b/libavcodec/x86/vp9intrapred_16bpp.asm index 8d8d65e..33a8a7f 100644 --- a/libavcodec/x86/vp9intrapred_16bpp.asm +++ b/libavcodec/x86/vp9intrapred_16bpp.asm @@ -901,49 +901,68 @@ cglobal vp9_ipred_dl_32x32_16, 2, 6, 7, dst, stride, l, a LOWPASS 1, 2, 3 ; RSTUVWXYZ......5 vperm2i128 m2, m1, m4, q0201 ; Z......555555555 vperm2i128 m5, m0, m1, q0201 ; JKLMNOPQRSTUVWXY - DEFINE_ARGS dst, stride, stride3, cnt + vperm2i128 m6, m2, m2, q0101 + DEFINE_ARGS dst, stride, stride3, dst16, cnt lea stride3q, [strideq*3] - mov cntd, 4 + lea dst16q, [dstq+strideq*8] + lea dst16q, [dst16q+strideq*8] + mov cntd, 2 .loop: mova [dstq+strideq*0 + 0], m0 mova [dstq+strideq*0 +32], m1 + mova [dst16q+strideq*0+ 0], m1 + mova [dst16q+strideq*0+32], m6 vpalignr m3, m5, m0, 2 vpalignr m4, m2, m1, 2 mova [dstq+strideq*1 + 0], m3 mova [dstq+strideq*1 +32], m4 + mova [dst16q+strideq*1 +0], m4 + mova [dst16q+strideq*1 +32], m6 vpalignr m3, m5, m0, 4 vpalignr m4, m2, m1, 4 mova [dstq+strideq*2 + 0], m3 mova [dstq+strideq*2 +32], m4 + mova [dst16q+strideq*2+0], m4 + mova [dst16q+strideq*2+32], m6 vpalignr m3, m5, m0, 6 - vpalignr m4, m2, m1, 6 + vpalignr m4, m2, m1, 6 mova [dstq+stride3q*1+ 0], m3 mova [dstq+stride3q*1+32], m4 - lea dstq, [dstq+strideq*4] + mova [dst16q+stride3q*1+0], m4 + mova [dst16q+stride3q*1+32], m6 vpalignr m3, m5, m0, 8 vpalignr m4, m2, m1, 8 + lea dstq, [dstq+strideq*4] + lea dst16q, [dst16q+strideq*4] mova [dstq+strideq*0 + 0], m3 mova [dstq+strideq*0 +32], m4 + mova [dst16q+strideq*0 +0], m4 + mova [dst16q+strideq*0 +32], m6 vpalignr m3, m5, m0, 10 vpalignr m4, m2, m1, 10 mova [dstq+strideq*1 + 0], m3 mova [dstq+strideq*1 +32], m4 + mova [dst16q+strideq*1 +0], m4 + mova [dst16q+strideq*1 +32], m6 vpalignr m3, m5, m0, 12 vpalignr m4, m2, m1, 12 - mova [dstq+strideq*2+ 0], m3 - mova [dstq+strideq*2+32], m4 + mova [dstq+strideq*2+ 0], m3 + mova [dstq+strideq*2+32], m4 + mova [dst16q+strideq*2+0], m4 + mova [dst16q+strideq*2+32], m6 vpalignr m3, m5, m0, 14 vpalignr m4, m2, m1, 14 - mova [dstq+stride3q+ 0], m3 - mova [dstq+stride3q+ 32], m4 - vpalignr m3, m5, m0, 16 - vpalignr m4, m2, m1, 16 - vperm2i128 m5, m3, m4, q0201 - vperm2i128 m2, m4, m4, q0101 - mova m0, m3 - mova m1, m4 + mova [dstq+stride3q+ 0], m3 + mova [dstq+stride3q+ 32], m4 + mova [dst16q+stride3q+ 0], m4 + mova [dst16q+stride3q+32], m6 + mova m0, m5 + mova m1, m2 + vperm2i128 m5, m5, m2, q0201 + mova m2, m6 lea dstq, [dstq+strideq*4] + lea dst16q, [dst16q+strideq*4] dec cntd jg .loop RET