From patchwork Tue Jul 17 11:24:45 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jun Zhao X-Patchwork-Id: 9742 Delivered-To: ffmpegpatchwork@gmail.com Received: by 2002:a02:104:0:0:0:0:0 with SMTP id c4-v6csp3378156jad; Tue, 17 Jul 2018 04:25:14 -0700 (PDT) X-Google-Smtp-Source: AAOMgpco0dKPXJpOkl07WH3/y3J3UzBOGNPmMZwCqLEDmGYWgS2gxeBqC90nnVy4q01DpM+grb+t X-Received: by 2002:adf:9366:: with SMTP id 93-v6mr1104971wro.60.1531826714473; Tue, 17 Jul 2018 04:25:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1531826714; cv=none; d=google.com; s=arc-20160816; b=L58LmmZKAmEThlb0ArpJVTTqDdTL3q2UpCwhiYBCgMH8HjUX2tiBvDQUae6YIW6bsE Dv02i1RbbTSJr5+k4tzhEVLCIOmF0MoVXdcbQcOGwu8TCwXUGMUuj/bd67JPhkIZl+E/ rx4w2Kh1FUajutQjXzkvtZJq19BvaKH1A3FnqLpspWwEoscbcSX3aWS+bz1OnXKQRbXS ESWfrEEzfn2cdNgUtQn6k9vWYPGHMmurvoJF9lBozRs/3zinZJ6NKeJ6M2sZxuatNlZR tpELSyNSWcO9ubEEwsGgLyLH6XJoke33C8ayEB437hSyKg+mHbnKFk+xT+0NCvv1d8Xh Mt9Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:mime-version:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:references:in-reply-to:message-id:date :to:from:dkim-signature:delivered-to:arc-authentication-results; bh=irl2bRFtHrQrIrlw387jOWInZe12GslXMg0+igG0/a4=; b=W6brpEF9oenCyw+p10k0/kgh65SCp1EpKep2ipzOj8ShgSX6NCAiy34Oyj+cZmB5pg N88KMUipAAI1nZihRxWCSz8/N4BZidz+qZVxDZxilZETWhhsvbmvykoWeGKBjt7R0G5x beVnnK7YLfbXFntcFsFEM8p23zbcTvzqodc1v5BZoDHRqWN42E4h4VDJ213+jyaSZ4Mq wT4CiVXkGsyvY06OyrAzHJxIjSz/dxJplb76pFzpNpFLph+sBYOmbwCjqQbTUbY42nRK hrJaFaBBSyX+a6WUD3fwCtfWC6OrKkf78zRMT9tc8fpPX7WKSXVC1XzUCyC+VtB3gMzu ruyQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.s=20161025 header.b=BnYi4OzS; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id t3-v6si618653wrs.445.2018.07.17.04.25.14; Tue, 17 Jul 2018 04:25:14 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.s=20161025 header.b=BnYi4OzS; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 8BFB568A6E9; Tue, 17 Jul 2018 14:25:02 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-pl0-f67.google.com (mail-pl0-f67.google.com [209.85.160.67]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id D60FB68A468 for ; Tue, 17 Jul 2018 14:24:55 +0300 (EEST) Received: by mail-pl0-f67.google.com with SMTP id 30-v6so321394pld.13 for ; Tue, 17 Jul 2018 04:25:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=kuFAJ4S05D7v3VDg0likfdMKwuOerbI1pKr/jWfmOxs=; b=BnYi4OzSWq5AkFGNF94ZEPqQZRgmyMgPKjlzNwSy5zwZF0AEN4XCfp4nb6w1w/NdjM PpWb6umOCPcyiCs+jGFbxP3hvy5udm6H9EG35lhqtU7Qr2RQv3vmXWmee9axPlqnWgn4 55QmrOF3jLHD8XcfFDFDmxbEaFwhpGJ9OEL/ZK6RlOKHnBOD+QPRInp15sL4e5552CRg R1amqWvDSSec3Yi6bkpLDjTT4evE806JUFJbglaSM8eig79C4ewUMITPWC4AZrhy4owM tuPFRO/5jO0BL1UBSLYM9wQfGmvbeI98Avlftxz5ydg6FbNScGEvI+8+AwYSVWHaIssz t9Fw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=kuFAJ4S05D7v3VDg0likfdMKwuOerbI1pKr/jWfmOxs=; b=iD6cdFhE3SrgdZrUOUynI9AYtYKFguOujvgos/q8vJh9Z9LW5UF7zLgWK/hihAdrxq mMoLPFBqs1tecrBiDhCOthq63NAwokXm7fRIIv0qXLE+mJ7yMwbIY0ZLHUx4LJSFMWCZ 0ZevUuiSB7rGtQNTm9S4tBd5lFj40jMBW6W+TevJAjpEz87CWxQbAjbHbNgHHAPUolvR DXDv41dYqePQTAAh4mHTh1FmQqPg/SnYcrRUw3Qfzq1To99SktjlpMOKaRX4u862sEqz i1RG99HrIJM5M7dvqNgxnsx+RO0/cWheDsKKxka6JkbkqiPAwchyW3LSDwwQXttGG3HS JiCw== X-Gm-Message-State: AOUpUlGxtsibVniAnCzQR6yv5oei8EGtvO28B4w6gS21nkUd99f9Humb sv++nvwVfxpElrED9IBjf2rxng== X-Received: by 2002:a17:902:342:: with SMTP id 60-v6mr1221715pld.311.1531826704861; Tue, 17 Jul 2018 04:25:04 -0700 (PDT) Received: from localhost.localdomain ([47.90.47.25]) by smtp.gmail.com with ESMTPSA id t63-v6sm1583541pgt.57.2018.07.17.04.25.03 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 17 Jul 2018 04:25:04 -0700 (PDT) From: Jun Zhao To: ffmpeg-devel@ffmpeg.org Date: Tue, 17 Jul 2018 19:24:45 +0800 Message-Id: <1531826685-27801-3-git-send-email-mypopydev@gmail.com> X-Mailer: git-send-email 1.7.1 In-Reply-To: <1531826685-27801-1-git-send-email-mypopydev@gmail.com> References: <1531826685-27801-1-git-send-email-mypopydev@gmail.com> Subject: [FFmpeg-devel] [PATCH 2/2] avutil/pixelutils: sad_32x32 sse2/avx2 optimizations. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Jun Zhao MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" add ff_pixelutils_sad_32x32_sse2, ff_pixelutils_sad_{a,u}_32x32_sse2, ff_pixelutils_sad_32x32_avx22, ff_pixelutils_sad_{a,u}_32x32_avx2 use perf record/report profiling, get instructions:u for avx2 sad_32x32: 72.05% pixelutils pixelutils [.] block_sad_32x32_c 18.50% pixelutils pixelutils [.] block_sad_16x16_c 4.78% pixelutils pixelutils [.] block_sad_8x8_c 2.69% pixelutils pixelutils [.] block_sad_4x4_c 0.89% pixelutils pixelutils [.] block_sad_2x2_c 0.16% pixelutils pixelutils [.] ff_pixelutils_sad_32x32_avx2 0.16% pixelutils pixelutils [.] ff_pixelutils_sad_u_32x32_avx2 0.12% pixelutils pixelutils [.] ff_pixelutils_sad_a_32x32_avx2 sse2 sad_32x32 instructions:u like: 71.86% pixelutils pixelutils [.] block_sad_32x32_c 18.42% pixelutils pixelutils [.] block_sad_16x16_c 4.81% pixelutils pixelutils [.] block_sad_8x8_c 2.68% pixelutils pixelutils [.] block_sad_4x4_c 0.88% pixelutils pixelutils [.] block_sad_2x2_c 0.29% pixelutils pixelutils [.] ff_pixelutils_sad_32x32_sse2 0.26% pixelutils pixelutils [.] ff_pixelutils_sad_u_32x32_sse2 0.23% pixelutils pixelutils [.] ff_pixelutils_sad_a_32x32_sse2 Signed-off-by: Jun Zhao --- libavutil/x86/pixelutils.asm | 220 +++++++++++++++++++++++++++++++++++++++ libavutil/x86/pixelutils_init.c | 30 ++++++ 2 files changed, 250 insertions(+), 0 deletions(-) diff --git a/libavutil/x86/pixelutils.asm b/libavutil/x86/pixelutils.asm index 171a3d1..e6dbfae 100644 --- a/libavutil/x86/pixelutils.asm +++ b/libavutil/x86/pixelutils.asm @@ -163,3 +163,223 @@ cglobal pixelutils_sad_%1_16x16, 4,4,3, src1, stride1, src2, stride2 SAD_XMM_16x16 a SAD_XMM_16x16 u + + +%macro PROCESS_SAD_32x4_U 0 + movu m1, [r2] + movu m2, [r2 + 16] + movu m3, [r0] + movu m4, [r0 + 16] + psadbw m1, m3 + psadbw m2, m4 + paddd m1, m2 + paddd m0, m1 + lea r2, [r2 + r3] + lea r0, [r0 + r1] + + movu m1, [r2] + movu m2, [r2 + 16] + movu m3, [r0] + movu m4, [r0 + 16] + psadbw m1, m3 + psadbw m2, m4 + paddd m1, m2 + paddd m0, m1 + lea r2, [r2 + r3] + lea r0, [r0 + r1] + + movu m1, [r2] + movu m2, [r2 + 16] + movu m3, [r0] + movu m4, [r0 + 16] + psadbw m1, m3 + psadbw m2, m4 + paddd m1, m2 + paddd m0, m1 + lea r2, [r2 + r3] + lea r0, [r0 + r1] + + movu m1, [r2] + movu m2, [r2 + 16] + movu m3, [r0] + movu m4, [r0 + 16] + psadbw m1, m3 + psadbw m2, m4 + paddd m1, m2 + paddd m0, m1 + lea r2, [r2 + r3] + lea r0, [r0 + r1] +%endmacro + +%macro PROCESS_SAD_32x4 1 + mov%1 m1, [r2] + mov%1 m2, [r2 + 16] + psadbw m1, [r0] + psadbw m2, [r0 + 16] + paddd m1, m2 + paddd m0, m1 + lea r2, [r2 + r3] + lea r0, [r0 + r1] + + mov%1 m1, [r2] + mov%1 m2, [r2 + 16] + psadbw m1, [r0] + psadbw m2, [r0 + 16] + paddd m1, m2 + paddd m0, m1 + lea r2, [r2 + r3] + lea r0, [r0 + r1] + + mov%1 m1, [r2] + mov%1 m2, [r2 + 16] + psadbw m1, [r0] + psadbw m2, [r0 + 16] + paddd m1, m2 + paddd m0, m1 + lea r2, [r2 + r3] + lea r0, [r0 + r1] + + mov%1 m1, [r2] + mov%1 m2, [r2 + 16] + psadbw m1, [r0] + psadbw m2, [r0 + 16] + paddd m1, m2 + paddd m0, m1 + lea r2, [r2 + r3] + lea r0, [r0 + r1] +%endmacro + +;----------------------------------------------------------------------------- +; int ff_pixelutils_sad_32x32_sse2(const uint8_t *src1, ptrdiff_t stride1, +; const uint8_t *src2, ptrdiff_t stride2); +;----------------------------------------------------------------------------- +INIT_XMM sse2 +cglobal pixelutils_sad_32x32, 4,5,5, src1, stride1, src2, stride2 + pxor m0, m0 + mov r4d, 4 +.loop: + PROCESS_SAD_32x4_U + PROCESS_SAD_32x4_U + dec r4d + jnz .loop + + movhlps m1, m0 + paddd m0, m1 + movd eax, m0 + RET + + +;------------------------------------------------------------------------------- +; int ff_pixelutils_sad_[au]_32x32_sse2(const uint8_t *src1, ptrdiff_t stride1, +; const uint8_t *src2, ptrdiff_t stride2); +;------------------------------------------------------------------------------- +%macro SAD_XMM_32x32 1 +INIT_XMM sse2 +cglobal pixelutils_sad_%1_32x32, 4,5,3, src1, stride1, src2, stride2 + pxor m0, m0 + mov r4d, 4 +.loop: + PROCESS_SAD_32x4 %1 + PROCESS_SAD_32x4 %1 + dec r4d + jnz .loop + + movhlps m1, m0 + paddd m0, m1 + movd eax, m0 + RET +%endmacro + +SAD_XMM_32x32 a +SAD_XMM_32x32 u + +;------------------------------------------------------------------------------- +; int ff_pixelutils_sad_32x32_avx2(const uint8_t *src1, ptrdiff_t stride1, +; const uint8_t *src2, ptrdiff_t stride2); +;------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal pixelutils_sad_32x32, 4,7,5, src1, stride1, src2, stride2 + pxor m0, m0 + mov r4d, 32/4 + lea r5, [stride1q * 3] + lea r6, [stride2q * 3] + +.loop: + movu m1, [src1q] ; row 0 of pix0 + movu m2, [src2q] ; row 0 of pix1 + movu m3, [src1q + stride1q] ; row 1 of pix0 + movu m4, [src2q + stride2q] ; row 1 of pix1 + + psadbw m1, m2 + psadbw m3, m4 + paddd m0, m1 + paddd m0, m3 + + movu m1, [src1q + 2 * stride1q] ; row 2 of pix0 + movu m2, [src2q + 2 * stride2q] ; row 2 of pix1 + movu m3, [src1q + r5] ; row 3 of pix0 + movu m4, [src2q + r6] ; row 3 of pix1 + + psadbw m1, m2 + psadbw m3, m4 + paddd m0, m1 + paddd m0, m3 + + lea src2q, [src2q + 4 * stride2q] + lea src1q, [src1q + 4 * stride1q] + + dec r4d + jnz .loop + + vextracti128 xm1, m0, 1 + paddd xm0, xm1 + pshufd xm1, xm0, 2 + paddd xm0, xm1 + movd eax, xm0 + RET + +;------------------------------------------------------------------------------- +; int ff_pixelutils_sad_[au]_32x32_avx2(const uint8_t *src1, ptrdiff_t stride1, +; const uint8_t *src2, ptrdiff_t stride2); +;------------------------------------------------------------------------------- +%macro SAD_AVX2_32x32 1 +INIT_YMM avx2 +cglobal pixelutils_sad_%1_32x32, 4,7,3, src1, stride1, src2, stride2 + pxor m0, m0 + mov r4d, 32/4 + lea r5, [stride1q * 3] + lea r6, [stride2q * 3] + +.loop: + mov%1 m1, [src2q] ; row 0 of pix1 + psadbw m1, [src1q] + mov%1 m2, [src2q + stride2q] ; row 1 of pix1 + psadbw m2, [src1q + stride1q] + + paddd m0, m1 + paddd m0, m2 + + mov%1 m1, [src2q + 2 * stride2q] ; row 2 of pix1 + psadbw m1, [src1q + 2 * stride1q] + mov%1 m2, [src2q + r6] ; row 3 of pix1 + psadbw m2, [src1q + r5] + + paddd m0, m1 + paddd m0, m2 + + lea src2q, [src2q + 4 * stride2q] + lea src1q, [src1q + 4 * stride1q] + + dec r4d + jnz .loop + + vextracti128 xm1, m0, 1 + paddd xm0, xm1 + pshufd xm1, xm0, 2 + paddd xm0, xm1 + movd eax, xm0 + RET +%endmacro + +SAD_AVX2_32x32 a +SAD_AVX2_32x32 u diff --git a/libavutil/x86/pixelutils_init.c b/libavutil/x86/pixelutils_init.c index c24a533..dd05421 100644 --- a/libavutil/x86/pixelutils_init.c +++ b/libavutil/x86/pixelutils_init.c @@ -35,6 +35,20 @@ int ff_pixelutils_sad_a_16x16_sse2(const uint8_t *src1, ptrdiff_t stride1, int ff_pixelutils_sad_u_16x16_sse2(const uint8_t *src1, ptrdiff_t stride1, const uint8_t *src2, ptrdiff_t stride2); +int ff_pixelutils_sad_32x32_sse2(const uint8_t *src1, ptrdiff_t stride1, + const uint8_t *src2, ptrdiff_t stride2); +int ff_pixelutils_sad_a_32x32_sse2(const uint8_t *src1, ptrdiff_t stride1, + const uint8_t *src2, ptrdiff_t stride2); +int ff_pixelutils_sad_u_32x32_sse2(const uint8_t *src1, ptrdiff_t stride1, + const uint8_t *src2, ptrdiff_t stride2); + +int ff_pixelutils_sad_32x32_avx2(const uint8_t *src1, ptrdiff_t stride1, + const uint8_t *src2, ptrdiff_t stride2); +int ff_pixelutils_sad_a_32x32_avx2(const uint8_t *src1, ptrdiff_t stride1, + const uint8_t *src2, ptrdiff_t stride2); +int ff_pixelutils_sad_u_32x32_avx2(const uint8_t *src1, ptrdiff_t stride1, + const uint8_t *src2, ptrdiff_t stride2); + void ff_pixelutils_sad_init_x86(av_pixelutils_sad_fn *sad, int aligned) { int cpu_flags = av_get_cpu_flags(); @@ -61,4 +75,20 @@ void ff_pixelutils_sad_init_x86(av_pixelutils_sad_fn *sad, int aligned) case 2: sad[3] = ff_pixelutils_sad_a_16x16_sse2; break; // src1 aligned, src2 aligned } } + + if (EXTERNAL_SSE2(cpu_flags)) { + switch (aligned) { + case 0: sad[4] = ff_pixelutils_sad_32x32_sse2; break; // src1 unaligned, src2 unaligned + case 1: sad[4] = ff_pixelutils_sad_u_32x32_sse2; break; // src1 aligned, src2 unaligned + case 2: sad[4] = ff_pixelutils_sad_a_32x32_sse2; break; // src1 aligned, src2 aligned + } + } + + if (EXTERNAL_AVX2(cpu_flags)) { + switch (aligned) { + case 0: sad[4] = ff_pixelutils_sad_32x32_avx2; break; // src1 unaligned, src2 unaligned + case 1: sad[4] = ff_pixelutils_sad_u_32x32_avx2; break; // src1 aligned, src2 unaligned + case 2: sad[4] = ff_pixelutils_sad_a_32x32_avx2; break; // src1 aligned, src2 aligned + } + } }