From patchwork Tue Jul 10 22:37:36 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jun Zhao X-Patchwork-Id: 9667 Delivered-To: ffmpegpatchwork@gmail.com Received: by 2002:a02:104:0:0:0:0:0 with SMTP id c4-v6csp4175321jad; Tue, 10 Jul 2018 15:37:59 -0700 (PDT) X-Google-Smtp-Source: AAOMgpc8Iuu/vgOKejPkx8ZNq717r2RBpjUjz9eUlDYpslQQaCi0PqDhRQI5+ko6Fj2nNhLxhinM X-Received: by 2002:a1c:7a19:: with SMTP id v25-v6mr14903522wmc.81.1531262279122; Tue, 10 Jul 2018 15:37:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1531262279; cv=none; d=google.com; s=arc-20160816; b=CxobPb2sjAFk5AMT/DlYAUZAEQzh0sSuMrUQskqF7cDBgzM1ycRxHqcmjZJXM0BNzw VLUKalhE1DPOxJ5HIEF6uMyDo/EYltGKLIuZRPWh4yyjuJskiRd1Oeh0vA7lvBuTPQzV O/77zG2w+Qn4S1VTAE3tDmXIYnodcAuJ+H6pCFcf5gYNlWS3YgJ4Jiee8mQ8fDjABuPZ RzgeOiy3dO1JKJwsr+yYFM52Ab3azUN/XjMa5lbtv441kfJfv4L37jBHo0fIr61LD9XS sts1W1QV7H1MbRYcexl6WaxIu3rsV8NMKEd2pWzDYZkbhq3MoQUfV0sMyF4AG+LyX7g+ iaZg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:mime-version:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:references:in-reply-to:message-id:date :to:from:dkim-signature:delivered-to:arc-authentication-results; bh=22SUoctz99VyKfu5CJLPqDV/LqPgunqOA/Yj07qw4VI=; b=FOdO4Fo04I/vDfUheTXq4Cra8xcoP8Q5nJYsIrOhb3Vkvdm735yTSmBPioddTRiwxa 3GctkHKLmmrxhls3IPE11vQ9B6ZP7dAWVeqwBVMjf7/MUabC7nBRpuHjibbjLFT2cu6w aR3ghrnqj7h+pRLVrj8XlOJxxWuTzgtWxyMMXZp6FANmyrJ2Hvfe/6rKq5s9OXwn4VKH 9cwTD+LuwkfrNry8NVOQnCbr8kxhOqeTdHu2/TzsVHr20Urkt1HgDd29y82yxkZSK0+o D82Ys1IGOH12xTOpFowjGoctPY2LiYG02JVJrLzZs2x6UhPQRQSzF6IVrklM8CWWBql3 o4fQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.s=20161025 header.b=P2I0mOSN; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id d9-v6si17477401wrc.160.2018.07.10.15.37.58; Tue, 10 Jul 2018 15:37:59 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.s=20161025 header.b=P2I0mOSN; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 3993868A492; Wed, 11 Jul 2018 01:37:49 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-pl0-f66.google.com (mail-pl0-f66.google.com [209.85.160.66]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id D180A689D99 for ; Wed, 11 Jul 2018 01:37:41 +0300 (EEST) Received: by mail-pl0-f66.google.com with SMTP id t6-v6so8242184plo.7 for ; Tue, 10 Jul 2018 15:37:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=hs6tgks+J5GrZ0nKOQFe5MnrASawxvQuqirGth3H+sg=; b=P2I0mOSNasbUlqtCsSyPLalYn3TD0b5k+wJ11cGTx35SDoUCyz0sfuP6dZZhHsUcrD P6giyydVGu+MkkPUyVEQxiFWXF6DCYFzoE9ejte0IsxOXB1ZMQXM9geL9fO6IyosSUdd ZHqSXhaXs12cafpmdyft+RO6kZ5aHLZzs1ADddUSzRIgToukF63SOgmMTXiLKWjTJo+z pHVYQKUhL136z8ZMqeRnHJ/1qyMZsidPazSqYj4vFhpoJe9cFRIKMjoTzRxaHDRVSBwM f3SMw1zuTh+feleMYj6IbmYfcfXxlPIHuIrFgA5efVnxQIdqDKKqPdwirpByw75b5YDx LmbA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=hs6tgks+J5GrZ0nKOQFe5MnrASawxvQuqirGth3H+sg=; b=ofzU6oHvsZlgbt4V+06IJ6czg4vtjZDCaHtVKkh3vr6/0tvIrSwbP6uYYfnys+H6qf e16mwXbB9hc+RNX+ZViimIa7sDX38SLNeqrov60RU3xOn3nQP/JfMw2nHqXw8yafVI7A 1cVx0bcs3OYPeNuNIqLXwNHOYF7BGsYrrHLXk9cVfOaoMynRqoAJ8MdaOITWmcHn6JWx DybF9KvjeM6BwCUr8M2MXImcj+GLsoLmBSIC3UHn9d2ZhXFnZhR22UUVatrn4bjmj0CJ G4L7PzbrR3/rm6YuX70K2kTIKnlFoY/RNAl/Yr53deExvy789WDp2FbDH08quUJdtGsT EPWA== X-Gm-Message-State: APt69E0KzAB9fghqiRS337xUjiuYkbJpu6Gm17EhKwwjFQCDhcvVeRPI aUbxOgAH7O6sqFWGDsVPZaCIS6STdIE= X-Received: by 2002:a17:902:9a8a:: with SMTP id w10-v6mr26299393plp.333.1531262267653; Tue, 10 Jul 2018 15:37:47 -0700 (PDT) Received: from localhost.localdomain ([47.90.47.25]) by smtp.gmail.com with ESMTPSA id i6-v6sm27675943pfo.107.2018.07.10.15.37.46 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 10 Jul 2018 15:37:47 -0700 (PDT) From: Jun Zhao To: ffmpeg-devel@ffmpeg.org Date: Wed, 11 Jul 2018 06:37:36 +0800 Message-Id: <1531262257-4660-2-git-send-email-mypopydev@gmail.com> X-Mailer: git-send-email 1.7.1 In-Reply-To: <1531262257-4660-1-git-send-email-mypopydev@gmail.com> References: <1531262257-4660-1-git-send-email-mypopydev@gmail.com> Subject: [FFmpeg-devel] [PATCH 2/3] avutil/pixelutils: sad_32x32 sse2/avx2 optimizations. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Jun Zhao MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" add ff_pixelutils_sad_32x32_sse2, ff_pixelutils_sad_{a,u}_32x32_sse2, ff_pixelutils_sad_32x32_avx22, ff_pixelutils_sad_{a,u}_32x32_avx2 Signed-off-by: Jun Zhao --- libavutil/x86/pixelutils.asm | 220 +++++++++++++++++++++++++++++++++++++++ libavutil/x86/pixelutils_init.c | 30 ++++++ 2 files changed, 250 insertions(+), 0 deletions(-) diff --git a/libavutil/x86/pixelutils.asm b/libavutil/x86/pixelutils.asm index 7af3007..76b1a1a 100644 --- a/libavutil/x86/pixelutils.asm +++ b/libavutil/x86/pixelutils.asm @@ -163,3 +163,223 @@ cglobal pixelutils_sad_%1_16x16, 4,4,3, src1, stride1, src2, stride2 SAD_XMM_16x16 a SAD_XMM_16x16 u + + +%macro PROCESS_SAD_32x4_U 0 + movu m1, [r2] + movu m2, [r2 + 16] + movu m3, [r0] + movu m4, [r0 + 16] + psadbw m1, m3 + psadbw m2, m4 + paddd m1, m2 + paddd m0, m1 + lea r2, [r2 + r3] + lea r0, [r0 + r1] + + movu m1, [r2] + movu m2, [r2 + 16] + movu m3, [r0] + movu m4, [r0 + 16] + psadbw m1, m3 + psadbw m2, m4 + paddd m1, m2 + paddd m0, m1 + lea r2, [r2 + r3] + lea r0, [r0 + r1] + + movu m1, [r2] + movu m2, [r2 + 16] + movu m3, [r0] + movu m4, [r0 + 16] + psadbw m1, m3 + psadbw m2, m4 + paddd m1, m2 + paddd m0, m1 + lea r2, [r2 + r3] + lea r0, [r0 + r1] + + movu m1, [r2] + movu m2, [r2 + 16] + movu m3, [r0] + movu m4, [r0 + 16] + psadbw m1, m3 + psadbw m2, m4 + paddd m1, m2 + paddd m0, m1 + lea r2, [r2 + r3] + lea r0, [r0 + r1] +%endmacro + +%macro PROCESS_SAD_32x4 1 + mov%1 m1, [r2] + mov%1 m2, [r2 + 16] + psadbw m1, [r0] + psadbw m2, [r0 + 16] + paddd m1, m2 + paddd m0, m1 + lea r2, [r2 + r3] + lea r0, [r0 + r1] + + mov%1 m1, [r2] + mov%1 m2, [r2 + 16] + psadbw m1, [r0] + psadbw m2, [r0 + 16] + paddd m1, m2 + paddd m0, m1 + lea r2, [r2 + r3] + lea r0, [r0 + r1] + + mov%1 m1, [r2] + mov%1 m2, [r2 + 16] + psadbw m1, [r0] + psadbw m2, [r0 + 16] + paddd m1, m2 + paddd m0, m1 + lea r2, [r2 + r3] + lea r0, [r0 + r1] + + mov%1 m1, [r2] + mov%1 m2, [r2 + 16] + psadbw m1, [r0] + psadbw m2, [r0 + 16] + paddd m1, m2 + paddd m0, m1 + lea r2, [r2 + r3] + lea r0, [r0 + r1] +%endmacro + +;----------------------------------------------------------------------------- +; int ff_pixelutils_sad_32x32_sse2(const uint8_t *src1, ptrdiff_t stride1, +; const uint8_t *src2, ptrdiff_t stride2); +;----------------------------------------------------------------------------- +INIT_XMM sse2 +cglobal pixelutils_sad_32x32, 4,5,5, src1, stride1, src2, stride2 + pxor m0, m0 + mov r4d, 4 +.loop: + PROCESS_SAD_32x4_U + PROCESS_SAD_32x4_U + dec r4d + jnz .loop + + movhlps m1, m0 + paddd m0, m1 + movd eax, m0 + RET + + +;------------------------------------------------------------------------------- +; int ff_pixelutils_sad_[au]_32x32_sse2(const uint8_t *src1, ptrdiff_t stride1, +; const uint8_t *src2, ptrdiff_t stride2); +;------------------------------------------------------------------------------- +%macro SAD_XMM_32x32 1 +INIT_XMM sse2 +cglobal pixelutils_sad_%1_32x32, 4,5,3, src1, stride1, src2, stride2 + pxor m0, m0 + mov r4d, 4 +.loop: + PROCESS_SAD_32x4 %1 + PROCESS_SAD_32x4 %1 + dec r4d + jnz .loop + + movhlps m1, m0 + paddd m0, m1 + movd eax, m0 + RET +%endmacro + +SAD_XMM_32x32 a +SAD_XMM_32x32 u + +;------------------------------------------------------------------------------- +; int ff_pixelutils_sad_32x32_avx2(const uint8_t *src1, ptrdiff_t stride1, +; const uint8_t *src2, ptrdiff_t stride2); +;------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal pixelutils_sad_32x32, 4,7,5, src1, stride1, src2, stride2 + xorps m0, m0 + mov r4d, 32/4 + lea r5, [stride1q * 3] + lea r6, [stride2q * 3] + +.loop: + movu m1, [src1q] ; row 0 of pix0 + movu m2, [src2q] ; row 0 of pix1 + movu m3, [src1q + stride1q] ; row 1 of pix0 + movu m4, [src2q + stride2q] ; row 1 of pix1 + + psadbw m1, m2 + psadbw m3, m4 + paddd m0, m1 + paddd m0, m3 + + movu m1, [src1q + 2 * stride1q] ; row 2 of pix0 + movu m2, [src2q + 2 * stride2q] ; row 2 of pix1 + movu m3, [src1q + r5] ; row 3 of pix0 + movu m4, [src2q + r6] ; row 3 of pix1 + + psadbw m1, m2 + psadbw m3, m4 + paddd m0, m1 + paddd m0, m3 + + lea src2q, [src2q + 4 * stride2q] + lea src1q, [src1q + 4 * stride1q] + + dec r4d + jnz .loop + + vextracti128 xm1, m0, 1 + paddd xm0, xm1 + pshufd xm1, xm0, 2 + paddd xm0, xm1 + movd eax, xm0 + RET + +;------------------------------------------------------------------------------- +; int ff_pixelutils_sad_[au]_32x32_avx2(const uint8_t *src1, ptrdiff_t stride1, +; const uint8_t *src2, ptrdiff_t stride2); +;------------------------------------------------------------------------------- +%macro SAD_AVX2_32x32 1 +INIT_YMM avx2 +cglobal pixelutils_sad_%1_32x32, 4,7,3, src1, stride1, src2, stride2 + xorps m0, m0 + mov r4d, 32/4 + lea r5, [stride1q * 3] + lea r6, [stride2q * 3] + +.loop: + mov%1 m1, [src2q] ; row 0 of pix1 + psadbw m1, [src1q] + mov%1 m2, [src2q + stride2q] ; row 1 of pix1 + psadbw m2, [src1q + stride1q] + + paddd m0, m1 + paddd m0, m2 + + mov%1 m1, [src2q + 2 * stride2q] ; row 2 of pix1 + psadbw m1, [src1q + 2 * stride1q] + mov%1 m2, [src2q + r6] ; row 3 of pix1 + psadbw m2, [src1q + r5] + + paddd m0, m1 + paddd m0, m2 + + lea src2q, [src2q + 4 * stride2q] + lea src1q, [src1q + 4 * stride1q] + + dec r4d + jnz .loop + + vextracti128 xm1, m0, 1 + paddd xm0, xm1 + pshufd xm1, xm0, 2 + paddd xm0, xm1 + movd eax, xm0 + RET +%endmacro + +SAD_AVX2_32x32 a +SAD_AVX2_32x32 u diff --git a/libavutil/x86/pixelutils_init.c b/libavutil/x86/pixelutils_init.c index c24a533..dd05421 100644 --- a/libavutil/x86/pixelutils_init.c +++ b/libavutil/x86/pixelutils_init.c @@ -35,6 +35,20 @@ int ff_pixelutils_sad_a_16x16_sse2(const uint8_t *src1, ptrdiff_t stride1, int ff_pixelutils_sad_u_16x16_sse2(const uint8_t *src1, ptrdiff_t stride1, const uint8_t *src2, ptrdiff_t stride2); +int ff_pixelutils_sad_32x32_sse2(const uint8_t *src1, ptrdiff_t stride1, + const uint8_t *src2, ptrdiff_t stride2); +int ff_pixelutils_sad_a_32x32_sse2(const uint8_t *src1, ptrdiff_t stride1, + const uint8_t *src2, ptrdiff_t stride2); +int ff_pixelutils_sad_u_32x32_sse2(const uint8_t *src1, ptrdiff_t stride1, + const uint8_t *src2, ptrdiff_t stride2); + +int ff_pixelutils_sad_32x32_avx2(const uint8_t *src1, ptrdiff_t stride1, + const uint8_t *src2, ptrdiff_t stride2); +int ff_pixelutils_sad_a_32x32_avx2(const uint8_t *src1, ptrdiff_t stride1, + const uint8_t *src2, ptrdiff_t stride2); +int ff_pixelutils_sad_u_32x32_avx2(const uint8_t *src1, ptrdiff_t stride1, + const uint8_t *src2, ptrdiff_t stride2); + void ff_pixelutils_sad_init_x86(av_pixelutils_sad_fn *sad, int aligned) { int cpu_flags = av_get_cpu_flags(); @@ -61,4 +75,20 @@ void ff_pixelutils_sad_init_x86(av_pixelutils_sad_fn *sad, int aligned) case 2: sad[3] = ff_pixelutils_sad_a_16x16_sse2; break; // src1 aligned, src2 aligned } } + + if (EXTERNAL_SSE2(cpu_flags)) { + switch (aligned) { + case 0: sad[4] = ff_pixelutils_sad_32x32_sse2; break; // src1 unaligned, src2 unaligned + case 1: sad[4] = ff_pixelutils_sad_u_32x32_sse2; break; // src1 aligned, src2 unaligned + case 2: sad[4] = ff_pixelutils_sad_a_32x32_sse2; break; // src1 aligned, src2 aligned + } + } + + if (EXTERNAL_AVX2(cpu_flags)) { + switch (aligned) { + case 0: sad[4] = ff_pixelutils_sad_32x32_avx2; break; // src1 unaligned, src2 unaligned + case 1: sad[4] = ff_pixelutils_sad_u_32x32_avx2; break; // src1 aligned, src2 unaligned + case 2: sad[4] = ff_pixelutils_sad_a_32x32_avx2; break; // src1 aligned, src2 aligned + } + } }