From patchwork Mon May 7 17:24:18 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?Q2zDqW1lbnQgQsWTc2No?= X-Patchwork-Id: 8847 Delivered-To: ffmpegpatchwork@gmail.com Received: by 2002:a02:155:0:0:0:0:0 with SMTP id c82-v6csp3103991jad; Mon, 7 May 2018 10:26:07 -0700 (PDT) X-Google-Smtp-Source: AB8JxZrmpH27erer+01YBVEWmPsX/20I8UlOtmbuDAdsY92WN0Tjf09jOBtGqd9xUnW5FZXoH5fl X-Received: by 2002:a1c:c386:: with SMTP id t128-v6mr1342462wmf.113.1525713966947; Mon, 07 May 2018 10:26:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525713966; cv=none; d=google.com; s=arc-20160816; b=FenF+Sk6lJaeGJFmhdRDhs+Bdq3e3Upw3sx23D8KAl3AgLAPTZVY4+T3b5MAvkSj8J 1AAYm1Md3vJPqq4b+8WYLtPHj9E+VYA6kruknSIqELDp8ruWc5SX+IZylzvm0FolOaNw R5CDevd+EyahTV2Y1uat/XcRIK0H2wwCCW4oZ9LLv1lbMZW4nWxfUa1XWPhr3SfW3Fcr 4hD0U2NrV1TDqqz/ECqKxyn3DyIRCOqqrndVn9cMWb3pBFgIpJ+BG/+9WUVZWTAJXDJg jSuCf174RQYdQSCbfZiY+/ZTrTMK6D8r/Yes7IFzV/tfMqnIm/plpStZeleZHmJZ/oji WeZw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:mime-version:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:references:in-reply-to:message-id:date :to:from:domainkey-signature:dkim-signature:delivered-to :arc-authentication-results; bh=osnHpKyN2E81l7ETOFXfQzAiEz6MseKrE7cctjd2NWs=; b=p7ShSaFU2rixyDf2ktyq6NuBKpoySgzv+dNvqqcgnTbShjcXSxJYqpEruZDREr7f3m CfWx4pFxmar72urXUz3gLLxlqLHBq3ni1U+4JaGt7ggeJEw75pnphFrytuIYIHAIpnVM 3zkuWSQtbuCr5/0yoWJlnioOB5NgfgAgtjX0/UCOHxzwV/m8Vr/1ndz3zhkN5YQndAis s4OzZXqmmv8GXzWDShOskwUDC3eAuLLQarr99J+a2xm+xGM+M/zSOCB7FDbAvClCcXu9 MAELO0x8p+G3WISdzX4qdmFXwPmCLhRsggqRIpK8cRqxSMbxmPxf1fYzcpIHRIZewqjn xoGw== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@pkh.me header.s=selector1 header.b=Av3RAw9Z; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id a3-v6si21115671wrn.5.2018.05.07.10.26.06; Mon, 07 May 2018 10:26:06 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@pkh.me header.s=selector1 header.b=Av3RAw9Z; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id BAD8F68A692; Mon, 7 May 2018 20:24:11 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from golem.pkh.me (LStLambert-657-1-117-164.w92-154.abo.wanadoo.fr [92.154.28.164]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 01ED668A681 for ; Mon, 7 May 2018 20:24:04 +0300 (EEST) Received: from golem.pkh.me (localhost.localdomain [127.0.0.1]) by golem.pkh.me (OpenSMTPD) with ESMTP id 95a25d8e for ; Mon, 7 May 2018 17:24:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=pkh.me; h=from:to:cc :subject:date:message-id:in-reply-to:references; s=selector1; bh=W7Da3ASJWKo9LOgJQKcJiGz7KhE=; b=Av3RAw9ZDQgfZHcG//5j4GtukvyH pfuoVvWtVTmpMD786xuhHZDudrl9Q5SY0tgMPj3pPSMCcuAqE7ADDI0X56KMccgO 3JTa/oabas0bF9yaD1Oen2UySQ+taRAnWawouEU88XMxcCMuSlaQzl8foMu5/eB0 sZYBYAXrI0hvB6AMb6CbDciqvTLIIZFai7JFoS9hLriiBfJda15nwmrh1VF6VcXQ HBe56ZBury11fNcl01SR020gkW+F+DLMqZd6tA2+CHkmYNguOeUk0vCJLkNhS5jx HLu/whlK0WM/Up/rHXE6WGhwEAXGleOz9AZk+jrKpPUFHzLEez95d8IThA== DomainKey-Signature: a=rsa-sha1; c=nofws; d=pkh.me; h=from:to:cc:subject :date:message-id:in-reply-to:references; q=dns; s=selector1; b=p DvwGgvprkVPTmRGK3WJJyQjGvul0o3YDIRPqNUVDCmwrishTiMw9R2x+dfyTmo8+ FWFDNcqQhbULZfLm2ffYuwvoNCAXAYJJZvqDndcYrRFOR4og99gJZe9URbeW+goY HEan4SdksuPcc40+q/X7w09h7gInrDnRgB61er2oOWBx+Gkf403x1w3dJ02cmmwi onPmNK59UU5nONXTT59JXeeeFgSSkRrviI4NqLS4SDcPPKyKz9kU2TD4ulo2NR+u KfQsYVre7/ApYQNsMBVQMS6/30qHUUNa/TXYWXz18Az1sB9YUutw+ZlXaThOUgIr jbS93M7A/9LKj6z7CsWfA== Received: from localhost (golem.pkh.me [local]) by golem.pkh.me (OpenSMTPD) with ESMTPA id 6bd40734; Mon, 7 May 2018 17:24:24 +0000 (UTC) From: =?UTF-8?q?Cl=C3=A9ment=20B=C5=93sch?= To: ffmpeg-devel@ffmpeg.org Date: Mon, 7 May 2018 19:24:18 +0200 Message-Id: <20180507172422.11003-7-u@pkh.me> X-Mailer: git-send-email 2.17.0 In-Reply-To: <20180507172422.11003-1-u@pkh.me> References: <20180507172422.11003-1-u@pkh.me> Subject: [FFmpeg-devel] [PATCH v2 06/10] lavfi/nlmeans: make compute_safe_ssd_integral_image_c faster X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: =?UTF-8?q?Cl=C3=A9ment=20B=C5=93sch?= MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" before: ssd_integral_image_c: 49204.6 after: ssd_integral_image_c: 44272.8 Unrolling by 4 for made the biggest different on odroid-c2 (aarch64); unrolling by 2 or 8 both raised 46k cycles vs 44k for 4. Additionally, this is a much better reference when writing SIMD (SIMD vectorization will just target 16 instead of 4). --- libavfilter/vf_nlmeans.c | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/libavfilter/vf_nlmeans.c b/libavfilter/vf_nlmeans.c index c30e44498f..f37f1183f7 100644 --- a/libavfilter/vf_nlmeans.c +++ b/libavfilter/vf_nlmeans.c @@ -146,10 +146,6 @@ static inline int get_integral_patch_value(const uint32_t *ii, int ii_lz_32, int * function, we do not need any clipping here. * * The line above dst and the column to its left are always readable. - * - * This C version computes the SSD integral image using a scalar accumulator, - * while for SIMD implementation it is likely more interesting to use the - * two-loops algorithm variant. */ static void compute_safe_ssd_integral_image_c(uint32_t *dst, ptrdiff_t dst_linesize_32, const uint8_t *s1, ptrdiff_t linesize1, @@ -157,21 +153,32 @@ static void compute_safe_ssd_integral_image_c(uint32_t *dst, ptrdiff_t dst_lines int w, int h) { int x, y; + const uint32_t *dst_top = dst - dst_linesize_32; /* SIMD-friendly assumptions allowed here */ av_assert2(!(w & 0xf) && w >= 16 && h >= 1); for (y = 0; y < h; y++) { - uint32_t acc = dst[-1] - dst[-dst_linesize_32 - 1]; - - for (x = 0; x < w; x++) { - const int d = s1[x] - s2[x]; - acc += d * d; - dst[x] = dst[-dst_linesize_32 + x] + acc; + for (x = 0; x < w; x += 4) { + const int d0 = s1[x ] - s2[x ]; + const int d1 = s1[x + 1] - s2[x + 1]; + const int d2 = s1[x + 2] - s2[x + 2]; + const int d3 = s1[x + 3] - s2[x + 3]; + + dst[x ] = dst_top[x ] - dst_top[x - 1] + d0*d0; + dst[x + 1] = dst_top[x + 1] - dst_top[x ] + d1*d1; + dst[x + 2] = dst_top[x + 2] - dst_top[x + 1] + d2*d2; + dst[x + 3] = dst_top[x + 3] - dst_top[x + 2] + d3*d3; + + dst[x ] += dst[x - 1]; + dst[x + 1] += dst[x ]; + dst[x + 2] += dst[x + 1]; + dst[x + 3] += dst[x + 2]; } s1 += linesize1; s2 += linesize2; dst += dst_linesize_32; + dst_top += dst_linesize_32; } }