From patchwork Sun May 6 11:40:57 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: =?utf-8?b?Q2zDqW1lbnQgQsWTc2No?= X-Patchwork-Id: 8808 Delivered-To: ffmpegpatchwork@gmail.com Received: by 2002:a02:155:0:0:0:0:0 with SMTP id c82-v6csp1854291jad; Sun, 6 May 2018 04:42:19 -0700 (PDT) X-Google-Smtp-Source: AB8JxZq1ycttGZeCSSAJZuBbiTNmSRwkR63D5BSlGIBpIt65XtjhGw2hS4HQft5TcQCO4aOB6EcP X-Received: by 10.28.213.70 with SMTP id m67mr981231wmg.117.1525606939298; Sun, 06 May 2018 04:42:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525606939; cv=none; d=google.com; s=arc-20160816; b=Vg+sD+LQ6uAqFswAh/gdKwD/LW/J6t4nGN2R13tSSvFCYf4SPIHJXrN92ZYPQM6ACg fRXZIhJplkgc1b512kJgN0rWnqxFq5jg8jHeLmGCZse0jql240jxC5wv65RVex4UmkfJ 6D6Sl/8+NojiRm4qub1vyT3yTgfujwKp3eRQIDBd2YEMHpENBZ7SiVLsYpiLMyKQONbv p9JuTv58dOOHmZIM9vJMS4ceTYOpCmObA80nZDn3ne7UcWD90+2DkxaCB2PrdikUMwnV 0jYWtg58PW9S44ybG9oGeR9Mz+JHxyIYWXsHaGHjNr0mV8HshZxNBVp8Y5yBxjthn1lj yIBg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:mime-version:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:references:in-reply-to:message-id:date :to:from:domainkey-signature:dkim-signature:delivered-to :arc-authentication-results; bh=osnHpKyN2E81l7ETOFXfQzAiEz6MseKrE7cctjd2NWs=; b=yzfnKmX22SgvfI6kj0kSGMdjo9vZAgitSQbWSaFSWWJ/kL9VbFouqGCBxeoUzYmGYs I8U7yYDAT70MzuI1x3UGRggKiidvoSWxFSGm60twXT5AKoeqRtIZyCga9D6gVNeVXtqE n3cQgkVtphHBvgq3qx7w3hN1APhjy+ETaPWvxe4BPu1SOSg6ZIki842gvY3Ql0b2RfFo q7xu0RaH0xNPHA1sh++wO9NRmI6lmsHGDURlVXTOgKRKO1T9Y0zh2WwFlTBluZn7paUq F7xt2DN3CSY9kH2RZ7+GKZKdq1G85SrBoVfQqpkFfEvO+zvNiR9RUNA2HDod55RO3w/v sxyQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@pkh.me header.s=selector1 header.b=pMUHQK+O; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id u42-v6si16544198wrc.194.2018.05.06.04.42.18; Sun, 06 May 2018 04:42:19 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@pkh.me header.s=selector1 header.b=pMUHQK+O; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id D94C268A638; Sun, 6 May 2018 14:40:50 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from golem.pkh.me (LStLambert-657-1-117-164.w92-154.abo.wanadoo.fr [92.154.28.164]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 1C3F868A611 for ; Sun, 6 May 2018 14:40:44 +0300 (EEST) Received: from golem.pkh.me (localhost.localdomain [127.0.0.1]) by golem.pkh.me (OpenSMTPD) with ESMTP id fc42a37e for ; Sun, 6 May 2018 11:41:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=pkh.me; h=from:to:cc :subject:date:message-id:in-reply-to:references; s=selector1; bh=W7Da3ASJWKo9LOgJQKcJiGz7KhE=; b=pMUHQK+O0SUIVFebDpevFCxlAU78 mnRqNYeeHREaTc/nJDLF1wEITROKp/tlMcv++h2NJP2VgY76gXyTW2lt4mAs0FFn KYU7L3MKsIKBDh7aPgycq11bBI360ss6YCoCRcGsXrXuMMYhGR/Bns+2SOW0tjzZ n9LFdPTHAMjejaRfqHWEuLmU9DZqfaSG6pHlGE/MDvRsKh/VgR2+MCMJM7BIl1W9 T6JH90IzquEgpz73QckyLKtcydDWYlW+rUHeqTRgsGmBKzJXsYpp9niXF3pEgsYA q72FXDT5P03t7NLw5aSN5TXR44kdARG1DqrEg/QKS+pvmLbcbzfSwN2H+w== DomainKey-Signature: a=rsa-sha1; c=nofws; d=pkh.me; h=from:to:cc:subject :date:message-id:in-reply-to:references; q=dns; s=selector1; b=T KR5FR7WF1K87xwnX4c0kRvLJie2QO/YPqeU8QzYe3QR/4dQZisgwGTeUztRgAlL2 ltxODIWlKVx0aM4+M5PubdllNFNEryr/EoK75F29coTOpV2cwgVScYX8/A8Ep8Gp eFFuJeYeRuxgkVcGY+ccEIBADPaQEvATL9YBxW2QSuty60QEgJsvb9tC1CaWjIYj BM8mTc+oGojrhiXYaan7WcWhwYiqWp3osIsNyjvjbcSuK1GGQPW24eQ8h1pdykZV abrv74HAtOGkI1Li8OBOInK5wU/uwj5E9k6QF0rzYme0nITlzPr9kdxHm0+xI5SN EajsPGO6pRqMt5Mpxzu1w== Received: from localhost (golem.pkh.me [local]) by golem.pkh.me (OpenSMTPD) with ESMTPA id 05a48db9; Sun, 6 May 2018 11:41:03 +0000 (UTC) From: =?UTF-8?q?Cl=C3=A9ment=20B=C5=93sch?= To: ffmpeg-devel@ffmpeg.org Date: Sun, 6 May 2018 13:40:57 +0200 Message-Id: <20180506114100.4223-7-u@pkh.me> X-Mailer: git-send-email 2.17.0 In-Reply-To: <20180506114100.4223-1-u@pkh.me> References: <20180506114100.4223-1-u@pkh.me> Subject: [FFmpeg-devel] [PATCH 6/9] lavfi/nlmeans: make compute_safe_ssd_integral_image_c faster X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: =?UTF-8?q?Cl=C3=A9ment=20B=C5=93sch?= MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" before: ssd_integral_image_c: 49204.6 after: ssd_integral_image_c: 44272.8 Unrolling by 4 for made the biggest different on odroid-c2 (aarch64); unrolling by 2 or 8 both raised 46k cycles vs 44k for 4. Additionally, this is a much better reference when writing SIMD (SIMD vectorization will just target 16 instead of 4). --- libavfilter/vf_nlmeans.c | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/libavfilter/vf_nlmeans.c b/libavfilter/vf_nlmeans.c index c30e44498f..f37f1183f7 100644 --- a/libavfilter/vf_nlmeans.c +++ b/libavfilter/vf_nlmeans.c @@ -146,10 +146,6 @@ static inline int get_integral_patch_value(const uint32_t *ii, int ii_lz_32, int * function, we do not need any clipping here. * * The line above dst and the column to its left are always readable. - * - * This C version computes the SSD integral image using a scalar accumulator, - * while for SIMD implementation it is likely more interesting to use the - * two-loops algorithm variant. */ static void compute_safe_ssd_integral_image_c(uint32_t *dst, ptrdiff_t dst_linesize_32, const uint8_t *s1, ptrdiff_t linesize1, @@ -157,21 +153,32 @@ static void compute_safe_ssd_integral_image_c(uint32_t *dst, ptrdiff_t dst_lines int w, int h) { int x, y; + const uint32_t *dst_top = dst - dst_linesize_32; /* SIMD-friendly assumptions allowed here */ av_assert2(!(w & 0xf) && w >= 16 && h >= 1); for (y = 0; y < h; y++) { - uint32_t acc = dst[-1] - dst[-dst_linesize_32 - 1]; - - for (x = 0; x < w; x++) { - const int d = s1[x] - s2[x]; - acc += d * d; - dst[x] = dst[-dst_linesize_32 + x] + acc; + for (x = 0; x < w; x += 4) { + const int d0 = s1[x ] - s2[x ]; + const int d1 = s1[x + 1] - s2[x + 1]; + const int d2 = s1[x + 2] - s2[x + 2]; + const int d3 = s1[x + 3] - s2[x + 3]; + + dst[x ] = dst_top[x ] - dst_top[x - 1] + d0*d0; + dst[x + 1] = dst_top[x + 1] - dst_top[x ] + d1*d1; + dst[x + 2] = dst_top[x + 2] - dst_top[x + 1] + d2*d2; + dst[x + 3] = dst_top[x + 3] - dst_top[x + 2] + d3*d3; + + dst[x ] += dst[x - 1]; + dst[x + 1] += dst[x ]; + dst[x + 2] += dst[x + 1]; + dst[x + 3] += dst[x + 2]; } s1 += linesize1; s2 += linesize2; dst += dst_linesize_32; + dst_top += dst_linesize_32; } }