From patchwork Tue Jul  4 14:04:44 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John Cox <jc@kynesim.co.uk>
X-Patchwork-Id: 42429
Delivered-To: ffmpegpatchwork2@gmail.com
Received: by 2002:a05:6a20:3b1e:b0:12b:9ae3:586d with SMTP id
 c30csp5126298pzh;
        Tue, 4 Jul 2023 07:06:54 -0700 (PDT)
X-Google-Smtp-Source: 
 APBJJlGxaC0d6lsUureXp/wRiUJTJ/mkXgS+mNufx27/RxHPUIls+QV17PPYYSQXK2oXSbAqdDY3
X-Received: by 2002:a17:906:a112:b0:988:71c8:9f3a with SMTP id
 t18-20020a170906a11200b0098871c89f3amr11742715ejy.16.1688479613942;
        Tue, 04 Jul 2023 07:06:53 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1688479613; cv=none;
        d=google.com; s=arc-20160816;
        b=LXJjCHf/M8ah0sKkjzzvtLzJcq/Eaef6qhoIVha6tF3MSqXfEeY50l3oMzObIgIsy8
         b1kaFL5dpyB+nNB7JDIlrYo56dAHJqvKtdzEDqAr16xzQC2wMizIwVNAv6yh/SyeiGOU
         O1ru6leFWXd839N4lGOdURmYS7nS2fSEJo33xHcxNWLUa30M0qiChVjZuiEBdmUpqfRB
         8PWo09CLpdQLVXiU00YZb9bYxUOmN5DG2rq2KGBpVDWi5513+XKHcIkRLHfgRx/cLxyR
         +7NBmM2GCfXx06Rl6aMHXB4j4xONfTg4C/iajO5uTwphAMK5pnc9RNsccldSBhXNHUKa
         Rrcg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=sender:errors-to:content-transfer-encoding:cc:reply-to
         :list-subscribe:list-help:list-post:list-archive:list-unsubscribe
         :list-id:precedence:subject:mime-version:references:in-reply-to
         :message-id:date:to:from:dkim-signature:delivered-to;
        bh=KuC+g7Q4gtIKbHE9s36pEYQFhuZUF79rHO2rGBTCv70=;
        fh=2QQVLAqz5Dgp0O7PTQ7hb1i3rOEvtuxkp5BnHStC38U=;
        b=Jr+oi/tbJxM3JOZATAgBC+6rHt15I1qrNEZHq8BOANdieRrdsSxcJgK5riT+gYbVTs
         zuMXSB/3MqOlMKgIiiWBZqCIp89olTrOLq6Uga4xuuceoMo6PM/2rOYb2ry2lC+T4zz5
         TmISBb7aHnqlkth0Lwf+4UiPa5rOB+gCTocTqv0quipQbSPYs1fVEOCu9aAgYMXHSrIv
         nxc6p8xBY7RZUUYQiKvM/Z6Np4UVtFTv/0ZXmTgYyiVZ0/xd1cfGDcxMtWVBwMBClKvd
         GNpxahpCB/CFMZ2YKkv91s8xCS7U3//3eLh+11ZSc11/80GOOlVmTNFfXUfEwKp2o4hB
         e0jg==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify)
 header.i=@kynesim-co-uk.20221208.gappssmtp.com header.s=20221208
 header.b=sEN8IAsY;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
        by mx.google.com with ESMTP id
 gg18-20020a170906e29200b0098d7390816asi14821105ejb.756.2023.07.04.07.06.53;
        Tue, 04 Jul 2023 07:06:53 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
       dkim=neutral (body hash did not verify)
 header.i=@kynesim-co-uk.20221208.gappssmtp.com header.s=20221208
 header.b=sEN8IAsY;
       spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender)
 smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 17E9D68C5E2;
	Tue,  4 Jul 2023 17:05:43 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mail-wr1-f43.google.com (mail-wr1-f43.google.com
 [209.85.221.43])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id AA9AE68C54B
 for <ffmpeg-devel@ffmpeg.org>; Tue,  4 Jul 2023 17:05:32 +0300 (EEST)
Received: by mail-wr1-f43.google.com with SMTP id
 ffacd0b85a97d-31297125334so4862728f8f.0
 for <ffmpeg-devel@ffmpeg.org>; Tue, 04 Jul 2023 07:05:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=kynesim-co-uk.20221208.gappssmtp.com; s=20221208; t=1688479532;
 x=1691071532;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=H687VJhKJFwkKLWCcK28jqX8tG6UW/1CdnXuKPDp1d0=;
 b=sEN8IAsYAXDocbwhsQjY0JOL05piPRdRMeDqsXHkV1bziXSjsHQJnQXDF+oCk+ygLZ
 197aJwQkt+bBl+IGHwVS1U2uYRiq5nc8ag8sCFCTpk+J0BMe7P/6wzyFZzoU+SKDbWG9
 EHySOg5tMTIbmx30TyBABofNONWn5cvrMJn1kQWPA2zDTrZTsy/Gj7Oz3xa9XjVOgyLt
 pz7tWnGMdwMoUNL/+qFl8k0Py1kUc0ukjGyI6/tVkHgg7fTsXJG2wnsoi+I7pi35D+21
 +5sw/CmusKsBLD+rrJmXKmNmPrTdCiHpviW+7riLqNlu8xV8LLGkj/9GzlYADrKnQ0H7
 1fLw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20221208; t=1688479532; x=1691071532;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=H687VJhKJFwkKLWCcK28jqX8tG6UW/1CdnXuKPDp1d0=;
 b=HfImC4LSR4QHYM3M/M6oQAkpbQhgGI1zCiTPq6vme595KcIpTkAIU1PIDgeO4+LmKI
 /pksQUn2zA70im8g89q9hlYDzgOsXVy85CK8c77cCbaIYNiPfbtrqvG4c5Da1PoDIRjM
 CQdYSFo9zR5mtY1mC9OJvHl+ncE35ZZqVDCJ1gq1zMn76KD2ebA4Vw4rvGhEyAY+1adG
 XpAy72Z6FkIyEYiGZ9HcCPrPPGxzMJCVdYGH8XWdOvnD4lLAf3LJuctbLy3ZTAPqezHL
 ceS255AGPpNV3trwEbS7bekutRNCWL35cy9eTpFK+ocf/dgVcl6fJytzqyxLw1Ifz2PF
 PUTw==
X-Gm-Message-State: AC+VfDygsSaq6bT2RS4DnPvFgbb1c3e/KefbxkXKl83MGcXyHbkETDZe
 4P0bdbu4GBEWfwb+09qnzjg20BBWopPGjqS1M2s=
X-Received: by 2002:a5d:6308:0:b0:30f:c050:88dd with SMTP id
 i8-20020a5d6308000000b0030fc05088ddmr16077824wru.8.1688479532029;
 Tue, 04 Jul 2023 07:05:32 -0700 (PDT)
Received: from sucnaath.outer.uphall.net
 (cpc1-cmbg20-2-0-cust759.5-4.cable.virginm.net. [86.21.218.248])
 by smtp.gmail.com with ESMTPSA id
 m23-20020a7bca57000000b003fbc30825fbsm13585970wml.39.2023.07.04.07.05.31
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 04 Jul 2023 07:05:31 -0700 (PDT)
From: John Cox <jc@kynesim.co.uk>
To: ffmpeg-devel@ffmpeg.org
Date: Tue,  4 Jul 2023 14:04:44 +0000
Message-Id: <20230704140445.240426-7-jc@kynesim.co.uk>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230704140445.240426-1-jc@kynesim.co.uk>
References: <20230704140445.240426-1-jc@kynesim.co.uk>
MIME-Version: 1.0
Subject: [FFmpeg-devel] [PATCH v4 6/7] avfilter/vf_bwdif: Add a filter_line3
 method for optimisation
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: thomas.mundt@hr.de, John Cox <jc@kynesim.co.uk>, martin@martin.st
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
X-TUID: W/vHBCaoAkND

Add an optional filter_line3 to the available optimisations.

filter_line3 is equivalent to filter_line, memcpy, filter_line

filter_line shares quite a number of loads and some calculations in
common with its next iteration and testing shows that using aarch64
neon filter_line3s performance is 30% better than two filter_lines
and a memcpy.

Adds a test for vf_bwdif filter_line3 to checkasm

Rounds job start lines down to a multiple of 4. This means that if
filter_line3 exists then filter_line will not sometimes be called
once at the end of a slice depending on thread count. The final slice
may do up to 3 extra lines but filter_edge is faster than filter_line
so it is unlikely to create any noticable thread load variation.

Signed-off-by: John Cox <jc@kynesim.co.uk>
---
 libavfilter/bwdif.h       |  7 ++++
 libavfilter/vf_bwdif.c    | 44 +++++++++++++++++++--
 tests/checkasm/vf_bwdif.c | 81 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 129 insertions(+), 3 deletions(-)

diff --git a/libavfilter/bwdif.h b/libavfilter/bwdif.h
index cce99953f3..496cec72ef 100644
--- a/libavfilter/bwdif.h
+++ b/libavfilter/bwdif.h
@@ -35,6 +35,9 @@ typedef struct BWDIFContext {
     void (*filter_edge)(void *dst, void *prev, void *cur, void *next,
                         int w, int prefs, int mrefs, int prefs2, int mrefs2,
                         int parity, int clip_max, int spat);
+    void (*filter_line3)(void *dst, int dstride,
+                         const void *prev, const void *cur, const void *next, int prefs,
+                         int w, int parity, int clip_max);
 } BWDIFContext;
 
 void ff_bwdif_init_filter_line(BWDIFContext *bwdif, int bit_depth);
@@ -53,4 +56,8 @@ void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1,
                             int prefs3, int mrefs3, int prefs4, int mrefs4,
                             int parity, int clip_max);
 
+void ff_bwdif_filter_line3_c(void * dst1, int d_stride,
+                             const void * prev1, const void * cur1, const void * next1, int s_stride,
+                             int w, int parity, int clip_max);
+
 #endif /* AVFILTER_BWDIF_H */
diff --git a/libavfilter/vf_bwdif.c b/libavfilter/vf_bwdif.c
index 26349da1fd..6701208efe 100644
--- a/libavfilter/vf_bwdif.c
+++ b/libavfilter/vf_bwdif.c
@@ -150,6 +150,31 @@ void ff_bwdif_filter_line_c(void *dst1, void *prev1, void *cur1, void *next1,
     FILTER2()
 }
 
+#define NEXT_LINE()\
+    dst += d_stride; \
+    prev += prefs; \
+    cur  += prefs; \
+    next += prefs;
+
+void ff_bwdif_filter_line3_c(void * dst1, int d_stride,
+                             const void * prev1, const void * cur1, const void * next1, int s_stride,
+                             int w, int parity, int clip_max)
+{
+    const int prefs = s_stride;
+    uint8_t * dst  = dst1;
+    const uint8_t * prev = prev1;
+    const uint8_t * cur  = cur1;
+    const uint8_t * next = next1;
+
+    ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur, (void*)next, w,
+                           prefs, -prefs, prefs * 2, - prefs * 2, prefs * 3, -prefs * 3, prefs * 4, -prefs * 4, parity, clip_max);
+    NEXT_LINE();
+    memcpy(dst, cur, w);
+    NEXT_LINE();
+    ff_bwdif_filter_line_c(dst, (void*)prev, (void*)cur, (void*)next, w,
+                           prefs, -prefs, prefs * 2, - prefs * 2, prefs * 3, -prefs * 3, prefs * 4, -prefs * 4, parity, clip_max);
+}
+
 void ff_bwdif_filter_edge_c(void *dst1, void *prev1, void *cur1, void *next1,
                             int w, int prefs, int mrefs, int prefs2, int mrefs2,
                             int parity, int clip_max, int spat)
@@ -212,6 +237,13 @@ static void filter_edge_16bit(void *dst1, void *prev1, void *cur1, void *next1,
     FILTER2()
 }
 
+// Round job start line down to multiple of 4 so that if filter_line3 exists
+// and the frame is a multiple of 4 high then filter_line will never be called
+static inline int job_start(const int jobnr, const int nb_jobs, const int h)
+{
+    return jobnr >= nb_jobs ? h : ((h * jobnr) / nb_jobs) & ~3;
+}
+
 static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs)
 {
     BWDIFContext *s = ctx->priv;
@@ -221,8 +253,8 @@ static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs)
     int clip_max = (1 << (yadif->csp->comp[td->plane].depth)) - 1;
     int df = (yadif->csp->comp[td->plane].depth + 7) / 8;
     int refs = linesize / df;
-    int slice_start = (td->h *  jobnr   ) / nb_jobs;
-    int slice_end   = (td->h * (jobnr+1)) / nb_jobs;
+    int slice_start = job_start(jobnr, nb_jobs, td->h);
+    int slice_end   = job_start(jobnr + 1, nb_jobs, td->h);
     int y;
 
     for (y = slice_start; y < slice_end; y++) {
@@ -244,6 +276,11 @@ static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs)
                                refs << 1, -(refs << 1),
                                td->parity ^ td->tff, clip_max,
                                (y < 2) || ((y + 3) > td->h) ? 0 : 1);
+            } else if (s->filter_line3 && y + 2 < slice_end && y + 6 < td->h) {
+                s->filter_line3(dst, td->frame->linesize[td->plane],
+                                prev, cur, next, linesize, td->w,
+                                td->parity ^ td->tff, clip_max);
+                y += 2;
             } else {
                 s->filter_line(dst, prev, cur, next, td->w,
                                refs, -refs, refs << 1, -(refs << 1),
@@ -280,7 +317,7 @@ static void filter(AVFilterContext *ctx, AVFrame *dstpic,
         td.plane = i;
 
         ff_filter_execute(ctx, filter_slice, &td, NULL,
-                          FFMIN(h, ff_filter_get_nb_threads(ctx)));
+                          FFMIN((h+3)/4, ff_filter_get_nb_threads(ctx)));
     }
     if (yadif->current_field == YADIF_FIELD_END) {
         yadif->current_field = YADIF_FIELD_NORMAL;
@@ -357,6 +394,7 @@ static int config_props(AVFilterLink *link)
 
 av_cold void ff_bwdif_init_filter_line(BWDIFContext *s, int bit_depth)
 {
+    s->filter_line3 = 0;
     if (bit_depth > 8) {
         s->filter_intra = filter_intra_16bit;
         s->filter_line  = filter_line_c_16bit;
diff --git a/tests/checkasm/vf_bwdif.c b/tests/checkasm/vf_bwdif.c
index 5fdba09fdc..3399cacdf7 100644
--- a/tests/checkasm/vf_bwdif.c
+++ b/tests/checkasm/vf_bwdif.c
@@ -28,6 +28,10 @@
     for (size_t i = 0; i < count; i++) \
         buf0[i] = buf1[i] = rnd() & mask
 
+#define randomize_overflow_check(buf0, buf1, mask, count) \
+    for (size_t i = 0; i < count; i++) \
+        buf0[i] = buf1[i] = (rnd() & 1) != 0 ? mask : 0;
+
 #define BODY(type, depth)                                                      \
     do {                                                                       \
         type prev0[9*WIDTH], prev1[9*WIDTH];                                   \
@@ -83,6 +87,83 @@ void checkasm_check_vf_bwdif(void)
         report("bwdif10");
     }
 
+    if (!ctx_8.filter_line3)
+        ctx_8.filter_line3 = ff_bwdif_filter_line3_c;
+
+    {
+        LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]);
+        LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]);
+        LOCAL_ALIGNED_16(uint8_t, next0, [11*WIDTH]);
+        LOCAL_ALIGNED_16(uint8_t, next1, [11*WIDTH]);
+        LOCAL_ALIGNED_16(uint8_t, cur0,  [11*WIDTH]);
+        LOCAL_ALIGNED_16(uint8_t, cur1,  [11*WIDTH]);
+        LOCAL_ALIGNED_16(uint8_t, dst0,  [WIDTH*3]);
+        LOCAL_ALIGNED_16(uint8_t, dst1,  [WIDTH*3]);
+        const int stride = WIDTH;
+        const int mask = (1<<8)-1;
+        int parity;
+
+        for (parity = 0; parity != 2; ++parity) {
+            if (check_func(ctx_8.filter_line3, "bwdif8.line3.rnd.p%d", parity)) {
+
+                declare_func(void, void * dst1, int d_stride,
+                                          const void * prev1, const void * cur1, const void * next1, int prefs,
+                                          int w, int parity, int clip_max);
+
+                randomize_buffers(prev0, prev1, mask, 11*WIDTH);
+                randomize_buffers(next0, next1, mask, 11*WIDTH);
+                randomize_buffers( cur0,  cur1, mask, 11*WIDTH);
+
+                call_ref(dst0, stride,
+                         prev0 + stride * 4, cur0 + stride * 4, next0 + stride * 4, stride,
+                         WIDTH, parity, mask);
+                call_new(dst1, stride,
+                         prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, stride,
+                         WIDTH, parity, mask);
+
+                if (memcmp(dst0, dst1, WIDTH*3)
+                        || memcmp(prev0, prev1, WIDTH*11)
+                        || memcmp(next0, next1, WIDTH*11)
+                        || memcmp( cur0,  cur1, WIDTH*11))
+                    fail();
+
+                bench_new(dst1, stride,
+                         prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, stride,
+                         WIDTH, parity, mask);
+            }
+        }
+
+        // Use just 0s and ~0s to try to provoke bad cropping or overflow
+        // Parity makes no difference to this test so just test 0
+        if (check_func(ctx_8.filter_line3, "bwdif8.line3.overflow")) {
+
+            declare_func(void, void * dst1, int d_stride,
+                                      const void * prev1, const void * cur1, const void * next1, int prefs,
+                                      int w, int parity, int clip_max);
+
+            randomize_overflow_check(prev0, prev1, mask, 11*WIDTH);
+            randomize_overflow_check(next0, next1, mask, 11*WIDTH);
+            randomize_overflow_check( cur0,  cur1, mask, 11*WIDTH);
+
+            call_ref(dst0, stride,
+                     prev0 + stride * 4, cur0 + stride * 4, next0 + stride * 4, stride,
+                     WIDTH, 0, mask);
+            call_new(dst1, stride,
+                     prev1 + stride * 4, cur1 + stride * 4, next1 + stride * 4, stride,
+                     WIDTH, 0, mask);
+
+            if (memcmp(dst0, dst1, WIDTH*3)
+                    || memcmp(prev0, prev1, WIDTH*11)
+                    || memcmp(next0, next1, WIDTH*11)
+                    || memcmp( cur0,  cur1, WIDTH*11))
+                fail();
+
+            // No point to benching
+        }
+
+        report("bwdif8.line3");
+    }
+
     {
         LOCAL_ALIGNED_16(uint8_t, prev0, [11*WIDTH]);
         LOCAL_ALIGNED_16(uint8_t, prev1, [11*WIDTH]);