From patchwork Thu Feb 22 11:38:15 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "J. Dekker" X-Patchwork-Id: 46444 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:26a3:b0:19e:cdac:8cce with SMTP id h35csp277422pze; Thu, 22 Feb 2024 03:38:35 -0800 (PST) X-Forwarded-Encrypted: i=2; AJvYcCXI+0JEx4riempR/ZPkfc5C46FhoFNMyHp17gXri4YbPF3LeX5VwQPyrZ4M1I5otjqBC6yllwbPcEmC6zgb/4z0OP6M5iBdgJf9/g== X-Google-Smtp-Source: AGHT+IGgzckH6uiumYyLHvrUbMasjM63O7kq7VuythuPQqeuJxwnzAYHQKDgtk91RVLKhbTFDjtv X-Received: by 2002:a17:906:3152:b0:a3e:7ae1:d0a7 with SMTP id e18-20020a170906315200b00a3e7ae1d0a7mr7468144eje.0.1708601915307; Thu, 22 Feb 2024 03:38:35 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1708601915; cv=none; d=google.com; s=arc-20160816; b=TanQOhHJgkTdHg2VDTa63IkKKh8SxkxmkiL3zg8h4HRqmXwTAZmMiUOT/amOMZ9IQ/ UEEtebX/J565bcl15itQ8vc6/D30zOMPZElzhErTn0qbbv9Zk3hUmF1r/WzhgM2AkvBz aPjywq8MhJxoubc/32Mb2Bxdsc+H03quHGrt8jBUqpbqFNdzv+gZpCgEs7Gre5XwGwlW TAkibtYTzP751m91VvwOI+AgdlwoPoLTvJF5Hi805z/mtxXgIenFMXkCrvRvrVexSWeM LFXlOCa0Q7ztYiAo5lkfP1wsqyr3fVFaKVbyUPOqHubvDSFPTkykrGUvWXtz0cPlFS+m KRTw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:reply-to:list-subscribe :list-help:list-post:list-archive:list-unsubscribe:list-id :precedence:subject:mime-version:message-id:date:to:from:feedback-id :dkim-signature:dkim-signature:delivered-to; bh=ajOcivEoJtKrprIfxaqWQcIkh9G/T76XmzzezkNk//E=; fh=YOA8vD9MJZuwZ71F/05pj6KdCjf6jQRmzLS+CATXUQk=; b=bWTQ32p5t9q6f04yMvW5O59g0Hf4kSBssHq+RMvA646odZoRS+N1R4H633LvLgJIpV wVhKIDGSqiK4BnAm3iHiljgvqitcoPIhsFRsHBdgbs5j46KUffIEZ2g8CeKhnHorZKOe jMJaXz3irrl8M/MyeCXJw/jw6d8NNPQI1kPUoWWQa1qK7/RzOPVfpidadax39bRvKxJK KPwHzt5A/HS3SAH4ztJuSlmyDIATXmmMdImMBS9VIZPwwOj4MqSVRv8/+9wcBCqQDKQ7 HZrTazLExXF1g0/p3a7O8LEMTeAG/PrI+oBfBDTfbLB2HQax3GiYhMqykmbUbBc855bY Vj3g==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@itanimul.li header.s=fm2 header.b=NAqz+YSU; dkim=neutral (body hash did not verify) header.i=@messagingengine.com header.s=fm1 header.b=ADVFDdW0; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id bo2-20020a170906d04200b00a3dafc2eaecsi5375470ejb.209.2024.02.22.03.38.34; Thu, 22 Feb 2024 03:38:35 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@itanimul.li header.s=fm2 header.b=NAqz+YSU; dkim=neutral (body hash did not verify) header.i=@messagingengine.com header.s=fm1 header.b=ADVFDdW0; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 075FC68D239; Thu, 22 Feb 2024 13:38:31 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from wfhigh5-smtp.messagingengine.com (wfhigh5-smtp.messagingengine.com [64.147.123.156]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 0432C68D1D3 for ; Thu, 22 Feb 2024 13:38:23 +0200 (EET) Received: from compute6.internal (compute6.nyi.internal [10.202.2.47]) by mailfhigh.west.internal (Postfix) with ESMTP id 238C818001C0 for ; Thu, 22 Feb 2024 06:38:21 -0500 (EST) Received: from mailfrontend1 ([10.202.2.162]) by compute6.internal (MEProxy); Thu, 22 Feb 2024 06:38:21 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=itanimul.li; h= cc:content-transfer-encoding:content-type:date:date:from:from :in-reply-to:message-id:mime-version:reply-to:subject:subject:to :to; s=fm2; t=1708601900; x=1708688300; bh=DvTl8dQ273xxMG/mvzAgc IjhrasfLtdq2XqDZmQsOXc=; b=NAqz+YSUwVBsxzfEHN7pmTwWWrIU30MvcHi4+ LES/pb0cTJubA89rc0yE0wJEo4Cv46QOfddlyzDBmyy0RDdUSNoz22Bwcy0VzzK2 XGLM2tKekujZn9YNzVqqxrVVk68Nnu978+YpEomZSRtj7yU/n6tuofSIhgAiLL31 MZJtjms1pjh7YKNeftkvM2vuXonc0odakaY7UvYdSr1YF02rW7fRWvjKDJTgLWsH pqVs8q8TSX3EfJZauA/vPzIUGCl1f8AKpFvkCWF1uHZOvCZkrPw3obPon2IlQ/yc qdRp646aXMkca9MKmRVu6h3C1nNxytsmI0WPB6wwvZSu7Hzow== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:date:feedback-id:feedback-id:from:from:in-reply-to :message-id:mime-version:reply-to:subject:subject:to:to :x-me-proxy:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s= fm1; t=1708601900; x=1708688300; bh=DvTl8dQ273xxMG/mvzAgcIjhrasf Ltdq2XqDZmQsOXc=; b=ADVFDdW0REf/NQ/O+69MMOuLswWux7a/deDLcNEz4TWA 6c1l2ikIfPOiShlXfPXMy3bEkAeOz5c6DDWPqwHh5SlkFXRAsDwiQTA8XLnypRqw MpYaWlzsUQgR+g66nr3jl3F4EEZsKoUBQqhw3MjCV64aE6CAyrDOUa3ln4zfbEod GUjAnPoBcWcHcN1ljVmiSQoBGandnpC0tcq0puvwS9PHeJ55iCKTY2RaE7xuLchv nGhuhxFKstAJak3PEBqk9htjnWrXhrYkuMjDvFmn/8tb4ATfrbgD7VdP9+3F8VFd uvzOn3RkuLwGJL4qwij8pRa8w5DwMKIjQlQkyYCQdA== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvledrfeeggddvlecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkffoggfgsedtkeertdertd dtnecuhfhrohhmpedflfdrucffvghkkhgvrhdfuceojhguvghksehithgrnhhimhhulhdr lhhiqeenucggtffrrghtthgvrhhnpeeutefgtdeuvdejjeejvdetleffueehtdeftedugf elhfejueektedvvddtveeiieenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhep mhgrihhlfhhrohhmpehjuggvkhesihhtrghnihhmuhhlrdhlih X-ME-Proxy: Feedback-ID: i84994747:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA for ; Thu, 22 Feb 2024 06:38:19 -0500 (EST) From: "J. Dekker" To: ffmpeg-devel@ffmpeg.org Date: Thu, 22 Feb 2024 12:38:15 +0100 Message-ID: <20240222113817.51750-1-jdek@itanimul.li> X-Mailer: git-send-email 2.43.2 MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v2 1/3] checkasm/hevc_deblock: add luma and chroma full X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: BfgJpu3+t+WU Signed-off-by: J. Dekker --- tests/checkasm/hevc_deblock.c | 246 +++++++++++++++++++++++++++++----- 1 file changed, 215 insertions(+), 31 deletions(-) diff --git a/tests/checkasm/hevc_deblock.c b/tests/checkasm/hevc_deblock.c index 66fc8d5646..91e57f5cf5 100644 --- a/tests/checkasm/hevc_deblock.c +++ b/tests/checkasm/hevc_deblock.c @@ -19,9 +19,9 @@ #include #include "libavutil/intreadwrite.h" +#include "libavutil/macros.h" #include "libavutil/mem_internal.h" -#include "libavcodec/avcodec.h" #include "libavcodec/hevcdsp.h" #include "checkasm.h" @@ -29,10 +29,11 @@ static const uint32_t pixel_mask[3] = { 0xffffffff, 0x03ff03ff, 0x0fff0fff }; #define SIZEOF_PIXEL ((bit_depth + 7) / 8) -#define BUF_STRIDE (8 * 2) -#define BUF_LINES (8) -#define BUF_OFFSET (BUF_STRIDE * BUF_LINES) -#define BUF_SIZE (BUF_STRIDE * BUF_LINES + BUF_OFFSET * 2) +#define BUF_STRIDE (16 * 2) +#define BUF_LINES (16) +// large buffer sizes based on high bit depth +#define BUF_OFFSET (2 * BUF_STRIDE * BUF_LINES) +#define BUF_SIZE (2 * BUF_STRIDE * BUF_LINES + BUF_OFFSET * 2) #define randomize_buffers(buf0, buf1, size) \ do { \ @@ -45,57 +46,240 @@ static const uint32_t pixel_mask[3] = { 0xffffffff, 0x03ff03ff, 0x0fff0fff }; } \ } while (0) -static void check_deblock_chroma(HEVCDSPContext *h, int bit_depth) +static void check_deblock_chroma(HEVCDSPContext *h, int bit_depth, int c) { - int32_t tc[2] = { 0, 0 }; + // see tctable[] in hevc_filter.c, we check full range + int32_t tc[2] = { rnd() % 25, rnd() % 25 }; // no_p, no_q can only be { 0,0 } for the simpler assembly (non *_c // variant) functions, see deblocking_filter_CTB() in hevc_filter.c - uint8_t no_p[2] = { 0, 0 }; - uint8_t no_q[2] = { 0, 0 }; + uint8_t no_p[2] = { rnd() & c, rnd() & c }; + uint8_t no_q[2] = { rnd() & c, rnd() & c }; LOCAL_ALIGNED_32(uint8_t, buf0, [BUF_SIZE]); LOCAL_ALIGNED_32(uint8_t, buf1, [BUF_SIZE]); declare_func(void, uint8_t *pix, ptrdiff_t stride, int32_t *tc, uint8_t *no_p, uint8_t *no_q); - if (check_func(h->hevc_h_loop_filter_chroma, "hevc_h_loop_filter_chroma%d", bit_depth)) { - for (int i = 0; i < 4; i++) { - randomize_buffers(buf0, buf1, BUF_SIZE); - // see betatable[] in hevc_filter.c - tc[0] = (rnd() & 63) + (rnd() & 1); - tc[1] = (rnd() & 63) + (rnd() & 1); + if (check_func(c ? h->hevc_h_loop_filter_chroma_c : h->hevc_h_loop_filter_chroma, + "hevc_h_loop_filter_chroma%d%s", bit_depth, c ? "_full" : "")) + { + randomize_buffers(buf0, buf1, BUF_SIZE); - call_ref(buf0 + BUF_OFFSET, BUF_STRIDE, tc, no_p, no_q); - call_new(buf1 + BUF_OFFSET, BUF_STRIDE, tc, no_p, no_q); + call_ref(buf0 + BUF_OFFSET, BUF_STRIDE, tc, no_p, no_q); + call_new(buf1 + BUF_OFFSET, BUF_STRIDE, tc, no_p, no_q); + if (memcmp(buf0, buf1, BUF_SIZE)) + fail(); + bench_new(buf1 + BUF_OFFSET, BUF_STRIDE, tc, no_p, no_q); + } + + if (check_func(c ? h->hevc_v_loop_filter_chroma_c : h->hevc_v_loop_filter_chroma, + "hevc_v_loop_filter_chroma%d%s", bit_depth, c ? "_full" : "")) + { + randomize_buffers(buf0, buf1, BUF_SIZE); + + call_ref(buf0 + BUF_OFFSET, BUF_STRIDE, tc, no_p, no_q); + call_new(buf1 + BUF_OFFSET, BUF_STRIDE, tc, no_p, no_q); + if (memcmp(buf0, buf1, BUF_SIZE)) + fail(); + bench_new(buf1 + BUF_OFFSET, BUF_STRIDE, tc, no_p, no_q); + } +} + +#define P3 buf[-4 * xstride] +#define P2 buf[-3 * xstride] +#define P1 buf[-2 * xstride] +#define P0 buf[-1 * xstride] +#define Q0 buf[0 * xstride] +#define Q1 buf[1 * xstride] +#define Q2 buf[2 * xstride] +#define Q3 buf[3 * xstride] + +#define TC25(x) ((tc[x] * 5 + 1) >> 1) +#define MASK(x) (uint16_t)(x & ((1 << (bit_depth)) - 1)) +#define GET(x) ((SIZEOF_PIXEL == 1) ? *(uint8_t*)(&x) : *(uint16_t*)(&x)) +#define SET(x, y) do { \ + uint16_t z = MASK(y); \ + if (SIZEOF_PIXEL == 1) \ + *(uint8_t*)(&x) = z; \ + else \ + *(uint16_t*)(&x) = z; \ +} while (0) +#define RANDCLIP(x, diff) av_clip(GET(x) - (diff), 0, \ + (1 << (bit_depth)) - 1) + rnd() % FFMAX(2 * (diff), 1) + +// NOTE: this function doesn't work 'correctly' in that it won't always choose +// strong/strong or weak/weak, in most cases it tends to but will sometimes mix +// weak/strong or even skip sometimes. This is more useful to test correctness +// for these functions, though it does make benching them difficult. The easiest +// way to bench these functions is to check an overall decode since there are too +// many paths and ways to trigger the deblock: we would have to bench all +// permutations of weak/strong/skip/nd_q/nd_p/no_q/no_p and it quickly becomes +// too much. +static void randomize_luma_buffers(int type, int *beta, int32_t tc[2], + uint8_t *buf, ptrdiff_t xstride, ptrdiff_t ystride, int bit_depth) +{ + int i, j, b3, tc25, tc25diff, b3diff; + // both tc & beta are unscaled inputs + // minimum useful value is 1, full range 0-24 + tc[0] = (rnd() % 25) + 1; + tc[1] = (rnd() % 25) + 1; + // minimum useful value for 8bit is 8 + *beta = (rnd() % 57) + 8; + + switch (type) { + case 0: // strong + for (j = 0; j < 2; j++) { + tc25 = TC25(j) << (bit_depth - 8); + tc25diff = FFMAX(tc25 - 1, 0); + // 4 lines per tc + for (i = 0; i < 4; i++) { + b3 = (*beta << (bit_depth - 8)) >> 3; + + SET(P0, rnd() % (1 << bit_depth)); + SET(Q0, RANDCLIP(P0, tc25diff)); + + // p3 - p0 up to beta3 budget + b3diff = rnd() % b3; + SET(P3, RANDCLIP(P0, b3diff)); + // q3 - q0, reduced budget + b3diff = rnd() % FFMAX(b3 - b3diff, 1); + SET(Q3, RANDCLIP(Q0, b3diff)); + + // same concept, budget across 4 pixels + b3 -= b3diff = rnd() % FFMAX(b3, 1); + SET(P2, RANDCLIP(P0, b3diff)); + b3 -= b3diff = rnd() % FFMAX(b3, 1); + SET(Q2, RANDCLIP(Q0, b3diff)); + + // extra reduced budget for weighted pixels + b3 -= b3diff = rnd() % FFMAX(b3 - (1 << (bit_depth - 8)), 1); + SET(P1, RANDCLIP(P0, b3diff)); + b3 -= b3diff = rnd() % FFMAX(b3 - (1 << (bit_depth - 8)), 1); + SET(Q1, RANDCLIP(Q0, b3diff)); + + buf += ystride; + } + } + break; + case 1: // weak + for (j = 0; j < 2; j++) { + tc25 = TC25(j) << (bit_depth - 8); + tc25diff = FFMAX(tc25 - 1, 0); + // 4 lines per tc + for (i = 0; i < 4; i++) { + // Weak filtering is signficantly simpler to activate as + // we only need to satisfy d0 + d3 < beta, which + // can be simplified to d0 + d0 < beta. Using the above + // derivations but substiuting b3 for b1 and ensuring + // that P0/Q0 are at least 1/2 tc25diff apart (tending + // towards 1/2 range). + b3 = (*beta << (bit_depth - 8)) >> 1; + + SET(P0, rnd() % (1 << bit_depth)); + SET(Q0, RANDCLIP(P0, tc25diff >> 1) + + (tc25diff >> 1) * (P0 < (1 << (bit_depth - 1))) ? 1 : -1); + + // p3 - p0 up to beta3 budget + b3diff = rnd() % b3; + SET(P3, RANDCLIP(P0, b3diff)); + // q3 - q0, reduced budget + b3diff = rnd() % FFMAX(b3 - b3diff, 1); + SET(Q3, RANDCLIP(Q0, b3diff)); + + // same concept, budget across 4 pixels + b3 -= b3diff = rnd() % FFMAX(b3, 1); + SET(P2, RANDCLIP(P0, b3diff)); + b3 -= b3diff = rnd() % FFMAX(b3, 1); + SET(Q2, RANDCLIP(Q0, b3diff)); + + // extra reduced budget for weighted pixels + b3 -= b3diff = rnd() % FFMAX(b3 - (1 << (bit_depth - 8)), 1); + SET(P1, RANDCLIP(P0, b3diff)); + b3 -= b3diff = rnd() % FFMAX(b3 - (1 << (bit_depth - 8)), 1); + SET(Q1, RANDCLIP(Q0, b3diff)); + + buf += ystride; + } + } + break; + case 2: // none + *beta = 0; // ensure skip + for (i = 0; i < 8; i++) { + // we can just fill with completely random data, nothing should be touched. + SET(P3, rnd()); SET(P2, rnd()); SET(P1, rnd()); SET(P0, rnd()); + SET(Q0, rnd()); SET(Q1, rnd()); SET(Q2, rnd()); SET(Q3, rnd()); + buf += ystride; + } + break; + } +} + +static void check_deblock_luma(HEVCDSPContext *h, int bit_depth, int c) +{ + const char *type; + const char *types[3] = { "strong", "weak", "skip" }; + int beta; + int32_t tc[2] = {0}; + uint8_t no_p[2] = { rnd() & c, rnd() & c }; + uint8_t no_q[2] = { rnd() & c, rnd() & c }; + LOCAL_ALIGNED_32(uint8_t, buf0, [BUF_SIZE]); + LOCAL_ALIGNED_32(uint8_t, buf1, [BUF_SIZE]); + uint8_t *ptr0 = buf0 + BUF_OFFSET, + *ptr1 = buf1 + BUF_OFFSET; + + declare_func(void, uint8_t *pix, ptrdiff_t stride, int beta, int32_t *tc, uint8_t *no_p, uint8_t *no_q); + + for (int j = 0; j < 3; j++) { + type = types[j]; + if (check_func(c ? h->hevc_h_loop_filter_luma_c : h->hevc_h_loop_filter_luma, + "hevc_h_loop_filter_luma%d_%s%s", bit_depth, type, c ? "_full" : "")) + { + randomize_luma_buffers(j, &beta, tc, buf0 + BUF_OFFSET, 16 * SIZEOF_PIXEL, SIZEOF_PIXEL, bit_depth); + memcpy(buf1, buf0, BUF_SIZE); + + call_ref(ptr0, 16 * SIZEOF_PIXEL, beta, tc, no_p, no_q); + call_new(ptr1, 16 * SIZEOF_PIXEL, beta, tc, no_p, no_q); if (memcmp(buf0, buf1, BUF_SIZE)) fail(); + bench_new(ptr1, 16 * SIZEOF_PIXEL, beta, tc, no_p, no_q); } - bench_new(buf1 + BUF_OFFSET, BUF_STRIDE, tc, no_p, no_q); - } - if (check_func(h->hevc_v_loop_filter_chroma, "hevc_v_loop_filter_chroma%d", bit_depth)) { - for (int i = 0; i < 4; i++) { - randomize_buffers(buf0, buf1, BUF_SIZE); - // see betatable[] in hevc_filter.c - tc[0] = (rnd() & 63) + (rnd() & 1); - tc[1] = (rnd() & 63) + (rnd() & 1); + if (check_func(c ? h->hevc_v_loop_filter_luma_c : h->hevc_v_loop_filter_luma, + "hevc_v_loop_filter_luma%d_%s%s", bit_depth, type, c ? "_full" : "")) + { + randomize_luma_buffers(j, &beta, tc, buf0 + BUF_OFFSET, SIZEOF_PIXEL, 16 * SIZEOF_PIXEL, bit_depth); + memcpy(buf1, buf0, BUF_SIZE); - call_ref(buf0 + BUF_OFFSET, BUF_STRIDE, tc, no_p, no_q); - call_new(buf1 + BUF_OFFSET, BUF_STRIDE, tc, no_p, no_q); + call_ref(ptr0, 16 * SIZEOF_PIXEL, beta, tc, no_p, no_q); + call_new(ptr1, 16 * SIZEOF_PIXEL, beta, tc, no_p, no_q); if (memcmp(buf0, buf1, BUF_SIZE)) fail(); + bench_new(ptr1, 16 * SIZEOF_PIXEL, beta, tc, no_p, no_q); } - bench_new(buf1 + BUF_OFFSET, BUF_STRIDE, tc, no_p, no_q); } } void checkasm_check_hevc_deblock(void) { + HEVCDSPContext h; int bit_depth; - for (bit_depth = 8; bit_depth <= 12; bit_depth += 2) { - HEVCDSPContext h; ff_hevc_dsp_init(&h, bit_depth); - check_deblock_chroma(&h, bit_depth); + check_deblock_chroma(&h, bit_depth, 0); } report("chroma"); + for (bit_depth = 8; bit_depth <= 12; bit_depth += 2) { + ff_hevc_dsp_init(&h, bit_depth); + check_deblock_chroma(&h, bit_depth, 1); + } + report("chroma_full"); + for (bit_depth = 8; bit_depth <= 12; bit_depth += 2) { + ff_hevc_dsp_init(&h, bit_depth); + check_deblock_luma(&h, bit_depth, 0); + } + report("luma"); + for (bit_depth = 8; bit_depth <= 12; bit_depth += 2) { + ff_hevc_dsp_init(&h, bit_depth); + check_deblock_luma(&h, bit_depth, 1); + } + report("luma_full"); } From patchwork Thu Feb 22 11:38:17 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "J. Dekker" X-Patchwork-Id: 46445 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:26a3:b0:19e:cdac:8cce with SMTP id h35csp277561pze; Thu, 22 Feb 2024 03:38:54 -0800 (PST) X-Forwarded-Encrypted: i=2; AJvYcCUBagXAaXmRwUT0Yzt5fcFqEKN9ZR7ULmHf9usXrcSoEbahDaWwXSuZ+DvGYoP7ErWAUcFb9XKPdQ+R5NcjgMnApSmo0+AMjfsTug== X-Google-Smtp-Source: AGHT+IFUD2mfv+lq0TkIz1K5e8sWZYxk0Q2LFWUqqEQbTUY3ocknvznm0hjd89a9Rx9UiO7n50HJ X-Received: by 2002:a05:6512:31c5:b0:512:db03:ae74 with SMTP id j5-20020a05651231c500b00512db03ae74mr1601367lfe.26.1708601933738; Thu, 22 Feb 2024 03:38:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1708601933; cv=none; d=google.com; s=arc-20160816; b=VhG6csAAze21Bwusw1M2gy4w6Zhdmdy6ib89QlDh7F3TV/befFBEsVlaWd49Zgd1Sq DVHY2NPZLXI8yqbIRTUY5XDG6nyuVvjqRZ7IhFYY69kJLMCXABQ3SIDm8t9meCVziLxR 88tsvBb2ElFwft9pXXZ4uZUfxwIyBl5XKaUrpb1MxaFJiTnnaas68DiQcEjh9PCXU4qO k/XI4nqP2lBVIpe9Gqn9gUMHt0qid8UG6rp4m1xETUPCG244OHNaAIITj2dxtlgVafrG PevifcyRVFCWGNefsJ8dNe0ZFNXINmdYimRrsKVxZF/YCZUTcPmjNIoDG61MmgoHZC7V Lniw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:reply-to:list-subscribe :list-help:list-post:list-archive:list-unsubscribe:list-id :precedence:subject:mime-version:references:in-reply-to:message-id :date:to:from:feedback-id:dkim-signature:dkim-signature:delivered-to; bh=luHNMMNTcc+zOFbrLzdCDRcRDDHosjG6JeDWqwFUG7o=; fh=YOA8vD9MJZuwZ71F/05pj6KdCjf6jQRmzLS+CATXUQk=; b=kuZx76ULweKR81t+n0ZaQhKuhk0PfTKqBZvy/Q6LWXGGRHIgCt3ZjSfxdudHv4ZaA7 LQ1Q54el1yfXQurKl1m6F7rS4O4MKAz9C3L80kt+R54jge2JRGxbjiNBFWNtSE7pdHnC viAiX9c5yOjEZabdAp4u1Qjc1Vq/htgWvWlGbpKQ2S//0GpyMMFLVUgrbUzraIg6xNr5 MalZQmGqgc4NtgxgTy8mG7vevC1mYEFV/ipsXFGfj+oWUjCanNBNI65epo3SfklXEIYD Xa5HCvWbifzL1mDA6jXYM8DlVQx5IM8+B6H0cpz0SDfCCrA0buOl114f1SVvf3dcMhG+ NPHA==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@itanimul.li header.s=fm2 header.b=SRqG0AOZ; dkim=neutral (body hash did not verify) header.i=@messagingengine.com header.s=fm1 header.b=nSMM0qw4; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id s24-20020a50ab18000000b00564bcb35bdfsi2446200edc.62.2024.02.22.03.38.53; Thu, 22 Feb 2024 03:38:53 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@itanimul.li header.s=fm2 header.b=SRqG0AOZ; dkim=neutral (body hash did not verify) header.i=@messagingengine.com header.s=fm1 header.b=nSMM0qw4; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 7CBD668D241; Thu, 22 Feb 2024 13:38:35 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from wfout5-smtp.messagingengine.com (wfout5-smtp.messagingengine.com [64.147.123.148]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 05EA168D23E for ; Thu, 22 Feb 2024 13:38:28 +0200 (EET) Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailfout.west.internal (Postfix) with ESMTP id 517101C00065 for ; Thu, 22 Feb 2024 06:38:26 -0500 (EST) Received: from mailfrontend1 ([10.202.2.162]) by compute2.internal (MEProxy); Thu, 22 Feb 2024 06:38:26 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=itanimul.li; h= cc:content-transfer-encoding:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:subject:subject:to:to; s=fm2; t=1708601905; x= 1708688305; bh=vADDjE5GBUmTXyGBz5fSUT5m+qCO6QHqypG4ocu6BPA=; b=S RqG0AOZel2lM5/tyzj3gBsz2ecYZab/ofZKEMTQWoe2faNxG96kl5PErpBkibahw FDWlNR/reZOGw47D77El9e49A7O+DNWmvinmqqf2gPAUzRq2ZoxmLrhdjIqsKLVr xyR943gbErRzGRzlojFoj6LRUtcsz+UxiyEet0tyGMw2LP3nOmorjyy3cRL0Gmnj YNjF7e3rFJ4h+IWcC1nRj6Pc7d7QqeaXt1hD0mnUxBm/VkKasNEYNQthjfyx4hCD d3aeVyYL42L3a2qc7UtqQSyq7kPA1Em+dFwNoSfJQZCnyCX3mt7x0xiPBtiOoof0 UNYYSHaSvYdFG8Xq1eFWA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:date:feedback-id:feedback-id:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm1; t=1708601905; x=1708688305; bh=vADDjE5GBUmTX yGBz5fSUT5m+qCO6QHqypG4ocu6BPA=; b=nSMM0qw4E6P1AmybC3eR7OcLJrl9O WqCM1AlXoNjguANg1D50qvioFFjuqS01+Led+mPOI/JjX3b0/fNjOWeTe/Jn5pfj tgd3PYfXsem86Prsn8Zlqy+tZl15Am/arHU/FBID5hq73/nmksGNbpN/YNKwv53B VoHMdFw/X5MOnlzOSV5QUsAYF8G/Fd64EBA/e3shGn9TGuzag3baa3iPzTt7tJvi kmLm6ExXVR4XauH8Kebv+jCWjaTXnpj5USIRa8+rMSUEPk0qTdU0e8JaTckTHqpQ lkJXLlpjw5/MleeLRRxme6i3m/3JbmggHYFZD2z0pBFq18dNx+GKnA31A== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvledrfeeggddvlecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkffojghfggfgsedtkeertd ertddtnecuhfhrohhmpedflfdrucffvghkkhgvrhdfuceojhguvghksehithgrnhhimhhu lhdrlhhiqeenucggtffrrghtthgvrhhnpeeihfeluefhjefhlefftdeiuefhvdehgeejue fgledvgffgffeviefggfevveefieenucffohhmrghinhephhgvvhgtughsphgpuggvsghl ohgtkhgpnhgvohhnrdhssgenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmh grihhlfhhrohhmpehjuggvkhesihhtrghnihhmuhhlrdhlih X-ME-Proxy: Feedback-ID: i84994747:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA for ; Thu, 22 Feb 2024 06:38:25 -0500 (EST) From: "J. Dekker" To: ffmpeg-devel@ffmpeg.org Date: Thu, 22 Feb 2024 12:38:17 +0100 Message-ID: <20240222113817.51750-3-jdek@itanimul.li> X-Mailer: git-send-email 2.43.2 In-Reply-To: <20240222113817.51750-1-jdek@itanimul.li> References: <20240222113817.51750-1-jdek@itanimul.li> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v2 3/3] avcodec/aarch64: add hevc deblock NEON X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: 8W96OZ0x2Hn1 Benched using single-threaded full decode on an Ampere Altra. Bpp Before After Speedup 8 73,3s 65,2s 1.124x 10 114,2s 104,0s 1.098x 12 125,8s 115,7s 1.087x Signed-off-by: J. Dekker --- libavcodec/aarch64/hevcdsp_deblock_neon.S | 421 ++++++++++++++++++++++ libavcodec/aarch64/hevcdsp_init_aarch64.c | 18 + 2 files changed, 439 insertions(+) diff --git a/libavcodec/aarch64/hevcdsp_deblock_neon.S b/libavcodec/aarch64/hevcdsp_deblock_neon.S index 8227f65649..314dbf78dc 100644 --- a/libavcodec/aarch64/hevcdsp_deblock_neon.S +++ b/libavcodec/aarch64/hevcdsp_deblock_neon.S @@ -181,3 +181,424 @@ hevc_h_loop_filter_chroma 12 hevc_v_loop_filter_chroma 8 hevc_v_loop_filter_chroma 10 hevc_v_loop_filter_chroma 12 + +.macro hevc_loop_filter_luma_body bitdepth +function hevc_loop_filter_luma_body_\bitdepth\()_neon, export=0 +.if \bitdepth > 8 + lsl w2, w2, #(\bitdepth - 8) // beta <<= BIT_DEPTH - 8 +.else + uxtl v0.8h, v0.8b + uxtl v1.8h, v1.8b + uxtl v2.8h, v2.8b + uxtl v3.8h, v3.8b + uxtl v4.8h, v4.8b + uxtl v5.8h, v5.8b + uxtl v6.8h, v6.8b + uxtl v7.8h, v7.8b +.endif + ldr w7, [x3] // tc[0] + ldr w8, [x3, #4] // tc[1] + dup v18.4h, w7 + dup v19.4h, w8 + trn1 v18.2d, v18.2d, v19.2d +.if \bitdepth > 8 + shl v18.8h, v18.8h, #(\bitdepth - 8) +.endif + dup v27.8h, w2 // beta + // tc25 + shl v19.8h, v18.8h, #2 // * 4 + add v19.8h, v19.8h, v18.8h // (tc * 5) + srshr v19.8h, v19.8h, #1 // (tc * 5 + 1) >> 1 + sshr v17.8h, v27.8h, #2 // beta2 + + ////// beta_2 check + // dp0 = abs(P2 - 2 * P1 + P0) + add v22.8h, v3.8h, v1.8h + shl v23.8h, v2.8h, #1 + sabd v30.8h, v22.8h, v23.8h + // dq0 = abs(Q2 - 2 * Q1 + Q0) + add v21.8h, v6.8h, v4.8h + shl v26.8h, v5.8h, #1 + sabd v31.8h, v21.8h, v26.8h + // d0 = dp0 + dq0 + add v20.8h, v30.8h, v31.8h + shl v25.8h, v20.8h, #1 + // (d0 << 1) < beta_2 + cmgt v23.8h, v17.8h, v25.8h + + ////// beta check + // d0 + d3 < beta + mov x9, #0xFFFF00000000FFFF + dup v24.2d, x9 + and v25.16b, v24.16b, v20.16b + addp v25.8h, v25.8h, v25.8h // 1+0 0+1 1+0 0+1 + addp v25.4h, v25.4h, v25.4h // 1+0+0+1 1+0+0+1 + cmgt v25.4h, v27.4h, v25.4h // lower/upper mask in h[0/1] + mov w9, v25.s[0] + cmp w9, #0 + sxtl v26.4s, v25.4h + sxtl v16.2d, v26.2s // full skip mask + b.eq 3f // skip both blocks + + // TODO: we can check the full skip mask with the weak/strong mask to + // potentially skip weak or strong calculation entirely if we only have one + + ////// beta_3 check + // abs(P3 - P0) + abs(Q3 - Q0) < beta_3 + sshr v17.8h, v17.8h, #1 // beta_3 + sabd v20.8h, v0.8h, v3.8h + saba v20.8h, v7.8h, v4.8h + cmgt v21.8h, v17.8h, v20.8h + + and v23.16b, v23.16b, v21.16b + + ////// tc25 check + // abs(P0 - Q0) < tc25 + sabd v20.8h, v3.8h, v4.8h + cmgt v21.8h, v19.8h, v20.8h + + and v23.16b, v23.16b, v21.16b + + ////// Generate low/high line max from lines 0/3/4/7 + // mask out lines 2/3/5/6 + not v20.16b, v24.16b // 0x0000FFFFFFFF0000 + orr v23.16b, v23.16b, v20.16b + + // generate weak/strong mask + uminp v23.8h, v23.8h, v23.8h // extend to singles + sxtl v23.4s, v23.4h + uminp v26.4s, v23.4s, v23.4s // check lines + // extract to gpr + ext v25.16b, v26.16b, v26.16b, #2 + zip1 v17.4s, v26.4s, v26.4s + mov w12, v25.s[0] + mov w11, #0x0000FFFF + mov w13, #0xFFFF0000 + // FFFF FFFF -> strong strong + // FFFF 0000 -> strong weak + // 0000 FFFF -> weak strong + // 0000 0000 -> weak weak + cmp w12, w13 + b.hi 0f // only strong/strong, skip weak nd_p/nd_q calc + + ////// weak nd_p/nd_q + // d0+d3 + and v30.16b, v30.16b, v24.16b // d0 __ __ d3 d4 __ __ d7 + and v31.16b, v31.16b, v24.16b + addp v30.8h, v30.8h, v30.8h // [d0+__ __+d3 d4+__ __+d7] [ ... ] + addp v31.8h, v31.8h, v31.8h // [d0+d3 d4+d7] + addp v30.4h, v30.4h, v30.4h + addp v31.4h, v31.4h, v31.4h + + // ((beta + (beta >> 1)) >> 3) + sshr v21.8h, v27.8h, #1 + add v21.8h, v21.8h, v27.8h + sshr v21.8h, v21.8h, #3 + + // nd_p = dp0 + dp3 < ((beta + (beta >> 1)) >> 3) + cmgt v30.8h, v21.8h, v30.8h + // nd_q = dq0 + dq3 < ((beta + (beta >> 1)) >> 3) + cmgt v31.8h, v21.8h, v31.8h + + sxtl v30.4s, v30.4h + sxtl v31.4s, v31.4h + sxtl v28.2d, v30.2s + sxtl v29.2d, v31.2s + + cmp w12, w11 + b.lo 1f // can only be weak weak, skip strong + +0: // STRONG FILTER + + // P0 = p0 + av_clip(((p2 + 2 * p1 + 2 * p0 + 2 * q0 + q1 + 4) >> 3) - p0, -tc3, tc3); + add v21.8h, v2.8h, v3.8h // (p1 + p0 + add v21.8h, v4.8h, v21.8h // + q0) + shl v21.8h, v21.8h, #1 // * 2 + add v22.8h, v1.8h, v5.8h // (p2 + q1) + add v21.8h, v22.8h, v21.8h // + + srshr v21.8h, v21.8h, #3 // >> 3 + sub v21.8h, v21.8h, v3.8h // - p0 + + // P1 = p1 + av_clip(((p2 + p1 + p0 + q0 + 2) >> 2) - p1, -tc2, tc2); + + add v22.8h, v1.8h, v2.8h + add v23.8h, v3.8h, v4.8h + add v22.8h, v22.8h, v23.8h + srshr v22.8h, v22.8h, #2 + sub v22.8h, v22.8h, v2.8h + + // P2 = p2 + av_clip(((2 * p3 + 3 * p2 + p1 + p0 + q0 + 4) >> 3) - p2, -tc, tc); + + add v23.8h, v0.8h, v1.8h // p3 + p2 + add v24.8h, v3.8h, v4.8h // p0 + q0 + shl v23.8h, v23.8h, #1 // * 2 + add v23.8h, v23.8h, v24.8h + add v24.8h, v1.8h, v2.8h // p2 + p1 + add v23.8h, v23.8h, v24.8h + srshr v23.8h, v23.8h, #3 + sub v23.8h, v23.8h, v1.8h + + // Q0 = q0 + av_clip(((p1 + 2 * p0 + 2 * q0 + 2 * q1 + q2 + 4) >> 3) - q0, -tc3, tc3); + add v24.8h, v3.8h, v4.8h // (p0 + q0 + add v24.8h, v5.8h, v24.8h // + q1) + shl v24.8h, v24.8h, #1 // * 2 + add v25.8h, v2.8h, v6.8h // (p1 + q2) + add v24.8h, v25.8h, v24.8h // + + srshr v24.8h, v24.8h, #3 // >> 3 + sub v24.8h, v24.8h, v4.8h // - q0 + + // Q1 = q1 + av_clip(((p0 + q0 + q1 + q2 + 2) >> 2) - q1, -tc2, tc2); + + add v25.8h, v6.8h, v5.8h + add v26.8h, v3.8h, v4.8h + add v25.8h, v25.8h, v26.8h + srshr v25.8h, v25.8h, #2 + sub v25.8h, v25.8h, v5.8h + + // Q2 = q2 + av_clip(((2 * q3 + 3 * q2 + q1 + q0 + p0 + 4) >> 3) - q2, -tc, tc); + + add v26.8h, v7.8h, v6.8h + add v27.8h, v6.8h, v5.8h + shl v26.8h, v26.8h, #1 + add v26.8h, v26.8h, v27.8h + add v27.8h, v3.8h, v4.8h + add v26.8h, v26.8h, v27.8h + srshr v26.8h, v26.8h, #3 + sub v26.8h, v26.8h, v6.8h + + // this clip should work properly + shl v30.8h, v18.8h, #1 // tc2 + neg v31.8h, v30.8h // -tc2 + clip v31.8h, v30.8h, v21.8h, v22.8h, v23.8h, v24.8h, v25.8h, v26.8h + + and v21.16b, v21.16b, v16.16b + and v22.16b, v22.16b, v16.16b + and v23.16b, v23.16b, v16.16b + and v24.16b, v24.16b, v16.16b + and v25.16b, v25.16b, v16.16b + and v26.16b, v26.16b, v16.16b + + add v23.8h, v23.8h, v1.8h // careful + add v22.8h, v22.8h, v2.8h + add v21.8h, v21.8h, v3.8h + add v24.8h, v24.8h, v4.8h + add v25.8h, v25.8h, v5.8h + add v26.8h, v26.8h, v6.8h + + cmp w12, w13 + b.hi 2f // only strong/strong, skip weak + +1: // WEAK FILTER + + // delta0 = (9 * (q0 - p0) - 3 * (q1 - p1) + 8) >> 4 +.if \bitdepth < 12 + sub v27.8h, v4.8h, v3.8h // q0 - p0 + shl v30.8h, v27.8h, #3 // * 8 + add v27.8h, v27.8h, v30.8h // 9 * (q0 - p0) + + sub v30.8h, v5.8h, v2.8h // q1 - p1 + shl v31.8h, v30.8h, #1 // * 2 + + sub v27.8h, v27.8h, v31.8h + sub v27.8h, v27.8h, v30.8h // - 3 * (q1 - p1) + srshr v27.8h, v27.8h, #4 +.else + ssubl v19.4s, v4.4h, v3.4h // q0 - p0 + ssubl2 v20.4s, v4.8h, v3.8h + + shl v30.4s, v19.4s, #3 // * 8 + shl v31.4s, v20.4s, #3 + + add v30.4s, v30.4s, v19.4s // 9 * (q0 - p0) + add v31.4s, v31.4s, v20.4s + + ssubl v19.4s, v5.4h, v2.4h // q1 - p1 + ssubl2 v20.4s, v5.8h, v2.8h + + sub v30.4s, v30.4s, v19.4s + sub v31.4s, v31.4s, v20.4s + + shl v19.4s, v19.4s, #1 + shl v20.4s, v20.4s, #1 + + sub v30.4s, v30.4s, v19.4s + sub v31.4s, v31.4s, v20.4s + + sqrshrn v27.4h, v30.4s, #4 + sqrshrn2 v27.8h, v31.4s, #4 +.endif + + // delta0 10tc check mask + shl v30.8h, v18.8h, #1 // * 2 + shl v31.8h, v18.8h, #3 // * 8 + add v30.8h, v30.8h, v31.8h // 10 * tc + abs v31.8h, v27.8h + cmgt v20.8h, v30.8h, v31.8h // abs(delta0) < 10 * tc + + and v20.16b, v20.16b, v16.16b // combine with full mask + + neg v31.8h, v18.8h // -tc + clip v31.8h, v18.8h, v27.8h // delta0 = av_clip(delta0, -tc, tc) + + // deltap1 = av_clip((((p2 + p0 + 1) >> 1) - p1 + delta0) >> 1, -tc_2, tc_2) + add v30.8h, v1.8h, v3.8h + srshr v30.8h, v30.8h, #1 + sub v30.8h, v30.8h, v2.8h + add v30.8h, v30.8h, v27.8h + sshr v30.8h, v30.8h, #1 + + // p3 p2 p1 p0 q0 q1 q2 q3 + // v0 v1 v2 v3 v4 v5 v6 v7 + + // deltaq1 = av_clip((((q2 + q0 + 1) >> 1) - q1 - delta0) >> 1, -tc_2, tc_2); + add v31.8h, v6.8h, v4.8h + srshr v31.8h, v31.8h, #1 + sub v31.8h, v31.8h, v5.8h + sub v31.8h, v31.8h, v27.8h + sshr v31.8h, v31.8h, #1 + + // apply nd_p nd_q mask to deltap1/deltaq1 + and v30.16b, v30.16b, v28.16b + and v31.16b, v31.16b, v29.16b + + // apply full skip mask to deltap1/deltaq1/delta0 + and v30.16b, v30.16b, v20.16b + and v27.16b, v27.16b, v20.16b + and v31.16b, v31.16b, v20.16b + + // clip P1/Q1 to -tc_2, tc_2 + sshr v18.8h, v18.8h, #1 // tc2 + neg v28.8h, v18.8h + clip v28.8h, v18.8h, v30.8h, v31.8h + + // P0 = av_clip_pixel(p0 + delta0) + // Q0 = av_clip_pixel(q0 - delta0) + add v29.8h, v3.8h, v27.8h // P0 + sub v27.8h, v4.8h, v27.8h // Q0 + + // P1 = av_clip_pixel(p1 + deltap1) + // Q1 = av_clip_pixel(q1 + deltaq1) + add v30.8h, v2.8h, v30.8h // P1 + add v31.8h, v5.8h, v31.8h // Q1 + +2: // MIX WEAK/STRONG + + mov v19.16b, v1.16b + mov v20.16b, v6.16b + // copy selection mask + mov v1.16b, v17.16b + mov v2.16b, v17.16b + mov v3.16b, v17.16b + mov v4.16b, v17.16b + mov v5.16b, v17.16b + mov v6.16b, v17.16b + // select + bsl v1.16b, v23.16b, v19.16b // P2 strong/orig + bsl v2.16b, v22.16b, v30.16b // P1 strong/weak + bsl v3.16b, v21.16b, v29.16b // P0 strong/weak + bsl v4.16b, v24.16b, v27.16b // Q0 strong/weak + bsl v5.16b, v25.16b, v31.16b // Q1 strong/weak + bsl v6.16b, v26.16b, v20.16b // Q2 strong/orig + // NOTE: Q3/P3 are unchanged + +.if \bitdepth > 8 + movi v19.8h, #0 + dup v20.8h, w14 + clip v19.8h, v20.8h, v1.8h, v2.8h, v3.8h, v4.8h, v5.8h, v6.8h +.else + sqxtun v0.8b, v0.8h + sqxtun v1.8b, v1.8h + sqxtun v2.8b, v2.8h + sqxtun v3.8b, v3.8h + sqxtun v4.8b, v4.8h + sqxtun v5.8b, v5.8h + sqxtun v6.8b, v6.8h + sqxtun v7.8b, v7.8h +.endif + ret +3: ret x6 +endfunc +.endm + +hevc_loop_filter_luma_body 8 +hevc_loop_filter_luma_body 10 +hevc_loop_filter_luma_body 12 + +// hevc_v_loop_filter_luma(uint8_t *pix, ptrdiff_t stride, int beta, const int32_t *tc, const uint8_t *no_p, const uint8_t *no_q) + +.macro hevc_loop_filter_luma dir, bitdepth +function ff_hevc_\dir\()_loop_filter_luma_\bitdepth\()_neon, export=1 + mov x6, x30 +.ifc \dir, v +.if \bitdepth > 8 + sub x0, x0, #8 +.else + sub x0, x0, #4 +.endif +.else + sub x0, x0, x1, lsl #2 // -4 * xstride +.endif + mov x10, x0 +.if \bitdepth > 8 + ld1 {v0.8h}, [x0], x1 + ld1 {v1.8h}, [x0], x1 + ld1 {v2.8h}, [x0], x1 + ld1 {v3.8h}, [x0], x1 + ld1 {v4.8h}, [x0], x1 + ld1 {v5.8h}, [x0], x1 + ld1 {v6.8h}, [x0], x1 + ld1 {v7.8h}, [x0] + mov w14, #((1 << \bitdepth) - 1) +.ifc \dir, v + transpose_8x8H v0, v1, v2, v3, v4, v5, v6, v7, v16, v17 +.endif +.else + ld1 {v0.8b}, [x0], x1 + ld1 {v1.8b}, [x0], x1 + ld1 {v2.8b}, [x0], x1 + ld1 {v3.8b}, [x0], x1 + ld1 {v4.8b}, [x0], x1 + ld1 {v5.8b}, [x0], x1 + ld1 {v6.8b}, [x0], x1 + ld1 {v7.8b}, [x0] +.ifc \dir, v + transpose_8x8B v0, v1, v2, v3, v4, v5, v6, v7, v16, v17 +.endif +.endif + bl hevc_loop_filter_luma_body_\bitdepth\()_neon +.if \bitdepth > 8 +.ifc \dir, v + transpose_8x8H v0, v1, v2, v3, v4, v5, v6, v7, v16, v17 +.endif + st1 {v0.8h}, [x10], x1 + st1 {v1.8h}, [x10], x1 + st1 {v2.8h}, [x10], x1 + st1 {v3.8h}, [x10], x1 + st1 {v4.8h}, [x10], x1 + st1 {v5.8h}, [x10], x1 + st1 {v6.8h}, [x10], x1 + st1 {v7.8h}, [x10] +.else +.ifc \dir, v + transpose_8x8B v0, v1, v2, v3, v4, v5, v6, v7, v16, v17 +.endif + st1 {v0.8b}, [x10], x1 + st1 {v1.8b}, [x10], x1 + st1 {v2.8b}, [x10], x1 + st1 {v3.8b}, [x10], x1 + st1 {v4.8b}, [x10], x1 + st1 {v5.8b}, [x10], x1 + st1 {v6.8b}, [x10], x1 + st1 {v7.8b}, [x10] +.endif + ret x6 +endfunc +.endm + +hevc_loop_filter_luma h, 8 +hevc_loop_filter_luma h, 10 +hevc_loop_filter_luma h, 12 + +hevc_loop_filter_luma v, 8 +hevc_loop_filter_luma v, 10 +hevc_loop_filter_luma v, 12 diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c index 687b6cc5c3..04692aa98e 100644 --- a/libavcodec/aarch64/hevcdsp_init_aarch64.c +++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c @@ -38,6 +38,18 @@ void ff_hevc_h_loop_filter_chroma_10_neon(uint8_t *_pix, ptrdiff_t _stride, const int *_tc, const uint8_t *_no_p, const uint8_t *_no_q); void ff_hevc_h_loop_filter_chroma_12_neon(uint8_t *_pix, ptrdiff_t _stride, const int *_tc, const uint8_t *_no_p, const uint8_t *_no_q); +void ff_hevc_v_loop_filter_luma_8_neon(uint8_t *_pix, ptrdiff_t _stride, int beta, + const int *_tc, const uint8_t *_no_p, const uint8_t *_no_q); +void ff_hevc_v_loop_filter_luma_10_neon(uint8_t *_pix, ptrdiff_t _stride, int beta, + const int *_tc, const uint8_t *_no_p, const uint8_t *_no_q); +void ff_hevc_v_loop_filter_luma_12_neon(uint8_t *_pix, ptrdiff_t _stride, int beta, + const int *_tc, const uint8_t *_no_p, const uint8_t *_no_q); +void ff_hevc_h_loop_filter_luma_8_neon(uint8_t *_pix, ptrdiff_t _stride, int beta, + const int *_tc, const uint8_t *_no_p, const uint8_t *_no_q); +void ff_hevc_h_loop_filter_luma_10_neon(uint8_t *_pix, ptrdiff_t _stride, int beta, + const int *_tc, const uint8_t *_no_p, const uint8_t *_no_q); +void ff_hevc_h_loop_filter_luma_12_neon(uint8_t *_pix, ptrdiff_t _stride, int beta, + const int *_tc, const uint8_t *_no_p, const uint8_t *_no_q); void ff_hevc_add_residual_4x4_8_neon(uint8_t *_dst, const int16_t *coeffs, ptrdiff_t stride); void ff_hevc_add_residual_4x4_10_neon(uint8_t *_dst, const int16_t *coeffs, @@ -291,6 +303,8 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth) if (!have_neon(cpu_flags)) return; if (bit_depth == 8) { + c->hevc_h_loop_filter_luma = ff_hevc_h_loop_filter_luma_8_neon; + c->hevc_v_loop_filter_luma = ff_hevc_v_loop_filter_luma_8_neon; c->hevc_h_loop_filter_chroma = ff_hevc_h_loop_filter_chroma_8_neon; c->hevc_v_loop_filter_chroma = ff_hevc_v_loop_filter_chroma_8_neon; c->add_residual[0] = ff_hevc_add_residual_4x4_8_neon; @@ -379,6 +393,8 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth) } if (bit_depth == 10) { + c->hevc_h_loop_filter_luma = ff_hevc_h_loop_filter_luma_10_neon; + c->hevc_v_loop_filter_luma = ff_hevc_v_loop_filter_luma_10_neon; c->hevc_h_loop_filter_chroma = ff_hevc_h_loop_filter_chroma_10_neon; c->hevc_v_loop_filter_chroma = ff_hevc_v_loop_filter_chroma_10_neon; c->add_residual[0] = ff_hevc_add_residual_4x4_10_neon; @@ -395,6 +411,8 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth) c->idct_dc[3] = ff_hevc_idct_32x32_dc_10_neon; } if (bit_depth == 12) { + c->hevc_h_loop_filter_luma = ff_hevc_h_loop_filter_luma_12_neon; + c->hevc_v_loop_filter_luma = ff_hevc_v_loop_filter_luma_12_neon; c->hevc_h_loop_filter_chroma = ff_hevc_h_loop_filter_chroma_12_neon; c->hevc_v_loop_filter_chroma = ff_hevc_v_loop_filter_chroma_12_neon; c->add_residual[0] = ff_hevc_add_residual_4x4_12_neon;