From patchwork Sat Aug 13 20:55:55 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Swinney, Jonathan" X-Patchwork-Id: 37258 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:3d0d:b0:8d:a68e:8a0e with SMTP id y13csp513514pzi; Sat, 13 Aug 2022 13:56:12 -0700 (PDT) X-Google-Smtp-Source: AA6agR4bkMBhixNzzibwanj4HIvkG3rXEHFWkGf6nrcl5YjiyhIu7BuZ2mmZ6qC/6olXImdpWmDn X-Received: by 2002:a17:907:8a0a:b0:730:a118:75de with SMTP id sc10-20020a1709078a0a00b00730a11875demr6504914ejc.189.1660424172138; Sat, 13 Aug 2022 13:56:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1660424172; cv=none; d=google.com; s=arc-20160816; b=JjoVc1FPna7O3gTvvNVH+nOqWyMJP2e7FUapQh3X4sVjSlEnlzUs5xfUyWg1tjHCaA BL7hk2xCpggAE5YKnod0jssSS0+rK9wI/sKBxCqqRI21G+T4zTOogLVqTcPII7+yWg1L esSH7uOqVsCdwCiFNbLlz6G8IrIR1EATGds4XWomZc85rSATAaw0NFIw0+XLC9EjjfNX MbSidicrWr2DaH5mKeLlnqXjwMGWWoQtffyOBWiKpJn9nNTuc6a8VmCuP4bc7sWAIrke TwnA1pa8s0sfpJZ9apEkDr/jbN+P9xA8IkP7jBVAp2PdxWrF834UWuL2AUPkUIe7/sFW gkxA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:content-language :accept-language:message-id:date:thread-index:thread-topic:to:from :dkim-signature:delivered-to; bh=v816ddX5bI2h2KFFkyNW1xOzFHn+V9Ap5xj3c1BvdzE=; b=0Gnn7zNdjOvMXWB7SqYZzMG/aj32WkmgtFgoxWMvqmNOe/JnNmw6f1HZ8XYNTpmXPl 49PBCNTdt5ZNveh1j4UZPuHsQ85HzT+M5axUUr8DflfWHLSOsEQWvOnif3C0fpPKuNET ymx9MzgbvdP/y+hLoPCqzEwaobweS4JGRjpB9dIUmSCshOu/8PDE5PAuacZwypYTTApX mjU19VEPwYERjsIH266Z5v9ogPzXqZlVyQX0vkUewJ5+/ofL8k/Bk806BF/5IwceCLut 9Cti8BOJcnVHiZxr8gkpuCZGZnE+6Vmr6JYXi9H8t9nHPI9c2MDaJNDVVfpwA0nMXLS4 gGaA== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@amazon.com header.s=amazon201209 header.b=A2R29emr; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id i20-20020a05640242d400b0043e8006a816si5639118edc.30.2022.08.13.13.56.11; Sat, 13 Aug 2022 13:56:12 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@amazon.com header.s=amazon201209 header.b=A2R29emr; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 1C38168B3B2; Sat, 13 Aug 2022 23:56:09 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from smtp-fw-80006.amazon.com (smtp-fw-80006.amazon.com [99.78.197.217]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 53C1E68B7DF for ; Sat, 13 Aug 2022 23:56:01 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1660424166; x=1691960166; h=from:to:cc:subject:date:message-id: content-transfer-encoding:mime-version; bh=3ev2ify0F3v8llOdECR+94S9KtZty7AYhuCvnUlLgSI=; b=A2R29emrblsW7HaS9IJb68AKz88lbx3pzpQTpP0g3J6T4OGrK+qjixyY 3ijKv4eTlW3p5wi5NL8ganjnp2xEICgefvCFTLnlTh+pw7QutvEWKwKRB vp5nXVGZB2i5u8r318AScxdmY81oNS9I/30dy1nQvdvMn59maWX89EgcK Q=; X-IronPort-AV: E=Sophos;i="5.93,236,1654560000"; d="scan'208";a="118894192" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO email-inbound-relay-iad-1e-fc41acad.us-east-1.amazon.com) ([10.25.36.210]) by smtp-border-fw-80006.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Aug 2022 20:55:58 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-iad-1e-fc41acad.us-east-1.amazon.com (Postfix) with ESMTPS id F282AC0230; Sat, 13 Aug 2022 20:55:56 +0000 (UTC) Received: from EX19D007UWB001.ant.amazon.com (10.13.138.75) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Sat, 13 Aug 2022 20:55:55 +0000 Received: from EX19D007UWB001.ant.amazon.com (10.13.138.75) by EX19D007UWB001.ant.amazon.com (10.13.138.75) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.12; Sat, 13 Aug 2022 20:55:55 +0000 Received: from EX19D007UWB001.ant.amazon.com ([fe80::bcaa:e18f:a569:3851]) by EX19D007UWB001.ant.amazon.com ([fe80::bcaa:e18f:a569:3851%6]) with mapi id 15.02.1118.012; Sat, 13 Aug 2022 20:55:55 +0000 From: "Swinney, Jonathan" To: "ffmpeg-devel@ffmpeg.org" Thread-Topic: [PATCH v3 1/3] checkasm: updated tests for sw_scale Thread-Index: AdivVusAsr+hqURDQGaVD4fKOSqJyg== Date: Sat, 13 Aug 2022 20:55:55 +0000 Message-ID: <859182400d774ed6a80829087b578b8e@amazon.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.43.162.134] MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v3 1/3] checkasm: updated tests for sw_scale X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: =?utf-8?q?Martin_Storsj=C3=B6?= , Hubert Mazur Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: WYvmlerZxkur - added a test for yuv2plane1 - fixed test for yuv2planeX for aarch64 which was previously not working at all - updated the test for yuv2planeX to check exact results or approximated results Signed-off-by: Jonathan Swinney --- libswscale/x86/swscale.c | 8 +- tests/checkasm/sw_scale.c | 188 ++++++++++++++++++++++++++++++-------- 2 files changed, 154 insertions(+), 42 deletions(-) diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 628f12137c..32d441245d 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -534,7 +534,8 @@ switch(c->dstBpc){ \ ASSIGN_SSE_SCALE_FUNC(c->hcScale, c->hChrFilterSize, sse2, sse2); ASSIGN_VSCALEX_FUNC(c->yuv2planeX, sse2, , HAVE_ALIGNED_STACK || ARCH_X86_64); - ASSIGN_VSCALE_FUNC(c->yuv2plane1, sse2); + if (!(c->flags & SWS_ACCURATE_RND)) + ASSIGN_VSCALE_FUNC(c->yuv2plane1, sse2); switch (c->srcFormat) { case AV_PIX_FMT_YA8: @@ -583,14 +584,15 @@ switch(c->dstBpc){ \ ASSIGN_VSCALEX_FUNC(c->yuv2planeX, sse4, if (!isBE(c->dstFormat)) c->yuv2planeX = ff_yuv2planeX_16_sse4, HAVE_ALIGNED_STACK || ARCH_X86_64); - if (c->dstBpc == 16 && !isBE(c->dstFormat)) + if (c->dstBpc == 16 && !isBE(c->dstFormat) && !(c->flags & SWS_ACCURATE_RND)) c->yuv2plane1 = ff_yuv2plane1_16_sse4; } if (EXTERNAL_AVX(cpu_flags)) { ASSIGN_VSCALEX_FUNC(c->yuv2planeX, avx, , HAVE_ALIGNED_STACK || ARCH_X86_64); - ASSIGN_VSCALE_FUNC(c->yuv2plane1, avx); + if (!(c->flags & SWS_ACCURATE_RND)) + ASSIGN_VSCALE_FUNC(c->yuv2plane1, avx); switch (c->srcFormat) { case AV_PIX_FMT_YUYV422: diff --git a/tests/checkasm/sw_scale.c b/tests/checkasm/sw_scale.c index b643a47c30..859993db6f 100644 --- a/tests/checkasm/sw_scale.c +++ b/tests/checkasm/sw_scale.c @@ -35,40 +35,140 @@ AV_WN32(buf + j, rnd()); \ } while (0) -// This reference function is the same approximate algorithm employed by the -// SIMD functions -static void ref_function(const int16_t *filter, int filterSize, - const int16_t **src, uint8_t *dest, int dstW, - const uint8_t *dither, int offset) +static void yuv2planeX_8_ref(const int16_t *filter, int filterSize, + const int16_t **src, uint8_t *dest, int dstW, + const uint8_t *dither, int offset) { - int i, d; - d = ((filterSize - 1) * 8 + dither[0]) >> 4; - for ( i = 0; i < dstW; i++) { - int16_t val = d; + // This corresponds to the yuv2planeX_8_c function + int i; + for (i = 0; i < dstW; i++) { + int val = dither[(i + offset) & 7] << 12; int j; - union { - int val; - int16_t v[2]; - } t; - for (j = 0; j < filterSize; j++){ - t.val = (int)src[j][i + offset] * (int)filter[j]; - val += t.v[1]; + for (j = 0; j < filterSize; j++) + val += src[j][i] * filter[j]; + + dest[i]= av_clip_uint8(val >> 19); + } +} + +static int cmp_off_by_n(const uint8_t *ref, const uint8_t *test, size_t n, int accuracy) +{ + for (size_t i = 0; i < n; i++) { + if (abs(ref[i] - test[i]) > accuracy) + return 1; + } + return 0; +} + +static void print_data(uint8_t *p, size_t len, size_t offset) +{ + size_t i = 0; + for (; i < len; i++) { + if (i % 8 == 0) { + printf("0x%04zx: ", i+offset); + } + printf("0x%02x ", (uint32_t) p[i]); + if (i % 8 == 7) { + printf("\n"); } - dest[i]= av_clip_uint8(val>>3); } + if (i % 8 != 0) { + printf("\n"); + } +} + +static size_t show_differences(uint8_t *a, uint8_t *b, size_t len) +{ + for (size_t i = 0; i < len; i++) { + if (a[i] != b[i]) { + size_t offset_of_mismatch = i; + size_t offset; + if (i >= 8) i-=8; + offset = i & (~7); + printf("test a:\n"); + print_data(&a[offset], 32, offset); + printf("\ntest b:\n"); + print_data(&b[offset], 32, offset); + printf("\n"); + return offset_of_mismatch; + } + } + return len; } -static void check_yuv2yuvX(void) +static void check_yuv2yuv1(int accurate) +{ + struct SwsContext *ctx; + int osi, isi; + int dstW, offset; + size_t fail_offset; + const int input_sizes[] = {8, 24, 128, 144, 256, 512}; + const int INPUT_SIZES = sizeof(input_sizes)/sizeof(input_sizes[0]); + #define LARGEST_INPUT_SIZE 512 + + const int offsets[] = {0, 3, 8, 11, 16, 19}; + const int OFFSET_SIZES = sizeof(offsets)/sizeof(offsets[0]); + const char *accurate_str = (accurate) ? "accurate" : "approximate"; + + declare_func_emms(AV_CPU_FLAG_MMX, void, + const int16_t *src, uint8_t *dest, + int dstW, const uint8_t *dither, int offset); + + LOCAL_ALIGNED_8(int16_t, src_pixels, [LARGEST_INPUT_SIZE]); + LOCAL_ALIGNED_8(uint8_t, dst0, [LARGEST_INPUT_SIZE]); + LOCAL_ALIGNED_8(uint8_t, dst1, [LARGEST_INPUT_SIZE]); + LOCAL_ALIGNED_8(uint8_t, dither, [8]); + + randomize_buffers((uint8_t*)dither, 8); + randomize_buffers((uint8_t*)src_pixels, LARGEST_INPUT_SIZE * sizeof(int16_t)); + ctx = sws_alloc_context(); + if (accurate) + ctx->flags |= SWS_ACCURATE_RND; + if (sws_init_context(ctx, NULL, NULL) < 0) + fail(); + + ff_sws_init_scale(ctx); + for (isi = 0; isi < INPUT_SIZES; ++isi) { + dstW = input_sizes[isi]; + for (osi = 0; osi < OFFSET_SIZES; osi++) { + offset = offsets[osi]; + if (check_func(ctx->yuv2plane1, "yuv2yuv1_%d_%d_%s", offset, dstW, accurate_str)){ + memset(dst0, 0, LARGEST_INPUT_SIZE * sizeof(dst0[0])); + memset(dst1, 0, LARGEST_INPUT_SIZE * sizeof(dst1[0])); + + call_ref(src_pixels, dst0, dstW, dither, offset); + call_new(src_pixels, dst1, dstW, dither, offset); + if (cmp_off_by_n(dst0, dst1, dstW * sizeof(dst0[0]), accurate ? 0 : 2)) { + fail(); + printf("failed: yuv2yuv1_%d_%di_%s\n", offset, dstW, accurate_str); + fail_offset = show_differences(dst0, dst1, LARGEST_INPUT_SIZE * sizeof(dst0[0])); + printf("failing values: src: 0x%04x dither: 0x%02x dst-c: %02x dst-asm: %02x\n", + (int) src_pixels[fail_offset], + (int) dither[(fail_offset + fail_offset) & 7], + (int) dst0[fail_offset], + (int) dst1[fail_offset]); + } + if(dstW == LARGEST_INPUT_SIZE) + bench_new(src_pixels, dst1, dstW, dither, offset); + } + } + } + sws_freeContext(ctx); +} + +static void check_yuv2yuvX(int accurate) { struct SwsContext *ctx; int fsi, osi, isi, i, j; int dstW; #define LARGEST_FILTER 16 -#define FILTER_SIZES 4 - static const int filter_sizes[FILTER_SIZES] = {1, 4, 8, 16}; + // ff_yuv2planeX_8_sse2 can't handle odd filter sizes + const int filter_sizes[] = {2, 4, 8, 16}; + const int FILTER_SIZES = sizeof(filter_sizes)/sizeof(filter_sizes[0]); #define LARGEST_INPUT_SIZE 512 -#define INPUT_SIZES 6 - static const int input_sizes[INPUT_SIZES] = {8, 24, 128, 144, 256, 512}; + static const int input_sizes[] = {8, 24, 128, 144, 256, 512}; + const int INPUT_SIZES = sizeof(input_sizes)/sizeof(input_sizes[0]); + const char *accurate_str = (accurate) ? "accurate" : "approximate"; declare_func_emms(AV_CPU_FLAG_MMX, void, const int16_t *filter, int filterSize, const int16_t **src, uint8_t *dest, @@ -89,6 +189,8 @@ static void check_yuv2yuvX(void) randomize_buffers((uint8_t*)src_pixels, LARGEST_FILTER * LARGEST_INPUT_SIZE * sizeof(int16_t)); randomize_buffers((uint8_t*)filter_coeff, LARGEST_FILTER * sizeof(int16_t)); ctx = sws_alloc_context(); + if (accurate) + ctx->flags |= SWS_ACCURATE_RND; if (sws_init_context(ctx, NULL, NULL) < 0) fail(); @@ -96,33 +198,37 @@ static void check_yuv2yuvX(void) for(isi = 0; isi < INPUT_SIZES; ++isi){ dstW = input_sizes[isi]; for(osi = 0; osi < 64; osi += 16){ - for(fsi = 0; fsi < FILTER_SIZES; ++fsi){ + if (dstW <= osi) + continue; + for (fsi = 0; fsi < FILTER_SIZES; ++fsi) { src = av_malloc(sizeof(int16_t*) * filter_sizes[fsi]); vFilterData = av_malloc((filter_sizes[fsi] + 2) * sizeof(union VFilterData)); memset(vFilterData, 0, (filter_sizes[fsi] + 2) * sizeof(union VFilterData)); - for(i = 0; i < filter_sizes[fsi]; ++i){ + for (i = 0; i < filter_sizes[fsi]; ++i) { src[i] = &src_pixels[i * LARGEST_INPUT_SIZE]; - vFilterData[i].src = src[i]; + vFilterData[i].src = src[i] - osi; for(j = 0; j < 4; ++j) vFilterData[i].coeff[j + 4] = filter_coeff[i]; } - if (check_func(ctx->yuv2planeX, "yuv2yuvX_%d_%d_%d", filter_sizes[fsi], osi, dstW)){ + if (check_func(ctx->yuv2planeX, "yuv2yuvX_%d_%d_%d_%s", filter_sizes[fsi], osi, dstW, accurate_str)){ + // use vFilterData for the mmx function + const int16_t *filter = ctx->use_mmx_vfilter ? (const int16_t*)vFilterData : &filter_coeff[0]; memset(dst0, 0, LARGEST_INPUT_SIZE * sizeof(dst0[0])); memset(dst1, 0, LARGEST_INPUT_SIZE * sizeof(dst1[0])); - // The reference function is not the scalar function selected when mmx - // is deactivated as the SIMD functions do not give the same result as - // the scalar ones due to rounding. The SIMD functions are activated by - // the flag SWS_ACCURATE_RND - ref_function(&filter_coeff[0], filter_sizes[fsi], src, dst0, dstW - osi, dither, osi); - // There's no point in calling new for the reference function - if(ctx->use_mmx_vfilter){ - call_new((const int16_t*)vFilterData, filter_sizes[fsi], src, dst1, dstW - osi, dither, osi); - if (memcmp(dst0, dst1, LARGEST_INPUT_SIZE * sizeof(dst0[0]))) - fail(); - if(dstW == LARGEST_INPUT_SIZE) - bench_new((const int16_t*)vFilterData, filter_sizes[fsi], src, dst1, dstW - osi, dither, osi); + // We can't use call_ref here, because we don't know if use_mmx_vfilter was set for that + // function or not, so we can't pass it the parameters correctly. + yuv2planeX_8_ref(&filter_coeff[0], filter_sizes[fsi], src, dst0, dstW - osi, dither, osi); + + call_new(filter, filter_sizes[fsi], src, dst1, dstW - osi, dither, osi); + if (cmp_off_by_n(dst0, dst1, LARGEST_INPUT_SIZE * sizeof(dst0[0]), accurate ? 0 : 2)) { + fail(); + printf("failed: yuv2yuvX_%d_%d_%d_%s\n", filter_sizes[fsi], osi, dstW, accurate_str); + show_differences(dst0, dst1, LARGEST_INPUT_SIZE * sizeof(dst0[0])); } + if(dstW == LARGEST_INPUT_SIZE) + bench_new((const int16_t*)vFilterData, filter_sizes[fsi], src, dst1, dstW - osi, dither, osi); + } av_freep(&src); av_freep(&vFilterData); @@ -245,6 +351,10 @@ void checkasm_check_sw_scale(void) { check_hscale(); report("hscale"); - check_yuv2yuvX(); + check_yuv2yuv1(0); + check_yuv2yuv1(1); + report("yuv2yuv1"); + check_yuv2yuvX(0); + check_yuv2yuvX(1); report("yuv2yuvX"); } From patchwork Sat Aug 13 20:56:02 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Swinney, Jonathan" X-Patchwork-Id: 37259 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:3d0d:b0:8d:a68e:8a0e with SMTP id y13csp513552pzi; Sat, 13 Aug 2022 13:56:20 -0700 (PDT) X-Google-Smtp-Source: AA6agR7LIC1+aXHwPCSG95UKP6DvvIPxm58NQUH+bYBFY2EXzNUn5iz8qLL9qIdVqmdSaS6KLt8j X-Received: by 2002:a05:6402:287:b0:43c:c604:addb with SMTP id l7-20020a056402028700b0043cc604addbmr8584260edv.201.1660424180204; Sat, 13 Aug 2022 13:56:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1660424180; cv=none; d=google.com; s=arc-20160816; b=LrKTXvvTASkjyZfVaLrIhANlEK2MIqNkh2oSO9QPL0HYRjhsRzI2NgBClMzXxZGV4/ SDInoT8CjnoWpBoolPlZsAV5vwNyOACyT9ZXiaZ6H9o4i9fEJLmkUe+XJEkqHZMc6hIa NOFiUF3IlieeAFEPC28x14llVkzaicYmSD501oOj0OWrUZn0zjP5gVbx9ccrD/+rBhBL fePdgfkIFNuq4dfDmLQYZBf0811C6j9RqPFH2XnYoHM7NeB41EVFLhMODvYH/WEQXOqK Nrnoilmro+CDLG7DFpxqdN9rB+xsd5hr+jrW/0u/KPdkk+VEWvOPxX5RWB3g2LomEIOB qc8w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:content-language :accept-language:message-id:date:thread-index:thread-topic:to:from :dkim-signature:delivered-to; bh=XuNxLfetY0aghed8GPAE5ijOfWueNGZa+hcddXiDMUw=; b=n/mLzlCipY1/5VlURLmWcLfGXpkRnRY6qJV7DaxFAVTnYRRHBhKVmHeYBca5CTiLNV 7ZeWQyeG5AWxuO/B2nKNGyGq38KdxnuXYgPIFqE4TEr8/8uGCNYV/DzpUAiEXi+kEBLx kB9wiutg9U3kHREsaFgSZNBIzNUTPMlqRX/BSB/1xiKh0OESFVRxOdH4NgqP47VHs6Ge OtOmNlonWejT/AslviUay2QYO8vMKmtDK1ggWt+CEKGS52Un4yZBBn7uIibhLxpbZ6jC yIyK+5c+r4RCMwUyJD4pinUEvUUZ1JI6hGh5RIsvAZsPr0ebQfGzAGGp8tbIuuzzjygq bD2g== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@amazon.com header.s=amazon201209 header.b=tW7dkFOh; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id q17-20020a50cc91000000b0043e438a48e1si4625315edi.178.2022.08.13.13.56.19; Sat, 13 Aug 2022 13:56:20 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@amazon.com header.s=amazon201209 header.b=tW7dkFOh; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 193D468B8CC; Sat, 13 Aug 2022 23:56:13 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from smtp-fw-6001.amazon.com (smtp-fw-6001.amazon.com [52.95.48.154]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 6F71C68B20B for ; Sat, 13 Aug 2022 23:56:06 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1660424172; x=1691960172; h=from:to:cc:subject:date:message-id: content-transfer-encoding:mime-version; bh=d2veYPVwfvfydEjiXV0RiCdjVX6kr4xg7SfSuR9ylAg=; b=tW7dkFOheLdM5OeVXTGvT6fbk2hMzy91Rl4Cv8O5ioqk2CaYalQEcicm vO2jawWe6faisVVwXv8+xlydBDHz4aCW0wRwPH8jdZbLqOpz9YsReP/Rq jk7Nle4pxWzyC5Vi97zrogiWK8RtFhuuPGOIW0SKgrJaIue3jGtK/xVxT s=; Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-iad-1a-b27d4a00.us-east-1.amazon.com) ([10.43.8.2]) by smtp-border-fw-6001.iad6.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Aug 2022 20:56:05 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-iad-1a-b27d4a00.us-east-1.amazon.com (Postfix) with ESMTPS id 6C5F8803B7; Sat, 13 Aug 2022 20:56:04 +0000 (UTC) Received: from EX19D007UWB001.ant.amazon.com (10.13.138.75) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Sat, 13 Aug 2022 20:56:02 +0000 Received: from EX19D007UWB001.ant.amazon.com (10.13.138.75) by EX19D007UWB001.ant.amazon.com (10.13.138.75) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.12; Sat, 13 Aug 2022 20:56:02 +0000 Received: from EX19D007UWB001.ant.amazon.com ([fe80::bcaa:e18f:a569:3851]) by EX19D007UWB001.ant.amazon.com ([fe80::bcaa:e18f:a569:3851%6]) with mapi id 15.02.1118.012; Sat, 13 Aug 2022 20:56:02 +0000 From: "Swinney, Jonathan" To: "ffmpeg-devel@ffmpeg.org" Thread-Topic: [PATCH v3 2/3] swscale/aarch64: vscale optimization Thread-Index: AdivVwIoGM530APiQviyMJ/WNteqHQ== Date: Sat, 13 Aug 2022 20:56:02 +0000 Message-ID: <17ac9282aef74f869ec9895f0d17f17e@amazon.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.43.162.134] MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v3 2/3] swscale/aarch64: vscale optimization X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: =?utf-8?q?Martin_Storsj=C3=B6?= , Hubert Mazur Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: U4pqH0biRAvE Use scalar times vector multiply accumlate instructions instead of vector times vector to remove the need for replicating load instructions which are slightly slower. On AWS c7g (Graviton 3, Neoverse V1) instances: yuv2yuvX_8_0_512_accurate_neon: 1144.8 987.4 yuv2yuvX_16_0_512_accurate_neon: 2080.5 1869.4 Signed-off-by: Jonathan Swinney --- libswscale/aarch64/output.S | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S index af71de6050..991750cf31 100644 --- a/libswscale/aarch64/output.S +++ b/libswscale/aarch64/output.S @@ -34,16 +34,15 @@ function ff_yuv2planeX_8_neon, export=1 mov x9, x2 // srcp = src mov x10, x0 // filterp = filter 3: ldp x11, x12, [x9], #16 // get 2 pointers: src[j] and src[j+1] + ldr s7, [x10], #4 // read 2x16-bit coeff X and Y at filter[j] and filter[j+1] add x11, x11, x7, lsl #1 // &src[j ][i] add x12, x12, x7, lsl #1 // &src[j+1][i] ld1 {v5.8H}, [x11] // read 8x16-bit @ src[j ][i + {0..7}]: A,B,C,D,E,F,G,H ld1 {v6.8H}, [x12] // read 8x16-bit @ src[j+1][i + {0..7}]: I,J,K,L,M,N,O,P - ld1r {v7.8H}, [x10], #2 // read 1x16-bit coeff X at filter[j ] and duplicate across lanes - ld1r {v16.8H}, [x10], #2 // read 1x16-bit coeff Y at filter[j+1] and duplicate across lanes - smlal v3.4S, v5.4H, v7.4H // val0 += {A,B,C,D} * X - smlal2 v4.4S, v5.8H, v7.8H // val1 += {E,F,G,H} * X - smlal v3.4S, v6.4H, v16.4H // val0 += {I,J,K,L} * Y - smlal2 v4.4S, v6.8H, v16.8H // val1 += {M,N,O,P} * Y + smlal v3.4S, v5.4H, v7.H[0] // val0 += {A,B,C,D} * X + smlal2 v4.4S, v5.8H, v7.H[0] // val1 += {E,F,G,H} * X + smlal v3.4S, v6.4H, v7.H[1] // val0 += {I,J,K,L} * Y + smlal2 v4.4S, v6.8H, v7.H[1] // val1 += {M,N,O,P} * Y subs w8, w8, #2 // tmpfilterSize -= 2 b.gt 3b // loop until filterSize consumed From patchwork Sat Aug 13 20:56:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Swinney, Jonathan" X-Patchwork-Id: 37260 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:3d0d:b0:8d:a68e:8a0e with SMTP id y13csp513593pzi; Sat, 13 Aug 2022 13:56:29 -0700 (PDT) X-Google-Smtp-Source: AA6agR4TvqDKfT/GJf/bu/C5YLt82nMlZoQOUZpQOCGPkXn90Xyz9c8tciaCVBLf8kc73Id8UX2B X-Received: by 2002:a05:6402:28cb:b0:43b:c6d7:ef92 with SMTP id ef11-20020a05640228cb00b0043bc6d7ef92mr8742524edb.333.1660424188990; Sat, 13 Aug 2022 13:56:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1660424188; cv=none; d=google.com; s=arc-20160816; b=kbn2r6WvfA+COQU92IhI1g3OuEDzeXFaf0x02NSaMlfD97gpRcoHv8Eo3p2uHfc1zc jvhZ717yRq5KV+8Ubc4dg/t2fC8MST5HQuxTmPi7tIGYKDhoxcxTcdd7Y7HdW+XI9D9j UwP8bEo82ItpXBQOld5V4J/Qoo5VbjLS2ct6Sx7wL+J+CWqpgVLx4KAdNVAkeLuGd2je OwLa21QAoJiN2gGHNQX/7eyw2DRcY5MeZjbR67/iwaZYD3PML7srNCwk0caeHowmzK6a IDwF2CN96yZPVq0G4mUR5wQsig5aMVyQUx/Jn0dbC3LJbD6rLozzZbeBJFRt7IspgRsG PmcQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:content-language :accept-language:message-id:date:thread-index:thread-topic:to:from :dkim-signature:delivered-to; bh=ScujgIZXTLJ/aw1y/QsnH6yBkAIObOadIbfviyhrtnM=; b=syhDi6/iEBbxdE/cf2nech4MwPZ7pjJWpdVj9oBCY7Q3HKVeIZsTTPYlt5htPPr5GN s2Ox0GO5XJ8NjD59HPx2Eap8FH+cbgw4p/fb4o6NSLaPMQLuoyIadZOGCRe95ddf1ZiU N79sBJt89AWjAYrr89FZd6P+epdYIQ5hNh+ePFJ+jgGb2g8wnFTI9oGBnekH0lSfy6H8 UhEV6yKlEg6F1dRTkid45Yw/MqholOrQiCoHiSJKJwM/+m6DYyWcFcxFKZUIDRfyVjbR T6IBzdACHOJHdS09PZDqeJhpMiZX31fyFI2mWFD1jCvtQ3mfCkDsFeMZUompBEkQX8nr A7aw== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@amazon.com header.s=amazon201209 header.b="fW/njcPN"; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id u1-20020a170906068100b0072eda634546si3666050ejb.560.2022.08.13.13.56.28; Sat, 13 Aug 2022 13:56:28 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@amazon.com header.s=amazon201209 header.b="fW/njcPN"; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id F3DBA68B8E5; Sat, 13 Aug 2022 23:56:20 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from smtp-fw-9103.amazon.com (smtp-fw-9103.amazon.com [207.171.188.200]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E1D6B68B419 for ; Sat, 13 Aug 2022 23:56:13 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1660424179; x=1691960179; h=from:to:cc:subject:date:message-id: content-transfer-encoding:mime-version; bh=4CZV234rAWqFk7WOGRyqAOhHKdEpQfUX1WSNmNxYWKs=; b=fW/njcPNRoR8uS0hZMe972jRUXb3F4JWX2AAVwJttOR2oSFUfaH6Kdk1 sSj/uZT9hkU9/gI7G9DXlErZSJ0K9W8Tb/EgZYWm39XwklRXjjaz2ddIG 7ASW90E9sSJGKzvYY7fdCNaDdxDctekJlAbSUJysjKPnWWwO1ZkJ4D+yk g=; X-IronPort-AV: E=Sophos;i="5.93,236,1654560000"; d="scan'208";a="1044016123" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO email-inbound-relay-iad-1e-b69ea591.us-east-1.amazon.com) ([10.25.36.214]) by smtp-border-fw-9103.sea19.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Aug 2022 20:56:11 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan2.iad.amazon.com [10.40.163.34]) by email-inbound-relay-iad-1e-b69ea591.us-east-1.amazon.com (Postfix) with ESMTPS id 40DC0C0425; Sat, 13 Aug 2022 20:56:10 +0000 (UTC) Received: from EX19D007UWB003.ant.amazon.com (10.13.138.28) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Sat, 13 Aug 2022 20:56:06 +0000 Received: from EX19D007UWB001.ant.amazon.com (10.13.138.75) by EX19D007UWB003.ant.amazon.com (10.13.138.28) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.12; Sat, 13 Aug 2022 20:56:06 +0000 Received: from EX19D007UWB001.ant.amazon.com ([fe80::bcaa:e18f:a569:3851]) by EX19D007UWB001.ant.amazon.com ([fe80::bcaa:e18f:a569:3851%6]) with mapi id 15.02.1118.012; Sat, 13 Aug 2022 20:56:06 +0000 From: "Swinney, Jonathan" To: "ffmpeg-devel@ffmpeg.org" Thread-Topic: [PATCH v3 3/3] swscale/aarch64: add vscale specializations Thread-Index: AdivVxEmf6fLDA28TLO6ywGfvD7wdQ== Date: Sat, 13 Aug 2022 20:56:06 +0000 Message-ID: <70629b7632564b30a44c71bf6a903b26@amazon.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.43.162.134] MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v3 3/3] swscale/aarch64: add vscale specializations X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: =?utf-8?q?Martin_Storsj=C3=B6?= , Hubert Mazur Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: 8NWbUFa+vR15 This commit adds new code paths for vscale when filterSize is 2, 4, or 8. By using specialized code with unrolling to match the filterSize we can improve performance. On AWS c7g (Graviton 3, Neoverse V1) instances: before after yuv2yuvX_2_0_512_accurate_neon: 558.8 268.9 yuv2yuvX_4_0_512_accurate_neon: 637.5 434.9 yuv2yuvX_8_0_512_accurate_neon: 1144.8 806.2 yuv2yuvX_16_0_512_accurate_neon: 2080.5 1853.7 Signed-off-by: Jonathan Swinney --- libswscale/aarch64/output.S | 177 +++++++++++++++++++++++++++++++++++ libswscale/aarch64/swscale.c | 12 +++ 2 files changed, 189 insertions(+) diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S index 991750cf31..b8a2818c9b 100644 --- a/libswscale/aarch64/output.S +++ b/libswscale/aarch64/output.S @@ -21,13 +21,33 @@ #include "libavutil/aarch64/asm.S" function ff_yuv2planeX_8_neon, export=1 +// x0 - const int16_t *filter, +// x1 - int filterSize, +// x2 - const int16_t **src, +// x3 - uint8_t *dest, +// w4 - int dstW, +// x5 - const uint8_t *dither, +// w6 - int offset + ld1 {v0.8B}, [x5] // load 8x8-bit dither + and w6, w6, #7 cbz w6, 1f // check if offsetting present ext v0.8B, v0.8B, v0.8B, #3 // honor offsetting which can be 0 or 3 only 1: uxtl v0.8H, v0.8B // extend dither to 16-bit ushll v1.4S, v0.4H, #12 // extend dither to 32-bit with left shift by 12 (part 1) ushll2 v2.4S, v0.8H, #12 // extend dither to 32-bit with left shift by 12 (part 2) + cmp w1, #8 // if filterSize == 8, branch to specialized version + b.eq 6f + cmp w1, #4 // if filterSize == 4, branch to specialized version + b.eq 8f + cmp w1, #2 // if filterSize == 2, branch to specialized version + b.eq 10f + +// The filter size does not match of the of specialized implementations. It is either even or odd. If it is even +// then use the first section below. mov x7, #0 // i = 0 + tbnz w1, #0, 4f // if filterSize % 2 != 0 branch to specialized version +// fs % 2 == 0 2: mov v3.16B, v1.16B // initialize accumulator part 1 with dithering value mov v4.16B, v2.16B // initialize accumulator part 2 with dithering value mov w8, w1 // tmpfilterSize = filterSize @@ -54,4 +74,161 @@ function ff_yuv2planeX_8_neon, export=1 add x7, x7, #8 // i += 8 b.gt 2b // loop until width consumed ret + +// If filter size is odd (most likely == 1), then use this section. +// fs % 2 != 0 +4: mov v3.16B, v1.16B // initialize accumulator part 1 with dithering value + mov v4.16B, v2.16B // initialize accumulator part 2 with dithering value + mov w8, w1 // tmpfilterSize = filterSize + mov x9, x2 // srcp = src + mov x10, x0 // filterp = filter +5: ldr x11, [x9], #8 // get 1 pointer: src[j] + ldr h6, [x10], #2 // read 1 16 bit coeff X at filter[j] + add x11, x11, x7, lsl #1 // &src[j ][i] + ld1 {v5.8H}, [x11] // read 8x16-bit @ src[j ][i + {0..7}]: A,B,C,D,E,F,G,H + smlal v3.4S, v5.4H, v6.H[0] // val0 += {A,B,C,D} * X + smlal2 v4.4S, v5.8H, v6.H[0] // val1 += {E,F,G,H} * X + subs w8, w8, #1 // tmpfilterSize -= 2 + b.gt 5b // loop until filterSize consumed + + sqshrun v3.4h, v3.4s, #16 // clip16(val0>>16) + sqshrun2 v3.8h, v4.4s, #16 // clip16(val1>>16) + uqshrn v3.8b, v3.8h, #3 // clip8(val>>19) + st1 {v3.8b}, [x3], #8 // write to destination + subs w4, w4, #8 // dstW -= 8 + add x7, x7, #8 // i += 8 + b.gt 4b // loop until width consumed + ret + +6: // fs=8 + ldp x5, x6, [x2] // load 2 pointers: src[j ] and src[j+1] + ldp x7, x9, [x2, #16] // load 2 pointers: src[j+2] and src[j+3] + ldp x10, x11, [x2, #32] // load 2 pointers: src[j+4] and src[j+5] + ldp x12, x13, [x2, #48] // load 2 pointers: src[j+6] and src[j+7] + + // load 8x16-bit values for filter[j], where j=0..7 + ld1 {v6.8H}, [x0] +7: + mov v3.16B, v1.16B // initialize accumulator part 1 with dithering value + mov v4.16B, v2.16B // initialize accumulator part 2 with dithering value + + ld1 {v24.8H}, [x5], #16 // load 8x16-bit values for src[j + 0][i + {0..7}] + ld1 {v25.8H}, [x6], #16 // load 8x16-bit values for src[j + 1][i + {0..7}] + ld1 {v26.8H}, [x7], #16 // load 8x16-bit values for src[j + 2][i + {0..7}] + ld1 {v27.8H}, [x9], #16 // load 8x16-bit values for src[j + 3][i + {0..7}] + ld1 {v28.8H}, [x10], #16 // load 8x16-bit values for src[j + 4][i + {0..7}] + ld1 {v29.8H}, [x11], #16 // load 8x16-bit values for src[j + 5][i + {0..7}] + ld1 {v30.8H}, [x12], #16 // load 8x16-bit values for src[j + 6][i + {0..7}] + ld1 {v31.8H}, [x13], #16 // load 8x16-bit values for src[j + 7][i + {0..7}] + + smlal v3.4S, v24.4H, v6.H[0] // val0 += src[0][i + {0..3}] * filter[0] + smlal2 v4.4S, v24.8H, v6.H[0] // val1 += src[0][i + {4..7}] * filter[0] + smlal v3.4S, v25.4H, v6.H[1] // val0 += src[1][i + {0..3}] * filter[1] + smlal2 v4.4S, v25.8H, v6.H[1] // val1 += src[1][i + {4..7}] * filter[1] + smlal v3.4S, v26.4H, v6.H[2] // val0 += src[2][i + {0..3}] * filter[2] + smlal2 v4.4S, v26.8H, v6.H[2] // val1 += src[2][i + {4..7}] * filter[2] + smlal v3.4S, v27.4H, v6.H[3] // val0 += src[3][i + {0..3}] * filter[3] + smlal2 v4.4S, v27.8H, v6.H[3] // val1 += src[3][i + {4..7}] * filter[3] + smlal v3.4S, v28.4H, v6.H[4] // val0 += src[4][i + {0..3}] * filter[4] + smlal2 v4.4S, v28.8H, v6.H[4] // val1 += src[4][i + {4..7}] * filter[4] + smlal v3.4S, v29.4H, v6.H[5] // val0 += src[5][i + {0..3}] * filter[5] + smlal2 v4.4S, v29.8H, v6.H[5] // val1 += src[5][i + {4..7}] * filter[5] + smlal v3.4S, v30.4H, v6.H[6] // val0 += src[6][i + {0..3}] * filter[6] + smlal2 v4.4S, v30.8H, v6.H[6] // val1 += src[6][i + {4..7}] * filter[6] + smlal v3.4S, v31.4H, v6.H[7] // val0 += src[7][i + {0..3}] * filter[7] + smlal2 v4.4S, v31.8H, v6.H[7] // val1 += src[7][i + {4..7}] * filter[7] + + sqshrun v3.4h, v3.4s, #16 // clip16(val0>>16) + sqshrun2 v3.8h, v4.4s, #16 // clip16(val1>>16) + uqshrn v3.8b, v3.8h, #3 // clip8(val>>19) + subs w4, w4, #8 // dstW -= 8 + st1 {v3.8b}, [x3], #8 // write to destination + b.gt 7b // loop until width consumed + ret + +8: // fs=4 + ldp x5, x6, [x2] // load 2 pointers: src[j ] and src[j+1] + ldp x7, x9, [x2, #16] // load 2 pointers: src[j+2] and src[j+3] + + // load 4x16-bit values for filter[j], where j=0..3 and replicated across lanes + ld1 {v6.4H}, [x0] +9: + mov v3.16B, v1.16B // initialize accumulator part 1 with dithering value + mov v4.16B, v2.16B // initialize accumulator part 2 with dithering value + + ld1 {v24.8H}, [x5], #16 // load 8x16-bit values for src[j + 0][i + {0..7}] + ld1 {v25.8H}, [x6], #16 // load 8x16-bit values for src[j + 1][i + {0..7}] + ld1 {v26.8H}, [x7], #16 // load 8x16-bit values for src[j + 2][i + {0..7}] + ld1 {v27.8H}, [x9], #16 // load 8x16-bit values for src[j + 3][i + {0..7}] + + smlal v3.4S, v24.4H, v6.H[0] // val0 += src[0][i + {0..3}] * filter[0] + smlal2 v4.4S, v24.8H, v6.H[0] // val1 += src[0][i + {4..7}] * filter[0] + smlal v3.4S, v25.4H, v6.H[1] // val0 += src[1][i + {0..3}] * filter[1] + smlal2 v4.4S, v25.8H, v6.H[1] // val1 += src[1][i + {4..7}] * filter[1] + smlal v3.4S, v26.4H, v6.H[2] // val0 += src[2][i + {0..3}] * filter[2] + smlal2 v4.4S, v26.8H, v6.H[2] // val1 += src[2][i + {4..7}] * filter[2] + smlal v3.4S, v27.4H, v6.H[3] // val0 += src[3][i + {0..3}] * filter[3] + smlal2 v4.4S, v27.8H, v6.H[3] // val1 += src[3][i + {4..7}] * filter[3] + + sqshrun v3.4h, v3.4s, #16 // clip16(val0>>16) + sqshrun2 v3.8h, v4.4s, #16 // clip16(val1>>16) + uqshrn v3.8b, v3.8h, #3 // clip8(val>>19) + st1 {v3.8b}, [x3], #8 // write to destination + subs w4, w4, #8 // dstW -= 8 + b.gt 9b // loop until width consumed + ret + +10: // fs=2 + ldp x5, x6, [x2] // load 2 pointers: src[j ] and src[j+1] + + // load 2x16-bit values for filter[j], where j=0..1 and replicated across lanes + ldr s6, [x0] +11: + mov v3.16B, v1.16B // initialize accumulator part 1 with dithering value + mov v4.16B, v2.16B // initialize accumulator part 2 with dithering value + + ld1 {v24.8H}, [x5], #16 // load 8x16-bit values for src[j + 0][i + {0..7}] + ld1 {v25.8H}, [x6], #16 // load 8x16-bit values for src[j + 1][i + {0..7}] + + smlal v3.4S, v24.4H, v6.H[0] // val0 += src[0][i + {0..3}] * filter[0] + smlal2 v4.4S, v24.8H, v6.H[0] // val1 += src[0][i + {4..7}] * filter[0] + smlal v3.4S, v25.4H, v6.H[1] // val0 += src[1][i + {0..3}] * filter[1] + smlal2 v4.4S, v25.8H, v6.H[1] // val1 += src[1][i + {4..7}] * filter[1] + + sqshrun v3.4h, v3.4s, #16 // clip16(val0>>16) + sqshrun2 v3.8h, v4.4s, #16 // clip16(val1>>16) + uqshrn v3.8b, v3.8h, #3 // clip8(val>>19) + st1 {v3.8b}, [x3], #8 // write to destination + subs w4, w4, #8 // dstW -= 8 + b.gt 11b // loop until width consumed + ret +endfunc + +function ff_yuv2plane1_8_neon, export=1 +// x0 - const int16_t *src, +// x1 - uint8_t *dest, +// w2 - int dstW, +// x3 - const uint8_t *dither, +// w4 - int offset + ld1 {v0.8B}, [x3] // load 8x8-bit dither + and w4, w4, #7 + cbz w4, 1f // check if offsetting present + ext v0.8B, v0.8B, v0.8B, #3 // honor offsetting which can be 0 or 3 only +1: uxtl v0.8H, v0.8B // extend dither to 32-bit + uxtl v1.4s, v0.4h + uxtl2 v2.4s, v0.8h +2: + ld1 {v3.8h}, [x0], #16 // read 8x16-bit @ src[j ][i + {0..7}]: A,B,C,D,E,F,G,H + sxtl v4.4s, v3.4h + sxtl2 v5.4s, v3.8h + add v4.4s, v4.4s, v1.4s + add v5.4s, v5.4s, v2.4s + sqshrun v4.4h, v4.4s, #6 + sqshrun2 v4.8h, v5.4s, #6 + + uqshrn v3.8b, v4.8h, #1 // clip8(val>>7) + subs w2, w2, #8 // dstW -= 8 + st1 {v3.8b}, [x1], #8 // write to destination + b.gt 2b // loop until width consumed + ret endfunc diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c index ab28be4da6..321d1f844e 100644 --- a/libswscale/aarch64/swscale.c +++ b/libswscale/aarch64/swscale.c @@ -39,6 +39,12 @@ ALL_SCALE_FUNCS(neon); void ff_yuv2planeX_8_neon(const int16_t *filter, int filterSize, const int16_t **src, uint8_t *dest, int dstW, const uint8_t *dither, int offset); +void ff_yuv2plane1_8_neon( + const int16_t *src, + uint8_t *dest, + int dstW, + const uint8_t *dither, + int offset); #define ASSIGN_SCALE_FUNC2(hscalefn, filtersize, opt) do { \ if (c->srcBpc == 8 && c->dstBpc <= 14) { \ @@ -54,6 +60,11 @@ void ff_yuv2planeX_8_neon(const int16_t *filter, int filterSize, ASSIGN_SCALE_FUNC2(hscalefn, X8, opt); \ break; \ } +#define ASSIGN_VSCALE_FUNC(vscalefn, opt) \ + switch (c->dstBpc) { \ + case 8: vscalefn = ff_yuv2plane1_8_ ## opt; break; \ + default: break; \ + } av_cold void ff_sws_init_swscale_aarch64(SwsContext *c) { @@ -62,6 +73,7 @@ av_cold void ff_sws_init_swscale_aarch64(SwsContext *c) if (have_neon(cpu_flags)) { ASSIGN_SCALE_FUNC(c->hyScale, c->hLumFilterSize, neon); ASSIGN_SCALE_FUNC(c->hcScale, c->hChrFilterSize, neon); + ASSIGN_VSCALE_FUNC(c->yuv2plane1, neon); if (c->dstBpc == 8) { c->yuv2planeX = ff_yuv2planeX_8_neon; }