From patchwork Mon Jun 14 11:14:06 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alan Kelly X-Patchwork-Id: 28264 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a5e:c91a:0:0:0:0:0 with SMTP id z26csp2747904iol; Mon, 14 Jun 2021 04:20:04 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxUWMa213GMMb9gOF91piYBM/mFh7rpM7ONPGBtW9TK5k5NviM5z7bKS3qk8Q5mkHTHA0Ur X-Received: by 2002:a05:6402:152:: with SMTP id s18mr16139038edu.221.1623669604726; Mon, 14 Jun 2021 04:20:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1623669604; cv=none; d=google.com; s=arc-20160816; b=ozc4/uRcKqk9wB+oWbkIeGWe3fSYe524eE+I0ygXFL2iVp7+hh8fVr93zL1QsFzRTW evOdgFwiI+S+gQi53VJDadu8dXNbVNORy7jMZczKFDw0bCpn/JN4l5TZvVymfGAEHFFw Smoxf7HqtHtqzWHYBt7oFOGB64+zlCdjV7TB4cLiScmZmvbjq8Zzjzhai7/hFvcZ4zdK 8N2KFxi2FaHdbQH0LQeZ9VnRkLLLavVxTDpeuNG/UN5P+o569vsv2B7LrPVtFD6d77CV 9ztogIcGUwMMYH5YjmLQvF1cQjqt6IFmMc9O7SXbg+LqySwKZF4pIxUmVlmi/XSKQrfR jmtA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:to:from:mime-version:message-id:date :dkim-signature:delivered-to; bh=Pe2217xKcijvpA3KGnxufP9Eb+G8/mp8aw0UeV7/mSg=; b=uuzM9EIMTCGZXGAlj6HT983iGOgbcflsBGfYizGGp9/CS3ULjQu5Uq9T4z6KJnXo4M GsMdlWJdxy1qq8xbPWIDE/NQk/502VdEvIEljscm6WSXJ26WBVwFV/g9ljCROMU13syI lxBjS6gDROcaEeDohSRuEUhNau8pKMlgSJ9se0j+ZbDE6M7hkBy3LToOdGYlMzolJWAa Hu28lOMKsIGqbRuLYrf7c+vaPSxYNeY0NW3P1U0+uU/e/nDjR9R/B7WH6g69T+ZwmZbU q9OxpucnwDkDUKWFqOCEjgmUgaVOHMfdtF3QCfDKPTbVQPjHVSl0NddvZ00aS2zge+1P oQYA== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@google.com header.s=20161025 header.b=NtyuXEv4; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id h23si10922962edq.536.2021.06.14.04.20.04; Mon, 14 Jun 2021 04:20:04 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@google.com header.s=20161025 header.b=NtyuXEv4; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id C07CF6882EE; Mon, 14 Jun 2021 14:20:00 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id B42146801A4 for ; Mon, 14 Jun 2021 14:19:54 +0300 (EEST) Received: by mail-pg1-f201.google.com with SMTP id k193-20020a633dca0000b029021ff326b222so6584509pga.9 for ; Mon, 14 Jun 2021 04:19:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:message-id:mime-version:subject:from:to:cc; bh=lHdqWHRiB3ve0OLXnvEX6zCusYJfvxeUd3N3f1+Mqsw=; b=NtyuXEv4SfvMBVEdBTstWyZLIsUTc434ocLstGy6oo7NQwdABp8U+85CeORO12YYqc 6cPM2903LP0wBMWXz4Mllc4WZtjVUGUgR/w51nACr6xKKMu/M6ienS1soDAOnA3HkXFi AD+M3q2X1zZBhDYcH9SaUlYt3IdoyEMBACSeR2qXwBWAuvAdUONnAhSt86neRbdDxL1P x5KQMtSAF2Lp9LXBomDoYuyQtdCk6FqeHMOAZ7HV2PPJHzWpr4SkZWubI5ak4l7Sh838 BxA9Xdh9pNGY1xXvjzfXRGD8HNkgwcjYvtOdbajeSmZeH/Hdmiju+wzUsmBD9+Dsk+MQ oVOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc; bh=lHdqWHRiB3ve0OLXnvEX6zCusYJfvxeUd3N3f1+Mqsw=; b=evZTXRPLPIVFDUkhQHK57XakbRz4s0IfVePF3d5ID7BYVHYUGlD99OBpyYUBe2s14z /yn1GnMxUYR0Z7jgu8DI7pD6TjsuD7SHFvXaQnEgCFJxs2B+XZW3AHtqONS9Suc6iIfs 5eX7o3c1OxYtN6sDWZJcD0ZovPl+j9tvmKeSJnXecOJZvZ7SKdLZCUFVqrvcQGcDSNFs mhvMh2u+Q7/83yI4Kz/LWZ818X9vMK/09AByynvY0rNhABJ1KlVUKmrUDkCvtrq2C1sT P+QDpA0DEBQeF/K2aKADufz8YVv8105mUO24fb/czp7csUME6qm4jVs40F8uUOBKYjTL 712Q== X-Gm-Message-State: AOAM530lynAKiYAhknhl5tPlT6cASpasSU/7F6iRgISCwSZqpRjwR+5M NtEGWv5Okdl2BmeBvrL0bMd60XsLAFnvp4skPks67kMXb1+gwhHd4+yQkK9i5HmFNxHlpKJrVqX +KPmUdc0qRTZb2CBhlpY7mslpC4cZLJ7HklLRiyvJaEerE1cvaeJLrgkaWLmBVCAcDMdnSWo= X-Received: from alankelly0.zrh.corp.google.com ([2a00:79e0:61:301:e65a:f650:168c:24b4]) (user=alankelly job=sendgmr) by 2002:a05:6214:18d0:: with SMTP id cy16mr17776716qvb.29.1623669259516; Mon, 14 Jun 2021 04:14:19 -0700 (PDT) Date: Mon, 14 Jun 2021 13:14:06 +0200 Message-Id: <20210614111407.1897690-1-alankelly@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.32.0.272.g935e593368-goog From: Alan Kelly To: ffmpeg-devel@ffmpeg.org Subject: [FFmpeg-devel] [PATCH 1/2] libavutil/cpu: Adds av_cpu_has_fast_gather to detect cpus with avx fast gather instruction X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Alan Kelly Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: Mg16ezUgoA4v Broadwell and later have fast gather instructions. --- This is so that the avx2 version of ff_hscale8to15X which uses gather instructions is only selected on machines where it will actually be faster. libavutil/cpu.c | 6 ++++++ libavutil/cpu.h | 6 ++++++ libavutil/cpu_internal.h | 1 + libavutil/x86/cpu.c | 18 ++++++++++++++++++ 4 files changed, 31 insertions(+) diff --git a/libavutil/cpu.c b/libavutil/cpu.c index 8960415d00..0a723eeb7a 100644 --- a/libavutil/cpu.c +++ b/libavutil/cpu.c @@ -49,6 +49,12 @@ static atomic_int cpu_flags = ATOMIC_VAR_INIT(-1); +int av_cpu_has_fast_gather(void){ + if (ARCH_X86) + return ff_cpu_has_fast_gather(); + return 0; +} + static int get_cpu_flags(void) { if (ARCH_MIPS) diff --git a/libavutil/cpu.h b/libavutil/cpu.h index b555422dae..faf3a221f4 100644 --- a/libavutil/cpu.h +++ b/libavutil/cpu.h @@ -72,6 +72,7 @@ #define AV_CPU_FLAG_MMI (1 << 0) #define AV_CPU_FLAG_MSA (1 << 1) +int av_cpu_has_fast_gather(void); /** * Return the flags which specify extensions supported by the CPU. * The returned value is affected by av_force_cpu_flags() if that was used @@ -107,6 +108,11 @@ int av_cpu_count(void); * av_set_cpu_flags_mask(), then this function will behave as if AVX is not * present. */ + +/** + * Returns true if the cpu has fast gather instructions. + * Broadwell and later cpus have fast gather + */ size_t av_cpu_max_align(void); #endif /* AVUTIL_CPU_H */ diff --git a/libavutil/cpu_internal.h b/libavutil/cpu_internal.h index 889764320b..92525df0c1 100644 --- a/libavutil/cpu_internal.h +++ b/libavutil/cpu_internal.h @@ -46,6 +46,7 @@ int ff_get_cpu_flags_aarch64(void); int ff_get_cpu_flags_arm(void); int ff_get_cpu_flags_ppc(void); int ff_get_cpu_flags_x86(void); +int ff_cpu_has_fast_gather(void); size_t ff_get_cpu_max_align_mips(void); size_t ff_get_cpu_max_align_aarch64(void); diff --git a/libavutil/x86/cpu.c b/libavutil/x86/cpu.c index bcd41a50a2..9724e0017b 100644 --- a/libavutil/x86/cpu.c +++ b/libavutil/x86/cpu.c @@ -270,3 +270,21 @@ size_t ff_get_cpu_max_align_x86(void) return 8; } + +int ff_cpu_has_fast_gather(void){ + int eax, ebx, ecx; + int max_std_level, std_caps = 0; + int family = 0, model = 0; + cpuid(0, max_std_level, ebx, ecx, std_caps); + + if (max_std_level >= 1) { + cpuid(1, eax, ebx, ecx, std_caps); + family = ((eax >> 8) & 0xf) + ((eax >> 20) & 0xff); + model = ((eax >> 4) & 0xf) + ((eax >> 12) & 0xf0); + // Broadwell and later + if(family == 6 && model >= 70){ + return 1; + } + } + return 0; +} From patchwork Mon Jun 14 11:14:07 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alan Kelly X-Patchwork-Id: 28265 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a5e:c91a:0:0:0:0:0 with SMTP id z26csp2748072iol; Mon, 14 Jun 2021 04:20:19 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzOkBksWw84muk9XFRYdSkWy8HG9R1blAKdsOCBlFl3LMTflD9ZkFFhQpm5h8lzF9/zyVr0 X-Received: by 2002:a17:906:4a48:: with SMTP id a8mr14874099ejv.472.1623669619424; Mon, 14 Jun 2021 04:20:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1623669619; cv=none; d=google.com; s=arc-20160816; b=zx+7aRaFgYxBnSmarAaboPTqVGsb7/JNx+b4GvurmthnlwnZfpitWJiU2bqp6wFcp4 2YjF5qmGAlfKpbMZpI1NNe7QcIydeuEvkR6+5ipZSqjvMSZZ6f4DzxAY/qzjukm6J84J MxCpd98ZUew/rD3v91OSE8rvuAntBsQdeEti11nBKY2PPsa3RQwzuxaBijs+DIoPb/2x 76+riNfr8QRwbB2xY3rJZyzwNPOqcpcYgcUXdXWIUdlg0KQmGsmyZ9PoiZhYAwAWdxSE mnTbOMbi1jOZsS/LJPB1oXEj1DnzH/HnDHMn/xHv561I4732OFYv3RomKnc9YK7ew6zI kSiA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:to:from:references:mime-version :message-id:in-reply-to:date:dkim-signature:delivered-to; bh=XcBhJFkr+s55ACDEI2cTO2q6qY1y7Oe02S8Lb8wOElA=; b=1C+1vmKud0d3RPVG4eEkjeBUVxz55QaVuzXbtlgzfxrYFMMNhnPQUb2/1CeAQmuhlY h6vhpSGQUJugYVLEtyQS0ZYnkNps5fpeE5uYJpOOJC801R2UkUQH9NFWxVSk09z6jzWV m5n/Gb8xYXpDLS7wS1Rx3fMfldAfdlJ3iCokTbJY/FdDunydOWum8QJyG0wyORYZtcrF V9ko5c61FUu9TPlLi1CmvpyAOUd/gakhU75kZin6Mak0wPj6AdYIcn2W/RZ52WhpWdZ0 +t59bpvzVZCw0ImX1oqF+9/D04bfHEbNh1bUafO+X4AQFRBGmhqe5YRdZ9uEDQPXB4Tm LsNQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@google.com header.s=20161025 header.b=vK9OBnQR; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id s6si6337177edd.183.2021.06.14.04.20.19; Mon, 14 Jun 2021 04:20:19 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@google.com header.s=20161025 header.b=vK9OBnQR; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 6D28F6883C9; Mon, 14 Jun 2021 14:20:16 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com [209.85.210.201]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id A2663688369 for ; Mon, 14 Jun 2021 14:20:08 +0300 (EEST) Received: by mail-pf1-f201.google.com with SMTP id q18-20020a056a000852b02902f93b26d6d9so2436929pfk.15 for ; Mon, 14 Jun 2021 04:20:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=o4LpdoAtIPNXKPmR2S127J6pcEkkyoUXCcHyuAXQdC8=; b=vK9OBnQR1W4CPOmHyfnwdtociZfcydKzk1wUp5QHCx0fpa6ZJC3RUoJ7+irG1j9s3O Pj+4X+fmi/kprAXsqNIryK98+Lhjt62o9M/hFPiJJ56rqtDOHPhIXpSYFGNTz1dnTpO6 EYhW7htQOLJ1SVIUJoK4kZ+YzQT7A17djaHiAmJ74/3DxRYrqefPEW2X94xHaGOBFiEv G+1gn7alJqnaZSRelwSNNa1nr/bdGP4j9uMJE0jD0eUYdpNadI6EJI4WRCKDwwy0igFr GZZ0PLnE8ACTR/BpNRGSxmTyah0BX8VaiWjyZvsEVovgDcbXPWdOPb7j2mKKlLYZCEnn CXAw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=o4LpdoAtIPNXKPmR2S127J6pcEkkyoUXCcHyuAXQdC8=; b=Deb/rqSgEt8H8mjZ+Q9Vt8JJ3XQKPSA/LMF2gW6FL9LFT8yDZZiw5JT0vXIwF1WxcO Pd9AMH50I6z1VIJIeVi8xKaMqncKYCfN1pMbZy612Jv75tkhYXu/4cVBKxKB98gpIC4g CYdK4MHOCCu2U+mMc7vhDqecnsbkIFsJU7cTGSiX1KYD6/OGwOVUq7mK77FJPFlxWJ6Y oinLhkZhnXnQo4hZsx7HPcD/+y2tpG0JUOBNqyDhyPGGYmTBRg6tCnJLJUqoklCKnSv+ 6MRW01o2wF9SOQlyvx8l0MdwKIXZGw8OKzJlCz9gVir4uRb7GvQl9tMhoJBgdCBrlJ64 HuTw== X-Gm-Message-State: AOAM533e1Z2iqKy982MDThVXI2ycVlRUKrZuU8DBN1IsdnxRZxP1wmUW WyoDs7XzDdUZOiEpOW/Qz8Wbj6oCVqsK9c7xJjhhH/rpn6SWV4egVEIMBCuBZtOM38lZx7ZnIHs 2DE2VwVCElRRtjVXAaJzi9MkEhNsXFFbigCaG5YxjNtwe6OVHH8B+nJO0bO8jggKT0V77zEw= X-Received: from alankelly0.zrh.corp.google.com ([2a00:79e0:61:301:e65a:f650:168c:24b4]) (user=alankelly job=sendgmr) by 2002:a0c:eec2:: with SMTP id h2mr18048496qvs.22.1623669278484; Mon, 14 Jun 2021 04:14:38 -0700 (PDT) Date: Mon, 14 Jun 2021 13:14:07 +0200 In-Reply-To: <20210614111407.1897690-1-alankelly@google.com> Message-Id: <20210614111407.1897690-2-alankelly@google.com> Mime-Version: 1.0 References: <20210614111407.1897690-1-alankelly@google.com> X-Mailer: git-send-email 2.32.0.272.g935e593368-goog From: Alan Kelly To: ffmpeg-devel@ffmpeg.org Subject: [FFmpeg-devel] [PATCH 2/2] libswscale: Adds ff_hscale8to15_4_avx2 and ff_hscale8to15_X4_avx2 for all filter sizes. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Alan Kelly Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: AzgwnDqVDgqT These functions replace all ff_hscale8to15_*_ssse3 when avx2 is available. --- libswscale/swscale_internal.h | 2 + libswscale/utils.c | 37 +++++++++++ libswscale/x86/Makefile | 1 + libswscale/x86/scale_avx2.asm | 112 ++++++++++++++++++++++++++++++++++ libswscale/x86/swscale.c | 19 ++++++ tests/checkasm/sw_scale.c | 21 +++++-- 6 files changed, 187 insertions(+), 5 deletions(-) create mode 100644 libswscale/x86/scale_avx2.asm diff --git a/libswscale/swscale_internal.h b/libswscale/swscale_internal.h index a1de95cee0..45ef657cd4 100644 --- a/libswscale/swscale_internal.h +++ b/libswscale/swscale_internal.h @@ -1056,4 +1056,6 @@ void ff_init_vscale_pfn(SwsContext *c, yuv2planar1_fn yuv2plane1, yuv2planarX_fn //number of extra lines to process #define MAX_LINES_AHEAD 4 +//shuffle filter and filterPos for hyScale and hcScale filters in avx2 +void ff_shuffle_filter_coefficients(SwsContext *c, int* filterPos, int filterSize, int16_t *filter, int dstW); #endif /* SWSCALE_SWSCALE_INTERNAL_H */ diff --git a/libswscale/utils.c b/libswscale/utils.c index 6bac7b658d..0dc1f7df7f 100644 --- a/libswscale/utils.c +++ b/libswscale/utils.c @@ -267,6 +267,41 @@ static const FormatEntry format_entries[] = { [AV_PIX_FMT_X2RGB10LE] = { 1, 1 }, }; +void ff_shuffle_filter_coefficients(SwsContext *c, int *filterPos, int filterSize, int16_t *filter, int dstW){ +#if ARCH_X86_64 + int i, j, k, l; + int cpu_flags = av_get_cpu_flags(); + if (EXTERNAL_AVX2_FAST(cpu_flags) && av_cpu_has_fast_gather()){ + if ((c->srcBpc == 8) && (c->dstBpc <= 14)){ + if (dstW % 16 == 0){ + if (filter != NULL){ + for (i = 0; i < dstW; i += 8){ + FFSWAP(int, filterPos[i + 2], filterPos[i+4]); + FFSWAP(int, filterPos[i + 3], filterPos[i+5]); + } + if (filterSize > 4){ + int16_t *tmp2 = av_malloc(dstW * filterSize * 2); + memcpy(tmp2, filter, dstW * filterSize * 2); + for (i = 0; i < dstW; i += 16){//pixel + for (k = 0; k < filterSize / 4; ++k){//fcoeff + for (j = 0; j < 16; ++j){//inner pixel + for (l = 0; l < 4; ++l){//coeff + int from = i * filterSize + j * filterSize + k * 4 + l; + int to = (i) * filterSize + j * 4 + l + k * 64; + filter[to] = tmp2[from]; + } + } + } + } + av_free(tmp2); + } + } + } + } + } +#endif +} + int sws_isSupportedInput(enum AVPixelFormat pix_fmt) { return (unsigned)pix_fmt < FF_ARRAY_ELEMS(format_entries) ? @@ -1697,6 +1732,7 @@ av_cold int sws_init_context(SwsContext *c, SwsFilter *srcFilter, get_local_pos(c, 0, 0, 0), get_local_pos(c, 0, 0, 0))) < 0) goto fail; + ff_shuffle_filter_coefficients(c, c->hLumFilterPos, c->hLumFilterSize, c->hLumFilter, dstW); if ((ret = initFilter(&c->hChrFilter, &c->hChrFilterPos, &c->hChrFilterSize, c->chrXInc, c->chrSrcW, c->chrDstW, filterAlign, 1 << 14, @@ -1706,6 +1742,7 @@ av_cold int sws_init_context(SwsContext *c, SwsFilter *srcFilter, get_local_pos(c, c->chrSrcHSubSample, c->src_h_chr_pos, 0), get_local_pos(c, c->chrDstHSubSample, c->dst_h_chr_pos, 0))) < 0) goto fail; + ff_shuffle_filter_coefficients(c, c->hChrFilterPos, c->hChrFilterSize, c->hChrFilter, c->chrDstW); } } // initialize horizontal stuff diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile index bfe383364e..68391494be 100644 --- a/libswscale/x86/Makefile +++ b/libswscale/x86/Makefile @@ -11,6 +11,7 @@ OBJS-$(CONFIG_XMM_CLOBBER_TEST) += x86/w64xmmtest.o X86ASM-OBJS += x86/input.o \ x86/output.o \ x86/scale.o \ + x86/scale_avx2.o \ x86/rgb_2_rgb.o \ x86/yuv_2_rgb.o \ x86/yuv2yuvX.o \ diff --git a/libswscale/x86/scale_avx2.asm b/libswscale/x86/scale_avx2.asm new file mode 100644 index 0000000000..d90fd2d791 --- /dev/null +++ b/libswscale/x86/scale_avx2.asm @@ -0,0 +1,112 @@ +;****************************************************************************** +;* x86-optimized horizontal line scaling functions +;* Copyright 2020 Google LLC +;* Copyright (c) 2011 Ronald S. Bultje +;* +;* This file is part of FFmpeg. +;* +;* FFmpeg is free software; you can redistribute it and/or +;* modify it under the terms of the GNU Lesser General Public +;* License as published by the Free Software Foundation; either +;* version 2.1 of the License, or (at your option) any later version. +;* +;* FFmpeg is distributed in the hope that it will be useful, +;* but WITHOUT ANY WARRANTY; without even the implied warranty of +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +;* Lesser General Public License for more details. +;* +;* You should have received a copy of the GNU Lesser General Public +;* License along with FFmpeg; if not, write to the Free Software +;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA +;****************************************************************************** + +%include "libavutil/x86/x86util.asm" + +SECTION_RODATA + +swizzle: dd 0, 4, 1, 5, 2, 6, 3, 7 +four: times 8 dd 4 + +SECTION .text + +;----------------------------------------------------------------------------- +; horizontal line scaling +; +; void hscale8to15__ +; (SwsContext *c, int16_t *dst, +; int dstW, const uint8_t *src, +; const int16_t *filter, +; const int32_t *filterPos, int filterSize); +; +; Scale one horizontal line. Input is 8-bit width Filter is 14 bits. Output is +; 15 bits (in int16_t). Each output pixel is generated from $filterSize input +; pixels, the position of the first pixel is given in filterPos[nOutputPixel]. +;----------------------------------------------------------------------------- + +%macro SCALE_FUNC 1 +cglobal hscale8to15_%1, 7, 9, 15, pos0, dst, w, srcmem, filter, fltpos, fltsize, count, inner + pxor m0, m0 + movu m15, [swizzle] + mov countq, $0 +%ifidn %1, X4 + movu m14, [four] + movsxd fltsizeq, fltsized + shr fltsizeq, 2 +%endif +.loop: + movu m1, [fltposq] + movu m2, [fltposq+32] +%ifidn %1, X4 + pxor m9, m9 + pxor m10, m10 + pxor m11, m11 + pxor m12, m12 + mov innerq, $0 +.innerloop: +%endif + vpcmpeqd m13, m13 + vpgatherdd m3,[srcmemq + m1], m13 + vpcmpeqd m13, m13 + vpgatherdd m4,[srcmemq + m2], m13 + vpunpcklbw m5, m3, m0 + vpunpckhbw m6, m3, m0 + vpunpcklbw m7, m4, m0 + vpunpckhbw m8, m4, m0 + vpmaddwd m5, m5, [filterq] + vpmaddwd m6, m6, [filterq + 32] + vpmaddwd m7, m7, [filterq + 64] + vpmaddwd m8, m8, [filterq + 96] + add filterq, $80 +%ifidn %1, X4 + paddd m9, m5 + paddd m10, m6 + paddd m11, m7 + paddd m12, m8 + paddd m1, m14 + paddd m2, m14 + add innerq, $1 + cmp innerq, fltsizeq + jl .innerloop + vphaddd m5, m9, m10 + vphaddd m6, m11, m12 +%else + vphaddd m5, m5, m6 + vphaddd m6, m7, m8 +%endif + vpsrad m5, 7 + vpsrad m6, 7 + vpackssdw m5, m5, m6 + vpermd m5, m15, m5 + vmovdqu [dstq + countq * 2], m5 + add fltposq, $40 + add countq, $10 + cmp countq, wq + jl .loop +REP_RET +%endmacro + +%if ARCH_X86_64 +INIT_YMM avx2 +SCALE_FUNC 4 +SCALE_FUNC X4 +%endif diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 0848a31461..a5d2c06357 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -276,6 +276,9 @@ SCALE_FUNCS_SSE(sse2); SCALE_FUNCS_SSE(ssse3); SCALE_FUNCS_SSE(sse4); +SCALE_FUNC(4, 8, 15, avx2); +SCALE_FUNC(X4, 8, 15, avx2); + #define VSCALEX_FUNC(size, opt) \ void ff_yuv2planeX_ ## size ## _ ## opt(const int16_t *filter, int filterSize, \ const int16_t **src, uint8_t *dest, int dstW, \ @@ -568,6 +571,22 @@ switch(c->dstBpc){ \ } #if ARCH_X86_64 +#define ASSIGN_AVX2_SCALE_FUNC(hscalefn, filtersize) \ + switch (filtersize) { \ + case 4: hscalefn = ff_hscale8to15_4_avx2; break; \ + default: hscalefn = ff_hscale8to15_X4_avx2; break; \ + break; \ + } + + if (EXTERNAL_AVX2_FAST(cpu_flags) && av_cpu_has_fast_gather()){ + if ((c->srcBpc == 8) && (c->dstBpc <= 14)){ + if(c->chrDstW % 16 == 0) + ASSIGN_AVX2_SCALE_FUNC(c->hcScale, c->hChrFilterSize); + if(c->dstW % 16 == 0) + ASSIGN_AVX2_SCALE_FUNC(c->hyScale, c->hLumFilterSize); + } + } + if (EXTERNAL_AVX2_FAST(cpu_flags)) { switch (c->dstFormat) { case AV_PIX_FMT_NV12: diff --git a/tests/checkasm/sw_scale.c b/tests/checkasm/sw_scale.c index 3ac0f9082f..177f9df3c4 100644 --- a/tests/checkasm/sw_scale.c +++ b/tests/checkasm/sw_scale.c @@ -135,13 +135,13 @@ static void check_yuv2yuvX(void) } #undef SRC_PIXELS -#define SRC_PIXELS 128 +#define SRC_PIXELS 512 static void check_hscale(void) { #define MAX_FILTER_WIDTH 40 -#define FILTER_SIZES 5 - static const int filter_sizes[FILTER_SIZES] = { 4, 8, 16, 32, 40 }; +#define FILTER_SIZES 6 + static const int filter_sizes[FILTER_SIZES] = { 4, 8, 12, 16, 32, 40 }; #define HSCALE_PAIRS 2 static const int hscale_pairs[HSCALE_PAIRS][2] = { @@ -160,6 +160,8 @@ static void check_hscale(void) // padded LOCAL_ALIGNED_32(int16_t, filter, [SRC_PIXELS * MAX_FILTER_WIDTH + MAX_FILTER_WIDTH]); LOCAL_ALIGNED_32(int32_t, filterPos, [SRC_PIXELS]); + LOCAL_ALIGNED_32(int16_t, filterAvx2, [SRC_PIXELS * MAX_FILTER_WIDTH + MAX_FILTER_WIDTH]); + LOCAL_ALIGNED_32(int32_t, filterPosAvx, [SRC_PIXELS]); // The dst parameter here is either int16_t or int32_t but we use void* to // just cover both cases. @@ -167,6 +169,8 @@ static void check_hscale(void) const uint8_t *src, const int16_t *filter, const int32_t *filterPos, int filterSize); + int cpu_flags = av_get_cpu_flags(); + ctx = sws_alloc_context(); if (sws_init_context(ctx, NULL, NULL) < 0) fail(); @@ -180,9 +184,11 @@ static void check_hscale(void) ctx->srcBpc = hscale_pairs[hpi][0]; ctx->dstBpc = hscale_pairs[hpi][1]; ctx->hLumFilterSize = ctx->hChrFilterSize = width; + ctx->dstW = ctx->chrDstW = SRC_PIXELS; for (i = 0; i < SRC_PIXELS; i++) { filterPos[i] = i; + filterPosAvx[i] = i; // These filter cofficients are chosen to try break two corner // cases, namely: @@ -210,6 +216,11 @@ static void check_hscale(void) filter[SRC_PIXELS * width + i] = rnd(); } + memcpy(filterAvx2, filter, sizeof(uint16_t) * (SRC_PIXELS * MAX_FILTER_WIDTH + MAX_FILTER_WIDTH)); + if (cpu_flags & AV_CPU_FLAG_AVX2){ + ff_shuffle_filter_coefficients(ctx, filterPosAvx, width, filterAvx2, SRC_PIXELS); + } + ff_getSwsFunc(ctx); if (check_func(ctx->hcScale, "hscale_%d_to_%d_width%d", ctx->srcBpc, ctx->dstBpc + 1, width)) { @@ -217,10 +228,10 @@ static void check_hscale(void) memset(dst1, 0, SRC_PIXELS * sizeof(dst1[0])); call_ref(NULL, dst0, SRC_PIXELS, src, filter, filterPos, width); - call_new(NULL, dst1, SRC_PIXELS, src, filter, filterPos, width); + call_new(NULL, dst1, SRC_PIXELS, src, filterAvx2, filterPosAvx, width); if (memcmp(dst0, dst1, SRC_PIXELS * sizeof(dst0[0]))) fail(); - bench_new(NULL, dst0, SRC_PIXELS, src, filter, filterPos, width); + bench_new(NULL, dst0, SRC_PIXELS, src, filter, filterPosAvx, width); } } }