From patchwork Fri Jul 16 13:44:53 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alan Kelly X-Patchwork-Id: 28936 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a5d:965a:0:0:0:0:0 with SMTP id d26csp1787628ios; Fri, 16 Jul 2021 06:45:12 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwmR1wQZLaIrSndevFAE3y3QduKWL+CTt/WSxf/BFVVG6yGXjb11Qg3poVvv1dA3ZJivJwh X-Received: by 2002:a05:6402:3584:: with SMTP id y4mr11671485edc.218.1626443112096; Fri, 16 Jul 2021 06:45:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1626443112; cv=none; d=google.com; s=arc-20160816; b=MTkolSyeXZ1jXqUH2KsXcKL2aOAAxj3Y6+dL1GNOZ4uA+AZ1nOnxOpU8MnlgV8p89C ytbohlPHJTaVzuV/iMSOTMJYgTIywSyOFkwGR6mVniur56DN0yLREqiXXP0ZQoBofYMT FSBVPocPQgU4N3Jus3k8oMwZ8ib1qZQHjVCbl44KWO98Hkdv+dmga31Y1eWBybXo82Es R9vGiGtZF5Zmu7u64JvAvCyeu+shIEyMC5SufLqjgZZGo5u+yHHXK1GKdHzO5wPMxnWA UQa3Q6GNOkq86R6HUeT07lLO7Pfv5WMm4lcfPnFzTZSUjZETMm/koeCnmzVM4ZqBzmgE lR1w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:to:from:references:mime-version :message-id:in-reply-to:date:dkim-signature:delivered-to; bh=U7Ntlm8LtnRgmoSqsoG+8iZC1MhbA5MQ0tkaeCjpdFg=; b=PDXaJdHeE/eG4pvBtBMh/2hY9rOCE01IuQe8DqI5VM56yU++KbHvWMy6fjtZHfvKGj dS//I/i/0nCctsAtkbfNnq9jbcVqetZlXyMXMjk89Ik71UTmlhJT5yqU5gCW3Og5RqK2 uD0H7o8bbYGqh7ugRXaC3uMHLF1fThcKZtQKHFJVUuEU+SkEuukb81nzGOoSDfrRs97T pFO5b/Uqc/OHOhnHrtasrmlQUBUXCnxr6MU4ylRih5CuVF8gBZsmtMP4xLTDajHLy+6I LKA5sTrMj+MPwKicygb96HYyPX/ps05vvsGxUCwRWXDfADBZiN6mVV0q9CEwETtPrL7+ 1Q3A== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@google.com header.s=20161025 header.b=E03V0mdX; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id lh10si3971605ejb.419.2021.07.16.06.45.11; Fri, 16 Jul 2021 06:45:12 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@google.com header.s=20161025 header.b=E03V0mdX; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 3C0E368A5D0; Fri, 16 Jul 2021 16:45:07 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-qv1-f74.google.com (mail-qv1-f74.google.com [209.85.219.74]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id A8788688095 for ; Fri, 16 Jul 2021 16:45:00 +0300 (EEST) Received: by mail-qv1-f74.google.com with SMTP id u8-20020a0562141c08b02902e82df307f0so6682274qvc.4 for ; Fri, 16 Jul 2021 06:45:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=8O9sxc8B9eqpCYE9kPJM//uzteqtKFxY7KejAVA3+V4=; b=E03V0mdXDidnl3KZiolV/myXuum3laTU0rsl0zveww8PI/OacVf2MW3iezw6A4GYZg f53KKoTKw4/b5CoAogGIxnaiLBVW0U9oqwJjy/4bvIqRB/V4uPY+WgezyWQO/mSIss+g OssQscqhXSOZyvV/yZXcgbNli35D3bAbwkyH9y2eFZobr/ECexQ+d0+4s+MD+QnUysnc RyTTIeZuDLkxBIZRkotsZ3kSbcldtmY/0l0WNlWUSXNyEuRAesp4Ic8QEuyCeRXBKzD6 ZhwIuzSHZmGiyBIp3sMeCW24x9W87FJv9mr8GqWv9VV97ezAG+FzL29FtJTO55kcIqPE Cdlg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=8O9sxc8B9eqpCYE9kPJM//uzteqtKFxY7KejAVA3+V4=; b=Y96fTtexLgfuiVIPKfkLjNxvWFY7PWebvRdbvsC2ZoNof4Ufj+jy+OL2IGMC5Cpsz2 k2eee5B09cw4bt7MFMgPho3O6J4z5uk661c2pPRSuCHo7PXH/3YSdbVDNoKrbKYiaBYw Dbu1dCBggqO/avvr4VqB8gqT0/9ug1Bn65AiGbJA7wJJ+niW8mDO3Zg3w/DAf2jHzCh1 60yL+8Qfb5STPk1ZVaY68SaHRxZ36unZM0AFCs+CLMhfD8JoAFeCFqjjbsq/5trX4Ci/ neg59GgoXvcCf0vyqtuYGzRbqeQAT2ca8PUE0d1DTsANI6KegXmurUfmdMkfIBWbLKG8 7cmg== X-Gm-Message-State: AOAM532nHoHz8lCr1nAk/1D6Hx2AyShUXJYkEzmIR3STUUMAjloacoia 5hCwjysCtlu5CmmgGkg310YGUm3mk4QUQjlt0kbBXVYmm4cF1lIsrUhBXIhDFsHLvo9ebf7jDE9 c244X/z3qqSTlFLen1h7CKjFLnFI6EVkPhfYVm9kVUrXJBE+U+759Xt8+sFA5Ff/fSlqynN4= X-Received: from alankelly0.zrh.corp.google.com ([2a00:79e0:61:301:20d5:d318:a70:504f]) (user=alankelly job=sendgmr) by 2002:a05:6214:da1:: with SMTP id h1mr10268184qvh.53.1626443098869; Fri, 16 Jul 2021 06:44:58 -0700 (PDT) Date: Fri, 16 Jul 2021 15:44:53 +0200 In-Reply-To: Message-Id: <20210716134453.1126957-1-alankelly@google.com> Mime-Version: 1.0 References: X-Mailer: git-send-email 2.32.0.402.g57bb445576-goog From: Alan Kelly To: ffmpeg-devel@ffmpeg.org Subject: [FFmpeg-devel] [PATCH 1/2] libavutil/cpu: Adds fast gather detection. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Alan Kelly Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: IGyYgBNSC0ym Broadwell and later and Zen3 and later have fast gather instructions. --- Haswell is now excluded from EXTERNAL_AVX2_FAST as discussed in the email thread. libavutil/cpu.h | 1 + libavutil/x86/cpu.c | 11 ++++++++++- 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/libavutil/cpu.h b/libavutil/cpu.h index c069076439..ec3073d021 100644 --- a/libavutil/cpu.h +++ b/libavutil/cpu.h @@ -113,6 +113,7 @@ void av_force_cpu_count(int count); * av_set_cpu_flags_mask(), then this function will behave as if AVX is not * present. */ + size_t av_cpu_max_align(void); #endif /* AVUTIL_CPU_H */ diff --git a/libavutil/x86/cpu.c b/libavutil/x86/cpu.c index bcd41a50a2..158e2170c4 100644 --- a/libavutil/x86/cpu.c +++ b/libavutil/x86/cpu.c @@ -146,8 +146,17 @@ int ff_get_cpu_flags_x86(void) if (max_std_level >= 7) { cpuid(7, eax, ebx, ecx, edx); #if HAVE_AVX2 - if ((rval & AV_CPU_FLAG_AVX) && (ebx & 0x00000020)) + if ((rval & AV_CPU_FLAG_AVX) && (ebx & 0x00000020)){ rval |= AV_CPU_FLAG_AVX2; + + cpuid(1, eax, ebx, ecx, std_caps); + family = ((eax >> 8) & 0xf) + ((eax >> 20) & 0xff); + model = ((eax >> 4) & 0xf) + ((eax >> 12) & 0xf0); + // Haswell and earlier has slow gather + if(family == 6 && model < 70) + rval |= AV_CPU_FLAG_AVXSLOW; + } + #if HAVE_AVX512 /* F, CD, BW, DQ, VL */ if ((xcr0_lo & 0xe0) == 0xe0) { /* OPMASK/ZMM state */ if ((rval & AV_CPU_FLAG_AVX2) && (ebx & 0xd0030000) == 0xd0030000) From patchwork Fri Jul 16 13:48:18 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alan Kelly X-Patchwork-Id: 28937 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a5d:965a:0:0:0:0:0 with SMTP id d26csp1790158ios; Fri, 16 Jul 2021 06:48:34 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxgmNcI7NdSap3SN6xQZNDDtbitahiupZvvNhUg40GdlUGTs1mWSr5gAwRLxs30GdnXl3DK X-Received: by 2002:a17:906:190c:: with SMTP id a12mr11960295eje.37.1626443314326; Fri, 16 Jul 2021 06:48:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1626443314; cv=none; d=google.com; s=arc-20160816; b=iN+snkTilcSr+v4pNqVTdnSoCFOWZ6tS1K9SIa5mdXixNt7kFfk4hUKXdHZxzS298F feaAbPqQmKPj08JlHUz4WkGb+u9GIDNHL1us8nX0JOJSsGEc/3R0nsUQ5R2ThLShss5g Vpg3pEfklhqp/G3eoaqobz9SuBeAJ7vLn+9xpRFbZci78F6FZRuj1AGiDQdE4ZcmPwNB F7kKah1Ij/vlJF2Ifn6Ohum9TUjdmbEQBBQ8ufO6yxdhSnw1sjGU//ndhV4adD+Wc3NK wuaoR+Q1dp7mAhk1HIxWN3ZYjW+tR63ZEy2zau3DL0YYVNGSEGX+9GmMBPzlANomIQ5x csjA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:to:from:references:mime-version :message-id:in-reply-to:date:dkim-signature:delivered-to; bh=5DKsWO8wK1UY6M0IIiZlEwZp7OIooeMxAH/jRAlDyDY=; b=viL7pKWEfNvBNGa9RQrrDXMpln3qdx7/S7s7Srf/zLc7Y8E7JWuw+kIzdCnx34Nfj2 mk2xNgsO1q2ESuJ5uMkZGEqWPsAiFK0CRaq7sHcZQLkxMu4dzHIk1vJ5JU3d7v921Uqu 9iAKkxs0OxuMXD1WJoh7/8d6g6mYZqUZxCBUzv0xBkwKyZgreVgMTqz0BFnw3pROxSPk w7Fh8wRvbe0FmAD3JAcu8z+uFrE1lTvJsUnXU90watRBvgZw7JgKdd/e+bVj82NJwLXK zxcl5EXkJjGCuJOgRQh0ecZCK2DinkNH7gTLn+W/YT5WNtcJYKUDOUxbUeWVAvUQdeSx jYRQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@google.com header.s=20161025 header.b="gH/yJ0yK"; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id p13si11586985edr.88.2021.07.16.06.48.33; Fri, 16 Jul 2021 06:48:34 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@google.com header.s=20161025 header.b="gH/yJ0yK"; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A078768A369; Fri, 16 Jul 2021 16:48:30 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-ej1-f73.google.com (mail-ej1-f73.google.com [209.85.218.73]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 70730680466 for ; Fri, 16 Jul 2021 16:48:23 +0300 (EEST) Received: by mail-ej1-f73.google.com with SMTP id de27-20020a1709069bdbb02904dedfc43879so3607879ejc.1 for ; Fri, 16 Jul 2021 06:48:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=PDnPBwo9pJFwv7sUAv6+lwQ+g3fhPdnsN6eojxdv1Ho=; b=gH/yJ0yKMcT7XV9GtY7/2UQZaFwjJMBrE/EtXHogMzw6eGU4kguLJD3kbjfrEKhexC 6UbTDZ+3Mco1vzD+D5gov3WyZspQFFQO8Vna6zoMVDZVKiWEnyMAW6o39/H00nySMDqZ t0J1kGnYLBA9c/vgym0nA8bQDOsUISVpK89XOWb6zpfy3rvlZTCHR2MazK8yOrfjoU1K cPM49Zj34w8X5dLTRJmOJoq9Doh6MkS2dL7Bz3ugH2zdlf0K20HGjTzLcnQ9XOu3nyfR PNn+FT/1S6EwJPrSH8BRqo/iJ95g8J/g7cIVTqqBlC20O16s7ZV/FnwWPoCNFj1xWTF6 id7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=PDnPBwo9pJFwv7sUAv6+lwQ+g3fhPdnsN6eojxdv1Ho=; b=Wueban4RclncuBSKJhjiU3xVBGQiBgyPIaDiAsUcvySIudl5utXBARF74YRyS87LJj xpoKvG8OHxPnpuXCQayP21nZLhWNv7HZivemDBio7OM5vdufLXXTYR3jt85UtBYRv2dW fOvG+koKztEYtrCWus6bpeDkAzYppCj+pbGhLasC489fFNhCxh+ulSG5E+acymsEej7T yhwxN/+MZToga9WoVca++ASAIGH10h1vkIh4U/oS5YhT1GpK+ZQ7hOdxpNTw4R2qBUi0 C9TQO7cjMZmYUWyJzbSf9FkvDqflI62Hlf6aM4cGrN1jdbiCL7iIHfkvhop2R/Gv79MI yo7A== X-Gm-Message-State: AOAM533HbDYuNzoNKTu/+s3dy1tERGkULpqxBKmS/WA7EmuwStH4jYXX W3NUxWerE4eRvj0SdlglIEjEbN9PqEGoNCEFk3HFXNfHlMdYMttZ8yKJnPuAN8m68dJYdALWEWD OiZTPCJB9AyBZskBL366heFxKBJqkbXYyDoIoanizgTNvAlT8eHkDDVgJ6xX4vCOPdTZACg4= X-Received: from alankelly0.zrh.corp.google.com ([2a00:79e0:61:301:20d5:d318:a70:504f]) (user=alankelly job=sendgmr) by 2002:a05:6402:615:: with SMTP id n21mr15158163edv.139.1626443302637; Fri, 16 Jul 2021 06:48:22 -0700 (PDT) Date: Fri, 16 Jul 2021 15:48:18 +0200 In-Reply-To: Message-Id: <20210716134818.1127438-1-alankelly@google.com> Mime-Version: 1.0 References: X-Mailer: git-send-email 2.32.0.402.g57bb445576-goog From: Alan Kelly To: ffmpeg-devel@ffmpeg.org Subject: [FFmpeg-devel] [PATCH 2/2] libswscale: Adds ff_hscale8to15_4_avx2 and ff_hscale8to15_X4_avx2 for all filter sizes. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Alan Kelly Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: DVStFBglFQrj These functions replace all ff_hscale8to15_*_ssse3 when avx2 is available. --- EXTERNAL_AVX2_FAST is now used instead of EXTERNAL_AVX2_FAST_GATHER as discussed in the email thread for part 1 of this patch. Benchmark results on Skylake and Haswell: Skylake Haswell hscale_8_to_15_width4_ssse3 761.2 760 hscale_8_to_15_width4_avx2 468.7 957 hscale_8_to_15_width8_ssse3 1170.7 1032 hscale_8_to_15_width8_avx2 865.7 1979 hscale_8_to_15_width12_ssse3 2172.2 2472 hscale_8_to_15_width12_avx2 1245.7 2901 hscale_8_to_15_width16_ssse3 2244.2 2400 hscale_8_to_15_width16_avx2 1647.2 3681 libswscale/swscale_internal.h | 2 + libswscale/utils.c | 37 +++++++++++ libswscale/x86/Makefile | 1 + libswscale/x86/scale_avx2.asm | 112 ++++++++++++++++++++++++++++++++++ libswscale/x86/swscale.c | 19 ++++++ tests/checkasm/sw_scale.c | 20 ++++-- 6 files changed, 186 insertions(+), 5 deletions(-) create mode 100644 libswscale/x86/scale_avx2.asm diff --git a/libswscale/swscale_internal.h b/libswscale/swscale_internal.h index 673407636a..fba3dabe5b 100644 --- a/libswscale/swscale_internal.h +++ b/libswscale/swscale_internal.h @@ -1064,4 +1064,6 @@ void ff_init_vscale_pfn(SwsContext *c, yuv2planar1_fn yuv2plane1, yuv2planarX_fn //number of extra lines to process #define MAX_LINES_AHEAD 4 +//shuffle filter and filterPos for hyScale and hcScale filters in avx2 +void ff_shuffle_filter_coefficients(SwsContext *c, int* filterPos, int filterSize, int16_t *filter, int dstW); #endif /* SWSCALE_SWSCALE_INTERNAL_H */ diff --git a/libswscale/utils.c b/libswscale/utils.c index 176fc6fd63..0577fd5490 100644 --- a/libswscale/utils.c +++ b/libswscale/utils.c @@ -268,6 +268,41 @@ static const FormatEntry format_entries[] = { [AV_PIX_FMT_X2RGB10LE] = { 1, 1 }, }; +void ff_shuffle_filter_coefficients(SwsContext *c, int *filterPos, int filterSize, int16_t *filter, int dstW){ +#if ARCH_X86_64 + int i, j, k, l; + int cpu_flags = av_get_cpu_flags(); + if (EXTERNAL_AVX2_FAST(cpu_flags)){ + if ((c->srcBpc == 8) && (c->dstBpc <= 14)){ + if (dstW % 16 == 0){ + if (filter != NULL){ + for (i = 0; i < dstW; i += 8){ + FFSWAP(int, filterPos[i + 2], filterPos[i+4]); + FFSWAP(int, filterPos[i + 3], filterPos[i+5]); + } + if (filterSize > 4){ + int16_t *tmp2 = av_malloc(dstW * filterSize * 2); + memcpy(tmp2, filter, dstW * filterSize * 2); + for (i = 0; i < dstW; i += 16){//pixel + for (k = 0; k < filterSize / 4; ++k){//fcoeff + for (j = 0; j < 16; ++j){//inner pixel + for (l = 0; l < 4; ++l){//coeff + int from = i * filterSize + j * filterSize + k * 4 + l; + int to = (i) * filterSize + j * 4 + l + k * 64; + filter[to] = tmp2[from]; + } + } + } + } + av_free(tmp2); + } + } + } + } + } +#endif +} + int sws_isSupportedInput(enum AVPixelFormat pix_fmt) { return (unsigned)pix_fmt < FF_ARRAY_ELEMS(format_entries) ? @@ -1699,6 +1734,7 @@ av_cold int sws_init_context(SwsContext *c, SwsFilter *srcFilter, get_local_pos(c, 0, 0, 0), get_local_pos(c, 0, 0, 0))) < 0) goto fail; + ff_shuffle_filter_coefficients(c, c->hLumFilterPos, c->hLumFilterSize, c->hLumFilter, dstW); if ((ret = initFilter(&c->hChrFilter, &c->hChrFilterPos, &c->hChrFilterSize, c->chrXInc, c->chrSrcW, c->chrDstW, filterAlign, 1 << 14, @@ -1708,6 +1744,7 @@ av_cold int sws_init_context(SwsContext *c, SwsFilter *srcFilter, get_local_pos(c, c->chrSrcHSubSample, c->src_h_chr_pos, 0), get_local_pos(c, c->chrDstHSubSample, c->dst_h_chr_pos, 0))) < 0) goto fail; + ff_shuffle_filter_coefficients(c, c->hChrFilterPos, c->hChrFilterSize, c->hChrFilter, c->chrDstW); } } // initialize horizontal stuff diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile index bfe383364e..68391494be 100644 --- a/libswscale/x86/Makefile +++ b/libswscale/x86/Makefile @@ -11,6 +11,7 @@ OBJS-$(CONFIG_XMM_CLOBBER_TEST) += x86/w64xmmtest.o X86ASM-OBJS += x86/input.o \ x86/output.o \ x86/scale.o \ + x86/scale_avx2.o \ x86/rgb_2_rgb.o \ x86/yuv_2_rgb.o \ x86/yuv2yuvX.o \ diff --git a/libswscale/x86/scale_avx2.asm b/libswscale/x86/scale_avx2.asm new file mode 100644 index 0000000000..d90fd2d791 --- /dev/null +++ b/libswscale/x86/scale_avx2.asm @@ -0,0 +1,112 @@ +;****************************************************************************** +;* x86-optimized horizontal line scaling functions +;* Copyright 2020 Google LLC +;* Copyright (c) 2011 Ronald S. Bultje +;* +;* This file is part of FFmpeg. +;* +;* FFmpeg is free software; you can redistribute it and/or +;* modify it under the terms of the GNU Lesser General Public +;* License as published by the Free Software Foundation; either +;* version 2.1 of the License, or (at your option) any later version. +;* +;* FFmpeg is distributed in the hope that it will be useful, +;* but WITHOUT ANY WARRANTY; without even the implied warranty of +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +;* Lesser General Public License for more details. +;* +;* You should have received a copy of the GNU Lesser General Public +;* License along with FFmpeg; if not, write to the Free Software +;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA +;****************************************************************************** + +%include "libavutil/x86/x86util.asm" + +SECTION_RODATA + +swizzle: dd 0, 4, 1, 5, 2, 6, 3, 7 +four: times 8 dd 4 + +SECTION .text + +;----------------------------------------------------------------------------- +; horizontal line scaling +; +; void hscale8to15__ +; (SwsContext *c, int16_t *dst, +; int dstW, const uint8_t *src, +; const int16_t *filter, +; const int32_t *filterPos, int filterSize); +; +; Scale one horizontal line. Input is 8-bit width Filter is 14 bits. Output is +; 15 bits (in int16_t). Each output pixel is generated from $filterSize input +; pixels, the position of the first pixel is given in filterPos[nOutputPixel]. +;----------------------------------------------------------------------------- + +%macro SCALE_FUNC 1 +cglobal hscale8to15_%1, 7, 9, 15, pos0, dst, w, srcmem, filter, fltpos, fltsize, count, inner + pxor m0, m0 + movu m15, [swizzle] + mov countq, $0 +%ifidn %1, X4 + movu m14, [four] + movsxd fltsizeq, fltsized + shr fltsizeq, 2 +%endif +.loop: + movu m1, [fltposq] + movu m2, [fltposq+32] +%ifidn %1, X4 + pxor m9, m9 + pxor m10, m10 + pxor m11, m11 + pxor m12, m12 + mov innerq, $0 +.innerloop: +%endif + vpcmpeqd m13, m13 + vpgatherdd m3,[srcmemq + m1], m13 + vpcmpeqd m13, m13 + vpgatherdd m4,[srcmemq + m2], m13 + vpunpcklbw m5, m3, m0 + vpunpckhbw m6, m3, m0 + vpunpcklbw m7, m4, m0 + vpunpckhbw m8, m4, m0 + vpmaddwd m5, m5, [filterq] + vpmaddwd m6, m6, [filterq + 32] + vpmaddwd m7, m7, [filterq + 64] + vpmaddwd m8, m8, [filterq + 96] + add filterq, $80 +%ifidn %1, X4 + paddd m9, m5 + paddd m10, m6 + paddd m11, m7 + paddd m12, m8 + paddd m1, m14 + paddd m2, m14 + add innerq, $1 + cmp innerq, fltsizeq + jl .innerloop + vphaddd m5, m9, m10 + vphaddd m6, m11, m12 +%else + vphaddd m5, m5, m6 + vphaddd m6, m7, m8 +%endif + vpsrad m5, 7 + vpsrad m6, 7 + vpackssdw m5, m5, m6 + vpermd m5, m15, m5 + vmovdqu [dstq + countq * 2], m5 + add fltposq, $40 + add countq, $10 + cmp countq, wq + jl .loop +REP_RET +%endmacro + +%if ARCH_X86_64 +INIT_YMM avx2 +SCALE_FUNC 4 +SCALE_FUNC X4 +%endif diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 0848a31461..164b06d6ba 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -276,6 +276,9 @@ SCALE_FUNCS_SSE(sse2); SCALE_FUNCS_SSE(ssse3); SCALE_FUNCS_SSE(sse4); +SCALE_FUNC(4, 8, 15, avx2); +SCALE_FUNC(X4, 8, 15, avx2); + #define VSCALEX_FUNC(size, opt) \ void ff_yuv2planeX_ ## size ## _ ## opt(const int16_t *filter, int filterSize, \ const int16_t **src, uint8_t *dest, int dstW, \ @@ -568,6 +571,22 @@ switch(c->dstBpc){ \ } #if ARCH_X86_64 +#define ASSIGN_AVX2_SCALE_FUNC(hscalefn, filtersize) \ + switch (filtersize) { \ + case 4: hscalefn = ff_hscale8to15_4_avx2; break; \ + default: hscalefn = ff_hscale8to15_X4_avx2; break; \ + break; \ + } + + if (EXTERNAL_AVX2_FAST(cpu_flags)){ + if ((c->srcBpc == 8) && (c->dstBpc <= 14)){ + if(c->chrDstW % 16 == 0) + ASSIGN_AVX2_SCALE_FUNC(c->hcScale, c->hChrFilterSize); + if(c->dstW % 16 == 0) + ASSIGN_AVX2_SCALE_FUNC(c->hyScale, c->hLumFilterSize); + } + } + if (EXTERNAL_AVX2_FAST(cpu_flags)) { switch (c->dstFormat) { case AV_PIX_FMT_NV12: diff --git a/tests/checkasm/sw_scale.c b/tests/checkasm/sw_scale.c index 40c5eb3aa8..103b1aa5da 100644 --- a/tests/checkasm/sw_scale.c +++ b/tests/checkasm/sw_scale.c @@ -135,13 +135,13 @@ static void check_yuv2yuvX(void) } #undef SRC_PIXELS -#define SRC_PIXELS 128 +#define SRC_PIXELS 512 static void check_hscale(void) { #define MAX_FILTER_WIDTH 40 -#define FILTER_SIZES 5 - static const int filter_sizes[FILTER_SIZES] = { 4, 8, 16, 32, 40 }; +#define FILTER_SIZES 6 + static const int filter_sizes[FILTER_SIZES] = { 4, 8, 12, 16, 32, 40 }; #define HSCALE_PAIRS 2 static const int hscale_pairs[HSCALE_PAIRS][2] = { @@ -160,6 +160,8 @@ static void check_hscale(void) // padded LOCAL_ALIGNED_32(int16_t, filter, [SRC_PIXELS * MAX_FILTER_WIDTH + MAX_FILTER_WIDTH]); LOCAL_ALIGNED_32(int32_t, filterPos, [SRC_PIXELS]); + LOCAL_ALIGNED_32(int16_t, filterAvx2, [SRC_PIXELS * MAX_FILTER_WIDTH + MAX_FILTER_WIDTH]); + LOCAL_ALIGNED_32(int32_t, filterPosAvx, [SRC_PIXELS]); // The dst parameter here is either int16_t or int32_t but we use void* to // just cover both cases. @@ -167,6 +169,8 @@ static void check_hscale(void) const uint8_t *src, const int16_t *filter, const int32_t *filterPos, int filterSize); + int cpu_flags = av_get_cpu_flags(); + ctx = sws_alloc_context(); if (sws_init_context(ctx, NULL, NULL) < 0) fail(); @@ -180,9 +184,11 @@ static void check_hscale(void) ctx->srcBpc = hscale_pairs[hpi][0]; ctx->dstBpc = hscale_pairs[hpi][1]; ctx->hLumFilterSize = ctx->hChrFilterSize = width; + ctx->dstW = ctx->chrDstW = SRC_PIXELS; for (i = 0; i < SRC_PIXELS; i++) { filterPos[i] = i; + filterPosAvx[i] = i; // These filter cofficients are chosen to try break two corner // cases, namely: @@ -211,16 +217,20 @@ static void check_hscale(void) filter[SRC_PIXELS * width + i] = rnd(); } ff_sws_init_scale(ctx); + memcpy(filterAvx2, filter, sizeof(uint16_t) * (SRC_PIXELS * MAX_FILTER_WIDTH + MAX_FILTER_WIDTH)); + if (cpu_flags & AV_CPU_FLAG_AVX2){ + ff_shuffle_filter_coefficients(ctx, filterPosAvx, width, filterAvx2, SRC_PIXELS); + } if (check_func(ctx->hcScale, "hscale_%d_to_%d_width%d", ctx->srcBpc, ctx->dstBpc + 1, width)) { memset(dst0, 0, SRC_PIXELS * sizeof(dst0[0])); memset(dst1, 0, SRC_PIXELS * sizeof(dst1[0])); call_ref(NULL, dst0, SRC_PIXELS, src, filter, filterPos, width); - call_new(NULL, dst1, SRC_PIXELS, src, filter, filterPos, width); + call_new(NULL, dst1, SRC_PIXELS, src, filterAvx2, filterPosAvx, width); if (memcmp(dst0, dst1, SRC_PIXELS * sizeof(dst0[0]))) fail(); - bench_new(NULL, dst0, SRC_PIXELS, src, filter, filterPos, width); + bench_new(NULL, dst0, SRC_PIXELS, src, filter, filterPosAvx, width); } } }