From patchwork Mon Jan 10 14:58:34 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alan Kelly X-Patchwork-Id: 33179 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a6b:cd86:0:0:0:0:0 with SMTP id d128csp2798581iog; Mon, 10 Jan 2022 06:59:07 -0800 (PST) X-Google-Smtp-Source: ABdhPJxDqv33ohe9uVJ26RK24AcOfQavKp3PlFtdlnbKgCxVTKBRphzYbUUkG0zs3fpfMRI2wXJa X-Received: by 2002:a17:907:e86:: with SMTP id ho6mr130714ejc.208.1641826746858; Mon, 10 Jan 2022 06:59:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1641826746; cv=none; d=google.com; s=arc-20160816; b=i8HckgQvoRA6NgMk6Cg1hGxRK4P11Qs5+9qykNbRMOU1LnE5Sz3chi9jA1Nm2p3usK s53/Sh/Kh1D1SsnGbhFy/kUoC12wOdLaVmQg/PsUpv1j6HhZ5Fo68b6muN+aLXi/sSe5 TlsEJbvXobJPVqXivPUSRPKk/UdZsvD2evMTwKQZKz+tDpdUvgc+pzg8QNaCwkNaFYIc +aVnhWYlzqjBPJhCwC0mw6K53jVo0wSPpUklb6CEFP5RUVQ6sf4/L2WwL+yqz+v+xMdc mouJNpyE/S9ycDoLEndi9aMaB8hdqGyAtvnr9ogigiK7L2wREOBn/+2DDnH6TMpr6HSy YKXQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:to:from:references:mime-version :message-id:in-reply-to:date:dkim-signature:delivered-to; bh=x9LiPJaPpJvnQaGJzIZiMTCoHCMnO59ewkfz5OiPRic=; b=VZQqt2N5aNd5atACchBHbuinJg9APIMGtYHfc/9S7hdlyVkWSlet74Bx5h38Zx279e ITc80rrD264bB2RZBNKAI4Vo7n8BR+J8GPhN++MacXMyu5/aVi7uRxlcLTaQrMnh67tU ee4TO92csSDIB+m2RuAiZN+11jSIHXOSH8GIuqsmybCaaKiqKhHV6zLmw9XGhGDy34BQ MDWJ9DzPwEXjkt9KVX6OW1P1dO3Z15FnSMFJGQfUfSu6CSFKbLx97aZlis8nnIzk7tqG K52umnkytdrjC9eOb8Ix9xxUET/F16NMu1T+1og3Cw9Lvv0w9ZCeR+sxc2xYI/NMvtVM UaYQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@google.com header.s=20210112 header.b=jq7e29rf; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id z10si4110034edc.589.2022.01.10.06.59.06; Mon, 10 Jan 2022 06:59:06 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@google.com header.s=20210112 header.b=jq7e29rf; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 3280268ACF5; Mon, 10 Jan 2022 16:59:04 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-ed1-f74.google.com (mail-ed1-f74.google.com [209.85.208.74]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 64EFC68ACF5 for ; Mon, 10 Jan 2022 16:58:57 +0200 (EET) Received: by mail-ed1-f74.google.com with SMTP id q15-20020a056402518f00b003f87abf9c37so10328055edd.15 for ; Mon, 10 Jan 2022 06:58:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=Wwj+Sc9WmUDQyhLfHao/MLUaiJbRA7EqbvhSw1acUSY=; b=jq7e29rfJHILMiGjhYSYYCjnYqqllHTRFDarvSxPNAj4UfDMx+CChh9HYruEvf17wb SKofJCQTMlctx6KweF0pcjJNTS1Ub6jfbzV4uFWBXaizteb/DJfsENKEYAzGkKN7FA/M bWFme7h+Prvz7NabkwoBVYu+UrE33DtNbG+astOToi9Nqd3hrDjKS/qUnX7dO1sZ05za uOGthaloDEqpKzA+nIgimuyV/DdfOibC8WKXM4UO9cmplzzl3jQ3YhLEYA7qFMBOVSAr fKz5ZBul8np5P2aULBQY6u4Sa12yuL/7shP9qXXA40poFwJpAegy2akAucpXCfYVVtXs pelA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=Wwj+Sc9WmUDQyhLfHao/MLUaiJbRA7EqbvhSw1acUSY=; b=wXQwdkJeUwfDGbRT91b6EzCmc040OOOjG0PON6gzDbF3psEIZa3CpNDwGx5NtAzREa pD78sr7gNu/Tg9kItbUrerQ3QR1vpcKazAzDKZTCVJMqQeMoCTZdQlWDHVv0l07Tcgea Eb+KWckYRgODP1Aw1/pH4s9rW6MafUlsz4Rd6th8lcZdzAjOm7Gy3RdhwjFfEVIRdcGR Lwjpn1as0czStSiMIgnvUAB/j04lAFIvcTednPucvF9JVoS7JN3FvIWDof27GylHTEqS r9GNtZf/1T2LILH//MvBHgOAuDNc46CN7ZMBYjuJ1oEU64THCaQl6gA68RDTwiIyggwl 1t7A== X-Gm-Message-State: AOAM533D2ZkrcQmaF2sxfWh3AeNDnQmgGVfzlQSkh0N/h6t3UmVU9IP7 TZTA2LXL4fO6H94Vb9evnT2ECF6fSbMwwthhM/gw29Ff4+e4T+OZzLXhIRkX0oXSizQQZgEy7yh V5dSJuZC33vpzbJqqF2ZXy0DEKgTqsNK6Npb17UQ3PdM/H8SC3ceXcFQKJpsupuR1ITnIxMs= X-Received: from alankelly0.zrh.corp.google.com ([2a00:79e0:61:301:b61d:d4c4:5dce:c0af]) (user=alankelly job=sendgmr) by 2002:a05:6402:3551:: with SMTP id f17mr70743edd.64.1641826736686; Mon, 10 Jan 2022 06:58:56 -0800 (PST) Date: Mon, 10 Jan 2022 15:58:34 +0100 In-Reply-To: <20220110145836.3449558-1-alankelly@google.com> Message-Id: <20220110145836.3449558-2-alankelly@google.com> Mime-Version: 1.0 References: <20220110145836.3449558-1-alankelly@google.com> X-Mailer: git-send-email 2.34.1.575.g55b058a8bb-goog From: Alan Kelly To: ffmpeg-devel@ffmpeg.org Subject: [FFmpeg-devel] [PATCH 2/4] libswscale: Avx2 hscale can process any input of size which is a multiple of 4. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Alan Kelly Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: fkfRK/opRtdj The main loop processes blocks of 16 pixels. The tail processes blocks of size 4. --- libswscale/x86/scale_avx2.asm | 48 +++++++++++++++++++++++++++++++++-- 1 file changed, 46 insertions(+), 2 deletions(-) diff --git a/libswscale/x86/scale_avx2.asm b/libswscale/x86/scale_avx2.asm index 20acdbd633..dc42abb100 100644 --- a/libswscale/x86/scale_avx2.asm +++ b/libswscale/x86/scale_avx2.asm @@ -53,6 +53,9 @@ cglobal hscale8to15_%1, 7, 9, 16, pos0, dst, w, srcmem, filter, fltpos, fltsize, mova m14, [four] shr fltsized, 2 %endif + cmp wq, 16 + jl .tail_loop + mov countq, 0x10 .loop: movu m1, [fltposq] movu m2, [fltposq+32] @@ -97,11 +100,52 @@ cglobal hscale8to15_%1, 7, 9, 16, pos0, dst, w, srcmem, filter, fltpos, fltsize, vpsrad m6, 7 vpackssdw m5, m5, m6 vpermd m5, m15, m5 - vmovdqu [dstq + countq * 2], m5 + vmovdqu [dstq], m5 + add dstq, 0x20 add fltposq, 0x40 add countq, 0x10 cmp countq, wq - jl .loop + jle .loop + + sub countq, 0x10 + cmp countq, wq + jge .end + +.tail_loop: + movu xm1, [fltposq] +%ifidn %1, X4 + pxor xm9, xm9 + pxor xm10, xm10 + xor innerq, innerq +.tail_innerloop: +%endif + vpcmpeqd xm13, xm13 + vpgatherdd xm3,[srcmemq + xm1], xm13 + vpunpcklbw xm5, xm3, xm0 + vpunpckhbw xm6, xm3, xm0 + vpmaddwd xm5, xm5, [filterq] + vpmaddwd xm6, xm6, [filterq + 16] + add filterq, 0x20 +%ifidn %1, X4 + paddd xm9, xm5 + paddd xm10, xm6 + paddd xm1, xm14 + add innerq, 1 + cmp innerq, fltsizeq + jl .tail_innerloop + vphaddd xm5, xm9, xm10 +%else + vphaddd xm5, xm5, xm6 +%endif + vpsrad xm5, 7 + vpackssdw xm5, xm5, xm5 + vmovq [dstq], xm5 + add dstq, 0x8 + add fltposq, 0x10 + add countq, 0x4 + cmp countq, wq + jl .tail_loop +.end: REP_RET %endmacro