From patchwork Tue Apr 26 08:00:02 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alan Kelly X-Patchwork-Id: 35439 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:3b9e:b0:7d:cfb5:dc7c with SMTP id b30csp2283089pzh; Tue, 26 Apr 2022 01:00:17 -0700 (PDT) X-Google-Smtp-Source: ABdhPJycQYSTrqR+cG4KQM9zSbrCevZU5g+OiuJghfYm0nvrKrsfzWLTC3ciOnoEIEEHHrTHP181 X-Received: by 2002:a17:907:7d94:b0:6f3:b160:9fb7 with SMTP id oz20-20020a1709077d9400b006f3b1609fb7mr1741759ejc.410.1650960017682; Tue, 26 Apr 2022 01:00:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1650960017; cv=none; d=google.com; s=arc-20160816; b=txjy3TL/j9nVi7gUbhqwH2hItY0IzgAbMnibb9t4IoPotWCEbovHWc4Gh/AaGUtx4b 6LC1KyltYQkeU3VJ3XUlbiS4fUa/CdYn0NHFn+e2PYqKWmGEgunXNemm4ieyetIiozsd 4VBsoSuNhIfLqni7NCCwKjlb+zf8/j/7Vml0qNScw++07dWTPund/e9KWzwexIIEDyUn LaWtfl6bZ8WL7tUZqZ7P7oNa3JXmCJADLZQDMtJNXon36Ig3vOIFjqWpBYEETP0MDZPk yg1BFBmbUitHTRKmxUHzN+6TYg1IHYwsb05JzqFw7qNKuhELxvZ0QHwMwPhai6to1CS0 SnHw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:to:from:references:mime-version :message-id:in-reply-to:date:dkim-signature:delivered-to; bh=TdR29hhLlqFB+6CiaWkYRhXdosy3QFhfj8tD0nCLQ7A=; b=qXL9CTLFVDdl5GxCpPw7gHcoTiUYp1SAjgL84QWdbJlzjYTQNOGUXmchBa8orHtHdb dtJM7GAoj0qkgutADHLdCMESMxFoV9E2wiQ1CpC82qKWBqJHkfMf0MGyjtRUHm5xQT1K efDMN/9qmBah3dTl1TrOEpPc/yrtLbf/ar4fgbnZXjZUImG+DgRvRNkuepIukqa48Tt3 TT5Oh4gE/VeMUq7mR1L6O5iZmV0MarGyQWz5RfwUxCWlUG4Q7O/CAhCSlPaLtx5AhSkL Zak2qGp3GzoHiYpM6BOqE2xE+D0TWAKs21pk9/umnhIoqB/OHpdWwAqfTjZIWVT8Bmex gBaA== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@google.com header.s=20210112 header.b=ftjNhvxH; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id m16-20020a056402431000b0041d76eaac95si16364675edc.54.2022.04.26.01.00.16; Tue, 26 Apr 2022 01:00:17 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@google.com header.s=20210112 header.b=ftjNhvxH; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id CCCD568B270; Tue, 26 Apr 2022 11:00:13 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lj1-f202.google.com (mail-lj1-f202.google.com [209.85.208.202]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id D27CB68B270 for ; Tue, 26 Apr 2022 11:00:06 +0300 (EEST) Received: by mail-lj1-f202.google.com with SMTP id o11-20020a2e90cb000000b0024f24265fcfso78656ljg.0 for ; Tue, 26 Apr 2022 01:00:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=SuatR44r+Ph+GTW6ZahtwQuCPCqeuIoI6DxCp2aSKJY=; b=ftjNhvxHnJishfMh2QcPaL+gRuK1vmGuhdV0KjL1Ws3y97Q9pQj6jW1Gj85ebyQvnw v/CTW0QoGBA+/nvMqz1CHaqLn/P9lFFW9AS9+QVw4UDRxH4bfcpF7d1sEdFDWiufoc8W fWeDdHo4bdioM/jskXtGd24v62faCkz24iqupTh2vg9d4I4FGojeB9ln3bGsnj6FQyRF f/I87aecQcp1zaiqBhS2RerzwXcV24CIoZngeFwuL7nTNHeEXoI2eqNhtcqwHdH+M39k ELgKImrL84sZ9vkEs7Yc2easD99JrQuLFJaK0+UDh10J24mAnmUFZtg4ZPlm8aiWbKfC KDJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=SuatR44r+Ph+GTW6ZahtwQuCPCqeuIoI6DxCp2aSKJY=; b=dBukq1emgCRfw/V7QAEkojb3tYragK66Q71fu2jVnFww4fPdkgVh1ivGT2TASG6CG3 j5Qo8Quk1l7G7VkpI0yFW/sfhRlOLV6TBcebaT9wpuoEszx9JEKF9JTG0KqB0lBDWO0p jzbLcBl5AqbwOi5+3sFdCb/4/9G2d4w26R28LlT/NwE9T3yu8WvVjEMa+jeD9U+/XDOp bMTV11L5bWPgXM+4xRccEePu1jbozd213U9gQuPKfVdakuQvW8+mTcLDsLnD7r6J/b9F oyZ7K9OQhB2HQxKebYLKcAjcn3/OhsQY7tuLQTe9QHqzITI9F6C5cPguuxjLsEdwx0GZ mQZw== X-Gm-Message-State: AOAM5339NtM0aY2D02/W62tBYeosiA25Cv4anjUTzN58exd9SfshntmU PJTaitW1HjOVF13QFk9w3hd7Ub83Mh/f7QU00PE+I8EKGZzUzal2MbS8iP/CRQltqn57r7f8H8F /CSn5FFn+NM6jXwvs1zDYaSNuU3qF02D1w4x+Lxury3xJbQcKdCl/3Z6qovt/VI9jRQLjwJA= X-Received: from alankelly0.zrh.corp.google.com ([2a00:79e0:9e:2:6d2:7a17:4a85:b84d]) (user=alankelly job=sendgmr) by 2002:a05:6512:321b:b0:44a:78f2:500b with SMTP id d27-20020a056512321b00b0044a78f2500bmr8453412lfe.434.1650960005558; Tue, 26 Apr 2022 01:00:05 -0700 (PDT) Date: Tue, 26 Apr 2022 10:00:02 +0200 In-Reply-To: Message-Id: <20220426080002.404023-1-alankelly@google.com> Mime-Version: 1.0 References: X-Mailer: git-send-email 2.36.0.rc2.479.g8af0fa9b8e-goog From: Alan Kelly To: ffmpeg-devel@ffmpeg.org Subject: [FFmpeg-devel] [PATCH v2 3/5] libswscale: Avx2 hscale can process inputs of any size. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Alan Kelly Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: nsnbYHbQHQY6 The main loop processes blocks of 16 pixels. The tail processes blocks of size 4. --- libswscale/x86/scale_avx2.asm | 44 ++++++++++++++++++++++++++++++++++- 1 file changed, 43 insertions(+), 1 deletion(-) diff --git a/libswscale/x86/scale_avx2.asm b/libswscale/x86/scale_avx2.asm index 20acdbd633..7657b2825f 100644 --- a/libswscale/x86/scale_avx2.asm +++ b/libswscale/x86/scale_avx2.asm @@ -53,6 +53,9 @@ cglobal hscale8to15_%1, 7, 9, 16, pos0, dst, w, srcmem, filter, fltpos, fltsize, mova m14, [four] shr fltsized, 2 %endif + cmp wq, 16 + jl .tail_loop + sub wq, 0x10 .loop: movu m1, [fltposq] movu m2, [fltposq+32] @@ -101,7 +104,46 @@ cglobal hscale8to15_%1, 7, 9, 16, pos0, dst, w, srcmem, filter, fltpos, fltsize, add fltposq, 0x40 add countq, 0x10 cmp countq, wq - jl .loop + jle .loop + + add wq, 0x10 + cmp countq, wq + jge .end + +.tail_loop: + movu xm1, [fltposq] +%ifidn %1, X4 + pxor xm9, xm9 + pxor xm10, xm10 + xor innerq, innerq +.tail_innerloop: +%endif + vpcmpeqd xm13, xm13 + vpgatherdd xm3,[srcmemq + xm1], xm13 + vpunpcklbw xm5, xm3, xm0 + vpunpckhbw xm6, xm3, xm0 + vpmaddwd xm5, xm5, [filterq] + vpmaddwd xm6, xm6, [filterq + 16] + add filterq, 0x20 +%ifidn %1, X4 + paddd xm9, xm5 + paddd xm10, xm6 + paddd xm1, xm14 + add innerq, 1 + cmp innerq, fltsizeq + jl .tail_innerloop + vphaddd xm5, xm9, xm10 +%else + vphaddd xm5, xm5, xm6 +%endif + vpsrad xm5, 7 + vpackssdw xm5, xm5, xm5 + vmovq [dstq + countq * 2], xm5 + add fltposq, 0x10 + add countq, 0x4 + cmp countq, wq + jl .tail_loop +.end: REP_RET %endmacro