From patchwork Wed Jul 26 14:56:28 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ivan Kalvachev X-Patchwork-Id: 4464 Delivered-To: ffmpegpatchwork@gmail.com Received: by 10.103.1.76 with SMTP id 73csp878845vsb; Wed, 26 Jul 2017 08:01:57 -0700 (PDT) X-Received: by 10.223.168.110 with SMTP id l101mr1083018wrc.251.1501081317443; Wed, 26 Jul 2017 08:01:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1501081317; cv=none; d=google.com; s=arc-20160816; b=uU0YMMoPTsAX9yeIz9+upqGycMxaNL4X3hdoSKjeaDWv2ybLYSCY2DZtK4n7c0gZtm MqPaQ1c1Rm7stNYrZFLnf5ho22LkgEPq0SsykcdIS27243yzr1/4j1eQA1DYEkNpExwE NJnX1rt3/xbD9n2k7NBn6hvjby5HfXfBsRHdOMf0LV7C8tsaOqvEARYIHbsWx7XBH7T+ gb8bGGQbYSrFZUxudn0i0k19UHsXLbBzwE3tdpYKhXpiG8mOa8BRjiUwgdWfrf5JBFxa 8OIHTfpbB1lAT4CK4UOw5IIcG/z97C8lRtoQUUR2lhDVa2UJSC1TLkPuZjw9vkjRA3U2 twVg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:reply-to:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence:subject:to :message-id:date:from:mime-version:dkim-signature:delivered-to :arc-authentication-results; bh=3LDdGd7Io0DrWGWNHzNrrmnu3fD2hXVp0z5sJWgUGKQ=; b=eRaDw+KIEG1slRVTB9R0h1qdofKyMVm8GN/BvngNqUa4rkNPDRvlCEpe4x/dDQ6mnQ kLSx2l1+Bo59dzvrobp+toBn7l3FAijR0oJw6wRKGSR4GAexGL3pknxROSj3QUiBLKXG zUCZCY4BZzMK9+pAduHbuj0wYY026PwKTQiGCkawRJGlgyPT1H777jBJkJdnWDFxk0GG XkwOCx0SeQ67mwoXclWI9CiGoh1H4DXNFYmj0NJbxwrWhwqmraM4qrvUtzXZiBd2Q1sD vseG57GvzGLmqbAXd/SlsYCpA3wfSSsZdpK6gXpmEf4N0hamB4OlrkwqMY+laA+PK8TF vaRg== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.b=ujrflH36; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=gmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id 50si5411886wrx.410.2017.07.26.08.01.56; Wed, 26 Jul 2017 08:01:57 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.b=ujrflH36; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=gmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 9105A689AF9; Wed, 26 Jul 2017 18:01:53 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-pg0-f46.google.com (mail-pg0-f46.google.com [74.125.83.46]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 8C773680671 for ; Wed, 26 Jul 2017 18:01:46 +0300 (EEST) Received: by mail-pg0-f46.google.com with SMTP id v190so85423719pgv.2 for ; Wed, 26 Jul 2017 08:01:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=7T4o9qpyP0bYDjsNCZ99YgcosvRUEjgskxVzKlDLRUo=; b=ujrflH36chCsbiBK8Dof3NV29jOiXpKH8s5RMLRddW1A6lRL0zkf4V5UX+mv5BhG6A FQBtSSF0KxPQDn5KeUKzUHMhROrDvVwuwaG0mjmAdbFVCUTOj+jxsOvCnC6+tz4EqIoL mDgrfgXqubOfsZyyAIEj7t9FJpeR7+HhMXflptkrX1cd2+7BIWNXlet0yLkx2ArbGdd3 7a1Qq9dEjo5T8B2ad7WpIhYJXB2XPtVHXK8XFusG194Zg6itFOh0TlpbRPuhaGyx5TVw FeFoJT2oEPZfqE8eOX+svcIozZOnYfDdoQIqQ0iZ5TNj7Ugoky7/ZDJaoyBAkcZxaVmC PBLg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=7T4o9qpyP0bYDjsNCZ99YgcosvRUEjgskxVzKlDLRUo=; b=P4Qd42C+gucAhUQYxKfMbvKOJA2fVj85tOcW+e136JU7FiqA2DR1pSP4CDU2CjutEm gIXR++EhPELz1XV64jRaEwomAQm1HsUUuDjm0G3d2lmN1S3HZaCTqBbtmRdnJopWTnOH KR5o2L0nl/F0/jYXrBb7FGXk2qgpqmc+1xDlh6O9e8K6FqfrhLYxrYzXFwJBNffV/ALo UlRu+rJ6/AeZrCXLb1cjdYF6FIH+CLEwg2cECUYD9NX/RvMh73O/PcZDNei7llDCkPxu 6KaLh2vGvUKNVFjJ8nee/lUa8nsj7pfwl8ZDP3Yr2WIWkgkjJNAoNQRynaBT2j72mjg7 CVWQ== X-Gm-Message-State: AIVw112x+UuMTyBnq9l96+geaBdmuyac0LvB8YLx/7+Ml+0jtkLjM2TH /6idfFlV2h2qwa/Z2mK+E8Gvaq5cCg== X-Received: by 10.84.218.66 with SMTP id f2mr1183620plm.206.1501080989217; Wed, 26 Jul 2017 07:56:29 -0700 (PDT) MIME-Version: 1.0 Received: by 10.100.149.8 with HTTP; Wed, 26 Jul 2017 07:56:28 -0700 (PDT) From: Ivan Kalvachev Date: Wed, 26 Jul 2017 17:56:28 +0300 Message-ID: To: FFmpeg development discussions and patches Subject: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" On 7/24/17, Ivan Kalvachev wrote: > On 7/23/17, Rostislav Pehlivanov wrote: >> On 22 July 2017 at 12:18, Ivan Kalvachev wrote: >> >>> This patch is ready for review and inclusion. >>> >>> >>> >>>+%macro emu_blendvps 3 ; dst/src_a, src_b, mask >>>+%macro lea_pic_base 2; reg, const_label >> Capitalize macro names. > > Done for the later. > About the former... > > I do not like to use capitalization for the emulated instructions: > > 1. In C macros are always in capital letters to separate them from the > functions. > In ASM functions cannot be mistaken for macros. > > 2. All instructions are lowercase, even the ones that are overloaded > by x86asm.h macros. > > 3. There are already about 30 macros with lower cases in > libavcodec/x86. The rule is not absolute. > > 4. There are some emulated instructions in x86util, that are all upper > case names and > I do find them very ugly when I see them in the code. > This is why I went with "emu_" route. > I actually want to encourage using the emu_ for them too (as > alternative names). > > 5. There is nothing guaranteeing that the assembler > should be case sensitive. > Actually nasm manual mentions that MASM (on which > it it is based on) is not case-sensitive (by default). > While nasm and yasm are case sensitive, > there are still %idefine and %imacro that > could create some confusion. > > Anyway. > > Would you accept a change where the emulation macros > are moved to a separate file and the macros are > renamed to EMU_ , aka EMU_blendvps ? > I'm thinking of "libavcodec/x86/x86emu.asm". > > Or maybe they should be put in libavutil/x86/x86util.asm , > however that could open a new can of worms. > AKA using the ugly uppercase only names. > >>>+ %error sse41 blendvps uses xmm0 as default 3d operand, you used %3 >> Remove this error > > I'd like to keep that one. > It helps finding out why the code does not compile. > > Sometimes it may not be obvious, since SWAP might > change the registers under the hood. > >> >>>+ %error "blendvps" emulation needs SSE >> Definitely remove this error too. > > Done > >>>+ %error AVX1 does not support integer 256bit ymm operations >> And this one is particularly pointless... > > On the contrary. AVX1 does support 256 bit ymm float point operations > but not integer ones. It is quite trivial mistake to use one with the > other. > > What is worse, without this the wrong code would compile > without any hint of error, because avx2 codes are > not overloaded by the current x86asm.h > so you won't get warning that you are using avx2 ops in avx1 section. > >>>+ %error sse41 blendvps uses xmm0 as default 3 operand, you used %3 >> Same... > > Again, I'd like to keep that one. > >>>+ %error "broadcastss" emulation needs SSE >> Same... > > Done > >>>+ %error "pbroadcastd" emulation needs SSE2 >> Yep, the same... > > Done > >> >>>+ %error pick HSUMPS implementation >> Again... > > I thought I had taken care of that. > Done > >> All of these are pointless and unnecessary. Always assume at least SSE2. > > NEVER assume anything (that you can check). > Making wrong assumptions is the easiest way > to make a mistakes that are > very hard to notice, find and correct. > >>>+ >>>+; 256bit version seems slower than the 128bit AVX on all current CPU's >>>+; So disable it for now and keep the code in case something changes in >> future. >>>+%if HAVE_AVX2_EXTERNAL > 0 && 0 >>>+INIT_YMM avx2 >>>+PVQ_FAST_SEARCH >>>+%endif >> >> Remove this altogether. > > I think this would be a mistake. > > BTW, would you do a proper benchmarks with the current > code and post the results. There is a chance that the > code has improved. > (Please, use a real audio sample, not generated noise). > > I also asked you to try something (changing "cmpless" to "cmpps" ), > that you totally forgot to do. > (IACA says there is no penalty in my code for using this SSE op in AVX2 > block, > but as I said, never assume.) > >>>+%if 1 >>>+ emu_pbroadcastd m1, xmm1 >> ... >>>+%else >>>+ ; Ryzen is the only CPU at the moment that doesn't have >>>+ ; write forwarding stall and where this code is faster >>>+ ; with about 7 cycles on avarage. >>>+ %{1}ps m5, mm_const_float_1 >>>+ movss [tmpY + r4q], xmm5 ; Y[max_idx] +-= 1.0; >> >> Remove the %else and always use the first bit of code. > > I don't like removing code that is faster on a new CPU. > But since you insist. > Done. > >>>+%if cpuflag(avx2) && 01 >>>+%elif cpuflag(avx) && 01 >>>+%if cpuflag(avx2) && 01 >> >> Remove the && 01 in these 3 places. > > Done > > >>>+; vperm2f128 %1, %1, %1, 0 ; ymm, ymm, ymm ; 2-3c 1/1 >> >> Remove this commented out code. > > Done > >> >>>+cglobal pvq_search, 4, 5+num_pic_regs, 11, 256*4, inX, outY, K, N >>>+%if ARCH_X86_64 == 0 ;sbrdsp >> >> You're using more than 11 xmm regs, so the entire code is always going to >> need a 64 bit system. >> So wrap the entire file, from top to bottom after the %include in >> %if ARCH_X86_64 >> >> %endif > > I'm using exactly 8 registers on 32 bit arch. > I'm using 3 extra registers on 64 bit ones to hold some constants. > > If you insist, I'll write some ifdef-ery to > always supply the correct number of registers. > From what I've seen in x86asm, the >8 case is only checked on WIN64. > >>>+SECTION_RODATA 64 >>>+ >>>+const_float_abs_mask: times 8 dd 0x7fffffff >>>+const_align_abs_edge: times 8 dd 0 >>>+ >>>+const_float_0_5: times 8 dd 0.5 >>>+const_float_1: times 8 dd 1.0 >>>+const_float_sign_mask: times 8 dd 0x80000000 >>>+ >>>+const_int32_offsets: >>>+ %rep 8 >>>+ dd $-const_int32_offsets >>>+ %endrep >>>+SECTION .text >> >> This whole thing needs to go right at the very top after the %include. >> All >> macros must then follow it. >> Having read only sections in the middle of the file is very confusing and >> not at all what all of our asm code follows. > > It's very simple: > pre-processor, constants, code. > Your confusion comes from the fact that > the code templates are done by macros. > > As I've told you before, the macros before > the constants belong to a separate file. > > Also why are you so obsessed > with constants been the first thing in a file. > You are not supposed to mess with them > on regular basis. > > Constant labels must be before the code that > uses them, but because of code templates > they could be literally few lines before the end of the file. In this version I did remove the AVX2 disabled code and moved the constants before the macro (but after defines). The emulations remain in this code, for now. I've also used ifdef-ery to always supply the correct number of used xmm registers to "cglobal" (and moved the relevant macros closer to it). Also here are the updated benchmarks: /=========================================================== SkyLake i7 6700HQ //v5 1814 1817 1827 //SSE4 1818 1834 1838 //AVX default 1828 1833 1837 //AVX2 xmm 1896 1897 1905 //AVX2 default 1896 1904 1905 //AVX2 cmpps Best Regards From 67c93689fab23386cb4d08d4a74d3e804f07c4f4 Mon Sep 17 00:00:00 2001 From: Ivan Kalvachev Date: Thu, 8 Jun 2017 22:24:33 +0300 Subject: [PATCH 1/5] SIMD opus pvq_search implementation Explanation on the workings and methods used by the Pyramid Vector Quantization Search function could be found in the following Work-In-Progress mail threads: http://ffmpeg.org/pipermail/ffmpeg-devel/2017-June/212146.html http://ffmpeg.org/pipermail/ffmpeg-devel/2017-June/212816.html http://ffmpeg.org/pipermail/ffmpeg-devel/2017-July/213030.html http://ffmpeg.org/pipermail/ffmpeg-devel/2017-July/213436.html Signed-off-by: Ivan Kalvachev --- libavcodec/opus_pvq.c | 3 + libavcodec/opus_pvq.h | 5 +- libavcodec/x86/Makefile | 2 + libavcodec/x86/opus_dsp_init.c | 43 +++ libavcodec/x86/opus_pvq_search.asm | 538 +++++++++++++++++++++++++++++++++++++ 5 files changed, 589 insertions(+), 2 deletions(-) create mode 100644 libavcodec/x86/opus_dsp_init.c create mode 100644 libavcodec/x86/opus_pvq_search.asm diff --git a/libavcodec/opus_pvq.c b/libavcodec/opus_pvq.c index 2ac66a0ede..3aa502929c 100644 --- a/libavcodec/opus_pvq.c +++ b/libavcodec/opus_pvq.c @@ -947,6 +947,9 @@ int av_cold ff_celt_pvq_init(CeltPVQ **pvq) s->encode_band = pvq_encode_band; s->band_cost = pvq_band_cost; + if (ARCH_X86 && CONFIG_OPUS_ENCODER) + ff_opus_dsp_init_x86(s); + *pvq = s; return 0; diff --git a/libavcodec/opus_pvq.h b/libavcodec/opus_pvq.h index 6691494838..9246337360 100644 --- a/libavcodec/opus_pvq.h +++ b/libavcodec/opus_pvq.h @@ -33,8 +33,8 @@ float *lowband_scratch, int fill) struct CeltPVQ { - DECLARE_ALIGNED(32, int, qcoeff )[176]; - DECLARE_ALIGNED(32, float, hadamard_tmp)[176]; + DECLARE_ALIGNED(32, int, qcoeff )[256]; + DECLARE_ALIGNED(32, float, hadamard_tmp)[256]; float (*pvq_search)(float *X, int *y, int K, int N); @@ -45,6 +45,7 @@ struct CeltPVQ { }; int ff_celt_pvq_init (struct CeltPVQ **pvq); +void ff_opus_dsp_init_x86(struct CeltPVQ *s); void ff_celt_pvq_uninit(struct CeltPVQ **pvq); #endif /* AVCODEC_OPUS_PVQ_H */ diff --git a/libavcodec/x86/Makefile b/libavcodec/x86/Makefile index 0dbc46504e..9875f48797 100644 --- a/libavcodec/x86/Makefile +++ b/libavcodec/x86/Makefile @@ -52,6 +52,7 @@ OBJS-$(CONFIG_APNG_DECODER) += x86/pngdsp_init.o OBJS-$(CONFIG_CAVS_DECODER) += x86/cavsdsp.o OBJS-$(CONFIG_DCA_DECODER) += x86/dcadsp_init.o x86/synth_filter_init.o OBJS-$(CONFIG_DNXHD_ENCODER) += x86/dnxhdenc_init.o +OBJS-$(CONFIG_OPUS_ENCODER) += x86/opus_dsp_init.o OBJS-$(CONFIG_HEVC_DECODER) += x86/hevcdsp_init.o OBJS-$(CONFIG_JPEG2000_DECODER) += x86/jpeg2000dsp_init.o OBJS-$(CONFIG_MLP_DECODER) += x86/mlpdsp_init.o @@ -123,6 +124,7 @@ X86ASM-OBJS-$(CONFIG_MDCT15) += x86/mdct15.o X86ASM-OBJS-$(CONFIG_ME_CMP) += x86/me_cmp.o X86ASM-OBJS-$(CONFIG_MPEGAUDIODSP) += x86/imdct36.o X86ASM-OBJS-$(CONFIG_MPEGVIDEOENC) += x86/mpegvideoencdsp.o +X86ASM-OBJS-$(CONFIG_OPUS_ENCODER) += x86/opus_pvq_search.o X86ASM-OBJS-$(CONFIG_PIXBLOCKDSP) += x86/pixblockdsp.o X86ASM-OBJS-$(CONFIG_QPELDSP) += x86/qpeldsp.o \ x86/fpel.o \ diff --git a/libavcodec/x86/opus_dsp_init.c b/libavcodec/x86/opus_dsp_init.c new file mode 100644 index 0000000000..f4c25822db --- /dev/null +++ b/libavcodec/x86/opus_dsp_init.c @@ -0,0 +1,43 @@ +/* + * Opus encoder assembly optimizations + * Copyright (C) 2017 Ivan Kalvachev + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "config.h" + +#include "libavutil/x86/cpu.h" +#include "libavcodec/opus_pvq.h" + +extern float ff_pvq_search_sse2(float *X, int *y, int K, int N); +extern float ff_pvq_search_sse4(float *X, int *y, int K, int N); +extern float ff_pvq_search_avx (float *X, int *y, int K, int N); + +av_cold void ff_opus_dsp_init_x86(CeltPVQ *s) +{ + int cpu_flags = av_get_cpu_flags(); + + if (EXTERNAL_SSE2(cpu_flags)) + s->pvq_search = ff_pvq_search_sse2; + + if (EXTERNAL_SSE4(cpu_flags)) + s->pvq_search = ff_pvq_search_sse4; + + if (EXTERNAL_AVX(cpu_flags)) + s->pvq_search = ff_pvq_search_avx; +} diff --git a/libavcodec/x86/opus_pvq_search.asm b/libavcodec/x86/opus_pvq_search.asm new file mode 100644 index 0000000000..359ad7a8a4 --- /dev/null +++ b/libavcodec/x86/opus_pvq_search.asm @@ -0,0 +1,538 @@ +;****************************************************************************** +;* SIMD optimized Opus encoder DSP function +;* +;* Copyright (C) 2017 Ivan Kalvachev +;* +;* This file is part of FFmpeg. +;* +;* FFmpeg is free software; you can redistribute it and/or +;* modify it under the terms of the GNU Lesser General Public +;* License as published by the Free Software Foundation; either +;* version 2.1 of the License, or (at your option) any later version. +;* +;* FFmpeg is distributed in the hope that it will be useful, +;* but WITHOUT ANY WARRANTY; without even the implied warranty of +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +;* Lesser General Public License for more details. +;* +;* You should have received a copy of the GNU Lesser General Public +;* License along with FFmpeg; if not, write to the Free Software +;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA +;****************************************************************************** + +%include "config.asm" +%include "libavutil/x86/x86util.asm" + +%ifdef __NASM_VER__ +%use "smartalign" +ALIGNMODE p6 +%endif + +; Use float op that give half precision but execute for around 3 cycles. +; On Skylake & Ryzen the division is much faster (around 11c/3), +; that makes the full precision code about 2% slower. +; Opus also does use rsqrt approximation in their intrinsics code. +%define USE_APPROXIMATION 01 + +; Presearch tries to quantize by using the property Sum( abs(x[i]*K)/Sx ) = K. +; If we use truncation for the division result then the integer Sum would be <= K. +; By using nearest rounding we get closer approximation, but +; we could also get more than K pulses and we have to remove the extra ones. +; This method is the main improvement of the ffmpeg C function over the Opus original. +%define PRESEARCH_ROUNDING 01 + + + +SECTION_RODATA 64 + +const_float_abs_mask: times 8 dd 0x7fffffff +const_align_abs_edge: times 8 dd 0 + +const_float_0_5: times 8 dd 0.5 +const_float_1: times 8 dd 1.0 +const_float_sign_mask: times 8 dd 0x80000000 + +const_int32_offsets: + %rep 8 + dd $-const_int32_offsets + %endrep +SECTION .text + + + + +; +; Horizontal Sum Packed Single precision float +; The resulting sum is in all elements. +; +%macro HSUMPS 2 ; dst/src, tmp +%if cpuflag(avx) + %if sizeof%1>=32 ; avx + vperm2f128 %2, %1, %1, (0)*16+(1) ; %2=lo(%1)<<128+hi(%1) + vaddps %1, %2 + %endif + vshufps %2, %1, %1, q1032 + vaddps %1, %2 + vshufps %2, %1, %1, q0321 + vaddps %1, %2 + +%else ; this form is a bit faster than the short avx-like emulation. + movaps %2, %1 ; [d, c, b, a ] + shufps %1, %1, q1032 ; %2=[b, a, d, c ] + addps %1, %2 ; %1=[b+d, a+c, d+b, c+a ] + movaps %2, %1 + shufps %1, %1, q0321 ; %2=[c+a, b+d, a+c, d+b ] + addps %1, %2 ; %1=[c+a+b+d, b+d+a+c, a+c+d+b, d+b+c+a ] + ; all %1 members should be equal for as long as float a+b==b+a +%endif +%endmacro + +; +; Emulate blendvps if not available +; +; src_b destroyed when using emulation with logical operands +; SSE41 blend instruction is hard coded to use xmm0 as mask +; +%macro EMU_blendvps 3 ; dst/src_a, src_b, mask +%if cpuflag(avx) + vblendvps %1, %1, %2, %3 +;------------------------- +%elif cpuflag(sse4) + %if notcpuflag(avx) + %ifnidn %3,xmm0 + %error sse41 blendvps uses xmm0 as default 3d operand, you used %3 + %endif + %endif + blendvps %1, %2, %3 +;------------------------- +%else + xorps %2, %1 + andps %2, %3 + xorps %1, %2 +%endif +%endmacro + +; +; Emulate pblendvb if not available +; +; src_b destroyed when using emulation with logical operands +; SSE41 blend instruction is hard coded to use xmm0 as mask +; +%macro EMU_pblendvb 3 ; dst/src_a, src_b, mask +%if cpuflag(avx) + %if cpuflag(avx) && notcpuflag(avx2) && sizeof%1 >= 32 + %error AVX1 does not support integer 256 bit ymm operations + %endif + + vpblendvb %1, %1, %2, %3 +;------------------------- +%elif cpuflag(sse4) + %ifnidn %3,xmm0 + %error sse41 blendvps uses xmm0 as default 3 operand, you used %3 + %endif + pblendvb %1, %2, %3 +;------------------------- +%else + pxor %2, %1 + pand %2, %3 + pxor %1, %2 +%endif +%endmacro + +; +; Emulate broadcastss if not available +; +%macro EMU_broadcastss 2 ; dst, src +%if cpuflag(avx2) + vbroadcastss %1, %2 ; ymm, xmm +;------------------------- +%elif cpuflag(avx) + %ifnum sizeof%2 ; avx1 register + vpermilps xmm%1, xmm%2, q0000 ; xmm, xmm, imm || ymm, ymm, imm + %if sizeof%1 >= 32 ; mmsize>=32 + vinsertf128 %1, %1,xmm%1, 1 ; ymm, ymm, xmm, im + %endif + %else ; avx1 memory + vbroadcastss %1, %2 ; ymm, mm32 || xmm, mm32 + %endif +;------------------------- +%else + %ifnum sizeof%2 ; sse register + shufps %1, %2, %2, q0000 + %else ; sse memory + movss %1, %2 ; this zeroes the other 3 elements + shufps %1, %1, 0 + %endif +%endif +%endmacro + +; +; Emulate pbroadcastd if not available +; +%macro EMU_pbroadcastd 2 ; dst, src +%if cpuflag(avx2) + vpbroadcastd %1, %2 +%elif cpuflag(avx) && mmsize >= 32 + %error AVX1 does not support integer 256 bit ymm operations +%else + %ifnum sizeof%2 ; sse2 register + pshufd %1, %2, q0000 + %else ; sse memory + movd %1, %2 ; movd zero extends + pshufd %1, %1, 0 + %endif +%endif +%endmacro + +; +; Setup High Register to be used +; for holding memory constants +; +; %1 - the register to be used, assmues it is >= mm8 +; %2 - name of the constant. +; +; Subsequent opcodes are going to use the constant in the form +; "addps m0, mm_const_name" and it would be turned into: +; "addps m0, [const_name]" on 32 bit arch or +; "addps m0, m8" on 64 bit arch +%macro SET_HI_REG_MM_CONSTANT 3 ; movop, reg, const_name +%if num_mmregs > 8 + %define mm_%3 %2 + %{1} %2, [%3] ; movaps m8, [const_name] +%else + %define mm_%3 [%3] +%endif +%endmacro + +; +; Set +; Position Independent Code +; Base address of a constant +; %1 - the register to be used, if PIC is set +; %2 - name of the constant. +; +; Subsequent opcode are going to use the base address in the form +; "movaps m0, [pic_base_constant_name+r4q]" and it would be turned into +; "movaps m0, [r5q + r4q]" if PIC is enabled +; "movaps m0, [constant_name + r4q]" if texrel are used +%macro SET_PIC_BASE 3; reg, const_label +%ifdef PIC + %{1} %2, [%3] ; lea r5q, [rip+const] + %define pic_base_%3 %2 +%else + %define pic_base_%3 %3 +%endif +%endmacro + + + + +%macro PULSES_SEARCH 1 +; m6 Syy_norm +; m7 Sxy_norm + addps m6, mm_const_float_0_5 ; Syy_norm += 1.0/2 + pxor m1, m1 ; max_idx + xorps m3, m3 ; p_max + xor r4q, r4q +align 16 +%%distortion_search: + movd xmm2, dword r4d ; movd zero extends +%ifidn %1,add + movaps m4, [tmpY + r4q] ; y[i] + movaps m5, [tmpX + r4q] ; X[i] + + %if USE_APPROXIMATION == 1 + xorps m0, m0 + cmpps m0, m0, m5, 4 ; m0 = (X[i] != 0.0) + %endif + + addps m4, m6 ; m4 = Syy_new = y[i] + Syy_norm + addps m5, m7 ; m5 = Sxy_new = X[i] + Sxy_norm + + %if USE_APPROXIMATION == 1 + andps m5, m0 ; if(X[i] == 0) Sxy_new = 0; Prevent aproximation error from setting pulses in array padding. + %endif + +%else + movaps m5, [tmpY + r4q] ; m5 = y[i] + + xorps m0, m0 ; m0 = 0; + cmpps m0, m0, m5, 1 ; m0 = (0 p_max) + maxps m3, m5 ; m3=max(p_max,p) + ; maxps here is faster than blendvps, despite blend having lower latency. + + pand m2, m0 ; This version seems faster than sse41 pblendvb + pmaxsw m1, m2 ; SSE2 signed word, so it would work for N < 32768/4 + + add r4q, mmsize + cmp r4q, Nq + jb %%distortion_search + + por m1, mm_const_int32_offsets ; max_idx offsets per individual lane (skipped in the inner loop) + movdqa m4, m1 ; needed for the aligned y[max_idx]+=1; processing + +%if mmsize >= 32 +; Merge parallel maximums round 8 (4 vs 4) + + vextractf128 xmm5, ymm3, 1 ; xmm5 = ymm3[1x128] = ymm3[255..128b] + vcmpps xmm0, xmm3, xmm5, 1 ; m0 = (m3 < m5) = ( p[0x128] < p[1x128] ) + + vextracti128 xmm2, ymm1, 1 ; xmm2 = ymm1[1x128] = ymm1[255..128b] + EMU_blendvps xmm3, xmm5, xmm0 ; max_idx = m0 ? max_idx[1x128] : max_idx[0x128] + EMU_pblendvb xmm1, xmm2, xmm0 ; p = m0 ? p[1x128] : p[0x128] +%endif + +; Merge parallel maximums round 4 (2 vs 2) + ; m3=p[3210] + movhlps xmm5, xmm3 ; m5=p[xx32] + cmpps xmm0, xmm3, xmm5, 1 ; m0 = (m3 < m5) = ( p[1,0] < p[3,2] ) + + pshufd xmm2, xmm1, q3232 + EMU_blendvps xmm3, xmm5, xmm0 ; max_idx = m0 ? max_idx[3,2] : max_idx[1,0] + EMU_pblendvb xmm1, xmm2, xmm0 ; p = m0 ? p[3,2] : p[1,0] + +; Merge parallel maximums final round (1 vs 1) + movaps xmm0, xmm3 ; m0 = m3[0] = p[0] + shufps xmm3, xmm3, q1111 ; m3 = m3[1] = p[1] + cmpless xmm0, xmm3 + + pshufd xmm2, xmm1, q1111 + EMU_pblendvb xmm1, xmm2, xmm0 + + movd dword r4d, xmm1 ; zero extends to the rest of r4q + + EMU_broadcastss m3, [tmpX + r4q] + %{1}ps m7, m3 ; Sxy += X[max_idx] + + EMU_broadcastss m5, [tmpY + r4q] + %{1}ps m6, m5 ; Syy += Y[max_idx] + + ; We have to update a single element in Y[i] + ; However writing 4 bytes and then doing 16 byte load in the inner loop + ; could cause a stall due to breaking write forwarding. + EMU_pbroadcastd m1, xmm1 + pcmpeqd m1, m1, m4 ; exactly 1 element matches max_idx and this finds it + + and r4q, ~(mmsize-1) ; align address down, so the value pointed by max_idx is inside a mmsize load + movaps m5, [tmpY + r4q] ; m5 = Y[y3...ym...y0] + andps m1, mm_const_float_1 ; m1 = [ 0...1.0...0] + %{1}ps m5, m1 ; m5 = Y[y3...ym...y0] +/- [0...1.0...0] + movaps [tmpY + r4q], m5 ; Y[max_idx] +-= 1.0; + +%endmacro + +; +; We need one more register for +; PIC relative addressing. Use this +; to count it in cglobal +; +%ifdef PIC + %define num_pic_regs 1 +%else + %define num_pic_regs 0 +%endif + +; +; In x86_64 mode few more xmm registers are used to hold constants. +; Use this to count them in cglobal +; +%if num_mmregs > 8 +%define num_hireg_used 3 +%endif + +; +; Pyramid Vector Quantization Search implementation +; +; float * inX - Unaligned (SIMD) access, it will be overread, +; but extra data is masked away. +; int32 * outY - Should be aligned and padded buffer. +; It is used as temp buffer. +; uint32 K - Number of pulses to have after quantizations. +; uint32 N - Number of vector elements. Must be 0 < N < 256 +; +%macro PVQ_FAST_SEARCH 0 +cglobal pvq_search, 4, 5+num_pic_regs, 8+num_hireg_used, 256*4, inX, outY, K, N +%define tmpX rsp +%define tmpY outYq + + + movaps m0, [const_float_abs_mask] + shl Nd, 2 ; N *= sizeof(float) ; also 32 bit operation zeroes the high 32 bits in 64 bit mode. + mov r4q, Nq + + neg r4q + and r4q, mmsize-1 + + SET_PIC_BASE lea, r5q, const_align_abs_edge ; rip+const + movups m3, [pic_base_const_align_abs_edge + r4q - mmsize] + + add Nq, r4q ; Nq = align(Nq, mmsize) + + lea r4q, [Nq - mmsize] ; Nq is rounded up (aligned up) to mmsize, so r4q can't become negative here, unless N=0. + movups m2, [inXq + r4q] + andps m2, m3 + movaps [tmpX + r4q], m2 + movaps m1, m2 ; Sx + +align 16 +%%loop_abs_sum: + sub r4q, mmsize + jc %%end_loop_abs_sum + + movups m2, [inXq + r4q] + andps m2, m0 + + movaps [tmpX + r4q], m2 ; tmpX[i]=abs(X[i]) + addps m1, m2 + jmp %%loop_abs_sum + +align 16 +%%end_loop_abs_sum: + + HSUMPS m1, m2 ; m1 = Sx + + xorps m0, m0 + comiss xmm0, xmm1 ; + jz %%zero_input ; if (Sx==0) goto zero_input + + cvtsi2ss xmm0, dword Kd ; m0 = K +%if USE_APPROXIMATION == 1 + rcpss xmm1, xmm1 ; m1 = approx(1/Sx) + mulss xmm0, xmm1 ; m0 = K*(1/Sx) +%else + divss xmm0, xmm1 ; b = K/Sx + ; b = K/max_x +%endif + + EMU_broadcastss m0, xmm0 + + lea r4q, [Nq - mmsize] + pxor m5, m5 ; Sy ( Sum of abs( y[i]) ) + xorps m6, m6 ; Syy ( Sum of y[i]*y[i] ) + xorps m7, m7 ; Sxy ( Sum of X[i]*y[i] ) +align 16 +%%loop_guess: + movaps m1, [tmpX + r4q] ; m1 = X[i] + mulps m2, m0, m1 ; m2 = res*X[i] + %if PRESEARCH_ROUNDING == 0 + cvttps2dq m2, m2 ; yt = (int)truncf( res*X[i] ) + %else + cvtps2dq m2, m2 ; yt = (int)lrintf( res*X[i] ) + %endif + paddd m5, m2 ; Sy += yt + cvtdq2ps m2, m2 ; yt = (float)yt + mulps m1, m2 ; m1 = X[i]*yt + movaps [tmpY + r4q], m2 ; y[i] = m2 + addps m7, m1 ; Sxy += m1; + mulps m2, m2 ; m2 = yt*yt + addps m6, m2 ; Syy += m2 + + sub r4q, mmsize + jnc %%loop_guess + + HSUMPS m6, m1 ; Syy_norm + HADDD m5, m4 ; pulses + + movd dword r4d, xmm5 ; zero extends to the rest of r4q + + sub Kd, r4d ; K -= pulses , also 32 bit operation zeroes high 32 bit in 64 bit mode. + jz %%finish ; K - pulses == 0 + + SET_HI_REG_MM_CONSTANT movaps, m8, const_float_0_5 + SET_HI_REG_MM_CONSTANT movaps, m9, const_float_1 + SET_HI_REG_MM_CONSTANT movdqa, m10, const_int32_offsets + ; Use Syy/2 in distortion parameter calculations. + ; Saves pre and post-caclulation to correct Y[] values. + ; Same precision, since float mantisa is normalized. + ; The SQRT approximation does differ. + HSUMPS m7, m0 ; Sxy_norm + mulps m6, mm_const_float_0_5 + + jc %%remove_pulses_loop ; K - pulses < 0 + +align 16 ; K - pulses > 0 +%%add_pulses_loop: + + PULSES_SEARCH add ; m6 Syy_norm ; m7 Sxy_norm + + sub Kd, 1 + jnz %%add_pulses_loop + + addps m6, m6 ; Syy*=2 + + jmp %%finish + +align 16 +%%remove_pulses_loop: + + PULSES_SEARCH sub ; m6 Syy_norm ; m7 Sxy_norm + + add Kd, 1 + jnz %%remove_pulses_loop + + addps m6, m6 ; Syy*=2 + +align 16 +%%finish: + lea r4q, [Nq - mmsize] + + movaps m2, [const_float_sign_mask] +align 16 +%%restore_sign_loop: + movaps m0, [tmpY + r4q] ; m0 = Y[i] + movups m1, [inXq + r4q] ; m1 = X[i] + andps m1, m2 ; m1 = sign(X[i]) + orps m0, m1 ; m0 = Y[i]*sign + cvtps2dq m3, m0 ; m3 = (int)m0 + movaps [outYq + r4q], m3 + + sub r4q, mmsize + jnc %%restore_sign_loop +%%return: + +%if ARCH_X86_64 == 0 ;sbrdsp + movss r0m, xmm6 ; return (float)Syy_norm + fld dword r0m +%else + movaps m0, m6 ; return (float)Syy_norm +%endif + RET + +align 16 +%%zero_input: + lea r4q, [Nq - mmsize] + xorps m0, m0 +%%zero_loop: + movaps [outYq + r4q], m0 + sub r4q, mmsize + jnc %%zero_loop + + movaps m6, [const_float_1] + jmp %%return +%endmacro + + +INIT_XMM sse2 +PVQ_FAST_SEARCH + +INIT_XMM sse4 +PVQ_FAST_SEARCH + +INIT_XMM avx +PVQ_FAST_SEARCH -- 2.13.2