From patchwork Wed Jul 26 14:56:28 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ivan Kalvachev <ikalvachev@gmail.com>
X-Patchwork-Id: 4464
Delivered-To: ffmpegpatchwork@gmail.com
Received: by 10.103.1.76 with SMTP id 73csp878845vsb;
	Wed, 26 Jul 2017 08:01:57 -0700 (PDT)
X-Received: by 10.223.168.110 with SMTP id
	l101mr1083018wrc.251.1501081317443;
	Wed, 26 Jul 2017 08:01:57 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1501081317; cv=none;
	d=google.com; s=arc-20160816;
	b=uU0YMMoPTsAX9yeIz9+upqGycMxaNL4X3hdoSKjeaDWv2ybLYSCY2DZtK4n7c0gZtm
	MqPaQ1c1Rm7stNYrZFLnf5ho22LkgEPq0SsykcdIS27243yzr1/4j1eQA1DYEkNpExwE
	NJnX1rt3/xbD9n2k7NBn6hvjby5HfXfBsRHdOMf0LV7C8tsaOqvEARYIHbsWx7XBH7T+
	gb8bGGQbYSrFZUxudn0i0k19UHsXLbBzwE3tdpYKhXpiG8mOa8BRjiUwgdWfrf5JBFxa
	8OIHTfpbB1lAT4CK4UOw5IIcG/z97C8lRtoQUUR2lhDVa2UJSC1TLkPuZjw9vkjRA3U2
	twVg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
	s=arc-20160816;
	h=sender:errors-to:reply-to:list-subscribe:list-help:list-post
	:list-archive:list-unsubscribe:list-id:precedence:subject:to
	:message-id:date:from:mime-version:dkim-signature:delivered-to
	:arc-authentication-results;
	bh=3LDdGd7Io0DrWGWNHzNrrmnu3fD2hXVp0z5sJWgUGKQ=;
	b=eRaDw+KIEG1slRVTB9R0h1qdofKyMVm8GN/BvngNqUa4rkNPDRvlCEpe4x/dDQ6mnQ
	kLSx2l1+Bo59dzvrobp+toBn7l3FAijR0oJw6wRKGSR4GAexGL3pknxROSj3QUiBLKXG
	zUCZCY4BZzMK9+pAduHbuj0wYY026PwKTQiGCkawRJGlgyPT1H777jBJkJdnWDFxk0GG
	XkwOCx0SeQ67mwoXclWI9CiGoh1H4DXNFYmj0NJbxwrWhwqmraM4qrvUtzXZiBd2Q1sD
	vseG57GvzGLmqbAXd/SlsYCpA3wfSSsZdpK6gXpmEf4N0hamB4OlrkwqMY+laA+PK8TF
	vaRg==
ARC-Authentication-Results: i=1; mx.google.com;
	dkim=neutral (body hash did not verify) header.i=@gmail.com
	header.b=ujrflH36;
	spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
	designates 79.124.17.100 as permitted sender)
	smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
	dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=gmail.com
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
	by mx.google.com with ESMTP id 50si5411886wrx.410.2017.07.26.08.01.56;
	Wed, 26 Jul 2017 08:01:57 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
	designates 79.124.17.100 as permitted sender)
	client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
	dkim=neutral (body hash did not verify) header.i=@gmail.com
	header.b=ujrflH36;
	spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
	designates 79.124.17.100 as permitted sender)
	smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org;
	dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=gmail.com
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 9105A689AF9;
	Wed, 26 Jul 2017 18:01:53 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mail-pg0-f46.google.com (mail-pg0-f46.google.com
	[74.125.83.46])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 8C773680671
	for <ffmpeg-devel@ffmpeg.org>; Wed, 26 Jul 2017 18:01:46 +0300 (EEST)
Received: by mail-pg0-f46.google.com with SMTP id v190so85423719pgv.2
	for <ffmpeg-devel@ffmpeg.org>; Wed, 26 Jul 2017 08:01:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=mime-version:from:date:message-id:subject:to;
	bh=7T4o9qpyP0bYDjsNCZ99YgcosvRUEjgskxVzKlDLRUo=;
	b=ujrflH36chCsbiBK8Dof3NV29jOiXpKH8s5RMLRddW1A6lRL0zkf4V5UX+mv5BhG6A
	FQBtSSF0KxPQDn5KeUKzUHMhROrDvVwuwaG0mjmAdbFVCUTOj+jxsOvCnC6+tz4EqIoL
	mDgrfgXqubOfsZyyAIEj7t9FJpeR7+HhMXflptkrX1cd2+7BIWNXlet0yLkx2ArbGdd3
	7a1Qq9dEjo5T8B2ad7WpIhYJXB2XPtVHXK8XFusG194Zg6itFOh0TlpbRPuhaGyx5TVw
	FeFoJT2oEPZfqE8eOX+svcIozZOnYfDdoQIqQ0iZ5TNj7Ugoky7/ZDJaoyBAkcZxaVmC
	PBLg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:mime-version:from:date:message-id:subject:to;
	bh=7T4o9qpyP0bYDjsNCZ99YgcosvRUEjgskxVzKlDLRUo=;
	b=P4Qd42C+gucAhUQYxKfMbvKOJA2fVj85tOcW+e136JU7FiqA2DR1pSP4CDU2CjutEm
	gIXR++EhPELz1XV64jRaEwomAQm1HsUUuDjm0G3d2lmN1S3HZaCTqBbtmRdnJopWTnOH
	KR5o2L0nl/F0/jYXrBb7FGXk2qgpqmc+1xDlh6O9e8K6FqfrhLYxrYzXFwJBNffV/ALo
	UlRu+rJ6/AeZrCXLb1cjdYF6FIH+CLEwg2cECUYD9NX/RvMh73O/PcZDNei7llDCkPxu
	6KaLh2vGvUKNVFjJ8nee/lUa8nsj7pfwl8ZDP3Yr2WIWkgkjJNAoNQRynaBT2j72mjg7
	CVWQ==
X-Gm-Message-State: AIVw112x+UuMTyBnq9l96+geaBdmuyac0LvB8YLx/7+Ml+0jtkLjM2TH
	/6idfFlV2h2qwa/Z2mK+E8Gvaq5cCg==
X-Received: by 10.84.218.66 with SMTP id f2mr1183620plm.206.1501080989217;
	Wed, 26 Jul 2017 07:56:29 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.100.149.8 with HTTP; Wed, 26 Jul 2017 07:56:28 -0700 (PDT)
From: Ivan Kalvachev <ikalvachev@gmail.com>
Date: Wed, 26 Jul 2017 17:56:28 +0300
Message-ID: 
 <CABA=pqcmfV7+xZVNzcvgvHaFbXoOpkuhbTq4jkwzxkGg+i6pAw@mail.gmail.com>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: [FFmpeg-devel] [PATCH]v6 Opus Pyramid Vector Quantization Search in
	x86 SIMD asm
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <http://ffmpeg.org/mailman/options/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <http://ffmpeg.org/pipermail/ffmpeg-devel/>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <http://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches
	<ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

On 7/24/17, Ivan Kalvachev <ikalvachev@gmail.com> wrote:
> On 7/23/17, Rostislav Pehlivanov <atomnuker@gmail.com> wrote:
>> On 22 July 2017 at 12:18, Ivan Kalvachev <ikalvachev@gmail.com> wrote:
>>
>>> This patch is ready for review and inclusion.
>>>
>>>
>>>
>>>+%macro emu_blendvps 3 ; dst/src_a, src_b, mask
>>>+%macro lea_pic_base 2; reg, const_label
>> Capitalize macro names.
>
> Done for the later.
> About the former...
>
> I do not like to use capitalization for the emulated instructions:
>
> 1. In C macros are always in capital letters to separate them from the
> functions.
> In ASM functions cannot be mistaken for macros.
>
> 2. All instructions are lowercase, even the ones that are overloaded
> by x86asm.h macros.
>
> 3. There are already about 30 macros with lower cases in
> libavcodec/x86. The rule is not absolute.
>
> 4. There are some emulated instructions in x86util, that are all upper
> case names and
> I do find them very ugly when I see them in the code.
> This is why I went with "emu_<op>" route.
> I actually want to encourage using the emu_<op> for them too (as
> alternative names).
>
> 5. There is nothing guaranteeing that the assembler
> should be case sensitive.
> Actually nasm manual mentions that MASM (on which
> it it is based on) is not case-sensitive (by default).
> While nasm and yasm are case sensitive,
> there are still %idefine and %imacro that
> could create some confusion.
>
> Anyway.
>
> Would you accept a change where the emulation macros
> are moved to a separate file and the macros are
> renamed to EMU_<op> , aka EMU_blendvps ?
> I'm thinking of "libavcodec/x86/x86emu.asm".
>
> Or maybe they should be put in libavutil/x86/x86util.asm ,
> however that could open a new can of worms.
> AKA using the ugly uppercase only names.
>
>>>+      %error sse41 blendvps uses xmm0 as default 3d operand, you used %3
>> Remove this error
>
> I'd like to keep that one.
> It helps finding out why the code does not compile.
>
> Sometimes it may not be obvious, since SWAP might
> change the registers under the hood.
>
>>
>>>+    %error "blendvps" emulation needs SSE
>> Definitely remove this error too.
>
> Done
>
>>>+        %error AVX1 does not support integer 256bit ymm operations
>> And this one is particularly pointless...
>
> On the contrary. AVX1 does support 256 bit ymm float point operations
> but not integer ones. It is quite trivial mistake to use one with the
> other.
>
> What is worse, without this the wrong code would compile
> without any hint of error, because avx2 codes are
> not overloaded by the current x86asm.h
> so you won't get warning that you are using avx2 ops in avx1 section.
>
>>>+      %error sse41 blendvps uses xmm0 as default 3 operand, you used %3
>> Same...
>
> Again, I'd like to keep that one.
>
>>>+    %error "broadcastss" emulation needs SSE
>> Same...
>
> Done
>
>>>+    %error "pbroadcastd" emulation needs SSE2
>> Yep, the same...
>
> Done
>
>>
>>>+    %error pick HSUMPS implementation
>> Again...
>
> I thought I had taken care of that.
> Done
>
>> All of these are pointless and unnecessary. Always assume at least SSE2.
>
> NEVER assume anything (that you can check).
> Making wrong assumptions is the easiest way
> to make a mistakes that are
> very hard to notice, find and correct.
>
>>>+
>>>+; 256bit version seems slower than the 128bit AVX on all current CPU's
>>>+; So disable it for now and keep the code in case something changes in
>> future.
>>>+%if HAVE_AVX2_EXTERNAL > 0 && 0
>>>+INIT_YMM avx2
>>>+PVQ_FAST_SEARCH
>>>+%endif
>>
>> Remove this altogether.
>
> I think this would be a mistake.
>
> BTW, would you do a proper benchmarks with the current
> code and post the results. There is a chance that the
> code has improved.
> (Please, use a real audio sample, not generated noise).
>
> I also asked you to try something (changing "cmpless" to "cmpps" ),
> that you totally forgot to do.
> (IACA says there is no penalty in my code for using this SSE op in AVX2
> block,
> but as I said, never assume.)
>
>>>+%if 1
>>>+    emu_pbroadcastd m1,   xmm1
>> ...
>>>+%else
>>>+        ; Ryzen is the only CPU at the moment that doesn't have
>>>+        ; write forwarding stall and where this code is faster
>>>+        ; with about 7 cycles on avarage.
>>>+        %{1}ps      m5,   mm_const_float_1
>>>+        movss       [tmpY + r4q], xmm5      ; Y[max_idx] +-= 1.0;
>>
>> Remove the %else and always use the first bit of code.
>
> I don't like removing code that is faster on a new CPU.
> But since you insist.
> Done.
>
>>>+%if cpuflag(avx2) && 01
>>>+%elif cpuflag(avx) && 01
>>>+%if cpuflag(avx2) && 01
>>
>> Remove the && 01 in these 3 places.
>
> Done
>
>
>>>+;      vperm2f128   %1,   %1,   %1,   0 ; ymm, ymm, ymm     ; 2-3c 1/1
>>
>> Remove this commented out code.
>
> Done
>
>>
>>>+cglobal pvq_search, 4, 5+num_pic_regs, 11, 256*4, inX, outY, K, N
>>>+%if ARCH_X86_64 == 0    ;sbrdsp
>>
>> You're using more than 11 xmm regs, so the entire code is always going to
>> need a 64 bit system.
>> So wrap the entire file, from top to bottom after the %include in
>> %if ARCH_X86_64
>> <everything>
>> %endif
>
> I'm using exactly 8 registers on 32 bit arch.
> I'm using 3 extra registers on 64 bit ones to hold some constants.
>
> If you insist, I'll write some ifdef-ery to
> always supply the correct number of registers.
> From what I've seen in x86asm, the >8 case is only checked on WIN64.
>
>>>+SECTION_RODATA 64
>>>+
>>>+const_float_abs_mask:   times 8 dd 0x7fffffff
>>>+const_align_abs_edge:   times 8 dd 0
>>>+
>>>+const_float_0_5:        times 8 dd 0.5
>>>+const_float_1:          times 8 dd 1.0
>>>+const_float_sign_mask:  times 8 dd 0x80000000
>>>+
>>>+const_int32_offsets:
>>>+                        %rep 8
>>>+                                dd $-const_int32_offsets
>>>+                        %endrep
>>>+SECTION .text
>>
>> This whole thing needs to go right at the very top after the %include.
>> All
>> macros must then follow it.
>> Having read only sections in the middle of the file is very confusing and
>> not at all what all of our asm code follows.
>
> It's very simple:
>   pre-processor, constants, code.
> Your confusion comes from the fact that
> the code templates are done by macros.
>
> As I've told you before, the macros before
> the constants belong to a separate file.
>
> Also why are you so obsessed
> with constants been the first thing in a file.
> You are not supposed to mess with them
> on regular basis.
>
> Constant labels must be before the code that
> uses them, but because of code templates
> they could be literally few lines before the end of the file.

In this version I did remove the AVX2 disabled code
and moved the constants before the macro (but after defines).
The emulations remain in this code, for now.

I've also used ifdef-ery to always supply the correct number
of used xmm registers to "cglobal" (and moved the relevant
macros closer to it).

Also here are the updated benchmarks:
/===========================================================
 SkyLake i7 6700HQ
//v5
    1814 1817 1827 //SSE4
    1818 1834 1838 //AVX default
    1828 1833 1837 //AVX2 xmm
    1896 1897 1905 //AVX2 default
    1896 1904 1905 //AVX2 cmpps

Best Regards

From 67c93689fab23386cb4d08d4a74d3e804f07c4f4 Mon Sep 17 00:00:00 2001
From: Ivan Kalvachev <ikalvachev@gmail.com>
Date: Thu, 8 Jun 2017 22:24:33 +0300
Subject: [PATCH 1/5] SIMD opus pvq_search implementation

Explanation on the workings and methods used by the
Pyramid Vector Quantization Search function
could be found in the following Work-In-Progress mail threads:
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-June/212146.html
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-June/212816.html
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-July/213030.html
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-July/213436.html

Signed-off-by: Ivan Kalvachev <ikalvachev@gmail.com>
---
 libavcodec/opus_pvq.c              |   3 +
 libavcodec/opus_pvq.h              |   5 +-
 libavcodec/x86/Makefile            |   2 +
 libavcodec/x86/opus_dsp_init.c     |  43 +++
 libavcodec/x86/opus_pvq_search.asm | 538 +++++++++++++++++++++++++++++++++++++
 5 files changed, 589 insertions(+), 2 deletions(-)
 create mode 100644 libavcodec/x86/opus_dsp_init.c
 create mode 100644 libavcodec/x86/opus_pvq_search.asm

diff --git a/libavcodec/opus_pvq.c b/libavcodec/opus_pvq.c
index 2ac66a0ede..3aa502929c 100644
--- a/libavcodec/opus_pvq.c
+++ b/libavcodec/opus_pvq.c
@@ -947,6 +947,9 @@ int av_cold ff_celt_pvq_init(CeltPVQ **pvq)
     s->encode_band        = pvq_encode_band;
     s->band_cost          = pvq_band_cost;
 
+    if (ARCH_X86 && CONFIG_OPUS_ENCODER)
+        ff_opus_dsp_init_x86(s);
+
     *pvq = s;
 
     return 0;
diff --git a/libavcodec/opus_pvq.h b/libavcodec/opus_pvq.h
index 6691494838..9246337360 100644
--- a/libavcodec/opus_pvq.h
+++ b/libavcodec/opus_pvq.h
@@ -33,8 +33,8 @@
                                        float *lowband_scratch, int fill)
 
 struct CeltPVQ {
-    DECLARE_ALIGNED(32, int,   qcoeff      )[176];
-    DECLARE_ALIGNED(32, float, hadamard_tmp)[176];
+    DECLARE_ALIGNED(32, int,   qcoeff      )[256];
+    DECLARE_ALIGNED(32, float, hadamard_tmp)[256];
 
     float (*pvq_search)(float *X, int *y, int K, int N);
 
@@ -45,6 +45,7 @@ struct CeltPVQ {
 };
 
 int  ff_celt_pvq_init  (struct CeltPVQ **pvq);
+void ff_opus_dsp_init_x86(struct CeltPVQ *s);
 void ff_celt_pvq_uninit(struct CeltPVQ **pvq);
 
 #endif /* AVCODEC_OPUS_PVQ_H */
diff --git a/libavcodec/x86/Makefile b/libavcodec/x86/Makefile
index 0dbc46504e..9875f48797 100644
--- a/libavcodec/x86/Makefile
+++ b/libavcodec/x86/Makefile
@@ -52,6 +52,7 @@ OBJS-$(CONFIG_APNG_DECODER)            += x86/pngdsp_init.o
 OBJS-$(CONFIG_CAVS_DECODER)            += x86/cavsdsp.o
 OBJS-$(CONFIG_DCA_DECODER)             += x86/dcadsp_init.o x86/synth_filter_init.o
 OBJS-$(CONFIG_DNXHD_ENCODER)           += x86/dnxhdenc_init.o
+OBJS-$(CONFIG_OPUS_ENCODER)            += x86/opus_dsp_init.o
 OBJS-$(CONFIG_HEVC_DECODER)            += x86/hevcdsp_init.o
 OBJS-$(CONFIG_JPEG2000_DECODER)        += x86/jpeg2000dsp_init.o
 OBJS-$(CONFIG_MLP_DECODER)             += x86/mlpdsp_init.o
@@ -123,6 +124,7 @@ X86ASM-OBJS-$(CONFIG_MDCT15)           += x86/mdct15.o
 X86ASM-OBJS-$(CONFIG_ME_CMP)           += x86/me_cmp.o
 X86ASM-OBJS-$(CONFIG_MPEGAUDIODSP)     += x86/imdct36.o
 X86ASM-OBJS-$(CONFIG_MPEGVIDEOENC)     += x86/mpegvideoencdsp.o
+X86ASM-OBJS-$(CONFIG_OPUS_ENCODER)     += x86/opus_pvq_search.o
 X86ASM-OBJS-$(CONFIG_PIXBLOCKDSP)      += x86/pixblockdsp.o
 X86ASM-OBJS-$(CONFIG_QPELDSP)          += x86/qpeldsp.o                 \
                                           x86/fpel.o                    \
diff --git a/libavcodec/x86/opus_dsp_init.c b/libavcodec/x86/opus_dsp_init.c
new file mode 100644
index 0000000000..f4c25822db
--- /dev/null
+++ b/libavcodec/x86/opus_dsp_init.c
@@ -0,0 +1,43 @@
+/*
+ * Opus encoder assembly optimizations
+ * Copyright (C) 2017 Ivan Kalvachev <ikalvachev@gmail.com>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "config.h"
+
+#include "libavutil/x86/cpu.h"
+#include "libavcodec/opus_pvq.h"
+
+extern float ff_pvq_search_sse2(float *X, int *y, int K, int N);
+extern float ff_pvq_search_sse4(float *X, int *y, int K, int N);
+extern float ff_pvq_search_avx (float *X, int *y, int K, int N);
+
+av_cold void ff_opus_dsp_init_x86(CeltPVQ *s)
+{
+    int cpu_flags = av_get_cpu_flags();
+
+    if (EXTERNAL_SSE2(cpu_flags))
+        s->pvq_search = ff_pvq_search_sse2;
+
+    if (EXTERNAL_SSE4(cpu_flags))
+        s->pvq_search = ff_pvq_search_sse4;
+
+    if (EXTERNAL_AVX(cpu_flags))
+        s->pvq_search = ff_pvq_search_avx;
+}
diff --git a/libavcodec/x86/opus_pvq_search.asm b/libavcodec/x86/opus_pvq_search.asm
new file mode 100644
index 0000000000..359ad7a8a4
--- /dev/null
+++ b/libavcodec/x86/opus_pvq_search.asm
@@ -0,0 +1,538 @@
+;******************************************************************************
+;* SIMD optimized Opus encoder DSP function
+;*
+;* Copyright (C) 2017 Ivan Kalvachev <ikalvachev@gmail.com>
+;*
+;* This file is part of FFmpeg.
+;*
+;* FFmpeg is free software; you can redistribute it and/or
+;* modify it under the terms of the GNU Lesser General Public
+;* License as published by the Free Software Foundation; either
+;* version 2.1 of the License, or (at your option) any later version.
+;*
+;* FFmpeg is distributed in the hope that it will be useful,
+;* but WITHOUT ANY WARRANTY; without even the implied warranty of
+;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+;* Lesser General Public License for more details.
+;*
+;* You should have received a copy of the GNU Lesser General Public
+;* License along with FFmpeg; if not, write to the Free Software
+;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+;******************************************************************************
+
+%include "config.asm"
+%include "libavutil/x86/x86util.asm"
+
+%ifdef __NASM_VER__
+%use "smartalign"
+ALIGNMODE p6
+%endif
+
+; Use float op that give half precision but execute for around 3 cycles.
+; On Skylake & Ryzen the division is much faster (around 11c/3),
+; that makes the full precision code about 2% slower.
+; Opus also does use rsqrt approximation in their intrinsics code.
+%define USE_APPROXIMATION  01
+
+; Presearch tries to quantize by using the property Sum( abs(x[i]*K)/Sx ) = K.
+; If we use truncation for the division result then the integer Sum would be <= K.
+; By using nearest rounding we get closer approximation, but
+; we could also get more than K pulses and we have to remove the extra ones.
+; This method is the main improvement of the ffmpeg C function over the Opus original.
+%define PRESEARCH_ROUNDING 01
+
+
+
+SECTION_RODATA 64
+
+const_float_abs_mask:   times 8 dd 0x7fffffff
+const_align_abs_edge:   times 8 dd 0
+
+const_float_0_5:        times 8 dd 0.5
+const_float_1:          times 8 dd 1.0
+const_float_sign_mask:  times 8 dd 0x80000000
+
+const_int32_offsets:
+                        %rep 8
+                                dd $-const_int32_offsets
+                        %endrep
+SECTION .text
+
+
+
+
+;
+; Horizontal Sum Packed Single precision float
+; The resulting sum is in all elements.
+;
+%macro HSUMPS 2 ; dst/src, tmp
+%if cpuflag(avx)
+  %if sizeof%1>=32  ; avx
+       vperm2f128   %2,   %1,   %1,   (0)*16+(1)   ; %2=lo(%1)<<128+hi(%1)
+       vaddps       %1,   %2
+  %endif
+       vshufps      %2,   %1,   %1,   q1032
+       vaddps       %1,   %2
+       vshufps      %2,   %1,   %1,   q0321
+       vaddps       %1,   %2
+
+%else  ; this form is a bit faster than the short avx-like emulation.
+        movaps      %2,   %1            ;    [d,       c,       b,       a       ]
+        shufps      %1,   %1,   q1032   ; %2=[b,       a,       d,       c       ]
+        addps       %1,   %2            ; %1=[b+d,     a+c,     d+b,     c+a     ]
+        movaps      %2,   %1
+        shufps      %1,   %1,   q0321   ; %2=[c+a,     b+d,     a+c,     d+b     ]
+        addps       %1,   %2            ; %1=[c+a+b+d, b+d+a+c, a+c+d+b, d+b+c+a ]
+        ; all %1 members should be equal for as long as float a+b==b+a
+%endif
+%endmacro
+
+;
+; Emulate blendvps if not available
+;
+; src_b destroyed when using emulation with logical operands
+; SSE41 blend instruction is hard coded to use xmm0 as mask
+;
+%macro EMU_blendvps 3 ; dst/src_a, src_b, mask
+%if cpuflag(avx)
+       vblendvps    %1,   %1,   %2,   %3
+;-------------------------
+%elif cpuflag(sse4)
+  %if notcpuflag(avx)
+    %ifnidn %3,xmm0
+      %error sse41 blendvps uses xmm0 as default 3d operand, you used %3
+    %endif
+  %endif
+        blendvps    %1,   %2,   %3
+;-------------------------
+%else
+        xorps       %2,   %1
+        andps       %2,   %3
+        xorps       %1,   %2
+%endif
+%endmacro
+
+;
+; Emulate pblendvb if not available
+;
+; src_b destroyed when using emulation with logical operands
+; SSE41 blend instruction is hard coded to use xmm0 as mask
+;
+%macro EMU_pblendvb 3 ; dst/src_a, src_b, mask
+%if cpuflag(avx)
+  %if cpuflag(avx) && notcpuflag(avx2) && sizeof%1 >= 32
+        %error AVX1 does not support integer 256 bit ymm operations
+  %endif
+
+       vpblendvb    %1,   %1,   %2,   %3
+;-------------------------
+%elif cpuflag(sse4)
+    %ifnidn %3,xmm0
+      %error sse41 blendvps uses xmm0 as default 3 operand, you used %3
+    %endif
+        pblendvb    %1,   %2,   %3
+;-------------------------
+%else
+        pxor        %2,   %1
+        pand        %2,   %3
+        pxor        %1,   %2
+%endif
+%endmacro
+
+;
+; Emulate broadcastss if not available
+;
+%macro EMU_broadcastss 2 ; dst, src
+%if cpuflag(avx2)
+       vbroadcastss %1,   %2            ; ymm, xmm
+;-------------------------
+%elif cpuflag(avx)
+  %ifnum sizeof%2       ; avx1 register
+       vpermilps    xmm%1, xmm%2, q0000 ; xmm, xmm, imm || ymm, ymm, imm
+    %if sizeof%1 >= 32  ; mmsize>=32
+       vinsertf128  %1,   %1,xmm%1,   1 ; ymm, ymm, xmm, im
+    %endif
+  %else                 ; avx1 memory
+       vbroadcastss %1,   %2            ; ymm, mm32 || xmm, mm32
+  %endif
+;-------------------------
+%else
+  %ifnum sizeof%2       ; sse register
+        shufps      %1,   %2,   %2,   q0000
+  %else                 ; sse memory
+        movss       %1,   %2            ; this zeroes the other 3 elements
+        shufps      %1,   %1,   0
+  %endif
+%endif
+%endmacro
+
+;
+; Emulate pbroadcastd if not available
+;
+%macro EMU_pbroadcastd 2 ; dst, src
+%if cpuflag(avx2)
+       vpbroadcastd %1,   %2
+%elif cpuflag(avx) && mmsize >= 32
+        %error AVX1 does not support integer 256 bit ymm operations
+%else
+  %ifnum sizeof%2       ; sse2 register
+        pshufd      %1,   %2,   q0000
+  %else                 ; sse memory
+        movd        %1,   %2            ; movd zero extends
+        pshufd      %1,   %1,   0
+  %endif
+%endif
+%endmacro
+
+;
+;   Setup High Register to be used
+;   for holding memory constants
+;
+; %1 - the register to be used, assmues it is >= mm8
+; %2 - name of the constant.
+;
+; Subsequent opcodes are going to use the constant in the form
+; "addps m0, mm_const_name" and it would be turned into:
+; "addps m0, [const_name]" on 32 bit arch or
+; "addps m0, m8" on 64 bit arch
+%macro SET_HI_REG_MM_CONSTANT 3 ; movop, reg, const_name
+%if num_mmregs > 8
+       %define      mm_%3 %2
+       %{1}         %2,   [%3]  ; movaps m8, [const_name]
+%else
+       %define      mm_%3 [%3]
+%endif
+%endmacro
+
+;
+;   Set
+;     Position Independent Code
+;       Base address of a constant
+; %1 - the register to be used, if PIC is set
+; %2 - name of the constant.
+;
+; Subsequent opcode are going to use the base address in the form
+; "movaps m0, [pic_base_constant_name+r4q]" and it would be turned into
+; "movaps m0, [r5q + r4q]" if PIC is enabled
+; "movaps m0, [constant_name + r4q]" if texrel are used
+%macro SET_PIC_BASE 3; reg, const_label
+%ifdef PIC
+        %{1}        %2,   [%3]    ; lea r5q, [rip+const]
+        %define     pic_base_%3 %2
+%else
+        %define     pic_base_%3 %3
+%endif
+%endmacro
+
+
+
+
+%macro PULSES_SEARCH 1
+; m6 Syy_norm
+; m7 Sxy_norm
+        addps       m6,   mm_const_float_0_5    ; Syy_norm += 1.0/2
+        pxor        m1,   m1            ; max_idx
+        xorps       m3,   m3            ; p_max
+        xor         r4q,  r4q
+align 16
+%%distortion_search:
+        movd        xmm2, dword r4d     ; movd zero extends
+%ifidn %1,add
+        movaps      m4,   [tmpY + r4q]  ; y[i]
+        movaps      m5,   [tmpX + r4q]  ; X[i]
+
+  %if   USE_APPROXIMATION == 1
+        xorps       m0,   m0
+        cmpps       m0,   m0,   m5,   4 ; m0 = (X[i] != 0.0)
+  %endif
+
+        addps       m4,   m6            ; m4 = Syy_new = y[i] + Syy_norm
+        addps       m5,   m7            ; m5 = Sxy_new = X[i] + Sxy_norm
+
+  %if   USE_APPROXIMATION == 1
+        andps       m5,   m0            ; if(X[i] == 0) Sxy_new = 0; Prevent aproximation error from setting pulses in array padding.
+  %endif
+
+%else
+        movaps      m5,   [tmpY + r4q]          ; m5 = y[i]
+
+        xorps       m0,   m0                    ; m0 = 0;
+        cmpps       m0,   m0,   m5,   1         ; m0 = (0<y)
+
+        subps       m4,   m6,   m5              ; m4 = Syy_new = Syy_norm - y[i]
+        subps       m5,   m7,   [tmpX + r4q]    ; m5 = Sxy_new = Sxy_norm - X[i]
+        andps       m5,   m0                    ; (0<y)?m5:0
+%endif
+
+%if   USE_APPROXIMATION == 1
+        rsqrtps     m4,   m4
+        mulps       m5,   m4            ; m5 = p = Sxy_new*approx(1/sqrt(Syy) )
+%else
+        mulps       m5,   m5
+        divps       m5,   m4            ; m5 = p = Sxy_new*Sxy_new/Syy
+%endif
+    EMU_pbroadcastd m2,   xmm2          ; m2=i (all lanes get same values, we add the offset-per-lane, later)
+
+        cmpps       m0,   m3,   m5,   1 ; m0 = (m3 < m5) ; (p_max < p) ; (p > p_max)
+        maxps       m3,   m5            ; m3=max(p_max,p)
+                                        ; maxps here is faster than blendvps, despite blend having lower latency.
+
+        pand        m2,   m0            ; This version seems faster than sse41 pblendvb
+        pmaxsw      m1,   m2            ; SSE2 signed word, so it would work for N < 32768/4
+
+        add         r4q,  mmsize
+        cmp         r4q,  Nq
+        jb          %%distortion_search
+
+        por         m1,   mm_const_int32_offsets    ; max_idx offsets per individual lane (skipped in the inner loop)
+        movdqa      m4,   m1                        ; needed for the aligned y[max_idx]+=1; processing
+
+%if mmsize >= 32
+; Merge parallel maximums round 8 (4 vs 4)
+
+       vextractf128 xmm5, ymm3, 1       ; xmm5 = ymm3[1x128] = ymm3[255..128b]
+       vcmpps       xmm0, xmm3, xmm5, 1 ; m0 = (m3 < m5) = ( p[0x128] < p[1x128] )
+
+       vextracti128 xmm2, ymm1, 1       ; xmm2 = ymm1[1x128] = ymm1[255..128b]
+    EMU_blendvps    xmm3, xmm5, xmm0    ; max_idx = m0 ? max_idx[1x128] : max_idx[0x128]
+    EMU_pblendvb    xmm1, xmm2, xmm0    ; p       = m0 ? p[1x128]       : p[0x128]
+%endif
+
+; Merge parallel maximums round 4 (2 vs 2)
+                                               ; m3=p[3210]
+        movhlps     xmm5, xmm3                 ; m5=p[xx32]
+        cmpps       xmm0, xmm3, xmm5, 1 ; m0 = (m3 < m5) = ( p[1,0] < p[3,2] )
+
+        pshufd      xmm2, xmm1, q3232
+    EMU_blendvps    xmm3, xmm5, xmm0    ; max_idx = m0 ? max_idx[3,2] : max_idx[1,0]
+    EMU_pblendvb    xmm1, xmm2, xmm0    ; p       = m0 ? p[3,2]       : p[1,0]
+
+; Merge parallel maximums final round (1 vs 1)
+        movaps      xmm0, xmm3          ; m0 = m3[0] = p[0]
+        shufps      xmm3, xmm3, q1111   ; m3 = m3[1] = p[1]
+        cmpless     xmm0, xmm3
+
+        pshufd      xmm2, xmm1, q1111
+    EMU_pblendvb    xmm1, xmm2, xmm0
+
+        movd  dword r4d,  xmm1          ; zero extends to the rest of r4q
+
+    EMU_broadcastss m3,   [tmpX + r4q]
+        %{1}ps      m7,   m3            ; Sxy += X[max_idx]
+
+    EMU_broadcastss m5,   [tmpY + r4q]
+        %{1}ps      m6,   m5            ; Syy += Y[max_idx]
+
+        ; We have to update a single element in Y[i]
+        ; However writing 4 bytes and then doing 16 byte load in the inner loop
+        ; could cause a stall due to breaking write forwarding.
+    EMU_pbroadcastd m1,   xmm1
+        pcmpeqd     m1,   m1,   m4          ; exactly 1 element matches max_idx and this finds it
+
+        and         r4q,  ~(mmsize-1)       ; align address down, so the value pointed by max_idx is inside a mmsize load
+        movaps      m5,   [tmpY + r4q]      ; m5 = Y[y3...ym...y0]
+        andps       m1,   mm_const_float_1  ; m1 =  [ 0...1.0...0]
+        %{1}ps      m5,   m1                ; m5 = Y[y3...ym...y0] +/- [0...1.0...0]
+        movaps      [tmpY + r4q], m5        ; Y[max_idx] +-= 1.0;
+
+%endmacro
+
+;
+; We need one more register for
+; PIC relative addressing. Use this
+; to count it in cglobal
+;
+%ifdef PIC
+  %define num_pic_regs 1
+%else
+  %define num_pic_regs 0
+%endif
+
+;
+; In x86_64 mode few more xmm registers are used to hold constants.
+; Use this to count them in cglobal
+;
+%if num_mmregs > 8
+%define num_hireg_used 3
+%endif
+
+;
+; Pyramid Vector Quantization Search implementation
+;
+; float * inX   - Unaligned (SIMD) access, it will be overread,
+;                 but extra data is masked away.
+; int32 * outY  - Should be aligned and padded buffer.
+;                 It is used as temp buffer.
+; uint32 K      - Number of pulses to have after quantizations.
+; uint32 N      - Number of vector elements. Must be 0 < N < 256
+;
+%macro PVQ_FAST_SEARCH 0
+cglobal pvq_search, 4, 5+num_pic_regs, 8+num_hireg_used, 256*4, inX, outY, K, N
+%define tmpX rsp
+%define tmpY outYq
+
+
+        movaps      m0,   [const_float_abs_mask]
+        shl         Nd,   2             ; N *= sizeof(float) ; also 32 bit operation zeroes the high 32 bits in 64 bit mode.
+        mov         r4q,  Nq
+
+        neg         r4q
+        and         r4q,  mmsize-1
+
+        SET_PIC_BASE lea, r5q, const_align_abs_edge    ; rip+const
+        movups      m3,   [pic_base_const_align_abs_edge + r4q - mmsize]
+
+        add         Nq,   r4q           ; Nq = align(Nq, mmsize)
+
+        lea         r4q,  [Nq - mmsize] ; Nq is rounded up (aligned up) to mmsize, so r4q can't become negative here, unless N=0.
+        movups      m2,   [inXq + r4q]
+        andps       m2,   m3
+        movaps      [tmpX + r4q], m2
+        movaps      m1,   m2            ; Sx
+
+align 16
+%%loop_abs_sum:
+        sub         r4q,  mmsize
+        jc          %%end_loop_abs_sum
+
+        movups      m2,   [inXq + r4q]
+        andps       m2,   m0
+
+        movaps      [tmpX + r4q], m2    ; tmpX[i]=abs(X[i])
+        addps       m1,   m2
+        jmp         %%loop_abs_sum
+
+align 16
+%%end_loop_abs_sum:
+
+        HSUMPS      m1,   m2            ; m1  = Sx
+
+        xorps       m0,   m0
+        comiss      xmm0, xmm1          ;
+        jz          %%zero_input        ; if (Sx==0) goto zero_input
+
+        cvtsi2ss    xmm0, dword Kd      ; m0  = K
+%if USE_APPROXIMATION == 1
+        rcpss       xmm1, xmm1          ; m1 = approx(1/Sx)
+        mulss       xmm0, xmm1          ; m0 = K*(1/Sx)
+%else
+        divss       xmm0, xmm1          ; b = K/Sx
+                                        ; b = K/max_x
+%endif
+
+    EMU_broadcastss m0,   xmm0
+
+        lea         r4q,  [Nq - mmsize]
+        pxor        m5,   m5            ; Sy    ( Sum of abs( y[i]) )
+        xorps       m6,   m6            ; Syy   ( Sum of y[i]*y[i]  )
+        xorps       m7,   m7            ; Sxy   ( Sum of X[i]*y[i]  )
+align 16
+%%loop_guess:
+        movaps      m1,   [tmpX + r4q]  ; m1   = X[i]
+        mulps       m2,   m0,   m1      ; m2   = res*X[i]
+  %if PRESEARCH_ROUNDING == 0
+        cvttps2dq   m2,   m2            ; yt   = (int)truncf( res*X[i] )
+  %else
+        cvtps2dq    m2,   m2            ; yt   = (int)lrintf( res*X[i] )
+  %endif
+        paddd       m5,   m2            ; Sy  += yt
+        cvtdq2ps    m2,   m2            ; yt   = (float)yt
+        mulps       m1,   m2            ; m1   = X[i]*yt
+        movaps      [tmpY + r4q], m2    ; y[i] = m2
+        addps       m7,   m1            ; Sxy += m1;
+        mulps       m2,   m2            ; m2   = yt*yt
+        addps       m6,   m2            ; Syy += m2
+
+        sub         r4q,  mmsize
+        jnc         %%loop_guess
+
+        HSUMPS      m6,   m1        ; Syy_norm
+        HADDD       m5,   m4        ; pulses
+
+        movd  dword r4d,  xmm5      ; zero extends to the rest of r4q
+
+        sub         Kd,   r4d       ; K -= pulses , also 32 bit operation zeroes high 32 bit in 64 bit mode.
+        jz          %%finish        ; K - pulses == 0
+
+        SET_HI_REG_MM_CONSTANT movaps, m8,  const_float_0_5
+        SET_HI_REG_MM_CONSTANT movaps, m9,  const_float_1
+        SET_HI_REG_MM_CONSTANT movdqa, m10, const_int32_offsets
+        ; Use Syy/2 in distortion parameter calculations.
+        ; Saves pre and post-caclulation to correct Y[] values.
+        ; Same precision, since float mantisa is normalized.
+        ; The SQRT approximation does differ.
+        HSUMPS      m7,   m0        ; Sxy_norm
+        mulps       m6,   mm_const_float_0_5
+
+        jc          %%remove_pulses_loop    ; K - pulses < 0
+
+align 16                                    ; K - pulses > 0
+%%add_pulses_loop:
+
+        PULSES_SEARCH add   ; m6 Syy_norm ; m7 Sxy_norm
+
+        sub         Kd,   1
+        jnz         %%add_pulses_loop
+
+        addps       m6,   m6        ; Syy*=2
+
+        jmp         %%finish
+
+align 16
+%%remove_pulses_loop:
+
+        PULSES_SEARCH sub   ; m6 Syy_norm ; m7 Sxy_norm
+
+        add         Kd,   1
+        jnz         %%remove_pulses_loop
+
+        addps       m6,   m6        ; Syy*=2
+
+align 16
+%%finish:
+        lea         r4q,  [Nq - mmsize]
+
+        movaps      m2,   [const_float_sign_mask]
+align 16
+%%restore_sign_loop:
+        movaps      m0,   [tmpY + r4q]  ; m0 = Y[i]
+        movups      m1,   [inXq + r4q]  ; m1 = X[i]
+        andps       m1,   m2            ; m1 = sign(X[i])
+        orps        m0,   m1            ; m0 = Y[i]*sign
+        cvtps2dq    m3,   m0            ; m3 = (int)m0
+        movaps      [outYq + r4q], m3
+
+        sub         r4q,  mmsize
+        jnc         %%restore_sign_loop
+%%return:
+
+%if ARCH_X86_64 == 0    ;sbrdsp
+        movss       r0m,  xmm6      ; return (float)Syy_norm
+        fld         dword r0m
+%else
+        movaps      m0,   m6        ; return (float)Syy_norm
+%endif
+        RET
+
+align 16
+%%zero_input:
+        lea         r4q,  [Nq - mmsize]
+        xorps       m0,   m0
+%%zero_loop:
+        movaps      [outYq + r4q], m0
+        sub         r4q,  mmsize
+        jnc         %%zero_loop
+
+        movaps      m6,   [const_float_1]
+        jmp         %%return
+%endmacro
+
+
+INIT_XMM sse2
+PVQ_FAST_SEARCH
+
+INIT_XMM sse4
+PVQ_FAST_SEARCH
+
+INIT_XMM avx
+PVQ_FAST_SEARCH
-- 
2.13.2