diff mbox

[FFmpeg-devel,WIP] Opus Piramid Vector Quantization Search in x86 SIMD asm

Message ID CABA=pqdn-csj0MhTGiH=RTjdpRG+xCf6_qZjw85XMX8K=206nQ@mail.gmail.com
State Superseded
Headers show

Commit Message

Ivan Kalvachev June 8, 2017, 10:36 p.m. UTC
On request by Rostislav Pehlivanov (atomnuker), I've been working on
SSE/AVX accelerated version of pvq_search().

The attached patch is my work so far.

At the moment, for me, at the default bitrate
the function is 2.5 times faster than the current C version.
(My cpu is Intel Westmere class.)
The total encoding time is about 5% faster.

I'd like some more benchmarks on different CPUs
and maybe some advises how to improve it.
(I've left some alternative methods in the code
that could be easily switched with defines, as they
may be faster on other cpu's).
(I'm also quite afraid how fast it would run on pre-Ryzen AMD CPUs)


The code generates 4 variants: SSE2, SSE4.2, AVX and 256bit AVX2.
I haven't tested the AVX myself on real CPU,
I used Intel SDE to develop and test them.
Rostislav (atomnuker) reported some crashes with the 256bit AVX2,
that however might be related to clang tools.




Bellow are some broad descriptions:

The typical use of the function for the default bitrate (96kbps) is
K<=36 and N<=32.
N is the size of the input array (called vector), K is number of
pulses that should be present in the output (The sum of output
elements is K).

In synthetic tests, the SIMD function could be 8-12 times faster with
the maximum N=176.
I've been told that bigger sizes are more common at low bitrate encodes and
will be more common with the upcoming RDO improvements.

A short description of the function working:
1. Loop that calculates sum (Sx) of the input elements (inX[]). The
loop is used to fill a stack allocated temp buffer (tmpX) that is
aligned to mmsize and contains absolute values of inX[i].

2. Pre-Search loop. It uses K/Sx as approximation for the vector gain
and fills output vector outY[] based on it. The output is in integers,
but we use outY[] to temporally store the doubled Y values as floats.
(We need 2*Y for calculations). This loop also calculates few
parameters that are needed for the distortion calculations later (Syy=
Sum of inY[i]^2 ; Sxy=Sum inX[i]*outY[i] )

3. Adding of missing pulses or Elimination of extra ones.
The C function uses variable "phase" to signal if pulses should be
added or removed, I've separated this to separate cases. The code is
shared through a macro PULSES_SEARCH .
Each case is formed by 2 loops. The outer loop is executed until we
have K pulses in the output.
The inner is calculating the distortion parameter for each element and
picking the best one.
(parallel search, combination of parallel results, update of variables).

4. When we are done we do one more loop, to convert outY[] to single
integer and to restore its sign (by using the original inX[]).

5. There is special case when Sx==0, that happens if all elements of
the input are zeroes (in theory the input should be normalized, that
means Sum of X[i]^2 == 1.0). In this case return zero output and 1.0
as gain.

---
Now, I've left some defines that change the generated code.

HADDPS_IS_FAST
PHADDD_IS_FAST
I've implemented my own horizontal sums macros, and while doing it, I
have discovered that on my CPU (Westmere class) the use of "new"
SSE4.2 instructions are not faster than using my own code for doing
the same.
It's not speed critical, since horizontal sums are used 3-4 times per
function call.

BLENDVPS_IS_FAST
PBLENDVB_IS_FAST
I think that blend on my CPU is faster than the alternative version
that I've implemented. However I'm not sure this is true for all
CPU's, since a number of modern cpu have latency=3 and
inv_throughput=2 (that's 2 clocks until another blend could start).

CONST_IN_X64_REG_IS_FASTER
The function is implemented so only 8 registers are used. With this
define constants used during PULSES_SEARCH are loaded in the high
registers available on X64. I could not determine if it is faster to
do so... it should be, but sometimes I got the opposite result.
I'd probably enable it in the final version.

STALL_WRITE_FORWARDING
After the inner search finds the maximum, we add/remove pulse in
outY[i]. Writing single element (sizeof(float)=4) however could block
the long load done in the inner loop (mmsize=16). This hurts a lot
more on small vector sizes.
On Skylake the penalty is only 11 cycles, while Ryzen should have no
penalty at all. Older CPU's can have penalty of up to 200 cycles.

SHORT_SYY_UPDATE
This define has meaning only when the STALL* is 0 (aka have the longer
code to avoid stalls).
It saves few instructions by loading old outY[] value by scalar load,
instead of using HSUMPS and some 'haddps' to calculate them.
So far it looks like the short update is always faster, but I've left
it just in case...

USE_APPROXIMATION
This controls the method used for calculation of the distortion parameter.
"0" means using 1 multiplication and 1 division, that could be a lot
slower (14;14 cycles on my CPU, 11;7 on Skylake)
"1" uses 2 multiplications and 1 reciprocal op that is a lot faster
than real division, but gives half precision.
"2" uses 1 multiplication and 1 reciprocal square root op, that is
literally 1 cycle, but again gives half precision.

PRESEARCH_ROUNDING
This control the rounding of the gain used for guess.
"0" means using truncf() that makes sure that the pulses would never
be more than K.
It gives results identical to the original celt_* functions
"1" means using lrintf(), this is basically the improvement of the
current C code over the celt_ one.


ALL_FLOAT_PRESEARCH
The presearch filling of outY[] could be done entirely with float ops
(using SSE4.2 'roundps' instead of two cvt*).  It is mostly useful if
you want to try YMM on AVX1 (AVX1 lacks 256 integer ops).
For some reason enabling this makes the whole function 4 times slower
on my CPU. ^_^

I've left some commented out code. I'll remove it for sure in the final version.

I just hope I haven't done some lame mistake in the last minute...

Comments

Michael Niedermayer June 9, 2017, 10:08 a.m. UTC | #1
On Fri, Jun 09, 2017 at 01:36:07AM +0300, Ivan Kalvachev wrote:
> On request by Rostislav Pehlivanov (atomnuker), I've been working on
> SSE/AVX accelerated version of pvq_search().
> 
> The attached patch is my work so far.
> 
> At the moment, for me, at the default bitrate
> the function is 2.5 times faster than the current C version.
> (My cpu is Intel Westmere class.)
> The total encoding time is about 5% faster.
> 
> I'd like some more benchmarks on different CPUs
> and maybe some advises how to improve it.
> (I've left some alternative methods in the code
> that could be easily switched with defines, as they
> may be faster on other cpu's).
> (I'm also quite afraid how fast it would run on pre-Ryzen AMD CPUs)
> 
> 
> The code generates 4 variants: SSE2, SSE4.2, AVX and 256bit AVX2.
> I haven't tested the AVX myself on real CPU,
> I used Intel SDE to develop and test them.
> Rostislav (atomnuker) reported some crashes with the 256bit AVX2,
> that however might be related to clang tools.
> 
> 
> 
> 
> Bellow are some broad descriptions:
> 
> The typical use of the function for the default bitrate (96kbps) is
> K<=36 and N<=32.
> N is the size of the input array (called vector), K is number of
> pulses that should be present in the output (The sum of output
> elements is K).
> 
> In synthetic tests, the SIMD function could be 8-12 times faster with
> the maximum N=176.
> I've been told that bigger sizes are more common at low bitrate encodes and
> will be more common with the upcoming RDO improvements.
> 
> A short description of the function working:
> 1. Loop that calculates sum (Sx) of the input elements (inX[]). The
> loop is used to fill a stack allocated temp buffer (tmpX) that is
> aligned to mmsize and contains absolute values of inX[i].
> 
> 2. Pre-Search loop. It uses K/Sx as approximation for the vector gain
> and fills output vector outY[] based on it. The output is in integers,
> but we use outY[] to temporally store the doubled Y values as floats.
> (We need 2*Y for calculations). This loop also calculates few
> parameters that are needed for the distortion calculations later (Syy=
> Sum of inY[i]^2 ; Sxy=Sum inX[i]*outY[i] )
> 
> 3. Adding of missing pulses or Elimination of extra ones.
> The C function uses variable "phase" to signal if pulses should be
> added or removed, I've separated this to separate cases. The code is
> shared through a macro PULSES_SEARCH .
> Each case is formed by 2 loops. The outer loop is executed until we
> have K pulses in the output.
> The inner is calculating the distortion parameter for each element and
> picking the best one.
> (parallel search, combination of parallel results, update of variables).
> 
> 4. When we are done we do one more loop, to convert outY[] to single
> integer and to restore its sign (by using the original inX[]).
> 
> 5. There is special case when Sx==0, that happens if all elements of
> the input are zeroes (in theory the input should be normalized, that
> means Sum of X[i]^2 == 1.0). In this case return zero output and 1.0
> as gain.
> 
> ---
> Now, I've left some defines that change the generated code.
> 
> HADDPS_IS_FAST
> PHADDD_IS_FAST
> I've implemented my own horizontal sums macros, and while doing it, I
> have discovered that on my CPU (Westmere class) the use of "new"
> SSE4.2 instructions are not faster than using my own code for doing
> the same.
> It's not speed critical, since horizontal sums are used 3-4 times per
> function call.
> 
> BLENDVPS_IS_FAST
> PBLENDVB_IS_FAST
> I think that blend on my CPU is faster than the alternative version
> that I've implemented. However I'm not sure this is true for all
> CPU's, since a number of modern cpu have latency=3 and
> inv_throughput=2 (that's 2 clocks until another blend could start).
> 
> CONST_IN_X64_REG_IS_FASTER
> The function is implemented so only 8 registers are used. With this
> define constants used during PULSES_SEARCH are loaded in the high
> registers available on X64. I could not determine if it is faster to
> do so... it should be, but sometimes I got the opposite result.
> I'd probably enable it in the final version.
> 
> STALL_WRITE_FORWARDING
> After the inner search finds the maximum, we add/remove pulse in
> outY[i]. Writing single element (sizeof(float)=4) however could block
> the long load done in the inner loop (mmsize=16). This hurts a lot
> more on small vector sizes.
> On Skylake the penalty is only 11 cycles, while Ryzen should have no
> penalty at all. Older CPU's can have penalty of up to 200 cycles.
> 
> SHORT_SYY_UPDATE
> This define has meaning only when the STALL* is 0 (aka have the longer
> code to avoid stalls).
> It saves few instructions by loading old outY[] value by scalar load,
> instead of using HSUMPS and some 'haddps' to calculate them.
> So far it looks like the short update is always faster, but I've left
> it just in case...
> 
> USE_APPROXIMATION
> This controls the method used for calculation of the distortion parameter.
> "0" means using 1 multiplication and 1 division, that could be a lot
> slower (14;14 cycles on my CPU, 11;7 on Skylake)
> "1" uses 2 multiplications and 1 reciprocal op that is a lot faster
> than real division, but gives half precision.
> "2" uses 1 multiplication and 1 reciprocal square root op, that is
> literally 1 cycle, but again gives half precision.
> 
> PRESEARCH_ROUNDING
> This control the rounding of the gain used for guess.
> "0" means using truncf() that makes sure that the pulses would never
> be more than K.
> It gives results identical to the original celt_* functions
> "1" means using lrintf(), this is basically the improvement of the
> current C code over the celt_ one.
> 
> 
> ALL_FLOAT_PRESEARCH
> The presearch filling of outY[] could be done entirely with float ops
> (using SSE4.2 'roundps' instead of two cvt*).  It is mostly useful if
> you want to try YMM on AVX1 (AVX1 lacks 256 integer ops).
> For some reason enabling this makes the whole function 4 times slower
> on my CPU. ^_^
> 
> I've left some commented out code. I'll remove it for sure in the final version.
> 
> I just hope I haven't done some lame mistake in the last minute...

>  opus_pvq.c              |    9 
>  opus_pvq.h              |    5 
>  x86/Makefile            |    1 
>  x86/opus_dsp_init.c     |   47 +++
>  x86/opus_pvq_search.asm |  597 ++++++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 657 insertions(+), 2 deletions(-)
> 3b9648bea3f01dad2cf159382f0ffc2d992c84b2  0001-SIMD-opus-pvq_search-implementation.patch
> From 06dc798c302e90aa5b45bec5d8fbcd64ba4af076 Mon Sep 17 00:00:00 2001
> From: Ivan Kalvachev <ikalvachev@gmail.com>
> Date: Thu, 8 Jun 2017 22:24:33 +0300
> Subject: [PATCH 1/3] SIMD opus pvq_search implementation.

seems this breaks build with mingw64, didnt investigate but it
fails with these errors:

libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x2d): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x3fd): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x7a1): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0xb48): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x2d): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x3fd): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x7a1): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0xb48): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status
make: *** [ffmpeg_g.exe] Error 1
make: *** Waiting for unfinished jobs....
make: *** [ffprobe_g.exe] Error 1


[...]
diff mbox

Patch

From 06dc798c302e90aa5b45bec5d8fbcd64ba4af076 Mon Sep 17 00:00:00 2001
From: Ivan Kalvachev <ikalvachev@gmail.com>
Date: Thu, 8 Jun 2017 22:24:33 +0300
Subject: [PATCH 1/3] SIMD opus pvq_search implementation.

---
 libavcodec/opus_pvq.c              |   9 +
 libavcodec/opus_pvq.h              |   5 +-
 libavcodec/x86/Makefile            |   1 +
 libavcodec/x86/opus_dsp_init.c     |  47 +++
 libavcodec/x86/opus_pvq_search.asm | 597 +++++++++++++++++++++++++++++++++++++
 5 files changed, 657 insertions(+), 2 deletions(-)
 create mode 100644 libavcodec/x86/opus_dsp_init.c
 create mode 100644 libavcodec/x86/opus_pvq_search.asm

diff --git a/libavcodec/opus_pvq.c b/libavcodec/opus_pvq.c
index 2ac66a0ede..172223e241 100644
--- a/libavcodec/opus_pvq.c
+++ b/libavcodec/opus_pvq.c
@@ -418,7 +418,13 @@  static uint32_t celt_alg_quant(OpusRangeCoder *rc, float *X, uint32_t N, uint32_
     int *y = pvq->qcoeff;
 
     celt_exp_rotation(X, N, blocks, K, spread, 1);
+
+    {
+    START_TIMER
     gain /= sqrtf(pvq->pvq_search(X, y, K, N));
+    STOP_TIMER("pvq_search");
+    }
+
     celt_encode_pulses(rc, y,  N, K);
     celt_normalize_residual(y, X, N, gain);
     celt_exp_rotation(X, N, blocks, K, spread, 0);
@@ -947,6 +953,9 @@  int av_cold ff_celt_pvq_init(CeltPVQ **pvq)
     s->encode_band        = pvq_encode_band;
     s->band_cost          = pvq_band_cost;
 
+    if (ARCH_X86)
+        ff_opus_dsp_init_x86(s);
+
     *pvq = s;
 
     return 0;
diff --git a/libavcodec/opus_pvq.h b/libavcodec/opus_pvq.h
index 6691494838..9246337360 100644
--- a/libavcodec/opus_pvq.h
+++ b/libavcodec/opus_pvq.h
@@ -33,8 +33,8 @@ 
                                        float *lowband_scratch, int fill)
 
 struct CeltPVQ {
-    DECLARE_ALIGNED(32, int,   qcoeff      )[176];
-    DECLARE_ALIGNED(32, float, hadamard_tmp)[176];
+    DECLARE_ALIGNED(32, int,   qcoeff      )[256];
+    DECLARE_ALIGNED(32, float, hadamard_tmp)[256];
 
     float (*pvq_search)(float *X, int *y, int K, int N);
 
@@ -45,6 +45,7 @@  struct CeltPVQ {
 };
 
 int  ff_celt_pvq_init  (struct CeltPVQ **pvq);
+void ff_opus_dsp_init_x86(struct CeltPVQ *s);
 void ff_celt_pvq_uninit(struct CeltPVQ **pvq);
 
 #endif /* AVCODEC_OPUS_PVQ_H */
diff --git a/libavcodec/x86/Makefile b/libavcodec/x86/Makefile
index 710e48b15f..f50a48dc13 100644
--- a/libavcodec/x86/Makefile
+++ b/libavcodec/x86/Makefile
@@ -51,6 +51,7 @@  OBJS-$(CONFIG_APNG_DECODER)            += x86/pngdsp_init.o
 OBJS-$(CONFIG_CAVS_DECODER)            += x86/cavsdsp.o
 OBJS-$(CONFIG_DCA_DECODER)             += x86/dcadsp_init.o x86/synth_filter_init.o
 OBJS-$(CONFIG_DNXHD_ENCODER)           += x86/dnxhdenc_init.o
+OBJS-$(CONFIG_OPUS_ENCODER)            += x86/opus_dsp_init.o x86/opus_pvq_search.o
 OBJS-$(CONFIG_HEVC_DECODER)            += x86/hevcdsp_init.o
 OBJS-$(CONFIG_JPEG2000_DECODER)        += x86/jpeg2000dsp_init.o
 OBJS-$(CONFIG_MLP_DECODER)             += x86/mlpdsp_init.o
diff --git a/libavcodec/x86/opus_dsp_init.c b/libavcodec/x86/opus_dsp_init.c
new file mode 100644
index 0000000000..ddccf6fe9e
--- /dev/null
+++ b/libavcodec/x86/opus_dsp_init.c
@@ -0,0 +1,47 @@ 
+/*
+ * Opus encoder assembly optimizations
+ * Copyright (C) 2017 Ivan Kalvachev <ikalvachev@gmail.com>
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "config.h"
+
+#include "libavutil/x86/cpu.h"
+#include "libavcodec/opus_pvq.h"
+
+extern float ff_pvq_search_sse2 (float *X, int *y, int K, int N);
+extern float ff_pvq_search_sse42(float *X, int *y, int K, int N);
+extern float ff_pvq_search_avx  (float *X, int *y, int K, int N);
+extern float ff_pvq_search_avx2 (float *X, int *y, int K, int N);
+
+av_cold void ff_opus_dsp_init_x86(CeltPVQ *s)
+{
+    int cpu_flags = av_get_cpu_flags();
+
+    if (EXTERNAL_SSE2(cpu_flags))
+        s->pvq_search = ff_pvq_search_sse2;
+
+    if (EXTERNAL_SSE42(cpu_flags))
+        s->pvq_search = ff_pvq_search_sse42;
+
+    if (EXTERNAL_AVX(cpu_flags))
+        s->pvq_search = ff_pvq_search_avx;
+
+    if (EXTERNAL_AVX2(cpu_flags))
+        s->pvq_search = ff_pvq_search_avx2;
+}
diff --git a/libavcodec/x86/opus_pvq_search.asm b/libavcodec/x86/opus_pvq_search.asm
new file mode 100644
index 0000000000..a4996529df
--- /dev/null
+++ b/libavcodec/x86/opus_pvq_search.asm
@@ -0,0 +1,597 @@ 
+; Opus encoder assembly optimizations
+; Copyright (C) 2017 Ivan Kalvachev <ikalvachev@gmail.com>
+
+%include "config.asm"
+%include "libavutil/x86/x86util.asm"
+
+
+%define HADDPS_IS_FAST 0
+%define PHADDD_IS_FAST 0
+%define BLENDVPS_IS_FAST 01
+%define PBLENDVB_IS_FAST 01
+%define CONST_IN_X64_REG_IS_FASTER 0
+
+%define STALL_WRITE_FORWARDING 0
+%define SHORT_SYY_UPDATE       01
+
+%define USE_APPROXIMATION  2
+%define PRESEARCH_ROUNDING 01
+%define ALL_FLOAT_PRESEARCH cpuflag(sse42) && 0
+
+
+;
+; Horizontal Sum Packed Single precision float
+; The resulting sum is in all elements.
+;
+%macro HSUMPS 2 ; dst, tmp
+%if cpuflag(sse42) && HADDPS_IS_FAST
+  %if sizeof%1>=32  ; avx mmsize=32
+       vperm2f128   %2,   %1,   %1,   (0)*16+(1)   ; %2=lo(%1)<<128+hi(%1)
+        addps       %1,   %2
+  %endif            ; sse&avx mmsize=16
+        haddps      %1,   %1            ; 5-6c 2/1
+        haddps      %1,   %1
+%elif cpuflag(avx)
+  %if sizeof%1>=32  ; avx
+       vperm2f128   %2,   %1,   %1,   (0)*16+(1)   ; %2=lo(%1)<<128+hi(%1)
+        addps       %1,   %2
+  %endif
+       vshufps      %2,   %1,   %1,   q1032 ; %2=[b,       a,       d,       c       ]
+        addps       %1,   %2                ; %1=[b+d,     a+c,     d+b,     c+a     ]
+       vshufps      %2,   %1,   %1,   q0321 ; %2=[c+a,     b+d,     a+c,     d+b     ]
+        addps       %1,   %2                ; %1=[c+a+b+d, b+d+a+c, a+c+d+b, d+b+c+a ]
+
+%elif 01    ; this form is a bit faster than the short avx-like emulation.
+        movaps      %2,   %1            ;    [d,       c,       b,       a       ]
+        shufps      %1,   %1,   q1032   ; %2=[b,       a,       d,       c       ]
+        addps       %1,   %2            ; %1=[b+d,     a+c,     d+b,     c+a     ]
+        movaps      %2,   %1
+        shufps      %1,   %1,   q0321   ; %2=[c+a,     b+d,     a+c,     d+b     ]
+        addps       %1,   %2            ; %1=[c+a+b+d, b+d+a+c, a+c+d+b, d+b+c+a ]
+        ; all %1 members should be equal for as long as float a+b==b+a
+%else
+    %error pick HSUMPS implementation
+%endif
+%endmacro
+
+
+;
+; Horizontal Sum integer Double word
+; The resulting sum is in all elements.
+;
+%macro HSUMDQ 2 ; dst, tmp
+%if cpuflag(avx) && notcpuflag(avx2) && sizeof%1>=32
+       %error emulating integer 256bit ymm operation with float point op. there will be speed penalty.
+      vperm2f128    %2,   %1,   %1, (0)*16+(1)   ; %2=lo(%2)<<128+hi(%2)
+       paddd        xmm%1,xmm%2
+       phaddd       xmm%1,xmm%1
+       phaddd       xmm%1,xmm%1
+      vinsertf128   %1,   %1,xmm%1,   1 ; ymm, ymm, xmm, im ; 3c   1/1
+%elif cpuflag(ssse3) && PHADDD_IS_FAST
+  %if sizeof%1>=32
+       vperm2i128   %2,   %1,   %1, (0)*16+(1)   ; %2=lo(%2)<<128+hi(%2)
+        paddd       %1,   %2
+  %endif      ;mmsize == 16
+        phaddd      %1,   %1    ;3 cycles on intel
+        phaddd      %1,   %1
+    ;6 cycles total (sse)
+%elif 01
+  %if cpuflag(avx2) && sizeof%1>=32      ;mmsize>=32
+       vperm2i128   %2,   %1,   %1,   (0)*16+(1)   ; %2=lo(%2)<<128+hi(%2)
+        paddd       %1,   %2
+  %endif
+        pshufd      %2,   %1,   q1032   ; %2=[b,       a,       d,       c       ]
+        paddd       %1,   %2            ; %1=[b+d,     a+c,     d+b,     c+a     ]
+        pshufd      %2,   %1,   q0321   ; %2=[c+a,     b+d,     a+c,     d+b     ]
+        paddd       %1,   %2            ; %1=[c+a+b+d, b+d+a+c, a+c+d+b, d+b+c+a ]
+    ;4 cycles total (sse)
+%else
+    %error pick HSUMDQ implementation
+%endif
+%endmacro
+
+
+;
+; Emulate blendvps if not available
+;
+; src_b destroyed when using emulation with logical operands
+; SSE41 blend instruction is hard fixed to use xmm0 as mask
+;
+%macro emu_blendvps 3 ; dst_a, src_b, mask
+%if cpuflag(avx) && BLENDVPS_IS_FAST
+        blendvps    %1,   %1,   %2,   %3
+%elif cpuflag(sse42) && BLENDVPS_IS_FAST
+  %if notcpuflag(avx)
+    %ifnidn %3,xmm0
+      %error sse41 blendvps uses xmm0 as default 3d operand, you used %3
+    %endif
+  %endif
+        blendvps    %1,   %2,   %3
+%elif 01
+        xorps       %2,   %1
+        andps       %2,   %3
+        xorps       %1,   %2
+%else
+    %error pick "blendvps" emulation
+%endif
+%endmacro
+
+
+;
+; Emulate pblendvb if not available
+;
+; src_b destroyed when using emulation with logical operands
+; SSE41 blend instruction is hard fixed to use xmm0 as mask
+;
+%macro emu_pblendvb 3 ; dst_a, src_b, mask
+%if cpuflag(avx2) && PBLENDVB_IS_FAST
+       pblendvb    %1,   %1,   %2,   %3
+;-------------------------
+%elif cpuflag(avx) && notcpuflag(avx2) && PBLENDVB_IS_FAST
+  %if sizeof%1>=32
+;AVX1 does not implement pblendvb that works with ymm
+;AVX1 does not implemnt pxor that works with ymm either,
+;  so no emulation possible
+; Use float point blend and hope the penalty is not severe
+        %error emulating integer 256bit ymm operation with float point op. there will be speed penalty.
+       vblendvps    %1,   %1,   %2,   %3
+  %else
+       vpblendvb    %1,   %1,   %2,   %3
+  %endif
+;-------------------------
+%elif cpuflag(sse42) && PBLENDVB_IS_FAST
+    %ifnidn %3,xmm0
+      %error sse41 blendvps uses xmm0 as default 3 operand, you used %3
+    %endif
+        pblendvb    %1,   %2,   %3
+;-------------------------
+%elif 01
+;On my cpu, this version is actually faster than pblendvb !
+;pblendvb is much slower than blenvps, go figure!
+        pxor        %2,   %1
+        pand        %2,   %3
+        pxor        %1,   %2
+%else
+    %error pick "pblendvb" emulation
+%endif
+%endmacro
+
+
+;
+; Emulate broadcastss if not available
+;
+%macro emu_broadcastss 2 ; dst_a, src_b
+%if cpuflag(avx2) && 01
+       vbroadcastss %1,   %2            ; ymm, xmm
+
+%elif cpuflag(avx) && 01
+  %ifnum sizeof%2       ; avx1 register
+       vpermilps    xmm%1, xmm%2, q0000 ; xmm, xmm, imm || ymm, ymm, imm
+    %if sizeof%1 >= 32  ;mmsize>=32
+       vinsertf128  %1,   %1,xmm%1,   1 ; ymm, ymm, xmm, im ; 3c   1/1
+;      vperm2f128   %1,   %1,   %1,   0 ; ymm, ymm, ymm     ; 2-3c 1/1
+    %endif
+  %else                 ; avx1 memory
+       vbroadcastss %1,   %2            ; ymm, mm32 || xmm, mm32
+  %endif
+%elif cpuflag(sse) && 1
+  %ifnum sizeof%2       ; sse register
+        shufps      %1,   %2,   %2,   q0000
+  %else                 ; sse memory
+;        xorps       %1,   %1            ; this seems to provide minor speedup
+        movss       %1,   %2            ; this zeroes the other 3 elements
+        shufps      %1,   %1,  0
+  %endif
+%else
+    %error invalid op, SSE needed for emulation
+%endif
+%endmacro
+
+;
+; Emulate pbroadcastd if not available
+;
+%macro emu_pbroadcastd 2 ; dst_a, src_b
+%if cpuflag(avx2) && 01
+       vpbroadcastd %1,   %2
+%elif cpuflag(avx) && mmsize >= 32
+        ;there are no integer operations with 256bit ymm registers on AVX1
+        pshufd      xmm%1, xmm%2,  q0000 ; xmm, xmm, im
+       vinsertf128  %1,   %1,xmm%1,   1  ; ymm, ymm, xmm. im
+        %error emulating integer 256bit ymm operation with float point op. there will be speed penalty.
+%elif cpuflag(sse)
+  %ifnum sizeof%2       ; sse register
+        pshufd      %1,   %2,   q0000
+  %else                 ; sse memory
+;        pxor        %1,   %1            ; this seems to provide minor speedup. On intel it is direct op, so no micro-ops, letency or throughput.
+        movd        %1,   %2            ; movd zero extends
+        pshufd      %1,   %1,   0
+  %endif
+%else
+    %error invalid op, SSE2 needed
+%endif
+%endmacro
+
+
+;big ugly hack
+%if cpuflag(avx) && notcpuflag(avx2) && mmsize == 32
+; AVX1 does not implement integer operations with 256bit ymm registers
+; Some operations however could be emulated with float point operations
+  %define ALL_FLOAT_PRESEARCH 1
+  %define pxor xorps
+  %define por orps
+  %define pand andps
+  %define vextracti128 vextractf128
+  %define emu_pblendvb emu_blendvps
+  %define emu_pbroadcastd emu_broadcastss
+%endif
+
+
+SECTION_RODATA 64
+
+const_float_abs_mask:   times 8 dd 0x7fffffff
+const_align_abs_edge:   times 8 dd 0
+
+const_float_0_5:        times 8 dd 0.5
+const_float_1111:       times 8 dd 1.0
+const_float_2:          times 8 dd 2.0
+const_float_sign_mask:  times 8 dd 0x80000000
+
+const_int32_offsets:
+                        %rep 8
+                                dd $-const_int32_offsets
+                        %endrep
+SECTION .text
+
+
+
+
+
+%macro PULSES_SEARCH 1
+; m6 Syy_norm
+; m7 Sxy_norm
+        xor         r4q,  r4q
+        pxor        m1,   m1            ; max_idx
+
+        xorps       m3,   m3            ; p_max
+        addps       m6,   mm_const_float_1111; Syy_norm += 1.0
+align 16
+%%distortion_search:
+        movd        xmm2, dword r4d     ; movd zero extends
+%ifidn %1,add
+        movaps      m4,   [outYq + r4q] ; y[i]
+        movaps      m5,   [tmpX  + r4q] ; X[i]
+        addps       m4,   m6            ; m4 = Syy_new = y[i] + Syy_norm
+        addps       m5,   m7            ; m5 = Sxy_new = X[i] + Sxy_norm
+
+;       addps       m4,   m6, [outYq + r4q] ; m4 = Syy_new = Syy_norm + y[i]
+;       addps       m5,   m7, [tmpX  + r4q] ; m5 = Sxy_new = Sxy_norm + X[i]
+
+;       movaps      m4,   m6            ; Syy_norm
+;       movaps      m5,   m7            ; Sxy_norm
+;       addps       m4,   [outYq + r4q] ; m4 = Syy_new = Syy_norm + y[i]
+;       addps       m5,   [tmpX  + r4q] ; m5 = Sxy_new = Sxy_norm + X[i]
+%else
+        movaps      m5,   [outYq + r4q]         ; m5 = y[i]
+
+        xorps       m0,   m0                    ; m0 = 0;
+        cmpps       m0,   m0,   m5,   1         ; m0 = (0<y)
+
+        subps       m4,   m6,   m5              ; m4 = Syy_new = Syy_norm - y[i]
+        subps       m5,   m7,   [tmpX  + r4q]   ; m5 = Sxy_new = Sxy_norm - X[i]
+        andps       m5,   m0                    ; (0<y)?m5:0
+%endif
+
+%if   USE_APPROXIMATION == 2
+        rsqrtps     m4,   m4
+        mulps       m5,   m4            ; m5 = p = Sxy_new*1/sqrt(Syy)
+%elif USE_APPROXIMATION == 1
+        mulps       m5,   m5
+        rcpps       m4,   m4
+        mulps       m5,   m4
+%else
+        mulps       m5,   m5
+        divps       m5,   m4
+%endif
+    emu_pbroadcastd m2,   xmm2          ; m2=i (all lanes get same values, we add the offset-per-lane, later)
+
+        cmpps       m0,   m3,   m5,   1 ; m0 = (m3 < m5) ; (p_max < p) ; (p > p_max)
+        maxps       m3,   m5        ; m3=max(p_max,p)
+                                    ; maxps here is faster than blendvps, despite blend having lower latency.
+                                    ; maxps is also shorter so the inner loop fits in icache segment
+;   emu_blendvps    m3,   m5,   m0  ; m3 = m0 ? p : p_max
+                                    ; SSE4.1 is using hardcoded xmm0 register as third argument
+%if cpuflag(avx) && notcpuflag(avx2) && mmsize == 32
+    emu_pblendvb    m1,   m2,   m0
+%elif 01
+        pand        m2,   m0    ; This version seems faster than sse41 pblendvb
+        pmaxsw      m1,   m2    ; SSE2 signed word, so it would work for N < 32768/4
+%else
+        pandn       m0,   m1    ; version for r4q decreasing. It's not faster than real blendv.
+        pmaxsw      m0,   m2
+        movdqa      m1,   m0
+%endif
+
+        add         r4q,  mmsize
+        cmp         r4q,  Nq
+        jb          %%distortion_search
+
+        por         m1,   mm_const_int32_offsets    ; max_idx offset per individual lane (skipped in the inner loop)
+        movdqa      m4,   m1                        ; needed for the aligned y[max_idx]+=2; processing
+
+%if mmsize >= 32
+;merge parallel maximums round 2 (4 vs 4)
+
+       vextractf128 xmm5, ymm3, 1       ; xmm5 = ymm3[1x128] = ymm3[255..128b]
+        cmpps       xmm0, xmm3, xmm5, 1 ; m0 = (m3 < m5) = ( p[0x128] < p[1x128] )
+
+       vextracti128 xmm2, ymm1, 1       ; xmm2 = ymm1[1x128] = ymm1[255..128b]
+    emu_blendvps    xmm3, xmm5, xmm0    ; max_idx = m0 ? max_idx[1x128] : max_idx[0x128]
+    emu_pblendvb    xmm1, xmm2, xmm0    ; p       = m0 ? p[1x128]       : p[0x128]
+%endif
+
+;merge parallel maximums round 1 (2 vs 2)
+                                               ; m3=p[3210]
+        movhlps     xmm5, xmm3                 ; m5=p[xx32]
+        cmpps       xmm0, xmm3, xmm5, 1 ; m0 = (m3 < m5) = ( p[1,0] < p[3,2] )
+
+        pshufd      xmm2, xmm1, q3232
+    emu_blendvps    xmm3, xmm5, xmm0    ; max_idx = m0 ? max_idx[3,2] : max_idx[1,0]
+    emu_pblendvb    xmm1, xmm2, xmm0    ; p       = m0 ? p[3,2]       : p[1,0]
+
+;merge parallel maximums final round (1 vs 1)
+        movaps      xmm0, xmm3          ; m0 = m3[0] = p[0]
+        shufps      xmm3, xmm3, q1111   ; m3 = m3[1] = p[1]
+        cmpless     xmm0, xmm3
+
+        pshufd      xmm2, xmm1, q1111
+    emu_pblendvb    xmm1, xmm2, xmm0
+
+        movd  dword r4d,  xmm1          ; zero extends to the rest of r4q
+
+;pextrw r4d,m1,0 ; extract word (16b) from pos0 to r4d
+
+    emu_broadcastss m3,   [tmpX + r4q]
+        %{1}ps      m7,   m3            ; Sxy +=    X[max_idx]
+
+%if STALL_WRITE_FORWARDING == 0
+  %if SHORT_SYY_UPDATE == 1
+    emu_broadcastss m5,   [outYq + r4q]
+        %{1}ps      m6,   m5            ; Syy += outY[max_idx]
+  %endif
+        ; this code is big and ugly
+        ; but it prevents stalling a store forwarding
+        ; by writing smaller size and then doing
+        ; big load from same address.
+        ; This helps with N<100 that are the typical case.
+    emu_pbroadcastd m1,   xmm1
+        pcmpeqd     m1,   m1,   m4      ; only 1 element matches the address of the new p_max
+
+
+        and         r4q,  ~(mmsize-1)   ; align down, so max_idx is (somewhere) inside mmsize load
+        movaps      m5,   [outYq+r4q]   ; m5 = Y[y3...ym...y0]
+        movaps      m3,   mm_const_float_2  ; m3 = [2.0, 2.0, 2.0, 2.0]
+  %if SHORT_SYY_UPDATE == 0
+        movaps      m2,   m5            ; m2 = m5
+  %endif
+        andps       m3,   m1            ; m3 =  [ 0...2.0...0]
+        %{1}ps      m5,   m3            ; m5 = Y[y3...ym...y0] +/- [0...2.0...0]
+        movaps      [outYq+r4q], m5     ; outY[max_idx] +-= 2.0;
+
+  %if SHORT_SYY_UPDATE == 0
+        andps       m2,   m1            ; m2 = Y[0...ym...0]
+        HSUMPS      m2,   m3            ; m2 = Y[ym, ym, ym, ym];
+        %{1}ps      m6,   m2            ; Syy+=outY[max_idx]
+  %endif
+%else
+        ; this code is shorter, simpler and slower
+        ; because it breaks store forwarding
+        ; it writes smaller size at unaligned address where we later do a bigger aligned load.
+        ; When we have biger vectors (N>100) the store might complete before we try to load it
+    emu_broadcastss m4,   [outYq + r4q]
+        %{1}ps      m6,   m4            ; Syy += outY[max_idx]
+        %{1}ps      m4,   mm_const_float_2
+        movss       [outYq+r4q], xmm4   ; outY[max_idx] += 2.0;
+%endif
+
+%endmacro
+
+;
+; Pyramid Vector Quantization Search implementation
+;
+; float * inX   - Unaligned (SIMD) access, it will be overread,
+;                 but extra data will be masked away.
+; int32 * outY  - Should be aligned and padded buffer.
+;                 It would be used as temp buffer.
+; uint32 K      - Number of pulses to have after quantizations.
+; uint32 N      - Number of vector elements. Must be 0 < N < 8192
+;
+%macro PVQ_FAST_SEARCH 0
+cglobal pvq_search,4,5,8, mmsize, inX, outY, K, N
+%define tmpX rsp
+
+;        movsxdifnidn Nq,  Nd
+        movaps      m0,   [const_float_abs_mask]
+        shl         Nq,   2             ; N *= sizeof(float)
+        mov         r4q,  Nq
+
+        neg         r4q
+        and         r4q,  mmsize-1;
+        add         Nq,   r4q           ; Nq = align(Nq, mmsize)
+        sub         rsp,  Nq            ; allocate tmpX[Nq]
+
+        movups      m3,   [const_align_abs_edge-mmsize+r4q] ; this is the bit mask for the padded read at the end of the input
+
+        lea         r4q,  [Nq-mmsize]   ; Nq is rounded up (aligned up) to mmsize, so r4q can't become negative here, unless N=0.
+        movups      m2,   [inXq + r4q]
+        andps       m2,   m3
+        movaps      [tmpX+r4q], m2
+        movaps      m1,   m2            ; Sx
+
+align 16
+%%loop_abs_sum:
+        sub         r4q,  mmsize
+        jc          %%end_loop_abs_sum
+
+        movups      m2,   [inXq + r4q]
+        andps       m2,   m0
+
+        movaps      [tmpX+r4q], m2      ; tmpX[i]=abs(X[i])
+        addps       m1,   m2
+        jmp         %%loop_abs_sum
+
+
+align 16
+%%end_loop_abs_sum
+
+        HSUMPS      m1,   m2            ; m1  = Sx
+
+        xorps       m0,   m0
+        comiss      xmm0, xmm1          ;
+        jz          %%zero_input        ; if (Sx==0) goto zero_input
+
+        cvtsi2ss    xmm0, dword Kd      ; m0  = K
+        divss       xmm0, xmm1          ; m0  = K/Sx
+    emu_broadcastss m0,   xmm0
+
+        lea         r4q,  [Nq-mmsize]
+        pxor        m5,   m5            ; Sy    ( Sum of y[i]      )
+        xorps       m6,   m6            ; Syy   ( Sum of y[i]*y[i] )
+        xorps       m7,   m7            ; Sxy   ( Sum of X[i]*y[i] )
+align 16
+%%loop_guess
+        movaps      m1,   [tmpX + r4q]  ; m1   = X[i]
+        mulps       m2,   m0,   m1      ; m2   = res*X[i]
+%if ALL_FLOAT_PRESEARCH == 1
+  %if PRESEARCH_ROUNDING == 0
+        roundps     m2,   m2,   0       ; yt   = (float)truncf( res*X[i] )
+  %else
+        roundps     m2,   m2,   3       ; yt   = (float)lrintf( res*X[i] )
+  %endif
+        addps       m5,   m2            ; Sy  += yt
+%else
+  %if PRESEARCH_ROUNDING == 0
+        cvttps2dq   m2,   m2            ; yt   = (int)truncf( res*X[i] )
+  %else
+        cvtps2dq    m2,   m2            ; yt   = (int)lrintf( res*X[i] )
+  %endif
+        paddd       m5,   m2            ; Sy  += yt
+        cvtdq2ps    m2,   m2            ; yt   = (float)yt
+%endif
+        mulps       m1,   m2            ; m1   = X[i]*yt
+        mulps       m3,   m2,   m2      ; m3   = yt*yt
+        addps       m7,   m1            ; Sxy += m1;
+        addps       m6,   m3            ; Syy += m3
+        addps       m2,   m2            ; m2   = 2*yt
+        movaps      [outYq + r4q], m2   ; y[i] = m2
+
+        sub         r4q,  mmsize
+        jnc         %%loop_guess
+;        add         r4q,  mmsize
+;        cmp         r4q,  Nq
+;        jb          %%loop_guess
+
+
+%if ALL_FLOAT_PRESEARCH == 1
+        HSUMPS      m5,   m4        ; pulses
+%else
+        HSUMDQ      m5,   m4        ; pulses
+%endif
+        HSUMPS      m6,   m1        ; Syy_norm
+        HSUMPS      m7,   m0        ; Sxy_norm
+
+%if num_mmregs > 8 && CONST_IN_X64_REG_IS_FASTER ; it works faster for me with mem access. ^_^
+  %define  mm_const_float_1111      m11
+  %define  mm_const_float_2         m12
+  %define  mm_const_int32_offsets   m13
+        movaps      m11,  [const_float_1111]
+        movaps      m12,  [const_float_2]
+        movdqa      m13,  [const_int32_offsets]
+%else
+  %define  mm_const_float_1111     [const_float_1111]
+  %define  mm_const_float_2        [const_float_2]
+  %define  mm_const_int32_offsets  [const_int32_offsets]
+%endif
+
+%if ALL_FLOAT_PRESEARCH == 1
+        cvtss2si    r4q,  xmm5
+%else
+        movd  dword r4d,  xmm5      ; zero extends to the rest of r4q
+%endif
+
+        sub         Kq,   r4q       ; K -= pulses
+        jc          %%remove_pulses_loop
+        jz          %%finish
+
+
+align 16
+%%add_pulses_loop:
+        PULSES_SEARCH add   ; m6 Syy_norm ; m7 Sxy_norm
+
+        sub         Kq,   1
+        jnz         %%add_pulses_loop
+
+        jmp         %%finish
+
+align 16
+%%remove_pulses_loop:
+        PULSES_SEARCH sub   ; m6 Syy_norm ; m7 Sxy_norm
+
+        add         Kq,   1
+        jnz         %%remove_pulses_loop
+
+
+%%finish:
+align 16
+        lea         r4q,  [Nq-mmsize]
+
+        movaps      m2,   [const_float_sign_mask]
+        movaps      m4,   [const_float_0_5]
+align 16
+%%restore_sign_loop:
+        movaps      m0,   [outYq + r4q] ; m0 = Y[i]
+        movups      m1,   [inXq  + r4q] ; m1 = X[i]
+        andps       m1,   m2            ; m1 = sign(X[i])
+        mulps       m0,   m4            ; m0 = Y[i]*0.5
+        orps        m0,   m1            ; m0 = Y[i]*0.5*sign
+        cvtps2dq    m3,   m0            ; m3 = (int)m0
+        movaps      [outYq + r4q], m3
+
+        sub         r4q,  mmsize
+        jnc         %%restore_sign_loop
+%%return
+;release the temp buffer and restore the stack
+        add         rsp,  Nq        ; free tmpX, Nq is the same as the one used to allocate it (after Nq has been aligned up)
+
+%if ARCH_X86_64 == 0    ;sbrdsp
+        movss       r0m,  xmm6        ; return (float)Syy_norm
+        fld         dword r0m
+%else
+        movaps      m0,   m6        ; return (float)Syy_norm
+%endif
+        RET
+
+align 16
+%%zero_input:
+        lea         r4q,  [Nq-mmsize]
+        xorps       m0,   m0
+%%zero_loop:
+        movaps      [outYq + r4q], m0
+        sub         r4q,  mmsize
+        jnc         %%zero_loop
+
+        movaps      m6,   [const_float_1111]
+        jmp         %%return
+%endmacro
+
+
+INIT_XMM sse2
+PVQ_FAST_SEARCH
+
+INIT_XMM sse42
+PVQ_FAST_SEARCH
+
+INIT_XMM avx
+PVQ_FAST_SEARCH
+
+INIT_YMM avx2
+PVQ_FAST_SEARCH
-- 
2.13.0