[FFmpeg-devel,WIP] Opus Piramid Vector Quantization Search in x86 SIMD asm

On request by Rostislav Pehlivanov (atomnuker), I've been working on
SSE/AVX accelerated version of pvq_search().

The attached patch is my work so far.

At the moment, for me, at the default bitrate
the function is 2.5 times faster than the current C version.
(My cpu is Intel Westmere class.)
The total encoding time is about 5% faster.

I'd like some more benchmarks on different CPUs
and maybe some advises how to improve it.
(I've left some alternative methods in the code
that could be easily switched with defines, as they
may be faster on other cpu's).
(I'm also quite afraid how fast it would run on pre-Ryzen AMD CPUs)

The code generates 4 variants: SSE2, SSE4.2, AVX and 256bit AVX2.
I haven't tested the AVX myself on real CPU,
I used Intel SDE to develop and test them.
Rostislav (atomnuker) reported some crashes with the 256bit AVX2,
that however might be related to clang tools.

Bellow are some broad descriptions:

The typical use of the function for the default bitrate (96kbps) is
K<=36 and N<=32.
N is the size of the input array (called vector), K is number of
pulses that should be present in the output (The sum of output
elements is K).

In synthetic tests, the SIMD function could be 8-12 times faster with
the maximum N=176.
I've been told that bigger sizes are more common at low bitrate encodes and
will be more common with the upcoming RDO improvements.

A short description of the function working:
1. Loop that calculates sum (Sx) of the input elements (inX[]). The
loop is used to fill a stack allocated temp buffer (tmpX) that is
aligned to mmsize and contains absolute values of inX[i].

2. Pre-Search loop. It uses K/Sx as approximation for the vector gain
and fills output vector outY[] based on it. The output is in integers,
but we use outY[] to temporally store the doubled Y values as floats.
(We need 2*Y for calculations). This loop also calculates few
parameters that are needed for the distortion calculations later (Syy=
Sum of inY[i]^2 ; Sxy=Sum inX[i]*outY[i] )

3. Adding of missing pulses or Elimination of extra ones.
The C function uses variable "phase" to signal if pulses should be
added or removed, I've separated this to separate cases. The code is
shared through a macro PULSES_SEARCH .
Each case is formed by 2 loops. The outer loop is executed until we
have K pulses in the output.
The inner is calculating the distortion parameter for each element and
picking the best one.
(parallel search, combination of parallel results, update of variables).

4. When we are done we do one more loop, to convert outY[] to single
integer and to restore its sign (by using the original inX[]).

5. There is special case when Sx==0, that happens if all elements of
the input are zeroes (in theory the input should be normalized, that
means Sum of X[i]^2 == 1.0). In this case return zero output and 1.0
as gain.

---
Now, I've left some defines that change the generated code.

HADDPS_IS_FAST
PHADDD_IS_FAST
I've implemented my own horizontal sums macros, and while doing it, I
have discovered that on my CPU (Westmere class) the use of "new"
SSE4.2 instructions are not faster than using my own code for doing
the same.
It's not speed critical, since horizontal sums are used 3-4 times per
function call.

BLENDVPS_IS_FAST
PBLENDVB_IS_FAST
I think that blend on my CPU is faster than the alternative version
that I've implemented. However I'm not sure this is true for all
CPU's, since a number of modern cpu have latency=3 and
inv_throughput=2 (that's 2 clocks until another blend could start).

CONST_IN_X64_REG_IS_FASTER
The function is implemented so only 8 registers are used. With this
define constants used during PULSES_SEARCH are loaded in the high
registers available on X64. I could not determine if it is faster to
do so... it should be, but sometimes I got the opposite result.
I'd probably enable it in the final version.

STALL_WRITE_FORWARDING
After the inner search finds the maximum, we add/remove pulse in
outY[i]. Writing single element (sizeof(float)=4) however could block
the long load done in the inner loop (mmsize=16). This hurts a lot
more on small vector sizes.
On Skylake the penalty is only 11 cycles, while Ryzen should have no
penalty at all. Older CPU's can have penalty of up to 200 cycles.

SHORT_SYY_UPDATE
This define has meaning only when the STALL* is 0 (aka have the longer
code to avoid stalls).
It saves few instructions by loading old outY[] value by scalar load,
instead of using HSUMPS and some 'haddps' to calculate them.
So far it looks like the short update is always faster, but I've left
it just in case...

USE_APPROXIMATION
This controls the method used for calculation of the distortion parameter.
"0" means using 1 multiplication and 1 division, that could be a lot
slower (14;14 cycles on my CPU, 11;7 on Skylake)
"1" uses 2 multiplications and 1 reciprocal op that is a lot faster
than real division, but gives half precision.
"2" uses 1 multiplication and 1 reciprocal square root op, that is
literally 1 cycle, but again gives half precision.

PRESEARCH_ROUNDING
This control the rounding of the gain used for guess.
"0" means using truncf() that makes sure that the pulses would never
be more than K.
It gives results identical to the original celt_* functions
"1" means using lrintf(), this is basically the improvement of the
current C code over the celt_ one.

ALL_FLOAT_PRESEARCH
The presearch filling of outY[] could be done entirely with float ops
(using SSE4.2 'roundps' instead of two cvt*).  It is mostly useful if
you want to try YMM on AVX1 (AVX1 lacks 256 integer ops).
For some reason enabling this makes the whole function 4 times slower
on my CPU. ^_^

I've left some commented out code. I'll remove it for sure in the final version.

I just hope I haven't done some lame mistake in the last minute...

[FFmpeg-devel,WIP] Opus Piramid Vector Quantization Search in x86 SIMD asm

Commit Message

Comments

Patch