[FFmpeg-devel] v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

On 7/24/17, Ivan Kalvachev <ikalvachev@gmail.com> wrote:
> On 7/23/17, Rostislav Pehlivanov <atomnuker@gmail.com> wrote:
>> On 22 July 2017 at 12:18, Ivan Kalvachev <ikalvachev@gmail.com> wrote:
>>
>>> This patch is ready for review and inclusion.
>>>
>>>
>>>
>>>+%macro emu_blendvps 3 ; dst/src_a, src_b, mask
>>>+%macro lea_pic_base 2; reg, const_label
>> Capitalize macro names.
>
> Done for the later.
> About the former...
>
> I do not like to use capitalization for the emulated instructions:
>
> 1. In C macros are always in capital letters to separate them from the
> functions.
> In ASM functions cannot be mistaken for macros.
>
> 2. All instructions are lowercase, even the ones that are overloaded
> by x86asm.h macros.
>
> 3. There are already about 30 macros with lower cases in
> libavcodec/x86. The rule is not absolute.
>
> 4. There are some emulated instructions in x86util, that are all upper
> case names and
> I do find them very ugly when I see them in the code.
> This is why I went with "emu_<op>" route.
> I actually want to encourage using the emu_<op> for them too (as
> alternative names).
>
> 5. There is nothing guaranteeing that the assembler
> should be case sensitive.
> Actually nasm manual mentions that MASM (on which
> it it is based on) is not case-sensitive (by default).
> While nasm and yasm are case sensitive,
> there are still %idefine and %imacro that
> could create some confusion.
>
> Anyway.
>
> Would you accept a change where the emulation macros
> are moved to a separate file and the macros are
> renamed to EMU_<op> , aka EMU_blendvps ?
> I'm thinking of "libavcodec/x86/x86emu.asm".
>
> Or maybe they should be put in libavutil/x86/x86util.asm ,
> however that could open a new can of worms.
> AKA using the ugly uppercase only names.
>
>>>+      %error sse41 blendvps uses xmm0 as default 3d operand, you used %3
>> Remove this error
>
> I'd like to keep that one.
> It helps finding out why the code does not compile.
>
> Sometimes it may not be obvious, since SWAP might
> change the registers under the hood.
>
>>
>>>+    %error "blendvps" emulation needs SSE
>> Definitely remove this error too.
>
> Done
>
>>>+        %error AVX1 does not support integer 256bit ymm operations
>> And this one is particularly pointless...
>
> On the contrary. AVX1 does support 256 bit ymm float point operations
> but not integer ones. It is quite trivial mistake to use one with the
> other.
>
> What is worse, without this the wrong code would compile
> without any hint of error, because avx2 codes are
> not overloaded by the current x86asm.h
> so you won't get warning that you are using avx2 ops in avx1 section.
>
>>>+      %error sse41 blendvps uses xmm0 as default 3 operand, you used %3
>> Same...
>
> Again, I'd like to keep that one.
>
>>>+    %error "broadcastss" emulation needs SSE
>> Same...
>
> Done
>
>>>+    %error "pbroadcastd" emulation needs SSE2
>> Yep, the same...
>
> Done
>
>>
>>>+    %error pick HSUMPS implementation
>> Again...
>
> I thought I had taken care of that.
> Done
>
>> All of these are pointless and unnecessary. Always assume at least SSE2.
>
> NEVER assume anything (that you can check).
> Making wrong assumptions is the easiest way
> to make a mistakes that are
> very hard to notice, find and correct.
>
>>>+
>>>+; 256bit version seems slower than the 128bit AVX on all current CPU's
>>>+; So disable it for now and keep the code in case something changes in
>> future.
>>>+%if HAVE_AVX2_EXTERNAL > 0 && 0
>>>+INIT_YMM avx2
>>>+PVQ_FAST_SEARCH
>>>+%endif
>>
>> Remove this altogether.
>
> I think this would be a mistake.
>
> BTW, would you do a proper benchmarks with the current
> code and post the results. There is a chance that the
> code has improved.
> (Please, use a real audio sample, not generated noise).
>
> I also asked you to try something (changing "cmpless" to "cmpps" ),
> that you totally forgot to do.
> (IACA says there is no penalty in my code for using this SSE op in AVX2
> block,
> but as I said, never assume.)
>
>>>+%if 1
>>>+    emu_pbroadcastd m1,   xmm1
>> ...
>>>+%else
>>>+        ; Ryzen is the only CPU at the moment that doesn't have
>>>+        ; write forwarding stall and where this code is faster
>>>+        ; with about 7 cycles on avarage.
>>>+        %{1}ps      m5,   mm_const_float_1
>>>+        movss       [tmpY + r4q], xmm5      ; Y[max_idx] +-= 1.0;
>>
>> Remove the %else and always use the first bit of code.
>
> I don't like removing code that is faster on a new CPU.
> But since you insist.
> Done.
>
>>>+%if cpuflag(avx2) && 01
>>>+%elif cpuflag(avx) && 01
>>>+%if cpuflag(avx2) && 01
>>
>> Remove the && 01 in these 3 places.
>
> Done
>
>
>>>+;      vperm2f128   %1,   %1,   %1,   0 ; ymm, ymm, ymm     ; 2-3c 1/1
>>
>> Remove this commented out code.
>
> Done
>
>>
>>>+cglobal pvq_search, 4, 5+num_pic_regs, 11, 256*4, inX, outY, K, N
>>>+%if ARCH_X86_64 == 0    ;sbrdsp
>>
>> You're using more than 11 xmm regs, so the entire code is always going to
>> need a 64 bit system.
>> So wrap the entire file, from top to bottom after the %include in
>> %if ARCH_X86_64
>> <everything>
>> %endif
>
> I'm using exactly 8 registers on 32 bit arch.
> I'm using 3 extra registers on 64 bit ones to hold some constants.
>
> If you insist, I'll write some ifdef-ery to
> always supply the correct number of registers.
> From what I've seen in x86asm, the >8 case is only checked on WIN64.
>
>>>+SECTION_RODATA 64
>>>+
>>>+const_float_abs_mask:   times 8 dd 0x7fffffff
>>>+const_align_abs_edge:   times 8 dd 0
>>>+
>>>+const_float_0_5:        times 8 dd 0.5
>>>+const_float_1:          times 8 dd 1.0
>>>+const_float_sign_mask:  times 8 dd 0x80000000
>>>+
>>>+const_int32_offsets:
>>>+                        %rep 8
>>>+                                dd $-const_int32_offsets
>>>+                        %endrep
>>>+SECTION .text
>>
>> This whole thing needs to go right at the very top after the %include.
>> All
>> macros must then follow it.
>> Having read only sections in the middle of the file is very confusing and
>> not at all what all of our asm code follows.
>
> It's very simple:
>   pre-processor, constants, code.
> Your confusion comes from the fact that
> the code templates are done by macros.
>
> As I've told you before, the macros before
> the constants belong to a separate file.
>
> Also why are you so obsessed
> with constants been the first thing in a file.
> You are not supposed to mess with them
> on regular basis.
>
> Constant labels must be before the code that
> uses them, but because of code templates
> they could be literally few lines before the end of the file.

In this version I did remove the AVX2 disabled code
and moved the constants before the macro (but after defines).
The emulations remain in this code, for now.

I've also used ifdef-ery to always supply the correct number
of used xmm registers to "cglobal" (and moved the relevant
macros closer to it).

Also here are the updated benchmarks:
/===========================================================
 SkyLake i7 6700HQ
//v5
    1814 1817 1827 //SSE4
    1818 1834 1838 //AVX default
    1828 1833 1837 //AVX2 xmm
    1896 1897 1905 //AVX2 default
    1896 1904 1905 //AVX2 cmpps

Best Regards

[FFmpeg-devel] v6 Opus Pyramid Vector Quantization Search in x86 SIMD asm

Commit Message

Comments

Patch