[FFmpeg-devel,WIP] Opus Piramid Vector Quantization Search in x86 SIMD asm

Submitted by Ivan Kalvachev on June 11, 2017, 9:34 a.m.

Details

Message ID CABA=pqdTJMwEynAn6onJ599M-b101NNWgv1x9rNLU8HYtgfN9A@mail.gmail.com
State New
Headers show

Commit Message

Ivan Kalvachev June 11, 2017, 9:34 a.m.
On 6/10/17, James Darnley <james.darnley@gmail.com> wrote:
> On 2017-06-09 13:41, Ivan Kalvachev wrote:
>> On 6/9/17, Michael Niedermayer <michael@niedermayer.cc> wrote:
>>> seems this breaks build with mingw64, didnt investigate but it
>>> fails with these errors:
>>>
>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x2d):
>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x3fd):
>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x7a1):
>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0xb48):
>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x2d):
>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x3fd):
>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x7a1):
>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0xb48):
>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>> collect2: error: ld returned 1 exit status
>>> collect2: error: ld returned 1 exit status
>>> make: *** [ffmpeg_g.exe] Error 1
>>> make: *** Waiting for unfinished jobs....
>>> make: *** [ffprobe_g.exe] Error 1
>>
>>
>> const_*_edge is used on only one place is the code.
>> Would you check if this patch fixes the issue.
>>
>> I expected that the addresses would be pre-calculated
>> by n/yasm as one value and indexed
>> relative to the section start.
>> Instead it seems that each entry is represented with
>> its own address and offset from it.
>> Since the offset is negative it uses all 64 bits and
>> it makes difference if it is truncated to 32 bits.
>>
>> Same issue could happen with clang tools.
>
> The problem is with the relative addressing.  You need to load the real
> address first before you can offset with another register at runtime. So
> something like:
>
>> mov  reg1,  [read_only_const]
lea ?
>> mova mmreg, [reg1 + reg2]

.
OK, Getting mingw for my distro is problem
and compiling one myself would take a bit more effort/time.
So I'm posting a patch that "should" work.
         lea         r4q,  [Nq-mmsize]   ; Nq is rounded up (aligned
up) to mmsize, so r4q can't become negative here, unless N=0.
         movups      m2,   [inXq + r4q]
======

What I find surprising is that PIC is enabled only on Windows and does not seem
to depend on CONFIG_PIC, so textrels are used all over assembly code.
Do I miss something?

Are there option(s) to signal/error when texrel is been used in code
that should be pic ?

Comments

Hendrik Leppkes June 11, 2017, 10:02 a.m.
On Sun, Jun 11, 2017 at 11:34 AM, Ivan Kalvachev <ikalvachev@gmail.com> wrote:
> On 6/10/17, James Darnley <james.darnley@gmail.com> wrote:
>> On 2017-06-09 13:41, Ivan Kalvachev wrote:
>>> On 6/9/17, Michael Niedermayer <michael@niedermayer.cc> wrote:
>>>> seems this breaks build with mingw64, didnt investigate but it
>>>> fails with these errors:
>>>>
>>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x2d):
>>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x3fd):
>>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x7a1):
>>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0xb48):
>>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x2d):
>>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x3fd):
>>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x7a1):
>>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0xb48):
>>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>>> collect2: error: ld returned 1 exit status
>>>> collect2: error: ld returned 1 exit status
>>>> make: *** [ffmpeg_g.exe] Error 1
>>>> make: *** Waiting for unfinished jobs....
>>>> make: *** [ffprobe_g.exe] Error 1
>>>
>>>
>>> const_*_edge is used on only one place is the code.
>>> Would you check if this patch fixes the issue.
>>>
>>> I expected that the addresses would be pre-calculated
>>> by n/yasm as one value and indexed
>>> relative to the section start.
>>> Instead it seems that each entry is represented with
>>> its own address and offset from it.
>>> Since the offset is negative it uses all 64 bits and
>>> it makes difference if it is truncated to 32 bits.
>>>
>>> Same issue could happen with clang tools.
>>
>> The problem is with the relative addressing.  You need to load the real
>> address first before you can offset with another register at runtime. So
>> something like:
>>
>>> mov  reg1,  [read_only_const]
> lea ?
>>> mova mmreg, [reg1 + reg2]
>
> .
> OK, Getting mingw for my distro is problem
> and compiling one myself would take a bit more effort/time.
> So I'm posting a patch that "should" work.
> ======
> --- a/libavcodec/x86/opus_pvq_search.asm
> +++ b/libavcodec/x86/opus_pvq_search.asm
> @@ -406,7 +406,7 @@ align 16
>  ; uint32 N      - Number of vector elements. Must be 0 < N < 8192
>  ;
>  %macro PVQ_FAST_SEARCH 0
> -cglobal pvq_search,4,5,8, mmsize, inX, outY, K, N
> +cglobal pvq_search,4,6,8, mmsize, inX, outY, K, N
>  %define tmpX rsp
>
>  ;        movsxdifnidn Nq,  Nd
> @@ -419,7 +419,12 @@ cglobal pvq_search,4,5,8, mmsize, inX, outY, K, N
>          add         Nq,   r4q           ; Nq = align(Nq, mmsize)
>          sub         rsp,  Nq            ; allocate tmpX[Nq]
>
> +%ifdef PIC
> +        lea         r5q,  [const_align_abs_edge]            ; rip+const
> +        movups      m3,   [r5q+r4q-mmsize]                  ; this is
> the bit mask for the padded read at the end of the input
> +%else
>          movups      m3,   [const_align_abs_edge-mmsize+r4q] ; this is
> the bit mask for the padded read at the end of the input
> +%endif
>
>          lea         r4q,  [Nq-mmsize]   ; Nq is rounded up (aligned
> up) to mmsize, so r4q can't become negative here, unless N=0.
>          movups      m2,   [inXq + r4q]
> ======
>
> What I find surprising is that PIC is enabled only on Windows and does not seem
> to depend on CONFIG_PIC, so textrels are used all over assembly code.
> Do I miss something?
>

Win32 code is always PIC, independent of what ffmpeg configure thinks
it should be. :)
Using the PIC define seems to be the appropriate thing, thats what the
other asm code does.

- Hendrik
Ivan Kalvachev June 11, 2017, 10:49 a.m.
On 6/11/17, Hendrik Leppkes <h.leppkes@gmail.com> wrote:
> On Sun, Jun 11, 2017 at 11:34 AM, Ivan Kalvachev <ikalvachev@gmail.com>
> wrote:
>> On 6/10/17, James Darnley <james.darnley@gmail.com> wrote:
>>> On 2017-06-09 13:41, Ivan Kalvachev wrote:
>>>> On 6/9/17, Michael Niedermayer <michael@niedermayer.cc> wrote:
>>>>> seems this breaks build with mingw64, didnt investigate but it
>>>>> fails with these errors:
>>>>>
>>>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x2d):
>>>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x3fd):
>>>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x7a1):
>>>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0xb48):
>>>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x2d):
>>>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x3fd):
>>>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x7a1):
>>>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>>>> libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0xb48):
>>>>> relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge'
>>>>> collect2: error: ld returned 1 exit status
>>>>> collect2: error: ld returned 1 exit status
>>>>> make: *** [ffmpeg_g.exe] Error 1
>>>>> make: *** Waiting for unfinished jobs....
>>>>> make: *** [ffprobe_g.exe] Error 1
>>>>
>>>>
>>>> const_*_edge is used on only one place is the code.
>>>> Would you check if this patch fixes the issue.
>>>>
>>>> I expected that the addresses would be pre-calculated
>>>> by n/yasm as one value and indexed
>>>> relative to the section start.
>>>> Instead it seems that each entry is represented with
>>>> its own address and offset from it.
>>>> Since the offset is negative it uses all 64 bits and
>>>> it makes difference if it is truncated to 32 bits.
>>>>
>>>> Same issue could happen with clang tools.
>>>
>>> The problem is with the relative addressing.  You need to load the real
>>> address first before you can offset with another register at runtime. So
>>> something like:
>>>
>>>> mov  reg1,  [read_only_const]
>> lea ?
>>>> mova mmreg, [reg1 + reg2]
>>
>> .
>> OK, Getting mingw for my distro is problem
>> and compiling one myself would take a bit more effort/time.
>> So I'm posting a patch that "should" work.
>> ======
>> --- a/libavcodec/x86/opus_pvq_search.asm
>> +++ b/libavcodec/x86/opus_pvq_search.asm
>> @@ -406,7 +406,7 @@ align 16
>>  ; uint32 N      - Number of vector elements. Must be 0 < N < 8192
>>  ;
>>  %macro PVQ_FAST_SEARCH 0
>> -cglobal pvq_search,4,5,8, mmsize, inX, outY, K, N
>> +cglobal pvq_search,4,6,8, mmsize, inX, outY, K, N
>>  %define tmpX rsp
>>
>>  ;        movsxdifnidn Nq,  Nd
>> @@ -419,7 +419,12 @@ cglobal pvq_search,4,5,8, mmsize, inX, outY, K, N
>>          add         Nq,   r4q           ; Nq = align(Nq, mmsize)
>>          sub         rsp,  Nq            ; allocate tmpX[Nq]
>>
>> +%ifdef PIC
>> +        lea         r5q,  [const_align_abs_edge]            ; rip+const
>> +        movups      m3,   [r5q+r4q-mmsize]                  ; this is
>> the bit mask for the padded read at the end of the input
>> +%else
>>          movups      m3,   [const_align_abs_edge-mmsize+r4q] ; this is
>> the bit mask for the padded read at the end of the input
>> +%endif
>>
>>          lea         r4q,  [Nq-mmsize]   ; Nq is rounded up (aligned
>> up) to mmsize, so r4q can't become negative here, unless N=0.
>>          movups      m2,   [inXq + r4q]
>> ======
>>
>> What I find surprising is that PIC is enabled only on Windows and does not
>> seem
>> to depend on CONFIG_PIC, so textrels are used all over assembly code.
>> Do I miss something?
>>
>
> Win32 code is always PIC, independent of what ffmpeg configure thinks
> it should be. :)
> Using the PIC define seems to be the appropriate thing, thats what the
> other asm code does.
>

Yes, but my query is more about
why CONFIG_PIC does not enable asm PIC too?
After all gcc -fpic is supposed to generate code without texrel, isn't it?
James Darnley June 11, 2017, 11:31 p.m.
On 2017-06-11 11:34, Ivan Kalvachev wrote:
> On 6/10/17, James Darnley <james.darnley@gmail.com> wrote:
>> On 2017-06-09 13:41, Ivan Kalvachev wrote:
>>>
>>> const_*_edge is used on only one place is the code.
>>> Would you check if this patch fixes the issue.
>>>
>>> I expected that the addresses would be pre-calculated
>>> by n/yasm as one value and indexed
>>> relative to the section start.
>>> Instead it seems that each entry is represented with
>>> its own address and offset from it.
>>> Since the offset is negative it uses all 64 bits and
>>> it makes difference if it is truncated to 32 bits.
>>>
>>> Same issue could happen with clang tools.
>>
>> The problem is with the relative addressing.  You need to load the real
>> address first before you can offset with another register at runtime. So
>> something like:
>>
>>> mov  reg1,  [read_only_const]
> lea ?

Yes, sorry about that.

> ======
> --- a/libavcodec/x86/opus_pvq_search.asm
> +++ b/libavcodec/x86/opus_pvq_search.asm
> @@ -406,7 +406,7 @@ align 16
>  ; uint32 N      - Number of vector elements. Must be 0 < N < 8192
>  ;
>  %macro PVQ_FAST_SEARCH 0
> -cglobal pvq_search,4,5,8, mmsize, inX, outY, K, N
> +cglobal pvq_search,4,6,8, mmsize, inX, outY, K, N
>  %define tmpX rsp
> 
>  ;        movsxdifnidn Nq,  Nd
> @@ -419,7 +419,12 @@ cglobal pvq_search,4,5,8, mmsize, inX, outY, K, N
>          add         Nq,   r4q           ; Nq = align(Nq, mmsize)
>          sub         rsp,  Nq            ; allocate tmpX[Nq]
> 
> +%ifdef PIC
> +        lea         r5q,  [const_align_abs_edge]            ; rip+const
> +        movups      m3,   [r5q+r4q-mmsize]                  ; this is the bit mask for the padded read at the end of the input
> +%else
>          movups      m3,   [const_align_abs_edge-mmsize+r4q] ; this is the bit mask for the padded read at the end of the input
> +%endif
> 
>          lea         r4q,  [Nq-mmsize]   ; Nq is rounded up (aligned up) to mmsize, so r4q can't become negative here, unless N=0.
>          movups      m2,   [inXq + r4q]
> ======

FYI that diff works just fine, when I correct the line wrapping.  I can
compile and run on Win64.  This is not a comment about the original patch.

Patch hide | download patch | download mbox

======
--- a/libavcodec/x86/opus_pvq_search.asm
+++ b/libavcodec/x86/opus_pvq_search.asm
@@ -406,7 +406,7 @@  align 16
 ; uint32 N      - Number of vector elements. Must be 0 < N < 8192
 ;
 %macro PVQ_FAST_SEARCH 0
-cglobal pvq_search,4,5,8, mmsize, inX, outY, K, N
+cglobal pvq_search,4,6,8, mmsize, inX, outY, K, N
 %define tmpX rsp

 ;        movsxdifnidn Nq,  Nd
@@ -419,7 +419,12 @@  cglobal pvq_search,4,5,8, mmsize, inX, outY, K, N
         add         Nq,   r4q           ; Nq = align(Nq, mmsize)
         sub         rsp,  Nq            ; allocate tmpX[Nq]

+%ifdef PIC
+        lea         r5q,  [const_align_abs_edge]            ; rip+const
+        movups      m3,   [r5q+r4q-mmsize]                  ; this is
the bit mask for the padded read at the end of the input
+%else
         movups      m3,   [const_align_abs_edge-mmsize+r4q] ; this is
the bit mask for the padded read at the end of the input
+%endif