diff mbox

[FFmpeg-devel] x86/exrdsp: optimize ff_reorder_pixels_avx2()

Message ID 20170918015213.7236-1-jamrial@gmail.com
State Accepted
Commit 18821e3ba1baa8e0fe037e11c77459ebc73f7e37
Headers show

Commit Message

James Almer Sept. 18, 2017, 1:52 a.m. UTC
From: Henrik Gramner <henrik@gramner.com>

Tested with "checkasm --test=exrdsp -bench"

Before:
reorder_pixels_c: 5187.8
reorder_pixels_sse2: 377.0
reorder_pixels_avx2: 331.3

After:
reorder_pixels_c: 5181.5
reorder_pixels_sse2: 377.0
reorder_pixels_avx2: 313.8

Signed-off-by: James Almer <jamrial@gmail.com>
---
 libavcodec/x86/exrdsp.asm | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

Comments

Martin Vignali Sept. 18, 2017, 7:35 a.m. UTC | #1
2017-09-18 3:52 GMT+02:00 James Almer <jamrial@gmail.com>:

> From: Henrik Gramner <henrik@gramner.com>
>
> Tested with "checkasm --test=exrdsp -bench"
>
> Before:
> reorder_pixels_c: 5187.8
> reorder_pixels_sse2: 377.0
> reorder_pixels_avx2: 331.3
>
> After:
> reorder_pixels_c: 5181.5
> reorder_pixels_sse2: 377.0
> reorder_pixels_avx2: 313.8
>
> I don't have the same result using a start/stop timer,
but your testing approach is probably better than mine.
And like, you both think it's a better way to do it, it's ok for me !

Thanks

Martin
James Almer Sept. 18, 2017, 2:27 p.m. UTC | #2
On 9/18/2017 4:35 AM, Martin Vignali wrote:
> 2017-09-18 3:52 GMT+02:00 James Almer <jamrial@gmail.com>:
> 
>> From: Henrik Gramner <henrik@gramner.com>
>>
>> Tested with "checkasm --test=exrdsp -bench"
>>
>> Before:
>> reorder_pixels_c: 5187.8
>> reorder_pixels_sse2: 377.0
>> reorder_pixels_avx2: 331.3
>>
>> After:
>> reorder_pixels_c: 5181.5
>> reorder_pixels_sse2: 377.0
>> reorder_pixels_avx2: 313.8
>>
>> I don't have the same result using a start/stop timer,
> but your testing approach is probably better than mine.

I also had a hard time getting to notice a difference with
star/stop_timer, and it's clear why seeing how little difference this
change truly makes.

You can build the above tool with "make checkasm", and the executable
will be in the tests/checkasm folder. The results tend to be less
variable and it's better detecting small differences between functions.

> And like, you both think it's a better way to do it, it's ok for me !
> 
> Thanks
> 
> Martin
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
Martin Vignali Sept. 20, 2017, 7:01 p.m. UTC | #3
>
> I also had a hard time getting to notice a difference with
> star/stop_timer, and it's clear why seeing how little difference this
> change truly makes.
>
> You can build the above tool with "make checkasm", and the executable
> will be in the tests/checkasm folder. The results tend to be less
> variable and it's better detecting small differences between functions.
>
> <http://ffmpeg.org/mailman/listinfo/ffmpeg-devel>
>
Thanks for the explanation, and the checkasm patch for exrdsp (i missed it
before)
I have similar result than you using checkasm,

Martin
diff mbox

Patch

diff --git a/libavcodec/x86/exrdsp.asm b/libavcodec/x86/exrdsp.asm
index b91a7be20d..06c629e59e 100644
--- a/libavcodec/x86/exrdsp.asm
+++ b/libavcodec/x86/exrdsp.asm
@@ -39,16 +39,15 @@  cglobal reorder_pixels, 3,4,3, dst, src1, size, src2
     neg                              sizeq                ; size = offset for dst, src1, src2
 .loop:
 
-%if cpuflag(avx2)
-    vpermq                              m0, [src1q + sizeq], 0xd8; load first part
-    vpermq                              m1, [src2q + sizeq], 0xd8; load second part
-%else
     mova                                m0, [src1q+sizeq]        ; load first part
     movu                                m1, [src2q+sizeq]        ; load second part
-%endif
     SBUTTERFLY bw, 0, 1, 2                                       ; interleaved
-    mova                 [dstq+2*sizeq   ], m0                   ; copy to dst
-    mova             [dstq+2*sizeq+mmsize], m1
+    mova                 [dstq+2*sizeq   ], xm0                  ; copy to dst
+    mova                 [dstq+2*sizeq+16], xm1
+%if cpuflag(avx2)
+    vperm2i128                          m0, m0, m1, q0301
+    mova                 [dstq+2*sizeq+32], m0
+%endif
     add     sizeq, mmsize
     jl .loop
     RET