[FFmpeg-devel,09/11] avcodec/x86: allow future 8-bit simple idct to have "DC only hack"

Submitted by James Darnley on June 19, 2017, 3:11 p.m.

Details

Message ID 20170619151104.31273-10-jdarnley@obe.tv
State New
Headers show

Commit Message

James Darnley June 19, 2017, 3:11 p.m.
Created by Ronald S. Bultje
---
 libavcodec/x86/simple_idct10_template.asm | 38 +++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

Comments

Ronald S. Bultje June 20, 2017, 11:56 a.m.
Hi,

On Mon, Jun 19, 2017 at 11:11 AM, James Darnley <jdarnley@obe.tv> wrote:

> Created by Ronald S. Bultje
> ---
>  libavcodec/x86/simple_idct10_template.asm | 38
> +++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
>
> diff --git a/libavcodec/x86/simple_idct10_template.asm
> b/libavcodec/x86/simple_idct10_template.asm
> index d8ea0bcc6b..51baf84c82 100644
> --- a/libavcodec/x86/simple_idct10_template.asm
> +++ b/libavcodec/x86/simple_idct10_template.asm
> @@ -257,6 +257,44 @@
>      pmullw      m12,[%8+96]
>
>      IDCT_1D     %1, %2, %8
> +%elif %2 == 11
> +    por     m1, m8, m13
> +    por     m1, m12
> +    por     m1, [blockq+ 16]       ; { row[1] }[0-7]
> +    por     m1, [blockq+ 48]       ; { row[3] }[0-7]
> +    por     m1, [blockq+ 80]       ; { row[5] }[0-7]
> +    por     m1, [blockq+112]       ; { row[7] }[0-7]
> +    pxor    m2, m2
> +    pcmpeqw m1, m2
> +    psllw   m2, m10, 3
> +    pand    m2, m1
> +    pcmpeqb m3, m3
> +    pxor    m1, m3
> +    mova    [rsp], m1
> +    mova    [rsp+16], m2
> +
> +    IDCT_1D     %1, %2
> +
> +    mova m5, [rsp]
> +    mova m6, [rsp+16]
> +    pand m8, m5
> +    por  m8, m6
> +    pand m0, m5
> +    por  m0, m6
> +    pand m1, m5
> +    por  m1, m6
> +    pand m2, m5
> +    por  m2, m6
> +    pand m4, m5
> +    por  m4, m6
> +    pand m11, m5
> +    por  m11, m6
> +    pand m9, m5
> +    por  m9, m6
> +    pand m10, m5
> +    por  m10, m6
> +    pand m3, m5
> +    por  m3, m6
>  %else


Can you fix the indentation while you push it? LGTM.

Ronald
James Darnley June 20, 2017, 2:26 p.m.
On 2017-06-20 13:56, Ronald S. Bultje wrote:
> Hi,
> 
> On Mon, Jun 19, 2017 at 11:11 AM, James Darnley <jdarnley@obe.tv> wrote:
> 
>> Created by Ronald S. Bultje
>> ---
>>  libavcodec/x86/simple_idct10_template.asm | 38
>> +++++++++++++++++++++++++++++++
>>  1 file changed, 38 insertions(+)
>>
>> diff --git a/libavcodec/x86/simple_idct10_template.asm
>> b/libavcodec/x86/simple_idct10_template.asm
>> index d8ea0bcc6b..51baf84c82 100644
>> --- a/libavcodec/x86/simple_idct10_template.asm
>> +++ b/libavcodec/x86/simple_idct10_template.asm
>> @@ -257,6 +257,44 @@
>>      pmullw      m12,[%8+96]
>>
>>      IDCT_1D     %1, %2, %8
>> +%elif %2 == 11
>> +    por     m1, m8, m13
>> +    por     m1, m12
>> +    por     m1, [blockq+ 16]       ; { row[1] }[0-7]
>> +    por     m1, [blockq+ 48]       ; { row[3] }[0-7]
>> +    por     m1, [blockq+ 80]       ; { row[5] }[0-7]
>> +    por     m1, [blockq+112]       ; { row[7] }[0-7]
>> +    pxor    m2, m2
>> +    pcmpeqw m1, m2
>> +    psllw   m2, m10, 3
>> +    pand    m2, m1
>> +    pcmpeqb m3, m3
>> +    pxor    m1, m3
>> +    mova    [rsp], m1
>> +    mova    [rsp+16], m2
>> +
>> +    IDCT_1D     %1, %2
>> +
>> +    mova m5, [rsp]
>> +    mova m6, [rsp+16]
>> +    pand m8, m5
>> +    por  m8, m6
>> +    pand m0, m5
>> +    por  m0, m6
>> +    pand m1, m5
>> +    por  m1, m6
>> +    pand m2, m5
>> +    por  m2, m6
>> +    pand m4, m5
>> +    por  m4, m6
>> +    pand m11, m5
>> +    por  m11, m6
>> +    pand m9, m5
>> +    por  m9, m6
>> +    pand m10, m5
>> +    por  m10, m6
>> +    pand m3, m5
>> +    por  m3, m6
>>  %else
> 
> 
> Can you fix the indentation while you push it? LGTM.

Done.  I also removed the redundant use of m3, again.  I swear that's
the 3rd time I've done that change.

Will push shortly.
Michael Niedermayer June 24, 2017, 5:21 p.m.
On Tue, Jun 20, 2017 at 07:56:40AM -0400, Ronald S. Bultje wrote:
> Hi,
> 
> On Mon, Jun 19, 2017 at 11:11 AM, James Darnley <jdarnley@obe.tv> wrote:
> 
> > Created by Ronald S. Bultje
> > ---
> >  libavcodec/x86/simple_idct10_template.asm | 38
> > +++++++++++++++++++++++++++++++
> >  1 file changed, 38 insertions(+)
> >
> > diff --git a/libavcodec/x86/simple_idct10_template.asm
> > b/libavcodec/x86/simple_idct10_template.asm
> > index d8ea0bcc6b..51baf84c82 100644
> > --- a/libavcodec/x86/simple_idct10_template.asm
> > +++ b/libavcodec/x86/simple_idct10_template.asm
> > @@ -257,6 +257,44 @@
> >      pmullw      m12,[%8+96]
> >
> >      IDCT_1D     %1, %2, %8
> > +%elif %2 == 11
> > +    por     m1, m8, m13
> > +    por     m1, m12
> > +    por     m1, [blockq+ 16]       ; { row[1] }[0-7]
> > +    por     m1, [blockq+ 48]       ; { row[3] }[0-7]
> > +    por     m1, [blockq+ 80]       ; { row[5] }[0-7]
> > +    por     m1, [blockq+112]       ; { row[7] }[0-7]
> > +    pxor    m2, m2
> > +    pcmpeqw m1, m2
> > +    psllw   m2, m10, 3
> > +    pand    m2, m1
> > +    pcmpeqb m3, m3
> > +    pxor    m1, m3
> > +    mova    [rsp], m1
> > +    mova    [rsp+16], m2
> > +
> > +    IDCT_1D     %1, %2
> > +
> > +    mova m5, [rsp]
> > +    mova m6, [rsp+16]
> > +    pand m8, m5
> > +    por  m8, m6
> > +    pand m0, m5
> > +    por  m0, m6
> > +    pand m1, m5
> > +    por  m1, m6
> > +    pand m2, m5
> > +    por  m2, m6
> > +    pand m4, m5
> > +    por  m4, m6
> > +    pand m11, m5
> > +    por  m11, m6
> > +    pand m9, m5
> > +    por  m9, m6
> > +    pand m10, m5
> > +    por  m10, m6
> > +    pand m3, m5
> > +    por  m3, m6
> >  %else
> 
> 
> Can you fix the indentation while you push it? LGTM.

as ive been asked abouz 9/10/11, no objections from me


[...]
Henrik Gramner June 24, 2017, 6:01 p.m.
On Mon, Jun 19, 2017 at 5:11 PM, James Darnley <jdarnley@obe.tv> wrote:
> +    por     m1, m8, m13
> +    por     m1, m12
> +    por     m1, [blockq+ 16]       ; { row[1] }[0-7]
> +    por     m1, [blockq+ 48]       ; { row[3] }[0-7]
> +    por     m1, [blockq+ 80]       ; { row[5] }[0-7]
> +    por     m1, [blockq+112]       ; { row[7] }[0-7]

Using a single register as destination here means that only one
instruction per cycle can be executed due to dependencies. Splitting
it across two destinations would double the (local) IPC.

OoOE might alleviate it, but no reason to unnecessarily rely on it.

Patch hide | download patch | download mbox

diff --git a/libavcodec/x86/simple_idct10_template.asm b/libavcodec/x86/simple_idct10_template.asm
index d8ea0bcc6b..51baf84c82 100644
--- a/libavcodec/x86/simple_idct10_template.asm
+++ b/libavcodec/x86/simple_idct10_template.asm
@@ -257,6 +257,44 @@ 
     pmullw      m12,[%8+96]
 
     IDCT_1D     %1, %2, %8
+%elif %2 == 11
+    por     m1, m8, m13
+    por     m1, m12
+    por     m1, [blockq+ 16]       ; { row[1] }[0-7]
+    por     m1, [blockq+ 48]       ; { row[3] }[0-7]
+    por     m1, [blockq+ 80]       ; { row[5] }[0-7]
+    por     m1, [blockq+112]       ; { row[7] }[0-7]
+    pxor    m2, m2
+    pcmpeqw m1, m2
+    psllw   m2, m10, 3
+    pand    m2, m1
+    pcmpeqb m3, m3
+    pxor    m1, m3
+    mova    [rsp], m1
+    mova    [rsp+16], m2
+
+    IDCT_1D     %1, %2
+
+    mova m5, [rsp]
+    mova m6, [rsp+16]
+    pand m8, m5
+    por  m8, m6
+    pand m0, m5
+    por  m0, m6
+    pand m1, m5
+    por  m1, m6
+    pand m2, m5
+    por  m2, m6
+    pand m4, m5
+    por  m4, m6
+    pand m11, m5
+    por  m11, m6
+    pand m9, m5
+    por  m9, m6
+    pand m10, m5
+    por  m10, m6
+    pand m3, m5
+    por  m3, m6
 %else
     IDCT_1D     %1, %2
 %endif