[FFmpeg-devel] avcodec/x86: add an 8-bit simple IDCT function based on the x86-64 high depth functions

Submitted by James Darnley on June 15, 2017, 3:08 p.m.

Details

Message ID 20170615150833.894-1-jdarnley@obe.tv
State Superseded
Headers show

Commit Message

James Darnley June 15, 2017, 3:08 p.m.
Includes add/put functions

Rounding contributed by Ronald S. Bultje
---
I must be stupid.  I dropped the stack space change somewhere.

 libavcodec/tests/x86/dct.c       |  2 +
 libavcodec/x86/idctdsp_init.c    | 23 ++++++++++
 libavcodec/x86/simple_idct.h     |  9 ++++
 libavcodec/x86/simple_idct10.asm | 94 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 128 insertions(+)

Comments

Michael Niedermayer June 16, 2017, 1:58 a.m.
On Thu, Jun 15, 2017 at 05:08:33PM +0200, James Darnley wrote:
> Includes add/put functions
> 
> Rounding contributed by Ronald S. Bultje
> ---
> I must be stupid.  I dropped the stack space change somewhere.
> 
>  libavcodec/tests/x86/dct.c       |  2 +
>  libavcodec/x86/idctdsp_init.c    | 23 ++++++++++
>  libavcodec/x86/simple_idct.h     |  9 ++++
>  libavcodec/x86/simple_idct10.asm | 94 ++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 128 insertions(+)

theres something wrong with this
it totally breaks this:
make -j12 ffmpeg && ./ffmpeg -ss 1 -i cache:matrixbench_mpeg2.mpg -t 2 -y test.avi
./ffplay test.avi

(totally as in bitstream errors not some idct artifacts)

[...]
James Darnley June 16, 2017, 10:09 a.m.
On 2017-06-16 03:58, Michael Niedermayer wrote:
> On Thu, Jun 15, 2017 at 05:08:33PM +0200, James Darnley wrote:
>> Includes add/put functions
>>
>> Rounding contributed by Ronald S. Bultje
>> ---
>> I must be stupid.  I dropped the stack space change somewhere.
>>
>>  libavcodec/tests/x86/dct.c       |  2 +
>>  libavcodec/x86/idctdsp_init.c    | 23 ++++++++++
>>  libavcodec/x86/simple_idct.h     |  9 ++++
>>  libavcodec/x86/simple_idct10.asm | 94 ++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 128 insertions(+)
> 
> theres something wrong with this
> it totally breaks this:
> make -j12 ffmpeg && ./ffmpeg -ss 1 -i cache:matrixbench_mpeg2.mpg -t 2 -y test.avi
> ./ffplay test.avi
> 
> (totally as in bitstream errors not some idct artifacts)

How did you manage to break it that much?  I don't touch avformat.  I
don't touch the rest of the decoder.

I don't have that file to test with and the cut down version which I
think is in fate is only 0.96 seconds long.  However trying your command
on that file successfully decodes 1 frame from it.  I made decoding it
with the new functions the fate test I added.  All the other fate
samples in mpeg2/ decode seemingly fine with no extra messages.
Paul B Mahol June 16, 2017, 10:48 a.m.
On 6/16/17, James Darnley <jdarnley@obe.tv> wrote:
> On 2017-06-16 03:58, Michael Niedermayer wrote:
>> On Thu, Jun 15, 2017 at 05:08:33PM +0200, James Darnley wrote:
>>> Includes add/put functions
>>>
>>> Rounding contributed by Ronald S. Bultje
>>> ---
>>> I must be stupid.  I dropped the stack space change somewhere.
>>>
>>>  libavcodec/tests/x86/dct.c       |  2 +
>>>  libavcodec/x86/idctdsp_init.c    | 23 ++++++++++
>>>  libavcodec/x86/simple_idct.h     |  9 ++++
>>>  libavcodec/x86/simple_idct10.asm | 94
>>> ++++++++++++++++++++++++++++++++++++++++
>>>  4 files changed, 128 insertions(+)
>>
>> theres something wrong with this
>> it totally breaks this:
>> make -j12 ffmpeg && ./ffmpeg -ss 1 -i cache:matrixbench_mpeg2.mpg -t 2 -y
>> test.avi
>> ./ffplay test.avi
>>
>> (totally as in bitstream errors not some idct artifacts)
>
> How did you manage to break it that much?  I don't touch avformat.  I
> don't touch the rest of the decoder.
>
> I don't have that file to test with and the cut down version which I
> think is in fate is only 0.96 seconds long.  However trying your command
> on that file successfully decodes 1 frame from it.  I made decoding it
> with the new functions the fate test I added.  All the other fate
> samples in mpeg2/ decode seemingly fine with no extra messages.

File is in samples.ffmpeg.org/benchmark/testsuite1/matrixbench_mpeg2.mpg

Michael failed to write that.
James Darnley June 16, 2017, 11:30 a.m.
On 2017-06-16 12:48, Paul B Mahol wrote:
> On 6/16/17, James Darnley <jdarnley@obe.tv> wrote:
>> On 2017-06-16 03:58, Michael Niedermayer wrote:
>>> theres something wrong with this
>>> it totally breaks this:
>>> make -j12 ffmpeg && ./ffmpeg -ss 1 -i cache:matrixbench_mpeg2.mpg -t 2 -y
>>> test.avi
>>> ./ffplay test.avi
>>>
>>> (totally as in bitstream errors not some idct artifacts)
>>
>> How did you manage to break it that much?  I don't touch avformat.  I
>> don't touch the rest of the decoder.
>>
>> I don't have that file to test with and the cut down version which I
>> think is in fate is only 0.96 seconds long.  However trying your command
>> on that file successfully decodes 1 frame from it.  I made decoding it
>> with the new functions the fate test I added.  All the other fate
>> samples in mpeg2/ decode seemingly fine with no extra messages.
> 
> File is in samples.ffmpeg.org/benchmark/testsuite1/matrixbench_mpeg2.mpg
> 
> Michael failed to write that.

TYVM.  Even with that file I get just 3 messages when I seek 1 second.

> [mpeg @ 0x399e3c0] start time for stream 0 is not set in estimate_timings_from_pts
> [mp2 @ 0x39a07a0] Header missing
> Error while decoding stream #0:2: Invalid data found when processing input

Furthermore I get those messages on master and they remain the same with
my commits.  (Stream 0:2 is the mp2 audio in this file.)
Michael Niedermayer June 16, 2017, 12:24 p.m.
On Fri, Jun 16, 2017 at 12:09:27PM +0200, James Darnley wrote:
> On 2017-06-16 03:58, Michael Niedermayer wrote:
> > On Thu, Jun 15, 2017 at 05:08:33PM +0200, James Darnley wrote:
> >> Includes add/put functions
> >>
> >> Rounding contributed by Ronald S. Bultje
> >> ---
> >> I must be stupid.  I dropped the stack space change somewhere.
> >>
> >>  libavcodec/tests/x86/dct.c       |  2 +
> >>  libavcodec/x86/idctdsp_init.c    | 23 ++++++++++
> >>  libavcodec/x86/simple_idct.h     |  9 ++++
> >>  libavcodec/x86/simple_idct10.asm | 94 ++++++++++++++++++++++++++++++++++++++++
> >>  4 files changed, 128 insertions(+)
> > 
> > theres something wrong with this
> > it totally breaks this:
> > make -j12 ffmpeg && ./ffmpeg -ss 1 -i cache:matrixbench_mpeg2.mpg -t 2 -y test.avi
> > ./ffplay test.avi
> > 
> > (totally as in bitstream errors not some idct artifacts)
> 
> How did you manage to break it that much?  I don't touch avformat.  I
> don't touch the rest of the decoder.
> 
> I don't have that file to test with and the cut down version which I
> think is in fate is only 0.96 seconds long.  However trying your command
> on that file successfully decodes 1 frame from it.  I made decoding it
> with the new functions the fate test I added.  All the other fate
> samples in mpeg2/ decode seemingly fine with no extra messages.

i retested and it does fail with this patch.

looking at the code changing the perm_type alone is enough to trigger
it.
see dct_quantize() in libavcodec/x86/mpegvideoenc_template.c
it seems this lacks support for TRANSPOSE

[...]

Patch hide | download patch | download mbox

diff --git a/libavcodec/tests/x86/dct.c b/libavcodec/tests/x86/dct.c
index 34f5b8767b..317d973f9f 100644
--- a/libavcodec/tests/x86/dct.c
+++ b/libavcodec/tests/x86/dct.c
@@ -88,10 +88,12 @@  static const struct algo idct_tab_arch[] = {
 #if HAVE_YASM
 #if ARCH_X86_64
 #if HAVE_SSE2_EXTERNAL
+    { "SIMPLE8-SSE2",   ff_simple_idct8_sse2,  FF_IDCT_PERM_TRANSPOSE, AV_CPU_FLAG_SSE2},
     { "SIMPLE10-SSE2",  ff_simple_idct10_sse2, FF_IDCT_PERM_TRANSPOSE, AV_CPU_FLAG_SSE2},
     { "SIMPLE12-SSE2",  ff_simple_idct12_sse2, FF_IDCT_PERM_TRANSPOSE, AV_CPU_FLAG_SSE2, 1 },
 #endif
 #if HAVE_AVX_EXTERNAL
+    { "SIMPLE8-AVX",    ff_simple_idct8_avx,   FF_IDCT_PERM_TRANSPOSE, AV_CPU_FLAG_AVX},
     { "SIMPLE10-AVX",   ff_simple_idct10_avx,  FF_IDCT_PERM_TRANSPOSE, AV_CPU_FLAG_AVX},
     { "SIMPLE12-AVX",   ff_simple_idct12_avx,  FF_IDCT_PERM_TRANSPOSE, AV_CPU_FLAG_AVX,  1 },
 #endif
diff --git a/libavcodec/x86/idctdsp_init.c b/libavcodec/x86/idctdsp_init.c
index f1c915aa00..9da60d1a1e 100644
--- a/libavcodec/x86/idctdsp_init.c
+++ b/libavcodec/x86/idctdsp_init.c
@@ -94,9 +94,32 @@  av_cold void ff_idctdsp_init_x86(IDCTDSPContext *c, AVCodecContext *avctx,
                 c->idct_add  = ff_simple_idct_add_sse2;
                 c->perm_type = FF_IDCT_PERM_SIMPLE;
         }
+
+        if (ARCH_X86_64 &&
+            !high_bit_depth &&
+            avctx->lowres == 0 &&
+            (avctx->idct_algo == FF_IDCT_AUTO ||
+                avctx->idct_algo == FF_IDCT_SIMPLEAUTO ||
+                avctx->idct_algo == FF_IDCT_SIMPLEMMX)) {
+                c->idct      = ff_simple_idct8_sse2;
+                c->idct_put  = ff_simple_idct8_put_sse2;
+                c->idct_add  = ff_simple_idct8_add_sse2;
+                c->perm_type = FF_IDCT_PERM_TRANSPOSE;
+        }
     }
 
     if (ARCH_X86_64 && avctx->lowres == 0) {
+        if (EXTERNAL_AVX(cpu_flags) &&
+            !high_bit_depth &&
+            (avctx->idct_algo == FF_IDCT_AUTO ||
+                avctx->idct_algo == FF_IDCT_SIMPLEAUTO ||
+                avctx->idct_algo == FF_IDCT_SIMPLEMMX)) {
+                c->idct      = ff_simple_idct8_avx;
+                c->idct_put  = ff_simple_idct8_put_avx;
+                c->idct_add  = ff_simple_idct8_add_avx;
+                c->perm_type = FF_IDCT_PERM_TRANSPOSE;
+        }
+
         if (avctx->bits_per_raw_sample == 10 &&
             (avctx->idct_algo == FF_IDCT_AUTO ||
              avctx->idct_algo == FF_IDCT_SIMPLEAUTO ||
diff --git a/libavcodec/x86/simple_idct.h b/libavcodec/x86/simple_idct.h
index d17ef6a462..9b64cfe9bc 100644
--- a/libavcodec/x86/simple_idct.h
+++ b/libavcodec/x86/simple_idct.h
@@ -29,6 +29,15 @@  void ff_simple_idct_put_mmx(uint8_t *dest, ptrdiff_t line_size, int16_t *block);
 void ff_simple_idct_add_sse2(uint8_t *dest, ptrdiff_t line_size, int16_t *block);
 void ff_simple_idct_put_sse2(uint8_t *dest, ptrdiff_t line_size, int16_t *block);
 
+void ff_simple_idct8_sse2(int16_t *block);
+void ff_simple_idct8_avx(int16_t *block);
+
+void ff_simple_idct8_put_sse2(uint8_t *dest, ptrdiff_t line_size, int16_t *block);
+void ff_simple_idct8_put_avx(uint8_t *dest, ptrdiff_t line_size, int16_t *block);
+
+void ff_simple_idct8_add_sse2(uint8_t *dest, ptrdiff_t line_size, int16_t *block);
+void ff_simple_idct8_add_avx(uint8_t *dest, ptrdiff_t line_size, int16_t *block);
+
 void ff_simple_idct10_sse2(int16_t *block);
 void ff_simple_idct10_avx(int16_t *block);
 
diff --git a/libavcodec/x86/simple_idct10.asm b/libavcodec/x86/simple_idct10.asm
index b492303a57..10c5cd4340 100644
--- a/libavcodec/x86/simple_idct10.asm
+++ b/libavcodec/x86/simple_idct10.asm
@@ -31,11 +31,14 @@  SECTION_RODATA
 
 cextern pw_2
 cextern pw_16
+cextern pw_32
 cextern pw_1023
 cextern pw_4095
+pd_round_11: times 4 dd 1<<(11-1)
 pd_round_12: times 4 dd 1<<(12-1)
 pd_round_15: times 4 dd 1<<(15-1)
 pd_round_19: times 4 dd 1<<(19-1)
+pd_round_20: times 4 dd 1<<(20-1)
 
 %macro CONST_DEC  3
 const %1
@@ -77,8 +80,99 @@  CONST_DEC  w3_min_w7_lo,    W3sh2_lo, -W7sh2
 
 SECTION .text
 
+%macro STORE_HI_LO 12
+    movq   %1, %9
+    movq   %3, %10
+    movq   %5, %11
+    movq   %7, %12
+    movhps %2, %9
+    movhps %4, %10
+    movhps %6, %11
+    movhps %8, %12
+%endmacro
+
+%macro LOAD_ZXBW_8 16
+    pmovzxbw %1, %9
+    pmovzxbw %2, %10
+    pmovzxbw %3, %11
+    pmovzxbw %4, %12
+    pmovzxbw %5, %13
+    pmovzxbw %6, %14
+    pmovzxbw %7, %15
+    pmovzxbw %8, %16
+%endmacro
+
+%macro LOAD_ZXBW_4 9
+    movh %1, %5
+    movh %2, %6
+    movh %3, %7
+    movh %4, %8
+    punpcklbw %1, %9
+    punpcklbw %2, %9
+    punpcklbw %3, %9
+    punpcklbw %4, %9
+%endmacro
+
+%define PASS4ROWS(base, stride, stride3) \
+    [base], [base + stride], [base + 2*stride], [base + stride3]
+
 %macro idct_fn 0
 
+define_constants _lo
+
+cglobal simple_idct8, 1, 1, 16, 32, block
+    IDCT_FN    "", 11, pw_32, 20, "store"
+RET
+
+; TODO: optimise by not writing the final data to the block.
+cglobal simple_idct8_put, 3, 4, 16, 32, pixels, lsize, block
+    IDCT_FN    "", 11, pw_32, 20
+    lea       r3, [3*lsizeq]
+    lea       r2, [pixelsq + r3]
+    packuswb  m8, m0
+    packuswb  m1, m2
+    packuswb  m4, m11
+    packuswb  m9, m10
+    STORE_HI_LO PASS8ROWS(pixelsq, r2, lsizeq, r3), m8, m1, m4, m9
+RET
+
+; TODO: optimise by not writing the final data to the block.
+cglobal simple_idct8_add, 3, 4, 16, 32, pixels, lsize, block
+    IDCT_FN    "", 11, pw_32, 20
+    lea r2, [3*lsizeq]
+    %if cpuflag(sse4)
+        lea r3, [pixelsq + r2]
+        LOAD_ZXBW_8 m3, m5, m6, m7, m12, m13, m14, m15, PASS8ROWS(pixelsq, r3, lsizeq, r2)
+        paddsw m8, m3
+        paddsw m0, m5
+        paddsw m1, m6
+        paddsw m2, m7
+        paddsw m4, m12
+        paddsw m11, m13
+        paddsw m9, m14
+        paddsw m10, m15
+    %else
+        pxor m12, m12
+        LOAD_ZXBW_4 m3, m5, m6, m7, PASS4ROWS(pixelsq, lsizeq, r2), m12
+        paddsw m8, m3
+        paddsw m0, m5
+        paddsw m1, m6
+        paddsw m2, m7
+        lea r3, [pixelsq + 4*lsizeq]
+        LOAD_ZXBW_4 m3, m5, m6, m7, PASS4ROWS(r3, lsizeq, r2), m12
+        paddsw m4, m3
+        paddsw m11, m5
+        paddsw m9, m6
+        paddsw m10, m7
+        lea r3, [pixelsq + r2]
+    %endif
+    packuswb  m8, m0
+    packuswb  m1, m2
+    packuswb  m4, m11
+    packuswb  m9, m10
+    STORE_HI_LO PASS8ROWS(pixelsq, r3, lsizeq, r2), m8, m1, m4, m9
+RET
+
 define_constants _hi
 
 cglobal simple_idct10, 1, 1, 16, block