diff mbox

[FFmpeg-devel,v2] swscale/output: Altivec-optimize yuv2plane1_8

Message ID 20181117101214.19917fdecfc9fc4d26854a9c@gmx.com
State Accepted
Commit 46c5693ea3a9364e24e2f5336bcdb5b191a2329f
Headers show

Commit Message

Lauri Kasanen Nov. 17, 2018, 8:12 a.m. UTC
./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p \
-f null -vframes 100 -v error -nostats -

1158 UNITS in planar1,   65528 runs,      8 skips

-cpuflags 0

19082 UNITS in planar1,   65533 runs,      3 skips

16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version
takes as many cycles as the x86 SSE2 version, yikes it's fast.

Note that this function uses VSX instructions, but is not marked so.
This is because several existing functions also make that mistake.
I'll submit a patch moving them once this is reviewed.

v2: Remove !BE check
Signed-off-by: Lauri Kasanen <cand@gmx.com>
---
 libswscale/ppc/swscale_altivec.c | 53 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

Comments

Michael Niedermayer Nov. 18, 2018, 12:29 a.m. UTC | #1
On Sat, Nov 17, 2018 at 10:12:14AM +0200, Lauri Kasanen wrote:
> ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p \
> -f null -vframes 100 -v error -nostats -
> 
> 1158 UNITS in planar1,   65528 runs,      8 skips
> 
> -cpuflags 0
> 
> 19082 UNITS in planar1,   65533 runs,      3 skips
> 
> 16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version
> takes as many cycles as the x86 SSE2 version, yikes it's fast.
> 
> Note that this function uses VSX instructions, but is not marked so.
> This is because several existing functions also make that mistake.
> I'll submit a patch moving them once this is reviewed.
> 
> v2: Remove !BE check
> Signed-off-by: Lauri Kasanen <cand@gmx.com>
> ---
>  libswscale/ppc/swscale_altivec.c | 53 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 53 insertions(+)

iam no altivec guy, just wanted to reply primarly so people dont wait for
a review from me as i normally review swscale patches.
if someone who knows altivec ok-es this then its ok with me too.
assuming this was tested (fate and some real world test)

thx

[...]
Lauri Kasanen Nov. 21, 2018, 8:12 a.m. UTC | #2
> ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p \
> -f null -vframes 100 -v error -nostats -
> 
> 1158 UNITS in planar1,   65528 runs,      8 skips
> 
> -cpuflags 0
> 
> 19082 UNITS in planar1,   65533 runs,      3 skips
> 
> 16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version
> takes as many cycles as the x86 SSE2 version, yikes it's fast.
> 
> Note that this function uses VSX instructions, but is not marked so.
> This is because several existing functions also make that mistake.
> I'll submit a patch moving them once this is reviewed.
> 
> v2: Remove !BE check
> Signed-off-by: Lauri Kasanen <cand@gmx.com>

Ping. Seems not many ffmpeg devs interested in ppc.

- Lauri
Michael Niedermayer Nov. 21, 2018, 12:21 p.m. UTC | #3
On Wed, Nov 21, 2018 at 10:12:48AM +0200, Lauri Kasanen wrote:
> > ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p \
> > -f null -vframes 100 -v error -nostats -
> > 
> > 1158 UNITS in planar1,   65528 runs,      8 skips
> > 
> > -cpuflags 0
> > 
> > 19082 UNITS in planar1,   65533 runs,      3 skips
> > 
> > 16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version
> > takes as many cycles as the x86 SSE2 version, yikes it's fast.
> > 
> > Note that this function uses VSX instructions, but is not marked so.
> > This is because several existing functions also make that mistake.
> > I'll submit a patch moving them once this is reviewed.
> > 
> > v2: Remove !BE check
> > Signed-off-by: Lauri Kasanen <cand@gmx.com>
> 
> Ping. Seems not many ffmpeg devs interested in ppc.

have you tried "make fate" with this patch (note you need to configure with
fate samples" so all tests are run

thx

[...]
Lauri Kasanen Nov. 21, 2018, 12:35 p.m. UTC | #4
On Wed, 21 Nov 2018 13:21:58 +0100
Michael Niedermayer <michael@niedermayer.cc> wrote:

> On Wed, Nov 21, 2018 at 10:12:48AM +0200, Lauri Kasanen wrote:
> > > ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p \
> > > -f null -vframes 100 -v error -nostats -
> > > 
> > > 1158 UNITS in planar1,   65528 runs,      8 skips
> > > 
> > > -cpuflags 0
> > > 
> > > 19082 UNITS in planar1,   65533 runs,      3 skips
> > > 
> > > 16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version
> > > takes as many cycles as the x86 SSE2 version, yikes it's fast.
> > > 
> > > Note that this function uses VSX instructions, but is not marked so.
> > > This is because several existing functions also make that mistake.
> > > I'll submit a patch moving them once this is reviewed.
> > > 
> > > v2: Remove !BE check
> > > Signed-off-by: Lauri Kasanen <cand@gmx.com>
> > 
> > Ping. Seems not many ffmpeg devs interested in ppc.
> 
> have you tried "make fate" with this patch (note you need to configure with
> fate samples" so all tests are run

I ran those fate tests containing "scale" in the name, I gather the
full suite takes > 20min. Otherwise I tested with a PNG to video
conversion on LE, and Carl Eugen Hoyos tested with Lena on BE.

- Lauri
Michael Niedermayer Nov. 21, 2018, 4:22 p.m. UTC | #5
On Wed, Nov 21, 2018 at 02:35:32PM +0200, Lauri Kasanen wrote:
> On Wed, 21 Nov 2018 13:21:58 +0100
> Michael Niedermayer <michael@niedermayer.cc> wrote:
> 
> > On Wed, Nov 21, 2018 at 10:12:48AM +0200, Lauri Kasanen wrote:
> > > > ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p \
> > > > -f null -vframes 100 -v error -nostats -
> > > > 
> > > > 1158 UNITS in planar1,   65528 runs,      8 skips
> > > > 
> > > > -cpuflags 0
> > > > 
> > > > 19082 UNITS in planar1,   65533 runs,      3 skips
> > > > 
> > > > 16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version
> > > > takes as many cycles as the x86 SSE2 version, yikes it's fast.
> > > > 
> > > > Note that this function uses VSX instructions, but is not marked so.
> > > > This is because several existing functions also make that mistake.
> > > > I'll submit a patch moving them once this is reviewed.
> > > > 
> > > > v2: Remove !BE check
> > > > Signed-off-by: Lauri Kasanen <cand@gmx.com>
> > > 
> > > Ping. Seems not many ffmpeg devs interested in ppc.
> > 
> > have you tried "make fate" with this patch (note you need to configure with
> > fate samples" so all tests are run
> 
> I ran those fate tests containing "scale" in the name, I gather the
> full suite takes > 20min. Otherwise I tested with a PNG to video
> conversion on LE, and Carl Eugen Hoyos tested with Lena on BE.

the full fate tests must be run, many of these tests use swscale without
having "scale" in their name
and yes on lower end hardware 20min and longer is possible

[...]
Lauri Kasanen Nov. 21, 2018, 5:19 p.m. UTC | #6
On Wed, 21 Nov 2018 17:22:36 +0100
Michael Niedermayer <michael@niedermayer.cc> wrote:

> the full fate tests must be run, many of these tests use swscale without
> having "scale" in their name
> and yes on lower end hardware 20min and longer is possible

I get failures on the baseline, without my patch. What is the procedure
here? Is there a var to skip those tests, or?

First I ran with THREADS=3, baseline blew up in
fate-h264-conformance-frext-hpcafl_bcrm_c

Then I ran without THREADS, it got further, but blew up in
fate-rv20-1239

- Lauri
Michael Niedermayer Nov. 23, 2018, 2:26 a.m. UTC | #7
On Wed, Nov 21, 2018 at 07:19:45PM +0200, Lauri Kasanen wrote:
> On Wed, 21 Nov 2018 17:22:36 +0100
> Michael Niedermayer <michael@niedermayer.cc> wrote:
> 
> > the full fate tests must be run, many of these tests use swscale without
> > having "scale" in their name
> > and yes on lower end hardware 20min and longer is possible
> 
> I get failures on the baseline, without my patch. What is the procedure
> here? Is there a var to skip those tests, or?

procedure ?
First i try to convince you to attempt to fix some of these failures ;)
because well, everyone would benefit if they are fixed ...


> 
> First I ran with THREADS=3, baseline blew up in
> fate-h264-conformance-frext-hpcafl_bcrm_c
> 
> Then I ran without THREADS, it got further, but blew up in
> fate-rv20-1239

you can use make -k  to continue testing in presence of failures
this is not guranteed to run all tests, it may skip dependant tests
which dependent on a failing target.

Thanks

[...]
Lauri Kasanen Nov. 23, 2018, 8:38 a.m. UTC | #8
On Fri, 23 Nov 2018 03:26:50 +0100
Michael Niedermayer <michael@niedermayer.cc> wrote:

> On Wed, Nov 21, 2018 at 07:19:45PM +0200, Lauri Kasanen wrote:
> > On Wed, 21 Nov 2018 17:22:36 +0100
> > Michael Niedermayer <michael@niedermayer.cc> wrote:
> > > the full fate tests must be run, many of these tests use swscale without
> > > having "scale" in their name
> > > and yes on lower end hardware 20min and longer is possible
> > 
> > I get failures on the baseline, without my patch. What is the procedure
> > here? Is there a var to skip those tests, or?
> 
> procedure ?
> First i try to convince you to attempt to fix some of these failures ;)
> because well, everyone would benefit if they are fixed ...

I mean, if my patch adds no failures, is that enough to apply it?

- Lauri
Michael Niedermayer Nov. 23, 2018, 10:01 p.m. UTC | #9
On Fri, Nov 23, 2018 at 10:38:13AM +0200, Lauri Kasanen wrote:
> On Fri, 23 Nov 2018 03:26:50 +0100
> Michael Niedermayer <michael@niedermayer.cc> wrote:
> 
> > On Wed, Nov 21, 2018 at 07:19:45PM +0200, Lauri Kasanen wrote:
> > > On Wed, 21 Nov 2018 17:22:36 +0100
> > > Michael Niedermayer <michael@niedermayer.cc> wrote:
> > > > the full fate tests must be run, many of these tests use swscale without
> > > > having "scale" in their name
> > > > and yes on lower end hardware 20min and longer is possible
> > > 
> > > I get failures on the baseline, without my patch. What is the procedure
> > > here? Is there a var to skip those tests, or?
> > 
> > procedure ?
> > First i try to convince you to attempt to fix some of these failures ;)
> > because well, everyone would benefit if they are fixed ...
> 
> I mean, if my patch adds no failures, is that enough to apply it?

yes that and the tests failing should still fail the same way with the
same checksums
This of course assumes noone finds an issue in the patch

[...]
Lauri Kasanen Nov. 24, 2018, 9:33 a.m. UTC | #10
On Fri, 23 Nov 2018 23:01:02 +0100
Michael Niedermayer <michael@niedermayer.cc> wrote:

> On Fri, Nov 23, 2018 at 10:38:13AM +0200, Lauri Kasanen wrote:
> > I mean, if my patch adds no failures, is that enough to apply it?
> 
> yes that and the tests failing should still fail the same way with the
> same checksums
> This of course assumes noone finds an issue in the patch

Okay, ran both with -k. No new failures, and fate-rv20-1239 failed with
the same checksums in both cases. That was the only failing test, did
not try with THREADS.

Curiously "make CPUFLAGS=0 fate-rv20-1239" also fails, so it's not
Altivec code that breaks that test, but C (?).

- Lauri
Michael Niedermayer Nov. 26, 2018, 1:56 a.m. UTC | #11
On Sat, Nov 24, 2018 at 11:33:01AM +0200, Lauri Kasanen wrote:
> On Fri, 23 Nov 2018 23:01:02 +0100
> Michael Niedermayer <michael@niedermayer.cc> wrote:
> 
> > On Fri, Nov 23, 2018 at 10:38:13AM +0200, Lauri Kasanen wrote:
> > > I mean, if my patch adds no failures, is that enough to apply it?
> > 
> > yes that and the tests failing should still fail the same way with the
> > same checksums
> > This of course assumes noone finds an issue in the patch
> 
> Okay, ran both with -k. No new failures, and fate-rv20-1239 failed with
> the same checksums in both cases. That was the only failing test, did
> not try with THREADS.

ok, will apply

thanks


[...]
Lauri Kasanen Nov. 26, 2018, 8:08 a.m. UTC | #12
On Mon, 26 Nov 2018 11:03:55 +0300
Michael Kostylev <michael.kostylev@gmail.com> wrote:

> 
> http://fate.xffm.org/?sort=arch
> /ppc

Yeah, mentioned in the commit message. Follow-up patch coming today.

- Lauri
Carl Eugen Hoyos Nov. 26, 2018, 11:17 p.m. UTC | #13
2018-11-17 9:12 GMT+01:00, Lauri Kasanen <cand@gmx.com>:
> ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt
> yuv420p \
> -f null -vframes 100 -v error -nostats -
>
> 1158 UNITS in planar1,   65528 runs,      8 skips
>
> -cpuflags 0
>
> 19082 UNITS in planar1,   65533 runs,      3 skips
>
> 16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version
> takes as many cycles as the x86 SSE2 version, yikes it's fast.
>
> Note that this function uses VSX instructions, but is not marked so.
> This is because several existing functions also make that mistake.
> I'll submit a patch moving them once this is reviewed.
>
> v2: Remove !BE check
> Signed-off-by: Lauri Kasanen <cand@gmx.com>
> ---
>  libswscale/ppc/swscale_altivec.c | 53
> ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 53 insertions(+)
>
> diff --git a/libswscale/ppc/swscale_altivec.c
> b/libswscale/ppc/swscale_altivec.c
> index 2fb2337..8c6056d 100644
> --- a/libswscale/ppc/swscale_altivec.c
> +++ b/libswscale/ppc/swscale_altivec.c
> @@ -324,6 +324,53 @@ static void hScale_altivec_real(SwsContext *c, int16_t
> *dst, int dstW,
>              }
>          }
>  }
> +
> +static void yuv2plane1_8_u(const int16_t *src, uint8_t *dest, int dstW,
> +                           const uint8_t *dither, int offset, int start)
> +{
> +    int i;
> +    for (i = start; i < dstW; i++) {
> +        int val = (src[i] + dither[(i + offset) & 7]) >> 7;
> +        dest[i] = av_clip_uint8(val);
> +    }
> +}
> +
> +static void yuv2plane1_8_altivec(const int16_t *src, uint8_t *dest, int
> dstW,
> +                           const uint8_t *dither, int offset)
> +{
> +    const int dst_u = -(uintptr_t)dest & 15;
> +    int i, j;
> +    LOCAL_ALIGNED(16, int16_t, val, [16]);

> +    const vector uint16_t shifts = (vector uint16_t) {7, 7, 7, 7, 7, 7, 7,
> 7};

The patch breaks compilation with xlc, sorry for not testing earlier:
libswscale/ppc/swscale_altivec.c:344:11: error: unknown type name 'vector'
    const vector uint16_t shifts = (vector uint16_t) {7, 7, 7, 7, 7, 7, 7, 7};

Carl Eugen
Carl Eugen Hoyos Nov. 27, 2018, 1:22 a.m. UTC | #14
2018-11-21 18:19 GMT+01:00, Lauri Kasanen <cand@gmx.com>:

> First I ran with THREADS=3, baseline blew up in
> fate-h264-conformance-frext-hpcafl_bcrm_c

I opened ticket #7570, thank you for the report!

Carl Eugen
diff mbox

Patch

diff --git a/libswscale/ppc/swscale_altivec.c b/libswscale/ppc/swscale_altivec.c
index 2fb2337..8c6056d 100644
--- a/libswscale/ppc/swscale_altivec.c
+++ b/libswscale/ppc/swscale_altivec.c
@@ -324,6 +324,53 @@  static void hScale_altivec_real(SwsContext *c, int16_t *dst, int dstW,
             }
         }
 }
+
+static void yuv2plane1_8_u(const int16_t *src, uint8_t *dest, int dstW,
+                           const uint8_t *dither, int offset, int start)
+{
+    int i;
+    for (i = start; i < dstW; i++) {
+        int val = (src[i] + dither[(i + offset) & 7]) >> 7;
+        dest[i] = av_clip_uint8(val);
+    }
+}
+
+static void yuv2plane1_8_altivec(const int16_t *src, uint8_t *dest, int dstW,
+                           const uint8_t *dither, int offset)
+{
+    const int dst_u = -(uintptr_t)dest & 15;
+    int i, j;
+    LOCAL_ALIGNED(16, int16_t, val, [16]);
+    const vector uint16_t shifts = (vector uint16_t) {7, 7, 7, 7, 7, 7, 7, 7};
+    vector int16_t vi, vileft, ditherleft, ditherright;
+    vector uint8_t vd;
+
+    for (j = 0; j < 16; j++) {
+        val[j] = dither[(dst_u + offset + j) & 7];
+    }
+
+    ditherleft = vec_ld(0, val);
+    ditherright = vec_ld(0, &val[8]);
+
+    yuv2plane1_8_u(src, dest, dst_u, dither, offset, 0);
+
+    for (i = dst_u; i < dstW - 15; i += 16) {
+
+        vi = vec_vsx_ld(0, &src[i]);
+        vi = vec_adds(ditherleft, vi);
+        vileft = vec_sra(vi, shifts);
+
+        vi = vec_vsx_ld(0, &src[i + 8]);
+        vi = vec_adds(ditherright, vi);
+        vi = vec_sra(vi, shifts);
+
+        vd = vec_packsu(vileft, vi);
+        vec_st(vd, 0, &dest[i]);
+    }
+
+    yuv2plane1_8_u(src, dest, dstW, dither, offset, i);
+}
+
 #endif /* HAVE_ALTIVEC */
 
 av_cold void ff_sws_init_swscale_ppc(SwsContext *c)
@@ -367,6 +414,12 @@  av_cold void ff_sws_init_swscale_ppc(SwsContext *c)
             c->yuv2packedX = ff_yuv2rgb24_X_altivec;
             break;
         }
+
+        switch (c->dstBpc) {
+        case 8:
+            c->yuv2plane1 = yuv2plane1_8_altivec;
+            break;
+        }
     }
 #endif /* HAVE_ALTIVEC */
 }