diff mbox

[FFmpeg-devel,v2] swscale/output: Altivec-optimize float yuv2plane1

Message ID 20181216110653.2cc19e502ad7037f77954a40@gmx.com
State Accepted
Commit 8dd9df9ecd258cff84cef559f16e682949e78e38
Headers show

Commit Message

Lauri Kasanen Dec. 16, 2018, 9:06 a.m. UTC
This function wouldn't benefit from VSX instructions, so I put it
under altivec.

./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt grayf32le \
-f null -vframes 100 -v error -nostats -

3743 UNITS in planar1,   65495 runs,     41 skips

-cpuflags 0

23511 UNITS in planar1,   65530 runs,      6 skips

grayf32be

4647 UNITS in planar1,   65449 runs,     87 skips

-cpuflags 0

28608 UNITS in planar1,   65530 runs,      6 skips

The native speedup is 6.28133, and the bswapping one 6.15623.
Fate passes, each format tested with an image to video conversion.

Signed-off-by: Lauri Kasanen <cand@gmx.com>
---

Tested on POWER8 LE. Testing on earlier ppc and/or BE appreciated.

v2: Added #undef vzero, that define broke the build on older gcc. Thanks Michael

 libswscale/ppc/swscale_altivec.c | 141 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 139 insertions(+), 2 deletions(-)

Comments

Carl Eugen Hoyos Dec. 17, 2018, 12:03 a.m. UTC | #1
2018-12-16 10:06 GMT+01:00, Lauri Kasanen <cand@gmx.com>:
> This function wouldn't benefit from VSX instructions, so I put it
> under altivec.
>
> ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt
> grayf32le \
> -f null -vframes 100 -v error -nostats -
>
> 3743 UNITS in planar1,   65495 runs,     41 skips
>
> -cpuflags 0
>
> 23511 UNITS in planar1,   65530 runs,      6 skips
>
> grayf32be
>
> 4647 UNITS in planar1,   65449 runs,     87 skips
>
> -cpuflags 0
>
> 28608 UNITS in planar1,   65530 runs,      6 skips
>
> The native speedup is 6.28133, and the bswapping one 6.15623.

> Fate passes

I wonder a little how, given that grayf32 already breaks fate as-is...

Note that this function / this pix_fmt currently has no real use-case
afaict.

Carl Eugen
Lauri Kasanen Dec. 17, 2018, 7:37 a.m. UTC | #2
On Mon, 17 Dec 2018 01:03:36 +0100
Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote:

> 2018-12-16 10:06 GMT+01:00, Lauri Kasanen <cand@gmx.com>:
> > This function wouldn't benefit from VSX instructions, so I put it
> > under altivec.
> >
> > ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt
> > grayf32le \
> > -f null -vframes 100 -v error -nostats -
> >
> > 3743 UNITS in planar1,   65495 runs,     41 skips
> >
> > -cpuflags 0
> >
> > 23511 UNITS in planar1,   65530 runs,      6 skips
> >
> > grayf32be
> >
> > 4647 UNITS in planar1,   65449 runs,     87 skips
> >
> > -cpuflags 0
> >
> > 28608 UNITS in planar1,   65530 runs,      6 skips
> >
> > The native speedup is 6.28133, and the bswapping one 6.15623.
> 
> > Fate passes
> 
> I wonder a little how, given that grayf32 already breaks fate as-is...

Are the tests for it disabled? fate.ffmpeg.org reports 100% success for
many platforms.

> Note that this function / this pix_fmt currently has no real use-case
> afaict.

Is there a list of which pix fmts are useful? Of course I don't want to
waste both my and reviewers' time, if the format is considered for
removal or otherwise broken.

- Lauri
Carl Eugen Hoyos Dec. 17, 2018, 1:52 p.m. UTC | #3
2018-12-17 8:37 GMT+01:00, Lauri Kasanen <cand@gmx.com>:
> On Mon, 17 Dec 2018 01:03:36 +0100
> Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote:
>
>> 2018-12-16 10:06 GMT+01:00, Lauri Kasanen <cand@gmx.com>:
>> > This function wouldn't benefit from VSX instructions, so I put it
>> > under altivec.
>> >
>> > ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt
>> > grayf32le \
>> > -f null -vframes 100 -v error -nostats -
>> >
>> > 3743 UNITS in planar1,   65495 runs,     41 skips
>> >
>> > -cpuflags 0
>> >
>> > 23511 UNITS in planar1,   65530 runs,      6 skips
>> >
>> > grayf32be
>> >
>> > 4647 UNITS in planar1,   65449 runs,     87 skips
>> >
>> > -cpuflags 0
>> >
>> > 28608 UNITS in planar1,   65530 runs,      6 skips
>> >
>> > The native speedup is 6.28133, and the bswapping one 6.15623.
>>
>> > Fate passes
>>
>> I wonder a little how, given that grayf32 already breaks fate as-is...
>
> Are the tests for it disabled? fate.ffmpeg.org reports 100% success for
> many platforms.

Iirc, it is broken with --disable-sse

>> Note that this function / this pix_fmt currently has no real use-case
>> afaict.
>
> Is there a list of which pix fmts are useful? Of course I don't want to
> waste both my and reviewers' time, if the format is considered for
> removal or otherwise broken.

The pix_fmt is not deprecated (it's new), what I meant was that it is
currently only used for obscure monochrome Photoshop images
and one filter, so I am not sure optimizing this colour conversion
will help often.

But this is of course not very much related to this patch, sorry
for the noise!

Carl Eugen
Lauri Kasanen Dec. 17, 2018, 3:31 p.m. UTC | #4
On Mon, 17 Dec 2018 14:52:49 +0100
Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote:

> >> Note that this function / this pix_fmt currently has no real use-case
> >> afaict.
> >
> > Is there a list of which pix fmts are useful? Of course I don't want to
> > waste both my and reviewers' time, if the format is considered for
> > removal or otherwise broken.
> 
> The pix_fmt is not deprecated (it's new), what I meant was that it is
> currently only used for obscure monochrome Photoshop images
> and one filter, so I am not sure optimizing this colour conversion
> will help often.

Oh, thanks for the clarification. I'm going roughly in difficulty
order, doing the easy functions first.

- Lauri
Lauri Kasanen Dec. 24, 2018, 5:39 p.m. UTC | #5
On Sun, 16 Dec 2018 11:06:53 +0200
Lauri Kasanen <cand@gmx.com> wrote:

> This function wouldn't benefit from VSX instructions, so I put it
> under altivec.
> 
> ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt grayf32le \
> -f null -vframes 100 -v error -nostats -
> 
> 3743 UNITS in planar1,   65495 runs,     41 skips
> 
> -cpuflags 0
> 
> 23511 UNITS in planar1,   65530 runs,      6 skips
> 
> grayf32be
> 
> 4647 UNITS in planar1,   65449 runs,     87 skips
> 
> -cpuflags 0
> 
> 28608 UNITS in planar1,   65530 runs,      6 skips
> 
> The native speedup is 6.28133, and the bswapping one 6.15623.
> Fate passes, each format tested with an image to video conversion.
> 
> Signed-off-by: Lauri Kasanen <cand@gmx.com>
> ---
> 
> Tested on POWER8 LE. Testing on earlier ppc and/or BE appreciated.
> 
> v2: Added #undef vzero, that define broke the build on older gcc. Thanks Michael

Ping. And of course it's not gcc version dependant, but rather it was
the BE ifdef; it was too early in the morning.

- Lauri
Michael Niedermayer Dec. 26, 2018, 7:28 p.m. UTC | #6
On Mon, Dec 24, 2018 at 07:39:18PM +0200, Lauri Kasanen wrote:
> On Sun, 16 Dec 2018 11:06:53 +0200
> Lauri Kasanen <cand@gmx.com> wrote:
> 
> > This function wouldn't benefit from VSX instructions, so I put it
> > under altivec.
> > 
> > ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt grayf32le \
> > -f null -vframes 100 -v error -nostats -
> > 
> > 3743 UNITS in planar1,   65495 runs,     41 skips
> > 
> > -cpuflags 0
> > 
> > 23511 UNITS in planar1,   65530 runs,      6 skips
> > 
> > grayf32be
> > 
> > 4647 UNITS in planar1,   65449 runs,     87 skips
> > 
> > -cpuflags 0
> > 
> > 28608 UNITS in planar1,   65530 runs,      6 skips
> > 
> > The native speedup is 6.28133, and the bswapping one 6.15623.
> > Fate passes, each format tested with an image to video conversion.
> > 
> > Signed-off-by: Lauri Kasanen <cand@gmx.com>
> > ---
> > 
> > Tested on POWER8 LE. Testing on earlier ppc and/or BE appreciated.
> > 
> > v2: Added #undef vzero, that define broke the build on older gcc. Thanks Michael
> 
> Ping. And of course it's not gcc version dependant, but rather it was
> the BE ifdef; it was too early in the morning.

seems working, will apply

thx

[...]
diff mbox

Patch

diff --git a/libswscale/ppc/swscale_altivec.c b/libswscale/ppc/swscale_altivec.c
index 1d2b2fa..d72ed1e 100644
--- a/libswscale/ppc/swscale_altivec.c
+++ b/libswscale/ppc/swscale_altivec.c
@@ -31,7 +31,8 @@ 
 #include "yuv2rgb_altivec.h"
 #include "libavutil/ppc/util_altivec.h"
 
-#if HAVE_ALTIVEC && HAVE_BIGENDIAN
+#if HAVE_ALTIVEC
+#if HAVE_BIGENDIAN
 #define vzero vec_splat_s32(0)
 
 #define  GET_LS(a,b,c,s) {\
@@ -102,7 +103,137 @@ 
 #include "swscale_ppc_template.c"
 #undef FUNC
 
-#endif /* HAVE_ALTIVEC && HAVE_BIGENDIAN */
+#undef vzero
+
+#endif /* HAVE_BIGENDIAN */
+
+#define output_pixel(pos, val, bias, signedness) \
+    if (big_endian) { \
+        AV_WB16(pos, bias + av_clip_ ## signedness ## 16(val >> shift)); \
+    } else { \
+        AV_WL16(pos, bias + av_clip_ ## signedness ## 16(val >> shift)); \
+    }
+
+static void
+yuv2plane1_float_u(const int32_t *src, float *dest, int dstW, int start)
+{
+    static const int big_endian = HAVE_BIGENDIAN;
+    static const int shift = 3;
+    static const float float_mult = 1.0f / 65535.0f;
+    int i, val;
+    uint16_t val_uint;
+
+    for (i = start; i < dstW; ++i){
+        val = src[i] + (1 << (shift - 1));
+        output_pixel(&val_uint, val, 0, uint);
+        dest[i] = float_mult * (float)val_uint;
+    }
+}
+
+static void
+yuv2plane1_float_bswap_u(const int32_t *src, uint32_t *dest, int dstW, int start)
+{
+    static const int big_endian = HAVE_BIGENDIAN;
+    static const int shift = 3;
+    static const float float_mult = 1.0f / 65535.0f;
+    int i, val;
+    uint16_t val_uint;
+
+    for (i = start; i < dstW; ++i){
+        val = src[i] + (1 << (shift - 1));
+        output_pixel(&val_uint, val, 0, uint);
+        dest[i] = av_bswap32(av_float2int(float_mult * (float)val_uint));
+    }
+}
+
+static void yuv2plane1_float_altivec(const int32_t *src, float *dest, int dstW)
+{
+    const int dst_u = -(uintptr_t)dest & 3;
+    const int shift = 3;
+    const int add = (1 << (shift - 1));
+    const int clip = (1 << 16) - 1;
+    const float fmult = 1.0f / 65535.0f;
+    const vector uint32_t vadd = (vector uint32_t) {add, add, add, add};
+    const vector uint32_t vshift = (vector uint32_t) vec_splat_u32(shift);
+    const vector uint32_t vlargest = (vector uint32_t) {clip, clip, clip, clip};
+    const vector float vmul = (vector float) {fmult, fmult, fmult, fmult};
+    const vector float vzero = (vector float) {0, 0, 0, 0};
+    vector uint32_t v;
+    vector float vd;
+    int i;
+
+    yuv2plane1_float_u(src, dest, dst_u, 0);
+
+    for (i = dst_u; i < dstW - 3; i += 4) {
+        v = vec_ld(0, (const uint32_t *) &src[i]);
+        v = vec_add(v, vadd);
+        v = vec_sr(v, vshift);
+        v = vec_min(v, vlargest);
+
+        vd = vec_ctf(v, 0);
+        vd = vec_madd(vd, vmul, vzero);
+
+        vec_st(vd, 0, &dest[i]);
+    }
+
+    yuv2plane1_float_u(src, dest, dstW, i);
+}
+
+static void yuv2plane1_float_bswap_altivec(const int32_t *src, uint32_t *dest, int dstW)
+{
+    const int dst_u = -(uintptr_t)dest & 3;
+    const int shift = 3;
+    const int add = (1 << (shift - 1));
+    const int clip = (1 << 16) - 1;
+    const float fmult = 1.0f / 65535.0f;
+    const vector uint32_t vadd = (vector uint32_t) {add, add, add, add};
+    const vector uint32_t vshift = (vector uint32_t) vec_splat_u32(shift);
+    const vector uint32_t vlargest = (vector uint32_t) {clip, clip, clip, clip};
+    const vector float vmul = (vector float) {fmult, fmult, fmult, fmult};
+    const vector float vzero = (vector float) {0, 0, 0, 0};
+    const vector uint32_t vswapbig = (vector uint32_t) {16, 16, 16, 16};
+    const vector uint16_t vswapsmall = vec_splat_u16(8);
+    vector uint32_t v;
+    vector float vd;
+    int i;
+
+    yuv2plane1_float_bswap_u(src, dest, dst_u, 0);
+
+    for (i = dst_u; i < dstW - 3; i += 4) {
+        v = vec_ld(0, (const uint32_t *) &src[i]);
+        v = vec_add(v, vadd);
+        v = vec_sr(v, vshift);
+        v = vec_min(v, vlargest);
+
+        vd = vec_ctf(v, 0);
+        vd = vec_madd(vd, vmul, vzero);
+
+        vd = (vector float) vec_rl((vector uint32_t) vd, vswapbig);
+        vd = (vector float) vec_rl((vector uint16_t) vd, vswapsmall);
+
+        vec_st(vd, 0, (float *) &dest[i]);
+    }
+
+    yuv2plane1_float_bswap_u(src, dest, dstW, i);
+}
+
+#define yuv2plane1_float(template, dest_type, BE_LE) \
+static void yuv2plane1_float ## BE_LE ## _altivec(const int16_t *src, uint8_t *dest, \
+                                                  int dstW, \
+                                                  const uint8_t *dither, int offset) \
+{ \
+    template((const int32_t *)src, (dest_type *)dest, dstW); \
+}
+
+#if HAVE_BIGENDIAN
+yuv2plane1_float(yuv2plane1_float_altivec,       float,    BE)
+yuv2plane1_float(yuv2plane1_float_bswap_altivec, uint32_t, LE)
+#else
+yuv2plane1_float(yuv2plane1_float_altivec,       float,    LE)
+yuv2plane1_float(yuv2plane1_float_bswap_altivec, uint32_t, BE)
+#endif
+
+#endif /* HAVE_ALTIVEC */
 
 av_cold void ff_sws_init_swscale_ppc(SwsContext *c)
 {
@@ -124,6 +255,12 @@  av_cold void ff_sws_init_swscale_ppc(SwsContext *c)
     }
 #endif
 
+    if (dstFormat == AV_PIX_FMT_GRAYF32BE) {
+        c->yuv2plane1 = yuv2plane1_floatBE_altivec;
+    } else if (dstFormat == AV_PIX_FMT_GRAYF32LE) {
+        c->yuv2plane1 = yuv2plane1_floatLE_altivec;
+    }
+
     /* The following list of supported dstFormat values should
      * match what's found in the body of ff_yuv2packedX_altivec() */
     if (!(c->flags & (SWS_BITEXACT | SWS_FULL_CHR_H_INT)) && !c->needAlpha) {