Message ID | 20181117101214.19917fdecfc9fc4d26854a9c@gmx.com |
---|---|
State | Accepted |
Commit | 46c5693ea3a9364e24e2f5336bcdb5b191a2329f |
Headers | show |
On Sat, Nov 17, 2018 at 10:12:14AM +0200, Lauri Kasanen wrote: > ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p \ > -f null -vframes 100 -v error -nostats - > > 1158 UNITS in planar1, 65528 runs, 8 skips > > -cpuflags 0 > > 19082 UNITS in planar1, 65533 runs, 3 skips > > 16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version > takes as many cycles as the x86 SSE2 version, yikes it's fast. > > Note that this function uses VSX instructions, but is not marked so. > This is because several existing functions also make that mistake. > I'll submit a patch moving them once this is reviewed. > > v2: Remove !BE check > Signed-off-by: Lauri Kasanen <cand@gmx.com> > --- > libswscale/ppc/swscale_altivec.c | 53 ++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 53 insertions(+) iam no altivec guy, just wanted to reply primarly so people dont wait for a review from me as i normally review swscale patches. if someone who knows altivec ok-es this then its ok with me too. assuming this was tested (fate and some real world test) thx [...]
> ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p \ > -f null -vframes 100 -v error -nostats - > > 1158 UNITS in planar1, 65528 runs, 8 skips > > -cpuflags 0 > > 19082 UNITS in planar1, 65533 runs, 3 skips > > 16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version > takes as many cycles as the x86 SSE2 version, yikes it's fast. > > Note that this function uses VSX instructions, but is not marked so. > This is because several existing functions also make that mistake. > I'll submit a patch moving them once this is reviewed. > > v2: Remove !BE check > Signed-off-by: Lauri Kasanen <cand@gmx.com> Ping. Seems not many ffmpeg devs interested in ppc. - Lauri
On Wed, Nov 21, 2018 at 10:12:48AM +0200, Lauri Kasanen wrote: > > ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p \ > > -f null -vframes 100 -v error -nostats - > > > > 1158 UNITS in planar1, 65528 runs, 8 skips > > > > -cpuflags 0 > > > > 19082 UNITS in planar1, 65533 runs, 3 skips > > > > 16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version > > takes as many cycles as the x86 SSE2 version, yikes it's fast. > > > > Note that this function uses VSX instructions, but is not marked so. > > This is because several existing functions also make that mistake. > > I'll submit a patch moving them once this is reviewed. > > > > v2: Remove !BE check > > Signed-off-by: Lauri Kasanen <cand@gmx.com> > > Ping. Seems not many ffmpeg devs interested in ppc. have you tried "make fate" with this patch (note you need to configure with fate samples" so all tests are run thx [...]
On Wed, 21 Nov 2018 13:21:58 +0100 Michael Niedermayer <michael@niedermayer.cc> wrote: > On Wed, Nov 21, 2018 at 10:12:48AM +0200, Lauri Kasanen wrote: > > > ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p \ > > > -f null -vframes 100 -v error -nostats - > > > > > > 1158 UNITS in planar1, 65528 runs, 8 skips > > > > > > -cpuflags 0 > > > > > > 19082 UNITS in planar1, 65533 runs, 3 skips > > > > > > 16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version > > > takes as many cycles as the x86 SSE2 version, yikes it's fast. > > > > > > Note that this function uses VSX instructions, but is not marked so. > > > This is because several existing functions also make that mistake. > > > I'll submit a patch moving them once this is reviewed. > > > > > > v2: Remove !BE check > > > Signed-off-by: Lauri Kasanen <cand@gmx.com> > > > > Ping. Seems not many ffmpeg devs interested in ppc. > > have you tried "make fate" with this patch (note you need to configure with > fate samples" so all tests are run I ran those fate tests containing "scale" in the name, I gather the full suite takes > 20min. Otherwise I tested with a PNG to video conversion on LE, and Carl Eugen Hoyos tested with Lena on BE. - Lauri
On Wed, Nov 21, 2018 at 02:35:32PM +0200, Lauri Kasanen wrote: > On Wed, 21 Nov 2018 13:21:58 +0100 > Michael Niedermayer <michael@niedermayer.cc> wrote: > > > On Wed, Nov 21, 2018 at 10:12:48AM +0200, Lauri Kasanen wrote: > > > > ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p \ > > > > -f null -vframes 100 -v error -nostats - > > > > > > > > 1158 UNITS in planar1, 65528 runs, 8 skips > > > > > > > > -cpuflags 0 > > > > > > > > 19082 UNITS in planar1, 65533 runs, 3 skips > > > > > > > > 16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version > > > > takes as many cycles as the x86 SSE2 version, yikes it's fast. > > > > > > > > Note that this function uses VSX instructions, but is not marked so. > > > > This is because several existing functions also make that mistake. > > > > I'll submit a patch moving them once this is reviewed. > > > > > > > > v2: Remove !BE check > > > > Signed-off-by: Lauri Kasanen <cand@gmx.com> > > > > > > Ping. Seems not many ffmpeg devs interested in ppc. > > > > have you tried "make fate" with this patch (note you need to configure with > > fate samples" so all tests are run > > I ran those fate tests containing "scale" in the name, I gather the > full suite takes > 20min. Otherwise I tested with a PNG to video > conversion on LE, and Carl Eugen Hoyos tested with Lena on BE. the full fate tests must be run, many of these tests use swscale without having "scale" in their name and yes on lower end hardware 20min and longer is possible [...]
On Wed, 21 Nov 2018 17:22:36 +0100 Michael Niedermayer <michael@niedermayer.cc> wrote: > the full fate tests must be run, many of these tests use swscale without > having "scale" in their name > and yes on lower end hardware 20min and longer is possible I get failures on the baseline, without my patch. What is the procedure here? Is there a var to skip those tests, or? First I ran with THREADS=3, baseline blew up in fate-h264-conformance-frext-hpcafl_bcrm_c Then I ran without THREADS, it got further, but blew up in fate-rv20-1239 - Lauri
On Wed, Nov 21, 2018 at 07:19:45PM +0200, Lauri Kasanen wrote: > On Wed, 21 Nov 2018 17:22:36 +0100 > Michael Niedermayer <michael@niedermayer.cc> wrote: > > > the full fate tests must be run, many of these tests use swscale without > > having "scale" in their name > > and yes on lower end hardware 20min and longer is possible > > I get failures on the baseline, without my patch. What is the procedure > here? Is there a var to skip those tests, or? procedure ? First i try to convince you to attempt to fix some of these failures ;) because well, everyone would benefit if they are fixed ... > > First I ran with THREADS=3, baseline blew up in > fate-h264-conformance-frext-hpcafl_bcrm_c > > Then I ran without THREADS, it got further, but blew up in > fate-rv20-1239 you can use make -k to continue testing in presence of failures this is not guranteed to run all tests, it may skip dependant tests which dependent on a failing target. Thanks [...]
On Fri, 23 Nov 2018 03:26:50 +0100 Michael Niedermayer <michael@niedermayer.cc> wrote: > On Wed, Nov 21, 2018 at 07:19:45PM +0200, Lauri Kasanen wrote: > > On Wed, 21 Nov 2018 17:22:36 +0100 > > Michael Niedermayer <michael@niedermayer.cc> wrote: > > > the full fate tests must be run, many of these tests use swscale without > > > having "scale" in their name > > > and yes on lower end hardware 20min and longer is possible > > > > I get failures on the baseline, without my patch. What is the procedure > > here? Is there a var to skip those tests, or? > > procedure ? > First i try to convince you to attempt to fix some of these failures ;) > because well, everyone would benefit if they are fixed ... I mean, if my patch adds no failures, is that enough to apply it? - Lauri
On Fri, Nov 23, 2018 at 10:38:13AM +0200, Lauri Kasanen wrote: > On Fri, 23 Nov 2018 03:26:50 +0100 > Michael Niedermayer <michael@niedermayer.cc> wrote: > > > On Wed, Nov 21, 2018 at 07:19:45PM +0200, Lauri Kasanen wrote: > > > On Wed, 21 Nov 2018 17:22:36 +0100 > > > Michael Niedermayer <michael@niedermayer.cc> wrote: > > > > the full fate tests must be run, many of these tests use swscale without > > > > having "scale" in their name > > > > and yes on lower end hardware 20min and longer is possible > > > > > > I get failures on the baseline, without my patch. What is the procedure > > > here? Is there a var to skip those tests, or? > > > > procedure ? > > First i try to convince you to attempt to fix some of these failures ;) > > because well, everyone would benefit if they are fixed ... > > I mean, if my patch adds no failures, is that enough to apply it? yes that and the tests failing should still fail the same way with the same checksums This of course assumes noone finds an issue in the patch [...]
On Fri, 23 Nov 2018 23:01:02 +0100 Michael Niedermayer <michael@niedermayer.cc> wrote: > On Fri, Nov 23, 2018 at 10:38:13AM +0200, Lauri Kasanen wrote: > > I mean, if my patch adds no failures, is that enough to apply it? > > yes that and the tests failing should still fail the same way with the > same checksums > This of course assumes noone finds an issue in the patch Okay, ran both with -k. No new failures, and fate-rv20-1239 failed with the same checksums in both cases. That was the only failing test, did not try with THREADS. Curiously "make CPUFLAGS=0 fate-rv20-1239" also fails, so it's not Altivec code that breaks that test, but C (?). - Lauri
On Sat, Nov 24, 2018 at 11:33:01AM +0200, Lauri Kasanen wrote: > On Fri, 23 Nov 2018 23:01:02 +0100 > Michael Niedermayer <michael@niedermayer.cc> wrote: > > > On Fri, Nov 23, 2018 at 10:38:13AM +0200, Lauri Kasanen wrote: > > > I mean, if my patch adds no failures, is that enough to apply it? > > > > yes that and the tests failing should still fail the same way with the > > same checksums > > This of course assumes noone finds an issue in the patch > > Okay, ran both with -k. No new failures, and fate-rv20-1239 failed with > the same checksums in both cases. That was the only failing test, did > not try with THREADS. ok, will apply thanks [...]
On Mon, 26 Nov 2018 11:03:55 +0300 Michael Kostylev <michael.kostylev@gmail.com> wrote: > > http://fate.xffm.org/?sort=arch > /ppc Yeah, mentioned in the commit message. Follow-up patch coming today. - Lauri
2018-11-17 9:12 GMT+01:00, Lauri Kasanen <cand@gmx.com>: > ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt > yuv420p \ > -f null -vframes 100 -v error -nostats - > > 1158 UNITS in planar1, 65528 runs, 8 skips > > -cpuflags 0 > > 19082 UNITS in planar1, 65533 runs, 3 skips > > 16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version > takes as many cycles as the x86 SSE2 version, yikes it's fast. > > Note that this function uses VSX instructions, but is not marked so. > This is because several existing functions also make that mistake. > I'll submit a patch moving them once this is reviewed. > > v2: Remove !BE check > Signed-off-by: Lauri Kasanen <cand@gmx.com> > --- > libswscale/ppc/swscale_altivec.c | 53 > ++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 53 insertions(+) > > diff --git a/libswscale/ppc/swscale_altivec.c > b/libswscale/ppc/swscale_altivec.c > index 2fb2337..8c6056d 100644 > --- a/libswscale/ppc/swscale_altivec.c > +++ b/libswscale/ppc/swscale_altivec.c > @@ -324,6 +324,53 @@ static void hScale_altivec_real(SwsContext *c, int16_t > *dst, int dstW, > } > } > } > + > +static void yuv2plane1_8_u(const int16_t *src, uint8_t *dest, int dstW, > + const uint8_t *dither, int offset, int start) > +{ > + int i; > + for (i = start; i < dstW; i++) { > + int val = (src[i] + dither[(i + offset) & 7]) >> 7; > + dest[i] = av_clip_uint8(val); > + } > +} > + > +static void yuv2plane1_8_altivec(const int16_t *src, uint8_t *dest, int > dstW, > + const uint8_t *dither, int offset) > +{ > + const int dst_u = -(uintptr_t)dest & 15; > + int i, j; > + LOCAL_ALIGNED(16, int16_t, val, [16]); > + const vector uint16_t shifts = (vector uint16_t) {7, 7, 7, 7, 7, 7, 7, > 7}; The patch breaks compilation with xlc, sorry for not testing earlier: libswscale/ppc/swscale_altivec.c:344:11: error: unknown type name 'vector' const vector uint16_t shifts = (vector uint16_t) {7, 7, 7, 7, 7, 7, 7, 7}; Carl Eugen
2018-11-21 18:19 GMT+01:00, Lauri Kasanen <cand@gmx.com>: > First I ran with THREADS=3, baseline blew up in > fate-h264-conformance-frext-hpcafl_bcrm_c I opened ticket #7570, thank you for the report! Carl Eugen
diff --git a/libswscale/ppc/swscale_altivec.c b/libswscale/ppc/swscale_altivec.c index 2fb2337..8c6056d 100644 --- a/libswscale/ppc/swscale_altivec.c +++ b/libswscale/ppc/swscale_altivec.c @@ -324,6 +324,53 @@ static void hScale_altivec_real(SwsContext *c, int16_t *dst, int dstW, } } } + +static void yuv2plane1_8_u(const int16_t *src, uint8_t *dest, int dstW, + const uint8_t *dither, int offset, int start) +{ + int i; + for (i = start; i < dstW; i++) { + int val = (src[i] + dither[(i + offset) & 7]) >> 7; + dest[i] = av_clip_uint8(val); + } +} + +static void yuv2plane1_8_altivec(const int16_t *src, uint8_t *dest, int dstW, + const uint8_t *dither, int offset) +{ + const int dst_u = -(uintptr_t)dest & 15; + int i, j; + LOCAL_ALIGNED(16, int16_t, val, [16]); + const vector uint16_t shifts = (vector uint16_t) {7, 7, 7, 7, 7, 7, 7, 7}; + vector int16_t vi, vileft, ditherleft, ditherright; + vector uint8_t vd; + + for (j = 0; j < 16; j++) { + val[j] = dither[(dst_u + offset + j) & 7]; + } + + ditherleft = vec_ld(0, val); + ditherright = vec_ld(0, &val[8]); + + yuv2plane1_8_u(src, dest, dst_u, dither, offset, 0); + + for (i = dst_u; i < dstW - 15; i += 16) { + + vi = vec_vsx_ld(0, &src[i]); + vi = vec_adds(ditherleft, vi); + vileft = vec_sra(vi, shifts); + + vi = vec_vsx_ld(0, &src[i + 8]); + vi = vec_adds(ditherright, vi); + vi = vec_sra(vi, shifts); + + vd = vec_packsu(vileft, vi); + vec_st(vd, 0, &dest[i]); + } + + yuv2plane1_8_u(src, dest, dstW, dither, offset, i); +} + #endif /* HAVE_ALTIVEC */ av_cold void ff_sws_init_swscale_ppc(SwsContext *c) @@ -367,6 +414,12 @@ av_cold void ff_sws_init_swscale_ppc(SwsContext *c) c->yuv2packedX = ff_yuv2rgb24_X_altivec; break; } + + switch (c->dstBpc) { + case 8: + c->yuv2plane1 = yuv2plane1_8_altivec; + break; + } } #endif /* HAVE_ALTIVEC */ }
./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p \ -f null -vframes 100 -v error -nostats - 1158 UNITS in planar1, 65528 runs, 8 skips -cpuflags 0 19082 UNITS in planar1, 65533 runs, 3 skips 16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version takes as many cycles as the x86 SSE2 version, yikes it's fast. Note that this function uses VSX instructions, but is not marked so. This is because several existing functions also make that mistake. I'll submit a patch moving them once this is reviewed. v2: Remove !BE check Signed-off-by: Lauri Kasanen <cand@gmx.com> --- libswscale/ppc/swscale_altivec.c | 53 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 53 insertions(+)