From patchwork Fri Nov 16 12:59:38 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lauri Kasanen X-Patchwork-Id: 11039 Return-Path: X-Original-To: patchwork@ffaux-bg.ffmpeg.org Delivered-To: patchwork@ffaux-bg.ffmpeg.org Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by ffaux.localdomain (Postfix) with ESMTP id B61BE44D117 for ; Fri, 16 Nov 2018 14:57:21 +0200 (EET) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 37AFE68A28A; Fri, 16 Nov 2018 14:57:22 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mout.gmx.net (mout.gmx.net [212.227.15.15]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 3EEC268A209 for ; Fri, 16 Nov 2018 14:57:15 +0200 (EET) Received: from Valinor ([84.250.81.169]) by mail.gmx.com (mrgmx002 [212.227.17.184]) with ESMTPSA (Nemesis) id 0MYOkT-1g1vxr3ir3-00V99P for ; Fri, 16 Nov 2018 13:57:15 +0100 Date: Fri, 16 Nov 2018 14:59:38 +0200 From: Lauri Kasanen To: ffmpeg-devel@ffmpeg.org Message-Id: <20181116145938.07a893157e57837fc3e2314c@gmx.com> X-Mailer: Sylpheed 3.5.0 (GTK+ 2.18.6; x86_64-unknown-linux-gnu) Mime-Version: 1.0 X-Provags-ID: V03:K1:Z5aMYB72qzmdugg//E1F14OlMC5+VTsIKDblzl4bCqtS0+fpwF4 NY+mKrw4geZUorVsLZ7BDcSwPO8dKDs6OSupqnP3rxiaP1Kar82Z0vOSkfyNCtKwrA8s7MS M8C4ULLCq7ektzJMQnpdMK4ETJKzj3wwxxwPsgVxnrnmt+1hvsmyKzCj2+ADuvV6+LpdPVp JmpDNOpPG35aT5FN5baig== X-Spam-Flag: NO X-UI-Out-Filterresults: notjunk:1; V01:K0:OGT7cndFpXI=:qlCS81nOXiCZ9Qf+RVQydB lBTk9iitcbNcBmY+tSRTgNMiXxA1UmkiaIISx8NtGhM5YdUghHNNkL4iUnddawKw6BP24VsJC ZVHNT2hKbQLChAeCaqvqMtOkRG/mTzB+IDG91vsF9vt1bG+Ae57goRT3RvhpCQenVRFR0E8xr WCi3sUgCjsT8s8nK8+efrrM6nfLBOGIi3J8KKpbJul+rKHt0OCJ7IbLUsPrL6UfgfAmgufvEL AZE1Ak3siWf7oveyBE5I/ma4TnXzkrwfYQl2WhCwhkSoNkAhNxWtgqVntdrWBdmBtmsKLLqeZ ghRf3UKXNdkggLckvnqVLsMcSgkDoq27bOTAJhVZfrPFYih3siA/GMaa7cn3eCURJ71krB+uU VXyHEOtE+OzX7c3dchcDHKSxHGZv/4ODG/BPpeX17lfjTvXhlpcAaSWzXYI9ZLLTVDOlNl33+ CFnnrmBlBEfC4iBui4/VNdeyHaryKq+Y6vtqGYex+7PYa7VhuxrwKpRPi0yoWPYeuBut47zcv EFaa+HFhpMEt2qyMfwgiASvDSSNoswIM+j588TPCFY1f7o59envCbGC4N9RiMSNjH9ILeZxwQ M/ZQJOFIsoWRzzyJNS7seIT4M5qfdzCZMOPYsq5ciC4295SgJbMTutI1ZavJmn0tZIH/R54Bn eEqnuGWzf53Tga0ze/wsysqOU11/1xaOGAuPKP15OOT1J5XObLhR5MRSOaL3iooSA//6TCA3C jY6fESaODy4BROkVszZX3qLMIYMylZWwrcEAjmpAP9Eh0uRngZJypOaP3l2YzusqfYduaCKkX Gj9RGvydCgrn7NKZIoHVLT9FX5MsYZI2PAL69FhIILdODglwGtJvlTTFRLcuTs4lrcAzzecnf jhFPt/+kCUE+O69ngEMzd8qxDY6C2dPME0yoFV6oHUSTDbV2w0vHV3gUTaewpG Subject: [FFmpeg-devel] [PATCH] swscale/output: Altivec-optimize yuv2plane1_8 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p \ -f null -vframes 100 -v error -nostats - 1158 UNITS in planar1, 65528 runs, 8 skips -cpuflags 0 19082 UNITS in planar1, 65533 runs, 3 skips 16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version takes as many cycles as the x86 SSE2 version, yikes it's fast. Note that this function uses VSX instructions, but is not marked so. This is because several existing functions also make that mistake. I'll submit a patch moving them all once this is reviewed. No BE support since I can only test LE. LE is however the common case for POWER8 and POWER9. Signed-off-by: Lauri Kasanen --- libswscale/ppc/swscale_altivec.c | 55 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 55 insertions(+) diff --git a/libswscale/ppc/swscale_altivec.c b/libswscale/ppc/swscale_altivec.c index 2fb2337..a064016 100644 --- a/libswscale/ppc/swscale_altivec.c +++ b/libswscale/ppc/swscale_altivec.c @@ -324,6 +324,53 @@ static void hScale_altivec_real(SwsContext *c, int16_t *dst, int dstW, } } } + +static void yuv2plane1_8_u(const int16_t *src, uint8_t *dest, int dstW, + const uint8_t *dither, int offset, int start) +{ + int i; + for (i = start; i < dstW; i++) { + int val = (src[i] + dither[(i + offset) & 7]) >> 7; + dest[i] = av_clip_uint8(val); + } +} + +static void yuv2plane1_8_altivec(const int16_t *src, uint8_t *dest, int dstW, + const uint8_t *dither, int offset) +{ + const int dst_u = -(uintptr_t)dest & 15; + int i, j; + LOCAL_ALIGNED(16, int16_t, val, [16]); + const vector uint16_t shifts = (vector uint16_t) {7, 7, 7, 7, 7, 7, 7, 7}; + vector int16_t vi, vileft, ditherleft, ditherright; + vector uint8_t vd; + + for (j = 0; j < 16; j++) { + val[j] = dither[(dst_u + offset + j) & 7]; + } + + ditherleft = vec_ld(0, val); + ditherright = vec_ld(0, &val[8]); + + yuv2plane1_8_u(src, dest, dst_u, dither, offset, 0); + + for (i = dst_u; i < dstW - 15; i += 16) { + + vi = vec_vsx_ld(0, &src[i]); + vi = vec_adds(ditherleft, vi); + vileft = vec_sra(vi, shifts); + + vi = vec_vsx_ld(0, &src[i + 8]); + vi = vec_adds(ditherright, vi); + vi = vec_sra(vi, shifts); + + vd = vec_packsu(vileft, vi); + vec_st(vd, 0, &dest[i]); + } + + yuv2plane1_8_u(src, dest, dstW, dither, offset, i); +} + #endif /* HAVE_ALTIVEC */ av_cold void ff_sws_init_swscale_ppc(SwsContext *c) @@ -367,6 +414,14 @@ av_cold void ff_sws_init_swscale_ppc(SwsContext *c) c->yuv2packedX = ff_yuv2rgb24_X_altivec; break; } + + switch (c->dstBpc) { + case 8: +#if !HAVE_BIGENDIAN + c->yuv2plane1 = yuv2plane1_8_altivec; + break; +#endif /* !HAVE_BIGENDIAN */ + } } #endif /* HAVE_ALTIVEC */ }