From patchwork Mon Mar 25 07:23:39 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lauri Kasanen X-Patchwork-Id: 12431 Return-Path: X-Original-To: patchwork@ffaux-bg.ffmpeg.org Delivered-To: patchwork@ffaux-bg.ffmpeg.org Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by ffaux.localdomain (Postfix) with ESMTP id 23584447F93 for ; Mon, 25 Mar 2019 09:21:03 +0200 (EET) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id EE0DC68A6E2; Mon, 25 Mar 2019 09:21:02 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mout.gmx.net (mout.gmx.net [212.227.15.15]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 1387868920D for ; Mon, 25 Mar 2019 09:20:57 +0200 (EET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gmx.net; s=badeba3b8450; t=1553498456; bh=ZiCCV0Uq8MWJqibh4Vlo1kT+uOOdAIrCQWT9Z9ikcrQ=; h=X-UI-Sender-Class:Date:From:To:Subject:In-Reply-To:References; b=XgQ2qxGBgecF8dYWTEx8oPuCrlZOe1fb3ezAwu3iyD2mU0BF7n8oT60BRRKQ+1+PF nrIHACcXnamBXvOSrXxbovuRg+FSylpLTOzmsKlv+DlC08kDTQPJxG547cC3qlv4zp Eao72XAgtTDzVXsbcmrMRfwZ0ewSbTnQRi4fcxME= X-UI-Sender-Class: 01bb95c1-4bf8-414a-932a-4f6e2808ef9c Received: from Valinor ([84.250.81.169]) by mail.gmx.com (mrgmx002 [212.227.17.184]) with ESMTPSA (Nemesis) id 0MVe87-1hUB5b1fah-00YvYj for ; Mon, 25 Mar 2019 08:20:56 +0100 Date: Mon, 25 Mar 2019 09:23:39 +0200 From: Lauri Kasanen To: ffmpeg-devel@ffmpeg.org Message-Id: <20190325092339.d51b34872776cde9b20f2c52@gmx.com> In-Reply-To: <20190324151110.068c9933101a7e439c17d1c6@gmx.com> References: <20190324151110.068c9933101a7e439c17d1c6@gmx.com> X-Mailer: Sylpheed 3.5.0 (GTK+ 2.18.6; x86_64-unknown-linux-gnu) Mime-Version: 1.0 X-Provags-ID: V03:K1:VON4xKWp722Ymwykfi5cksG9jmcyO/fQcvRuO19w9tU2eFhcclR J0VHVOoKJic/qH6G3hEHEMcGwcZVrmbf3jTnKbqDQ/NO3GOxe2L1lNSEsehYIT9FVqzhu2Q vzGK9krUzYFVIVSVm6NS+0PnRJTIx45o2Ljt2TyDVoZXJY+tIP5fdEc/mgx7uwKjx4JNh4a 2r1Akz8dgKtQSv2R9kHOg== X-Spam-Flag: NO X-UI-Out-Filterresults: notjunk:1; V03:K0:lKZ/1o5srbc=:5AzPfwSVs2C5Os2WN9aL81 ljgrDGEXws3J0VrQJd1IHULUIRi3SSg44q8ZdOOX9a4JdiL67iyS+dqGTvDJTKYIHrEGGAOp/ rxxZzqM9ge0U8MbufmzIiOQzYhjfJ82pJYx/NeQBznkF4nZx5ZVq63jaoVkHsZzu6TGh/31rD GJxP7Z1LWAbGc02UEAqgFeOze5Fg5+ratE77si7y48ofZvyIpj3bGwj1l0Z6knwe671MzF0sn ypQJh5BROSvFjqiyyHsgNvQfG8xb9jvicnmBGsHYiEUBduPZa6guCV4i5WZKZZv4NFGJKi4yC pKHDBX2GIcchk20kSMe1DWXPo/n20b7q4KQPNhXRk9dee7NOOtKIj+/O6+xpzOGD1iWmDuw1+ mb8EuH325brToQKTCKOFTCfpvwlTNDqf486nM3dUcxfFNGy8e5YPCtTTv9BOUo2Irf3kIsQu8 /aTa5GDU8YJz46AZqggxsCG67Wq0c9S9vdZbg/tXbLZsEhlFbGYGHG7HqbSKe/EkJuDwFm45V xtEjI7vhSxt7UNVzJlcF4Kgwh1N6q791EC70pMyz7AovjXZrqgK2wQzVxT2VbBugGcdXdgFsn J90SzdHUuHssDXZIVBh5lbRxDAWaxHrxjcLyEzenk12AhymfOxCJnR9Zv/l6ZVDz3PqF7Top9 B0zeueVsK/RitWUpmsEKP6EYLbhl9kFWWkIYaEDVaMntC0yvUjG99RB5e+m6KOyEi2wBq8f3V 22e2v49DpW4lmfgafCf+Bgt4ecQePRRlaPqyxr36TlbmsDfJ+sBvIFiCBwDoS9tGvrNWJhh5+ Ysa2KgK0/nRb0IHVvuwqtehlt2lBmBLXs6fIJ3SGJQSdYo6/QFvCNpT8EoqvWNfII6ejmMA7n ZvKKUeSJu/rLdfjgl6Tqox22dofIRospcbmdSfBK+7IP+2/wypgtYA4SahzlOY Subject: [FFmpeg-devel] [PATCH 3/3 v2] swscale/ppc: VSX-optimize yuv2422_X X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" ./ffmpeg -f lavfi -i yuvtestsrc=duration=1:size=1200x1440 \ -s 1200x720 -f null -vframes 100 -pix_fmt $i -nostats \ -cpuflags 0 -v error - 7.2x speedup: yuyv422 126354 UNITS in yuv2packedX, 16384 runs, 0 skips 16383 UNITS in yuv2packedX, 16382 runs, 2 skips yvyu422 117669 UNITS in yuv2packedX, 16384 runs, 0 skips 16271 UNITS in yuv2packedX, 16379 runs, 5 skips uyvy422 117310 UNITS in yuv2packedX, 16384 runs, 0 skips 16226 UNITS in yuv2packedX, 16382 runs, 2 skips Signed-off-by: Lauri Kasanen --- libswscale/ppc/swscale_vsx.c | 104 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 104 insertions(+) v2: Fix accidental tabs. No code changes -- 2.6.2 diff --git a/libswscale/ppc/swscale_vsx.c b/libswscale/ppc/swscale_vsx.c index 1c4051b..69ec63d 100644 --- a/libswscale/ppc/swscale_vsx.c +++ b/libswscale/ppc/swscale_vsx.c @@ -726,6 +726,93 @@ write422(const vector int16_t vy1, const vector int16_t vy2, } } +static av_always_inline void +yuv2422_X_vsx_template(SwsContext *c, const int16_t *lumFilter, + const int16_t **lumSrc, int lumFilterSize, + const int16_t *chrFilter, const int16_t **chrUSrc, + const int16_t **chrVSrc, int chrFilterSize, + const int16_t **alpSrc, uint8_t *dest, int dstW, + int y, enum AVPixelFormat target) +{ + int i, j; + vector int16_t vy1, vy2, vu, vv; + vector int32_t vy32[4], vu32[2], vv32[2], tmp, tmp2, tmp3, tmp4; + vector int16_t vlumFilter[MAX_FILTER_SIZE], vchrFilter[MAX_FILTER_SIZE]; + const vector int32_t start = vec_splats(1 << 18); + const vector uint32_t shift19 = vec_splats(19U); + + for (i = 0; i < lumFilterSize; i++) + vlumFilter[i] = vec_splats(lumFilter[i]); + for (i = 0; i < chrFilterSize; i++) + vchrFilter[i] = vec_splats(chrFilter[i]); + + for (i = 0; i < ((dstW + 1) >> 1); i += 8) { + vy32[0] = + vy32[1] = + vy32[2] = + vy32[3] = + vu32[0] = + vu32[1] = + vv32[0] = + vv32[1] = start; + + for (j = 0; j < lumFilterSize; j++) { + vv = vec_ld(0, &lumSrc[j][i * 2]); + tmp = vec_mule(vv, vlumFilter[j]); + tmp2 = vec_mulo(vv, vlumFilter[j]); + tmp3 = vec_mergeh(tmp, tmp2); + tmp4 = vec_mergel(tmp, tmp2); + + vy32[0] = vec_adds(vy32[0], tmp3); + vy32[1] = vec_adds(vy32[1], tmp4); + + vv = vec_ld(0, &lumSrc[j][(i + 4) * 2]); + tmp = vec_mule(vv, vlumFilter[j]); + tmp2 = vec_mulo(vv, vlumFilter[j]); + tmp3 = vec_mergeh(tmp, tmp2); + tmp4 = vec_mergel(tmp, tmp2); + + vy32[2] = vec_adds(vy32[2], tmp3); + vy32[3] = vec_adds(vy32[3], tmp4); + } + + for (j = 0; j < chrFilterSize; j++) { + vv = vec_ld(0, &chrUSrc[j][i]); + tmp = vec_mule(vv, vchrFilter[j]); + tmp2 = vec_mulo(vv, vchrFilter[j]); + tmp3 = vec_mergeh(tmp, tmp2); + tmp4 = vec_mergel(tmp, tmp2); + + vu32[0] = vec_adds(vu32[0], tmp3); + vu32[1] = vec_adds(vu32[1], tmp4); + + vv = vec_ld(0, &chrVSrc[j][i]); + tmp = vec_mule(vv, vchrFilter[j]); + tmp2 = vec_mulo(vv, vchrFilter[j]); + tmp3 = vec_mergeh(tmp, tmp2); + tmp4 = vec_mergel(tmp, tmp2); + + vv32[0] = vec_adds(vv32[0], tmp3); + vv32[1] = vec_adds(vv32[1], tmp4); + } + + for (j = 0; j < 4; j++) { + vy32[j] = vec_sra(vy32[j], shift19); + } + for (j = 0; j < 2; j++) { + vu32[j] = vec_sra(vu32[j], shift19); + vv32[j] = vec_sra(vv32[j], shift19); + } + + vy1 = vec_packs(vy32[0], vy32[1]); + vy2 = vec_packs(vy32[2], vy32[3]); + vu = vec_packs(vu32[0], vu32[1]); + vv = vec_packs(vv32[0], vv32[1]); + + write422(vy1, vy2, vu, vv, &dest[i * 4], target); + } +} + #define SETUP(x, buf0, buf1, alpha) { \ x = vec_ld(0, buf0); \ tmp = vec_mule(x, alpha); \ @@ -841,7 +928,21 @@ yuv2422_1_vsx_template(SwsContext *c, const int16_t *buf0, } } +#define YUV2PACKEDWRAPPERX(name, base, ext, fmt) \ +static void name ## ext ## _X_vsx(SwsContext *c, const int16_t *lumFilter, \ + const int16_t **lumSrc, int lumFilterSize, \ + const int16_t *chrFilter, const int16_t **chrUSrc, \ + const int16_t **chrVSrc, int chrFilterSize, \ + const int16_t **alpSrc, uint8_t *dest, int dstW, \ + int y) \ +{ \ + name ## base ## _X_vsx_template(c, lumFilter, lumSrc, lumFilterSize, \ + chrFilter, chrUSrc, chrVSrc, chrFilterSize, \ + alpSrc, dest, dstW, y, fmt); \ +} + #define YUV2PACKEDWRAPPER2(name, base, ext, fmt) \ +YUV2PACKEDWRAPPERX(name, base, ext, fmt) \ static void name ## ext ## _2_vsx(SwsContext *c, const int16_t *buf[2], \ const int16_t *ubuf[2], const int16_t *vbuf[2], \ const int16_t *abuf[2], uint8_t *dest, int dstW, \ @@ -976,14 +1077,17 @@ av_cold void ff_sws_init_swscale_vsx(SwsContext *c) case AV_PIX_FMT_YUYV422: c->yuv2packed1 = yuv2yuyv422_1_vsx; c->yuv2packed2 = yuv2yuyv422_2_vsx; + c->yuv2packedX = yuv2yuyv422_X_vsx; break; case AV_PIX_FMT_YVYU422: c->yuv2packed1 = yuv2yvyu422_1_vsx; c->yuv2packed2 = yuv2yvyu422_2_vsx; + c->yuv2packedX = yuv2yvyu422_X_vsx; break; case AV_PIX_FMT_UYVY422: c->yuv2packed1 = yuv2uyvy422_1_vsx; c->yuv2packed2 = yuv2uyvy422_2_vsx; + c->yuv2packedX = yuv2uyvy422_X_vsx; break; } }