From patchwork Fri Nov 16 12:59:38 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lauri Kasanen <cand@gmx.com>
X-Patchwork-Id: 11039
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
X-Original-To: patchwork@ffaux-bg.ffmpeg.org
Delivered-To: patchwork@ffaux-bg.ffmpeg.org
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by ffaux.localdomain (Postfix) with ESMTP id B61BE44D117
	for <patchwork@ffaux-bg.ffmpeg.org>;
	Fri, 16 Nov 2018 14:57:21 +0200 (EET)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 37AFE68A28A;
	Fri, 16 Nov 2018 14:57:22 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mout.gmx.net (mout.gmx.net [212.227.15.15])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 3EEC268A209
	for <ffmpeg-devel@ffmpeg.org>; Fri, 16 Nov 2018 14:57:15 +0200 (EET)
Received: from Valinor ([84.250.81.169]) by mail.gmx.com (mrgmx002
	[212.227.17.184]) with ESMTPSA (Nemesis) id 0MYOkT-1g1vxr3ir3-00V99P
	for <ffmpeg-devel@ffmpeg.org>; Fri, 16 Nov 2018 13:57:15 +0100
Date: Fri, 16 Nov 2018 14:59:38 +0200
From: Lauri Kasanen <cand@gmx.com>
To: ffmpeg-devel@ffmpeg.org
Message-Id: <20181116145938.07a893157e57837fc3e2314c@gmx.com>
X-Mailer: Sylpheed 3.5.0 (GTK+ 2.18.6; x86_64-unknown-linux-gnu)
Mime-Version: 1.0
X-Provags-ID: V03:K1:Z5aMYB72qzmdugg//E1F14OlMC5+VTsIKDblzl4bCqtS0+fpwF4
	NY+mKrw4geZUorVsLZ7BDcSwPO8dKDs6OSupqnP3rxiaP1Kar82Z0vOSkfyNCtKwrA8s7MS
	M8C4ULLCq7ektzJMQnpdMK4ETJKzj3wwxxwPsgVxnrnmt+1hvsmyKzCj2+ADuvV6+LpdPVp
	JmpDNOpPG35aT5FN5baig==
X-Spam-Flag: NO
X-UI-Out-Filterresults: notjunk:1; V01:K0:OGT7cndFpXI=:qlCS81nOXiCZ9Qf+RVQydB
	lBTk9iitcbNcBmY+tSRTgNMiXxA1UmkiaIISx8NtGhM5YdUghHNNkL4iUnddawKw6BP24VsJC
	ZVHNT2hKbQLChAeCaqvqMtOkRG/mTzB+IDG91vsF9vt1bG+Ae57goRT3RvhpCQenVRFR0E8xr
	WCi3sUgCjsT8s8nK8+efrrM6nfLBOGIi3J8KKpbJul+rKHt0OCJ7IbLUsPrL6UfgfAmgufvEL
	AZE1Ak3siWf7oveyBE5I/ma4TnXzkrwfYQl2WhCwhkSoNkAhNxWtgqVntdrWBdmBtmsKLLqeZ
	ghRf3UKXNdkggLckvnqVLsMcSgkDoq27bOTAJhVZfrPFYih3siA/GMaa7cn3eCURJ71krB+uU
	VXyHEOtE+OzX7c3dchcDHKSxHGZv/4ODG/BPpeX17lfjTvXhlpcAaSWzXYI9ZLLTVDOlNl33+
	CFnnrmBlBEfC4iBui4/VNdeyHaryKq+Y6vtqGYex+7PYa7VhuxrwKpRPi0yoWPYeuBut47zcv
	EFaa+HFhpMEt2qyMfwgiASvDSSNoswIM+j588TPCFY1f7o59envCbGC4N9RiMSNjH9ILeZxwQ
	M/ZQJOFIsoWRzzyJNS7seIT4M5qfdzCZMOPYsq5ciC4295SgJbMTutI1ZavJmn0tZIH/R54Bn
	eEqnuGWzf53Tga0ze/wsysqOU11/1xaOGAuPKP15OOT1J5XObLhR5MRSOaL3iooSA//6TCA3C
	jY6fESaODy4BROkVszZX3qLMIYMylZWwrcEAjmpAP9Eh0uRngZJypOaP3l2YzusqfYduaCKkX
	Gj9RGvydCgrn7NKZIoHVLT9FX5MsYZI2PAL69FhIILdODglwGtJvlTTFRLcuTs4lrcAzzecnf
	jhFPt/+kCUE+O69ngEMzd8qxDY6C2dPME0yoFV6oHUSTDbV2w0vHV3gUTaewpG
Subject: [FFmpeg-devel] [PATCH] swscale/output: Altivec-optimize yuv2plane1_8
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <http://ffmpeg.org/mailman/options/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <http://ffmpeg.org/pipermail/ffmpeg-devel/>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <http://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches
	<ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p \
-f null -vframes 100 -v error -nostats -

1158 UNITS in planar1,   65528 runs,      8 skips

-cpuflags 0

19082 UNITS in planar1,   65533 runs,      3 skips

16.48 speedup ratio. On x86, SSE2 is ~7. Curiously, the Power C version
takes as many cycles as the x86 SSE2 version, yikes it's fast.

Note that this function uses VSX instructions, but is not marked so.
This is because several existing functions also make that mistake.
I'll submit a patch moving them all once this is reviewed.

No BE support since I can only test LE. LE is however the common case
for POWER8 and POWER9.

Signed-off-by: Lauri Kasanen <cand@gmx.com>
---
 libswscale/ppc/swscale_altivec.c | 55 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/libswscale/ppc/swscale_altivec.c b/libswscale/ppc/swscale_altivec.c
index 2fb2337..a064016 100644
--- a/libswscale/ppc/swscale_altivec.c
+++ b/libswscale/ppc/swscale_altivec.c
@@ -324,6 +324,53 @@ static void hScale_altivec_real(SwsContext *c, int16_t *dst, int dstW,
             }
         }
 }
+
+static void yuv2plane1_8_u(const int16_t *src, uint8_t *dest, int dstW,
+                           const uint8_t *dither, int offset, int start)
+{
+    int i;
+    for (i = start; i < dstW; i++) {
+        int val = (src[i] + dither[(i + offset) & 7]) >> 7;
+        dest[i] = av_clip_uint8(val);
+    }
+}
+
+static void yuv2plane1_8_altivec(const int16_t *src, uint8_t *dest, int dstW,
+                           const uint8_t *dither, int offset)
+{
+    const int dst_u = -(uintptr_t)dest & 15;
+    int i, j;
+    LOCAL_ALIGNED(16, int16_t, val, [16]);
+    const vector uint16_t shifts = (vector uint16_t) {7, 7, 7, 7, 7, 7, 7, 7};
+    vector int16_t vi, vileft, ditherleft, ditherright;
+    vector uint8_t vd;
+
+    for (j = 0; j < 16; j++) {
+        val[j] = dither[(dst_u + offset + j) & 7];
+    }
+
+    ditherleft = vec_ld(0, val);
+    ditherright = vec_ld(0, &val[8]);
+
+    yuv2plane1_8_u(src, dest, dst_u, dither, offset, 0);
+
+    for (i = dst_u; i < dstW - 15; i += 16) {
+
+        vi = vec_vsx_ld(0, &src[i]);
+        vi = vec_adds(ditherleft, vi);
+        vileft = vec_sra(vi, shifts);
+
+        vi = vec_vsx_ld(0, &src[i + 8]);
+        vi = vec_adds(ditherright, vi);
+        vi = vec_sra(vi, shifts);
+
+        vd = vec_packsu(vileft, vi);
+        vec_st(vd, 0, &dest[i]);
+    }
+
+    yuv2plane1_8_u(src, dest, dstW, dither, offset, i);
+}
+
 #endif /* HAVE_ALTIVEC */
 
 av_cold void ff_sws_init_swscale_ppc(SwsContext *c)
@@ -367,6 +414,14 @@ av_cold void ff_sws_init_swscale_ppc(SwsContext *c)
             c->yuv2packedX = ff_yuv2rgb24_X_altivec;
             break;
         }
+
+        switch (c->dstBpc) {
+        case 8:
+#if !HAVE_BIGENDIAN
+            c->yuv2plane1 = yuv2plane1_8_altivec;
+            break;
+#endif /* !HAVE_BIGENDIAN */
+        }
     }
 #endif /* HAVE_ALTIVEC */
 }