From patchwork Fri Jan  4 19:43:51 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lauri Kasanen <cand@gmx.com>
X-Patchwork-Id: 11652
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
X-Original-To: patchwork@ffaux-bg.ffmpeg.org
Delivered-To: patchwork@ffaux-bg.ffmpeg.org
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by ffaux.localdomain (Postfix) with ESMTP id ED61444D7D2
	for <patchwork@ffaux-bg.ffmpeg.org>;
	Fri,  4 Jan 2019 21:46:32 +0200 (EET)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A2869689EC0;
	Fri,  4 Jan 2019 21:46:29 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mout.gmx.net (mout.gmx.net [212.227.15.18])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 70BF4689E50
	for <ffmpeg-devel@ffmpeg.org>; Fri,  4 Jan 2019 21:46:23 +0200 (EET)
Received: from Valinor ([84.250.81.169]) by mail.gmx.com (mrgmx002
	[212.227.17.184]) with ESMTPSA (Nemesis) id 0Leux5-1h5zX10BVH-00qfLe
	for <ffmpeg-devel@ffmpeg.org>; Fri, 04 Jan 2019 20:41:24 +0100
Date: Fri, 4 Jan 2019 21:43:51 +0200
From: Lauri Kasanen <cand@gmx.com>
To: ffmpeg-devel@ffmpeg.org
Message-Id: <20190104214351.fed5441d2d08053d2151f26c@gmx.com>
X-Mailer: Sylpheed 3.5.0 (GTK+ 2.18.6; x86_64-unknown-linux-gnu)
Mime-Version: 1.0
X-Provags-ID: V03:K1:BVv6VUn2MfgBFWPLIFTxacicmQFecg0F/F1UB5OjmLe5lAtl1Tm
	UK/k2TcwYCep1jbKeENBJCHnPojaU83IBznYZu/spROMl9FMzJsq8a0mE1/WTT7yqSsn8Gy
	7Vi5f0Lh+5V4zfjF7KrU2oqlmYNyfgBSNhtoVFKD3PyFrG4HIVOcG5cG2Aw7QC8nxsA3i5Y
	1zugSL+kpet7imDUSzvhw==
X-Spam-Flag: NO
X-UI-Out-Filterresults: notjunk:1; V03:K0:GDNLyupM4VA=:KRGxEPQvx0i0iBS6OrCI11
	LdFFnrMG4cJuIDHVIruhNznLqj1HsHNG7701LemS3Ho20HpsI/yGyAi0clUNeN7R44OkGfMG+
	UC4vs1HBrIMYhJY7ac4Aklaiezs0Ok+u59AiYSkW6R90jXCD8IiwjqoM1CyvUC8rnX+kByK7P
	PNpwKNVjJtT5nQLgJgl7XmQcrnwJI75XmNPTeg4ijSh4pxfYKfGLpfYsSg9IzXRVn2lw82T4S
	K2JB7XdvKiAKWJ2Qmxw7ai3Nd0eKt/rWg5790zpNPdZDAsLKLT1G79FI7u4sQp/zZ8LCsulKy
	R70AsK5X9crXt0w5sJ6/KpjNU9P7QfB5NfaP6cNFU/jHIy+D8/34BQ/4WxvOF5z3Onh51yJ2D
	B1qkrrDz5IC8JYQunSWlDuKX+HPXi6n8hHcNu5KtMtGG8rUtapJWq2eOhsBbBiOS34HaYbpSk
	vvbaJfNFU7vtPQz9lcjPhvFVm+i7tLDDLdDgARpnOr1qaxWlV1m67PGy1xp4usuZeWGtpA+ta
	gUkaENDPRixNt779CF5nsUll16oJnGg1mM9RPpbSFfMyHZB4dywfSjKYX5OPsNn2taZChzPnL
	JkeLXkpck/uQk/jNfSF+gioQ4zCQpMhh31DtLFNA/eOcr4N9UN8JNzjxvkQWSmAcxpW0BuL1Z
	MamrRnjymUwYSqTOx5GNTz6369A7lz/drZVG1lKByoCdIAvwPYAvKsmvXzK4ZTa/jmEqG+u9v
	9IwhRBz4DB0Rv9HOXw5hDEKwfCFkHQOV+aPUk7Ghqibs9gJYIP7YerLnDACFjCgGyLxrMdAGe
	5ZX0Iv26n4PunSRe9xbZSZT5WmYJex5uNSWiBfZYMfQ2QXK7enLpGwP93dlo80Ys4lRIPwzPA
	dir1HxxRgKkFCUfjSjzuu+n7PTKGE8rj4JFOGiFLbRGvT4X5M7XLdOz2yNmg61
Subject: [FFmpeg-devel] [PATCH] swscale/output: VSX-optimize 9-16 bit
	yuv2planeX
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <http://ffmpeg.org/mailman/options/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <http://ffmpeg.org/pipermail/ffmpeg-devel/>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <http://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches
	<ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p16be \
-s 1920x1728 -f null -vframes 100 -v error -nostats -

9-14 bit funcs get about 6x speedup, 16-bit gets about 15x.
Fate passes, each format tested with an image to video conversion.

Only POWER8 includes 32-bit vector multiplies, so POWER7 is locked out
of the 16-bit function. This includes the vec_mulo/mule functions too,
not just vmuluwm.

yuv420p9le
  12341 UNITS in planarX,  130976 runs,     96 skips
  73752 UNITS in planarX,  131066 runs,      6 skips
yuv420p9be
  12364 UNITS in planarX,  131025 runs,     47 skips
  73001 UNITS in planarX,  131055 runs,     17 skips
yuv420p10le
  12386 UNITS in planarX,  131042 runs,     30 skips
  72735 UNITS in planarX,  131062 runs,     10 skips
yuv420p10be
  12337 UNITS in planarX,  131045 runs,     27 skips
  72734 UNITS in planarX,  131057 runs,     15 skips
yuv420p12le
  12236 UNITS in planarX,  131058 runs,     14 skips
  73029 UNITS in planarX,  131062 runs,     10 skips
yuv420p12be
  12218 UNITS in planarX,  130973 runs,     99 skips
  72402 UNITS in planarX,  131069 runs,      3 skips
yuv420p14le
  12168 UNITS in planarX,  131067 runs,      5 skips
  72480 UNITS in planarX,  131069 runs,      3 skips
yuv420p14be
  12358 UNITS in planarX,  130948 runs,    124 skips
  73772 UNITS in planarX,  131063 runs,      9 skips
yuv420p16le
  10439 UNITS in planarX,  130911 runs,    161 skips
 157923 UNITS in planarX,  131068 runs,      4 skips
yuv420p16be
  10463 UNITS in planarX,  130874 runs,    198 skips
 154405 UNITS in planarX,  131061 runs,     11 skips

Signed-off-by: Lauri Kasanen <cand@gmx.com>
---

The existing VSX yuv2plane1 is also ifdefed out for POWER7, even though it works there.
This is for cleanliness mainly, separating the macros would be a bit uglier. If we
have POWER7 users who need that one, please speak up.

 libswscale/ppc/swscale_ppc_template.c |   4 +-
 libswscale/ppc/swscale_vsx.c          | 177 +++++++++++++++++++++++++++++++++-
 2 files changed, 178 insertions(+), 3 deletions(-)

diff --git a/libswscale/ppc/swscale_ppc_template.c b/libswscale/ppc/swscale_ppc_template.c
index 00e4b99..11decab 100644
--- a/libswscale/ppc/swscale_ppc_template.c
+++ b/libswscale/ppc/swscale_ppc_template.c
@@ -21,7 +21,7 @@
  * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
  */
 
-static void FUNC(yuv2planeX_16)(const int16_t *filter, int filterSize,
+static void FUNC(yuv2planeX_8_16)(const int16_t *filter, int filterSize,
                                   const int16_t **src, uint8_t *dest,
                                   const uint8_t *dither, int offset, int x)
 {
@@ -88,7 +88,7 @@ static void FUNC(yuv2planeX)(const int16_t *filter, int filterSize,
     yuv2planeX_u(filter, filterSize, src, dest, dst_u, dither, offset, 0);
 
     for (i = dst_u; i < dstW - 15; i += 16)
-        FUNC(yuv2planeX_16)(filter, filterSize, src, dest + i, dither,
+        FUNC(yuv2planeX_8_16)(filter, filterSize, src, dest + i, dither,
                               offset, i);
 
     yuv2planeX_u(filter, filterSize, src, dest, dstW, dither, offset, i);
diff --git a/libswscale/ppc/swscale_vsx.c b/libswscale/ppc/swscale_vsx.c
index 70da6ae..baca36c 100644
--- a/libswscale/ppc/swscale_vsx.c
+++ b/libswscale/ppc/swscale_vsx.c
@@ -83,6 +83,8 @@
 #include "swscale_ppc_template.c"
 #undef FUNC
 
+#undef vzero
+
 #endif /* !HAVE_BIGENDIAN */
 
 static void yuv2plane1_8_u(const int16_t *src, uint8_t *dest, int dstW,
@@ -180,6 +182,76 @@ static void yuv2plane1_nbps_vsx(const int16_t *src, uint16_t *dest, int dstW,
     yuv2plane1_nbps_u(src, dest, dstW, big_endian, output_bits, i);
 }
 
+static void yuv2planeX_nbps_u(const int16_t *filter, int filterSize,
+                              const int16_t **src, uint16_t *dest, int dstW,
+                              int big_endian, int output_bits, int start)
+{
+    int i;
+    int shift = 11 + 16 - output_bits;
+
+    for (i = start; i < dstW; i++) {
+        int val = 1 << (shift - 1);
+        int j;
+
+        for (j = 0; j < filterSize; j++)
+            val += src[j][i] * filter[j];
+
+        output_pixel(&dest[i], val);
+    }
+}
+
+static void yuv2planeX_nbps_vsx(const int16_t *filter, int filterSize,
+                                const int16_t **src, uint16_t *dest, int dstW,
+                                int big_endian, int output_bits)
+{
+    const int dst_u = -(uintptr_t)dest & 7;
+    const int shift = 11 + 16 - output_bits;
+    const int add = (1 << (shift - 1));
+    const int clip = (1 << output_bits) - 1;
+    const uint16_t swap = big_endian ? 8 : 0;
+    const vector uint32_t vadd = (vector uint32_t) {add, add, add, add};
+    const vector uint32_t vshift = (vector uint32_t) {shift, shift, shift, shift};
+    const vector uint16_t vswap = (vector uint16_t) {swap, swap, swap, swap, swap, swap, swap, swap};
+    const vector uint16_t vlargest = (vector uint16_t) {clip, clip, clip, clip, clip, clip, clip, clip};
+    const vector int16_t vzero = vec_splat_s16(0);
+    const vector uint8_t vperm = (vector uint8_t) {0, 1, 8, 9, 2, 3, 10, 11, 4, 5, 12, 13, 6, 7, 14, 15};
+    vector int16_t vfilter[MAX_FILTER_SIZE], vin;
+    vector uint16_t v;
+    vector uint32_t vleft, vright, vtmp;
+    int i, j;
+
+    for (i = 0; i < filterSize; i++) {
+        vfilter[i] = (vector int16_t) {filter[i], filter[i], filter[i], filter[i],
+                                       filter[i], filter[i], filter[i], filter[i]};
+    }
+
+    yuv2planeX_nbps_u(filter, filterSize, src, dest, dst_u, big_endian, output_bits, 0);
+
+    for (i = dst_u; i < dstW - 7; i += 8) {
+        vleft = vright = vadd;
+
+        for (j = 0; j < filterSize; j++) {
+            vin = vec_vsx_ld(0, &src[j][i]);
+            vtmp = (vector uint32_t) vec_mule(vin, vfilter[j]);
+            vleft = vec_add(vleft, vtmp);
+            vtmp = (vector uint32_t) vec_mulo(vin, vfilter[j]);
+            vright = vec_add(vright, vtmp);
+        }
+
+        vleft = vec_sra(vleft, vshift);
+        vright = vec_sra(vright, vshift);
+        v = vec_packsu(vleft, vright);
+        v = (vector uint16_t) vec_max((vector int16_t) v, vzero);
+        v = vec_min(v, vlargest);
+        v = vec_rl(v, vswap);
+        v = vec_perm(v, v, vperm);
+        vec_st(v, 0, &dest[i]);
+    }
+
+    yuv2planeX_nbps_u(filter, filterSize, src, dest, dstW, big_endian, output_bits, i);
+}
+
+
 #undef output_pixel
 
 #define output_pixel(pos, val, bias, signedness) \
@@ -234,6 +306,92 @@ static void yuv2plane1_16_vsx(const int32_t *src, uint16_t *dest, int dstW,
     yuv2plane1_16_u(src, dest, dstW, big_endian, output_bits, i);
 }
 
+#ifdef __POWER8_VECTOR__
+
+static void yuv2planeX_16_u(const int16_t *filter, int filterSize,
+                            const int32_t **src, uint16_t *dest, int dstW,
+                            int big_endian, int output_bits, int start)
+{
+    int i;
+    int shift = 15;
+
+    for (i = start; i < dstW; i++) {
+        int val = 1 << (shift - 1);
+        int j;
+
+        /* range of val is [0,0x7FFFFFFF], so 31 bits, but with lanczos/spline
+         * filters (or anything with negative coeffs, the range can be slightly
+         * wider in both directions. To account for this overflow, we subtract
+         * a constant so it always fits in the signed range (assuming a
+         * reasonable filterSize), and re-add that at the end. */
+        val -= 0x40000000;
+        for (j = 0; j < filterSize; j++)
+            val += src[j][i] * (unsigned)filter[j];
+
+        output_pixel(&dest[i], val, 0x8000, int);
+    }
+}
+
+static void yuv2planeX_16_vsx(const int16_t *filter, int filterSize,
+                              const int32_t **src, uint16_t *dest, int dstW,
+                              int big_endian, int output_bits)
+{
+    const int dst_u = -(uintptr_t)dest & 7;
+    const int shift = 15;
+    const int bias = 0x8000;
+    const int add = (1 << (shift - 1)) - 0x40000000;
+    const uint16_t swap = big_endian ? 8 : 0;
+    const vector uint32_t vadd = (vector uint32_t) {add, add, add, add};
+    const vector uint32_t vshift = (vector uint32_t) {shift, shift, shift, shift};
+    const vector uint16_t vswap = (vector uint16_t) {swap, swap, swap, swap, swap, swap, swap, swap};
+    const vector uint16_t vbias = (vector uint16_t) {bias, bias, bias, bias, bias, bias, bias, bias};
+    vector int32_t vfilter[MAX_FILTER_SIZE];
+    vector uint16_t v;
+    vector uint32_t vleft, vright, vtmp;
+    vector int32_t vin32l, vin32r;
+    int i, j;
+
+    for (i = 0; i < filterSize; i++) {
+        vfilter[i] = (vector int32_t) {filter[i], filter[i], filter[i], filter[i]};
+    }
+
+    yuv2planeX_16_u(filter, filterSize, src, dest, dst_u, big_endian, output_bits, 0);
+
+    for (i = dst_u; i < dstW - 7; i += 8) {
+        vleft = vright = vadd;
+
+        for (j = 0; j < filterSize; j++) {
+            vin32l = vec_vsx_ld(0, &src[j][i]);
+            vin32r = vec_vsx_ld(0, &src[j][i + 4]);
+
+#ifdef __GNUC__
+            // GCC does not support vmuluwm yet. Bug open.
+            __asm__("vmuluwm	%0, %1, %2" : "=v"(vtmp) : "v"(vin32l), "v"(vfilter[j]));
+            vleft = vec_add(vleft, vtmp);
+            __asm__("vmuluwm	%0, %1, %2" : "=v"(vtmp) : "v"(vin32r), "v"(vfilter[j]));
+            vright = vec_add(vright, vtmp);
+#else
+            // No idea which compilers this works in, untested. Copied from libsimdpp
+            vtmp = vec_vmuluwm(vin32l, vfilter[j]);
+            vleft = vec_add(vleft, vtmp);
+            vtmp = vec_vmuluwm(vin32r, vfilter[j]);
+            vright = vec_add(vright, vtmp);
+#endif
+        }
+
+        vleft = vec_sra(vleft, vshift);
+        vright = vec_sra(vright, vshift);
+        v = (vector uint16_t) vec_packs((vector int32_t) vleft, (vector int32_t) vright);
+        v = vec_add(v, vbias);
+        v = vec_rl(v, vswap);
+        vec_st(v, 0, &dest[i]);
+    }
+
+    yuv2planeX_16_u(filter, filterSize, src, dest, dstW, big_endian, output_bits, i);
+}
+
+#endif /* __POWER8_VECTOR__ */
+
 #define yuv2NBPS(bits, BE_LE, is_be, template_size, typeX_t) \
 static void yuv2plane1_ ## bits ## BE_LE ## _vsx(const int16_t *src, \
                              uint8_t *dest, int dstW, \
@@ -241,6 +399,14 @@ static void yuv2plane1_ ## bits ## BE_LE ## _vsx(const int16_t *src, \
 { \
     yuv2plane1_ ## template_size ## _vsx((const typeX_t *) src, \
                          (uint16_t *) dest, dstW, is_be, bits); \
+}\
+static void yuv2planeX_ ## bits ## BE_LE ## _vsx(const int16_t *filter, int filterSize, \
+                              const int16_t **src, uint8_t *dest, int dstW, \
+                              const uint8_t *dither, int offset)\
+{ \
+    yuv2planeX_## template_size ## _vsx(filter, \
+                         filterSize, (const typeX_t **) src, \
+                         (uint16_t *) dest, dstW, is_be, bits); \
 }
 
 yuv2NBPS( 9, BE, 1, nbps, int16_t)
@@ -251,8 +417,10 @@ yuv2NBPS(12, BE, 1, nbps, int16_t)
 yuv2NBPS(12, LE, 0, nbps, int16_t)
 yuv2NBPS(14, BE, 1, nbps, int16_t)
 yuv2NBPS(14, LE, 0, nbps, int16_t)
+#ifdef __POWER8_VECTOR__
 yuv2NBPS(16, BE, 1, 16, int32_t)
 yuv2NBPS(16, LE, 0, 16, int32_t)
+#endif
 
 #endif /* !HAVE_BIGENDIAN */
 
@@ -286,20 +454,27 @@ av_cold void ff_sws_init_swscale_vsx(SwsContext *c)
 #if !HAVE_BIGENDIAN
         case 9:
             c->yuv2plane1 = isBE(dstFormat) ? yuv2plane1_9BE_vsx  : yuv2plane1_9LE_vsx;
+            c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_9BE_vsx  : yuv2planeX_9LE_vsx;
             break;
         case 10:
             c->yuv2plane1 = isBE(dstFormat) ? yuv2plane1_10BE_vsx  : yuv2plane1_10LE_vsx;
+            c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_10BE_vsx  : yuv2planeX_10LE_vsx;
             break;
         case 12:
             c->yuv2plane1 = isBE(dstFormat) ? yuv2plane1_12BE_vsx  : yuv2plane1_12LE_vsx;
+            c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_12BE_vsx  : yuv2planeX_12LE_vsx;
             break;
         case 14:
             c->yuv2plane1 = isBE(dstFormat) ? yuv2plane1_14BE_vsx  : yuv2plane1_14LE_vsx;
+            c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_14BE_vsx  : yuv2planeX_14LE_vsx;
             break;
+#ifdef __POWER8_VECTOR__
         case 16:
             c->yuv2plane1 = isBE(dstFormat) ? yuv2plane1_16BE_vsx  : yuv2plane1_16LE_vsx;
+            c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_16BE_vsx  : yuv2planeX_16LE_vsx;
             break;
-#endif
+#endif /* __POWER8_VECTOR__ */
+#endif /* !HAVE_BIGENDIAN */
         }
     }
 #endif /* HAVE_VSX */