From patchwork Sat Jan 12 08:47:50 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lauri Kasanen X-Patchwork-Id: 11714 Return-Path: X-Original-To: patchwork@ffaux-bg.ffmpeg.org Delivered-To: patchwork@ffaux-bg.ffmpeg.org Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by ffaux.localdomain (Postfix) with ESMTP id DE25244E194 for ; Sat, 12 Jan 2019 10:45:22 +0200 (EET) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 08D8668A100; Sat, 12 Jan 2019 10:45:11 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mout.gmx.net (mout.gmx.net [212.227.15.15]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 345EB680CCC for ; Sat, 12 Jan 2019 10:45:04 +0200 (EET) Received: from Valinor ([84.250.81.169]) by mail.gmx.com (mrgmx001 [212.227.17.184]) with ESMTPSA (Nemesis) id 0MMCFR-1gcTc91g69-0084AG for ; Sat, 12 Jan 2019 09:45:16 +0100 Date: Sat, 12 Jan 2019 10:47:50 +0200 From: Lauri Kasanen To: ffmpeg-devel@ffmpeg.org Message-Id: <20190112104750.b02878764a3c4d4ddd027810@gmx.com> X-Mailer: Sylpheed 3.5.0 (GTK+ 2.18.6; x86_64-unknown-linux-gnu) Mime-Version: 1.0 X-Provags-ID: V03:K1:Gboj7QD8pB0kEX4fsATIGxsYSVVcRGvajr8HcWjaMtZ8/NYA4YY BTWZ4ItuxJNQFZhtTrNtXFQe0cVIPaJ6AedrPnIMaUv33vVBcqWv63FNpwF+yjCkIvQsIko XzpwQjFrosSLdD+GfQaWItUMH+9/ANap5mkkyftpfwdxDBh9CyZrHnekZpp0TogHtpTLGUM lzSymgI3LtreOoASBLaVQ== X-Spam-Flag: NO X-UI-Out-Filterresults: notjunk:1; V03:K0:a8uK6u5l7Lc=:nqFwdGZ2a3+Wqy9gVA5z97 AjG9Yp4WxVnaV96qZ1Ue9rpUEJV2g9RlP8DxvVjZ4BfK1Cduk4I+SmwSqJ7FiFFxNlIjXtEUj GK4nGtSb5mWn7WmcPAauyv6v6pQakaGRa1b4njp8JMbgFniMWKXunVxLCW0A0yoHm/MZBH5vA RDctNLKlzlbzMbIRBoCXohye7j67Bw54wePPAGW0GMT1jGo8vLFpGvK5M7Hy++xSSS/fUo2Gm SueD1rtLmNrsCWBlTP1jXHvas5uxu3mpdXa7dUKWGtdgt2HYEEvpDyUrqQStrkk0I/u+xfKKp lbvWL9RmUgwp+zr9tlETNVa75txTbbMxkzPJy4/8wH09YpaiT+/866e1suziCYtMv/wQ3lu+P iDLxlgy85rQCit8xErp0BDIUEJ7X+f5vO6WY6Qd0SOaWgzw0Y3+AeSLcEKw184HkHNwyCQUr1 yJB3SfkUoXQUy5fEsTbDeCiKpdq95ttiTn6w/jQMakBNomwycwzeUPcNOwRsKfggJZGDN1/85 3wPiOUhSzEIj9vUS8cYZnnLv4NLjSESVfpHCy0x+QnD2oI9iUsJWGFXaKng2R1KKusNBJSriJ njk9rjIKJDiIIBuqQbV3sm5hMLfzVm1sulbjlg2bx5MjP0IEMuec5ziyBXJhfjH+w0uIMN3C0 FEsPU6X9lLl1vZd+ooI5pUzgZoDn8HvYQGIJRjLdSCGuVH6asPNrl2JqC4SlUAxClKnZI3lVT ABHvusyJbOHwqQZHUA215CQAts20gx+tZpcwxwn/mSw/GtN5OYKBpXKuyh8JJKx1Lu5hRDvIJ A8ajpVoyrHlZ4YFGmfxIMjG2lsD1ZuY7iIwaY5t3Wozitmmq8gvf5eOE8opUkS+ycu/eLsYxz vrVOn2M4aZb1meA2O5lr7vJ1ildb6+SZMCxUMrJDpCYM4ZTaqa951Okqzges3r Subject: [FFmpeg-devel] [PATCH v5] libswscale/ppc: VSX-optimize 9-16 bit yuv2planeX X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" ./ffmpeg_g -f rawvideo -pix_fmt rgb24 -s hd1080 -i /dev/zero -pix_fmt yuv420p16be \ -s 1920x1728 -f null -vframes 100 -v error -nostats - 9-14 bit funcs get about 6x speedup, 16-bit gets about 15x. Fate passes, each format tested with an image to video conversion. Only POWER8 includes 32-bit vector multiplies, so POWER7 is locked out of the 16-bit function. This includes the vec_mulo/mule functions too, not just vmuluwm. yuv420p9le 12341 UNITS in planarX, 130976 runs, 96 skips 73752 UNITS in planarX, 131066 runs, 6 skips yuv420p9be 12364 UNITS in planarX, 131025 runs, 47 skips 73001 UNITS in planarX, 131055 runs, 17 skips yuv420p10le 12386 UNITS in planarX, 131042 runs, 30 skips 72735 UNITS in planarX, 131062 runs, 10 skips yuv420p10be 12337 UNITS in planarX, 131045 runs, 27 skips 72734 UNITS in planarX, 131057 runs, 15 skips yuv420p12le 12236 UNITS in planarX, 131058 runs, 14 skips 73029 UNITS in planarX, 131062 runs, 10 skips yuv420p12be 12218 UNITS in planarX, 130973 runs, 99 skips 72402 UNITS in planarX, 131069 runs, 3 skips yuv420p14le 12168 UNITS in planarX, 131067 runs, 5 skips 72480 UNITS in planarX, 131069 runs, 3 skips yuv420p14be 12358 UNITS in planarX, 130948 runs, 124 skips 73772 UNITS in planarX, 131063 runs, 9 skips yuv420p16le 10439 UNITS in planarX, 130911 runs, 161 skips 157923 UNITS in planarX, 131068 runs, 4 skips yuv420p16be 10463 UNITS in planarX, 130874 runs, 198 skips 154405 UNITS in planarX, 131061 runs, 11 skips Signed-off-by: Lauri Kasanen --- libswscale/ppc/swscale_ppc_template.c | 4 +- libswscale/ppc/swscale_vsx.c | 186 +++++++++++++++++++++++++++++++++- 2 files changed, 184 insertions(+), 6 deletions(-) v2: Separate macros so that yuv2plane1_16_vsx remains available for power7 v3: Remove accidental tabs, switch to HAVE_POWER8 from configure + runtime check v4: #if HAVE_POWER8 v5: Get rid of the mul #if, turns out gcc vec_mul works diff --git a/libswscale/ppc/swscale_ppc_template.c b/libswscale/ppc/swscale_ppc_template.c index 00e4b99..11decab 100644 --- a/libswscale/ppc/swscale_ppc_template.c +++ b/libswscale/ppc/swscale_ppc_template.c @@ -21,7 +21,7 @@ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA */ -static void FUNC(yuv2planeX_16)(const int16_t *filter, int filterSize, +static void FUNC(yuv2planeX_8_16)(const int16_t *filter, int filterSize, const int16_t **src, uint8_t *dest, const uint8_t *dither, int offset, int x) { @@ -88,7 +88,7 @@ static void FUNC(yuv2planeX)(const int16_t *filter, int filterSize, yuv2planeX_u(filter, filterSize, src, dest, dst_u, dither, offset, 0); for (i = dst_u; i < dstW - 15; i += 16) - FUNC(yuv2planeX_16)(filter, filterSize, src, dest + i, dither, + FUNC(yuv2planeX_8_16)(filter, filterSize, src, dest + i, dither, offset, i); yuv2planeX_u(filter, filterSize, src, dest, dstW, dither, offset, i); diff --git a/libswscale/ppc/swscale_vsx.c b/libswscale/ppc/swscale_vsx.c index 70da6ae..f6c7f1d 100644 --- a/libswscale/ppc/swscale_vsx.c +++ b/libswscale/ppc/swscale_vsx.c @@ -83,6 +83,8 @@ #include "swscale_ppc_template.c" #undef FUNC +#undef vzero + #endif /* !HAVE_BIGENDIAN */ static void yuv2plane1_8_u(const int16_t *src, uint8_t *dest, int dstW, @@ -180,6 +182,76 @@ static void yuv2plane1_nbps_vsx(const int16_t *src, uint16_t *dest, int dstW, yuv2plane1_nbps_u(src, dest, dstW, big_endian, output_bits, i); } +static void yuv2planeX_nbps_u(const int16_t *filter, int filterSize, + const int16_t **src, uint16_t *dest, int dstW, + int big_endian, int output_bits, int start) +{ + int i; + int shift = 11 + 16 - output_bits; + + for (i = start; i < dstW; i++) { + int val = 1 << (shift - 1); + int j; + + for (j = 0; j < filterSize; j++) + val += src[j][i] * filter[j]; + + output_pixel(&dest[i], val); + } +} + +static void yuv2planeX_nbps_vsx(const int16_t *filter, int filterSize, + const int16_t **src, uint16_t *dest, int dstW, + int big_endian, int output_bits) +{ + const int dst_u = -(uintptr_t)dest & 7; + const int shift = 11 + 16 - output_bits; + const int add = (1 << (shift - 1)); + const int clip = (1 << output_bits) - 1; + const uint16_t swap = big_endian ? 8 : 0; + const vector uint32_t vadd = (vector uint32_t) {add, add, add, add}; + const vector uint32_t vshift = (vector uint32_t) {shift, shift, shift, shift}; + const vector uint16_t vswap = (vector uint16_t) {swap, swap, swap, swap, swap, swap, swap, swap}; + const vector uint16_t vlargest = (vector uint16_t) {clip, clip, clip, clip, clip, clip, clip, clip}; + const vector int16_t vzero = vec_splat_s16(0); + const vector uint8_t vperm = (vector uint8_t) {0, 1, 8, 9, 2, 3, 10, 11, 4, 5, 12, 13, 6, 7, 14, 15}; + vector int16_t vfilter[MAX_FILTER_SIZE], vin; + vector uint16_t v; + vector uint32_t vleft, vright, vtmp; + int i, j; + + for (i = 0; i < filterSize; i++) { + vfilter[i] = (vector int16_t) {filter[i], filter[i], filter[i], filter[i], + filter[i], filter[i], filter[i], filter[i]}; + } + + yuv2planeX_nbps_u(filter, filterSize, src, dest, dst_u, big_endian, output_bits, 0); + + for (i = dst_u; i < dstW - 7; i += 8) { + vleft = vright = vadd; + + for (j = 0; j < filterSize; j++) { + vin = vec_vsx_ld(0, &src[j][i]); + vtmp = (vector uint32_t) vec_mule(vin, vfilter[j]); + vleft = vec_add(vleft, vtmp); + vtmp = (vector uint32_t) vec_mulo(vin, vfilter[j]); + vright = vec_add(vright, vtmp); + } + + vleft = vec_sra(vleft, vshift); + vright = vec_sra(vright, vshift); + v = vec_packsu(vleft, vright); + v = (vector uint16_t) vec_max((vector int16_t) v, vzero); + v = vec_min(v, vlargest); + v = vec_rl(v, vswap); + v = vec_perm(v, v, vperm); + vec_st(v, 0, &dest[i]); + } + + yuv2planeX_nbps_u(filter, filterSize, src, dest, dstW, big_endian, output_bits, i); +} + + #undef output_pixel #define output_pixel(pos, val, bias, signedness) \ @@ -234,7 +306,88 @@ static void yuv2plane1_16_vsx(const int32_t *src, uint16_t *dest, int dstW, yuv2plane1_16_u(src, dest, dstW, big_endian, output_bits, i); } +#if HAVE_POWER8 + +static void yuv2planeX_16_u(const int16_t *filter, int filterSize, + const int32_t **src, uint16_t *dest, int dstW, + int big_endian, int output_bits, int start) +{ + int i; + int shift = 15; + + for (i = start; i < dstW; i++) { + int val = 1 << (shift - 1); + int j; + + /* range of val is [0,0x7FFFFFFF], so 31 bits, but with lanczos/spline + * filters (or anything with negative coeffs, the range can be slightly + * wider in both directions. To account for this overflow, we subtract + * a constant so it always fits in the signed range (assuming a + * reasonable filterSize), and re-add that at the end. */ + val -= 0x40000000; + for (j = 0; j < filterSize; j++) + val += src[j][i] * (unsigned)filter[j]; + + output_pixel(&dest[i], val, 0x8000, int); + } +} + +static void yuv2planeX_16_vsx(const int16_t *filter, int filterSize, + const int32_t **src, uint16_t *dest, int dstW, + int big_endian, int output_bits) +{ + const int dst_u = -(uintptr_t)dest & 7; + const int shift = 15; + const int bias = 0x8000; + const int add = (1 << (shift - 1)) - 0x40000000; + const uint16_t swap = big_endian ? 8 : 0; + const vector uint32_t vadd = (vector uint32_t) {add, add, add, add}; + const vector uint32_t vshift = (vector uint32_t) {shift, shift, shift, shift}; + const vector uint16_t vswap = (vector uint16_t) {swap, swap, swap, swap, swap, swap, swap, swap}; + const vector uint16_t vbias = (vector uint16_t) {bias, bias, bias, bias, bias, bias, bias, bias}; + vector int32_t vfilter[MAX_FILTER_SIZE]; + vector uint16_t v; + vector uint32_t vleft, vright, vtmp; + vector int32_t vin32l, vin32r; + int i, j; + + for (i = 0; i < filterSize; i++) { + vfilter[i] = (vector int32_t) {filter[i], filter[i], filter[i], filter[i]}; + } + + yuv2planeX_16_u(filter, filterSize, src, dest, dst_u, big_endian, output_bits, 0); + + for (i = dst_u; i < dstW - 7; i += 8) { + vleft = vright = vadd; + + for (j = 0; j < filterSize; j++) { + vin32l = vec_vsx_ld(0, &src[j][i]); + vin32r = vec_vsx_ld(0, &src[j][i + 4]); + + vtmp = (vector uint32_t) vec_mul(vin32l, vfilter[j]); + vleft = vec_add(vleft, vtmp); + vtmp = (vector uint32_t) vec_mul(vin32r, vfilter[j]); + vright = vec_add(vright, vtmp); + } + + vleft = vec_sra(vleft, vshift); + vright = vec_sra(vright, vshift); + v = (vector uint16_t) vec_packs((vector int32_t) vleft, (vector int32_t) vright); + v = vec_add(v, vbias); + v = vec_rl(v, vswap); + vec_st(v, 0, &dest[i]); + } + + yuv2planeX_16_u(filter, filterSize, src, dest, dstW, big_endian, output_bits, i); +} + +#endif /* HAVE_POWER8 */ + #define yuv2NBPS(bits, BE_LE, is_be, template_size, typeX_t) \ + yuv2NBPS1(bits, BE_LE, is_be, template_size, typeX_t) \ + yuv2NBPSX(bits, BE_LE, is_be, template_size, typeX_t) + +#define yuv2NBPS1(bits, BE_LE, is_be, template_size, typeX_t) \ static void yuv2plane1_ ## bits ## BE_LE ## _vsx(const int16_t *src, \ uint8_t *dest, int dstW, \ const uint8_t *dither, int offset) \ @@ -243,6 +396,16 @@ static void yuv2plane1_ ## bits ## BE_LE ## _vsx(const int16_t *src, \ (uint16_t *) dest, dstW, is_be, bits); \ } +#define yuv2NBPSX(bits, BE_LE, is_be, template_size, typeX_t) \ +static void yuv2planeX_ ## bits ## BE_LE ## _vsx(const int16_t *filter, int filterSize, \ + const int16_t **src, uint8_t *dest, int dstW, \ + const uint8_t *dither, int offset)\ +{ \ + yuv2planeX_## template_size ## _vsx(filter, \ + filterSize, (const typeX_t **) src, \ + (uint16_t *) dest, dstW, is_be, bits); \ +} + yuv2NBPS( 9, BE, 1, nbps, int16_t) yuv2NBPS( 9, LE, 0, nbps, int16_t) yuv2NBPS(10, BE, 1, nbps, int16_t) @@ -251,8 +414,13 @@ yuv2NBPS(12, BE, 1, nbps, int16_t) yuv2NBPS(12, LE, 0, nbps, int16_t) yuv2NBPS(14, BE, 1, nbps, int16_t) yuv2NBPS(14, LE, 0, nbps, int16_t) -yuv2NBPS(16, BE, 1, 16, int32_t) -yuv2NBPS(16, LE, 0, 16, int32_t) + +yuv2NBPS1(16, BE, 1, 16, int32_t) +yuv2NBPS1(16, LE, 0, 16, int32_t) +#if HAVE_POWER8 +yuv2NBPSX(16, BE, 1, 16, int32_t) +yuv2NBPSX(16, LE, 0, 16, int32_t) +#endif #endif /* !HAVE_BIGENDIAN */ @@ -262,8 +430,9 @@ av_cold void ff_sws_init_swscale_vsx(SwsContext *c) { #if HAVE_VSX enum AVPixelFormat dstFormat = c->dstFormat; + const int cpu_flags = av_get_cpu_flags(); - if (!(av_get_cpu_flags() & AV_CPU_FLAG_VSX)) + if (!(cpu_flags & AV_CPU_FLAG_VSX)) return; #if !HAVE_BIGENDIAN @@ -286,20 +455,29 @@ av_cold void ff_sws_init_swscale_vsx(SwsContext *c) #if !HAVE_BIGENDIAN case 9: c->yuv2plane1 = isBE(dstFormat) ? yuv2plane1_9BE_vsx : yuv2plane1_9LE_vsx; + c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_9BE_vsx : yuv2planeX_9LE_vsx; break; case 10: c->yuv2plane1 = isBE(dstFormat) ? yuv2plane1_10BE_vsx : yuv2plane1_10LE_vsx; + c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_10BE_vsx : yuv2planeX_10LE_vsx; break; case 12: c->yuv2plane1 = isBE(dstFormat) ? yuv2plane1_12BE_vsx : yuv2plane1_12LE_vsx; + c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_12BE_vsx : yuv2planeX_12LE_vsx; break; case 14: c->yuv2plane1 = isBE(dstFormat) ? yuv2plane1_14BE_vsx : yuv2plane1_14LE_vsx; + c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_14BE_vsx : yuv2planeX_14LE_vsx; break; case 16: c->yuv2plane1 = isBE(dstFormat) ? yuv2plane1_16BE_vsx : yuv2plane1_16LE_vsx; +#if HAVE_POWER8 + if (cpu_flags & AV_CPU_FLAG_POWER8) { + c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_16BE_vsx : yuv2planeX_16LE_vsx; + } +#endif /* HAVE_POWER8 */ break; -#endif +#endif /* !HAVE_BIGENDIAN */ } } #endif /* HAVE_VSX */