From patchwork Fri Sep 22 00:10:01 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mateusz X-Patchwork-Id: 5231 Delivered-To: ffmpegpatchwork@gmail.com Received: by 10.2.36.26 with SMTP id f26csp2554400jaa; Thu, 21 Sep 2017 17:14:23 -0700 (PDT) X-Google-Smtp-Source: AOwi7QA87duPmKjikQrFGvaTqL43Z+TAP0IKS+IA1s9IqGUmHufQOulqpQP7D18Qx2OxsizXpv9k X-Received: by 10.28.142.147 with SMTP id q141mr2234783wmd.155.1506039263167; Thu, 21 Sep 2017 17:14:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1506039263; cv=none; d=google.com; s=arc-20160816; b=gzX+pKJUiukysA+QipBjSiaydvSzvuOZ0jqBNhEypzgIIbcqJ6JdGpLPzAkVZY8wjm oqK7WAzHRr5I4zsidAFAr8XyGR9WgdJ7zk8gucfZU5EutFNQ9Iy1oOckD6DslrGMMBCR /fe+blM8v0YAgIj+7vFednq3pXunz29v0SlZDdenBBbMOhHvISV2BfP3hqMNQoNiMKNr ZFaps/FsqeP+dKR79DBTUebIfBnoM14uKIN+g10D5suXn4jjfarrHB9gyAe8N7BHO2e6 e54+VolcGnIbJ5YOvg2siBKZ+9foRtnjMo72vpOzLq7RUsHtqpFlyvfnZa2++qxEUa88 U/ew== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:reply-to:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence:subject :content-language:mime-version:user-agent:date:message-id:from:to :dkim-signature:delivered-to:arc-authentication-results; bh=LqpFYwHrVOerH2+XvuRQwh2+qRc2P+hFUmaA65++02Y=; b=MpVQ3ZMjmHMQ9l4tcm1ABBF3F1OwO9kwRq+s5bi5bVyMzGAkvJhl0GdxxLp4B5nFms smE5Fr36QPvRvMtNmCm96BNqp0hyRhKygBQ2AYnzZxbbJag7dUqfDsj5JBNEfcGTGbjc XLlfrOc948RY1Zxs/NnFveUEd+D7+4yCMf1BEG7pe1MvGa4v78sqLz+clAmSr3ZzIYj7 XBRsHxd3j6/opN4F+e+8N+e2DRbF4J65r4Rv7eooYnN8JrwGzY4p0yr1+wYHfuaarSHB gxlbrdgkDUXfepxosDbQu9P/5e6rS93KCzO3AoT2J0A+BMBSUGlH79ShP2nAPsGOO1qf xH6A== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@poczta.onet.pl header.s=2011 header.b=EDael5Yc; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id o6si1892245wrg.436.2017.09.21.17.14.22; Thu, 21 Sep 2017 17:14:23 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@poczta.onet.pl header.s=2011 header.b=EDael5Yc; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 935B2688374; Fri, 22 Sep 2017 03:14:11 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from smtpo67.poczta.onet.pl (smtpo67.poczta.onet.pl [141.105.16.17]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id C399F6882BA for ; Fri, 22 Sep 2017 03:14:04 +0300 (EEST) Received: from [192.168.1.2] (aeno193.neoplus.adsl.tpnet.pl [79.191.92.193]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: mateuszb@poczta.onet.pl) by smtp.poczta.onet.pl (Onet) with ESMTPSA id 3xyv8n1SgMz18bsM4 for ; Fri, 22 Sep 2017 02:13:44 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=poczta.onet.pl; s=2011; t=1506039225; bh=ANczPPERTyf4hxkHxXdL7lDiZBo0+mORLHvd4o1afug=; h=To:From:Subject:Date:From; b=EDael5YcKD9XWzh+P4C3W98dnLkEI+x6pSem414eEoiCI+1l0TF/gKTRYOEQT538I WFI2IToMUShsRrfjLOzdKP4SEc5vN9I6+dIYMt3Iilx0cQfQE81APBUukvlwCBYzVV P/qsHcJ6+VbSzpJQSOdKxSHFfEW41tE0CbOzz3PU= To: ffmpeg-devel@ffmpeg.org From: Mateusz Message-ID: Date: Fri, 22 Sep 2017 02:10:01 +0200 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 Content-Language: en-US Subject: [FFmpeg-devel] [PATCH] swscale_unscaled: fix DITHER_COPY macro, use it only for dst_depth == 8 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" To reduce bit depth in planar YUV or gray pixel formats ffmpeg uses DITHER_COPY macro. Now it makes images greener and with visible dither pattern. In my opinion there is no point to use dither tables for destination bit depth >= 9, we can use simple down-shift which is neutral in full and limited range -- result images are with the same brightness and with the same colors. For destination bit depth == 8 we could use new bit exact precise DITHER_COPY macro (which is slower). If the problem is with speed only, I've attached second patch with Intel Intrinsics for x86_64 that makes code faster (I don't see any Intel Intrinsics in ffmpeg so it's probably for testing only). Please review. Mateusz From a52417a3817ac774eb364bbef20c954a3d278d45 Mon Sep 17 00:00:00 2001 From: Mateusz Date: Fri, 22 Sep 2017 01:22:59 +0200 Subject: [PATCH] swscale_unscaled: fix and speed up DITHER_COPY macro for x86_64, use it only for dst_depth == 8 --- libswscale/swscale_unscaled.c | 185 ++++++++++++++++++++++++++++++++++-------- 1 file changed, 150 insertions(+), 35 deletions(-) diff --git a/libswscale/swscale_unscaled.c b/libswscale/swscale_unscaled.c index ef36aec..7d1cbed 100644 --- a/libswscale/swscale_unscaled.c +++ b/libswscale/swscale_unscaled.c @@ -35,6 +35,10 @@ #include "libavutil/avassert.h" #include "libavutil/avconfig.h" +#if ARCH_X86_64 +#include +#endif + DECLARE_ALIGNED(8, static const uint8_t, dithers)[8][8][8]={ { { 0, 1, 0, 1, 0, 1, 0, 1,}, @@ -110,24 +114,6 @@ DECLARE_ALIGNED(8, static const uint8_t, dithers)[8][8][8]={ { 112, 16,104, 8,118, 22,110, 14,}, }}; -static const uint16_t dither_scale[15][16]={ -{ 2, 3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,}, -{ 2, 3, 7, 7, 13, 13, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25,}, -{ 3, 3, 4, 15, 15, 29, 57, 57, 57, 113, 113, 113, 113, 113, 113, 113,}, -{ 3, 4, 4, 5, 31, 31, 61, 121, 241, 241, 241, 241, 481, 481, 481, 481,}, -{ 3, 4, 5, 5, 6, 63, 63, 125, 249, 497, 993, 993, 993, 993, 993, 1985,}, -{ 3, 5, 6, 6, 6, 7, 127, 127, 253, 505, 1009, 2017, 4033, 4033, 4033, 4033,}, -{ 3, 5, 6, 7, 7, 7, 8, 255, 255, 509, 1017, 2033, 4065, 8129,16257,16257,}, -{ 3, 5, 6, 8, 8, 8, 8, 9, 511, 511, 1021, 2041, 4081, 8161,16321,32641,}, -{ 3, 5, 7, 8, 9, 9, 9, 9, 10, 1023, 1023, 2045, 4089, 8177,16353,32705,}, -{ 3, 5, 7, 8, 10, 10, 10, 10, 10, 11, 2047, 2047, 4093, 8185,16369,32737,}, -{ 3, 5, 7, 8, 10, 11, 11, 11, 11, 11, 12, 4095, 4095, 8189,16377,32753,}, -{ 3, 5, 7, 9, 10, 12, 12, 12, 12, 12, 12, 13, 8191, 8191,16381,32761,}, -{ 3, 5, 7, 9, 10, 12, 13, 13, 13, 13, 13, 13, 14,16383,16383,32765,}, -{ 3, 5, 7, 9, 10, 12, 14, 14, 14, 14, 14, 14, 14, 15,32767,32767,}, -{ 3, 5, 7, 9, 11, 12, 14, 15, 15, 15, 15, 15, 15, 15, 16,65535,}, -}; - static void fillPlane(uint8_t *plane, int stride, int width, int height, int y, uint8_t val) @@ -1502,22 +1488,127 @@ static int packedCopyWrapper(SwsContext *c, const uint8_t *src[], } #define DITHER_COPY(dst, dstStride, src, srcStride, bswap, dbswap)\ - uint16_t scale= dither_scale[dst_depth-1][src_depth-1];\ - int shift= src_depth-dst_depth + dither_scale[src_depth-2][dst_depth-1];\ + unsigned shift= src_depth-dst_depth, tmp;\ + if (shiftonly) {\ + for (i = 0; i < height; i++) {\ + const uint8_t *dither= dithers[shift-1][i&7];\ + for (j = 0; j < length-7; j+=8) {\ + tmp = (bswap(src[j+0]) + dither[0])>>shift; dst[j+0] = dbswap(tmp - (tmp>>dst_depth));\ + tmp = (bswap(src[j+1]) + dither[1])>>shift; dst[j+1] = dbswap(tmp - (tmp>>dst_depth));\ + tmp = (bswap(src[j+2]) + dither[2])>>shift; dst[j+2] = dbswap(tmp - (tmp>>dst_depth));\ + tmp = (bswap(src[j+3]) + dither[3])>>shift; dst[j+3] = dbswap(tmp - (tmp>>dst_depth));\ + tmp = (bswap(src[j+4]) + dither[4])>>shift; dst[j+4] = dbswap(tmp - (tmp>>dst_depth));\ + tmp = (bswap(src[j+5]) + dither[5])>>shift; dst[j+5] = dbswap(tmp - (tmp>>dst_depth));\ + tmp = (bswap(src[j+6]) + dither[6])>>shift; dst[j+6] = dbswap(tmp - (tmp>>dst_depth));\ + tmp = (bswap(src[j+7]) + dither[7])>>shift; dst[j+7] = dbswap(tmp - (tmp>>dst_depth));\ + }\ + for (; j < length; j++) {\ + tmp = (bswap(src[j]) + dither[j&7])>>shift; dst[j] = dbswap(tmp - (tmp>>dst_depth));\ + }\ + dst += dstStride;\ + src += srcStride;\ + }\ + } else {\ + for (i = 0; i < height; i++) {\ + const uint8_t *dither= dithers[shift-1][i&7];\ + for (j = 0; j < length-7; j+=8) {\ + tmp = bswap(src[j+0]); dst[j+0] = dbswap((tmp - (tmp>>dst_depth) + dither[0])>>shift);\ + tmp = bswap(src[j+1]); dst[j+1] = dbswap((tmp - (tmp>>dst_depth) + dither[1])>>shift);\ + tmp = bswap(src[j+2]); dst[j+2] = dbswap((tmp - (tmp>>dst_depth) + dither[2])>>shift);\ + tmp = bswap(src[j+3]); dst[j+3] = dbswap((tmp - (tmp>>dst_depth) + dither[3])>>shift);\ + tmp = bswap(src[j+4]); dst[j+4] = dbswap((tmp - (tmp>>dst_depth) + dither[4])>>shift);\ + tmp = bswap(src[j+5]); dst[j+5] = dbswap((tmp - (tmp>>dst_depth) + dither[5])>>shift);\ + tmp = bswap(src[j+6]); dst[j+6] = dbswap((tmp - (tmp>>dst_depth) + dither[6])>>shift);\ + tmp = bswap(src[j+7]); dst[j+7] = dbswap((tmp - (tmp>>dst_depth) + dither[7])>>shift);\ + }\ + for (; j < length; j++) {\ + tmp = bswap(src[j]); dst[j] = dbswap((tmp - (tmp>>dst_depth) + dither[j&7])>>shift);\ + }\ + dst += dstStride;\ + src += srcStride;\ + }\ + } + +#define SHIFT_COPY(dst, dstStride, src, srcStride, bswap, dbswap)\ + unsigned shift= src_depth-dst_depth;\ + for (i = 0; i < height; i++) {\ + for (j = 0; j < length-7; j+=8) {\ + dst[j+0] = dbswap(bswap(src[j+0])>>shift);\ + dst[j+1] = dbswap(bswap(src[j+1])>>shift);\ + dst[j+2] = dbswap(bswap(src[j+2])>>shift);\ + dst[j+3] = dbswap(bswap(src[j+3])>>shift);\ + dst[j+4] = dbswap(bswap(src[j+4])>>shift);\ + dst[j+5] = dbswap(bswap(src[j+5])>>shift);\ + dst[j+6] = dbswap(bswap(src[j+6])>>shift);\ + dst[j+7] = dbswap(bswap(src[j+7])>>shift);\ + }\ + for (; j < length; j++)\ + dst[j] = dbswap(bswap(src[j])>>shift);\ + dst += dstStride;\ + src += srcStride;\ + } + +#define MM_BSWAP16(n) _mm_or_si128(_mm_srli_epi16(n, 8), _mm_slli_epi16(n, 8)) + +// Only for dst_depth == 8 +#define DITHER_COPY_X64(dst, dstStride, src, srcStride, bswap, mbswap)\ + unsigned shift= src_depth-8, tmp;\ + __m128i A0, D0;\ + if (shiftonly) {\ + for (i = 0; i < height; i++) {\ + const uint8_t *dither= dithers[shift-1][i&7];\ + D0 = _mm_loadl_epi64((__m128i const*)dither);\ + D0 = _mm_unpacklo_epi8(D0, _mm_setzero_si128());\ + for (j = 0; j < length-7; j+=8) {\ + A0 = _mm_loadu_si128((__m128i const*)(src + j));\ + A0 = mbswap(A0);\ + A0 = _mm_adds_epu16(A0, D0);\ + A0 = _mm_srli_epi16(A0, shift);\ + A0 = _mm_packus_epi16(A0, A0);\ + _mm_storel_epi64((__m128i*)(dst + j), A0);\ + }\ + for (; j < length; j++) {\ + tmp = (bswap(src[j]) + dither[j&7])>>shift; dst[j] = tmp - (tmp>>8);\ + }\ + dst += dstStride;\ + src += srcStride;\ + }\ + } else {\ + for (i = 0; i < height; i++) {\ + const uint8_t *dither= dithers[shift-1][i&7];\ + D0 = _mm_loadl_epi64((__m128i const*)dither);\ + D0 = _mm_unpacklo_epi8(D0, _mm_setzero_si128());\ + for (j = 0; j < length-7; j+=8) {\ + A0 = _mm_loadu_si128((__m128i const*)(src + j));\ + A0 = mbswap(A0);\ + A0 = _mm_sub_epi16(A0, _mm_srli_epi16(A0, 8));\ + A0 = _mm_add_epi16(A0, D0);\ + A0 = _mm_srli_epi16(A0, shift);\ + A0 = _mm_packus_epi16(A0, A0);\ + _mm_storel_epi64((__m128i*)(dst + j), A0);\ + }\ + for (; j < length; j++) {\ + tmp = bswap(src[j]); dst[j] = (tmp - (tmp>>8) + dither[j&7])>>shift;\ + }\ + dst += dstStride;\ + src += srcStride;\ + }\ + } + +// Only for dst_depth > 8 +#define SHIFT_COPY_X64(dst, dstStride, src, srcStride, bswap, dbswap, mbswap, mdbswap)\ + unsigned shift= src_depth-dst_depth;\ + __m128i A0;\ for (i = 0; i < height; i++) {\ - const uint8_t *dither= dithers[src_depth-9][i&7];\ - for (j = 0; j < length-7; j+=8){\ - dst[j+0] = dbswap((bswap(src[j+0]) + dither[0])*scale>>shift);\ - dst[j+1] = dbswap((bswap(src[j+1]) + dither[1])*scale>>shift);\ - dst[j+2] = dbswap((bswap(src[j+2]) + dither[2])*scale>>shift);\ - dst[j+3] = dbswap((bswap(src[j+3]) + dither[3])*scale>>shift);\ - dst[j+4] = dbswap((bswap(src[j+4]) + dither[4])*scale>>shift);\ - dst[j+5] = dbswap((bswap(src[j+5]) + dither[5])*scale>>shift);\ - dst[j+6] = dbswap((bswap(src[j+6]) + dither[6])*scale>>shift);\ - dst[j+7] = dbswap((bswap(src[j+7]) + dither[7])*scale>>shift);\ + for (j = 0; j < length-7; j+=8) {\ + A0 = _mm_loadu_si128((__m128i const*)(src + j));\ + A0 = mbswap(A0);\ + A0 = _mm_srli_epi16(A0, shift);\ + A0 = mdbswap(A0);\ + _mm_storeu_si128((__m128i*)(dst + j), A0);\ }\ for (; j < length; j++)\ - dst[j] = dbswap((bswap(src[j]) + dither[j&7])*scale>>shift);\ + dst[j] = dbswap(bswap(src[j])>>shift);\ dst += dstStride;\ src += srcStride;\ } @@ -1561,9 +1652,17 @@ static int planarCopyWrapper(SwsContext *c, const uint8_t *src[], if (dst_depth == 8) { if(isBE(c->srcFormat) == HAVE_BIGENDIAN){ +#if ARCH_X86_64 + DITHER_COPY_X64(dstPtr, dstStride[plane], srcPtr2, srcStride[plane]/2, , ) +#else DITHER_COPY(dstPtr, dstStride[plane], srcPtr2, srcStride[plane]/2, , ) +#endif } else { +#if ARCH_X86_64 + DITHER_COPY_X64(dstPtr, dstStride[plane], srcPtr2, srcStride[plane]/2, av_bswap16, MM_BSWAP16) +#else DITHER_COPY(dstPtr, dstStride[plane], srcPtr2, srcStride[plane]/2, av_bswap16, ) +#endif } } else if (src_depth == 8) { for (i = 0; i < height; i++) { @@ -1642,15 +1741,31 @@ static int planarCopyWrapper(SwsContext *c, const uint8_t *src[], } else { if(isBE(c->srcFormat) == HAVE_BIGENDIAN){ if(isBE(c->dstFormat) == HAVE_BIGENDIAN){ - DITHER_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, , ) +#if ARCH_X86_64 + SHIFT_COPY_X64(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, , , , ) +#else + SHIFT_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, , ) +#endif } else { - DITHER_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, , av_bswap16) +#if ARCH_X86_64 + SHIFT_COPY_X64(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, , av_bswap16, , MM_BSWAP16) +#else + SHIFT_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, , av_bswap16) +#endif } }else{ if(isBE(c->dstFormat) == HAVE_BIGENDIAN){ - DITHER_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, av_bswap16, ) +#if ARCH_X86_64 + SHIFT_COPY_X64(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, av_bswap16, , MM_BSWAP16, ) +#else + SHIFT_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, av_bswap16, ) +#endif } else { - DITHER_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, av_bswap16, av_bswap16) +#if ARCH_X86_64 + SHIFT_COPY_X64(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, av_bswap16, av_bswap16, MM_BSWAP16, MM_BSWAP16) +#else + SHIFT_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, av_bswap16, av_bswap16) +#endif } } }