From patchwork Fri Sep 22 15:23:12 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mateusz <mateuszb@poczta.onet.pl>
X-Patchwork-Id: 5238
Delivered-To: ffmpegpatchwork@gmail.com
Received: by 10.2.36.26 with SMTP id f26csp3312178jaa;
	Fri, 22 Sep 2017 08:25:56 -0700 (PDT)
X-Google-Smtp-Source: 
 AOwi7QBfUtlzzff5tV7tLVqZatVt/tVUpkokm6djK0Qpw9TlKhO0sFUJnUG7ngIMUUyYgUiZUBYK
X-Received: by 10.223.174.141 with SMTP id y13mr4838064wrc.209.1506093956536;
	Fri, 22 Sep 2017 08:25:56 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1506093956; cv=none;
	d=google.com; s=arc-20160816;
	b=vP8E3m+DqN6MNDHgPAUBDuc7wSCqNLCdoJHJjna4rLc5GdCQmk/hK9DLFpWfdQU6O4
	Ggoe1Out+gqMfq2ifoy/XhCE4p1QZXG+9pGWUTg9QRfE1D/oRglynush6d1rDMjAAz7s
	m121+CDqO3iaEEYvD+PlvZ5/goT5eMe7LRsnp+/NNJqbXehHYBHocb09FaNOZW8NUNbA
	n5PHIhdkTNuCTS5o0jNEjBe84HJeYzabgcj10okIbkeFufL0ANqZAJ65PsqqyCjiiPxR
	YXr4a79V5CKIogX64CUQNe9qUiU0uFbl7Jt8zyckdFrQNhCHwRfvwT6PupyxL1zp7r1z
	Eluw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
	s=arc-20160816;
	h=sender:errors-to:reply-to:list-subscribe:list-help:list-post
	:list-archive:list-unsubscribe:list-id:precedence:subject
	:content-language:mime-version:user-agent:date:message-id:from:to
	:dkim-signature:delivered-to:arc-authentication-results;
	bh=RAc8GDWboUUnOaONsHSvPWvIg+PRDfq07dgFPMhpywc=;
	b=Dvg76r2EMBzviTUyyBzWWxrNauMtH/h5ZZjUihnnvwRvTfFXQhZX6e5P6sqm9NovAl
	SCdLu5jWyAtoECZQbmIssbEG5iSHQP6yjCycL600/xHJTx/6/W6bHYxq9xDWl/s2lqT/
	bOnym//qjzl3BgHHXvxaSlLVXEjL4eBNQcbdTidL7TrcG55LMCri5v9Jp2SDkMGA6+yF
	8ya/qNHrjlNftiLmiA9Zu5G9p6CMj66Z+4BI7b2xibiNila3WauGiG+2ASY0beSVQ8ts
	jyK3NGEOcjIdzljmMNPaUIdPyEETj96BlGemyVPx/h0Hcnpyg5NExGasWDHjq3gBvFLD
	BoNg==
ARC-Authentication-Results: i=1; mx.google.com;
	dkim=neutral (body hash did not verify) header.i=@poczta.onet.pl
	header.s=2011 header.b=kxvMsTXj;
	spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
	designates 79.124.17.100 as permitted sender)
	smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
	by mx.google.com with ESMTP id o28si57166wrf.105.2017.09.22.08.25.55;
	Fri, 22 Sep 2017 08:25:56 -0700 (PDT)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
	designates 79.124.17.100 as permitted sender)
	client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
	dkim=neutral (body hash did not verify) header.i=@poczta.onet.pl
	header.s=2011 header.b=kxvMsTXj;
	spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
	designates 79.124.17.100 as permitted sender)
	smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 4DBFA6883B2;
	Fri, 22 Sep 2017 18:25:44 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from smtpo20.poczta.onet.pl (smtpo20.poczta.onet.pl
	[213.180.142.151])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 192D86882C8
	for <ffmpeg-devel@ffmpeg.org>; Fri, 22 Sep 2017 18:25:38 +0300 (EEST)
Received: from [192.168.1.2] (aeno193.neoplus.adsl.tpnet.pl [79.191.92.193])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128
	bits)) (No client certificate requested)
	(Authenticated sender: mateuszb@poczta.onet.pl)
	by smtp.poczta.onet.pl (Onet) with ESMTPSA id 3xzHNr5Cq7zsGHl7
	for <ffmpeg-devel@ffmpeg.org>; Fri, 22 Sep 2017 17:25:32 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=poczta.onet.pl;
	s=2011; t=1506093933;
	bh=Te/qoznCy5aJr0iYpWrMgScK1OuM8bQJwH+jDuQz7r4=;
	h=To:From:Subject:Date:From;
	b=kxvMsTXjf4dbnws8LRWd0jdTDaiqWvSGbfyocUuVSu7Gw/wEl884x9lz8bghQCoqi
	qGZGYif13trf7Nt3+qlUYan6FHkt3RTAWq7mx0JBWpqt2ykEP3rkSd97T/BAdBVB0n
	5G2VFADrT4cayYsPj8N1SCyWstvfUcibDNxdWosg=
To: ffmpeg-devel@ffmpeg.org
From: Mateusz <mateuszb@poczta.onet.pl>
Message-ID: <d6e32949-1b35-ccb2-34c6-0ae340c4c65c@poczta.onet.pl>
Date: Fri, 22 Sep 2017 17:23:12 +0200
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
	Thunderbird/52.3.0
MIME-Version: 1.0
Content-Language: en-US
Subject: [FFmpeg-devel] [PATCH] swscale_unscaled: fix and speed up
	DITHER_COPY macro for x86 with SSE2
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <http://ffmpeg.org/mailman/options/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <http://ffmpeg.org/pipermail/ffmpeg-devel/>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <http://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches
	<ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

New version of the patch -- now it uses the same logic independent of the target bitdepth.

For x86_64 it is much faster than current code (with perfect quality), for x86_32 it is fast
if you add to configure: --extra-cflags="-msse2"
(for x86_32 with default configure options it is slower than current code but with better quality)

Please review/test.

Mateusz
From 8eaa76fc82550f62f1a22e9388a51dc61c031a2c Mon Sep 17 00:00:00 2001
From: Mateusz <mateuszb@poczta.onet.pl>
Date: Fri, 22 Sep 2017 14:54:53 +0200
Subject: [PATCH] swscale_unscaled: fix and speed up DITHER_COPY macro for x86
 with SSE2
---
 libswscale/swscale_unscaled.c | 220 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 185 insertions(+), 35 deletions(-)

diff --git a/libswscale/swscale_unscaled.c b/libswscale/swscale_unscaled.c
index ef36aec..cd3e917 100644
--- a/libswscale/swscale_unscaled.c
+++ b/libswscale/swscale_unscaled.c
@@ -35,6 +35,10 @@
 #include "libavutil/avassert.h"
 #include "libavutil/avconfig.h"
 
+#if ARCH_X86_64 || (ARCH_X86_32 && defined(__SSE2__))
+#include <emmintrin.h>
+#endif
+
 DECLARE_ALIGNED(8, static const uint8_t, dithers)[8][8][8]={
 {
   {   0,  1,  0,  1,  0,  1,  0,  1,},
@@ -110,24 +114,6 @@ DECLARE_ALIGNED(8, static const uint8_t, dithers)[8][8][8]={
   { 112, 16,104,  8,118, 22,110, 14,},
 }};
 
-static const uint16_t dither_scale[15][16]={
-{    2,    3,    3,    5,    5,    5,    5,    5,    5,    5,    5,    5,    5,    5,    5,    5,},
-{    2,    3,    7,    7,   13,   13,   25,   25,   25,   25,   25,   25,   25,   25,   25,   25,},
-{    3,    3,    4,   15,   15,   29,   57,   57,   57,  113,  113,  113,  113,  113,  113,  113,},
-{    3,    4,    4,    5,   31,   31,   61,  121,  241,  241,  241,  241,  481,  481,  481,  481,},
-{    3,    4,    5,    5,    6,   63,   63,  125,  249,  497,  993,  993,  993,  993,  993, 1985,},
-{    3,    5,    6,    6,    6,    7,  127,  127,  253,  505, 1009, 2017, 4033, 4033, 4033, 4033,},
-{    3,    5,    6,    7,    7,    7,    8,  255,  255,  509, 1017, 2033, 4065, 8129,16257,16257,},
-{    3,    5,    6,    8,    8,    8,    8,    9,  511,  511, 1021, 2041, 4081, 8161,16321,32641,},
-{    3,    5,    7,    8,    9,    9,    9,    9,   10, 1023, 1023, 2045, 4089, 8177,16353,32705,},
-{    3,    5,    7,    8,   10,   10,   10,   10,   10,   11, 2047, 2047, 4093, 8185,16369,32737,},
-{    3,    5,    7,    8,   10,   11,   11,   11,   11,   11,   12, 4095, 4095, 8189,16377,32753,},
-{    3,    5,    7,    9,   10,   12,   12,   12,   12,   12,   12,   13, 8191, 8191,16381,32761,},
-{    3,    5,    7,    9,   10,   12,   13,   13,   13,   13,   13,   13,   14,16383,16383,32765,},
-{    3,    5,    7,    9,   10,   12,   14,   14,   14,   14,   14,   14,   14,   15,32767,32767,},
-{    3,    5,    7,    9,   11,   12,   14,   15,   15,   15,   15,   15,   15,   15,   16,65535,},
-};
-
 
 static void fillPlane(uint8_t *plane, int stride, int width, int height, int y,
                       uint8_t val)
@@ -1502,24 +1488,164 @@ static int packedCopyWrapper(SwsContext *c, const uint8_t *src[],
 }
 
 #define DITHER_COPY(dst, dstStride, src, srcStride, bswap, dbswap)\
-    uint16_t scale= dither_scale[dst_depth-1][src_depth-1];\
-    int shift= src_depth-dst_depth + dither_scale[src_depth-2][dst_depth-1];\
-    for (i = 0; i < height; i++) {\
-        const uint8_t *dither= dithers[src_depth-9][i&7];\
-        for (j = 0; j < length-7; j+=8){\
-            dst[j+0] = dbswap((bswap(src[j+0]) + dither[0])*scale>>shift);\
-            dst[j+1] = dbswap((bswap(src[j+1]) + dither[1])*scale>>shift);\
-            dst[j+2] = dbswap((bswap(src[j+2]) + dither[2])*scale>>shift);\
-            dst[j+3] = dbswap((bswap(src[j+3]) + dither[3])*scale>>shift);\
-            dst[j+4] = dbswap((bswap(src[j+4]) + dither[4])*scale>>shift);\
-            dst[j+5] = dbswap((bswap(src[j+5]) + dither[5])*scale>>shift);\
-            dst[j+6] = dbswap((bswap(src[j+6]) + dither[6])*scale>>shift);\
-            dst[j+7] = dbswap((bswap(src[j+7]) + dither[7])*scale>>shift);\
+    unsigned shift= src_depth-dst_depth, tmp;\
+    if (shiftonly) {\
+        for (i = 0; i < height; i++) {\
+            const uint8_t *dither= dithers[shift-1][i&7];\
+            for (j = 0; j < length-7; j+=8) {\
+                tmp = (bswap(src[j+0]) + dither[0])>>shift; dst[j+0] = dbswap(tmp - (tmp>>dst_depth));\
+                tmp = (bswap(src[j+1]) + dither[1])>>shift; dst[j+1] = dbswap(tmp - (tmp>>dst_depth));\
+                tmp = (bswap(src[j+2]) + dither[2])>>shift; dst[j+2] = dbswap(tmp - (tmp>>dst_depth));\
+                tmp = (bswap(src[j+3]) + dither[3])>>shift; dst[j+3] = dbswap(tmp - (tmp>>dst_depth));\
+                tmp = (bswap(src[j+4]) + dither[4])>>shift; dst[j+4] = dbswap(tmp - (tmp>>dst_depth));\
+                tmp = (bswap(src[j+5]) + dither[5])>>shift; dst[j+5] = dbswap(tmp - (tmp>>dst_depth));\
+                tmp = (bswap(src[j+6]) + dither[6])>>shift; dst[j+6] = dbswap(tmp - (tmp>>dst_depth));\
+                tmp = (bswap(src[j+7]) + dither[7])>>shift; dst[j+7] = dbswap(tmp - (tmp>>dst_depth));\
+            }\
+            for (; j < length; j++) {\
+                tmp = (bswap(src[j]) + dither[j&7])>>shift; dst[j] = dbswap(tmp - (tmp>>dst_depth));\
+            }\
+            dst += dstStride;\
+            src += srcStride;\
+        }\
+    } else {\
+        for (i = 0; i < height; i++) {\
+            const uint8_t *dither= dithers[shift-1][i&7];\
+            for (j = 0; j < length-7; j+=8) {\
+                tmp = bswap(src[j+0]); dst[j+0] = dbswap((tmp - (tmp>>dst_depth) + dither[0])>>shift);\
+                tmp = bswap(src[j+1]); dst[j+1] = dbswap((tmp - (tmp>>dst_depth) + dither[1])>>shift);\
+                tmp = bswap(src[j+2]); dst[j+2] = dbswap((tmp - (tmp>>dst_depth) + dither[2])>>shift);\
+                tmp = bswap(src[j+3]); dst[j+3] = dbswap((tmp - (tmp>>dst_depth) + dither[3])>>shift);\
+                tmp = bswap(src[j+4]); dst[j+4] = dbswap((tmp - (tmp>>dst_depth) + dither[4])>>shift);\
+                tmp = bswap(src[j+5]); dst[j+5] = dbswap((tmp - (tmp>>dst_depth) + dither[5])>>shift);\
+                tmp = bswap(src[j+6]); dst[j+6] = dbswap((tmp - (tmp>>dst_depth) + dither[6])>>shift);\
+                tmp = bswap(src[j+7]); dst[j+7] = dbswap((tmp - (tmp>>dst_depth) + dither[7])>>shift);\
+            }\
+            for (; j < length; j++) {\
+                tmp = bswap(src[j]); dst[j] = dbswap((tmp - (tmp>>dst_depth) + dither[j&7])>>shift);\
+            }\
+            dst += dstStride;\
+            src += srcStride;\
+        }\
+    }
+
+#define MM_BSWAP16(n) _mm_or_si128(_mm_srli_epi16(n, 8), _mm_slli_epi16(n, 8))
+
+#define DITHER_COPY_X64_1(dst, dstStride, src, srcStride, bswap, mbswap)\
+    unsigned shift= src_depth-8, tmp;\
+    __m128i A0, A1, D0;\
+    if (shiftonly) {\
+        for (i = 0; i < height; i++) {\
+            const uint8_t *dither= dithers[shift-1][i&7];\
+            D0 = _mm_loadl_epi64((__m128i const*)dither);\
+            D0 = _mm_unpacklo_epi8(D0, _mm_setzero_si128());\
+            for (j = 0; j < length-15; j+=16) {\
+                A0 = _mm_loadu_si128((__m128i const*)(src + j));\
+                A1 = _mm_loadu_si128((__m128i const*)(src + j+8));\
+                A0 = mbswap(A0);\
+                A1 = mbswap(A1);\
+                A0 = _mm_adds_epu16(A0, D0);\
+                A1 = _mm_adds_epu16(A1, D0);\
+                A0 = _mm_srli_epi16(A0, shift);\
+                A1 = _mm_srli_epi16(A1, shift);\
+                A0 = _mm_packus_epi16(A0, A1);\
+                _mm_storeu_si128((__m128i*)(dst + j), A0);\
+            }\
+            if (j < length-7) {\
+                A0 = _mm_loadu_si128((__m128i const*)(src + j));\
+                A0 = mbswap(A0);\
+                A0 = _mm_adds_epu16(A0, D0);\
+                A0 = _mm_srli_epi16(A0, shift);\
+                A0 = _mm_packus_epi16(A0, A0);\
+                _mm_storel_epi64((__m128i*)(dst + j), A0);\
+                j += 8;\
+            }\
+            for (; j < length; j++) {\
+                tmp = (bswap(src[j]) + dither[j&7])>>shift; dst[j] = tmp - (tmp>>8);\
+            }\
+            dst += dstStride;\
+            src += srcStride;\
+        }\
+    } else {\
+        for (i = 0; i < height; i++) {\
+            const uint8_t *dither= dithers[shift-1][i&7];\
+            D0 = _mm_loadl_epi64((__m128i const*)dither);\
+            D0 = _mm_unpacklo_epi8(D0, _mm_setzero_si128());\
+            for (j = 0; j < length-15; j+=16) {\
+                A0 = _mm_loadu_si128((__m128i const*)(src + j));\
+                A1 = _mm_loadu_si128((__m128i const*)(src + j+8));\
+                A0 = mbswap(A0);\
+                A1 = mbswap(A1);\
+                A0 = _mm_sub_epi16(A0, _mm_srli_epi16(A0, 8));\
+                A1 = _mm_sub_epi16(A1, _mm_srli_epi16(A1, 8));\
+                A0 = _mm_add_epi16(A0, D0);\
+                A1 = _mm_add_epi16(A1, D0);\
+                A0 = _mm_srli_epi16(A0, shift);\
+                A1 = _mm_srli_epi16(A1, shift);\
+                A0 = _mm_packus_epi16(A0, A1);\
+                _mm_storeu_si128((__m128i*)(dst + j), A0);\
+            }\
+            if (j < length-7) {\
+                A0 = _mm_loadu_si128((__m128i const*)(src + j));\
+                A0 = mbswap(A0);\
+                A0 = _mm_sub_epi16(A0, _mm_srli_epi16(A0, 8));\
+                A0 = _mm_add_epi16(A0, D0);\
+                A0 = _mm_srli_epi16(A0, shift);\
+                A0 = _mm_packus_epi16(A0, A0);\
+                _mm_storel_epi64((__m128i*)(dst + j), A0);\
+                j += 8;\
+            }\
+            for (; j < length; j++) {\
+                tmp = bswap(src[j]); dst[j] = (tmp - (tmp>>8) + dither[j&7])>>shift;\
+            }\
+            dst += dstStride;\
+            src += srcStride;\
+        }\
+    }
+
+#define DITHER_COPY_X64_2(dst, dstStride, src, srcStride, bswap, dbswap, mbswap, mdbswap)\
+    unsigned shift= src_depth-dst_depth, tmp;\
+    __m128i A0, D0;\
+    if (shiftonly) {\
+        for (i = 0; i < height; i++) {\
+            const uint8_t *dither= dithers[shift-1][i&7];\
+            D0 = _mm_loadl_epi64((__m128i const*)dither);\
+            D0 = _mm_unpacklo_epi8(D0, _mm_setzero_si128());\
+            for (j = 0; j < length-7; j+=8) {\
+                A0 = _mm_loadu_si128((__m128i const*)(src + j));\
+                A0 = mbswap(A0);\
+                A0 = _mm_adds_epu16(A0, D0);\
+                A0 = _mm_srli_epi16(A0, shift);\
+                A0 = _mm_sub_epi16(A0, _mm_srli_epi16(A0, dst_depth));\
+                A0 = mdbswap(A0);\
+                _mm_storeu_si128((__m128i*)(dst + j), A0);\
+            }\
+            for (; j < length; j++) {\
+                tmp = (bswap(src[j]) + dither[j&7])>>shift; dst[j] = dbswap(tmp - (tmp>>dst_depth));\
+            }\
+            dst += dstStride;\
+            src += srcStride;\
+        }\
+    } else {\
+        for (i = 0; i < height; i++) {\
+            const uint8_t *dither= dithers[shift-1][i&7];\
+            D0 = _mm_loadl_epi64((__m128i const*)dither);\
+            D0 = _mm_unpacklo_epi8(D0, _mm_setzero_si128());\
+            for (j = 0; j < length-7; j+=8) {\
+                A0 = _mm_loadu_si128((__m128i const*)(src + j));\
+                A0 = mbswap(A0);\
+                A0 = _mm_sub_epi16(A0, _mm_srli_epi16(A0, dst_depth));\
+                A0 = _mm_add_epi16(A0, D0);\
+                A0 = _mm_srli_epi16(A0, shift);\
+                A0 = mdbswap(A0);\
+                _mm_storeu_si128((__m128i*)(dst + j), A0);\
+            }\
+            for (; j < length; j++) {\
+                tmp = bswap(src[j]); dst[j] = dbswap((tmp - (tmp>>dst_depth) + dither[j&7])>>shift);\
+            }\
+            dst += dstStride;\
+            src += srcStride;\
         }\
-        for (; j < length; j++)\
-            dst[j] = dbswap((bswap(src[j]) + dither[j&7])*scale>>shift);\
-        dst += dstStride;\
-        src += srcStride;\
     }
 
 static int planarCopyWrapper(SwsContext *c, const uint8_t *src[],
@@ -1561,9 +1687,17 @@ static int planarCopyWrapper(SwsContext *c, const uint8_t *src[],
 
                 if (dst_depth == 8) {
                     if(isBE(c->srcFormat) == HAVE_BIGENDIAN){
+#if ARCH_X86_64 || (ARCH_X86_32 && defined(__SSE2__))
+                        DITHER_COPY_X64_1(dstPtr, dstStride[plane], srcPtr2, srcStride[plane]/2, , )
+#else
                         DITHER_COPY(dstPtr, dstStride[plane], srcPtr2, srcStride[plane]/2, , )
+#endif
                     } else {
+#if ARCH_X86_64 || (ARCH_X86_32 && defined(__SSE2__))
+                        DITHER_COPY_X64_1(dstPtr, dstStride[plane], srcPtr2, srcStride[plane]/2, av_bswap16, MM_BSWAP16)
+#else
                         DITHER_COPY(dstPtr, dstStride[plane], srcPtr2, srcStride[plane]/2, av_bswap16, )
+#endif
                     }
                 } else if (src_depth == 8) {
                     for (i = 0; i < height; i++) {
@@ -1642,15 +1776,31 @@ static int planarCopyWrapper(SwsContext *c, const uint8_t *src[],
                 } else {
                     if(isBE(c->srcFormat) == HAVE_BIGENDIAN){
                         if(isBE(c->dstFormat) == HAVE_BIGENDIAN){
+#if ARCH_X86_64 || (ARCH_X86_32 && defined(__SSE2__))
+                            DITHER_COPY_X64_2(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, , , , )
+#else
                             DITHER_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, , )
+#endif
                         } else {
+#if ARCH_X86_64 || (ARCH_X86_32 && defined(__SSE2__))
+                            DITHER_COPY_X64_2(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, , av_bswap16, , MM_BSWAP16)
+#else
                             DITHER_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, , av_bswap16)
+#endif
                         }
                     }else{
                         if(isBE(c->dstFormat) == HAVE_BIGENDIAN){
+#if ARCH_X86_64 || (ARCH_X86_32 && defined(__SSE2__))
+                            DITHER_COPY_X64_2(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, av_bswap16, , MM_BSWAP16, )
+#else
                             DITHER_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, av_bswap16, )
+#endif
                         } else {
+#if ARCH_X86_64 || (ARCH_X86_32 && defined(__SSE2__))
+                            DITHER_COPY_X64_2(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, av_bswap16, av_bswap16, MM_BSWAP16, MM_BSWAP16)
+#else
                             DITHER_COPY(dstPtr2, dstStride[plane]/2, srcPtr2, srcStride[plane]/2, av_bswap16, av_bswap16)
+#endif
                         }
                     }
                 }