From patchwork Mon Jan 11 16:46:31 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alan Kelly X-Patchwork-Id: 24910 Return-Path: X-Original-To: patchwork@ffaux-bg.ffmpeg.org Delivered-To: patchwork@ffaux-bg.ffmpeg.org Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by ffaux.localdomain (Postfix) with ESMTP id 421D344B224 for ; Mon, 11 Jan 2021 18:46:46 +0200 (EET) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 00BC468ABCF; Mon, 11 Jan 2021 18:46:45 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-wm1-f73.google.com (mail-wm1-f73.google.com [209.85.128.73]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id AFAC668ABB5 for ; Mon, 11 Jan 2021 18:46:39 +0200 (EET) Received: by mail-wm1-f73.google.com with SMTP id g82so5341910wmg.6 for ; Mon, 11 Jan 2021 08:46:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=sender:date:in-reply-to:message-id:mime-version:references:subject :from:to:cc; bh=yT8tYe/2xMAC+s/CrkYMoMFFB9ZSyeK9T+ZPNi6Gl3w=; b=GfoEItK6ZQ/eH1+25XfbqFSRu1Ctih6H5D2vuxQISxbWl8pWc0df+maYYStYV6B/vs JYYiVy+TiCiwJu09wbxw28HDnga8+daooP50NI33uSlqBrlOiF94u+L+ZmkAA5LUmqjU yUdggIKKjAye1v5jMC3AnKJXdq576eBDWErYv+YZCUz4hwcF4lfLglG9MHKHlTo70FM2 vudvZZxgesGPQgSAJPw0nSrcba/Q0/yUqb0bVJ3jeaekkTUb/wzE07sKjwnWrXVfABek fAT+LhReFCCrgv/h9SvMN/P5oIhegEULT4VUbYnoMon8nb8mPfIu8K3pAUXqcfDRY7Yz 7smw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=yT8tYe/2xMAC+s/CrkYMoMFFB9ZSyeK9T+ZPNi6Gl3w=; b=DHO5Z4T1H3/bE7E7yOCXVzSTX1Otzc678Gbfhk7JKybmhjX8XCsm7sYGD1qXdH4cRk 9urB8Ly02r0AvI/9hdjNM/gvZgUoFczWcFnrzAVl0He5iPZd36eL8Y/EGWN9eeQHVcK3 ryrGOVVcIVC7+Ztr25g68Y4300N5WnUQt7atu1xXuAcaihwdPd958gqVDDslO9DrEdNC v9ote5KggApslaJu/ULeopSvMmWJv7pktxjfu9tRpXGOFL3zaVaaRyXYU8gNUdodJmEB HL4ifOMQHhoCZjcJUOpsr0WRwgdT8zZDfqLiTjXLhWN+jEM/uyjfebIrxYcHTre3zNuh uMTg== X-Gm-Message-State: AOAM533qocRFzStYwX5xveZiJDA2gWUzUULj5PKE+dEB7PdKepaOYcfv NqW8DvFBOr/m3epERUxmOFZWwn1UixII7FaawuYMHh8a5BD6eDHsnMJ83BRiWISIIOLfOCtMuFD nTCOHatSiKjyhSxCz8Bli0C4c+6POz4gHA7Omr59UfIO9ZhUl6S//6KCwSaVL43dplqd9qrM= X-Google-Smtp-Source: ABdhPJzPcmPnYM13JUJYxHlgQ1m/jlcbdtA5LRPUd4jG7Y0krlb7loA/3ptiUdeYJ4UE9hlUqMTuGu0RWIfHGDM= X-Received: from alankelly0.zrh.corp.google.com ([2a00:79e0:42:205:f693:9fff:fef7:aa73]) (user=alankelly job=sendgmr) by 2002:a1c:6383:: with SMTP id x125mr542510wmb.46.1610383598888; Mon, 11 Jan 2021 08:46:38 -0800 (PST) Date: Mon, 11 Jan 2021 17:46:31 +0100 In-Reply-To: Message-Id: <20210111164631.3729786-1-alankelly@google.com> Mime-Version: 1.0 References: X-Mailer: git-send-email 2.30.0.284.gd98b1dd5eaa7-goog From: Alan Kelly To: ffmpeg-devel@ffmpeg.org Subject: [FFmpeg-devel] [PATCH] Moves yuv2yuvX_sse3 to yasm, unrolls main loop and other small optimizations for ~20% speedup. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Alan Kelly Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" --- Fixes a bug where if there is no offset and a tail which is not processed by the sse3/avx2 version the dither is modified Deletes mmx/mmxext yuv2yuvX version from swscale_template and adds it to yuv2yuvX.asm to reduce code duplication and so that it may be used to process the tail from the larger cardinal simd versions. src argument of yuv2yuvX_* is now srcOffset, so that tails and offsets are accounted for correctly. Changes input size in checkasm so that this corner case is tested. libswscale/x86/Makefile | 1 + libswscale/x86/swscale.c | 130 ++++++++++++---------------- libswscale/x86/swscale_template.c | 82 ------------------ libswscale/x86/yuv2yuvX.asm | 136 ++++++++++++++++++++++++++++++ tests/checkasm/sw_scale.c | 100 ++++++++++++++++++++++ 5 files changed, 291 insertions(+), 158 deletions(-) create mode 100644 libswscale/x86/yuv2yuvX.asm diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile index 831d5359aa..bfe383364e 100644 --- a/libswscale/x86/Makefile +++ b/libswscale/x86/Makefile @@ -13,3 +13,4 @@ X86ASM-OBJS += x86/input.o \ x86/scale.o \ x86/rgb_2_rgb.o \ x86/yuv_2_rgb.o \ + x86/yuv2yuvX.o \ diff --git a/libswscale/x86/swscale.c b/libswscale/x86/swscale.c index 15c0b22f20..3df193a067 100644 --- a/libswscale/x86/swscale.c +++ b/libswscale/x86/swscale.c @@ -63,6 +63,16 @@ DECLARE_ASM_ALIGNED(8, const uint64_t, ff_bgr2UVOffset) = 0x8080808080808080ULL; DECLARE_ASM_ALIGNED(8, const uint64_t, ff_w1111) = 0x0001000100010001ULL; +#define YUV2YUVX_FUNC_DECL(opt) \ +static void yuv2yuvX_ ##opt(const int16_t *filter, int filterSize, const int16_t **src, \ + uint8_t *dest, int dstW, \ + const uint8_t *dither, int offset); \ + +YUV2YUVX_FUNC_DECL(mmx) +YUV2YUVX_FUNC_DECL(mmxext) +YUV2YUVX_FUNC_DECL(sse3) +YUV2YUVX_FUNC_DECL(avx2) + //MMX versions #if HAVE_MMX_INLINE #undef RENAME @@ -198,81 +208,44 @@ void ff_updateMMXDitherTables(SwsContext *c, int dstY) } #if HAVE_MMXEXT -static void yuv2yuvX_sse3(const int16_t *filter, int filterSize, - const int16_t **src, uint8_t *dest, int dstW, - const uint8_t *dither, int offset) -{ - if(((uintptr_t)dest) & 15){ - yuv2yuvX_mmxext(filter, filterSize, src, dest, dstW, dither, offset); - return; - } - filterSize--; -#define MAIN_FUNCTION \ - "pxor %%xmm0, %%xmm0 \n\t" \ - "punpcklbw %%xmm0, %%xmm3 \n\t" \ - "movd %4, %%xmm1 \n\t" \ - "punpcklwd %%xmm1, %%xmm1 \n\t" \ - "punpckldq %%xmm1, %%xmm1 \n\t" \ - "punpcklqdq %%xmm1, %%xmm1 \n\t" \ - "psllw $3, %%xmm1 \n\t" \ - "paddw %%xmm1, %%xmm3 \n\t" \ - "psraw $4, %%xmm3 \n\t" \ - "movdqa %%xmm3, %%xmm4 \n\t" \ - "movdqa %%xmm3, %%xmm7 \n\t" \ - "movl %3, %%ecx \n\t" \ - "mov %0, %%"FF_REG_d" \n\t"\ - "mov (%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ - ".p2align 4 \n\t" /* FIXME Unroll? */\ - "1: \n\t"\ - "movddup 8(%%"FF_REG_d"), %%xmm0 \n\t" /* filterCoeff */\ - "movdqa (%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm2 \n\t" /* srcData */\ - "movdqa 16(%%"FF_REG_S", %%"FF_REG_c", 2), %%xmm5 \n\t" /* srcData */\ - "add $16, %%"FF_REG_d" \n\t"\ - "mov (%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ - "test %%"FF_REG_S", %%"FF_REG_S" \n\t"\ - "pmulhw %%xmm0, %%xmm2 \n\t"\ - "pmulhw %%xmm0, %%xmm5 \n\t"\ - "paddw %%xmm2, %%xmm3 \n\t"\ - "paddw %%xmm5, %%xmm4 \n\t"\ - " jnz 1b \n\t"\ - "psraw $3, %%xmm3 \n\t"\ - "psraw $3, %%xmm4 \n\t"\ - "packuswb %%xmm4, %%xmm3 \n\t"\ - "movntdq %%xmm3, (%1, %%"FF_REG_c") \n\t"\ - "add $16, %%"FF_REG_c" \n\t"\ - "cmp %2, %%"FF_REG_c" \n\t"\ - "movdqa %%xmm7, %%xmm3 \n\t" \ - "movdqa %%xmm7, %%xmm4 \n\t" \ - "mov %0, %%"FF_REG_d" \n\t"\ - "mov (%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ - "jb 1b \n\t" - - if (offset) { - __asm__ volatile( - "movq %5, %%xmm3 \n\t" - "movdqa %%xmm3, %%xmm4 \n\t" - "psrlq $24, %%xmm3 \n\t" - "psllq $40, %%xmm4 \n\t" - "por %%xmm4, %%xmm3 \n\t" - MAIN_FUNCTION - :: "g" (filter), - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset), - "m"(filterSize), "m"(((uint64_t *) dither)[0]) - : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , "%xmm5" , "%xmm7" ,) - "%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c - ); - } else { - __asm__ volatile( - "movq %5, %%xmm3 \n\t" - MAIN_FUNCTION - :: "g" (filter), - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset), - "m"(filterSize), "m"(((uint64_t *) dither)[0]) - : XMM_CLOBBERS("%xmm0" , "%xmm1" , "%xmm2" , "%xmm3" , "%xmm4" , "%xmm5" , "%xmm7" ,) - "%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c - ); - } +#define YUV2YUVX_FUNC_MMX(opt, step) \ +void ff_yuv2yuvX_ ##opt(const int16_t *filter, int filterSize, int srcOffset, \ + uint8_t *dest, int dstW, \ + const uint8_t *dither, int offset); \ +static void yuv2yuvX_ ##opt(const int16_t *filter, int filterSize, \ + const int16_t **src, uint8_t *dest, int dstW, \ + const uint8_t *dither, int offset) \ +{ \ + ff_yuv2yuvX_ ##opt(filter, filterSize - 1, 0, dest - offset, dstW + offset, dither, offset); \ + return; \ } + +#define YUV2YUVX_FUNC(opt, step) \ +void ff_yuv2yuvX_ ##opt(const int16_t *filter, int filterSize, int srcOffset, \ + uint8_t *dest, int dstW, \ + const uint8_t *dither, int offset); \ +static void yuv2yuvX_ ##opt(const int16_t *filter, int filterSize, \ + const int16_t **src, uint8_t *dest, int dstW, \ + const uint8_t *dither, int offset) \ +{ \ + int remainder = (dstW % step); \ + int pixelsProcessed = dstW - remainder; \ + if(((uintptr_t)dest) & 15){ \ + yuv2yuvX_mmx(filter, filterSize, src, dest, dstW, dither, offset); \ + return; \ + } \ + ff_yuv2yuvX_ ##opt(filter, filterSize - 1, 0, dest - offset, pixelsProcessed + offset, dither, offset); \ + if(remainder > 0){ \ + ff_yuv2yuvX_mmx(filter, filterSize - 1, pixelsProcessed, dest - offset, pixelsProcessed + remainder + offset, dither, offset); \ + } \ + return; \ +} + +YUV2YUVX_FUNC_MMX(mmx, 16) +YUV2YUVX_FUNC_MMX(mmxext, 16) +YUV2YUVX_FUNC(sse3, 32) +YUV2YUVX_FUNC(avx2, 64) + #endif #endif /* HAVE_INLINE_ASM */ @@ -403,9 +376,14 @@ av_cold void ff_sws_init_swscale_x86(SwsContext *c) #if HAVE_MMXEXT_INLINE if (INLINE_MMXEXT(cpu_flags)) sws_init_swscale_mmxext(c); - if (cpu_flags & AV_CPU_FLAG_SSE3){ - if(c->use_mmx_vfilter && !(c->flags & SWS_ACCURATE_RND)) + if (cpu_flags & AV_CPU_FLAG_AVX2){ + if(c->use_mmx_vfilter && !(c->flags & SWS_ACCURATE_RND)){ + c->yuv2planeX = yuv2yuvX_avx2; + } + } else if (cpu_flags & AV_CPU_FLAG_SSE3){ + if(c->use_mmx_vfilter && !(c->flags & SWS_ACCURATE_RND)){ c->yuv2planeX = yuv2yuvX_sse3; + } } #endif diff --git a/libswscale/x86/swscale_template.c b/libswscale/x86/swscale_template.c index 823056c2ea..cb33af97e4 100644 --- a/libswscale/x86/swscale_template.c +++ b/libswscale/x86/swscale_template.c @@ -38,88 +38,6 @@ #endif #define MOVNTQ(a,b) REAL_MOVNTQ(a,b) -#if !COMPILE_TEMPLATE_MMXEXT -static av_always_inline void -dither_8to16(const uint8_t *srcDither, int rot) -{ - if (rot) { - __asm__ volatile("pxor %%mm0, %%mm0\n\t" - "movq (%0), %%mm3\n\t" - "movq %%mm3, %%mm4\n\t" - "psrlq $24, %%mm3\n\t" - "psllq $40, %%mm4\n\t" - "por %%mm4, %%mm3\n\t" - "movq %%mm3, %%mm4\n\t" - "punpcklbw %%mm0, %%mm3\n\t" - "punpckhbw %%mm0, %%mm4\n\t" - :: "r"(srcDither) - ); - } else { - __asm__ volatile("pxor %%mm0, %%mm0\n\t" - "movq (%0), %%mm3\n\t" - "movq %%mm3, %%mm4\n\t" - "punpcklbw %%mm0, %%mm3\n\t" - "punpckhbw %%mm0, %%mm4\n\t" - :: "r"(srcDither) - ); - } -} -#endif - -static void RENAME(yuv2yuvX)(const int16_t *filter, int filterSize, - const int16_t **src, uint8_t *dest, int dstW, - const uint8_t *dither, int offset) -{ - dither_8to16(dither, offset); - filterSize--; - __asm__ volatile( - "movd %0, %%mm1\n\t" - "punpcklwd %%mm1, %%mm1\n\t" - "punpckldq %%mm1, %%mm1\n\t" - "psllw $3, %%mm1\n\t" - "paddw %%mm1, %%mm3\n\t" - "paddw %%mm1, %%mm4\n\t" - "psraw $4, %%mm3\n\t" - "psraw $4, %%mm4\n\t" - ::"m"(filterSize) - ); - - __asm__ volatile(\ - "movq %%mm3, %%mm6\n\t" - "movq %%mm4, %%mm7\n\t" - "movl %3, %%ecx\n\t" - "mov %0, %%"FF_REG_d" \n\t"\ - "mov (%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ - ".p2align 4 \n\t" /* FIXME Unroll? */\ - "1: \n\t"\ - "movq 8(%%"FF_REG_d"), %%mm0 \n\t" /* filterCoeff */\ - "movq (%%"FF_REG_S", %%"FF_REG_c", 2), %%mm2 \n\t" /* srcData */\ - "movq 8(%%"FF_REG_S", %%"FF_REG_c", 2), %%mm5 \n\t" /* srcData */\ - "add $16, %%"FF_REG_d" \n\t"\ - "mov (%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ - "test %%"FF_REG_S", %%"FF_REG_S" \n\t"\ - "pmulhw %%mm0, %%mm2 \n\t"\ - "pmulhw %%mm0, %%mm5 \n\t"\ - "paddw %%mm2, %%mm3 \n\t"\ - "paddw %%mm5, %%mm4 \n\t"\ - " jnz 1b \n\t"\ - "psraw $3, %%mm3 \n\t"\ - "psraw $3, %%mm4 \n\t"\ - "packuswb %%mm4, %%mm3 \n\t" - MOVNTQ2 " %%mm3, (%1, %%"FF_REG_c")\n\t" - "add $8, %%"FF_REG_c" \n\t"\ - "cmp %2, %%"FF_REG_c" \n\t"\ - "movq %%mm6, %%mm3\n\t" - "movq %%mm7, %%mm4\n\t" - "mov %0, %%"FF_REG_d" \n\t"\ - "mov (%%"FF_REG_d"), %%"FF_REG_S" \n\t"\ - "jb 1b \n\t"\ - :: "g" (filter), - "r" (dest-offset), "g" ((x86_reg)(dstW+offset)), "m" (offset) - : "%"FF_REG_d, "%"FF_REG_S, "%"FF_REG_c - ); -} - #define YSCALEYUV2PACKEDX_UV \ __asm__ volatile(\ "xor %%"FF_REG_a", %%"FF_REG_a" \n\t"\ diff --git a/libswscale/x86/yuv2yuvX.asm b/libswscale/x86/yuv2yuvX.asm new file mode 100644 index 0000000000..b3a9426c61 --- /dev/null +++ b/libswscale/x86/yuv2yuvX.asm @@ -0,0 +1,136 @@ +;****************************************************************************** +;* x86-optimized yuv2yuvX +;* Copyright 2020 Google LLC +;* Copyright (C) 2001-2011 Michael Niedermayer +;* +;* This file is part of FFmpeg. +;* +;* FFmpeg is free software; you can redistribute it and/or +;* modify it under the terms of the GNU Lesser General Public +;* License as published by the Free Software Foundation; either +;* version 2.1 of the License, or (at your option) any later version. +;* +;* FFmpeg is distributed in the hope that it will be useful, +;* but WITHOUT ANY WARRANTY; without even the implied warranty of +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +;* Lesser General Public License for more details. +;* +;* You should have received a copy of the GNU Lesser General Public +;* License along with FFmpeg; if not, write to the Free Software +;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA +;****************************************************************************** + +%include "libavutil/x86/x86util.asm" + +SECTION .text + +;----------------------------------------------------------------------------- +; yuv2yuvX +; +; void ff_yuv2yuvX_(const int16_t *filter, int filterSize, +; int srcOffset, uint8_t *dest, int dstW, +; const uint8_t *dither, int offset); +; +;----------------------------------------------------------------------------- + +%macro YUV2YUVX_FUNC 0 +cglobal yuv2yuvX, 7, 7, 8, filter, filterSize, src, dest, dstW, dither, offset +%if cpuflag(mmx) +%define movr mova +%else +%define movr movdqu +%endif +%if ARCH_X86_64 + movsxd dstWq, dstWd + movsxd offsetq, offsetd + movsxd srcq, srcd +%endif ; x86-64 +%if cpuflag(avx2) + vpbroadcastq m3, [ditherq] +%elif cpuflag(sse3) + movq xmm3, [ditherq] +%else + mova m3, [ditherq] +%endif ; avx2 + cmp offsetd, 0 + jz .offset + + ; offset != 0 path. + psrlq m5, m3, $18 + psllq m3, m3, $28 + por m3, m3, m5 + +.offset: + add offsetq, srcq +%if cpuflag(avx2) + movd xmm1, filterSized + vpbroadcastw m1, xmm1 +%elif cpuflag(sse3) + movd xmm1, filterSized + pshuflw m1, m1, q0000 + punpcklqdq m1, m1 +%else + movd m1, filterSized + punpcklwd m1, m1 + punpckldq m1, m1 +%endif ; avx2 + pxor m0, m0, m0 + mov filterSizeq, filterq + mov srcq, [filterSizeq] + punpcklbw m3, m0 + psllw m1, m1, 3 + paddw m3, m3, m1 + psraw m7, m3, 4 +.outerloop: + mova m4, m7 + mova m3, m7 + mova m6, m7 + mova m1, m7 +.loop: +%if cpuflag(avx2) + vpbroadcastq m0, [filterSizeq + 8] +%elif cpuflag(sse3) + movddup m0, [filterSizeq + 8] +%else + mova m0, [filterSizeq + 8] +%endif + pmulhw m2, m0, [srcq + offsetq * 2] + pmulhw m5, m0, [srcq + offsetq * 2 + mmsize] + paddw m3, m3, m2 + paddw m4, m4, m5 + pmulhw m2, m0, [srcq + offsetq * 2 + 2 * mmsize] + pmulhw m5, m0, [srcq + offsetq * 2 + 3 * mmsize] + paddw m6, m6, m2 + paddw m1, m1, m5 + add filterSizeq, $10 + mov srcq, [filterSizeq] + test srcq, srcq + jnz .loop + psraw m3, m3, 3 + psraw m4, m4, 3 + psraw m6, m6, 3 + psraw m1, m1, 3 + packuswb m3, m3, m4 + packuswb m6, m6, m1 + mov srcq, [filterq] +%if cpuflag(avx2) + vpermq m3, m3, 216 + vpermq m6, m6, 216 +%endif + movr [destq + offsetq], m3 + movr [destq + offsetq + mmsize], m6 + add offsetq, mmsize * 2 + mov filterSizeq, filterq + cmp offsetq, dstWq + jb .outerloop + REP_RET +%endmacro + +INIT_MMX mmx +YUV2YUVX_FUNC +INIT_MMX mmxext +YUV2YUVX_FUNC +INIT_XMM sse3 +YUV2YUVX_FUNC +INIT_YMM avx2 +YUV2YUVX_FUNC diff --git a/tests/checkasm/sw_scale.c b/tests/checkasm/sw_scale.c index 8741b3943c..76209775da 100644 --- a/tests/checkasm/sw_scale.c +++ b/tests/checkasm/sw_scale.c @@ -36,6 +36,104 @@ AV_WN32(buf + j, rnd()); \ } while (0) +#define SRC_PIXELS 144 + +// This reference function is the same approximate algorithm employed by the +// SIMD functions +static void ref_function(const int16_t *filter, int filterSize, + const int16_t **src, uint8_t *dest, int dstW, + const uint8_t *dither, int offset) +{ + int i, d; + d = ((filterSize - 1) * 8 + dither[0]) >> 4; + for (i=0; i>3); + } +} + +static void check_yuv2yuvX(void) +{ + struct SwsContext *ctx; + int fsi, osi, i, j; +#define LARGEST_FILTER 16 +#define FILTER_SIZES 4 + static const int filter_sizes[FILTER_SIZES] = {1, 4, 8, 16}; + + declare_func_emms(AV_CPU_FLAG_MMX, void, const int16_t *filter, + int filterSize, const int16_t **src, uint8_t *dest, + int dstW, const uint8_t *dither, int offset); + + int dstW = SRC_PIXELS; + const int16_t **src; + LOCAL_ALIGNED_32(int16_t, src_pixels, [LARGEST_FILTER * SRC_PIXELS]); + LOCAL_ALIGNED_32(int16_t, filter_coeff, [LARGEST_FILTER]); + LOCAL_ALIGNED_32(uint8_t, dst0, [SRC_PIXELS]); + LOCAL_ALIGNED_32(uint8_t, dst1, [SRC_PIXELS]); + LOCAL_ALIGNED_32(uint8_t, dither, [SRC_PIXELS]); + union VFilterData{ + const int16_t *src; + uint16_t coeff[8]; + } *vFilterData; + uint8_t d_val = rnd(); + randomize_buffers(filter_coeff, LARGEST_FILTER); + randomize_buffers(src_pixels, LARGEST_FILTER * SRC_PIXELS); + ctx = sws_alloc_context(); + if (sws_init_context(ctx, NULL, NULL) < 0) + fail(); + + ff_getSwsFunc(ctx); + for(i = 0; i < SRC_PIXELS; ++i){ + dither[i] = d_val; + } + for(osi = 0; osi < 64; osi += 16){ + for(fsi = 0; fsi < FILTER_SIZES; ++fsi){ + src = av_malloc(sizeof(int16_t*) * filter_sizes[fsi]); + vFilterData = av_malloc((filter_sizes[fsi] + 2) * sizeof(union VFilterData)); + memset(vFilterData, 0, (filter_sizes[fsi] + 2) * sizeof(union VFilterData)); + for(i = 0; i < filter_sizes[fsi]; ++i){ + src[i] = &src_pixels[i * SRC_PIXELS]; + vFilterData[i].src = src[i]; + for(j = 0; j < 4; ++j){ + vFilterData[i].coeff[j + 4] = filter_coeff[i]; + } + } + if (check_func(ctx->yuv2planeX, "yuv2yuvX_%d_%d", filter_sizes[fsi], osi)){ + memset(dst0, 0, SRC_PIXELS * sizeof(dst0[0])); + memset(dst1, 0, SRC_PIXELS * sizeof(dst1[0])); + + // The reference function is not the scalar function selected when mmx + // is deactivated as the SIMD functions do not give the same result as + // the scalar ones due to rounding. The SIMD functions are activated by + // the flag SWS_ACCURATE_RND + ref_function(&filter_coeff[0], filter_sizes[fsi], src, dst0, dstW - osi, dither, osi); + // There's no point in calling new for the reference function + if(ctx->use_mmx_vfilter){ + call_new((const int16_t*)vFilterData, filter_sizes[fsi], src, dst1, dstW - osi, dither, osi); + if (memcmp(dst0, dst1, SRC_PIXELS * sizeof(dst0[0]))){ + fail(); + } + bench_new((const int16_t*)vFilterData, filter_sizes[fsi], src, dst1, dstW - osi, dither, osi); + } + } + free(src); + free(vFilterData); + } + } + sws_freeContext(ctx); +#undef FILTER_SIZES +} + +#undef SRC_PIXELS #define SRC_PIXELS 128 static void check_hscale(void) @@ -132,4 +230,6 @@ void checkasm_check_sw_scale(void) { check_hscale(); report("hscale"); + check_yuv2yuvX(); + report("yuv2yuvX"); }