From patchwork Mon Jun 19 15:11:03 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: James Darnley X-Patchwork-Id: 4040 Delivered-To: ffmpegpatchwork@gmail.com Received: by 10.103.22.4 with SMTP id 4csp961777vsw; Mon, 19 Jun 2017 08:24:02 -0700 (PDT) X-Received: by 10.223.169.203 with SMTP id b69mr18453220wrd.1.1497885841940; Mon, 19 Jun 2017 08:24:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1497885841; cv=none; d=google.com; s=arc-20160816; b=X/p99Ytkl3A0lxmLxVImpsKFjZykIriIzBvAChND0/Oc9l0DppLIfkgkaB0J04BSFP RMpiR9e4Z3z43kL0IQbGOcFExQnkbUzoOONVZ5jU+gJRfsg9JajE6iREqkOWMovWsuRA yyodnAzuIlz2dG693Sbucg3SbpY4D3wPl+1gKUQY+gMose9Z5gmeCq+6cmOFU5RWDZ07 2gxBeICDs9/olXXdVwkL5Djv3CacVpunXUnhMNZdOq7HjzwxO85v9kziOFY+oY5PAr+D 7mf08iS6Dx0YUTVC5yo/AcFh7LQ1HW8tp4n618H1Thl3cv8HsfH2iCI9DmUQ61sUK+0L 0Xrg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:mime-version:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:references:in-reply-to:message-id:date :to:from:dkim-signature:delivered-to:arc-authentication-results; bh=IUPu/TUT9nsXtkOOGV+k9O8GEUj5QJMdc7iD7r+W5X8=; b=e6ASrVPOJONBvM/nJPOAlpAYDt76Cas/S1d8GZT03cPL6IF8L8s/hrC6SEFLKcOAtF YqwDklT7wlYfG2Xu9pBPhJQefCTP3iXTU8IbMSRQCHc3uj10L6Dng8tQ98kXJW8Z25yk Mk9n7wPw2CNEiRY16pD9peq+9hgpRgB4tXCnLx/fRGb/HhCBEXGNz1F5akRpRoQBlhja n/BH7uizF1nDfilmWQqZ7G3WoqVgVM8w6k0URJsRpHzKCd3cHktBmI6RLZ9xTqxlXRHX Biov9odBIB5lJqG3CO1BmetGmHIge+wUntrQqjFxRehu0P6AiX8t7Mt9mpCpvW+TVlZR yi5A== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@ob-encoder-com.20150623.gappssmtp.com header.b=TFmt9bkP; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id 33si3754949wrf.18.2017.06.19.08.24.01; Mon, 19 Jun 2017 08:24:01 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@ob-encoder-com.20150623.gappssmtp.com header.b=TFmt9bkP; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A900A68A51A; Mon, 19 Jun 2017 18:23:42 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-wm0-f65.google.com (mail-wm0-f65.google.com [74.125.82.65]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 5677968A4F6 for ; Mon, 19 Jun 2017 18:23:36 +0300 (EEST) Received: by mail-wm0-f65.google.com with SMTP id d64so17064857wmf.2 for ; Mon, 19 Jun 2017 08:23:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ob-encoder-com.20150623.gappssmtp.com; s=20150623; h=sender:from:to:subject:date:message-id:in-reply-to:references; bh=mmwTM8oYJ7hFst6jFMbkPps3TJjavP4KUvl7XjsfZ28=; b=TFmt9bkPdONkIWHfu3bO3u1Awhc3+l0EuBKOvo25OEJykeovSLTzO/qqfowAFjZgGu kH62QUaOuqUkGjxNoRE/8xoy0k9z6ZAoONQEob2pqakJOvAHtC+vMW8QU+n1nCoGUuud jC04RUUC03SzXaPiBYgH8AV0xaGdlUQGTVivJpUAaiKfmJqN+yKhaZUO5YrN+tE7Mte2 krCefjwaUkri+rzVWjSDur10cOsJZIfftUA204snTjkWNKeHwoG2oyvx+zNVwXxkiG2q /sl6T/JEicEWsXJJH8wopQf3/a9apTyjj27jCCxaUiAlcda88ya7raMvY7Srh8nszjKq nUPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:subject:date:message-id :in-reply-to:references; bh=mmwTM8oYJ7hFst6jFMbkPps3TJjavP4KUvl7XjsfZ28=; b=LYJt8nsBEO4Ej0xunVyhSv8COe3Hk3TkafgkehqyXJItWKHjbayYIiyIZupcUB2CNg 3WJJZldV0HmDO6lCZA05Iaq+6Wph+YLlKFFgwufnZTl1H05QZWAGZaCpniMI08WGyCOf MRbRVYGnzedcMsPU87WVcblp4LiLluC3SWP/EOEwZ828Trn+j4S/hvkb7bc67T8RMJlG qEM/CAWNqxpqrqWqXrg6jHYi69suiTc9L2sbr/Zrl8zhVvmI9OPeUAkAL5LzFzkX+LDr wr4asjW2LaZcV4ZQFNq6nclM6zSFusR7qOoATrD31bTNySeCe8Sk7GSzv36fD76Mo0/j j2bw== X-Gm-Message-State: AKS2vOwNxYINczGV9j8Ntc5foAfH4TzdBeeWa/Yd9+Am2CjNTCg0iQut EKt94KOgYBvUytpmt5s= X-Received: by 10.28.107.88 with SMTP id g85mr15783155wmc.42.1497885465216; Mon, 19 Jun 2017 08:17:45 -0700 (PDT) Received: from Ifrit.systemlords.lan (d51a44418.access.telenet.be. [81.164.68.24]) by smtp.gmail.com with ESMTPSA id 6sm8059540wrg.61.2017.06.19.08.17.44 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 19 Jun 2017 08:17:44 -0700 (PDT) From: James Darnley To: FFmpeg development discussions and patches Date: Mon, 19 Jun 2017 17:11:03 +0200 Message-Id: <20170619151104.31273-11-jdarnley@obe.tv> X-Mailer: git-send-email 2.13.1 In-Reply-To: <20170619151104.31273-1-jdarnley@obe.tv> References: <20170619151104.31273-1-jdarnley@obe.tv> Subject: [FFmpeg-devel] [PATCH 10/11] avcodec/x86: add an 8-bit simple IDCT function based on the x86-64 high depth functions X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Includes add/put functions Rounding contributed by Ronald S. Bultje --- libavcodec/tests/x86/dct.c | 2 + libavcodec/x86/idctdsp_init.c | 23 ++++++++ libavcodec/x86/simple_idct.h | 9 +++ libavcodec/x86/simple_idct10.asm | 92 +++++++++++++++++++++++++++++++ libavcodec/x86/simple_idct10_template.asm | 6 +- 5 files changed, 130 insertions(+), 2 deletions(-) diff --git a/libavcodec/tests/x86/dct.c b/libavcodec/tests/x86/dct.c index 34f5b8767b..317d973f9f 100644 --- a/libavcodec/tests/x86/dct.c +++ b/libavcodec/tests/x86/dct.c @@ -88,10 +88,12 @@ static const struct algo idct_tab_arch[] = { #if HAVE_YASM #if ARCH_X86_64 #if HAVE_SSE2_EXTERNAL + { "SIMPLE8-SSE2", ff_simple_idct8_sse2, FF_IDCT_PERM_TRANSPOSE, AV_CPU_FLAG_SSE2}, { "SIMPLE10-SSE2", ff_simple_idct10_sse2, FF_IDCT_PERM_TRANSPOSE, AV_CPU_FLAG_SSE2}, { "SIMPLE12-SSE2", ff_simple_idct12_sse2, FF_IDCT_PERM_TRANSPOSE, AV_CPU_FLAG_SSE2, 1 }, #endif #if HAVE_AVX_EXTERNAL + { "SIMPLE8-AVX", ff_simple_idct8_avx, FF_IDCT_PERM_TRANSPOSE, AV_CPU_FLAG_AVX}, { "SIMPLE10-AVX", ff_simple_idct10_avx, FF_IDCT_PERM_TRANSPOSE, AV_CPU_FLAG_AVX}, { "SIMPLE12-AVX", ff_simple_idct12_avx, FF_IDCT_PERM_TRANSPOSE, AV_CPU_FLAG_AVX, 1 }, #endif diff --git a/libavcodec/x86/idctdsp_init.c b/libavcodec/x86/idctdsp_init.c index f1c915aa00..9da60d1a1e 100644 --- a/libavcodec/x86/idctdsp_init.c +++ b/libavcodec/x86/idctdsp_init.c @@ -94,9 +94,32 @@ av_cold void ff_idctdsp_init_x86(IDCTDSPContext *c, AVCodecContext *avctx, c->idct_add = ff_simple_idct_add_sse2; c->perm_type = FF_IDCT_PERM_SIMPLE; } + + if (ARCH_X86_64 && + !high_bit_depth && + avctx->lowres == 0 && + (avctx->idct_algo == FF_IDCT_AUTO || + avctx->idct_algo == FF_IDCT_SIMPLEAUTO || + avctx->idct_algo == FF_IDCT_SIMPLEMMX)) { + c->idct = ff_simple_idct8_sse2; + c->idct_put = ff_simple_idct8_put_sse2; + c->idct_add = ff_simple_idct8_add_sse2; + c->perm_type = FF_IDCT_PERM_TRANSPOSE; + } } if (ARCH_X86_64 && avctx->lowres == 0) { + if (EXTERNAL_AVX(cpu_flags) && + !high_bit_depth && + (avctx->idct_algo == FF_IDCT_AUTO || + avctx->idct_algo == FF_IDCT_SIMPLEAUTO || + avctx->idct_algo == FF_IDCT_SIMPLEMMX)) { + c->idct = ff_simple_idct8_avx; + c->idct_put = ff_simple_idct8_put_avx; + c->idct_add = ff_simple_idct8_add_avx; + c->perm_type = FF_IDCT_PERM_TRANSPOSE; + } + if (avctx->bits_per_raw_sample == 10 && (avctx->idct_algo == FF_IDCT_AUTO || avctx->idct_algo == FF_IDCT_SIMPLEAUTO || diff --git a/libavcodec/x86/simple_idct.h b/libavcodec/x86/simple_idct.h index d17ef6a462..9b64cfe9bc 100644 --- a/libavcodec/x86/simple_idct.h +++ b/libavcodec/x86/simple_idct.h @@ -29,6 +29,15 @@ void ff_simple_idct_put_mmx(uint8_t *dest, ptrdiff_t line_size, int16_t *block); void ff_simple_idct_add_sse2(uint8_t *dest, ptrdiff_t line_size, int16_t *block); void ff_simple_idct_put_sse2(uint8_t *dest, ptrdiff_t line_size, int16_t *block); +void ff_simple_idct8_sse2(int16_t *block); +void ff_simple_idct8_avx(int16_t *block); + +void ff_simple_idct8_put_sse2(uint8_t *dest, ptrdiff_t line_size, int16_t *block); +void ff_simple_idct8_put_avx(uint8_t *dest, ptrdiff_t line_size, int16_t *block); + +void ff_simple_idct8_add_sse2(uint8_t *dest, ptrdiff_t line_size, int16_t *block); +void ff_simple_idct8_add_avx(uint8_t *dest, ptrdiff_t line_size, int16_t *block); + void ff_simple_idct10_sse2(int16_t *block); void ff_simple_idct10_avx(int16_t *block); diff --git a/libavcodec/x86/simple_idct10.asm b/libavcodec/x86/simple_idct10.asm index b492303a57..069bb61378 100644 --- a/libavcodec/x86/simple_idct10.asm +++ b/libavcodec/x86/simple_idct10.asm @@ -31,11 +31,14 @@ SECTION_RODATA cextern pw_2 cextern pw_16 +cextern pw_32 cextern pw_1023 cextern pw_4095 +pd_round_11: times 4 dd 1<<(11-1) pd_round_12: times 4 dd 1<<(12-1) pd_round_15: times 4 dd 1<<(15-1) pd_round_19: times 4 dd 1<<(19-1) +pd_round_20: times 4 dd 1<<(20-1) %macro CONST_DEC 3 const %1 @@ -77,8 +80,97 @@ CONST_DEC w3_min_w7_lo, W3sh2_lo, -W7sh2 SECTION .text +%macro STORE_HI_LO 12 + movq %1, %9 + movq %3, %10 + movq %5, %11 + movq %7, %12 + movhps %2, %9 + movhps %4, %10 + movhps %6, %11 + movhps %8, %12 +%endmacro + +%macro LOAD_ZXBW_8 16 + pmovzxbw %1, %9 + pmovzxbw %2, %10 + pmovzxbw %3, %11 + pmovzxbw %4, %12 + pmovzxbw %5, %13 + pmovzxbw %6, %14 + pmovzxbw %7, %15 + pmovzxbw %8, %16 +%endmacro + +%macro LOAD_ZXBW_4 9 + movh %1, %5 + movh %2, %6 + movh %3, %7 + movh %4, %8 + punpcklbw %1, %9 + punpcklbw %2, %9 + punpcklbw %3, %9 + punpcklbw %4, %9 +%endmacro + +%define PASS4ROWS(base, stride, stride3) \ + [base], [base + stride], [base + 2*stride], [base + stride3] + %macro idct_fn 0 +define_constants _lo + +cglobal simple_idct8, 1, 1, 16, 32, block + IDCT_FN "", 11, pw_32, 20, "store" +RET + +cglobal simple_idct8_put, 3, 4, 16, 32, pixels, lsize, block + IDCT_FN "", 11, pw_32, 20 + lea r3, [3*lsizeq] + lea r2, [pixelsq + r3] + packuswb m8, m0 + packuswb m1, m2 + packuswb m4, m11 + packuswb m9, m10 + STORE_HI_LO PASS8ROWS(pixelsq, r2, lsizeq, r3), m8, m1, m4, m9 +RET + +cglobal simple_idct8_add, 3, 4, 16, 32, pixels, lsize, block + IDCT_FN "", 11, pw_32, 20 + lea r2, [3*lsizeq] + %if cpuflag(sse4) + lea r3, [pixelsq + r2] + LOAD_ZXBW_8 m3, m5, m6, m7, m12, m13, m14, m15, PASS8ROWS(pixelsq, r3, lsizeq, r2) + paddsw m8, m3 + paddsw m0, m5 + paddsw m1, m6 + paddsw m2, m7 + paddsw m4, m12 + paddsw m11, m13 + paddsw m9, m14 + paddsw m10, m15 + %else + pxor m12, m12 + LOAD_ZXBW_4 m3, m5, m6, m7, PASS4ROWS(pixelsq, lsizeq, r2), m12 + paddsw m8, m3 + paddsw m0, m5 + paddsw m1, m6 + paddsw m2, m7 + lea r3, [pixelsq + 4*lsizeq] + LOAD_ZXBW_4 m3, m5, m6, m7, PASS4ROWS(r3, lsizeq, r2), m12 + paddsw m4, m3 + paddsw m11, m5 + paddsw m9, m6 + paddsw m10, m7 + lea r3, [pixelsq + r2] + %endif + packuswb m8, m0 + packuswb m1, m2 + packuswb m4, m11 + packuswb m9, m10 + STORE_HI_LO PASS8ROWS(pixelsq, r3, lsizeq, r2), m8, m1, m4, m9 +RET + define_constants _hi cglobal simple_idct10, 1, 1, 16, block diff --git a/libavcodec/x86/simple_idct10_template.asm b/libavcodec/x86/simple_idct10_template.asm index 51baf84c82..02fd445ec0 100644 --- a/libavcodec/x86/simple_idct10_template.asm +++ b/libavcodec/x86/simple_idct10_template.asm @@ -258,6 +258,10 @@ IDCT_1D %1, %2, %8 %elif %2 == 11 + ; This copies the DC-only shortcut. When there is only a DC coefficient the + ; C shifts the value and splats it to all coeffs rather than multiplying and + ; doing the full IDCT. This causes a difference on 8-bit because the + ; coefficient is 16383 rather than 16384 (which you can get with shifting). por m1, m8, m13 por m1, m12 por m1, [blockq+ 16] ; { row[1] }[0-7] @@ -293,8 +297,6 @@ por m9, m6 pand m10, m5 por m10, m6 - pand m3, m5 - por m3, m6 %else IDCT_1D %1, %2 %endif