From patchwork Thu Jul 19 14:52:47 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: James Darnley X-Patchwork-Id: 9758 Delivered-To: ffmpegpatchwork@gmail.com Received: by 2002:a02:104:0:0:0:0:0 with SMTP id c4-v6csp1784731jad; Thu, 19 Jul 2018 07:53:36 -0700 (PDT) X-Google-Smtp-Source: AAOMgpfYiuO4xkqlxsVSUQsvKK9m2TVBiI7xE+x6Bd1nQLpyIH6usvwTMlYNVx86azsU9TV+qLd+ X-Received: by 2002:a1c:b406:: with SMTP id d6-v6mr4489711wmf.126.1532012016861; Thu, 19 Jul 2018 07:53:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1532012016; cv=none; d=google.com; s=arc-20160816; b=egU5WW+uTx5tO9UBPvckiHIx5uUxy6Rb38Utt5t8oQrhKPYVE7TLcNXokRbrdq4wl8 RJ8M2abdQiWLuoOvm18OB7pL64k1d7/4lj1jYbqD9I9jKr70U9y2YV5P9y6cgAR9gMfu LkH+xDsrJG9BfCF9WMUh6DGObSvYbuYbamD0Sdq5RJ5Ay2+/ylLkj/RkFUbDvKoT70DO RU8pMYmIHUcBs1xzTMLceAW4OTmVlFED+5rqsGhs4umOzpokKrecMvtRmrfD/33HhElJ BUAAxxbWvdX2jSLwwUWjRMQPWXS2xSDAu5+sLzdvAIET85SPwTMr9BXrot0SK0Ms5YJW LRIw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:mime-version:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:references:in-reply-to:message-id:date :to:from:dkim-signature:delivered-to:arc-authentication-results; bh=Ef1uQOzLz9bl9T4vVNw5jnLSTkSFuURxJEfgW/kfhV4=; b=YLRNcYGcqIkgypoEr/TQgJHsMQMbpHfvw7O6eGq1vIjmzrCPgKaHOzYoSnBWMN8nc0 TWZzRJXOTRnjuq5IBdkRsPnkLvCDX8CPniTPSabrZa+kFajLEkyJISHpkweBxbBx4+mz k9xNZEwPiSoi8JVubfk4eB5XhcxZgoAXh8IP/AOkMnKzEcX+wRxV8NJ8ZEQznoAuWckt +VmPGdrcnIidc3WxcdlhvSN1QoEmdYBLX09vQaHjNKZNTcx2E/7Bvdl7T5L3d7oYax9G sWyYDtpYEgVP54hAn3XrspaANm/wrGVa5fB//MDjfcYd3sp4J1f2Y6oI6EGepqukLQhE pQMQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@ob-encoder-com.20150623.gappssmtp.com header.s=20150623 header.b=jxVl6b0R; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id t134-v6si3491975wmt.159.2018.07.19.07.53.36; Thu, 19 Jul 2018 07:53:36 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@ob-encoder-com.20150623.gappssmtp.com header.s=20150623 header.b=jxVl6b0R; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id E8188689E49; Thu, 19 Jul 2018 17:53:01 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-ed1-f52.google.com (mail-ed1-f52.google.com [209.85.208.52]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 0F934689C7E for ; Thu, 19 Jul 2018 17:52:55 +0300 (EEST) Received: by mail-ed1-f52.google.com with SMTP id s16-v6so7295658edq.12 for ; Thu, 19 Jul 2018 07:53:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ob-encoder-com.20150623.gappssmtp.com; s=20150623; h=sender:from:to:subject:date:message-id:in-reply-to:references; bh=4Sa63nL4CZHL5yq8z2Oe6nvqe/WmyBut1R6wvsXpxdw=; b=jxVl6b0RKvanXEaXPe+W+ufOTB/NPUcnH1PkHEABqn0Vm9XLel7xYQdc8rZxwzQdIz UlQaZPETaPVAF10yZiZ2ffFyLB4+qs57csWgnxv+5GR3euIYeAlNjDX0fYlPL/nooqMQ rSffAWntPGwW0vVs+XwTZ4gXKVqcdUZZTkfuuBUiwBeA3qGyN8xjV8oPO0hIUg3nRWa/ jw9umIK07AQ74Ysdbto4BTDGGmJjJ7nOp+bydqUzuJmJFYosZyVNuqAD9qiIg3FTx/fr 6mj/NUkb3EZ7dySu7iln4wMWu+VTvKQYBdlAsTTshh5dblD2ftzC+WYvDWitCXLfC/xj qDJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:subject:date:message-id :in-reply-to:references; bh=4Sa63nL4CZHL5yq8z2Oe6nvqe/WmyBut1R6wvsXpxdw=; b=Xdll4AlC72W7m3sQE+FGf3mAfuwu8SD9jC7WjviuXiymblDpq7WQCGwZujPXG8btgh RVlTPdg156ZBdYoxZN4TqS4R8JETkQJOmT7dvlZDu+V+rNa7KY2Vzp9l4GBMBzRIW0Q5 oI/1il1ZYjZRUha5MG3OboTRYlirrU0vh0hjEJnYGQo6FgmKL/AtdGqAb7NADYJum6tJ KV6H9ZyXgPhGYgH0qrtgqXUAZKr1oWf5xPFVgPIkolyNz/wAjT3LfVJa0cW85Cp6q9Gv /eAjB3EzHH7Vatss8LQg2ZulLtMU1WQRnJ2Bq7YXLv84gLaVWii/Qfs4FDCQa801YphC CUYg== X-Gm-Message-State: AOUpUlHTXXU3HhFIl3pzwGbF5PbfoXFRpWz/iILcI76Wqz2YAwDxpZ4B R2o13CRVePXRoVB9e2STgNC8AvndO8c= X-Received: by 2002:a50:ec0b:: with SMTP id g11-v6mr11605679edr.38.1532011985657; Thu, 19 Jul 2018 07:53:05 -0700 (PDT) Received: from Highwind.systemlords.lan (d51A44418.access.telenet.be. [81.164.68.24]) by smtp.gmail.com with ESMTPSA id y10-v6sm3620960ede.38.2018.07.19.07.53.04 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 19 Jul 2018 07:53:05 -0700 (PDT) From: James Darnley To: ffmpeg-devel@ffmpeg.org Date: Thu, 19 Jul 2018 16:52:47 +0200 Message-Id: <20180719145252.30613-2-jdarnley@obe.tv> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20180719145252.30613-1-jdarnley@obe.tv> References: <20180719145252.30613-1-jdarnley@obe.tv> Subject: [FFmpeg-devel] [PATCH 1/6] diracdec: add 10-bit Haar SIMD functions X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Speed of ffmpeg when decoding a 720p yuv422p10 file encoded with the relevant transform. C: 119fps SSE2: 204fps AVX: 206fps AVX2: 221fps --- libavcodec/dirac_dwt.c | 7 +- libavcodec/dirac_dwt.h | 1 + libavcodec/x86/Makefile | 6 +- libavcodec/x86/dirac_dwt_10bit.asm | 113 +++++++++++++++++++++ libavcodec/x86/dirac_dwt_init_10bit.c | 136 ++++++++++++++++++++++++++ 5 files changed, 260 insertions(+), 3 deletions(-) create mode 100644 libavcodec/x86/dirac_dwt_10bit.asm create mode 100644 libavcodec/x86/dirac_dwt_init_10bit.c diff --git a/libavcodec/dirac_dwt.c b/libavcodec/dirac_dwt.c index cc08f8865a..86bee5bb9b 100644 --- a/libavcodec/dirac_dwt.c +++ b/libavcodec/dirac_dwt.c @@ -59,8 +59,13 @@ int ff_spatial_idwt_init(DWTContext *d, DWTPlane *p, enum dwt_type type, return AVERROR_INVALIDDATA; } - if (ARCH_X86 && bit_depth == 8) +#if ARCH_X86 + if (bit_depth == 8) ff_spatial_idwt_init_x86(d, type); + else if (bit_depth == 10) + ff_spatial_idwt_init_10bit_x86(d, type); +#endif + return 0; } diff --git a/libavcodec/dirac_dwt.h b/libavcodec/dirac_dwt.h index 994dc21d70..1ad7b9a821 100644 --- a/libavcodec/dirac_dwt.h +++ b/libavcodec/dirac_dwt.h @@ -88,6 +88,7 @@ enum dwt_type { int ff_spatial_idwt_init(DWTContext *d, DWTPlane *p, enum dwt_type type, int decomposition_count, int bit_depth); void ff_spatial_idwt_init_x86(DWTContext *d, enum dwt_type type); +void ff_spatial_idwt_init_10bit_x86(DWTContext *d, enum dwt_type type); void ff_spatial_idwt_slice2(DWTContext *d, int y); diff --git a/libavcodec/x86/Makefile b/libavcodec/x86/Makefile index 2350c8bbee..590d83c167 100644 --- a/libavcodec/x86/Makefile +++ b/libavcodec/x86/Makefile @@ -7,7 +7,8 @@ OBJS-$(CONFIG_BLOCKDSP) += x86/blockdsp_init.o OBJS-$(CONFIG_BSWAPDSP) += x86/bswapdsp_init.o OBJS-$(CONFIG_DCT) += x86/dct_init.o OBJS-$(CONFIG_DIRAC_DECODER) += x86/diracdsp_init.o \ - x86/dirac_dwt_init.o + x86/dirac_dwt_init.o \ + x86/dirac_dwt_init_10bit.o OBJS-$(CONFIG_FDCTDSP) += x86/fdctdsp_init.o OBJS-$(CONFIG_FFT) += x86/fft_init.o OBJS-$(CONFIG_FLACDSP) += x86/flacdsp_init.o @@ -153,7 +154,8 @@ X86ASM-OBJS-$(CONFIG_APNG_DECODER) += x86/pngdsp.o X86ASM-OBJS-$(CONFIG_CAVS_DECODER) += x86/cavsidct.o X86ASM-OBJS-$(CONFIG_DCA_DECODER) += x86/dcadsp.o x86/synth_filter.o X86ASM-OBJS-$(CONFIG_DIRAC_DECODER) += x86/diracdsp.o \ - x86/dirac_dwt.o + x86/dirac_dwt.o \ + x86/dirac_dwt_10bit.o X86ASM-OBJS-$(CONFIG_DNXHD_ENCODER) += x86/dnxhdenc.o X86ASM-OBJS-$(CONFIG_EXR_DECODER) += x86/exrdsp.o X86ASM-OBJS-$(CONFIG_FLAC_DECODER) += x86/flacdsp.o diff --git a/libavcodec/x86/dirac_dwt_10bit.asm b/libavcodec/x86/dirac_dwt_10bit.asm new file mode 100644 index 0000000000..dc3830615e --- /dev/null +++ b/libavcodec/x86/dirac_dwt_10bit.asm @@ -0,0 +1,113 @@ +;****************************************************************************** +;* x86 optimized discrete 10-bit wavelet trasnform +;* Copyright (c) 2018 James Darnley +;* +;* This file is part of FFmpeg. +;* +;* FFmpeg is free software; you can redistribute it and/or +;* modify it under the terms of the GNU Lesser General Public +;* License as published by the Free Software Foundation; either +;* version 2.1 of the License, or (at your option) any later version. +;* +;* FFmpeg is distributed in the hope that it will be useful, +;* but WITHOUT ANY WARRANTY; without even the implied warranty of +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +;* Lesser General Public License for more details. +;* +;* You should have received a copy of the GNU Lesser General Public +;* License along with FFmpeg; if not, write to the Free Software +;* 51, Inc., Foundation Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA +;****************************************************************************** + +%include "libavutil/x86/x86util.asm" + +SECTION_RODATA + +cextern pd_1 + +SECTION .text + +%macro HAAR_VERTICAL 0 + +cglobal vertical_compose_haar_10bit, 3, 3, 4, b0, b1, w + mova m2, [pd_1] + shl wd, 2 + add b0q, wq + add b1q, wq + neg wq + + ALIGN 16 + .loop: + mova m0, [b0q + wq] + mova m1, [b1q + wq] + paddd m3, m1, m2 + psrad m3, 1 + psubd m0, m3 + paddd m1, m0 + mova [b0q + wq], m0 + mova [b1q + wq], m1 + add wq, mmsize + jl .loop +RET + +%endmacro + +%macro HAAR_HORIZONTAL 0 + +cglobal horizontal_compose_haar_10bit, 3, 6, 4, b, temp_, w, x, b2 + mova m2, [pd_1] + xor xd, xd + shr wd, 1 + lea b2q, [bq + 4*wq] + + ALIGN 16 + .loop_lo: + mova m0, [bq + 4*xq] + movu m1, [b2q + 4*xq] + paddd m1, m2 + psrad m1, 1 + psubd m0, m1 + mova [temp_q + 4*xq], m0 + add xd, mmsize/4 + cmp xd, wd + jl .loop_lo + + xor xd, xd + and wd, ~(mmsize/4 - 1) + cmp wd, mmsize/4 + jl .end + + ALIGN 16 + .loop_hi: + mova m0, [temp_q + 4*xq] + movu m1, [b2q + 4*xq] + paddd m1, m0 + paddd m0, m2 + paddd m1, m2 + psrad m0, 1 + psrad m1, 1 + SBUTTERFLY dq, 0,1,3 + %if cpuflag(avx2) + SBUTTERFLY dqqq, 0,1,3 + %endif + mova [bq + 8*xq], m0 + mova [bq + 8*xq + mmsize], m1 + add xd, mmsize/4 + cmp xd, wd + jl .loop_hi + .end: +REP_RET + +%endmacro + +INIT_XMM sse2 +HAAR_HORIZONTAL +HAAR_VERTICAL + +INIT_XMM avx +HAAR_HORIZONTAL +HAAR_VERTICAL + +INIT_YMM avx2 +HAAR_HORIZONTAL +HAAR_VERTICAL diff --git a/libavcodec/x86/dirac_dwt_init_10bit.c b/libavcodec/x86/dirac_dwt_init_10bit.c new file mode 100644 index 0000000000..939950e3ff --- /dev/null +++ b/libavcodec/x86/dirac_dwt_init_10bit.c @@ -0,0 +1,136 @@ +/* + * x86 optimized discrete wavelet transform + * Copyright (c) 2018 James Darnley + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/x86/asm.h" +#include "libavutil/x86/cpu.h" +#include "libavcodec/dirac_dwt.h" + +void ff_horizontal_compose_haar_10bit_sse2(int32_t *b0, int32_t *b1, int width_align); +void ff_horizontal_compose_haar_10bit_avx(int32_t *b0, int32_t *b1, int width_align); +void ff_horizontal_compose_haar_10bit_avx2(int32_t *b0, int32_t *b1, int width_align); + +void ff_vertical_compose_haar_10bit_sse2(int32_t *b0, int32_t *b1, int width_align); +void ff_vertical_compose_haar_10bit_avx(int32_t *b0, int32_t *b1, int width_align); +void ff_vertical_compose_haar_10bit_avx2(int32_t *b0, int32_t *b1, int width_align); + +static void vertical_compose_haar_sse2(int32_t *b0, int32_t *b1, int width) +{ + int i, width_align = width & ~3; + ff_vertical_compose_haar_10bit_sse2(b0, b1, width_align); + for(i=width_align; i> 1; + b[2*i+1] = (COMPOSE_HAARiH0(b[i + width/2], tmp[i]) + 1) >> 1; + } +} + +static void horizontal_compose_haar_avx(int32_t *b, int32_t *tmp, int width) +{ + int i = width/2 & ~3; + ff_horizontal_compose_haar_10bit_avx(b, tmp, width); + for (; i < width/2; i++) { + b[2*i ] = (tmp[i] + 1) >> 1; + b[2*i+1] = (COMPOSE_HAARiH0(b[i + width/2], tmp[i]) + 1) >> 1; + } +} + +static void horizontal_compose_haar_avx2(int32_t *b, int32_t *tmp, int width) +{ + int i = width/2 & ~7; + ff_horizontal_compose_haar_10bit_avx2(b, tmp, width); + for (; i < width/2; i++) { + b[2*i ] = (tmp[i] + 1) >> 1; + b[2*i+1] = (COMPOSE_HAARiH0(b[i + width/2], tmp[i]) + 1) >> 1; + } +} + +av_cold void ff_spatial_idwt_init_10bit_x86(DWTContext *d, enum dwt_type type) +{ +#if HAVE_X86ASM + int cpu_flags = av_get_cpu_flags(); + + if (EXTERNAL_SSE2(cpu_flags)) { + switch (type) { + case DWT_DIRAC_HAAR0: + d->vertical_compose = (void*)vertical_compose_haar_sse2; + break; + case DWT_DIRAC_HAAR1: + d->horizontal_compose = (void*)horizontal_compose_haar_sse2; + d->vertical_compose = (void*)vertical_compose_haar_sse2; + break; + } + } + + if (EXTERNAL_AVX(cpu_flags)) { + switch (type) { + case DWT_DIRAC_HAAR0: + d->vertical_compose = (void*)vertical_compose_haar_avx; + break; + case DWT_DIRAC_HAAR1: + d->horizontal_compose = (void*)horizontal_compose_haar_avx; + d->vertical_compose = (void*)vertical_compose_haar_avx; + break; + } + } + + if (EXTERNAL_AVX2(cpu_flags)) { + switch (type) { + case DWT_DIRAC_HAAR0: + d->vertical_compose = (void*)vertical_compose_haar_avx2; + break; + case DWT_DIRAC_HAAR1: + d->horizontal_compose = (void*)horizontal_compose_haar_avx2; + d->vertical_compose = (void*)vertical_compose_haar_avx2; + break; + } + } + +#endif // HAVE_X86ASM +}