From patchwork Thu Jul 26 11:28:06 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: James Darnley X-Patchwork-Id: 9809 Delivered-To: ffmpegpatchwork@gmail.com Received: by 2002:a02:104:0:0:0:0:0 with SMTP id c4-v6csp339012jad; Thu, 26 Jul 2018 04:36:53 -0700 (PDT) X-Google-Smtp-Source: AAOMgpcXyK2w2GKQVd2EuceLzDajBlE40ZoEsBIdYWIDjxii8dwac6P+8WoCXeU8w2D4XvFJTxvs X-Received: by 2002:adf:90e9:: with SMTP id i96-v6mr1384675wri.146.1532605013182; Thu, 26 Jul 2018 04:36:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1532605013; cv=none; d=google.com; s=arc-20160816; b=FfnLouZ7W5I/uGUs2eE2/KaJEVo1IwkzY+fn1tOaeIj7PHnfB/Fdc04yE0vkNrI38T RzoTz8vc6d+vSrV1qTSlfS4i/KD0fKj6fPwrBl3KjvNIso8YDFc/vlX764lC/xGuOUfl IJ37NTs44PLmUrGZdS3eElMUAeQ0OSzQm9cez5RL99zSj4NjouD+EnoJ4TOLyAYi0jZt HfcFp0qJXL1NgwtXtVn/ws/TzF5MDQOEs2lg8wnxFjEDIuAI2fHKAQm2IalXhrfff+VY jG84diDVr1YfxGmtuLdVFuM2Vj4UNMvoEBXJQSPIGTo6Ts/mt5QtSfOvZ6km5obhzsxS 17sg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:mime-version:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:references:in-reply-to:message-id:date :to:from:dkim-signature:delivered-to:arc-authentication-results; bh=p3yQ+p3QP7OdCdRp0cRcIZissr2SXETPyUc9q5MFhpI=; b=KKqjYnTscDsz0ctbuPi5WSiXXhZb13VI2ctKyNQzp7PEGRtmhI/2PeJ6G6mU7XZIJ9 1skJyddC61q7btqAx88ZpZoCc3cIDecBsil0FkuS/sZANJBUL6kb4ldTMa1vjTc0rSm1 bUe3Gn+adYnD8rZRgQ1mHBkErLJipvkfbHjrOFT0Ps9CfN82Omh4hcn6FRqWJyBmYFF0 uaL2Yp1goeTbxsabGp+mUmpBSU4vTorvnVo7zKWehovjcZ75ebnNrrpbdOXtMkY7mbck xej28ka6B61ETSFybB1SMVlYXp3v8EBxXcaow7rDQSI+k5EmefLcWN5ghjfkJlxIMfz7 C7Kw== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@ob-encoder-com.20150623.gappssmtp.com header.s=20150623 header.b=atZvp+Xa; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id x130-v6si1066046wmb.57.2018.07.26.04.36.52; Thu, 26 Jul 2018 04:36:53 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@ob-encoder-com.20150623.gappssmtp.com header.s=20150623 header.b=atZvp+Xa; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 3130D68A456; Thu, 26 Jul 2018 14:36:37 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lj1-f182.google.com (mail-lj1-f182.google.com [209.85.208.182]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id CD23268A3B6 for ; Thu, 26 Jul 2018 14:36:29 +0300 (EEST) Received: by mail-lj1-f182.google.com with SMTP id p10-v6so1212340ljg.2 for ; Thu, 26 Jul 2018 04:36:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ob-encoder-com.20150623.gappssmtp.com; s=20150623; h=sender:from:to:subject:date:message-id:in-reply-to:references; bh=CdBknbd9wos3YJhYc4XMHD9oB1sI939D9S1w6dNQISA=; b=atZvp+XaH4M24Vewci5jgpZtrz6sv5+29C9/gFdbAyx7TvKMyl/vZ5zPH0CmcW+/zj c+QEPPCTAXfbMUCaShklYDWXx5PwZ+QgzS8mVeDLpD0KqBpq6pKHp9jLJh8LkjxIJkpR TRIf8yk6Hxr1HDFcgTcRzg4GKjFIaVxRojVCFN8kRZ4fEvBjfj3j0jOBlwKEw6lYEE98 6AZ8kBrcw5Sw7wgHT4YScbqDXg6zSqjpBUmBxN+rx/pMa1CeuQ0bUmfDWZ8oeQKycfnt 6udD84Qx2H6ucaUljCMP+ppvA67Qq1pTDyCXTSKf9XiV+UE6NfwMoJJg9u1g0gTpXAt0 rXgg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:subject:date:message-id :in-reply-to:references; bh=CdBknbd9wos3YJhYc4XMHD9oB1sI939D9S1w6dNQISA=; b=d30DXC2slSWJWUHZmY9qRGkBKzDp/j/odnPKaVWpFeG//eTfdr09LP1DdSVTXCtQiM 8ODzMXcmAISzZNqBZk062vBZiZRtpJL2i02XMUbzDNOA/GMmKSW8PTKh51Guom3o9Cqe sAwoyXDnTbFia++7CEUgGBQa1BZvI9ytO1GWwP7p9tuqVNq1va/+KfuKKat3TI6sWjW7 am0A/dByfmC29eSv7f+6ne6mJL6ENRMeJsY5ZaYMBXGJR4cMek1Qm3Ad7GJ2ntuuPwrG dSR4Io/YGiUF1BdNXDXHewEEtvIy5Oey54T6QyYYPMHnLNnXLNblNItx9M8lFxRdcqC9 +Lyw== X-Gm-Message-State: AOUpUlHqmUL3srCEhi/RkDVcmvvnNpZaqoGDLYO3Adlf/We4tyNyZndx yUjWIzxKOE1B6hE8fi4AnJYTsmOyVbs= X-Received: by 2002:a2e:1101:: with SMTP id f1-v6mr1377415lje.75.1532604534740; Thu, 26 Jul 2018 04:28:54 -0700 (PDT) Received: from Ifrit.systemlords.lan (d51a44418.access.telenet.be. [81.164.68.24]) by smtp.gmail.com with ESMTPSA id r73-v6sm182286ljb.16.2018.07.26.04.28.53 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 26 Jul 2018 04:28:54 -0700 (PDT) From: James Darnley To: ffmpeg-devel@ffmpeg.org Date: Thu, 26 Jul 2018 13:28:06 +0200 Message-Id: <20180726112808.11792-2-jdarnley@obe.tv> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20180726112808.11792-1-jdarnley@obe.tv> References: <20180726112808.11792-1-jdarnley@obe.tv> Subject: [FFmpeg-devel] [PATCH 1/3] diracdec: add 10-bit Haar SIMD functions X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Speed of ffmpeg when decoding a 720p yuv422p10 file encoded with the relevant transform. C: 119fps SSE2: 204fps AVX: 206fps AVX2: 221fps timer measurements, haar horizontal compose: sse2: 3.68x faster (45143 vs. 12279 decicycles) compared with C avx: 3.68x faster (45143 vs. 12275 decicycles) compared with C avx2: 5.16x faster (45143 vs. 8742 decicycles) compared with C haar vertical compose: sse2: 1.64x faster (31792 vs. 19377 decicycles) compared with C avx: 1.58x faster (31792 vs. 20090 decicycles) compared with C avx2: 1.66x faster (31792 vs. 19157 decicycles) compared with C --- libavcodec/dirac_dwt.c | 7 +- libavcodec/dirac_dwt.h | 1 + libavcodec/x86/Makefile | 6 +- libavcodec/x86/dirac_dwt_10bit.asm | 160 ++++++++++++++++++++++++++ libavcodec/x86/dirac_dwt_init_10bit.c | 76 ++++++++++++ 5 files changed, 247 insertions(+), 3 deletions(-) create mode 100644 libavcodec/x86/dirac_dwt_10bit.asm create mode 100644 libavcodec/x86/dirac_dwt_init_10bit.c diff --git a/libavcodec/dirac_dwt.c b/libavcodec/dirac_dwt.c index cc08f8865a..86bee5bb9b 100644 --- a/libavcodec/dirac_dwt.c +++ b/libavcodec/dirac_dwt.c @@ -59,8 +59,13 @@ int ff_spatial_idwt_init(DWTContext *d, DWTPlane *p, enum dwt_type type, return AVERROR_INVALIDDATA; } - if (ARCH_X86 && bit_depth == 8) +#if ARCH_X86 + if (bit_depth == 8) ff_spatial_idwt_init_x86(d, type); + else if (bit_depth == 10) + ff_spatial_idwt_init_10bit_x86(d, type); +#endif + return 0; } diff --git a/libavcodec/dirac_dwt.h b/libavcodec/dirac_dwt.h index 994dc21d70..1ad7b9a821 100644 --- a/libavcodec/dirac_dwt.h +++ b/libavcodec/dirac_dwt.h @@ -88,6 +88,7 @@ enum dwt_type { int ff_spatial_idwt_init(DWTContext *d, DWTPlane *p, enum dwt_type type, int decomposition_count, int bit_depth); void ff_spatial_idwt_init_x86(DWTContext *d, enum dwt_type type); +void ff_spatial_idwt_init_10bit_x86(DWTContext *d, enum dwt_type type); void ff_spatial_idwt_slice2(DWTContext *d, int y); diff --git a/libavcodec/x86/Makefile b/libavcodec/x86/Makefile index 2350c8bbee..590d83c167 100644 --- a/libavcodec/x86/Makefile +++ b/libavcodec/x86/Makefile @@ -7,7 +7,8 @@ OBJS-$(CONFIG_BLOCKDSP) += x86/blockdsp_init.o OBJS-$(CONFIG_BSWAPDSP) += x86/bswapdsp_init.o OBJS-$(CONFIG_DCT) += x86/dct_init.o OBJS-$(CONFIG_DIRAC_DECODER) += x86/diracdsp_init.o \ - x86/dirac_dwt_init.o + x86/dirac_dwt_init.o \ + x86/dirac_dwt_init_10bit.o OBJS-$(CONFIG_FDCTDSP) += x86/fdctdsp_init.o OBJS-$(CONFIG_FFT) += x86/fft_init.o OBJS-$(CONFIG_FLACDSP) += x86/flacdsp_init.o @@ -153,7 +154,8 @@ X86ASM-OBJS-$(CONFIG_APNG_DECODER) += x86/pngdsp.o X86ASM-OBJS-$(CONFIG_CAVS_DECODER) += x86/cavsidct.o X86ASM-OBJS-$(CONFIG_DCA_DECODER) += x86/dcadsp.o x86/synth_filter.o X86ASM-OBJS-$(CONFIG_DIRAC_DECODER) += x86/diracdsp.o \ - x86/dirac_dwt.o + x86/dirac_dwt.o \ + x86/dirac_dwt_10bit.o X86ASM-OBJS-$(CONFIG_DNXHD_ENCODER) += x86/dnxhdenc.o X86ASM-OBJS-$(CONFIG_EXR_DECODER) += x86/exrdsp.o X86ASM-OBJS-$(CONFIG_FLAC_DECODER) += x86/flacdsp.o diff --git a/libavcodec/x86/dirac_dwt_10bit.asm b/libavcodec/x86/dirac_dwt_10bit.asm new file mode 100644 index 0000000000..baea91329e --- /dev/null +++ b/libavcodec/x86/dirac_dwt_10bit.asm @@ -0,0 +1,160 @@ +;****************************************************************************** +;* x86 optimized discrete 10-bit wavelet trasnform +;* Copyright (c) 2018 James Darnley +;* +;* This file is part of FFmpeg. +;* +;* FFmpeg is free software; you can redistribute it and/or +;* modify it under the terms of the GNU Lesser General Public +;* License as published by the Free Software Foundation; either +;* version 2.1 of the License, or (at your option) any later version. +;* +;* FFmpeg is distributed in the hope that it will be useful, +;* but WITHOUT ANY WARRANTY; without even the implied warranty of +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +;* Lesser General Public License for more details. +;* +;* You should have received a copy of the GNU Lesser General Public +;* License along with FFmpeg; if not, write to the Free Software +;* 51, Inc., Foundation Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA +;****************************************************************************** + +%include "libavutil/x86/x86util.asm" + +SECTION_RODATA + +cextern pd_1 + +SECTION .text + +%macro HAAR_VERTICAL 0 + +cglobal vertical_compose_haar_10bit, 3, 6, 4, b0, b1, w + DECLARE_REG_TMP 4,5 + + mova m2, [pd_1] + mov r3d, wd + and wd, ~(mmsize/4 - 1) + shl wd, 2 + add b0q, wq + add b1q, wq + neg wq + + ALIGN 16 + .loop_simd: + mova m0, [b0q + wq] + mova m1, [b1q + wq] + paddd m3, m1, m2 + psrad m3, 1 + psubd m0, m3 + paddd m1, m0 + mova [b0q + wq], m0 + mova [b1q + wq], m1 + add wq, mmsize + jl .loop_simd + + and r3d, mmsize/4 - 1 + jz .end + .loop_scalar: + mov t0d, [b0q] + mov t1d, [b1q] + mov r2d, t1d + add r2d, 1 + sar r2d, 1 + sub t0d, r2d + add t1d, t0d + mov [b0q], t0d + mov [b1q], t1d + + add b0q, 4 + add b1q, 4 + sub r3d, 1 + jg .loop_scalar + + .end: +RET + +%endmacro + +%macro HAAR_HORIZONTAL 0 + +cglobal horizontal_compose_haar_10bit, 3, 6+ARCH_X86_64, 4, b, temp_, w, x, b2 + DECLARE_REG_TMP 2,5 + %if ARCH_X86_64 + %define tail r6d + %else + %define tail dword wm + %endif + + mova m2, [pd_1] + xor xd, xd + shr wd, 1 + mov tail, wd + lea b2q, [bq + 4*wq] + + ALIGN 16 + .loop_lo: + mova m0, [bq + 4*xq] + movu m1, [b2q + 4*xq] + paddd m1, m2 + psrad m1, 1 + psubd m0, m1 + mova [temp_q + 4*xq], m0 + add xd, mmsize/4 + cmp xd, wd + jl .loop_lo + + xor xd, xd + and wd, ~(mmsize/4 - 1) + + ALIGN 16 + .loop_hi: + mova m0, [temp_q + 4*xq] + movu m1, [b2q + 4*xq] + paddd m1, m0 + paddd m0, m2 + paddd m1, m2 + psrad m0, 1 + psrad m1, 1 + SBUTTERFLY dq, 0,1,3 + %if cpuflag(avx2) + SBUTTERFLY dqqq, 0,1,3 + %endif + mova [bq + 8*xq], m0 + mova [bq + 8*xq + mmsize], m1 + add xd, mmsize/4 + cmp xd, wd + jl .loop_hi + + and tail, mmsize/4 - 1 + jz .end + .loop_scalar: + mov t0d, [temp_q + 4*xq] + mov t1d, [b2q + 4*xq] + add t1d, t0d + add t0d, 1 + add t1d, 1 + sar t0d, 1 + sar t1d, 1 + mov [bq + 8*xq], t0d + mov [bq + 8*xq + 4], t1d + add xq, 1 + sub tail, 1 + jg .loop_scalar + + .end: +REP_RET + +%endmacro + +INIT_XMM sse2 +HAAR_HORIZONTAL +HAAR_VERTICAL + +INIT_XMM avx +HAAR_HORIZONTAL +HAAR_VERTICAL + +INIT_YMM avx2 +HAAR_HORIZONTAL +HAAR_VERTICAL diff --git a/libavcodec/x86/dirac_dwt_init_10bit.c b/libavcodec/x86/dirac_dwt_init_10bit.c new file mode 100644 index 0000000000..289862d728 --- /dev/null +++ b/libavcodec/x86/dirac_dwt_init_10bit.c @@ -0,0 +1,76 @@ +/* + * x86 optimized discrete wavelet transform + * Copyright (c) 2018 James Darnley + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "libavutil/x86/asm.h" +#include "libavutil/x86/cpu.h" +#include "libavcodec/dirac_dwt.h" + +void ff_horizontal_compose_haar_10bit_sse2(int32_t *b0, int32_t *b1, int width_align); +void ff_horizontal_compose_haar_10bit_avx(int32_t *b0, int32_t *b1, int width_align); +void ff_horizontal_compose_haar_10bit_avx2(int32_t *b0, int32_t *b1, int width_align); + +void ff_vertical_compose_haar_10bit_sse2(int32_t *b0, int32_t *b1, int width_align); +void ff_vertical_compose_haar_10bit_avx(int32_t *b0, int32_t *b1, int width_align); +void ff_vertical_compose_haar_10bit_avx2(int32_t *b0, int32_t *b1, int width_align); + +av_cold void ff_spatial_idwt_init_10bit_x86(DWTContext *d, enum dwt_type type) +{ +#if HAVE_X86ASM + int cpu_flags = av_get_cpu_flags(); + + if (EXTERNAL_SSE2(cpu_flags)) { + switch (type) { + case DWT_DIRAC_HAAR0: + d->vertical_compose = (void*)ff_vertical_compose_haar_10bit_sse2; + break; + case DWT_DIRAC_HAAR1: + d->horizontal_compose = (void*)ff_horizontal_compose_haar_10bit_sse2; + d->vertical_compose = (void*)ff_vertical_compose_haar_10bit_sse2; + break; + } + } + + if (EXTERNAL_AVX(cpu_flags)) { + switch (type) { + case DWT_DIRAC_HAAR0: + d->vertical_compose = (void*)ff_vertical_compose_haar_10bit_avx; + break; + case DWT_DIRAC_HAAR1: + d->horizontal_compose = (void*)ff_horizontal_compose_haar_10bit_avx; + d->vertical_compose = (void*)ff_vertical_compose_haar_10bit_avx; + break; + } + } + + if (EXTERNAL_AVX2(cpu_flags)) { + switch (type) { + case DWT_DIRAC_HAAR0: + d->vertical_compose = (void*)ff_vertical_compose_haar_10bit_avx2; + break; + case DWT_DIRAC_HAAR1: + d->horizontal_compose = (void*)ff_horizontal_compose_haar_10bit_avx2; + d->vertical_compose = (void*)ff_vertical_compose_haar_10bit_avx2; + break; + } + } + +#endif // HAVE_X86ASM +}