From patchwork Thu Jul 26 11:28:07 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: James Darnley X-Patchwork-Id: 9808 Delivered-To: ffmpegpatchwork@gmail.com Received: by 2002:a02:104:0:0:0:0:0 with SMTP id c4-v6csp336677jad; Thu, 26 Jul 2018 04:34:14 -0700 (PDT) X-Google-Smtp-Source: AAOMgpecK/Y+56f9NITXnLSe1ztG37sgRXt4cfdyZ/3+V0Co0hpJQ+EGr4KnyD2MwZYuJcKRhnnP X-Received: by 2002:a1c:ea9c:: with SMTP id g28-v6mr1314786wmi.65.1532604853956; Thu, 26 Jul 2018 04:34:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1532604853; cv=none; d=google.com; s=arc-20160816; b=SSIpTravMJrWizpPgxYR0AOP3525FeDw+fLK8ZqztnNZOMIVfStviqnlTunNEWei5D VAy02zXffahbofVkejpno5/hcVo91Wb9xv/JVPhM+zGVsEN87sIwZrZ/2rieOu5IhhFM sZVWAp8glzHSBYI3+s8wtPEtsdf+DzgOJjXCOpsCqr9aRfBZDDT6m0rC64HDBToYFF9E eWfcpODCKSTB/bnKkR5ZbxEUjqKylpCB0upQUf05hLHfQPviscvsvVooMee2emwxLbOw x9ykPYN9ezybGoeAai1haFoIlj/bZSD1yUrPgB8qwrvQKyMLyjkXVk+ZBsbYM/0uilLP 1wqg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:mime-version:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:references:in-reply-to:message-id:date :to:from:dkim-signature:delivered-to:arc-authentication-results; bh=GMDYjkB5pGzSYU64Bt/KFDIkUpkgJe2x3Bp1VY719Pk=; b=gQuFQshH9bxEWwVd7Ju+dZ76n3ydS5Gvu1hRxwXNJZy02+B+tK2BmyhrsIzVlQI7MC gWYz7GD6VEoFkCtSEqyGMr3rwQdhDaaEQC2ncnPj6Tf4A+1uLu7dtdy5qeVLbj4fupHh Y3MC7LncwNJYYQO6WT3uhEbuPQwLwMWTvP8YzaoKpCCsGmpjEttwMVse2UpI8vy9eAb0 R7DK0lWK/SYcN1sz/FsvvolE2mAAfs028YtjFh6JBaS60lzhqdtJ3GsXGoonQ+OynRhS u1OAWiJ/1rkfOKt8n62BrcMW24TNz5YEXZvH3WN429dJiXWOgTrUeWEZXjz5uP23OuaJ at5Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@ob-encoder-com.20150623.gappssmtp.com header.s=20150623 header.b=X7yJcr4f; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id f9-v6si922321wrt.86.2018.07.26.04.34.13; Thu, 26 Jul 2018 04:34:13 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@ob-encoder-com.20150623.gappssmtp.com header.s=20150623 header.b=X7yJcr4f; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id AFD4268A44A; Thu, 26 Jul 2018 14:33:56 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-lj1-f179.google.com (mail-lj1-f179.google.com [209.85.208.179]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 4963D680625 for ; Thu, 26 Jul 2018 14:33:50 +0300 (EEST) Received: by mail-lj1-f179.google.com with SMTP id f8-v6so1208531ljk.1 for ; Thu, 26 Jul 2018 04:34:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ob-encoder-com.20150623.gappssmtp.com; s=20150623; h=sender:from:to:subject:date:message-id:in-reply-to:references; bh=+lOK64A7DbVr10sY35s5PGx2E/U/tje+dcP67upVS9w=; b=X7yJcr4fEsfgpo8UjT1xcXdfouO1ATcRoWFUWyraJbV+cVOhhDsREWNFnP+Hdga/HT Ew25yaXuB1FYxGVhCY4FY28CLuQ1O47WG8+lYoWbsdS1j4i+UVkfPmhmZPgDgDEvJ7uZ 3LwcbHUlaKjdNNfGJE3nBI9NuNV3NJ0Ue11Bsj3vwRRlKgsWqWap28TB4vJpuve7sC2H wGlt/pvSp5vq/r/V20keJ/XAGZawEdCWfxD2P43YhAO5cUXgdzAiLsbMf69rvflW/5Rj oHSCvnSU09JYmmTKP49F/3yXnuRnkJlqe06zFfixntaz9rYyLV6FUE9QhVvV5cwLr0ot 4dYg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:subject:date:message-id :in-reply-to:references; bh=+lOK64A7DbVr10sY35s5PGx2E/U/tje+dcP67upVS9w=; b=SYJGSLd8DL9jturEDJYA0sO/izS0deSYi3pNsAvamgI9GOS89dizOKfA2TD/bl8nix Z3hzpSD8JXfJkFZs24wuJ6RYfPwTI4MIzYuS/msKdo7xdhtgVSZgqAc8rxtkEXK421jx +n+mRxTKaKKf0bNvVa674bUIDdtwOvifnUH4kXnarvkv/XOtQk4X5sY5MB2nDLncuDHY 1Y1aCJrDzBkBBVy4B5lw/PEJSlPTA0Zc65h+rexNuPUF0wtKtp6bxOyYNDOT0Ce4zP65 LF2iADSHfuisL8audmB2fg9mv2TnPW40LL891Gsy4sxNM7UrHYo62lv9aWNON/juYqVm X6UA== X-Gm-Message-State: AOUpUlEdEQ0OJJQ6SZe3+/aK/dCiih+KEHRAibGMRKrACrxPuak5Euun tKEnIE1Vm5NO4RE4MgD05DA9rWMkMe4= X-Received: by 2002:a2e:1609:: with SMTP id w9-v6mr1390397ljd.120.1532604535566; Thu, 26 Jul 2018 04:28:55 -0700 (PDT) Received: from Ifrit.systemlords.lan (d51a44418.access.telenet.be. [81.164.68.24]) by smtp.gmail.com with ESMTPSA id r73-v6sm182286ljb.16.2018.07.26.04.28.54 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 26 Jul 2018 04:28:55 -0700 (PDT) From: James Darnley To: ffmpeg-devel@ffmpeg.org Date: Thu, 26 Jul 2018 13:28:07 +0200 Message-Id: <20180726112808.11792-3-jdarnley@obe.tv> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20180726112808.11792-1-jdarnley@obe.tv> References: <20180726112808.11792-1-jdarnley@obe.tv> Subject: [FFmpeg-devel] [PATCH 2/3] diracdec: add 10-bit Legall 5, 3 (5_3) SIMD functions X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Speed of ffmpeg when decoding a 720p yuv422p10 file encoded with the relevant transform. C: 94fps SSE2: 118fps AVX2: 121fps legall vertical hi sse2: 3.86x faster (20201 vs. 5231 decicycles) compared with C avx2: 6.70x faster (20201 vs. 3014 decicycles) compared with C legall vertical lo sse2: 1.50x faster (28345 vs. 18908 decicycles) compared with C avx2: 1.63x faster (28345 vs. 17361 decicycles) compared with C --- libavcodec/x86/dirac_dwt_10bit.asm | 105 +++++++++++++++++++++++++- libavcodec/x86/dirac_dwt_init_10bit.c | 13 ++++ 2 files changed, 117 insertions(+), 1 deletion(-) diff --git a/libavcodec/x86/dirac_dwt_10bit.asm b/libavcodec/x86/dirac_dwt_10bit.asm index baea91329e..0295e6f554 100644 --- a/libavcodec/x86/dirac_dwt_10bit.asm +++ b/libavcodec/x86/dirac_dwt_10bit.asm @@ -21,9 +21,10 @@ %include "libavutil/x86/x86util.asm" -SECTION_RODATA +SECTION_RODATA 32 cextern pd_1 +pd_2: times 8 dd 2 SECTION .text @@ -147,9 +148,109 @@ REP_RET %endmacro +%macro LEGALL53_VERTICAL_LO 0 + +cglobal legall53_vertical_lo, 4, 6, 4, b0, b1, b2, w + DECLARE_REG_TMP 3,4,5 + + mova m3, [pd_2] + mov t2d, wd + and wd, ~(mmsize/4 - 1) + shl wd, 2 + add b0q, wq + add b1q, wq + add b2q, wq + neg wq + + ALIGN 16 + .loop: + mova m0, [b0q + wq] + mova m1, [b1q + wq] + mova m2, [b2q + wq] + paddd m0, m2 + paddd m0, m3 + psrad m0, 2 + psubd m1, m0 + mova [b1q + wq], m1 + add wq, mmsize + jl .loop + + and t2d, mmsize/4 - 1 + jz .end + .loop_scalar: + mov t0d, [b0q] + mov t1d, [b1q] + add t0d, [b2q] + add t0d, 2 + sar t0d, 2 + sub t1d, t0d + mov [b1q], t1d + + add b0q, 4 + add b1q, 4 + add b2q, 4 + sub t2d, 1 + jg .loop_scalar + + .end: +RET + +%endmacro + +%macro LEGALL53_VERTICAL_HI 0 + +cglobal legall53_vertical_hi, 4, 6, 4, b0, b1, b2, w + DECLARE_REG_TMP 3,4,5 + + mova m3, [pd_1] + mov t2d, wd + and wd, ~(mmsize/4 - 1) + shl wd, 2 + add b0q, wq + add b1q, wq + add b2q, wq + neg wq + + ALIGN 16 + .loop: + mova m0, [b0q + wq] + mova m1, [b1q + wq] + mova m2, [b2q + wq] + paddd m0, m2 + paddd m0, m3 + psrad m0, 1 + paddd m1, m0 + mova [b1q + wq], m1 + add wq, mmsize + jl .loop + + and t2d, mmsize/4 - 1 + jz .end + .loop_scalar: + mov t0d, [b0q] + mov t1d, [b1q] + add t0d, [b2q] + add t0d, 1 + sar t0d, 1 + add t1d, t0d + mov [b1q], t1d + + add b0q, 4 + add b1q, 4 + add b2q, 4 + sub t2d, 1 + jg .loop_scalar + + .end: +RET + +%endmacro + INIT_XMM sse2 HAAR_HORIZONTAL HAAR_VERTICAL +LEGALL53_VERTICAL_HI +LEGALL53_VERTICAL_LO INIT_XMM avx HAAR_HORIZONTAL @@ -158,3 +259,5 @@ HAAR_VERTICAL INIT_YMM avx2 HAAR_HORIZONTAL HAAR_VERTICAL +LEGALL53_VERTICAL_HI +LEGALL53_VERTICAL_LO diff --git a/libavcodec/x86/dirac_dwt_init_10bit.c b/libavcodec/x86/dirac_dwt_init_10bit.c index 289862d728..d1234efac5 100644 --- a/libavcodec/x86/dirac_dwt_init_10bit.c +++ b/libavcodec/x86/dirac_dwt_init_10bit.c @@ -23,6 +23,11 @@ #include "libavutil/x86/cpu.h" #include "libavcodec/dirac_dwt.h" +void ff_legall53_vertical_hi_sse2(int32_t *b0, int32_t *b1, int32_t *b2, int width); +void ff_legall53_vertical_lo_sse2(int32_t *b0, int32_t *b1, int32_t *b2, int width); +void ff_legall53_vertical_hi_avx2(int32_t *b0, int32_t *b1, int32_t *b2, int width); +void ff_legall53_vertical_lo_avx2(int32_t *b0, int32_t *b1, int32_t *b2, int width); + void ff_horizontal_compose_haar_10bit_sse2(int32_t *b0, int32_t *b1, int width_align); void ff_horizontal_compose_haar_10bit_avx(int32_t *b0, int32_t *b1, int width_align); void ff_horizontal_compose_haar_10bit_avx2(int32_t *b0, int32_t *b1, int width_align); @@ -38,6 +43,10 @@ av_cold void ff_spatial_idwt_init_10bit_x86(DWTContext *d, enum dwt_type type) if (EXTERNAL_SSE2(cpu_flags)) { switch (type) { + case DWT_DIRAC_LEGALL5_3: + d->vertical_compose_h0 = (void*)ff_legall53_vertical_hi_sse2; + d->vertical_compose_l0 = (void*)ff_legall53_vertical_lo_sse2; + break; case DWT_DIRAC_HAAR0: d->vertical_compose = (void*)ff_vertical_compose_haar_10bit_sse2; break; @@ -62,6 +71,10 @@ av_cold void ff_spatial_idwt_init_10bit_x86(DWTContext *d, enum dwt_type type) if (EXTERNAL_AVX2(cpu_flags)) { switch (type) { + case DWT_DIRAC_LEGALL5_3: + d->vertical_compose_h0 = (void*)ff_legall53_vertical_hi_avx2; + d->vertical_compose_l0 = (void*)ff_legall53_vertical_lo_avx2; + break; case DWT_DIRAC_HAAR0: d->vertical_compose = (void*)ff_vertical_compose_haar_10bit_avx2; break;