From patchwork Sun Jul 9 00:49:22 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ivan Kalvachev X-Patchwork-Id: 4270 Delivered-To: ffmpegpatchwork@gmail.com Received: by 10.103.1.76 with SMTP id 73csp1921039vsb; Sat, 8 Jul 2017 17:56:05 -0700 (PDT) X-Received: by 10.223.130.134 with SMTP id 6mr3800953wrc.16.1499561765480; Sat, 08 Jul 2017 17:56:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1499561765; cv=none; d=google.com; s=arc-20160816; b=FyBnFT6lJobt0xFtTPmY0JBKQkQt9IE/8VTqT0y6nqn3yLMYMPG6UYsH7YT8YtvvkG wIE1myGD/7PtKo2v116hRTM0z5fGMK/Q8Xyl9eQzpbvIozu0C4gpbmTnM+cQWJR77tJq UUHOTOE24VyAXdkUeieHBqNzt99GVbgxZA91G90MlTrLpA1LishdXnCNW4U9ziLvFyqc D1jVPiBAofiycRXYRq7kFQL0fo77RbJGBNCTtXEdIRCJsN8EjiWRXijBAdfeWd+FHRfK z/9qRrIup5zayY2gfPadeO2kknvgAfTHMnxfSmI09H7LOR8DrcsMJu0T0a2rW7A4nRpn KZ1Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:reply-to:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence:subject:to :message-id:date:from:mime-version:dkim-signature:delivered-to :arc-authentication-results; bh=EHDufZqmiCcmxsCr+KPuxhdGmLuUgFyfPEompxA39jU=; b=0/fQ/qCsxsf3SBRVgd2CrmpGLabF6q+K2ftg+HL2ll4qEQzwPMWBg9JUKMhuyzjitl Qk4PrE5/BCUipW/C0QjMeCqV8GJFOEVSclEuWovmVdxPccfY+e7jDGURjH+4GIIjLi8z KesDc3UY/MwrpxLYlIeNtE61v55LI7wKGGsx3RarYpSFPKe02O7sFecfrLAliQQLz1DF 1AbqkaX0KXP5FXvWCTtc6ZqZAfEL2aIDNSvcAX13MGsS8n40J6ZVJ7CsksSEzcXYdU2P g6K5lu1uvvAsqi8rCJn/PtdSmtYkCzsH2wvrvWJQ8TOQWNBgTrj6i7zQX3NFnZu95Fiy nEtw== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.b=tlkz2zDb; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=gmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id u76si5047444wrc.211.2017.07.08.17.56.05; Sat, 08 Jul 2017 17:56:05 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.b=tlkz2zDb; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=gmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id D4636689C45; Sun, 9 Jul 2017 03:55:58 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-pf0-f169.google.com (mail-pf0-f169.google.com [209.85.192.169]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 3CA97689AF8 for ; Sun, 9 Jul 2017 03:55:52 +0300 (EEST) Received: by mail-pf0-f169.google.com with SMTP id c73so33325143pfk.2 for ; Sat, 08 Jul 2017 17:55:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=nkTqrq1Tcy7yb6VJcELpzijMGy8xD9PU/urkLVZUW6Q=; b=tlkz2zDbjm93QbWesewq3BiidXJtyPsL6JLhNYCEPbm7YSALLplm1nz64l+kzyeET3 8/bfdRPDr0vuUXotTkGYclJ+b6VO4tZzmFxmHh6hdSBPAFmq6vQWsRCfPb7GQU6NTfSt 57h6pwfq7BEx4Kmku/Iwb35tjEw3iQpe79OkqXxLidmLzLR07H598sCzGfRT31Bps6qp X4ZFnGYuboeQ2Ebpbship7tXt5uZrkTmKrp9NbZW5C31aWHyIonaMu1PnF7xGvuAVPQ2 IDeFQiHDu3LHJzkABdIVEhkLPKMOE4GPxONx1hcoF+qpidvFO8lUbKuONDUNngXgZAR+ SgPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=nkTqrq1Tcy7yb6VJcELpzijMGy8xD9PU/urkLVZUW6Q=; b=Kqlm3x3y2bJN5GrjI2ZIxqsAOW1+lam4La4/jvL6qpZBIVDrwO3ccbs4zgMeaTr9ed 35LeT0ngP+7mcrrd2AQnM/Pp46IbaLu0qYQ4XCz7FOPdo0T/Hiur9WwPIEnSEqvnH1nj 4+oCzFZ+eYPBHJcfGnUbMEHQKiJyHj4r3hGo27P5bhSX32pEYlgwszR+DhqkUNgflBPh G6qLTesVQnjsfpNHuYJA74oaYQy29UOxqe5H8aRj6l9WHaBhP8pwiQ3P4dscfa3+66w/ 2wgX0wSEJAVvL42JFC5x6y/nzkJmkZ2ycwjBCFUvRXLtCQLMp6Lhq2Gb6MmrfAItNE4e 6OUg== X-Gm-Message-State: AIVw113n/duhJ4rWqcsp71hj0wNAnKGG44tAkdrzuNf2EmFdOpd6LN5+ BuVqF+VV0Dnpm6JzRC7CJbmQIt1lww== X-Received: by 10.84.229.13 with SMTP id b13mr10799270plk.1.1499561363806; Sat, 08 Jul 2017 17:49:23 -0700 (PDT) MIME-Version: 1.0 Received: by 10.100.163.161 with HTTP; Sat, 8 Jul 2017 17:49:22 -0700 (PDT) From: Ivan Kalvachev Date: Sun, 9 Jul 2017 03:49:22 +0300 Message-ID: To: ffmpeg-devel@ffmpeg.org Subject: [FFmpeg-devel] [WIP][PATCH]v4 Opus Pyramid Vector Quantization Search in x86 SIMD asm X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" This should be the final work-in-progress patch. What's changed: 1. Removed macros conditional defines. The defaults seems to be optimal on all machines that I got benchmarks from. HADDPS and PHADDD are always slower, "BLEND"s are never slower than the emulation. 2. Remove SHORT_SYY_UPDATE. It is always slower. 3. Remove ALL_FLOAT_PRESEARCH, it is always slower. Remove the ugly hack to use 256bit ymm with avx1 and integer operations. 4. Remove remaining disabled code. 5. Use HADDD macro from "x86util", I don't need the result in all lanes/elements 6. Use v-prefix for all avx code. 7. Small optimization: Move some of the HSUMPS in the K!=0 branch. 8. Small optimization: Instead of pre-calculation 2*Y[i] and then correcting it on exit, It is possible to use Syy/2 instead in distortion parameter calculations. It saves few multiplications in pre-search and sign restore loop. It however gives different approximation of sqrt(). It's not (consistently) better or worse than the previous approximation. 9. Using movdqa to load "const_int32_offsets". Wrong type might explain why directly using mem constants is sometimes faster. 10. Move some code around and do minor tweaks. --- I do not intend of removing "PRESEARCH_ROUNDING" and "USE_APPROXIMATION", (while for the latter I think I will remove method#1, I've left it this time just for consistency"). These defines control the precision and the type of results that the function generates. E.g. This function can produce same results as opus functions with "PRESEARCH_ROUNDING 0". If you want same results as the ffmpeg improved function, then you need "approx#0". It uses real division and is much slower on older cpu's, but reasonably fast on anything recent. I've left 2 other defines. "CONST_IN_X64_REG_IS_FASTER" and "STALL_WRITE_FORWARDING". On Sandy Bridge and laters, "const_in_x64" has always been faster. On my cpu it is about the same. On Ryzen the "const_in_x64" was consistently faster in all sse/avx variants, with about 5%. But not if "stall_write" is enabled too. Ryzen (allegedly) has no write stalling, but that method alone is just a few cycles faster (about 0.5% ). I'd like to see if the changes I've done this time, would affect the above results. The code is much cleaner and you are free to nitpick on it. There is something that I'm not exactly sure if I need it. The function gets 2 integer parameters, and I am not sure if I have to sign extend them in 64 bit more, in order to clear the high 32 bits. These parameters should never be negative, so the sign is not needed. All 32bit operands should also clear the high bits. Still I'm not sure if there is guarantee that the high bits won't contain garbage. Best Regards From 4591ad752d1d111615f320b17ad19d5fad0d91d4 Mon Sep 17 00:00:00 2001 From: Ivan Kalvachev Date: Thu, 8 Jun 2017 22:24:33 +0300 Subject: [PATCH 1/5] SIMD opus pvq_search implementation v4 --- libavcodec/opus_pvq.c | 9 + libavcodec/opus_pvq.h | 5 +- libavcodec/x86/Makefile | 1 + libavcodec/x86/opus_dsp_init.c | 47 ++++ libavcodec/x86/opus_pvq_search.asm | 524 +++++++++++++++++++++++++++++++++++++ 5 files changed, 584 insertions(+), 2 deletions(-) create mode 100644 libavcodec/x86/opus_dsp_init.c create mode 100644 libavcodec/x86/opus_pvq_search.asm diff --git a/libavcodec/opus_pvq.c b/libavcodec/opus_pvq.c index 2ac66a0ede..6c7504296d 100644 --- a/libavcodec/opus_pvq.c +++ b/libavcodec/opus_pvq.c @@ -418,7 +418,13 @@ static uint32_t celt_alg_quant(OpusRangeCoder *rc, float *X, uint32_t N, uint32_ int *y = pvq->qcoeff; celt_exp_rotation(X, N, blocks, K, spread, 1); + + { + START_TIMER gain /= sqrtf(pvq->pvq_search(X, y, K, N)); + STOP_TIMER("pvq_search"); + } + celt_encode_pulses(rc, y, N, K); celt_normalize_residual(y, X, N, gain); celt_exp_rotation(X, N, blocks, K, spread, 0); @@ -947,6 +953,9 @@ int av_cold ff_celt_pvq_init(CeltPVQ **pvq) s->encode_band = pvq_encode_band; s->band_cost = pvq_band_cost; + if (ARCH_X86 && CONFIG_OPUS_ENCODER) + ff_opus_dsp_init_x86(s); + *pvq = s; return 0; diff --git a/libavcodec/opus_pvq.h b/libavcodec/opus_pvq.h index 6691494838..9246337360 100644 --- a/libavcodec/opus_pvq.h +++ b/libavcodec/opus_pvq.h @@ -33,8 +33,8 @@ float *lowband_scratch, int fill) struct CeltPVQ { - DECLARE_ALIGNED(32, int, qcoeff )[176]; - DECLARE_ALIGNED(32, float, hadamard_tmp)[176]; + DECLARE_ALIGNED(32, int, qcoeff )[256]; + DECLARE_ALIGNED(32, float, hadamard_tmp)[256]; float (*pvq_search)(float *X, int *y, int K, int N); @@ -45,6 +45,7 @@ struct CeltPVQ { }; int ff_celt_pvq_init (struct CeltPVQ **pvq); +void ff_opus_dsp_init_x86(struct CeltPVQ *s); void ff_celt_pvq_uninit(struct CeltPVQ **pvq); #endif /* AVCODEC_OPUS_PVQ_H */ diff --git a/libavcodec/x86/Makefile b/libavcodec/x86/Makefile index 0dbc46504e..6687fbcd72 100644 --- a/libavcodec/x86/Makefile +++ b/libavcodec/x86/Makefile @@ -52,6 +52,7 @@ OBJS-$(CONFIG_APNG_DECODER) += x86/pngdsp_init.o OBJS-$(CONFIG_CAVS_DECODER) += x86/cavsdsp.o OBJS-$(CONFIG_DCA_DECODER) += x86/dcadsp_init.o x86/synth_filter_init.o OBJS-$(CONFIG_DNXHD_ENCODER) += x86/dnxhdenc_init.o +OBJS-$(CONFIG_OPUS_ENCODER) += x86/opus_dsp_init.o x86/opus_pvq_search.o OBJS-$(CONFIG_HEVC_DECODER) += x86/hevcdsp_init.o OBJS-$(CONFIG_JPEG2000_DECODER) += x86/jpeg2000dsp_init.o OBJS-$(CONFIG_MLP_DECODER) += x86/mlpdsp_init.o diff --git a/libavcodec/x86/opus_dsp_init.c b/libavcodec/x86/opus_dsp_init.c new file mode 100644 index 0000000000..ddccf6fe9e --- /dev/null +++ b/libavcodec/x86/opus_dsp_init.c @@ -0,0 +1,47 @@ +/* + * Opus encoder assembly optimizations + * Copyright (C) 2017 Ivan Kalvachev + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include "config.h" + +#include "libavutil/x86/cpu.h" +#include "libavcodec/opus_pvq.h" + +extern float ff_pvq_search_sse2 (float *X, int *y, int K, int N); +extern float ff_pvq_search_sse42(float *X, int *y, int K, int N); +extern float ff_pvq_search_avx (float *X, int *y, int K, int N); +extern float ff_pvq_search_avx2 (float *X, int *y, int K, int N); + +av_cold void ff_opus_dsp_init_x86(CeltPVQ *s) +{ + int cpu_flags = av_get_cpu_flags(); + + if (EXTERNAL_SSE2(cpu_flags)) + s->pvq_search = ff_pvq_search_sse2; + + if (EXTERNAL_SSE42(cpu_flags)) + s->pvq_search = ff_pvq_search_sse42; + + if (EXTERNAL_AVX(cpu_flags)) + s->pvq_search = ff_pvq_search_avx; + + if (EXTERNAL_AVX2(cpu_flags)) + s->pvq_search = ff_pvq_search_avx2; +} diff --git a/libavcodec/x86/opus_pvq_search.asm b/libavcodec/x86/opus_pvq_search.asm new file mode 100644 index 0000000000..f752ac367b --- /dev/null +++ b/libavcodec/x86/opus_pvq_search.asm @@ -0,0 +1,524 @@ +; Opus encoder assembly optimizations +; Copyright (C) 2017 Ivan Kalvachev +; This file is part of FFmpeg + +%include "config.asm" +%include "libavutil/x86/x86util.asm" + +%ifdef __NASM_VER__ +%use "smartalign" +ALIGNMODE p6 +%endif + + +%define CONST_IN_X64_REG_IS_FASTER 01 +%define STALL_WRITE_FORWARDING 0 + +%define USE_APPROXIMATION 02 +%define PRESEARCH_ROUNDING 01 + + +; +; Horizontal Sum Packed Single precision float +; The resulting sum is in all elements. +; +%macro HSUMPS 2 ; dst, tmp +%if cpuflag(avx) + %if sizeof%1>=32 ; avx + vperm2f128 %2, %1, %1, (0)*16+(1) ; %2=lo(%1)<<128+hi(%1) + vaddps %1, %2 + %endif + vshufps %2, %1, %1, q1032 ; %2=[b, a, d, c ] + vaddps %1, %2 ; %1=[b+d, a+c, d+b, c+a ] + vshufps %2, %1, %1, q0321 ; %2=[c+a, b+d, a+c, d+b ] + vaddps %1, %2 ; %1=[c+a+b+d, b+d+a+c, a+c+d+b, d+b+c+a ] + +%elif 01 ; this form is a bit faster than the short avx-like emulation. + movaps %2, %1 ; [d, c, b, a ] + shufps %1, %1, q1032 ; %2=[b, a, d, c ] + addps %1, %2 ; %1=[b+d, a+c, d+b, c+a ] + movaps %2, %1 + shufps %1, %1, q0321 ; %2=[c+a, b+d, a+c, d+b ] + addps %1, %2 ; %1=[c+a+b+d, b+d+a+c, a+c+d+b, d+b+c+a ] + ; all %1 members should be equal for as long as float a+b==b+a +%else + %error pick HSUMPS implementation +%endif +%endmacro + + +; +; Emulate blendvps if not available +; +; src_b destroyed when using emulation with logical operands +; SSE41 blend instruction is hard fixed to use xmm0 as mask +; +%macro emu_blendvps 3 ; dst_a, src_b, mask +%if cpuflag(avx) + vblendvps %1, %1, %2, %3 +;------------------------- +%elif cpuflag(sse42) + %if notcpuflag(avx) + %ifnidn %3,xmm0 + %error sse41 blendvps uses xmm0 as default 3d operand, you used %3 + %endif + %endif + blendvps %1, %2, %3 +;------------------------- +%elif cpuflag(sse) + xorps %2, %1 + andps %2, %3 + xorps %1, %2 +%else + %error "blendvps" emulation needs SSE +%endif +%endmacro + +; +; Emulate pblendvb if not available +; +; src_b destroyed when using emulation with logical operands +; SSE41 blend instruction is hard fixed to use xmm0 as mask +; +%macro emu_pblendvb 3 ; dst_a, src_b, mask +%if cpuflag(avx) + %if cpuflag(avx) && notcpuflag(avx2) && sizeof%1 >= 32 + %error AVX1 does not support integer 256bit ymm operations + %endif + + vpblendvb %1, %1, %2, %3 +;------------------------- +%elif cpuflag(sse42) + %ifnidn %3,xmm0 + %error sse41 blendvps uses xmm0 as default 3 operand, you used %3 + %endif + pblendvb %1, %2, %3 +;------------------------- +%else + pxor %2, %1 + pand %2, %3 + pxor %1, %2 +%endif +%endmacro + + +; +; Emulate broadcastss if not available +; +%macro emu_broadcastss 2 ; dst_a, src_b +%if cpuflag(avx2) && 01 + vbroadcastss %1, %2 ; ymm, xmm +;------------------------- +%elif cpuflag(avx) && 01 + %ifnum sizeof%2 ; avx1 register + vpermilps xmm%1, xmm%2, q0000 ; xmm, xmm, imm || ymm, ymm, imm + %if sizeof%1 >= 32 ;mmsize>=32 + vinsertf128 %1, %1,xmm%1, 1 ; ymm, ymm, xmm, im ; 3c 1/1 +; vperm2f128 %1, %1, %1, 0 ; ymm, ymm, ymm ; 2-3c 1/1 + %endif + %else ; avx1 memory + vbroadcastss %1, %2 ; ymm, mm32 || xmm, mm32 + %endif +;------------------------- +%elif cpuflag(sse) && 1 + %ifnum sizeof%2 ; sse register + shufps %1, %2, %2, q0000 + %else ; sse memory + movss %1, %2 ; this zeroes the other 3 elements + shufps %1, %1, 0 + %endif +%else + %error invalid op, SSE needed for emulation +%endif +%endmacro + +; +; Emulate pbroadcastd if not available +; +%macro emu_pbroadcastd 2 ; dst_a, src_b +%if cpuflag(avx2) && 01 + vpbroadcastd %1, %2 +%elif cpuflag(avx) && mmsize >= 32 + %error AVX1 does not support integer 256bit ymm operations +%elif cpuflag(sse2) + %ifnum sizeof%2 ; sse2 register + pshufd %1, %2, q0000 + %else ; sse memory + movd %1, %2 ; movd zero extends + pshufd %1, %1, 0 + %endif +%else + %error invalid op, SSE2 needed +%endif +%endmacro + +; +; Setup High register to be used +; for holding memory constants +; +; %1 - the register to be used, assmues it is >= mm8 +; %2 - name of the constant. +; +; Subsequent opcodes are going to use the constant in the form +; "addps m0, mm_const_name" and it would be turned into: +; "addps m0, [const_name]" on 32bit arch or +; "addps m0, m8" on 64 bit arch +%macro set_hi_mm_constant 3 ; movop, reg, const_name +%if num_mmregs > 8 && CONST_IN_X64_REG_IS_FASTER + %define mm_%3 %2 + %{1} %2, [%3] ; movaps m8, [const_name] +%else + %define mm_%3 [%3] +%endif +%endmacro + +; +; Load Effective Address of +; Position Independent Code +; Base address of a constant +; %1 - the register to be used, if PIC is set +; %2 - name of the constant. +; +; Subsequent optcode are going to use the base address in the form +; "movaps m0, [pic_base_constant_name+r4q]" and it would be turned into +; "movaps m0, [r5q + r4q]" if PIC is enabled +; "movaps m0, [constant_name + r4q]" if texrel are used +%macro lea_pic_base 2; reg, const_label +%ifdef PIC + lea %1, [%2] ; rip+const + %define pic_base_%2 %1 +%else + %define pic_base_%2 %2 +%endif +%endmacro + + + +SECTION_RODATA 64 + +const_float_abs_mask: times 8 dd 0x7fffffff +const_align_abs_edge: times 8 dd 0 + +const_float_0_5: times 8 dd 0.5 +const_float_1: times 8 dd 1.0 +const_float_sign_mask: times 8 dd 0x80000000 + +const_int32_offsets: + %rep 8 + dd $-const_int32_offsets + %endrep +SECTION .text + + + + + +%macro PULSES_SEARCH 1 +; m6 Syy_norm +; m7 Sxy_norm + addps m6, mm_const_float_0_5 ; Syy_norm += 1.0/2 + pxor m1, m1 ; max_idx + xorps m3, m3 ; p_max + xor r4q, r4q +align 16 +%%distortion_search: + movd xmm2, dword r4d ; movd zero extends +%ifidn %1,add + movaps m4, [tmpY + r4q] ; y[i] + movaps m5, [tmpX + r4q] ; X[i] + + %if USE_APPROXIMATION >= 1 + xorps m0, m0 + cmpps m0, m0, m5, 4 ; m0 = (X[i] != 0.0) + %endif + + addps m4, m6 ; m4 = Syy_new = y[i] + Syy_norm + addps m5, m7 ; m5 = Sxy_new = X[i] + Sxy_norm + + %if USE_APPROXIMATION >= 1 + andps m5, m0 ; if(X[i] == 0) Sxy_new = 0; Prevent aproximation error from setting pulses in array padding. + %endif + +%else + movaps m5, [tmpY + r4q] ; m5 = y[i] + + xorps m0, m0 ; m0 = 0; + cmpps m0, m0, m5, 1 ; m0 = (0 p_max) + maxps m3, m5 ; m3=max(p_max,p) + ; maxps here is faster than blendvps, despite blend having lower latency. + + pand m2, m0 ; This version seems faster than sse41 pblendvb + pmaxsw m1, m2 ; SSE2 signed word, so it would work for N < 32768/4 + + add r4q, mmsize + cmp r4q, Nq + jb %%distortion_search + + por m1, mm_const_int32_offsets ; max_idx offset per individual lane (skipped in the inner loop) + movdqa m4, m1 ; needed for the aligned y[max_idx]+=1; processing + +%if mmsize >= 32 +;merge parallel maximums round 2 (4 vs 4) + + vextractf128 xmm5, ymm3, 1 ; xmm5 = ymm3[1x128] = ymm3[255..128b] + vcmpps xmm0, xmm3, xmm5, 1 ; m0 = (m3 < m5) = ( p[0x128] < p[1x128] ) + + vextracti128 xmm2, ymm1, 1 ; xmm2 = ymm1[1x128] = ymm1[255..128b] + emu_blendvps xmm3, xmm5, xmm0 ; max_idx = m0 ? max_idx[1x128] : max_idx[0x128] + emu_pblendvb xmm1, xmm2, xmm0 ; p = m0 ? p[1x128] : p[0x128] +%endif + +;merge parallel maximums round 1 (2 vs 2) + ; m3=p[3210] + movhlps xmm5, xmm3 ; m5=p[xx32] + cmpps xmm0, xmm3, xmm5, 1 ; m0 = (m3 < m5) = ( p[1,0] < p[3,2] ) + + pshufd xmm2, xmm1, q3232 + emu_blendvps xmm3, xmm5, xmm0 ; max_idx = m0 ? max_idx[3,2] : max_idx[1,0] + emu_pblendvb xmm1, xmm2, xmm0 ; p = m0 ? p[3,2] : p[1,0] + +;merge parallel maximums final round (1 vs 1) + movaps xmm0, xmm3 ; m0 = m3[0] = p[0] + shufps xmm3, xmm3, q1111 ; m3 = m3[1] = p[1] + cmpless xmm0, xmm3 + + pshufd xmm2, xmm1, q1111 + emu_pblendvb xmm1, xmm2, xmm0 + + movd dword r4d, xmm1 ; zero extends to the rest of r4q + + emu_broadcastss m3, [tmpX + r4q] + %{1}ps m7, m3 ; Sxy += X[max_idx] + + emu_broadcastss m5, [tmpY + r4q] + %{1}ps m6, m5 ; Syy += Y[max_idx] +%if STALL_WRITE_FORWARDING == 0 + ; this code is big and ugly + ; but it prevents stalling a store forwarding + ; by writing smaller size and then doing + ; big load from same address. + ; This helps with N<100 that are the typical case. + emu_pbroadcastd m1, xmm1 + pcmpeqd m1, m1, m4 ; only 1 element matches the address of the new p_max + + and r4q, ~(mmsize-1) ; align down, so max_idx is (somewhere) inside mmsize load + movaps m5, [tmpY + r4q] ; m5 = Y[y3...ym...y0] + andps m1, mm_const_float_1 ; m1 = [ 0...1.0...0] + %{1}ps m5, m1 ; m5 = Y[y3...ym...y0] +/- [0...1.0...0] + movaps [tmpY + r4q], m5 ; Y[max_idx] +-= 1.0; +%else + ; this code is shorter, simpler and slower + ; because it breaks store forwarding + ; it writes smaller size at unaligned address where we later do a bigger aligned load. + ; When we have biger vectors (N>100) the store might complete before we try to load it + %{1}ps m5, mm_const_float_1 + movss [tmpY + r4q], xmm5 ; Y[max_idx] += 1.0; +%endif + +%endmacro + +; +; Pyramid Vector Quantization Search implementation +; +; float * inX - Unaligned (SIMD) access, it will be overread, +; but extra data will be masked away. +; int32 * outY - Should be aligned and padded buffer. +; It would be used as temp buffer. +; uint32 K - Number of pulses to have after quantizations. +; uint32 N - Number of vector elements. Must be 0 < N < 256 +; +%macro PVQ_FAST_SEARCH 0 +cglobal pvq_search,4,6,8, 256*4, inX, outY, K, N +%define tmpX rsp +%define tmpY outYq + + +; movsxdifnidn Nq, Nd +; movsxdifnidn Kq, Kd + movaps m0, [const_float_abs_mask] + shl Nq, 2 ; N *= sizeof(float) + mov r4q, Nq + + neg r4q + and r4q, mmsize-1 + + lea_pic_base r5q, const_align_abs_edge ; rip+const + movups m3, [pic_base_const_align_abs_edge + r4q - mmsize] + + add Nq, r4q ; Nq = align(Nq, mmsize) + + lea r4q, [Nq - mmsize] ; Nq is rounded up (aligned up) to mmsize, so r4q can't become negative here, unless N=0. + movups m2, [inXq + r4q] + andps m2, m3 + movaps [tmpX + r4q], m2 + movaps m1, m2 ; Sx + +align 16 +%%loop_abs_sum: + sub r4q, mmsize + jc %%end_loop_abs_sum + + movups m2, [inXq + r4q] + andps m2, m0 + + movaps [tmpX + r4q], m2 ; tmpX[i]=abs(X[i]) + addps m1, m2 + jmp %%loop_abs_sum + + +align 16 +%%end_loop_abs_sum: + + HSUMPS m1, m2 ; m1 = Sx + + xorps m0, m0 + comiss xmm0, xmm1 ; + jz %%zero_input ; if (Sx==0) goto zero_input + + cvtsi2ss xmm0, dword Kd ; m0 = K +%if USE_APPROXIMATION >= 1 && 0 + rcpss xmm1, xmm1 ; m1 = approx(1/Sx) + mulss xmm0, xmm1 ; m0 = K*(1/Sx) +%else + divss xmm0, xmm1 ; b = K/Sx + ; b = K/max_x +%endif + + emu_broadcastss m0, xmm0 + + lea r4q, [Nq - mmsize] + pxor m5, m5 ; Sy ( Sum of y[i] ) + xorps m6, m6 ; Syy ( Sum of y[i]*y[i] ) + xorps m7, m7 ; Sxy ( Sum of X[i]*y[i] ) +align 16 +%%loop_guess: + movaps m1, [tmpX + r4q] ; m1 = X[i] + mulps m2, m0, m1 ; m2 = res*X[i] + %if PRESEARCH_ROUNDING == 0 + cvttps2dq m2, m2 ; yt = (int)truncf( res*X[i] ) + %else + cvtps2dq m2, m2 ; yt = (int)lrintf( res*X[i] ) + %endif + paddd m5, m2 ; Sy += yt + cvtdq2ps m2, m2 ; yt = (float)yt + mulps m1, m2 ; m1 = X[i]*yt + movaps [tmpY + r4q], m2 ; y[i] = m2 + addps m7, m1 ; Sxy += m1; + mulps m2, m2 ; m2 = yt*yt + addps m6, m2 ; Syy += m2 + + sub r4q, mmsize + jnc %%loop_guess + + HSUMPS m6, m1 ; Syy_norm + + HADDD m5, m4 ; pulses + + movd dword r4d, xmm5 ; zero extends to the rest of r4q + + sub Kq, r4q ; K -= pulses + jz %%finish ; K - pulses == 0 + + set_hi_mm_constant movaps, m10, const_float_0_5 + set_hi_mm_constant movaps, m11, const_float_1 + set_hi_mm_constant movdqa, m13, const_int32_offsets + ; Use Syy/2 in distortion parameter calculations. + ; Saves pre and post-caclulation to correct Y[] values. + ; Same precision, since float mantisa is normalized. + ; The SQRT approximation does differ. + HSUMPS m7, m0 ; Sxy_norm + mulps m6, mm_const_float_0_5 + + jc %%remove_pulses_loop ; K - pulses < 0 + +align 16 ; K - pulses > 0 +%%add_pulses_loop: + PULSES_SEARCH add ; m6 Syy_norm ; m7 Sxy_norm + + sub Kq, 1 + jnz %%add_pulses_loop + + addps m6, m6 ; Syy*=2 + + jmp %%finish + +align 16 +%%remove_pulses_loop: + PULSES_SEARCH sub ; m6 Syy_norm ; m7 Sxy_norm + + add Kq, 1 + jnz %%remove_pulses_loop + + addps m6, m6 ; Syy*=2 + +align 16 +%%finish: + lea r4q, [Nq - mmsize] + + movaps m2, [const_float_sign_mask] +align 16 +%%restore_sign_loop: + movaps m0, [tmpY + r4q] ; m0 = Y[i] + movups m1, [inXq + r4q] ; m1 = X[i] + andps m1, m2 ; m1 = sign(X[i]) + orps m0, m1 ; m0 = Y[i]*sign + cvtps2dq m3, m0 ; m3 = (int)m0 + movaps [outYq + r4q], m3 + + sub r4q, mmsize + jnc %%restore_sign_loop +%%return: + +%if ARCH_X86_64 == 0 ;sbrdsp + movss r0m, xmm6 ; return (float)Syy_norm + fld dword r0m +%else + movaps m0, m6 ; return (float)Syy_norm +%endif + RET + +align 16 +%%zero_input: + lea r4q, [Nq - mmsize] + xorps m0, m0 +%%zero_loop: + movaps [outYq + r4q], m0 + sub r4q, mmsize + jnc %%zero_loop + + movaps m6, [const_float_1] + jmp %%return +%endmacro + + +INIT_XMM sse2 +PVQ_FAST_SEARCH + +INIT_XMM sse42 +PVQ_FAST_SEARCH + +INIT_XMM avx +PVQ_FAST_SEARCH + +INIT_YMM avx2 +PVQ_FAST_SEARCH -- 2.13.0