From patchwork Sun Nov 26 22:51:09 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: James Darnley X-Patchwork-Id: 6379 Delivered-To: ffmpegpatchwork@gmail.com Received: by 10.2.161.94 with SMTP id m30csp967304jah; Sun, 26 Nov 2017 14:59:25 -0800 (PST) X-Google-Smtp-Source: AGs4zMa0p7Q/5ipWNCUNz9W6G14tXExCpnvWV10mz7aisO7ioJ24g+n4l9+R6DhKd0yjNEMmDA77 X-Received: by 10.223.160.184 with SMTP id m53mr33254556wrm.126.1511737165037; Sun, 26 Nov 2017 14:59:25 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1511737165; cv=none; d=google.com; s=arc-20160816; b=iGqtkuk7uNYFSglbAlJvuSYI1mVY08+w7JC1SME9F4b8323mAXGYTmHCw4OVlynkb0 Oe3YtMJme3dSKd4VSIYOfcTgl8g3ufNvB3Kv534d21AK0amnvTHqlE1oKNedJBsVjLwK SIh4mx6Bo2k/ig97EEvLvqLl9Ry8g03Hg82sfoR8HlCrNB3b4dZTBa6ij8yBXvPHTs+q QDkCP3XRK5EbLncn9IXY9MYJXR7EB4Xb886K5QD2JyRCtWws50FSm1WlgmsEbND7kt8T EuriKTmOovBfsH/qSkFmXeipJHrDUDOnZkURESfLTaYslc1HZIG/chKPn3IF4lkTMhuv YUDA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:mime-version:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:references:in-reply-to:message-id:date :to:from:dkim-signature:delivered-to:arc-authentication-results; bh=phetzIovixSc3X54LNbPGSZDil6RXhD+iWQC4+qi4Yo=; b=Io3cR4FvU23wkJed/iBrMUettVAICOTnNXYH2/D4aaj7KOZbVwK3unmMt6Nj76+B0R Ba9QCvNLceO4gJPoHvraRQ9kHBx3R3lNsminWD8AKsxTadZ9ZkChfVUxm3uAsOObnxId u6ojIDjN4Sd5/frVILMnSyMYxRrZ2NUBByLDXhzkZcYeENPA32JFFJs808AzHNOLj91t l0BAZYdAh9VSmscFPo72yZlkjVxpyqCCAHA/hCg2gEwFT23GnruzFZjQMd5rQgg+G/Xf 2+v/hcK4khXJPMNVtQ6LVkKuZU81+fzbmF8TXFz+KljBotYGJQG1oj0iOCiym4bTjyGI r5hw== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.s=20161025 header.b=fSkpK7mz; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=gmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id 91si21184079wrf.176.2017.11.26.14.59.24; Sun, 26 Nov 2017 14:59:24 -0800 (PST) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.s=20161025 header.b=fSkpK7mz; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=gmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 33DBC68A343; Mon, 27 Nov 2017 00:59:22 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-wm0-f66.google.com (mail-wm0-f66.google.com [74.125.82.66]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id B81E768A2F4 for ; Mon, 27 Nov 2017 00:59:15 +0200 (EET) Received: by mail-wm0-f66.google.com with SMTP id r68so30930337wmr.3 for ; Sun, 26 Nov 2017 14:59:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:subject:date:message-id:in-reply-to:references; bh=EpIBOaxs2RA158ZAhqN4BBrfiJuLrMLmiLnQYDLbLA0=; b=fSkpK7mzKLPitPvzOTTmA1/+rAeImu8AkU3naQO8YqGvShqwu29Uv21FaSAtwuvafC +/K5BC3Px0ovxDpSG/DAJ2b7tlLKXL/NTbprrNC9C1clKc7AT4SGFh1NMx+mt917TvOl xLrHAhdBfOhOviHao1UzupLfkcaBKvD9Xe98lB94laesbBpl7R4kL3ftOZRC5jzw9fQL 1lttvPyXuba7tyIp5rDgHq5R7aAFfXfJofkWrjodHXFUFBmuRDrRkxVmPhU24Vdjbexf 06RNFqGwsz2JFj2cMbOfVUvp27WfucgJq9/BFrZkO4d+/fzclYgtt2Sn20eLFWBfasKL fjrw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id:in-reply-to :references; bh=EpIBOaxs2RA158ZAhqN4BBrfiJuLrMLmiLnQYDLbLA0=; b=SclZ3nO66HDfM7TL5/9klPAVj1eFQln0GaSr7TjnQWL+2YnPPTcAeYCHEd7eOs7J/s PCSr/MnItrKO9vMvhMEdBjwrhRRio9OC5ze/wUNid8dn+0YJ/kr1Cc3RIsKLWDyQhZCH VOpPFtnQA5e79HZ8MQ6lnPXQGjTmZRA7H3SvESD3MMr0hvqZDLMsqCWIgfa7mASjQyzP hb8T65bUCAAfJR7EMdQWatu+BS2jLmQQmgWfToMMOGuh7Iabh/ZeUntjEDpzLU7AZamJ SsT+bxoAcIVry/RxYdJWrJuQit/G1snc6/QZfmvc45qIe1dKmT+3uBcBFOV8yxZFd6rv Casw== X-Gm-Message-State: AJaThX40y5VZnl+ffKAHzLw5tLMmOSy8UGFSQ8Oo8VX+8MTs5DAkWI0G zNUtElkAmFzEc8z0SBgT/W1tIA== X-Received: by 10.80.137.147 with SMTP id g19mr38802215edg.293.1511736725021; Sun, 26 Nov 2017 14:52:05 -0800 (PST) Received: from Highwind.systemlords.lan (d51a44418.access.telenet.be. [81.164.68.24]) by smtp.gmail.com with ESMTPSA id h56sm22545791ede.15.2017.11.26.14.52.03 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 26 Nov 2017 14:52:03 -0800 (PST) From: James Darnley To: ffmpeg-devel@ffmpeg.org Date: Sun, 26 Nov 2017 23:51:09 +0100 Message-Id: <20171126225111.5108-7-james.darnley@gmail.com> X-Mailer: git-send-email 2.15.0 In-Reply-To: <20171126225111.5108-1-james.darnley@gmail.com> References: <20171126225111.5108-1-james.darnley@gmail.com> Subject: [FFmpeg-devel] [PATCH 6/8] lavc/x86/flac_dsp_gpl: partially unroll 32-bit LPC encoder X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Around 1.1 times faster and reduces runtime by up to 6%. --- libavcodec/x86/flac_dsp_gpl.asm | 91 ++++++++++++++++++++++++++++++++--------- 1 file changed, 72 insertions(+), 19 deletions(-) diff --git a/libavcodec/x86/flac_dsp_gpl.asm b/libavcodec/x86/flac_dsp_gpl.asm index 952fc8b86b..91989ce560 100644 --- a/libavcodec/x86/flac_dsp_gpl.asm +++ b/libavcodec/x86/flac_dsp_gpl.asm @@ -152,13 +152,13 @@ RET %macro FUNCTION_BODY_32 0 %if ARCH_X86_64 - cglobal flac_enc_lpc_32, 5, 7, 8, mmsize, res, smp, len, order, coefs + cglobal flac_enc_lpc_32, 5, 7, 8, mmsize*4, res, smp, len, order, coefs DECLARE_REG_TMP 5, 6 %define length r2d movsxd orderq, orderd %else - cglobal flac_enc_lpc_32, 5, 6, 8, mmsize, res, smp, len, order, coefs + cglobal flac_enc_lpc_32, 5, 6, 8, mmsize*4, res, smp, len, order, coefs DECLARE_REG_TMP 2, 5 %define length r2mp %endif @@ -189,18 +189,23 @@ mova [rsp], m4 ; save sign extend mask %define negj t1q .looplen: + ; process "odd" samples pxor m0, m0 pxor m4, m4 pxor m6, m6 mov posj, orderq xor negj, negj - .looporder: + .looporder1: movd m2, [coefsq+posj*4] ; c = coefs[j] SPLATD m2 - pmovzxdq m1, [smpq+negj*4-4] ; s = smp[i-j-1] - pmovzxdq m5, [smpq+negj*4-4+mmsize/2] - pmovzxdq m7, [smpq+negj*4-4+mmsize] + movu m1, [smpq+negj*4-4] ; s = smp[i-j-1] + movu m5, [smpq+negj*4-4+mmsize] + movu m7, [smpq+negj*4-4+mmsize*2] + ; Rather than explicitly unpack adjacent samples into qwords we can let + ; the pmuldq instruction unpack the 0th and 2nd samples for us when it + ; does its multiply. This saves an unpack for every sample in the inner + ; loop meaning it should be (much) quicker. pmuldq m1, m2 pmuldq m5, m2 pmuldq m7, m2 @@ -210,7 +215,7 @@ mova [rsp], m4 ; save sign extend mask dec negj inc posj - jnz .looporder + jnz .looporder1 HACK_PSRAQ m0, m3, [rsp], m2 ; p >>= shift HACK_PSRAQ m4, m3, [rsp], m2 @@ -218,22 +223,70 @@ mova [rsp], m4 ; save sign extend mask CLIPQ m0, [pq_int_min], [pq_int_max], m2 ; clip(p >> shift) CLIPQ m4, [pq_int_min], [pq_int_max], m2 CLIPQ m6, [pq_int_min], [pq_int_max], m2 - pshufd m0, m0, q0020 ; pack into first 2 dwords - pshufd m4, m4, q0020 - pshufd m6, m6, q0020 - movh m1, [smpq] - movh m5, [smpq+mmsize/2] - movh m7, [smpq+mmsize] + movu m1, [smpq] + movu m5, [smpq+mmsize] + movu m7, [smpq+mmsize*2] psubd m1, m0 ; smp[i] - p psubd m5, m4 psubd m7, m6 - movh [resq], m1 ; res[i] = smp[i] - (p >> shift) - movh [resq+mmsize/2], m5 - movh [resq+mmsize], m7 + mova [rsp+mmsize], m1 ; res[i] = smp[i] - (p >> shift) + mova [rsp+mmsize*2], m5 + mova [rsp+mmsize*3], m7 + + ; process "even" samples + pxor m0, m0 + pxor m4, m4 + pxor m6, m6 + mov posj, orderq + xor negj, negj + + .looporder2: + movd m2, [coefsq+posj*4] ; c = coefs[j] + SPLATD m2 + movu m1, [smpq+negj*4] ; s = smp[i-j-1] + movu m5, [smpq+negj*4+mmsize] + movu m7, [smpq+negj*4+mmsize*2] + pmuldq m1, m2 + pmuldq m5, m2 + pmuldq m7, m2 + paddq m0, m1 ; p += c * s + paddq m4, m5 + paddq m6, m7 + + dec negj + inc posj + jnz .looporder2 + + HACK_PSRAQ m0, m3, [rsp], m2 ; p >>= shift + HACK_PSRAQ m4, m3, [rsp], m2 + HACK_PSRAQ m6, m3, [rsp], m2 + CLIPQ m0, [pq_int_min], [pq_int_max], m2 ; clip(p >> shift) + CLIPQ m4, [pq_int_min], [pq_int_max], m2 + CLIPQ m6, [pq_int_min], [pq_int_max], m2 + movu m1, [smpq+4] + movu m5, [smpq+4+mmsize] + movu m7, [smpq+4+mmsize*2] + psubd m1, m0 ; smp[i] - p + psubd m5, m4 + psubd m7, m6 + + ; interleave odd and even samples + pslldq m1, 4 + pslldq m5, 4 + pslldq m7, 4 + + pblendw m1, [rsp+mmsize], q0303 + pblendw m5, [rsp+mmsize*2], q0303 + pblendw m7, [rsp+mmsize*3], q0303 + + movu [resq], m1 + movu [resq+mmsize], m5 + movu [resq+mmsize*2], m7 + + add resq, 3*mmsize + add smpq, 3*mmsize + sub length, (3*mmsize)/4 - add resq, (3*mmsize)/2 - add smpq, (3*mmsize)/2 - sub length, (3*mmsize)/8 jg .looplen RET