From patchwork Sun Aug 25 08:18:00 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: James Cowgill X-Patchwork-Id: 14700 Return-Path: X-Original-To: patchwork@ffaux-bg.ffmpeg.org Delivered-To: patchwork@ffaux-bg.ffmpeg.org Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by ffaux.localdomain (Postfix) with ESMTP id 3703A4481A1 for ; Sun, 25 Aug 2019 11:19:17 +0300 (EEST) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 0BB2168A7E6; Sun, 25 Aug 2019 11:19:17 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from relay4-d.mail.gandi.net (relay4-d.mail.gandi.net [217.70.183.196]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id CBB8168A761 for ; Sun, 25 Aug 2019 11:19:09 +0300 (EEST) X-Originating-IP: 188.29.164.50 Received: from dummy (188.29.164.50.threembb.co.uk [188.29.164.50]) (Authenticated sender: jcowgill@jcowgill.uk) by relay4-d.mail.gandi.net (Postfix) with ESMTPSA id 47878E0006; Sun, 25 Aug 2019 08:19:04 +0000 (UTC) From: James Cowgill To: ffmpeg-devel@ffmpeg.org Date: Sun, 25 Aug 2019 09:18:00 +0100 Message-Id: <20190825081800.9412-1-jcowgill@debian.org> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190728214606.15179-1-jcowgill@debian.org> References: <20190728214606.15179-1-jcowgill@debian.org> MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH v2] avcodec/arm/sbcenc: avoid callee preserved vfp registers X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: James Cowgill Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" When compiling FFmpeg with GCC-9, some very random segfaults were observed in code which had previously called down into the SBC encoder NEON assembly routines. This was caused by these functions clobbering some of the vfp callee saved registers (d8 - d15 aka q4 - q7). GCC was using these registers to save local variables, but after these functions returned, they would contain garbage. Fix by reallocating the registers in the two affected functions in the following way: ff_sbc_analyze_4_neon: q2-q5 => q8-q11, then q1-q4 => q8-q11 ff_sbc_analyze_8_neon: q2-q9 => q8-q15 The reason for using these replacements is to keep closely related sets of registers consecutively numbered which hopefully makes the code more easy to follow. Since this commit only reallocates registers, it should have no performance impact. Signed-off-by: James Cowgill --- On 29/07/2019 19:59, Reimar Döffinger wrote: > Seems sensible to me, though extra points if you or someone has numbers on performance impact. > To know whether it would be worthwhile to check if it can be optimized... Sorry for the long delay - been on various holidays. I did a few tests on my original patch and overall it was about 2% slower than before. In any case I think this new patch is a better solution (although the diff is a lot larger). We don't actually need that many registers in either of these functions, so instead of pushing the clobbered callee saved registers, we can reallocate all the registers to avoid them in the first place. This way there is no performance impact. I couldn't find any tests for this encoder, but I have tested a few audio samples with it and verified the output is identical to what t was before (and with what I get on x86). libavcodec/arm/sbcdsp_neon.S | 220 +++++++++++++++++------------------ 1 file changed, 110 insertions(+), 110 deletions(-) diff --git a/libavcodec/arm/sbcdsp_neon.S b/libavcodec/arm/sbcdsp_neon.S index d83d21d202..914abfb6cc 100644 --- a/libavcodec/arm/sbcdsp_neon.S +++ b/libavcodec/arm/sbcdsp_neon.S @@ -38,49 +38,49 @@ function ff_sbc_analyze_4_neon, export=1 /* TODO: merge even and odd cases (or even merge all four calls to this * function) in order to have only aligned reads from 'in' array * and reduce number of load instructions */ - vld1.16 {d4, d5}, [r0, :64]! - vld1.16 {d8, d9}, [r2, :128]! + vld1.16 {d16, d17}, [r0, :64]! + vld1.16 {d20, d21}, [r2, :128]! - vmull.s16 q0, d4, d8 - vld1.16 {d6, d7}, [r0, :64]! - vmull.s16 q1, d5, d9 - vld1.16 {d10, d11}, [r2, :128]! + vmull.s16 q0, d16, d20 + vld1.16 {d18, d19}, [r0, :64]! + vmull.s16 q1, d17, d21 + vld1.16 {d22, d23}, [r2, :128]! - vmlal.s16 q0, d6, d10 - vld1.16 {d4, d5}, [r0, :64]! - vmlal.s16 q1, d7, d11 - vld1.16 {d8, d9}, [r2, :128]! + vmlal.s16 q0, d18, d22 + vld1.16 {d16, d17}, [r0, :64]! + vmlal.s16 q1, d19, d23 + vld1.16 {d20, d21}, [r2, :128]! - vmlal.s16 q0, d4, d8 - vld1.16 {d6, d7}, [r0, :64]! - vmlal.s16 q1, d5, d9 - vld1.16 {d10, d11}, [r2, :128]! + vmlal.s16 q0, d16, d20 + vld1.16 {d18, d19}, [r0, :64]! + vmlal.s16 q1, d17, d21 + vld1.16 {d22, d23}, [r2, :128]! - vmlal.s16 q0, d6, d10 - vld1.16 {d4, d5}, [r0, :64]! - vmlal.s16 q1, d7, d11 - vld1.16 {d8, d9}, [r2, :128]! + vmlal.s16 q0, d18, d22 + vld1.16 {d16, d17}, [r0, :64]! + vmlal.s16 q1, d19, d23 + vld1.16 {d20, d21}, [r2, :128]! - vmlal.s16 q0, d4, d8 - vmlal.s16 q1, d5, d9 + vmlal.s16 q0, d16, d20 + vmlal.s16 q1, d17, d21 vpadd.s32 d0, d0, d1 vpadd.s32 d1, d2, d3 vrshrn.s32 d0, q0, SBC_PROTO_FIXED_SCALE - vld1.16 {d2, d3, d4, d5}, [r2, :128]! + vld1.16 {d16, d17, d18, d19}, [r2, :128]! vdup.i32 d1, d0[1] /* TODO: can be eliminated */ vdup.i32 d0, d0[0] /* TODO: can be eliminated */ - vmull.s16 q3, d2, d0 - vmull.s16 q4, d3, d0 - vmlal.s16 q3, d4, d1 - vmlal.s16 q4, d5, d1 + vmull.s16 q10, d16, d0 + vmull.s16 q11, d17, d0 + vmlal.s16 q10, d18, d1 + vmlal.s16 q11, d19, d1 - vpadd.s32 d0, d6, d7 /* TODO: can be eliminated */ - vpadd.s32 d1, d8, d9 /* TODO: can be eliminated */ + vpadd.s32 d0, d20, d21 /* TODO: can be eliminated */ + vpadd.s32 d1, d22, d23 /* TODO: can be eliminated */ vst1.32 {d0, d1}, [r1, :128] @@ -91,57 +91,57 @@ function ff_sbc_analyze_8_neon, export=1 /* TODO: merge even and odd cases (or even merge all four calls to this * function) in order to have only aligned reads from 'in' array * and reduce number of load instructions */ - vld1.16 {d4, d5}, [r0, :64]! - vld1.16 {d8, d9}, [r2, :128]! - - vmull.s16 q6, d4, d8 - vld1.16 {d6, d7}, [r0, :64]! - vmull.s16 q7, d5, d9 - vld1.16 {d10, d11}, [r2, :128]! - vmull.s16 q8, d6, d10 - vld1.16 {d4, d5}, [r0, :64]! - vmull.s16 q9, d7, d11 - vld1.16 {d8, d9}, [r2, :128]! - - vmlal.s16 q6, d4, d8 - vld1.16 {d6, d7}, [r0, :64]! - vmlal.s16 q7, d5, d9 - vld1.16 {d10, d11}, [r2, :128]! - vmlal.s16 q8, d6, d10 - vld1.16 {d4, d5}, [r0, :64]! - vmlal.s16 q9, d7, d11 - vld1.16 {d8, d9}, [r2, :128]! - - vmlal.s16 q6, d4, d8 - vld1.16 {d6, d7}, [r0, :64]! - vmlal.s16 q7, d5, d9 - vld1.16 {d10, d11}, [r2, :128]! - vmlal.s16 q8, d6, d10 - vld1.16 {d4, d5}, [r0, :64]! - vmlal.s16 q9, d7, d11 - vld1.16 {d8, d9}, [r2, :128]! - - vmlal.s16 q6, d4, d8 - vld1.16 {d6, d7}, [r0, :64]! - vmlal.s16 q7, d5, d9 - vld1.16 {d10, d11}, [r2, :128]! - vmlal.s16 q8, d6, d10 - vld1.16 {d4, d5}, [r0, :64]! - vmlal.s16 q9, d7, d11 - vld1.16 {d8, d9}, [r2, :128]! - - vmlal.s16 q6, d4, d8 - vld1.16 {d6, d7}, [r0, :64]! - vmlal.s16 q7, d5, d9 - vld1.16 {d10, d11}, [r2, :128]! - - vmlal.s16 q8, d6, d10 - vmlal.s16 q9, d7, d11 - - vpadd.s32 d0, d12, d13 - vpadd.s32 d1, d14, d15 - vpadd.s32 d2, d16, d17 - vpadd.s32 d3, d18, d19 + vld1.16 {d16, d17}, [r0, :64]! + vld1.16 {d20, d21}, [r2, :128]! + + vmull.s16 q12, d16, d20 + vld1.16 {d18, d19}, [r0, :64]! + vmull.s16 q13, d17, d21 + vld1.16 {d22, d23}, [r2, :128]! + vmull.s16 q14, d18, d22 + vld1.16 {d16, d17}, [r0, :64]! + vmull.s16 q15, d19, d23 + vld1.16 {d20, d21}, [r2, :128]! + + vmlal.s16 q12, d16, d20 + vld1.16 {d18, d19}, [r0, :64]! + vmlal.s16 q13, d17, d21 + vld1.16 {d22, d23}, [r2, :128]! + vmlal.s16 q14, d18, d22 + vld1.16 {d16, d17}, [r0, :64]! + vmlal.s16 q15, d19, d23 + vld1.16 {d20, d21}, [r2, :128]! + + vmlal.s16 q12, d16, d20 + vld1.16 {d18, d19}, [r0, :64]! + vmlal.s16 q13, d17, d21 + vld1.16 {d22, d23}, [r2, :128]! + vmlal.s16 q14, d18, d22 + vld1.16 {d16, d17}, [r0, :64]! + vmlal.s16 q15, d19, d23 + vld1.16 {d20, d21}, [r2, :128]! + + vmlal.s16 q12, d16, d20 + vld1.16 {d18, d19}, [r0, :64]! + vmlal.s16 q13, d17, d21 + vld1.16 {d22, d23}, [r2, :128]! + vmlal.s16 q14, d18, d22 + vld1.16 {d16, d17}, [r0, :64]! + vmlal.s16 q15, d19, d23 + vld1.16 {d20, d21}, [r2, :128]! + + vmlal.s16 q12, d16, d20 + vld1.16 {d18, d19}, [r0, :64]! + vmlal.s16 q13, d17, d21 + vld1.16 {d22, d23}, [r2, :128]! + + vmlal.s16 q14, d18, d22 + vmlal.s16 q15, d19, d23 + + vpadd.s32 d0, d24, d25 + vpadd.s32 d1, d26, d27 + vpadd.s32 d2, d28, d29 + vpadd.s32 d3, d30, d31 vrshr.s32 q0, q0, SBC_PROTO_FIXED_SCALE vrshr.s32 q1, q1, SBC_PROTO_FIXED_SCALE @@ -153,38 +153,38 @@ function ff_sbc_analyze_8_neon, export=1 vdup.i32 d1, d0[1] /* TODO: can be eliminated */ vdup.i32 d0, d0[0] /* TODO: can be eliminated */ - vld1.16 {d4, d5}, [r2, :128]! - vmull.s16 q6, d4, d0 - vld1.16 {d6, d7}, [r2, :128]! - vmull.s16 q7, d5, d0 - vmull.s16 q8, d6, d0 - vmull.s16 q9, d7, d0 - - vld1.16 {d4, d5}, [r2, :128]! - vmlal.s16 q6, d4, d1 - vld1.16 {d6, d7}, [r2, :128]! - vmlal.s16 q7, d5, d1 - vmlal.s16 q8, d6, d1 - vmlal.s16 q9, d7, d1 - - vld1.16 {d4, d5}, [r2, :128]! - vmlal.s16 q6, d4, d2 - vld1.16 {d6, d7}, [r2, :128]! - vmlal.s16 q7, d5, d2 - vmlal.s16 q8, d6, d2 - vmlal.s16 q9, d7, d2 - - vld1.16 {d4, d5}, [r2, :128]! - vmlal.s16 q6, d4, d3 - vld1.16 {d6, d7}, [r2, :128]! - vmlal.s16 q7, d5, d3 - vmlal.s16 q8, d6, d3 - vmlal.s16 q9, d7, d3 - - vpadd.s32 d0, d12, d13 /* TODO: can be eliminated */ - vpadd.s32 d1, d14, d15 /* TODO: can be eliminated */ - vpadd.s32 d2, d16, d17 /* TODO: can be eliminated */ - vpadd.s32 d3, d18, d19 /* TODO: can be eliminated */ + vld1.16 {d16, d17}, [r2, :128]! + vmull.s16 q12, d16, d0 + vld1.16 {d18, d19}, [r2, :128]! + vmull.s16 q13, d17, d0 + vmull.s16 q14, d18, d0 + vmull.s16 q15, d19, d0 + + vld1.16 {d16, d17}, [r2, :128]! + vmlal.s16 q12, d16, d1 + vld1.16 {d18, d19}, [r2, :128]! + vmlal.s16 q13, d17, d1 + vmlal.s16 q14, d18, d1 + vmlal.s16 q15, d19, d1 + + vld1.16 {d16, d17}, [r2, :128]! + vmlal.s16 q12, d16, d2 + vld1.16 {d18, d19}, [r2, :128]! + vmlal.s16 q13, d17, d2 + vmlal.s16 q14, d18, d2 + vmlal.s16 q15, d19, d2 + + vld1.16 {d16, d17}, [r2, :128]! + vmlal.s16 q12, d16, d3 + vld1.16 {d18, d19}, [r2, :128]! + vmlal.s16 q13, d17, d3 + vmlal.s16 q14, d18, d3 + vmlal.s16 q15, d19, d3 + + vpadd.s32 d0, d24, d25 /* TODO: can be eliminated */ + vpadd.s32 d1, d26, d27 /* TODO: can be eliminated */ + vpadd.s32 d2, d28, d29 /* TODO: can be eliminated */ + vpadd.s32 d3, d30, d31 /* TODO: can be eliminated */ vst1.32 {d0, d1, d2, d3}, [r1, :128]