From patchwork Tue Jan 8 21:32:30 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Simon Nash X-Patchwork-Id: 11681 Return-Path: X-Original-To: patchwork@ffaux-bg.ffmpeg.org Delivered-To: patchwork@ffaux-bg.ffmpeg.org Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by ffaux.localdomain (Postfix) with ESMTP id 5363044DBF2 for ; Tue, 8 Jan 2019 23:32:31 +0200 (EET) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 0750368A3A5; Tue, 8 Jan 2019 23:32:28 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mout.kundenserver.de (mout.kundenserver.de [212.227.126.135]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 5552C689F8B for ; Tue, 8 Jan 2019 23:32:21 +0200 (EET) Received: from [192.168.0.45] ([95.172.236.170]) by mrelayeu.kundenserver.de (mreue010 [212.227.15.163]) with ESMTPSA (Nemesis) id 1M2etD-1giMQS3zg7-004AiT for ; Tue, 08 Jan 2019 22:32:32 +0100 Message-ID: <5C3516EE.9020002@cjnash.com> Date: Tue, 08 Jan 2019 21:32:30 +0000 From: Simon Nash User-Agent: Thunderbird 2.0.0.24 (Windows/20100228) MIME-Version: 1.0 To: ffmpeg-devel@ffmpeg.org X-Provags-ID: V03:K1:qHQzcPhSVHa4H5CTyYlYx+Y4bUou8Lkus9g4Nd4UlUhmLXA/jX7 +/PiPwdFRGqeLBYdjJOJo8OiZOWDi+ybPEvOUXQxfU2NgdfX++Rh4Wbk1VmimDx7foZQ81R g+6+F1QDyAh2eeMbrhxre9WIGrpzaQzP3Ha717+g+ApRgpNxdHzY6R8+3oYZ7MloZ2q7wdA lukdA7M21KhDRBWSYZTJg== X-Spam-Flag: NO X-UI-Out-Filterresults: notjunk:1; V03:K0:ybpHEjV8oZo=:fcwkp1E6jyqEtWUaZaIDxV FrO/o7dAoI4mBJ0Ra1vPmH/f/OCA3InDO0Wd+AxOc5Mxqgg2kOLeXE9pTulk68xka8kf/S/Uu dzpzkL71hXCkAc/BoO2aR7sQ6GMcKwz+t+fdAjccMajrkjUkrCg7JjXig6A60JYLexKtGN19C L/E3y8MrNm92qY/VfIEo5A3Sv+z+SPCW2+60O5UUFpBiDIOnx61LifDhPuURGEgZc9B8RiA00 Feedx2k7WHKj7+WHICkBBQJ9FkJaI8YFxtq4C9yIzfTTZHOzmLMpJJXTNJbHPqSwuaDL2gkvy BgJlIC0ByV16oBUcUftf5BDp41fnEHi08exttXW0MfmRhA2m4/YQ9+yu6UOC4rji80hKZtIRH 0rC9+97yf2fp86oAMQmHTt3bpr4CaqUbksiLcFzu11K7IlV2Kr9W+W3bEgbfNnSxqXItYQBVQ AzlWAhaEugdhySpDJSFfVXdap9mtubLLY7jCZh5RpDmTddJ+Ca0BEzrQihXrIqoM1qGfi8KTK GA02PMaVHzTvEgg0TOxUsf3YyC5AbOJ9suQbmH57R2Z3f5nluOEzJvX/Jxf2/FMPeGOHzLYVR TVm6iGsUxpzkTy3bk2WEuNoj4mYa1vNmaKeL2u39/FNVMdoNqD2iNEKdM8ExxLVgZxA+p7nOO 09/1l/UpTz8ITlkVTunP/lC1pvHQADHVCFhdru0GcZEC/21jIQD207rAMwVkxg0K+nGUhyUq9 d6aOW9Lp835i3BHx9FfoCt9O1QpjWd9M3m9zbLSDEwU42NeotRqNKt9q4cg= Subject: [FFmpeg-devel] Armada 370 problem causes ffmpeg segmentation fault X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Hello, This is my first message to ffmpeg-devel, so please excuse me if anything here doesn't conform to the normal mailing list conventions. I have encountered a problem with ffmpeg (a segmentation fault) that occurs only when running ffmpeg on the Marvell Armada 370 processor. I used gdb to debug the cause of the problem and found that it was caused by a malfunction in the floating-point unit of the Armada 370. This is not specific to a single machine but appears to affect all Armada 370 chips. The problem occurs in the following section of code in libavfilter/af_afir.c (lines 441 to 447): for (ch = 0; ch < ctx->inputs[1]->channels; ch++) { float *time = (float *)s->in[1]->extended_data[!s->one2many * ch]; for (i = 0; i < s->nb_taps; i++) power += time[i] * time[i]; } s->gain = sqrtf(ch / power); The generated assembler code for the inner loop includes the following: 0x0018a8d4 : add.w r0, r6, #100 ; 0x64 0x0018a8d8 : movs r3, #0 0x0018a8da : vldr s15, [r0, #-100] ; 0xffffff9c 0x0018a8de : adds r3, #8 0x0018a8e0 : vldr s8, [r0, #-96] ; 0xffffffa0 0x0018a8e4 : cmp r3, lr 0x0018a8e6 : vldr s9, [r0, #-92] ; 0xffffffa4 0x0018a8ea : pld [r0] 0x0018a8ee : vldr s10, [r0, #-88] ; 0xffffffa8 0x0018a8f2 : vmla.f32 s12, s15, s15 0x0018a8f6 : vldr s11, [r0, #-84] ; 0xffffffac 0x0018a8fa : vldr s13, [r0, #-80] ; 0xffffffb0 0x0018a8fe : vldr s14, [r0, #-76] ; 0xffffffb4 0x0018a902 : vldr s15, [r0, #-72] ; 0xffffffb8 0x0018a906 : add.w r0, r0, #32 When the 32-bit floating-point multiply instruction 0x0018a8f2 : vmla.f32 s12, s15, s15 at activate+1690 is executed, there is a segmentation fault. This should never happen because the vmla.f32 instruction uses register values only. This problem happens when the value of the operand in s15 is small enough to cause an underflow when multiplied by itself (7.04930082e-28 in this case). I have a console log showing the state of the registers before and after single-stepping through the vmla.f32 instruction and I would be pleased to make this available in any convenient form. In other cases (when ffmpeg was compiled with a lower optimization level for debugging purposes), the vmla.f32 instruction completes but corrupts the unrelated register r7, causing a segmentation fault on a subsequent instruction. It seems that vmla.f32 produces unpredictable results on the Armada 370 when underflow occurs. I have investigated possible workarounds. The problem does not occur if the 32-bit float value is converted to a 64-bit value and multiplied using a vmla.f64 instruction producing a 64-bit result. This is because there is no underflow from ther multiplication. The following patch (diff file from git) does this. Changing to 64-bit multiplication would (I believe) avoid the possibility of underflow for any 32-bit coefficient value, which would mean that the code works correctly on the Armada 370 as well as all other processors. index 31919f6..3e69107 100644 This produces the following assembler code, with vmla.f32 replaced with vmla.f64: 0x0018a8d4 : add.w r0, r6, #100 ; 0x64 0x0018a8d8 : movs r3, #0 0x0018a8da : vldr s12, [r0, #-100] ; 0xffffff9c 0x0018a8de : adds r3, #8 0x0018a8e0 : vldr s0, [r0, #-96] ; 0xffffffa0 0x0018a8e4 : cmp r3, lr 0x0018a8e6 : vldr s2, [r0, #-92] ; 0xffffffa4 0x0018a8ea : pld [r0] 0x0018a8ee : vldr s4, [r0, #-88] ; 0xffffffa8 0x0018a8f2 : vcvt.f64.f32 d6, s12 0x0018a8f6 : vcvt.f64.f32 d0, s0 0x0018a8fa : vcvt.f64.f32 d1, s2 0x0018a8fe : vcvt.f64.f32 d2, s4 0x0018a902 : vmla.f64 d7, d6, d6 0x0018a906 : vldr s6, [r0, #-84] ; 0xffffffac 0x0018a90a : vldr s8, [r0, #-80] ; 0xffffffb0 0x0018a90e : vldr s10, [r0, #-76] ; 0xffffffb4 0x0018a912 : vldr s12, [r0, #-72] ; 0xffffffb8 0x0018a916 : add.w r0, r0, #32 I have tested this and it works on the Armada 370 as well as on other ARMv7 platforms. The change to using 64-bit multiplication is unlikely to be a performance issue because this code runs only during activation. However, if there are concerns about using 64-bit multiplication on processors that don't have the bug, I can modify the patch to check whether the processor is the Armada 370 and use the current 32-bit multiplication code when running on other processors. I can provide a test case if required. This will reproduce the bug reliably if ffmpeg is running on an Armada 370 machine. I realize that this is not a bug in FFmpeg but I would like to ask the FFmpeg developers to apply this patch (or equivalent) so that ffmpeg is able to run correctly on the Armada 370. My application (MinimServer) is using ffmpeg for various audio transformations including applying FIR filters and has users with devices that use the Armada 370 as well as users on many other platforms. Please advise me what I should do now to help make this change happen. Many thanks. Best regards, Simon --- a/libavfilter/af_afir.c +++ b/libavfilter/af_afir.c @@ -375,6 +375,7 @@ static int convert_coeffs(AVFilterContext *ctx) int left, offset = 0, part_size, max_part_size; int ret, i, ch, n; float power = 0; + double power_d = 0; s->nb_taps = ff_inlink_queued_samples(ctx->inputs[1]); if (s->nb_taps <= 0) @@ -441,9 +442,12 @@ static int convert_coeffs(AVFilterContext *ctx) for (ch = 0; ch < ctx->inputs[1]->channels; ch++) { float *time = (float *)s->in[1]->extended_data[!s->one2many * ch]; - for (i = 0; i < s->nb_taps; i++) - power += time[i] * time[i]; + for (i = 0; i < s->nb_taps; i++) { + double coeff_d = (double)time[i]; + power_d += coeff_d * coeff_d; // ensure no underflow + } } + power = (float)power_d; s->gain = sqrtf(ch / power); break; default: