From patchwork Tue Jan  8 21:32:30 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Simon Nash <simon@cjnash.com>
X-Patchwork-Id: 11681
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
X-Original-To: patchwork@ffaux-bg.ffmpeg.org
Delivered-To: patchwork@ffaux-bg.ffmpeg.org
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by ffaux.localdomain (Postfix) with ESMTP id 5363044DBF2
	for <patchwork@ffaux-bg.ffmpeg.org>;
	Tue,  8 Jan 2019 23:32:31 +0200 (EET)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 0750368A3A5;
	Tue,  8 Jan 2019 23:32:28 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mout.kundenserver.de (mout.kundenserver.de [212.227.126.135])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 5552C689F8B
	for <ffmpeg-devel@ffmpeg.org>; Tue,  8 Jan 2019 23:32:21 +0200 (EET)
Received: from [192.168.0.45] ([95.172.236.170]) by mrelayeu.kundenserver.de
	(mreue010 [212.227.15.163]) with ESMTPSA (Nemesis) id
	1M2etD-1giMQS3zg7-004AiT for <ffmpeg-devel@ffmpeg.org>;
	Tue, 08 Jan 2019 22:32:32 +0100
Message-ID: <5C3516EE.9020002@cjnash.com>
Date: Tue, 08 Jan 2019 21:32:30 +0000
From: Simon Nash <simon@cjnash.com>
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
To: ffmpeg-devel@ffmpeg.org
X-Provags-ID: V03:K1:qHQzcPhSVHa4H5CTyYlYx+Y4bUou8Lkus9g4Nd4UlUhmLXA/jX7
	+/PiPwdFRGqeLBYdjJOJo8OiZOWDi+ybPEvOUXQxfU2NgdfX++Rh4Wbk1VmimDx7foZQ81R
	g+6+F1QDyAh2eeMbrhxre9WIGrpzaQzP3Ha717+g+ApRgpNxdHzY6R8+3oYZ7MloZ2q7wdA
	lukdA7M21KhDRBWSYZTJg==
X-Spam-Flag: NO
X-UI-Out-Filterresults: notjunk:1; V03:K0:ybpHEjV8oZo=:fcwkp1E6jyqEtWUaZaIDxV
	FrO/o7dAoI4mBJ0Ra1vPmH/f/OCA3InDO0Wd+AxOc5Mxqgg2kOLeXE9pTulk68xka8kf/S/Uu
	dzpzkL71hXCkAc/BoO2aR7sQ6GMcKwz+t+fdAjccMajrkjUkrCg7JjXig6A60JYLexKtGN19C
	L/E3y8MrNm92qY/VfIEo5A3Sv+z+SPCW2+60O5UUFpBiDIOnx61LifDhPuURGEgZc9B8RiA00
	Feedx2k7WHKj7+WHICkBBQJ9FkJaI8YFxtq4C9yIzfTTZHOzmLMpJJXTNJbHPqSwuaDL2gkvy
	BgJlIC0ByV16oBUcUftf5BDp41fnEHi08exttXW0MfmRhA2m4/YQ9+yu6UOC4rji80hKZtIRH
	0rC9+97yf2fp86oAMQmHTt3bpr4CaqUbksiLcFzu11K7IlV2Kr9W+W3bEgbfNnSxqXItYQBVQ
	AzlWAhaEugdhySpDJSFfVXdap9mtubLLY7jCZh5RpDmTddJ+Ca0BEzrQihXrIqoM1qGfi8KTK
	GA02PMaVHzTvEgg0TOxUsf3YyC5AbOJ9suQbmH57R2Z3f5nluOEzJvX/Jxf2/FMPeGOHzLYVR
	TVm6iGsUxpzkTy3bk2WEuNoj4mYa1vNmaKeL2u39/FNVMdoNqD2iNEKdM8ExxLVgZxA+p7nOO
	09/1l/UpTz8ITlkVTunP/lC1pvHQADHVCFhdru0GcZEC/21jIQD207rAMwVkxg0K+nGUhyUq9
	d6aOW9Lp835i3BHx9FfoCt9O1QpjWd9M3m9zbLSDEwU42NeotRqNKt9q4cg=
Subject: [FFmpeg-devel] Armada 370 problem causes ffmpeg segmentation fault
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <http://ffmpeg.org/mailman/options/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <http://ffmpeg.org/pipermail/ffmpeg-devel/>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <http://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches
	<ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

Hello,
This is my first message to ffmpeg-devel, so please excuse me if
anything here doesn't conform to the normal mailing list conventions.

I have encountered a problem with ffmpeg (a segmentation fault) that
occurs only when running ffmpeg on the Marvell Armada 370 processor.

I used gdb to debug the cause of the problem and found that it was
caused by a malfunction in the floating-point unit of the Armada 370.
This is not specific to a single machine but appears to affect all
Armada 370 chips.

The problem occurs in the following section of code in
libavfilter/af_afir.c (lines 441 to 447):

    for (ch = 0; ch < ctx->inputs[1]->channels; ch++) {
        float *time = (float *)s->in[1]->extended_data[!s->one2many * ch];

        for (i = 0; i < s->nb_taps; i++)
            power += time[i] * time[i];
    }
    s->gain = sqrtf(ch / power);

The generated assembler code for the inner loop includes the following:

    0x0018a8d4 <activate+1660>:	add.w	r0, r6, #100	; 0x64
    0x0018a8d8 <activate+1664>:	movs	r3, #0
    0x0018a8da <activate+1666>:	vldr	s15, [r0, #-100]	; 0xffffff9c
    0x0018a8de <activate+1670>:	adds	r3, #8
    0x0018a8e0 <activate+1672>:	vldr	s8, [r0, #-96]	; 0xffffffa0
    0x0018a8e4 <activate+1676>:	cmp	r3, lr
    0x0018a8e6 <activate+1678>:	vldr	s9, [r0, #-92]	; 0xffffffa4
    0x0018a8ea <activate+1682>:	pld	[r0]
    0x0018a8ee <activate+1686>:	vldr	s10, [r0, #-88]	; 0xffffffa8
    0x0018a8f2 <activate+1690>:	vmla.f32	s12, s15, s15
    0x0018a8f6 <activate+1694>:	vldr	s11, [r0, #-84]	; 0xffffffac
    0x0018a8fa <activate+1698>:	vldr	s13, [r0, #-80]	; 0xffffffb0
    0x0018a8fe <activate+1702>:	vldr	s14, [r0, #-76]	; 0xffffffb4
    0x0018a902 <activate+1706>:	vldr	s15, [r0, #-72]	; 0xffffffb8
    0x0018a906 <activate+1710>:	add.w	r0, r0, #32

When the 32-bit floating-point multiply instruction
    0x0018a8f2 <activate+1690>:	vmla.f32	s12, s15, s15
at activate+1690 is executed, there is a segmentation fault. This
should never happen because the vmla.f32 instruction uses register
values only. This problem happens when the value of the operand in
s15 is small enough to cause an underflow when multiplied by itself
(7.04930082e-28 in this case). I have a console log showing the
state of the registers before and after single-stepping through the
vmla.f32 instruction and I would be pleased to make this available
in any convenient form.

In other cases (when ffmpeg was compiled with a lower optimization
level for debugging purposes), the vmla.f32 instruction completes
but corrupts the unrelated register r7, causing a segmentation fault
on a subsequent instruction. It seems that vmla.f32 produces
unpredictable results on the Armada 370 when underflow occurs.

I have investigated possible workarounds. The problem does not occur
if the 32-bit float value is converted to a 64-bit value and
multiplied using a vmla.f64 instruction producing a 64-bit result.
This is because there is no underflow from ther multiplication.
The following patch (diff file from git) does this. Changing to
64-bit multiplication would (I believe) avoid the possibility of
underflow for any 32-bit coefficient value, which would mean that
the code works correctly on the Armada 370 as well as all other
processors.

index 31919f6..3e69107 100644

This produces the following assembler code, with vmla.f32 replaced with
vmla.f64:

    0x0018a8d4 <activate+1660>:  add.w   r0, r6, #100    ; 0x64
    0x0018a8d8 <activate+1664>:  movs    r3, #0
    0x0018a8da <activate+1666>:  vldr    s12, [r0, #-100]        ; 0xffffff9c
    0x0018a8de <activate+1670>:  adds    r3, #8
    0x0018a8e0 <activate+1672>:  vldr    s0, [r0, #-96]  ; 0xffffffa0
    0x0018a8e4 <activate+1676>:  cmp     r3, lr
    0x0018a8e6 <activate+1678>:  vldr    s2, [r0, #-92]  ; 0xffffffa4
    0x0018a8ea <activate+1682>:  pld     [r0]
    0x0018a8ee <activate+1686>:  vldr    s4, [r0, #-88]  ; 0xffffffa8
    0x0018a8f2 <activate+1690>:  vcvt.f64.f32    d6, s12
    0x0018a8f6 <activate+1694>:  vcvt.f64.f32    d0, s0
    0x0018a8fa <activate+1698>:  vcvt.f64.f32    d1, s2
    0x0018a8fe <activate+1702>:  vcvt.f64.f32    d2, s4
    0x0018a902 <activate+1706>:  vmla.f64        d7, d6, d6
    0x0018a906 <activate+1710>:  vldr    s6, [r0, #-84]  ; 0xffffffac
    0x0018a90a <activate+1714>:  vldr    s8, [r0, #-80]  ; 0xffffffb0
    0x0018a90e <activate+1718>:  vldr    s10, [r0, #-76] ; 0xffffffb4
    0x0018a912 <activate+1722>:  vldr    s12, [r0, #-72] ; 0xffffffb8
    0x0018a916 <activate+1726>:  add.w   r0, r0, #32

I have tested this and it works on the Armada 370 as well as on
other ARMv7 platforms.

The change to using 64-bit multiplication is unlikely to be a
performance issue because this code runs only during activation.
However, if there are concerns about using 64-bit multiplication
on processors that don't have the bug, I can modify the patch to
check whether the processor is the Armada 370 and use the current
32-bit multiplication code when running on other processors.

I can provide a test case if required. This will reproduce the bug
reliably if ffmpeg is running on an Armada 370 machine.

I realize that this is not a bug in FFmpeg but I would like to ask
the FFmpeg developers to apply this patch (or equivalent) so that
ffmpeg is able to run correctly on the Armada 370. My application
(MinimServer) is using ffmpeg for various audio transformations
including applying FIR filters and has users with devices that use
the Armada 370 as well as users on many other platforms.

Please advise me what I should do now to help make this change
happen. Many thanks.

Best regards,
Simon

--- a/libavfilter/af_afir.c
+++ b/libavfilter/af_afir.c
@@ -375,6 +375,7 @@ static int convert_coeffs(AVFilterContext *ctx)
      int left, offset = 0, part_size, max_part_size;
      int ret, i, ch, n;
      float power = 0;
+    double power_d = 0;

      s->nb_taps = ff_inlink_queued_samples(ctx->inputs[1]);
      if (s->nb_taps <= 0)
@@ -441,9 +442,12 @@ static int convert_coeffs(AVFilterContext *ctx)
          for (ch = 0; ch < ctx->inputs[1]->channels; ch++) {
              float *time = (float *)s->in[1]->extended_data[!s->one2many * ch];

-            for (i = 0; i < s->nb_taps; i++)
-                power += time[i] * time[i];
+            for (i = 0; i < s->nb_taps; i++) {
+                double coeff_d = (double)time[i];
+                power_d += coeff_d * coeff_d; // ensure no underflow
+            }
          }
+        power = (float)power_d;
          s->gain = sqrtf(ch / power);
          break;
      default: