[FFmpeg-devel] Armada 370 problem causes ffmpeg segmentation fault

Submitted by Simon Nash on Jan. 8, 2019, 9:32 p.m.

Details

Message ID 5C3516EE.9020002@cjnash.com
State New
Headers show

Commit Message

Simon Nash Jan. 8, 2019, 9:32 p.m.
Hello,
This is my first message to ffmpeg-devel, so please excuse me if
anything here doesn't conform to the normal mailing list conventions.

I have encountered a problem with ffmpeg (a segmentation fault) that
occurs only when running ffmpeg on the Marvell Armada 370 processor.

I used gdb to debug the cause of the problem and found that it was
caused by a malfunction in the floating-point unit of the Armada 370.
This is not specific to a single machine but appears to affect all
Armada 370 chips.

The problem occurs in the following section of code in
libavfilter/af_afir.c (lines 441 to 447):

    for (ch = 0; ch < ctx->inputs[1]->channels; ch++) {
        float *time = (float *)s->in[1]->extended_data[!s->one2many * ch];

        for (i = 0; i < s->nb_taps; i++)
            power += time[i] * time[i];
    }
    s->gain = sqrtf(ch / power);

The generated assembler code for the inner loop includes the following:

    0x0018a8d4 <activate+1660>:	add.w	r0, r6, #100	; 0x64
    0x0018a8d8 <activate+1664>:	movs	r3, #0
    0x0018a8da <activate+1666>:	vldr	s15, [r0, #-100]	; 0xffffff9c
    0x0018a8de <activate+1670>:	adds	r3, #8
    0x0018a8e0 <activate+1672>:	vldr	s8, [r0, #-96]	; 0xffffffa0
    0x0018a8e4 <activate+1676>:	cmp	r3, lr
    0x0018a8e6 <activate+1678>:	vldr	s9, [r0, #-92]	; 0xffffffa4
    0x0018a8ea <activate+1682>:	pld	[r0]
    0x0018a8ee <activate+1686>:	vldr	s10, [r0, #-88]	; 0xffffffa8
    0x0018a8f2 <activate+1690>:	vmla.f32	s12, s15, s15
    0x0018a8f6 <activate+1694>:	vldr	s11, [r0, #-84]	; 0xffffffac
    0x0018a8fa <activate+1698>:	vldr	s13, [r0, #-80]	; 0xffffffb0
    0x0018a8fe <activate+1702>:	vldr	s14, [r0, #-76]	; 0xffffffb4
    0x0018a902 <activate+1706>:	vldr	s15, [r0, #-72]	; 0xffffffb8
    0x0018a906 <activate+1710>:	add.w	r0, r0, #32

When the 32-bit floating-point multiply instruction
    0x0018a8f2 <activate+1690>:	vmla.f32	s12, s15, s15
at activate+1690 is executed, there is a segmentation fault. This
should never happen because the vmla.f32 instruction uses register
values only. This problem happens when the value of the operand in
s15 is small enough to cause an underflow when multiplied by itself
(7.04930082e-28 in this case). I have a console log showing the
state of the registers before and after single-stepping through the
vmla.f32 instruction and I would be pleased to make this available
in any convenient form.

In other cases (when ffmpeg was compiled with a lower optimization
level for debugging purposes), the vmla.f32 instruction completes
but corrupts the unrelated register r7, causing a segmentation fault
on a subsequent instruction. It seems that vmla.f32 produces
unpredictable results on the Armada 370 when underflow occurs.

I have investigated possible workarounds. The problem does not occur
if the 32-bit float value is converted to a 64-bit value and
multiplied using a vmla.f64 instruction producing a 64-bit result.
This is because there is no underflow from ther multiplication.
The following patch (diff file from git) does this. Changing to
64-bit multiplication would (I believe) avoid the possibility of
underflow for any 32-bit coefficient value, which would mean that
the code works correctly on the Armada 370 as well as all other
processors.

index 31919f6..3e69107 100644

This produces the following assembler code, with vmla.f32 replaced with
vmla.f64:

    0x0018a8d4 <activate+1660>:  add.w   r0, r6, #100    ; 0x64
    0x0018a8d8 <activate+1664>:  movs    r3, #0
    0x0018a8da <activate+1666>:  vldr    s12, [r0, #-100]        ; 0xffffff9c
    0x0018a8de <activate+1670>:  adds    r3, #8
    0x0018a8e0 <activate+1672>:  vldr    s0, [r0, #-96]  ; 0xffffffa0
    0x0018a8e4 <activate+1676>:  cmp     r3, lr
    0x0018a8e6 <activate+1678>:  vldr    s2, [r0, #-92]  ; 0xffffffa4
    0x0018a8ea <activate+1682>:  pld     [r0]
    0x0018a8ee <activate+1686>:  vldr    s4, [r0, #-88]  ; 0xffffffa8
    0x0018a8f2 <activate+1690>:  vcvt.f64.f32    d6, s12
    0x0018a8f6 <activate+1694>:  vcvt.f64.f32    d0, s0
    0x0018a8fa <activate+1698>:  vcvt.f64.f32    d1, s2
    0x0018a8fe <activate+1702>:  vcvt.f64.f32    d2, s4
    0x0018a902 <activate+1706>:  vmla.f64        d7, d6, d6
    0x0018a906 <activate+1710>:  vldr    s6, [r0, #-84]  ; 0xffffffac
    0x0018a90a <activate+1714>:  vldr    s8, [r0, #-80]  ; 0xffffffb0
    0x0018a90e <activate+1718>:  vldr    s10, [r0, #-76] ; 0xffffffb4
    0x0018a912 <activate+1722>:  vldr    s12, [r0, #-72] ; 0xffffffb8
    0x0018a916 <activate+1726>:  add.w   r0, r0, #32

I have tested this and it works on the Armada 370 as well as on
other ARMv7 platforms.

The change to using 64-bit multiplication is unlikely to be a
performance issue because this code runs only during activation.
However, if there are concerns about using 64-bit multiplication
on processors that don't have the bug, I can modify the patch to
check whether the processor is the Armada 370 and use the current
32-bit multiplication code when running on other processors.

I can provide a test case if required. This will reproduce the bug
reliably if ffmpeg is running on an Armada 370 machine.

I realize that this is not a bug in FFmpeg but I would like to ask
the FFmpeg developers to apply this patch (or equivalent) so that
ffmpeg is able to run correctly on the Armada 370. My application
(MinimServer) is using ffmpeg for various audio transformations
including applying FIR filters and has users with devices that use
the Armada 370 as well as users on many other platforms.

Please advise me what I should do now to help make this change
happen. Many thanks.

Best regards,
Simon

Comments

Paul B Mahol Jan. 8, 2019, 9:47 p.m.
On 1/8/19, Simon Nash <simon@cjnash.com> wrote:
> Hello,
> This is my first message to ffmpeg-devel, so please excuse me if
> anything here doesn't conform to the normal mailing list conventions.
>
> I have encountered a problem with ffmpeg (a segmentation fault) that
> occurs only when running ffmpeg on the Marvell Armada 370 processor.
>
> I used gdb to debug the cause of the problem and found that it was
> caused by a malfunction in the floating-point unit of the Armada 370.
> This is not specific to a single machine but appears to affect all
> Armada 370 chips.
>
> The problem occurs in the following section of code in
> libavfilter/af_afir.c (lines 441 to 447):
>
>     for (ch = 0; ch < ctx->inputs[1]->channels; ch++) {
>         float *time = (float *)s->in[1]->extended_data[!s->one2many * ch];
>
>         for (i = 0; i < s->nb_taps; i++)
>             power += time[i] * time[i];
>     }
>     s->gain = sqrtf(ch / power);
>
> The generated assembler code for the inner loop includes the following:
>
>     0x0018a8d4 <activate+1660>:	add.w	r0, r6, #100	; 0x64
>     0x0018a8d8 <activate+1664>:	movs	r3, #0
>     0x0018a8da <activate+1666>:	vldr	s15, [r0, #-100]	; 0xffffff9c
>     0x0018a8de <activate+1670>:	adds	r3, #8
>     0x0018a8e0 <activate+1672>:	vldr	s8, [r0, #-96]	; 0xffffffa0
>     0x0018a8e4 <activate+1676>:	cmp	r3, lr
>     0x0018a8e6 <activate+1678>:	vldr	s9, [r0, #-92]	; 0xffffffa4
>     0x0018a8ea <activate+1682>:	pld	[r0]
>     0x0018a8ee <activate+1686>:	vldr	s10, [r0, #-88]	; 0xffffffa8
>     0x0018a8f2 <activate+1690>:	vmla.f32	s12, s15, s15
>     0x0018a8f6 <activate+1694>:	vldr	s11, [r0, #-84]	; 0xffffffac
>     0x0018a8fa <activate+1698>:	vldr	s13, [r0, #-80]	; 0xffffffb0
>     0x0018a8fe <activate+1702>:	vldr	s14, [r0, #-76]	; 0xffffffb4
>     0x0018a902 <activate+1706>:	vldr	s15, [r0, #-72]	; 0xffffffb8
>     0x0018a906 <activate+1710>:	add.w	r0, r0, #32
>
> When the 32-bit floating-point multiply instruction
>     0x0018a8f2 <activate+1690>:	vmla.f32	s12, s15, s15
> at activate+1690 is executed, there is a segmentation fault. This
> should never happen because the vmla.f32 instruction uses register
> values only. This problem happens when the value of the operand in
> s15 is small enough to cause an underflow when multiplied by itself
> (7.04930082e-28 in this case). I have a console log showing the
> state of the registers before and after single-stepping through the
> vmla.f32 instruction and I would be pleased to make this available
> in any convenient form.
>
> In other cases (when ffmpeg was compiled with a lower optimization
> level for debugging purposes), the vmla.f32 instruction completes
> but corrupts the unrelated register r7, causing a segmentation fault
> on a subsequent instruction. It seems that vmla.f32 produces
> unpredictable results on the Armada 370 when underflow occurs.
>
> I have investigated possible workarounds. The problem does not occur
> if the 32-bit float value is converted to a 64-bit value and
> multiplied using a vmla.f64 instruction producing a 64-bit result.
> This is because there is no underflow from ther multiplication.
> The following patch (diff file from git) does this. Changing to
> 64-bit multiplication would (I believe) avoid the possibility of
> underflow for any 32-bit coefficient value, which would mean that
> the code works correctly on the Armada 370 as well as all other
> processors.
>
> index 31919f6..3e69107 100644
> --- a/libavfilter/af_afir.c
> +++ b/libavfilter/af_afir.c
> @@ -375,6 +375,7 @@ static int convert_coeffs(AVFilterContext *ctx)
>       int left, offset = 0, part_size, max_part_size;
>       int ret, i, ch, n;
>       float power = 0;
> +    double power_d = 0;
>
>       s->nb_taps = ff_inlink_queued_samples(ctx->inputs[1]);
>       if (s->nb_taps <= 0)
> @@ -441,9 +442,12 @@ static int convert_coeffs(AVFilterContext *ctx)
>           for (ch = 0; ch < ctx->inputs[1]->channels; ch++) {
>               float *time = (float *)s->in[1]->extended_data[!s->one2many *
> ch];
>
> -            for (i = 0; i < s->nb_taps; i++)
> -                power += time[i] * time[i];
> +            for (i = 0; i < s->nb_taps; i++) {
> +                double coeff_d = (double)time[i];
> +                power_d += coeff_d * coeff_d; // ensure no underflow
> +            }
>           }
> +        power = (float)power_d;
>           s->gain = sqrtf(ch / power);
>           break;
>       default:
>
> This produces the following assembler code, with vmla.f32 replaced with
> vmla.f64:
>
>     0x0018a8d4 <activate+1660>:  add.w   r0, r6, #100    ; 0x64
>     0x0018a8d8 <activate+1664>:  movs    r3, #0
>     0x0018a8da <activate+1666>:  vldr    s12, [r0, #-100]        ;
> 0xffffff9c
>     0x0018a8de <activate+1670>:  adds    r3, #8
>     0x0018a8e0 <activate+1672>:  vldr    s0, [r0, #-96]  ; 0xffffffa0
>     0x0018a8e4 <activate+1676>:  cmp     r3, lr
>     0x0018a8e6 <activate+1678>:  vldr    s2, [r0, #-92]  ; 0xffffffa4
>     0x0018a8ea <activate+1682>:  pld     [r0]
>     0x0018a8ee <activate+1686>:  vldr    s4, [r0, #-88]  ; 0xffffffa8
>     0x0018a8f2 <activate+1690>:  vcvt.f64.f32    d6, s12
>     0x0018a8f6 <activate+1694>:  vcvt.f64.f32    d0, s0
>     0x0018a8fa <activate+1698>:  vcvt.f64.f32    d1, s2
>     0x0018a8fe <activate+1702>:  vcvt.f64.f32    d2, s4
>     0x0018a902 <activate+1706>:  vmla.f64        d7, d6, d6
>     0x0018a906 <activate+1710>:  vldr    s6, [r0, #-84]  ; 0xffffffac
>     0x0018a90a <activate+1714>:  vldr    s8, [r0, #-80]  ; 0xffffffb0
>     0x0018a90e <activate+1718>:  vldr    s10, [r0, #-76] ; 0xffffffb4
>     0x0018a912 <activate+1722>:  vldr    s12, [r0, #-72] ; 0xffffffb8
>     0x0018a916 <activate+1726>:  add.w   r0, r0, #32
>
> I have tested this and it works on the Armada 370 as well as on
> other ARMv7 platforms.
>
> The change to using 64-bit multiplication is unlikely to be a
> performance issue because this code runs only during activation.
> However, if there are concerns about using 64-bit multiplication
> on processors that don't have the bug, I can modify the patch to
> check whether the processor is the Armada 370 and use the current
> 32-bit multiplication code when running on other processors.
>
> I can provide a test case if required. This will reproduce the bug
> reliably if ffmpeg is running on an Armada 370 machine.
>
> I realize that this is not a bug in FFmpeg but I would like to ask
> the FFmpeg developers to apply this patch (or equivalent) so that
> ffmpeg is able to run correctly on the Armada 370. My application
> (MinimServer) is using ffmpeg for various audio transformations
> including applying FIR filters and has users with devices that use
> the Armada 370 as well as users on many other platforms.
>
> Please advise me what I should do now to help make this change
> happen. Many thanks.
>
> Best regards,
> Simon

This could use doubles, but perhaps same issue could happen there too
in another scenario.
Others may be against it. So better wait for others opinions.
Lauri Kasanen Jan. 9, 2019, 8:38 a.m.
On Tue, 08 Jan 2019 21:32:30 +0000
Simon Nash <simon@cjnash.com> wrote:

> I have encountered a problem with ffmpeg (a segmentation fault) that
> occurs only when running ffmpeg on the Marvell Armada 370 processor.
...
> When the 32-bit floating-point multiply instruction
>     0x0018a8f2 <activate+1690>:	vmla.f32	s12, s15, s15
> at activate+1690 is executed, there is a segmentation fault.

You don't want to go whack-a-mole on this, since there could be 1500
other places in just ffmpeg that could hit this. You want to fix this
in your compiler, it already has similar errata workarounds for almost
every processor. Then every such case will work automatically.

So,
1) Find the errata from the processor manufacturer
2) Report bug with that to gcc/clang/whatever compiler you use

If there is no known errata for this, and you managed to find a new
one, contact the processor manufacturer.

- Lauri

Patch hide | download patch | download mbox

--- a/libavfilter/af_afir.c
+++ b/libavfilter/af_afir.c
@@ -375,6 +375,7 @@  static int convert_coeffs(AVFilterContext *ctx)
      int left, offset = 0, part_size, max_part_size;
      int ret, i, ch, n;
      float power = 0;
+    double power_d = 0;

      s->nb_taps = ff_inlink_queued_samples(ctx->inputs[1]);
      if (s->nb_taps <= 0)
@@ -441,9 +442,12 @@  static int convert_coeffs(AVFilterContext *ctx)
          for (ch = 0; ch < ctx->inputs[1]->channels; ch++) {
              float *time = (float *)s->in[1]->extended_data[!s->one2many * ch];

-            for (i = 0; i < s->nb_taps; i++)
-                power += time[i] * time[i];
+            for (i = 0; i < s->nb_taps; i++) {
+                double coeff_d = (double)time[i];
+                power_d += coeff_d * coeff_d; // ensure no underflow
+            }
          }
+        power = (float)power_d;
          s->gain = sqrtf(ch / power);
          break;
      default: