mbox series

[FFmpeg-devel,v4,0/5] avcodec/ac3: Add aarch64 NEON DSP

Message ID 51f7be0a-4267-47bf-ab0b-bd6585806da7@geoffhill.org
Headers show
Series avcodec/ac3: Add aarch64 NEON DSP | expand

Message

Geoff Hill April 6, 2024, 2:23 p.m. UTC
Thanks Martin for your review and testing.

Here's v4 with the following changes:

  * Use fmal in sum_square_butterfly_float loop. Faster.

  * Removed redundant loop bound zero checks in extract_exponents,
    sum_square_bufferfly_int32 and sum_square_bufferfly_float.

  * Fixed randomize_int24() to also use negative values.

  * Carry copyright from arm implementation over to aarch64. I
    did use this version as reference.

  * Fix indentation to match existing aarch64 assembly style.

Tested once again on aarch64 and x86.

On AWS Graviton2 (t4g.medium), GCC 12.3:

$ tests/checkasm/checkasm --bench --test=ac3dsp
...
NEON:
 - ac3dsp.ac3_exponent_min               [OK]
 - ac3dsp.ac3_extract_exponents          [OK]
 - ac3dsp.float_to_fixed24               [OK]
 - ac3dsp.ac3_sum_square_butterfly_int32 [OK]
 - ac3dsp.ac3_sum_square_butterfly_float [OK]
checkasm: all 20 tests passed
ac3_exponent_min_reuse0_c: 7.5
ac3_exponent_min_reuse0_neon: 7.5
ac3_exponent_min_reuse1_c: 1044.0
ac3_exponent_min_reuse1_neon: 57.0
ac3_exponent_min_reuse2_c: 2073.0
ac3_exponent_min_reuse2_neon: 73.7
ac3_exponent_min_reuse3_c: 2596.2
ac3_exponent_min_reuse3_neon: 154.0
ac3_exponent_min_reuse4_c: 3107.2
ac3_exponent_min_reuse4_neon: 169.2
ac3_exponent_min_reuse5_c: 3615.2
ac3_exponent_min_reuse5_neon: 185.2
ac3_extract_exponents_n512_c: 1672.0
ac3_extract_exponents_n512_neon: 517.5
ac3_extract_exponents_n768_c: 2505.0
ac3_extract_exponents_n768_neon: 770.5
ac3_extract_exponents_n1024_c: 3304.0
ac3_extract_exponents_n1024_neon: 1022.0
ac3_extract_exponents_n1280_c: 4163.5
ac3_extract_exponents_n1280_neon: 1279.5
ac3_extract_exponents_n1536_c: 5001.2
ac3_extract_exponents_n1536_neon: 1553.2
ac3_extract_exponents_n1792_c: 5823.5
ac3_extract_exponents_n1792_neon: 1851.7
ac3_extract_exponents_n2048_c: 6601.5
ac3_extract_exponents_n2048_neon: 2116.2
ac3_extract_exponents_n2304_c: 7425.2
ac3_extract_exponents_n2304_neon: 2382.7
ac3_extract_exponents_n2560_c: 8278.5
ac3_extract_exponents_n2560_neon: 2620.5
ac3_extract_exponents_n2816_c: 9079.5
ac3_extract_exponents_n2816_neon: 2893.2
ac3_extract_exponents_n3072_c: 10026.5
ac3_extract_exponents_n3072_neon: 3127.0
ac3_sum_square_bufferfly_float_c: 1647.5
ac3_sum_square_bufferfly_float_neon: 229.5
ac3_sum_square_bufferfly_int32_c: 963.5
ac3_sum_square_bufferfly_int32_neon: 546.5
float_to_fixed24_c: 2460.5
float_to_fixed24_neon: 561.5


Geoff Hill (5):
  avcodec/ac3: Implement float_to_fixed24 for aarch64 NEON
  avcodec/ac3: Implement ac3_exponent_min for aarch64 NEON
  avcodec/ac3: Implement ac3_extract_exponents for aarch64 NEON
  avcodec/ac3: Implement sum_square_butterfly_int32 for aarch64 NEON
  avcodec/ac3: Implement sum_square_butterfly_float for aarch64 NEON

 libavcodec/aarch64/Makefile              |   2 +
 libavcodec/aarch64/ac3dsp_init_aarch64.c |  50 +++++++++
 libavcodec/aarch64/ac3dsp_neon.S         | 119 ++++++++++++++++++++
 libavcodec/ac3dsp.c                      |   4 +-
 libavcodec/ac3dsp.h                      |   3 +-
 tests/checkasm/ac3dsp.c                  | 133 +++++++++++++++++++++++
 6 files changed, 309 insertions(+), 2 deletions(-)
 create mode 100644 libavcodec/aarch64/ac3dsp_init_aarch64.c
 create mode 100644 libavcodec/aarch64/ac3dsp_neon.S

Comments

Martin Storsjö April 8, 2024, 10:47 a.m. UTC | #1
On Sat, 6 Apr 2024, Geoff Hill wrote:

> Thanks Martin for your review and testing.
>
> Here's v4 with the following changes:
>
>  * Use fmal in sum_square_butterfly_float loop. Faster.
>
>  * Removed redundant loop bound zero checks in extract_exponents,
>    sum_square_bufferfly_int32 and sum_square_bufferfly_float.
>
>  * Fixed randomize_int24() to also use negative values.
>
>  * Carry copyright from arm implementation over to aarch64. I
>    did use this version as reference.
>
>  * Fix indentation to match existing aarch64 assembly style.
>
> Tested once again on aarch64 and x86.

Thanks, this set looked good, so I pushed it.

I amended the commits a bit, moving the added copyright line from 
checkasm/ac3dsp.c from patch 1 to 2, where that file actually gets 
extended.

Actually, after pushing, I realized another thing that can be done better 
in ff_ac3_sum_square_butterfly_float_neon - I'll send a patch for that.

> On AWS Graviton2 (t4g.medium), GCC 12.3:
>
> $ tests/checkasm/checkasm --bench --test=ac3dsp
> ...
> NEON:
> - ac3dsp.ac3_exponent_min               [OK]
> - ac3dsp.ac3_extract_exponents          [OK]
> - ac3dsp.float_to_fixed24               [OK]
> - ac3dsp.ac3_sum_square_butterfly_int32 [OK]
> - ac3dsp.ac3_sum_square_butterfly_float [OK]
> checkasm: all 20 tests passed
> float_to_fixed24_c: 2460.5
> float_to_fixed24_neon: 561.5

FWIW, it's usually neater to include such numbers in the commit message, 
so it gets brought along into the final git history (to show the benefit 
we got from the optimization at the time), quoting only those functions 
that are added/modified in each patch. But I didn't amend in that in the 
commit messages this time, but you can keep it in mind for the future.

Anyway, thanks for the patches!

// Martin