[FFmpeg-devel,v4,0/5] avcodec/ac3: Add aarch64 NEON DSP

Message ID	51f7be0a-4267-47bf-ab0b-bd6585806da7@geoffhill.org
Headers	show Delivered-To: ffmpegpatchwork2@gmail.com Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Message-ID: <51f7be0a-4267-47bf-ab0b-bd6585806da7@geoffhill.org> Date: Sat, 6 Apr 2024 07:23:37 -0700 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird From: Geoff Hill <geoff@geoffhill.org> To: ffmpeg-devel@ffmpeg.org, =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st> Content-Language: en-US Subject: [FFmpeg-devel] [PATCH v4 0/5] avcodec/ac3: Add aarch64 NEON DSP Precedence: list Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Series	avcodec/ac3: Add aarch64 NEON DSP \| expand [FFmpeg-devel,v4,0/5] avcodec/ac3: Add aarch64 NEON DSP [FFmpeg-devel,v4,1/5] avcodec/ac3: Implement float_to_fixed24 for aarch64 NEON [FFmpeg-devel,v4,2/5] avcodec/ac3: Implement ac3_exponent_min for aarch64 NEON [FFmpeg-devel,v4,3/5] avcodec/ac3: Implement ac3_extract_exponents for aarch64 NEON [FFmpeg-devel,v4,4/5] avcodec/ac3: Implement sum_square_butterfly_int32 for aarch64 NEON [FFmpeg-devel,v4,5/5] avcodec/ac3: Implement sum_square_butterfly_float for aarch64 NEON

Message ID

51f7be0a-4267-47bf-ab0b-bd6585806da7@geoffhill.org

Headers

Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
 designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100;
Message-ID: <51f7be0a-4267-47bf-ab0b-bd6585806da7@geoffhill.org>
Date: Sat, 6 Apr 2024 07:23:37 -0700
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
From: Geoff Hill <geoff@geoffhill.org>
To: ffmpeg-devel@ffmpeg.org,
 =?utf-8?q?Martin_Storsj=C3=B6?= <martin@martin.st>
Content-Language: en-US
Subject: [FFmpeg-devel] [PATCH v4 0/5] avcodec/ac3: Add aarch64 NEON DSP
Precedence: list
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

Series

avcodec/ac3: Add aarch64 NEON DSP | expand

Message

Geoff Hill April 6, 2024, 2:23 p.m. UTC

Thanks Martin for your review and testing.

Here's v4 with the following changes:

  * Use fmal in sum_square_butterfly_float loop. Faster.

  * Removed redundant loop bound zero checks in extract_exponents,
    sum_square_bufferfly_int32 and sum_square_bufferfly_float.

  * Fixed randomize_int24() to also use negative values.

  * Carry copyright from arm implementation over to aarch64. I
    did use this version as reference.

  * Fix indentation to match existing aarch64 assembly style.

Tested once again on aarch64 and x86.

On AWS Graviton2 (t4g.medium), GCC 12.3:

$ tests/checkasm/checkasm --bench --test=ac3dsp
...
NEON:
 - ac3dsp.ac3_exponent_min               [OK]
 - ac3dsp.ac3_extract_exponents          [OK]
 - ac3dsp.float_to_fixed24               [OK]
 - ac3dsp.ac3_sum_square_butterfly_int32 [OK]
 - ac3dsp.ac3_sum_square_butterfly_float [OK]
checkasm: all 20 tests passed
ac3_exponent_min_reuse0_c: 7.5
ac3_exponent_min_reuse0_neon: 7.5
ac3_exponent_min_reuse1_c: 1044.0
ac3_exponent_min_reuse1_neon: 57.0
ac3_exponent_min_reuse2_c: 2073.0
ac3_exponent_min_reuse2_neon: 73.7
ac3_exponent_min_reuse3_c: 2596.2
ac3_exponent_min_reuse3_neon: 154.0
ac3_exponent_min_reuse4_c: 3107.2
ac3_exponent_min_reuse4_neon: 169.2
ac3_exponent_min_reuse5_c: 3615.2
ac3_exponent_min_reuse5_neon: 185.2
ac3_extract_exponents_n512_c: 1672.0
ac3_extract_exponents_n512_neon: 517.5
ac3_extract_exponents_n768_c: 2505.0
ac3_extract_exponents_n768_neon: 770.5
ac3_extract_exponents_n1024_c: 3304.0
ac3_extract_exponents_n1024_neon: 1022.0
ac3_extract_exponents_n1280_c: 4163.5
ac3_extract_exponents_n1280_neon: 1279.5
ac3_extract_exponents_n1536_c: 5001.2
ac3_extract_exponents_n1536_neon: 1553.2
ac3_extract_exponents_n1792_c: 5823.5
ac3_extract_exponents_n1792_neon: 1851.7
ac3_extract_exponents_n2048_c: 6601.5
ac3_extract_exponents_n2048_neon: 2116.2
ac3_extract_exponents_n2304_c: 7425.2
ac3_extract_exponents_n2304_neon: 2382.7
ac3_extract_exponents_n2560_c: 8278.5
ac3_extract_exponents_n2560_neon: 2620.5
ac3_extract_exponents_n2816_c: 9079.5
ac3_extract_exponents_n2816_neon: 2893.2
ac3_extract_exponents_n3072_c: 10026.5
ac3_extract_exponents_n3072_neon: 3127.0
ac3_sum_square_bufferfly_float_c: 1647.5
ac3_sum_square_bufferfly_float_neon: 229.5
ac3_sum_square_bufferfly_int32_c: 963.5
ac3_sum_square_bufferfly_int32_neon: 546.5
float_to_fixed24_c: 2460.5
float_to_fixed24_neon: 561.5


Geoff Hill (5):
  avcodec/ac3: Implement float_to_fixed24 for aarch64 NEON
  avcodec/ac3: Implement ac3_exponent_min for aarch64 NEON
  avcodec/ac3: Implement ac3_extract_exponents for aarch64 NEON
  avcodec/ac3: Implement sum_square_butterfly_int32 for aarch64 NEON
  avcodec/ac3: Implement sum_square_butterfly_float for aarch64 NEON

 libavcodec/aarch64/Makefile              |   2 +
 libavcodec/aarch64/ac3dsp_init_aarch64.c |  50 +++++++++
 libavcodec/aarch64/ac3dsp_neon.S         | 119 ++++++++++++++++++++
 libavcodec/ac3dsp.c                      |   4 +-
 libavcodec/ac3dsp.h                      |   3 +-
 tests/checkasm/ac3dsp.c                  | 133 +++++++++++++++++++++++
 6 files changed, 309 insertions(+), 2 deletions(-)
 create mode 100644 libavcodec/aarch64/ac3dsp_init_aarch64.c
 create mode 100644 libavcodec/aarch64/ac3dsp_neon.S

Comments

Martin Storsjö April 8, 2024, 10:47 a.m. UTC | #1

On Sat, 6 Apr 2024, Geoff Hill wrote:

> Thanks Martin for your review and testing.
>
> Here's v4 with the following changes:
>
>  * Use fmal in sum_square_butterfly_float loop. Faster.
>
>  * Removed redundant loop bound zero checks in extract_exponents,
>    sum_square_bufferfly_int32 and sum_square_bufferfly_float.
>
>  * Fixed randomize_int24() to also use negative values.
>
>  * Carry copyright from arm implementation over to aarch64. I
>    did use this version as reference.
>
>  * Fix indentation to match existing aarch64 assembly style.
>
> Tested once again on aarch64 and x86.

Thanks, this set looked good, so I pushed it.

I amended the commits a bit, moving the added copyright line from 
checkasm/ac3dsp.c from patch 1 to 2, where that file actually gets 
extended.

Actually, after pushing, I realized another thing that can be done better 
in ff_ac3_sum_square_butterfly_float_neon - I'll send a patch for that.

> On AWS Graviton2 (t4g.medium), GCC 12.3:
>
> $ tests/checkasm/checkasm --bench --test=ac3dsp
> ...
> NEON:
> - ac3dsp.ac3_exponent_min               [OK]
> - ac3dsp.ac3_extract_exponents          [OK]
> - ac3dsp.float_to_fixed24               [OK]
> - ac3dsp.ac3_sum_square_butterfly_int32 [OK]
> - ac3dsp.ac3_sum_square_butterfly_float [OK]
> checkasm: all 20 tests passed
> float_to_fixed24_c: 2460.5
> float_to_fixed24_neon: 561.5

FWIW, it's usually neater to include such numbers in the commit message, 
so it gets brought along into the final git history (to show the benefit 
we got from the optimization at the time), quoting only those functions 
that are added/modified in each patch. But I didn't amend in that in the 
commit messages this time, but you can keep it in mind for the future.

Anyway, thanks for the patches!

// Martin