mbox series

[FFmpeg-devel,00/21] aarch64: hevc: Add missing hevc_pel NEON functions

Message ID 20240325150243.59058-1-martin@martin.st
Headers show
Series aarch64: hevc: Add missing hevc_pel NEON functions | expand

Message

Martin Storsjö March 25, 2024, 3:02 p.m. UTC
Hi,

Since some time, we have pretty complete AArch64 NEON coverage
for the hevc decoder.

However, some of these functions require the I8MM instruction set
extension, and many of them (but not all) lack a plain NEON
version.

This patchset fills in a regular NEON version of all functions
where we have an I8MM function.

For context; the I8MM instruction set extension is a mandatory
part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it,
but Apple M1 and Ampere Altra don't.

This patchset takes decoding of a 1080p HEVC clip from 402
fps to 649 fps on an Apple M1.

Patch #2 also fixes a subtle bug in the existing implementation;
two functions relied on the contents on the stack, below the
stack pointer, being untouched within a function. If a signal
gets delivered, those parts of the stack could be clobbered.

// Martin

Martin Storsjö (21):
  aarch64: hevc: Reorder a misplaced function init line
  aarch64: hevc: Don't iterate with sp in
    ff_hevc_put_hevc_qpel_uni_w_hv32/64_8_neon_i8mm
  aarch64: hevc: Merge consecutive stores in
    put_hevc_\type\()_h16_8_neon
  aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal
    looping
  aarch64: hevc: Use ld1r instead of ldr+dup in hevc_qpel_uni_w_h
  aarch64: hevc: Implement a neon version of put_hevc_epel_h*_8
  aarch64: hevc: Implement a neon version of hevc_epel_uni_w_h*_8
  aarch64: hevc: Split the epel_*_hv functions into two parts
  aarch64: hevc: Reorder epel_hv functions to prepare for templating
  aarch64: hevc: Produce epel_hv functions for both plain neon and i8mm
  aarch64: hevc: Produce epel_uni_hv functions for both neon and i8mm
  aarch64: hevc: Produce epel_uni_w_hv functions for both neon and i8mm
  aarch64: hevc: Produce epel_bi_hv functions for both neon and i8mm
  aarch64: hevc: Implement a neon version of hevc_qpel_uni_w_h*_8
  aarch64: hevc: Split the qpel_*_hv functions into two parts
  aarch64: hevc: Deduplicate the hevc_put_hevc_qpel_uni_w_hv*_8_end_neon
    functions
  aarch64: hevc: Reorder qpel_hv functions to prepare for templating
  aarch64: hevc: Produce plain neon versions of qpel_hv
  aarch64: hevc: Produce plain neon versions of qpel_uni_hv
  aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv
  aarch64: hevc: Produce plain neon versions of qpel_bi_hv

 libavcodec/aarch64/hevcdsp_epel_neon.S    | 1529 +++++++++++------
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   96 +-
 libavcodec/aarch64/hevcdsp_qpel_neon.S    | 1804 +++++++++++++--------
 3 files changed, 2291 insertions(+), 1138 deletions(-)

Comments

Martin Storsjö March 25, 2024, 9:15 p.m. UTC | #1
On Mon, 25 Mar 2024, Martin Storsjö wrote:

> Since some time, we have pretty complete AArch64 NEON coverage
> for the hevc decoder.
>
> However, some of these functions require the I8MM instruction set
> extension, and many of them (but not all) lack a plain NEON
> version.
>
> This patchset fills in a regular NEON version of all functions
> where we have an I8MM function.
>
> For context; the I8MM instruction set extension is a mandatory
> part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it,
> but Apple M1 and Ampere Altra don't.
>
> This patchset takes decoding of a 1080p HEVC clip from 402
> fps to 649 fps on an Apple M1.
>
> Patch #2 also fixes a subtle bug in the existing implementation;
> two functions relied on the contents on the stack, below the
> stack pointer, being untouched within a function. If a signal
> gets delivered, those parts of the stack could be clobbered.

I know this is a bit short notice for a patchset of this size - but, would 
people be OK with merging this patchset before the impending 7.0 branch 
(which is made within the next 24h)?

The patches pass all my tricky build configurations, they give a very 
non-negligible speedup on many common CPUs, and patch #2 fixes a real bug 
in the existing impleemntations. (A bug fix patch can of course be 
backported after the branch too, but performance optimizations aren't 
generally relevant for backporting.)

// Martin
J. Dekker March 25, 2024, 9:56 p.m. UTC | #2
> On Mon, 25 Mar 2024, Martin Storsjö wrote:
> 
>> Since some time, we have pretty complete AArch64 NEON coverage
>> for the hevc decoder.
>> 
>> However, some of these functions require the I8MM instruction set
>> extension, and many of them (but not all) lack a plain NEON
>> version.
>> 
>> This patchset fills in a regular NEON version of all functions
>> where we have an I8MM function.
>> 
>> For context; the I8MM instruction set extension is a mandatory
>> part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it,
>> but Apple M1 and Ampere Altra don't.
>> 
>> This patchset takes decoding of a 1080p HEVC clip from 402
>> fps to 649 fps on an Apple M1.
>> 
>> Patch #2 also fixes a subtle bug in the existing implementation;
>> two functions relied on the contents on the stack, below the
>> stack pointer, being untouched within a function. If a signal
>> gets delivered, those parts of the stack could be clobbered.
> 
> I know this is a bit short notice for a patchset of this size - but, would people be OK with merging this patchset before the impending 7.0 branch (which is made within the next 24h)?
> 
> The patches pass all my tricky build configurations, they give a very non-negligible speedup on many common CPUs, and patch #2 fixes a real bug in the existing impleemntations. (A bug fix patch can of course be backported after the branch too, but performance optimizations aren't generally relevant for backporting.)
> 
> // Martin

Yes, please. I will tomorrow morning if you didn’t already push.
Jean-Baptiste Kempf March 26, 2024, 6:01 a.m. UTC | #3
On Mon, 25 Mar 2024, at 22:56, J. Dekker wrote:
>> On Mon, 25 Mar 2024, Martin Storsjö wrote:
>> 
>>> Since some time, we have pretty complete AArch64 NEON coverage
>>> for the hevc decoder.
>>> 
>>> However, some of these functions require the I8MM instruction set
>>> extension, and many of them (but not all) lack a plain NEON
>>> version.
>>> 
>>> This patchset fills in a regular NEON version of all functions
>>> where we have an I8MM function.
>>> 
>>> For context; the I8MM instruction set extension is a mandatory
>>> part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it,
>>> but Apple M1 and Ampere Altra don't.
>>> 
>>> This patchset takes decoding of a 1080p HEVC clip from 402
>>> fps to 649 fps on an Apple M1.
>>> 
>>> Patch #2 also fixes a subtle bug in the existing implementation;
>>> two functions relied on the contents on the stack, below the
>>> stack pointer, being untouched within a function. If a signal
>>> gets delivered, those parts of the stack could be clobbered.
>> 
>> I know this is a bit short notice for a patchset of this size - but, would people be OK with merging this patchset before the impending 7.0 branch (which is made within the next 24h)?
>> 
>> The patches pass all my tricky build configurations, they give a very non-negligible speedup on many common CPUs, and patch #2 fixes a real bug in the existing impleemntations. (A bug fix patch can of course be backported after the branch too, but performance optimizations aren't generally relevant for backporting.)
>> 
>> // Martin
>
> Yes, please. I will tomorrow morning if you didn’t already push.

+1
Martin Storsjö March 26, 2024, 7:09 a.m. UTC | #4
On Tue, 26 Mar 2024, Jean-Baptiste Kempf wrote:

> On Mon, 25 Mar 2024, at 22:56, J. Dekker wrote:
>>> On Mon, 25 Mar 2024, Martin Storsjö wrote:
>>>
>>>> Since some time, we have pretty complete AArch64 NEON coverage
>>>> for the hevc decoder.
>>>>
>>>> However, some of these functions require the I8MM instruction set
>>>> extension, and many of them (but not all) lack a plain NEON
>>>> version.
>>>>
>>>> This patchset fills in a regular NEON version of all functions
>>>> where we have an I8MM function.
>>>>
>>>> For context; the I8MM instruction set extension is a mandatory
>>>> part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it,
>>>> but Apple M1 and Ampere Altra don't.
>>>>
>>>> This patchset takes decoding of a 1080p HEVC clip from 402
>>>> fps to 649 fps on an Apple M1.
>>>>
>>>> Patch #2 also fixes a subtle bug in the existing implementation;
>>>> two functions relied on the contents on the stack, below the
>>>> stack pointer, being untouched within a function. If a signal
>>>> gets delivered, those parts of the stack could be clobbered.
>>>
>>> I know this is a bit short notice for a patchset of this size - but, would people be OK with merging this patchset before the impending 7.0 branch (which is made within the next 24h)?
>>>
>>> The patches pass all my tricky build configurations, they give a very non-negligible speedup on many common CPUs, and patch #2 fixes a real bug in the existing impleemntations. (A bug fix patch can of course be backported after the branch too, but performance optimizations aren't generally relevant for backporting.)
>>>
>>> // Martin
>>
>> Yes, please. I will tomorrow morning if you didn’t already push.
>
> +1

Thanks, I pushed this set now.

// Martin