Message ID | 20240325150243.59058-1-martin@martin.st |
---|---|
Headers | show |
Series | aarch64: hevc: Add missing hevc_pel NEON functions | expand |
On Mon, 25 Mar 2024, Martin Storsjö wrote: > Since some time, we have pretty complete AArch64 NEON coverage > for the hevc decoder. > > However, some of these functions require the I8MM instruction set > extension, and many of them (but not all) lack a plain NEON > version. > > This patchset fills in a regular NEON version of all functions > where we have an I8MM function. > > For context; the I8MM instruction set extension is a mandatory > part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it, > but Apple M1 and Ampere Altra don't. > > This patchset takes decoding of a 1080p HEVC clip from 402 > fps to 649 fps on an Apple M1. > > Patch #2 also fixes a subtle bug in the existing implementation; > two functions relied on the contents on the stack, below the > stack pointer, being untouched within a function. If a signal > gets delivered, those parts of the stack could be clobbered. I know this is a bit short notice for a patchset of this size - but, would people be OK with merging this patchset before the impending 7.0 branch (which is made within the next 24h)? The patches pass all my tricky build configurations, they give a very non-negligible speedup on many common CPUs, and patch #2 fixes a real bug in the existing impleemntations. (A bug fix patch can of course be backported after the branch too, but performance optimizations aren't generally relevant for backporting.) // Martin
> On Mon, 25 Mar 2024, Martin Storsjö wrote: > >> Since some time, we have pretty complete AArch64 NEON coverage >> for the hevc decoder. >> >> However, some of these functions require the I8MM instruction set >> extension, and many of them (but not all) lack a plain NEON >> version. >> >> This patchset fills in a regular NEON version of all functions >> where we have an I8MM function. >> >> For context; the I8MM instruction set extension is a mandatory >> part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it, >> but Apple M1 and Ampere Altra don't. >> >> This patchset takes decoding of a 1080p HEVC clip from 402 >> fps to 649 fps on an Apple M1. >> >> Patch #2 also fixes a subtle bug in the existing implementation; >> two functions relied on the contents on the stack, below the >> stack pointer, being untouched within a function. If a signal >> gets delivered, those parts of the stack could be clobbered. > > I know this is a bit short notice for a patchset of this size - but, would people be OK with merging this patchset before the impending 7.0 branch (which is made within the next 24h)? > > The patches pass all my tricky build configurations, they give a very non-negligible speedup on many common CPUs, and patch #2 fixes a real bug in the existing impleemntations. (A bug fix patch can of course be backported after the branch too, but performance optimizations aren't generally relevant for backporting.) > > // Martin Yes, please. I will tomorrow morning if you didn’t already push.
On Mon, 25 Mar 2024, at 22:56, J. Dekker wrote: >> On Mon, 25 Mar 2024, Martin Storsjö wrote: >> >>> Since some time, we have pretty complete AArch64 NEON coverage >>> for the hevc decoder. >>> >>> However, some of these functions require the I8MM instruction set >>> extension, and many of them (but not all) lack a plain NEON >>> version. >>> >>> This patchset fills in a regular NEON version of all functions >>> where we have an I8MM function. >>> >>> For context; the I8MM instruction set extension is a mandatory >>> part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it, >>> but Apple M1 and Ampere Altra don't. >>> >>> This patchset takes decoding of a 1080p HEVC clip from 402 >>> fps to 649 fps on an Apple M1. >>> >>> Patch #2 also fixes a subtle bug in the existing implementation; >>> two functions relied on the contents on the stack, below the >>> stack pointer, being untouched within a function. If a signal >>> gets delivered, those parts of the stack could be clobbered. >> >> I know this is a bit short notice for a patchset of this size - but, would people be OK with merging this patchset before the impending 7.0 branch (which is made within the next 24h)? >> >> The patches pass all my tricky build configurations, they give a very non-negligible speedup on many common CPUs, and patch #2 fixes a real bug in the existing impleemntations. (A bug fix patch can of course be backported after the branch too, but performance optimizations aren't generally relevant for backporting.) >> >> // Martin > > Yes, please. I will tomorrow morning if you didn’t already push. +1
On Tue, 26 Mar 2024, Jean-Baptiste Kempf wrote: > On Mon, 25 Mar 2024, at 22:56, J. Dekker wrote: >>> On Mon, 25 Mar 2024, Martin Storsjö wrote: >>> >>>> Since some time, we have pretty complete AArch64 NEON coverage >>>> for the hevc decoder. >>>> >>>> However, some of these functions require the I8MM instruction set >>>> extension, and many of them (but not all) lack a plain NEON >>>> version. >>>> >>>> This patchset fills in a regular NEON version of all functions >>>> where we have an I8MM function. >>>> >>>> For context; the I8MM instruction set extension is a mandatory >>>> part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it, >>>> but Apple M1 and Ampere Altra don't. >>>> >>>> This patchset takes decoding of a 1080p HEVC clip from 402 >>>> fps to 649 fps on an Apple M1. >>>> >>>> Patch #2 also fixes a subtle bug in the existing implementation; >>>> two functions relied on the contents on the stack, below the >>>> stack pointer, being untouched within a function. If a signal >>>> gets delivered, those parts of the stack could be clobbered. >>> >>> I know this is a bit short notice for a patchset of this size - but, would people be OK with merging this patchset before the impending 7.0 branch (which is made within the next 24h)? >>> >>> The patches pass all my tricky build configurations, they give a very non-negligible speedup on many common CPUs, and patch #2 fixes a real bug in the existing impleemntations. (A bug fix patch can of course be backported after the branch too, but performance optimizations aren't generally relevant for backporting.) >>> >>> // Martin >> >> Yes, please. I will tomorrow morning if you didn’t already push. > > +1 Thanks, I pushed this set now. // Martin