mbox series

[FFmpeg-devel,v4,0/7] avfilter/vf_bwdif: Add aarch64 neon functions

Message ID 20230704140445.240426-1-jc@kynesim.co.uk
Headers show
Series avfilter/vf_bwdif: Add aarch64 neon functions | expand

Message

John Cox July 4, 2023, 2:04 p.m. UTC
Also adds a filter_line3 method which on aarch64 neon yields approx 30%
speedup over 2xfilter_line and a memcpy

Differences from v3:
Remove a few lines of neon in filter_line that should have been removed
when copying from line3

Sorry about the two patch sets in quick succession, but I think I've
applied all the requested changes and I didn't want this mistake in the
final patchset. (The mistake was benign - it just wasted a few cycles.)

John Cox (7):
  tests/checkasm: Add test for vf_bwdif filter_intra
  avfilter/vf_bwdif: Add neon for filter_intra
  tests/checkasm: Add test for vf_bwdif filter_edge
  avfilter/vf_bwdif: Add neon for filter_edge
  avfilter/vf_bwdif: Add neon for filter_line
  avfilter/vf_bwdif: Add a filter_line3 method for optimisation
  avfilter/vf_bwdif: Add neon for filter_line3

 libavfilter/aarch64/Makefile                |   2 +
 libavfilter/aarch64/vf_bwdif_init_aarch64.c | 125 ++++
 libavfilter/aarch64/vf_bwdif_neon.S         | 788 ++++++++++++++++++++
 libavfilter/bwdif.h                         |  20 +
 libavfilter/vf_bwdif.c                      |  70 +-
 tests/checkasm/vf_bwdif.c                   | 172 +++++
 6 files changed, 1162 insertions(+), 15 deletions(-)
 create mode 100644 libavfilter/aarch64/vf_bwdif_init_aarch64.c
 create mode 100644 libavfilter/aarch64/vf_bwdif_neon.S

Comments

Martin Storsjö July 5, 2023, 9:19 p.m. UTC | #1
On Tue, 4 Jul 2023, John Cox wrote:

> Also adds a filter_line3 method which on aarch64 neon yields approx 30%
> speedup over 2xfilter_line and a memcpy
>
> Differences from v3:
> Remove a few lines of neon in filter_line that should have been removed
> when copying from line3
>
> Sorry about the two patch sets in quick succession, but I think I've
> applied all the requested changes and I didn't want this mistake in the
> final patchset. (The mistake was benign - it just wasted a few cycles.)
>
> John Cox (7):
>  tests/checkasm: Add test for vf_bwdif filter_intra
>  avfilter/vf_bwdif: Add neon for filter_intra
>  tests/checkasm: Add test for vf_bwdif filter_edge
>  avfilter/vf_bwdif: Add neon for filter_edge
>  avfilter/vf_bwdif: Add neon for filter_line
>  avfilter/vf_bwdif: Add a filter_line3 method for optimisation
>  avfilter/vf_bwdif: Add neon for filter_line3

I think this looks ok to me, so I'll go ahead and push it. The tests pass 
on x86 too, msvc/aarch64, llvm-mingw/aarch64, macOS and linux.

Just a couple notes I didn't remember to mention before:

- Regarding the int parameters on the stack; as long as you do have the C 
wrapper functions, you don't strictly need to have the same function 
signature for the NEON function as for the actual DSP function. So if 
you'd have wanted to have a different signature for the NEON function 
(changing it to intptr_t), that'd worked too. But I do see the benefit of 
keeping it identical to the DSP function interface.

- The way of making the the C function exported and calling that for the 
tail is neat, but kinda unusual within ffmpeg. In most cases (except for 
parts of swscale), we can just assume and rely on buffers being aligned 
enough for the SIMD vector length of the current platform, and freely 
overwrite a little into the padding at the end of the lines. Not sure if 
this is the case here though.

(If it is, it's easy enough to remove those bits and make the C functions 
static again as a follow-up.)

Also, checkasm coverage for >8bpp would be nice as mentioned, but if 
someone wants to write asm for that, it should be doable to factorize the 
new tests to run them for both 8 and 16 bpp.

That said, it looks ok enough to me so I'll push it.

// Martin
John Cox July 6, 2023, 9:30 a.m. UTC | #2
On Thu, 6 Jul 2023 00:19:50 +0300 (EEST), you wrote:

>On Tue, 4 Jul 2023, John Cox wrote:
>
>> Also adds a filter_line3 method which on aarch64 neon yields approx 30%
>> speedup over 2xfilter_line and a memcpy
>>
>> Differences from v3:
>> Remove a few lines of neon in filter_line that should have been removed
>> when copying from line3
>>
>> Sorry about the two patch sets in quick succession, but I think I've
>> applied all the requested changes and I didn't want this mistake in the
>> final patchset. (The mistake was benign - it just wasted a few cycles.)
>>
>> John Cox (7):
>>  tests/checkasm: Add test for vf_bwdif filter_intra
>>  avfilter/vf_bwdif: Add neon for filter_intra
>>  tests/checkasm: Add test for vf_bwdif filter_edge
>>  avfilter/vf_bwdif: Add neon for filter_edge
>>  avfilter/vf_bwdif: Add neon for filter_line
>>  avfilter/vf_bwdif: Add a filter_line3 method for optimisation
>>  avfilter/vf_bwdif: Add neon for filter_line3
>
>I think this looks ok to me, so I'll go ahead and push it. The tests pass 
>on x86 too, msvc/aarch64, llvm-mingw/aarch64, macOS and linux.
>
>Just a couple notes I didn't remember to mention before:
>
>- Regarding the int parameters on the stack; as long as you do have the C 
>wrapper functions, you don't strictly need to have the same function 
>signature for the NEON function as for the actual DSP function. So if 
>you'd have wanted to have a different signature for the NEON function 
>(changing it to intptr_t), that'd worked too. But I do see the benefit of 
>keeping it identical to the DSP function interface.

If I was going to do that what I'd actually do is only pass a single
prefs arg and no mrefs which would cut the number of args down to
something that is register only. (In the only current use case all the
other args are multiples of prefs.)

>- The way of making the the C function exported and calling that for the 
>tail is neat, but kinda unusual within ffmpeg. In most cases (except for 
>parts of swscale), we can just assume and rely on buffers being aligned 
>enough for the SIMD vector length of the current platform, and freely 
>overwrite a little into the padding at the end of the lines. Not sure if 
>this is the case here though.
>
>(If it is, it's easy enough to remove those bits and make the C functions 
>static again as a follow-up.)

Indeed - if you call the asm functions with a width that is not a
multiple of 16 then you will get the rounded up number processed so you
can just replace the helper function with the asm.

>Also, checkasm coverage for >8bpp would be nice as mentioned, but if 
>someone wants to write asm for that, it should be doable to factorize the 
>new tests to run them for both 8 and 16 bpp.

There seemed no point writing the test unless I had some asm to back it
up!

The asm for 9-14bit should be fairly straightforward (15-16-bit requires
a little more work due to overflow) if processing 8 pels at a time
(16-bytes), 16 pels at a time might be doable but there's a non-trivial
chance of running out of registers.

However my use case is Kodi on Pi4 where 10-bit interlace has never been
seen so this isn't going to happen immediately.

>That said, it looks ok enough to me so I'll push it.

Thanks

JC

>// Martin