mbox series

[FFmpeg-devel,PATCHv2,0/10] RISC-V V floating point DSP

Message ID 3372981.QJadu78ljV@basile.remlab.net
Headers show
Series RISC-V V floating point DSP | expand

Message

Rémi Denis-Courmont Sept. 4, 2022, 1:54 p.m. UTC
The following changes since commit b6e8fc1c201d58672639134a737137e1ba7b55fe:

  avcodec/speexdec: improve support for speex in non-ogg (2022-09-04 11:31:57 +0200)

are waiting thorough bashing at your express convenience up to:

  riscv: float vector dot product with RVV (2022-09-04 16:45:38 +0300)

Changes since v1:

- Removed stray define.
- Fixed mismatch between byte and element size in mul-scalar.
- Added fmul, fac, dmul, dmac, fmul-add, fmul-reverse, fmul-window.
- Added float butterfly and dot product.

All operations are unrolled to the maximum group size (8), with the
exception of overlap/add. The later seems to require a minimum of 6
vectors (maybe 5 by extremely careful ordering), so the group size is
only 4.

The pointer arithmetic could be slightly optimised with SH2ADD and
SH3ADD instructions from the Zvba extension. This would require more
conditional code, or requiring support for Zvba for probably neglible
performance gains though.

----------------------------------------------------------------
Rémi Denis-Courmont (10):
      riscv: add CPU flags for the RISC-V Vector extension
      riscv: initial common header for assembler macros
      riscv: float vector-scalar multiplication with RVV
      riscv: float vector-vector multiplication with RVV
      riscv: float vector multiply-accumulate with RVV
      riscv: float vector multiplication-addition with RVV
      riscv: float vector sum-and-difference with RVV
      riscv: float reversed vector multiplication with RVV
      riscv: float vector windowed overlap/add with RVV
      riscv: float vector dot product with RVV

 libavutil/cpu.c                  |  14 +++
 libavutil/cpu.h                  |   6 +
 libavutil/cpu_internal.h         |   1 +
 libavutil/float_dsp.c            |   2 +
 libavutil/float_dsp.h            |   1 +
 libavutil/riscv/Makefile         |   3 +
 libavutil/riscv/asm.S            |  33 +++++
 libavutil/riscv/cpu.c            |  57 +++++++++
 libavutil/riscv/float_dsp_init.c |  67 ++++++++++
 libavutil/riscv/float_dsp_rvv.S  | 255 +++++++++++++++++++++++++++++++++++++++
 10 files changed, 439 insertions(+)
 create mode 100644 libavutil/riscv/Makefile
 create mode 100644 libavutil/riscv/asm.S
 create mode 100644 libavutil/riscv/cpu.c
 create mode 100644 libavutil/riscv/float_dsp_init.c
 create mode 100644 libavutil/riscv/float_dsp_rvv.S

Comments

Lynne Sept. 4, 2022, 5:48 p.m. UTC | #1
Sep 4, 2022, 15:54 by remi@remlab.net:

> The following changes since commit b6e8fc1c201d58672639134a737137e1ba7b55fe:
>
>  avcodec/speexdec: improve support for speex in non-ogg (2022-09-04 11:31:57 +0200)
>
> are waiting thorough bashing at your express convenience up to:
>
>  riscv: float vector dot product with RVV (2022-09-04 16:45:38 +0300)
>
> Changes since v1:
>
> - Removed stray define.
> - Fixed mismatch between byte and element size in mul-scalar.
> - Added fmul, fac, dmul, dmac, fmul-add, fmul-reverse, fmul-window.
> - Added float butterfly and dot product.
>
> All operations are unrolled to the maximum group size (8), with the
> exception of overlap/add. The later seems to require a minimum of 6
> vectors (maybe 5 by extremely careful ordering), so the group size is
> only 4.
>
> The pointer arithmetic could be slightly optimised with SH2ADD and
> SH3ADD instructions from the Zvba extension. This would require more
> conditional code, or requiring support for Zvba for probably neglible
> performance gains though.
>

Did you test on real hardware or a VM?
If the former, what does checkasm --bench report?
Rémi Denis-Courmont Sept. 4, 2022, 7:01 p.m. UTC | #2
Le sunnuntaina 4. syyskuuta 2022, 20.48.26 EEST Lynne a écrit :
> > The pointer arithmetic could be slightly optimised with SH2ADD and
> > SH3ADD instructions from the Zvba extension. This would require more
> > conditional code, or requiring support for Zvba for probably neglible
> > performance gains though.
> 
> Did you test on real hardware or a VM?

I don't think we will see real and conforming RV64GV hardware this year, 
unless you count FPGAs. I hope we can get affordable stuff in 2023H1.

This is running on simulator. But since the code is already (AFAICT) using the 
largest possible grouping, there won't be that much room for further 
optimisations, other maybe than to add prefetching hints. In that sense, RVV 
is really nice in how it makes unrolling almost unnecessary/effortless.

The code is also already laid out to leverage multiple issue if available. 
RISC-V does not have post-index addressing modes, so the interleaving a fair 
amount of pointer arithmetic is unavoidable.

> If the former, what does checkasm --bench report?

...