diff mbox series

[FFmpeg-devel,1/3] riscv: add CPU flags for the RISC-V Vector extension

Message ID 20220903190147.196927-1-remi@remlab.net
State New
Headers show
Series Float DSP for RISC-V Vector extension - part I | expand

Checks

Context Check Description
yinshiyou/make_loongarch64 success Make finished
yinshiyou/make_fate_loongarch64 success Make fate finished
andriy/make_x86 success Make finished
andriy/make_fate_x86 success Make fate finished

Commit Message

Rémi Denis-Courmont Sept. 3, 2022, 7:01 p.m. UTC
From: Rémi Denis-Courmont <remi@remlab.net>

RVV defines a total of 12 different extensions: V, Zvl32b, Zvl64b,
Zvl128b, Zvl256b, Zvl512b, Zvl1024b, Zve32x, Zve32f, Zve64x, Zve64f and
Zve64d.

At this stage, we don't care about the vector length extensions Zvl*,
as most or all optimisations will be running in a loop that is
independent on the data set size.

Zve64f is equivalent to Zve32f plus Zve64x, so it is exposed as a
convenience flag, but not tracked internally. Likewise V is the
equivalent of Zve64d plus Zvl128b.

Technically, Zve32f and Zve64x are both implied by Zve64d and both
imply Zve32x, leaving only 5 possibilities (including no vector
support), but we keep 4 separate bits for easy run-time checks as on
other instruction set architectures.
---
 libavutil/cpu.c          | 14 ++++++++++
 libavutil/cpu.h          |  6 +++++
 libavutil/cpu_internal.h |  1 +
 libavutil/riscv/Makefile |  1 +
 libavutil/riscv/cpu.c    | 58 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 80 insertions(+)
 create mode 100644 libavutil/riscv/Makefile
 create mode 100644 libavutil/riscv/cpu.c

Comments

Lynne Sept. 3, 2022, 7:20 p.m. UTC | #1
Sep 3, 2022, 21:01 by remi@remlab.net:

> From: Rémi Denis-Courmont <remi@remlab.net>
>
> RVV defines a total of 12 different extensions: V, Zvl32b, Zvl64b,
> Zvl128b, Zvl256b, Zvl512b, Zvl1024b, Zve32x, Zve32f, Zve64x, Zve64f and
> Zve64d.
>
> At this stage, we don't care about the vector length extensions Zvl*,
> as most or all optimisations will be running in a loop that is
> independent on the data set size.
>

I need to know the maximum length to write an FFT.
Could you add flags for it? I don't mind a 5-bit bitfield for a log2 of it,
or one flag per length (up to 65536).
Rémi Denis-Courmont Sept. 3, 2022, 7:59 p.m. UTC | #2
Le lauantaina 3. syyskuuta 2022, 22.20.20 EEST Lynne a écrit :
> Sep 3, 2022, 21:01 by remi@remlab.net:
> > From: Rémi Denis-Courmont <remi@remlab.net>
> > 
> > RVV defines a total of 12 different extensions: V, Zvl32b, Zvl64b,
> > Zvl128b, Zvl256b, Zvl512b, Zvl1024b, Zve32x, Zve32f, Zve64x, Zve64f and
> > Zve64d.
> > 
> > At this stage, we don't care about the vector length extensions Zvl*,
> > as most or all optimisations will be running in a loop that is
> > independent on the data set size.
> 
> I need to know the maximum length to write an FFT.
> Could you add flags for it?

I think we should cross that bridge if/when the need actually arises. In most 
cases, the vector length returned at run-time from VSETVL is good enough.
Lynne Sept. 3, 2022, 9:38 p.m. UTC | #3
Sep 3, 2022, 21:59 by remi@remlab.net:

> Le lauantaina 3. syyskuuta 2022, 22.20.20 EEST Lynne a écrit :
>
>> Sep 3, 2022, 21:01 by remi@remlab.net:
>> > From: Rémi Denis-Courmont <remi@remlab.net>
>> > 
>> > RVV defines a total of 12 different extensions: V, Zvl32b, Zvl64b,
>> > Zvl128b, Zvl256b, Zvl512b, Zvl1024b, Zve32x, Zve32f, Zve64x, Zve64f and
>> > Zve64d.
>> > 
>> > At this stage, we don't care about the vector length extensions Zvl*,
>> > as most or all optimisations will be running in a loop that is
>> > independent on the data set size.
>>
>> I need to know the maximum length to write an FFT.
>> Could you add flags for it?
>>
>
> I think we should cross that bridge if/when the need actually arises. In most 
> cases, the vector length returned at run-time from VSETVL is good enough.
>

I need to know the length in C, not assembly. Whilst you're at adding
initial support, I think it makes sense to support all code that's targetting
RISC-V, not just the ones it's convenient to. I'll probably write the FFT
as soon as I get access to a real machine.
Rémi Denis-Courmont Sept. 4, 2022, 5:41 a.m. UTC | #4
Le sunnuntaina 4. syyskuuta 2022, 0.38.32 EEST Lynne a écrit :
> I need to know the length in C, not assembly.

There may be some corner cases where that makes sense, but typically it 
doesn't. Even if you're dealing in fixed-size macro blocks, you should leverage 
the larger vectors to unroll and process multiple macro blocks in parallel.

And besides, how do you want to get the value if not with assembler? This is 
currently not found in ELF HWCAP and probably never will be.

So the only way to find out in pure C is in the embedded case, by checking out 
the __riscv_zlvXXXb preprocessor predefined constants. But that only tells what 
is the guaranteed minimum vector size for the compile-time target.

Outside of embedded world, that's currently always undefined because everybody 
uses RVA20 as the baseline, which does not require vector support. Going 
forward, RVA22 will require 128 bits, but that says nothing of what the run-
time CPU can actually do.

> I think it makes sense to support all code that's targetting RISC-V, not 
just the ones it's convenient to.

I disagree. There are currently no means to negotiate a vector length with the 
OS, so that seems highly premature. And even if there was such a mechanism, 
it's simply much faster to call VSETVL in an inline assembler macro where 
needed than to compute the whole set of CPU flags.
Lynne Sept. 4, 2022, 6:39 a.m. UTC | #5
Sep 4, 2022, 07:41 by remi@remlab.net:

> Le sunnuntaina 4. syyskuuta 2022, 0.38.32 EEST Lynne a écrit :
>
>> I need to know the length in C, not assembly.
>>
>
> There may be some corner cases where that makes sense, but typically it 
> doesn't. Even if you're dealing in fixed-size macro blocks, you should leverage 
> the larger vectors to unroll and process multiple macro blocks in parallel.
>

Some aspects of a split-radix FFT work better if you know how
much you could fit into a register upfront. In particular, doing
the tail, which consists of 2 equal length transforms. On AVX
we interleave the coefficients from 2x4pt transforms during
lookups since we can do them simultaneously and save on
shuffles. Doing them individually wouldn't be as efficient.
Since interleaving is done during the permute step, we have
to know from C how much to interleave.
Of course if you switched away from a split-radix algorithm (X+X/2+X/2),
you could have a very simple 100-line FFT if you had arbitrarily
long vectors (or the pretense of such), but if you didn't have
the hardware to back that up, the penalty for using a suboptimal
algorithm wouldn't be worth it.


> And besides, how do you want to get the value if not with assembler? This is 
> currently not found in ELF HWCAP and probably never will be.
>

Sucks, knowing how wide the units are is as important as
knowing how much L1 cache you have for me.


> I disagree. There are currently no means to negotiate a vector length with the 
> OS, so that seems highly premature. And even if there was such a mechanism, 
> it's simply much faster to call VSETVL in an inline assembler macro where 
> needed than to compute the whole set of CPU flags.
>

Guess that's what I'll have to do.In due time anyway, who knows how many years it'll be until
a cheap enough device appears with vector support that
doesn't merely do what SVE2 devices did by reusing old NEON
unit designs.
Rémi Denis-Courmont Sept. 4, 2022, 8:27 a.m. UTC | #6
Le sunnuntaina 4. syyskuuta 2022, 9.39.36 EEST Lynne a écrit :
> In particular, doing the tail, which consists of 2 equal length transforms.
> On AVX we interleave the coefficients from 2x4pt transforms during
> lookups since we can do them simultaneously and save on
> shuffles. Doing them individually wouldn't be as efficient.

I'm not going to boldy state that one size fits all, because I am pretty sure 
that it would come back to bite me in soft and sensitive tissue. But unlike 
SIMD extensions, RISC-V V and ARM SVE favour the use of offsets and masks to 
deal with misaligned edges, so I'm not sure how useful the insights from AVX 
are.

> > And besides, how do you want to get the value if not with assembler? This
> > is currently not found in ELF HWCAP and probably never will be.

> Sucks, knowing how wide the units are is as important as
> knowing how much L1 cache you have for me.

I understand that for some multidimensional calculations, you need to make 
special cases. The obvious case would be if the vector is too short to fit a 
column or row of elements whilst performing a transposition.

But even then, and even if we end up later on with, say, an arch_prctl() call 
to find the vector size, I don't think exposing it in CPU flags would be a good 
idea. VSETVL & VSETIVL also account for the element size and the vector group 
multiplier, so it seems better to use either of them than to reimplement the 
same logic in C based on the raw vector bit length.
diff mbox series

Patch

diff --git a/libavutil/cpu.c b/libavutil/cpu.c
index 0035e927a5..83bf513cf2 100644
--- a/libavutil/cpu.c
+++ b/libavutil/cpu.c
@@ -62,6 +62,8 @@  static int get_cpu_flags(void)
     return ff_get_cpu_flags_arm();
 #elif ARCH_PPC
     return ff_get_cpu_flags_ppc();
+#elif ARCH_RISCV
+    return ff_get_cpu_flags_riscv();
 #elif ARCH_X86
     return ff_get_cpu_flags_x86();
 #elif ARCH_LOONGARCH
@@ -178,6 +180,18 @@  int av_parse_cpu_caps(unsigned *flags, const char *s)
 #elif ARCH_LOONGARCH
         { "lsx",      NULL, 0, AV_OPT_TYPE_CONST, { .i64 = AV_CPU_FLAG_LSX      },    .unit = "flags" },
         { "lasx",     NULL, 0, AV_OPT_TYPE_CONST, { .i64 = AV_CPU_FLAG_LASX     },    .unit = "flags" },
+#elif ARCH_RISCV
+#define AV_CPU_FLAG_ZVE32F_M (AV_CPU_FLAG_ZVE32F | AV_CPU_FLAG_ZVE32X)
+#define AV_CPU_FLAG_ZVE64X_M (AV_CPU_FLAG_ZVE64X | AV_CPU_FLAG_ZVE32X)
+#define AV_CPU_FLAG_ZVE64D_M (AV_CPU_FLAG_ZVE64D | AV_CPU_FLAG_ZVE64F_M)
+#define AV_CPU_FLAG_ZVE64F_M (AV_CPU_FLAG_ZVE32F_M | AV_CPU_FLAG_ZVE64X_M)
+#define AV_CPU_FLAG_VECTORS  AV_CPU_FLAG_ZVE64D_M
+        { "vectors",  NULL, 0, AV_OPT_TYPE_CONST, { .i64 = AV_CPU_FLAG_VECTORS  },    .unit = "flags" },
+        { "zve32x",   NULL, 0, AV_OPT_TYPE_CONST, { .i64 = AV_CPU_FLAG_ZVE32X   },    .unit = "flags" },
+        { "zve32f",   NULL, 0, AV_OPT_TYPE_CONST, { .i64 = AV_CPU_FLAG_ZVE32F_M },    .unit = "flags" },
+        { "zve64x",   NULL, 0, AV_OPT_TYPE_CONST, { .i64 = AV_CPU_FLAG_ZVE64X_M },    .unit = "flags" },
+        { "zve64f",   NULL, 0, AV_OPT_TYPE_CONST, { .i64 = AV_CPU_FLAG_ZVE64F_M },    .unit = "flags" },
+        { "zve64d",   NULL, 0, AV_OPT_TYPE_CONST, { .i64 = AV_CPU_FLAG_ZVE64D_M },    .unit = "flags" },
 #endif
         { NULL },
     };
diff --git a/libavutil/cpu.h b/libavutil/cpu.h
index 9711e574c5..44836e50d6 100644
--- a/libavutil/cpu.h
+++ b/libavutil/cpu.h
@@ -78,6 +78,12 @@ 
 #define AV_CPU_FLAG_LSX          (1 << 0)
 #define AV_CPU_FLAG_LASX         (1 << 1)
 
+// RISC-V Vector extension
+#define AV_CPU_FLAG_ZVE32X       (1 << 0) /* 8-, 16-, 32-bit integers */
+#define AV_CPU_FLAG_ZVE32F       (1 << 1) /* single precision scalars */
+#define AV_CPU_FLAG_ZVE64X       (1 << 2) /* 64-bit integers */
+#define AV_CPU_FLAG_ZVE64D       (1 << 3) /* double precision scalars */
+
 /**
  * Return the flags which specify extensions supported by the CPU.
  * The returned value is affected by av_force_cpu_flags() if that was used
diff --git a/libavutil/cpu_internal.h b/libavutil/cpu_internal.h
index 650d47fc96..634f28bac4 100644
--- a/libavutil/cpu_internal.h
+++ b/libavutil/cpu_internal.h
@@ -48,6 +48,7 @@  int ff_get_cpu_flags_mips(void);
 int ff_get_cpu_flags_aarch64(void);
 int ff_get_cpu_flags_arm(void);
 int ff_get_cpu_flags_ppc(void);
+int ff_get_cpu_flags_riscv(void);
 int ff_get_cpu_flags_x86(void);
 int ff_get_cpu_flags_loongarch(void);
 
diff --git a/libavutil/riscv/Makefile b/libavutil/riscv/Makefile
new file mode 100644
index 0000000000..1f818043dc
--- /dev/null
+++ b/libavutil/riscv/Makefile
@@ -0,0 +1 @@ 
+OBJS += riscv/cpu.o
diff --git a/libavutil/riscv/cpu.c b/libavutil/riscv/cpu.c
new file mode 100644
index 0000000000..96726f2f85
--- /dev/null
+++ b/libavutil/riscv/cpu.c
@@ -0,0 +1,58 @@ 
+/*
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/cpu.h"
+#include "libavutil/cpu_internal.h"
+#include "config.h"
+
+#if HAVE_GETAUXVAL
+#include <sys/auxv.h>
+#endif
+
+#define HWCAP_RV(letter) (1ul << ((letter) - 'A'))
+#define ZVE_UP_TO(cap)   ((2 * (cap)) - 1)
+
+int ff_get_cpu_flags_riscv(void)
+{
+    int ret = 0;
+
+    /* If RV-V is enabled statically at compile-time, check the details. */
+#ifdef __riscv_vectors
+    ret |= AV_CPU_FLAG_ZVE32X;
+#if __riscv_v_elen >= 64
+    ret |= AV_CPU_FLAG_ZVE64X;
+#endif
+#if __riscv_v_elen_fp >= 32
+    ret |= AV_CPU_FLAG_ZVE32F;
+#endif
+#if __riscv_v_elen_fp >= 64
+    ret |= AV_CPU_FLAG_ZVE32F;
+#endif
+#endif
+
+#if HAVE_GETAUXVAL
+    const unsigned long hwcap = getauxval(AT_HWCAP);
+
+    /* The V extension implies all subsets */
+    if (hwcap & HWCAP_RV('V'))
+        ret |= AV_CPU_FLAG_ZVE32X | AV_CPU_FLAG_ZVE64X
+             | AV_CPU_FLAG_ZVE32F | AV_CPU_FLAG_ZVE64D;
+#endif
+
+    return ret;
+}