diff mbox series

[FFmpeg-devel,2/4] avcodec/get_bits: Avoid 2nd bitstream read in GET_VLC() if bits are known at build and small

Message ID 20231024150443.7438-2-michael@niedermayer.cc
State New
Headers show
Series [FFmpeg-devel,1/4] avcodec/magicyuv: Use a compile time constant for vlc_bits | expand

Checks

Context Check Description
yinshiyou/configure_loongarch64 warning Failed to apply patch
andriy/configure_x86 warning Failed to apply patch

Commit Message

Michael Niedermayer Oct. 24, 2023, 3:04 p.m. UTC
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
---
 libavcodec/get_bits.h | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Andreas Rheinhardt Oct. 27, 2023, 3:10 a.m. UTC | #1
Michael Niedermayer:
> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
> ---
>  libavcodec/get_bits.h | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/libavcodec/get_bits.h b/libavcodec/get_bits.h
> index cfcf97c021c..86cea00494a 100644
> --- a/libavcodec/get_bits.h
> +++ b/libavcodec/get_bits.h
> @@ -581,8 +581,12 @@ static inline const uint8_t *align_get_bits(GetBitContext *s)
>          n     = table[index].len;                               \
>                                                                  \
>          if (max_depth > 1 && n < 0) {                           \
> -            LAST_SKIP_BITS(name, gb, bits);                     \
> -            UPDATE_CACHE(name, gb);                             \
> +            if (av_builtin_constant_p(bits <= MIN_CACHE_BITS/2) && bits <= MIN_CACHE_BITS/2) { \
> +                SKIP_BITS(name, gb, bits);                      \
> +            } else {                                            \
> +                LAST_SKIP_BITS(name, gb, bits);                 \
> +                UPDATE_CACHE(name, gb);                         \
> +            }                                                   \
>                                                                  \
>              nb_bits = -n;                                       \
>                                                                  \

This is problematic: The GET_VLC macro does not presume that
MIN_CACHE_BITS are available; there is code that directly uses GET_VLC
instead of get_vlc2().

I had the same idea when I made my VLC patchset, yet I wanted to first
apply it (which I forgot). While investigating the above issue, I found
out that all users of GET_VLC always call UPDATE_CACHE immediately
before GET_VLC, so UPDATE_CACHE should be moved into GET_VLC;
furthermore, no user of GET_VLC relies on the reloads inside of GET_VLC.
The patches for this are here:
https://github.com/mkver/FFmpeg/commits/vlc Shall I send them?

Notice that making GET_VLC more standalone enables improvements over the
current approach; yet it will not lead to optimal code: E.g. the VLCs in
decode_alpha_block() in speedhqdec.c are so short that one could read
both VLCs with only one UPDATE_CACHE(); another example is mjpegdec.c
which currently does this:

        GET_VLC(code, re, &s->gb, s->vlcs[1][ac_index].table, 9, 2);

        i += ((unsigned)code) >> 4;
            code &= 0xf;
        if (code) {
            if (code > MIN_CACHE_BITS - 16)
                UPDATE_CACHE(re, &s->gb);

            {
                int cache = GET_CACHE(re, &s->gb);
                int sign  = (~cache) >> 31;
                level     = (NEG_USR32(sign ^ cache,code) ^ sign) - sign;
            }

            LAST_SKIP_BITS(re, &s->gb, code);

Because of the reloads in GET_VLC, there will always be at least
MIN_CACHE_BITS - 9 (= 16) bits available after GET_VLC, so one can read
code (<= 15) bits without updating the cache at all (16 in
MIN_CACHE_BITS - 16 is the maximum length of a VLC code used here); this
will no longer be possible with this optimization.
Btw: I am surprised that there is a branch before UPDATE_CACHE instead
of an unconditional UPDATE_CACHE. I also do not really see why this uses
these macros directly and not the functions.

Given my objection to your patch #1, magicyuv will not benefit from
this; a different approach (see
https://github.com/mkver/FFmpeg/commit/9b5a977957968c0718dea55a5b15f060ef6201dc)
is to add a get_vlc() that uses the nb of bits used to create the VLC
and a compile-time upper bound for the maximum length of a VLC code as
parameters instead of the maximum depth of the VLC.

Reading VLCs for the cached bitstream reader can btw also be improved:
https://github.com/mkver/FFmpeg/commit/fba57506a9cf6be2f4aa5eeee7b10d54729fd92a

- Andreas
Michael Niedermayer Oct. 27, 2023, 6:38 p.m. UTC | #2
On Fri, Oct 27, 2023 at 05:10:32AM +0200, Andreas Rheinhardt wrote:
> Michael Niedermayer:
> > Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
> > ---
> >  libavcodec/get_bits.h | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/libavcodec/get_bits.h b/libavcodec/get_bits.h
> > index cfcf97c021c..86cea00494a 100644
> > --- a/libavcodec/get_bits.h
> > +++ b/libavcodec/get_bits.h
> > @@ -581,8 +581,12 @@ static inline const uint8_t *align_get_bits(GetBitContext *s)
> >          n     = table[index].len;                               \
> >                                                                  \
> >          if (max_depth > 1 && n < 0) {                           \
> > -            LAST_SKIP_BITS(name, gb, bits);                     \
> > -            UPDATE_CACHE(name, gb);                             \
> > +            if (av_builtin_constant_p(bits <= MIN_CACHE_BITS/2) && bits <= MIN_CACHE_BITS/2) { \
> > +                SKIP_BITS(name, gb, bits);                      \
> > +            } else {                                            \
> > +                LAST_SKIP_BITS(name, gb, bits);                 \
> > +                UPDATE_CACHE(name, gb);                         \
> > +            }                                                   \
> >                                                                  \
> >              nb_bits = -n;                                       \
> >                                                                  \
> 
> This is problematic: The GET_VLC macro does not presume that
> MIN_CACHE_BITS are available; there is code that directly uses GET_VLC
> instead of get_vlc2().
> 
> I had the same idea when I made my VLC patchset, yet I wanted to first
> apply it (which I forgot). While investigating the above issue, I found
> out that all users of GET_VLC always call UPDATE_CACHE immediately
> before GET_VLC, so UPDATE_CACHE should be moved into GET_VLC;
> furthermore, no user of GET_VLC relies on the reloads inside of GET_VLC.
> The patches for this are here:
> https://github.com/mkver/FFmpeg/commits/vlc Shall I send them?
> 
> Notice that making GET_VLC more standalone enables improvements over the
> current approach; yet it will not lead to optimal code: E.g. the VLCs in
> decode_alpha_block() in speedhqdec.c are so short that one could read
> both VLCs with only one UPDATE_CACHE(); another example is mjpegdec.c
> which currently does this:
> 
>         GET_VLC(code, re, &s->gb, s->vlcs[1][ac_index].table, 9, 2);
> 
>         i += ((unsigned)code) >> 4;
>             code &= 0xf;
>         if (code) {
>             if (code > MIN_CACHE_BITS - 16)
>                 UPDATE_CACHE(re, &s->gb);
> 
>             {
>                 int cache = GET_CACHE(re, &s->gb);
>                 int sign  = (~cache) >> 31;
>                 level     = (NEG_USR32(sign ^ cache,code) ^ sign) - sign;
>             }
> 
>             LAST_SKIP_BITS(re, &s->gb, code);
> 
> Because of the reloads in GET_VLC, there will always be at least
> MIN_CACHE_BITS - 9 (= 16) bits available after GET_VLC, so one can read
> code (<= 15) bits without updating the cache at all (16 in
> MIN_CACHE_BITS - 16 is the maximum length of a VLC code used here); this
> will no longer be possible with this optimization.
> Btw: I am surprised that there is a branch before UPDATE_CACHE instead
> of an unconditional UPDATE_CACHE. I also do not really see why this uses
> these macros directly and not the functions.
> 
> Given my objection to your patch #1, magicyuv will not benefit from
> this; a different approach (see
> https://github.com/mkver/FFmpeg/commit/9b5a977957968c0718dea55a5b15f060ef6201dc)
> is to add a get_vlc() that uses the nb of bits used to create the VLC
> and a compile-time upper bound for the maximum length of a VLC code as
> parameters instead of the maximum depth of the VLC.
> 

> Reading VLCs for the cached bitstream reader can btw also be improved:
> https://github.com/mkver/FFmpeg/commit/fba57506a9cf6be2f4aa5eeee7b10d54729fd92a

speaking of that, i was wondering if the alternatives we had in get_bits.h
A32_BITSTREAM_READER wouldnt be worth reinvestigating
especially when extended to 64bit some of these readers might perform better
There are then just more bits available and fewer reads and fewer mispredicted
branches for cached ones

It would be somewhat nice if we could avoid having 2 different APIs as we have
now with the cached and normal reader.
Also the normal one with 64bit would be interresting, more available bits
so fewer reads

also i was wondering about a vlc reader thats entirely free of conditional
branches. Just a loop that in each iteration would step by 0-n symbols
forward and update a pointer to which table to use next
but i dont think i will have the time to try to implement this.

I have alot more ideas than i have time to try sadly, if you wan/can/or did
try any of above that would be interresting

thx

[...]
Andreas Rheinhardt Oct. 30, 2023, 8:49 p.m. UTC | #3
Michael Niedermayer:
> On Fri, Oct 27, 2023 at 05:10:32AM +0200, Andreas Rheinhardt wrote:
>> Michael Niedermayer:
>>> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
>>> ---
>>>  libavcodec/get_bits.h | 8 ++++++--
>>>  1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/libavcodec/get_bits.h b/libavcodec/get_bits.h
>>> index cfcf97c021c..86cea00494a 100644
>>> --- a/libavcodec/get_bits.h
>>> +++ b/libavcodec/get_bits.h
>>> @@ -581,8 +581,12 @@ static inline const uint8_t *align_get_bits(GetBitContext *s)
>>>          n     = table[index].len;                               \
>>>                                                                  \
>>>          if (max_depth > 1 && n < 0) {                           \
>>> -            LAST_SKIP_BITS(name, gb, bits);                     \
>>> -            UPDATE_CACHE(name, gb);                             \
>>> +            if (av_builtin_constant_p(bits <= MIN_CACHE_BITS/2) && bits <= MIN_CACHE_BITS/2) { \
>>> +                SKIP_BITS(name, gb, bits);                      \
>>> +            } else {                                            \
>>> +                LAST_SKIP_BITS(name, gb, bits);                 \
>>> +                UPDATE_CACHE(name, gb);                         \
>>> +            }                                                   \
>>>                                                                  \
>>>              nb_bits = -n;                                       \
>>>                                                                  \
>>
>> This is problematic: The GET_VLC macro does not presume that
>> MIN_CACHE_BITS are available; there is code that directly uses GET_VLC
>> instead of get_vlc2().
>>
>> I had the same idea when I made my VLC patchset, yet I wanted to first
>> apply it (which I forgot). While investigating the above issue, I found
>> out that all users of GET_VLC always call UPDATE_CACHE immediately
>> before GET_VLC, so UPDATE_CACHE should be moved into GET_VLC;
>> furthermore, no user of GET_VLC relies on the reloads inside of GET_VLC.
>> The patches for this are here:
>> https://github.com/mkver/FFmpeg/commits/vlc Shall I send them?
>>
>> Notice that making GET_VLC more standalone enables improvements over the
>> current approach; yet it will not lead to optimal code: E.g. the VLCs in
>> decode_alpha_block() in speedhqdec.c are so short that one could read
>> both VLCs with only one UPDATE_CACHE(); another example is mjpegdec.c
>> which currently does this:
>>
>>         GET_VLC(code, re, &s->gb, s->vlcs[1][ac_index].table, 9, 2);
>>
>>         i += ((unsigned)code) >> 4;
>>             code &= 0xf;
>>         if (code) {
>>             if (code > MIN_CACHE_BITS - 16)
>>                 UPDATE_CACHE(re, &s->gb);
>>
>>             {
>>                 int cache = GET_CACHE(re, &s->gb);
>>                 int sign  = (~cache) >> 31;
>>                 level     = (NEG_USR32(sign ^ cache,code) ^ sign) - sign;
>>             }
>>
>>             LAST_SKIP_BITS(re, &s->gb, code);
>>
>> Because of the reloads in GET_VLC, there will always be at least
>> MIN_CACHE_BITS - 9 (= 16) bits available after GET_VLC, so one can read
>> code (<= 15) bits without updating the cache at all (16 in
>> MIN_CACHE_BITS - 16 is the maximum length of a VLC code used here); this
>> will no longer be possible with this optimization.
>> Btw: I am surprised that there is a branch before UPDATE_CACHE instead
>> of an unconditional UPDATE_CACHE. I also do not really see why this uses
>> these macros directly and not the functions.
>>
>> Given my objection to your patch #1, magicyuv will not benefit from
>> this; a different approach (see
>> https://github.com/mkver/FFmpeg/commit/9b5a977957968c0718dea55a5b15f060ef6201dc)
>> is to add a get_vlc() that uses the nb of bits used to create the VLC
>> and a compile-time upper bound for the maximum length of a VLC code as
>> parameters instead of the maximum depth of the VLC.
>>
> 
>> Reading VLCs for the cached bitstream reader can btw also be improved:
>> https://github.com/mkver/FFmpeg/commit/fba57506a9cf6be2f4aa5eeee7b10d54729fd92a
> 
> speaking of that, i was wondering if the alternatives we had in get_bits.h
> A32_BITSTREAM_READER wouldnt be worth reinvestigating
> especially when extended to 64bit some of these readers might perform better
> There are then just more bits available and fewer reads and fewer mispredicted
> branches for cached ones
> 
> It would be somewhat nice if we could avoid having 2 different APIs as we have
> now with the cached and normal reader.
> Also the normal one with 64bit would be interresting, more available bits
> so fewer reads

I already did something like that in
b0fb8e82dde375efb8d5602f6cd479f210c1e93c. But be aware that the most
important reason we read more often than we necessarily have to is not
the number of bits in the cache, but that we stopped using the macros
(like SKIP_CACHE, SKIP_COUNTER) directly and used functions which are
more standalone (and IMO also more readable).
Furthermore, creating replacements of these macros is complicated for
cached bitstream readers, because of the implicit overread checks. Here
is the important part of read_nz:

    if (n > bc->bits_valid) {
        if (BS_FUNC(priv_refill_32)(bc) < 0)
            bc->bits_valid = n;
    }

    return BS_FUNC(priv_val_get)(bc, n);

priv_val_get presumes there to be enough valid bits available; otherwise
bits_valid wraps around.

This does not mean that there is no hope for avoiding unnecessary reads:
We could add an unsigned get_bits_cached(GetBitContext *gb, int n)
(basically a get_bits(), but the valid bits are the most significant
bits (for BE)) and get_bits_from_cache(unsigned *cache, int n)
(basically, SHOW_UBITS+SKIP_CACHE (or priv_val_get for the cached API)
and that could then be used as follows:

    unsigned cache = get_bits_cached(gb, 4 * 4);

    s->mpeg_f_code[0][0] = get_bits_from_cache(&cache, 4);
    s->mpeg_f_code[0][1] = get_bits_from_cache(&cache, 4);
    s->mpeg_f_code[1][0] = get_bits_from_cache(&cache, 4);
    s->mpeg_f_code[1][1] = get_bits_from_cache(&cache, 4);

This works because get_bits_cached() would already remove the specified
number of bits from the bitstream; but this also shows its restrictions:
It is only usable when the number of bits to be read subsequently is
known in advance.

Another potential API is to add a show_bits_cached() that is the same as
show_bits(), but returns the valid bits in the most significant bits
(for BE). This could then be complemented with a get_bits_from_cache()
and in the end one needs to skip the number of bits actually consumed.
The latter would of course need to validate/sanitize the number of bits
consumed for both the cached and uncached reader. For the cached reader,
doing so incurs a check that is avoided in case one uses the
get_bits_cached() API outlined above (of course, this only applies to
the scenarios where one can actually use said API).

> 
> also i was wondering about a vlc reader thats entirely free of conditional
> branches. Just a loop that in each iteration would step by 0-n symbols
> forward and update a pointer to which table to use next

Doesn't this have the downside that short symbols need as many
iterations as the longest one?

> but i dont think i will have the time to try to implement this.
> 
> I have alot more ideas than i have time to try sadly, if you wan/can/or did
> try any of above that would be interresting
>
Michael Niedermayer Oct. 31, 2023, 12:25 a.m. UTC | #4
On Mon, Oct 30, 2023 at 09:49:07PM +0100, Andreas Rheinhardt wrote:
> Michael Niedermayer:
[...]

> > 
> > also i was wondering about a vlc reader thats entirely free of conditional
> > branches. Just a loop that in each iteration would step by 0-n symbols
> > forward and update a pointer to which table to use next
> 
> Doesn't this have the downside that short symbols need as many
> iterations as the longest one?

i dont think so but maybe iam thinking of something else

lets assume our main table is 10bits
so we read 10 bits look it up in the table and that tells us what there is
and lets assume these 10 bits contain 2 complete symbols and one partial

first we copy from the table a block into our symbol output list (that contains 2 symbols)
second we take a pointer from the table to the next table
    the incomplete can either now be handled in the next iteration or we
    could point back to the main 10bit table and only handle the
    2 complete in this iteration
third we move bits (10) and symbols output pointer (2) forward

now we go back to the start of the loop and continue handling the partial symbol
first we copy from the table a block into our symbol output list (that contains 0 symbols as our partial one still is unfinished)
second we take a pointer from the table to the next table
    this is the next table to decode the long symbol
third we move bits (10) and symbols output pointer (0) forward

now we go back to the start of the loop and continue handling the partial symbol
first we copy from the table a block into our symbol output list (that finally completes the long symbol so we output it)
second we take a pointer from the table to the next which is back at the main table
third we move bits (x) and symbols output pointer (1) forward


thx

[...]
diff mbox series

Patch

diff --git a/libavcodec/get_bits.h b/libavcodec/get_bits.h
index cfcf97c021c..86cea00494a 100644
--- a/libavcodec/get_bits.h
+++ b/libavcodec/get_bits.h
@@ -581,8 +581,12 @@  static inline const uint8_t *align_get_bits(GetBitContext *s)
         n     = table[index].len;                               \
                                                                 \
         if (max_depth > 1 && n < 0) {                           \
-            LAST_SKIP_BITS(name, gb, bits);                     \
-            UPDATE_CACHE(name, gb);                             \
+            if (av_builtin_constant_p(bits <= MIN_CACHE_BITS/2) && bits <= MIN_CACHE_BITS/2) { \
+                SKIP_BITS(name, gb, bits);                      \
+            } else {                                            \
+                LAST_SKIP_BITS(name, gb, bits);                 \
+                UPDATE_CACHE(name, gb);                         \
+            }                                                   \
                                                                 \
             nb_bits = -n;                                       \
                                                                 \