diff mbox series

[FFmpeg-devel,10/39] lavc/ffv1dec: move the bitreader to stack

Message ID 20240716171155.31838-10-anton@khirnov.net
State New
Headers show
Series [FFmpeg-devel,01/39] tests/fate/vcodec: add vsynth tests for FFV1 version 2 | expand

Checks

Context Check Description
andriy/make_x86 success Make finished
andriy/make_fate_x86 success Make fate finished

Commit Message

Anton Khirnov July 16, 2024, 5:11 p.m. UTC
There is no reason to place it in persistent state.
---
 libavcodec/ffv1.h             |  1 -
 libavcodec/ffv1dec.c          | 28 +++++++++++++++-------------
 libavcodec/ffv1dec_template.c | 23 ++++++++++++-----------
 3 files changed, 27 insertions(+), 25 deletions(-)

Comments

Michael Niedermayer July 17, 2024, 10:42 p.m. UTC | #1
On Tue, Jul 16, 2024 at 07:11:25PM +0200, Anton Khirnov wrote:
> There is no reason to place it in persistent state.
> ---
>  libavcodec/ffv1.h             |  1 -
>  libavcodec/ffv1dec.c          | 28 +++++++++++++++-------------
>  libavcodec/ffv1dec_template.c | 23 ++++++++++++-----------
>  3 files changed, 27 insertions(+), 25 deletions(-)

more complex code, an additional pointer to pass around

all the stuff should be put together close so its efficiently
using CPU caches

thx

[...]
Anton Khirnov July 18, 2024, 9:08 a.m. UTC | #2
Quoting Michael Niedermayer (2024-07-18 00:42:05)
> all the stuff should be put together close so its efficiently
> using CPU caches

Which is why it shares its cacheline with PutBitContext, because the
code benefits from having the both in the cache, right? And the 4-byte
hole in PutBitContext is there presumably to aerate the cache for
smoother data streaming.

More seriously, this is not how caches work. Being close together
matters mainly so long as your data fits in a cacheline, beyond that
physical proximity matters little. On stack, the bitreader is likely to
share the cacheline with other data that is currently needed, thus
improving cache utilization.

Another factor that matters in efficient cache use is e.g. not having
multiple copies of the same constant data scattered around, which you're
objecting to in my other patches.
Michael Niedermayer July 18, 2024, 2:48 p.m. UTC | #3
On Thu, Jul 18, 2024 at 11:08:59AM +0200, Anton Khirnov wrote:
> Quoting Michael Niedermayer (2024-07-18 00:42:05)
> > all the stuff should be put together close so its efficiently
> > using CPU caches
> 
> Which is why it shares its cacheline with PutBitContext, because the
> code benefits from having the both in the cache, right? And the 4-byte
> hole in PutBitContext is there presumably to aerate the cache for
> smoother data streaming.

thanks for spoting these, can you fix these ?


> 
> More seriously, this is not how caches work. Being close together
> matters mainly so long as your data fits in a cacheline, beyond that
> physical proximity matters little. On stack, the bitreader is likely to
> share the cacheline with other data that is currently needed, thus
> improving cache utilization.

caches are complex, and being close does matter.
having things in seperate allocations risks hitting aliassing cases
(that is things that cannot be in the cache at the same time)
so when you have the bitstream, the frame buffer, the context already
in 3 independant locations adding a few more increases the risk for hitting
these.
Also sequential memory access is faster than non sequential, it does
make sense to put things together in few places than to scatter them

Its years since ive done hardcore optimization stuff but i dont think
the principles have changed that much that random access is faster than
sequential and that caches work fundamentally differently


> 
> Another factor that matters in efficient cache use is e.g. not having
> multiple copies of the same constant data scattered around, which you're
> objecting to in my other patches.

copying the actually used small data together per slice
where its accessed per pixel should improve teh speed per pixel while
making the per slice code a little slower. now we have 4 slices maybe
and millions of pixels. Thats why this can give an overall gain

thx

[...]
Anton Khirnov July 18, 2024, 3:31 p.m. UTC | #4
Quoting Michael Niedermayer (2024-07-18 16:48:06)
> On Thu, Jul 18, 2024 at 11:08:59AM +0200, Anton Khirnov wrote:
> > Quoting Michael Niedermayer (2024-07-18 00:42:05)
> > > all the stuff should be put together close so its efficiently
> > > using CPU caches
> > 
> > Which is why it shares its cacheline with PutBitContext, because the
> > code benefits from having the both in the cache, right? And the 4-byte
> > hole in PutBitContext is there presumably to aerate the cache for
> > smoother data streaming.
> 
> thanks for spoting these, can you fix these ?

I have no interest in optimizing the performance of this code. My
primary goal here is to remove FFV1-specific hacks from the frame
threading code for patch 33/39, which is in turn needed for 38/39.

As a public service, I also spent some effort on making the ffv1 code
easier to understand, but if you insist on keeping the code as it is I
can also just drop its non-compliant frame threading implementation.

> > 
> > More seriously, this is not how caches work. Being close together
> > matters mainly so long as your data fits in a cacheline, beyond that
> > physical proximity matters little. On stack, the bitreader is likely to
> > share the cacheline with other data that is currently needed, thus
> > improving cache utilization.
> 
> caches are complex, and being close does matter.
> having things in seperate allocations risks hitting aliassing cases
> (that is things that cannot be in the cache at the same time)
> so when you have the bitstream, the frame buffer, the context already
> in 3 independant locations adding a few more increases the risk for hitting
> these.
> Also sequential memory access is faster than non sequential, it does
> make sense to put things together in few places than to scatter them
> 
> Its years since ive done hardcore optimization stuff but i dont think
> the principles have changed that much that random access is faster than
> sequential and that caches work fundamentally differently

I don't see how any of these arguments are relevant - I am not moving
the bitreader to a new allocation, but to stack, which is already highly
likely to be in cache.

> > 
> > Another factor that matters in efficient cache use is e.g. not having
> > multiple copies of the same constant data scattered around, which you're
> > objecting to in my other patches.
> 
> copying the actually used small data together per slice
> where its accessed per pixel should improve teh speed per pixel while
> making the per slice code a little slower. now we have 4 slices maybe
> and millions of pixels. Thats why this can give an overall gain

This all sounds like premature optimization, AKA the root of all evil.
As I said above, I intended to make this code more readable, not faster.
Yet somehow it became faster anyway, which suggests this code is not
very optimized. So then arguing whether this or that specific change
adds or removes a few cycles per frame seems like a waste time to me.
Paul B Mahol July 18, 2024, 3:35 p.m. UTC | #5
On Thu, Jul 18, 2024 at 5:31 PM Anton Khirnov <anton@khirnov.net> wrote:

> Quoting Michael Niedermayer (2024-07-18 16:48:06)
> > On Thu, Jul 18, 2024 at 11:08:59AM +0200, Anton Khirnov wrote:
> > > Quoting Michael Niedermayer (2024-07-18 00:42:05)
> > > > all the stuff should be put together close so its efficiently
> > > > using CPU caches
> > >
> > > Which is why it shares its cacheline with PutBitContext, because the
> > > code benefits from having the both in the cache, right? And the 4-byte
> > > hole in PutBitContext is there presumably to aerate the cache for
> > > smoother data streaming.
> >
> > thanks for spoting these, can you fix these ?
>
> I have no interest in optimizing the performance of this code. My
> primary goal here is to remove FFV1-specific hacks from the frame
> threading code for patch 33/39, which is in turn needed for 38/39.
>
> As a public service, I also spent some effort on making the ffv1 code
> easier to understand, but if you insist on keeping the code as it is I
> can also just drop its non-compliant frame threading implementation.
>
> > >
> > > More seriously, this is not how caches work. Being close together
> > > matters mainly so long as your data fits in a cacheline, beyond that
> > > physical proximity matters little. On stack, the bitreader is likely to
> > > share the cacheline with other data that is currently needed, thus
> > > improving cache utilization.
> >
> > caches are complex, and being close does matter.
> > having things in seperate allocations risks hitting aliassing cases
> > (that is things that cannot be in the cache at the same time)
> > so when you have the bitstream, the frame buffer, the context already
> > in 3 independant locations adding a few more increases the risk for
> hitting
> > these.
> > Also sequential memory access is faster than non sequential, it does
> > make sense to put things together in few places than to scatter them
> >
> > Its years since ive done hardcore optimization stuff but i dont think
> > the principles have changed that much that random access is faster than
> > sequential and that caches work fundamentally differently
>
> I don't see how any of these arguments are relevant - I am not moving
> the bitreader to a new allocation, but to stack, which is already highly
> likely to be in cache.
>
> > >
> > > Another factor that matters in efficient cache use is e.g. not having
> > > multiple copies of the same constant data scattered around, which
> you're
> > > objecting to in my other patches.
> >
> > copying the actually used small data together per slice
> > where its accessed per pixel should improve teh speed per pixel while
> > making the per slice code a little slower. now we have 4 slices maybe
> > and millions of pixels. Thats why this can give an overall gain
>
> This all sounds like premature optimization, AKA the root of all evil.
> As I said above, I intended to make this code more readable, not faster.
> Yet somehow it became faster anyway, which suggests this code is not
> very optimized. So then arguing whether this or that specific change
> adds or removes a few cycles per frame seems like a waste time to me.
>


Will merge this into Librempeg if it does not get into FFmpeg.


>
> --
> Anton Khirnov
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
>
Michael Niedermayer July 18, 2024, 6:18 p.m. UTC | #6
On Thu, Jul 18, 2024 at 05:31:16PM +0200, Anton Khirnov wrote:
> Quoting Michael Niedermayer (2024-07-18 16:48:06)
> > On Thu, Jul 18, 2024 at 11:08:59AM +0200, Anton Khirnov wrote:
> > > Quoting Michael Niedermayer (2024-07-18 00:42:05)
[...]
> > >
> > > Another factor that matters in efficient cache use is e.g. not having
> > > multiple copies of the same constant data scattered around, which you're
> > > objecting to in my other patches.
> > 
> > copying the actually used small data together per slice
> > where its accessed per pixel should improve teh speed per pixel while
> > making the per slice code a little slower. now we have 4 slices maybe
> > and millions of pixels. Thats why this can give an overall gain
> 
> This all sounds like premature optimization, AKA the root of all evil.
> As I said above, I intended to make this code more readable, not faster.
> Yet somehow it became faster anyway, which suggests this code is not
> very optimized. So then arguing whether this or that specific change
> adds or removes a few cycles per frame seems like a waste time to me.

Making the code faster is welcome
Making the code cleaner is welcome

whats faster is objective (mostly)
whats cleaner is not objective. So we could agree or disagree on that

Things that make the code slower, and dont provide a larger improvment
elsewhere i do object to

About "premature optimization", i have not seen any optimizations being
removed that made the code hard to maintain or work with so this is
not a nice statement.
And iam the first to suport making the code cleaner and easier to maintain
as iam the one maintaining it.

Either way, thanks for your efforts in improving the ffv1 code, they
are appreciated, my objections are not meant to hinder improvments
i just want to make sure it all one by one is an improvment

thx

--
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Any man who breaks a law that conscience tells him is unjust and willingly 
accepts the penalty by staying in jail in order to arouse the conscience of 
the community on the injustice of the law is at that moment expressing the 
very highest respect for law. - Martin Luther King Jr
Anton Khirnov July 20, 2024, 12:15 p.m. UTC | #7
Quoting Michael Niedermayer (2024-07-18 20:18:15)
> On Thu, Jul 18, 2024 at 05:31:16PM +0200, Anton Khirnov wrote:
> > Quoting Michael Niedermayer (2024-07-18 16:48:06)
> > > On Thu, Jul 18, 2024 at 11:08:59AM +0200, Anton Khirnov wrote:
> > > > Quoting Michael Niedermayer (2024-07-18 00:42:05)
> [...]
> > > >
> > > > Another factor that matters in efficient cache use is e.g. not having
> > > > multiple copies of the same constant data scattered around, which you're
> > > > objecting to in my other patches.
> > > 
> > > copying the actually used small data together per slice
> > > where its accessed per pixel should improve teh speed per pixel while
> > > making the per slice code a little slower. now we have 4 slices maybe
> > > and millions of pixels. Thats why this can give an overall gain
> > 
> > This all sounds like premature optimization, AKA the root of all evil.
> > As I said above, I intended to make this code more readable, not faster.
> > Yet somehow it became faster anyway, which suggests this code is not
> > very optimized. So then arguing whether this or that specific change
> > adds or removes a few cycles per frame seems like a waste time to me.
> 
> Making the code faster is welcome
> Making the code cleaner is welcome
> 
> whats faster is objective (mostly)
> whats cleaner is not objective. So we could agree or disagree on that
> 
> Things that make the code slower, and dont provide a larger improvment
> elsewhere i do object to
> 
> About "premature optimization", i have not seen any optimizations being
> removed that made the code hard to maintain or work with so this is
> not a nice statement.
> And iam the first to suport making the code cleaner and easier to maintain
> as iam the one maintaining it.
> 
> Either way, thanks for your efforts in improving the ffv1 code, they
> are appreciated, my objections are not meant to hinder improvments
> i just want to make sure it all one by one is an improvment

As I said in my other reply, patches 07-25 moving per-slice data to a
new per-slice struct should be considered as one change, as it does not
make sense to apply only some of them and not the others. Some of them,
taken individually, may introduce transient slowdown, because
temporarily there are two per-slice contexts.
diff mbox series

Patch

diff --git a/libavcodec/ffv1.h b/libavcodec/ffv1.h
index 68d13a2964..c88aa8c30d 100644
--- a/libavcodec/ffv1.h
+++ b/libavcodec/ffv1.h
@@ -85,7 +85,6 @@  typedef struct FFV1Context {
     AVClass *class;
     AVCodecContext *avctx;
     RangeCoder c;
-    GetBitContext gb;
     PutBitContext pb;
     uint64_t rc_stat[256][2];
     uint64_t (*rc_stat2[MAX_QUANT_TABLES])[32][2];
diff --git a/libavcodec/ffv1dec.c b/libavcodec/ffv1dec.c
index a2971d7eea..a1f7206871 100644
--- a/libavcodec/ffv1dec.c
+++ b/libavcodec/ffv1dec.c
@@ -94,14 +94,14 @@  static inline int get_vlc_symbol(GetBitContext *gb, VlcState *const state,
     return ret;
 }
 
-static int is_input_end(FFV1Context *s)
+static int is_input_end(FFV1Context *s, GetBitContext *gb)
 {
     if (s->ac != AC_GOLOMB_RICE) {
         RangeCoder *const c = &s->c;
         if (c->overread > MAX_OVERREAD)
             return AVERROR_INVALIDDATA;
     } else {
-        if (get_bits_left(&s->gb) < 1)
+        if (get_bits_left(gb) < 1)
             return AVERROR_INVALIDDATA;
     }
     return 0;
@@ -118,6 +118,7 @@  static int is_input_end(FFV1Context *s)
 #include "ffv1dec_template.c"
 
 static int decode_plane(FFV1Context *s, FFV1SliceContext *sc,
+                        GetBitContext *gb,
                         uint8_t *src, int w, int h, int stride, int plane_index,
                          int pixel_stride)
 {
@@ -140,13 +141,13 @@  static int decode_plane(FFV1Context *s, FFV1SliceContext *sc,
         sample[0][w]  = sample[0][w - 1];
 
         if (s->avctx->bits_per_raw_sample <= 8) {
-            int ret = decode_line(s, sc, w, sample, plane_index, 8);
+            int ret = decode_line(s, sc, gb, w, sample, plane_index, 8);
             if (ret < 0)
                 return ret;
             for (x = 0; x < w; x++)
                 src[x*pixel_stride + stride * y] = sample[1][x];
         } else {
-            int ret = decode_line(s, sc, w, sample, plane_index, s->avctx->bits_per_raw_sample);
+            int ret = decode_line(s, sc, gb, w, sample, plane_index, s->avctx->bits_per_raw_sample);
             if (ret < 0)
                 return ret;
             if (s->packed_at_lsb) {
@@ -262,6 +263,7 @@  static int decode_slice(AVCodecContext *c, void *arg)
     AVFrame * const p = f->picture.f;
     const int      si = (FFV1Context**)arg - f->slice_context;
     FFV1SliceContext *sc = &f->slices[si];
+    GetBitContext gb;
 
     if (f->fsrc && !(p->flags & AV_FRAME_FLAG_KEY) && f->last_picture.f)
         ff_progress_frame_await(&f->last_picture, si);
@@ -322,7 +324,7 @@  static int decode_slice(AVCodecContext *c, void *arg)
         if (f->version == 3 && f->micro_version > 1 || f->version > 3)
             get_rac(&fs->c, (uint8_t[]) { 129 });
         fs->ac_byte_count = f->version > 2 || (!x && !y) ? fs->c.bytestream - fs->c.bytestream_start - 1 : 0;
-        init_get_bits(&fs->gb,
+        init_get_bits(&gb,
                       fs->c.bytestream_start + fs->ac_byte_count,
                       (fs->c.bytestream_end - fs->c.bytestream_start - fs->ac_byte_count) * 8);
     }
@@ -333,29 +335,29 @@  static int decode_slice(AVCodecContext *c, void *arg)
         const int chroma_height = AV_CEIL_RSHIFT(height, f->chroma_v_shift);
         const int cx            = x >> f->chroma_h_shift;
         const int cy            = y >> f->chroma_v_shift;
-        decode_plane(fs, sc, p->data[0] + ps*x + y*p->linesize[0], width, height, p->linesize[0], 0, 1);
+        decode_plane(fs, sc, &gb, p->data[0] + ps*x + y*p->linesize[0], width, height, p->linesize[0], 0, 1);
 
         if (f->chroma_planes) {
-            decode_plane(fs, sc, p->data[1] + ps*cx+cy*p->linesize[1], chroma_width, chroma_height, p->linesize[1], 1, 1);
-            decode_plane(fs, sc, p->data[2] + ps*cx+cy*p->linesize[2], chroma_width, chroma_height, p->linesize[2], 1, 1);
+            decode_plane(fs, sc, &gb, p->data[1] + ps*cx+cy*p->linesize[1], chroma_width, chroma_height, p->linesize[1], 1, 1);
+            decode_plane(fs, sc, &gb, p->data[2] + ps*cx+cy*p->linesize[2], chroma_width, chroma_height, p->linesize[2], 1, 1);
         }
         if (fs->transparency)
-            decode_plane(fs, sc, p->data[3] + ps*x + y*p->linesize[3], width, height, p->linesize[3], (f->version >= 4 && !f->chroma_planes) ? 1 : 2, 1);
+            decode_plane(fs, sc, &gb, p->data[3] + ps*x + y*p->linesize[3], width, height, p->linesize[3], (f->version >= 4 && !f->chroma_planes) ? 1 : 2, 1);
     } else if (f->colorspace == 0) {
-         decode_plane(fs, sc, p->data[0] + ps*x + y*p->linesize[0]    , width, height, p->linesize[0], 0, 2);
-         decode_plane(fs, sc, p->data[0] + ps*x + y*p->linesize[0] + 1, width, height, p->linesize[0], 1, 2);
+         decode_plane(fs, sc, &gb, p->data[0] + ps*x + y*p->linesize[0]    , width, height, p->linesize[0], 0, 2);
+         decode_plane(fs, sc, &gb, p->data[0] + ps*x + y*p->linesize[0] + 1, width, height, p->linesize[0], 1, 2);
     } else if (f->use32bit) {
         uint8_t *planes[4] = { p->data[0] + ps * x + y * p->linesize[0],
                                p->data[1] + ps * x + y * p->linesize[1],
                                p->data[2] + ps * x + y * p->linesize[2],
                                p->data[3] + ps * x + y * p->linesize[3] };
-        decode_rgb_frame32(fs, sc, planes, width, height, p->linesize);
+        decode_rgb_frame32(fs, sc, &gb, planes, width, height, p->linesize);
     } else {
         uint8_t *planes[4] = { p->data[0] + ps * x + y * p->linesize[0],
                                p->data[1] + ps * x + y * p->linesize[1],
                                p->data[2] + ps * x + y * p->linesize[2],
                                p->data[3] + ps * x + y * p->linesize[3] };
-        decode_rgb_frame(fs, sc, planes, width, height, p->linesize);
+        decode_rgb_frame(fs, sc, &gb, planes, width, height, p->linesize);
     }
     if (fs->ac != AC_GOLOMB_RICE && f->version > 2) {
         int v;
diff --git a/libavcodec/ffv1dec_template.c b/libavcodec/ffv1dec_template.c
index 8e2e38c0b9..e9d3002be9 100644
--- a/libavcodec/ffv1dec_template.c
+++ b/libavcodec/ffv1dec_template.c
@@ -23,8 +23,8 @@ 
 #include "ffv1_template.c"
 
 static av_always_inline int
-RENAME(decode_line)(FFV1Context *s, FFV1SliceContext *sc, int w,
-                    TYPE *sample[2], int plane_index, int bits)
+RENAME(decode_line)(FFV1Context *s, FFV1SliceContext *sc, GetBitContext *gb,
+                    int w, TYPE *sample[2], int plane_index, int bits)
 {
     PlaneContext *const p = &s->plane[plane_index];
     RangeCoder *const c   = &s->c;
@@ -33,7 +33,7 @@  RENAME(decode_line)(FFV1Context *s, FFV1SliceContext *sc, int w,
     int run_mode  = 0;
     int run_index = sc->run_index;
 
-    if (is_input_end(s))
+    if (is_input_end(s, gb))
         return AVERROR_INVALIDDATA;
 
     if (s->slice_coding_mode == 1) {
@@ -53,7 +53,7 @@  RENAME(decode_line)(FFV1Context *s, FFV1SliceContext *sc, int w,
         int diff, context, sign;
 
         if (!(x & 1023)) {
-            if (is_input_end(s))
+            if (is_input_end(s, gb))
                 return AVERROR_INVALIDDATA;
         }
 
@@ -74,13 +74,13 @@  RENAME(decode_line)(FFV1Context *s, FFV1SliceContext *sc, int w,
 
             if (run_mode) {
                 if (run_count == 0 && run_mode == 1) {
-                    if (get_bits1(&s->gb)) {
+                    if (get_bits1(gb)) {
                         run_count = 1 << ff_log2_run[run_index];
                         if (x + run_count <= w)
                             run_index++;
                     } else {
                         if (ff_log2_run[run_index])
-                            run_count = get_bits(&s->gb, ff_log2_run[run_index]);
+                            run_count = get_bits(gb, ff_log2_run[run_index]);
                         else
                             run_count = 0;
                         if (run_index)
@@ -105,17 +105,17 @@  RENAME(decode_line)(FFV1Context *s, FFV1SliceContext *sc, int w,
                 if (run_count < 0) {
                     run_mode  = 0;
                     run_count = 0;
-                    diff      = get_vlc_symbol(&s->gb, &p->vlc_state[context],
+                    diff      = get_vlc_symbol(gb, &p->vlc_state[context],
                                                bits);
                     if (diff >= 0)
                         diff++;
                 } else
                     diff = 0;
             } else
-                diff = get_vlc_symbol(&s->gb, &p->vlc_state[context], bits);
+                diff = get_vlc_symbol(gb, &p->vlc_state[context], bits);
 
             ff_dlog(s->avctx, "count:%d index:%d, mode:%d, x:%d pos:%d\n",
-                    run_count, run_index, run_mode, x, get_bits_count(&s->gb));
+                    run_count, run_index, run_mode, x, get_bits_count(gb));
         }
 
         if (sign)
@@ -128,6 +128,7 @@  RENAME(decode_line)(FFV1Context *s, FFV1SliceContext *sc, int w,
 }
 
 static int RENAME(decode_rgb_frame)(FFV1Context *s, FFV1SliceContext *sc,
+                                    GetBitContext *gb,
                                     uint8_t *src[4], int w, int h, int stride[4])
 {
     int x, y, p;
@@ -157,9 +158,9 @@  static int RENAME(decode_rgb_frame)(FFV1Context *s, FFV1SliceContext *sc,
             sample[p][1][-1]= sample[p][0][0  ];
             sample[p][0][ w]= sample[p][0][w-1];
             if (lbd && s->slice_coding_mode == 0)
-                ret = RENAME(decode_line)(s, sc, w, sample[p], (p + 1)/2, 9);
+                ret = RENAME(decode_line)(s, sc, gb, w, sample[p], (p + 1)/2, 9);
             else
-                ret = RENAME(decode_line)(s, sc, w, sample[p], (p + 1)/2, bits + (s->slice_coding_mode != 1));
+                ret = RENAME(decode_line)(s, sc, gb, w, sample[p], (p + 1)/2, bits + (s->slice_coding_mode != 1));
             if (ret < 0)
                 return ret;
         }