diff mbox series

[FFmpeg-devel,5/5] libavcodec/ffv1: Support storing LSB raw

Message ID 20241016132639.1958007-5-michael@niedermayer.cc
State New
Headers show
Series [FFmpeg-devel,1/5] avcodec/rangecoder: only perform renorm check/loop for callers that need it | expand

Commit Message

Michael Niedermayer Oct. 16, 2024, 1:26 p.m. UTC
This makes a 16bit RGB raw sample 25% faster at a 2% loss of compression with rawlsb=4

Please test and comment

This stores the LSB through non binary range coding, this is simpler than using a
separate coder
For cases where range coding is not wanted its probably best to use golomb rice
for everything.

We also pass the LSB through the decorrelation and context stages (which is basically free)
this leads to slightly better compression than separating them earlier.

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
---
 libavcodec/ffv1.h             |  2 ++
 libavcodec/ffv1_template.c    | 19 ++++++++++---------
 libavcodec/ffv1dec.c          |  2 ++
 libavcodec/ffv1dec_template.c | 16 +++++++++++++---
 libavcodec/ffv1enc.c          | 15 ++++++++++++++-
 libavcodec/ffv1enc_template.c | 17 +++++++++++++++--
 6 files changed, 56 insertions(+), 15 deletions(-)

Comments

Michael Niedermayer Oct. 16, 2024, 1:54 p.m. UTC | #1
On Wed, Oct 16, 2024 at 03:26:39PM +0200, Michael Niedermayer wrote:
> This makes a 16bit RGB raw sample 25% faster at a 2% loss of compression with rawlsb=4
> 
> Please test and comment
> 
> This stores the LSB through non binary range coding, this is simpler than using a
> separate coder
> For cases where range coding is not wanted its probably best to use golomb rice
> for everything.
> 
> We also pass the LSB through the decorrelation and context stages (which is basically free)
> this leads to slightly better compression than separating them earlier.
> 
> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
> ---
>  libavcodec/ffv1.h             |  2 ++
>  libavcodec/ffv1_template.c    | 19 ++++++++++---------
>  libavcodec/ffv1dec.c          |  2 ++
>  libavcodec/ffv1dec_template.c | 16 +++++++++++++---
>  libavcodec/ffv1enc.c          | 15 ++++++++++++++-
>  libavcodec/ffv1enc_template.c | 17 +++++++++++++++--
>  6 files changed, 56 insertions(+), 15 deletions(-)

3rd implemantation :)
you might ask why i implement this 4?! times
Heres why: (tests done with 4 rawlsb bits, 16bit per sample input)

The original (no decompression supported, bits riped out before context model and decorrelation, seperate buffer)
-rw-r----- 1 michael michael   91403202 Oct 15 22:48 film4st.nut

I dont remember
-rw-r----- 1 michael michael   91253765 Oct 15 23:14 film4st-new.nut

bits extracted after decorrelation and context model, interleaved range coder, simple but not efficient rangecoder implemantation
-rw-r----- 1 michael michael   91080109 Oct 16 00:14 film4st-new2.nut

This current, bits extracted after decorrelation and context model, simplest and most efficient range coder implemantation
-rw-r----- 1 michael michael   90996813 Oct 16 15:14 film4st-new3.nut
same but with quantization table 1
-rw-r----- 1 michael michael   89883371 Oct 16 15:50 film4st-new3q1.nut

Heres the reference without rawlsb:
-rw-r----- 1 michael michael   88090676 Oct 15 22:49 film0st.nut
-rw-r----- 1 michael michael   88168254 Oct 16 15:50 film0st-q1.nut

So this implemantation so far performs best and is also very simple

thx

[...]
Jerome Martinez Oct. 16, 2024, 4:27 p.m. UTC | #2
Le 16/10/2024 à 15:54, Michael Niedermayer a écrit :
> 3rd implemantation :)
> you might ask why i implement this 4?! times
> Heres why: (tests done with 4 rawlsb bits, 16bit per sample input)

I tested on my side also including the speed as the goal is to avoid the 
speed cost of the LSB, with 6K content from scanner (analog input, last 
bits are more or less random).
Speed     Compr
0,037x    0,471    No patch
0,051x    0,491    bitfield
0,046x    0,489    rangecoder

the 25% gain in the speed is clearly visible (actually it 27%  in my 
tests) with the bitfield version of storing LSB, but it is a lot less 
visible with the rangecoder version (the one from today):
There is a small 0.5% gain in size at the cost of 9% of speed, it is not 
worth it IMO.

I prefer by far your first version (really storing LSB as raw), it is 
fast as expected (25% less time) and adding the range coder creates 
something not really interesting (20% less time "only" for so little 
gain in size compared to 25%)
Michael Niedermayer Oct. 16, 2024, 7:51 p.m. UTC | #3
Hi Jerome

On Wed, Oct 16, 2024 at 06:27:09PM +0200, Jerome Martinez wrote:
> Le 16/10/2024 à 15:54, Michael Niedermayer a écrit :
> > 3rd implemantation :)
> > you might ask why i implement this 4?! times
> > Heres why: (tests done with 4 rawlsb bits, 16bit per sample input)
> 
> I tested on my side also including the speed as the goal is to avoid the
> speed cost of the LSB, with 6K content from scanner (analog input, last bits
> are more or less random).
> Speed     Compr
> 0,037x    0,471    No patch
> 0,051x    0,491    bitfield
> 0,046x    0,489    rangecoder
> 
> the 25% gain in the speed is clearly visible (actually it 27%  in my tests)
> with the bitfield version of storing LSB, but it is a lot less visible with
> the rangecoder version (the one from today):
> There is a small 0.5% gain in size at the cost of 9% of speed, it is not
> worth it IMO.
>
> I prefer by far your first version (really storing LSB as raw), it is fast
> as expected (25% less time) and adding the range coder creates something not
> really interesting (20% less time "only" for so little gain in size compared
> to 25%)

did you try qtable 1 ? strangely it performed better for the file i used compression wise

also i cleaned the code up a bit and reposted, some of the code was
not exactly optimized

here the old "bitfield"
real	0m5.545s
real	0m5.655s
real	0m5.643s

vs. just now posted:
real	0m5.407s
real	0m5.393s
real	0m5.404s

thx

[...]
Jerome Martinez Oct. 16, 2024, 8:29 p.m. UTC | #4
Le 16/10/2024 à 21:51, Michael Niedermayer a écrit :
> did you try qtable 1 ? strangely it performed better for the file i used compression wise

Updated with latest code (in practice, no change in previous values):
0,037x    0,471    No patch
0,051x    0,491    bitfield
0,046x    0,489    rangecoder
0,046x    0,486    rangecoder qtable=1

qtable=1 helps but still not enough for being interesting compared to 
the bitfield version.

> also i cleaned the code up a bit and reposted, some of the code was
> not exactly optimized
>
> here the old "bitfield"
> real	0m5.545s
> real	0m5.655s
> real	0m5.643s
>
> vs. just now posted:
> real	0m5.407s
> real	0m5.393s
> real	0m5.404s

Even more interesting to keep the bitfield version rather than the range 
coder!

FYI I test on 2 different sets of 2 seconds of real content, on SSD, 1 
min of encoding each so result is stable, unfortunately I don't have 
more content but I try to have more.
Michael Niedermayer Oct. 16, 2024, 8:53 p.m. UTC | #5
Hi Jerome

On Wed, Oct 16, 2024 at 10:29:18PM +0200, Jerome Martinez wrote:
> Le 16/10/2024 à 21:51, Michael Niedermayer a écrit :
> > did you try qtable 1 ? strangely it performed better for the file i used compression wise
> 
> Updated with latest code (in practice, no change in previous values):
> 0,037x    0,471    No patch
> 0,051x    0,491    bitfield
> 0,046x    0,489    rangecoder
> 0,046x    0,486    rangecoder qtable=1
> 
> qtable=1 helps but still not enough for being interesting compared to the
> bitfield version.
> 
> > also i cleaned the code up a bit and reposted, some of the code was
> > not exactly optimized
> > 
> > here the old "bitfield"
> > real	0m5.545s
> > real	0m5.655s
> > real	0m5.643s
> > 
> > vs. just now posted:
> > real	0m5.407s
> > real	0m5.393s
> > real	0m5.404s
> 
> Even more interesting to keep the bitfield version rather than the range
> coder!

what are you testing?
the new code is faster than the old code.
There is something not right here, the range coder based implementation
i posted now is 5% faster then the range coder based one i posted earlier today
thats overall speed meassured with "time ./ffmpeg"
if you see no difference there is something fishy

i simply tested this:
./ffmpeg -i rawsamples/16/01.dpx -threads 1  -c:v ffv1 -context 1 -coder 1 -strict -2 -level 4 -rawlsb 4   -y /tmp/speedtest4.nut

It uses 1 thread as the speed with more threads was very unstable between runs
and we want to know how fast ffv1 is not how multithreading behaves


> 
> FYI I test on 2 different sets of 2 seconds of real content, on SSD, 1 min
> of encoding each so result is stable, unfortunately I don't have more
> content but I try to have more.

can this content be downloaded somewhere ?

thx

[...]
Michael Niedermayer Oct. 16, 2024, 9:25 p.m. UTC | #6
On Wed, Oct 16, 2024 at 10:53:37PM +0200, Michael Niedermayer wrote:
> Hi Jerome
> 
> On Wed, Oct 16, 2024 at 10:29:18PM +0200, Jerome Martinez wrote:
> > Le 16/10/2024 à 21:51, Michael Niedermayer a écrit :
> > > did you try qtable 1 ? strangely it performed better for the file i used compression wise
> > 
> > Updated with latest code (in practice, no change in previous values):
> > 0,037x    0,471    No patch
> > 0,051x    0,491    bitfield
> > 0,046x    0,489    rangecoder
> > 0,046x    0,486    rangecoder qtable=1
> > 
> > qtable=1 helps but still not enough for being interesting compared to the
> > bitfield version.
> > 
> > > also i cleaned the code up a bit and reposted, some of the code was
> > > not exactly optimized
> > > 
> > > here the old "bitfield"
> > > real	0m5.545s
> > > real	0m5.655s
> > > real	0m5.643s
> > > 
> > > vs. just now posted:
> > > real	0m5.407s
> > > real	0m5.393s
> > > real	0m5.404s
> > 
> > Even more interesting to keep the bitfield version rather than the range
> > coder!
> 
> what are you testing?
> the new code is faster than the old code.
> There is something not right here, the range coder based implementation
> i posted now is 5% faster then the range coder based one i posted earlier today
> thats overall speed meassured with "time ./ffmpeg"
> if you see no difference there is something fishy
> 
> i simply tested this:
> ./ffmpeg -i rawsamples/16/01.dpx -threads 1  -c:v ffv1 -context 1 -coder 1 -strict -2 -level 4 -rawlsb 4   -y /tmp/speedtest4.nut

compiler was this:  gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.2)

thx

[...]
diff mbox series

Patch

diff --git a/libavcodec/ffv1.h b/libavcodec/ffv1.h
index 4f5a8ab2be7..02bfc33f680 100644
--- a/libavcodec/ffv1.h
+++ b/libavcodec/ffv1.h
@@ -83,6 +83,7 @@  typedef struct FFV1SliceContext {
     int slice_coding_mode;
     int slice_rct_by_coef;
     int slice_rct_ry_coef;
+    int rawlsb;
 
     // RefStruct reference, array of MAX_PLANES elements
     PlaneContext *plane;
@@ -139,6 +140,7 @@  typedef struct FFV1Context {
     int key_frame_ok;
     int context_model;
     int qtable;
+    int rawlsb;
 
     int bits_per_raw_sample;
     int packed_at_lsb;
diff --git a/libavcodec/ffv1_template.c b/libavcodec/ffv1_template.c
index abb90a12e49..10206702ee8 100644
--- a/libavcodec/ffv1_template.c
+++ b/libavcodec/ffv1_template.c
@@ -30,24 +30,25 @@  static inline int RENAME(predict)(TYPE *src, TYPE *last)
 }
 
 static inline int RENAME(get_context)(const int16_t quant_table[MAX_CONTEXT_INPUTS][MAX_QUANT_TABLE_SIZE],
-                                      TYPE *src, TYPE *last, TYPE *last2)
+                                      TYPE *src, TYPE *last, TYPE *last2, int rawlsb)
 {
     const int LT = last[-1];
     const int T  = last[0];
     const int RT = last[1];
     const int L  = src[-1];
+    const int rawoff = (1<<rawlsb) >> 1;
 
     if (quant_table[3][127] || quant_table[4][127]) {
         const int TT = last2[0];
         const int LL = src[-2];
-        return quant_table[0][(L - LT) & MAX_QUANT_TABLE_MASK] +
-               quant_table[1][(LT - T) & MAX_QUANT_TABLE_MASK] +
-               quant_table[2][(T - RT) & MAX_QUANT_TABLE_MASK] +
-               quant_table[3][(LL - L) & MAX_QUANT_TABLE_MASK] +
-               quant_table[4][(TT - T) & MAX_QUANT_TABLE_MASK];
+        return quant_table[0][(L - LT + rawoff >> rawlsb) & MAX_QUANT_TABLE_MASK] +
+               quant_table[1][(LT - T + rawoff >> rawlsb) & MAX_QUANT_TABLE_MASK] +
+               quant_table[2][(T - RT + rawoff >> rawlsb) & MAX_QUANT_TABLE_MASK] +
+               quant_table[3][(LL - L + rawoff >> rawlsb) & MAX_QUANT_TABLE_MASK] +
+               quant_table[4][(TT - T + rawoff >> rawlsb) & MAX_QUANT_TABLE_MASK];
     } else
-        return quant_table[0][(L - LT) & MAX_QUANT_TABLE_MASK] +
-               quant_table[1][(LT - T) & MAX_QUANT_TABLE_MASK] +
-               quant_table[2][(T - RT) & MAX_QUANT_TABLE_MASK];
+        return quant_table[0][(L - LT + rawoff >> rawlsb) & MAX_QUANT_TABLE_MASK] +
+               quant_table[1][(LT - T + rawoff >> rawlsb) & MAX_QUANT_TABLE_MASK] +
+               quant_table[2][(T - RT + rawoff >> rawlsb) & MAX_QUANT_TABLE_MASK];
 }
 
diff --git a/libavcodec/ffv1dec.c b/libavcodec/ffv1dec.c
index 5c099e49ad4..fc96bfb4cea 100644
--- a/libavcodec/ffv1dec.c
+++ b/libavcodec/ffv1dec.c
@@ -249,6 +249,8 @@  static int decode_slice_header(const FFV1Context *f,
                 return AVERROR_INVALIDDATA;
             }
         }
+        if (f->micro_version > 2)
+            sc->rawlsb = get_symbol(c, state, 0);
     }
 
     return 0;
diff --git a/libavcodec/ffv1dec_template.c b/libavcodec/ffv1dec_template.c
index 2da6bd935dc..dbdcad7768e 100644
--- a/libavcodec/ffv1dec_template.c
+++ b/libavcodec/ffv1dec_template.c
@@ -60,8 +60,13 @@  RENAME(decode_line)(FFV1Context *f, FFV1SliceContext *sc,
                 return AVERROR_INVALIDDATA;
         }
 
-        context = RENAME(get_context)(quant_table,
-                                      sample[1] + x, sample[0] + x, sample[1] + x);
+        if (sc->rawlsb) {
+            context = RENAME(get_context)(quant_table,
+                                          sample[1] + x, sample[0] + x, sample[1] + x, sc->rawlsb);
+        } else {
+            context = RENAME(get_context)(quant_table,
+                                          sample[1] + x, sample[0] + x, sample[1] + x, 0);
+        }
         if (context < 0) {
             context = -context;
             sign    = 1;
@@ -71,7 +76,12 @@  RENAME(decode_line)(FFV1Context *f, FFV1SliceContext *sc,
         av_assert2(context < p->context_count);
 
         if (ac != AC_GOLOMB_RICE) {
-            diff = get_symbol_inline(c, p->state[context], 1);
+            if (sc->rawlsb) {
+                const int rawoff = (1<<sc->rawlsb) >> 1;
+                diff = get_rac_raw(c, sc->rawlsb);
+                diff += (get_symbol_inline(c, p->state[context], 1) << sc->rawlsb) - rawoff;
+            } else
+                diff = get_symbol_inline(c, p->state[context], 1);
         } else {
             if (context == 0 && run_mode == 0)
                 run_mode = 1;
diff --git a/libavcodec/ffv1enc.c b/libavcodec/ffv1enc.c
index 0dbfebc1a1a..0548daf8c47 100644
--- a/libavcodec/ffv1enc.c
+++ b/libavcodec/ffv1enc.c
@@ -416,7 +416,7 @@  static int write_extradata(FFV1Context *f)
         if (f->version == 3) {
             f->micro_version = 4;
         } else if (f->version == 4)
-            f->micro_version = 2;
+            f->micro_version = 3;
         put_symbol(&c, state, f->micro_version, 0);
     }
 
@@ -564,6 +564,9 @@  static av_cold int encode_init(AVCodecContext *avctx)
     if (s->ec == 2)
         s->version = FFMAX(s->version, 4);
 
+    if (s->rawlsb)
+        s->version = FFMAX(s->version, 4);
+
     if ((s->version == 2 || s->version>3) && avctx->strict_std_compliance > FF_COMPLIANCE_EXPERIMENTAL) {
         av_log(avctx, AV_LOG_ERROR, "Version 2 or 4 needed for requested features but version 2 or 4 is experimental and not enabled\n");
         return AVERROR_INVALIDDATA;
@@ -716,6 +719,11 @@  static av_cold int encode_init(AVCodecContext *avctx)
         }
     }
 
+    if (s->rawlsb > s->bits_per_raw_sample) {
+        av_log(avctx, AV_LOG_ERROR, "too many raw lsb\n");
+        return AVERROR(EINVAL);
+    }
+
     if (s->ac == AC_RANGE_CUSTOM_TAB) {
         for (i = 1; i < 256; i++)
             s->state_transition[i] = ver2_state[i];
@@ -958,6 +966,7 @@  static void encode_slice_header(FFV1Context *f, FFV1SliceContext *sc)
             put_symbol(c, state, sc->slice_rct_by_coef, 0);
             put_symbol(c, state, sc->slice_rct_ry_coef, 0);
         }
+        put_symbol(c, state, sc->rawlsb, 0);
     }
 }
 
@@ -1077,6 +1086,8 @@  static int encode_slice(AVCodecContext *c, void *arg)
         sc->slice_rct_ry_coef = 1;
     }
 
+    sc->rawlsb = f->rawlsb; // we do not optimize this per slice, but other encoders could
+
 retry:
     if (f->key_frame)
         ff_ffv1_clear_slice_state(f, sc);
@@ -1291,6 +1302,8 @@  static const AVOption options[] = {
             { .i64 = 0 }, 0, 1, VE },
     { "qtable", "Quantization table", OFFSET(qtable), AV_OPT_TYPE_INT,
             { .i64 = -1 }, -1, 2, VE },
+    { "rawlsb", "number of LSBs stored RAW", OFFSET(rawlsb), AV_OPT_TYPE_INT,
+            { .i64 = 0 }, 0, 8, VE },
 
     { NULL }
 };
diff --git a/libavcodec/ffv1enc_template.c b/libavcodec/ffv1enc_template.c
index bc14926ab95..848328c70af 100644
--- a/libavcodec/ffv1enc_template.c
+++ b/libavcodec/ffv1enc_template.c
@@ -62,8 +62,14 @@  RENAME(encode_line)(FFV1Context *f, FFV1SliceContext *sc,
     for (x = 0; x < w; x++) {
         int diff, context;
 
-        context = RENAME(get_context)(f->quant_tables[p->quant_table_index],
-                                      sample[0] + x, sample[1] + x, sample[2] + x);
+        if (f->rawlsb) {
+            context = RENAME(get_context)(f->quant_tables[p->quant_table_index],
+                                        sample[0] + x, sample[1] + x, sample[2] + x, f->rawlsb);
+        } else {
+            //try to force a version with rawlsb optimized out
+            context = RENAME(get_context)(f->quant_tables[p->quant_table_index],
+                                        sample[0] + x, sample[1] + x, sample[2] + x, 0);
+        }
         diff    = sample[0][x] - RENAME(predict)(sample[0] + x, sample[1] + x);
 
         if (context < 0) {
@@ -74,6 +80,13 @@  RENAME(encode_line)(FFV1Context *f, FFV1SliceContext *sc,
         diff = fold(diff, bits);
 
         if (ac != AC_GOLOMB_RICE) {
+            if (f->rawlsb) {
+                const int rawoff = (1<<f->rawlsb) >> 1;
+                const unsigned mask = (1<<f->rawlsb) - 1;
+                diff += rawoff;
+                put_rac_raw(c, (diff & mask), f->rawlsb);
+                diff = diff >> f->rawlsb; // Note, this will be biased on small rawlsb
+            }
             if (pass1) {
                 put_symbol_inline(c, p->state[context], diff, 1, sc->rc_stat,
                                   sc->rc_stat2[p->quant_table_index][context]);