diff mbox

[FFmpeg-devel,1/2] libavcodec/zmbvenc: block scoring improvements/bug fixes

Message ID 20190209131021.9959-1-matthew.w.fearnley@gmail.com
State Accepted
Commit 2d80b56ce0d4bf545aeecfcc3b71f2bb2aeb3c9e
Headers show

Commit Message

matthew.w.fearnley@gmail.com Feb. 9, 2019, 1:10 p.m. UTC
- Improve block choices by counting 0-bytes in the entropy score
- Make histogram use uint16_t type, to allow byte counts from 16*16
(current block size) up to 255*255 (maximum allowed 8bpp block size)
- Make sure score table is big enough for a full block's worth of bytes
- Calculate *xored without using code in inner loop
---
 libavcodec/zmbvenc.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

Comments

Tomas Härdin Feb. 10, 2019, 3:42 p.m. UTC | #1
lör 2019-02-09 klockan 13:10 +0000 skrev Matthew Fearnley:
> - Improve block choices by counting 0-bytes in the entropy score
> - Make histogram use uint16_t type, to allow byte counts from 16*16
> (current block size) up to 255*255 (maximum allowed 8bpp block size)
> - Make sure score table is big enough for a full block's worth of bytes
> - Calculate *xored without using code in inner loop
> ---
>  libavcodec/zmbvenc.c | 22 ++++++++++++++++------
>  1 file changed, 16 insertions(+), 6 deletions(-)

Passes FATE, looks good to me

/Tomas
Michael Niedermayer Feb. 20, 2019, 9:12 p.m. UTC | #2
On Sat, Feb 09, 2019 at 01:10:20PM +0000, Matthew Fearnley wrote:
> - Improve block choices by counting 0-bytes in the entropy score
> - Make histogram use uint16_t type, to allow byte counts from 16*16
> (current block size) up to 255*255 (maximum allowed 8bpp block size)
> - Make sure score table is big enough for a full block's worth of bytes
> - Calculate *xored without using code in inner loop

This should have been split into multiple changes

compression seems to become slightly worse from this change

./ffmpeg -i matrixbench_mpeg2.mpg -vframes 30 -vcodec zmbv -an -y test-old.avi
./ffmpeg -i matrixbench_mpeg2.mpg -vframes 30 -vcodec zmbv -an -y test-new.avi

-rw-r----- 1 michael michael 1175466 Feb 20 22:06 test-new.avi
-rw-r----- 1 michael michael 1174832 Feb 20 22:07 test-old.avi



[...]
Carl Eugen Hoyos Feb. 20, 2019, 10:33 p.m. UTC | #3
2019-02-10 16:42 GMT+01:00, Tomas Härdin <tjoppen@acc.umu.se>:
> lör 2019-02-09 klockan 13:10 +0000 skrev Matthew Fearnley:
>> - Improve block choices by counting 0-bytes in the entropy score
>> - Make histogram use uint16_t type, to allow byte counts from 16*16
>> (current block size) up to 255*255 (maximum allowed 8bpp block size)
>> - Make sure score table is big enough for a full block's worth of bytes
>> - Calculate *xored without using code in inner loop
>> ---
>>  libavcodec/zmbvenc.c | 22 ++++++++++++++++------
>>  1 file changed, 16 insertions(+), 6 deletions(-)
>
> Passes FATE, looks good to me

I believe you asked on irc about fate tests:
Since such a test would depend on the zlib version, you cannot test
exact output, you could only test a round-trip (assuming the codec
really is lossless).

Carl Eugen
Tomas Härdin Feb. 21, 2019, 10:19 a.m. UTC | #4
ons 2019-02-20 klockan 22:12 +0100 skrev Michael Niedermayer:
> On Sat, Feb 09, 2019 at 01:10:20PM +0000, Matthew Fearnley wrote:
> > - Improve block choices by counting 0-bytes in the entropy score
> > - Make histogram use uint16_t type, to allow byte counts from 16*16
> > (current block size) up to 255*255 (maximum allowed 8bpp block size)
> > - Make sure score table is big enough for a full block's worth of bytes
> > - Calculate *xored without using code in inner loop
> 
> This should have been split into multiple changes

Alas

> compression seems to become slightly worse from this change
> 
> ./ffmpeg -i matrixbench_mpeg2.mpg -vframes 30 -vcodec zmbv -an -y test-old.avi
> ./ffmpeg -i matrixbench_mpeg2.mpg -vframes 30 -vcodec zmbv -an -y test-new.avi
> 
> -rw-r----- 1 michael michael 1175466 Feb 20 22:06 test-new.avi
> -rw-r----- 1 michael michael 1174832 Feb 20 22:07 test-old.avi

A whopping 0.05% change, with an MPEG source rather than the intended
PAL8..

The flip side here is that we can now use larger block sizes and higher
me_range without crashing the encoder. We can add a slower mode that
tests a few different block sizes to see which one results in the
smallest output.

/Tomas
Tomas Härdin Feb. 21, 2019, 10:26 a.m. UTC | #5
ons 2019-02-20 klockan 23:33 +0100 skrev Carl Eugen Hoyos:
> > 2019-02-10 16:42 GMT+01:00, Tomas Härdin <tjoppen@acc.umu.se>:
> > lör 2019-02-09 klockan 13:10 +0000 skrev Matthew Fearnley:
> > > - Improve block choices by counting 0-bytes in the entropy score
> > > - Make histogram use uint16_t type, to allow byte counts from 16*16
> > > (current block size) up to 255*255 (maximum allowed 8bpp block size)
> > > - Make sure score table is big enough for a full block's worth of bytes
> > > - Calculate *xored without using code in inner loop
> > > ---
> > >  libavcodec/zmbvenc.c | 22 ++++++++++++++++------
> > >  1 file changed, 16 insertions(+), 6 deletions(-)
> > 
> > Passes FATE, looks good to me
> 
> I believe you asked on irc about fate tests:
> Since such a test would depend on the zlib version, you cannot test
> exact output, you could only test a round-trip (assuming the codec
> really is lossless).

Right, we'd need to embed a specific version of zlib. But we probably
don't want to do that, for many reasons.

A dummy DEFLATE implementation that just passes the bytes along could
be useful. I know there's a mode in the DEFLATE format for blocks that
fail to compress. That way we could test many encoders that depend on
zlib, and their output should decode just fine.

/Tomas
Tomas Härdin Feb. 21, 2019, 11:55 a.m. UTC | #6
tor 2019-02-21 klockan 11:26 +0100 skrev Tomas Härdin:
> ons 2019-02-20 klockan 23:33 +0100 skrev Carl Eugen Hoyos:
> > > > > > 2019-02-10 16:42 GMT+01:00, Tomas Härdin <tjoppen@acc.umu.se>:
> > > lör 2019-02-09 klockan 13:10 +0000 skrev Matthew Fearnley:
> > > > - Improve block choices by counting 0-bytes in the entropy score
> > > > - Make histogram use uint16_t type, to allow byte counts from 16*16
> > > > (current block size) up to 255*255 (maximum allowed 8bpp block size)
> > > > - Make sure score table is big enough for a full block's worth of bytes
> > > > - Calculate *xored without using code in inner loop
> > > > ---
> > > >  libavcodec/zmbvenc.c | 22 ++++++++++++++++------
> > > >  1 file changed, 16 insertions(+), 6 deletions(-)
> > > 
> > > Passes FATE, looks good to me
> > 
> > I believe you asked on irc about fate tests:
> > Since such a test would depend on the zlib version, you cannot test
> > exact output, you could only test a round-trip (assuming the codec
> > really is lossless).
> 
> Right, we'd need to embed a specific version of zlib. But we probably
> don't want to do that, for many reasons.
> 
> A dummy DEFLATE implementation that just passes the bytes along could
> be useful. I know there's a mode in the DEFLATE format for blocks that
> fail to compress. That way we could test many encoders that depend on
> zlib, and their output should decode just fine.

.. or just -compression_level 0 upon further inspection. Not likely to
change between zlib version, but of course not guaranteed not to

/Tomas
matthew.w.fearnley@gmail.com Feb. 21, 2019, 2:08 p.m. UTC | #7
On Thu, 21 Feb 2019 at 10:26, Tomas Härdin <tjoppen@acc.umu.se> wrote:

> ons 2019-02-20 klockan 23:33 +0100 skrev Carl Eugen Hoyos:
> > > 2019-02-10 16:42 GMT+01:00, Tomas Härdin <tjoppen@acc.umu.se>:
> > > lör 2019-02-09 klockan 13:10 +0000 skrev Matthew Fearnley:
> > > > - Improve block choices by counting 0-bytes in the entropy score
> > > > - Make histogram use uint16_t type, to allow byte counts from 16*16
> > > > (current block size) up to 255*255 (maximum allowed 8bpp block size)
> > > > - Make sure score table is big enough for a full block's worth of
> bytes
> > > > - Calculate *xored without using code in inner loop
> > > > ---
> > > >  libavcodec/zmbvenc.c | 22 ++++++++++++++++------
> > > >  1 file changed, 16 insertions(+), 6 deletions(-)
> > >
> > > Passes FATE, looks good to me
> >
> > I believe you asked on irc about fate tests:
> > Since such a test would depend on the zlib version, you cannot test
> > exact output, you could only test a round-trip (assuming the codec
> > really is lossless).
>
> Right, we'd need to embed a specific version of zlib. But we probably
> don't want to do that, for many reasons.
>
> A dummy DEFLATE implementation that just passes the bytes along could
> be useful. I know there's a mode in the DEFLATE format for blocks that
> fail to compress. That way we could test many encoders that depend on
> zlib, and their output should decode just fine.
>

I know the DEFLATE format fairly well, and a dummy DEFLATE implementation
would certainly be possible.

An uncompressed block can be up to 65535 bytes, and has a simple five-byte
header (subject to some bit alignment issues, but they only come into play
when other block types are also used).

Here's a simple code example, the guts of which fit into <70 lines:
https://gist.github.com/countingpine/602453d330ea055a4bcab90ca2a7ed02

It allows data to be compressed in chunks, allowing a stream to be split
across frames, like in ZMBV.
The first chunk needs the START flag, and the last chunk (if any) needs the
FINISH flag.

It also allows adding an Adler32 checksum - although ZMBV itself doesn't
need this, because it never "finishes" the stream.

The question would still remain though, of the best way to incorporate it
into FFMEG.  That's probably beyond my current knowledge..

Matthew


/Tomas
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
Carl Eugen Hoyos Feb. 21, 2019, 2:25 p.m. UTC | #8
2019-02-21 15:08 GMT+01:00, Matthew Fearnley <matthew.w.fearnley@gmail.com>:

> The question would still remain though, of the best way to incorporate it
> into FFMEG.  That's probably beyond my current knowledge..

It belongs in libavutil, see also the unfinished inflate implementation:
https://github.com/michaelni/FFmpeg/tree/inflate

Carl Eugen
Carl Eugen Hoyos Feb. 21, 2019, 2:26 p.m. UTC | #9
2019-02-21 11:26 GMT+01:00, Tomas Härdin <tjoppen@acc.umu.se>:
> ons 2019-02-20 klockan 23:33 +0100 skrev Carl Eugen Hoyos:
>> > 2019-02-10 16:42 GMT+01:00, Tomas Härdin <tjoppen@acc.umu.se>:
>> > lör 2019-02-09 klockan 13:10 +0000 skrev Matthew Fearnley:
>> > > - Improve block choices by counting 0-bytes in the entropy score
>> > > - Make histogram use uint16_t type, to allow byte counts from 16*16
>> > > (current block size) up to 255*255 (maximum allowed 8bpp block size)
>> > > - Make sure score table is big enough for a full block's worth of
>> > > bytes
>> > > - Calculate *xored without using code in inner loop
>> > > ---
>> > >  libavcodec/zmbvenc.c | 22 ++++++++++++++++------
>> > >  1 file changed, 16 insertions(+), 6 deletions(-)
>> >
>> > Passes FATE, looks good to me
>>
>> I believe you asked on irc about fate tests:
>> Since such a test would depend on the zlib version, you cannot test
>> exact output, you could only test a round-trip (assuming the codec
>> really is lossless).
>
> Right, we'd need to embed a specific version of zlib. But we probably
> don't want to do that, for many reasons.
>
> A dummy DEFLATE implementation that just passes the bytes along could
> be useful. I know there's a mode in the DEFLATE format for blocks that
> fail to compress. That way we could test many encoders that depend on
> zlib, and their output should decode just fine.

What's wrong with a round-trip test?

Carl Eugen
Michael Niedermayer Feb. 21, 2019, 7:28 p.m. UTC | #10
On Thu, Feb 21, 2019 at 11:19:08AM +0100, Tomas Härdin wrote:
> ons 2019-02-20 klockan 22:12 +0100 skrev Michael Niedermayer:
> > On Sat, Feb 09, 2019 at 01:10:20PM +0000, Matthew Fearnley wrote:
> > > - Improve block choices by counting 0-bytes in the entropy score
> > > - Make histogram use uint16_t type, to allow byte counts from 16*16
> > > (current block size) up to 255*255 (maximum allowed 8bpp block size)
> > > - Make sure score table is big enough for a full block's worth of bytes
> > > - Calculate *xored without using code in inner loop
> > 
> > This should have been split into multiple changes
> 
> Alas
> 

> > compression seems to become slightly worse from this change
> > 
> > ./ffmpeg -i matrixbench_mpeg2.mpg -vframes 30 -vcodec zmbv -an -y test-old.avi
> > ./ffmpeg -i matrixbench_mpeg2.mpg -vframes 30 -vcodec zmbv -an -y test-new.avi
> > 
> > -rw-r----- 1 michael michael 1175466 Feb 20 22:06 test-new.avi
> > -rw-r----- 1 michael michael 1174832 Feb 20 22:07 test-old.avi
> 
> A whopping 0.05% change,

no its just a tiny difference but i think it still would make sense to
understand it.
Generally a patch doing 4 improvments would be expected to improve things.
This was not a hand picked case. I mean its not as if I saw 10 cases
improving things and i picked the 11th. i looked at one case and that was
worse

Also either way the patch should have been split, dont you agree ?
it also would make it much easier to within seconds identify which
change actually caused the 0.05% difference. (maybe avoiding this
discussion as it maybe would be obvious from that if this is a
ommision in the commit message a bug a rare outlier ...)


> with an MPEG source rather than the intended
> PAL8..

The input to the encoder is always PAL8. so i assume you mean
"old computer game material" vs. "realistic video".


> 
> The flip side here is that we can now use larger block sizes and higher
> me_range without crashing the encoder. We can add a slower mode that
> tests a few different block sizes to see which one results in the
> smallest output.
> 
> /Tomas
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
matthew.w.fearnley@gmail.com Feb. 21, 2019, 9:26 p.m. UTC | #11
On Thu, 21 Feb 2019 at 19:28, Michael Niedermayer <michael@niedermayer.cc>
wrote:

> On Thu, Feb 21, 2019 at 11:19:08AM +0100, Tomas Härdin wrote:
> > ons 2019-02-20 klockan 22:12 +0100 skrev Michael Niedermayer:
> > > On Sat, Feb 09, 2019 at 01:10:20PM +0000, Matthew Fearnley wrote:
> > > > - Improve block choices by counting 0-bytes in the entropy score
> > > > - Make histogram use uint16_t type, to allow byte counts from 16*16
> > > > (current block size) up to 255*255 (maximum allowed 8bpp block size)
> > > > - Make sure score table is big enough for a full block's worth of
> bytes
> > > > - Calculate *xored without using code in inner loop
> > >
> > > This should have been split into multiple changes
> >
> > Alas
> >
>
> > > compression seems to become slightly worse from this change
> > >
> > > ./ffmpeg -i matrixbench_mpeg2.mpg -vframes 30 -vcodec zmbv -an -y
> test-old.avi
> > > ./ffmpeg -i matrixbench_mpeg2.mpg -vframes 30 -vcodec zmbv -an -y
> test-new.avi
> > >
> > > -rw-r----- 1 michael michael 1175466 Feb 20 22:06 test-new.avi
> > > -rw-r----- 1 michael michael 1174832 Feb 20 22:07 test-old.avi
> >
> > A whopping 0.05% change,
>
> no its just a tiny difference but i think it still would make sense to
> understand it.
> Generally a patch doing 4 improvments would be expected to improve things.
> This was not a hand picked case. I mean its not as if I saw 10 cases
> improving things and i picked the 11th. i looked at one case and that was
> worse
>
> Also either way the patch should have been split, dont you agree ?
> it also would make it much easier to within seconds identify which
> change actually caused the 0.05% difference. (maybe avoiding this
> discussion as it maybe would be obvious from that if this is a
> ommision in the commit message a bug a rare outlier ...)
>
> Hi Michael.  First, let me say thanks for your feedback on this patch.
Let me address some of this..

- Yes: in retrospect, I think I should have split the patch.
I struggled here to work out how to split up my changes, and spent too long
staring too closely at the code to see this.
I still perhaps have some way to go before I would be able to get a proper
feel for the general FFmpeg development ecosystem.

- To clear up the effects: the change from 'i=1' to 'i=0' was the only
change that should make any difference in the output.
The rest of the changes were to improve the speed of an inner loop, and to
fix two bugs that happily cancelled each other out, but would have broken
if the code were changed to emit larger blocks.  These should not lead to
changes in the blocks chosen or the final compression size.

- Regarding the worse compression: let me start by stating that scoring on
byte frequency alone will not always predict well how Deflate will work,
since Deflate also uses pattern matching as well.
Frequency-based block scoring will only ever be a heuristic.  There may be
scope for improving the scoring, but it would add a lot of complexity, and
I think it would be easy to make things worse rather than better.  (I found
this in my brief experiments with including the frequencies from the
previous block.)

I think the main thing that persuaded me to make this particular change
from 1 to 0 was that it gave a pure entropy-based score, and allowed me,
mathematically, to use the same scoring table on both full and partial
blocks, without having to worry about the effects of omitting the 0-bytes
in the score in each case.

I focused my testing on the available ZMBV samples, which were likely all
generated by DOSBox, and my findings were that while not all movies were
smaller, overall the gains outweighed the losses.
I'm slightly sad to hear it didn't improve this real-world movie sample.
But I think this level in this case is acceptable, and not outside the
expected range of size difference that I've seen.  And the improvements to
the motion estimation range (from the other patch) should tend to lead to a
compression improvement overall.

Maybe I should have been clearer though, that the scoring changes won't
bring universal improvements.  And if someone looks into this and finds on
average results are worsened by the change, I won't complain if it gets
reverted.

Thanks for your feedback, I appreciate it.
Matthew


> > with an MPEG source rather than the intended
> > PAL8..
>
> The input to the encoder is always PAL8. so i assume you mean
> "old computer game material" vs. "realistic video".
>
>
> >
> > The flip side here is that we can now use larger block sizes and higher
> > me_range without crashing the encoder. We can add a slower mode that
> > tests a few different block sizes to see which one results in the
> > smallest output.
> >
> > /Tomas
> > _______________________________________________
> > ffmpeg-devel mailing list
> > ffmpeg-devel@ffmpeg.org
> > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> --
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> Complexity theory is the science of finding the exact solution to an
> approximation. Benchmarking OTOH is finding an approximation of the exact
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
Michael Niedermayer Feb. 21, 2019, 11:11 p.m. UTC | #12
On Thu, Feb 21, 2019 at 09:26:44PM +0000, Matthew Fearnley wrote:
> On Thu, 21 Feb 2019 at 19:28, Michael Niedermayer <michael@niedermayer.cc>
> wrote:
> 
> > On Thu, Feb 21, 2019 at 11:19:08AM +0100, Tomas Härdin wrote:
> > > ons 2019-02-20 klockan 22:12 +0100 skrev Michael Niedermayer:
> > > > On Sat, Feb 09, 2019 at 01:10:20PM +0000, Matthew Fearnley wrote:
> > > > > - Improve block choices by counting 0-bytes in the entropy score
> > > > > - Make histogram use uint16_t type, to allow byte counts from 16*16
> > > > > (current block size) up to 255*255 (maximum allowed 8bpp block size)
> > > > > - Make sure score table is big enough for a full block's worth of
> > bytes
> > > > > - Calculate *xored without using code in inner loop
> > > >
> > > > This should have been split into multiple changes
> > >
> > > Alas
> > >
> >
> > > > compression seems to become slightly worse from this change
> > > >
> > > > ./ffmpeg -i matrixbench_mpeg2.mpg -vframes 30 -vcodec zmbv -an -y
> > test-old.avi
> > > > ./ffmpeg -i matrixbench_mpeg2.mpg -vframes 30 -vcodec zmbv -an -y
> > test-new.avi
> > > >
> > > > -rw-r----- 1 michael michael 1175466 Feb 20 22:06 test-new.avi
> > > > -rw-r----- 1 michael michael 1174832 Feb 20 22:07 test-old.avi
> > >
> > > A whopping 0.05% change,
> >
> > no its just a tiny difference but i think it still would make sense to
> > understand it.
> > Generally a patch doing 4 improvments would be expected to improve things.
> > This was not a hand picked case. I mean its not as if I saw 10 cases
> > improving things and i picked the 11th. i looked at one case and that was
> > worse
> >
> > Also either way the patch should have been split, dont you agree ?
> > it also would make it much easier to within seconds identify which
> > change actually caused the 0.05% difference. (maybe avoiding this
> > discussion as it maybe would be obvious from that if this is a
> > ommision in the commit message a bug a rare outlier ...)
> >
> > Hi Michael.  First, let me say thanks for your feedback on this patch.
> Let me address some of this..
> 

> - Yes: in retrospect, I think I should have split the patch.
> I struggled here to work out how to split up my changes, and spent too long
> staring too closely at the code to see this.
> I still perhaps have some way to go before I would be able to get a proper
> feel for the general FFmpeg development ecosystem.

if something is unclear, ask, yes i know this is often not so simple but
especially in case of for example spliting patches, if you dont know how
its easy to ask 


> 
> - To clear up the effects: the change from 'i=1' to 'i=0' was the only
> change that should make any difference in the output.
> The rest of the changes were to improve the speed of an inner loop, and to
> fix two bugs that happily cancelled each other out, but would have broken
> if the code were changed to emit larger blocks.  These should not lead to
> changes in the blocks chosen or the final compression size.
> 

> - Regarding the worse compression: let me start by stating that scoring on
> byte frequency alone will not always predict well how Deflate will work,
> since Deflate also uses pattern matching as well.
> Frequency-based block scoring will only ever be a heuristic.  There may be
> scope for improving the scoring, but it would add a lot of complexity, and
> I think it would be easy to make things worse rather than better.  (I found
> this in my brief experiments with including the frequencies from the
> previous block.)

implementing the exact scoring by using Deflate itself would allow us to
understand how much there is between the heuristic and what is achievable.
If there is a gap that is significant then it may be interresting to
look at improving the heuristic if the gap is small than the whole
heuristic improvment path can be discarded.
The obvious improvment for the heuristic would be to move to higher
order entropy estimation. ATM IIUC this uses a 0th order. 1st order could
be tried or any fast compression algorithm could also be used to estimate
entropy.
It also would be interresting to see how a higher order heuristic or
some simple fast LZ like compressor would perform with the previous block.
That is if the issue you saw with the previous block inclusion was because
of the very simplistic entropy estimation.


> 
> I think the main thing that persuaded me to make this particular change
> from 1 to 0 was that it gave a pure entropy-based score, and allowed me,
> mathematically, to use the same scoring table on both full and partial
> blocks, without having to worry about the effects of omitting the 0-bytes
> in the score in each case.
> 
> I focused my testing on the available ZMBV samples, which were likely all
> generated by DOSBox, and my findings were that while not all movies were
> smaller, overall the gains outweighed the losses.
> I'm slightly sad to hear it didn't improve this real-world movie sample.
> But I think this level in this case is acceptable, and not outside the
> expected range of size difference that I've seen.  And the improvements to
> the motion estimation range (from the other patch) should tend to lead to a
> compression improvement overall.
> 

> Maybe I should have been clearer though, that the scoring changes won't
> bring universal improvements.  And if someone looks into this and finds on
> average results are worsened by the change, I won't complain if it gets
> reverted.

please be more verbose in future commit messages. I mean for example this mail
contains alot of valuable information which would be interresting to
someone looking at the commit.

Also more verbose commit messages can help reviewers review changes.

Thanks

[...]
Tomas Härdin Feb. 22, 2019, 11:54 a.m. UTC | #13
tor 2019-02-21 klockan 15:26 +0100 skrev Carl Eugen Hoyos:
> > 2019-02-21 11:26 GMT+01:00, Tomas Härdin <tjoppen@acc.umu.se>:
> > ons 2019-02-20 klockan 23:33 +0100 skrev Carl Eugen Hoyos:
> > > > > > > > 2019-02-10 16:42 GMT+01:00, Tomas Härdin <tjoppen@acc.umu.se>:
> > > > lör 2019-02-09 klockan 13:10 +0000 skrev Matthew Fearnley:
> > > > > - Improve block choices by counting 0-bytes in the entropy score
> > > > > - Make histogram use uint16_t type, to allow byte counts from 16*16
> > > > > (current block size) up to 255*255 (maximum allowed 8bpp block size)
> > > > > - Make sure score table is big enough for a full block's worth of
> > > > > bytes
> > > > > - Calculate *xored without using code in inner loop
> > > > > ---
> > > > >  libavcodec/zmbvenc.c | 22 ++++++++++++++++------
> > > > >  1 file changed, 16 insertions(+), 6 deletions(-)
> > > > 
> > > > Passes FATE, looks good to me
> > > 
> > > I believe you asked on irc about fate tests:
> > > Since such a test would depend on the zlib version, you cannot test
> > > exact output, you could only test a round-trip (assuming the codec
> > > really is lossless).
> > 
> > Right, we'd need to embed a specific version of zlib. But we probably
> > don't want to do that, for many reasons.
> > 
> > A dummy DEFLATE implementation that just passes the bytes along could
> > be useful. I know there's a mode in the DEFLATE format for blocks that
> > fail to compress. That way we could test many encoders that depend on
> > zlib, and their output should decode just fine.
> 
> What's wrong with a round-trip test?

Nothing I suppose. But it might be nice to catch unintended changes to
the encoder's output

/Tomas
Tomas Härdin Feb. 22, 2019, 12:09 p.m. UTC | #14
fre 2019-02-22 klockan 00:11 +0100 skrev Michael Niedermayer:
> On Thu, Feb 21, 2019 at 09:26:44PM +0000, Matthew Fearnley wrote:
> > - To clear up the effects: the change from 'i=1' to 'i=0' was the only
> > change that should make any difference in the output.
> > The rest of the changes were to improve the speed of an inner loop, and to
> > fix two bugs that happily cancelled each other out, but would have broken
> > if the code were changed to emit larger blocks.  These should not lead to
> > changes in the blocks chosen or the final compression size.
> > 
> > - Regarding the worse compression: let me start by stating that scoring on
> > byte frequency alone will not always predict well how Deflate will work,
> > since Deflate also uses pattern matching as well.
> > Frequency-based block scoring will only ever be a heuristic.  There may be
> > scope for improving the scoring, but it would add a lot of complexity, and
> > I think it would be easy to make things worse rather than better.  (I found
> > this in my brief experiments with including the frequencies from the
> > previous block.)
> 
> implementing the exact scoring by using Deflate itself would allow us to
> understand how much there is between the heuristic and what is achievable.
> If there is a gap that is significant then it may be interresting to
> look at improving the heuristic if the gap is small than the whole
> heuristic improvment path can be discarded.
> The obvious improvment for the heuristic would be to move to higher
> order entropy estimation. ATM IIUC this uses a 0th order. 1st order could
> be tried or any fast compression algorithm could also be used to estimate
> entropy.

We tried a sliding window order-0 model, which did worse. An order-1
model might be useful, but would need quite a bit of context for it to
be meaningful

> It also would be interresting to see how a higher order heuristic or
> some simple fast LZ like compressor would perform with the previous block.
> That is if the issue you saw with the previous block inclusion was because
> of the very simplistic entropy estimation.

zlib at level 1 maybe? You could also have a multi-pass thing where it
first does a coarse pass then tries to jiggle MVs and greedily accept
any change that results in smaller output.

/Tomas
Tomas Härdin Feb. 22, 2019, 5:45 p.m. UTC | #15
fre 2019-02-22 klockan 13:09 +0100 skrev Tomas Härdin:
> fre 2019-02-22 klockan 00:11 +0100 skrev Michael Niedermayer:
> > On Thu, Feb 21, 2019 at 09:26:44PM +0000, Matthew Fearnley wrote:
> > > - To clear up the effects: the change from 'i=1' to 'i=0' was the only
> > > change that should make any difference in the output.
> > > The rest of the changes were to improve the speed of an inner loop, and to
> > > fix two bugs that happily cancelled each other out, but would have broken
> > > if the code were changed to emit larger blocks.  These should not lead to
> > > changes in the blocks chosen or the final compression size.
> > > 
> > > - Regarding the worse compression: let me start by stating that scoring on
> > > byte frequency alone will not always predict well how Deflate will work,
> > > since Deflate also uses pattern matching as well.
> > > Frequency-based block scoring will only ever be a heuristic.  There may be
> > > scope for improving the scoring, but it would add a lot of complexity, and
> > > I think it would be easy to make things worse rather than better.  (I found
> > > this in my brief experiments with including the frequencies from the
> > > previous block.)
> > 
> > implementing the exact scoring by using Deflate itself would allow us to
> > understand how much there is between the heuristic and what is achievable.
> > If there is a gap that is significant then it may be interresting to
> > look at improving the heuristic if the gap is small than the whole
> > heuristic improvment path can be discarded.
> > The obvious improvment for the heuristic would be to move to higher
> > order entropy estimation. ATM IIUC this uses a 0th order. 1st order could
> > be tried or any fast compression algorithm could also be used to estimate
> > entropy.
> 
> We tried a sliding window order-0 model, which did worse. An order-1
> model might be useful, but would need quite a bit of context for it to
> be meaningful

I did some looking into this because I was unsure, and it turns out
evaluating the entropy of an order-1 model involves computing the
eigenvector of the transfer matrix corresponding to the eigenvalue 1,
and computing the scalar product between that and the order-0 entropy
of each column of the transfer matrix. A pricey operation.

With some luck computing the entropy change for a small perturbation
(MV change) over the transfer matrix for the entire frame *might* be
cheap enough to be useful. Maybe. The resulting matrix difference has a
particular structure that might be exploitable..

/Tomas
Michael Niedermayer Feb. 22, 2019, 9:53 p.m. UTC | #16
On Fri, Feb 22, 2019 at 01:09:12PM +0100, Tomas Härdin wrote:
> fre 2019-02-22 klockan 00:11 +0100 skrev Michael Niedermayer:
> > On Thu, Feb 21, 2019 at 09:26:44PM +0000, Matthew Fearnley wrote:
> > > - To clear up the effects: the change from 'i=1' to 'i=0' was the only
> > > change that should make any difference in the output.
> > > The rest of the changes were to improve the speed of an inner loop, and to
> > > fix two bugs that happily cancelled each other out, but would have broken
> > > if the code were changed to emit larger blocks.  These should not lead to
> > > changes in the blocks chosen or the final compression size.
> > > 
> > > - Regarding the worse compression: let me start by stating that scoring on
> > > byte frequency alone will not always predict well how Deflate will work,
> > > since Deflate also uses pattern matching as well.
> > > Frequency-based block scoring will only ever be a heuristic.  There may be
> > > scope for improving the scoring, but it would add a lot of complexity, and
> > > I think it would be easy to make things worse rather than better.  (I found
> > > this in my brief experiments with including the frequencies from the
> > > previous block.)
> > 
> > implementing the exact scoring by using Deflate itself would allow us to
> > understand how much there is between the heuristic and what is achievable.
> > If there is a gap that is significant then it may be interresting to
> > look at improving the heuristic if the gap is small than the whole
> > heuristic improvment path can be discarded.
> > The obvious improvment for the heuristic would be to move to higher
> > order entropy estimation. ATM IIUC this uses a 0th order. 1st order could
> > be tried or any fast compression algorithm could also be used to estimate
> > entropy.
> 
> We tried a sliding window order-0 model, which did worse. An order-1
> model might be useful, but would need quite a bit of context for it to
> be meaningful
> 

> > It also would be interresting to see how a higher order heuristic or
> > some simple fast LZ like compressor would perform with the previous block.
> > That is if the issue you saw with the previous block inclusion was because
> > of the very simplistic entropy estimation.
> 
> zlib at level 1 maybe? You could also have a multi-pass thing where it
> first does a coarse pass then tries to jiggle MVs and greedily accept
> any change that results in smaller output.

yes, but i would avoid using external code like zlib for this. Its not
designed for fast entropy estimation of many blocks. 

[...]
diff mbox

Patch

diff --git a/libavcodec/zmbvenc.c b/libavcodec/zmbvenc.c
index 4d9147657d..3df6e724c8 100644
--- a/libavcodec/zmbvenc.c
+++ b/libavcodec/zmbvenc.c
@@ -55,7 +55,7 @@  typedef struct ZmbvEncContext {
     int keyint, curfrm;
     z_stream zstream;
 
-    int score_tab[256];
+    int score_tab[ZMBV_BLOCK * ZMBV_BLOCK + 1];
 } ZmbvEncContext;
 
 
@@ -69,20 +69,26 @@  static inline int block_cmp(ZmbvEncContext *c, uint8_t *src, int stride,
 {
     int sum = 0;
     int i, j;
-    uint8_t histogram[256] = {0};
+    uint16_t histogram[256] = {0};
 
-    *xored = 0;
+    /* Build frequency histogram of byte values for src[] ^ src2[] */
     for(j = 0; j < bh; j++){
         for(i = 0; i < bw; i++){
             int t = src[i] ^ src2[i];
             histogram[t]++;
-            *xored |= t;
         }
         src += stride;
         src2 += stride2;
     }
 
-    for(i = 1; i < 256; i++)
+    /* If not all the xored values were 0, then the blocks are different */
+    *xored = (histogram[0] < bw * bh);
+
+    /* Exit early if blocks are equal */
+    if (!*xored) return 0;
+
+    /* Sum the entropy of all values */
+    for(i = 0; i < 256; i++)
         sum += c->score_tab[histogram[i]];
 
     return sum;
@@ -278,7 +284,11 @@  static av_cold int encode_init(AVCodecContext *avctx)
     int i;
     int lvl = 9;
 
-    for(i=1; i<256; i++)
+    /* Entropy-based score tables for comparing blocks.
+     * Suitable for blocks up to (ZMBV_BLOCK * ZMBV_BLOCK) bytes.
+     * Scores are nonnegative, lower is better.
+     */
+    for(i = 1; i <= ZMBV_BLOCK * ZMBV_BLOCK; i++)
         c->score_tab[i] = -i * log2(i / (double)(ZMBV_BLOCK * ZMBV_BLOCK)) * 256;
 
     c->avctx = avctx;