diff mbox

[FFmpeg-devel,1/4] zmbvenc: don't sum the entropy when blocks are equal

Message ID 20181219220003.27225-1-matthew.w.fearnley@gmail.com
State New
Headers show

Commit Message

matthew.w.fearnley@gmail.com Dec. 19, 2018, 10 p.m. UTC
From: Matthew Fearnley <matthew.w.fearnley@gmail.com>

If *xored is 0, then histogram[0]==bw*bh and histogram[1..255]==0.

Because histogram[0] is skipped over for the entropy calculation, the
return value is always 0 when *xored==0, so we don't need to waste time
calculating it.

This addition both clarifies the behaviour of the code and improves
the speed when the block matches.
---
 libavcodec/zmbvenc.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Tomas Härdin Dec. 20, 2018, 4:29 p.m. UTC | #1
ons 2018-12-19 klockan 22:00 +0000 skrev matthew.w.fearnley@gmail.com:
> > From: Matthew Fearnley <matthew.w.fearnley@gmail.com>
> 
> If *xored is 0, then histogram[0]==bw*bh and histogram[1..255]==0.
> 
> Because histogram[0] is skipped over for the entropy calculation, the
> return value is always 0 when *xored==0, so we don't need to waste time
> calculating it.
> 
> This addition both clarifies the behaviour of the code and improves
> the speed when the block matches.
> ---
>  libavcodec/zmbvenc.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/libavcodec/zmbvenc.c b/libavcodec/zmbvenc.c
> index 4d9147657d..2f041dae32 100644
> --- a/libavcodec/zmbvenc.c
> +++ b/libavcodec/zmbvenc.c
> @@ -71,6 +71,7 @@ static inline int block_cmp(ZmbvEncContext *c, uint8_t *src, int stride,
>      int i, j;
>      uint8_t histogram[256] = {0};
>  
> +    /* build frequency histogram of byte values for src[] ^ src2[] */
>      *xored = 0;
>      for(j = 0; j < bh; j++){
>          for(i = 0; i < bw; i++){
> @@ -82,6 +83,10 @@ static inline int block_cmp(ZmbvEncContext *c, uint8_t *src, int stride,
>          src2 += stride2;
>      }
>  
> +    /* early out if src and src2 are equal */
> +    if (!*xored) return 0;

I have a feeling this could be sped up further by just doing *xored =
histogram[0] == ZMBV_BLOCK*ZMBV_BLOCK after the loops, if [PATCH 3/4]
is applied before this. Computing both histogram and xored in the loop
seems pointless.

Beyond that this looks good

/Tomas
matthew.w.fearnley@gmail.com Dec. 20, 2018, 5:46 p.m. UTC | #2
On Thu, 20 Dec 2018 at 16:30, Tomas Härdin <tjoppen@acc.umu.se> wrote:

> I have a feeling this could be sped up further by just doing *xored =
> histogram[0] == ZMBV_BLOCK*ZMBV_BLOCK after the loops, if [PATCH 3/4]
> is applied before this. Computing both histogram and xored in the loop
> seems pointless.
>

You're right, that speedup didn't occur to me.  It makes the logic a bit
more tenuous, but it would be more efficient.
Note that bw,bh aren't guaranteed to equal ZMBV_BLOCK, so `histogram[0] ==
bw*bh` would have to be used to guard against those (literal) edge cases.
Tomas Härdin Dec. 22, 2018, 12:11 p.m. UTC | #3
tor 2018-12-20 klockan 17:46 +0000 skrev Matthew Fearnley:
> > On Thu, 20 Dec 2018 at 16:30, Tomas Härdin <tjoppen@acc.umu.se> wrote:
> 
> > I have a feeling this could be sped up further by just doing *xored =
> > histogram[0] == ZMBV_BLOCK*ZMBV_BLOCK after the loops, if [PATCH 3/4]
> > is applied before this. Computing both histogram and xored in the loop
> > seems pointless.
> > 
> 
> You're right, that speedup didn't occur to me.  It makes the logic a bit
> more tenuous, but it would be more efficient.

Eh, I wouldn't really call "we have xored data to output if and only if
number of zeroes < number of pixel" fairly easy to grasp. A comment
might be good tho

> Note that bw,bh aren't guaranteed to equal ZMBV_BLOCK, so `histogram[0] ==
> bw*bh` would have to be used to guard against those (literal) edge cases.

Right, yes. But if we have block sizes other than ZMBV_BLOCK^2 then we
need score tables for those sizes too.

/Tomas
matthew.w.fearnley@gmail.com Dec. 22, 2018, 3:32 p.m. UTC | #4
> On 22 Dec 2018, at 12:11, Tomas Härdin <tjoppen@acc.umu.se> wrote:
> 
> tor 2018-12-20 klockan 17:46 +0000 skrev Matthew Fearnley:
>>>> On Thu, 20 Dec 2018 at 16:30, Tomas Härdin <tjoppen@acc.umu.se> wrote:
>>> 
>>> I have a feeling this could be sped up further by just doing *xored =
>>> histogram[0] == ZMBV_BLOCK*ZMBV_BLOCK after the loops, if [PATCH 3/4]
>>> is applied before this. Computing both histogram and xored in the loop
>>> seems pointless.
>>> 
>> 
>> You're right, that speedup didn't occur to me.  It makes the logic a bit
>> more tenuous, but it would be more efficient.
> 
> Eh, I wouldn't really call "we have xored data to output if and only if
> number of zeroes < number of pixel" fairly easy to grasp. A comment
> might be good tho
Agreed.
Just have to get the patch order sorted out now somehow.
>> Note that bw,bh aren't guaranteed to equal ZMBV_BLOCK, so `histogram[0] ==
>> bw*bh` would have to be used to guard against those (literal) edge cases.
> 
> Right, yes. But if we have block sizes other than ZMBV_BLOCK^2 then we
> need score tables for those sizes too.
I’ve thought about that a bit. I wondered if it would be worth it given:
- the extra code, memory and logic needed
- it would only improve the edge blocks
- the existing score table isn’t catastrophically bad for short blocks, and would still favour blocks with more common pixels.

It would be better from a correctness perspective though, and effects on running time should be negligible.

I can work on a patch, see how it looks.
Tomas Härdin Dec. 25, 2018, 9:35 a.m. UTC | #5
lör 2018-12-22 klockan 15:32 +0000 skrev Matthew Fearnley:
> > > > 
> > > Note that bw,bh aren't guaranteed to equal ZMBV_BLOCK, so `histogram[0] ==
> > > bw*bh` would have to be used to guard against those (literal) edge cases.
> > 
> > Right, yes. But if we have block sizes other than ZMBV_BLOCK^2 then we
> > need score tables for those sizes too.
> 
> I’ve thought about that a bit. I wondered if it would be worth it given:
> - the extra code, memory and logic needed

If you have a huge amount of DOS captures to optimize then it might be
worth it, else probably questionable

> - it would only improve the edge blocks

I imagine large blocks would be good for scenes with mostly global
motion. You cut down on the number of MVs and thus the amount of data
zlib has to compress, if the block size is a good fit.

> - the existing score table isn’t catastrophically bad for short blocks, and would still favour blocks with more common pixels.
> 
> It would be better from a correctness perspective though, and effects on running time should be negligible.

Good point. There's also no telling whether the current model is
actually an accurate prediction of how zlib behaves :)

/Tomas
matthew.w.fearnley@gmail.com Jan. 19, 2019, 10:40 p.m. UTC | #6
On Tue, 25 Dec 2018 at 09:35, Tomas Härdin <tjoppen@acc.umu.se> wrote:

> lör 2018-12-22 klockan 15:32 +0000 skrev Matthew Fearnley:
> > > > >
> > > > Note that bw,bh aren't guaranteed to equal ZMBV_BLOCK, so
> `histogram[0] ==
> > > > bw*bh` would have to be used to guard against those (literal) edge
> cases.
> > >
> > > Right, yes. But if we have block sizes other than ZMBV_BLOCK^2 then we
> > > need score tables for those sizes too.
> >
> > I’ve thought about that a bit. I wondered if it would be worth it given:
> > - the extra code, memory and logic needed
>
> If you have a huge amount of DOS captures to optimize then it might be
> worth it, else probably questionable
>
> > - it would only improve the edge blocks
>
> I imagine large blocks would be good for scenes with mostly global
> motion. You cut down on the number of MVs and thus the amount of data
> zlib has to compress, if the block size is a good fit.
>
> > - the existing score table isn’t catastrophically bad for short blocks,
> and would still favour blocks with more common pixels.
> >
> > It would be better from a correctness perspective though, and effects on
> running time should be negligible.
>
> Good point. There's also no telling whether the current model is
> actually an accurate prediction of how zlib behaves :)
>
>
Hi.  Just to say, I tried setting up additional score_tabs for the
bottom/right partial blocks.  Unfortunately, after implementing it and
ironing out the bugs, the new score tables actually caused a slight
increase in file size!
After a little informal investigation with a couple of videos, my findings
were that increasing the divisor - '(i / (double)(ZMBV_BLOCK * ZMBV_BLOCK *
some_factor))' - would generally lead to slight compression improvements.
Given the score table is still "valid" for smaller blocks, and given the
extra complexity of adding the score tables, plus the fact that it may
generally hurt the compression, I'm inclined to leave it with the one score
table.  But there may be potential to improve the current compression
method in future, by somehow tuning the divisor for better results
generally.

Matthew
Tomas Härdin Jan. 20, 2019, 3:15 p.m. UTC | #7
lör 2019-01-19 klockan 22:40 +0000 skrev Matthew Fearnley:
> > On Tue, 25 Dec 2018 at 09:35, Tomas Härdin <tjoppen@acc.umu.se> wrote:
> 
> > lör 2018-12-22 klockan 15:32 +0000 skrev Matthew Fearnley:
> > > > > > 
> > > > > 
> > > > > Note that bw,bh aren't guaranteed to equal ZMBV_BLOCK, so
> > 
> > `histogram[0] ==
> > > > > bw*bh` would have to be used to guard against those (literal) edge
> > 
> > cases.
> > > > 
> > > > Right, yes. But if we have block sizes other than ZMBV_BLOCK^2 then we
> > > > need score tables for those sizes too.
> > > 
> > > I’ve thought about that a bit. I wondered if it would be worth it given:
> > > - the extra code, memory and logic needed
> > 
> > If you have a huge amount of DOS captures to optimize then it might be
> > worth it, else probably questionable
> > 
> > > - it would only improve the edge blocks
> > 
> > I imagine large blocks would be good for scenes with mostly global
> > motion. You cut down on the number of MVs and thus the amount of data
> > zlib has to compress, if the block size is a good fit.
> > 
> > > - the existing score table isn’t catastrophically bad for short blocks,
> > 
> > and would still favour blocks with more common pixels.
> > > 
> > > It would be better from a correctness perspective though, and effects on
> > 
> > running time should be negligible.
> > 
> > Good point. There's also no telling whether the current model is
> > actually an accurate prediction of how zlib behaves :)
> > 
> > 
> 
> Hi.  Just to say, I tried setting up additional score_tabs for the
> bottom/right partial blocks.  Unfortunately, after implementing it and
> ironing out the bugs, the new score tables actually caused a slight
> increase in file size!
> After a little informal investigation with a couple of videos, my findings
> were that increasing the divisor - '(i / (double)(ZMBV_BLOCK * ZMBV_BLOCK *
> some_factor))' - would generally lead to slight compression improvements.
> Given the score table is still "valid" for smaller blocks, and given the
> extra complexity of adding the score tables, plus the fact that it may
> generally hurt the compression, I'm inclined to leave it with the one score
> table.  But there may be potential to improve the current compression
> method in future, by somehow tuning the divisor for better results
> generally.

Huh, that's strange. Sounds like something that warrants a comment in
the code. I also see we have an answer do Carl's question: you're still
experimenting with this :) I think we can hold off on pushing anything
until you feel happy with the result

/Tomas
matthew.w.fearnley@gmail.com Jan. 31, 2019, 1:31 p.m. UTC | #8
On Sun, 20 Jan 2019 at 15:16, Tomas Härdin <tjoppen@acc.umu.se> wrote:

> > Hi.  Just to say, I tried setting up additional score_tabs for the
> > bottom/right partial blocks.  Unfortunately, after implementing it and
> > ironing out the bugs, the new score tables actually caused a slight
> > increase in file size!
> > After a little informal investigation with a couple of videos, my
> findings
> > were that increasing the divisor - '(i / (double)(ZMBV_BLOCK *
> ZMBV_BLOCK *
> > some_factor))' - would generally lead to slight compression improvements.
> > Given the score table is still "valid" for smaller blocks, and given the
> > extra complexity of adding the score tables, plus the fact that it may
> > generally hurt the compression, I'm inclined to leave it with the one
> score
> > table.  But there may be potential to improve the current compression
> > method in future, by somehow tuning the divisor for better results
> > generally.
>
> Huh, that's strange. Sounds like something that warrants a comment in
> the code. I also see we have an answer do Carl's question: you're still
> experimenting with this :) I think we can hold off on pushing anything
> until you feel happy with the result
>

Hi.
Sorry, I had missed Carl's question.  Regrettably my work on this has been
slow since my initial patch submissions, but I think I'm close to
submitting some final changes.  Thanks for your patience so far.

I have recently made two helpful realisations about the above score table
problem:

1. The entropy calculation in block_cmp() omits the score of histogram[0]
from the final sum.
It's tempting to do this to bias the scores in favour of 0-bytes, but in
reality, blocks with a majority of 0 (or any other byte) will already be
naturally favoured by the entropy score anyway, and this bias will fail to
penalise blocks with an "average" (i.e. high entropy) number of 0's in them.
The exact implications are difficult to ponder, but in practical terms this
error does tend to produce worse results in the video compression.  Not
massively so, but it's still noticeable.
(Math is a harsh mistress, and will often look unkindly on such flagrant
attempts to circumvent her laws.. :)

2. As long as the blocks being compared are the same size as each other,
the entropy-based score comparisons (without the above bias) are unaffected
by the base of the log, or the divisor used.
(Mathematically, if 'a+b=c+d', then if 'a*log(a) + b*log(b) < c*log(c) +
d*log(d)' then it is also true that 'a*log(a/N) + b*log(b/N) < c*log(c/N) +
d*log(d/N)'.
If that doesn't make sense, it helps to note that 'log(a/N) = log(a) -
log(N)', so the log(N)'s cancel out.)

This means that we can use a single score table for all block sizes!  It
only needs to be big enough for the largest block size, then it produces
optimal scores for all blocks up to that size.
It does mean that partial blocks with uniform bytes won't have a score of
0, but this property of entropy is not actually important when scoring the
blocks.
(Overall, this is a significant relief, because my attempts at multiple
score tables resulted in a lot of tricky decisions about how best to keep
the code DRY and minimise complexity, and I wasn't really happy with my
solution.)


I am planning to organise my submission into two patches.  I intend for one
to focus on problems/inefficiencies in block comparisons/score tables.  The
other will be focused on motion estimation, i.e. the range of MV's that are
compared, and to a very small extent, the order they're compared in.
This avoids the overhead of patching each individual issue separately, but
it will provide a logical split between the two main areas of change.

Kind regards,

Matthew
Tomas Härdin Jan. 31, 2019, 3 p.m. UTC | #9
tor 2019-01-31 klockan 13:31 +0000 skrev Matthew Fearnley:
> > On Sun, 20 Jan 2019 at 15:16, Tomas Härdin <tjoppen@acc.umu.se> wrote:
> 
> > > Hi.  Just to say, I tried setting up additional score_tabs for the
> > > bottom/right partial blocks.  Unfortunately, after implementing it and
> > > ironing out the bugs, the new score tables actually caused a slight
> > > increase in file size!
> > > After a little informal investigation with a couple of videos, my
> > 
> > findings
> > > were that increasing the divisor - '(i / (double)(ZMBV_BLOCK *
> > 
> > ZMBV_BLOCK *
> > > some_factor))' - would generally lead to slight compression improvements.
> > > Given the score table is still "valid" for smaller blocks, and given the
> > > extra complexity of adding the score tables, plus the fact that it may
> > > generally hurt the compression, I'm inclined to leave it with the one
> > 
> > score
> > > table.  But there may be potential to improve the current compression
> > > method in future, by somehow tuning the divisor for better results
> > > generally.
> > 
> > Huh, that's strange. Sounds like something that warrants a comment in
> > the code. I also see we have an answer do Carl's question: you're still
> > experimenting with this :) I think we can hold off on pushing anything
> > until you feel happy with the result
> > 
> 
> Hi.
> Sorry, I had missed Carl's question.  Regrettably my work on this has been
> slow since my initial patch submissions, but I think I'm close to
> submitting some final changes.  Thanks for your patience so far.
> 
> I have recently made two helpful realisations about the above score table
> problem:
> 
> 1. The entropy calculation in block_cmp() omits the score of histogram[0]
> from the final sum.
> It's tempting to do this to bias the scores in favour of 0-bytes, but in
> reality, blocks with a majority of 0 (or any other byte) will already be
> naturally favoured by the entropy score anyway, and this bias will fail to
> penalise blocks with an "average" (i.e. high entropy) number of 0's in them.
> The exact implications are difficult to ponder, but in practical terms this
> error does tend to produce worse results in the video compression.  Not
> massively so, but it's still noticeable.

Did you try combining the statistics of the current MV with the
statistics of the previous block, to bias the choice in favor of
similar bytes? Could work like a cheap order-1 entropy model.

> 2. As long as the blocks being compared are the same size as each other,
> the entropy-based score comparisons (without the above bias) are unaffected
> by the base of the log, or the divisor used.
> (Mathematically, if 'a+b=c+d', then if 'a*log(a) + b*log(b) < c*log(c) +
> d*log(d)' then it is also true that 'a*log(a/N) + b*log(b/N) < c*log(c/N) +
> d*log(d/N)'.
> If that doesn't make sense, it helps to note that 'log(a/N) = log(a) -
> log(N)', so the log(N)'s cancel out.)

Convenient observation :)

> I am planning to organise my submission into two patches.  I intend for one
> to focus on problems/inefficiencies in block comparisons/score tables.  The
> other will be focused on motion estimation, i.e. the range of MV's that are
> compared, and to a very small extent, the order they're compared in.
> This avoids the overhead of patching each individual issue separately, but
> it will provide a logical split between the two main areas of change.

Sounds good!

/Tomas
matthew.w.fearnley@gmail.com Feb. 10, 2019, 10:18 p.m. UTC | #10
On Thu, 31 Jan 2019 at 15:00, Tomas Härdin <tjoppen@acc.umu.se> wrote:

>
> > 1. The entropy calculation in block_cmp() omits the score of histogram[0]
> > from the final sum.
> > It's tempting to do this to bias the scores in favour of 0-bytes, but in
> > reality, blocks with a majority of 0 (or any other byte) will already be
> > naturally favoured by the entropy score anyway, and this bias will fail
> to
> > penalise blocks with an "average" (i.e. high entropy) number of 0's in
> them.
> > The exact implications are difficult to ponder, but in practical terms
> this
> > error does tend to produce worse results in the video compression.  Not
> > massively so, but it's still noticeable.
>
> Did you try combining the statistics of the current MV with the
> statistics of the previous block, to bias the choice in favor of
> similar bytes? Could work like a cheap order-1 entropy model.
>

I've had a go at this, but sadly, it seemed to adversely affect
compression, producing larger files.
Maybe it can be improved with some tweaking, or maybe there's a bug
somewhere.
But it feels to me like this approach on its own isn't enough to improve
scoring.  Maybe as part of something larger though..
Anyway, you can see the results of my efforts here:
https://github.com/countingpine/FFmpeg/commit/prev_histogram.

- Matthew
Tomas Härdin Feb. 11, 2019, 8:11 a.m. UTC | #11
sön 2019-02-10 klockan 22:18 +0000 skrev Matthew Fearnley:
> > On Thu, 31 Jan 2019 at 15:00, Tomas Härdin <tjoppen@acc.umu.se> wrote:
> 
> > 
> > > 1. The entropy calculation in block_cmp() omits the score of histogram[0]
> > > from the final sum.
> > > It's tempting to do this to bias the scores in favour of 0-bytes, but in
> > > reality, blocks with a majority of 0 (or any other byte) will already be
> > > naturally favoured by the entropy score anyway, and this bias will fail
> > 
> > to
> > > penalise blocks with an "average" (i.e. high entropy) number of 0's in
> > 
> > them.
> > > The exact implications are difficult to ponder, but in practical terms
> > 
> > this
> > > error does tend to produce worse results in the video compression.  Not
> > > massively so, but it's still noticeable.
> > 
> > Did you try combining the statistics of the current MV with the
> > statistics of the previous block, to bias the choice in favor of
> > similar bytes? Could work like a cheap order-1 entropy model.
> > 
> 
> I've had a go at this, but sadly, it seemed to adversely affect
> compression, producing larger files.
> Maybe it can be improved with some tweaking, or maybe there's a bug
> somewhere.
> But it feels to me like this approach on its own isn't enough to improve
> scoring.  Maybe as part of something larger though..
> Anyway, you can see the results of my efforts here:
> https://github.com/countingpine/FFmpeg/commit/prev_histogram.

Darn. Interesting result at least

/Tomas
diff mbox

Patch

diff --git a/libavcodec/zmbvenc.c b/libavcodec/zmbvenc.c
index 4d9147657d..2f041dae32 100644
--- a/libavcodec/zmbvenc.c
+++ b/libavcodec/zmbvenc.c
@@ -71,6 +71,7 @@  static inline int block_cmp(ZmbvEncContext *c, uint8_t *src, int stride,
     int i, j;
     uint8_t histogram[256] = {0};
 
+    /* build frequency histogram of byte values for src[] ^ src2[] */
     *xored = 0;
     for(j = 0; j < bh; j++){
         for(i = 0; i < bw; i++){
@@ -82,6 +83,10 @@  static inline int block_cmp(ZmbvEncContext *c, uint8_t *src, int stride,
         src2 += stride2;
     }
 
+    /* early out if src and src2 are equal */
+    if (!*xored) return 0;
+
+    /* sum the entropy of the non-zero values */
     for(i = 1; i < 256; i++)
         sum += c->score_tab[histogram[i]];