[FFmpeg-devel,V2] lavfi/vf_nlmeans: Improve the performance for nlmeans

Submitted by Jun Zhao on Feb. 1, 2019, 2:45 a.m.

Details

Message ID 1548989124-27074-1-git-send-email-mypopydev@gmail.com
State New
Headers show

Commit Message

Jun Zhao Feb. 1, 2019, 2:45 a.m.
Remove the pdiff_lut_scale in nlmeans and increase weight_lut table size
from 2^9 to 800000, this change will avoid using pdiff_lut_scale in
nlmeans_slice() for weight_lut table search, it's will improve the
performance about 12%. (in 1080P size picture case).

Use the profiling command like:

perf stat -a -d -r 5 ./ffmpeg -i input -an -vf nlmeans=s=30 -vframes 10 \
-f null /dev/null

without this change:
when s=1.0(default value) 63s
     s=30.0               72s

after this change:
     s=1.0(default value) 56s
     s=30.0               63s

Reviewed-by: Carl Eugen Hoyos <ceffmpeg@gmail.com>
Signed-off-by: Jun Zhao <mypopydev@gmail.com>
---
 libavfilter/vf_nlmeans.c |   12 ++++--------
 1 files changed, 4 insertions(+), 8 deletions(-)

Comments

Clément Bœsch Feb. 1, 2019, 8:29 a.m.
On Fri, Feb 01, 2019 at 10:45:24AM +0800, Jun Zhao wrote:
> Remove the pdiff_lut_scale in nlmeans and increase weight_lut table size
> from 2^9 to 800000, this change will avoid using pdiff_lut_scale in
> nlmeans_slice() for weight_lut table search, it's will improve the
> performance about 12%. (in 1080P size picture case).
> 
> Use the profiling command like:
> 
> perf stat -a -d -r 5 ./ffmpeg -i input -an -vf nlmeans=s=30 -vframes 10 \
> -f null /dev/null
> 
> without this change:
> when s=1.0(default value) 63s
>      s=30.0               72s
> 
> after this change:
>      s=1.0(default value) 56s
>      s=30.0               63s

Nice.

I assume this is tested on x86_64?

> 
> Reviewed-by: Carl Eugen Hoyos <ceffmpeg@gmail.com>
> Signed-off-by: Jun Zhao <mypopydev@gmail.com>
> ---
>  libavfilter/vf_nlmeans.c |   12 ++++--------
>  1 files changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/libavfilter/vf_nlmeans.c b/libavfilter/vf_nlmeans.c
> index 82e779c..72eb819 100644
> --- a/libavfilter/vf_nlmeans.c
> +++ b/libavfilter/vf_nlmeans.c
> @@ -43,8 +43,7 @@ struct weighted_avg {
>      float sum;
>  };
>  
> -#define WEIGHT_LUT_NBITS 9
> -#define WEIGHT_LUT_SIZE  (1<<WEIGHT_LUT_NBITS)
> +#define WEIGHT_LUT_SIZE  (800000) // need to >  300 * 300 * log(255)

So the LUT is now 3.2MB?

Why 300? 300*300*log(255) is closer to 500 000 than 800 000

[...]
mypopy@gmail.com Feb. 1, 2019, 8:57 a.m.
On Fri, Feb 1, 2019 at 4:29 PM Clément Bœsch <u@pkh.me> wrote:
>
> On Fri, Feb 01, 2019 at 10:45:24AM +0800, Jun Zhao wrote:
> > Remove the pdiff_lut_scale in nlmeans and increase weight_lut table size
> > from 2^9 to 800000, this change will avoid using pdiff_lut_scale in
> > nlmeans_slice() for weight_lut table search, it's will improve the
> > performance about 12%. (in 1080P size picture case).
> >
> > Use the profiling command like:
> >
> > perf stat -a -d -r 5 ./ffmpeg -i input -an -vf nlmeans=s=30 -vframes 10 \
> > -f null /dev/null
> >
> > without this change:
> > when s=1.0(default value) 63s
> >      s=30.0               72s
> >
> > after this change:
> >      s=1.0(default value) 56s
> >      s=30.0               63s
>
> Nice.
>
> I assume this is tested on x86_64?
Yes
>
> >
> > Reviewed-by: Carl Eugen Hoyos <ceffmpeg@gmail.com>
> > Signed-off-by: Jun Zhao <mypopydev@gmail.com>
> > ---
> >  libavfilter/vf_nlmeans.c |   12 ++++--------
> >  1 files changed, 4 insertions(+), 8 deletions(-)
> >
> > diff --git a/libavfilter/vf_nlmeans.c b/libavfilter/vf_nlmeans.c
> > index 82e779c..72eb819 100644
> > --- a/libavfilter/vf_nlmeans.c
> > +++ b/libavfilter/vf_nlmeans.c
> > @@ -43,8 +43,7 @@ struct weighted_avg {
> >      float sum;
> >  };
> >
> > -#define WEIGHT_LUT_NBITS 9
> > -#define WEIGHT_LUT_SIZE  (1<<WEIGHT_LUT_NBITS)
> > +#define WEIGHT_LUT_SIZE  (800000) // need to >  300 * 300 * log(255)
>
> So the LUT is now 3.2MB?
>
> Why 300? 300*300*log(255) is closer to 500 000 than 800 000
I just seleted a value > 300*300*log(255) (500 000 more precise) for
this case at liberty in fact , the other option is use a dynamic
allocation memory for weight_lut table size base on the
max_meaningful_diff :), but maybe seems pretty obvious, I think 3M is
not a big burden for nlmeans
Clément Bœsch Feb. 1, 2019, 9:11 a.m.
On Fri, Feb 01, 2019 at 04:57:37PM +0800, mypopy@gmail.com wrote:
[...]
> > > -#define WEIGHT_LUT_NBITS 9
> > > -#define WEIGHT_LUT_SIZE  (1<<WEIGHT_LUT_NBITS)
> > > +#define WEIGHT_LUT_SIZE  (800000) // need to >  300 * 300 * log(255)
> >
> > So the LUT is now 3.2MB?
> >
> > Why 300? 300*300*log(255) is closer to 500 000 than 800 000
> I just seleted a value > 300*300*log(255) (500 000 more precise) for
> this case at liberty in fact , the other option is use a dynamic
> allocation memory for weight_lut table size base on the
> max_meaningful_diff :), but maybe seems pretty obvious, I think 3M is
> not a big burden for nlmeans

It's probably fine yes, I'm just confused at the comment: why does it
*needs* to be > 300 * 300 * log(255)?
mypopy@gmail.com Feb. 1, 2019, 9:19 a.m.
On Fri, Feb 1, 2019 at 5:11 PM Clément Bœsch <u@pkh.me> wrote:
>
> On Fri, Feb 01, 2019 at 04:57:37PM +0800, mypopy@gmail.com wrote:
> [...]
> > > > -#define WEIGHT_LUT_NBITS 9
> > > > -#define WEIGHT_LUT_SIZE  (1<<WEIGHT_LUT_NBITS)
> > > > +#define WEIGHT_LUT_SIZE  (800000) // need to >  300 * 300 *
log(255)
> > >
> > > So the LUT is now 3.2MB?
> > >
> > > Why 300? 300*300*log(255) is closer to 500 000 than 800 000
> > I just seleted a value > 300*300*log(255) (500 000 more precise) for
> > this case at liberty in fact , the other option is use a dynamic
> > allocation memory for weight_lut table size base on the
> > max_meaningful_diff :), but maybe seems pretty obvious, I think 3M is
> > not a big burden for nlmeans
>
> It's probably fine yes, I'm just confused at the comment: why does it
> *needs* to be > 300 * 300 * log(255)?
>
> --
ohhh, 300 = max(s) * 10 :), max(s) = 30, this is the reason.

In fact, max size of WEIGHT_LUT_SIZE ==  max (max_meaningful_diff), then we
can avoid use pdiff_lut_scale in nlmeans, becasue now pdiff_lut_scale == 1.
:)

and max( max_meaningful_diff ) = -log(-1/255.0) * h * h = log(255) * max
(h) * max(h) = log(255) * max (10*s) * max(10*s)
Clément Bœsch Feb. 1, 2019, 9:43 a.m.
On Fri, Feb 01, 2019 at 05:19:53PM +0800, mypopy@gmail.com wrote:
> On Fri, Feb 1, 2019 at 5:11 PM Clément Bœsch <u@pkh.me> wrote:
> >
> > On Fri, Feb 01, 2019 at 04:57:37PM +0800, mypopy@gmail.com wrote:
> > [...]
> > > > > -#define WEIGHT_LUT_NBITS 9
> > > > > -#define WEIGHT_LUT_SIZE  (1<<WEIGHT_LUT_NBITS)
> > > > > +#define WEIGHT_LUT_SIZE  (800000) // need to >  300 * 300 *
> log(255)
> > > >
> > > > So the LUT is now 3.2MB?
> > > >
> > > > Why 300? 300*300*log(255) is closer to 500 000 than 800 000
> > > I just seleted a value > 300*300*log(255) (500 000 more precise) for
> > > this case at liberty in fact , the other option is use a dynamic
> > > allocation memory for weight_lut table size base on the
> > > max_meaningful_diff :), but maybe seems pretty obvious, I think 3M is
> > > not a big burden for nlmeans
> >
> > It's probably fine yes, I'm just confused at the comment: why does it
> > *needs* to be > 300 * 300 * log(255)?
> >
> > --
> ohhh, 300 = max(s) * 10 :), max(s) = 30, this is the reason.
> 
> In fact, max size of WEIGHT_LUT_SIZE ==  max (max_meaningful_diff), then we
> can avoid use pdiff_lut_scale in nlmeans, becasue now pdiff_lut_scale == 1.
> :)
> 
> and max( max_meaningful_diff ) = -log(-1/255.0) * h * h = log(255) * max
> (h) * max(h) = log(255) * max (10*s) * max(10*s)

Ok, makes sense. Would you mind updating the comment to something like:

    /* Note: WEIGHT_LUT_SIZE must be larger than max_meaningful_diff
     * (log(255)*max(h)^2, which is approximately 500000 with the current
     * maximum sigma of 30). The current value is arbitrary and could be
     * tweaked or defined dynamically. */
    #define WEIGHT_LUT_SIZE 800000

I will test your patch tonight (let's say in about 10 hours given my
current timezone) and apply if everything works fine.

Thanks
mypopy@gmail.com Feb. 1, 2019, 9:50 a.m.
On Fri, Feb 1, 2019 at 5:43 PM Clément Bœsch <u@pkh.me> wrote:

> On Fri, Feb 01, 2019 at 05:19:53PM +0800, mypopy@gmail.com wrote:
> > On Fri, Feb 1, 2019 at 5:11 PM Clément Bœsch <u@pkh.me> wrote:
> > >
> > > On Fri, Feb 01, 2019 at 04:57:37PM +0800, mypopy@gmail.com wrote:
> > > [...]
> > > > > > -#define WEIGHT_LUT_NBITS 9
> > > > > > -#define WEIGHT_LUT_SIZE  (1<<WEIGHT_LUT_NBITS)
> > > > > > +#define WEIGHT_LUT_SIZE  (800000) // need to >  300 * 300 *
> > log(255)
> > > > >
> > > > > So the LUT is now 3.2MB?
> > > > >
> > > > > Why 300? 300*300*log(255) is closer to 500 000 than 800 000
> > > > I just seleted a value > 300*300*log(255) (500 000 more precise) for
> > > > this case at liberty in fact , the other option is use a dynamic
> > > > allocation memory for weight_lut table size base on the
> > > > max_meaningful_diff :), but maybe seems pretty obvious, I think 3M is
> > > > not a big burden for nlmeans
> > >
> > > It's probably fine yes, I'm just confused at the comment: why does it
> > > *needs* to be > 300 * 300 * log(255)?
> > >
> > > --
> > ohhh, 300 = max(s) * 10 :), max(s) = 30, this is the reason.
> >
> > In fact, max size of WEIGHT_LUT_SIZE ==  max (max_meaningful_diff), then
> we
> > can avoid use pdiff_lut_scale in nlmeans, becasue now pdiff_lut_scale ==
> 1.
> > :)
> >
> > and max( max_meaningful_diff ) = -log(-1/255.0) * h * h = log(255) * max
> > (h) * max(h) = log(255) * max (10*s) * max(10*s)
>
> Ok, makes sense. Would you mind updating the comment to something like:
>
>     /* Note: WEIGHT_LUT_SIZE must be larger than max_meaningful_diff
>      * (log(255)*max(h)^2, which is approximately 500000 with the current
>      * maximum sigma of 30). The current value is arbitrary and could be
>      * tweaked or defined dynamically. */
>     #define WEIGHT_LUT_SIZE 800000
>
> I will test your patch tonight (let's say in about 10 hours given my
> current timezone) and apply if everything works fine.
>
> it's OK, and I think you can change   800000  to 500000 at the same time
if the patch is Ok. Tks
Clément Bœsch Feb. 1, 2019, 8 p.m.
On Fri, Feb 01, 2019 at 05:50:37PM +0800, mypopy@gmail.com wrote:
[...]
> > Ok, makes sense. Would you mind updating the comment to something like:
> >
> >     /* Note: WEIGHT_LUT_SIZE must be larger than max_meaningful_diff
> >      * (log(255)*max(h)^2, which is approximately 500000 with the current
> >      * maximum sigma of 30). The current value is arbitrary and could be
> >      * tweaked or defined dynamically. */
> >     #define WEIGHT_LUT_SIZE 800000
> >
> > I will test your patch tonight (let's say in about 10 hours given my
> > current timezone) and apply if everything works fine.
> >
> it's OK, and I think you can change   800000  to 500000 at the same time
> if the patch is Ok. Tks

I did change it to 500000, added a comment, updated the description, and
pushed. I also made the dynamic change you suggested earlier on the weight
LUT.

Thanks for the patch,

Patch hide | download patch | download mbox

diff --git a/libavfilter/vf_nlmeans.c b/libavfilter/vf_nlmeans.c
index 82e779c..72eb819 100644
--- a/libavfilter/vf_nlmeans.c
+++ b/libavfilter/vf_nlmeans.c
@@ -43,8 +43,7 @@  struct weighted_avg {
     float sum;
 };
 
-#define WEIGHT_LUT_NBITS 9
-#define WEIGHT_LUT_SIZE  (1<<WEIGHT_LUT_NBITS)
+#define WEIGHT_LUT_SIZE  (800000) // need to >  300 * 300 * log(255)
 
 typedef struct NLMeansContext {
     const AVClass *class;
@@ -63,7 +62,6 @@  typedef struct NLMeansContext {
     struct weighted_avg *wa;                    // weighted average of every pixel
     ptrdiff_t wa_linesize;                      // linesize for wa in struct size unit
     float weight_lut[WEIGHT_LUT_SIZE];          // lookup table mapping (scaled) patch differences to their associated weights
-    float pdiff_lut_scale;                      // scale factor for patch differences before looking into the LUT
     uint32_t max_meaningful_diff;               // maximum difference considered (if the patch difference is too high we ignore the pixel)
     NLMeansDSPContext dsp;
 } NLMeansContext;
@@ -401,8 +399,7 @@  static int nlmeans_slice(AVFilterContext *ctx, void *arg, int jobnr, int nb_jobs
             const uint32_t patch_diff_sq = e - d - b + a;
 
             if (patch_diff_sq < s->max_meaningful_diff) {
-                const unsigned weight_lut_idx = patch_diff_sq * s->pdiff_lut_scale;
-                const float weight = s->weight_lut[weight_lut_idx]; // exp(-patch_diff_sq * s->pdiff_scale)
+                const float weight = s->weight_lut[patch_diff_sq]; // exp(-patch_diff_sq * s->pdiff_scale)
                 wa[x].total_weight += weight;
                 wa[x].sum += weight * src[x];
             }
@@ -527,10 +524,9 @@  static av_cold int init(AVFilterContext *ctx)
 
     s->pdiff_scale = 1. / (h * h);
     s->max_meaningful_diff = -log(1/255.) / s->pdiff_scale;
-    s->pdiff_lut_scale = 1./s->max_meaningful_diff * WEIGHT_LUT_SIZE;
-    av_assert0((s->max_meaningful_diff - 1) * s->pdiff_lut_scale < FF_ARRAY_ELEMS(s->weight_lut));
+    av_assert0((s->max_meaningful_diff - 1) < FF_ARRAY_ELEMS(s->weight_lut));
     for (i = 0; i < WEIGHT_LUT_SIZE; i++)
-        s->weight_lut[i] = exp(-i / s->pdiff_lut_scale * s->pdiff_scale);
+        s->weight_lut[i] = exp(-i * s->pdiff_scale);
 
     CHECK_ODD_FIELD(research_size,   "Luma research window");
     CHECK_ODD_FIELD(patch_size,      "Luma patch");