diff mbox

[FFmpeg-devel] avcodec/proresenc_anatoliy: change quantization scaling to floating point to utilize vectorization

Message ID 6a39cc61-df29-b742-bc53-7629cb8af705@btf.de
State New
Headers show

Commit Message

David Murmann Feb. 27, 2018, 8:35 p.m. UTC
Quantization scaling seems to be a slight bottleneck,
this change allows the compiler to more easily vectorize
the loop. This improves total encoding performance in my
tests by about 10-20%.

Signed-off-by: David Murmann <david@btf.de>
---
  libavcodec/proresenc_anatoliy.c | 12 ++++++++----
  1 file changed, 8 insertions(+), 4 deletions(-)

Comments

Hendrik Leppkes Feb. 27, 2018, 8:58 p.m. UTC | #1
On Tue, Feb 27, 2018 at 9:35 PM, David Murmann <david.murmann@btf.de> wrote:
> Quantization scaling seems to be a slight bottleneck,
> this change allows the compiler to more easily vectorize
> the loop. This improves total encoding performance in my
> tests by about 10-20%.
>
> Signed-off-by: David Murmann <david@btf.de>
> ---
>  libavcodec/proresenc_anatoliy.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/libavcodec/proresenc_anatoliy.c
> b/libavcodec/proresenc_anatoliy.c
> index 0516066163..8b296f6f1b 100644
> --- a/libavcodec/proresenc_anatoliy.c
> +++ b/libavcodec/proresenc_anatoliy.c
> @@ -232,14 +232,18 @@ static const uint8_t lev_to_cb[10] = { 0x04, 0x0A,
> 0x05, 0x06, 0x04, 0x28,
>  static void encode_ac_coeffs(AVCodecContext *avctx, PutBitContext *pb,
>          int16_t *in, int blocks_per_slice, int *qmat)
>  {
> +    int16_t block[64];
>      int prev_run = 4;
>      int prev_level = 2;
>       int run = 0, level, code, i, j;
> -    for (i = 1; i < 64; i++) {
> -        int indp = progressive_scan[i];
> -        for (j = 0; j < blocks_per_slice; j++) {
> -            int val = QSCALE(qmat, indp, in[(j << 6) + indp]);
> +    for (j = 0; j < blocks_per_slice; j++) {
> +        for (i = 0; i < 64; i++) {
> +            block[i] = (float)in[(j << 6) + i] / (float)qmat[i];
> +        }
> +
> +        for (i = 1; i < 64; i++) {
> +            int val = block[progressive_scan[i]];
>              if (val) {
>                  encode_codeword(pb, run, run_to_cb[FFMIN(prev_run, 15)]);

Usually, using float is best avoided. Did you test re-factoring the
loop structure without changing it to float?

- Hendrik
David Murmann Feb. 27, 2018, 9:22 p.m. UTC | #2
On 2/27/2018 9:58 PM, Hendrik Leppkes wrote:
 > On Tue, Feb 27, 2018 at 9:35 PM, David Murmann <david.murmann@btf.de> 
wrote:
 >> Quantization scaling seems to be a slight bottleneck,
 >> this change allows the compiler to more easily vectorize
 >> the loop. This improves total encoding performance in my
 >> tests by about 10-20%.
 >>
 >> Signed-off-by: David Murmann <david@btf.de>
 >> ---
 >>   libavcodec/proresenc_anatoliy.c | 12 ++++++++----
 >>   1 file changed, 8 insertions(+), 4 deletions(-)
 >>
[...]
 >> +    for (j = 0; j < blocks_per_slice; j++) {
 >> +        for (i = 0; i < 64; i++) {
 >> +            block[i] = (float)in[(j << 6) + i] / (float)qmat[i];
 >> +        }
 >> +
 >> +        for (i = 1; i < 64; i++) {
 >> +            int val = block[progressive_scan[i]];
 >>               if (val) {
 >>                   encode_codeword(pb, run, run_to_cb[FFMIN(prev_run, 
15)]);
 >
 > Usually, using float is best avoided. Did you test re-factoring the
 > loop structure without changing it to float?

Yes, the vector instructions don't have integer division, AFAIK, and the
compiler just generates a loop with idivs. This is quite a bit slower
than converting to float, dividing and converting back, if the compiler
uses vector instructions. In the general case this wouldn't be exact,
but since the input values are int16 they should losslessly fit into
float32. On platforms where this auto-vectorization fails this might
actually be quite a bit slower, but I have not seen that in my tests
(though I have only tested on x86_64).
Rostislav Pehlivanov Feb. 27, 2018, 11:19 p.m. UTC | #3
On 27 February 2018 at 21:22, David Murmann <david.murmann@btf.de> wrote:

>
> On 2/27/2018 9:58 PM, Hendrik Leppkes wrote:
> > On Tue, Feb 27, 2018 at 9:35 PM, David Murmann <david.murmann@btf.de>
> wrote:
> >> Quantization scaling seems to be a slight bottleneck,
> >> this change allows the compiler to more easily vectorize
> >> the loop. This improves total encoding performance in my
> >> tests by about 10-20%.
> >>
> >> Signed-off-by: David Murmann <david@btf.de>
> >> ---
> >>   libavcodec/proresenc_anatoliy.c | 12 ++++++++----
> >>   1 file changed, 8 insertions(+), 4 deletions(-)
> >>
> [...]
> >> +    for (j = 0; j < blocks_per_slice; j++) {
> >> +        for (i = 0; i < 64; i++) {
> >> +            block[i] = (float)in[(j << 6) + i] / (float)qmat[i];
> >> +        }
> >> +
> >> +        for (i = 1; i < 64; i++) {
> >> +            int val = block[progressive_scan[i]];
> >>               if (val) {
> >>                   encode_codeword(pb, run, run_to_cb[FFMIN(prev_run,
> 15)]);
> >
> > Usually, using float is best avoided. Did you test re-factoring the
> > loop structure without changing it to float?
>
> Yes, the vector instructions don't have integer division, AFAIK, and the
> compiler just generates a loop with idivs. This is quite a bit slower
> than converting to float, dividing and converting back, if the compiler
> uses vector instructions. In the general case this wouldn't be exact,
> but since the input values are int16 they should losslessly fit into
> float32. On platforms where this auto-vectorization fails this might
> actually be quite a bit slower, but I have not seen that in my tests
> (though I have only tested on x86_64).
>
> --
> David Murmann
>
> david@btf.de
> Telefon +49 (0) 221 82008710
> Fax +49 (0) 221 82008799
>
> http://btf.de/
>
> --
> btf GmbH | Leyendeckerstr. 27, 50825 Köln | +49 (0) 221 82 00 87 10
> Geschäftsführer: Philipp Käßbohrer & Matthias Murmann | HR Köln | HRB 74707
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>

No, you're going about it the wrong way. Floats should most definitely be
avoided in encoders/decoders. Non-deterministic output on platforms is a
smaller issue to how they can obliterate performance if compilers emit an
actual div instruction.

Instead, here's what you can do to make it even faster: replace the
division with a multiply + a shift. Keeps the output identical too. I've
just sent an old patch of mine (for a different but similar codec) you can
work off of - just take the last bit of code there, run it at init to
generate the LUTs for all quantizers and then just multiply and shift by
looking into the tables you generate. Here's the link:
http://ffmpeg.org/pipermail/ffmpeg-devel/2018-February/225867.html
diff mbox

Patch

diff --git a/libavcodec/proresenc_anatoliy.c 
b/libavcodec/proresenc_anatoliy.c
index 0516066163..8b296f6f1b 100644
--- a/libavcodec/proresenc_anatoliy.c
+++ b/libavcodec/proresenc_anatoliy.c
@@ -232,14 +232,18 @@  static const uint8_t lev_to_cb[10] = { 0x04, 0x0A, 
0x05, 0x06, 0x04, 0x28,
  static void encode_ac_coeffs(AVCodecContext *avctx, PutBitContext *pb,
          int16_t *in, int blocks_per_slice, int *qmat)
  {
+    int16_t block[64];
      int prev_run = 4;
      int prev_level = 2;
       int run = 0, level, code, i, j;
-    for (i = 1; i < 64; i++) {
-        int indp = progressive_scan[i];
-        for (j = 0; j < blocks_per_slice; j++) {
-            int val = QSCALE(qmat, indp, in[(j << 6) + indp]);
+    for (j = 0; j < blocks_per_slice; j++) {
+        for (i = 0; i < 64; i++) {
+            block[i] = (float)in[(j << 6) + i] / (float)qmat[i];
+        }
+
+        for (i = 1; i < 64; i++) {
+            int val = block[progressive_scan[i]];
              if (val) {
                  encode_codeword(pb, run, run_to_cb[FFMIN(prev_run, 15)]);
  -- 2.16.2