diff mbox

[FFmpeg-devel] Efficiently support several output pixel formats in Cinepak decoder

Message ID 20170201193702.GX1516@example.net
State New
Headers show

Commit Message

u-9iep@aetey.se Feb. 1, 2017, 7:37 p.m. UTC
Hello,

On Sat, Jan 28, 2017 at 07:53:06PM +0100, Michael Niedermayer wrote:
> please provide a git compatible patch
> git format-patch / send-email

The corresponding patches (concerning comments in cinepak-related files)
have been resent in a git-compatible form 2017-01-29.
This patch applies after them.

It represents a large change in the Cinepak decoder for speed/efficiency.
Cinepak was always very good at speed, now it has become even (much) better.

Bought by the code size growth is a reduction of in-loop checking and
a capability to produce formats directly suitable for output devices,
without applying pixel format conversion afterwards to the whole frame.

Decoding to certain formats is also remarkably faster than to the default
rgb24, on i686
 to rgb32 seems to be about 15% faster than to rgb24
 to rgb565 seems to be over 20% faster than to rgb24

Besides this, on a slow device using a 16-bit depth provides a remarkable
speedup per se, compared to larger depths.

Avoiding frame pixel format conversion by generating rgb565 in the decoder
for a corresponsing video buffer yields in our tests (on MMX-capable
i*86) more than twice [sic] the playback speed compared to decoding to rgb24.

The default output format remains rgb24, with the full quality offered
by the codec.

Splitting this into multiple patches does not look very useful,
the changes make sense together.

Regards,
Rune
From 300b8a4e712d1a404983b245aac501e09326ee72 Mon Sep 17 00:00:00 2001
From: Rl <addr-see-the-website@aetey.se>
Date: Wed, 1 Feb 2017 19:44:40 +0100
Subject: [PATCH] Efficiently support several output pixel formats in Cinepak
 decoder

Optimize decoding in general and largely improve speed,
among others by the ability to produce device-native pixel formats.

The output format can be chosen at runtime by an option.

The format can be also chosen by setting environment variable
CINEPAK_OUTPUT_FORMAT_OVERRIDE, if this is enabled at build time.
---
 libavcodec/cinepak.c | 957 +++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 845 insertions(+), 112 deletions(-)

Comments

wm4 Feb. 2, 2017, 3:25 p.m. UTC | #1
On Wed, 1 Feb 2017 20:37:02 +0100
u-9iep@aetey.se wrote:

> From 300b8a4e712d1a404983b245aac501e09326ee72 Mon Sep 17 00:00:00 2001
> From: Rl <addr-see-the-website@aetey.se>
> Date: Wed, 1 Feb 2017 19:44:40 +0100
> Subject: [PATCH] Efficiently support several output pixel formats in Cinepak
>  decoder
> 
> Optimize decoding in general and largely improve speed,
> among others by the ability to produce device-native pixel formats.
> 
> The output format can be chosen at runtime by an option.
> 

I can see how conversion in the decoder could speed it up somewhat
(especially if limited by memory bandwidth, I guess), but it's not
really how we do things in FFmpeg. Normally, conversion is done by
libswscale (or whatever the API user uses).

> The format can be also chosen by setting environment variable
> CINEPAK_OUTPUT_FORMAT_OVERRIDE, if this is enabled at build time.

It's not an environment variable, but a preprocessor variable. It's
also not defined by default, which effectively makes some of your
additions dead code. This is also not a thing we do in FFmpeg, and it
usually creates maintenance nightmares. (For you too. Who would care
about accidentally breaking this code when making changes to live parts
of the decoder?)

>  [...]
> -typedef struct CinepakContext {
> +typedef struct CinepakContext CinepakContext;
> +struct CinepakContext {
> +    const AVClass *class;
>  
>      AVCodecContext *avctx;
>      AVFrame *frame;
> @@ -71,24 +87,111 @@ typedef struct CinepakContext {
>      int sega_film_skip_bytes;
>  
>      uint32_t pal[256];
> -} CinepakContext;

Stray change?

>  
> -static void cinepak_decode_codebook (cvid_codebook *codebook,
> -                                     int chunk_id, int size, const uint8_t *data)
> +    void (*decode_codebook)(CinepakContext *s,
> +                            cvid_codebook *codebook, int chunk_id,
> +                            int size, const uint8_t *data);
> +    int  (*decode_vectors)(CinepakContext *s, cvid_strip *strip,
> +                           int chunk_id, int size, const uint8_t *data);
> +/* options */
> +    enum AVPixelFormat out_pixfmt;
> +};
> +
> +#define OFFSET(x) offsetof(CinepakContext, x)
> +#define VD AV_OPT_FLAG_VIDEO_PARAM | AV_OPT_FLAG_DECODING_PARAM
> +static const AVOption options[] = {
> +{"output_pixel_format", "set output pixel format: rgb24/rgb32/rgb565/yuv420p/pal8; yuv420p is approximate", OFFSET(out_pixfmt), AV_OPT_TYPE_PIXEL_FMT, {.i64=AV_PIX_FMT_RGB24}, -1, INT_MAX, VD },
> +    { NULL },
> +};

This is also not how we do things in FFmpeg. Usually get_format is used
when a decoder really can output multiple formats (which is very
rarely, usually for hardware decoding).


> +static void cinepak_decode_codebook_rgb32 (CinepakContext *s,
> +    cvid_codebook *codebook, int chunk_id, int size, const uint8_t *data)


> +static void cinepak_decode_codebook_rgb24 (CinepakContext *s,
> +    cvid_codebook *codebook, int chunk_id, int size, const uint8_t *data)

> +static void cinepak_decode_codebook_rgb565 (CinepakContext *s,
> +    cvid_codebook *codebook, int chunk_id, int size, const uint8_t *data)


So a part of the decoder is essentially duplicated for each format?

> +/* a simplistic version to begin with, it is also fast -- rl */
> +static void cinepak_decode_codebook_yuv420 (CinepakContext *s,
> +    cvid_codebook *codebook, int chunk_id, int size, const uint8_t *data)
> +{
> +    const uint8_t *eod = (data + size);
> +    uint32_t flag, mask;
> +    int      i, n;
> +    uint8_t *p;
> +    int palette_video = s->palette_video;
> +    int selective_update = (chunk_id & 0x01);
> +
> +    /* check if this chunk contains 4- or 6-element vectors */
> +    n    = (chunk_id & 0x04) ? 4 : 6;
> +    flag = 0;
> +    mask = 0;
> +
> +    p = codebook->yuv420[0];
> +    for (i=0; i < 256; i++) {
> +        if (selective_update && !(mask >>= 1)) {
> +            if ((data + 4) > eod)
> +                break;
> +
> +            flag  = AV_RB32 (data);
> +            data += 4;
> +            mask  = 0x80000000;
> +        }
> +
> +        if (!selective_update || (flag & mask)) {
> +            int k;
> +
> +            if ((data + n) > eod)
> +                break;
> +
> +            if (n == 4)
> +                if (palette_video) {
> +/* here we have kind of "more" data than the output format can express */
> +                    int r, g, b, u = 0, v = 0;
> +                    for (k = 0; k < 4; ++k) {
> +                        uint32_t rr = s->pal[*data++];
> +                        r = (rr>>16)&0xff;
> +                        g = (rr>>8) &0xff;
> +                        b =  rr     &0xff;
> +/* calculate the components (https://en.wikipedia.org/wiki/YUV) */
> +                        *p++ = ((r*66+g*129+b*25+128)>>8)+16;
> +                        u += (-r*38-g*74+b*112+128)>>8;
> +                        v += (r*112-g*94-b*18+128)>>8;
> +                    }
> +                    *p++ = (u+2)/4+128;
> +                    *p++ = (v+2)/4+128;
> +                } else { /* grayscale, easy */
> +                    for (k = 0; k < 4; ++k) {
> +                        *p++ = *data++;
> +                    }
> +                    *p++ = 128;
> +                    *p++ = 128;
> +                }
> +            else { /* n == 6 */
> +/* here we'd have to handle double format conversion
> + * Cinepak=>rgb24 and then rgb24=>yuv420p, which can not be shortcut;
> + * for the moment just copying as-is, for simplicity and speed,
> + * color will be slightly off but not much */
> +                *p++ = *data++;
> +                *p++ = *data++;
> +                *p++ = *data++;
> +                *p++ = *data++;
> +                *p++ = *data++ + 128;
> +                *p++ = *data++ + 128;
> +            }
> +        } else {
> +            p += 6;
> +        }
> +    }
> +}

What we _especially_ don't do in FFmpeg is hardcoding colorspace
conversion coefficients in decoders, and doing colorspace conversion in
decoders.

(OK, we do for subtitle formats - which is actually a bad thing, but
historical mistakes.)

> +#ifdef CINEPAK_OUTPUT_FORMAT_OVERRIDE
> +    char *out_fmt_override = getenv("CINEPAK_OUTPUT_FORMAT_OVERRIDE");
> +#endif

Absolutely not acceptable.

> +    switch (avctx->pix_fmt) {
> +    case AV_PIX_FMT_RGB32:
> +        av_log(avctx, AV_LOG_INFO, "Codec output pixel format is rgb32\n");
> +        s->decode_codebook = cinepak_decode_codebook_rgb32;
> +        s->decode_vectors  = cinepak_decode_vectors_rgb32;
> +        break;

AFAIK we don't usually do INFO messages in decoders or anywhere outside
of CLI tools. I'm probably wrong, but it should be avoided anyway.
wm4 Feb. 2, 2017, 3:44 p.m. UTC | #2
On Thu, 2 Feb 2017 16:25:05 +0100
wm4 <nfxjfg@googlemail.com> wrote:

> On Wed, 1 Feb 2017 20:37:02 +0100
> u-9iep@aetey.se wrote:

> > The format can be also chosen by setting environment variable
> > CINEPAK_OUTPUT_FORMAT_OVERRIDE, if this is enabled at build time.  
> 
> It's not an environment variable, but a preprocessor variable. It's

And to correct myself on that, it's both. I didn't see the getenv at
first and forgot to correct this.

The preprocessor variable is not defined by default, but I probably got
the magnitude of code disabled by it wrong. Either way, except for very
special things (like terminal settings or HTTP proxies), we do not
check environment variables.
u-9iep@aetey.se Feb. 2, 2017, 3:59 p.m. UTC | #3
On Thu, Feb 02, 2017 at 04:25:05PM +0100, wm4 wrote:
> I can see how conversion in the decoder could speed it up somewhat
> (especially if limited by memory bandwidth, I guess), but it's not
> really how we do things in FFmpeg. Normally, conversion is done by
> libswscale (or whatever the API user uses).

I would not call "twice" for "somewhat" :)

The point is to do something that libswscale hardly can help with.

> > The format can be also chosen by setting environment variable
> > CINEPAK_OUTPUT_FORMAT_OVERRIDE, if this is enabled at build time.
> 
> It's not an environment variable, but a preprocessor variable. It's

Both are (intentionally) spelled the same which misled you to judge so.

> also not defined by default, which effectively makes some of your
> additions dead code. This is also not a thing we do in FFmpeg, and it

I am always enabling this feature anyway.
Was unsure whether activate it by default would be acceptable.

> usually creates maintenance nightmares. (For you too. Who would care
> about accidentally breaking this code when making changes to live parts
> of the decoder?)

Shall I repost the patch with this part on by default?

> > -typedef struct CinepakContext {
> > +typedef struct CinepakContext CinepakContext;
> > +struct CinepakContext {
> > +    const AVClass *class;
> >  
> >      AVCodecContext *avctx;
> >      AVFrame *frame;
> > @@ -71,24 +87,111 @@ typedef struct CinepakContext {
> >      int sega_film_skip_bytes;
> >  
> >      uint32_t pal[256];
> > -} CinepakContext;
> 
> Stray change?

No.

The compiler complained about referencing the type
being typedef'ed when I introduced function pointers into this
structure, the functions take pointers to it as args:

> > +    void (*decode_codebook)(CinepakContext *s,
> > +                            cvid_codebook *codebook, int chunk_id,
> > +                            int size, const uint8_t *data);
> > +    int  (*decode_vectors)(CinepakContext *s, cvid_strip *strip,
> > +                           int chunk_id, int size, const uint8_t *data);

> > +static void cinepak_decode_codebook_rgb32 (CinepakContext *s,
> > +    cvid_codebook *codebook, int chunk_id, int size, const uint8_t *data)
> 
> > +static void cinepak_decode_codebook_rgb24 (CinepakContext *s,
> > +    cvid_codebook *codebook, int chunk_id, int size, const uint8_t *data)
> 
> > +static void cinepak_decode_codebook_rgb565 (CinepakContext *s,
> > +    cvid_codebook *codebook, int chunk_id, int size, const uint8_t *data)

> So a part of the decoder is essentially duplicated for each format?

It is the irregular differences between them which are the reason
for splitting. I would not call this "duplication". If you feel
it is straightforward and important to make this more compact,
with the same performance, just go ahead.

> What we _especially_ don't do in FFmpeg is hardcoding colorspace
> conversion coefficients in decoders, and doing colorspace conversion in
> decoders.

Have you got a suggestion how to do avoid this in this case,
without sacrificing the speed?

> > +    char *out_fmt_override = getenv("CINEPAK_OUTPUT_FORMAT_OVERRIDE");

> Absolutely not acceptable.

1. Why?

2. I am open to a better solution of how to choose a format at run time
when the application lacks the knowledge for choosing the most suitable
format or does not even try to.

Any suggestion which can replace this approach?

> AFAIK we don't usually do INFO messages in decoders or anywhere outside
> of CLI tools. I'm probably wrong, but it should be avoided anyway.

This is very useful when you wish to check that you got it right
for a particular visual buffer / device, given that an apllication can
try to make its own (and possibly bad) choices. Not critical, I admit.

Regards,
Rune
Ronald S. Bultje Feb. 2, 2017, 4:16 p.m. UTC | #4
Hi Rune,

On Thu, Feb 2, 2017 at 10:59 AM, <u-9iep@aetey.se> wrote:

> On Thu, Feb 02, 2017 at 04:25:05PM +0100, wm4 wrote:
> > > +static void cinepak_decode_codebook_rgb32 (CinepakContext *s,
> > > +    cvid_codebook *codebook, int chunk_id, int size, const uint8_t
> *data)
> >
> > > +static void cinepak_decode_codebook_rgb24 (CinepakContext *s,
> > > +    cvid_codebook *codebook, int chunk_id, int size, const uint8_t
> *data)
> >
> > > +static void cinepak_decode_codebook_rgb565 (CinepakContext *s,
> > > +    cvid_codebook *codebook, int chunk_id, int size, const uint8_t
> *data)
>
> > So a part of the decoder is essentially duplicated for each format?
>
> It is the irregular differences between them which are the reason
> for splitting. I would not call this "duplication". If you feel
> it is straightforward and important to make this more compact,
> with the same performance, just go ahead.
>

So, typically, we wouldn't duplicate the code, we'd template it. There's
some examples in h264 how to do it. You'd have a single (av_always_inline)
decode_codebook function, which takes "format" as an argument, and then
have three av_noinline callers to it (using fmt=rgb565, fmt=rgb24 or
fmt=rgb32).

That way performance works as you want it, without the source code
duplication.

> What we _especially_ don't do in FFmpeg is hardcoding colorspace
> > conversion coefficients in decoders, and doing colorspace conversion in
> > decoders.
>
> Have you got a suggestion how to do avoid this in this case,
> without sacrificing the speed?


Ah, yes, the question. So, the code change is quite big and it does various
things, and each of these might have a better alternative or be good as-is.
fundamentally, I don't really understand how _adding_ a colorspace
conversion does any good to speed. It fundamentally always makes things
slower. So can you explain why you need to _add_ a colorspace conversion?
Why not just always output the native format? (And then do conversion in
GPU if really needed, that is always near-free.)

> > +    char *out_fmt_override = getenv("CINEPAK_OUTPUT_FORMAT_OVERRIDE");
>
> > Absolutely not acceptable.
>
> 1. Why?
>

Libavcodec is a library. Being sensitive to environment in a library, or
worse yet, affecting the environment, is typically not what is expected.
There are almost always better ways to do the same thing.

For example, in this case:


> 2. I am open to a better solution of how to choose a format at run time
> when the application lacks the knowledge for choosing the most suitable
> format or does not even try to.
>

wm4 suggested earlier to implement a get_format() function. He meant this:

https://www.ffmpeg.org/doxygen/trunk/structAVCodecContext.html#ae85c5a0e81e9f97c063881148edc28b7

Ronald
wm4 Feb. 2, 2017, 4:52 p.m. UTC | #5
On Thu, 2 Feb 2017 16:59:51 +0100
u-9iep@aetey.se wrote:

> On Thu, Feb 02, 2017 at 04:25:05PM +0100, wm4 wrote:
> > I can see how conversion in the decoder could speed it up somewhat
> > (especially if limited by memory bandwidth, I guess), but it's not
> > really how we do things in FFmpeg. Normally, conversion is done by
> > libswscale (or whatever the API user uses).  
> 
> I would not call "twice" for "somewhat" :)

Well, your original mail mentioned only speedups up to 20%.

> The point is to do something that libswscale hardly can help with.

Why not?

> > > The format can be also chosen by setting environment variable
> > > CINEPAK_OUTPUT_FORMAT_OVERRIDE, if this is enabled at build time.  
> > 
> > It's not an environment variable, but a preprocessor variable. It's  
> 
> Both are (intentionally) spelled the same which misled you to judge so.
> 
> > also not defined by default, which effectively makes some of your
> > additions dead code. This is also not a thing we do in FFmpeg, and it  
> 
> I am always enabling this feature anyway.
> Was unsure whether activate it by default would be acceptable.

The heavy code duplication has other downsides. What if someone fixes
a bug, but only in the rgb32 version and ignores the rgb565 version?

> > usually creates maintenance nightmares. (For you too. Who would care
> > about accidentally breaking this code when making changes to live parts
> > of the decoder?)  
> 
> Shall I repost the patch with this part on by default?

Not for me.

> > What we _especially_ don't do in FFmpeg is hardcoding colorspace
> > conversion coefficients in decoders, and doing colorspace conversion in
> > decoders.  
> 
> Have you got a suggestion how to do avoid this in this case,
> without sacrificing the speed?

Is there any value in this at all? It's a very old codec, that was
apparently used by some video games. What modern application of it
would there possibly be? And that in addition would require special
optimizations done by no other codec?

> > > +    char *out_fmt_override = getenv("CINEPAK_OUTPUT_FORMAT_OVERRIDE");  
> 
> > Absolutely not acceptable.  
> 
> 1. Why?

See the reply by BBB.

> 2. I am open to a better solution of how to choose a format at run time
> when the application lacks the knowledge for choosing the most suitable
> format or does not even try to.
> 
> Any suggestion which can replace this approach?

get_format would be more appropriate.

> > AFAIK we don't usually do INFO messages in decoders or anywhere outside
> > of CLI tools. I'm probably wrong, but it should be avoided anyway.  
> 
> This is very useful when you wish to check that you got it right
> for a particular visual buffer / device, given that an apllication can
> try to make its own (and possibly bad) choices. Not critical, I admit.

Add that to your application code. Or alternatively, make ffmpeg.c
print the format (this would be useful for a few things).
u-9iep@aetey.se Feb. 3, 2017, 9:08 a.m. UTC | #6
Hello Ronald,

On Thu, Feb 02, 2017 at 11:16:35AM -0500, Ronald S. Bultje wrote:
> On Thu, Feb 2, 2017 at 10:59 AM, <u-9iep@aetey.se> wrote:
> > It is the irregular differences between them which are the reason
> > for splitting. I would not call this "duplication". If you feel
> > it is straightforward and important to make this more compact,
> > with the same performance, just go ahead.

> So, typically, we wouldn't duplicate the code, we'd template it. There's
> some examples in h264 how to do it. You'd have a single (av_always_inline)
> decode_codebook function, which takes "format" as an argument, and then
> have three av_noinline callers to it (using fmt=rgb565, fmt=rgb24 or
> fmt=rgb32).
> 
> That way performance works as you want it, without the source code
> duplication.

(Thanks for the pointer. I'll look at how it is done in h264, but: )

I thought about generating the bodies of the functions from something
like a template but it did not feel like this would make the code more
understandable aka maintainable. So I wonder if there is any point in
doing this, given the same binary result.

> > What we _especially_ don't do in FFmpeg is hardcoding colorspace
> > > conversion coefficients in decoders, and doing colorspace conversion in
> > > decoders.
> >
> > Have you got a suggestion how to do avoid this in this case,
> > without sacrificing the speed?

> Ah, yes, the question. So, the code change is quite big and it does various
> things, and each of these might have a better alternative or be good as-is.
> fundamentally, I don't really understand how _adding_ a colorspace
> conversion does any good to speed. It fundamentally always makes things
> slower. So can you explain why you need to _add_ a colorspace conversion?

It moves the conversion from after the decoder, where the data to convert
is all and whole frames, to the inside where the conversion applies
to the codebooks, by the codec design much less than the output.

> Why not just always output the native format? (And then do conversion in

The "native" Cinepak format is actually unknown to swscaler, and I
seriously doubt it would make sense to add it there, which would
just largely cut down the decoding efficiency.

> GPU if really needed, that is always near-free.)

You assume one have got a GPU.

The intention is to allow usable playback on as simple/slow devices
as possible.

> > > +    char *out_fmt_override = getenv("CINEPAK_OUTPUT_FORMAT_OVERRIDE");
> >
> > > Absolutely not acceptable.
> >
> > 1. Why?
> >
> 
> Libavcodec is a library. Being sensitive to environment in a library, or
> worse yet, affecting the environment, is typically not what is expected.
> There are almost always better ways to do the same thing.

I did my best to look for a better way but it does not seem to be existing.

> For example, in this case:

> 2. I am open to a better solution of how to choose a format at run time       
> when the application lacks the knowledge for choosing the most suitable       
> format or does not even try to.                                               

> wm4 suggested earlier to implement a get_format() function. He meant this:
> 
> https://www.ffmpeg.org/doxygen/trunk/structAVCodecContext.html#ae85c5a0e81e9f97c063881148edc28b7

Unfortunately this is _not_ a solution.

I was of course aware of get_format(). Its functionality seems to
be intended for use by the applications: "decoding: Set by user".
Correct me if I am wrong (even better correct the documentation which
apparently is confusing?).

In any case, the codec has no means to know which output format is
"best" in a particular case so all it can do is to list the formats it
offers. This is done and using get_format() should work as expected,
which unfortunately does not solve anything:

There is of course hardly any application which tries to tune the codecs
to the output. (Mplayer seems to try but this does not seem to help,
possibly because its codec handling is generalized beyond ffmpeg codec
interfaces?)
Codecs efficiently supporting multiple formats are definitely a tiny
minority, so no surprize applications are not prepared to deal with this.

Applications also do lack any reliable knowledge of which format from the
codec is or is not efficient. It is only the deployment context which
can tell what is best for the particular case, which is known neither
to the codec nor to a general-purpose application.

To resum, I do not expect that there is any better solution than using
an out-of-band channel like a dedicated environment variable,
but would be of course happy to see one.

Regards,
Rune
u-9iep@aetey.se Feb. 3, 2017, 9:46 a.m. UTC | #7
On Thu, Feb 02, 2017 at 05:52:29PM +0100, wm4 wrote:
> On Thu, 2 Feb 2017 16:59:51 +0100
> u-9iep@aetey.se wrote:
> > On Thu, Feb 02, 2017 at 04:25:05PM +0100, wm4 wrote:
> > > I can see how conversion in the decoder could speed it up somewhat

> > I would not call "twice" for "somewhat" :)
> 
> Well, your original mail mentioned only speedups up to 20%.

I have to recite from the original mail:
"
Avoiding frame pixel format conversion by generating rgb565 in the decoder
for a corresponsing video buffer yields in our tests (on MMX-capable
i*86) more than twice [sic] the playback speed compared to decoding to rgb24.
"

> The heavy code duplication has other downsides. What if someone fixes
> a bug, but only in the rgb32 version and ignores the rgb565 version?

There is no guarantee that this is the same bug or that the same fix
would apply, because the functions are subtly different.

> > Have you got a suggestion how to do avoid this in this case,
> > without sacrificing the speed?
> 
> Is there any value in this at all? It's a very old codec, that was
> apparently used by some video games. What modern application of it
> would there possibly be? And that in addition would require special
> optimizations done by no other codec?

Cinepak is not a "game codec" but a solid video codec and has been used
very widely, for a long time, for a large range of applications.

For some niche applications it still provides the best result, being
a single and far away leader in decoding speed.

On a 16-bit-per-pixel output with a CPU-based decoder you will
not find _any_ over 25% of Cinepak speed. Raw video can not compete
either when indata delivery bandwidth si limited.

It has also an unused improvement margin in the encoder, still keeping
legacy decoders compatibility. The current encoder is already performing
a _way_ better than the proprietary one, so why leave this nice tool
unused where it can help?

> > Any suggestion which can replace this approach?
> 
> get_format would be more appropriate.

get_format() does not belong to the decoder, nor is it up to the task,
which I explained in a recent message.

> > This is very useful when you wish to check that you got it right
> > for a particular visual buffer / device, given that an apllication can
> > try to make its own (and possibly bad) choices. Not critical, I admit.
> 
> Add that to your application code. Or alternatively, make ffmpeg.c
> print the format (this would be useful for a few things).

I would be fine with commenting out these info-messages.

Regards,
Rune
wm4 Feb. 3, 2017, 10:14 a.m. UTC | #8
On Fri, 3 Feb 2017 10:46:12 +0100
u-9iep@aetey.se wrote:

> On Thu, Feb 02, 2017 at 05:52:29PM +0100, wm4 wrote:
> > On Thu, 2 Feb 2017 16:59:51 +0100
> > u-9iep@aetey.se wrote:  

> > The heavy code duplication has other downsides. What if someone fixes
> > a bug, but only in the rgb32 version and ignores the rgb565 version?  
> 
> There is no guarantee that this is the same bug or that the same fix
> would apply, because the functions are subtly different.

Yes, duplicated code with subtle changes is harder to maintain.

> > > Have you got a suggestion how to do avoid this in this case,
> > > without sacrificing the speed?  
> > 
> > Is there any value in this at all? It's a very old codec, that was
> > apparently used by some video games. What modern application of it
> > would there possibly be? And that in addition would require special
> > optimizations done by no other codec?  
> 
> Cinepak is not a "game codec" but a solid video codec and has been used
> very widely, for a long time, for a large range of applications.
> 
> For some niche applications it still provides the best result, being
> a single and far away leader in decoding speed.
> 
> On a 16-bit-per-pixel output with a CPU-based decoder you will
> not find _any_ over 25% of Cinepak speed. Raw video can not compete
> either when indata delivery bandwidth si limited.
> 
> It has also an unused improvement margin in the encoder, still keeping
> legacy decoders compatibility. The current encoder is already performing
> a _way_ better than the proprietary one, so why leave this nice tool
> unused where it can help?

I can't take this seriously, but I won't go and try finding a better
codec just to make a point.

> > > Any suggestion which can replace this approach?  
> > 
> > get_format would be more appropriate.  
> 
> get_format() does not belong to the decoder, nor is it up to the task,
> which I explained in a recent message.

I don't know what you're thinking, but that's just wrong.
u-9iep@aetey.se Feb. 3, 2017, 10:42 a.m. UTC | #9
On Fri, Feb 03, 2017 at 11:14:28AM +0100, wm4 wrote:
> > On a 16-bit-per-pixel output with a CPU-based decoder you will
> > not find _any_ over 25% of Cinepak speed. Raw video can not compete
> > either when indata delivery bandwidth si limited.
> > 
> > It has also an unused improvement margin in the encoder, still keeping
> > legacy decoders compatibility. The current encoder is already performing
> > a _way_ better than the proprietary one, so why leave this nice tool
> > unused where it can help?
> 
> I can't take this seriously, but I won't go and try finding a better
> codec just to make a point.

I take this as a statement that you believe something without having
the actual information.

[concerning get_format()]

> I don't know what you're thinking, but that's just wrong.

Thanks for making your point quite clear.

Have a nice day,
Rune
Ronald S. Bultje Feb. 3, 2017, 1:52 p.m. UTC | #10
Hi Rune,

On Fri, Feb 3, 2017 at 4:08 AM, <u-9iep@aetey.se> wrote:

> On Thu, Feb 02, 2017 at 11:16:35AM -0500, Ronald S. Bultje wrote:
> > On Thu, Feb 2, 2017 at 10:59 AM, <u-9iep@aetey.se> wrote:
> > > It is the irregular differences between them which are the reason
> > > for splitting. I would not call this "duplication". If you feel
> > > it is straightforward and important to make this more compact,
> > > with the same performance, just go ahead.
>
> > So, typically, we wouldn't duplicate the code, we'd template it. There's
> > some examples in h264 how to do it. You'd have a single
> (av_always_inline)
> > decode_codebook function, which takes "format" as an argument, and then
> > have three av_noinline callers to it (using fmt=rgb565, fmt=rgb24 or
> > fmt=rgb32).
> >
> > That way performance works as you want it, without the source code
> > duplication.
>
> (Thanks for the pointer. I'll look at how it is done in h264, but: )
>
> I thought about generating the bodies of the functions from something
> like a template but it did not feel like this would make the code more
> understandable aka maintainable. So I wonder if there is any point in
> doing this, given the same binary result.


wm4 has tried to make this point several times, it's about maintainability.
Let me explain, otherwise we'll keep going back and forth.

Let's assume you have a function that does a, b and c, where b depends on
the pix fmt. You're currently writing it as such:

function_565() {
a
b_565
c
}

function_24() {
a
b_24
c
}

function_32() {
a
b_32
c
}

It should be pretty obvious that a and c are triplicated in this example.
Now compare it with this:

av_always_inline function_generic(fmt) {
a
if (fmt == 565) {
b_565
} else if (fmt == 24) {
b_24
} else {
assert(fmt == 32);
b_32
}
c
}

function_565() {
funtion_generic(565);
}

function_24() {
function_generic(24);
}

function_32() {
function_generic(32);
}

It might look larger, but that's merely because we're not writing out a and
c. The key thing here is that we're no longer triplicating a and c. This is
a significant maintenance improvement. Also, because of the inline keywords
used, the performance will be identical.

Conclusion: better to maintain, identical performance. Only advantages, no
disadvantages. Should be easy to accept, right?

> > What we _especially_ don't do in FFmpeg is hardcoding colorspace
> > > > conversion coefficients in decoders, and doing colorspace conversion
> in
> > > > decoders.
> > >
> > > Have you got a suggestion how to do avoid this in this case,
> > > without sacrificing the speed?
>
> > Ah, yes, the question. So, the code change is quite big and it does
> various
> > things, and each of these might have a better alternative or be good
> as-is.
> > fundamentally, I don't really understand how _adding_ a colorspace
> > conversion does any good to speed. It fundamentally always makes things
> > slower. So can you explain why you need to _add_ a colorspace conversion?
>
> It moves the conversion from after the decoder, where the data to convert
> is all and whole frames, to the inside where the conversion applies
> to the codebooks, by the codec design much less than the output.
>
> > Why not just always output the native format? (And then do conversion in
>
> The "native" Cinepak format is actually unknown to swscaler, and I
> seriously doubt it would make sense to add it there, which would
> just largely cut down the decoding efficiency.


I see, so the codebook contains indexed entities that are re-used to
reconstruct the actual frames. That makes some sense. So, what I don't
understand then, is why below you're claiming that get_format() doesn't do
this. this is exactly what get_format() does. Why do you believe
get_format() isn't capable of helping you accomplish this?

> > > +    char *out_fmt_override = getenv("CINEPAK_OUTPUT_FORMAT_
> OVERRIDE");
> > >
> > > > Absolutely not acceptable.
> > >
> > > 1. Why?
> > >
> >
> > Libavcodec is a library. Being sensitive to environment in a library, or
> > worse yet, affecting the environment, is typically not what is expected.
> > There are almost always better ways to do the same thing.
>
> I did my best to look for a better way but it does not seem to be existing.


Look into private options, for one. But really, get_format() solves this. I
can elaborate more if you really want me to, as above, but I feel you
haven't really looked into it. I know this feeling, sometimes you've got
something that works for you and you just want to commit it and be done
with it.

But that's not going to happen. The env variable will never be committed.
Never. I guarantee it with 100% certainty. If you'd like this general
feature to be picked up in the main codebase, you'll need to change at the
very, very least how it is exposed as a selectable option. Env opts are not
acceptable, they have never been and never will. A private codec option
(av_opt_*()) might be acceptable depending on how special/non-generic it
is, but I'm still convinced that get_format() can help also.

I recommend that you play around with how get_format() is used in a few
places, or you try implementing an application that supports the
get_format() callback and use that to recommend a format to the decoder
here, then pick that up in the decoder to select the actual pix_fmt, and
use that for decoding. If you need help with that, we're obviously happy to
help. But we know this code, we know it can do this, don't tell us it's not
possible and an env variable is the only way. That's crazy.

Then make the recommended changes to prevent code duplication, and there's
an actual chance the patch will be picked up in the main codebase. That
sounds pretty straightforward, doesn't it?

(Aside, I agree with other reviewers that colorspace conversion doesn't
belong in a decoder; I'd just remove yuv420p in the codebook conversion,
the native format seems to be RGB, so yuv420p output makes no sense at all.
It appears you're primarily concerned about selecting between rgb565, 24
and 32 anyway. If you really need yuv, we need some way to import the
actual coefficients from other places, they don't belong in the decoder.)

Ronald
u-9iep@aetey.se Feb. 3, 2017, 4:36 p.m. UTC | #11
Hello Ronald,

On Fri, Feb 03, 2017 at 08:52:53AM -0500, Ronald S. Bultje wrote:
> > I thought about generating the bodies of the functions from something
> > like a template but it did not feel like this would make the code more
> > understandable aka maintainable. So I wonder if there is any point in
> > doing this, given the same binary result.
> 
> wm4 has tried to make this point several times, it's about maintainability.

I really do understand.

> Let me explain, otherwise we'll keep going back and forth.

Thanks for the constructive discussion, hopefully I can make my point
clear here, and help you see the different perspective:

> Let's assume you have a function that does a, b and c, where b depends on
> the pix fmt. You're currently writing it as such:
> 
> function_565() {
> a
> b_565
> c
> }
> 
> function_24() {
> a
> b_24
> c
> }
> 
> function_32() {
> a
> b_32
> c
> }

Rather (or even more complicated) :

function_565() {
a; x; b; y; c; z; d;
}

function_24() {
a; p; b; q; c; r; d;
}

function_32() {
a; i; b; j; c; k; d;
}

Now, a small change in any of "a", "b", "c" would not necessarily have
the same consequence for all the functions, so templating would have
made it _harder_ to make safe changes, if we'd use something like

TEMPLATE(){ a; X; b; Y; c; Z; d; }

according to your suggestion.

Do you follow me? The "common" code happens to be the same _now_ but may
have to be different due to a change in any of a,b,c,d,x,y,z,p,q,r,i,j,k;

It would be also harder to follow the code flow.

> It should be pretty obvious that a and c are triplicated in this example.

Only in this statical situation, which is the contrary to maintainability.

> Now compare it with this:

[...]

> It might look larger, but that's merely because we're not writing out a and

It is not the size but the readability and change safety.

> Conclusion: better to maintain, identical performance. Only advantages, no
> disadvantages. Should be easy to accept, right?

Hope you see from the above: this is exactly why the code is structured
the way it does.

> So, what I don't
> understand then, is why below you're claiming that get_format() doesn't do
> this. this is exactly what get_format() does. Why do you believe
> get_format() isn't capable of helping you accomplish this?

get_format() apparently returns to the caller a suitable format
from the supplied list. I read the documentation as carefully as
I can and my interpretation is that it is the application who
is to define the function and that it is the ffmpeg framework which
supplies the list of formats supported by the codec. I appreciate
if somebody corrects me and/or improves the documentation.

But actually this does not matter in the particular situation.
None of the parties (the decoder, the framework, the application)
has all the necessary knowledge to be able to make the optimal choice.
It is the human operator/administrator who may know. The choice depends
among others e.g. on how fast swscaler is on the particular hardware
(use it or not?), how much is the contents sensitive to color
depths and so on.
How can get_format() help with answering these questions??).

So get_format() is not a solution, mo matter how good or misleading
its documentation is.

> > I did my best to look for a better way but it does not seem to be existing.

> Look into private options, for one. But really, get_format() solves this. I

We must have been thinking about different problems? The problem of
choosing the optimal format to decode to is not solvable with get_format()
(unless get_format() asks the human or e.g. checks an environment
variable or uses another out-of-band channel).

> can elaborate more if you really want me to, as above, but I feel you
> haven't really looked into it. I know this feeling, sometimes you've got
> something that works for you and you just want to commit it and be done
> with it.

:) I know this too, but I am quite confident in having done my homework
properly.

You try to explain your point and educate "the newcomer", which is very
helpful and appreciated.

Hope you are also prepared to see cases where your assumptions
do not hold.

> But that's not going to happen. The env variable will never be committed.
> Never. I guarantee it with 100% certainty. If you'd like this general

Then pity for ffmpeg, rejecting a useful feature without having any
comparable functionality.

Why not reserve a namespace in envvars, like
 FFMPEG_VIDEODEC_PIXFMT_<CODEC>_<FMT>
?
This would be (of course) much more general and useful than sticking
with the name I picked. Then some other codec could easily and coherently
take advantage of the same method as well.

There are possibly other (not necessarily pixel format related) scenarios
where corresponding namespaced envvars would help as out-of-band channels.

> feature to be picked up in the main codebase, you'll need to change at the
> very, very least how it is exposed as a selectable option. Env opts are not
> acceptable, they have never been and never will. A private codec option
> (av_opt_*()) might be acceptable depending on how special/non-generic it
> is, but I'm still convinced that get_format() can help also.

Apparently, no one ever tried to regularly use the advantages of letting
the decoder convert pixel formats internally - because it hardly makes
sense outside of VQ codecs which are a minority.

Regrettably, as the result there is the lack of the corresponding framework
in ffmpeg and also the general ignorance about this strong optimization.

> I recommend that you play around with how get_format() is used in a few
> places, or you try implementing an application that supports the
> get_format() callback and use that to recommend a format to the decoder
> here, then pick that up in the decoder to select the actual pix_fmt, and
> use that for decoding. If you need help with that, we're obviously happy to
> help. But we know this code, we know it can do this, don't tell us it's not
> possible and an env variable is the only way. That's crazy.

You are thinking of a different problem than I do.
You know the code. I know the problem and have researched the solutions.

> Then make the recommended changes to prevent code duplication, and there's
> an actual chance the patch will be picked up in the main codebase. That
> sounds pretty straightforward, doesn't it?

Then I think the patch is ready for inclusion, modulo commenting out the
info messages. The environment variable support is off by default
and the maintainability of the highly similar code pieces is safe.

(no one complained about the lack of documentaton! amazing)

> (Aside, I agree with other reviewers that colorspace conversion doesn't
> belong in a decoder; I'd just remove yuv420p in the codebook conversion,
> the native format seems to be RGB, so yuv420p output makes no sense at all.

Again, you do not see the problem I am solving:

If a videobuffer is accepting yuv420p and the speed is critical, then
I want to be able to make it fast. (When you say "native" - what do
you mean? Internally Cinepak uses a yuv420-alike. For the videobuffer
"native" varies from one device to another.)

So how can I get the same speed if we'd remove yuv420p?

> It appears you're primarily concerned about selecting between rgb565, 24
> and 32 anyway. If you really need yuv, we need some way to import the
> actual coefficients from other places, they don't belong in the decoder.)

Such an import can easily be done later, if indeed this would offer an
improvement and no regression. Apparently there is no framework in place
to do such things, and in the end - _what_ would we gain?

Now I have made an honest effort to explain that I know what I am doing
and why it makes sense.

If the decision makers here do not want the contribution, it is not my fault.

Best regards and good luck,
Rune
Hendrik Leppkes Feb. 3, 2017, 4:51 p.m. UTC | #12
On Fri, Feb 3, 2017 at 5:36 PM,  <u-9iep@aetey.se> wrote:
>
>> So, what I don't
>> understand then, is why below you're claiming that get_format() doesn't do
>> this. this is exactly what get_format() does. Why do you believe
>> get_format() isn't capable of helping you accomplish this?
>
> get_format() apparently returns to the caller a suitable format
> from the supplied list. I read the documentation as carefully as
> I can and my interpretation is that it is the application who
> is to define the function and that it is the ffmpeg framework which
> supplies the list of formats supported by the codec. I appreciate
> if somebody corrects me and/or improves the documentation.
>
> But actually this does not matter in the particular situation.
> None of the parties (the decoder, the framework, the application)
> has all the necessary knowledge to be able to make the optimal choice.
> It is the human operator/administrator who may know. The choice depends
> among others e.g. on how fast swscaler is on the particular hardware
> (use it or not?), how much is the contents sensitive to color
> depths and so on.
> How can get_format() help with answering these questions??).
>
> So get_format() is not a solution, mo matter how good or misleading
> its documentation is.

"The application" can implement the get_format callback anyway it
wants, ask the user by carrier pigeon for all we care - but such user
interaction does simply not belong into the avcodec library - hence we
have the get_format callback for that, so that the decision can be
moved outside of the codec library and into the calling user-code.
Clearly whatever application you are working with should implement
these choices, and you should not try to shoe-horn this into
libavcodec, where it clearly does not belong.

We do not want hacks around the established systems just because it
doesn't fit your use-case or workflow, sorry.

- Hendrik
u-9iep@aetey.se Feb. 3, 2017, 5:37 p.m. UTC | #13
On Fri, Feb 03, 2017 at 05:51:19PM +0100, Hendrik Leppkes wrote:
> On Fri, Feb 3, 2017 at 5:36 PM,  <u-9iep@aetey.se> wrote:
> > So get_format() is not a solution, mo matter how good or misleading
> > its documentation is.
> 
> "The application" can implement the get_format callback anyway it
> wants, ask the user by carrier pigeon for all we care - but such user
> interaction does simply not belong into the avcodec library - hence we
> have the get_format callback for that, so that the decision can be
> moved outside of the codec library and into the calling user-code.
> Clearly whatever application you are working with should implement
> these choices, and you should not try to shoe-horn this into
> libavcodec, where it clearly does not belong.

You suggest I should shoe-horn this into every application.
Very helpful, thank you :)

As for "clearly", it is your personal feeling, not an argument.
Seriously.

> We do not want hacks around the established systems just because it
> doesn't fit your use-case or workflow, sorry.

You should listen more on those who actually live in their workflows.
It is there your code is being useful. Or not.

I happen to be in a suitable position (using the stuff and arranging it
for others) to estimate what is useful.

Based on this I am trying to help ffmpeg.
You are certainly free to refuse the help, for any reason :)

Rune
wm4 Feb. 3, 2017, 7:21 p.m. UTC | #14
On Fri, 3 Feb 2017 18:37:52 +0100
u-9iep@aetey.se wrote:

> On Fri, Feb 03, 2017 at 05:51:19PM +0100, Hendrik Leppkes wrote:
> > On Fri, Feb 3, 2017 at 5:36 PM,  <u-9iep@aetey.se> wrote:  
> > > So get_format() is not a solution, mo matter how good or misleading
> > > its documentation is.  
> > 
> > "The application" can implement the get_format callback anyway it
> > wants, ask the user by carrier pigeon for all we care - but such user
> > interaction does simply not belong into the avcodec library - hence we
> > have the get_format callback for that, so that the decision can be
> > moved outside of the codec library and into the calling user-code.
> > Clearly whatever application you are working with should implement
> > these choices, and you should not try to shoe-horn this into
> > libavcodec, where it clearly does not belong.  
> 
> You suggest I should shoe-horn this into every application.
> Very helpful, thank you :)
> 
> As for "clearly", it is your personal feeling, not an argument.
> Seriously.
> 
> > We do not want hacks around the established systems just because it
> > doesn't fit your use-case or workflow, sorry.  
> 
> You should listen more on those who actually live in their workflows.
> It is there your code is being useful. Or not.
> 
> I happen to be in a suitable position (using the stuff and arranging it
> for others) to estimate what is useful.
> 
> Based on this I am trying to help ffmpeg.
> You are certainly free to refuse the help, for any reason :)

He didn't refuse help, he just explained how it is. We helped you
plenty by trying to explain you the mechanisms of this library.

With your special use-case (special as in does not fit into the API
conventions of libavcodec), you might be better off with creating your
own standalone cinepak decoder. That's not a bad thing; there's plenty
of multimedia software that does not use libavcodec. Part of the reason
is that one library can't make everyone happy and can't fit all
use-cases.

Back to being "constructive" - the only way your code could get
accepted is by implementing the get_format callback. There's a
precedent, 8bps.c, which would also show you how it's done. But maybe
it should be at most only 1 other output format that handles the most
efficient case, or so. There's really no reason to add them all.
u-9iep@aetey.se Feb. 5, 2017, 10:02 a.m. UTC | #15
On Fri, Feb 03, 2017 at 08:21:37PM +0100, wm4 wrote:
> With your special use-case (special as in does not fit into the API
> conventions of libavcodec), you might be better off with creating your
> own standalone cinepak decoder. That's not a bad thing; there's plenty
> of multimedia software that does not use libavcodec. Part of the reason
> is that one library can't make everyone happy and can't fit all
> use-cases.

To make it a bit more clear, my use case is

- various devices and videobuffers
- different applications which are not feasible to patch/teach/replace

Replacing ffmpeg with something else or modifying the applications 
unfortunately does not fit there, that's also why get_format() does not help.

Fortunately there is a solution for controlling the libraries out-of-band
even if it is not exactly elegant (I am more than well aware of implications
when using environment variables). If anything better would exist I'd use
it. This matter belongs otherwise to system- and deployment administraton,
offtopic to discuss here but you can safely take my word.

This does not have to be "supported" by ffmpeg but it is useful, not
harmful to have the corresponding code at hand, at least as a reference
(disabled as it is).

An improvement would be namespacing the environment variables to be
used together with ffmpeg libraries, but as long as such use is strongly
discouraged there is no possibility to allocate a namespace.

Otherwise, you see, ffmpeg fits very well for the needs in my environment,
it _does_ make people happy.

> Back to being "constructive" - the only way your code could get
> accepted is by implementing the get_format callback. There's a

Sure.

Indeed, activating get_format() can become useful for someone,
I should not have omitted it.

This is a tiny change.

I am now posting an amended and additionally commented patch.

Regards,
Rune
Mark Thompson Feb. 5, 2017, 12:12 p.m. UTC | #16
On 05/02/17 10:02, u-9iep@aetey.se wrote:
> On Fri, Feb 03, 2017 at 08:21:37PM +0100, wm4 wrote:
>> With your special use-case (special as in does not fit into the API
>> conventions of libavcodec), you might be better off with creating your
>> own standalone cinepak decoder. That's not a bad thing; there's plenty
>> of multimedia software that does not use libavcodec. Part of the reason
>> is that one library can't make everyone happy and can't fit all
>> use-cases.
> 
> To make it a bit more clear, my use case is
> 
> - various devices and videobuffers
> - different applications which are not feasible to patch/teach/replace
> 
> Replacing ffmpeg with something else or modifying the applications 
> unfortunately does not fit there, that's also why get_format() does not help.

Even if you need to support such a use-case, doing it via the get_format() callback is still the right thing to do.  Once you've done that, all normal applications (including ffmpeg.c itself) can use the standard API as it already exists to set the output format.  For your decoding-into-framebuffer case the calling application must already be fully aware of the state of the framebuffer (after all, it has to be able to make a suitable AVFrame to pass to get_buffer2() so that you can avoid the extra copy), so adding get_format() support to also communicate the format is not onerous.

Then, if you have a proprietary application which cannot be modified because you don't have the source, you could make a shim layer like:

static enum AVPixelFormat get_format_by_env_var(pix_fmt_list)
{
    requested_pix_fmt = getenv(SOMETHING);
    if (requested_pix_fmt in pix_fmt_list)
        return requested_pix_fmt;
    else
        error;
}

int avcodec_open2(AVCodecContext *avctx)
{
    avctx->get_format = &get_format_by_env_var;
    return real_avcodec_open2(avctx);
}

and LD_PRELOAD it into the application to achieve the same result (insofar as that is possible - it seems unlikely that it will be able to use get_buffer() appropriately, so there are probably going to be more redundant copies in the application and you would need to patch it directly to eliminate them).

- Mark
u-9iep@aetey.se Feb. 5, 2017, 1:56 p.m. UTC | #17
Hello Mark,

On Sun, Feb 05, 2017 at 12:12:37PM +0000, Mark Thompson wrote:
> On 05/02/17 10:02, u-9iep@aetey.se wrote:
> > To make it a bit more clear, my use case is
> > 
> > - various devices and videobuffers
> > - different applications which are not feasible to patch/teach/replace
> > 
> > Replacing ffmpeg with something else or modifying the applications 
> > unfortunately does not fit there, that's also why get_format() does not help.

> Even if you need to support such a use-case, doing it via the get_format() callback is still the right thing to do.  Once you've done that, all normal applications (including ffmpeg.c itself) can use the standard API as it already exists to set the output format.  For your decoding-into-framebuffer case the calling application must already be fully aware of the state of the framebuffer (after all, it has to be able to make a suitable AVFrame to pass to get_buffer2() so that you can avoid the extra copy), so adding get_format() support to also communicate the format is not onerous.

Note that it is generally impossible to let the application decide
what to choose. Sometimes this may work, but the applications lack
the relevant needed knowledge, which is not guessable from
"what the layers before and after me report as supported formats
in a certain order".

> Then, if you have a proprietary application which cannot be modified because you don't have the source, you could make a shim layer like:

Proprietary or not, I can hardly modify them.

A shim layer is certainly a possible workaround, but from my
perspective it would be "moving from a mildly inelegant
to a basically broken and unreliable technology".

The problem is
<offtopic - system administration>
the technology of LD_*, an old and really bad design by itself.
Compared to a _library_specific_ envvar it is a long way father
from being usable as a general solution.
Note that an LD_* variable affects _linking_ (which is very intrusive)
for _all_ programs, related or not, in all child processes.
Note also that some applications do play tricks with LD_* on their own.
Have I said enough? :(
</offtopic>

> static enum AVPixelFormat get_format_by_env_var(pix_fmt_list)
> {
>     requested_pix_fmt = getenv(SOMETHING);
>     if (requested_pix_fmt in pix_fmt_list)
>         return requested_pix_fmt;
>     else
>         error;
> }

Exactly, but in my situation it is much more robust and easier to
enable the corresponding code in the decoder (or even add it there on
my own in the worst case) than play with binary patching on the fly,
which dynamic linking basically is.

So instead of forcing the possible fellow sysadmins in a similar situation
to patch, it would we nice to just let them build lilbavcodec with
this slightly non-standard (and pretty safe) behaviour.

> and LD_PRELOAD it into the application to achieve the same result (insofar as that is possible - it seems unlikely that it will be able to use get_buffer() appropriately, so there are probably going to be more redundant copies in the application and you would need to patch it directly to eliminate them).

You see.

In any case, thanks for understanding the original problem.

Regards,
Rune
wm4 Feb. 6, 2017, 6:41 a.m. UTC | #18
On Sun, 5 Feb 2017 14:56:26 +0100
u-9iep@aetey.se wrote:

> Hello Mark,
> 
> On Sun, Feb 05, 2017 at 12:12:37PM +0000, Mark Thompson wrote:
> > On 05/02/17 10:02, u-9iep@aetey.se wrote:  
> > > To make it a bit more clear, my use case is
> > > 
> > > - various devices and videobuffers
> > > - different applications which are not feasible to patch/teach/replace
> > > 
> > > Replacing ffmpeg with something else or modifying the applications 
> > > unfortunately does not fit there, that's also why get_format() does not help.  
> 
> > Even if you need to support such a use-case, doing it via the get_format() callback is still the right thing to do.  Once you've done that, all normal applications (including ffmpeg.c itself) can use the standard API as it already exists to set the output format.  For your decoding-into-framebuffer case the calling application must already be fully aware of the state of the framebuffer (after all, it has to be able to make a suitable AVFrame to pass to get_buffer2() so that you can avoid the extra copy), so adding get_format() support to also communicate the format is not onerous.  
> 
> Note that it is generally impossible to let the application decide
> what to choose. Sometimes this may work, but the applications lack
> the relevant needed knowledge, which is not guessable from
> "what the layers before and after me report as supported formats
> in a certain order".
> 
> > Then, if you have a proprietary application which cannot be modified because you don't have the source, you could make a shim layer like:  
> 
> Proprietary or not, I can hardly modify them.
> 
> A shim layer is certainly a possible workaround, but from my
> perspective it would be "moving from a mildly inelegant
> to a basically broken and unreliable technology".
> 
> The problem is
> <offtopic - system administration>
> the technology of LD_*, an old and really bad design by itself.
> Compared to a _library_specific_ envvar it is a long way father
> from being usable as a general solution.
> Note that an LD_* variable affects _linking_ (which is very intrusive)
> for _all_ programs, related or not, in all child processes.
> Note also that some applications do play tricks with LD_* on their own.
> Have I said enough? :(
> </offtopic>
> 
> > static enum AVPixelFormat get_format_by_env_var(pix_fmt_list)
> > {
> >     requested_pix_fmt = getenv(SOMETHING);
> >     if (requested_pix_fmt in pix_fmt_list)
> >         return requested_pix_fmt;
> >     else
> >         error;
> > }  
> 
> Exactly, but in my situation it is much more robust and easier to
> enable the corresponding code in the decoder (or even add it there on
> my own in the worst case) than play with binary patching on the fly,
> which dynamic linking basically is.
> 
> So instead of forcing the possible fellow sysadmins in a similar situation
> to patch, it would we nice to just let them build lilbavcodec with
> this slightly non-standard (and pretty safe) behaviour.
> 
> > and LD_PRELOAD it into the application to achieve the same result (insofar as that is possible - it seems unlikely that it will be able to use get_buffer() appropriately, so there are probably going to be more redundant copies in the application and you would need to patch it directly to eliminate them).  
> 
> You see.
> 
> In any case, thanks for understanding the original problem.

I think you don't get it. There will be no such environment variable
use in the libraries, ever. The discussion about this was over a long
time ago.
u-9iep@aetey.se Feb. 7, 2017, 5:04 p.m. UTC | #19
On Fri, Feb 03, 2017 at 11:42:03AM +0100, u-9iep@aetey.se wrote:
> On Fri, Feb 03, 2017 at 11:14:28AM +0100, wm4 wrote:
> > > On a 16-bit-per-pixel output with a CPU-based decoder you will
> > > not find _any_ over 25% of Cinepak speed. Raw video can not compete
> > > either when indata delivery bandwidth si limited.

> > I can't take this seriously, but I won't go and try finding a better
> > codec just to make a point.

Of course any benchmark result makes sense only within a carefully
specified context. It was incautious of me to make a statement in
such a general way.

I have to admit that the number 25% was not measured but extrapolated
from a specific test I made three years ago.
It is impossible to repeat that test (the hardware is gone) and the video
data was a specific sample which may have incurred a bias.

That's why now I have done a new test, with 15 minutes taken from a
(single) random movie as the input, compressed with ffmpeg to mpeg1video
at quality 8 and to cinepak with default quality settings, which in this
case yielded comparable quality.

When I take the user part of the output of
"time mplayer -fps 10000 -vo null -quiet -vf format=fmt=bgr16 -sws 0" [*]

I get the following numbers on two different computers with Intel CPUs
running 32-bit code:  (averaging/rounding after three runs)

mpeg1, yuv420p            23.6     (via the fast bilinear swscaler)
cinepak, rgb24            19.7     (via the fast bilinear swscaler)
cinepak, internal rgb565   6.0

mpeg1, yuv420p            52.1     (via the fast bilinear swscaler)
cinepak, rgb24            43.1     (via the fast bilinear swscaler)
cinepak, internal rgb565  10.3

This corresponds to mpeg1 being for rgb565 output about 4-5 times slower
than cinepak.

I doubt that there is any other applicable codec with a faster decoder.
Tell me if you find it, this will be appreciated.

The "high color" 15-bit VQA was near cinepak in speed but it is
unfortunately not practically usable.

Regards,
Rune

[*] Remarkably, mplayer, an application with tight ties to ffmpeg,
does not seem to choose the appropriate decoder output format itself,
despite the availability of get_format() functionality...
Would you tell me which video player does? or where the bug is?
Ronald S. Bultje Feb. 7, 2017, 5:54 p.m. UTC | #20
Hi,

On Tue, Feb 7, 2017 at 12:04 PM, <u-9iep@aetey.se> wrote:

> cinepak, rgb24            19.7     (via the fast bilinear swscaler)
> cinepak, internal rgb565   6.0


The reason that your decoder-integrated colorspace convertor is so much
faster than swscale is because swscale is converting internally to yuv420p
using a scaling filter, and then back to rgb565 using another scaling
filter. This is "easily" solved by adding a direct (possibly
x86-simd-accelerated) rgb24-to-rgb565 converter into
libswscale/swscale_unscaled.c, which would likely have nearly identical
performance to what you're seeing here. Possibly even faster, because
you're allowing for simd optimizations.

HTH,
Ronald
wm4 Feb. 7, 2017, 6:05 p.m. UTC | #21
On Tue, 7 Feb 2017 12:54:23 -0500
"Ronald S. Bultje" <rsbultje@gmail.com> wrote:

> Hi,
> 
> On Tue, Feb 7, 2017 at 12:04 PM, <u-9iep@aetey.se> wrote:
> 
> > cinepak, rgb24            19.7     (via the fast bilinear swscaler)
> > cinepak, internal rgb565   6.0  
> 
> 
> The reason that your decoder-integrated colorspace convertor is so much
> faster than swscale is because swscale is converting internally to yuv420p
> using a scaling filter, and then back to rgb565 using another scaling
> filter. This is "easily" solved by adding a direct (possibly
> x86-simd-accelerated) rgb24-to-rgb565 converter into
> libswscale/swscale_unscaled.c, which would likely have nearly identical
> performance to what you're seeing here. Possibly even faster, because
> you're allowing for simd optimizations.

I'm also wondering how much performance the 3 byte rgb24 format costs
as opposed to using a 4 byte padded format like rgb0.
Daniel Verkamp Feb. 8, 2017, 4:07 a.m. UTC | #22
On Tue, Feb 7, 2017 at 10:54 AM, Ronald S. Bultje <rsbultje@gmail.com> wrote:
> Hi,
>
> On Tue, Feb 7, 2017 at 12:04 PM, <u-9iep@aetey.se> wrote:
>
>> cinepak, rgb24            19.7     (via the fast bilinear swscaler)
>> cinepak, internal rgb565   6.0
>
>
> The reason that your decoder-integrated colorspace convertor is so much
> faster than swscale is because swscale is converting internally to yuv420p
> using a scaling filter, and then back to rgb565 using another scaling
> filter. This is "easily" solved by adding a direct (possibly
> x86-simd-accelerated) rgb24-to-rgb565 converter into
> libswscale/swscale_unscaled.c, which would likely have nearly identical
> performance to what you're seeing here. Possibly even faster, because
> you're allowing for simd optimizations.

There is already a rgb24-to-rgb565 path, but it does not get used by
default because of the needsDither check in ff_get_unscaled_swscale().
If the scaling algorithm is set to fast_bilinear, needsDither is
ignored and the RGB24 to RGB16 converter is used. (In either case, no
actual scaling is performed, so the scaling algorithm is irrelevant
aside from selecting dithering.) Looking at the mplayer docs, this is
probably equivalent to '-sws 0'.

e.g.
  ffmpeg -i <sample> -f null -c rawvideo -pix_fmt rgb565le -sws_flags
fast_bilinear /dev/null

Using matrixbench_mpeg2.mpg (720x567) encoded with ffmpeg into Cinepak
using default settings, decoding on an i5 3570K, 3.4 GHz:

bicubic (default):          ~24x realtime
fast_bilinear:              ~65x realtime
patch w/rgb565 override:    ~154x realtime

As far as I can tell, this patch doesn't do any dithering for RGB565,
so the fast_bilinear (non-dithering) swscale converter is a fair
comparison.

-- Daniel
u-9iep@aetey.se Feb. 8, 2017, 7:45 a.m. UTC | #23
On Tue, Feb 07, 2017 at 12:54:23PM -0500, Ronald S. Bultje wrote:
> On Tue, Feb 7, 2017 at 12:04 PM, <u-9iep@aetey.se> wrote:
> 
> > cinepak, rgb24            19.7     (via the fast bilinear swscaler)
> > cinepak, internal rgb565   6.0
> 
> 
> The reason that your decoder-integrated colorspace convertor is so much
> faster than swscale is because swscale is converting internally to yuv420p
> using a scaling filter, and then back to rgb565 using another scaling
> filter.

(If this indeed happens, this does not sound exactly efficient, nor
is a conversion to yuv420p lossless itself, IOW a shame?)

> This is "easily" solved by adding a direct (possibly
> x86-simd-accelerated) rgb24-to-rgb565 converter into
> libswscale/swscale_unscaled.c, which would likely have nearly identical
> performance to what you're seeing here. Possibly even faster, because
> you're allowing for simd optimizations.

This unfortunately can not come near an identical performance because
it would have to work on several times more data (frame vs codebook).

Besides that, there would be at least an extra copy operation over
each frame, even if the conversion itself would be indefinitely fast.

Generally:

I value layered design as much as you do, but it introduces limitations.

For comparison, an example from a different domain, but well known:
ZFS. It shortcut several "design layers" in the storage subsystem
which allowed a lot of improvement and did not render ZFS unmaintainable.

The shortcuts I add (not introduce, just add to the present ones) are
of exactly the same nature as a specialized converter in libswscale.
I just add them in a place where they are several times more efficient
(the amount of data to handle).

Cinepak is hardly going to make an impact similar to ZFS now :)
but it has in fact been very big before.

In certain aspects, by the original design, it is still superior to
virtualy anything else at hand. With the proposed optimization we get the
best out of its virtues. It would be a waste to ignore the possibility.

Regards,
Rune
u-9iep@aetey.se Feb. 8, 2017, 8:57 a.m. UTC | #24
On Tue, Feb 07, 2017 at 09:07:02PM -0700, Daniel Verkamp wrote:
> There is already a rgb24-to-rgb565 path, but it does not get used by
> default because of the needsDither check in ff_get_unscaled_swscale().
> If the scaling algorithm is set to fast_bilinear, needsDither is
> ignored and the RGB24 to RGB16 converter is used. (In either case, no
> actual scaling is performed, so the scaling algorithm is irrelevant
> aside from selecting dithering.) Looking at the mplayer docs, this is
> probably equivalent to '-sws 0'.
> 
> e.g.
>   ffmpeg -i <sample> -f null -c rawvideo -pix_fmt rgb565le -sws_flags
> fast_bilinear /dev/null
> 
> Using matrixbench_mpeg2.mpg (720x567) encoded with ffmpeg into Cinepak
> using default settings, decoding on an i5 3570K, 3.4 GHz:
> 
> bicubic (default):          ~24x realtime
> fast_bilinear:              ~65x realtime
> patch w/rgb565 override:    ~154x realtime
> 
> As far as I can tell, this patch doesn't do any dithering for RGB565,
> so the fast_bilinear (non-dithering) swscale converter is a fair
> comparison.

Thanks Daniel.

Yes, the patch does accurate rounding, no dithering.

Rune
Ronald S. Bultje Feb. 8, 2017, 2:02 p.m. UTC | #25
Hi,

On Wed, Feb 8, 2017 at 2:45 AM, <u-9iep@aetey.se> wrote:

> This unfortunately can not come near an identical performance because
> it would have to work on several times more data (frame vs codebook).

[..]

> I value layered design as much as you do, but it introduces limitations.

[..]

> In certain aspects, by the original design, it is still superior to
> virtualy anything else at hand. With the proposed optimization we get the
> best out of its virtues. It would be a waste to ignore the possibility.


I encourage you to fork the code and publish the faster decoder so others
with use cases like yours can use it. It's free software, you're allowed
and encouraged to do that.

I just don't think it belongs in upstream FFmpeg.

(Going back to the native format discussion, I looked at decode_codebook
and it appears the native format is actually 8-bit/component YCgCo; it's
interesting that we chose rgb24 as output format, I'm guessing it's because
YCgCo support in FFmpeg is non-existent.)

Ronald
u-9iep@aetey.se Feb. 8, 2017, 6:30 p.m. UTC | #26
On Wed, Feb 08, 2017 at 09:02:44AM -0500, Ronald S. Bultje wrote:
> I encourage you to fork the code and publish the faster decoder so others
> with use cases like yours can use it. It's free software, you're allowed
> and encouraged to do that.

I considered this possibility. My scope of contribution is though "a
useful modification". Leading/maintaining a multimedia-tool-project is
a very different thing and not something I am interested in.

> I just don't think it belongs in upstream FFmpeg.

This depends of course on the idea of what FFmpeg is.
I am not qualified to make such a definition, given my limited
engagement here.

My expectation was "the Swiss Army Knife of multimedia" and "the most
efficient tool in its class". If either of those is not or no longer
relevant, not my thing to decide, then the patch is indeed wrongly
addressed.

> (Going back to the native format discussion, I looked at decode_codebook
> and it appears the native format is actually 8-bit/component YCgCo; it's

Not exactly, it is similar but Cinepak-specific, it is also optimized for
conversion to rgb.

> interesting that we chose rgb24 as output format, I'm guessing it's because

It is the natural target format of Cinepak's design.

> YCgCo support in FFmpeg is non-existent.)

Kind of.
(https://ffmpeg.org/pipermail/ffmpeg-devel/2013-February/138811.html)

This was fortunate, allowing to notice the gain associated with filling
the codebook with the converted data, compared to doing the conversion
on the resulting frame.

From the commit message Mon Feb 18 18:00:33 2013 +0100
"
    old decoder + conversion to rgb:         fps = 2618
    old decoder, without converting to rgb:  fps = 4012
    
    new decoder, producing rgb:              fps = 4502
"

Regards,
Rune
diff mbox

Patch

diff --git a/libavcodec/cinepak.c b/libavcodec/cinepak.c
index d657e9c0c1..76105fcc0c 100644
--- a/libavcodec/cinepak.c
+++ b/libavcodec/cinepak.c
@@ -31,6 +31,8 @@ 
  *
  * Cinepak colorspace support (c) 2013 Rl, Aetey Global Technologies AB
  * @author Cinepak colorspace, Rl, Aetey Global Technologies AB
+ * Extra output formats / optimizations (c) 2017 Rl, Aetey Global Technologies AB
+ * @author Extra output formats / optimizations, Rl, Aetey Global Technologies AB
  */
 
 #include <stdio.h>
@@ -39,23 +41,37 @@ 
 
 #include "libavutil/common.h"
 #include "libavutil/intreadwrite.h"
+#include "libavutil/opt.h"
+/* #include "libavutil/avassert.h" */
 #include "avcodec.h"
 #include "internal.h"
 
+/* rounding to nearest is quite cheap in my tests
+ * and yields a remarkable quality improvement
+ * compared to simple truncation -- rl */
+#define PACK_RGB_RGB565(r,g,b) (((av_clip_uint8((r)+4)>>3)<<11)|((av_clip_uint8((g)+2)>>2)<<5)|(av_clip_uint8((b)+4)>>3))
 
-typedef uint8_t cvid_codebook[12];
+typedef union cvid_codebook {
+    uint32_t  rgb32[256][ 4];
+    uint8_t   rgb24[256][12];
+    uint16_t rgb565[256][ 4];
+    uint8_t  yuv420[256][ 6];
+    uint8_t    pal8[256][ 4];
+} cvid_codebook;
 
-#define MAX_STRIPS      32
+#define MAX_STRIPS      32    /* an arbitrary limit -- rl */
 
 typedef struct cvid_strip {
     uint16_t          id;
     uint16_t          x1, y1;
     uint16_t          x2, y2;
-    cvid_codebook     v4_codebook[256];
-    cvid_codebook     v1_codebook[256];
+    cvid_codebook     v4_codebook;
+    cvid_codebook     v1_codebook;
 } cvid_strip;
 
-typedef struct CinepakContext {
+typedef struct CinepakContext CinepakContext;
+struct CinepakContext {
+    const AVClass *class;
 
     AVCodecContext *avctx;
     AVFrame *frame;
@@ -71,24 +87,111 @@  typedef struct CinepakContext {
     int sega_film_skip_bytes;
 
     uint32_t pal[256];
-} CinepakContext;
 
-static void cinepak_decode_codebook (cvid_codebook *codebook,
-                                     int chunk_id, int size, const uint8_t *data)
+    void (*decode_codebook)(CinepakContext *s,
+                            cvid_codebook *codebook, int chunk_id,
+                            int size, const uint8_t *data);
+    int  (*decode_vectors)(CinepakContext *s, cvid_strip *strip,
+                           int chunk_id, int size, const uint8_t *data);
+/* options */
+    enum AVPixelFormat out_pixfmt;
+};
+
+#define OFFSET(x) offsetof(CinepakContext, x)
+#define VD AV_OPT_FLAG_VIDEO_PARAM | AV_OPT_FLAG_DECODING_PARAM
+static const AVOption options[] = {
+{"output_pixel_format", "set output pixel format: rgb24/rgb32/rgb565/yuv420p/pal8; yuv420p is approximate", OFFSET(out_pixfmt), AV_OPT_TYPE_PIXEL_FMT, {.i64=AV_PIX_FMT_RGB24}, -1, INT_MAX, VD },
+    { NULL },
+};
+
+static const AVClass cinepak_class = {
+    .class_name = "cinepak decoder",
+    .item_name  = av_default_item_name,
+    .option     = options,
+    .version    = LIBAVUTIL_VERSION_INT,
+};
+
+static void cinepak_decode_codebook_rgb32 (CinepakContext *s,
+    cvid_codebook *codebook, int chunk_id, int size, const uint8_t *data)
+{
+    const uint8_t *eod = (data + size);
+    uint32_t flag, mask;
+    int      i, n;
+    uint32_t *p;
+    int palette_video = s->palette_video;
+    int selective_update = (chunk_id & 0x01);
+
+    /* check if this chunk contains 4- or 6-element vectors */
+    n    = (chunk_id & 0x04) ? 4 : 6;
+    flag = 0;
+    mask = 0;
+
+    p = codebook->rgb32[0];
+    for (i=0; i < 256; i++) {
+        if (selective_update && !(mask >>= 1)) {
+            if ((data + 4) > eod)
+                break;
+
+            flag  = AV_RB32 (data);
+            data += 4;
+            mask  = 0x80000000;
+        }
+
+        if (!selective_update || (flag & mask)) {
+            int k;
+
+            if ((data + n) > eod)
+                break;
+
+            if (n == 4)
+                if (palette_video)
+                    for (k = 0; k < 4; ++k)
+                        *p++ = s->pal[*data++]; /* this is easy */
+                else
+                    for (k = 0; k < 4; ++k) {
+                        int r = *data++;
+/* in some situations we might not have to set opacity */
+                        *p++ = /**/ (255<<24)| /**/ (r<<16)|(r<<8)|r;
+                    }
+            else { /* n == 6 */
+                int y, u, v;
+                u = (int8_t)data[4];
+                v = (int8_t)data[5];
+                for(k=0; k<4; ++k) {
+                    y = *data++;
+/* in some situations we might not have to set opacity */
+                    *p++ = /**/ (255<<24)| /**/
+/* here the cinepak color space excels */
+                           (av_clip_uint8(y + v*2)<<16)|
+                           (av_clip_uint8(y - (u/2) - v)<<8)|
+                            av_clip_uint8(y + u*2);
+                }
+                data += 2;
+            }
+        } else {
+            p += 4;
+        }
+    }
+}
+
+static void cinepak_decode_codebook_rgb24 (CinepakContext *s,
+    cvid_codebook *codebook, int chunk_id, int size, const uint8_t *data)
 {
     const uint8_t *eod = (data + size);
     uint32_t flag, mask;
     int      i, n;
     uint8_t *p;
+    int palette_video = s->palette_video;
+    int selective_update = (chunk_id & 0x01);
 
     /* check if this chunk contains 4- or 6-element vectors */
     n    = (chunk_id & 0x04) ? 4 : 6;
     flag = 0;
     mask = 0;
 
-    p = codebook[0];
+    p = codebook->rgb24[0];
     for (i=0; i < 256; i++) {
-        if ((chunk_id & 0x01) && !(mask >>= 1)) {
+        if (selective_update && !(mask >>= 1)) {
             if ((data + 4) > eod)
                 break;
 
@@ -97,31 +200,38 @@  static void cinepak_decode_codebook (cvid_codebook *codebook,
             mask  = 0x80000000;
         }
 
-        if (!(chunk_id & 0x01) || (flag & mask)) {
+        if (!selective_update || (flag & mask)) {
             int k, kk;
 
             if ((data + n) > eod)
                 break;
 
-            for (k = 0; k < 4; ++k) {
-                int r = *data++;
-                for (kk = 0; kk < 3; ++kk)
-                    *p++ = r;
-            }
-            if (n == 6) {
-                int r, g, b, u, v;
-                u = *(int8_t *)data++;
-                v = *(int8_t *)data++;
-                p -= 12;
+            if (n == 4)
+                if (palette_video)
+                    for (k = 0; k < 4; ++k) {
+                        uint32_t r = s->pal[*data++];
+                        *p++ = (r>>16)&0xff;
+                        *p++ = (r>>8) &0xff;
+                        *p++ =  r     &0xff;
+                    }
+                else
+                    for (k = 0; k < 4; ++k) {
+                        int r = *data++;
+                        for (kk = 0; kk < 3; ++kk)
+                            *p++ = r;
+                    }
+            else { /* n == 6 */
+                int y, u, v;
+                u = (int8_t)data[4];
+                v = (int8_t)data[5];
                 for(k=0; k<4; ++k) {
-                    r = *p++ + v*2;
-                    g = *p++ - (u/2) - v;
-                    b = *p   + u*2;
-                    p -= 2;
-                    *p++ = av_clip_uint8(r);
-                    *p++ = av_clip_uint8(g);
-                    *p++ = av_clip_uint8(b);
+                    y = *data++;
+/* here the cinepak color space excels */
+                    *p++ = av_clip_uint8(y + v*2);
+                    *p++ = av_clip_uint8(y - (u/2) - v);
+                    *p++ = av_clip_uint8(y + u*2);
                 }
+                data += 2;
             }
         } else {
             p += 12;
@@ -129,14 +239,296 @@  static void cinepak_decode_codebook (cvid_codebook *codebook,
     }
 }
 
-static int cinepak_decode_vectors (CinepakContext *s, cvid_strip *strip,
+static void cinepak_decode_codebook_rgb565 (CinepakContext *s,
+    cvid_codebook *codebook, int chunk_id, int size, const uint8_t *data)
+{
+    const uint8_t *eod = (data + size);
+    uint32_t flag, mask;
+    int      i, n;
+    uint16_t *p;
+    int palette_video = s->palette_video;
+    int selective_update = (chunk_id & 0x01);
+
+    /* check if this chunk contains 4- or 6-element vectors */
+    n    = (chunk_id & 0x04) ? 4 : 6;
+    flag = 0;
+    mask = 0;
+
+    p = codebook->rgb565[0];
+    for (i=0; i < 256; i++) {
+        if (selective_update && !(mask >>= 1)) {
+            if ((data + 4) > eod)
+                break;
+
+            flag  = AV_RB32 (data);
+            data += 4;
+            mask  = 0x80000000;
+        }
+
+        if (!selective_update || (flag & mask)) {
+            int k;
+
+            if ((data + n) > eod)
+                break;
+
+            if (n == 4)
+                if (palette_video)
+                    for (k = 0; k < 4; ++k) {
+                        uint32_t r = s->pal[*data++];
+                        *p++ = PACK_RGB_RGB565((r>>16)&0xff,
+                                               (r>>8)&0xff,
+                                                r&0xff);
+                    }
+                else
+                    for (k = 0; k < 4; ++k) {
+                        int r = *data++;
+                        *p++ = PACK_RGB_RGB565(r,r,r);
+                    }
+            else { /* n == 6 */
+                int y, u, v;
+                u = (int8_t)data[4];
+                v = (int8_t)data[5];
+                for(k=0; k<4; ++k) {
+                    y = *data++;
+/* here the cinepak color space excels */
+                    *p++ = PACK_RGB_RGB565(y + v*2,
+                                           y - (u/2) - v,
+                                           y + u*2);
+                }
+                data += 2;
+            }
+        } else {
+            p += 4;
+        }
+    }
+}
+
+/* a simplistic version to begin with, it is also fast -- rl */
+static void cinepak_decode_codebook_yuv420 (CinepakContext *s,
+    cvid_codebook *codebook, int chunk_id, int size, const uint8_t *data)
+{
+    const uint8_t *eod = (data + size);
+    uint32_t flag, mask;
+    int      i, n;
+    uint8_t *p;
+    int palette_video = s->palette_video;
+    int selective_update = (chunk_id & 0x01);
+
+    /* check if this chunk contains 4- or 6-element vectors */
+    n    = (chunk_id & 0x04) ? 4 : 6;
+    flag = 0;
+    mask = 0;
+
+    p = codebook->yuv420[0];
+    for (i=0; i < 256; i++) {
+        if (selective_update && !(mask >>= 1)) {
+            if ((data + 4) > eod)
+                break;
+
+            flag  = AV_RB32 (data);
+            data += 4;
+            mask  = 0x80000000;
+        }
+
+        if (!selective_update || (flag & mask)) {
+            int k;
+
+            if ((data + n) > eod)
+                break;
+
+            if (n == 4)
+                if (palette_video) {
+/* here we have kind of "more" data than the output format can express */
+                    int r, g, b, u = 0, v = 0;
+                    for (k = 0; k < 4; ++k) {
+                        uint32_t rr = s->pal[*data++];
+                        r = (rr>>16)&0xff;
+                        g = (rr>>8) &0xff;
+                        b =  rr     &0xff;
+/* calculate the components (https://en.wikipedia.org/wiki/YUV) */
+                        *p++ = ((r*66+g*129+b*25+128)>>8)+16;
+                        u += (-r*38-g*74+b*112+128)>>8;
+                        v += (r*112-g*94-b*18+128)>>8;
+                    }
+                    *p++ = (u+2)/4+128;
+                    *p++ = (v+2)/4+128;
+                } else { /* grayscale, easy */
+                    for (k = 0; k < 4; ++k) {
+                        *p++ = *data++;
+                    }
+                    *p++ = 128;
+                    *p++ = 128;
+                }
+            else { /* n == 6 */
+/* here we'd have to handle double format conversion
+ * Cinepak=>rgb24 and then rgb24=>yuv420p, which can not be shortcut;
+ * for the moment just copying as-is, for simplicity and speed,
+ * color will be slightly off but not much */
+                *p++ = *data++;
+                *p++ = *data++;
+                *p++ = *data++;
+                *p++ = *data++;
+                *p++ = *data++ + 128;
+                *p++ = *data++ + 128;
+            }
+        } else {
+            p += 6;
+        }
+    }
+}
+
+/* here we do not expect anything besides palettized video,
+ * nor check the data for validity, which should be ok
+ * as long as we do not write beyond the bounds */
+static void cinepak_decode_codebook_pal8 (CinepakContext *s,
+    cvid_codebook *codebook, int chunk_id, int size, const uint8_t *data)
+{
+    const uint8_t *eod = (data + size);
+    uint32_t flag, mask;
+    int      i;
+    uint8_t *p;
+    int selective_update = (chunk_id & 0x01);
+
+#define PAL8_VECTOR_LENGTH 4
+/*    av_assert0(chunk_id & 0x04); */
+
+    flag = 0;
+    mask = 0;
+
+    p = codebook->pal8[0];
+    for (i=0; i < 256; i++) {
+        if (selective_update && !(mask >>= 1)) {
+            if ((data + 4) > eod)
+                break;
+
+            flag  = AV_RB32 (data);
+            data += 4;
+            mask  = 0x80000000;
+        }
+
+        if (!selective_update || (flag & mask)) {
+            int k;
+
+            if ((data + PAL8_VECTOR_LENGTH) > eod)
+                break;
+
+            for (k = 0; k < 4; ++k) 
+                *p++ = *data++;
+        } else {
+            p += 4;
+        }
+    }
+}
+
+static int cinepak_decode_vectors_rgb32 (CinepakContext *s, cvid_strip *strip,
+                                   int chunk_id, int size, const uint8_t *data)
+{
+    const uint8_t   *eod = (data + size);
+    uint32_t         flag, mask;
+    uint32_t         *cb0, *cb1, *cb2, *cb3;
+    int              x, y;
+    char            *ip0, *ip1, *ip2, *ip3;
+    int selective_update = (chunk_id & 0x01);
+    int v1_only          = (chunk_id & 0x02);
+
+    flag = 0;
+    mask = 0;
+
+    for (y=strip->y1; y < strip->y2; y+=4) {
+
+/* take care of y dimension not being multiple of 4, such streams exist */
+        ip0 = ip1 = ip2 = ip3 = s->frame->data[0] +
+                                strip->x1*4 + y*s->frame->linesize[0];
+        if(s->avctx->height - y > 1) {
+            ip1 = ip0 + s->frame->linesize[0];
+            if(s->avctx->height - y > 2) {
+                ip2 = ip1 + s->frame->linesize[0];
+                if(s->avctx->height - y > 3) {
+                    ip3 = ip2 + s->frame->linesize[0];
+                }
+            }
+        }
+/* to get the correct picture for not-multiple-of-4 cases let us fill each
+ * block from the bottom up, thus possibly overwriting the bottommost line
+ * more than once but ending with the correct data in place
+ * (instead of in-loop checking) */
+
+        for (x=strip->x1; x < strip->x2; x+=4) {
+            if (selective_update && !(mask >>= 1)) {
+                if ((data + 4) > eod)
+                    return AVERROR_INVALIDDATA;
+
+                flag  = AV_RB32 (data);
+                data += 4;
+                mask  = 0x80000000;
+            }
+
+            if (!selective_update || (flag & mask)) {
+                if (!v1_only && !(mask >>= 1)) {
+                    if ((data + 4) > eod)
+                        return AVERROR_INVALIDDATA;
+
+                    flag  = AV_RB32 (data);
+                    data += 4;
+                    mask  = 0x80000000;
+                }
+
+                if (v1_only || (~flag & mask)) {
+                    uint32_t *p;
+                    if (data >= eod)
+                        return AVERROR_INVALIDDATA;
+
+                    p = strip->v1_codebook.rgb32[*data++] + 2; /* ... + 8 */
+                    memcpy(ip3 + 0, p, 4); memcpy(ip3 + 4, p, 4);
+                    memcpy(ip2 + 0, p, 4); memcpy(ip2 + 4, p, 4);
+                    p += 1; /* ... + 12 */
+                    memcpy(ip3 + 8, p, 4); memcpy(ip3 + 12, p, 4);
+                    memcpy(ip2 + 8, p, 4); memcpy(ip2 + 12, p, 4);
+                    p -= 3; /* ... + 0 */
+                    memcpy(ip1 + 0, p, 4); memcpy(ip1 + 4, p, 4);
+                    memcpy(ip0 + 0, p, 4); memcpy(ip0 + 4, p, 4);
+                    p += 1; /* ... + 4 */
+                    memcpy(ip1 + 8, p, 4); memcpy(ip1 + 12, p, 4);
+                    memcpy(ip0 + 8, p, 4); memcpy(ip0 + 12, p, 4);
+
+                } else if (flag & mask) {
+                    if ((data + 4) > eod)
+                        return AVERROR_INVALIDDATA;
+
+                    cb0 = strip->v4_codebook.rgb32[*data++];
+                    cb1 = strip->v4_codebook.rgb32[*data++];
+                    cb2 = strip->v4_codebook.rgb32[*data++];
+                    cb3 = strip->v4_codebook.rgb32[*data++];
+                    memcpy(ip3 + 0, cb2 + 2, 8);
+                    memcpy(ip3 + 8, cb3 + 2, 8);
+                    memcpy(ip2 + 0, cb2 + 0, 8);
+                    memcpy(ip2 + 8, cb3 + 0, 8);
+                    memcpy(ip1 + 0, cb0 + 2, 8);
+                    memcpy(ip1 + 8, cb1 + 2, 8);
+                    memcpy(ip0 + 0, cb0 + 0, 8);
+                    memcpy(ip0 + 8, cb1 + 0, 8);
+
+                }
+            }
+
+            ip0 += 16;  ip1 += 16;
+            ip2 += 16;  ip3 += 16;
+        }
+    }
+
+    return 0;
+}
+
+static int cinepak_decode_vectors_rgb24 (CinepakContext *s, cvid_strip *strip,
                                    int chunk_id, int size, const uint8_t *data)
 {
     const uint8_t   *eod = (data + size);
     uint32_t         flag, mask;
     uint8_t         *cb0, *cb1, *cb2, *cb3;
-    int             x, y;
+    int              x, y;
     char            *ip0, *ip1, *ip2, *ip3;
+    int selective_update = (chunk_id & 0x01);
+    int v1_only          = (chunk_id & 0x02);
 
     flag = 0;
     mask = 0;
@@ -145,7 +537,7 @@  static int cinepak_decode_vectors (CinepakContext *s, cvid_strip *strip,
 
 /* take care of y dimension not being multiple of 4, such streams exist */
         ip0 = ip1 = ip2 = ip3 = s->frame->data[0] +
-          (s->palette_video?strip->x1:strip->x1*3) + (y * s->frame->linesize[0]);
+                                strip->x1*3 + y*s->frame->linesize[0];
         if(s->avctx->height - y > 1) {
             ip1 = ip0 + s->frame->linesize[0];
             if(s->avctx->height - y > 2) {
@@ -161,7 +553,7 @@  static int cinepak_decode_vectors (CinepakContext *s, cvid_strip *strip,
  * (instead of in-loop checking) */
 
         for (x=strip->x1; x < strip->x2; x+=4) {
-            if ((chunk_id & 0x01) && !(mask >>= 1)) {
+            if (selective_update && !(mask >>= 1)) {
                 if ((data + 4) > eod)
                     return AVERROR_INVALIDDATA;
 
@@ -170,8 +562,8 @@  static int cinepak_decode_vectors (CinepakContext *s, cvid_strip *strip,
                 mask  = 0x80000000;
             }
 
-            if (!(chunk_id & 0x01) || (flag & mask)) {
-                if (!(chunk_id & 0x02) && !(mask >>= 1)) {
+            if (!selective_update || (flag & mask)) {
+                if (!v1_only && !(mask >>= 1)) {
                     if ((data + 4) > eod)
                         return AVERROR_INVALIDDATA;
 
@@ -180,84 +572,357 @@  static int cinepak_decode_vectors (CinepakContext *s, cvid_strip *strip,
                     mask  = 0x80000000;
                 }
 
-                if ((chunk_id & 0x02) || (~flag & mask)) {
+                if (v1_only || (~flag & mask)) {
                     uint8_t *p;
                     if (data >= eod)
                         return AVERROR_INVALIDDATA;
 
-                    p = strip->v1_codebook[*data++];
-                    if (s->palette_video) {
-                        ip3[0] = ip3[1] = ip2[0] = ip2[1] = p[6];
-                        ip3[2] = ip3[3] = ip2[2] = ip2[3] = p[9];
-                        ip1[0] = ip1[1] = ip0[0] = ip0[1] = p[0];
-                        ip1[2] = ip1[3] = ip0[2] = ip0[3] = p[3];
-                    } else {
-                        p += 6;
-                        memcpy(ip3 + 0, p, 3); memcpy(ip3 + 3, p, 3);
-                        memcpy(ip2 + 0, p, 3); memcpy(ip2 + 3, p, 3);
-                        p += 3; /* ... + 9 */
-                        memcpy(ip3 + 6, p, 3); memcpy(ip3 + 9, p, 3);
-                        memcpy(ip2 + 6, p, 3); memcpy(ip2 + 9, p, 3);
-                        p -= 9; /* ... + 0 */
-                        memcpy(ip1 + 0, p, 3); memcpy(ip1 + 3, p, 3);
-                        memcpy(ip0 + 0, p, 3); memcpy(ip0 + 3, p, 3);
-                        p += 3; /* ... + 3 */
-                        memcpy(ip1 + 6, p, 3); memcpy(ip1 + 9, p, 3);
-                        memcpy(ip0 + 6, p, 3); memcpy(ip0 + 9, p, 3);
-                    }
+                    p = strip->v1_codebook.rgb24[*data++] + 6;
+                    memcpy(ip3 + 0, p, 3); memcpy(ip3 + 3, p, 3);
+                    memcpy(ip2 + 0, p, 3); memcpy(ip2 + 3, p, 3);
+                    p += 3; /* ... + 9 */
+                    memcpy(ip3 + 6, p, 3); memcpy(ip3 + 9, p, 3);
+                    memcpy(ip2 + 6, p, 3); memcpy(ip2 + 9, p, 3);
+                    p -= 9; /* ... + 0 */
+                    memcpy(ip1 + 0, p, 3); memcpy(ip1 + 3, p, 3);
+                    memcpy(ip0 + 0, p, 3); memcpy(ip0 + 3, p, 3);
+                    p += 3; /* ... + 3 */
+                    memcpy(ip1 + 6, p, 3); memcpy(ip1 + 9, p, 3);
+                    memcpy(ip0 + 6, p, 3); memcpy(ip0 + 9, p, 3);
 
                 } else if (flag & mask) {
                     if ((data + 4) > eod)
                         return AVERROR_INVALIDDATA;
 
-                    cb0 = strip->v4_codebook[*data++];
-                    cb1 = strip->v4_codebook[*data++];
-                    cb2 = strip->v4_codebook[*data++];
-                    cb3 = strip->v4_codebook[*data++];
-                    if (s->palette_video) {
-                        uint8_t *p;
-                        p = ip3;
-                        *p++ = cb2[6];
-                        *p++ = cb2[9];
-                        *p++ = cb3[6];
-                        *p   = cb3[9];
-                        p = ip2;
-                        *p++ = cb2[0];
-                        *p++ = cb2[3];
-                        *p++ = cb3[0];
-                        *p   = cb3[3];
-                        p = ip1;
-                        *p++ = cb0[6];
-                        *p++ = cb0[9];
-                        *p++ = cb1[6];
-                        *p   = cb1[9];
-                        p = ip0;
-                        *p++ = cb0[0];
-                        *p++ = cb0[3];
-                        *p++ = cb1[0];
-                        *p   = cb1[3];
-                    } else {
-                        memcpy(ip3 + 0, cb2 + 6, 6);
-                        memcpy(ip3 + 6, cb3 + 6, 6);
-                        memcpy(ip2 + 0, cb2 + 0, 6);
-                        memcpy(ip2 + 6, cb3 + 0, 6);
-                        memcpy(ip1 + 0, cb0 + 6, 6);
-                        memcpy(ip1 + 6, cb1 + 6, 6);
-                        memcpy(ip0 + 0, cb0 + 0, 6);
-                        memcpy(ip0 + 6, cb1 + 0, 6);
-                    }
+                    cb0 = strip->v4_codebook.rgb24[*data++];
+                    cb1 = strip->v4_codebook.rgb24[*data++];
+                    cb2 = strip->v4_codebook.rgb24[*data++];
+                    cb3 = strip->v4_codebook.rgb24[*data++];
+                    memcpy(ip3 + 0, cb2 + 6, 6);
+                    memcpy(ip3 + 6, cb3 + 6, 6);
+                    memcpy(ip2 + 0, cb2 + 0, 6);
+                    memcpy(ip2 + 6, cb3 + 0, 6);
+                    memcpy(ip1 + 0, cb0 + 6, 6);
+                    memcpy(ip1 + 6, cb1 + 6, 6);
+                    memcpy(ip0 + 0, cb0 + 0, 6);
+                    memcpy(ip0 + 6, cb1 + 0, 6);
 
                 }
             }
 
-            if (s->palette_video) {
-                ip0 += 4;  ip1 += 4;
-                ip2 += 4;  ip3 += 4;
-            } else {
-                ip0 += 12;  ip1 += 12;
-                ip2 += 12;  ip3 += 12;
+            ip0 += 12;  ip1 += 12;
+            ip2 += 12;  ip3 += 12;
+        }
+    }
+
+    return 0;
+}
+
+static int cinepak_decode_vectors_rgb565 (CinepakContext *s, cvid_strip *strip,
+                                   int chunk_id, int size, const uint8_t *data)
+{
+    const uint8_t   *eod = (data + size);
+    uint32_t         flag, mask;
+    uint16_t        *cb0, *cb1, *cb2, *cb3;
+    int              x, y;
+    char            *ip0, *ip1, *ip2, *ip3;
+    int selective_update = (chunk_id & 0x01);
+    int v1_only          = (chunk_id & 0x02);
+
+    flag = 0;
+    mask = 0;
+
+    for (y=strip->y1; y < strip->y2; y+=4) {
+
+/* take care of y dimension not being multiple of 4, such streams exist */
+        ip0 = ip1 = ip2 = ip3 = s->frame->data[0] +
+                                strip->x1*2 + y*s->frame->linesize[0];
+        if(s->avctx->height - y > 1) {
+            ip1 = ip0 + s->frame->linesize[0];
+            if(s->avctx->height - y > 2) {
+                ip2 = ip1 + s->frame->linesize[0];
+                if(s->avctx->height - y > 3) {
+                    ip3 = ip2 + s->frame->linesize[0];
+                }
             }
         }
+/* to get the correct picture for not-multiple-of-4 cases let us fill each
+ * block from the bottom up, thus possibly overwriting the bottommost line
+ * more than once but ending with the correct data in place
+ * (instead of in-loop checking) */
+
+        for (x=strip->x1; x < strip->x2; x+=4) {
+            if (selective_update && !(mask >>= 1)) {
+                if ((data + 4) > eod)
+                    return AVERROR_INVALIDDATA;
+
+                flag  = AV_RB32 (data);
+                data += 4;
+                mask  = 0x80000000;
+            }
+
+            if (!selective_update || (flag & mask)) {
+                if (!v1_only && !(mask >>= 1)) {
+                    if ((data + 4) > eod)
+                        return AVERROR_INVALIDDATA;
+
+                    flag  = AV_RB32 (data);
+                    data += 4;
+                    mask  = 0x80000000;
+                }
+
+                if (v1_only || (~flag & mask)) {
+                    uint16_t *p;
+                    if (data >= eod)
+                        return AVERROR_INVALIDDATA;
+
+                    p = strip->v1_codebook.rgb565[*data++];
+                    * (uint16_t *)ip3    = *((uint16_t *)ip3+1) =
+                    * (uint16_t *)ip2    = *((uint16_t *)ip2+1) = p[2];
+                    *((uint16_t *)ip3+2) = *((uint16_t *)ip3+3) =
+                    *((uint16_t *)ip2+2) = *((uint16_t *)ip2+3) = p[3];
+                    * (uint16_t *)ip1    = *((uint16_t *)ip1+1) =
+                    * (uint16_t *)ip0    = *((uint16_t *)ip0+1) = p[0];
+                    *((uint16_t *)ip1+2) = *((uint16_t *)ip1+3) =
+                    *((uint16_t *)ip0+2) = *((uint16_t *)ip0+3) = p[1];
+
+                } else if (flag & mask) {
+                    if ((data + 4) > eod)
+                        return AVERROR_INVALIDDATA;
+
+                    cb0 = strip->v4_codebook.rgb565[*data++];
+                    cb1 = strip->v4_codebook.rgb565[*data++];
+                    cb2 = strip->v4_codebook.rgb565[*data++];
+                    cb3 = strip->v4_codebook.rgb565[*data++];
+                    memcpy(ip3 + 0, cb2 + 2, 4);
+                    memcpy(ip3 + 4, cb3 + 2, 4);
+                    memcpy(ip2 + 0, cb2 + 0, 4);
+                    memcpy(ip2 + 4, cb3 + 0, 4);
+                    memcpy(ip1 + 0, cb0 + 2, 4);
+                    memcpy(ip1 + 4, cb1 + 2, 4);
+                    memcpy(ip0 + 0, cb0 + 0, 4);
+                    memcpy(ip0 + 4, cb1 + 0, 4);
+
+                }
+            }
+
+            ip0 += 8;  ip1 += 8;
+            ip2 += 8;  ip3 += 8;
+        }
+    }
+
+    return 0;
+}
+
+static int cinepak_decode_vectors_yuv420 (CinepakContext *s, cvid_strip *strip,
+                                   int chunk_id, int size, const uint8_t *data)
+{
+    const uint8_t   *eod = (data + size);
+    uint32_t         flag, mask;
+    uint8_t         *cb0, *cb1, *cb2, *cb3;
+    int              x, y;
+    char            *ip0, *ip1, *ip2, *ip3,
+                    *up01, *up23, *vp01, *vp23;
+    int selective_update = (chunk_id & 0x01);
+    int v1_only          = (chunk_id & 0x02);
+
+    flag = 0;
+    mask = 0;
+
+    for (y=strip->y1; y < strip->y2; y+=4) {
+
+/* take care of y dimension not being multiple of 4, such streams exist */
+        ip0 = ip1 = ip2 = ip3 = s->frame->data[0] +
+                                strip->x1*3 + y*s->frame->linesize[0];
+        up01 = up23 = s->frame->data[1] + strip->x1 + y/2*s->frame->linesize[1];
+        vp01 = vp23 = s->frame->data[2] + strip->x1 + y/2*s->frame->linesize[2];
+        if(s->avctx->height - y > 1) {
+            ip1 = ip0 + s->frame->linesize[0];
+            if(s->avctx->height - y > 2) {
+                ip2 = ip1 + s->frame->linesize[0];
+                up23 = up01 + s->frame->linesize[1];
+                vp23 = vp01 + s->frame->linesize[2];
+                if(s->avctx->height - y > 3) {
+                    ip3 = ip2 + s->frame->linesize[0];
+                }
+            }
+        }
+
+/* to get the correct picture for not-multiple-of-4 cases let us fill each
+ * block from the bottom up, thus possibly overwriting the bottommost line
+ * more than once but ending with the correct data in place
+ * (instead of in-loop checking) */
+
+        for (x=strip->x1; x < strip->x2; x+=4) {
+            if (selective_update && !(mask >>= 1)) {
+                if ((data + 4) > eod)
+                    return AVERROR_INVALIDDATA;
+
+                flag  = AV_RB32 (data);
+                data += 4;
+                mask  = 0x80000000;
+            }
+
+            if (!selective_update || (flag & mask)) {
+                if (!v1_only && !(mask >>= 1)) {
+                    if ((data + 4) > eod)
+                        return AVERROR_INVALIDDATA;
+
+                    flag  = AV_RB32 (data);
+                    data += 4;
+                    mask  = 0x80000000;
+                }
+
+                if (v1_only || (~flag & mask)) {
+                    uint8_t *p;
+                    if (data >= eod)
+                        return AVERROR_INVALIDDATA;
+
+                    p = strip->v1_codebook.yuv420[*data++];
+                    ip3[0] = ip3[1] = ip2[0] = ip2[1] = p[2];
+                    ip3[2] = ip3[3] = ip2[2] = ip2[3] = p[3];
+                    ip1[0] = ip1[1] = ip0[0] = ip0[1] = p[0];
+                    ip1[2] = ip1[3] = ip0[2] = ip0[3] = p[1];
+                    p += 4;
+                    up01[0] = up01[1] = up23[0] = up23[1] = *p++;
+                    vp01[0] = vp01[1] = vp23[0] = vp23[1] = *p++;
+
+                } else if (flag & mask) {
+                    if ((data + 4) > eod)
+                        return AVERROR_INVALIDDATA;
+
+                    cb0 = strip->v4_codebook.yuv420[*data++];
+                    cb1 = strip->v4_codebook.yuv420[*data++];
+                    cb2 = strip->v4_codebook.yuv420[*data++];
+                    cb3 = strip->v4_codebook.yuv420[*data++];
+                    memcpy(ip3 + 0, cb2 + 2, 2);
+                    memcpy(ip3 + 2, cb3 + 2, 2);
+                    memcpy(ip2 + 0, cb2 + 0, 2);
+                    memcpy(ip2 + 2, cb3 + 0, 2);
+                    memcpy(ip1 + 0, cb0 + 2, 2);
+                    memcpy(ip1 + 2, cb1 + 2, 2);
+                    memcpy(ip0 + 0, cb0 + 0, 2);
+                    memcpy(ip0 + 2, cb1 + 0, 2);
+                    cb0 += 4; cb1 += 4; cb2 += 4; cb3 += 4;
+                    up23[0] = *cb2++; vp23[0] = *cb2;
+                    up23[1] = *cb3++; vp23[1] = *cb3;
+                    up01[0] = *cb0++; vp01[0] = *cb0;
+                    up01[1] = *cb1++; vp01[1] = *cb1;
+
+                }
+            }
+
+            ip0 += 4;  ip1 += 4;
+            ip2 += 4;  ip3 += 4;
+            up01 += 2; up23 += 2;
+            vp01 += 2; vp23 += 2;
+        }
+    }
+
+    return 0;
+}
+
+static int cinepak_decode_vectors_pal8 (CinepakContext *s, cvid_strip *strip,
+                                 int chunk_id, int size, const uint8_t *data)
+{
+    const uint8_t   *eod = (data + size);
+    uint32_t         flag, mask;
+    uint8_t         *cb0, *cb1, *cb2, *cb3;
+    int              x, y;
+    char            *ip0, *ip1, *ip2, *ip3;
+    int selective_update = (chunk_id & 0x01);
+    int v1_only          = (chunk_id & 0x02);
+
+    flag = 0;
+    mask = 0;
+
+    for (y=strip->y1; y < strip->y2; y+=4) {
+
+/* take care of y dimension not being multiple of 4, such streams exist */
+        ip0 = ip1 = ip2 = ip3 = s->frame->data[0] +
+                                strip->x1 + y*s->frame->linesize[0];
+        if(s->avctx->height - y > 1) {
+            ip1 = ip0 + s->frame->linesize[0];
+            if(s->avctx->height - y > 2) {
+                ip2 = ip1 + s->frame->linesize[0];
+                if(s->avctx->height - y > 3) {
+                    ip3 = ip2 + s->frame->linesize[0];
+                }
+            }
+        }
+
+/* to get the correct picture for not-multiple-of-4 cases let us fill each
+ * block from the bottom up, thus possibly overwriting the bottommost line
+ * more than once but ending with the correct data in place
+ * (instead of in-loop checking) */
+
+        for (x=strip->x1; x < strip->x2; x+=4) {
+            if (selective_update && !(mask >>= 1)) {
+                if ((data + 4) > eod)
+                    return AVERROR_INVALIDDATA;
+
+                flag  = AV_RB32 (data);
+                data += 4;
+                mask  = 0x80000000;
+            }
+
+            if (!selective_update || (flag & mask)) {
+                if (!v1_only && !(mask >>= 1)) {
+                    if ((data + 4) > eod)
+                        return AVERROR_INVALIDDATA;
+
+                    flag  = AV_RB32 (data);
+                    data += 4;
+                    mask  = 0x80000000;
+                }
+
+                if (v1_only || (~flag & mask)) {
+                    uint8_t *p;
+                    if (data >= eod)
+                        return AVERROR_INVALIDDATA;
+
+                    p = strip->v1_codebook.pal8[*data++];
+                    ip3[0] = ip3[1] = ip2[0] = ip2[1] = p[2];
+                    ip3[2] = ip3[3] = ip2[2] = ip2[3] = p[3];
+                    ip1[0] = ip1[1] = ip0[0] = ip0[1] = p[0];
+                    ip1[2] = ip1[3] = ip0[2] = ip0[3] = p[1];
+
+                } else if (flag & mask) {
+                    uint8_t *p;
+                    if ((data + 4) > eod)
+                        return AVERROR_INVALIDDATA;
+
+                    cb0 = strip->v4_codebook.pal8[*data++];
+                    cb1 = strip->v4_codebook.pal8[*data++];
+                    cb2 = strip->v4_codebook.pal8[*data++];
+                    cb3 = strip->v4_codebook.pal8[*data++];
+                    p = ip3;
+                    *p++ = cb2[2];
+                    *p++ = cb2[3];
+                    *p++ = cb3[2];
+                    *p   = cb3[3];
+                    p = ip2;
+                    *p++ = cb2[0];
+                    *p++ = cb2[1];
+                    *p++ = cb3[0];
+                    *p   = cb3[1];
+                    p = ip1;
+                    *p++ = cb0[2];
+                    *p++ = cb0[3];
+                    *p++ = cb1[2];
+                    *p   = cb1[3];
+                    p = ip0;
+                    *p++ = cb0[0];
+                    *p++ = cb0[1];
+                    *p++ = cb1[0];
+                    *p   = cb1[1];
+
+                }
+            }
+
+            ip0 += 4;  ip1 += 4;
+            ip2 += 4;  ip3 += 4;
+        }
     }
 
     return 0;
@@ -290,22 +955,22 @@  static int cinepak_decode_strip (CinepakContext *s,
         case 0x21:
         case 0x24:
         case 0x25:
-            cinepak_decode_codebook (strip->v4_codebook, chunk_id,
-                chunk_size, data);
+            s->decode_codebook(s, &strip->v4_codebook,
+                chunk_id, chunk_size, data);
             break;
 
         case 0x22:
         case 0x23:
         case 0x26:
         case 0x27:
-            cinepak_decode_codebook (strip->v1_codebook, chunk_id,
-                chunk_size, data);
+            s->decode_codebook (s, &strip->v1_codebook,
+                chunk_id, chunk_size, data);
             break;
 
         case 0x30:
         case 0x31:
         case 0x32:
-            return cinepak_decode_vectors (s, strip, chunk_id,
+            return s->decode_vectors (s, strip, chunk_id,
                 chunk_size, data);
         }
 
@@ -385,9 +1050,9 @@  static int cinepak_decode (CinepakContext *s)
         strip_size = ((s->data + strip_size) > eod) ? (eod - s->data) : strip_size;
 
         if ((i > 0) && !(frame_flags & 0x01)) {
-            memcpy (s->strips[i].v4_codebook, s->strips[i-1].v4_codebook,
+            memcpy (&s->strips[i].v4_codebook, &s->strips[i-1].v4_codebook,
                 sizeof(s->strips[i].v4_codebook));
-            memcpy (s->strips[i].v1_codebook, s->strips[i-1].v1_codebook,
+            memcpy (&s->strips[i].v1_codebook, &s->strips[i-1].v1_codebook,
                 sizeof(s->strips[i].v1_codebook));
         }
 
@@ -402,9 +1067,24 @@  static int cinepak_decode (CinepakContext *s)
     return 0;
 }
 
+static const enum AVPixelFormat ff_cinepak_pixfmt_list[] = {
+    AV_PIX_FMT_RGB24,
+    AV_PIX_FMT_RGB32,
+    AV_PIX_FMT_RGB565,
+    AV_PIX_FMT_YUV420P,
+    AV_PIX_FMT_PAL8, /* only when input is palettized */
+    AV_PIX_FMT_NONE
+};
+
 static av_cold int cinepak_decode_init(AVCodecContext *avctx)
 {
     CinepakContext *s = avctx->priv_data;
+#ifdef CINEPAK_OUTPUT_FORMAT_OVERRIDE
+    char *out_fmt_override = getenv("CINEPAK_OUTPUT_FORMAT_OVERRIDE");
+#endif
+
+/* we take advantage of VQ to efficiently support
+ * multiple output formats */
 
     s->avctx = avctx;
     s->width = (avctx->width + 3) & ~3;
@@ -412,13 +1092,64 @@  static av_cold int cinepak_decode_init(AVCodecContext *avctx)
 
     s->sega_film_skip_bytes = -1;  /* uninitialized state */
 
-    // check for paletted data
-    if (avctx->bits_per_coded_sample != 8) {
-        s->palette_video = 0;
-        avctx->pix_fmt = AV_PIX_FMT_RGB24;
-    } else {
-        s->palette_video = 1;
-        avctx->pix_fmt = AV_PIX_FMT_PAL8;
+    /* check for paletted data */
+    s->palette_video = (avctx->bits_per_coded_sample == 8);
+    av_log(avctx, AV_LOG_INFO, "this is %sa palette video\n", s->palette_video?"":"not ");
+
+#ifdef CINEPAK_OUTPUT_FORMAT_OVERRIDE
+    if (out_fmt_override && *out_fmt_override) {
+        if (       !strcmp(out_fmt_override, "rgb32")) {
+            avctx->pix_fmt = AV_PIX_FMT_RGB32;
+        } else if (!strcmp(out_fmt_override, "rgb24")) {
+            avctx->pix_fmt = AV_PIX_FMT_RGB24;
+        } else if (!strcmp(out_fmt_override, "rgb565")) {
+            avctx->pix_fmt = AV_PIX_FMT_RGB565;
+        } else if (!strcmp(out_fmt_override, "yuv420p")) {
+            avctx->pix_fmt = AV_PIX_FMT_YUV420P;
+        } else if (!strcmp(out_fmt_override, "pal8")) {
+            avctx->pix_fmt = AV_PIX_FMT_PAL8;
+        } else {
+            av_log(avctx, AV_LOG_ERROR, "Unsupported pixel format override '%s'\n",
+                                        out_fmt_override);
+            return AVERROR(EINVAL);
+        }
+    } else
+#endif
+    avctx->pix_fmt = s->out_pixfmt;
+
+    switch (avctx->pix_fmt) {
+    case AV_PIX_FMT_RGB32:
+        av_log(avctx, AV_LOG_INFO, "Codec output pixel format is rgb32\n");
+        s->decode_codebook = cinepak_decode_codebook_rgb32;
+        s->decode_vectors  = cinepak_decode_vectors_rgb32;
+        break;
+    case AV_PIX_FMT_RGB24:
+        av_log(avctx, AV_LOG_INFO, "Codec output pixel format is rgb24\n");
+        s->decode_codebook = cinepak_decode_codebook_rgb24;
+        s->decode_vectors  = cinepak_decode_vectors_rgb24;
+        break;
+    case AV_PIX_FMT_RGB565:
+        av_log(avctx, AV_LOG_INFO, "Codec output pixel format is rgb565\n");
+        s->decode_codebook = cinepak_decode_codebook_rgb565;
+        s->decode_vectors  = cinepak_decode_vectors_rgb565;
+        break;
+    case AV_PIX_FMT_YUV420P:
+        av_log(avctx, AV_LOG_INFO, "Codec output pixel format is approximate yuv420p\n");
+        s->decode_codebook = cinepak_decode_codebook_yuv420;
+        s->decode_vectors  = cinepak_decode_vectors_yuv420;
+        break;
+    case AV_PIX_FMT_PAL8:
+        if (!s->palette_video) {
+            av_log(avctx, AV_LOG_ERROR, "Palettized output not supported without palettized input\n");
+            return AVERROR(EINVAL);
+        }
+        av_log(avctx, AV_LOG_INFO, "Codec output pixel format is pal8\n");
+        s->decode_codebook = cinepak_decode_codebook_pal8;
+        s->decode_vectors  = cinepak_decode_vectors_pal8;
+        break;
+    default:
+        av_log(avctx, AV_LOG_ERROR, "Unsupported pixel format %d\n", avctx->pix_fmt);
+        return AVERROR(EINVAL);
     }
 
     s->frame = av_frame_alloc();
@@ -457,7 +1188,7 @@  static int cinepak_decode_frame(AVCodecContext *avctx,
         av_log(avctx, AV_LOG_ERROR, "cinepak_decode failed\n");
     }
 
-    if (s->palette_video)
+    if (avctx->pix_fmt == AV_PIX_FMT_PAL8)
         memcpy (s->frame->data[1], s->pal, AVPALETTE_SIZE);
 
     if ((ret = av_frame_ref(data, s->frame)) < 0)
@@ -488,4 +1219,6 @@  AVCodec ff_cinepak_decoder = {
     .close          = cinepak_decode_end,
     .decode         = cinepak_decode_frame,
     .capabilities   = AV_CODEC_CAP_DR1,
+    .pix_fmts       = ff_cinepak_pixfmt_list,
+    .priv_class     = &cinepak_class,
 };