From patchwork Sun Mar  5 18:26:30 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: u-9iep@aetey.se
X-Patchwork-Id: 2755
Delivered-To: ffmpegpatchwork@gmail.com
Received: by 10.103.31.14 with SMTP id f14csp1099929vsf;
	Sun, 5 Mar 2017 10:27:12 -0800 (PST)
X-Received: by 10.28.172.68 with SMTP id v65mr10156711wme.111.1488738431884;
	Sun, 05 Mar 2017 10:27:11 -0800 (PST)
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100])
	by mx.google.com with ESMTP id
	124si11434543wmc.106.2017.03.05.10.27.11;
	Sun, 05 Mar 2017 10:27:11 -0800 (PST)
Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
	designates 79.124.17.100 as permitted sender)
	client-ip=79.124.17.100;
Authentication-Results: mx.google.com;
	dkim=neutral (body hash did not verify) header.i=@fripost.org;
	spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org
	designates 79.124.17.100 as permitted sender)
	smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 6136A6882CE;
	Sun,  5 Mar 2017 20:26:55 +0200 (EET)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from outgoing.fripost.org (giraff.fripost.org [178.16.208.44])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 302AC688294
	for <ffmpeg-devel@ffmpeg.org>; Sun,  5 Mar 2017 20:26:48 +0200 (EET)
Received: from localhost (localhost [127.0.0.1])
	by outgoing.fripost.org (Postfix) with ESMTP id B80EDAD05F2
	for <ffmpeg-devel@ffmpeg.org>; Sun,  5 Mar 2017 19:27:00 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=fripost.org; h=
	in-reply-to:content-disposition:content-type:content-type
	:mime-version:references:message-id:subject:subject:from:from
	:date:date; s=20140703; t=1488738420; x=1490552821; bh=k3R+jk6dP
	8IhGHTuJ8kZi4D18B/8bv61h9o4a8cuMXA=; b=Q4oEN+40sE49gWCchCV9fDGXa
	BaXhZGjnaFa+RQRlXdtO8zdLNGpmoAcYNSC64L4UVl14rcCBbF/6iT/Ufdjttuub
	INU4ogvCUkziQyF4ABACElkasEHPabUnxoXT8ujvDPUrKVvTJbvp0ae99EWLxgmu
	5EyQXWY/kH6u/mM45c=
X-Virus-Scanned: Debian amavisd-new at fripost.org
Received: from outgoing.fripost.org ([127.0.0.1])
	by localhost (giraff.fripost.org [127.0.0.1]) (amavisd-new,
	port 10040)
	with LMTP id Jk5vgH9GoJ0g for <ffmpeg-devel@ffmpeg.org>;
	Sun,  5 Mar 2017 19:27:00 +0100 (CET)
Received: from smtp.fripost.org (unknown [172.16.0.6])
	by outgoing.fripost.org (Postfix) with ESMTP id 879A6AD05EB
	for <ffmpeg-devel@ffmpeg.org>; Sun,  5 Mar 2017 19:27:00 +0100 (CET)
Received: from [127.0.0.1] (localhost [127.0.0.1])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) by smtp.fripost.org (Postfix) with ESMTPSA id 182392A96900
	for <ffmpeg-devel@ffmpeg.org>; Sun,  5 Mar 2017 19:26:59 +0100 (CET)
Received: (qmail 14190 invoked from network); 5 Mar 2017 18:25:35 -0000
Received: from localhost (HELO aetey.se) (eh1ba719@127.0.0.1)
	by mail with ESMTPA; 5 Mar 2017 18:25:35 -0000
Date: Sun, 5 Mar 2017 19:26:30 +0100
From: u-9iep@aetey.se
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Message-ID: <20170305182630.GL32749@example.net>
References: 
 <CAPYw7P4s06ARd=+eBw=nnz6yZgXpoPCa6YE5V-86_jF8erq9yw@mail.gmail.com>
	<20170213175139.GB32749@example.net>
	<20170214065146.0810cb48@debian>
	<20170214085154.GD32749@example.net>
	<20170214100003.013390c4@debian>
	<CAPYw7P6vOLKix3OYr0VCxCsB_h+AokvdOmNHPZ_eMs-pSCjMJA@mail.gmail.com>
	<20170214111404.GE32749@example.net>
	<CAEEMt2=shogv+wN09G1D_xvLDV2PJ44F1eHsuq7=f3sM7UfiyQ@mail.gmail.com>
	<20170215110155.GG32749@example.net>
	<20170225200358.GG32749@example.net>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20170225200358.GG32749@example.net>
Subject: [FFmpeg-devel] [PATCH 1/2] (was Re: deduplicated [PATCH] Cinepak:
	speed up decoding several-fold...) formats.
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <http://ffmpeg.org/mailman/options/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <http://ffmpeg.org/pipermail/ffmpeg-devel/>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <http://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches
	<ffmpeg-devel@ffmpeg.org>
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

For one week there was no substantial feedback on the patches.

I understand that the first patch was very large which makes it hard
to review.

That's why I now have produced a limited, minimalistic change which still
yields most of the improvement.

I kindly ask the reader to note that cinepak is not ffmpeg's everyday meal
i.e. fast shortcut judgements may be not applicable. Please take your time.

To avoid contention around some of the issues which triggered such
regrettable kind of judgements let me point out that

- letting a decoder produce multiple pixel formats is explicitly
  allowed and supported by the documented API (get_format())

- the patch does not introduce a new feature in this respect,
  just makes its presence more visible and much more useful

- the cinepak codec is extraordinary well suited to produce virtually
  any desired pixel format very efficiently (iow it is quite special)

- even though we have got a framework of libswscale, there is no point
  to use it where it can not help; it would be several times slower than
  cinepak itself, NOTE not by a coincidence but by the very design of both

- when some functionality is desired from libswscale, it remains fully
  usable with cinepak, with no loss in the actual efficiency of libswscale

- the 3-fold practical speedup makes the difference between working/useful
  and unusable/useless, with very little complexity (aka maintenance costs)

- the feedback concerning the coding style and code duplication was not
  specified in any measurable terms but it seems that all related issues
  have been resolved

Looking forward to your analysis and feedback.

Thanks.

Rune
From cbe0664a13a615ecac302ffb0de577cd08b1f910 Mon Sep 17 00:00:00 2001
From: Rl <addr-see-the-website@aetey.se>
Date: Sun, 5 Mar 2017 15:54:06 +0100
Subject: [PATCH 1/2] Cinepak decoding: refactor for several-fold speedup

Refactored codebook generation and vector parsing
to better support the decoder API and allow for optimizations.

Decoding to rgb24 is now slightly faster.

Replaced generation of the constrained pal8 output
(which was only possible with palettized input)
with rgb565, comparably fast but working with any input format.

Decoding to rgb565 is now several-fold faster:
 (including overheads, underestimation of the actual decoding speedup)
 --------
 matrixbench_mpeg2.mpg (720x567) encoded with ffmpeg into Cinepak
 using default settings, decoding on an i5 3570K, 3.4 GHz:
 ...
 fast_bilinear:              ~65x realtime
 patch w/rgb565 override:    ~154x realtime
 --------
 https://ffmpeg.org/pipermail/ffmpeg-devel/2017-February/206799.html

The default output format is unchanged, rgb24.
---
 libavcodec/cinepak.c | 462 ++++++++++++++++++++++++++++-----------------------
 1 file changed, 252 insertions(+), 210 deletions(-)

diff --git a/libavcodec/cinepak.c b/libavcodec/cinepak.c
index d657e9c0c1..d8bc7860eb 100644
--- a/libavcodec/cinepak.c
+++ b/libavcodec/cinepak.c
@@ -29,8 +29,8 @@
  * @see For more information on the quirky data inside Sega FILM/CPK files, visit:
  *   http://wiki.multimedia.cx/index.php?title=Sega_FILM
  *
- * Cinepak colorspace support (c) 2013 Rl, Aetey Global Technologies AB
- * @author Cinepak colorspace, Rl, Aetey Global Technologies AB
+ * Cinepak colorspace/optimizations (c) 2013-2017 Rl, Aetey Global Technologies AB
+ * @author Cinepak colorspace/optimizations, Rl, Aetey Global Technologies AB
  */
 
 #include <stdio.h>
@@ -39,24 +39,29 @@
 
 #include "libavutil/common.h"
 #include "libavutil/intreadwrite.h"
+#include "libavutil/opt.h"
+#include "libavutil/pixdesc.h"
 #include "avcodec.h"
 #include "internal.h"
 
+#define MAX_STRIPS      32    /* an arbitrary limit -- rl */
 
-typedef uint8_t cvid_codebook[12];
-
-#define MAX_STRIPS      32
+typedef union cvid_codebook {
+    uint8_t    rgb24[256][12];
+    uint16_t  rgb565[256][ 4];
+} cvid_codebook;
 
 typedef struct cvid_strip {
     uint16_t          id;
     uint16_t          x1, y1;
     uint16_t          x2, y2;
-    cvid_codebook     v4_codebook[256];
-    cvid_codebook     v1_codebook[256];
+    cvid_codebook     v4_codebook;
+    cvid_codebook     v1_codebook;
 } cvid_strip;
 
-typedef struct CinepakContext {
-
+typedef struct CinepakContext CinepakContext;
+struct CinepakContext {
+    const AVClass *class;
     AVCodecContext *avctx;
     AVFrame *frame;
 
@@ -71,195 +76,233 @@ typedef struct CinepakContext {
     int sega_film_skip_bytes;
 
     uint32_t pal[256];
-} CinepakContext;
-
-static void cinepak_decode_codebook (cvid_codebook *codebook,
-                                     int chunk_id, int size, const uint8_t *data)
-{
-    const uint8_t *eod = (data + size);
-    uint32_t flag, mask;
-    int      i, n;
-    uint8_t *p;
-
-    /* check if this chunk contains 4- or 6-element vectors */
-    n    = (chunk_id & 0x04) ? 4 : 6;
-    flag = 0;
-    mask = 0;
-
-    p = codebook[0];
-    for (i=0; i < 256; i++) {
-        if ((chunk_id & 0x01) && !(mask >>= 1)) {
-            if ((data + 4) > eod)
-                break;
-
-            flag  = AV_RB32 (data);
-            data += 4;
-            mask  = 0x80000000;
-        }
+    void (*decode_codebook)(CinepakContext *s, cvid_codebook *codebook,
+                            int chunk_id, int size, const uint8_t *data);
+    int  (*decode_vectors)(CinepakContext *s, cvid_strip *strip,
+                           int chunk_id, int size, const uint8_t *data);
+/* options */
+    enum AVPixelFormat out_pixfmt;
+};
 
-        if (!(chunk_id & 0x01) || (flag & mask)) {
-            int k, kk;
+#define OFFSET(x) offsetof(CinepakContext, x)
+#define VD AV_OPT_FLAG_VIDEO_PARAM | AV_OPT_FLAG_DECODING_PARAM
+static const AVOption options[] = {
+{"output_pixel_format", "set output pixel format as rgb24 or rgb565", OFFSET(out_pixfmt), AV_OPT_TYPE_PIXEL_FMT, {.i64=AV_PIX_FMT_NONE}, -1, INT_MAX, VD },
+    { NULL },
+};
 
-            if ((data + n) > eod)
-                break;
+static const AVClass cinepak_class = {
+    .class_name = "cinepak decoder",
+    .item_name  = av_default_item_name,
+    .option     = options,
+    .version    = LIBAVUTIL_VERSION_INT,
+};
 
-            for (k = 0; k < 4; ++k) {
-                int r = *data++;
-                for (kk = 0; kk < 3; ++kk)
-                    *p++ = r;
-            }
-            if (n == 6) {
-                int r, g, b, u, v;
-                u = *(int8_t *)data++;
-                v = *(int8_t *)data++;
-                p -= 12;
+#define CODEBOOK_PROLOGUE(pixel_format) \
+static void cinepak_decode_codebook_##pixel_format (CinepakContext *s,\
+    cvid_codebook *codebook, int chunk_id, int size, const uint8_t *data) {\
+    const uint8_t *eod;\
+    uint32_t flag, mask;\
+    int selective_update, i;\
+
+#define DECODE_CODEBOOK(pixel_format) \
+CODEBOOK_PROLOGUE(pixel_format) \
+    int n, palette_video;\
+
+#define CODEBOOK_STREAM_PARSING \
+    for (i=0; i < 256; i++) {\
+        if (selective_update && !(mask >>= 1)) {\
+            if ((data + 4) > eod) break;\
+            flag = AV_RB32 (data); data += 4; mask = 0x80000000;\
+        }\
+        if (!selective_update || (flag & mask)) {\
+            int k;\
+            if ((data + n) > eod) break;\
+
+#define CODEBOOK_INTRO \
+    selective_update = (chunk_id & 0x01);\
+    eod = (data + size); flag = mask = 0;\
+
+#define CODEBOOK_FULL_COLOR \
+    /* check if this chunk contains 4- or 6-element vectors */\
+    n = (chunk_id & 0x04) ? 4 : 6;\
+    palette_video = s->palette_video;\
+    CODEBOOK_INTRO\
+    CODEBOOK_STREAM_PARSING\
+
+#define DECODE_VECTORS(pixel_format) \
+static int cinepak_decode_vectors_##pixel_format (CinepakContext *s, cvid_strip *strip, int chunk_id, int size, const uint8_t *data) {\
+    const uint8_t   *eod;\
+    uint32_t         flag, mask;\
+    int              x, y, selective_update, v1_only;\
+    char            *ip0, *ip1, *ip2, *ip3;\
+
+#define VECTOR_INTRO \
+    CODEBOOK_INTRO\
+    v1_only = (chunk_id & 0x02);\
+    for (y=strip->y1; y < strip->y2; y+=4) {\
+
+#define VECTOR_STREAM_PARSING \
+        for (x=strip->x1; x < strip->x2; x+=4) {\
+            if (selective_update && !(mask >>= 1)) {\
+                if ((data + 4) > eod) return AVERROR_INVALIDDATA;\
+                flag  = AV_RB32 (data); data += 4; mask = 0x80000000;\
+            }\
+            if (!selective_update || (flag & mask)) {\
+                if (!v1_only && !(mask >>= 1)) {\
+                    if ((data + 4) > eod) return AVERROR_INVALIDDATA;\
+                    flag  = AV_RB32 (data); data += 4; mask = 0x80000000;\
+                }\
+                if (v1_only || (~flag & mask)) {\
+                    POINTER_TYPE *p;\
+                    if (data >= eod) return AVERROR_INVALIDDATA;\
+
+#define VECTOR_DO \
+/* take care of y dimension not being multiple of 4 */\
+        if(s->avctx->height - y > 1) {\
+            ip1 = ip0 + s->frame->linesize[0];\
+            if(s->avctx->height - y > 2) {\
+                ip2 = ip1 + s->frame->linesize[0];\
+                if(s->avctx->height - y > 3) {\
+                    ip3 = ip2 + s->frame->linesize[0];\
+                }\
+            }\
+        }\
+/* to get the correct picture for not-multiple-of-4 cases let us fill each\
+ * block from the bottom up, thus possibly overwriting the bottommost line\
+ * more than once but ending with the correct data in place\
+ * (instead of in-loop checking) */\
+        VECTOR_STREAM_PARSING\
+
+DECODE_CODEBOOK(rgb24)
+    uint8_t *p = codebook->rgb24[0];
+    CODEBOOK_FULL_COLOR
+            if (n == 4)
+                if (palette_video)
+                    for (k = 0; k < 4; ++k) {
+                        uint32_t r = s->pal[*data++];
+                        *p++ = (r>>16)&0xff; *p++ = (r>>8)&0xff; *p++ = r&0xff;
+                    }
+                else
+                    for (k = 0; k < 4; ++k) {
+                        int kk, r = *data++;
+                        for (kk = 0; kk < 3; ++kk) *p++ = r;
+                    }
+            else { /* n == 6 */
+                int y, u, v, v2, u2v, u2;
+                u = (int8_t)data[4]; v = (int8_t)data[5];
+                v2 = v*2; u2v = u/2+v; u2 = u*2;
                 for(k=0; k<4; ++k) {
-                    r = *p++ + v*2;
-                    g = *p++ - (u/2) - v;
-                    b = *p   + u*2;
-                    p -= 2;
-                    *p++ = av_clip_uint8(r);
-                    *p++ = av_clip_uint8(g);
-                    *p++ = av_clip_uint8(b);
+                    y = *data++;
+                    *p++ = av_clip_uint8(y+v2);
+                    *p++ = av_clip_uint8(y-u2v);
+                    *p++ = av_clip_uint8(y+u2);
                 }
+                data += 2;
             }
-        } else {
+        } else
             p += 12;
-        }
     }
 }
-
-static int cinepak_decode_vectors (CinepakContext *s, cvid_strip *strip,
-                                   int chunk_id, int size, const uint8_t *data)
-{
-    const uint8_t   *eod = (data + size);
-    uint32_t         flag, mask;
-    uint8_t         *cb0, *cb1, *cb2, *cb3;
-    int             x, y;
-    char            *ip0, *ip1, *ip2, *ip3;
-
-    flag = 0;
-    mask = 0;
-
-    for (y=strip->y1; y < strip->y2; y+=4) {
-
-/* take care of y dimension not being multiple of 4, such streams exist */
+DECODE_VECTORS(rgb24)
+    uint8_t *cb0, *cb1, *cb2, *cb3;
+    VECTOR_INTRO
         ip0 = ip1 = ip2 = ip3 = s->frame->data[0] +
-          (s->palette_video?strip->x1:strip->x1*3) + (y * s->frame->linesize[0]);
-        if(s->avctx->height - y > 1) {
-            ip1 = ip0 + s->frame->linesize[0];
-            if(s->avctx->height - y > 2) {
-                ip2 = ip1 + s->frame->linesize[0];
-                if(s->avctx->height - y > 3) {
-                    ip3 = ip2 + s->frame->linesize[0];
+                                strip->x1*3 + y*s->frame->linesize[0];
+#define POINTER_TYPE uint8_t
+        VECTOR_DO
+#undef POINTER_TYPE
+                    p = strip->v1_codebook.rgb24[*data++] + 6;
+                    memcpy(ip3 + 0, p, 3); memcpy(ip3 + 3, p, 3);
+                    memcpy(ip2 + 0, p, 3); memcpy(ip2 + 3, p, 3);
+                    p += 3; /* ... + 9 */
+                    memcpy(ip3 + 6, p, 3); memcpy(ip3 + 9, p, 3);
+                    memcpy(ip2 + 6, p, 3); memcpy(ip2 + 9, p, 3);
+                    p -= 9; /* ... + 0 */
+                    memcpy(ip1 + 0, p, 3); memcpy(ip1 + 3, p, 3);
+                    memcpy(ip0 + 0, p, 3); memcpy(ip0 + 3, p, 3);
+                    p += 3; /* ... + 3 */
+                    memcpy(ip1 + 6, p, 3); memcpy(ip1 + 9, p, 3);
+                    memcpy(ip0 + 6, p, 3); memcpy(ip0 + 9, p, 3);
+                } else if (flag & mask) {
+                    if ((data + 4) > eod) return AVERROR_INVALIDDATA;
+                    cb0 = strip->v4_codebook.rgb24[*data++];
+                    cb1 = strip->v4_codebook.rgb24[*data++];
+                    cb2 = strip->v4_codebook.rgb24[*data++];
+                    cb3 = strip->v4_codebook.rgb24[*data++];
+                    memcpy(ip3 + 0, cb2 + 6, 6); memcpy(ip3 + 6, cb3 + 6, 6);
+                    memcpy(ip2 + 0, cb2 + 0, 6); memcpy(ip2 + 6, cb3 + 0, 6);
+                    memcpy(ip1 + 0, cb0 + 6, 6); memcpy(ip1 + 6, cb1 + 6, 6);
+                    memcpy(ip0 + 0, cb0 + 0, 6); memcpy(ip0 + 6, cb1 + 0, 6);
                 }
             }
+            ip0 += 12; ip1 += 12; ip2 += 12; ip3 += 12;
         }
-/* to get the correct picture for not-multiple-of-4 cases let us fill each
- * block from the bottom up, thus possibly overwriting the bottommost line
- * more than once but ending with the correct data in place
- * (instead of in-loop checking) */
-
-        for (x=strip->x1; x < strip->x2; x+=4) {
-            if ((chunk_id & 0x01) && !(mask >>= 1)) {
-                if ((data + 4) > eod)
-                    return AVERROR_INVALIDDATA;
-
-                flag  = AV_RB32 (data);
-                data += 4;
-                mask  = 0x80000000;
-            }
-
-            if (!(chunk_id & 0x01) || (flag & mask)) {
-                if (!(chunk_id & 0x02) && !(mask >>= 1)) {
-                    if ((data + 4) > eod)
-                        return AVERROR_INVALIDDATA;
-
-                    flag  = AV_RB32 (data);
-                    data += 4;
-                    mask  = 0x80000000;
-                }
-
-                if ((chunk_id & 0x02) || (~flag & mask)) {
-                    uint8_t *p;
-                    if (data >= eod)
-                        return AVERROR_INVALIDDATA;
-
-                    p = strip->v1_codebook[*data++];
-                    if (s->palette_video) {
-                        ip3[0] = ip3[1] = ip2[0] = ip2[1] = p[6];
-                        ip3[2] = ip3[3] = ip2[2] = ip2[3] = p[9];
-                        ip1[0] = ip1[1] = ip0[0] = ip0[1] = p[0];
-                        ip1[2] = ip1[3] = ip0[2] = ip0[3] = p[3];
-                    } else {
-                        p += 6;
-                        memcpy(ip3 + 0, p, 3); memcpy(ip3 + 3, p, 3);
-                        memcpy(ip2 + 0, p, 3); memcpy(ip2 + 3, p, 3);
-                        p += 3; /* ... + 9 */
-                        memcpy(ip3 + 6, p, 3); memcpy(ip3 + 9, p, 3);
-                        memcpy(ip2 + 6, p, 3); memcpy(ip2 + 9, p, 3);
-                        p -= 9; /* ... + 0 */
-                        memcpy(ip1 + 0, p, 3); memcpy(ip1 + 3, p, 3);
-                        memcpy(ip0 + 0, p, 3); memcpy(ip0 + 3, p, 3);
-                        p += 3; /* ... + 3 */
-                        memcpy(ip1 + 6, p, 3); memcpy(ip1 + 9, p, 3);
-                        memcpy(ip0 + 6, p, 3); memcpy(ip0 + 9, p, 3);
+    }
+    return 0;
+}
+#define PACK_RGB_RGB565(r,g,b) (((av_clip_uint8((r)+4)>>3)<<11)|((av_clip_uint8((g)+2)>>2)<<5)|(av_clip_uint8((b)+4)>>3)) /* rounding to nearest */
+DECODE_CODEBOOK(rgb565)
+    uint16_t *p = codebook->rgb565[0];
+    CODEBOOK_FULL_COLOR
+            if (n == 4)
+                if (palette_video)
+                    for (k = 0; k < 4; ++k) {
+                        uint32_t r = s->pal[*data++];
+                        *p++ = PACK_RGB_RGB565((r>>16)&0xff,(r>>8)&0xff,r&0xff);
                     }
-
+                else
+                    for (k = 0; k < 4; ++k) {
+                        int r = *data++;
+                        *p++ = PACK_RGB_RGB565(r,r,r);
+                    }
+            else { /* n == 6 */
+                int y, u, v, v2, u2v, u2;
+                u = (int8_t)data[4]; v = (int8_t)data[5];
+                v2 = v*2; u2v = u/2+v; u2 = u*2;
+                for(k=0; k<4; ++k) {
+                    y = *data++;
+                    *p++ = PACK_RGB_RGB565(y+v2,y-u2v,y+u2);
+                }
+                data += 2;
+            }
+        } else
+            p += 4;
+    }
+}
+DECODE_VECTORS(rgb565)
+    uint16_t *cb0, *cb1, *cb2, *cb3;
+    VECTOR_INTRO
+        ip0 = ip1 = ip2 = ip3 = s->frame->data[0] +
+                                strip->x1*2 + y*s->frame->linesize[0];
+#define POINTER_TYPE uint16_t
+        VECTOR_DO
+#undef POINTER_TYPE
+                    p = strip->v1_codebook.rgb565[*data++];
+                    * (uint16_t *)ip3    = *((uint16_t *)ip3+1) =
+                    * (uint16_t *)ip2    = *((uint16_t *)ip2+1) = p[2];
+                    *((uint16_t *)ip3+2) = *((uint16_t *)ip3+3) =
+                    *((uint16_t *)ip2+2) = *((uint16_t *)ip2+3) = p[3];
+                    * (uint16_t *)ip1    = *((uint16_t *)ip1+1) =
+                    * (uint16_t *)ip0    = *((uint16_t *)ip0+1) = p[0];
+                    *((uint16_t *)ip1+2) = *((uint16_t *)ip1+3) =
+                    *((uint16_t *)ip0+2) = *((uint16_t *)ip0+3) = p[1];
                 } else if (flag & mask) {
                     if ((data + 4) > eod)
                         return AVERROR_INVALIDDATA;
-
-                    cb0 = strip->v4_codebook[*data++];
-                    cb1 = strip->v4_codebook[*data++];
-                    cb2 = strip->v4_codebook[*data++];
-                    cb3 = strip->v4_codebook[*data++];
-                    if (s->palette_video) {
-                        uint8_t *p;
-                        p = ip3;
-                        *p++ = cb2[6];
-                        *p++ = cb2[9];
-                        *p++ = cb3[6];
-                        *p   = cb3[9];
-                        p = ip2;
-                        *p++ = cb2[0];
-                        *p++ = cb2[3];
-                        *p++ = cb3[0];
-                        *p   = cb3[3];
-                        p = ip1;
-                        *p++ = cb0[6];
-                        *p++ = cb0[9];
-                        *p++ = cb1[6];
-                        *p   = cb1[9];
-                        p = ip0;
-                        *p++ = cb0[0];
-                        *p++ = cb0[3];
-                        *p++ = cb1[0];
-                        *p   = cb1[3];
-                    } else {
-                        memcpy(ip3 + 0, cb2 + 6, 6);
-                        memcpy(ip3 + 6, cb3 + 6, 6);
-                        memcpy(ip2 + 0, cb2 + 0, 6);
-                        memcpy(ip2 + 6, cb3 + 0, 6);
-                        memcpy(ip1 + 0, cb0 + 6, 6);
-                        memcpy(ip1 + 6, cb1 + 6, 6);
-                        memcpy(ip0 + 0, cb0 + 0, 6);
-                        memcpy(ip0 + 6, cb1 + 0, 6);
-                    }
-
+                    cb0 = strip->v4_codebook.rgb565[*data++];
+                    cb1 = strip->v4_codebook.rgb565[*data++];
+                    cb2 = strip->v4_codebook.rgb565[*data++];
+                    cb3 = strip->v4_codebook.rgb565[*data++];
+                    memcpy(ip3 + 0, cb2 + 2, 4); memcpy(ip3 + 4, cb3 + 2, 4);
+                    memcpy(ip2 + 0, cb2 + 0, 4); memcpy(ip2 + 4, cb3 + 0, 4);
+                    memcpy(ip1 + 0, cb0 + 2, 4); memcpy(ip1 + 4, cb1 + 2, 4);
+                    memcpy(ip0 + 0, cb0 + 0, 4); memcpy(ip0 + 4, cb1 + 0, 4);
                 }
             }
-
-            if (s->palette_video) {
-                ip0 += 4;  ip1 += 4;
-                ip2 += 4;  ip3 += 4;
-            } else {
-                ip0 += 12;  ip1 += 12;
-                ip2 += 12;  ip3 += 12;
-            }
+            ip0 += 8; ip1 += 8; ip2 += 8; ip3 += 8;
         }
     }
-
     return 0;
 }
 
@@ -286,29 +329,15 @@ static int cinepak_decode_strip (CinepakContext *s,
 
         switch (chunk_id) {
 
-        case 0x20:
-        case 0x21:
-        case 0x24:
-        case 0x25:
-            cinepak_decode_codebook (strip->v4_codebook, chunk_id,
-                chunk_size, data);
+        case 0x20: case 0x21: case 0x24: case 0x25:
+            s->decode_codebook(s, &strip->v4_codebook, chunk_id, chunk_size, data);
             break;
-
-        case 0x22:
-        case 0x23:
-        case 0x26:
-        case 0x27:
-            cinepak_decode_codebook (strip->v1_codebook, chunk_id,
-                chunk_size, data);
+        case 0x22: case 0x23: case 0x26: case 0x27:
+            s->decode_codebook (s, &strip->v1_codebook, chunk_id, chunk_size, data);
             break;
-
-        case 0x30:
-        case 0x31:
-        case 0x32:
-            return cinepak_decode_vectors (s, strip, chunk_id,
-                chunk_size, data);
+        case 0x30: case 0x31: case 0x32:
+            return s->decode_vectors (s, strip, chunk_id, chunk_size, data);
         }
-
         data += chunk_size;
     }
 
@@ -385,9 +414,9 @@ static int cinepak_decode (CinepakContext *s)
         strip_size = ((s->data + strip_size) > eod) ? (eod - s->data) : strip_size;
 
         if ((i > 0) && !(frame_flags & 0x01)) {
-            memcpy (s->strips[i].v4_codebook, s->strips[i-1].v4_codebook,
+            memcpy (&s->strips[i].v4_codebook, &s->strips[i-1].v4_codebook,
                 sizeof(s->strips[i].v4_codebook));
-            memcpy (s->strips[i].v1_codebook, s->strips[i-1].v1_codebook,
+            memcpy (&s->strips[i].v1_codebook, &s->strips[i-1].v1_codebook,
                 sizeof(s->strips[i].v1_codebook));
         }
 
@@ -402,6 +431,10 @@ static int cinepak_decode (CinepakContext *s)
     return 0;
 }
 
+static const enum AVPixelFormat pixfmt_list[] = {
+    AV_PIX_FMT_RGB24, AV_PIX_FMT_RGB565, AV_PIX_FMT_NONE
+};
+
 static av_cold int cinepak_decode_init(AVCodecContext *avctx)
 {
     CinepakContext *s = avctx->priv_data;
@@ -412,15 +445,26 @@ static av_cold int cinepak_decode_init(AVCodecContext *avctx)
 
     s->sega_film_skip_bytes = -1;  /* uninitialized state */
 
-    // check for paletted data
-    if (avctx->bits_per_coded_sample != 8) {
-        s->palette_video = 0;
-        avctx->pix_fmt = AV_PIX_FMT_RGB24;
-    } else {
-        s->palette_video = 1;
-        avctx->pix_fmt = AV_PIX_FMT_PAL8;
+    /* check for paletted data */
+    s->palette_video = (avctx->bits_per_coded_sample == 8);
+
+    if (s->out_pixfmt != AV_PIX_FMT_NONE) /* the option is set to something */
+        avctx->pix_fmt = s->out_pixfmt;
+    else
+        avctx->pix_fmt = ff_get_format(avctx, pixfmt_list);
+
+#define DECODE_TO(pixel_format) \
+ s->decode_codebook = cinepak_decode_codebook_##pixel_format;\
+ s->decode_vectors  = cinepak_decode_vectors_##pixel_format;\
+ break;\
+
+    switch (avctx->pix_fmt) {
+    case AV_PIX_FMT_RGB24:   DECODE_TO(rgb24)
+    case AV_PIX_FMT_RGB565:  DECODE_TO(rgb565)
+    default:
+        av_log(avctx, AV_LOG_ERROR, "Unsupported pixel format %s\n", av_get_pix_fmt_name(avctx->pix_fmt));
+        return AVERROR(EINVAL);
     }
-
     s->frame = av_frame_alloc();
     if (!s->frame)
         return AVERROR(ENOMEM);
@@ -457,9 +501,6 @@ static int cinepak_decode_frame(AVCodecContext *avctx,
         av_log(avctx, AV_LOG_ERROR, "cinepak_decode failed\n");
     }
 
-    if (s->palette_video)
-        memcpy (s->frame->data[1], s->pal, AVPALETTE_SIZE);
-
     if ((ret = av_frame_ref(data, s->frame)) < 0)
         return ret;
 
@@ -488,4 +529,5 @@ AVCodec ff_cinepak_decoder = {
     .close          = cinepak_decode_end,
     .decode         = cinepak_decode_frame,
     .capabilities   = AV_CODEC_CAP_DR1,
+    .priv_class     = &cinepak_class,
 };