From patchwork Sat Oct 12 10:58:21 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Lance Wang <lance.lmwang@gmail.com>
X-Patchwork-Id: 15718
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
X-Original-To: patchwork@ffaux-bg.ffmpeg.org
Delivered-To: patchwork@ffaux-bg.ffmpeg.org
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by ffaux.localdomain (Postfix) with ESMTP id DD3E2447A54
	for <patchwork@ffaux-bg.ffmpeg.org>;
	Sat, 12 Oct 2019 13:58:35 +0300 (EEST)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id B6239689B4D;
	Sat, 12 Oct 2019 13:58:35 +0300 (EEST)
X-Original-To: ffmpeg-devel@ffmpeg.org
Delivered-To: ffmpeg-devel@ffmpeg.org
Received: from mail-pl1-f195.google.com (mail-pl1-f195.google.com
	[209.85.214.195])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 3F8B2687F64
	for <ffmpeg-devel@ffmpeg.org>; Sat, 12 Oct 2019 13:58:29 +0300 (EEST)
Received: by mail-pl1-f195.google.com with SMTP id s17so5678286plp.6
	for <ffmpeg-devel@ffmpeg.org>; Sat, 12 Oct 2019 03:58:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=from:to:cc:subject:date:message-id:in-reply-to:references;
	bh=axpUz9r+lO7iYkgfbEe0OtnCFU513PMDOG7aP0WeNj4=;
	b=SjWpaKZQpo9TyqNX2l/UyTA4yqUdySOuXJmQi4y09SkZ0+wV3UIA6PON/8VNfuGtY1
	yEZikRMjUJSyvkJevDVgJifTjCxsr+2Dmsx04TocMT6LUJX4mfJJWN+0+/zrewckdlV7
	XZ5QTl0LEN2Ldip2Qis3W1YUyyAXHvWqHjCxiamX9ZcROCULk/zVXh3lnByyulg6w82N
	Qdslwh9VXTypcLNzmhWP4xOJ3RoGepaSe9oGxc0XzmDf58iiFgyCDfF2b1AaE/keOdh3
	qrpAedsOEXr1L1Qvre8GccL/6ecxrGxnXKifiTMypnSZ12LWuwzgB1upZWqJYgS55BgL
	jqtQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
	:references;
	bh=axpUz9r+lO7iYkgfbEe0OtnCFU513PMDOG7aP0WeNj4=;
	b=qbpj/Lc+bWaKSo0h9hw/hYjzfeSw2XP867WDXujfx+kEXUNXrPu957SZ2tCEC1adth
	B9nHK5f4jT1I18paiI0Fc9i6irqIOD03BmqjvOzEOe7T54GmrnCPwJgOGW0YNFRYnKSp
	omymes33IDRSEh5mTMiYI8D3GfBixtU+G1ptmRpDPRX+39uaQ7rRregNJXz4JzICVXcD
	uFvVEHbpgsT+ZlItseL+wZaBHfKDbaY+6/sxf/0XqFUe5/iKFjh/tTYGD9Y6+Yudw1HD
	/+yU4JyDbc2kq5BbIW+LfBOf6qG3WV9Vdy9TzcBfzW8tQycp6VtIQzpVB3af3sW08Qnb
	Endg==
X-Gm-Message-State: APjAAAVsOKnF8SLAL36dNh3B5dT+Tt7KkuYRMdDYG/EPvNA5YjRb6Zo7
	d37J1lkN2evD3FwkmHlUYx/8cHNqUbY=
X-Google-Smtp-Source: 
 APXvYqx03ExWPy/4e6XYtWkay6OemLw8nNyMUPaodH7w8mRBiJJ2o7lMtx158TsqwVMwhBUsu6sj5w==
X-Received: by 2002:a17:902:9f83:: with SMTP id
	g3mr20181592plq.168.1570877907109;
	Sat, 12 Oct 2019 03:58:27 -0700 (PDT)
Received: from vpn.localdomain ([47.90.99.151])
	by smtp.gmail.com with ESMTPSA id
	62sm12680405pfg.164.2019.10.12.03.58.25
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Sat, 12 Oct 2019 03:58:26 -0700 (PDT)
From: lance.lmwang@gmail.com
To: ffmpeg-devel@ffmpeg.org
Date: Sat, 12 Oct 2019 18:58:21 +0800
Message-Id: <20191012105821.15523-1-lance.lmwang@gmail.com>
X-Mailer: git-send-email 2.9.5
In-Reply-To: <20190830033752.26454-2-lance.lmwang@gmail.com>
References: <20190830033752.26454-2-lance.lmwang@gmail.com>
Subject: [FFmpeg-devel] [PATCH v5] avcodec/v210dec: add the frame and slice
	threading support
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <http://ffmpeg.org/mailman/options/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <http://ffmpeg.org/pipermail/ffmpeg-devel/>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <http://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
	<mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches
	<ffmpeg-devel@ffmpeg.org>
Cc: Limin Wang <lance.lmwang@gmail.com>
MIME-Version: 1.0
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

From: Limin Wang <lance.lmwang@gmail.com>

The multithread is avoid one core cpu is full with other filter like scale etc.
About the performance, the gain is very small, below is my testing for
performance.
In order to avoid the disk bottleneck, I'll use stream_loop mode for 10 frame
only.

./ffmpeg -y -i ~/Movies/4k_Rec709_ProResHQ.mov -c:v v210 -f rawvideo -frames 10
~/Movies/1.v210

master:
./ffmpeg -threads 1 -s 4096x3072 -stream_loop 100 -i ~/Movies/1.v210 -benchmark
-f null -
frame= 1010 fps= 42 q=-0.0 Lsize=N/A time=00:00:40.40 bitrate=N/A speed=1.69x
video:529kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing
overhead: unknown
bench: utime=10.082s stime=13.784s rtime=23.889s
bench: maxrss=147836928kB

patch applied:
./ffmpeg -threads 4 -thread_type frame+slice  -s 4096x3072 -stream_loop 100 -i
~/Movies/1.v210 -benchmark -f null -

frame= 1010 fps= 55 q=-0.0 Lsize=N/A time=00:00:40.40 bitrate=N/A speed=2.22x
video:529kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing
overhead: unknown
bench: utime=11.407s stime=17.258s rtime=18.279s
bench: maxrss=442884096kB

Signed-off-by: Limin Wang <lance.lmwang@gmail.com>
---
 libavcodec/v210dec.c | 132 +++++++++++++++++++++++++++----------------
 libavcodec/v210dec.h |   1 +
 2 files changed, 85 insertions(+), 48 deletions(-)

diff --git a/libavcodec/v210dec.c b/libavcodec/v210dec.c
index 5a33d8c089..c3ef8051e6 100644
--- a/libavcodec/v210dec.c
+++ b/libavcodec/v210dec.c
@@ -28,6 +28,7 @@
 #include "libavutil/internal.h"
 #include "libavutil/mem.h"
 #include "libavutil/intreadwrite.h"
+#include "thread.h"
 
 #define READ_PIXELS(a, b, c)         \
     do {                             \
@@ -37,6 +38,12 @@
         *c++ = (val >> 20) & 0x3FF;  \
     } while (0)
 
+typedef struct ThreadData {
+    AVFrame *frame;
+    uint8_t *buf;
+    int stride;
+} ThreadData;
+
 static void v210_planar_unpack_c(const uint32_t *src, uint16_t *y, uint16_t *u, uint16_t *v, int width)
 {
     uint32_t val;
@@ -64,21 +71,87 @@ static av_cold int decode_init(AVCodecContext *avctx)
     avctx->pix_fmt             = AV_PIX_FMT_YUV422P10;
     avctx->bits_per_raw_sample = 10;
 
+    s->thread_count  = av_clip(avctx->thread_count, 1, avctx->height/4);
     s->aligned_input = 0;
     ff_v210dec_init(s);
 
     return 0;
 }
 
+static int v210_decode_slice(AVCodecContext *avctx, void *arg, int jobnr, int threadnr)
+{
+    V210DecContext *s = avctx->priv_data;
+    int h, w;
+    ThreadData *td = arg;
+    AVFrame *frame = td->frame;
+    int stride = td->stride;
+    int slice_h = avctx->height / s->thread_count;
+    int slice_m = avctx->height % s->thread_count;
+    int slice_start = jobnr * slice_h;
+    int slice_end = slice_start + slice_h;
+    const uint8_t *psrc = td->buf + stride * slice_start;
+    uint16_t *y, *u, *v;
+
+    /* add the remaining slice for the last job */
+    if (jobnr == s->thread_count - 1)
+        slice_end += slice_m;
+
+    y = (uint16_t*)frame->data[0] + slice_start * frame->linesize[0] / 2;
+    u = (uint16_t*)frame->data[1] + slice_start * frame->linesize[1] / 2;
+    v = (uint16_t*)frame->data[2] + slice_start * frame->linesize[2] / 2;
+    for (h = slice_start; h < slice_end; h++) {
+        const uint32_t *src = (const uint32_t*)psrc;
+        uint32_t val;
+
+        w = (avctx->width / 12) * 12;
+        s->unpack_frame(src, y, u, v, w);
+
+        y += w;
+        u += w >> 1;
+        v += w >> 1;
+        src += (w << 1) / 3;
+
+        if (w < avctx->width - 5) {
+            READ_PIXELS(u, y, v);
+            READ_PIXELS(y, u, y);
+            READ_PIXELS(v, y, u);
+            READ_PIXELS(y, v, y);
+            w += 6;
+        }
+
+        if (w < avctx->width - 1) {
+            READ_PIXELS(u, y, v);
+
+            val  = av_le2ne32(*src++);
+            *y++ =  val & 0x3FF;
+            if (w < avctx->width - 3) {
+                *u++ = (val >> 10) & 0x3FF;
+                *y++ = (val >> 20) & 0x3FF;
+
+                val  = av_le2ne32(*src++);
+                *v++ =  val & 0x3FF;
+                *y++ = (val >> 10) & 0x3FF;
+            }
+        }
+
+        psrc += stride;
+        y += frame->linesize[0] / 2 - avctx->width + (avctx->width & 1);
+        u += frame->linesize[1] / 2 - avctx->width / 2;
+        v += frame->linesize[2] / 2 - avctx->width / 2;
+    }
+
+    return 0;
+}
+
 static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame,
                         AVPacket *avpkt)
 {
     V210DecContext *s = avctx->priv_data;
-
-    int h, w, ret, stride, aligned_input;
+    ThreadData td;
+    int ret, stride, aligned_input;
+    ThreadFrame frame = { .f = data };
     AVFrame *pic = data;
     const uint8_t *psrc = avpkt->data;
-    uint16_t *y, *u, *v;
 
     if (s->custom_stride )
         stride = s->custom_stride;
@@ -86,6 +159,7 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame,
         int aligned_width = ((avctx->width + 47) / 48) * 48;
         stride = aligned_width * 8 / 3;
     }
+    td.stride = stride;
 
     if (avpkt->size < stride * avctx->height) {
         if ((((avctx->width + 23) / 24) * 24 * 8) / 3 * avctx->height == avpkt->size) {
@@ -110,55 +184,15 @@ static int decode_frame(AVCodecContext *avctx, void *data, int *got_frame,
         ff_v210dec_init(s);
     }
 
-    if ((ret = ff_get_buffer(avctx, pic, 0)) < 0)
+    if ((ret = ff_thread_get_buffer(avctx, &frame, 0)) < 0)
         return ret;
 
-    y = (uint16_t*)pic->data[0];
-    u = (uint16_t*)pic->data[1];
-    v = (uint16_t*)pic->data[2];
     pic->pict_type = AV_PICTURE_TYPE_I;
     pic->key_frame = 1;
 
-    for (h = 0; h < avctx->height; h++) {
-        const uint32_t *src = (const uint32_t*)psrc;
-        uint32_t val;
-
-        w = (avctx->width / 12) * 12;
-        s->unpack_frame(src, y, u, v, w);
-
-        y += w;
-        u += w >> 1;
-        v += w >> 1;
-        src += (w << 1) / 3;
-
-        if (w < avctx->width - 5) {
-            READ_PIXELS(u, y, v);
-            READ_PIXELS(y, u, y);
-            READ_PIXELS(v, y, u);
-            READ_PIXELS(y, v, y);
-            w += 6;
-        }
-
-        if (w < avctx->width - 1) {
-            READ_PIXELS(u, y, v);
-
-            val  = av_le2ne32(*src++);
-            *y++ =  val & 0x3FF;
-            if (w < avctx->width - 3) {
-                *u++ = (val >> 10) & 0x3FF;
-                *y++ = (val >> 20) & 0x3FF;
-
-                val  = av_le2ne32(*src++);
-                *v++ =  val & 0x3FF;
-                *y++ = (val >> 10) & 0x3FF;
-            }
-        }
-
-        psrc += stride;
-        y += pic->linesize[0] / 2 - avctx->width + (avctx->width & 1);
-        u += pic->linesize[1] / 2 - avctx->width / 2;
-        v += pic->linesize[2] / 2 - avctx->width / 2;
-    }
+    td.buf = (uint8_t*)psrc;
+    td.frame = pic;
+    avctx->execute2(avctx, v210_decode_slice, &td, NULL, s->thread_count);
 
     if (avctx->field_order > AV_FIELD_PROGRESSIVE) {
         /* we have interlaced material flagged in container */
@@ -194,6 +228,8 @@ AVCodec ff_v210_decoder = {
     .priv_data_size = sizeof(V210DecContext),
     .init           = decode_init,
     .decode         = decode_frame,
-    .capabilities   = AV_CODEC_CAP_DR1,
+    .capabilities   = AV_CODEC_CAP_DR1 |
+                      AV_CODEC_CAP_SLICE_THREADS |
+                      AV_CODEC_CAP_FRAME_THREADS,
     .priv_class     = &v210dec_class,
 };
diff --git a/libavcodec/v210dec.h b/libavcodec/v210dec.h
index cfdb29da09..662e266315 100644
--- a/libavcodec/v210dec.h
+++ b/libavcodec/v210dec.h
@@ -27,6 +27,7 @@ typedef struct {
     AVClass *av_class;
     int custom_stride;
     int aligned_input;
+    int thread_count;
     int stride_warning_shown;
     void (*unpack_frame)(const uint32_t *src, uint16_t *y, uint16_t *u, uint16_t *v, int width);
 } V210DecContext;