From patchwork Mon Jun 12 20:38:29 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ganapathy Raman Kasi X-Patchwork-Id: 3930 Delivered-To: ffmpegpatchwork@gmail.com Received: by 10.103.22.4 with SMTP id 4csp39662vsw; Mon, 12 Jun 2017 13:38:43 -0700 (PDT) X-Received: by 10.223.135.90 with SMTP id 26mr414571wrz.133.1497299923480; Mon, 12 Jun 2017 13:38:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1497299923; cv=none; d=google.com; s=arc-20160816; b=kQycykEJn7jvXXXWlvqPvuj4iGtV1T5irmIWk6B4BCzGnlelxoHt4TSnWN4bABFG3Z VtV4SCH9PXT5n8CSGFgHW/5yldDUqb5XOM8/l08QF8CtUjC2Vimr6S4wVGLUfj9IBmyx nmGoCYYwpwVbD/Vc5iCqXWcL1fcwHF6h2JdXpLzKTYaOnXADccO1rNgdDbmwC7aOT8Dy 9Drs4BLhWDtUwOT3i4JxQmNJ1RCJjvj2pPf/JjLYjCUEOOdWE/9bp9dnvhfX9xJCyHYc kjH/8vse09kJUEx8/OJlpa3CMz2fINfjPqKqbpTGCo9YIcdT/4mCdFvZbgmzVYdp0aX8 AmUA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:reply-to:list-subscribe:list-help:list-post :list-archive:list-unsubscribe:list-id:precedence:subject :content-language:mime-version:accept-language:message-id:date :thread-index:thread-topic:to:from:delivered-to :arc-authentication-results; bh=ooohrSKxIlLO2jC6aFyh0eurgaRdTlyrJ/UHxjXFTDg=; b=cnsJN4UB69vQyjCW0fv/ovXrMRXwam6Gm5pVtCoeZzO/jxo/yXgEHrQa7YiaTE2QqW EyUl9sj9qmGm0O3J1iiOkdlt61v5SOut1wkb8/Im63R2q/pA3Bi2FYLYA4QlFoHq8z20 FHFGNP7CwutTThZdd1UgGBzIzW+JfasdtvzsLDI8iRby6eCIJuYRBkVLA7nNwgFK2ReB l1yffSZ9HJTDDhlDKszaF98AWYmHJIhFb8Ix1kr6cyQxvdP1aDHVUoEcLYu8yVKD5psk TdfAczK5yd42gfPdkDEDRbj6bQM71NKl7kozaoZ0x1T3TF+L1iH8pS2kbkUmZDzcK5OG khNQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id e79si518085wma.106.2017.06.12.13.38.42; Mon, 12 Jun 2017 13:38:43 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 3F91A689FE1; Mon, 12 Jun 2017 23:38:38 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from hqemgate16.nvidia.com (hqemgate16.nvidia.com [216.228.121.65]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 6E619689FDB for ; Mon, 12 Jun 2017 23:38:30 +0300 (EEST) Received: from hqpgpgate102.nvidia.com (Not Verified[216.228.121.13]) by hqemgate16.nvidia.com id ; Mon, 12 Jun 2017 13:38:26 -0700 Received: from HQMAIL106.nvidia.com ([172.20.13.39]) by hqpgpgate102.nvidia.com (PGP Universal service); Mon, 12 Jun 2017 13:38:30 -0700 X-PGP-Universal: processed; by hqpgpgate102.nvidia.com on Mon, 12 Jun 2017 13:38:30 -0700 Received: from HQMAIL105.nvidia.com (172.20.187.12) by HQMAIL106.nvidia.com (172.18.146.12) with Microsoft SMTP Server (TLS) id 15.0.1263.5; Mon, 12 Jun 2017 20:38:29 +0000 Received: from HQMAIL105.nvidia.com ([::1]) by HQMAIL105.nvidia.com ([fe80::bc14:12fa:d545:8380%19]) with mapi id 15.00.1263.000; Mon, 12 Jun 2017 20:38:29 +0000 From: Ganapathy Raman Kasi To: FFmpeg development discussions and patches Thread-Topic: Sharing cuda context between transcode sessions to reduce initialization overhead Thread-Index: AQHS47vK43wMGou+FECYCp5tBoL2Vg== Date: Mon, 12 Jun 2017 20:38:29 +0000 Message-ID: <1497299909330.63072@nvidia.com> Accept-Language: en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: x-ms-exchange-transport-fromentityheader: Hosted x-originating-ip: [172.16.177.254] MIME-Version: 1.0 Content-Language: en-US X-Content-Filtered-By: Mailman/MimeDel 2.1.20 Subject: [FFmpeg-devel] Sharing cuda context between transcode sessions to reduce initialization overhead X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Hi, Currently incase of using 1 -> N transcode (1 SW decode -> N NVENC encodes) without HW upload filter, we end up allocating multiple Cuda contexts for the N transcode sessions for the same underlying gpu device. This comes with the cuda context initialization overhead. (~100 ms per context creation with 4th gen i5 with GTX 1080 in ubuntu 16.04). Also in case of M * (1->N) full HW accelerated transcode we face this issue where the cuda context is not shared between the M transcode sessions. Sharing the context would greatly reduce the initialization time which will matter in case of short clip transcodes. I currently have a global array in avutil/hwcontext_cuda.c which keeps track of the cuda contexts created and reuses existing contexts when request for hwdevice ctx create occurs. This is shared in the attached patch. Please check the approach and let me know if there is better/cleaner way to do this. Thanks Regards Ganapathy ----------------------------------------------------------------------------------- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ----------------------------------------------------------------------------------- From 9e828c7cd943b964ccf4cc8d1059fcef014b24a3 Mon Sep 17 00:00:00 2001 From: Ganapathy Kasi Date: Mon, 12 Jun 2017 13:14:36 -0700 Subject: [PATCH] Share cuda context across multiple transcode sessions for the same gpu Cuda context is allocated per decode/scale/encode session. If there are multiple transcodes in same process, many cuda contexts are allocated for the underlying same gpu device which has a initialization perf overhead. Sharing the cuda context per device fixes the issue. Also nvenc is directly using the cuda interface to create the cuda context instead of using the av_hwdevice interface. --- libavcodec/nvenc.c | 33 ++++++++++++++++++--------------- libavcodec/nvenc.h | 3 ++- libavutil/hwcontext_cuda.c | 40 ++++++++++++++++++++++++++-------------- 3 files changed, 46 insertions(+), 30 deletions(-) diff --git a/libavcodec/nvenc.c b/libavcodec/nvenc.c index f79b9a5..d5b6978 100644 --- a/libavcodec/nvenc.c +++ b/libavcodec/nvenc.c @@ -326,10 +326,14 @@ static av_cold int nvenc_check_device(AVCodecContext *avctx, int idx) NvencDynLoadFunctions *dl_fn = &ctx->nvenc_dload_funcs; NV_ENCODE_API_FUNCTION_LIST *p_nvenc = &dl_fn->nvenc_funcs; char name[128] = { 0}; + char device_str[20]; int major, minor, ret; CUresult cu_res; CUdevice cu_device; CUcontext dummy; + AVHWDeviceContext *device_ctx; + AVCUDADeviceContext *device_hwctx; + int loglevel = AV_LOG_VERBOSE; if (ctx->device == LIST_DEVICES) @@ -364,19 +368,19 @@ static av_cold int nvenc_check_device(AVCodecContext *avctx, int idx) if (ctx->device != idx && ctx->device != ANY_DEVICE) return -1; - cu_res = dl_fn->cuda_dl->cuCtxCreate(&ctx->cu_context_internal, 0, cu_device); - if (cu_res != CUDA_SUCCESS) { - av_log(avctx, AV_LOG_FATAL, "Failed creating CUDA context for NVENC: 0x%x\n", (int)cu_res); + if (ctx->device == ANY_DEVICE) + ctx->device = 0; + + sprintf(device_str, "%d", ctx->device); + + ret = av_hwdevice_ctx_create(&ctx->hwdevice, AV_HWDEVICE_TYPE_CUDA, device_str, NULL, 0); + if (ret < 0) goto fail; - } - ctx->cu_context = ctx->cu_context_internal; + device_ctx = (AVHWDeviceContext *)ctx->hwdevice->data; + device_hwctx = device_ctx->hwctx; - cu_res = dl_fn->cuda_dl->cuCtxPopCurrent(&dummy); - if (cu_res != CUDA_SUCCESS) { - av_log(avctx, AV_LOG_FATAL, "Failed popping CUDA context: 0x%x\n", (int)cu_res); - goto fail2; - } + ctx->cu_context = device_hwctx->cuda_ctx; if ((ret = nvenc_open_session(avctx)) < 0) goto fail2; @@ -408,8 +412,8 @@ fail3: } fail2: - dl_fn->cuda_dl->cuCtxDestroy(ctx->cu_context_internal); - ctx->cu_context_internal = NULL; + av_buffer_unref(&ctx->hwdevice); + ctx->cu_context = NULL; fail: return AVERROR(ENOSYS); @@ -1374,9 +1378,8 @@ av_cold int ff_nvenc_encode_close(AVCodecContext *avctx) return AVERROR_EXTERNAL; } - if (ctx->cu_context_internal) - dl_fn->cuda_dl->cuCtxDestroy(ctx->cu_context_internal); - ctx->cu_context = ctx->cu_context_internal = NULL; + av_buffer_unref(&ctx->hwdevice); + ctx->cu_context = NULL; nvenc_free_functions(&dl_fn->nvenc_dl); cuda_free_functions(&dl_fn->cuda_dl); diff --git a/libavcodec/nvenc.h b/libavcodec/nvenc.h index 2e24604..327c914 100644 --- a/libavcodec/nvenc.h +++ b/libavcodec/nvenc.h @@ -106,7 +106,6 @@ typedef struct NvencContext NV_ENC_INITIALIZE_PARAMS init_encode_params; NV_ENC_CONFIG encode_config; CUcontext cu_context; - CUcontext cu_context_internal; int nb_surfaces; NvencSurface *surfaces; @@ -116,6 +115,8 @@ typedef struct NvencContext AVFifoBuffer *output_surface_ready_queue; AVFifoBuffer *timestamp_list; + AVBufferRef *hwdevice; + struct { CUdeviceptr ptr; NV_ENC_REGISTERED_PTR regptr; diff --git a/libavutil/hwcontext_cuda.c b/libavutil/hwcontext_cuda.c index ed595c3..16d2812 100644 --- a/libavutil/hwcontext_cuda.c +++ b/libavutil/hwcontext_cuda.c @@ -24,8 +24,12 @@ #include "mem.h" #include "pixdesc.h" #include "pixfmt.h" +#include #define CUDA_FRAME_ALIGNMENT 256 +#define NUM_DEVICES 8 + +CUcontext cudaCtx[NUM_DEVICES] = { NULL }; typedef struct CUDAFramesContext { int shift_width, shift_height; @@ -363,27 +367,35 @@ static int cuda_device_create(AVHWDeviceContext *ctx, const char *device, cu = hwctx->internal->cuda_dl; err = cu->cuInit(0); - if (err != CUDA_SUCCESS) { - av_log(ctx, AV_LOG_ERROR, "Could not initialize the CUDA driver API\n"); - goto error; - } - err = cu->cuDeviceGet(&cu_device, device_idx); if (err != CUDA_SUCCESS) { - av_log(ctx, AV_LOG_ERROR, "Could not get the device number %d\n", device_idx); - goto error; - } - - err = cu->cuCtxCreate(&hwctx->cuda_ctx, CU_CTX_SCHED_BLOCKING_SYNC, cu_device); - if (err != CUDA_SUCCESS) { - av_log(ctx, AV_LOG_ERROR, "Error creating a CUDA context\n"); + av_log(ctx, AV_LOG_ERROR, "Could not initialize the CUDA driver API\n"); goto error; } - cu->cuCtxPopCurrent(&dummy); + if (!cudaCtx[device_idx]) + { + err = cu->cuDeviceGet(&cu_device, device_idx); + if (err != CUDA_SUCCESS) { + av_log(ctx, AV_LOG_ERROR, "Could not get the device number %d\n", device_idx); + goto error; + } - hwctx->internal->is_allocated = 1; + err = cu->cuCtxCreate(&hwctx->cuda_ctx, 0, cu_device); + if (err != CUDA_SUCCESS) { + av_log(ctx, AV_LOG_ERROR, "Error creating a CUDA context\n"); + goto error; + } + cu->cuCtxPopCurrent(&dummy); + cudaCtx[device_idx] = hwctx->cuda_ctx; + hwctx->internal->is_allocated = 1; + } + else + { + hwctx->cuda_ctx = cudaCtx[device_idx]; + hwctx->internal->is_allocated = 0; + } return 0; error: -- 2.7.4