[FFmpeg-devel] hwcontext_vaapi: use the special UC copy for downloading, frames.

Submitted by Jun Zhao on April 11, 2017, 7:30 a.m.

Details

Message ID d3c4715e-a14a-e1e5-9947-10f0f8baef25@gmail.com
State New
Headers show

Commit Message

Jun Zhao April 11, 2017, 7:30 a.m.
From 9bab458006369f427fa2f4c6248ee89329e81067 Mon Sep 17 00:00:00 2001
From: Jun Zhao <jun.zhao@intel.com>
Date: Tue, 11 Apr 2017 14:37:07 +0800
Subject: [PATCH] hwcontext_vaapi: use the special UC copy for downloading
 frames.

used SSE4 UC function for copying image data from GPU mapped memory,
see https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers

before this change, VA-API HWAccel decoder copy image data from GPU
mapped memory used vaCreateImage/vaGetImage/av_frame_copy, now use
vaDeriveImage/av_image_copy_uc_from.

decoding a 3000 frames 1080p h264 stream in Intel(R) Core(TM)
i5-6260U CPU @ 1.80GHz, the CPU usage and decode fps as follow:

1. Software decoder.
./ffmpeg -i ./skyfall2-trailer.mp4 -f null /dev/null

CPU: 80%, fps: 334fps

2a. vaCreateImage/vaGetImage/av_frame_copy
./ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -i skyfall2-trailer.mp4 -f null /dev/null

CPU: 12%, fps: 147fps

2b. vaDeriveImage/av_image_copy_uc_from
./ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -i skyfall2-trailer.mp4 -f null /dev/null

CPU: 23%, fps: 628fps

Signed-off-by: Jun Zhao <jun.zhao@intel.com>
---
 libavutil/hwcontext_vaapi.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

Comments

Mark Thompson April 11, 2017, 11:26 a.m.
On 11/04/17 08:30, Jun Zhao wrote:
> From 9bab458006369f427fa2f4c6248ee89329e81067 Mon Sep 17 00:00:00 2001
> From: Jun Zhao <jun.zhao@intel.com>
> Date: Tue, 11 Apr 2017 14:37:07 +0800
> Subject: [PATCH] hwcontext_vaapi: use the special UC copy for downloading
>  frames.
> 
> used SSE4 UC function for copying image data from GPU mapped memory,
> see https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers
> 
> before this change, VA-API HWAccel decoder copy image data from GPU
> mapped memory used vaCreateImage/vaGetImage/av_frame_copy, now use
> vaDeriveImage/av_image_copy_uc_from.
> 
> decoding a 3000 frames 1080p h264 stream in Intel(R) Core(TM)
> i5-6260U CPU @ 1.80GHz, the CPU usage and decode fps as follow:
> 
> 1. Software decoder.
> ./ffmpeg -i ./skyfall2-trailer.mp4 -f null /dev/null
> 
> CPU: 80%, fps: 334fps
> 
> 2a. vaCreateImage/vaGetImage/av_frame_copy
> ./ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -i skyfall2-trailer.mp4 -f null /dev/null
> 
> CPU: 12%, fps: 147fps
> 
> 2b. vaDeriveImage/av_image_copy_uc_from
> ./ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -i skyfall2-trailer.mp4 -f null /dev/null
> 
> CPU: 23%, fps: 628fps
> 
> Signed-off-by: Jun Zhao <jun.zhao@intel.com>
> ---

This change was considered in libav when the UC copy function was introduced (<https://lists.libav.org/pipermail/libav-devel/2016-August/078826.html>, <https://lists.libav.org/pipermail/libav-devel/2016-August/078825.html>), but was not in the end applied.

The reasons for this were:

* It had much worse performance on the low-power cores - try your benchmark above on Braswell.

* The performance gain on high-power cores was marginal - we had results of maybe +20% performance in the best case with +40% or more CPU time used.  I admit this seems at odds with your results above, I'll try running it again later with current ffmpeg.

* Most software uses don't actually want the native surface format (typically NV12), so you end up with an extra convert step in software anyway.  VAAPI supports downloading directly to YUV420P - try with -hwaccel_output_format yuv420p or -vf hwmap=read,format=yuv420p.

(Note that none of these arguments applied to DXVA2 because it has no GPU-driven download, so indeed it was applied there (and is why the streaming copy code exists in libavutil at all).)

Still, performance results on more cores would be welcome for comparison, particularly newer ones - can you test on a Gemini Lake or Apollo Lake?


Some patch comments as well:

>  libavutil/hwcontext_vaapi.c | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/libavutil/hwcontext_vaapi.c b/libavutil/hwcontext_vaapi.c
> index 3b50e95..23899f1 100644
> --- a/libavutil/hwcontext_vaapi.c
> +++ b/libavutil/hwcontext_vaapi.c
> @@ -40,6 +40,7 @@
>  #include "mem.h"
>  #include "pixdesc.h"
>  #include "pixfmt.h"
> +#include "imgutils.h"
>  
>  typedef struct VAAPIDevicePriv {
>  #if HAVE_VAAPI_X11
> @@ -720,7 +721,7 @@ static int vaapi_map_frame(AVHWFramesContext *hwfc,
>      // assume for now that the user is not aware of that and would therefore
>      // prefer not to be given direct-mapped memory if they request read access.
>      if (ctx->derive_works && dst->format == hwfc->sw_format &&
> -        ((flags & AV_HWFRAME_MAP_DIRECT) || !(flags & AV_HWFRAME_MAP_READ))) {
> +        ((flags & AV_HWFRAME_MAP_DIRECT) || (flags & AV_HWFRAME_MAP_READ))) {

This disables direct-mapping by default for write (which is always correct on current platforms, as far as I know?).

More generally, I don't think you should be modifying this function.  It also gets used for av_hwframe_map(), and defaulting to giving users uncached memory for reading when they aren't expecting it is a complete disaster for performance.

>          vas = vaDeriveImage(hwctx->display, surface_id, &map->image);
>          if (vas != VA_STATUS_SUCCESS) {
>              av_log(hwfc, AV_LOG_ERROR, "Failed to derive image from "
> @@ -736,7 +737,6 @@ static int vaapi_map_frame(AVHWFramesContext *hwfc,
>              err = AVERROR(EIO);
>              goto fail;
>          }
> -        map->flags |= AV_HWFRAME_MAP_DIRECT;
>      } else {
>          vas = vaCreateImage(hwctx->display, image_format,
>                              hwfc->width, hwfc->height, &map->image);
> @@ -806,7 +806,8 @@ static int vaapi_transfer_data_from(AVHWFramesContext *hwfc,
>                                      AVFrame *dst, const AVFrame *src)
>  {
>      AVFrame *map;
> -    int err;
> +    int i,err;
> +    ptrdiff_t src_linesize[4], dst_linesize[4];
>  
>      if (dst->width > hwfc->width || dst->height > hwfc->height)
>          return AVERROR(EINVAL);
> @@ -823,11 +824,12 @@ static int vaapi_transfer_data_from(AVHWFramesContext *hwfc,
>      map->width  = dst->width;
>      map->height = dst->height;
>  
> -    err = av_frame_copy(dst, map);
> -    if (err)
> -        goto fail;
> -
> -    err = 0;
> +    for (i = 0; i < 4; i++) {
> +        dst_linesize[i] = dst->linesize[i];
> +        src_linesize[i] = map->linesize[i];
> +    }
> +    av_image_copy_uc_from(dst->data, dst_linesize, map->data, src_linesize,
> +                          hwfc->sw_format, src->width, src->height);

This isn't necessarily sw_format, because the formats don't have to match in the download (the NV12 -> YUV420P case).

Also, I think you want to only use the streaming copy when the source is uncached.  (I assume glibc memcpy() will always win if both are normal memory, though I admit I've never tested that.)

>  fail:
>      av_frame_free(&map);
>      return err;
> -- 
> 2.9.3
Mark Thompson April 11, 2017, 9 p.m.
On 11/04/17 12:26, Mark Thompson wrote:
> On 11/04/17 08:30, Jun Zhao wrote:
>> From 9bab458006369f427fa2f4c6248ee89329e81067 Mon Sep 17 00:00:00 2001
>> From: Jun Zhao <jun.zhao@intel.com>
>> Date: Tue, 11 Apr 2017 14:37:07 +0800
>> Subject: [PATCH] hwcontext_vaapi: use the special UC copy for downloading
>>  frames.
>>
>> used SSE4 UC function for copying image data from GPU mapped memory,
>> see https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers
>>
>> before this change, VA-API HWAccel decoder copy image data from GPU
>> mapped memory used vaCreateImage/vaGetImage/av_frame_copy, now use
>> vaDeriveImage/av_image_copy_uc_from.
>>
>> decoding a 3000 frames 1080p h264 stream in Intel(R) Core(TM)
>> i5-6260U CPU @ 1.80GHz, the CPU usage and decode fps as follow:
>>
>> 1. Software decoder.
>> ./ffmpeg -i ./skyfall2-trailer.mp4 -f null /dev/null
>>
>> CPU: 80%, fps: 334fps
>>
>> 2a. vaCreateImage/vaGetImage/av_frame_copy
>> ./ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -i skyfall2-trailer.mp4 -f null /dev/null
>>
>> CPU: 12%, fps: 147fps
>>
>> 2b. vaDeriveImage/av_image_copy_uc_from
>> ./ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -i skyfall2-trailer.mp4 -f null /dev/null
>>
>> CPU: 23%, fps: 628fps
>>
>> Signed-off-by: Jun Zhao <jun.zhao@intel.com>
>> ---
> 
> This change was considered in libav when the UC copy function was introduced (<https://lists.libav.org/pipermail/libav-devel/2016-August/078826.html>, <https://lists.libav.org/pipermail/libav-devel/2016-August/078825.html>), but was not in the end applied.
> 
> The reasons for this were:
> 
> * It had much worse performance on the low-power cores - try your benchmark above on Braswell.

Running on a Braswell N3700, input is 38072 frames of 1920x1080 H.264.

No download at all:        520fps,   52s CPU
Before patch, 4 threads:   107fps,  237s CPU
Before patch, 1 thread:     90fps,  233s CPU
After patch, 4 threads:     30fps, 1294s CPU
After patch, 1 thread:      28fps, 1305s CPU


- Mark
Jun Zhao April 12, 2017, 2:19 a.m.
On 2017/4/12 5:00, Mark Thompson wrote:
> On 11/04/17 12:26, Mark Thompson wrote:
>> On 11/04/17 08:30, Jun Zhao wrote:
>>> From 9bab458006369f427fa2f4c6248ee89329e81067 Mon Sep 17 00:00:00 2001
>>> From: Jun Zhao <jun.zhao@intel.com>
>>> Date: Tue, 11 Apr 2017 14:37:07 +0800
>>> Subject: [PATCH] hwcontext_vaapi: use the special UC copy for downloading
>>>  frames.
>>>
>>> used SSE4 UC function for copying image data from GPU mapped memory,
>>> see https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers
>>>
>>> before this change, VA-API HWAccel decoder copy image data from GPU
>>> mapped memory used vaCreateImage/vaGetImage/av_frame_copy, now use
>>> vaDeriveImage/av_image_copy_uc_from.
>>>
>>> decoding a 3000 frames 1080p h264 stream in Intel(R) Core(TM)
>>> i5-6260U CPU @ 1.80GHz, the CPU usage and decode fps as follow:
>>>
>>> 1. Software decoder.
>>> ./ffmpeg -i ./skyfall2-trailer.mp4 -f null /dev/null
>>>
>>> CPU: 80%, fps: 334fps
>>>
>>> 2a. vaCreateImage/vaGetImage/av_frame_copy
>>> ./ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -i skyfall2-trailer.mp4 -f null /dev/null
>>>
>>> CPU: 12%, fps: 147fps
>>>
>>> 2b. vaDeriveImage/av_image_copy_uc_from
>>> ./ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -i skyfall2-trailer.mp4 -f null /dev/null
>>>
>>> CPU: 23%, fps: 628fps
>>>
>>> Signed-off-by: Jun Zhao <jun.zhao@intel.com>
>>> ---
>>
>> This change was considered in libav when the UC copy function was introduced (<https://lists.libav.org/pipermail/libav-devel/2016-August/078826.html>, <https://lists.libav.org/pipermail/libav-devel/2016-August/078825.html>), but was not in the end applied.
>>
>> The reasons for this were:
>>
>> * It had much worse performance on the low-power cores - try your benchmark above on Braswell.
> 
> Running on a Braswell N3700, input is 38072 frames of 1920x1080 H.264.
> 
> No download at all:        520fps,   52s CPU
> Before patch, 4 threads:   107fps,  237s CPU
> Before patch, 1 thread:     90fps,  233s CPU
> After patch, 4 threads:     30fps, 1294s CPU
> After patch, 1 thread:      28fps, 1305s CPU
> 
> 

I will try to reproduce this in BSW.

> - Mark
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>

Patch hide | download patch | download mbox

diff --git a/libavutil/hwcontext_vaapi.c b/libavutil/hwcontext_vaapi.c
index 3b50e95..23899f1 100644
--- a/libavutil/hwcontext_vaapi.c
+++ b/libavutil/hwcontext_vaapi.c
@@ -40,6 +40,7 @@ 
 #include "mem.h"
 #include "pixdesc.h"
 #include "pixfmt.h"
+#include "imgutils.h"
 
 typedef struct VAAPIDevicePriv {
 #if HAVE_VAAPI_X11
@@ -720,7 +721,7 @@  static int vaapi_map_frame(AVHWFramesContext *hwfc,
     // assume for now that the user is not aware of that and would therefore
     // prefer not to be given direct-mapped memory if they request read access.
     if (ctx->derive_works && dst->format == hwfc->sw_format &&
-        ((flags & AV_HWFRAME_MAP_DIRECT) || !(flags & AV_HWFRAME_MAP_READ))) {
+        ((flags & AV_HWFRAME_MAP_DIRECT) || (flags & AV_HWFRAME_MAP_READ))) {
         vas = vaDeriveImage(hwctx->display, surface_id, &map->image);
         if (vas != VA_STATUS_SUCCESS) {
             av_log(hwfc, AV_LOG_ERROR, "Failed to derive image from "
@@ -736,7 +737,6 @@  static int vaapi_map_frame(AVHWFramesContext *hwfc,
             err = AVERROR(EIO);
             goto fail;
         }
-        map->flags |= AV_HWFRAME_MAP_DIRECT;
     } else {
         vas = vaCreateImage(hwctx->display, image_format,
                             hwfc->width, hwfc->height, &map->image);
@@ -806,7 +806,8 @@  static int vaapi_transfer_data_from(AVHWFramesContext *hwfc,
                                     AVFrame *dst, const AVFrame *src)
 {
     AVFrame *map;
-    int err;
+    int i,err;
+    ptrdiff_t src_linesize[4], dst_linesize[4];
 
     if (dst->width > hwfc->width || dst->height > hwfc->height)
         return AVERROR(EINVAL);
@@ -823,11 +824,12 @@  static int vaapi_transfer_data_from(AVHWFramesContext *hwfc,
     map->width  = dst->width;
     map->height = dst->height;
 
-    err = av_frame_copy(dst, map);
-    if (err)
-        goto fail;
-
-    err = 0;
+    for (i = 0; i < 4; i++) {
+        dst_linesize[i] = dst->linesize[i];
+        src_linesize[i] = map->linesize[i];
+    }
+    av_image_copy_uc_from(dst->data, dst_linesize, map->data, src_linesize,
+                          hwfc->sw_format, src->width, src->height);
 fail:
     av_frame_free(&map);
     return err;