[FFmpeg-devel] adding RGBA and BGRA to nvenc.c

Message ID	06309fa1-a0e9-9a79-5e87-c82a03bd16b5@sky.com
State	Superseded
Headers	show Delivered-To: ffmpegpatchwork@gmail.com Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; References: <67a158c2-8043-50ab-c3f4-3bbf805deedb@sky.com> <CAB0OVGrFk3_+-oF0bUjAq_jwOw=+P_7C4W3=w17iVUYXuKdDLA@mail.gmail.com> <a10c3398-ef09-20ec-2b86-f60030972dc1@sky.com> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> From: "Sven C. Dack" <sven.c.dack@sky.com> Message-ID: <06309fa1-a0e9-9a79-5e87-c82a03bd16b5@sky.com> Date: Wed, 7 Sep 2016 01:23:57 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.2.0 MIME-Version: 1.0 In-Reply-To: <a10c3398-ef09-20ec-2b86-f60030972dc1@sky.com> Subject: Re: [FFmpeg-devel] adding RGBA and BGRA to nvenc.c Precedence: list Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> Content-Transfer-Encoding: base64 Content-Type: text/plain; charset="utf-8"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

Sven C. Dack Sept. 7, 2016, 12:23 a.m. UTC

On 07/09/16 01:08, Sven C. Dack wrote:
> On 07/09/16 00:43, Carl Eugen Hoyos wrote:
>>
>> Should be AV_PIX_FMT_RGB0...
>>
>>
>> ... and AV_PIX_FMT_BGR0
>
> I was wondering about that. The 0 means undefined/unused and I didn't want to 
> open a can of worms with it. Should I add BGR0 and RGB0, too? Does the 0 mean 
> it is reliably 0?

Just tested it with BGR0 and RGB0 and it does encode at ~100 fp/s from the 
screen with it.

$ ffmpeg -f x11grab -framerate 333 -s 1920x1080 -i :0.0 -c:v hevc_nvenc -b:v 
4096k -y test.mkv

Sven

          return AVERROR(EINVAL);
@@ -1350,6 +1364,14 @@ static int nvenc_copy_frame(AVCodecContext *avctx, 
NvencSurface *inSurf,
          av_image_copy_plane(buf, lockBufferParams->pitch,
              frame->data[2], frame->linesize[2],
              avctx->width << 1, avctx->height);
+    } else if (frame->format == AV_PIX_FMT_RGBA || frame->format == 
AV_PIX_FMT_RGB0) {
+      av_image_copy_plane(buf, lockBufferParams->pitch,
+           frame->data[0], frame->linesize[0],
+           avctx->width << 2, avctx->height);
+    } else if (frame->format == AV_PIX_FMT_BGRA || frame->format == 
AV_PIX_FMT_BGR0) {
+      av_image_copy_plane(buf, lockBufferParams->pitch,
+           frame->data[0], frame->linesize[0],
+           avctx->width << 2, avctx->height);
      } else {
          av_log(avctx, AV_LOG_FATAL, "Invalid pixel format!\n");
          return AVERROR(EINVAL);

Timo Rothenpieler Sept. 7, 2016, 8:23 a.m. UTC | #1

>              avctx->width << 1, avctx->height);
> +    } else if (frame->format == AV_PIX_FMT_RGBA || frame->format ==
> AV_PIX_FMT_RGB0) {
> +      av_image_copy_plane(buf, lockBufferParams->pitch,
> +           frame->data[0], frame->linesize[0],
> +           avctx->width << 2, avctx->height);
> +    } else if (frame->format == AV_PIX_FMT_BGRA || frame->format ==
> AV_PIX_FMT_BGR0) {
> +      av_image_copy_plane(buf, lockBufferParams->pitch,
> +           frame->data[0], frame->linesize[0],
> +           avctx->width << 2, avctx->height);
>      } else {

These are identical, so please put them into one if.

Also, why is the twist from AV_PIX_FMT_RGBA to NV_ENC_BUFFER_FORMAT_ABGR
necessary?

The nvenc header describes it as "8 bit Packed A8B8G8R8", so did they
mess it up?

Carl Eugen Hoyos Sept. 7, 2016, 9:06 a.m. UTC | #2

Hi!

2016-09-07 2:23 GMT+02:00 Sven C. Dack <sven.c.dack@sky.com>:
> On 07/09/16 01:08, Sven C. Dack wrote:
>>
>> On 07/09/16 00:43, Carl Eugen Hoyos wrote:
>>>
>>> Should be AV_PIX_FMT_RGB0...

As pointed out by Timo, this should be AV_PIX_FMT_0BGR32...

>>> ... and AV_PIX_FMT_BGR0

... and this AV_PIX_FMT_0RGB32: It makes no difference on
little-endian, but matches the documentation.
Sorry for not realizing this yesterday!

> +    AV_PIX_FMT_RGB0,
> +    AV_PIX_FMT_BGR0,

> +    AV_PIX_FMT_RGBA,
> +    AV_PIX_FMT_BGRA,

In any case, please remove AV_PIX_FMT_RGBA and AV_PIX_FMT_BGRA,
the encoder cannot deal with transparency.

Carl Eugen

Paul B Mahol Sept. 7, 2016, 9:36 a.m. UTC | #3

On 9/7/16, Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote:
> Hi!
>
> 2016-09-07 2:23 GMT+02:00 Sven C. Dack <sven.c.dack@sky.com>:
>> On 07/09/16 01:08, Sven C. Dack wrote:
>>>
>>> On 07/09/16 00:43, Carl Eugen Hoyos wrote:
>>>>
>>>> Should be AV_PIX_FMT_RGB0...
>
> As pointed out by Timo, this should be AV_PIX_FMT_0BGR32...
>
>>>> ... and AV_PIX_FMT_BGR0
>
> ... and this AV_PIX_FMT_0RGB32: It makes no difference on
> little-endian, but matches the documentation.
> Sorry for not realizing this yesterday!
>
>> +    AV_PIX_FMT_RGB0,
>> +    AV_PIX_FMT_BGR0,
>
>> +    AV_PIX_FMT_RGBA,
>> +    AV_PIX_FMT_BGRA,
>
> In any case, please remove AV_PIX_FMT_RGBA and AV_PIX_FMT_BGRA,
> the encoder cannot deal with transparency.

And how do you know that?

Sven C. Dack Sept. 7, 2016, 9:40 a.m. UTC | #4

On 07/09/16 09:23, Timo Rothenpieler wrote:
>>               avctx->width << 1, avctx->height);
>> +    } else if (frame->format == AV_PIX_FMT_RGBA || frame->format ==
>> AV_PIX_FMT_RGB0) {
>> +      av_image_copy_plane(buf, lockBufferParams->pitch,
>> +           frame->data[0], frame->linesize[0],
>> +           avctx->width << 2, avctx->height);
>> +    } else if (frame->format == AV_PIX_FMT_BGRA || frame->format ==
>> AV_PIX_FMT_BGR0) {
>> +      av_image_copy_plane(buf, lockBufferParams->pitch,
>> +           frame->data[0], frame->linesize[0],
>> +           avctx->width << 2, avctx->height);
>>       } else {
> These are identical, so please put them into one if.
>
> Also, why is the twist from AV_PIX_FMT_RGBA to NV_ENC_BUFFER_FORMAT_ABGR
> necessary?
>
> The nvenc header describes it as "8 bit Packed A8B8G8R8", so did they
> mess it up?

It is necessary in order to make it work. The twist here is intentional as I 
pointed out earlier. If you do it the other way around as described in the 
documentation then you get false and missing colours.

I'd like to keep in the transparency channel unless you know there is an actual 
problem with it. The encoder may not use it, but it is no reason not to pass it 
on. Otherwise will RGBA/BGRA have to be converted into RGB0/BGR0 and you will 
again get a performance penalty.

Sven

Carl Eugen Hoyos Sept. 7, 2016, 10:20 a.m. UTC | #5

Hi!

> Am 07.09.2016 um 11:36 schrieb Paul B Mahol <onemda@gmail.com>:
> 
>> On 9/7/16, Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote:
>> Hi!
>> 
>> 2016-09-07 2:23 GMT+02:00 Sven C. Dack <sven.c.dack@sky.com>:
>>>> On 07/09/16 01:08, Sven C. Dack wrote:
>>>> 
>>>>> On 07/09/16 00:43, Carl Eugen Hoyos wrote:
>>>>> 
>>>>> Should be AV_PIX_FMT_RGB0...
>> 
>> As pointed out by Timo, this should be AV_PIX_FMT_0BGR32...
>> 
>>>>> ... and AV_PIX_FMT_BGR0
>> 
>> ... and this AV_PIX_FMT_0RGB32: It makes no difference on
>> little-endian, but matches the documentation.
>> Sorry for not realizing this yesterday!
>> 
>>> +    AV_PIX_FMT_RGB0,
>>> +    AV_PIX_FMT_BGR0,
>> 
>>> +    AV_PIX_FMT_RGBA,
>>> +    AV_PIX_FMT_BGRA,
>> 
>> In any case, please remove AV_PIX_FMT_RGBA and AV_PIX_FMT_BGRA,
>> the encoder cannot deal with transparency.
> 
> And how do you know that?

If the encoder would deal with transparency (which it does not) RGB0 would be wrong but it seems everybody agrees it is correct.

Carl Eugen

Timo Rothenpieler Sept. 7, 2016, 10:20 a.m. UTC | #6

>> Also, why is the twist from AV_PIX_FMT_RGBA to NV_ENC_BUFFER_FORMAT_ABGR
>> necessary?
>>
>> The nvenc header describes it as "8 bit Packed A8B8G8R8", so did they
>> mess it up?
> 
> It is necessary in order to make it work. The twist here is intentional
> as I pointed out earlier. If you do it the other way around as described
> in the documentation then you get false and missing colours.

Carl already pointed you to the correct, native-endian pixel formats,
which match with the nvenc documentation:

https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/pixfmt.h#L320

> I'd like to keep in the transparency channel unless you know there is an
> actual problem with it. The encoder may not use it, but it is no reason
> not to pass it on. Otherwise will RGBA/BGRA have to be converted into
> RGB0/BGR0 and you will again get a performance penalty.

NVENC itself lists the alpha channel. So keeping it should be fine and
save a conversion.

Carl Eugen Hoyos Sept. 7, 2016, 10:25 a.m. UTC | #7

Hi!

> Am 07.09.2016 um 11:40 schrieb "Sven C. Dack" <sven.c.dack@sky.com>:
> 
> On 07/09/16 09:23, Timo Rothenpieler wrote:

> Otherwise will RGBA/BGRA have to 
> be converted into RGB0/BGR0 

> and you will again get a performance penalty.

What makes you think so?
(The problem is that your encoder now reports "I support transparency" and this will lead to funny effects depending in the colour space of your input file.)

Please also implement my other suggestion to make documentation and implementation match (and to match existing and new colour spaces, there is already an endianness-depending pix_fmt).

Carl Eugen

Sven C. Dack Sept. 7, 2016, 10:50 a.m. UTC | #8

On 07/09/16 11:25, Carl Eugen Hoyos wrote:
> Hi!
>
>> Am 07.09.2016 um 11:40 schrieb "Sven C. Dack" <sven.c.dack@sky.com>:
>>
>> On 07/09/16 09:23, Timo Rothenpieler wrote:
>> Otherwise will RGBA/BGRA have to
>> be converted into RGB0/BGR0
>> and you will again get a performance penalty.
> What makes you think so?

I have tested it. What makes you think it wouldn't?

> (The problem is that your encoder now reports "I support transparency" and this will lead to funny effects depending in the colour space of your input file.)

No, it doesn't. The encoder does deal with transparency. It effectively ignores 
it. So it is save to pass it on without FFmpeg needing to set the value to 0 but 
let the encoder do it.

>
> Please also implement my other suggestion to make documentation and implementation match (and to match existing and new colour spaces, there is already an endianness-depending pix_fmt).

I have no idea what you are talking about. If you want to implement a new pixel 
format and colour space for what seems to be a mere twist in Nvidia's 
documentation then you are welcome to do so. I won't.

Sven

Carl Eugen Hoyos Sept. 7, 2016, 11:27 a.m. UTC | #9

2016-09-07 12:50 GMT+02:00 Sven C. Dack <sven.c.dack@sky.com>:
> On 07/09/16 11:25, Carl Eugen Hoyos wrote:
>>
>>> Am 07.09.2016 um 11:40 schrieb "Sven C. Dack" <sven.c.dack@sky.com>:
>>>
>>> On 07/09/16 09:23, Timo Rothenpieler wrote:
>>> Otherwise will RGBA/BGRA have to
>>> be converted into RGB0/BGR0
>>> and you will again get a performance penalty.
>>
>> What makes you think so?
>
> I have tested it. What makes you think it wouldn't?

This is a bug that should be fixed independently.

>> (The problem is that your encoder now reports "I support
>> transparency" and this will lead to funny effects depending
>> in the colour space of your input file.)
>
> No, it doesn't.

If your patch adds "RGBA" as supported colour space,
the encoder (by definition) announces that it supports
transparency.
nvenc does not support transparency (if it would, it would
be a - grave - bug to use "RGB0" in the patch). So please
do not add it.

Carl Eugen

Timo Rothenpieler Sept. 7, 2016, 11:40 a.m. UTC | #10

Am 07.09.2016 um 13:27 schrieb Carl Eugen Hoyos:
> 2016-09-07 12:50 GMT+02:00 Sven C. Dack <sven.c.dack@sky.com>:
>> On 07/09/16 11:25, Carl Eugen Hoyos wrote:
>>>
>>>> Am 07.09.2016 um 11:40 schrieb "Sven C. Dack" <sven.c.dack@sky.com>:
>>>>
>>>> On 07/09/16 09:23, Timo Rothenpieler wrote:
>>>> Otherwise will RGBA/BGRA have to
>>>> be converted into RGB0/BGR0
>>>> and you will again get a performance penalty.
>>>
>>> What makes you think so?
>>
>> I have tested it. What makes you think it wouldn't?
> 
> This is a bug that should be fixed independently.

libavutil/pixfmt.h defines AV_PIX_FMT_RGB0 and the other ones like this:

packed RGB 8:8:8, 32bpp, XRGBXRGB...   X=unused/undefined

So I would expect the Alpha-Channel to be anything, and converting from
RGBA to RGB0 to be a no-op "conversion".

Sven C. Dack Sept. 7, 2016, 1:26 p.m. UTC | #11

On 07/09/16 12:40, Timo Rothenpieler wrote:
> libavutil/pixfmt.h defines AV_PIX_FMT_RGB0 and the other ones like this:
>
> packed RGB 8:8:8, 32bpp, XRGBXRGB...   X=unused/undefined
>
> So I would expect the Alpha-Channel to be anything, and converting from
> RGBA to RGB0 to be a no-op "conversion".

It is not an issue. x11grab produces BGR0 and nvenc can handle it with the 
patch. It's giving me 100fp/s (up from 47fp/s) with a 1920x1080 monitor. I'd 
imagine people with 4K displays will be happy, too, although they will have to 
live with lower speeds of perhaps 30 fp/s. Would be interesting to know how it 
performs on 4K though.

If there is really an RGBA/BGRA input then it needs to be convert to RGB0/BGR0. 
Until then is it a theoretical issue. Might be the module producing RGBA/BGRA 
can produce RGB0/BGR0, too.

Sven

Timo Rothenpieler Sept. 7, 2016, 1:33 p.m. UTC | #12

Am 07.09.2016 um 15:26 schrieb Sven C. Dack:
> On 07/09/16 12:40, Timo Rothenpieler wrote:
>> libavutil/pixfmt.h defines AV_PIX_FMT_RGB0 and the other ones like this:
>>
>> packed RGB 8:8:8, 32bpp, XRGBXRGB...   X=unused/undefined
>>
>> So I would expect the Alpha-Channel to be anything, and converting from
>> RGBA to RGB0 to be a no-op "conversion".
> 
> It is not an issue. x11grab produces BGR0 and nvenc can handle it with
> the patch. It's giving me 100fp/s (up from 47fp/s) with a 1920x1080
> monitor. I'd imagine people with 4K displays will be happy, too,
> although they will have to live with lower speeds of perhaps 30 fp/s.
> Would be interesting to know how it performs on 4K though.
> 
> If there is really an RGBA/BGRA input then it needs to be convert to
> RGB0/BGR0. Until then is it a theoretical issue. Might be the module
> producing RGBA/BGRA can produce RGB0/BGR0, too.

0RGB/0BGR does not mean the alpha bits are zeroed.
It means they are undefined, so you convert from ARGB to 0RGB by doing
nothing. There is no performance to gain by supporting a format that
falsely advertises support for an alpha channel.

Also, the correct formats to use are AV_PIX_FMT_0RGB32, which
corresponds to NV_ENC_BUFFER_FORMAT_ARGB, and AV_PIX_FMT_0BGR32 for ABGR.

Will apply with those.

For the future, please use git format-patch, and ideally also git
send-email for your patches.
Attaching the patches is just fine though, preferably only one per mail
for patchwork to pick it up cleanly.

Andy Furniss Sept. 8, 2016, 10:01 a.m. UTC | #13

Sven C. Dack wrote:

> It is not an issue. x11grab produces BGR0 and nvenc can handle it
> with the patch. It's giving me 100fp/s (up from 47fp/s) with a
> 1920x1080 monitor. I'd imagine people with 4K displays will be happy,
> too, although they will have to live with lower speeds of perhaps 30
> fp/s. Would be interesting to know how it performs on 4K though.

Maybe

xrandr --output <output-name>  --fb 3840x2160 --panning 3840x2160

I don't know what it is about x11grab/CSC with ffmpeg, but on my
old CPU gstreamer is twice as fast.

Carl Eugen Hoyos Sept. 8, 2016, 1:13 p.m. UTC | #14

2016-09-08 12:01 GMT+02:00 Andy Furniss <adf.lists@gmail.com>:

> I don't know what it is about x11grab/CSC with ffmpeg, but
> on my old CPU gstreamer is twice as fast.

x11grab or xcb?

Carl Eugen

Andy Furniss Sept. 8, 2016, 2:17 p.m. UTC | #15

Carl Eugen Hoyos wrote:
> 2016-09-08 12:01 GMT+02:00 Andy Furniss <adf.lists@gmail.com>:
>
>> I don't know what it is about x11grab/CSC with ffmpeg, but
>> on my old CPU gstreamer is twice as fast.
>
> x11grab or xcb?

I guess xcb as I don't --enable anything related with configure.

Carl Eugen Hoyos Sept. 8, 2016, 2:51 p.m. UTC | #16

2016-09-08 16:17 GMT+02:00 Andy Furniss <adf.lists@gmail.com>:
> Carl Eugen Hoyos wrote:
>>
>> 2016-09-08 12:01 GMT+02:00 Andy Furniss <adf.lists@gmail.com>:
>>
>>> I don't know what it is about x11grab/CSC with ffmpeg, but
>>> on my old CPU gstreamer is twice as fast.
>>
>> x11grab or xcb?
>
> I guess xcb as I don't --enable anything related with configure.

Then please test if this is a regression over x11grab, use:
configure --disable-libxcb --enable-x11grab --enable-gpl

(It is very likely that nobody ever tested.)

Carl Eugen

Sven C. Dack Sept. 8, 2016, 2:52 p.m. UTC | #17

On 08/09/16 15:17, Andy Furniss wrote:
> Carl Eugen Hoyos wrote:
>> 2016-09-08 12:01 GMT+02:00 Andy Furniss <adf.lists@gmail.com>:
>>
>>> I don't know what it is about x11grab/CSC with ffmpeg, but
>>> on my old CPU gstreamer is twice as fast.
>>
>> x11grab or xcb?
>
> I guess xcb as I don't --enable anything related with configure.

In case it got mistaken...

The following is what gives the 100 fp/s (with a GTX 960):
$ ffmpeg -f x11grab -framerate 200 -s hd1080 -i :0.0 -c:v hevc_nvenc -y test.mkv

The mentioned 100 fp/s refers to the HEVS/H.265 hardware encoder (nvenc). 
X11grab is here only the input device.

Without the encoder does it give me 160 fp/s:
$ ffmpeg -f x11grab -s hd1080 -framerate 500 -i :0.0 -f null -

Different sizes then give different results:

- For hd720 is it 200 fp/s (with nvenc) and 420 fp/s (without nvenc).
- For hd480 is it 460 fp/s (with nvenc) and 890 fp/s (without nvenc).

Can you compare gstreamer to those numbers?

Is anyone also capable of running this in a true 4K display with a Pascal card? 
There the speed matters the most.

Sven

Carl Eugen Hoyos Sept. 8, 2016, 2:58 p.m. UTC | #18

2016-09-08 16:52 GMT+02:00 Sven C. Dack <sven.c.dack@sky.com>:

> X11grab is here only the input device.

Of course.
I know that and I assume Andy also knows.

Still Andy claims that (our) xcb input is slower than GStreamer's x11
input and I would like him to test if the reason is our change to xcb.

Carl Eugen

Andy Furniss Sept. 8, 2016, 5:16 p.m. UTC | #19

Carl Eugen Hoyos wrote:
> 2016-09-08 16:17 GMT+02:00 Andy Furniss <adf.lists@gmail.com>:
>> Carl Eugen Hoyos wrote:
>>>
>>> 2016-09-08 12:01 GMT+02:00 Andy Furniss <adf.lists@gmail.com>:
>>>
>>>> I don't know what it is about x11grab/CSC with ffmpeg, but
>>>> on my old CPU gstreamer is twice as fast.
>>>
>>> x11grab or xcb?
>>
>> I guess xcb as I don't --enable anything related with configure.
>
> Then please test if this is a regression over x11grab, use:
> configure --disable-libxcb --enable-x11grab --enable-gpl
>
> (It is very likely that nobody ever tested.)

It is a bit faster, comparing results I posted on users to new build -

(vsync 2 for consistency as that's how the old tests were done
it isn't really needed/makes no difference for -f null but was used
as I previously also tested -f rawvideo -y /dev/null and when
using -framerate with that it is essential).

xcb -> x11grab.

ffmpeg -vsync 2 -framerate 200 -f x11grab -s 1920x1080 -i :0.0 -vframes 
1000 -f null -

138 -> 179 fps

ffmpeg -vsync 2 -framerate 200 -f x11grab -s 1920x1080 -i :0.0 -vframes 
1000 -pix_fmt yuv420p -f null -

47 -> 56 fps

ffmpeg -vsync 2 -framerate 200 -f x11grab -s 1920x1080 -i :0.0 -vframes 
1000 -pix_fmt nv12 -f null -

33 -> 37 fps

Carl Eugen Hoyos Sept. 8, 2016, 5:18 p.m. UTC | #20

2016-09-08 19:16 GMT+02:00 Andy Furniss <adf.lists@gmail.com>:

> It is a bit faster

xcb is slower?

Carl Eugen

Andy Furniss Sept. 8, 2016, 5:32 p.m. UTC | #21

Carl Eugen Hoyos wrote:
> 2016-09-08 19:16 GMT+02:00 Andy Furniss <adf.lists@gmail.com>:
>
>> It is a bit faster
>
> xcb is slower?

Yes (unless I am mixing things up)

--disable-libxcb --enable-x11grab is faster than autodetect

Nicolas George Sept. 8, 2016, 5:34 p.m. UTC | #22

Le tridi 23 fructidor, an CCXXIV, Carl Eugen Hoyos a écrit :
> xcb is slower?

Looking at the code, I see that xcbgrab allocates and frees the shared
memory segment for each frame, while x11grab only allocates it once at the
beginning. xcbgrab does that so that it can wrap the segment in a refcounted
buffer.

On the other hand, x11grab seems to be returning the shared memory segment
as the packet payload as is. If I am not mistaken, this is just invalid, and
works just by luck. It could be tested with the following procedure: grab a
packet, dump it, grab a second packet, dump the first one again.

Still, it is probably possible to do something faster than xcbgrab while
still correct: only allocate a new segment if the first one is still in use.

Regards,

Andy Furniss Sept. 8, 2016, 11:01 p.m. UTC | #23

Sven C. Dack wrote:
> On 08/09/16 15:17, Andy Furniss wrote:
>> Carl Eugen Hoyos wrote:
>>> 2016-09-08 12:01 GMT+02:00 Andy Furniss <adf.lists@gmail.com>:
>>>
>>>> I don't know what it is about x11grab/CSC with ffmpeg, but
>>>> on my old CPU gstreamer is twice as fast.
>>>
>>> x11grab or xcb?
>>
>> I guess xcb as I don't --enable anything related with configure.
>
> In case it got mistaken...
>
> The following is what gives the 100 fp/s (with a GTX 960):
> $ ffmpeg -f x11grab -framerate 200 -s hd1080 -i :0.0 -c:v hevc_nvenc -y
> test.mkv
>
> The mentioned 100 fp/s refers to the HEVS/H.265 hardware encoder
> (nvenc). X11grab is here only the input device.
>
> Without the encoder does it give me 160 fp/s:
> $ ffmpeg -f x11grab -s hd1080 -framerate 500 -i :0.0 -f null -
>
> Different sizes then give different results:
>
> - For hd720 is it 200 fp/s (with nvenc) and 420 fp/s (without nvenc).
> - For hd480 is it 460 fp/s (with nvenc) and 890 fp/s (without nvenc).
>
> Can you compare gstreamer to those numbers?

With gstreamer 1080p I can get around 350 fps testing like -

gst-launch-1.0 -f ximagesrc use-damage=0 startx=0 starty=0 endx=1919 
endy=1079 num-buffers=5000 ! queue ! videoconvert ! 
video/x-raw,framerate=500/1,format=BGRx ! fakesink
Setting pipeline to PAUSED ...
Pipeline is live and does not need PREROLL ...
Setting pipeline to PLAYING ...
New clock: GstSystemClock
Got EOS from element "pipeline0".
Execution ended after 0:00:14.205120141
Setting pipeline to PAUSED ...
Setting pipeline to READY ...
Setting pipeline to NULL ...
Freeing pipeline ...

5000/14.205120141 = 351.98....

Of course I don't know if that's "real" as such.

I do know that I have really grabbed and encoded 1080p60 with my AMD h/w
and including nv12 conversion gives a sane looking result -

gst-launch-1.0 -f ximagesrc use-damage=0 startx=0 starty=0 endx=1919 
endy=1079 num-buffers=1000 ! queue ! videoconvert ! 
video/x-raw,framerate=100/1,format=NV12  ! fakesink
Setting pipeline to PAUSED ...
Pipeline is live and does not need PREROLL ...
Setting pipeline to PLAYING ...
New clock: GstSystemClock
Got EOS from element "pipeline0".
Execution ended after 0:00:14.419928745
Setting pipeline to PAUSED ...
Setting pipeline to READY ...
Setting pipeline to NULL ...
Freeing pipeline ...

1000/14.419928745 = 69.3....

I also tried (fake) 2160p and got around 90fps BGRx and 17 nv12.

My PC is old = Phenom II x4 965 3.4GHz Cpus, 1333 ram.

Andy Furniss Sept. 10, 2016, 5:35 p.m. UTC | #24

Andy Furniss wrote:

> With gstreamer 1080p I can get around 350 fps testing like -
>
> gst-launch-1.0 -f ximagesrc use-damage=0 startx=0 starty=0 endx=1919
>  endy=1079 num-buffers=5000 ! queue ! videoconvert !
> video/x-raw,framerate=500/1,format=BGRx ! fakesink Setting pipeline
> to PAUSED ... Pipeline is live and does not need PREROLL ... Setting
> pipeline to PLAYING ... New clock: GstSystemClock Got EOS from
> element "pipeline0". Execution ended after 0:00:14.205120141 Setting
> pipeline to PAUSED ... Setting pipeline to READY ... Setting pipeline
> to NULL ... Freeing pipeline ...
>
> 5000/14.205120141 = 351.98....
>
> Of course I don't know if that's "real" as such.

Looking into this more with sysprof it seems with this test gstreamer is
twice as fast because it doesn't copy from shm but ffmpeg does.

Of course I may be misunderstanding, but >90% load according to sysprof
is __memcpy_sse2_unaligned.

ffmpeg is not using xcb for this test and like this both ffmpeg and
gstreamer are using XShmGetImage.

With gstreamer all calls to __memcpy_sse2_unaligned are from X
with ffmpeg half are from X and half ffmpeg.

OK I think, maybe the test was flawed, so I changed fakesink to filesink
location=/mnt/ramdisk/out.bgr0 where ramdisk is tmpfs. Sure enough it's
slower = 188fps which is more like ffmpeg (179 fps with -f null -).

Unfortunately if I force ffmpeg to really output to tmpfs using
-f rawvideo -y /mnt/ramdisk/out.bgr0 I only get 68fps so I am still slower.

FWIW -f rawvideo -y /dev/null gives 90fps.

Nicolas George Sept. 10, 2016, 5:50 p.m. UTC | #25

Le quintidi 25 fructidor, an CCXXIV, Andy Furniss a écrit :
> Looking into this more with sysprof it seems with this test gstreamer is
> twice as fast because it doesn't copy from shm but ffmpeg does.
> 
> Of course I may be misunderstanding, but >90% load according to sysprof
> is __memcpy_sse2_unaligned.

Can you detect what call to memcpy() in the ffmpeg code is responsible for
this?

I had an inkling that it was about the packet -> frame conversion, but it
seems rawdec does the right thing in this particular case. Still, I suggest
you add a quick debug av_log() in libavcodec/rawdec.c to check that
need_copy is false.

Also, did you have a look at my last mail on the topic:
http://ffmpeg.org/pipermail/ffmpeg-devel/2016-September/199156.html
about the speed difference between xcb and xlib?

Regards,

Sven C. Dack Sept. 10, 2016, 5:55 p.m. UTC | #26

On 10/09/16 18:35, Andy Furniss wrote:
> Andy Furniss wrote:
>
>> With gstreamer 1080p I can get around 350 fps testing like -
>>
>> gst-launch-1.0 -f ximagesrc use-damage=0 startx=0 starty=0 endx=1919
>>  endy=1079 num-buffers=5000 ! queue ! videoconvert !
>> video/x-raw,framerate=500/1,format=BGRx ! fakesink Setting pipeline
>> to PAUSED ... Pipeline is live and does not need PREROLL ... Setting
>> pipeline to PLAYING ... New clock: GstSystemClock Got EOS from
>> element "pipeline0". Execution ended after 0:00:14.205120141 Setting
>> pipeline to PAUSED ... Setting pipeline to READY ... Setting pipeline
>> to NULL ... Freeing pipeline ...
>>
>> 5000/14.205120141 = 351.98....
>>
>> Of course I don't know if that's "real" as such.
>
> Looking into this more with sysprof it seems with this test gstreamer is
> twice as fast because it doesn't copy from shm but ffmpeg does.
>
> Of course I may be misunderstanding, but >90% load according to sysprof
> is __memcpy_sse2_unaligned.
>
> ffmpeg is not using xcb for this test and like this both ffmpeg and
> gstreamer are using XShmGetImage.
>
> With gstreamer all calls to __memcpy_sse2_unaligned are from X
> with ffmpeg half are from X and half ffmpeg.
>
> OK I think, maybe the test was flawed, so I changed fakesink to filesink
> location=/mnt/ramdisk/out.bgr0 where ramdisk is tmpfs. Sure enough it's
> slower = 188fps which is more like ffmpeg (179 fps with -f null -).
>
> Unfortunately if I force ffmpeg to really output to tmpfs using
> -f rawvideo -y /mnt/ramdisk/out.bgr0 I only get 68fps so I am still slower.
>
> FWIW -f rawvideo -y /dev/null gives 90fps.
I'd be more interested in grabbing the desktop synchronous to the display 
refresh rate at this point. 60Hz is a key mark, going above it not so much. What 
would be the next mark anyway? 100Hz?

Andy Furniss Sept. 10, 2016, 6:48 p.m. UTC | #27

Nicolas George wrote:
> Le quintidi 25 fructidor, an CCXXIV, Andy Furniss a écrit :
>> Looking into this more with sysprof it seems with this test
>> gstreamer is twice as fast because it doesn't copy from shm but
>> ffmpeg does.
>>
>> Of course I may be misunderstanding, but >90% load according to
>> sysprof is __memcpy_sse2_unaligned.
>
> Can you detect what call to memcpy() in the ffmpeg code is
> responsible for this?

Not with sysprof unfortunately.

> I had an inkling that it was about the packet -> frame conversion,
> but it seems rawdec does the right thing in this particular case.
> Still, I suggest you add a quick debug av_log() in
> libavcodec/rawdec.c to check that need_copy is false.

I shall have go later this evening.

> Also, did you have a look at my last mail on the topic:
> http://ffmpeg.org/pipermail/ffmpeg-devel/2016-September/199156.html
> about the speed difference between xcb and xlib?

I did and thanks for taking an interest - as I don't really know much
about the code I thought that one was more for Carl Eugen :-)

But yes I get the version I am currently testing is unsafe - just it
seems to be more like gstreamer in using XShmGetImage so I thought the
sysprofs would be easier to compare.

Andy Furniss Sept. 10, 2016, 6:54 p.m. UTC | #28

Sven C. Dack wrote:

> I'd be more interested in grabbing the desktop synchronous to the
> display refresh rate at this point. 60Hz is a key mark, going above
> it not so much. What would be the next mark anyway? 100Hz?

That would be good - I am in a different situation to you though, in
that my h/w won't currently take RGB anyway. Benching bgr0 is just a
starting point. With nv12 conversion I can't do 60fps with ffmpeg but
can with gstreamer.

As for monitors  - there are plenty that do 144Hz marketed at gamers and
even at 60Hz there's increasing use of 1440p and 2160p screens to consider.

Andy Furniss Sept. 12, 2016, 1:51 p.m. UTC | #29

Andy Furniss wrote:

> I do know that I have really grabbed and encoded 1080p60 with my AMD
> h/w and including nv12 conversion gives a sane looking result -
>
> gst-launch-1.0 -f ximagesrc use-damage=0 startx=0 starty=0 endx=1919
>  endy=1079 num-buffers=1000 ! queue ! videoconvert !
> video/x-raw,framerate=100/1,format=NV12  ! fakesink Setting pipeline
> to PAUSED ... Pipeline is live and does not need PREROLL ... Setting
> pipeline to PLAYING ... New clock: GstSystemClock Got EOS from
> element "pipeline0". Execution ended after 0:00:14.419928745 Setting
> pipeline to PAUSED ... Setting pipeline to READY ... Setting pipeline
> to NULL ... Freeing pipeline ...
>
> 1000/14.419928745 = 69.3

Over the weekend I looked at the CSC aspect of this without using
x11grab = benching bgr0 on tmpfs to nv12 and managed with a bit of luck
to get ffmpeg to beat gstreamer.

Starting point gstreamer bgr0 to nv12 = 70fps, to I420 68fps.

ffmpeg benched using -f null as -f rawvideo to ram or /dev/null is
slower and I suspect/hope for my intended usage = vaapi upload -f null
will be more representative, but of course I don't know that.

ffmpeg -f rawvideo -s 1920x1080 -pix_fmt bgr0  -i /mnt/ramdisk/out.bgr0 
-pix_fmt nv12 -f null -

=41 fps, yuv420p = 66fps

So yuv420p is close to gstreamer but nv12 is poor.

By chance I wondered how much worse it would be if I used -sws_flags as
I have done in the past. Result it was faster, it turns out that
+full_chroma_inp takes yuv420p from 66 to 84fps and nv12 to 47fps.

The reason being that with no flags time is spent in bgr32toUV_half_c
with flag above I don't use that and see various sse in use like
ff_rgbatoUV_sse2.

nv12 is still too slow though. Looking with sysprof I see that time
is spent in yuv2nv12cX_c.

Seemed slow when remembering yuv420p -> nv12 conversions from the past
so I benched 1080p yuv420p -> nv12 and got > 500fps. Doing this didn't
use yuv2nv12cX_c at all so I got to make a new command line -

ffmpeg -f rawvideo -s 1920x1080 -pix_fmt bgr0  -i /mnt/ramdisk/out.bgr0 
-vf scale=flags=+full_chroma_inp,format=yuv420p,format=nv12 -f null -

= 78fps, nice.

So at least I can beat gstreamer on CSC now. Testing the new commandline
with x11grab gets me close to gst using the legacy x11grab = 65 fps.

libxcb x11grab is 52 fps though, so it would be good if that can be 
fixed up.

Carl Eugen Hoyos Sept. 14, 2016, 3:49 p.m. UTC | #30

Hi!

2016-09-08 19:34 GMT+02:00 Nicolas George <george@nsup.org>:

> Still, it is probably possible to do something faster than xcbgrab while
> still correct: only allocate a new segment if the first one is still in use.

Will you try this?

Thank you, Carl Eugen

[FFmpeg-devel] adding RGBA and BGRA to nvenc.c

Commit Message

Comments

Patch