diff mbox

[FFmpeg-devel] Fix quadratic memory use in ff_h2645_extract_rbsp() when multiple NALUs exist in packet.

Message ID 20171019184647.80641-1-nbowe@google.com
State New
Headers show

Commit Message

Niki Bowe Oct. 19, 2017, 6:46 p.m. UTC
Found via fuzzing.
/tmp/poc is a 1 MB mpegts file generated via fuzzing, where 1 packet has many NALUs
Before this change:
  $ /usr/bin/time -f "\t%M Max Resident Set Size (Kb)"  ./ffprobe /tmp/poc 2>&1 | tail -n 1
  	2158192 Max Resident Set Size (Kb)
After this change:
  $ /usr/bin/time -f "\t%M Max Resident Set Size (Kb)"  ./ffprobe /tmp/poc 2>&1 | tail -n 1
  	1046812 Max Resident Set Size (Kb)
---
 libavcodec/h2645_parse.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

Comments

Carl Eugen Hoyos Oct. 19, 2017, 10:39 p.m. UTC | #1
2017-10-19 20:46 GMT+02:00 Nikolas Bowe <nbowe-at-google.com@ffmpeg.org>:
> Found via fuzzing.
> /tmp/poc is a 1 MB mpegts file generated via fuzzing, where 1 packet has many NALUs
> Before this change:
>   $ /usr/bin/time -f "\t%M Max Resident Set Size (Kb)"  ./ffprobe /tmp/poc 2>&1 | tail -n 1
>         2158192 Max Resident Set Size (Kb)
> After this change:
>   $ /usr/bin/time -f "\t%M Max Resident Set Size (Kb)"  ./ffprobe /tmp/poc 2>&1 | tail -n 1
>         1046812 Max Resident Set Size (Kb)

This does not look like a fix for a "quadratic" memory consumption or
do I misunderstand?

Does the patch have a measurable speed impact?

Carl Eugen
Niki Bowe Oct. 23, 2017, 11:43 p.m. UTC | #2
On Thu, Oct 19, 2017 at 3:39 PM, Carl Eugen Hoyos <ceffmpeg@gmail.com>
wrote:

> 2017-10-19 20:46 GMT+02:00 Nikolas Bowe <nbowe-at-google.com@ffmpeg.org>:
> > Found via fuzzing.
> > /tmp/poc is a 1 MB mpegts file generated via fuzzing, where 1 packet has
> many NALUs
> > Before this change:
> >   $ /usr/bin/time -f "\t%M Max Resident Set Size (Kb)"  ./ffprobe
> /tmp/poc 2>&1 | tail -n 1
> >         2158192 Max Resident Set Size (Kb)
> > After this change:
> >   $ /usr/bin/time -f "\t%M Max Resident Set Size (Kb)"  ./ffprobe
> /tmp/poc 2>&1 | tail -n 1
> >         1046812 Max Resident Set Size (Kb)
>
> This does not look like a fix for a "quadratic" memory consumption or
> do I misunderstand?
>

Before this patch, for each NALU in the packet, rbsp_buffer would be sized
from the start of the NALU to the end of the packet, not the end of the
NALU.
This would occur for each NALU in the packet. Total memory allocated in all
the rbsp_buffers for all the NALUs in the packet would be N + (N+x1) +
(N+x2) + ...
This is quadratic in the number of NALUs in the packet.

The fuzzer's proof of concept file is a bit extreme. It has over 2000 small
NALUs in one packet.
An easier way to trigger this would be to put some small NALUs (perhaps
SEI) in front of large IDRs. Each small NALU added to the front doubles the
total memory allocated for rbsp_buffers for that packet.


> Does the patch have a measurable speed impact?
>
>
Is there a standard set of benchmarks I can run?

For typical videos the speed impact is small, due to NALU fitting in cache,
but for videos with many large NALUs there can be some slowdown.

Here is the decode time for some typical video and some extreme cases,
taking the result with the best user+system time of 3 runs.
Measured via
for i in {1..3}; do /usr/bin/time -f "\t%U User CPU seconds\n\t%S System
CPU seconds\n\t%M Max Resident Set Size (Kb)" ./ffmpeg -loglevel quiet
-nostats -threads 1 -i $file -f null - ; done

Tears of Steel HD
Somewhat typical HD short movie. 728 MB, 8314kb/s
  no patch:
113.69 User CPU seconds
0.60 System CPU seconds
44784 Max Resident Set Size (Kb)
  patch:
112.52 User CPU seconds
0.40 System CPU seconds
44780 Max Resident Set Size (Kb)
  1% slower.

Tears of Steel 4k
Somewhat typical high-ish bitrate 4k movie. 73244 kb/s.
  no patch:
682.70 User CPU seconds
2.99 System CPU seconds
104420 Max Resident Set Size (Kb)
  patch:
716.06 User CPU seconds
4.08 System CPU seconds
103632 Max Resident Set Size (Kb)
  5% slower.

random 50 Mbps i-only video I had laying around
  no patch:
421.33 User CPU seconds
1.21 System CPU seconds
28284 Max Resident Set Size (Kb)
  patch
423.00 User CPU seconds
1.98 System CPU seconds
27300 Max Resident Set Size (Kb)
   <1% slower

foo_200M.ts
10 seconds of /dev/urandom at 4k, encoded at 200Mbps using 2 pass, 2 second
GOP (it actually used P frames even though its encoding random noise).
This is
  no patch:
11.52 User CPU seconds
0.19 System CPU seconds
68668 Max Resident Set Size (Kb)
  patch:
11.92 User CPU seconds
0.15 System CPU seconds
68656 Max Resident Set Size (Kb)
  3% slower

large_nals.ts
This is an extreme case. I tried to come up with a very bad case for
slowdown. Every packets has very large VCL NALUs.
Generated via ./ffmpeg -f rawvideo -video_size 3840x2160 -pixel_format
yuv420p -framerate 30 -i /dev/urandom -t 5 -c:v libx264 -preset ultrafast
-crf 0 -g 1 -y large_nals.ts
Each packet in large_nals.ts is close to 20 MB. There's 5 NALUs per packet:
9 (AUD) 7 (SPS) 8 (PPS) 6(SEI) 5 (IDR slice).
This is of course unrealistically large at over 4 Gbps, but it should be a
decent worst case example.
  no patch:
46.76 User CPU seconds
3.10 System CPU seconds
199148 Max Resident Set Size (Kb)
  patch:
54.42 User CPU seconds
1.79 System CPU seconds
156720 Max Resident Set Size (Kb)
  13% slower when there are very large NALUs (also 21% less memory usage)



> Carl Eugen
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
Carl Eugen Hoyos Oct. 23, 2017, 11:56 p.m. UTC | #3
2017-10-24 1:43 GMT+02:00 Niki Bowe <nbowe-at-google.com@ffmpeg.org>:
> On Thu, Oct 19, 2017 at 3:39 PM, Carl Eugen Hoyos <ceffmpeg@gmail.com>
> wrote:

>> Does the patch have a measurable speed impact?
>>
> Is there a standard set of benchmarks I can run?
>
> For typical videos the speed impact is small, due to NALU fitting in cache,
> but for videos with many large NALUs there can be some slowdown.

(5% overall slowdown would make every patch unacceptable
but I doubt this is the case.)

Use the TIMER makros from libavutil/timer.h, put them around all
calls to ff_h2645_extract_rbsp().

Carl Eugen
Kieran Kunhya Oct. 31, 2017, 1:42 a.m. UTC | #4
On Tue, 24 Oct 2017 at 00:56 Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote:

> 2017-10-24 1:43 GMT+02:00 Niki Bowe <nbowe-at-google.com@ffmpeg.org>:
> > On Thu, Oct 19, 2017 at 3:39 PM, Carl Eugen Hoyos <ceffmpeg@gmail.com>
> > wrote:
>
> >> Does the patch have a measurable speed impact?
> >>
> > Is there a standard set of benchmarks I can run?
> >
> > For typical videos the speed impact is small, due to NALU fitting in
> cache,
> > but for videos with many large NALUs there can be some slowdown.
>
> (5% overall slowdown would make every patch unacceptable
> but I doubt this is the case.)
>
> Use the TIMER makros from libavutil/timer.h, put them around all
> calls to ff_h2645_extract_rbsp().
>

Related to https://trac.ffmpeg.org/ticket/6789, we see huge memory
allocations in this code so if this patch fixes, it should be committed
irrespective of any speed loss.

Kieran
Kieran Kunhya Oct. 31, 2017, 1:50 a.m. UTC | #5
On Tue, 31 Oct 2017 at 01:42 Kieran Kunhya <kierank@obe.tv> wrote:

> On Tue, 24 Oct 2017 at 00:56 Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote:
>
>> 2017-10-24 1:43 GMT+02:00 Niki Bowe <nbowe-at-google.com@ffmpeg.org>:
>> > On Thu, Oct 19, 2017 at 3:39 PM, Carl Eugen Hoyos <ceffmpeg@gmail.com>
>> > wrote:
>>
>> >> Does the patch have a measurable speed impact?
>> >>
>> > Is there a standard set of benchmarks I can run?
>> >
>> > For typical videos the speed impact is small, due to NALU fitting in
>> cache,
>> > but for videos with many large NALUs there can be some slowdown.
>>
>> (5% overall slowdown would make every patch unacceptable
>> but I doubt this is the case.)
>>
>> Use the TIMER makros from libavutil/timer.h, put them around all
>> calls to ff_h2645_extract_rbsp().
>>
>
> Related to https://trac.ffmpeg.org/ticket/6789, we see huge memory
> allocations in this code so if this patch fixes, it should be committed
> irrespective of any speed loss.
>
> Kieran
>

I confirm this patch fixes ticket 6789.

Kieran
Michael Niedermayer Oct. 31, 2017, 2:03 a.m. UTC | #6
Hi

On Mon, Oct 23, 2017 at 04:43:55PM -0700, Niki Bowe wrote:
> On Thu, Oct 19, 2017 at 3:39 PM, Carl Eugen Hoyos <ceffmpeg@gmail.com>
> wrote:
> 
> > 2017-10-19 20:46 GMT+02:00 Nikolas Bowe <nbowe-at-google.com@ffmpeg.org>:
> > > Found via fuzzing.
> > > /tmp/poc is a 1 MB mpegts file generated via fuzzing, where 1 packet has
> > many NALUs
> > > Before this change:
> > >   $ /usr/bin/time -f "\t%M Max Resident Set Size (Kb)"  ./ffprobe
> > /tmp/poc 2>&1 | tail -n 1
> > >         2158192 Max Resident Set Size (Kb)
> > > After this change:
> > >   $ /usr/bin/time -f "\t%M Max Resident Set Size (Kb)"  ./ffprobe
> > /tmp/poc 2>&1 | tail -n 1
> > >         1046812 Max Resident Set Size (Kb)
> >
> > This does not look like a fix for a "quadratic" memory consumption or
> > do I misunderstand?
> >
> 
> Before this patch, for each NALU in the packet, rbsp_buffer would be sized
> from the start of the NALU to the end of the packet, not the end of the
> NALU.
> This would occur for each NALU in the packet. Total memory allocated in all
> the rbsp_buffers for all the NALUs in the packet would be N + (N+x1) +
> (N+x2) + ...
> This is quadratic in the number of NALUs in the packet.

A better solution would be to allocate a buffer sized based on the
whole packet and then have all teh NALs point into that.
That way theres no need to know the size of the first NAL during
allocation.
This should have even lower memory overhead than your code
fewer alloc/free calls, and no speed loss

[...]
Michael Niedermayer Oct. 31, 2017, 2:11 a.m. UTC | #7
On Thu, Oct 19, 2017 at 11:46:47AM -0700, Nikolas Bowe wrote:
> Found via fuzzing.
> /tmp/poc is a 1 MB mpegts file generated via fuzzing, where 1 packet has many NALUs
> Before this change:
>   $ /usr/bin/time -f "\t%M Max Resident Set Size (Kb)"  ./ffprobe /tmp/poc 2>&1 | tail -n 1
>   	2158192 Max Resident Set Size (Kb)
> After this change:
>   $ /usr/bin/time -f "\t%M Max Resident Set Size (Kb)"  ./ffprobe /tmp/poc 2>&1 | tail -n 1
>   	1046812 Max Resident Set Size (Kb)
> ---
>  libavcodec/h2645_parse.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/libavcodec/h2645_parse.c b/libavcodec/h2645_parse.c
> index b0d9ff66f0..e77689f347 100644
> --- a/libavcodec/h2645_parse.c
> +++ b/libavcodec/h2645_parse.c
> @@ -32,7 +32,7 @@
>  int ff_h2645_extract_rbsp(const uint8_t *src, int length,
>                            H2645NAL *nal, int small_padding)
>  {
> -    int i, si, di;
> +    int i, si, di, nsc;
>      uint8_t *dst;
>      int64_t padding = small_padding ? 0 : MAX_MBPAIR_SIZE;
>  
> @@ -91,8 +91,17 @@ int ff_h2645_extract_rbsp(const uint8_t *src, int length,
>      } else if (i > length)
>          i = length;
>  
> +    // Find next NAL start code, if present, to reduce rbsp_buffer size when
> +    // multiple NALUs.
> +    for (nsc = i; nsc + 2 < length; nsc++) {
> +        if (src[nsc] == 0 && src[nsc + 1] == 0 && src[nsc + 2] == 1)
> +          break;
> +    }
> +    if (nsc + 2 == length)
> +        nsc = length;
> +
>      av_fast_padded_malloc(&nal->rbsp_buffer, &nal->rbsp_buffer_size,
> -                          length + padding);
> +                          nsc + padding);
>      if (!nal->rbsp_buffer)
>          return AVERROR(ENOMEM);

This reduces memory consumption to linear from qudratic but i think
it still can be made to allocate very large amounts of memory.
That is with many small NAL units MAX_MBPAIR_SIZE would be allocated
for each.in worst case.
So this does fix the qudratic issue but not the OOM issue.
Using the same buffer for all would fix it unless iam missing something.
Using the same buffer avoids the padding needs for all but the last.
So its alot less memory for many small nal units

[...]
Michael Niedermayer Oct. 31, 2017, 2:25 a.m. UTC | #8
On Thu, Oct 19, 2017 at 11:46:47AM -0700, Nikolas Bowe wrote:
> Found via fuzzing.
> /tmp/poc is a 1 MB mpegts file generated via fuzzing, where 1 packet has many NALUs
> Before this change:
>   $ /usr/bin/time -f "\t%M Max Resident Set Size (Kb)"  ./ffprobe /tmp/poc 2>&1 | tail -n 1
>   	2158192 Max Resident Set Size (Kb)
> After this change:
>   $ /usr/bin/time -f "\t%M Max Resident Set Size (Kb)"  ./ffprobe /tmp/poc 2>&1 | tail -n 1
>   	1046812 Max Resident Set Size (Kb)
> ---
>  libavcodec/h2645_parse.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)

This patch also fixes 2145/clusterfuzz-testcase-minimized-5866217724182528
that should be added to the commit message

(though as said, this fix is not ideal or complete, I would very much
 prefer if this would be fixed by using a single buffer or any other
 solution that avoids the speedloss.)

Also please tell me in case you choose not to work on this further.

thx

[...]
Kieran Kunhya Oct. 31, 2017, 9:24 a.m. UTC | #9
On Tue, 31 Oct 2017 at 02:26 Michael Niedermayer <michael@niedermayer.cc>
wrote:

> On Thu, Oct 19, 2017 at 11:46:47AM -0700, Nikolas Bowe wrote:
> > Found via fuzzing.
> > /tmp/poc is a 1 MB mpegts file generated via fuzzing, where 1 packet has
> many NALUs
> > Before this change:
> >   $ /usr/bin/time -f "\t%M Max Resident Set Size (Kb)"  ./ffprobe
> /tmp/poc 2>&1 | tail -n 1
> >       2158192 Max Resident Set Size (Kb)
> > After this change:
> >   $ /usr/bin/time -f "\t%M Max Resident Set Size (Kb)"  ./ffprobe
> /tmp/poc 2>&1 | tail -n 1
> >       1046812 Max Resident Set Size (Kb)
> > ---
> >  libavcodec/h2645_parse.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
>
> This patch also fixes 2145/clusterfuzz-testcase-minimized-5866217724182528
> that should be added to the commit message
>
> (though as said, this fix is not ideal or complete, I would very much
>  prefer if this would be fixed by using a single buffer or any other
>  solution that avoids the speedloss.)
>
> Also please tell me in case you choose not to work on this further.
>
> thx
>
> [...]
>

Hi,

I left the sample in https://trac.ffmpeg.org/ticket/6789 running overnight,
it still leaks with this patch, just much slower.
So there is still a related (but separate?) bug here.

Kieran
Kieran Kunhya Oct. 31, 2017, 8:14 p.m. UTC | #10
>
> This reduces memory consumption to linear from qudratic but i think
> it still can be made to allocate very large amounts of memory.
> That is with many small NAL units MAX_MBPAIR_SIZE would be allocated
> for each.in worst case.
> So this does fix the qudratic issue but not the OOM issue.
> Using the same buffer for all would fix it unless iam missing something.
> Using the same buffer avoids the padding needs for all but the last.
> So its alot less memory for many small nal units
>

This wouldn't fix the issue because the nals_allocated cache could still
grow with a crafted stream.

Kieran
Derek Buitenhuis Oct. 31, 2017, 8:24 p.m. UTC | #11
On 10/31/2017 2:25 AM, Michael Niedermayer wrote:
> (though as said, this fix is not ideal or complete, I would very much
>  prefer if this would be fixed by using a single buffer or any other
>  solution that avoids the speedloss.)

Using a single buffer would be marginally faster, but it does not solve
the underlying problem, which is that the NAL "cache" (nals_allocated)
seems to be cumulative, and the size of each buffer in it seems to be
the largest observed size of a NAL in that position.

Consider I could craft a stream that contains, in order:

Has 1999 tiny NALs, followed by 1 10MiB NAL, in packet 1.
Has 1998 tiny NALs, followed by 1 10MiB NAL, in packet 2.
.
.
.
Has 500 tiny NALs, followed by 1 10MiB NAL, in packet 1500.
.
.
.
And so forth.

The result would be that we have 2000 10MiB buffers allocated
in the NAL memory "pool" (nals_allocated == 2000), which will
persist until the decode is deinitialized.

Am I missing something here?

P.S. I see Kieran mailed the same thing as I wrote this.

- Derek
Kieran Kunhya Oct. 31, 2017, 8:28 p.m. UTC | #12
>
> Am I missing something here?
>
> P.S. I see Kieran mailed the same thing as I wrote this.
>

Further to Derek's excellent explanation, I think the best way is to go
back to the old "in-place" NAL parsing code for H264.
I will prepare a patch to do this.

Kieran
Niki Bowe Oct. 31, 2017, 10:30 p.m. UTC | #13
Thank you Kieran, Michael, Derek, and Carl.

I ran some benchmarks and the patch I provided really does have a speed
impact so I was considering alternative ways to achieve the same result.
After a little thought last week I came to the same conclusion as Michael
that a single rbsp_buffer per packet, with the nals pointing into that
would be ideal because
 * allows a single allocation for all the NALs, sized to the size of the
packet (+ padding. see next)
 * I think the padding is only to allow unchecked bitstream reading, so we
only need 1 unit of padding (MAX_MBPAIR_SIZE) if we have a single buffer,
rather than 1 per NAL.
Also IIRC it used to be this way a long time ago.

Likewise skipped_bytes_pos could also be made a single buffer for the
packet rather than 1 per NALU. Probably not as big a benefit to be made
there though, besides avoiding some allocations.

I was going to prepare a patch, but it sounds like Kieran is going to.
Thank you Kieran. Let me know if there's anything you want me to do, or if
you would prefer me to do it for the experience.

As for the NAL memory pool, I think I see what you mean. You are saying
h264dec.c is reusing the H2645Packet packet in decode_nal_units, so any
growth in nals_allocated from previous packets carries over to later
packets, and also the nals themselves also carry over even if the current
packet has small pkt->nb_nals.
Moving rbsp_buffer (and maybe skipped_bytes_pos) out of the nals and into
H2645Packet should fix the worst of this.




On Tue, Oct 31, 2017 at 1:28 PM, Kieran Kunhya <kierank@obe.tv> wrote:

> >
> > Am I missing something here?
> >
> > P.S. I see Kieran mailed the same thing as I wrote this.
> >
>
> Further to Derek's excellent explanation, I think the best way is to go
> back to the old "in-place" NAL parsing code for H264.
> I will prepare a patch to do this.
>
> Kieran
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
Kieran Kunhya Oct. 31, 2017, 11:13 p.m. UTC | #14
On Tue, 31 Oct 2017 at 22:36 Niki Bowe <nbowe-at-google.com@ffmpeg.org>
wrote:

> I was going to prepare a patch, but it sounds like Kieran is going to.
> Thank you Kieran. Let me know if there's anything you want me to do, or if
> you would prefer me to do it for the experience.
>
> As for the NAL memory pool, I think I see what you mean. You are saying
> h264dec.c is reusing the H2645Packet packet in decode_nal_units, so any
> growth in nals_allocated from previous packets carries over to later
> packets, and also the nals themselves also carry over even if the current
> packet has small pkt->nb_nals.
> Moving rbsp_buffer (and maybe skipped_bytes_pos) out of the nals and into
> H2645Packet should fix the worst of this.
>

I am not sure this is possible. I will see tomorrow.
Just FYI we are regularly seeing OOMs on some 24/7 live streams as a result
of h2645_parser.

Kieran
Michael Niedermayer Nov. 1, 2017, 1:36 a.m. UTC | #15
On Tue, Oct 31, 2017 at 08:24:37PM +0000, Derek Buitenhuis wrote:
> On 10/31/2017 2:25 AM, Michael Niedermayer wrote:
> > (though as said, this fix is not ideal or complete, I would very much
> >  prefer if this would be fixed by using a single buffer or any other
> >  solution that avoids the speedloss.)
> 
> Using a single buffer would be marginally faster, but it does not solve
> the underlying problem, which is that the NAL "cache" (nals_allocated)
> seems to be cumulative, and the size of each buffer in it seems to be
> the largest observed size of a NAL in that position.
> 
> Consider I could craft a stream that contains, in order:
> 
> Has 1999 tiny NALs, followed by 1 10MiB NAL, in packet 1.
> Has 1998 tiny NALs, followed by 1 10MiB NAL, in packet 2.
> .
> .
> .
> Has 500 tiny NALs, followed by 1 10MiB NAL, in packet 1500.
> .
> .
> .
> And so forth.
> 
> The result would be that we have 2000 10MiB buffers allocated
> in the NAL memory "pool" (nals_allocated == 2000), which will
> persist until the decode is deinitialized.
>
> Am I missing something here?

The idea would be that there would only be one uint8_t buffer and the
2000 entries from te pool would point into that.
So as a larger NAL shifts through the 2000 the pointers would get
distributed differently but the size would not grow
Any variable size buffer the H2645NAL needs would be such a "shared"
buffer

I dont know if theres anything that would make this non trivial to
implement.


>
> P.S. I see Kieran mailed the same thing as I wrote this.
> 
> - Derek
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Derek Buitenhuis Nov. 1, 2017, 1:45 p.m. UTC | #16
On 11/1/2017 1:36 AM, Michael Niedermayer wrote:
> The idea would be that there would only be one uint8_t buffer and the
> 2000 entries from te pool would point into that.
> So as a larger NAL shifts through the 2000 the pointers would get
> distributed differently but the size would not grow
> Any variable size buffer the H2645NAL needs would be such a "shared"
> buffer

This fixes the type of case I mentioned, but it doesn't fix something, for
example, on a never-ending stream (e.g. a TV channel), if you hit a bunch
of huge NALs in a packet, the single buffer will be resizes to fit them all,
and so instead of a momentary memory usage spike, it lasts forever (or until
you restart the stream / reinit the decoder).

- Derek
Kieran Kunhya Nov. 2, 2017, 10:48 p.m. UTC | #17
>
> The idea would be that there would only be one uint8_t buffer and the
> 2000 entries from te pool would point into that.
> So as a larger NAL shifts through the 2000 the pointers would get
> distributed differently but the size would not grow
> Any variable size buffer the H2645NAL needs would be such a "shared"
> buffer
>
> I dont know if theres anything that would make this non trivial to
> implement.
>

I have tried this using the following patch but it does not work:
https://www.irccloud.com/pastebin/qobTcW9d/

Nothing obviously seems wrong so I suspect it's not possible to do this
whilst reusing the code between decoder and parser.
The old code use to write to a per-slice context, not a per packet one so
this might be related.

Kieran
Derek Buitenhuis Nov. 2, 2017, 11:38 p.m. UTC | #18
On 11/2/2017 10:48 PM, Kieran Kunhya wrote:
> I have tried this using the following patch but it does not work:
> https://www.irccloud.com/pastebin/qobTcW9d/
> 
> Nothing obviously seems wrong so I suspect it's not possible to do this
> whilst reusing the code between decoder and parser.
> The old code use to write to a per-slice context, not a per packet one so
> this might be related.

As discussed on IRC, the issue was dangling pointers due to the realloc.

It can be allocated based on the full packet size, outside the function,
or on first call, I think.

Aside:

Do we have some preferred way to address the cumulative memory issue
I mentioned in my previous mail, or is it a DNC situation? I haven't
thought of a good way to solve that - only stuff like going back
to in-place NAL parsing, or shrinking the buffer after some time based
on some metric (ew...).

- Derek
Hendrik Leppkes Nov. 2, 2017, 11:51 p.m. UTC | #19
On Fri, Nov 3, 2017 at 12:38 AM, Derek Buitenhuis
<derek.buitenhuis@gmail.com> wrote:
> On 11/2/2017 10:48 PM, Kieran Kunhya wrote:
>> I have tried this using the following patch but it does not work:
>> https://www.irccloud.com/pastebin/qobTcW9d/
>>
>> Nothing obviously seems wrong so I suspect it's not possible to do this
>> whilst reusing the code between decoder and parser.
>> The old code use to write to a per-slice context, not a per packet one so
>> this might be related.
>
> As discussed on IRC, the issue was dangling pointers due to the realloc.
>
> It can be allocated based on the full packet size, outside the function,
> or on first call, I think.
>
> Aside:
>
> Do we have some preferred way to address the cumulative memory issue
> I mentioned in my previous mail, or is it a DNC situation? I haven't
> thought of a good way to solve that - only stuff like going back
> to in-place NAL parsing, or shrinking the buffer after some time based
> on some metric (ew...).
>

Personally I don't think the problem of a single buffer getting way
too large is a huge problem, at least it should not accumulate and
keep growing. Its also not a "new" problem, the old code would've
suffered from the same issue as well, since it also cached the rbsp
buffer - just only one, not multiple.

Another option would be to re-arrange the API to keep it shared but go
back to an iterator-style parsing of the NALs. Parse one, process it,
parse the next. That way you can also avoid the issue with a multitude
of NAL structs being allocated and sitting there forever.
Obviously H264 would be compatible with that, as its how it was used
before. I didn't check HEVC, however.

- Hendrik
diff mbox

Patch

diff --git a/libavcodec/h2645_parse.c b/libavcodec/h2645_parse.c
index b0d9ff66f0..e77689f347 100644
--- a/libavcodec/h2645_parse.c
+++ b/libavcodec/h2645_parse.c
@@ -32,7 +32,7 @@ 
 int ff_h2645_extract_rbsp(const uint8_t *src, int length,
                           H2645NAL *nal, int small_padding)
 {
-    int i, si, di;
+    int i, si, di, nsc;
     uint8_t *dst;
     int64_t padding = small_padding ? 0 : MAX_MBPAIR_SIZE;
 
@@ -91,8 +91,17 @@  int ff_h2645_extract_rbsp(const uint8_t *src, int length,
     } else if (i > length)
         i = length;
 
+    // Find next NAL start code, if present, to reduce rbsp_buffer size when
+    // multiple NALUs.
+    for (nsc = i; nsc + 2 < length; nsc++) {
+        if (src[nsc] == 0 && src[nsc + 1] == 0 && src[nsc + 2] == 1)
+          break;
+    }
+    if (nsc + 2 == length)
+        nsc = length;
+
     av_fast_padded_malloc(&nal->rbsp_buffer, &nal->rbsp_buffer_size,
-                          length + padding);
+                          nsc + padding);
     if (!nal->rbsp_buffer)
         return AVERROR(ENOMEM);