diff mbox series

[FFmpeg-devel,1/2] decode: add ff_decode_skip_samples function

Message ID Nhyz9MY--3-9@lynne.ee
State New
Headers show
Series [FFmpeg-devel,1/2] decode: add ff_decode_skip_samples function | expand

Checks

Context Check Description
yinshiyou/make_loongarch64 success Make finished
yinshiyou/make_fate_loongarch64 success Make fate finished
andriy/make_x86 success Make finished
andriy/make_fate_x86 success Make fate finished

Commit Message

Lynne Oct. 30, 2023, 5:09 a.m. UTC
This is a convenience function, which is required to be called by decoders
needing to skip samples every time.
It automatically creates and increments side data.

The idea is to get rid of skip_samples eventually and replace it with this
function.

Patch attached.

Comments

Anton Khirnov Nov. 4, 2023, 10:41 a.m. UTC | #1
Quoting Lynne (2023-10-30 06:09:28)
> This is a convenience function, which is required to be called by decoders
> needing to skip samples every time.
> It automatically creates and increments side data.
> 
> The idea is to get rid of skip_samples eventually and replace it with this
> function.
> 
> Patch attached.
> 
> 
> From 41dfcbbacfa9232d2308d0229dcd172309b32f9f Mon Sep 17 00:00:00 2001
> From: Lynne <dev@lynne.ee>
> Date: Mon, 30 Oct 2023 05:38:17 +0100
> Subject: [PATCH 1/2] decode: add ff_decode_skip_samples function
> 
> This is a convenience function, which is required to be called by decoders
> needing to skip samples every time.
> It automatically creates and increments side data.
> 
> The idea is to get rid of skip_samples eventually and replace it with this
> function.
> ---
>  libavcodec/decode.c | 18 ++++++++++++++++++
>  libavcodec/decode.h |  9 +++++++++
>  2 files changed, 27 insertions(+)
> 
> diff --git a/libavcodec/decode.c b/libavcodec/decode.c
> index ad39021354..f971723ff7 100644
> --- a/libavcodec/decode.c
> +++ b/libavcodec/decode.c
> @@ -299,6 +299,24 @@ static int64_t guess_correct_pts(AVCodecContext *ctx,
>      return pts;
>  }
>  
> +int ff_decode_skip_samples(AVCodecContext *avctx, AVFrame *frame, uint32_t base_skip, uint32_t additional)

avctx seems unused.
Derek Buitenhuis Nov. 4, 2023, 4:22 p.m. UTC | #2
Hi,

I'm going to opine a bit here, and also comment on the mov/MP4 patch[0] that accompanies
this set.

This is for both historical purposes, and to distill IRC logs into something more
digestible for others on the list to gain context on the issue, so apologies for
re-treading ground.

On 10/30/2023 5:09 AM, Lynne wrote:
> This is a convenience function, which is required to be called by decoders
> needing to skip samples every time.
> It automatically creates and increments side data.
> 
> The idea is to get rid of skip_samples eventually and replace it with this
> function.

So there is a lot to cover here, and  lot of nuance to how things should be handled,
that I want to list out, before discussing the specific changes of these two sets of
patches. A bit of of a 'state of the world'.

The goal of this patchset seems to be to aid in gapless playback and correct seeking
in AAC streams (or later, more generally MDCT-styl audio codecs in general), but properly
skipping initial priming samples (which include pre-roll), pre-roll (both normal, and extra
incurred due to SBR) when seeking, and and, though not covered in these sets, I'll mention,
end padding.

First a note on terminology: 'Algorithmic delay' as it is being used here is not quite
correct, for two reasons:
    * Latency and pre-roll (or roll distance) are separate things. Opus, for example,
      can have a latency as low as 2.5ms, but pre-roll is always at least 80ms - they
      are different things which serve different purposes, and I confirmed this with
      people who definitely know more about audio than me[1]. Pre-roll is often larger
      than latency, and the values stored in file metadata reflect this.
    * Pre-roll, or roll distance, are the industry standard terms. Making up out own
      terms because we disagree is silly and stubborn, and makes it harder on API
      users trying to use the API correctly, or understnd our code.

Next, a quick breakdown of the AAC situation, in terms of both how this it is stored,
what we support, and the state of the ecosystem and types of files that exist:
    * 'Raw' ADTS streams have no way to store any of this. The best we can do is guess
      the pre-roll. We should not guess priming or end padding, as no matter what we do,
      it'll be wrong, and any value will be a cargo culted hack value.
    * MP4 - there are two places to store this metadata - one standard, and one proprietary
      Apple way. There are, separately, two ways to signal priming length when SBR is present.
       * MP4s may contain a user data box with 'iTunSMPB' which contains priming, pre-roll,
         and end padding data. We support reading only priming data from this at the moment,
         and we set skip samples based on this. This is 'iTunes style' metadata.
       * The standards compliant (read: non-iTunes) way is to use an edit list to trim the
         priming samples, and, opionally end padding. End padding may also be trimmed by reducing
         the sample duration of the last packet in the stts box. Pre-roll is store in the sgpb
         box with the 'roll', type, which signals the roll distance as a number of packets;
         for example, -1 indicates you should decode an discard the samples of 1 packet before
         beginning plaback. Notably, this allows the sgpd box to also be use for video like
         periodic intra refresh H.264. libavformat does not current parse or export this info,
         but even if we did, converting number of packets to audio samples can get hairy.
           * Notably, since in MP4, the edit list represents the exact presentation-level info,
             when no edit list, or an edit list startiing at 0 is present, no samples, not even
             pre-roll should be trimmed - all players in the wild handle this properly, and it
             has been standard practice among streaming services for >10 years to not output
             the AAC frames representing priming samples at all (even if there is a small hit
             quality). This is what the patch at [0] is addressing.
               * My personal opinion is that since priming samples include any inherent delay already,
                 that if we do not know how many priming samples there are, we should not trim anything
                 from the start of the file, regardless of format. I am keen on hearing others Opinions(TM)
                 here, particularily Anton and Martin (sorry for name dropping :)).
       * Further complicating matters is the fact that, again thanks to Apple, there are a lot
         of broken files around, since iTunes expects files to *not* include addition delay incurred
         by SBR in their edit list / priming info, even though, by spec, you are suppose to. This
         leads to the unfortunate case where you have tons of files in the wild that both do, and
         do not include SBR delay in their edit lists, and there is no way of detecting when this is
         the case. I do not have a good solution to this other than proposing a flag somewhere to
         switch between behaviors.

Aside, for Opus, we incorrectly read this info from the codec specific box instead of the sgpd box...
but we only ever put it in a write-only field.

Now, on to the patches / implementation (opinions warning):

    * As noted above, I don't think we should apply any guessed priming to initial samples (pre-roll,
      or 'algorithmic delay, included). No other decoders or players do this, in the world, to my
      knowledge, and violating the principal of least surpise because we think we're slightly more
      correct isn't great. I also think trying to 'fix' raw ADTS is destined to always be a hack,
      an we shouldn't. YMMV. I'd like to hear views from others here. This would make the patch in
      [0] redundant.
    * I don't think hardcoding a value of 1024 (or 960) is wise, or necessarily even correct. For example,
      all HE-AAC MP4s I have signal a seek pre-roll of 2048 samples. Ideally we would pass through seek
      pre-roll somehow - but I am not sure of a good way in our existing setup - the write-only codecpar
      field for seek pre-roll is in samples, which is kind of incompatible with the way the sgpd box stores
      pre-roll as number of packets. We could set it based one the duration of the first packet(s), assuming
      all audio codecs will have only 1 packet duration (frame size), or we could add a new field. Possibly,
      we could leave the hardcoded values in place, only for seeking, inside e.g. ADTS.
    * This kind of brings me to: I don't really think using the same side data for both priming and pre-roll
      is a great idea. Clearly this set of problems is already confusing enough to many, and conflating the
      two in the API is a sure way to only make it worse.
    * If we are going to discard at seek, we need to take this into account in the demuxer's seek code, or
      we will always be $preroll number of samples off where the user actually wanted to seek and decode.

I am almost certain I missed even more nuance, and hopefully Martin or Anton can chime in, or I remember more.

But also, given all of this, I think we need to deeply consider how we approach this, so we don't end up with
something that only covers certain cases (and I am sure I forgot more cases). To that end, I do not think rushing
to get a patchset that can change sync on all AAC files in existence into 6.1 is wise. Even when this does go in,
it should be able to sit in master for a good long time before being in a release. As I understand it, FATE is
already unhappy, and it shouldn't be treated as being a problem with FATE vs the set.

Lastly, some time this weekend/week, I will labour to create a more extensive set of AAC files we can use to
test, and use in FATE.

Hope that all made sense and I didn't forget any details in my Covid-induced haze.

Cheers,
- Derek

[0] http://ffmpeg.org/pipermail/ffmpeg-devel/2023-October/316389.html
[1] https://gist.github.com/dwbuiten/18745d6cb253a2304f776581c9f68b30
[2] https://github.com/nu774/fdkaac/blob/master/README#L165-L177
Derek Buitenhuis Nov. 4, 2023, 5:32 p.m. UTC | #3
On 11/4/2023 4:22 PM, Derek Buitenhuis wrote:
>          the sample duration of the last packet in the stts box. Pre-roll is store in the sgpb

Bah. Please try and ignore my various typos of 'sgpd'.

- Derek
Michael Niedermayer Nov. 4, 2023, 5:41 p.m. UTC | #4
On Sat, Nov 04, 2023 at 04:22:11PM +0000, Derek Buitenhuis wrote:
[...]
> I am almost certain I missed even more nuance, and hopefully Martin or Anton can chime in, or I remember more.
> 
> But also, given all of this, I think we need to deeply consider how we approach this, so we don't end up with
> something that only covers certain cases (and I am sure I forgot more cases). To that end, I do not think rushing
> to get a patchset that can change sync on all AAC files in existence into 6.1 is wise. Even when this does go in,
> it should be able to sit in master for a good long time before being in a release. As I understand it, FATE is
> already unhappy, and it shouldn't be treated as being a problem with FATE vs the set.
> 

> Lastly, some time this weekend/week, I will labour to create a more extensive set of AAC files we can use to
> test, and use in FATE.

I think thats a very good idea.
Ideally we would have a set of test samples encoded with all major encoders
samples where start, end and sync can be easily determined unambigously

once we have such a set, it becomes easy to check that we get the start/end/sync
right for every file by default.

I think the guiding principle should be that what comes out of the decoder is
as close to what went into the encoder. And of course to comply to specifications


>
> Hope that all made sense and I didn't forget any details in my Covid-induced haze.

i hope you get better soon!

thx

[...]
Lynne Nov. 4, 2023, 8:33 p.m. UTC | #5
Nov 4, 2023, 17:22 by derek.buitenhuis@gmail.com:

> Hi,
>
> I'm going to opine a bit here, and also comment on the mov/MP4 patch[0] that accompanies
> this set.
>
> This is for both historical purposes, and to distill IRC logs into something more
> digestible for others on the list to gain context on the issue, so apologies for
> re-treading ground.
>
> On 10/30/2023 5:09 AM, Lynne wrote:
>
>> This is a convenience function, which is required to be called by decoders
>> needing to skip samples every time.
>> It automatically creates and increments side data.
>>
>> The idea is to get rid of skip_samples eventually and replace it with this
>> function.
>>
>
> So there is a lot to cover here, and  lot of nuance to how things should be handled,
> that I want to list out, before discussing the specific changes of these two sets of
> patches. A bit of of a 'state of the world'.
>
> The goal of this patchset seems to be to aid in gapless playback and correct seeking
> in AAC streams (or later, more generally MDCT-styl audio codecs in general), but properly
> skipping initial priming samples (which include pre-roll), pre-roll (both normal, and extra
> incurred due to SBR) when seeking, and and, though not covered in these sets, I'll mention,
> end padding.
>
> First a note on terminology: 'Algorithmic delay' as it is being used here is not quite
> correct, for two reasons:
>  * Latency and pre-roll (or roll distance) are separate things. Opus, for example,
>  can have a latency as low as 2.5ms, but pre-roll is always at least 80ms - they
>  are different things which serve different purposes, and I confirmed this with
>  people who definitely know more about audio than me[1]. Pre-roll is often larger
>  than latency, and the values stored in file metadata reflect this.
>

Algorithmic delay is an industry standard term for those that do codec research,
and is accurate. Algorithmic delay == your definition of latency.
(latency is not a technically completely valid term, as latency has blurry definition
outside of real-time situations, which decoding may or may not be).
It is not equal to pre-roll.

It is a codec-dependent value, defined by the sum of all delay that happens within
each algorithm (block) of a codec.
It is exactly defined to:
 - 1024 samples for AAC (21.3̅ milliseconds), due to the MDCT overlap process.
 - 120 samples for Opus (2.5 milliseconds), due to the MDCT overlap process
 - 320 samples for Siren (G.719, 6.6̅ milliseconds), due to the MDCT overlap process
 - Equal to the frame size for Vorbis, due to it having flexible blocks.
- 1024 samples for ATRAC9 (21.3̅ milliseconds), due to the MDCT overlap process.
 - 256 samples for AC-3 (5.3̅ milliseconds), due to the MDCT overlap process.

Quoting from a few official documents:

RFC6716 "Definition of the Opus Audio Codec":
"The inverse MDCT implementation has no special characteristics.  The input is N frequency-domain samples and the output is 2*N time-domain samples, while scaling by 1/2.  A "low-overlap" window reduces the algorithmic delay."

ISO/IEC 14496-3, the bible itself on the very problem we're trying to solve:
"To enable coding of general audio signals with an algorithmic delay not exceeding 20 ms at 48 kHz, it uses a frame length of 512 or 480 samples (compared to the 1024 or 960 samples used in standard MPEG-2/4 AAC)."

G.719:
"The codec operates on 20 ms frames, and the algorithmic delay end-to-end is 40 ms. The encoder input and decoder output are sampled at 48 kHz."
(this one gives the algorithmic delay as the sum of both the decoder and the encoder's delays during real-time operation).


>  * Pre-roll, or roll distance, are the industry standard terms. Making up out own
>  terms because we disagree is silly and stubborn, and makes it harder on API
>  users trying to use the API correctly, or understnd our code.
>

You call me silly and stubborn for trying to use a standard term for a mathematically
defined value upon which to base other values (your definition of pre-roll, which contains
the algorithmic delay), so that there is no disagreement in what we talk about.
I defined the algorithmic delay earlier in our discussion and outlined the reason to do so.
And disagreement in what we talk about is exactly what happened.

With that out of the way, I will not stop calling what I have just defined as algorithmic delay.
Nor will I want an explanation on why after saying algorithmic delay so many times, you
keep reading pre-roll, and you keep trying to interpret every value as pre-roll, even
in the chat log you posted.

Now, let me begin by defining that the amount of samples you have to necessarily strip
from any MDCT-based audio codec at the start is exactly equal to its algorithmic delay.
**Including Opus**.
The output audio signal will be timing-accurate compared to the input signal,
as long as the encoder did not add any extra samples.

Continuing, let me define the exact number of samples that have to be stripped
from any MDCT-based audio codec after a seek is also exactly equal to its algorithmic delay.
**Including Opus**, in case the encoder signals that each coarse energy value is
independent and not delta/inter/differentially coded.

End of mathematically defined values.

In case the encoder signals that coarse energy levels are to be signaled as deltas between
each frame, the amount of frames needed varies, as each value will, as a result of a guided
random walk, converge to the encoder's value closer over each frame.
The threshold for which one decides that the value is close enough is SUBJECTIVE, and also
varies not with time, but with number of frames. Opus supports variable frame sizes.
Some may generally state that the amount of time needed for convergence at 20 millisecond
frames is 80 milliseconds (4 packets), but if the frame size the encoder uses are 2.5 milliseconds,
then a much lower amount of time would be needed to get a subjectively identical level of
convergence, perhaps 6 frames (14 milliseconds).
I keep saying subjective, as there is no exact mathematical definition for when the coarse
energy levels converge to exactly the levels the encoder has. If a specially crated stream is
used, this could be guaranteed to never happen, even within what I've been referring to
as a subjective threshold.
This is from someone who knows more about Opus than anyone else except maybe those
who created it.

Allow me to define pre-roll with a definition that matches yours:
Pre-roll shall be any amount of samples needed to be skipped at the start of decoding,
and it shall not be lower than the algorithmic delay.
Pre-roll shall contain any additional encoder-side induced delay.


> Next, a quick breakdown of the AAC situation, in terms of both how this it is stored,
> what we support, and the state of the ecosystem and types of files that exist:
>  * 'Raw' ADTS streams have no way to store any of this. The best we can do is guess
>  the pre-roll. We should not guess priming or end padding, as no matter what we do,
>  it'll be wrong, and any value will be a cargo culted hack value.
>

The amount of samples to strip off, would be, and to match
*every other codec we have*, would be 1024, the algorithmic delay.
The exact mathematically defined extra audio samples the decoder produces.
I will not accept calling this a "cargo culted hack value". It is, as I demonstrated earlier,
an exactly defined value that you can learn by reading the spec.
It is not a guess, it is a guarantee. It is a guarantee, that any codec which does not
insert any extra samples of padding, will meet, and the result will be that, compared to
the input signal pre-encoding, no samples will be present at the start.
This is not a fantasy, this is a reality when using our aac encoder.


>  * MP4 - there are two places to store this metadata - one standard, and one proprietary
>  Apple way. There are, separately, two ways to signal priming length when SBR is present.
>

I would like to add that there is a third way - that nothing is signalled. Given what you have
said about MP4 timelines, those require that nothing is stripped off from audio data.
From my point of view, this is a special case, not a general case valid for all containers.


>  * MP4s may contain a user data box with 'iTunSMPB' which contains priming, pre-roll,
>  and end padding data. We support reading only priming data from this at the moment,
>  and we set skip samples based on this. This is 'iTunes style' metadata.
>  * The standards compliant (read: non-iTunes) way is to use an edit list to trim the
>  priming samples, and, opionally end padding. End padding may also be trimmed by reducing
>  the sample duration of the last packet in the stts box. Pre-roll is store in the sgpb
>  box with the 'roll', type, which signals the roll distance as a number of packets;
>  for example, -1 indicates you should decode an discard the samples of 1 packet before
>  beginning plaback. Notably, this allows the sgpd box to also be use for video like
>  periodic intra refresh H.264. libavformat does not current parse or export this info,
>  but even if we did, converting number of packets to audio samples can get hairy.
>  * Notably, since in MP4, the edit list represents the exact presentation-level info,
>  when no edit list, or an edit list startiing at 0 is present, no samples, not even
>  pre-roll should be trimmed - all players in the wild handle this properly, and it
>  has been standard practice among streaming services for >10 years to not output
>  the AAC frames representing priming samples at all (even if there is a small hit
>  quality). This is what the patch at [0] is addressing.
>  * My personal opinion is that since priming samples include any inherent delay already,
>  that if we do not know how many priming samples there are, we should not trim anything
>  from the start of the file, regardless of format. I am keen on hearing others Opinions(TM)
>  here, particularily Anton and Martin (sorry for name dropping :)).
>

For MP4-only, I would be inclined to agree. For anything else, I will disagree.
Everything else, including users, expect no additional samples to be present at the
start, and assume so.


>  * Further complicating matters is the fact that, again thanks to Apple, there are a lot
>  of broken files around, since iTunes expects files to *not* include addition delay incurred
>  by SBR in their edit list / priming info, even though, by spec, you are suppose to. This
>  leads to the unfortunate case where you have tons of files in the wild that both do, and
>  do not include SBR delay in their edit lists, and there is no way of detecting when this is
>  the case. I do not have a good solution to this other than proposing a flag somewhere to
>  switch between behaviors.
>
> Aside, for Opus, we incorrectly read this info from the codec specific box instead of the sgpd box...
> but we only ever put it in a write-only field.
>
> Now, on to the patches / implementation (opinions warning):
>
>  * As noted above, I don't think we should apply any guessed priming to initial samples (pre-roll,
>  or 'algorithmic delay, included). No other decoders or players do this, in the world, to my
>  knowledge, and violating the principal of least surpise because we think we're slightly more
>  correct isn't great. I also think trying to 'fix' raw ADTS is destined to always be a hack,
>  an we shouldn't. YMMV. I'd like to hear views from others here. This would make the patch in
>  [0] redundant.
>

I would be willing to accept this for MP4 and maybe ADTS.
But I will not accept not stripping off mathematically defined as redundant samples,
which we strip off for every other codec with a known amount, for anything else, if no
data is present to indicate how many samples to strip.
Our encoder matches this, and so does every single other audio codec. Just because everyone
else introduces additional padding with their encoder does not mean that it's a guess.
It is a defined *minimum* value that any well-written encoder can meet.


>  * I don't think hardcoding a value of 1024 (or 960) is wise, or necessarily even correct. For example,
>  all HE-AAC MP4s I have signal a seek pre-roll of 2048 samples. Ideally we would pass through seek
>  pre-roll somehow - but I am not sure of a good way in our existing setup - the write-only codecpar
>  field for seek pre-roll is in samples, which is kind of incompatible with the way the sgpd box stores
>  pre-roll as number of packets. We could set it based one the duration of the first packet(s), assuming
>  all audio codecs will have only 1 packet duration (frame size), or we could add a new field. Possibly,
>  we could leave the hardcoded values in place, only for seeking, inside e.g. ADTS.
>

The value is not hardcoded. If side data is present, the side data will be used. The value I have put there is equal to the algorithmic delay, which, like I mentioned earlier, is a minimum.
The encoder may know more, so I would accept any container value as long as it is larger than the algorithmic delay.


>  * This kind of brings me to: I don't really think using the same side data for both priming and pre-roll
>  is a great idea. Clearly this set of problems is already confusing enough to many, and conflating the
>  two in the API is a sure way to only make it worse.
>  * If we are going to discard at seek, we need to take this into account in the demuxer's seek code, or
>  we will always be $preroll number of samples off where the user actually wanted to seek and decode.
>

Correct. We have to take this into account while muxing.


> I am almost certain I missed even more nuance, and hopefully Martin or Anton can chime in, or I remember more.
>
> But also, given all of this, I think we need to deeply consider how we approach this, so we don't end up with
> something that only covers certain cases (and I am sure I forgot more cases). To that end, I do not think rushing
> to get a patchset that can change sync on all AAC files in existence into 6.1 is wise. Even when this does go in,
>

I did agree to this, and you did see it before you sent this.


> it should be able to sit in master for a good long time before being in a release. As I understand it, FATE is
> already unhappy, and it shouldn't be treated as being a problem with FATE vs the set.
>
> Lastly, some time this weekend/week, I will labour to create a more extensive set of AAC files we can use to
> test, and use in FATE.
>
Martin Storsjö Nov. 4, 2023, 11:02 p.m. UTC | #6
Hi,

Just following up on this - I'm sorry I haven't been able to look at the 
proposed patchset myself quite in detail yet.

My prime concern is about the requests to have this merged into the 
upcoming 6.1 release; that's way too soon IMO.

These patches do change aspects of how these things behave, that have been 
working the same for a very long time, so there are all sorts of potential 
subtle breakage, or (incorrect or not) assumptions being broken, across 
libavcodec and its users.

On Sat, 4 Nov 2023, Derek Buitenhuis wrote:

> Next, a quick breakdown of the AAC situation, in terms of both how this it is stored,
> what we support, and the state of the ecosystem and types of files that exist:
>    * 'Raw' ADTS streams have no way to store any of this. The best we can do is guess
>      the pre-roll. We should not guess priming or end padding, as no matter what we do,
>      it'll be wrong, and any value will be a cargo culted hack value.

I share this concern; all the various encoders I've seen have used a 
different amount of priming samples, so guessing it will be bound to be 
wrong in a lot of the cases.

>    * MP4 - there are two places to store this metadata - one standard, and one proprietary
>      Apple way. There are, separately, two ways to signal priming length when SBR is present.
>       * MP4s may contain a user data box with 'iTunSMPB' which contains priming, pre-roll,
>         and end padding data. We support reading only priming data from this at the moment,
>         and we set skip samples based on this. This is 'iTunes style' metadata.
>       * The standards compliant (read: non-iTunes) way is to use an edit list to trim the
>         priming samples, and, opionally end padding. End padding may also be trimmed by reducing
>         the sample duration of the last packet in the stts box. Pre-roll is store in the sgpb
>         box with the 'roll', type, which signals the roll distance as a number of packets;
>         for example, -1 indicates you should decode an discard the samples of 1 packet before
>         beginning plaback. Notably, this allows the sgpd box to also be use for video like
>         periodic intra refresh H.264. libavformat does not current parse or export this info,
>         but even if we did, converting number of packets to audio samples can get hairy.
>           * Notably, since in MP4, the edit list represents the exact presentation-level info,
>             when no edit list, or an edit list startiing at 0 is present, no samples, not even
>             pre-roll should be trimmed - all players in the wild handle this properly, and it
>             has been standard practice among streaming services for >10 years to not output
>             the AAC frames representing priming samples at all (even if there is a small hit
>             quality). This is what the patch at [0] is addressing.

FWIW, MP4 isn't the only container where this might be relevant; AAC is 
frequently used in muxes together with video in FLV and MKV and others as 
well.

In the case of FLV, I'm not aware of any metadata that signals how much to 
trim off, so essentially we can't do it by guessing. On the producing 
side, this is handled by shifting the timestamps so the audio track, which 
would be starting at -<delay>, ends up starting at 0, and the video track 
ends up starting at +<audiodelay> instead.

In that case, if we trim off the priming samples (based on a guess as 
that's all we have?), I guess that'd lead us to both tracks starting at 
+<delay> (i.e. not affecting sync). As long as it doesn't change sync, I 
guess it can be tolerable.

To avoid all these effects, producers of muxed files can work around this 
in many ways. For many years, I've been doing the trick of skipping the 
first <delay> samples of input to the audio encoder, so that after 
accounting for that, I have both audio and video tracks starting at 0.0, 
without the decoder needing to do anything - working the same across all 
players, good and bad.

If we suddenly start decoding such files with the audio track starting at 
+<delay>, I guess it'll be ok for sync, but it's a mildly surprising 
change, but hopefully any reasonable player based on libavcodec would 
still not freak out by it.

>    * As noted above, I don't think we should apply any guessed priming to initial samples (pre-roll,
>      or 'algorithmic delay, included). No other decoders or players do this, in the world, to my
>      knowledge, and violating the principal of least surpise because we think we're slightly more
>      correct isn't great. I also think trying to 'fix' raw ADTS is destined to always be a hack,
>      an we shouldn't. YMMV. I'd like to hear views from others here. This would make the patch in
>      [0] redundant.

Yes, with raw ADTS there's really no good way of getting this right, other 
than plain guessing, and there's no single universally correct guess 
AFAIK.

(And even if we have a qualified guess for the amount of encoder priming, 
we have even less knowledge about how much to trim off at the end, if 
we're aiming at proper gapless playback.)

For MP4 there's at least a couple ways of signalling it.

> But also, given all of this, I think we need to deeply consider how we 
> approach this, so we don't end up with something that only covers 
> certain cases (and I am sure I forgot more cases). To that end, I do not 
> think rushing to get a patchset that can change sync on all AAC files in 
> existence into 6.1 is wise. Even when this does go in, it should be able 
> to sit in master for a good long time before being in a release.

+1. This has the potential to be surprising in many different cases, and 
may need a bunch of follow up patches to sort out cases found later. It 
definitely should sit in git master for a some time before ending up in a 
release - not be slipped into 6.1 the week before the release.

// Martin
Lynne Nov. 4, 2023, 11:27 p.m. UTC | #7
Nov 5, 2023, 00:02 by martin@martin.st:

> Hi,
>
> Just following up on this - I'm sorry I haven't been able to look at the proposed patchset myself quite in detail yet.
>
> My prime concern is about the requests to have this merged into the upcoming 6.1 release; that's way too soon IMO.
>
> These patches do change aspects of how these things behave, that have been working the same for a very long time, so there are all sorts of potential subtle breakage, or (incorrect or not) assumptions being broken, across libavcodec and its users.
>
> On Sat, 4 Nov 2023, Derek Buitenhuis wrote:
>
>> Next, a quick breakdown of the AAC situation, in terms of both how this it is stored,
>> what we support, and the state of the ecosystem and types of files that exist:
>>  * 'Raw' ADTS streams have no way to store any of this. The best we can do is guess
>>  the pre-roll. We should not guess priming or end padding, as no matter what we do,
>>  it'll be wrong, and any value will be a cargo culted hack value.
>>
>
> I share this concern; all the various encoders I've seen have used a different amount of priming samples, so guessing it will be bound to be wrong in a lot of the cases.
>
>> * MP4 - there are two places to store this metadata - one standard, and one proprietary
>>  Apple way. There are, separately, two ways to signal priming length when SBR is present.
>>  * MP4s may contain a user data box with 'iTunSMPB' which contains priming, pre-roll,
>>  and end padding data. We support reading only priming data from this at the moment,
>>  and we set skip samples based on this. This is 'iTunes style' metadata.
>>  * The standards compliant (read: non-iTunes) way is to use an edit list to trim the
>>  priming samples, and, opionally end padding. End padding may also be trimmed by reducing
>>  the sample duration of the last packet in the stts box. Pre-roll is store in the sgpb
>>  box with the 'roll', type, which signals the roll distance as a number of packets;
>>  for example, -1 indicates you should decode an discard the samples of 1 packet before
>>  beginning plaback. Notably, this allows the sgpd box to also be use for video like
>>  periodic intra refresh H.264. libavformat does not current parse or export this info,
>>  but even if we did, converting number of packets to audio samples can get hairy.
>>  * Notably, since in MP4, the edit list represents the exact presentation-level info,
>>  when no edit list, or an edit list startiing at 0 is present, no samples, not even
>>  pre-roll should be trimmed - all players in the wild handle this properly, and it
>>  has been standard practice among streaming services for >10 years to not output
>>  the AAC frames representing priming samples at all (even if there is a small hit
>>  quality). This is what the patch at [0] is addressing.
>>
>
> FWIW, MP4 isn't the only container where this might be relevant; AAC is frequently used in muxes together with video in FLV and MKV and others as well.
>
> In the case of FLV, I'm not aware of any metadata that signals how much to trim off, so essentially we can't do it by guessing. On the producing side, this is handled by shifting the timestamps so the audio track, which would be starting at -<delay>, ends up starting at 0, and the video track ends up starting at +<audiodelay> instead.
>
> In that case, if we trim off the priming samples (based on a guess as that's all we have?), I guess that'd lead us to both tracks starting at +<delay> (i.e. not affecting sync). As long as it doesn't change sync, I guess it can be tolerable.
>
> To avoid all these effects, producers of muxed files can work around this in many ways. For many years, I've been doing the trick of skipping the first <delay> samples of input to the audio encoder, so that after accounting for that, I have both audio and video tracks starting at 0.0, without the decoder needing to do anything - working the same across all players, good and bad.
>
> If we suddenly start decoding such files with the audio track starting at +<delay>, I guess it'll be ok for sync, but it's a mildly surprising change, but hopefully any reasonable player based on libavcodec would still not freak out by it.
>
>> * As noted above, I don't think we should apply any guessed priming to initial samples (pre-roll,
>>  or 'algorithmic delay, included). No other decoders or players do this, in the world, to my
>>  knowledge, and violating the principal of least surpise because we think we're slightly more
>>  correct isn't great. I also think trying to 'fix' raw ADTS is destined to always be a hack,
>>  an we shouldn't. YMMV. I'd like to hear views from others here. This would make the patch in
>>  [0] redundant.
>>
>
> Yes, with raw ADTS there's really no good way of getting this right, other than plain guessing, and there's no single universally correct guess AFAIK.
>

My point is that at least we can make an attempt at it by removing the minimum number of samples, the one the decoder produces. Which also fully fixes the case where the encoder used was ours.


> (And even if we have a qualified guess for the amount of encoder priming, we have even less knowledge about how much to trim off at the end, if we're aiming at proper gapless playback.)
>

That I agree with.


> For MP4 there's at least a couple ways of signalling it.
>
>> But also, given all of this, I think we need to deeply consider how we approach this, so we don't end up with something that only covers certain cases (and I am sure I forgot more cases). To that end, I do not think rushing to get a patchset that can change sync on all AAC files in existence into 6.1 is wise. Even when this does go in, it should be able to sit in master for a good long time before being in a release.
>>
>
> +1. This has the potential to be surprising in many different cases, and may need a bunch of follow up patches to sort out cases found later. It definitely should sit in git master for a some time before ending up in a release - not be slipped into 6.1 the week before the release.
>

For the third time, I did agree to this. This is my patch you're talking about.
diff mbox series

Patch

From 41dfcbbacfa9232d2308d0229dcd172309b32f9f Mon Sep 17 00:00:00 2001
From: Lynne <dev@lynne.ee>
Date: Mon, 30 Oct 2023 05:38:17 +0100
Subject: [PATCH 1/2] decode: add ff_decode_skip_samples function

This is a convenience function, which is required to be called by decoders
needing to skip samples every time.
It automatically creates and increments side data.

The idea is to get rid of skip_samples eventually and replace it with this
function.
---
 libavcodec/decode.c | 18 ++++++++++++++++++
 libavcodec/decode.h |  9 +++++++++
 2 files changed, 27 insertions(+)

diff --git a/libavcodec/decode.c b/libavcodec/decode.c
index ad39021354..f971723ff7 100644
--- a/libavcodec/decode.c
+++ b/libavcodec/decode.c
@@ -299,6 +299,24 @@  static int64_t guess_correct_pts(AVCodecContext *ctx,
     return pts;
 }
 
+int ff_decode_skip_samples(AVCodecContext *avctx, AVFrame *frame, uint32_t base_skip, uint32_t additional)
+{
+    uint32_t val = 0;
+    AVFrameSideData *side = av_frame_get_side_data(frame, AV_FRAME_DATA_SKIP_SAMPLES);
+    if (!side) {
+        side = av_frame_new_side_data(frame, AV_FRAME_DATA_SKIP_SAMPLES, 10);
+        if (!side)
+            return AVERROR(ENOMEM);
+        AV_WL32(side->data, base_skip);
+    }
+
+    val += AV_RL32(side->data);
+    val += additional;
+    AV_WL32(side->data, val);
+
+    return 0;
+}
+
 static int discard_samples(AVCodecContext *avctx, AVFrame *frame, int64_t *discarded_samples)
 {
     AVCodecInternal *avci = avctx->internal;
diff --git a/libavcodec/decode.h b/libavcodec/decode.h
index daf1a67444..647f091da9 100644
--- a/libavcodec/decode.h
+++ b/libavcodec/decode.h
@@ -155,4 +155,13 @@  int ff_hwaccel_frame_priv_alloc(AVCodecContext *avctx, void **hwaccel_picture_pr
 const AVPacketSideData *ff_get_coded_side_data(const AVCodecContext *avctx,
                                                enum AVPacketSideDataType type);
 
+/**
+ * Skip samples in an AVFrame.
+ *
+ * @param base_skip amount of samples to skip if no side data is present
+ * @param additional amount of samples to skip unconditionally
+ */
+int ff_decode_skip_samples(AVCodecContext *avctx, AVFrame *frame,
+                           uint32_t base_skip, uint32_t additional);
+
 #endif /* AVCODEC_DECODE_H */
-- 
2.42.0