diff mbox series

[FFmpeg-devel,v20,02/20] avutil/frame: Prepare AVFrame for subtitle handling

Message ID DM8P223MB03658B8D2E1841A6D1D76552BA6C9@DM8P223MB0365.NAMP223.PROD.OUTLOOK.COM
State Superseded, archived
Headers show
Series [FFmpeg-devel,v20,01/20] avcodec, avutil: Move enum AVSubtitleType to avutil, add new and deprecate old values | expand

Checks

Context Check Description
andriy/configurex86 warning Failed to apply patch
andriy/configureppc warning Failed to apply patch

Commit Message

Soft Works Dec. 5, 2021, 4:23 p.m. UTC
Root commit for adding subtitle filtering capabilities.
In detail:

- Add type (AVMediaType) field to AVFrame
  Replaces previous way of distinction which was based on checking
  width and height to determine whether a frame is audio or video
- Add subtitle fields to AVFrame
- Add new struct AVSubtitleArea, similar to AVSubtitleRect, but different
  allocation logic. Cannot and must not be used interchangeably, hence
  the new struct

Signed-off-by: softworkz <softworkz@hotmail.com>
---
 doc/APIchanges     |  10 +++
 libavutil/Makefile |   1 +
 libavutil/frame.c  | 211 ++++++++++++++++++++++++++++++++++++++++-----
 libavutil/frame.h  |  77 ++++++++++++++++-
 libavutil/subfmt.c |  50 +++++++++++
 libavutil/subfmt.h |  48 +++++++++++
 6 files changed, 373 insertions(+), 24 deletions(-)
 create mode 100644 libavutil/subfmt.c

Comments

Lynne Dec. 5, 2021, 4:39 p.m. UTC | #1
5 Dec 2021, 17:23 by softworkz@hotmail.com: 

> @@ -491,6 +499,39 @@ typedef struct AVFrame {
>  */
>  uint64_t channel_layout;
>  
> +    /**
> +     * Display start time, relative to packet pts, in ms.
> +     */
> +    uint32_t subtitle_start_time;
> +
> +    /**
> +     * Display end time, relative to packet pts, in ms.
> +     */
> +    uint32_t subtitle_end_time;
>

Milliseconds? Our entire API's based around timestamps
with time bases. Plus, we all know what happened when
Matroska settled onto milliseconds and ruined a perfectly
complex but good container.
Make this relative to the PTS field, with the same timebase
as the PTS field.
There's even a new AVFrame->time_base field for you to
set so you wouldn't forget it.


> +    /**
> +     * Number of items in the @ref subtitle_areas array.
> +     */
> +    unsigned num_subtitle_areas;
> +
> +    /**
> +     * Array of subtitle areas, may be empty.
> +     */
> +    AVSubtitleArea **subtitle_areas;
>

There's no reason why this cannot be handled using the buffer
and data fields. If you need more space there, you're free to bump
the maximum number of pointers, plus this removes the horrid
malloc(1) hack. We've discussed this, and I couldn't follow why
this was bad in the email discussion you linked.


> +    /**
> +     * Usually the same as packet pts, in AV_TIME_BASE.
> +     *
> +     * @deprecated This is kept for compatibility reasons and corresponds to
> +     * AVSubtitle->pts. Might be removed in the future.
> +     */
> +    int64_t subtitle_pts;
>

I'm not going to accept a field which is instantly deprecated.
As we've discussed multiple times, please merge this into
the regular frame PTS field. We already have _2_ necessary
stard/end fields.


> +    /**
> +     * Header containing style information for text subtitles.
> +     */
> +    AVBufferRef *subtitle_header;
>

Good, I like this.
Hendrik Leppkes Dec. 5, 2021, 4:44 p.m. UTC | #2
On Sun, Dec 5, 2021 at 5:40 PM Lynne <dev@lynne.ee> wrote:
> > +    /**
> > +     * Usually the same as packet pts, in AV_TIME_BASE.
> > +     *
> > +     * @deprecated This is kept for compatibility reasons and corresponds to
> > +     * AVSubtitle->pts. Might be removed in the future.
> > +     */
> > +    int64_t subtitle_pts;
> >
>
> I'm not going to accept a field which is instantly deprecated.
> As we've discussed multiple times, please merge this into
> the regular frame PTS field. We already have _2_ necessary
> stard/end fields.
>

I agree with this entirely. Even ignoring the fact that adding a new
field thats deprecated is instantly a disqualification, AVSubtitle had
one pts field, AVFrame already has one pts field - both are even
documented to have the same semantic. They should just contain the
exact same data, thats how you achieve compatibility, not by claiming
you need a new field for compatibility reasons.

- Hendrik
Soft Works Dec. 5, 2021, 5:58 p.m. UTC | #3
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Lynne
> Sent: Sunday, December 5, 2021 5:40 PM
> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame
> for subtitle handling
> 
> 5 Dec 2021, 17:23 by softworkz@hotmail.com:
> 
> > @@ -491,6 +499,39 @@ typedef struct AVFrame {
> >  */
> >  uint64_t channel_layout;
> >
> > +    /**
> > +     * Display start time, relative to packet pts, in ms.
> > +     */
> > +    uint32_t subtitle_start_time;
> > +
> > +    /**
> > +     * Display end time, relative to packet pts, in ms.
> > +     */
> > +    uint32_t subtitle_end_time;
> >
> 
> Milliseconds? Our entire API's based around timestamps
> with time bases. Plus, we all know what happened when
> Matroska settled onto milliseconds and ruined a perfectly
> complex but good container.
> Make this relative to the PTS field, with the same timebase
> as the PTS field.
> There's even a new AVFrame->time_base field for you to
> set so you wouldn't forget it.

The internal format for text subtitles is ASS, and this is 
using a timebase of milliseconds.

All existing decoders and encoders are using this and I'm
afraid, but I will not go and change them all.

> > +    /**
> > +     * Number of items in the @ref subtitle_areas array.
> > +     */
> > +    unsigned num_subtitle_areas;
> > +
> > +    /**
> > +     * Array of subtitle areas, may be empty.
> > +     */
> > +    AVSubtitleArea **subtitle_areas;
> >
> 
> There's no reason why this cannot be handled using the buffer
> and data fields. If you need more space there, you're free to bump
> the maximum number of pointers, plus this removes the horrid
> malloc(1) hack. We've discussed this, and I couldn't follow why
> this was bad in the email discussion you linked.

There are reasons. Some of those had I mentioned in an earlier 
discussion with Hendrik.
The effort alone to relate the buffers to subtitle areas (which area
'owns' which buffer) is not reasonable. Too many cases to consider,
what's when there are 3 areas and the second area doesn't have any
buffer? The convention is that the buffer should be used contiguously.
Managing those relations is error-prone and would require a lot of code.

> > +    /**
> > +     * Usually the same as packet pts, in AV_TIME_BASE.
> > +     *
> > +     * @deprecated This is kept for compatibility reasons and corresponds
> to
> > +     * AVSubtitle->pts. Might be removed in the future.
> > +     */
> > +    int64_t subtitle_pts;
> >
> 
> I'm not going to accept a field which is instantly deprecated.
> As we've discussed multiple times, please merge this into
> the regular frame PTS field. We already have _2_ necessary
> stard/end fields.

--

> I agree with this entirely. Even ignoring the fact that adding a new
> field thats deprecated is instantly a disqualification, AVSubtitle had
> one pts field, AVFrame already has one pts field - both are even
> documented to have the same semantic. They should just contain the
> exact same data, thats how you achieve compatibility, not by claiming
> you need a new field for compatibility reasons.
> 
> - Hendrik

I think the mistake is to declare subtitle_pts as deprecated. I had
added the deprecation at a very early point in time where I had still
thought that it can be eliminated.

Even though we are driving subtitle data through the graph attached
to AVFrame, the behavior involved is very different from audio and
video frames. Actually there's not one but many different ways how 
subtitle data can appear in a source and travel through a filtergraph:

- Sometimes, subtitle events are muxed into a stream many seconds 
  ahead of display time. In this case, AVFrame.pts is the mux position
  and AVFrame.subtitle_pts is the actual presentation time.
  When filtering subtitles to modify something, it would be still desired
  to retain the offset between mux time and display start

- Sometimes, subtitle events are occurring in the mux "live" - right in 
  the moment when they are meant to be shown. An example for this are
  closed captions, and when extracting those via the new splitcc filter,
  the subtitle_pts is equal to the frame.pts value. 
  But CC events do not come regularly, while downstream filters might
  expect exactly that in order to proceed. That's why the splitcc filter
  can emit subtitle frames at a certain framerate. It does so by re-
  sending the most recent subtitle frame - basically not much different
  than the main heatbeat mechanism (which cannot be used here because 
  the stream has it's origin inside the filtegraph (secondary output
  of splitcc).
  Now when splitcc sends such a repeated frame, it needs to adjust the
  frame's pts in order to match the configured output framerate.
  But that's for keeping the graph running, the actual display start
  time (subtitle_pts) must not be changed.
  It depends then on the downstream filters how to handle those frames.
  For example, a filter could detect (based on subtitle_pts) repeated
  subtitle events. Then it can skip processing and use a previous cached
  result (some of the filter are actually doing that).

- I already mentioned the subtitle_heartbeat mechanism that I have kept.
  The same applies here: it needs to resend frames to retain a certain 
  frequency of frames and for this purpose it needs to send duplicate 
  frames. Those frames need a new pts, but subtitle_pts must remain
  at its original value

- There exist extreme cases, like where the complete set of ASS subtitles 
  is muxed at   the beginning of the stream with pkt-pts values like 1,2,3,..
  The pkt-pts is what subtitle frames get as AVFrame.pts while the actual
  display start-time it in subtitle_pts.
  The desire way to handle such situations surely varies from case to case.
  Having those values separate is an important key for keeping a wide range
  of options open to handle this - for current and future subtitle filters.

My patchset offers a really wide range of possibilities for subtitle 
filtering. Those are real-world production scenarios that are working.
This is not based on just a few command line experiments.

Not every scenario that would be desirable is working. With subtitles,
there are a lot of different cases for which I have given just a few 
examples above, and certain scenarios will need some work to get it
working by some tweaking, adding a new filter or adding additional 
options to existing filters (like you could see in the version history).
My experience from doing just that has made me realize one important
point: with the proposed architecture, it's possible to get almost
any desired scenario working (what might not work already).

It can all be done with filters and the next change I'd want to do
after merge is to move the heartbeat functionality into a filter, so 
it allows to have much better control about its functionality (and the
possibility to choose at which point in the graph heartbeat should 
originate). Also it will make sense to have a counterpart filter to this, 
which eliminates duplicated subtitle frames.

Having both, pts and subtitle_pts is an important element for all these
things to work.
If it would be merged into a single value, probably half of the 
example scenarios of my patchset would no longer work.

Kind regards,
softworkz
Soft Works Dec. 5, 2021, 6:05 p.m. UTC | #4
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Hendrik
> Leppkes
> Sent: Sunday, December 5, 2021 5:45 PM
> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame
> for subtitle handling
> 
> On Sun, Dec 5, 2021 at 5:40 PM Lynne <dev@lynne.ee> wrote:
> > > +    /**
> > > +     * Usually the same as packet pts, in AV_TIME_BASE.
> > > +     *
> > > +     * @deprecated This is kept for compatibility reasons and
> corresponds to
> > > +     * AVSubtitle->pts. Might be removed in the future.
> > > +     */
> > > +    int64_t subtitle_pts;
> > >
> >
> > I'm not going to accept a field which is instantly deprecated.
> > As we've discussed multiple times, please merge this into
> > the regular frame PTS field. We already have _2_ necessary
> > stard/end fields.
> >
> 
> I agree with this entirely. Even ignoring the fact that adding a new
> field thats deprecated is instantly a disqualification, AVSubtitle had
> one pts field, AVFrame already has one pts field 


The short answer to why it has worked before with a single pts field
is: Because there was no filtering.

Kind regards,
softworkz
Soft Works Dec. 5, 2021, 7:21 p.m. UTC | #5
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Lynne
> Sent: Sunday, December 5, 2021 5:40 PM
> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame
> for subtitle handling
> 
> 5 Dec 2021, 17:23 by softworkz@hotmail.com:
> 
> > @@ -491,6 +499,39 @@ typedef struct AVFrame {
> >  */
> >  uint64_t channel_layout;
> >
> > +    /**
> > +     * Display start time, relative to packet pts, in ms.
> > +     */
> > +    uint32_t subtitle_start_time;
> > +
> > +    /**
> > +     * Display end time, relative to packet pts, in ms.
> > +     */
> > +    uint32_t subtitle_end_time;
> >
> 
> Milliseconds? Our entire API's based around timestamps
> with time bases. Plus, we all know what happened when
> Matroska settled onto milliseconds and ruined a perfectly
> complex but good container.
> Make this relative to the PTS field, with the same timebase
> as the PTS field.
> There's even a new AVFrame->time_base field for you to
> set so you wouldn't forget it.
> 
> 
> > +    /**
> > +     * Number of items in the @ref subtitle_areas array.
> > +     */
> > +    unsigned num_subtitle_areas;
> > +
> > +    /**
> > +     * Array of subtitle areas, may be empty.
> > +     */
> > +    AVSubtitleArea **subtitle_areas;
> >
> 
> There's no reason why this cannot be handled using the buffer
> and data fields. If you need more space there, you're free to bump
> the maximum number of pointers, plus this removes the horrid
> malloc(1) hack. We've discussed this, and I couldn't follow why
> this was bad in the email discussion you linked.


Hi Lynne,

let me add another note on these two topics.

One of the reasons why this subtitle filtering patchset has proceeded
rather smoothly so far, causing only a small amount of regressions 
that were easy to find and fix, is that I have kept as much as possible
from the current AVSubtitle implementation, projecting it to the 
new frame-based subtitle implementation, while keeping it mostly 
compatible with the current logic.
Interestingly, an earlier attempt towards subtitle filtering has
gotten stuck right at the point about storing subtitle data in 
AVFrame buffers. I decided to focus on functionality rather than
academic kind of details. What has been in place and working
for such a long time can't be so wrong that it can't be accepted.

We also need to consider that it's not my patchset that is deliberately
deciding to store data in pointers associate with rects/area and 
to store start- and end-times in milliseconds: it's all the decoders
and encoders that are doing this. I'm just adapting to the status
quo.

These two things - changing the timebase of the ms-fields and 
changing the location of the buffer pointers - is a gigantic task 
that would require to change everything that this patchset covers 
and implements - plus even more code that this patchset hasn't even 
touched. It would require changes to everything everywhere where
subtitles are concerned.
It is a task that is risky as hell, will cause plenty of 
regressions under  guarantee, require extensive testing and might even 
drive this whole endeavor into a dead end.  

As I said before, please understand that I'm not ready to enter that
path.

Thanks,
softworkz
Lynne Dec. 5, 2021, 9:30 p.m. UTC | #6
5 Dec 2021, 20:21 by softworkz@hotmail.com:

>
>
>> -----Original Message-----
>> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Lynne
>> Sent: Sunday, December 5, 2021 5:40 PM
>> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
>> Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame
>> for subtitle handling
>>
>> 5 Dec 2021, 17:23 by softworkz@hotmail.com:
>>
>> > @@ -491,6 +499,39 @@ typedef struct AVFrame {
>> >  */
>> >  uint64_t channel_layout;
>> >
>> > +    /**
>> > +     * Display start time, relative to packet pts, in ms.
>> > +     */
>> > +    uint32_t subtitle_start_time;
>> > +
>> > +    /**
>> > +     * Display end time, relative to packet pts, in ms.
>> > +     */
>> > +    uint32_t subtitle_end_time;
>> >
>>
>> Milliseconds? Our entire API's based around timestamps
>> with time bases. Plus, we all know what happened when
>> Matroska settled onto milliseconds and ruined a perfectly
>> complex but good container.
>> Make this relative to the PTS field, with the same timebase
>> as the PTS field.
>> There's even a new AVFrame->time_base field for you to
>> set so you wouldn't forget it.
>>
>>
>> > +    /**
>> > +     * Number of items in the @ref subtitle_areas array.
>> > +     */
>> > +    unsigned num_subtitle_areas;
>> > +
>> > +    /**
>> > +     * Array of subtitle areas, may be empty.
>> > +     */
>> > +    AVSubtitleArea **subtitle_areas;
>> >
>>
>> There's no reason why this cannot be handled using the buffer
>> and data fields. If you need more space there, you're free to bump
>> the maximum number of pointers, plus this removes the horrid
>> malloc(1) hack. We've discussed this, and I couldn't follow why
>> this was bad in the email discussion you linked.
>>
>
>
> Hi Lynne,
>
> let me add another note on these two topics.
>
> One of the reasons why this subtitle filtering patchset has proceeded
> rather smoothly so far, causing only a small amount of regressions 
> that were easy to find and fix, is that I have kept as much as possible
> from the current AVSubtitle implementation, projecting it to the 
> new frame-based subtitle implementation, while keeping it mostly 
> compatible with the current logic.
> Interestingly, an earlier attempt towards subtitle filtering has
> gotten stuck right at the point about storing subtitle data in 
> AVFrame buffers. I decided to focus on functionality rather than
> academic kind of details. What has been in place and working
> for such a long time can't be so wrong that it can't be accepted.
>
> We also need to consider that it's not my patchset that is deliberately
> deciding to store data in pointers associate with rects/area and 
> to store start- and end-times in milliseconds: it's all the decoders
> and encoders that are doing this. I'm just adapting to the status
> quo.
>
> These two things - changing the timebase of the ms-fields and 
> changing the location of the buffer pointers - is a gigantic task 
> that would require to change everything that this patchset covers 
> and implements - plus even more code that this patchset hasn't even 
> touched. It would require changes to everything everywhere where
> subtitles are concerned.
> It is a task that is risky as hell, will cause plenty of 
> regressions under  guarantee, require extensive testing and might even 
> drive this whole endeavor into a dead end. 
>
> As I said before, please understand that I'm not ready to enter that
> path.
>

You don't have to break anything to make the start/end times proper
timestamps. We have timebases for that! If you'd like your timestamps
to have millisecond precision, you use a millisecond timebase!

So what if we use ASS internally which rounds timestamps? We have
bitmap formats too, and those are generally following the video
timestamps. You absolutely don't want the timestamps being rounded
in that case, or you risk having your subtitles be a frame late or
a frame early.

You could add a separate timebase for the subtitle start/end fields,
in case the packet's timebase doesn't match the subtitle format's
native timebase. I won't object to that.

You also don't have to risk that much by removing the subtitle_pts
field. Once the start/end times are converted to proper timestamps,
then the AVFrame->pts field can do whatever it wants to, whether
it mirrors the demuxer's timestamp, be constant, non-existent or even
go backwards.

If you just want to have an empty subtitle rect, you can simply
zero the subtitle struct's data and buffer fields.

Calling these issues too academic when the frame API is so very
straightforward and cut-and-paste is like calling computers
too academic when you have several thousand large numbers to add.

I'd really rather have a subtitle API that's usable, maintainable and easy
to extend, even if it's momentarily broken for certain corner cases
and weird files.
We can easily fix broken files one by one. We've got plenty of heuristics,
and have mostly abstracted and documented them well. We can add more.
We cannot remove cruft from an API that's designed to be crufty, however,
because the cruft is the only thing holding it together.
Hendrik Leppkes Dec. 5, 2021, 10:38 p.m. UTC | #7
On Sun, Dec 5, 2021 at 6:58 PM Soft Works <softworkz@hotmail.com> wrote:
>
>
>
> > -----Original Message-----
> > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Lynne
> > Sent: Sunday, December 5, 2021 5:40 PM
> > To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> > Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame
> > for subtitle handling
> >
> > 5 Dec 2021, 17:23 by softworkz@hotmail.com:
> >
> > > @@ -491,6 +499,39 @@ typedef struct AVFrame {
> > >  */
> > >  uint64_t channel_layout;
> > >
> > > +    /**
> > > +     * Display start time, relative to packet pts, in ms.
> > > +     */
> > > +    uint32_t subtitle_start_time;
> > > +
> > > +    /**
> > > +     * Display end time, relative to packet pts, in ms.
> > > +     */
> > > +    uint32_t subtitle_end_time;
> > >
> >
> > Milliseconds? Our entire API's based around timestamps
> > with time bases. Plus, we all know what happened when
> > Matroska settled onto milliseconds and ruined a perfectly
> > complex but good container.
> > Make this relative to the PTS field, with the same timebase
> > as the PTS field.
> > There's even a new AVFrame->time_base field for you to
> > set so you wouldn't forget it.
>
> The internal format for text subtitles is ASS, and this is
> using a timebase of milliseconds.
>
> All existing decoders and encoders are using this and I'm
> afraid, but I will not go and change them all.
>
> > > +    /**
> > > +     * Number of items in the @ref subtitle_areas array.
> > > +     */
> > > +    unsigned num_subtitle_areas;
> > > +
> > > +    /**
> > > +     * Array of subtitle areas, may be empty.
> > > +     */
> > > +    AVSubtitleArea **subtitle_areas;
> > >
> >
> > There's no reason why this cannot be handled using the buffer
> > and data fields. If you need more space there, you're free to bump
> > the maximum number of pointers, plus this removes the horrid
> > malloc(1) hack. We've discussed this, and I couldn't follow why
> > this was bad in the email discussion you linked.
>
> There are reasons. Some of those had I mentioned in an earlier
> discussion with Hendrik.
> The effort alone to relate the buffers to subtitle areas (which area
> 'owns' which buffer) is not reasonable. Too many cases to consider,
> what's when there are 3 areas and the second area doesn't have any
> buffer? The convention is that the buffer should be used contiguously.
> Managing those relations is error-prone and would require a lot of code.
>
> > > +    /**
> > > +     * Usually the same as packet pts, in AV_TIME_BASE.
> > > +     *
> > > +     * @deprecated This is kept for compatibility reasons and corresponds
> > to
> > > +     * AVSubtitle->pts. Might be removed in the future.
> > > +     */
> > > +    int64_t subtitle_pts;
> > >
> >
> > I'm not going to accept a field which is instantly deprecated.
> > As we've discussed multiple times, please merge this into
> > the regular frame PTS field. We already have _2_ necessary
> > stard/end fields.
>
> --
>
> > I agree with this entirely. Even ignoring the fact that adding a new
> > field thats deprecated is instantly a disqualification, AVSubtitle had
> > one pts field, AVFrame already has one pts field - both are even
> > documented to have the same semantic. They should just contain the
> > exact same data, thats how you achieve compatibility, not by claiming
> > you need a new field for compatibility reasons.
> >
> > - Hendrik
>
> I think the mistake is to declare subtitle_pts as deprecated. I had
> added the deprecation at a very early point in time where I had still
> thought that it can be eliminated.
>
> Even though we are driving subtitle data through the graph attached
> to AVFrame, the behavior involved is very different from audio and
> video frames. Actually there's not one but many different ways how
> subtitle data can appear in a source and travel through a filtergraph:
>
> - Sometimes, subtitle events are muxed into a stream many seconds
>   ahead of display time. In this case, AVFrame.pts is the mux position
>   and AVFrame.subtitle_pts is the actual presentation time.
>   When filtering subtitles to modify something, it would be still desired
>   to retain the offset between mux time and display start
>

This information was impossible to convey in AVSubtitle, as it had
exactly one pts field, so where does it come from now?
Also, isn't that kind of offset what subtitle_start_time is for, a
delta on top of the pts?

Sounds like its just duplicating logic once again.

pts is also by definition and name the "presentation time stamp", if a
timestamp of another meaning can convey additional information, we do
have a dts field in AVFrame, for the decoding timestamp.

>   Now when splitcc sends such a repeated frame, it needs to adjust the
>   frame's pts in order to match the configured output framerate.
>   But that's for keeping the graph running, the actual display start
>   time (subtitle_pts) must not be changed.

This sounds like an implementation detail, and not something we should
have in a core public structure like AVFrame.

- Hendrik
Hendrik Leppkes Dec. 5, 2021, 10:53 p.m. UTC | #8
On Sun, Dec 5, 2021 at 11:38 PM Hendrik Leppkes <h.leppkes@gmail.com> wrote:
>
> On Sun, Dec 5, 2021 at 6:58 PM Soft Works <softworkz@hotmail.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Lynne
> > > Sent: Sunday, December 5, 2021 5:40 PM
> > > To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> > > Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame
> > > for subtitle handling
> > >
> > > 5 Dec 2021, 17:23 by softworkz@hotmail.com:
> > >
> > > > @@ -491,6 +499,39 @@ typedef struct AVFrame {
> > > >  */
> > > >  uint64_t channel_layout;
> > > >
> > > > +    /**
> > > > +     * Display start time, relative to packet pts, in ms.
> > > > +     */
> > > > +    uint32_t subtitle_start_time;
> > > > +
> > > > +    /**
> > > > +     * Display end time, relative to packet pts, in ms.
> > > > +     */
> > > > +    uint32_t subtitle_end_time;
> > > >
> > >
> > > Milliseconds? Our entire API's based around timestamps
> > > with time bases. Plus, we all know what happened when
> > > Matroska settled onto milliseconds and ruined a perfectly
> > > complex but good container.
> > > Make this relative to the PTS field, with the same timebase
> > > as the PTS field.
> > > There's even a new AVFrame->time_base field for you to
> > > set so you wouldn't forget it.
> >
> > The internal format for text subtitles is ASS, and this is
> > using a timebase of milliseconds.
> >
> > All existing decoders and encoders are using this and I'm
> > afraid, but I will not go and change them all.
> >
> > > > +    /**
> > > > +     * Number of items in the @ref subtitle_areas array.
> > > > +     */
> > > > +    unsigned num_subtitle_areas;
> > > > +
> > > > +    /**
> > > > +     * Array of subtitle areas, may be empty.
> > > > +     */
> > > > +    AVSubtitleArea **subtitle_areas;
> > > >
> > >
> > > There's no reason why this cannot be handled using the buffer
> > > and data fields. If you need more space there, you're free to bump
> > > the maximum number of pointers, plus this removes the horrid
> > > malloc(1) hack. We've discussed this, and I couldn't follow why
> > > this was bad in the email discussion you linked.
> >
> > There are reasons. Some of those had I mentioned in an earlier
> > discussion with Hendrik.
> > The effort alone to relate the buffers to subtitle areas (which area
> > 'owns' which buffer) is not reasonable. Too many cases to consider,
> > what's when there are 3 areas and the second area doesn't have any
> > buffer? The convention is that the buffer should be used contiguously.
> > Managing those relations is error-prone and would require a lot of code.
> >
> > > > +    /**
> > > > +     * Usually the same as packet pts, in AV_TIME_BASE.
> > > > +     *
> > > > +     * @deprecated This is kept for compatibility reasons and corresponds
> > > to
> > > > +     * AVSubtitle->pts. Might be removed in the future.
> > > > +     */
> > > > +    int64_t subtitle_pts;
> > > >
> > >
> > > I'm not going to accept a field which is instantly deprecated.
> > > As we've discussed multiple times, please merge this into
> > > the regular frame PTS field. We already have _2_ necessary
> > > stard/end fields.
> >
> > --
> >
> > > I agree with this entirely. Even ignoring the fact that adding a new
> > > field thats deprecated is instantly a disqualification, AVSubtitle had
> > > one pts field, AVFrame already has one pts field - both are even
> > > documented to have the same semantic. They should just contain the
> > > exact same data, thats how you achieve compatibility, not by claiming
> > > you need a new field for compatibility reasons.
> > >
> > > - Hendrik
> >
> > I think the mistake is to declare subtitle_pts as deprecated. I had
> > added the deprecation at a very early point in time where I had still
> > thought that it can be eliminated.
> >
> > Even though we are driving subtitle data through the graph attached
> > to AVFrame, the behavior involved is very different from audio and
> > video frames. Actually there's not one but many different ways how
> > subtitle data can appear in a source and travel through a filtergraph:
> >
> > - Sometimes, subtitle events are muxed into a stream many seconds
> >   ahead of display time. In this case, AVFrame.pts is the mux position
> >   and AVFrame.subtitle_pts is the actual presentation time.
> >   When filtering subtitles to modify something, it would be still desired
> >   to retain the offset between mux time and display start
> >
>
> This information was impossible to convey in AVSubtitle, as it had
> exactly one pts field, so where does it come from now?
> Also, isn't that kind of offset what subtitle_start_time is for, a
> delta on top of the pts?
>
> Sounds like its just duplicating logic once again.
>
> pts is also by definition and name the "presentation time stamp", if a
> timestamp of another meaning can convey additional information, we do
> have a dts field in AVFrame, for the decoding timestamp.

To illustrate this, lets just look at the description of timestamp
fields in AVFrame

pts:
Presentation timestamp in time_base units (time when frame should be
shown to user).

pkt_dts:
DTS copied from the AVPacket that triggered returning this frame.

These comments are pretty clear on the meaning of those fields, and
shall not be violated by subtitles. pts is the time when the frame is
going to be displayed, and nothing else. We probably shouldn't even
tolerate subtitle_start_time as its a stupid field when it should just
get baked into the timestamp (with duration being used for the
end_time field).
pkt_dts can be used to convey an additional timestamp that was copied
from the demuxer, if for some reason you want to preserve the
interleaving position of the frame - which in itself  is already
questionable, as that should be up to the muxer or the interleaving
code to decide, but the field exists, so its fine.

subtitle_pts has to go, and these existing fields have to be used
within their definition.

- Hendrik
Soft Works Dec. 5, 2021, 11:23 p.m. UTC | #9
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Soft Works
> Sent: Sunday, December 5, 2021 6:58 PM
> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame
> for subtitle handling
> 
> 
> 
> > -----Original Message-----
> > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Lynne
> > Sent: Sunday, December 5, 2021 5:40 PM
> > To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> > Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame
> > for subtitle handling
> >
> > 5 Dec 2021, 17:23 by softworkz@hotmail.com:
> >
> > > @@ -491,6 +499,39 @@ typedef struct AVFrame {
> > >  */
> > >  uint64_t channel_layout;
> > >
> > > +    /**
> > > +     * Display start time, relative to packet pts, in ms.
> > > +     */
> > > +    uint32_t subtitle_start_time;
> > > +
> > > +    /**
> > > +     * Display end time, relative to packet pts, in ms.
> > > +     */
> > > +    uint32_t subtitle_end_time;
> > >
> >
> > Milliseconds? Our entire API's based around timestamps
> > with time bases. Plus, we all know what happened when
> > Matroska settled onto milliseconds and ruined a perfectly
> > complex but good container.
> > Make this relative to the PTS field, with the same timebase
> > as the PTS field.
> > There's even a new AVFrame->time_base field for you to
> > set so you wouldn't forget it.
> 
> The internal format for text subtitles is ASS, and this is
> using a timebase of milliseconds.
> 
> All existing decoders and encoders are using this and I'm
> afraid, but I will not go and change them all.
> 
> > > +    /**
> > > +     * Number of items in the @ref subtitle_areas array.
> > > +     */
> > > +    unsigned num_subtitle_areas;
> > > +
> > > +    /**
> > > +     * Array of subtitle areas, may be empty.
> > > +     */
> > > +    AVSubtitleArea **subtitle_areas;
> > >
> >
> > There's no reason why this cannot be handled using the buffer
> > and data fields. If you need more space there, you're free to bump
> > the maximum number of pointers, plus this removes the horrid
> > malloc(1) hack. We've discussed this, and I couldn't follow why
> > this was bad in the email discussion you linked.
> 
> There are reasons. Some of those had I mentioned in an earlier
> discussion with Hendrik.
> The effort alone to relate the buffers to subtitle areas (which area
> 'owns' which buffer) is not reasonable. Too many cases to consider,
> what's when there are 3 areas and the second area doesn't have any
> buffer? The convention is that the buffer should be used contiguously.
> Managing those relations is error-prone and would require a lot of code.
> 
> > > +    /**
> > > +     * Usually the same as packet pts, in AV_TIME_BASE.
> > > +     *
> > > +     * @deprecated This is kept for compatibility reasons and
> corresponds
> > to
> > > +     * AVSubtitle->pts. Might be removed in the future.
> > > +     */
> > > +    int64_t subtitle_pts;
> > >
> >
> > I'm not going to accept a field which is instantly deprecated.
> > As we've discussed multiple times, please merge this into
> > the regular frame PTS field. We already have _2_ necessary
> > stard/end fields.
> 
> --
> 
> > I agree with this entirely. Even ignoring the fact that adding a new
> > field thats deprecated is instantly a disqualification, AVSubtitle had
> > one pts field, AVFrame already has one pts field - both are even
> > documented to have the same semantic. They should just contain the
> > exact same data, thats how you achieve compatibility, not by claiming
> > you need a new field for compatibility reasons.
> >
> > - Hendrik
> 
> I think the mistake is to declare subtitle_pts as deprecated. I had
> added the deprecation at a very early point in time where I had still
> thought that it can be eliminated.
> 
> Even though we are driving subtitle data through the graph attached
> to AVFrame, the behavior involved is very different from audio and
> video frames. Actually there's not one but many different ways how
> subtitle data can appear in a source and travel through a filtergraph:
> 
> - Sometimes, subtitle events are muxed into a stream many seconds
>   ahead of display time. In this case, AVFrame.pts is the mux position
>   and AVFrame.subtitle_pts is the actual presentation time.
>   When filtering subtitles to modify something, it would be still desired
>   to retain the offset between mux time and display start
> 
> - Sometimes, subtitle events are occurring in the mux "live" - right in
>   the moment when they are meant to be shown. An example for this are
>   closed captions, and when extracting those via the new splitcc filter,
>   the subtitle_pts is equal to the frame.pts value.
>   But CC events do not come regularly, while downstream filters might
>   expect exactly that in order to proceed. That's why the splitcc filter
>   can emit subtitle frames at a certain framerate. It does so by re-
>   sending the most recent subtitle frame - basically not much different
>   than the main heatbeat mechanism (which cannot be used here because
>   the stream has it's origin inside the filtegraph (secondary output
>   of splitcc).
>   Now when splitcc sends such a repeated frame, it needs to adjust the
>   frame's pts in order to match the configured output framerate.
>   But that's for keeping the graph running, the actual display start
>   time (subtitle_pts) must not be changed.
>   It depends then on the downstream filters how to handle those frames.
>   For example, a filter could detect (based on subtitle_pts) repeated
>   subtitle events. Then it can skip processing and use a previous cached
>   result (some of the filter are actually doing that).
> 
> - I already mentioned the subtitle_heartbeat mechanism that I have kept.
>   The same applies here: it needs to resend frames to retain a certain
>   frequency of frames and for this purpose it needs to send duplicate
>   frames. Those frames need a new pts, but subtitle_pts must remain
>   at its original value
> 
> - There exist extreme cases, like where the complete set of ASS subtitles
>   is muxed at   the beginning of the stream with pkt-pts values like 1,2,3,..
>   The pkt-pts is what subtitle frames get as AVFrame.pts while the actual
>   display start-time it in subtitle_pts.
>   The desire way to handle such situations surely varies from case to case.
>   Having those values separate is an important key for keeping a wide range
>   of options open to handle this - for current and future subtitle filters.
> 
> My patchset offers a really wide range of possibilities for subtitle
> filtering. Those are real-world production scenarios that are working.
> This is not based on just a few command line experiments.
> 
> Not every scenario that would be desirable is working. With subtitles,
> there are a lot of different cases for which I have given just a few
> examples above, and certain scenarios will need some work to get it
> working by some tweaking, adding a new filter or adding additional
> options to existing filters (like you could see in the version history).
> My experience from doing just that has made me realize one important
> point: with the proposed architecture, it's possible to get almost
> any desired scenario working 

Maybe this doesn't sound that much substantial; let me illustrate what 
I mean by looking at an example.
Last week, I needed to convert closed captions during a live transcoding
and deliver these as regular text subtitles (ass, srt, vtt).

As explained above, the problem with CCs is that they are not muxed 
ahead of time as it's usually the case with other subtitles - they are
appearing in the stream right in the moment when they are meant to be 
presented. And even worse: there is no duration included. The duration 
is solely determined by the occurrence of the following update (can be 
an addition to the current text or a completely new text (or nothing).

The splitcc filter uses the existing cc_dec decoder, which can work in
two different modes (splitcc exposes the decoder's options):

- Default
  On each CC update, it saves the data and waits (for a certain time)
  until the next update happens. Then it emits the saved event, 
  setting the duration from the time difference.

- Realtime
  Emits subtitle events immediately with the duration set to infinite.
  A minimum interval between two subsequent outputs can be specified 
  for this mode.

None of those modes is really useful for the desired conversion:

In the default mode, subtitles are shown too late, with the delay not
even being constant but varying as the decoder doesn’t delay by a fixed
timespan but by the interval between two events which can range between
milliseconds and several seconds. The display start times are correct 
though (not delayed), and that can lead to situations where for example
an event with a 3 second duration arrives at the client at a time 
where 2.5s of the display time have elapsed already and the remaining 
0.5s display duration appears like a flicker.
In turn, that mode is surely suitable for offline conversions but not
for live transcoding.

When using the real_time mode on the other side, we don't have any
durations, and when we convert that to a text subtitle format, the 
lines will just fill up the screen to the max. Setting a fixed duration
doesn’t help either: the display start times are varying arbitrarily
and when we don't specify the accurate duration, the events will overlap
in time and texts will be shown stacked on overlap, which looks like 
jumping up and down.

I've been at the same point before in the context of burning-in CCs
where I had discussed the issue with the guys from libass:
https://github.com/libass/libass/issues/562

The solution there was the 'render_latest_only' option that
I ended up adding to the overlaytextsubs and textsub2video filters.
Unfortunately this can only work when implemented at the final end
of the chain (in the former case: the libass renderer), and that option
isn't available when converting to a subtitle output stream which 
will be processed by a client we don't have control of.

Eventually I came to a new idea for which I added the scatter_realtime_output
option to splitcc. It basically works like this:

- The splitcc output is no longer driven by the cc_decoder's output
- Instead, splitcc is emitting subtitle events at a fixed "frame"-rate
  based on the configured real_time_latency_msec value (e.g. 250msec)
- all subtitle frames/events have a fixed duration set to this value
- subsequent events have start display times (subtitle_pts) 
  increasing by this (fixed) value
- if there has been an update from the split_cc decoder between two 
  of those events, that update is taken and emitted with the next
  output. The decoded event's start time is replaced by the output
  event time
  (means: start times are quantized to 250ms intervals)
- if there hasn't been a change, the previous content is re-sent
  (but with the timings changed to the next 250ms interval)

Essentially, closed caption events are quantized and split ("scattered")
to match that fixed time raster and the filter output is decoupled from
the CC decoder's output timings.


But that's not all. There's another, additional level of decoupling 
in the case of the splitcc filter:

This filter has one input (video) and two outputs (0: video, 1: text subs)
where the second output works in a special way.
The first output is just for passthrough of the video frames from which 
we take the CC side data without further manipulation. 
It's very simple behavior: you push a frame to the input and the same 
frame is synchronously being output at output pin 0.

The second output uses the request_frame api instead, which works 
as a kind of pull model. Frames are requested according to the configured
output framerate and we need to increase the frame.pts value each
time by the reciprocal value of that output framerate.

For splitcc, this is set to 5 fps by default. In our example, we have 
a 250ms interval, which makes 4 fps. That means that we need to duplicate
every 4th of our output frames. The duplicated frame needs to have 
the subtitle_pts unchanged, so it can be identified as a duplicate and 
removed/ignored downstream. 
While, frame.pts needs to follow the graph timing to keep it all going.

Effectively, these are all different kinds of heartbeat cases. Those beats
are required to drive subtitle filtering because the subtitles' own timings
alone are not capable for driving a constant filtering flow.
All earlier discussions about subtitle filtering had that conclusion by
the way, that a heartbeat mechanism is required for subtitle filtering.

The purpose of that heartbeat mechanism is to ensure the filtering flow
and this is driven by AVFrame.pts, that's why it's exactly that field 
that needs to be used for this and no other one - even when it might not 
exactly fit to the code doc comment.

It can only work like that. 

Kind regards,
softworkz
Hendrik Leppkes Dec. 5, 2021, 11:37 p.m. UTC | #10
On Mon, Dec 6, 2021 at 12:23 AM Soft Works <softworkz@hotmail.com> wrote:
>
> It can only work like that.
>

You literally designed subtitle filtering. You defined how it works.
All your explanations come down to some fancy form of "because I made
it required".

All I'm hearing is that some internal mechanisms in avfilter need it.
Why is it a public field in one of the most key structures in our
entire codebase? You understand this concern, right? Any public field
should first (and only) answer the question, what information does
this convey to the user? Does the user need this information? I'm not
seeing any of that in your explanations. All I'm seeing is internal
avfilter workings - cemented in public API.

You are not integrating into an existing infrastructure, you are
literally trying to add the whole thing, so you can change it to not
need this redundant field.

- Hendrik
Soft Works Dec. 6, 2021, 12:14 a.m. UTC | #11
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Hendrik
> Leppkes
> Sent: Monday, December 6, 2021 12:37 AM
> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame
> for subtitle handling
> 
> On Mon, Dec 6, 2021 at 12:23 AM Soft Works <softworkz@hotmail.com> wrote:
> >
> > It can only work like that.
> >
> 
> You literally designed subtitle filtering. You defined how it works.
> All your explanations come down to some fancy form of "because I made
> it required".

The origin was in the context of keeping sub2video and the existing 
heartbeat functionality working, And in the end it's still about heartbeat 
and filter flow.

Removing the field and replacing with the main pts field makes 18 fate tests
fail. Those tests are not testing the new filtering.
I guess that shouldn't happen when the field would be redundant.

Thanks,
sw
Soft Works Dec. 6, 2021, 12:24 a.m. UTC | #12
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Hendrik
> Leppkes
> Sent: Monday, December 6, 2021 12:37 AM
> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame
> for subtitle handling
> 
> On Mon, Dec 6, 2021 at 12:23 AM Soft Works <softworkz@hotmail.com> wrote:
> >
> > It can only work like that.
> >
> 
> You literally designed subtitle filtering. You defined how it works.
> All your explanations come down to some fancy form of "because I made
> it required".
> 
> All I'm hearing is that some internal mechanisms in avfilter need it.
> Why is it a public field in one of the most key structures in our
> entire codebase? You understand this concern, right? Any public field
> should first (and only) answer the question, what information does
> this convey to the user? Does the user need this information? I'm not
> seeing any of that in your explanations. All I'm seeing is internal
> avfilter workings - cemented in public API.
> 
> You are not integrating into an existing infrastructure, you are
> literally trying to add the whole thing, so you can change it to not
> need this redundant field.
> 
> - Hendrik

OK, maybe this is better to understand. Very simple burn-down:

I send a subtitle frame into a filtergraph.

pts = 10
subtitle_pts = 10

Now I need to re-send a duplicate of that frame into the graph
(=> heartbeat - to keep the graph flowing)

Now, it's not possible (by lavf design, not mine) to send another frame 
without increasing the pts of the frame. So we send it like this:

pts = 11
subtitle_pts = 10

The pts needs to be increased. But the start-display time of the 
subtitle event must not be changed - it's still the same frame like
before, just a duplicate.


softworkz
Hendrik Leppkes Dec. 6, 2021, 12:37 a.m. UTC | #13
On Mon, Dec 6, 2021 at 1:24 AM Soft Works <softworkz@hotmail.com> wrote:
>
>
>
> > -----Original Message-----
> > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Hendrik
> > Leppkes
> > Sent: Monday, December 6, 2021 12:37 AM
> > To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> > Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame
> > for subtitle handling
> >
> > On Mon, Dec 6, 2021 at 12:23 AM Soft Works <softworkz@hotmail.com> wrote:
> > >
> > > It can only work like that.
> > >
> >
> > You literally designed subtitle filtering. You defined how it works.
> > All your explanations come down to some fancy form of "because I made
> > it required".
> >
> > All I'm hearing is that some internal mechanisms in avfilter need it.
> > Why is it a public field in one of the most key structures in our
> > entire codebase? You understand this concern, right? Any public field
> > should first (and only) answer the question, what information does
> > this convey to the user? Does the user need this information? I'm not
> > seeing any of that in your explanations. All I'm seeing is internal
> > avfilter workings - cemented in public API.
> >
> > You are not integrating into an existing infrastructure, you are
> > literally trying to add the whole thing, so you can change it to not
> > need this redundant field.
> >
> > - Hendrik
>
> OK, maybe this is better to understand. Very simple burn-down:
>
> I send a subtitle frame into a filtergraph.
>
> pts = 10
> subtitle_pts = 10
>
> Now I need to re-send a duplicate of that frame into the graph
> (=> heartbeat - to keep the graph flowing)
>
> Now, it's not possible (by lavf design, not mine) to send another frame
> without increasing the pts of the frame. So we send it like this:
>
> pts = 11
> subtitle_pts = 10
>
> The pts needs to be increased. But the start-display time of the
> subtitle event must not be changed - it's still the same frame like
> before, just a duplicate.
>

You just keep re-iterating that avfilter internals need it. And I'll
just keep saying that avfilter internals shall not define public API
design. I have not heard anything that justifies yet another field
that has no benefit to the external public API - to the contrary even,
a user would assume pts is pts, and he would be wrong.

- Hendrik
Soft Works Dec. 6, 2021, 12:40 a.m. UTC | #14
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Hendrik
> Leppkes
> Sent: Monday, December 6, 2021 1:37 AM
> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame
> for subtitle handling
> 
> On Mon, Dec 6, 2021 at 1:24 AM Soft Works <softworkz@hotmail.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Hendrik
> > > Leppkes
> > > Sent: Monday, December 6, 2021 12:37 AM
> > > To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> > > Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare
> AVFrame
> > > for subtitle handling
> > >
> > > On Mon, Dec 6, 2021 at 12:23 AM Soft Works <softworkz@hotmail.com> wrote:
> > > >
> > > > It can only work like that.
> > > >
> > >
> > > You literally designed subtitle filtering. You defined how it works.
> > > All your explanations come down to some fancy form of "because I made
> > > it required".
> > >
> > > All I'm hearing is that some internal mechanisms in avfilter need it.
> > > Why is it a public field in one of the most key structures in our
> > > entire codebase? You understand this concern, right? Any public field
> > > should first (and only) answer the question, what information does
> > > this convey to the user? Does the user need this information? I'm not
> > > seeing any of that in your explanations. All I'm seeing is internal
> > > avfilter workings - cemented in public API.
> > >
> > > You are not integrating into an existing infrastructure, you are
> > > literally trying to add the whole thing, so you can change it to not
> > > need this redundant field.
> > >
> > > - Hendrik
> >
> > OK, maybe this is better to understand. Very simple burn-down:
> >
> > I send a subtitle frame into a filtergraph.
> >
> > pts = 10
> > subtitle_pts = 10
> >
> > Now I need to re-send a duplicate of that frame into the graph
> > (=> heartbeat - to keep the graph flowing)
> >
> > Now, it's not possible (by lavf design, not mine) to send another frame
> > without increasing the pts of the frame. So we send it like this:
> >
> > pts = 11
> > subtitle_pts = 10
> >
> > The pts needs to be increased. But the start-display time of the
> > subtitle event must not be changed - it's still the same frame like
> > before, just a duplicate.
> >
> 
> You just keep re-iterating that avfilter internals need it. And I'll
> just keep saying that avfilter internals shall not define public API
> design. I have not heard anything that justifies yet another field
> that has no benefit to the external public API - to the contrary even,
> a user would assume pts is pts, and he would be wrong.

So, what solution would you suggest, then?

sw
Soft Works Dec. 6, 2021, 2:25 a.m. UTC | #15
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Hendrik
> Leppkes
> Sent: Monday, December 6, 2021 1:37 AM
> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame
> for subtitle handling
> 
> On Mon, Dec 6, 2021 at 1:24 AM Soft Works <softworkz@hotmail.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Hendrik
> > > Leppkes
> > > Sent: Monday, December 6, 2021 12:37 AM
> > > To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> > > Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare
> AVFrame
> > > for subtitle handling
> > >
> > > On Mon, Dec 6, 2021 at 12:23 AM Soft Works <softworkz@hotmail.com> wrote:
> > > >
> > > > It can only work like that.
> > > >
> > >
> > > You literally designed subtitle filtering. You defined how it works.
> > > All your explanations come down to some fancy form of "because I made
> > > it required".
> > >
> > > All I'm hearing is that some internal mechanisms in avfilter need it.
> > > Why is it a public field in one of the most key structures in our
> > > entire codebase? You understand this concern, right? Any public field
> > > should first (and only) answer the question, what information does
> > > this convey to the user? Does the user need this information? I'm not
> > > seeing any of that in your explanations. All I'm seeing is internal
> > > avfilter workings - cemented in public API.
> > >
> > > You are not integrating into an existing infrastructure, you are
> > > literally trying to add the whole thing, so you can change it to not
> > > need this redundant field.
> > >
> > > - Hendrik
> >
> > OK, maybe this is better to understand. Very simple burn-down:
> >
> > I send a subtitle frame into a filtergraph.
> >
> > pts = 10
> > subtitle_pts = 10
> >
> > Now I need to re-send a duplicate of that frame into the graph
> > (=> heartbeat - to keep the graph flowing)
> >
> > Now, it's not possible (by lavf design, not mine) to send another frame
> > without increasing the pts of the frame. So we send it like this:
> >
> > pts = 11
> > subtitle_pts = 10
> >
> > The pts needs to be increased. But the start-display time of the
> > subtitle event must not be changed - it's still the same frame like
> > before, just a duplicate.
> >
> 
> You just keep re-iterating that avfilter internals need it. And I'll
> just keep saying that avfilter internals shall not define public API
> design. I have not heard anything that justifies yet another field
> that has no benefit to the external public API - to the contrary even,
> a user would assume pts is pts, and he would be wrong.

About which use cases of the public API are we talking about actually?

Applications similar to ffmpeg CLI?
Applications that want to do subtitle filtering?
Applications that work with subtitle AVFrames?

=> The heartbeat mechanism is an intrinsic requirement for performing
   subtitle filtering. Whether you want to send frames into a filtergraph
   or process subtitle frames coming out of a filtergraph: such applications
   will often need to deal with both fields, and when just one, then it might
   be the one or the other, depending on use case.

ffmpeg cli needs to deal with both of these fields. That's already a good
indication that other consumers will need the same.
Essentially, this is not an implementation detail that needs to be hidden, 
it's an important element of subtitle filtering functionality.

I have tried to explain the functionality and importance of having these
fields separate in detail and by examples. And it's not only the use of 
those fields in the current code. I have quite a number of future use
cases in mind, and many of them wouldn't be possible with the duality 
of those fields.

I admit that the meaning of those fields in a subtitle context and their 
relation to each other might not be easy to understand immediately, but 
that's just one tiny bit among many others in case of ffmpeg and surely
not among the worst.

Maybe some improvement could be achieved by things like renaming or clarifying
doc texts, I'm open to suggestions.

Kind regards,
softworkz
Nicolas George Dec. 6, 2021, 1:42 p.m. UTC | #16
Hendrik Leppkes (12021-12-06):
> You just keep re-iterating that avfilter internals need it. And I'll
> just keep saying that avfilter internals shall not define public API
> design. I have not heard anything that justifies yet another field
> that has no benefit to the external public API - to the contrary even,
> a user would assume pts is pts, and he would be wrong.

Thank you.

I need to add that libavfilter internals need it only because it was
designed in a very naive way, by duplicating the frame instead of using
dummy frames without content.

This is not the only part of these patch series that are improperly
designed. Having two different filters to overlay subs depending on
whether they are text or bitmap is unacceptable. Not having any utility
filters (to shift or scale the times, to discard a segment, tu
concatenate subtitles along with audio and video, to print diagnostic
information, etc.) is also a big issue — and of course, triplicating the
boilerplate code for these filters is a big no. The way the
overlaytextsubs filter synchronizes its inputs is completely broken; it
works with the obvious test case, of course, but it will fail in any
complicated case.

I have made these comments before. I also have offered to help do things
properly. But instead of welcoming my help, Soft Works has rudely
rejected it. So now I am just looking with annoyance as people waste
their time helping Soft Works polishing patches whose design is wrong
from the core.

What everybody should be telling Soft Works on this subject is just
this: You do not try to redesign a major API of the project when you
have been in the project in just a few months and are still struggling
with our coding conventions and with configuring a mail client.

Especially not something that have been in project for more than a
decade. There is a reason I have not yet implemented subtitles
filtering, or offered it as a Summer of Code project: it is hard and has
a ton of prerequisite.

Now, if I knew I had the support of the core developers here, I could
serenely get on with implementing FATE tests for the negotiation system
and refactor it, which is the necessary first step before extending it
for a third media type.

Regards,
Anton Khirnov Dec. 6, 2021, 3:51 p.m. UTC | #17
Quoting Lynne (2021-12-05 22:30:59)
> 5 Dec 2021, 20:21 by softworkz@hotmail.com:
> > One of the reasons why this subtitle filtering patchset has proceeded
> > rather smoothly so far, causing only a small amount of regressions 
> > that were easy to find and fix, is that I have kept as much as possible
> > from the current AVSubtitle implementation, projecting it to the 
> > new frame-based subtitle implementation, while keeping it mostly 
> > compatible with the current logic.
> > Interestingly, an earlier attempt towards subtitle filtering has
> > gotten stuck right at the point about storing subtitle data in 
> > AVFrame buffers. I decided to focus on functionality rather than
> > academic kind of details. What has been in place and working
> > for such a long time can't be so wrong that it can't be accepted.
> >
> > We also need to consider that it's not my patchset that is deliberately
> > deciding to store data in pointers associate with rects/area and 
> > to store start- and end-times in milliseconds: it's all the decoders
> > and encoders that are doing this. I'm just adapting to the status
> > quo.
> >
> > These two things - changing the timebase of the ms-fields and 
> > changing the location of the buffer pointers - is a gigantic task 
> > that would require to change everything that this patchset covers 
> > and implements - plus even more code that this patchset hasn't even 
> > touched. It would require changes to everything everywhere where
> > subtitles are concerned.
> > It is a task that is risky as hell, will cause plenty of 
> > regressions under  guarantee, require extensive testing and might even 
> > drive this whole endeavor into a dead end. 
> >
> > As I said before, please understand that I'm not ready to enter that
> > path.
> >
> 
> You don't have to break anything to make the start/end times proper
> timestamps. We have timebases for that! If you'd like your timestamps
> to have millisecond precision, you use a millisecond timebase!
> 
> So what if we use ASS internally which rounds timestamps? We have
> bitmap formats too, and those are generally following the video
> timestamps. You absolutely don't want the timestamps being rounded
> in that case, or you risk having your subtitles be a frame late or
> a frame early.
> 
> You could add a separate timebase for the subtitle start/end fields,
> in case the packet's timebase doesn't match the subtitle format's
> native timebase. I won't object to that.
> 
> You also don't have to risk that much by removing the subtitle_pts
> field. Once the start/end times are converted to proper timestamps,
> then the AVFrame->pts field can do whatever it wants to, whether
> it mirrors the demuxer's timestamp, be constant, non-existent or even
> go backwards.
> 
> If you just want to have an empty subtitle rect, you can simply
> zero the subtitle struct's data and buffer fields.
> 
> Calling these issues too academic when the frame API is so very
> straightforward and cut-and-paste is like calling computers
> too academic when you have several thousand large numbers to add.
> 
> I'd really rather have a subtitle API that's usable, maintainable and easy
> to extend, even if it's momentarily broken for certain corner cases
> and weird files.
> We can easily fix broken files one by one. We've got plenty of heuristics,
> and have mostly abstracted and documented them well. We can add more.
> We cannot remove cruft from an API that's designed to be crufty, however,
> because the cruft is the only thing holding it together.

+1

This set is a very significant change that will likely stay with us for
many years, if not decades. So it is of utmost importance that the API
design is clean and well-defined, because changing it later will be
extremely hard.

Keeping all the edge cases working flawlessly is IMO secondary here.
It would be nice of course, but they can always be fixed later. The API
cannot be, at least not without major pain.
Soft Works Dec. 6, 2021, 5:11 p.m. UTC | #18
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Anton
> Khirnov
> Sent: Monday, December 6, 2021 4:52 PM
> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame
> for subtitle handling
> 
> Quoting Lynne (2021-12-05 22:30:59)
> > 5 Dec 2021, 20:21 by softworkz@hotmail.com:
> > > One of the reasons why this subtitle filtering patchset has proceeded
> > > rather smoothly so far, causing only a small amount of regressions
> > > that were easy to find and fix, is that I have kept as much as possible
> > > from the current AVSubtitle implementation, projecting it to the
> > > new frame-based subtitle implementation, while keeping it mostly
> > > compatible with the current logic.
> > > Interestingly, an earlier attempt towards subtitle filtering has
> > > gotten stuck right at the point about storing subtitle data in
> > > AVFrame buffers. I decided to focus on functionality rather than
> > > academic kind of details. What has been in place and working
> > > for such a long time can't be so wrong that it can't be accepted.
> > >
> > > We also need to consider that it's not my patchset that is deliberately
> > > deciding to store data in pointers associate with rects/area and
> > > to store start- and end-times in milliseconds: it's all the decoders
> > > and encoders that are doing this. I'm just adapting to the status
> > > quo.
> > >
> > > These two things - changing the timebase of the ms-fields and
> > > changing the location of the buffer pointers - is a gigantic task
> > > that would require to change everything that this patchset covers
> > > and implements - plus even more code that this patchset hasn't even
> > > touched. It would require changes to everything everywhere where
> > > subtitles are concerned.
> > > It is a task that is risky as hell, will cause plenty of
> > > regressions under  guarantee, require extensive testing and might even
> > > drive this whole endeavor into a dead end.
> > >
> > > As I said before, please understand that I'm not ready to enter that
> > > path.
> > >
> >
> > You don't have to break anything to make the start/end times proper
> > timestamps. We have timebases for that! If you'd like your timestamps
> > to have millisecond precision, you use a millisecond timebase!
> >
> > So what if we use ASS internally which rounds timestamps? We have
> > bitmap formats too, and those are generally following the video
> > timestamps. You absolutely don't want the timestamps being rounded
> > in that case, or you risk having your subtitles be a frame late or
> > a frame early.
> >
> > You could add a separate timebase for the subtitle start/end fields,
> > in case the packet's timebase doesn't match the subtitle format's
> > native timebase. I won't object to that.
> >
> > You also don't have to risk that much by removing the subtitle_pts
> > field. Once the start/end times are converted to proper timestamps,
> > then the AVFrame->pts field can do whatever it wants to, whether
> > it mirrors the demuxer's timestamp, be constant, non-existent or even
> > go backwards.
> >
> > If you just want to have an empty subtitle rect, you can simply
> > zero the subtitle struct's data and buffer fields.
> >
> > Calling these issues too academic when the frame API is so very
> > straightforward and cut-and-paste is like calling computers
> > too academic when you have several thousand large numbers to add.
> >
> > I'd really rather have a subtitle API that's usable, maintainable and easy
> > to extend, even if it's momentarily broken for certain corner cases
> > and weird files.
> > We can easily fix broken files one by one. We've got plenty of heuristics,
> > and have mostly abstracted and documented them well. We can add more.
> > We cannot remove cruft from an API that's designed to be crufty, however,
> > because the cruft is the only thing holding it together.
> 
> +1
> 
> This set is a very significant change that will likely stay with us for
> many years, if not decades. So it is of utmost importance that the API
> design is clean and well-defined, because changing it later will be
> extremely hard.
> 
> Keeping all the edge cases working flawlessly is IMO secondary here.
> It would be nice of course, but they can always be fixed later. The API
> cannot be, at least not without major pain.

From the perspective of that logic, the problem is that almost all cases 
are "edge cases".

Kind regards,
softworkz
Soft Works Dec. 6, 2021, 5:46 p.m. UTC | #19
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Nicolas
> George
> Sent: Monday, December 6, 2021 2:43 PM
> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH v20 02/20] avutil/frame: Prepare AVFrame
> for subtitle handling

[..]

> Regards,
> --
>   Nicolas George

I have stripped those points that don't appear to be worth responding to.

If you have an example for a scenario that isn't working with my patchset
(and I'm sure such cases can be found), then I'll look into it and try to 
address it.

Regards,
softworkz
Michael Niedermayer Dec. 7, 2021, 11:01 p.m. UTC | #20
On Mon, Dec 06, 2021 at 02:42:58PM +0100, Nicolas George wrote:
> Hendrik Leppkes (12021-12-06):
> > You just keep re-iterating that avfilter internals need it. And I'll
> > just keep saying that avfilter internals shall not define public API
> > design. I have not heard anything that justifies yet another field
> > that has no benefit to the external public API - to the contrary even,
> > a user would assume pts is pts, and he would be wrong.
> 
> Thank you.
> 
> I need to add that libavfilter internals need it only because it was
> designed in a very naive way, by duplicating the frame instead of using
> dummy frames without content.
> 
> This is not the only part of these patch series that are improperly
> designed. 

> Having two different filters to overlay subs depending on
> whether they are text or bitmap is unacceptable. 

i do agree that is inconvenient i would not call it unacceptable


> Not having any utility
> filters (to shift or scale the times, to discard a segment, tu
> concatenate subtitles along with audio and video, to print diagnostic
> information, etc.) is also a big issue 

i agree too


[...]

> Now, if I knew I had the support of the core developers here, I could
> serenely get on with implementing FATE tests for the negotiation system
> and refactor it, which is the necessary first step before extending it
> for a third media type.

Anyone who does that without triggering hostilities and (either works
together with other developers OR finds any other solution all involved 
are happy with) has my support.
But i am a bit scared of more fights, hostilities between 2 cooks trying
to cook one soup.
And myself having more headaches from reading more escalating threads.
Basically i dont want to end up hiding depressed in a corner after some
quite hard headed developers fail to treat each other respectfully and
cooperate.

Also judging from your question here i guess that you have time and want
to work on FFmpeg ?
If you rather want to work on something else within FFmpeg.
I have both easy and hard things that need to be done
Easy one, fix ssl support on fate.ffmpeg.org (that is setup letsencrypt)

Hard one, can low latency support to libavfilter be improved?
consider a use case like realtime streaming, cloud gaming or some cloud
based metaverse any filtering done for these would benefit from really low latency.
Because the end user interacts with the remote and so any filtering
done adds top the latency of this loop

Is libavfilter optimal for this usecase ?
would some different threading be better, or maybe some information
exchange between filters about which lines from the previous filter 
are already available to the next to process

thx

[...]
diff mbox series

Patch

diff --git a/doc/APIchanges b/doc/APIchanges
index a466a2ec22..8c63ea0311 100644
--- a/doc/APIchanges
+++ b/doc/APIchanges
@@ -14,6 +14,16 @@  libavutil:     2021-04-27
 
 API changes, most recent first:
 
+2021-12-05 - xxxxxxxxxx - lavu 58.1.100 - frame.h
+  Add AVMediaType field to AVFrame
+  Add Fields for carrying subtitle data to AVFrame
+  (subtitle_areas, subtitle_header, subtitle_pts, start/end time, etc.)
+  Add av_frame_get_buffer2() and deprecate av_frame_get_buffer()
+
+2021-12-05 - xxxxxxxxxx - lavu 58.1.100 - subfmt.h
+  Add struct AVSubtitleArea (replaces AVSubtitle)
+  Add av_get_subtitle_fmt_name() and av_get_subtitle_fmt()
+
 2021-12-05 - xxxxxxxxxx - lavu 58.1.100 - subfmt.h
   Add enum AVSubtitleType (moved from lavc), add new values, deprecate existing
 
diff --git a/libavutil/Makefile b/libavutil/Makefile
index c7843db1e4..7e79936876 100644
--- a/libavutil/Makefile
+++ b/libavutil/Makefile
@@ -160,6 +160,7 @@  OBJS = adler32.o                                                        \
        slicethread.o                                                    \
        spherical.o                                                      \
        stereo3d.o                                                       \
+       subfmt.o                                                         \
        threadmessage.o                                                  \
        time.o                                                           \
        timecode.o                                                       \
diff --git a/libavutil/frame.c b/libavutil/frame.c
index 64eed5b0dd..5da082adda 100644
--- a/libavutil/frame.c
+++ b/libavutil/frame.c
@@ -26,6 +26,7 @@ 
 #include "imgutils.h"
 #include "mem.h"
 #include "samplefmt.h"
+#include "subfmt.h"
 #include "hwcontext.h"
 
 #define CHECK_CHANNELS_CONSISTENCY(frame) \
@@ -50,6 +51,9 @@  const char *av_get_colorspace_name(enum AVColorSpace val)
     return name[val];
 }
 #endif
+
+static int frame_copy_subtitles(AVFrame *dst, const AVFrame *src, int copy_data);
+
 static void get_frame_defaults(AVFrame *frame)
 {
     if (frame->extended_data != frame->data)
@@ -73,7 +77,12 @@  static void get_frame_defaults(AVFrame *frame)
     frame->colorspace          = AVCOL_SPC_UNSPECIFIED;
     frame->color_range         = AVCOL_RANGE_UNSPECIFIED;
     frame->chroma_location     = AVCHROMA_LOC_UNSPECIFIED;
-    frame->flags               = 0;
+    frame->subtitle_start_time = 0;
+    frame->subtitle_end_time   = 0;
+    frame->num_subtitle_areas  = 0;
+    frame->subtitle_areas      = NULL;
+    frame->subtitle_pts        = 0;
+    frame->subtitle_header     = NULL;
 }
 
 static void free_side_data(AVFrameSideData **ptr_sd)
@@ -244,23 +253,55 @@  static int get_audio_buffer(AVFrame *frame, int align)
 
 }
 
+static int get_subtitle_buffer(AVFrame *frame)
+{
+    // Buffers in AVFrame->buf[] are not used in case of subtitle frames.
+    // To accomodate with existing code, checking ->buf[0] to determine
+    // whether a frame is ref-counted or has data, we're adding a 1-byte
+    // buffer here, which marks the subtitle frame to contain data.
+    frame->buf[0] = av_buffer_alloc(1);
+    if (!frame->buf[0]) {
+        av_frame_unref(frame);
+        return AVERROR(ENOMEM);
+    }
+
+    frame->extended_data = frame->data;
+
+    return 0;
+}
+
 int av_frame_get_buffer(AVFrame *frame, int align)
+{
+    if (frame->width > 0 && frame->height > 0)
+        frame->type = AVMEDIA_TYPE_VIDEO;
+    else if (frame->nb_samples > 0 && (frame->channel_layout || frame->channels > 0))
+        frame->type = AVMEDIA_TYPE_AUDIO;
+
+    return av_frame_get_buffer2(frame, align);
+}
+
+int av_frame_get_buffer2(AVFrame *frame, int align)
 {
     if (frame->format < 0)
         return AVERROR(EINVAL);
 
-    if (frame->width > 0 && frame->height > 0)
+    switch(frame->type) {
+    case AVMEDIA_TYPE_VIDEO:
         return get_video_buffer(frame, align);
-    else if (frame->nb_samples > 0 && (frame->channel_layout || frame->channels > 0))
+    case AVMEDIA_TYPE_AUDIO:
         return get_audio_buffer(frame, align);
-
-    return AVERROR(EINVAL);
+    case AVMEDIA_TYPE_SUBTITLE:
+        return get_subtitle_buffer(frame);
+    default:
+        return AVERROR(EINVAL);
+    }
 }
 
 static int frame_copy_props(AVFrame *dst, const AVFrame *src, int force_copy)
 {
     int ret, i;
 
+    dst->type                   = src->type;
     dst->key_frame              = src->key_frame;
     dst->pict_type              = src->pict_type;
     dst->sample_aspect_ratio    = src->sample_aspect_ratio;
@@ -292,6 +333,12 @@  static int frame_copy_props(AVFrame *dst, const AVFrame *src, int force_copy)
     dst->colorspace             = src->colorspace;
     dst->color_range            = src->color_range;
     dst->chroma_location        = src->chroma_location;
+    dst->subtitle_start_time    = src->subtitle_start_time;
+    dst->subtitle_end_time      = src->subtitle_end_time;
+    dst->subtitle_pts           = src->subtitle_pts;
+
+    if ((ret = av_buffer_replace(&dst->subtitle_header, src->subtitle_header)) < 0)
+        return ret;
 
     av_dict_copy(&dst->metadata, src->metadata, 0);
 
@@ -333,6 +380,7 @@  int av_frame_ref(AVFrame *dst, const AVFrame *src)
     av_assert1(dst->width == 0 && dst->height == 0);
     av_assert1(dst->channels == 0);
 
+    dst->type           = src->type;
     dst->format         = src->format;
     dst->width          = src->width;
     dst->height         = src->height;
@@ -346,7 +394,7 @@  int av_frame_ref(AVFrame *dst, const AVFrame *src)
 
     /* duplicate the frame data if it's not refcounted */
     if (!src->buf[0]) {
-        ret = av_frame_get_buffer(dst, 0);
+        ret = av_frame_get_buffer2(dst, 0);
         if (ret < 0)
             goto fail;
 
@@ -368,6 +416,10 @@  int av_frame_ref(AVFrame *dst, const AVFrame *src)
         }
     }
 
+    /* copy subtitle areas */
+    if (src->type == AVMEDIA_TYPE_SUBTITLE)
+        frame_copy_subtitles(dst, src, 0);
+
     if (src->extended_buf) {
         dst->extended_buf = av_calloc(src->nb_extended_buf,
                                       sizeof(*dst->extended_buf));
@@ -438,7 +490,7 @@  AVFrame *av_frame_clone(const AVFrame *src)
 
 void av_frame_unref(AVFrame *frame)
 {
-    int i;
+    unsigned i, n;
 
     if (!frame)
         return;
@@ -457,6 +509,21 @@  void av_frame_unref(AVFrame *frame)
     av_buffer_unref(&frame->opaque_ref);
     av_buffer_unref(&frame->private_ref);
 
+    av_buffer_unref(&frame->subtitle_header);
+
+    for (i = 0; i < frame->num_subtitle_areas; i++) {
+        AVSubtitleArea *area = frame->subtitle_areas[i];
+
+        for (n = 0; n < FF_ARRAY_ELEMS(area->buf); n++)
+            av_buffer_unref(&area->buf[n]);
+
+        av_freep(&area->text);
+        av_freep(&area->ass);
+        av_free(area);
+    }
+
+    av_freep(&frame->subtitle_areas);
+
     get_frame_defaults(frame);
 }
 
@@ -474,18 +541,28 @@  void av_frame_move_ref(AVFrame *dst, AVFrame *src)
 
 int av_frame_is_writable(AVFrame *frame)
 {
-    int i, ret = 1;
+    AVSubtitleArea *area;
+    int ret = 1;
 
     /* assume non-refcounted frames are not writable */
     if (!frame->buf[0])
         return 0;
 
-    for (i = 0; i < FF_ARRAY_ELEMS(frame->buf); i++)
+    for (unsigned i = 0; i < FF_ARRAY_ELEMS(frame->buf); i++)
         if (frame->buf[i])
             ret &= !!av_buffer_is_writable(frame->buf[i]);
-    for (i = 0; i < frame->nb_extended_buf; i++)
+    for (unsigned i = 0; i < frame->nb_extended_buf; i++)
         ret &= !!av_buffer_is_writable(frame->extended_buf[i]);
 
+    for (unsigned i = 0; i < frame->num_subtitle_areas; i++) {
+        area = frame->subtitle_areas[i];
+        if (area) {
+            for (unsigned n = 0; n < FF_ARRAY_ELEMS(area->buf); n++)
+                if (area->buf[n])
+                    ret &= !!av_buffer_is_writable(area->buf[n]);
+            }
+    }
+
     return ret;
 }
 
@@ -501,6 +578,7 @@  int av_frame_make_writable(AVFrame *frame)
         return 0;
 
     memset(&tmp, 0, sizeof(tmp));
+    tmp.type           = frame->type;
     tmp.format         = frame->format;
     tmp.width          = frame->width;
     tmp.height         = frame->height;
@@ -511,7 +589,7 @@  int av_frame_make_writable(AVFrame *frame)
     if (frame->hw_frames_ctx)
         ret = av_hwframe_get_buffer(frame->hw_frames_ctx, &tmp, 0);
     else
-        ret = av_frame_get_buffer(&tmp, 0);
+        ret = av_frame_get_buffer2(&tmp, 0);
     if (ret < 0)
         return ret;
 
@@ -546,14 +624,22 @@  AVBufferRef *av_frame_get_plane_buffer(AVFrame *frame, int plane)
     uint8_t *data;
     int planes, i;
 
-    if (frame->nb_samples) {
-        int channels = frame->channels;
-        if (!channels)
-            return NULL;
-        CHECK_CHANNELS_CONSISTENCY(frame);
-        planes = av_sample_fmt_is_planar(frame->format) ? channels : 1;
-    } else
+    switch(frame->type) {
+    case AVMEDIA_TYPE_VIDEO:
         planes = 4;
+        break;
+    case AVMEDIA_TYPE_AUDIO:
+        {
+            int channels = frame->channels;
+            if (!channels)
+                return NULL;
+            CHECK_CHANNELS_CONSISTENCY(frame);
+            planes = av_sample_fmt_is_planar(frame->format) ? channels : 1;
+            break;
+        }
+    default:
+        return NULL;
+    }
 
     if (plane < 0 || plane >= planes || !frame->extended_data[plane])
         return NULL;
@@ -677,17 +763,98 @@  static int frame_copy_audio(AVFrame *dst, const AVFrame *src)
     return 0;
 }
 
+static int av_subtitle_area2area(AVSubtitleArea *dst, const AVSubtitleArea *src, int copy_data)
+{
+    dst->x         =  src->x;
+    dst->y         =  src->y;
+    dst->w         =  src->w;
+    dst->h         =  src->h;
+    dst->nb_colors =  src->nb_colors;
+    dst->type      =  src->type;
+    dst->flags     =  src->flags;
+
+    switch (dst->type) {
+    case AV_SUBTITLE_FMT_BITMAP:
+
+        for (unsigned i = 0; i < AV_NUM_BUFFER_POINTERS; i++) {
+            if (src->h > 0 && src->w > 0 && src->buf[i]) {
+                dst->buf[0] = av_buffer_ref(src->buf[i]);
+                if (!dst->buf[i])
+                    return AVERROR(ENOMEM);
+
+                if (copy_data) {
+                    const int ret = av_buffer_make_writable(&dst->buf[i]);
+                    if (ret < 0)
+                        return ret;
+                }
+
+                dst->linesize[i] = src->linesize[i];
+            }
+        }
+
+        memcpy(&dst->pal[0], &src->pal[0], sizeof(src->pal[0]) * 256);
+
+        break;
+    case AV_SUBTITLE_FMT_TEXT:
+
+        if (src->text) {
+            dst->text = av_strdup(src->text);
+            if (!dst->text)
+                return AVERROR(ENOMEM);
+        }
+
+        break;
+    case AV_SUBTITLE_FMT_ASS:
+
+        if (src->ass) {
+            dst->ass = av_strdup(src->ass);
+            if (!dst->ass)
+                return AVERROR(ENOMEM);
+        }
+
+        break;
+    default:
+
+        av_log(NULL, AV_LOG_ERROR, "Subtitle area has invalid format: %d", dst->type);
+        return AVERROR(ENOMEM);
+    }
+
+    return 0;
+}
+
+static int frame_copy_subtitles(AVFrame *dst, const AVFrame *src, int copy_data)
+{
+    dst->format = src->format;
+
+    dst->num_subtitle_areas = src->num_subtitle_areas;
+
+    if (src->num_subtitle_areas) {
+        dst->subtitle_areas = av_malloc_array(src->num_subtitle_areas, sizeof(AVSubtitleArea *));
+
+        for (unsigned i = 0; i < src->num_subtitle_areas; i++) {
+            dst->subtitle_areas[i] = av_mallocz(sizeof(AVSubtitleArea));
+            av_subtitle_area2area(dst->subtitle_areas[i], src->subtitle_areas[i], copy_data);
+        }
+    }
+
+    return 0;
+}
+
 int av_frame_copy(AVFrame *dst, const AVFrame *src)
 {
     if (dst->format != src->format || dst->format < 0)
         return AVERROR(EINVAL);
 
-    if (dst->width > 0 && dst->height > 0)
+    switch(dst->type) {
+    case AVMEDIA_TYPE_VIDEO:
         return frame_copy_video(dst, src);
-    else if (dst->nb_samples > 0 && dst->channels > 0)
+    case AVMEDIA_TYPE_AUDIO:
         return frame_copy_audio(dst, src);
-
-    return AVERROR(EINVAL);
+    case AVMEDIA_TYPE_SUBTITLE:
+        return frame_copy_subtitles(dst, src, 1);
+    default:
+        return AVERROR(EINVAL);
+    }
 }
 
 void av_frame_remove_side_data(AVFrame *frame, enum AVFrameSideDataType type)
diff --git a/libavutil/frame.h b/libavutil/frame.h
index 3f295f6b9e..f4ca23781b 100644
--- a/libavutil/frame.h
+++ b/libavutil/frame.h
@@ -34,6 +34,7 @@ 
 #include "rational.h"
 #include "samplefmt.h"
 #include "pixfmt.h"
+#include "subfmt.h"
 #include "version.h"
 
 
@@ -278,7 +279,7 @@  typedef struct AVRegionOfInterest {
 } AVRegionOfInterest;
 
 /**
- * This structure describes decoded (raw) audio or video data.
+ * This structure describes decoded (raw) audio, video or subtitle data.
  *
  * AVFrame must be allocated using av_frame_alloc(). Note that this only
  * allocates the AVFrame itself, the buffers for the data must be managed
@@ -309,6 +310,13 @@  typedef struct AVRegionOfInterest {
  */
 typedef struct AVFrame {
 #define AV_NUM_DATA_POINTERS 8
+    /**
+     * Media type of the frame (audio, video, subtitles..)
+     *
+     * See AVMEDIA_TYPE_xxx
+     */
+    enum AVMediaType type;
+
     /**
      * pointer to the picture/channel planes.
      * This might be different from the first allocated byte. For video,
@@ -392,7 +400,7 @@  typedef struct AVFrame {
     /**
      * format of the frame, -1 if unknown or unset
      * Values correspond to enum AVPixelFormat for video frames,
-     * enum AVSampleFormat for audio)
+     * enum AVSampleFormat for audio, AVSubtitleType for subtitles)
      */
     int format;
 
@@ -491,6 +499,39 @@  typedef struct AVFrame {
      */
     uint64_t channel_layout;
 
+    /**
+     * Display start time, relative to packet pts, in ms.
+     */
+    uint32_t subtitle_start_time;
+
+    /**
+     * Display end time, relative to packet pts, in ms.
+     */
+    uint32_t subtitle_end_time;
+
+    /**
+     * Number of items in the @ref subtitle_areas array.
+     */
+    unsigned num_subtitle_areas;
+
+    /**
+     * Array of subtitle areas, may be empty.
+     */
+    AVSubtitleArea **subtitle_areas;
+
+    /**
+     * Usually the same as packet pts, in AV_TIME_BASE.
+     *
+     * @deprecated This is kept for compatibility reasons and corresponds to
+     * AVSubtitle->pts. Might be removed in the future.
+     */
+    int64_t subtitle_pts;
+
+    /**
+     * Header containing style information for text subtitles.
+     */
+    AVBufferRef *subtitle_header;
+
     /**
      * AVBuffer references backing the data for this frame. All the pointers in
      * data and extended_data must point inside one of the buffers in buf or
@@ -750,6 +791,8 @@  void av_frame_move_ref(AVFrame *dst, AVFrame *src);
 /**
  * Allocate new buffer(s) for audio or video data.
  *
+ * Note: For subtitle data, use av_frame_get_buffer2
+ *
  * The following fields must be set on frame before calling this function:
  * - format (pixel format for video, sample format for audio)
  * - width and height for video
@@ -769,9 +812,39 @@  void av_frame_move_ref(AVFrame *dst, AVFrame *src);
  *              recommended to pass 0 here unless you know what you are doing.
  *
  * @return 0 on success, a negative AVERROR on error.
+ *
+ * @deprecated Use @ref av_frame_get_buffer2 instead and set @ref AVFrame.type
+ * before calling.
  */
+attribute_deprecated
 int av_frame_get_buffer(AVFrame *frame, int align);
 
+/**
+ * Allocate new buffer(s) for audio, video or subtitle data.
+ *
+ * The following fields must be set on frame before calling this function:
+ * - format (pixel format for video, sample format for audio)
+ * - width and height for video
+ * - nb_samples and channel_layout for audio
+ * - type (AVMediaType)
+ *
+ * This function will fill AVFrame.data and AVFrame.buf arrays and, if
+ * necessary, allocate and fill AVFrame.extended_data and AVFrame.extended_buf.
+ * For planar formats, one buffer will be allocated for each plane.
+ *
+ * @warning: if frame already has been allocated, calling this function will
+ *           leak memory. In addition, undefined behavior can occur in certain
+ *           cases.
+ *
+ * @param frame frame in which to store the new buffers.
+ * @param align Required buffer size alignment. If equal to 0, alignment will be
+ *              chosen automatically for the current CPU. It is highly
+ *              recommended to pass 0 here unless you know what you are doing.
+ *
+ * @return 0 on success, a negative AVERROR on error.
+ */
+int av_frame_get_buffer2(AVFrame *frame, int align);
+
 /**
  * Check if the frame data is writable.
  *
diff --git a/libavutil/subfmt.c b/libavutil/subfmt.c
new file mode 100644
index 0000000000..3998a6d486
--- /dev/null
+++ b/libavutil/subfmt.c
@@ -0,0 +1,50 @@ 
+/*
+ * Copyright (c) 2021 softworkz
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <string.h>
+#include "common.h"
+#include "subfmt.h"
+
+typedef struct SubtitleFmtInfo {
+    enum AVSubtitleType fmt;
+    const char* name;
+} SubtitleFmtInfo;
+
+static const char sub_fmt_info[AV_SUBTITLE_FMT_NB][24] = {
+    [AV_SUBTITLE_FMT_UNKNOWN] = "Unknown subtitle format",
+    [AV_SUBTITLE_FMT_BITMAP]  = "Graphical subtitles",
+    [AV_SUBTITLE_FMT_TEXT]    = "Text subtitles (plain)",
+    [AV_SUBTITLE_FMT_ASS]     = "Text subtitles (ass)",
+};
+
+const char *av_get_subtitle_fmt_name(enum AVSubtitleType sub_fmt)
+{
+    if (sub_fmt < 0 || sub_fmt >= AV_SUBTITLE_FMT_NB)
+        return NULL;
+    return sub_fmt_info[sub_fmt];
+}
+
+enum AVSubtitleType av_get_subtitle_fmt(const char *name)
+{
+    for (int i = 0; i < AV_SUBTITLE_FMT_NB; i++)
+        if (!strcmp(sub_fmt_info[i], name))
+            return i;
+    return AV_SUBTITLE_FMT_NONE;
+}
diff --git a/libavutil/subfmt.h b/libavutil/subfmt.h
index 88078a0d14..4e0250cd23 100644
--- a/libavutil/subfmt.h
+++ b/libavutil/subfmt.h
@@ -21,6 +21,10 @@ 
 #ifndef AVUTIL_SUBFMT_H
 #define AVUTIL_SUBFMT_H
 
+#include <stdint.h>
+
+#include "buffer.h"
+
 enum AVSubtitleType {
 
     /**
@@ -63,4 +67,48 @@  enum AVSubtitleType {
     AV_SUBTITLE_FMT_NB,         ///< number of subtitle formats, DO NOT USE THIS if you want to link with shared libav* because the number of formats might differ between versions.
 };
 
+typedef struct AVSubtitleArea {
+#define AV_NUM_BUFFER_POINTERS 1
+
+    enum AVSubtitleType type;
+    int flags;
+
+    int x;         ///< top left corner  of area.
+    int y;         ///< top left corner  of area.
+    int w;         ///< width            of area.
+    int h;         ///< height           of area.
+    int nb_colors; ///< number of colors in bitmap palette (@ref pal).
+
+    /**
+     * Buffers and line sizes for the bitmap of this subtitle.
+     *
+     * @{
+     */
+    AVBufferRef *buf[AV_NUM_BUFFER_POINTERS];
+    int linesize[AV_NUM_BUFFER_POINTERS];
+    /**
+     * @}
+     */
+
+    uint32_t pal[256]; ///< RGBA palette for the bitmap.
+
+    char *text;        ///< 0-terminated plain UTF-8 text
+    char *ass;         ///< 0-terminated ASS/SSA compatible event line.
+
+} AVSubtitleArea;
+
+/**
+ * Return the name of sub_fmt, or NULL if sub_fmt is not
+ * recognized.
+ */
+const char *av_get_subtitle_fmt_name(enum AVSubtitleType sub_fmt);
+
+/**
+ * Return a subtitle format corresponding to name, or AV_SUBTITLE_FMT_NONE
+ * on error.
+ *
+ * @param name Subtitle format name.
+ */
+enum AVSubtitleType av_get_subtitle_fmt(const char *name);
+
 #endif /* AVUTIL_SUBFMT_H */