diff mbox series

[FFmpeg-devel,1/2] avcodec/{ass, webvttdec}: fix handling of backslashes

Message ID 20220116181655.6407-1-oneric@oneric.de
State New
Headers show
Series [FFmpeg-devel,1/2] avcodec/{ass, webvttdec}: fix handling of backslashes | expand

Checks

Context Check Description
andriy/configure_aarch64_jetson warning Failed to run configure
andriy/make_x86 success Make finished
andriy/make_fate_x86 success Make fate finished
andriy/make_ppc success Make finished
andriy/make_fate_ppc success Make fate finished

Commit Message

Oneric Jan. 16, 2022, 6:16 p.m. UTC
Backslashes cannot be escaped by backslashes in any ASS renderer,
but unless followed by a few specific characters it is just printed
as a regular character. Insert a word-joiner character after a backslash
to break up the active sequences without changing the visual output.
Also the existing \{ and \} escapes are specific to libass only.
---
The  patch assumes UTF-8 encoding in ff_ass_bprint_text_event
(WebVTT requires UTF-8 per sepc). If we cannot assume a particular
encoding, please advise how to best insert a word-joiner character in 
the correct encoding.
---
 libavcodec/ass.c       | 5 ++++-
 libavcodec/webvttdec.c | 2 +-
 2 files changed, 5 insertions(+), 2 deletions(-)

Comments

Oneric Feb. 1, 2022, 5:38 p.m. UTC | #1
On Sun, Jan 16, 2022 at 19:16:54 +0100, Oneric wrote:
> [PATCH 1/2] avcodec/{ass, webvttdec}: fix handling of backslashes
>  libavcodec/ass.c       | 5 ++++-
>  libavcodec/webvttdec.c | 2 +-
>  2 files changed, 5 insertions(+), 2 deletions(-)

> [PATCH 2/2] avcodec/webvttdec: honour bidi marks
>  libavcodec/webvttdec.c     | 2 +-
>  tests/ref/fate/sub-webvtt2 | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)

ping.

In case anyone is wondering why patchwork fails to apply the second patch,
this is probably once again because the patch updates one of FATE's ASS
reference files which use CRLF line-endings.
Locally git am applies both without a hitch for me on top of current master
(and FATE passes after applying each patch).
Soft Works Feb. 1, 2022, 7:44 p.m. UTC | #2
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Oneric
> Sent: Tuesday, February 1, 2022 6:39 PM
> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix
> handling of backslashes
> 
> On Sun, Jan 16, 2022 at 19:16:54 +0100, Oneric wrote:
> > [PATCH 1/2] avcodec/{ass, webvttdec}: fix handling of backslashes
> >  libavcodec/ass.c       | 5 ++++-
> >  libavcodec/webvttdec.c | 2 +-
> >  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> > [PATCH 2/2] avcodec/webvttdec: honour bidi marks
> >  libavcodec/webvttdec.c     | 2 +-
> >  tests/ref/fate/sub-webvtt2 | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> ping.
> 
> In case anyone is wondering why patchwork fails to apply the second patch,
> this is probably once again because the patch updates one of FATE's ASS
> reference files which use CRLF line-endings.
> Locally git am applies both without a hitch for me on top of current master
> (and FATE passes after applying each patch).


You can add a .gitattributes file to tests/ref/fate/ which includes the line

sub-webvtt2 -diff

Then your local git format-patch will create a binary diff for the file.


softworkz
Oneric Feb. 1, 2022, 8:06 p.m. UTC | #3
On Tue, Feb 01, 2022 at 19:44:24 +0000, Soft Works wrote:
> > On Sun, Jan 16, 2022 at 19:16:54 +0100, Oneric wrote:
> > 
> > In case anyone is wondering why patchwork fails to apply the second patch,
> > this is probably once again because the patch updates one of FATE's ASS
> > reference files which use CRLF line-endings.
> > Locally git am applies both without a hitch for me on top of current master
> > (and FATE passes after applying each patch).
> 
> 
> You can add a .gitattributes file to tests/ref/fate/ which includes the line
> 
> sub-webvtt2 -diff
> 
> Then your local git format-patch will create a binary diff for the file.

Thanks for your suggestion. However, a binary diff would look like this which 
isn't great for seeing what's going on during review:

  diff --git a/tests/ref/fate/sub-webvtt2 b/tests/ref/fate/sub-webvtt2
  index 357b8178ea1cf224ad47dcf78b24f1948ece6665..4cd1d86a9a58ccf65812131bf84a17531c2c6cfa 100644
  GIT binary patch
  delta 24
  gcmeys^NnXiIV;bjhJHgMV-r)eM-6?GYgs=70DwpeRR910
  
  delta 18
  Zcmeyy^MPkWIV+o?k+F%X+2m%{&j3Hw26O-b

Also as noted, locally plain `git am` has no issues applying the regular 
(non-binary) patch; iirc patchwork uses some additional flags to make it more 
restrictive because regular codefiles shouldn't use CRLF line-endings.

I recall using some .gitattribute settings to force crlf (without making 
the file binary?) were discussed last time this happened, but deemed to be not 
worth it because of some other issues with it.

> 
> softworkz
Soft Works Feb. 1, 2022, 8:41 p.m. UTC | #4
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Oneric
> Sent: Tuesday, February 1, 2022 9:07 PM
> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix
> handling of backslashes
> 
> On Tue, Feb 01, 2022 at 19:44:24 +0000, Soft Works wrote:
> > > On Sun, Jan 16, 2022 at 19:16:54 +0100, Oneric wrote:
> > >
> > > In case anyone is wondering why patchwork fails to apply the second
> patch,
> > > this is probably once again because the patch updates one of FATE's ASS
> > > reference files which use CRLF line-endings.
> > > Locally git am applies both without a hitch for me on top of current
> master
> > > (and FATE passes after applying each patch).
> >
> >
> > You can add a .gitattributes file to tests/ref/fate/ which includes the
> line
> >
> > sub-webvtt2 -diff
> >
> > Then your local git format-patch will create a binary diff for the file.
> 
> Thanks for your suggestion. However, a binary diff would look like this which
> isn't great for seeing what's going on during review:
> 
>   diff --git a/tests/ref/fate/sub-webvtt2 b/tests/ref/fate/sub-webvtt2
>   index
> 357b8178ea1cf224ad47dcf78b24f1948ece6665..4cd1d86a9a58ccf65812131bf84a17531c2
> c6cfa 100644
>   GIT binary patch
>   delta 24
>   gcmeys^NnXiIV;bjhJHgMV-r)eM-6?GYgs=70DwpeRR910
> 
>   delta 18
>   Zcmeyy^MPkWIV+o?k+F%X+2m%{&j3Hw26O-b

Yes, I know, but the question is probably what's more important..

..that it can be applied or that the text is viewable?

I assume that a reviewer would often apply the patch locally anyway, then 
the text changes become visible again as this doesn't affect git log.
Given that it's just a handful files from a few thousands, this should 
be acceptable.


> Also as noted, locally plain `git am` has no issues applying the regular
> (non-binary) patch; 

Yes, because it hasn't been sent around via e-mail yet. SMTP requires
CRLF line endings, so essentially it depends on the receiving e-mail client
whether it outputs LF or CRLF. It's then up to git whether it ignores
or adjust the line ending when applying a patch. Maybe that's what is 
configured @patchwork, I can't tell.

 
The most extreme example is sub-textenc (and there was another one iirc).
It has mixed line endings - some LF and some CRLF. At least at that point
there's no way out, because those endings will get unrecoverably lost 
as soon as it is sent around via e-mail as text-diff.

That's how I came to the binary diff as only working option.
(or you just don't send patches around via e-mail at all ;-)


> I recall using some .gitattribute settings to force crlf (without making
> the file binary?) were discussed last time this happened, but deemed to be
> not worth it because of some other issues with it.

Doesn't work with mixed endings. BTW, I don't find it wrong to make such files
binary because logically these are not text files, but test reference results,
which are expected to precisely match (rather than "have the same text").

If you don't want to mark the files as binary, you could probably use
.gitattributes without commiting, so it would only affect format-patch 
generation.

Best,
softworkz
Oneric Feb. 1, 2022, 11:25 p.m. UTC | #5
On Tue, Feb 01, 2022 at 20:41:37 +0000, Soft Works wrote:
> > On Tue, Feb 01, 2022 at 19:44:24 +0000, Soft Works wrote:
> > > > On Sun, Jan 16, 2022 at 19:16:54 +0100, Oneric wrote:
> > > >
> > > > In case anyone is wondering why patchwork fails to apply the second patch,
> > > > this is probably once again because the patch updates one of FATE's ASS
> > > > reference files which use CRLF line-endings.
> > > > Locally git am applies both without a hitch for me on top of current master
> > > > (and FATE passes after applying each patch).
> > >
> > >
> > > You can add a .gitattributes file to tests/ref/fate/ which includes the line
> > >
> > > sub-webvtt2 -diff
> > >
> > > Then your local git format-patch will create a binary diff for the file.
> > 
> > Thanks for your suggestion. However, a binary diff would look like this which
> > isn't great for seeing what's going on during review:
> > 
> >   [...]
> 
> Yes, I know, but the question is probably what's more important..
> 
> ..that it can be applied or that the text is viewable?

It CAN be applied (as I've now written twice) and
of course I verified this with the mail received from the list.

> > Also as noted, locally plain `git am` has no issues applying the regular
> > (non-binary) patch; 
> 
> Yes, because it hasn't been sent around via e-mail yet. SMTP requires
> CRLF line endings, so essentially it depends on the receiving e-mail client
> whether it outputs LF or CRLF. [...]

The mail content is base64 encoded during client-server and server-server 
transfer. Perhaps the raw base64 text is using all CRLF line-endings, but 
this doesn't matter. Once decoded it will byte-for-byte match the format-patch 
I created locally except for the ffmpeg-devel footer and additional headers in 
the mail version — but git am will just ignore those.
See the headers of the second patch's mail:

  Content-Type: text/plain; charset="utf-8"
  Content-Transfer-Encoding: base64

For comparison, here are the same headers for your first reply:

  Content-Type: text/plain; charset="us-ascii"
  Content-Transfer-Encoding: 7bit

If you tried to send a patch with your mail client by pasting in the diff,
then yes, presumably the line-endings would indeed get mangled. That's why 
doc/developer.texi tells you not to do this (instead recommending
git send-email or binary (i.e. base64-encoded) attachments).
  
> The most extreme example is sub-textenc (and there was another one iirc).
> It has mixed line endings - some LF and some CRLF. At least at that point
> there's no way out, because those endings will get unrecoverably lost 
> as soon as it is sent around via e-mail as text-diff.
> 
> That's how I came to the binary diff as only working option.
> (or you just don't send patches around via e-mail at all ;-)

Mixed line-endings also aren't a problem with a base64-encoded mail body
as eg git send-email will automatically use.


There's one caveat specific to (ffmpeg's) patchwork
I didn't know about before:
https://patchwork.ffmpeg.org/project/ffmpeg/patch/20220116181655.6407-2-oneric@oneric.de/

If I download the “diff” from the above page, CRLF and LF line-endings are 
preserved as they should be. However, the files served from the “mbox” and 
“series” buttons only contain LF line-endings. As everyone can test, the mail 
I sent evidently does use mixed CRLF and LF line-endings matching the file 
being patched (just like the result of the “diff” button).
I believe this is an issue separate from patchwork using --no-keep-cr and 
should be fixed on patchwork's side.


Tl;dr:
Mixed line-endings in a diff are no technical issue,
provided the patch is sent correctly.
The issue is purely with patchwork, which:
 a) uses --no-keep-cr or similar presumably to protect against CRLF sneaking 
    into regular source files. (This is a minor annoyance when CRLF is 
    actually required, but otherwise harmless)
 b) mangles the line-endings for the “mbox” and “series” options but does the 
    correct thing for the “diff” option. This is harmful and imho should be 
    fixed by patchwork. Nonetheless, the patch can still be correctly applied 
    from the _mail_ on the list.

The patch on the list can easily be applied from the received mail by a plain 
`git am` or `git am --keep-cr` if the value of am.keepcr is set to false.
(In my case am.keepcr is not set at all and plain git am does the right thing.)
If someone wants to apply the patches and has trouble with getting the patch 
out of the mailclient I can on request resend the patch with a binary diff for 
the reference file.
Soft Works Feb. 2, 2022, 4:44 a.m. UTC | #6
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Oneric
> Sent: Wednesday, February 2, 2022 12:26 AM
> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix
> handling of backslashes
> 
> On Tue, Feb 01, 2022 at 20:41:37 +0000, Soft Works wrote:
> > > On Tue, Feb 01, 2022 at 19:44:24 +0000, Soft Works wrote:
> > > > > On Sun, Jan 16, 2022 at 19:16:54 +0100, Oneric wrote:
> > > > >
> > > > > In case anyone is wondering why patchwork fails to apply the second
> patch,
> > > > > this is probably once again because the patch updates one of FATE's
> ASS
> > > > > reference files which use CRLF line-endings.
> > > > > Locally git am applies both without a hitch for me on top of current
> master
> > > > > (and FATE passes after applying each patch).
> > > >
> > > >
> > > > You can add a .gitattributes file to tests/ref/fate/ which includes the
> line
> > > >
> > > > sub-webvtt2 -diff
> > > >
> > > > Then your local git format-patch will create a binary diff for the
> file.
> > >
> > > Thanks for your suggestion. However, a binary diff would look like this
> which
> > > isn't great for seeing what's going on during review:
> > >
> > >   [...]
> >
> > Yes, I know, but the question is probably what's more important..
> >
> > ..that it can be applied or that the text is viewable?
> 
> It CAN be applied (as I've now written twice) and
> of course I verified this with the mail received from the list.

I meant that it can be applied by everybody, including Patchwork
and Michael. 

> If you tried to send a patch with your mail client by pasting in the diff,
> then yes, presumably the line-endings would indeed get mangled.

I use git format-patch, just like many others and afaik it can't create base64 
encoded content.

BTW. BASE64 doesn't seem to be widely used here, just checked the frequent 
committers among the recent submissions. just quickly checked Michael,
Andreas,Gyan,Marton,James,Anton,Derek - all of

> [..] That's why 
> doc/developer.texi tells you not to do this (instead recommending
> git send-email or binary (i.e. base64-encoded) attachments).

Interestingly, even the author of those lines is sending patches with 
Content-Transfer-Encoding: 7bit :-)


> Tl;dr:
> Mixed line-endings in a diff are no technical issue,
> provided the patch is sent correctly.^

LOL - so the majority on this ML is sending patches incorrectly?
I think, marking such files as binary is a bit more practical than 
expecting everybody to change ones working process. 

Best,
softworkz
Oneric Feb. 2, 2022, 5:03 p.m. UTC | #7
It appears my previous mail has been interpreted as some sort of attack
against you; this is not the case. I am sorry for creating such a
misunderstanding.
The mail was merely meant to clear up the misconception that it would be 
impossible to send mixed line-endings via mail due to some SMTP property
using some illustrative examples from the current thread.
(I have no idea how SMTP works in detail)

I'll respond to the new misunderstandings below in more detail if you're 
interested, but I see little point in continuing to discuss how mails are 
transferred or encoded.
Can we now start to discuss the patches themselves instead?

If you want to apply them and have trouble with getting the correct patch out
of your client (and don't want to use Patchwork's diff), let me repeat my 
previous offer:
> > If someone wants to apply the patches and has trouble with getting the patch
> > out of the mailclient I can on request resend the patch with a binary diff
> > for the reference file.

~~~~~~

On Wed, Feb 02, 2022 at 04:44:54 +0000, Soft Works wrote:
> > It CAN be applied (as I've now written twice) and
> > of course I verified this with the mail received from the list.
>
> I meant that it can be applied by everybody, including Patchwork
> and Michael.

Everybody but Patchwork can apply the mail and Patchwork's _diff_ works too.
In case any mail client acts up I already offered to resend the patch
with a binary flag for the reference file on request.

> I use git format-patch, just like many others and afaik it can't create base64
> encoded content.

It doesn't need to. When attaching a format-patch (as ffmpeg's documentation 
recommends when not using send-email) the _mail client_ is supposed to
choose and perform a suitable transfer encoding.
Depending on the mail client this can also work with the main message,
but some clients have trouble preserving everything in the main message.
(Because they assume the main body to be plain text and eg
 "normalise" line-ending, automatically break long lines, 
 remove or replace special characters, etc)

> BTW. BASE64 doesn't seem to be widely used here, [...]

Note, that I wrote send-email uses the appropriate transfer-encoding 
“automatically” not “always base64”; if there are no characters in the
message beyond US-ASCII (and mayhaps only LF line-endings), using base64
or quoted-printable is not necessary.
Whether or not mixed line-endings _require_ one of the latter two encodings
due to some SMTP limitation I do not know. It's certainly _possible_ to
preserve them this way though.
(And git send-email choose to use base64 for the second patch.)

> Interestingly, even the author of those lines is sending patches with
> Content-Transfer-Encoding: 7bit :-)

see previous.
(second patch contains bidi-marks and mixed line-endings, first neither)

> LOL - so the majority on this ML is sending patches incorrectly?

No, see second last.
Soft Works Feb. 2, 2022, 10:18 p.m. UTC | #8
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Oneric
> Sent: Wednesday, February 2, 2022 6:03 PM
> To: FFmpeg development discussions and patches <ffmpeg-
> devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix
> handling of backslashes
> 
> It appears my previous mail has been interpreted as some sort of
> attack
> against you; this is not the case. I am sorry for creating such a
> misunderstanding.

Don't know who said that, but I don't feel attacked. It's about technical
details and opinions about it. All good.

> The mail was merely meant to clear up the misconception that it would
> be
> impossible to send mixed line-endings via mail due to some SMTP
> property
> using some illustrative examples from the current thread.
> (I have no idea how SMTP works in detail)

Where did you see such misconception? I think almost everybody knows 
that this is doable over SMTP in some way, and there are in
fact many ways to do this, even when it would be as stupid like attaching
as zip archive.

It's never been a matter of SMTP. It's the GIT tooling in the context of 
"patch-over-SMTP" transport and the way it is using plain-ascii-text 
e-mails for this purpose. IMO, this is a design flaw or just a bug
when git-send-email and format-patch are creating output with transfer-encodings
7bit/8bit but mixed EOLs. 

I suppose this is due to the tendency of GIT to consider text file basically
as files with lines of text where the line endings are exchangeable and
rather a platform detail than part of the text file's content.
And text files with mixed line endings are an accidental side effect of 
multi-platform work and the endings can be re-unified at any time.
That's also the reason why format-patch cannot warn, because bazillions 
of code files exist which have mixed EOLs (accidentally and exchangeable).

Yes, yes yes - don't need to respond that GIT can handle those cases 
flawlessly and precisely when configured appropriately (and when not
using the e-mail features). That goes without saying. 
I'm just speculating about history, why it might have happened that
it doesn't work flawlessly when using e-mail.

Via e-mail, it only works when using base64 encoding like you said, 
but this had only been added at a later time and wasn't part of the
original functionality.

Further - I could find no evidence for your claims that base64 encoding
would be chosen automatically depending on the content, neither is it
on by default.
There's an 'auto' mode which chooses between 7bit, 8bit and quoted-printable,
but not base64, which always needs to be specified explicitly (or set in
.gitconfig).

This is the way it is documented and I'm seeing exactly that documented
behavior on both Linux and Windows when testing with a mocked smtp server.


You had contradicted my statement that it would be desirable that format-patch
would be able to do base64 encoding as well, saying that this would be the 
task of a mail-client. This is not necessarily the case, though.
format-patch can generate patch e-mails with multipart-mime content, where 
the patch content is created as an attachment mime part. format-patch does
that with a transfer-encoding of 8bit, but it can't do it with base64 encoding.


(I'm stripping the rest of your reply as it is all covered above already)


Regards,
softworkz
Soft Works Feb. 2, 2022, 10:44 p.m. UTC | #9
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Soft
> Works
> Sent: Wednesday, February 2, 2022 11:19 PM
> To: FFmpeg development discussions and patches <ffmpeg-
> devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix
> handling of backslashes
> 
> 
> 
> > -----Original Message-----
> > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> > Oneric
> > Sent: Wednesday, February 2, 2022 6:03 PM
> > To: FFmpeg development discussions and patches <ffmpeg-
> > devel@ffmpeg.org>
> > Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}:
> fix
> > handling of backslashes
> >
> > It appears my previous mail has been interpreted as some sort of
> > attack
> > against you; this is not the case. I am sorry for creating such a
> > misunderstanding.
> 
> Don't know who said that, but I don't feel attacked. It's about
> technical
> details and opinions about it. All good.
> 
> > The mail was merely meant to clear up the misconception that it
> would
> > be
> > impossible to send mixed line-endings via mail due to some SMTP
> > property
> > using some illustrative examples from the current thread.
> > (I have no idea how SMTP works in detail)
> 
> Where did you see such misconception? I think almost everybody knows
> that this is doable over SMTP in some way, and there are in
> fact many ways to do this, even when it would be as stupid like
> attaching
> as zip archive.
> 
> It's never been a matter of SMTP. It's the GIT tooling in the context
> of
> "patch-over-SMTP" transport and the way it is using plain-ascii-text
> e-mails for this purpose. IMO, this is a design flaw or just a bug
> when git-send-email and format-patch are creating output with
> transfer-encodings
> 7bit/8bit but mixed EOLs.
> 
> I suppose this is due to the tendency of GIT to consider text file
> basically
> as files with lines of text where the line endings are exchangeable
> and
> rather a platform detail than part of the text file's content.
> And text files with mixed line endings are an accidental side effect
> of
> multi-platform work and the endings can be re-unified at any time.
> That's also the reason why format-patch cannot warn, because
> bazillions
> of code files exist which have mixed EOLs (accidentally and
> exchangeable).
> 
> Yes, yes yes - don't need to respond that GIT can handle those cases
> flawlessly and precisely when configured appropriately (and when not
> using the e-mail features). That goes without saying.
> I'm just speculating about history, why it might have happened that
> it doesn't work flawlessly when using e-mail.
> 
> Via e-mail, it only works when using base64 encoding like you said,
> but this had only been added at a later time and wasn't part of the
> original functionality.

Haha, I almost forgot to get to my own point..

From that situation above, I would conclude that at least at an earlier
time, the git developers (when asked about a case like the ffmpeg ref
files) might have argued that such files with mixed EOL where the 
individual EOLs also form a relevant part of the 
content - are not meant to be handled as text, but as binary
files (either) or at to use binary diffs when being send via e-mail.
.

sw
Oneric Feb. 3, 2022, 2:11 a.m. UTC | #10
On Wed, Feb 02, 2022 at 22:44:18 +0000, Soft Works wrote:
> > Sent: Wednesday, February 2, 2022 11:19 PM
> > To: FFmpeg development discussions and patches <ffmpeg-
> > devel@ffmpeg.org>
> > 
> >  [claims about git and smtp: part 1]
> > 
> 
> [claims about git and smtp: part 2]
> 

None of this relates to the topic of the patches, which
is backslash-escaping in ASS and converting BIDI-marks
from WebVTT.
As written before, I see no point in advancing this offtopic talk;
please actually discuss the content of the patches in this thread.
If you are convinced that the reference files should be treated as binary, 
then submit a suitable gitattributes patch to the list and discuss it 
with whom it may interest.

I’ll ignore all further offtopic messages here.
Soft Works Feb. 3, 2022, 8:51 p.m. UTC | #11
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Oneric
> Sent: Thursday, February 3, 2022 3:11 AM
> To: FFmpeg development discussions and patches <ffmpeg-
> devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix
> handling of backslashes
> 
> On Wed, Feb 02, 2022 at 22:44:18 +0000, Soft Works wrote:
> > > Sent: Wednesday, February 2, 2022 11:19 PM
> > > To: FFmpeg development discussions and patches <ffmpeg-
> > > devel@ffmpeg.org>
> > >
> > >  [claims about git and smtp: part 1]
> > >
> >
> > [claims about git and smtp: part 2]
> >

> I’ll ignore all further offtopic messages here.

Aha, suddenly it's off-topic, but ok.

> please actually discuss the content of the patches in this thread.

I think when you inject that word-joiner as a workaround for ass
parsing, you'll also need to make sure that it gets removed
when encoding to other formats.

softworkz
Oneric Feb. 4, 2022, 1:01 a.m. UTC | #12
On Thu, Feb 03, 2022 at 20:51:16 +0000, Soft Works wrote:
> I think when you inject that word-joiner as a workaround for ass
> parsing, you'll also need to make sure that it gets removed
> when encoding to other formats.

There's no way of knowing whether the word-joiner comes from
a conversion performed by ffmpeg in the past or already existed
in the original source.
However, the wordjoiner does not alter the visually appearance and
is unlikely to change line-breaking properties; that's why I chose
a word-joiner. Therefore I don't think removing (only) the inserted
word-joiners is possible, but also not necessary.

Also, was the wrong \\ recollapsed into a single \ somehwere? If yes,
then this code can be dropped as well (but I don't know where it is).

My biggest concern with this is the unconditional assumption of UTF-8 in
ff_ass_bprint_text_event? Is there a way to improve on this or does ffmpeg 
convert all text to UTF-8 before it's passed to this function?
Andreas Rheinhardt Feb. 4, 2022, 1:30 a.m. UTC | #13
Oneric:
> On Thu, Feb 03, 2022 at 20:51:16 +0000, Soft Works wrote:
>> I think when you inject that word-joiner as a workaround for ass
>> parsing, you'll also need to make sure that it gets removed
>> when encoding to other formats.
> 
> There's no way of knowing whether the word-joiner comes from
> a conversion performed by ffmpeg in the past or already existed
> in the original source.
> However, the wordjoiner does not alter the visually appearance and
> is unlikely to change line-breaking properties; that's why I chose
> a word-joiner. Therefore I don't think removing (only) the inserted
> word-joiners is possible, but also not necessary.
> 
> Also, was the wrong \\ recollapsed into a single \ somehwere? If yes,
> then this code can be dropped as well (but I don't know where it is).
> 
> My biggest concern with this is the unconditional assumption of UTF-8 in
> ff_ass_bprint_text_event? Is there a way to improve on this or does ffmpeg 
> convert all text to UTF-8 before it's passed to this function?

All text-based subtitles are supposed to be UTF-8 when they reach the
decoder; if it isn't, the user has to set the appropriate -sub_charenc
and -sub_charenc_mode.

- Andreas
Soft Works Feb. 4, 2022, 1:57 a.m. UTC | #14
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Oneric
> Sent: Friday, February 4, 2022 2:01 AM
> To: FFmpeg development discussions and patches <ffmpeg-
> devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix
> handling of backslashes
> 
> On Thu, Feb 03, 2022 at 20:51:16 +0000, Soft Works wrote:
> > I think when you inject that word-joiner as a workaround for ass
> > parsing, you'll also need to make sure that it gets removed
> > when encoding to other formats.
> 
> There's no way of knowing whether the word-joiner comes from
> a conversion performed by ffmpeg in the past or already existed
> in the original source.

That might be true, but I think it's valid to say that such characters
are very unusual "original" subtitle sources and that's why I don't
think it's a good idea for ffmpeg to start injecting them.

Subtitle implementations are often rather minimal, especially in
hardware devices and might not always cover the full range of 
UTF-8 specifics.

> However, the wordjoiner does not alter the visually appearance and
> is unlikely to change line-breaking properties; that's why I chose
> a word-joiner. Therefore I don't think removing (only) the inserted
> word-joiners is possible,

Why not? As it seems to be required for ASS encoding only, all other
output formats should remain unaffected. 

> but also not necessary.

I'm not sure whether all ffmpeg text-sub encoders can handle 
those chars - which could be verified of course.

But what remains is the question about the effect on end devices
which are consuming that output.

Finally, those chars are a pest. I'm using them myself for a 
specific use case, but when you don't know they are there, it can
drive you totally mad, eventually even thinking your system or
software is faulty.

Example: 

Open your patch file [2/2] and search for the string
"123456\NAscending". You can see the string in two lines, but search
will only find one of them.

Or just look at the two lines directly. They are preceded by + and -
even though both appear identical. 


So, this also needs consideration of the consequences, like how 
many developers (inside and outside of ffmpeg) this would be driving
nuts over the years and make them start hating ffmpeg for doing so 
once they've found out.

softworkz
Soft Works Feb. 4, 2022, 5:34 a.m. UTC | #15
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Soft
> Works
> Sent: Friday, February 4, 2022 2:58 AM
> To: FFmpeg development discussions and patches <ffmpeg-
> devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix
> handling of backslashes
> 
> 
> 
> > -----Original Message-----
> > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> > Oneric
> > Sent: Friday, February 4, 2022 2:01 AM
> > To: FFmpeg development discussions and patches <ffmpeg-
> > devel@ffmpeg.org>
> > Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}:
> fix
> > handling of backslashes
> >
> > On Thu, Feb 03, 2022 at 20:51:16 +0000, Soft Works wrote:
> > > I think when you inject that word-joiner as a workaround for ass
> > > parsing, you'll also need to make sure that it gets removed
> > > when encoding to other formats.
> >
> > There's no way of knowing whether the word-joiner comes from
> > a conversion performed by ffmpeg in the past or already existed
> > in the original source.
> 
> That might be true, but I think it's valid to say that such characters
> are very unusual "original" subtitle sources and that's why I don't
> think it's a good idea for ffmpeg to start injecting them.
> 
> Subtitle implementations are often rather minimal, especially in
> hardware devices and might not always cover the full range of
> UTF-8 specifics.
> 
> > However, the wordjoiner does not alter the visually appearance and
> > is unlikely to change line-breaking properties; that's why I chose
> > a word-joiner. Therefore I don't think removing (only) the inserted
> > word-joiners is possible,
> 
> Why not? As it seems to be required for ASS encoding only, all other
> output formats should remain unaffected.
> 
> > but also not necessary.
> 
> I'm not sure whether all ffmpeg text-sub encoders can handle
> those chars - which could be verified of course.
> 
> But what remains is the question about the effect on end devices
> which are consuming that output.
> 
> Finally, those chars are a pest. I'm using them myself for a
> specific use case, but when you don't know they are there, it can
> drive you totally mad, eventually even thinking your system or
> software is faulty.
> 
> Example:
> 
> Open your patch file [2/2] and search for the string
> "123456\NAscending". You can see the string in two lines, but search
> will only find one of them.
> 
> Or just look at the two lines directly. They are preceded by + and -
> even though both appear identical.
> 
> 
> So, this also needs consideration of the consequences, like how
> many developers (inside and outside of ffmpeg) this would be driving
> nuts over the years and make them start hating ffmpeg for doing so
> once they've found out.

As I really hate how many devs on this ML keep saying 'no' to submitted
code without having a better suggestion, assuming that this is all that
it takes, I don't want to assimilate in this regard.

Hence I want to propose the following solution:

First of all, the existing code in ff_ass_bprint_text_event() is totally
wrong already. Not only with regard to the backslash escaping (as you 
have already pointed out), but also the curly brace escaping is invalid.
There is no curly-brace escaping in ASS either. 

In fact it is impossible with ASS to display an opening curly brace followed
by a closing curly brace at a subsequent position (each one alone may work
depending on implementation).

If it was about ASS alone, we might just drop those braces, so we could
at least avoid the text in-between from being hidden (when outputting 
ASS), but ASS is also the internal ("uncompressed/raw") subtitle format
in ffmpeg that is used for conversion (and subtitle filtering).
So it would be hard-to-sell when curly braces would get lost when 
converting from one text-sub format to another with none of them 
even being ASS.

What we need is to stop creating invalid ASS and at the same time 
ensure proper conversion of curly braces. How? We substitute them!

And still, UTF-8 can come to the rescue. There are two suitable 
candidates for that:

SMALL LEFT CURLY BRACKET (U+FE5B, Ps): ﹛
SMALL RIGHT CURLY BRACKET (U+FE5C, Pe): ﹜
FULLWIDTH LEFT CURLY BRACKET (U+FF5B, Ps): {
FULLWIDTH RIGHT CURLY BRACKET (U+FF5D, Pe): }

Substitution of curly braces with one of those will prevent ASS from treating
any possible subtitle content as override code.

What remains to be handled now is the backslash case. Now that we can be sure
that we are never inside a sequence that ASS would consider an override code,
only 3 cases are remaining where the backslash has a meaning in ASS dialog 
text:  '\n', '\N' and '\h'.

We can simply escape those sequences by inserting a (no-op) override code
between the backslash and the char. Suitable for this is: {\r}
This code resets inline styles, but since we are coming from plain text subs
in ff_ass_bprint_text_event(), we know that we don't have any inline styles
and it's a no-op to reset the style.

Needless to say that we will of course change the substituted curly braces
back to the regular ones at the encoding side for all but ASS.
Remains the question what to do when encoding to ASS: We can either 
keep the alternate brace characters or just remove them (or maybe replace
with square brackets). 

I'm not sure about that last point, but in total, this will be a clean solution
without injecting any weird chars into the subtitle output, and it will fix
multiple incorrect behaviors in the current implementation.

Regards,
softworkz
Soft Works Feb. 4, 2022, 5:59 a.m. UTC | #16
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Soft
> Works
> Sent: Friday, February 4, 2022 6:34 AM
> To: FFmpeg development discussions and patches <ffmpeg-
> devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix
> handling of backslashes
> 
> 
> 
> > -----Original Message-----
> > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Soft
> > Works
> > Sent: Friday, February 4, 2022 2:58 AM
> > To: FFmpeg development discussions and patches <ffmpeg-
> > devel@ffmpeg.org>
> > Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}:
> fix
> > handling of backslashes
> >
> >
> >
> > > -----Original Message-----
> > > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> > > Oneric
> > > Sent: Friday, February 4, 2022 2:01 AM
> > > To: FFmpeg development discussions and patches <ffmpeg-
> > > devel@ffmpeg.org>
> > > Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}:
> > fix
> > > handling of backslashes
> > >
> > > On Thu, Feb 03, 2022 at 20:51:16 +0000, Soft Works wrote:

[..]

> Needless to say that we will of course change the substituted curly
> braces
> back to the regular ones at the encoding side for all but ASS.
> Remains the question what to do when encoding to ASS: We can either
> keep the alternate brace characters or just remove them (or maybe
> replace
> with square brackets).
> 
> I'm not sure about that last point

The reason why I'm saying this is because we can't be sure that the
font being used at the target would include glyphs for those alternate
curly braces, and when it doesn't it would typically cause ugly 
replacement glyphs (usually a square outline) to be rendered.


But to set the record straight: we're not talking about a degradation
to current here. Right now this doesn't work at all (text inside 
curly braces being invisible)

sw
Soft Works Feb. 4, 2022, 6:48 a.m. UTC | #17
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Soft
> Works
> Sent: Friday, February 4, 2022 6:34 AM
> To: FFmpeg development discussions and patches <ffmpeg-
> devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix
> handling of backslashes
> 
> 
> 
> > -----Original Message-----
> > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Soft
> > Works
> > Sent: Friday, February 4, 2022 2:58 AM
> > To: FFmpeg development discussions and patches <ffmpeg-
> > devel@ffmpeg.org>
> > Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}:
> fix
> > handling of backslashes
> >
> >
> >
> > > -----Original Message-----
> > > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> > > Oneric
> > > Sent: Friday, February 4, 2022 2:01 AM
> > > To: FFmpeg development discussions and patches <ffmpeg-
> > > devel@ffmpeg.org>
> > > Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}:
> > fix
> > > handling of backslashes
> > >
> > > On Thu, Feb 03, 2022 at 20:51:16 +0000, Soft Works wrote:
> > > > I think when you inject that word-joiner as a workaround for ass
> > > > parsing, you'll also need to make sure that it gets removed
> > > > when encoding to other formats.
> > >
> > > There's no way of knowing whether the word-joiner comes from
> > > a conversion performed by ffmpeg in the past or already existed
> > > in the original source.
> >
> > That might be true, but I think it's valid to say that such
> characters
> > are very unusual "original" subtitle sources and that's why I don't
> > think it's a good idea for ffmpeg to start injecting them.
> >
> > Subtitle implementations are often rather minimal, especially in
> > hardware devices and might not always cover the full range of
> > UTF-8 specifics.
> >
> > > However, the wordjoiner does not alter the visually appearance and
> > > is unlikely to change line-breaking properties; that's why I chose
> > > a word-joiner. Therefore I don't think removing (only) the
> inserted
> > > word-joiners is possible,
> >
> > Why not? As it seems to be required for ASS encoding only, all other
> > output formats should remain unaffected.
> >
> > > but also not necessary.
> >
> > I'm not sure whether all ffmpeg text-sub encoders can handle
> > those chars - which could be verified of course.
> >
> > But what remains is the question about the effect on end devices
> > which are consuming that output.
> >
> > Finally, those chars are a pest. I'm using them myself for a
> > specific use case, but when you don't know they are there, it can
> > drive you totally mad, eventually even thinking your system or
> > software is faulty.
> >
> > Example:
> >
> > Open your patch file [2/2] and search for the string
> > "123456\NAscending". You can see the string in two lines, but search
> > will only find one of them.
> >
> > Or just look at the two lines directly. They are preceded by + and -
> > even though both appear identical.
> >
> >
> > So, this also needs consideration of the consequences, like how
> > many developers (inside and outside of ffmpeg) this would be driving
> > nuts over the years and make them start hating ffmpeg for doing so
> > once they've found out.
> 
> As I really hate how many devs on this ML keep saying 'no' to
> submitted
> code without having a better suggestion, assuming that this is all
> that
> it takes, I don't want to assimilate in this regard.
> 
> Hence I want to propose the following solution:
> 
> First of all, the existing code in ff_ass_bprint_text_event() is
> totally
> wrong already. Not only with regard to the backslash escaping (as you
> have already pointed out), but also the curly brace escaping is
> invalid.
> There is no curly-brace escaping in ASS either.
> 
> In fact it is impossible with ASS to display an opening curly brace
> followed
> by a closing curly brace at a subsequent position (each one alone may
> work
> depending on implementation).
> 
> If it was about ASS alone, we might just drop those braces, so we
> could
> at least avoid the text in-between from being hidden (when outputting
> ASS), but ASS is also the internal ("uncompressed/raw") subtitle
> format
> in ffmpeg that is used for conversion (and subtitle filtering).
> So it would be hard-to-sell when curly braces would get lost when
> converting from one text-sub format to another with none of them
> even being ASS.
> 
> What we need is to stop creating invalid ASS and at the same time
> ensure proper conversion of curly braces. How? We substitute them!
> 
> And still, UTF-8 can come to the rescue. There are two suitable
> candidates for that:
> 
> SMALL LEFT CURLY BRACKET (U+FE5B, Ps): ﹛
> SMALL RIGHT CURLY BRACKET (U+FE5C, Pe): ﹜
> FULLWIDTH LEFT CURLY BRACKET (U+FF5B, Ps): {
> FULLWIDTH RIGHT CURLY BRACKET (U+FF5D, Pe): }
> 
> Substitution of curly braces with one of those will prevent ASS from
> treating
> any possible subtitle content as override code.
> 
> What remains to be handled now is the backslash case. Now that we can
> be sure
> that we are never inside a sequence that ASS would consider an
> override code,
> only 3 cases are remaining where the backslash has a meaning in ASS
> dialog
> text:  '\n', '\N' and '\h'.
> 
> We can simply escape those sequences by inserting a (no-op) override
> code
> between the backslash and the char. Suitable for this is: {\r}
> This code resets inline styles, but since we are coming from plain
> text subs
> in ff_ass_bprint_text_event(), we know that we don't have any inline
> styles
> and it's a no-op to reset the style.
> 
> Needless to say that we will of course change the substituted curly
> braces
> back to the regular ones at the encoding side for all but ASS.
> Remains the question what to do when encoding to ASS: We can either
> keep the alternate brace characters or just remove them (or maybe
> replace
> with square brackets).
> 
> I'm not sure about that last point, but in total, this will be a clean
> solution
> without injecting any weird chars into the subtitle output, and it
> will fix
> multiple incorrect behaviors in the current implementation.


I've found out where the \{ and \} escaping has come from: libass
They decided at some time to introduce this kind of escaping which
is actually incompatible with normal ASS syntax and libass specific:
https://github.com/libass/libass/issues/194#issuecomment-352213210


This doesn't mean though, that the ffmpeg internal ASS format needs
to follow the libass route in this regard. It only matters for the
libass output encoder, because \{\r}N is broken by that libass
decision, so for this case, we'll need a different way.

sw
Oneric Feb. 4, 2022, 9:19 p.m. UTC | #18
On Fri, Feb 04, 2022 at 06:48:40 +0000, Soft Works wrote:
> > [two-part message removed for brevity]
>
> I've found out where the \{ and \} escaping has come from: libass

As already written in the commit-message of the first patch..


You already noticed your proposal only works with VSFilters,
but even without this it's a terrible approach. Consider:
 - fullwidth characters have different metrics then the "regular" ones
 - fullwidth and small characters have a different visual appearance
 - support for fullwidth and small characters in fonts is much rarer
   than support for plain {}
 - fullwidth characters are commonly used _as fullwidth characters_
   e.g. in text using one of the CJK writing systems.
   Replacing them with non-fullwidth variants when transforming
   away from ASS is guaranteed to be disastrous.
 - Not sure if applies, but something to keep in mind:
   {\r} is not a noop if the source-format had any sort of per-event
   styling which got prepended to the ASS event text before
   using the plain-text conversion for the rest of the event.

As noted in the discussion of the libass issue you linked,
it’s not unusual for ASS subtitle authors to employ
fullwidth curly braces for displaying curly braces
in all ASS-renderers. However, they have tight control over the
fonts used and can carefully select them to match the visual
appearance and compensate differing metrics with bespoke
local adjustments to \fs and negative \fsp.
ffmpeg does not have tight control over the fonts and it'd be silly
to require users to pass in special fonts just to render curly braces.

If you want to make the rendering in VSFilters not complettly broken,
try to do what the libass-wiki recommends and add an empty command
block after an escaped opening curly brace. This way VSFilters
will display a lone backslash instead of a opening curly brace
but otherwise work fine without swallowing up text.
If done consistently closing curly braces won't need to be
escaped at all anymore.
However, such a VSFilter-compatibility change is unrelated to
fixing the broken \\ escape which doesn't work anywhere.

(Note that the wordjoiner doesn't have font or spacing issues as
 it’s defined to be invisible and zero-width.
 If a user supplies a special font like FontoXMLWhitespace¹
 showing the word-joiner that's not a problem anymore but a
 deliberate choice)

[1]: https://documentation.fontoxml.com/latest/whitespace-visualization-5b5c6b154c78#id-18b2211e-79cb-4b85-8390-1e54d0f45466
Soft Works Feb. 4, 2022, 10:23 p.m. UTC | #19
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Oneric
> Sent: Friday, February 4, 2022 10:19 PM
> To: FFmpeg development discussions and patches <ffmpeg-
> devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix
> handling of backslashes
> 
> On Fri, Feb 04, 2022 at 06:48:40 +0000, Soft Works wrote:
> > > [two-part message removed for brevity]
> >
> > I've found out where the \{ and \} escaping has come from: libass
> 
> As already written in the commit-message of the first patch..
> 
> 
> You already noticed your proposal only works with VSFilters,
> but even without this it's a terrible approach. Consider:
>  - fullwidth characters have different metrics then the "regular" ones
>  - fullwidth and small characters have a different visual appearance
>  - support for fullwidth and small characters in fonts is much rarer
>    than support for plain {}
>  - fullwidth characters are commonly used _as fullwidth characters_
>    e.g. in text using one of the CJK writing systems.
>    Replacing them with non-fullwidth variants when transforming
>    away from ASS is guaranteed to be disastrous.
>  - Not sure if applies, but something to keep in mind:
>    {\r} is not a noop if the source-format had any sort of per-event
>    styling which got prepended to the ASS event text before
>    using the plain-text conversion for the rest of the event.

     No, this doesn't apply from what I've seen, but {} might
     still be preferable.

> As noted in the discussion of the libass issue you linked,
> it’s not unusual for ASS subtitle authors to employ
> fullwidth curly braces for displaying curly braces
> in all ASS-renderers. However, they have tight control over the
> fonts used and can carefully select them to match the visual
> appearance and compensate differing metrics with bespoke
> local adjustments to \fs and negative \fsp.
> ffmpeg does not have tight control over the fonts and it'd be silly
> to require users to pass in special fonts just to render curly braces.

I basically agree to everything you say - except that there's a 
little misunderstanding. Maybe I haven't explained well enough.

We use ASS as the internal format to which all text subtitles are 
decoded and from which all text subtitles are encoded for output
and for the upcoming subtitle filtering it's also the one and only
format for text subtitles.

BUT: That does not necessarily mean that the internally used
ASS must be exactly the same that we're outputting when encoding
to ASS. And that's why we need to consider this as two separate
steps. It would also be possible to have options at the ass encoder
to control the compatibility of the encoded ASS output.

That internal ASS format already has some quirks that some had 
introduced in order to achieve certain things when other subtitle
formats are involved at the input and at the output. That's why
we should not continue adding one workaround on top of another
but try to clean those things up instead.

With your submission, you are actually pointing at a core point
of evil: the escaping of braces in combination with the backslash
logic introduces an unresolvable ambiguity. And when we don't
clean that up, it won't be possible to get on a sane path.


> If you want to make the rendering in VSFilters not complettly broken,
> try to do what the libass-wiki recommends and add an empty command
> block after an escaped opening curly brace. This way VSFilters
> will display a lone backslash instead of a opening curly brace
> but otherwise work fine without swallowing up text.
> If done consistently closing curly braces won't need to be
> escaped at all anymore.
> However, such a VSFilter-compatibility change is unrelated to
> fixing the broken \\ escape which doesn't work anywhere.

see above, I'm not into changing ffmpeg's ass output, it's all
about the internally used ass format and the escaping is  
a central problem there.

> (Note that the wordjoiner doesn't have font or spacing issues as
>  it’s defined to be invisible and zero-width.

Yes, and the use of this has already created issues, even in libass:
https://github.com/libass/libass/issues/507

So it's very likely to cause issues in other implementations 
as well, and not many are developed as actively as libass.
(and even that doesn't help when you don't get an update 
for your device/software).

I wouldn't want to close like that, but I'm getting distracted
right now. Will get back later.

softworkz
Soft Works Feb. 4, 2022, 11:24 p.m. UTC | #20
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Oneric
> Sent: Friday, February 4, 2022 10:52 PM
> To: FFmpeg development discussions and patches <ffmpeg-
> devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix
> handling of backslashes
> 
> On Fri, Feb 04, 2022 at 02:30:37 +0100, Andreas Rheinhardt wrote:
> > All text-based subtitles are supposed to be UTF-8 when they reach
> the
> > decoder; if it isn't, the user has to set the appropriate -
> sub_charenc
> > and -sub_charenc_mode.
> >
> > - Andreas
> 
> Thanks for the info! Then at least the UTF-8 assumption
> is no problem after all.
> 
> 

[..]

> >
> > I'm not sure whether all ffmpeg text-sub encoders can handle
> > those chars - which could be verified of course.
> 
> Since it's in the BMP and ffmpeg already seems happy to assume some
> UTF-8
> support by converting everything to it, I'm not worried about this
> until
> proven wrong.

Proven wrong: https://github.com/libass/libass/issues/507


> > Finally, those chars are a pest. I'm using them myself for a
> > specific use case, but when you don't know they are there, it can
> > drive you totally mad, eventually even thinking your system or
> > software is faulty.
> >
> > Example:
> >
> > Open your patch file [2/2] and search for the string
> > "123456\NAscending". You can see the string in two lines, but search
> > will only find one of them.
> >
> > Or just look at the two lines directly. They are preceded by + and -
> > even though both appear identical.
> 
> Actually, I see this with helpful colouring lost here:
> 
>   -Dialogue: 0,0:00:55.00,0:01:00.00,Default,,0,0,0,,Descending:
> 123456\NAscending: 123456^M
>   +Dialogue: 0,0:00:55.00,0:01:00.00,Default,,0,0,0,,Descending:
> <200f>123456<200e>\NAscending: 123456^M

I didn't say you won't be a able to find a viewer that can display 
them. :-)


> More plain-text oriented editors likely won't show them though, yes.

Yes => pest


> > That might be true, but I think it's valid to say that such
> characters
> > are very unusual "original" subtitle sources and that's why I don't
> > think it's a good idea for ffmpeg to start injecting them.
> 
> Don't underestimate what subtitle authors can come up with :)

Sure. But a subtitle author is responsible for their authored
subtitles while ffmpeg is responsible for encoding of large
part of the world's subtitles.


And from that same perspective I find the relation of this 
proposal somewhat insane:

You want to "pollute" gazillions of subtitle streams in the 
world from multiple subtitle formats with invisible 
characters in order to solve an escaping problem in ffmpeg?

softworkz
Oneric Feb. 5, 2022, 1:20 a.m. UTC | #21
On Fri, Feb 04, 2022 at 23:24:58 +0000, Soft Works wrote:
> You want to "pollute" gazillions of subtitle streams in the
> world from multiple subtitle formats with invisible
> characters in order to solve an escaping problem in ffmpeg?

I do not consider using characters that are explicitly recommended to be
used by Unicode to be “polluting”. Further consider that as mentioned
invisible characters in ASS are not uncommon anyway already and conversion
from ASS to something else are rare due to being generally lossy. Lossy 
with regards to typesetting that is, removing breaking hints in form of
plain Unicode characters would be a new form of lossyness.

> [From the other mail:]
> I'm not into changing ffmpeg's ass output, it's all
> about the internally used ass format and the escaping is
> a central problem there.

I’m not interested in reworking ffmpeg’s internal subtitle handling.
The proposed patch is a clear improvement over the status quo which
is plain incorrect. Within reasonable effort and sound arguments for
it adjustments to the patch can be made; reworking ffmpeg internals is
imo not “reasonable” effort to correct an uncontestedly wrong escape.

You have two options:
Either finally tell me what I asked about:
where (as in which file and function) removing wordjoiners should
even happen and where possible lingering “\\ → \” conversions presumably
are and if it’s simple enough I can add a removal accompanied by a comment
pointing out that this can go wrong.
Or go ahead and create your own patch.

~~~~~~

> > > I'm not sure whether all ffmpeg text-sub encoders can handle
> > > those chars - which could be verified of course.
> >
> > Since it's in the BMP and ffmpeg already seems happy to assume some
> > UTF-8
> > support by converting everything to it, I'm not worried about this
> > until
> > proven wrong.
>
> Proven wrong: https://github.com/libass/libass/issues/507

This issue is not at all wordjoiner specific despite the name.
As far as I recall this never lead to wrong rendering.
With HarfBuzz, the only fully featured shaping backend of libass,
control characters were and are handled by HarfBuzz.
And even with FriBiDi U+2060 was ignored since long before (2012)
the linked issue was opened.

What that issue really is about is a combination of two more general
issues. libass is currently not caching failure to lookup a glyph leading
to multiple messages and at worst a perf degradation if no font on the
font pool contained a glyph for a particular glyph. And the realisation
that libass’ font-fallback strategy is not ideal for prefix-type control
characters, characters which visibly affect both neighbours and a few
others.
The word-joiner is only highlighted here as due to its usage as an
backslash escape its commonly passed to libass and a high enough
percentage of fonts doesn’t contain it to create reports about it.


For further reference: U+2060 was added in Unicode 3.2 released 2002.
If you want to strip it because it might not render correctly you should
also strip most emoji, the uppercase eszett ẞ and several actively 
used writing systems in their entirety.
Soft Works Feb. 5, 2022, 2:08 a.m. UTC | #22
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Oneric
> Sent: Saturday, February 5, 2022 2:20 AM
> To: FFmpeg development discussions and patches <ffmpeg-
> devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix
> handling of backslashes
> 
> On Fri, Feb 04, 2022 at 23:24:58 +0000, Soft Works wrote:
> > You want to "pollute" gazillions of subtitle streams in the
> > world from multiple subtitle formats with invisible
> > characters in order to solve an escaping problem in ffmpeg?
> 
> I do not consider using characters that are explicitly recommended to
> be
> used by Unicode to be “polluting”. Further consider that as mentioned
> invisible characters in ASS are not uncommon anyway already and
> conversion
> from ASS to something else are rare due to being generally lossy.
> Lossy
> with regards to typesetting that is, removing breaking hints in form
> of
> plain Unicode characters would be a new form of lossyness.
> 
> > [From the other mail:]
> > I'm not into changing ffmpeg's ass output, it's all
> > about the internally used ass format and the escaping is
> > a central problem there.
> 
> I’m not interested in reworking ffmpeg’s internal subtitle handling.
> The proposed patch is a clear improvement over the status quo which
> is plain incorrect. Within reasonable effort and sound arguments for
> it adjustments to the patch can be made; reworking ffmpeg internals is
> imo not “reasonable” effort to correct an uncontestedly wrong escape.
> 
> You have two options:
> Either finally tell me what I asked about:
> where (as in which file and function) removing wordjoiners should
> even happen and where possible lingering “\\ → \” conversions
> presumably
> are and if it’s simple enough I can add a removal accompanied by a
> comment
> pointing out that this can go wrong.
> Or go ahead and create your own patch.
> 
> ~~~~~~
> 
> > > > I'm not sure whether all ffmpeg text-sub encoders can handle
> > > > those chars - which could be verified of course.
> > >
> > > Since it's in the BMP and ffmpeg already seems happy to assume
> some
> > > UTF-8
> > > support by converting everything to it, I'm not worried about this
> > > until
> > > proven wrong.
> >
> > Proven wrong: https://github.com/libass/libass/issues/507
> 
> This issue is not at all wordjoiner specific despite the name.
> As far as I recall this never lead to wrong rendering.
> With HarfBuzz, the only fully featured shaping backend of libass,
> control characters were and are handled by HarfBuzz.
> And even with FriBiDi U+2060 was ignored since long before (2012)
> the linked issue was opened.
> 
> What that issue really is about is a combination of two more general
> issues. libass is currently not caching failure to lookup a glyph
> leading
> to multiple messages and at worst a perf degradation if no font on the
> font pool contained a glyph for a particular glyph. And the
> realisation
> that libass’ font-fallback strategy is not ideal for prefix-type
> control
> characters, characters which visibly affect both neighbours and a few
> others.
> The word-joiner is only highlighted here as due to its usage as an
> backslash escape its commonly passed to libass and a high enough
> percentage of fonts doesn’t contain it to create reports about it.
> 
> 
> For further reference: U+2060 was added in Unicode 3.2 released 2002.
> If you want to strip it because it might not render correctly you
> should
> also strip most emoji, the uppercase eszett ẞ and several actively
> used writing systems in their entirety.


Let's try to approach this from a different side. Which case is 
your [1/2] commit actually supposed to fix? 
How did you test your patch?
Can we please go over an example?

Thanks,
sw
Oneric Feb. 5, 2022, 9:59 p.m. UTC | #23
On Sat, Feb 05, 2022 at 02:08:48 +0000, Soft Works wrote:
> Let's try to approach this from a different side. Which case is
> your [1/2] commit actually supposed to fix?

Escape backslashes when converting from WebVTT to not accidentally
introduce active ASS sequences and replace the wrong backslash-escape
in ff_ass_bprint_text_event with an escape that actually works.

> How did you test your patch?
> Can we please go over an example?

Take a look at the attached WebVTT file.
We expect the second event to be rendered like this,
as from WebVTT’s point of view it’s just normal text:

  our final \h approach \N into \ Coruscant.

What we currently get after conversion to ASS is like this though
(pay attention to the number of spaces):

  our final   approach
  into \ Coruscant.

If we escape \ as \\ like ff_ass_bprint_text_event currently does,
we instead get the following rendering which is also wrong:

  our final \  approach \
  into \\ Coruscant.

If instead the word-joiner is appended as in my patch, the
visual output matches the expectation (mail does not contain U+2060):

  our final \h approach \N into \ Coruscant.


You can confirm this for yourself by feeding the original WebVTT
to a native WebVTT renderer, e.g. https://zcorpan.github.io/live-webvtt-viewer/
and using ./ffmpeg -i test2.vtt -f ass tmp_conv.ass on
master, master + my patch and master + a modified patch to use \\
then watching how the converted file gets rendered by mpv, VLC or so.

I was not able to create a sample using ff_ass_bprint_text_event
as the only users seem to be teletext and some rawtext(?) subtitles
and I don't know how to create such files.
However, as nothing special happened to the \\ for WebVTT after this
sequence was inserted for the internal representation, and since
the preceding comment (wrongly) declares \\ to be an ASS rather than an
internal escape, it seems highly unlikely to behave any different.
You are very welcome to provide a sample using the
ff_ass_bprint_text_event codepath and keep_ass_markup=false
to verify or debunk this.
Soft Works Feb. 6, 2022, 1:08 a.m. UTC | #24
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> Oneric
> Sent: Saturday, February 5, 2022 11:00 PM
> To: FFmpeg development discussions and patches <ffmpeg-
> devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix
> handling of backslashes
> 
> On Sat, Feb 05, 2022 at 02:08:48 +0000, Soft Works wrote:
> > Let's try to approach this from a different side. Which case is
> > your [1/2] commit actually supposed to fix?
> 
> Escape backslashes when converting from WebVTT to not accidentally
> introduce active ASS sequences and replace the wrong backslash-escape
> in ff_ass_bprint_text_event with an escape that actually works.
> 
> > How did you test your patch?
> > Can we please go over an example?
> 
> Take a look at the attached WebVTT file.

Thanks a lot for the test file!


> We expect the second event to be rendered like this,
> as from WebVTT’s point of view it’s just normal text:
> 
>   our final \h approach \N into \ Coruscant.
> 
> What we currently get after conversion to ASS is like this though
> (pay attention to the number of spaces):
> 
>   our final   approach
>   into \ Coruscant.

Yup, no doubt that this is wrong. 


> If instead the word-joiner is appended as in my patch, the
> visual output matches the expectation (mail does not contain U+2060):
> 
>   our final \h approach \N into \ Coruscant.

I can also confirm that your patch would "work" with regards
to ass output when trying with both "old" libass and new libass.
I haven't tried with other implementations yet, but when this 
would turn out to be working with all usual implementations,
I might even be OK with the word-joiner.

But this is where the agreement ends. 

- If at all, the word-joiner insertion would need to be 
  limited to ASS output ONLY
- it would need to be controllable through an option in the ASS 
  encoder
- The word joiners should not be used in internal processing and 
  only be (optionally) added when encoding to ass
- Unfortunately, the FATE tests for the other subtitle formats
  do not include these sequences in the test source files, and
  that means, before making such change that might potentially 
  affect all other text subtitle encoders, those sequences would
  need to be added to make sure these conversion won't be affected
- Generally, those changes (also the BIDI mark insertion) should 
  be evaluated with regards to all text subtitle encoders,
  making sure there's no side effect.

You said:

> I’m not interested in reworking ffmpeg’s internal subtitle handling.
> The proposed patch is a clear improvement over the status quo which
> is plain incorrect. Within reasonable effort and sound arguments for
> it adjustments to the patch can be made; reworking ffmpeg internals is
> imo not “reasonable” effort to correct an uncontestedly wrong escape.

And that is a problem. The changes you are proposing are making
changes to ffmpeg’s internal subtitle handling, so you need to decide 
whether you want to work on it or not.


> You have two options:
[..]
> Or go ahead and create your own patch.

I will do this, but "only" on top of my subtitle filtering patchset
because that's my current focus area and just two weeks ago I had to
add a temporary hack for a related case which is about ASS dialog lines
like:

Dialogue: 0,0:00:00.00,0:00:05.00,Default,,0,0,0,,{comment text: \....}

Currently, ffmpeg does not recognize this as override code, even though
it's valid in ASS that a backslash with the actual code doesn't appear
immediately after the opening curly brace.
What comes on top of this is that other subtitle decoders do NOT escape
the curly braces like WebVTTdec and ass_bprint_text_event().
When you look at SubRip_capability_tester.srt from the FATE suite, you
can see that it contains ASS codes that are expected to be recognized
and applied, but when there's normal text in curly braces without 
a backslash, it should be treated at normal text.

This is quite a mess that needs to be cleaned up with a plan and it's
all related. Like I said already: A central point to this is the escaping
and what's needed is a solution that can cover all of those things.

I had put this subject aside as I've been unsure about how to do it,
but this discussion has been very helpful to see the issues more clearly.

How about separating the BIDI part from your patch? I'd see only two
things remaining:

- Go through all text subtitle encoders and think/research whether 
  those marks would be acceptable in those formats or whether they
  would need to be removed (like now)
- Think about whether the Unicode bidi marks should be replaced back
  to the html-style codes
  It's not these wouldn't work, but it's again about visibility and 
  I tend to think that it would be preferable to have them visible 
  in the output

softworkz
Soft Works Feb. 6, 2022, 1:37 a.m. UTC | #25
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Soft
> Works
> Sent: Sunday, February 6, 2022 2:09 AM
> To: FFmpeg development discussions and patches <ffmpeg-
> devel@ffmpeg.org>
> Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}: fix
> handling of backslashes
> 
> 
> 
> > -----Original Message-----
> > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of
> > Oneric
> > Sent: Saturday, February 5, 2022 11:00 PM
> > To: FFmpeg development discussions and patches <ffmpeg-
> > devel@ffmpeg.org>
> > Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/{ass, webvttdec}:
> fix
> > handling of backslashes
> >
> > On Sat, Feb 05, 2022 at 02:08:48 +0000, Soft Works wrote:
> > > Let's try to approach this from a different side. Which case is
> > > your [1/2] commit actually supposed to fix?
> >
> > Escape backslashes when converting from WebVTT to not accidentally
> > introduce active ASS sequences and replace the wrong backslash-
> escape
> > in ff_ass_bprint_text_event with an escape that actually works.
> >
> > > How did you test your patch?
> > > Can we please go over an example?
> >
> > Take a look at the attached WebVTT file.
> 
> Thanks a lot for the test file!
> 
> 
> > We expect the second event to be rendered like this,
> > as from WebVTT’s point of view it’s just normal text:
> >
> >   our final \h approach \N into \ Coruscant.
> >
> > What we currently get after conversion to ASS is like this though
> > (pay attention to the number of spaces):
> >
> >   our final   approach
> >   into \ Coruscant.
> 
> Yup, no doubt that this is wrong.
> 
> 
> > If instead the word-joiner is appended as in my patch, the
> > visual output matches the expectation (mail does not contain
> U+2060):
> >
> >   our final \h approach \N into \ Coruscant.
> 
> I can also confirm that your patch would "work" with regards
> to ass output when trying with both "old" libass and new libass.
> I haven't tried with other implementations yet, but when this
> would turn out to be working with all usual implementations,
> I might even be OK with the word-joiner.
> 
> But this is where the agreement ends.
> 
> - If at all, the word-joiner insertion would need to be
>   limited to ASS output ONLY
> - it would need to be controllable through an option in the ASS
>   encoder
> - The word joiners should not be used in internal processing and
>   only be (optionally) added when encoding to ass
> - Unfortunately, the FATE tests for the other subtitle formats
>   do not include these sequences in the test source files, and
>   that means, before making such change that might potentially
>   affect all other text subtitle encoders, those sequences would
>   need to be added to make sure these conversion won't be affected
> - Generally, those changes (also the BIDI mark insertion) should
>   be evaluated with regards to all text subtitle encoders,
>   making sure there's no side effect.
> 
> You said:
> 
> > I’m not interested in reworking ffmpeg’s internal subtitle handling.
> > The proposed patch is a clear improvement over the status quo which
> > is plain incorrect. Within reasonable effort and sound arguments for
> > it adjustments to the patch can be made; reworking ffmpeg internals
> is
> > imo not “reasonable” effort to correct an uncontestedly wrong
> escape.
> 
> And that is a problem. The changes you are proposing are making
> changes to ffmpeg’s internal subtitle handling, so you need to decide
> whether you want to work on it or not.
> 
> 
> > You have two options:
> [..]
> > Or go ahead and create your own patch.
> 
> I will do this, but "only" on top of my subtitle filtering patchset
> because that's my current focus area and just two weeks ago I had to
> add a temporary hack for a related case which is about ASS dialog
> lines
> like:
> 
> Dialogue: 0,0:00:00.00,0:00:05.00,Default,,0,0,0,,{comment text:
> \....}
> 
> Currently, ffmpeg does not recognize this as override code, even
> though
> it's valid in ASS that a backslash with the actual code doesn't appear
> immediately after the opening curly brace.
> What comes on top of this is that other subtitle decoders do NOT
> escape
> the curly braces like WebVTTdec and ass_bprint_text_event().
> When you look at SubRip_capability_tester.srt from the FATE suite, you
> can see that it contains ASS codes that are expected to be recognized
> and applied, but when there's normal text in curly braces without
> a backslash, it should be treated at normal text.
> 
> This is quite a mess that needs to be cleaned up with a plan and it's
> all related. Like I said already: A central point to this is the
> escaping
> and what's needed is a solution that can cover all of those things.
> 
> I had put this subject aside as I've been unsure about how to do it,
> but this discussion has been very helpful to see the issues more
> clearly.
> 
> How about separating the BIDI part from your patch? I'd see only two
> things remaining:
> 
> - Go through all text subtitle encoders and think/research whether
>   those marks would be acceptable in those formats or whether they
>   would need to be removed (like now)
> - Think about whether the Unicode bidi marks should be replaced back
>   to the html-style codes
>   It's not these wouldn't work, but it's again about visibility and
>   I tend to think that it would be preferable to have them visible
>   in the output
> 
> softworkz

What I would also find acceptable is to just remove the escaping of
backslashes in ff_ass_bprint_text_event() without adding a word-joiner,
as this is clearly nonsense and it would still be an improvement.

sw
diff mbox series

Patch

diff --git a/libavcodec/ass.c b/libavcodec/ass.c
index 725e4d42ba..461e110ca4 100644
--- a/libavcodec/ass.c
+++ b/libavcodec/ass.c
@@ -157,8 +157,11 @@  void ff_ass_bprint_text_event(AVBPrint *buf, const char *p, int size,
 
         /* standard ASS escaping so random characters don't get mis-interpreted
          * as ASS */
-        } else if (!keep_ass_markup && strchr("{}\\", *p)) {
+        } else if (!keep_ass_markup && strchr("{}", *p)) {
             av_bprintf(buf, "\\%c", *p);
+        } else if (!keep_ass_markup && *p == '\\') {
+           // append word-joiner U+2060 as UTF-8 to break up sequences like \N
+           av_bprintf(buf, "\\\xe2\x81\xa0");
 
         /* some packets might end abruptly (no \0 at the end, like for example
          * in some cases of demuxing from a classic video container), some
diff --git a/libavcodec/webvttdec.c b/libavcodec/webvttdec.c
index 0093f328fa..8cb739697a 100644
--- a/libavcodec/webvttdec.c
+++ b/libavcodec/webvttdec.c
@@ -37,7 +37,7 @@  static const struct {
     {"<i>", "{\\i1}"}, {"</i>", "{\\i0}"},
     {"<b>", "{\\b1}"}, {"</b>", "{\\b0}"},
     {"<u>", "{\\u1}"}, {"</u>", "{\\u0}"},
-    {"{", "\\{"}, {"}", "\\}"}, // escape to avoid ASS markup conflicts
+    {"{", "\\{"}, {"}", "\\}"}, {"\\", "\\\xe2\x81\xa0"}, // escape to avoid ASS markup conflicts
     {"&gt;", ">"}, {"&lt;", "<"},
     {"&lrm;", ""}, {"&rlm;", ""}, // FIXME: properly honor bidi marks
     {"&amp;", "&"}, {"&nbsp;", "\\h"},