diff mbox series

[FFmpeg-devel,v4,2/4] avcodec/{ass, webvttdec}: fix handling of backslashes

Message ID 20240219214227.19814-3-oneric@oneric.de
State New
Headers show
Series Fix some active sequences in subtitles | expand

Checks

Context Check Description
yinshiyou/make_loongarch64 success Make finished
yinshiyou/make_fate_loongarch64 success Make fate finished
andriy/make_x86 success Make finished
andriy/make_fate_x86 success Make fate finished

Commit Message

Oneric Feb. 19, 2024, 9:42 p.m. UTC
Backslashes cannot be escaped by a backslash in any ASS renderer,
but unless followed by specific characters it is just printed out.
Insert a word-joiner character after a backslash to break up
active sequences without changing the visual output.
---
 libavcodec/ass.c       | 9 ++++++++-
 libavcodec/webvttdec.c | 2 +-
 2 files changed, 9 insertions(+), 2 deletions(-)

Comments

Stefano Sabatini April 4, 2024, 5:44 p.m. UTC | #1
On date Monday 2024-02-19 22:42:25 +0100, Oneric wrote:
> Backslashes cannot be escaped by a backslash in any ASS renderer,
> but unless followed by specific characters it is just printed out.
> Insert a word-joiner character after a backslash to break up
> active sequences without changing the visual output.
> ---
>  libavcodec/ass.c       | 9 ++++++++-
>  libavcodec/webvttdec.c | 2 +-
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/libavcodec/ass.c b/libavcodec/ass.c
> index 5058dc8337..a68d3568b4 100644
> --- a/libavcodec/ass.c
> +++ b/libavcodec/ass.c
> @@ -183,9 +183,16 @@ void ff_ass_bprint_text_event(AVBPrint *buf, const char *p, int size,
>  
>          /* standard ASS escaping so random characters don't get mis-interpreted
>           * as ASS */
> -        } else if (!keep_ass_markup && strchr("{}\\", *p)) {
> +        } else if (!keep_ass_markup && strchr("{}", *p)) {
>              av_bprintf(buf, "\\%c", *p);
>  

> +        /* append word-joiner U+2060 as UTF-8 to break up sequences like \N */

I'm confused by this, what kind of \N sequences might appear in an ASS
file? Can you show an offending sequence?

> +        } else if (!keep_ass_markup && *p == '\\') {
> +            if (p_end - p <= 3 || strncmp(p + 1, "\xe2\x81\xa0", 3))
> +                av_bprintf(buf, "\\\xe2\x81\xa0");
> +            else
> +                av_bprintf(buf, "\\");
> +
>          /* some packets might end abruptly (no \0 at the end, like for example
>           * in some cases of demuxing from a classic video container), some
>           * might be terminated with \n or \r\n which we have to remove (for
> diff --git a/libavcodec/webvttdec.c b/libavcodec/webvttdec.c
> index 990d150f16..6e55bc5499 100644
> --- a/libavcodec/webvttdec.c
> +++ b/libavcodec/webvttdec.c
> @@ -37,7 +37,7 @@ static const struct {
>      {"<i>", "{\\i1}"}, {"</i>", "{\\i0}"},
>      {"<b>", "{\\b1}"}, {"</b>", "{\\b0}"},
>      {"<u>", "{\\u1}"}, {"</u>", "{\\u0}"},
> -    {"{", "\\{"}, {"}", "\\}"}, // escape to avoid ASS markup conflicts
> +    {"{", "\\{"}, {"}", "\\}"}, {"\\", "\\\xe2\x81\xa0"}, // escape to avoid ASS markup conflicts
>      {"&gt;", ">"}, {"&lt;", "<"},
>      {"&lrm;", "\xe2\x80\x8e"}, {"&rlm;", "\xe2\x80\x8f"},
>      {"&amp;", "&"}, {"&nbsp;", "\\h"},
Marth64 April 4, 2024, 5:56 p.m. UTC | #2
> I'm confused by this, what kind of \N
> sequences might appear in an ASS file?
> Can you show an offending sequence?

Good day Stefano,

The one I tested with was something like :
Jim\Nancy

I made this up while testing. But I have seen similar in real world
scenarios particularly in ASS files originating from subtitles that are
“summarized”. In legacy content this is common, to fit content in a 4:3
screen the subtitle writer would shorten or abbreviate what is spoken on
the screen and slashes are a replacement for “and”.

Hope this helps.
Stefano Sabatini April 4, 2024, 6:02 p.m. UTC | #3
On date Thursday 2024-04-04 12:56:30 -0500, Marth64 wrote:
> > I'm confused by this, what kind of \N
> > sequences might appear in an ASS file?
> > Can you show an offending sequence?
> 
> Good day Stefano,
> 

> The one I tested with was something like :
> Jim\Nancy

Is N a literal or is this impacting also sequences in the form:
Jim\Joe?

Is \N a special sequence in ASS-speak or does it mean a slash followed
by any character?
Marth64 April 4, 2024, 6:26 p.m. UTC | #4
>  Is \N a special sequence in ASS-speak
\N is the special line break sequence in ASS

I’m not sure of other special sequences following the \ character, I yield
to Oneric on this question.

Best,
Marth64
Oneric April 4, 2024, 6:36 p.m. UTC | #5
On Thu, Apr 04, 2024 at 13:26:55 -0500, Marth64 wrote:
> >  Is \N a special sequence in ASS-speak
> \N is the special line break sequence in ASS
> 
> I’m not sure of other special sequences following the \ character, I yield
> to Oneric on this question.

Standard ASS(2) and SSAv4 have
  \N (forced linebreak)
  \n (linebreak hint)
  \h (nonbreaking space)

libass further has \{ and \} dealt with by the following commit.
Afaik no ASS renderer ever supported \\.

The added escaping happens for non-ASS format (in which those sequences
are just normal text) which get encoded into ASS by ffmpeg. If it weren’t
escaped we’d get sudden linebreaks, spaces etc and missing text.
It doesn’t affect ASS->ASS remuxing.
Stefano Sabatini April 4, 2024, 9:53 p.m. UTC | #6
On date Thursday 2024-04-04 20:36:50 +0200, Oneric wrote:
> On Thu, Apr 04, 2024 at 13:26:55 -0500, Marth64 wrote:
> > >  Is \N a special sequence in ASS-speak
> > \N is the special line break sequence in ASS
> > 
> > I’m not sure of other special sequences following the \ character, I yield
> > to Oneric on this question.
> 
> Standard ASS(2) and SSAv4 have
>   \N (forced linebreak)
>   \n (linebreak hint)
>   \h (nonbreaking space)
> 
> libass further has \{ and \} dealt with by the following commit.
> Afaik no ASS renderer ever supported \\.
> 
> The added escaping happens for non-ASS format (in which those sequences
> are just normal text) which get encoded into ASS by ffmpeg. If it weren’t
> escaped we’d get sudden linebreaks, spaces etc and missing text.
> It doesn’t affect ASS->ASS remuxing.

Thanks for the clarification.

I think the patcheset should be good, but I'll wait a few days for
letting more comments.

I can fix the typos before pushing, or feel free to send an update.
diff mbox series

Patch

diff --git a/libavcodec/ass.c b/libavcodec/ass.c
index 5058dc8337..a68d3568b4 100644
--- a/libavcodec/ass.c
+++ b/libavcodec/ass.c
@@ -183,9 +183,16 @@  void ff_ass_bprint_text_event(AVBPrint *buf, const char *p, int size,
 
         /* standard ASS escaping so random characters don't get mis-interpreted
          * as ASS */
-        } else if (!keep_ass_markup && strchr("{}\\", *p)) {
+        } else if (!keep_ass_markup && strchr("{}", *p)) {
             av_bprintf(buf, "\\%c", *p);
 
+        /* append word-joiner U+2060 as UTF-8 to break up sequences like \N */
+        } else if (!keep_ass_markup && *p == '\\') {
+            if (p_end - p <= 3 || strncmp(p + 1, "\xe2\x81\xa0", 3))
+                av_bprintf(buf, "\\\xe2\x81\xa0");
+            else
+                av_bprintf(buf, "\\");
+
         /* some packets might end abruptly (no \0 at the end, like for example
          * in some cases of demuxing from a classic video container), some
          * might be terminated with \n or \r\n which we have to remove (for
diff --git a/libavcodec/webvttdec.c b/libavcodec/webvttdec.c
index 990d150f16..6e55bc5499 100644
--- a/libavcodec/webvttdec.c
+++ b/libavcodec/webvttdec.c
@@ -37,7 +37,7 @@  static const struct {
     {"<i>", "{\\i1}"}, {"</i>", "{\\i0}"},
     {"<b>", "{\\b1}"}, {"</b>", "{\\b0}"},
     {"<u>", "{\\u1}"}, {"</u>", "{\\u0}"},
-    {"{", "\\{"}, {"}", "\\}"}, // escape to avoid ASS markup conflicts
+    {"{", "\\{"}, {"}", "\\}"}, {"\\", "\\\xe2\x81\xa0"}, // escape to avoid ASS markup conflicts
     {"&gt;", ">"}, {"&lt;", "<"},
     {"&lrm;", "\xe2\x80\x8e"}, {"&rlm;", "\xe2\x80\x8f"},
     {"&amp;", "&"}, {"&nbsp;", "\\h"},