[FFmpeg-devel] movtextdec: fix handling of UTF-8 subtitles

Message ID	20180324144836.29296-1-nfxjfg@googlemail.com
State	Accepted
Commit	b0644c3e1a96397ee5e2448c542fa4c3bc319537
Headers	show Delivered-To: ffmpegpatchwork@gmail.com Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; From: wm4 <nfxjfg@googlemail.com> To: ffmpeg-devel@ffmpeg.org Date: Sat, 24 Mar 2018 15:48:36 +0100 Message-Id: <20180324144836.29296-1-nfxjfg@googlemail.com> Subject: [FFmpeg-devel] [PATCH] movtextdec: fix handling of UTF-8 subtitles Precedence: list Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> Cc: wm4 <nfxjfg@googlemail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>

wm4 March 24, 2018, 2:48 p.m. UTC

Subtitles which contained styled UTF-8 subtitles (i.e. not just 7 bit
ASCII characters) were not handled correctly. The spec mandates that
styling start/end ranges are in "characters". It's not quite clear what
a "character" is supposed to be, but maybe they mean unicode codepoints.

FFmpeg's decoder treated the style ranges as byte idexes, which could
lead to UTF-8 sequences being broken, and the common code dropping the
whole subtitle line.

Change this and count the codepoint instead. This also means that even
if this is somehow wrong, the decoder won't break UTF-8 sequences
anymore. The sample which led me to investigate this now appears to work
correctly.
---
https://github.com/mpv-player/mpv/issues/5675
---
 libavcodec/movtextdec.c | 50 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 37 insertions(+), 13 deletions(-)

Jan Ekström March 24, 2018, 2:54 p.m. UTC | #1

On Sat, Mar 24, 2018 at 4:48 PM, wm4 <nfxjfg@googlemail.com> wrote:
> Subtitles which contained styled UTF-8 subtitles (i.e. not just 7 bit
> ASCII characters) were not handled correctly. The spec mandates that
> styling start/end ranges are in "characters". It's not quite clear what
> a "character" is supposed to be, but maybe they mean unicode codepoints.
>
> FFmpeg's decoder treated the style ranges as byte idexes, which could
> lead to UTF-8 sequences being broken, and the common code dropping the
> whole subtitle line.
>
> Change this and count the codepoint instead. This also means that even
> if this is somehow wrong, the decoder won't break UTF-8 sequences
> anymore. The sample which led me to investigate this now appears to work
> correctly.
> ---
> https://github.com/mpv-player/mpv/issues/5675

For reference, the relevant specification for MOV/3GPP Timed Text
seems to be ETSI TS 126 245, which is currently at version 14
(2017-04), available at
http://www.etsi.org/deliver/etsi_ts/126200_126299/126245/14.00.00_60/ts_126245v140000p.pdf
.

It is indeed rather ambiguous in 5.2 regarding what a "character" is
in the context of UTF-8 or UTF-16.

Best regards,
Jan

Philip Langdale March 24, 2018, 3:37 p.m. UTC | #2

On Sat, 24 Mar 2018 15:48:36 +0100
wm4 <nfxjfg@googlemail.com> wrote:

> Subtitles which contained styled UTF-8 subtitles (i.e. not just 7 bit
> ASCII characters) were not handled correctly. The spec mandates that
> styling start/end ranges are in "characters". It's not quite clear
> what a "character" is supposed to be, but maybe they mean unicode
> codepoints.
> 
> FFmpeg's decoder treated the style ranges as byte idexes, which could
> lead to UTF-8 sequences being broken, and the common code dropping the
> whole subtitle line.
> 
> Change this and count the codepoint instead. This also means that even
> if this is somehow wrong, the decoder won't break UTF-8 sequences
> anymore. The sample which led me to investigate this now appears to
> work correctly.
> ---
> https://github.com/mpv-player/mpv/issues/5675
> ---
>  libavcodec/movtextdec.c | 50
> ++++++++++++++++++++++++++++++++++++------------- 1 file changed, 37
> insertions(+), 13 deletions(-)
> 
> diff --git a/libavcodec/movtextdec.c b/libavcodec/movtextdec.c
> index bd19577724..89ac791602 100644
> --- a/libavcodec/movtextdec.c
> +++ b/libavcodec/movtextdec.c
> @@ -326,9 +326,24 @@ static const Box box_types[] = {
>  
>  const static size_t box_count = FF_ARRAY_ELEMS(box_types);
>  
> +// Return byte length of the UTF-8 sequence starting at text[0]. 0
> on error. +static int get_utf8_length_at(const char *text, const char
> *text_end) +{
> +    const char *start = text;
> +    int err = 0;
> +    uint32_t c;
> +    GET_UTF8(c, text < text_end ? (uint8_t)*text++ : (err = 1, 0),
> goto error;);
> +    if (err)
> +        goto error;
> +    return text - start;
> +error:
> +    return 0;
> +}
> +
>  static int text_to_ass(AVBPrint *buf, const char *text, const char
> *text_end,
> -                        MovTextContext *m)
> +                       AVCodecContext *avctx)
>  {
> +    MovTextContext *m = avctx->priv_data;
>      int i = 0;
>      int j = 0;
>      int text_pos = 0;
> @@ -342,6 +357,8 @@ static int text_to_ass(AVBPrint *buf, const char
> *text, const char *text_end, }
>  
>      while (text < text_end) {
> +        int len;
> +
>          if (m->box_flags & STYL_BOX) {
>              for (i = 0; i < m->style_entries; i++) {
>                  if (m->s[i]->style_flag && text_pos ==
> m->s[i]->style_end) { @@ -388,17 +405,24 @@ static int
> text_to_ass(AVBPrint *buf, const char *text, const char *text_end, }
>          }
>  
> -        switch (*text) {
> -        case '\r':
> -            break;
> -        case '\n':
> -            av_bprintf(buf, "\\N");
> -            break;
> -        default:
> -            av_bprint_chars(buf, *text, 1);
> -            break;
> +        len = get_utf8_length_at(text, text_end);
> +        if (len < 1) {
> +            av_log(avctx, AV_LOG_ERROR, "invalid UTF-8 byte in
> subtitle\n");
> +            len = 1;
> +        }
> +        for (i = 0; i < len; i++) {
> +            switch (*text) {
> +            case '\r':
> +                break;
> +            case '\n':
> +                av_bprintf(buf, "\\N");
> +                break;
> +            default:
> +                av_bprint_chars(buf, *text, 1);
> +                break;
> +            }
> +            text++;
>          }
> -        text++;
>          text_pos++;
>      }
>  
> @@ -507,10 +531,10 @@ static int mov_text_decode_frame(AVCodecContext
> *avctx, }
>              m->tracksize = m->tracksize + tsmb_size;
>          }
> -        text_to_ass(&buf, ptr, end, m);
> +        text_to_ass(&buf, ptr, end, avctx);
>          mov_text_cleanup(m);
>      } else
> -        text_to_ass(&buf, ptr, end, m);
> +        text_to_ass(&buf, ptr, end, avctx);
>  
>      ret = ff_ass_add_rect(sub, buf.str, m->readorder++, 0, NULL,
> NULL); av_bprint_finalize(&buf, NULL);

Ship it. Thanks!


--phil

Carl Eugen Hoyos March 24, 2018, 4:05 p.m. UTC | #3

2018-03-24 15:48 GMT+01:00, wm4 <nfxjfg@googlemail.com>:
> Subtitles which contained styled UTF-8 subtitles (i.e. not just 7 bit
> ASCII characters) were not handled correctly. The spec mandates that
> styling start/end ranges are in "characters". It's not quite clear what
> a "character" is supposed to be, but maybe they mean unicode codepoints.
>
> FFmpeg's decoder treated the style ranges as byte idexes, which could
> lead to UTF-8 sequences being broken, and the common code dropping the
> whole subtitle line.
>
> Change this and count the codepoint instead. This also means that even
> if this is somehow wrong, the decoder won't break UTF-8 sequences
> anymore. The sample which led me to investigate this now appears to work
> correctly.

Could you confirm that this is also what QT does?

Or is it impossible that the patch breaks something?

[...]

> @@ -388,17 +405,24 @@ static int text_to_ass(AVBPrint *buf, const char
> *text, const char *text_end,
>              }
>          }
>
> -        switch (*text) {
> -        case '\r':
> -            break;
> -        case '\n':
> -            av_bprintf(buf, "\\N");
> -            break;
> -        default:
> -            av_bprint_chars(buf, *text, 1);
> -            break;
> +        len = get_utf8_length_at(text, text_end);
> +        if (len < 1) {
> +            av_log(avctx, AV_LOG_ERROR, "invalid UTF-8 byte in
> subtitle\n");
> +            len = 1;
> +        }
> +        for (i = 0; i < len; i++) {

> +            switch (*text) {
> +            case '\r':
> +                break;
> +            case '\n':
> +                av_bprintf(buf, "\\N");
> +                break;
> +            default:
> +                av_bprint_chars(buf, *text, 1);
> +                break;

Imo, the reindentation is not ok but this isn't my code.

Carl Eugen

Hendrik Leppkes March 24, 2018, 4:15 p.m. UTC | #4

On Sat, Mar 24, 2018 at 3:48 PM, wm4 <nfxjfg@googlemail.com> wrote:
> Subtitles which contained styled UTF-8 subtitles (i.e. not just 7 bit
> ASCII characters) were not handled correctly. The spec mandates that
> styling start/end ranges are in "characters". It's not quite clear what
> a "character" is supposed to be, but maybe they mean unicode codepoints.
>

Well a character certainly isn't a byte in a Unicode string, even if
the C type may be called that way.

So seems fine to me.

- Hendrik

Jan Ekström March 24, 2018, 4:27 p.m. UTC | #5

On Sat, Mar 24, 2018 at 6:05 PM, Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote:
> Could you confirm that this is also what QT does?
>
> Or is it impossible that the patch breaks something?
>

Everything can in theory break something. Anyways, it was widely
understood that 3GPP timed text was a better defined MOV timed text,
but I did have a quick look at Apple's qtff.pdf, which happily enough
actually defined the start and end characters (yes, they are marked as
"characters") as follows from the "Subtitle Style Atom" definition
("styl"):

Start character
    A 16-bit value that is the offset of the first character that is
to use the style specified in this record.
    Zero (0) is the first character in the subtitle.
End character
    A 16-bit value that is the offset of the character that follows
the last character to use this style.

Best regards,
Jan

wm4 March 24, 2018, 4:42 p.m. UTC | #6

On Sat, 24 Mar 2018 17:05:41 +0100
Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote:

> 2018-03-24 15:48 GMT+01:00, wm4 <nfxjfg@googlemail.com>:
> > Subtitles which contained styled UTF-8 subtitles (i.e. not just 7 bit
> > ASCII characters) were not handled correctly. The spec mandates that
> > styling start/end ranges are in "characters". It's not quite clear what
> > a "character" is supposed to be, but maybe they mean unicode codepoints.
> >
> > FFmpeg's decoder treated the style ranges as byte idexes, which could
> > lead to UTF-8 sequences being broken, and the common code dropping the
> > whole subtitle line.
> >
> > Change this and count the codepoint instead. This also means that even
> > if this is somehow wrong, the decoder won't break UTF-8 sequences
> > anymore. The sample which led me to investigate this now appears to work
> > correctly.  
> 
> Could you confirm that this is also what QT does?

I can't test with QT. VLC seems to behave like with this patch applied.

> Or is it impossible that the patch breaks something?

Could probably break movtext subtitles generated by ffmpeg (I didn't
fix the movtext encoder, and it seems to have the same bug). But these
will most likely be broken on other players too. Tough the worst case
is just that the styles get shifted.

> [...]
> 
> > @@ -388,17 +405,24 @@ static int text_to_ass(AVBPrint *buf, const char
> > *text, const char *text_end,
> >              }
> >          }
> >
> > -        switch (*text) {
> > -        case '\r':
> > -            break;
> > -        case '\n':
> > -            av_bprintf(buf, "\\N");
> > -            break;
> > -        default:
> > -            av_bprint_chars(buf, *text, 1);
> > -            break;
> > +        len = get_utf8_length_at(text, text_end);
> > +        if (len < 1) {
> > +            av_log(avctx, AV_LOG_ERROR, "invalid UTF-8 byte in
> > subtitle\n");
> > +            len = 1;
> > +        }
> > +        for (i = 0; i < len; i++) {  
> 
> > +            switch (*text) {
> > +            case '\r':
> > +                break;
> > +            case '\n':
> > +                av_bprintf(buf, "\\N");
> > +                break;
> > +            default:
> > +                av_bprint_chars(buf, *text, 1);
> > +                break;  
> 
> Imo, the reindentation is not ok but this isn't my code.

Why not?

Carl Eugen Hoyos March 24, 2018, 4:50 p.m. UTC | #7

2018-03-24 17:42 GMT+01:00, wm4 <nfxjfg@googlemail.com>:
> On Sat, 24 Mar 2018 17:05:41 +0100
> Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote:
>
>> 2018-03-24 15:48 GMT+01:00, wm4 <nfxjfg@googlemail.com>:
>> > Subtitles which contained styled UTF-8 subtitles (i.e. not just 7 bit
>> > ASCII characters) were not handled correctly. The spec mandates that
>> > styling start/end ranges are in "characters". It's not quite clear what
>> > a "character" is supposed to be, but maybe they mean unicode codepoints.
>> >
>> > FFmpeg's decoder treated the style ranges as byte idexes, which could
>> > lead to UTF-8 sequences being broken, and the common code dropping the
>> > whole subtitle line.
>> >
>> > Change this and count the codepoint instead. This also means that even
>> > if this is somehow wrong, the decoder won't break UTF-8 sequences
>> > anymore. The sample which led me to investigate this now appears to work
>> > correctly.
>>
>> Could you confirm that this is also what QT does?
>
> I can't test with QT. VLC seems to behave like with this patch applied.
>
>> Or is it impossible that the patch breaks something?
>
> Could probably break movtext subtitles generated by ffmpeg (I didn't
> fix the movtext encoder, and it seems to have the same bug). But these
> will most likely be broken on other players too. Tough the worst case
> is just that the styles get shifted.

Thank you.

>> [...]
>>
>> > @@ -388,17 +405,24 @@ static int text_to_ass(AVBPrint *buf, const char
>> > *text, const char *text_end,
>> >              }
>> >          }
>> >
>> > -        switch (*text) {
>> > -        case '\r':
>> > -            break;
>> > -        case '\n':
>> > -            av_bprintf(buf, "\\N");
>> > -            break;
>> > -        default:
>> > -            av_bprint_chars(buf, *text, 1);
>> > -            break;
>> > +        len = get_utf8_length_at(text, text_end);
>> > +        if (len < 1) {
>> > +            av_log(avctx, AV_LOG_ERROR, "invalid UTF-8 byte in
>> > subtitle\n");
>> > +            len = 1;
>> > +        }
>> > +        for (i = 0; i < len; i++) {
>>
>> > +            switch (*text) {
>> > +            case '\r':
>> > +                break;
>> > +            case '\n':
>> > +                av_bprintf(buf, "\\N");
>> > +                break;
>> > +            default:
>> > +                av_bprint_chars(buf, *text, 1);
>> > +                break;
>>
>> Imo, the reindentation is not ok but this isn't my code.
>
> Why not?

Because the patch is much easier to read without it.

Carl Eugen

wm4 March 24, 2018, 4:59 p.m. UTC | #8

On Sat, 24 Mar 2018 17:50:53 +0100
Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote:

> 2018-03-24 17:42 GMT+01:00, wm4 <nfxjfg@googlemail.com>:
> > On Sat, 24 Mar 2018 17:05:41 +0100
> > Carl Eugen Hoyos <ceffmpeg@gmail.com> wrote:
> >  
> >> 2018-03-24 15:48 GMT+01:00, wm4 <nfxjfg@googlemail.com>:  
> >> > Subtitles which contained styled UTF-8 subtitles (i.e. not just 7 bit
> >> > ASCII characters) were not handled correctly. The spec mandates that
> >> > styling start/end ranges are in "characters". It's not quite clear what
> >> > a "character" is supposed to be, but maybe they mean unicode codepoints.
> >> >
> >> > FFmpeg's decoder treated the style ranges as byte idexes, which could
> >> > lead to UTF-8 sequences being broken, and the common code dropping the
> >> > whole subtitle line.
> >> >
> >> > Change this and count the codepoint instead. This also means that even
> >> > if this is somehow wrong, the decoder won't break UTF-8 sequences
> >> > anymore. The sample which led me to investigate this now appears to work
> >> > correctly.  
> >>
> >> Could you confirm that this is also what QT does?  
> >
> > I can't test with QT. VLC seems to behave like with this patch applied.
> >  
> >> Or is it impossible that the patch breaks something?  
> >
> > Could probably break movtext subtitles generated by ffmpeg (I didn't
> > fix the movtext encoder, and it seems to have the same bug). But these
> > will most likely be broken on other players too. Tough the worst case
> > is just that the styles get shifted.  
> 
> Thank you.
> 
> >> [...]
> >>  
> >> > @@ -388,17 +405,24 @@ static int text_to_ass(AVBPrint *buf, const char
> >> > *text, const char *text_end,
> >> >              }
> >> >          }
> >> >
> >> > -        switch (*text) {
> >> > -        case '\r':
> >> > -            break;
> >> > -        case '\n':
> >> > -            av_bprintf(buf, "\\N");
> >> > -            break;
> >> > -        default:
> >> > -            av_bprint_chars(buf, *text, 1);
> >> > -            break;
> >> > +        len = get_utf8_length_at(text, text_end);
> >> > +        if (len < 1) {
> >> > +            av_log(avctx, AV_LOG_ERROR, "invalid UTF-8 byte in
> >> > subtitle\n");
> >> > +            len = 1;
> >> > +        }
> >> > +        for (i = 0; i < len; i++) {  
> >>  
> >> > +            switch (*text) {
> >> > +            case '\r':
> >> > +                break;
> >> > +            case '\n':
> >> > +                av_bprintf(buf, "\\N");
> >> > +                break;
> >> > +            default:
> >> > +                av_bprint_chars(buf, *text, 1);
> >> > +                break;  
> >>
> >> Imo, the reindentation is not ok but this isn't my code.  
> >
> > Why not?  
> 
> Because the patch is much easier to read without it.

git repo viewers can show commits without whitespaces, so I don't think
it matters anymore for this patch.

wm4 March 25, 2018, 5:30 p.m. UTC | #9

On Sat, 24 Mar 2018 15:48:36 +0100
wm4 <nfxjfg@googlemail.com> wrote:

> Subtitles which contained styled UTF-8 subtitles (i.e. not just 7 bit
> ASCII characters) were not handled correctly. The spec mandates that
> styling start/end ranges are in "characters". It's not quite clear what
> a "character" is supposed to be, but maybe they mean unicode codepoints.
> 
> FFmpeg's decoder treated the style ranges as byte idexes, which could
> lead to UTF-8 sequences being broken, and the common code dropping the
> whole subtitle line.
> 
> Change this and count the codepoint instead. This also means that even
> if this is somehow wrong, the decoder won't break UTF-8 sequences
> anymore. The sample which led me to investigate this now appears to work
> correctly.
> ---
> https://github.com/mpv-player/mpv/issues/5675
> ---
>  libavcodec/movtextdec.c | 50 ++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 37 insertions(+), 13 deletions(-)
> 
> diff --git a/libavcodec/movtextdec.c b/libavcodec/movtextdec.c
> index bd19577724..89ac791602 100644
> --- a/libavcodec/movtextdec.c
> +++ b/libavcodec/movtextdec.c
> @@ -326,9 +326,24 @@ static const Box box_types[] = {
>  
>  const static size_t box_count = FF_ARRAY_ELEMS(box_types);
>  
> +// Return byte length of the UTF-8 sequence starting at text[0]. 0 on error.
> +static int get_utf8_length_at(const char *text, const char *text_end)
> +{
> +    const char *start = text;
> +    int err = 0;
> +    uint32_t c;
> +    GET_UTF8(c, text < text_end ? (uint8_t)*text++ : (err = 1, 0), goto error;);
> +    if (err)
> +        goto error;
> +    return text - start;
> +error:
> +    return 0;
> +}
> +
>  static int text_to_ass(AVBPrint *buf, const char *text, const char *text_end,
> -                        MovTextContext *m)
> +                       AVCodecContext *avctx)
>  {
> +    MovTextContext *m = avctx->priv_data;
>      int i = 0;
>      int j = 0;
>      int text_pos = 0;
> @@ -342,6 +357,8 @@ static int text_to_ass(AVBPrint *buf, const char *text, const char *text_end,
>      }
>  
>      while (text < text_end) {
> +        int len;
> +
>          if (m->box_flags & STYL_BOX) {
>              for (i = 0; i < m->style_entries; i++) {
>                  if (m->s[i]->style_flag && text_pos == m->s[i]->style_end) {
> @@ -388,17 +405,24 @@ static int text_to_ass(AVBPrint *buf, const char *text, const char *text_end,
>              }
>          }
>  
> -        switch (*text) {
> -        case '\r':
> -            break;
> -        case '\n':
> -            av_bprintf(buf, "\\N");
> -            break;
> -        default:
> -            av_bprint_chars(buf, *text, 1);
> -            break;
> +        len = get_utf8_length_at(text, text_end);
> +        if (len < 1) {
> +            av_log(avctx, AV_LOG_ERROR, "invalid UTF-8 byte in subtitle\n");
> +            len = 1;
> +        }
> +        for (i = 0; i < len; i++) {
> +            switch (*text) {
> +            case '\r':
> +                break;
> +            case '\n':
> +                av_bprintf(buf, "\\N");
> +                break;
> +            default:
> +                av_bprint_chars(buf, *text, 1);
> +                break;
> +            }
> +            text++;
>          }
> -        text++;
>          text_pos++;
>      }
>  
> @@ -507,10 +531,10 @@ static int mov_text_decode_frame(AVCodecContext *avctx,
>              }
>              m->tracksize = m->tracksize + tsmb_size;
>          }
> -        text_to_ass(&buf, ptr, end, m);
> +        text_to_ass(&buf, ptr, end, avctx);
>          mov_text_cleanup(m);
>      } else
> -        text_to_ass(&buf, ptr, end, m);
> +        text_to_ass(&buf, ptr, end, avctx);
>  
>      ret = ff_ass_add_rect(sub, buf.str, m->readorder++, 0, NULL, NULL);
>      av_bprint_finalize(&buf, NULL);

Pushed.

[FFmpeg-devel] movtextdec: fix handling of UTF-8 subtitles

Commit Message

Comments

Patch