diff mbox

[FFmpeg-devel] movtextenc: fix handling of utf-8 subtitles

Message ID 20180328030705.22815-1-philipl@overt.org
State Superseded
Headers show

Commit Message

Philip Langdale March 28, 2018, 3:07 a.m. UTC
See the earlier fix for movtextdec for details. The equivalent bug is
present on the encoder side as well.

We need to track the text length in 'characters' (which seems to really
mean codepoints) to ensure that styles are applied across the correct
ranges.

Signed-off-by: Philip Langdale <philipl@overt.org>
---
 libavcodec/movtextenc.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

Comments

wm4 March 28, 2018, 10:01 a.m. UTC | #1
On Tue, 27 Mar 2018 20:07:05 -0700
Philip Langdale <philipl@overt.org> wrote:

> See the earlier fix for movtextdec for details. The equivalent bug is
> present on the encoder side as well.
> 
> We need to track the text length in 'characters' (which seems to really
> mean codepoints) to ensure that styles are applied across the correct
> ranges.
> 
> Signed-off-by: Philip Langdale <philipl@overt.org>
> ---
>  libavcodec/movtextenc.c | 24 +++++++++++++++++++++++-
>  1 file changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/libavcodec/movtextenc.c b/libavcodec/movtextenc.c
> index d795e317c3..fd0743f752 100644
> --- a/libavcodec/movtextenc.c
> +++ b/libavcodec/movtextenc.c
> @@ -304,11 +304,33 @@ static void mov_text_color_cb(void *priv, unsigned int color, unsigned int color
>       */
>  }
>  
> +static uint16_t utf8_strlen(const char *text, int len)
> +{
> +    uint16_t i = 0, ret = 0;
> +    while (i < len) {
> +        char c = text[i];
> +        if (c >= 0)
> +            i += 1;
> +        else if ((c & 0xE0) == 0xC0)
> +            i += 2;
> +        else if ((c & 0xF0) == 0xE0)
> +            i += 3;
> +        else if ((c & 0xF8) == 0xF0)
> +            i += 4;
> +        else
> +            return 0;
> +        ret++;
> +    }
> +    return ret;
> +}
> +
>  static void mov_text_text_cb(void *priv, const char *text, int len)
>  {
> +    uint16_t utf8_len = utf8_strlen(text, len);
>      MovTextContext *s = priv;
>      av_bprint_append_data(&s->buffer, text, len);
> -    s->text_pos += len;
> +    // If it's not utf-8, just use the byte length
> +    s->text_pos += utf8_len ? utf8_len : len;
>  }
>  
>  static void mov_text_new_line_cb(void *priv, int forced)

LGTM.

Maybe a later patch that adds the UTF-8 length as lavu function (or
macro for SPEEED?) eould be nice, since we seem to need that often.
Andrey Turkin March 28, 2018, 11:02 a.m. UTC | #2
I think first check (ascii case) should be explicit bit-test, or (if there
is some reason to use sign-bit approach) at least "c" should be signed char
instead of simply char.

2018-03-28 6:07 GMT+03:00 Philip Langdale <philipl@overt.org>:

> See the earlier fix for movtextdec for details. The equivalent bug is
> present on the encoder side as well.
>
> We need to track the text length in 'characters' (which seems to really
> mean codepoints) to ensure that styles are applied across the correct
> ranges.
>
> Signed-off-by: Philip Langdale <philipl@overt.org>
> ---
>  libavcodec/movtextenc.c | 24 +++++++++++++++++++++++-
>  1 file changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/libavcodec/movtextenc.c b/libavcodec/movtextenc.c
> index d795e317c3..fd0743f752 100644
> --- a/libavcodec/movtextenc.c
> +++ b/libavcodec/movtextenc.c
> @@ -304,11 +304,33 @@ static void mov_text_color_cb(void *priv, unsigned
> int color, unsigned int color
>       */
>  }
>
> +static uint16_t utf8_strlen(const char *text, int len)
> +{
> +    uint16_t i = 0, ret = 0;
> +    while (i < len) {
> +        char c = text[i];
> +        if (c >= 0)
> +            i += 1;
> +        else if ((c & 0xE0) == 0xC0)
> +            i += 2;
> +        else if ((c & 0xF0) == 0xE0)
> +            i += 3;
> +        else if ((c & 0xF8) == 0xF0)
> +            i += 4;
> +        else
> +            return 0;
> +        ret++;
> +    }
> +    return ret;
> +}
> +
>  static void mov_text_text_cb(void *priv, const char *text, int len)
>  {
> +    uint16_t utf8_len = utf8_strlen(text, len);
>      MovTextContext *s = priv;
>      av_bprint_append_data(&s->buffer, text, len);
> -    s->text_pos += len;
> +    // If it's not utf-8, just use the byte length
> +    s->text_pos += utf8_len ? utf8_len : len;
>  }
>
>  static void mov_text_new_line_cb(void *priv, int forced)
> --
> 2.14.1
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
wm4 March 28, 2018, 11:14 a.m. UTC | #3
On Wed, 28 Mar 2018 14:02:04 +0300
Andrey Turkin <andrey.turkin@gmail.com> wrote:

> I think first check (ascii case) should be explicit bit-test, or (if there
> is some reason to use sign-bit approach) at least "c" should be signed char
> instead of simply char.

Good point actually. It relies on char signedness, which is not
guaranteed.
Philip Langdale March 28, 2018, 3:16 p.m. UTC | #4
On Wed, 28 Mar 2018 13:14:57 +0200
wm4 <nfxjfg@googlemail.com> wrote:

> On Wed, 28 Mar 2018 14:02:04 +0300
> Andrey Turkin <andrey.turkin@gmail.com> wrote:
> 
> > I think first check (ascii case) should be explicit bit-test, or
> > (if there is some reason to use sign-bit approach) at least "c"
> > should be signed char instead of simply char.  
> 
> Good point actually. It relies on char signedness, which is not
> guaranteed.

Fair point. I'm updating it.

--phil
Philip Langdale March 28, 2018, 3:18 p.m. UTC | #5
On Wed, 28 Mar 2018 12:01:47 +0200
wm4 <nfxjfg@googlemail.com> wrote:

> 
> LGTM.
> 
> Maybe a later patch that adds the UTF-8 length as lavu function (or
> macro for SPEEED?) eould be nice, since we seem to need that often.

I can do that.

Although, I'm not sure the SPEEED version is readable enough to justify
using it. :-)

http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html

--phil
diff mbox

Patch

diff --git a/libavcodec/movtextenc.c b/libavcodec/movtextenc.c
index d795e317c3..fd0743f752 100644
--- a/libavcodec/movtextenc.c
+++ b/libavcodec/movtextenc.c
@@ -304,11 +304,33 @@  static void mov_text_color_cb(void *priv, unsigned int color, unsigned int color
      */
 }
 
+static uint16_t utf8_strlen(const char *text, int len)
+{
+    uint16_t i = 0, ret = 0;
+    while (i < len) {
+        char c = text[i];
+        if (c >= 0)
+            i += 1;
+        else if ((c & 0xE0) == 0xC0)
+            i += 2;
+        else if ((c & 0xF0) == 0xE0)
+            i += 3;
+        else if ((c & 0xF8) == 0xF0)
+            i += 4;
+        else
+            return 0;
+        ret++;
+    }
+    return ret;
+}
+
 static void mov_text_text_cb(void *priv, const char *text, int len)
 {
+    uint16_t utf8_len = utf8_strlen(text, len);
     MovTextContext *s = priv;
     av_bprint_append_data(&s->buffer, text, len);
-    s->text_pos += len;
+    // If it's not utf-8, just use the byte length
+    s->text_pos += utf8_len ? utf8_len : len;
 }
 
 static void mov_text_new_line_cb(void *priv, int forced)