diff mbox

[FFmpeg-devel] avcodec/htmlsubtitles: Be a bit more picky on syntax

Message ID 20170701220942.20818-1-michael@niedermayer.cc
State New
Headers show

Commit Message

Michael Niedermayer July 1, 2017, 10:09 p.m. UTC
This reduces the number of strstr() calls per byte
This diasalows empty tags like '< >' as well as '<' in tags like '<ab<cd<<ef>'

Fixes timeout
Fixes: 1817/clusterfuzz-testcase-minimized-5104230530547712

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
---
 libavcodec/htmlsubtitles.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

wm4 July 2, 2017, 11:14 a.m. UTC | #1
On Sun,  2 Jul 2017 00:09:42 +0200
Michael Niedermayer <michael@niedermayer.cc> wrote:

> This reduces the number of strstr() calls per byte
> This diasalows empty tags like '< >' as well as '<' in tags like '<ab<cd<<ef>'
> 
> Fixes timeout
> Fixes: 1817/clusterfuzz-testcase-minimized-5104230530547712
> 
> Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
> Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
> ---
>  libavcodec/htmlsubtitles.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/libavcodec/htmlsubtitles.c b/libavcodec/htmlsubtitles.c
> index be5c9316ca..67abc94085 100644
> --- a/libavcodec/htmlsubtitles.c
> +++ b/libavcodec/htmlsubtitles.c
> @@ -110,13 +110,13 @@ int ff_htmlmarkup_to_ass(void *log_ctx, AVBPrint *dst, const char *in)
>          case '<':
>              tag_close = in[1] == '/';
>              len = 0;
> -            if (sscanf(in+tag_close+1, "%127[^>]>%n", buffer, &len) >= 1 && len > 0) {
> +            if (sscanf(in+tag_close+1, "%127[^<>]>%n", buffer, &len) >= 1 && len > 0) {
>                  const char *tagname = buffer;
>                  while (*tagname == ' ')
>                      tagname++;
>                  if ((param = strchr(tagname, ' ')))
>                      *param++ = 0;
> -                if ((!tag_close && sptr < FF_ARRAY_ELEMS(stack)) ||
> +                if ((!tag_close && sptr < FF_ARRAY_ELEMS(stack) && *tagname) ||
>                      ( tag_close && sptr > 0 && !strcmp(stack[sptr-1].tag, tagname))) {
>                      int i, j, unknown = 0;
>                      in += len + tag_close;

Invalid syntax is not unusual in SRT files. Are you sure this doesn't
make the output worse in files that do not use the syntax correctly?
Michael Niedermayer July 2, 2017, 11:43 a.m. UTC | #2
On Sun, Jul 02, 2017 at 01:14:00PM +0200, wm4 wrote:
> On Sun,  2 Jul 2017 00:09:42 +0200
> Michael Niedermayer <michael@niedermayer.cc> wrote:
> 
> > This reduces the number of strstr() calls per byte
> > This diasalows empty tags like '< >' as well as '<' in tags like '<ab<cd<<ef>'
> > 
> > Fixes timeout
> > Fixes: 1817/clusterfuzz-testcase-minimized-5104230530547712
> > 
> > Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
> > Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
> > ---
> >  libavcodec/htmlsubtitles.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/libavcodec/htmlsubtitles.c b/libavcodec/htmlsubtitles.c
> > index be5c9316ca..67abc94085 100644
> > --- a/libavcodec/htmlsubtitles.c
> > +++ b/libavcodec/htmlsubtitles.c
> > @@ -110,13 +110,13 @@ int ff_htmlmarkup_to_ass(void *log_ctx, AVBPrint *dst, const char *in)
> >          case '<':
> >              tag_close = in[1] == '/';
> >              len = 0;
> > -            if (sscanf(in+tag_close+1, "%127[^>]>%n", buffer, &len) >= 1 && len > 0) {
> > +            if (sscanf(in+tag_close+1, "%127[^<>]>%n", buffer, &len) >= 1 && len > 0) {
> >                  const char *tagname = buffer;
> >                  while (*tagname == ' ')
> >                      tagname++;
> >                  if ((param = strchr(tagname, ' ')))
> >                      *param++ = 0;
> > -                if ((!tag_close && sptr < FF_ARRAY_ELEMS(stack)) ||
> > +                if ((!tag_close && sptr < FF_ARRAY_ELEMS(stack) && *tagname) ||
> >                      ( tag_close && sptr > 0 && !strcmp(stack[sptr-1].tag, tagname))) {
> >                      int i, j, unknown = 0;
> >                      in += len + tag_close;
> 
> Invalid syntax is not unusual in SRT files. Are you sure this doesn't
> make the output worse in files that do not use the syntax correctly?

I do not know if this makes the output worse for files with invalid syntax
I also do not know if this makes the output better for files with invalid
syntax

i dont have a large library of (real world invalid syntax) srt files
whith which to find cases where it makes a difference.

You seem to know alot about srt, maybe you can pick some srt files
and add fate tests, so we have better coverage of odd and nasty cases

about this patch specifically, what do you suggest ?
should i try to fix this while maintaining existing behavior for
invalid syntax exactly? (i dont know how that code would look)

[...]
Paul B Mahol July 3, 2017, 9:44 a.m. UTC | #3
On 7/2/17, Michael Niedermayer <michael@niedermayer.cc> wrote:
> On Sun, Jul 02, 2017 at 01:14:00PM +0200, wm4 wrote:
>> On Sun,  2 Jul 2017 00:09:42 +0200
>> Michael Niedermayer <michael@niedermayer.cc> wrote:
>>
>> > This reduces the number of strstr() calls per byte
>> > This diasalows empty tags like '< >' as well as '<' in tags like
>> > '<ab<cd<<ef>'
>> >
>> > Fixes timeout
>> > Fixes: 1817/clusterfuzz-testcase-minimized-5104230530547712
>> >
>> > Found-by: continuous fuzzing process
>> > https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
>> > Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
>> > ---
>> >  libavcodec/htmlsubtitles.c | 4 ++--
>> >  1 file changed, 2 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/libavcodec/htmlsubtitles.c b/libavcodec/htmlsubtitles.c
>> > index be5c9316ca..67abc94085 100644
>> > --- a/libavcodec/htmlsubtitles.c
>> > +++ b/libavcodec/htmlsubtitles.c
>> > @@ -110,13 +110,13 @@ int ff_htmlmarkup_to_ass(void *log_ctx, AVBPrint
>> > *dst, const char *in)
>> >          case '<':
>> >              tag_close = in[1] == '/';
>> >              len = 0;
>> > -            if (sscanf(in+tag_close+1, "%127[^>]>%n", buffer, &len) >=
>> > 1 && len > 0) {
>> > +            if (sscanf(in+tag_close+1, "%127[^<>]>%n", buffer, &len) >=
>> > 1 && len > 0) {
>> >                  const char *tagname = buffer;
>> >                  while (*tagname == ' ')
>> >                      tagname++;
>> >                  if ((param = strchr(tagname, ' ')))
>> >                      *param++ = 0;
>> > -                if ((!tag_close && sptr < FF_ARRAY_ELEMS(stack)) ||
>> > +                if ((!tag_close && sptr < FF_ARRAY_ELEMS(stack) &&
>> > *tagname) ||
>> >                      ( tag_close && sptr > 0 &&
>> > !strcmp(stack[sptr-1].tag, tagname))) {
>> >                      int i, j, unknown = 0;
>> >                      in += len + tag_close;
>>
>> Invalid syntax is not unusual in SRT files. Are you sure this doesn't
>> make the output worse in files that do not use the syntax correctly?
>
> I do not know if this makes the output worse for files with invalid syntax
> I also do not know if this makes the output better for files with invalid
> syntax
>
> i dont have a large library of (real world invalid syntax) srt files
> whith which to find cases where it makes a difference.
>
> You seem to know alot about srt, maybe you can pick some srt files
> and add fate tests, so we have better coverage of odd and nasty cases
>
> about this patch specifically, what do you suggest ?
> should i try to fix this while maintaining existing behavior for
> invalid syntax exactly? (i dont know how that code would look)

WHat's so wrong with code that you want it changed in bad way?
wm4 July 3, 2017, 10:40 a.m. UTC | #4
On Sun, 2 Jul 2017 13:43:57 +0200
Michael Niedermayer <michael@niedermayer.cc> wrote:

> On Sun, Jul 02, 2017 at 01:14:00PM +0200, wm4 wrote:
> > On Sun,  2 Jul 2017 00:09:42 +0200
> > Michael Niedermayer <michael@niedermayer.cc> wrote:
> >   
> > > This reduces the number of strstr() calls per byte
> > > This diasalows empty tags like '< >' as well as '<' in tags like '<ab<cd<<ef>'
> > > 
> > > Fixes timeout
> > > Fixes: 1817/clusterfuzz-testcase-minimized-5104230530547712
> > > 
> > > Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
> > > Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
> > > ---
> > >  libavcodec/htmlsubtitles.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/libavcodec/htmlsubtitles.c b/libavcodec/htmlsubtitles.c
> > > index be5c9316ca..67abc94085 100644
> > > --- a/libavcodec/htmlsubtitles.c
> > > +++ b/libavcodec/htmlsubtitles.c
> > > @@ -110,13 +110,13 @@ int ff_htmlmarkup_to_ass(void *log_ctx, AVBPrint *dst, const char *in)
> > >          case '<':
> > >              tag_close = in[1] == '/';
> > >              len = 0;
> > > -            if (sscanf(in+tag_close+1, "%127[^>]>%n", buffer, &len) >= 1 && len > 0) {
> > > +            if (sscanf(in+tag_close+1, "%127[^<>]>%n", buffer, &len) >= 1 && len > 0) {
> > >                  const char *tagname = buffer;
> > >                  while (*tagname == ' ')
> > >                      tagname++;
> > >                  if ((param = strchr(tagname, ' ')))
> > >                      *param++ = 0;
> > > -                if ((!tag_close && sptr < FF_ARRAY_ELEMS(stack)) ||
> > > +                if ((!tag_close && sptr < FF_ARRAY_ELEMS(stack) && *tagname) ||
> > >                      ( tag_close && sptr > 0 && !strcmp(stack[sptr-1].tag, tagname))) {
> > >                      int i, j, unknown = 0;
> > >                      in += len + tag_close;  
> > 
> > Invalid syntax is not unusual in SRT files. Are you sure this doesn't
> > make the output worse in files that do not use the syntax correctly?  
> 
> I do not know if this makes the output worse for files with invalid syntax
> I also do not know if this makes the output better for files with invalid
> syntax
> 
> i dont have a large library of (real world invalid syntax) srt files
> whith which to find cases where it makes a difference.
> 
> You seem to know alot about srt, maybe you can pick some srt files
> and add fate tests, so we have better coverage of odd and nasty cases
> 
> about this patch specifically, what do you suggest ?
> should i try to fix this while maintaining existing behavior for
> invalid syntax exactly? (i dont know how that code would look)
> 
> [...]

I don't know either, but due to the messy syntax situation, I'd rather
not change the code in unpredictable ways just to get faster fuzzing.
Michael Niedermayer July 3, 2017, 10:44 a.m. UTC | #5
On Mon, Jul 03, 2017 at 11:44:26AM +0200, Paul B Mahol wrote:
> On 7/2/17, Michael Niedermayer <michael@niedermayer.cc> wrote:
> > On Sun, Jul 02, 2017 at 01:14:00PM +0200, wm4 wrote:
> >> On Sun,  2 Jul 2017 00:09:42 +0200
> >> Michael Niedermayer <michael@niedermayer.cc> wrote:
> >>
> >> > This reduces the number of strstr() calls per byte
> >> > This diasalows empty tags like '< >' as well as '<' in tags like
> >> > '<ab<cd<<ef>'
> >> >
> >> > Fixes timeout
> >> > Fixes: 1817/clusterfuzz-testcase-minimized-5104230530547712
> >> >
> >> > Found-by: continuous fuzzing process
> >> > https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
> >> > Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
> >> > ---
> >> >  libavcodec/htmlsubtitles.c | 4 ++--
> >> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >> >
> >> > diff --git a/libavcodec/htmlsubtitles.c b/libavcodec/htmlsubtitles.c
> >> > index be5c9316ca..67abc94085 100644
> >> > --- a/libavcodec/htmlsubtitles.c
> >> > +++ b/libavcodec/htmlsubtitles.c
> >> > @@ -110,13 +110,13 @@ int ff_htmlmarkup_to_ass(void *log_ctx, AVBPrint
> >> > *dst, const char *in)
> >> >          case '<':
> >> >              tag_close = in[1] == '/';
> >> >              len = 0;
> >> > -            if (sscanf(in+tag_close+1, "%127[^>]>%n", buffer, &len) >=
> >> > 1 && len > 0) {
> >> > +            if (sscanf(in+tag_close+1, "%127[^<>]>%n", buffer, &len) >=
> >> > 1 && len > 0) {
> >> >                  const char *tagname = buffer;
> >> >                  while (*tagname == ' ')
> >> >                      tagname++;
> >> >                  if ((param = strchr(tagname, ' ')))
> >> >                      *param++ = 0;
> >> > -                if ((!tag_close && sptr < FF_ARRAY_ELEMS(stack)) ||
> >> > +                if ((!tag_close && sptr < FF_ARRAY_ELEMS(stack) &&
> >> > *tagname) ||
> >> >                      ( tag_close && sptr > 0 &&
> >> > !strcmp(stack[sptr-1].tag, tagname))) {
> >> >                      int i, j, unknown = 0;
> >> >                      in += len + tag_close;
> >>
> >> Invalid syntax is not unusual in SRT files. Are you sure this doesn't
> >> make the output worse in files that do not use the syntax correctly?
> >
> > I do not know if this makes the output worse for files with invalid syntax
> > I also do not know if this makes the output better for files with invalid
> > syntax
> >
> > i dont have a large library of (real world invalid syntax) srt files
> > whith which to find cases where it makes a difference.
> >
> > You seem to know alot about srt, maybe you can pick some srt files
> > and add fate tests, so we have better coverage of odd and nasty cases
> >
> > about this patch specifically, what do you suggest ?
> > should i try to fix this while maintaining existing behavior for
> > invalid syntax exactly? (i dont know how that code would look)
> 
> WHat's so wrong with code that you want it changed in bad way?

IIRC
If you have tags like <<<<<<<<<<<<<<<<<<<<<<a> you end up checking
<<<<<<<<<<<<<<<<<<<<<<a>
<<<<<<<<<<<<<<<<<<<<<a>
<<<<<<<<<<<<<<<<<<<<a>
...

then form the closing tag
</<<<<<<<<<<<<<<<<<<<<<a>
</<<<<<<<<<<<<<<<<<<<<a>
</<<<<<<<<<<<<<<<<<<<a>
...

and then search the whole input for each closing tag

On top of this there are special cases and limitations like
fixed size buffers and a fixed size stack, so it will not always do
all the steps above.

But in practice this is slow. The same issue should happen on real
files if they contain syntax issues and conatin a lot of unescaped '<'


[...]
Michael Niedermayer July 3, 2017, 3:50 p.m. UTC | #6
On Mon, Jul 03, 2017 at 12:40:13PM +0200, wm4 wrote:
> On Sun, 2 Jul 2017 13:43:57 +0200
> Michael Niedermayer <michael@niedermayer.cc> wrote:
> 
> > On Sun, Jul 02, 2017 at 01:14:00PM +0200, wm4 wrote:
> > > On Sun,  2 Jul 2017 00:09:42 +0200
> > > Michael Niedermayer <michael@niedermayer.cc> wrote:
> > >   
> > > > This reduces the number of strstr() calls per byte
> > > > This diasalows empty tags like '< >' as well as '<' in tags like '<ab<cd<<ef>'
> > > > 
> > > > Fixes timeout
> > > > Fixes: 1817/clusterfuzz-testcase-minimized-5104230530547712
> > > > 
> > > > Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
> > > > Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
> > > > ---
> > > >  libavcodec/htmlsubtitles.c | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/libavcodec/htmlsubtitles.c b/libavcodec/htmlsubtitles.c
> > > > index be5c9316ca..67abc94085 100644
> > > > --- a/libavcodec/htmlsubtitles.c
> > > > +++ b/libavcodec/htmlsubtitles.c
> > > > @@ -110,13 +110,13 @@ int ff_htmlmarkup_to_ass(void *log_ctx, AVBPrint *dst, const char *in)
> > > >          case '<':
> > > >              tag_close = in[1] == '/';
> > > >              len = 0;
> > > > -            if (sscanf(in+tag_close+1, "%127[^>]>%n", buffer, &len) >= 1 && len > 0) {
> > > > +            if (sscanf(in+tag_close+1, "%127[^<>]>%n", buffer, &len) >= 1 && len > 0) {
> > > >                  const char *tagname = buffer;
> > > >                  while (*tagname == ' ')
> > > >                      tagname++;
> > > >                  if ((param = strchr(tagname, ' ')))
> > > >                      *param++ = 0;
> > > > -                if ((!tag_close && sptr < FF_ARRAY_ELEMS(stack)) ||
> > > > +                if ((!tag_close && sptr < FF_ARRAY_ELEMS(stack) && *tagname) ||
> > > >                      ( tag_close && sptr > 0 && !strcmp(stack[sptr-1].tag, tagname))) {
> > > >                      int i, j, unknown = 0;
> > > >                      in += len + tag_close;  
> > > 
> > > Invalid syntax is not unusual in SRT files. Are you sure this doesn't
> > > make the output worse in files that do not use the syntax correctly?  
> > 
> > I do not know if this makes the output worse for files with invalid syntax
> > I also do not know if this makes the output better for files with invalid
> > syntax
> > 
> > i dont have a large library of (real world invalid syntax) srt files
> > whith which to find cases where it makes a difference.
> > 
> > You seem to know alot about srt, maybe you can pick some srt files
> > and add fate tests, so we have better coverage of odd and nasty cases
> > 
> > about this patch specifically, what do you suggest ?
> > should i try to fix this while maintaining existing behavior for
> > invalid syntax exactly? (i dont know how that code would look)
> > 
> > [...]
> 
> I don't know either, but due to the messy syntax situation, I'd rather
> not change the code in unpredictable ways just to get faster fuzzing.

Its primarly intended to fix the potential denial of service.
That is moderate sized files with short duration, few streams, low
resolution, low channel count should not consume huge amounts of
cpu time.


[...]
Michael Niedermayer July 8, 2017, 9:07 p.m. UTC | #7
On Mon, Jul 03, 2017 at 05:50:08PM +0200, Michael Niedermayer wrote:
> On Mon, Jul 03, 2017 at 12:40:13PM +0200, wm4 wrote:
> > On Sun, 2 Jul 2017 13:43:57 +0200
> > Michael Niedermayer <michael@niedermayer.cc> wrote:
> > 
> > > On Sun, Jul 02, 2017 at 01:14:00PM +0200, wm4 wrote:
> > > > On Sun,  2 Jul 2017 00:09:42 +0200
> > > > Michael Niedermayer <michael@niedermayer.cc> wrote:
> > > >   
> > > > > This reduces the number of strstr() calls per byte
> > > > > This diasalows empty tags like '< >' as well as '<' in tags like '<ab<cd<<ef>'
> > > > > 
> > > > > Fixes timeout
> > > > > Fixes: 1817/clusterfuzz-testcase-minimized-5104230530547712
> > > > > 
> > > > > Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
> > > > > Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
> > > > > ---
> > > > >  libavcodec/htmlsubtitles.c | 4 ++--
> > > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/libavcodec/htmlsubtitles.c b/libavcodec/htmlsubtitles.c
> > > > > index be5c9316ca..67abc94085 100644
> > > > > --- a/libavcodec/htmlsubtitles.c
> > > > > +++ b/libavcodec/htmlsubtitles.c
> > > > > @@ -110,13 +110,13 @@ int ff_htmlmarkup_to_ass(void *log_ctx, AVBPrint *dst, const char *in)
> > > > >          case '<':
> > > > >              tag_close = in[1] == '/';
> > > > >              len = 0;
> > > > > -            if (sscanf(in+tag_close+1, "%127[^>]>%n", buffer, &len) >= 1 && len > 0) {
> > > > > +            if (sscanf(in+tag_close+1, "%127[^<>]>%n", buffer, &len) >= 1 && len > 0) {
> > > > >                  const char *tagname = buffer;
> > > > >                  while (*tagname == ' ')
> > > > >                      tagname++;
> > > > >                  if ((param = strchr(tagname, ' ')))
> > > > >                      *param++ = 0;
> > > > > -                if ((!tag_close && sptr < FF_ARRAY_ELEMS(stack)) ||
> > > > > +                if ((!tag_close && sptr < FF_ARRAY_ELEMS(stack) && *tagname) ||
> > > > >                      ( tag_close && sptr > 0 && !strcmp(stack[sptr-1].tag, tagname))) {
> > > > >                      int i, j, unknown = 0;
> > > > >                      in += len + tag_close;  
> > > > 
> > > > Invalid syntax is not unusual in SRT files. Are you sure this doesn't
> > > > make the output worse in files that do not use the syntax correctly?  
> > > 
> > > I do not know if this makes the output worse for files with invalid syntax
> > > I also do not know if this makes the output better for files with invalid
> > > syntax
> > > 
> > > i dont have a large library of (real world invalid syntax) srt files
> > > whith which to find cases where it makes a difference.
> > > 
> > > You seem to know alot about srt, maybe you can pick some srt files
> > > and add fate tests, so we have better coverage of odd and nasty cases
> > > 
> > > about this patch specifically, what do you suggest ?
> > > should i try to fix this while maintaining existing behavior for
> > > invalid syntax exactly? (i dont know how that code would look)
> > > 
> > > [...]
> > 
> > I don't know either, but due to the messy syntax situation, I'd rather
> > not change the code in unpredictable ways just to get faster fuzzing.
> 
> Its primarly intended to fix the potential denial of service.
> That is moderate sized files with short duration, few streams, low
> resolution, low channel count should not consume huge amounts of
> cpu time.

5 days passed with no comments and no suggestions on how to implement
this better.
does anyone object to the fix as in the patch to be applied ?

thx

[...]
Clément Bœsch July 9, 2017, 12:28 p.m. UTC | #8
On Sat, Jul 08, 2017 at 11:07:48PM +0200, Michael Niedermayer wrote:
[...]
> 5 days passed with no comments and no suggestions on how to implement
> this better.
> does anyone object to the fix as in the patch to be applied ?

misc patterns I found you may want to give a try (note that these are
completely broken garbage you may want to ignore):

82
00:04:50,430 --> 00:05:01,030
<<flash gordon
<E9>pisode 4
saison 1>
www.SeriesSub.com>


57
00:03:38,318 --> 00:03:42,778
<<THE HIGH ROLLERS>>
<<Grandes Jogadores>>


2
00:01:10,000 --> 00:01:14,500
>>> RebelSubTeam <<<


306
00:20:31,849 --> 00:20:56,839
<font color="##FFFF00">
<<<<www.egfire.com>>>></font>


37
00:02:37,750 --> 00:02:43,700
<b>~ASUKO MARCH!~
>>:<<
translation by: cangii
Retiming by: furransu</b>


1214
02:11:00,738 --> 02:11:05,538
<b><<< ÓÍÀÊÑ ÒÈÉÌ < 2015 > UNACS TEAM >>></b>


1
00:00:01,000 --> 00:00:02,308
<<<<<< Antes <<<<<<
Ela é hum...


544
00:53:43,780 --> 00:53:45,941
<<That's OK, >> - he calmed himself.


545
00:53:46,216 --> 00:53:49,083
<<lt's not a long way to the hotel,
the hotel is within easy reach.


894
00:37:59,693 --> 00:38:01,591
mint asztalt foglaltatni
a <<>Le Cirque-ben.


I really have a LOT of matches, so these are very common patterns.

Feel free to apply your patch if it generally doesn't make these cases
worse.

Regards,
Clément Bœsch July 9, 2017, 12:32 p.m. UTC | #9
On Sun, Jul 09, 2017 at 02:28:28PM +0200, Clément Bœsch wrote:
[...]
> 544
> 00:53:43,780 --> 00:53:45,941
> <<That's OK, >> - he calmed himself.
> 

Side note here: "<<" and ">>" are used for « and ». It's important to not
break this case IMO.

> 
> 545
> 00:53:46,216 --> 00:53:49,083
> <<lt's not a long way to the hotel,
> the hotel is within easy reach.
> 

Same here, the confusion between 'l' and 'I' makes me think it's the
result of an OCR, so the use of these ascii quotation should be pretty
common.
Michael Niedermayer July 11, 2017, 9:18 p.m. UTC | #10
On Sun, Jul 09, 2017 at 02:28:28PM +0200, Clément Bœsch wrote:
> On Sat, Jul 08, 2017 at 11:07:48PM +0200, Michael Niedermayer wrote:
> [...]
> > 5 days passed with no comments and no suggestions on how to implement
> > this better.
> > does anyone object to the fix as in the patch to be applied ?
> 
> misc patterns I found you may want to give a try (note that these are
> completely broken garbage you may want to ignore):
> 
> 82
> 00:04:50,430 --> 00:05:01,030
> <<flash gordon
> <E9>pisode 4
> saison 1>
> www.SeriesSub.com>
> 
> 
> 57
> 00:03:38,318 --> 00:03:42,778
> <<THE HIGH ROLLERS>>
> <<Grandes Jogadores>>
> 
> 
> 2
> 00:01:10,000 --> 00:01:14,500
> >>> RebelSubTeam <<<
> 
> 
> 306
> 00:20:31,849 --> 00:20:56,839
> <font color="##FFFF00">
> <<<<www.egfire.com>>>></font>
> 
> 
> 37
> 00:02:37,750 --> 00:02:43,700
> <b>~ASUKO MARCH!~
> >>:<<
> translation by: cangii
> Retiming by: furransu</b>
> 
> 
> 1214
> 02:11:00,738 --> 02:11:05,538
> <b><<< ÓÍÀÊÑ ÒÈÉÌ < 2015 > UNACS TEAM >>></b>
> 
> 
> 1
> 00:00:01,000 --> 00:00:02,308
> <<<<<< Antes <<<<<<
> Ela é hum...
> 
> 
> 544
> 00:53:43,780 --> 00:53:45,941
> <<That's OK, >> - he calmed himself.
> 
> 
> 545
> 00:53:46,216 --> 00:53:49,083
> <<lt's not a long way to the hotel,
> the hotel is within easy reach.
> 
> 
> 894
> 00:37:59,693 --> 00:38:01,591
> mint asztalt foglaltatni
> a <<>Le Cirque-ben.
> 
> 
> I really have a LOT of matches, so these are very common patterns.
> 
> Feel free to apply your patch if it generally doesn't make these cases
> worse.

patch posted that adds this to fate
ill upload the srt file (copy and paste of above)

the patch seems not to change the output for this
i will apply the fate test before the patch here

[...]
Michael Niedermayer July 18, 2017, 8:11 p.m. UTC | #11
On Sun, Jul 09, 2017 at 02:28:28PM +0200, Clément Bœsch wrote:
> On Sat, Jul 08, 2017 at 11:07:48PM +0200, Michael Niedermayer wrote:
> [...]
> > 5 days passed with no comments and no suggestions on how to implement
> > this better.
> > does anyone object to the fix as in the patch to be applied ?
> 
> misc patterns I found you may want to give a try (note that these are
> completely broken garbage you may want to ignore):
> 
> 82
> 00:04:50,430 --> 00:05:01,030
> <<flash gordon
> <E9>pisode 4
> saison 1>
> www.SeriesSub.com>
> 
> 
> 57
> 00:03:38,318 --> 00:03:42,778
> <<THE HIGH ROLLERS>>
> <<Grandes Jogadores>>
> 
> 
> 2
> 00:01:10,000 --> 00:01:14,500
> >>> RebelSubTeam <<<
> 
> 
> 306
> 00:20:31,849 --> 00:20:56,839
> <font color="##FFFF00">
> <<<<www.egfire.com>>>></font>
> 
> 
> 37
> 00:02:37,750 --> 00:02:43,700
> <b>~ASUKO MARCH!~
> >>:<<
> translation by: cangii
> Retiming by: furransu</b>
> 
> 
> 1214
> 02:11:00,738 --> 02:11:05,538
> <b><<< ÓÍÀÊÑ ÒÈÉÌ < 2015 > UNACS TEAM >>></b>
> 
> 
> 1
> 00:00:01,000 --> 00:00:02,308
> <<<<<< Antes <<<<<<
> Ela é hum...
> 
> 
> 544
> 00:53:43,780 --> 00:53:45,941
> <<That's OK, >> - he calmed himself.
> 
> 
> 545
> 00:53:46,216 --> 00:53:49,083
> <<lt's not a long way to the hotel,
> the hotel is within easy reach.
> 
> 
> 894
> 00:37:59,693 --> 00:38:01,591
> mint asztalt foglaltatni
> a <<>Le Cirque-ben.
> 
> 
> I really have a LOT of matches, so these are very common patterns.
> 
> Feel free to apply your patch if it generally doesn't make these cases
> worse.

the patch doesnt change the fate test,
applied

thanks

[...]
diff mbox

Patch

diff --git a/libavcodec/htmlsubtitles.c b/libavcodec/htmlsubtitles.c
index be5c9316ca..67abc94085 100644
--- a/libavcodec/htmlsubtitles.c
+++ b/libavcodec/htmlsubtitles.c
@@ -110,13 +110,13 @@  int ff_htmlmarkup_to_ass(void *log_ctx, AVBPrint *dst, const char *in)
         case '<':
             tag_close = in[1] == '/';
             len = 0;
-            if (sscanf(in+tag_close+1, "%127[^>]>%n", buffer, &len) >= 1 && len > 0) {
+            if (sscanf(in+tag_close+1, "%127[^<>]>%n", buffer, &len) >= 1 && len > 0) {
                 const char *tagname = buffer;
                 while (*tagname == ' ')
                     tagname++;
                 if ((param = strchr(tagname, ' ')))
                     *param++ = 0;
-                if ((!tag_close && sptr < FF_ARRAY_ELEMS(stack)) ||
+                if ((!tag_close && sptr < FF_ARRAY_ELEMS(stack) && *tagname) ||
                     ( tag_close && sptr > 0 && !strcmp(stack[sptr-1].tag, tagname))) {
                     int i, j, unknown = 0;
                     in += len + tag_close;