diff mbox

[FFmpeg-devel,05/21] avformat/matroskadec: Set offset of first cluster

Message ID 20190327111852.3784-6-andreas.rheinhardt@googlemail.com
State New
Headers show

Commit Message

Diego Felix de Souza via ffmpeg-devel March 27, 2019, 11:18 a.m. UTC
By default, the data_offset member of the AVFormatInternal of the
AVFormatContext associated with the MatroskaDemuxContext has not been
initialized explicitly by any Matroska-specific function, so that it was
initialized by default to the offset at the end of matroska_read_header,
i.e. usually to the offset of the length field of the first encountered
cluster. This meant that in case that the Matroska-specific seek-code
fails because there are no index entries for the target track a seek to
data_offset would be performed and ordinary parsing would start from
there which is nonsense: The length field would be treated as EBML ID and
(if the length field is not longer than four bytes (EBML numbers that
long are rejected as invalid EBML IDs)) and whatever comes next would be
treated as its EBML size although it simply isn't.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@googlemail.com>
---
 libavformat/matroskadec.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Steve Lhomme April 7, 2019, 6:35 a.m. UTC | #1
On 3/27/2019 12:18 PM, Andreas Rheinhardt via ffmpeg-devel wrote:
> By default, the data_offset member of the AVFormatInternal of the
> AVFormatContext associated with the MatroskaDemuxContext has not been
> initialized explicitly by any Matroska-specific function, so that it was
> initialized by default to the offset at the end of matroska_read_header,
> i.e. usually to the offset of the length field of the first encountered
> cluster. This meant that in case that the Matroska-specific seek-code
> fails because there are no index entries for the target track a seek to
> data_offset would be performed and ordinary parsing would start from
> there which is nonsense: The length field would be treated as EBML ID and
> (if the length field is not longer than four bytes (EBML numbers that
> long are rejected as invalid EBML IDs)) and whatever comes next would be
> treated as its EBML size although it simply isn't.
>
> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@googlemail.com>
> ---
>   libavformat/matroskadec.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/libavformat/matroskadec.c b/libavformat/matroskadec.c
> index 49f8ff4082..f9811b54a1 100644
> --- a/libavformat/matroskadec.c
> +++ b/libavformat/matroskadec.c
> @@ -2651,6 +2651,9 @@ static int matroska_read_header(AVFormatContext *s)
>           pos = avio_tell(matroska->ctx->pb);
>           res = ebml_parse(matroska, matroska_segment, matroska);
>       }
> +    /* Set data_offset as it might be needed later by seek_frame_generic. */
> +    if (matroska->current_id)

I'm surprised this doesn't error out if a (level 1) ID is not found here.

> +        s->internal->data_offset = avio_tell(matroska->ctx->pb) - 4;

The "- 4" is OK as long as level 1 elements are always 4 bytes (which is 
the case). But if matroska_resync() ever exits if it finds an EBML Void 
or CRC-32 then this will break.

The code is safe for now but may not be future proof.

>       matroska_execute_seekhead(matroska);
>   
>       if (!matroska->time_scale)
> -- 
> 2.19.2
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
Diego Felix de Souza via ffmpeg-devel April 7, 2019, 8:31 a.m. UTC | #2
Steve Lhomme:
> On 3/27/2019 12:18 PM, Andreas Rheinhardt via ffmpeg-devel wrote:
>> By default, the data_offset member of the AVFormatInternal of the
>> AVFormatContext associated with the MatroskaDemuxContext has not been
>> initialized explicitly by any Matroska-specific function, so that it
>> was
>> initialized by default to the offset at the end of
>> matroska_read_header,
>> i.e. usually to the offset of the length field of the first encountered
>> cluster. This meant that in case that the Matroska-specific seek-code
>> fails because there are no index entries for the target track a seek to
>> data_offset would be performed and ordinary parsing would start from
>> there which is nonsense: The length field would be treated as EBML
>> ID and
>> (if the length field is not longer than four bytes (EBML numbers that
>> long are rejected as invalid EBML IDs)) and whatever comes next
>> would be
>> treated as its EBML size although it simply isn't.
>>
>> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@googlemail.com>
>> ---
>>   libavformat/matroskadec.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/libavformat/matroskadec.c b/libavformat/matroskadec.c
>> index 49f8ff4082..f9811b54a1 100644
>> --- a/libavformat/matroskadec.c
>> +++ b/libavformat/matroskadec.c
>> @@ -2651,6 +2651,9 @@ static int
>> matroska_read_header(AVFormatContext *s)
>>           pos = avio_tell(matroska->ctx->pb);
>>           res = ebml_parse(matroska, matroska_segment, matroska);
>>       }
>> +    /* Set data_offset as it might be needed later by
>> seek_frame_generic. */
>> +    if (matroska->current_id)
> 
> I'm surprised this doesn't error out if a (level 1) ID is not found here.
>There are two ways how ebml_parse can return 1 in the earlier call: It
can find (and consume) a cluster ID; or it can hit EOF if it is in the
mode for parsing live header files (these don't contain clusters at
all). In the latter case, matroska->current_id is unset.

And if we are in the normal (non-live-header) mode and no cluster has
been found, then we will end up at fail anyway (when matroska_resync
eventually hits the end of input).

Thinking about this, there is something wrong with this approach: A
Matroska file needn't contain a cluster at all (e.g. a chapter- or
tags- or attachment-only file is not against the spec if I am not
mistaken). But this is orthogonal to the patch at hand.
>> +        s->internal->data_offset = avio_tell(matroska->ctx->pb) - 4;
> 
> The "- 4" is OK as long as level 1 elements are always 4 bytes (which
> is the case). But if matroska_resync() ever exits if it finds an EBML
> Void or CRC-32 then this will break.
> 
> The code is safe for now but may not be future proof.
> 
As has already been said: If matroska->current_id is set at this
point, then it contains a cluster ID, not an arbitrary ID, because the
IDs that matroska_resync finds are fed to ebml_parse before this check
and a level 1 ID different than a cluster is parsed.

Nevertheless, the check should probably be changed to explicitly check
for a cluster ID, just to make this more readable.
>>       matroska_execute_seekhead(matroska);
>>         if (!matroska->time_scale)
>> -- 
>> 2.19.2
diff mbox

Patch

diff --git a/libavformat/matroskadec.c b/libavformat/matroskadec.c
index 49f8ff4082..f9811b54a1 100644
--- a/libavformat/matroskadec.c
+++ b/libavformat/matroskadec.c
@@ -2651,6 +2651,9 @@  static int matroska_read_header(AVFormatContext *s)
         pos = avio_tell(matroska->ctx->pb);
         res = ebml_parse(matroska, matroska_segment, matroska);
     }
+    /* Set data_offset as it might be needed later by seek_frame_generic. */
+    if (matroska->current_id)
+        s->internal->data_offset = avio_tell(matroska->ctx->pb) - 4;
     matroska_execute_seekhead(matroska);
 
     if (!matroska->time_scale)