From patchwork Sat Mar 24 14:48:36 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: wm4 X-Patchwork-Id: 8140 Delivered-To: ffmpegpatchwork@gmail.com Received: by 10.2.1.70 with SMTP id c67csp1606665jad; Sat, 24 Mar 2018 07:48:43 -0700 (PDT) X-Google-Smtp-Source: AG47ELtGGgP5aKaWyVqXmfyY7dUXpn14ILN2dDRyZ8+qLDGurUjjFilKCiohHMP4PqTxnY43VLOl X-Received: by 10.28.44.134 with SMTP id s128mr10883501wms.85.1521902923603; Sat, 24 Mar 2018 07:48:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521902923; cv=none; d=google.com; s=arc-20160816; b=dpow0s2bU9e1G2Bgf6JcLlFuMHyYzG8KcRowin1iZbSPxeSB8EqZ11mYn5uF3jqD++ H/EmhPrrFcp6IK2zpDDjfeHsIS4Q2vI1YEHQyq4OvS2wwjT1EKWMq4CrLPNBYx/pu0m6 DNPnWCUKktPTp3AUM6gHJB/3FvJ7SzaCIGGSTLxz8391O52N1JtUkvOGHf9q3SiysZX0 NbVoPESwTYMMLTJeccOrDh6CekHS2vdnxH4TBImMuKuvFGtBHiJ/jrzcRgk4vtBVBqzR V3EOpFXBF8N/XQIcPPvqItlJy9OihzY5Ms6e4AsLLoWwVNL5KfGtKqhKX8ILmy5/mWiQ bVTg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:mime-version:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:message-id:date:to:from:dkim-signature :delivered-to:arc-authentication-results; bh=Q9LW3M7BDsBkvU++zr60v/g1Ys4BqXLbb9fUT03eMYI=; b=SOi9etooeK3LT5vdx/bKo0zw9cN3vBGC0buWKyRf1DqvKs4bvcOqW/SxijmgLANOMZ WX5sYndxRydiTBUpBDuNDYkv8C5NwOvY7OUy0JXtBPyy36ipNp98t6bitlSppJkM2IbP 2CeTzXvaO1ZmGpQbKXqYPTlTqICu/kVIzbJ8SGCqfVCp7+qoh6U1r/8TIphdE4bDa5/z R93p5WSzxMmmTzxSaVKkIWnU9VvM/Uv2tsaMYM7KyzM3QNrajtsn7VnAwuatFy9lXKZC VU3UqQ8Uu1/tsPuJwuVn+eNKvszrP1LNtRIyZLFAWN7w+TY0Kg6aBhhjB7wfTcWiZiUv OKgA== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@googlemail.com header.s=20161025 header.b=ForBzZcX; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=googlemail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id p17si8508801wrp.438.2018.03.24.07.48.43; Sat, 24 Mar 2018 07:48:43 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@googlemail.com header.s=20161025 header.b=ForBzZcX; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=googlemail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 82749689C58; Sat, 24 Mar 2018 16:48:25 +0200 (EET) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-wr0-f175.google.com (mail-wr0-f175.google.com [209.85.128.175]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 6922E689C26 for ; Sat, 24 Mar 2018 16:48:19 +0200 (EET) Received: by mail-wr0-f175.google.com with SMTP id c24so14769466wrc.6 for ; Sat, 24 Mar 2018 07:48:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=4zJrDJRjGdsxxGyY8x3Qn16MPbPtuf1N6l3EH94btbw=; b=ForBzZcXjm2d+8q/Yo8+i2+9rmLTIgJL+95D2rZu6h6syn2ybttIOREXZIlj1FE55d 6/ERA1M1a2GyOk8Ea6y7EKjhOmj33buOwmn/dB6mHFLAH2PvyTxPzKyLe/JXV0kGCP1e 9ZuVdtQNDqgU3lwof0XANGqKLANY30BJ6IYYXVIWc9ETEy9xEGxn1Kab09C23ZoxPICU v8Q+IcKr1GE/rp/ruCqNl1yT5+k9frAYl57JM2LTJP0xLDZqt17J41EZ0SZmfGKr734j 1sAnt/2FULfN2lIGZbweWe19AHzJC3by0NV2VAz09NzK2/glETcP6Fafij1WDh6pXdso zekw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=4zJrDJRjGdsxxGyY8x3Qn16MPbPtuf1N6l3EH94btbw=; b=AnRI9xpvgigF5qXChuas0t3E8BuMLohpO+F6PyzonDPthrixC8HsTHTmR6VNkoOXjF IbaEJvHLBBmlh/maqR18nfxcNsvS48FIVJ+CFSe3kcPNdDvZy4QEy7xEh/PNlULh0Hz1 wgFg9NOjP9miZu9BTG5YOUONadgGthX7lWjUh1Ijd4TxuSIIBHRcniMhHEldvJ+kL4p5 GhTNlDiecF0sAFPIaMSZl/Mtyvdn+cpoFBbs8bxbT7MUwT4Bx0sWgog7pnWtyPx3knOv DohOtH79E5y4R4wDAjpEz+uvjZYlQysJM2xxvte9PLlGKUU2FuXU/hd5uG6Ki527GcRl QMYQ== X-Gm-Message-State: AElRT7H+Bn4tvff7mei6d9RxOANvKE9/5Qolx8vmUUBQBU1gVRkqw+2w C+1WE9+3UkmgMZQAWxBGz6hxkQ== X-Received: by 10.223.165.67 with SMTP id j3mr26979625wrb.111.1521902915065; Sat, 24 Mar 2018 07:48:35 -0700 (PDT) Received: from debian.speedport.ip (p2003006CCD4EDC610CC341E3A28E97C1.dip0.t-ipconnect.de. [2003:6c:cd4e:dc61:cc3:41e3:a28e:97c1]) by smtp.googlemail.com with ESMTPSA id f22sm24508795wmi.39.2018.03.24.07.48.33 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 24 Mar 2018 07:48:34 -0700 (PDT) From: wm4 To: ffmpeg-devel@ffmpeg.org Date: Sat, 24 Mar 2018 15:48:36 +0100 Message-Id: <20180324144836.29296-1-nfxjfg@googlemail.com> X-Mailer: git-send-email 2.16.1 Subject: [FFmpeg-devel] [PATCH] movtextdec: fix handling of UTF-8 subtitles X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: wm4 MIME-Version: 1.0 Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Subtitles which contained styled UTF-8 subtitles (i.e. not just 7 bit ASCII characters) were not handled correctly. The spec mandates that styling start/end ranges are in "characters". It's not quite clear what a "character" is supposed to be, but maybe they mean unicode codepoints. FFmpeg's decoder treated the style ranges as byte idexes, which could lead to UTF-8 sequences being broken, and the common code dropping the whole subtitle line. Change this and count the codepoint instead. This also means that even if this is somehow wrong, the decoder won't break UTF-8 sequences anymore. The sample which led me to investigate this now appears to work correctly. --- https://github.com/mpv-player/mpv/issues/5675 --- libavcodec/movtextdec.c | 50 ++++++++++++++++++++++++++++++++++++------------- 1 file changed, 37 insertions(+), 13 deletions(-) diff --git a/libavcodec/movtextdec.c b/libavcodec/movtextdec.c index bd19577724..89ac791602 100644 --- a/libavcodec/movtextdec.c +++ b/libavcodec/movtextdec.c @@ -326,9 +326,24 @@ static const Box box_types[] = { const static size_t box_count = FF_ARRAY_ELEMS(box_types); +// Return byte length of the UTF-8 sequence starting at text[0]. 0 on error. +static int get_utf8_length_at(const char *text, const char *text_end) +{ + const char *start = text; + int err = 0; + uint32_t c; + GET_UTF8(c, text < text_end ? (uint8_t)*text++ : (err = 1, 0), goto error;); + if (err) + goto error; + return text - start; +error: + return 0; +} + static int text_to_ass(AVBPrint *buf, const char *text, const char *text_end, - MovTextContext *m) + AVCodecContext *avctx) { + MovTextContext *m = avctx->priv_data; int i = 0; int j = 0; int text_pos = 0; @@ -342,6 +357,8 @@ static int text_to_ass(AVBPrint *buf, const char *text, const char *text_end, } while (text < text_end) { + int len; + if (m->box_flags & STYL_BOX) { for (i = 0; i < m->style_entries; i++) { if (m->s[i]->style_flag && text_pos == m->s[i]->style_end) { @@ -388,17 +405,24 @@ static int text_to_ass(AVBPrint *buf, const char *text, const char *text_end, } } - switch (*text) { - case '\r': - break; - case '\n': - av_bprintf(buf, "\\N"); - break; - default: - av_bprint_chars(buf, *text, 1); - break; + len = get_utf8_length_at(text, text_end); + if (len < 1) { + av_log(avctx, AV_LOG_ERROR, "invalid UTF-8 byte in subtitle\n"); + len = 1; + } + for (i = 0; i < len; i++) { + switch (*text) { + case '\r': + break; + case '\n': + av_bprintf(buf, "\\N"); + break; + default: + av_bprint_chars(buf, *text, 1); + break; + } + text++; } - text++; text_pos++; } @@ -507,10 +531,10 @@ static int mov_text_decode_frame(AVCodecContext *avctx, } m->tracksize = m->tracksize + tsmb_size; } - text_to_ass(&buf, ptr, end, m); + text_to_ass(&buf, ptr, end, avctx); mov_text_cleanup(m); } else - text_to_ass(&buf, ptr, end, m); + text_to_ass(&buf, ptr, end, avctx); ret = ff_ass_add_rect(sub, buf.str, m->readorder++, 0, NULL, NULL); av_bprint_finalize(&buf, NULL);