From patchwork Sat Sep 23 23:20:49 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: llyyr X-Patchwork-Id: 43884 Delivered-To: ffmpegpatchwork2@gmail.com Received: by 2002:a05:6a20:2a18:b0:15d:8365:d4b8 with SMTP id e24csp434808pzh; Sat, 23 Sep 2023 16:21:14 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGnobNVjuu+KSg3HCU2O4dRdY77bjQxe9pJEoUrMUHltGfRb33I5tV+pr+ca57aHyt/wlns X-Received: by 2002:a17:906:535e:b0:9aa:e07:d421 with SMTP id j30-20020a170906535e00b009aa0e07d421mr2713524ejo.43.1695511274158; Sat, 23 Sep 2023 16:21:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695511274; cv=none; d=google.com; s=arc-20160816; b=WdLaFPrkwT6fRR0GJTTdMI7eIDtFZHft2C0k2cQUfHToB6FvOr/YaONG3b5cMW/tP0 0cTcCZo0tTR5xrfD78CAt8/t0z8YBsrTCdNhU2tIxlfEofze69E5IUqdDpDlIllHURCg 1/RI8eq/1aET1FNQHqVZyfyjY3C05gErmWJwRzYvHk5dDWDRZ+czvbyqYQsutX6BPGM/ nVon4zcEHmQUwPwHndPxjL9gDPiCwEBYsdtYgAyzyLom0UUeRINvyDCGNJF0JLbFjnu6 0Qsp5OdM3hMHXizi+Uh2r/UDNTq+OBN8BxHZ4Q4mSsnLhA9g0tKW/R2Qf9vlNvDetn60 lE+Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:content-transfer-encoding:cc:reply-to :list-subscribe:list-help:list-post:list-archive:list-unsubscribe :list-id:precedence:subject:mime-version:message-id:date:to:from :dkim-signature:delivered-to; bh=pGUY9bCCghooKVHn4/aJHc+LqlqEFoBGv7Wu2fii9RE=; fh=9+5V1imiycJiT5azxiNRH94nOdHnEBV1p/NpmNUA0dg=; b=RHlCHsr0Qb/S06SJXGjzJI87sgTKRZGBzcGQXm+NzJ0AyTv8m/r0ryTHvpcI2GIKZz 9WQuZwh9RHwG2V5QZEpwWcchluv4wthcbm1W+gmNHBLGgXx/L+RWqXZun7Pt2Cv0j7F0 hztFMSebSeqezp7fPodL0jF2DrCKxkev9EKLTjyzFTkApG6mgjVfTNtAGcYuD5hQ9eK4 GNNwqvb2BrhCYAHJtvfijwQ6dusSHzdyyki6qOr8G6g0UU+v3ifHTib7783ngcRFDBx8 EEG/cxHrXEyFMhKcnCCopmLxYVxDNxTDP7oR73t0m6H8N790YmnN2haDXkQL01Cf4RRt hNpw== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.s=20230601 header.b=kbR4K1Ob; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org. [79.124.17.100]) by mx.google.com with ESMTP id q22-20020a1709066b1600b009930d1379a0si5631121ejr.885.2023.09.23.16.21.13; Sat, 23 Sep 2023 16:21:14 -0700 (PDT) Received-SPF: pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) client-ip=79.124.17.100; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@gmail.com header.s=20230601 header.b=kbR4K1Ob; spf=pass (google.com: domain of ffmpeg-devel-bounces@ffmpeg.org designates 79.124.17.100 as permitted sender) smtp.mailfrom=ffmpeg-devel-bounces@ffmpeg.org; dmarc=fail (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 01C2E68C70A; Sun, 24 Sep 2023 02:21:10 +0300 (EEST) X-Original-To: ffmpeg-devel@ffmpeg.org Delivered-To: ffmpeg-devel@ffmpeg.org Received: from mail-oa1-f54.google.com (mail-oa1-f54.google.com [209.85.160.54]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 2FC5E68C94A for ; Sun, 24 Sep 2023 02:21:02 +0300 (EEST) Received: by mail-oa1-f54.google.com with SMTP id 586e51a60fabf-1dce0c05171so1252655fac.3 for ; Sat, 23 Sep 2023 16:21:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1695511259; x=1696116059; darn=ffmpeg.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=aJjEi/C2e0UJXadTOijpcXsfw9kDKMijgTUiOlisogU=; b=kbR4K1ObOTt/2xkCP6BQV03PFcsCmf0FYFnn7XnlrlkdsZlojhFw/aaMYkOOdpXUPz aYt1rtOfItfBW9gHv6QPjXNiJTw6UfY1rg/jyJ787D2h++tQ5wLCEFDLDZ6vH38lI+GA Sr0UX7jJvqJr/RlNJ6DtaSh32Z5/HcbECNix3y84xyi3uPPZOK2Jk+PKMzhOy0p4ISit TMNidxfO5jGOhY5sY9ILN3fvK+qMpIgE0OEb70m8eUn3o8iMNdB3SOXid1tPlpP0wvur +cGjDGh44G7xzOadapsJ76YFikbNVYWI9tx70Jcug/R3TG2AYBN8m3C8aWVre5WbyDU+ XlQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695511259; x=1696116059; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=aJjEi/C2e0UJXadTOijpcXsfw9kDKMijgTUiOlisogU=; b=dMCGukJlSOhCM/lcMHUH+R9DbPoS96laFfx2yZGDZUCsJmIV3h7M1P7wRzoZZlR9Dw j9U4kP6xSNqaIrBhxL77goHh7VdaHG3WnBxamgOHtVdpstacBvC6kXRogDK61I3eyxwz K2c3H/TM+sKroad6EY/KNv6AfOz9eUWRFnsWZLjeDxcDW+LX0qMbUJRm6uIQV9b8NFyN +IBLup9CEUp2oLOYAJMgbXEFI0koJ70JdsKuiUVN+BMepRFYAYQ8FtlJ8XqGnR2r/tEz V5BUVK/VScdHzj9V73oyBS88vYWqlDtPiDuL6aonoST6A5ur6ulq603dlpV8LDkJpacL 55+w== X-Gm-Message-State: AOJu0Yzbmm2+s5ea8s8XfokYzWsGK1aBQAK7hzO8TuufQ7ZT4xoVM/Yo duFv9p5W/TgR73vOGeH/t1FUOGuaW/E= X-Received: by 2002:a05:6870:a256:b0:1d6:96f9:66fa with SMTP id g22-20020a056870a25600b001d696f966famr4185126oai.54.1695511259587; Sat, 23 Sep 2023 16:20:59 -0700 (PDT) Received: from localhost.localdomain ([103.194.71.93]) by smtp.gmail.com with ESMTPSA id j12-20020a63b60c000000b00577d53c50f7sm4677479pgf.75.2023.09.23.16.20.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 23 Sep 2023 16:20:59 -0700 (PDT) From: llyyr To: ffmpeg-devel@ffmpeg.org Date: Sun, 24 Sep 2023 04:50:49 +0530 Message-ID: <20230923232049.14119-1-llyyr.public@gmail.com> X-Mailer: git-send-email 2.42.0 MIME-Version: 1.0 Subject: [FFmpeg-devel] [PATCH] avformat/subtitles: check for double BOM in UTF-16 files X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: llyyr Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" X-TUID: JMw8rNJdVVf0 While these files certainly aren't the norm, and might not even be considered valid by many programs, there are plenty of older ASS tracks in UTF-16 LE/BE encoding that contain double BOMs. This patch teaches ff_text_init_avio about double BOMs and makes it check for them in UTF-16 LE/BE files. This works by reading two more bytes after the first BOM check, and seeking back if a second BOM doesn't exist. If it does exist, we simply procede with buf_pos two bytes ahead. While this hack could certainly live in assdec.c, and would be much simpler that way, there certainly isn't any harm in allowing other subtitle format readers to be aware of double BOMs too. --- libavformat/subtitles.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/libavformat/subtitles.c b/libavformat/subtitles.c index 3413763c7b..a7b83cbb69 100644 --- a/libavformat/subtitles.c +++ b/libavformat/subtitles.c @@ -44,6 +44,20 @@ void ff_text_init_avio(void *s, FFTextReader *r, AVIOContext *pb) r->buf_pos += 3; } } + if (r->type != FF_UTF_8) { + // Check for double BOM in UTF-16 LE/BE files + for (i = 0; i < 2; i++) + r->buf[r->buf_len++] = avio_r8(r->pb); + if (strncmp("\xFF\xFE\xFF\xFE", r->buf, 4) == 0 || + strncmp("\xFE\xFF\xFE\xFF", r->buf, 4) == 0) { + // We did find a second BOM, so move buf_pos two bytes ahead + r->buf_pos += 2; + } else { + // We did not find a second BOM, undo the seek + r->buf_len -= 2; + avio_seek(r->pb, -2, SEEK_CUR); // Seek back two bytes + } + } if (s && (r->type == FF_UTF16LE || r->type == FF_UTF16BE)) av_log(s, AV_LOG_INFO, "UTF16 is automatically converted to UTF8, do not specify a character encoding\n");