Message ID | DM8P223MB03658782AEC7729CC067D6ACBA719@DM8P223MB0365.NAMP223.PROD.OUTLOOK.COM |
---|---|
State | New |
Headers | show |
Series | Subtitle Filtering | expand |
Context | Check | Description |
---|---|---|
andriy/make_x86 | success | Make finished |
andriy/make_fate_x86 | success | Make fate finished |
andriy/make_ppc | success | Make finished |
andriy/make_fate_ppc | success | Make fate finished |
Hi there softworkz. Having worked before with OCR filter output, I suggest you a modification for your new filter. It's not something that should delay the patch, but just a nice addenum. Could be done in another patch, or could even do it myself in the future. But I let the comment here anyways, for you to consider. If you take a look at vf_ocr, you'll see that it sets "lavfi.ocr.confidence" metadata field. Well... downstream filters can check that field in order to just consider certain confidence threshold, discarding the rest. This is very useful when doing OCR with non-ascii chars, like I do with Spanish language. So I propose an option like this: { "confidence", "Sets the confidence threshold for valid OCR. Default 80." , OFFSET(confidence), AV_OPT_TYPE_INT, {.i64=80}, 0, 100, FLAGS }, Then you do an average of all confidences detected by tesseract after OCR but before converting to text subtitle frame, and compare that option value to the average result. Something like this: int average = sum_of_all_confidences / number_of_confidence_items; if (average >= s->confidence) { do_your_thing(); } else { av_log(ctx, AV_LOG_DEBUG, "Confidence average %d under threshold. Text detected: '%s'\n", average, text); } Also, I would like to do some tests with spanish OCR, as I had to explicitly allowlist our non-ascii chars when using OCR filter, and don't know how yours will behave in that situation. Maybe having the chars allowlist option here too is a good idea. But, again: none of this this should delay the patch, as your work is much more important than this kind of nice to have functionalities, which could be easily implemented later by anyone. Thanks, Daniel.
> -----Original Message----- > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Daniel > Cantarín > Sent: Saturday, December 11, 2021 4:18 PM > To: ffmpeg-devel@ffmpeg.org > Subject: Re: [FFmpeg-devel] [PATCH v23 19/21] avfilter/graphicsub2text: Add > new graphicsub2text filter (OCR) > > Hi there softworkz. > > Having worked before with OCR filter output, I suggest you a > modification for your new filter. > It's not something that should delay the patch, but just a nice addenum. > Could be done in another patch, or could even do it myself in the > future. But I let the comment here anyways, for you to consider. > > If you take a look at vf_ocr, you'll see that it sets > "lavfi.ocr.confidence" metadata field. > Well... downstream filters can check that field in order to just > consider certain confidence threshold, discarding the rest. > This is very useful when doing OCR with non-ascii chars, like I do with > Spanish language. > > So I propose an option like this: > > { "confidence", "Sets the confidence threshold for valid OCR. Default > 80." , OFFSET(confidence), AV_OPT_TYPE_INT, {.i64=80}, 0, 100, FLAGS }, > > Then you do an average of all confidences detected by tesseract after > OCR but before converting to text subtitle frame, and compare that > option value to the average result. > Something like this: > > int average = sum_of_all_confidences / number_of_confidence_items; > if (average >= s->confidence) { > do_your_thing(); > } else { > av_log(ctx, AV_LOG_DEBUG, "Confidence average %d under threshold. > Text detected: '%s'\n", average, text); > } > > Also, I would like to do some tests with spanish OCR, as I had to > explicitly allowlist our non-ascii chars when using OCR filter, and > don't know how yours will behave in that situation. Maybe having the > chars allowlist option here too is a good idea. But, again: none of this > this should delay the patch, as your work is much more important than > this kind of nice to have functionalities, which could be easily > implemented later by anyone. > Hi Daniel, I don't think that any of that will be necessary. For the generic ocr filter, this might make sense, because it is meant to work in many different situations, different text sizes, different (not necessarily uniform) backgrounds, static or moving, a wide spectrum of colours, and no quantization in the time dimension, etc. But for subtitle-ocr, we have a fixed and static background, we have palette colours from like 4 to 32 only, we know when it starts and that it doesn’t change until the next event and we have a pixel density relative to the text height that is a multiple of what you get when you scan a letter for example. Basically, this is like a pre-school situation for an OCR. If it can't recognize that in a reliable way and you would end up needing to dissect results by confidence level, then the OCR wouldn't be worth a penny and this filter kind of pointless ;-) IIUC, you haven't tried graphicsub2text yet. I suggest, you to look at filters.texi for instructions to set up the model data. There's an example with a test stream that you can run right away. With that example, I haven't been able to spot a single incorrectly recognized character. Somebody who tried my filter had contacted me last week as he was getting rather bad recognition results. It turned out that the text in this case had strong outlines and the inner text was black. After removing the outlines and inverting the text, the recognition result was close to perfect. The crucial part is the preparation of the image before doing OCR. When this is not done right, you can't remedy later with confidence level evaluation. What's working fine already is bright text without outlines. Left for me to do is automatic detection of outline colours and removing those before running recognition. Second part is detection of the text (fill) color and depending on that - replace the transparency either with a light or dark background colour (and invert in the latter case). When you get a chance to try, please let me know about your results. PS: When positive, post here - otherwise contact me privately...LOL Just joking..whatever you prefer. Kind regards,' softworkz Application: Microsoft.Office.Interop.Outlook.ApplicationClass Class: 43 Session: System.__ComObject Parent: System.__ComObject Actions: System.__ComObject Attachments: System.__ComObject BillingInformation: Body: > -----Original Message----- > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Daniel > Cantarín > Sent: Saturday, December 11, 2021 4:18 PM > To: ffmpeg-devel@ffmpeg.org > Subject: Re: [FFmpeg-devel] [PATCH v23 19/21] avfilter/graphicsub2text: Add > new graphicsub2text filter (OCR) > > Hi there softworkz. > > Having worked before with OCR filter output, I suggest you a > modification for your new filter. > It's not something that should delay the patch, but just a nice addenum. > Could be done in another patch, or could even do it myself in the > future. But I let the comment here anyways, for you to consider. > > If you take a look at vf_ocr, you'll see that it sets > "lavfi.ocr.confidence" metadata field. > Well... downstream filters can check that field in order to just > consider certain confidence threshold, discarding the rest. > This is very useful when doing OCR with non-ascii chars, like I do with > Spanish language. > > So I propose an option like this: > > { "confidence", "Sets the confidence threshold for valid OCR. Default > 80." , OFFSET(confidence), AV_OPT_TYPE_INT, {.i64=80}, 0, 100, FLAGS }, > > Then you do an average of all confidences detected by tesseract after > OCR but before converting to text subtitle frame, and compare that > option value to the average result. > Something like this: > > int average = sum_of_all_confidences / number_of_confidence_items; > if (average >= s->confidence) { > do_your_thing(); > } else { > av_log(ctx, AV_LOG_DEBUG, "Confidence average %d under threshold. > Text detected: '%s'\n", average, text); > } > > Also, I would like to do some tests with spanish OCR, as I had to > explicitly allowlist our non-ascii chars when using OCR filter, and > don't know how yours will behave in that situation. Maybe having the > chars allowlist option here too is a good idea. But, again: none of this > this should delay the patch, as your work is much more important than > this kind of nice to have functionalities, which could be easily > implemented later by anyone. > Hi Daniel, I don't think that any of that will be necessary. For the generic ocr filter, this might make sense, because it is meant to work in many different situations, different text sizes, different (not necessarily uniform) backgrounds, static or moving, a wide spectrum of colours, and no quantization in the time dimension, etc. But for subtitle-ocr, we have a fixed and static background, we have palette colours from like 4 to 32 only, we know when it starts and that it doesn’t change until the next event and we have a pixel density relative to the text height that is a multiple of what you get when you scan a letter for example. Basically, this is like a pre-school situation for an OCR. If it can't recognize that in a reliable way and you would end up needing to dissect results by confidence level, then the OCR wouldn't be worth a penny and this filter kind of pointless ;-) IIUC, you haven't tried graphicsub2text yet. I suggest, you to look at filters.texi for instructions to set up the model data. There's an example with a test stream that you can run right away. With that example, I haven't been able to spot a single incorrectly recognized character. Somebody who tried my filter had contacted me last week as he was getting rather bad recognition results. It turned out that the text in this case had strong outlines and the inner text was black. After removing the outlines and inverting the text, the recognition result was close to perfect. The crucial part is the preparation of the image before doing OCR. When this is not done right, you can't remedy later with confidence level evaluation. What's working fine already is bright text without outlines. Left for me to do is automatic detection of outline colours and removing those before running recognition. Second part is detection of the text (fill) color and depending on that - replace the transparency either with a light or dark background colour (and invert in the latter case). When you get a chance to try, please let me know about your results. PS: When positive, post here - otherwise contact me privately...LOL Just joking..whatever you prefer. Kind regards,' softworkz Categories: Companies: ConversationIndex: 0101D7EE0E321DD45EEE605BD34994EF6CE3443CE288AC2D685A00800014B360 ConversationTopic: [PATCH 1/1] Test ref file change CreationTime: 11 Dec 2021 17:31:42 EntryID: 00000000BEFAAEED30DFDF43976487F12A562A600700DEB68488D92E8146A5C16995B1AE958D00000000010F0000DEB68488D92E8146A5C16995B1AE958D000570B7A2E40000 FormDescription: System.__ComObject GetInspector: System.__ComObject Importance: 1 LastModificationTime: 11 Dec 2021 17:31:42 MessageClass: IPM.Note Mileage: NoAging: False OutlookInternalVersion: 1614701 OutlookVersion: 16.0 Saved: False Sensitivity: 0 Size: 12438 Subject: RE: [FFmpeg-devel] [PATCH v23 19/21] avfilter/graphicsub2text: Add new graphicsub2text filter (OCR) UnRead: True UserProperties: System.__ComObject AlternateRecipientAllowed: True AutoForwarded: False BCC: CC: DeferredDeliveryTime: 1 Jan 4501 00:00:00 DeleteAfterSubmit: False ExpiryTime: 1 Jan 4501 00:00:00 FlagRequest: HTMLBody: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <META NAME="Generator" CONTENT="MS Exchange Server version 16.0.14701.20038"> <TITLE></TITLE> </HEAD> <BODY> <!-- Converted from text/plain format --> <BR> <BR> <P><FONT SIZE=2>> -----Original Message-----<BR> > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Daniel<BR> > Cantarín<BR> > Sent: Saturday, December 11, 2021 4:18 PM<BR> > To: ffmpeg-devel@ffmpeg.org<BR> > Subject: Re: [FFmpeg-devel] [PATCH v23 19/21] avfilter/graphicsub2text: Add<BR> > new graphicsub2text filter (OCR)<BR> ><BR> > Hi there softworkz.<BR> ><BR> > Having worked before with OCR filter output, I suggest you a<BR> > modification for your new filter.<BR> > It's not something that should delay the patch, but just a nice addenum.<BR> > Could be done in another patch, or could even do it myself in the<BR> > future. But I let the comment here anyways, for you to consider.<BR> ><BR> > If you take a look at vf_ocr, you'll see that it sets<BR> > "lavfi.ocr.confidence" metadata field.<BR> > Well... downstream filters can check that field in order to just<BR> > consider certain confidence threshold, discarding the rest.<BR> > This is very useful when doing OCR with non-ascii chars, like I do with<BR> > Spanish language.<BR> ><BR> > So I propose an option like this:<BR> ><BR> > { "confidence", "Sets the confidence threshold for valid OCR. Default<BR> > 80." , OFFSET(confidence), AV_OPT_TYPE_INT, {.i64=80}, 0, 100, FLAGS },<BR> ><BR> > Then you do an average of all confidences detected by tesseract after<BR> > OCR but before converting to text subtitle frame, and compare that<BR> > option value to the average result.<BR> > Something like this:<BR> ><BR> > int average = sum_of_all_confidences / number_of_confidence_items;<BR> > if (average >= s->confidence) {<BR> > do_your_thing();<BR> > } else {<BR> > av_log(ctx, AV_LOG_DEBUG, "Confidence average %d under threshold.<BR> > Text detected: '%s'\n", average, text);<BR> > }<BR> ><BR> > Also, I would like to do some tests with spanish OCR, as I had to<BR> > explicitly allowlist our non-ascii chars when using OCR filter, and<BR> > don't know how yours will behave in that situation. Maybe having the<BR> > chars allowlist option here too is a good idea. But, again: none of this<BR> > this should delay the patch, as your work is much more important than<BR> > this kind of nice to have functionalities, which could be easily<BR> > implemented later by anyone.<BR> ><BR> <BR> Hi Daniel,<BR> <BR> I don't think that any of that will be necessary. For the generic ocr<BR> filter, this might make sense, because it is meant to work in<BR> many different situations, different text sizes, different (not<BR> necessarily uniform) backgrounds, static or moving, a wide spectrum<BR> of colours, and no quantization in the time dimension, etc.<BR> <BR> But for subtitle-ocr, we have a fixed and static background, we have<BR> palette colours from like 4 to 32 only, we know when it starts and<BR> that it doesn’t change until the next event and we have a pixel<BR> density relative to the text height that is a multiple of what<BR> you get when you scan a letter for example.<BR> <BR> Basically, this is like a pre-school situation for an OCR. If it<BR> can't recognize that in a reliable way and you would end up needing<BR> to dissect results by confidence level, then the OCR wouldn't be<BR> worth a penny and this filter kind of pointless ;-)<BR> <BR> IIUC, you haven't tried graphicsub2text yet. I suggest, you to<BR> look at filters.texi for instructions to set up the model data.<BR> There's an example with a test stream that you can run right<BR> away. With that example, I haven't been able to spot a single<BR> incorrectly recognized character.<BR> <BR> Somebody who tried my filter had contacted me last week as he<BR> was getting rather bad recognition results. It turned out<BR> that the text in this case had strong outlines and the inner<BR> text was black. After removing the outlines and inverting the<BR> text, the recognition result was close to perfect.<BR> <BR> The crucial part is the preparation of the image before doing<BR> OCR. When this is not done right, you can't remedy later with<BR> confidence level evaluation.<BR> <BR> What's working fine already is bright text without outlines.<BR> Left for me to do is automatic detection of outline colours<BR> and removing those before running recognition. Second part is<BR> detection of the text (fill) color and depending on that - replace<BR> the transparency either with a light or dark background colour<BR> (and invert in the latter case).<BR> <BR> When you get a chance to try, please let me know about your<BR> results.<BR> <BR> PS: When positive, post here - otherwise contact me privately...LOL<BR> <BR> Just joking..whatever you prefer.<BR> <BR> Kind regards,'<BR> softworkz<BR> <BR> </FONT> </P> </BODY> </HTML> OriginatorDeliveryReportRequested: False ReadReceiptRequested: False ReceivedByEntryID: ReceivedByName: ReceivedOnBehalfOfEntryID: ReceivedOnBehalfOfName: ReceivedTime: 11 Dec 2021 18:39:00 RecipientReassignmentProhibited: False Recipients: System.__ComObject ReminderOverrideDefault: False ReminderPlaySound: False ReminderSet: False ReminderSoundFile: ReminderTime: 1 Jan 4501 00:00:00 RemoteStatus: 0 ReplyRecipientNames: ReplyRecipients: System.__ComObject SaveSentMessageFolder: System.__ComObject SenderName: Sent: False SentOn: 1 Jan 4501 00:00:00 SentOnBehalfOfName: softworkz@hotmail.com Submitted: False To: FFmpeg development discussions and patches VotingOptions: VotingResponse: ItemProperties: System.__ComObject BodyFormat: 1 DownloadState: 1 InternetCodepage: 65001 MarkForDownload: 0 IsConflict: False AutoResolvedWinner: False Conflicts: System.__ComObject SenderEmailAddress: softworkz@hotmail.com SenderEmailType: EX Permission: 0 PermissionService: 0 PropertyAccessor: System.__ComObject SendUsingAccount: System.__ComObject TaskSubject: RE: [FFmpeg-devel] [PATCH v23 19/21] avfilter/graphicsub2text: Add new graphicsub2text filter (OCR) TaskDueDate: 1 Jan 4501 00:00:00 TaskStartDate: 1 Jan 4501 00:00:00 TaskCompletedDate: 1 Jan 4501 00:00:00 ToDoTaskOrdinal: 1 Jan 4501 00:00:00 IsMarkedAsTask: False ConversationID: Sender: System.__ComObject RTFBody: System.Byte[] RetentionExpirationDate: 1 Jan 4501 00:00:00
> Hi Daniel, > > I don't think that any of that will be necessary. For the generic ocr > filter, this might make sense, because it is meant to work in > many different situations, different text sizes, different (not > necessarily uniform) backgrounds, static or moving, a wide spectrum > of colours, and no quantization in the time dimension, etc. > > But for subtitle-ocr, we have a fixed and static background, we have > palette colours from like 4 to 32 only, we know when it starts and > that it doesn’t change until the next event and we have a pixel > density relative to the text height that is a multiple of what > you get when you scan a letter for example. > I see. That's a good point: this isn't generic OCR, but pretty specific. Didn't considered that before. > > Basically, this is like a pre-school situation for an OCR. If it > can't recognize that in a reliable way and you would end up needing > to dissect results by confidence level, then the OCR wouldn't be > worth a penny and this filter kind of pointless ;-) > Well... I respectfully disagree, because reality's pretty effective when it's about messing with common sense, making that paragraph simply too optimistic. I'm sure we'll find some subtitle provider with awful fonts and/or subtitling practices more sooner than later, and that day those words will become sour. Yet, I get your point. Please just ignore my previous comments about the new filter. I'll test it properly eventually, and give you some feedback. If any change is needed, I'll try to apply it myself, so you don't have to do extra work. But just forget about it in the meantime, as your point stands so far. > > IIUC, you haven't tried graphicsub2text yet. I suggest, you to > look at filters.texi for instructions to set up the model data. > (...) > The crucial part is the preparation of the image before doing > OCR. When this is not done right, you can't remedy later with > confidence level evaluation. > I'm aware, thanks. No expert, but have some experience with the stuff. I'm actually using vf_ocr, taking dvbsubs and doing some alchemy with lavfi using fps filter for the sparseness (and OCR CPU usage), color tuning, creating a proper background for the ocr process, and so on. I got OK results with image prep, and lots of noise without it. So I kinda know the deal. Insights are cool anyways, and your code give some good ideas too. > > What's working fine already is bright text without outlines. > Left for me to do is automatic detection of outline colours > and removing those before running recognition. Second part is > detection of the text (fill) color and depending on that - replace > the transparency either with a light or dark background colour > (and invert in the latter case). > Bright (white) background over dark (black) characters had the best results for me so far. > > When you get a chance to try, please let me know about your > results. > Most likely next week I'll take a look at it. It's easier now that you let a public fork online (in another thread). I'm still getting used to the patch and mail list dynamics. > > PS: When positive, post here - otherwise contact me privately...LOL > > Just joking..whatever you prefer. > > Kind regards,' > softworkz I try not to be rude, because I know it feels awful on the other side, and I value feelings. I also tend to be chatty, in order to try to understand and be understood. However, I fear replying a lot may be seen as spaming the mailing list, so I'll keep my interactions to a minimum. Please know there are people like me reading your work, even when we may keep silent for different reasons. Thanks, Daniel.
> -----Original Message----- > From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Daniel > Cantarín > Sent: Saturday, December 11, 2021 9:24 PM > To: ffmpeg-devel@ffmpeg.org > Subject: Re: [FFmpeg-devel] [PATCH v23 19/21] avfilter/graphicsub2text: Add > new graphicsub2text filter (OCR) > > > Basically, this is like a pre-school situation for an OCR. If it > > can't recognize that in a reliable way and you would end up needing > > to dissect results by confidence level, then the OCR wouldn't be > > worth a penny and this filter kind of pointless ;-) > > > > Well... I respectfully disagree, because reality's pretty effective > when it's about messing with common sense, making that paragraph > simply too optimistic. I'm sure we'll find some subtitle provider > with awful fonts and/or subtitling practices more sooner than later, > and that day those words will become sour. I meant it a bit more serious than it sounded. If I can't get it to work at a decent degree of reliability then I have no use for it. I'm optimistic but the final assessment is yet to be made. > Yet, I get your point. Please just ignore my previous comments > about the new filter. I'll test it properly eventually, and give you > some feedback. If any change is needed, I'll try to apply it myself, > so you don't have to do extra work. But just forget about it in the > meantime, as your point stands so far. In case you got some samples that have appeared to be difficult in recognition and that you could share, please let me know. > I try not to be rude, because I know it feels awful on the other side, > and I value feelings. I also tend to be chatty, in order to try to > understand and be understood. However, I fear replying a lot may be > seen as spaming the mailing list, so I'll keep my interactions to a > minimum. Please know there are people like me reading your work, even > when we may keep silent for different reasons. This ML has seen so many unpleasant conversations. I think a less serious line should be more than acceptable. Let's follow-up privately, though. Kind regards, sw
diff --git a/configure b/configure index 73a7267cf1..360e91d762 100755 --- a/configure +++ b/configure @@ -3640,6 +3640,7 @@ frei0r_filter_deps="frei0r" frei0r_src_filter_deps="frei0r" fspp_filter_deps="gpl" gblur_vulkan_filter_deps="vulkan spirv_compiler" +graphicsub2text_filter_deps="libtesseract" hflip_vulkan_filter_deps="vulkan spirv_compiler" histeq_filter_deps="gpl" hqdn3d_filter_deps="gpl" diff --git a/doc/filters.texi b/doc/filters.texi index 743c36c432..26bf6014d0 100644 --- a/doc/filters.texi +++ b/doc/filters.texi @@ -25820,6 +25820,61 @@ ffmpeg -i "https://streams.videolan.org/ffmpeg/mkv_subtitles.mkv" -filter_comple @end example @end itemize +@section graphicsub2text + +Converts graphic subtitles to text subtitles by performing OCR. + +For this filter to be available, ffmpeg needs to be compiled with libtesseract (see https://github.com/tesseract-ocr/tesseract). +Language models need to be downloaded from https://github.com/tesseract-ocr/tessdata and put into as subfolder named 'tessdata' or into a folder specified via the environment variable 'TESSDATA_PREFIX'. +The path can also be specified via filter option (see below). + +Note: These models are including the data for both OCR modes. + +Inputs: +- 0: Subtitles [bitmap] + +Outputs: +- 0: Subtitles [text] + +It accepts the following parameters: + +@table @option +@item ocr_mode +The character recognition mode to use. + +Supported OCR modes are: + +@table @var +@item 0, tesseract +This is the classic libtesseract operation mode. It is fast but less accurate than LSTM. +@item 1, lstm +Newer OCR implementation based on ML models. Provides usually better results, requires more processing resources. +@item 2, both +Use a combination of both modes. +@end table + +@item tessdata_path +The path to a folder containing the language models to be used. + +@item language +The recognition language. It needs to match the first three characters of a language model file in the tessdata path. + +@end table + + +@subsection Examples + +@itemize +@item +Convert DVB graphic subtitles to ASS (text) subtitles + +Note: For this to work, you need to have the data file 'eng.traineddata' in a 'tessdata' subfolder (see above). +@example +ffmpeg ffmpeg -loglevel verbose -i "https://streams.videolan.org/streams/ts/video_subs_ttxt%2Bdvbsub.ts" -filter_complex "[0:13]graphicsub2text=ocr_mode=both" -c:s ass -y output.mkv +@end example +@end itemize + + @section graphicsub2video Renders graphic subtitles as video frames. diff --git a/libavfilter/Makefile b/libavfilter/Makefile index 2224e5fe5f..3b972e134b 100644 --- a/libavfilter/Makefile +++ b/libavfilter/Makefile @@ -296,6 +296,8 @@ OBJS-$(CONFIG_GBLUR_VULKAN_FILTER) += vf_gblur_vulkan.o vulkan.o vulka OBJS-$(CONFIG_GEQ_FILTER) += vf_geq.o OBJS-$(CONFIG_GRADFUN_FILTER) += vf_gradfun.o OBJS-$(CONFIG_GRAPHICSUB2VIDEO_FILTER) += vf_overlaygraphicsubs.o framesync.o +OBJS-$(CONFIG_GRAPHICSUB2TEXT_FILTER) += sf_graphicsub2text.o +OBJS-$(CONFIG_GRAPHICSUB2VIDEO_FILTER) += vf_overlaygraphicsubs.o framesync.o OBJS-$(CONFIG_GRAPHMONITOR_FILTER) += f_graphmonitor.o OBJS-$(CONFIG_GRAYWORLD_FILTER) += vf_grayworld.o OBJS-$(CONFIG_GREYEDGE_FILTER) += vf_colorconstancy.o diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c index 6adde2b9f6..f70f08dc5a 100644 --- a/libavfilter/allfilters.c +++ b/libavfilter/allfilters.c @@ -545,6 +545,7 @@ extern const AVFilter ff_avf_showwaves; extern const AVFilter ff_avf_showwavespic; extern const AVFilter ff_vaf_spectrumsynth; extern const AVFilter ff_sf_censor; +extern const AVFilter ff_sf_graphicsub2text; extern const AVFilter ff_sf_showspeaker; extern const AVFilter ff_sf_splitcc; extern const AVFilter ff_sf_stripstyles; diff --git a/libavfilter/sf_graphicsub2text.c b/libavfilter/sf_graphicsub2text.c new file mode 100644 index 0000000000..ef10d60efd --- /dev/null +++ b/libavfilter/sf_graphicsub2text.c @@ -0,0 +1,354 @@ +/* + * Copyright (c) 2021 softworkz + * + * This file is part of FFmpeg. + * + * FFmpeg is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * FFmpeg is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with FFmpeg; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +/** + * @file + * subtitle filter to convert graphical subs to text subs via OCR + */ + +#include <tesseract/capi.h> +#include <libavutil/ass_internal.h> + +#include "libavutil/opt.h" +#include "subtitles.h" + +typedef struct SubOcrContext { + const AVClass *class; + int w, h; + + TessBaseAPI *tapi; + TessOcrEngineMode ocr_mode; + char *tessdata_path; + char *language; + + int readorder_counter; + + AVFrame *pending_frame; +} SubOcrContext; + + +static int init(AVFilterContext *ctx) +{ + SubOcrContext *s = ctx->priv; + const char* tver = TessVersion(); + int ret; + + s->tapi = TessBaseAPICreate(); + + if (!s->tapi || !tver || !strlen(tver)) { + av_log(ctx, AV_LOG_ERROR, "Failed to access libtesseract\n"); + return AVERROR(ENOSYS); + } + + av_log(ctx, AV_LOG_VERBOSE, "Initializing libtesseract, version: %s\n", tver); + + ret = TessBaseAPIInit4(s->tapi, s->tessdata_path, s->language, s->ocr_mode, NULL, 0, NULL, NULL, 0, 1); + if (ret < 0 ) { + av_log(ctx, AV_LOG_ERROR, "Failed to initialize libtesseract. Error: %d\n", ret); + return AVERROR(ENOSYS); + } + + ret = TessBaseAPISetVariable(s->tapi, "tessedit_char_blacklist", "|"); + if (ret < 0 ) { + av_log(ctx, AV_LOG_ERROR, "Failed to set 'tessedit_char_blacklist'. Error: %d\n", ret); + return AVERROR(EINVAL); + } + + return 0; +} + +static void uninit(AVFilterContext *ctx) +{ + SubOcrContext *s = ctx->priv; + + if (s->tapi) { + TessBaseAPIEnd(s->tapi); + TessBaseAPIDelete(s->tapi); + } +} + +static int query_formats(AVFilterContext *ctx) +{ + AVFilterFormats *formats, *formats2; + AVFilterLink *inlink = ctx->inputs[0]; + AVFilterLink *outlink = ctx->outputs[0]; + static const enum AVSubtitleType in_fmts[] = { AV_SUBTITLE_FMT_BITMAP, AV_SUBTITLE_FMT_NONE }; + static const enum AVSubtitleType out_fmts[] = { AV_SUBTITLE_FMT_ASS, AV_SUBTITLE_FMT_NONE }; + int ret; + + /* set input format */ + formats = ff_make_format_list(in_fmts); + if ((ret = ff_formats_ref(formats, &inlink->outcfg.formats)) < 0) + return ret; + + /* set output format */ + formats2 = ff_make_format_list(out_fmts); + if ((ret = ff_formats_ref(formats2, &outlink->incfg.formats)) < 0) + return ret; + + return 0; +} + +static int config_input(AVFilterLink *inlink) +{ + AVFilterContext *ctx = inlink->dst; + SubOcrContext *s = ctx->priv; + + if (s->w <= 0 || s->h <= 0) { + s->w = inlink->w; + s->h = inlink->h; + } + return 0; +} + +static int config_output(AVFilterLink *outlink) +{ + const AVFilterContext *ctx = outlink->src; + SubOcrContext *s = ctx->priv; + + outlink->format = AV_SUBTITLE_FMT_ASS; + outlink->w = s->w; + outlink->h = s->h; + + return 0; +} + +static uint8_t* create_grayscale_image(AVFilterContext *ctx, AVSubtitleArea *area) +{ + uint8_t gray_pal[256]; + const size_t img_size = area->buf[0]->size; + const uint8_t* img = area->buf[0]->data; + uint8_t* gs_img = av_malloc(img_size); + + if (!gs_img) + return NULL; + + for (unsigned i = 0; i < 256; i++) { + const uint8_t *col = (uint8_t*)&area->pal[i]; + const int val = (int)col[3] * FFMAX3(col[0], col[1], col[2]); + gray_pal[i] = (uint8_t)(val >> 8); + } + + for (unsigned i = 0; i < img_size; i++) + gs_img[i] = 255 - gray_pal[img[i]]; + + return gs_img; +} + +static int convert_area(AVFilterContext *ctx, AVSubtitleArea *area) +{ + SubOcrContext *s = ctx->priv; + char *ocr_text = NULL; + int ret; + uint8_t *gs_img = create_grayscale_image(ctx, area); + + if (!gs_img) + return AVERROR(ENOMEM); + + area->type = AV_SUBTITLE_FMT_ASS; + TessBaseAPISetImage(s->tapi, gs_img, area->w, area->h, 1, area->linesize[0]); + TessBaseAPISetSourceResolution(s->tapi, 70); + + ret = TessBaseAPIRecognize(s->tapi, NULL); + if (ret == 0) + ocr_text = TessBaseAPIGetUTF8Text(s->tapi); + + if (!ocr_text) { + av_log(ctx, AV_LOG_WARNING, "OCR didn't return a text. ret=%d\n", ret); + area->ass = NULL; + } + else { + const size_t len = strlen(ocr_text); + + if (len > 0 && ocr_text[len - 1] == '\n') + ocr_text[len - 1] = 0; + + av_log(ctx, AV_LOG_VERBOSE, "OCR Result: %s\n", ocr_text); + + area->ass = av_strdup(ocr_text); + + TessDeleteText(ocr_text); + } + + av_freep(&gs_img); + av_buffer_unref(&area->buf[0]); + area->type = AV_SUBTITLE_FMT_ASS; + + return 0; +} + +static int filter_frame(AVFilterLink *inlink, AVFrame *frame) +{ + AVFilterContext *ctx = inlink->dst; + SubOcrContext *s = ctx->priv; + AVFilterLink *outlink = inlink->dst->outputs[0]; + int ret, frame_sent = 0; + + if (s->pending_frame) { + const uint64_t pts_diff = frame->subtitle_pts - s->pending_frame->subtitle_pts; + + if (pts_diff == 0) { + // This is just a repetition of the previous frame, ignore it + av_frame_free(&frame); + return 0; + } + + s->pending_frame->subtitle_end_time = (uint32_t)(pts_diff / 1000); + + ret = ff_filter_frame(outlink, s->pending_frame); + s->pending_frame = NULL; + if (ret < 0) + return ret; + + frame_sent = 1; + + if (frame->num_subtitle_areas == 0) { + // No need to forward this empty frame + av_frame_free(&frame); + return 0; + } + } + + ret = av_frame_make_writable(frame); + + if (ret < 0) { + av_frame_free(&frame); + return ret; + } + + frame->format = AV_SUBTITLE_FMT_ASS; + + av_log(ctx, AV_LOG_DEBUG, "filter_frame sub_pts: %"PRIu64", start_time: %d, end_time: %d, num_areas: %d\n", + frame->subtitle_pts, frame->subtitle_start_time, frame->subtitle_end_time, frame->num_subtitle_areas); + + if (frame->num_subtitle_areas > 1 && + frame->subtitle_areas[0]->y > frame->subtitle_areas[frame->num_subtitle_areas - 1]->y) { + + for (unsigned i = 0; i < frame->num_subtitle_areas / 2; i++) + FFSWAP(AVSubtitleArea*, frame->subtitle_areas[i], frame->subtitle_areas[frame->num_subtitle_areas - i - 1]); + } + + for (unsigned i = 0; i < frame->num_subtitle_areas; i++) { + AVSubtitleArea *area = frame->subtitle_areas[i]; + + ret = convert_area(ctx, area); + if (ret < 0) + return ret; + + if (area->ass && area->ass[0] != '\0') { + char *tmp = area->ass; + + if (i == 0) + area->ass = avpriv_ass_get_dialog(s->readorder_counter++, 0, "Default", NULL, tmp); + else { + AVSubtitleArea* area0 = frame->subtitle_areas[0]; + char* tmp2 = area0->ass; + area0->ass = av_asprintf("%s\\N%s", area0->ass, tmp); + av_free(tmp2); + area->ass = NULL; + } + + av_free(tmp); + } + } + + if (frame->num_subtitle_areas > 1) { + for (unsigned i = 1; i < frame->num_subtitle_areas; i++) { + AVSubtitleArea* area = frame->subtitle_areas[i]; + + for (unsigned n = 0; n < FF_ARRAY_ELEMS(area->buf); n++) + av_buffer_unref(&area->buf[n]); + + av_freep(&area->text); + av_freep(&area->ass); + av_freep(&frame->subtitle_areas[i]); + } + + AVSubtitleArea* area0 = frame->subtitle_areas[0]; + av_freep(&frame->subtitle_areas); + frame->subtitle_areas = av_malloc_array(1, sizeof(AVSubtitleArea*)); + frame->subtitle_areas[0] = area0; + frame->num_subtitle_areas = 1; + } + + // When decoders can't determine the end time, they are setting it either to UINT32_NAX + // or 30s (dvbsub). + if (frame->num_subtitle_areas > 0 && frame->subtitle_end_time >= 30000) { + // Can't send it without end time, wait for the next frame to determine the end_display time + s->pending_frame = frame; + + if (frame_sent) + return 0; + + // To keep all going, send an empty frame instead + frame = ff_get_subtitles_buffer(outlink, AV_SUBTITLE_FMT_ASS); + if (!frame) + return AVERROR(ENOMEM); + + av_frame_copy_props(frame, s->pending_frame); + frame->subtitle_end_time = 1; + } + + return ff_filter_frame(outlink, frame); +} + +#define OFFSET(x) offsetof(SubOcrContext, x) +#define FLAGS (AV_OPT_FLAG_SUBTITLE_PARAM | AV_OPT_FLAG_FILTERING_PARAM) + +static const AVOption graphicsub2text_options[] = { + { "ocr_mode", "set ocr mode", OFFSET(ocr_mode), AV_OPT_TYPE_INT, {.i64=OEM_TESSERACT_ONLY}, OEM_TESSERACT_ONLY, 2, FLAGS, "ocr_mode" }, + { "tesseract", "classic tesseract ocr", 0, AV_OPT_TYPE_CONST, {.i64=OEM_TESSERACT_ONLY}, 0, 0, FLAGS, "ocr_mode" }, + { "lstm", "lstm (ML based)", 0, AV_OPT_TYPE_CONST, {.i64=OEM_LSTM_ONLY}, 0, 0, FLAGS, "ocr_mode" }, + { "both", "use both models combined", 0, AV_OPT_TYPE_CONST, {.i64=OEM_TESSERACT_LSTM_COMBINED}, 0, 0, FLAGS, "ocr_mode" }, + { "tessdata_path", "path to tesseract data", OFFSET(tessdata_path), AV_OPT_TYPE_STRING, {.str = NULL}, 0, 0, FLAGS, NULL }, + { "language", "ocr language", OFFSET(language), AV_OPT_TYPE_STRING, {.str = "eng"}, 0, 0, FLAGS, NULL }, + { NULL }, +}; + +AVFILTER_DEFINE_CLASS(graphicsub2text); + +static const AVFilterPad inputs[] = { + { + .name = "default", + .type = AVMEDIA_TYPE_SUBTITLE, + .filter_frame = filter_frame, + .config_props = config_input, + }, +}; + +static const AVFilterPad outputs[] = { + { + .name = "default", + .type = AVMEDIA_TYPE_SUBTITLE, + .config_props = config_output, + }, +}; + +const AVFilter ff_sf_graphicsub2text = { + .name = "graphicsub2text", + .description = NULL_IF_CONFIG_SMALL("Convert graphical subtitles to text subtitles via OCR"), + .init = init, + .uninit = uninit, + .priv_size = sizeof(SubOcrContext), + .priv_class = &graphicsub2text_class, + FILTER_INPUTS(inputs), + FILTER_OUTPUTS(outputs), + FILTER_QUERY_FUNC(query_formats), +};
Signed-off-by: softworkz <softworkz@hotmail.com> --- configure | 1 + doc/filters.texi | 55 +++++ libavfilter/Makefile | 2 + libavfilter/allfilters.c | 1 + libavfilter/sf_graphicsub2text.c | 354 +++++++++++++++++++++++++++++++ 5 files changed, 413 insertions(+) create mode 100644 libavfilter/sf_graphicsub2text.c