diff mbox series

[FFmpeg-devel,v23,19/21] avfilter/graphicsub2text: Add new graphicsub2text filter (OCR)

Message ID DM8P223MB03658782AEC7729CC067D6ACBA719@DM8P223MB0365.NAMP223.PROD.OUTLOOK.COM
State New
Headers show
Series Subtitle Filtering | expand

Checks

Context Check Description
andriy/make_x86 success Make finished
andriy/make_fate_x86 success Make fate finished
andriy/make_ppc success Make finished
andriy/make_fate_ppc success Make fate finished

Commit Message

Soft Works Dec. 10, 2021, 9:37 p.m. UTC
Signed-off-by: softworkz <softworkz@hotmail.com>
---
 configure                        |   1 +
 doc/filters.texi                 |  55 +++++
 libavfilter/Makefile             |   2 +
 libavfilter/allfilters.c         |   1 +
 libavfilter/sf_graphicsub2text.c | 354 +++++++++++++++++++++++++++++++
 5 files changed, 413 insertions(+)
 create mode 100644 libavfilter/sf_graphicsub2text.c

Comments

Daniel Cantarín Dec. 11, 2021, 3:17 p.m. UTC | #1
Hi there softworkz.

Having worked before with OCR filter output, I suggest you a 
modification for your new filter.
It's not something that should delay the patch, but just a nice addenum. 
Could be done in another patch, or could even do it myself in the 
future. But I let the comment here anyways, for you to consider.

If you take a look at vf_ocr, you'll see that it sets 
"lavfi.ocr.confidence" metadata field.
Well... downstream filters can check that field in order to just 
consider certain confidence threshold, discarding the rest.
This is very useful when doing OCR with non-ascii chars, like I do with 
Spanish language.

So I propose an option like this:

   { "confidence", "Sets the confidence threshold for valid OCR. Default 
80." , OFFSET(confidence), AV_OPT_TYPE_INT, {.i64=80}, 0, 100, FLAGS },

Then you do an average of all confidences detected by tesseract after 
OCR but before converting to text subtitle frame, and compare that 
option value to the average result.
Something like this:

   int average = sum_of_all_confidences / number_of_confidence_items;
   if (average >= s->confidence) {
     do_your_thing();
   } else {
     av_log(ctx, AV_LOG_DEBUG, "Confidence average %d under threshold. 
Text detected: '%s'\n", average, text);
   }

Also, I would like to do some tests with spanish OCR, as I had to 
explicitly allowlist our non-ascii chars when using OCR filter, and 
don't know how yours will behave in that situation. Maybe having the 
chars allowlist option here too is a good idea. But, again: none of this 
this should delay the patch, as your work is much more important than 
this kind of nice to have functionalities, which could be easily 
implemented later by anyone.

Thanks,
Daniel.
Soft Works Dec. 11, 2021, 5:39 p.m. UTC | #2
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Daniel
> Cantarín
> Sent: Saturday, December 11, 2021 4:18 PM
> To: ffmpeg-devel@ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH v23 19/21] avfilter/graphicsub2text: Add
> new graphicsub2text filter (OCR)
> 
> Hi there softworkz.
> 
> Having worked before with OCR filter output, I suggest you a
> modification for your new filter.
> It's not something that should delay the patch, but just a nice addenum.
> Could be done in another patch, or could even do it myself in the
> future. But I let the comment here anyways, for you to consider.
> 
> If you take a look at vf_ocr, you'll see that it sets
> "lavfi.ocr.confidence" metadata field.
> Well... downstream filters can check that field in order to just
> consider certain confidence threshold, discarding the rest.
> This is very useful when doing OCR with non-ascii chars, like I do with
> Spanish language.
> 
> So I propose an option like this:
> 
>    { "confidence", "Sets the confidence threshold for valid OCR. Default
> 80." , OFFSET(confidence), AV_OPT_TYPE_INT, {.i64=80}, 0, 100, FLAGS },
> 
> Then you do an average of all confidences detected by tesseract after
> OCR but before converting to text subtitle frame, and compare that
> option value to the average result.
> Something like this:
> 
>    int average = sum_of_all_confidences / number_of_confidence_items;
>    if (average >= s->confidence) {
>      do_your_thing();
>    } else {
>      av_log(ctx, AV_LOG_DEBUG, "Confidence average %d under threshold.
> Text detected: '%s'\n", average, text);
>    }
> 
> Also, I would like to do some tests with spanish OCR, as I had to
> explicitly allowlist our non-ascii chars when using OCR filter, and
> don't know how yours will behave in that situation. Maybe having the
> chars allowlist option here too is a good idea. But, again: none of this
> this should delay the patch, as your work is much more important than
> this kind of nice to have functionalities, which could be easily
> implemented later by anyone.
> 

Hi Daniel,

I don't think that any of that will be necessary. For the generic ocr 
filter, this might make sense, because it is meant to work in 
many different situations, different text sizes, different (not 
necessarily uniform) backgrounds, static or moving, a wide spectrum
of colours, and no quantization in the time dimension, etc.

But for subtitle-ocr, we have a fixed and static background, we have
palette colours from like 4 to 32 only, we know when it starts and
that it doesn’t change until the next event and we have a pixel 
density relative to the text height that is a multiple of what
you get when you scan a letter for example.

Basically, this is like a pre-school situation for an OCR. If it 
can't recognize that in a reliable way and you would end up needing
to dissect results by confidence level, then the OCR wouldn't be 
worth a penny and this filter kind of pointless ;-)

IIUC, you haven't tried graphicsub2text yet. I suggest, you to
look at filters.texi for instructions to set up the model data.
There's an example with a test stream that you can run right
away. With that example, I haven't been able to spot a single 
incorrectly recognized character.

Somebody who tried my filter had contacted me last week as he 
was getting rather bad recognition results. It turned out
that the text in this case had strong outlines and the inner 
text was black. After removing the outlines and inverting the
text, the recognition result was close to perfect.

The crucial part is the preparation of the image before doing
OCR. When this is not done right, you can't remedy later with
confidence level evaluation.

What's working fine already is bright text without outlines.
Left for me to do is automatic detection of outline colours
and removing those before running recognition. Second part is
detection of the text (fill) color and depending on that - replace
the transparency either with a light or dark  background colour 
(and invert in the latter case).

When you get a chance to try, please let me know about your 
results.

PS: When positive, post here - otherwise contact me privately...LOL

Just joking..whatever you prefer.

Kind regards,'
softworkz
Application: Microsoft.Office.Interop.Outlook.ApplicationClass
Class: 43
Session: System.__ComObject
Parent: System.__ComObject
Actions: System.__ComObject
Attachments: System.__ComObject
BillingInformation: 
Body: 

> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Daniel
> Cantarín
> Sent: Saturday, December 11, 2021 4:18 PM
> To: ffmpeg-devel@ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH v23 19/21] avfilter/graphicsub2text: Add
> new graphicsub2text filter (OCR)
> 
> Hi there softworkz.
> 
> Having worked before with OCR filter output, I suggest you a
> modification for your new filter.
> It's not something that should delay the patch, but just a nice addenum.
> Could be done in another patch, or could even do it myself in the
> future. But I let the comment here anyways, for you to consider.
> 
> If you take a look at vf_ocr, you'll see that it sets
> "lavfi.ocr.confidence" metadata field.
> Well... downstream filters can check that field in order to just
> consider certain confidence threshold, discarding the rest.
> This is very useful when doing OCR with non-ascii chars, like I do with
> Spanish language.
> 
> So I propose an option like this:
> 
>    { "confidence", "Sets the confidence threshold for valid OCR. Default
> 80." , OFFSET(confidence), AV_OPT_TYPE_INT, {.i64=80}, 0, 100, FLAGS },
> 
> Then you do an average of all confidences detected by tesseract after
> OCR but before converting to text subtitle frame, and compare that
> option value to the average result.
> Something like this:
> 
>    int average = sum_of_all_confidences / number_of_confidence_items;
>    if (average >= s->confidence) {
>      do_your_thing();
>    } else {
>      av_log(ctx, AV_LOG_DEBUG, "Confidence average %d under threshold.
> Text detected: '%s'\n", average, text);
>    }
> 
> Also, I would like to do some tests with spanish OCR, as I had to
> explicitly allowlist our non-ascii chars when using OCR filter, and
> don't know how yours will behave in that situation. Maybe having the
> chars allowlist option here too is a good idea. But, again: none of this
> this should delay the patch, as your work is much more important than
> this kind of nice to have functionalities, which could be easily
> implemented later by anyone.
> 

Hi Daniel,

I don't think that any of that will be necessary. For the generic ocr 
filter, this might make sense, because it is meant to work in 
many different situations, different text sizes, different (not 
necessarily uniform) backgrounds, static or moving, a wide spectrum
of colours, and no quantization in the time dimension, etc.

But for subtitle-ocr, we have a fixed and static background, we have
palette colours from like 4 to 32 only, we know when it starts and
that it doesn’t change until the next event and we have a pixel 
density relative to the text height that is a multiple of what
you get when you scan a letter for example.

Basically, this is like a pre-school situation for an OCR. If it 
can't recognize that in a reliable way and you would end up needing
to dissect results by confidence level, then the OCR wouldn't be 
worth a penny and this filter kind of pointless ;-)

IIUC, you haven't tried graphicsub2text yet. I suggest, you to
look at filters.texi for instructions to set up the model data.
There's an example with a test stream that you can run right
away. With that example, I haven't been able to spot a single 
incorrectly recognized character.

Somebody who tried my filter had contacted me last week as he 
was getting rather bad recognition results. It turned out
that the text in this case had strong outlines and the inner 
text was black. After removing the outlines and inverting the
text, the recognition result was close to perfect.

The crucial part is the preparation of the image before doing
OCR. When this is not done right, you can't remedy later with
confidence level evaluation.

What's working fine already is bright text without outlines.
Left for me to do is automatic detection of outline colours
and removing those before running recognition. Second part is
detection of the text (fill) color and depending on that - replace
the transparency either with a light or dark  background colour 
(and invert in the latter case).

When you get a chance to try, please let me know about your 
results.

PS: When positive, post here - otherwise contact me privately...LOL

Just joking..whatever you prefer.

Kind regards,'
softworkz

Categories: 
Companies: 
ConversationIndex: 0101D7EE0E321DD45EEE605BD34994EF6CE3443CE288AC2D685A00800014B360
ConversationTopic: [PATCH 1/1] Test ref file change
CreationTime: 11 Dec 2021 17:31:42
EntryID: 00000000BEFAAEED30DFDF43976487F12A562A600700DEB68488D92E8146A5C16995B1AE958D00000000010F0000DEB68488D92E8146A5C16995B1AE958D000570B7A2E40000
FormDescription: System.__ComObject
GetInspector: System.__ComObject
Importance: 1
LastModificationTime: 11 Dec 2021 17:31:42
MessageClass: IPM.Note
Mileage: 
NoAging: False
OutlookInternalVersion: 1614701
OutlookVersion: 16.0
Saved: False
Sensitivity: 0
Size: 12438
Subject: RE: [FFmpeg-devel] [PATCH v23 19/21] avfilter/graphicsub2text: Add new graphicsub2text filter (OCR)
UnRead: True
UserProperties: System.__ComObject
AlternateRecipientAllowed: True
AutoForwarded: False
BCC: 
CC: 
DeferredDeliveryTime: 1 Jan 4501 00:00:00
DeleteAfterSubmit: False
ExpiryTime: 1 Jan 4501 00:00:00
FlagRequest: 
HTMLBody: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META NAME="Generator" CONTENT="MS Exchange Server version 16.0.14701.20038">
<TITLE></TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->
<BR>
<BR>

<P><FONT SIZE=2>&gt; -----Original Message-----<BR>
&gt; From: ffmpeg-devel &lt;ffmpeg-devel-bounces@ffmpeg.org&gt; On Behalf Of Daniel<BR>
&gt; Cantarín<BR>
&gt; Sent: Saturday, December 11, 2021 4:18 PM<BR>
&gt; To: ffmpeg-devel@ffmpeg.org<BR>
&gt; Subject: Re: [FFmpeg-devel] [PATCH v23 19/21] avfilter/graphicsub2text: Add<BR>
&gt; new graphicsub2text filter (OCR)<BR>
&gt;<BR>
&gt; Hi there softworkz.<BR>
&gt;<BR>
&gt; Having worked before with OCR filter output, I suggest you a<BR>
&gt; modification for your new filter.<BR>
&gt; It's not something that should delay the patch, but just a nice addenum.<BR>
&gt; Could be done in another patch, or could even do it myself in the<BR>
&gt; future. But I let the comment here anyways, for you to consider.<BR>
&gt;<BR>
&gt; If you take a look at vf_ocr, you'll see that it sets<BR>
&gt; &quot;lavfi.ocr.confidence&quot; metadata field.<BR>
&gt; Well... downstream filters can check that field in order to just<BR>
&gt; consider certain confidence threshold, discarding the rest.<BR>
&gt; This is very useful when doing OCR with non-ascii chars, like I do with<BR>
&gt; Spanish language.<BR>
&gt;<BR>
&gt; So I propose an option like this:<BR>
&gt;<BR>
&gt;&nbsp;   { &quot;confidence&quot;, &quot;Sets the confidence threshold for valid OCR. Default<BR>
&gt; 80.&quot; , OFFSET(confidence), AV_OPT_TYPE_INT, {.i64=80}, 0, 100, FLAGS },<BR>
&gt;<BR>
&gt; Then you do an average of all confidences detected by tesseract after<BR>
&gt; OCR but before converting to text subtitle frame, and compare that<BR>
&gt; option value to the average result.<BR>
&gt; Something like this:<BR>
&gt;<BR>
&gt;&nbsp;   int average = sum_of_all_confidences / number_of_confidence_items;<BR>
&gt;&nbsp;   if (average &gt;= s-&gt;confidence) {<BR>
&gt;&nbsp;     do_your_thing();<BR>
&gt;&nbsp;   } else {<BR>
&gt;&nbsp;     av_log(ctx, AV_LOG_DEBUG, &quot;Confidence average %d under threshold.<BR>
&gt; Text detected: '%s'\n&quot;, average, text);<BR>
&gt;&nbsp;   }<BR>
&gt;<BR>
&gt; Also, I would like to do some tests with spanish OCR, as I had to<BR>
&gt; explicitly allowlist our non-ascii chars when using OCR filter, and<BR>
&gt; don't know how yours will behave in that situation. Maybe having the<BR>
&gt; chars allowlist option here too is a good idea. But, again: none of this<BR>
&gt; this should delay the patch, as your work is much more important than<BR>
&gt; this kind of nice to have functionalities, which could be easily<BR>
&gt; implemented later by anyone.<BR>
&gt;<BR>
<BR>
Hi Daniel,<BR>
<BR>
I don't think that any of that will be necessary. For the generic ocr<BR>
filter, this might make sense, because it is meant to work in<BR>
many different situations, different text sizes, different (not<BR>
necessarily uniform) backgrounds, static or moving, a wide spectrum<BR>
of colours, and no quantization in the time dimension, etc.<BR>
<BR>
But for subtitle-ocr, we have a fixed and static background, we have<BR>
palette colours from like 4 to 32 only, we know when it starts and<BR>
that it doesn’t change until the next event and we have a pixel<BR>
density relative to the text height that is a multiple of what<BR>
you get when you scan a letter for example.<BR>
<BR>
Basically, this is like a pre-school situation for an OCR. If it<BR>
can't recognize that in a reliable way and you would end up needing<BR>
to dissect results by confidence level, then the OCR wouldn't be<BR>
worth a penny and this filter kind of pointless ;-)<BR>
<BR>
IIUC, you haven't tried graphicsub2text yet. I suggest, you to<BR>
look at filters.texi for instructions to set up the model data.<BR>
There's an example with a test stream that you can run right<BR>
away. With that example, I haven't been able to spot a single<BR>
incorrectly recognized character.<BR>
<BR>
Somebody who tried my filter had contacted me last week as he<BR>
was getting rather bad recognition results. It turned out<BR>
that the text in this case had strong outlines and the inner<BR>
text was black. After removing the outlines and inverting the<BR>
text, the recognition result was close to perfect.<BR>
<BR>
The crucial part is the preparation of the image before doing<BR>
OCR. When this is not done right, you can't remedy later with<BR>
confidence level evaluation.<BR>
<BR>
What's working fine already is bright text without outlines.<BR>
Left for me to do is automatic detection of outline colours<BR>
and removing those before running recognition. Second part is<BR>
detection of the text (fill) color and depending on that - replace<BR>
the transparency either with a light or dark&nbsp; background colour<BR>
(and invert in the latter case).<BR>
<BR>
When you get a chance to try, please let me know about your<BR>
results.<BR>
<BR>
PS: When positive, post here - otherwise contact me privately...LOL<BR>
<BR>
Just joking..whatever you prefer.<BR>
<BR>
Kind regards,'<BR>
softworkz<BR>
<BR>
</FONT>
</P>

</BODY>
</HTML>
OriginatorDeliveryReportRequested: False
ReadReceiptRequested: False
ReceivedByEntryID: 
ReceivedByName: 
ReceivedOnBehalfOfEntryID: 
ReceivedOnBehalfOfName: 
ReceivedTime: 11 Dec 2021 18:39:00
RecipientReassignmentProhibited: False
Recipients: System.__ComObject
ReminderOverrideDefault: False
ReminderPlaySound: False
ReminderSet: False
ReminderSoundFile: 
ReminderTime: 1 Jan 4501 00:00:00
RemoteStatus: 0
ReplyRecipientNames: 
ReplyRecipients: System.__ComObject
SaveSentMessageFolder: System.__ComObject
SenderName: 
Sent: False
SentOn: 1 Jan 4501 00:00:00
SentOnBehalfOfName: softworkz@hotmail.com
Submitted: False
To: FFmpeg development discussions and patches
VotingOptions: 
VotingResponse: 
ItemProperties: System.__ComObject
BodyFormat: 1
DownloadState: 1
InternetCodepage: 65001
MarkForDownload: 0
IsConflict: False
AutoResolvedWinner: False
Conflicts: System.__ComObject
SenderEmailAddress: softworkz@hotmail.com
SenderEmailType: EX
Permission: 0
PermissionService: 0
PropertyAccessor: System.__ComObject
SendUsingAccount: System.__ComObject
TaskSubject: RE: [FFmpeg-devel] [PATCH v23 19/21] avfilter/graphicsub2text: Add new graphicsub2text filter (OCR)
TaskDueDate: 1 Jan 4501 00:00:00
TaskStartDate: 1 Jan 4501 00:00:00
TaskCompletedDate: 1 Jan 4501 00:00:00
ToDoTaskOrdinal: 1 Jan 4501 00:00:00
IsMarkedAsTask: False
ConversationID: 
Sender: System.__ComObject
RTFBody: System.Byte[]
RetentionExpirationDate: 1 Jan 4501 00:00:00
Daniel Cantarín Dec. 11, 2021, 8:23 p.m. UTC | #3
> Hi Daniel,
> 
> I don't think that any of that will be necessary. For the generic ocr 
> filter, this might make sense, because it is meant to work in 
> many different situations, different text sizes, different (not 
> necessarily uniform) backgrounds, static or moving, a wide spectrum
> of colours, and no quantization in the time dimension, etc.
>
> But for subtitle-ocr, we have a fixed and static background, we have
> palette colours from like 4 to 32 only, we know when it starts and
> that it doesn’t change until the next event and we have a pixel 
> density relative to the text height that is a multiple of what
> you get when you scan a letter for example.
>

I see. That's a good point: this isn't generic OCR, but pretty
specific. Didn't considered that before.

> 
> Basically, this is like a pre-school situation for an OCR. If it 
> can't recognize that in a reliable way and you would end up needing
> to dissect results by confidence level, then the OCR wouldn't be 
> worth a penny and this filter kind of pointless ;-)
>

Well... I respectfully disagree, because reality's pretty effective
when it's about messing with common sense, making that paragraph
simply too optimistic. I'm sure we'll find some subtitle provider
with awful fonts and/or subtitling practices more sooner than later,
and that day those words will become sour.

Yet, I get your point. Please just ignore my previous comments
about the new filter. I'll test it properly eventually, and give you
some feedback. If any change is needed, I'll try to apply it myself,
so you don't have to do extra work. But just forget about it in the
meantime, as your point stands so far.

>
> IIUC, you haven't tried graphicsub2text yet. I suggest, you to
> look at filters.texi for instructions to set up the model data.
> (...)
> The crucial part is the preparation of the image before doing
> OCR. When this is not done right, you can't remedy later with
> confidence level evaluation.
> 

I'm aware, thanks. No expert, but have some experience with the stuff.

I'm actually using vf_ocr, taking dvbsubs and doing some alchemy with
lavfi using fps filter for the sparseness (and OCR CPU usage), color
tuning, creating a proper background for the ocr process, and so on.
I got OK results with image prep, and lots of noise without it. So
I kinda know the deal. Insights are cool anyways, and your code give
some good ideas too.

> 
> What's working fine already is bright text without outlines.
> Left for me to do is automatic detection of outline colours
> and removing those before running recognition. Second part is
> detection of the text (fill) color and depending on that - replace
> the transparency either with a light or dark  background colour 
> (and invert in the latter case).
> 

Bright (white) background over dark (black) characters had the best
results for me so far.

> 
> When you get a chance to try, please let me know about your 
> results.
> 

Most likely next week I'll take a look at it. It's easier now that you
let a public fork online (in another thread). I'm still getting used
to the patch and mail list dynamics.

>
> PS: When positive, post here - otherwise contact me privately...LOL
> 
> Just joking..whatever you prefer.
> 
> Kind regards,'
> softworkz


I try not to be rude, because I know it feels awful on the other side,
and I value feelings. I also tend to be chatty, in order to try to
understand and be understood. However, I fear replying a lot may be
seen as spaming the mailing list, so I'll keep my interactions to a
minimum. Please know there are people like me reading your work, even
when we may keep silent for different reasons.


Thanks,
Daniel.
Soft Works Dec. 11, 2021, 8:45 p.m. UTC | #4
> -----Original Message-----
> From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Daniel
> Cantarín
> Sent: Saturday, December 11, 2021 9:24 PM
> To: ffmpeg-devel@ffmpeg.org
> Subject: Re: [FFmpeg-devel] [PATCH v23 19/21] avfilter/graphicsub2text: Add
> new graphicsub2text filter (OCR)
> 
> > Basically, this is like a pre-school situation for an OCR. If it
> > can't recognize that in a reliable way and you would end up needing
> > to dissect results by confidence level, then the OCR wouldn't be
> > worth a penny and this filter kind of pointless ;-)
> >
> 
> Well... I respectfully disagree, because reality's pretty effective
> when it's about messing with common sense, making that paragraph
> simply too optimistic. I'm sure we'll find some subtitle provider
> with awful fonts and/or subtitling practices more sooner than later,
> and that day those words will become sour.

I meant it a bit more serious than it sounded. If I can't get it to 
work at a decent degree of reliability then I have no use for it. 
I'm optimistic but the final assessment is yet to be made.

> Yet, I get your point. Please just ignore my previous comments
> about the new filter. I'll test it properly eventually, and give you
> some feedback. If any change is needed, I'll try to apply it myself,
> so you don't have to do extra work. But just forget about it in the
> meantime, as your point stands so far.

In case you got some samples that have appeared to be difficult in
recognition and that you could share, please let me know.

> I try not to be rude, because I know it feels awful on the other side,
> and I value feelings. I also tend to be chatty, in order to try to
> understand and be understood. However, I fear replying a lot may be
> seen as spaming the mailing list, so I'll keep my interactions to a
> minimum. Please know there are people like me reading your work, even
> when we may keep silent for different reasons.

This ML has seen so many unpleasant conversations. I think a less 
serious line should be more than acceptable.

Let's follow-up privately, though.

Kind regards,
sw
diff mbox series

Patch

diff --git a/configure b/configure
index 73a7267cf1..360e91d762 100755
--- a/configure
+++ b/configure
@@ -3640,6 +3640,7 @@  frei0r_filter_deps="frei0r"
 frei0r_src_filter_deps="frei0r"
 fspp_filter_deps="gpl"
 gblur_vulkan_filter_deps="vulkan spirv_compiler"
+graphicsub2text_filter_deps="libtesseract"
 hflip_vulkan_filter_deps="vulkan spirv_compiler"
 histeq_filter_deps="gpl"
 hqdn3d_filter_deps="gpl"
diff --git a/doc/filters.texi b/doc/filters.texi
index 743c36c432..26bf6014d0 100644
--- a/doc/filters.texi
+++ b/doc/filters.texi
@@ -25820,6 +25820,61 @@  ffmpeg -i "https://streams.videolan.org/ffmpeg/mkv_subtitles.mkv" -filter_comple
 @end example
 @end itemize
 
+@section graphicsub2text
+
+Converts graphic subtitles to text subtitles by performing OCR.
+
+For this filter to be available, ffmpeg needs to be compiled with libtesseract (see https://github.com/tesseract-ocr/tesseract).
+Language models need to be downloaded from https://github.com/tesseract-ocr/tessdata and put into as subfolder named 'tessdata' or into a folder specified via the environment variable 'TESSDATA_PREFIX'.
+The path can also be specified via filter option (see below).
+
+Note: These models are including the data for both OCR modes.
+
+Inputs:
+- 0: Subtitles [bitmap]
+
+Outputs:
+- 0: Subtitles [text]
+
+It accepts the following parameters:
+
+@table @option
+@item ocr_mode
+The character recognition mode to use.
+
+Supported OCR modes are:
+
+@table @var
+@item 0, tesseract
+This is the classic libtesseract operation mode. It is fast but less accurate than LSTM.
+@item 1, lstm
+Newer OCR implementation based on ML models. Provides usually better results, requires more processing resources.
+@item 2, both
+Use a combination of both modes.
+@end table
+
+@item tessdata_path
+The path to a folder containing the language models to be used.
+
+@item language
+The recognition language. It needs to match the first three characters of a  language model file in the tessdata path.
+
+@end table
+
+
+@subsection Examples
+
+@itemize
+@item
+Convert DVB graphic subtitles to ASS (text) subtitles
+
+Note: For this to work, you need to have the data file 'eng.traineddata' in a 'tessdata' subfolder (see above).
+@example
+ffmpeg ffmpeg -loglevel verbose -i "https://streams.videolan.org/streams/ts/video_subs_ttxt%2Bdvbsub.ts" -filter_complex "[0:13]graphicsub2text=ocr_mode=both" -c:s ass -y output.mkv
+@end example
+@end itemize
+
+
 @section graphicsub2video
 
 Renders graphic subtitles as video frames.
diff --git a/libavfilter/Makefile b/libavfilter/Makefile
index 2224e5fe5f..3b972e134b 100644
--- a/libavfilter/Makefile
+++ b/libavfilter/Makefile
@@ -296,6 +296,8 @@  OBJS-$(CONFIG_GBLUR_VULKAN_FILTER)           += vf_gblur_vulkan.o vulkan.o vulka
 OBJS-$(CONFIG_GEQ_FILTER)                    += vf_geq.o
 OBJS-$(CONFIG_GRADFUN_FILTER)                += vf_gradfun.o
 OBJS-$(CONFIG_GRAPHICSUB2VIDEO_FILTER)       += vf_overlaygraphicsubs.o framesync.o
+OBJS-$(CONFIG_GRAPHICSUB2TEXT_FILTER)        += sf_graphicsub2text.o
+OBJS-$(CONFIG_GRAPHICSUB2VIDEO_FILTER)       += vf_overlaygraphicsubs.o framesync.o
 OBJS-$(CONFIG_GRAPHMONITOR_FILTER)           += f_graphmonitor.o
 OBJS-$(CONFIG_GRAYWORLD_FILTER)              += vf_grayworld.o
 OBJS-$(CONFIG_GREYEDGE_FILTER)               += vf_colorconstancy.o
diff --git a/libavfilter/allfilters.c b/libavfilter/allfilters.c
index 6adde2b9f6..f70f08dc5a 100644
--- a/libavfilter/allfilters.c
+++ b/libavfilter/allfilters.c
@@ -545,6 +545,7 @@  extern const AVFilter ff_avf_showwaves;
 extern const AVFilter ff_avf_showwavespic;
 extern const AVFilter ff_vaf_spectrumsynth;
 extern const AVFilter ff_sf_censor;
+extern const AVFilter ff_sf_graphicsub2text;
 extern const AVFilter ff_sf_showspeaker;
 extern const AVFilter ff_sf_splitcc;
 extern const AVFilter ff_sf_stripstyles;
diff --git a/libavfilter/sf_graphicsub2text.c b/libavfilter/sf_graphicsub2text.c
new file mode 100644
index 0000000000..ef10d60efd
--- /dev/null
+++ b/libavfilter/sf_graphicsub2text.c
@@ -0,0 +1,354 @@ 
+/*
+ * Copyright (c) 2021 softworkz
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/**
+ * @file
+ * subtitle filter to convert graphical subs to text subs via OCR
+ */
+
+#include <tesseract/capi.h>
+#include <libavutil/ass_internal.h>
+
+#include "libavutil/opt.h"
+#include "subtitles.h"
+
+typedef struct SubOcrContext {
+    const AVClass *class;
+    int w, h;
+
+    TessBaseAPI *tapi;
+    TessOcrEngineMode ocr_mode;
+    char *tessdata_path;
+    char *language;
+
+    int readorder_counter;
+
+    AVFrame *pending_frame;
+} SubOcrContext;
+
+
+static int init(AVFilterContext *ctx)
+{
+    SubOcrContext *s = ctx->priv;
+    const char* tver = TessVersion();
+    int ret;
+
+    s->tapi = TessBaseAPICreate();
+
+    if (!s->tapi || !tver || !strlen(tver)) {
+        av_log(ctx, AV_LOG_ERROR, "Failed to access libtesseract\n");
+        return AVERROR(ENOSYS);
+    }
+
+    av_log(ctx, AV_LOG_VERBOSE, "Initializing libtesseract, version: %s\n", tver);
+
+    ret = TessBaseAPIInit4(s->tapi, s->tessdata_path, s->language, s->ocr_mode, NULL, 0, NULL, NULL, 0, 1);
+    if (ret < 0 ) {
+        av_log(ctx, AV_LOG_ERROR, "Failed to initialize libtesseract. Error: %d\n", ret);
+        return AVERROR(ENOSYS);
+    }
+
+    ret = TessBaseAPISetVariable(s->tapi, "tessedit_char_blacklist", "|");
+    if (ret < 0 ) {
+        av_log(ctx, AV_LOG_ERROR, "Failed to set 'tessedit_char_blacklist'. Error: %d\n", ret);
+        return AVERROR(EINVAL);
+    }
+
+    return 0;
+}
+
+static void uninit(AVFilterContext *ctx)
+{
+    SubOcrContext *s = ctx->priv;
+
+    if (s->tapi) {
+        TessBaseAPIEnd(s->tapi);
+        TessBaseAPIDelete(s->tapi);
+    }
+}
+
+static int query_formats(AVFilterContext *ctx)
+{
+    AVFilterFormats *formats, *formats2;
+    AVFilterLink *inlink = ctx->inputs[0];
+    AVFilterLink *outlink = ctx->outputs[0];
+    static const enum AVSubtitleType in_fmts[] = { AV_SUBTITLE_FMT_BITMAP, AV_SUBTITLE_FMT_NONE };
+    static const enum AVSubtitleType out_fmts[] = { AV_SUBTITLE_FMT_ASS, AV_SUBTITLE_FMT_NONE };
+    int ret;
+
+    /* set input format */
+    formats = ff_make_format_list(in_fmts);
+    if ((ret = ff_formats_ref(formats, &inlink->outcfg.formats)) < 0)
+        return ret;
+
+    /* set output format */
+    formats2 = ff_make_format_list(out_fmts);
+    if ((ret = ff_formats_ref(formats2, &outlink->incfg.formats)) < 0)
+        return ret;
+
+    return 0;
+}
+
+static int config_input(AVFilterLink *inlink)
+{
+    AVFilterContext *ctx = inlink->dst;
+    SubOcrContext *s = ctx->priv;
+
+    if (s->w <= 0 || s->h <= 0) {
+        s->w = inlink->w;
+        s->h = inlink->h;
+    }
+    return 0;
+}
+
+static int config_output(AVFilterLink *outlink)
+{
+    const AVFilterContext *ctx  = outlink->src;
+    SubOcrContext *s = ctx->priv;
+
+    outlink->format = AV_SUBTITLE_FMT_ASS;
+    outlink->w = s->w;
+    outlink->h = s->h;
+
+    return 0;
+}
+
+static uint8_t* create_grayscale_image(AVFilterContext *ctx, AVSubtitleArea *area)
+{
+    uint8_t gray_pal[256];
+    const size_t img_size = area->buf[0]->size;
+    const uint8_t* img    = area->buf[0]->data;
+    uint8_t* gs_img       = av_malloc(img_size);
+
+    if (!gs_img)
+        return NULL;
+
+    for (unsigned i = 0; i < 256; i++) {
+        const uint8_t *col = (uint8_t*)&area->pal[i];
+        const int val      = (int)col[3] * FFMAX3(col[0], col[1], col[2]);
+        gray_pal[i]        = (uint8_t)(val >> 8);
+    }
+
+    for (unsigned i = 0; i < img_size; i++)
+        gs_img[i] = 255 - gray_pal[img[i]];
+
+    return gs_img;
+}
+
+static int convert_area(AVFilterContext *ctx, AVSubtitleArea *area)
+{
+    SubOcrContext *s = ctx->priv;
+    char *ocr_text = NULL;
+    int ret;
+    uint8_t *gs_img = create_grayscale_image(ctx, area);
+
+    if (!gs_img)
+        return AVERROR(ENOMEM);
+
+    area->type = AV_SUBTITLE_FMT_ASS;
+    TessBaseAPISetImage(s->tapi, gs_img, area->w, area->h, 1, area->linesize[0]);
+    TessBaseAPISetSourceResolution(s->tapi, 70);
+
+    ret = TessBaseAPIRecognize(s->tapi, NULL);
+    if (ret == 0)
+        ocr_text = TessBaseAPIGetUTF8Text(s->tapi);
+
+    if (!ocr_text) {
+        av_log(ctx, AV_LOG_WARNING, "OCR didn't return a text. ret=%d\n", ret);
+        area->ass = NULL;
+    }
+    else {
+        const size_t len = strlen(ocr_text);
+
+        if (len > 0 && ocr_text[len - 1] == '\n')
+            ocr_text[len - 1] = 0;
+
+        av_log(ctx, AV_LOG_VERBOSE, "OCR Result: %s\n", ocr_text);
+
+        area->ass = av_strdup(ocr_text);
+
+        TessDeleteText(ocr_text);
+    }
+
+    av_freep(&gs_img);
+    av_buffer_unref(&area->buf[0]);
+    area->type = AV_SUBTITLE_FMT_ASS;
+
+    return 0;
+}
+
+static int filter_frame(AVFilterLink *inlink, AVFrame *frame)
+{
+    AVFilterContext *ctx = inlink->dst;
+    SubOcrContext *s = ctx->priv;
+    AVFilterLink *outlink = inlink->dst->outputs[0];
+    int ret, frame_sent = 0;
+
+    if (s->pending_frame) {
+        const uint64_t pts_diff = frame->subtitle_pts - s->pending_frame->subtitle_pts;
+
+        if (pts_diff == 0) {
+            // This is just a repetition of the previous frame, ignore it
+            av_frame_free(&frame);
+            return 0;
+        }
+
+        s->pending_frame->subtitle_end_time = (uint32_t)(pts_diff / 1000);
+
+        ret = ff_filter_frame(outlink, s->pending_frame);
+        s->pending_frame = NULL;
+        if (ret < 0)
+            return  ret;
+
+        frame_sent = 1;
+
+        if (frame->num_subtitle_areas == 0) {
+            // No need to forward this empty frame
+            av_frame_free(&frame);
+            return 0;
+        }
+    }
+
+    ret = av_frame_make_writable(frame);
+
+    if (ret < 0) {
+        av_frame_free(&frame);
+        return ret;
+    }
+
+    frame->format = AV_SUBTITLE_FMT_ASS;
+
+    av_log(ctx, AV_LOG_DEBUG, "filter_frame sub_pts: %"PRIu64", start_time: %d, end_time: %d, num_areas: %d\n",
+        frame->subtitle_pts, frame->subtitle_start_time, frame->subtitle_end_time, frame->num_subtitle_areas);
+
+    if (frame->num_subtitle_areas > 1 &&
+        frame->subtitle_areas[0]->y > frame->subtitle_areas[frame->num_subtitle_areas - 1]->y) {
+
+        for (unsigned i = 0; i < frame->num_subtitle_areas / 2; i++)
+            FFSWAP(AVSubtitleArea*, frame->subtitle_areas[i], frame->subtitle_areas[frame->num_subtitle_areas - i - 1]);
+    }
+
+    for (unsigned i = 0; i < frame->num_subtitle_areas; i++) {
+        AVSubtitleArea *area = frame->subtitle_areas[i];
+
+        ret = convert_area(ctx, area);
+        if (ret < 0)
+            return ret;
+
+        if (area->ass && area->ass[0] != '\0') {
+            char *tmp = area->ass;
+
+            if (i == 0)
+                area->ass = avpriv_ass_get_dialog(s->readorder_counter++, 0, "Default", NULL, tmp);
+            else {
+                AVSubtitleArea* area0 = frame->subtitle_areas[0];
+                char* tmp2 = area0->ass;
+                area0->ass = av_asprintf("%s\\N%s", area0->ass, tmp);
+                av_free(tmp2);
+                area->ass = NULL;
+            }
+
+            av_free(tmp);
+        }
+    }
+
+    if (frame->num_subtitle_areas > 1) {
+        for (unsigned i = 1; i < frame->num_subtitle_areas; i++) {
+            AVSubtitleArea* area = frame->subtitle_areas[i];
+
+            for (unsigned n = 0; n < FF_ARRAY_ELEMS(area->buf); n++)
+                av_buffer_unref(&area->buf[n]);
+
+            av_freep(&area->text);
+            av_freep(&area->ass);
+            av_freep(&frame->subtitle_areas[i]);
+        }
+
+        AVSubtitleArea* area0 = frame->subtitle_areas[0];
+        av_freep(&frame->subtitle_areas);
+        frame->subtitle_areas = av_malloc_array(1, sizeof(AVSubtitleArea*));
+        frame->subtitle_areas[0] = area0;
+        frame->num_subtitle_areas = 1;
+    }
+
+    // When decoders can't determine the end time, they are setting it either to UINT32_NAX
+    // or 30s (dvbsub).
+    if (frame->num_subtitle_areas > 0 && frame->subtitle_end_time >= 30000) {
+        // Can't send it without end time, wait for the next frame to determine the end_display time
+        s->pending_frame = frame;
+
+        if (frame_sent)
+            return 0;
+
+        // To keep all going, send an empty frame instead
+        frame = ff_get_subtitles_buffer(outlink, AV_SUBTITLE_FMT_ASS);
+        if (!frame)
+            return AVERROR(ENOMEM);
+
+        av_frame_copy_props(frame, s->pending_frame);
+        frame->subtitle_end_time = 1;
+    }
+
+    return ff_filter_frame(outlink, frame);
+}
+
+#define OFFSET(x) offsetof(SubOcrContext, x)
+#define FLAGS (AV_OPT_FLAG_SUBTITLE_PARAM | AV_OPT_FLAG_FILTERING_PARAM)
+
+static const AVOption graphicsub2text_options[] = {
+    { "ocr_mode",       "set ocr mode",                  OFFSET(ocr_mode),      AV_OPT_TYPE_INT,    {.i64=OEM_TESSERACT_ONLY},          OEM_TESSERACT_ONLY, 2, FLAGS, "ocr_mode" },
+    {   "tesseract",    "classic tesseract ocr",         0,                     AV_OPT_TYPE_CONST,  {.i64=OEM_TESSERACT_ONLY},          0,                  0, FLAGS, "ocr_mode" },
+    {   "lstm",         "lstm (ML based)",               0,                     AV_OPT_TYPE_CONST,  {.i64=OEM_LSTM_ONLY},               0,                  0, FLAGS, "ocr_mode" },
+    {   "both",         "use both models combined",      0,                     AV_OPT_TYPE_CONST,  {.i64=OEM_TESSERACT_LSTM_COMBINED}, 0,                  0, FLAGS, "ocr_mode" },
+    { "tessdata_path",  "path to tesseract data",        OFFSET(tessdata_path), AV_OPT_TYPE_STRING, {.str = NULL},                      0,                  0, FLAGS, NULL   },
+    { "language",       "ocr language",                  OFFSET(language),      AV_OPT_TYPE_STRING, {.str = "eng"},                     0,                  0, FLAGS, NULL   },
+    { NULL },
+};
+
+AVFILTER_DEFINE_CLASS(graphicsub2text);
+
+static const AVFilterPad inputs[] = {
+    {
+        .name         = "default",
+        .type         = AVMEDIA_TYPE_SUBTITLE,
+        .filter_frame = filter_frame,
+        .config_props = config_input,
+    },
+};
+
+static const AVFilterPad outputs[] = {
+    {
+        .name          = "default",
+        .type          = AVMEDIA_TYPE_SUBTITLE,
+        .config_props  = config_output,
+    },
+};
+
+const AVFilter ff_sf_graphicsub2text = {
+    .name          = "graphicsub2text",
+    .description   = NULL_IF_CONFIG_SMALL("Convert graphical subtitles to text subtitles via OCR"),
+    .init          = init,
+    .uninit        = uninit,
+    .priv_size     = sizeof(SubOcrContext),
+    .priv_class    = &graphicsub2text_class,
+    FILTER_INPUTS(inputs),
+    FILTER_OUTPUTS(outputs),
+    FILTER_QUERY_FUNC(query_formats),
+};