OCR of DVB-sub is not recognizing umlauts; timecode issues #90

Closed
opened 2026-01-29 16:34:54 +00:00 by claunia · 2 comments
Owner

Originally created by @hurda on GitHub (Nov 19, 2015).

Originally assigned to: @bigharshrag on GitHub.

CCExtractor 0.77 and git-677fee4
Tesseract-data 3.02 resp. 3.04 for German
Saving to SRT.
Example files: http://www.mediafire.com/download/05oh34xj3oou90d/umlauts.7z (210MB)

OCR of DVB-subtitles isn't supporting umlauts properly.
They are either recognized as different characters (0.77 and git), or the lines with the umlauts are discarded completely (0.77).

E.g.:

TeletextDVB 0.77DVB git
2
00:00:05,480 --> 00:00:08,100
Viele Fahrer sind überfordert,

3
00:00:08,140 --> 00:00:11,180
wissen in dem Moment nicht,
was sie machen sollen.

4
00:00:11,220 --> 00:00:17,680
Fahren weiter,
und dann kommt es zu Engpässen.
1
00:00:00,001 --> 00:00:05,139
Viele Fahrer sind iiberfordert,


2
00:00:07,820 --> 00:00:10,899
und dann kommt es zu Engpéissen.
1
00:00:00,001 --> 00:00:00,000
Viele Fahrer sind fiberfordert,


2
00:00:00,001 --> 00:00:02,679
wissen in dem Moment nicht,

was sie machen sollen.



3
00:00:02,680 --> 00:00:05,759
Fahren weiter,

und dann kommt es zu Engpéissen.

EDIT:
Now that I'm seeing them side by side, I noticed the abundance of linebreaks and the incorrect timecodes in the DVB-subtitles.

Originally created by @hurda on GitHub (Nov 19, 2015). Originally assigned to: @bigharshrag on GitHub. CCExtractor 0.77 and git-677fee4 Tesseract-data 3.02 resp. 3.04 for German Saving to SRT. Example files: http://www.mediafire.com/download/05oh34xj3oou90d/umlauts.7z (210MB) OCR of DVB-subtitles isn't supporting umlauts properly. They are either recognized as different characters (0.77 and git), or the lines with the umlauts are discarded completely (0.77). E.g.: <table><tr><td valign=top><b>Teletext</b></td valign=top><td valign=top><b>DVB 0.77</b></td valign=top><td valign=top><b>DVB git</b></td valign=top></tr><tr><td valign=top>2<br /> 00:00:05,480 --> 00:00:08,100<br /> Viele Fahrer sind überfordert,<br /> <br /> 3<br /> 00:00:08,140 --> 00:00:11,180<br /> wissen in dem Moment nicht,<br /> was sie machen sollen.<br /> <br /> 4<br /> 00:00:11,220 --> 00:00:17,680<br /> Fahren weiter,<br /> und dann kommt es zu Engpässen.</td valign=top><td valign=top>1<br /> 00:00:00,001 --> 00:00:05,139<br /> Viele Fahrer sind iiberfordert,<br /> <br /> <br /> 2<br /> 00:00:07,820 --> 00:00:10,899<br /> und dann kommt es zu Engpéissen.</td valign=top><td valign=top>1<br /> 00:00:00,001 --> 00:00:00,000<br /> Viele Fahrer sind fiberfordert,<br /> <br /> <br /> 2<br /> 00:00:00,001 --> 00:00:02,679<br /> wissen in dem Moment nicht,<br /> <br /> was sie machen sollen.<br /> —<br /> <br /> <br /> 3<br /> 00:00:02,680 --> 00:00:05,759<br /> Fahren weiter,<br /> <br /> und dann kommt es zu Engpéissen.</td valign=top></tr></table> EDIT: Now that I'm seeing them side by side, I noticed the abundance of linebreaks and the incorrect timecodes in the DVB-subtitles.
Author
Owner

@bigharshrag commented on GitHub (Aug 22, 2016):

Running this with the latest git version and the parameter -ocrlang deu, the umlauts and linebreaks issues are completely fixed.
The timecode problem with DVB however, is a known issue and still remains to be fixed.

@bigharshrag commented on GitHub (Aug 22, 2016): Running this with the latest git version and the parameter `-ocrlang deu`, the umlauts and linebreaks issues are completely fixed. The timecode problem with DVB however, is a known issue and still remains to be fixed.
Author
Owner

@cfsmp3 commented on GitHub (Jan 11, 2017):

Times are correct now too (in git master).

@cfsmp3 commented on GitHub (Jan 11, 2017): Times are correct now too (in git master).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#90