mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-15 05:26:07 +00:00
OCR of DVB-sub is not recognizing umlauts; timecode issues #90
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @hurda on GitHub (Nov 19, 2015).
Originally assigned to: @bigharshrag on GitHub.
CCExtractor 0.77 and git-677fee4
Tesseract-data 3.02 resp. 3.04 for German
Saving to SRT.
Example files: http://www.mediafire.com/download/05oh34xj3oou90d/umlauts.7z (210MB)
OCR of DVB-subtitles isn't supporting umlauts properly.
They are either recognized as different characters (0.77 and git), or the lines with the umlauts are discarded completely (0.77).
E.g.:
00:00:05,480 --> 00:00:08,100
Viele Fahrer sind überfordert,
3
00:00:08,140 --> 00:00:11,180
wissen in dem Moment nicht,
was sie machen sollen.
4
00:00:11,220 --> 00:00:17,680
Fahren weiter,
und dann kommt es zu Engpässen.
00:00:00,001 --> 00:00:05,139
Viele Fahrer sind iiberfordert,
2
00:00:07,820 --> 00:00:10,899
und dann kommt es zu Engpéissen.
00:00:00,001 --> 00:00:00,000
Viele Fahrer sind fiberfordert,
2
00:00:00,001 --> 00:00:02,679
wissen in dem Moment nicht,
was sie machen sollen.
—
3
00:00:02,680 --> 00:00:05,759
Fahren weiter,
und dann kommt es zu Engpéissen.
EDIT:
Now that I'm seeing them side by side, I noticed the abundance of linebreaks and the incorrect timecodes in the DVB-subtitles.
@bigharshrag commented on GitHub (Aug 22, 2016):
Running this with the latest git version and the parameter
-ocrlang deu, the umlauts and linebreaks issues are completely fixed.The timecode problem with DVB however, is a known issue and still remains to be fixed.
@cfsmp3 commented on GitHub (Jan 11, 2017):
Times are correct now too (in git master).