mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-16 21:23:35 +00:00
OCR issue #57
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @okisseloff on GitHub (Jun 4, 2015).
Originally assigned to: @canihavesomecoffee, @Abhinav95 on GitHub.
I've found a problem with OCR feature, that causes problems with these two samples - https://github.com/CCExtractor/ccextractor/issues/172 and https://github.com/CCExtractor/ccextractor/issues/151.
I've compiled ccextractor with ocr feature and extracted png subs using -out=spupng option. pngs extracted well, but there are not enough subtitles in srt file - some of lines are missing. First thing, that strikes the eye is that there are no multi-line subs there. After that I found that some single-line subs are missing too.
Then I tried to check some of excluded from srt file png sources with tesseract cli tool if it can can recognize the text. Some of multi-line sources were recognized well, and some of them could not be recognized. More than that, lots of single-line sources could not be recognized too. Error messages appeared:
led me to tesseracts' bugtracker https://code.google.com/p/tesseract-ocr/issues/detail?id=605, where they say it is a leptonica issue.
I am not so familiar with OCR-related code in ccextractor, but probably some of you guys are.
@canihavesomecoffee commented on GitHub (Jun 4, 2015):
The issue linked in #172 is not linked to OCR. This is regular DVB, which can be decoded (and actually is) without any OCR. Not 100% sure, but I think it also applies to #151...
@okisseloff commented on GitHub (Jun 4, 2015):
@wforums, I made this conclusion about https://github.com/CCExtractor/ccextractor/issues/172, because when I compile ccextractor without ocr support, launching it with defaults produces an empty srt file. Same with second issue.
@canihavesomecoffee commented on GitHub (Jun 4, 2015):
Yeah, it looks you are right; Windows is not saying anything about DVB, while Linux complains about not having OCR...
@canihavesomecoffee commented on GitHub (May 21, 2016):
Since the #605 issue of Tesseract isn't available anymore, here's an updated link (and verbatim copy below to ensure that it remains readable):
https://groups.google.com/forum/#!topic/tesseract-issues/lLoq9SZt5Po
Reply:
I'll see if this issue is still present in current .80
@cfsmp3 commented on GitHub (Aug 8, 2016):
Assigned to Abhinav95 . Abhinav95 does this stuff still apply? Unsure since both bugs mentioned by Kisselef are now closed.
@Abhinav95 commented on GitHub (Aug 10, 2016):
@cfsmp3 The links to the files in both the issues are currently broken.
For all the DVB samples I have encountered so far, whatever we get as the spupng output, will get an associated OCR text too. So I'm unsure if it still applies (my hunch is that it doesn't, if the stream is not corrupt).
I think we can close this till a similar sample becomes available.
@cfsmp3 commented on GitHub (Aug 10, 2016):
Closing for now, impossible to track this down.