OCR issue #54

Open
opened 2026-01-29 16:34:05 +00:00 by claunia · 0 comments
Owner

Originally created by @okisseloff on GitHub (Jun 4, 2015).

Originally assigned to: @canihavesomecoffee, @Abhinav95 on GitHub.

I've found a problem with OCR feature, that causes problems with these two samples - https://github.com/CCExtractor/ccextractor/issues/172 and https://github.com/CCExtractor/ccextractor/issues/151.

I've compiled ccextractor with ocr feature and extracted png subs using -out=spupng option. pngs extracted well, but there are not enough subtitles in srt file - some of lines are missing. First thing, that strikes the eye is that there are no multi-line subs there. After that I found that some single-line subs are missing too.

Then I tried to check some of excluded from srt file png sources with tesseract cli tool if it can can recognize the text. Some of multi-line sources were recognized well, and some of them could not be recognized. More than that, lots of single-line sources could not be recognized too. Error messages appeared:

...
Error in pixReduceRankBinary2: hs must be at least 2
Error in pixDilateBrick: pixs not defined
Error in pixExpandReplicate: pixs not defined
Error in pixAnd: pixs1 not defined
Error in pixDilateBrick: pixs not defined
...

led me to tesseracts' bugtracker https://code.google.com/p/tesseract-ocr/issues/detail?id=605, where they say it is a leptonica issue.

I am not so familiar with OCR-related code in ccextractor, but probably some of you guys are.

Originally created by @okisseloff on GitHub (Jun 4, 2015). Originally assigned to: @canihavesomecoffee, @Abhinav95 on GitHub. I've found a problem with OCR feature, that causes problems with these two samples - https://github.com/CCExtractor/ccextractor/issues/172 and https://github.com/CCExtractor/ccextractor/issues/151. I've compiled ccextractor with ocr feature and extracted png subs using -out=spupng option. pngs extracted well, but there are not enough subtitles in srt file - some of lines are missing. First thing, that strikes the eye is that there are no multi-line subs there. After that I found that some single-line subs are missing too. Then I tried to check some of excluded from srt file png sources with tesseract cli tool if it can can recognize the text. Some of multi-line sources were recognized well, and some of them could not be recognized. More than that, lots of single-line sources could not be recognized too. Error messages appeared: ``` ... Error in pixReduceRankBinary2: hs must be at least 2 Error in pixDilateBrick: pixs not defined Error in pixExpandReplicate: pixs not defined Error in pixAnd: pixs1 not defined Error in pixDilateBrick: pixs not defined ... ``` led me to tesseracts' bugtracker https://code.google.com/p/tesseract-ocr/issues/detail?id=605, where they say it is a leptonica issue. I am not so familiar with OCR-related code in ccextractor, but probably some of you guys are.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#54