OCR issue #54

New Issue

claunia · 2026-01-29T16:34:05Z

claunia commented

2026-01-29 16:34:05 +00:00

Originally created by @okisseloff on GitHub (Jun 4, 2015).

Originally assigned to: @canihavesomecoffee, @Abhinav95 on GitHub.

I've found a problem with OCR feature, that causes problems with these two samples - https://github.com/CCExtractor/ccextractor/issues/172 and https://github.com/CCExtractor/ccextractor/issues/151.

I've compiled ccextractor with ocr feature and extracted png subs using -out=spupng option. pngs extracted well, but there are not enough subtitles in srt file - some of lines are missing. First thing, that strikes the eye is that there are no multi-line subs there. After that I found that some single-line subs are missing too.

Then I tried to check some of excluded from srt file png sources with tesseract cli tool if it can can recognize the text. Some of multi-line sources were recognized well, and some of them could not be recognized. More than that, lots of single-line sources could not be recognized too. Error messages appeared:

...
Error in pixReduceRankBinary2: hs must be at least 2
Error in pixDilateBrick: pixs not defined
Error in pixExpandReplicate: pixs not defined
Error in pixAnd: pixs1 not defined
Error in pixDilateBrick: pixs not defined
...

led me to tesseracts' bugtracker https://code.google.com/p/tesseract-ocr/issues/detail?id=605, where they say it is a leptonica issue.

I am not so familiar with OCR-related code in ccextractor, but probably some of you guys are.

Originally created by @okisseloff on GitHub (Jun 4, 2015). Originally assigned to: @canihavesomecoffee, @Abhinav95 on GitHub. I've found a problem with OCR feature, that causes problems with these two samples - https://github.com/CCExtractor/ccextractor/issues/172 and https://github.com/CCExtractor/ccextractor/issues/151. I've compiled ccextractor with ocr feature and extracted png subs using -out=spupng option. pngs extracted well, but there are not enough subtitles in srt file - some of lines are missing. First thing, that strikes the eye is that there are no multi-line subs there. After that I found that some single-line subs are missing too. Then I tried to check some of excluded from srt file png sources with tesseract cli tool if it can can recognize the text. Some of multi-line sources were recognized well, and some of them could not be recognized. More than that, lots of single-line sources could not be recognized too. Error messages appeared: ``` ... Error in pixReduceRankBinary2: hs must be at least 2 Error in pixDilateBrick: pixs not defined Error in pixExpandReplicate: pixs not defined Error in pixAnd: pixs1 not defined Error in pixDilateBrick: pixs not defined ... ``` led me to tesseracts' bugtracker https://code.google.com/p/tesseract-ocr/issues/detail?id=605, where they say it is a leptonica issue. I am not so familiar with OCR-related code in ccextractor, but probably some of you guys are.

claunia referenced this issue

2026-01-29 16:57:33 +00:00

[PR #54] [MERGED] Binary search for dictionary #946

claunia referenced this issue

2026-01-29 16:57:35 +00:00

[PR #54] Binary search for dictionary #951

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/ccextractor#54