mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-03 21:23:48 +00:00
CCextractor does not find non-English tesseract data #204
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @BBagger on GitHub (Nov 29, 2016).
I have a recording from BBC Brit. It has 3 subtitle streams: Danish, Swedish and Finnish.
When I now run CCextractor I do get a .srt file but CCextractor complains:
Opening file: Pointless.tsFile seems to be a transport stream, enabling TS modeAnalyzing data in general modedan.traineddata not found! Switching to Englishswe.traineddata not found! Switching to Englishfin.traineddata not found! Switching to EnglishCreating Pointless.srtUsing English trained data on Scandinavian texts makes for funny results!
The tesseract-ocr trained data is installed in /usr/share/tessdata/.
I did some research on this. When I run an 'strace' on CCextractor I found that CCextractor looks locally to find the trained data:
openat(AT_FDCWD, "./tessdata/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)write(1, "dan.traineddata not found! Switc"..., 48dan.traineddata not found! Switching to Englishbut globally to find the English data:
open("/usr/share/tessdata/eng.traineddata", O_RDONLY) = 4When I added a link from current directory to /usr/shar/tessdata CCextractor stopped complaining over missing data.
So my questions are:
@Abhinav95 commented on GitHub (Nov 30, 2016):
Hi @BBagger
I'd like to bring to your attention the following parameters that seem to be appropriate for your problem:-
Assuming you want to extract the subtitles from the Danish subtitle stream with the Danish trained data file, you will want to do:-
ccextractor Pointless.ts -dvblang dan -ocrlang danAssuming you wanted to do something esoteric like perform OCR with the Swedish trained data on the Danish stream, you would do:-
ccextractor Pointless.ts -dvblang dan -ocrlang swedvblang acts as a stream selector and ocrlang is the selector for which OCR data file to use.
As far as the location of the trained data is concerned, the priority of Tesseract traineddata file search paths are:-
1. tessdata in TESSDATA_PREFIX, if it is specified. Overrides others
2. tessdata in current working directory
I think what you'd want to do in your case is place the .traineddata files in the default path, e.g.
/usr/share/tessdata/dan.traineddata.Please get back to me if you have any questions or run into further issues.