CCextractor does not find non-English tesseract data #204

Closed
opened 2026-01-29 16:37:53 +00:00 by claunia · 1 comment
Owner

Originally created by @BBagger on GitHub (Nov 29, 2016).

I have a recording from BBC Brit. It has 3 subtitle streams: Danish, Swedish and Finnish.
When I now run CCextractor I do get a .srt file but CCextractor complains:

Opening file: Pointless.ts
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode
dan.traineddata not found! Switching to English
swe.traineddata not found! Switching to English
fin.traineddata not found! Switching to English
Creating Pointless.srt

Using English trained data on Scandinavian texts makes for funny results!

The tesseract-ocr trained data is installed in /usr/share/tessdata/.

I did some research on this. When I run an 'strace' on CCextractor I found that CCextractor looks locally to find the trained data:

openat(AT_FDCWD, "./tessdata/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
write(1, "dan.traineddata not found! Switc"..., 48dan.traineddata not found! Switching to English

but globally to find the English data:

open("/usr/share/tessdata/eng.traineddata", O_RDONLY) = 4

When I added a link from current directory to /usr/shar/tessdata CCextractor stopped complaining over missing data.

So my questions are:

  • How do I get CCextractor to read the trained data from the 'standard' location? It is not practical to have the tessdata files in every directory that contains a video.
  • How do I specify to CCextractor which language I want? It appears that CCextractor uses the first stream encountered. That is fine in this case since I live in Denmark, but what would happen if I were Finnish?
Originally created by @BBagger on GitHub (Nov 29, 2016). I have a recording from BBC Brit. It has 3 subtitle streams: Danish, Swedish and Finnish. When I now run CCextractor I do get a .srt file but CCextractor complains: `Opening file: Pointless.ts` `File seems to be a transport stream, enabling TS mode` `Analyzing data in general mode` `dan.traineddata not found! Switching to English` `swe.traineddata not found! Switching to English` `fin.traineddata not found! Switching to English` `Creating Pointless.srt` Using English trained data on Scandinavian texts makes for funny results! The tesseract-ocr trained data is installed in /usr/share/tessdata/. I did some research on this. When I run an 'strace' on CCextractor I found that CCextractor looks locally to find the trained data: `openat(AT_FDCWD, "./tessdata/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)` `write(1, "dan.traineddata not found! Switc"..., 48dan.traineddata not found! Switching to English` but globally to find the English data: `open("/usr/share/tessdata/eng.traineddata", O_RDONLY) = 4` When I added a link from current directory to /usr/shar/tessdata CCextractor stopped complaining over missing data. So my questions are: - How do I get CCextractor to read the trained data from the 'standard' location? It is not practical to have the tessdata files in every directory that contains a video. - How do I specify to CCextractor which language I want? It appears that CCextractor uses the first stream encountered. That is fine in this case since I live in Denmark, but what would happen if I were Finnish?
Author
Owner

@Abhinav95 commented on GitHub (Nov 30, 2016):

Hi @BBagger

I'd like to bring to your attention the following parameters that seem to be appropriate for your problem:-

-dvblang: For DVB subtitles, select which language's caption
stream will be processed. e.g. 'eng' for English.
If there are multiple languages, only this specified
language stream will be processed
-ocrlang: Manually select the name of the Tesseract .traineddata
file. Helpful if you want to OCR a caption stream of
one language with the data of another language.
e.g. '-dvblang chs -ocrlang chi_tra' will decode the
Chinese (Simplified) caption stream but perform OCR
using the Chinese (Traditional) trained data
This option is also helpful when the traineddata file
has non standard names that don't follow ISO specs

Assuming you want to extract the subtitles from the Danish subtitle stream with the Danish trained data file, you will want to do:-
ccextractor Pointless.ts -dvblang dan -ocrlang dan
Assuming you wanted to do something esoteric like perform OCR with the Swedish trained data on the Danish stream, you would do:-
ccextractor Pointless.ts -dvblang dan -ocrlang swe

dvblang acts as a stream selector and ocrlang is the selector for which OCR data file to use.

As far as the location of the trained data is concerned, the priority of Tesseract traineddata file search paths are:-
1. tessdata in TESSDATA_PREFIX, if it is specified. Overrides others
2. tessdata in current working directory

I think what you'd want to do in your case is place the .traineddata files in the default path, e.g. /usr/share/tessdata/dan.traineddata .

Please get back to me if you have any questions or run into further issues.

@Abhinav95 commented on GitHub (Nov 30, 2016): Hi @BBagger I'd like to bring to your attention the following parameters that seem to be appropriate for your problem:- > -dvblang: For DVB subtitles, select which language's caption stream will be processed. e.g. 'eng' for English. If there are multiple languages, only this specified language stream will be processed -ocrlang: Manually select the name of the Tesseract .traineddata file. Helpful if you want to OCR a caption stream of one language with the data of another language. e.g. '-dvblang chs -ocrlang chi_tra' will decode the Chinese (Simplified) caption stream but perform OCR using the Chinese (Traditional) trained data This option is also helpful when the traineddata file has non standard names that don't follow ISO specs Assuming you want to extract the subtitles from the Danish subtitle stream with the Danish trained data file, you will want to do:- ` ccextractor Pointless.ts -dvblang dan -ocrlang dan ` Assuming you wanted to do something esoteric like perform OCR with the Swedish trained data on the Danish stream, you would do:- ` ccextractor Pointless.ts -dvblang dan -ocrlang swe ` dvblang acts as a stream selector and ocrlang is the selector for which OCR data file to use. As far as the location of the trained data is concerned, the priority of Tesseract traineddata file search paths are:- 1. tessdata in TESSDATA_PREFIX, if it is specified. Overrides others 2. tessdata in current working directory I think what you'd want to do in your case is place the .traineddata files in the default path, e.g. `/usr/share/tessdata/dan.traineddata` . Please get back to me if you have any questions or run into further issues.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#204