[BUG] Incorrect path for loading tesseract traineddata #746

Closed
opened 2026-01-29 16:52:37 +00:00 by claunia · 3 comments
Owner

Originally created by @ibrahim-akrab on GitHub (Mar 10, 2023).

CCExtractor version: 0.94

Necessary information

  • Is this a regression (i.e. did it work before)? {NO}
  • What platform did you use? {Linux}
  • What were the used arguments? {}

Video links

channel5-2018-02-12.ts from the TV Samples page

Additional information

ccextractor tries to load tesseract traineddata from a wrong location then blames it on the TESSDATA_PREFIX. Here's the output it produces:

Opening file: /home/ibrahim/Downloads/channel5-2018-02-12.ts
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode
Error opening data file /usr/share/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Failed TessBaseAPIInit4 -1

I checked the logic in ocr.c and found that probe_tessdata_location works fine by tracing the syscalls it makes to each possible tessdata location by running strace -e trace=openat ./ccextractor ~/Downloads/channel5-2018-02-12.ts and the result is as follows:

Opening file: /home/ibrahim/Downloads/channel5-2018-02-12.ts
openat(AT_FDCWD, "/home/ibrahim/Downloads/channel5-2018-02-12.ts", O_RDONLY) = 3
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode
openat(AT_FDCWD, "./tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
openat(AT_FDCWD, "/usr/share/eng.traineddata", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/eng.traineddata", O_RDONLY) = -1 ENOENT (No such file or directory)
Error opening data file /usr/share/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Failed TessBaseAPIInit4 -1

It checks the paths correctly and stops when finding it at /usr/share/tessdata/ so I suspect the problem is possibly in the TessBaseAPIInit4 call.

Also for full reference, here's the complete output of ccextractor --version on my setup:

        Version: 0.94
        Git commit: b1cbfcea9b9c687143bf0d80bc179b563e99d025
        Compilation date: 2023-03-10
        CEA-708 decoder: Rust
        File SHA256: 03bf3b76ff69b73e18166558675278cae9b91f52acce532b80a480c6920b87f4
Libraries used by CCExtractor
        Tesseract Version: 5.3.0
        Leptonica Version: leptonica-1.82.0
        libGPAC Version: 1.0.1
        zlib: 1.2.11
        utf8proc Version: 2.4.0
        protobuf-c Version: 1.3.1
        libpng Version: 1.6.37
        FreeType
        libhash
        nuklear
        libzvbi
Originally created by @ibrahim-akrab on GitHub (Mar 10, 2023). CCExtractor version: 0.94 # Necessary information - Is this a regression (i.e. did it work before)? {NO} - What platform did you use? {Linux} - What were the used arguments? `{}` # Video links [channel5-2018-02-12.ts](https://drive.google.com/file/d/1Etq-pv5G3jGqVhhRl7cNrfuw4gaKkLoV/view?usp=sharing) from the TV Samples page # Additional information ccextractor tries to load tesseract traineddata from a wrong location then blames it on the TESSDATA_PREFIX. Here's the output it produces: ``` Opening file: /home/ibrahim/Downloads/channel5-2018-02-12.ts File seems to be a transport stream, enabling TS mode Analyzing data in general mode Error opening data file /usr/share/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Failed TessBaseAPIInit4 -1 ``` I checked the logic in `ocr.c` and found that `probe_tessdata_location` works fine by tracing the syscalls it makes to each possible tessdata location by running `strace -e trace=openat ./ccextractor ~/Downloads/channel5-2018-02-12.ts` and the result is as follows: ``` Opening file: /home/ibrahim/Downloads/channel5-2018-02-12.ts openat(AT_FDCWD, "/home/ibrahim/Downloads/channel5-2018-02-12.ts", O_RDONLY) = 3 File seems to be a transport stream, enabling TS mode Analyzing data in general mode openat(AT_FDCWD, "./tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4 openat(AT_FDCWD, "/usr/share/eng.traineddata", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/eng.traineddata", O_RDONLY) = -1 ENOENT (No such file or directory) Error opening data file /usr/share/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Failed TessBaseAPIInit4 -1 ``` It checks the paths correctly and stops when finding it at `/usr/share/tessdata/` so I suspect the problem is possibly in the `TessBaseAPIInit4` call. Also for full reference, here's the complete output of `ccextractor --version` on my setup: ```CCExtractor detailed version info Version: 0.94 Git commit: b1cbfcea9b9c687143bf0d80bc179b563e99d025 Compilation date: 2023-03-10 CEA-708 decoder: Rust File SHA256: 03bf3b76ff69b73e18166558675278cae9b91f52acce532b80a480c6920b87f4 Libraries used by CCExtractor Tesseract Version: 5.3.0 Leptonica Version: leptonica-1.82.0 libGPAC Version: 1.0.1 zlib: 1.2.11 utf8proc Version: 2.4.0 protobuf-c Version: 1.3.1 libpng Version: 1.6.37 FreeType libhash nuklear libzvbi ```
Author
Owner

@ibrahim-akrab commented on GitHub (Mar 10, 2023):

I Investigated it a bit more and it turns out that the init_ocr function checks for the version of tesseract installed and for some reason if it isn't major version 4, it doesn't pass the initialization the full path of the traineddata.
I think the reason this is the way it is is because of support for old tesseract versions because the same code appears in hardsubx.c but it treats versions 4.x and 5.x the same.
I just want confirmation from a maintainer to make that change since it's my first contribution to the project, then I'll get to fixing the #929 issue.

@ibrahim-akrab commented on GitHub (Mar 10, 2023): I Investigated it a bit more and it turns out that the `init_ocr` function checks for the version of tesseract installed and for some reason if it isn't major version 4, it doesn't pass the initialization the full path of the traineddata. I think the reason this is the way it is is because of support for old tesseract versions because the same code appears in `hardsubx.c` but it treats versions 4.x and 5.x the same. I just want confirmation from a maintainer to make that change since it's my first contribution to the project, then I'll get to fixing the #929 issue.
Author
Owner

@cfsmp3 commented on GitHub (Mar 10, 2023):

if it isn't major version 4, it doesn't pass the initialization the full path of the traineddata.

Go for it. We're updating both tesseract and FFmpeg (and others) to the last version. We don't really care much about supporting old versions of anything anymore - if someone wants to run an old tesseract they can do it with an old CCExtractor.

@cfsmp3 commented on GitHub (Mar 10, 2023): > if it isn't major version 4, it doesn't pass the initialization the full path of the traineddata. Go for it. We're updating both tesseract and FFmpeg (and others) to the last version. We don't really care much about supporting old versions of anything anymore - if someone wants to run an old tesseract they can do it with an old CCExtractor.
Author
Owner

@ibrahim-akrab commented on GitHub (Mar 11, 2023):

I think I should mark this issue as closed since it's fix is now merged.

@ibrahim-akrab commented on GitHub (Mar 11, 2023): I think I should mark this issue as closed since it's fix is now merged.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#746