mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-14 21:23:42 +00:00
[BUG]"Error opening data file /usr/share/eng.traineddata" error, regardless of TESSDATA_PREFIX #683
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @rezad1393 on GitHub (Feb 7, 2022).
To get the version of CCExtractor, you can use
--version.CCExtractor version: CCExtractor 0.94, Carlos Fernandez Sanz, Volker Quetschke.
In raising this issue, I confirm the following:
Necessary information
Video links
anything really
Additional information
what ever I set as TESSDATA_PREFIX ccextract still says the same error with the same path,
setting TESSDATA_PREFIX affects tesseract so I know it is not that.
but CCExtractor seems to look at a hardocded path.
@paulshields commented on GitHub (Mar 24, 2022):
I came across this bug when facing the same issue. I noticed there was a wget for the traineddata in the ccextractor/linux/build-static.sh file
wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddatathough the path appears to have changed.
I can see all the traineddata files here though https://github.com/tesseract-ocr/tessdata.git
I downloaded the eng.traineddata via GitHub and copied it to the tesseract tessdata dir
This then allowed me to run ccextractor against a file with burned in subs (no need to set TESSDATA_PREFIX) and it (mostly) worked. It ran at least and generated an SRT file.. I think I just need to play around with some thresholds to get more accurate OCR.
ccextractor CEA-608-SEI-Captiondata.mp4 -hardsubx -subcolor white -detect_italics -whiteness_thresh 90 -conf_thresh 60 -o cea.srtbtw - if you prefix ccextractor with
strace -e filethen you can see that it looks in various places for the tessdata directory@ocococococ commented on GitHub (Sep 4, 2022):
At least, on mac using brew for tesseract 5 installation, tessdata directory /usr/local/share/tessdata is never found.
This could be too simplistic but changing tesseract version check in file ocr.c
if (!strncmp("4.", TessVersion(), 2))by
if (TessVersion()[0] >= '4')seems to do the trick
by forcing to use same code as tesseract version 4 which appends a slash to tessdata parent path.
Minor changes in CMakeLists.txt are also required to build on mac Big Sur.
tesseract_5_mac.patch.zip
@PunitLodha commented on GitHub (Mar 14, 2023):
Could you please check if this is still an issue on the latest master? Should have been fixed by #1479
@rezad1393 commented on GitHub (Mar 15, 2023):
Can't test for hardsub
@ocococococ commented on GitHub (Mar 15, 2023):
FYI, for my use cases (with these options -DWITH_OCR=ON -DWITHOUT_RUST=ON), it is ok.
tesseract 5 is used and tessdata can be found correctly.
I still needed to apply minor CMake modifications to be able to build it on Mac Os Big Sur (see tesseract_5_mac.patch.zip above)
@PunitLodha commented on GitHub (Mar 15, 2023):
@prateekmedia could you look into this? Seems like the
build_hardsubxscript is broken by addingrusty_ffmpegwhich is a dependency ofrsmpeg@prateekmedia commented on GitHub (Mar 16, 2023):
Two environment variable are needed to be set, see linux CI in
build_ocr_hardsubx
On Thu, 16 Mar, 2023, 04:38 Punit Lodha, @.***> wrote:
@PunitLodha commented on GitHub (Mar 20, 2023):
@rezad1393 can you try adding the env variables, FFMPEG_INCLUDE_DIR and FFMPEG_PKG_CONFIG_PATH and then trying again?
or to whatever the correct path for your machine is
@rboy1 commented on GitHub (Oct 3, 2023):
@PunitLodha @cfsmp3 when do you think we could see a new release?
@cfsmp3 commented on GitHub (Oct 4, 2023):
When we can merge all the pending PRs I guess.
@cfsmp3 commented on GitHub (Dec 26, 2025):
This issue should be resolved by PR #1479 (merged March 2023) which fixed Tesseract 5 compatibility.
Summary of solutions:
Ensure
eng.traineddata(or your language's traineddata) is in one of these directories:/usr/share/tesseract-ocr/4.00/tessdata//usr/local/share/tessdata/TESSDATA_PREFIXenvironment variable to point to the parent directory containingtessdata/Download traineddata files from: https://github.com/tesseract-ocr/tessdata
Use
strace -e file ccextractor ...to see which paths CCExtractor is searching for tessdataIf you're still experiencing this issue on the latest version (0.96+), please open a new issue with your specific setup details.
Closing as resolved by #1479.