[BUG] TESSDATA_PREFIX requires path separator at its end #534

Closed
opened 2026-01-29 16:46:43 +00:00 by claunia · 12 comments
Owner

Originally created by @NilsIrl on GitHub (Dec 29, 2019).

Necessary information

  • arguments: just the filename (including location):
$ TESSDATA_PREFIX=/nix/store/8lr60hp7yv0aysns056b74fsi8fm49zg-tesseract-3.05.00/share/ ./result/bin/ccextractor ~/Downloads/telecine.ts
  • platform: NixOS (Linux)
  • regression: I don't know

Video link

I hope that it works with any file that uses tesseract (files that store subtitles as images). Because if it isn't then it means that the location of the tesseract data is dealt in a separate way for different files.

I used the one from #1104 (https://edge1.motv.eu/telecine.ts)

Additional information

TESSDATA_PREFIX is an environment variable that points to the directory/folder containing the tessdata directory/folder. For some reason, ccextractor requires TESSDATA_PREFIX to finish with a /. It should work without one.

e.g.

TESSDATA_PREFIX=/nix/store/8lr60hp7yv0aysns056b74fsi8fm49zg-tesseract-3.05.00/share

Should work but it doesn't.

Originally created by @NilsIrl on GitHub (Dec 29, 2019). **Necessary information** * arguments: just the filename (including location): ``` $ TESSDATA_PREFIX=/nix/store/8lr60hp7yv0aysns056b74fsi8fm49zg-tesseract-3.05.00/share/ ./result/bin/ccextractor ~/Downloads/telecine.ts ``` * platform: NixOS (Linux) * regression: I don't know **Video link** I **hope** that it works with any file that uses tesseract (files that store subtitles as images). Because if it isn't then it means that the location of the tesseract data is dealt in a separate way for different files. I used the one from #1104 (https://edge1.motv.eu/telecine.ts) **Additional information** `TESSDATA_PREFIX` is an environment variable that points to the directory/folder containing the tessdata directory/folder. For some reason, ccextractor requires `TESSDATA_PREFIX` to finish with a `/`. It should work without one. e.g. ``` TESSDATA_PREFIX=/nix/store/8lr60hp7yv0aysns056b74fsi8fm49zg-tesseract-3.05.00/share ``` Should work but it doesn't.
claunia added the difficulty: easyOCRHacktoberfest labels 2026-01-29 16:46:43 +00:00
Author
Owner

@cfsmp3 commented on GitHub (Dec 29, 2019):

Feel free to fix :-)

char* probe_tessdata_location(int lang_index)

in ocr.c

@cfsmp3 commented on GitHub (Dec 29, 2019): Feel free to fix :-) ``` char* probe_tessdata_location(int lang_index) ``` in ocr.c
Author
Owner

@NilsIrl commented on GitHub (Dec 29, 2019):

This environment variable isn't documented so I discovered about it by looking at ocr.c.

Documentating it should also be done.

@NilsIrl commented on GitHub (Dec 29, 2019): This environment variable isn't documented so I discovered about it by looking at ocr.c. [Documentating](https://www.urbandictionary.com/define.php?term=documentating) it should also be done.
Author
Owner

@cfsmp3 commented on GitHub (Dec 30, 2019):

TESSDATA_PREFIX is a tesseract environment variable, not ours (even though we use it).

@cfsmp3 commented on GitHub (Dec 30, 2019): TESSDATA_PREFIX is a tesseract environment variable, not ours (even though we use it).
Author
Owner

@NilsIrl commented on GitHub (Dec 30, 2019):

TESSDATA_PREFIX is a tesseract environment variable, not ours (even though we use it).

Yes but how is a user supposed to know, they can use it? In the end, ccextractor, implements it so I believe it should be documented.

@NilsIrl commented on GitHub (Dec 30, 2019): > TESSDATA_PREFIX is a tesseract environment variable, not ours (even though we use it). Yes but how is a user supposed to know, they can use it? In the end, ccextractor, implements it so I believe it should be documented.
Author
Owner

@cfsmp3 commented on GitHub (Dec 31, 2019):

TESSDATA_PREFIX is a tesseract environment variable, not ours (even though we use it).

Yes but how is a user supposed to know, they can use it? In the end, ccextractor, implements it so I believe it should be documented.

Go ahead :-)

@cfsmp3 commented on GitHub (Dec 31, 2019): > > TESSDATA_PREFIX is a tesseract environment variable, not ours (even though we use it). > > Yes but how is a user supposed to know, they can use it? In the end, ccextractor, implements it so I believe it should be documented. Go ahead :-)
Author
Owner

@NilsIrl commented on GitHub (Jan 2, 2020):

This is a regression from this line:

5dbbe654f0 (diff-06df1969161cf1684b04764b42380ce6R52)

@NilsIrl commented on GitHub (Jan 2, 2020): This is a regression from this line: https://github.com/CCExtractor/ccextractor/commit/5dbbe654f05f1b3e5fcdfd6633e6258bed216345#diff-06df1969161cf1684b04764b42380ce6R52
Author
Owner

@cfsmp3 commented on GitHub (Jan 2, 2020):

I'll let @anshul1912 comment and decide since it's his code and he knows what he's doing :-)

@cfsmp3 commented on GitHub (Jan 2, 2020): I'll let @anshul1912 comment and decide since it's his code and he knows what he's doing :-)
Author
Owner

@cfsmp3 commented on GitHub (Jan 10, 2020):

@NilsIrl did you test with both tesseract 3 and 4?

@cfsmp3 commented on GitHub (Jan 10, 2020): @NilsIrl did you test with both tesseract 3 and 4?
Author
Owner

@NilsIrl commented on GitHub (Jan 10, 2020):

@NilsIrl did you test with both tesseract 3 and 4?

yes

@NilsIrl commented on GitHub (Jan 10, 2020): > @NilsIrl did you test with both tesseract 3 and 4? yes
Author
Owner

@anshul1912 commented on GitHub (Jan 15, 2020):

I think you will break ubuntu version 4 with it, I think it may work on nixOS but break Ubuntu.
what is location of tessdata in your nixOS installation. If you are using only enviorment variable TESSDATA_PREFIX then as you see in function first priority is given to environment variable.
if there is default location in nixOS tessdata but enviorment variable is not set. Then you must add that location in probe function

@anshul1912 commented on GitHub (Jan 15, 2020): I think you will break ubuntu version 4 with it, I think it may work on nixOS but break Ubuntu. what is location of tessdata in your nixOS installation. If you are using only enviorment variable TESSDATA_PREFIX then as you see in function first priority is given to environment variable. if there is default location in nixOS tessdata but enviorment variable is not set. Then you must add that location in probe function
Author
Owner

@Rahul-2k4 commented on GitHub (Dec 6, 2025):

Hi! I’ve prepared a fix for this issue.
It adds automatic normalization to TESSDATA_PREFIX so it no longer requires a trailing slash.
The solution is cross-platform, safe (uses a static buffer), and backward compatible.
Tests pass for all cases.

I’ll open a PR shortly.

@Rahul-2k4 commented on GitHub (Dec 6, 2025): Hi! I’ve prepared a fix for this issue. It adds automatic normalization to TESSDATA_PREFIX so it no longer requires a trailing slash. The solution is cross-platform, safe (uses a static buffer), and backward compatible. Tests pass for all cases. I’ll open a PR shortly.
Author
Owner

@cfsmp3 commented on GitHub (Dec 21, 2025):

This issue was fixed in PR #1674, which was merged on 2025-03-13. The search_language_pack() function in ocr.c now automatically normalizes the path by adding a trailing slash if missing before appending tessdata/.

@cfsmp3 commented on GitHub (Dec 21, 2025): This issue was fixed in PR #1674, which was merged on 2025-03-13. The `search_language_pack()` function in `ocr.c` now automatically normalizes the path by adding a trailing slash if missing before appending `tessdata/`.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#534