mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-08 21:24:12 +00:00
Sad situation with Windows + OCR #577
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @cfsmp3 on GitHub (Apr 12, 2020).
While testing a previous ticket regarding hardsubx on Windows, on master. Running this exact version, just compiled:
First, the reports, as usual about eng.traineddata couldn't suck more.
Seriously, would it kill us to tell the user WHERE we expect that file to be present?
OK So since I didn't remember how this worked at all I started looking into the code a bit. We do look TESSDATA_PREFIX amount other places /usr/share. Wait what? This is Windows! Why are we looking there? Also I see lots of / as path separator, but Windows uses . Is this portable at all?
OK, so I set set the env variable:
Still not working.
Problem now is that I'm missing a \ at the end of the end variable.
OK so let's set it correct:
So now apparently it starts at least, but then it crashes.
We just need to work on OCR + Windows.
In my opinion, at the very least:
Labelling HARD because we seem to be unable to fix it once and for all.
cc: @ShraxO1
@NilsIrl commented on GitHub (Apr 12, 2020):
#1170
@NilsIrl commented on GitHub (Apr 12, 2020):
Jokes aside, try PR #1170 on windows, might solve the problem. Also play around with deleting the data directory. You might fall on a message by tesseract that says where it looked for the data and didn't find it. If this message is not to your liking, it can be modified/suppressed using the
dup2syscall.EDIT: tesseract may have an API which wouldn't require the use of
dup2.@apovalyaev commented on GitHub (Apr 23, 2020):
Issue might be the result of not-properly builded solution. The issue should not appear if the solution is built properly. I checked it within VS2015 and VS2019 (default SDKs are used) and have not faced out such kind of issue.
@apovalyaev commented on GitHub (Apr 26, 2020):
Now after "Update VS project build settings" we can use the following steps (which automatically takes the last version of tesseract, as for now it is tesseract-4.1.1)
Build steps which use last version of Tesseract:
2.1) git clone https://github.com/Microsoft/vcpkg.git
> cd vcpkg
PS> .\bootstrap-vcpkg.bat
2.2) Modify vcpkg/triplets/x86-windows.cmake
set(VCPKG_CRT_LINKAGE static)
set(VCPKG_LIBRARY_LINKAGE static)
NOTE: Now it is tesseract-4.1.1
vcpkg install tesseract:x86-windows
vcpkg integrate install
So, further steps:
@cfsmp3 commented on GitHub (Apr 26, 2020):
I'd say there's something missing here. I followed your instructions, no errors (good), but the binary is still using tessearct-4.00dev, which makes sense - why would it pick any other version if that's the one we have inside the project?
@apovalyaev commented on GitHub (Apr 26, 2020):
Let's check if we are on the same page:
If the project compiled fine before you issued "vcpkg install tesseract:x86-windows" command, it means you have already installed some other copy of tesseract. It makes sense to remove it;
@canihavesomecoffee commented on GitHub (Apr 26, 2020):
@apovalyaev Tesseract is included in those "cppan" dependencies; refer to https://github.com/CCExtractor/ccextractor/tree/master/windows/libs/lib/release-lib
Refer to https://github.com/CCExtractor/ccextractor/pull/592 for the PR, and maybe @Izaron could explain a bit if needed?
@apovalyaev commented on GitHub (Apr 26, 2020):
I've taken a look #592 to see that tesseract was manually compiled.
I can see two ways:
Some others? What would be the best fit?
@cfsmp3 commented on GitHub (Apr 27, 2020):
Both solutions are OK. Personally I favor "the least required steps when starting from scratch".
As a developer, I prefer not having to install a lot of things to build something for the first time. That makes me more likely to contribute to a project than if I have to install a whole toolchain to get to a binary.
As a end-user, we should strive to provide a self-contained .msi that includes any library we use. Possibly including tesseract DLLs (so the user can replace them with new versions if he wants) would be better than statically linking tesseract.
cppan might have been the most convenient thing when it was added 3 years ago; it might not be the best solution today.
Since you are doing it, I'd say do whatever you prefer that works. If you get CCExtractor to report 4.11 (or whatever the current version is) and actually work, that's a better situation than what we have now.
@canihavesomecoffee is doing the GH actions integration (so we can get a full binary from GH, instead of me manually building releases) It would be great to have this working again.
@apovalyaev commented on GitHub (Apr 27, 2020):
To make things work automatically, it should provide both tesseract-ocr libraries and tess-data compatible (this is what this issue is about). Hence, when building solution/package, it needs to
(A) Replace outdated "ccpan" libraries within a newly rebuilt versions;
(B) Add tessdata directory to git clone https://github.com/ccextractor repository;
As for Step (A)...
Below are the steps to make the project using vcpkg supplied packages instead of precompiled "ccpan's" (in other words, all the libraries from directories in windows\libs\lib\release-lib and windows\libs\lib\debug-lib)
It is all about "Debug-Full" and "Release-Full" build modes:
windows\libs\lib\release-lib
windows\libs\lib\debug-lib
and update additional libraries project settings to remove appropriate library dependencies (those "ccpan" libraries)
vcpkg integrate remove
Then:
2.1) vcpkg export --zip tesseract:x86-windows
NOTE: of course, it is assume that appropriate packages are already installed (see vcpkg commands mentioned previously)
This command automatically creates a .zip-achive including all the appropriate .lib files. The name of this archive will be something "vcpkg-export-....zip" (this name can be extracted from vcpkg command output).
2.2) Extract the archive to some appropriate location:
vcpkg-export-20200427-142748\installed\x86-windows\lib
2.3) Copy all libraries from "installed\x86-windows\lib" subdirectory to
ccextractor windows\libs\lib\release-lib (for release).
The same things for debug ...
${vcpkg-export-directory}\installed\x86-windows\debug\lib -> ccextractor\windows\libs\lib\debug-lib
2.4) Update project "additional libraries" settings accordingly.
I will prepare a pull request within: (1) newly rebuild libraries (replace of old "ccpan's); (2) added tessdata subdirectory to ccextractor project.
@mirh commented on GitHub (Mar 28, 2022):
If
TESSDATA_PREFIXisn't set, the program will just look into its root folder.And once you throw in the age appropriate models you are good. Not a big deal really.
The problem if any is that you crash badly and without explanations after "FFMpeg Media Information".
@prateekmedia commented on GitHub (Mar 16, 2023):
@cfsmp3 What do you think of this as we already have windows build system and CI fixed.
@cfsmp3 commented on GitHub (Mar 17, 2023):
I think we're still missing the issue with the trained data file. If it's not found, rather that "Not found!" it should say:
"Not found. I looked in these directories: [ xxxx, xxxx, xxxx ]"
@cfsmp3 commented on GitHub (Dec 26, 2025):
0.96.2 has everything - just checked on a Windows 11 VM,, I was able to process our hardsub sample from SP. Finally! I'm closing this!