Sad situation with Windows + OCR #576

Open
opened 2026-01-29 16:48:11 +00:00 by claunia · 0 comments
Owner

Originally created by @cfsmp3 on GitHub (Apr 12, 2020).

While testing a previous ticket regarding hardsubx on Windows, on master. Running this exact version, just compiled:

CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
        Version: 0.88
        Git commit: Unknown
        Compilation date: Unknown
        File SHA256: 0a40241ddd609f5272f063d25e0f2c29c2192187aabd2592da98909463b88541
Libraries used by CCExtractor
        Tesseract Version: 4.00.00dev
        Leptonica Version: leptonica-1.74 (Dec 31 2016, 12:28:35) [MSC v.1900 LIB Debug x86]
        libGPAC Version: 0.7.2-DEV
        zlib: 1.2.11
        utf8proc Version: 2.4.0
        protobuf-c Version: 1.3.1
        libpng Version: 1.6.35
        FreeType
        libhash
        nuklear
        libzvbi

First, the reports, as usual about eng.traineddata couldn't suck more.

CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
HardsubX (Hard Subtitle Extractor) - Burned-in subtitle extraction subsystem
eng.traineddata not found! No Switching Possible

Seriously, would it kill us to tell the user WHERE we expect that file to be present?

OK So since I didn't remember how this worked at all I started looking into the code a bit. We do look TESSDATA_PREFIX amount other places /usr/share. Wait what? This is Windows! Why are we looking there? Also I see lots of / as path separator, but Windows uses . Is this portable at all?

OK, so I set set the env variable:

set TESSDATA_PREFIX=C:\Downloads

C:\Users\Carlos\source\repos\CCExtractor\ccextractor\windows\Debug-Full>dir c:\Downloads\tessdata
 Volume in drive C has no label.
 Volume Serial Number is 3A55-62AE

 Directory of c:\Downloads\tessdata

12-Apr-20  14:47    <DIR>          .
12-Apr-20  14:47    <DIR>          ..
12-Apr-20  14:46        23,466,654 eng.traineddata
               1 File(s)     23,466,654 bytes
               2 Dir(s)  92,672,598,016 bytes free

C:\Users\Carlos\source\repos\CCExtractor\ccextractor\windows\Debug-Full>ccextractorwinfull.exe -hardsubx c:\Downloads\ITV1.mp4
CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
HardsubX (Hard Subtitle Extractor) - Burned-in subtitle extraction subsystem
eng.traineddata not found! No Switching Possible

Still not working.
Problem now is that I'm missing a \ at the end of the end variable.

OK so let's set it correct:

C:\Users\Carlos\source\repos\CCExtractor\ccextractor\windows\Debug-Full>set TESSDATA_PREFIX=C:\Downloads\

C:\Users\Carlos\source\repos\CCExtractor\ccextractor\windows\Debug-Full>ccextractorwinfull.exe -hardsubx c:\Downloads\ITV1.mp4
CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
HardsubX (Hard Subtitle Extractor) - Burned-in subtitle extraction subsystem
lstm_recognizer_->DeSerialize(tessdata_manager.swap(), &fp):Error:Assert failed:in file C:\Users\HOME\.cppan\storage\src\42\9e\ba91\ccmain\tessedit.cpp, line 202

So now apparently it starts at least, but then it crashes.

We just need to work on OCR + Windows.

In my opinion, at the very least:

  1. Proper information to the user, including which paths are being searched. And where do the errors come from? Is it tesseract, or us? Are we bailing out before even giving tesseract a try?
  2. Update tesseract 4 to last version OR downgrade to 3. But using 4.00.00 is ridiculous! It's buggy.
  3. Check if there's an officially compiled binary we can use. I remember we did our own thing a long time ago. Still needed?

Labelling HARD because we seem to be unable to fix it once and for all.

cc: @ShraxO1

Originally created by @cfsmp3 on GitHub (Apr 12, 2020). While testing a previous ticket regarding hardsubx on Windows, on master. Running this exact version, just compiled: ```C:\Users\Carlos\source\repos\CCExtractor\ccextractor\windows\Debug-Full>.\ccextractorwinfull.exe --version CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke. Teletext portions taken from Petr Kutalek's telxcc -------------------------------------------------------------------------- CCExtractor detailed version info Version: 0.88 Git commit: Unknown Compilation date: Unknown File SHA256: 0a40241ddd609f5272f063d25e0f2c29c2192187aabd2592da98909463b88541 Libraries used by CCExtractor Tesseract Version: 4.00.00dev Leptonica Version: leptonica-1.74 (Dec 31 2016, 12:28:35) [MSC v.1900 LIB Debug x86] libGPAC Version: 0.7.2-DEV zlib: 1.2.11 utf8proc Version: 2.4.0 protobuf-c Version: 1.3.1 libpng Version: 1.6.35 FreeType libhash nuklear libzvbi ``` First, the reports, as usual about eng.traineddata couldn't suck more. ```C:\Users\Carlos\source\repos\CCExtractor\ccextractor\windows\Debug-Full>ccextractorwinfull.exe -hardsubx c:\Downloads\ITV1.mp4 CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke. Teletext portions taken from Petr Kutalek's telxcc -------------------------------------------------------------------------- HardsubX (Hard Subtitle Extractor) - Burned-in subtitle extraction subsystem eng.traineddata not found! No Switching Possible ``` Seriously, would it kill us to tell the user WHERE we expect that file to be present? OK So since I didn't remember how this worked at all I started looking into the code a bit. We do look TESSDATA_PREFIX amount other places /usr/share. Wait what? This is Windows! Why are we looking there? Also I see lots of / as path separator, but Windows uses \. Is this portable at all? OK, so I set set the env variable: ``` set TESSDATA_PREFIX=C:\Downloads C:\Users\Carlos\source\repos\CCExtractor\ccextractor\windows\Debug-Full>dir c:\Downloads\tessdata Volume in drive C has no label. Volume Serial Number is 3A55-62AE Directory of c:\Downloads\tessdata 12-Apr-20 14:47 <DIR> . 12-Apr-20 14:47 <DIR> .. 12-Apr-20 14:46 23,466,654 eng.traineddata 1 File(s) 23,466,654 bytes 2 Dir(s) 92,672,598,016 bytes free C:\Users\Carlos\source\repos\CCExtractor\ccextractor\windows\Debug-Full>ccextractorwinfull.exe -hardsubx c:\Downloads\ITV1.mp4 CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke. Teletext portions taken from Petr Kutalek's telxcc -------------------------------------------------------------------------- HardsubX (Hard Subtitle Extractor) - Burned-in subtitle extraction subsystem eng.traineddata not found! No Switching Possible ``` Still not working. Problem now is that I'm missing a \ at the end of the end variable. OK so let's set it correct: ``` C:\Users\Carlos\source\repos\CCExtractor\ccextractor\windows\Debug-Full>set TESSDATA_PREFIX=C:\Downloads\ C:\Users\Carlos\source\repos\CCExtractor\ccextractor\windows\Debug-Full>ccextractorwinfull.exe -hardsubx c:\Downloads\ITV1.mp4 CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke. Teletext portions taken from Petr Kutalek's telxcc -------------------------------------------------------------------------- HardsubX (Hard Subtitle Extractor) - Burned-in subtitle extraction subsystem lstm_recognizer_->DeSerialize(tessdata_manager.swap(), &fp):Error:Assert failed:in file C:\Users\HOME\.cppan\storage\src\42\9e\ba91\ccmain\tessedit.cpp, line 202 ``` So now apparently it starts at least, but then it crashes. We just need to work on OCR + Windows. In my opinion, at the very least: 1) Proper information to the user, including which paths are being searched. And where do the errors come from? Is it tesseract, or us? Are we bailing out before even giving tesseract a try? 2) Update tesseract 4 to last version OR downgrade to 3. But using 4.00.00 is ridiculous! It's buggy. 3) Check if there's an officially compiled binary we can use. I remember we did our own thing a long time ago. Still needed? Labelling HARD because we seem to be unable to fix it once and for all. cc: @ShraxO1
claunia added the difficulty: hardOCR labels 2026-01-29 16:48:11 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#576