mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-03 21:23:48 +00:00
[BUG] dvblang option doesn't work #536
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @hamelg on GitHub (Dec 29, 2019).
CCExtractor detailed version info
Version: 0.88
Git commit: bc3d729e30a751feb9b854a54c085f0e81a99134
Compilation date: 2019-12-25
File SHA256: Could not open file
Libraries used by CCExtractor
Tesseract Version: 4.1.1
Leptonica Version: leptonica-1.78.0
libGPAC Version: 0.7.2-DEV
zlib: 1.2.11
utf8proc Version: 2.2.0
protobuf-c Version: 1.3.1
libpng Version: 1.6.35
FreeType
libhash
nuklear
libzvbi
In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):
My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):
Necessary information
Issue description
Some french dvb channels doesn't use ISO 639-2 to specify the language for the subtitles stream. Here is an example :
On the subtitle streams, the language code should be "fra", and not "fre".
The following command fails to find the subtitle stream :
It fails because the code "fre" doesn't exist in the language array (see lib_ccx/ccx_common_constants.c).
@gauravahlawat81 commented on GitHub (Jan 13, 2020):
Can you please give me some video samples regarding this issue ?
@hamelg commented on GitHub (Jan 13, 2020):
The link is valid 30 days.
http://dl.free.fr/k2j8OpZJF
@NilsIrl commented on GitHub (Jan 14, 2020):
Only
-dvblangis relevant to the problem@hamelg commented on GitHub (Jan 16, 2020):
The fix doesn't work.
Now, the -ocrlang option has no effect ...
it doesn't select fra.traineddata file despite "-ocrlang fra".
@NilsIrl commented on GitHub (Jan 16, 2020):
Tested and it worked.
It seems ccextractor is unable to find the OCR data. You can set the
TESSDATA_PREFIXenvironment variable to select another place for it to be found.for example here is the command I run:
@hamelg commented on GitHub (Jan 16, 2020):
I tested again, but definitively it doesn't work. The file is at the right place and the TESSDATA_PREFIX makes no difference.
@NilsIrl commented on GitHub (Jan 16, 2020):
It seems indeed that something is broken as there is no reason ccextractor isn't able to find the file by itself (in
/usr/share/tessdata/).But anyway this isn't supposed to work.
TESSDATA_PREFIXshould be set to the directory abovetessdata.try like this:
@cfsmp3 commented on GitHub (Jan 16, 2020):
Why is it trying to read eng.traineddata of you specified fra? That's
definitely broken...
On Thu, Jan 16, 2020 at 1:34 PM hamelg notifications@github.com wrote:
@cfsmp3 commented on GitHub (Jan 16, 2020):
@NilsIrl try deleting your file eng.traineddata (or rename it to fra.traineddata) and see if it still works for you.
@NilsIrl commented on GitHub (Jan 16, 2020):
I've tested that
ccextractoris usingfra.tessdata. but let me check again.@hamelg commented on GitHub (Jan 16, 2020):
ditto, same result
@NilsIrl commented on GitHub (Jan 16, 2020):
It seems to have been broken before.
@cfsmp3 commented on GitHub (Jan 16, 2020):
Well, since both you guys @NilsIrl and @hamelg are around right now seems like it can be solved once and for all really quickly.
By the way @hamelg maybe running ccextractor with strace and looking for open() calls will tell us exactly where tesseract is actually looking for the file (as opposed of what we think it's doing).
@NilsIrl commented on GitHub (Jan 16, 2020):
280b4308f7is broken for me as well. (last PR before CGI and v0.88)@hamelg commented on GitHub (Jan 16, 2020):
@NilsIrl commented on GitHub (Jan 16, 2020):
I will not have enough time to fix it today. Could you try on 0.88 to confirm that
-ocrlangdoesn't work there as well?@cfsmp3 commented on GitHub (Jan 16, 2020):
What I see (just visually inspecting the source code) is that we attempt to switch to english if we can't find the selected language:
https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ocr.c (line 169)
That function char* probe_tessdata_location(int lang_index)
expects an integer which is used to look up in an array... probably that's one of the problems to begin with.
@NilsIrl commented on GitHub (Jan 16, 2020):
Changing probe_tessdata_location to take a
const char *and removing probe_tessdata_location_string I think is a good thing.@cfsmp3 commented on GitHub (Jan 16, 2020):
Well, get it working for everybody and then I'll be OK with your solution whatever it is :-) As you soon as yourself, @anshul1912 and @hamelg all agree that it's working I'll merge (well, after testing on Windows myself)
@NilsIrl commented on GitHub (Jan 19, 2020):
@hamelg with the latest PR does it work?
@hamelg commented on GitHub (Jan 20, 2020):
Yes, it works fine now.
I just have the wrong message at exit :
No captions were found in input.
but it found all the subtitles.
Thanks !
@NilsIrl commented on GitHub (Jan 20, 2020):
Okay I will look into that