[BUG] dvblang option doesn't work #536

Closed
opened 2026-01-29 16:46:48 +00:00 by claunia · 22 comments
Owner

Originally created by @hamelg on GitHub (Dec 29, 2019).

CCExtractor detailed version info
Version: 0.88
Git commit: bc3d729e30a751feb9b854a54c085f0e81a99134
Compilation date: 2019-12-25
File SHA256: Could not open file
Libraries used by CCExtractor
Tesseract Version: 4.1.1
Leptonica Version: leptonica-1.78.0
libGPAC Version: 0.7.2-DEV
zlib: 1.2.11
utf8proc Version: 2.2.0
protobuf-c Version: 1.3.1
libpng Version: 1.6.35
FreeType
libhash
nuklear
libzvbi

In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):

  • [ X] I have read and understood the contributors guide.
  • [ X] I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present.
  • [ X] I have checked that the issue I'm posting isn't already reported.
  • [ X] I have checked that the issue I'm porting isn't already solved and no duplicates exist in closed issues and in opened issues
  • [ X] I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion.
  • [ X] I have used the latest available version of CCExtractor to verify this issue exists.

My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):

  • [ X] I have used CCExtractor just a couple of times.

Necessary information

  • Is this a regression (did it work before)? [X ] NO | [ ] YES - please specify the last known working version
  • What platform did you use? [ ] Windows - [X ] Linux - [ ] Mac

Issue description

Some french dvb channels doesn't use ISO 639-2 to specify the language for the subtitles stream. Here is an example :

Input #0, mpegts, from '1008_20191216001500.ts':
  Duration: 00:04:59.97, start: 1.400000, bitrate: 4007 kb/s
  Program 1 
    Metadata:
      service_name    : Service01
      service_provider: FFmpeg
    Stream #0:0[0x100]: Video: h264 (High) ([27][0][0][0] / 0x001B), yuv420p(tv, bt709, top first), 1920x1080 [SAR 1:1 DAR 16:9], 25 fps, 25 tbr, 90k tbn, 50 tbc
    Stream #0:1[0x101](fre): Audio: eac3 ([135][0][0][0] / 0x0087), 48000 Hz, stereo, fltp, 128 kb/s
    Stream #0:2[0x102](qaa): Audio: eac3 ([135][0][0][0] / 0x0087), 48000 Hz, stereo, fltp, 128 kb/s
    Stream #0:3[0x103](fra): Audio: eac3 ([135][0][0][0] / 0x0087), 48000 Hz, stereo, fltp, 128 kb/s (visual impaired) (descriptions)
    Stream #0:4[0x104](fre): Subtitle: dvb_subtitle ([6][0][0][0] / 0x0006) (hearing impaired)
    Stream #0:5[0x105](fre): Subtitle: dvb_subtitle ([6][0][0][0] / 0x0006)
    Stream #0:6[0x106]: Data: bin_data ([6][0][0][0] / 0x0006)

On the subtitle streams, the language code should be "fra", and not "fre".

The following command fails to find the subtitle stream :

$ ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra mythtv/1008_20191216001500/1008_20191216001500.ts
...
Analyzing data in general mode
Ignoring stream language 'und' not equal to dvblang 'fre'
Ignoring stream language 'und' not equal to dvblang 'fre'
...
No captions were found in input.

It fails because the code "fre" doesn't exist in the language array (see lib_ccx/ccx_common_constants.c).

Originally created by @hamelg on GitHub (Dec 29, 2019). CCExtractor detailed version info Version: 0.88 Git commit: bc3d729e30a751feb9b854a54c085f0e81a99134 Compilation date: 2019-12-25 File SHA256: Could not open file Libraries used by CCExtractor Tesseract Version: 4.1.1 Leptonica Version: leptonica-1.78.0 libGPAC Version: 0.7.2-DEV zlib: 1.2.11 utf8proc Version: 2.2.0 protobuf-c Version: 1.3.1 libpng Version: 1.6.35 FreeType libhash nuklear libzvbi **In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):** - [ X] I have read and understood the [contributors guide](https://github.com/CCExtractor/ccextractor/blob/master/.github/CONTRIBUTING.md). - [ X] I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present. - [ X] I have checked that the issue I'm posting isn't already reported. - [ X] I have checked that the issue I'm porting isn't already solved and no duplicates exist in [closed issues](https://github.com/CCExtractor/ccextractor/issues?q=is%3Aissue+is%3Aclosed) and in [opened issues](https://github.com/CCExtractor/ccextractor/issues) - [ X] I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion. - [ X] I have used the latest available version of CCExtractor to verify this issue exists. **My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):** - [ X] I have used CCExtractor just a couple of times. **Necessary information** - Is this a regression (did it work before)? [X ] NO | [ ] YES - *please specify the last known working version* - What platform did you use? [ ] Windows - [X ] Linux - [ ] Mac **Issue description** Some french dvb channels doesn't use ISO 639-2 to specify the language for the subtitles stream. Here is an example : ```` Input #0, mpegts, from '1008_20191216001500.ts': Duration: 00:04:59.97, start: 1.400000, bitrate: 4007 kb/s Program 1 Metadata: service_name : Service01 service_provider: FFmpeg Stream #0:0[0x100]: Video: h264 (High) ([27][0][0][0] / 0x001B), yuv420p(tv, bt709, top first), 1920x1080 [SAR 1:1 DAR 16:9], 25 fps, 25 tbr, 90k tbn, 50 tbc Stream #0:1[0x101](fre): Audio: eac3 ([135][0][0][0] / 0x0087), 48000 Hz, stereo, fltp, 128 kb/s Stream #0:2[0x102](qaa): Audio: eac3 ([135][0][0][0] / 0x0087), 48000 Hz, stereo, fltp, 128 kb/s Stream #0:3[0x103](fra): Audio: eac3 ([135][0][0][0] / 0x0087), 48000 Hz, stereo, fltp, 128 kb/s (visual impaired) (descriptions) Stream #0:4[0x104](fre): Subtitle: dvb_subtitle ([6][0][0][0] / 0x0006) (hearing impaired) Stream #0:5[0x105](fre): Subtitle: dvb_subtitle ([6][0][0][0] / 0x0006) Stream #0:6[0x106]: Data: bin_data ([6][0][0][0] / 0x0006) ```` On the subtitle streams, the language code should be "fra", and not "fre". The following command fails to find the subtitle stream : ``` $ ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra mythtv/1008_20191216001500/1008_20191216001500.ts ... Analyzing data in general mode Ignoring stream language 'und' not equal to dvblang 'fre' Ignoring stream language 'und' not equal to dvblang 'fre' ... No captions were found in input. ```` It fails because the code "fre" doesn't exist in the language array (see lib_ccx/ccx_common_constants.c).
Author
Owner

@gauravahlawat81 commented on GitHub (Jan 13, 2020):

Can you please give me some video samples regarding this issue ?

@gauravahlawat81 commented on GitHub (Jan 13, 2020): Can you please give me some video samples regarding this issue ?
Author
Owner

@hamelg commented on GitHub (Jan 13, 2020):

The link is valid 30 days.
http://dl.free.fr/k2j8OpZJF

@hamelg commented on GitHub (Jan 13, 2020): The link is valid 30 days. http://dl.free.fr/k2j8OpZJF
Author
Owner

@NilsIrl commented on GitHub (Jan 14, 2020):

Only -dvblang is relevant to the problem

@NilsIrl commented on GitHub (Jan 14, 2020): Only `-dvblang` is relevant to the problem
Author
Owner

@hamelg commented on GitHub (Jan 16, 2020):

The fix doesn't work.
Now, the -ocrlang option has no effect ...

$ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts
...
eng.traineddata not found! No Switching Possible
...

it doesn't select fra.traineddata file despite "-ocrlang fra".

@hamelg commented on GitHub (Jan 16, 2020): The fix doesn't work. Now, the -ocrlang option has no effect ... ``` $ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts ... eng.traineddata not found! No Switching Possible ... ``` it doesn't select fra.traineddata file despite "-ocrlang fra".
Author
Owner

@NilsIrl commented on GitHub (Jan 16, 2020):

Tested and it worked.

It seems ccextractor is unable to find the OCR data. You can set the TESSDATA_PREFIX environment variable to select another place for it to be found.

for example here is the command I run:

$ TESSDATA_PREFIX=/nix/store/9yawzjj82bib4dr9x7y340w10c3k319y-tesseract-3.05.00/share/ ./ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra files/1008_20191216001500.ts
@NilsIrl commented on GitHub (Jan 16, 2020): Tested and it worked. It seems ccextractor is unable to find the OCR data. You can set the `TESSDATA_PREFIX` environment variable to select another place for it to be found. for example here is the command I run: ```bash $ TESSDATA_PREFIX=/nix/store/9yawzjj82bib4dr9x7y340w10c3k319y-tesseract-3.05.00/share/ ./ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra files/1008_20191216001500.ts ```
Author
Owner

@hamelg commented on GitHub (Jan 16, 2020):

I tested again, but definitively it doesn't work. The file is at the right place and the TESSDATA_PREFIX makes no difference.


$ ls -l /usr/share/tessdata/fra.traineddata 
-rw-r--r-- 1 root root 14213351 Nov 11  2018 /usr/share/tessdata/fra.traineddata
$ TESSDATA_PREFIX=/usr/share/tessdata/ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts
...
eng.traineddata not found! No Switching Possible
...
No captions were found in input.
@hamelg commented on GitHub (Jan 16, 2020): I tested again, but definitively it doesn't work. The file is at the right place and the TESSDATA_PREFIX makes no difference. ``` $ ls -l /usr/share/tessdata/fra.traineddata -rw-r--r-- 1 root root 14213351 Nov 11 2018 /usr/share/tessdata/fra.traineddata $ TESSDATA_PREFIX=/usr/share/tessdata/ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts ... eng.traineddata not found! No Switching Possible ... No captions were found in input. ```
Author
Owner

@NilsIrl commented on GitHub (Jan 16, 2020):

It seems indeed that something is broken as there is no reason ccextractor isn't able to find the file by itself (in /usr/share/tessdata/).

But anyway this isn't supposed to work. TESSDATA_PREFIX should be set to the directory above tessdata.

try like this:

$ TESSDATA_PREFIX=/usr/share/ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts
@NilsIrl commented on GitHub (Jan 16, 2020): It seems indeed that something is broken as there is no reason ccextractor isn't able to find the file by itself (in `/usr/share/tessdata/`). But anyway this isn't supposed to work. `TESSDATA_PREFIX` should be set to the directory above `tessdata`. try like this: ``` $ TESSDATA_PREFIX=/usr/share/ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts ```
Author
Owner

@cfsmp3 commented on GitHub (Jan 16, 2020):

Why is it trying to read eng.traineddata of you specified fra? That's
definitely broken...

On Thu, Jan 16, 2020 at 1:34 PM hamelg notifications@github.com wrote:

I tested again, but definitively it doesn't work. The file is at the right
place and the TESSDATA_PREFIX makes no difference.

$ ls -l /usr/share/tessdata/fra.traineddata
-rw-r--r-- 1 root root 14213351 Nov 11 2018 /usr/share/tessdata/fra.traineddata
$ TESSDATA_PREFIX=/usr/share/tessdata/ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts
...
eng.traineddata not found! No Switching Possible
...
No captions were found in input.


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/1161?email_source=notifications&email_token=ABNMTWNKGSWE5JRLDO6CF7DQ6DHHRA5CNFSM4KA7ICAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJFUGAA#issuecomment-575357696,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABNMTWMKB76P3OQQFWJUAQ3Q6DHHRANCNFSM4KA7ICAA
.

@cfsmp3 commented on GitHub (Jan 16, 2020): Why is it trying to read eng.traineddata of you specified fra? That's definitely broken... On Thu, Jan 16, 2020 at 1:34 PM hamelg <notifications@github.com> wrote: > I tested again, but definitively it doesn't work. The file is at the right > place and the TESSDATA_PREFIX makes no difference. > > > $ ls -l /usr/share/tessdata/fra.traineddata > -rw-r--r-- 1 root root 14213351 Nov 11 2018 /usr/share/tessdata/fra.traineddata > $ TESSDATA_PREFIX=/usr/share/tessdata/ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts > ... > eng.traineddata not found! No Switching Possible > ... > No captions were found in input. > > — > You are receiving this because you modified the open/close state. > Reply to this email directly, view it on GitHub > <https://github.com/CCExtractor/ccextractor/issues/1161?email_source=notifications&email_token=ABNMTWNKGSWE5JRLDO6CF7DQ6DHHRA5CNFSM4KA7ICAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJFUGAA#issuecomment-575357696>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABNMTWMKB76P3OQQFWJUAQ3Q6DHHRANCNFSM4KA7ICAA> > . >
Author
Owner

@cfsmp3 commented on GitHub (Jan 16, 2020):

@NilsIrl try deleting your file eng.traineddata (or rename it to fra.traineddata) and see if it still works for you.

@cfsmp3 commented on GitHub (Jan 16, 2020): @NilsIrl try deleting your file eng.traineddata (or rename it to fra.traineddata) and see if it still works for you.
Author
Owner

@NilsIrl commented on GitHub (Jan 16, 2020):

try deleting your file eng.traineddata (or rename it to fra.traineddata) and see if it still works for you.

I've tested that ccextractor is using fra.tessdata. but let me check again.

@NilsIrl commented on GitHub (Jan 16, 2020): > try deleting your file eng.traineddata (or rename it to fra.traineddata) and see if it still works for you. I've tested that `ccextractor` is using `fra.tessdata`. but let me check again.
Author
Owner

@hamelg commented on GitHub (Jan 16, 2020):

try like this:

$ TESSDATA_PREFIX=/usr/share/ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts

ditto, same result

@hamelg commented on GitHub (Jan 16, 2020): > try like this: > > ``` > $ TESSDATA_PREFIX=/usr/share/ ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts > ``` ditto, same result
Author
Owner

@NilsIrl commented on GitHub (Jan 16, 2020):

It seems to have been broken before.

@NilsIrl commented on GitHub (Jan 16, 2020): It seems to have been broken before.
Author
Owner

@cfsmp3 commented on GitHub (Jan 16, 2020):

Well, since both you guys @NilsIrl and @hamelg are around right now seems like it can be solved once and for all really quickly.

By the way @hamelg maybe running ccextractor with strace and looking for open() calls will tell us exactly where tesseract is actually looking for the file (as opposed of what we think it's doing).

@cfsmp3 commented on GitHub (Jan 16, 2020): Well, since both you guys @NilsIrl and @hamelg are around right now seems like it can be solved once and for all really quickly. By the way @hamelg maybe running ccextractor with strace and looking for open() calls will tell us exactly where tesseract is actually looking for the file (as opposed of what we think it's doing).
Author
Owner

@NilsIrl commented on GitHub (Jan 16, 2020):

280b4308f7 is broken for me as well. (last PR before CGI and v0.88)

@NilsIrl commented on GitHub (Jan 16, 2020): 280b4308f7f7ff769fd9c3fe2b03a7259644bfdb is broken for me as well. (last PR before CGI and v0.88)
Author
Owner

@hamelg commented on GitHub (Jan 16, 2020):

By the way @hamelg maybe running ccextractor with strace and looking for open() calls will tell us exactly where tesseract is actually looking for the file (as opposed of what we think it's doing).

$ strace -e file ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts
...
Opening file: 1008_20191216001500.ts
openat(AT_FDCWD, "1008_20191216001500.ts", O_RDONLY) = 3
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode
openat(AT_FDCWD, "./tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
openat(AT_FDCWD, "/usr/local/share/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/tesseract-ocr/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/share/tesseract-ocr/4.00/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory)
eng.traineddata not found! No Switching Possible
openat(AT_FDCWD, "sbt", O_RDWR|O_CREAT|O_TRUNC, 0600) = 4
@hamelg commented on GitHub (Jan 16, 2020): > By the way @hamelg maybe running ccextractor with strace and looking for open() calls will tell us exactly where tesseract is actually looking for the file (as opposed of what we think it's doing). ``` $ strace -e file ~/tmp/ccextractor/linux/ccextractor -o sbt -out=spupng -dvblang fre -ocrlang fra 1008_20191216001500.ts ... Opening file: 1008_20191216001500.ts openat(AT_FDCWD, "1008_20191216001500.ts", O_RDONLY) = 3 File seems to be a transport stream, enabling TS mode Analyzing data in general mode openat(AT_FDCWD, "./tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4 openat(AT_FDCWD, "/usr/local/share/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/tesseract-ocr/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/share/tesseract-ocr/4.00/tessdata/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = -1 ENOENT (No such file or directory) eng.traineddata not found! No Switching Possible openat(AT_FDCWD, "sbt", O_RDWR|O_CREAT|O_TRUNC, 0600) = 4 ```
Author
Owner

@NilsIrl commented on GitHub (Jan 16, 2020):

I will not have enough time to fix it today. Could you try on 0.88 to confirm that -ocrlang doesn't work there as well?

@NilsIrl commented on GitHub (Jan 16, 2020): I will not have enough time to fix it today. Could you try on 0.88 to confirm that `-ocrlang` doesn't work there as well?
Author
Owner

@cfsmp3 commented on GitHub (Jan 16, 2020):

What I see (just visually inspecting the source code) is that we attempt to switch to english if we can't find the selected language:

https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ocr.c (line 169)

That function char* probe_tessdata_location(int lang_index)

expects an integer which is used to look up in an array... probably that's one of the problems to begin with.

@cfsmp3 commented on GitHub (Jan 16, 2020): What I see (just visually inspecting the source code) is that we attempt to switch to english if we can't find the selected language: https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ocr.c (line 169) That function char* probe_tessdata_location(int lang_index) expects an integer which is used to look up in an array... probably that's one of the problems to begin with.
Author
Owner

@NilsIrl commented on GitHub (Jan 16, 2020):

Changing probe_tessdata_location to take a const char * and removing probe_tessdata_location_string I think is a good thing.

@NilsIrl commented on GitHub (Jan 16, 2020): Changing probe_tessdata_location to take a `const char *` and removing probe_tessdata_location_string I think is a good thing.
Author
Owner

@cfsmp3 commented on GitHub (Jan 16, 2020):

Well, get it working for everybody and then I'll be OK with your solution whatever it is :-) As you soon as yourself, @anshul1912 and @hamelg all agree that it's working I'll merge (well, after testing on Windows myself)

@cfsmp3 commented on GitHub (Jan 16, 2020): Well, get it working for everybody and then I'll be OK with your solution whatever it is :-) As you soon as yourself, @anshul1912 and @hamelg all agree that it's working I'll merge (well, after testing on Windows myself)
Author
Owner

@NilsIrl commented on GitHub (Jan 19, 2020):

@hamelg with the latest PR does it work?

@NilsIrl commented on GitHub (Jan 19, 2020): @hamelg with the latest PR does it work?
Author
Owner

@hamelg commented on GitHub (Jan 20, 2020):

Yes, it works fine now.
I just have the wrong message at exit :
No captions were found in input.
but it found all the subtitles.
Thanks !

@hamelg commented on GitHub (Jan 20, 2020): Yes, it works fine now. I just have the wrong message at exit : _No captions were found in input._ but it found all the subtitles. Thanks !
Author
Owner

@NilsIrl commented on GitHub (Jan 20, 2020):

No captions were found in input.

Okay I will look into that

@NilsIrl commented on GitHub (Jan 20, 2020): > No captions were found in input. Okay I will look into that
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#536