[Feature] Unable to properly extract (ocr) dvb_subtitles from .mkv container #444

Closed
opened 2026-01-29 16:44:04 +00:00 by claunia · 8 comments
Owner

Originally created by @agrafiodata on GitHub (Aug 27, 2018).

Please prefix your issue with one of the following: [BUG], [PROPOSAL], [QUESTION].

CCExtractor version (using the --version parameter preferably) : X.X

In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):

  • I have read and understood the contributors guide.
  • I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present.
  • I have checked that the issue I'm posting isn't already reported.
  • I have checked that the issue I'm porting isn't already solved and no duplicates exist in closed issues and in opened issues
  • I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion.
  • I have used the latest available version of CCExtractor to verify this issue exists.

My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):

  • I have used CCExtractor just a couple of times.

Necessary information

  • Is this a regression (did it work before)? [X] NO | [ ] YES - please specify the last known working version

  • What platform did you use? [ ] Windows - [X] Linux - [ ] Mac

  • What were the used arguments?
    Tried both without arguments and -codec dvbsub -dvblang eng -ocrlang eng. Same outputs for both.

Video links
https://www.dropbox.com/sh/7wmowjdyy2sv4n6/AADACv4q16SmC2MgAE_nqz3Da?dl=0

Additional information

Hi,
I am having an issue while trying to extract subtitles from an mkv container. Using the default build from the ccextractor master branch with ocr enabled (git -clone master and build via the supplied linux/build script). The video has a single dvb_sub subtitle stream. Ccextractor, however it shows no signs of actually trying to perform ocr and the produced .srt file appears to have binary data instead of having the extracted text. If I try to extract a converted .ts file from this file (via ffmpeg -i file.mkv -c copy -map 0 file.ts) it extracts subtitles properly. Same problem if done in the opposite sequence, that is it extracts from a recorded .ts file but fails to do so from a converted .mkv one.

PS: I have marked this issue as a [BUG] as I consider this to arise from a problem in codec identification for mkv containers. However I have not read the current implementations for this so it could technically be a restriction in the existing implementation, thus actually being a feature request.

Originally created by @agrafiodata on GitHub (Aug 27, 2018). Please prefix your issue with one of the following: [BUG], [PROPOSAL], [QUESTION]. CCExtractor version (using the --version parameter preferably) : **X.X** **In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):** - [X] I have read and understood the [contributors guide](https://github.com/CCExtractor/ccextractor/blob/master/.github/CONTRIBUTING.md). - [X] I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present. - [X] I have checked that the issue I'm posting isn't already reported. - [X] I have checked that the issue I'm porting isn't already solved and no duplicates exist in [closed issues](https://github.com/CCExtractor/ccextractor/issues?q=is%3Aissue+is%3Aclosed) and in [opened issues](https://github.com/CCExtractor/ccextractor/issues) - [X] I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion. - [X] I have used the latest available version of CCExtractor to verify this issue exists. **My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):** - [x] I have used CCExtractor just a couple of times. **Necessary information** - Is this a regression (did it work before)? [X] NO | [ ] YES - *please specify the last known working version* - What platform did you use? [ ] Windows - [X] Linux - [ ] Mac - What were the used arguments? Tried both without arguments and `-codec dvbsub -dvblang eng -ocrlang eng`. Same outputs for both. **Video links** https://www.dropbox.com/sh/7wmowjdyy2sv4n6/AADACv4q16SmC2MgAE_nqz3Da?dl=0 **Additional information** Hi, I am having an issue while trying to extract subtitles from an mkv container. Using the default build from the ccextractor master branch with ocr enabled (git -clone master and build via the supplied linux/build script). The video has a single dvb_sub subtitle stream. Ccextractor, however it shows no signs of actually trying to perform ocr and the produced .srt file appears to have binary data instead of having the extracted text. If I try to extract a converted .ts file from this file (via ffmpeg -i file.mkv -c copy -map 0 file.ts) it extracts subtitles properly. Same problem if done in the opposite sequence, that is it extracts from a recorded .ts file but fails to do so from a converted .mkv one. PS: I have marked this issue as a [BUG] as I consider this to arise from a problem in codec identification for mkv containers. However I have not read the current implementations for this so it could technically be a restriction in the existing implementation, thus actually being a feature request.
claunia added the GSoC-related label 2026-01-29 16:44:04 +00:00
Author
Owner

@agrafiodata commented on GitHub (Aug 27, 2018):

Posting detailed version parameters that I forgot to post above

CCExtractor detailed version info
Version: 0.87
Git commit: 45ed8456ee
Compilation date: 2018-08-27
File SHA256: 56628a1805a3f2d0925f1055905b649c3899c08e9051b1805eb13da1df88b695
Libraries used by CCExtractor
Tesseract Version: 3.04.01
Leptonica Version: leptonica-1.74.1
libGPAC Version: 0.7.2-DEV
zlib: 1.2.11
utf8proc Version: 2.1.0
protobuf-c Version: 1.1.1
libpng Version: 1.6.34
FreeType
libhash
nuklear
libzvbi

@agrafiodata commented on GitHub (Aug 27, 2018): Posting detailed version parameters that I forgot to post above CCExtractor detailed version info Version: 0.87 Git commit: 45ed8456eea9b30745ac71aca0a14cbb44f4d811 Compilation date: 2018-08-27 File SHA256: 56628a1805a3f2d0925f1055905b649c3899c08e9051b1805eb13da1df88b695 Libraries used by CCExtractor Tesseract Version: 3.04.01 Leptonica Version: leptonica-1.74.1 libGPAC Version: 0.7.2-DEV zlib: 1.2.11 utf8proc Version: 2.1.0 protobuf-c Version: 1.1.1 libpng Version: 1.6.34 FreeType libhash nuklear libzvbi
Author
Owner

@anshul1912 commented on GitHub (Aug 29, 2018):

Yes your guess is correct, its a feature, matroska does not support DVB subtitle yet.

@anshul1912 commented on GitHub (Aug 29, 2018): Yes your guess is correct, its a feature, matroska does not support DVB subtitle yet.
Author
Owner

@agrafiodata commented on GitHub (Aug 31, 2018):

Is there a plan to add support for it in the foreseeable future?

@agrafiodata commented on GitHub (Aug 31, 2018): Is there a plan to add support for it in the foreseeable future?
Author
Owner

@cfsmp3 commented on GitHub (Aug 31, 2018):

No. But no plans doesn't mean it won't ever be added. If some
developer comes along and does it we'll be happy to integrate, or if
someone pays us to do it.

But if no one in the core team needs it for himself then it's not
going to happen.
On Fri, Aug 31, 2018 at 1:29 AM agrafiodata notifications@github.com wrote:

Is there a plan to add support for it in the foreseeable future?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@cfsmp3 commented on GitHub (Aug 31, 2018): No. But no plans doesn't mean it won't ever be added. If some developer comes along and does it we'll be happy to integrate, or if someone pays us to do it. But if no one in the core team needs it for himself then it's not going to happen. On Fri, Aug 31, 2018 at 1:29 AM agrafiodata <notifications@github.com> wrote: > > Is there a plan to add support for it in the foreseeable future? > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub, or mute the thread.
Author
Owner

@ashutoshmishraji commented on GitHub (Mar 22, 2019):

HI @cfsmp3
I am a gsoc 2019 participant and can u give me some guidance to solve this bug

@ashutoshmishraji commented on GitHub (Mar 22, 2019): HI @cfsmp3 I am a gsoc 2019 participant and can u give me some guidance to solve this bug
Author
Owner

@cfsmp3 commented on GitHub (Mar 22, 2019):

Sure, please don't use a github issue to ask unrelated things, please check
out our website and join our slack.

On Fri, Mar 22, 2019, 08:26 ASHUTOSH MISHRA notifications@github.com
wrote:

HI @cfsmp3 https://github.com/cfsmp3
I am a gsoc 2019 participant and can u give me some guidance


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/1000#issuecomment-475664268,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFrJ2SaOUPQg0bsu33Ec8OQKYmKWKyjLks5vZPY8gaJpZM4WNlng
.

@cfsmp3 commented on GitHub (Mar 22, 2019): Sure, please don't use a github issue to ask unrelated things, please check out our website and join our slack. On Fri, Mar 22, 2019, 08:26 ASHUTOSH MISHRA <notifications@github.com> wrote: > HI @cfsmp3 <https://github.com/cfsmp3> > I am a gsoc 2019 participant and can u give me some guidance > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/CCExtractor/ccextractor/issues/1000#issuecomment-475664268>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AFrJ2SaOUPQg0bsu33Ec8OQKYmKWKyjLks5vZPY8gaJpZM4WNlng> > . >
Author
Owner

@thelastpolaris commented on GitHub (Mar 22, 2019):

Hey @ashutoshmishraji . I've almost finished adding this feature and will make a push request soon. Sorry for not stating that earlier.

@thelastpolaris commented on GitHub (Mar 22, 2019): Hey @ashutoshmishraji . I've almost finished adding this feature and will make a push request soon. Sorry for not stating that earlier.
Author
Owner

@ashutoshmishraji commented on GitHub (Mar 22, 2019):

ok @cfsmp3 @thelastpolaris

@ashutoshmishraji commented on GitHub (Mar 22, 2019): ok @cfsmp3 @thelastpolaris
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#444