[BUG] unable to extract multiple lines from DVB-sub using OCR #352

Closed
opened 2026-01-29 16:41:42 +00:00 by claunia · 10 comments
Owner

Originally created by @MaxEliaserAWS on GitHub (Dec 22, 2017).

CCExtractor version (using the --version parameter preferably) : 0.85 and Git commit 1858425944

In raising this issue, I confirm the following (please check boxes, eg [X]):

  • I have read and understood the contributors guide.
  • I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present.
  • I have checked that the issue I'm posting isn't already reported.
  • I have checked that the issue I'm porting isn't already solved and no duplicates exist in closed issues and in opened issues
  • I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion.
  • I have used the latest available version of CCExtractor to verify this issue exists.

I found https://github.com/CCExtractor/ccextractor/issues/392 which seems like the same issue, but it was closed in mid-2016, and I was able to reproduce the issue on a much newer version than that (including a Git build from today.) Therefore I think it's OK to file this.

Fortunately, the latest Git version seems to have addressed another bug I was going to file. Although the "something messy" instrumentation in you added would have to be filtered out, disabled, or sent to stderr before I could use this version.

My familiarity with the project is as follows (check one, eg [X]):

  • I have never used CCExtractor.
  • I have used CCExtractor just a couple of times.
  • I absolutely love CCExtractor, but have not contributed previously.
  • I am an active contributor to CCExtractor.

Necessary information

  • Is this a regression (did it work before)? [x] NO | [ ] YES - please specify the last known working version
  • What platform did you use? [ ] Windows - [x] Linux - [ ] Mac
    64-bit CentOS 7.3
  • What where the used arguments? ccextractor -stdout -quiet -nofc -nodvbcolor ccextractor_bugs_allcaps_29fps_leftjustify.m2ts

Video links

https://s3-us-west-2.amazonaws.com/ccextractor-dvbsub-bugreports/ccextractor_bugs_allcaps_29fps_leftjustify.m2ts

I think I have set the permissions correctly on this S3 bucket, let me know if you can't download it.

The DVB-sub captions in this video display perfectly in VLC (tested verison 2.2.4.) There is a burned in timecode on the video to help you judge the timing, and it seems dead-on to me in VLC.

Additional information

There are three problems with ccextractor's output from this file:

  • Timing is slightly off (really not a huge deal, it's only off by 10-30 ms)
  • The last caption (the twenty-fifth caption) is missing from the output completely
  • Only the first line of text from each caption is shown (this is the biggest problem)

When I use spupng output and run the tesseract command line program (version 3.04 tested) on the PNG images, the text is detected just fine, so I don't think it's a limitation of tesseract itself.

If you're curious about the bug that was fixed with the latest Git version, it's that several captions were missing altogether when extracted from this file:
https://s3-us-west-2.amazonaws.com/ccextractor-dvbsub-bugreports/big_buck_bunny_eac3_4.m2ts
even though they played just fine in VLC. This bug is fixed now, but you could use the video in your regression tests if you want.

Originally created by @MaxEliaserAWS on GitHub (Dec 22, 2017). CCExtractor version (using the --version parameter preferably) : **0.85** _and_ **Git commit 18584259447a145a1b8c1cae6733223393dfb4f1** **In raising this issue, I confirm the following (please check boxes, eg [X]):** - [x] I have read and understood the [contributors guide](https://github.com/CCExtractor/ccextractor/blob/master/.github/CONTRIBUTING.md). - [x] I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present. - [x] I have checked that the issue I'm posting isn't already reported. - [x] I have checked that the issue I'm porting isn't already solved and no duplicates exist in [closed issues](https://github.com/CCExtractor/ccextractor/issues?q=is%3Aissue+is%3Aclosed) and in [opened issues](https://github.com/CCExtractor/ccextractor/issues) - [x] I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion. - [x] I have used the latest available version of CCExtractor to verify this issue exists. I found https://github.com/CCExtractor/ccextractor/issues/392 which seems like the same issue, but it was closed in mid-2016, and I was able to reproduce the issue on a much newer version than that (including a Git build from today.) Therefore I think it's OK to file this. Fortunately, the latest Git version seems to have addressed another bug I was going to file. Although the "something messy" instrumentation in you added would have to be filtered out, disabled, or sent to stderr before I could use this version. **My familiarity with the project is as follows (check one, eg [X]):** - [ ] I have never used CCExtractor. - [ ] I have used CCExtractor just a couple of times. - [x] I absolutely love CCExtractor, but have not contributed previously. - [ ] I am an active contributor to CCExtractor. **Necessary information** - Is this a regression (did it work before)? [x] NO | [ ] YES - *please specify the last known working version* - What platform did you use? [ ] Windows - [x] Linux - [ ] Mac 64-bit CentOS 7.3 - What where the used arguments? `ccextractor -stdout -quiet -nofc -nodvbcolor ccextractor_bugs_allcaps_29fps_leftjustify.m2ts` **Video links** https://s3-us-west-2.amazonaws.com/ccextractor-dvbsub-bugreports/ccextractor_bugs_allcaps_29fps_leftjustify.m2ts I think I have set the permissions correctly on this S3 bucket, let me know if you can't download it. The DVB-sub captions in this video display perfectly in VLC (tested verison 2.2.4.) There is a burned in timecode on the video to help you judge the timing, and it seems dead-on to me in VLC. **Additional information** There are three problems with ccextractor's output from this file: - Timing is slightly off (really not a huge deal, it's only off by 10-30 ms) - The last caption (the twenty-fifth caption) is missing from the output completely - Only the first line of text from each caption is shown (this is the biggest problem) When I use spupng output and run the tesseract command line program (version 3.04 tested) on the PNG images, the text is detected just fine, so I don't think it's a limitation of tesseract itself. If you're curious about the bug that was fixed with the latest Git version, it's that several captions were missing altogether when extracted from this file: https://s3-us-west-2.amazonaws.com/ccextractor-dvbsub-bugreports/big_buck_bunny_eac3_4.m2ts even though they played just fine in VLC. This bug is fixed now, but you could use the video in your regression tests if you want.
Author
Owner

@MaxEliaserAWS commented on GitHub (Dec 22, 2017):

I also notice that removing the -nodvbcolor option causes a segfault on both input files in the latest Git version (not in 0.85,) but I don't want color output anyway.

@MaxEliaserAWS commented on GitHub (Dec 22, 2017): I also notice that removing the -nodvbcolor option causes a segfault on both input files in the latest Git version (not in 0.85,) but I don't want color output anyway.
Author
Owner

@ghost commented on GitHub (Dec 25, 2017):

Gotcha

CCE gets them and interprets and reads the subs properly, it just doesn't print properly because for some reason it terminated ocr text on newlines instead of just nullbytes? I presume there's a reason for this (some are only terminated by newlines and not nullbytes somewhere) so hm not sure what to do here I'll just put it at this

@ghost commented on GitHub (Dec 25, 2017): Gotcha CCE gets them and interprets and reads the subs properly, it just doesn't print properly because for some reason it terminated ocr text on newlines instead of just nullbytes? I presume there's a reason for this (some are only terminated by newlines and not nullbytes somewhere) so hm not sure what to do here I'll just put it at this
Author
Owner

@MaxEliaserAWS commented on GitHub (Dec 27, 2017):

I can confirm that this issue is now fixed, and a Git snapshot of ccextractor is now working well for my purposes. Great job getting this turned around so quickly, and over the Christmas weekend too!

Are you guys going to want to copy those input files anywhere for regression testing purposes? I'm going to want to take down that S3 bucket eventually...

@MaxEliaserAWS commented on GitHub (Dec 27, 2017): I can confirm that this issue is now fixed, and a Git snapshot of ccextractor is now working well for my purposes. Great job getting this turned around so quickly, and over the Christmas weekend too! Are you guys going to want to copy those input files anywhere for regression testing purposes? I'm going to want to take down that S3 bucket eventually...
Author
Owner

@cfsmp3 commented on GitHub (Dec 28, 2017):

@MaxEliaserAWS we'll copy it to our regression testing platform in the next few days, we'll let you know when done.
@canihavesomecoffee can you take care of this?

@cfsmp3 commented on GitHub (Dec 28, 2017): @MaxEliaserAWS we'll copy it to our regression testing platform in the next few days, we'll let you know when done. @canihavesomecoffee can you take care of this?
Author
Owner

@MaxEliaserAWS commented on GitHub (Dec 28, 2017):

Cool. Might as well copy both files, it can't hurt right?

@MaxEliaserAWS commented on GitHub (Dec 28, 2017): Cool. Might as well copy both files, it can't hurt right?
Author
Owner

@MaxEliaserAWS commented on GitHub (Jan 16, 2018):

Has this actually happened yet?

@MaxEliaserAWS commented on GitHub (Jan 16, 2018): Has this actually happened yet?
Author
Owner

@cfsmp3 commented on GitHub (Jan 16, 2018):

If you mean downloading and archiving them I don't think so.
What would be an appropriate description for the files?

On Tue, Jan 16, 2018 at 2:18 PM, MaxEliaserAWS notifications@github.com
wrote:

Has this actually happened yet?


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/840#issuecomment-358126799,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFrJ2cR3zFxtyTzVOP8Et-PVqT1ihMPfks5tLSBEgaJpZM4RKgBf
.

@cfsmp3 commented on GitHub (Jan 16, 2018): If you mean downloading and archiving them I don't think so. What would be an appropriate description for the files? On Tue, Jan 16, 2018 at 2:18 PM, MaxEliaserAWS <notifications@github.com> wrote: > Has this actually happened yet? > > — > You are receiving this because you modified the open/close state. > Reply to this email directly, view it on GitHub > <https://github.com/CCExtractor/ccextractor/issues/840#issuecomment-358126799>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AFrJ2cR3zFxtyTzVOP8Et-PVqT1ihMPfks5tLSBEgaJpZM4RKgBf> > . >
Author
Owner

@MaxEliaserAWS commented on GitHub (Jan 16, 2018):

Yes, that's what I meant.

A good description for ccextractor_bugs_allcaps_29fps_leftjustify.m2ts would be "dvb-sub captions containing multiple lines of text."

A good description for big_buck_bunny_eac3_4.m2ts would be "DVB-sub captions which prior versions of ccextractor failed to extract." This is fixed in the latest Git (or at least the last Git that I tried,) but it might be useful to make sure the bug doesn't come back, and I don't know if you have a file that covers this exact problem.

@MaxEliaserAWS commented on GitHub (Jan 16, 2018): Yes, that's what I meant. A good description for ccextractor_bugs_allcaps_29fps_leftjustify.m2ts would be "dvb-sub captions containing multiple lines of text." A good description for big_buck_bunny_eac3_4.m2ts would be "DVB-sub captions which prior versions of ccextractor failed to extract." This is fixed in the latest Git (or at least the last Git that I tried,) but it might be useful to make sure the bug doesn't come back, and I don't know if you have a file that covers this exact problem.
Author
Owner

@cfsmp3 commented on GitHub (Jan 16, 2018):

OK, I've added them to our official repo: https://ccextractor.org/public:general:tvsamples

@canihavesomecoffee would be the right person to add it to the regression system, too.

@cfsmp3 commented on GitHub (Jan 16, 2018): OK, I've added them to our official repo: https://ccextractor.org/public:general:tvsamples @canihavesomecoffee would be the right person to add it to the regression system, too.
Author
Owner

@MaxEliaserAWS commented on GitHub (Jan 16, 2018):

OK great, I'll be taking that S3 bucket down then.

@MaxEliaserAWS commented on GitHub (Jan 16, 2018): OK great, I'll be taking that S3 bucket down then.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#352