mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-03 21:23:48 +00:00
[BUG] unable to extract multiple lines from DVB-sub using OCR #352
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @MaxEliaserAWS on GitHub (Dec 22, 2017).
CCExtractor version (using the --version parameter preferably) : 0.85 and Git commit
1858425944In raising this issue, I confirm the following (please check boxes, eg [X]):
I found https://github.com/CCExtractor/ccextractor/issues/392 which seems like the same issue, but it was closed in mid-2016, and I was able to reproduce the issue on a much newer version than that (including a Git build from today.) Therefore I think it's OK to file this.
Fortunately, the latest Git version seems to have addressed another bug I was going to file. Although the "something messy" instrumentation in you added would have to be filtered out, disabled, or sent to stderr before I could use this version.
My familiarity with the project is as follows (check one, eg [X]):
Necessary information
64-bit CentOS 7.3
ccextractor -stdout -quiet -nofc -nodvbcolor ccextractor_bugs_allcaps_29fps_leftjustify.m2tsVideo links
https://s3-us-west-2.amazonaws.com/ccextractor-dvbsub-bugreports/ccextractor_bugs_allcaps_29fps_leftjustify.m2ts
I think I have set the permissions correctly on this S3 bucket, let me know if you can't download it.
The DVB-sub captions in this video display perfectly in VLC (tested verison 2.2.4.) There is a burned in timecode on the video to help you judge the timing, and it seems dead-on to me in VLC.
Additional information
There are three problems with ccextractor's output from this file:
When I use spupng output and run the tesseract command line program (version 3.04 tested) on the PNG images, the text is detected just fine, so I don't think it's a limitation of tesseract itself.
If you're curious about the bug that was fixed with the latest Git version, it's that several captions were missing altogether when extracted from this file:
https://s3-us-west-2.amazonaws.com/ccextractor-dvbsub-bugreports/big_buck_bunny_eac3_4.m2ts
even though they played just fine in VLC. This bug is fixed now, but you could use the video in your regression tests if you want.
@MaxEliaserAWS commented on GitHub (Dec 22, 2017):
I also notice that removing the -nodvbcolor option causes a segfault on both input files in the latest Git version (not in 0.85,) but I don't want color output anyway.
@ghost commented on GitHub (Dec 25, 2017):
Gotcha
CCE gets them and interprets and reads the subs properly, it just doesn't print properly because for some reason it terminated ocr text on newlines instead of just nullbytes? I presume there's a reason for this (some are only terminated by newlines and not nullbytes somewhere) so hm not sure what to do here I'll just put it at this
@MaxEliaserAWS commented on GitHub (Dec 27, 2017):
I can confirm that this issue is now fixed, and a Git snapshot of ccextractor is now working well for my purposes. Great job getting this turned around so quickly, and over the Christmas weekend too!
Are you guys going to want to copy those input files anywhere for regression testing purposes? I'm going to want to take down that S3 bucket eventually...
@cfsmp3 commented on GitHub (Dec 28, 2017):
@MaxEliaserAWS we'll copy it to our regression testing platform in the next few days, we'll let you know when done.
@canihavesomecoffee can you take care of this?
@MaxEliaserAWS commented on GitHub (Dec 28, 2017):
Cool. Might as well copy both files, it can't hurt right?
@MaxEliaserAWS commented on GitHub (Jan 16, 2018):
Has this actually happened yet?
@cfsmp3 commented on GitHub (Jan 16, 2018):
If you mean downloading and archiving them I don't think so.
What would be an appropriate description for the files?
On Tue, Jan 16, 2018 at 2:18 PM, MaxEliaserAWS notifications@github.com
wrote:
@MaxEliaserAWS commented on GitHub (Jan 16, 2018):
Yes, that's what I meant.
A good description for ccextractor_bugs_allcaps_29fps_leftjustify.m2ts would be "dvb-sub captions containing multiple lines of text."
A good description for big_buck_bunny_eac3_4.m2ts would be "DVB-sub captions which prior versions of ccextractor failed to extract." This is fixed in the latest Git (or at least the last Git that I tried,) but it might be useful to make sure the bug doesn't come back, and I don't know if you have a file that covers this exact problem.
@cfsmp3 commented on GitHub (Jan 16, 2018):
OK, I've added them to our official repo: https://ccextractor.org/public:general:tvsamples
@canihavesomecoffee would be the right person to add it to the regression system, too.
@MaxEliaserAWS commented on GitHub (Jan 16, 2018):
OK great, I'll be taking that S3 bucket down then.