mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-13 05:25:03 +00:00
[BUG] French DVB subtitles stopped working #458
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Liontooth on GitHub (Nov 18, 2018).
Please prefix your issue with one of the following: [BUG], [PROPOSAL], [QUESTION].
CCExtractor version (using the --version parameter preferably) : ccextractor-0.87
In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):
My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):
Necessary information
-pn 257 -tpage 888 -datets -ttxt -UCLA -noru -utf8 -parsepat -parsepmtVideo links
http://vrnewsscape.ucla.edu/dropbox/2017-07-24_1800_FR_FR2_Journal_20h00.mpg
http://vrnewsscape.ucla.edu/dropbox/2017-07-24_1800_FR_FR2_Journal_20h00.txt
Additional information
CCExtractor-0.85 compiled 2017-07-29 with liblept4 succeeds in extracting DVB captions from the file above, as shown in the accompanying txt file.
CCExtractor-0.86 and CCExtractor-0.87 fail to find any subtitles.
@anshul1912 commented on GitHub (Nov 18, 2018):
Hey,
Can you please provide output of ./ccextractor --version
I want to check which tesseract and leptonica version you are using.
Also can you let me know, that spupng output format works for you or not?
I was able to get subs out of your file, though it is not that accurate:
Following is my version information:
@anshul1912 commented on GitHub (Nov 19, 2018):
I did not understand why you cross referenced, do you want to say that only problem you have is duplication?
@saurabhshri commented on GitHub (Nov 22, 2018):
@anshul1912 No, the duplication problem is reported through 0.85 since @Liontooth were not able to extract French DVB in 0.86 and 0.87. 🙂 He has mentioned this in additional information and hence cross referenced this issue.
From #1040 :
@Liontooth commented on GitHub (Nov 29, 2018):
Hi -- sorry to be slow. Anshul, your attempt to extract the text demonstrates the regression. Version 0.85 does a great job -- close to perfect (Chrome no longer lets you set the character set and gets this one wrong; in fact it's UTF-8):
In comparison, your attempt shows 0.87 gets almost nothing right. So this is a clear regression.
The version I run doesn't show a lot of information:
Back then, the --version flag was still not fully supported. The downloaded file is dated 19 Jan 2017. I no longer have the version information for tesseract and leptonica (other than that it's liblept4); let me know if you'd like the binary. Strace might tell you what it's using.
It would really be a pity to lose this excellent functionality! This issue was more or less completely solved, so let's try to get back to 0.85.
Cheers,
David
@anshul1912 commented on GitHub (Dec 9, 2018):
Hi David,
I see there is problem with quantization, I see output is fine if quantization is disabled.
I ran ccextractor like following
/ccextractor ~/Videos/Samples/DVB/2017-07-24_1800_FR_FR2_Journal_20h00.mpg -quant 0 -pn 257 -tpage 888 -datets -ttxt -UCLA -noru -utf8 -parsepat -parsepmt -o a.txtcan you confirm that
-quant 0work perfectly for you in 0.87@anshul1912 commented on GitHub (Dec 9, 2018):
Only starting output is fine, complete output is still bad, looks like latest fra trained data is bad compared to older one
@anshul1912 commented on GitHub (Dec 9, 2018):
when I tried to compare with 0.85, my output file was completely empty.
@thunderbolt-tom commented on GitHub (Jan 24, 2019):
I've experienced a very similar issue to this with DVB subtitles from British TV. Using v0.87 newly built on Ubuntu with tesseract 4.0.0 I get the
No captions were found in input.error. Previously using v0.84 built against an older version of tesseract the subtitles were converted to srt almost perfectly.What I discovered through trial and error is this seems to be an issue with the newer version of tesseract. Tesseract has various data files containing trained data for different languages here: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files. As that page says, the tessdata_fast file listed under the "September 15 2017" section is what is installed by default. if i instead install a language file from the plain "tessdata" folder under that section, or a file from the section "November 29, 2016", then ccextractor works as expected.
I'm uncertain why this works. Tesseract 4.0.0 has a newer "LSTM" engine which could be part of the problem, but testing with different combinations of data files and forcing different engines gave conflicting results. Some combinations when using LSTM also gave extremely bad detection for some sentences, e.g the second line should say "you never actually came here":
Ultimately using the plain tessdata file from https://github.com/tesseract-ocr/tessdata seems to work.
Having said all that, the changelog for ccextractor v0.88 says
- New: Add support for tesseract 4.0so maybe we shouldn't expect it to work properly in 0.87. I do get other issues using ccextractor from the latest git though, so for now using the alternative tessdata in 0.87 seems to be the solution.@ggnull35 commented on GitHub (Feb 26, 2019):
Actually ccextractor v0.87 can compile with tesseract 4.0 & leptonica 1.77.0
Just use libpng 1.6.34 for compile.
@cfsmp3 commented on GitHub (Nov 21, 2021):
@Liontooth Can you provide updated samples? Can't download that one. We're cleaning up issues now (overdue, I know).
@cfsmp3 commented on GitHub (Mar 22, 2023):
Closing due to no samples