mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-15 05:26:07 +00:00
[BUG] subtitles with alpha channel are badly detected #579
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @hamelg on GitHub (Apr 18, 2020).
CCExtractor version: 0.88
tesseract version: 4.1.1
leptonica version: 1.79.0
Video links
https://app.box.com/s/mhu17q37hc4ofprneydfailktp70pi4l
Additional information
Take the sample.
$ ccextractor -o sbt -out=spupng -dvblang fra -ocrlang fra -oem 0 sample.tsThe result is very bad : 50% of subtitles are wrong.
Focus on the first png sub0001.png :
Now remove the alpha channel with the imagemagick utility :
Removing alpha channel has fixed the issue !
I read that on the tesseract wiki :
https://tesseract-ocr.github.io/tessdoc/ImproveQuality#transparency--alpha-channel
It would be a great improvement to have an option to remove automatically alpha channel.
What do you think ?
@hamelg commented on GitHub (Apr 28, 2020):
I did the modification, but the result is still so bad.
I found out the -oem option doesn't work : if tesseract v4 is installed, ccextract force oem parameter to 1.
Why ?
I fixed that to set tesseract oem with the option -oem passed on cli (ccx_options.ocr_oem) and now with oem=0 the result is near perfect. On my subs, oem=0 gives the best results, oem=1 or 2 are useless.
@hamelg commented on GitHub (Apr 29, 2020):
I close my issue because the option -oem 0 with -quant 2 gives a very good result. I'll open another about the -oem option.