[BUG] subtitles with alpha channel are badly detected #579

Closed
opened 2026-01-29 16:48:20 +00:00 by claunia · 2 comments
Owner

Originally created by @hamelg on GitHub (Apr 18, 2020).

CCExtractor version: 0.88
tesseract version: 4.1.1
leptonica version: 1.79.0

  • Is this a regression (i.e. did it work before)? NO
  • What platform did you use? Linux

Video links

https://app.box.com/s/mhu17q37hc4ofprneydfailktp70pi4l

Additional information

Take the sample.

$ ccextractor -o sbt -out=spupng -dvblang fra -ocrlang fra -oem 0 sample.ts

The result is very bad : 50% of subtitles are wrong.

Focus on the first png sub0001.png :

$ tesseract -l fra --oem 0 sub0001.png -
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 688
Entoutœs,
bm‘m…mmn

Now remove the alpha channel with the imagemagick utility :

$ convert sub0001.png -alpha off output.png

$ tesseract -l fra --oem 0  output.png - 
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 411
En tout cas,
je m'en souviens comme ça.

Removing alpha channel has fixed the issue !

I read that on the tesseract wiki :
https://tesseract-ocr.github.io/tessdoc/ImproveQuality#transparency--alpha-channel

It would be a great improvement to have an option to remove automatically alpha channel.
What do you think ?

Originally created by @hamelg on GitHub (Apr 18, 2020). CCExtractor version: 0.88 tesseract version: 4.1.1 leptonica version: 1.79.0 - Is this a regression (i.e. did it work before)? NO - What platform did you use? Linux # Video links https://app.box.com/s/mhu17q37hc4ofprneydfailktp70pi4l # Additional information Take the sample. `$ ccextractor -o sbt -out=spupng -dvblang fra -ocrlang fra -oem 0 sample.ts` The result is very bad : 50% of subtitles are wrong. Focus on the first png sub0001.png : ``` $ tesseract -l fra --oem 0 sub0001.png - Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 688 Entoutœs, bm‘m…mmn ``` Now remove the alpha channel with the imagemagick utility : ``` $ convert sub0001.png -alpha off output.png $ tesseract -l fra --oem 0 output.png - Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 411 En tout cas, je m'en souviens comme ça. ``` Removing alpha channel has fixed the issue ! I read that on the tesseract wiki : https://tesseract-ocr.github.io/tessdoc/ImproveQuality#transparency--alpha-channel It would be a great improvement to have an option to remove automatically alpha channel. What do you think ?
claunia added the difficulty: medium label 2026-01-29 16:48:20 +00:00
Author
Owner

@hamelg commented on GitHub (Apr 28, 2020):

I did the modification, but the result is still so bad.
I found out the -oem option doesn't work : if tesseract v4 is installed, ccextract force oem parameter to 1.
Why ?
I fixed that to set tesseract oem with the option -oem passed on cli (ccx_options.ocr_oem) and now with oem=0 the result is near perfect. On my subs, oem=0 gives the best results, oem=1 or 2 are useless.

$ tesseract --help-oem
OCR Engine modes:
  0    Legacy engine only.
  1    Neural nets LSTM engine only.
  2    Legacy + LSTM engines.
  3    Default, based on what is available.
@hamelg commented on GitHub (Apr 28, 2020): I did the modification, but the result is still so bad. I found out the -oem option doesn't work : if tesseract v4 is installed, ccextract force oem parameter to 1. Why ? I fixed that to set tesseract oem with the option -oem passed on cli (ccx_options.ocr_oem) and now with oem=0 the result is near perfect. On my subs, oem=0 gives the best results, oem=1 or 2 are useless. ``` $ tesseract --help-oem OCR Engine modes: 0 Legacy engine only. 1 Neural nets LSTM engine only. 2 Legacy + LSTM engines. 3 Default, based on what is available. ```
Author
Owner

@hamelg commented on GitHub (Apr 29, 2020):

I close my issue because the option -oem 0 with -quant 2 gives a very good result. I'll open another about the -oem option.

@hamelg commented on GitHub (Apr 29, 2020): I close my issue because the option -oem 0 with -quant 2 gives a very good result. I'll open another about the -oem option.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#579