Terrible OCR results with Channel 5 (UK) #387

@cfsmp3 commented on GitHub (Feb 23, 2018):

@adarshshukla19 that link is not public

@cfsmp3 commented on GitHub (Feb 23, 2018): @adarshshukla19 that link is not public

claunia commented

@adarshshukla19 commented on GitHub (Feb 26, 2018):

So are you guys working further on the project or not and are all the
issues resolved?

Regards.
Adarsh

On Sat, Feb 24, 2018 at 1:03 AM, Carlos Fernandez Sanz <
notifications@github.com> wrote:

@adarshshukla19 https://github.com/adarshshukla19 that link is not
public

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/929#issuecomment-368115913,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ARwTy6YY8uEzq5-cwSIfQfWSUZ7AUi4Sks5tXxKOgaJpZM4SDCm2
.

@adarshshukla19 commented on GitHub (Feb 26, 2018): So are you guys working further on the project or not and are all the issues resolved? Regards. Adarsh On Sat, Feb 24, 2018 at 1:03 AM, Carlos Fernandez Sanz < notifications@github.com> wrote: > @adarshshukla19 <https://github.com/adarshshukla19> that link is not > public > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/CCExtractor/ccextractor/issues/929#issuecomment-368115913>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ARwTy6YY8uEzq5-cwSIfQfWSUZ7AUi4Sks5tXxKOgaJpZM4SDCm2> > . >

claunia commented

@thealphadollar commented on GitHub (Feb 26, 2018):

I'll be working on adding one more quantisation option, probably from the
leptonica library since we already use it.

One small addition was done a while back which reduces the number of
colours in the colour palette and improves the output slightly.

On 26-Feb-2018 11:08 AM, "Adarsh SHUKLA" notifications@github.com wrote:

So are you guys working further on the project or not and are all the
issues resolved?

Regards.
Adarsh

On Sat, Feb 24, 2018 at 1:03 AM, Carlos Fernandez Sanz <
notifications@github.com> wrote:

@adarshshukla19 https://github.com/adarshshukla19 that link is not
public

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/CCExtractor/ccextractor/issues/929#issuecomment-
368115913>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARwTy6YY8uEzq5-
cwSIfQfWSUZ7AUi4Sks5tXxKOgaJpZM4SDCm2>
.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/929#issuecomment-368395435,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AfStICU4RhXP42Wza1GM1OuaAAWrURUIks5tYkM5gaJpZM4SDCm2
.

@thealphadollar commented on GitHub (Feb 26, 2018): I'll be working on adding one more quantisation option, probably from the leptonica library since we already use it. One small addition was done a while back which reduces the number of colours in the colour palette and improves the output slightly. On 26-Feb-2018 11:08 AM, "Adarsh SHUKLA" <notifications@github.com> wrote: > So are you guys working further on the project or not and are all the > issues resolved? > > Regards. > Adarsh > > On Sat, Feb 24, 2018 at 1:03 AM, Carlos Fernandez Sanz < > notifications@github.com> wrote: > > > @adarshshukla19 <https://github.com/adarshshukla19> that link is not > > public > > > > — > > You are receiving this because you were mentioned. > > Reply to this email directly, view it on GitHub > > <https://github.com/CCExtractor/ccextractor/issues/929#issuecomment- > 368115913>, > > or mute the thread > > <https://github.com/notifications/unsubscribe-auth/ARwTy6YY8uEzq5- > cwSIfQfWSUZ7AUi4Sks5tXxKOgaJpZM4SDCm2> > > . > > > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/CCExtractor/ccextractor/issues/929#issuecomment-368395435>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AfStICU4RhXP42Wza1GM1OuaAAWrURUIks5tYkM5gaJpZM4SDCm2> > . >

claunia commented

2026-01-29 16:42:42 +00:00

@cfsmp3 commented on GitHub (Feb 26, 2018):

@adarshshukla19 Issues are not yet solved, so yes, we're definitely going to continue working on this unless we get really reliable results.

@cfsmp3 commented on GitHub (Feb 26, 2018): @adarshshukla19 Issues are not yet solved, so yes, we're definitely going to continue working on this unless we get really reliable results.

claunia commented

@tsmarinov commented on GitHub (Mar 2, 2018):

After the last commit results at my side are good but still have this french channel with terrible output:

./ccextractor -nofc -in=ts -datapid 0x8c3 -out=srt -stdout -nobom -trim -noteletext -codec dvbsub -dvblang fra -ocrlang fra ./merged/franceo.ts

  2%  |  00:001
00:00:00,200 --> 00:00:04,039
<font color="#00d300">[L@s ÜIIIÏÜSÏËS Œamç—aﬂa dhﬂnols.</font>
<font color="#00d300">ﬂ[lg WB@mm@mü,,</font>

  4%  |  00:042
00:00:04,240 --> 00:00:09,559
<font color="#00d300">…[kas</font>
<font color="#00d300">Œ[B®mü…m@sﬂ</font>

  6%  |  00:093
00:00:09,760 --> 00:00:12,719
<font color="#00d300">Mlëﬁâ$‘ﬂläläﬂüàﬁàﬂlﬁ,läl®û</font>

 11%  |  00:204
00:00:20,240 --> 00:00:24,359
<font color="#00d300">©@ät…@ä@ﬁﬂâ</font>
<font color="#00d300">©m…@äﬁbz</font>

 17%  |  00:245
00:00:24,560 --> 00:00:34,319
<font color="#00d300">m@puæ5ærﬂcﬂ</font>
<font color="#00d300">MÊÆ>[ÈJ©ŒIÏŒ</font>

 19%  |  00:346
00:00:34,520 --> 00:00:37,799
<font color="#00d3d2">=AÿŒ]üu®ﬁüuääW</font>
<font color="#00d3d2">[@@ñﬁdŒﬂﬁm,</font>

 22%  |  00:387
00:00:38,000 --> 00:00:40,239
<font color="#00d3d2">[Lﬂ@'éeonom Maiao repose</font>

 24%  |  00:408
00:00:40,440 --> 00:00:42,919
<font color="#00d3d2">sun‘ [La uŒmæ @@ …ms,</font>

here are the matterials: https://goo.gl/kncQUn

@tsmarinov commented on GitHub (Mar 2, 2018): After the last commit results at my side are good but still have this french channel with terrible output: `./ccextractor -nofc -in=ts -datapid 0x8c3 -out=srt -stdout -nobom -trim -noteletext -codec dvbsub -dvblang fra -ocrlang fra ./merged/franceo.ts` ``` 2% | 00:001 00:00:00,200 --> 00:00:04,039 [L@s ÜIIIÏÜSÏËS Œamç—aﬂa dhﬂnols. ﬂ[lg WB@mm@mü,, 4% | 00:042 00:00:04,240 --> 00:00:09,559 …[kas Œ[B®mü…m@sﬂ 6% | 00:093 00:00:09,760 --> 00:00:12,719 Mlëﬁâ$‘ﬂläläﬂüàﬁàﬂlﬁ,läl®û 11% | 00:204 00:00:20,240 --> 00:00:24,359 ©@ät…@ä@ﬁﬂâ ©m…@äﬁbz 17% | 00:245 00:00:24,560 --> 00:00:34,319 m@puæ5ærﬂcﬂ MÊÆ>[ÈJ©ŒIÏŒ 19% | 00:346 00:00:34,520 --> 00:00:37,799 =AÿŒ]üu®ﬁüuääW [@@ñﬁdŒﬂﬁm, 22% | 00:387 00:00:38,000 --> 00:00:40,239 [Lﬂ@'éeonom Maiao repose 24% | 00:408 00:00:40,440 --> 00:00:42,919 sun‘ [La uŒmæ @@ …ms, ``` here are the matterials: https://goo.gl/kncQUn

claunia commented

2026-01-29 16:42:42 +00:00

@krushanbauva commented on GitHub (Mar 7, 2018):

The salt noise present in the images can be removed by the method of erosion and dilation.
You can refer to this link for more clarity (Sorry for it being highly mathematical in nature😥 ).
You can see the images to get a clear picture of what happens when you apply a proper combination of these filters.

Original image:

Processed image:

You can find the opencv implementation of erosion and dilation filters.

P.S.: I am not very familiar with the codebase or the tesseract API either, so I might take some time to implement it. Though if anyone wants to go ahead, this might help to solve it.

@krushanbauva commented on GitHub (Mar 7, 2018): The salt noise present in the images can be removed by the method of erosion and dilation. You can refer to [this link](https://en.wikipedia.org/wiki/Mathematical_morphology#Erosion) for more clarity (Sorry for it being highly mathematical in nature:disappointed_relieved: ). You can see the images to get a clear picture of what happens when you apply a proper combination of these filters. Original image: ![filter3](https://user-images.githubusercontent.com/31391329/37108624-a2339972-225d-11e8-9248-396c26c93e62.png) Processed image: ![filter3_processed](https://user-images.githubusercontent.com/31391329/37108630-a8a51902-225d-11e8-86ce-3ec2db56a6ea.jpg) You can find the opencv implementation of [erosion](https://github.com/krushanbauva/basic_opencv/blob/master/erosion.cpp) and [dilation](https://github.com/krushanbauva/basic_opencv/blob/master/dilation.cpp) filters. P.S.: I am not very familiar with the codebase or the tesseract API either, so I might take some time to implement it. Though if anyone wants to go ahead, this might help to solve it.

claunia commented

2026-01-29 16:42:42 +00:00

@thealphadollar commented on GitHub (Mar 7, 2018):

@krushanbauva I thought of implementing this but there are certain issues I was facing and hence, will be taking this up when I've little time in hand.

It's mathematically heavy and hence requires a lot of homework to be done and later a lot of testing so nothing breaks.
I need to analyze in depth how we are reading the images. It makes a lot of difference how the images are read and probably the way openCV does it and we do are drastically different though I believe at the basics they will be somewhat similar.
We cannot add openCV directly since that'll be a huge dependency we are not in (very) need of.

You can surely try to implement it, go through codebase and ask doubts. I'll look back into this when I've couple of days of time in hand. I spent around a week on this, so I can support you on the codebase part a bit :)

@thealphadollar commented on GitHub (Mar 7, 2018): @krushanbauva I thought of implementing this but there are certain issues I was facing and hence, will be taking this up when I've little time in hand. - It's mathematically heavy and hence requires a lot of homework to be done and later a lot of testing so nothing breaks. - I need to analyze in depth how we are reading the images. It makes a lot of difference how the images are read and probably the way openCV does it and we do are drastically different though I believe at the basics they will be somewhat similar. - We cannot add openCV directly since that'll be a huge dependency we are not in (very) need of. You can surely try to implement it, go through codebase and ask doubts. I'll look back into this when I've couple of days of time in hand. I spent around a week on this, so I can support you on the codebase part a bit :)

claunia commented

2026-01-29 16:42:43 +00:00

@krushanbauva commented on GitHub (Mar 7, 2018):

@thealphadollar

The implementation code for erosion and dilation was just meant to serve as a reference and to highlight the fact that it's not really difficult/mathematical as it seems to be.
Neither am I wanting to include openCV library(there's no need actually). It can be solved with tesseract itself(hopefully 😄)
I'm currently going through the codebase and the API, and working on this issue and will keep this forum updated as and when I progress.

@krushanbauva commented on GitHub (Mar 7, 2018): @thealphadollar - The implementation code for erosion and dilation was just meant to serve as a reference and to highlight the fact that it's not really difficult/mathematical as it seems to be. - Neither am I wanting to include openCV library(there's no need actually). It can be solved with tesseract itself(hopefully :smile:) - I'm currently going through the codebase and the API, and working on this issue and will keep this forum updated as and when I progress.

claunia commented

2026-01-29 16:42:43 +00:00

@thealphadollar commented on GitHub (Mar 7, 2018):

Sounds amazing :) @krushanbauva

@thealphadollar commented on GitHub (Mar 7, 2018): Sounds amazing :) @krushanbauva

claunia commented

2026-01-29 16:42:43 +00:00

@cfsmp3 commented on GitHub (Mar 8, 2018):

Good luck @krushanbauva :-)

@cfsmp3 commented on GitHub (Mar 8, 2018): Good luck @krushanbauva :-)

claunia commented

@cyberdrk commented on GitHub (Mar 8, 2018):

I've got some prior experience in Tesseract and morphological operations, do you guys mind if I join in? :)

@cyberdrk commented on GitHub (Mar 8, 2018): I've got some prior experience in Tesseract and morphological operations, do you guys mind if I join in? :)

claunia commented

@cfsmp3 commented on GitHub (Mar 8, 2018):

You're more than welcome :-)

On Wed, Mar 7, 2018 at 7:17 PM, Dipti Kulkarni notifications@github.com
wrote:

I've got some prior experience in Tesseract and morphological operations,
do you guys mind if I join in? :)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/929#issuecomment-371365260,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFrJ2bK4pJx9n3j7XFbHh6RWG8ESHkQeks5tcKKygaJpZM4SDCm2
.

@cfsmp3 commented on GitHub (Mar 8, 2018): You're more than welcome :-) On Wed, Mar 7, 2018 at 7:17 PM, Dipti Kulkarni <notifications@github.com> wrote: > I've got some prior experience in Tesseract and morphological operations, > do you guys mind if I join in? :) > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/CCExtractor/ccextractor/issues/929#issuecomment-371365260>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AFrJ2bK4pJx9n3j7XFbHh6RWG8ESHkQeks5tcKKygaJpZM4SDCm2> > . >

claunia commented

@krushanbauva commented on GitHub (Mar 8, 2018):

@cyberdrk You can go through the articles on the official CCExtractor's page which will get you started with the codebase and also going through the recent PR's give you a lot of intuition as to where things are. 😄
Also, there has been some activity on this part of the code in recent times, so that might help you big time!

P.S.: You are always welcomed to collaborate!! 😋

@krushanbauva commented on GitHub (Mar 8, 2018): @cyberdrk You can go through the articles on the official CCExtractor's page which will get you started with the codebase and also going through the recent PR's give you a lot of intuition as to where things are. :smile: Also, there has been some activity on this part of the code in recent times, so that might help you big time! P.S.: You are always welcomed to collaborate!! :yum:

claunia commented

@amitdo commented on GitHub (Mar 8, 2018):

Tesseract uses Leptonica for image IO and image processing.

@amitdo commented on GitHub (Mar 8, 2018): Tesseract uses Leptonica for image IO and image processing. * http://www.leptonica.org/local-sources.html * http://www.leptonica.org/binary-morphology.html * https://github.com/DanBloomberg/leptonica/search?q=pixerode * https://github.com/DanBloomberg/leptonica/issues/311

claunia commented

@Saiteja31597 commented on GitHub (Mar 27, 2018):

i would like to work on this

@Saiteja31597 commented on GitHub (Mar 27, 2018): i would like to work on this

claunia commented

@cfsmp3 commented on GitHub (Mar 28, 2018):

No need to ask, just go for it :-)

On Tue, Mar 27, 2018 at 6:17 AM, Saiteja31597 notifications@github.com
wrote:

i would like to work on this

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/929#issuecomment-376521246,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFrJ2cpc8q5BZIva3a7O8y8yYdmtbcJ7ks5tijv2gaJpZM4SDCm2
.

@cfsmp3 commented on GitHub (Mar 28, 2018): No need to ask, just go for it :-) On Tue, Mar 27, 2018 at 6:17 AM, Saiteja31597 <notifications@github.com> wrote: > i would like to work on this > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/CCExtractor/ccextractor/issues/929#issuecomment-376521246>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AFrJ2cpc8q5BZIva3a7O8y8yYdmtbcJ7ks5tijv2gaJpZM4SDCm2> > . >

claunia commented

@thealphadollar commented on GitHub (Apr 1, 2018):

@cyberdrk @krushanbauva Any leads you guys would like to share? I'm starting back my work on this.

@thealphadollar commented on GitHub (Apr 1, 2018): @cyberdrk @krushanbauva Any leads you guys would like to share? I'm starting back my work on this.

claunia commented

@thealphadollar commented on GitHub (Apr 7, 2018):

@cfsmp3 For the past few days I have tried implementing some more libraries (including leptonica) but could not be successful; the problem mostly faced is the incorporation of the libraries without changing the structure of the png file we currently use.

Doing that will be, I believe, inefficient since we are already having three methods which work pretty much perfectly for most of the types of videos.

If I'm not wrong in terms of the compatibility of format, I think we can close the issue since we have already solved the problem this issue raised :)

@thealphadollar commented on GitHub (Apr 7, 2018): @cfsmp3 For the past few days I have tried implementing some more libraries (including leptonica) but could not be successful; the problem mostly faced is the incorporation of the libraries without changing the structure of the png file we currently use. Doing that will be, I believe, inefficient since we are already having three methods which work pretty much perfectly for most of the types of videos. If I'm not wrong in terms of the compatibility of format, I think we can close the issue since we have already solved the problem this issue raised :)

claunia commented