Corrupt or empty subtitles (OCR, ts, DVB) #81

Closed
opened 2026-01-29 16:34:39 +00:00 by claunia · 12 comments
Owner

Originally created by @claunia on GitHub (Nov 1, 2015).

On some files subtitles appear empty even when the program was subbed, or corrupt, containing garbage characters.

Tried recording from Imagenio and from DVB-T in Spain, happens in all tested broadcasts.

Test files have been put on /repository/Natalia

Regards

Originally created by @claunia on GitHub (Nov 1, 2015). On some files subtitles appear empty even when the program was subbed, or corrupt, containing garbage characters. Tried recording from Imagenio and from DVB-T in Spain, happens in all tested broadcasts. Test files have been put on /repository/Natalia Regards
claunia added the bugOCRdifficulty: medium labels 2026-01-29 16:34:39 +00:00
Author
Owner

@cfsmp3 commented on GitHub (Nov 7, 2016):

@claunia we're going to spend a bit of time on this. What's the current status? (with the last CCExtractor I mean)

@cfsmp3 commented on GitHub (Nov 7, 2016): @claunia we're going to spend a bit of time on this. What's the current status? (with the last CCExtractor I mean)
Author
Owner

@ghost commented on GitHub (Nov 28, 2016):

Hello, I can't seem to find the test files repository for this one!

@ghost commented on GitHub (Nov 28, 2016): Hello, I can't seem to find the test files repository for this one!
Author
Owner
@cfsmp3 commented on GitHub (Nov 28, 2016): Here's the two files: https://drive.google.com/open?id=0B_61ywKPmI0TLWRwY3Myc0pTMEE https://drive.google.com/open?id=0B_61ywKPmI0TUGctV1hZSkFwalE
Author
Owner

@cfsmp3 commented on GitHub (Jan 20, 2017):

GSOC qualification: This issue gives 2 points.

@cfsmp3 commented on GitHub (Jan 20, 2017): GSOC qualification: This issue gives 2 points.
Author
Owner

@harrynull commented on GitHub (Dec 31, 2017):

The zip files contain in total of 4 video files:

Star Wars Rebels_Disney Channel_2014-12-12_22-24.ts:
The video contains teletext subtitle.

The output is generally good, except 2 lines missing.

It is caused by fuzzy_memcmp in telxcc.c:809, which seems to discard the
previous line if the current line has similar content to it.

EDIT: with -nolevdist, the missing lines can now be outputed

In addition, I find -out=spupng doesn't work with teletext. Don't know if it
is expected. It crashes because of a bug in ccx_encoders_spupng.c:14. After
fixing it (Patch #864 ), it will generate .png files with size of 0 byte (i.e. empty).

Star Wars Rebels_Disney Channel_2014-12-12_22-24_cortado.ts:

It has a teletext subtitle stream but neither VLC nor Potplayer can display
any subtitle. CCExtractor can't extract anything from it.

I think it can be because the video itself doesn't actually have any subtitle.

Cine Clan TVE Perez, el ratoncito de tus sueños 2.ts

It contains DVB subtitles, but CCExtractor isn't able to extract anything from it.
-out=spupng doesn't work either.

The cause is the stream doesn't send DVBSUB_DISPLAY_SEGMENT. Although the case
is considered, it is poorly handled. Patch: #866

Cine Clan TVE Perez, el ratoncito de tus sueños 2_cortado.ts

Same problem as "Cine Clan TVE Perez, el ratoncito de tus sueños 2.ts"

During the debugging, I also discovered a heap corruption problem caused by
add_ocrtext2str (Patch: #865 )

@harrynull commented on GitHub (Dec 31, 2017): The zip files contain in total of 4 video files: **Star Wars Rebels_Disney Channel_2014-12-12_22-24.ts:** The video contains teletext subtitle. <del>The output is generally good, except 2 lines missing.</del> <del>It is caused by `fuzzy_memcmp` in telxcc.c:809, which seems to discard the previous line if the current line has similar content to it.</del> **EDIT: with `-nolevdist`, the missing lines can now be outputed** In addition, I find `-out=spupng` doesn't work with teletext. Don't know if it is expected. It crashes because of a bug in ccx_encoders_spupng.c:14. After fixing it (Patch #864 ), it will generate .png files with size of 0 byte (i.e. empty). **Star Wars Rebels_Disney Channel_2014-12-12_22-24_cortado.ts:** It has a teletext subtitle stream but neither VLC nor Potplayer can display any subtitle. CCExtractor can't extract anything from it. I think it can be because the video itself doesn't actually have any subtitle. **Cine Clan TVE Perez, el ratoncito de tus sueños 2.ts** It contains DVB subtitles, but CCExtractor isn't able to extract anything from it. `-out=spupng` doesn't work either. The cause is the stream doesn't send DVBSUB_DISPLAY_SEGMENT. Although the case is considered, it is poorly handled. Patch: #866 **Cine Clan TVE Perez, el ratoncito de tus sueños 2_cortado.ts** Same problem as "Cine Clan TVE Perez, el ratoncito de tus sueños 2.ts" During the debugging, I also discovered a heap corruption problem caused by add_ocrtext2str (Patch: #865 )
Author
Owner

@cfsmp3 commented on GitHub (Dec 31, 2017):

@harrynull use -nolevdist if you want fuzzy_memcpy to behave like memcpy

@cfsmp3 commented on GitHub (Dec 31, 2017): @harrynull use -nolevdist if you want fuzzy_memcpy to behave like memcpy
Author
Owner

@cfsmp3 commented on GitHub (Jan 11, 2018):

First one (teletext) works fine.
However the 2nd one shows a bunch of messages:

In ocr_bitmap: Failed to perform OCR. Skipped.
In ocr_bitmap: Failed to perform OCR. Skipped.
In ocr_bitmap: Failed to perform OCR. Skipped.

Takes forever, too.

@cfsmp3 commented on GitHub (Jan 11, 2018): First one (teletext) works fine. However the 2nd one shows a bunch of messages: In ocr_bitmap: Failed to perform OCR. Skipped. In ocr_bitmap: Failed to perform OCR. Skipped. In ocr_bitmap: Failed to perform OCR. Skipped. Takes forever, too.
Author
Owner

@harrynull commented on GitHub (Jan 12, 2018):

It is caused by some of the images are totally empty and invalid for some reasons.
But it should not affect the output file.

@harrynull commented on GitHub (Jan 12, 2018): It is caused by some of the images are totally empty and invalid for some reasons. But it should not affect the output file.
Author
Owner

@cfsmp3 commented on GitHub (Jan 12, 2018):

@harrynull It does, check this out:

670
01:00:15,877 --> 01:00:20,676
Enos oi onimnro nno dnonio
otnpnnio pnno oroannthonio.

671
01:00:20,677 --> 01:00:23,116
TI‘QMG.
Monono. oi no“n

672
01:00:23,117 --> 01:00:27,756
sono wondido o onion
nnos olrozoo non on.

That's total gibberish :-) There's definitely a correlation between those errors and the incorrect lines.
It's definitely better than before, and there's lots of good output - but still not perfect.

@cfsmp3 commented on GitHub (Jan 12, 2018): @harrynull It does, check this out: 670 01:00:15,877 --> 01:00:20,676 <font color="#00c8c6">Enos oi onimnro nno dnonio</font> <font color="#00c8c6">otnpnnio pnno oroannthonio.</font> 671 01:00:20,677 --> 01:00:23,116 <font color="#00c8c6">TI‘QMG.</font> <font color="#00c8c6">Monono. oi no“n</font> 672 01:00:23,117 --> 01:00:27,756 <font color="#00c8c6">sono wondido o onion</font> <font color="#00c8c6">nnos olrozoo non on.</font> That's total gibberish :-) There's definitely a correlation between those errors and the incorrect lines. It's definitely better than before, and there's lots of good output - but still not perfect.
Author
Owner

@harrynull commented on GitHub (Jan 13, 2018):

@cfsmp3
It works well here:

670
01:00:15,877 --> 01:00:20,676
<font color="#00c8c6">Erao ol prlmuro ono onorla</font>
<font color="#00c8c6">atoporlo pora prognnflarlo.</font>

671
01:00:20,677 --> 01:00:23,116
<font color="#00c8c6">Tardo.</font>
<font color="#00c8c6">Mahana. ol ratOn</font>

672
01:00:23,117 --> 01:00:27,756
<font color="#00c8c6">oora vondldo a onlon</font>
<font color="#00c8c6">mao ofruzoa por or.</font>

Did you forget to put spa.traineddata in the right place?

But I do found that sometime doesn't close

24
00:03:54,997 --> 00:03:57,836
<font color="#c7c800">¢Como fue Ia fiesta?</font>
<font color="#c7c800"></font><font color="#d6d6d6">-Estuvimos esperandole.

In addition, some subtitles are skipped and missing. I am not sure if it is limitation of tesseract but I will check them later.

@harrynull commented on GitHub (Jan 13, 2018): @cfsmp3 It works well here: ``` 670 01:00:15,877 --> 01:00:20,676 <font color="#00c8c6">Erao ol prlmuro ono onorla</font> <font color="#00c8c6">atoporlo pora prognnflarlo.</font> 671 01:00:20,677 --> 01:00:23,116 <font color="#00c8c6">Tardo.</font> <font color="#00c8c6">Mahana. ol ratOn</font> 672 01:00:23,117 --> 01:00:27,756 <font color="#00c8c6">oora vondldo a onlon</font> <font color="#00c8c6">mao ofruzoa por or.</font> ``` Did you forget to put `spa.traineddata` in the right place? But I do found that sometime <font> doesn't close ``` 24 00:03:54,997 --> 00:03:57,836 <font color="#c7c800">¢Como fue Ia fiesta?</font> <font color="#c7c800"></font><font color="#d6d6d6">-Estuvimos esperandole. ``` In addition, some subtitles are skipped and missing. I am not sure if it is limitation of tesseract but I will check them later.
Author
Owner

@cfsmp3 commented on GitHub (Jan 13, 2018):

That stuff in 670, 671 and 672 is not Spanish, believe me :-) (or I
suspect, any other language)

On Fri, Jan 12, 2018 at 6:45 PM, Null notifications@github.com wrote:

@cfsmp3 https://github.com/cfsmp3
It works well here:

670
01:00:15,877 --> 01:00:20,676
Erao ol prlmuro ono onorla
atoporlo pora prognnflarlo.

671
01:00:20,677 --> 01:00:23,116
Tardo.
Mahana. ol ratOn

672
01:00:23,117 --> 01:00:27,756
oora vondldo a onlon
mao ofruzoa por or.

Did you forget to put spa.traineddata in the right place?

But I do found that sometime doesn't close

24
00:03:54,997 --> 00:03:57,836
¢Como fue Ia fiesta?
-Estuvimos esperandole.

In addition, some subtitles are skipped and missing. I am not sure if it
is limitation of tesseract but I will check them later.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/243#issuecomment-357404171,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFrJ2W1IS-1s-n7YAj-fi_B17p7L4ekoks5tKBivgaJpZM4GZuQu
.

@cfsmp3 commented on GitHub (Jan 13, 2018): That stuff in 670, 671 and 672 is not Spanish, believe me :-) (or I suspect, any other language) On Fri, Jan 12, 2018 at 6:45 PM, Null <notifications@github.com> wrote: > @cfsmp3 <https://github.com/cfsmp3> > It works well here: > > 670 > 01:00:15,877 --> 01:00:20,676 > <font color="#00c8c6">Erao ol prlmuro ono onorla</font> > <font color="#00c8c6">atoporlo pora prognnflarlo.</font> > > 671 > 01:00:20,677 --> 01:00:23,116 > <font color="#00c8c6">Tardo.</font> > <font color="#00c8c6">Mahana. ol ratOn</font> > > 672 > 01:00:23,117 --> 01:00:27,756 > <font color="#00c8c6">oora vondldo a onlon</font> > <font color="#00c8c6">mao ofruzoa por or.</font> > > Did you forget to put spa.traineddata in the right place? > > But I do found that sometime doesn't close > > 24 > 00:03:54,997 --> 00:03:57,836 > <font color="#c7c800">¢Como fue Ia fiesta?</font> > <font color="#c7c800"></font><font color="#d6d6d6">-Estuvimos esperandole. > > In addition, some subtitles are skipped and missing. I am not sure if it > is limitation of tesseract but I will check them later. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/CCExtractor/ccextractor/issues/243#issuecomment-357404171>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AFrJ2W1IS-1s-n7YAj-fi_B17p7L4ekoks5tKBivgaJpZM4GZuQu> > . >
Author
Owner

@cfsmp3 commented on GitHub (Mar 22, 2023):

Status update: Still broken. Possibly differently. The file that matters is Cine Clan TVE *.ts (ignore the Disney one).

We get lots of these messages:

Error in pixConvertRGBToGray: pixs not defined
Error in boxClipToRectangle: box outside rectangle
Warning in pixClipRectangle: box doesn't overlap pix
Error in pixConvertRGBToGray: pixs not defined
Error in boxClipToRectangle: box outside rectangle
Warning in pixClipRectangle: box doesn't overlap pix
Error in pixConvertRGBToGray: pixs not defined
Error in boxClipToRectangle: box outside rectangle
Warning in pixClipRectangle: box doesn't overlap pix
Error in pixConvertRGBToGray: pixs not defined
Error in boxClipToRectangle: box outside rectangle
Warning in pixClipRectangle: box doesn't overlap pix
Error in pixConvertRGBToGray: pixs not defined

and a bonus:

Direct leak of 216 byte(s) in 3 object(s) allocated from:
    #0 0x7f77522bf90f in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:69
    #1 0x556c85248761 in dvbsub_init_decoder ../src/lib_ccx/dvb_subtitle_decoder.c:424
    #2 0x556c8529ee4d in parse_PMT ../src/lib_ccx/ts_tables.c:346
    #3 0x556c85272f9e in ts_readstream ../src/lib_ccx/ts_functions.c:752
    #4 0x556c85275167 in ts_get_more_data ../src/lib_ccx/ts_functions.c:980
    #5 0x556c852a9a9f in general_loop ../src/lib_ccx/general_loop.c:1051
    #6 0x556c851a7986 in api_start ../src/ccextractor.c:205
    #7 0x556c851a9cdb in main ../src/ccextractor.c:463
    #8 0x7f775162350f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
@cfsmp3 commented on GitHub (Mar 22, 2023): Status update: Still broken. Possibly differently. The file that matters is Cine Clan TVE *.ts (ignore the Disney one). We get lots of these messages: ``` Error in pixConvertRGBToGray: pixs not defined Error in boxClipToRectangle: box outside rectangle Warning in pixClipRectangle: box doesn't overlap pix Error in pixConvertRGBToGray: pixs not defined Error in boxClipToRectangle: box outside rectangle Warning in pixClipRectangle: box doesn't overlap pix Error in pixConvertRGBToGray: pixs not defined Error in boxClipToRectangle: box outside rectangle Warning in pixClipRectangle: box doesn't overlap pix Error in pixConvertRGBToGray: pixs not defined Error in boxClipToRectangle: box outside rectangle Warning in pixClipRectangle: box doesn't overlap pix Error in pixConvertRGBToGray: pixs not defined ``` and a bonus: ``` Direct leak of 216 byte(s) in 3 object(s) allocated from: #0 0x7f77522bf90f in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:69 #1 0x556c85248761 in dvbsub_init_decoder ../src/lib_ccx/dvb_subtitle_decoder.c:424 #2 0x556c8529ee4d in parse_PMT ../src/lib_ccx/ts_tables.c:346 #3 0x556c85272f9e in ts_readstream ../src/lib_ccx/ts_functions.c:752 #4 0x556c85275167 in ts_get_more_data ../src/lib_ccx/ts_functions.c:980 #5 0x556c852a9a9f in general_loop ../src/lib_ccx/general_loop.c:1051 #6 0x556c851a7986 in api_start ../src/ccextractor.c:205 #7 0x556c851a9cdb in main ../src/ccextractor.c:463 #8 0x7f775162350f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#81