mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-14 13:35:43 +00:00
Corrupt or empty subtitles (OCR, ts, DVB) #81
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @claunia on GitHub (Nov 1, 2015).
On some files subtitles appear empty even when the program was subbed, or corrupt, containing garbage characters.
Tried recording from Imagenio and from DVB-T in Spain, happens in all tested broadcasts.
Test files have been put on /repository/Natalia
Regards
@cfsmp3 commented on GitHub (Nov 7, 2016):
@claunia we're going to spend a bit of time on this. What's the current status? (with the last CCExtractor I mean)
@ghost commented on GitHub (Nov 28, 2016):
Hello, I can't seem to find the test files repository for this one!
@cfsmp3 commented on GitHub (Nov 28, 2016):
Here's the two files:
https://drive.google.com/open?id=0B_61ywKPmI0TLWRwY3Myc0pTMEE
https://drive.google.com/open?id=0B_61ywKPmI0TUGctV1hZSkFwalE
@cfsmp3 commented on GitHub (Jan 20, 2017):
GSOC qualification: This issue gives 2 points.
@harrynull commented on GitHub (Dec 31, 2017):
The zip files contain in total of 4 video files:
Star Wars Rebels_Disney Channel_2014-12-12_22-24.ts:
The video contains teletext subtitle.
The output is generally good, except 2 lines missing.It is caused byfuzzy_memcmpin telxcc.c:809, which seems to discard theprevious line if the current line has similar content to it.
EDIT: with
-nolevdist, the missing lines can now be outputedIn addition, I find
-out=spupngdoesn't work with teletext. Don't know if itis expected. It crashes because of a bug in ccx_encoders_spupng.c:14. After
fixing it (Patch #864 ), it will generate .png files with size of 0 byte (i.e. empty).
Star Wars Rebels_Disney Channel_2014-12-12_22-24_cortado.ts:
It has a teletext subtitle stream but neither VLC nor Potplayer can display
any subtitle. CCExtractor can't extract anything from it.
I think it can be because the video itself doesn't actually have any subtitle.
Cine Clan TVE Perez, el ratoncito de tus sueños 2.ts
It contains DVB subtitles, but CCExtractor isn't able to extract anything from it.
-out=spupngdoesn't work either.The cause is the stream doesn't send DVBSUB_DISPLAY_SEGMENT. Although the case
is considered, it is poorly handled. Patch: #866
Cine Clan TVE Perez, el ratoncito de tus sueños 2_cortado.ts
Same problem as "Cine Clan TVE Perez, el ratoncito de tus sueños 2.ts"
During the debugging, I also discovered a heap corruption problem caused by
add_ocrtext2str (Patch: #865 )
@cfsmp3 commented on GitHub (Dec 31, 2017):
@harrynull use -nolevdist if you want fuzzy_memcpy to behave like memcpy
@cfsmp3 commented on GitHub (Jan 11, 2018):
First one (teletext) works fine.
However the 2nd one shows a bunch of messages:
In ocr_bitmap: Failed to perform OCR. Skipped.
In ocr_bitmap: Failed to perform OCR. Skipped.
In ocr_bitmap: Failed to perform OCR. Skipped.
Takes forever, too.
@harrynull commented on GitHub (Jan 12, 2018):
It is caused by some of the images are totally empty and invalid for some reasons.
But it should not affect the output file.
@cfsmp3 commented on GitHub (Jan 12, 2018):
@harrynull It does, check this out:
670
01:00:15,877 --> 01:00:20,676
Enos oi onimnro nno dnonio
otnpnnio pnno oroannthonio.
671
01:00:20,677 --> 01:00:23,116
TI‘QMG.
Monono. oi no“n
672
01:00:23,117 --> 01:00:27,756
sono wondido o onion
nnos olrozoo non on.
That's total gibberish :-) There's definitely a correlation between those errors and the incorrect lines.
It's definitely better than before, and there's lots of good output - but still not perfect.
@harrynull commented on GitHub (Jan 13, 2018):
@cfsmp3
It works well here:
Did you forget to put
spa.traineddatain the right place?But I do found that sometime doesn't close
In addition, some subtitles are skipped and missing. I am not sure if it is limitation of tesseract but I will check them later.
@cfsmp3 commented on GitHub (Jan 13, 2018):
That stuff in 670, 671 and 672 is not Spanish, believe me :-) (or I
suspect, any other language)
On Fri, Jan 12, 2018 at 6:45 PM, Null notifications@github.com wrote:
@cfsmp3 commented on GitHub (Mar 22, 2023):
Status update: Still broken. Possibly differently. The file that matters is Cine Clan TVE *.ts (ignore the Disney one).
We get lots of these messages:
and a bonus: