mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-17 05:25:33 +00:00
Here's a quandary: 608 captions in Korean (redux) #212
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @SWalkerTTU on GitHub (Dec 3, 2016).
Since the last issue https://github.com/CCExtractor/ccextractor/issues/409 was closed (I've been a bit busy lately) the comment asked for the TS file and not the BIN file. I'll attach Dropbox links to the TS file as well as the BIN file. I don't believe the problem is with the extractor, but with the 608 decoder, as legible caption data was extracted, but not converted to SRT.
BIN: https://www.dropbox.com/s/g72aaw3txzmto8a/DancingQueen.bin?dl=0
TS: https://www.dropbox.com/s/xvq99nyhj4rymlp/DancingQueen.ts?dl=0
@SWalkerTTU commented on GitHub (Dec 3, 2016):
Correction: these are 708 captions. Still not producing an output, though.
@Izaron commented on GitHub (Jan 2, 2017):
Can you please re-upload your video file?
@saurabhshri commented on GitHub (Jan 2, 2017):
@Izaron Don't click the link, just copy and paste them in the address bar. Those dropbox links are still working. No idea why they redirect to Github 404 page.
https://www.dropbox.com/s/xvq99nyhj4rymlp/DancingQueen.ts?dl=0
@ghost commented on GitHub (Jan 8, 2017):
Hey, just confirming this is still a thing in 0.84. Going to try to do something hopefully.
@Izaron commented on GitHub (Jan 8, 2017):
Launch ccextractor as
ccextractor DancingQueen.ts -svc alland you will get 2 files - first is emptyDancingQueen.srt, second file contains English words and unrecognized characters from CEA-708 DancingQueen.p2.svc01.srt Maybe Korean characters? Current CEA-708 support is bad with Unicode which is ideal in CEA-608. We are working on CEA-708 complete support@SWalkerTTU commented on GitHub (Jan 11, 2017):
I did a little work over the weekend on a bespoke Java program just for this file. SBS's (the broadcaster's) encoding is a little strange (and it may not just be SBS — I'd need KBS and MBC raw streams to confirm it): two-byte encoding is based on EUC-KR, but Latin characters in both the 708 and EUC-KR standards only require one byte (ASCII, of course). SBS's encoder pads Latin characters with a null high-order byte. Also, apparently the encoder does not properly set the window.
Edit:
I feel like a total idiot now; I ran the program with -svc "all[EUC-KR]" and got a completely serviceable set of SRT subtitles. I actually did find somewhere that Korean TV uses EUC-KR encoding. I can't remember where now, though, but it is confirmed with these captions from both ccextractor (0.81) and my own little program. As Emily Litella might say, never mind.
@SWalkerTTU commented on GitHub (Jan 11, 2017):
I found it again: the Wikipedia article on EIA-608 mentioned the Norpak extension calling for KS C 5601 encoding (now KS X 1001), one variant of which is EUC-KR. The section on the Norpak extension goes some way to explain the odd encoding: in 608 captions, the entire character system is replaced, and even ASCII characters take two bytes, likely so as to avoid reusing control codes. It makes sense that captioning practice for DTV (708) follows from practice for analog TV (608), even though the 708 standard does not require doing things like null-padding single-byte characters.