Here's a quandary: 608 captions in Korean (redux) #212

Closed
opened 2026-01-29 16:38:14 +00:00 by claunia · 7 comments
Owner

Originally created by @SWalkerTTU on GitHub (Dec 3, 2016).

Since the last issue https://github.com/CCExtractor/ccextractor/issues/409 was closed (I've been a bit busy lately) the comment asked for the TS file and not the BIN file. I'll attach Dropbox links to the TS file as well as the BIN file. I don't believe the problem is with the extractor, but with the 608 decoder, as legible caption data was extracted, but not converted to SRT.

BIN: https://www.dropbox.com/s/g72aaw3txzmto8a/DancingQueen.bin?dl=0
TS: https://www.dropbox.com/s/xvq99nyhj4rymlp/DancingQueen.ts?dl=0

Originally created by @SWalkerTTU on GitHub (Dec 3, 2016). Since the last issue [https://github.com/CCExtractor/ccextractor/issues/409](url) was closed (I've been a bit busy lately) the comment asked for the TS file and not the BIN file. I'll attach Dropbox links to the TS file as well as the BIN file. I don't believe the problem is with the extractor, but with the 608 decoder, as legible caption data was extracted, but not converted to SRT. BIN: [https://www.dropbox.com/s/g72aaw3txzmto8a/DancingQueen.bin?dl=0](url) TS: [https://www.dropbox.com/s/xvq99nyhj4rymlp/DancingQueen.ts?dl=0](url)
Author
Owner

@SWalkerTTU commented on GitHub (Dec 3, 2016):

Correction: these are 708 captions. Still not producing an output, though.

@SWalkerTTU commented on GitHub (Dec 3, 2016): Correction: these are 708 captions. Still not producing an output, though.
Author
Owner

@Izaron commented on GitHub (Jan 2, 2017):

Can you please re-upload your video file?

@Izaron commented on GitHub (Jan 2, 2017): Can you please re-upload your video file?
Author
Owner

@saurabhshri commented on GitHub (Jan 2, 2017):

@Izaron Don't click the link, just copy and paste them in the address bar. Those dropbox links are still working. No idea why they redirect to Github 404 page.

https://www.dropbox.com/s/xvq99nyhj4rymlp/DancingQueen.ts?dl=0

@saurabhshri commented on GitHub (Jan 2, 2017): @Izaron Don't click the link, just copy and paste them in the address bar. Those dropbox links are still working. No idea why they redirect to Github 404 page. https://www.dropbox.com/s/xvq99nyhj4rymlp/DancingQueen.ts?dl=0
Author
Owner

@ghost commented on GitHub (Jan 8, 2017):

Hey, just confirming this is still a thing in 0.84. Going to try to do something hopefully.

@ghost commented on GitHub (Jan 8, 2017): Hey, just confirming this is still a thing in 0.84. Going to try to do something hopefully.
Author
Owner

@Izaron commented on GitHub (Jan 8, 2017):

Launch ccextractor as ccextractor DancingQueen.ts -svc all and you will get 2 files - first is empty DancingQueen.srt, second file contains English words and unrecognized characters from CEA-708 DancingQueen.p2.svc01.srt Maybe Korean characters? Current CEA-708 support is bad with Unicode which is ideal in CEA-608. We are working on CEA-708 complete support

@Izaron commented on GitHub (Jan 8, 2017): Launch ccextractor as `ccextractor DancingQueen.ts -svc all` and you will get 2 files - first is empty `DancingQueen.srt`, second file contains English words and unrecognized characters from CEA-708 [DancingQueen.p2.svc01.srt](https://gist.github.com/Izaron/34136a8ec8216469c3c3828acdfbe53e) Maybe Korean characters? Current CEA-708 support is bad with Unicode which is ideal in CEA-608. We are working on CEA-708 complete support
Author
Owner

@SWalkerTTU commented on GitHub (Jan 11, 2017):

I did a little work over the weekend on a bespoke Java program just for this file. SBS's (the broadcaster's) encoding is a little strange (and it may not just be SBS — I'd need KBS and MBC raw streams to confirm it): two-byte encoding is based on EUC-KR, but Latin characters in both the 708 and EUC-KR standards only require one byte (ASCII, of course). SBS's encoder pads Latin characters with a null high-order byte. Also, apparently the encoder does not properly set the window.

Edit:

I feel like a total idiot now; I ran the program with -svc "all[EUC-KR]" and got a completely serviceable set of SRT subtitles. I actually did find somewhere that Korean TV uses EUC-KR encoding. I can't remember where now, though, but it is confirmed with these captions from both ccextractor (0.81) and my own little program. As Emily Litella might say, never mind.

@SWalkerTTU commented on GitHub (Jan 11, 2017): I did a little work over the weekend on a bespoke Java program just for this file. SBS's (the broadcaster's) encoding is a little strange (and it may not just be SBS — I'd need KBS and MBC raw streams to confirm it): two-byte encoding is based on EUC-KR, but Latin characters in both the 708 and EUC-KR standards only require one byte (ASCII, of course). SBS's encoder pads Latin characters with a null high-order byte. Also, apparently the encoder does not properly set the window. Edit: I feel like a total idiot now; I ran the program with -svc "all[EUC-KR]" and got a completely serviceable set of SRT subtitles. I actually did find somewhere that Korean TV uses EUC-KR encoding. I can't remember where now, though, but it is confirmed with these captions from both ccextractor (0.81) and my own little program. As Emily Litella might say, never mind.
Author
Owner

@SWalkerTTU commented on GitHub (Jan 11, 2017):

I found it again: the Wikipedia article on EIA-608 mentioned the Norpak extension calling for KS C 5601 encoding (now KS X 1001), one variant of which is EUC-KR. The section on the Norpak extension goes some way to explain the odd encoding: in 608 captions, the entire character system is replaced, and even ASCII characters take two bytes, likely so as to avoid reusing control codes. It makes sense that captioning practice for DTV (708) follows from practice for analog TV (608), even though the 708 standard does not require doing things like null-padding single-byte characters.

@SWalkerTTU commented on GitHub (Jan 11, 2017): I found it again: the Wikipedia article on EIA-608 mentioned the Norpak extension calling for KS C 5601 encoding (now KS X 1001), one variant of which is EUC-KR. The section on the Norpak extension goes some way to explain the odd encoding: in 608 captions, the entire character system is replaced, and even ASCII characters take two bytes, likely so as to avoid reusing control codes. It makes sense that captioning practice for DTV (708) follows from practice for analog TV (608), even though the 708 standard does not require doing things like null-padding single-byte characters.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#212