[PR #1871] fix(708): Support Korean EUC-KR encoding in CEA-708 decoder #2649

New Issue

claunia · 2026-01-29T17:23:15Z

claunia commented

2026-01-29 17:23:15 +00:00

Original Pull Request: https://github.com/CCExtractor/ccextractor/pull/1871

State: closed
Merged: Yes

Summary

Korean broadcasts use EUC-KR encoding (variable-width) in CEA-708 captions, where ASCII is 1 byte and Korean characters are 2 bytes. The decoder was always writing 2 bytes per character (UTF-16BE style), causing NULL bytes (0x00) to be inserted before every ASCII character (spaces, punctuation).

Changes

Add is_utf16_charset() function to detect fixed-width 16-bit encodings (UTF-16BE, UCS-2)
Modify write_char() to accept use_utf16 flag:
- true: Always 2 bytes (UTF-16BE for Japanese/Chinese, maintains fix for #1451)
- false: 1 byte for ASCII, 2 bytes for extended chars (EUC-KR for Korean)
Detect charset type in write_row() before building output buffer

Before fix

     그래 ,  계집  때문이었더냐 .   # Extra spaces from NULL bytes

After fix

     그래, 계집 때문이었더냐.        # Clean Korean text

Test plan

Tested with Korean MBC sample (mbc.ts) - drama dialog extracted correctly
Tested with Korean KBS sample (0623_215529_CH9-1_KBS.mpg) - news broadcast extracted correctly
All 301 Rust unit tests pass
No NULL bytes in output with --service "1[EUC-KR]"
Backward compatible: raw output (no charset) still works without NULL bytes

Closes #1065

🤖 Generated with Claude Code

**Original Pull Request:** https://github.com/CCExtractor/ccextractor/pull/1871 **State:** closed **Merged:** Yes --- ## Summary Korean broadcasts use **EUC-KR encoding** (variable-width) in CEA-708 captions, where ASCII is 1 byte and Korean characters are 2 bytes. The decoder was always writing 2 bytes per character (UTF-16BE style), causing NULL bytes (0x00) to be inserted before every ASCII character (spaces, punctuation). ### Changes - Add `is_utf16_charset()` function to detect fixed-width 16-bit encodings (UTF-16BE, UCS-2) - Modify `write_char()` to accept `use_utf16` flag: - `true`: Always 2 bytes (UTF-16BE for Japanese/Chinese, maintains fix for #1451) - `false`: 1 byte for ASCII, 2 bytes for extended chars (EUC-KR for Korean) - Detect charset type in `write_row()` before building output buffer ### Before fix ``` 그래 , 계집 때문이었더냐 . # Extra spaces from NULL bytes ``` ### After fix ``` 그래, 계집 때문이었더냐. # Clean Korean text ``` ## Test plan - [x] Tested with Korean MBC sample (`mbc.ts`) - drama dialog extracted correctly - [x] Tested with Korean KBS sample (`0623_215529_CH9-1_KBS.mpg`) - news broadcast extracted correctly - [x] All 301 Rust unit tests pass - [x] No NULL bytes in output with `--service "1[EUC-KR]"` - [x] Backward compatible: raw output (no charset) still works without NULL bytes Closes #1065 🤖 Generated with [Claude Code](https://claude.com/claude-code)

claunia added the pull-request label 2026-01-29 17:23:15 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/ccextractor#2649