[PR #1871] fix(708): Support Korean EUC-KR encoding in CEA-708 decoder #2649

Open
opened 2026-01-29 17:23:15 +00:00 by claunia · 0 comments
Owner

Original Pull Request: https://github.com/CCExtractor/ccextractor/pull/1871

State: closed
Merged: Yes


Summary

Korean broadcasts use EUC-KR encoding (variable-width) in CEA-708 captions, where ASCII is 1 byte and Korean characters are 2 bytes. The decoder was always writing 2 bytes per character (UTF-16BE style), causing NULL bytes (0x00) to be inserted before every ASCII character (spaces, punctuation).

Changes

  • Add is_utf16_charset() function to detect fixed-width 16-bit encodings (UTF-16BE, UCS-2)
  • Modify write_char() to accept use_utf16 flag:
    • true: Always 2 bytes (UTF-16BE for Japanese/Chinese, maintains fix for #1451)
    • false: 1 byte for ASCII, 2 bytes for extended chars (EUC-KR for Korean)
  • Detect charset type in write_row() before building output buffer

Before fix

     그래 ,  계집  때문이었더냐 .   # Extra spaces from NULL bytes

After fix

     그래, 계집 때문이었더냐.        # Clean Korean text

Test plan

  • Tested with Korean MBC sample (mbc.ts) - drama dialog extracted correctly
  • Tested with Korean KBS sample (0623_215529_CH9-1_KBS.mpg) - news broadcast extracted correctly
  • All 301 Rust unit tests pass
  • No NULL bytes in output with --service "1[EUC-KR]"
  • Backward compatible: raw output (no charset) still works without NULL bytes

Closes #1065

🤖 Generated with Claude Code

**Original Pull Request:** https://github.com/CCExtractor/ccextractor/pull/1871 **State:** closed **Merged:** Yes --- ## Summary Korean broadcasts use **EUC-KR encoding** (variable-width) in CEA-708 captions, where ASCII is 1 byte and Korean characters are 2 bytes. The decoder was always writing 2 bytes per character (UTF-16BE style), causing NULL bytes (0x00) to be inserted before every ASCII character (spaces, punctuation). ### Changes - Add `is_utf16_charset()` function to detect fixed-width 16-bit encodings (UTF-16BE, UCS-2) - Modify `write_char()` to accept `use_utf16` flag: - `true`: Always 2 bytes (UTF-16BE for Japanese/Chinese, maintains fix for #1451) - `false`: 1 byte for ASCII, 2 bytes for extended chars (EUC-KR for Korean) - Detect charset type in `write_row()` before building output buffer ### Before fix ``` 그래 , 계집 때문이었더냐 . # Extra spaces from NULL bytes ``` ### After fix ``` 그래, 계집 때문이었더냐. # Clean Korean text ``` ## Test plan - [x] Tested with Korean MBC sample (`mbc.ts`) - drama dialog extracted correctly - [x] Tested with Korean KBS sample (`0623_215529_CH9-1_KBS.mpg`) - news broadcast extracted correctly - [x] All 301 Rust unit tests pass - [x] No NULL bytes in output with `--service "1[EUC-KR]"` - [x] Backward compatible: raw output (no charset) still works without NULL bytes Closes #1065 🤖 Generated with [Claude Code](https://claude.com/claude-code)
claunia added the pull-request label 2026-01-29 17:23:15 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#2649