[PR #1871] [MERGED] fix(708): Support Korean EUC-KR encoding in CEA-708 decoder #2644

Open
opened 2026-01-29 17:23:15 +00:00 by claunia · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/CCExtractor/ccextractor/pull/1871
Author: @cfsmp3
Created: 12/21/2025
Status: Merged
Merged: 12/21/2025
Merged by: @cfsmp3

Base: masterHead: fix/korean-euc-kr-support


📝 Commits (3)

  • da3dc52 fix(708): Support Korean EUC-KR encoding in CEA-708 decoder
  • d0caf23 fix(timing): Use i64 instead of c_long for Windows compatibility
  • 73cd19f fix(rust): Use i64 instead of c_long for Windows compatibility

📊 Changes

6 files changed (+125 additions, -41 deletions)

View changed files

📝 src/rust/src/avc/nal.rs (+2 -2)
📝 src/rust/src/decoder/output.rs (+86 -15)
📝 src/rust/src/decoder/tv_screen.rs (+19 -2)
📝 src/rust/src/es/pic.rs (+1 -5)
📝 src/rust/src/lib.rs (+3 -3)
📝 src/rust/src/libccxr_exports/time.rs (+14 -14)

📄 Description

Summary

Korean broadcasts use EUC-KR encoding (variable-width) in CEA-708 captions, where ASCII is 1 byte and Korean characters are 2 bytes. The decoder was always writing 2 bytes per character (UTF-16BE style), causing NULL bytes (0x00) to be inserted before every ASCII character (spaces, punctuation).

Changes

  • Add is_utf16_charset() function to detect fixed-width 16-bit encodings (UTF-16BE, UCS-2)
  • Modify write_char() to accept use_utf16 flag:
    • true: Always 2 bytes (UTF-16BE for Japanese/Chinese, maintains fix for #1451)
    • false: 1 byte for ASCII, 2 bytes for extended chars (EUC-KR for Korean)
  • Detect charset type in write_row() before building output buffer

Before fix

     그래 ,  계집  때문이었더냐 .   # Extra spaces from NULL bytes

After fix

     그래, 계집 때문이었더냐.        # Clean Korean text

Test plan

  • Tested with Korean MBC sample (mbc.ts) - drama dialog extracted correctly
  • Tested with Korean KBS sample (0623_215529_CH9-1_KBS.mpg) - news broadcast extracted correctly
  • All 301 Rust unit tests pass
  • No NULL bytes in output with --service "1[EUC-KR]"
  • Backward compatible: raw output (no charset) still works without NULL bytes

Closes #1065

🤖 Generated with Claude Code


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/CCExtractor/ccextractor/pull/1871 **Author:** [@cfsmp3](https://github.com/cfsmp3) **Created:** 12/21/2025 **Status:** ✅ Merged **Merged:** 12/21/2025 **Merged by:** [@cfsmp3](https://github.com/cfsmp3) **Base:** `master` ← **Head:** `fix/korean-euc-kr-support` --- ### 📝 Commits (3) - [`da3dc52`](https://github.com/CCExtractor/ccextractor/commit/da3dc52b45dee5388cd17971f05ee6aaa5ede5fd) fix(708): Support Korean EUC-KR encoding in CEA-708 decoder - [`d0caf23`](https://github.com/CCExtractor/ccextractor/commit/d0caf23a82af9d07c2f264860dc9d94273919e39) fix(timing): Use i64 instead of c_long for Windows compatibility - [`73cd19f`](https://github.com/CCExtractor/ccextractor/commit/73cd19f5d0ff3286718a5de296c54c33d727443a) fix(rust): Use i64 instead of c_long for Windows compatibility ### 📊 Changes **6 files changed** (+125 additions, -41 deletions) <details> <summary>View changed files</summary> 📝 `src/rust/src/avc/nal.rs` (+2 -2) 📝 `src/rust/src/decoder/output.rs` (+86 -15) 📝 `src/rust/src/decoder/tv_screen.rs` (+19 -2) 📝 `src/rust/src/es/pic.rs` (+1 -5) 📝 `src/rust/src/lib.rs` (+3 -3) 📝 `src/rust/src/libccxr_exports/time.rs` (+14 -14) </details> ### 📄 Description ## Summary Korean broadcasts use **EUC-KR encoding** (variable-width) in CEA-708 captions, where ASCII is 1 byte and Korean characters are 2 bytes. The decoder was always writing 2 bytes per character (UTF-16BE style), causing NULL bytes (0x00) to be inserted before every ASCII character (spaces, punctuation). ### Changes - Add `is_utf16_charset()` function to detect fixed-width 16-bit encodings (UTF-16BE, UCS-2) - Modify `write_char()` to accept `use_utf16` flag: - `true`: Always 2 bytes (UTF-16BE for Japanese/Chinese, maintains fix for #1451) - `false`: 1 byte for ASCII, 2 bytes for extended chars (EUC-KR for Korean) - Detect charset type in `write_row()` before building output buffer ### Before fix ``` 그래 , 계집 때문이었더냐 . # Extra spaces from NULL bytes ``` ### After fix ``` 그래, 계집 때문이었더냐. # Clean Korean text ``` ## Test plan - [x] Tested with Korean MBC sample (`mbc.ts`) - drama dialog extracted correctly - [x] Tested with Korean KBS sample (`0623_215529_CH9-1_KBS.mpg`) - news broadcast extracted correctly - [x] All 301 Rust unit tests pass - [x] No NULL bytes in output with `--service "1[EUC-KR]"` - [x] Backward compatible: raw output (no charset) still works without NULL bytes Closes #1065 🤖 Generated with [Claude Code](https://claude.com/claude-code) --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
claunia added the pull-request label 2026-01-29 17:23:15 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#2644