[PR #1820] [MERGED] fix(708): Write consistent 2-byte UTF-16BE encoding for CEA-708 captions #2566

Open
opened 2026-01-29 17:22:48 +00:00 by claunia · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/CCExtractor/ccextractor/pull/1820
Author: @cfsmp3
Created: 12/14/2025
Status: Merged
Merged: 12/14/2025
Merged by: @cfsmp3

Base: masterHead: fix/issue-1451-utf16-encoding


📝 Commits (2)

  • 9e665a1 fix(708): Write consistent 2-byte UTF-16BE encoding for CEA-708 captions
  • 238f411 test(708): Update write_char test to expect 2-byte UTF-16BE output

📊 Changes

2 files changed (+19 additions, -22 deletions)

View changed files

📝 src/lib_ccx/ccx_decoders_708_output.c (+7 -11)
📝 src/rust/src/decoder/output.rs (+12 -11)

📄 Description

Summary

  • Fixed the write_utf16_char function in C (ccx_decoders_708_output.c) to always write 2 bytes
  • Fixed the write_char function in Rust (decoder/output.rs) to always write 2 bytes
  • This ensures consistent UTF-16BE encoding that iconv/encoding_rs can properly convert to UTF-8

Problem

When extracting CEA-708 captions with Japanese or Chinese characters using --service all[UTF-16BE], the output was garbled:

人々が私を知‰挰弰栰䴰Ź섰漠時間管理につい‰晦<U+F830>䐰昰䐰縰

The root cause was that both C and Rust implementations wrote:

  • 1 byte for ASCII characters (high byte = 0)
  • 2 bytes for non-ASCII characters

This created an invalid mix of 8-bit and 16-bit values that couldn't be properly converted.

Solution

Always write 2 bytes per character, ensuring valid UTF-16BE encoding. After the fix:

人々が私を知 ったとき、私は 時間管理につい て書いています

Test plan

  • Downloaded and tested with the sample file from issue #1451
  • Verified Japanese captions in service 2 now display correctly
  • Verified Chinese captions in service 3 now display correctly
  • Verified no encoding errors are reported
  • Verified build succeeds for both C and Rust components

Fixes #1451

🤖 Generated with Claude Code


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/CCExtractor/ccextractor/pull/1820 **Author:** [@cfsmp3](https://github.com/cfsmp3) **Created:** 12/14/2025 **Status:** ✅ Merged **Merged:** 12/14/2025 **Merged by:** [@cfsmp3](https://github.com/cfsmp3) **Base:** `master` ← **Head:** `fix/issue-1451-utf16-encoding` --- ### 📝 Commits (2) - [`9e665a1`](https://github.com/CCExtractor/ccextractor/commit/9e665a1dfd50a8d98b67a2b403de0ad634da7ff5) fix(708): Write consistent 2-byte UTF-16BE encoding for CEA-708 captions - [`238f411`](https://github.com/CCExtractor/ccextractor/commit/238f4116e846f449a135ca9ccfdf2043f2244eeb) test(708): Update write_char test to expect 2-byte UTF-16BE output ### 📊 Changes **2 files changed** (+19 additions, -22 deletions) <details> <summary>View changed files</summary> 📝 `src/lib_ccx/ccx_decoders_708_output.c` (+7 -11) 📝 `src/rust/src/decoder/output.rs` (+12 -11) </details> ### 📄 Description ## Summary - Fixed the `write_utf16_char` function in C (`ccx_decoders_708_output.c`) to always write 2 bytes - Fixed the `write_char` function in Rust (`decoder/output.rs`) to always write 2 bytes - This ensures consistent UTF-16BE encoding that iconv/encoding_rs can properly convert to UTF-8 ## Problem When extracting CEA-708 captions with Japanese or Chinese characters using `--service all[UTF-16BE]`, the output was garbled: ``` 人々が私を知‰挰弰栰䴰Ź섰漠時間管理につい‰晦<U+F830>䐰昰䐰縰 ``` The root cause was that both C and Rust implementations wrote: - 1 byte for ASCII characters (high byte = 0) - 2 bytes for non-ASCII characters This created an invalid mix of 8-bit and 16-bit values that couldn't be properly converted. ## Solution Always write 2 bytes per character, ensuring valid UTF-16BE encoding. After the fix: ``` 人々が私を知 ったとき、私は 時間管理につい て書いています ``` ## Test plan - [x] Downloaded and tested with the sample file from issue #1451 - [x] Verified Japanese captions in service 2 now display correctly - [x] Verified Chinese captions in service 3 now display correctly - [x] Verified no encoding errors are reported - [x] Verified build succeeds for both C and Rust components Fixes #1451 🤖 Generated with [Claude Code](https://claude.com/claude-code) --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
claunia added the pull-request label 2026-01-29 17:22:48 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#2566