mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-04 05:44:53 +00:00
[PR #1871] fix(708): Support Korean EUC-KR encoding in CEA-708 decoder #2649
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Original Pull Request: https://github.com/CCExtractor/ccextractor/pull/1871
State: closed
Merged: Yes
Summary
Korean broadcasts use EUC-KR encoding (variable-width) in CEA-708 captions, where ASCII is 1 byte and Korean characters are 2 bytes. The decoder was always writing 2 bytes per character (UTF-16BE style), causing NULL bytes (0x00) to be inserted before every ASCII character (spaces, punctuation).
Changes
is_utf16_charset()function to detect fixed-width 16-bit encodings (UTF-16BE, UCS-2)write_char()to acceptuse_utf16flag:true: Always 2 bytes (UTF-16BE for Japanese/Chinese, maintains fix for #1451)false: 1 byte for ASCII, 2 bytes for extended chars (EUC-KR for Korean)write_row()before building output bufferBefore fix
After fix
Test plan
mbc.ts) - drama dialog extracted correctly0623_215529_CH9-1_KBS.mpg) - news broadcast extracted correctly--service "1[EUC-KR]"Closes #1065
🤖 Generated with Claude Code