[BUG] A mix of 8-bit/16-bit chars sent to iconv #714

Open
opened 2026-01-29 16:51:52 +00:00 by claunia · 0 comments
Owner

Originally created by @erankor on GitHub (Aug 24, 2022).

Necessary information

  • Is this a regression (i.e. did it work before)? NO
  • What platform did you use? Linux
  • What were the used arguments? ./ccextractor test.ts -svc all[UTF-16BE] -nofc -12

Video links

http://cdnapi.kaltura.com/p/2035982/playManifest/entryId/1_frxnu0yr/flavorId/1_tr3kiz6l/format/download/a.ts

Additional information

Hi all,

I have some TS file with 708 subtitles in Japanese & Chinese that failed to decode properly.
After some debugging, I found that if I patch the function write_utf16_char here -
https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ccx_decoders_708_output.c#L113
to always output 2 byte chars (I changed the if to if (1)), and I specify an encoding of UTF-16BE, it decodes properly.

This code looks off to me, as it creates a mix of 8-bit & 16-bit chars with no clear encoding (it's not UTF-8 and it's not UTF-16...).
Maybe when iconv is used, the function should always output 2 byte chars?
Or, alternatively, if it would use 2-bytes for ALL chars if there is ANY char that doesn't fit in 1-byte, it would also be ok (but this sounds more complex to do...).

Btw, VLC decodes the Japanese & Chinese properly, after changing the 'preferred closed captions decoder' setting from 608 to 708.

Thanks!

Eran

Originally created by @erankor on GitHub (Aug 24, 2022). # Necessary information - Is this a regression (i.e. did it work before)? NO - What platform did you use? Linux - What were the used arguments? `./ccextractor test.ts -svc all[UTF-16BE] -nofc -12` # Video links http://cdnapi.kaltura.com/p/2035982/playManifest/entryId/1_frxnu0yr/flavorId/1_tr3kiz6l/format/download/a.ts # Additional information Hi all, I have some TS file with 708 subtitles in Japanese & Chinese that failed to decode properly. After some debugging, I found that if I patch the function `write_utf16_char` here - https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ccx_decoders_708_output.c#L113 to always output 2 byte chars (I changed the if to `if (1)`), and I specify an encoding of `UTF-16BE`, it decodes properly. This code looks off to me, as it creates a mix of 8-bit & 16-bit chars with no clear encoding (it's not UTF-8 and it's not UTF-16...). Maybe when iconv is used, the function should always output 2 byte chars? Or, alternatively, if it would use 2-bytes for ALL chars if there is ANY char that doesn't fit in 1-byte, it would also be ok (but this sounds more complex to do...). Btw, VLC decodes the Japanese & Chinese properly, after changing the 'preferred closed captions decoder' setting from 608 to 708. Thanks! Eran
claunia added the GSOC-2023 label 2026-01-29 16:51:52 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#714