mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-03 21:23:48 +00:00
[BUG] A mix of 8-bit/16-bit chars sent to iconv #717
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @erankor on GitHub (Aug 24, 2022).
Necessary information
./ccextractor test.ts -svc all[UTF-16BE] -nofc -12Video links
http://cdnapi.kaltura.com/p/2035982/playManifest/entryId/1_frxnu0yr/flavorId/1_tr3kiz6l/format/download/a.ts
Additional information
Hi all,
I have some TS file with 708 subtitles in Japanese & Chinese that failed to decode properly.
After some debugging, I found that if I patch the function
write_utf16_charhere -https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ccx_decoders_708_output.c#L113
to always output 2 byte chars (I changed the if to
if (1)), and I specify an encoding ofUTF-16BE, it decodes properly.This code looks off to me, as it creates a mix of 8-bit & 16-bit chars with no clear encoding (it's not UTF-8 and it's not UTF-16...).
Maybe when iconv is used, the function should always output 2 byte chars?
Or, alternatively, if it would use 2-bytes for ALL chars if there is ANY char that doesn't fit in 1-byte, it would also be ok (but this sounds more complex to do...).
Btw, VLC decodes the Japanese & Chinese properly, after changing the 'preferred closed captions decoder' setting from 608 to 708.
Thanks!
Eran
@PunitLodha commented on GitHub (Aug 24, 2022):
Could you share the output of
ccextractor --version?@erankor commented on GitHub (Aug 24, 2022):
@PunitLodha commented on GitHub (Aug 24, 2022):
You are using version 0.89. Could you try using the latest version(0.94)?
@erankor commented on GitHub (Aug 24, 2022):
Reverted my change and pulled latest master, it is decoding stuff (which is better than previous version IIRC...), but still every space in the text messes it up, and I get some non-printable chars in the output.
Output without any code changes -
1
00:00:01,068 --> 00:00:03,770
人々が私を知‰挰弰栰䴰Ź섰漠時間管理につい‰晦<U+F830>䐰昰䐰縰
Output after forcing write_utf16_char to always use 2 chars -
1
00:00:01,068 --> 00:00:03,770
人々が私を知 ったとき、私は 時間管理につい て書いています
I don't speak Japanese myself :) but google translate can confirm the fixed version is better.
Current version -
@PunitLodha commented on GitHub (Aug 24, 2022):
You could send a PR. If it doesn't cause any issues with the other tests, then we can merge it
@ArchitBhonsle commented on GitHub (Feb 26, 2023):
Was this fixed? I could make a simple pull request with the specified changes.
@cfsmp3 commented on GitHub (Feb 26, 2023):
Probably not if it's still open :-)
Feel free to give it a shot.
@prateekmedia commented on GitHub (Sep 26, 2023):
Created a PR: #1571
@cfsmp3 commented on GitHub (Dec 14, 2025):
@canihavesomecoffee Can you add this sample to the SP and add a test for it?