[BUG] A mix of 8-bit/16-bit chars sent to iconv #717

Closed
opened 2026-01-29 16:51:54 +00:00 by claunia · 9 comments
Owner

Originally created by @erankor on GitHub (Aug 24, 2022).

Necessary information

  • Is this a regression (i.e. did it work before)? NO
  • What platform did you use? Linux
  • What were the used arguments? ./ccextractor test.ts -svc all[UTF-16BE] -nofc -12

Video links

http://cdnapi.kaltura.com/p/2035982/playManifest/entryId/1_frxnu0yr/flavorId/1_tr3kiz6l/format/download/a.ts

Additional information

Hi all,

I have some TS file with 708 subtitles in Japanese & Chinese that failed to decode properly.
After some debugging, I found that if I patch the function write_utf16_char here -
https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ccx_decoders_708_output.c#L113
to always output 2 byte chars (I changed the if to if (1)), and I specify an encoding of UTF-16BE, it decodes properly.

This code looks off to me, as it creates a mix of 8-bit & 16-bit chars with no clear encoding (it's not UTF-8 and it's not UTF-16...).
Maybe when iconv is used, the function should always output 2 byte chars?
Or, alternatively, if it would use 2-bytes for ALL chars if there is ANY char that doesn't fit in 1-byte, it would also be ok (but this sounds more complex to do...).

Btw, VLC decodes the Japanese & Chinese properly, after changing the 'preferred closed captions decoder' setting from 608 to 708.

Thanks!

Eran

Originally created by @erankor on GitHub (Aug 24, 2022). # Necessary information - Is this a regression (i.e. did it work before)? NO - What platform did you use? Linux - What were the used arguments? `./ccextractor test.ts -svc all[UTF-16BE] -nofc -12` # Video links http://cdnapi.kaltura.com/p/2035982/playManifest/entryId/1_frxnu0yr/flavorId/1_tr3kiz6l/format/download/a.ts # Additional information Hi all, I have some TS file with 708 subtitles in Japanese & Chinese that failed to decode properly. After some debugging, I found that if I patch the function `write_utf16_char` here - https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ccx_decoders_708_output.c#L113 to always output 2 byte chars (I changed the if to `if (1)`), and I specify an encoding of `UTF-16BE`, it decodes properly. This code looks off to me, as it creates a mix of 8-bit & 16-bit chars with no clear encoding (it's not UTF-8 and it's not UTF-16...). Maybe when iconv is used, the function should always output 2 byte chars? Or, alternatively, if it would use 2-bytes for ALL chars if there is ANY char that doesn't fit in 1-byte, it would also be ok (but this sounds more complex to do...). Btw, VLC decodes the Japanese & Chinese properly, after changing the 'preferred closed captions decoder' setting from 608 to 708. Thanks! Eran
claunia added the GSOC-2023 label 2026-01-29 16:51:54 +00:00
Author
Owner

@PunitLodha commented on GitHub (Aug 24, 2022):

Could you share the output of ccextractor --version?

@PunitLodha commented on GitHub (Aug 24, 2022): Could you share the output of `ccextractor --version`?
Author
Owner

@erankor commented on GitHub (Aug 24, 2022):

./ccextractor --version
CCExtractor 0.89, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
        Version: 0.89
        Git commit: b793f16343dc442bcb977387fcef08195e71dd7c
        Compilation date: 2022-08-23
        File SHA256: 259ccd18d508a3aed03149080853f98d1bce57672ce20c9b715953227621c9d9
Libraries used by CCExtractor
        Tesseract Version: 3.03
        Leptonica Version: leptonica-1.70
        libGPAC Version: 1.0.1
        zlib: 1.2.11
        utf8proc Version: 2.4.0
        protobuf-c Version: 1.3.1
        libpng Version: 1.6.37
        FreeType
        libhash
        nuklear
        libzvbi
@erankor commented on GitHub (Aug 24, 2022): ``` ./ccextractor --version CCExtractor 0.89, Carlos Fernandez Sanz, Volker Quetschke. Teletext portions taken from Petr Kutalek's telxcc -------------------------------------------------------------------------- CCExtractor detailed version info Version: 0.89 Git commit: b793f16343dc442bcb977387fcef08195e71dd7c Compilation date: 2022-08-23 File SHA256: 259ccd18d508a3aed03149080853f98d1bce57672ce20c9b715953227621c9d9 Libraries used by CCExtractor Tesseract Version: 3.03 Leptonica Version: leptonica-1.70 libGPAC Version: 1.0.1 zlib: 1.2.11 utf8proc Version: 2.4.0 protobuf-c Version: 1.3.1 libpng Version: 1.6.37 FreeType libhash nuklear libzvbi ```
Author
Owner

@PunitLodha commented on GitHub (Aug 24, 2022):

You are using version 0.89. Could you try using the latest version(0.94)?

@PunitLodha commented on GitHub (Aug 24, 2022): You are using version 0.89. Could you try using the latest version(0.94)?
Author
Owner

@erankor commented on GitHub (Aug 24, 2022):

Reverted my change and pulled latest master, it is decoding stuff (which is better than previous version IIRC...), but still every space in the text messes it up, and I get some non-printable chars in the output.

Output without any code changes -
1
00:00:01,068 --> 00:00:03,770
人々が私を知‰挰弰栰䴰Ź섰漠時間管理につい‰晦<U+F830>䐰昰䐰縰

Output after forcing write_utf16_char to always use 2 chars -
1
00:00:01,068 --> 00:00:03,770
人々が私を知 ったとき、私は 時間管理につい て書いています

I don't speak Japanese myself :) but google translate can confirm the fixed version is better.

Current version -

./ccextractor --version
CCExtractor 0.94, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
        Version: 0.94
        Git commit: 4cb474c5a36b61bafec4a2379c4d0b240e44359b
        Compilation date: 2022-08-24
        CEA-708 decoder: C
        File SHA256: 8fd4f5625eb6aadb30532a2ff9f29adaec4b60a77916e3f001d5f4e59d4d08e9
Libraries used by CCExtractor
        Tesseract Version: 3.03
        Leptonica Version: leptonica-1.70
        libGPAC Version: 1.0.1
        zlib: 1.2.11
        utf8proc Version: 2.4.0
        protobuf-c Version: 1.3.1
        libpng Version: 1.6.37
        FreeType
        libhash
        nuklear
        libzvbi
@erankor commented on GitHub (Aug 24, 2022): Reverted my change and pulled latest master, it is decoding stuff (which is better than previous version IIRC...), but still every space in the text messes it up, and I get some non-printable chars in the output. Output without any code changes - 1 00:00:01,068 --> 00:00:03,770 人々が私を知‰挰弰栰䴰Ź섰漠時間管理につい‰晦<U+F830>䐰昰䐰縰 Output after forcing write_utf16_char to always use 2 chars - 1 00:00:01,068 --> 00:00:03,770 人々が私を知 ったとき、私は 時間管理につい て書いています I don't speak Japanese myself :) but google translate can confirm the fixed version is better. Current version - ``` ./ccextractor --version CCExtractor 0.94, Carlos Fernandez Sanz, Volker Quetschke. Teletext portions taken from Petr Kutalek's telxcc -------------------------------------------------------------------------- CCExtractor detailed version info Version: 0.94 Git commit: 4cb474c5a36b61bafec4a2379c4d0b240e44359b Compilation date: 2022-08-24 CEA-708 decoder: C File SHA256: 8fd4f5625eb6aadb30532a2ff9f29adaec4b60a77916e3f001d5f4e59d4d08e9 Libraries used by CCExtractor Tesseract Version: 3.03 Leptonica Version: leptonica-1.70 libGPAC Version: 1.0.1 zlib: 1.2.11 utf8proc Version: 2.4.0 protobuf-c Version: 1.3.1 libpng Version: 1.6.37 FreeType libhash nuklear libzvbi ```
Author
Owner

@PunitLodha commented on GitHub (Aug 24, 2022):

You could send a PR. If it doesn't cause any issues with the other tests, then we can merge it

@PunitLodha commented on GitHub (Aug 24, 2022): You could send a PR. If it doesn't cause any issues with the other tests, then we can merge it
Author
Owner

@ArchitBhonsle commented on GitHub (Feb 26, 2023):

Was this fixed? I could make a simple pull request with the specified changes.

@ArchitBhonsle commented on GitHub (Feb 26, 2023): Was this fixed? I could make a simple pull request with the specified changes.
Author
Owner

@cfsmp3 commented on GitHub (Feb 26, 2023):

Was this fixed? I could make a simple pull request with the specified changes.

Probably not if it's still open :-)
Feel free to give it a shot.

@cfsmp3 commented on GitHub (Feb 26, 2023): > Was this fixed? I could make a simple pull request with the specified changes. Probably not if it's still open :-) Feel free to give it a shot.
Author
Owner

@prateekmedia commented on GitHub (Sep 26, 2023):

Created a PR: #1571

@prateekmedia commented on GitHub (Sep 26, 2023): Created a PR: #1571
Author
Owner

@cfsmp3 commented on GitHub (Dec 14, 2025):

@canihavesomecoffee Can you add this sample to the SP and add a test for it?

@cfsmp3 commented on GitHub (Dec 14, 2025): @canihavesomecoffee Can you add this sample to the SP and add a test for it?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#717