[BUG] Code page problems #392

Closed
opened 2026-01-29 16:42:51 +00:00 by claunia · 5 comments
Owner

Originally created by @mkver on GitHub (Feb 25, 2018).

CCExtractor version (using the --version parameter preferably) : 0.87 (it is actually cfsmp3's build from #926)

  • I have read and understood the contributors guide.
  • I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present.
  • I have checked that the issue I'm posting isn't already reported.
  • I have checked that the issue I'm porting isn't already solved and no duplicates exist in closed issues and in opened issues
  • I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion.
  • I have used the latest available version of CCExtractor to verify this issue exists.

My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):

  • I absolutely love CCExtractor, but have contributed only once previously.

Necessary information

  • Is this a regression (did it work before)? [X] NO | [ ] YES
  • What platform did you use? [X] Windows - [ ] Linux - [ ] Mac
  • What were the used arguments? Multiple combinations. See below.

Additional information

On Windows, there are currently several issues with regards to codepages and the handling of special characters. All of the following examples were tested on a system with CP-1252 as the default codepage for non-unicode non-cli applications (a German windows system); the default cli codepage is CP 850.

a) ccextractorwin.exe -autoprogram ä.ts (make sure that there is a file "ä.ts" in your working directory or somewhere where ccextractor can find it; the actual content doesn't matter): The file is correctly handled, but ccextractor emits Input: õ.ts and Opening file: õ.ts on the command line. ä is 0xE4 in CP-1252 and 0xE4 is õ in CP 850 that cmd.exe uses and expects. So ccextractor seems to omit a conversion to the currently used codepage of the console. (If I am not mistaken then these codepages are actually legacy (and have been so for quite some time); the console has full unicode support and using unicode is therefore probably the cleaner solution than actually using codepages.)
b) The command line used is the same as a), but this time the active console CP is 852 (use chcp 852 before). The file is correctly opened, but now it emits ń.ts because 0xE4 is ń in CP 852. This confirms my conclusion from a).
c) ccextractorwin.exe -autoprogram ě.ts: This time the file is not handled at all; it outputs Input: e.ts and Error: Failed to open input file: File does not exist. (as well as its configuration data ([Program : Auto ] etc.)). My guess to what happens: Because ě is not in CP-1252 or CP 850, somewhere in the processing there is a lossy conversion to CP-1252 or CP 850 on a best-effort basis and the best match for ě is e.
d) echoargs.exe shows that when the cli CP is 850, the console itself converts ě to e. This does not conclusively show whether the console or ccextractor converts the characters in case c).
e) Same as c), but this time we set chcp 852 first. The output is the same as c).
f) But with chcp 852 echoargs shows that the console leaves the ě untouched. So there seems to be a conversion of the input to CP 1252 (or more generally, to the CP that the non-unicode non-cli applications use) in any case.
g) In order to find out if there is an intermediate conversion to the CP of the console I set the used codepage to 437 and test a file called "Ø.ts"; CP 437 lacks "Ø" and echoargs shows that it is converted to O by the console if the console converts it to CP 437. ccextractor can open it; of course the problem a) happens here, too: Ø is 0xD8 in CP-1252 and 0xD8 in CP 437 is ╪. This shows that there there is no conversion to the cli CP in between so probably it's not the console at all that does the conversion of the input file name.
h) Same as c), but this time there is also a file e.ts besides the ě.ts. Result: e.ts is opened.

I searched a bit and it seems that the usual way of solving this is using proper unicode by including windows.h and defining UNICODE; but the fact that ccextractor is cross-plattform might complicate things.

And finally, there is a bug in the GUI's preview window. Some non-ASCII characters aren't properly displayed; others are fine, though. For example the sample I uploaded for #922 shows this in the preview box:

00:00  00:04  Sie können ihn nicht von der Schule
               schmei�Yen! Schulpflicht!
 00:05  00:09  Für diese Klassenstufe ist er nicht
               geeignet.    Stufen Sie ihn zurück!

I ran the cli version with the gui_mode_reports parameter and redirected stderr. The subtitle related part is proper UTF-8. And all characters with more than 1 B length are correct, too, including the "ß" which is displayed as �Y above. I have no clue why the umlauts are fine, but ß isn't in the final display.

Originally created by @mkver on GitHub (Feb 25, 2018). CCExtractor version (using the --version parameter preferably) : **0.87** (it is actually cfsmp3's build from #926) - [X] I have read and understood the [contributors guide](https://github.com/CCExtractor/ccextractor/blob/master/.github/CONTRIBUTING.md). - [X] I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present. - [X] I have checked that the issue I'm posting isn't already reported. - [X] I have checked that the issue I'm porting isn't already solved and no duplicates exist in [closed issues](https://github.com/CCExtractor/ccextractor/issues?q=is%3Aissue+is%3Aclosed) and in [opened issues](https://github.com/CCExtractor/ccextractor/issues) - [X] I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion. - [X] I have used the latest available version of CCExtractor to verify this issue exists. **My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):** - [X] I absolutely love CCExtractor, but have contributed only once previously. **Necessary information** - Is this a regression (did it work before)? [X] NO | [ ] YES - What platform did you use? [X] Windows - [ ] Linux - [ ] Mac - What were the used arguments? Multiple combinations. See below. **Additional information** On Windows, there are currently several issues with regards to codepages and the handling of special characters. All of the following examples were tested on a system with CP-1252 as the default codepage for non-unicode non-cli applications (a German windows system); the default cli codepage is CP 850. a) `ccextractorwin.exe -autoprogram ä.ts` (make sure that there is a file "ä.ts" in your working directory or somewhere where ccextractor can find it; the actual content doesn't matter): The file is correctly handled, but ccextractor emits `Input: õ.ts` and `Opening file: õ.ts` on the command line. `ä` is 0xE4 in CP-1252 and 0xE4 is `õ` in CP 850 that cmd.exe uses and expects. So ccextractor seems to omit a conversion to the currently used codepage of the console. (If I am not mistaken then these codepages are actually legacy (and have been so for quite some time); the console has full unicode support and using unicode is therefore probably the cleaner solution than actually using codepages.) b) The command line used is the same as a), but this time the active console CP is 852 (use `chcp 852` before). The file is correctly opened, but now it emits `ń.ts` because 0xE4 is `ń` in CP 852. This confirms my conclusion from a). c) `ccextractorwin.exe -autoprogram ě.ts`: This time the file is not handled at all; it outputs `Input: e.ts` and `Error: Failed to open input file: File does not exist.` (as well as its configuration data (`[Program : Auto ]` etc.)). My guess to what happens: Because ě is not in CP-1252 or CP 850, somewhere in the processing there is a lossy conversion to CP-1252 or CP 850 on a best-effort basis and the best match for ě is e. d) [echoargs.exe](http://ss64.com/ps/EchoArgs.exe) shows that when the cli CP is 850, the console itself converts ě to e. This does not conclusively show whether the console or ccextractor converts the characters in case c). e) Same as c), but this time we set chcp 852 first. The output is the same as c). f) But with chcp 852 echoargs shows that the console leaves the ě untouched. So there seems to be a conversion of the input to CP 1252 (or more generally, to the CP that the non-unicode non-cli applications use) in any case. g) In order to find out if there is an intermediate conversion to the CP of the console I set the used codepage to 437 and test a file called "Ø.ts"; CP 437 lacks "Ø" and echoargs shows that it is converted to O by the console if the console converts it to CP 437. ccextractor can open it; of course the problem a) happens here, too: Ø is 0xD8 in CP-1252 and 0xD8 in CP 437 is ╪. This shows that there there is no conversion to the cli CP in between so probably it's not the console at all that does the conversion of the input file name. h) Same as c), but this time there is also a file e.ts besides the ě.ts. Result: e.ts is opened. I searched a bit and it seems that the usual way of solving this is using proper unicode by including windows.h and defining UNICODE; but the fact that ccextractor is cross-plattform might complicate things. And finally, there is a bug in the GUI's preview window. Some non-ASCII characters aren't properly displayed; others are fine, though. For example the [sample](https://www.dropbox.com/s/0o2ncppc0hq8ljt/DVB-Teletext%20incomplete.ts?dl=0) I uploaded for #922 shows this in the preview box: ``` 00:00 00:04 Sie können ihn nicht von der Schule schmei�Yen! Schulpflicht! 00:05 00:09 Für diese Klassenstufe ist er nicht geeignet. Stufen Sie ihn zurück! ``` I ran the cli version with the gui_mode_reports parameter and redirected stderr. The subtitle related part is proper UTF-8. And all characters with more than 1 B length are correct, too, including the "ß" which is displayed as �Y above. I have no clue why the umlauts are fine, but ß isn't in the final display.
Author
Owner

@cfsmp3 commented on GitHub (Feb 26, 2018):

GSOC qualification: 5 points

@cfsmp3 commented on GitHub (Feb 26, 2018): GSOC qualification: 5 points
Author
Owner

@pujanm commented on GitHub (Mar 15, 2018):

How to work on this issue like what is the exact issue?

@pujanm commented on GitHub (Mar 15, 2018): How to work on this issue like what is the exact issue?
Author
Owner

@cfsmp3 commented on GitHub (Mar 15, 2018):

It's quite well explained in the original post - please read it carefully,
and ask specific questions.

On Wed, Mar 14, 2018 at 8:11 PM, Pujan Mehta notifications@github.com
wrote:

How to work on this issue like what is the exact issue?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/937#issuecomment-373246640,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFrJ2SO0FvQMrOjvk_KDHLRHrWpaswCWks5tedvEgaJpZM4SSG_L
.

@cfsmp3 commented on GitHub (Mar 15, 2018): It's quite well explained in the original post - please read it carefully, and ask specific questions. On Wed, Mar 14, 2018 at 8:11 PM, Pujan Mehta <notifications@github.com> wrote: > How to work on this issue like what is the exact issue? > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <https://github.com/CCExtractor/ccextractor/issues/937#issuecomment-373246640>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AFrJ2SO0FvQMrOjvk_KDHLRHrWpaswCWks5tedvEgaJpZM4SSG_L> > . >
Author
Owner

@cfsmp3 commented on GitHub (Jan 25, 2020):

@mkver Is this still an issue in the current master?

@cfsmp3 commented on GitHub (Jan 25, 2020): @mkver Is this still an issue in the current master?
Author
Owner

@cfsmp3 commented on GitHub (Nov 21, 2021):

Closing due to original poster not answering.

@cfsmp3 commented on GitHub (Nov 21, 2021): Closing due to original poster not answering.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#392