mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-04 05:44:53 +00:00
[BUG] Code page problems #392
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @mkver on GitHub (Feb 25, 2018).
CCExtractor version (using the --version parameter preferably) : 0.87 (it is actually cfsmp3's build from #926)
My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):
Necessary information
Additional information
On Windows, there are currently several issues with regards to codepages and the handling of special characters. All of the following examples were tested on a system with CP-1252 as the default codepage for non-unicode non-cli applications (a German windows system); the default cli codepage is CP 850.
a)
ccextractorwin.exe -autoprogram ä.ts(make sure that there is a file "ä.ts" in your working directory or somewhere where ccextractor can find it; the actual content doesn't matter): The file is correctly handled, but ccextractor emitsInput: õ.tsandOpening file: õ.tson the command line.äis 0xE4 in CP-1252 and 0xE4 isõin CP 850 that cmd.exe uses and expects. So ccextractor seems to omit a conversion to the currently used codepage of the console. (If I am not mistaken then these codepages are actually legacy (and have been so for quite some time); the console has full unicode support and using unicode is therefore probably the cleaner solution than actually using codepages.)b) The command line used is the same as a), but this time the active console CP is 852 (use
chcp 852before). The file is correctly opened, but now it emitsń.tsbecause 0xE4 isńin CP 852. This confirms my conclusion from a).c)
ccextractorwin.exe -autoprogram ě.ts: This time the file is not handled at all; it outputsInput: e.tsandError: Failed to open input file: File does not exist.(as well as its configuration data ([Program : Auto ]etc.)). My guess to what happens: Because ě is not in CP-1252 or CP 850, somewhere in the processing there is a lossy conversion to CP-1252 or CP 850 on a best-effort basis and the best match for ě is e.d) echoargs.exe shows that when the cli CP is 850, the console itself converts ě to e. This does not conclusively show whether the console or ccextractor converts the characters in case c).
e) Same as c), but this time we set chcp 852 first. The output is the same as c).
f) But with chcp 852 echoargs shows that the console leaves the ě untouched. So there seems to be a conversion of the input to CP 1252 (or more generally, to the CP that the non-unicode non-cli applications use) in any case.
g) In order to find out if there is an intermediate conversion to the CP of the console I set the used codepage to 437 and test a file called "Ø.ts"; CP 437 lacks "Ø" and echoargs shows that it is converted to O by the console if the console converts it to CP 437. ccextractor can open it; of course the problem a) happens here, too: Ø is 0xD8 in CP-1252 and 0xD8 in CP 437 is ╪. This shows that there there is no conversion to the cli CP in between so probably it's not the console at all that does the conversion of the input file name.
h) Same as c), but this time there is also a file e.ts besides the ě.ts. Result: e.ts is opened.
I searched a bit and it seems that the usual way of solving this is using proper unicode by including windows.h and defining UNICODE; but the fact that ccextractor is cross-plattform might complicate things.
And finally, there is a bug in the GUI's preview window. Some non-ASCII characters aren't properly displayed; others are fine, though. For example the sample I uploaded for #922 shows this in the preview box:
I ran the cli version with the gui_mode_reports parameter and redirected stderr. The subtitle related part is proper UTF-8. And all characters with more than 1 B length are correct, too, including the "ß" which is displayed as �Y above. I have no clue why the umlauts are fine, but ß isn't in the final display.
@cfsmp3 commented on GitHub (Feb 26, 2018):
GSOC qualification: 5 points
@pujanm commented on GitHub (Mar 15, 2018):
How to work on this issue like what is the exact issue?
@cfsmp3 commented on GitHub (Mar 15, 2018):
It's quite well explained in the original post - please read it carefully,
and ask specific questions.
On Wed, Mar 14, 2018 at 8:11 PM, Pujan Mehta notifications@github.com
wrote:
@cfsmp3 commented on GitHub (Jan 25, 2020):
@mkver Is this still an issue in the current master?
@cfsmp3 commented on GitHub (Nov 21, 2021):
Closing due to original poster not answering.