Changing code pages? #10936

Closed
opened 2026-01-31 02:34:07 +00:00 by claunia · 20 comments
Owner

Originally created by @vefatica on GitHub (Oct 7, 2020).

Environment

Microsoft Windows 10 Pro for Workstations
10.0.18363.1082 (1909)
WindowsTerminalPreview_1.4.2652.0_x64

Windows build number: [run `[Environment]::OSVersion` for powershell, or `ver` for cmd]
Windows Terminal version (if applicable):

Any other software?

Steps to reproduce

Expected behavior

As in a console.

Actual behavior

I don't know what to ask except for "What's happening here?". I have a 128-byte file containing the bytes 128~255. In both cases below, the font is Consolas.

Using CMD.EXE in a console I see this (which looks pretty good).

image

Using CMD.EXE in Windows Terminal I see this (which doesn't look as good).

image

Originally created by @vefatica on GitHub (Oct 7, 2020). <!-- 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨 I ACKNOWLEDGE THE FOLLOWING BEFORE PROCEEDING: 1. If I delete this entire template and go my own path, the core team may close my issue without further explanation or engagement. 2. If I list multiple bugs/concerns in this one issue, the core team may close my issue without further explanation or engagement. 3. If I write an issue that has many duplicates, the core team may close my issue without further explanation or engagement (and without necessarily spending time to find the exact duplicate ID number). 4. If I leave the title incomplete when filing the issue, the core team may close my issue without further explanation or engagement. 5. If I file something completely blank in the body, the core team may close my issue without further explanation or engagement. All good? Then proceed! --> <!-- This bug tracker is monitored by Windows Terminal development team and other technical folks. **Important: When reporting BSODs or security issues, DO NOT attach memory dumps, logs, or traces to Github issues**. Instead, send dumps/traces to secure@microsoft.com, referencing this GitHub issue. If this is an application crash, please also provide a Feedback Hub submission link so we can find your diagnostic data on the backend. Use the category "Apps > Windows Terminal (Preview)" and choose "Share My Feedback" after submission to get the link. Please use this form and describe your issue, concisely but precisely, with as much detail as possible. --> # Environment Microsoft Windows 10 Pro for Workstations 10.0.18363.1082 (1909) WindowsTerminalPreview_1.4.2652.0_x64 ```none Windows build number: [run `[Environment]::OSVersion` for powershell, or `ver` for cmd] Windows Terminal version (if applicable): Any other software? ``` # Steps to reproduce <!-- A description of how to trigger this bug. --> # Expected behavior As in a console. <!-- A description of what you're expecting, possibly containing screenshots or reference material. --> # Actual behavior <!-- What's actually happening? --> I don't know what to ask except for "What's happening here?". I have a 128-byte file containing the bytes 128~255. In both cases below, the font is Consolas. Using CMD.EXE in a console I see this (which looks pretty good). ![image](https://user-images.githubusercontent.com/61856645/95393739-1513cd00-08c9-11eb-95f0-a6a74d8312f5.png) Using CMD.EXE in Windows Terminal I see this (which doesn't look as good). ![image](https://user-images.githubusercontent.com/61856645/95393823-442a3e80-08c9-11eb-95c3-9acd26eff29c.png)
Author
Owner

@vefatica commented on GitHub (Oct 7, 2020):

Here's the file (renamed)
128-255.txt
.

@vefatica commented on GitHub (Oct 7, 2020): Here's the file (renamed) [128-255.txt](https://github.com/microsoft/terminal/files/5344304/128-255.txt) .
Author
Owner

@DHowett commented on GitHub (Oct 7, 2020):

Hey, a fun intersection between codepages and @skyline75489's work on C1 control codes!

Bad news: this is by design.
Double bad news: I'm not certain how to comport these things.

Notes

This file contains the bytes 128-255. When translated in codepage 1252, these values become codepoints:

0x80: 20AC
0x81: ----
0x82: 201A
...
0x8E: 017D 
0x8F: ----
0x90: ----

The translations for 81, 8F, 90, and a few others are unspecified as part of the codepage. This means that a receiving application is free to do pretty much anything.

Wikipedia notes that "MultiByteToWideChar" maps them to the corresponding C1 control codes.

According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar maps these to the corresponding C1 control codes. The "best fit" mapping documents this behavior, too.

As of #7340, the Windows Console will properly treat 80..9F as control characters. The codepage says we're supposed to treat them like control characters.

Applications that want to print literal invalid characters (characters unspecified in the output codepage are not valid characters!) to the screen should not be using VT processing mode. CMD is not such an application: type is intended to print valid ANSI data in the system's codepage to the screen. If given a file whose contents are unrepresentable in the ANSI codepage, it will behave erratically.

@DHowett commented on GitHub (Oct 7, 2020): Hey, a fun intersection between codepages and @skyline75489's work on C1 control codes! Bad news: this is by design. Double bad news: I'm not certain how to comport these things. ### Notes This file contains the bytes 128-255. When translated in codepage 1252, these values become codepoints: ``` 0x80: 20AC 0x81: ---- 0x82: 201A ... 0x8E: 017D 0x8F: ---- 0x90: ---- ``` The translations for 81, 8F, 90, and a few others are **unspecified** as part of the codepage. This means that a receiving application is free to do pretty much anything. [Wikipedia notes](https://en.wikipedia.org/wiki/Windows-1252#Character_set) that "MultiByteToWideChar" maps them to the corresponding C1 control codes. > According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API `MultiByteToWideChar` maps these to the corresponding C1 control codes. The "best fit" mapping documents this behavior, too. As of #7340, the Windows Console _will properly treat `80`..`9F` as control characters. The codepage says we're supposed to treat them like control characters._ Applications that want to print literal invalid characters (characters unspecified in the output codepage are not valid characters!) to the screen should not be using VT processing mode. CMD is not such an application: `type` is intended to print valid ANSI data in the system's codepage to the screen. If given a file whose contents are unrepresentable in the ANSI codepage, it will behave erratically.
Author
Owner

@eryksun commented on GitHub (Oct 8, 2020):

type is intended to print valid ANSI data in the system's codepage to the screen.

CMD's internal type command decodes a file using the console's output codepage (i.e. GetConsoleOutputCP()), not the process ANSI codepage. It also detects UTF-16 if the file starts with a BOM. The decoded text is written to the console via wide-character WriteConsoleW.

Here are some examples with the C1 control characters, using Python under Windows Terminal.

The most powerful of the C1 control characters is CSI (0x9b), which also works in a regular console with virtual-terminal (VT) mode enabled. Fortunately codepage 1252 maps 0x9b, so it's not an issue here.

>>> # CSI (0x9b): Control Sequence Introducer
>>> print('spam\x9b4Deggs') # CSI 4D
eggs

The following examples are currently only implemented by Windows Terminal. When writing to a system conhost.exe console (at least as of release 2004), even with VT mode enabled, the C1 codes simply appear as default glyphs (e.g. an empty rectangle). Of interest here regarding codepage 1252 are 0x8D (R1), 0x8F (SS3), 0x90 (DCS), and 0x9D (OSC). CP1252 also doesn't map 0x81, but the HOP (High Octet Preset) control code is ignored.

>>> # IND (0x84): Index (Line Feed)
>>> print('spam\x84eggs')
spam
    eggs

>>> # NEL (0x85): Next Line
>>> print('spam\x85eggs') 
spam
eggs

>>> # RI (0x8d): Reverse Index (Line Feed)
>>> print('\nspam\x8deggs\n')
    eggs
spam

>>> # SS2 (0x8e): Single-Shift G2
>>> # SS3 (0x8f): Single-Shift G3
>>> print('\x8e0\x8e1\x8e2\x8e3')
°±²³
>>> print('\x8f0\x8f1\x8f2\x8f3')
°±²³

>>> # ST (0x9C): String Terminator
>>> # DCS (0x90): Device Control String
>>> # SOS (0x98): Start of String
>>> # OSC (0x9d): Operating System Command
>>> # PM (0x9e): Privacy Message
>>> # APC (0x9f): Application Program Command
>>> print('\x90spam\x9ceggs')
eggs
>>> print('\x98spam\x9ceggs')
eggs
>>> print('\x9dspam\x9ceggs')
eggs
>>> print('\x9espam\x9ceggs')
eggs
>>> print('\x9fspam\x9ceggs')
eggs
@eryksun commented on GitHub (Oct 8, 2020): > type is intended to print valid ANSI data in the system's codepage to the screen. CMD's internal `type` command decodes a file using the console's output codepage (i.e. `GetConsoleOutputCP()`), not the process ANSI codepage. It also detects UTF-16 if the file starts with a BOM. The decoded text is written to the console via wide-character `WriteConsoleW`. Here are some examples with the C1 control characters, using Python under Windows Terminal. The most powerful of the C1 control characters is CSI (0x9b), which also works in a regular console with virtual-terminal (VT) mode enabled. Fortunately codepage 1252 maps 0x9b, so it's not an issue here. ``` >>> # CSI (0x9b): Control Sequence Introducer >>> print('spam\x9b4Deggs') # CSI 4D eggs ``` The following examples are currently only implemented by Windows Terminal. When writing to a system conhost.exe console (at least as of release 2004), even with VT mode enabled, the C1 codes simply appear as default glyphs (e.g. an empty rectangle). Of interest here regarding codepage 1252 are 0x8D (R1), 0x8F (SS3), 0x90 (DCS), and 0x9D (OSC). CP1252 also doesn't map 0x81, but the HOP (High Octet Preset) control code is ignored. ``` >>> # IND (0x84): Index (Line Feed) >>> print('spam\x84eggs') spam eggs >>> # NEL (0x85): Next Line >>> print('spam\x85eggs') spam eggs >>> # RI (0x8d): Reverse Index (Line Feed) >>> print('\nspam\x8deggs\n') eggs spam >>> # SS2 (0x8e): Single-Shift G2 >>> # SS3 (0x8f): Single-Shift G3 >>> print('\x8e0\x8e1\x8e2\x8e3') °±²³ >>> print('\x8f0\x8f1\x8f2\x8f3') °±²³ >>> # ST (0x9C): String Terminator >>> # DCS (0x90): Device Control String >>> # SOS (0x98): Start of String >>> # OSC (0x9d): Operating System Command >>> # PM (0x9e): Privacy Message >>> # APC (0x9f): Application Program Command >>> print('\x90spam\x9ceggs') eggs >>> print('\x98spam\x9ceggs') eggs >>> print('\x9dspam\x9ceggs') eggs >>> print('\x9espam\x9ceggs') eggs >>> print('\x9fspam\x9ceggs') eggs ```
Author
Owner

@DHowett commented on GitHub (Oct 8, 2020):

Good catch on the specifics of the implementation of TYPE. Horrifyingly, it's written to call WriteConsole and hope that UNICODE is set. Those sure were the days.

The console host that comes out with the version of Windows after 2004 will contain the changes from #7317.

@DHowett commented on GitHub (Oct 8, 2020): Good catch on the specifics of the implementation of `TYPE`. Horrifyingly, it's written to call `WriteConsole` and hope that `UNICODE` is set. Those sure were the days. The console host that comes out with the version of Windows after 2004 will contain the changes from #7317.
Author
Owner

@eryksun commented on GitHub (Oct 8, 2020):

Unfortunately most Windows filesystems only reserve the C0 block in filenames, not the C1 block. I don't want displaying a filename to evaluate CSI sequences or IND, RI, and NEL line feeds. POSIX systems permissively allow control characters in filenames, but POSIX CLI programs such as ls address this by escaping the C0 and C1 control characters when displaying files. This is not the case for Windows PowerShell and CMD:

Python

>>> import os; os.listdir('.')
['spam\x9b4Deggs.txt']

WSL

/mnt/c/Temp/test$ ls
'spam'$'\302\233''4Deggs.txt'

PowerShell

PS C:\temp\test> gci -n
eggs.txt

CMD

C:\Temp\test>dir /b
eggs.txt
@eryksun commented on GitHub (Oct 8, 2020): Unfortunately most Windows filesystems only reserve the C0 block in filenames, not the C1 block. I don't want displaying a filename to evaluate CSI sequences or IND, RI, and NEL line feeds. POSIX systems permissively allow control characters in filenames, but POSIX CLI programs such as `ls` address this by escaping the C0 and C1 control characters when displaying files. This is not the case for Windows PowerShell and CMD: **Python** ``` >>> import os; os.listdir('.') ['spam\x9b4Deggs.txt'] ``` **WSL** ``` /mnt/c/Temp/test$ ls 'spam'$'\302\233''4Deggs.txt' ``` **PowerShell** ``` PS C:\temp\test> gci -n eggs.txt ``` **CMD** ``` C:\Temp\test>dir /b eggs.txt ```
Author
Owner

@vefatica commented on GitHub (Oct 8, 2020):

Thanks gentlemen. I didn't know most of that. Here's a question (I'll probably have more).

I imagine it was the OSC (0x9D) that was cutting off the tail in my example (CP 1252 in Windows Terminal). Wiki says OSC should be followed by a string of printables (0x32~0x7E). That was not the case in my example. What's up with that? And was it waiting for ST (0x9C)? CP 1252 uses 0x9C.

@vefatica commented on GitHub (Oct 8, 2020): Thanks gentlemen. I didn't know most of that. Here's a question (I'll probably have more). I imagine it was the OSC (0x9D) that was cutting off the tail in my example (CP 1252 in Windows Terminal). [Wiki](https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C1_control_codes_for_general_use) says OSC should be followed by a string of printables (0x32~0x7E). That was not the case in my example. What's up with that? And was it waiting for ST (0x9C)? CP 1252 uses 0x9C.
Author
Owner

@j4james commented on GitHub (Oct 8, 2020):

Any escape or C1 control should terminate the sequence, as would the SUB and CAN control characters. For example, I'm assuming your prompt has an escape sequence that sets the color to green - that was most likely the terminator in your case. Without that you probably would have found the terminal stuck in a weird state where you couldn't see any output because it would think it was still processing an OSC or APC sequence (whichever came last).

@j4james commented on GitHub (Oct 8, 2020): Any escape or C1 control should terminate the sequence, as would the `SUB` and `CAN` control characters. For example, I'm assuming your prompt has an escape sequence that sets the color to green - that was most likely the terminator in your case. Without that you probably would have found the terminal stuck in a weird state where you couldn't see any output because it would think it was still processing an `OSC` or `APC` sequence (whichever came last).
Author
Owner

@vefatica commented on GitHub (Oct 8, 2020):

j4james, as I read your comments I was trying to get it stuck as you said (not realizing that is was my prompt preventing it).

What about wiki's comment that the Operating System Command be composed of 0x32~0x7F? Is it accurate? Characters > 0x7F don't, in general, terminate the OSC string. Neither do characters < 32 except for ESC. ST (0x9C) does even though it's used by the CP. And seeing that OSC is honored, what happens to the string itself ... ignored?

@vefatica commented on GitHub (Oct 8, 2020): j4james, as I read your comments I was trying to get it stuck as you said (not realizing that is was my prompt preventing it). What about wiki's comment that the Operating System Command be composed of 0x32~0x7F? Is it accurate? Characters > 0x7F don't, in general, terminate the OSC string. Neither do characters < 32 except for ESC. ST (0x9C) does even though it's used by the CP. And seeing that OSC is honored, what happens to the string itself ... ignored?
Author
Owner

@vefatica commented on GitHub (Oct 8, 2020):

I spoke a bit prematurely. In fact, 0x7 (BEL), 0x18 (CAN), and 0x1A (SUB) also terminate the OSC string.

@vefatica commented on GitHub (Oct 8, 2020): I spoke a bit prematurely. In fact, 0x7 (BEL), 0x18 (CAN), and 0x1A (SUB) also terminate the OSC string.
Author
Owner

@vefatica commented on GitHub (Oct 8, 2020):

And I think I was wrong about ST (0x9C) terminating OSC.

@vefatica commented on GitHub (Oct 8, 2020): And I think I was wrong about ST (0x9C) terminating OSC.
Author
Owner

@j4james commented on GitHub (Oct 8, 2020):

BEL is a bit of a weird case. Technically it's not a standard string terminator, but at some point in the past it was used as an OSC terminator by a popular terminal emulator, and that ended up becoming a de facto standard.

As for characters > 0x7F, I believe anything in the range 0xA0 to 0xFE is technically supposed to be interpreted as 0x20 to 0x7E when included in a control sequence. I don't think we follow those rules exactly, but since we're typically dealing with Unicode the original specs don't really apply in that sense. Either way, though, I wouldn't expect a character > 0xA0 to terminate a string sequence. C1 controls should, although there may be exceptions - I'm not positive about that.

And yes ST should terminate an OSC sequence.

@j4james commented on GitHub (Oct 8, 2020): `BEL` is a bit of a weird case. Technically it's not a standard string terminator, but at some point in the past it was used as an `OSC` terminator by a popular terminal emulator, and that ended up becoming a de facto standard. As for characters > 0x7F, I believe anything in the range 0xA0 to 0xFE is technically supposed to be interpreted as 0x20 to 0x7E when included in a control sequence. I don't think we follow those rules exactly, but since we're typically dealing with Unicode the original specs don't really apply in that sense. Either way, though, I wouldn't expect a character > 0xA0 to terminate a string sequence. C1 controls should, although there may be exceptions - I'm not positive about that. And yes `ST` should terminate an `OSC` sequence.
Author
Owner

@DHowett commented on GitHub (Oct 9, 2020):

(Closing as by design + question, but do feel free to continue the discussion!)

@DHowett commented on GitHub (Oct 9, 2020): (Closing as by design + question, but do feel free to continue the discussion!)
Author
Owner

@vefatica commented on GitHub (Oct 9, 2020):

Do any OSC sequences work? The only one I tried was OSC2;titleBEL. That didn't work. Neither did the equivalent (?) ESC]2;titleBEL (which does change the title in a conhost console).

@vefatica commented on GitHub (Oct 9, 2020): Do any OSC sequences work? The only one I tried was OSC2;titleBEL. That didn't work. Neither did the equivalent (?) ESC]2;titleBEL (which does change the title in a conhost console).
Author
Owner

@zadjii-msft commented on GitHub (Oct 9, 2020):

Uh yea, a whole bunch of them should

d33ca7e8eb/src/terminal/parser/OutputStateMachineEngine.hpp (L156-L171)

What string exactly are you trying to emit? And do you have "suppressApplicationTitle": true set?

@zadjii-msft commented on GitHub (Oct 9, 2020): Uh yea, a whole bunch of them _should_ https://github.com/microsoft/terminal/blob/d33ca7e8eb431f42913d0f617e6354cc93ab7a4e/src/terminal/parser/OutputStateMachineEngine.hpp#L156-L171 What string _exactly_ are you trying to emit? And do you have `"suppressApplicationTitle": true` set?
Author
Owner

@vefatica commented on GitHub (Oct 9, 2020):

I have: "suppressApplicationTitle": true

I've tried both of these:

L"\x009d2;new_title\x0007"
L"\x001b]2;new_title\x0007"

I'm using WriteConsoleW and a HANDLE to L"CONOUT$". The second one above works in conhost.

I can also send them from a TCC command line.

echos %@char[0x9d]2;new_title%@char[7]
echos ^e]2;new_title%@char[7]

The results are the same; neither works in WT and the second works in a conhost console.

@vefatica commented on GitHub (Oct 9, 2020): I have: "suppressApplicationTitle": true I've tried both of these: L"\x009d2;new_title\x0007" L"\x001b]2;new_title\x0007" I'm using WriteConsoleW and a HANDLE to L"CONOUT$". The second one above works in conhost. I can also send them from a TCC command line. ``` echos %@char[0x9d]2;new_title%@char[7] echos ^e]2;new_title%@char[7] ``` The results are the same; neither works in WT and the second works in a conhost console.
Author
Owner

@DHowett commented on GitHub (Oct 9, 2020):

"suppressApplicationTitle": true

This disables OSC2.

@DHowett commented on GitHub (Oct 9, 2020): "suppressApplicationTitle": true This disables OSC2.
Author
Owner

@vefatica commented on GitHub (Oct 9, 2020):

OK. I was thinking that affected only SetConsoleTitle(). So without suppressApplicationTitle = true,

L"\x001b]2;new_title\x0007" works

and

L"\x009d2;new_title\x0007" doesn't work.

@vefatica commented on GitHub (Oct 9, 2020): OK. I was thinking that affected only SetConsoleTitle(). So without suppressApplicationTitle = true, L"\x001b]2;new_title\x0007" works and L"\x009d2;new_title\x0007" doesn't work.
Author
Owner

@zadjii-msft commented on GitHub (Oct 9, 2020):

I don't believe the C1 codes work quite yet, see #7340

@zadjii-msft commented on GitHub (Oct 9, 2020): I don't believe the C1 codes work quite yet, see #7340
Author
Owner

@vefatica commented on GitHub (Oct 9, 2020):

Actually, it either a compiler bug (VS Community 2019) or my misunderstanding. This page seems to make it clear that L"\xhhhh" denotes a wide char. Any more that 4 hex digits doesn't make sense! Yet these two strings are different:

WCHAR sz1[32];
wsprintf(sz1, L"%c2;new_title1\x0007", 0x9d);
WCHAR sz2[32] = L"\x009d2;new_title1\x0007";

The first one above DOES work in Windows Terminal. The first character of the second one is 0x9d2

@vefatica commented on GitHub (Oct 9, 2020): Actually, it either a compiler bug (VS Community 2019) or my misunderstanding. [This page](https://docs.microsoft.com/en-us/cpp/c-language/escape-sequences?view=vs-2019) seems to make it clear that L"\xhhhh" denotes a wide char. Any more that 4 hex digits doesn't make sense! Yet these two strings are different: WCHAR sz1[32]; wsprintf(sz1, L"%c2;new_title1\x0007", 0x9d); WCHAR sz2[32] = L"\x009d2;new_title1\x0007"; The first one above DOES work in Windows Terminal. The first character of the second one is 0x9d2
Author
Owner

@vefatica commented on GitHub (Oct 10, 2020):

Yup, my misunderstanding (or more like ignorance). This works

L"\x009d" L"2;new_title1\x0007"

as does this

L"\u009d2;new_title1\x0007"

but I'm not sure why the second one works.

@vefatica commented on GitHub (Oct 10, 2020): Yup, my misunderstanding (or more like ignorance). This works L"\x009d" L"2;new_title1\x0007" as does this L"\u009d2;new_title1\x0007" but I'm not sure why the second one works.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#10936