C1 control characters detection breaks output (regression) #14087

Closed
opened 2026-01-31 04:00:31 +00:00 by claunia · 25 comments
Owner

Originally created by @alabuzhev on GitHub (Jun 2, 2021).

Windows Terminal version (or Windows build number)

1.9.1445.0

Other Software

No response

Steps to reproduce

Compile and run the following code:

#include <windows.h>

int main()
{
	const char data[] = "\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F";
	wchar_t buffer[sizeof(data)];
	if (!MultiByteToWideChar(1252, MB_USEGLYPHCHARS, data, -1, buffer, sizeof(buffer)))
	{
		printf("%d\n", GetLastError());
	}

	DWORD n;
	WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), buffer, sizeof(buffer) / sizeof(wchar_t), &n, 0);

	printf("\n\n");

	for (int i = 0; i != sizeof(data) - 1; ++i)
	{
		if (buffer[i] == (unsigned char)data[i])
		{
			printf("%04X not converted\n", (unsigned char)data[i]);
		}
	}
}

Expected Behavior

€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ or something similar, depending on your output codepage

Actual Behavior

€‚ƒ„…†‡ˆ‰Š‹Œ - most of the characters are missing.

After f91b53d5fd the whole range 0x80 - 0x9F is considered control characters.

The comment above boldly claims that

"we do not need to worry about confusion whether a single byte, for example, \x9b in a single-byte stream represents a C1 CSI or some other glyph, because by the time we get here, everything is Unicode. Knowing whether a single-byte \x9b represents a single-character C1 CSI or some other glyph is handled by MultiByteToWideChar before we get here (if the stream was not already UTF-16). For instance, in CP_ACP, if a \x9b shows up, it will get converted to \x203a. So, if we get here, and have a \x009b, we know that it unambiguously represents a C1 CSI"

, but that is simply not true: as the example code above demonstrates, \x81, \x8D, \x8F, \x90, and \x9D are not handled by MultiByteToWideChar (at least in codepage 1252 ANSI - Latin I, hopefully popular enough), so no, not everything is Unicode by the time we get here, and no, we do need to worry about such confusion and implement a proper check to avoid breaking existing applications.

Originally created by @alabuzhev on GitHub (Jun 2, 2021). ### Windows Terminal version (or Windows build number) 1.9.1445.0 ### Other Software _No response_ ### Steps to reproduce Compile and run the following code: ```C #include <windows.h> int main() { const char data[] = "\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F"; wchar_t buffer[sizeof(data)]; if (!MultiByteToWideChar(1252, MB_USEGLYPHCHARS, data, -1, buffer, sizeof(buffer))) { printf("%d\n", GetLastError()); } DWORD n; WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), buffer, sizeof(buffer) / sizeof(wchar_t), &n, 0); printf("\n\n"); for (int i = 0; i != sizeof(data) - 1; ++i) { if (buffer[i] == (unsigned char)data[i]) { printf("%04X not converted\n", (unsigned char)data[i]); } } } ``` ### Expected Behavior `€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ` or something similar, depending on your output codepage ### Actual Behavior `€‚ƒ„…†‡ˆ‰Š‹Œ` - most of the characters are missing. After f91b53d5fdc5b20387e297357668f5f14a795d4c the whole range 0x80 - 0x9F is [considered control characters](https://github.com/microsoft/terminal/blob/f91b53d5fdc5b20387e297357668f5f14a795d4c/src/terminal/parser/stateMachine.cpp#L92). [The comment above](https://github.com/microsoft/terminal/blob/f91b53d5fdc5b20387e297357668f5f14a795d4c/src/terminal/parser/stateMachine.cpp#L78-L84) boldly claims that > *"we do not need to worry about confusion whether a single byte, for example, \x9b in a single-byte stream represents a C1 CSI or some other glyph, because by the time we get here, everything is Unicode. Knowing whether a single-byte \x9b represents a single-character C1 CSI or some other glyph is handled by MultiByteToWideChar before we get here (if the stream was not already UTF-16). For instance, in CP_ACP, if a \x9b shows up, it will get converted to \x203a. So, if we get here, and have a \x009b, we know that it unambiguously represents a C1 CSI"* , but that is simply not true: as the example code above demonstrates, \x81, \x8D, \x8F, \x90, and \x9D are not handled by MultiByteToWideChar (at least in codepage 1252 ANSI - Latin I, hopefully popular enough), so no, not everything is Unicode by the time we get here, and no, we do need to worry about such confusion and implement a proper check to avoid breaking existing applications.
Author
Owner

@skyline75489 commented on GitHub (Jun 2, 2021):

I think https://github.com/microsoft/terminal/issues/7854#issuecomment-705235345 explains this.

@skyline75489 commented on GitHub (Jun 2, 2021): I think https://github.com/microsoft/terminal/issues/7854#issuecomment-705235345 explains this.
Author
Owner

@DHowett commented on GitHub (Jun 2, 2021):

@skyline75489 has the right of it. Those characters are unspecified in 1252. They are control characters after f91b53d, but they were control characters before that, too.

What is a well-meaning application doing printing things outside of its codepage’s codepoint coverage?

@DHowett commented on GitHub (Jun 2, 2021): @skyline75489 has the right of it. Those characters are unspecified in 1252. They are control characters after f91b53d, but _they were control characters before that, too_. What is a well-meaning application doing printing things outside of its codepage’s codepoint coverage?
Author
Owner

@alabuzhev commented on GitHub (Jun 2, 2021):

The application basically outputs a file picked by the user using a codepage picked by the user.

If that behaviour is now by design - ok.
Although it would be nice to either mention that is the comment or remove that MultiByteToWideChar-inspired motivation altogether in favour of something like 0x80 - 0x9F are control codes now. Deal with it. to avoid further confusions.

@alabuzhev commented on GitHub (Jun 2, 2021): The application basically outputs a file picked by the user using a codepage picked by the user. If that behaviour is now by design - ok. Although it would be nice to either mention that is the comment or remove that MultiByteToWideChar-inspired motivation altogether in favour of something like `0x80 - 0x9F are control codes now. Deal with it.` to avoid further confusions.
Author
Owner

@alabuzhev commented on GitHub (Jun 2, 2021):

A few more thoughts:

Except for SS2 and SS3 in EUC-JP text, and NEL in text transcoded from EBCDIC, the 8-bit forms of these codes are almost never used. CSI, DCS and OSC are used to control text terminals and terminal emulators, but almost always by using their 7-bit escape code representations. Their ISO/IEC 2022 compliant single-byte representations are invalid in UTF-8, and the UTF-8 encodings of their corresponding codepoints are two bytes long like their escape code forms (for instance, CSI at U+009B is encoded as the bytes 0xC2, 0x9B in UTF-8), so there is no advantage to using them rather than the equivalent two-byte escape sequence. When these codes appear in modern documents, web pages, e-mail messages, etc., they are usually intended to be printing characters at that position in a proprietary encoding such as Windows-1252 or Mac OS Roman that use the C1 codes to provide additional graphic characters.

  • Windows does allow 0x80 - 0x9F in filenames. You can literally create a file named "€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ", type dir, sit back and watch the world burn:

image
image

Don't ask "but why?" - users can, so they will.
And incorrect codepage conversions are still a thing, especially during processing of various metadata. There are lots and lots of weird file names in the wild.

Sanitising file names, everywhere, even in scenarios not related to outputting anything for the sake of the feature that is "almost never used"... 🤔
Supposedly sooner or later C1 will make it into the conhost and this is where the fun begins.

@alabuzhev commented on GitHub (Jun 2, 2021): A few more thoughts: - https://en.wikipedia.org/wiki/C0_and_C1_control_codes: > Except for SS2 and SS3 in EUC-JP text, and NEL in text transcoded from EBCDIC, the 8-bit forms of these codes are **almost never used**. CSI, DCS and OSC are used to control text terminals and terminal emulators, but almost always by using their 7-bit escape code representations. Their ISO/IEC 2022 compliant single-byte representations are invalid in UTF-8, and the UTF-8 encodings of their corresponding codepoints are two bytes long like their escape code forms (for instance, CSI at U+009B is encoded as the bytes 0xC2, 0x9B in UTF-8), so there is no advantage to using them rather than the equivalent two-byte escape sequence. When these codes appear in modern documents, web pages, e-mail messages, etc., **they are usually intended to be printing characters at that position in a proprietary encoding such as Windows-1252 or Mac OS Roman that use the C1 codes to provide additional graphic characters**. - Windows _does_ allow 0x80 - 0x9F in filenames. You can literally create a file named "€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ", type `dir`, sit back and watch the world burn: ![image](https://user-images.githubusercontent.com/11453922/120514484-3def5200-c3c5-11eb-971d-36e33a341444.png) ![image](https://user-images.githubusercontent.com/11453922/120515567-53b14700-c3c6-11eb-891f-3eead29924eb.png) Don't ask "but why?" - users can, so they will. And incorrect codepage conversions are still a thing, especially during processing of various metadata. There are lots and lots of weird file names in the wild. Sanitising *file names*, *everywhere*, [even in scenarios not related to outputting anything](https://github.com/microsoft/terminal/issues/10312) for the sake of the feature that is "almost never used"... 🤔 Supposedly sooner or later C1 will make it into the conhost and this is where the fun begins.
Author
Owner

@j4james commented on GitHub (Jun 2, 2021):

Note that your test case won't be interpreted as control characters in conhost (even the very latest build), because you have to have the ENABLE_VIRTUAL_TERMINAL_PROCESSING mode set for this functionality to apply. In Windows Terminal that's enabled by default, so if you don't want VT controls processed in WT I think you have to explicitly disable that mode.

That said, the cmd shell does enable VT mode, so control characters in a filename could be an issue there. Somebody already raised that in issue #10069, which I misdiagnosed as a conpty problem, but I've just checked with a recent OpenConsole build and can reproduce the issue there too. So that issue should probably be reopened - it's not a dup of #4363.

However, note that this has been an issue long before PR #7340, because we already supported the 8-bit CSI control before then. PR #7340 just added more controls.

@j4james commented on GitHub (Jun 2, 2021): Note that your test case won't be interpreted as control characters in conhost (even the very latest build), because you have to have the `ENABLE_VIRTUAL_TERMINAL_PROCESSING` mode set for this functionality to apply. In Windows Terminal that's enabled by default, so if you don't want VT controls processed in WT I think you have to explicitly disable that mode. That said, the cmd shell does enable VT mode, so control characters in a filename could be an issue there. Somebody already raised that in issue #10069, which I misdiagnosed as a conpty problem, but I've just checked with a recent OpenConsole build and can reproduce the issue there too. So that issue should probably be reopened - it's not a dup of #4363. However, note that this has been an issue long before PR #7340, because we already supported the 8-bit CSI control before then. PR #7340 just added more controls.
Author
Owner

@PennRobotics commented on GitHub (Oct 22, 2021):

https://github.com/BurntSushi/ripgrep/issues/1992

At least two more users are affected by this behavior.

@PennRobotics commented on GitHub (Oct 22, 2021): https://github.com/BurntSushi/ripgrep/issues/1992 At least two more users are affected by this behavior.
Author
Owner

@DHowett commented on GitHub (Oct 22, 2021):

It seems like BurntSushi/ripgrep#1992 is another instance of "an application is printing UTF-8 to the screen without setting the console codepage to UTF-8, or converting internally and printing it as UTF-16."

@DHowett commented on GitHub (Oct 22, 2021): It seems like BurntSushi/ripgrep#1992 is another instance of "an application is printing UTF-8 to the screen without setting the console codepage to UTF-8, or converting internally and printing it as UTF-16."
Author
Owner

@PennRobotics commented on GitHub (Oct 22, 2021):

I'm aware of chcp.com, but is there a way to change the code page in WSL2? I've tried (semi-successfully) to pipe output through iconv, but I don't know flags to get the same output as conhost. For instance, when I use -f UTF-7 and omit invalid characters, -c, the output (below) ends at letter z.

In any case, I want to run commands without escape codes causing cut off lines and random escape characters appearing on the next zsh prompt.

This escape code behavior is not occurring in the default WSL terminal (via conhost, I believe), where I see instead
let identchars_ok = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™šžŸ ¡¢£¤¥¦§µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ' (Interesting enough, the Github edit window for this post shows a different Unicode replacement character than this post.)


Update: piping iconv -f UTF-8 -t UNICODE -c makes all the extended characters show up as replacement symbols, which is nice. Not using the -c flag causes an error at line 396, so I opened up vim to this line, and it's a total, utter disaster. Lines 391, 392, 396, 397, and 401 will change persistently just by moving the cursor around in this area. This is a function called Run_regexp_multibyte_magic, and the hexdump of the offending characters from these lines follows:

00000000  e0 b8 ab e0 b8 a1 e0 b9  88 78 20 e0 b8 ad e0 b8  |.........x .....|
00000010  a1 78 ** e0 b8 ad e0 b8  a1 78 20 e0 b8 ab e0 b8  |.x.......x .....|
00000020  a1 e0 b9 88 78 ** fc 92  8d 85 99 b8 79 ** fc 92  |....x.......y...|
00000030  8d 8a af 8d 7a ** c3 a4  c3 b6 20 c3 bc ce b1 cc  |....z..... .....|
00000040  84 cc 86 cc 81 ** **                              |.......|

I replaced 0a with ** so the linebreaks are easier to spot.

This snippet is displayed in vim:

  1 หม่x อมx
  2 อมx หม่x
  3 ������y
  4 ������z
  5 äö üᾱ̆
  6            

Moving the cursor around causes all sorts of trouble. Moving the cursor down, line-by-line, from the top line to the bottom without entering insert mode:

  1 หม่x อม
  2 อมx หม่
  3 ��
  4 ��
  5 äö üα
  6         

On further inspection, this same file shows up screwy in conhost, too, so vim's handling of UTF-8 could be imperfect. Also,
using iconv -f UTF-8 -t ASCII -c will strip away information that might be needed, so I don't believe this is a good solution. The string
AÀÁÂÃÄÅĀĂĄǍǞǠǺȂȦȺḀẠẢẤẦẨẪẬẮẰẲẴẶ BƁɃḂḄḆ CÇĆĈĊČƇȻḈꞒ DĎĐƊḊḌḎḐḒ EÈÉÊËĒĔĖĘĚȄȆȨɆḔḖḘḚḜẸẺẼẾỀỂỄỆ FƑḞꞘ GĜĞĠĢƓǤǦǴḠꞠ HĤĦȞḢḤḦḨḪⱧ IÌÍÎÏĨĪĬĮİƗǏȈȊḬḮỈỊ JĴɈ KĶƘǨḰḲḴⱩꝀ LĹĻĽĿŁȽḶḸḺḼⱠ MḾṀṂ NÑŃŅŇǸṄṆṈṊꞤ OÒÓÔÕÖØŌŎŐƟƠǑǪǬǾȌȎȪȬȮȰṌṎṐṒỌỎỐỒỔỖỘỚỜỞỠỢ PƤṔṖⱣ QɊ RŔŖŘȐȒɌṘṚṜṞⱤꞦ SŚŜŞŠȘṠṢṤṦṨⱾꞨ TŢŤŦƬƮȚȾṪṬṮṰ UÙÚÛÜŨŪŬŮŰƯǕǙǛǓǗȔȖɄṲṴṶṸṺỤỦỨỪỬỮỰ VƲṼṾ WŴẀẂẄẆẈ XẊẌ YÝŶŸƳȲɎẎỲỴỶỸ ZŹŻŽƵẐẒẔⱫ aàáâãäåāăąǎǟǡǻȃȧᶏḁẚạảấầẩẫậắằẳẵặⱥ bƀɓᵬᶀḃḅḇ cçćĉċčƈȼḉꞓꞔ dďđɗᵭᶁᶑḋḍḏḑḓ eèéêëēĕėęěȅȇȩɇᶒḕḗḙḛḝẹẻẽếềểễệ fƒᵮᶂḟꞙ gĝğġģǥǧǵɠᶃḡꞡ hĥħȟḣḥḧḩḫẖⱨꞕ iìíîïĩīĭįǐȉȋɨᶖḭḯỉị jĵǰɉ kķƙǩᶄḱḳḵⱪꝁ lĺļľŀłƚḷḹḻḽⱡ mᵯḿṁṃ nñńņňʼnǹᵰᶇṅṇṉṋꞥ oòóôõöøōŏőơǒǫǭǿȍȏȫȭȯȱɵṍṏṑṓọỏốồổỗộớờởỡợ pƥᵱᵽᶈṕṗ qɋʠ rŕŗřȑȓɍɽᵲᵳᶉṛṝṟꞧ sśŝşšșȿᵴᶊṡṣṥṧṩꞩ tţťŧƫƭțʈᵵṫṭṯṱẗⱦ uùúûüũūŭůűųǚǖưǔǘǜȕȗʉᵾᶙṳṵṷṹṻụủứừửữự vʋᶌṽṿ wŵẁẃẅẇẉẘ xẋẍ yýÿŷƴȳɏẏẙỳỵỷỹ zźżžƶᵶᶎẑẓẕⱬ
becomes
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z

@PennRobotics commented on GitHub (Oct 22, 2021): I'm aware of `chcp.com`, but is there a way to change the code page in WSL2? I've tried (semi-successfully) to pipe output through `iconv`, but I don't know flags to get the same output as conhost. For instance, when I use `-f UTF-7` and omit invalid characters, `-c`, the output (below) ends at letter z. In any case, I want to run commands without escape codes causing cut off lines and random escape characters appearing on the next zsh prompt. This escape code behavior is not occurring in the default WSL terminal (via conhost, I believe), where I see instead `let identchars_ok = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™šžŸ ¡¢£¤¥¦§µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'` (Interesting enough, the Github edit window for this post shows a different Unicode replacement character than this post.) ----- Update: piping `iconv -f UTF-8 -t UNICODE -c` makes all the extended characters show up as replacement symbols, which is nice. Not using the `-c` flag causes an error at line 396, so I opened up vim to this line, and it's a total, utter disaster. Lines 391, 392, 396, 397, and 401 will change persistently just by moving the cursor around in this area. This is a function called Run_regexp_multibyte_magic, and the hexdump of the offending characters from these lines follows: ``` 00000000 e0 b8 ab e0 b8 a1 e0 b9 88 78 20 e0 b8 ad e0 b8 |.........x .....| 00000010 a1 78 ** e0 b8 ad e0 b8 a1 78 20 e0 b8 ab e0 b8 |.x.......x .....| 00000020 a1 e0 b9 88 78 ** fc 92 8d 85 99 b8 79 ** fc 92 |....x.......y...| 00000030 8d 8a af 8d 7a ** c3 a4 c3 b6 20 c3 bc ce b1 cc |....z..... .....| 00000040 84 cc 86 cc 81 ** ** |.......| ``` I replaced `0a` with `**` so the linebreaks are easier to spot. This snippet is displayed in vim: ``` 1 หม่x อมx 2 อมx หม่x 3 ������y 4 ������z 5 äö üᾱ̆ 6 ``` Moving the cursor around causes all sorts of trouble. Moving the cursor down, line-by-line, from the top line to the bottom without entering insert mode: ``` 1 หม่x อม 2 อมx หม่ 3 �� 4 �� 5 äö üα 6 ``` On further inspection, this same file shows up screwy in conhost, too, so vim's handling of UTF-8 could be imperfect. Also, using `iconv -f UTF-8 -t ASCII -c` will strip away information that might be needed, so I don't believe this is a good solution. The string AÀÁÂÃÄÅĀĂĄǍǞǠǺȂȦȺḀẠẢẤẦẨẪẬẮẰẲẴẶ BƁɃḂḄḆ CÇĆĈĊČƇȻḈꞒ DĎĐƊḊḌḎḐḒ EÈÉÊËĒĔĖĘĚȄȆȨɆḔḖḘḚḜẸẺẼẾỀỂỄỆ FƑḞꞘ GĜĞĠĢƓǤǦǴḠꞠ HĤĦȞḢḤḦḨḪⱧ IÌÍÎÏĨĪĬĮİƗǏȈȊḬḮỈỊ JĴɈ KĶƘǨḰḲḴⱩꝀ LĹĻĽĿŁȽḶḸḺḼⱠ MḾṀṂ NÑŃŅŇǸṄṆṈṊꞤ OÒÓÔÕÖØŌŎŐƟƠǑǪǬǾȌȎȪȬȮȰṌṎṐṒỌỎỐỒỔỖỘỚỜỞỠỢ PƤṔṖⱣ QɊ RŔŖŘȐȒɌṘṚṜṞⱤꞦ SŚŜŞŠȘṠṢṤṦṨⱾꞨ TŢŤŦƬƮȚȾṪṬṮṰ UÙÚÛÜŨŪŬŮŰƯǕǙǛǓǗȔȖɄṲṴṶṸṺỤỦỨỪỬỮỰ VƲṼṾ WŴẀẂẄẆẈ XẊẌ YÝŶŸƳȲɎẎỲỴỶỸ ZŹŻŽƵẐẒẔⱫ aàáâãäåāăąǎǟǡǻȃȧᶏḁẚạảấầẩẫậắằẳẵặⱥ bƀɓᵬᶀḃḅḇ cçćĉċčƈȼḉꞓꞔ dďđɗᵭᶁᶑḋḍḏḑḓ eèéêëēĕėęěȅȇȩɇᶒḕḗḙḛḝẹẻẽếềểễệ fƒᵮᶂḟꞙ gĝğġģǥǧǵɠᶃḡꞡ hĥħȟḣḥḧḩḫẖⱨꞕ iìíîïĩīĭįǐȉȋɨᶖḭḯỉị jĵǰɉ kķƙǩᶄḱḳḵⱪꝁ lĺļľŀłƚḷḹḻḽⱡ mᵯḿṁṃ nñńņňʼnǹᵰᶇṅṇṉṋꞥ oòóôõöøōŏőơǒǫǭǿȍȏȫȭȯȱɵṍṏṑṓọỏốồổỗộớờởỡợ pƥᵱᵽᶈṕṗ qɋʠ rŕŗřȑȓɍɽᵲᵳᶉṛṝṟꞧ sśŝşšșȿᵴᶊṡṣṥṧṩꞩ tţťŧƫƭțʈᵵṫṭṯṱẗⱦ uùúûüũūŭůűųǚǖưǔǘǜȕȗʉᵾᶙṳṵṷṹṻụủứừửữự vʋᶌṽṿ wŵẁẃẅẇẉẘ xẋẍ yýÿŷƴȳɏẏẙỳỵỷỹ zźżžƶᵶᶎẑẓẕⱬ becomes A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z
Author
Owner

@DHowett commented on GitHub (Oct 22, 2021):

change the code page in WSL2

Hmm. I let my assumptions cloud my interpretation of the linked issue and missed that this was under WSL. I'm sorry.

conhost will unfortunately exhibit this behavior on Windows 11 (in before I've accidentally missed that detail too) or a later update, since the code in this repo is the code for conhost and huge swaths of the terminal emulation code are straight-up shared.

I'll have to come back to the vim bits after the weekend for a deeper investigation.

@DHowett commented on GitHub (Oct 22, 2021): > change the code page in WSL2 _Hmm_. I let my assumptions cloud my interpretation of the linked issue and missed that this was under WSL. I'm sorry. conhost will unfortunately exhibit this behavior on Windows 11 (in before I've accidentally missed that detail too) or a later update, since the code in this repo _is_ the code for conhost and huge swaths of the terminal emulation code are straight-up shared. I'll have to come back to the vim bits after the weekend for a deeper investigation.
Author
Owner

@PennRobotics commented on GitHub (Oct 22, 2021):

I've gotta link this file I'm using as a test for reference: https://github.com/vim/vim/blob/master/src/testdir/test_regexp_utf8.vim
This might be the boss level of encoding tests. On my system, the characters of many of the strings are different when shown in Github and when viewed raw.

I'm not even as much worried about the proper character display as much as the 1;0c that pops out at the next prompt whenever I search certain repositories with the right terms. I'm not sure how that's getting from stdout to the input buffer.

Also, sorry for hijacking @alabuzhev 's thread

@PennRobotics commented on GitHub (Oct 22, 2021): I've gotta link this file I'm using as a test for reference: https://github.com/vim/vim/blob/master/src/testdir/test_regexp_utf8.vim This might be the boss level of encoding tests. On my system, the characters of many of the strings are different when shown in Github and when viewed raw. I'm not even as much worried about the proper character display as much as the `1;0c` that pops out at the next prompt whenever I search certain repositories with the right terms. I'm not sure how that's getting from stdout to the input buffer. Also, sorry for hijacking @alabuzhev 's thread
Author
Owner

@j4james commented on GitHub (Oct 22, 2021):

I'm not even as much worried about the proper character display as much as the 1;0c that pops out at the next prompt whenever I search certain repositories with the right terms. I'm not sure how that's getting from stdout to the input buffer.

That's a response to the DECID query which is triggered by the C1 control character U+009A (see https://invisible-island.net/xterm/ctlseqs/ctlseqs.html#h3-C1-8-Bit-Control-Characters).

@j4james commented on GitHub (Oct 22, 2021): > I'm not even as much worried about the proper character display as much as the `1;0c` that pops out at the next prompt whenever I search certain repositories with the right terms. I'm not sure how that's getting from stdout to the input buffer. That's a response to the `DECID` query which is triggered by the C1 control character U+009A (see https://invisible-island.net/xterm/ctlseqs/ctlseqs.html#h3-C1-_8-Bit_-Control-Characters).
Author
Owner

@j4james commented on GitHub (Oct 23, 2021):

  1 หม่x อมx
  2 อมx หม่x
  3 ������y
  4 ������z
  5 äö üᾱ̆
  6            

@PennRobotics FYI, this particular case has got nothing to do with C1 controls. Lines 1, 2 and 5 have got combining characters or non-spacing marks, that we are probably not handling correctly, or at least not in the same way as vim. Lines 3 and 4 are all just invalid code points in UTF-8, which I guess could also be an issue if there is a discrepancy in the way vim and WT handle the erroneous values.

I'm not seeing the content changing when moving the cursor up and down in vim, but maybe that depends on the version or other configuration differences. It certainly wouldn't surprise me if it did something weird with those characters though. In any event, I think these problems are possibly more on topic in issue #8000, but @DHowett can correct me on that.

@j4james commented on GitHub (Oct 23, 2021): > ``` > 1 หม่x อมx > 2 อมx หม่x > 3 ������y > 4 ������z > 5 äö üᾱ̆ > 6 > ``` @PennRobotics FYI, this particular case has got nothing to do with C1 controls. Lines 1, 2 and 5 have got combining characters or non-spacing marks, that we are probably not handling correctly, or at least not in the same way as vim. Lines 3 and 4 are all just invalid code points in UTF-8, which I guess could also be an issue if there is a discrepancy in the way vim and WT handle the erroneous values. I'm not seeing the content changing when moving the cursor up and down in vim, but maybe that depends on the version or other configuration differences. It certainly wouldn't surprise me if it did something weird with those characters though. In any event, I think these problems are possibly more on topic in issue #8000, but @DHowett can correct me on that.
Author
Owner

@PennRobotics commented on GitHub (Oct 25, 2021):

@j4james set cursorline (and potentially also set cursorcolumn) is the offending .vimrc option

Sure enough, echo -ne '\u009a' is enough to add characters to the beginning of the next prompt, so piping to sed "s/$(echo -ne '\u009a')//g" will suppress this behavior.

Unrelated/unimportant: There's still another stray character sequence that---when running ripgrep ok on the file I linked above---causes two instances of install-from-source (the beginning of the outputted file path; on the next line) to appear as énstall-from-source, and a line break is also stripped away. In general, now I know what to look for when I start to get extra characters at the prompt. Thanks!

@PennRobotics commented on GitHub (Oct 25, 2021): @j4james `set cursorline` (and potentially also `set cursorcolumn`) is the offending .vimrc option Sure enough, `echo -ne '\u009a'` is enough to add characters to the beginning of the next prompt, so piping to `sed "s/$(echo -ne '\u009a')//g"` will suppress this behavior. Unrelated/unimportant: There's still another stray character sequence that---when running `ripgrep ok` on the file I linked above---causes two instances of install-from-source (the beginning of the outputted file path; on the next line) to appear as énstall-from-source, and a line break is also stripped away. In general, now I know what to look for when I start to get extra characters at the prompt. Thanks!
Author
Owner

@PennRobotics commented on GitHub (Oct 25, 2021):

image

The top is Ubuntu via Windows Terminal and the bottom is Ubuntu started as an app.

@PennRobotics commented on GitHub (Oct 25, 2021): ![image](https://user-images.githubusercontent.com/4408242/138666869-f12a6d29-9f8a-49d6-8c31-0e8017f69d6f.png) The top is Ubuntu via Windows Terminal and the bottom is Ubuntu started as an app.
Author
Owner

@j4james commented on GitHub (Oct 25, 2021):

The top is Ubuntu via Windows Terminal and the bottom is Ubuntu started as an app.

@PennRobotics That's because full C1 support was only added in PR #7340 and assumedly hasn't made its way into the inbox conhost yet, or at least not the version you're using. But U+009B should at least work. Try echo -ne '\u009Bc'.

@j4james commented on GitHub (Oct 25, 2021): > The top is Ubuntu via Windows Terminal and the bottom is Ubuntu started as an app. @PennRobotics That's because full C1 support was only added in PR #7340 and assumedly hasn't made its way into the inbox conhost yet, or at least not the version you're using. But `U+009B` should at least work. Try `echo -ne '\u009Bc'`.
Author
Owner

@PennRobotics commented on GitHub (Oct 26, 2021):

Is there an easy way to universally disable the aspect of C1 support that is putting characters into the input buffer? An environmental var? I understand this could be the side effects of an exciting new/upcoming feature, but I can't say I would want the old terminal to start doing what this newer Terminal does---prefixing my next command with some gibberish.

I do have the use case where I need to grep a mixed-ASCII-and-binary file (.elf w/ symbols and strings, .pdf, etc.) as part of a larger codebase search, so this has become a rare but occasional annoyance. (I also understand that I'm in a minority of users, and the workaround of pressing Ctrl+U to clear the input line is much, much faster than any alternatives that I can conceive.)

@PennRobotics commented on GitHub (Oct 26, 2021): Is there an easy way to universally disable the aspect of C1 support that is putting characters into the input buffer? An environmental var? I understand this could be the side effects of an exciting new/upcoming feature, but I can't say I would want the old terminal to start doing what this newer Terminal does---prefixing my next command with some gibberish. I do have the use case where I need to grep a mixed-ASCII-and-binary file (.elf w/ symbols and strings, .pdf, etc.) as part of a larger codebase search, so this has become a rare but occasional annoyance. (I also understand that I'm in a minority of users, and the workaround of pressing Ctrl+U to clear the input line is much, much faster than any alternatives that I can conceive.)
Author
Owner

@j4james commented on GitHub (Oct 26, 2021):

If you're outputting raw control characters to the screen, there is always the possibility you're going to trigger a query response, even if you could disable C1 support. You'd really need to disable all VT processing entirely.

So if you're using grep on binary files, I'd suggest piping the output through less, which should handle the control character filtering for you. Instead of the controls being interpreted, you'll just see something like <U+009A> in the output.

@j4james commented on GitHub (Oct 26, 2021): If you're outputting raw control characters to the screen, there is always the possibility you're going to trigger a query response, even if you could disable C1 support. You'd really need to disable all VT processing entirely. So if you're using grep on binary files, I'd suggest piping the output through `less`, which should handle the control character filtering for you. Instead of the controls being interpreted, you'll just see something like `<U+009A>` in the output.
Author
Owner

@PennRobotics commented on GitHub (Oct 26, 2021):

Eww. less on its own will erase its buffer after exiting. less -X will leave the output in stdout but forces the user to scroll to the bottom to leave the full (up to the argumented buffer size, at least) results in the active buffer. less -X | grep $ eliminates the scrolling and brings back the unwanted control characters!

In time, I'll find a good alternative to sed and less.

It doesn't make sense why a query response shows up at the prompt as the next command. In contrast, something like ((sleep 1 && echo -n a) &) will echo "a" at the prompt before the cursor (or mid-command if you type immediately after), but this echo'd character is never part of the next command and is overwritten by the correct, typed character if you move the cursor back.

If you don't delete the query response's 1; from the next prompt, you'll jump to the last pushed directory and run a (likely) invalid command. It's not a major bug, but it is an annoyance.

I fail to understand why the control character must become part of stdin instead of stdout, but even wikipedia indicates a terminal can generate sequences that seem to come from the user.

However, for the sake of your time and my time, I have no more interest in this subject and reserve hope that stray characters never land in the input buffer at some point in the future.

Thanks for sharing all this reasonably obscure info on terminal nuances.

@PennRobotics commented on GitHub (Oct 26, 2021): Eww. `less` on its own will erase its buffer after exiting. `less -X` will leave the output in stdout but forces the user to scroll to the bottom to leave the full (up to the argumented buffer size, at least) results in the active buffer. `less -X | grep $` eliminates the scrolling and brings back the unwanted control characters! In time, I'll find a good alternative to `sed` and `less`. It doesn't make sense why a query response shows up at the prompt as the next command. In contrast, something like `((sleep 1 && echo -n a) &)` will echo "a" at the prompt before the cursor (or mid-command if you type immediately after), but this echo'd character is never part of the next command and is overwritten by the correct, typed character if you move the cursor back. If you don't delete the query response's <tt>1;</tt> from the next prompt, you'll jump to the last pushed directory and run a (likely) invalid command. It's not a major bug, but it is an annoyance. I fail to understand why the control character must become part of stdin instead of stdout, but even [wikipedia](https://en.wikipedia.org/wiki/ANSI_escape_code#Terminal_input_sequences) indicates a terminal can generate sequences that seem to come from the user. However, for the sake of your time and my time, I have no more interest in this subject and reserve hope that stray characters never land in the input buffer at some point in the future. Thanks for sharing all this reasonably obscure info on terminal nuances.
Author
Owner

@PennRobotics commented on GitHub (Oct 27, 2021):

Piping to preconv -r works without removing color codes, although umlauts and other non-English letters also get converted. This is close enough for me.

@PennRobotics commented on GitHub (Oct 27, 2021): Piping to `preconv -r` works without removing color codes, although umlauts and other non-English letters also get converted. This is close enough for me.
Author
Owner

@j4james commented on GitHub (Oct 31, 2021):

@DHowett If we want to try and do something about this, I have a proposal that I think might make most people happy.

  1. We start with C1 controls disabled by default. They're not particularly useful in UTF-8, and it's unlikely anyone is expecting a random subset of them to work in the unmapped portions of the DOS/Windows code pages.
  2. If ISO-2022 mode is requested, that's when we enable the C1 support in the parser, since that's the one time they're actually likely to be needed.
  3. Optionally add support for the DECAC1 (Accept C1 Controls) escape sequence, so they can also be manually enabled in the UTF-8 codepage, in case anyone actually does need that.

Hopefully this will cut down on the bug reports, without actually losing any significant functionality.

It's also worth mentioning that XTerm doesn't support C1 controls in UTF-8 either, so it's unlikely to cause compatibility issues with Linux apps. While there are a few Linux terminals that do support UTF-8 C1 (VTE being the most well known), I think they're probably in the minority.

Anyway, I don't feel that strongly about this either way, but I'd be happy to put together a PR if you like the idea.

@j4james commented on GitHub (Oct 31, 2021): @DHowett If we want to try and do something about this, I have a proposal that I think might make most people happy. 1. We start with C1 controls disabled by default. They're not particularly useful in UTF-8, and it's unlikely anyone is expecting a random subset of them to work in the unmapped portions of the DOS/Windows code pages. 2. If ISO-2022 mode is requested, that's when we enable the C1 support in the parser, since that's the one time they're actually likely to be needed. 3. Optionally add support for the `DECAC1` (Accept C1 Controls) escape sequence, so they can also be manually enabled in the UTF-8 codepage, in case anyone actually does need that. Hopefully this will cut down on the bug reports, without actually losing any significant functionality. It's also worth mentioning that XTerm doesn't support C1 controls in UTF-8 either, so it's unlikely to cause compatibility issues with Linux apps. While there are a few Linux terminals that do support UTF-8 C1 (VTE being the most well known), I think they're probably in the minority. Anyway, I don't feel that strongly about this either way, but I'd be happy to put together a PR if you like the idea.
Author
Owner

@DHowett commented on GitHub (Nov 1, 2021):

@j4james I'm totally on board with this proposal. I love it.

The only thing that gives me pause (and it is not enough pause for me to care) is that I think we supported one C1 control when we initially went open-source, and I somewhat wondered if there was a reason we chose to support that one. It's likely a "this one seems common, maybe we should support it we guess?" situation.

@DHowett commented on GitHub (Nov 1, 2021): @j4james I'm _totally_ on board with this proposal. I love it. The only thing that gives me pause (and it is not enough pause for me to care) is that I _think_ we supported one C1 control when we initially went open-source, and I somewhat wondered if there was a reason we chose to support that one. It's likely a "this one seems common, maybe we should support it we guess?" situation.
Author
Owner

@DHowett commented on GitHub (Nov 1, 2021):

I'll probably prepare this one for isolated ingestion into Windows so it can be released in a servicing update. 😄 Folks may eventually be more broadly upset that conhost is "acting weird" and "displaying corrupt text" and "stole my lunch."

@DHowett commented on GitHub (Nov 1, 2021): I'll probably prepare this one for isolated ingestion into Windows so it can be released in a servicing update. :smile: Folks may eventually be more broadly upset that conhost is "acting weird" and "displaying corrupt text" and "stole my lunch."
Author
Owner

@j4james commented on GitHub (Nov 2, 2021):

The only thing that gives me pause (and it is not enough pause for me to care) is that I think we supported one C1 control when we initially went open-source, and I somewhat wondered if there was a reason we chose to support that one. It's likely a "this one seems common, maybe we should support it we guess?" situation.

At the time you went open-source, there were only 5 supported sequences that could potentially have been implemented as C1 as far as I could see: HTS, RI, CSI, OSC, and ST. Of those five, HTS and RI aren't handled at the StateMachine level, so I can understand why they might not even have been considered.

That leaves CSI, ST, and OSC, the first two of which were actually supported as C1 controls. So the odd one out is OSC, which is kind of weird, because the only time you would use ST is when coupled with OSC. But the bottom line is that you supported almost all of the controls at the state machine level that were implemented at that time.

What's giving me pause now, though, is I just went to dig up an issue I remembered from the VTE tracker, where they were considering removing their C1 handling, thinking it would support this decision, but they ultimately chose not to (this was issue 209). Ironically they cited the Windows support of C1 as one of the reasons for keeping it.

It wasn't just the fact that they chose to keep it, though, but they linked to a bug report in Alacritty where Fedora was using an OSC sequence with a C1 control, and since Alacritty didn't support C1, it got a bunch of garbage output on the screen. So that is a situation we could end up facing as well.

That said, I think this particular case only arose because Alacritty was started from a VTE shell, and was misrecognized as VTE because the VTE_VERSION environment variable was set. So I still think we are more likely to get bug reports from having C1 enabled by default, than we will if we remove it, but I don't want to pretend there are no downsides.

@j4james commented on GitHub (Nov 2, 2021): > The only thing that gives me pause (and it is not enough pause for me to care) is that I _think_ we supported one C1 control when we initially went open-source, and I somewhat wondered if there was a reason we chose to support that one. It's likely a "this one seems common, maybe we should support it we guess?" situation. At the time you went open-source, there were only 5 supported sequences that could potentially have been implemented as C1 as far as I could see: `HTS`, `RI`, `CSI`, `OSC`, and `ST`. Of those five, `HTS` and `RI` aren't handled at the `StateMachine` level, so I can understand why they might not even have been considered. That leaves `CSI`, `ST`, and `OSC`, the first two of which were actually supported as C1 controls. So the odd one out is `OSC`, which is kind of weird, because the only time you would use `ST` is when coupled with `OSC`. But the bottom line is that you supported almost all of the controls at the state machine level that were implemented at that time. What's giving me pause now, though, is I just went to dig up an issue I remembered from the VTE tracker, where they were considering removing their C1 handling, thinking it would support this decision, but they ultimately chose not to (this was issue [209](https://gitlab.gnome.org/GNOME/vte/-/issues/209)). Ironically they cited the Windows support of C1 as one of the reasons for keeping it. It wasn't just the fact that they chose to keep it, though, but they linked to a [bug report in Alacritty](https://github.com/alacritty/alacritty/issues/3105) where Fedora was using an OSC sequence with a C1 control, and since Alacritty didn't support C1, it got a bunch of garbage output on the screen. So that is a situation we could end up facing as well. That said, I think this particular case only arose because Alacritty was started from a VTE shell, and was misrecognized as VTE because the VTE_VERSION environment variable was set. So I still think we are more likely to get bug reports from having C1 enabled by default, than we will if we remove it, but I don't want to pretend there are no downsides.
Author
Owner

@alabuzhev commented on GitHub (Nov 5, 2021):

Guys, any recommendations which characters should be used for C1 replacement?
For C0 it's what MB_USEGLYPHCHARS does, but, as far as I see, there's no established equivalent for C1.
I've tried to just remap them to a private range (E080 - E09F), but it looks like the host has performance issues with some of those, at least with E098.

@alabuzhev commented on GitHub (Nov 5, 2021): Guys, any recommendations which characters should be used for C1 replacement? For C0 it's what [MB_USEGLYPHCHARS](http://archives.miloush.net/michkap/archive/2005/02/26/381020.html) does, but, as far as I see, there's no established equivalent for C1. I've tried to just remap them to a private range (E080 - E09F), but it looks like the host [has performance issues with some of those](https://github.com/FarGroup/FarManager/issues/469), at least with E098.
Author
Owner

@ghost commented on GitHub (Feb 3, 2022):

:tada:This issue was addressed in #11690, which has now been successfully released as Windows Terminal Preview v1.13.10336.0.🎉

Handy links:

@ghost commented on GitHub (Feb 3, 2022): :tada:This issue was addressed in #11690, which has now been successfully released as `Windows Terminal Preview v1.13.10336.0`.:tada: Handy links: * [Release Notes](https://github.com/microsoft/terminal/releases/tag/v1.13.10336.0) * [Store Download](https://www.microsoft.com/store/apps/9n8g5rfz9xk3?cid=storebadge&ocid=badge)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#14087