Some Unicode characters are improperly accepted or dropped #12430

Closed
opened 2026-01-31 03:15:27 +00:00 by claunia · 13 comments
Owner

Originally created by @chitoku-k on GitHub (Feb 5, 2021).

Originally assigned to: @DHowett on GitHub.

Environment

Windows build number: 10.0.19042.0
Windows Terminal version (if applicable): 1.5.x and 1.6.x
PowerShell: 5.1.19041.610 and 7.1.1

Steps to reproduce

  1. Run Windows Terminal built after https://github.com/microsoft/terminal/pull/8035 gets merged.
  2. Input 0123456789 (FULLWIDTH DIGIT) by any methods such as right click to paste, Ctrl + V, or from keyboard.

Expected behavior

  • 0123456789 is input.

Actual behavior

  • Only the first letter ( in this case) is input.

Detailed Explanation

The implementation of GetQuickCharWidth has been changed in https://github.com/microsoft/terminal/pull/8035 and affected the following invocations:

The former is totally acceptable because it falls back to looking up from Unicode table in CodepointWidthDetector later on; however, the latter one has been broken in this PR.

When the scanned key is considered invalid in CharToKeyEvents, it tries to detect character width by calling GetQuickCharWidth and make it process as keyboard events if the width is CodepointWidth::Wide. In the aforementioned PR, however, GetQuickCharWidth no longer returns CodepointWidth::Wide and instead returns CodepointWidth::Invalid for the characters other than ASCII, and results in SynthesizeNumpadEvents being called for 0123456789. Since this function processes the given characters as if it were typed with Alt key + numpad, the result becomes nondeterministic (such as some applications like cmd.exe process this normally but powershell.exe only accepts the first letter).

1df3182865/src/types/convert.cpp (L156)

One option is to return a new value like CodepointWidth::Unknown when GetQuickCharWidth cannot detect the character width immediately,

  CodepointWidth GetQuickCharWidth(const wchar_t wch) noexcept
  {
      if (0x20 <= wch && wch <= 0x7e)
      {
          /* ASCII */
          return CodepointWidth::Narrow;
      }
+     else if (wch < 0xffff)
+     {
+         return CodepointWidth::Unknown;
+     }
      return CodepointWidth::Invalid;
  }

because the definition of CodepointWidth::Invalid is not a valid unicode codepoint.

1df3182865/src/types/inc/convert.hpp (L23-L29)

In CharToKeyEvents, the expression should be corrected to:

- if (WI_IsFlagSet(CharType, C3_ALPHA) || GetQuickCharWidth(wch) == CodepointWidth::Wide)
+ if (WI_IsFlagSet(CharType, C3_ALPHA) || GetQuickCharWidth(wch) == CodepointWidth::Unknown)

and in CodepointWidthDetector::GetWidth(), it has to support the new value:

-         // If it's invalid, the quick width had no opinion, so go to the lookup table.
-         if (width == CodepointWidth::Invalid)
+         // If it's unknown or invalid, the quick width had no opinion, so go to the lookup table.
+         if (width == CodepointWidth::Unknown || width == CodepointWidth::Invalid)

I can make a PR if the way of fix I've suggested is acceptable. Thanks in advance.

Appendix

Note that this issue is not related to the issue in PSReadLine at all because it can be reproducible in applications launched from PowerShell.

Originally created by @chitoku-k on GitHub (Feb 5, 2021). Originally assigned to: @DHowett on GitHub. <!-- 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨 I ACKNOWLEDGE THE FOLLOWING BEFORE PROCEEDING: 1. If I delete this entire template and go my own path, the core team may close my issue without further explanation or engagement. 2. If I list multiple bugs/concerns in this one issue, the core team may close my issue without further explanation or engagement. 3. If I write an issue that has many duplicates, the core team may close my issue without further explanation or engagement (and without necessarily spending time to find the exact duplicate ID number). 4. If I leave the title incomplete when filing the issue, the core team may close my issue without further explanation or engagement. 5. If I file something completely blank in the body, the core team may close my issue without further explanation or engagement. All good? Then proceed! --> <!-- This bug tracker is monitored by Windows Terminal development team and other technical folks. **Important: When reporting BSODs or security issues, DO NOT attach memory dumps, logs, or traces to Github issues**. Instead, send dumps/traces to secure@microsoft.com, referencing this GitHub issue. If this is an application crash, please also provide a Feedback Hub submission link so we can find your diagnostic data on the backend. Use the category "Apps > Windows Terminal (Preview)" and choose "Share My Feedback" after submission to get the link. Please use this form and describe your issue, concisely but precisely, with as much detail as possible. --> # Environment ```none Windows build number: 10.0.19042.0 Windows Terminal version (if applicable): 1.5.x and 1.6.x PowerShell: 5.1.19041.610 and 7.1.1 ``` # Steps to reproduce <!-- A description of how to trigger this bug. --> 1. Run Windows Terminal built after https://github.com/microsoft/terminal/pull/8035 gets merged. 1. Input `0123456789` ([FULLWIDTH DIGIT](https://unicode.org/charts/PDF/UFF00.pdf)) by any methods such as right click to paste, `Ctrl + V`, or from keyboard. # Expected behavior <!-- A description of what you're expecting, possibly containing screenshots or reference material. --> - `0123456789` is input. # Actual behavior <!-- What's actually happening? --> - Only the first letter (`0` in this case) is input. # Detailed Explanation The implementation of `GetQuickCharWidth` has been changed in https://github.com/microsoft/terminal/pull/8035 and affected the following invocations: - to determine which type of width the given character has for rendering - [CodepointWidthDetector::GetWidth(const std::wstring_view)](https://github.com/microsoft/terminal/blob/1df3182865fb089bd653763cd0abbea811545365/src/types/CodepointWidthDetector.cpp#L340-L376) - to convert the given character(s) into a queue of key events for input - [CharToKeyEvents(const wchar_t, const unsigned int)](https://github.com/microsoft/terminal/blob/1df3182865fb089bd653763cd0abbea811545365/src/types/convert.cpp#L142-L175) The former is totally acceptable because it falls back to looking up from Unicode table in `CodepointWidthDetector` later on; however, the latter one has been broken in this PR. When the scanned key is considered invalid in `CharToKeyEvents`, it tries to detect character width by calling `GetQuickCharWidth` and make it process as keyboard events if the width is `CodepointWidth::Wide`. In the aforementioned PR, however, `GetQuickCharWidth` no longer returns `CodepointWidth::Wide` and instead returns `CodepointWidth::Invalid` for the characters other than ASCII, and results in `SynthesizeNumpadEvents` being called for `0123456789`. Since this function processes the given characters as if it were typed with `Alt` key + numpad, the result becomes nondeterministic (such as some applications like cmd.exe process this normally but powershell.exe only accepts the first letter). https://github.com/microsoft/terminal/blob/1df3182865fb089bd653763cd0abbea811545365/src/types/convert.cpp#L156 One option is to return a new value like `CodepointWidth::Unknown` when `GetQuickCharWidth` cannot detect the character width immediately, ```diff CodepointWidth GetQuickCharWidth(const wchar_t wch) noexcept { if (0x20 <= wch && wch <= 0x7e) { /* ASCII */ return CodepointWidth::Narrow; } + else if (wch < 0xffff) + { + return CodepointWidth::Unknown; + } return CodepointWidth::Invalid; } ``` because the definition of `CodepointWidth::Invalid` is **not a valid unicode codepoint**. https://github.com/microsoft/terminal/blob/1df3182865fb089bd653763cd0abbea811545365/src/types/inc/convert.hpp#L23-L29 In `CharToKeyEvents`, the expression should be corrected to: ```diff - if (WI_IsFlagSet(CharType, C3_ALPHA) || GetQuickCharWidth(wch) == CodepointWidth::Wide) + if (WI_IsFlagSet(CharType, C3_ALPHA) || GetQuickCharWidth(wch) == CodepointWidth::Unknown) ``` and in `CodepointWidthDetector::GetWidth()`, it has to support the new value: ```diff - // If it's invalid, the quick width had no opinion, so go to the lookup table. - if (width == CodepointWidth::Invalid) + // If it's unknown or invalid, the quick width had no opinion, so go to the lookup table. + if (width == CodepointWidth::Unknown || width == CodepointWidth::Invalid) ``` I can make a PR if the way of fix I've suggested is acceptable. Thanks in advance. ## Appendix Note that this issue is not related to the issue in PSReadLine at all because it can be reproducible in applications launched from PowerShell.
claunia added the Product-ConhostIssue-BugArea-InputNeeds-Tag-FixProduct-Terminal labels 2026-01-31 03:15:27 +00:00
Author
Owner

@chitoku-k commented on GitHub (Feb 12, 2021):

@DHowett
Could you look into this since you are the author of the PR (#8035)?

@chitoku-k commented on GitHub (Feb 12, 2021): @DHowett Could you look into this since you are the author of the PR (#8035)?
Author
Owner

@zadjii-msft commented on GitHub (Feb 12, 2021):

FYI, DHowett is a tad bit behind on his inbox - he's been busy with managerial stuff. He'll get to this eventually, sorry for the delays ☺️

@zadjii-msft commented on GitHub (Feb 12, 2021): FYI, DHowett is a tad bit behind on his inbox - he's been busy with managerial stuff. He'll get to this eventually, sorry for the delays ☺️
Author
Owner

@chitoku-k commented on GitHub (Feb 12, 2021):

@zadjii-msft Thanks to let me know about that. I'll hang tight then.

@chitoku-k commented on GitHub (Feb 12, 2021): @zadjii-msft Thanks to let me know about that. I'll hang tight then.
Author
Owner

@DHowett commented on GitHub (Mar 28, 2021):

Sorry for taking so long to get to this. This is one of the horrible places where we misuse a character's literal screen width to mean something related to its encoding.

Great find! I suspect that this is related to @skyline75489's recent work in numpad event synthesis, too.

@DHowett commented on GitHub (Mar 28, 2021): Sorry for taking so long to get to this. This is one of the horrible places where we misuse a character's _literal screen width_ to mean something related to its encoding. Great find! I suspect that this is related to @skyline75489's recent work in numpad event synthesis, too.
Author
Owner

@DHowett commented on GitHub (Mar 28, 2021):

(This will also impact Windows, so I'll make sure the fix gets into Windows as appropriate!)

@DHowett commented on GitHub (Mar 28, 2021): (This will also impact Windows, so I'll make sure the fix gets into Windows as appropriate!)
Author
Owner

@DHowett commented on GitHub (Mar 28, 2021):

@miniksa would you mind noodling over this with me this week? I don't understand why CharToKeyEvent takes the display width as a signal to determine whether VkKeyScanW was "wrong" (???) about whether the character could be entered via keyboard events.

It looks like the decision tree is...

  1. Do a VkKeyScan to turn the character into a key state
  2. Was it invalid? If it was valid, go to 5.
  3. Is it a C3_ALPHA or does it take up two columns? If either is true, go to 5.
  4. Convert the character into numpad events. Exit.
  5. Convert the character into KeyEvents. Exit.

Since we changed the definition of GQCW, we hit 4 more often. The "dumb" solution would be to hook CharToKeyEvent up to the entire unicode lookup table, but that just kicks the "why are we using the display width?" question down the road a bit. The "smart" solution is to use some other intrinsic about the character other than how many columns it takes up.

Converting it into a Key Event later will do something weird with the scancode -- it'll present an input that simply doesn't have a scancode. Not incorrect, but also not correct? But then, why would we use the numpad input generator at all if it was okay to yeet random UCS-2 data into the KeyEvent?

@DHowett commented on GitHub (Mar 28, 2021): @miniksa would you mind noodling over this with me this week? I don't understand why CharToKeyEvent takes the _display width_ as a signal to determine whether VkKeyScanW was "wrong" (???) about whether the character could be entered via keyboard events. It looks like the decision tree is... 1. Do a VkKeyScan to turn the character into a key state 2. Was it invalid? If it was **valid**, go to 5. 3. Is it a `C3_ALPHA` or does it take up two columns? If either is true, go to 5. 4. Convert the character into numpad events. Exit. 5. Convert the character into KeyEvents. Exit. Since we changed the definition of GQCW, we hit 4 more often. The "dumb" solution would be to hook CharToKeyEvent up to the entire unicode lookup table, but that just kicks the "why are we using the display width?" question down the road a bit. The "smart" solution is to use some other intrinsic about the character other than how many columns it takes up. Converting it into a Key Event later will do something weird with the scancode -- it'll present an input that simply doesn't _have_ a scancode. Not incorrect, but also not correct? But then, why would we use the numpad input generator at all if it was okay to yeet random UCS-2 data into the KeyEvent?
Author
Owner

@miniksa commented on GitHub (Mar 28, 2021):

@DHowett sure let's talk it out at some point. On my todo list.

@miniksa commented on GitHub (Mar 28, 2021): @DHowett sure let's talk it out at some point. On my todo list.
Author
Owner

@DHowett commented on GitHub (Mar 28, 2021):

This decision (width = key event instead of numpad event) first appeared in the clipboard handler in Windows 7, with the integration of "direct read" from the "_concurrency" branch. Conhost didn't formally exist before that, so I'm having trouble figuring out whether it was new in 7 or carried over from Vista. Hmm.

@DHowett commented on GitHub (Mar 28, 2021): This decision (width = key event instead of numpad event) first appeared in the clipboard handler in Windows 7, with the integration of "direct read" from the "_concurrency" branch. Conhost didn't formally exist before that, so I'm having trouble figuring out whether it was new in 7 or carried over from Vista. Hmm.
Author
Owner

@skyline75489 commented on GitHub (Mar 30, 2021):

I don't understand why CharToKeyEvent takes the display width as a signal to determine whether VkKeyScanW was "wrong" (???) about whether the character could be entered via keyboard events.

Count me in. What's worse is that, even this is the correct way, the code does not even handle the detection part comprehensively (per the chart in https://github.com/microsoft/terminal/pull/9613#issuecomment-806465276) . I assume a lot of Unicode characters do not belong to C3_ALPHA and will not be synthesized correctly.

@skyline75489 commented on GitHub (Mar 30, 2021): > I don't understand why CharToKeyEvent takes the display width as a signal to determine whether VkKeyScanW was "wrong" (???) about whether the character could be entered via keyboard events. Count me in. What's worse is that, even this *is* the correct way, the code does not even handle the detection part comprehensively (per the chart in https://github.com/microsoft/terminal/pull/9613#issuecomment-806465276) . I assume a lot of Unicode characters do not belong to `C3_ALPHA` and will not be synthesized correctly.
Author
Owner

@DHowett commented on GitHub (Apr 13, 2021):

Okay. I'm going to do a bit of a brain dump of the past couple hours' work here.

  • Numpad synthesis does not make any sense.
  • This code has existed since Windows 2000 (we think?)
  • When we synthesize a numpad event, we generate [alt down] [numpad x down] [numpad x up] (repeat...) [alt up, UnicodeChar = *ACTUAL CHARACTER*]
  • Most consumers of the ReadConsoleInput API that we could find ignore the numpad events and wait for the alt up to read the UnicodeChar
  • COOKED_READ uses the UnicodeChar from this event as well, but it mistakenly converts it as two bytes of Output codepage data instead of treating it as unicode (lol)
    • I think that Alt Up + UnicodeChar was a secret internal backdoor that conhost used to communicate a uchar to itself, and applications were intended to use the numpad events
    • @miniksa thinks that it was intended to mimic win32 key events (where input would be entered via numpad if it was not on the keyboard) for apps that replaced their win32 message loop with a ReadConsoleInput loop

Remediations:

  1. Yeet numpad synthesis into space.
  2. Reintroduce a copy of GetQuickCharWidth, build it inside windows only, and synthesize numpad events on conhost only. Yeet numpad synthesis into space in Terminal. SAFEST OPTION.
@DHowett commented on GitHub (Apr 13, 2021): Okay. I'm going to do a bit of a brain dump of the past couple hours' work here. * Numpad synthesis does not make any sense. * This code has existed since Windows 2000 (we think?) * When we synthesize a numpad event, we generate `[alt down] [numpad x down] [numpad x up] (repeat...) [alt up, UnicodeChar = *ACTUAL CHARACTER*]` * Most consumers of the ReadConsoleInput API that we could find ignore the numpad events and wait for the alt up to read the `UnicodeChar` * [dotnet ignored numpad input](https://github.com/dotnet/runtime/blob/79ae74f5ca5c8a6fe3a48935e85bd7374959c570/src/libraries/System.Console/src/System/ConsolePal.Windows.cs#L376-L382) * [vim short-circuits out if char != 0](https://github.com/vim/vim/blob/e7bebc495d4014d7bc81f863939c35268cb8e97d/src/os_win32.c#L1009-L1010) * **notable exception**: the Windows in-box telnet client actually does seem to accumulate numpad data and call `atoi` on it (holy sh*t, what on earth) * `COOKED_READ` uses the `UnicodeChar` from this event as well, but it mistakenly converts it as two bytes of Output codepage data instead of treating it as unicode (lol) * I think that Alt Up + UnicodeChar was a secret internal backdoor that conhost used to communicate a uchar to itself, and applications were intended to use the numpad events * @miniksa thinks that it was intended to mimic win32 key events (where input would be entered via numpad if it was not on the keyboard) for apps that replaced their win32 message loop with a ReadConsoleInput loop Remediations: 1. Yeet numpad synthesis into space. 2. Reintroduce a copy of GetQuickCharWidth, build it inside windows only, and synthesize numpad events on conhost only. Yeet numpad synthesis into space in Terminal. **SAFEST OPTION**.
Author
Owner

@miniksa commented on GitHub (Apr 13, 2021):

As badly as I want to do number 1, this feels awfully CJK_DbcsTests.cpp to me where the weird munging is going to cause some dumb compat issue. So I vote number 2 for sure even though I hate it.

@miniksa commented on GitHub (Apr 13, 2021): As badly as I want to do number 1, this feels awfully `CJK_DbcsTests.cpp` to me where the weird munging is going to cause some dumb compat issue. So I vote number 2 for sure even though I hate it.
Author
Owner

@DHowett commented on GitHub (Apr 13, 2021):

  • this was only used in the clipboard input path, until ConPTY came along — now all input from a connected terminal undergoes synthesis
@DHowett commented on GitHub (Apr 13, 2021): * this was only used in the clipboard input path, until ConPTY came along — now all input from a connected terminal undergoes synthesis
Author
Owner

@dAu6jARL commented on GitHub (Apr 13, 2021):

FULLWIDTH Solidus. / (U+FF0F) makes same.
original:

FOO/BAR

pasted result:

FOOBAR

@dAu6jARL commented on GitHub (Apr 13, 2021): FULLWIDTH Solidus. / (U+FF0F) makes same. original: > FOO/BAR pasted result: > FOOBAR
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#12430