[Epic] COOKED READ and character conversion fun with all our different Convert* functions #10825

Open
opened 2026-01-31 02:31:21 +00:00 by claunia · 0 comments
Owner

Originally created by @DHowett on GitHub (Sep 29, 2020).

Originally assigned to: @lhecker on GitHub.

[Claimed by miniksa]
This issue exists to resolve many issues around cooked read data processing including choking on the return of bytes in various A function calls from Read methods (with both DBCS and UTF-8 codepages), the improper display of Emoji or other U+10000 characters on the edit line (commonly seen in cmd.exe), and the general buffer grossness of mixmatching both char and wchar_t text within the same client-owned buffer space during waiting read operations.

These issues should be easier to resolve after this work or be fixed by it:

Uncategorized:

This is what I expect to do:

  • Start with _handlePostCharInputLoop and revise it so it doesn't attempt to precount the number of "bytes" by counting the "width" of characters. Instead move to just translating the text and storing the excess, much like an aliased command would do.
  • Realize TranslateUnicodeToOem has almost no usages anymore (and ConvertToOem) and could likely be converged with ConvertToA. Also that the fallback behavior of TranslateUnicodeToOem can be accomplished just by asking WC2MB to replace with the default character anyway.
  • Realize that if we don't need to know the CP's default character, then why are we loading it on startup? Oh, it's still because we need the lead byte table to know when something IsDBCSLeadByteConsole or not.
  • Realize that now that we don't use custom codepages anymore... there doesn't really seem to be a difference between IsDBCSLeadByteConsole and just calling the winnls.h exported IsDBCSLeadByteEx with the same codepage (and then not holding onto the CPInfo stuff at all.)
  • Doing the same thing for output codepage as input codepage in respect to dumping the load of that information and just asking winnls.h directly.
  • Look closely at CheckBisectStringA because it only seems to have one consumer that's really just checking if the final cell of a row IsDBCSLeadByteEx...
  • The ConvertInputToUnicode and ConvertOutputToUnicode are pretty darn close to ConvertToA and ConvertToW anyway. The only variations I can see in the pattern of using MB2WC and WC2MB are: no flags at all, putting the default character in when choking, or using glyph chars instead of ctrl chars. So why can't we just have ConvertToA and ConvertToW have those modes and run all the translations through those and use the safer and more sensible string-in-string-out pattern to translate everything.
  • If we're going all in on ConvertToA and ConvertToW... perhaps it's time for til::u8u16 and til::u16u8 to get their til::au16 and til::u16a variants brought in, have the flags added, and just have a unified way of converting things around here.
  • Can we also torch CharToWchar since its just translation of a short string (single character) but with the glyph for ctrl char substitution into a til::au16 with the glyph chars flag?

The expected effects are primarily:

  • Returning classic codepage or UTF-8 information out of ReadConsoleA, ReadFile and such should work correctly
  • We should eliminate a giant pile of static analysis suppressions
  • The cooked line should be able to just hold/display/convert emoji correctly and erase them correctly

More will show up in each of these headings as I discover it or we have feedback

Originally created by @DHowett on GitHub (Sep 29, 2020). Originally assigned to: @lhecker on GitHub. [Claimed by miniksa] This issue exists to resolve many issues around cooked read data processing including choking on the return of bytes in various A function calls from Read methods (with both DBCS and UTF-8 codepages), the improper display of Emoji or other U+10000 characters on the edit line (commonly seen in `cmd.exe`), and the general buffer grossness of mixmatching both `char` and `wchar_t` text within the same client-owned buffer space during waiting read operations. ## Related issues ### These issues are related and should be improved or resolved by this work: - #1503 - #4551 - #7589 - #3786 - #190 - #5618 - #1058 - #4493 ### These issues should be easier to resolve after this work or be fixed by it: - #4975 - #6555 - #780 - #1802 ### Uncategorized: - #4628 ## This is what I expect to do: - Start with `_handlePostCharInputLoop` and revise it so it doesn't attempt to precount the number of "bytes" by counting the "width" of characters. Instead move to just translating the text and storing the excess, much like an aliased command would do. - Realize `TranslateUnicodeToOem` has almost no usages anymore (and `ConvertToOem`) and could likely be converged with `ConvertToA`. Also that the fallback behavior of `TranslateUnicodeToOem` can be accomplished just by asking WC2MB to replace with the default character anyway. - Realize that if we don't need to know the CP's default character, then why are we loading it on startup? Oh, it's still because we need the lead byte table to know when something `IsDBCSLeadByteConsole` or not. - Realize that now that we don't use custom codepages anymore... there doesn't really seem to be a difference between `IsDBCSLeadByteConsole` and just calling the `winnls.h` exported `IsDBCSLeadByteEx` with the same codepage (and then not holding onto the `CPInfo` stuff at all.) - Doing the same thing for output codepage as input codepage in respect to dumping the load of that information and just asking `winnls.h` directly. - Look closely at `CheckBisectStringA` because it only seems to have one consumer that's really just checking if the final cell of a row `IsDBCSLeadByteEx`... - The `ConvertInputToUnicode` and `ConvertOutputToUnicode` are pretty darn close to `ConvertToA` and `ConvertToW` anyway. The only variations I can see in the pattern of using MB2WC and WC2MB are: no flags at all, putting the default character in when choking, or using glyph chars instead of ctrl chars. So why can't we just have `ConvertToA` and `ConvertToW` have those modes and run all the translations through those and use the safer and more sensible string-in-string-out pattern to translate everything. - If we're going all in on `ConvertToA` and `ConvertToW`... perhaps it's time for `til::u8u16` and `til::u16u8` to get their `til::au16` and `til::u16a` variants brought in, have the flags added, and just have a unified way of converting things around here. - Can we also torch `CharToWchar` since its just translation of a short string (single character) but with the glyph for ctrl char substitution into a `til::au16` with the glyph chars flag? ## The expected effects are primarily: - Returning classic codepage or UTF-8 information out of `ReadConsoleA`, `ReadFile` and such should work correctly - We should eliminate a giant pile of static analysis suppressions - The cooked line should be able to just hold/display/convert emoji correctly and erase them correctly **More will show up in each of these headings as I discover it or we have feedback**
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#10825