WriteConsoleOutputCharacterA doesn't merge UTF-8 partials in successive calls #2602

Open
opened 2026-01-30 22:59:26 +00:00 by claunia · 7 comments
Owner

Originally created by @german-one on GitHub (Jul 6, 2019).

Environment

Windows build number: [run "ver" at a command prompt]
Microsoft Windows [Version 10.0.18362.207]

Steps to reproduce

If a UTF-8 stream gets buffered in a loop, characters that consume more than one byte may get split at the buffer boundaries. Passing the buffer to WriteConsoleOutputCharacterA will corrupt the text because a conversion to UTF-16 is in place where these partials are treated as invalid UTF-8 characters and replaced with with U+FFFD characters.

Expected behavior

WriteConsoleOutputCharacterA should cache the partials and prepend them to the characters passed at the next call of this function, similar to the behavior of WriteConsoleA.

Actual behavior

UTF-8 partials result in corrupted text.
A discussion about this already began in #386 but was rather out of scope in this issue.

Originally created by @german-one on GitHub (Jul 6, 2019). <!-- 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨 I ACKNOWLEDGE THE FOLLOWING BEFORE PROCEEDING: 1. If I delete this entire template and go my own path, the core team may close my issue without further explanation or engagement. 2. If I list multiple bugs/concerns in this one issue, the core team may close my issue without further explanation or engagement. 3. If I write an issue that has many duplicates, the core team may close my issue without further explanation or engagement (and without necessarily spending time to find the exact duplicate ID number). 4. If I leave the title incomplete when filing the issue, the core team may close my issue without further explanation or engagement. 5. If I file something completely blank in the body, the core team may close my issue without further explanation or engagement. All good? Then proceed! --> <!-- This bug tracker is monitored by Windows Terminal development team and other technical folks. **Important: When reporting BSODs or security issues, DO NOT attach memory dumps, logs, or traces to Github issues**. Instead, send dumps/traces to secure@microsoft.com, referencing this GitHub issue. Please use this form and describe your issue, concisely but precisely, with as much detail as possible. --> # Environment ``` Windows build number: [run "ver" at a command prompt] Microsoft Windows [Version 10.0.18362.207] ``` # Steps to reproduce <!-- A description of how to trigger this bug. --> If a UTF-8 stream gets buffered in a loop, characters that consume more than one byte may get split at the buffer boundaries. Passing the buffer to `WriteConsoleOutputCharacterA` will corrupt the text because a conversion to UTF-16 is in place where these partials are treated as invalid UTF-8 characters and replaced with with U+FFFD characters. # Expected behavior <!-- A description of what you're expecting, possibly containing screenshots or reference material. --> `WriteConsoleOutputCharacterA` should cache the partials and prepend them to the characters passed at the next call of this function, similar to the behavior of `WriteConsoleA`. # Actual behavior <!-- What's actually happening? --> UTF-8 partials result in corrupted text. A discussion about this already began in #386 but was rather out of scope in this issue.
claunia added the Product-ConhostArea-OutputIssue-BugArea-Server labels 2026-01-30 22:59:26 +00:00
Author
Owner

@german-one commented on GitHub (Jul 6, 2019):

@miniksa
This is continuing #386.

I was able to include the Utf8ToWideCharParser class into ApiRoutines::WriteConsoleOutputCharacterAImpl. This resolves the problem of merging UTF-8 partials but raises another issue. I'm struggling with the used parameter. I won't start a PR unless we have the same understanding of what number has to be assigned to it. My current understanding is that it is the number of bytes consumed from chars. But does it include the number of bytes of the partials (either from the previous or from the current cache)? And what's the benefit of getting this information as the caller of this function? In my tests I tried to calculate the coordinates of the end of the previously written chunk in order to continue writing at this position. This doesn't work at all if I try to use the bytes written. I would rather need the number of character cells written. Would it be reasonable to use the number of characters created that I get from Utf8ToWideCharParser::Parse (likewise the characters written by WriteConsoleOutputCharacterWImpl)?

@german-one commented on GitHub (Jul 6, 2019): @miniksa This is continuing #386. I was able to include the `Utf8ToWideCharParser` class into `ApiRoutines::WriteConsoleOutputCharacterAImpl`. This resolves the problem of merging UTF-8 partials but raises another issue. I'm struggling with the `used` parameter. I won't start a PR unless we have the same understanding of what number has to be assigned to it. My current understanding is that it is the number of bytes consumed from `chars`. But does it include the number of bytes of the partials (either from the previous or from the current cache)? And what's the benefit of getting this information as the caller of this function? In my tests I tried to calculate the coordinates of the end of the previously written chunk in order to continue writing at this position. This doesn't work at all if I try to use the bytes written. I would rather need the number of character cells written. Would it be reasonable to use the number of characters created that I get from `Utf8ToWideCharParser::Parse` (likewise the characters written by `WriteConsoleOutputCharacterWImpl`)?
Author
Owner

@j4james commented on GitHub (Feb 20, 2020):

The discussion in #386 seemed to suggest that this problem had been around since Windows 7, but what I'm seeing definitely looks like a recent regression, so I'm not sure if it's the same issue. In my current Windows version (10.0.18362.535), the following sequences both output a smiling face when executed in a bash shell.

printf "\xE2\x98\xBA\n"
printf "\xE2"; printf "\x98\xBA\n"

However, when I start a conhost shell built from the latest source (commit 39d3c65420), the second sequence fails to decode the UTF-8 correctly, and three error glyphs are output instead.

Screenshots of the old and new consoles...

image

@j4james commented on GitHub (Feb 20, 2020): The discussion in #386 seemed to suggest that this problem had been around since Windows 7, but what I'm seeing definitely looks like a recent regression, so I'm not sure if it's the same issue. In my current Windows version (10.0.18362.535), the following sequences both output a smiling face when executed in a bash shell. printf "\xE2\x98\xBA\n" printf "\xE2"; printf "\x98\xBA\n" However, when I start a conhost shell built from the latest source (commit 39d3c6542006fa3cd45d85fc00e9d546a1398625), the second sequence fails to decode the UTF-8 correctly, and three error glyphs are output instead. Screenshots of the old and new consoles... ![image](https://user-images.githubusercontent.com/4181424/74890695-ba028980-537c-11ea-9e90-608ad5966ac3.png)
Author
Owner

@german-one commented on GitHub (Feb 20, 2020):

Not sure what functions are invoked under the hood if you call printf in WSL. Although I'm pretty sure that the behavior shouldn't have changed if it was WriteConsoleOutputCharacterAImpl. The ConvertToW function is called there which converts the passed string to UTF-16 without caching UTF-8 partials. That's the reason why I once filed this issue.

However, issue #4086 has been fixed with PR #4422 which indicates that WriteConsoleAImpl is involved. Not sure if I did something wrong when I updated this function.

@german-one commented on GitHub (Feb 20, 2020): Not sure what functions are invoked under the hood if you call `printf` in WSL. Although I'm pretty sure that the behavior shouldn't have changed if it was [`WriteConsoleOutputCharacterAImpl`](https://github.com/microsoft/terminal/blob/2dec894515762a9b326b10339c9d15aeff535df8/src/host/_output.cpp#L154). The [`ConvertToW`](https://github.com/microsoft/terminal/blob/2dec894515762a9b326b10339c9d15aeff535df8/src/types/convert.cpp#L29) function is called there which converts the passed string to UTF-16 without caching UTF-8 partials. That's the reason why I once filed this issue. However, issue #4086 has been fixed with PR #4422 which indicates that [`WriteConsoleAImpl`](https://github.com/microsoft/terminal/blob/2dec894515762a9b326b10339c9d15aeff535df8/src/host/_stream.cpp#L1039) is involved. Not sure if I did something wrong when I updated this function.
Author
Owner

@j4james commented on GitHub (Feb 20, 2020):

I'm afraid it does look like PR #4422 is to blame, at least for my particular test case. It works in commit 0d92f71e45, but fails after #4422 is applied in commit 06b3931418.

It seems that the old Utf8ToWideCharParser::Parse method does actually detect \xE2 as a partial (and then successfully combines it with the following two bytes when they later arrive), but the new u8u16state code doesn't. The utf8 conversion method just returns the string as if it were a complete set of utf8 code points, leaving _partialsLen as zero. It's then passed on to the MultiByteToWideChar API where it gets converted into an error glyph.

@j4james commented on GitHub (Feb 20, 2020): I'm afraid it does look like PR #4422 is to blame, at least for my particular test case. It works in commit 0d92f71e45c97827f4219d69a1ea48811f9d70dd, but fails after #4422 is applied in commit 06b393141864ae52b41aefa560918aa2e1aadb85. It seems that the old `Utf8ToWideCharParser::Parse` method does actually detect `\xE2` as a partial (and then successfully combines it with the following two bytes when they later arrive), but the new `u8u16state` code doesn't. The [utf8 conversion method](https://github.com/microsoft/terminal/blob/32ea419c3d7b64de70d9007c9b6c119ece850535/src/inc/til/u8u16convert.h#L48-L50) just returns the string as if it were a complete set of utf8 code points, leaving `_partialsLen` as zero. It's then passed on to the `MultiByteToWideChar` API where it gets converted into an error glyph.
Author
Owner

@german-one commented on GitHub (Feb 21, 2020):

I updated the unit test with this sequences and it failed. So it's definitely reproduceable. I'll file an issue referring to your comment here. And I'll fix it as soon as I can. Thanks for letting me know!

@german-one commented on GitHub (Feb 21, 2020): I updated the unit test with this sequences and it failed. So it's definitely reproduceable. I'll file an issue referring to your comment here. And I'll fix it as soon as I can. Thanks for letting me know!
Author
Owner

@german-one commented on GitHub (Jan 15, 2022):

@zadjii-msft
I'm struggling to close this myself because it is an accepted issues and added to a milestone. However, meanwhile I'm not sure about it anymore.

WriteConsoleOutputCharacterA is no longer a part of the ecosystem roadmap, as remarked in the docs.
WriteConsoleOutputCharacterA does not and did never respect character boundaries of DBCSs.

The docs also read

"Copies a number of characters to consecutive cells of a console screen buffer, ..."

  • What is the meaning of "characters" here? (Is it actually "the representation of code points"? Or is it rather "values of a C/C++ character type"?)
  • What separates WriteConsoleOutputCharacterA from WriteConsoleA? (There might be a reason why they behave differently.)
  • How inconsistent is it to fix UTF-8 handling without fixing DBCS handling?
  • Do we break code if we try to fix it?

Consider to use the answers to these questions as reasons to close this one out.

Steffen

@german-one commented on GitHub (Jan 15, 2022): @zadjii-msft I'm struggling to close this myself because it is an accepted issues and added to a milestone. However, meanwhile I'm not sure about it anymore. `WriteConsoleOutputCharacterA` is no longer a part of the ecosystem roadmap, as remarked in the docs. `WriteConsoleOutputCharacterA` does not and did never respect character boundaries of DBCSs. The docs also read > "Copies a number of characters to consecutive cells of a console screen buffer, ..." - What is the meaning of "characters" here? (Is it actually "the representation of code points"? Or is it rather "values of a C/C++ character type"?) - What separates `WriteConsoleOutputCharacterA` from `WriteConsoleA`? (There might be a reason why they behave differently.) - How inconsistent is it to fix UTF-8 handling without fixing DBCS handling? - Do we break code if we try to fix it? Consider to use the answers to these questions as reasons to close this one out. Steffen
Author
Owner

@miniksa commented on GitHub (Mar 11, 2022):

  • What is the meaning of "characters" here? (Is it actually "the representation of code points"? Or is it rather "values of a C/C++ character type"?)

In the context of WriteConsoleOutputCharacterA, it means one char into one cell of the buffer. Does it make wonderful sense? No. But it is the way that it is when we inherited it.

  • What separates WriteConsoleOutputCharacterA from WriteConsoleA? (There might be a reason why they behave differently.)

WriteConsoleOutputCharacterA is just going to modify the character without modifying the color attributes in the cell. It will insert pretty much exactly what you give it exactly into the cell you tell it to.
WriteConsoleA will run normal character handling against it as if it were otherwise streamed in including often processing backspaces, newlines, etc. and applying the active/default color attributes to whatever you're inserting.

  • How inconsistent is it to fix UTF-8 handling without fixing DBCS handling?

It's probably fine honestly to fix the UTF-8 handling as a separate path. We did that in a few places... there's 3 states... W, A, and A when 65001 (UTF-8) is set. That way we can maintain the nice and broken A state for compatibility reasons and fix up the UTF-8 that we actually care about independently.

  • Do we break code if we try to fix it?

Oh probably.

Consider to use the answers to these questions as reasons to close this one out.

I'll let it hang around for now, but don't feel obligated to further bump it. If it's on 22H2 we're going to keep looking at it ourselves.

@miniksa commented on GitHub (Mar 11, 2022): > * What is the meaning of "characters" here? (Is it actually "the representation of code points"? Or is it rather "values of a C/C++ character type"?) In the context of `WriteConsoleOutputCharacterA`, it means one `char` into one cell of the buffer. Does it make wonderful sense? No. But it is the way that it is when we inherited it. > * What separates `WriteConsoleOutputCharacterA` from `WriteConsoleA`? (There might be a reason why they behave differently.) `WriteConsoleOutputCharacterA` is just going to modify the character without modifying the color attributes in the cell. It will insert pretty much exactly what you give it exactly into the cell you tell it to. `WriteConsoleA` will run normal character handling against it as if it were otherwise streamed in including often processing backspaces, newlines, etc. and applying the active/default color attributes to whatever you're inserting. > * How inconsistent is it to fix UTF-8 handling without fixing DBCS handling? It's probably fine honestly to fix the UTF-8 handling as a separate path. We did that in a few places... there's 3 states... W, A, and A when 65001 (UTF-8) is set. That way we can maintain the nice and broken A state for compatibility reasons and fix up the UTF-8 that we actually care about independently. > * Do we break code if we try to fix it? Oh probably. > Consider to use the answers to these questions as reasons to close this one out. I'll let it hang around for now, but don't feel obligated to further bump it. If it's on 22H2 we're going to keep looking at it ourselves.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#2602