UTF-8 decoding problem when a codepoint straddles an i/o boundary #22217

Open
opened 2026-01-31 08:06:49 +00:00 by claunia · 6 comments
Owner

Originally created by @aidtopia on GitHub (Sep 5, 2024).

Originally assigned to: @lhecker on GitHub.

Windows Terminal version

1.20.11781.0

Windows build number

10.0.19045.4780

Other Software

No response

Steps to reproduce

  1. Create a text file with 180 instances of the Unicode character U+20B0 (German Penny Sign) on one line and save it as UTF-8 (with or without BOM). Call it foo.txt. (I've attached a sample.)
  2. Open a Command Prompt profile in Terminal.
  3. chcp 65001
  4. type foo.txt

Note that, near the end of the output there are a couple Unicode replacement characters.

What's happening is that type sends the text to the terminal in 512-byte blocks. The UTF-8 encoding of U+20B0 takes 3 bytes. Since 512 isn't a multiple of 3, the 171st German Penny Sign is split across the boundary of the first and second write operations issued by type. The UTF-8 decoding is resetting state state with each write.

But it's not just UTF-8 decoding. If one write ends with a complete character, and the next write begins with a combining character, they either (1) won't be composed or (2) they will be composed but there will be an empty cell immediately after it.

These problems occur less frequently with applications that issue larger writes, but they do still happen. They can even happen with applications that normally flush the output on line boundaries if a single line grows so long that an intermediate flush occurs.
foo.txt

Expected Behavior

I expected UTF-8 decoding and composition of combining characters to resync if a sequence of bytes that represents a single codepoint or grapheme cluster happens to fall on the boundary between two consecutive writes.

Actual Behavior

Note the replacement characters in the output.

image

Originally created by @aidtopia on GitHub (Sep 5, 2024). Originally assigned to: @lhecker on GitHub. ### Windows Terminal version 1.20.11781.0 ### Windows build number 10.0.19045.4780 ### Other Software _No response_ ### Steps to reproduce 1. Create a text file with 180 instances of the Unicode character U+20B0 (German Penny Sign) on one line and save it as UTF-8 (with or without BOM). Call it foo.txt. (I've attached a sample.) 2. Open a Command Prompt profile in Terminal. 3. `chcp 65001` 4. `type foo.txt` Note that, near the end of the output there are a couple Unicode replacement characters. What's happening is that `type` sends the text to the terminal in 512-byte blocks. The UTF-8 encoding of U+20B0 takes 3 bytes. Since 512 isn't a multiple of 3, the 171st German Penny Sign is split across the boundary of the first and second write operations issued by `type`. The UTF-8 decoding is resetting state state with each write. But it's not just UTF-8 decoding. If one write ends with a complete character, and the next write begins with a combining character, they either (1) won't be composed or (2) they will be composed but there will be an empty cell immediately after it. These problems occur less frequently with applications that issue larger writes, but they do still happen. They can even happen with applications that normally flush the output on line boundaries if a single line grows so long that an intermediate flush occurs. [foo.txt](https://github.com/user-attachments/files/16882048/foo.txt) ### Expected Behavior I expected UTF-8 decoding and composition of combining characters to resync if a sequence of bytes that represents a single codepoint or grapheme cluster happens to fall on the boundary between two consecutive writes. ### Actual Behavior Note the replacement characters in the output. ![image](https://github.com/user-attachments/assets/9ba7193a-2fda-463a-9850-551480295b44)
claunia added the Issue-BugProduct-Cmd.exeTracking-External labels 2026-01-31 08:06:49 +00:00
Author
Owner

@lhecker commented on GitHub (Sep 5, 2024):

To my horror I found that cmd.exe uses the current codepage to translate the output of type into wchar_t first and then calls WriteConsoleW. Since cmd.exe hasn't had major changes in (a) decade(s?) it's not compatible with UTF-16 and it only supports the preceding UCS-2 standard.

To fix this issue we need to do two things:

  • Add support for joining DBCS just like we join broken up UTF-8 sequences.
  • Remove the FileIsConsole(STDOUT) calls from TyWork in cinfo.cpp (internal cmd source code). This will cause it to use WriteFile.

It may be possible to just do the latter only, given that we don't support joining DBCS in any API except for stdin.

@lhecker commented on GitHub (Sep 5, 2024): To my horror I found that cmd.exe uses the current codepage to translate the output of `type` into `wchar_t` first and then calls `WriteConsoleW`. Since cmd.exe hasn't had major changes in (a) decade(s?) it's not compatible with UTF-16 and it only supports the preceding UCS-2 standard. To fix this issue we need to do two things: * Add support for joining DBCS just like we join broken up UTF-8 sequences. * Remove the `FileIsConsole(STDOUT)` calls from `TyWork` in `cinfo.cpp` (internal cmd source code). This will cause it to use `WriteFile`. It may be possible to just do the latter only, given that we don't support joining DBCS in any API except for stdin.
Author
Owner

@german-one commented on GitHub (Sep 6, 2024):

This is #386 by the way.
It once drove me crazy. At the same time, it also made me get involved with C++, and the proto-u8u16 has been an attempt to fix this (without success of course 😄).

@lhecker you eventually confirmed that this is a CMD bug. 👍

@aidtopia if you need to work around this in a CMD shell then use another command line tool to write the file content. Windows ships with findstr.exe which is suitable.

findstr "^" "foo.txt"
@german-one commented on GitHub (Sep 6, 2024): This is #386 by the way. It once drove me crazy. At the same time, it also made me get involved with C++, and the proto-`u8u16` has been an attempt to fix this (without success of course 😄). @lhecker you eventually confirmed that this is a CMD bug. 👍 @aidtopia if you need to work around this in a CMD shell then use another command line tool to write the file content. Windows ships with findstr.exe which is suitable. ``` findstr "^" "foo.txt" ```
Author
Owner

@aidtopia commented on GitHub (Sep 8, 2024):

I'm glad to hear the UTF-8 problem is understood.

The problem with the combining characters is more subtle than I realized. It's probably a separate issue. If I find a more illustrative repro, I'll file another bug report for just that.

@german-one: That's a clever use of findstr, but I'll just whip up my own clone of type.

@aidtopia commented on GitHub (Sep 8, 2024): I'm glad to hear the UTF-8 problem is understood. The problem with the combining characters is more subtle than I realized. It's probably a separate issue. If I find a more illustrative repro, I'll file another bug report for just that. @german-one: That's a clever use of `findstr`, but I'll just whip up my own clone of `type`.
Author
Owner

@lhecker commented on GitHub (Sep 8, 2024):

The problem with the combining characters is more subtle than I realized.

Just to be sure, since it wasn't yet mentioned here: "Windows Terminal Preview" 1.22 is the first version that supports combining characters. You can find it in the Microsoft Store app and in our releases page.

@lhecker commented on GitHub (Sep 8, 2024): > The problem with the combining characters is more subtle than I realized. Just to be sure, since it wasn't yet mentioned here: "Windows Terminal Preview" 1.22 is the first version that supports combining characters. You can find it in the Microsoft Store app and in our [releases](https://github.com/microsoft/terminal/releases) page.
Author
Owner

@aidtopia commented on GitHub (Sep 8, 2024):

"Windows Terminal Preview" 1.22 is the first version that supports combining characters.

Ah, that explains why I haven't been able to reproduce exactly what I saw before. I must've first spotted the combining bug while using a (probably quite old) version of Preview, but just very recently switched to the mainstream release because ... reasons.

When I get a chance, I'll try the current Preview release.

@aidtopia commented on GitHub (Sep 8, 2024): > "Windows Terminal Preview" 1.22 is the first version that supports combining characters. Ah, that explains why I haven't been able to reproduce exactly what I saw before. I must've first spotted the combining bug while using a (probably quite old) version of Preview, but just very recently switched to the mainstream release because ... reasons. When I get a chance, I'll try the current Preview release.
Author
Owner

@lhecker commented on GitHub (Sep 25, 2024):

Note to self, TODO: File an internal issue and then fix it with containment.

@lhecker commented on GitHub (Sep 25, 2024): Note to self, TODO: File an internal issue and then fix it with containment.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#22217