[PR #3380] Compensate for non-minimal UTF-8 encodings #25347

Closed
opened 2026-01-31 09:08:55 +00:00 by claunia · 0 comments
Owner

Original Pull Request: https://github.com/microsoft/terminal/pull/3380

State: closed
Merged: Yes


Summary of the Pull Request

  • Permits substitution of Unicode Replacement for non-minimal codepoint encodings in UTF-8.

PR Checklist

Detailed Description of the Pull Request / Additional comments

  • We do a lot of work to figure out whether or not we have some invalid UTF-8 inside our own internal parser. We're correctly identifying in the first full-conversion pass that something is amiss with the stream of text we were given. And inside the second pass for involved parsing, we are identifying and removing obviously wrong sequences like those that have a lead byte without the correct number of continuations, continuations that come from nowhere, and so on. But there's one big gap:

  • We're not correctly identifying non-minimal forms of characters. Specifically, what is causing the crash, is non-minimal representations of the null character U+0000. The minimal form of this character in UTF-8 is 0x00. But technically, it could also be written as any of the following:

  1. 0xC0 0x80 - Lead byte of 2 byte sequence and a single continuation.
  2. 0xE0 0x80 0x80 - Lead byte of 3 byte sequences and two continuations.
  3. 0xF0 0x80 0x80 0x80 - Lead byte of 4 byte sequences and three continuations.
    All of which didn't fill any of the payload bits.
  • The OS does identify these as invalid non-minimal forms when the data is sent to MultiByteToWideChar after we believe we've removed all invalid data and then it errors out because we set the MB_ERR_INVALID_CHARS flag.

  • If we remove the flag, the error goes away and the OS will substitute one or more U+FFFDs for these sequences and continue past them.

  • This is inconsistent with the rest of our invalid behavior (where we just eat the invalid bytes and walk on instead of substituting them) but the OS doesn't offer that provision as an option.

  • We also can't straight up just call the OS in all cases because we want to be available for the case where a caller sends us part of a valid sequence at the end of the buffer and then continues with the next valid pieces in the next call. That is, think of the putc case where someone drops 0xe3, 0x81, and 0x99 on three calls in a row. We want to form those together into the correct U+3059 once the third one comes in. Just using MultiByteToWideChar straight up will convert each of them into their own U+FFFD on the three calls. Not OK. So we must have some knowledge of UTF-8 to allow this valid scenario to happen.

  • The solution here is to let the somewhat inconsistent behavior of "replacements for non-minimal sequences but suppress clearly invalid things" to happen. We're doing this because the fix is needed in the Windows product for 20H1, which is today subject to ever-tightening requirements to prepare to ship. The smallest and least risky fix possible is preferable right now.

  • #3378 is filed as a follow on to investigate reconciling the somewhat inconsistent behavior as well as other things noted during this investigation in the future to be consumed into Terminal and whatever Windows release comes after 20H1.

Validation Steps Performed

Ran the repro steps in #3320 on 20h1 previews with WSL2. No longer crashes after (because it no longer returns the error).

Added automated test in utf8ToWideCharParserTests.cpp to ensure that non-minimal forms don't cause .Parse to throw an error and turn into some number of replacements instead

**Original Pull Request:** https://github.com/microsoft/terminal/pull/3380 **State:** closed **Merged:** Yes --- <!-- Enter a brief description/summary of your PR here. What does it fix/what does it change/how was it tested (even manually, if necessary)? --> ## Summary of the Pull Request - Permits substitution of Unicode Replacement for non-minimal codepoint encodings in UTF-8. <!-- Please review the items on the PR checklist before submitting--> ## PR Checklist * [x] Closes #3320 * [x] I work here * [x] Tests added/passed * [x] Comments added, follow on issue filed #3378 * [x] I'm a core contributor <!-- Provide a more detailed description of the PR, other things fixed or any additional comments/features here --> ## Detailed Description of the Pull Request / Additional comments - We do a lot of work to figure out whether or not we have some invalid UTF-8 inside our own internal parser. We're correctly identifying in the first full-conversion pass that something is amiss with the stream of text we were given. And inside the second pass for involved parsing, we are identifying and removing obviously wrong sequences like those that have a lead byte without the correct number of continuations, continuations that come from nowhere, and so on. But there's one big gap: - We're not correctly identifying non-minimal forms of characters. Specifically, what is causing the crash, is non-minimal representations of the null character U+0000. The minimal form of this character in UTF-8 is 0x00. But technically, it could also be written as any of the following: 1. 0xC0 0x80 - Lead byte of 2 byte sequence and a single continuation. 2. 0xE0 0x80 0x80 - Lead byte of 3 byte sequences and two continuations. 3. 0xF0 0x80 0x80 0x80 - Lead byte of 4 byte sequences and three continuations. All of which didn't fill any of the payload bits. - The OS does identify these as invalid non-minimal forms when the data is sent to `MultiByteToWideChar` after we believe we've removed all invalid data and then it errors out because we set the `MB_ERR_INVALID_CHARS` flag. - If we remove the flag, the error goes away and the OS will substitute one or more U+FFFDs for these sequences and continue past them. - This is inconsistent with the rest of our invalid behavior (where we just eat the invalid bytes and walk on instead of substituting them) but the OS doesn't offer that provision as an option. - We also can't straight up just call the OS in all cases because we want to be available for the case where a caller sends us part of a valid sequence at the end of the buffer and then continues with the next valid pieces in the next call. That is, think of the `putc` case where someone drops 0xe3, 0x81, and 0x99 on three calls in a row. We want to form those together into the correct U+3059 once the third one comes in. Just using `MultiByteToWideChar` straight up will convert each of them into their own U+FFFD on the three calls. Not OK. So we must have *some* knowledge of UTF-8 to allow this valid scenario to happen. - The solution here is to let the somewhat inconsistent behavior of "replacements for non-minimal sequences but suppress clearly invalid things" to happen. We're doing this because the fix is needed in the Windows product for 20H1, which is today subject to ever-tightening requirements to prepare to ship. The smallest and least risky fix possible is preferable right now. - #3378 is filed as a follow on to investigate reconciling the somewhat inconsistent behavior as well as other things noted during this investigation in the future to be consumed into Terminal and whatever Windows release comes after 20H1. <!-- Describe how you validated the behavior. Add automated tests wherever possible, but list manual validation steps taken as well --> ## Validation Steps Performed Ran the repro steps in #3320 on 20h1 previews with WSL2. No longer crashes after (because it no longer returns the error). Added automated test in utf8ToWideCharParserTests.cpp to ensure that non-minimal forms don't cause `.Parse` to throw an error and turn into some number of replacements instead
claunia added the pull-request label 2026-01-31 09:08:55 +00:00
Sign in to join this conversation.
No Label pull-request
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#25347