Reconcile UTF-8 behavior in utf8ToWideCharParser.cpp #4720

Closed
opened 2026-01-30 23:54:53 +00:00 by claunia · 7 comments
Owner

Originally created by @miniksa on GitHub (Oct 30, 2019).

Outcomes from #3320... (this is all related to utf8ToWideCharParser.cpp)

  1. When we encounter obviously invalid UTF-8 (wrong number of continuation bytes for a lead, lead without continuations, etc.), we straight up discard them from the stream in _InvolvedParse.
  • We need to consider replacing them with U+FFFD or having the option to do so.
  1. When we get not-obviously invalid UTF-8 (non-minimal forms like 0xC0 0x80 for null), we didn't detect and remove those.
  • We now (as of #3320) replace them with U+FFFD, though that's inconsistent with eating them above. This was chosen as a quick fix to be easy to port into the Windows 20H1 cycle during our bug fixing phase.
  • We should reconcile this difference with part 1 above (being able to toggle both of them on or off)
  1. We seem to need to walk though this entire sequence multiple to many times inside our code and then again inside the kernel as the MultiByteToWideChar call eventually thunks to RtlUTF8ToUnicodeN.
  • The reason we have to walk through it ourselves is to attempt to preserve partial sequences that just happen to fall across a call boundary. If the client application emits one UTF-8 character at a time, we still want to aggregate them and turn it into the correct result if they properly gave us "3-sequence-lead, continuation, continuation" in 3 separate calls. MultiByteToWideChar will not do this. It will either error (MB_ERR_INVALID_CHARS) or replace (U+FFFD).
  • The call to RtlUTF8ToUnicodeN is a kernel syscall on a hot path in our codebase that is quite probably slowing us down.

Given that we're already doing almost everything required to understand the UTF-8 sequence such that we can store partials across calls, perhaps we just add the final two steps of:

  1. Detecting non-minimal forms of UTF-8 as invalid
  2. Just doing the conversion to UTF-8 ourselves in user mode as it's just an algorithmic translation and we're already most of the way there bit twiddling to identify sequence length and continuations

This issue represents investigating:

  1. Is there a serious performance detriment to the kernel syscall for UTF-8 conversion that will start biting us especially hard as more and more things need UTF-8 <--> UTF-16?
  2. How hard is it for us to detect the non-minimal forms?
  3. Can we just finish the algorithm on our side in user mode relatively simply?
  4. Can we decide on a replacement strategy and make it consistent between clearly invalid sequences and non-minimal forms?
Originally created by @miniksa on GitHub (Oct 30, 2019). Outcomes from #3320... (this is all related to `utf8ToWideCharParser.cpp`) 1. When we encounter obviously invalid UTF-8 (wrong number of continuation bytes for a lead, lead without continuations, etc.), we straight up discard them from the stream in `_InvolvedParse`. - We need to consider replacing them with U+FFFD or having the option to do so. 2. When we get not-obviously invalid UTF-8 (non-minimal forms like 0xC0 0x80 for null), we didn't detect and remove those. - We now (as of #3320) replace them with U+FFFD, though that's inconsistent with eating them above. This was chosen as a quick fix to be easy to port into the Windows 20H1 cycle during our bug fixing phase. - We should reconcile this difference with part 1 above (being able to toggle both of them on or off) 3. We seem to need to walk though this entire sequence multiple to many times inside our code and then again inside the kernel as the `MultiByteToWideChar` call eventually thunks to `RtlUTF8ToUnicodeN`. - The reason we have to walk through it ourselves is to attempt to preserve partial sequences that just happen to fall across a call boundary. If the client application emits one UTF-8 character at a time, we still want to aggregate them and turn it into the correct result if they properly gave us "3-sequence-lead, continuation, continuation" in 3 separate calls. `MultiByteToWideChar` will not do this. It will either error (MB_ERR_INVALID_CHARS) or replace (U+FFFD). - The call to `RtlUTF8ToUnicodeN` is a kernel syscall on a hot path in our codebase that is quite probably slowing us down. Given that we're already doing almost everything required to understand the UTF-8 sequence such that we can store partials across calls, perhaps we just add the final two steps of: 1. Detecting non-minimal forms of UTF-8 as invalid 2. Just doing the conversion to UTF-8 ourselves in user mode as it's just an algorithmic translation and we're already most of the way there bit twiddling to identify sequence length and continuations This issue represents investigating: 1. Is there a serious performance detriment to the kernel syscall for UTF-8 conversion that will start biting us especially hard as more and more things need UTF-8 <--> UTF-16? 2. How hard is it for us to detect the non-minimal forms? 3. Can we just finish the algorithm on our side in user mode relatively simply? 4. Can we decide on a replacement strategy and make it consistent between clearly invalid sequences and non-minimal forms?
Author
Owner

@DHowett-MSFT commented on GitHub (Oct 30, 2019):

We also have half of a UTF-8 parser in UTF8OutputPipeReader, where it's used for detecting valid but partial encoded characters and shuffling them to the beginning of a string for the next parse cycle. It's unnecessarily coupled with doing a ReadFile. It seems like a solution to utf8ToWideCharParser would also be able to unify our disparate UTF-8 parsers.

@DHowett-MSFT commented on GitHub (Oct 30, 2019): We also have half of a UTF-8 parser in UTF8OutputPipeReader, where it's used for detecting valid but partial encoded characters and shuffling them to the beginning of a string for the next parse cycle. It's unnecessarily coupled with doing a `ReadFile`. It seems like a solution to `utf8ToWideCharParser` would also be able to unify our disparate UTF-8 parsers.
Author
Owner

@german-one commented on GitHub (Dec 28, 2019):

@miniksa @DHowett-MSFT

I spent a little time over Xmas to transpose my own UTF-8 <--> UTF-16 functions that I wrote in C into C++. I'm afraid you will still discern the C origin 😊 I'll provide them here first in order to get feedback

  • if this is close to what you're looking for
  • if this has the slightest chance to pass a code review with all the magic numbers and static_casts

If so, they could probably land in types/convert.cpp and be used instead of MultiByteToWideChar and WideCharToMultiByte.

Also it should be relatively easy to write a functor class based on UTF8OutputPipeReader which just detects partials at the string end, caches them, and completes them at the next call. The conversion functions and the cache class might be either used directly or rather for the composition of more specialized classes like Utf8ToWideCharParser ...

Edit: Nevermind it's in PR #4093 wich should make it easier to comment 🙂
BTW Happy New Year!

@german-one commented on GitHub (Dec 28, 2019): @miniksa @DHowett-MSFT I spent a little time over Xmas to transpose my own UTF-8 <--> UTF-16 functions that I wrote in C into C++. I'm afraid you will still discern the C origin 😊 I'll provide them here first in order to get feedback * if this is close to what you're looking for * if this has the slightest chance to pass a code review with all the magic numbers and static_casts If so, they could probably land in `types/convert.cpp` and be used instead of `MultiByteToWideChar` and `WideCharToMultiByte`. Also it should be relatively easy to write a functor class based on `UTF8OutputPipeReader` which just detects partials at the string end, caches them, and completes them at the next call. The conversion functions and the cache class might be either used directly or rather for the composition of more specialized classes like `Utf8ToWideCharParser` ... Edit: Nevermind it's in PR #4093 wich should make it easier to comment 🙂 BTW Happy New Year!
Author
Owner

@miniksa commented on GitHub (Jan 2, 2020):

@miniksa @DHowett-MSFT

I spent a little time over Xmas to transpose my own UTF-8 <--> UTF-16 functions that I wrote in C into C++. I'm afraid you will still discern the C origin 😊 I'll provide them here first in order to get feedback

  • if this is close to what you're looking for
  • if this has the slightest chance to pass a code review with all the magic numbers and static_casts

If so, they could probably land in types/convert.cpp and be used instead of MultiByteToWideChar and WideCharToMultiByte.

Also it should be relatively easy to write a functor class based on UTF8OutputPipeReader which just detects partials at the string end, caches them, and completes them at the next call. The conversion functions and the cache class might be either used directly or rather for the composition of more specialized classes like Utf8ToWideCharParser ...

Edit: Nevermind it's in PR #4093 wich should make it easier to comment 🙂
BTW Happy New Year!

Oooh. I am excited for this. Thank you very much! I'll put it on my list to look at.

@miniksa commented on GitHub (Jan 2, 2020): > @miniksa @DHowett-MSFT > > I spent a little time over Xmas to transpose my own UTF-8 <--> UTF-16 functions that I wrote in C into C++. I'm afraid you will still discern the C origin 😊 I'll provide them here first in order to get feedback > > * if this is close to what you're looking for > * if this has the slightest chance to pass a code review with all the magic numbers and static_casts > > If so, they could probably land in `types/convert.cpp` and be used instead of `MultiByteToWideChar` and `WideCharToMultiByte`. > > Also it should be relatively easy to write a functor class based on `UTF8OutputPipeReader` which just detects partials at the string end, caches them, and completes them at the next call. The conversion functions and the cache class might be either used directly or rather for the composition of more specialized classes like `Utf8ToWideCharParser` ... > > Edit: Nevermind it's in PR #4093 wich should make it easier to comment 🙂 > BTW Happy New Year! Oooh. I am excited for this. Thank you very much! I'll put it on my list to look at.
Author
Owner

@german-one commented on GitHub (Jan 2, 2020):

Thanks for your interest @miniksa !
I'm struggling to think out loud because I guess Utf8ToWideCharParser became a good friend, but ... if you guys like my implementation then we are only a few updated lines away to supersede Utf8ToWideCharParser.

@german-one commented on GitHub (Jan 2, 2020): Thanks for your interest @miniksa ! I'm struggling to think out loud because I guess `Utf8ToWideCharParser` became a good friend, but ... if you guys like my implementation then we are only a few updated lines away to supersede `Utf8ToWideCharParser`.
Author
Owner

@DHowett-MSFT commented on GitHub (Feb 4, 2020):

@miniksa do the changes as merged fulfill the brief for this bug?

@DHowett-MSFT commented on GitHub (Feb 4, 2020): @miniksa do the changes as merged fulfill the brief for this bug?
Author
Owner

@miniksa commented on GitHub (Feb 4, 2020):

@miniksa Michael Niksa FTE do the changes as merged fulfill the brief for this bug?

Close enough.

I think we proved that we can't do it faster in user and we now have a way to reconcile them thanks to @german-one's utility functions. And it's been unified in the major places. So I believe we're good.

@miniksa commented on GitHub (Feb 4, 2020): > @miniksa Michael Niksa FTE do the changes as merged fulfill the brief for this bug? Close enough. I think we proved that we can't do it faster in user and we now have a way to reconcile them thanks to @german-one's utility functions. And it's been unified in the major places. So I believe we're good.
Author
Owner

@german-one commented on GitHub (Feb 5, 2020):

@miniksa Actually I left two questions in the PR. One was if it really closes this issue (which is now answered), and the other one was if we still need utf8ToWideCharParser in the code base or if I should commit its removal.

BTW You encouraged me to write extensions au16 and u16a. I'm already about to implement them 🙂

@german-one commented on GitHub (Feb 5, 2020): @miniksa Actually I left two questions in the PR. One was if it really closes this issue (which is now answered), and the other one was if we still need `utf8ToWideCharParser` in the code base or if I should commit its removal. BTW You encouraged me to write extensions `au16` and `u16a`. I'm already about to implement them 🙂
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#4720