[PR #2228] [CLOSED] Correctly process composite characters in Terminal #24851

Open
opened 2026-01-31 09:05:45 +00:00 by claunia · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/microsoft/terminal/pull/2228
Author: @german-one
Created: 8/3/2019
Status: Closed

Base: masterHead: composite_characters


📝 Commits (2)

  • 2281536 Correctly process composite characters in Terminal
  • ab88dfa update outdated figure

📊 Changes

4 files changed (+304 additions, -80 deletions)

View changed files

📝 src/cascadia/TerminalConnection/ConhostConnection.cpp (+2 -2)
📝 src/types/UTF8OutPipeReader.cpp (+174 -20)
📝 src/types/inc/UTF8OutPipeReader.hpp (+57 -6)
📝 src/types/ut_types/UTF8OutPipeReaderTests.cpp (+71 -52)

📄 Description

Summary of the Pull Request

Convert composite characters to their precomposed equivalents in the Windows Terminal to avoid faulty output.

References

Potentially closes #2003, but I'm not familiar with emacs and thus, I only printed the content of the mentioned re.rs file to the Terminal in my tests.
pws
wsl

Briefly discussed in the latest comments in #1850.

PR Checklist

  • Closes #2003 (partially)
  • CLA signed. If not, go over here and sign the CLA
  • Tests added/passed
  • Requires documentation to be updated
  • I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #1850

Detailed Description of the Pull Request / Additional comments

The winrt::to_hstring method has been used to convert UTF-8 to UTF-16. This method does not convert composite characters to their precomposed equivalents.
This PR extends the UTF8OutPipeReader class with an overload of the Read method that takes a reference of a std::wsting_view variable which receives the chunks read out of the pipe, already converted to UTF-16. The conversion makes composite characters to their precomposed equivalents.
Furthermore the class methods take care that composite characters don't get split into different chunks.

That's possibly the last thing we can do using conversions of the text data.

  • We have the conversion of UTF-8 to UTF-16. This changes the encoding but leaves the codepoints the same.
  • Now we have the conversion of composite characters to precomposed characters, too. This leaves the encoding UTF-16 but changes the code points.
  • The updated code can only solve display problems where a precomposed character for a pair of base and combining character exists. That means, a conversion is only possible if another code point for the composite characters exists.
  • It can't solve problems caused by grapheme clusters in general. E.g. combinations with a non-spacing mark where no precomposed character exists will still consist of two characters in the output. Also zero-width characters (such as the byte order mark) still take the width of a character in the output. Things like intersecting base characters with non-spacing marks is a graphical task. It can't be done using conversions of the text data anymore.

Validation Steps Performed

  • A lot of manual tests where I used UTF8OutPipeReader in a separate code which displayed the chunk boudaries and the bytes that belong to them, which displayed the converted text, and the return values of the methods.
  • Manual tests in the Terminal to see how it behaves using keyboard input and redirected input from files and sockets.
  • I updated UTF8OutPipeReaderTest to address the new features, and successfully run the unit test.

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/microsoft/terminal/pull/2228 **Author:** [@german-one](https://github.com/german-one) **Created:** 8/3/2019 **Status:** ❌ Closed **Base:** `master` ← **Head:** `composite_characters` --- ### 📝 Commits (2) - [`2281536`](https://github.com/microsoft/terminal/commit/228153698a2b6c1bbde10f7991510e4475527100) Correctly process composite characters in Terminal - [`ab88dfa`](https://github.com/microsoft/terminal/commit/ab88dfae0c94c5d145f2346702a5f80ce81494a8) update outdated figure ### 📊 Changes **4 files changed** (+304 additions, -80 deletions) <details> <summary>View changed files</summary> 📝 `src/cascadia/TerminalConnection/ConhostConnection.cpp` (+2 -2) 📝 `src/types/UTF8OutPipeReader.cpp` (+174 -20) 📝 `src/types/inc/UTF8OutPipeReader.hpp` (+57 -6) 📝 `src/types/ut_types/UTF8OutPipeReaderTests.cpp` (+71 -52) </details> ### 📄 Description <!-- Enter a brief description/summary of your PR here. What does it fix/what does it change/how was it tested (even manually, if necessary)? --> ## Summary of the Pull Request Convert composite characters to their precomposed equivalents in the Windows Terminal to avoid faulty output. <!-- Other than the issue solved, is this relevant to any other issues/existing PRs? --> ## References Potentially closes #2003, but I'm not familiar with emacs and thus, I only printed the content of the mentioned `re.rs` file to the Terminal in my tests. ![pws](https://user-images.githubusercontent.com/46659645/62416304-63c83600-b638-11e9-93c4-eb6721344e47.png) ![wsl](https://user-images.githubusercontent.com/46659645/62416308-72165200-b638-11e9-92d7-49f7ea85ba8f.png) Briefly discussed in the latest comments in #1850. <!-- Please review the items on the PR checklist before submitting--> ## PR Checklist * [X] Closes #2003 (partially) * [X] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA * [X] Tests added/passed * [ ] Requires documentation to be updated * [X] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #1850 <!-- Provide a more detailed description of the PR, other things fixed or any additional comments/features here --> ## Detailed Description of the Pull Request / Additional comments The `winrt::to_hstring` method has been used to convert UTF-8 to UTF-16. This method does not convert composite characters to their precomposed equivalents. This PR extends the `UTF8OutPipeReader` class with an overload of the `Read` method that takes a reference of a `std::wsting_view` variable which receives the chunks read out of the pipe, already converted to UTF-16. The conversion makes composite characters to their precomposed equivalents. Furthermore the class methods take care that composite characters don't get split into different chunks. That's possibly the last thing we can do using conversions of the text data. - We have the conversion of UTF-8 to UTF-16. This changes the encoding but leaves the codepoints the same. - Now we have the conversion of composite characters to precomposed characters, too. This leaves the encoding UTF-16 but changes the code points. - The updated code can only solve display problems where a precomposed character for a pair of base and combining character exists. That means, a conversion is only possible if another code point for the composite characters exists. - It can't solve problems caused by grapheme clusters in general. E.g. combinations with a non-spacing mark where no precomposed character exists will still consist of two characters in the output. Also zero-width characters (such as the byte order mark) still take the width of a character in the output. Things like intersecting base characters with non-spacing marks is a graphical task. It can't be done using conversions of the text data anymore. <!-- Describe how you validated the behavior. Add automated tests wherever possible, but list manual validation steps taken as well --> ## Validation Steps Performed - A lot of manual tests where I used `UTF8OutPipeReader` in a separate code which displayed the chunk boudaries and the bytes that belong to them, which displayed the converted text, and the return values of the methods. - Manual tests in the Terminal to see how it behaves using keyboard input and redirected input from files and sockets. - I updated `UTF8OutPipeReaderTest` to address the new features, and successfully run the unit test. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
claunia added the pull-request label 2026-01-31 09:05:45 +00:00
Sign in to join this conversation.
No Label pull-request
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#24851