The status of UTF-16 vs UTF-8 output #13745

Open
opened 2026-01-31 03:50:59 +00:00 by claunia · 0 comments
Owner

Originally created by @d8928fcddcd54a2eb616c93261f24d97 on GitHub (May 8, 2021).

Windows Terminal version (or Windows build number)

Terminal: 1.7.1033.0, Windows: 10.0.19041.928

Other Software

cmd.exe

Steps to reproduce

  1. Set up Unicode (UTF-16) output
fflush(stdout);
_setmode(_fileno(stdout), _O_U16TEXT);
_setmode(_fileno(stdin), _O_U16TEXT);
GetConsoleMode(GetStdHandle(STD_OUTPUT_HANDLE), &mode);
SetConsoleMode(GetStdHandle(STD_OUTPUT_HANDLE), mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);
  1. Print some surrogate pairs via WriteConsoleW
const wchar_t* str = L"\U0002002C\U0001F495"; // 𠀬💕 - both are encoded in UTF-16 as surrogate pairs
DWORD nwritten;
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), str, 4, &nwritten, NULL);
  1. Print the same string via any CRT facility (fputws, fwrite, wprintf)
  2. Alternatively, set up UTF-8 output via SetConsoleOutputCP(CP_UTF8)
  3. Print the same codepoints (utf-8 encoded, compile with /utf-8) via WriteConsoleA or any CRT facility

Expected Behavior

2/3. Supplementary Plane characters are correctly displayed in Windows Terminal regardless of printing function used. The behavior of "normal" conhost is consistent between functions.
5. The behavior is the same as UTF-16 / wchar_t one.

Actual Behavior

UTF-16 / Wide output

Windows Terminal

Supplementary Plane characters are displayed correctly by Windows Terminal only if both elements of the surrogate pair are printed by a single WriteConsoleW call. Emitting a surrogate pair by two consecutive WriteConsoleW calls results in REPLACEMENT CHARACTER (U+FFFD) being displayed. CRT functions result in U+FFFD being displayed by Windows Terminal in any scenario I've tested.
When copying, U+FFFD characters get copied.
Redirecting the output to a text file results, however, produces correct/uncorrupted UTF-16 (no BOM) text for all output functions used. Subsequently printing it to Windows Terminal by pwsh -c "get-content -encoding Unicode output.txt" displays the text correctly.
Saving it with the BOM in any capable text editor and subsequently printing to the console via type works as well.

"Normal" conhost

In "normal" conhost printing Supplementary Plane characters via a single WriteConsoleW call results in "wide"  being displayed. However, copying from cmd.exe window produces uncorrupted characters.
The rest of behavior is analogous.

UTF-8 char output

Windows Terminal

Supplementary Plane characters are displayed correctly regardless of function employed.
Redirecting produces UTF-8 encoded files.

"Normal" conhost

The behavior is analogous.

Clarify the status of Unicode beyond UCS-2

What method of outputting Unicode text should be used by newly written software? Why do wide CRT functions (and writing one element of a surrogate pair at a time via WriteConsoleW) result in incorrect Windows Terminal behavior?
Is it expected to be fixed?
What is the internal encoding used by Windows Terminal and "normal" conhost? Which output method would allow an application to avoid conversions being performed by conhost?

Originally created by @d8928fcddcd54a2eb616c93261f24d97 on GitHub (May 8, 2021). ### Windows Terminal version (or Windows build number) Terminal: 1.7.1033.0, Windows: 10.0.19041.928 ### Other Software `cmd.exe` ### Steps to reproduce 1. Set up Unicode (UTF-16) output ```cpp fflush(stdout); _setmode(_fileno(stdout), _O_U16TEXT); _setmode(_fileno(stdin), _O_U16TEXT); GetConsoleMode(GetStdHandle(STD_OUTPUT_HANDLE), &mode); SetConsoleMode(GetStdHandle(STD_OUTPUT_HANDLE), mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT); ``` 2. Print some surrogate pairs via `WriteConsoleW` ```cpp const wchar_t* str = L"\U0002002C\U0001F495"; // 𠀬💕 - both are encoded in UTF-16 as surrogate pairs DWORD nwritten; WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), str, 4, &nwritten, NULL); ``` 3. Print the same string via any CRT facility (`fputws`, `fwrite`, `wprintf`) 4. Alternatively, set up UTF-8 output via `SetConsoleOutputCP(CP_UTF8)` 5. Print the same codepoints (utf-8 encoded, compile with /utf-8) via `WriteConsoleA` or any CRT facility ### Expected Behavior 2/3. Supplementary Plane characters are correctly displayed in Windows Terminal regardless of printing function used. The behavior of "normal" conhost is consistent between functions. 5. The behavior is the same as UTF-16 / wchar_t one. ### Actual Behavior ## UTF-16 / Wide output ### Windows Terminal Supplementary Plane characters are displayed correctly by Windows Terminal only if both elements of the surrogate pair are printed by a single `WriteConsoleW` call. Emitting a surrogate pair by two consecutive `WriteConsoleW` calls results in REPLACEMENT CHARACTER (U+FFFD) being displayed. CRT functions result in U+FFFD being displayed by Windows Terminal in any scenario I've tested. When copying, U+FFFD characters get copied. Redirecting the output to a text file results, however, produces correct/uncorrupted UTF-16 (no BOM) text for all output functions used. Subsequently printing it to Windows Terminal by `pwsh -c "get-content -encoding Unicode output.txt"` displays the text correctly. Saving it with the BOM in any capable text editor and subsequently printing to the console via `type` works as well. ### "Normal" conhost In "normal" conhost printing Supplementary Plane characters via a single `WriteConsoleW` call results in "wide"  being displayed. However, copying from `cmd.exe` window produces uncorrupted characters. The rest of behavior is analogous. ## UTF-8 char output ### Windows Terminal Supplementary Plane characters are displayed correctly regardless of function employed. Redirecting produces UTF-8 encoded files. ### "Normal" conhost The behavior is analogous. ## Clarify the status of Unicode beyond UCS-2 What method of outputting Unicode text should be used by newly written software? Why do wide CRT functions (and writing one element of a surrogate pair at a time via `WriteConsoleW`) result in incorrect Windows Terminal behavior? Is it expected to be fixed? What is the internal encoding used by Windows Terminal and "normal" conhost? Which output method would allow an application to avoid conversions being performed by conhost?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#13745