The status of UTF-16 vs UTF-8 output #13750

Closed
opened 2026-01-31 03:51:07 +00:00 by claunia · 4 comments
Owner

Originally created by @d8928fcddcd54a2eb616c93261f24d97 on GitHub (May 8, 2021).

Windows Terminal version (or Windows build number)

Terminal: 1.7.1033.0, Windows: 10.0.19041.928

Other Software

cmd.exe

Steps to reproduce

  1. Set up Unicode (UTF-16) output
fflush(stdout);
_setmode(_fileno(stdout), _O_U16TEXT);
_setmode(_fileno(stdin), _O_U16TEXT);
GetConsoleMode(GetStdHandle(STD_OUTPUT_HANDLE), &mode);
SetConsoleMode(GetStdHandle(STD_OUTPUT_HANDLE), mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);
  1. Print some surrogate pairs via WriteConsoleW
const wchar_t* str = L"\U0002002C\U0001F495"; // 𠀬💕 - both are encoded in UTF-16 as surrogate pairs
DWORD nwritten;
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), str, 4, &nwritten, NULL);
  1. Print the same string via any CRT facility (fputws, fwrite, wprintf)
  2. Alternatively, set up UTF-8 output via SetConsoleOutputCP(CP_UTF8)
  3. Print the same codepoints (utf-8 encoded, compile with /utf-8) via WriteConsoleA or any CRT facility

Expected Behavior

2/3. Supplementary Plane characters are correctly displayed in Windows Terminal regardless of printing function used. The behavior of "normal" conhost is consistent between functions.
5. The behavior is the same as UTF-16 / wchar_t one.

Actual Behavior

UTF-16 / Wide output

Windows Terminal

Supplementary Plane characters are displayed correctly by Windows Terminal only if both elements of the surrogate pair are printed by a single WriteConsoleW call. Emitting a surrogate pair by two consecutive WriteConsoleW calls results in REPLACEMENT CHARACTER (U+FFFD) being displayed. CRT functions result in U+FFFD being displayed by Windows Terminal in any scenario I've tested.
When copying, U+FFFD characters get copied.
Redirecting the output to a text file results, however, produces correct/uncorrupted UTF-16 (no BOM) text for all output functions used. Subsequently printing it to Windows Terminal by pwsh -c "get-content -encoding Unicode output.txt" displays the text correctly.
Saving it with the BOM in any capable text editor and subsequently printing to the console via type works as well.

"Normal" conhost

In "normal" conhost printing Supplementary Plane characters via a single WriteConsoleW call results in "wide"  being displayed. However, copying from cmd.exe window produces uncorrupted characters.
The rest of behavior is analogous.

UTF-8 char output

Windows Terminal

Supplementary Plane characters are displayed correctly regardless of function employed.
Redirecting produces UTF-8 encoded files.

"Normal" conhost

The behavior is analogous.

Clarify the status of Unicode beyond UCS-2

What method of outputting Unicode text should be used by newly written software? Why do wide CRT functions (and writing one element of a surrogate pair at a time via WriteConsoleW) result in incorrect Windows Terminal behavior?
Is it expected to be fixed?
What is the internal encoding used by Windows Terminal and "normal" conhost? Which output method would allow an application to avoid conversions being performed by conhost?

Originally created by @d8928fcddcd54a2eb616c93261f24d97 on GitHub (May 8, 2021). ### Windows Terminal version (or Windows build number) Terminal: 1.7.1033.0, Windows: 10.0.19041.928 ### Other Software `cmd.exe` ### Steps to reproduce 1. Set up Unicode (UTF-16) output ```cpp fflush(stdout); _setmode(_fileno(stdout), _O_U16TEXT); _setmode(_fileno(stdin), _O_U16TEXT); GetConsoleMode(GetStdHandle(STD_OUTPUT_HANDLE), &mode); SetConsoleMode(GetStdHandle(STD_OUTPUT_HANDLE), mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT); ``` 2. Print some surrogate pairs via `WriteConsoleW` ```cpp const wchar_t* str = L"\U0002002C\U0001F495"; // 𠀬💕 - both are encoded in UTF-16 as surrogate pairs DWORD nwritten; WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), str, 4, &nwritten, NULL); ``` 3. Print the same string via any CRT facility (`fputws`, `fwrite`, `wprintf`) 4. Alternatively, set up UTF-8 output via `SetConsoleOutputCP(CP_UTF8)` 5. Print the same codepoints (utf-8 encoded, compile with /utf-8) via `WriteConsoleA` or any CRT facility ### Expected Behavior 2/3. Supplementary Plane characters are correctly displayed in Windows Terminal regardless of printing function used. The behavior of "normal" conhost is consistent between functions. 5. The behavior is the same as UTF-16 / wchar_t one. ### Actual Behavior ## UTF-16 / Wide output ### Windows Terminal Supplementary Plane characters are displayed correctly by Windows Terminal only if both elements of the surrogate pair are printed by a single `WriteConsoleW` call. Emitting a surrogate pair by two consecutive `WriteConsoleW` calls results in REPLACEMENT CHARACTER (U+FFFD) being displayed. CRT functions result in U+FFFD being displayed by Windows Terminal in any scenario I've tested. When copying, U+FFFD characters get copied. Redirecting the output to a text file results, however, produces correct/uncorrupted UTF-16 (no BOM) text for all output functions used. Subsequently printing it to Windows Terminal by `pwsh -c "get-content -encoding Unicode output.txt"` displays the text correctly. Saving it with the BOM in any capable text editor and subsequently printing to the console via `type` works as well. ### "Normal" conhost In "normal" conhost printing Supplementary Plane characters via a single `WriteConsoleW` call results in "wide"  being displayed. However, copying from `cmd.exe` window produces uncorrupted characters. The rest of behavior is analogous. ## UTF-8 char output ### Windows Terminal Supplementary Plane characters are displayed correctly regardless of function employed. Redirecting produces UTF-8 encoded files. ### "Normal" conhost The behavior is analogous. ## Clarify the status of Unicode beyond UCS-2 What method of outputting Unicode text should be used by newly written software? Why do wide CRT functions (and writing one element of a surrogate pair at a time via `WriteConsoleW`) result in incorrect Windows Terminal behavior? Is it expected to be fixed? What is the internal encoding used by Windows Terminal and "normal" conhost? Which output method would allow an application to avoid conversions being performed by conhost?
Author
Owner

@miniksa commented on GitHub (May 14, 2021):

What method of outputting Unicode text should be used by newly written software?

The goal is that either WriteConsoleW with UTF-16 text or WriteFile/WriteConsoleA with SetConsoleOutputCP to 65001 (UTF-8) will be the ways that newly written software will succeed.

Why do wide CRT functions (and writing one element of a surrogate pair at a time via WriteConsoleW) result in incorrect Windows Terminal behavior?

The CRT functions often do their own internal conversions before writing to the console. See information on setlocale to explicitly choose a locale before writing with the CRT if the fidelity of byte output is very important to your application.

Is it expected to be fixed?

It what?

The CRT? Probably not for compatibility reasons. You will probably have to always declare your intent to the CRT to distinguish yourself from a classic application that just assumed.

Flaws in Conhost and Terminal that lead to the mishandling of any particular Unicode character or sequence? Yes, hopefully, eventually. We tend to make slow and steady progress over time improving the buffers, renderers, and translators to cover top-requested issues on this tracker.

What is the internal encoding used by Windows Terminal and "normal" conhost?

They store things internally as wchar_t arrays which should fully support UCS-2 and support some amount of UTF-16 characters via surrogate pairs to ensure wide characters like extended Chinese and Emoji graphics can be handled.

The PTY mechanism always submits between the processes as UTF-8. Data is converted back and forth from rest to transit to rest again, but because UTF8/UTF16 is an algorithmic conversion, we believe this is lossless.

Which output method would allow an application to avoid conversions being performed by conhost?

WriteConsoleW into the conhost should not be converted and should be inserted into the buffer as it can manage. However, as we continue to improve the buffer to support things like joiners, flow direction, and other specialities of Unicode... we may be required to interpret the characters and I cannot guarantee that there is a scenario where all characters will never be translated.

For a conhost acting in PTY, something is always translated as the rest buffer is ideally UTF-16 and the PTY communication channel is UTF-8. Though again, algorithmic conversions SHOULD be no problem.

@miniksa commented on GitHub (May 14, 2021): > What method of outputting Unicode text should be used by newly written software? The goal is that either `WriteConsoleW` with UTF-16 text or `WriteFile`/`WriteConsoleA` with `SetConsoleOutputCP` to `65001` (UTF-8) will be the ways that newly written software will succeed. > Why do wide CRT functions (and writing one element of a surrogate pair at a time via WriteConsoleW) result in incorrect Windows Terminal behavior? The CRT functions often do their own internal conversions before writing to the console. See information on [`setlocale`](https://docs.microsoft.com/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-160) to explicitly choose a locale before writing with the CRT if the fidelity of byte output is very important to your application. > Is it expected to be fixed? It what? The CRT? Probably not for compatibility reasons. You will probably have to always declare your intent to the CRT to distinguish yourself from a classic application that just assumed. Flaws in Conhost and Terminal that lead to the mishandling of any particular Unicode character or sequence? Yes, hopefully, eventually. We tend to make slow and steady progress over time improving the buffers, renderers, and translators to cover top-requested issues on this tracker. > What is the internal encoding used by Windows Terminal and "normal" conhost? They store things internally as `wchar_t` arrays which should fully support UCS-2 and support some amount of UTF-16 characters via surrogate pairs to ensure wide characters like extended Chinese and Emoji graphics can be handled. The PTY mechanism always submits between the processes as UTF-8. Data is converted back and forth from rest to transit to rest again, but because UTF8/UTF16 is an algorithmic conversion, we believe this is lossless. > Which output method would allow an application to avoid conversions being performed by conhost? `WriteConsoleW` into the conhost should not be converted and should be inserted into the buffer as it can manage. However, as we continue to improve the buffer to support things like joiners, flow direction, and other specialities of Unicode... we may be required to interpret the characters and I cannot guarantee that there is a scenario where all characters will never be translated. For a conhost acting in PTY, something is always translated as the rest buffer is ideally UTF-16 and the PTY communication channel is UTF-8. Though again, algorithmic conversions SHOULD be no problem.
Author
Owner

@LuanVSO commented on GitHub (May 14, 2021):

shouldn't setlocale(... ,".utf8") also call SetConsoleCP(CP_UTF8) and SetConsoleOutputCP(CP_UTF8)?

@LuanVSO commented on GitHub (May 14, 2021): shouldn't `setlocale(... ,".utf8")` also call `SetConsoleCP(CP_UTF8)` and `SetConsoleOutputCP(CP_UTF8)`?
Author
Owner

@d8928fcddcd54a2eb616c93261f24d97 commented on GitHub (May 15, 2021):

I was always under the impression that wide CRT functions are lossless and do not locale-convert (This is AFAIK true for BMP).

Is it expected to be fixed?

It what?

Conhost as well as Terminal failing to display/output non-BMP characters in the majority of UTF-16 scenarios.

Here is the sample code example (compile with /utf-8)

#define _USE_CRT_SECURE_NO_WARNINGS
#define WIN32_LEAN_AND_MEAN
#define NOMINMAX
#include <Windows.h>
#include <iostream>
#include <fcntl.h>
#include <io.h>
#include <cstdio>
#include <tuple>

constexpr auto str_utf16 = L"\U0002002C\U0001F495";
constexpr auto str_utf8 = "\U0002002C\U0001F495";
constexpr auto bmp_utf16 = L"BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος\n";
constexpr auto bmp_utf8 = "BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος\n";

int main(int argc, char** argv)
{
    if (argc == 2)
    {
		switch (argv[1][0])
		{
		case '1': // Single WriteConsoleW
		{
			DWORD nWritten;
			DWORD mode;
			HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);

			fflush(stdout);
			std::ignore = _setmode(_fileno(stdout), _O_U16TEXT);
			std::ignore = _setmode(_fileno(stdin), _O_U16TEXT);
			GetConsoleMode(hOut, &mode);
			SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);

			WriteConsoleW(hOut, bmp_utf16, 46, &nWritten, NULL);
			WriteConsoleW(hOut, L"Single WriteConsoleW call: ", 27, &nWritten, NULL);
			WriteConsoleW(hOut, str_utf16, 4, &nWritten, NULL);
			WriteConsoleW(hOut, L"\n", 1, &nWritten, NULL);

			SetConsoleMode(hOut, mode);
			break;
		}
		case '2': // Multiple WriteConsoleW
		{
			DWORD nWritten;
			DWORD mode;
			HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);

			fflush(stdout);
			std::ignore = _setmode(_fileno(stdout), _O_U16TEXT);
			std::ignore = _setmode(_fileno(stdin), _O_U16TEXT);
			GetConsoleMode(hOut, &mode);
			SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);

			WriteConsoleW(hOut, bmp_utf16, 46, &nWritten, NULL);
			WriteConsoleW(hOut, L"Multiple WriteConsoleW calls (1st split): ", 42, &nWritten, NULL);
			WriteConsoleW(hOut, str_utf16, 1, &nWritten, NULL);
			WriteConsoleW(hOut, str_utf16 + 1, 3, &nWritten, NULL);
			WriteConsoleW(hOut, L"\n", 1, &nWritten, NULL);

			WriteConsoleW(hOut, L"Multiple WriteConsoleW calls (2nd split): ", 42, &nWritten, NULL);
			WriteConsoleW(hOut, str_utf16, 3, &nWritten, NULL);
			WriteConsoleW(hOut, str_utf16 + 3, 1, &nWritten, NULL);
			WriteConsoleW(hOut, L"\n", 1, &nWritten, NULL);

			SetConsoleMode(hOut, mode);
			break;
		}
		case '3': // Single fwrite
		{
			DWORD mode;
			HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);

			fflush(stdout);
			std::ignore = _setmode(_fileno(stdout), _O_U16TEXT);
			std::ignore = _setmode(_fileno(stdin), _O_U16TEXT);
			GetConsoleMode(hOut, &mode);
			SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);

			fwrite(bmp_utf16, 2, 46, stdout);
			fwrite(L"Single fwrite call: ", 2, 20, stdout);
			fwrite(str_utf16, 2, 4, stdout);
			fwrite(L"\n", 2, 1, stdout);

			SetConsoleMode(hOut, mode);
			break;
		}
		case '4': // Multiple fwrite
		{
			DWORD mode;
			HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);

			fflush(stdout);
			std::ignore = _setmode(_fileno(stdout), _O_U16TEXT);
			std::ignore = _setmode(_fileno(stdin), _O_U16TEXT);
			GetConsoleMode(hOut, &mode);
			SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);

			wchar_t str1[] = { str_utf16[0], L'\0', str_utf16[1], str_utf16[2], str_utf16[3], L'\0' };
			wchar_t str2[] = { str_utf16[0], str_utf16[1], str_utf16[2], L'\0', str_utf16[3], L'\0' };

			fwrite(bmp_utf16, 2, 46, stdout);
			fwrite(L"Multiple fwrite calls (1st split): ", 2, 35, stdout);
			fwrite(str1, 2, 1, stdout);
			fwrite(str1 + 2, 2, 3, stdout);
			fwrite(L"\n", 2, 1, stdout);

			fwrite(L"Multiple fwrite calls (2nd split): ", 2, 35, stdout);
			fwrite(str2, 2, 3, stdout);
			fwrite(str2 + 4, 2, 1, stdout);
			fwrite(L"\n", 2, 1, stdout);

			SetConsoleMode(hOut, mode);
			break;
		}
		case '5': // Single fputws
		{
			DWORD mode;
			HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);

			fflush(stdout);
			std::ignore = _setmode(_fileno(stdout), _O_U16TEXT);
			std::ignore = _setmode(_fileno(stdin), _O_U16TEXT);
			GetConsoleMode(hOut, &mode);
			SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);

			fputws(bmp_utf16, stdout);
			fputws(L"Single fputws call: ", stdout);
			fputws(str_utf16, stdout);
			fputws(L"\n", stdout);

			SetConsoleMode(hOut, mode);
			break;
		}
		case '6': // Multiple fputws
		{
			DWORD mode;
			HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);

			fflush(stdout);
			std::ignore = _setmode(_fileno(stdout), _O_U16TEXT);
			std::ignore = _setmode(_fileno(stdin), _O_U16TEXT);
			GetConsoleMode(hOut, &mode);
			SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);

			wchar_t str1[] = { str_utf16[0], L'\0', str_utf16[1], str_utf16[2], str_utf16[3], L'\0' };
			wchar_t str2[] = { str_utf16[0], str_utf16[1], str_utf16[2], L'\0', str_utf16[3], L'\0' };

			fputws(bmp_utf16, stdout);
			fputws(L"Multiple fputws calls (1st split): ", stdout);
			fputws(str1, stdout);
			fputws(str1 + 2, stdout);
			fputws(L"\n", stdout);

			fputws(L"Multiple fputws calls (2nd split): ", stdout);
			fputws(str2, stdout);
			fputws(str2 + 4, stdout);
			fputws(L"\n", stdout);

			SetConsoleMode(hOut, mode);
			break;
		}
		case 'a': // Single WriteConsoleA
		{
			DWORD nWritten;
			DWORD mode;
			HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);

			SetConsoleOutputCP(CP_UTF8);
			GetConsoleMode(hOut, &mode);
			SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);

			WriteConsoleA(hOut, bmp_utf8, strlen(bmp_utf8), &nWritten, NULL);
			WriteConsoleA(hOut, "Single WriteConsoleA call: ", 27, &nWritten, NULL);
			WriteConsoleA(hOut, str_utf8, strlen(str_utf8), &nWritten, NULL);
			WriteConsoleA(hOut, L"\n", 1, &nWritten, NULL);

			SetConsoleMode(hOut, mode);
			break;
		}
		case 'b': // Multiple WriteConsoleA
		{
			DWORD nWritten;
			DWORD mode;
			HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);

			SetConsoleOutputCP(CP_UTF8);
			GetConsoleMode(hOut, &mode);
			SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);

			WriteConsoleA(hOut, bmp_utf8, 13, &nWritten, NULL);
			WriteConsoleA(hOut, bmp_utf8 + 13, strlen(bmp_utf8) - 13, &nWritten, NULL);
			WriteConsoleA(hOut, "Multiple WriteConsoleA calls (1st split): ", 42, &nWritten, NULL);
			WriteConsoleA(hOut, str_utf8, 1, &nWritten, NULL);
			WriteConsoleA(hOut, str_utf8 + 1, 7, &nWritten, NULL);
			WriteConsoleA(hOut, "\n", 1, &nWritten, NULL);

			WriteConsoleA(hOut, "Multiple WriteConsoleA calls (2nd split): ", 42, &nWritten, NULL);
			WriteConsoleA(hOut, str_utf8, 5, &nWritten, NULL);
			WriteConsoleA(hOut, str_utf8 + 5, 3, &nWritten, NULL);
			WriteConsoleA(hOut, "\n", 1, &nWritten, NULL);

			SetConsoleMode(hOut, mode);
			break;
		}
		case 'c': // Single fwrite
		{
			DWORD mode;
			HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);

			SetConsoleOutputCP(CP_UTF8);
			GetConsoleMode(hOut, &mode);
			SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);

			fwrite(bmp_utf8, 1, strlen(bmp_utf8), stdout);
			fwrite("Single fwrite call: ", 1, 20, stdout);
			fwrite(str_utf8, 1, 8, stdout);
			fwrite("\n", 1, 1, stdout);

			SetConsoleMode(hOut, mode);
			break;
		}
		case 'd': // Multiple fwrite
		{
			DWORD mode;
			HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);

			SetConsoleOutputCP(CP_UTF8);
			GetConsoleMode(hOut, &mode);
			SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);

			char str1[] = { str_utf8[0], '\0', str_utf8[1], str_utf8[2], str_utf8[3], str_utf8[4], str_utf8[5], str_utf8[6], str_utf8[7], '\0' };
			char str2[] = { str_utf8[0], str_utf8[1], str_utf8[2], str_utf8[3], str_utf8[4], '\0', str_utf8[5], str_utf8[6], str_utf8[7], '\0' };

			fwrite(bmp_utf8, 1, strlen(bmp_utf8), stdout);
			fwrite("Multiple fwrite calls (1st split): ", 1, 35, stdout);
			fwrite(str1, 1, 1, stdout);
			fwrite(str1 + 2, 1, 7, stdout);
			fwrite("\n", 1, 1, stdout);

			fwrite("Multiple fwrite calls (2nd split): ", 1, 35, stdout);
			fwrite(str2, 1, 5, stdout);
			fwrite(str2 + 6, 1, 3, stdout);
			fwrite("\n", 1, 1, stdout);

			SetConsoleMode(hOut, mode);
			break;
		}
		case 'e': // Single fputs
		{
			DWORD mode;
			HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);

			SetConsoleOutputCP(CP_UTF8);
			GetConsoleMode(hOut, &mode);
			SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);

			fputs(bmp_utf8, stdout);
			fputs("Single fputs call: ", stdout);
			fputs(str_utf8, stdout);
			fputs("\n", stdout);

			SetConsoleMode(hOut, mode);
			break;
		}
		case 'f': // Multiple fputs
		{
			DWORD mode;
			HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);

			SetConsoleOutputCP(CP_UTF8);
			GetConsoleMode(hOut, &mode);
			SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT);

			char str1[] = { str_utf8[0], '\0', str_utf8[1], str_utf8[2], str_utf8[3], str_utf8[4], str_utf8[5], str_utf8[6], str_utf8[7], '\0' };
			char str2[] = { str_utf8[0], str_utf8[1], str_utf8[2], str_utf8[3], str_utf8[4], '\0', str_utf8[5], str_utf8[6], str_utf8[7], '\0' };

			fputs(bmp_utf8, stdout);
			fputs("Multiple fputs calls (1st split): ", stdout);
			fputs(str1, stdout);
			fputs(str1 + 2, stdout);
			fputs("\n", stdout);

			fputs("Multiple fputs calls (2nd split): ", stdout);
			fputs(str2, stdout);
			fputs(str2 + 6, stdout);
			fputs("\n", stdout);

			SetConsoleMode(hOut, mode);
			break;
		}
		default:
			break;
		}
    }
}

@d8928fcddcd54a2eb616c93261f24d97 commented on GitHub (May 15, 2021): I was always under the impression that wide CRT functions are lossless and do not locale-convert (This is AFAIK true for BMP). > >Is it expected to be fixed? > It what? Conhost as well as Terminal failing to display/output non-BMP characters in the majority of UTF-16 scenarios. Here is the sample code example (compile with `/utf-8`) ```cpp #define _USE_CRT_SECURE_NO_WARNINGS #define WIN32_LEAN_AND_MEAN #define NOMINMAX #include <Windows.h> #include <iostream> #include <fcntl.h> #include <io.h> #include <cstdio> #include <tuple> constexpr auto str_utf16 = L"\U0002002C\U0001F495"; constexpr auto str_utf8 = "\U0002002C\U0001F495"; constexpr auto bmp_utf16 = L"BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος\n"; constexpr auto bmp_utf8 = "BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος\n"; int main(int argc, char** argv) { if (argc == 2) { switch (argv[1][0]) { case '1': // Single WriteConsoleW { DWORD nWritten; DWORD mode; HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE); fflush(stdout); std::ignore = _setmode(_fileno(stdout), _O_U16TEXT); std::ignore = _setmode(_fileno(stdin), _O_U16TEXT); GetConsoleMode(hOut, &mode); SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT); WriteConsoleW(hOut, bmp_utf16, 46, &nWritten, NULL); WriteConsoleW(hOut, L"Single WriteConsoleW call: ", 27, &nWritten, NULL); WriteConsoleW(hOut, str_utf16, 4, &nWritten, NULL); WriteConsoleW(hOut, L"\n", 1, &nWritten, NULL); SetConsoleMode(hOut, mode); break; } case '2': // Multiple WriteConsoleW { DWORD nWritten; DWORD mode; HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE); fflush(stdout); std::ignore = _setmode(_fileno(stdout), _O_U16TEXT); std::ignore = _setmode(_fileno(stdin), _O_U16TEXT); GetConsoleMode(hOut, &mode); SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT); WriteConsoleW(hOut, bmp_utf16, 46, &nWritten, NULL); WriteConsoleW(hOut, L"Multiple WriteConsoleW calls (1st split): ", 42, &nWritten, NULL); WriteConsoleW(hOut, str_utf16, 1, &nWritten, NULL); WriteConsoleW(hOut, str_utf16 + 1, 3, &nWritten, NULL); WriteConsoleW(hOut, L"\n", 1, &nWritten, NULL); WriteConsoleW(hOut, L"Multiple WriteConsoleW calls (2nd split): ", 42, &nWritten, NULL); WriteConsoleW(hOut, str_utf16, 3, &nWritten, NULL); WriteConsoleW(hOut, str_utf16 + 3, 1, &nWritten, NULL); WriteConsoleW(hOut, L"\n", 1, &nWritten, NULL); SetConsoleMode(hOut, mode); break; } case '3': // Single fwrite { DWORD mode; HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE); fflush(stdout); std::ignore = _setmode(_fileno(stdout), _O_U16TEXT); std::ignore = _setmode(_fileno(stdin), _O_U16TEXT); GetConsoleMode(hOut, &mode); SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT); fwrite(bmp_utf16, 2, 46, stdout); fwrite(L"Single fwrite call: ", 2, 20, stdout); fwrite(str_utf16, 2, 4, stdout); fwrite(L"\n", 2, 1, stdout); SetConsoleMode(hOut, mode); break; } case '4': // Multiple fwrite { DWORD mode; HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE); fflush(stdout); std::ignore = _setmode(_fileno(stdout), _O_U16TEXT); std::ignore = _setmode(_fileno(stdin), _O_U16TEXT); GetConsoleMode(hOut, &mode); SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT); wchar_t str1[] = { str_utf16[0], L'\0', str_utf16[1], str_utf16[2], str_utf16[3], L'\0' }; wchar_t str2[] = { str_utf16[0], str_utf16[1], str_utf16[2], L'\0', str_utf16[3], L'\0' }; fwrite(bmp_utf16, 2, 46, stdout); fwrite(L"Multiple fwrite calls (1st split): ", 2, 35, stdout); fwrite(str1, 2, 1, stdout); fwrite(str1 + 2, 2, 3, stdout); fwrite(L"\n", 2, 1, stdout); fwrite(L"Multiple fwrite calls (2nd split): ", 2, 35, stdout); fwrite(str2, 2, 3, stdout); fwrite(str2 + 4, 2, 1, stdout); fwrite(L"\n", 2, 1, stdout); SetConsoleMode(hOut, mode); break; } case '5': // Single fputws { DWORD mode; HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE); fflush(stdout); std::ignore = _setmode(_fileno(stdout), _O_U16TEXT); std::ignore = _setmode(_fileno(stdin), _O_U16TEXT); GetConsoleMode(hOut, &mode); SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT); fputws(bmp_utf16, stdout); fputws(L"Single fputws call: ", stdout); fputws(str_utf16, stdout); fputws(L"\n", stdout); SetConsoleMode(hOut, mode); break; } case '6': // Multiple fputws { DWORD mode; HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE); fflush(stdout); std::ignore = _setmode(_fileno(stdout), _O_U16TEXT); std::ignore = _setmode(_fileno(stdin), _O_U16TEXT); GetConsoleMode(hOut, &mode); SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT); wchar_t str1[] = { str_utf16[0], L'\0', str_utf16[1], str_utf16[2], str_utf16[3], L'\0' }; wchar_t str2[] = { str_utf16[0], str_utf16[1], str_utf16[2], L'\0', str_utf16[3], L'\0' }; fputws(bmp_utf16, stdout); fputws(L"Multiple fputws calls (1st split): ", stdout); fputws(str1, stdout); fputws(str1 + 2, stdout); fputws(L"\n", stdout); fputws(L"Multiple fputws calls (2nd split): ", stdout); fputws(str2, stdout); fputws(str2 + 4, stdout); fputws(L"\n", stdout); SetConsoleMode(hOut, mode); break; } case 'a': // Single WriteConsoleA { DWORD nWritten; DWORD mode; HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE); SetConsoleOutputCP(CP_UTF8); GetConsoleMode(hOut, &mode); SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT); WriteConsoleA(hOut, bmp_utf8, strlen(bmp_utf8), &nWritten, NULL); WriteConsoleA(hOut, "Single WriteConsoleA call: ", 27, &nWritten, NULL); WriteConsoleA(hOut, str_utf8, strlen(str_utf8), &nWritten, NULL); WriteConsoleA(hOut, L"\n", 1, &nWritten, NULL); SetConsoleMode(hOut, mode); break; } case 'b': // Multiple WriteConsoleA { DWORD nWritten; DWORD mode; HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE); SetConsoleOutputCP(CP_UTF8); GetConsoleMode(hOut, &mode); SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT); WriteConsoleA(hOut, bmp_utf8, 13, &nWritten, NULL); WriteConsoleA(hOut, bmp_utf8 + 13, strlen(bmp_utf8) - 13, &nWritten, NULL); WriteConsoleA(hOut, "Multiple WriteConsoleA calls (1st split): ", 42, &nWritten, NULL); WriteConsoleA(hOut, str_utf8, 1, &nWritten, NULL); WriteConsoleA(hOut, str_utf8 + 1, 7, &nWritten, NULL); WriteConsoleA(hOut, "\n", 1, &nWritten, NULL); WriteConsoleA(hOut, "Multiple WriteConsoleA calls (2nd split): ", 42, &nWritten, NULL); WriteConsoleA(hOut, str_utf8, 5, &nWritten, NULL); WriteConsoleA(hOut, str_utf8 + 5, 3, &nWritten, NULL); WriteConsoleA(hOut, "\n", 1, &nWritten, NULL); SetConsoleMode(hOut, mode); break; } case 'c': // Single fwrite { DWORD mode; HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE); SetConsoleOutputCP(CP_UTF8); GetConsoleMode(hOut, &mode); SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT); fwrite(bmp_utf8, 1, strlen(bmp_utf8), stdout); fwrite("Single fwrite call: ", 1, 20, stdout); fwrite(str_utf8, 1, 8, stdout); fwrite("\n", 1, 1, stdout); SetConsoleMode(hOut, mode); break; } case 'd': // Multiple fwrite { DWORD mode; HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE); SetConsoleOutputCP(CP_UTF8); GetConsoleMode(hOut, &mode); SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT); char str1[] = { str_utf8[0], '\0', str_utf8[1], str_utf8[2], str_utf8[3], str_utf8[4], str_utf8[5], str_utf8[6], str_utf8[7], '\0' }; char str2[] = { str_utf8[0], str_utf8[1], str_utf8[2], str_utf8[3], str_utf8[4], '\0', str_utf8[5], str_utf8[6], str_utf8[7], '\0' }; fwrite(bmp_utf8, 1, strlen(bmp_utf8), stdout); fwrite("Multiple fwrite calls (1st split): ", 1, 35, stdout); fwrite(str1, 1, 1, stdout); fwrite(str1 + 2, 1, 7, stdout); fwrite("\n", 1, 1, stdout); fwrite("Multiple fwrite calls (2nd split): ", 1, 35, stdout); fwrite(str2, 1, 5, stdout); fwrite(str2 + 6, 1, 3, stdout); fwrite("\n", 1, 1, stdout); SetConsoleMode(hOut, mode); break; } case 'e': // Single fputs { DWORD mode; HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE); SetConsoleOutputCP(CP_UTF8); GetConsoleMode(hOut, &mode); SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT); fputs(bmp_utf8, stdout); fputs("Single fputs call: ", stdout); fputs(str_utf8, stdout); fputs("\n", stdout); SetConsoleMode(hOut, mode); break; } case 'f': // Multiple fputs { DWORD mode; HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE); SetConsoleOutputCP(CP_UTF8); GetConsoleMode(hOut, &mode); SetConsoleMode(hOut, mode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | ENABLE_PROCESSED_OUTPUT); char str1[] = { str_utf8[0], '\0', str_utf8[1], str_utf8[2], str_utf8[3], str_utf8[4], str_utf8[5], str_utf8[6], str_utf8[7], '\0' }; char str2[] = { str_utf8[0], str_utf8[1], str_utf8[2], str_utf8[3], str_utf8[4], '\0', str_utf8[5], str_utf8[6], str_utf8[7], '\0' }; fputs(bmp_utf8, stdout); fputs("Multiple fputs calls (1st split): ", stdout); fputs(str1, stdout); fputs(str1 + 2, stdout); fputs("\n", stdout); fputs("Multiple fputs calls (2nd split): ", stdout); fputs(str2, stdout); fputs(str2 + 6, stdout); fputs("\n", stdout); SetConsoleMode(hOut, mode); break; } default: break; } } } ```
Author
Owner

@d8928fcddcd54a2eb616c93261f24d97 commented on GitHub (May 15, 2021):

Terminal output:
image
image

Conhost output:
image
image

Redirecting to file results in correct output in all scenarios. So it is not CRT functions which corrupt the output, but especially the rendering / screen text buffer part shared between Conhost and Terminal, as behavior is exactly analogous. Copying from Terminal / Conhost window results in � REPLACEMENT CHARACTERs:

>output_test.exe 1
BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος
Single WriteConsoleW call: 𠀬💕
>output_test.exe 2
BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος
Multiple WriteConsoleW calls (1st split): �💕�
Multiple WriteConsoleW calls (2nd split): 𠀬��
>output_test.exe 3
BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος
Single fwrite call: ����
>output_test.exe 4
BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος
Multiple fwrite calls (1st split): ����
Multiple fwrite calls (2nd split): ����
>output_test.exe 5
BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος
Single fputws call: ����
>output_test.exe 6
BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος
Multiple fputws calls (1st split): ����
Multiple fputws calls (2nd split): ����
>output_test.exe a
BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος
Single WriteConsoleA call: 𠀬💕
>output_test.exe b
BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος
Multiple WriteConsoleA calls (1st split): 𠀬💕
Multiple WriteConsoleA calls (2nd split): 𠀬💕
>output_test.exe c
BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος
Single fwrite call: 𠀬💕
>output_test.exe d
BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος
Multiple fwrite calls (1st split): 𠀬💕
Multiple fwrite calls (2nd split): 𠀬💕
>output_test.exe e
BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος
Single fputs call: 𠀬💕
>output_test.exe f
BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος
Multiple fputs calls (1st split): 𠀬💕
Multiple fputs calls (2nd split): 𠀬💕

The fact that it handles utf-8 correctly suggests me that this issue is wide-interface -specific. It blocks outputting text without detecting surrogate pairs in it, for surrogate pairs split between WriteConsoleW would be corrupted, as well as blocks wide-interface libraries that use wide-interface CRT facilities for output.

@d8928fcddcd54a2eb616c93261f24d97 commented on GitHub (May 15, 2021): Terminal output: ![image](https://user-images.githubusercontent.com/75134015/118373616-8e427380-b5c0-11eb-9a6b-b3a909026ef2.png) ![image](https://user-images.githubusercontent.com/75134015/118373635-a1554380-b5c0-11eb-8d1b-ae02bd721424.png) Conhost output: ![image](https://user-images.githubusercontent.com/75134015/118373650-b16d2300-b5c0-11eb-9384-603c618f82e4.png) ![image](https://user-images.githubusercontent.com/75134015/118373659-b7fb9a80-b5c0-11eb-9d07-12cdd73724de.png) Redirecting to file results in **correct** output in all scenarios. So it is not CRT functions which corrupt the output, but especially the rendering / screen text buffer part shared between Conhost and Terminal, as behavior is exactly analogous. Copying from Terminal / Conhost window results in � REPLACEMENT CHARACTERs: ``` >output_test.exe 1 BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος Single WriteConsoleW call: 𠀬💕 >output_test.exe 2 BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος Multiple WriteConsoleW calls (1st split): �💕� Multiple WriteConsoleW calls (2nd split): 𠀬�� >output_test.exe 3 BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος Single fwrite call: ���� >output_test.exe 4 BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος Multiple fwrite calls (1st split): ���� Multiple fwrite calls (2nd split): ���� >output_test.exe 5 BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος Single fputws call: ���� >output_test.exe 6 BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος Multiple fputws calls (1st split): ���� Multiple fputws calls (2nd split): ���� >output_test.exe a BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος Single WriteConsoleA call: 𠀬💕 >output_test.exe b BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος Multiple WriteConsoleA calls (1st split): 𠀬💕 Multiple WriteConsoleA calls (2nd split): 𠀬💕 >output_test.exe c BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος Single fwrite call: 𠀬💕 >output_test.exe d BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος Multiple fwrite calls (1st split): 𠀬💕 Multiple fwrite calls (2nd split): 𠀬💕 >output_test.exe e BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος Single fputs call: 𠀬💕 >output_test.exe f BMP test: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος Multiple fputs calls (1st split): 𠀬💕 Multiple fputs calls (2nd split): 𠀬💕 ``` The fact that it handles utf-8 correctly suggests me that this issue is wide-interface -specific. It blocks outputting text without detecting surrogate pairs in it, for surrogate pairs split between `WriteConsoleW` would be corrupted, as well as blocks wide-interface libraries that use wide-interface CRT facilities for output.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#13750