Issue with Printing Emojis on Windows Console #22469

Closed
opened 2026-01-31 08:14:19 +00:00 by claunia · 5 comments
Owner

Originally created by @nadrojpeg on GitHub (Oct 30, 2024).

Windows Terminal version

1.21.2911.0

Windows build number

10.0.22631.4391

Other Software

gcc (Rev1, Built by MSYS2 project) 14.2.0
clang version 18.1.3 (https://github.com/llvm/llvm-project.git c13b7485b87909fcf739f62cfa382b55407433c0)
MSVC v143
Visual Studio Code Version: 1.94.2 (user setup)

Steps to reproduce

I'm developing a Windows console application in C that aims to print Unicode(UTF-8) characters, including emojis.
Here’s a simplified version of my code to illustrate the issue:

#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <io.h>
#include <fcntl.h>

int main(void)
{
   _setmode(_fileno(stdout), _O_U8TEXT);
   const wchar_t* string = L"😀🌸你好Приветשלוםمرحبا";
   // Sleep(100);
   wprintf(L"%ls\n", string);

   return EXIT_SUCCESS;
}

When I run the program for the first time after compiling it, the output in Windows Terminal is ��🌸你好Приветשלוםمرحبا. On subsequent runs, the terminal sometimes prints the correct string, sometimes shows ��🌸你好Приветשלוםمرحبا, and occasionally prints 😀��你好Приветשלוםمرحبا, where the second emoji is not displayed correctly. I also noticed another interesting behavior: if I add a Sleep(100); before the wprintf, the issue occurs on every execution. To trigger these different output variants, one can place the wprintf function inside a loop, as shown in the following example:

#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
#include <io.h>
#include <fcntl.h>

int main(void)
{
   _setmode(_fileno(stdout), _O_U8TEXT);
   const wchar_t* string = L"😀🌸你好Приветשלוםمرحبا";
   for (size_t i = 0; i < 999; ++i) { 
      wprintf(L"%ls\n", string);
   }

   return EXIT_SUCCESS;
}

If I try to print the string a😀🌸你好Приветשלוםمرحبا instead of the previous string 😀🌸你好Приветשלוםمرحبا, the issue occurs less frequently, but it does not disappear: when wprintf is placed inside a for loop, one can occasionally see the output as a😀��你好Приветשלוםمرحبا.
I have already tried using SetConsoleOutputCP(CP_UTF8);, but that didn’t change anything.
I am using GCC on Windows, but I have tested with MSVC and clang as well and encountered the same issue. I compiled the program using the command gcc -std=c2x -Wall -pedantic -Werror -g -o file file.c
Is this a known issue with the Windows console, or is there something in my code that I may be overlooking?

Expected Behavior

To see the output string 😀🌸你好Приветשלוםمرحба

Actual Behavior

Sometimes Windows Terminal outputs ��🌸你好Приветשלוםمرحبا; other times, it shows 😀��你好Приветשלוםمرحبا.

Originally created by @nadrojpeg on GitHub (Oct 30, 2024). ### Windows Terminal version 1.21.2911.0 ### Windows build number 10.0.22631.4391 ### Other Software `gcc (Rev1, Built by MSYS2 project) 14.2.0` `clang version 18.1.3 (https://github.com/llvm/llvm-project.git c13b7485b87909fcf739f62cfa382b55407433c0)` `MSVC v143` `Visual Studio Code Version: 1.94.2 (user setup)` ### Steps to reproduce I'm developing a Windows `console application` in C that aims to print `Unicode(UTF-8)` characters, including emojis. Here’s a simplified version of my code to illustrate the issue: ``` #include <stdlib.h> #include <stdio.h> #include <wchar.h> #include <io.h> #include <fcntl.h> int main(void) { _setmode(_fileno(stdout), _O_U8TEXT); const wchar_t* string = L"😀🌸你好Приветשלוםمرحبا"; // Sleep(100); wprintf(L"%ls\n", string); return EXIT_SUCCESS; } ``` When I run the program for the first time after compiling it, the output in Windows Terminal is `��🌸你好Приветשלוםمرحبا`. On subsequent runs, the terminal sometimes prints the correct string, sometimes shows `��🌸你好Приветשלוםمرحبا`, and occasionally prints `😀��你好Приветשלוםمرحبا`, where the second emoji is not displayed correctly. I also noticed another interesting behavior: if I add a `Sleep(100);` before the `wprintf`, the issue occurs on every execution. To trigger these different output variants, one can place the `wprintf` function inside a `loop`, as shown in the following example: ``` #include <stdlib.h> #include <stdio.h> #include <wchar.h> #include <io.h> #include <fcntl.h> int main(void) { _setmode(_fileno(stdout), _O_U8TEXT); const wchar_t* string = L"😀🌸你好Приветשלוםمرحبا"; for (size_t i = 0; i < 999; ++i) { wprintf(L"%ls\n", string); } return EXIT_SUCCESS; } ``` If I try to print the string `a😀🌸你好Приветשלוםمرحبا` instead of the previous string `😀🌸你好Приветשלוםمرحبا`, the issue occurs less frequently, but it does not disappear: when `wprintf` is placed inside a `for loop`, one can occasionally see the output as `a😀��你好Приветשלוםمرحبا`. I have already tried using `SetConsoleOutputCP(CP_UTF8);`, but that didn’t change anything. I am using `GCC` on Windows, but I have tested with `MSVC` and `clang` as well and encountered the same issue. I compiled the program using the command `gcc -std=c2x -Wall -pedantic -Werror -g -o file file.c` Is this a known issue with the Windows console, or is there something in my code that I may be overlooking? ### Expected Behavior To see the output string `😀🌸你好Приветשלוםمرحба` ### Actual Behavior Sometimes Windows Terminal outputs `��🌸你好Приветשלוםمرحبا`; other times, it shows `😀��你好Приветשלוםمرحبا`.
claunia added the Needs-TriageIssue-BugNeeds-Attention labels 2026-01-31 08:14:20 +00:00
Author
Owner

@carlos-zamora commented on GitHub (Oct 30, 2024):

Thanks for filing! Does it make a difference if you use _O_U16TEXT /? We have seen issues where libc is doing text transformations before text is printed to the console. This mode applies when you're using wprintf, whereas U8Text only applies when using printf.

@carlos-zamora commented on GitHub (Oct 30, 2024): Thanks for filing! Does it make a difference if you use `_O_U16TEXT` /? We have seen issues where libc is doing text transformations before text is printed to the console. This mode applies when you're using wprintf, whereas `U8Text` only applies when using printf.
Author
Owner

@nadrojpeg commented on GitHub (Oct 30, 2024):

Thank you for your response.
No, using _O_U16TEXT (or _O_WTEXT) doesn't make any difference. As for _O_U8TEXT, I thought it couldn’t be used directly with narrow print functions like printf, at least based on what I found in the documentation: _setmode. (The documentation states: "You can also pass _O_U16TEXT, _O_U8TEXT, or _O_WTEXT to enable Unicode mode [...] Unicode mode is for wide print functions (for example, wprintf) and is not supported for narrow print functions.)
I have just tested on my system, and using printf with _O_U8TEXT doesn't produce any output at all.

@nadrojpeg commented on GitHub (Oct 30, 2024): Thank you for your response. No, using `_O_U16TEXT` (or `_O_WTEXT`) doesn't make any difference. As for _O_U8TEXT, I thought it couldn’t be used directly with `narrow print functions` like `printf`, at least based on what I found in the documentation: [_setmode](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setmode?view=msvc-170). (The documentation states: "You can also pass `_O_U16TEXT`, `_O_U8TEXT`, or `_O_WTEXT` to enable Unicode mode [...] Unicode mode is for wide print functions (for example, `wprintf`) and is not supported for `narrow print functions`.) I have just tested on my system, and using `printf` with `_O_U8TEXT` doesn't produce any output at all.
Author
Owner

@nadrojpeg commented on GitHub (Nov 1, 2024):

After experimenting with GDB, I discovered that the wprintf function, when called with _setmode(_fileno(stdout), _O_U8TEXT); (or equivalently with _O_WTEXT or _O_U16TEXT), works by calling WriteConsoleW on each individual wchar_t in the string. This approach can cause characters represented by two wchar_ts to display incorrectly. I was able to reproduce this behavior with the following short example:

HANDLE hStdOut = GetStdHandle(STD_OUTPUT_HANDLE);
const wchar_t* string = L"😀🌸你好Приветשלוםمرحبا\n";
size_t size = wcslen(string);
for (size_t i = 0; i < size; ++i) {
   wchar_t wc = string[i];
   WriteConsoleW(hStdOut, &wc, 1, NULL, NULL);
}

This code displays the string as ��🌸你好Приветשלוםمرحبا, just as the wprintf version does. At this point, I believe this issue is not with Windows Terminal itself but rather with libc.

I've also noticed that when wprintf calls WriteConsoleW, the rdx register—which holds the character to be printed—contains a UTF-16 encoding of the character, even with _O_U8TEXT mode enabled. Honestly, I don’t understand what _O_U8TEXT is supposed to accomplish.

Given all this, I’ll be using the Console API functions instead of those from the standard library.

@nadrojpeg commented on GitHub (Nov 1, 2024): After experimenting with `GDB`, I discovered that the `wprintf` function, when called with `_setmode(_fileno(stdout), _O_U8TEXT);` (or equivalently with `_O_WTEXT` or `_O_U16TEXT`), works by calling `WriteConsoleW` on each individual `wchar_t` in the string. This approach can cause characters represented by two `wchar_t`s to display incorrectly. I was able to reproduce this behavior with the following short example: ``` HANDLE hStdOut = GetStdHandle(STD_OUTPUT_HANDLE); const wchar_t* string = L"😀🌸你好Приветשלוםمرحبا\n"; size_t size = wcslen(string); for (size_t i = 0; i < size; ++i) { wchar_t wc = string[i]; WriteConsoleW(hStdOut, &wc, 1, NULL, NULL); } ``` This code displays the string as `��🌸你好Приветשלוםمرحبا`, just as the `wprintf` version does. At this point, I believe this issue is not with Windows Terminal itself but rather with `libc`. I've also noticed that when `wprintf` calls `WriteConsoleW`, the `rdx` register—which holds the character to be printed—contains a `UTF-16` encoding of the character, even with `_O_U8TEXT` mode enabled. Honestly, I don’t understand what `_O_U8TEXT` is supposed to accomplish. Given all this, I’ll be using the Console API functions instead of those from the standard library.
Author
Owner

@lhecker commented on GitHub (Nov 1, 2024):

Given all this, I’ll be using the Console API functions instead of those from the standard library.

Not quite. While the implementation of (w)printf in the stdlib is definitely quite terrible 1 we should also fix our surrogate pair support. I think we'll naturally get there about next year, because I'm planning to rewrite the entire console server layer which will also fix this issue.


  1. A solid >10000000% CPU overhead per character. Yes, really. printf on Windows can be slower than a 56K modem. I'm neither joking nor exaggerating. While Windows Terminal can chug >200MB/s, printf with _O_U16TEXT runs at <30kB/s. Avoid it at all costs. Write your own wrapper with vsnprintf for instance. I can also recommend adding a TCP_CORK style buffering layer, or something akin to fflush, so that you can bottle up a series of printf-style calls and only flush them at the end, because the flush overhead is very high. The ideal stdout buffer size is at least 4KiB (~100MB/s) and anything above 128KiB (~200MB/s) stops giving a benefit. ↩︎

@lhecker commented on GitHub (Nov 1, 2024): > Given all this, I’ll be using the Console API functions instead of those from the standard library. Not quite. While the implementation of `(w)printf` in the stdlib is definitely quite terrible [^1] we should also fix our surrogate pair support. I think we'll naturally get there about next year, because I'm planning to rewrite the entire console server layer which will also fix this issue. [^1]: A solid >10000000% CPU overhead per character. Yes, really. `printf` on Windows can be slower than a 56K modem. I'm neither joking nor exaggerating. While Windows Terminal can chug >200MB/s, `printf` with `_O_U16TEXT` runs at <30kB/s. Avoid it at all costs. Write your own wrapper with `vsnprintf` for instance. I can also recommend adding a `TCP_CORK` style buffering layer, or something akin to `fflush`, so that you can bottle up a series of printf-style calls and only flush them at the end, because the flush overhead is very high. The ideal stdout buffer size is at least 4KiB (~100MB/s) and anything above 128KiB (~200MB/s) stops giving a benefit.
Author
Owner

@nadrojpeg commented on GitHub (Nov 6, 2024):

Thank you for the information! Regarding surrogate pair support, I wanted to share that, based on my tests with WriteConsoleW, the handling of surrogate pairs seems solid. When I use WriteConsoleW to print the entire string at once, everything works as expected, even with strings containing surrogate pairs. This is why I mentioned in my previous message, "Given all this, I’ll be using the Console API functions instead of those from the standard library."

@nadrojpeg commented on GitHub (Nov 6, 2024): Thank you for the information! Regarding surrogate pair support, I wanted to share that, based on my tests with `WriteConsoleW`, the handling of surrogate pairs seems solid. When I use `WriteConsoleW` to print the entire string at once, everything works as expected, even with strings containing surrogate pairs. This is why I mentioned in my previous message, "Given all this, I’ll be using the Console API functions instead of those from the standard library."
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#22469