Terminal should force pseudoconsole host into UTF-8 codepage by default #2524

Open
opened 2026-01-30 22:57:25 +00:00 by claunia · 24 comments
Owner

Originally created by @DHowett-MSFT on GitHub (Jul 3, 2019).

Originally assigned to: @DHowett on GitHub.

It's 2019, after all. Maybe we should introduce a flag that starts up the pseudoconsole host in codepage 65001 so that we make good on our promise of "emoji just work and everything else works like it should too," and use WT as a real opportunity to push the boundaries here.

  🌛
💪 💪
  👖

maintainer note, Aug 2023

It's {{current_year}}, after all

Also, we want to take into account arbitrary codepages, ala #15678

Originally created by @DHowett-MSFT on GitHub (Jul 3, 2019). Originally assigned to: @DHowett on GitHub. It's 2019, after all. Maybe we should introduce a flag that starts up the pseudoconsole host in codepage 65001 so that we make good on our promise of "emoji just work and everything else works like it should too," and use WT as a _real_ opportunity to push the boundaries here. ``` 🌛 💪 💪 👖 ``` <!-- fite me --> --- _maintainer note, Aug 2023_ > It's {{current_year}}, after all Also, we want to take into account _arbitrary_ codepages, ala #15678
Author
Owner

@miniksa commented on GitHub (Jul 3, 2019):

I_am_okay_with_this.jpg

@miniksa commented on GitHub (Jul 3, 2019): I_am_okay_with_this.jpg
Author
Owner

@MicheleCicciottiWork commented on GitHub (Jul 4, 2019):

Does cmd support batch scripts in codepage 65001 now?

@MicheleCicciottiWork commented on GitHub (Jul 4, 2019): Does cmd support batch scripts in codepage 65001 now?
Author
Owner

@hcoona commented on GitHub (Jul 5, 2019):

+1 for running *nix tools with CJK outputs

For example: fc-list in texlive

@hcoona commented on GitHub (Jul 5, 2019): +1 for running *nix tools with CJK outputs For example: fc-list in texlive
Author
Owner

@rivy commented on GitHub (Jul 7, 2019):

A work-around in the meanwhile is ... enable BETA: Use Unicode UTF-8 for worldwide language support in the "Control Panel \ Clock and Region \ Region \ Administrative \ Change system locale..." dialog box and rebooting. Do note the BETA prefix.

@rivy commented on GitHub (Jul 7, 2019): A work-around in the meanwhile is ... enable `BETA: Use Unicode UTF-8 for worldwide language support` in the "Control Panel \ Clock and Region \ Region \ Administrative \ Change system locale..." dialog box and rebooting. Do note the **`BETA`** prefix.
Author
Owner

@zadjii-msft commented on GitHub (Jul 8, 2019):

I'm on board with this, esp. if we add a "disableAutoCp65001" (boy that needs a better name) setting to disable this behavior, set to false by default.

@zadjii-msft commented on GitHub (Jul 8, 2019): I'm on board with this, esp. if we add a `"disableAutoCp65001"` <sup>(boy that needs a better name)</sup> setting to disable this behavior, set to `false` by default.
Author
Owner

@driver1998 commented on GitHub (Sep 26, 2019):

Maybe we need to somehow enable localized messages in CMD while codepage is 65001.
Now it is forced to be English.

@driver1998 commented on GitHub (Sep 26, 2019): Maybe we need to somehow enable localized messages in CMD while codepage is 65001. Now it is forced to be English.
Author
Owner

@driver1998 commented on GitHub (Oct 4, 2019):

A work-around in the meanwhile is ... enable BETA: Use Unicode UTF-8 for worldwide language support in the "Control Panel \ Clock and Region \ Region \ Administrative \ Change system locale..." dialog box and rebooting. Do note the BETA prefix.

This changes the system code page to 65001, if you have any ANSI application, they will be forced to use UTF-8.
It should be fine for English users, but for CJK users, that will be a big trouble.

@driver1998 commented on GitHub (Oct 4, 2019): > A work-around in the meanwhile is ... enable `BETA: Use Unicode UTF-8 for worldwide language support` in the "Control Panel \ Clock and Region \ Region \ Administrative \ Change system locale..." dialog box and rebooting. Do note the **`BETA`** prefix. This changes the system code page to 65001, if you have any ANSI application, they will be forced to use UTF-8. It should be fine for English users, but for CJK users, that will be a big trouble.
Author
Owner

@methane commented on GitHub (Jan 22, 2020):

Is there any plan to add an environment variable for "UTF-8 mode"?

For example, Python has PYTHONUTF8. Git for Windows uses LANG. Some tools use the GetConsoleCP.
It seems there is no standard way to indicate "I want to use UTF-8 in this session!".

Is the ConsoleCP the best way?
When we launch a GUI application from console, the ConsoleCP can be used too?

@methane commented on GitHub (Jan 22, 2020): Is there any plan to add an environment variable for "UTF-8 mode"? For example, Python has PYTHONUTF8. Git for Windows uses LANG. Some tools use the `GetConsoleCP`. It seems there is no standard way to indicate "I want to use UTF-8 in this session!". Is the ConsoleCP the best way? When we launch a GUI application from console, the ConsoleCP can be used too?
Author
Owner

@eryksun commented on GitHub (Jan 22, 2020):

Noe that the console host is still broken for non-ASCII input when the input codepage is set to UTF-8 (65001). Unfortunately both the registry "CodePage" value and chcp.com set both the input and output codepage to a single value, so an admin or user can't easily set just the output codepage to UTF-8.

In particular, the calculation of NumBytes in _handlePostCharInputLoop assumes that only wide glyphs with DBCS will need more than 1 byte per character. Then in TranslateUnicodeToOem we see the same assumption applied. It ends up calling ConvertToOem -- a thin wrapper around WINAPI WideCharToMultiByte -- with only 1 byte for the conversion. Non-ASCII UTF-8 requires 2-4 bytes per character, so all non-ASCII characters are translated as null bytes ("\x00") in the result.

@eryksun commented on GitHub (Jan 22, 2020): Noe that the console host is still broken for non-ASCII input when the input codepage is set to UTF-8 (65001). Unfortunately both the registry "CodePage" value and chcp.com set both the input and output codepage to a single value, so an admin or user can't easily set just the output codepage to UTF-8. In particular, the calculation of `NumBytes` in [`_handlePostCharInputLoop`](https://github.com/microsoft/terminal/blob/cbb87b98b75e5cad68dd9e575b6a4e67c1e45841/src/host/readDataCooked.cpp#L1120) assumes that only wide glyphs with DBCS will need more than 1 byte per character. Then in [`TranslateUnicodeToOem`](https://github.com/microsoft/terminal/blob/cbb87b98b75e5cad68dd9e575b6a4e67c1e45841/src/host/dbcs.cpp#L155) we see the same assumption applied. It ends up calling [`ConvertToOem`](https://github.com/microsoft/terminal/blob/cbb87b98b75e5cad68dd9e575b6a4e67c1e45841/src/host/misc.cpp#L250) -- a thin wrapper around WINAPI `WideCharToMultiByte` -- with only 1 byte for the conversion. Non-ASCII UTF-8 requires 2-4 bytes per character, so all non-ASCII characters are translated as null bytes (`"\x00"`) in the result.
Author
Owner

@driver1998 commented on GitHub (Jan 23, 2020):

@eryksun MS Pinyin IME works fine on CP65001 on ConHost though.

@driver1998 commented on GitHub (Jan 23, 2020): @eryksun MS Pinyin IME works fine on CP65001 on ConHost though.
Author
Owner

@driver1998 commented on GitHub (Jan 23, 2020):

Is there any plan to add an environment variable for "UTF-8 mode"?

For example, Python has PYTHONUTF8. Git for Windows uses LANG. Some tools use the GetConsoleCP.
It seems there is no standard way to indicate "I want to use UTF-8 in this session!".

Is the ConsoleCP the best way?
When we launch a GUI application from console, the ConsoleCP can be used too?

Now (19H1+) there is a way to force UTF-8 CP in application manifest (so your -A APIs will use UTF-8), although the console CP is still separated from application CP...
https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page

For Application CP, use GetACP(). For console, ConsoleCP should be fine.

@driver1998 commented on GitHub (Jan 23, 2020): > Is there any plan to add an environment variable for "UTF-8 mode"? > > For example, Python has PYTHONUTF8. Git for Windows uses LANG. Some tools use the `GetConsoleCP`. > It seems there is no standard way to indicate "I want to use UTF-8 in this session!". > > Is the ConsoleCP the best way? > When we launch a GUI application from console, the ConsoleCP can be used too? Now (19H1+) there is a way to force UTF-8 CP in application manifest (so your -A APIs will use UTF-8), although the console CP is still separated from application CP... https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page For Application CP, use GetACP(). For console, ConsoleCP should be fine.
Author
Owner

@methane commented on GitHub (Jan 23, 2020):

Now (19H1+) there is a way to force UTF-8 CP in application manifest (so your -A APIs will use UTF-8), although the console CP is still separated from application CP...
https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page

I know it but it doesn't help me because:

  • User can not change the manifest easily.
  • It is not pragmatic to change the manifest of all executables.

What I want to have is the environment variable which indicates user want to use UTF-8 in this session.

Python has PYTHONUTF8. But it changes only Python.
I want to have one environment variable which many application use.

@methane commented on GitHub (Jan 23, 2020): > Now (19H1+) there is a way to force UTF-8 CP in application manifest (so your -A APIs will use UTF-8), although the console CP is still separated from application CP... > https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page > I know it but it doesn't help me because: * User can not change the manifest easily. * It is not pragmatic to change the manifest of all executables. What I want to have is the environment variable which indicates user want to use UTF-8 in this session. Python has PYTHONUTF8. But it changes only Python. I want to have one environment variable which many application use.
Author
Owner

@methane commented on GitHub (Jan 23, 2020):

For Application CP, use GetACP(). For console, ConsoleCP should be fine.

This is not stable though. Windows terminal starts OpenConsole. But wsl.exe starts another terminal.

image

And when executing Windows command from wsl, new conhost is executed again. Current code page is legacy regradless the first codepage in the OpenConsole.

image

We are mixing many conhost and they have their own code page. When writing and reading PIPE, which encoding should be used?

In wsl, Linux commands uses UTF-8 in most case because UTF-8 locale is almost standard.

That's why I want to have standard environment variable for UTF-8 mode. Let's call it WINUTF8 for now (any better name?).

  • When Windows Terminal (or mintty, VSCode Terminal, etc) starts some shell (cmd.exe, PowerShell Core, git bash, etc) in the console with code page 65001, WINUTF8=1 will be set.
  • WSL uses code page 65001 when WINUTF8=1 is set.
  • Some Windows applications will use the WINUTF8 to opt-in UTF-8 mode.

When all modern developer tools supports the WINUTF8, Windows user can have "UTF-8 everywhere" environment.

@methane commented on GitHub (Jan 23, 2020): > For Application CP, use GetACP(). For console, ConsoleCP should be fine. This is not stable though. Windows terminal starts OpenConsole. But wsl.exe starts another terminal. ![image](https://user-images.githubusercontent.com/199592/72972138-394c7e00-3e0e-11ea-8cbe-59bac0bc6dc6.png) And when executing Windows command from wsl, new conhost is executed again. Current code page is legacy regradless the first codepage in the OpenConsole. ![image](https://user-images.githubusercontent.com/199592/72972420-c42d7880-3e0e-11ea-9f59-2f795dd7449c.png) We are mixing many conhost and they have their own code page. When writing and reading PIPE, which encoding should be used? In wsl, Linux commands uses UTF-8 in most case because UTF-8 locale is almost standard. That's why I want to have standard environment variable for UTF-8 mode. Let's call it `WINUTF8` for now (any better name?). * When Windows Terminal (or mintty, VSCode Terminal, etc) starts some shell (cmd.exe, PowerShell Core, git bash, etc) in the console with code page 65001, `WINUTF8=1` will be set. * WSL uses code page 65001 when `WINUTF8=1` is set. * Some Windows applications will use the `WINUTF8` to opt-in UTF-8 mode. When all modern developer tools supports the `WINUTF8`, Windows user can have "UTF-8 everywhere" environment.
Author
Owner

@eryksun commented on GitHub (Jan 23, 2020):

MS Pinyin IME works fine on CP65001 on ConHost though.

@driver1998, I'm not familiar enough with the code to know the pathways taken in East-Asian locales. Maybe IME hacks the console service routine for ReadFile / ReadConsole? Can you write a little test app that sets the input codepage to 65001, writes the full range of the BMP (up to U+FFFF) to console input via WriteConsoleInputW (wide character). and reads it back via ReadFile?

@eryksun commented on GitHub (Jan 23, 2020): > MS Pinyin IME works fine on CP65001 on ConHost though. @driver1998, I'm not familiar enough with the code to know the pathways taken in East-Asian locales. Maybe IME hacks the console service routine for `ReadFile` / `ReadConsole`? Can you write a little test app that sets the input codepage to 65001, writes the full range of the BMP (up to U+FFFF) to console input via `WriteConsoleInputW` (wide character). and reads it back via `ReadFile`?
Author
Owner

@eryksun commented on GitHub (Jan 23, 2020):

Python has PYTHONUTF8. But it changes only Python.
I want to have one environment variable which many application use.

This is something to be addressed by the team that's in charge of the NLS API.

Locales address language and regional formatting rules. They really shouldn't concern text encodings, at least not in Windows NT, which was always a Unicode system. It shouldn't matter to the locale whether it's English or Hindi text -- except for the intersection with language rules (e.g. collation).

That said, most locales have legacy ANSI and OEM codepages because this was necessary prior to Unicode back in the 80s, and the NLS team needed to support non-Unicode applications, especially on Windows 9x systems that had hardly any Unicode support. But not all locales have legacy ANSI/OEM codepages. Some locales are Unicode only, i.e. their ANSI codepage is CP_ACP (0), which means they use the ANSI codepage of the system locale. It used to be that these locales couldn't be set as the system locale. In Windows 10, selecting one of these locales as the system locale implicitly enables beta UTF-8 support. For example, hi-IN (Hindi, India) is a Unicode-only locale:

>>> GetLocaleInfoEx('hi-IN', LOCALE_IDEFAULTANSICODEPAGE, cp, 8)
2
>>> cp[:1]
'0'

Nowadays the Universal C Runtime (ucrt) supports using UTF-8 (CP_UTF8, i.e. 65001) in locales, and it defaults to UTF-8 for Unicode-only locales. For example:

>>> setlocale(LC_ALL, 'en_UK.utf8')
'en_UK.utf8'
>>> ucrt.___lc_codepage_func()
65001

>>> setlocale(LC_ALL, 'hi_IN')
'hi-IN'
>>> ucrt.___lc_codepage_func()
65001

The NLS team could provide a setting for the per-user locale that overrides the active codepage (CP_ACP) an OEM codepage (CP_OEMCP) of a user's processes to use CP_UTF8 instead of the legacy codepage. This would be like what they already support in Windows 10 for the system locale (at least as a beta feature), but with a scope narrowed to a particular user --- and also to thread locales in a user's session (i.e. CP_THREAD_ACP). Implementing it per-user also allows the convenience of changing the value without requiring administrator access or a reboot. For command line environments, it could also allow the registry setting to be overridden by an environment variable. It could be a POSIX variable like LANG or LC_CTYPE, or they could come up with their own name(s) that don't interfere with existing usage of the POSIX names.

Applications that query either the active codepage via GetACP() or the active OEM codepage via GetOEMCP() will use UTF-8. For those that use the C runtime library, note that ucrt uses UTF-8 instead of the user locale's legacy ANSI codepage if GetACP() returns CP_UTF8. So applications that set and query the codepage via C setlocale and/or use CRT encoding functions such as wcstombs and mbstowcs will also use UTF-8 in accordance with the new per-user locale setting.

@eryksun commented on GitHub (Jan 23, 2020): > Python has PYTHONUTF8. But it changes only Python. > I want to have one environment variable which many application use. This is something to be addressed by the team that's in charge of the NLS API. Locales address language and regional formatting rules. They really shouldn't concern text encodings, at least not in Windows NT, which was always a Unicode system. It shouldn't matter to the locale whether it's English or Hindi text -- except for the intersection with language rules (e.g. collation). That said, most locales have legacy ANSI and OEM codepages because this was necessary prior to Unicode back in the 80s, and the NLS team needed to support non-Unicode applications, especially on Windows 9x systems that had hardly any Unicode support. But not all locales have legacy ANSI/OEM codepages. Some locales are Unicode only, i.e. their ANSI codepage is `CP_ACP` (0), which means they use the ANSI codepage of the system locale. It used to be that these locales couldn't be set as the system locale. In Windows 10, selecting one of these locales as the system locale implicitly enables beta UTF-8 support. For example, hi-IN (Hindi, India) is a Unicode-only locale: ```python >>> GetLocaleInfoEx('hi-IN', LOCALE_IDEFAULTANSICODEPAGE, cp, 8) 2 >>> cp[:1] '0' ``` Nowadays the Universal C Runtime (ucrt) supports using UTF-8 (`CP_UTF8`, i.e. 65001) in locales, and it defaults to UTF-8 for Unicode-only locales. For example: ```python >>> setlocale(LC_ALL, 'en_UK.utf8') 'en_UK.utf8' >>> ucrt.___lc_codepage_func() 65001 >>> setlocale(LC_ALL, 'hi_IN') 'hi-IN' >>> ucrt.___lc_codepage_func() 65001 ``` The NLS team could provide a setting for the per-user locale that overrides the active codepage (`CP_ACP`) an OEM codepage (`CP_OEMCP`) of a user's processes to use `CP_UTF8` instead of the legacy codepage. This would be like what they already support in Windows 10 for the system locale (at least as a beta feature), but with a scope narrowed to a particular user --- and also to thread locales in a user's session (i.e. `CP_THREAD_ACP`). Implementing it per-user also allows the convenience of changing the value without requiring administrator access or a reboot. For command line environments, it could also allow the registry setting to be overridden by an environment variable. It could be a POSIX variable like `LANG` or `LC_CTYPE`, or they could come up with their own name(s) that don't interfere with existing usage of the POSIX names. Applications that query either the active codepage via `GetACP()` or the active OEM codepage via `GetOEMCP()` will use UTF-8. For those that use the C runtime library, note that ucrt uses UTF-8 instead of the user locale's legacy ANSI codepage if `GetACP()` returns `CP_UTF8`. So applications that set and query the codepage via C `setlocale` and/or use CRT encoding functions such as `wcstombs` and `mbstowcs` will also use UTF-8 in accordance with the new per-user locale setting.
Author
Owner

@driver1998 commented on GitHub (Feb 8, 2020):

Some Windows applications will use the WINUTF8 to opt-in UTF-8 mode.

The problem is, there is no way to "opt-in" UTF-8 mode in console from an app-by-app basics. They can output either UTF-16 (which does not care about ConsoleCP) or "ANSI" (which has to be ConsoleCP regardless of the Application CP, and UTF-8 is one of those CP).

Well maybe WriteConsoleOutputA can output UTF-8 on a CP936 console, but I don't think printf can.

Maybe make WSL start conhost in UTF-8 CP when WINUTF8=1?

@driver1998 commented on GitHub (Feb 8, 2020): > Some Windows applications will use the WINUTF8 to opt-in UTF-8 mode. The problem is, there is no way to "opt-in" UTF-8 mode in console from an app-by-app basics. They can output either UTF-16 (which does not care about ConsoleCP) or "ANSI" (which has to be ConsoleCP regardless of the Application CP, and UTF-8 is one of those CP). Well maybe WriteConsoleOutputA can output UTF-8 on a CP936 console, but I don't think printf can. Maybe make WSL start conhost in UTF-8 CP when WINUTF8=1?
Author
Owner

@driver1998 commented on GitHub (Feb 8, 2020):

Well WriteConsoleA honors Console CP, so no UTF-8 on CP936.

Some tests:
System CP: Chinese Simplified GBK, CP936
Console CP: CP936

A bog-standard ANSI app, which uses GBK
批注 2020-02-09 051144

An ANSI app with UTF-8 codepage specified
批注 2020-02-09 051306

@driver1998 commented on GitHub (Feb 8, 2020): Well WriteConsoleA honors Console CP, so no UTF-8 on CP936. Some tests: System CP: Chinese Simplified GBK, CP936 Console CP: CP936 A bog-standard ANSI app, which uses GBK ![批注 2020-02-09 051144](https://user-images.githubusercontent.com/22699485/74092101-afcdc900-4afa-11ea-80e4-578bb31d1878.png) An ANSI app with UTF-8 codepage specified ![批注 2020-02-09 051306](https://user-images.githubusercontent.com/22699485/74092118-e0adfe00-4afa-11ea-96ba-7fc5d3dc67dd.png)
Author
Owner

@eryksun commented on GitHub (Feb 10, 2020):

Well WriteConsoleA honors Console CP, so no UTF-8 on CP936.

The console's multibyte-string functions such as ReadConsoleA and WriteConsoleA are special in the Windows "ANSI" API because they're I/O functions that read and write from a file. As a necessity for terminal applications, the console API needs the flexibility to handle arbitrary codepages.

The default codepage is the active OEM codepage of the conhost.exe process. In principle this should be configurable as the "CodePage" value in "HKCU\Console", but that still doesn't work as intended. This value only works in the per-window subkey settings. (For convenience the codepage value should be made configurable in the property dialog, and also in the shell-link property dialog.) Anyway, if the system locale is set to UTF-8, the active OEM codepage is UTF-8, and the console follows suit. If the NLS team implemented the extension for the user locale that I suggested above, the result would be similar given the user enables UTF-8. Then all that's needed is for the console to finally support reading non-ASCII text from the input buffer as UTF-8 -- at least 24 years after UTF-8 was introduced.

Note that since the console is a shared I/O resource, any one process attached to it doesn't get to dictate its global input and output codepages. Any program that attaches to the console at any time can change these values, or even library code in your process might sneak in a change. Thus in principle the values can't be relied on as constants, but many programs check once and assume the values are constant. For example, classic Python prior to 3.6 works like this when setting the encoding of the sys.std* streams. It's a recipe for mojibake.

@eryksun commented on GitHub (Feb 10, 2020): > Well WriteConsoleA honors Console CP, so no UTF-8 on CP936. The console's multibyte-string functions such as `ReadConsoleA` and `WriteConsoleA` are special in the Windows "ANSI" API because they're I/O functions that read and write from a file. As a necessity for terminal applications, the console API needs the flexibility to handle arbitrary codepages. The default codepage is the active OEM codepage of the conhost.exe process. In principle this should be configurable as the "CodePage" value in "HKCU\Console", but that still doesn't work as intended. This value only works in the per-window subkey settings. (For convenience the codepage value should be made configurable in the property dialog, and also in the shell-link property dialog.) Anyway, if the system locale is set to UTF-8, the active OEM codepage is UTF-8, and the console follows suit. If the NLS team implemented the extension for the user locale that I suggested above, the result would be similar given the user enables UTF-8. Then all that's needed is for the console to finally support reading non-ASCII text from the input buffer as UTF-8 -- at least 24 years after UTF-8 was introduced. Note that since the console is a shared I/O resource, any one process attached to it doesn't get to dictate its global input and output codepages. Any program that attaches to the console at any time can change these values, or even library code in your process might sneak in a change. Thus in principle the values can't be relied on as constants, but many programs check once and assume the values are constant. For example, classic Python prior to 3.6 works like this when setting the encoding of the `sys.std*` streams. It's a recipe for mojibake.
Author
Owner

@lanyizi commented on GitHub (Apr 16, 2020):

Any updates on this?

@lanyizi commented on GitHub (Apr 16, 2020): Any updates on this?
Author
Owner

@zadjii-msft commented on GitHub (Apr 16, 2020):

@BSG-75 Nope, when there is an update to this, we'll make sure to post in this thread 😉

@zadjii-msft commented on GitHub (Apr 16, 2020): @BSG-75 Nope, when there is an update to this, we'll make sure to post in this thread 😉
Author
Owner

@driver1998 commented on GitHub (Jul 26, 2020):

I still want to know how many real world console apps are there expects legacy codepages (especially ones that outputs legacy codepage strings like GBK), and is it a good idea to break these.

Because that's why system-wide UTF-8 is still labeled as beta.

Given that many modern Windows command line apps are ported from the *nix world, and the modern principle seems to be command line apps should use English, I guess it is acceptable?

@driver1998 commented on GitHub (Jul 26, 2020): I still want to know how many real world console apps are there expects legacy codepages (especially ones that outputs legacy codepage strings like GBK), and is it a good idea to break these. Because that's why system-wide UTF-8 is still labeled as beta. Given that many modern Windows command line apps are ported from the *nix world, and the modern principle seems to be command line apps should use English, I guess it is acceptable?
Author
Owner

@methane commented on GitHub (Jul 27, 2020):

I still want to know how many real world console apps are there expects legacy codepages (especially ones that outputs legacy codepage strings like GBK), and is it a good idea to break these.

Because that's why system-wide UTF-8 is still labeled as beta.

System wide setting breaks legacy applications. That's why we need per-session option instead.

Given that many modern Windows command line apps are ported from the *nix world, and the modern principle seems to be command line apps should use English, I guess it is acceptable?

Some modern CLI applications (Python and Go) use WriteConsoleW to write to console. But Python still use legacy encoding for PIPE.
And some modern applications from Unix use UTF-8 always. Cygwin/MSYS also use UTF-8 (they use environment variables like LANG).

To use such applications, we want UTF-8 session in VSCode terminal and Windows Terminal.

@methane commented on GitHub (Jul 27, 2020): > I still want to know how many real world console apps are there expects legacy codepages (especially ones that outputs legacy codepage strings like GBK), and is it a good idea to break these. > > Because that's why system-wide UTF-8 is still labeled as beta. System wide setting breaks legacy applications. That's why we need per-session option instead. > Given that many modern Windows command line apps are ported from the *nix world, and the modern principle seems to be command line apps should use English, I guess it is acceptable? Some modern CLI applications (Python and Go) use WriteConsoleW to write to console. But Python still use legacy encoding for PIPE. And some modern applications from Unix use UTF-8 always. Cygwin/MSYS also use UTF-8 (they use environment variables like `LANG`). To use such applications, we want UTF-8 session in VSCode terminal and Windows Terminal.
Author
Owner

@o-sdn-o commented on GitHub (May 20, 2023):

Since std input works well with UTF-8 encoding today #14745, may be it's worth to add some sort of a syntactic sugar to push/pop the initial state of the system code page for new Windows console applications? Wrap it up in a single API call or put it into a specific header file (e.g. <iostream>) to reduce following boilerplate code at the beginning

#define UTF8_EVERYWHERE
#include <iostream>
#include <string>

// Put this block inside <iostream> on windows?
#ifdef _WIN32
    #ifdef UTF8_EVERYWHERE
        #include "windows.h"
        namespace winapi_cp_state
        {
            static UINT ou_state = GetConsoleOutputCP(); // Save original system code pages.
            static UINT in_state = GetConsoleCP();       //
            static void set_page(UINT out, UINT in) { SetConsoleOutputCP(out); SetConsoleCP(in); }
            static void set_page() { set_page(ou_state, in_state); }
            static int _state = (set_page(CP_UTF8, CP_UTF8), ::atexit(set_page)); // Set to UTF-8 and always restore original system code pages at exit.
        }
    #endif
#endif

// x-platform code
int main()
{
    std::cout << "Test: あああ🙂🙂🙂日本👌中文👍Кириллица\n"; // Make sure you save your project file with 65001(UTF-8) encoding.
    std::cout << "Enter text: ";
    std::string utf8;
    std::cin >> utf8;
    std::cout << "UTF-8 text: " << utf8 << std::endl;
    return 0;
}

The #define UTF8_EVERYWHERE key is used to indicate the programmer’s intention to use UTF-8 encoding instead of original system code page.
This would be extremely helpful for newbie console programmers. All of them are completely confused with text encodings. In addition, cross-platform compatibility is achieved automatically.

@o-sdn-o commented on GitHub (May 20, 2023): Since std input works well with UTF-8 encoding today #14745, may be it's worth to add some sort of a syntactic sugar to push/pop the initial state of the system code page for new Windows console applications? Wrap it up in a single API call or put it into a specific header file (e.g. \<iostream\>) to reduce following boilerplate code at the beginning ```c++ #define UTF8_EVERYWHERE #include <iostream> #include <string> // Put this block inside <iostream> on windows? #ifdef _WIN32 #ifdef UTF8_EVERYWHERE #include "windows.h" namespace winapi_cp_state { static UINT ou_state = GetConsoleOutputCP(); // Save original system code pages. static UINT in_state = GetConsoleCP(); // static void set_page(UINT out, UINT in) { SetConsoleOutputCP(out); SetConsoleCP(in); } static void set_page() { set_page(ou_state, in_state); } static int _state = (set_page(CP_UTF8, CP_UTF8), ::atexit(set_page)); // Set to UTF-8 and always restore original system code pages at exit. } #endif #endif // x-platform code int main() { std::cout << "Test: あああ🙂🙂🙂日本👌中文👍Кириллица\n"; // Make sure you save your project file with 65001(UTF-8) encoding. std::cout << "Enter text: "; std::string utf8; std::cin >> utf8; std::cout << "UTF-8 text: " << utf8 << std::endl; return 0; } ``` The `#define UTF8_EVERYWHERE` key is used to indicate the programmer’s intention to use UTF-8 encoding instead of original system code page. This would be extremely helpful for newbie console programmers. All of them are completely confused with text encodings. In addition, cross-platform compatibility is achieved automatically.
Author
Owner

@zadjii-msft commented on GitHub (Jul 5, 2023):

xref some discussion in #15504

pretty sure our plan was to do:

  • compatibility.defaultToutf8 in the Terminal settings, which sets
  • a --flag to conpty to tell it to default to CP65001 instead.

Now it's just a matter of plumbing, and deciding if we really want to do the --flags thing or the --defaultToUtf8 thing.


More notes

conhost --codepage 65001 to start in utf-8, or accept an arbitrary one.
conhost --codepage WITHOUT AN ARG to use the one in HKCU/Console/Codepage

@zadjii-msft commented on GitHub (Jul 5, 2023): xref some discussion in #15504 pretty sure our plan was to do: * `compatibility.defaultToutf8` in the Terminal settings, which sets * a `--flag` to conpty to tell it to default to CP65001 instead. Now it's just a matter of plumbing, and deciding if we really want to do the `--flags` thing or the `--defaultToUtf8` thing. --- More notes * #11591 * #10870 * #9174 * #15678 * #1802 `conhost --codepage 65001` to start in utf-8, or accept an arbitrary one. `conhost --codepage` WITHOUT AN ARG to use the one in `HKCU/Console/Codepage`
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#2524