[PR #4496] Improve support for VT character sets #25801

Closed
opened 2026-01-31 09:11:52 +00:00 by claunia · 0 comments
Owner

Original Pull Request: https://github.com/microsoft/terminal/pull/4496

State: closed
Merged: Yes


This PR improves our VT character set support, enabling the SCS
escape sequences to designate into all four G-sets with both 94- and
96-character sets, and supports invoking those G-sets into both the GL
and GR areas of the code table, with locking shifts and single
shifts
. It also adds DOCS sequences to switch between UTF-8 and the
ISO-2022 coding system (which is what the VT character sets require),
and adds support for a lot more characters sets, up to around the level
of a VT510.

Detailed Description of the Pull Request / Additional comments

To make it easier for us to declare a bunch of character sets, I've made
a little constexpr class that can build up a mapping table from a base
character set (ASCII or Latin1), along with a collection of mappings for
the characters the deviate from the base set. Many of the character sets
are simple variations of ASCII, so they're easy to define this way.

This class then casts directly to a wstring_view which is how the
translation tables are represented in most of the code. We have an array
of four of these tables representing the four G-sets, two instances for
the active left and right tables, and one instance for the single shift
table.

Initially we had just one DesignateCharset method, which could select
the active character set. We now have two designate methods (for 94- and
96- character sets), and each takes a G-set number specifying the target
of the designation, and a pair of characters identifying the character
set that will be designated (at the higher VT levels, character sets are
often identified by more than one character).

There are then two new LockingShift methods to invoke these G-sets
into either the GL or GR area of the code table, and a SingleShift
method which invokes a G-set temporarily (for just the next character
that is output).

I should mention here that I had to make some changes to the state
machine to make these single shift sequences work. The problem is that
the input state machine treats SS3 as the start of a control sequence,
while the output state machine needs it to be dispatched immediately
(it's literally the Single Shift 3 escape sequence). To make that
work, I've added a ParseControlSequenceAfterSs3 callback in the
IStateMachineEngine interface to decide which behavior is appropriate.

When it comes to mapping a character, it's simply an array reference
into the appropriate wstring_view table. If the single shift table is
set, that takes preference. Otherwise the GL table is used for
characters in the range 0x20 to 0x7F, and the GR table for characters
0xA0 to 0xFF (technically some character sets will only map up to 0x7E
and 0xFE, but that's easily controlled by the length of the
wstring_view).

The DEL character is a bit of a special case. By default it's meant to
be ignored like the NUL character (it's essentially a time-fill
character). However, it's possible that it could be remapped to a
printable character in a 96-character set, so we need to check for that
after the translation. This is handled in the AdaptDispatch::Print
method, so it doesn't interfere with the primary PrintString code
path.

The biggest problem with this whole process, though, is that the GR
mappings only really make sense if you have access to the raw output,
but by the time the output gets to us, it would already have been
translated to Unicode by the active code page. And in the case of UTF-8,
the characters we eventually receive may originally have been composed
from two or more code points.

The way I've dealt with this was to disable the GR translations by
default, and then added support for a pair of ISO-2022 DOCS sequences,
which can switch the code page between UTF-8 and ISO-8859-1. When the
code page is ISO-8859-1, we're essentially receiving the raw output
bytes, so it's safe to enable the GR translations. This is not strictly
correct ISO-2022 behavior, and there are edge cases where it's not going
to work, but it's the best solution I could come up with.

Validation Steps Performed

As a result of the SS3 changes in the state machine engine, I've had
to move the existing SS3 tests from the OutputEngineTest to the
InputEngineTest, otherwise they would now fail (technically they
should never have been output tests).

I've added no additional unit tests, but I have done a lot of manual
testing, and made sure we passed all the character set tests in Vttest
(at least for the character sets we currently support). Note that this
required a slightly hacked version of the app, since by default it
doesn't expose a lot of the test to low-level terminals, and we
currently identify as a VT100.

Closes #3377
Closes #3487

**Original Pull Request:** https://github.com/microsoft/terminal/pull/4496 **State:** closed **Merged:** Yes --- This PR improves our VT character set support, enabling the [`SCS`] escape sequences to designate into all four G-sets with both 94- and 96-character sets, and supports invoking those G-sets into both the GL and GR areas of the code table, with [locking shifts] and [single shifts]. It also adds [`DOCS`] sequences to switch between UTF-8 and the ISO-2022 coding system (which is what the VT character sets require), and adds support for a lot more characters sets, up to around the level of a VT510. [`SCS`]: https://vt100.net/docs/vt510-rm/SCS.html [locking shifts]: https://vt100.net/docs/vt510-rm/LS.html [single shifts]: https://vt100.net/docs/vt510-rm/SS.html [`DOCS`]: https://en.wikipedia.org/wiki/ISO/IEC_2022#Interaction_with_other_coding_systems ## Detailed Description of the Pull Request / Additional comments To make it easier for us to declare a bunch of character sets, I've made a little `constexpr` class that can build up a mapping table from a base character set (ASCII or Latin1), along with a collection of mappings for the characters the deviate from the base set. Many of the character sets are simple variations of ASCII, so they're easy to define this way. This class then casts directly to a `wstring_view` which is how the translation tables are represented in most of the code. We have an array of four of these tables representing the four G-sets, two instances for the active left and right tables, and one instance for the single shift table. Initially we had just one `DesignateCharset` method, which could select the active character set. We now have two designate methods (for 94- and 96- character sets), and each takes a G-set number specifying the target of the designation, and a pair of characters identifying the character set that will be designated (at the higher VT levels, character sets are often identified by more than one character). There are then two new `LockingShift` methods to invoke these G-sets into either the GL or GR area of the code table, and a `SingleShift` method which invokes a G-set temporarily (for just the next character that is output). I should mention here that I had to make some changes to the state machine to make these single shift sequences work. The problem is that the input state machine treats `SS3` as the start of a control sequence, while the output state machine needs it to be dispatched immediately (it's literally the _Single Shift 3_ escape sequence). To make that work, I've added a `ParseControlSequenceAfterSs3` callback in the `IStateMachineEngine` interface to decide which behavior is appropriate. When it comes to mapping a character, it's simply an array reference into the appropriate `wstring_view` table. If the single shift table is set, that takes preference. Otherwise the GL table is used for characters in the range 0x20 to 0x7F, and the GR table for characters 0xA0 to 0xFF (technically some character sets will only map up to 0x7E and 0xFE, but that's easily controlled by the length of the `wstring_view`). The `DEL` character is a bit of a special case. By default it's meant to be ignored like the `NUL` character (it's essentially a time-fill character). However, it's possible that it could be remapped to a printable character in a 96-character set, so we need to check for that after the translation. This is handled in the `AdaptDispatch::Print` method, so it doesn't interfere with the primary `PrintString` code path. The biggest problem with this whole process, though, is that the GR mappings only really make sense if you have access to the raw output, but by the time the output gets to us, it would already have been translated to Unicode by the active code page. And in the case of UTF-8, the characters we eventually receive may originally have been composed from two or more code points. The way I've dealt with this was to disable the GR translations by default, and then added support for a pair of ISO-2022 `DOCS` sequences, which can switch the code page between UTF-8 and ISO-8859-1. When the code page is ISO-8859-1, we're essentially receiving the raw output bytes, so it's safe to enable the GR translations. This is not strictly correct ISO-2022 behavior, and there are edge cases where it's not going to work, but it's the best solution I could come up with. ## Validation Steps Performed As a result of the `SS3` changes in the state machine engine, I've had to move the existing `SS3` tests from the `OutputEngineTest` to the `InputEngineTest`, otherwise they would now fail (technically they should never have been output tests). I've added no additional unit tests, but I have done a lot of manual testing, and made sure we passed all the character set tests in Vttest (at least for the character sets we currently support). Note that this required a slightly hacked version of the app, since by default it doesn't expose a lot of the test to low-level terminals, and we currently identify as a VT100. Closes #3377 Closes #3487
claunia added the pull-request label 2026-01-31 09:11:52 +00:00
Sign in to join this conversation.
No Label pull-request
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#25801