Refactor the way VT control sequences are identified and dispatched #10139

Closed
opened 2026-01-31 02:13:31 +00:00 by claunia · 6 comments
Owner

Originally created by @j4james on GitHub (Aug 13, 2020).

Description of the new feature/enhancement

The way VT sequences are currently processed, we build up a list of intermediate characters (and private parameter prefixes) in a wchar_t vector, and then pass both that vector and the final character through to the state machine engine to dispatch. The engine goes through a rather complicated process where it has to check for intermediates, and branch off into separate methods, just to figure out which command has actually been requested.

So to make things simpler, I want to propose we identify each control sequence with a single integer value, which is a combination of the private parameter prefix, the intermediates, and the final char. This identifier can then be used in a simple switch statement when dispatching commands, without having to worry about special case handling for intermediates.

Right now we only have a few commands that depend on intermediates and private prefixes, and it's already a bit of a pain to work with. Once we start adding more advanced commands, though, it's going to get a lot worse. So before we get to that point, I thought it would be worth refactoring the code to make it a little easier to extend.

Proposed technical implementation details (optional)

I've been investigating different ways of implementing this, and the option I found best was a VTID class with a single size_t field holding the sequence value, and a size_t cast operator that will return that value. It would also have a constexpr constructor that can build the value from a string, so the action code enums can easily be defined with something like DECSTR_SoftReset = VTID("!p"). Currently we're pretending that DECSTR is identified by p, which is actually something else entirely.

The actual structure of the sequence value is simply made up from the intermediate and final characters stored one byte at a time in reverse order. For example, the DECTLTC control has a private parameter prefix of ?, one intermediate of ', and a final character of s. The ASCII values of those characters are 0x3F, 0x27, and 0x73 respectively, and reversing them gets you 0x73273F, so that would then be the identifier for the control.

The reason for storing them in reverse, is because sometimes we need to look at the first intermediate to determine the operation, and treat the rest of the sequence as a kind of sub-identifier (the character set designation sequences are one example of this). This is easily achieved by masking off the low byte to get the first intermediate, and then shifting the value right by 8 bits to get a new identifier with the rest of the sequence.

The downside of the reverse order is it makes constructing the sequence in the state machine a little more complicated, because you need to keep track of how many characters have already been stored in the identifier (each new character needs to be shifted 8-bit further left). The way I've approached that was with a little VTIDBuilder class, that has both an accumulator for the value being built up, and a shift value that determine the offset at which each new character is added.

And note that we don't need to worry too much about overflowing, because any sequence that wouldn't fit in a size_t would not be a valid sequence anyway. If it looks like it's going to overflow, we simply set the accumulator to zero, and stop accumulating further characters. When it's eventually dispatched, it'll just be ignored as an unrecognised identifier.

This is quite a big change to the code base, but I do think it makes things simpler, and should make the process of adding new multicharacter control sequences a lot less hassle.

Originally created by @j4james on GitHub (Aug 13, 2020). # Description of the new feature/enhancement The way VT sequences are currently processed, we build up a list of intermediate characters (and private parameter prefixes) in a wchar_t vector, and then pass both that vector and the final character through to the state machine engine to dispatch. The engine goes through a rather complicated process where it has to check for intermediates, and branch off into separate methods, just to figure out which command has actually been requested. So to make things simpler, I want to propose we identify each control sequence with a single integer value, which is a combination of the private parameter prefix, the intermediates, and the final char. This identifier can then be used in a simple switch statement when dispatching commands, without having to worry about special case handling for intermediates. Right now we only have a few commands that depend on intermediates and private prefixes, and it's already a bit of a pain to work with. Once we start adding more advanced commands, though, it's going to get a lot worse. So before we get to that point, I thought it would be worth refactoring the code to make it a little easier to extend. # Proposed technical implementation details (optional) I've been investigating different ways of implementing this, and the option I found best was a `VTID` class with a single `size_t` field holding the sequence value, and a `size_t` cast operator that will return that value. It would also have a constexpr constructor that can build the value from a string, so the action code enums can easily be defined with something like `DECSTR_SoftReset = VTID("!p")`. Currently we're pretending that `DECSTR` is identified by `p`, which is actually something else entirely. The actual structure of the sequence value is simply made up from the intermediate and final characters stored one byte at a time in reverse order. For example, the `DECTLTC` control has a private parameter prefix of `?`, one intermediate of `'`, and a final character of `s`. The ASCII values of those characters are 0x3F, 0x27, and 0x73 respectively, and reversing them gets you 0x73273F, so that would then be the identifier for the control. The reason for storing them in reverse, is because sometimes we need to look at the first intermediate to determine the operation, and treat the rest of the sequence as a kind of sub-identifier (the character set designation sequences are one example of this). This is easily achieved by masking off the low byte to get the first intermediate, and then shifting the value right by 8 bits to get a new identifier with the rest of the sequence. The downside of the reverse order is it makes constructing the sequence in the state machine a little more complicated, because you need to keep track of how many characters have already been stored in the identifier (each new character needs to be shifted 8-bit further left). The way I've approached that was with a little `VTIDBuilder` class, that has both an _accumulator_ for the value being built up, and a _shift_ value that determine the offset at which each new character is added. And note that we don't need to worry too much about overflowing, because any sequence that wouldn't fit in a `size_t` would not be a valid sequence anyway. If it looks like it's going to overflow, we simply set the accumulator to zero, and stop accumulating further characters. When it's eventually dispatched, it'll just be ignored as an unrecognised identifier. This is quite a big change to the code base, but I do think it makes things simpler, and should make the process of adding new multicharacter control sequences a lot less hassle.
Author
Owner

@DHowett commented on GitHub (Aug 13, 2020):

I'm in favor. @zadjii-msft / @miniksa if you're in favor, one of you yank Triage 😄

@DHowett commented on GitHub (Aug 13, 2020): I'm in favor. @zadjii-msft / @miniksa if you're in favor, one of you yank Triage :smile:
Author
Owner

@DHowett commented on GitHub (Aug 13, 2020):

It's clever, and I like it. It also sets us up for tabled dispatch (if we ever want something like that).

Are there only ever 0-2 intermediates? I noticed that we support N where N fits in a size_t apparently o_O

@DHowett commented on GitHub (Aug 13, 2020): It's clever, and I like it. It also sets us up for tabled dispatch (if we ever want something like that). Are there only ever 0-2 intermediates? I noticed that we support `N` where `N` fits in a `size_t` apparently o_O
Author
Owner

@j4james commented on GitHub (Aug 13, 2020):

According to the DEC STD 070 specs, "conforming implementations shall recognize at least 3 intermediates", although the most I think I've seen in practice is 2 (e.g. designating the Portuguese character set is ESC ( % 6). So worst case theoretically would be 5 bytes (a private prefix, 3 intermediates, and a final), so maybe uint64_t would be better than size_t to be safe in a 32-bit build. In practice we'd probably never need that much though.

@j4james commented on GitHub (Aug 13, 2020): According to the DEC STD 070 specs, "conforming implementations shall recognize at least 3 intermediates", although the most I think I've seen in practice is 2 (e.g. designating the Portuguese character set is `ESC ( % 6`). So worst case theoretically would be 5 bytes (a private prefix, 3 intermediates, and a final), so maybe `uint64_t` would be better than `size_t` to be safe in a 32-bit build. In practice we'd probably never need that much though.
Author
Owner

@miniksa commented on GitHub (Aug 13, 2020):

I'd rather not get bitten by the difference in size_t on 32 vs 64, no matter how unlikely. I would feel absolutely awful discovering that was the issue at the bottom of the bug. Rather you do uint64_t in that case.

@miniksa commented on GitHub (Aug 13, 2020): I'd rather not get bitten by the difference in `size_t` on 32 vs 64, no matter how unlikely. I would feel absolutely awful discovering that was the issue at the bottom of the bug. Rather you do `uint64_t` in that case.
Author
Owner

@miniksa commented on GitHub (Aug 13, 2020):

Also, I'm fine with this proposal. When I originally put the state machine together this way, I didn't necessarily think quite this far ahead nor imagine that the parser would become quite this feature filled. It is likely about time that we have a better mechanism for sorting these things out.

I like the proposal with the VTID() and the rationale behind the storage. I knew the single letter thing wouldn't last forever the moment I saw the first collision. Let's do it.

@miniksa commented on GitHub (Aug 13, 2020): Also, I'm fine with this proposal. When I originally put the state machine together this way, I didn't necessarily think quite this far ahead nor imagine that the parser would become quite this feature filled. It is likely about time that we have a better mechanism for sorting these things out. I like the proposal with the `VTID()` and the rationale behind the storage. I knew the single letter thing wouldn't last forever the moment I saw the first collision. Let's do it.
Author
Owner

@ghost commented on GitHub (Aug 26, 2020):

:tada:This issue was addressed in #7304, which has now been successfully released as Windows Terminal Preview v1.3.2382.0.🎉

Handy links:

@ghost commented on GitHub (Aug 26, 2020): :tada:This issue was addressed in #7304, which has now been successfully released as `Windows Terminal Preview v1.3.2382.0`.:tada: Handy links: * [Release Notes](https://github.com/microsoft/terminal/releases/tag/v1.3.2382.0) * [Store Download](https://www.microsoft.com/store/apps/9n8g5rfz9xk3?cid=storebadge&ocid=badge)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#10139