Extremely slow performance when processing virtual terminal sequences #14160

Closed
opened 2026-01-31 04:02:30 +00:00 by claunia · 47 comments
Owner

Originally created by @cmuratori on GitHub (Jun 8, 2021).

Windows Terminal version (or Windows build number)

1.8.1521.0

Other Software

No response

Steps to reproduce

Using any command line utility that produces virtual terminal sequences for setting the colors of individual characters, the performance of the terminal drops by a factor of around 40.

To measure this effect precisely, you can use the F2 key in termbench and observe the performance difference between color-per-character output and single-color output:

https://github.com/cmuratori/termbench/releases/tag/V1

Expected Behavior

Despite the increased parsing load, modern CPUs should not have a problem parsing per-character color escape codes quickly. I would expect the performance of the terminal to be able to sustain roughly the same frame rate with per-character color codes as without, and if there was a performance drop, I wouldn't expect it to be anything close to 40x.

Actual Behavior

The speed of per-character color output is 40x slower than the speed of single-color output.

Originally created by @cmuratori on GitHub (Jun 8, 2021). ### Windows Terminal version (or Windows build number) 1.8.1521.0 ### Other Software _No response_ ### Steps to reproduce Using any command line utility that produces virtual terminal sequences for setting the colors of individual characters, the performance of the terminal drops by a factor of around 40. To measure this effect precisely, you can use the F2 key in termbench and observe the performance difference between color-per-character output and single-color output: https://github.com/cmuratori/termbench/releases/tag/V1 ### Expected Behavior Despite the increased parsing load, modern CPUs should not have a problem parsing per-character color escape codes quickly. I would expect the performance of the terminal to be able to sustain roughly the same frame rate with per-character color codes as without, and if there was a performance drop, I wouldn't expect it to be anything close to 40x. ### Actual Behavior The speed of per-character color output is 40x slower than the speed of single-color output.
claunia added the Needs-TriageNeeds-Tag-FixArea-Performance labels 2026-01-31 04:02:30 +00:00
Author
Owner

@skyline75489 commented on GitHub (Jun 8, 2021):

Thanks for the amazing benchmark tool! I'm sure @miniksa will be interested in trying it out.

Yeah the current performance of colored output is not as fast as non-colored ones. However, the bottleneck is not in parsing but rather rendering, in my opinion. Well, to be specific, when it comes to terminal, there's a lot of things that may hurt the performance, for example ConPTY, DxRenderer, memory allocation, etc. Anyway this is a good tool to measure the performance of both conhost & terminal.

@skyline75489 commented on GitHub (Jun 8, 2021): Thanks for the amazing benchmark tool! I'm sure @miniksa will be interested in trying it out. Yeah the current performance of colored output is not as fast as non-colored ones. However, the bottleneck is not in *parsing* but rather *rendering*, in my opinion. Well, to be specific, when it comes to terminal, there's a lot of things that may hurt the performance, for example ConPTY, DxRenderer, memory allocation, etc. Anyway this is a good tool to measure the performance of both conhost & terminal.
Author
Owner

@cmuratori commented on GitHub (Jun 8, 2021):

Can you explain what you think is the slow part of rendering characters in multiple colors (for those of us unfamiliar with how the renderer of your console works)?

@cmuratori commented on GitHub (Jun 8, 2021): Can you explain what you think is the slow part of rendering characters in multiple colors (for those of us unfamiliar with how the renderer of your console works)?
Author
Owner

@skyline75489 commented on GitHub (Jun 8, 2021):

If it means anything, here's a sample WPR trace running your benchmark tool:

image

As you can see, roughly 70% of CPU time is consumed by RenderThread, which in this case (Windows Terminal) relys DxRenderer for the actual rendering job. The OutputThread (where the VT parsing & related work reside) only takes 10% ~ 20% of the CPU time. Note that this pattern of CPU usage can be seen among almost every output-heavy-with-color programs (cmatrix, for example).

Regarding colored vs non-colored text, my initial observation is that drawing colored text frame takes longer time than non-colored ones, which leaves less CPU for the OutputThread and causes the FPF to drop.

@skyline75489 commented on GitHub (Jun 8, 2021): If it means anything, here's a sample WPR trace running your benchmark tool: ![image](https://user-images.githubusercontent.com/4710575/121150157-dc702d00-c875-11eb-850c-6740b6f81882.png) As you can see, roughly 70% of CPU time is consumed by `RenderThread`, which in this case (Windows Terminal) relys `DxRenderer` for the actual rendering job. The `OutputThread` (where the VT parsing & related work reside) only takes 10% ~ 20% of the CPU time. Note that this pattern of CPU usage can be seen among almost every output-heavy-with-color programs (cmatrix, for example). Regarding colored vs non-colored text, my initial observation is that drawing colored text frame takes longer time than non-colored ones, which leaves less CPU for the `OutputThread` and causes the FPF to drop.
Author
Owner

@skyline75489 commented on GitHub (Jun 8, 2021):

Deep down in the RenderThread, we use DirectX to do the actual drawing. In a perfect world, we'd like to see RenderThread takes most of the CPU time, which means we are not wasting too much CPU on VT processing. But even if we managed to make RenderThread consume 90% of the CPU, this still won't gives you a near close rendering performance for colored text comparing non-colored text.

Another example that might helps: if we cat a very long file, which gives us non-colored output, the WRP trace usually indicates about 60%~70% of CPU is consumed by OutputThread and only 30% of CPU consumed by RenderThread. See how this is going here? Rendering non-colored text is simply a much more cheaper operation for the renderer, which helps the FPS a lot.

In conclusion, I know there's a lot of space for performance tuning in the Windows Terminal, but I honestly can't expect the performance of colored text to be close comparing to non-colored text. Even with a hardware-accelerated solution like DirectX, we still faces performance bottleneck from the rendering stack & GPUs.

@skyline75489 commented on GitHub (Jun 8, 2021): Deep down in the `RenderThread`, we use DirectX to do the actual drawing. In a perfect world, we'd like to see `RenderThread` takes most of the CPU time, which means we are not wasting too much CPU on VT processing. But even if we managed to make `RenderThread` consume 90% of the CPU, this still won't gives you a near close rendering performance for colored text comparing non-colored text. Another example that might helps: if we `cat` a very long file, which gives us non-colored output, the WRP trace usually indicates about 60%~70% of CPU is consumed by `OutputThread` and only 30% of CPU consumed by `RenderThread`. See how this is going here? Rendering non-colored text is simply a much more cheaper operation for the renderer, which helps the FPS a lot. In conclusion, I know there's a lot of space for performance tuning in the Windows Terminal, but I honestly can't expect the performance of colored text to be close comparing to non-colored text. Even with a hardware-accelerated solution like DirectX, we still faces performance bottleneck from the rendering stack & GPUs.
Author
Owner

@cmuratori commented on GitHub (Jun 8, 2021):

Two things:

  1. Can you expand that entire Render::PaintFrame trace? I assume it has more detailed attribution of time there?

  2. I'm not sure I understand what you're suggesting. Are you actually saying that you think a GPU will slow down substantially if it has to use a different color for each character?

@cmuratori commented on GitHub (Jun 8, 2021): Two things: 1. Can you expand that entire Render::PaintFrame trace? I assume it has more detailed attribution of time there? 2. I'm not sure I understand what you're suggesting. Are you actually saying that you think a GPU will slow down substantially if it has to use a different color for each character?
Author
Owner

@mmozeiko commented on GitHub (Jun 8, 2021):

I cannot profile Terminal .exe as there are no symbols for it available on Microsoft pdb servers.
But I wrote dumb vt terminal using CreatePseudoConsole that does not render anything - all it does it sits in a loop doing ReadFile from application output. Basically for (;;) { ReadFile(pipe, ...); } and completely ignores output. So there is no VTE parsing, no rendering, no formatting. Nothing. Only thing that runs is conhost.exe.

In task manager I see that there is ~4MB/s traffic to my "terminal" from Casey's termbench.exe. And I get around 250-300ms per frame which seems pretty slow. Task manager shows that conhost.exe is bottleneck. It is using 100% of one core (3.12% on 32x core Ryzen):

image

When I run ETW on conhost.exe for my "terminal" I see following things:

  1. A lot of time is spent in std::vector resizing:

image

  1. A lot of time is spent in std::stringstream:

image

  1. A lot of time is spent in RenderThread:

image

The confusing part to me in this RenderThread call stack are all those string formatting functions like _SetGraphicsRenditionRGBColor. Why is conhost.exe formatting VT sequences like this? Shouldn't my "terminal" receive direct bytes in pipe from what termbench application is sending?

Again - there is zero rendering happening in my terminal. There are no DirectX calls at all. Only conhost.exe is bottleneck here.

@mmozeiko commented on GitHub (Jun 8, 2021): I cannot profile Terminal .exe as there are no symbols for it available on Microsoft pdb servers. But I wrote dumb vt terminal using `CreatePseudoConsole` that does not render anything - all it does it sits in a loop doing ReadFile from application output. Basically `for (;;) { ReadFile(pipe, ...); }` and completely ignores output. So there is no VTE parsing, no rendering, no formatting. Nothing. Only thing that runs is conhost.exe. In task manager I see that there is ~4MB/s traffic to my "terminal" from Casey's termbench.exe. And I get around 250-300ms per frame which seems pretty slow. Task manager shows that conhost.exe is bottleneck. It is using 100% of one core (3.12% on 32x core Ryzen): ![image](https://user-images.githubusercontent.com/1665010/121154141-8eb6de80-c7fb-11eb-822d-edad8ecd26db.png) When I run ETW on conhost.exe for my "terminal" I see following things: 1. A lot of time is spent in std::vector resizing: ![image](https://user-images.githubusercontent.com/1665010/121154304-b9089c00-c7fb-11eb-8136-473e0cb33eb4.png) 2. A lot of time is spent in std::stringstream: ![image](https://user-images.githubusercontent.com/1665010/121154369-c9207b80-c7fb-11eb-9f50-0687d27e6da1.png) 3. A lot of time is spent in RenderThread: ![image](https://user-images.githubusercontent.com/1665010/121154534-e5241d00-c7fb-11eb-9d89-d1b5be9c97f0.png) The confusing part to me in this RenderThread call stack are all those string formatting functions like [_SetGraphicsRenditionRGBColor](https://github.com/microsoft/terminal/blob/v1.8.1444.0/src/renderer/vt/VtSequences.cpp#L266). Why is conhost.exe formatting VT sequences like this? Shouldn't my "terminal" receive direct bytes in pipe from what `termbench` application is sending? Again - there is **zero** rendering happening in my terminal. There are no DirectX calls at all. Only conhost.exe is bottleneck here.
Author
Owner

@skyline75489 commented on GitHub (Jun 8, 2021):

@cmuratori To answer you question:

  1. The deepest call that consumes the CPU is fb597ed304/src/renderer/dx/CustomTextRenderer.cpp (L798)
    This is how Windows Terminal essentially uses to draw text.
  2. I'm seeing the fact that drawing colored-text consumes more CPU time in the renderer. I can't say it's because of DirectX or GPU, or both. I'm no DX expert so I can only guess both.
@skyline75489 commented on GitHub (Jun 8, 2021): @cmuratori To answer you question: 1. The deepest call that consumes the CPU is https://github.com/microsoft/terminal/blob/fb597ed304ec6eef245405c9652e9b8a029b821f/src/renderer/dx/CustomTextRenderer.cpp#L798 This is how Windows Terminal essentially uses to draw text. 2. I'm seeing the fact that drawing colored-text consumes more CPU time in the renderer. I can't say it's because of DirectX or GPU, or both. I'm no DX expert so I can only guess both.
Author
Owner

@skyline75489 commented on GitHub (Jun 8, 2021):

@mmozeiko To helps you understand how things work under the hood, the RenderThread you see is actually used by conhost.exe to "render" VT sequences for the Windows Terminal. That's why you see all the text-related things. This is part of the ConPTY mechanism. A more detailed introduction can be found here.

Shouldn't my "terminal" receive direct bytes in pipe from what termbench application is sending?

Unfortunely no. There's currently no way to bypass the ConPTY layer. This does hurt the performance but it's necessary at the moment for the terminal to work properly. There's discussion about ConPTY "passthough" in #1173.

@skyline75489 commented on GitHub (Jun 8, 2021): @mmozeiko To helps you understand how things work under the hood, the `RenderThread` you see is actually used by `conhost.exe` to "render" VT sequences for the Windows Terminal. That's why you see all the text-related things. This is part of the ConPTY mechanism. A more detailed introduction can be found [here](https://devblogs.microsoft.com/commandline/windows-command-line-introducing-the-windows-pseudo-console-conpty/). > Shouldn't my "terminal" receive direct bytes in pipe from what termbench application is sending? Unfortunely no. There's currently no way to bypass the ConPTY layer. This does hurt the performance but it's necessary at the moment for the terminal to work properly. There's discussion about ConPTY "passthough" in #1173.
Author
Owner

@cmuratori commented on GitHub (Jun 8, 2021):

1. The deepest call that consumes the CPU is https://github.com/microsoft/terminal/blob/fb597ed304ec6eef245405c9652e9b8a029b821f/src/renderer/dx/CustomTextRenderer.cpp#L798

Is it not possible to post the entire trace?

@cmuratori commented on GitHub (Jun 8, 2021): > 1. The deepest call that consumes the CPU is https://github.com/microsoft/terminal/blob/fb597ed304ec6eef245405c9652e9b8a029b821f/src/renderer/dx/CustomTextRenderer.cpp#L798 Is it not possible to post the entire trace?
Author
Owner

@skyline75489 commented on GitHub (Jun 8, 2021):

@cmuratori it's possible. Just me being lazy about it because I've seen too many of those traces.

The WPR traces actually varies, depending on the content being drawn, the font and probably also the GPU performance. Check out the screenshot here https://github.com/microsoft/terminal/pull/6206#issue-423245376 if you're interested. This PR helps the performance with Cacafire, but for cmatrix (or more practially, vim) it does not mean too much.

@skyline75489 commented on GitHub (Jun 8, 2021): @cmuratori it's possible. Just me being lazy about it because I've seen too many of those traces. The WPR traces actually varies, depending on the content being drawn, the font and probably also the GPU performance. Check out the screenshot here https://github.com/microsoft/terminal/pull/6206#issue-423245376 if you're interested. This PR helps the performance with Cacafire, but for cmatrix (or more practially, vim) it does not mean too much.
Author
Owner

@mmozeiko commented on GitHub (Jun 8, 2021):

@skyline75489
I cannot test this, because I have no idea how to rebuild conhost.exe, but from my benchmark it is visible that all this terminal stuff can be sped up a lot by improving string processing on ConPTY layer. Nothing needs to be changed in DirectX rendering. Just need to avoid expensive string allocations and operations. This will help all terminal application - not only Windows Terminal. Currently if somebody wants to implement more efficient rendering they really cannot, because they will be bottlenecked by these issues in ConPTY layer.

For example, _SetGraphicsRenditionRGBColor should be changed to something like this (probably can do something even better, but I wrote this in github comment):

HRESULT VtEngine::_SetGraphicsRenditionRGBColor(const COLORREF color, const bool fIsForeground) noexcept
{
    DWORD const r = GetRValue(color);
    DWORD const g = GetGValue(color);
    DWORD const b = GetBValue(color);

#define FMT_BYTE(x)                           \
    if (x >= 100) *ptr++ = (x/100) + '0';     \
    if (x >= 10)  *ptr++ = ((x/10)%10) + '0'; \
    *ptr++ = (r%10) + '0';

    char buffer[10+3+3+3];

    char* ptr = buffer;
    *ptr++ = '\x1b';
    *ptr++ = '[';
    *ptr++ = fIsForeground ? '3' : '4';
    *ptr++ = '8';
    *ptr++ = ';';
    *ptr++ = '2';
    *ptr++ = ';';
    FMT_BYTE(r);
    *ptr++ = ';'
    FMT_BYTE(g);
    *ptr++ = ';'
    FMT_BYTE(b);
    *ptr++ = 'm';
    
#undef FMT_BYTE

    return _Write({ buffer, ptr - buffer });
}

No std::string and no vsnprintf functions. Rest of file uses too much of std::string just for trivial constant string literals used in formatter string.

@mmozeiko commented on GitHub (Jun 8, 2021): @skyline75489 I cannot test this, because I have no idea how to rebuild conhost.exe, but from my benchmark it is visible that all this terminal stuff can be sped up a lot by improving string processing on ConPTY layer. Nothing needs to be changed in DirectX rendering. Just need to avoid expensive string allocations and operations. This will help all terminal application - not only Windows Terminal. Currently if somebody wants to implement more efficient rendering they really cannot, because they will be bottlenecked by these issues in ConPTY layer. For example, [_SetGraphicsRenditionRGBColor](https://github.com/microsoft/terminal/blob/v1.8.1444.0/src/renderer/vt/VtSequences.cpp#L266) should be changed to something like this (probably can do something even better, but I wrote this in github comment): HRESULT VtEngine::_SetGraphicsRenditionRGBColor(const COLORREF color, const bool fIsForeground) noexcept { DWORD const r = GetRValue(color); DWORD const g = GetGValue(color); DWORD const b = GetBValue(color); #define FMT_BYTE(x) \ if (x >= 100) *ptr++ = (x/100) + '0'; \ if (x >= 10) *ptr++ = ((x/10)%10) + '0'; \ *ptr++ = (r%10) + '0'; char buffer[10+3+3+3]; char* ptr = buffer; *ptr++ = '\x1b'; *ptr++ = '['; *ptr++ = fIsForeground ? '3' : '4'; *ptr++ = '8'; *ptr++ = ';'; *ptr++ = '2'; *ptr++ = ';'; FMT_BYTE(r); *ptr++ = ';' FMT_BYTE(g); *ptr++ = ';' FMT_BYTE(b); *ptr++ = 'm'; #undef FMT_BYTE return _Write({ buffer, ptr - buffer }); } No std::string and no vsnprintf functions. Rest of file uses too much of std::string just for trivial constant string literals used in formatter string.
Author
Owner

@skyline75489 commented on GitHub (Jun 8, 2021):

@mmozeiko Yeah you have a good point here actually. What I want to say is that there's a lot things under the hood than just "processing terminal sequences". The performance of ConPTY layer is also very important for the console subsystem, as you mentioned it helps all terminal applications.

Do understand that some of the tuning tricks are not used for both readability and maintainability of the project. That being said, if we found something worth investigating, I think we'd all be happy to squeeze as much CPU as we can to improve the performance.

@skyline75489 commented on GitHub (Jun 8, 2021): @mmozeiko Yeah you have a good point here actually. What I want to say is that there's a *lot* things under the hood than just "processing terminal sequences". The performance of ConPTY layer is also very important for the console subsystem, as you mentioned it helps all terminal applications. Do understand that some of the tuning tricks are not used for both readability and maintainability of the project. That being said, if we found something worth investigating, I think we'd all be happy to squeeze as much CPU as we can to improve the performance.
Author
Owner

@superninjakiwi commented on GitHub (Jun 8, 2021):

If all terminals have to go through this code to function on windows, I feel like performance should be more important than it seems to be treated. It's one thing if only the default terminal suffers performance issues, but if an entire classification of applications on windows suffers negative effects due to the way strings are handled, and gets none of the benefits, I'm not sure that's the best way to prioritize things.

@superninjakiwi commented on GitHub (Jun 8, 2021): If all terminals have to go through this code to function on windows, I feel like performance should be more important than it seems to be treated. It's one thing if only the default terminal suffers performance issues, but if an entire classification of applications on windows suffers negative effects due to the way strings are handled, and gets none of the benefits, I'm not sure that's the best way to prioritize things.
Author
Owner

@skyline75489 commented on GitHub (Jun 8, 2021):

Here’s something interesting. I tried to port the benchmark to Linux and I’m seeing a even larger performance gap between colored and non-colored text on Linux (roughly 60x - 80x). The overall rendering performance is better on Windows, of course.

Will see if I can port it to macOS tomorrow.

@skyline75489 commented on GitHub (Jun 8, 2021): Here’s something interesting. I tried to port the benchmark to Linux and I’m seeing a even larger performance gap between colored and non-colored text on Linux (roughly 60x - 80x). The overall rendering performance is better on Windows, of course. Will see if I can port it to macOS tomorrow.
Author
Owner

@vaualbus commented on GitHub (Jun 8, 2021):

So the overall result is, rendering text color on a console is no easy and require a super computer to reach good frame rate/performances.

@vaualbus commented on GitHub (Jun 8, 2021): So the overall result is, rendering text color on a console is no easy and require a super computer to reach good frame rate/performances.
Author
Owner

@skyline75489 commented on GitHub (Jun 8, 2021):

Oops. Accidentally closed this.

@vaualbus haha I see what you mean. Basically rendering colored text will be slower than non-colored text. But IMO it’s still fast enough for daily usage, be it on Linux or Windows.

@skyline75489 commented on GitHub (Jun 8, 2021): Oops. Accidentally closed this. @vaualbus haha I see what you mean. Basically rendering colored text will be slower than non-colored text. But IMO it’s still fast enough for daily usage, be it on Linux or Windows.
Author
Owner

@forksnd commented on GitHub (Jun 8, 2021):

Basically rendering colored text will be slower than non-colored text.

@skyline75489 Modern AAA games can render millions of polygons and do ray-traced lighting at 60 fps, while Windows Terminal is able to render some text at 2-3 fps. Clearly, there is some part of this terminal emulator that is pushing the available hardware beyond its limits, either intentionally or not.

Can you (or someone else familiar with this code base) explain what part of the code needs this much computing power, so that the community can see about potentially filing a pull request to fix this performance issue?

But IMO it’s still fast enough for daily usage, be it on Linux or Windows.

With all due respect, if Windows console infrastructure is not (and has never been) performant, then you do not know what daily-usage applications you are missing, which are impossible to write currently due to the pervasive slowness.

@forksnd commented on GitHub (Jun 8, 2021): > Basically rendering colored text will be slower than non-colored text. @skyline75489 Modern AAA games can render millions of polygons and do ray-traced lighting at 60 fps, while Windows Terminal is able to render some text at 2-3 fps. Clearly, there is some part of this terminal emulator that is pushing the available hardware beyond its limits, either intentionally or not. Can you (or someone else familiar with this code base) explain what part of the code needs this much computing power, so that the community can see about potentially filing a pull request to fix this performance issue? > But IMO it’s still fast enough for daily usage, be it on Linux or Windows. With all due respect, if Windows console infrastructure is not (and has never been) performant, then you do not know what daily-usage applications you are missing, which are impossible to write currently due to the pervasive slowness.
Author
Owner

@jfhs commented on GitHub (Jun 8, 2021):

I've just benchmarked running Terminal + OpenConsole with @mmozeiko's change, and see 3x improvement from TermMarkV1.
Here are raw numbers. Before:

Glyphs: 9k  Bytes: 335kb  Frame: 66  Prep: 0ms  Write: 75ms  Read: 0ms  Total: 75ms  
TermMarkV1: 48kcg/s  (Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz Win32 VTS)

After:

Glyphs: 9k  Bytes: 364kb  Frame: 187  Prep: 0ms  Write: 53ms  Read: 0ms  Total: 53ms 
TermMarkV1: 153kcg/s  (Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz Win32 VTS)

Given such significant improvement (even if in benchmark app), and functional equivalence, I think you should consider that change (and similar changes for other trivial formatting code paths), despite some detriment to readability.

@jfhs commented on GitHub (Jun 8, 2021): I've just benchmarked running Terminal + OpenConsole with @mmozeiko's change, and see 3x improvement from TermMarkV1. Here are raw numbers. Before: ``` Glyphs: 9k Bytes: 335kb Frame: 66 Prep: 0ms Write: 75ms Read: 0ms Total: 75ms TermMarkV1: 48kcg/s (Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz Win32 VTS) ``` After: ``` Glyphs: 9k Bytes: 364kb Frame: 187 Prep: 0ms Write: 53ms Read: 0ms Total: 53ms TermMarkV1: 153kcg/s (Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz Win32 VTS) ``` Given such significant improvement (even if in benchmark app), and functional equivalence, I think you should consider that change (and similar changes for other trivial formatting code paths), despite some detriment to readability.
Author
Owner

@skyline75489 commented on GitHub (Jun 8, 2021):

@forksnd I totally get your point. I for one have been trying to improve the rendering performance of the terminal since the year 2019. I feel eligible to say a few words here.

For those who don’t quite get how text rendering work, it may seem unreasonable. But one major performance block comes from text layout & rendering. I have filed several PRs trying to minimize the impact of text layout. As for now, in order to use all the fancy Unicode features that people expect for a modern terminal(emojis, CJK languages, RTL, etc), a significant amount of time will be needed for text layout. This gets worse when the text is colored because it will force us to split the text into different runs according to their color.

Modern AAA games can render millions of polygons and do ray-traced lighting at 60 fps,

If you understand what I mean about text layout, you’d understand drawing text is a very different task than drawing polygons & alike. I may sound innocent but I don’t actually know a AAA game that draws millions of text. The only kind of applications that ( I know of) draw a lot of text is terminal applications. I mentioned DirectX and maybe this is a bit misleading. The terminal needs DirectX for both rendering(Direct2D) and text layout(DirectWrite). So the technology behind games and terminals are not exactly the same, nor do they have the same performance metrics, IMO.

which are impossible to write currently due to the pervasive slowness

It may surprise you but up until now we’ve been mostly targeting Linux applications & terminals as some sort of benchmarking standard, since people have been using the Linux tools since forever. Also with the help of WSL, this can be easily conducted without firing up a VM. After 3 years of open-source development, the Windows Terminal can handle applications like cmatrix & cacafire easily. I wouldn’t call it perfect example but I think it can be seen as an indicator of how performant the terminal currently is.

@skyline75489 commented on GitHub (Jun 8, 2021): @forksnd I totally get your point. I for one have been trying to improve the rendering performance of the terminal since the year 2019. I feel eligible to say a few words here. For those who don’t quite get how text rendering work, it may seem unreasonable. But one major performance block comes from text layout & rendering. I have filed several PRs trying to minimize the impact of text layout. As for now, in order to use all the fancy Unicode features that people expect for a modern terminal(emojis, CJK languages, RTL, etc), a significant amount of time will be needed for text layout. This gets worse when the text is colored because it will force us to split the text into different runs according to their color. > Modern AAA games can render millions of polygons and do ray-traced lighting at 60 fps, If you understand what I mean about text layout, you’d understand drawing text is a very different task than drawing polygons & alike. I may sound innocent but I don’t actually know a AAA game that draws millions of text. The only kind of applications that ( I know of) draw a lot of text is terminal applications. I mentioned DirectX and maybe this is a bit misleading. The terminal needs DirectX for both rendering(Direct2D) and text layout(DirectWrite). So the technology behind games and terminals are not exactly the same, nor do they have the same performance metrics, IMO. > which are impossible to write currently due to the pervasive slowness It may surprise you but up until now we’ve been mostly targeting *Linux* applications & terminals as some sort of benchmarking standard, since people have been using the Linux tools since forever. Also with the help of WSL, this can be easily conducted without firing up a VM. After 3 years of open-source development, the Windows Terminal can handle applications like cmatrix & cacafire easily. I wouldn’t call it perfect example but I think it can be seen as an indicator of how performant the terminal currently is.
Author
Owner

@superninjakiwi commented on GitHub (Jun 8, 2021):

I can understand your position here, and I don't think you've taken an unreasonable position. I do think that, considering the vast amount of string handling you do with terminals, the poor performance of std::string hits this kind of application harder than anyone else, and more than any other program, I believe that the terminal would benefit from at least considering alternatives to the current string handling, if it can be shown to be a significant enough performance boost, even if the current string handling is judged to be a bit more readable.

@superninjakiwi commented on GitHub (Jun 8, 2021): I can understand your position here, and I don't think you've taken an unreasonable position. I do think that, considering the vast amount of string handling you do with terminals, the poor performance of std::string hits this kind of application harder than anyone else, and more than any other program, I believe that the terminal would benefit from at least considering alternatives to the current string handling, if it can be shown to be a significant enough performance boost, even if the current string handling is judged to be a bit more readable.
Author
Owner

@skyline75489 commented on GitHub (Jun 8, 2021):

@superninjakiwi thanks for the kind words and the suggestions. I will see if there’s anything I can do in the future.

Excuse me for being wordy here. I swear this is my last comment for the day. Most people tend to underestimate how hard text layout is. Turns out it’s really, really hard. When it comes to text layout, it’s really, really hard to even be correct, let alone be performant. In this modern world where you can almost find anything in Unicode, you’ll be surprised how many things are needed to correctly layout complicated text. It’s so hard that it requires a dedicated framework for just text layout (DirectWrite on Windows, CoreText on iOS/macOS)

Again I’m no expert in text layout. One thing I found that is crucial when it comes to the performance of text layout is that there’s isn’t too much space for parallelism. Because sometimes(maybe most of the time) you have to do it sequentially(think about ligature, for example). So not much of the modern multi-core compute power can be used, be it on CPU or GPU.

When we talk about the challenges in performance of the terminal, this here is just tip of the iceberg. I’m super happy that so many people are interested in the project and in the particular area. I’m glad I can explain things so people can better understand what stage we are at, and hopefully what we’re headed.

@skyline75489 commented on GitHub (Jun 8, 2021): @superninjakiwi thanks for the kind words and the suggestions. I will see if there’s anything I can do in the future. Excuse me for being wordy here. I swear this is my last comment for the day. Most people tend to underestimate how hard text layout is. Turns out it’s really, really hard. When it comes to text layout, it’s really, really hard to even be correct, let alone be performant. In this modern world where you can almost find anything in Unicode, you’ll be surprised how many things are needed to correctly layout complicated text. It’s so hard that it requires a dedicated framework for just text layout (DirectWrite on Windows, CoreText on iOS/macOS) Again I’m no expert in text layout. One thing I found that is crucial when it comes to the performance of text layout is that there’s isn’t too much space for parallelism. Because sometimes(maybe most of the time) you have to do it sequentially(think about ligature, for example). So not much of the modern multi-core compute power can be used, be it on CPU or GPU. When we talk about the challenges in performance of the terminal, this here is just tip of the iceberg. I’m super happy that so many people are interested in the project and in the particular area. I’m glad I can explain things so people can better understand what stage we are at, and hopefully what we’re headed.
Author
Owner

@cmuratori commented on GitHub (Jun 8, 2021):

Most people tend to underestimate how hard text layout is.
Turns out it’s really, really hard. When it comes to text layout,
it’s really, really hard to even be correct, let alone be performant.
In this modern world where you can almost find anything in
Unicode, you’ll be surprised how many things are needed to
correctly layout complicated text. It’s so hard that it requires
a dedicated framework for just text layout (DirectWrite on
Windows, CoreText on iOS/macOS)

Can you be more specific here about what you are talking about? While text output can be "hard", for some not particularly hard definition of "hard", it is usually because of things that terminals don't do, such as aesthetically-pleasing justification; preventing rivers, widows, and orphans; precisely aligning internal character features with other characters; proper ligatures; etc. So I guess I'm not sure I know where the "hard" part would come in for rasterizing a monospace font into a fixed character grid?

Can you point me to or post some examples of "difficult text output" I can make happen in the Windows Terminal so I can see what you mean?

@cmuratori commented on GitHub (Jun 8, 2021): > Most people tend to underestimate how hard text layout is. > Turns out it’s really, really hard. When it comes to text layout, > it’s really, really hard to even be correct, let alone be performant. > In this modern world where you can almost find anything in > Unicode, you’ll be surprised how many things are needed to > correctly layout complicated text. It’s so hard that it requires > a dedicated framework for just text layout (DirectWrite on > Windows, CoreText on iOS/macOS) Can you be more specific here about what you are talking about? While text output can be "hard", for some not particularly hard definition of "hard", it is usually because of things that terminals don't do, such as aesthetically-pleasing justification; preventing rivers, widows, and orphans; precisely aligning internal character features with other characters; proper ligatures; etc. So I guess I'm not sure I know where the "hard" part would come in for rasterizing a monospace font into a fixed character grid? Can you point me to or post some examples of "difficult text output" I can make happen in the Windows Terminal so I can see what you mean?
Author
Owner

@DHowett commented on GitHub (Jun 8, 2021):

Sorry -- I'm going to try to corral this thread before it gets further out of hand.

  1. The translation from the console buffer--which we need to keep for compatibility reasons--to VT does too much string math to be properly performant.
  2. Per https://github.com/microsoft/terminal/issues/10362#issuecomment-856589469 the bottleneck identified here is not in the DirectWrite renderer
  3. Measuring throughput here is somewhat annoying because everything goes through conhost/OpenConsole first (the "VT Renderer") and then through Terminal second (the "DirectWrite renderer")

Is this an acceptable summary?

Notes

  • In response to https://github.com/microsoft/terminal/issues/10362#issuecomment-856625365: compiling conhost is relatively easy: from OpenConsole.sln, set Host.EXE as the startup project and hit F5. Make sure that you're running Release/x64 or something that matches your local architecture, as conhost is sensitive to the architecture of the kernel.
  • In response to https://github.com/microsoft/terminal/issues/10362#issuecomment-856589469: We should be publishing PDBs, and I'm sorry for the miss here. When I find a thread that's asking for them I usually upload them, but we don't have an automatic process in place for making them publicly available.
    • I'd like to inch us ever closer to "reproduceable builds", but that's a long-term goal.

All in all, this sounds like a more general case of #410. Chafa does exactly this ("change colors a lot, try to render as fast as possible") and we are now, at least, profiling and optimizing using it as a test case (PR #10071).

@DHowett commented on GitHub (Jun 8, 2021): Sorry -- I'm going to try to corral this thread before it gets further out of hand. 1. The translation from the console buffer--which we need to keep for compatibility reasons--to VT does too much string math to be properly performant. 2. Per https://github.com/microsoft/terminal/issues/10362#issuecomment-856589469 the bottleneck identified here is not in the DirectWrite renderer * Our DirectWrite renderer is somewhat inefficient and causes more command list flushes than should be necessary. I think "text rendering is hard" (https://github.com/microsoft/terminal/issues/10362#issuecomment-856736267) _because we've made it hard_, not because of some intrinsic quality of the universe. 3. Measuring throughput here is somewhat annoying because everything goes through conhost/OpenConsole _first_ (the "VT Renderer") and then through Terminal second (the "DirectWrite renderer") * Both of these impact perceived performance, but per https://github.com/microsoft/terminal/issues/10362#issuecomment-856589469 (again) the effect is compounded since Terminal requires both; other ConPTY consumers only require one. Is this an acceptable summary? ## Notes * In response to https://github.com/microsoft/terminal/issues/10362#issuecomment-856625365: compiling conhost is _relatively_ easy: from OpenConsole.sln, set `Host.EXE` as the startup project and hit <kbd>F5</kbd>. Make sure that you're running `Release`/`x64` or something that matches your local architecture, as conhost is sensitive to the architecture of the kernel. * In response to https://github.com/microsoft/terminal/issues/10362#issuecomment-856589469: We should be publishing PDBs, and I'm sorry for the miss here. When I find a thread that's asking for them I usually upload them, but we don't have an automatic process in place for making them publicly available. * I'd like to inch us ever closer to "reproduceable builds", but that's a long-term goal. All in all, this sounds like a more general case of #410. Chafa does exactly this ("change colors a lot, try to render as fast as possible") and we are now, at least, profiling and optimizing using it as a test case (PR #10071).
Author
Owner

@cmuratori commented on GitHub (Jun 8, 2021):

I think "text rendering is hard" because we've made it hard,
not because of some intrinsic quality of the universe.

That sounds much more sensible, yes.

@cmuratori commented on GitHub (Jun 8, 2021): > I think "text rendering is hard" _because we've made it hard_, > not because of some intrinsic quality of the universe. That sounds much more sensible, yes.
Author
Owner

@DHowett commented on GitHub (Jun 8, 2021):

I will take your terse response as accepting the summary. Thanks!

@DHowett commented on GitHub (Jun 8, 2021): I will take your terse response as accepting the summary. Thanks!
Author
Owner

@cmuratori commented on GitHub (Jun 8, 2021):

I apologize for the terseness, but I don't feel like I am in any position to accept or reject a summary, since most of what you were replying to was other people's comments (@mmozeiko, for example, was the person posting about how the processing is slow right now due to unnecessary string manipulation). They would be the ones to accept or reject a summary :)

In general, all I wanted to have open with this particular bug report was "color text rendering should not be slower than uncolored text rendering". While it may be true that architectural decisions made in how Windows Terminal works could mean that it will always be slow in this regard, that is a different thing from it being slow because the actual processing is substantial. The processing required here is definitely insubstantial.

Parsing a 1mb buffer of control codes and outputting the GPU buffer necessary to encode ~30k colored fixed-width glyphs is something I would expect to run in the thousands of frames per second on a modern machine, not five frames per second as it does currently. So the difference between the reasonably expected performance and the realized performance here is several orders of magnitude, which would at least suggest to me that a great deal of improvement could be made to the performance of the product if one were so inclined.

That may not be a priority, however, which is fine, and you are welcome to close this report as not being something you're interested in fixing, etc.

@cmuratori commented on GitHub (Jun 8, 2021): I apologize for the terseness, but I don't feel like I am in any position to accept or reject a summary, since most of what you were replying to was other people's comments (@mmozeiko, for example, was the person posting about how the processing is slow right now due to unnecessary string manipulation). They would be the ones to accept or reject a summary :) In general, all I wanted to have open with this particular bug report was "color text rendering should not be slower than uncolored text rendering". While it may be true that architectural decisions made in how Windows Terminal works could mean that it will always be slow in this regard, that is a different thing from it being slow because the actual processing is substantial. The processing required here is definitely insubstantial. Parsing a 1mb buffer of control codes and outputting the GPU buffer necessary to encode ~30k colored fixed-width glyphs is something I would expect to run in the thousands of frames per second on a modern machine, not five frames per second as it does currently. So the difference between the reasonably expected performance and the realized performance here is several orders of magnitude, which would at least suggest to me that a great deal of improvement could be made to the performance of the product if one were so inclined. That may not be a priority, however, which is fine, and you are welcome to close this report as not being something you're interested in fixing, etc.
Author
Owner

@DHowett commented on GitHub (Jun 8, 2021):

Parsing a 1mb buffer of control codes and outputting the GPU buffer necessary to encode ~30k colored fixed-width glyphs is something I would expect to run in the thousands of frames per second ... that a great deal of improvement could be made to the performance of the product if one were so inclined.

I completely agree. I'll use this report as the tracking issue for any performance improvements we make here.

Thanks for raising this -- and I'm excited to get termbench going.

EDIT: And, that's fair, the bit about the summary. Sorry. 😄

@DHowett commented on GitHub (Jun 8, 2021): > Parsing a 1mb buffer of control codes and outputting the GPU buffer necessary to encode ~30k colored fixed-width glyphs is something I would expect to run in the thousands of frames per second ... that a great deal of improvement could be made to the performance of the product if one were so inclined. I completely agree. I'll use this report as the tracking issue for any performance improvements we make here. Thanks for raising this -- and I'm excited to get `termbench` going. EDIT: And, that's fair, the bit about the summary. Sorry. :smile:
Author
Owner

@cmuratori commented on GitHub (Jun 8, 2021):

I completely agree. I'll use this report as the tracking issue
for any performance improvements we make here.

Awesome! Let me know if you need me to modify termbench to test other things at some point. It is obviously very simple at the moment because many (most?) terminals already struggle with its current output.

@cmuratori commented on GitHub (Jun 8, 2021): > I completely agree. I'll use this report as the tracking issue > for any performance improvements we make here. Awesome! Let me know if you need me to modify termbench to test other things at some point. It is obviously very simple at the moment because many (most?) terminals already struggle with its current output.
Author
Owner

@skyline75489 commented on GitHub (Jun 8, 2021):

Parsing a 1mb buffer of control codes and outputting the GPU buffer necessary to encode ~30k colored fixed-width glyphs is something I would expect to run in the thousands of frames per second

This is what I see on Linux. You’re totally right about terminal not being that performant, because of the existence of ConPTY & co. But on Linux non-color text is also way faster than non-colored text.

Man, we need this on Linux. I’ll send a PR later.

@skyline75489 commented on GitHub (Jun 8, 2021): >Parsing a 1mb buffer of control codes and outputting the GPU buffer necessary to encode ~30k colored fixed-width glyphs is something I would expect to run in the thousands of frames per second This is what I see on Linux. You’re totally right about terminal not being that performant, because of the existence of ConPTY & co. But on Linux non-color text is also way faster than non-colored text. Man, we need this on Linux. I’ll send a PR later.
Author
Owner

@cmuratori commented on GitHub (Jun 8, 2021):

Man, we need this on Linux. I’ll send a PR later.

I will go ahead and post a Linux version as well, if that is useful.

@cmuratori commented on GitHub (Jun 8, 2021): > Man, we need this on Linux. I’ll send a PR later. I will go ahead and post a Linux version as well, if that is useful.
Author
Owner

@cmuratori commented on GitHub (Jun 9, 2021):

  1. The translation from the console buffer--which we need to keep for compatibility reasons--
    to VT does too much string math to be properly performant.

Before I forget, I just wanted to mention: this part was a little confusing to me, because my understanding was that up until recently, you could not even use VT codes in Windows terminal - hence the need to set the console mode to ENABLE_VIRTUAL_TERMINAL_PROCESSING (which doesn't even exist in Windows 8). So, is there a reason you couldn't just bypass the entire pipeline when the person on the other end sets that flag? Because then you know they aren't expecting any backwards compatibility, because it is obviously a new app?

Maybe I'm missing something here, but I just thought I'd mention it, because it seemed odd. It seems like you have an explicit flag that tells you the person doesn't need the old console behavior (or that you can easily define to be that, because it is a brand new flag), and that seems like that might solve the entire string processing problem that is currently going on in the conduit?

@cmuratori commented on GitHub (Jun 9, 2021): > 1. The translation from the console buffer--which we need to keep for compatibility reasons-- > to VT does too much string math to be properly performant. Before I forget, I just wanted to mention: this part was a little confusing to me, because my understanding was that up until recently, you could not even use VT codes in Windows terminal - hence the need to set the console mode to ENABLE_VIRTUAL_TERMINAL_PROCESSING (which doesn't even exist in Windows 8). So, is there a reason you couldn't just bypass the entire pipeline when the person on the other end sets that flag? Because then you know they aren't expecting any backwards compatibility, because it is obviously a new app? Maybe I'm missing something here, but I just thought I'd mention it, because it seemed odd. It seems like you have an explicit flag that tells you the person doesn't need the old console behavior (or that you can easily define to be that, because it is a brand new flag), and that seems like that might solve the entire string processing problem that is currently going on in the conduit?
Author
Owner

@skyline75489 commented on GitHub (Jun 9, 2021):

@cmuratori that is explained in #1173, which is also a very lengthy thread and requires some background knowledge.

@skyline75489 commented on GitHub (Jun 9, 2021): @cmuratori that is explained in #1173, which is also a very lengthy thread and requires some background knowledge.
Author
Owner

@stephc-int13 commented on GitHub (Jun 9, 2021):

@skyline75489 I understand that text layout can be difficult and that support for emoji, ligatures, etc. is needed, but I also think that it should not be too difficult to process the VT stream in chunks and have a quick path for the common cases when the layout is a simple fixed grid to avoid paying for unnecessary processing all the time.

@stephc-int13 commented on GitHub (Jun 9, 2021): @skyline75489 I understand that text layout can be difficult and that support for emoji, ligatures, etc. is needed, but I also think that it should not be too difficult to process the VT stream in chunks and have a quick path for the common cases when the layout is a simple fixed grid to avoid paying for unnecessary processing all the time.
Author
Owner

@cmuratori commented on GitHub (Jun 9, 2021):

@cmuratori that is explained in #1173, which is also a very lengthy thread and requires some background knowledge.

Reading through that, ENABLE_PASSTHROUGH_MODE actually sounds like it would improve the termbench performance substantially without anyone needing to optimize the current VT-to-non-VT-and-back-again problems that @mmozeiko was observing. Is this still a planned feature? I would love to add a toggle for it in termbench if it becomes a reality.

@cmuratori commented on GitHub (Jun 9, 2021): > @cmuratori that is explained in #1173, which is also a very lengthy thread and requires some background knowledge. Reading through that, ENABLE_PASSTHROUGH_MODE actually sounds like it would improve the termbench performance substantially without anyone needing to optimize the current VT-to-non-VT-and-back-again problems that @mmozeiko was observing. Is this still a planned feature? I would love to add a toggle for it in termbench if it becomes a reality.
Author
Owner

@mmozeiko commented on GitHub (Jun 9, 2021):

Just to see how fast terminal can go, I patched conhost source to drop all incoming data - so there will be no VT string parsing & processing happening. I commented out call to ProcessString on this line: https://github.com/microsoft/terminal/blob/v1.8.1444.0/src/host/_stream.cpp#L972

All the terminal will now show is black window, but we could see how fast terminal application like termbench can run.

What I got is 2-3 msec per frame, TermMark shows 7800 score, and termmark.exe is sending data with ~270MB/s to OpenConsole.exe.
Compared to previous 300msec, TermMark=80 and 3MB/speed.
All three numbers show ~100x speedup.

CPU usage also dropped - before conhost was using 100% of one core and termmark.exe was almost idle at 0%. Now both are doing some work at 50% of core, so CPU is not bottleneck anymore for conhost.

Here's the screenshot in task manager:
image

What this means is that with good incoming text parsing code and good rendering code which should really take no time for 27K characters I'm using, the terminal could easily render 60fps or even upwards of 100fps. Not only that - but it would also save power & battery for laptop users because of lower CPU usage.

@mmozeiko commented on GitHub (Jun 9, 2021): Just to see how fast terminal can go, I patched conhost source to drop all incoming data - so there will be no VT string parsing & processing happening. I commented out call to ProcessString on this line: https://github.com/microsoft/terminal/blob/v1.8.1444.0/src/host/_stream.cpp#L972 All the terminal will now show is black window, but we could see how fast terminal application like termbench can run. What I got is 2-3 msec per frame, TermMark shows 7800 score, and termmark.exe is sending data with ~270MB/s to OpenConsole.exe. Compared to previous 300msec, TermMark=80 and 3MB/speed. All three numbers show ~100x speedup. CPU usage also dropped - before conhost was using 100% of one core and termmark.exe was almost idle at 0%. Now both are doing some work at 50% of core, so CPU is not bottleneck anymore for conhost. Here's the screenshot in task manager: ![image](https://user-images.githubusercontent.com/1665010/121282251-06354e00-c88e-11eb-8c7a-91cfad92c4af.png) What this means is that with good incoming text parsing code and good rendering code which should really take no time for 27K characters I'm using, the terminal could easily render 60fps or even upwards of 100fps. Not only that - but it would also save power & battery for laptop users because of lower CPU usage.
Author
Owner

@ped7g commented on GitHub (Jun 9, 2021):

One thing made me curious - just a mental exercise: colored vs non-colored throughput. Shouldn't the colored text be actually faster in terms of processed bandwidth? (I guess when comparing amount of characters, it's fair to assume colored will be much slower, but some comments made it sound as actual throughput is slower)

Let's say we have 4MB of input data for terminal to render. If it is colored, the amount of glyphs to render will be considerable lower. Rendering glyphs may be complex, because of RTL/ligatures/... (all the nice Unicode stuff). While processing color code means "just" parsing the color code and modify color of further rendering, but there's no pixels-length calculation or layout-positioning of glyph.

So if you think about it like this, then processing the same amount of input should be faster with color codes, because there's much less actual characters to render?

/end of mental exercise

@ped7g commented on GitHub (Jun 9, 2021): One thing made me curious - just a mental exercise: colored vs non-colored throughput. Shouldn't the colored text be actually *faster* in terms of processed bandwidth? (I guess when comparing amount of characters, it's fair to assume colored will be much slower, but some comments made it sound as actual throughput is slower) Let's say we have 4MB of input data for terminal to render. If it is colored, the amount of glyphs to render will be considerable lower. Rendering glyphs may be complex, because of RTL/ligatures/... (all the nice Unicode stuff). While processing color code means "just" parsing the color code and modify color of further rendering, but there's no pixels-length calculation or layout-positioning of glyph. So if you think about it like this, then processing the same amount of input should be faster with color codes, because there's much less actual characters to render? /end of mental exercise
Author
Owner

@nico-abram commented on GitHub (Jun 9, 2021):

because of RTL

Does terminal actually handle RTL? I was under the impression it didn't

@nico-abram commented on GitHub (Jun 9, 2021): > because of RTL Does terminal actually handle RTL? I was under the impression it didn't
Author
Owner

@cmuratori commented on GitHub (Jun 9, 2021):

One thing made me curious - just a mental exercise: colored vs non-colored
throughput. Shouldn't the colored text be actually faster in terms of
processed bandwidth?

"Shouldn't" is not really something you can say definitively about this particular situation because it would depend on the implementation details.

If the processing for the input is significantly slower than the glyph rendering, then you would expect the FPS vs. input footprint to be the same or worse for colored vs non-colored, because your performance will be entirely dependent on the input processing, which costs more proportional to the footprint.

On the other hand, if the glyph rendering is significantly slower than the input processing, then you would expect the FPS vs. input footprint to improve substantially for colored glyphs because the performance would stay the same but the footprint would increase, leading to a faster "score" by your metric.

The reason nobody is concerned about "processed bandwidth" in this particular thread is because the memory bandwidth necessary to retrieve the input is insubstantial in both the colored and non-colored cases. The terminal would have to be several orders of magnitude faster before you would be looking at input bandwidth as a metric.

The largest VT-coded input in question is around 1mb of data for a full screen of color-per-glyph output. On a modern machine you would expect to read a cold 1mb buffer at ~20gb/s, or a hot one (which this would be, at least partially) at ~80gb/s, so a single core would expect to read the input somewhere between twenty and eighty thousand times a second. Since the observed frame rate was around five frames per second, we know that input memory bandwidth is not implicated in the performance problems.

(And note that when I say "input memory bandwidth", I am talking only about the bandwidth necessary to get the data from the application to the terminal. Obviously we know the terminal itself is taking a long time to process the data, so that processing may itself be generating large amounts of unnecessary memory traffic which then implicates memory bandwidth as a bottleneck, etc., etc.)

Not sure if that is what you were asking, but hopefully that provides enough information to answer the question.

@cmuratori commented on GitHub (Jun 9, 2021): > One thing made me curious - just a mental exercise: colored vs non-colored > throughput. Shouldn't the colored text be actually _faster_ in terms of > processed bandwidth? "Shouldn't" is not really something you can say definitively about this particular situation because it would depend on the implementation details. If the processing for the input is significantly slower than the glyph rendering, then you would expect the FPS vs. input footprint to be the same or worse for colored vs non-colored, because your performance will be entirely dependent on the input processing, which costs more proportional to the footprint. On the other hand, if the glyph rendering is significantly slower than the input processing, then you would expect the FPS vs. input footprint to improve substantially for colored glyphs because the performance would stay the same but the footprint would increase, leading to a faster "score" by your metric. The reason nobody is concerned about "processed bandwidth" in this particular thread is because the memory bandwidth necessary to retrieve the input is insubstantial in both the colored and non-colored cases. The terminal would have to be several orders of magnitude faster before you would be looking at input bandwidth as a metric. The largest VT-coded input in question is around 1mb of data for a full screen of color-per-glyph output. On a modern machine you would expect to read a cold 1mb buffer at ~20gb/s, or a hot one (which this would be, at least partially) at ~80gb/s, so a single core would expect to read the input somewhere between twenty and eighty thousand times a second. Since the observed frame rate was around five frames per second, we know that input memory bandwidth is not implicated in the performance problems. (And note that when I say "input memory bandwidth", I am talking only about the bandwidth necessary to get the data from the application to the terminal. Obviously we know the terminal itself is taking a long time to process the data, so that processing may itself be generating large amounts of unnecessary memory traffic which then implicates memory bandwidth as a bottleneck, etc., etc.) Not sure if that is what you were asking, but hopefully that provides enough information to answer the question.
Author
Owner

@skyline75489 commented on GitHub (Jun 10, 2021):

Is (ENABLE_PASSTHROUGH_MODE) this still a planned feature?

I think it's the right direction but considering the amount of all backlog items & the limited developer time, I wouldn't really expect to see it implemented before the year 2023. We'll have to live with the ConPTY layer for a reasonable long time.

@skyline75489 commented on GitHub (Jun 10, 2021): > Is (`ENABLE_PASSTHROUGH_MODE`) this still a planned feature? I think it's the right direction but considering the amount of all backlog items & the limited developer time, I wouldn't really expect to see it implemented before the year 2023. We'll have to live with the ConPTY layer for a reasonable long time.
Author
Owner

@cmuratori commented on GitHub (Jun 10, 2021):

I wouldn't really expect to see it implemented before the year 2023.

Ouch.

@cmuratori commented on GitHub (Jun 10, 2021): > I wouldn't really expect to see it implemented before the year 2023. Ouch.
Author
Owner

@DHowett commented on GitHub (Jun 10, 2021):

I wouldn't really expect to see it implemented before the year 2023.

Ouch.

The hang-up is that this needs OS changes and, while we do contribute the console host code from this repository back into Windows, the OS moves much slower than this project does. 😄

@DHowett commented on GitHub (Jun 10, 2021): > > I wouldn't really expect to see it implemented before the year 2023. > > Ouch. The hang-up is that this needs OS changes and, while we do contribute the console host code from this repository back into Windows, the OS moves much slower than this project does. :smile:
Author
Owner

@lhecker commented on GitHub (Jun 17, 2021):

FYI We investigated this today and the slowdown likely occurs because we draw each run of consecutive characters with identical text attributes (colors, etc.) at once. If the background color changes for each character, each character will be drawn independently, which makes rendering slow. This affects us more than other terminals, as our parsing and rendering loops still work in sync - the former can't proceed until the latter is finished.

The situation of the submitter of this issue will vastly improve with https://github.com/microsoft/terminal/issues/6193.

@lhecker commented on GitHub (Jun 17, 2021): FYI We investigated this today and the slowdown likely occurs because we draw each run of consecutive characters with identical text attributes (colors, etc.) at once. If the background color changes for each character, each character will be drawn independently, which makes rendering slow. This affects us more than other terminals, as our parsing and rendering loops still work in sync - the former can't proceed until the latter is finished. The situation of the submitter of this issue will vastly improve with https://github.com/microsoft/terminal/issues/6193.
Author
Owner

@cmuratori commented on GitHub (Jun 17, 2021):

For what it's worth, #6193 sounds like a step in the wrong direction. Drawing something in multiple passes that could have been drawn in a single pass wastes GPU render target bandwidth.

Drawing a monospace terminal display is straightforward. You have two textures that encode your data. You have a pixel shader that div-floors the screen coordinate to figure out a cell index then looks up into the first texture. It encodes one background color, one foreground color, and one cell-glyph index per terminal cell.

The cell-glyph index is then used for a single dependent texture fetch which loads a per-pixel glyph out of the second texture, which is a glyph atlas encoding the cell-glyph coverage in whatever way makes it easiest to compute your ClearType blending values. Combine the background and foreground color using the ClearType algorithm and blending values, output final pixel color, done. (I am assuming the terminal has to support ClearType - if it doesn't, you just blend with a regular coverage value directly and it's even easier).

There would only be one dispatch for the entire terminal display, which is a single full-window quad. Note also that I say "cell-glyph", not glyph, because obviously if you want glyphs that span two cells, you split those into two cell-glyphs accordingly (but the renderer doesn't care).

That's it, right? I mean that is the entire renderer. It'd be a very short pixel shader, modulo the fact that you have a couple different ClearType patterns, so you'd need a few different conditional compilations of the shader.

This would render at thousands of frames per second. The only bandwidth to the card would be downloading texture updates. The parser outputs these - one texture update to change the cell contents in the cell contents texture, and then occasional texture updates to add glyphs to the cell-glyph coverage atlas whenever the parser detects a codepoint that has not previously been rasterized (in normal usage this would happen only at the beginning, and then all relevant glyphs would soon be in the atlas and you'd never need any more updates to it).

Am I missing something? Why is all this stuff with "runs of characters" happening at all? Why would you ever need to separate the background from the foreground for performance reasons? It really seems like most of the code in the parser/renderer part of the terminal is unnecessary and just slows things down. What this code needs to do is extremely simple and it seems like it has been massively overcomplicated.

@cmuratori commented on GitHub (Jun 17, 2021): For what it's worth, #6193 sounds like a step in the wrong direction. Drawing something in multiple passes that could have been drawn in a single pass wastes GPU render target bandwidth. Drawing a monospace terminal display is straightforward. You have two textures that encode your data. You have a pixel shader that div-floors the screen coordinate to figure out a cell index then looks up into the first texture. It encodes one background color, one foreground color, and one cell-glyph index per terminal cell. The cell-glyph index is then used for a single dependent texture fetch which loads a per-pixel glyph out of the second texture, which is a glyph atlas encoding the cell-glyph coverage in whatever way makes it easiest to compute your ClearType blending values. Combine the background and foreground color using the ClearType algorithm and blending values, output final pixel color, done. (I am assuming the terminal has to support ClearType - if it doesn't, you just blend with a regular coverage value directly and it's even easier). There would only be one dispatch for the entire terminal display, which is a single full-window quad. Note also that I say "cell-glyph", not glyph, because obviously if you want glyphs that span two cells, you split those into two cell-glyphs accordingly (but the renderer doesn't care). That's it, right? I mean that is the entire renderer. It'd be a very short pixel shader, modulo the fact that you have a couple different ClearType patterns, so you'd need a few different conditional compilations of the shader. This would render at thousands of frames per second. The only bandwidth to the card would be downloading texture updates. The parser outputs these - one texture update to change the cell contents in the cell contents texture, and then occasional texture updates to add glyphs to the cell-glyph coverage atlas whenever the parser detects a codepoint that has not previously been rasterized (in normal usage this would happen only at the beginning, and then all relevant glyphs would soon be in the atlas and you'd never need any more updates to it). Am I missing something? Why is all this stuff with "runs of characters" happening at all? Why would you ever need to separate the background from the foreground for performance reasons? It really seems like most of the code in the parser/renderer part of the terminal is unnecessary and just slows things down. What this code needs to do is extremely simple and it seems like it has been massively overcomplicated.
Author
Owner

@DHowett commented on GitHub (Jun 17, 2021):

I believe what you’re doing is describing something that might be considered an entire doctoral research project in performant terminal emulation as “extremely simple” somewhat combatively. I am not aware of the body of work around performant GPU terminal emulation, but I’m somewhat surprised that other accelerated terminals aren’t already doing this (as I imagine we would have heard about it before now had they done so.)

Is there not a significant startup cost to this? Rendering the entire glyph closure available from the font and all of its fallbacks to a texture seems prohibitively expensive, but if you’re removing a stage from the pipeline that determines exactly what glyphs to shape and where you’ll need to do that—as well as reimplement a large portion of a text shaper, no?

I expect that DirectWrite does incredible optimizations on its own, and that we are impeding it from doing so by not intelligently commanding it, but I don’t believe that it’s quite that advanced.

Setting the technical merits of your suggestion aside though: peppering your comments with clauses like “it’s that simple” or “extremely simple” and, somewhat unexpectedly “am I missing something?” can be read as impugning the reader. Some folks may be a little put off by your style here. I certainly am, but I am still trying to process exactly why that is.

@DHowett commented on GitHub (Jun 17, 2021): I believe what you’re doing is describing something that might be considered _an entire doctoral research project in performant terminal emulation_ as “extremely simple” somewhat combatively. I am not aware of the body of work around performant GPU terminal emulation, but I’m somewhat surprised that other accelerated terminals aren’t already doing this (as I imagine we would have heard about it before now had they done so.) Is there not a significant startup cost to this? Rendering the entire glyph closure available from the font and all of its fallbacks to a texture seems prohibitively expensive, but if you’re removing a stage from the pipeline that determines exactly what glyphs to shape and where you’ll need to do that—as well as reimplement a large portion of a text shaper, no? I expect that DirectWrite does incredible optimizations on its own, and that we are impeding it from doing so by not intelligently commanding it, but I don’t believe that it’s quite that advanced. _Setting the technical merits of your suggestion aside though: peppering your comments with clauses like “it’s that simple” or “extremely simple” and, somewhat unexpectedly “am I missing something?” can be read as impugning the reader. Some folks may be a little put off by your style here. I certainly am, but I am still trying to process exactly why that is._
Author
Owner

@DHowett commented on GitHub (Jun 17, 2021):

To address Leonard’s specific reason for calling out background rendering: right now, we don’t have a single stage pipeline that uses a pixel shader to pull cell-glyphs from a texture. What we have instead is a rendering pipeline that emits up to 7,200 individual draw calls, and we’re talking about reducing that[1]. I’m not aiming for instant perfection, but simply trying to converge on a better solution. I can’t justify taking somebody offline for the months it would take to retool the entire renderer and then further justify dealing with the inevitable globalization issues that will follow to push thousands of frames per second when decoupling the renderer from the output pipeline gets the major performance bottleneck out of the way and better local draw call batching can get us in throwing distance of hundreds of fps.

[1]: at the very least, introducing a stage specifically for rendering backgrounds lets us better batch draw calls and let the get the CPU and our drawing pipeline stalls out of the way.

@DHowett commented on GitHub (Jun 17, 2021): To address Leonard’s specific reason for calling out background rendering: right now, we don’t have a single stage pipeline that uses a pixel shader to pull cell-glyphs from a texture. What we have instead is a rendering pipeline that emits up to 7,200 individual draw calls, and we’re talking about reducing that[1]. I’m not aiming for instant perfection, but simply trying to converge on a better solution. I can’t justify taking somebody offline for the months it would take to retool the entire renderer and then further justify dealing with the inevitable globalization issues that will follow to push _thousands_ of frames per second when decoupling the renderer from the output pipeline gets the major performance bottleneck out of the way and better local draw call batching can get us in throwing distance of _hundreds_ of fps. [1]: at the very least, introducing a stage specifically for rendering backgrounds lets us _better_ batch draw calls and let the get the CPU and our drawing pipeline stalls out of the way.
Author
Owner

@cmuratori commented on GitHub (Jun 17, 2021):

When we're at the stage when something that can be implemented in a weekend is described as "a doctoral research project", and then I am accused of "impugning the reader" for describing something as simple that is extremely simple, we're done. Consider the bug report closed.

@cmuratori commented on GitHub (Jun 17, 2021): When we're at the stage when something that can be implemented in a weekend is described as "a doctoral research project", and then I am accused of "impugning the reader" for describing something as simple _that is extremely simple_, we're done. Consider the bug report closed.
Author
Owner

@lhecker commented on GitHub (Jun 17, 2021):

Discussion may continue here: https://github.com/microsoft/terminal/issues/10461
I deeply apologize for the condescending comment below.

Uneditied original comment @cmuratori Apart from what Dustin said, frankly, you seem misguided about how text rendering with DirectWrite works. When you call [`DrawGlyphRun`](https://docs.microsoft.com/en-us/windows/win32/api/dwrite/nf-dwrite-idwritebitmaprendertarget-drawglyphrun) it lays down glyphs in your "texture", _by using a backing glyph atlas internally already_. Basically the thing you suggest us to do, is already part of the framework we use.

Now obviously there's a difference between whether you do thousands of glyph layouts or just a few dozen.
Calling DrawGlyphRun doesn't equate a full render stage in your GPU either. In fact your GPU is barely involved in text rendering!

Side note: DirectWrite doesn't necessarily cache glyphs between renderings. This is indeed something we could consider doing, but just absolutely isn't worth it, when the problem you have is caused by the number of calls and not the complexity to layout a couple ASCII letters.

👉 Also ClearType can't be trivially alpha blended making it impossible to render into a separate glyph atlas.
👉 Finally Firefox used to use alpha blending, but they moved away from it towards the DirectWrite-style-of-things, because... you guessed it... the use of alpha blending was an absolute nightmare of complexity and unmaintainable. In fact not something that was created on a weekend.

If you don't believe me I invite you to cat this file in a WSL2 instance. It'll finish drawing the entire 6MB file within about a second or two. From that I can already estimate that after we implemented the issue I linked, your program will render at about ~30 FPS. Significantly more than the current performance, right?

Lastly I can only suggest everyone to read: https://gankra.github.io/blah/text-hates-you/
You were overly confident in your opinion, but I hope this website helps you understand that it's actually really damn hard.
The reason your program shows a high FPS under other terminal emulators is simply, because their rendering pipeline works independent of VT ingestion. Gnome Terminal is not laying out text faster than your display refresh rate either. And of course, again, this is something WT will probably do as well in the future... but this project is nowhere near as old as Gnome Terminal is.

@lhecker commented on GitHub (Jun 17, 2021): Discussion may continue here: https://github.com/microsoft/terminal/issues/10461 I deeply apologize for the condescending comment below. <details> <summary>Uneditied original comment</summary> @cmuratori Apart from what Dustin said, frankly, you seem misguided about how text rendering with DirectWrite works. When you call [`DrawGlyphRun`](https://docs.microsoft.com/en-us/windows/win32/api/dwrite/nf-dwrite-idwritebitmaprendertarget-drawglyphrun) it lays down glyphs in your "texture", _by using a backing glyph atlas internally already_. Basically the thing you suggest us to do, is already part of the framework we use. Now obviously there's a difference between whether you do thousands of glyph layouts or just a few dozen. Calling `DrawGlyphRun` doesn't equate a full render stage in your GPU either. In fact your GPU is barely involved in text rendering! _Side note: DirectWrite doesn't necessarily cache glyphs between renderings. This is indeed something we could consider doing, but just absolutely isn't worth it, when the problem you have is caused by the number of calls and not the complexity to layout a couple ASCII letters._ 👉 Also ClearType can't be trivially alpha blended making it impossible to render into a separate glyph atlas. 👉 Finally Firefox used to use alpha blending, but they moved away from it towards the DirectWrite-style-of-things, because... you guessed it... the use of alpha blending was an absolute nightmare of complexity and unmaintainable. In fact not something that was created on a weekend. If you don't believe me I invite you to cat [this file](https://norvig.com/big.txt) in a WSL2 instance. It'll finish drawing the entire 6MB file within about a second or two. From that I can already estimate that after we implemented the issue I linked, your program will render at about ~30 FPS. Significantly more than the current performance, right? Lastly I can only suggest everyone to read: https://gankra.github.io/blah/text-hates-you/ You were overly confident in your opinion, but I hope this website helps you understand that it's actually really damn hard. The reason your program shows a high FPS under other terminal emulators is simply, because their rendering pipeline works independent of VT ingestion. Gnome Terminal is not laying out text faster than your display refresh rate either. And of course, again, this is something WT will probably do as well in the future... but this project is _nowhere_ near as old as Gnome Terminal is. </details>
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#14160