Exceptional slowness of low level input of console since 09.19 #4570

Closed
opened 2026-01-30 23:50:53 +00:00 by claunia · 3 comments
Owner

Originally created by @Agrael1 on GitHub (Oct 20, 2019).

Environment

Windows build number: 10.0.18362.418
Windows Terminal version (if applicable):

Steps to reproduce

I had a 640:360 buffer, which was resized from original by making font smaller(2x2). Buffer Must not be monolithic, but even single colored buffer takes 15ms to fill with 7th system call on NtDeviceIOControl from WriteConsoleOutputW. When it contains some picture, render time grows drastically to 79-84ms.

If Asked I can pass a serialized picture from my engine, drawn using char_info with custom palette

Expected behavior

A Low level output prior to latest updates was projecting the same symbol buffer at 100-200 actions per second. Timings were 25ms at max. I'm making a console game engine and those timings are crucial. If there is any better function which passes the console output buffer (CHAR_INFO*) please tell me.

Actual behavior

Low level output takes 74-76 ms to accomplish, no faster analogs found, buffer size is 640:360

image

Originally created by @Agrael1 on GitHub (Oct 20, 2019). <!-- 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨 I ACKNOWLEDGE THE FOLLOWING BEFORE PROCEEDING: 1. If I delete this entire template and go my own path, the core team may close my issue without further explanation or engagement. 2. If I list multiple bugs/concerns in this one issue, the core team may close my issue without further explanation or engagement. 3. If I write an issue that has many duplicates, the core team may close my issue without further explanation or engagement (and without necessarily spending time to find the exact duplicate ID number). 4. If I leave the title incomplete when filing the issue, the core team may close my issue without further explanation or engagement. 5. If I file something completely blank in the body, the core team may close my issue without further explanation or engagement. All good? Then proceed! --> <!-- This bug tracker is monitored by Windows Terminal development team and other technical folks. **Important: When reporting BSODs or security issues, DO NOT attach memory dumps, logs, or traces to Github issues**. Instead, send dumps/traces to secure@microsoft.com, referencing this GitHub issue. If this is an application crash, please also provide a Feedback Hub submission link so we can find your diagnostic data on the backend. Use the category "Apps > Windows Terminal (Preview)" and choose "Share My Feedback" after submission to get the link. Please use this form and describe your issue, concisely but precisely, with as much detail as possible. --> # Environment ```none Windows build number: 10.0.18362.418 Windows Terminal version (if applicable): ``` # Steps to reproduce I had a 640:360 buffer, which was resized from original by making font smaller(2x2). Buffer Must not be monolithic, but even single colored buffer takes 15ms to fill with 7th system call on NtDeviceIOControl from WriteConsoleOutputW. When it contains some picture, render time grows drastically to 79-84ms. If Asked I can pass a serialized picture from my engine, drawn using char_info with custom palette # Expected behavior <!-- A description of what you're expecting, possibly containing screenshots or reference material. --> A Low level output prior to latest updates was projecting the same symbol buffer at 100-200 actions per second. Timings were 25ms at max. I'm making a console game engine and those timings are crucial. If there is any better function which passes the console output buffer (CHAR_INFO*) please tell me. # Actual behavior <!-- What's actually happening? --> Low level output takes 74-76 ms to accomplish, no faster analogs found, buffer size is 640:360 ![image](https://user-images.githubusercontent.com/23313867/67164049-8edc3f00-f376-11e9-8f7c-91a10762e65e.png)
claunia added the Product-ConhostIssue-BugNeeds-Tag-FixArea-PerformancePriority-2 labels 2026-01-30 23:50:53 +00:00
Author
Owner

@parkovski commented on GitHub (Oct 26, 2019):

Hey, I'd really like to see profiler data for that project... I'm not necessarily ruling out conhost doing something slowly as I've experienced those kind of issues myself. However, I looked at your Veritas-2D code and it's hard to understand but I can tell you right away that you have a lot more allocations than is normal for C, and your shared_ptr can easily result in unaligned reads/writes of the counter.

Cache locality and memory alignment are two of the most important things that if you mess up will make your code extremely slow (or not work at all depending on the architecture, Intel is forgiving here but ARM is not). Coupling this with a denser grid (e.g. more allocations, assumedly more memory fragmentation), I could easily see it running 4x-8x slower. I couldn't really say if there are other issues because there's a lot of obfuscation behind macros.

@parkovski commented on GitHub (Oct 26, 2019): Hey, I'd really like to see profiler data for that project... I'm not necessarily ruling out conhost doing something slowly as I've experienced those kind of issues myself. However, I looked at your Veritas-2D code and it's hard to understand but I can tell you right away that you have a lot more allocations than is normal for C, and your `shared_ptr` can easily result in unaligned reads/writes of the counter. Cache locality and memory alignment are two of the most important things that if you mess up will make your code extremely slow (or not work at all depending on the architecture, Intel is forgiving here but ARM is not). Coupling this with a denser grid (e.g. more allocations, assumedly more memory fragmentation), I could easily see it running 4x-8x slower. I couldn't really say if there are other issues because there's a lot of obfuscation behind macros.
Author
Owner

@Agrael1 commented on GitHub (Oct 28, 2019):

Hey, I'd really like to see profiler data for that project... I'm not necessarily ruling out conhost doing something slowly as I've experienced those kind of issues myself. However, I looked at your Veritas-2D code and it's hard to understand but I can tell you right away that you have a lot more allocations than is normal for C, and your shared_ptr can easily result in unaligned reads/writes of the counter.

Cache locality and memory alignment are two of the most important things that if you mess up will make your code extremely slow (or not work at all depending on the architecture, Intel is forgiving here but ARM is not). Coupling this with a denser grid (e.g. more allocations, assumedly more memory fragmentation), I could easily see it running 4x-8x slower. I couldn't really say if there are other issues because there's a lot of obfuscation behind macros.

Ok, I'll try to allocate as less memory as possible, also, shared_ptr is just made from what I've seen in C++ (I'm trying to learn internals of the C++, because its mechanics are obfuscated the same way as macros and I hate magic), also the buffer has to be dynamic, otherwise buffer resize won't work e.g. changing resolution...

The most of the macros are just automation for namings which emulate mangling, and ENDCLASSDESC is just making a class table definition. but there are no problems in class code, as I've tested the render of nanosuite model takes around 6ms with dynamic lights and complete texture. The question takes solely the part of WriteConsoleOutput, which takes the buffer produced by render and passes it to console. From here I can't predict what alignment it requires, because those buffers are allocated by default malloc, meaning 16 is an alignment of the buffer, so from that I can't really tell what is wrong with it.

I have my diag tools broken in VS15 for no reason, but the launchable tests show the average mem usage is about 7.6 Mb, including all the assets loaded (nanosuite, textures to it), all the buffers allocated and classes. Processor usage is about 7%, most of it is taken by drawing triangles.
image
image

@Agrael1 commented on GitHub (Oct 28, 2019): > Hey, I'd really like to see profiler data for that project... I'm not necessarily ruling out conhost doing something slowly as I've experienced those kind of issues myself. However, I looked at your Veritas-2D code and it's hard to understand but I can tell you right away that you have a lot more allocations than is normal for C, and your `shared_ptr` can easily result in unaligned reads/writes of the counter. > > Cache locality and memory alignment are two of the most important things that if you mess up will make your code extremely slow (or not work at all depending on the architecture, Intel is forgiving here but ARM is not). Coupling this with a denser grid (e.g. more allocations, assumedly more memory fragmentation), I could easily see it running 4x-8x slower. I couldn't really say if there are other issues because there's a lot of obfuscation behind macros. Ok, I'll try to allocate as less memory as possible, also, `shared_ptr` is just made from what I've seen in C++ (I'm trying to learn internals of the C++, because its mechanics are obfuscated the same way as macros and I hate magic), also the buffer has to be dynamic, otherwise buffer resize won't work e.g. changing resolution... The most of the macros are just automation for namings which emulate mangling, and ENDCLASSDESC is just making a class table definition. but there are no problems in class code, as I've tested the render of nanosuite model takes around 6ms with dynamic lights and complete texture. The question takes solely the part of WriteConsoleOutput, which takes the buffer produced by render and passes it to console. From here I can't predict what alignment it requires, because those buffers are allocated by default malloc, meaning 16 is an alignment of the buffer, so from that I can't really tell what is wrong with it. I have my diag tools broken in VS15 for no reason, but the launchable tests show the average mem usage is about 7.6 Mb, including all the assets loaded (nanosuite, textures to it), all the buffers allocated and classes. Processor usage is about 7%, most of it is taken by drawing triangles. ![image](https://user-images.githubusercontent.com/23313867/68087921-636b5100-fe5a-11e9-927a-028ccd0b5683.png) ![image](https://user-images.githubusercontent.com/23313867/68088063-9104ca00-fe5b-11e9-9136-a50d69a44411.png)
Author
Owner

@lhecker commented on GitHub (Oct 10, 2024):

We're still not down to where we used to be but it's a lot better now:

Image

I'll be closing this issue then, as I continue to noodle on our performance. It'll get a lot better in the next version once I rewrote the GDI renderer and detached it from the global console mutex (= async rendering). FYI: For optimal performance, please make sure to use VT output.

@lhecker commented on GitHub (Oct 10, 2024): We're still not down to where we used to be but it's a lot better now: ![Image](https://github.com/user-attachments/assets/0fd8b343-5cac-408b-a2dd-f08a68c18c16) I'll be closing this issue then, as I continue to noodle on our performance. It'll get a lot better in the next version once I rewrote the GDI renderer and detached it from the global console mutex (= async rendering). FYI: For optimal performance, please make sure to use VT output.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#4570