7Zip extraction performs significantly worse in 0.43.0 #753

@Camble commented on GitHub (Jan 6, 2026):

Allocations are much lower, but the duration is unchanged.

@Camble commented on GitHub (Jan 6, 2026): Allocations are much lower, but the duration is unchanged. <img width="641" height="66" alt="Image" src="https://github.com/user-attachments/assets/ce4d382a-4b85-4fd4-9a13-6863c2219eed" />

claunia commented

@adamhathcock commented on GitHub (Jan 6, 2026):

I'll have to have a closer look and think as to what changed to make it slower. Having done perf testing and look at the slowest bits, they haven't changed for a while I think.

@adamhathcock commented on GitHub (Jan 6, 2026): I'll have to have a closer look and think as to what changed to make it slower. Having done perf testing and look at the slowest bits, they haven't changed for a while I think.

claunia commented

@julianxhokaxhiu commented on GitHub (Jan 18, 2026):

I did try to look a bit into this so far but unfortunately without any meaningful impact.

The curious thing is that I did manage to generate a repro archive where I can "feel" the slowness, however if I run the unit test used to read 7zip.solid.7z the difference is very minimal ( ~2s ).

Here the snippet to generate the test archive ( you need powershell and 7z installed in the system )

512MB ( LZMA2:24 - 16MB dictionary - 1 block )

$dir="$PWD\random_files_512mb";New-Item -ItemType Directory -Force $dir|Out-Null;1..512|%{$b=New-Object byte[] 1048576;[Security.Cryptography.RandomNumberGenerator]::Create().GetBytes($b);[IO.File]::WriteAllBytes("$dir\file_{0:D4}.bin"-f $_,$b)};7z a -t7z "$dir\random_data_512mb.7z" "$dir\*" -ms=on -ms=512m -m0=LZMA2 -md=16m -mx=9

1024MB ( LZMA2:24 - 16MB dictionary - 1 block )

$dir="$PWD\random_files_1gb";New-Item -ItemType Directory -Force $dir|Out-Null;1..1024|%{$b=New-Object byte[] 1048576;[Security.Cryptography.RandomNumberGenerator]::Create().GetBytes($b);[IO.File]::WriteAllBytes("$dir\file_{0:D4}.bin"-f $_,$b)};7z a -t7z "$dir\random_data_1gb.7z" "$dir\*" -ms=on -ms=1g -m0=LZMA2 -md=16m -mx=9

I'll try in the next days to build a simple test case like the one suggested by @Camble so I can benchmark this easily across 0.41 and 0.44.0. Also if you can share the example archive ( or mimic it ) I'd appreciate.

On our own side we crossed this bug because this archive ( https://qhimm.7thheaven.rocks/Catalog%204.0/FieldTextures/Final_Fantasy_VII_HD_Field_Scenes.7z ) under 0.41.0 takes just some minutes to unpack, while on 0.42.0+ takes hours ( on my own system takes 5+ hours, on an AMD 9800X3D, 64GB DDR5 6000Mhz RAM and 1TB SSD NVMe PCIEx 5.0 Samsung Evo 990 ). Clearly something is going on in the code, but I was not able yet to spot the hot path.

Using either the test archives or the one I linked however can be seen this issue. Using the small built-in test archives in this repo unfortunately not as they are too small and I feel it's about the dictionary ( as the more you increase the slower it becomes ).

Thanks for any support in case.

@julianxhokaxhiu commented on GitHub (Jan 18, 2026): I did try to look a bit into this so far but unfortunately without any meaningful impact. The curious thing is that I did manage to generate a repro archive where I can "feel" the slowness, however if I run the unit test used to read `7zip.solid.7z` the difference is very minimal ( ~2s ). Here the snippet to generate the test archive ( you need powershell and 7z installed in the system ) **512MB** ( LZMA2:24 - 16MB dictionary - 1 block ) ```pwsh $dir="$PWD\random_files_512mb";New-Item -ItemType Directory -Force $dir|Out-Null;1..512|%{$b=New-Object byte[] 1048576;[Security.Cryptography.RandomNumberGenerator]::Create().GetBytes($b);[IO.File]::WriteAllBytes("$dir\file_{0:D4}.bin"-f $_,$b)};7z a -t7z "$dir\random_data_512mb.7z" "$dir\*" -ms=on -ms=512m -m0=LZMA2 -md=16m -mx=9 ``` **1024MB** ( LZMA2:24 - 16MB dictionary - 1 block ) ```pwsh $dir="$PWD\random_files_1gb";New-Item -ItemType Directory -Force $dir|Out-Null;1..1024|%{$b=New-Object byte[] 1048576;[Security.Cryptography.RandomNumberGenerator]::Create().GetBytes($b);[IO.File]::WriteAllBytes("$dir\file_{0:D4}.bin"-f $_,$b)};7z a -t7z "$dir\random_data_1gb.7z" "$dir\*" -ms=on -ms=1g -m0=LZMA2 -md=16m -mx=9 ``` I'll try in the next days to build a simple test case like the one suggested by @Camble so I can benchmark this easily across 0.41 and 0.44.0. Also if you can share the example archive ( or mimic it ) I'd appreciate. On our own side we crossed this bug because this archive ( https://qhimm.7thheaven.rocks/Catalog%204.0/FieldTextures/Final_Fantasy_VII_HD_Field_Scenes.7z ) under 0.41.0 takes just some minutes to unpack, while on 0.42.0+ takes hours ( on my own system takes 5+ hours, on an AMD 9800X3D, 64GB DDR5 6000Mhz RAM and 1TB SSD NVMe PCIEx 5.0 Samsung Evo 990 ). Clearly something is going on in the code, but I was not able yet to spot the hot path. Using either the test archives or the one I linked however can be seen this issue. Using the small built-in test archives in this repo unfortunately not as they are too small and I feel it's about the dictionary ( as the more you increase the slower it becomes ). Thanks for any support in case.

claunia commented

@Camble commented on GitHub (Jan 18, 2026):

@julianxhokaxhiu I dug out the sample file I was using. It contains two file types; DDS and Lua. I figured I'd remove the Lua files and noticed something unusual with 0.44.0, the benchmark was improved. So I ran the same benchmark on 0.41.0, but the results were identical for both archives. I have no idea if this is useful or relevant information, but the archive which only contained one filetype only performed worse on 0.44.0.

I also have no idea why one benchmark has no GC.

The archives and benchmark are here.

@Camble commented on GitHub (Jan 18, 2026): @julianxhokaxhiu I dug out the sample file I was using. It contains two file types; DDS and Lua. I figured I'd remove the Lua files and noticed something unusual with 0.44.0, the benchmark was improved. So I ran the same benchmark on 0.41.0, but the results were identical for both archives. I have no idea if this is useful or relevant information, but the archive which only contained one filetype only performed worse on 0.44.0. I also have no idea why one benchmark has no GC. The archives and benchmark are [here](https://drive.google.com/drive/folders/15pYIfOAwfSWIiw-YMZEXLVnkD506VJTm?usp=sharing). <img width="988" height="117" alt="Image" src="https://github.com/user-attachments/assets/f81e94e2-075c-44ac-90fb-a103144e1a34" />

claunia commented

@julianxhokaxhiu commented on GitHub (Jan 18, 2026):

Thanks for sharing your archives and interesting finding. I noticed that your archives share the same specs as mine, also LZMA2:24 solid 1 block.

Ironically the archive I linked contains as well DDS files and those are are very slow to be unpacked, so I wonder if the slowdown has to do with 7z archives containing DDS compressible images ( which tend to be very much "liked" as the data they contain is repeated often as they are color codes ).

I'll use your archives as well as test cases to see if I'm able to see the same timelines. Out of curiosity, the tests you are running are based on this unit test? ( https://github.com/adamhathcock/sharpcompress/blob/master/tests/SharpCompress.Test/SevenZip/SevenZipArchiveTests.cs#L16 )

I'm using that one, replacing the original 7zip.solid.7z archive with the examples above and then profiling on Visual Studio. If not can you please share your one, so I can repro exactly your timings?

Thank you in advance

@julianxhokaxhiu commented on GitHub (Jan 18, 2026): Thanks for sharing your archives and interesting finding. I noticed that your archives share the same specs as mine, also LZMA2:24 solid 1 block. Ironically the archive I linked contains as well DDS files and those are are very slow to be unpacked, so I wonder if the slowdown has to do with 7z archives containing DDS compressible images ( which tend to be very much "liked" as the data they contain is repeated often as they are color codes ). I'll use your archives as well as test cases to see if I'm able to see the same timelines. Out of curiosity, the tests you are running are based on this unit test? ( https://github.com/adamhathcock/sharpcompress/blob/master/tests/SharpCompress.Test/SevenZip/SevenZipArchiveTests.cs#L16 ) I'm using that one, replacing the original `7zip.solid.7z` archive with the examples above and then profiling on Visual Studio. If not can you please share your one, so I can repro exactly your timings? Thank you in advance

claunia commented

@Camble commented on GitHub (Jan 19, 2026):

No, they're not related to those tests. I've included the benchmark project in the link above.

@Camble commented on GitHub (Jan 19, 2026): No, they're not related to those tests. I've included the benchmark project in the link above.

claunia commented

@adamhathcock commented on GitHub (Jan 19, 2026):

I've done some perf changes here: https://github.com/adamhathcock/sharpcompress/pull/1131

Mostly, I've reduced some allocations.

I'm gonna do some looking around at that test case. You're able to see if anything has helped by getting the betas of 0.45 from nuget.

@adamhathcock commented on GitHub (Jan 19, 2026): I've done some perf changes here: https://github.com/adamhathcock/sharpcompress/pull/1131 Mostly, I've reduced some allocations. I'm gonna do some looking around at that test case. You're able to see if anything has helped by getting the betas of 0.45 from nuget.

claunia commented

@Camble commented on GitHub (Jan 19, 2026):

Worth pointing out that all my benchmarks were carried out with the non-async implementation.

@Camble commented on GitHub (Jan 19, 2026): Worth pointing out that all my benchmarks were carried out with the non-async implementation.

claunia commented

@adamhathcock commented on GitHub (Jan 19, 2026):

I'm having trouble tracking down the smoking gun in this case. Allocations don't seem to be the issue and I've done speed tests. I see stuff taking a while but nothing that doesn't look like that it more recently changed.

@adamhathcock commented on GitHub (Jan 19, 2026): I'm having trouble tracking down the smoking gun in this case. Allocations don't seem to be the issue and I've done speed tests. I see stuff taking a while but nothing that doesn't look like that it more recently changed.

claunia commented

@Camble commented on GitHub (Jan 19, 2026):

I re-ran the benchmark comparing 0.41.0 and 0.44.0 with CPU profiler, hopefully this is helpful. These are both synchronous.

@Camble commented on GitHub (Jan 19, 2026): I re-ran the benchmark comparing 0.41.0 and 0.44.0 with CPU profiler, hopefully this is helpful. These are both synchronous. <img width="1929" height="1065" alt="Image" src="https://github.com/user-attachments/assets/5a1a49f2-5a20-4292-b8eb-92c72627b653" />

claunia commented

@julianxhokaxhiu commented on GitHub (Jan 19, 2026):

First thing I noticed from the perf tests is that you mentioned both are non-async however the code path there goes clearly in async mode. Could be that now spawning a new thread for every entry adds some overhead? Might explain why now it takes a bit more longer and does more allocations.

//EDIT: In other words, looks like SharpCompess takes the async/await pathway even if we use it in "sync" mode

@julianxhokaxhiu commented on GitHub (Jan 19, 2026): First thing I noticed from the perf tests is that you mentioned both are non-async however the code path there goes clearly in async mode. Could be that now spawning a new thread for every entry adds some overhead? Might explain why now it takes a bit more longer and does more allocations. //EDIT: In other words, looks like SharpCompess takes the `async/await` pathway even if we use it in "sync" mode

claunia commented

@adamhathcock commented on GitHub (Jan 19, 2026):

First thing I noticed from the perf tests is that you mentioned both are non-async however the code path there goes clearly in async mode. Could be that now spawning a new thread for every entry adds some overhead? Might explain why now it takes a bit more longer and does more allocations.

//EDIT: In other words, looks like SharpCompess takes the async/await pathway even if we use it in "sync" mode

I'll take a look at it from this angle. I'm currently working on going full async (and not hitting the sync paths) but don't worry about the other way around. Maybe I should!

@adamhathcock commented on GitHub (Jan 19, 2026): > First thing I noticed from the perf tests is that you mentioned both are non-async however the code path there goes clearly in async mode. Could be that now spawning a new thread for every entry adds some overhead? Might explain why now it takes a bit more longer and does more allocations. > > //EDIT: In other words, looks like SharpCompess takes the `async/await` pathway even if we use it in "sync" mode I'll take a look at it from this angle. I'm currently working on going full async (and not hitting the sync paths) but don't worry about the other way around. Maybe I should!

claunia commented

@Camble commented on GitHub (Jan 19, 2026):

If the overhead of async ends up performing as my benchmarks have demonstrated, I'd likely continue to go synchronous. Unfortunately, as @julianxhokaxhiu said, it would seem the sync path is now using async so an upgrade is off the table for now.

Happy to help any way I can.

@Camble commented on GitHub (Jan 19, 2026): If the overhead of async ends up performing as my benchmarks have demonstrated, I'd likely continue to go synchronous. Unfortunately, as @julianxhokaxhiu said, it would seem the sync path is now using async so an upgrade is off the table for now. Happy to help any way I can.

claunia commented

@julianxhokaxhiu commented on GitHub (Jan 19, 2026):

I personally think if used wisely could in practice lead to much better performance in terms of extraction time. Maybe one way to solve it would be to yeld while running the extraction flow per item using async/await. This way we should be able to squeeze the CPU cores to our advantagw allowing each file to be extracted using each core. Which would in fact mimic what 7-Zip native implementation does.

Providing in any case the sync pattern back would anyway help cases where the Async pattern is not requested. So I think there's some wins here to be considered nevertheless. Haven't had the time to look at the code that handles the extraction but if multi-thread could be achieved this would be a massive win for this project.

@julianxhokaxhiu commented on GitHub (Jan 19, 2026): I personally think if used wisely could in practice lead to much better performance in terms of extraction time. Maybe one way to solve it would be to `yeld` while running the extraction flow per item using `async/await`. This way we should be able to squeeze the CPU cores to our advantagw allowing each file to be extracted using each core. Which would in fact mimic what 7-Zip native implementation does. Providing in any case the sync pattern back would anyway help cases where the Async pattern is not requested. So I think there's some wins here to be considered nevertheless. Haven't had the time to look at the code that handles the extraction but if multi-thread could be achieved this would be a massive win for this project.

claunia commented

@adamhathcock commented on GitHub (Jan 20, 2026):

If the overhead of async ends up performing as my benchmarks have demonstrated, I'd likely continue to go synchronous. Unfortunately, as @julianxhokaxhiu said, it would seem the sync path is now using async so an upgrade is off the table for now.

Happy to help any way I can.

Async was always going to add overhead. I'm rethinking removing the path but I hate having two paths. Interacting with Streams is going to be a problem (which may be the cause of this).

A new implementation is probably what I need to find or let an LLM convert.

The downside is that a sync implementation won't be great for multi-threaded operations.

@adamhathcock commented on GitHub (Jan 20, 2026): > If the overhead of async ends up performing as my benchmarks have demonstrated, I'd likely continue to go synchronous. Unfortunately, as [@julianxhokaxhiu](https://github.com/julianxhokaxhiu) said, it would seem the sync path is now using async so an upgrade is off the table for now. > > Happy to help any way I can. Async was always going to add overhead. I'm rethinking removing the path but I hate having two paths. Interacting with Streams is going to be a problem (which may be the cause of this). A new implementation is probably what I need to find or let an LLM convert. The downside is that a sync implementation won't be great for multi-threaded operations.

claunia commented

2026-01-29 22:17:01 +00:00

@adamhathcock commented on GitHub (Jan 20, 2026):

Quick analysis from me is that the sync path is actually slower. Test by not awaiting anything and see.

This is moving beyond me as I'm not sure the LZMA implementation is the best. I still think I haven't significantly changed it so something outside of it may be the cause of the slowdown but I'm not sure where else to look.

@adamhathcock commented on GitHub (Jan 20, 2026): Quick analysis from me is that the sync path is actually slower. Test by not awaiting anything and see. This is moving beyond me as I'm not sure the LZMA implementation is the best. I still think I haven't significantly changed it so something outside of it may be the cause of the slowdown but I'm not sure where else to look.

claunia commented

@Camble commented on GitHub (Jan 20, 2026):

The downside is that a sync implementation won't be great for multi-threaded operations.

Does the async implementation actually leverage multi-threading?

Quick analysis from me is that the sync path is actually slower. Test by not awaiting anything and see.

This is not my experience. Sync path in 0.41.0 is 4-5x faster than the sync path in 0.42.0+, which seems to be using async internally.
Link again to my benchmark.

I noticed heavy Copilot usage for 0.42.0. While I wasn't able to test 0.42.0 because of the LZMA bug, but I suspect the async refactor has introduced this issue. Really not familiar with the repo, so I've had AI analyse the diffs for 0.42.0 and 0.43.0.

If I get some time I'll clone the repo and look into each of the findings, but you'll probably make more sense of them than I will.

Based on the diffs and your profiling data, the performance regression was introduced in 0.42.0 by the core async refactor, and then significantly exacerbated (or made visible) in 0.43.0 by the specific way LZMA was "fixed" to accommodate that new architecture.

Here is the breakdown of the issues found in those two transitions:

0.41.0 -> 0.42.0

This is where the "Copilot async refactor" happened. The primary issue here is Unification of Sync/Async Paths.

The Change: To support async without duplicating the entire library, many core methods in AbstractReader and BufferedSubStream were refactored to share logic.
The Issue: This refactor introduced AsyncTaskMethodBuilder and ValueTask state machine logic into the synchronous path.
Proof in your trace: Your 0.44.0 trace shows AsyncTaskMethodBuilder.Start being called. This proves that even when you call a sync method, the code is paying the "tax" of setting up an async state machine that it doesn't actually need.

0.42.0 -> 0.43.0

While PR #912 and #913 were merged in 0.40.0 to optimize ReadByte, the 0.42.0 async refactor broke the way these methods interact with the LZMA decoder, leading to the exceptions you saw. The "fix" in 0.43.0 to stop these crashes effectively made the previous performance optimizations irrelevant.

The Change: Implementations for ReadByte were added to BufferedSubStream and LzOutWindow to fix positioning bugs.
The Issue: 7Zip/LZMA decompression is unique because it relies on a Range Decoder that reads millions of individual bytes.
The Regression: By introducing a specialized ReadByte that has to track position and check bounds within the new async-aware BufferedSubStream architecture, the library added extra instructions to the single most executed loop in the entire application.
Why 0.43.0 is worse: The fix for the exception in 0.43.0 likely involved adding more rigorous bounds checking or "safety" wrappers that are technically correct (fixing the 0.42.0 crash) but computationally expensive.

Which version caused the issue?

It is a combination:

0.42.0 added the "State Machine" overhead to every entry open (explaining why OpenEntryStream is so slow).
0.43.0 added the "Per-Byte" overhead to the LZMA decoder (explaining why the actual extraction takes so much longer once the stream is open).

Technical summary for the dev:

"The diff shows that 0.42.0 introduced a dual-path architecture where synchronous calls now incur ValueTask and AsyncTaskMethodBuilder overhead. The 0.43.0 fix for the LZMA exception further regressed performance by adding specialized ReadByte logic to BufferedSubStream that is called millions of times in the LZMA Range Decoder loop. Essentially, the 'fix' in 0.43.0 traded correctness for a massive increase in per-byte instruction count."

@Camble commented on GitHub (Jan 20, 2026): > The downside is that a sync implementation won't be great for multi-threaded operations. Does the async implementation actually leverage multi-threading? > Quick analysis from me is that the sync path is actually slower. Test by not awaiting anything and see. This is not my experience. Sync path in 0.41.0 is 4-5x faster than the sync path in 0.42.0+, which seems to be using async internally. Link again to my [benchmark](https://drive.google.com/drive/folders/15pYIfOAwfSWIiw-YMZEXLVnkD506VJTm?usp=sharing). I noticed heavy Copilot usage for 0.42.0. While I wasn't able to test 0.42.0 because of the LZMA bug, but I suspect the async refactor has introduced this issue. Really not familiar with the repo, so I've had AI analyse the diffs for 0.42.0 and 0.43.0. If I get some time I'll clone the repo and look into each of the findings, but you'll probably make more sense of them than I will. --- Based on the diffs and your profiling data, the performance regression was introduced in **0.42.0** by the core async refactor, and then significantly exacerbated (or made visible) in **0.43.0** by the specific way LZMA was "fixed" to accommodate that new architecture. Here is the breakdown of the issues found in those two transitions: ### 0.41.0 -> 0.42.0 This is where the "Copilot async refactor" happened. The primary issue here is **Unification of Sync/Async Paths**. * **The Change:** To support async without duplicating the entire library, many core methods in `AbstractReader` and `BufferedSubStream` were refactored to share logic. * **The Issue:** This refactor introduced `AsyncTaskMethodBuilder` and `ValueTask` state machine logic into the **synchronous** path. * **Proof in your trace:** Your 0.44.0 trace shows `AsyncTaskMethodBuilder.Start` being called. This proves that even when you call a sync method, the code is paying the "tax" of setting up an async state machine that it doesn't actually need. ### 0.42.0 -> 0.43.0 While PR #912 and #913 were merged in 0.40.0 to optimize `ReadByte`, the 0.42.0 async refactor broke the way these methods interact with the LZMA decoder, leading to the exceptions you saw. The "fix" in 0.43.0 to stop these crashes effectively made the previous performance optimizations irrelevant. * **The Change:** Implementations for `ReadByte` were added to `BufferedSubStream` and `LzOutWindow` to fix positioning bugs. * **The Issue:** 7Zip/LZMA decompression is unique because it relies on a **Range Decoder** that reads millions of individual bytes. * **The Regression:** By introducing a specialized `ReadByte` that has to track position and check bounds within the new async-aware `BufferedSubStream` architecture, the library added extra instructions to the single most executed loop in the entire application. * **Why 0.43.0 is worse:** The fix for the exception in 0.43.0 likely involved adding more rigorous bounds checking or "safety" wrappers that are technically correct (fixing the 0.42.0 crash) but computationally expensive. ### Which version caused the issue? It is a **combination**: 1. **0.42.0** added the "State Machine" overhead to every entry open (explaining why `OpenEntryStream` is so slow). 2. **0.43.0** added the "Per-Byte" overhead to the LZMA decoder (explaining why the actual extraction takes so much longer once the stream is open). ### Technical summary for the dev: "The diff shows that 0.42.0 introduced a dual-path architecture where synchronous calls now incur `ValueTask` and `AsyncTaskMethodBuilder` overhead. The 0.43.0 fix for the LZMA exception further regressed performance by adding specialized `ReadByte` logic to `BufferedSubStream` that is called millions of times in the LZMA Range Decoder loop. Essentially, the 'fix' in 0.43.0 traded correctness for a massive increase in per-byte instruction count."

claunia commented

2026-01-29 22:17:01 +00:00

@adamhathcock commented on GitHub (Jan 20, 2026):

This is not my experience. Sync path in 0.41.0 is 4-5x faster than the sync path in 0.42.0+, which seems to be using async internally.

If you don't use CopyToAsync then it doesn't use an asynchronous path. There is no asynchronous path pre-0.42.0 so there was no choice.

Thanks for AI summary, it does help focus.

@adamhathcock commented on GitHub (Jan 20, 2026): > This is not my experience. Sync path in 0.41.0 is 4-5x faster than the sync path in 0.42.0+, which seems to be using async internally. If you don't use `CopyToAsync` then it doesn't use an asynchronous path. There is no asynchronous path pre-0.42.0 so there was no choice. Thanks for AI summary, it does help focus.

claunia commented

2026-01-29 22:17:02 +00:00

@adamhathcock commented on GitHub (Jan 20, 2026):

You can follow what I'm doing in a branch. I've reverted to be sync only and some tests on an older BufferedSubStream but the results aren't much different.

Sync only made the biggest difference

@adamhathcock commented on GitHub (Jan 20, 2026): You can follow what I'm doing in a branch. I've reverted to be sync only and some tests on an older BufferedSubStream but the results aren't much different. Sync only made the biggest difference

claunia commented

2026-01-29 22:17:02 +00:00

@Camble commented on GitHub (Jan 20, 2026):

If you don't use CopyToAsync then it doesn't use an asynchronous path.

You're right, I've re-run the CPU profiler and we have no async calls. Performance is the same, however. Maybe that was a red herring.

@Camble commented on GitHub (Jan 20, 2026): > If you don't use `CopyToAsync` then it doesn't use an asynchronous path. You're right, I've re-run the CPU profiler and we have no async calls. Performance is the same, however. Maybe that was a red herring.

claunia commented

@julianxhokaxhiu commented on GitHub (Jan 25, 2026):

I spent some time today ( with Claude Sonnet 4.5 ) to analyze more carefully this issue and I think we came to a conclusion which might make sense and might fix the root cause of this issue.

First of all, the patch looks something like this:

diff --git a/src/SharpCompress/IO/BufferedSubStream.cs b/src/SharpCompress/IO/BufferedSubStream.cs
index 7a51bb0f..5e2286c7 100755
--- a/src/SharpCompress/IO/BufferedSubStream.cs
+++ b/src/SharpCompress/IO/BufferedSubStream.cs
@@ -68,7 +68,12 @@ internal class BufferedSubStream : SharpCompressStream, IStreamStack
             _cacheLength = 0;
             return;
         }
-        Stream.Position = origin;
+        // Only seek if we're not already at the correct position
+        // This avoids expensive seek operations when reading sequentially
+        if (Stream.CanSeek && Stream.Position != origin)
+        {
+            Stream.Position = origin;
+        }
         _cacheLength = Stream.Read(_cache, 0, count);
         origin += _cacheLength;
         BytesLeftToRead -= _cacheLength;
@@ -83,7 +88,12 @@ internal class BufferedSubStream : SharpCompressStream, IStreamStack
             _cacheLength = 0;
             return;
         }
-        Stream.Position = origin;
+        // Only seek if we're not already at the correct position
+        // This avoids expensive seek operations when reading sequentially
+        if (Stream.CanSeek && Stream.Position != origin)
+        {
+            Stream.Position = origin;
+        }
         _cacheLength = await Stream
             .ReadAsync(_cache, 0, count, cancellationToken)
             .ConfigureAwait(false);

Root Cause

The performance problem was caused by unnecessary seek operations in BufferedSubStream.RefillCache(). Every time the cache needed to be refilled (every 32KB), the code was doing:

Stream.Position = origin;  // SEEK EVERY 32KB!

This was happening even when the stream was already at the correct position (sequential reading). For large 7zip archives being decompressed with LZMA, this means:

Thousands of unnecessary seek operations
Each seek can be expensive on file streams
The overhead compounds with multiple buffer refills

The issue became visible with the changes in 0.42.0+ because of how streams were being layered or managed, making these seeks more expensive.

Test Results

BEFORE THE PATCH: ( Debug )

AFTER THE PATCH: ( Debug )

BEFORE THIS PATCH: ( Release )

AFTER THIS PATCH: ( Release )

Notice how some of the tests now even end with < 1ms result which was never happening before.

One final note: this patch is more effective with big archives as the seek will be hit more often ( 100MB+ ones ) especially for ones compressed as 1 block only ( which means 1 stream to seek ).

Feel free to test it and let me know how it goes :)

@julianxhokaxhiu commented on GitHub (Jan 25, 2026): I spent some time today ( with Claude Sonnet 4.5 ) to analyze more carefully this issue and I think we came to a conclusion which might make sense and might fix the root cause of this issue. First of all, the patch looks something like this: ```diff diff --git a/src/SharpCompress/IO/BufferedSubStream.cs b/src/SharpCompress/IO/BufferedSubStream.cs index 7a51bb0f..5e2286c7 100755 --- a/src/SharpCompress/IO/BufferedSubStream.cs +++ b/src/SharpCompress/IO/BufferedSubStream.cs @@ -68,7 +68,12 @@ internal class BufferedSubStream : SharpCompressStream, IStreamStack _cacheLength = 0; return; } - Stream.Position = origin; + // Only seek if we're not already at the correct position + // This avoids expensive seek operations when reading sequentially + if (Stream.CanSeek && Stream.Position != origin) + { + Stream.Position = origin; + } _cacheLength = Stream.Read(_cache, 0, count); origin += _cacheLength; BytesLeftToRead -= _cacheLength; @@ -83,7 +88,12 @@ internal class BufferedSubStream : SharpCompressStream, IStreamStack _cacheLength = 0; return; } - Stream.Position = origin; + // Only seek if we're not already at the correct position + // This avoids expensive seek operations when reading sequentially + if (Stream.CanSeek && Stream.Position != origin) + { + Stream.Position = origin; + } _cacheLength = await Stream .ReadAsync(_cache, 0, count, cancellationToken) .ConfigureAwait(false); ``` ## Root Cause The performance problem was caused by unnecessary seek operations in `BufferedSubStream.RefillCache()`. Every time the cache needed to be refilled (every 32KB), the code was doing: ```csharp Stream.Position = origin; // SEEK EVERY 32KB! ``` This was happening even when the stream was already at the correct position (sequential reading). For large 7zip archives being decompressed with LZMA, this means: - Thousands of unnecessary seek operations - Each seek can be expensive on file streams - The overhead compounds with multiple buffer refills The issue became visible with the changes in 0.42.0+ because of how streams were being layered or managed, making these seeks more expensive. ## Test Results **BEFORE THE PATCH:** ( Debug ) <img width="451" height="1130" alt="Image" src="https://github.com/user-attachments/assets/4767ad0f-4ef9-4307-8956-d58295890bf4" /> **AFTER THE PATCH:** ( Debug ) <img width="469" height="1132" alt="Image" src="https://github.com/user-attachments/assets/ea82c737-060d-4c02-94a2-3620a1b3d662" /> **BEFORE THIS PATCH:** ( Release ) <img width="446" height="1128" alt="Image" src="https://github.com/user-attachments/assets/1aa40f1c-07cb-4889-a219-fe7b5f108207" /> **AFTER THIS PATCH:** ( Release ) <img width="447" height="1132" alt="Image" src="https://github.com/user-attachments/assets/e08fedc3-14d3-41cc-bc2f-992d989d40a9" /> Notice how some of the tests now even end with `< 1ms` result which was never happening before. One final note: this patch is more effective with big archives as the seek will be hit more often ( 100MB+ ones ) especially for ones compressed as 1 block only ( which means 1 stream to seek ). Feel free to test it and let me know how it goes :)

claunia commented

@adamhathcock commented on GitHub (Jan 26, 2026):

This definitely smells like the smoking gun. I can even patch 0.44 with a simple change.

I'll give it a test and maybe a PR. Thanks for this!

@adamhathcock commented on GitHub (Jan 26, 2026): This definitely smells like the smoking gun. I can even patch 0.44 with a simple change. I'll give it a test and maybe a PR. Thanks for this!

claunia commented

@adamhathcock commented on GitHub (Jan 26, 2026):

Using benchmark..net on the sample it doesn't change much for me but using perf testing it definitely reduces calls. Researching still I guess but maybe others will see a win.

Gonna patch it regardless. Thanks!

@adamhathcock commented on GitHub (Jan 26, 2026): Using benchmark..net on the sample it doesn't change much for me but using perf testing it definitely reduces calls. Researching still I guess but maybe others will see a win. Gonna patch it regardless. Thanks!

claunia commented

@adamhathcock commented on GitHub (Jan 26, 2026):

0.44.4 should have this little fix in it. Maybe it will make things tolerable: https://github.com/adamhathcock/sharpcompress/releases/tag/0.44.4

@adamhathcock commented on GitHub (Jan 26, 2026): 0.44.4 should have this little fix in it. Maybe it will make things tolerable: https://github.com/adamhathcock/sharpcompress/releases/tag/0.44.4

claunia commented

@julianxhokaxhiu commented on GitHub (Jan 26, 2026):

Thanks a lot, I'll give it a spin and let you know how it goes. Appreciated

@julianxhokaxhiu commented on GitHub (Jan 26, 2026): Thanks a lot, I'll give it a spin and let you know how it goes. Appreciated

claunia commented

@julianxhokaxhiu commented on GitHub (Jan 26, 2026):

I asked the community to test the new release I made using 0.44.4 but unfortunately they still claim the version 0.41.0 I use in our current stable release, is much faster in extracting. I'll continue investigate what causes this slowdown. Thanks!

@julianxhokaxhiu commented on GitHub (Jan 26, 2026): I asked the community to test the new release I made using 0.44.4 but unfortunately they still claim the version 0.41.0 I use in our current stable release, is much faster in extracting. I'll continue investigate what causes this slowdown. Thanks!

claunia commented

@julianxhokaxhiu commented on GitHub (Jan 27, 2026):

I opened a PR which I think qualifies as a potential candidate to solve this issue. I again tested every change using the CPU Profiler in combination with Claude Sonnet 4.5, which detected various minor optimizations that could have been applied.

The end result is that I am now able to reach in 1 minute the same extracted files in my huge 700MB archive that before I reached in at least 3 min. Tomorrow I'll give a full extraction test to measure how long it will take to fully extract it but I'd appreciate if you can both have a review and let m know what you think. Thanks!

@julianxhokaxhiu commented on GitHub (Jan 27, 2026): I opened a PR which I think qualifies as a potential candidate to solve this issue. I again tested every change using the CPU Profiler in combination with Claude Sonnet 4.5, which detected various minor optimizations that could have been applied. The end result is that I am now able to reach in 1 minute the same extracted files in my huge 700MB archive that before I reached in at least 3 min. Tomorrow I'll give a full extraction test to measure how long it will take to fully extract it but I'd appreciate if you can both have a review and let m know what you think. Thanks!

claunia commented

@Camble commented on GitHub (Jan 27, 2026):

@julianxhokaxhiu Nice one! I haven't had a chance to look at the code at all, but if this PR goes into a pre-release, I'll re-run my benchmarks. Sounds very promising, thanks!

@Camble commented on GitHub (Jan 27, 2026): @julianxhokaxhiu Nice one! I haven't had a chance to look at the code at all, but if this PR goes into a pre-release, I'll re-run my benchmarks. Sounds very promising, thanks!

claunia commented

@eve-atum commented on GitHub (Jan 27, 2026):

Tomorrow I'll give a full extraction test to measure how long it will take to fully extract it but I'd appreciate if you can both have a review and let m know what you think. Thanks!

Performed the test mentioned yesterday evening using our application and got these results using https://qhimm.7thheaven.rocks/Catalog%204.0/FieldTextures/Final_Fantasy_VII_HD_Field_Scenes.7z

7th Heaven v4.4.0.42 (sharpcompress 0.44.4) - 5h2m20s
7th Heaven v4.4.0.0 (sharpcompress 0.41) - 1m18s
7zip - 38s

@eve-atum commented on GitHub (Jan 27, 2026): > Tomorrow I'll give a full extraction test to measure how long it will take to fully extract it but I'd appreciate if you can both have a review and let m know what you think. Thanks! Performed the test mentioned yesterday evening using our application and got these results using https://qhimm.7thheaven.rocks/Catalog%204.0/FieldTextures/Final_Fantasy_VII_HD_Field_Scenes.7z 7th Heaven v4.4.0.42 (sharpcompress 0.44.4) - 5h2m20s 7th Heaven v4.4.0.0 (sharpcompress 0.41) - 1m18s 7zip - 38s

claunia commented

@adamhathcock commented on GitHub (Jan 27, 2026):

Tomorrow I'll give a full extraction test to measure how long it will take to fully extract it but I'd appreciate if you can both have a review and let m know what you think. Thanks!

Performed the test mentioned yesterday evening using our application and got these results using https://qhimm.7thheaven.rocks/Catalog%204.0/FieldTextures/Final_Fantasy_VII_HD_Field_Scenes.7z

7th Heaven v4.4.0.42 (sharpcompress 0.44.4) - 5h2m20s 7th Heaven v4.4.0.0 (sharpcompress 0.41) - 1m18s 7zip - 38s

Thanks for this. Gonna use this large file for testing

@adamhathcock commented on GitHub (Jan 27, 2026): > > Tomorrow I'll give a full extraction test to measure how long it will take to fully extract it but I'd appreciate if you can both have a review and let m know what you think. Thanks! > > Performed the test mentioned yesterday evening using our application and got these results using https://qhimm.7thheaven.rocks/Catalog%204.0/FieldTextures/Final_Fantasy_VII_HD_Field_Scenes.7z > > 7th Heaven v4.4.0.42 (sharpcompress 0.44.4) - 5h2m20s 7th Heaven v4.4.0.0 (sharpcompress 0.41) - 1m18s 7zip - 38s Thanks for this. Gonna use this large file for testing

claunia commented

@adamhathcock commented on GitHub (Jan 27, 2026):

Okay. The basic problem is that for SOLID files, I'm creating/reseeking to the desired file after each iteration when I should be just holding the state of where I am in the file and waiting.

I'm not sure why this behavior changed for 7z but it did. I do this for SOLID rar files as well so it's an understood problem and something I need to fix. Definitely dramatically visible with a large file that's one stream internally

@adamhathcock commented on GitHub (Jan 27, 2026): Okay. The basic problem is that for SOLID files, I'm creating/reseeking to the desired file after each iteration when I should be just holding the state of where I am in the file and waiting. I'm not sure why this behavior changed for 7z but it did. I do this for SOLID rar files as well so it's an understood problem and something I need to fix. Definitely dramatically visible with a large file that's one stream internally

claunia commented

@julianxhokaxhiu commented on GitHub (Jan 27, 2026):

That's also what Claude told me yesterday and it's part of the fix in my PR. However if you better know what to touch please do and use my PR only as a ref basically. Thanks :)

@julianxhokaxhiu commented on GitHub (Jan 27, 2026): That's also what Claude told me yesterday and it's part of the fix in my PR. However if you better know what to touch please do and use my PR only as a ref basically. Thanks :)

claunia commented

@adamhathcock commented on GitHub (Jan 27, 2026):

That's also what Claude told me yesterday and it's part of the fix in my PR. However if you better know what to touch please do and use my PR only as a ref basically. Thanks :)

Well, I know what's supposed to happen. Getting it to happen is something else. I'll have to try tomorrow or later to fix it and find a way to test it.

@adamhathcock commented on GitHub (Jan 27, 2026): > That's also what Claude told me yesterday and it's part of the fix in my PR. However if you better know what to touch please do and use my PR only as a ref basically. Thanks :) Well, I know what's supposed to happen. Getting it to happen is something else. I'll have to try tomorrow or later to fix it and find a way to test it.

claunia commented

@schland commented on GitHub (Jan 27, 2026):

Same problem here 0.42.1 is fine since 0.43 massive performance decrease, extracting 100mb 7z file with lot of little files on 0.42.1 takes 8 seconds. 0.44.4 needs over 360 seconds. Same code, same file.

@schland commented on GitHub (Jan 27, 2026): Same problem here 0.42.1 is fine since 0.43 massive performance decrease, extracting 100mb 7z file with lot of little files on 0.42.1 takes 8 seconds. 0.44.4 needs over 360 seconds. Same code, same file.

claunia commented

@adamhathcock commented on GitHub (Jan 28, 2026):

As the title suggests, 7Zip extraction 0.43.0 is significantly slower than 0.41.0.

I extract the contents in memory for later writing to disk
**Benchmark sample:**
FileStream fileStream = File.OpenRead(filename);
IArchive archive = ArchiveFactory.Open(fileStream);
IReader reader = archive.ExtractAllEntries();
while (reader.MoveToNextEntry())
{
// etc...
EntryStream source = reader.OpenEntryStream();
// etc...
}

I'm now getting around the same things with the fix:

| Method                      | Mean     | Error    | StdDev   | Gen0      | Gen1      | Gen2      | Allocated |
|---------------------------- |---------:|---------:|---------:|----------:|----------:|----------:|----------:|
| SharpCompress_0_44_Original | 78.15 ms | 1.533 ms | 2.342 ms | 2714.2857 | 2714.2857 | 2714.2857 |  52.71 MB |

Try with the latest beta https://www.nuget.org/packages/SharpCompress/0.44.5-beta.27

@adamhathcock commented on GitHub (Jan 28, 2026): > As the title suggests, 7Zip extraction 0.43.0 is significantly slower than 0.41.0. > > I extract the contents in memory for later writing to disk > > <img alt="Image" width="936" height="92" src="https://private-user-images.githubusercontent.com/863954/531726885-7811a89f-5795-4314-97b8-e29e8bec849d.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Njk1OTAwNzYsIm5iZiI6MTc2OTU4OTc3NiwicGF0aCI6Ii84NjM5NTQvNTMxNzI2ODg1LTc4MTFhODlmLTU3OTUtNDMxNC05N2I4LWUyOWU4YmVjODQ5ZC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjYwMTI4JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI2MDEyOFQwODQyNTZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0wNDVmMWRjYTE0NTBlNmM4YmE4M2VmMzRkODljODE0YzJmYzE0ZjRkNjgxODIxY2Q1NDg5MzRiY2MyNDhlNzFlJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.pjQkT2y7p8nJ6vxvGIYiDXLruXknyQnMu28cLOC6tsA"> > **Benchmark sample:** > > FileStream fileStream = File.OpenRead(filename); > IArchive archive = ArchiveFactory.Open(fileStream); > IReader reader = archive.ExtractAllEntries(); > while (reader.MoveToNextEntry()) > { > // etc... > EntryStream source = reader.OpenEntryStream(); > // etc... > } I'm now getting around the same things with the fix: ``` | Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |---------------------------- |---------:|---------:|---------:|----------:|----------:|----------:|----------:| | SharpCompress_0_44_Original | 78.15 ms | 1.533 ms | 2.342 ms | 2714.2857 | 2714.2857 | 2714.2857 | 52.71 MB | ``` Try with the latest beta https://www.nuget.org/packages/SharpCompress/0.44.5-beta.27

claunia commented

@adamhathcock commented on GitHub (Jan 28, 2026):

My benchmark was allocating a GB for buffer size :(

@adamhathcock commented on GitHub (Jan 28, 2026): My benchmark was allocating a GB for buffer size :(

claunia commented

@Camble commented on GitHub (Jan 28, 2026):

Try with the latest beta https://www.nuget.org/packages/SharpCompress/0.44.5-beta.27

Nice one, I'll give this a try once NuGet indexes this version.

@Camble commented on GitHub (Jan 28, 2026): > Try with the latest beta https://www.nuget.org/packages/SharpCompress/0.44.5-beta.27 Nice one, I'll give this a try once NuGet indexes this version.

claunia commented

@Camble commented on GitHub (Jan 28, 2026):

I think you guys have cracked it!

@Camble commented on GitHub (Jan 28, 2026): I think you guys have cracked it! <img width="953" height="66" alt="Image" src="https://github.com/user-attachments/assets/4894a982-4345-4ae7-b064-d7707a35657d" />

claunia commented

@adamhathcock commented on GitHub (Jan 28, 2026):

released 0.44.5 and closing this. Hopefully doesn't happen again!

@adamhathcock commented on GitHub (Jan 28, 2026): released 0.44.5 and closing this. Hopefully doesn't happen again!

claunia commented

@Camble commented on GitHub (Jan 28, 2026):

@adamhathcock @julianxhokaxhiu Thank you both for all your hard work in resolving this!

@Camble commented on GitHub (Jan 28, 2026): @adamhathcock @julianxhokaxhiu Thank you both for all your hard work in resolving this!

claunia commented