When decompressing a file in 7z, there will be a problem of performance degradation #559

Open
opened 2026-01-29 22:13:44 +00:00 by claunia · 13 comments
Owner

Originally created by @ghost on GitHub (Feb 20, 2023).

I have conducted some tests, and sharpcompress does suffer from performance degradation when reading some 7z files.

For example, there are 100 files in a 7z archive. When it reads the previous files, such as the first 10, the speed is very fast, but the more it reads the latter files, the slower the decompression speed will be.

There seems to be a rule that the further the file is in the back, the slower the decompression will be.

For example, if you have a 300MB 7z file, you extract one of the 100 files, and it appears in the 10th file order, you will read it very quickly, but if you only read the 70th file, then Even if they have the same file size, the decompression speed will become very slow.

This problem does not occur in the zip format.

Originally created by @ghost on GitHub (Feb 20, 2023). I have conducted some tests, and sharpcompress does suffer from **performance degradation** when reading some 7z files. For example, there are 100 files in a 7z archive. When it reads the previous files, such as the first 10, the speed is very fast, but the more it reads the latter files, the slower the decompression speed will be. There seems to be a rule that the further the file is in the back, the slower the decompression will be. For example, **if you have a 300MB 7z file**, you extract one of the 100 files, and it appears in the 10th file order, you will read it very quickly, but if you **only** read the **70th** file, then Even if they have the same file size, the decompression speed will become very slow. This problem does not occur in the zip format.
claunia added the enhancementup for grabs labels 2026-01-29 22:13:44 +00:00
Author
Owner

@adamhathcock commented on GitHub (Mar 1, 2023):

PRs are welcome. I've been away for personal reasons.

@adamhathcock commented on GitHub (Mar 1, 2023): PRs are welcome. I've been away for personal reasons.
Author
Owner

@Nanook commented on GitHub (Mar 1, 2023):

The speed degradation described above could be due to 7zip grouping files with "Solid Block Size" to achieve better compression. Files at the end of these blocks cannot be directly decompressed. Earlier files must be decompressed first.

@Nanook commented on GitHub (Mar 1, 2023): The speed degradation described above could be due to 7zip grouping files with "Solid Block Size" to achieve better compression. Files at the end of these blocks cannot be directly decompressed. Earlier files must be decompressed first.
Author
Owner

@ghost commented on GitHub (Mar 2, 2023):

The speed degradation described above could be due to 7zip grouping files with "Solid Block Size" to achieve better compression. Files at the end of these blocks cannot be directly decompressed. Earlier files must be decompressed first.

@Nanook Thank you, I use WinRAR to test the same files (use WinRAR to decompress .7z files), and WinRAR works normally, it does not have this bug.

@ghost commented on GitHub (Mar 2, 2023): > The speed degradation described above could be due to 7zip grouping files with "Solid Block Size" to achieve better compression. Files at the end of these blocks cannot be directly decompressed. Earlier files must be decompressed first. @Nanook Thank you, I use WinRAR to test the same files (use WinRAR to decompress **.7z files**), and WinRAR works normally, it does not have this bug.
Author
Owner

@Nanook commented on GitHub (Mar 2, 2023):

It's not necessarily a bug, but rather a trade off between compression and flexibility. Rar also has SOLID mode which makes the full archive non seekable. 7zip's is at least configurable. Just for info :)

@Nanook commented on GitHub (Mar 2, 2023): It's not necessarily a bug, but rather a trade off between compression and flexibility. Rar also has SOLID mode which makes the full archive non seekable. 7zip's is at least configurable. Just for info :)
Author
Owner

@ghost commented on GitHub (Mar 3, 2023):

@Nanook No, I meant to use WinRAR to decompress .7z files. It has almost no problem of slowing down. WinRAR is fast at unpacking both .zip and .7z files.

@ghost commented on GitHub (Mar 3, 2023): @Nanook No, I meant to use WinRAR to decompress **.7z** files. It has almost no problem of slowing down. WinRAR is fast at unpacking both .zip and .7z files.
Author
Owner

@Nanook commented on GitHub (Mar 3, 2023):

Thanks for the info. Perhaps the SharpCompress implementation can be improved. I'll take a look next time I'm in there.

@Nanook commented on GitHub (Mar 3, 2023): Thanks for the info. Perhaps the SharpCompress implementation can be improved. I'll take a look next time I'm in there.
Author
Owner

@ghost commented on GitHub (Mar 3, 2023):

Thank you very much, looks like this is an old bug, detailed data in this post.

post url:
https://github.com/adamhathcock/sharpcompress/issues/399

@ghost commented on GitHub (Mar 3, 2023): Thank you very much, looks like this is an old bug, detailed data in this post. post url: https://github.com/adamhathcock/sharpcompress/issues/399
Author
Owner

@Erior commented on GitHub (Mar 12, 2023):

Looking at the cases it seems to be using archive.ExtractAllEntries() vs archive.Entries.Where(entry => !entry.IsDirectory)

Performance might not be great but the latter does a skip to find the entry you try to decompress..the more files in the 7z file the more to redo/decompress to get to your file.

archive.ExtractAllEntries()
while( reader.MoveToNextEntry()) )
....

Does move along a bit faster

@Erior commented on GitHub (Mar 12, 2023): Looking at the cases it seems to be using archive.ExtractAllEntries() vs archive.Entries.Where(entry => !entry.IsDirectory) Performance might not be great but the latter does a skip to find the entry you try to decompress..the more files in the 7z file the more to redo/decompress to get to your file. archive.ExtractAllEntries() while( reader.MoveToNextEntry()) ) .... Does move along a bit faster
Author
Owner

@Lombra commented on GitHub (Mar 15, 2023):

Not sure whether this is related, but I've been using SevenZipSharp for my 7z needs, which has an issue of being extremely slow to process solid 7z archives on a file entry level. However, it also has an ExtractArchive method which works as fast as you might expect.

SevenZipSharp is basically a wrapper around 7z.dll, so I don't know whether that's unhelpful. The DLL is not supported on Linux however, which is how I found out about this project.

This archive extracts in an instant using SevenZipSharp:

var archive = new SevenZipExtractor(@"mod.7z");
archive.ExtractArchive(@"C:\temp");

The same archive takes nine seconds to extract using SharpCompress.

var archive = SevenZipArchive.Open(@"mod.7z");
archive.WriteToDirectory(@"C:\temp");
@Lombra commented on GitHub (Mar 15, 2023): Not sure whether this is related, but I've been using SevenZipSharp for my 7z needs, which has an issue of being extremely slow to process solid 7z archives on a file entry level. However, it also has an [ExtractArchive](https://github.com/squid-box/SevenZipSharp/blob/dev/SevenZip/SevenZipExtractor.cs#L1337) method which works as fast as you might expect. SevenZipSharp is basically a wrapper around 7z.dll, so I don't know whether that's unhelpful. The DLL is not supported on Linux however, which is how I found out about this project. This archive extracts in an instant using SevenZipSharp: ```cs var archive = new SevenZipExtractor(@"mod.7z"); archive.ExtractArchive(@"C:\temp"); ``` The same archive takes nine seconds to extract using SharpCompress. ```cs var archive = SevenZipArchive.Open(@"mod.7z"); archive.WriteToDirectory(@"C:\temp"); ```
Author
Owner

@bodgit commented on GitHub (May 3, 2023):

I randomly stumbled on this issue, I wrote a similar Golang library for reading .7z archives so I'm familiar with this particular phenomenon.

The most efficient way to extract a .7z archive is to iterate over the files in the order they're stored in the archive. If you offer some sort of random access API to the files and implement it naively then you will get this performance degradation. The problem comes from the fact that in order to read file n in a solid block, you have to read and discard all of the decompressed data for files 0 through n-1. As n increases, you have to read and discard more and more data. This is exacerbated if files near the beginning of the block are quite large and you're only interested in some files near the end. You also can't just seek forward into the compressed stream as the state machine of the decompression routine(s) will be confused.

So if you implement your "extract everything in the archive" API in terms of your random access API, i.e. extracting file 0, then file 1, etc. it will have worse performance the larger the archive becomes. Whereas a dedicated "extract everything" API that iterates over the archive in one shot will be quick relative to the size of the archive.

I had the exact some problem and fixed this in my library by caching the reader of the decompressed data after reading file n, so that when reading any file > n it would use that cached reader instead of recreating it again from scratch. This means that iterating over the files in archive order always results in a cache hit as there's always a cached reader positioned at the end of file n-1. If I were to sort the files in any way then that would potentially introduce performance degradation again; the worst scenario being to sort the files in reverse order to that of the archive.

Hope that helps.

@bodgit commented on GitHub (May 3, 2023): I randomly stumbled on this issue, I wrote a similar Golang library for reading .7z archives so I'm familiar with this particular phenomenon. The most efficient way to extract a .7z archive is to iterate over the files in the order they're stored in the archive. If you offer some sort of random access API to the files and implement it naively then you will get this performance degradation. The problem comes from the fact that in order to read file `n` in a solid block, you have to read and discard all of the decompressed data for files `0` through `n-1`. As `n` increases, you have to read and discard more and more data. This is exacerbated if files near the beginning of the block are quite large and you're only interested in some files near the end. You also can't just seek forward into the compressed stream as the state machine of the decompression routine(s) will be confused. So if you implement your "extract everything in the archive" API in terms of your random access API, i.e. extracting file `0`, then file `1`, etc. it will have worse performance the larger the archive becomes. Whereas a dedicated "extract everything" API that iterates over the archive in one shot will be quick relative to the size of the archive. I had the exact some problem and fixed this in my library by caching the reader of the decompressed data after reading file `n`, so that when reading any file `> n` it would use that cached reader instead of recreating it again from scratch. This means that iterating over the files in archive order always results in a cache hit as there's always a cached reader positioned at the end of file `n-1`. If I were to sort the files in any way then that would potentially introduce performance degradation again; the worst scenario being to sort the files in reverse order to that of the archive. Hope that helps.
Author
Owner

@adamhathcock commented on GitHub (May 3, 2023):

CreateReaderForSolidExtraction and SevenZipReader will access things sequentally in the archive.

7z has the worst of both worlds and the IArchive interface might not the best for it

@adamhathcock commented on GitHub (May 3, 2023): `CreateReaderForSolidExtraction` and `SevenZipReader` will access things sequentally in the archive. 7z has the worst of both worlds and the IArchive interface might not the best for it
Author
Owner

@FlsZen commented on GitHub (Jul 19, 2023):

I created a PR #750 with the extension method that supports my use case of extracting large .7z files to a new directory. It's super fast compared to WriteToDirectory, but might not be as feature-rich as needed.

@FlsZen commented on GitHub (Jul 19, 2023): I created a PR #750 with the extension method that supports my use case of extracting large `.7z` files to a new directory. It's super fast compared to `WriteToDirectory`, but might not be as feature-rich as needed.
Author
Owner

@adamhathcock commented on GitHub (Jul 19, 2023):

Your PR uses a Task.Run which is just putting the thread on a different pool. If you want to do that, fine, but that's beyond the scope of this library. True Async needs to go all the way to the Stream.

If you want a different Extract to happen using the Reader, I'd be up for a PR for that

@adamhathcock commented on GitHub (Jul 19, 2023): Your PR uses a `Task.Run` which is just putting the thread on a different pool. If you want to do that, fine, but that's beyond the scope of this library. True Async needs to go all the way to the Stream. If you want a different Extract to happen using the Reader, I'd be up for a PR for that
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/sharpcompress#559