multithread support? #136

Open
opened 2026-01-29 22:07:13 +00:00 by claunia · 5 comments
Owner

Originally created by @skyyearxp on GitHub (Oct 13, 2016).

i use a 7z file like data store. there about 10,000 files in 7z file. about 500MB, before compress about 5GB.

i frist get every file entry stream, then used in multithread Parallel.For to extract each stream to byte[]. but failed. data was corrupt. entry stream's base stream position may be affect by other thread.

var files = archive.Entries;
int nCount = files.Count;
for (int i=0;i<nCount;i++)
{
    var file = files.ElementAt(i);
    _HisDataFileIndex.Add(file.Key, i);
}

...

Parallel.For(...)
{
...
var stream = archive.Entries.ElementAt(fileIndex).OpenEntryStream();
stream.CopyTo(_loadCache_Stream);
stream.Dispose();
stream = null;
_loadCache_Stream.Position = 0;
...
}

so i use object pool to hold some SevenZipArchive instance, every thread pick up one instance to use, after that return back to pool. that's work.

but, every ServenZipArchive load the whole EntryInfos to memory. about every instance take 80MB. when threads come up to 500. object pool instance about up to 100. take about 8GB. that's not so good.

so i want to know how to use sharpcompress in multithread? or could you try to make it work in multithread?

thanks a lot.

Originally created by @skyyearxp on GitHub (Oct 13, 2016). i use a 7z file like data store. there about 10,000 files in 7z file. about 500MB, before compress about 5GB. i frist get every file entry stream, then used in multithread Parallel.For to extract each stream to byte[]. but failed. data was corrupt. entry stream's base stream position may be affect by other thread. > ``` > var files = archive.Entries; > int nCount = files.Count; > for (int i=0;i<nCount;i++) > { > var file = files.ElementAt(i); > _HisDataFileIndex.Add(file.Key, i); > } > ``` ... > Parallel.For(...) > { > ... > var stream = archive.Entries.ElementAt(fileIndex).OpenEntryStream(); > stream.CopyTo(_loadCache_Stream); > stream.Dispose(); > stream = null; > _loadCache_Stream.Position = 0; > ... > } so i use object pool to hold some SevenZipArchive instance, every thread pick up one instance to use, after that return back to pool. that's work. but, every ServenZipArchive load the whole EntryInfos to memory. about every instance take 80MB. when threads come up to 500. object pool instance about up to 100. take about 8GB. that's not so good. so i want to know how to use sharpcompress in multithread? or could you try to make it work in multithread? thanks a lot.
Author
Owner

@adamhathcock commented on GitHub (Oct 13, 2016):

Sorry, doing this isn't on my mental roadmap. Especially for 7Zip as it's painful. Happy to accept features though.

@adamhathcock commented on GitHub (Oct 13, 2016): Sorry, doing this isn't on my mental roadmap. Especially for 7Zip as it's painful. Happy to accept features though.
Author
Owner

@skyyearxp commented on GitHub (Oct 13, 2016):

could i get entry's base stream position?

then i use filestream to read data from that position within length. then use lzma2 to uncompress it. could i ?
and how? some code could help

@skyyearxp commented on GitHub (Oct 13, 2016): could i get entry's base stream position? then i use filestream to read data from that position within length. then use lzma2 to uncompress it. could i ? and how? some code could help
Author
Owner

@elgonzo commented on GitHub (Dec 11, 2016):

@skyyearxp,

it is not as straightforward as you think with the 7z archive format. 7z archive format organizes compressed data in so-called solid blocks. A 7z archive can have one or more solid blocks.

One solid block represents a compressed data stream. However (and that's where it gets complicated), such a solid block (i.e., compressed stream) does not necessarily correspond to one compressed file. When adding files to an 7z archive, all those files will be concatenated, forming one long continuous data stream. This data stream is then LZMA compressed and forms a solid block. (Please note that my explanation here is a grossly simplified and incomplete illustration of a far more complex subject matter.)

It is unfortunately not possible to start decompressing somewhere in the middle of the compressed LZMA stream. Decompression of an LZMA stream has always to start from the beginning; that is the nature of the LZMA algorithm. (By the way, LZMA is a variation of LZ77, and LZ77 also has this same characteristic.)

You can notice this characteristic when trying to extract a file that is located closer to the end of a large(r) solid block - the 7z decompressor has to spend a significant time decompressing the block and discarding all the decompressed data until the start position of the desired data (i.e. the file to be extracted) in the uncompressed data stream has been reached.

If you understand this, then you will also understand that a concurrent/parallel extraction of two file entries can only promise a performance benefit if and only if the two file entries reside in different solid blocks. If the two file entries would reside in the same solid block, concurrent extraction of these two files could not be faster than single-threaded sequential extraction.

A brief example:

Imagine a solid block that contains file A (of file size size_A) followed by file B (of file size size_B).
To extract B, the decompression starts at the beginning of the block. The first size_A uncompressed bytes will be discarded, and then the following size_B uncompressed bytes are the file data of B.
So yes, to extract B, file A has to be extracted as well. The only difference here is that the data of A is discarded, whereas the data of B is copied/written to a buffer or target stream.

Now imagine you have two tasks running concurrently. The job of task 1 is to extract A, and the job of task 2 is to extract B. This is what those tasks will do, if they run perfectly parallel:

Task 1: Start decompressing block -> decompress data of A and write it to buffer
Task 2: Start decompressing block -> decompress data of A and discard/ignore it ->continue with decompression of data for file B and write it to buffer

Note that the work done by task 1 is being more or less done by task 2 as well. Essentially, task 1 is just a waste of time in this scenario, decreasing the performance overall. It would be much more efficient to not use concurrent tasks in such a scenario and just do:

Task: Start decompressing block -> decompress data of A and write it to buffer ->continue with decompression of data for file B and write it to buffer

This scenario applies similarly always when trying to extract three or more file entries from the same solid block in parallel: All but one of the parallel tasks will do more or less repeat work that the task for the last file entry in the block has to do.

Now, you probably think: How are the files in my 7z archives organized? Are they all lumped together in a single or few solid blocks, or do they reside in their own individual solid blocks? Probably you should expect the former to be the likely case (after all, concatenating many files into one or few solid blocks is one of the things that enables the good compression ratios achieved by 7z). Anyway, you don't need to speculate about this. You can check it easily by yourself. The GUI of the 7-Zip archiver shows for each file in which block it is located (you might need to enable the column "Block").

Also, if you want to know what has to be done in detail to extract a file from a 7z archive, perhaps take a look into the SharpCompress source code. Make a small project that uses SharpCompress for extracting a file from a 7z archive. Don't use the nuget package for this little project, but rather pull the SharpCompress source code directly from github. Having the SharpCompress source code in your project/solution allows you to single-step through the code that is being executed during the extraction of a file...

@elgonzo commented on GitHub (Dec 11, 2016): @skyyearxp, it is not as straightforward as you think with the 7z archive format. 7z archive format organizes compressed data in so-called [solid blocks](https://en.wikipedia.org/wiki/Solid_compression). A 7z archive can have one or more solid blocks. One solid block represents a compressed data stream. However (and that's where it gets complicated), such a solid block (i.e., compressed stream) does not necessarily correspond to one compressed file. When adding files to an 7z archive, all those files will be concatenated, forming one long continuous data stream. This data stream is then LZMA compressed and forms a solid block. (Please note that my explanation here is a grossly simplified and incomplete illustration of a far more complex subject matter.) It is unfortunately not possible to start decompressing somewhere in the middle of the compressed LZMA stream. Decompression of an LZMA stream has always to start from the beginning; that is the nature of the LZMA algorithm. (By the way, LZMA is a variation of LZ77, and LZ77 also has this same characteristic.) You can notice this characteristic when trying to extract a file that is located closer to the end of a large(r) solid block - the 7z decompressor has to spend a significant time decompressing the block and discarding all the decompressed data until the start position of the desired data (i.e. the file to be extracted) in the uncompressed data stream has been reached. If you understand this, then you will also understand that a concurrent/parallel extraction of two file entries can only promise a performance benefit if and only if the two file entries reside in different solid blocks. If the two file entries would reside in the same solid block, concurrent extraction of these two files could not be faster than single-threaded sequential extraction. A brief example: Imagine a solid block that contains file A (of file size size_A) followed by file B (of file size size_B). To extract B, the decompression starts at the beginning of the block. The first _size_A_ uncompressed bytes will be discarded, and then the following _size_B_ uncompressed bytes are the file data of B. So yes, to extract B, file A has to be extracted as well. The only difference here is that the data of A is discarded, whereas the data of B is copied/written to a buffer or target stream. Now imagine you have two tasks running concurrently. The job of task 1 is to extract A, and the job of task 2 is to extract B. This is what those tasks will do, if they run perfectly parallel: ``` Task 1: Start decompressing block -> decompress data of A and write it to buffer Task 2: Start decompressing block -> decompress data of A and discard/ignore it ->continue with decompression of data for file B and write it to buffer ``` Note that the work done by task 1 is being more or less done by task 2 as well. Essentially, task 1 is just a waste of time in this scenario, decreasing the performance overall. It would be much more efficient to not use concurrent tasks in such a scenario and just do: ``` Task: Start decompressing block -> decompress data of A and write it to buffer ->continue with decompression of data for file B and write it to buffer ``` This scenario applies similarly always when trying to extract three or more file entries from the same solid block in parallel: All but one of the parallel tasks will do more or less repeat work that the task for the last file entry in the block has to do. Now, you probably think: How are the files in my 7z archives organized? Are they all lumped together in a single or few solid blocks, or do they reside in their own individual solid blocks? Probably you should expect the former to be the likely case (after all, concatenating many files into one or few solid blocks is one of the things that enables the good compression ratios achieved by 7z). Anyway, you don't need to speculate about this. You can check it easily by yourself. The GUI of the 7-Zip archiver shows for each file in which block it is located (you might need to enable the column "Block"). Also, if you want to know what has to be done in detail to extract a file from a 7z archive, perhaps take a look into the SharpCompress source code. Make a small project that uses SharpCompress for extracting a file from a 7z archive. Don't use the nuget package for this little project, but rather pull the SharpCompress source code directly from github. Having the SharpCompress source code in your project/solution allows you to single-step through the code that is being executed during the extraction of a file...
Author
Owner

@rmontoya12 commented on GitHub (Sep 17, 2018):

I'm also having threading issues with uncompressed TAR so this is not specific to 7z. My code also gets all the entries up front and then uses multiple threads to access the files from the TAR in parallel. Occasionally I get the following error:

System.IndexOutOfRangeException: Probable I/O race condition detected while copying memory. The I/O package is not thread safe by default. In multithreaded applications, a stream must be accessed in a thread-safe way, such as a thread-safe wrapper returned by TextReader's or TextWriter's Synchronized methods. This also applies to classes like StreamWriter and StreamReader.
at System.Buffer.InternalBlockCopy(Array src, Int32 srcOffsetBytes, Array dst, Int32 dstOffsetBytes, Int32 byteCount)
at System.IO.FileStream.Read(Byte[] array, Int32 offset, Int32 count)
at SharpCompress.IO.ReadOnlySubStream.Read(Byte[] buffer, Int32 offset, Int32 count)
at System.IO.Stream.<>c.b__39_0(Object )
at System.Threading.Tasks.Task`1.InnerInvoke()
at System.Threading.Tasks.Task.Execute()

@rmontoya12 commented on GitHub (Sep 17, 2018): I'm also having threading issues with uncompressed TAR so this is not specific to 7z. My code also gets all the entries up front and then uses multiple threads to access the files from the TAR in parallel. Occasionally I get the following error: System.IndexOutOfRangeException: Probable I/O race condition detected while copying memory. The I/O package is not thread safe by default. In multithreaded applications, a stream must be accessed in a thread-safe way, such as a thread-safe wrapper returned by TextReader's or TextWriter's Synchronized methods. This also applies to classes like StreamWriter and StreamReader. at System.Buffer.InternalBlockCopy(Array src, Int32 srcOffsetBytes, Array dst, Int32 dstOffsetBytes, Int32 byteCount) at System.IO.FileStream.Read(Byte[] array, Int32 offset, Int32 count) at SharpCompress.IO.ReadOnlySubStream.Read(Byte[] buffer, Int32 offset, Int32 count) at System.IO.Stream.<>c.<BeginReadInternal>b__39_0(Object ) at System.Threading.Tasks.Task`1.InnerInvoke() at System.Threading.Tasks.Task.Execute()
Author
Owner

@adamhathcock commented on GitHub (Sep 18, 2018):

None of this code is thread-safe. If you're somehow doing multi-threaded access then you need to lock things yourself or just pray it works.

@adamhathcock commented on GitHub (Sep 18, 2018): None of this code is thread-safe. If you're somehow doing multi-threaded access then you need to lock things yourself or just pray it works.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/sharpcompress#136