mirror of
https://github.com/adamhathcock/sharpcompress.git
synced 2026-02-03 21:23:38 +00:00
Best way to achieve parallel processing of individual entries inside a tarball #520
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @arvindshmicrosoft on GitHub (May 22, 2022).
My goal is to use separate threads to extract and process individual files from the tarball. Each thread will be doing some CPU intensive post-processing with the data being read from the file, hence the motivation to keep this multi-threaded. Given that individual TarReader instances are assumed to not be thread-safe, I took the approach of opening up separate TarReader instances, and then starting separate tasks for each EntryStream.
The approach seems to work, but performance rapidly becomes a concern for larger files. The logic I use of skipping entries which already have a Task queued up, is wasteful in that it starts from scratch each time, so later entries in the tarball take (by definition) longer to even start processing. For example, when SkipEntry() is called for larger (several tens of GiBs each) files, it looks like the stream is still read till the next entry, and that is painfully slow. This is despite setting a larger buffer size for the underlying FileStream and using an SSD for the source tarball:
The whole sample is below. I would appreciate any advice on doing this in a more performant manner. For example, is there a way I can start reading from a later position in the stream, for subsequent executions of the outer loop? When opening the same tarball with 7-zip for example, it is very quick to extract entries which are much "later" in the file. So, I am hoping there is a more performant way to achieve my requirement using SharpCompress.
@adamhathcock commented on GitHub (May 23, 2022):
If you use TarArchive (which you should because it's a file that can be randomly accessed?) then skip will be faster because the code should be using
Seekto skip entries instead of reading bytes to the Null stream (Reader is forward only so we can't seek in that scenario).If you use multiple TarArchives to process then you might be able to achive what you want. Maybe also MemoryMapped file streams.
@arvindshmicrosoft commented on GitHub (May 23, 2022):
Thank you @adamhathcock for your precise suggestion. Using TarArchive worked like a charm, and only minimal changes were required to my code. For completeness, the refactored code is below.