mirror of
https://github.com/adamhathcock/sharpcompress.git
synced 2026-02-10 05:31:29 +00:00
XZ seeking support #240
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @frabar666 on GitHub (Sep 19, 2017).
I added XZ seeking support on my XZSeek branch.
Given the structure of XZ files, good performance relies on the XZ file being small, or having an approriately-sized index (using an indexed XZ file with 1MB-size blocks means that a seek+read may need to decompress and "throw away" up to 1MB of data before returning the first requested byte).
It doesn't look like indexed XZ files are very common (7-zip doesn't create them). Seeking in a large non-indexed XZ file will work, but slowly: the implementation will start at the beginning of the file, and then decompress and throw away everything before reaching the requested seek position.
Is there interest in a PR for this?
If there is, the API would need validation first:
XZStream, it can still be used with unseekable input streams (it only checksBaseStream.CanReadon creation).XZSeekableStreamclass (which checksBaseStream.CanReadandCanSeek). It uses a new index implementation (XZPackedIndexwhich puts less pressure on the GC by allocating few large arrays, instead ofXZIndexwhich allocates a lot of small objects).XZSeekableStreambehaves likeXZStreamwhile reading in "streaming" mode, and only seeks in the underlying stream whenLengthis read,Seekis called orPositionis written to.Seeking support could be merged into
XZStreamin order to have just one implementation, with a new constructor argument indicating whether seeking is requested (throws if!BaseStream.CanSeek) or not (throws ifLength/Seek/Positionare called)... but I'm not convinced that is better than keeping 2 distinct classes.@adamhathcock commented on GitHub (Sep 20, 2017):
I'm not sure what the value is. You still can't reverse, right? Currently, a user can manually seek forward by just throwing away the decompression themselves to get to a desired position.
Please tell me about the use-case here. XZ doesn't seem to be highly used and LZip seems like a superior format.
@frabar666 commented on GitHub (Sep 21, 2017):
You can go forward as well as backward. When you seek, regardless of the direction, the implementation locates the beginning of the block containing the seek position (thanks to the index), starts decompressing there and skips data until the requested position is reached.
I need to compress huge text files (up to tens of TB) in order to reduce storage costs, but I still need to do quick seeks to extract a few KB from anywhere in the file. Basically the whole text file is a sequence of KB- to MB-sized records, and extracting any record must be quick (<5s). Given the size of the files and the speed goal, an index seems mandatory, so XZ seemed like the best choice as it can have an in-file index.
It's a pretty specific use-case :) Indexed XZ files seem very rare, so I realize seeking support is probably not of much interest.
I quickly discarded LZip, as I had never heard of it before a few days ago and standard 7-zip doesn't support it (both despite the format existing for over 8 years, making it hard to convince anyone to use it), and it supports independent chunks but has no in-file index. But I suppose I could use LZip and maintain an index myself in a separate file... I'll look into it more, thanks for the suggestion.
@adamhathcock commented on GitHub (Sep 21, 2017):
I think I better understand and I don't think this is something to support. This seems very specific and not really a standard. If there's something out there that does do something similar to what you need, let me know.
I hadn't heard of XZ or LZip until more recently either. LZip seems to have been made to replace XZ http://lzip.nongnu.org/xz_inadequate.html