XZ seeking support #240

New Issue

claunia · 2026-01-29T22:08:52Z

claunia commented

2026-01-29 22:08:52 +00:00

Originally created by @frabar666 on GitHub (Sep 19, 2017).

I added XZ seeking support on my XZSeek branch.
Given the structure of XZ files, good performance relies on the XZ file being small, or having an approriately-sized index (using an indexed XZ file with 1MB-size blocks means that a seek+read may need to decompress and "throw away" up to 1MB of data before returning the first requested byte).
It doesn't look like indexed XZ files are very common (7-zip doesn't create them). Seeking in a large non-indexed XZ file will work, but slowly: the implementation will start at the beginning of the file, and then decompress and throw away everything before reaching the requested seek position.

Is there interest in a PR for this?

If there is, the API would need validation first:

I didn't modify XZStream, it can still be used with unseekable input streams (it only checks BaseStream.CanRead on creation).
I added a new XZSeekableStream class (which checks BaseStream.CanRead and CanSeek). It uses a new index implementation (XZPackedIndex which puts less pressure on the GC by allocating few large arrays, instead of XZIndex which allocates a lot of small objects).

XZSeekableStream behaves like XZStream while reading in "streaming" mode, and only seeks in the underlying stream when Length is read, Seek is called or Position is written to.

Seeking support could be merged into XZStream in order to have just one implementation, with a new constructor argument indicating whether seeking is requested (throws if !BaseStream.CanSeek) or not (throws if Length/Seek/Position are called)... but I'm not convinced that is better than keeping 2 distinct classes.

Originally created by @frabar666 on GitHub (Sep 19, 2017). I added XZ seeking support on my [XZSeek branch](https://github.com/frabar666/sharpcompress/tree/XZSeek). Given the structure of XZ files, good performance relies on the XZ file being small, or having an approriately-sized index (using an indexed XZ file with 1MB-size blocks means that a seek+read may need to decompress and "throw away" up to 1MB of data before returning the first requested byte). It doesn't look like indexed XZ files are very common (7-zip doesn't create them). Seeking in a large non-indexed XZ file will work, but slowly: the implementation will start at the beginning of the file, and then decompress and throw away everything before reaching the requested seek position. Is there interest in a PR for this? If there is, the API would need validation first: - I didn't modify `XZStream`, it can still be used with unseekable input streams (it only checks `BaseStream.CanRead` on creation). - I added a new `XZSeekableStream` class (which checks `BaseStream.CanRead` and `CanSeek`). It uses a new index implementation (`XZPackedIndex` which puts less pressure on the GC by allocating few large arrays, instead of `XZIndex` which allocates a lot of small objects). `XZSeekableStream` behaves like `XZStream` while reading in "streaming" mode, and only seeks in the underlying stream when `Length` is read, `Seek` is called or `Position` is written to. Seeking support could be merged into `XZStream` in order to have just one implementation, with a new constructor argument indicating whether seeking is requested (throws if `!BaseStream.CanSeek`) or not (throws if `Length`/`Seek`/`Position` are called)... but I'm not convinced that is better than keeping 2 distinct classes.

claunia closed this issue

2026-01-29 22:08:52 +00:00

claunia commented

2026-01-29 22:08:53 +00:00

@adamhathcock commented on GitHub (Sep 20, 2017):

I'm not sure what the value is. You still can't reverse, right? Currently, a user can manually seek forward by just throwing away the decompression themselves to get to a desired position.

Please tell me about the use-case here. XZ doesn't seem to be highly used and LZip seems like a superior format.

@adamhathcock commented on GitHub (Sep 20, 2017): I'm not sure what the value is. You still can't reverse, right? Currently, a user can manually seek forward by just throwing away the decompression themselves to get to a desired position. Please tell me about the use-case here. XZ doesn't seem to be highly used and LZip seems like a superior format.

claunia commented

2026-01-29 22:08:53 +00:00

@frabar666 commented on GitHub (Sep 21, 2017):

You can go forward as well as backward. When you seek, regardless of the direction, the implementation locates the beginning of the block containing the seek position (thanks to the index), starts decompressing there and skips data until the requested position is reached.

I need to compress huge text files (up to tens of TB) in order to reduce storage costs, but I still need to do quick seeks to extract a few KB from anywhere in the file. Basically the whole text file is a sequence of KB- to MB-sized records, and extracting any record must be quick (<5s). Given the size of the files and the speed goal, an index seems mandatory, so XZ seemed like the best choice as it can have an in-file index.
It's a pretty specific use-case :) Indexed XZ files seem very rare, so I realize seeking support is probably not of much interest.

I quickly discarded LZip, as I had never heard of it before a few days ago and standard 7-zip doesn't support it (both despite the format existing for over 8 years, making it hard to convince anyone to use it), and it supports independent chunks but has no in-file index. But I suppose I could use LZip and maintain an index myself in a separate file... I'll look into it more, thanks for the suggestion.

@frabar666 commented on GitHub (Sep 21, 2017): You can go forward as well as backward. When you seek, regardless of the direction, the implementation locates the beginning of the block containing the seek position (thanks to the index), starts decompressing there and skips data until the requested position is reached. I need to compress huge text files (up to tens of TB) in order to reduce storage costs, but I still need to do quick seeks to extract a few KB from anywhere in the file. Basically the whole text file is a sequence of KB- to MB-sized records, and extracting any record must be quick (<5s). Given the size of the files and the speed goal, an index seems mandatory, so XZ seemed like the best choice as it can have an in-file index. It's a pretty specific use-case :) Indexed XZ files seem very rare, so I realize seeking support is probably not of much interest. I quickly discarded LZip, as I had never heard of it before a few days ago and standard 7-zip doesn't support it (both despite the format existing for over 8 years, making it hard to convince anyone to use it), and it supports independent chunks but has no in-file index. But I suppose I could use LZip and maintain an index myself in a separate file... I'll look into it more, thanks for the suggestion.

claunia commented

2026-01-29 22:08:54 +00:00

@adamhathcock commented on GitHub (Sep 21, 2017):

I think I better understand and I don't think this is something to support. This seems very specific and not really a standard. If there's something out there that does do something similar to what you need, let me know.

I hadn't heard of XZ or LZip until more recently either. LZip seems to have been made to replace XZ http://lzip.nongnu.org/xz_inadequate.html

@adamhathcock commented on GitHub (Sep 21, 2017): I think I better understand and I don't think this is something to support. This seems very specific and not really a standard. If there's something out there that does do something similar to what you need, let me know. I hadn't heard of XZ or LZip until more recently either. LZip seems to have been made to replace XZ http://lzip.nongnu.org/xz_inadequate.html

Sign in to join this conversation.

Branches Tags

master

adam/cleanup-options

copilot/add-lzwreader-support

adam/make-nullable

copilot/fix-rar-extraction-issues

adam/add-alternate-compressions

copilot/fix-openentrystreamasync-memory-issue

release

adam/more-explode-async

copilot/fix-infinite-loop-rar-archive

adam/data-descriptor-fix

adam/fix-tests-with-proper-rewind

copilot/fix-data-descriptor-stream-bug

adam/lmza-investigation

adam/create-rar-async

adam/async-rar2

copilot/support-multi-threading-path

copilot/sub-pr-1132-again

adam/memory-perf

copilot/add-performance-benchmarking

copilot/sub-pr-1121

copilot/add-password-support-zip-files

copilot/add-so-optimized-zip-support

adam/rar-async-only

copilot/add-buffered-stream-async-read

copilot/sub-pr-1076

copilot/fix-decompression-exception

copilot/fix-archivefactory-issue

copilot/rationalize-sourcestream-volumes

adam/open-async

copilot/add-ace-archive-support

copilot/sub-pr-1040-again

adam/more-async-3

copilot/fix-tararchive-incomplete-iteration

adam/multi-threaded

copilot/sub-pr-1040

adam/awesome-copilot

copilot/fix-ziparchive-extraction-issue

copilot/fix-tararchive-open-crash

copilot/fix-tar-xz-file-reading-issue

copilot/setup-copilot-instructions

copilot/fix-decompression-performance-issue

copilot/convert-stream-access-to-async

adam/enable-agent

adam/async-deflate

adam/async-rar

adam/more-cleanup

adam/zstd

async-2

zstandard

net461-tests

dmg

async

build-netcore3

recycle-memory-stream

presentation

pax

netcore2

zip_encryption

dotnet-tool

tar_redux

native_zlib

Issue-197

system_buffers

TarNames

7zip_sfx

portable_crypto

WinRT

new_7zip

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/sharpcompress#240