mirror of
https://github.com/adamhathcock/sharpcompress.git
synced 2026-02-04 05:25:00 +00:00
Seek Optimized Zip #555
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Techmaven-geospatial on GitHub (Jan 17, 2023).
Originally assigned to: @Copilot on GitHub.
Is there plans to support reading and writing of SO SEEK OPTIMIZED ZIP ARCHIVES?
@adamhathcock commented on GitHub (Jan 17, 2023):
What is this? A new Zip style/feature?
PRs are welcome.
@cocoon commented on GitHub (Nov 20, 2025):
because I was also searching if there is seekable compression available for compressed large disk images to open them as stream:
https://github.com/sozip/sozip-spec
https://github.com/madler/zlib/blob/master/examples/zran.c
https://lh3.github.io/2014/07/05/random-access-to-zlib-compressed-files
https://github.com/lh3/samtools-legacy/blob/master/razf.h
https://github.com/circulosmeos/gztool
https://www.artpol-software.com/ZipArchive/KB/0711101739.aspx
@adamhathcock commented on GitHub (Nov 26, 2025):
Version
License
This specification document is (C) 2022-2023 Even Rouault and licensed under the
CC-BY-4.0 terms.
Note: the scope of the copyrighted material does, of course, not extend onto
any source or binary code derived from the specification.
What is SOZip ?
A Seek-Optimized ZIP file (SOZip) is a
ZIP file that contains one
or several Deflate-compressed files
that are organized and annotated such that a SOZip-aware reader can perform
very fast random access (seek) within a compressed file.
SOZip makes it possible to access large compressed files directly from a .zip
file without prior decompression. It is not a new file format, but a profile
of the existing ZIP format, done in a fully backward compatible way. ZIP
readers that are non-SOZip aware can read a SOZip-enabled file
normally and ignore the extended features that support efficient seek
capability.
Use cases
This specification is intended to be general purpose / not domain specific.
SOZip was first developed to serve geospatial use cases, which commonly
have large compressed files inside of ZIP archives. In particular, it makes it
possible for users to read large Geographic Information Systems (GIS) files using the
Shapefile,
GeoPackage or
FlatGeobuf formats (which have no native provision
for compression) compressed in .zip files without prior decompression.
Efficient random access and selective decompression are a requirement to provide
acceptable performance in many usage scenarios: spatial index filtering, access to a
feature by its identifier, etc.
High-level specification
The SOZip optimization relies on two independent and combined mechanisms:
The first mechanism is the generation of a Deflate
compressed stream that is structured in such a way that it contains data chunks
that can be compressed and uncompressed independently from
preceding and proceeding chunks in the compression stream. Conceptually, such
a compressed file could be split into multiple independently compressed
files, but from the point of view of a non-SOZip-aware ZIP reader, it will
be a fully single legit compressed stream for the whole file.
The chunking relies on the block flush mechanisms of the ZLib library, an
example of which is the pigz utility with its
--independent option. Block flushes are done at a regular interval of
the input uncompressed stream, with the consequence of slight degradation of
the compression rate. In the rest of this document, this interval is
called chunk size. A typical value for it is 32 kilobytes.
The second mechanism is the creation of a hidden index file containing an
array that maps file offsets of the uncompressed file, at every chunk size
interval, to the corresponding offset in the Deflate compressed stream. This
index is the structure that allows SOZip-aware readers to skip about throughout
the file.
The below diagram shows the organization in high level records of a SOZip-enabled
ZIP consisting of a SOZIP-optimized file
my.gpkg:my.gpkgfile preceded by its local header.my.gpkg.sozip.idxindex file preceded by its local headerCentral directory with an entry corresponding to
my.gpkg(and none for theindex file)
End of Central directory
If we zoom on the content of those 2 files, we can see that:
the Deflate stream of
my.gpkgconsists in many concatenated independentchunks.
the
.my.gpkg.sozip.idxindex file contains the offsets to the beginning ofeach chunk.
Detailed specification
A ZIP file is said to be SOZip-enabled if it contains one or several Deflate
compressed files meeting the following requirements, in additions to the
requirements of the .ZIP File Format
Specification.
A SOZip-enabled file may contain a mix of SOZip-compressed and regular compressed
or uncompressed files.
A file may be SOZIP-compressed only if its uncompressed size is strictly greater
than the chunk size (otherwise there is no point in doing the SOZip optimization).
Chunked Deflate-compressed stream
A SOZip-compressed file MUST be created with
compression_method = 8(Deflate).
A SOZip-compressed file MUST have a corresponding local file
header
and a central directory file
header.
Those headers MAY use extended fields. Typically for the
ZIP64 extension if
the compressed and/or uncompressed size of a file exceeds 4 GB. And/or the
"Info-ZIP Unicode Path Extra Field" file name extension.
A SOZip writer MUST issue a call to
deflate()ZLib method with the
Z_SYNC_FLUSHmode, followed by a call with the
Z_FULL_FLUSHflag, at a fixed interval(called chunk size) of the data read from the input uncompressed stream.
Z_SYNC_FLUSHandZ_FULL_FLUSHare not required (but may be used) forthe final chunk, whose size may be smaller or equal to the chunk size.
However the last call to
deflate()to encode the last chunk MUST be donewith the
Z_FINISHflag, to finalize a valid Deflate stream.Note: an explanation of the
Z_SYNC_FLUSHandZ_FULL_FLUSHmode can be found athttps://www.bolet.org/~pornin/deflate-flush-fr.html
The writer MUST collect the offset of each chunk, except for the initial
chunk size whose offset is zero. Offsets MUST be relative to the start of
the compressed data.
Note: a pseudo-code (among many possible variations) written in C++, using
zlib, can be found in Annex E
Hidden index file
Storage of the index file
The index file MUST be stored as a uncompressed file.
The index file name MUST be :
${path_to_filename}/.${filename}.sozip.idxwhere${path_to_filename}is the name of the directory if
${filename}contains directory paths.For example
my_dir/.rivers.gpkg.sozip.idxif the filename stored in thearchive is
my_dir/rivers.gpkg.sozip.idx.${filename}.sozip.idxif there is no directory path in the filename.For example
.rivers.gpkg.sozip.idxNote the leading dot ('.') character preceding the index filename, to indicate
its hidden status.
The index file MUST be preceded by a ZIP local file header (cf paragraph 4.3.7
of the .ZIP File Format Specification)
That local file header MAY use extended fields. Typically for the
"Info-ZIP Unicode Path Extra Field" file name extension. If the local file
header of the indexed file uses the "Info-ZIP Unicode Path Extra Field"
extension, the local file header of the index file MUST also use the
"Info-ZIP Unicode Path Extra Field" extension.
That local file header MUST be immediately placed after the compressed file.
That file MUST NOT be listed as a central directory file header, to remain invisible.
Content of the index file
The hidden index file is made of a 32-byte header followed by a section of
varying size (called offset section) which contains the values of the offsets
collected during the generation of the Deflate-compressed data.
In the following,
uint32is a 32-bit unsigned integer, encoded in little-endianorder (least significant byte first).
uint64is a 64-bit unsigned integer, encoded in little-endian order.Header
Specification of fields:
version: MUST be set to 1 for this specificationskip_bytes: number of bytes between the end of the header and the beginningof the offset section. Generally set to 0. This could be set to a non-zero
value to store extra content, unspecified currently. SOZip readers SHOULD skip
over such extra content.
chunk_size: Interval, in uncompressed stream, at which Z_SYNC_FLUSH +Z_FULL_FLUSH are performed. It MUST not be zero (a value
of 4096 or bigger is strongly recommended). A value lower than 100 MB
is strongly recommended for performance and compatibility with SOZip readers.
32 KB is a generally safe default value.
offset_size: Number of bytes to encode each entry of the offset section.This MUST be 8 (uint64 values).
uncompress_size: Size in bytes of the uncompressed file (not the index, butthe file subject to SOZip compression). This field is redundant with other
information found in the local and central directory file headers of the compressed file,
and is provides a reader a consistency check of the SOZip index
with the compressed file.
uncompress_sizemust be strictly greater thanchunk_sizecompress_size: Size in bytes of the compressed file (not the index, butthe file subject to SOZip compression). This field is redundant with other
information found in the local and central directory file headers of the compressed file,
and is here so that a reader can check the consistency of the SOZip index
with the compressed file.
Offset section
The offset section MUST contain exactly
(uncompress_size - 1) / chunk_size(floor rounding) entries, each of sizeoffset_sizebytes.Each entry MUST be a uint64 expressing the offset at which a compressed chunk
starts. That offset is a relative offset, with respect to the start of the
compressed stream (consequently, the indexed file and its index could be
potentially relocated within the .zip file without requiring to regenerate
the index).
The offset of the first compressed chunk MUST be omitted, as always 0.
The first offset value is thus the offset in the compressed stream where
uncompressed bytes in the range
[chunk_size, min(2 * chunk_size, uncompressed_size)[are encoded.The second offset value is the offset in the compressed stream where
uncompressed bytes in the range
[2 * chunk_size, min(3 * chunk_size, uncompressed_size)[are encoded. And so on.As a consequence of the generation of the index, values in the offset
section MUST be in strictly ascending order, and MUST be strictly lower than
compress_size.Normative references
.ZIP File Format Specification
DEFLATE Compressed Data Format Specification version 1.3
Annex A: Software implementations
GDAL (C/C++ library)
The sozip development branch
contains:
of
a
CPLAddFileInZip()C function that can compress a file and add it to annew or existing ZIP file, and enable the SOZip optimization when relevant.
an implementation of the
VSIGetFileMetadata()
method that can be used on a filename of the form "/vsizip//path/to/my.zip/filename/inside" and with
domain = "ZIP" to get information if a SOZip index is available for that file.
a modified /vsizip/
virtual file system that can use the SOZip optimization to perform fast random
access to a compressed file within a ZIP.
a new command line utility,
sozip,
that can be used to create a seek-optimized ZIP file, to append files to an
existing ZIP file, list the contents of a ZIP file and display the SOZip
optimization status or validate a SOZip file.
Updated Shapefile and GeoPackage drivers that can directly generate SOZip-enabled
.shz/.shp.zip or .gpkg.zip files.
This development branch is available in the
rouault/sozipDocker image.Examples:
Create a new ZIP file with an input file called in.gpkg:
Create a SOZip-optimized zip file called out.zip from an existing ZIP file
called in.zip.
List the content of a ZIP file and check if files in it are SOZip-optimized:
Validate a SOZip file:
Extracting the polygons extracting a georeferenced window of interest from a
remote SOZip-optimized GeoPackage file:
sozipfile (Python module)
sozipfile is a fork of Python
zipfile module, which
implements the SOZip implementation in the ZIP reader and writer, and can be used to
check if a file within a ZIP is SOZip-optimized.
MapServer (Web mapping server written in C/C++, using GDAL)
The sozip development branch
of MapServer, when built against a SOZip-capable GDAL,
can generate SOZip-enabled output files if the mapfile has a ZIP output format,
such as:
QGIS (Geographic Information System desktop and server application, using GDAL)
QGIS can read efficiently SOZip files when built against a
SOZip-capable GDAL, through the use of GDAL
/vsizip/virtual file system.Annex B: Advantages and limitations
Advantages
SOZip allows multithreaded compression of independent chunks. This is for
example used in the GDAL implementation.
SOZip allows multithreaded decompression of independent chunks.
For decompression, faster alternatives to zlib can be used, such as
libdeflate. This is for example used
in the GDAL implementation.
SOZip has excellent backward compatibility. A data producer may deliver a
SOZip enabled file with good confidence that nearly all existing ZIP readers
can decompress it (at time of writing, we are not aware of ZIP readers that
reject a SOZip enabled file.)
Limitations
Compression efficiency is reduced by the flushes done to isolate chunks.
The larger the chunk size, the more efficient the compression, but
random seeking will be less efficient due to more data being decompressed.
SOZip inherits all the limitations of the base ZIP format: in particular
update in place of a SOZip optimized file requires rewriting the entire ZIP,
or appending the updated version of the modified file at the end of the ZIP
(with rewriting of the central header records and end of central directory
record).
Annex C: Discussion about design choices
Why use the Deflate compression and not an alternative compression method ?
Deflate has been chosen as it is supported by all existing ZIP implementations.
Other compression methods (LZMA, BZIP2, etc.) are supported more sparsely.
Furthermore, given that the SOZip optimization results in non-optimal
compression rate, it is likely that compression schemes that offer higher
compression than Deflate would perform in a suboptimal way, due to the chunking
mechanism resetting the dictionary at each chunk boundary.
Why encoding the hidden index as hidden and not visible ?
This design decision has pros and cons.
Pros:
(some readers could for example expect a precise list of files to be in a
.zip)
a non SOZip-aware writer that does an edit operation to an existing archive
by recreating a new file, it would preserve the index file, but could
potentially recompress the compressed file without using the chunked Deflate
techniques, which could confuse SOZip readers (although a SOZip reader must
use information from the header of the SOZip index to check its consistence
with the compressed file).
Cons:
to be lost when using some non SOZip-aware ZIP writer. However, ZIP
implementations that have an append-in-place strategy will generally
preserve the hidden index. Refer to
Annex D for a
list of known implementations that can append-in-place.
Why encoding the hidden index as a file preceded by a local header ?
We have found at least one reader, Java's
java.util.zip.ZipInputStream,
which operates in a streaming way, and stops its enumeration of the content of
a ZIP file at the first encountered content that is not a local header.
Why placing the hidden index after the compressed file and not before ?
Both can make sense. It has been observed though that if a hidden local header
(that is not listed in the central directory entries) is located immediately
at the beginning of a zip file, the
7ziputility will expose the hidden file.And, combined with the question at the previous paragraph, if the content
is not preceded by a local header,
7zipwill emit a warning("The archive is open with offset").
It is also slightly easier to generate the index after the compressed file,
given that the content of the index depends on information collected during
creation of the chunked Deflate compressed stream. This also makes it
potentially possible for a streaming writer to write a SOZip optimized file.
Why not compressing the index file ?
Having it uncompressed makes it easier for implementations. And for very large
files, having it uncompressed makes it possible to seek at a random location
in a truly constant time. A 64 GB compressed file, with a chunk size of 32 KB,
requires a 15.6 MB index. For scenarios where a SOZip file is read in a
on-demand piece-wise way from network storage, it would be costly in bandwith
to have to download and decompress those 15 MB to read the last chunk of the
64 GB compressed file.
Annex D: Compatibility with existing ZIP implementations
SOZip-enabled files have been tested with the following ZIP capable utilities
to check the backward compatibility (non exhaustive list!):
Compatible readers:
Info-ZIP unzip command line utility.
libzip: C library for reading, creating, and modifying zip archives
7zipcommand line utility or graphical interface.WinZipgraphical interface.WinRARgraphical interface.Windows Explorer default ZIP reader.
MacOSX default ZIP extractor.
MacOSX
zipinfocommand line utilityark KDE (Linux/Unix typically) graphical
interface
file-roller GNOME
(Linux/Unix typically) graphical interface
Java's java.util.zip.ZipFile class.
Java's java.util.zip.ZipInputStream class,
with the caveat that it sees the hidden index files (being a streaming reader,
it only takes into account local file records)
Python zipfile module
GDAL /vsizip/
virtual file system, in all existing GDAL versions, can read SOZip-enabled files.
Versions >= 3.7 will be able to take advantage of the SOZip index (earlier
verions will ignore it.)
Partially compatible readers:
main branch, or a version greater than
2.108. Currently released versions (2.108 or earlier) will error out on SOZip
files, rejecting them because the Local header record of a .sozip.idx file has
no matching Central header record.
Compatible writers:
Info-ZIP zip command line utility,
used to create / edit zip files, can append new files to a SOZip-enabled file,
if using the -g/--grow option, while preserving their existing SOZip-optimization
Python zipfile module can
append new files to a SOZip-enabled file, while preserving their existing
SOZip-optimization.
GDAL sozip
command line utility can create SOZip-enabled files, and append new files to
a SOZip-enabled file, while preserving their existing SOZip-optimization.
GDAL /vsizip/
virtual file system, in all existing GDAL versions, can append new files to a
SOZip-enabled file, while preserving their existing SOZip-optimization.
However a number of writers, while attempting to "append" to a SOZip-enabled file,
actually create a new file from scratch, and will loose the hidden index.
Annex E: Pseudo-code for SOZip Deflate stream generation
Licensed under CC0 or
MIT at the choice of the user (that is feel free
to reuse and adapt without any constaint).
Annex F: Notes to independently decode a SOZip chunk
The chunking process, involving Z_FULL_FLUSH, used during SOZip generation
makes it possible to decompress chunks in an independent way.
Given
uncompressed_offsetbeing an offset of the uncompressed file,multiple of
chunk_size, the reader should retrieve in the offset sectionof the .sozip.idx (stored in an array
compressed_offset_table[]),the start and end offsets of the chunk, like below:
Given that each chunk, except the last one, is not a standalone Deflate stream,
reading code must be careful to present it in a way that will not confuse the
Deflate decoding library.
The only precaution to take is to dynamically patch the last flush Deflate block
at the end of the compressed chunk to be advertized as the final Deflate block
of the chunk.
Given the use of Z_FULL_FLUSH, the last 5 bytes of a chunk (that is not the last
one) should be
\x00,\x00,\x00,\xFF,\xFF. The patchingoperation consists in changing the first byte of this 5 byte sequence from
\x00to\x01as below:With the above, if using
zlib, a chunk can then be decompressed with:Or if using
libdeflate:All pseudo-code in this annex is licensed under CC0 or
MIT at the choice of the user (that is feel free
to reuse and adapt without any constaint).
Annex G: Examples
Examples of SOZip-enabled files can be found in the
sozip-examples repository.
Annex H: commented dump of a dummy SOZip file
The following invokation of GDAL's sozip utility generates a dummy
SOZip enabled file that contains a tiny file "foo" with "foo" as content,
and using a chunk size of 2 bytes.
(for the compressed file)
(for the index file)
'.', 's', 'o', 'z', 'i', 'p',
'.', 'i', 'd', 'x'
(absolute offset is 33 + 13 = 46)
(compressed file)
is the first file, and there's no leading content)
central directory
directory on this disk
directory
respect to the starting disk number
Output of "zipdetails foo.zip" (using main branch of https://github.com/pmqs/zipdetails):