GZip Header / Footer info #432

New Issue

claunia · 2026-01-29T22:11:59Z

claunia commented

2026-01-29 22:11:59 +00:00

Originally created by @jzabroski on GitHub (Jan 8, 2021).

Hi @adamhathcock ,

Small world.

I have about 75GB of gzip data I need to decompress and then load into SQL tables. Since the data vendor could in theory update any historical data at any time, I wanted to ideally

Get list of gz files in both history and daily folders
Read the list entries from the zip file and get filenames / sizes (not sure if size is needed)
Compare filename / sizes to what we have in the database
Anything not in the database -> extract to temp folder
Import data in temp folder into database
Clean up temp folder
Repeat

But, to do this, ideally I would only read the GZip header and footer, so I know how big of a file I am extracting, but I don't see any clean .NET APIs that let you do something like the following pseudo-code:

await using var fileStream = File.OpenAsync("myfile.gz");
await using var gzipStream = new GZipStream(fileStream, ZipMode.Read);
var fileSize = gzipStream.Header.UncompressedFileSize;
var fileName = gzipStream.Header.FileName;

But... this is likely slightly incorrect. Even so, I see that is approximately how Go models it: https://golang.org/src/compress/gzip/gunzip.go?s=1297:1500#L42

I'm a little surprised there doesn't seem to be a .NET library with an API with such an obvious use case, but Go does.

StackOverflow seems to suggest that an older version of this library supported a concrete FilePath value. https://stackoverflow.com/a/39081983/1040437

This is the code I wrote so far, but reader.Entry.Key is blank and reader.Entry.LinkTarget is also blank, and I don't see a FilePath option anywhere.

var files = Directory.EnumerateFiles(historicalDataLocalPath, "*.gz", SearchOption.AllDirectories);
            foreach (var file in files)
            {
                var readerOptions = new ReaderOptions();
                readerOptions.LookForHeader = true; // It looks like this only applies to RarArchive for some reason.
                using Stream stream = File.OpenRead(file);
                using var reader = GZipReader.Open(stream, readerOptions);
                while (reader.MoveToNextEntry())
                {
                    if (reader.Entry.IsDirectory)
                    {
                        continue;
                    }

                    using var entryStream = reader.OpenEntryStream();
                    var outputPath = Path.Combine(configuration.WorkingPath, reader.Entry.Key ?? reader.Entry.LinkTarget);
                    using Stream writeStream = File.OpenWrite(outputPath);
                    entryStream.CopyTo(writeStream);
                }
            }

Originally created by @jzabroski on GitHub (Jan 8, 2021). Hi @adamhathcock , Small world. I have about 75GB of gzip data I need to decompress and then load into SQL tables. Since the data vendor could in theory update any historical data at any time, I wanted to ideally 1. Get list of gz files in both history and daily folders 2. Read the list entries from the zip file and get filenames / sizes (not sure if size is needed) 3. Compare filename / sizes to what we have in the database 4. Anything not in the database -> extract to temp folder 5. Import data in temp folder into database 6. Clean up temp folder 7. Repeat But, to do this, ideally I would only read the GZip header and footer, so I know how big of a file I am extracting, but I don't see any clean .NET APIs that let you do something like the following pseudo-code: ```c# await using var fileStream = File.OpenAsync("myfile.gz"); await using var gzipStream = new GZipStream(fileStream, ZipMode.Read); var fileSize = gzipStream.Header.UncompressedFileSize; var fileName = gzipStream.Header.FileName; ``` But... this is likely slightly incorrect. Even so, I see that is approximately how Go models it: https://golang.org/src/compress/gzip/gunzip.go?s=1297:1500#L42 I'm a little surprised there doesn't seem to be a .NET library with an API with such an obvious use case, but Go does. StackOverflow seems to suggest that an older version of this library supported a concrete FilePath value. https://stackoverflow.com/a/39081983/1040437 This is the code I wrote so far, but `reader.Entry.Key` is blank and `reader.Entry.LinkTarget` is also blank, and I don't see a FilePath option anywhere. ```c# var files = Directory.EnumerateFiles(historicalDataLocalPath, "*.gz", SearchOption.AllDirectories); foreach (var file in files) { var readerOptions = new ReaderOptions(); readerOptions.LookForHeader = true; // It looks like this only applies to RarArchive for some reason. using Stream stream = File.OpenRead(file); using var reader = GZipReader.Open(stream, readerOptions); while (reader.MoveToNextEntry()) { if (reader.Entry.IsDirectory) { continue; } using var entryStream = reader.OpenEntryStream(); var outputPath = Path.Combine(configuration.WorkingPath, reader.Entry.Key ?? reader.Entry.LinkTarget); using Stream writeStream = File.OpenWrite(outputPath); entryStream.CopyTo(writeStream); } } ```

claunia commented

2026-01-29 22:11:59 +00:00

@adamhathcock commented on GitHub (Jan 8, 2021):

The header stuff that Go references is actually there:
https://github.com/adamhathcock/sharpcompress/blob/master/src/SharpCompress/Compressors/Deflate/ZlibBaseStream.cs#L63

It just needs exposing or some other minor updates as it looks like the whole header is gathered. This implementation was based on another and I haven't really touched it since.

@adamhathcock commented on GitHub (Jan 8, 2021): The header stuff that Go references is actually there: https://github.com/adamhathcock/sharpcompress/blob/master/src/SharpCompress/Compressors/Deflate/ZlibBaseStream.cs#L63 It just needs exposing or some other minor updates as it looks like the whole header is gathered. This implementation was based on another and I haven't really touched it since.

claunia commented

2026-01-29 22:12:00 +00:00

@jzabroski commented on GitHub (Jan 8, 2021):

Is that why my GZipEntry records don't match PeaZip? Since I am not super familiar with GZip standard and am learning as I go, I don't quite fully understand why these don't match.

@jzabroski commented on GitHub (Jan 8, 2021): Is that why my GZipEntry records don't match PeaZip? Since I am not super familiar with GZip standard and am learning as I go, I don't quite fully understand why these don't match. ![image](https://user-images.githubusercontent.com/447485/104054203-a80b1e80-51ba-11eb-8bcb-9468cd89d128.png) ![image](https://user-images.githubusercontent.com/447485/104054458-249dfd00-51bb-11eb-883e-65b0d2b9eb0f.png)

claunia commented

2026-01-29 22:12:00 +00:00

@jzabroski commented on GitHub (Jan 8, 2021):

For what its worth, this .gz file is from a RedShift table export. https://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html

So, I think this would be a fairly common task to want to use GZip decompression for, as more and more people use data lakes.

@jzabroski commented on GitHub (Jan 8, 2021): For what its worth, this .gz file is from a RedShift table export. https://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html So, I think this would be a fairly common task to want to use GZip decompression for, as more and more people use data lakes.

claunia commented

2026-01-29 22:12:00 +00:00

@adamhathcock commented on GitHub (Jan 8, 2021):

There might be a bug in that the GZipEntry isn't picking up the file name correctly.

It's also possible that there is no file name embedded and PeaZip just gives it a default.

I'm guessing I have a bug but would need a sample (and time) to validate.

Any chance you get the source of sharpcompress to debug? I'd put a breakpoint on the gzip header read spot to see what comes out. If that's correct then I need to fix. Else there's just no info.

@adamhathcock commented on GitHub (Jan 8, 2021): There might be a bug in that the GZipEntry isn't picking up the file name correctly. It's also possible that there is no file name embedded and PeaZip just gives it a default. I'm guessing I have a bug but would need a sample (and time) to validate. Any chance you get the source of sharpcompress to debug? I'd put a breakpoint on the gzip header read spot to see what comes out. If that's correct then I need to fix. Else there's just no info.

claunia commented

2026-01-29 22:12:01 +00:00

@jzabroski commented on GitHub (Jan 8, 2021):

I think its not a bug. I did the following to try to analyze further:

Installed the GNU gzip library via chocolatey: choco install -y gzip
Ran refreshenv to add gzip command path to $env:PATH
Ran gzip --list "\\fileshare\path\to\file.gz"

Got the following:

     compressed        uncompressed  ratio uncompressed_name
       26393242            78119087  66.2% \\fileshare\path\to\file

In reading online, the 4th bit of the 4th byte determines if the original filename is kept. When it is not present, the "correct" behavior used by various tools is to use the gz filename without the gz extension. Unfortunately, gzip command line program doesn't directly display that header info, either, which sucks. Adding -v only adds three new columns,

method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
defla 72c113c2 Jun  8 07:59            26393242            78119087  66.2% \\fileshare\path\to\file

Any chance you get the source of sharpcompress to debug? I'd put a breakpoint on the gzip header read spot to see what comes out. If that's correct then I need to fix. Else there's just no info.

This is a good idea. Will fork and see.

@jzabroski commented on GitHub (Jan 8, 2021): I think its not a bug. I did the following to try to analyze further: 1. Installed the GNU gzip library via chocolatey: `choco install -y gzip` 2. Ran `refreshenv` to add gzip command path to `$env:PATH` 3. Ran `gzip --list "\\fileshare\path\to\file.gz"` 4. Got the following: ``` compressed uncompressed ratio uncompressed_name 26393242 78119087 66.2% \\fileshare\path\to\file ``` In reading online, the 4th bit of the 4th byte determines if the original filename is kept. When it is not present, the "correct" behavior used by various tools is to use the gz filename without the gz extension. Unfortunately, `gzip` command line program doesn't directly display that header info, either, which sucks. Adding `-v` only adds three new columns, ``` method crc date time compressed uncompressed ratio uncompressed_name defla 72c113c2 Jun 8 07:59 26393242 78119087 66.2% \\fileshare\path\to\file ``` > Any chance you get the source of sharpcompress to debug? I'd put a breakpoint on the gzip header read spot to see what comes out. If that's correct then I need to fix. Else there's just no info. This is a good idea. Will fork and see.

claunia commented

2026-01-29 22:12:01 +00:00

@jzabroski commented on GitHub (Jan 8, 2021):

I was able to figure it out with gzip

gzip -dkv "\\fileshare\path\to\file.gz"

outputs:

\\fileshare\path\to\file.gz:
 66.2% -- replaced with \\fileshare\path\to\file

Which, upon reading online, - is the default "file name" if none is given.

I'll still fork the repo and look at contributing a patch sometime soon. Seems like fun.

I also think the API could use some changes to make it more friendly to generic programming and async/await all the way.

@jzabroski commented on GitHub (Jan 8, 2021): I was able to figure it out with gzip ``` gzip -dkv "\\fileshare\path\to\file.gz" ``` outputs: ``` \\fileshare\path\to\file.gz: 66.2% -- replaced with \\fileshare\path\to\file ``` Which, upon reading online, `-` is the default "file name" if none is given. I'll still fork the repo and look at contributing a patch sometime soon. Seems like fun. I also think the API could use some changes to make it more friendly to generic programming and async/await all the way.

claunia commented

2026-01-29 22:12:01 +00:00

@adamhathcock commented on GitHub (Jan 8, 2021):

Showing the filename when that byte is present isn't what this library does. The API doesn't know the name. It knows streams.

That said, all the info should be exposed on GZipStream and/or GZipEntry.

Exposing the info should be easy.

I'm happy to rework the API. I haven't given it critical thought for 10 years! Any thoughts on issues or PRs are welcome.

Async/await has been on the TODO list but I've just never made a start as it seems like a lot of grunt work. Again, PRs welcome! Even partial ones where I could help with this big task.

I'm with a startup and have 3 young kids in lockdown. Not much free time.

@adamhathcock commented on GitHub (Jan 8, 2021): Showing the filename when that byte is present isn't what this library does. The API doesn't know the name. It knows streams. That said, all the info should be exposed on GZipStream and/or GZipEntry. Exposing the info should be easy. I'm happy to rework the API. I haven't given it critical thought for 10 years! Any thoughts on issues or PRs are welcome. Async/await has been on the TODO list but I've just never made a start as it seems like a lot of grunt work. Again, PRs welcome! Even partial ones where I could help with this big task. I'm with a startup and have 3 young kids in lockdown. Not much free time.

claunia commented

2026-01-29 22:12:02 +00:00

@adamhathcock commented on GitHub (Jan 9, 2021):

I got some time and got curious so I dug into the code and spec and then reread your use case.

It looks like all you want is the name and uncompressed size. The name could be "default" which GZipStream doesn't know the file name.

As for the size, it's basically the last 4 bytes on the file, assuming there's only one "member" in the file as most GZ files are. I don't think there's any value this library can add to that use case other than a static method on GZipArchive or something. I guess I could make the entries on the archive read the footers to load size and crc data.

I started a branch here that just exposes LastModified but I don't see anything obvious to do: https://github.com/adamhathcock/sharpcompress/pull/560

@adamhathcock commented on GitHub (Jan 9, 2021): I got some time and got curious so I dug into the code and spec and then reread your use case. It looks like all you want is the name and uncompressed size. The name could be "default" which GZipStream doesn't know the file name. As for the size, it's basically the last 4 bytes on the file, assuming there's only one "member" in the file as most GZ files are. I don't think there's any value this library can add to that use case other than a static method on GZipArchive or something. I guess I could make the entries on the archive read the footers to load size and crc data. I started a branch here that just exposes LastModified but I don't see anything obvious to do: https://github.com/adamhathcock/sharpcompress/pull/560

claunia commented

2026-01-29 22:12:02 +00:00

@adamhathcock commented on GitHub (Jan 9, 2021):

Nevermind, I take that back: If you use GZipArchive it will read the trailer for CRC and size info now.

really need to refactor this lib for nullables and async.

@adamhathcock commented on GitHub (Jan 9, 2021): Nevermind, I take that back: If you use GZipArchive it will read the trailer for CRC and size info now. really need to refactor this lib for nullables and async.

claunia commented

2026-01-29 22:12:03 +00:00

@jzabroski commented on GitHub (Jan 9, 2021):

Do you use Resharper? Feel like I can blaze through refactoring it.

The thing I don't understand is, reading online, is there really such a thing as a GZipArchive? Isn't that just tar.gz? I didn't know what GZipFilePart was either.

I think the ReaderFactory would be nicer if it supported generic types. That would clean up ReaderOptions too since you could have options per format.

@jzabroski commented on GitHub (Jan 9, 2021): Do you use Resharper? Feel like I can blaze through refactoring it. The thing I don't understand is, reading online, is there really such a thing as a GZipArchive? Isn't that just tar.gz? I didn't know what GZipFilePart was either. I think the ReaderFactory would be nicer if it supported generic types. That would clean up ReaderOptions too since you could have options per format.

claunia commented

2026-01-29 22:12:03 +00:00

@adamhathcock commented on GitHub (Jan 9, 2021):

It's really that in sharpcompress:
Archive = random access
Reader = forward only streaming

Ive kulged a common API over different formats as I could for fun.

FilePart was a way to have the same file in an archive across multiple physical files. For example, Rar and Zip can divide an archive into multi-file archives. You might have a compressed file split over 2 or more physical archive files because of it. FilePart made sense at the time.

I use Rider so basically I use Resharper :)

I'll merge in my gzip changes soon and release then prepare for breaking changes. I want more nullables anyway.

@adamhathcock commented on GitHub (Jan 9, 2021): It's really that in sharpcompress: Archive = random access Reader = forward only streaming Ive kulged a common API over different formats as I could for fun. FilePart was a way to have the same file in an archive across multiple physical files. For example, Rar and Zip can divide an archive into multi-file archives. You might have a compressed file split over 2 or more physical archive files because of it. FilePart made sense at the time. I use Rider so basically I use Resharper :) I'll merge in my gzip changes soon and release then prepare for breaking changes. I want more nullables anyway.

Sign in to join this conversation.

Branches Tags

master

release

adam/merge-release-to-master

dependabot/nuget/xunit.v3-3.2.2

adam/more-explode-async

copilot/fix-infinite-loop-rar-archive

adam/data-descriptor-fix

adam/fix-tests-with-proper-rewind

copilot/fix-data-descriptor-stream-bug

adam/lmza-investigation

adam/create-rar-async

adam/async-rar2

copilot/support-multi-threading-path

copilot/sub-pr-1132-again

adam/memory-perf

copilot/add-performance-benchmarking

copilot/sub-pr-1121

copilot/add-password-support-zip-files

copilot/add-so-optimized-zip-support

adam/rar-async-only

copilot/add-buffered-stream-async-read

copilot/sub-pr-1076

copilot/fix-decompression-exception

copilot/fix-archivefactory-issue

copilot/rationalize-sourcestream-volumes

adam/open-async

copilot/add-ace-archive-support

copilot/sub-pr-1040-again

adam/more-async-3

copilot/fix-tararchive-incomplete-iteration

adam/multi-threaded

copilot/sub-pr-1040

adam/awesome-copilot

copilot/fix-ziparchive-extraction-issue

copilot/fix-tararchive-open-crash

copilot/fix-tar-xz-file-reading-issue

copilot/setup-copilot-instructions

copilot/fix-decompression-performance-issue

copilot/convert-stream-access-to-async

adam/enable-agent

adam/async-deflate

adam/async-rar

adam/more-cleanup

adam/zstd

async-2

zstandard

net461-tests

dmg

async

build-netcore3

recycle-memory-stream

presentation

pax

netcore2

zip_encryption

dotnet-tool

tar_redux

native_zlib

Issue-197

system_buffers

TarNames

7zip_sfx

portable_crypto

WinRT

new_7zip

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/sharpcompress#432