mirror of
https://github.com/adamhathcock/sharpcompress.git
synced 2026-02-03 21:23:38 +00:00
ZipReader/StreamingZipReaderFactory fails for archive entries which are uncompressed files in ZIP format #58
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @elgonzo on GitHub (Sep 3, 2015).
The issue has already been documented on CodePlex (see here). I added my comments and an example ZIP file exhibiting the problem to this issue in CodePlex, trying to keep the information in one place.
A short summary:
If a ZIP archive contains an uncompressed file entry which happens to be file in the ZIP format (such as another ZIP archive, epub, docx, xlsx document, etc...) then trying to extract the ZIP archive using ReaderFactory.Open(...) and IReader.OpenEntryStream(...) will fail.
I am using the SharpCompress version 0.11.1 available on nuget.
@adamhathcock commented on GitHub (Sep 4, 2015):
I totally have forgotten about that issue as it wasn't on github. I'll look into it when I get some free time.
@adamhathcock commented on GitHub (Sep 5, 2015):
I looked at this today and this is basic cause: https://github.com/adamhathcock/sharpcompress/blob/master/SharpCompress/Common/Zip/StreamingZipHeaderFactory.cs#L50
However, this was to fix an issue with zero byte files in a zip file. I'm not sure I like that fix anyway as it buffers in memory.
Need to think more.
@matt-richardson commented on GitHub (Nov 7, 2016):
Have you had a chance to think about the solution to this issue, @adamhathcock? We've got a customer (of Octopus Deploy) hitting extraction issues with a nested zip file.
The sample file they provided as available at http://help.octopusdeploy.com/discussions/problems/48574, if you want to take a look.
Thanks in advance, and thanks for a great library!
@adamhathcock commented on GitHub (Nov 7, 2016):
The basic issue is that for the Streaming Reader case, it's trying to read all entries. To do that, it's scanning the file for something that matches a ZIP file header.
There are possibly two ZIP headers to read: there primary header that's before the data (which may not have a length if it's created using a POST header) and the POST header. POST headers are created when ZIPs are created with a streaming write to avoid knowing the entire file size.
If the file is being skipped and I don't know the length (because the primary header has no length) then I have to just scan bytes until the next file is found. Since there's no compression, then I can find a nested ZIP.
I'm not sure this is a solvable problem because in the POST-header no-compression scenario, a ZipReader doesn't know the current file length ever. So it doesn't know there's a nested ZIP.
Two fixes for this:
ZipArchive uses the trailing Zip directory and should know the file lengths always. If this isn't true then this is an issue that can be fixed (in theory).
Myself, I'm not sure about my own timing to be able to fix this.
@matt-richardson commented on GitHub (Nov 9, 2016):
Thanks for the detailed info, @adamhathcock.
Turns out we didn't need to use the streaming
ZipReader-ZipArchiveworks well for us.@SmilingCatEntertainment commented on GitHub (Apr 16, 2018):
Any progress on solving this? I am also encountering this in common code used to extract 7z, rar, and zip files. If I don't use ZipReader, I get terrible performance on extracting 7z files. If I do use ZipReader, I hit this bug on zip files.
If I special-case the code to implement a workaround and handle zip and 7z differently, that defeats a lot of the purpose of choosing this library - being able to write code once to handle all 3 archive types.
@adamhathcock commented on GitHub (Apr 16, 2018):
As explained above, this isn’t fixable as it’s not really a bug.
When creating streaming zips with any library, you don’t know the length for the header. Therefore, when extracting again with ZipReader, you can’t skip with the length. To find entries, I have to scan for headers which will find the nested archives.
I’m open to suggestions.
In your case, I don’t know what you’re meaning with 7z files because you can’t access them in a forward-only way ever which means there’s no 7zReader. Only 7zArchive with random access.
@SmilingCatEntertainment commented on GitHub (Apr 16, 2018):
For the 7z issue, I was referring to the situation described by this SO and this post, where the issue was able to be fixed by using ExtractAllEntries() which results in a reader being created. Perhaps I misspoke about the exact terminology, sorry.
Unfortunately, using ExtractAllEntries() on certain zip files in the wild results in the issue described here. This means that for my program to function correctly, I need to handle 7z files with ExtractAllEntries(), and zip files by iterating the Entries collection directly.
I'm not enough of an expert on archive file formats to be able to offer any advice, hence I am relying on your expertise. It's a shame this can't be addressed somehow, because other than this issue I've been very happy with this library for the last couple of years, especially with its flexibility in handling multiple archive formats.
@adamhathcock commented on GitHub (Apr 16, 2018):
7z is dumb in your case it's compressing the headers along side the files into a single "entry" to achieve something more efficient. (RAR does this as SOLID) . This can be a good thing for many small files. However, it's not good if you're just trying to extract a single file because it requires you to read and decompress everything.
Honestly, if you just used Zip (non-streaming) with LZMA 1 or 2, you'd get something similar and better for random access.
LZMA is what achieves good compression, not the 7Z format itself. It's a painful format to deal with. LZip with Tar is better and more straight forward as a streamed file.
Sorry to continue this discussion on this issue. Please create a new issue about understanding 7Z better or something.
@SmilingCatEntertainment commented on GitHub (Apr 16, 2018):
I realize that both issues could be resolved if I had the luxury of creating the input archives myself. However, in my case, it needs to be able to handle archives created in the wild. 7z can die a fiery death for all I care, but people insist on using it to distribute stuff. Shutting up about 7z now as it is OT except as to how it relates to how I happened across the zip issue.
@andewall1175 commented on GitHub (Mar 19, 2019):
Hi!
This seems to be related to an issue I am facing.
How to reproduce:
Workaround when using SharpCompress: Filter on Reader.Entry.Key using Regex and only add the appropriate entries to the List. However, this assumes there is actually a file naming standard, which might not be the case, which now is an issue as there is seemingly no way of telling if an entry comes from the outer or inner zip-archive.
As my software is getting ZIP-archives from an external system, I can never tell if the archive is going to use LZMA, Deflate or be uncompressed. One way would have been to just use LZMA, but this is not possible in my case.
Would it be possible to tell MoveToNextEntry to include files to a certain sub-level, 0 meaning only go through top-level entries, 1 would be going one step further, and so on?
@adamhathcock commented on GitHub (Mar 19, 2019):
The basic issue is still:
@Erior commented on GitHub (Jan 14, 2020):
Fixes the issue we had at least, does not break any tests
StreamingZipHeaderFactory.cpp
//entry could be zero bytes so we need to know that.if (header.ZipHeaderType == ZipHeaderType.LocalEntry){// Do we a sizeif(((LocalEntryHeader)header).CompressedSize > 0){header.HasData = true;}else // Try detect next header{bool isRecording = rewindableStream.IsRecording;if (!isRecording){rewindableStream.StartRecording();}uint nextHeaderBytes = reader.ReadUInt32();header.HasData = !IsHeader(nextHeaderBytes);rewindableStream.Rewind(!isRecording);}}@adamhathcock commented on GitHub (Jan 15, 2020):
Any chance for a pull request?
@Erior commented on GitHub (Jan 16, 2020):
Pull request #500 created, first time for everything, I hope I didn't mess up things too much.
@adamhathcock commented on GitHub (Jan 17, 2020):
Thanks!