ZipReader/StreamingZipReaderFactory fails for archive entries which are uncompressed files in ZIP format #58

@adamhathcock commented on GitHub (Sep 4, 2015):

I totally have forgotten about that issue as it wasn't on github. I'll look into it when I get some free time.

@adamhathcock commented on GitHub (Sep 4, 2015): I totally have forgotten about that issue as it wasn't on github. I'll look into it when I get some free time.

claunia commented

@adamhathcock commented on GitHub (Sep 5, 2015):

I looked at this today and this is basic cause: https://github.com/adamhathcock/sharpcompress/blob/master/SharpCompress/Common/Zip/StreamingZipHeaderFactory.cs#L50

However, this was to fix an issue with zero byte files in a zip file. I'm not sure I like that fix anyway as it buffers in memory.

Need to think more.

@adamhathcock commented on GitHub (Sep 5, 2015): I looked at this today and this is basic cause: https://github.com/adamhathcock/sharpcompress/blob/master/SharpCompress/Common/Zip/StreamingZipHeaderFactory.cs#L50 However, this was to fix an issue with zero byte files in a zip file. I'm not sure I like that fix anyway as it buffers in memory. Need to think more.

claunia commented

@matt-richardson commented on GitHub (Nov 7, 2016):

Have you had a chance to think about the solution to this issue, @adamhathcock? We've got a customer (of Octopus Deploy) hitting extraction issues with a nested zip file.

The sample file they provided as available at http://help.octopusdeploy.com/discussions/problems/48574, if you want to take a look.

Thanks in advance, and thanks for a great library!

@matt-richardson commented on GitHub (Nov 7, 2016): Have you had a chance to think about the solution to this issue, @adamhathcock? We've got a customer (of Octopus Deploy) hitting extraction issues with a nested zip file. The sample file they provided as available at http://help.octopusdeploy.com/discussions/problems/48574, if you want to take a look. Thanks in advance, and thanks for a great library!

claunia commented

@adamhathcock commented on GitHub (Nov 7, 2016):

The basic issue is that for the Streaming Reader case, it's trying to read all entries. To do that, it's scanning the file for something that matches a ZIP file header.

There are possibly two ZIP headers to read: there primary header that's before the data (which may not have a length if it's created using a POST header) and the POST header. POST headers are created when ZIPs are created with a streaming write to avoid knowing the entire file size.

If the file is being skipped and I don't know the length (because the primary header has no length) then I have to just scan bytes until the next file is found. Since there's no compression, then I can find a nested ZIP.

I'm not sure this is a solvable problem because in the POST-header no-compression scenario, a ZipReader doesn't know the current file length ever. So it doesn't know there's a nested ZIP.

Two fixes for this:

Don't use no-compression ZIPs with a ZipWriter.
Don't use ZipReader when there's a possibility the above was used.

ZipArchive uses the trailing Zip directory and should know the file lengths always. If this isn't true then this is an issue that can be fixed (in theory).

Myself, I'm not sure about my own timing to be able to fix this.

@adamhathcock commented on GitHub (Nov 7, 2016): The basic issue is that for the Streaming Reader case, it's trying to read all entries. To do that, it's scanning the file for something that matches a ZIP file header. There are possibly two ZIP headers to read: there primary header that's before the data (which may not have a length if it's created using a POST header) and the POST header. POST headers are created when ZIPs are created with a streaming write to avoid knowing the entire file size. If the file is being skipped and I don't know the length (because the primary header has no length) then I have to just scan bytes until the next file is found. Since there's no compression, then I can find a nested ZIP. I'm not sure this is a solvable problem because in the POST-header no-compression scenario, a ZipReader doesn't know the current file length ever. So it doesn't know there's a nested ZIP. Two fixes for this: - Don't use no-compression ZIPs with a ZipWriter. - Don't use ZipReader when there's a possibility the above was used. ZipArchive uses the trailing Zip directory and should know the file lengths always. If this isn't true then this is an issue that can be fixed (in theory). Myself, I'm not sure about my own timing to be able to fix this.

claunia commented

@matt-richardson commented on GitHub (Nov 9, 2016):

Thanks for the detailed info, @adamhathcock.

Turns out we didn't need to use the streaming ZipReader - ZipArchive works well for us.

@matt-richardson commented on GitHub (Nov 9, 2016): Thanks for the detailed info, @adamhathcock. Turns out we didn't need to use the streaming `ZipReader` - `ZipArchive` works well for us.

claunia commented

@SmilingCatEntertainment commented on GitHub (Apr 16, 2018):

Any progress on solving this? I am also encountering this in common code used to extract 7z, rar, and zip files. If I don't use ZipReader, I get terrible performance on extracting 7z files. If I do use ZipReader, I hit this bug on zip files.

If I special-case the code to implement a workaround and handle zip and 7z differently, that defeats a lot of the purpose of choosing this library - being able to write code once to handle all 3 archive types.

@SmilingCatEntertainment commented on GitHub (Apr 16, 2018): Any progress on solving this? I am also encountering this in common code used to extract 7z, rar, and zip files. If I don't use ZipReader, I get terrible performance on extracting 7z files. If I do use ZipReader, I hit this bug on zip files. If I special-case the code to implement a workaround and handle zip and 7z differently, that defeats a lot of the purpose of choosing this library - being able to write code once to handle all 3 archive types.

claunia commented

@adamhathcock commented on GitHub (Apr 16, 2018):

As explained above, this isn’t fixable as it’s not really a bug.

When creating streaming zips with any library, you don’t know the length for the header. Therefore, when extracting again with ZipReader, you can’t skip with the length. To find entries, I have to scan for headers which will find the nested archives.

I’m open to suggestions.

In your case, I don’t know what you’re meaning with 7z files because you can’t access them in a forward-only way ever which means there’s no 7zReader. Only 7zArchive with random access.

@adamhathcock commented on GitHub (Apr 16, 2018): As explained above, this isn’t fixable as it’s not really a bug. When creating streaming zips with any library, you don’t know the length for the header. Therefore, when extracting again with ZipReader, you can’t skip with the length. To find entries, I have to scan for headers which will find the nested archives. I’m open to suggestions. In your case, I don’t know what you’re meaning with 7z files because you can’t access them in a forward-only way ever which means there’s no 7zReader. Only 7zArchive with random access.

claunia commented

@SmilingCatEntertainment commented on GitHub (Apr 16, 2018):

For the 7z issue, I was referring to the situation described by this SO and this post, where the issue was able to be fixed by using ExtractAllEntries() which results in a reader being created. Perhaps I misspoke about the exact terminology, sorry.

Unfortunately, using ExtractAllEntries() on certain zip files in the wild results in the issue described here. This means that for my program to function correctly, I need to handle 7z files with ExtractAllEntries(), and zip files by iterating the Entries collection directly.

I'm not enough of an expert on archive file formats to be able to offer any advice, hence I am relying on your expertise. It's a shame this can't be addressed somehow, because other than this issue I've been very happy with this library for the last couple of years, especially with its flexibility in handling multiple archive formats.

@SmilingCatEntertainment commented on GitHub (Apr 16, 2018): For the 7z issue, I was referring to the situation described by this [SO](https://stackoverflow.com/questions/43606374/sharpcompress-lzma2-7z-archive-very-slow-extraction-of-specific-file-why-a) and this [post](http://sharpcompress2.rssing.com/chan-6339576/all_p7.html), where the issue was able to be fixed by using ExtractAllEntries() which results in a reader being created. Perhaps I misspoke about the exact terminology, sorry. Unfortunately, using ExtractAllEntries() on certain zip files in the wild results in the issue described here. This means that for my program to function correctly, I need to handle 7z files with ExtractAllEntries(), and zip files by iterating the Entries collection directly. I'm not enough of an expert on archive file formats to be able to offer any advice, hence I am relying on your expertise. It's a shame this can't be addressed somehow, because other than this issue I've been very happy with this library for the last couple of years, especially with its flexibility in handling multiple archive formats.

claunia commented

@adamhathcock commented on GitHub (Apr 16, 2018):

7z is dumb in your case it's compressing the headers along side the files into a single "entry" to achieve something more efficient. (RAR does this as SOLID) . This can be a good thing for many small files. However, it's not good if you're just trying to extract a single file because it requires you to read and decompress everything.

Honestly, if you just used Zip (non-streaming) with LZMA 1 or 2, you'd get something similar and better for random access.

LZMA is what achieves good compression, not the 7Z format itself. It's a painful format to deal with. LZip with Tar is better and more straight forward as a streamed file.

Sorry to continue this discussion on this issue. Please create a new issue about understanding 7Z better or something.

@adamhathcock commented on GitHub (Apr 16, 2018): 7z is dumb in your case it's compressing the headers along side the files into a single "entry" to achieve something more efficient. (RAR does this as SOLID) . This can be a good thing for many small files. However, it's not good if you're just trying to extract a single file because it requires you to read and decompress everything. Honestly, if you just used Zip (non-streaming) with LZMA 1 or 2, you'd get something similar and better for random access. LZMA is what achieves good compression, not the 7Z format itself. It's a painful format to deal with. LZip with Tar is better and more straight forward as a streamed file. Sorry to continue this discussion on this issue. Please create a new issue about understanding 7Z better or something.

claunia commented

@SmilingCatEntertainment commented on GitHub (Apr 16, 2018):

I realize that both issues could be resolved if I had the luxury of creating the input archives myself. However, in my case, it needs to be able to handle archives created in the wild. 7z can die a fiery death for all I care, but people insist on using it to distribute stuff. Shutting up about 7z now as it is OT except as to how it relates to how I happened across the zip issue.

@SmilingCatEntertainment commented on GitHub (Apr 16, 2018): I realize that both issues could be resolved if I had the luxury of creating the input archives myself. However, in my case, it needs to be able to handle archives created in the wild. 7z can die a fiery death for all I care, but people insist on using it to distribute stuff. Shutting up about 7z now as it is OT except as to how it relates to how I happened across the zip issue.

claunia commented

@andewall1175 commented on GitHub (Mar 19, 2019):

Hi!

This seems to be related to an issue I am facing.
How to reproduce:

Create a zip-archive, containing at least one file. Let it be uncompressed.
Create another uncompressed zip-archive, and add the zip-archive created in step 1 to the new zip-archive.
Now open a nonseekable stream towards the zip-archive, so you must use ZipReader (meaning getting entries with ZipArchive is not an alternative).
Use ReaderFactory.Open to open the zip-archive created in step 2.
Use the resulting IReader to call MoveNextEntry() in a loop in order to go through all entries and add the entries to a List.
Check the Count property on the List - it will now include the entries from both zip-archives.
Now perform step 1-6 but use LZMA as compression method for the ZIP-archive containing the inner ZIP-archive. The List will now contain one entry, the inner archive.

Workaround when using SharpCompress: Filter on Reader.Entry.Key using Regex and only add the appropriate entries to the List. However, this assumes there is actually a file naming standard, which might not be the case, which now is an issue as there is seemingly no way of telling if an entry comes from the outer or inner zip-archive.

As my software is getting ZIP-archives from an external system, I can never tell if the archive is going to use LZMA, Deflate or be uncompressed. One way would have been to just use LZMA, but this is not possible in my case.

Would it be possible to tell MoveToNextEntry to include files to a certain sub-level, 0 meaning only go through top-level entries, 1 would be going one step further, and so on?

@andewall1175 commented on GitHub (Mar 19, 2019): Hi! This seems to be related to an issue I am facing. How to reproduce: 1. Create a zip-archive, containing at least one file. Let it be uncompressed. 2. Create another uncompressed zip-archive, and add the zip-archive created in step 1 to the new zip-archive. 3. Now open a nonseekable stream towards the zip-archive, so you must use ZipReader (meaning getting entries with ZipArchive is not an alternative). 4. Use ReaderFactory.Open to open the zip-archive created in step 2. 5. Use the resulting IReader to call MoveNextEntry() in a loop in order to go through all entries and add the entries to a List<IEntry>. 6. Check the Count property on the List - it will now include the entries from both zip-archives. 7. Now perform step 1-6 but use LZMA as compression method for the ZIP-archive containing the inner ZIP-archive. The List<IEntry> will now contain one entry, the inner archive. Workaround when using SharpCompress: Filter on Reader.Entry.Key using Regex and only add the appropriate entries to the List. However, this assumes there is actually a file naming standard, which might not be the case, which now is an issue as there is seemingly no way of telling if an entry comes from the outer or inner zip-archive. As my software is getting ZIP-archives from an external system, I can never tell if the archive is going to use LZMA, Deflate or be uncompressed. One way would have been to just use LZMA, but this is not possible in my case. Would it be possible to tell MoveToNextEntry to include files to a certain sub-level, 0 meaning only go through top-level entries, 1 would be going one step further, and so on?

claunia commented

@adamhathcock commented on GitHub (Mar 19, 2019):

The basic issue is still:

When creating streaming zips with any library, you don’t know the length for the entry in the starting header (it's in the tail header). Therefore, when extracting again with ZipReader, you can’t skip with the length. To find entries, I have to scan for headers which will find the nested archives.

@adamhathcock commented on GitHub (Mar 19, 2019): The basic issue is still: > When creating streaming zips with any library, you don’t know the length for the entry in the starting header (it's in the tail header). Therefore, when extracting again with ZipReader, you can’t skip with the length. To find entries, I have to scan for headers which will find the nested archives.

claunia commented