Can't extract big tbz file(408 MB) - BZip2 implementations C# won't extract everything #121

Open
opened 2026-01-29 22:06:53 +00:00 by claunia · 21 comments
Owner

Originally created by @agoretsky on GitHub (Sep 2, 2016).

Hello! I am trying to extract this file:
https://drive.google.com/open?id=0B_hcLdxD3U8Td2dXRGhSaHl0cDg
In destination folder I've got only 879 KB of result and it is not growing, but program is still running.
Looks like program is stuck or all archive is still extracting in memory (but I can't see process with growing memory)
My code:

        private void ExtractArchiveFileToTempFolder(string path, string feedDate)
        {
            var destination = Path.Combine(this.tempFolderPath, feedDate);
            if (!Directory.Exists(destination))
            {
                Directory.CreateDirectory(destination);
            }

            using (Stream stream = File.OpenRead(path))
            {
                var reader = ReaderFactory.Open(stream);
                while (reader.MoveToNextEntry())
                {
                    if (!reader.Entry.IsDirectory)
                    {
                        Console.WriteLine(reader.Entry.Key);
                        reader.WriteEntryToDirectory(destination);
                    }
                }
            }
        }
Originally created by @agoretsky on GitHub (Sep 2, 2016). Hello! I am trying to extract this file: https://drive.google.com/open?id=0B_hcLdxD3U8Td2dXRGhSaHl0cDg In destination folder I've got only 879 KB of result and it is not growing, but program is still running. Looks like program is stuck or all archive is still extracting in memory (but I can't see process with growing memory) My code: ``` private void ExtractArchiveFileToTempFolder(string path, string feedDate) { var destination = Path.Combine(this.tempFolderPath, feedDate); if (!Directory.Exists(destination)) { Directory.CreateDirectory(destination); } using (Stream stream = File.OpenRead(path)) { var reader = ReaderFactory.Open(stream); while (reader.MoveToNextEntry()) { if (!reader.Entry.IsDirectory) { Console.WriteLine(reader.Entry.Key); reader.WriteEntryToDirectory(destination); } } } } ```
claunia added the bug label 2026-01-29 22:06:53 +00:00
Author
Owner

@adamhathcock commented on GitHub (Sep 27, 2016):

Got this file and it doesn't seem winrar can open this either.

SharpCompress doesn't detect errors.

@adamhathcock commented on GitHub (Sep 27, 2016): Got this file and it doesn't seem winrar can open this either. SharpCompress doesn't detect errors.
Author
Owner

@erikcturner commented on GitHub (Sep 27, 2016):

It's true that WinRAR can't open the file. But 7zip opens the file just fine. I was able to extract a file called collection.tar. And then I was able to extract a file "itunes20160831\collection" that was 2,126,385,940 bytes long.

I suspect that this is the same error that I was having (see #165 submitted on Aug 28). The issue for me was that the the ".tar.gz" file was truncated during transfer and then caused the SharpCompress decoder to hang (throwing an exception would definitely have been better).

EDIT: Upon further review, I don't think your issue is that same as mine in #165. The "collection.tar" file that 7zip extracted is a perfectly formatted TAR file.

@erikcturner commented on GitHub (Sep 27, 2016): It's true that WinRAR can't open the file. But 7zip opens the file just fine. I was able to extract a file called collection.tar. And then I was able to extract a file "itunes20160831\collection" that was 2,126,385,940 bytes long. I suspect that this is the same error that I was having (see #165 submitted on Aug 28). The issue for me was that the the ".tar.gz" file was truncated during transfer and then caused the SharpCompress decoder to hang (throwing an exception would definitely have been better). EDIT: Upon further review, I don't think your issue is that same as mine in #165. The "collection.tar" file that 7zip extracted is a perfectly formatted TAR file.
Author
Owner

@adamhathcock commented on GitHub (Sep 28, 2016):

This looks like my BZip2 implementation only extracts 879k. WinRAR doesn't like it either. I guess 7Zip does.

@adamhathcock commented on GitHub (Sep 28, 2016): This looks like my BZip2 implementation only extracts 879k. WinRAR doesn't like it either. I guess 7Zip does.
Author
Owner

@agoretsky commented on GitHub (Sep 28, 2016):

I am using WinRar 5.40 beta, right click on archive file then "Extract files" and everything works fine (I have downloaded this archive from provided link too)

@agoretsky commented on GitHub (Sep 28, 2016): I am using WinRar 5.40 beta, right click on archive file then "Extract files" and everything works fine (I have downloaded this archive from provided link too)
Author
Owner

@adamhathcock commented on GitHub (Sep 28, 2016):

Actually, I see that too now. Just opening it in the UI doesn't look right.

I'll do some more investigation tomorrow

@adamhathcock commented on GitHub (Sep 28, 2016): Actually, I see that too now. Just opening it in the UI doesn't look right. I'll do some more investigation tomorrow
Author
Owner

@adamhathcock commented on GitHub (Sep 29, 2016):

This is weird. All implementations of Bzip2 in C# I've tried will only extract 879k of the file. It just stops after that.

@adamhathcock commented on GitHub (Sep 29, 2016): This is weird. All implementations of Bzip2 in C# I've tried will only extract 879k of the file. It just stops after that.
Author
Owner

@plusulica commented on GitHub (Oct 10, 2016):

Hi Adam.
Do you have any news about your last comment? Also me I got the same weird behaviour, everything that I try to decompress is truncated to 879 kb. Thanks in advance.

@plusulica commented on GitHub (Oct 10, 2016): Hi Adam. Do you have any news about your last comment? Also me I got the same weird behaviour, everything that I try to decompress is truncated to 879 kb. Thanks in advance.
Author
Owner

@adamhathcock commented on GitHub (Oct 10, 2016):

I don't have any updates. Sorry. It's likely a big that needs to taken up with someone that implements bzip2.

@adamhathcock commented on GitHub (Oct 10, 2016): I don't have any updates. Sorry. It's likely a big that needs to taken up with someone that implements bzip2.
Author
Owner

@EdSF commented on GitHub (Oct 11, 2016):

All implementations of Bzip2 in C# I've tried will only extract 879k of the file. It just stops after that.

Yup, seeing this myself just today (first time dealing with bz2). I haven't tried your lib, but this is the case (so far) for DotNetZip and SharpZipLib...

Update

Odder: If you just use 7Zip to decompress and re-compress into BZip2, this oddity does not occur (DotNetZip, etc. ) will extract the entire compressed file.

Reproduce:

  1. Extract the above using 7Zip (app)
  2. (Re)Archive the extracted file (collection.tar) using 7Zip (app) using BZip2 -> e.g. collection.tar.bz2
  3. Use lib (e.g. DotNetZip, etc) to programmatically extract the collection.tar.bz2` file
  4. Works - the entire file (2G) is extracted....

BTW, just for clarity, its really not just "this" file that exhibits this oddity. I hit the exact issue with a "standard" XML file (large - 400m bzip2, around 7g uncompressed). Using the above steps "fixes" things - in other words, as long as 7Zip is used to archive to Bzip2, no problem regardless of size of archive...

@EdSF commented on GitHub (Oct 11, 2016): > All implementations of Bzip2 in C# I've tried will only extract 879k of the file. It just stops after that. Yup, seeing this myself just today (first time dealing with `bz2`). I haven't tried your lib, but this is the case (so far) for DotNetZip and SharpZipLib... ## Update Odder: If you just use 7Zip to decompress and re-compress into BZip2, this oddity does not occur (DotNetZip, etc. ) will extract the entire compressed file. Reproduce: 1. Extract the above using 7Zip (app) 2. (Re)Archive the extracted file (`collection.tar`) using 7Zip (app) using BZip2 -> e.g. `collection.tar.bz2` 3. Use lib (e.g. DotNetZip, etc) to programmatically extract the collection.tar.bz2` file 4. Works - the entire file (2G) is extracted.... --- BTW, just for clarity, its really not just "this" file that exhibits this oddity. I hit the exact issue with a "standard" XML file (large - 400m bzip2, around 7g uncompressed). Using the above steps "fixes" things - in other words, as long as 7Zip is used to archive to Bzip2, no problem regardless of size of archive...
Author
Owner

@adamhathcock commented on GitHub (May 29, 2017):

Looks like another project is having this issue https://github.com/duplicati/duplicati/issues/1735

@adamhathcock commented on GitHub (May 29, 2017): Looks like another project is having this issue https://github.com/duplicati/duplicati/issues/1735
Author
Owner

@jensschmidbauer commented on GitHub (Oct 4, 2017):

Hi Adam,

I ran into the same problem trying to decompress the big bz2 files at http://planet.openstreetmap.org/ . I also tested several libraries, all with the same effect that exactly 900,000 bytes of the uncompressed file where retrieved. Since this is exactly the byte count of the biggest block size, I started further investigations.

These files (as well as agoretsky's file above) contain not only one but several "streams", each of them containing one 900k block. So the six "end of stream magic bytes" are not the end of the file. They are followed by their 4 bytes stream checksum and the 1-7 padding bit(s) which end the stream. Afterwards the next stream begins with another "BZh9" header.

Found a little documentation about this at chapter 3.4.8 of http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html#embed

@jensschmidbauer commented on GitHub (Oct 4, 2017): Hi Adam, I ran into the same problem trying to decompress the big bz2 files at http://planet.openstreetmap.org/ . I also tested several libraries, all with the same effect that exactly 900,000 bytes of the uncompressed file where retrieved. Since this is exactly the byte count of the biggest block size, I started further investigations. These files (as well as agoretsky's file above) contain not only one but several "streams", each of them containing one 900k block. So the six "end of stream magic bytes" are not the end of the file. They are followed by their 4 bytes stream checksum and the 1-7 padding bit(s) which end the stream. Afterwards the next stream begins with another "BZh9" header. Found a little documentation about this at chapter 3.4.8 of http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html#embed
Author
Owner

@jensschmidbauer commented on GitHub (Oct 10, 2017):

Hi again,

okay, the solution (for my case) is already available. I found the optional decompressContacted (typo? its called decompressConcatenated in CBZip2InputStream which makes more sense) flag in BZip2Stream while debugging, and setting it to true solved my problem. What I asked myself was why it was not true by default or if there's a case where false makes sense at all.

Regarding agoretsky's problem, unfortunately, I was not able to test that code above because I get a NotSupportedException on ReaderFactory.Open on the specified file somehow. Nevertheless, looking at the source code I noticed that BZip2Stream is also created with decompressContacted = false there.

@jensschmidbauer commented on GitHub (Oct 10, 2017): Hi again, okay, the solution (for my case) is already available. I found the optional decompressContacted (typo? its called decompressConcatenated in CBZip2InputStream which makes more sense) flag in BZip2Stream while debugging, and setting it to true solved my problem. What I asked myself was why it was not true by default or if there's a case where false makes sense at all. Regarding agoretsky's problem, unfortunately, I was not able to test that code above because I get a NotSupportedException on ReaderFactory.Open on the specified file somehow. Nevertheless, looking at the source code I noticed that BZip2Stream is also created with decompressContacted = false there.
Author
Owner

@adamhathcock commented on GitHub (Oct 10, 2017):

Hi @jensschmidbauer

I haven't implemented any of the BZip2 algorithm myself. Most implementations of the compression/decompression algorithms are from other open-source projects. I'm afraid I know very little about the algorithms as I usually keep my head the archive format space.

If you've got ideas or fixes, please contribute but I don't think I can offer much help with BZip2 itself.

@adamhathcock commented on GitHub (Oct 10, 2017): Hi @jensschmidbauer I haven't implemented any of the BZip2 algorithm myself. Most implementations of the compression/decompression algorithms are from other open-source projects. I'm afraid I know very little about the algorithms as I usually keep my head the archive format space. If you've got ideas or fixes, please contribute but I don't think I can offer much help with BZip2 itself.
Author
Owner

@rpm61 commented on GitHub (Apr 10, 2018):

I am having this issue now. I have tried DotNetZip and SharpZipLib. Both left me with the dreaded 879k file. I tried looking for the decompressConcatenated (or contacted?) flag in both in BZip2InputStream and found it in neither. Are you using a different lib or where do I set that flag?

@rpm61 commented on GitHub (Apr 10, 2018): I am having this issue now. I have tried DotNetZip and SharpZipLib. Both left me with the dreaded 879k file. I tried looking for the decompressConcatenated (or contacted?) flag in both in BZip2InputStream and found it in neither. Are you using a different lib or where do I set that flag?
Author
Owner

@jensschmidbauer commented on GitHub (Apr 11, 2018):

Hi @rpm61

it's the last parameter in BZip2Stream's constructor.

The ReaderFactory (as used by agoretsky) does not provide this parameter, so the default of false is used. But it has to be true in our case.

Therefore, I did not use the ReaderFactory and used BZip2Stream directly.

@jensschmidbauer commented on GitHub (Apr 11, 2018): Hi @rpm61 it's the last parameter in [BZip2Stream](https://github.com/adamhathcock/sharpcompress/blob/master/src/SharpCompress/Compressors/BZip2/BZip2Stream.cs#L18)'s constructor. The [ReaderFactory](https://github.com/adamhathcock/sharpcompress/blob/master/src/SharpCompress/Readers/ReaderFactory.cs#L61) (as used by agoretsky) does not provide this parameter, so the default of false is used. But it has to be true in our case. Therefore, I did not use the ReaderFactory and used BZip2Stream directly.
Author
Owner

@adamhathcock commented on GitHub (Apr 11, 2018):

@jensschmidbauer should there be an option for this or does your use case just not require the ReaderFactory?

I'm struggling to understand when the usage of the flag is appropriate.

@adamhathcock commented on GitHub (Apr 11, 2018): @jensschmidbauer should there be an option for this or does your use case just not require the ReaderFactory? I'm struggling to understand when the usage of the flag is appropriate.
Author
Owner

@jensschmidbauer commented on GitHub (Apr 11, 2018):

@adamhathcock

As far as I understand, the flag is required, wenn the data contains more than one block (i.e. max. 900,000 bytes uncompressed) of data. But I cannot guarantee there would be no harm in other cases, though it could be easily tested on smaller files that worked with "false".

In case of bzip2 (well, if you know it's one) the ReaderFactory is not required at all, because it contains no file entries, just the compressed data of a single file.

@jensschmidbauer commented on GitHub (Apr 11, 2018): @adamhathcock As far as I understand, the flag is required, wenn the data contains more than one block (i.e. max. 900,000 bytes uncompressed) of data. But I cannot guarantee there would be no harm in other cases, though it could be easily tested on smaller files that worked with "false". In case of bzip2 (well, if you know it's one) the ReaderFactory is not required at all, because it contains no file entries, just the compressed data of a single file.
Author
Owner

@rpm61 commented on GitHub (Apr 11, 2018):

@jensschmidbauer

Thank you very much, that fixed me right up!

@rpm61 commented on GitHub (Apr 11, 2018): @jensschmidbauer Thank you very much, that fixed me right up!
Author
Owner

@gusarov commented on GitHub (Jan 8, 2021):

Hello, we are facing the same issue. Most tbz archives that we got from vendor can't be extracted with BZip2 in this lib and perfectly extracted with 7zip app. The files are truncated exactly at 900 000 (which is one bz2 block), This first extracted block is 100% matches to actual data extracted by 7zip.

@gusarov commented on GitHub (Jan 8, 2021): Hello, we are facing the same issue. Most tbz archives that we got from vendor can't be extracted with BZip2 in this lib and perfectly extracted with 7zip app. The files are **truncated** exactly at 900 000 (which is one bz2 block), This first extracted block is 100% matches to actual data extracted by 7zip.
Author
Owner

@gusarov commented on GitHub (Jan 8, 2021):

FYI: decompressConcatenated: true solved this issue for me. The bug is 4 yo, may be there were not such flag at that time...

@gusarov commented on GitHub (Jan 8, 2021): FYI: `decompressConcatenated: true` solved this issue for me. The bug is 4 yo, may be there were not such flag at that time...
Author
Owner

@adamhathcock commented on GitHub (Jan 8, 2021):

Thanks for adding what worked for you

@adamhathcock commented on GitHub (Jan 8, 2021): Thanks for adding what worked for you
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/sharpcompress#121