TarArchive fails to find subsequent entries if you open an entry stream during entry iteration #405

New Issue

claunia · 2026-01-29T22:11:22Z

claunia commented

2026-01-29 22:11:22 +00:00

Originally created by @DannyBoyk on GitHub (Jul 1, 2020).

We usually use TarReader, not TarArchive, so we hadn't seen this before. When using TarArchive (directly or from ArchiveFactory), if you open an entry stream while iterating over the entries, TarArchive fails to find the next header and silently indicates there are no more entries. The follow code will demonstrate this with any TAR archive of more than one file.

using (var streamReader = File.OpenRead(args[0]))
using (var archiveFactory = ArchiveFactory.Open(streamReader))
{
    foreach (var entry in archiveFactory.Entries)
    {
        if (entry.IsDirectory)
        {
            continue;
        }

        var outputPath = Path.Combine(@"D:\temp\tar", entry.Key);

        Directory.CreateDirectory(Path.GetDirectoryName(outputPath));

        using (var tarEntryStream = entry.OpenEntryStream())
        using (var outputStream = File.OpenWrite(outputPath))
        {
            tarEntryStream.CopyTo(outputStream);
        }
    }
}

I stepped through the source and TarHeaderFactory.ReadHeader fails silently for the second entry in the archive. Without debugging further, it appears the archive is someone losing its position due to opening the entry stream. If you don't open the stream, all the entries are iterated over just fine.

Originally created by @DannyBoyk on GitHub (Jul 1, 2020). We usually use TarReader, not TarArchive, so we hadn't seen this before. When using TarArchive (directly or from ArchiveFactory), if you open an entry stream while iterating over the entries, TarArchive fails to find the next header and silently indicates there are no more entries. The follow code will demonstrate this with any TAR archive of more than one file. ``` using (var streamReader = File.OpenRead(args[0])) using (var archiveFactory = ArchiveFactory.Open(streamReader)) { foreach (var entry in archiveFactory.Entries) { if (entry.IsDirectory) { continue; } var outputPath = Path.Combine(@"D:\temp\tar", entry.Key); Directory.CreateDirectory(Path.GetDirectoryName(outputPath)); using (var tarEntryStream = entry.OpenEntryStream()) using (var outputStream = File.OpenWrite(outputPath)) { tarEntryStream.CopyTo(outputStream); } } } ``` I stepped through the source and `TarHeaderFactory.ReadHeader` fails silently for the second entry in the archive. Without debugging further, it appears the archive is someone losing its position due to opening the entry stream. If you don't open the stream, all the entries are iterated over just fine.

claunia added the bug label 2026-01-29 22:11:22 +00:00

claunia closed this issue

2026-01-29 22:11:22 +00:00

claunia commented

2026-01-29 22:11:22 +00:00

@DannyBoyk commented on GitHub (Jul 16, 2020):

Stepping through further, it appears that the EntryStream (which is just using the base stream of the whole files) is left right after the last valid byte of the Entry, but does not remove the padding bytes. So, when TarHeader.Read goes to read the next header, when it reads the next "block", it's not really at the next block and believes a valid header has not been found. It seems like the fundamental issue here is that OpenEntryStream doesn't fully read the last block of the file.

@DannyBoyk commented on GitHub (Jul 16, 2020): Stepping through further, it appears that the EntryStream (which is just using the base stream of the whole files) is left right after the last valid byte of the Entry, but does not remove the padding bytes. So, when TarHeader.Read goes to read the next header, when it reads the next "block", it's not really at the next block and believes a valid header has not been found. It seems like the fundamental issue here is that `OpenEntryStream` doesn't fully read the last block of the file.

claunia commented

2026-01-29 22:11:22 +00:00

@DannyBoyk commented on GitHub (Jul 16, 2020):

More information. In TarHeaderFactory.ReadHeader, you set the base stream to get up to the next header:

                                header.DataStartPosition = reader.BaseStream.Position;

                                //skip to nearest 512
                                reader.BaseStream.Position += PadTo512(header.Size);

But, when someone uses OpenEntryStream, it changes BaseStream.Position since it's one and the same.

When using TarReader, it creates a standalone TarReadOnlySubStream and sets Header.PackedStream. Hence why TarReader works and not TarArchive.

@DannyBoyk commented on GitHub (Jul 16, 2020): More information. In `TarHeaderFactory.ReadHeader`, you set the base stream to get up to the next header: ``` header.DataStartPosition = reader.BaseStream.Position; //skip to nearest 512 reader.BaseStream.Position += PadTo512(header.Size); ``` But, when someone uses `OpenEntryStream`, it changes `BaseStream.Position` since it's one and the same. When using `TarReader`, it creates a standalone `TarReadOnlySubStream` and sets `Header.PackedStream`. Hence why `TarReader` works and not `TarArchive`.

claunia commented

2026-01-29 22:11:23 +00:00

@DannyBoyk commented on GitHub (Jul 16, 2020):

Actually, the final piece of the puzzle is that TarReadOnlySubStream advances the position in it's dispose method.

            if (disposing)
            {
                long skipBytes = _amountRead % 512;
                if (skipBytes == 0)
                {
                    return;
                }
                skipBytes = 512 - skipBytes;
                if (skipBytes == 0)
                {
                    return;
                }
                var buffer = new byte[skipBytes];
                Stream.ReadFully(buffer);
            }

It is also still just pointing at the one and only BaseStream, but it takes care of the padding bytes. The path taken by TarArchive does not since it doesn't create any sub streams to skip the padding bytes.

@DannyBoyk commented on GitHub (Jul 16, 2020): Actually, the final piece of the puzzle is that `TarReadOnlySubStream` advances the position in it's dispose method. ``` if (disposing) { long skipBytes = _amountRead % 512; if (skipBytes == 0) { return; } skipBytes = 512 - skipBytes; if (skipBytes == 0) { return; } var buffer = new byte[skipBytes]; Stream.ReadFully(buffer); } ``` It is also still just pointing at the one and only `BaseStream`, but it takes care of the padding bytes. The path taken by `TarArchive` does not since it doesn't create any sub streams to skip the padding bytes.

claunia commented

2026-01-29 22:11:23 +00:00

@adamhathcock commented on GitHub (Jul 17, 2020):

Thanks for tracking this down. Any chance of a PR and test? I'm surprised my own don't cover it.

@adamhathcock commented on GitHub (Jul 17, 2020): Thanks for tracking this down. Any chance of a PR and test? I'm surprised my own don't cover it.

claunia commented

2026-01-29 22:11:23 +00:00

@DannyBoyk commented on GitHub (Jul 17, 2020):

@adamhathcock , I think I found a quick and easy solution, but I was thinking about a different solution that might be better.

The quick and easy solution is to change TarFilePart.GetCompressedStream from this:

        internal override Stream GetCompressedStream()
        {
            if (_seekableStream != null)
            {
                _seekableStream.Position = Header.DataStartPosition.Value;
                return new ReadOnlySubStream(_seekableStream, Header.Size);
            }
            return Header.PackedStream;
        }

to this:

        internal override Stream GetCompressedStream()
        {
            if (_seekableStream != null)
            {
                _seekableStream.Position = Header.DataStartPosition.Value;
                return new TarReadOnlySubStream(_seekableStream, Header.Size);
            }
            return Header.PackedStream;
        }

Namely, changing from ReadOnlySubStream to TarReadOnlySubSteam. That fixed the problem, but I have concerns outlined below.

My proposed slightly more complicated solution would be to track the BaseStream.Position separate from the SubStream position. I did a thought experiment and wondered what would happen with TarReadOnlySubStream if they didn't read all the blocks of the entry, e.g. they open the entry Stream but only read a few bytes for some reason. The Dispose method only advances to the next block, so they might not yet be at the next header.

I think the BaseStream.Position needs to be managed separate from the SubStream position, somehow.

@DannyBoyk commented on GitHub (Jul 17, 2020): @adamhathcock , I think I found a quick and easy solution, but I was thinking about a different solution that might be better. The quick and easy solution is to change `TarFilePart.GetCompressedStream` from this: ``` internal override Stream GetCompressedStream() { if (_seekableStream != null) { _seekableStream.Position = Header.DataStartPosition.Value; return new ReadOnlySubStream(_seekableStream, Header.Size); } return Header.PackedStream; } ``` to this: ``` internal override Stream GetCompressedStream() { if (_seekableStream != null) { _seekableStream.Position = Header.DataStartPosition.Value; return new TarReadOnlySubStream(_seekableStream, Header.Size); } return Header.PackedStream; } ``` Namely, changing from `ReadOnlySubStream` to `TarReadOnlySubSteam`. That fixed the problem, but I have concerns outlined below. My proposed slightly more complicated solution would be to track the `BaseStream.Position` separate from the SubStream position. I did a thought experiment and wondered what would happen with `TarReadOnlySubStream` if they didn't read all the blocks of the entry, e.g. they open the entry `Stream` but only read a few bytes for some reason. The `Dispose` method only advances to the next block, so they might not yet be at the next header. I think the `BaseStream.Position` needs to be managed separate from the SubStream position, somehow.

claunia commented

2026-01-29 22:11:23 +00:00

@DannyBoyk commented on GitHub (Jul 17, 2020):

I just realized TarReadOnlySubStream has CanSeek set to false. Not sure my solution is OK since the change falls under the condition that _seekableStream is not null.

@DannyBoyk commented on GitHub (Jul 17, 2020): I just realized `TarReadOnlySubStream` has `CanSeek` set to `false`. Not sure my solution is OK since the change falls under the condition that `_seekableStream` is not `null`.

claunia commented

2026-01-29 22:11:24 +00:00

@DannyBoyk commented on GitHub (Jul 17, 2020):

Oh, ReadOnlySubStream also has CanSeek set to false. Never mind then; my solution should be valid so long as the user reads up to the last block of the entry. Otherwise, the BaseStream.Position will still be in the wrong place.

@DannyBoyk commented on GitHub (Jul 17, 2020): Oh, `ReadOnlySubStream` also has `CanSeek` set to false. Never mind then; my solution should be valid so long as the user reads up to the last block of the entry. Otherwise, the `BaseStream.Position` will still be in the wrong place.

claunia commented

2026-01-29 22:11:24 +00:00

@DannyBoyk commented on GitHub (Jul 20, 2020):

I think I have a decent fix in place. I'll put together a PR and new test case. The old test cases are not failing because all of the TarArchive tests ensure all the entries have been read before you attempt to read the streams. This occurs in IArchiveEntryExtensions.WriteTo if not earlier. I wasn't even aware of all these helper methods to write files out. The consequence of using them, though, is seeking within the file quite a bit, or loading many things into memory.

@DannyBoyk commented on GitHub (Jul 20, 2020): I think I have a decent fix in place. I'll put together a PR and new test case. The old test cases are not failing because all of the TarArchive tests ensure all the entries have been read before you attempt to read the streams. This occurs in `IArchiveEntryExtensions.WriteTo` if not earlier. I wasn't even aware of all these helper methods to write files out. The consequence of using them, though, is seeking within the file quite a bit, or loading many things into memory.

claunia commented

2026-01-29 22:11:24 +00:00

@adamhathcock commented on GitHub (Jul 26, 2020):

Released https://www.nuget.org/packages/sharpcompress/0.26.0

@adamhathcock commented on GitHub (Jul 26, 2020): Released https://www.nuget.org/packages/sharpcompress/0.26.0

Sign in to join this conversation.

Branches Tags

master

release

adam/merge-release-to-master

dependabot/nuget/xunit.v3-3.2.2

adam/more-explode-async

copilot/fix-infinite-loop-rar-archive

adam/data-descriptor-fix

adam/fix-tests-with-proper-rewind

copilot/fix-data-descriptor-stream-bug

adam/lmza-investigation

adam/create-rar-async

adam/async-rar2

copilot/support-multi-threading-path

copilot/sub-pr-1132-again

adam/memory-perf

copilot/add-performance-benchmarking

copilot/sub-pr-1121

copilot/add-password-support-zip-files

copilot/add-so-optimized-zip-support

adam/rar-async-only

copilot/add-buffered-stream-async-read

copilot/sub-pr-1076

copilot/fix-decompression-exception

copilot/fix-archivefactory-issue

copilot/rationalize-sourcestream-volumes

adam/open-async

copilot/add-ace-archive-support

copilot/sub-pr-1040-again

adam/more-async-3

copilot/fix-tararchive-incomplete-iteration

adam/multi-threaded

copilot/sub-pr-1040

adam/awesome-copilot

copilot/fix-ziparchive-extraction-issue

copilot/fix-tararchive-open-crash

copilot/fix-tar-xz-file-reading-issue

copilot/setup-copilot-instructions

copilot/fix-decompression-performance-issue

copilot/convert-stream-access-to-async

adam/enable-agent

adam/async-deflate

adam/async-rar

adam/more-cleanup

adam/zstd

async-2

zstandard

net461-tests

dmg

async

build-netcore3

recycle-memory-stream

presentation

pax

netcore2

zip_encryption

dotnet-tool

tar_redux

native_zlib

Issue-197

system_buffers

TarNames

7zip_sfx

portable_crypto

WinRT

new_7zip

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/sharpcompress#405