TarArchive fails to find subsequent entries if you open an entry stream during entry iteration #405

Closed
opened 2026-01-29 22:11:22 +00:00 by claunia · 9 comments
Owner

Originally created by @DannyBoyk on GitHub (Jul 1, 2020).

We usually use TarReader, not TarArchive, so we hadn't seen this before. When using TarArchive (directly or from ArchiveFactory), if you open an entry stream while iterating over the entries, TarArchive fails to find the next header and silently indicates there are no more entries. The follow code will demonstrate this with any TAR archive of more than one file.

using (var streamReader = File.OpenRead(args[0]))
using (var archiveFactory = ArchiveFactory.Open(streamReader))
{
    foreach (var entry in archiveFactory.Entries)
    {
        if (entry.IsDirectory)
        {
            continue;
        }

        var outputPath = Path.Combine(@"D:\temp\tar", entry.Key);

        Directory.CreateDirectory(Path.GetDirectoryName(outputPath));

        using (var tarEntryStream = entry.OpenEntryStream())
        using (var outputStream = File.OpenWrite(outputPath))
        {
            tarEntryStream.CopyTo(outputStream);
        }
    }
}

I stepped through the source and TarHeaderFactory.ReadHeader fails silently for the second entry in the archive. Without debugging further, it appears the archive is someone losing its position due to opening the entry stream. If you don't open the stream, all the entries are iterated over just fine.

Originally created by @DannyBoyk on GitHub (Jul 1, 2020). We usually use TarReader, not TarArchive, so we hadn't seen this before. When using TarArchive (directly or from ArchiveFactory), if you open an entry stream while iterating over the entries, TarArchive fails to find the next header and silently indicates there are no more entries. The follow code will demonstrate this with any TAR archive of more than one file. ``` using (var streamReader = File.OpenRead(args[0])) using (var archiveFactory = ArchiveFactory.Open(streamReader)) { foreach (var entry in archiveFactory.Entries) { if (entry.IsDirectory) { continue; } var outputPath = Path.Combine(@"D:\temp\tar", entry.Key); Directory.CreateDirectory(Path.GetDirectoryName(outputPath)); using (var tarEntryStream = entry.OpenEntryStream()) using (var outputStream = File.OpenWrite(outputPath)) { tarEntryStream.CopyTo(outputStream); } } } ``` I stepped through the source and `TarHeaderFactory.ReadHeader` fails silently for the second entry in the archive. Without debugging further, it appears the archive is someone losing its position due to opening the entry stream. If you don't open the stream, all the entries are iterated over just fine.
claunia added the bug label 2026-01-29 22:11:22 +00:00
Author
Owner

@DannyBoyk commented on GitHub (Jul 16, 2020):

Stepping through further, it appears that the EntryStream (which is just using the base stream of the whole files) is left right after the last valid byte of the Entry, but does not remove the padding bytes. So, when TarHeader.Read goes to read the next header, when it reads the next "block", it's not really at the next block and believes a valid header has not been found. It seems like the fundamental issue here is that OpenEntryStream doesn't fully read the last block of the file.

@DannyBoyk commented on GitHub (Jul 16, 2020): Stepping through further, it appears that the EntryStream (which is just using the base stream of the whole files) is left right after the last valid byte of the Entry, but does not remove the padding bytes. So, when TarHeader.Read goes to read the next header, when it reads the next "block", it's not really at the next block and believes a valid header has not been found. It seems like the fundamental issue here is that `OpenEntryStream` doesn't fully read the last block of the file.
Author
Owner

@DannyBoyk commented on GitHub (Jul 16, 2020):

More information. In TarHeaderFactory.ReadHeader, you set the base stream to get up to the next header:

                                header.DataStartPosition = reader.BaseStream.Position;

                                //skip to nearest 512
                                reader.BaseStream.Position += PadTo512(header.Size);

But, when someone uses OpenEntryStream, it changes BaseStream.Position since it's one and the same.

When using TarReader, it creates a standalone TarReadOnlySubStream and sets Header.PackedStream. Hence why TarReader works and not TarArchive.

@DannyBoyk commented on GitHub (Jul 16, 2020): More information. In `TarHeaderFactory.ReadHeader`, you set the base stream to get up to the next header: ``` header.DataStartPosition = reader.BaseStream.Position; //skip to nearest 512 reader.BaseStream.Position += PadTo512(header.Size); ``` But, when someone uses `OpenEntryStream`, it changes `BaseStream.Position` since it's one and the same. When using `TarReader`, it creates a standalone `TarReadOnlySubStream` and sets `Header.PackedStream`. Hence why `TarReader` works and not `TarArchive`.
Author
Owner

@DannyBoyk commented on GitHub (Jul 16, 2020):

Actually, the final piece of the puzzle is that TarReadOnlySubStream advances the position in it's dispose method.

            if (disposing)
            {
                long skipBytes = _amountRead % 512;
                if (skipBytes == 0)
                {
                    return;
                }
                skipBytes = 512 - skipBytes;
                if (skipBytes == 0)
                {
                    return;
                }
                var buffer = new byte[skipBytes];
                Stream.ReadFully(buffer);
            }

It is also still just pointing at the one and only BaseStream, but it takes care of the padding bytes. The path taken by TarArchive does not since it doesn't create any sub streams to skip the padding bytes.

@DannyBoyk commented on GitHub (Jul 16, 2020): Actually, the final piece of the puzzle is that `TarReadOnlySubStream` advances the position in it's dispose method. ``` if (disposing) { long skipBytes = _amountRead % 512; if (skipBytes == 0) { return; } skipBytes = 512 - skipBytes; if (skipBytes == 0) { return; } var buffer = new byte[skipBytes]; Stream.ReadFully(buffer); } ``` It is also still just pointing at the one and only `BaseStream`, but it takes care of the padding bytes. The path taken by `TarArchive` does not since it doesn't create any sub streams to skip the padding bytes.
Author
Owner

@adamhathcock commented on GitHub (Jul 17, 2020):

Thanks for tracking this down. Any chance of a PR and test? I'm surprised my own don't cover it.

@adamhathcock commented on GitHub (Jul 17, 2020): Thanks for tracking this down. Any chance of a PR and test? I'm surprised my own don't cover it.
Author
Owner

@DannyBoyk commented on GitHub (Jul 17, 2020):

@adamhathcock , I think I found a quick and easy solution, but I was thinking about a different solution that might be better.

The quick and easy solution is to change TarFilePart.GetCompressedStream from this:

        internal override Stream GetCompressedStream()
        {
            if (_seekableStream != null)
            {
                _seekableStream.Position = Header.DataStartPosition.Value;
                return new ReadOnlySubStream(_seekableStream, Header.Size);
            }
            return Header.PackedStream;
        }

to this:

        internal override Stream GetCompressedStream()
        {
            if (_seekableStream != null)
            {
                _seekableStream.Position = Header.DataStartPosition.Value;
                return new TarReadOnlySubStream(_seekableStream, Header.Size);
            }
            return Header.PackedStream;
        }

Namely, changing from ReadOnlySubStream to TarReadOnlySubSteam. That fixed the problem, but I have concerns outlined below.

My proposed slightly more complicated solution would be to track the BaseStream.Position separate from the SubStream position. I did a thought experiment and wondered what would happen with TarReadOnlySubStream if they didn't read all the blocks of the entry, e.g. they open the entry Stream but only read a few bytes for some reason. The Dispose method only advances to the next block, so they might not yet be at the next header.

I think the BaseStream.Position needs to be managed separate from the SubStream position, somehow.

@DannyBoyk commented on GitHub (Jul 17, 2020): @adamhathcock , I think I found a quick and easy solution, but I was thinking about a different solution that might be better. The quick and easy solution is to change `TarFilePart.GetCompressedStream` from this: ``` internal override Stream GetCompressedStream() { if (_seekableStream != null) { _seekableStream.Position = Header.DataStartPosition.Value; return new ReadOnlySubStream(_seekableStream, Header.Size); } return Header.PackedStream; } ``` to this: ``` internal override Stream GetCompressedStream() { if (_seekableStream != null) { _seekableStream.Position = Header.DataStartPosition.Value; return new TarReadOnlySubStream(_seekableStream, Header.Size); } return Header.PackedStream; } ``` Namely, changing from `ReadOnlySubStream` to `TarReadOnlySubSteam`. That fixed the problem, but I have concerns outlined below. My proposed slightly more complicated solution would be to track the `BaseStream.Position` separate from the SubStream position. I did a thought experiment and wondered what would happen with `TarReadOnlySubStream` if they didn't read all the blocks of the entry, e.g. they open the entry `Stream` but only read a few bytes for some reason. The `Dispose` method only advances to the next block, so they might not yet be at the next header. I think the `BaseStream.Position` needs to be managed separate from the SubStream position, somehow.
Author
Owner

@DannyBoyk commented on GitHub (Jul 17, 2020):

I just realized TarReadOnlySubStream has CanSeek set to false. Not sure my solution is OK since the change falls under the condition that _seekableStream is not null.

@DannyBoyk commented on GitHub (Jul 17, 2020): I just realized `TarReadOnlySubStream` has `CanSeek` set to `false`. Not sure my solution is OK since the change falls under the condition that `_seekableStream` is not `null`.
Author
Owner

@DannyBoyk commented on GitHub (Jul 17, 2020):

Oh, ReadOnlySubStream also has CanSeek set to false. Never mind then; my solution should be valid so long as the user reads up to the last block of the entry. Otherwise, the BaseStream.Position will still be in the wrong place.

@DannyBoyk commented on GitHub (Jul 17, 2020): Oh, `ReadOnlySubStream` also has `CanSeek` set to false. Never mind then; my solution should be valid so long as the user reads up to the last block of the entry. Otherwise, the `BaseStream.Position` will still be in the wrong place.
Author
Owner

@DannyBoyk commented on GitHub (Jul 20, 2020):

I think I have a decent fix in place. I'll put together a PR and new test case. The old test cases are not failing because all of the TarArchive tests ensure all the entries have been read before you attempt to read the streams. This occurs in IArchiveEntryExtensions.WriteTo if not earlier. I wasn't even aware of all these helper methods to write files out. The consequence of using them, though, is seeking within the file quite a bit, or loading many things into memory.

@DannyBoyk commented on GitHub (Jul 20, 2020): I think I have a decent fix in place. I'll put together a PR and new test case. The old test cases are not failing because all of the TarArchive tests ensure all the entries have been read before you attempt to read the streams. This occurs in `IArchiveEntryExtensions.WriteTo` if not earlier. I wasn't even aware of all these helper methods to write files out. The consequence of using them, though, is seeking within the file quite a bit, or loading many things into memory.
Author
Owner

@adamhathcock commented on GitHub (Jul 26, 2020):

Released https://www.nuget.org/packages/sharpcompress/0.26.0

@adamhathcock commented on GitHub (Jul 26, 2020): Released https://www.nuget.org/packages/sharpcompress/0.26.0
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/sharpcompress#405