[PR #1024] [MERGED] Fix memory exhaustion in TAR header auto-detection #1447

New Issue

claunia · 2026-01-29T22:20:37Z

claunia commented

2026-01-29 22:20:37 +00:00

📋 Pull Request Information

Original PR: https://github.com/adamhathcock/sharpcompress/pull/1024
Author: @Copilot
Created: 11/19/2025
Status: ✅ Merged
Merged: 11/19/2025
Merged by: @adamhathcock

Base: master ← Head: copilot/fix-memory-exhaustion-bug

📝 Commits (2)

0698031 Initial plan
7b06652 Add validation to prevent memory exhaustion in TAR long name headers

📊 Changes

3 files changed (+69 additions, -1 deletions)

View changed files

📝 .gitignore (+2 -1)
📝 src/SharpCompress/Common/Tar/Headers/TarHeader.cs (+13 -0)
📝 tests/SharpCompress.Test/Tar/TarReaderTests.cs (+54 -0)

📄 Description

During auto-detection without extension hints, random bytes in compressed files (e.g., tar.lz) can be misinterpreted as TAR LongName/LongLink headers with multi-gigabyte sizes, causing memory exhaustion.

Changes

Added size validation in TarHeader.ReadLongName()
- Introduced MAX_LONG_NAME_SIZE constant (32KB) - covers real-world path limits
- Validates size before allocation: if (size < 0 || size > MAX_LONG_NAME_SIZE)
- Throws InvalidFormatException (caught by IsTarFile, returns false, auto-detection continues)
Added regression test
- Tar_Malformed_LongName_Excessive_Size creates malformed header with 8GB size
- Verifies graceful failure instead of memory exhaustion

Example

// Before: Could attempt to allocate 8GB+ when detecting tar.lz
using var stream = File.OpenRead("archive.tar.lz");
using var reader = ReaderFactory.Open(stream);  // OutOfMemoryException

// After: Validation prevents excessive allocation, detection succeeds
using var stream = File.OpenRead("archive.tar.lz");
using var reader = ReaderFactory.Open(stream);  // Works correctly

All existing tests pass. No breaking changes.

Original prompt

This section details on the original issue you should resolve

<issue_title>Bug: Memory exhaustion when auto-detecting a specific tar.lz archive</issue_title>
<issue_description>### Summary

When reading a specific .tar.lz file without providing an extension hint, the library attempts to auto-detect the format. This process incorrectly identifies the file as a Tar archive with a LongLink header, leading to an attempt to allocate a massive amount of memory (e.g., 20GB). This causes the application to either crash or fail to open the archive. Standard compression utilities can open this same file without any issues.

The root cause appears to be a lack of validation in TarHeader.Read() and its helper methods.

Steps to Reproduce

Use the library to open a specially crafted .tar.lz file.
Do not specify ReaderOptions.ExtensionHint, forcing the library to auto-detect the archive type.
The library will fail to open the file after a massive memory spike.

Root Cause Analysis

The problem occurs because the auto-detection mechanism first tries to parse the file as a standard Tar archive. My file is a .tar.lz, but a byte at a specific offset is misinterpreted.

In TarHeader.Read(), the code enters a loop to process headers.

internal bool Read(BinaryReader reader)
{
    string? longName = null;
    string? longLinkName = null;
    var hasLongValue = true;
    byte[] buffer;
    EntryType entryType;

    do
    {
        buffer = ReadBlock(reader);

        if (buffer.Length == 0)
        {
            return false;
        }

        entryType = ReadEntryType(buffer);

        // In my file, the byte at offset 157 is misinterpreted as EntryType.LongLink
        if (entryType == EntryType.LongName)
        {
            longName = ReadLongName(reader, buffer); // <- THIS LINE
            continue;
        }
        else if (entryType == EntryType.LongLink)
        {
            longLinkName = ReadLongName(reader, buffer); // <- THIS LINE
            continue;
        }

        hasLongValue = false;
    } while (hasLongValue);
//...
}

For my specific file, the byte at offset 157 (read as entryType) happens to match EntryType.LongLink. This triggers a call to TarHeader.ReadLongName().

Inside ReadLongName(), the ReadSize(buffer) method calculates an extremely large value for nameLength based on the misinterpreted header data. The subsequent call to reader.ReadBytes(nameLength) attempts to allocate a massive array without any sanity checks.

private string ReadLongName(BinaryReader reader, byte[] buffer)
{
    var size = ReadSize(buffer); // Calculates a huge size
    var nameLength = (int)size;
    var nameBytes = reader.ReadBytes(nameLength); // <- ATTEMPTS HUGE ALLOCATION
    var remainingBytesToRead = BLOCK_SIZE - (nameLength % BLOCK_SIZE);

    // ...
    return ArchiveEncoding.Decode(nameBytes, 0, nameBytes.Length).TrimNulls();
}

The BinaryReader.ReadBytes() method directly allocates memory based on the provided count.

public virtual byte[] ReadBytes(int count)
{
    ArgumentOutOfRangeException.ThrowIfNegative(count);
    ThrowIfDisposed();

    if (count == 0)
    {
        return Array.Empty<byte>();
    }

    byte[] result = new byte[count]; // <- HUGE MEMORY ALLOCATION HAPPENS HERE
    int numRead = _stream.ReadAtLeast(result, result.Length, throwOnEndOfStream: false);

    // ...
    return result;
}

Stream Corruption

After the Tar parsing attempt fails (likely due to an EndOfStreamException or I/O error from Stream.ReadAtLeast()), the underlying Stream or SharpCompressStream appears to be left in a corrupted state.

When the auto-detection logic proceeds to the correct tar.lz format, it fails to read the header correctly. For example, it does not see the "LZIP" magic bytes at the beginning of the stream, even though debugging shows the bytes are present in the buffer. This strongly suggests that the stream's internal position or state has been irrecoverably altered by the failed read attempt.

Workaround

The issue can be avoided by explicitly setting ReaderOptions.ExtensionHint to guide the parser. This skips the problematic Tar auto-detection step.

// Example workaround
var options = new ReaderOptions { ExtensionHint = "tar.lz" };
using (var archive = ArchiveFactory.Open(filePath, options))
{
    // ...
}

However, most users would expect the auto-detection to be robust and would not think to set this option unless they have investigated the source code.</issue_description>

Comments on the Issue (you are @copilot in this section)

@adamhathcock Please make a P...

Fixes adamhathcock/sharpcompress#1021

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/adamhathcock/sharpcompress/pull/1024 **Author:** [@Copilot](https://github.com/apps/copilot-swe-agent) **Created:** 11/19/2025 **Status:** ✅ Merged **Merged:** 11/19/2025 **Merged by:** [@adamhathcock](https://github.com/adamhathcock) **Base:** `master` ← **Head:** `copilot/fix-memory-exhaustion-bug` --- ### 📝 Commits (2) - [`0698031`](https://github.com/adamhathcock/sharpcompress/commit/0698031ed40c5b2c1d4daad1c94f2d1b9046cfa5) Initial plan - [`7b06652`](https://github.com/adamhathcock/sharpcompress/commit/7b06652bffd6568bdfef8023d5380d665e9087f2) Add validation to prevent memory exhaustion in TAR long name headers ### 📊 Changes **3 files changed** (+69 additions, -1 deletions) <details> <summary>View changed files</summary> 📝 `.gitignore` (+2 -1) 📝 `src/SharpCompress/Common/Tar/Headers/TarHeader.cs` (+13 -0) 📝 `tests/SharpCompress.Test/Tar/TarReaderTests.cs` (+54 -0) </details> ### 📄 Description During auto-detection without extension hints, random bytes in compressed files (e.g., tar.lz) can be misinterpreted as TAR `LongName`/`LongLink` headers with multi-gigabyte sizes, causing memory exhaustion. ## Changes - **Added size validation in `TarHeader.ReadLongName()`** - Introduced `MAX_LONG_NAME_SIZE` constant (32KB) - covers real-world path limits - Validates size before allocation: `if (size < 0 || size > MAX_LONG_NAME_SIZE)` - Throws `InvalidFormatException` (caught by `IsTarFile`, returns false, auto-detection continues) - **Added regression test** - `Tar_Malformed_LongName_Excessive_Size` creates malformed header with 8GB size - Verifies graceful failure instead of memory exhaustion ## Example ```csharp // Before: Could attempt to allocate 8GB+ when detecting tar.lz using var stream = File.OpenRead("archive.tar.lz"); using var reader = ReaderFactory.Open(stream); // OutOfMemoryException // After: Validation prevents excessive allocation, detection succeeds using var stream = File.OpenRead("archive.tar.lz"); using var reader = ReaderFactory.Open(stream); // Works correctly ``` All existing tests pass. No breaking changes.  <details> <summary>Original prompt</summary> ---- *This section details on the original issue you should resolve* <issue_title>Bug: Memory exhaustion when auto-detecting a specific tar.lz archive</issue_title> <issue_description>### Summary When reading a specific `.tar.lz` file without providing an extension hint, the library attempts to auto-detect the format. This process incorrectly identifies the file as a `Tar` archive with a `LongLink` header, leading to an attempt to allocate a massive amount of memory (e.g., 20GB). This causes the application to either crash or fail to open the archive. Standard compression utilities can open this same file without any issues. The root cause appears to be a lack of validation in `TarHeader.Read()` and its helper methods. ### Steps to Reproduce 1. Use the library to open a specially crafted `.tar.lz` file. 2. Do not specify `ReaderOptions.ExtensionHint`, forcing the library to auto-detect the archive type. 3. The library will fail to open the file after a massive memory spike. ### Root Cause Analysis The problem occurs because the auto-detection mechanism first tries to parse the file as a standard `Tar` archive. My file is a `.tar.lz`, but a byte at a specific offset is misinterpreted. 1. In `TarHeader.Read()`, the code enters a loop to process headers. ```csharp internal bool Read(BinaryReader reader) { string? longName = null; string? longLinkName = null; var hasLongValue = true; byte[] buffer; EntryType entryType; do { buffer = ReadBlock(reader); if (buffer.Length == 0) { return false; } entryType = ReadEntryType(buffer); // In my file, the byte at offset 157 is misinterpreted as EntryType.LongLink if (entryType == EntryType.LongName) { longName = ReadLongName(reader, buffer); // <- THIS LINE continue; } else if (entryType == EntryType.LongLink) { longLinkName = ReadLongName(reader, buffer); // <- THIS LINE continue; } hasLongValue = false; } while (hasLongValue); //... } ``` 2. For my specific file, the byte at offset 157 (read as `entryType`) happens to match `EntryType.LongLink`. This triggers a call to `TarHeader.ReadLongName()`. 3. Inside `ReadLongName()`, the `ReadSize(buffer)` method calculates an extremely large value for `nameLength` based on the misinterpreted header data. The subsequent call to `reader.ReadBytes(nameLength)` attempts to allocate a massive array without any sanity checks. ```csharp private string ReadLongName(BinaryReader reader, byte[] buffer) { var size = ReadSize(buffer); // Calculates a huge size var nameLength = (int)size; var nameBytes = reader.ReadBytes(nameLength); // <- ATTEMPTS HUGE ALLOCATION var remainingBytesToRead = BLOCK_SIZE - (nameLength % BLOCK_SIZE); // ... return ArchiveEncoding.Decode(nameBytes, 0, nameBytes.Length).TrimNulls(); } ``` 4. The `BinaryReader.ReadBytes()` method directly allocates memory based on the provided `count`. ```csharp public virtual byte[] ReadBytes(int count) { ArgumentOutOfRangeException.ThrowIfNegative(count); ThrowIfDisposed(); if (count == 0) { return Array.Empty<byte>(); } byte[] result = new byte[count]; // <- HUGE MEMORY ALLOCATION HAPPENS HERE int numRead = _stream.ReadAtLeast(result, result.Length, throwOnEndOfStream: false); // ... return result; } ``` ### Stream Corruption After the `Tar` parsing attempt fails (likely due to an `EndOfStreamException` or I/O error from `Stream.ReadAtLeast()`), the underlying `Stream` or `SharpCompressStream` appears to be left in a corrupted state. When the auto-detection logic proceeds to the correct `tar.lz` format, it fails to read the header correctly. For example, it does not see the "LZIP" magic bytes at the beginning of the stream, even though debugging shows the bytes are present in the buffer. This strongly suggests that the stream's internal position or state has been irrecoverably altered by the failed read attempt. ### Workaround The issue can be avoided by explicitly setting `ReaderOptions.ExtensionHint` to guide the parser. This skips the problematic `Tar` auto-detection step. ```csharp // Example workaround var options = new ReaderOptions { ExtensionHint = "tar.lz" }; using (var archive = ArchiveFactory.Open(filePath, options)) { // ... } ``` However, most users would expect the auto-detection to be robust and would not think to set this option unless they have investigated the source code.</issue_description> ## Comments on the Issue (you are @copilot in this section) <comments> <comment_new><author>@adamhathcock</author><body> Please make a P... </details> - Fixes adamhathcock/sharpcompress#1021  --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

claunia added the pull-request label 2026-01-29 22:20:37 +00:00

claunia closed this issue

2026-01-29 22:20:37 +00:00

Sign in to join this conversation.

Branches Tags

master

release

adam/merge-release-to-master

dependabot/nuget/xunit.v3-3.2.2

adam/more-explode-async

copilot/fix-infinite-loop-rar-archive

adam/data-descriptor-fix

adam/fix-tests-with-proper-rewind

copilot/fix-data-descriptor-stream-bug

adam/lmza-investigation

adam/create-rar-async

adam/async-rar2

copilot/support-multi-threading-path

copilot/sub-pr-1132-again

adam/memory-perf

copilot/add-performance-benchmarking

copilot/sub-pr-1121

copilot/add-password-support-zip-files

copilot/add-so-optimized-zip-support

adam/rar-async-only

copilot/add-buffered-stream-async-read

copilot/sub-pr-1076

copilot/fix-decompression-exception

copilot/fix-archivefactory-issue

copilot/rationalize-sourcestream-volumes

adam/open-async

copilot/add-ace-archive-support

copilot/sub-pr-1040-again

adam/more-async-3

copilot/fix-tararchive-incomplete-iteration

adam/multi-threaded

copilot/sub-pr-1040

adam/awesome-copilot

copilot/fix-ziparchive-extraction-issue

copilot/fix-tararchive-open-crash

copilot/fix-tar-xz-file-reading-issue

copilot/setup-copilot-instructions

copilot/fix-decompression-performance-issue

copilot/convert-stream-access-to-async

adam/enable-agent

adam/async-deflate

adam/async-rar

adam/more-cleanup

adam/zstd

async-2

zstandard

net461-tests

dmg

async

build-netcore3

recycle-memory-stream

presentation

pax

netcore2

zip_encryption

dotnet-tool

tar_redux

native_zlib

Issue-197

system_buffers

TarNames

7zip_sfx

portable_crypto

WinRT

new_7zip

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/sharpcompress#1447