Tar parsing is too benevolent #459

Open
opened 2026-01-29 22:12:25 +00:00 by claunia · 2 comments
Owner

Originally created by @IS4Code on GitHub (May 6, 2021).

I would like to use this library to detect and open common archive formats, yet there is the issue that Tar is recognized far too often and invalid broken objects are produced from files in completely different formats.

First, the checksum in the header is not read and verified at all. In my opinion, doing this would remove a lot of the false positives, but it seems there are implementations that store the checksum there in different formats. Perhaps it could be controlled by a specific option, but I feel it should be enabled by default.

Next, the Magic property in the header is not exposed in any way. Modern tar files should have ustar there, so perhaps another option could be to reject old files that do not have this signature.

Originally created by @IS4Code on GitHub (May 6, 2021). I would like to use this library to detect and open common archive formats, yet there is the issue that Tar is recognized far too often and invalid broken objects are produced from files in completely different formats. First, the checksum in the header is not read and verified at all. In my opinion, doing this would remove a lot of the false positives, but it seems there are implementations that store the checksum there in different formats. Perhaps it could be controlled by a specific option, but I feel it should be enabled by default. Next, the `Magic` property in the header is not exposed in any way. Modern tar files should have `ustar` there, so perhaps another option could be to reject old files that do not have this signature.
Author
Owner

@adamhathcock commented on GitHub (Jun 4, 2021):

Tar parsing isn't too great. I never liked my implementation which was based on something really old. I need to adapt another more complete implementation from another library.

Suggestions and PRs welcome.

@adamhathcock commented on GitHub (Jun 4, 2021): Tar parsing isn't too great. I never liked my implementation which was based on something really old. I need to adapt another more complete implementation from another library. Suggestions and PRs welcome.
Author
Owner

@IS4Code commented on GitHub (Jun 4, 2021):

@adamhathcock I tried to come up with a better checking of the header based on the checksum:

const int headerLength = 512;

static readonly byte[] leadingDigits = Enumerable.Range('1', 9).Select(i => (byte)i).ToArray();

static readonly byte[] terminator = { 0, (byte)' ' };

public bool CheckHeader(Span<byte> header)
{
    if(header.Length < headerLength) return false;
    if(header.Slice(154, 2).IndexOfAny(terminator) != 0)
    {
        // Checksum not terminated
        return false;
    }
    // Locate the checksum
    var checksumSpan = header.Slice(148, 6);
    var finalDigit = checksumSpan[5];
    if(finalDigit < '0' || finalDigit > '7') return false;
    // Find the leading digit
    int start = checksumSpan.IndexOfAny(leadingDigits);
    if(start == -1)
    {
        // Checksum not present
        return false;
    }
    for(int i = 0; i < start; i++)
    {
        var digit = checksumSpan[i];
        if(digit != 0 && digit != ' ' && digit != '0')
        {
            // Unexpected padding character
            return false;
        }
    }
    // Parse the checksum
    int targetChecksum = 0;
    for(int i = start; i < checksumSpan.Length; i++)
    {
        var digit = checksumSpan[i];
        if(digit < '0' || digit > '7')
        {
            // Not an octal digit
            return false;
        }
        targetChecksum = targetChecksum * 8 + (digit - '0');
    }
    if(targetChecksum > headerLength * Byte.MaxValue)
    {
        // Not a valid header checksum
        return false;
    }
    // Compute both unsigned and signed checksum
    int unsignedChecksum = 0;
    int signedChecksum = 0;
    var sheader = MemoryMarshal.Cast<byte, sbyte>(header);
    for(int i = 0; i < headerLength; i++)
    {
        if(i >= 148 && i <= 155)
        {
            // Inside the checksum span
            unsignedChecksum += ' ';
            signedChecksum += ' ';
        }else{
            unsignedChecksum += header[i];
            signedChecksum += sheader[i];
        }
        if(i % (headerLength / 8) == 0)
        {
            // Check if not too far from the target checksum
            int unsignedDist = targetChecksum - unsignedChecksum;
            int signedDist = targetChecksum - signedChecksum;
            int remaining = headerLength - i - 1;
            if(
                (unsignedDist < 0 || unsignedDist > Byte.MaxValue * remaining) &&
                (signedDist < SByte.MinValue * remaining || signedDist > SByte.MaxValue * remaining)
            ) return false;
        }
    }
    return true;
}

I tried this on a lot of files and so far found none that would be mistaken as Tar. I haven't however checked if all valid Tar files are correctly accepted. I tried to still account for some different implementations: the checksum could have either spaces or zeros as leading digits, the termination could differ, and both the signed and unsigned checksum is computed. Plus every 64 bytes, it checks if there is even enough remaining bytes to reach the checksum.

It's focused more on preventing false positives, so if something like this is included (and you certainly have my permission), I'd prefer to have this toggleable in case of some weird Tar versions.

@IS4Code commented on GitHub (Jun 4, 2021): @adamhathcock I tried to come up with a better checking of the header based on the checksum: ```cs const int headerLength = 512; static readonly byte[] leadingDigits = Enumerable.Range('1', 9).Select(i => (byte)i).ToArray(); static readonly byte[] terminator = { 0, (byte)' ' }; public bool CheckHeader(Span<byte> header) { if(header.Length < headerLength) return false; if(header.Slice(154, 2).IndexOfAny(terminator) != 0) { // Checksum not terminated return false; } // Locate the checksum var checksumSpan = header.Slice(148, 6); var finalDigit = checksumSpan[5]; if(finalDigit < '0' || finalDigit > '7') return false; // Find the leading digit int start = checksumSpan.IndexOfAny(leadingDigits); if(start == -1) { // Checksum not present return false; } for(int i = 0; i < start; i++) { var digit = checksumSpan[i]; if(digit != 0 && digit != ' ' && digit != '0') { // Unexpected padding character return false; } } // Parse the checksum int targetChecksum = 0; for(int i = start; i < checksumSpan.Length; i++) { var digit = checksumSpan[i]; if(digit < '0' || digit > '7') { // Not an octal digit return false; } targetChecksum = targetChecksum * 8 + (digit - '0'); } if(targetChecksum > headerLength * Byte.MaxValue) { // Not a valid header checksum return false; } // Compute both unsigned and signed checksum int unsignedChecksum = 0; int signedChecksum = 0; var sheader = MemoryMarshal.Cast<byte, sbyte>(header); for(int i = 0; i < headerLength; i++) { if(i >= 148 && i <= 155) { // Inside the checksum span unsignedChecksum += ' '; signedChecksum += ' '; }else{ unsignedChecksum += header[i]; signedChecksum += sheader[i]; } if(i % (headerLength / 8) == 0) { // Check if not too far from the target checksum int unsignedDist = targetChecksum - unsignedChecksum; int signedDist = targetChecksum - signedChecksum; int remaining = headerLength - i - 1; if( (unsignedDist < 0 || unsignedDist > Byte.MaxValue * remaining) && (signedDist < SByte.MinValue * remaining || signedDist > SByte.MaxValue * remaining) ) return false; } } return true; } ``` I tried this on a lot of files and so far found none that would be mistaken as Tar. I haven't however checked if all valid Tar files are correctly accepted. I tried to still account for some different implementations: the checksum could have either spaces or zeros as leading digits, the termination could differ, and both the signed and unsigned checksum is computed. Plus every 64 bytes, it checks if there is even enough remaining bytes to reach the checksum. It's focused more on preventing false positives, so if something like this is included (and you certainly have my permission), I'd prefer to have this toggleable in case of some weird Tar versions.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/sharpcompress#459