False tar file detection #304

Open
opened 2026-01-29 22:09:46 +00:00 by claunia · 10 comments
Owner

Originally created by @TheYarin on GitHub (Jun 8, 2018).

TL;DR: TarArchive.IsTarFile() says a file is a tar file when it actually isn't.

I tried extracting a .gz file (not .tar.gz) using the ReaderFactory.Open() method and the extraction resulted in very strange file names, which lead me to investigate, and I discovered that the decompressed gz file is detected as a valid tar file.

Steps to reproduce:

var bytes = new List<byte>(System.Text.Encoding.ASCII.GetBytes("hello world"));

while (bytes.Count < 512)
    bytes.Add(0);

var stream = new MemoryStream(bytes.ToArray());
bool result = SharpCompress.Archives.Tar.TarArchive.IsTarFile(stream); // returns true

This behaviour occured with multiple (similar) files.
The code above generates a simplified version of one of the files I tried to decompress. It behaves the same even if I read the file from disk.

Originally created by @TheYarin on GitHub (Jun 8, 2018). ### TL;DR: TarArchive.IsTarFile() says a file is a tar file when it actually isn't. I tried extracting a .gz file (not .tar.gz) using the ReaderFactory.Open() method and the extraction resulted in very strange file names, which lead me to investigate, and I discovered that the decompressed gz file is detected as a valid tar file. Steps to reproduce: ```c# var bytes = new List<byte>(System.Text.Encoding.ASCII.GetBytes("hello world")); while (bytes.Count < 512) bytes.Add(0); var stream = new MemoryStream(bytes.ToArray()); bool result = SharpCompress.Archives.Tar.TarArchive.IsTarFile(stream); // returns true ``` This behaviour occured with multiple (similar) files. The code above generates a simplified version of one of the files I tried to decompress. It behaves the same even if I read the file from disk.
claunia added the bugup for grabs labels 2026-01-29 22:09:46 +00:00
Author
Owner

@adamhathcock commented on GitHub (Jun 8, 2018):

Unfortunately, Tar detection is going to be imperfect but I can probably do more but I don't know what.

Currently, as the format has no header, I just see if the 100 bytes of Name results in something as well as if the EntryType part is a valid value. The answer is going to be yes above because the first 100 bytes is a valid string and 0 byte is a valid EntryType value.

@adamhathcock commented on GitHub (Jun 8, 2018): Unfortunately, Tar detection is going to be imperfect but I can probably do more but I don't know what. Currently, as the format has no header, I just see if the 100 bytes of Name results in something as well as if the `EntryType` part is a valid value. The answer is going to be yes above because the first 100 bytes is a valid string and `0` byte is a valid `EntryType` value.
Author
Owner

@TheYarin commented on GitHub (Jun 8, 2018):

I see... any chance you'd use a third party library to detect the file type?

@TheYarin commented on GitHub (Jun 8, 2018): I see... any chance you'd use a third party library to detect the file type?
Author
Owner

@adamhathcock commented on GitHub (Jun 8, 2018):

I'm saying it's not really possible because of the data format, not because I can't be bothered to do it. Auto-detection of archive formats isn't really a thing done by other libraries. They mostly just look at file extensions.

@adamhathcock commented on GitHub (Jun 8, 2018): I'm saying it's not really possible because of the data format, not because I can't be bothered to do it. Auto-detection of archive formats isn't really a thing done by other libraries. They mostly just look at file extensions.
Author
Owner

@simmotech commented on GitHub (Jun 8, 2018):

Tell me I'm a amateur fool if you think so but couldn't IsTarFile use the Checksum at offset 148 in the block as a further check?

@simmotech commented on GitHub (Jun 8, 2018): Tell me I'm a amateur fool if you think so but couldn't IsTarFile use the Checksum at offset 148 in the block as a further check?
Author
Owner

@adamhathcock commented on GitHub (Jun 8, 2018):

I was about to say that's the CRC of the file, but I think that's the CRC of the 512 byte block. So that could work.

Now, this is going in to "I can't be bothered" territory :)

Should be easy to do: copy the 512 block, zero out the 8 bytes starting at 148 then CRC check the block against the value at 148.

@adamhathcock commented on GitHub (Jun 8, 2018): I was about to say that's the CRC of the file, but I think that's the CRC of the 512 byte block. So that could work. Now, this is going in to "I can't be bothered" territory :) Should be easy to do: copy the 512 block, zero out the 8 bytes starting at 148 then CRC check the block against the value at 148.
Author
Owner

@TheYarin commented on GitHub (Jun 8, 2018):

While I don't know the specifics of how they detect a file type, I've had a good experience with the following file detection tools/libraries:

All three of these don't recognize that file as a TAR file.

@TheYarin commented on GitHub (Jun 8, 2018): While I don't know the specifics of how they detect a file type, I've had a good experience with the following file detection tools/libraries: - https://github.com/hey-red/Mime (uses linux's `file` command's database) - ExifTool (cli tool) - TRiD (cli tool with a few APIs) All three of these don't recognize that file as a TAR file.
Author
Owner

@adamhathcock commented on GitHub (Jun 8, 2018):

Read the source in is_tar.c in libmagic which calculates the checksum of the block. So easy enough

@adamhathcock commented on GitHub (Jun 8, 2018): Read the source in is_tar.c in libmagic which calculates the checksum of the block. So easy enough
Author
Owner

@SourabhChakraborty commented on GitHub (Feb 4, 2021):

I'm having an issue related to https://github.com/TASVideos/BizHawk/issues/2587 where TarArchive.IsTarFile() returns true for files that don't have the tar magic number. I traced the issue to TarHeader.Read(): it looks like Read() can return true even if the file doesn't have the magic number in the header (i.e. even if (!string.IsNullOrEmpty(Magic)&& "ustar".Equals(Magic)) is false). Is there a reason it does that, or is that an error?

@SourabhChakraborty commented on GitHub (Feb 4, 2021): I'm having an issue related to https://github.com/TASVideos/BizHawk/issues/2587 where TarArchive.IsTarFile() returns true for files that don't have the tar magic number. I traced the issue to TarHeader.Read(): it looks like Read() can return true even if the file doesn't have the magic number in the header (i.e. even if `(!string.IsNullOrEmpty(Magic)&& "ustar".Equals(Magic))` is false). Is there a reason it does that, or is that an error?
Author
Owner

@adamhathcock commented on GitHub (Feb 5, 2021):

Yeah, it looks like Tar detection isn't good enough. Need a better implementation. I'd rather not rely on something like libmagic just for Tar.

@adamhathcock commented on GitHub (Feb 5, 2021): Yeah, it looks like Tar detection isn't good enough. Need a better implementation. I'd rather not rely on something like libmagic just for Tar.
Author
Owner

@adamhathcock commented on GitHub (Feb 5, 2021):

Should port the is_tar detection from https://github.com/threatstack/libmagic/blob/master/src/is_tar.c

@adamhathcock commented on GitHub (Feb 5, 2021): Should port the `is_tar` detection from https://github.com/threatstack/libmagic/blob/master/src/is_tar.c
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/sharpcompress#304