mirror of
https://github.com/adamhathcock/sharpcompress.git
synced 2026-02-09 13:34:58 +00:00
False tar file detection #304
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @TheYarin on GitHub (Jun 8, 2018).
TL;DR: TarArchive.IsTarFile() says a file is a tar file when it actually isn't.
I tried extracting a .gz file (not .tar.gz) using the ReaderFactory.Open() method and the extraction resulted in very strange file names, which lead me to investigate, and I discovered that the decompressed gz file is detected as a valid tar file.
Steps to reproduce:
This behaviour occured with multiple (similar) files.
The code above generates a simplified version of one of the files I tried to decompress. It behaves the same even if I read the file from disk.
@adamhathcock commented on GitHub (Jun 8, 2018):
Unfortunately, Tar detection is going to be imperfect but I can probably do more but I don't know what.
Currently, as the format has no header, I just see if the 100 bytes of Name results in something as well as if the
EntryTypepart is a valid value. The answer is going to be yes above because the first 100 bytes is a valid string and0byte is a validEntryTypevalue.@TheYarin commented on GitHub (Jun 8, 2018):
I see... any chance you'd use a third party library to detect the file type?
@adamhathcock commented on GitHub (Jun 8, 2018):
I'm saying it's not really possible because of the data format, not because I can't be bothered to do it. Auto-detection of archive formats isn't really a thing done by other libraries. They mostly just look at file extensions.
@simmotech commented on GitHub (Jun 8, 2018):
Tell me I'm a amateur fool if you think so but couldn't IsTarFile use the Checksum at offset 148 in the block as a further check?
@adamhathcock commented on GitHub (Jun 8, 2018):
I was about to say that's the CRC of the file, but I think that's the CRC of the 512 byte block. So that could work.
Now, this is going in to "I can't be bothered" territory :)
Should be easy to do: copy the 512 block, zero out the 8 bytes starting at 148 then CRC check the block against the value at 148.
@TheYarin commented on GitHub (Jun 8, 2018):
While I don't know the specifics of how they detect a file type, I've had a good experience with the following file detection tools/libraries:
filecommand's database)All three of these don't recognize that file as a TAR file.
@adamhathcock commented on GitHub (Jun 8, 2018):
Read the source in is_tar.c in libmagic which calculates the checksum of the block. So easy enough
@SourabhChakraborty commented on GitHub (Feb 4, 2021):
I'm having an issue related to https://github.com/TASVideos/BizHawk/issues/2587 where TarArchive.IsTarFile() returns true for files that don't have the tar magic number. I traced the issue to TarHeader.Read(): it looks like Read() can return true even if the file doesn't have the magic number in the header (i.e. even if
(!string.IsNullOrEmpty(Magic)&& "ustar".Equals(Magic))is false). Is there a reason it does that, or is that an error?@adamhathcock commented on GitHub (Feb 5, 2021):
Yeah, it looks like Tar detection isn't good enough. Need a better implementation. I'd rather not rely on something like libmagic just for Tar.
@adamhathcock commented on GitHub (Feb 5, 2021):
Should port the
is_tardetection from https://github.com/threatstack/libmagic/blob/master/src/is_tar.c