starred/sharpcompress

Fork 0

mirror of https://github.com/adamhathcock/sharpcompress.git synced 2026-02-03 21:23:38 +00:00

Files

Adam Hathcock 8c876c70af Add more documentation

2026-01-08 15:34:48 +00:00

14 KiB

Raw Blame History

SharpCompress Performance Guide

This guide helps you optimize SharpCompress for performance in various scenarios.

API Selection Guide

Archive API vs Reader API

Choose the right API based on your use case:

Aspect	Archive API	Reader API
Stream Type	Seekable only	Non-seekable OK
Memory Usage	All entries in memory	One entry at a time
Random Access	✓ Yes	✗ No
Best For	Small-to-medium archives	Large or streaming data
Performance	Fast for random access	Better for large files

Archive API (Fast for Random Access)

// Use when:
// - Archive fits in memory
// - You need random access to entries
// - Stream is seekable (file, MemoryStream)

using (var archive = ZipArchive.Open("archive.zip"))
{
    // Random access - all entries available
    var specific = archive.Entries.FirstOrDefault(e => e.Key == "file.txt");
    if (specific != null)
    {
        specific.WriteToFile(@"C:\output\file.txt");
    }
}

Performance Characteristics:

✓ Instant entry lookup
✓ Parallel extraction possible
✗ Entire archive in memory
✗ Can't process while downloading

Reader API (Best for Large Files)

// Use when:
// - Processing large archives (>100 MB)
// - Streaming from network/pipe
// - Memory is constrained
// - Forward-only processing is acceptable

using (var stream = File.OpenRead("large.zip"))
using (var reader = ReaderFactory.Open(stream))
{
    while (reader.MoveToNextEntry())
    {
        // Process one entry at a time
        reader.WriteEntryToDirectory(@"C:\output");
    }
}

Performance Characteristics:

✓ Minimal memory footprint
✓ Works with non-seekable streams
✓ Can process while downloading
✗ Forward-only (no random access)
✗ Entry lookup requires iteration

Buffer Sizing

Understanding Buffers

SharpCompress uses internal buffers for reading compressed data. Buffer size affects:

Speed: Larger buffers = fewer I/O operations = faster
Memory: Larger buffers = higher memory usage

Recommended Buffer Sizes

Scenario	Size	Notes
Embedded/IoT devices	4-8 KB	Minimal memory usage
Memory-constrained	16-32 KB	Conservative default
Standard use (default)	64 KB	Recommended default
Large file streaming	256 KB	Better throughput
High-speed SSD	512 KB - 1 MB	Maximum throughput

How Buffer Size Affects Performance

// SharpCompress manages buffers internally
// You can't directly set buffer size, but you can:

// 1. Use Stream.CopyTo with explicit buffer size
using (var entryStream = reader.OpenEntryStream())
using (var fileStream = File.Create(@"C:\output\file.txt"))
{
    // 64 KB buffer (default)
    entryStream.CopyTo(fileStream);
    
    // Or specify larger buffer for faster copy
    entryStream.CopyTo(fileStream, bufferSize: 262144);  // 256 KB
}

// 2. Use custom buffer for writing
using (var entryStream = reader.OpenEntryStream())
using (var fileStream = File.Create(@"C:\output\file.txt"))
{
    byte[] buffer = new byte[262144];  // 256 KB
    int bytesRead;
    while ((bytesRead = entryStream.Read(buffer, 0, buffer.Length)) > 0)
    {
        fileStream.Write(buffer, 0, bytesRead);
    }
}

Streaming Large Files

Non-Seekable Stream Patterns

For processing archives from downloads or pipes:

// Download stream (non-seekable)
using (var httpStream = await httpClient.GetStreamAsync(url))
using (var reader = ReaderFactory.Open(httpStream))
{
    // Process entries as they arrive
    while (reader.MoveToNextEntry())
    {
        if (!reader.Entry.IsDirectory)
        {
            reader.WriteEntryToDirectory(@"C:\output");
        }
    }
}

Performance Tips:

Don't try to buffer the entire stream
Process entries immediately
Use async APIs for better responsiveness

Download-Then-Extract vs Streaming

Choose based on your constraints:

Approach	When to Use
Download then extract	Moderate size, need random access
Stream during download	Large files, bandwidth limited, memory constrained

// Download then extract (requires disk space)
var archivePath = await DownloadFile(url, @"C:\temp\archive.zip");
using (var archive = ZipArchive.Open(archivePath))
{
    archive.WriteToDirectory(@"C:\output");
}

// Stream during download (on-the-fly extraction)
using (var httpStream = await httpClient.GetStreamAsync(url))
using (var reader = ReaderFactory.Open(httpStream))
{
    while (reader.MoveToNextEntry())
    {
        reader.WriteEntryToDirectory(@"C:\output");
    }
}

Solid Archive Optimization

Why Solid Archives Are Slow

Solid archives (Rar, 7Zip) group files together in a single compressed stream:

Solid Archive Layout:
[Header] [Compressed Stream] [Footer]
         ├─ File1 compressed data
         ├─ File2 compressed data
         ├─ File3 compressed data
         └─ File4 compressed data

Extracting File3 requires decompressing File1 and File2 first.

Sequential vs Random Extraction

Random Extraction (Slow):

using (var archive = RarArchive.Open("solid.rar"))
{
    foreach (var entry in archive.Entries)
    {
        entry.WriteToFile(@"C:\output\" + entry.Key);  // ✗ Slow!
        // Each entry triggers full decompression from start
    }
}

Sequential Extraction (Fast):

using (var archive = RarArchive.Open("solid.rar"))
{
    // Method 1: Use WriteToDirectory (recommended)
    archive.WriteToDirectory(@"C:\output", new ExtractionOptions
    {
        ExtractFullPath = true,
        Overwrite = true
    });
    
    // Method 2: Use ExtractAllEntries
    archive.ExtractAllEntries();
    
    // Method 3: Use Reader API (also sequential)
    using (var reader = RarReader.Open(File.OpenRead("solid.rar")))
    {
        while (reader.MoveToNextEntry())
        {
            reader.WriteEntryToDirectory(@"C:\output");
        }
    }
}

Performance Impact:

Random extraction: O(n²) - very slow for many files
Sequential extraction: O(n) - 10-100x faster

Best Practices for Solid Archives

Always extract sequentially when possible
Use Reader API for large solid archives
Process entries in order from the archive
Consider using 7Zip command-line for scripted extractions

Compression Level Trade-offs

Deflate/GZip Levels

// Level 1 = Fastest, largest size
// Level 6 = Default (balanced)
// Level 9 = Slowest, best compression

// Write with different compression levels
using (var archive = ZipArchive.Create())
{
    archive.AddAllFromDirectory(@"D:\data");
    
    // Fast compression (level 1)
    archive.SaveTo("fast.zip", new WriterOptions(CompressionType.Deflate)
    {
        CompressionLevel = 1
    });
    
    // Default compression (level 6)
    archive.SaveTo("default.zip", CompressionType.Deflate);
    
    // Best compression (level 9)
    archive.SaveTo("best.zip", new WriterOptions(CompressionType.Deflate)
    {
        CompressionLevel = 9
    });
}

Speed vs Size:

Level	Speed	Size	Use Case
1	10x	90%	Network, streaming
6	1x	75%	Default (good balance)
9	0.1x	65%	Archival, static storage

BZip2 Block Size

// BZip2 block size affects memory and compression
// 100K to 900K (default 900K)

// Smaller block size = lower memory, faster
// Larger block size = better compression, slower

using (var archive = TarArchive.Create())
{
    archive.AddAllFromDirectory(@"D:\data");
    
    // These are preset in WriterOptions via CompressionLevel
    archive.SaveTo("archive.tar.bz2", CompressionType.BZip2);
}

LZMA Settings

LZMA compression is very powerful but memory-intensive:

// LZMA (7Zip, .tar.lzma):
// - Dictionary size: 16 KB to 1 GB (default 32 MB)
// - Faster preset: smaller dictionary
// - Better compression: larger dictionary

// Preset via CompressionType
using (var archive = TarArchive.Create())
{
    archive.AddAllFromDirectory(@"D:\data");
    archive.SaveTo("archive.tar.xz", CompressionType.LZMA);  // Default settings
}

Async Performance

When Async Helps

Async is beneficial when:

Long I/O operations (network, slow disks)
UI responsiveness needed (Windows Forms, WPF, Blazor)
Server applications (ASP.NET, multiple concurrent operations)

// Async extraction (non-blocking)
using (var archive = ZipArchive.Open("archive.zip"))
{
    await archive.WriteToDirectoryAsync(
        @"C:\output",
        new ExtractionOptions { ExtractFullPath = true, Overwrite = true },
        cancellationToken
    );
}
// Thread can handle other work while I/O happens

When Async Doesn't Help

Async doesn't improve performance for:

CPU-bound operations (already fast)
Local SSD I/O (I/O is fast enough)
Single-threaded scenarios (no parallelism benefit)

// Sync extraction (simpler, same performance on fast I/O)
using (var archive = ZipArchive.Open("archive.zip"))
{
    archive.WriteToDirectory(
        @"C:\output",
        new ExtractionOptions { ExtractFullPath = true, Overwrite = true }
    );
}
// Simple and fast - no async needed

Cancellation Pattern

var cts = new CancellationTokenSource();

// Cancel after 5 minutes
cts.CancelAfter(TimeSpan.FromMinutes(5));

try
{
    using (var archive = ZipArchive.Open("archive.zip"))
    {
        await archive.WriteToDirectoryAsync(
            @"C:\output",
            new ExtractionOptions { ExtractFullPath = true, Overwrite = true },
            cts.Token
        );
    }
}
catch (OperationCanceledException)
{
    Console.WriteLine("Extraction cancelled");
    // Clean up partial extraction if needed
}

Memory Efficiency

Reducing Allocations

// ✗ Wrong - creates new options object each iteration
foreach (var archiveFile in archiveFiles)
{
    using (var archive = ZipArchive.Open(archiveFile))
    {
        archive.WriteToDirectory(outputDir, new ExtractionOptions
        {
            ExtractFullPath = true,
            Overwrite = true
        });
    }
}

// ✓ Better - reuse options object
var options = new ExtractionOptions
{
    ExtractFullPath = true,
    Overwrite = true
};
foreach (var archiveFile in archiveFiles)
{
    using (var archive = ZipArchive.Open(archiveFile))
    {
        archive.WriteToDirectory(outputDir, options);
    }
}

Object Pooling for Repeated Operations

// For very high-throughput scenarios, consider pooling
public class ArchiveExtractionPool
{
    private readonly ArrayPool<byte> _bufferPool = ArrayPool<byte>.Shared;
    
    public void ExtractMany(IEnumerable<string> archiveFiles, string outputDir)
    {
        var options = new ExtractionOptions
        {
            ExtractFullPath = true,
            Overwrite = true
        };
        
        foreach (var archiveFile in archiveFiles)
        {
            using (var stream = File.OpenRead(archiveFile))
            using (var archive = ZipArchive.Open(stream))
            {
                archive.WriteToDirectory(outputDir, options);
            }
        }
    }
}

Practical Performance Tips

1. Choose the Right API

Scenario	API	Why
Small archives	Archive	Faster random access
Large archives	Reader	Lower memory
Streaming	Reader	Works on non-seekable streams
Download streams	Reader	Async extraction while downloading

2. Batch Operations

// ✗ Slow - opens each archive separately
foreach (var file in files)
{
    using (var archive = ZipArchive.Open("archive.zip"))
    {
        archive.WriteToDirectory(@"C:\output");
    }
}

// ✓ Better - process multiple entries at once
using (var archive = ZipArchive.Open("archive.zip"))
{
    archive.WriteToDirectory(@"C:\output");
}

3. Use Appropriate Compression

// For distribution/storage: Best compression
archive.SaveTo("archive.zip", new WriterOptions(CompressionType.Deflate)
{
    CompressionLevel = 9
});

// For daily backups: Balanced compression
archive.SaveTo("backup.zip", CompressionType.Deflate);  // Default level 6

// For temporary/streaming: Fast compression
archive.SaveTo("temp.zip", new WriterOptions(CompressionType.Deflate)
{
    CompressionLevel = 1
});

4. Profile Your Code

var sw = Stopwatch.StartNew();
using (var archive = ZipArchive.Open("large.zip"))
{
    archive.WriteToDirectory(@"C:\output");
}
sw.Stop();

Console.WriteLine($"Extraction took {sw.ElapsedMilliseconds}ms");

// Measure memory before/after
var beforeMem = GC.GetTotalMemory(true);
// ... do work ...
var afterMem = GC.GetTotalMemory(true);
Console.WriteLine($"Memory used: {(afterMem - beforeMem) / 1024 / 1024}MB");

Troubleshooting Performance

Extraction is Slow

Check if solid archive → Use sequential extraction
Check API → Reader API might be faster for large files
Check compression level → Higher levels are slower to decompress
Check I/O → Network drives are much slower than SSD
Check buffer size → May need larger buffers for network

High Memory Usage

Use Reader API instead of Archive API
Process entries immediately rather than buffering
Reduce compression level if writing
Check for memory leaks in your code

CPU Usage at 100%

Normal for compression - especially with high compression levels
Consider lower level for faster processing
Reduce parallelism if processing multiple archives
Check if awaiting properly in async code

PERFORMANCE.md - Usage examples with performance considerations
FORMATS.md - Format-specific performance notes
TROUBLESHOOTING.md - Solving common issues

14 KiB Raw Blame History