Files
sharpcompress/docs/PERFORMANCE.md
2026-01-15 13:29:57 +00:00

12 KiB

SharpCompress Performance Guide

This guide helps you optimize SharpCompress for performance in various scenarios.

API Selection Guide

Archive API vs Reader API

Choose the right API based on your use case:

Aspect Archive API Reader API
Stream Type Seekable only Non-seekable OK
Memory Usage All entries in memory One entry at a time
Random Access ✓ Yes ✗ No
Best For Small-to-medium archives Large or streaming data
Performance Fast for random access Better for large files

Archive API (Fast for Random Access)

// Use when:
// - Archive fits in memory
// - You need random access to entries
// - Stream is seekable (file, MemoryStream)

using (var archive = ZipArchive.OpenArchive("archive.zip"))
{
    // Random access - all entries available
    var specific = archive.Entries.FirstOrDefault(e => e.Key == "file.txt");
    if (specific != null)
    {
        specific.WriteToFile(@"C:\output\file.txt");
    }
}

Performance Characteristics:

  • ✓ Instant entry lookup
  • ✓ Parallel extraction possible
  • ✗ Entire archive in memory
  • ✗ Can't process while downloading

Reader API (Best for Large Files)

// Use when:
// - Processing large archives (>100 MB)
// - Streaming from network/pipe
// - Memory is constrained
// - Forward-only processing is acceptable

using (var stream = File.OpenRead("large.zip"))
using (var reader = ReaderFactory.OpenReader(stream))
{
    while (reader.MoveToNextEntry())
    {
        // Process one entry at a time
        reader.WriteEntryToDirectory(@"C:\output");
    }
}

Performance Characteristics:

  • ✓ Minimal memory footprint
  • ✓ Works with non-seekable streams
  • ✓ Can process while downloading
  • ✗ Forward-only (no random access)
  • ✗ Entry lookup requires iteration

Buffer Sizing

Understanding Buffers

SharpCompress uses internal buffers for reading compressed data. Buffer size affects:

  • Speed: Larger buffers = fewer I/O operations = faster
  • Memory: Larger buffers = higher memory usage
Scenario Size Notes
Embedded/IoT devices 4-8 KB Minimal memory usage
Memory-constrained 16-32 KB Conservative default
Standard use (default) 64 KB Recommended default
Large file streaming 256 KB Better throughput
High-speed SSD 512 KB - 1 MB Maximum throughput

How Buffer Size Affects Performance

// SharpCompress manages buffers internally
// You can't directly set buffer size, but you can:

// 1. Use Stream.CopyTo with explicit buffer size
using (var entryStream = reader.OpenEntryStream())
using (var fileStream = File.Create(@"C:\output\file.txt"))
{
    // 64 KB buffer (default)
    entryStream.CopyTo(fileStream);
    
    // Or specify larger buffer for faster copy
    entryStream.CopyTo(fileStream, bufferSize: 262144);  // 256 KB
}

// 2. Use custom buffer for writing
using (var entryStream = reader.OpenEntryStream())
using (var fileStream = File.Create(@"C:\output\file.txt"))
{
    byte[] buffer = new byte[262144];  // 256 KB
    int bytesRead;
    while ((bytesRead = entryStream.Read(buffer, 0, buffer.Length)) > 0)
    {
        fileStream.Write(buffer, 0, bytesRead);
    }
}

Streaming Large Files

Non-Seekable Stream Patterns

For processing archives from downloads or pipes:

// Download stream (non-seekable)
using (var httpStream = await httpClient.GetStreamAsync(url))
using (var reader = ReaderFactory.OpenReader(httpStream))
{
    // Process entries as they arrive
    while (reader.MoveToNextEntry())
    {
        if (!reader.Entry.IsDirectory)
        {
            reader.WriteEntryToDirectory(@"C:\output");
        }
    }
}

Performance Tips:

  • Don't try to buffer the entire stream
  • Process entries immediately
  • Use async APIs for better responsiveness

Download-Then-Extract vs Streaming

Choose based on your constraints:

Approach When to Use
Download then extract Moderate size, need random access
Stream during download Large files, bandwidth limited, memory constrained
// Download then extract (requires disk space)
var archivePath = await DownloadFile(url, @"C:\temp\archive.zip");
using (var archive = ZipArchive.OpenArchive(archivePath))
{
    archive.WriteToDirectory(@"C:\output");
}

// Stream during download (on-the-fly extraction)
using (var httpStream = await httpClient.GetStreamAsync(url))
using (var reader = ReaderFactory.OpenReader(httpStream))
{
    while (reader.MoveToNextEntry())
    {
        reader.WriteEntryToDirectory(@"C:\output");
    }
}

Solid Archive Optimization

Why Solid Archives Are Slow

Solid archives (Rar, 7Zip) group files together in a single compressed stream:

Solid Archive Layout:
[Header] [Compressed Stream] [Footer]
         ├─ File1 compressed data
         ├─ File2 compressed data
         ├─ File3 compressed data
         └─ File4 compressed data

Extracting File3 requires decompressing File1 and File2 first.

Sequential vs Random Extraction

Random Extraction (Slow):

using (var archive = RarArchive.OpenArchive("solid.rar"))
{
    foreach (var entry in archive.Entries)
    {
        entry.WriteToFile(@"C:\output\" + entry.Key);  // ✗ Slow!
        // Each entry triggers full decompression from start
    }
}

Sequential Extraction (Fast):

using (var archive = RarArchive.OpenArchive("solid.rar"))
{
    // Method 1: Use WriteToDirectory (recommended)
    archive.WriteToDirectory(@"C:\output", new ExtractionOptions
    {
        ExtractFullPath = true,
        Overwrite = true
    });
    
    // Method 2: Use ExtractAllEntries
    archive.ExtractAllEntries();
    
    // Method 3: Use Reader API (also sequential)
    using (var reader = RarReader.Open(File.OpenRead("solid.rar")))
    {
        while (reader.MoveToNextEntry())
        {
            reader.WriteEntryToDirectory(@"C:\output");
        }
    }
}

Performance Impact:

  • Random extraction: O(n²) - very slow for many files
  • Sequential extraction: O(n) - 10-100x faster

Best Practices for Solid Archives

  1. Always extract sequentially when possible
  2. Use Reader API for large solid archives
  3. Process entries in order from the archive
  4. Consider using 7Zip command-line for scripted extractions

Compression Level Trade-offs

Deflate/GZip Levels

// Level 1 = Fastest, largest size
// Level 6 = Default (balanced)
// Level 9 = Slowest, best compression

// Write with different compression levels
using (var archive = ZipArchive.CreateArchive())
{
    archive.AddAllFromDirectory(@"D:\data");
    
    // Fast compression (level 1)
    archive.SaveTo("fast.zip", new WriterOptions(CompressionType.Deflate)
    {
        CompressionLevel = 1
    });
    
    // Default compression (level 6)
    archive.SaveTo("default.zip", CompressionType.Deflate);
    
    // Best compression (level 9)
    archive.SaveTo("best.zip", new WriterOptions(CompressionType.Deflate)
    {
        CompressionLevel = 9
    });
}

Speed vs Size:

Level Speed Size Use Case
1 10x 90% Network, streaming
6 1x 75% Default (good balance)
9 0.1x 65% Archival, static storage

BZip2 Block Size

// BZip2 block size affects memory and compression
// 100K to 900K (default 900K)

// Smaller block size = lower memory, faster
// Larger block size = better compression, slower

using (var archive = TarArchive.CreateArchive())
{
    archive.AddAllFromDirectory(@"D:\data");
    
    // These are preset in WriterOptions via CompressionLevel
    archive.SaveTo("archive.tar.bz2", CompressionType.BZip2);
}

LZMA Settings

LZMA compression is very powerful but memory-intensive:

// LZMA (7Zip, .tar.lzma):
// - Dictionary size: 16 KB to 1 GB (default 32 MB)
// - Faster preset: smaller dictionary
// - Better compression: larger dictionary

// Preset via CompressionType
using (var archive = TarArchive.CreateArchive())
{
    archive.AddAllFromDirectory(@"D:\data");
    archive.SaveTo("archive.tar.xz", CompressionType.LZMA);  // Default settings
}

Async Performance

When Async Helps

Async is beneficial when:

  • Long I/O operations (network, slow disks)
  • UI responsiveness needed (Windows Forms, WPF, Blazor)
  • Server applications (ASP.NET, multiple concurrent operations)
// Async extraction (non-blocking)
using (var archive = ZipArchive.OpenArchive("archive.zip"))
{
    await archive.WriteToDirectoryAsync(
        @"C:\output",
        new ExtractionOptions { ExtractFullPath = true, Overwrite = true },
        cancellationToken
    );
}
// Thread can handle other work while I/O happens

When Async Doesn't Help

Async doesn't improve performance for:

  • CPU-bound operations (already fast)
  • Local SSD I/O (I/O is fast enough)
  • Single-threaded scenarios (no parallelism benefit)
// Sync extraction (simpler, same performance on fast I/O)
using (var archive = ZipArchive.OpenArchive("archive.zip"))
{
    archive.WriteToDirectory(
        @"C:\output",
        new ExtractionOptions { ExtractFullPath = true, Overwrite = true }
    );
}
// Simple and fast - no async needed

Cancellation Pattern

var cts = new CancellationTokenSource();

// Cancel after 5 minutes
cts.CancelAfter(TimeSpan.FromMinutes(5));

try
{
    using (var archive = ZipArchive.OpenArchive("archive.zip"))
    {
        await archive.WriteToDirectoryAsync(
            @"C:\output",
            new ExtractionOptions { ExtractFullPath = true, Overwrite = true },
            cts.Token
        );
    }
}
catch (OperationCanceledException)
{
    Console.WriteLine("Extraction cancelled");
    // Clean up partial extraction if needed
}

Practical Performance Tips

1. Choose the Right API

Scenario API Why
Small archives Archive Faster random access
Large archives Reader Lower memory
Streaming Reader Works on non-seekable streams
Download streams Reader Async extraction while downloading

2. Batch Operations

// ✗ Slow - opens each archive separately
foreach (var file in files)
{
    using (var archive = ZipArchive.OpenArchive("archive.zip"))
    {
        archive.WriteToDirectory(@"C:\output");
    }
}

// ✓ Better - process multiple entries at once
using (var archive = ZipArchive.OpenArchive("archive.zip"))
{
    archive.WriteToDirectory(@"C:\output");
}

3. Profile Your Code

var sw = Stopwatch.StartNew();
using (var archive = ZipArchive.OpenArchive("large.zip"))
{
    archive.WriteToDirectory(@"C:\output");
}
sw.Stop();

Console.WriteLine($"Extraction took {sw.ElapsedMilliseconds}ms");

// Measure memory before/after
var beforeMem = GC.GetTotalMemory(true);
// ... do work ...
var afterMem = GC.GetTotalMemory(true);
Console.WriteLine($"Memory used: {(afterMem - beforeMem) / 1024 / 1024}MB");

Troubleshooting Performance

Extraction is Slow

  1. Check if solid archive → Use sequential extraction
  2. Check API → Reader API might be faster for large files
  3. Check compression level → Higher levels are slower to decompress
  4. Check I/O → Network drives are much slower than SSD
  5. Check buffer size → May need larger buffers for network

High Memory Usage

  1. Use Reader API instead of Archive API
  2. Process entries immediately rather than buffering
  3. Reduce compression level if writing
  4. Check for memory leaks in your code

CPU Usage at 100%

  1. Normal for compression - especially with high compression levels
  2. Consider lower level for faster processing
  3. Reduce parallelism if processing multiple archives
  4. Check if awaiting properly in async code