mirror of
https://github.com/adamhathcock/sharpcompress.git
synced 2026-02-03 21:23:38 +00:00
14 KiB
14 KiB
SharpCompress Performance Guide
This guide helps you optimize SharpCompress for performance in various scenarios.
API Selection Guide
Archive API vs Reader API
Choose the right API based on your use case:
| Aspect | Archive API | Reader API |
|---|---|---|
| Stream Type | Seekable only | Non-seekable OK |
| Memory Usage | All entries in memory | One entry at a time |
| Random Access | ✓ Yes | ✗ No |
| Best For | Small-to-medium archives | Large or streaming data |
| Performance | Fast for random access | Better for large files |
Archive API (Fast for Random Access)
// Use when:
// - Archive fits in memory
// - You need random access to entries
// - Stream is seekable (file, MemoryStream)
using (var archive = ZipArchive.Open("archive.zip"))
{
// Random access - all entries available
var specific = archive.Entries.FirstOrDefault(e => e.Key == "file.txt");
if (specific != null)
{
specific.WriteToFile(@"C:\output\file.txt");
}
}
Performance Characteristics:
- ✓ Instant entry lookup
- ✓ Parallel extraction possible
- ✗ Entire archive in memory
- ✗ Can't process while downloading
Reader API (Best for Large Files)
// Use when:
// - Processing large archives (>100 MB)
// - Streaming from network/pipe
// - Memory is constrained
// - Forward-only processing is acceptable
using (var stream = File.OpenRead("large.zip"))
using (var reader = ReaderFactory.Open(stream))
{
while (reader.MoveToNextEntry())
{
// Process one entry at a time
reader.WriteEntryToDirectory(@"C:\output");
}
}
Performance Characteristics:
- ✓ Minimal memory footprint
- ✓ Works with non-seekable streams
- ✓ Can process while downloading
- ✗ Forward-only (no random access)
- ✗ Entry lookup requires iteration
Buffer Sizing
Understanding Buffers
SharpCompress uses internal buffers for reading compressed data. Buffer size affects:
- Speed: Larger buffers = fewer I/O operations = faster
- Memory: Larger buffers = higher memory usage
Recommended Buffer Sizes
| Scenario | Size | Notes |
|---|---|---|
| Embedded/IoT devices | 4-8 KB | Minimal memory usage |
| Memory-constrained | 16-32 KB | Conservative default |
| Standard use (default) | 64 KB | Recommended default |
| Large file streaming | 256 KB | Better throughput |
| High-speed SSD | 512 KB - 1 MB | Maximum throughput |
How Buffer Size Affects Performance
// SharpCompress manages buffers internally
// You can't directly set buffer size, but you can:
// 1. Use Stream.CopyTo with explicit buffer size
using (var entryStream = reader.OpenEntryStream())
using (var fileStream = File.Create(@"C:\output\file.txt"))
{
// 64 KB buffer (default)
entryStream.CopyTo(fileStream);
// Or specify larger buffer for faster copy
entryStream.CopyTo(fileStream, bufferSize: 262144); // 256 KB
}
// 2. Use custom buffer for writing
using (var entryStream = reader.OpenEntryStream())
using (var fileStream = File.Create(@"C:\output\file.txt"))
{
byte[] buffer = new byte[262144]; // 256 KB
int bytesRead;
while ((bytesRead = entryStream.Read(buffer, 0, buffer.Length)) > 0)
{
fileStream.Write(buffer, 0, bytesRead);
}
}
Streaming Large Files
Non-Seekable Stream Patterns
For processing archives from downloads or pipes:
// Download stream (non-seekable)
using (var httpStream = await httpClient.GetStreamAsync(url))
using (var reader = ReaderFactory.Open(httpStream))
{
// Process entries as they arrive
while (reader.MoveToNextEntry())
{
if (!reader.Entry.IsDirectory)
{
reader.WriteEntryToDirectory(@"C:\output");
}
}
}
Performance Tips:
- Don't try to buffer the entire stream
- Process entries immediately
- Use async APIs for better responsiveness
Download-Then-Extract vs Streaming
Choose based on your constraints:
| Approach | When to Use |
|---|---|
| Download then extract | Moderate size, need random access |
| Stream during download | Large files, bandwidth limited, memory constrained |
// Download then extract (requires disk space)
var archivePath = await DownloadFile(url, @"C:\temp\archive.zip");
using (var archive = ZipArchive.Open(archivePath))
{
archive.WriteToDirectory(@"C:\output");
}
// Stream during download (on-the-fly extraction)
using (var httpStream = await httpClient.GetStreamAsync(url))
using (var reader = ReaderFactory.Open(httpStream))
{
while (reader.MoveToNextEntry())
{
reader.WriteEntryToDirectory(@"C:\output");
}
}
Solid Archive Optimization
Why Solid Archives Are Slow
Solid archives (Rar, 7Zip) group files together in a single compressed stream:
Solid Archive Layout:
[Header] [Compressed Stream] [Footer]
├─ File1 compressed data
├─ File2 compressed data
├─ File3 compressed data
└─ File4 compressed data
Extracting File3 requires decompressing File1 and File2 first.
Sequential vs Random Extraction
Random Extraction (Slow):
using (var archive = RarArchive.Open("solid.rar"))
{
foreach (var entry in archive.Entries)
{
entry.WriteToFile(@"C:\output\" + entry.Key); // ✗ Slow!
// Each entry triggers full decompression from start
}
}
Sequential Extraction (Fast):
using (var archive = RarArchive.Open("solid.rar"))
{
// Method 1: Use WriteToDirectory (recommended)
archive.WriteToDirectory(@"C:\output", new ExtractionOptions
{
ExtractFullPath = true,
Overwrite = true
});
// Method 2: Use ExtractAllEntries
archive.ExtractAllEntries();
// Method 3: Use Reader API (also sequential)
using (var reader = RarReader.Open(File.OpenRead("solid.rar")))
{
while (reader.MoveToNextEntry())
{
reader.WriteEntryToDirectory(@"C:\output");
}
}
}
Performance Impact:
- Random extraction: O(n²) - very slow for many files
- Sequential extraction: O(n) - 10-100x faster
Best Practices for Solid Archives
- Always extract sequentially when possible
- Use Reader API for large solid archives
- Process entries in order from the archive
- Consider using 7Zip command-line for scripted extractions
Compression Level Trade-offs
Deflate/GZip Levels
// Level 1 = Fastest, largest size
// Level 6 = Default (balanced)
// Level 9 = Slowest, best compression
// Write with different compression levels
using (var archive = ZipArchive.Create())
{
archive.AddAllFromDirectory(@"D:\data");
// Fast compression (level 1)
archive.SaveTo("fast.zip", new WriterOptions(CompressionType.Deflate)
{
CompressionLevel = 1
});
// Default compression (level 6)
archive.SaveTo("default.zip", CompressionType.Deflate);
// Best compression (level 9)
archive.SaveTo("best.zip", new WriterOptions(CompressionType.Deflate)
{
CompressionLevel = 9
});
}
Speed vs Size:
| Level | Speed | Size | Use Case |
|---|---|---|---|
| 1 | 10x | 90% | Network, streaming |
| 6 | 1x | 75% | Default (good balance) |
| 9 | 0.1x | 65% | Archival, static storage |
BZip2 Block Size
// BZip2 block size affects memory and compression
// 100K to 900K (default 900K)
// Smaller block size = lower memory, faster
// Larger block size = better compression, slower
using (var archive = TarArchive.Create())
{
archive.AddAllFromDirectory(@"D:\data");
// These are preset in WriterOptions via CompressionLevel
archive.SaveTo("archive.tar.bz2", CompressionType.BZip2);
}
LZMA Settings
LZMA compression is very powerful but memory-intensive:
// LZMA (7Zip, .tar.lzma):
// - Dictionary size: 16 KB to 1 GB (default 32 MB)
// - Faster preset: smaller dictionary
// - Better compression: larger dictionary
// Preset via CompressionType
using (var archive = TarArchive.Create())
{
archive.AddAllFromDirectory(@"D:\data");
archive.SaveTo("archive.tar.xz", CompressionType.LZMA); // Default settings
}
Async Performance
When Async Helps
Async is beneficial when:
- Long I/O operations (network, slow disks)
- UI responsiveness needed (Windows Forms, WPF, Blazor)
- Server applications (ASP.NET, multiple concurrent operations)
// Async extraction (non-blocking)
using (var archive = ZipArchive.Open("archive.zip"))
{
await archive.WriteToDirectoryAsync(
@"C:\output",
new ExtractionOptions { ExtractFullPath = true, Overwrite = true },
cancellationToken
);
}
// Thread can handle other work while I/O happens
When Async Doesn't Help
Async doesn't improve performance for:
- CPU-bound operations (already fast)
- Local SSD I/O (I/O is fast enough)
- Single-threaded scenarios (no parallelism benefit)
// Sync extraction (simpler, same performance on fast I/O)
using (var archive = ZipArchive.Open("archive.zip"))
{
archive.WriteToDirectory(
@"C:\output",
new ExtractionOptions { ExtractFullPath = true, Overwrite = true }
);
}
// Simple and fast - no async needed
Cancellation Pattern
var cts = new CancellationTokenSource();
// Cancel after 5 minutes
cts.CancelAfter(TimeSpan.FromMinutes(5));
try
{
using (var archive = ZipArchive.Open("archive.zip"))
{
await archive.WriteToDirectoryAsync(
@"C:\output",
new ExtractionOptions { ExtractFullPath = true, Overwrite = true },
cts.Token
);
}
}
catch (OperationCanceledException)
{
Console.WriteLine("Extraction cancelled");
// Clean up partial extraction if needed
}
Memory Efficiency
Reducing Allocations
// ✗ Wrong - creates new options object each iteration
foreach (var archiveFile in archiveFiles)
{
using (var archive = ZipArchive.Open(archiveFile))
{
archive.WriteToDirectory(outputDir, new ExtractionOptions
{
ExtractFullPath = true,
Overwrite = true
});
}
}
// ✓ Better - reuse options object
var options = new ExtractionOptions
{
ExtractFullPath = true,
Overwrite = true
};
foreach (var archiveFile in archiveFiles)
{
using (var archive = ZipArchive.Open(archiveFile))
{
archive.WriteToDirectory(outputDir, options);
}
}
Object Pooling for Repeated Operations
// For very high-throughput scenarios, consider pooling
public class ArchiveExtractionPool
{
private readonly ArrayPool<byte> _bufferPool = ArrayPool<byte>.Shared;
public void ExtractMany(IEnumerable<string> archiveFiles, string outputDir)
{
var options = new ExtractionOptions
{
ExtractFullPath = true,
Overwrite = true
};
foreach (var archiveFile in archiveFiles)
{
using (var stream = File.OpenRead(archiveFile))
using (var archive = ZipArchive.Open(stream))
{
archive.WriteToDirectory(outputDir, options);
}
}
}
}
Practical Performance Tips
1. Choose the Right API
| Scenario | API | Why |
|---|---|---|
| Small archives | Archive | Faster random access |
| Large archives | Reader | Lower memory |
| Streaming | Reader | Works on non-seekable streams |
| Download streams | Reader | Async extraction while downloading |
2. Batch Operations
// ✗ Slow - opens each archive separately
foreach (var file in files)
{
using (var archive = ZipArchive.Open("archive.zip"))
{
archive.WriteToDirectory(@"C:\output");
}
}
// ✓ Better - process multiple entries at once
using (var archive = ZipArchive.Open("archive.zip"))
{
archive.WriteToDirectory(@"C:\output");
}
3. Use Appropriate Compression
// For distribution/storage: Best compression
archive.SaveTo("archive.zip", new WriterOptions(CompressionType.Deflate)
{
CompressionLevel = 9
});
// For daily backups: Balanced compression
archive.SaveTo("backup.zip", CompressionType.Deflate); // Default level 6
// For temporary/streaming: Fast compression
archive.SaveTo("temp.zip", new WriterOptions(CompressionType.Deflate)
{
CompressionLevel = 1
});
4. Profile Your Code
var sw = Stopwatch.StartNew();
using (var archive = ZipArchive.Open("large.zip"))
{
archive.WriteToDirectory(@"C:\output");
}
sw.Stop();
Console.WriteLine($"Extraction took {sw.ElapsedMilliseconds}ms");
// Measure memory before/after
var beforeMem = GC.GetTotalMemory(true);
// ... do work ...
var afterMem = GC.GetTotalMemory(true);
Console.WriteLine($"Memory used: {(afterMem - beforeMem) / 1024 / 1024}MB");
Troubleshooting Performance
Extraction is Slow
- Check if solid archive → Use sequential extraction
- Check API → Reader API might be faster for large files
- Check compression level → Higher levels are slower to decompress
- Check I/O → Network drives are much slower than SSD
- Check buffer size → May need larger buffers for network
High Memory Usage
- Use Reader API instead of Archive API
- Process entries immediately rather than buffering
- Reduce compression level if writing
- Check for memory leaks in your code
CPU Usage at 100%
- Normal for compression - especially with high compression levels
- Consider lower level for faster processing
- Reduce parallelism if processing multiple archives
- Check if awaiting properly in async code
Related Documentation
- PERFORMANCE.md - Usage examples with performance considerations
- FORMATS.md - Format-specific performance notes
- TROUBLESHOOTING.md - Solving common issues