diff --git a/AGENTS.md b/AGENTS.md index ae36265a..659dfdd6 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -110,7 +110,7 @@ SharpCompress supports multiple archive and compression formats: - **Archive Formats**: Zip, Tar, 7Zip, Rar (read-only) - **Compression**: DEFLATE, BZip2, LZMA/LZMA2, PPMd, ZStandard (decompress only), Deflate64 (decompress only) - **Combined Formats**: Tar.GZip, Tar.BZip2, Tar.LZip, Tar.XZ, Tar.ZStandard -- See FORMATS.md for complete format support matrix +- See [docs/FORMATS.md](docs/FORMATS.md) for complete format support matrix ### Stream Handling Rules - **Disposal**: As of version 0.21, SharpCompress closes wrapped streams by default diff --git a/README.md b/README.md index 67cba2cd..ecaa15e4 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ SharpCompress is a compression library in pure C# for .NET Framework 4.8, .NET 8 The major feature is support for non-seekable streams so large files can be processed on the fly (i.e. download stream). -**NEW:** All I/O operations now support async/await for improved performance and scalability. See the [USAGE.md](USAGE.md#async-examples) for examples. +**NEW:** All I/O operations now support async/await for improved performance and scalability. See the [USAGE.md](docs/USAGE.md#async-examples) for examples. GitHub Actions Build - [![SharpCompress](https://github.com/adamhathcock/sharpcompress/actions/workflows/dotnetcore.yml/badge.svg)](https://github.com/adamhathcock/sharpcompress/actions/workflows/dotnetcore.yml) @@ -14,7 +14,7 @@ GitHub Actions Build - Post Issues on Github! -Check the [Supported Formats](FORMATS.md) and [Basic Usage.](USAGE.md) +Check the [Supported Formats](docs/FORMATS.md) and [Basic Usage.](docs/USAGE.md) ## Recommended Formats diff --git a/docs/API.md b/docs/API.md new file mode 100644 index 00000000..d22f723b --- /dev/null +++ b/docs/API.md @@ -0,0 +1,490 @@ +# API Quick Reference + +Quick reference for commonly used SharpCompress APIs. + +## Factory Methods + +### Opening Archives + +```csharp +// Auto-detect format +using (var reader = ReaderFactory.Open(stream)) +{ + // Works with Zip, Tar, GZip, Rar, 7Zip, etc. +} + +// Specific format - Archive API +using (var archive = ZipArchive.Open("file.zip")) +using (var archive = TarArchive.Open("file.tar")) +using (var archive = RarArchive.Open("file.rar")) +using (var archive = SevenZipArchive.Open("file.7z")) +using (var archive = GZipArchive.Open("file.gz")) + +// With options +var options = new ReaderOptions +{ + Password = "password", + LeaveStreamOpen = true, + ArchiveEncoding = new ArchiveEncoding { Default = Encoding.GetEncoding(932) } +}; +using (var archive = ZipArchive.Open("encrypted.zip", options)) +``` + +### Creating Archives + +```csharp +// Writer Factory +using (var writer = WriterFactory.Open(stream, ArchiveType.Zip, CompressionType.Deflate)) +{ + // Write entries +} + +// Specific writer +using (var archive = ZipArchive.Create()) +using (var archive = TarArchive.Create()) +using (var archive = GZipArchive.Create()) + +// With options +var options = new WriterOptions(CompressionType.Deflate) +{ + CompressionLevel = 9, + LeaveStreamOpen = false +}; +using (var archive = ZipArchive.Create()) +{ + archive.SaveTo("output.zip", options); +} +``` + +--- + +## Archive API Methods + +### Reading/Extracting + +```csharp +using (var archive = ZipArchive.Open("file.zip")) +{ + // Get all entries + IEnumerable entries = archive.Entries; + + // Find specific entry + var entry = archive.Entries.FirstOrDefault(e => e.Key == "file.txt"); + + // Extract all + archive.WriteToDirectory(@"C:\output", new ExtractionOptions + { + ExtractFullPath = true, + Overwrite = true + }); + + // Extract single entry + var entry = archive.Entries.First(); + entry.WriteToFile(@"C:\output\file.txt"); + entry.WriteToFile(@"C:\output\file.txt", new ExtractionOptions { Overwrite = true }); + + // Get entry stream + using (var stream = entry.OpenEntryStream()) + { + stream.CopyTo(outputStream); + } +} + +// Async variants +await archive.WriteToDirectoryAsync(@"C:\output", options, cancellationToken); +using (var stream = await entry.OpenEntryStreamAsync(cancellationToken)) +{ + // ... +} +``` + +### Entry Properties + +```csharp +foreach (var entry in archive.Entries) +{ + string name = entry.Key; // Entry name/path + long size = entry.Size; // Uncompressed size + long compressedSize = entry.CompressedSize; + bool isDir = entry.IsDirectory; + DateTime? modTime = entry.LastModifiedTime; + CompressionType compression = entry.CompressionType; +} +``` + +### Creating Archives + +```csharp +using (var archive = ZipArchive.Create()) +{ + // Add file + archive.AddEntry("file.txt", "C:\\source\\file.txt"); + + // Add multiple files + archive.AddAllFromDirectory("C:\\source"); + archive.AddAllFromDirectory("C:\\source", "*.txt"); // Pattern + + // Save to file + archive.SaveTo("output.zip", CompressionType.Deflate); + + // Save to stream + archive.SaveTo(outputStream, new WriterOptions(CompressionType.Deflate) + { + CompressionLevel = 9, + LeaveStreamOpen = true + }); +} +``` + +--- + +## Reader API Methods + +### Forward-Only Reading + +```csharp +using (var stream = File.OpenRead("file.zip")) +using (var reader = ReaderFactory.Open(stream)) +{ + while (reader.MoveToNextEntry()) + { + IEntry entry = reader.Entry; + + if (!entry.IsDirectory) + { + // Extract entry + reader.WriteEntryToDirectory(@"C:\output"); + reader.WriteEntryToFile(@"C:\output\file.txt"); + + // Or get stream + using (var entryStream = reader.OpenEntryStream()) + { + entryStream.CopyTo(outputStream); + } + } + } +} + +// Async variants +while (await reader.MoveToNextEntryAsync()) +{ + await reader.WriteEntryToFileAsync(@"C:\output\" + reader.Entry.Key, cancellationToken); +} + +// Async extraction +await reader.WriteAllToDirectoryAsync(@"C:\output", + new ExtractionOptions { ExtractFullPath = true, Overwrite = true }, + cancellationToken); +``` + +--- + +## Writer API Methods + +### Creating Archives (Streaming) + +```csharp +using (var stream = File.Create("output.zip")) +using (var writer = WriterFactory.Open(stream, ArchiveType.Zip, CompressionType.Deflate)) +{ + // Write single file + using (var fileStream = File.OpenRead("source.txt")) + { + writer.Write("entry.txt", fileStream, DateTime.Now); + } + + // Write directory + writer.WriteAll("C:\\source", "*", SearchOption.AllDirectories); + writer.WriteAll("C:\\source", "*.txt", SearchOption.TopDirectoryOnly); + + // Async variants + using (var fileStream = File.OpenRead("source.txt")) + { + await writer.WriteAsync("entry.txt", fileStream, DateTime.Now, cancellationToken); + } + + await writer.WriteAllAsync("C:\\source", "*", SearchOption.AllDirectories, cancellationToken); +} +``` + +--- + +## Common Options + +### ReaderOptions + +```csharp +var options = new ReaderOptions +{ + Password = "password", // For encrypted archives + LeaveStreamOpen = true, // Don't close wrapped stream + ArchiveEncoding = new ArchiveEncoding // Custom character encoding + { + Default = Encoding.GetEncoding(932) + } +}; +using (var archive = ZipArchive.Open("file.zip", options)) +{ + // ... +} +``` + +### WriterOptions + +```csharp +var options = new WriterOptions(CompressionType.Deflate) +{ + CompressionLevel = 9, // 0-9 for Deflate + LeaveStreamOpen = true, // Don't close stream +}; +archive.SaveTo("output.zip", options); +``` + +### ExtractionOptions + +```csharp +var options = new ExtractionOptions +{ + ExtractFullPath = true, // Recreate directory structure + Overwrite = true, // Overwrite existing files + PreserveFileTime = true // Keep original timestamps +}; +archive.WriteToDirectory(@"C:\output", options); +``` + +--- + +## Compression Types + +### Available Compressions + +```csharp +// For creating archives +CompressionType.None // No compression (store) +CompressionType.Deflate // DEFLATE (default for ZIP/GZip) +CompressionType.BZip2 // BZip2 +CompressionType.LZMA // LZMA (for 7Zip, LZip, XZ) +CompressionType.PPMd // PPMd (for ZIP) +CompressionType.Rar // RAR compression (read-only) + +// For Tar archives +// Use CompressionType in TarWriter constructor +using (var writer = TarWriter(stream, CompressionType.GZip)) // Tar.GZip +using (var writer = TarWriter(stream, CompressionType.BZip2)) // Tar.BZip2 +``` + +### Archive Types + +```csharp +ArchiveType.Zip +ArchiveType.Tar +ArchiveType.GZip +ArchiveType.BZip2 +ArchiveType.Rar +ArchiveType.SevenZip +ArchiveType.XZ +ArchiveType.ZStandard +``` + +--- + +## Patterns & Examples + +### Extract with Error Handling + +```csharp +try +{ + using (var archive = ZipArchive.Open("archive.zip", + new ReaderOptions { Password = "password" })) + { + archive.WriteToDirectory(@"C:\output", new ExtractionOptions + { + ExtractFullPath = true, + Overwrite = true + }); + } +} +catch (PasswordRequiredException) +{ + Console.WriteLine("Password required"); +} +catch (InvalidArchiveException) +{ + Console.WriteLine("Archive is invalid"); +} +catch (SharpCompressException ex) +{ + Console.WriteLine($"Error: {ex.Message}"); +} +``` + +### Extract with Progress + +```csharp +var progress = new Progress(report => +{ + Console.WriteLine($"Extracting {report.EntryPath}: {report.PercentComplete}%"); +}); + +var options = new ReaderOptions { Progress = progress }; +using (var archive = ZipArchive.Open("archive.zip", options)) +{ + archive.WriteToDirectory(@"C:\output"); +} +``` + +### Async Extract with Cancellation + +```csharp +var cts = new CancellationTokenSource(); +cts.CancelAfter(TimeSpan.FromMinutes(5)); + +try +{ + using (var archive = ZipArchive.Open("archive.zip")) + { + await archive.WriteToDirectoryAsync(@"C:\output", + new ExtractionOptions { ExtractFullPath = true, Overwrite = true }, + cts.Token); + } +} +catch (OperationCanceledException) +{ + Console.WriteLine("Extraction cancelled"); +} +``` + +### Create with Custom Compression + +```csharp +using (var archive = ZipArchive.Create()) +{ + archive.AddAllFromDirectory(@"D:\source"); + + // Fastest + archive.SaveTo("fast.zip", new WriterOptions(CompressionType.Deflate) + { + CompressionLevel = 1 + }); + + // Balanced (default) + archive.SaveTo("normal.zip", CompressionType.Deflate); + + // Best compression + archive.SaveTo("best.zip", new WriterOptions(CompressionType.Deflate) + { + CompressionLevel = 9 + }); +} +``` + +### Stream Processing (No File I/O) + +```csharp +using (var outputStream = new MemoryStream()) +using (var archive = ZipArchive.Create()) +{ + // Add content from memory + using (var contentStream = new MemoryStream(Encoding.UTF8.GetBytes("Hello"))) + { + archive.AddEntry("file.txt", contentStream); + } + + // Save to memory + archive.SaveTo(outputStream, CompressionType.Deflate); + + // Get bytes + byte[] archiveBytes = outputStream.ToArray(); +} +``` + +### Extract Specific Files + +```csharp +using (var archive = ZipArchive.Open("archive.zip")) +{ + var filesToExtract = new[] { "file1.txt", "file2.txt" }; + + foreach (var entry in archive.Entries.Where(e => filesToExtract.Contains(e.Key))) + { + entry.WriteToFile(@"C:\output\" + entry.Key); + } +} +``` + +### List Archive Contents + +```csharp +using (var archive = ZipArchive.Open("archive.zip")) +{ + foreach (var entry in archive.Entries) + { + if (entry.IsDirectory) + Console.WriteLine($"[DIR] {entry.Key}"); + else + Console.WriteLine($"[FILE] {entry.Key} ({entry.Size} bytes)"); + } +} +``` + +--- + +## Common Mistakes + +### ✗ Wrong - Stream not disposed + +```csharp +var stream = File.OpenRead("archive.zip"); +var archive = ZipArchive.Open(stream); +archive.WriteToDirectory(@"C:\output"); +// stream not disposed - leaked resource +``` + +### ✓ Correct - Using blocks + +```csharp +using (var stream = File.OpenRead("archive.zip")) +using (var archive = ZipArchive.Open(stream)) +{ + archive.WriteToDirectory(@"C:\output"); +} +// Both properly disposed +``` + +### ✗ Wrong - Mixing API styles + +```csharp +// Loading entire archive then iterating +using (var archive = ZipArchive.Open("large.zip")) +{ + var entries = archive.Entries.ToList(); // Loads all in memory + foreach (var e in entries) + { + e.WriteToFile(...); // Then extracts each + } +} +``` + +### ✓ Correct - Use Reader for large files + +```csharp +// Streaming iteration +using (var stream = File.OpenRead("large.zip")) +using (var reader = ReaderFactory.Open(stream)) +{ + while (reader.MoveToNextEntry()) + { + reader.WriteEntryToDirectory(@"C:\output"); + } +} +``` + +--- + +## Related Documentation + +- [USAGE.md](USAGE.md) - Complete code examples +- [FORMATS.md](FORMATS.md) - Supported formats +- [PERFORMANCE.md](PERFORMANCE.md) - API selection guide +- [ERRORS.md](ERRORS.md) - Exception handling diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md new file mode 100644 index 00000000..ed96eda0 --- /dev/null +++ b/docs/ARCHITECTURE.md @@ -0,0 +1,660 @@ +# SharpCompress Architecture Guide + +This guide explains the internal architecture and design patterns of SharpCompress for contributors. + +## Overview + +SharpCompress is organized into three main layers: + +``` +┌─────────────────────────────────────────┐ +│ User-Facing APIs (Top Layer) │ +│ Archive, Reader, Writer Factories │ +├─────────────────────────────────────────┤ +│ Format-Specific Implementations │ +│ ZipArchive, TarReader, GZipWriter, │ +│ RarArchive, SevenZipArchive, etc. │ +├─────────────────────────────────────────┤ +│ Compression & Crypto (Bottom Layer) │ +│ Deflate, LZMA, BZip2, AES, CRC32 │ +└─────────────────────────────────────────┘ +``` + +--- + +## Directory Structure + +### `src/SharpCompress/` + +#### `Archives/` - Archive Implementations +Contains `IArchive` implementations for seekable, random-access APIs. + +**Key Files:** +- `AbstractArchive.cs` - Base class for all archives +- `IArchive.cs` - Archive interface definition +- `ArchiveFactory.cs` - Factory for opening archives +- Format-specific: `ZipArchive.cs`, `TarArchive.cs`, `RarArchive.cs`, `SevenZipArchive.cs`, `GZipArchive.cs` + +**Use Archive API when:** +- Stream is seekable (file, memory) +- Need random access to entries +- Archive fits in memory +- Simplicity is important + +#### `Readers/` - Reader Implementations +Contains `IReader` implementations for forward-only, non-seekable APIs. + +**Key Files:** +- `AbstractReader.cs` - Base reader class +- `IReader.cs` - Reader interface +- `ReaderFactory.cs` - Auto-detection factory +- `ReaderOptions.cs` - Configuration for readers +- Format-specific: `ZipReader.cs`, `TarReader.cs`, `GZipReader.cs`, `RarReader.cs`, etc. + +**Use Reader API when:** +- Stream is non-seekable (network, pipe, compressed) +- Processing large files +- Memory is limited +- Forward-only processing is acceptable + +#### `Writers/` - Writer Implementations +Contains `IWriter` implementations for forward-only writing. + +**Key Files:** +- `AbstractWriter.cs` - Base writer class +- `IWriter.cs` - Writer interface +- `WriterFactory.cs` - Factory for creating writers +- `WriterOptions.cs` - Configuration for writers +- Format-specific: `ZipWriter.cs`, `TarWriter.cs`, `GZipWriter.cs` + +#### `Factories/` - Format Detection +Factory classes for auto-detecting archive format and creating appropriate readers/writers. + +**Key Files:** +- `Factory.cs` - Base factory class +- `IFactory.cs` - Factory interface +- Format-specific: `ZipFactory.cs`, `TarFactory.cs`, `RarFactory.cs`, etc. + +**How It Works:** +1. `ReaderFactory.Open(stream)` probes stream signatures +2. Identifies format by magic bytes +3. Creates appropriate reader instance +4. Returns generic `IReader` interface + +#### `Common/` - Shared Types +Common types, options, and enumerations used across formats. + +**Key Files:** +- `IEntry.cs` - Entry interface (file within archive) +- `Entry.cs` - Entry implementation +- `ArchiveType.cs` - Enum for archive formats +- `CompressionType.cs` - Enum for compression methods +- `ArchiveEncoding.cs` - Character encoding configuration +- `ExtractionOptions.cs` - Extraction configuration +- Format-specific headers: `Zip/Headers/`, `Tar/Headers/`, `Rar/Headers/`, etc. + +#### `Compressors/` - Compression Algorithms +Low-level compression streams implementing specific algorithms. + +**Algorithms:** +- `Deflate/` - DEFLATE compression (Zip default) +- `BZip2/` - BZip2 compression +- `LZMA/` - LZMA compression (7Zip, XZ, LZip) +- `PPMd/` - Prediction by Partial Matching (Zip, 7Zip) +- `ZStandard/` - ZStandard compression (decompression only) +- `Xz/` - XZ format (decompression only) +- `Rar/` - RAR-specific unpacking +- `Arj/`, `Arc/`, `Ace/` - Legacy format decompression +- `Filters/` - BCJ/BCJ2 filters for executable compression + +**Each Compressor:** +- Implements a `Stream` subclass +- Provides both compression and decompression +- Some are read-only (decompression only) + +#### `Crypto/` - Encryption & Hashing +Cryptographic functions and stream wrappers. + +**Key Files:** +- `Crc32Stream.cs` - CRC32 calculation wrapper +- `BlockTransformer.cs` - Block cipher transformations +- AES, PKWare, WinZip encryption implementations + +#### `IO/` - Stream Utilities +Stream wrappers and utilities. + +**Key Classes:** +- `SharpCompressStream` - Base stream class +- `ProgressReportingStream` - Progress tracking wrapper +- `MarkingBinaryReader` - Binary reader with position marks +- `BufferedSubStream` - Buffered read-only substream +- `ReadOnlySubStream` - Read-only view of parent stream +- `NonDisposingStream` - Prevents wrapped stream disposal + +--- + +## Design Patterns + +### 1. Factory Pattern + +**Purpose:** Auto-detect format and create appropriate reader/writer. + +**Example:** +```csharp +// User calls factory +using (var reader = ReaderFactory.Open(stream)) // Returns IReader +{ + while (reader.MoveToNextEntry()) + { + // Process entry + } +} + +// Behind the scenes: +// 1. Factory.Open() probes stream signatures +// 2. Detects format (Zip, Tar, Rar, etc.) +// 3. Creates appropriate reader (ZipReader, TarReader, etc.) +// 4. Returns as generic IReader interface +``` + +**Files:** +- `src/SharpCompress/Factories/ReaderFactory.cs` +- `src/SharpCompress/Factories/WriterFactory.cs` +- `src/SharpCompress/Factories/ArchiveFactory.cs` + +### 2. Strategy Pattern + +**Purpose:** Encapsulate compression algorithms as swappable strategies. + +**Example:** +```csharp +// Different compression strategies +CompressionType.Deflate // DEFLATE +CompressionType.BZip2 // BZip2 +CompressionType.LZMA // LZMA +CompressionType.PPMd // PPMd + +// Writer uses strategy pattern +var archive = ZipArchive.Create(); +archive.SaveTo("output.zip", CompressionType.Deflate); // Use Deflate +archive.SaveTo("output.bz2", CompressionType.BZip2); // Use BZip2 +``` + +**Files:** +- `src/SharpCompress/Compressors/` - Strategy implementations + +### 3. Decorator Pattern + +**Purpose:** Wrap streams with additional functionality. + +**Example:** +```csharp +// Progress reporting decorator +var progressStream = new ProgressReportingStream(baseStream, progressReporter); +progressStream.Read(buffer, 0, buffer.Length); // Reports progress + +// Non-disposing decorator +var nonDisposingStream = new NonDisposingStream(baseStream); +using (var compressor = new DeflateStream(nonDisposingStream)) +{ + // baseStream won't be disposed when compressor is disposed +} +``` + +**Files:** +- `src/SharpCompress/IO/ProgressReportingStream.cs` +- `src/SharpCompress/IO/NonDisposingStream.cs` + +### 4. Template Method Pattern + +**Purpose:** Define algorithm skeleton in base class, let subclasses fill details. + +**Example:** +```csharp +// AbstractArchive defines common archive operations +public abstract class AbstractArchive : IArchive +{ + // Template methods + public virtual void WriteToDirectory(string destinationDirectory, ExtractionOptions options) + { + // Common extraction logic + foreach (var entry in Entries) + { + // Call subclass method + entry.WriteToFile(destinationPath, options); + } + } + + // Subclasses override format-specific details + protected abstract Entry CreateEntry(EntryData data); +} +``` + +**Files:** +- `src/SharpCompress/Archives/AbstractArchive.cs` +- `src/SharpCompress/Readers/AbstractReader.cs` + +### 5. Iterator Pattern + +**Purpose:** Provide sequential access to entries. + +**Example:** +```csharp +// Archive API - provides collection +IEnumerable entries = archive.Entries; +foreach (var entry in entries) +{ + // Random access - entries already in memory +} + +// Reader API - provides iterator +IReader reader = ReaderFactory.Open(stream); +while (reader.MoveToNextEntry()) +{ + // Forward-only iteration - one entry at a time + var entry = reader.Entry; +} +``` + +--- + +## Key Interfaces + +### IArchive - Random Access API + +```csharp +public interface IArchive : IDisposable +{ + IEnumerable Entries { get; } + + void WriteToDirectory(string destinationDirectory, + ExtractionOptions options = null); + + IEntry FirstOrDefault(Func predicate); + + // ... format-specific methods +} +``` + +**Implementations:** `ZipArchive`, `TarArchive`, `RarArchive`, `SevenZipArchive`, `GZipArchive` + +### IReader - Forward-Only API + +```csharp +public interface IReader : IDisposable +{ + IEntry Entry { get; } + + bool MoveToNextEntry(); + + void WriteEntryToDirectory(string destinationDirectory, + ExtractionOptions options = null); + + Stream OpenEntryStream(); + + // ... async variants +} +``` + +**Implementations:** `ZipReader`, `TarReader`, `RarReader`, `GZipReader`, etc. + +### IWriter - Writing API + +```csharp +public interface IWriter : IDisposable +{ + void Write(string entryPath, Stream source, + DateTime? modificationTime = null); + + void WriteAll(string sourceDirectory, string searchPattern, + SearchOption searchOption); + + // ... async variants +} +``` + +**Implementations:** `ZipWriter`, `TarWriter`, `GZipWriter` + +### IEntry - Archive Entry + +```csharp +public interface IEntry +{ + string Key { get; } + uint Size { get; } + uint CompressedSize { get; } + bool IsDirectory { get; } + DateTime? LastModifiedTime { get; } + CompressionType CompressionType { get; } + + void WriteToFile(string fullPath, ExtractionOptions options = null); + void WriteToStream(Stream destinationStream); + Stream OpenEntryStream(); + + // ... async variants +} +``` + +--- + +## Adding Support for a New Format + +### Step 1: Understand the Format +- Research format specification +- Understand compression/encryption used +- Study existing similar formats in codebase + +### Step 2: Create Format Structure Classes + +**Create:** `src/SharpCompress/Common/NewFormat/` + +```csharp +// Headers and data structures +public class NewFormatHeader +{ + public uint Magic { get; set; } + public ushort Version { get; set; } + // ... other fields + + public static NewFormatHeader Read(BinaryReader reader) + { + // Deserialize from binary + } +} + +public class NewFormatEntry +{ + public string FileName { get; set; } + public uint CompressedSize { get; set; } + public uint UncompressedSize { get; set; } + // ... other fields +} +``` + +### Step 3: Create Archive Implementation + +**Create:** `src/SharpCompress/Archives/NewFormat/NewFormatArchive.cs` + +```csharp +public class NewFormatArchive : AbstractArchive +{ + private NewFormatHeader _header; + private List _entries; + + public static NewFormatArchive Open(Stream stream) + { + var archive = new NewFormatArchive(); + archive._header = NewFormatHeader.Read(stream); + archive.LoadEntries(stream); + return archive; + } + + public override IEnumerable Entries => _entries.Select(e => new Entry(e)); + + protected override Stream OpenEntryStream(Entry entry) + { + // Return decompressed stream for entry + } + + // ... other abstract method implementations +} +``` + +### Step 4: Create Reader Implementation + +**Create:** `src/SharpCompress/Readers/NewFormat/NewFormatReader.cs` + +```csharp +public class NewFormatReader : AbstractReader +{ + private NewFormatHeader _header; + private BinaryReader _reader; + + public NewFormatReader(Stream stream) + { + _reader = new BinaryReader(stream); + _header = NewFormatHeader.Read(_reader); + } + + public override bool MoveToNextEntry() + { + // Read next entry header + if (!_reader.BaseStream.CanRead) return false; + + var entryData = NewFormatEntry.Read(_reader); + // ... set this.Entry + return entryData != null; + } + + // ... other abstract method implementations +} +``` + +### Step 5: Create Factory + +**Create:** `src/SharpCompress/Factories/NewFormatFactory.cs` + +```csharp +public class NewFormatFactory : Factory, IArchiveFactory, IReaderFactory +{ + // Archive format magic bytes (signature) + private static readonly byte[] NewFormatSignature = new byte[] { 0x4E, 0x46 }; // "NF" + + public static NewFormatFactory Instance { get; } = new(); + + public IArchive CreateArchive(Stream stream) + => NewFormatArchive.Open(stream); + + public IReader CreateReader(Stream stream, ReaderOptions options) + => new NewFormatReader(stream) { Options = options }; + + public bool Matches(Stream stream, ReadOnlySpan signature) + => signature.StartsWith(NewFormatSignature); +} +``` + +### Step 6: Register Factory + +**Update:** `src/SharpCompress/Factories/ArchiveFactory.cs` + +```csharp +private static readonly IFactory[] Factories = +{ + ZipFactory.Instance, + TarFactory.Instance, + RarFactory.Instance, + SevenZipFactory.Instance, + GZipFactory.Instance, + NewFormatFactory.Instance, // Add here + // ... other factories +}; +``` + +### Step 7: Add Tests + +**Create:** `tests/SharpCompress.Test/NewFormat/NewFormatTests.cs` + +```csharp +public class NewFormatTests : TestBase +{ + [Fact] + public void NewFormat_Extracts_Successfully() + { + var archivePath = Path.Combine(TEST_ARCHIVES_PATH, "archive.newformat"); + using (var archive = NewFormatArchive.Open(archivePath)) + { + archive.WriteToDirectory(SCRATCH_FILES_PATH); + // Assert extraction + } + } + + [Fact] + public void NewFormat_Reader_Works() + { + var archivePath = Path.Combine(TEST_ARCHIVES_PATH, "archive.newformat"); + using (var stream = File.OpenRead(archivePath)) + using (var reader = new NewFormatReader(stream)) + { + Assert.True(reader.MoveToNextEntry()); + Assert.NotNull(reader.Entry); + } + } +} +``` + +### Step 8: Add Test Archives + +Place test files in `tests/TestArchives/Archives/NewFormat/` directory. + +### Step 9: Document + +Update `docs/FORMATS.md` with format support information. + +--- + +## Compression Algorithm Implementation + +### Creating a New Compression Stream + +**Example:** Creating `CustomStream` for a custom compression algorithm + +```csharp +public class CustomStream : Stream +{ + private readonly Stream _baseStream; + private readonly bool _leaveOpen; + + public CustomStream(Stream baseStream, bool leaveOpen = false) + { + _baseStream = baseStream; + _leaveOpen = leaveOpen; + } + + public override int Read(byte[] buffer, int offset, int count) + { + // Decompress data from _baseStream into buffer + // Return number of decompressed bytes + } + + public override void Write(byte[] buffer, int offset, int count) + { + // Compress data from buffer into _baseStream + } + + protected override void Dispose(bool disposing) + { + if (disposing && !_leaveOpen) + { + _baseStream?.Dispose(); + } + base.Dispose(disposing); + } +} +``` + +--- + +## Stream Handling Best Practices + +### Disposal Pattern + +```csharp +// Correct: Nested using blocks +using (var fileStream = File.OpenRead("archive.zip")) +using (var archive = ZipArchive.Open(fileStream)) +{ + archive.WriteToDirectory(@"C:\output"); +} +// Both archive and fileStream properly disposed + +// Correct: Using with options +var options = new ReaderOptions { LeaveStreamOpen = true }; +var stream = File.OpenRead("archive.zip"); +using (var archive = ZipArchive.Open(stream, options)) +{ + archive.WriteToDirectory(@"C:\output"); +} +stream.Dispose(); // Manually dispose if LeaveStreamOpen = true +``` + +### NonDisposingStream Wrapper + +```csharp +// Prevent unwanted stream closure +var baseStream = File.OpenRead("data.bin"); +var nonDisposing = new NonDisposingStream(baseStream); + +using (var compressor = new DeflateStream(nonDisposing)) +{ + // Compressor won't close baseStream when disposed +} + +// baseStream still usable +baseStream.Position = 0; // Works +baseStream.Dispose(); // Manual disposal +``` + +--- + +## Performance Considerations + +### Memory Efficiency + +1. **Avoid loading entire archive in memory** - Use Reader API for large files +2. **Process entries sequentially** - Especially for solid archives +3. **Use appropriate buffer sizes** - Larger buffers for network I/O +4. **Dispose streams promptly** - Free resources when done + +### Algorithm Selection + +1. **Archive API** - Fast for small archives with random access +2. **Reader API** - Efficient for large files or streaming +3. **Solid archives** - Sequential extraction much faster +4. **Compression levels** - Trade-off between speed and size + +--- + +## Testing Guidelines + +### Test Coverage + +1. **Happy path** - Normal extraction works +2. **Edge cases** - Empty archives, single file, many files +3. **Corrupted data** - Handle gracefully +4. **Error cases** - Missing passwords, unsupported compression +5. **Async operations** - Both sync and async code paths + +### Test Archives + +- Use `tests/TestArchives/` for test data +- Create format-specific subdirectories +- Include encrypted, corrupted, and edge case archives +- Don't recreate existing archives + +### Test Patterns + +```csharp +[Fact] +public void Archive_Extraction_Works() +{ + // Arrange + var testArchive = Path.Combine(TEST_ARCHIVES_PATH, "test.zip"); + + // Act + using (var archive = ZipArchive.Open(testArchive)) + { + archive.WriteToDirectory(SCRATCH_FILES_PATH); + } + + // Assert + Assert.True(File.Exists(Path.Combine(SCRATCH_FILES_PATH, "file.txt"))); +} +``` + +--- + +## Related Documentation + +- [CONTRIBUTING.md](CONTRIBUTING.md) - How to contribute +- [AGENTS.md](../AGENTS.md) - Development guidelines +- [FORMATS.md](FORMATS.md) - Supported formats diff --git a/docs/ENCODING.md b/docs/ENCODING.md new file mode 100644 index 00000000..b421000d --- /dev/null +++ b/docs/ENCODING.md @@ -0,0 +1,611 @@ +# SharpCompress Character Encoding Guide + +This guide explains how SharpCompress handles character encoding for archive entries (filenames, comments, etc.). + +## Overview + +Most archive formats store filenames and metadata as bytes. SharpCompress must convert these bytes to strings using the appropriate character encoding. + +**Common Problem:** Archives created on systems with non-UTF8 encodings (especially Japanese, Chinese systems) appear with corrupted filenames when extracted on systems that assume UTF8. + +--- + +## ArchiveEncoding Class + +### Basic Usage + +```csharp +using SharpCompress.Common; +using SharpCompress.Readers; + +// Configure encoding before opening archive +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.GetEncoding(932) // cp932 for Japanese + } +}; + +using (var archive = ZipArchive.Open("japanese.zip", options)) +{ + foreach (var entry in archive.Entries) + { + Console.WriteLine(entry.Key); // Now shows correct characters + } +} +``` + +### ArchiveEncoding Properties + +| Property | Purpose | +|----------|---------| +| `Default` | Default encoding for filenames (fallback) | +| `CustomDecoder` | Custom decoding function for special cases | + +### Setting for Different APIs + +**Archive API:** +```csharp +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding { Default = Encoding.GetEncoding(932) } +}; +using (var archive = ZipArchive.Open("file.zip", options)) +{ + // Use archive with correct encoding +} +``` + +**Reader API:** +```csharp +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding { Default = Encoding.GetEncoding(932) } +}; +using (var stream = File.OpenRead("file.zip")) +using (var reader = ReaderFactory.Open(stream, options)) +{ + while (reader.MoveToNextEntry()) + { + // Filenames decoded correctly + } +} +``` + +--- + +## Common Encodings + +### Asian Encodings + +#### cp932 (Japanese) +```csharp +// Windows-31J, Shift-JIS variant used on Japanese Windows +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.GetEncoding(932) + } +}; +using (var archive = ZipArchive.Open("japanese.zip", options)) +{ + // Correctly decodes Japanese filenames +} +``` + +**When to use:** +- Archives from Japanese Windows systems +- Files with Japanese characters in names + +#### gb2312 (Simplified Chinese) +```csharp +// Simplified Chinese +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.GetEncoding("gb2312") + } +}; +``` + +#### gbk (Extended Simplified Chinese) +```csharp +// Extended Simplified Chinese (more characters than gb2312) +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.GetEncoding("gbk") + } +}; +``` + +#### big5 (Traditional Chinese) +```csharp +// Traditional Chinese (Taiwan, Hong Kong) +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.GetEncoding("big5") + } +}; +``` + +#### euc-jp (Japanese, Unix) +```csharp +// Extended Unix Code for Japanese +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.GetEncoding("eucjp") + } +}; +``` + +#### euc-kr (Korean) +```csharp +// Extended Unix Code for Korean +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.GetEncoding("euc-kr") + } +}; +``` + +### Western European Encodings + +#### iso-8859-1 (Latin-1) +```csharp +// Western European (includes accented characters) +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.GetEncoding("iso-8859-1") + } +}; +``` + +**When to use:** +- Archives from French, German, Spanish systems +- Files with accented characters (é, ñ, ü, etc.) + +#### cp1252 (Windows-1252) +```csharp +// Windows Western European +// Very similar to iso-8859-1 but with additional printable characters +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.GetEncoding("cp1252") + } +}; +``` + +**When to use:** +- Archives from older Western European Windows systems +- Files with smart quotes and other Windows-specific characters + +#### iso-8859-15 (Latin-9) +```csharp +// Western European with Euro symbol support +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.GetEncoding("iso-8859-15") + } +}; +``` + +### Cyrillic Encodings + +#### cp1251 (Windows Cyrillic) +```csharp +// Russian, Serbian, Bulgarian, etc. +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.GetEncoding("cp1251") + } +}; +``` + +#### koi8-r (KOI8 Russian) +```csharp +// Russian (Unix standard) +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.GetEncoding("koi8-r") + } +}; +``` + +### UTF Encodings (Modern) + +#### UTF-8 (Default) +```csharp +// Modern standard - usually correct for new archives +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.UTF8 + } +}; +``` + +#### UTF-16 +```csharp +// Unicode - rarely used in archives +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.Unicode + } +}; +``` + +--- + +## Encoding Auto-Detection + +SharpCompress attempts to auto-detect encoding, but this isn't always reliable: + +```csharp +// Auto-detection (default) +using (var archive = ZipArchive.Open("file.zip")) // Uses UTF8 by default +{ + // May show corrupted characters if archive uses different encoding +} + +// Explicit encoding (more reliable) +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding { Default = Encoding.GetEncoding(932) } +}; +using (var archive = ZipArchive.Open("file.zip", options)) +{ + // Correct characters displayed +} +``` + +### When Manual Override is Needed + +| Situation | Solution | +|-----------|----------| +| Archive shows corrupted characters | Specify the encoding explicitly | +| Archives from specific region | Use that region's encoding | +| Mixed encodings in archive | Use CustomDecoder | +| Testing with international files | Try different encodings | + +--- + +## Custom Decoder + +For complex scenarios where a single encoding isn't sufficient: + +### Basic Custom Decoder + +```csharp +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + CustomDecoder = (data, offset, length) => + { + // Custom decoding logic + var bytes = new byte[length]; + Array.Copy(data, offset, bytes, 0, length); + + // Try UTF8 first + try + { + return Encoding.UTF8.GetString(bytes); + } + catch + { + // Fallback to cp932 if UTF8 fails + return Encoding.GetEncoding(932).GetString(bytes); + } + } + } +}; + +using (var archive = ZipArchive.Open("mixed.zip", options)) +{ + foreach (var entry in archive.Entries) + { + Console.WriteLine(entry.Key); // Uses custom decoder + } +} +``` + +### Advanced: Detect Encoding by Content + +```csharp +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + CustomDecoder = DetectAndDecode + } +}; + +private static string DetectAndDecode(byte[] data, int offset, int length) +{ + var bytes = new byte[length]; + Array.Copy(data, offset, bytes, 0, length); + + // Try UTF8 (most modern archives) + try + { + var str = Encoding.UTF8.GetString(bytes); + // Verify it decoded correctly (no replacement characters) + if (!str.Contains('\uFFFD')) + return str; + } + catch { } + + // Try cp932 (Japanese) + try + { + var str = Encoding.GetEncoding(932).GetString(bytes); + if (!str.Contains('\uFFFD')) + return str; + } + catch { } + + // Fallback to iso-8859-1 (always succeeds) + return Encoding.GetEncoding("iso-8859-1").GetString(bytes); +} +``` + +--- + +## Code Examples + +### Extract Archive with Japanese Filenames + +```csharp +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.GetEncoding(932) // cp932 + } +}; + +using (var archive = ZipArchive.Open("japanese_files.zip", options)) +{ + archive.WriteToDirectory(@"C:\output", new ExtractionOptions + { + ExtractFullPath = true, + Overwrite = true + }); +} +// Files extracted with correct Japanese names +``` + +### Extract Archive with Western European Filenames + +```csharp +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.GetEncoding("iso-8859-1") + } +}; + +using (var archive = ZipArchive.Open("french_files.zip", options)) +{ + archive.WriteToDirectory(@"C:\output"); +} +// Accented characters (é, è, ê, etc.) display correctly +``` + +### Extract Archive with Chinese Filenames + +```csharp +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.GetEncoding("gbk") // Simplified Chinese + } +}; + +using (var archive = ZipArchive.Open("chinese_files.zip", options)) +{ + archive.WriteToDirectory(@"C:\output"); +} +``` + +### Extract Archive with Russian Filenames + +```csharp +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.GetEncoding("cp1251") // Windows Cyrillic + } +}; + +using (var archive = ZipArchive.Open("russian_files.zip", options)) +{ + archive.WriteToDirectory(@"C:\output"); +} +``` + +### Reader API with Encoding + +```csharp +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.GetEncoding(932) + } +}; + +using (var stream = File.OpenRead("japanese.zip")) +using (var reader = ReaderFactory.Open(stream, options)) +{ + while (reader.MoveToNextEntry()) + { + if (!reader.Entry.IsDirectory) + { + Console.WriteLine(reader.Entry.Key); // Correct characters + reader.WriteEntryToDirectory(@"C:\output"); + } + } +} +``` + +--- + +## Creating Archives with Correct Encoding + +When creating archives, SharpCompress uses UTF8 by default (recommended): + +```csharp +// Create with UTF8 (default, recommended) +using (var archive = ZipArchive.Create()) +{ + archive.AddAllFromDirectory(@"D:\my_files"); + archive.SaveTo("output.zip", CompressionType.Deflate); + // Archives created with UTF8 encoding +} +``` + +If you need to create archives for systems that expect specific encodings: + +```csharp +// Note: SharpCompress Writer API uses UTF8 for encoding +// To create archives with other encodings, consider: +// 1. Let users on those systems create archives +// 2. Use system tools (7-Zip, WinRAR) with desired encoding +// 3. Post-process archives if absolutely necessary + +// For now, recommend modern UTF8-based archives +``` + +--- + +## Troubleshooting Encoding Issues + +### Filenames Show Question Marks (?) + +``` +✗ Wrong encoding detected +test文件.txt → test???.txt +``` + +**Solution:** Specify correct encoding explicitly + +```csharp +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + Default = Encoding.GetEncoding("gbk") // Try different encodings + } +}; +``` + +### Filenames Show Replacement Character (￿) + +``` +✗ Invalid bytes for selected encoding +café.txt → caf￿.txt +``` + +**Solution:** +1. Try a different encoding (see Common Encodings table) +2. Use CustomDecoder with fallback encoding +3. Archive might be corrupted + +### Mixed Encodings in Single Archive + +```csharp +// Use CustomDecoder to handle mixed encodings +var options = new ReaderOptions +{ + ArchiveEncoding = new ArchiveEncoding + { + CustomDecoder = (data, offset, length) => + { + // Try multiple encodings in priority order + var bytes = new byte[length]; + Array.Copy(data, offset, bytes, 0, length); + + foreach (var encoding in new[] + { + Encoding.UTF8, + Encoding.GetEncoding(932), + Encoding.GetEncoding("iso-8859-1") + }) + { + try + { + var str = encoding.GetString(bytes); + if (!str.Contains('\uFFFD')) + return str; + } + catch { } + } + + // Final fallback + return Encoding.GetEncoding("iso-8859-1").GetString(bytes); + } + } +}; +``` + +--- + +## Encoding Reference Table + +| Encoding | Code | Use Case | +|----------|------|----------| +| UTF-8 | (default) | Modern archives, recommended | +| cp932 | 932 | Japanese Windows | +| gb2312 | "gb2312" | Simplified Chinese | +| gbk | "gbk" | Extended Simplified Chinese | +| big5 | "big5" | Traditional Chinese | +| iso-8859-1 | "iso-8859-1" | Western European | +| cp1252 | "cp1252" | Windows Western European | +| cp1251 | "cp1251" | Russian/Cyrillic | +| euc-jp | "eucjp" | Japanese Unix | +| euc-kr | "euc-kr" | Korean | + +--- + +## Best Practices + +1. **Use UTF-8 for new archives** - Most modern systems support it +2. **Ask the archive creator** - When receiving archives with corrupted names +3. **Provide encoding options** - If your app handles user archives +4. **Document your assumption** - Tell users what encoding you're using +5. **Test with international files** - Before releasing production code + +--- + +## Related Documentation + +- [TROUBLESHOOTING.md](TROUBLESHOOTING.md#garbled-filenames) - Encoding troubleshooting +- [USAGE.md](USAGE.md#extract-zip-which-has-non-utf8-encoded-filenamycp932) - Usage examples diff --git a/FORMATS.md b/docs/FORMATS.md similarity index 100% rename from FORMATS.md rename to docs/FORMATS.md diff --git a/OLD_CHANGELOG.md b/docs/OLD_CHANGELOG.md similarity index 100% rename from OLD_CHANGELOG.md rename to docs/OLD_CHANGELOG.md diff --git a/docs/PERFORMANCE.md b/docs/PERFORMANCE.md new file mode 100644 index 00000000..7fc46853 --- /dev/null +++ b/docs/PERFORMANCE.md @@ -0,0 +1,557 @@ +# SharpCompress Performance Guide + +This guide helps you optimize SharpCompress for performance in various scenarios. + +## API Selection Guide + +### Archive API vs Reader API + +Choose the right API based on your use case: + +| Aspect | Archive API | Reader API | +|--------|------------|-----------| +| **Stream Type** | Seekable only | Non-seekable OK | +| **Memory Usage** | All entries in memory | One entry at a time | +| **Random Access** | ✓ Yes | ✗ No | +| **Best For** | Small-to-medium archives | Large or streaming data | +| **Performance** | Fast for random access | Better for large files | + +### Archive API (Fast for Random Access) + +```csharp +// Use when: +// - Archive fits in memory +// - You need random access to entries +// - Stream is seekable (file, MemoryStream) + +using (var archive = ZipArchive.Open("archive.zip")) +{ + // Random access - all entries available + var specific = archive.Entries.FirstOrDefault(e => e.Key == "file.txt"); + if (specific != null) + { + specific.WriteToFile(@"C:\output\file.txt"); + } +} +``` + +**Performance Characteristics:** +- ✓ Instant entry lookup +- ✓ Parallel extraction possible +- ✗ Entire archive in memory +- ✗ Can't process while downloading + +### Reader API (Best for Large Files) + +```csharp +// Use when: +// - Processing large archives (>100 MB) +// - Streaming from network/pipe +// - Memory is constrained +// - Forward-only processing is acceptable + +using (var stream = File.OpenRead("large.zip")) +using (var reader = ReaderFactory.Open(stream)) +{ + while (reader.MoveToNextEntry()) + { + // Process one entry at a time + reader.WriteEntryToDirectory(@"C:\output"); + } +} +``` + +**Performance Characteristics:** +- ✓ Minimal memory footprint +- ✓ Works with non-seekable streams +- ✓ Can process while downloading +- ✗ Forward-only (no random access) +- ✗ Entry lookup requires iteration + +--- + +## Buffer Sizing + +### Understanding Buffers + +SharpCompress uses internal buffers for reading compressed data. Buffer size affects: +- **Speed:** Larger buffers = fewer I/O operations = faster +- **Memory:** Larger buffers = higher memory usage + +### Recommended Buffer Sizes + +| Scenario | Size | Notes | +|----------|------|-------| +| Embedded/IoT devices | 4-8 KB | Minimal memory usage | +| Memory-constrained | 16-32 KB | Conservative default | +| Standard use (default) | 64 KB | Recommended default | +| Large file streaming | 256 KB | Better throughput | +| High-speed SSD | 512 KB - 1 MB | Maximum throughput | + +### How Buffer Size Affects Performance + +```csharp +// SharpCompress manages buffers internally +// You can't directly set buffer size, but you can: + +// 1. Use Stream.CopyTo with explicit buffer size +using (var entryStream = reader.OpenEntryStream()) +using (var fileStream = File.Create(@"C:\output\file.txt")) +{ + // 64 KB buffer (default) + entryStream.CopyTo(fileStream); + + // Or specify larger buffer for faster copy + entryStream.CopyTo(fileStream, bufferSize: 262144); // 256 KB +} + +// 2. Use custom buffer for writing +using (var entryStream = reader.OpenEntryStream()) +using (var fileStream = File.Create(@"C:\output\file.txt")) +{ + byte[] buffer = new byte[262144]; // 256 KB + int bytesRead; + while ((bytesRead = entryStream.Read(buffer, 0, buffer.Length)) > 0) + { + fileStream.Write(buffer, 0, bytesRead); + } +} +``` + +--- + +## Streaming Large Files + +### Non-Seekable Stream Patterns + +For processing archives from downloads or pipes: + +```csharp +// Download stream (non-seekable) +using (var httpStream = await httpClient.GetStreamAsync(url)) +using (var reader = ReaderFactory.Open(httpStream)) +{ + // Process entries as they arrive + while (reader.MoveToNextEntry()) + { + if (!reader.Entry.IsDirectory) + { + reader.WriteEntryToDirectory(@"C:\output"); + } + } +} +``` + +**Performance Tips:** +- Don't try to buffer the entire stream +- Process entries immediately +- Use async APIs for better responsiveness + +### Download-Then-Extract vs Streaming + +Choose based on your constraints: + +| Approach | When to Use | +|----------|------------| +| **Download then extract** | Moderate size, need random access | +| **Stream during download** | Large files, bandwidth limited, memory constrained | + +```csharp +// Download then extract (requires disk space) +var archivePath = await DownloadFile(url, @"C:\temp\archive.zip"); +using (var archive = ZipArchive.Open(archivePath)) +{ + archive.WriteToDirectory(@"C:\output"); +} + +// Stream during download (on-the-fly extraction) +using (var httpStream = await httpClient.GetStreamAsync(url)) +using (var reader = ReaderFactory.Open(httpStream)) +{ + while (reader.MoveToNextEntry()) + { + reader.WriteEntryToDirectory(@"C:\output"); + } +} +``` + +--- + +## Solid Archive Optimization + +### Why Solid Archives Are Slow + +Solid archives (Rar, 7Zip) group files together in a single compressed stream: + +``` +Solid Archive Layout: +[Header] [Compressed Stream] [Footer] + ├─ File1 compressed data + ├─ File2 compressed data + ├─ File3 compressed data + └─ File4 compressed data +``` + +Extracting File3 requires decompressing File1 and File2 first. + +### Sequential vs Random Extraction + +**Random Extraction (Slow):** +```csharp +using (var archive = RarArchive.Open("solid.rar")) +{ + foreach (var entry in archive.Entries) + { + entry.WriteToFile(@"C:\output\" + entry.Key); // ✗ Slow! + // Each entry triggers full decompression from start + } +} +``` + +**Sequential Extraction (Fast):** +```csharp +using (var archive = RarArchive.Open("solid.rar")) +{ + // Method 1: Use WriteToDirectory (recommended) + archive.WriteToDirectory(@"C:\output", new ExtractionOptions + { + ExtractFullPath = true, + Overwrite = true + }); + + // Method 2: Use ExtractAllEntries + archive.ExtractAllEntries(); + + // Method 3: Use Reader API (also sequential) + using (var reader = RarReader.Open(File.OpenRead("solid.rar"))) + { + while (reader.MoveToNextEntry()) + { + reader.WriteEntryToDirectory(@"C:\output"); + } + } +} +``` + +**Performance Impact:** +- Random extraction: O(n²) - very slow for many files +- Sequential extraction: O(n) - 10-100x faster + +### Best Practices for Solid Archives + +1. **Always extract sequentially** when possible +2. **Use Reader API** for large solid archives +3. **Process entries in order** from the archive +4. **Consider using 7Zip command-line** for scripted extractions + +--- + +## Compression Level Trade-offs + +### Deflate/GZip Levels + +```csharp +// Level 1 = Fastest, largest size +// Level 6 = Default (balanced) +// Level 9 = Slowest, best compression + +// Write with different compression levels +using (var archive = ZipArchive.Create()) +{ + archive.AddAllFromDirectory(@"D:\data"); + + // Fast compression (level 1) + archive.SaveTo("fast.zip", new WriterOptions(CompressionType.Deflate) + { + CompressionLevel = 1 + }); + + // Default compression (level 6) + archive.SaveTo("default.zip", CompressionType.Deflate); + + // Best compression (level 9) + archive.SaveTo("best.zip", new WriterOptions(CompressionType.Deflate) + { + CompressionLevel = 9 + }); +} +``` + +**Speed vs Size:** +| Level | Speed | Size | Use Case | +|-------|-------|------|----------| +| 1 | 10x | 90% | Network, streaming | +| 6 | 1x | 75% | Default (good balance) | +| 9 | 0.1x | 65% | Archival, static storage | + +### BZip2 Block Size + +```csharp +// BZip2 block size affects memory and compression +// 100K to 900K (default 900K) + +// Smaller block size = lower memory, faster +// Larger block size = better compression, slower + +using (var archive = TarArchive.Create()) +{ + archive.AddAllFromDirectory(@"D:\data"); + + // These are preset in WriterOptions via CompressionLevel + archive.SaveTo("archive.tar.bz2", CompressionType.BZip2); +} +``` + +### LZMA Settings + +LZMA compression is very powerful but memory-intensive: + +```csharp +// LZMA (7Zip, .tar.lzma): +// - Dictionary size: 16 KB to 1 GB (default 32 MB) +// - Faster preset: smaller dictionary +// - Better compression: larger dictionary + +// Preset via CompressionType +using (var archive = TarArchive.Create()) +{ + archive.AddAllFromDirectory(@"D:\data"); + archive.SaveTo("archive.tar.xz", CompressionType.LZMA); // Default settings +} +``` + +--- + +## Async Performance + +### When Async Helps + +Async is beneficial when: +- **Long I/O operations** (network, slow disks) +- **UI responsiveness** needed (Windows Forms, WPF, Blazor) +- **Server applications** (ASP.NET, multiple concurrent operations) + +```csharp +// Async extraction (non-blocking) +using (var archive = ZipArchive.Open("archive.zip")) +{ + await archive.WriteToDirectoryAsync( + @"C:\output", + new ExtractionOptions { ExtractFullPath = true, Overwrite = true }, + cancellationToken + ); +} +// Thread can handle other work while I/O happens +``` + +### When Async Doesn't Help + +Async doesn't improve performance for: +- **CPU-bound operations** (already fast) +- **Local SSD I/O** (I/O is fast enough) +- **Single-threaded scenarios** (no parallelism benefit) + +```csharp +// Sync extraction (simpler, same performance on fast I/O) +using (var archive = ZipArchive.Open("archive.zip")) +{ + archive.WriteToDirectory( + @"C:\output", + new ExtractionOptions { ExtractFullPath = true, Overwrite = true } + ); +} +// Simple and fast - no async needed +``` + +### Cancellation Pattern + +```csharp +var cts = new CancellationTokenSource(); + +// Cancel after 5 minutes +cts.CancelAfter(TimeSpan.FromMinutes(5)); + +try +{ + using (var archive = ZipArchive.Open("archive.zip")) + { + await archive.WriteToDirectoryAsync( + @"C:\output", + new ExtractionOptions { ExtractFullPath = true, Overwrite = true }, + cts.Token + ); + } +} +catch (OperationCanceledException) +{ + Console.WriteLine("Extraction cancelled"); + // Clean up partial extraction if needed +} +``` + +--- + +## Memory Efficiency + +### Reducing Allocations + +```csharp +// ✗ Wrong - creates new options object each iteration +foreach (var archiveFile in archiveFiles) +{ + using (var archive = ZipArchive.Open(archiveFile)) + { + archive.WriteToDirectory(outputDir, new ExtractionOptions + { + ExtractFullPath = true, + Overwrite = true + }); + } +} + +// ✓ Better - reuse options object +var options = new ExtractionOptions +{ + ExtractFullPath = true, + Overwrite = true +}; +foreach (var archiveFile in archiveFiles) +{ + using (var archive = ZipArchive.Open(archiveFile)) + { + archive.WriteToDirectory(outputDir, options); + } +} +``` + +### Object Pooling for Repeated Operations + +```csharp +// For very high-throughput scenarios, consider pooling +public class ArchiveExtractionPool +{ + private readonly ArrayPool _bufferPool = ArrayPool.Shared; + + public void ExtractMany(IEnumerable archiveFiles, string outputDir) + { + var options = new ExtractionOptions + { + ExtractFullPath = true, + Overwrite = true + }; + + foreach (var archiveFile in archiveFiles) + { + using (var stream = File.OpenRead(archiveFile)) + using (var archive = ZipArchive.Open(stream)) + { + archive.WriteToDirectory(outputDir, options); + } + } + } +} +``` + +--- + +## Practical Performance Tips + +### 1. Choose the Right API + +| Scenario | API | Why | +|----------|-----|-----| +| Small archives | Archive | Faster random access | +| Large archives | Reader | Lower memory | +| Streaming | Reader | Works on non-seekable streams | +| Download streams | Reader | Async extraction while downloading | + +### 2. Batch Operations + +```csharp +// ✗ Slow - opens each archive separately +foreach (var file in files) +{ + using (var archive = ZipArchive.Open("archive.zip")) + { + archive.WriteToDirectory(@"C:\output"); + } +} + +// ✓ Better - process multiple entries at once +using (var archive = ZipArchive.Open("archive.zip")) +{ + archive.WriteToDirectory(@"C:\output"); +} +``` + +### 3. Use Appropriate Compression + +```csharp +// For distribution/storage: Best compression +archive.SaveTo("archive.zip", new WriterOptions(CompressionType.Deflate) +{ + CompressionLevel = 9 +}); + +// For daily backups: Balanced compression +archive.SaveTo("backup.zip", CompressionType.Deflate); // Default level 6 + +// For temporary/streaming: Fast compression +archive.SaveTo("temp.zip", new WriterOptions(CompressionType.Deflate) +{ + CompressionLevel = 1 +}); +``` + +### 4. Profile Your Code + +```csharp +var sw = Stopwatch.StartNew(); +using (var archive = ZipArchive.Open("large.zip")) +{ + archive.WriteToDirectory(@"C:\output"); +} +sw.Stop(); + +Console.WriteLine($"Extraction took {sw.ElapsedMilliseconds}ms"); + +// Measure memory before/after +var beforeMem = GC.GetTotalMemory(true); +// ... do work ... +var afterMem = GC.GetTotalMemory(true); +Console.WriteLine($"Memory used: {(afterMem - beforeMem) / 1024 / 1024}MB"); +``` + +--- + +## Troubleshooting Performance + +### Extraction is Slow + +1. **Check if solid archive** → Use sequential extraction +2. **Check API** → Reader API might be faster for large files +3. **Check compression level** → Higher levels are slower to decompress +4. **Check I/O** → Network drives are much slower than SSD +5. **Check buffer size** → May need larger buffers for network + +### High Memory Usage + +1. **Use Reader API** instead of Archive API +2. **Process entries immediately** rather than buffering +3. **Reduce compression level** if writing +4. **Check for memory leaks** in your code + +### CPU Usage at 100% + +1. **Normal for compression** - especially with high compression levels +2. **Consider lower level** for faster processing +3. **Reduce parallelism** if processing multiple archives +4. **Check if awaiting properly** in async code + +--- + +## Related Documentation + +- [PERFORMANCE.md](USAGE.md) - Usage examples with performance considerations +- [FORMATS.md](FORMATS.md) - Format-specific performance notes +- [TROUBLESHOOTING.md](TROUBLESHOOTING.md) - Solving common issues diff --git a/USAGE.md b/docs/USAGE.md similarity index 99% rename from USAGE.md rename to docs/USAGE.md index 11d31a9c..8e636cab 100644 --- a/USAGE.md +++ b/docs/USAGE.md @@ -1,6 +1,6 @@ # SharpCompress Usage -## Async/Await Support +## Async/Await Support (Beta) SharpCompress now provides full async/await support for all I/O operations. All `Read`, `Write`, and extraction operations have async equivalents ending in `Async` that accept an optional `CancellationToken`. This enables better performance and scalability for I/O-bound operations. @@ -13,7 +13,7 @@ SharpCompress now provides full async/await support for all I/O operations. All See [Async Examples](#async-examples) section below for usage patterns. -## Stream Rules (changed with 0.21) +## Stream Rules When dealing with Streams, the rule should be that you don't close a stream you didn't create. This, in effect, should mean you should always put a Stream in a using block to dispose it.