|
libaaruformat 1.0
Aaru Data Preservation Suite - Format Library
|
< For bool type used in aaru_options. More...
#include <stdbool.h>#include <stdint.h>Go to the source code of this file.
Data Structures | |
| struct | aaru_options |
| Parsed user-specified tunables controlling compression, deduplication, hashing and DDT geometry. More... | |
< For bool type used in aaru_options.
For fixed-width integer types.
Image creation / open tuning options structure and related semantics.
The library accepts a semicolon-delimited key=value options string (see parse_options()). Recognized keys: compress=true|false Enable/disable block compression (LZMA for data blocks, FLAC for audio tracks). deduplicate=true|false If true, identical (duplicate) sectors are stored once (DDT entries point to same physical block). If false, duplicates are still tracked in DDT but each occurrence is stored independently (no storage savings). DDT itself is always present. dictionary=<bytes> LZMA dictionary size in bytes (fallback default 33554432 if 0 or invalid). table_shift=<n> DDT v2 table shift (default 9) (items per primary entry = 2^n when multi-level). data_shift=<n> Global data shift (default 12). Defines per-block address granularity: the low 2^n range encodes the sector (or unit) offset within a block; higher bits combine with block_alignment to derive block file offsets. Used by DDT but not limited to it. block_alignment=<n> log2 alignment of underlying data blocks (default 9 => 512 bytes) (block size = 2^n). md5=true|false Generate MD5 checksum (stored in checksum block if true). sha1=true|false Generate SHA-1 checksum. sha256=true|false Generate SHA-256 checksum. blake3=true|false Generate BLAKE3 checksum (may require build-time support; ignored if unsupported). spamsum=true|false Generate SpamSum fuzzy hash.
Defaults (when option string NULL or key omitted): compress=true, deduplicate=true, dictionary=33554432, table_shift=9, data_shift=12, block_alignment=9, md5=false, sha1=false, sha256=false, blake3=false, spamsum=false.
Validation / normalization done in parse_options():
Rationale:
Performance / space trade-offs (deduplicate=false):
Approximate in-RAM hash map usage for deduplication (deduplicate=true): The on-disk DDT can span many secondary tables, but only the primary table plus a currently loaded secondary (and possibly a small cache) reside in memory; their footprint is typically <<5% of total indexed media space and is often negligible compared to the hash map used to detect duplicate sectors. Therefore we focus here on the hash / lookup structure ("hash_map") memory, not the entire DDT on-disk size.
Worst-case (all sectors unique) per 1 GiB of user data: sectors_per_GiB = 2^30 / sector_size hash_bytes ≈ sectors_per_GiB * H (H ≈ 16 bytes: 8-byte fingerprint + ~8 bytes map overhead)
Resulting hash_map RAM per GiB (unique sectors): +-----------—+---------------—+---------------------------—+ | Sector size | Sectors / GiB | Hash map (~16 B / sector) | +-----------—+---------------—+---------------------------—+ | 512 bytes | 2,097,152 | ~33.5 MiB (≈32.0–36.0 MiB) | | 2048 bytes | 524,288 | ~ 8.0 MiB (≈7.5–8.5 MiB) | | 4096 bytes | 262,144 | ~ 4.0 MiB (≈3.8–4.3 MiB) | +-----------—+---------------—+---------------------------—+
(Range reflects allocator + load factor variation.)
Targeted projections (hash map only, R=1): 2048‑byte sectors (~8 MiB per GiB unique) Capacity | Hash map (MiB) | Hash map (GiB) ------—+------------—+-------------— 25 GiB | ~200 | 0.20 50 GiB | ~400 | 0.39
512‑byte sectors (~34 MiB per GiB unique; using 33.5 MiB for calc) Capacity | Hash map (MiB) | Hash map (GiB) ------—+------------—+-------------— 128 GiB | ~4288 | 4.19 500 GiB | ~16750 | 16.36 1 TiB* | ~34304 | 33.50 2 TiB* | ~68608 | 67.00
*TiB = 1024 GiB binary. For decimal TB reduce by ~7% (×0.93).
Duplicate ratio scaling: Effective hash RAM ≈ table_value * R, where R = unique_sectors / total_sectors. Example: 500 GiB @512 B, R=0.4 ⇒ ~16750 MiB * 0.4 ≈ 6700 MiB (~6.54 GiB).
Quick rule of thumb (hash only): hash_bytes_per_GiB ≈ 16 * (2^30 / sector_size) ≈ (17.1799e9 / sector_size) bytes → ≈ 33.6 MiB (512 B), 8.4 MiB (2048 B), 4.2 MiB (4096 B) per GiB unique.
Memory planning tip: If projected hash_map usage risks exceeding available RAM, consider:
NOTE: DDT in-RAM portion (primary + one secondary) usually adds only a few additional MiB even for very large images, hence omitted from sizing tables. Include +5% safety margin if extremely tight on memory.
Guidance for table_shift / data_shift selection: Let: S = total logical sectors expected in image (estimate if unknown). T = table_shift (items per primary DDT entry = 2^T when multi-level; 0 => single-level). D = data_shift (in-block sector offset span = 2^D). BA = block_alignment (bytes) = 2^block_alignment. SS = sector size (bytes).
data_shift constraints:
Recommended starting points:
block capacity within an entry:
Approximate requiredBlockIndex ≈ (TotalUniqueBlocks) where TotalUniqueBlocks ≈ (S * SS) / (BA * (2^D * SS / (SS))) = S / (2^D * (BA / SS)) Simplified (assuming BA = SS): TotalUniqueBlocks ≈ S / 2^D.
table_shift considerations (multi-level DDT):
Memory rough estimate for single-level SMALL DDT: bytes ≈ S * 2 (each small entry 2 bytes). For BIG DDT: bytes ≈ S * 4. Multi-level: primary table bytes ≈ (S / 2^T) * entrySize + sum(secondary tables).
Recommended presets (approximate bands): +----------------------+----------------------+---------------------------+-------------------------------+ | Total logical sectors | table_shift (T) | data_shift (D) | Notes | +----------------------+----------------------+---------------------------+-------------------------------+ | < 50,000 | 0 | 8 – 10 | Single-level small DDT likely | | 50K – 1,000,000 | 8 – 9 | 9 – 10 | Still feasible small DDT | | 1M – 10,000,000 | 9 – 10 | 10 – 12 | Borderline small -> big DDT | | 10M – 100,000,000 | 10 – 11 | 11 – 12 | Prefer big DDT; tune T for mem| | > 100,000,000 | 11 – 12 | 12 | Big DDT; higher T saves memory| +-------------------—+-------------------—+------------------------—+----------------------------—+ Ranges show typical stable regions; pick the lower end of table_shift if memory is ample, higher if minimizing primary table size. Always validate actual unique block count vs payload bits.
NOTE: The library will automatically fall back to BIG DDT where needed; these settings bias structure, they do not guarantee small DDT retention.
Thread-safety: aaru_options is a plain POD struct; caller may copy freely. parse_options() returns by value.
Future compatibility: unknown keys are ignored by current parser; consumers should preserve original option strings if round-tripping is required.
Definition in file options.h.