Files
libaaruformat/include/aaruformat/structs/options.h

233 lines
15 KiB
C
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

/*
* This file is part of the Aaru Data Preservation Suite.
* Copyright (c) 2019-2025 Natalia Portillo.
*
* This library is free software; you can redistribute it and/or modify
* it under the terms of the GNU Lesser General Public License as
* published by the Free Software Foundation; either version 2.1 of the
* License, or (at your option) any later version.
*
* This library is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* Lesser General Public License for more details.
*
* You should have received a copy of the GNU Lesser General Public
* License along with this library; if not, see <http://www.gnu.org/licenses/>.
*/
#ifndef LIBAARUFORMAT_OPTIONS_H
#define LIBAARUFORMAT_OPTIONS_H
#include <stdbool.h> ///< For bool type used in aaru_options.
#include <stdint.h> ///< For fixed-width integer types.
/** \file aaruformat/structs/options.h
* \brief Image creation / open tuning options structure and related semantics.
*
* The library accepts a semicolon-delimited key=value options string (see parse_options()). Recognized keys:
* compress=true|false Enable/disable block compression (LZMA for data blocks, FLAC for audio tracks).
* deduplicate=true|false If true, identical (duplicate) sectors are stored once (DDT entries point to same
* physical block). If false, duplicates are still tracked in DDT but each occurrence
* is stored independently (no storage savings). DDT itself is always present.
* dictionary=<bytes> LZMA dictionary size in bytes (fallback default 33554432 if 0 or invalid).
* table_shift=<n> DDT v2 table shift (default 9) (items per primary entry = 2^n when multi-level).
* data_shift=<n> Global data shift (default 12). Defines per-block address granularity: the low
* 2^n range encodes the sector (or unit) offset within a block; higher bits combine
* with block_alignment to derive block file offsets. Used by DDT but not limited to it.
* block_alignment=<n> log2 alignment of underlying data blocks (default 9 => 512 bytes) (block size = 2^n).
* md5=true|false Generate MD5 checksum (stored in checksum block if true).
* sha1=true|false Generate SHA-1 checksum.
* sha256=true|false Generate SHA-256 checksum.
* blake3=true|false Generate BLAKE3 checksum (may require build-time support; ignored if unsupported).
* spamsum=true|false Generate SpamSum fuzzy hash.
*
* Defaults (when option string NULL or key omitted):
* compress=true, deduplicate=true, dictionary=33554432, table_shift=9, data_shift=12,
* block_alignment=9, md5=false, sha1=false, sha256=false, blake3=false, spamsum=false.
*
* Validation / normalization done in parse_options():
* - Zero / missing dictionary resets to default 33554432.
* - Zero table_shift resets to 9.
* - Zero data_shift resets to 12.
* - Zero block_alignment resets to 9.
*
* Rationale:
* - table_shift, data_shift and block_alignment mirror fields stored in on-disk headers (see AaruHeaderV2 &
* DdtHeader2); data_shift is a global per-block granularity exponent (not DDT-specific) governing how in-block offsets
* are encoded.
* - compress selects adaptive codec usage: LZMA applied to generic/data blocks, FLAC applied to audio track payloads.
* - deduplicate toggles storage optimization only: the DDT directory is always built for addressing; disabling simply
* forces each sector's content to be written even if already present (useful for forensic byte-for-byte
* duplication).
* - dictionary tunes compression ratio/memory use; large values increase memory footprint.
* - Checksums are optional; enabling multiple increases CPU time at write finalization.
*
* Performance / space trade-offs (deduplicate=false):
* - Significantly larger image size: every repeated sector payload is written again.
* - Higher write I/O and longer creation time for highly redundant sources (e.g., zero-filled regions) compared to
* deduplicate=true, although CPU time spent on duplicate detection/hash lookups is reduced.
* - Potentially simpler post-process forensic validation (physical ordering preserved without logical coalescing).
* - Use when exact physical repetition is more critical than storage efficiency, or to benchmark raw device
* throughput.
* - For typical archival use-cases with large zero / repeated patterns, deduplicate=true markedly reduces footprint.
*
* Approximate in-RAM hash map usage for deduplication (deduplicate=true):
* The on-disk DDT can span many secondary tables, but only the primary table plus a currently loaded secondary (and
* possibly a small cache) reside in memory; their footprint is typically <<5% of total indexed media space and is
* often negligible compared to the hash map used to detect duplicate sectors. Therefore we focus here on the hash /
* lookup structure ("hash_map") memory, not the entire DDT on-disk size.
*
* Worst-case (all sectors unique) per 1 GiB of user data:
* sectors_per_GiB = 2^30 / sector_size
* hash_bytes ≈ sectors_per_GiB * H (H ≈ 16 bytes: 8-byte fingerprint + ~8 bytes map overhead)
*
* Resulting hash_map RAM per GiB (unique sectors):
* +--------------+------------------+------------------------------+
* | Sector size | Sectors / GiB | Hash map (~16 B / sector) |
* +--------------+------------------+------------------------------+
* | 512 bytes | 2,097,152 | ~33.5 MiB (≈32.036.0 MiB) |
* | 2048 bytes | 524,288 | ~ 8.0 MiB (≈7.58.5 MiB) |
* | 4096 bytes | 262,144 | ~ 4.0 MiB (≈3.84.3 MiB) |
* +--------------+------------------+------------------------------+
*
* (Range reflects allocator + load factor variation.)
*
* Targeted projections (hash map only, R=1):
* 2048byte sectors (~8 MiB per GiB unique)
* Capacity | Hash map (MiB) | Hash map (GiB)
* ---------+---------------+----------------
* 25 GiB | ~200 | 0.20
* 50 GiB | ~400 | 0.39
*
* 512byte sectors (~34 MiB per GiB unique; using 33.5 MiB for calc)
* Capacity | Hash map (MiB) | Hash map (GiB)
* ---------+---------------+----------------
* 128 GiB | ~4288 | 4.19
* 500 GiB | ~16750 | 16.36
* 1 TiB* | ~34304 | 33.50
* 2 TiB* | ~68608 | 67.00
*
* *TiB = 1024 GiB binary. For decimal TB reduce by ~7% (×0.93).
*
* Duplicate ratio scaling:
* Effective hash RAM ≈ table_value * R, where R = unique_sectors / total_sectors.
* Example: 500 GiB @512 B, R=0.4 ⇒ ~16750 MiB * 0.4 ≈ 6700 MiB (~6.54 GiB).
*
* Quick rule of thumb (hash only):
* hash_bytes_per_GiB ≈ 16 * (2^30 / sector_size) ≈ (17.1799e9 / sector_size) bytes
* → ≈ 33.6 MiB (512 B), 8.4 MiB (2048 B), 4.2 MiB (4096 B) per GiB unique.
*
* Memory planning tip:
* If projected hash_map usage risks exceeding available RAM, consider:
* - Increasing table_shift (reduces simultaneous secondary loads / contention)
* - Lowering data_shift (if practical) to encourage earlier big DDT adoption with fewer unique blocks
* - Segmenting the dump into phases (if workflow permits)
* - Accepting higher duplicate ratio by pre-zero detection or sparse treatment externally.
* - Resuming the dump in multiple passes: each resume rebuilds the hash_map from scratch, so peak RAM still
* matches a single-pass estimate, but average RAM over total wall time can drop if you unload between passes.
*
* NOTE: DDT in-RAM portion (primary + one secondary) usually adds only a few additional MiB even for very large
* images, hence omitted from sizing tables. Include +5% safety margin if extremely tight on memory.
*
* Guidance for table_shift / data_shift selection:
* Let:
* S = total logical sectors expected in image (estimate if unknown).
* T = table_shift (items per primary DDT entry = 2^T when multi-level; 0 => single-level).
* D = data_shift (in-block sector offset span = 2^D).
* BA = block_alignment (bytes) = 2^block_alignment.
* SS = sector size (bytes).
*
* 1. data_shift constraints:
* - For SMALL DDT entries (12 payload bits after status): D must satisfy 0 < D < 12 and (12 - D) >= 1 so that at
* least one bit remains for block index. Practical range for small DDT: 6..10 (leaves 2+ bits for block index).
* - For BIG DDT entries (28 payload bits after status): D may be larger (up to 27) but values >16 rarely useful.
* - Effective address granularity inside a block = min(2^D * SS, physical block span implied by BA).
* - Choosing D too large wastes bits (larger offset range than block actually contains) and reduces the number of
* block index bits within a small entry, potentially forcing upgrade to big DDT earlier.
*
* Recommended starting points:
* * 512byte sectors, 512byte block alignment: D=9 (512 offsets) or D=8 (256 offsets) keeps small DDT viable.
* * 2048byte optical sectors, 2048byte alignment: D=8 (256 offsets) typically sufficient.
* * Mixed / large logical block sizes: keep D so that (2^D * SS) ≈ typical dedup block region you want
* addressable.
*
* 2. block capacity within an entry:
* - SMALL DDT: usable block index bits = 12 - D.
* Max representable block index (small) = 2^(12-D) - 1.
* - BIG DDT: usable block index bits = 28 - D.
* Max representable block index (big) = 2^(28-D) - 1.
* - If (requiredBlockIndex > max) you must either reduce D or rely on big DDT.
*
* Approximate requiredBlockIndex ≈ (TotalUniqueBlocks) where
* TotalUniqueBlocks ≈ (S * SS) / (BA * (2^D * SS / (SS))) = S / (2^D * (BA / SS))
* Simplified (assuming BA = SS): TotalUniqueBlocks ≈ S / 2^D.
*
* 3. table_shift considerations (multi-level DDT):
* - Primary entries count ≈ ceil(S / 2^T). Choose T so this count fits memory and keeps lookup fast.
* - Larger T reduces primary table size, increasing secondary table dereferences.
* - Typical balanced values: T in [8..12] (256..4096 sectors per primary entry).
* - Set T=0 for single-level when S is small enough that all entries fit comfortably in memory.
*
* Memory rough estimate for single-level SMALL DDT:
* bytes ≈ S * 2 (each small entry 2 bytes). For BIG DDT: bytes ≈ S * 4.
* Multi-level: primary table bytes ≈ (S / 2^T) * entrySize + sum(secondary tables).
*
* 4. Example scenarios:
* - 50M sectors (≈25 GiB @512B), want small DDT: pick D=8 (256); block index bits=4 (max 16 blocks) insufficient.
* Need either D=6 (1024 block indices) or accept BIG DDT (28-8=20 bits => million+ blocks). So prefer BIG DDT
* here.
* - 2M sectors, 2048B alignment, optical: D=8 gives S/2^D ≈ 7812 unique offsets; small DDT block index bits=4 (max
* 16) inadequate → choose D=6 (offset span 64 sectors) giving 6 block index bits (max 64) or just use big DDT.
*
* 5. Practical recommendations:
* - If unsure and image > ~1M sectors: keep defaults (data_shift=12, table_shift=9) and allow big DDT.
* - For small archival (<100k sectors): T=0 (single-level), D≈8..10 to keep small DDT feasible.
* - Benchmark before lowering D purely to stay in small DDT; increased secondary lookups or larger primary tables
* can offset saved space.
*
* Recommended presets (approximate bands):
* +----------------------+----------------------+---------------------------+-------------------------------+
* | Total logical sectors | table_shift (T) | data_shift (D) | Notes |
* +----------------------+----------------------+---------------------------+-------------------------------+
* | < 50,000 | 0 | 8 10 | Single-level small DDT likely |
* | 50K 1,000,000 | 8 9 | 9 10 | Still feasible small DDT |
* | 1M 10,000,000 | 9 10 | 10 12 | Borderline small -> big DDT |
* | 10M 100,000,000 | 10 11 | 11 12 | Prefer big DDT; tune T for mem|
* | > 100,000,000 | 11 12 | 12 | Big DDT; higher T saves memory|
* +----------------------+----------------------+---------------------------+-------------------------------+
* Ranges show typical stable regions; pick the lower end of table_shift if memory is ample, higher if minimizing
* primary table size. Always validate actual unique block count vs payload bits.
*
* NOTE: The library will automatically fall back to BIG DDT where needed; these settings bias structure, they do not
* guarantee small DDT retention.
*
* Thread-safety: aaru_options is a plain POD struct; caller may copy freely. parse_options() returns by value.
*
* Future compatibility: unknown keys are ignored by current parser; consumers should preserve original option
* strings if round-tripping is required.
*/
/** \struct aaru_options
* \brief Parsed user-specified tunables controlling compression, deduplication, hashing and DDT geometry.
*
* All shifts are exponents of two.
*/
typedef struct
{
bool compress; ///< Enable adaptive compression (LZMA for data blocks, FLAC for audio). Default: true.
bool deduplicate; ///< Storage dedup flag (DDT always exists). true=share identical sector content, false=store
///< each instance.
uint32_t dictionary; ///< LZMA dictionary size in bytes (>= 4096 recommended). Default: 33554432 (32 MiB).
uint8_t table_shift; ///< DDT table shift (multi-level fan-out exponent). Default: 9.
uint8_t data_shift; ///< Global data shift: low bits encode sector offset inside a block (2^data_shift span).
uint8_t block_alignment; ///< log2 underlying block alignment (2^n bytes). Default: 9 (512 bytes).
bool md5; ///< Generate MD5 checksum (ChecksumAlgorithm::Md5) when finalizing image.
bool sha1; ///< Generate SHA-1 checksum (ChecksumAlgorithm::Sha1) when finalizing image.
bool sha256; ///< Generate SHA-256 checksum (ChecksumAlgorithm::Sha256) when finalizing image.
bool blake3; ///< Generate BLAKE3 checksum if supported (not stored if algorithm unavailable).
bool spamsum; ///< Generate SpamSum fuzzy hash (ChecksumAlgorithm::SpamSum) if enabled.
} aaru_options;
#endif // LIBAARUFORMAT_OPTIONS_H