Files
ccextractor/docs/VOBSUB.md
Carlos Fernandez 6f2a73d706 docs: Add VOBSUB extraction documentation and subtile-ocr Dockerfile
- Add docs/VOBSUB.md explaining the VOBSUB extraction workflow
- Add tools/vobsubocr/Dockerfile for building subtile-ocr OCR tool
- Document how to convert VOBSUB (.idx/.sub) to SRT using OCR

The Dockerfile uses subtile-ocr (https://github.com/gwen-lg/subtile-ocr),
an actively maintained fork of vobsubocr with better accuracy.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-28 10:26:41 +01:00

130 lines
3.8 KiB
Markdown

# VOBSUB Subtitle Extraction from MKV Files
CCExtractor supports extracting VOBSUB (S_VOBSUB) subtitles from Matroska (MKV) containers. VOBSUB is an image-based subtitle format originally from DVD video.
## Overview
VOBSUB subtitles consist of two files:
- `.idx` - Index file containing metadata, palette, and timestamp/position entries
- `.sub` - Binary file containing the actual subtitle bitmap data in MPEG Program Stream format
## Basic Usage
```bash
ccextractor movie.mkv
```
This will extract all VOBSUB tracks and create paired `.idx` and `.sub` files:
- `movie_eng.idx` + `movie_eng.sub` (first English track)
- `movie_eng_1.idx` + `movie_eng_1.sub` (second English track, if present)
- etc.
## Converting VOBSUB to SRT (Text)
Since VOBSUB subtitles are images, you need OCR (Optical Character Recognition) to convert them to text-based formats like SRT.
### Using subtile-ocr (Recommended)
[subtile-ocr](https://github.com/gwen-lg/subtile-ocr) is an actively maintained Rust tool that provides accurate OCR conversion.
#### Option 1: Docker (Easiest)
We provide a Dockerfile that builds subtile-ocr with all dependencies:
```bash
# Build the Docker image (one-time)
cd tools/vobsubocr
docker build -t subtile-ocr .
# Extract VOBSUB from MKV
ccextractor movie.mkv
# Convert to SRT using OCR
docker run --rm -v $(pwd):/data subtile-ocr -l eng -o /data/movie_eng.srt /data/movie_eng.idx
```
#### Option 2: Install subtile-ocr Natively
If you have Rust and Tesseract development libraries installed:
```bash
# Install dependencies (Ubuntu/Debian)
sudo apt-get install libleptonica-dev libtesseract-dev tesseract-ocr tesseract-ocr-eng
# Install subtile-ocr
cargo install --git https://github.com/gwen-lg/subtile-ocr
# Convert
subtile-ocr -l eng -o movie_eng.srt movie_eng.idx
```
### subtile-ocr Options
| Option | Description |
|--------|-------------|
| `-l, --lang <LANG>` | Tesseract language code (required). Examples: `eng`, `fra`, `deu`, `chi_sim` |
| `-o, --output <FILE>` | Output SRT file (stdout if not specified) |
| `-t, --threshold <0.0-1.0>` | Binarization threshold (default: 0.6) |
| `-d, --dpi <DPI>` | Image DPI for OCR (default: 150) |
| `--dump` | Save processed subtitle images as PNG files |
### Language Codes
Install additional Tesseract language packs as needed:
```bash
# Examples
sudo apt-get install tesseract-ocr-fra # French
sudo apt-get install tesseract-ocr-deu # German
sudo apt-get install tesseract-ocr-spa # Spanish
sudo apt-get install tesseract-ocr-chi-sim # Simplified Chinese
```
## Technical Details
### .idx File Format
The index file contains:
1. Header with metadata (size, palette, alignment settings)
2. Language identifier line
3. Timestamp entries with file positions
Example:
```
# VobSub index file, v7 (do not modify this line!)
size: 720x576
palette: 000000, 828282, ...
id: eng, index: 0
timestamp: 00:01:12:920, filepos: 000000000
timestamp: 00:01:18:640, filepos: 000000800
...
```
### .sub File Format
The binary file contains MPEG Program Stream packets:
- Each subtitle is wrapped in a PS Pack header (14 bytes) + PES header (15 bytes)
- Subtitles are aligned to 2048-byte boundaries
- Contains raw SPU (SubPicture Unit) bitmap data
## Troubleshooting
### Empty output files
- Ensure the MKV file actually contains VOBSUB tracks (check with `mediainfo` or `ffprobe`)
- CCExtractor will report "No VOBSUB subtitles to write" if the track is empty
### OCR quality issues
- Try adjusting the `-t` threshold parameter
- Ensure the correct language pack is installed
- Use `--dump` to inspect the processed images
### Docker permission issues
- The output files may be owned by root; use `sudo chown` to fix ownership
- Or run Docker with `--user $(id -u):$(id -g)`
## See Also
- [OCR.md](OCR.md) - General OCR support in CCExtractor
- [subtile-ocr GitHub](https://github.com/gwen-lg/subtile-ocr) - OCR tool documentation