Files
ccextractor/docs/VOBSUB.md
Carlos Fernandez 6f2a73d706 docs: Add VOBSUB extraction documentation and subtile-ocr Dockerfile
- Add docs/VOBSUB.md explaining the VOBSUB extraction workflow
- Add tools/vobsubocr/Dockerfile for building subtile-ocr OCR tool
- Document how to convert VOBSUB (.idx/.sub) to SRT using OCR

The Dockerfile uses subtile-ocr (https://github.com/gwen-lg/subtile-ocr),
an actively maintained fork of vobsubocr with better accuracy.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-28 10:26:41 +01:00

3.8 KiB

VOBSUB Subtitle Extraction from MKV Files

CCExtractor supports extracting VOBSUB (S_VOBSUB) subtitles from Matroska (MKV) containers. VOBSUB is an image-based subtitle format originally from DVD video.

Overview

VOBSUB subtitles consist of two files:

  • .idx - Index file containing metadata, palette, and timestamp/position entries
  • .sub - Binary file containing the actual subtitle bitmap data in MPEG Program Stream format

Basic Usage

ccextractor movie.mkv

This will extract all VOBSUB tracks and create paired .idx and .sub files:

  • movie_eng.idx + movie_eng.sub (first English track)
  • movie_eng_1.idx + movie_eng_1.sub (second English track, if present)
  • etc.

Converting VOBSUB to SRT (Text)

Since VOBSUB subtitles are images, you need OCR (Optical Character Recognition) to convert them to text-based formats like SRT.

subtile-ocr is an actively maintained Rust tool that provides accurate OCR conversion.

Option 1: Docker (Easiest)

We provide a Dockerfile that builds subtile-ocr with all dependencies:

# Build the Docker image (one-time)
cd tools/vobsubocr
docker build -t subtile-ocr .

# Extract VOBSUB from MKV
ccextractor movie.mkv

# Convert to SRT using OCR
docker run --rm -v $(pwd):/data subtile-ocr -l eng -o /data/movie_eng.srt /data/movie_eng.idx

Option 2: Install subtile-ocr Natively

If you have Rust and Tesseract development libraries installed:

# Install dependencies (Ubuntu/Debian)
sudo apt-get install libleptonica-dev libtesseract-dev tesseract-ocr tesseract-ocr-eng

# Install subtile-ocr
cargo install --git https://github.com/gwen-lg/subtile-ocr

# Convert
subtile-ocr -l eng -o movie_eng.srt movie_eng.idx

subtile-ocr Options

Option Description
-l, --lang <LANG> Tesseract language code (required). Examples: eng, fra, deu, chi_sim
-o, --output <FILE> Output SRT file (stdout if not specified)
-t, --threshold <0.0-1.0> Binarization threshold (default: 0.6)
-d, --dpi <DPI> Image DPI for OCR (default: 150)
--dump Save processed subtitle images as PNG files

Language Codes

Install additional Tesseract language packs as needed:

# Examples
sudo apt-get install tesseract-ocr-fra  # French
sudo apt-get install tesseract-ocr-deu  # German
sudo apt-get install tesseract-ocr-spa  # Spanish
sudo apt-get install tesseract-ocr-chi-sim  # Simplified Chinese

Technical Details

.idx File Format

The index file contains:

  1. Header with metadata (size, palette, alignment settings)
  2. Language identifier line
  3. Timestamp entries with file positions

Example:

# VobSub index file, v7 (do not modify this line!)
size: 720x576
palette: 000000, 828282, ...

id: eng, index: 0
timestamp: 00:01:12:920, filepos: 000000000
timestamp: 00:01:18:640, filepos: 000000800
...

.sub File Format

The binary file contains MPEG Program Stream packets:

  • Each subtitle is wrapped in a PS Pack header (14 bytes) + PES header (15 bytes)
  • Subtitles are aligned to 2048-byte boundaries
  • Contains raw SPU (SubPicture Unit) bitmap data

Troubleshooting

Empty output files

  • Ensure the MKV file actually contains VOBSUB tracks (check with mediainfo or ffprobe)
  • CCExtractor will report "No VOBSUB subtitles to write" if the track is empty

OCR quality issues

  • Try adjusting the -t threshold parameter
  • Ensure the correct language pack is installed
  • Use --dump to inspect the processed images

Docker permission issues

  • The output files may be owned by root; use sudo chown to fix ownership
  • Or run Docker with --user $(id -u):$(id -g)

See Also