mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-04 05:26:31 +00:00
docs: Add VOBSUB extraction documentation and subtile-ocr Dockerfile
- Add docs/VOBSUB.md explaining the VOBSUB extraction workflow - Add tools/vobsubocr/Dockerfile for building subtile-ocr OCR tool - Document how to convert VOBSUB (.idx/.sub) to SRT using OCR The Dockerfile uses subtile-ocr (https://github.com/gwen-lg/subtile-ocr), an actively maintained fork of vobsubocr with better accuracy. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
129
docs/VOBSUB.md
Normal file
129
docs/VOBSUB.md
Normal file
@@ -0,0 +1,129 @@
|
||||
# VOBSUB Subtitle Extraction from MKV Files
|
||||
|
||||
CCExtractor supports extracting VOBSUB (S_VOBSUB) subtitles from Matroska (MKV) containers. VOBSUB is an image-based subtitle format originally from DVD video.
|
||||
|
||||
## Overview
|
||||
|
||||
VOBSUB subtitles consist of two files:
|
||||
- `.idx` - Index file containing metadata, palette, and timestamp/position entries
|
||||
- `.sub` - Binary file containing the actual subtitle bitmap data in MPEG Program Stream format
|
||||
|
||||
## Basic Usage
|
||||
|
||||
```bash
|
||||
ccextractor movie.mkv
|
||||
```
|
||||
|
||||
This will extract all VOBSUB tracks and create paired `.idx` and `.sub` files:
|
||||
- `movie_eng.idx` + `movie_eng.sub` (first English track)
|
||||
- `movie_eng_1.idx` + `movie_eng_1.sub` (second English track, if present)
|
||||
- etc.
|
||||
|
||||
## Converting VOBSUB to SRT (Text)
|
||||
|
||||
Since VOBSUB subtitles are images, you need OCR (Optical Character Recognition) to convert them to text-based formats like SRT.
|
||||
|
||||
### Using subtile-ocr (Recommended)
|
||||
|
||||
[subtile-ocr](https://github.com/gwen-lg/subtile-ocr) is an actively maintained Rust tool that provides accurate OCR conversion.
|
||||
|
||||
#### Option 1: Docker (Easiest)
|
||||
|
||||
We provide a Dockerfile that builds subtile-ocr with all dependencies:
|
||||
|
||||
```bash
|
||||
# Build the Docker image (one-time)
|
||||
cd tools/vobsubocr
|
||||
docker build -t subtile-ocr .
|
||||
|
||||
# Extract VOBSUB from MKV
|
||||
ccextractor movie.mkv
|
||||
|
||||
# Convert to SRT using OCR
|
||||
docker run --rm -v $(pwd):/data subtile-ocr -l eng -o /data/movie_eng.srt /data/movie_eng.idx
|
||||
```
|
||||
|
||||
#### Option 2: Install subtile-ocr Natively
|
||||
|
||||
If you have Rust and Tesseract development libraries installed:
|
||||
|
||||
```bash
|
||||
# Install dependencies (Ubuntu/Debian)
|
||||
sudo apt-get install libleptonica-dev libtesseract-dev tesseract-ocr tesseract-ocr-eng
|
||||
|
||||
# Install subtile-ocr
|
||||
cargo install --git https://github.com/gwen-lg/subtile-ocr
|
||||
|
||||
# Convert
|
||||
subtile-ocr -l eng -o movie_eng.srt movie_eng.idx
|
||||
```
|
||||
|
||||
### subtile-ocr Options
|
||||
|
||||
| Option | Description |
|
||||
|--------|-------------|
|
||||
| `-l, --lang <LANG>` | Tesseract language code (required). Examples: `eng`, `fra`, `deu`, `chi_sim` |
|
||||
| `-o, --output <FILE>` | Output SRT file (stdout if not specified) |
|
||||
| `-t, --threshold <0.0-1.0>` | Binarization threshold (default: 0.6) |
|
||||
| `-d, --dpi <DPI>` | Image DPI for OCR (default: 150) |
|
||||
| `--dump` | Save processed subtitle images as PNG files |
|
||||
|
||||
### Language Codes
|
||||
|
||||
Install additional Tesseract language packs as needed:
|
||||
|
||||
```bash
|
||||
# Examples
|
||||
sudo apt-get install tesseract-ocr-fra # French
|
||||
sudo apt-get install tesseract-ocr-deu # German
|
||||
sudo apt-get install tesseract-ocr-spa # Spanish
|
||||
sudo apt-get install tesseract-ocr-chi-sim # Simplified Chinese
|
||||
```
|
||||
|
||||
## Technical Details
|
||||
|
||||
### .idx File Format
|
||||
|
||||
The index file contains:
|
||||
1. Header with metadata (size, palette, alignment settings)
|
||||
2. Language identifier line
|
||||
3. Timestamp entries with file positions
|
||||
|
||||
Example:
|
||||
```
|
||||
# VobSub index file, v7 (do not modify this line!)
|
||||
size: 720x576
|
||||
palette: 000000, 828282, ...
|
||||
|
||||
id: eng, index: 0
|
||||
timestamp: 00:01:12:920, filepos: 000000000
|
||||
timestamp: 00:01:18:640, filepos: 000000800
|
||||
...
|
||||
```
|
||||
|
||||
### .sub File Format
|
||||
|
||||
The binary file contains MPEG Program Stream packets:
|
||||
- Each subtitle is wrapped in a PS Pack header (14 bytes) + PES header (15 bytes)
|
||||
- Subtitles are aligned to 2048-byte boundaries
|
||||
- Contains raw SPU (SubPicture Unit) bitmap data
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Empty output files
|
||||
- Ensure the MKV file actually contains VOBSUB tracks (check with `mediainfo` or `ffprobe`)
|
||||
- CCExtractor will report "No VOBSUB subtitles to write" if the track is empty
|
||||
|
||||
### OCR quality issues
|
||||
- Try adjusting the `-t` threshold parameter
|
||||
- Ensure the correct language pack is installed
|
||||
- Use `--dump` to inspect the processed images
|
||||
|
||||
### Docker permission issues
|
||||
- The output files may be owned by root; use `sudo chown` to fix ownership
|
||||
- Or run Docker with `--user $(id -u):$(id -g)`
|
||||
|
||||
## See Also
|
||||
|
||||
- [OCR.md](OCR.md) - General OCR support in CCExtractor
|
||||
- [subtile-ocr GitHub](https://github.com/gwen-lg/subtile-ocr) - OCR tool documentation
|
||||
35
tools/vobsubocr/Dockerfile
Normal file
35
tools/vobsubocr/Dockerfile
Normal file
@@ -0,0 +1,35 @@
|
||||
# Dockerfile for subtile-ocr - VOBSUB to SRT converter
|
||||
# Uses subtile-ocr, an actively maintained fork of vobsubocr
|
||||
# https://github.com/gwen-lg/subtile-ocr
|
||||
|
||||
FROM ubuntu:22.04
|
||||
|
||||
# Prevent interactive prompts during package installation
|
||||
ENV DEBIAN_FRONTEND=noninteractive
|
||||
|
||||
# Install build dependencies
|
||||
RUN apt-get update && apt-get install -y \
|
||||
build-essential \
|
||||
clang \
|
||||
pkg-config \
|
||||
libleptonica-dev \
|
||||
libtesseract-dev \
|
||||
tesseract-ocr \
|
||||
tesseract-ocr-eng \
|
||||
curl \
|
||||
git \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Install Rust
|
||||
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
|
||||
ENV PATH="/root/.cargo/bin:${PATH}"
|
||||
|
||||
# Install subtile-ocr from git
|
||||
RUN cargo install --git https://github.com/gwen-lg/subtile-ocr
|
||||
|
||||
# Create working directory
|
||||
WORKDIR /data
|
||||
|
||||
# Default command shows help
|
||||
ENTRYPOINT ["subtile-ocr"]
|
||||
CMD ["--help"]
|
||||
Reference in New Issue
Block a user