[PR #1943] perf(dvb): Lazy OCR initialization for DVB subtitle decoder #2744

Open
opened 2026-01-29 17:23:42 +00:00 by claunia · 0 comments
Owner

Original Pull Request: https://github.com/CCExtractor/ccextractor/pull/1943

State: closed
Merged: Yes


Summary

  • Defers Tesseract OCR initialization until a DVB bitmap region actually needs OCR processing
  • Eliminates ~10 second startup overhead for files with DVB streams that don't produce bitmap output

Problem

Previously, OCR was initialized eagerly in dvbsub_init_decoder() whenever a DVB subtitle stream was detected. This caused performance issues:

  1. Unnecessary startup cost: Files with DVB streams but no actual bitmap subtitles (or alongside CEA-608 text captions) paid a ~10 second Tesseract initialization penalty
  2. Valgrind test timeouts: Tesseract's OpenMP thread pool generated 747,000+ futex syscalls, causing valgrind tests 238/239 to take 15+ minutes and timeout

Solution

Move init_ocr() call from dvbsub_init_decoder() to the first actual OCR usage point in dvbsub_decode_region_segment(). An ocr_initialized flag ensures single initialization.

Performance Results

File Type Before After
Pure CEA-608 (no DVB streams) ~10s 0.1s
DVB + CEA-608 (11MB M2TS) ~10s 3s
DVB + CEA-608 (18MB M2TS) ~15s 1s

Test plan

  • Build succeeds
  • Pure CEA-608 files: No OCR initialization, instant processing
  • DVB+CEA-608 files: OCR initialized only when bitmap regions processed
  • OCR still works correctly when DVB bitmaps are present

🤖 Generated with Claude Code

**Original Pull Request:** https://github.com/CCExtractor/ccextractor/pull/1943 **State:** closed **Merged:** Yes --- ## Summary - Defers Tesseract OCR initialization until a DVB bitmap region actually needs OCR processing - Eliminates ~10 second startup overhead for files with DVB streams that don't produce bitmap output ## Problem Previously, OCR was initialized eagerly in `dvbsub_init_decoder()` whenever a DVB subtitle stream was detected. This caused performance issues: 1. **Unnecessary startup cost**: Files with DVB streams but no actual bitmap subtitles (or alongside CEA-608 text captions) paid a ~10 second Tesseract initialization penalty 2. **Valgrind test timeouts**: Tesseract's OpenMP thread pool generated 747,000+ futex syscalls, causing valgrind tests 238/239 to take 15+ minutes and timeout ## Solution Move `init_ocr()` call from `dvbsub_init_decoder()` to the first actual OCR usage point in `dvbsub_decode_region_segment()`. An `ocr_initialized` flag ensures single initialization. ## Performance Results | File Type | Before | After | |-----------|--------|-------| | Pure CEA-608 (no DVB streams) | ~10s | **0.1s** | | DVB + CEA-608 (11MB M2TS) | ~10s | **3s** | | DVB + CEA-608 (18MB M2TS) | ~15s | **1s** | ## Test plan - [x] Build succeeds - [x] Pure CEA-608 files: No OCR initialization, instant processing - [x] DVB+CEA-608 files: OCR initialized only when bitmap regions processed - [x] OCR still works correctly when DVB bitmaps are present 🤖 Generated with [Claude Code](https://claude.com/claude-code)
claunia added the pull-request label 2026-01-29 17:23:42 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#2744