[PR #1826] fix(dvb): Multiple fixes for DVB subtitle extraction from Chinese broadcasts (#224) #2583

Open
opened 2026-01-29 17:22:55 +00:00 by claunia · 0 comments
Owner

Original Pull Request: https://github.com/CCExtractor/ccextractor/pull/1826

State: closed
Merged: Yes


Summary

This PR addresses multiple issues with DVB subtitle extraction from Chinese broadcasts as reported in #224:

  • Fix PMT parsing crash: Added bounds checks to prevent segfault on malformed PMT data
  • Fix negative timestamps: Properly initialize min_pts for DVB subtitle streams
  • Fix OCR crash: Rewrote ignore_alpha_at_edge() to handle edge cases correctly
  • Improve OCR accuracy: Added image inversion and contrast normalization for DVB subtitles
  • Fix --ocrlang parameter: Accept Tesseract language names (chi_tra, chi_sim) directly
  • Fix --dvblang case sensitivity: Added case-insensitive matching for language codes

Changes

File Description
src/lib_ccx/ts_tables.c Added bounds checks in parse_PMT()
src/lib_ccx/general_loop.c Fixed DVB subtitle timing initialization
src/lib_ccx/ocr.c Fixed cropping, added inversion + contrast enhancement
src/rust/lib_ccxr/src/common/constants.rs Added case-insensitive Language enum
src/rust/lib_ccxr/src/common/options.rs Changed ocrlang to String type
src/rust/src/parser.rs Updated ocrlang handling
src/rust/src/common.rs Updated ocrlang C binding
docs/CHANGES.TXT Added changelog entry

Test Results

Tested with the 12GB sample file from issue #224:

  • All timestamps now positive (0.235s, 2.594s, etc. instead of -95000s)
  • OCR accuracy improved from ~70% to ~80-90% for Traditional Chinese
  • No crashes during full file processing (52 seconds)
  • --ocrlang chi_tra now works correctly

Test Commands

# Extract with OCR
ccextractor input.ts --codec dvbsub --out=srt -o output.srt --ocrlang chi_tra --no-fontcolor

# Extract without OCR (PNG images)
ccextractor input.ts --codec dvbsub --out=spupng -o output.xml --no-spupngocr

Fixes #224

🤖 Generated with Claude Code

**Original Pull Request:** https://github.com/CCExtractor/ccextractor/pull/1826 **State:** closed **Merged:** Yes --- ## Summary This PR addresses multiple issues with DVB subtitle extraction from Chinese broadcasts as reported in #224: - **Fix PMT parsing crash**: Added bounds checks to prevent segfault on malformed PMT data - **Fix negative timestamps**: Properly initialize min_pts for DVB subtitle streams - **Fix OCR crash**: Rewrote `ignore_alpha_at_edge()` to handle edge cases correctly - **Improve OCR accuracy**: Added image inversion and contrast normalization for DVB subtitles - **Fix --ocrlang parameter**: Accept Tesseract language names (chi_tra, chi_sim) directly - **Fix --dvblang case sensitivity**: Added case-insensitive matching for language codes ## Changes | File | Description | |------|-------------| | `src/lib_ccx/ts_tables.c` | Added bounds checks in `parse_PMT()` | | `src/lib_ccx/general_loop.c` | Fixed DVB subtitle timing initialization | | `src/lib_ccx/ocr.c` | Fixed cropping, added inversion + contrast enhancement | | `src/rust/lib_ccxr/src/common/constants.rs` | Added case-insensitive Language enum | | `src/rust/lib_ccxr/src/common/options.rs` | Changed ocrlang to String type | | `src/rust/src/parser.rs` | Updated ocrlang handling | | `src/rust/src/common.rs` | Updated ocrlang C binding | | `docs/CHANGES.TXT` | Added changelog entry | ## Test Results Tested with the 12GB sample file from issue #224: - ✅ All timestamps now positive (0.235s, 2.594s, etc. instead of -95000s) - ✅ OCR accuracy improved from ~70% to ~80-90% for Traditional Chinese - ✅ No crashes during full file processing (52 seconds) - ✅ `--ocrlang chi_tra` now works correctly ## Test Commands ```bash # Extract with OCR ccextractor input.ts --codec dvbsub --out=srt -o output.srt --ocrlang chi_tra --no-fontcolor # Extract without OCR (PNG images) ccextractor input.ts --codec dvbsub --out=spupng -o output.xml --no-spupngocr ``` ## Related Issues Fixes #224 🤖 Generated with [Claude Code](https://claude.com/claude-code)
claunia added the pull-request label 2026-01-29 17:22:55 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#2583