[PR #1826] [MERGED] fix(dvb): Multiple fixes for DVB subtitle extraction from Chinese broadcasts (#224) #2579

Open
opened 2026-01-29 17:22:54 +00:00 by claunia · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/CCExtractor/ccextractor/pull/1826
Author: @cfsmp3
Created: 12/14/2025
Status: Merged
Merged: 12/15/2025
Merged by: @cfsmp3

Base: masterHead: fix/issue-224-chinese-dvb


📝 Commits (3)

  • e4cd0db fix(dvb): Multiple fixes for DVB subtitle extraction from Chinese broadcasts (#224)
  • fdef1d9 fix(ocr): Fix crashes in DVB subtitle color detection
  • e84369c fix(dvb): Fix zero-duration subtitles and overlaps during PTS jumps

📊 Changes

11 files changed (+284 additions, -64 deletions)

View changed files

📝 .gitignore (+6 -0)
📝 docs/CHANGES.TXT (+7 -0)
📝 src/lib_ccx/ccx_common_structs.h (+30 -25)
📝 src/lib_ccx/dvb_subtitle_decoder.c (+53 -7)
📝 src/lib_ccx/general_loop.c (+34 -1)
📝 src/lib_ccx/ocr.c (+123 -20)
📝 src/lib_ccx/ts_tables.c (+15 -0)
📝 src/rust/lib_ccxr/src/common/constants.rs (+1 -1)
📝 src/rust/lib_ccxr/src/common/options.rs (+2 -1)
📝 src/rust/src/common.rs (+4 -7)
📝 src/rust/src/parser.rs (+9 -2)

📄 Description

Summary

This PR addresses multiple issues with DVB subtitle extraction from Chinese broadcasts as reported in #224:

  • Fix PMT parsing crash: Added bounds checks to prevent segfault on malformed PMT data
  • Fix negative timestamps: Properly initialize min_pts for DVB subtitle streams
  • Fix OCR crash: Rewrote ignore_alpha_at_edge() to handle edge cases correctly
  • Improve OCR accuracy: Added image inversion and contrast normalization for DVB subtitles
  • Fix --ocrlang parameter: Accept Tesseract language names (chi_tra, chi_sim) directly
  • Fix --dvblang case sensitivity: Added case-insensitive matching for language codes

Changes

File Description
src/lib_ccx/ts_tables.c Added bounds checks in parse_PMT()
src/lib_ccx/general_loop.c Fixed DVB subtitle timing initialization
src/lib_ccx/ocr.c Fixed cropping, added inversion + contrast enhancement
src/rust/lib_ccxr/src/common/constants.rs Added case-insensitive Language enum
src/rust/lib_ccxr/src/common/options.rs Changed ocrlang to String type
src/rust/src/parser.rs Updated ocrlang handling
src/rust/src/common.rs Updated ocrlang C binding
docs/CHANGES.TXT Added changelog entry

Test Results

Tested with the 12GB sample file from issue #224:

  • All timestamps now positive (0.235s, 2.594s, etc. instead of -95000s)
  • OCR accuracy improved from ~70% to ~80-90% for Traditional Chinese
  • No crashes during full file processing (52 seconds)
  • --ocrlang chi_tra now works correctly

Test Commands

# Extract with OCR
ccextractor input.ts --codec dvbsub --out=srt -o output.srt --ocrlang chi_tra --no-fontcolor

# Extract without OCR (PNG images)
ccextractor input.ts --codec dvbsub --out=spupng -o output.xml --no-spupngocr

Fixes #224

🤖 Generated with Claude Code


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/CCExtractor/ccextractor/pull/1826 **Author:** [@cfsmp3](https://github.com/cfsmp3) **Created:** 12/14/2025 **Status:** ✅ Merged **Merged:** 12/15/2025 **Merged by:** [@cfsmp3](https://github.com/cfsmp3) **Base:** `master` ← **Head:** `fix/issue-224-chinese-dvb` --- ### 📝 Commits (3) - [`e4cd0db`](https://github.com/CCExtractor/ccextractor/commit/e4cd0dbb8adc75e73af12514963ba15113f35304) fix(dvb): Multiple fixes for DVB subtitle extraction from Chinese broadcasts (#224) - [`fdef1d9`](https://github.com/CCExtractor/ccextractor/commit/fdef1d996b10866fe0464b7d366727a581f99c17) fix(ocr): Fix crashes in DVB subtitle color detection - [`e84369c`](https://github.com/CCExtractor/ccextractor/commit/e84369c34430db4c58de4ad757179ffcdda1f08b) fix(dvb): Fix zero-duration subtitles and overlaps during PTS jumps ### 📊 Changes **11 files changed** (+284 additions, -64 deletions) <details> <summary>View changed files</summary> 📝 `.gitignore` (+6 -0) 📝 `docs/CHANGES.TXT` (+7 -0) 📝 `src/lib_ccx/ccx_common_structs.h` (+30 -25) 📝 `src/lib_ccx/dvb_subtitle_decoder.c` (+53 -7) 📝 `src/lib_ccx/general_loop.c` (+34 -1) 📝 `src/lib_ccx/ocr.c` (+123 -20) 📝 `src/lib_ccx/ts_tables.c` (+15 -0) 📝 `src/rust/lib_ccxr/src/common/constants.rs` (+1 -1) 📝 `src/rust/lib_ccxr/src/common/options.rs` (+2 -1) 📝 `src/rust/src/common.rs` (+4 -7) 📝 `src/rust/src/parser.rs` (+9 -2) </details> ### 📄 Description ## Summary This PR addresses multiple issues with DVB subtitle extraction from Chinese broadcasts as reported in #224: - **Fix PMT parsing crash**: Added bounds checks to prevent segfault on malformed PMT data - **Fix negative timestamps**: Properly initialize min_pts for DVB subtitle streams - **Fix OCR crash**: Rewrote `ignore_alpha_at_edge()` to handle edge cases correctly - **Improve OCR accuracy**: Added image inversion and contrast normalization for DVB subtitles - **Fix --ocrlang parameter**: Accept Tesseract language names (chi_tra, chi_sim) directly - **Fix --dvblang case sensitivity**: Added case-insensitive matching for language codes ## Changes | File | Description | |------|-------------| | `src/lib_ccx/ts_tables.c` | Added bounds checks in `parse_PMT()` | | `src/lib_ccx/general_loop.c` | Fixed DVB subtitle timing initialization | | `src/lib_ccx/ocr.c` | Fixed cropping, added inversion + contrast enhancement | | `src/rust/lib_ccxr/src/common/constants.rs` | Added case-insensitive Language enum | | `src/rust/lib_ccxr/src/common/options.rs` | Changed ocrlang to String type | | `src/rust/src/parser.rs` | Updated ocrlang handling | | `src/rust/src/common.rs` | Updated ocrlang C binding | | `docs/CHANGES.TXT` | Added changelog entry | ## Test Results Tested with the 12GB sample file from issue #224: - ✅ All timestamps now positive (0.235s, 2.594s, etc. instead of -95000s) - ✅ OCR accuracy improved from ~70% to ~80-90% for Traditional Chinese - ✅ No crashes during full file processing (52 seconds) - ✅ `--ocrlang chi_tra` now works correctly ## Test Commands ```bash # Extract with OCR ccextractor input.ts --codec dvbsub --out=srt -o output.srt --ocrlang chi_tra --no-fontcolor # Extract without OCR (PNG images) ccextractor input.ts --codec dvbsub --out=spupng -o output.xml --no-spupngocr ``` ## Related Issues Fixes #224 🤖 Generated with [Claude Code](https://claude.com/claude-code) --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
claunia added the pull-request label 2026-01-29 17:22:54 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#2579