Files
ccextractor/.gitignore
Carlos Fernandez Sanz 42885caedd fix(dvb): Multiple fixes for DVB subtitles - timing, OCR quality, memory access bugs (#224) (#1826)
* fix(dvb): Multiple fixes for DVB subtitle extraction from Chinese broadcasts (#224)

This commit addresses multiple issues with DVB subtitle extraction reported in #224:

1. **PMT parsing crash fix** (ts_tables.c):
   - Added minimum length check (16 bytes) to prevent out-of-bounds access
   - Added bounds check before memcpy to prevent buffer overflow when section > 1021 bytes

2. **Negative subtitle timing fix** (general_loop.c):
   - For DVB subtitle streams, properly initialize min_pts from audio/subtitle PTS
   - This fixes the issue where all timestamps were negative (~95000 seconds off)

3. **OCR improvements** (ocr.c):
   - Fixed ignore_alpha_at_edge() which could create invalid crop windows
   - Added image inversion for DVB subtitles (light text on dark background)
     to improve Tesseract OCR accuracy
   - Added contrast normalization to further improve character recognition
   - Fixed nofontcolor check to respect --no-fontcolor parameter
   - Added iteration safety limit in color detection loop

4. **--ocrlang parameter fix** (Rust files):
   - Changed ocrlang from Language enum to String to accept Tesseract language
     names directly (e.g., "chi_tra", "chi_sim", "eng")
   - Added case-insensitive matching for --dvblang parameter
   - Added better error messages for invalid language codes

Tested with 12GB Chinese DVB broadcast file:
- Timing: All timestamps now positive (0.235s, 2.594s, etc.)
- OCR: ~80-90% accuracy with chi_tra traineddata (improved from ~70%)
- No crashes during full file processing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(ocr): Fix crashes in DVB subtitle color detection

Two issues fixed in the OCR color detection code:

1. Tesseract crash during iteration:
   - The color detection pass used raw color images without preprocessing
   - Tesseract expects dark text on light background, but DVB subtitles
     have light text on dark background
   - Added grayscale conversion, inversion, and contrast enhancement
     (same preprocessing as the main OCR pass)

2. Heap corruption in histogram calculation:
   - The histogram loop had no bounds checking on array accesses
   - Tesseract could return invalid bounding boxes causing buffer overflows
   - Added validation of bounding box coordinates before processing
   - Added safe index checking for copy->data and histogram arrays

Also added skip_color_detection label for clean error handling and
proper cleanup of the preprocessed image.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(dvb): Fix zero-duration subtitles and overlaps during PTS jumps

Add start_pts field to cc_subtitle struct to track raw PTS values
independent of FTS timeline resets. Modify end_time calculation in
dvbsub_handle_display_segment() to cap duration at 4 seconds when
PTS jumps cause timeline discontinuities, preventing zero-duration
and overlapping subtitles.

Also update .gitignore to exclude plans/ directory and temp files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-14 20:03:55 -08:00

164 lines
2.4 KiB
Plaintext

####
# Ignore tests tmp files and results
tests/runtest
tests/**/*.gcda
tests/**/*.gcno
####
# Ignore CVS related files
.CVS
CVS
####
# Linux Ignored binary and build folder
*.o
*.so
mac/ccextractor
linux/ccextractor
linux/depend
windows/x86_64-pc-windows-msvc/**
windows/Debug/**
windows/Debug-OCR/**
windows/release-with-debug/**
windows/Release/**
windows/Release-Full/**
windows/Release-OCR/**
windows/Debug-Full/**
windows/x64/**
windows/ccextractor.VC.db
build/
####
# Python
*.pyc
####
# Visual Studio project Ignored files
.vs/**
windows/.vs/**
!windows/.vs/config/applicationhost.config
*.suo
*.sdf
*.opensdf
*.user
*.opendb
*.db
*.vscode
####
# Ignore the header file that is updated upon build
src/lib_ccx/compile_info_real.h
#### Ignore windows OCR libraries and folders
windows/libs/leptonica/**
windows/libs/tesseract/**
# Ctags
*.tags*
tags
# Vagrant
.vagrant/
# Eclipse stuff
.cproject
.project
.settings/
# Mac
.DS_Store
windows/enc_temp_folder/*
#CMake
src/cmake-build-debug/
src/.idea/
#Autotools
linux/config.h
linux/config.log
linux/config.status
linux/Makefile
linux/autom4te.cache
linux/aclocal.m4
linux/*.in
linux/configure
linux/build-conf/
mac/rust/
mac/config.h
mac/config.log
mac/config.status
mac/Makefile
mac/autom4te.cache
mac/aclocal.m4
mac/*.in
mac/configure
mac/build-conf/
package_creators/*tar.gz
package_creators/build/*.deb
src/.deps/
src/.dirstamp
src/lib_ccx/.deps/
src/lib_ccx/.dirstamp
src/lib_hash/.deps/
src/lib_hash/.dirstamp
src/libpng/.deps/
src/libpng/.dirstamp
src/utf8proc/.deps/
src/utf8proc/.dirstamp
src/zlib/.deps/
src/zlib/.dirstamp
src/zvbi/.deps/
src/zvbi/.dirstamp
# Arch
package_creators/*.pkg.tar.xz
#RPMs
package_creators/*.rpm
src/lib_ccx/ccx.pc
windows/combase.pdb/
src/**/.deps
src/**/.dirstamp
mac/ccextractorGUI
linux/ccextractorGUI
linux/ccxGUI.ini
linux/CMakeCache.txt
linux/CMakeFiles/
linux/cmake_install.cmake
linux/install_manifest.txt
linux/lib_ccx/
mac/lib_ccx/
mac/install_manifest.txt
mac/cmake_install.cmake
mac/CMakeFiles/
mac/CMakeCache.txt
*.py.bak
# Bazel
bazel*
#Intellij IDEs
.idea/
# Rust build and MakeFiles (and CMake files)
src/rust/CMakeFiles/
src/rust/CMakeCache.txt
src/rust/Makefile
src/rust/cmake_install.cmake
src/rust/target/
src/rust/lib_ccxr/target/
windows/ccx_rust.lib
windows/*/debug/*
windows/*/CACHEDIR.TAG
windows/.rustc_info.json
linux/configure~
# Plans and temporary files
plans/
tess.log
**/tess.log
ut=srt*