[PR #1925] [MERGED] feat(ocr): Add character blacklist and line-split options for better accuracy #2727

Open
opened 2026-01-29 17:23:37 +00:00 by claunia · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/CCExtractor/ccextractor/pull/1925
Author: @cfsmp3
Created: 12/29/2025
Status: Merged
Merged: 12/31/2025
Merged by: @cfsmp3

Base: masterHead: feat/ocr-blacklist-default


📝 Commits (2)

  • 8c586bc feat(ocr): Add character blacklist and line-split options for better accuracy
  • d28bc4e style: Fix formatting issues in ocr.c and options.rs

📊 Changes

8 files changed (+320 additions, -0 deletions)

View changed files

📝 src/lib_ccx/ccx_common_option.c (+2 -0)
📝 src/lib_ccx/ccx_common_option.h (+2 -0)
📝 src/lib_ccx/ocr.c (+279 -0)
📝 src/lib_ccx/params.c (+7 -0)
📝 src/rust/lib_ccxr/src/common/options.rs (+6 -0)
📝 src/rust/src/args.rs (+12 -0)
📝 src/rust/src/common.rs (+4 -0)
📝 src/rust/src/parser.rs (+8 -0)

📄 Description

Summary

  • Add OCR character blacklist (enabled by default) to prevent common misrecognition errors like I|
  • Add optional --ocr-line-split mode for multi-line subtitle images
  • Inspired by subtile-ocr's proven approach

New Options

Option Default Description
--no-ocr-blacklist Blacklist ON Disable the character blacklist (|, \, `, _, ~)
--ocr-line-split OFF Split images into lines, use PSM 7 for each

Test Results (VOBSUB MKV sample)

Metric Before After (blacklist)
Pipe | errors 14 0

The blacklist completely eliminates pipe character misrecognition, matching subtile-ocr's accuracy.

Test plan

  • Build with OCR support
  • Test VOBSUB extraction with blacklist (default)
  • Test with --no-ocr-blacklist to verify it can be disabled
  • Test --ocr-line-split option
  • Verify help text displays correctly

🤖 Generated with Claude Code


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/CCExtractor/ccextractor/pull/1925 **Author:** [@cfsmp3](https://github.com/cfsmp3) **Created:** 12/29/2025 **Status:** ✅ Merged **Merged:** 12/31/2025 **Merged by:** [@cfsmp3](https://github.com/cfsmp3) **Base:** `master` ← **Head:** `feat/ocr-blacklist-default` --- ### 📝 Commits (2) - [`8c586bc`](https://github.com/CCExtractor/ccextractor/commit/8c586bccbd097146a3634d96ac17b3466e32d4a0) feat(ocr): Add character blacklist and line-split options for better accuracy - [`d28bc4e`](https://github.com/CCExtractor/ccextractor/commit/d28bc4e114a0fb30e69243af3efba9e876662034) style: Fix formatting issues in ocr.c and options.rs ### 📊 Changes **8 files changed** (+320 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `src/lib_ccx/ccx_common_option.c` (+2 -0) 📝 `src/lib_ccx/ccx_common_option.h` (+2 -0) 📝 `src/lib_ccx/ocr.c` (+279 -0) 📝 `src/lib_ccx/params.c` (+7 -0) 📝 `src/rust/lib_ccxr/src/common/options.rs` (+6 -0) 📝 `src/rust/src/args.rs` (+12 -0) 📝 `src/rust/src/common.rs` (+4 -0) 📝 `src/rust/src/parser.rs` (+8 -0) </details> ### 📄 Description ## Summary - Add OCR character blacklist (enabled by default) to prevent common misrecognition errors like `I` → `|` - Add optional `--ocr-line-split` mode for multi-line subtitle images - Inspired by subtile-ocr's proven approach ## New Options | Option | Default | Description | |--------|---------|-------------| | `--no-ocr-blacklist` | Blacklist ON | Disable the character blacklist (`\|`, `\`, `` ` ``, `_`, `~`) | | `--ocr-line-split` | OFF | Split images into lines, use PSM 7 for each | ## Test Results (VOBSUB MKV sample) | Metric | Before | After (blacklist) | |--------|--------|-------------------| | Pipe `\|` errors | 14 | **0** | The blacklist completely eliminates pipe character misrecognition, matching subtile-ocr's accuracy. ## Test plan - [x] Build with OCR support - [x] Test VOBSUB extraction with blacklist (default) - [x] Test with `--no-ocr-blacklist` to verify it can be disabled - [x] Test `--ocr-line-split` option - [x] Verify help text displays correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
claunia added the pull-request label 2026-01-29 17:23:37 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#2727