[PR #1601] [MERGED] Add flag for Page Segmentation Modes control #2314

Closed
opened 2026-01-29 17:21:27 +00:00 by claunia · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/CCExtractor/ccextractor/pull/1601
Author: @Neo2SHYAlien
Created: 3/5/2024
Status: Merged
Merged: 9/3/2024
Merged by: @PunitLodha

Base: masterHead: master


📝 Commits (10+)

  • ba09cb4 Add flag for Page Segmentation Modes control
  • f60841f Merge branch 'master' into master
  • 2121165 feat: add psm for rust parser
  • 1ac6cc7 Merge pull request #1 from prateekmedia/add-psm-rust
  • 47b3757 Merge branch 'master' into master
  • bc8c86f fix: add psm to options
  • d5d703c Merge branch 'master' into master
  • 036d34c fix: add default value of psm to 3
  • b8c24a9 fix: correct type of ocr oem
  • 7222fca fix(rust): use fatal! instead of exit

📊 Changes

9 files changed (+78 additions, -0 deletions)

View changed files

📝 docs/CHANGES.TXT (+1 -0)
📝 src/lib_ccx/ccx_common_option.c (+1 -0)
📝 src/lib_ccx/ccx_common_option.h (+1 -0)
📝 src/lib_ccx/ocr.c (+3 -0)
📝 src/lib_ccx/params.c (+38 -0)
📝 src/lib_ccx/params_dump.c (+2 -0)
📝 src/rust/lib_ccxr/src/common/options.rs (+3 -0)
📝 src/rust/src/args.rs (+19 -0)
📝 src/rust/src/parser.rs (+10 -0)

📄 Description

In raising this pull request, I confirm the following (please check boxes):

  • I have read and understood the contributors guide.
  • I have checked that another pull request for this purpose does not exist.
  • I have considered, and confirmed that this submission will be valuable to others.
  • I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
  • I give this submission freely, and claim no ownership to its content.
  • I have mentioned this change in the changelog.

My familiarity with the project is as follows (check one):

  • I have never used CCExtractor.
  • I have used CCExtractor just a couple of times.
  • I absolutely love CCExtractor, but have not contributed previously.
  • I am an active contributor to CCExtractor.

I added an flag -psm for controlling PSM (Page Segmentation Modes) in Tesseract. The default option (3) gives me quite bad results. When I use 6, 11, or 12 for Bulgarian, it gives me much better OCR results. I haven't tested other languages yet, but I expect improvements as well if other mode is used.

p.s This PR is continue #1544 which was closed after the rebase 🥲


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/CCExtractor/ccextractor/pull/1601 **Author:** [@Neo2SHYAlien](https://github.com/Neo2SHYAlien) **Created:** 3/5/2024 **Status:** ✅ Merged **Merged:** 9/3/2024 **Merged by:** [@PunitLodha](https://github.com/PunitLodha) **Base:** `master` ← **Head:** `master` --- ### 📝 Commits (10+) - [`ba09cb4`](https://github.com/CCExtractor/ccextractor/commit/ba09cb4b12aeb07265f58f52378e0dae3c2d92bf) Add flag for Page Segmentation Modes control - [`f60841f`](https://github.com/CCExtractor/ccextractor/commit/f60841f786437ca52b53439ce8eb1eb7e756447a) Merge branch 'master' into master - [`2121165`](https://github.com/CCExtractor/ccextractor/commit/212116520e109e8ab67906434e4795920841c4a7) feat: add psm for rust parser - [`1ac6cc7`](https://github.com/CCExtractor/ccextractor/commit/1ac6cc7859ea5e3746fb2ed4ff92cfb4807cabaf) Merge pull request #1 from prateekmedia/add-psm-rust - [`47b3757`](https://github.com/CCExtractor/ccextractor/commit/47b37574fddba08d904659c24c57cc941819be33) Merge branch 'master' into master - [`bc8c86f`](https://github.com/CCExtractor/ccextractor/commit/bc8c86f4b742ae9ad40aa47427834173c2538391) fix: add psm to options - [`d5d703c`](https://github.com/CCExtractor/ccextractor/commit/d5d703ca4e065b178e293b99470fbe6a39683e74) Merge branch 'master' into master - [`036d34c`](https://github.com/CCExtractor/ccextractor/commit/036d34cbaed5f0ecbe798595e9d10e2147d2b206) fix: add default value of psm to 3 - [`b8c24a9`](https://github.com/CCExtractor/ccextractor/commit/b8c24a95240819f7e7e1c2368742fe9411143ed6) fix: correct type of ocr oem - [`7222fca`](https://github.com/CCExtractor/ccextractor/commit/7222fca36170b7512407b1a27f287c7b8acfadab) fix(rust): use fatal! instead of exit ### 📊 Changes **9 files changed** (+78 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `docs/CHANGES.TXT` (+1 -0) 📝 `src/lib_ccx/ccx_common_option.c` (+1 -0) 📝 `src/lib_ccx/ccx_common_option.h` (+1 -0) 📝 `src/lib_ccx/ocr.c` (+3 -0) 📝 `src/lib_ccx/params.c` (+38 -0) 📝 `src/lib_ccx/params_dump.c` (+2 -0) 📝 `src/rust/lib_ccxr/src/common/options.rs` (+3 -0) 📝 `src/rust/src/args.rs` (+19 -0) 📝 `src/rust/src/parser.rs` (+10 -0) </details> ### 📄 Description <!-- Please prefix your pull request with one of the following: **[FEATURE]** **[FIX]** **[IMPROVEMENT]**. --> **In raising this pull request, I confirm the following (please check boxes):** - [X] I have read and understood the [contributors guide](https://github.com/CCExtractor/ccextractor/blob/master/.github/CONTRIBUTING.md). - [X] I have checked that another pull request for this purpose does not exist. - [X] I have considered, and confirmed that this submission will be valuable to others. - [X] I accept that this submission may not be used, and the pull request closed at the will of the maintainer. - [X] I give this submission freely, and claim no ownership to its content. - [X] **I have mentioned this change in the [changelog](https://github.com/CCExtractor/ccextractor/blob/master/docs/CHANGES.TXT).** **My familiarity with the project is as follows (check one):** - [ ] I have never used CCExtractor. - [ ] I have used CCExtractor just a couple of times. - [X] I absolutely love CCExtractor, but have not contributed previously. - [ ] I am an active contributor to CCExtractor. --- I added an flag `-psm` for controlling PSM ([Page Segmentation Modes](https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html#page-segmentation-method)) in Tesseract. The default option (3) gives me quite bad results. When I use 6, 11, or 12 for Bulgarian, it gives me much better OCR results. I haven't tested other languages yet, but I expect improvements as well if other mode is used. p.s This PR is continue #1544 which was closed after the rebase 🥲 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
claunia added the pull-request label 2026-01-29 17:21:27 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#2314