[PR #1675] [MERGED] [FIX] DVB OCR: Memory Leak & Quantization Issues #2387

New Issue

claunia · 2026-01-29T17:21:53Z

claunia commented

2026-01-29 17:21:53 +00:00

📋 Pull Request Information

Original PR: https://github.com/CCExtractor/ccextractor/pull/1675
Author: @hrideshmg
Created: 3/15/2025
Status: ✅ Merged
Merged: 3/22/2025
Merged by: @cfsmp3

Base: master ← Head: fix_985

📝 Commits (2)

580a168 fix: do not free ocr text before return
435de6c fix(OCR): erode and dilate function

📊 Changes

1 file changed (+24 additions, -27 deletions)

View changed files

📝 src/lib_ccx/ocr.c (+24 -27)

📄 Description

In raising this pull request, I confirm the following (please check boxes):

I have read and understood the contributors guide.
I have checked that another pull request for this purpose does not exist.
I have considered, and confirmed that this submission will be valuable to others.
I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
I give this submission freely, and claim no ownership to its content.
I have mentioned this change in the changelog.

My familiarity with the project is as follows (check one):

I have never used CCExtractor.
I have used CCExtractor just a couple of times.
I absolutely love CCExtractor, but have not contributed previously.
I am an active contributor to CCExtractor.

Closes #985

DVB subtitle extraction is currently broken on the latest master build. I've verified this by testing it on the following few files:

09-ITV_Red_Heat.ts
2016-12-15-BBC4.ts
CHANNEL_4_2016-06-21.ts
chan7_BBC NEWS.ts

I've found two root issues on why this is the case:

The first issue is that the ocr'd text in ocr_bitmap() is freed before being returned. Removing this free causes memory leaks (as pointed out by #1511).
The second issue lies within the quantize_map() function
- Passing --quant 0 (or 2) with the first fix enables proper extraction of DVB subtitles.

I've spent the past two days trying to understand this function and have narrowed it down to the erode() function introduced in PR 1510. I believe this is better explained visually, so here are the subtitle bitmaps before and after the erode() call for two different video files:

Before

After

Fixes

The memory leaks are caused due to empty strings that were not being freed due to an if condition that was prematurely returning. I've handled this case and tested it on the files mentioned in #1511.
After analyzing the erode() function, I noticed that the text was being eroded based on transparency rather than the text background. This method will only work for bitmaps which have their quantized text color be transparent.

if (alpha[bitmap[row * w + col]] || alpha[bitmap[(row + 1) * w + col]] ||
    alpha[bitmap[row * w + (col + 1)]] || alpha[bitmap[(row + 1) * w + (col + 1)]])

I've modified erode and dilate so that they now use the text and text background color rather than the alpha.
I'm getting these colors from the loop which populates the mcit variable. This approach has been pretty successful in my limited amount of testing, however it relies on the assumption that the background and text color will always be the second and third most frequently occurring colors respectively.

channel5-2018-02-12.ts is one exception though, in it the text color is the fourth most frequently occurring color (black, the bg color is repeated twice for some reason). So erosion succeeds but dilation fails, the result is still better than the raw quantized results but it might be worthwhile to disable quantization by default.

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/CCExtractor/ccextractor/pull/1675 **Author:** [@hrideshmg](https://github.com/hrideshmg) **Created:** 3/15/2025 **Status:** ✅ Merged **Merged:** 3/22/2025 **Merged by:** [@cfsmp3](https://github.com/cfsmp3) **Base:** `master` ← **Head:** `fix_985` --- ### 📝 Commits (2) - [`580a168`](https://github.com/CCExtractor/ccextractor/commit/580a168f7b04315a7cfc46aaebb94d6a416fb838) fix: do not free ocr text before return - [`435de6c`](https://github.com/CCExtractor/ccextractor/commit/435de6cdc833363ad5f4e385988c5694c94623b6) fix(OCR): erode and dilate function ### 📊 Changes **1 file changed** (+24 additions, -27 deletions) <details> <summary>View changed files</summary> 📝 `src/lib_ccx/ocr.c` (+24 -27) </details> ### 📄 Description  **In raising this pull request, I confirm the following (please check boxes):** - [x] I have read and understood the [contributors guide](https://github.com/CCExtractor/ccextractor/blob/master/.github/CONTRIBUTING.md). - [x] I have checked that another pull request for this purpose does not exist. - [x] I have considered, and confirmed that this submission will be valuable to others. - [x] I accept that this submission may not be used, and the pull request closed at the will of the maintainer. - [x] I give this submission freely, and claim no ownership to its content. - [ ] **I have mentioned this change in the [changelog](https://github.com/CCExtractor/ccextractor/blob/master/docs/CHANGES.TXT).** **My familiarity with the project is as follows (check one):** - [ ] I have never used CCExtractor. - [x] I have used CCExtractor just a couple of times. - [ ] I absolutely love CCExtractor, but have not contributed previously. - [ ] I am an active contributor to CCExtractor. --- Closes #985 DVB subtitle extraction is currently broken on the latest master build. I've verified this by testing it on the following few files: - `09-ITV_Red_Heat.ts` - `2016-12-15-BBC4.ts` - `CHANNEL_4_2016-06-21.ts` - `chan7_BBC NEWS.ts` I've found two root issues on why this is the case: 1. The first issue is that the ocr'd text in `ocr_bitmap()` is [freed](https://github.com/CCExtractor/ccextractor/blob/a84256da017fa3c8d993262cb8a6436309fc9197/src/lib_ccx/ocr.c#L689) before being returned. Removing this free causes memory leaks (as pointed out by #1511). 2. The second issue lies within the `quantize_map()` function - Passing `--quant 0 (or 2)` with the first fix enables proper extraction of DVB subtitles. I've spent the past two days trying to understand this function and have narrowed it down to the `erode()` function introduced in [PR 1510](https://github.com/CCExtractor/ccextractor/pull/1510). I believe this is better explained visually, so here are the subtitle bitmaps before and after the `erode()` call for two different video files: ## Before ![before_erosion](https://github.com/user-attachments/assets/c4ba013c-ec31-4886-8133-10ccbebc208b) ![before_erosion_CHANNEL_4_2016-06-21](https://github.com/user-attachments/assets/4619ca1d-3adc-4fcc-92ef-56ca336cea1f) ## After ![after_erosion](https://github.com/user-attachments/assets/3da3c647-ec76-4459-9cc7-05078192eb20) ![after_erosion_CHANNEL_4_2016-06-21](https://github.com/user-attachments/assets/a83ea0c9-0610-40a0-b62e-cd7c7e95bb3d) ## Fixes 1. The memory leaks are caused due to empty strings that were not being freed due to an [if condition](https://github.com/CCExtractor/ccextractor/blob/a84256da017fa3c8d993262cb8a6436309fc9197/src/lib_ccx/ocr.c#L1064-L1065) that was prematurely returning. I've handled this case and tested it on the files mentioned in #1511. 2. After analyzing the [`erode()`](https://github.com/CCExtractor/ccextractor/blob/a84256da017fa3c8d993262cb8a6436309fc9197/src/lib_ccx/ocr.c#L701-L723) function, I noticed that the text was being eroded based on transparency rather than the text background. This method will only work for bitmaps which have their quantized text color be transparent. ```c if (alpha[bitmap[row * w + col]] || alpha[bitmap[(row + 1) * w + col]] || alpha[bitmap[row * w + (col + 1)]] || alpha[bitmap[(row + 1) * w + (col + 1)]]) ``` I've modified erode and dilate so that they now use the text and text background color rather than the alpha. I'm getting these colors from the loop which populates the mcit variable. This approach has been pretty successful in my limited amount of testing, however it relies on the assumption that the background and text color will always be the second and third most frequently occurring colors respectively. `channel5-2018-02-12.ts` is one exception though, in it the text color is the fourth most frequently occurring color (black, the bg color is repeated twice for some reason). So erosion succeeds but dilation fails, the result is still better than the raw quantized results but it might be worthwhile to disable quantization by default. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

claunia added the pull-request label 2026-01-29 17:21:53 +00:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/ccextractor#2387