OCR status summary - tests not passing, a fix broke something else #643

Closed
opened 2026-01-29 16:49:55 +00:00 by claunia · 4 comments
Owner

Originally created by @cfsmp3 on GitHub (Jun 11, 2021).

Originally assigned to: @PunitLodha on GitHub.

Summarizing the situation here so we have all the information handy.

One of the tests have been failing for a while. Specifically, we're getting garbage in some of the subtitle frames (but not all) for one specific sample. The failing test is here:

https://sampleplatform.ccextractor.org/test/3308

We know that the guilty commit is this:

84a9ea5572

Which itself fixed something else, so just reverting it would probably fix this test at the expense of breaking the original sample again.

I spent a bit of time yesterday on it, and it's clearly a problem with the OCR, however the input images are correct. Enabling DEBUG_OCR (which writes the massaged images as the OCR engine -tesseract- gets them) show that the input contains what we expect.

So currently I suspect a problem with the internal status of the OCR (possibly we're not reinitializing something, who knows).

Since we have all samples, the previous code, the new code, etc, I think troubleshooting this should take a reasonable amount of time (and patience).

We want to release 0.89 in the next couple of days, with 0.90 following shortly after. This should be fixed (properly) in one of the two releases.

Originally created by @cfsmp3 on GitHub (Jun 11, 2021). Originally assigned to: @PunitLodha on GitHub. Summarizing the situation here so we have all the information handy. One of the tests have been failing for a while. Specifically, we're getting garbage in some of the subtitle frames (but not all) for one specific sample. The failing test is here: https://sampleplatform.ccextractor.org/test/3308 We know that the guilty commit is this: https://github.com/CCExtractor/ccextractor/commit/84a9ea5572da4728fc4ad01b88978808c925ad9f Which itself fixed something else, so just reverting it would probably fix this test at the expense of breaking the original sample again. I spent a bit of time yesterday on it, and it's clearly a problem with the OCR, _however_ the input images are correct. Enabling DEBUG_OCR (which writes the massaged images as the OCR engine -tesseract- gets them) show that the input contains what we expect. So currently I suspect a problem with the internal status of the OCR (possibly we're not reinitializing something, who knows). Since we have all samples, the previous code, the new code, etc, I think troubleshooting this should take a reasonable amount of time (and patience). We want to release 0.89 in the next couple of days, with 0.90 following shortly after. This should be fixed (properly) in one of the two releases.
claunia added the GSoC 2021OCRdifficulty: mediumregression labels 2026-01-29 16:49:55 +00:00
Author
Owner

@cfsmp3 commented on GitHub (Jun 11, 2021):

I'm assigning this to @harrynull (don't know if around though - haven't seen him in a way) because he sent that commit, and to @PunitLodha since at some point this code will be rewritten to Rust anyway and Punit is working preparing things for the Rust work.

@cfsmp3 commented on GitHub (Jun 11, 2021): I'm assigning this to @harrynull (don't know if around though - haven't seen him in a way) because he sent that commit, and to @PunitLodha since at some point this code will be rewritten to Rust anyway and Punit is working preparing things for the Rust work.
Author
Owner

@MauryaRitesh commented on GitHub (Dec 4, 2021):

Is the issue still not resolved? I would like to work on this issue(or any other issue).

@MauryaRitesh commented on GitHub (Dec 4, 2021): Is the issue still not resolved? I would like to work on this issue(or any other issue).
Author
Owner

@cfsmp3 commented on GitHub (Dec 4, 2021):

Not solved, got for it

On Fri, Dec 3, 2021, 21:23 Ritesh Maurya @.***> wrote:

Is the issue still not resolved? I would like to work on this issue(or any
other issue).


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/1346#issuecomment-985971428,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABNMTWMCX36IXZDSVDHQRO3UPGQVJANCNFSM46RIE44Q
.

@cfsmp3 commented on GitHub (Dec 4, 2021): Not solved, got for it On Fri, Dec 3, 2021, 21:23 Ritesh Maurya ***@***.***> wrote: > Is the issue still not resolved? I would like to work on this issue(or any > other issue). > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <https://github.com/CCExtractor/ccextractor/issues/1346#issuecomment-985971428>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABNMTWMCX36IXZDSVDHQRO3UPGQVJANCNFSM46RIE44Q> > . >
Author
Owner

@cfsmp3 commented on GitHub (Mar 22, 2023):

Tested it just now. Unfortunately, still broken.

@cfsmp3 commented on GitHub (Mar 22, 2023): Tested it just now. Unfortunately, still broken.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#643