[BUG/PROPOSAL] Levenshtein algorithm automatically merges lines #308

Closed
opened 2026-01-29 16:40:33 +00:00 by claunia · 1 comment
Owner

Originally created by @mkver on GitHub (Apr 24, 2017).

CCExtractor detailed version info
Version: 0.85
Git commit: 5fa83394a0
Compilation date: 2017-01-26
File SHA256: Could not open file

  • I have read and understood the contributors guide.

  • I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present.

  • I have checked that the issue I'm posting isn't already reported.

  • I have checked that the issue I'm porting isn't already solved and no duplicates exist in closed issues and in opened issues

  • I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion.

  • I have used the latest available version of CCExtractor to verify this issue exists.

  • I absolutely love CCExtractor, but have not contributed previously.

Necessary information

  • Is this a regression (did it work before)? [ ] NO | [X] YES - in 0.84
  • What platform did you use? [X] Windows - [ ] Linux - [ ] Mac
  • What where the used arguments? --gui_mode_reports -autoprogram -out=srt -bom -deblev -utf8 -trim -levdistmincnt 0 -levdistmaxpct 0

Video links
ccextractor.issue.737.ts

Additional information
Up until version 0.84 the Levenshtein teletext line deduplication was only performed if one uses transcript for the output (at least it was not used when output was srt -- even Levenshtein debug info wasn't output in this mode even when with the -deblev parameter). This changed for version 0.85. Now even srt goes through the Levenshtein algorithm. One can reduce the amount of matches that this algorithm finds by reducing levdistmincnt and levdistmaxpct, but even setting them to 0 means that some lines are merged (namely identical ones) even if this is undesired as in my file. The audio contains several bangs and the subtitles contain one * Knall * for each of them. Version 0.84 extracts these as individual lines:

1
00:00:06,360 --> 00:00:07,720
* Knall *

2
00:00:09,000 --> 00:00:10,560
* Knall *

3
00:00:12,480 --> 00:00:14,040
* Knall *

4
00:00:14,880 --> 00:00:16,120
* Knall *

5
00:00:16,240 --> 00:00:20,120
Was klopft denn da?<font color="#00ff00">Weiß nicht.</font>
Das ist unheimlich.

6
00:00:20,240 --> 00:00:20,560
<font color="#00ff00">Geh ins Bett!</font>

But in version 0.85 the first four lines are merged:

1
00:00:06,480 --> 00:00:16,239
* Knall *

2
00:00:16,360 --> 00:00:20,239
Was klopft denn da?<font color="#00ff00">Weiß nicht.</font>
Das ist unheimlich.

3
00:00:20,360 --> 00:00:20,679
<font color="#00ff00">Geh ins Bett!</font>

Is it possible to add an option to disable the Levenshtein algorithm completely?
PS: Given that the text in the GUI still says "In transcript mode, this causes duplicated lines. CCExtractor tries to remove these duplicates..." I regard the fact that this deduplication is used on srt subtitles as a bug; but it could also be wanted behaviour and in this case my issue is a proposal. Therefore I filled this under [BUG/PROPOSAL].
PPS: Thanks for your great program!

Originally created by @mkver on GitHub (Apr 24, 2017). CCExtractor detailed version info Version: 0.85 Git commit: 5fa83394a0c0300244eb5707b30d344d7687bb2c Compilation date: 2017-01-26 File SHA256: Could not open file - [X] I have read and understood the [contributors guide](https://github.com/CCExtractor/ccextractor/blob/master/.github/CONTRIBUTING.md). - [X] I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present. - [X] I have checked that the issue I'm posting isn't already reported. - [X] I have checked that the issue I'm porting isn't already solved and no duplicates exist in [closed issues](https://github.com/CCExtractor/ccextractor/issues?q=is%3Aissue+is%3Aclosed) and in [opened issues](https://github.com/CCExtractor/ccextractor/issues) - [X] I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion. - [X] I have used the latest available version of CCExtractor to verify this issue exists. - [X] I absolutely love CCExtractor, but have not contributed previously. **Necessary information** - Is this a regression (did it work before)? [ ] NO | [X] YES - in 0.84 - What platform did you use? [X] Windows - [ ] Linux - [ ] Mac - What where the used arguments? ` --gui_mode_reports -autoprogram -out=srt -bom -deblev -utf8 -trim -levdistmincnt 0 -levdistmaxpct 0` **Video links** [ccextractor.issue.737.ts](https://www.dropbox.com/s/zk7yzursi5nqpbb/ccextractor.issue.737.ts?dl=0) **Additional information** Up until version 0.84 the Levenshtein teletext line deduplication was only performed if one uses transcript for the output (at least it was not used when output was srt -- even Levenshtein debug info wasn't output in this mode even when with the `-deblev` parameter). This changed for version 0.85. Now even srt goes through the Levenshtein algorithm. One can reduce the amount of matches that this algorithm finds by reducing levdistmincnt and levdistmaxpct, but even setting them to 0 means that some lines are merged (namely identical ones) even if this is undesired as in my file. The audio contains several bangs and the subtitles contain one `* Knall *` for each of them. Version 0.84 extracts these as individual lines: ``` 1 00:00:06,360 --> 00:00:07,720 * Knall * 2 00:00:09,000 --> 00:00:10,560 * Knall * 3 00:00:12,480 --> 00:00:14,040 * Knall * 4 00:00:14,880 --> 00:00:16,120 * Knall * 5 00:00:16,240 --> 00:00:20,120 Was klopft denn da?<font color="#00ff00">Weiß nicht.</font> Das ist unheimlich. 6 00:00:20,240 --> 00:00:20,560 <font color="#00ff00">Geh ins Bett!</font> ``` But in version 0.85 the first four lines are merged: ``` 1 00:00:06,480 --> 00:00:16,239 * Knall * 2 00:00:16,360 --> 00:00:20,239 Was klopft denn da?<font color="#00ff00">Weiß nicht.</font> Das ist unheimlich. 3 00:00:20,360 --> 00:00:20,679 <font color="#00ff00">Geh ins Bett!</font> ``` Is it possible to add an option to disable the Levenshtein algorithm completely? PS: Given that the text in the GUI still says "In transcript mode, this causes duplicated lines. CCExtractor tries to remove these duplicates..." I regard the fact that this deduplication is used on srt subtitles as a bug; but it could also be wanted behaviour and in this case my issue is a proposal. Therefore I filled this under [BUG/PROPOSAL]. PPS: Thanks for your great program!
Author
Owner

@cfsmp3 commented on GitHub (Apr 24, 2017):

Added -dolevdist to disable automatic typo fixing.

@cfsmp3 commented on GitHub (Apr 24, 2017): Added -dolevdist to disable automatic typo fixing.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#308