mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-03 21:23:48 +00:00
[BUG/PROPOSAL] Levenshtein algorithm automatically merges lines #308
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @mkver on GitHub (Apr 24, 2017).
CCExtractor detailed version info
Version: 0.85
Git commit:
5fa83394a0Compilation date: 2017-01-26
File SHA256: Could not open file
I have read and understood the contributors guide.
I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present.
I have checked that the issue I'm posting isn't already reported.
I have checked that the issue I'm porting isn't already solved and no duplicates exist in closed issues and in opened issues
I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion.
I have used the latest available version of CCExtractor to verify this issue exists.
I absolutely love CCExtractor, but have not contributed previously.
Necessary information
--gui_mode_reports -autoprogram -out=srt -bom -deblev -utf8 -trim -levdistmincnt 0 -levdistmaxpct 0Video links
ccextractor.issue.737.ts
Additional information
Up until version 0.84 the Levenshtein teletext line deduplication was only performed if one uses transcript for the output (at least it was not used when output was srt -- even Levenshtein debug info wasn't output in this mode even when with the
-deblevparameter). This changed for version 0.85. Now even srt goes through the Levenshtein algorithm. One can reduce the amount of matches that this algorithm finds by reducing levdistmincnt and levdistmaxpct, but even setting them to 0 means that some lines are merged (namely identical ones) even if this is undesired as in my file. The audio contains several bangs and the subtitles contain one* Knall *for each of them. Version 0.84 extracts these as individual lines:But in version 0.85 the first four lines are merged:
Is it possible to add an option to disable the Levenshtein algorithm completely?
PS: Given that the text in the GUI still says "In transcript mode, this causes duplicated lines. CCExtractor tries to remove these duplicates..." I regard the fact that this deduplication is used on srt subtitles as a bug; but it could also be wanted behaviour and in this case my issue is a proposal. Therefore I filled this under [BUG/PROPOSAL].
PPS: Thanks for your great program!
@cfsmp3 commented on GitHub (Apr 24, 2017):
Added -dolevdist to disable automatic typo fixing.