mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-13 05:25:03 +00:00
Improve built-in dictionary code #28
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @cfsmp3 on GitHub (May 28, 2014).
Originally assigned to: @anshul1912 on GitHub.
CCExtractor has a small (maybe 20 words) dictionary that is used to correct capitalization.
Because the dictionary is so small the implementation that uses is to do the correction is extremely trivial, and also time consuming (basically it checks against each word in the dictionary rather than do a binary search).
We didn't care about this until someone sent a 11,000 word dictionary for us to use :-)
The function is in 608_helpers.
void correct_case (int line_num, struct eia608_screen *data)
Job: Implement sort and binary search so we can use that dictionary efficiently.
@anshul1912 commented on GitHub (Jun 18, 2014):
New dictionary MattS_dictionary.txt looks already sorted to me, do we have any chances that we will get any unsorted dictionary, I was thinking to implement hash table or indexed binary tree for searching algorithm.
If by any chances we can have unsorted Dictionary then i have already implemented shell sort for quantization which generate an sorted hash table so that we dont move anything in actual just keep the index sorted, Algorithm is quite merged with quntization, I would need to make it more generic.
@cfsmp3 commented on GitHub (Jun 19, 2014):
The problem is not that it's not sorted, is that we aren't doing a good
search, we're just reading each word in a loop :-) No binary search.
On Wed, Jun 18, 2014 at 12:24 PM, Anshul Maheshwari <
notifications@github.com> wrote:
@cfsmp3 commented on GitHub (Jun 20, 2014):
correct_case is enabled via parameter (--sentencecap). Any video will work
to test this - the correct case stuff converts from all uppercase (as
captions come in) to correct case by using some (basic) rules and the
dictionary.
On Fri, Jun 20, 2014 at 3:48 AM, Anshul Maheshwari <notifications@github.com