Improve built-in dictionary code #28

Closed
opened 2026-01-29 16:33:14 +00:00 by claunia · 3 comments
Owner

Originally created by @cfsmp3 on GitHub (May 28, 2014).

Originally assigned to: @anshul1912 on GitHub.

CCExtractor has a small (maybe 20 words) dictionary that is used to correct capitalization.

Because the dictionary is so small the implementation that uses is to do the correction is extremely trivial, and also time consuming (basically it checks against each word in the dictionary rather than do a binary search).

We didn't care about this until someone sent a 11,000 word dictionary for us to use :-)

The function is in 608_helpers.
void correct_case (int line_num, struct eia608_screen *data)

Job: Implement sort and binary search so we can use that dictionary efficiently.

Originally created by @cfsmp3 on GitHub (May 28, 2014). Originally assigned to: @anshul1912 on GitHub. CCExtractor has a small (maybe 20 words) dictionary that is used to correct capitalization. Because the dictionary is so small the implementation that uses is to do the correction is extremely trivial, and also time consuming (basically it checks against each word in the dictionary rather than do a binary search). We didn't care about this until someone sent a 11,000 word dictionary for us to use :-) The function is in 608_helpers. void correct_case (int line_num, struct eia608_screen *data) Job: Implement sort and binary search so we can use that dictionary efficiently.
Author
Owner

@anshul1912 commented on GitHub (Jun 18, 2014):

New dictionary MattS_dictionary.txt looks already sorted to me, do we have any chances that we will get any unsorted dictionary, I was thinking to implement hash table or indexed binary tree for searching algorithm.
If by any chances we can have unsorted Dictionary then i have already implemented shell sort for quantization which generate an sorted hash table so that we dont move anything in actual just keep the index sorted, Algorithm is quite merged with quntization, I would need to make it more generic.

@anshul1912 commented on GitHub (Jun 18, 2014): New dictionary MattS_dictionary.txt looks already sorted to me, do we have any chances that we will get any unsorted dictionary, I was thinking to implement hash table or indexed binary tree for searching algorithm. If by any chances we can have unsorted Dictionary then i have already implemented shell sort for quantization which generate an sorted hash table so that we dont move anything in actual just keep the index sorted, Algorithm is quite merged with quntization, I would need to make it more generic.
Author
Owner

@cfsmp3 commented on GitHub (Jun 19, 2014):

The problem is not that it's not sorted, is that we aren't doing a good
search, we're just reading each word in a loop :-) No binary search.

On Wed, Jun 18, 2014 at 12:24 PM, Anshul Maheshwari <
notifications@github.com> wrote:

New dictionary MattS_dictionary.txt looks already sorted to me, do we have
any chances that we will get any unsorted dictionary, I was thinking to
implement hash table or indexed binary tree for searching algorithm.
If by any chances we can have unsorted Dictionary then i have already
implemented shell sort for quantization which generate an sorted hash table
so that we dont move anything in actual just keep the index sorted,
Algorithm is quite merged with quntization, I would need to make it more
generic.


Reply to this email directly or view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/39#issuecomment-46481583
.

@cfsmp3 commented on GitHub (Jun 19, 2014): The problem is not that it's not sorted, is that we aren't doing a good search, we're just reading each word in a loop :-) No binary search. On Wed, Jun 18, 2014 at 12:24 PM, Anshul Maheshwari < notifications@github.com> wrote: > New dictionary MattS_dictionary.txt looks already sorted to me, do we have > any chances that we will get any unsorted dictionary, I was thinking to > implement hash table or indexed binary tree for searching algorithm. > If by any chances we can have unsorted Dictionary then i have already > implemented shell sort for quantization which generate an sorted hash table > so that we dont move anything in actual just keep the index sorted, > Algorithm is quite merged with quntization, I would need to make it more > generic. > > — > Reply to this email directly or view it on GitHub > https://github.com/CCExtractor/ccextractor/issues/39#issuecomment-46481583 > .
Author
Owner

@cfsmp3 commented on GitHub (Jun 20, 2014):

correct_case is enabled via parameter (--sentencecap). Any video will work
to test this - the correct case stuff converts from all uppercase (as
captions come in) to correct case by using some (basic) rules and the
dictionary.

On Fri, Jun 20, 2014 at 3:48 AM, Anshul Maheshwari <notifications@github.com

wrote:

can you please point one video in our repository, which use correct_case
function.


Reply to this email directly or view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/39#issuecomment-46666371
.

@cfsmp3 commented on GitHub (Jun 20, 2014): correct_case is enabled via parameter (--sentencecap). Any video will work to test this - the correct case stuff converts from all uppercase (as captions come in) to correct case by using some (basic) rules and the dictionary. On Fri, Jun 20, 2014 at 3:48 AM, Anshul Maheshwari <notifications@github.com > wrote: > > can you please point one video in our repository, which use correct_case > function. > > — > Reply to this email directly or view it on GitHub > https://github.com/CCExtractor/ccextractor/issues/39#issuecomment-46666371 > .
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#28