mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-03 21:23:48 +00:00
[BUG] Capitalization doesn't work with tags #547
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @NilsIrl on GitHub (Jan 23, 2020).
CCExtractor version:
1764aa1f92In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):
My familiarity with the project is as follows:
Necessary information
I don't know though it probably isn't./ccextractor -sc files/CNN.tsVideo links (replace text below with your links)
CNN.tshere: https://drive.google.com/drive/folders/0B_61ywKPmI0Ta2diT3l0eTlHc2cAdditional information
Notice how the word "take" isn't capitalized correctly (It should be "Take").
(This snippet is from line 54.)
@kashparty commented on GitHub (Jan 23, 2020):
I don't think this is platform-specific because the same behaviour is observed on windows. The issue may be due to clever_capitalize, but I am not 100% sure:
7b038ab649/src/lib_ccx/ccx_encoders_helpers.c (L183)This behaviour doesn't always happen. For example:
I haven't contributed before but I'd be willing to try and fix this (if I can). I would really appreciate it if someone could point me in the right direction.
@NilsIrl commented on GitHub (Jan 23, 2020):
I am of the same opinion.
Don't hesitate to ask questions. If you need additinal help you can join the ccextractor slack as well.
@cfsmp3 commented on GitHub (Jan 24, 2020):
@KashParty clever_capitalize() just uses the character matrix, it doesn't know anything about fonts.
But a quick way to test what's going to would be to add this line (just typing here, so it might need fixing):
And see what it's doing.
@kashparty commented on GitHub (Jan 24, 2020):
Thank you for the help. I can confirm that the problem arises from
clever_capitalize. I used the following code:When I ran this, it produced:
So it is obvious that clever_capitalize is causing the issue. After investigating a little more, I don't think that this has anything to do with tags at all. The value of
context->new_sentencepersists across lines, so if a previous line ended in an exclamation mark, then the first character on the next line will be capitalized. The issue that is seen here is probably because the previous line doesn't end with any punctuation that needs capitalization after it:Because the previous line ends with
),clever_capitalizedoesn't capitalize the next letter. Closing brackets don't always have capital letters after them. Does this behaviour even need to be fixed, and if so, what should be done about it?@kashparty commented on GitHub (Jan 24, 2020):
I could potentially try and change
clever_capitalizeso that after a closing bracket, it checks whether that is the last non-whitespace character in the line, and setscontext->new_sentenceaccordingly. That way it would only capitalize the word after the brackets if the word is on the next line. Obviously we don't want it to always capitalize after brackets(because it) Would cause behaviour like this.@kashparty commented on GitHub (Jan 24, 2020):
I tried implementing the change I described in my previous comment by adding a case for
')':This seems to work. I haven't tested it fully but it works with the video provided by @NilsIrl in the issue description:
This is a rough first attempt just to ensure that this approach would theoretically work. I used
<ctype.h>but I probably don't really need it, and I'm not sure ifdata->characters[line_num][j] != 0x89is necessary in the if statement. However, if this is a change that you would want to make, I would be more than happy to clean this up properly and make a pull request.@cfsmp3 commented on GitHub (Jan 25, 2020):
Most likely it won't work for many other samples. I don't think it's possible to come up with a reasonably naive algorithms (I mean, just looking at the previous character and/or lines) that works well for all cases.
It might be just simpler to have a post-processing program that uses a proper NLP library to correct capitalization.
@cfsmp3 commented on GitHub (Jan 25, 2020):
I'm closing this since I really think it's not something we can solve inside CCExtractor, but if someone proves me wrong and sends a cool PR I'll merge :-)