[BUG] Capitalization doesn't work with tags #547

Closed
opened 2026-01-29 16:47:12 +00:00 by claunia · 8 comments
Owner

Originally created by @NilsIrl on GitHub (Jan 23, 2020).

CCExtractor version: 1764aa1f92

In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):

  • I have read and understood the contributors guide.
  • I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present.
  • I have checked that the issue I'm posting isn't already reported.
  • I have checked that the issue I'm porting isn't already solved and no duplicates exist in closed issues and in opened issues
  • I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion.
  • I have used the latest available version of CCExtractor to verify this issue exists.

My familiarity with the project is as follows:

  • I am an active contributor to CCExtractor.

Necessary information

  • Is this a regression (did it work before)? [ ] NO | [ ] YES - I don't know though it probably isn't
  • What platform did you use? [ ] Windows - [X] Linux - [ ] Mac
  • What were the used arguments? ./ccextractor -sc files/CNN.ts

Video links (replace text below with your links)

Additional information

Notice how the word "take" isn't capitalized correctly (It should be "Take").

54c54
<   <i> Take a closer look at GEICO.</i>
---
>   <i> take a closer look at geico.</i>

(This snippet is from line 54.)

Originally created by @NilsIrl on GitHub (Jan 23, 2020). CCExtractor version: 1764aa1f92a2c3c201d7f37b67ee18b551b982ef **In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):** - [X] I have read and understood the [contributors guide](https://github.com/CCExtractor/ccextractor/blob/master/.github/CONTRIBUTING.md). - [X] I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present. - [X] I have checked that the issue I'm posting isn't already reported. - [X] I have checked that the issue I'm porting isn't already solved and no duplicates exist in [closed issues](https://github.com/CCExtractor/ccextractor/issues?q=is%3Aissue+is%3Aclosed) and in [opened issues](https://github.com/CCExtractor/ccextractor/issues) - [X] I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion. - [X] I have used the latest available version of CCExtractor to verify this issue exists. **My familiarity with the project is as follows:** - [X] I am an active contributor to CCExtractor. **Necessary information** - Is this a regression (did it work before)? [ ] NO | [ ] YES - `I don't know though it probably isn't` - What platform did you use? [ ] Windows - [X] Linux - [ ] Mac - What were the used arguments? `./ccextractor -sc files/CNN.ts` **Video links (replace text below with your links)** * `CNN.ts` here: https://drive.google.com/drive/folders/0B_61ywKPmI0Ta2diT3l0eTlHc2c **Additional information** Notice how the word "take" isn't capitalized correctly (It should be "Take"). ```patch 54c54 < <i> Take a closer look at GEICO.</i> --- > <i> take a closer look at geico.</i> ``` (This snippet is from line 54.)
Author
Owner

@kashparty commented on GitHub (Jan 23, 2020):

I don't think this is platform-specific because the same behaviour is observed on windows. The issue may be due to clever_capitalize, but I am not 100% sure:
7b038ab649/src/lib_ccx/ccx_encoders_helpers.c (L183)

This behaviour doesn't always happen. For example:

13
00:00:24,558 --> 00:00:25,924
  <i> take a closer look at geico.</i> 

14
00:00:25,960 --> 00:00:26,658
         <i> Great savings.</i>

I haven't contributed before but I'd be willing to try and fix this (if I can). I would really appreciate it if someone could point me in the right direction.

@kashparty commented on GitHub (Jan 23, 2020): I don't think this is platform-specific because the same behaviour is observed on windows. The issue _may_ be due to clever_capitalize, but I am not 100% sure: https://github.com/CCExtractor/ccextractor/blob/7b038ab64905750b7da5320b599504b1ca890aab/src/lib_ccx/ccx_encoders_helpers.c#L183 This behaviour doesn't always happen. For example: ``` 13 00:00:24,558 --> 00:00:25,924 <i> take a closer look at geico.</i> 14 00:00:25,960 --> 00:00:26,658 <i> Great savings.</i> ``` I haven't contributed before but I'd be willing to try and fix this (if I can). I would really appreciate it if someone could point me in the right direction.
Author
Owner

@NilsIrl commented on GitHub (Jan 23, 2020):

The issue may be due to clever_capitalize

I am of the same opinion.

I haven't contributed before but I'd be willing to try and fix this if someone could point me in the right direction

Don't hesitate to ask questions. If you need additinal help you can join the ccextractor slack as well.

@NilsIrl commented on GitHub (Jan 23, 2020): > The issue may be due to clever_capitalize I am of the same opinion. > I haven't contributed before but I'd be willing to try and fix this if someone could point me in the right direction Don't hesitate to ask questions. If you need additinal help you can join the ccextractor slack as well.
Author
Owner

@cfsmp3 commented on GitHub (Jan 24, 2020):

@KashParty clever_capitalize() just uses the character matrix, it doesn't know anything about fonts.

But a quick way to test what's going to would be to add this line (just typing here, so it might need fixing):

	for (int i = 0; i < CCX_DECODER_608_SCREEN_WIDTH; i++)
	{
+                printf ("Clever: Column: %d Char: %c  do_it: %d\n", i, data->characters[line_num][i]), do_it);

And see what it's doing.

@cfsmp3 commented on GitHub (Jan 24, 2020): @KashParty clever_capitalize() just uses the character matrix, it doesn't know anything about fonts. But a quick way to test what's going to would be to add this line (just typing here, so it might need fixing): ``` for (int i = 0; i < CCX_DECODER_608_SCREEN_WIDTH; i++) { + printf ("Clever: Column: %d Char: %c do_it: %d\n", i, data->characters[line_num][i]), do_it); ``` And see what it's doing.
Author
Owner

@kashparty commented on GitHub (Jan 24, 2020):

Thank you for the help. I can confirm that the problem arises from clever_capitalize. I used the following code:

void correct_spelling_and_censor_words_608(struct encoder_ctx *context, int line_number, struct eia608_screen *data)
{
	if (context->sentence_cap)
	{
		printf("0: %s\n", data->characters[line_number]);
		if (clever_capitalize(context, line_number, data))
		{
			printf("1: %s\n", data->characters[line_number]);
			correct_case_with_dictionary(line_number, data);
			printf("2: %s\n", data->characters[line_number]);
		}
	}

When I ran this, it produced:

0:    Take a closer look at GEICO.
1:    take a closer look at geico.
2:    take a closer look at geico.

So it is obvious that clever_capitalize is causing the issue. After investigating a little more, I don't think that this has anything to do with tags at all. The value of context->new_sentence persists across lines, so if a previous line ended in an exclamation mark, then the first character on the next line will be capitalized. The issue that is seen here is probably because the previous line doesn't end with any punctuation that needs capitalization after it:

12
00:00:23,491 --> 00:00:24,523
           (humming)            

13
00:00:24,558 --> 00:00:25,924
  <i> take a closer look at geico.</i> 

14
00:00:25,960 --> 00:00:26,658
         <i> Great savings.</i>

Because the previous line ends with ), clever_capitalize doesn't capitalize the next letter. Closing brackets don't always have capital letters after them. Does this behaviour even need to be fixed, and if so, what should be done about it?

@kashparty commented on GitHub (Jan 24, 2020): Thank you for the help. I can confirm that the problem arises from `clever_capitalize`. I used the following code: ```c void correct_spelling_and_censor_words_608(struct encoder_ctx *context, int line_number, struct eia608_screen *data) { if (context->sentence_cap) { printf("0: %s\n", data->characters[line_number]); if (clever_capitalize(context, line_number, data)) { printf("1: %s\n", data->characters[line_number]); correct_case_with_dictionary(line_number, data); printf("2: %s\n", data->characters[line_number]); } } ``` When I ran this, it produced: ``` 0: Take a closer look at GEICO. 1: take a closer look at geico. 2: take a closer look at geico. ``` So it is obvious that clever_capitalize is causing the issue. After investigating a little more, I don't think that this has anything to do with tags at all. The value of `context->new_sentence` persists across lines, so if a previous line ended in an exclamation mark, then the first character on the next line will be capitalized. The issue that is seen here is probably because the previous line _doesn't_ end with any punctuation that needs capitalization after it: ``` 12 00:00:23,491 --> 00:00:24,523 (humming) 13 00:00:24,558 --> 00:00:25,924 <i> take a closer look at geico.</i> 14 00:00:25,960 --> 00:00:26,658 <i> Great savings.</i> ``` Because the previous line ends with `)`, `clever_capitalize` doesn't capitalize the next letter. Closing brackets don't always have capital letters after them. Does this behaviour even need to be fixed, and if so, what should be done about it?
Author
Owner

@kashparty commented on GitHub (Jan 24, 2020):

I could potentially try and change clever_capitalize so that after a closing bracket, it checks whether that is the last non-whitespace character in the line, and sets context->new_sentence accordingly. That way it would only capitalize the word after the brackets if the word is on the next line. Obviously we don't want it to always capitalize after brackets (because it) Would cause behaviour like this.

@kashparty commented on GitHub (Jan 24, 2020): I could potentially try and change `clever_capitalize` so that after a closing bracket, it checks whether that is the last non-whitespace character in the line, and sets `context->new_sentence` accordingly. That way it would only capitalize the word after the brackets if the word is on the next line. Obviously we don't want it to _always_ capitalize after brackets `(because it) Would cause behaviour like this.`
Author
Owner

@kashparty commented on GitHub (Jan 24, 2020):

I tried implementing the change I described in my previous comment by adding a case for ')':

#include <ctype.h>

...

case ')':
	context->new_sentence = 1;
	for (int j = i + 1; j < CCX_DECODER_608_SCREEN_WIDTH; j++)
	{
		if (!isspace(data->characters[line_num][j]) && data->characters[line_num][j] != 0x89)
			context->new_sentence = 0;
	}
	break;

This seems to work. I haven't tested it fully but it works with the video provided by @NilsIrl in the issue description:

12
00:00:23,491 --> 00:00:24,523
           (humming)            

13
00:00:24,558 --> 00:00:25,924
  <i> Take a closer look at geico.</i> 

14
00:00:25,960 --> 00:00:26,658
         <i> Great savings.</i>

This is a rough first attempt just to ensure that this approach would theoretically work. I used <ctype.h> but I probably don't really need it, and I'm not sure if data->characters[line_num][j] != 0x89 is necessary in the if statement. However, if this is a change that you would want to make, I would be more than happy to clean this up properly and make a pull request.

@kashparty commented on GitHub (Jan 24, 2020): I tried implementing the change I described in my previous comment by adding a case for `')'`: ```c #include <ctype.h> ... case ')': context->new_sentence = 1; for (int j = i + 1; j < CCX_DECODER_608_SCREEN_WIDTH; j++) { if (!isspace(data->characters[line_num][j]) && data->characters[line_num][j] != 0x89) context->new_sentence = 0; } break; ``` _This seems to work_. I haven't tested it fully but it works with the video provided by @NilsIrl in the issue description: ``` 12 00:00:23,491 --> 00:00:24,523 (humming) 13 00:00:24,558 --> 00:00:25,924 <i> Take a closer look at geico.</i> 14 00:00:25,960 --> 00:00:26,658 <i> Great savings.</i> ``` This is a rough first attempt just to ensure that this approach would theoretically work. I used `<ctype.h>` but I probably don't really need it, and I'm not sure if `data->characters[line_num][j] != 0x89` is necessary in the if statement. However, if this is a change that you would want to make, I would be more than happy to clean this up properly and make a pull request.
Author
Owner

@cfsmp3 commented on GitHub (Jan 25, 2020):

This seems to work. I haven't tested it fully but it works with the video provided by @NilsIrl in the issue description:

Most likely it won't work for many other samples. I don't think it's possible to come up with a reasonably naive algorithms (I mean, just looking at the previous character and/or lines) that works well for all cases.

It might be just simpler to have a post-processing program that uses a proper NLP library to correct capitalization.

@cfsmp3 commented on GitHub (Jan 25, 2020): > _This seems to work_. I haven't tested it fully but it works with the video provided by @NilsIrl in the issue description: Most likely it won't work for many other samples. I don't think it's possible to come up with a reasonably naive algorithms (I mean, just looking at the previous character and/or lines) that works well for all cases. It might be just simpler to have a post-processing program that uses a proper NLP library to correct capitalization.
Author
Owner

@cfsmp3 commented on GitHub (Jan 25, 2020):

I'm closing this since I really think it's not something we can solve inside CCExtractor, but if someone proves me wrong and sends a cool PR I'll merge :-)

@cfsmp3 commented on GitHub (Jan 25, 2020): I'm closing this since I really think it's not something we can solve inside CCExtractor, but if someone proves me wrong and sends a cool PR I'll merge :-)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#547