[BUG] French DVB subtitles need deduplication #462

Closed
opened 2026-01-29 16:44:37 +00:00 by claunia · 1 comment
Owner

Originally created by @Liontooth on GitHub (Nov 18, 2018).

CCExtractor version: 0.85

In raising this issue, I confirm the following:

  • I have read and understood the contributors guide.
  • I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present.
  • I have checked that the issue I'm posting isn't already reported.
  • I have checked that the issue I'm reporting isn't already solved and no duplicates exist in closed issues and in opened issues
  • I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion.

My familiarity with the project is as follows:

  • I am an active contributor to CCExtractor.

Necessary information

  • Is this a regression (did it work before)? [X] NO
  • What platform did you use? [ ] Windows - [X] Linux - [ ] Mac
  • What were the used arguments? -datets -ttxt -UCLA -noru -utf8

**Video links **
http://vrnewsscape.ucla.edu/dropbox/2017-07-14_1100_FR_TF1_Journal.mpg
http://vrnewsscape.ucla.edu/dropbox/2017-07-14_1100_FR_TF1_Journal.txt

Additional information
CCExtractor-0.85 compiled 2017-07-29 with liblept4 succeeds in extracting DVB captions from the file above, as shown in the accompanying txt file (Chrome gets the encoding wrong and no longer has a way to correct it; in fact the file is UTF-8). (CCExtractor-0.86 and CCExtractor-0.87 fail to find any subtitles, see issue #1039.)

However, each line appears in part several times before it completes, and also at times partially repeats in the following line:

20170714110001.000|20170714110001.360|CC1|distribués gratuitement pour petits,
20170714110001.360|20170714110001.480|CC1|distribués, gratuitement pour petits et
20170714110001.480|20170714110001.880|CC1|distribués, gratuitement pour petits et grands,
20170714110001.880|20170714110002.280|CC1|distribués, gratuitement pour …
20170714110002.280|20170714110002.440|CC1|distribués, gratuitement pour petits et grands,, histoire que
20170714110002.440|20170714110002.840|CC1|petits et grands,, histoire que pe rd u re,
20170714110002.840|20170714110003.120|CC1|petits et grands,, histoire que pe rd u re, cette
20170714110003.120|20170714110003.400|CC1|petits et grands,, histoire que pe rd u re, cette a n n ée
20170714110003.400|20170714110003.800|CC1|petits et grands,, histoire que perdure, cette année encore,
20170714110003.800|20170714110003.880|CC1|petits et grands,, histoire que perdure, cette année encore, la

CCExtractor has solved this duplication problem in teletext; it's clearly also present in some DVB subtitles, notably the French network TF1.

Originally created by @Liontooth on GitHub (Nov 18, 2018). CCExtractor version: 0.85 **In raising this issue, I confirm the following:** - [X] I have read and understood the [contributors guide](https://github.com/CCExtractor/ccextractor/blob/master/.github/CONTRIBUTING.md). - [X] I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present. - [X] I have checked that the issue I'm posting isn't already reported. - [X] I have checked that the issue I'm reporting isn't already solved and no duplicates exist in [closed issues](https://github.com/CCExtractor/ccextractor/issues?q=is%3Aissue+is%3Aclosed) and in [opened issues](https://github.com/CCExtractor/ccextractor/issues) - [X] I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion. **My familiarity with the project is as follows:** - [X] I am an active contributor to CCExtractor. **Necessary information** - Is this a regression (did it work before)? [X] NO - What platform did you use? [ ] Windows - [X] Linux - [ ] Mac - What were the used arguments? `-datets -ttxt -UCLA -noru -utf8` **Video links ** http://vrnewsscape.ucla.edu/dropbox/2017-07-14_1100_FR_TF1_Journal.mpg http://vrnewsscape.ucla.edu/dropbox/2017-07-14_1100_FR_TF1_Journal.txt **Additional information** CCExtractor-0.85 compiled 2017-07-29 with liblept4 succeeds in extracting DVB captions from the file above, as shown in the accompanying txt file (Chrome gets the encoding wrong and no longer has a way to correct it; in fact the file is UTF-8). (CCExtractor-0.86 and CCExtractor-0.87 fail to find any subtitles, see issue #1039.) However, each line appears in part several times before it completes, and also at times partially repeats in the following line: ``` 20170714110001.000|20170714110001.360|CC1|distribués gratuitement pour petits, 20170714110001.360|20170714110001.480|CC1|distribués, gratuitement pour petits et 20170714110001.480|20170714110001.880|CC1|distribués, gratuitement pour petits et grands, 20170714110001.880|20170714110002.280|CC1|distribués, gratuitement pour … 20170714110002.280|20170714110002.440|CC1|distribués, gratuitement pour petits et grands,, histoire que 20170714110002.440|20170714110002.840|CC1|petits et grands,, histoire que pe rd u re, 20170714110002.840|20170714110003.120|CC1|petits et grands,, histoire que pe rd u re, cette 20170714110003.120|20170714110003.400|CC1|petits et grands,, histoire que pe rd u re, cette a n n ée 20170714110003.400|20170714110003.800|CC1|petits et grands,, histoire que perdure, cette année encore, 20170714110003.800|20170714110003.880|CC1|petits et grands,, histoire que perdure, cette année encore, la ``` CCExtractor has solved this duplication problem in teletext; it's clearly also present in some DVB subtitles, notably the French network TF1.
claunia added the difficulty: medium label 2026-01-29 16:44:37 +00:00
Author
Owner

@cfsmp3 commented on GitHub (Dec 27, 2025):

Closing - samples are no longer working, I suppose this corner case is no longer important? @Liontooth let us know if you still would like to see this happen.

@cfsmp3 commented on GitHub (Dec 27, 2025): Closing - samples are no longer working, I suppose this corner case is no longer important? @Liontooth let us know if you still would like to see this happen.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#462