[BUG] French captions lack accents #303

Closed
opened 2026-01-29 16:40:26 +00:00 by claunia · 12 comments
Owner

Originally created by @Liontooth on GitHub (Apr 16, 2017).

Please prefix your issue with one of the following: [BUG], [PROPOSAL], [QUESTION].

CCExtractor version (using the --version parameter preferably) : 0.84

In raising this issue, I confirm the following (please check boxes, eg [X]):

  • I have read and understood the contributors guide.
  • I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present.
  • I have checked that the issue I'm posting isn't already reported.
  • I have checked that the issue I'm porting isn't already solved and no duplicates exist in closed issues and in opened issues
  • I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion.
  • I have used the latest available version of CCExtractor to verify this issue exists.

My familiarity with the project is as follows (check one, eg [X]):

  • I have never used CCExtractor.
  • I have used CCExtractor just a couple of times.
  • I absolutely love CCExtractor, but have not contributed previously.
  • I am an active contributor to CCExtractor.

Necessary information

  • Is this a regression (did it work before)? [X] NO | [ ] YES - please specify the last known working version
  • What platform did you use? [ ] Windows - [X] Linux - [ ] Mac
  • What where the used arguments? -debug -ts -noru -out=ttxt -utf8

Video links

http://vrnewsscape.ucla.edu/dropbox/2017-03-28_0102_FR_TV5_Temps_pr%c3%a9sent.mpg

Additional information

The file contains both English and French subtitles, and they extract perfectly. However, the French subtitles lack accents -- for instance, the word "eleves" in line four should be "élevés", and "defraichies" in line five should really be "défraîchies", among many other examples. It's conceivable that these accents are missing in the transmission; however, this is official French state television, so we think that is unlikely. Could you have a look?

00:06:00,120|00:06:02,240|TLT|Bonsoir. Ce soir, nous vous emmenons
00:06:02,440|00:06:04,520|TLT|dans le monde des hotels et du tourisme.
00:06:04,720|00:06:07,680|TLT|L'image que nous en avons concernant la Suisse,
00:06:07,960|00:06:10,040|TLT|c'est celui de prix excessivement eleves
00:06:10,200|00:06:11,760|TLT|pour des chambres defraichies
00:06:11,960|00:06:14,240|TLT|et un accueil pas toujours a la hauteur
00:06:14,400|00:06:16,200|TLT|de ce qu'on vit a l'etranger.
00:06:16,400|00:06:19,360|TLT|Cette impression d'un tourisme et d'une hotellerie helvetiques
00:06:19,560|00:06:21,800|TLT|legerement assoupis n'est pas qu'une idee.
00:06:22,000|00:06:24,600|TLT|La realite est que l'hotellerie suisse de moyenne gamme

PS: Make sure you set an alert in GitHub so you get notifications about your ticket. We may need to ask questions and we do everything inside GitHub's system.

Originally created by @Liontooth on GitHub (Apr 16, 2017). Please prefix your issue with one of the following: [BUG], [PROPOSAL], [QUESTION]. CCExtractor version (using the --version parameter preferably) : 0.84 **In raising this issue, I confirm the following (please check boxes, eg [X]):** - [X] I have read and understood the [contributors guide](https://github.com/CCExtractor/ccextractor/blob/master/.github/CONTRIBUTING.md). - [X] I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present. - [X] I have checked that the issue I'm posting isn't already reported. - [X] I have checked that the issue I'm porting isn't already solved and no duplicates exist in [closed issues](https://github.com/CCExtractor/ccextractor/issues?q=is%3Aissue+is%3Aclosed) and in [opened issues](https://github.com/CCExtractor/ccextractor/issues) - [X] I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion. - [X] I have used the latest available version of CCExtractor to verify this issue exists. **My familiarity with the project is as follows (check one, eg [X]):** - [ ] I have never used CCExtractor. - [ ] I have used CCExtractor just a couple of times. - [ ] I absolutely love CCExtractor, but have not contributed previously. - [X] I am an active contributor to CCExtractor. **Necessary information** - Is this a regression (did it work before)? [X] NO | [ ] YES - *please specify the last known working version* - What platform did you use? [ ] Windows - [X] Linux - [ ] Mac - What where the used arguments? `-debug -ts -noru -out=ttxt -utf8` **Video links** http://vrnewsscape.ucla.edu/dropbox/2017-03-28_0102_FR_TV5_Temps_pr%c3%a9sent.mpg **Additional information** The file contains both English and French subtitles, and they extract perfectly. However, the French subtitles lack accents -- for instance, the word "eleves" in line four should be "élevés", and "defraichies" in line five should really be "défraîchies", among many other examples. It's conceivable that these accents are missing in the transmission; however, this is official French state television, so we think that is unlikely. Could you have a look? > 00:06:00,120|00:06:02,240|TLT|Bonsoir. Ce soir, nous vous emmenons > 00:06:02,440|00:06:04,520|TLT|dans le monde des hotels et du tourisme. > 00:06:04,720|00:06:07,680|TLT|L'image que nous en avons concernant la Suisse, > 00:06:07,960|00:06:10,040|TLT|c'est celui de prix excessivement eleves > 00:06:10,200|00:06:11,760|TLT|pour des chambres defraichies > 00:06:11,960|00:06:14,240|TLT|et un accueil pas toujours a la hauteur > 00:06:14,400|00:06:16,200|TLT|de ce qu'on vit a l'etranger. > 00:06:16,400|00:06:19,360|TLT|Cette impression d'un tourisme et d'une hotellerie helvetiques > 00:06:19,560|00:06:21,800|TLT|legerement assoupis n'est pas qu'une idee. > 00:06:22,000|00:06:24,600|TLT|La realite est que l'hotellerie suisse de moyenne gamme PS: Make sure you set an alert in GitHub so you get notifications about your ticket. We may need to ask questions and we do everything inside GitHub's system.
Author
Owner

@saurabhshri commented on GitHub (Apr 22, 2017):

The file contains both English and French subtitles

I downloaded the file, and checked. The file has only one subtitle stream (teletext) along with burned in subtitles. I couldn't find English subtitles.

Further, I could not get VLC to display the teletext subtitles as it yielded codec error.

screenshot 318

So, I downloaded smplayer and it too couldn't display those teletext subtitles. Though media information clearly shows that subtitle stream is present.

screenshot 324

I am not familiar with any language other than English and Hindi, so I couldn't recognise if they are different language, but the subtitles extracted from CCExtractor are different from that burned in.

screenshot 320

But this is probably the way they are supplied.

TL;DR

  • No video player could play that subtitle stream.
  • Couldn't confirm if the supplied teletext subtitles have accents in them.
  • The burned in subtitles have accents in few places.

screenshot 325
(The red coloured subtitles are extracted by CCExtractor).

@saurabhshri commented on GitHub (Apr 22, 2017): > The file contains both English and French subtitles I downloaded the file, and checked. The file has only one subtitle stream (teletext) along with burned in subtitles. I couldn't find English subtitles. Further, I could not get VLC to display the teletext subtitles as it yielded codec error. ![screenshot 318](https://cloud.githubusercontent.com/assets/12415700/25305238/45c48ca0-2795-11e7-8396-2db938f50ae2.png) So, I downloaded `smplayer` and it too couldn't display those teletext subtitles. Though media information clearly shows that subtitle stream is present. ![screenshot 324](https://cloud.githubusercontent.com/assets/12415700/25305250/85f2573a-2795-11e7-90d6-1b208cb10bdf.png) I am not familiar with any language other than English and Hindi, so I couldn't recognise if they are different language, but the subtitles extracted from CCExtractor are different from that burned in. ![screenshot 320](https://cloud.githubusercontent.com/assets/12415700/25305272/ccc15206-2795-11e7-8bd1-36c8871f946a.png) But this is probably the way they are supplied. TL;DR - No video player could play that subtitle stream. - **Couldn't confirm if the supplied teletext subtitles have accents in them.** - The burned in subtitles have accents in few places. ![screenshot 325](https://cloud.githubusercontent.com/assets/12415700/25305295/3723edb6-2796-11e7-80fb-0b914ce792b9.png) (The red coloured subtitles are extracted by CCExtractor).
Author
Owner

@Liontooth commented on GitHub (Apr 23, 2017):

French subtitles are on teletext page 891 and English on 892. For some reason VLC doesn't see them; instead, it displays only the German DVB subtitles. This is not the subject of this bug report, though I agree it would be helpful to see if VLC shows the captions correctly, with accents.

The issue is that CCExtractor successfully extracts French teletext on page 891, but they lack accents (diacritics). This is an official news broadcast from French state television, so it would be extremely surprising (but not impossible) if accents are missing in the transport stream.

@Liontooth commented on GitHub (Apr 23, 2017): French subtitles are on teletext page 891 and English on 892. For some reason VLC doesn't see them; instead, it displays only the German DVB subtitles. This is not the subject of this bug report, though I agree it would be helpful to see if VLC shows the captions correctly, with accents. The issue is that CCExtractor successfully extracts French teletext on page 891, but they lack accents (diacritics). This is an official news broadcast from French state television, so it would be extremely surprising (but not impossible) if accents are missing in the transport stream.
Author
Owner

@saurabhshri commented on GitHub (Apr 23, 2017):

@Liontooth True. Before digging into the file, I just thought checking with video players would be a good idea.

@saurabhshri commented on GitHub (Apr 23, 2017): @Liontooth True. Before digging into the file, I just thought checking with video players would be a good idea.
Author
Owner

@mkver commented on GitHub (Apr 24, 2017):

VLC does not give me an error. Instead it shows the subtitles -- without accents:
vlcsnap-3879-05-27-22h58m24s352
vlcsnap-4524-05-20-19h16m39s827
So I guess that this is no bug in ccextractor.

@mkver commented on GitHub (Apr 24, 2017): VLC does not give me an error. Instead it shows the subtitles -- without accents: ![vlcsnap-3879-05-27-22h58m24s352](https://cloud.githubusercontent.com/assets/18448094/25318830/720c0e26-2896-11e7-863c-519f86dde24b.jpg) ![vlcsnap-4524-05-20-19h16m39s827](https://cloud.githubusercontent.com/assets/18448094/25318831/720fb6ac-2896-11e7-9ffe-6d7476716fbb.jpg) So I guess that this is no bug in ccextractor.
Author
Owner

@cfsmp3 commented on GitHub (Apr 24, 2017):

Confirmed. This issue happens here:

(in telxcc.c)

uint16_t telx_to_ucs2(uint8_t c)
{
if (PARITY_8[c] == 0)
{
dbg_print (CCX_DMT_TELETEXT, "- Unrecoverable data error; PARITY(%02x)\n", c);
return 0x20;
}

uint16_t r = c & 0x7f;
if (r >= 0x20)
	r = G0[default_g0_charset][r - 0x20];
return r;

}

For that 'è` in the 4th line the value of c is 229 (ä), which is converted to 'e' via that conversion table G0.

We seem to be missing completely the part of the Teletext specs:
15.6.2 Latin National Option Sub-Sets

Also this part from the specs might be relevant:

15.1 Overview of designation requirements
In general, there is a G0 basic character set and a G2 supplementary character set, each of 96 entries, for
each alphabet. Level 1 transmissions are restricted to using the G0 set and some characters in the
table may be substituted to accommodate the requirements of the local languages. These national option
sub-sets are selected by the C12, C13 and C14 control bits in the page header. At levels 2.5 and 3.5 a
more precise method of designating the required G0 and G2 character sets and the national option subset
is available via packets X/28 and M/29, as described in subclause 15.2.
Where the local language requirements require more than 96 alphanumeric characters to provide a basic
service, additional packets X/26 may be introduced to form Level 1.5 transmissions. Typically these
incorporate a few characters from the G2 supplementary set of characters, plus a few G0 characters with
diacritical marks.

@cfsmp3 commented on GitHub (Apr 24, 2017): Confirmed. This issue happens here: (in telxcc.c) uint16_t telx_to_ucs2(uint8_t c) { if (PARITY_8[c] == 0) { dbg_print (CCX_DMT_TELETEXT, "- Unrecoverable data error; PARITY(%02x)\n", c); return 0x20; } uint16_t r = c & 0x7f; if (r >= 0x20) r = G0[default_g0_charset][r - 0x20]; return r; } For that 'è` in the 4th line the value of c is 229 (ä), which is converted to 'e' via that conversion table G0. We seem to be missing completely the part of the Teletext specs: 15.6.2 Latin National Option Sub-Sets Also this part from the specs might be relevant: 15.1 Overview of designation requirements In general, there is a G0 basic character set and a G2 supplementary character set, each of 96 entries, for each alphabet. Level 1 transmissions are restricted to using the G0 set and some characters in the table may be substituted to accommodate the requirements of the local languages. These national option sub-sets are selected by the C12, C13 and C14 control bits in the page header. At levels 2.5 and 3.5 a more precise method of designating the required G0 and G2 character sets and the national option subset is available via packets X/28 and M/29, as described in subclause 15.2. Where the local language requirements require more than 96 alphanumeric characters to provide a basic service, additional packets X/26 may be introduced to form Level 1.5 transmissions. Typically these incorporate a few characters from the G2 supplementary set of characters, plus a few G0 characters with diacritical marks.
Author
Owner

@cfsmp3 commented on GitHub (Apr 24, 2017):

7 GSoC points.

@cfsmp3 commented on GitHub (Apr 24, 2017): 7 GSoC points.
Author
Owner

@cfsmp3 commented on GitHub (Apr 24, 2017):

Assigned to @bigharshrag since he just offered and he made the G0 changes and he offered :-) Thanks a lot Rishabh

@cfsmp3 commented on GitHub (Apr 24, 2017): Assigned to @bigharshrag since he just offered and he made the G0 changes and he offered :-) Thanks a lot Rishabh
Author
Owner

@thealphadollar commented on GitHub (Jan 22, 2018):

@cfsmp3 @canihavesomecoffee I'm willing to work on this issue and wish to be the assignee.

@thealphadollar commented on GitHub (Jan 22, 2018): @cfsmp3 @canihavesomecoffee I'm willing to work on this issue and wish to be the assignee.
Author
Owner

@cfsmp3 commented on GitHub (Jan 22, 2018):

Just work on it :-) We cannot assign you unless we add you to the team, but
we cannot add you until you have worked a bit, etc... but don't worry, just
raising your hand and saying you're working on it is good enough.

On Mon, Jan 22, 2018 at 11:36 AM, Shivam Kumar Jha <notifications@github.com

wrote:

@cfsmp3 https://github.com/cfsmp3 @canihavesomecoffee
https://github.com/canihavesomecoffee I'm willing to work on this issue
and would wish to be the assignee.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/735#issuecomment-359539595,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFrJ2TrujC43XqZZhPCDYE3hlz6ekhqMks5tNOM0gaJpZM4M-hBf
.

@cfsmp3 commented on GitHub (Jan 22, 2018): Just work on it :-) We cannot assign you unless we add you to the team, but we cannot add you until you have worked a bit, etc... but don't worry, just raising your hand and saying you're working on it is good enough. On Mon, Jan 22, 2018 at 11:36 AM, Shivam Kumar Jha <notifications@github.com > wrote: > @cfsmp3 <https://github.com/cfsmp3> @canihavesomecoffee > <https://github.com/canihavesomecoffee> I'm willing to work on this issue > and would wish to be the assignee. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/CCExtractor/ccextractor/issues/735#issuecomment-359539595>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AFrJ2TrujC43XqZZhPCDYE3hlz6ekhqMks5tNOM0gaJpZM4M-hBf> > . >
Author
Owner

@thealphadollar commented on GitHub (Feb 3, 2018):

I think the video file lacks the accented characters. I'm saying so as for accented (aka diacritical) alphabets the mode of the cell should be between 17 to 31 (both inclusive). Here 31 is the termination mark.

screenshot from 2018-02-03 23-19-51

I modified the code in a way such that it prints out the mode in which the cell is interpreted. I found that there were only mode 0, 4 and 31 present. This means that the transmission does not include any sort of diacritical marks embedded with it.

screenshot from 2018-02-03 23-25-20

The above screen-shot shows that the program was set to print mode value if it is not equal to 0, 4 and 31. The value was not printed in the console window which shows that no mode except these three are present.

I also separately check, to confirm, that whether the condition with mode less than 31 and greater than 6 is reached or not in any case. I got a negative result.

Also an added observation is that the interpreted value for accented and unaccented 'e' are the same and hence the video file cannot distinguish between the same since they are not present.

@thealphadollar commented on GitHub (Feb 3, 2018): I think the video file lacks the accented characters. I'm saying so as for accented (aka diacritical) alphabets the mode of the cell should be between 17 to 31 (both inclusive). Here 31 is the termination mark. ![screenshot from 2018-02-03 23-19-51](https://user-images.githubusercontent.com/32812320/35769932-ec2a397a-0938-11e8-9149-66b07deb59a9.png) I modified the code in a way such that it prints out the mode in which the cell is interpreted. I found that there were only mode 0, 4 and 31 present. This means that the transmission does not include any sort of diacritical marks embedded with it. ![screenshot from 2018-02-03 23-25-20](https://user-images.githubusercontent.com/32812320/35769991-988dba8e-0939-11e8-9317-7441945d01c6.png) The above screen-shot shows that the program was set to print mode value if it is not equal to 0, 4 and 31. The value was not printed in the console window which shows that no mode except these three are present. I also separately check, to confirm, that whether the condition with mode less than 31 and greater than 6 is reached or not in any case. I got a negative result. Also an added observation is that the interpreted value for accented and unaccented 'e' are the same and hence the video file cannot distinguish between the same since they are not present.
Author
Owner

@thealphadollar commented on GitHub (Feb 11, 2018):

@cfsmp3 have a look please?

@thealphadollar commented on GitHub (Feb 11, 2018): @cfsmp3 have a look please?
Author
Owner

@cfsmp3 commented on GitHub (Feb 16, 2018):

@thealphadollar - You seem to be right. I've been looking into it for a while and the code in telxcc that would use G2 is there, it's not just called.

`
// ETS 300 706, chapter 12.3.1, table 27: character from G2 set
if ((mode == 0x0f) && (row_address_group == NO))
{
x26_col = address;
if (data > 31)
{
ctx->page_buffer.text[x26_row][x26_col] = G2[0][data - 0x20];
ctx->page_buffer.g2_char_present[x26_row][x26_col] = 1;
}
}

`

An obvious explanation is that we aren't parsing the mode correctly, but it that was true nothing would work...

So unless someone produces any software generating or displaying these accents I'm inclined to just close this ticket.

@cfsmp3 commented on GitHub (Feb 16, 2018): @thealphadollar - You seem to be right. I've been looking into it for a while and the code in telxcc that would use G2 is there, it's not just called. ` // ETS 300 706, chapter 12.3.1, table 27: character from G2 set if ((mode == 0x0f) && (row_address_group == NO)) { x26_col = address; if (data > 31) { ctx->page_buffer.text[x26_row][x26_col] = G2[0][data - 0x20]; **ctx->page_buffer.g2_char_present[x26_row][x26_col] = 1;** } } ` An obvious explanation is that we aren't parsing the mode correctly, but it that was true nothing would work... So unless someone produces any software generating or displaying these accents I'm inclined to just close this ticket.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#303