mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-15 05:26:07 +00:00
[BUG] French captions lack accents #303
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Liontooth on GitHub (Apr 16, 2017).
Please prefix your issue with one of the following: [BUG], [PROPOSAL], [QUESTION].
CCExtractor version (using the --version parameter preferably) : 0.84
In raising this issue, I confirm the following (please check boxes, eg [X]):
My familiarity with the project is as follows (check one, eg [X]):
Necessary information
-debug -ts -noru -out=ttxt -utf8Video links
http://vrnewsscape.ucla.edu/dropbox/2017-03-28_0102_FR_TV5_Temps_pr%c3%a9sent.mpg
Additional information
The file contains both English and French subtitles, and they extract perfectly. However, the French subtitles lack accents -- for instance, the word "eleves" in line four should be "élevés", and "defraichies" in line five should really be "défraîchies", among many other examples. It's conceivable that these accents are missing in the transmission; however, this is official French state television, so we think that is unlikely. Could you have a look?
PS: Make sure you set an alert in GitHub so you get notifications about your ticket. We may need to ask questions and we do everything inside GitHub's system.
@saurabhshri commented on GitHub (Apr 22, 2017):
I downloaded the file, and checked. The file has only one subtitle stream (teletext) along with burned in subtitles. I couldn't find English subtitles.
Further, I could not get VLC to display the teletext subtitles as it yielded codec error.
So, I downloaded
smplayerand it too couldn't display those teletext subtitles. Though media information clearly shows that subtitle stream is present.I am not familiar with any language other than English and Hindi, so I couldn't recognise if they are different language, but the subtitles extracted from CCExtractor are different from that burned in.
But this is probably the way they are supplied.
TL;DR
(The red coloured subtitles are extracted by CCExtractor).
@Liontooth commented on GitHub (Apr 23, 2017):
French subtitles are on teletext page 891 and English on 892. For some reason VLC doesn't see them; instead, it displays only the German DVB subtitles. This is not the subject of this bug report, though I agree it would be helpful to see if VLC shows the captions correctly, with accents.
The issue is that CCExtractor successfully extracts French teletext on page 891, but they lack accents (diacritics). This is an official news broadcast from French state television, so it would be extremely surprising (but not impossible) if accents are missing in the transport stream.
@saurabhshri commented on GitHub (Apr 23, 2017):
@Liontooth True. Before digging into the file, I just thought checking with video players would be a good idea.
@mkver commented on GitHub (Apr 24, 2017):
VLC does not give me an error. Instead it shows the subtitles -- without accents:


So I guess that this is no bug in ccextractor.
@cfsmp3 commented on GitHub (Apr 24, 2017):
Confirmed. This issue happens here:
(in telxcc.c)
uint16_t telx_to_ucs2(uint8_t c)
{
if (PARITY_8[c] == 0)
{
dbg_print (CCX_DMT_TELETEXT, "- Unrecoverable data error; PARITY(%02x)\n", c);
return 0x20;
}
}
For that 'è` in the 4th line the value of c is 229 (ä), which is converted to 'e' via that conversion table G0.
We seem to be missing completely the part of the Teletext specs:
15.6.2 Latin National Option Sub-Sets
Also this part from the specs might be relevant:
15.1 Overview of designation requirements
In general, there is a G0 basic character set and a G2 supplementary character set, each of 96 entries, for
each alphabet. Level 1 transmissions are restricted to using the G0 set and some characters in the
table may be substituted to accommodate the requirements of the local languages. These national option
sub-sets are selected by the C12, C13 and C14 control bits in the page header. At levels 2.5 and 3.5 a
more precise method of designating the required G0 and G2 character sets and the national option subset
is available via packets X/28 and M/29, as described in subclause 15.2.
Where the local language requirements require more than 96 alphanumeric characters to provide a basic
service, additional packets X/26 may be introduced to form Level 1.5 transmissions. Typically these
incorporate a few characters from the G2 supplementary set of characters, plus a few G0 characters with
diacritical marks.
@cfsmp3 commented on GitHub (Apr 24, 2017):
7 GSoC points.
@cfsmp3 commented on GitHub (Apr 24, 2017):
Assigned to @bigharshrag since he just offered and he made the G0 changes and he offered :-) Thanks a lot Rishabh
@thealphadollar commented on GitHub (Jan 22, 2018):
@cfsmp3 @canihavesomecoffee I'm willing to work on this issue and wish to be the assignee.
@cfsmp3 commented on GitHub (Jan 22, 2018):
Just work on it :-) We cannot assign you unless we add you to the team, but
we cannot add you until you have worked a bit, etc... but don't worry, just
raising your hand and saying you're working on it is good enough.
On Mon, Jan 22, 2018 at 11:36 AM, Shivam Kumar Jha <notifications@github.com
@thealphadollar commented on GitHub (Feb 3, 2018):
I think the video file lacks the accented characters. I'm saying so as for accented (aka diacritical) alphabets the mode of the cell should be between 17 to 31 (both inclusive). Here 31 is the termination mark.
I modified the code in a way such that it prints out the mode in which the cell is interpreted. I found that there were only mode 0, 4 and 31 present. This means that the transmission does not include any sort of diacritical marks embedded with it.
The above screen-shot shows that the program was set to print mode value if it is not equal to 0, 4 and 31. The value was not printed in the console window which shows that no mode except these three are present.
I also separately check, to confirm, that whether the condition with mode less than 31 and greater than 6 is reached or not in any case. I got a negative result.
Also an added observation is that the interpreted value for accented and unaccented 'e' are the same and hence the video file cannot distinguish between the same since they are not present.
@thealphadollar commented on GitHub (Feb 11, 2018):
@cfsmp3 have a look please?
@cfsmp3 commented on GitHub (Feb 16, 2018):
@thealphadollar - You seem to be right. I've been looking into it for a while and the code in telxcc that would use G2 is there, it's not just called.
`
// ETS 300 706, chapter 12.3.1, table 27: character from G2 set
if ((mode == 0x0f) && (row_address_group == NO))
{
x26_col = address;
if (data > 31)
{
ctx->page_buffer.text[x26_row][x26_col] = G2[0][data - 0x20];
ctx->page_buffer.g2_char_present[x26_row][x26_col] = 1;
}
}
`
An obvious explanation is that we aren't parsing the mode correctly, but it that was true nothing would work...
So unless someone produces any software generating or displaying these accents I'm inclined to just close this ticket.