mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-03 21:23:48 +00:00
@ in teletext-subs saved as asterisk #89
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @hurda on GitHub (Nov 18, 2015).
CCExtractor 0.77 and git-677fee4
File: http://www.mediafire.com/download/q6ebvmzwe1prvi3/cce-at-sign.7z (25MB)
Teletext:

SRT:
@hurda commented on GitHub (Dec 7, 2015):
Addendum: Also affects other output-formats, like SAMI and TTXT.
@anshul1912 commented on GitHub (Dec 7, 2015):
when I try to download above file, it shows deleted
@cfsmp3 commented on GitHub (Jan 7, 2016):
Please reopen when a working link is available.
@hurda commented on GitHub (Jan 7, 2016):
http://www.mediafire.com/download/q6ebvmzwe1prvi3/cce-at-sign.7z
@cfsmp3 commented on GitHub (Jan 7, 2016):
I'm looking into this. That * is written to the buffer here:
ctx->page_buffer.text[y][i] = telx_to_ucs2(packet->data[i]);
packet->data[i] contains 42 (0x2a) which is indeed an asterisk
http://www.columbia.edu/kermit/ucs2.html
@Dhrumil2910 commented on GitHub (Mar 9, 2016):
Not able to Download the file. Can you pls re-upload the file?
@cfsmp3 commented on GitHub (Mar 9, 2016):
It seems to work fine for me... is there an error message when trying to
download the file?
On Wed, Mar 9, 2016 at 12:19 PM, Dhrumil2910 notifications@github.com
wrote:
@Dhrumil2910 commented on GitHub (Mar 9, 2016):
It is because of the college proxy server which is denying the download
@cfsmp3 commented on GitHub (Mar 9, 2016):
OK, I've uploaded it to slack. Maybe it can be downloaded from there?
On Wed, Mar 9, 2016 at 3:56 PM, Dhrumil2910 notifications@github.com
wrote:
@Dhrumil2910 commented on GitHub (Mar 10, 2016):
Ok , thanks a lot for your support
@isacdaavid commented on GitHub (Mar 11, 2016):
I'm trying to track that 0x2a byte back through the pipeline to find where things went wrong, if anywhere; but I need to know what the adequate value would be.
For this test video telx_to_ucs2() converts 0x40 to '§' rather than '@' because of the local language (German) substitutions in the basic character set specified in ETS 300-706. Is that OK?
@abhishek-vinjamoori commented on GitHub (Mar 11, 2016):
@isacdaavid , ideally telx_to_ucs2() should get an input of 64 to get back "@". But, there is no possible input, for which the output is "@"(according to current decoding)
@isacdaavid commented on GitHub (Mar 12, 2016):
I think this bug is invalid after all. The asterisk is really there at offset 0xDBF164 in the file (value is 0x54 which is 0x2A in reverse endianess), and OP's software is responsible for outputting the at sign.
I followed that particular 0x2A back to tlt_process_pes_packet() where its endianess is reversed to 0x54, then after failing to find another transformation through several function calls and buffers all the way back until get_cinfo() I became suspicious that ccextractor had been doing the right thing all the time. Sure enough, I searched the binary file for "a7 37 54 f7" after printing the adjacent values according to ccextractor, and after changing that "54" to "02" (reversed 0x40) the '§' appeared in the .srt output as expected from my previous post.
I can provide the hex-edited video and my debugging patch/pull request. If you like, I could also implement a change to telx_to_ucs2() that would output an at sign when it finds an asterisk, but I guess you don't want to introduce such behaviour. I'm interested in proving that I know some git and can make useful changes to your codebase, but I fear that if this bug gets closed without needing a patch then I will not have earned the points for my GSoC application :(
@cfsmp3 commented on GitHub (Mar 12, 2016):
The issue is with supplementary charsets.
A good starting point to research this is google "supplementary charsets
teletext". There's some other teletext applications around surely some of
them get this right and we learn from them.
This is not about replacing one char with another generically (obviously
that might work for this specific file but would break it for many others)
but rather complete the supplementary charset implementation.
A good thing is that teletext specifications are public and totally free,
so this also serve as an introduction to standard documents :-)
Notes to GSoC applicants
specific solution is not used. We'll need to pick just one and that might
come down to personal taste. As long as your solution is valid, you'll get
points.
where in the standard that character appear mention it here, don't keep it
for yourself.
On Sat, Mar 12, 2016 at 5:06 AM, Isaac David notifications@github.com
wrote:
@hurda commented on GitHub (Mar 13, 2016):
Extracting the subtitles using ProjectX 0.91.0.10 portable also outputs "@". Maybe it helps.
EDIT:
http://project-x.cvs.sourceforge.net/viewvc/project-x/Project-X/src/net/sourceforge/dvb/projectx/subtitle/Teletext.java?view=annotate
http://project-x.cvs.sourceforge.net/viewvc/project-x/Project-X/src/net/sourceforge/dvb/projectx/subtitle/CharSet.java?view=markup
http://project-x.cvs.sourceforge.net/viewvc/project-x/Project-X/src/net/sourceforge/dvb/projectx/parser/StreamProcessTeletext.java?view=annotate
http://project-x.cvs.sourceforge.net/viewvc/project-x/Project-X/src/net/sourceforge/dvb/projectx/parser/StreamProcessSubpicture.java?view=annotate
EDIT2:
VLC (2.2.2) shows @ too.
http://git.videolan.org/?p=vlc.git;a=blob;f=modules/codec/telx.c;h=4f8842a95f4a94cb326d3e48234014852f04c235;hb=HEAD
To check this, you'll have to use this file: http://www.mediafire.com/download/7fwbqdw57sxykby/at-sign_teletext_pcr-pts.ts
The other has a difference between the PCR-timestamps of A/V and Teletext of almost three hours, which VLC apparently can't handle.
EDIT3:
While this changes what is being output by ccextractor, DVBViewer, ProjectX and VLC are still showing @.
How's that possible?
@isacdaavid commented on GitHub (Mar 14, 2016):
Quick update:
I couldn't find the @ in any of the supplementary (AKA G2) character sets. I still need to find more information on the second G0 sets (mentioned in section 15.3 in the ETS 300 706) and modified G0 and G2 sets (part of Teletext level 2.5 and 3.5, mentioned in section 15.4 in the standard).
@hurda Thanks. I will definitely see what other projects are doing if I fail to find a satisfactory explanation in those extra character sets.
EDIT:
I found it. This weird behaviour seems to have been introduced in ETS 300 706 version 1.2.1 from 2003 as a marginal note in section 15.6.1 about the basic G0 Latin character set. Quoting from it:
Time to implement it!
@cfsmp3 commented on GitHub (Mar 14, 2016):
It's in table 36 (Latin National Option subset), in the English row.
Page 115 of ETS 300 706: May 1997
On Mon, Mar 14, 2016 at 5:30 AM, Isaac David notifications@github.com wrote:
@hurda commented on GitHub (Mar 14, 2016):
Good catch! It's not really helping that the first searchengine-results when searching for "ets 300 706" are for the 1997-version of the spec.
In telxcc.c the 1997-spec is referenced, but not 2003.
Here's the link http://www.etsi.org/deliver/etsi_en/300700_300799/300706/01.02.01_60/en_300706v010201p.pdf
PS: It's actually clause 12.3.4, at the bottom of table 29.
@cfsmp3 commented on GitHub (Mar 14, 2016):
This seems like the best possible explanation. It's a one liner fix
probably. Points will be awarded to the first GSoC applicant that sends a
proper PR :-)
On Mon, Mar 14, 2016 at 11:25 AM, hurda notifications@github.com wrote:
@abhishek-vinjamoori commented on GitHub (Mar 14, 2016):
According to the standards mentioned when the packet is X/26 only, the "*" = 42 must be replaced with "@" -
if(y== 26) //But currently the * is addressed at y=22
{
//And Mode Description = 10000 and Data = 0101010.
if(data == 64 && mode == 0x10)
{
ctx->page_buffer.text[i]j] = 0x0040;
}
}
According to current situation of decoding-
if(data == 10 && mode == 2 && ctx->page_buffer.text[y][k] == 42 && default_g0_charset == LATIN)
ctx->page_buffer.text[y][k] = 0x0040; //Special case only for @
k is iterated from 0 to 39
@abhishek-vinjamoori commented on GitHub (Mar 14, 2016):
I need a file where in "*" is actually used. (With that it can be verified, whether the data/mode are different when * actually appears)
@hurda commented on GitHub (Mar 14, 2016):
http://www.mediafire.com/download/apc078mz884gbkr/teletext_subtitles_with_asterisk_pcr-pts.ts
@abhishek-vinjamoori commented on GitHub (Mar 14, 2016):
Could this be hosted somewhere else, as mediafire is blocked.
@hurda commented on GitHub (Mar 14, 2016):
http://www111.zippyshare.com/v/Gc65zLfD/file.html
@abhishek-vinjamoori commented on GitHub (Mar 14, 2016):
1
00:00:03,560 --> 00:00:06,160
schon wieder Streit, Doris.
2
00:00:10,360 --> 00:00:12,000
Is this the desired output ?
@hurda commented on GitHub (Mar 14, 2016):
Have you omitted some lines for brevity?
Here are all subtitles of that sample-video:
That's with ccextractor 0.79.
@abhishek-vinjamoori commented on GitHub (Mar 15, 2016):
Yes. That is the output I'm getting. Is there any other file with "@" ? It would be really helpful.
@hurda commented on GitHub (Mar 15, 2016):
I only got files with the same "untertitel*orf.at"-output.
@abhishek-vinjamoori commented on GitHub (Mar 15, 2016):
But, are they different files ?
@hurda commented on GitHub (Mar 15, 2016):
Yes. http://www36.zippyshare.com/v/BwLLxb8i/file.html
@cfsmp3 commented on GitHub (Mar 18, 2016):
Solved in current github version.