mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-17 05:25:33 +00:00
Cyrilic support. Missing or broken. #117
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @cfsmp3 on GitHub (Feb 20, 2016).
When processing a Russian TV capture we're getting Western characters instead of Cyrillic. I see we need Cyrillic G0 Primary Set - Option 2 - Russian/Bulgarian -- how does that get turned on?
The first test file is available here:
http://vrnewsscape.ucla.edu/dropbox/2016-02-09_1800_RU_TVC_Test.mpg
http://vrnewsscape.ucla.edu/dropbox/2016-02-09_1800_RU_TVC_Test.txt
VLC plays the teletext just fine


@bigharshrag commented on GitHub (Mar 13, 2016):
There is no support yet for Greek, Arabic, and Hebrew as well as for accented character sets (G2 character sets) for languages other than Latin based ones.
I'd be willing to add the support but I do not have any test files for the same.
@cfsmp3 commented on GitHub (Mar 13, 2016):
I don't think have any of those either, but we'll try to find some.
Are you applying to GSoC?
On Sun, Mar 13, 2016 at 4:32 PM, Rishabh Garg notifications@github.com
wrote:
@bigharshrag commented on GitHub (Mar 13, 2016):
Yes, I am indeed applying for GSoC.
Do I go ahead and try and add the above mentioned support for character sets without the files to test on?
@cfsmp3 commented on GitHub (Mar 13, 2016):
Sure, it's better to have theoretical implementations than nothing :-) That
way when we actually get samples we'll have a good starting point.
On Sun, Mar 13, 2016 at 7:19 PM, Rishabh Garg notifications@github.com
wrote:
@ghost commented on GitHub (Mar 13, 2016):
The patch looks like it's working perfectly for Russian TVC -- I'm just waiting to get that confirmed by a native speaker. The other russian channel we sampled is 1TV, and we're still getting Latin1. Please see http://vrnewsscape.ucla.edu/dropbox/2016-02-09_1900_RU_1TV_Test.mpg
@bigharshrag commented on GitHub (Mar 15, 2016):
@littleredhen Could you please tell me if a video player like VLC produces the correct subtitles? I could not get VLC to do so.
@abhishek-vinjamoori commented on GitHub (Mar 15, 2016):
@bigharshrag . Please check my comment for your PR.
@ghost commented on GitHub (Mar 18, 2016):
I haven't had a chance to play this file in VLC, but I can confirm that CCExtractor is able to extract text from the file. However, this text is still in a Latin character set. Are you not able to get text out with CCExtractor?
@abhishek-vinjamoori commented on GitHub (Mar 18, 2016):
@littleredhen, this issue has been updated today again with bug fixes. Please check if latest code is being used. ALthough some problems are still not solved.
@bigharshrag commented on GitHub (Mar 18, 2016):
@littleredhen I can confirm that CCExtractor does extract subtitles for the file and they are indeed using Latin character set. I have looked into the file provided and I found that it does not actually send the packet that would tell CCExtractor what character set to use for the file(the X/28 or M/29 packet). Hence CCExtractor resorts to default character set of Latin (as specified by the ETS 300 706 documentation as the default character set in such a scenario). The TVC file provided earlier had this data.
Hence I was wondering if you could in fact get any video player to display correct subtitle files in Cyrillic script.
Also the new bug fixes don't fix this problem.