Cyrilic support. Missing or broken. #117

New Issue

claunia · 2026-01-29T16:35:42Z

claunia commented

2026-01-29 16:35:42 +00:00

Originally created by @cfsmp3 on GitHub (Feb 20, 2016).

When processing a Russian TV capture we're getting Western characters instead of Cyrillic. I see we need Cyrillic G0 Primary Set - Option 2 - Russian/Bulgarian -- how does that get turned on?

The first test file is available here:

http://vrnewsscape.ucla.edu/dropbox/2016-02-09_1800_RU_TVC_Test.mpg
http://vrnewsscape.ucla.edu/dropbox/2016-02-09_1800_RU_TVC_Test.txt

VLC plays the teletext just fine

Originally created by @cfsmp3 on GitHub (Feb 20, 2016). When processing a Russian TV capture we're getting Western characters instead of Cyrillic. I see we need Cyrillic G0 Primary Set - Option 2 - Russian/Bulgarian -- how does that get turned on? The first test file is available here: http://vrnewsscape.ucla.edu/dropbox/2016-02-09_1800_RU_TVC_Test.mpg http://vrnewsscape.ucla.edu/dropbox/2016-02-09_1800_RU_TVC_Test.txt VLC plays the teletext just fine ![cyrilic1](https://cloud.githubusercontent.com/assets/5949913/13196200/03c11a94-d7c8-11e5-8a76-0c0273281e34.png) ![cyrilic2](https://cloud.githubusercontent.com/assets/5949913/13196201/0a7470a2-d7c8-11e5-9d3a-28b82e075bde.png)

claunia closed this issue

2026-01-29 16:35:42 +00:00

claunia commented

2026-01-29 16:35:43 +00:00

@bigharshrag commented on GitHub (Mar 13, 2016):

There is no support yet for Greek, Arabic, and Hebrew as well as for accented character sets (G2 character sets) for languages other than Latin based ones.
I'd be willing to add the support but I do not have any test files for the same.

@bigharshrag commented on GitHub (Mar 13, 2016): There is no support yet for Greek, Arabic, and Hebrew as well as for accented character sets (G2 character sets) for languages other than Latin based ones. I'd be willing to add the support but I do not have any test files for the same.

claunia commented

2026-01-29 16:35:43 +00:00

@cfsmp3 commented on GitHub (Mar 13, 2016):

I don't think have any of those either, but we'll try to find some.

Are you applying to GSoC?

On Sun, Mar 13, 2016 at 4:32 PM, Rishabh Garg notifications@github.com
wrote:

There is no support yet for Greek, Arabic, and Hebrew as well as for
accented character sets (G2 character sets) for languages other than Latin
bases ones.
I'd be willing to add the support but I do not have any test files for the
same.

—
Reply to this email directly or view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/283#issuecomment-195979750
.

@cfsmp3 commented on GitHub (Mar 13, 2016): I don't think have any of those either, but we'll try to find some. Are you applying to GSoC? On Sun, Mar 13, 2016 at 4:32 PM, Rishabh Garg notifications@github.com wrote: > There is no support yet for Greek, Arabic, and Hebrew as well as for > accented character sets (G2 character sets) for languages other than Latin > bases ones. > I'd be willing to add the support but I do not have any test files for the > same. > > — > Reply to this email directly or view it on GitHub > https://github.com/CCExtractor/ccextractor/issues/283#issuecomment-195979750 > .

claunia commented

2026-01-29 16:35:43 +00:00

@bigharshrag commented on GitHub (Mar 13, 2016):

Yes, I am indeed applying for GSoC.
Do I go ahead and try and add the above mentioned support for character sets without the files to test on?

@bigharshrag commented on GitHub (Mar 13, 2016): Yes, I am indeed applying for GSoC. Do I go ahead and try and add the above mentioned support for character sets without the files to test on?

claunia commented

2026-01-29 16:35:44 +00:00

@cfsmp3 commented on GitHub (Mar 13, 2016):

Sure, it's better to have theoretical implementations than nothing :-) That
way when we actually get samples we'll have a good starting point.

On Sun, Mar 13, 2016 at 7:19 PM, Rishabh Garg notifications@github.com
wrote:

Yes, I am indeed applying for GSoC.
Do I go ahead and try and add the above mentioned support for character
sets without the files to test on?

—
Reply to this email directly or view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/283#issuecomment-196015617
.

@cfsmp3 commented on GitHub (Mar 13, 2016): Sure, it's better to have theoretical implementations than nothing :-) That way when we actually get samples we'll have a good starting point. On Sun, Mar 13, 2016 at 7:19 PM, Rishabh Garg notifications@github.com wrote: > Yes, I am indeed applying for GSoC. > Do I go ahead and try and add the above mentioned support for character > sets without the files to test on? > > — > Reply to this email directly or view it on GitHub > https://github.com/CCExtractor/ccextractor/issues/283#issuecomment-196015617 > .

claunia commented

2026-01-29 16:35:44 +00:00

@ghost commented on GitHub (Mar 13, 2016):

The patch looks like it's working perfectly for Russian TVC -- I'm just waiting to get that confirmed by a native speaker. The other russian channel we sampled is 1TV, and we're still getting Latin1. Please see http://vrnewsscape.ucla.edu/dropbox/2016-02-09_1900_RU_1TV_Test.mpg

@ghost commented on GitHub (Mar 13, 2016): The patch looks like it's working perfectly for Russian TVC -- I'm just waiting to get that confirmed by a native speaker. The other russian channel we sampled is 1TV, and we're still getting Latin1. Please see http://vrnewsscape.ucla.edu/dropbox/2016-02-09_1900_RU_1TV_Test.mpg

claunia commented

2026-01-29 16:35:44 +00:00

@bigharshrag commented on GitHub (Mar 15, 2016):

@littleredhen Could you please tell me if a video player like VLC produces the correct subtitles? I could not get VLC to do so.

@bigharshrag commented on GitHub (Mar 15, 2016): @littleredhen Could you please tell me if a video player like VLC produces the correct subtitles? I could not get VLC to do so.

claunia commented

2026-01-29 16:35:44 +00:00

@abhishek-vinjamoori commented on GitHub (Mar 15, 2016):

@bigharshrag . Please check my comment for your PR.

@abhishek-vinjamoori commented on GitHub (Mar 15, 2016): @bigharshrag . Please check my comment for your PR.

claunia commented

2026-01-29 16:35:44 +00:00

@ghost commented on GitHub (Mar 18, 2016):

I haven't had a chance to play this file in VLC, but I can confirm that CCExtractor is able to extract text from the file. However, this text is still in a Latin character set. Are you not able to get text out with CCExtractor?

@ghost commented on GitHub (Mar 18, 2016): I haven't had a chance to play this file in VLC, but I can confirm that CCExtractor is able to extract text from the file. However, this text is still in a Latin character set. Are you not able to get text out with CCExtractor?

claunia commented

2026-01-29 16:35:45 +00:00

@abhishek-vinjamoori commented on GitHub (Mar 18, 2016):

@littleredhen, this issue has been updated today again with bug fixes. Please check if latest code is being used. ALthough some problems are still not solved.

@abhishek-vinjamoori commented on GitHub (Mar 18, 2016): @littleredhen, this issue has been updated today again with bug fixes. Please check if latest code is being used. ALthough some problems are still not solved.

claunia commented

2026-01-29 16:35:45 +00:00

@bigharshrag commented on GitHub (Mar 18, 2016):

@littleredhen I can confirm that CCExtractor does extract subtitles for the file and they are indeed using Latin character set. I have looked into the file provided and I found that it does not actually send the packet that would tell CCExtractor what character set to use for the file(the X/28 or M/29 packet). Hence CCExtractor resorts to default character set of Latin (as specified by the ETS 300 706 documentation as the default character set in such a scenario). The TVC file provided earlier had this data.

Hence I was wondering if you could in fact get any video player to display correct subtitle files in Cyrillic script.

Also the new bug fixes don't fix this problem.

@bigharshrag commented on GitHub (Mar 18, 2016): @littleredhen I can confirm that CCExtractor does extract subtitles for the file and they are indeed using Latin character set. I have looked into the file provided and I found that it does not actually send the packet that would tell CCExtractor what character set to use for the file(the X/28 or M/29 packet). Hence CCExtractor resorts to default character set of Latin (as specified by the ETS 300 706 documentation as the default character set in such a scenario). The TVC file provided earlier had this data. Hence I was wondering if you could in fact get any video player to display correct subtitle files in Cyrillic script. Also the new bug fixes don't fix this problem.

claunia referenced this issue

2026-01-29 16:58:17 +00:00

[PR #117] [MERGED] Bugfix for output filename fix #1014

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: starred/ccextractor#117