Ver 0.85 CEA-708: 16 bit charset (Korean) Not support #277

Closed
opened 2026-01-29 16:39:43 +00:00 by claunia · 19 comments
Owner

Originally created by @gkehstn on GitHub (Feb 17, 2017).

Originally assigned to: @PunitLodha on GitHub.

0.78 (2015-12-12)
  - CEA-708: 16 bit charset support (tested on Korean).
0.84 test result normal
0.85 Not supported.

  • See issue # 286.
Originally created by @gkehstn on GitHub (Feb 17, 2017). Originally assigned to: @PunitLodha on GitHub. 0.78 (2015-12-12)   - CEA-708: 16 bit charset support (tested on Korean). 0.84 test result normal 0.85 Not supported. - See issue # 286.
claunia added the CEA-708difficulty: mediumGSoC-related labels 2026-01-29 16:39:43 +00:00
Author
Owner

@cfsmp3 commented on GitHub (Feb 18, 2017):

GSoC qualification: 2 points

@cfsmp3 commented on GitHub (Feb 18, 2017): GSoC qualification: 2 points
Author
Owner

@Izaron commented on GitHub (Feb 19, 2017):

Well, I changed this part of code, because in many videos I got wrong output.
Link 1 (before my changes) - https://gist.github.com/Izaron/34136a8ec8216469c3c3828acdfbe53e
Link 2 (my change) - d60baf1895
Link 3 (after my changes - absolutely correct) - https://gist.github.com/Izaron/44c030eae8c6c1049ae6d3e6c6d0dd32

Can you please attach your video file with wrong text? If this worked correctly in 0.84 and don't works in 0.85. I will try to fix this error.

@Izaron commented on GitHub (Feb 19, 2017): Well, I changed this part of code, because in many videos I got wrong output. Link 1 (before my changes) - https://gist.github.com/Izaron/34136a8ec8216469c3c3828acdfbe53e Link 2 (my change) - https://github.com/CCExtractor/ccextractor/pull/623/commits/d60baf18953f1501e2d450fd7e97406cd9624c58 Link 3 (after my changes - absolutely correct) - https://gist.github.com/Izaron/44c030eae8c6c1049ae6d3e6c6d0dd32 Can you please attach your video file with wrong text? If this worked correctly in 0.84 and don't works in 0.85. I will try to fix this error.
Author
Owner

@HaneolLee commented on GitHub (Feb 20, 2017):

  1. When I run it in 0.84 version, Korean is good.
    link 1 : https://drive.google.com/open?id=0BxFzM3fSXVOiZEo2R1E4MEFFY1U

  2. When I run it in version 0.85, I do not see Korean.
    link 2 : https://drive.google.com/open?id=0BxFzM3fSXVOiSnVBZkc4RlBzVkE

All run with the same options.

  1. Upload the tested video file.
    https://drive.google.com/open?id=0BxFzM3fSXVOiV3hUTnVoVVRjeDg
@HaneolLee commented on GitHub (Feb 20, 2017): 1. When I run it in 0.84 version, Korean is good. link 1 : https://drive.google.com/open?id=0BxFzM3fSXVOiZEo2R1E4MEFFY1U 2. When I run it in version 0.85, I do not see Korean. link 2 : https://drive.google.com/open?id=0BxFzM3fSXVOiSnVBZkc4RlBzVkE All run with the same options. 3. Upload the tested video file. https://drive.google.com/open?id=0BxFzM3fSXVOiV3hUTnVoVVRjeDg
Author
Owner

@Izaron commented on GitHub (Feb 20, 2017):

I wrote a patch
Remember you should call it as "ccextractor -svc all[EUC-KR]" or so.
Resulting file - https://paste.fedoraproject.org/paste/imMCT5qPdsAk8TlL8Qa35V5M1UNdIGYhyRLivL9gydE=/raw

See issue # 286

Yes, that's bad... I can say I wait for new GSoC student to come and fix it 😄

@Izaron commented on GitHub (Feb 20, 2017): I wrote a patch Remember you should call it as "ccextractor <file> -svc all[EUC-KR]" or so. Resulting file - https://paste.fedoraproject.org/paste/imMCT5qPdsAk8TlL8Qa35V5M1UNdIGYhyRLivL9gydE=/raw > See issue # 286 Yes, that's bad... I can say I wait for new GSoC student to come and fix it 😄
Author
Owner

@unicode45 commented on GitHub (Dec 25, 2017):

Version 0.85 still can not extract proper Korean characters.

I've attached sample srt files using below samples.
https://drive.google.com/drive/folders/0B_61ywKPmI0TZU00VjRYWENfYjg

Files start with Ver079 is correct. 0.85 produce broken characters except ASCII charcters.
cea708.zip

@unicode45 commented on GitHub (Dec 25, 2017): Version 0.85 still can not extract proper Korean characters. I've attached sample srt files using below samples. https://drive.google.com/drive/folders/0B_61ywKPmI0TZU00VjRYWENfYjg Files start with Ver079 is correct. 0.85 produce broken characters except ASCII charcters. [cea708.zip](https://github.com/CCExtractor/ccextractor/files/1585153/cea708.zip)
Author
Owner

@ghost commented on GitHub (Dec 26, 2017):

Further regressions since 0.85: Using mbc.ts linked above, I get 00:00:01,234 --> 00:00:01,368
җס, Ѩ½ ֧ٮLߺյԄ.
instead of 00:00:01,601 --> 00:00:01,735
뇗랡, 냨쇽 뚧릮샌뻺듵도.

This is caused by using write_utf16_char instead of utf16_to_utf8 in 29180a95b1

Attempting fix now.

@ghost commented on GitHub (Dec 26, 2017): Further regressions since 0.85: Using mbc.ts linked above, I get 00:00:01,234 --> 00:00:01,368 җס, Ѩ½ ֧ٮLߺյԄ. instead of 00:00:01,601 --> 00:00:01,735 뇗랡, 냨쇽 뚧릮샌뻺듵도. This is caused by using write_utf16_char instead of utf16_to_utf8 in https://github.com/CCExtractor/ccextractor/commit/29180a95b17996f64d2107d8adcb8d773d150921 Attempting fix now.
Author
Owner

@ghost commented on GitHub (Dec 26, 2017):

....mate, I don't know how Korean encoding works, but in the previous versions I'm not getting korean.

Here's a byte-by-byte analysis between .85 and .84 respectively:

EB 87 97 EB 9E A1 2C 20 EB 83 A8 EC 87 BD 20 EB 9A A7 EB A6 AE EC 83 8C EB BB BA EB 93 B5 EB 8F 84 2E

B1 D7 B7 A1 2C 20 B0 E8 C1 FD 20 B6 A7 B9 AE C0 CC BE FA B4 F5 B3 C4

0.84 literally does not produce valid unicode characters, so either it was actually a fix (doubt it, 0.85 produces completely illegible strings of random words) or some other type of encoding apart from unicode. Can someone confirm what exactly Korean 708 subs are in, EUC-KR or UTF16 or something else maybe?

@ghost commented on GitHub (Dec 26, 2017): ....mate, I don't know how Korean encoding works, but in the previous versions I'm not getting korean. Here's a byte-by-byte analysis between .85 and .84 respectively: EB 87 97 EB 9E A1 2C 20 EB 83 A8 EC 87 BD 20 EB 9A A7 EB A6 AE EC 83 8C EB BB BA EB 93 B5 EB 8F 84 2E B1 D7 B7 A1 2C 20 B0 E8 C1 FD 20 B6 A7 B9 AE C0 CC BE FA B4 F5 B3 C4 0.84 literally does not produce valid unicode characters, so either it was actually a fix (doubt it, 0.85 produces completely illegible strings of random words) or some other type of encoding apart from unicode. Can someone confirm what exactly Korean 708 subs are in, EUC-KR or UTF16 or something else maybe?
Author
Owner

@unicode45 commented on GitHub (Dec 26, 2017):

Basically EUC-KR is common but both Unicode and EUC-KR can be used.
You can find which encoding is used by checking Caption Service Descriptor in PMT.
If language field contains 'kor' or 'KOR' and korean_code field is 0, it's unicode(while 1 is EUC-KR).

@unicode45 commented on GitHub (Dec 26, 2017): Basically EUC-KR is common but both Unicode and EUC-KR can be used. You can find which encoding is used by checking Caption Service Descriptor in PMT. If language field contains 'kor' or 'KOR' and korean_code field is 0, it's unicode(while 1 is EUC-KR).
Author
Owner

@ghost commented on GitHub (Dec 26, 2017):

OK, fairly sure we don't have EUC-KR support and that definitely wasn't EUC-KR since it was on notepad of all things so I'm just as stumped here. I'll work on EUC-KR support, I guess, there's a cool free lib for that but other than that I'm actually stumped since none of these are legible outputs and I have no idea what encoding @HaneolLee used to get that output on 0.84

@ghost commented on GitHub (Dec 26, 2017): OK, fairly sure we don't have EUC-KR support and that definitely wasn't EUC-KR since it was on notepad of all things so I'm just as stumped here. I'll work on EUC-KR support, I guess, there's a cool free lib for that but other than that I'm actually stumped since none of these are legible outputs and I have no idea what encoding @HaneolLee used to get that output on 0.84
Author
Owner

@unicode45 commented on GitHub (Dec 26, 2017):

I think ccextractor requires iconv (libiconv) for it. I could convert it by adding "-svc all[EUC-KR]".

@unicode45 commented on GitHub (Dec 26, 2017): I think ccextractor requires iconv (libiconv) for it. I could convert it by adding "-svc all[EUC-KR]".
Author
Owner

@ghost commented on GitHub (Dec 27, 2017):

Confirming, on latest builds conversions for both samples linked by @unicode45 process successfully if I add -svc all[EUC-KR]

mystery solved

@ghost commented on GitHub (Dec 27, 2017): Confirming, on latest builds conversions for both samples linked by @unicode45 process successfully if I add -svc all[EUC-KR] mystery solved
Author
Owner

@cfsmp3 commented on GitHub (Dec 27, 2017):

Can we make it work without the user passing EUC-KR? (i.e. detect the
correct encoding ourselves)

On Thu, Dec 28, 2017 at 12:15 AM, Alex Huang notifications@github.com
wrote:

Confirming, on latest builds conversions for both samples linked by
@unicode45 https://github.com/unicode45 process successfully if I add
-svc all[EUC-KR]

mystery solved


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/690#issuecomment-354194378,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFrJ2eM0_UgXLdCbRFTtFtVG1KyDHE4Cks5tEs-sgaJpZM4MERHh
.

@cfsmp3 commented on GitHub (Dec 27, 2017): Can we make it work without the user passing EUC-KR? (i.e. detect the correct encoding ourselves) On Thu, Dec 28, 2017 at 12:15 AM, Alex Huang <notifications@github.com> wrote: > Confirming, on latest builds conversions for both samples linked by > @unicode45 <https://github.com/unicode45> process successfully if I add > -svc all[EUC-KR] > > mystery solved > > — > You are receiving this because you modified the open/close state. > Reply to this email directly, view it on GitHub > <https://github.com/CCExtractor/ccextractor/issues/690#issuecomment-354194378>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AFrJ2eM0_UgXLdCbRFTtFtVG1KyDHE4Cks5tEs-sgaJpZM4MERHh> > . >
Author
Owner

@ghost commented on GitHub (Dec 27, 2017):

Doesn't seem possible, valid EUC-KR characters are also valid Unicode characters and I reckon it would be very hard to tell automatically what the correct encoding.

@ghost commented on GitHub (Dec 27, 2017): Doesn't seem possible, valid EUC-KR characters are also valid Unicode characters and I reckon it would be very hard to tell automatically what the correct encoding.
Author
Owner

@cfsmp3 commented on GitHub (Dec 28, 2017):

@gray-v did you read this?

Basically EUC-KR is common but both Unicode and EUC-KR can be used.
You can find which encoding is used by checking Caption Service Descriptor in PMT.
If language field contains 'kor' or 'KOR' and korean_code field is 0, it's unicode(while 1 is EUC-KR).

@cfsmp3 commented on GitHub (Dec 28, 2017): @gray-v did you read this? Basically EUC-KR is common but both Unicode and EUC-KR can be used. You can find which encoding is used by checking Caption Service Descriptor in PMT. If language field contains 'kor' or 'KOR' and korean_code field is 0, it's unicode(while 1 is EUC-KR).
Author
Owner

@thetransformerr commented on GitHub (Jul 9, 2018):

Hi all ,@unicode45 , @cfsmp3

as I have tested with -svc all it was working fine but as per suggestion above

Basically EUC-KR is common but both Unicode and EUC-KR can be used.
You can find which encoding is used by checking Caption Service Descriptor in PMT.
If language field contains 'kor' or 'KOR' and korean_code field is 0, it's unicode(while 1 is EUC-KR).

I cannot find any entry or reference towards such an field in PMT , either in code or standard for PMT ISO13818 table 2.24 or it might be the case that I have missed that, would anyone please point out where I can find references to make above changes possible.
All I could determine was PMT are used to store program information guide and its table location can be defined for each service in PAT but ISO 13818 recommends it as 0x0002.

following are the lines from code that looks like it but I can't understand how to modify them,

25a8b53ff5/src/lib_ccx/ts_tables.c (L94)

please point out what I am missing.....

@thetransformerr commented on GitHub (Jul 9, 2018): Hi all ,@unicode45 , @cfsmp3 as I have tested with -svc all it was working fine but as per suggestion above > Basically EUC-KR is common but both Unicode and EUC-KR can be used. You can find which encoding is used by checking Caption Service Descriptor in PMT. If language field contains 'kor' or 'KOR' and korean_code field is 0, it's unicode(while 1 is EUC-KR). I cannot find any entry or reference towards such an field in PMT , either in code or standard for PMT ISO13818 table 2.24 or it might be the case that I have missed that, would anyone please point out where I can find references to make above changes possible. All I could determine was PMT are used to store program information guide and its table location can be defined for each service in PAT but ISO 13818 recommends it as 0x0002. following are the lines from code that looks like it but I can't understand how to modify them, https://github.com/CCExtractor/ccextractor/blob/25a8b53ff55f904f29e4810bdaedd4f154567677/src/lib_ccx/ts_tables.c#L94 please point out what I am missing.....
Author
Owner

@unicode45 commented on GitHub (Jul 9, 2018):

Hi, @thetransformerr

I've found a information but I'm sorry it's written in Korean (Google translation will be helpful).
http://www.nl.go.kr/app/nl/search/common/download.jsp?file_id=FILE-00008442489

Here's summary related PMT.

  • Page No.25, Chapter B.1
    "PMT is an optional value."
    (I think that's the reason you could not find PMT.)

  • Page No.25, Chapter B.2 to Page No.28
    Described caption service descriptor.

  • Page No.28, Chapter B.3
    "DTVCC Default Mode in Korea : Although DTVCC subtitles data exists in DTVCC transmission channels but PMT and EIT do not have any caption service descriptor, it will be treated as Service 1 and EUC-KR."
    (So, if you could not find any PMT information on it, please regard it Service 1 and EUC-KR.)

I could not find any Korean subtitle written in Unicode in my experience so far.
I hope it will be helpful.

@unicode45 commented on GitHub (Jul 9, 2018): Hi, @thetransformerr I've found a information but I'm sorry it's written in Korean (Google translation will be helpful). http://www.nl.go.kr/app/nl/search/common/download.jsp?file_id=FILE-00008442489 Here's summary related PMT. - Page No.25, Chapter B.1 "PMT is an optional value." (I think that's the reason you could not find PMT.) - Page No.25, Chapter B.2 to Page No.28 Described caption service descriptor. - Page No.28, Chapter B.3 "DTVCC Default Mode in Korea : Although DTVCC subtitles data exists in DTVCC transmission channels but PMT and EIT do not have any caption service descriptor, it will be treated as Service 1 and EUC-KR." (So, if you could not find any PMT information on it, please regard it Service 1 and EUC-KR.) I could not find any Korean subtitle written in Unicode in my experience so far. I hope it will be helpful.
Author
Owner

@thetransformerr commented on GitHub (Jul 9, 2018):

hey @unicode45 ,

Thanks very much for your reply and help , so given that with svc we are able to extract Korean , Wouldn't it be useful if we make svc 1 , EUC-KR as default ?
In case of failure , user can provide unicode manually.

@thetransformerr commented on GitHub (Jul 9, 2018): hey @unicode45 , Thanks very much for your reply and help , so given that with svc we are able to extract Korean , Wouldn't it be useful if we make svc 1 , EUC-KR as default ? In case of failure , user can provide unicode manually.
Author
Owner

@unicode45 commented on GitHub (Jul 9, 2018):

Hi, @thetransformerr

Wouldn't it be useful if we make svc 1 , EUC-KR as default ?
Yes, I think so because all broadcasts were svc 1, EUC-KR in my several years experience.

Thanks,

@unicode45 commented on GitHub (Jul 9, 2018): Hi, @thetransformerr > Wouldn't it be useful if we make svc 1 , EUC-KR as default ? Yes, I think so because all broadcasts were svc 1, EUC-KR in my several years experience. Thanks,
Author
Owner

@PunitLodha commented on GitHub (Nov 24, 2021):

Wouldn't it be useful if we make svc 1 , EUC-KR as default ?

We cannot default to EUC-KR on all videos, which are in different languages, not just Korean.
I think the best solution here is to just manually pass EUC-KR parameter

Same as #286

@PunitLodha commented on GitHub (Nov 24, 2021): > Wouldn't it be useful if we make svc 1 , EUC-KR as default ? We cannot default to EUC-KR on all videos, which are in different languages, not just Korean. I think the best solution here is to just manually pass EUC-KR parameter Same as #286
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#277