mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-17 05:25:33 +00:00
Ver 0.85 CEA-708: 16 bit charset (Korean) Not support #277
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @gkehstn on GitHub (Feb 17, 2017).
Originally assigned to: @PunitLodha on GitHub.
0.78 (2015-12-12)
- CEA-708: 16 bit charset support (tested on Korean).
0.84 test result normal
0.85 Not supported.
@cfsmp3 commented on GitHub (Feb 18, 2017):
GSoC qualification: 2 points
@Izaron commented on GitHub (Feb 19, 2017):
Well, I changed this part of code, because in many videos I got wrong output.
Link 1 (before my changes) - https://gist.github.com/Izaron/34136a8ec8216469c3c3828acdfbe53e
Link 2 (my change) -
d60baf1895Link 3 (after my changes - absolutely correct) - https://gist.github.com/Izaron/44c030eae8c6c1049ae6d3e6c6d0dd32
Can you please attach your video file with wrong text? If this worked correctly in 0.84 and don't works in 0.85. I will try to fix this error.
@HaneolLee commented on GitHub (Feb 20, 2017):
When I run it in 0.84 version, Korean is good.
link 1 : https://drive.google.com/open?id=0BxFzM3fSXVOiZEo2R1E4MEFFY1U
When I run it in version 0.85, I do not see Korean.
link 2 : https://drive.google.com/open?id=0BxFzM3fSXVOiSnVBZkc4RlBzVkE
All run with the same options.
https://drive.google.com/open?id=0BxFzM3fSXVOiV3hUTnVoVVRjeDg
@Izaron commented on GitHub (Feb 20, 2017):
I wrote a patch
Remember you should call it as "ccextractor -svc all[EUC-KR]" or so.
Resulting file - https://paste.fedoraproject.org/paste/imMCT5qPdsAk8TlL8Qa35V5M1UNdIGYhyRLivL9gydE=/raw
Yes, that's bad... I can say I wait for new GSoC student to come and fix it 😄
@unicode45 commented on GitHub (Dec 25, 2017):
Version 0.85 still can not extract proper Korean characters.
I've attached sample srt files using below samples.
https://drive.google.com/drive/folders/0B_61ywKPmI0TZU00VjRYWENfYjg
Files start with Ver079 is correct. 0.85 produce broken characters except ASCII charcters.
cea708.zip
@ghost commented on GitHub (Dec 26, 2017):
Further regressions since 0.85: Using mbc.ts linked above, I get 00:00:01,234 --> 00:00:01,368
җס, Ѩ½ ֧ٮLߺյԄ.
instead of 00:00:01,601 --> 00:00:01,735
뇗랡, 냨쇽 뚧릮샌뻺듵도.
This is caused by using write_utf16_char instead of utf16_to_utf8 in
29180a95b1Attempting fix now.
@ghost commented on GitHub (Dec 26, 2017):
....mate, I don't know how Korean encoding works, but in the previous versions I'm not getting korean.
Here's a byte-by-byte analysis between .85 and .84 respectively:
EB 87 97 EB 9E A1 2C 20 EB 83 A8 EC 87 BD 20 EB 9A A7 EB A6 AE EC 83 8C EB BB BA EB 93 B5 EB 8F 84 2E
B1 D7 B7 A1 2C 20 B0 E8 C1 FD 20 B6 A7 B9 AE C0 CC BE FA B4 F5 B3 C4
0.84 literally does not produce valid unicode characters, so either it was actually a fix (doubt it, 0.85 produces completely illegible strings of random words) or some other type of encoding apart from unicode. Can someone confirm what exactly Korean 708 subs are in, EUC-KR or UTF16 or something else maybe?
@unicode45 commented on GitHub (Dec 26, 2017):
Basically EUC-KR is common but both Unicode and EUC-KR can be used.
You can find which encoding is used by checking Caption Service Descriptor in PMT.
If language field contains 'kor' or 'KOR' and korean_code field is 0, it's unicode(while 1 is EUC-KR).
@ghost commented on GitHub (Dec 26, 2017):
OK, fairly sure we don't have EUC-KR support and that definitely wasn't EUC-KR since it was on notepad of all things so I'm just as stumped here. I'll work on EUC-KR support, I guess, there's a cool free lib for that but other than that I'm actually stumped since none of these are legible outputs and I have no idea what encoding @HaneolLee used to get that output on 0.84
@unicode45 commented on GitHub (Dec 26, 2017):
I think ccextractor requires iconv (libiconv) for it. I could convert it by adding "-svc all[EUC-KR]".
@ghost commented on GitHub (Dec 27, 2017):
Confirming, on latest builds conversions for both samples linked by @unicode45 process successfully if I add -svc all[EUC-KR]
mystery solved
@cfsmp3 commented on GitHub (Dec 27, 2017):
Can we make it work without the user passing EUC-KR? (i.e. detect the
correct encoding ourselves)
On Thu, Dec 28, 2017 at 12:15 AM, Alex Huang notifications@github.com
wrote:
@ghost commented on GitHub (Dec 27, 2017):
Doesn't seem possible, valid EUC-KR characters are also valid Unicode characters and I reckon it would be very hard to tell automatically what the correct encoding.
@cfsmp3 commented on GitHub (Dec 28, 2017):
@gray-v did you read this?
Basically EUC-KR is common but both Unicode and EUC-KR can be used.
You can find which encoding is used by checking Caption Service Descriptor in PMT.
If language field contains 'kor' or 'KOR' and korean_code field is 0, it's unicode(while 1 is EUC-KR).
@thetransformerr commented on GitHub (Jul 9, 2018):
Hi all ,@unicode45 , @cfsmp3
as I have tested with -svc all it was working fine but as per suggestion above
I cannot find any entry or reference towards such an field in PMT , either in code or standard for PMT ISO13818 table 2.24 or it might be the case that I have missed that, would anyone please point out where I can find references to make above changes possible.
All I could determine was PMT are used to store program information guide and its table location can be defined for each service in PAT but ISO 13818 recommends it as 0x0002.
following are the lines from code that looks like it but I can't understand how to modify them,
25a8b53ff5/src/lib_ccx/ts_tables.c (L94)please point out what I am missing.....
@unicode45 commented on GitHub (Jul 9, 2018):
Hi, @thetransformerr
I've found a information but I'm sorry it's written in Korean (Google translation will be helpful).
http://www.nl.go.kr/app/nl/search/common/download.jsp?file_id=FILE-00008442489
Here's summary related PMT.
Page No.25, Chapter B.1
"PMT is an optional value."
(I think that's the reason you could not find PMT.)
Page No.25, Chapter B.2 to Page No.28
Described caption service descriptor.
Page No.28, Chapter B.3
"DTVCC Default Mode in Korea : Although DTVCC subtitles data exists in DTVCC transmission channels but PMT and EIT do not have any caption service descriptor, it will be treated as Service 1 and EUC-KR."
(So, if you could not find any PMT information on it, please regard it Service 1 and EUC-KR.)
I could not find any Korean subtitle written in Unicode in my experience so far.
I hope it will be helpful.
@thetransformerr commented on GitHub (Jul 9, 2018):
hey @unicode45 ,
Thanks very much for your reply and help , so given that with svc we are able to extract Korean , Wouldn't it be useful if we make svc 1 , EUC-KR as default ?
In case of failure , user can provide unicode manually.
@unicode45 commented on GitHub (Jul 9, 2018):
Hi, @thetransformerr
Thanks,
@PunitLodha commented on GitHub (Nov 24, 2021):
We cannot default to EUC-KR on all videos, which are in different languages, not just Korean.
I think the best solution here is to just manually pass EUC-KR parameter
Same as #286