mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-12 13:35:15 +00:00
Issue with Russian recordings #490
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @cfsmp3 on GitHub (May 13, 2019).
This came from our friends at Red Hen:
You may remember that there is a set of Russian recordings that are broadcast with some sort of non-Cyrillic characters, for instance 2017-07-17_1158_RU_Первый_Новости_с_субтитрами.txt.
AP has been so kind as to provide a mapping table for these broken Russian files, and I was able to run it on our existing dataset. However, things are made a bit more complicated by the fact that there are HTML tags included, and with a simple search and replace I am replacing the characters in those, too, with Cyrillic characters, effectively destroying them. This happens to lines such as the following:
I can of course fix this (and will probably do that for our current files anyway), but still it would be great if CCExtractor was able to provide the mapped text correctly.
I think you should be able to simply use the following mapping table and be good:
@thelastpolaris commented on GitHub (May 15, 2019):
@cfsmp3 Could you please provide one of the video files from this set? I would like to take a look at it to be sure that this mapping will cover all possible variants.