DVB subtitles from China #75

Closed
opened 2026-01-29 16:34:31 +00:00 by claunia · 8 comments
Owner

Originally created by @cfsmp3 on GitHub (Sep 11, 2015).

We have a nice sample in the video-samples repository from a Chinese station that comes with DVB subtitles in Chinese.

Chinese DVB/Yan Oi Tong Charity Show 2014 [Live] - High Definition Jade - 2014-10-18.ts

Anshul, can you take a preliminary look?

Originally created by @cfsmp3 on GitHub (Sep 11, 2015). We have a nice sample in the video-samples repository from a Chinese station that comes with DVB subtitles in Chinese. Chinese DVB/Yan Oi Tong Charity Show 2014 [Live] - High Definition Jade - 2014-10-18.ts Anshul, can you take a preliminary look?
claunia added the difficulty: hardOCRDVB labels 2026-01-29 16:34:31 +00:00
Author
Owner

@cfsmp3 commented on GitHub (Nov 28, 2016):

Link to the file:
https://drive.google.com/open?id=0B_61ywKPmI0TZFNPWmVrbGk2WDA

Alternative URL: https://sampleplatform.ccextractor.org/sample/176

@cfsmp3 commented on GitHub (Nov 28, 2016): Link to the file: https://drive.google.com/open?id=0B_61ywKPmI0TZFNPWmVrbGk2WDA Alternative URL: https://sampleplatform.ccextractor.org/sample/176
Author
Owner

@Abhinav95 commented on GitHub (Nov 30, 2016):

For students working on the Code In task, you will have to make sure that the DVB subtitle extraction system is working, along with having the necessary Chinese recognition data files. ('chi_sim' and 'chi_tra' for simplified and traditional, available at https://github.com/tesseract-ocr/tessdata). You need to place these files at the correct location (along with the English .traineddata files), and then call ccextractor with the parameters '-dvblang' and '-ocrlang'. (Read the help screen for details on those options). Let us know how it goes!

@Abhinav95 commented on GitHub (Nov 30, 2016): For students working on the Code In task, you will have to make sure that the DVB subtitle extraction system is working, along with having the necessary Chinese recognition data files. ('chi_sim' and 'chi_tra' for simplified and traditional, available at https://github.com/tesseract-ocr/tessdata). You need to place these files at the correct location (along with the English .traineddata files), and then call ccextractor with the parameters '-dvblang' and '-ocrlang'. (Read the help screen for details on those options). Let us know how it goes!
Author
Owner

@ghost commented on GitHub (Dec 1, 2016):

Hey, I can read Chinese. I'll give this one a look. Can't claim the task, though, because I'm currently doing another.

e: oh god 11gb i need to free up space on my drive

@ghost commented on GitHub (Dec 1, 2016): Hey, I can read Chinese. I'll give this one a look. Can't claim the task, though, because I'm currently doing another. e: oh god 11gb i need to free up space on my drive
Author
Owner

@harrynull commented on GitHub (Dec 10, 2017):

The video is in Cantonese; the subtitles are in Traditional Chinese.

The result is not satisfying. Although some (~20%) of subtitles are extracted correctly, most of them are just random characters. The timing is not good as well. The video itself is also somehow damaged. CCExtractor crashes half-way with -ocrlang (it says "something messy"), but it works if I use parameter -out=spupng. It could be a bug of CCExtractor because if -out=spupng can work well with the video file, OCR should work too.

Example of bad OCR (Completely irrelevant. It seems that the only thing that matches is the number of the characters):
Generated: 跩鯉頤鮨嵐噩圉胸囍武蓿儡蘑意凰
Correct: 以罐頭作為主題的菜式有什麼意見
correct

Example of good OCR (although it's not completely correct):
Generated: 煎爛了嗎?那我屹掉不要浪賣
Correct: 煎爛了嗎?那我掉不要浪賣
inaccurate
Generated by --out=spupng (The result is squeezed. It could be the reason why OCR works incorrectly):
sub0005

Example of bad timing:

16
00:00:32,720 --> 00:00:34,079
來看看評判對第二組

This subtitle should start at 00:34, instead of 00:32

@harrynull commented on GitHub (Dec 10, 2017): The video is in Cantonese; the subtitles are in Traditional Chinese. The result is not satisfying. Although some (~20%) of subtitles are extracted correctly, most of them are just random characters. The timing is not good as well. The video itself is also somehow damaged. CCExtractor crashes half-way with -ocrlang (it says "something messy"), but it works if I use parameter -out=spupng. It could be a bug of CCExtractor because if -out=spupng can work well with the video file, OCR should work too. Example of bad OCR (Completely irrelevant. It seems that the only thing that matches is the number of the characters): Generated: 跩鯉頤鮨嵐噩圉胸囍武蓿儡蘑意凰 Correct: 以罐頭作為主題的菜式有什麼意見 ![correct](https://user-images.githubusercontent.com/7413706/33802056-5057e930-dda9-11e7-8d7e-9dce97d592eb.png) Example of good OCR (although it's not completely correct): Generated: 煎爛了嗎?那我屹掉不要浪賣 Correct: 煎爛了嗎?那我**吃**掉不要浪賣 ![inaccurate](https://user-images.githubusercontent.com/7413706/33802055-5023df3c-dda9-11e7-91b4-6b34a6750e57.png) Generated by --out=spupng (The result is squeezed. It could be the reason why OCR works incorrectly): ![sub0005](https://user-images.githubusercontent.com/7413706/33802118-fc78f6cc-ddaa-11e7-830a-2f0e9d1244ff.png) Example of bad timing: ``` 16 00:00:32,720 --> 00:00:34,079 來看看評判對第二組 ``` This subtitle should start at 00:34, instead of 00:32
Author
Owner

@RaXorX commented on GitHub (Nov 19, 2019):

@Abhinav95 What is the correct location to have the traineddata files in? I am running tesseract v4.1.0 (5.0.0 was in alpha stage so didn't know if it was a good idea or not to have that one).
Anyways, I am on windows 10 x64, I have the Tesseract installation folder in path variable. I have the required traineddata files downloaded in tessdata folder inside the installation path.

I'm not sure what I am doing wrong, but ccextractor just gives me that it can't find the traineddata files. I was able to have the eng.traineddata detecting normally after creating a tessdata folder inside the CCextractor folder, but apparently it doesn't detect the other files.

@RaXorX commented on GitHub (Nov 19, 2019): @Abhinav95 What is the correct location to have the traineddata files in? I am running tesseract v4.1.0 (5.0.0 was in alpha stage so didn't know if it was a good idea or not to have that one). Anyways, I am on windows 10 x64, I have the Tesseract installation folder in path variable. I have the required traineddata files downloaded in tessdata folder inside the installation path. I'm not sure what I am doing wrong, but ccextractor just gives me that it can't find the traineddata files. I was able to have the eng.traineddata detecting normally after creating a tessdata folder inside the CCextractor folder, but apparently it doesn't detect the other files.
Author
Owner

@cfsmp3 commented on GitHub (Mar 22, 2023):

I'm going to merge all Chinese tasks here.

These two are very related so I'll be closing them:
https://github.com/CCExtractor/ccextractor/issues/1379
https://github.com/CCExtractor/ccextractor/issues/918

@cfsmp3 commented on GitHub (Mar 22, 2023): I'm going to merge all Chinese tasks here. These two are very related so I'll be closing them: https://github.com/CCExtractor/ccextractor/issues/1379 https://github.com/CCExtractor/ccextractor/issues/918
Author
Owner

@esp0r commented on GitHub (Feb 20, 2024):

I recently explored the GSoC 2024 projects and came across this issue regarding DVB subtitles from China. I noticed that there have been challenges with the accuracy of Tesseract for Chinese character recognition. I'd like to suggest considering PaddleOCR as a potential alternative. PaddleOCR is a multi-language OCR toolkit that leverages deep learning, and it's actively maintained by Baidu. It has shown to be particularly effective for Chinese text recognition.

I have practical experience with PaddleOCR; I've successfully used it to extract text from Chinese and Japanese textbooks with high accuracy. The toolkit is user-friendly and straightforward to implement.

Would the integration of PaddleOCR be something the team is willing to consider? Please let me know your thoughts on this proposal.

@esp0r commented on GitHub (Feb 20, 2024): I recently explored the GSoC 2024 projects and came across this issue regarding DVB subtitles from China. I noticed that there have been challenges with the accuracy of Tesseract for Chinese character recognition. I'd like to suggest considering [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) as a potential alternative. PaddleOCR is a multi-language OCR toolkit that leverages deep learning, and it's actively maintained by Baidu. It has shown to be particularly effective for Chinese text recognition. I have practical experience with PaddleOCR; I've successfully used it to extract text from Chinese and Japanese textbooks with high accuracy. The toolkit is user-friendly and straightforward to implement. Would the integration of PaddleOCR be something the team is willing to consider? Please let me know your thoughts on this proposal.
Author
Owner

@cfsmp3 commented on GitHub (Feb 20, 2024):

Sure. I've never heard of it personally but that sounds precisely the
problem, we don't have any knowledge of the Chinese ecosystem.

On Tue, Feb 20, 2024, 07:22 esp0r @.***> wrote:

I recently explored the GSoC 2024 projects and came across this issue
regarding DVB subtitles from China. I noticed that there have been
challenges with the accuracy of Tesseract for Chinese character
recognition. I'd like to suggest considering PaddleOCR
https://github.com/PaddlePaddle/PaddleOCR as a potential alternative.
PaddleOCR is a multi-language OCR toolkit that leverages deep learning, and
it's actively maintained by Baidu. It has shown to be particularly
effective for Chinese text recognition.

I have practical experience with PaddleOCR; I've successfully used it to
extract text from Chinese textbooks with high accuracy. The toolkit is
user-friendly and straightforward to implement.

Would the integration of PaddleOCR be something the team is willing to
consider? Please let me know your thoughts on this proposal.


Reply to this email directly, view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/224#issuecomment-1954451180,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABNMTWNJOXCYAK2TSGJWOQDYUS5R5AVCNFSM4BPOIQRKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJVGQ2DKMJRHAYA
.
You are receiving this because you authored the thread.Message ID:
@.***>

@cfsmp3 commented on GitHub (Feb 20, 2024): Sure. I've never heard of it personally but that sounds precisely the problem, we don't have any knowledge of the Chinese ecosystem. On Tue, Feb 20, 2024, 07:22 esp0r ***@***.***> wrote: > I recently explored the GSoC 2024 projects and came across this issue > regarding DVB subtitles from China. I noticed that there have been > challenges with the accuracy of Tesseract for Chinese character > recognition. I'd like to suggest considering PaddleOCR > <https://github.com/PaddlePaddle/PaddleOCR> as a potential alternative. > PaddleOCR is a multi-language OCR toolkit that leverages deep learning, and > it's actively maintained by Baidu. It has shown to be particularly > effective for Chinese text recognition. > > I have practical experience with PaddleOCR; I've successfully used it to > extract text from Chinese textbooks with high accuracy. The toolkit is > user-friendly and straightforward to implement. > > Would the integration of PaddleOCR be something the team is willing to > consider? Please let me know your thoughts on this proposal. > > — > Reply to this email directly, view it on GitHub > <https://github.com/CCExtractor/ccextractor/issues/224#issuecomment-1954451180>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABNMTWNJOXCYAK2TSGJWOQDYUS5R5AVCNFSM4BPOIQRKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJVGQ2DKMJRHAYA> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#75