[BUG] ISDB subtitles issues with encoding of special characters #438

Closed
opened 2026-01-29 16:43:56 +00:00 by claunia · 5 comments
Owner

Originally created by @jakubvojacek on GitHub (Aug 26, 2018).

CCExtractor version (using the --version parameter preferably) : e9d2a89768f10e6d269dcd0b9245895f3899a72d

In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):

  • I have read and understood the contributors guide.
  • I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present.
  • I have checked that the issue I'm posting isn't already reported.
  • I have checked that the issue I'm porting isn't already solved and no duplicates exist in closed issues and in opened issues
  • I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion.
  • I have used the latest available version of CCExtractor to verify this issue exists.

My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):

  • I absolutely love CCExtractor, but have not contributed previously.

Necessary information

  • Is this a regression (did it work before)? [x] NO | [ ] YES - please specify the last known working version
  • What platform did you use? [ ] Windows - [x] Linux - [ ] Mac
  • What were the used arguments? ccextractor -datapid 0x116 -o test.vtt nsc.mp4

Video links (replace text below with your links)

nsc.mp4 - https://goo.gl/iiKTAQ

Additional information
Hello,

the issue is with portugesse accent characters, such as á, ã, ê, .... Instead of these characters, the cccextractor shows ?. Probably some issue with encoding? Am I doing something wrong or is this an issue with ccextractor? Please find bellow samples from generated test.vtt file and manually fixed comparison

4
00:00:14,421 --> 00:00:15,925
Ah, vamos l?!                   


5
00:00:15,926 --> 00:00:17,430
Que horas s?o agora?       

vs

4
00:00:14,421 --> 00:00:15,925
Ah, vamos lá!                   


5
00:00:15,926 --> 00:00:17,430
Que horas são agora?    

Thank you
Jakub

Originally created by @jakubvojacek on GitHub (Aug 26, 2018). CCExtractor version (using the --version parameter preferably) : e9d2a89768f10e6d269dcd0b9245895f3899a72d **In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):** - [x] I have read and understood the [contributors guide](https://github.com/CCExtractor/ccextractor/blob/master/.github/CONTRIBUTING.md). - [x] I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present. - [x] I have checked that the issue I'm posting isn't already reported. - [x] I have checked that the issue I'm porting isn't already solved and no duplicates exist in [closed issues](https://github.com/CCExtractor/ccextractor/issues?q=is%3Aissue+is%3Aclosed) and in [opened issues](https://github.com/CCExtractor/ccextractor/issues) - [x] I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion. - [x] I have used the latest available version of CCExtractor to verify this issue exists. **My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):** - [x] I absolutely love CCExtractor, but have not contributed previously. **Necessary information** - Is this a regression (did it work before)? [x] NO | [ ] YES - *please specify the last known working version* - What platform did you use? [ ] Windows - [x] Linux - [ ] Mac - What were the used arguments? `ccextractor -datapid 0x116 -o test.vtt nsc.mp4` **Video links (replace text below with your links)** nsc.mp4 - https://goo.gl/iiKTAQ **Additional information** Hello, the issue is with portugesse accent characters, such as *á, ã, ê, ...*. Instead of these characters, the cccextractor shows *?*. Probably some issue with encoding? Am I doing something wrong or is this an issue with ccextractor? Please find bellow samples from generated test.vtt file and manually fixed comparison ``` 4 00:00:14,421 --> 00:00:15,925 Ah, vamos l?! 5 00:00:15,926 --> 00:00:17,430 Que horas s?o agora? ``` vs ``` 4 00:00:14,421 --> 00:00:15,925 Ah, vamos lá! 5 00:00:15,926 --> 00:00:17,430 Que horas são agora? ``` Thank you Jakub
Author
Owner

@anshul1912 commented on GitHub (Aug 29, 2018):

I checked srt file of same video in vim editor, I was able to see the letters corrcetly.

@anshul1912 commented on GitHub (Aug 29, 2018): I checked srt file of same video in vim editor, I was able to see the letters corrcetly.
Author
Owner

@anshul1912 commented on GitHub (Aug 29, 2018):

In notepad++ if you select encode in ANSI, you can see those character correctly

@anshul1912 commented on GitHub (Aug 29, 2018): In notepad++ if you select encode in ANSI, you can see those character correctly
Author
Owner

@jakubvojacek commented on GitHub (Aug 29, 2018):

@anshul1912 ccextractor should be using utf8 by default, therefore no encoding change should be required I believe. Also, extracting these special characters from other subtitles sources such as dvb subtitles or CEA608 is working perfectly, the issue is only with ISDBT source.

I only have access to linux (we're using debian) and mac so I cannot try notepad++, anyway, I tried using vim and some other editors and changing the encoding - did not help.

Is it possible that ccextractor might work differently on windows vs linux platform? Since you can see the characters properly and I cannot.

Thank you

@jakubvojacek commented on GitHub (Aug 29, 2018): @anshul1912 ccextractor should be using utf8 by default, therefore no encoding change should be required I believe. Also, extracting these special characters from other subtitles sources such as dvb subtitles or CEA608 is working perfectly, the issue is only with ISDBT source. I only have access to linux (we're using debian) and mac so I cannot try notepad++, anyway, I tried using vim and some other editors and changing the encoding - did not help. Is it possible that ccextractor might work differently on windows vs linux platform? Since you can see the characters properly and I cannot. Thank you
Author
Owner

@anshul1912 commented on GitHub (Aug 29, 2018):

Hi

I was using vim on Ubuntu and notepad++ on windows.

Problem is from source, ISDB expect utf-8 in its data. But actually ANSI is
present.
Because of which editor do not interpret character well.

You can also use iconv to convert the file encoding.

Thanks
Anshul

On Wed, 29 Aug 2018, 5:26 pm jakubvojacek, notifications@github.com wrote:

@anshul1912 https://github.com/anshul1912 ccextractor should be using
utf8 by default, therefore no encoding change should be required I believe.
Also, extracting these special characters from other subtitles sources such
as dvb subtitles or CEA608 is working perfectly, the issue is only with
ISDBT source.

I only have access to linux (we're using debian) and mac so I cannot try
notepad++, anyway, I tried using vim and some other editors and changing
the encoding - did not help.

Is it possible that ccextractor might work differently on windows vs linux
platform? Since you can see the characters properly and I cannot.

Thank you


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/999#issuecomment-416926255,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHCOGmQiqT3Cias65m5OtriJ4EH5uS1Wks5uVoF3gaJpZM4WMzEX
.

@anshul1912 commented on GitHub (Aug 29, 2018): Hi I was using vim on Ubuntu and notepad++ on windows. Problem is from source, ISDB expect utf-8 in its data. But actually ANSI is present. Because of which editor do not interpret character well. You can also use iconv to convert the file encoding. Thanks Anshul On Wed, 29 Aug 2018, 5:26 pm jakubvojacek, <notifications@github.com> wrote: > @anshul1912 <https://github.com/anshul1912> ccextractor should be using > utf8 by default, therefore no encoding change should be required I believe. > Also, extracting these special characters from other subtitles sources such > as dvb subtitles or CEA608 is working perfectly, the issue is only with > ISDBT source. > > I only have access to linux (we're using debian) and mac so I cannot try > notepad++, anyway, I tried using vim and some other editors and changing > the encoding - did not help. > > Is it possible that ccextractor might work differently on windows vs linux > platform? Since you can see the characters properly and I cannot. > > Thank you > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/CCExtractor/ccextractor/issues/999#issuecomment-416926255>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AHCOGmQiqT3Cias65m5OtriJ4EH5uS1Wks5uVoF3gaJpZM4WMzEX> > . >
Author
Owner

@jakubvojacek commented on GitHub (Aug 29, 2018):

Thank you, using the iconv I was able to fix the encodings.

@jakubvojacek commented on GitHub (Aug 29, 2018): Thank you, using the iconv I was able to fix the encodings.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#438