[CEA-708] Is there a way to extract subtitles #271

Closed
opened 2026-01-29 16:39:28 +00:00 by claunia · 16 comments
Owner

Originally created by @fsh240led on GitHub (Feb 2, 2017).

The others are extracted without problems
But the video below is not extract
https://docs.google.com/uc?id=0B42jO8cBUe6baEtqLTNkRnExbVk&export=download

PotPlayer displays subtitles
http://i.imgur.com/SX9KApM.png

Originally created by @fsh240led on GitHub (Feb 2, 2017). The others are extracted without problems But the video below is not extract https://docs.google.com/uc?id=0B42jO8cBUe6baEtqLTNkRnExbVk&export=download PotPlayer displays subtitles http://i.imgur.com/SX9KApM.png
Author
Owner

@cfsmp3 commented on GitHub (Feb 7, 2017):

Are yo sure you aren't using an external subtitle file? I just installed
potplayer for this and it says there's no subtitles available.

On Wed, Feb 1, 2017 at 10:42 PM, fsh240led notifications@github.com wrote:

The others are extracted without problems
But this is not an extraction
https://docs.google.com/uc?id=0B42jO8cBUe6baEtqLTNkRnExbVk&export=download

PotPlayer displays subtitles
http://i.imgur.com/SX9KApM.png


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/CCExtractor/ccextractor/issues/677, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AFrJ2Y2hioV83NIkNr1aUyOetGBgYu7eks5rYXrAgaJpZM4L0xd1
.

@cfsmp3 commented on GitHub (Feb 7, 2017): Are yo sure you aren't using an external subtitle file? I just installed potplayer for this and it says there's no subtitles available. On Wed, Feb 1, 2017 at 10:42 PM, fsh240led <notifications@github.com> wrote: > The others are extracted without problems > But this is not an extraction > https://docs.google.com/uc?id=0B42jO8cBUe6baEtqLTNkRnExbVk&export=download > > PotPlayer displays subtitles > http://i.imgur.com/SX9KApM.png > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <https://github.com/CCExtractor/ccextractor/issues/677>, or mute the > thread > <https://github.com/notifications/unsubscribe-auth/AFrJ2Y2hioV83NIkNr1aUyOetGBgYu7eks5rYXrAgaJpZM4L0xd1> > . >
Author
Owner

@fsh240led commented on GitHub (Feb 8, 2017):

Preferences required

Preferences(F5) -> Filter Control -> Video Decoder -> Built-in codec/DXVA settings -> Enable Closed Captioning (Check)

@fsh240led commented on GitHub (Feb 8, 2017): Preferences required Preferences(F5) -> Filter Control -> Video Decoder -> Built-in codec/DXVA settings -> Enable Closed Captioning (Check)
Author
Owner

@cfsmp3 commented on GitHub (Feb 8, 2017):

I'm probably missing some fonts, this is what a see after enabling CC.

sbsplus

@cfsmp3 commented on GitHub (Feb 8, 2017): I'm probably missing some fonts, this is what a see after enabling CC. ![sbsplus](https://cloud.githubusercontent.com/assets/5949913/22720320/e8414a66-ed5e-11e6-81e9-34209d27298e.png)
Author
Owner

@fsh240led commented on GitHub (Feb 8, 2017):

korean font download link
http://ko.cooltext.com/Fonts-Unicode-Korean

@fsh240led commented on GitHub (Feb 8, 2017): korean font download link http://ko.cooltext.com/Fonts-Unicode-Korean
Author
Owner

@cfsmp3 commented on GitHub (Feb 9, 2017):

Assigning to @Izaron since he was the last one looking into Korean.

@cfsmp3 commented on GitHub (Feb 9, 2017): Assigning to @Izaron since he was the last one looking into Korean.
Author
Owner

@cfsmp3 commented on GitHub (Feb 9, 2017):

GSoC qualification: 3 points.

@cfsmp3 commented on GitHub (Feb 9, 2017): GSoC qualification: 3 points.
Author
Owner

@Izaron commented on GitHub (Feb 9, 2017):

@fsh240led Are this output at 00:09 from the other video player is incorrect also in comparing with PotPlayer? Can you confirm it?

default

@cfsmp3 and @AlexBratosin2001 (since you're PTS man)
Problem is there: https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/stream_functions.c#L307
Can you give me specs about this binary stream, when (nextheader[7]&0xC0) == 0xC0? Maybe there just problem with hexadecimal hardcoded constants.

@Izaron commented on GitHub (Feb 9, 2017): @fsh240led Are this output **at 00:09** from the other video player is incorrect also in comparing with PotPlayer? Can you confirm it? ![default](https://cloud.githubusercontent.com/assets/5406399/22783469/034ae50e-eedd-11e6-8fcb-bd1b77a0baad.png) @cfsmp3 and @AlexBratosin2001 (since you're PTS man) Problem is there: https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/stream_functions.c#L307 Can you give me specs about this binary stream, when `(nextheader[7]&0xC0) == 0xC0`? Maybe there just problem with hexadecimal hardcoded constants.
Author
Owner

@fsh240led commented on GitHub (Feb 9, 2017):

It is an unknown words
It should be output as

@fsh240led commented on GitHub (Feb 9, 2017): It is an unknown words It should be output as <img src="http://i.imgur.com/SX9KApM.png">
Author
Owner

@mahalwal commented on GitHub (Dec 21, 2017):

@fsh240led Can you tell which font are you using from this link? ( http://ko.cooltext.com/Fonts-Unicode-Korean ) I tried Batang(che), Gungsuh(che) and they are still giving random characters.

@mahalwal commented on GitHub (Dec 21, 2017): @fsh240led Can you tell which font are you using from this link? ( http://ko.cooltext.com/Fonts-Unicode-Korean ) I tried Batang(che), Gungsuh(che) and they are still giving random characters.
Author
Owner

@MatejMecka commented on GitHub (Jan 3, 2018):

@fsh240led Can you tell me with which command did you got this subtitles?

@MatejMecka commented on GitHub (Jan 3, 2018): @fsh240led Can you tell me with which command did you got this subtitles?
Author
Owner

@navimakarov commented on GitHub (Dec 11, 2018):

The problem was that when we tried to process this file we got error "Window has to be defined" because decoder->current_window == -1;
https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ccx_decoders_708.c#L1375
So I found out that in ccx_decoders_708.c we had a condition which was impossible according to author's comment but here that condition returned true which crashed ccextractor extracting captions and made ret = 0; which is No captions found in Input error.
Here is the problem:
https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ccx_decoders_708.c#L1728
And the obvious fix is to change:
_dtvcc_decoders_reset(dtvcc);
return;
to
dtvcc->current_packet_length = len;

  • Note: I think it would be better also to delete comment("Is this possible") cause as we see it is possible.

After all those changes ccextractor is able to extract captions from this file.

@navimakarov commented on GitHub (Dec 11, 2018): The problem was that when we tried to process this file we got error "Window has to be defined" because decoder->current_window == -1; https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ccx_decoders_708.c#L1375 So I found out that in ccx_decoders_708.c we had a condition which was impossible according to author's comment but here that condition returned true which crashed ccextractor extracting captions and made ret = 0; which is No captions found in Input error. Here is the problem: https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ccx_decoders_708.c#L1728 And the obvious fix is to change: _dtvcc_decoders_reset(dtvcc); return; to dtvcc->current_packet_length = len; * Note: I think it would be better also to delete comment("Is this possible") cause as we see it is possible. After all those changes ccextractor is able to extract captions from this file.
Author
Owner

@cfsmp3 commented on GitHub (Dec 11, 2018):

OK so it's good research here... but let's not be happy with that, we need to find out what's actually going on.

if (dtvcc->current_packet_length != len) // Is this possible?

So there dtvcc->current_packet_length is how much data we have for that packet.
len is the size length according to the packet header.

So we have 3 possibilities:
a) We have more data than the declared packet length. If yes - what's that data, where did it come from, what is it for? Can we ignore it?
b) We have LESS data than the declared packet length. This is really not good, we can't process the packet at all (we would read out of bounds)
c) They match, which is the expected thing, but we know that's not the case here.

So first - let's check if it's a or b.

@cfsmp3 commented on GitHub (Dec 11, 2018): OK so it's good research here... but let's not be happy with that, we need to find out what's actually going on. if (dtvcc->current_packet_length != len) // Is this possible? So there dtvcc->current_packet_length is how much data we have for that packet. len is the size length according to the packet header. So we have 3 possibilities: a) We have more data than the declared packet length. If yes - what's that data, where did it come from, what is it for? Can we ignore it? b) We have LESS data than the declared packet length. This is really not good, we can't process the packet at all (we would read out of bounds) c) They match, which is the expected thing, but we know that's not the case here. So first - let's check if it's a or b.
Author
Owner

@navimakarov commented on GitHub (Dec 12, 2018):

So our problem is "a problem"(We have more data than the declared packet length.) cause after debugging we constantly get in this particular file dtvcc->current_packet_length = len + 2; But before that we have a condition len = len * 2; That means that while getting len from here: https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ccx_decoders_708.c#L1712 we get it wrong and len must equals to len - 1. Maybe this is a problem with False PTS/DTS or
False PES header which we get before or maybe the problem is with hardcoded value in that line of code I shared above. I'm working on it and has no ideas what is this data for. But I compared my output to PotPlayer's output and it is absolutely identical so I think that we can skip this data.

@navimakarov commented on GitHub (Dec 12, 2018): So our problem is "a problem"(We have more data than the declared packet length.) cause after debugging we constantly get in this particular file dtvcc->current_packet_length = len + 2; But before that we have a condition len = len * 2; That means that while getting len from here: https://github.com/CCExtractor/ccextractor/blob/master/src/lib_ccx/ccx_decoders_708.c#L1712 we get it wrong and len must equals to len - 1. Maybe this is a problem with False PTS/DTS or False PES header which we get before or maybe the problem is with hardcoded value in that line of code I shared above. I'm working on it and has no ideas what is this data for. But I compared my output to PotPlayer's output and it is absolutely identical so I think that we can skip this data.
Author
Owner

@cfsmp3 commented on GitHub (Dec 12, 2018):

That line:

int len = dtvcc->current_packet[0] & 0x3F; // 6 least significants bits

is correct. The packet header has 6 bits for the packet length and that mask gives you those 6 bits.

You may need to go over the actual specs to understand that code.

@cfsmp3 commented on GitHub (Dec 12, 2018): That line: int len = dtvcc->current_packet[0] & 0x3F; // 6 least significants bits is correct. The packet header has 6 bits for the packet length and that mask gives you those 6 bits. You may need to go over the actual specs to understand that code.
Author
Owner

@PunitLodha commented on GitHub (Mar 11, 2021):

What i found was that there are extra 2 bytes (Both are always 0) after the stipulated len specified in the header

Example:-
Here in the header, len is specified as 22.

Number cc valid cc type cc data1 cc data2
.. .. .. .. ..
3 1 3 pkt header pkt data
4 1 2 pkt data pkt data
5 1 2 pkt data pkt data
6 1 2 pkt data pkt data
.. .. .. .. ..
13 1 2 pkt data pkt data
14 1 2 pkt data(should be padding) pkt data(should be padding)
15 0 2 pkt data(padding) pkt data(padding)
16 0 2 pkt data(padding) pkt data(padding)
.. .. .. .. ..
20 0 2 pkt data(padding) pkt data(padding)

While parsing line 13, current_packet_length also becomes 22, which means after that everything should be padding(cc_valid = 0)
But on line 14 we have cc_valid = 1, which leads to it being parsed and current_packet_length becoming 24, i.e. current_packet_length = len +2

I guess we can safely ignore the two bytes, and change,

if (dtvcc->current_packet_length != len) // Is this possible?
{
    _dtvcc_decoders_reset(dtvcc);
    return; 
}

to

if (dtvcc->current_packet_length != len) 
{
    // Extra padding data, can be ignored
    dtvcc->current_packet_length = len; 
}

Should i go ahead with the change?

@PunitLodha commented on GitHub (Mar 11, 2021): What i found was that there are extra 2 bytes (Both are always 0) after the stipulated len specified in the header Example:- Here in the header, len is specified as 22. Number | cc valid | cc type | cc data1 | cc data2 | --- | --- | --- | --- | --- | .. | .. | .. | .. | .. 3 | 1 | 3 | pkt header | pkt data 4 | 1 | 2 | pkt data | pkt data 5 | 1 | 2 | pkt data | pkt data 6 | 1 | 2 | pkt data | pkt data .. | .. | .. | .. | .. 13 | 1 | 2 | pkt data | pkt data 14 | 1 | 2 | pkt data(should be padding) | pkt data(should be padding) 15 | 0 | 2 | pkt data(padding) | pkt data(padding) 16 | 0 | 2 | pkt data(padding) | pkt data(padding) .. | .. | .. | .. | .. 20 | 0 | 2 | pkt data(padding) | pkt data(padding) While parsing line 13, current_packet_length also becomes 22, which means after that everything should be padding(cc_valid = 0) But on line 14 we have cc_valid = 1, which leads to it being parsed and current_packet_length becoming 24, i.e. current_packet_length = len +2 I guess we can safely ignore the two bytes, and change, ``` c if (dtvcc->current_packet_length != len) // Is this possible? { _dtvcc_decoders_reset(dtvcc); return; } ``` to ``` c if (dtvcc->current_packet_length != len) { // Extra padding data, can be ignored dtvcc->current_packet_length = len; } ``` Should i go ahead with the change?
Author
Owner

@cfsmp3 commented on GitHub (Mar 11, 2021):

@PunitLodha Good research.
Well, give it a go and see if it makes things better or worse.

@cfsmp3 commented on GitHub (Mar 11, 2021): @PunitLodha Good research. Well, give it a go and see if it makes things better or worse.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#271