mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-16 13:35:45 +00:00
[BUG] DVB Teletext subtitle incomplete #376
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @mkver on GitHub (Jan 28, 2018).
CCExtractor version (using the --version parameter preferably) : 0.86 (Git commit
5fa83394a0)In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):
My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):
Necessary information
ccextractorwin.exe --gui_mode_reports -autoprogram -out=srt -bom -utf8 -tpage 150Link to the file
Additional information
When I extract the teletext from page 150 of the sample file (that actually contains another subtitle at page 799, but that is irrelevant), the subtitles are incomplete:
The second subtitle is incomplete. Here is what the libzvbi teletext decoder inside ffmpeg produces:
It unfortunately doesn't emit colours; but it includes the "geeignet."

FYI VLC also detects the "geeignet." -- and it can show the colours:
@cfsmp3 commented on GitHub (Jan 29, 2018):
I can confirm this happens and I've looked into it a bit but I'm leaving the solution for GSoC students. Some pointers though.
Processing row: 22
0 | D |
01 | B |
02 | B |
03 | 67 | g
04 | 65 | e
05 | 65 | e
06 | 69 | i
07 | 67 | g
08 | 6E | n
09 | 65 | e
10 | 74 | t
11 | 2E | .
12 | A |
13 | 20 |
14 | 6 |
15 | B |
16 | B |
17 | 53 | S
18 | 74 | t
19 | 75 | u
20 | 66 | f
21 | 65 | e
22 | 6E | n
23 | 20 |
24 | 53 | S
25 | 69 | i
26 | 65 | e
27 | 20 |
28 | 69 | i
29 | 68 | h
30 | 6E | n
31 | 20 |
32 | 7A | z
33 | 75 | u
34 | 72 | r
35 | FC | ⁿ
36 | 63 | c
37 | 6B | k
38 | 21 | !
39 | A |
0/B Start Box ("Set-After")
On pages with the C5 or C6 bits set (Newsflash or subtitle), this code defines (on
each appropriate row) the start of an area that is to be boxed into the normal video
picture. Characters outside this area are not displayed, but changes in display
mode, colour, height etc., will affect the boxed area. Cancelled by an End Box
code (0/A) or by the start of a new row.
NOTE: Protection against false operation is provided by double transmission
of Start Box control characters, with the action taking place between
them.
So you can see that everything to the left of the right-most '0xb' is going to be ignored.
Also note that if we just comment out that loop we'll always have col_start==40 so the line will be considered to be empty. Removing the loop is not a good solution.
@mkver commented on GitHub (Feb 12, 2018):
Thanks to @BPYap and @cfsmp3 for the time you have put into this. I thought it worthwhile to upload a few more samples showing this behaviour. They are here. (I have stripped all the audio and video tracks away to save space.)
@BPYap commented on GitHub (Feb 12, 2018):
Hi @mkver , thank you for more samples.
I have modified my original pull request to replace all 0xA appearing before 0xB with 0x20 characters instead of replacing only one 0xA character.
The test results are as follow:
Sample 1
The row which the extractor ignored in this sample is row 22 shown below:
0d | 02 | 0b | 0b | 6d | 61 | 6e | 20 | 64 | 61 | 7a | 75 | 20 | 6e | 6f | 63 | 68 | 3f | 0a | 0a |
05 | 0b | 0b | 22 | 4c | 65 | 74 | 7a | 74 | 65 | 72 | 20 | 57 | 69 | 6c | 6c | 65 | 2e | 22 | 0a
Characters in bold are bounded by start box (denoted by double 0/B characters) but is ignored by CCextractor. In the ETS 300 706 documentation, the introduction of 0/A would cancel the effect of 0/B. Hence, I replace both 0/A and 0/B characters if 0/A is found within the start box.
The correct output for this row should be man dazu noch? "Letzter Wille."
===================================================================
Sample 2
It seems nothing is detected when I ran ccextractor with and without -teletext option for both original and fixed version
===================================================================
Sample 3
The row concerned is row 22 as shown below:
20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 0d | 06 | 0b | 0b | 50 | 61 |
70 | 61 | 21 | 20 | 20 | 20 | 20 | 07 | 0b | 0b | 50 | 61 | 70 | 61 | 21 | 0a | 0a | 20 | 20 | 20
Characters in bold are bounded by start box (denoted by double 0/B characters). Here 0/A is not found within the start box area hence it is intended that only | 50 | 61 | 70 | 61 | 21 | 0a | 0a | 20 | 20 | 20 would be extracted.
The correct output for this row would be Papa!
same as the original output above
===================================================================
Sample 4
The row concerned is row 22 as shown below:
20 | 20 | 20 | 0d | 05 | 0b | 0b | 48 | 61 | 6c | 6c | 6f | 21 | 06 | 20 | 20 | 20 | 20 | 20 | 20 |
20 | 20 | 20 | 20 | 20 | 0b | 0b | 48 | 69 | 21 | 0a | 0a | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20
Characters in bold are bounded by start box (denoted by double 0/B characters). Here 0/A is not found within the start box area hence it is intended that only | 48 | 69 | 21 | 0a | 0a | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 would be extracted.
The correct output for this row would be Hi!
original version output (extracted without any argument):

fixed version output (extracted without any argument):
same as the original output above
===================================================================

Sample 5
CCextractor is able to extract the teletext completely
===================================================================
Sample 6
The row concerned is row 22 as shown below:
20 | 20 | 20 | 20 | 20 | 20 | 20 | 0d | 02 | 0b | 0b | 57 | 61 | 73 | 3f | 0a | 0a | 07 | 0b | 0b |
53 | 65 | 69 | 20 | 6e | 69 | 63 | 68 | 74 | 20 | 73 | 6f | 20 | 6e | 61 | 69 | 76 | 21 | 0a | 0a
The correct output for this row should be Was? Sei nicht so naiv!
original version output (extracted without any argument):

fixed version output (extracted without any argument):

===================================================================
Sample 7
The row concerned is row 22 as shown below:
06 | 0b | 0b | 53 | 69 | 65 | 68 | 73 | 74 | 20 | 64 | 75 | 3f | 20 | 20 | 05 | 0b | 0b | 44 | 75 |
20 | 68 | 21 | 74 | 74 | 65 | 73 | 74 | 20 | 72 | 65 | 63 | 68 | 74 | 2e | 0a | 0a | 20 | 20 | 20
Here 0/A is not found within the start box area hence it is intended that it would be ignored.
original version output (extracted without any argument):

fixed version output (extracted without any argument):
same as the original output above
I also noticed that despite there is caption found in the output srt file, the extraction process for samples 1,3,4,5,6 would show "No captions were found in input." Is this a possible bug?
@cfsmp3
Please correct me if my interpretation of the ETS 300 706 documentation is wrong. Thanks :)
@mkver commented on GitHub (Feb 12, 2018):
Once again, thank you @BPYap!
Sample 2 needs -in=ts in order to be detected as transport stream. (Or you can delete bytes 376-544 of the file. At first, I just extracted just the teletext PID and the PAT; ccextractor couldn't work with the resulting files and it turned out that the PMT was missing. So I inserted them manually and apparently I messed it up with the second sample: the 169 bytes mentioned above are an incomplete ts packet which results in wrong file type detection.)
libzvbi gives me "Papa! Papa!" for the third sample. Actually, these two words come from two different speakers and should be put in two different colours (VLC shows it in the colours I expect them to be (these subtitles use color information to indicate who is speaking)). I'm not saying that your interpretation (that a missing 0xA indicates that parts should not be displayed) is incorrect, but this is a bit odd.
For sample 4 libzvbi gives "Hallo! Hi!".
For sample 5 the first line from libzvbi is "Huhu, Ha-We! Claudia!".
For sample 7 the first line from libzvbi is "Siehst du? Du hattest recht.".
In all these instances, the text produced by libzvbi matches what is actually said.
@BPYap commented on GitHub (Feb 12, 2018):
Thank you @mkver for your response 😄
I ran extraction on sample 2 with -in=ts and got
Bestimmt 1 Jahr. Vielleicht ...
Soll ich mich erkundigen?, should be correct because here the 0/A is within the starting box.
I will research deeper into startbox and endbox behavior, hopefully I can come up with a solution tomorrow 😄
@mkver commented on GitHub (Feb 12, 2018):
Yes, this agrees with libzvbi.
I have also started reading ETS 300706.
@mkver commented on GitHub (Feb 12, 2018):
The relevant part of the standard is section 12.2. It contains two important pieces of information: "Unless operating in "Hold Mosaics" mode, each character space occupied by a spacing attribute is displayed as a SPACE." So your approach to replace 0xA and 0xB with spaces is what the standard says.
And I have two interpretations concerning the boxes that are not explicitly closed:
The first one is simply that the first 0xB 0xB starts a box that is not explicitly closed, hence the box reaches until the end of the row and all characters after 0xB 0xB are displayed. That there are further 0xB 0xB inside this box doesn't change that the characters between the two pairs of 0xB0xB are intent for display.
The standard also says: "The action of an attribute persists until the end of a row or until the transmission of a further attribute that modifies its action." I think if there is the start of another box, then we could see this as a further attribute that modifies its action (by opening a new box). Hence the box is implicitly closed.
This teletext is against the spec. Given that I haven't found anything that says that a box has to be closed before another one can be opened and given that this comes from HR, a German public broadcaster and that these were very active in the development of the teletext specifications I doubt it.
The first two interpretations directly imply that one should not require the presence of 0xA; the third one does not speak against it, but that one should rather check whether not requiring 0xA has adverse effects on other samples. If there are no such samples, not requiring 0xA would be wise. If there are, the best thing would be an option for the user to choose for himself.
@BPYap commented on GitHub (Feb 13, 2018):
I agree that subtitle box should be implicitly closed when another 0xB is encountered.
I modified the loop so that it starts from the first column until it encounters the first 0xB character (where it marks this index as starting index for further processing), then replace all subsequent 0xB and 0xA characters (except the last 0xA) with 0x20.
Further testing is done on samples 1 to 7 and the results obtained matched the output of what libzvbi would produce.