[BUG] WebVTT-Full formatted output splits multi-byte UTF-8 characters #755

Closed
opened 2026-01-29 16:52:52 +00:00 by claunia · 12 comments
Owner

Originally created by @dhouck on GitHub (Mar 25, 2023).

CCExtractor version:

CCExtractor 0.94, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
        Version: 0.94
        Git commit: Unknown
        Compilation date: 2022-08-10
        CEA-708 decoder: Rust
        File SHA256: Could not open file
Libraries used by CCExtractor
        libGPAC Version: 1.0.1
        zlib: 1.2.11
        utf8proc Version: 2.4.0
        protobuf-c Version: 1.3.1
        libpng Version: 1.6.37
        FreeType
        libhash
        nuklear
        libzvbi

Since some of that is listed as unknown, Iʼll add that this is from the Arch Linux AUR compiled as shown here

Necessary information

  • Is this a regression (i.e. did it work before)? Unknown
  • What platform did you use? Linux
  • What were the used arguments? ccextractor test.bin -out=webvtt-full

Video links

Input and output files attached, both separately gziped and in a combined zip file (for some reason GitHub wouldnʼt let me attach a TGZ despite claiming it was supported)
test.bin.gz
test.vtt.gz
test.zip

Additional information

The captions contain an italicized eighth note, with one space on either side (). The SRT output is correct (<i> ♪ </i>); the WebVTT-full output format should have the same line, but instead it puts three bytes inside the <i> element. The UTF-8 encoding for U+266A EIGTH NOTE is 0xE2 0x99 0xAA, so this puts one space and 2/3 of the eighth note bytes inside the italics, and 1/3 of the eighth note bytes outside the italics, making the file no longer valid UTF-8 and instead having two unknown byte sequences.

Originally created by @dhouck on GitHub (Mar 25, 2023). CCExtractor version: ``` CCExtractor 0.94, Carlos Fernandez Sanz, Volker Quetschke. Teletext portions taken from Petr Kutalek's telxcc -------------------------------------------------------------------------- CCExtractor detailed version info Version: 0.94 Git commit: Unknown Compilation date: 2022-08-10 CEA-708 decoder: Rust File SHA256: Could not open file Libraries used by CCExtractor libGPAC Version: 1.0.1 zlib: 1.2.11 utf8proc Version: 2.4.0 protobuf-c Version: 1.3.1 libpng Version: 1.6.37 FreeType libhash nuklear libzvbi ``` Since some of that is listed as unknown, Iʼll add that this is from the Arch Linux AUR compiled as shown [here](https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=ccextractor&id=f1cf3b736614441d9373e63cb0eb8b45714493fd) # Necessary information - Is this a regression (i.e. did it work before)? Unknown - What platform did you use? Linux - What were the used arguments? `ccextractor test.bin -out=webvtt-full` # Video links Input and output files attached, both separately gziped and in a combined zip file (for some reason GitHub wouldnʼt let me attach a TGZ despite claiming it was supported) [test.bin.gz](https://github.com/CCExtractor/ccextractor/files/11068120/test.bin.gz) [test.vtt.gz](https://github.com/CCExtractor/ccextractor/files/11068121/test.vtt.gz) [test.zip](https://github.com/CCExtractor/ccextractor/files/11068123/test.zip) # Additional information The captions contain an italicized eighth note, with one space on either side (_` ♪ `_). The SRT output is correct (`<i> ♪ </i>`); the WebVTT-full output format should have the same line, but instead it puts three bytes inside the `<i>` element. The UTF-8 encoding for U+266A EIGTH NOTE is 0xE2 0x99 0xAA, so this puts one space and 2/3 of the eighth note bytes inside the italics, and 1/3 of the eighth note bytes outside the italics, making the file no longer valid UTF-8 and instead having two unknown byte sequences.
claunia added the good-first-task label 2026-01-29 16:52:52 +00:00
Author
Owner

@dhouck commented on GitHub (Mar 25, 2023):

I did not mean to create this issue yet; I hit the wrong keys on my keyboard and accidentally submitted it before it was ready.

@dhouck commented on GitHub (Mar 25, 2023): I did not mean to create this issue yet; I hit the wrong keys on my keyboard and accidentally submitted it before it was ready.
Author
Owner

@dhouck commented on GitHub (Mar 25, 2023):

There, now I actually have an issue that other people can read and attached relevant files. Sorry for creating it prematurely.

@dhouck commented on GitHub (Mar 25, 2023): There, now I actually have an issue that other people can read and attached relevant files. Sorry for creating it prematurely.
Author
Owner

@dhouck commented on GitHub (Mar 25, 2023):

It appears this is caused by this loop reading the font events and color events per-608-grid-cell but iterating per-output-byte.

(Also incidentally it can lead to mis-nested italics and underlines; I donʼt know how common it is for webvtt readers to be able to handle that)

@dhouck commented on GitHub (Mar 25, 2023): It appears this is caused by [this loop](https://github.com/CCExtractor/ccextractor/blob/d379d72685959859db797621f270aeeb01a50021/src/lib_ccx/ccx_encoders_webvtt.c#L470-L518) reading the font events and color events per-608-grid-cell but iterating per-output-byte. (Also incidentally it can lead to mis-nested italics and underlines; I donʼt know how common it is for webvtt readers to be able to handle that)
Author
Owner

@dhouck commented on GitHub (Mar 25, 2023):

Meanwhile, SRT uses get_decoder_line_encoded, which solves both the bytes-vs-characters problem and the mis-nested tags problem, but it assumes that color changes use a <font> tag, which is not true in WebVTT, so canʼt just be used as-is.

@dhouck commented on GitHub (Mar 25, 2023): Meanwhile, SRT uses [`get_decoder_line_encoded`](https://github.com/CCExtractor/ccextractor/blob/d379d72685959859db797621f270aeeb01a50021/src/lib_ccx/ccx_encoders_helpers.c#L284), which solves both the bytes-vs-characters problem and the mis-nested tags problem, but it assumes that color changes use a `<font>` tag, which is not true in WebVTT, so canʼt just be used as-is.
Author
Owner

@CheeksTheGeek commented on GitHub (Mar 25, 2023):

working on this, just very thankful for providing so much info for a good start, just a quick thing, the function is get_decoder_line_encoded

@CheeksTheGeek commented on GitHub (Mar 25, 2023): working on this, just very thankful for providing so much info for a good start, just a quick thing, the function is get_decoder_line_encoded
Author
Owner

@dhouck commented on GitHub (Mar 26, 2023):

Oops, yeah, fixed the typo there.

@dhouck commented on GitHub (Mar 26, 2023): Oops, yeah, fixed the typo there.
Author
Owner

@cfsmp3 commented on GitHub (Mar 26, 2023):

@dhouck Does this fix it for you? https://github.com/CCExtractor/ccextractor/pull/1518

@cfsmp3 commented on GitHub (Mar 26, 2023): @dhouck Does this fix it for you? https://github.com/CCExtractor/ccextractor/pull/1518
Author
Owner

@dhouck commented on GitHub (Mar 27, 2023):

From reading it, it looks like itʼll work for UTF-8 output but not other supported character sets; I canʼt actually try it out right now but Iʼll check soon and get back to you then.

@dhouck commented on GitHub (Mar 27, 2023): From reading it, it looks like itʼll work for UTF-8 output but not other supported character sets; I canʼt actually try it out right now but Iʼll check soon and get back to you then.
Author
Owner

@dhouck commented on GitHub (Mar 27, 2023):

Okay it looks like the -unicode and -latin1 options are already ignored completely anyway (unless -unicode was always supposed to be a synonym for -utf8, which would theoretically make sense because UTF-8 is a way of representing unicode, but usually that means one of the UTF-16s) for both webvtt-full and srt, and I donʼt need them, so Iʼll let someone else worry about that if necessary.

For the sample file above, this still isnʼt working quite right because it doesnʼt put a space between the music note and the </i> so it still seems to be an off-by-one issue. Not a big deal in this sample (nobody cares if a space is italicized) but indicates there might be a more general off-by-one issue.

@dhouck commented on GitHub (Mar 27, 2023): Okay it looks like the `-unicode` and `-latin1` options are already ignored completely anyway (unless `-unicode` was always supposed to be a synonym for `-utf8`, which would theoretically make sense because UTF-8 is a way of representing unicode, but usually that means one of the UTF-16s) for both webvtt-full and srt, and I donʼt need them, so Iʼll let someone else worry about that if necessary. For the sample file above, this still isnʼt working *quite* right because it doesnʼt put a space between the music note and the `</i>` so it still seems to be an off-by-one issue. Not a big deal in this sample (nobody cares if a space is italicized) but indicates there might be a more general off-by-one issue.
Author
Owner

@cfsmp3 commented on GitHub (Mar 27, 2023):

Okay it looks like the -unicode and -latin1 options are already ignored completely anyway (unless -unicode was always supposed to be a synonym for -utf8, which would theoretically make sense because UTF-8 is a way of representing unicode, but usually that means one of the UTF-16s) for both webvtt-full and srt, and I donʼt need them, so Iʼll let someone else worry about that if necessary.

The specs say that webvtt must be UTF-8. We added latin1 etc because srt specs doesn't say and there seemed to be differences between players, but since webvtt is properly defined, let's just say it must always be utf-8. Still, it would be nice if we just exited with an error if out==webvtt and encoding != utf8.

For the sample file above, this still isnʼt working quite right because it doesnʼt put a space between the music note and the </i> so it still seems to be an off-by-one issue. Not a big deal in this sample (nobody cares if a space is italicized) but indicates there might be a more general off-by-one issue.

Looking at the PR it doesn't seem like the problem could be with the PR but a more general issue. Since you are looking into this carefully, could you check if the issue happens exporting to srt, and if yes, create a new issue so we can take a look?

I'm closing this one since the specific issue on the subject is fixed, but happy to work on making this perfect.

@cfsmp3 commented on GitHub (Mar 27, 2023): > Okay it looks like the `-unicode` and `-latin1` options are already ignored completely anyway (unless `-unicode` was always supposed to be a synonym for `-utf8`, which would theoretically make sense because UTF-8 is a way of representing unicode, but usually that means one of the UTF-16s) for both webvtt-full and srt, and I donʼt need them, so Iʼll let someone else worry about that if necessary. The specs say that webvtt must be UTF-8. We added latin1 etc because srt specs doesn't say and there seemed to be differences between players, but since webvtt is properly defined, let's just say it must always be utf-8. Still, it would be nice if we just exited with an error if out==webvtt and encoding != utf8. > > For the sample file above, this still isnʼt working _quite_ right because it doesnʼt put a space between the music note and the `</i>` so it still seems to be an off-by-one issue. Not a big deal in this sample (nobody cares if a space is italicized) but indicates there might be a more general off-by-one issue. Looking at the PR it doesn't seem like the problem could be with the PR but a more general issue. Since you are looking into this carefully, could you check if the issue happens exporting to srt, and if yes, create a new issue so we can take a look? I'm closing this one since the specific issue on the subject is fixed, but happy to work on making this perfect.
Author
Owner

@dhouck commented on GitHub (Mar 27, 2023):

It does not happen with SRT; the equivalent line is <i> ♪ </i> .

@dhouck commented on GitHub (Mar 27, 2023): It does not happen with SRT; the equivalent line is ` <i> ♪ </i> `.
Author
Owner

@cfsmp3 commented on GitHub (Mar 27, 2023):

It does not happen with SRT; the equivalent line is <i> ♪ </i>.

On visual inspection I can't find a bug, but I saw that the font stuff is much more involved in webvtt than in srt, so maybe... def. worth checking out. Could you open a bug with samples, issue info, etc?

@cfsmp3 commented on GitHub (Mar 27, 2023): > It does not happen with SRT; the equivalent line is ` <i> ♪ </i> `. On visual inspection I can't find a bug, but I saw that the font stuff is much more involved in webvtt than in srt, so maybe... def. worth checking out. Could you open a bug with samples, issue info, etc?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#755