mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-15 05:26:07 +00:00
[BUG] WebVTT-Full formatted output splits multi-byte UTF-8 characters #755
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @dhouck on GitHub (Mar 25, 2023).
CCExtractor version:
Since some of that is listed as unknown, Iʼll add that this is from the Arch Linux AUR compiled as shown here
Necessary information
ccextractor test.bin -out=webvtt-fullVideo links
Input and output files attached, both separately gziped and in a combined zip file (for some reason GitHub wouldnʼt let me attach a TGZ despite claiming it was supported)
test.bin.gz
test.vtt.gz
test.zip
Additional information
The captions contain an italicized eighth note, with one space on either side (
♪). The SRT output is correct (<i> ♪ </i>); the WebVTT-full output format should have the same line, but instead it puts three bytes inside the<i>element. The UTF-8 encoding for U+266A EIGTH NOTE is 0xE2 0x99 0xAA, so this puts one space and 2/3 of the eighth note bytes inside the italics, and 1/3 of the eighth note bytes outside the italics, making the file no longer valid UTF-8 and instead having two unknown byte sequences.@dhouck commented on GitHub (Mar 25, 2023):
I did not mean to create this issue yet; I hit the wrong keys on my keyboard and accidentally submitted it before it was ready.
@dhouck commented on GitHub (Mar 25, 2023):
There, now I actually have an issue that other people can read and attached relevant files. Sorry for creating it prematurely.
@dhouck commented on GitHub (Mar 25, 2023):
It appears this is caused by this loop reading the font events and color events per-608-grid-cell but iterating per-output-byte.
(Also incidentally it can lead to mis-nested italics and underlines; I donʼt know how common it is for webvtt readers to be able to handle that)
@dhouck commented on GitHub (Mar 25, 2023):
Meanwhile, SRT uses
get_decoder_line_encoded, which solves both the bytes-vs-characters problem and the mis-nested tags problem, but it assumes that color changes use a<font>tag, which is not true in WebVTT, so canʼt just be used as-is.@CheeksTheGeek commented on GitHub (Mar 25, 2023):
working on this, just very thankful for providing so much info for a good start, just a quick thing, the function is get_decoder_line_encoded
@dhouck commented on GitHub (Mar 26, 2023):
Oops, yeah, fixed the typo there.
@cfsmp3 commented on GitHub (Mar 26, 2023):
@dhouck Does this fix it for you? https://github.com/CCExtractor/ccextractor/pull/1518
@dhouck commented on GitHub (Mar 27, 2023):
From reading it, it looks like itʼll work for UTF-8 output but not other supported character sets; I canʼt actually try it out right now but Iʼll check soon and get back to you then.
@dhouck commented on GitHub (Mar 27, 2023):
Okay it looks like the
-unicodeand-latin1options are already ignored completely anyway (unless-unicodewas always supposed to be a synonym for-utf8, which would theoretically make sense because UTF-8 is a way of representing unicode, but usually that means one of the UTF-16s) for both webvtt-full and srt, and I donʼt need them, so Iʼll let someone else worry about that if necessary.For the sample file above, this still isnʼt working quite right because it doesnʼt put a space between the music note and the
</i>so it still seems to be an off-by-one issue. Not a big deal in this sample (nobody cares if a space is italicized) but indicates there might be a more general off-by-one issue.@cfsmp3 commented on GitHub (Mar 27, 2023):
The specs say that webvtt must be UTF-8. We added latin1 etc because srt specs doesn't say and there seemed to be differences between players, but since webvtt is properly defined, let's just say it must always be utf-8. Still, it would be nice if we just exited with an error if out==webvtt and encoding != utf8.
Looking at the PR it doesn't seem like the problem could be with the PR but a more general issue. Since you are looking into this carefully, could you check if the issue happens exporting to srt, and if yes, create a new issue so we can take a look?
I'm closing this one since the specific issue on the subject is fixed, but happy to work on making this perfect.
@dhouck commented on GitHub (Mar 27, 2023):
It does not happen with SRT; the equivalent line is
<i> ♪ </i>.@cfsmp3 commented on GitHub (Mar 27, 2023):
On visual inspection I can't find a bug, but I saw that the font stuff is much more involved in webvtt than in srt, so maybe... def. worth checking out. Could you open a bug with samples, issue info, etc?