mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-04 05:44:53 +00:00
[BUG] WebVTT style/characters get out of sync when non-ASCII characters are used #756
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @dhouck on GitHub (Mar 27, 2023).
CCExtractor version: Compiled myself from
fa85a527(I double-checked several times); I donʼt know why it thinks itʼs ond379d726Necessary information
{replace with the arguments}Video links
[Same test input #1516; no need to re-upload]
Current output after #1518: test.vtt.gz
Expected output: there should be space between the
♪and the</i>; see this line of the g608:The SRT line is has the space before the
</i>; the WebVTT-full one has it after. This isnʼt a big deal in this sample but the same thing would happen in a visible way in most circumstances. For example, if the line were supposed to be<i> ♪♪ [epic music] ♪♪ </i>, then it would instead be<i> ♪♪ [epic music</i>] ♪♪.Additional information
This is a follow-up for #1516. Its fix, #1518, does prevent splitting characters, but the styling will still always get out of sync with the text if there are any multibyte characters. This is because it still uses
jas both a bytes index and a screen index; I think a more comprehensive fix would be to usejas only a screen index, like the SRT decoder does, and decode each symbol separately.I have a fix Iʼm planning to upload shortly, although probably a better fix is possible and I wonʼt be disappointed if someone comes along and refactors the entire loop or function away.