[BUG] WebVTT style/characters get out of sync when non-ASCII characters are used #756

Open
opened 2026-01-29 16:52:52 +00:00 by claunia · 0 comments
Owner

Originally created by @dhouck on GitHub (Mar 27, 2023).

CCExtractor version: Compiled myself from fa85a527 (I double-checked several times); I donʼt know why it thinks itʼs on d379d726

CCExtractor 0.94, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
CCExtractor detailed version info
        Version: 0.94
        Git commit: d379d72685959859db797621f270aeeb01a50021
        Compilation date: 2023-03-26
        CEA-708 decoder: Rust
        File SHA256: f7edb9796bf45c48bf3fe80db340293854e394f4ed0960f0f730d2ab5eec9028
Libraries used by CCExtractor
        libGPAC Version: 1.0.1
        zlib: 1.2.11
        utf8proc Version: 2.4.0
        protobuf-c Version: 1.3.1
        libpng Version: 1.6.37
        FreeType
        libhash
        nuklear
        libzvbi

Necessary information

  • Is this a regression (i.e. did it work before)? New behavior, but it was even worse before
  • What platform did you use? {Window/Linux/Mac} Linux
  • What were the used arguments? {replace with the arguments}

Video links

[Same test input #1516; no need to re-upload]
Current output after #1518: test.vtt.gz

Expected output: there should be space between the and the </i>; see this line of the g608:

               ♪                ^@99999999999999000999999999999999RRRRRRRRRRRRRRIIIRRRRRRRRRRRRRRR

The SRT line is has the space before the </i>; the WebVTT-full one has it after. This isnʼt a big deal in this sample but the same thing would happen in a visible way in most circumstances. For example, if the line were supposed to be <i> ♪♪ [epic music] ♪♪ </i>, then it would instead be <i> ♪♪ [epic music</i>] ♪♪ .

Additional information

This is a follow-up for #1516. Its fix, #1518, does prevent splitting characters, but the styling will still always get out of sync with the text if there are any multibyte characters. This is because it still uses j as both a bytes index and a screen index; I think a more comprehensive fix would be to use j as only a screen index, like the SRT decoder does, and decode each symbol separately.

I have a fix Iʼm planning to upload shortly, although probably a better fix is possible and I wonʼt be disappointed if someone comes along and refactors the entire loop or function away.

Originally created by @dhouck on GitHub (Mar 27, 2023). CCExtractor version: Compiled myself from fa85a527 (I double-checked several times); I donʼt know why it thinks itʼs on d379d726 ``` CCExtractor 0.94, Carlos Fernandez Sanz, Volker Quetschke. Teletext portions taken from Petr Kutalek's telxcc -------------------------------------------------------------------------- CCExtractor detailed version info Version: 0.94 Git commit: d379d72685959859db797621f270aeeb01a50021 Compilation date: 2023-03-26 CEA-708 decoder: Rust File SHA256: f7edb9796bf45c48bf3fe80db340293854e394f4ed0960f0f730d2ab5eec9028 Libraries used by CCExtractor libGPAC Version: 1.0.1 zlib: 1.2.11 utf8proc Version: 2.4.0 protobuf-c Version: 1.3.1 libpng Version: 1.6.37 FreeType libhash nuklear libzvbi ``` # Necessary information - Is this a regression (i.e. did it work before)? New behavior, but it was even worse before - What platform did you use? {Window/Linux/Mac} Linux - What were the used arguments? `{replace with the arguments}` # Video links [Same test input #1516; no need to re-upload] Current output after #1518: [test.vtt.gz](https://github.com/CCExtractor/ccextractor/files/11075494/test.vtt.gz) Expected output: there should be space between the `♪` and the `</i>`; see this line of the g608: ``` ♪ ^@99999999999999000999999999999999RRRRRRRRRRRRRRIIIRRRRRRRRRRRRRRR ``` The SRT line is has the space before the `</i>`; the WebVTT-full one has it after. This isnʼt a big deal in this sample but the same thing would happen in a visible way in most circumstances. For example, if the line were supposed to be `<i> ♪♪ [epic music] ♪♪ </i>`, then it would instead be `<i> ♪♪ [epic music</i>] ♪♪ `. # Additional information This is a follow-up for #1516. Its fix, #1518, does prevent splitting characters, but the styling will still always get out of sync with the text if there are any multibyte characters. This is because it still uses `j` as both a bytes index and a screen index; I think a more comprehensive fix would be to use `j` as *only* a screen index, like the SRT decoder does, and decode each symbol separately. I have a fix Iʼm planning to upload shortly, although probably a better fix is possible and I wonʼt be disappointed if someone comes along and refactors the entire loop or function away.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#756