[QUESTION] Valid xml document #556

Closed
opened 2026-01-29 16:47:29 +00:00 by claunia · 10 comments
Owner

Originally created by @jamoore5 on GitHub (Feb 21, 2020).

Should the XML generator be updated to guarantee that it is generating valid XML?

Example output from one of my video that I exacted subtitles from

1029 | <!--
1030 | And that's exactly it--   that
1031 | human beings perceive patterns
1032 | -->

I was getting an invalid token error from 3 different parsers that I tried to read it with. When I validated the XML in a validator I found

1030: | 25 | The string "--" is not permitted within comments.

Is this a possible issue in the code base?

Originally created by @jamoore5 on GitHub (Feb 21, 2020). Should the XML generator be updated to guarantee that it is generating valid XML? Example output from one of my video that I exacted subtitles from ``` 1029 | <!-- 1030 | And that's exactly it--  that 1031 | human beings perceive patterns 1032 | --> ``` I was getting an invalid token error from 3 different parsers that I tried to read it with. When I validated the XML in a validator I found `1030: | 25 | The string "--" is not permitted within comments.` Is this a possible issue in the code base?
claunia added the difficulty: easyHacktoberfest labels 2026-01-29 16:47:29 +00:00
Author
Owner

@cfsmp3 commented on GitHub (Feb 21, 2020):

Yes, I don't think we're escaping anything - just happily concatenating strings without a care in the world :-)

@cfsmp3 commented on GitHub (Feb 21, 2020): Yes, I don't think we're escaping anything - just happily concatenating strings without a care in the world :-)
Author
Owner

@jamoore5 commented on GitHub (Feb 21, 2020):

Is there a reason that the subtitles are a comments instead of text within the element?

Sorry, I do not know much about subtitles and formats.

@jamoore5 commented on GitHub (Feb 21, 2020): Is there a reason that the subtitles are a comments instead of text within the <spu> element? Sorry, I do not know much about subtitles and formats.
Author
Owner

@cfsmp3 commented on GitHub (Feb 21, 2020):

Is there a reason that the subtitles are a comments instead of text within the element?

If you mean when writing spupng, because they're not supposed there at all - the XML is just an index to the image files that contain the subtitles. We do add them to the XML for debugging purposes, and only if the OCR subsystem is present. spupng doesn't need a text representation of the subtitles.

@cfsmp3 commented on GitHub (Feb 21, 2020): > Is there a reason that the subtitles are a comments instead of text within the element? If you mean when writing spupng, because they're not supposed there at all - the XML is just an index to the image files that contain the subtitles. We do add them to the XML for debugging purposes, and only if the OCR subsystem is present. spupng doesn't need a text representation of the subtitles.
Author
Owner

@jamoore5 commented on GitHub (Feb 21, 2020):

@cfsmp3 thanks that was a helpful explanation.

I got introduced to this library and the concept of spupng when following this raspberry pi tutorial. https://magpi.raspberrypi.org/articles/make-comics-from-tv-recordings

The funny thing is that the content in the XML is a lot cleaner then what I exact out of my pngs using the above tutorial. On my raspberry pi, I get boxes instead of text in the png, but the right text in the xml. On my mac, I get the text in the pngs but it is not clear as the text in the xml.

For this project I was going to take advantage of the helpful debug addition and parse the text from the xml.

@jamoore5 commented on GitHub (Feb 21, 2020): @cfsmp3 thanks that was a helpful explanation. I got introduced to this library and the concept of spupng when following this raspberry pi tutorial. https://magpi.raspberrypi.org/articles/make-comics-from-tv-recordings The funny thing is that the content in the XML is a lot cleaner then what I exact out of my pngs using the above tutorial. On my raspberry pi, I get boxes instead of text in the png, but the right text in the xml. On my mac, I get the text in the pngs but it is not clear as the text in the xml. For this project I was going to take advantage of the helpful debug addition and parse the text from the xml.
Author
Owner

@PulkitMishra commented on GitHub (Feb 22, 2020):

hey @cfsmp3 do we need the -- in the xml file . I mean clearly -- is not allowed in XML files. So do we remove it or add a space between the two hyphens?

@PulkitMishra commented on GitHub (Feb 22, 2020): hey @cfsmp3 do we need the -- in the xml file . I mean clearly -- is not allowed in XML files. So do we remove it or add a space between the two hyphens?
Author
Owner

@cfsmp3 commented on GitHub (Feb 22, 2020):

@PulkitMishra if we're doing any work on this it should be to generate proper XML. But honestly I don't think this is high priority (based on number of users), but it's low hanging fruit so if you want to do it to get started by all means go ahead.

Just remember that the goal is always to improve quality of output -just removing things we don't like is not a good idea :-)

@cfsmp3 commented on GitHub (Feb 22, 2020): @PulkitMishra if we're doing any work on this it should be to generate proper XML. But honestly I don't think this is high priority (based on number of users), but it's low hanging fruit so if you want to do it to get started by all means go ahead. Just remember that the goal is always to improve quality of output -just removing things we don't like is not a good idea :-)
Author
Owner

@dak-x commented on GitHub (Mar 21, 2021):

@PulkitMishra From your discussion I conclude that subtitle text is provided inside the spu-xml only for debugging purposes. I also assume that the source for invalid xml tokens is from the subtitle itself (implying this is not a decoding bug). I suggest the solutions:

  1. Remove the subtitle text comments from the spu-xml for production. Only keep it for debug builds. We can display a warning message while using spu-xml on debug builds for the case when someone might accidentally generate it.

  2. Filter the text written to the spu-xml to avoid having invalid token. Might escape the strings or update/remove some tokens.

Please suggest what you think is appropriate. And also confirm my above assumption.

@dak-x commented on GitHub (Mar 21, 2021): @PulkitMishra From your discussion I conclude that subtitle text is provided inside the spu-xml only for debugging purposes. I also assume that the source for invalid xml tokens is from the subtitle itself (implying this is not a decoding bug). I suggest the solutions: 1. Remove the subtitle text comments from the spu-xml for production. Only keep it for debug builds. We can display a warning message while using spu-xml on debug builds for the case when someone might accidentally generate it. 2. Filter the text written to the spu-xml to avoid having invalid token. Might escape the strings or update/remove some tokens. Please suggest what you think is appropriate. And also confirm my above assumption.
Author
Owner

@cfsmp3 commented on GitHub (Mar 21, 2021):

@PulkitMishra From your discussion I conclude that subtitle text is provided inside the spu-xml only for debugging purposes. I also assume that the source for invalid xml tokens is from the subtitle itself (implying this is not a decoding bug). I suggest the solutions:

The invalid XML is caused by our writer not escaping some characters at all to comply with XML specifications. It's not a problem with the subtitles (which don't care at all about XML) or the format itself.

It's our own code being very sloppy here.

  1. Filter the text written to the spu-xml to avoid having invalid token. Might escape the strings or update/remove some tokens.

Yes, this is the right approach.

@cfsmp3 commented on GitHub (Mar 21, 2021): > @PulkitMishra From your discussion I conclude that subtitle text is provided inside the spu-xml only for debugging purposes. I also assume that the source for invalid xml tokens is from the subtitle itself (implying this is not a decoding bug). I suggest the solutions: The invalid XML is caused by our writer not escaping some characters at all to comply with XML specifications. It's not a problem with the subtitles (which don't care at all about XML) or the format itself. It's our own code being very sloppy here. > 2. Filter the text written to the spu-xml to avoid having invalid token. Might escape the strings or update/remove some tokens. > Yes, this is the right approach.
Author
Owner

@peach280 commented on GitHub (Jan 10, 2025):

I have generated a pull request which generates valid xml tokens. Please review.

@peach280 commented on GitHub (Jan 10, 2025): I have generated a pull request which generates valid xml tokens. Please review.
Author
Owner

@cfsmp3 commented on GitHub (Dec 20, 2025):

Fixed in PR #1783 which was merged on December 8, 2025. The write_spucomment function now sanitizes double-hyphens (--) in XML comments to produce valid XML output.

@cfsmp3 commented on GitHub (Dec 20, 2025): Fixed in PR #1783 which was merged on December 8, 2025. The `write_spucomment` function now sanitizes double-hyphens (`--`) in XML comments to produce valid XML output.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#556