Consider Non-breaking space (unicode spaces) as regular spaces for formatting #576

Closed
opened 2026-01-29 14:40:08 +00:00 by claunia · 5 comments
Owner

Originally created by @linkdotnet on GitHub (Oct 18, 2022).

Hey,

if we have for example no-break spaces in ## (aka h2) elements, they don't get transform into <h2> elements because the library checks whether or not a "regular space" is used. A no-break space would be this: aka U+00A= (info here: https://www.compart.com/en/unicode/U+00A0)

Also GitHub does not consider them (whereas stackedit.io does recognize them).
## This should be an h2 element

If you have a Mac you can easily reproduce this with Option+SpaceBar, which produces the non-breaking space.

My request would be to include those characters as valid space-bars so that the above example would work.
I am not sure what the Markdown specification says about those "unicode spaces".

Originally created by @linkdotnet on GitHub (Oct 18, 2022). Hey, if we have for example no-break spaces in ## (aka h2) elements, they don't get transform into `<h2>` elements because the library checks whether or not a "regular space" is used. A no-break space would be this: ` ` aka U+00A= (info here: https://www.compart.com/en/unicode/U+00A0) Also GitHub does not consider them (whereas stackedit.io does recognize them). ## This should be an h2 element If you have a Mac you can easily reproduce this with Option+SpaceBar, which produces the non-breaking space. My request would be to include those characters as valid space-bars so that the above example would work. I am not sure what the Markdown specification says about those "unicode spaces".
Author
Owner

@MihaZupan commented on GitHub (Oct 18, 2022):

I think GitHub may have stripped these characters from your example.
Do you mean if the heading is "#\u00A0# foo"? Or ## f\u00A0oo?

For the former, that doesn't make sense as the input, just like it wouldn't make sense with regular spaces # # foo. The spec doesn't mention that anything can be in between the # characters.

@MihaZupan commented on GitHub (Oct 18, 2022): I think GitHub may have stripped these characters from your example. Do you mean if the heading is `"#\u00A0# foo"`? Or `## f\u00A0oo`? For the former, that doesn't make sense as the input, just like it wouldn't make sense with regular spaces `# # foo`. The [spec](https://spec.commonmark.org/0.30/#atx-headings) doesn't mention that anything can be in between the `#` characters.
Author
Owner

@linkdotnet commented on GitHub (Oct 18, 2022):

The latter example ## f\u00A0oo.
This example "should" work: ## My h2 title.

You can copy&paste this into a website like this, which shows you non-printable characters.
image

@linkdotnet commented on GitHub (Oct 18, 2022): The latter example `## f\u00A0oo`. This example "should" work: `## My h2 title`. You can copy&paste this into a website like [this](https://www.soscisurvey.de/tools/view-chars.php), which shows you non-printable characters. ![image](https://user-images.githubusercontent.com/26365461/196433954-75fc595c-57c1-4252-91fe-cbace1c1b991.png)
Author
Owner

@xoofx commented on GitHub (Oct 18, 2022):

I think the spec says ## must be followed by a space or a tab (not a unicode in the space category), so if you look at all other CommonMark compliant parser here they don't parse ## foo correctly either.

So I would say, it's constrained as per spec, and so won't fix. Thoughts @MihaZupan ?

@xoofx commented on GitHub (Oct 18, 2022): I think the spec says `##` must be followed by a space or a tab (not a unicode in the space category), so if you look at all other CommonMark compliant parser [here](https://babelmark.github.io/?text=%23%23%C2%A0foo) they don't parse `## foo` correctly either. So I would say, it's constrained as per spec, and so won't fix. Thoughts @MihaZupan ?
Author
Owner

@MihaZupan commented on GitHub (Oct 18, 2022):

The opening sequence of # characters must be followed by spaces or tabs

My interpretation of the spec is that same as @xoofx.
If the spec intends for the broader interpretation of whitespace, it generally explicitly calls it whitespace or Unicode whitespace.
Given that commonmark.js and GitHub don't support it either, I would also lean towards not changing anything in Markdig.

In any case, you can apply the fix on your side

markdown = markdown.Replace('\u00A0', ' ');

or a more targeted

markdown = markdown.Replace("#\u00A0", "# ", StringComparison.Ordinal);
@MihaZupan commented on GitHub (Oct 18, 2022): > The opening sequence of # characters must be followed by spaces or tabs My interpretation of the spec is that same as @xoofx. If the spec intends for the broader interpretation of whitespace, it generally explicitly calls it `whitespace` or [`Unicode whitespace`](https://spec.commonmark.org/0.30/#unicode-whitespace-character). Given that `commonmark.js` and GitHub don't support it either, I would also lean towards not changing anything in Markdig. In any case, you can apply the fix on your side ```c# markdown = markdown.Replace('\u00A0', ' '); ``` or a more targeted ```c# markdown = markdown.Replace("#\u00A0", "# ", StringComparison.Ordinal); ```
Author
Owner

@linkdotnet commented on GitHub (Oct 18, 2022):

Fair point. I guess for now we can close the issue. Thanks for your inputs!

@linkdotnet commented on GitHub (Oct 18, 2022): Fair point. I guess for now we can close the issue. Thanks for your inputs!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/markdig#576