mirror of
https://github.com/xoofx/markdig.git
synced 2026-02-03 21:36:36 +00:00
Differences in URL Encoding for links, text and Ids #338
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @RickStrahl on GitHub (Oct 29, 2019).
I'm running into issues trying to consolidate links that require encoding and jumping between them in a page. The problem is that it looks like the encoding for links (
[]()) and generated text and more importantly element IDs are not handled in the same way.No good way to show the ID handling in Babelmark, but the problem shows itself in actual text rendering. Notice the difference in the text/url encoding for the has above vs. the code:
https://babelmark.github.io/?text=*+%5BNamenskonventionen+f%C3%BCr+Forms%5D(%23namenskonventionen-f%C3%BCr-f%22%2C%26orms)%0A%0A%23%23%23+Namenskonventionen+f%C3%BCr+F%22%2C%26orms
If you render with Auto-Ids the IDs use the same encoding as the text on the bottom which doesn't match the link encoding used above.
For a use case of this if I generate a link in a page and want to link it automatically to a header below, there's no single way that I can encode that link. The very specific scenario is a TOC generator where I pick out all the topic headers and then generate a toc of links that point to those same headers. But because the encoding is different the links don't work.
turns into:
The differences in encoding cause the link to not navigate.
There are a number of differences in how things are encoded, but in the above the umlaut probably shouldn't be encoded .
So the question is - should there be a consistent way to encode links that matches what the id generators are using?
Edge case for sure, but this has bitten me for a number of things related to creating reliable intra-document cross links. As it is I have to take over link navigation manually in my document solutions, but I'm not sure how to deal with the above.
@MihaZupan commented on GitHub (Oct 30, 2019):
I believe the HTML you are seeing is correct and it works on regular browsers,
If you look at the html source GitHub is using you can see that it also encodes the href while leaving the characters in the id as-is.
The escaping in the href is done according to the CommonMark spec and I believe these characters should be escaped.
I would recommend using the AutoLink functionality of
AutoIdentifiersExtensioninstead of trying to guess what the generated id will be like.When using the default
UseAutoIdentifiersyou avoid this problem as it usesAllowOnlyAscii, normalizing the id and thus avoiding character escaping in the href.Since you are using the GitHub way of generating the id, non-ascii characters are preserved in the id and then escaped in the href.
Does the preview renderer you are using correctly establish the link if the html looks like the following - with the value of the heading id also url encoded, thus matching the href?
If that does work in your use-case, an extra setting controlling whether heading IDs are URL-encoded could be exposed, off by default (I tested it locally and the change needed is rather trivial).
@RickStrahl commented on GitHub (Oct 30, 2019):
I am already using the AutoLinks pipeline extension and that's how the ID gets generated, but as mentioned they are not getting URL encoded the same way. Note, that there is some URL encoding happening.
Checked out your example, and sure enough, even if the URL encoding matches it doesn't work, so encoding isn't a solution either.
doesn't work:
this works:
this also works:
So it looks like if the link umlaut is URL Encoded the navigation just doesn't work.
Note although I'm using a tool for previewing (Markdown Monster) which uses the IE WebBrowser control in WPF, the same behavior happens in Chrome both with local file URLs as well as running against local Web urls.
@RickStrahl commented on GitHub (Oct 30, 2019):
Sigh...
more info. It looks like the
<base href="" />tag in the document messes with all of this. If I don't have a base tag in the document at all, then urlencoded to raw text, or urlencoded works.In the application previewer the base tag is required in order to properly find all the related resources relative to the document. However, with the base tag the navigation fails as soon as the hash is URL encoded. No encoded characters - it works fine.
I already intercept navigation of the tag and manually try to locate elements, so I guess it's possible to do a bit more work to normalize the IDs and URLs by explicitly url-decoding them, but that will then fail if somebody just dumps out the preview locally. Exports try to avoid the base tag, so that's all good and on a typical Web page there likely won't be a base tag.
While I still think that it would be better to not URL encode upper Unicode characters (just for the sheer overhead of it), I think that Markdig is actually doing the right thing, and I'm dealing with a HTML DOM quirk related to the
<base>tag.After some more thought I think we can probably close this but I'll leave it open a little longer in case somebody has any other ideas on a good way to deal with this.
At the end of the day this may bite others as well - anytime there are
basetags in a page plus some URLEncoded hash content in a link will make this show, but I don't think based on the observations above that there's a good workaround for this short of using{#explicit-id}with extra attributes.@MihaZupan commented on GitHub (Oct 30, 2019):
Since your preview differs from the actual export, there is a way (a bit of a hack).
1. Don't manually add a link destination when refering to a header.
2. In the preview pipeline, use
and in the release/export pipeline, use
The html will obviously differ in such a case between the pipelines, but characters like umlauts will be normalized during preview. Preview HTML looks like
And the release HTML stays the same
3.
While this does mean markdown like this can't work in the preview as there will be normalization happening, it doesn't work right now either so I don't see this as a real regression.
@RickStrahl commented on GitHub (Oct 31, 2019):
@MihaZupan Thank you - yes that would work. However the easier solution was to modify the render script that drives the preview and already intercepts hash navigation which is inconsistent anyway due to the file based nature (
file:///links) of the previewer.The solution was actually quite simple by simple UrlDecoding the hash. Since auto-linking tends to strip spaces, quotes and other symbols the only encoded content should be Unicode characters so decoding should work fine.
@MihaZupan commented on GitHub (Oct 31, 2019):
Glad to hear you've found a solution