mirror of
https://github.com/xoofx/markdig.git
synced 2026-02-09 21:42:15 +00:00
German Umlaut handling in urilize #764
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @mos379 on GitHub (Oct 29, 2025).
When generating proper Ids for headers I ran into the problem that the German Umlauts are not handled properly and this causes some problems
e.g.
<h2 id="-revolution-re-funktionen">🚀 Revolutionäre Funktionen</h2>while
<h2 id="-revolutionaere-funktionen">🚀 Revolutionäre Funktionen</h2>would be appropriate.
since the IAutoIdentifierGenerator is not public its not possible to overwrite, so to include this in the implementation of the standard would be great.
so for example in the LinkHelper
Urilize.txt
@xoofx commented on GitHub (Oct 29, 2025):
Interesting, could you open a proper Pull-Request?
I'm curious also, is it really only for the German language, or there is something that we are missing with the .NET implementation regarding such transform?
@MihaZupan commented on GitHub (Oct 29, 2025):
Can you share the code you're using?
This is the current behavior I see
with the mapping to ASCII defined here for AutoLink:
8c01cf0549/src/Markdig/Helpers/CharNormalizer.cs (L67)or just
char.ToLowerInvariantfor GitHub.@mos379 commented on GitHub (Oct 29, 2025):
Yes, there is a standard .NET approach, but it has an important limitation for German umlauts.
The standard .NET way is to use String.Normalize(NormalizationForm.FormD) which splits characters into their base characters and combining diacritic marks, then filter out the marks using
CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMarkStack OverflowLevibotelho.However, there's a critical issue: this approach doesn't work correctly for German umlauts - characters like ä, ö, and ü are stripped to just a, o, and u instead of the proper ae, oe, and ue transliterations
Stack Overflow.
@xoofx commented on GitHub (Oct 29, 2025):
Thanks for the context, I have started to add a comment to your PR.
I have two questions though. I'm a bit worried that this change is going to introduce an unexpected change of behavior for many folks.
For example, the translation
'Å' => "Aa"would introduce a double Aa while I have seen just recently an example from someone namedÅkessonto use the termAkessonin ascii, and notAakesson. Thoughts?Another question: Cannot it be language specific? (some language flattening to a different version), and so it would require provide an optional "language selector" for configuring such behavior?
@mos379 commented on GitHub (Oct 30, 2025):
For me it is ok to just keep the German parts, because these are the proper transliterations.
When it comes to the Scandinavian letters it is in my experience mostly used to used two letters.
For context I have been working with Norwegian for more than 15 years.
in addition here is some official context.
The letter å was introduced in Norwegian in 1917 and in Danish in 1948, replacing the digraph 'aa', and 'aa' remains in use as the standard transliteration when the letter is not available for technical reasons
Example transformations:
There's actually some debate among Scandinavians about the "correct" transliteration - some prefer just removing the special marks (ø→o, å→a) for shorter URLs, while others insist on the official transliteration (ø→oe, å→aa).
The code I wrote uses the official, standard transliteration which is what's used in passports and when these characters are unavailable.
@mos379 commented on GitHub (Nov 25, 2025):
@xoofx with the PR merged, this should be closed as completed, and could this make it into a release please? :)
@xoofx commented on GitHub (Nov 25, 2025):
Yes, sorry, I was waiting for #913 but seems to be stuck. Gonna push a release now.