German Umlaut handling in urilize #764

Closed
opened 2026-01-29 14:45:00 +00:00 by claunia · 7 comments
Owner

Originally created by @mos379 on GitHub (Oct 29, 2025).

When generating proper Ids for headers I ran into the problem that the German Umlauts are not handled properly and this causes some problems

e.g.
<h2 id="-revolution-re-funktionen">🚀 Revolutionäre Funktionen</h2>
while
<h2 id="-revolutionaere-funktionen">🚀 Revolutionäre Funktionen</h2>
would be appropriate.

since the IAutoIdentifierGenerator is not public its not possible to overwrite, so to include this in the implementation of the standard would be great.
so for example in the LinkHelper

Urilize.txt

Originally created by @mos379 on GitHub (Oct 29, 2025). When generating proper Ids for headers I ran into the problem that the German Umlauts are not handled properly and this causes some problems e.g. `<h2 id="-revolution-re-funktionen">🚀 Revolutionäre Funktionen</h2>` while `<h2 id="-revolutionaere-funktionen">🚀 Revolutionäre Funktionen</h2>` would be appropriate. since the IAutoIdentifierGenerator is not public its not possible to overwrite, so to include this in the implementation of the standard would be great. so for example in the LinkHelper [Urilize.txt](https://github.com/user-attachments/files/23208116/Urilize.txt)
claunia added the bugPR Welcome! labels 2026-01-29 14:45:01 +00:00
Author
Owner

@xoofx commented on GitHub (Oct 29, 2025):

Interesting, could you open a proper Pull-Request?

I'm curious also, is it really only for the German language, or there is something that we are missing with the .NET implementation regarding such transform?

@xoofx commented on GitHub (Oct 29, 2025): Interesting, could you open a proper [Pull-Request](https://github.com/xoofx/markdig/pulls)? I'm curious also, is it really only for the German language, or there is something that we are missing with the .NET implementation regarding such transform?
Author
Owner

@MihaZupan commented on GitHub (Oct 29, 2025):

Can you share the code you're using?

This is the current behavior I see

AutoLink
<h1 id="revolutionäre-funktionen">🚀 Revolutionäre Funktionen</h1>

AllowOnlyAscii, AutoLink (default)
<h1 id="revolutionare-funktionen">🚀 Revolutionäre Funktionen</h1>

GitHub
<h1 id="-revolutionäre-funktionen">🚀 Revolutionäre Funktionen</h1>

AllowOnlyAscii, GitHub -- the GitHub format doesn't pay attention to AllowOnlyAscii rn
<h1 id="-revolutionäre-funktionen">🚀 Revolutionäre Funktionen</h1>

with the mapping to ASCII defined here for AutoLink:
8c01cf0549/src/Markdig/Helpers/CharNormalizer.cs (L67)
or just char.ToLowerInvariant for GitHub.

@MihaZupan commented on GitHub (Oct 29, 2025): Can you share the code you're using? This is the current behavior I see ```c# AutoLink <h1 id="revolutionäre-funktionen">🚀 Revolutionäre Funktionen</h1> AllowOnlyAscii, AutoLink (default) <h1 id="revolutionare-funktionen">🚀 Revolutionäre Funktionen</h1> GitHub <h1 id="-revolutionäre-funktionen">🚀 Revolutionäre Funktionen</h1> AllowOnlyAscii, GitHub -- the GitHub format doesn't pay attention to AllowOnlyAscii rn <h1 id="-revolutionäre-funktionen">🚀 Revolutionäre Funktionen</h1> ``` with the mapping to ASCII defined here for AutoLink: https://github.com/xoofx/markdig/blob/8c01cf054971a1ec4cd663edaca6e2d035236133/src/Markdig/Helpers/CharNormalizer.cs#L67 or just `char.ToLowerInvariant` for GitHub.
Author
Owner

@mos379 commented on GitHub (Oct 29, 2025):

Interesting, could you open a proper Pull-Request?

I'm curious also, is it really only for the German language, or there is something that we are missing with the .NET implementation regarding such transform?

Yes, there is a standard .NET approach, but it has an important limitation for German umlauts.
The standard .NET way is to use String.Normalize(NormalizationForm.FormD) which splits characters into their base characters and combining diacritic marks, then filter out the marks using CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark Stack OverflowLevibotelho.

However, there's a critical issue: this approach doesn't work correctly for German umlauts - characters like ä, ö, and ü are stripped to just a, o, and u instead of the proper ae, oe, and ue transliterations
Stack Overflow.

@mos379 commented on GitHub (Oct 29, 2025): > Interesting, could you open a proper [Pull-Request](https://github.com/xoofx/markdig/pulls)? > > I'm curious also, is it really only for the German language, or there is something that we are missing with the .NET implementation regarding such transform? Yes, there is a standard .NET approach, but it has an important limitation for German umlauts. The standard .NET way is to use _String.Normalize(NormalizationForm.FormD)_ which splits characters into their base characters and combining diacritic marks, then filter out the marks using `CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark` [Stack Overflow](https://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net)[Levibotelho](https://www.levibotelho.com/development/c-remove-diacritics-accents-from-a-string/). However, there's a critical issue: this approach doesn't work correctly for German umlauts - characters like ä, ö, and ü are stripped to just a, o, and u instead of the proper ae, oe, and ue transliterations [Stack Overflow](https://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net).
Author
Owner

@xoofx commented on GitHub (Oct 29, 2025):

Thanks for the context, I have started to add a comment to your PR.

I have two questions though. I'm a bit worried that this change is going to introduce an unexpected change of behavior for many folks.

For example, the translation 'Å' => "Aa" would introduce a double Aa while I have seen just recently an example from someone named Åkesson to use the term Akesson in ascii, and not Aakesson. Thoughts?

Another question: Cannot it be language specific? (some language flattening to a different version), and so it would require provide an optional "language selector" for configuring such behavior?

@xoofx commented on GitHub (Oct 29, 2025): Thanks for the context, I have started to add a comment to your PR. I have two questions though. I'm a bit worried that this change is going to introduce an unexpected change of behavior for many folks. For example, the translation `'Å' => "Aa"` would introduce a double Aa while I have seen just recently an example from someone named `Åkesson` to use the term `Akesson` in ascii, and not `Aakesson`. Thoughts? Another question: Cannot it be language specific? (some language flattening to a different version), and so it would require provide an optional "language selector" for configuring such behavior?
Author
Owner

@mos379 commented on GitHub (Oct 30, 2025):

For me it is ok to just keep the German parts, because these are the proper transliterations.
When it comes to the Scandinavian letters it is in my experience mostly used to used two letters.
For context I have been working with Norwegian for more than 15 years.
in addition here is some official context.
The letter å was introduced in Norwegian in 1917 and in Danish in 1948, replacing the digraph 'aa', and 'aa' remains in use as the standard transliteration when the letter is not available for technical reasons
Example transformations:

  • Norwegian: "Åse på øya" → "aase-paa-oeya"
  • Danish: "Kærlighed på første møde" → "kaerlighed-paa-foerste-moede"
  • Icelandic: "Þórsmörk" → "thorsmork"
  • Icelandic: "Ísafjörður" → "isafjordur"

There's actually some debate among Scandinavians about the "correct" transliteration - some prefer just removing the special marks (ø→o, å→a) for shorter URLs, while others insist on the official transliteration (ø→oe, å→aa).
The code I wrote uses the official, standard transliteration which is what's used in passports and when these characters are unavailable.

@mos379 commented on GitHub (Oct 30, 2025): For me it is ok to just keep the German parts, because these are the proper transliterations. When it comes to the Scandinavian letters it is in my experience mostly used to used two letters. For context I have been working with Norwegian for more than 15 years. in addition here is some official context. _The letter å was introduced in Norwegian in 1917 and in Danish in 1948, replacing the digraph 'aa', and 'aa' remains in use as the standard transliteration when the letter is not available for technical reasons_ Example transformations: - Norwegian: "Åse på øya" → "aase-paa-oeya" - Danish: "Kærlighed på første møde" → "kaerlighed-paa-foerste-moede" - Icelandic: "Þórsmörk" → "thorsmork" - Icelandic: "Ísafjörður" → "isafjordur" There's actually some debate among Scandinavians about the "correct" transliteration - some prefer just removing the special marks (ø→o, å→a) for shorter URLs, while others insist on the official transliteration (ø→oe, å→aa). The code I wrote uses the official, **standard transliteration which is what's used in passports** and when these characters are unavailable.
Author
Owner

@mos379 commented on GitHub (Nov 25, 2025):

@xoofx with the PR merged, this should be closed as completed, and could this make it into a release please? :)

@mos379 commented on GitHub (Nov 25, 2025): @xoofx with the PR merged, this should be closed as completed, and could this make it into a release please? :)
Author
Owner

@xoofx commented on GitHub (Nov 25, 2025):

@xoofx with the PR merged, this should be closed as completed, and could this make it into a release please? :)

Yes, sorry, I was waiting for #913 but seems to be stuck. Gonna push a release now.

@xoofx commented on GitHub (Nov 25, 2025): > [@xoofx](https://github.com/xoofx) with the PR merged, this should be closed as completed, and could this make it into a release please? :) Yes, sorry, I was waiting for #913 but seems to be stuck. Gonna push a release now.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/markdig#764