mirror of
https://github.com/xoofx/markdig.git
synced 2026-02-10 21:40:00 +00:00
Recognize supplementary (non-BMP) punctuations around emphasis delimiter runs #732
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @tats-u on GitHub (Mar 17, 2025).
fb3fe8b261/src/Markdig/Helpers/CharHelper.cs (L37-L60)The specs says "A character is a Unicode code point." However, the
chartype is just a UTF-16 code unit, which does not cover supplementary characters (whose code points are U+100000 or greater).https://spec.commonmark.org/0.31.2/#character
@tats-u commented on GitHub (Mar 17, 2025):
The problems in
System.Text.Rune:@MihaZupan commented on GitHub (Mar 17, 2025):
Runedoes support surrogates, you can donew Rune(high, low). It's also thankfully one of those types that's trivial to polyfill.If limited to e.g.
EmphasisInline, supporting non-BMP punctuations seems fine to me.@tats-u commented on GitHub (Mar 17, 2025):
It's how we get a supplementary code point from a well-formed surrogate pair. What I mean is how to handle an ill-formed isolated surrogate code unit.
Sufficient.
@tats-u commented on GitHub (Mar 17, 2025):
https://learn.microsoft.com/en-US/dotnet/api/system.globalization.charunicodeinfo.getunicodecategory?view=net-8.0#system-globalization-charunicodeinfo-getunicodecategory(system-int32)
Another problem is the
CharUnicodeInfo.GetUnicodeCategoryhas an overload that takesintbut it's exclusive to .NET.Should we substitute the
(string, int)overload for that in .NET Standard 2.0?@MihaZupan commented on GitHub (Mar 17, 2025):
I'm not sure I understand the concern. If we happen to see a valid surrogate pair we can check its category. If it's not a valid pair the behavior stays the same - it's currently treated as an arbitrary uninteresting
char.That's okay, older runtimes just have to deal with being slow :)
That's how
Rune.GetUnicodeCategoryis imlpemented as well.@tats-u commented on GitHub (Mar 18, 2025):
Will be acceptable because the Unicode category of all surrogate code points is Cs—i.e. not P or S.
It looks like you prefer Rune to int or unit—how about
Rune?(nullable)? Rune is a struct, so its nullable type is fully supported in older C# or .NET Standard 2.0. If the adjacent code point is an isolated surrogate one, the value should be null. (Falling back to another value expressing a non-punctuation scalar vale is fine though)@tats-u commented on GitHub (Mar 18, 2025):
Probably some methods that return
Rune,int, oruintneed to be added inStringSlice.@tats-u commented on GitHub (Mar 18, 2025):
memo: how to call
internal unsafe static UnicodeCategory InternalGetUnicodeCategory(int ch)in .NET Framework via reflection: