Recognize supplementary (non-BMP) punctuations around emphasis delimiter runs #732

Closed
opened 2026-01-29 14:44:07 +00:00 by claunia · 8 comments
Owner

Originally created by @tats-u on GitHub (Mar 17, 2025).

fb3fe8b261/src/Markdig/Helpers/CharHelper.cs (L37-L60)

The specs says "A character is a Unicode code point." However, the char type is just a UTF-16 code unit, which does not cover supplementary characters (whose code points are U+100000 or greater).

https://spec.commonmark.org/0.31.2/#character

Originally created by @tats-u on GitHub (Mar 17, 2025). https://github.com/xoofx/markdig/blob/fb3fe8b261dfdec82edc2d1b07735931df04bbd4/src/Markdig/Helpers/CharHelper.cs#L37-L60 - https://github.com/commonmark/commonmark.js/pull/297 - https://github.com/commonmark/commonmark-java/pull/198 - https://github.com/commonmark/commonmark-spec/pull/794 The specs says "A [character](https://spec.commonmark.org/0.31.2/#character) is a Unicode code point." However, the `char` type is just a UTF-16 code *unit*, which does not cover supplementary characters (whose code points are U+100000 or greater). https://spec.commonmark.org/0.31.2/#character
claunia added the enhancementPR Welcome! labels 2026-01-29 14:44:07 +00:00
Author
Owner

@tats-u commented on GitHub (Mar 17, 2025):

The problems in System.Text.Rune:

  • does not accept surrogate code points (U+D800–U+DFFF) because it is for Unicode scalar values (surrogate code points are out of their range).
  • Exclusive for .NET (unavailable in .NET Framework)
  • No official NuGet backport package for .NET Standard 2.0
@tats-u commented on GitHub (Mar 17, 2025): The problems in `System.Text.Rune`: - does not accept surrogate code points (U+D800–U+DFFF) because it is for Unicode *scalar values* (surrogate code points are out of their range). - Exclusive for .NET (unavailable in .NET Framework) - No official NuGet backport package for .NET Standard 2.0
Author
Owner

@MihaZupan commented on GitHub (Mar 17, 2025):

Rune does support surrogates, you can do new Rune(high, low). It's also thankfully one of those types that's trivial to polyfill.

If limited to e.g. EmphasisInline, supporting non-BMP punctuations seems fine to me.

@MihaZupan commented on GitHub (Mar 17, 2025): `Rune` does support surrogates, you can do `new Rune(high, low)`. It's also thankfully one of those types that's trivial to polyfill. If limited to e.g. `EmphasisInline`, supporting non-BMP punctuations seems fine to me.
Author
Owner

@tats-u commented on GitHub (Mar 17, 2025):

Rune does support surrogates, you can do new Rune(high, low).

It's how we get a supplementary code point from a well-formed surrogate pair. What I mean is how to handle an ill-formed isolated surrogate code unit.

If limited to e.g. EmphasisInline, supporting non-BMP punctuations seems fine to me.

Sufficient.

@tats-u commented on GitHub (Mar 17, 2025): > Rune does support surrogates, you can do new Rune(high, low). It's how we get a *supplementary* code *point* from a *well-formed* surrogate *pair*. What I mean is how to handle an *ill-formed isolated* surrogate code *unit*. > If limited to e.g. EmphasisInline, supporting non-BMP punctuations seems fine to me. Sufficient.
Author
Owner

@tats-u commented on GitHub (Mar 17, 2025):

https://learn.microsoft.com/en-US/dotnet/api/system.globalization.charunicodeinfo.getunicodecategory?view=net-8.0#system-globalization-charunicodeinfo-getunicodecategory(system-int32)

Another problem is the CharUnicodeInfo.GetUnicodeCategory has an overload that takes int but it's exclusive to .NET.

Should we substitute the (string, int) overload for that in .NET Standard 2.0?

@tats-u commented on GitHub (Mar 17, 2025): https://learn.microsoft.com/en-US/dotnet/api/system.globalization.charunicodeinfo.getunicodecategory?view=net-8.0#system-globalization-charunicodeinfo-getunicodecategory(system-int32) Another problem is the `CharUnicodeInfo.GetUnicodeCategory` has an overload that takes `int` but it's exclusive to .NET. Should we substitute the `(string, int)` overload for that in .NET Standard 2.0?
Author
Owner

@MihaZupan commented on GitHub (Mar 17, 2025):

how to handle an ill-formed isolated surrogate code unit.

I'm not sure I understand the concern. If we happen to see a valid surrogate pair we can check its category. If it's not a valid pair the behavior stays the same - it's currently treated as an arbitrary uninteresting char.

Another problem is the CharUnicodeInfo.GetUnicodeCategory has an overload that takes int but it's exclusive to .NET.

That's okay, older runtimes just have to deal with being slow :)
That's how Rune.GetUnicodeCategory is imlpemented as well.

@MihaZupan commented on GitHub (Mar 17, 2025): > how to handle an ill-formed isolated surrogate code unit. I'm not sure I understand the concern. If we happen to see a valid surrogate pair we can check its category. If it's not a valid pair the behavior stays the same - it's currently treated as an arbitrary uninteresting `char`. > Another problem is the CharUnicodeInfo.GetUnicodeCategory has an overload that takes int but it's exclusive to .NET. That's okay, older runtimes just have to deal with being slow :) That's how `Rune.GetUnicodeCategory` is imlpemented as well.
Author
Owner

@tats-u commented on GitHub (Mar 18, 2025):

If it's not a valid pair the behavior stays the same - it's currently treated as an arbitrary uninteresting char.

Will be acceptable because the Unicode category of all surrogate code points is Cs—i.e. not P or S.

It looks like you prefer Rune to int or unit—how about Rune? (nullable)? Rune is a struct, so its nullable type is fully supported in older C# or .NET Standard 2.0. If the adjacent code point is an isolated surrogate one, the value should be null. (Falling back to another value expressing a non-punctuation scalar vale is fine though)

@tats-u commented on GitHub (Mar 18, 2025): > If it's not a valid pair the behavior stays the same - it's currently treated as an arbitrary uninteresting char. Will be acceptable because the Unicode category of all surrogate code points is Cs—i.e. not P or S. It looks like you prefer Rune to int or unit—how about `Rune?` (nullable)? Rune is a struct, so its nullable type is fully supported in older C# or .NET Standard 2.0. If the adjacent code point is an isolated surrogate one, the value should be null. (Falling back to another value expressing a non-punctuation scalar vale is fine though)
Author
Owner

@tats-u commented on GitHub (Mar 18, 2025):

Probably some methods that return Rune, int, or uint need to be added in StringSlice.

@tats-u commented on GitHub (Mar 18, 2025): Probably some methods that return `Rune`, `int`, or `uint` need to be added in [`StringSlice`](https://github.com/xoofx/markdig/blob/master/src/Markdig/Helpers/StringSlice.cs).
Author
Owner

@tats-u commented on GitHub (Mar 18, 2025):

memo: how to call internal unsafe static UnicodeCategory InternalGetUnicodeCategory(int ch) in .NET Framework via reflection:

using System;
using System.Reflection;
using System.Globalization;

var method = typeof(CharUnicodeInfo).GetMethod(
            "InternalGetUnicodeCategory",
            BindingFlags.NonPublic | BindingFlags.Static,
            null,
            new Type[] { typeof(int) }
            null);
if (method is not null)
{
    var getUnicodeCategory = (Func<int, UnicodeCategory>)method.CreateDelegate(
                    typeof(Func<int, UnicodeCategory>));
}
@tats-u commented on GitHub (Mar 18, 2025): memo: how to call `internal unsafe static UnicodeCategory InternalGetUnicodeCategory(int ch)` in .NET Framework via reflection: ```cs using System; using System.Reflection; using System.Globalization; var method = typeof(CharUnicodeInfo).GetMethod( "InternalGetUnicodeCategory", BindingFlags.NonPublic | BindingFlags.Static, null, new Type[] { typeof(int) } null); if (method is not null) { var getUnicodeCategory = (Func<int, UnicodeCategory>)method.CreateDelegate( typeof(Func<int, UnicodeCategory>)); } ````
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/markdig#732