Incorrect parsing with: .UseListExtras() - depth limit exceeded #753

Open
opened 2026-01-29 14:44:47 +00:00 by claunia · 4 comments
Owner

Originally created by @r-Larch on GitHub (Sep 22, 2025).

I convert scanned PDFs to Markdown using Mistral OCR. The OCR sometimes produces sequences like D. M. M. M. … on a single line.
With UseListExtras() enabled, Markdig appears to interpret each X. as an (alpha) list marker, causing pathological nesting and throwing:

Markdown elements in the input are too deeply nested - depth limit exceeded.

Input that triggers the error

Krankenhaus Gepedale di
Brixen Bressanone
ARZTLICHE DIATO
GESCHÄFTE VIT-DE
ARZTLICHEA DIATO
D. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M

Expected behavior

  • The problematic line is treated as plain text (a paragraph), or at worst, a single list item like D. <remaining text>, without creating deep nesting.

Actual behavior

  • Markdown.ToHtml(...) throws ArgumentException: Markdown elements in the input are too deeply nested - depth limit exceeded.

Minimal repro (NUnit)

using Markdig;
using NUnit.Framework;

[TestFixture]
public class OcrListFalsePositiveTests
{
    [Test]
    public void FalsePositiveDepthLimit()
    {
        var mistralOcrMarkdown = """
            Krankenhaus Gepedale di
            Brixen Bressanone
            ARZTLICHE DIATO
            GESCHÄFTE VIT-DE
            ARZTLICHEA DIATO
            D. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M
            """;

        var pipeline = new MarkdownPipelineBuilder()
            .UseListExtras() // enables alpha list markers like "a. "
            .Build();

        Assert.DoesNotThrow(() => {
            Markdig.Markdown.ToHtml(mistralOcrMarkdown, pipeline);
        });
    }
}

Current result: the test throws with the “depth limit exceeded” exception.

Environment

  • Markdig version: 0.42.0
  • .NET: 9

Notes / Hypothesis

  • With UseListExtras(), sequences like D. , M. are valid alpha list markers. When many appear in a single line, the parser seems to treat them as nested structures, quickly tripping the depth limit.
  • This looks like a false positive for list parsing when multiple alpha markers occur inline on the same line.

Possible directions

  • Only treat alpha list markers as list starts when they appear at the beginning of a line (or after indentation consistent with list blocks), not repeatedly within the same line.
  • Alternatively, short-circuit nested list interpretation when multiple markers appear on a single line without intervening newlines.
Originally created by @r-Larch on GitHub (Sep 22, 2025). I convert scanned PDFs to Markdown using Mistral OCR. The OCR sometimes produces sequences like `D. M. M. M. …` on a single line. With `UseListExtras()` enabled, Markdig appears to interpret each `X. ` as an (alpha) list marker, causing pathological nesting and throwing: ``` Markdown elements in the input are too deeply nested - depth limit exceeded. ``` ## Input that triggers the error ```markdown Krankenhaus Gepedale di Brixen Bressanone ARZTLICHE DIATO GESCHÄFTE VIT-DE ARZTLICHEA DIATO D. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M ``` ## Expected behavior * The problematic line is treated as plain text (a paragraph), or at worst, a single list item like `D. <remaining text>`, without creating deep nesting. ## Actual behavior * `Markdown.ToHtml(...)` throws `ArgumentException: Markdown elements in the input are too deeply nested - depth limit exceeded.` ## Minimal repro (NUnit) ```csharp using Markdig; using NUnit.Framework; [TestFixture] public class OcrListFalsePositiveTests { [Test] public void FalsePositiveDepthLimit() { var mistralOcrMarkdown = """ Krankenhaus Gepedale di Brixen Bressanone ARZTLICHE DIATO GESCHÄFTE VIT-DE ARZTLICHEA DIATO D. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M. M """; var pipeline = new MarkdownPipelineBuilder() .UseListExtras() // enables alpha list markers like "a. " .Build(); Assert.DoesNotThrow(() => { Markdig.Markdown.ToHtml(mistralOcrMarkdown, pipeline); }); } } ``` **Current result:** the test throws with the “depth limit exceeded” exception. ## Environment * Markdig version: 0.42.0 * .NET: 9 ## Notes / Hypothesis * With `UseListExtras()`, sequences like `D. `, `M. ` are valid alpha list markers. When many appear in a single line, the parser seems to treat them as nested structures, quickly tripping the depth limit. * This looks like a false positive for list parsing when multiple alpha markers occur inline on the *same line*. ## Possible directions * Only treat alpha list markers as list starts when they appear at the beginning of a line (or after indentation consistent with list blocks), not repeatedly within the same line. * Alternatively, short-circuit nested list interpretation when multiple markers appear on a single line without intervening newlines.
claunia added the bugPR Welcome! labels 2026-01-29 14:44:47 +00:00
Author
Owner

@prozolic commented on GitHub (Jan 1, 2026):

A similar sequence with ordered list markers causes the same error.
The nesting depth for list markers appears to be limited to a maximum of 128, and exceeding this limit triggers this error.

var orderedListMarkdown = """
            1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126. 127. 128.
            """;

var pipeline = new MarkdownPipelineBuilder().Build();

Markdig.Markdown.ToHtml(orderedListMarkdown, pipeline);
System.ArgumentException: 
Markdown elements in the input are too deeply nested - depth limit exceeded. Input is most likely not sensible or is a very large table.

https://github.com/xoofx/markdig/blob/master/src/Markdig/Helpers/ThrowHelper.cs#L66-L81

In CommonMark, when the content of a list item begins with another list marker, it appears to be interpreted as a nested list.
In that case, the nesting itself seems correct, but what do you think?

https://spec.commonmark.org/dingus/?text=1.%202.%203.%204.%205.%206.%207.%208.%209.%2010.%2011.%2012.%2013.%2014.%2015.%2016.%2017.%2018.%2019.%2020.%2021.%2022.%2023.%2024.%2025.%2026.%2027.%2028.%2029.%2030.%2031.%2032.%2033.%2034.%2035.%2036.%2037.%2038.%2039.%2040.%2041.%2042.%2043.%2044.%2045.%2046.%2047.%2048.%2049.%2050.%2051.%2052.%2053.%2054.%2055.%2056.%2057.%2058.%2059.%2060.%2061.%2062.%2063.%2064.%2065.%2066.%2067.%2068.%2069.%2070.%2071.%2072.%2073.%2074.%2075.%2076.%2077.%2078.%2079.%2080.%2081.%2082.%2083.%2084.%2085.%2086.%2087.%2088.%2089.%2090.%2091.%2092.%2093.%2094.%2095.%2096.%2097.%2098.%2099.%20100.%20101.%20102.%20103.%20104.%20105.%20106.%20107.%20108.%20109.%20110.%20111.%20112.%20113.%20114.%20115.%20116.%20117.%20118.%20119.%20120.%20121.%20122.%20123.%20124.%20125.%20126.%20127.%20128.

@prozolic commented on GitHub (Jan 1, 2026): A similar sequence with ordered list markers causes the same error. The nesting depth for list markers appears to be limited to a maximum of 128, and exceeding this limit triggers this error. ```csharp var orderedListMarkdown = """ 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126. 127. 128. """; var pipeline = new MarkdownPipelineBuilder().Build(); Markdig.Markdown.ToHtml(orderedListMarkdown, pipeline); ``` ``` System.ArgumentException: Markdown elements in the input are too deeply nested - depth limit exceeded. Input is most likely not sensible or is a very large table. ``` https://github.com/xoofx/markdig/blob/master/src/Markdig/Helpers/ThrowHelper.cs#L66-L81 In CommonMark, when the content of a list item begins with another list marker, it appears to be interpreted as a nested list. In that case, the nesting itself seems correct, but what do you think? https://spec.commonmark.org/dingus/?text=1.%202.%203.%204.%205.%206.%207.%208.%209.%2010.%2011.%2012.%2013.%2014.%2015.%2016.%2017.%2018.%2019.%2020.%2021.%2022.%2023.%2024.%2025.%2026.%2027.%2028.%2029.%2030.%2031.%2032.%2033.%2034.%2035.%2036.%2037.%2038.%2039.%2040.%2041.%2042.%2043.%2044.%2045.%2046.%2047.%2048.%2049.%2050.%2051.%2052.%2053.%2054.%2055.%2056.%2057.%2058.%2059.%2060.%2061.%2062.%2063.%2064.%2065.%2066.%2067.%2068.%2069.%2070.%2071.%2072.%2073.%2074.%2075.%2076.%2077.%2078.%2079.%2080.%2081.%2082.%2083.%2084.%2085.%2086.%2087.%2088.%2089.%2090.%2091.%2092.%2093.%2094.%2095.%2096.%2097.%2098.%2099.%20100.%20101.%20102.%20103.%20104.%20105.%20106.%20107.%20108.%20109.%20110.%20111.%20112.%20113.%20114.%20115.%20116.%20117.%20118.%20119.%20120.%20121.%20122.%20123.%20124.%20125.%20126.%20127.%20128.
Author
Owner

@r-Larch commented on GitHub (Jan 5, 2026):

Okay, looks like the behavior is spec-conform regarding parsing the text as a deeply nested list, but I could not trigger any depth limit error in the https://spec.commonmark.org/ fiddle, therefore I assume that to be spec-conform, we need at least a way to configure the depth limit.

I would be very happy if I could configure the hard-coded depth limit.

I'm open to providing a PR in case the change is welcome.

@r-Larch commented on GitHub (Jan 5, 2026): Okay, looks like the behavior is spec-conform regarding parsing the text as a deeply nested list, but I could not trigger any depth limit error in the https://spec.commonmark.org/ fiddle, therefore I assume that to be spec-conform, we need at least a way to configure the depth limit. I would be very happy if I could configure the hard-coded depth limit. I'm open to providing a PR in case the change is welcome.
Author
Owner

@MihaZupan commented on GitHub (Jan 5, 2026):

Increasing the depth limit here feels like a band aid. The limit is there for a reason - what do you do once the OCR spits out a few more characters on that line?

The approach of letting the parser know its current depth, with the option to just bail in such cases could be interesting here.

Is the result from Markdown parsing of such text even valuable at all?
Even if you parsed a list that's nested 200 times, is that more valuable than just throwing it away if you hit the exception?


Re: spec-conformity, my 2 cents is that there will always be practical limits that implementations may impose on weird inputs. An example of this comes from the spec itself when talking about links

Implementations may impose limits on parentheses nesting to avoid performance issues, but at least three levels of nesting should be supported.

@MihaZupan commented on GitHub (Jan 5, 2026): Increasing the depth limit here feels like a band aid. The limit is there for a reason - what do you do once the OCR spits out a few more characters on that line? The approach of letting the parser know its current depth, with the option to just bail in such cases could be interesting here. Is the result from Markdown parsing of such text even valuable at all? Even if you parsed a list that's nested 200 times, is that more valuable than just throwing it away if you hit the exception? --- Re: spec-conformity, my 2 cents is that there will always be practical limits that implementations may impose on weird inputs. An example of this comes from the spec itself when talking about links > Implementations may impose limits on parentheses nesting to avoid performance issues, but at least three levels of nesting should be supported.
Author
Owner

@r-Larch commented on GitHub (Jan 9, 2026):

Is the result from Markdown parsing of such text even valuable at all?
Even if you parsed a list that's nested 200 times, is that more valuable than just throwing it away if you hit the exception?

I'm working with lawyers, and they need to convert terrible scanned documents into Microsoft Word Documents.
They expect the output to contain all text and as much of the formatting as possible.

So my pipeline is like:

Scan -> OCR-to-markdown -> markdown-to-html(markdig) -> html-to-docx`

I'm fine with every solution where I can get at least something out of Markdig.

For me, the ideal scenario for this edge case would be to parse it as a paragraph after exceeding the depth limit.
But that's probably not spec-conforming.

So, making the limit configurable looks like the best choice to stay spec-conform while allowing the business logic to decide about resource constraints (e.g., max memory footprint and max execution time - the consequences of deep nesting) with a rule of thumb of "the higher the limit, the more stack (recursion) and heap (AST) memory is required for bad or malicious input while the time spend in the parser also increases".
In business logic, such a limit can be tuned based on testing and measuring.

If you prefer to give an option to decide based on the current depth, that's also fine for me 👌

The approach of letting the parser know its current depth, with the option to just bail in such cases could be interesting here.

@r-Larch commented on GitHub (Jan 9, 2026): > Is the result from Markdown parsing of such text even valuable at all? > Even if you parsed a list that's nested 200 times, is that more valuable than just throwing it away if you hit the exception? I'm working with lawyers, and they need to convert terrible scanned documents into Microsoft Word Documents. They expect the output to contain all text and as much of the formatting as possible. So my pipeline is like: ``` Scan -> OCR-to-markdown -> markdown-to-html(markdig) -> html-to-docx` ``` I'm fine with every solution where I can get at least something out of Markdig. For me, the ideal scenario for this edge case would be to parse it as a paragraph after exceeding the depth limit. But that's probably not spec-conforming. So, making the limit configurable looks like the best choice to stay spec-conform while allowing the business logic to decide about resource constraints (e.g., max memory footprint and max execution time - the consequences of deep nesting) with a rule of thumb of "the higher the limit, the more stack (recursion) and heap (AST) memory is required for bad or malicious input while the time spend in the parser also increases". In business logic, such a limit can be tuned based on testing and measuring. If you prefer to give an option to decide based on the current depth, that's also fine for me 👌 > The approach of letting the parser know its current depth, with the option to just bail in such cases could be interesting here.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/markdig#753