mirror of
https://github.com/xoofx/markdig.git
synced 2026-02-03 21:36:36 +00:00
Incorrect parsing with: .UseListExtras() - depth limit exceeded
#753
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @r-Larch on GitHub (Sep 22, 2025).
I convert scanned PDFs to Markdown using Mistral OCR. The OCR sometimes produces sequences like
D. M. M. M. …on a single line.With
UseListExtras()enabled, Markdig appears to interpret eachX.as an (alpha) list marker, causing pathological nesting and throwing:Input that triggers the error
Expected behavior
D. <remaining text>, without creating deep nesting.Actual behavior
Markdown.ToHtml(...)throwsArgumentException: Markdown elements in the input are too deeply nested - depth limit exceeded.Minimal repro (NUnit)
Current result: the test throws with the “depth limit exceeded” exception.
Environment
Notes / Hypothesis
UseListExtras(), sequences likeD.,M.are valid alpha list markers. When many appear in a single line, the parser seems to treat them as nested structures, quickly tripping the depth limit.Possible directions
@prozolic commented on GitHub (Jan 1, 2026):
A similar sequence with ordered list markers causes the same error.
The nesting depth for list markers appears to be limited to a maximum of 128, and exceeding this limit triggers this error.
https://github.com/xoofx/markdig/blob/master/src/Markdig/Helpers/ThrowHelper.cs#L66-L81
In CommonMark, when the content of a list item begins with another list marker, it appears to be interpreted as a nested list.
In that case, the nesting itself seems correct, but what do you think?
https://spec.commonmark.org/dingus/?text=1.%202.%203.%204.%205.%206.%207.%208.%209.%2010.%2011.%2012.%2013.%2014.%2015.%2016.%2017.%2018.%2019.%2020.%2021.%2022.%2023.%2024.%2025.%2026.%2027.%2028.%2029.%2030.%2031.%2032.%2033.%2034.%2035.%2036.%2037.%2038.%2039.%2040.%2041.%2042.%2043.%2044.%2045.%2046.%2047.%2048.%2049.%2050.%2051.%2052.%2053.%2054.%2055.%2056.%2057.%2058.%2059.%2060.%2061.%2062.%2063.%2064.%2065.%2066.%2067.%2068.%2069.%2070.%2071.%2072.%2073.%2074.%2075.%2076.%2077.%2078.%2079.%2080.%2081.%2082.%2083.%2084.%2085.%2086.%2087.%2088.%2089.%2090.%2091.%2092.%2093.%2094.%2095.%2096.%2097.%2098.%2099.%20100.%20101.%20102.%20103.%20104.%20105.%20106.%20107.%20108.%20109.%20110.%20111.%20112.%20113.%20114.%20115.%20116.%20117.%20118.%20119.%20120.%20121.%20122.%20123.%20124.%20125.%20126.%20127.%20128.
@r-Larch commented on GitHub (Jan 5, 2026):
Okay, looks like the behavior is spec-conform regarding parsing the text as a deeply nested list, but I could not trigger any depth limit error in the https://spec.commonmark.org/ fiddle, therefore I assume that to be spec-conform, we need at least a way to configure the depth limit.
I would be very happy if I could configure the hard-coded depth limit.
I'm open to providing a PR in case the change is welcome.
@MihaZupan commented on GitHub (Jan 5, 2026):
Increasing the depth limit here feels like a band aid. The limit is there for a reason - what do you do once the OCR spits out a few more characters on that line?
The approach of letting the parser know its current depth, with the option to just bail in such cases could be interesting here.
Is the result from Markdown parsing of such text even valuable at all?
Even if you parsed a list that's nested 200 times, is that more valuable than just throwing it away if you hit the exception?
Re: spec-conformity, my 2 cents is that there will always be practical limits that implementations may impose on weird inputs. An example of this comes from the spec itself when talking about links
@r-Larch commented on GitHub (Jan 9, 2026):
I'm working with lawyers, and they need to convert terrible scanned documents into Microsoft Word Documents.
They expect the output to contain all text and as much of the formatting as possible.
So my pipeline is like:
I'm fine with every solution where I can get at least something out of Markdig.
For me, the ideal scenario for this edge case would be to parse it as a paragraph after exceeding the depth limit.
But that's probably not spec-conforming.
So, making the limit configurable looks like the best choice to stay spec-conform while allowing the business logic to decide about resource constraints (e.g., max memory footprint and max execution time - the consequences of deep nesting) with a rule of thumb of "the higher the limit, the more stack (recursion) and heap (AST) memory is required for bad or malicious input while the time spend in the parser also increases".
In business logic, such a limit can be tuned based on testing and measuring.
If you prefer to give an option to decide based on the current depth, that's also fine for me 👌