Make HeadingBlockParser configurable #264

Closed
opened 2026-01-29 14:32:12 +00:00 by claunia · 9 comments
Owner

Originally created by @dustinmoris on GitHub (Jan 27, 2019).

Hi,

First of all this is a great library, I've been using it a lot for my blog and other projects and absolutely love it. I've got a use case where I would like to use Markdig to parse user entered markdown into HTML, but only a limited subset. One of the things which I would like to limit is the header tags (e.g. user should only be able to create h3, h4, h5, and h6, but not h1 and h2).

I would like to propose the following change in the HeadingBlockParser:

Add an optional parameter to the constructor (this is pseudo code):

private readonly HashSet<int> _allowedHeaders;

public HeadingBlockParser(HashSet<int> allowedHeaders = null)
{
    OpeningCharacters = new[] {'#'};
    _allowedHeaders = allowedHeaders != null
        ? allowedHeaders // Probably validate that it only contains valid values (1-6) here
        : new HashSet<int>() { 1, 2, 3, 4, 5, 6 };
}

And then change this if statement:

Before

if (leadingCount > 0 && leadingCount <= MaxLeadingCount && (c.IsSpaceOrTab() || c == '\0'))

After

if (_allowedHeaders.Contains(leadingCount) && (c.IsSpaceOrTab() || c == '\0'))

Then I could configure the HeaderBlockParser to limit the allowed headers via the constructor when setting up BlockParsers.

What do you think?

Originally created by @dustinmoris on GitHub (Jan 27, 2019). Hi, First of all this is a great library, I've been using it a lot for my blog and other projects and absolutely love it. I've got a use case where I would like to use Markdig to parse user entered markdown into HTML, but only a limited subset. One of the things which I would like to limit is the header tags (e.g. user should only be able to create h3, h4, h5, and h6, but not h1 and h2). I would like to propose the following change in the [`HeadingBlockParser`](https://github.com/lunet-io/markdig/blob/master/src/Markdig/Parsers/HeadingBlockParser.cs): Add an optional parameter to the constructor (this is pseudo code): ```csharp private readonly HashSet<int> _allowedHeaders; public HeadingBlockParser(HashSet<int> allowedHeaders = null) { OpeningCharacters = new[] {'#'}; _allowedHeaders = allowedHeaders != null ? allowedHeaders // Probably validate that it only contains valid values (1-6) here : new HashSet<int>() { 1, 2, 3, 4, 5, 6 }; } ``` And then change [this if statement](https://github.com/lunet-io/markdig/blob/master/src/Markdig/Parsers/HeadingBlockParser.cs#L72): Before ```csharp if (leadingCount > 0 && leadingCount <= MaxLeadingCount && (c.IsSpaceOrTab() || c == '\0')) ``` After ```csharp if (_allowedHeaders.Contains(leadingCount) && (c.IsSpaceOrTab() || c == '\0')) ``` Then I could configure the `HeaderBlockParser` to limit the allowed headers via the constructor when setting up BlockParsers. What do you think?
claunia added the enhancement label 2026-01-29 14:32:12 +00:00
Author
Owner

@MihaZupan commented on GitHub (Jan 27, 2019):

I would think that it's not too common to not want h1, but want h3 for example. Therefore I believe that rather than changing the parser, post-processing the document AST to remove said headers would be more appropriate, since you can then decide whether to discard them all together or replace them with regular text

@MihaZupan commented on GitHub (Jan 27, 2019): I would think that it's not too common to not want h1, but want h3 for example. Therefore I believe that rather than changing the parser, post-processing the document AST to remove said headers would be more appropriate, since you can then decide whether to discard them all together or replace them with regular text
Author
Owner

@xoofx commented on GitHub (Jan 27, 2019):

I agree, otherwise someone else could come with a different rule here. It is quite straightforward to process the AST document afterwards and adapt it to your specific requirements.

@xoofx commented on GitHub (Jan 27, 2019): I agree, otherwise someone else could come with a different rule here. It is quite straightforward to process the AST document afterwards and adapt it to your specific requirements.
Author
Owner

@dustinmoris commented on GitHub (Jan 27, 2019):

I would think that it's not too common to not want h1, but want h3 for example

I think that is a very common scenario, because h1 has a special meaning in a HTML document or a particular HTML block. For example you shouldn't have more than one h1 in an article or a hgroup. If you want to display user entered markdown in a HTML block where you want to have control that there should be only one h1 then this is a VERY COMMON use case.

But nevertheless...

post-processing the document AST to remove said headers would be more appropriate

If this is easy and fast then this would be good enough for me too. Have you got an example somewhere?

Thanks for the fast replies!

@dustinmoris commented on GitHub (Jan 27, 2019): > I would think that it's not too common to not want h1, but want h3 for example I think that is a very common scenario, because h1 has a special meaning in a HTML document or a particular HTML block. For example you shouldn't have more than one h1 in an `article` or a `hgroup`. If you want to display user entered markdown in a HTML block where you want to have control that there should be only one h1 then this is a VERY COMMON use case. But nevertheless... > post-processing the document AST to remove said headers would be more appropriate If this is easy and fast then this would be good enough for me too. Have you got an example somewhere? Thanks for the fast replies!
Author
Owner

@dustinmoris commented on GitHub (Jan 27, 2019):

I agree, otherwise someone else could come with a different rule here.

My proposed suggestion would have given the flexibility to implement a different rule, because anyone could set any combination of headers into the constructor (e.g. 1, 2, 3 vs. 3, 4, 5, 6, etc.), but I get your point, if you say that post processing is already the correct way of applying such rules in a fast and efficient way then I'm happy to do it. An example would be much appreciated!

@dustinmoris commented on GitHub (Jan 27, 2019): > I agree, otherwise someone else could come with a different rule here. My proposed suggestion would have given the flexibility to implement a different rule, because anyone could set any combination of headers into the constructor (e.g. `1, 2, 3` vs. `3, 4, 5, 6`, etc.), but I get your point, if you say that post processing is already the correct way of applying such rules in a fast and efficient way then I'm happy to do it. An example would be much appreciated!
Author
Owner

@MihaZupan commented on GitHub (Jan 27, 2019):

Something like this

var pipeline = new MarkdownPipelineBuilder().UseAdvancedExtensions().Build();
var markdown = @"
# header 1

## header 2

### header 3

#### header 4

##### header 5

###### header 6
";

bool replaceWithText = true;
//                                    1      2      3     4     5     6
bool[] allowedHeadings = new bool[] { false, false, true, true, true, true };

var document = Markdown.Parse(markdown, pipeline);
foreach (var descendant in document.Descendants())
{
    if (descendant is HeadingBlock heading && !allowedHeadings[heading.Level - 1])
    {
        // The heading could be the first block
        var parent = heading.Parent ?? document;

        if (replaceWithText)
        {
            var literal = new LiteralInline(markdown.Substring(heading.Span.Start, heading.Span.Length));
            var inline = new ContainerInline();
            inline.AppendChild(literal);
            var paragraph = new ParagraphBlock
            {
                Inline = inline
            };

            for (int i = 0; i < parent.Count; i++) // Not the most efficient but it works
            {
                if (parent[i] == heading)
                {
                    parent.RemoveAt(i);
                    parent.Insert(i, paragraph);
                }
            }
        }
        else
        {
            parent.Remove(heading);
        }
    }
}

var writer = new StringWriter();
var renderer = new HtmlRenderer(writer);
pipeline.Setup(renderer);
renderer.Render(document);

string html = writer.ToString();
@MihaZupan commented on GitHub (Jan 27, 2019): Something like this ```csharp var pipeline = new MarkdownPipelineBuilder().UseAdvancedExtensions().Build(); var markdown = @" # header 1 ## header 2 ### header 3 #### header 4 ##### header 5 ###### header 6 "; bool replaceWithText = true; // 1 2 3 4 5 6 bool[] allowedHeadings = new bool[] { false, false, true, true, true, true }; var document = Markdown.Parse(markdown, pipeline); foreach (var descendant in document.Descendants()) { if (descendant is HeadingBlock heading && !allowedHeadings[heading.Level - 1]) { // The heading could be the first block var parent = heading.Parent ?? document; if (replaceWithText) { var literal = new LiteralInline(markdown.Substring(heading.Span.Start, heading.Span.Length)); var inline = new ContainerInline(); inline.AppendChild(literal); var paragraph = new ParagraphBlock { Inline = inline }; for (int i = 0; i < parent.Count; i++) // Not the most efficient but it works { if (parent[i] == heading) { parent.RemoveAt(i); parent.Insert(i, paragraph); } } } else { parent.Remove(heading); } } } var writer = new StringWriter(); var renderer = new HtmlRenderer(writer); pipeline.Setup(renderer); renderer.Render(document); string html = writer.ToString(); ```
Author
Owner

@dustinmoris commented on GitHub (Jan 27, 2019):

Ok thanks for providing this code! I think this will work for my use case, but from an architectural POV I think that my proposed solution still makes sense. Apologies for trying to convince you, but I just want to throw in a few more thoughts and then you can decide if you agree or disagree :)

I think my proposed solution to include some h-tags but not others sounds weird because you have created one parser for all. In reality these tags should be considered as different elements. For example you treat a pre block differently than a blockquote or p. Why is that? It "feels" different because you know that most pages would want to style them differently and also, because you know that they have a different meaning.

Well with headers it is the same actually. IMHO it would be more correct to have a Header1BlockParser and a Header2BlockParser, etc., but I do get the point how it made more sense to have a single HeaderBlockParser. In reality h1 tag is a different block than a h2 or h3 and they DO have a different meaning obviously. That's why HTML doesn't have a single header tag (e.g. <h order="1"></h> and <h order="3"></h>), but different ones (h1, h2, etc.).

If you made the architectural decision to warp them all together in a single HeaderBlockParser class, then I think it would be only fair to expose some additional control via the constructor, to which tags should actually get parsed.

I don't think this opens a door for more/different rules, because there is only 6 different headers and if you allow a user to configure in the constructor which elements should get parsed out of the 6 then you have already exhausted all possibilities which one might want to configure.

I personally think there would be value in doing this, but I can also live with the longer workaround of the post processing, but it feels a bit wrong to parse elements into the AST when I know that they should have never get parsed in the first place.

@dustinmoris commented on GitHub (Jan 27, 2019): Ok thanks for providing this code! I think this will work for my use case, but from an architectural POV I think that my proposed solution still makes sense. Apologies for trying to convince you, but I just want to throw in a few more thoughts and then you can decide if you agree or disagree :) I think my proposed solution to include some h-tags but not others sounds weird because you have created one parser for all. In reality these tags should be considered as different elements. For example you treat a `pre` block differently than a `blockquote` or `p`. Why is that? It "feels" different because you know that most pages would want to style them differently and also, because you know that they have a different meaning. Well with headers it is the same actually. IMHO it would be more correct to have a `Header1BlockParser` and a `Header2BlockParser`, etc., but I do get the point how it made more sense to have a single `HeaderBlockParser`. In reality `h1` tag is a *different* block than a `h2` or `h3` and they DO have a different meaning obviously. That's why HTML doesn't have a single header tag (e.g. `<h order="1"></h>` and `<h order="3"></h>`), but different ones (`h1`, `h2`, etc.). If you made the architectural decision to warp them all together in a single `HeaderBlockParser` class, then I think it would be only fair to expose some additional control via the constructor, to which tags should actually get parsed. I don't think this opens a door for more/different rules, because there is only 6 different headers and if you allow a user to configure in the constructor which elements should get parsed out of the 6 then you have already exhausted all possibilities which one might want to configure. I personally think there would be value in doing this, but I can also live with the longer workaround of the post processing, but it feels a bit wrong to parse elements into the AST when I know that they should have never get parsed in the first place.
Author
Owner

@xoofx commented on GitHub (Jan 27, 2019):

Let's keep the code as it is today. You can either copy/paste the HeadingBlockParser in your own project and modify it, or go to the post-processing route, if you need a special handling. We can revisit this later if we have more people coming here looking for a similar use case.

@xoofx commented on GitHub (Jan 27, 2019): Let's keep the code as it is today. You can either copy/paste the HeadingBlockParser in your own project and modify it, or go to the post-processing route, if you need a special handling. We can revisit this later if we have more people coming here looking for a similar use case.
Author
Owner

@dustinmoris commented on GitHub (Jan 28, 2019):

Ok thanks, I've went with copy pasting the current HeadingBlockParser and added my proposed changes and it works like a charm! Thanks for considering my proposal and keep up the great work!

@dustinmoris commented on GitHub (Jan 28, 2019): Ok thanks, I've went with copy pasting the current HeadingBlockParser and added my proposed changes and it works like a charm! Thanks for considering my proposal and keep up the great work!
Author
Owner

@petefox commented on GitHub (Feb 10, 2020):

I would like to give a +1 on @dustinmoris 's suggestion.

For use cases like Blogs or Support/Documentation pages, where multiple contributors write content intended for public facing pages, it's important from an SEO point of view, to only have one H1 tag, which describes the title of the document.

This H1 would typically be rendered outside the markdown by the wrapping page, and defined in a separate input field on the editing page.

And yes, I know you could "just tell your users" to only use one H1 tag, and always place it in the top of the page - but that defeats the purpose of using something like Markdown, where you would want to restrict the editing/design options in order to create a more uniform layout and result markup.

I think it would make great sense to make an option that restricts the use of certain tags.

@petefox commented on GitHub (Feb 10, 2020): I would like to give a +1 on @dustinmoris 's suggestion. For use cases like Blogs or Support/Documentation pages, where multiple contributors write content intended for public facing pages, it's important from an SEO point of view, to only have one H1 tag, which describes the title of the document. This H1 would typically be rendered outside the markdown by the wrapping page, and defined in a separate input field on the editing page. And yes, I know you could "just tell your users" to only use one H1 tag, and always place it in the top of the page - but that defeats the purpose of using something like Markdown, where you would want to restrict the editing/design options in order to create a more uniform layout and result markup. I think it would make great sense to make an option that restricts the use of certain tags.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/markdig#264