Parsing HTML into tree nodes instead of HtmlInline or HtmlBlock objects #413

Open
opened 2026-01-29 14:36:06 +00:00 by claunia · 3 comments
Owner

Originally created by @KvanTTT on GitHub (Nov 9, 2020).

Consider parsing of the following string:

<a href="https://github.com/KvanTTT/MarkConv">Html Link</a>

The parser returns ContainerInline with the following children :

[0] = {HtmlInline} <a href="https://github.com/KvanTTT/MarkConv">
[1] = {LiteralInline} Html Link
[2] = {HtmlInline} </a>

But I want to get not just the mix of HtmlInline and other markdown objects but HTML tree-structure like this:

HtmlElement (a, href="https://github.com/KvanTTT/MarkConv"):
    LiteralInline (Html Link)

The same related to HtmlBlock:

<details>
<summary>Title</summary>

Content

</details>

Such a feature should be included in the basic library or implemented as an external extension?

Originally created by @KvanTTT on GitHub (Nov 9, 2020). Consider parsing of the following string: ```md <a href="https://github.com/KvanTTT/MarkConv">Html Link</a> ``` The parser returns `ContainerInline` with the following children : ``` [0] = {HtmlInline} <a href="https://github.com/KvanTTT/MarkConv"> [1] = {LiteralInline} Html Link [2] = {HtmlInline} </a> ``` But I want to get not just the mix of `HtmlInline` and other markdown objects but HTML tree-structure like this: ``` HtmlElement (a, href="https://github.com/KvanTTT/MarkConv"): LiteralInline (Html Link) ``` The same related to `HtmlBlock`: ```md <details> <summary>Title</summary> Content </details> ``` Such a feature should be included in the basic library or implemented as an external extension?
Author
Owner

@gagahpangeran commented on GitHub (Aug 1, 2021):

I also interested with this feature.

Is there any extension to achieve this?

@gagahpangeran commented on GitHub (Aug 1, 2021): I also interested with this feature. Is there any extension to achieve this?
Author
Owner

@MihaZupan commented on GitHub (Aug 1, 2021):

I would turn to the AngleSharp library for parsing the HtmlInline

@MihaZupan commented on GitHub (Aug 1, 2021): I would turn to the AngleSharp library for parsing the HtmlInline
Author
Owner

@KvanTTT commented on GitHub (Aug 1, 2021):

@gagahpangeran I've implemented such a parser in one of my projects, see Parser and example test file that is parsed correctly. ANTLR-based lexer and parser are used for HTML. You can extract this to your project and/or convert it to extension.

@MihaZupan I tried different HTML-parsing libraries: Html Agility Pack, AngleSharp. But they work badly with HTML fragments, invalid or unknown tags. Eventually, I decided to write my own HTML lexer/parser based on ANTLR. It works fine and it's much better customizable.

@KvanTTT commented on GitHub (Aug 1, 2021): @gagahpangeran I've implemented such a parser in one of my projects, see [Parser](https://github.com/KvanTTT/MarkConv/blob/master/MarkConv/Parser.cs) and example [test file](https://raw.githubusercontent.com/KvanTTT/MarkConv/master/MarkConv.Tests/Resources/Html.md) that is parsed correctly. ANTLR-based lexer and parser are used for HTML. You can extract this to your project and/or convert it to extension. @MihaZupan I tried different HTML-parsing libraries: [Html Agility Pack](https://html-agility-pack.net/), AngleSharp. But they work badly with HTML fragments, invalid or unknown tags. Eventually, I decided to write my own HTML lexer/parser based on ANTLR. It works fine and it's much better customizable.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/markdig#413