PipeTableParser strip opening and ending characters #658

Closed
opened 2026-01-29 14:42:16 +00:00 by claunia · 9 comments
Owner

Originally created by @doggy8088 on GitHub (Feb 22, 2024).

Here is my code that can run under LINQPad:

void Main()
{
	var builder = new MarkdownPipelineBuilder();
	
	var pt = new PipeTableExtension();
	pt.Setup(builder);

	var pipeline = builder.Build();
	
	var markdownText = GetText();

	var document = Markdown.Parse(markdownText, pipeline);

	var blocks = document.ToList();
	
	foreach (var item in blocks)
	{
		string blockText = ExtractText(markdownText, item);
		blockText.Pre().Dump("Parsed Output");
	}
}

public static string ExtractText(string text, Block item)
{
	var pos = item.ToPositionText(); // $295, 0, 13301-13375

	// parse "$295, 0, 13301-13375" into lines, columns, range variables
	var line = pos.Split(",")[0].TrimStart('$').Trim();
	var column = pos.Split(",")[1].Trim();

	var range = pos.Split(",")[2].Trim();
	var start = range.Split("-")[0];
	var end = range.Split("-")[1];

	var blockText = text.Substring(int.Parse(start), int.Parse(end) - int.Parse(start) + 1);
	return blockText;
}

string GetText() => """

| Attributes               | Details                                           |
|:---                      |:---                                               |
| `<docs-card-container>`  | All cards must be nested inside a container       |
| `title`                  | Card title                                        |
| card body contents       | Anything between `<docs-card>` and `</docs-card>` |
| `link`                   | (Optional) Call to Action link text               |
| `href`                   | (Optional) Call to Action link href               |

""";

When running this code, the parsed output will remove the opening and ending characters. It seems a bug.

image

Originally created by @doggy8088 on GitHub (Feb 22, 2024). [Here is my code](https://share.linqpad.net/ugvsf9ot.linq) that can run under LINQPad: ```cs void Main() { var builder = new MarkdownPipelineBuilder(); var pt = new PipeTableExtension(); pt.Setup(builder); var pipeline = builder.Build(); var markdownText = GetText(); var document = Markdown.Parse(markdownText, pipeline); var blocks = document.ToList(); foreach (var item in blocks) { string blockText = ExtractText(markdownText, item); blockText.Pre().Dump("Parsed Output"); } } public static string ExtractText(string text, Block item) { var pos = item.ToPositionText(); // $295, 0, 13301-13375 // parse "$295, 0, 13301-13375" into lines, columns, range variables var line = pos.Split(",")[0].TrimStart('$').Trim(); var column = pos.Split(",")[1].Trim(); var range = pos.Split(",")[2].Trim(); var start = range.Split("-")[0]; var end = range.Split("-")[1]; var blockText = text.Substring(int.Parse(start), int.Parse(end) - int.Parse(start) + 1); return blockText; } string GetText() => """ | Attributes | Details | |:--- |:--- | | `<docs-card-container>` | All cards must be nested inside a container | | `title` | Card title | | card body contents | Anything between `<docs-card>` and `</docs-card>` | | `link` | (Optional) Call to Action link text | | `href` | (Optional) Call to Action link href | """; ``` When running this code, the parsed output will remove the opening and ending characters. It seems a bug. ![image](https://github.com/xoofx/markdig/assets/88981/db80e9b0-963f-471d-afe8-6dfd448229e3)
claunia added the bugPR Welcome! labels 2026-01-29 14:42:17 +00:00
Author
Owner

@xoofx commented on GitHub (Feb 29, 2024):

It might be a bug in the position (not all elements, specially the one not CommonMark like table are necessarily tested for their text position)

That being said, this ExtractText function feels complicated to extract the line/column/span information. Not sure why you are not using the properties behind (Span, Line, Column) here

public static string ExtractText(string text, Block item)
{
	var pos = item.ToPositionText(); // $295, 0, 13301-13375

	// parse "$295, 0, 13301-13375" into lines, columns, range variables
	var line = pos.Split(",")[0].TrimStart('$').Trim();
	var column = pos.Split(",")[1].Trim();

	var range = pos.Split(",")[2].Trim();
	var start = range.Split("-")[0];
	var end = range.Split("-")[1];

	var blockText = text.Substring(int.Parse(start), int.Parse(end) - int.Parse(start) + 1);
	return blockText;
}
@xoofx commented on GitHub (Feb 29, 2024): It might be a bug in the position (not all elements, specially the one not CommonMark like table are necessarily tested for their text position) That being said, this `ExtractText` function feels complicated to extract the line/column/span information. Not sure why you are not using the properties behind (Span, Line, Column) [here](https://github.com/xoofx/markdig/blob/201aa4ef738e5faa1bee35592a665b1bdf0fd90e/src/Markdig/Syntax/MarkdownObject.cs#L96-L108) ``` public static string ExtractText(string text, Block item) { var pos = item.ToPositionText(); // $295, 0, 13301-13375 // parse "$295, 0, 13301-13375" into lines, columns, range variables var line = pos.Split(",")[0].TrimStart('$').Trim(); var column = pos.Split(",")[1].Trim(); var range = pos.Split(",")[2].Trim(); var start = range.Split("-")[0]; var end = range.Split("-")[1]; var blockText = text.Substring(int.Parse(start), int.Parse(end) - int.Parse(start) + 1); return blockText; } ```
Author
Owner

@doggy8088 commented on GitHub (Jul 2, 2024):

@xoofx Yes, you're right. This is so much easier.

public static string ExtractText(string text, Block item)
{
    return text.Substring(item.Span.Start, item.Span.Length);
}

But the Markdig.Extensions.Tables.Table is still missing the first and last characters when parsing syntax. Here is my workaround for this issue:

public static string ExtractText(string text, Markdig.Syntax.Block item)
{
	var start = item.Span.Start;
	var end = item.Span.End;
	var len = item.Span.Length;

	if (item is Markdig.Extensions.Tables.Table)
	{
		start--;
		end++;
		len += 2;
	}

	// Markdig.Extensions.Tables.Table is missing the first and last characters when parsing syntax
	return text.Substring(start, len);
}
@doggy8088 commented on GitHub (Jul 2, 2024): @xoofx Yes, you're right. This is so much easier. ```cs public static string ExtractText(string text, Block item) { return text.Substring(item.Span.Start, item.Span.Length); } ``` But the Markdig.Extensions.Tables.Table is still missing the first and last characters when parsing syntax. Here is my workaround for this issue: ```cs public static string ExtractText(string text, Markdig.Syntax.Block item) { var start = item.Span.Start; var end = item.Span.End; var len = item.Span.Length; if (item is Markdig.Extensions.Tables.Table) { start--; end++; len += 2; } // Markdig.Extensions.Tables.Table is missing the first and last characters when parsing syntax return text.Substring(start, len); } ```
Author
Owner

@doggy8088 commented on GitHub (Jul 2, 2024):

As I know, GridTable looks like this:

+-------------+-------------+
| Header 1    | Header 2    |
| ----------- | ----------- |
| Row 1 Col 1 | Row 1 Col 2 |
| Row 1 Col 1 |             |
+-------------+-------------+

And the PipeTable should looks like this:

| Header 1    | Header 2    |
| ----------- | ----------- |
| Row 1 Col 1 | Row 1 Col 2 |
| Row 2 Col 1 | Row 2 Col 2 |

In the GridTableParser.cs, the OpeningCharacters is +.

But in PipeTableBlockParser.cs, the OpeningCharacters is -. Why -?

The PipeTable's codebase is still too complicated to me. I still can't find the bug.

@doggy8088 commented on GitHub (Jul 2, 2024): As I know, `GridTable` looks like this: ```markdown +-------------+-------------+ | Header 1 | Header 2 | | ----------- | ----------- | | Row 1 Col 1 | Row 1 Col 2 | | Row 1 Col 1 | | +-------------+-------------+ ``` And the `PipeTable` should looks like this: ```markdown | Header 1 | Header 2 | | ----------- | ----------- | | Row 1 Col 1 | Row 1 Col 2 | | Row 2 Col 1 | Row 2 Col 2 | ``` In the [GridTableParser.cs](https://github.com/xoofx/markdig/blob/1a1bbecc467a800dd6b39e68825df50309f6065c/src/Markdig/Extensions/Tables/GridTableParser.cs#L15-L16), the `OpeningCharacters` is `+`. But in [PipeTableBlockParser.cs](https://github.com/xoofx/markdig/blob/1a1bbecc467a800dd6b39e68825df50309f6065c/src/Markdig/Extensions/Tables/PipeTableBlockParser.cs#L25-L26), the `OpeningCharacters` is `-`. Why `-`? The PipeTable's codebase is still too complicated to me. I still can't find the bug.
Author
Owner

@xoofx commented on GitHub (Jul 2, 2024):

In the GridTableParser.cs, the OpeningCharacters is +.

Because this is coming from https://pandoc.org/MANUAL.html#extension-grid_tables

But in PipeTableBlockParser.cs, the OpeningCharacters is -. Why -?

Because this is coming from GitHub behavior and also https://pandoc.org/MANUAL.html#extension-grid_tables

@xoofx commented on GitHub (Jul 2, 2024): > In the [GridTableParser.cs](https://github.com/xoofx/markdig/blob/1a1bbecc467a800dd6b39e68825df50309f6065c/src/Markdig/Extensions/Tables/GridTableParser.cs#L15-L16), the OpeningCharacters is +. Because this is coming from https://pandoc.org/MANUAL.html#extension-grid_tables > But in [PipeTableBlockParser.cs](https://github.com/xoofx/markdig/blob/1a1bbecc467a800dd6b39e68825df50309f6065c/src/Markdig/Extensions/Tables/PipeTableBlockParser.cs#L25-L26), the OpeningCharacters is -. Why -? Because this is coming from GitHub behavior and also https://pandoc.org/MANUAL.html#extension-grid_tables
Author
Owner

@doggy8088 commented on GitHub (Jul 2, 2024):

Do you mean this format?

- | Header 1    | Header 2    |
  | ----------- | ----------- |
  | Row 1 Col 1 | Row 1 Col 2 |
  | Row 2 Col 1 | Row 2 Col 2 |
  • Header 1 Header 2
    Row 1 Col 1 Row 1 Col 2
    Row 2 Col 1 Row 2 Col 2
@doggy8088 commented on GitHub (Jul 2, 2024): Do you mean this format? ```markdown - | Header 1 | Header 2 | | ----------- | ----------- | | Row 1 Col 1 | Row 1 Col 2 | | Row 2 Col 1 | Row 2 Col 2 | ``` - | Header 1 | Header 2 | | ----------- | ----------- | | Row 1 Col 1 | Row 1 Col 2 | | Row 2 Col 1 | Row 2 Col 2 |
Author
Owner

@xoofx commented on GitHub (Jul 2, 2024):

As explained in the comment of PipeTableBlockParser here it is to discard list (that can start with -).

a | b
- | -
0 | 1

which is not supported by GitHub but was supported by pandoc. See the comparison here.

The parser for pipe tables is more complicated because we can only detect it after we have processed a paragraph, so that's why it is an inline parser and not a block parser.

@xoofx commented on GitHub (Jul 2, 2024): As explained in the comment of PipeTableBlockParser [here](https://github.com/xoofx/markdig/blob/1a1bbecc467a800dd6b39e68825df50309f6065c/src/Markdig/Extensions/Tables/PipeTableBlockParser.cs#L11-L17) it is to discard list (that can start with `-`). ```md a | b - | - 0 | 1 ``` which is not supported by GitHub but was supported by pandoc. See the comparison [here](https://babelmark.github.io/?text=a+%7C+b%0A-+%7C+-%0A0+%7C+1). The parser for pipe tables is more complicated because we can only detect it after we have processed a paragraph, so that's why it is an inline parser and not a block parser.
Author
Owner

@doggy8088 commented on GitHub (Jul 2, 2024):

I never know that. I always think it's a block parser. Can you implement another block-based PipeTable parser? I never know there is a scenario for inline usage. At least I have never used it this way. 😅

@doggy8088 commented on GitHub (Jul 2, 2024): I never know that. I always think it's a block parser. Can you implement another block-based PipeTable parser? I never know there is a scenario for inline usage. At least I have never used it this way. 😅
Author
Owner

@xoofx commented on GitHub (Jul 2, 2024):

I never know that. I always think it's a block parser. Can you implement another block-based PipeTable parser? I never know there is a scenario for inline usage. At least I have never used it this way. 😅

It is not an inline usage. In order to parse a "block" pipe table, we can only use | because we could only detect if a paragraph is actually a table once we have parsed its content. That means that:

a | b
- | -
0 | 1

Is initially parsed as a paragraph because we don't know when parsing a that it is actually a table after (e.g you could have a backstick a ``|`` b that is actually escaping the table).

That's why the pipetable is so complicated because we are treating | as a delimiter (similar to * or _ or emphasis), and then from there, we are trying to rebuild a table.

A naive implementation could have said: I'm just gonna split the line by | but that's not the solution that was taken.

@xoofx commented on GitHub (Jul 2, 2024): > I never know that. I always think it's a block parser. Can you implement another block-based PipeTable parser? I never know there is a scenario for inline usage. At least I have never used it this way. 😅 It is not an inline usage. In order to parse a "block" pipe table, we can only use `|` because we could only detect if a paragraph is actually a table once we have parsed its content. That means that: ``` a | b - | - 0 | 1 ``` Is initially parsed as a paragraph because we don't know when parsing `a` that it is actually a table after (e.g you could have a backstick ` a ``|`` b` that is actually escaping the table). That's why the pipetable is so complicated because we are treating `|` as a delimiter (similar to `*` or `_` or emphasis), and then from there, we are trying to rebuild a table. A naive implementation could have said: I'm just gonna split the line by `|` but that's not the solution that was taken.
Author
Owner

@doggy8088 commented on GitHub (Jul 3, 2024):

Can you take a look on why PipeTable missing 2 characters? I'd like to fix the bug but I can't find the entry point of the position info.

@doggy8088 commented on GitHub (Jul 3, 2024): Can you take a look on why PipeTable missing 2 characters? I'd like to fix the bug but I can't find the entry point of the position info.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/markdig#658