Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some text node has not position property #16

Closed
azu opened this issue May 12, 2021 · 12 comments
Closed

Some text node has not position property #16

azu opened this issue May 12, 2021 · 12 comments
Labels
🙅 no/wontfix This is not (enough of) an issue for this project 👎 phase/no Post cannot or will not be acted on

Comments

@azu
Copy link

azu commented May 12, 2021

Subject of the issue

When parsing some text, the text node has not position property.

Your environment

  • OS: macOs 10.15.7
  • Packages:
    "remark-gfm": "^1.0.0",
    "remark-parse": "^9.0.0",
    "unified": "^9.2.1"

Env:

Node: 14.16.1 - ~/.volta/tools/image/node/14.16.1/bin/node
Yarn: 1.22.10 - ~/.volta/tools/image/yarn/1.22.10/bin/yarn
npm: 6.14.12 - ~/.volta/tools/image/node/14.16.1/bin/npm

Steps to reproduce

Reproducing repo: https://github.com/azu/remark-gfm-no-position-bug

import unified from "unified";
import remarkGfm from "remark-gfm";
import remarkParse from "remark-parse";

const remark = unified().use(remarkParse).use(remarkGfm);
const ast = remark.parse(`http://user:password@host:port/path?key=value#fragment`);
console.log(JSON.stringify(ast, null, 4));

output:

{
    "type": "root",
    "children": [
        {
            "type": "paragraph",
            "children": [
                {
                    "type": "text",
                    "value": "http://user:password@host:port/path?key=value#fragment"
                }
            ],
            "position": {
                "start": {
                    "line": 1,
                    "column": 1,
                    "offset": 0
                },
                "end": {
                    "line": 1,
                    "column": 60,
                    "offset": 59
                }
            }
        }
    ],
    "position": {
        "start": {
            "line": 1,
            "column": 1,
            "offset": 0
        },
        "end": {
            "line": 1,
            "column": 60,
            "offset": 59
        }
    }
}

Expected behavior

The text node should have position property.

{
    "type": "root",
    "children": [
        {
            "type": "paragraph",
            "children": [
                {
                    "type": "text",
                    "value": "http://user:password@host:port/path?key=value#fragment",
                    "position": {
                        "start": {
                            "line": 1,
                            "column": 1,
                            "offset": 0
                        },
                        "end": {
                            "line": 1,
                            "column": 60,
                            "offset": 59
                        }
                    }
                }
            ],
            "position": {
                "start": {
                    "line": 1,
                    "column": 1,
                    "offset": 0
                },
                "end": {
                    "line": 1,
                    "column": 60,
                    "offset": 59
                }
            }
        }
    ],
    "position": {
        "start": {
            "line": 1,
            "column": 1,
            "offset": 0
        },
        "end": {
            "line": 1,
            "column": 60,
            "offset": 59
        }
    }
}

Actual behavior

The text node has not position property.

{
    "type": "root",
    "children": [
        {
            "type": "paragraph",
            "children": [
                {
                    "type": "text",
                    "value": "http://user:password@host:port/path?key=value#fragment"
                }
            ],
            "position": {
                "start": {
                    "line": 1,
                    "column": 1,
                    "offset": 0
                },
                "end": {
                    "line": 1,
                    "column": 60,
                    "offset": 59
                }
            }
        }
    ],
    "position": {
        "start": {
            "line": 1,
            "column": 1,
            "offset": 0
        },
        "end": {
            "line": 1,
            "column": 60,
            "offset": 59
        }
    }
}

Escaping + http:// is something wrong:
http://a

@wooorm
Copy link
Member

wooorm commented May 14, 2021

Unfortunately we can’t add them.

GH’s uses two algorithms to add autolinks (see this readme and some of the other issues there). One at parse time, another at “AST” time.

The one at parse time nicely adds positional info. The one at AST-time can’t infer the original position of certain URLs from text nodes, because the AST only has the starting and ending point, but no knowledge in this case:

- > a block quote in a list
    with https://example.com a URL

But, the fact that http://user:password@host:port/path?key=value#fragment is seen as a URL is incorrect 🤔

http://user:password@host:port/path?key=value#fragment
http://user:password@host:port/path?key=value#fragment

http://user:password@host:port/path?key=value#fragment
http://user:password@host:port/path?key=value#fragment

^-- it shouldn’t link

@azu azu changed the title Some text node has not position property Auto link's text node has not position property May 14, 2021
@azu
Copy link
Author

azu commented May 14, 2021

However, It seems that gfm plugin can infer the position when parsing valid http://~ URL string.

import unified from "unified";
import remarkGfm from "remark-gfm";
import remarkParse from "remark-parse";

const remark = unified().use(remarkParse).use(remarkGfm);
const ast = remark.parse(`This is https://example.com`);
console.log(JSON.stringify(ast, null, 4));

to

{
    "type": "root",
    "children": [
        {
            "type": "paragraph",
            "children": [
                {
                    "type": "text",
                    "value": "This is ",
                    "position": {
                        "start": {
                            "line": 1,
                            "column": 1,
                            "offset": 0
                        },
                        "end": {
                            "line": 1,
                            "column": 9,
                            "offset": 8
                        }
                    }
                },
                {
                    "type": "link",
                    "title": null,
                    "url": "https://example.com",
                    "children": [
                        {
                            "type": "text",
                            "value": "https://example.com",
                            "position": {
                                "start": {
                                    "line": 1,
                                    "column": 9,
                                    "offset": 8
                                },
                                "end": {
                                    "line": 1,
                                    "column": 28,
                                    "offset": 27
                                }
                            }
                        }
                    ],
                    "position": {
                        "start": {
                            "line": 1,
                            "column": 9,
                            "offset": 8
                        },
                        "end": {
                            "line": 1,
                            "column": 28,
                            "offset": 27
                        }
                    }
                }
            ],
            "position": {
                "start": {
                    "line": 1,
                    "column": 1,
                    "offset": 0
                },
                "end": {
                    "line": 1,
                    "column": 28,
                    "offset": 27
                }
            }
        }
    ],
    "position": {
        "start": {
            "line": 1,
            "column": 1,
            "offset": 0
        },
        "end": {
            "line": 1,
            "column": 28,
            "offset": 27
        }
    }
}

I think that it is a bug related to http://.

result of list + blockquote + autolink
import unified from "unified";
import remarkGfm from "remark-gfm";
import remarkParse from "remark-parse";

const remark = unified().use(remarkParse).use(remarkGfm);
const ast = remark.parse(`- > a block quote in a list
    with https://example.com a URL`);
console.log(JSON.stringify(ast, null, 4));

to

{
    "type": "root",
    "children": [
        {
            "type": "list",
            "ordered": false,
            "start": null,
            "spread": false,
            "children": [
                {
                    "type": "listItem",
                    "spread": false,
                    "checked": null,
                    "children": [
                        {
                            "type": "blockquote",
                            "children": [
                                {
                                    "type": "paragraph",
                                    "children": [
                                        {
                                            "type": "text",
                                            "value": "a block quote in a list",
                                            "position": {
                                                "start": {
                                                    "line": 1,
                                                    "column": 5,
                                                    "offset": 4
                                                },
                                                "end": {
                                                    "line": 1,
                                                    "column": 28,
                                                    "offset": 27
                                                }
                                            }
                                        }
                                    ],
                                    "position": {
                                        "start": {
                                            "line": 1,
                                            "column": 5,
                                            "offset": 4
                                        },
                                        "end": {
                                            "line": 1,
                                            "column": 28,
                                            "offset": 27
                                        }
                                    }
                                }
                            ],
                            "position": {
                                "start": {
                                    "line": 1,
                                    "column": 3,
                                    "offset": 2
                                },
                                "end": {
                                    "line": 1,
                                    "column": 28,
                                    "offset": 27
                                }
                            }
                        },
                        {
                            "type": "paragraph",
                            "children": [
                                {
                                    "type": "text",
                                    "value": "with ",
                                    "position": {
                                        "start": {
                                            "line": 2,
                                            "column": 5,
                                            "offset": 32
                                        },
                                        "end": {
                                            "line": 2,
                                            "column": 10,
                                            "offset": 37
                                        }
                                    }
                                },
                                {
                                    "type": "link",
                                    "title": null,
                                    "url": "https://example.com",
                                    "children": [
                                        {
                                            "type": "text",
                                            "value": "https://example.com",
                                            "position": {
                                                "start": {
                                                    "line": 2,
                                                    "column": 10,
                                                    "offset": 37
                                                },
                                                "end": {
                                                    "line": 2,
                                                    "column": 29,
                                                    "offset": 56
                                                }
                                            }
                                        }
                                    ],
                                    "position": {
                                        "start": {
                                            "line": 2,
                                            "column": 10,
                                            "offset": 37
                                        },
                                        "end": {
                                            "line": 2,
                                            "column": 29,
                                            "offset": 56
                                        }
                                    }
                                },
                                {
                                    "type": "text",
                                    "value": " a URL",
                                    "position": {
                                        "start": {
                                            "line": 2,
                                            "column": 29,
                                            "offset": 56
                                        },
                                        "end": {
                                            "line": 2,
                                            "column": 35,
                                            "offset": 62
                                        }
                                    }
                                }
                            ],
                            "position": {
                                "start": {
                                    "line": 2,
                                    "column": 5,
                                    "offset": 32
                                },
                                "end": {
                                    "line": 2,
                                    "column": 35,
                                    "offset": 62
                                }
                            }
                        }
                    ],
                    "position": {
                        "start": {
                            "line": 1,
                            "column": 1,
                            "offset": 0
                        },
                        "end": {
                            "line": 2,
                            "column": 35,
                            "offset": 62
                        }
                    }
                }
            ],
            "position": {
                "start": {
                    "line": 1,
                    "column": 1,
                    "offset": 0
                },
                "end": {
                    "line": 2,
                    "column": 35,
                    "offset": 62
                }
            }
        }
    ],
    "position": {
        "start": {
            "line": 1,
            "column": 1,
            "offset": 0
        },
        "end": {
            "line": 2,
            "column": 35,
            "offset": 62
        }
    }
}

@azu azu changed the title Auto link's text node has not position property Some text node has not position property May 14, 2021
@wooorm
Copy link
Member

wooorm commented May 18, 2021

However, It seems that gfm plugin can infer the position when parsing valid http://~ URL string.

This is exactly what I described in my previous comment!

I think that it is a bug related to http://.

I think it’s acceptable, because it’s impossible to solve.
GH algorithm 2 works on the AST, and in that phase it’s impossible to know the original positional info


The list + blockquote example was to illustrate, here is an actual working example: https://runkit.com/embed/tyohiws7jhem. There is no positional info on the link, because it has to use algorithm 2, and it would be hard to figure out from the AST information where that link was originally.

There is a bug: http://... should not turn into a link.

@azu
Copy link
Author

azu commented May 20, 2021

Thanks for the details.

The one at AST-time can’t infer the original position of certain URLs from text nodes, because the AST only has the starting and ending point, but no knowledge in this case

Does it mean that transforming AST to AST?
Previously, I've written a similar AST to AST parser.
This parser walks AST and parses the node's value for getting text offset. The offset is based on the node's starting point.

I expected that all node has position property.
If some node has not position, we need to always add if(!node.position) { return; /* can not handle */ } statement.
I feel it is an invalid node because it is the same text type but some nodes have not position property.
So I want to remove the invalid node. However, removing the node and will break another one.

Actually, remark@12 has position property on this example.
https://runkit.com/azu/60a5b5b63d9183001a1c5670

Is there a workaround for avoiding creating non-postion node?
It is blocker to update remark-parser, and I need to investigate it.

@azu
Copy link
Author

azu commented May 22, 2021

I have looked into this.
I think that it is possible, but the current missing some parts for implementation.

We may need following changes to transformGfmAutolinkLiterals.
This is pseudo-code diff for https://github.com/syntax-tree/mdast-util-gfm-autolink-literal/blob/515f945feda9ec57d53ab09dfd458a86e9ff8eb8/from-markdown.js#L81-L86

diff --git a/from-markdown.js b/from-markdown.js
index 8a24bcc..cc02eab 100644
--- a/from-markdown.js
+++ b/from-markdown.js
@@ -82,7 +82,22 @@ function findUrl($0, protocol, domain, path, match) {
     type: 'link',
     title: null,
     url: prefix + protocol + parts[0],
-    children: [{type: 'text', value: protocol + parts[0]}]
+    children: [{
+      type: 'text',
+      value: protocol + parts[0],
+      position: {
+        start: {
+          line: calcLineFromOffset(node.position.start.offset + match.index),
+          column: calcColumnFromOffset(node.position.start.offset + match.index),
+          offset: node.position.start.offset + match.index
+        },
+        end: {
+          line: calcLineFromOffset(node.position.start.offset + match.index + $0.length),
+          column: calcColumnFromOffset(node.position.start.offset + match.index + $0.length),
+          offset: node.position.start.offset + match.index + $0.length
+        }
+      }
+    }]
   }
 
   if (parts[1]) {

Lack of implementation:

  • mast-util-find-and-replace does not pass the matched node to replace function
    • It will needed to infer start offset correctly.
    • This is node in pseudo-code
  • mdast-util-from-markdown does not pass raw text(parsing text) to MdastExtension
    • It will needed to calculate line and column from offset
    • Or, get raw text from the node.
    • This is calcLineFromOffset and calcColumnFromOffset in pseudo-code

@azu
Copy link
Author

azu commented May 22, 2021

📝 Workaround: disable mdast-util-gfm-autolink-literal transform.

import unified from "unified";
// @ts-ignore
import autolinkLiteral from "mdast-util-gfm-autolink-literal/from-markdown";
// disable autolink transforms
autolinkLiteral.transforms = [];
import remarkGfm from "remark-gfm";
import remarkParse from "remark-parse";
const remark = unified().use(remarkParse).use(remarkGfm);
const ast = remark.parse(`http://user:password@host:port/path?key=value#fragment`);
console.log(JSON.stringify(ast, null, 4));

azu added a commit to textlint/textlint that referenced this issue May 22, 2021
azu added a commit to textlint/textlint that referenced this issue May 22, 2021
* feat(markdown-to-ast): update remark-parse@9

* test: update snapshots

* fix: support footnote [^1] again

* fix: create range from offset

* fix: treat BOM text

* chore: rename footnoteReference to FootnoteReference

* test: add broken example

* test: add broken example

* fix(markdown-to-ast): add a workaround to fix broken node

workaround for remarkjs/remark-gfm#16
@wooorm
Copy link
Member

wooorm commented May 22, 2021

Does it mean that transforming AST to AST?
Previously, I've written a similar AST to AST parser.
This parser walks AST and parses the node's value for getting text offset. The offset is based on the > node's starting point.

Given this markdown:

- > a block quote in a list
    with https://example.com a URL

The text node has this value: a block quote in a list\nwith https://example.com a URL.

It starts at 1:5, and ends at 2:35.

The markdown could look like this:

- > a block quote in a list
with https://example.com a URL

This:

- > a block quote in a list
                                  with https://example.com a URL

This:

:::some custom markdown structure
    a block quote in a list
    with https://example.com a URL
:::

How do you know where the link starts and ends?

I expected that all node has position property.
If some node has not position, we need to always add if(!node.position) { return; /* can not handle > */ } statement.
I feel it is an invalid node because it is the same text type but some nodes have not position property.
So I want to remove the invalid node. However, removing the node and will break another one.

The position field on nodes is optional: https://github.com/syntax-tree/unist#node.
It is valid, and it is heavily used when doing more complex tasks. For example, when processing MDX, the input is markdown-like (remark), which goes through to HTML (rehype), and then compiles to JavaScript (recma).
Several nodes in this process are added (one example), without positional information, because they were not in the original document.

Actually, remark@12 has position property on this example.

Yes it did, but it also didn’t match GFM. To properly match GFM, the two algorithms need to be used: one at parse time, one at “AST time”.

@github-actions

This comment has been minimized.

@wooorm
Copy link
Member

wooorm commented Aug 15, 2021

I fixed what seemed like the bug.
However, unfortunately I don’t see a way to “solve” this issue (text nodes w/o position) :(

@wooorm wooorm added 🙅 no/wontfix This is not (enough of) an issue for this project and removed 🐛 type/bug This is a problem 🙉 open/needs-info This needs some more info labels Aug 15, 2021
@github-actions github-actions bot added the 👎 phase/no Post cannot or will not be acted on label Aug 15, 2021
@azu
Copy link
Author

azu commented Aug 15, 2021

Thanks

However, unfortunately I don’t see a way to “solve” this issue (text nodes w/o position) :(

Would it be possible to provide a option that disable mdast-util-gfm-autolink-literal?
https://github.com/remarkjs/remark-gfm#unifieduseremarkgfm-options
If remark-gfm has the option, we can avoid to use a workaround.
(📝 Probably, this workaround will not work with Node.js exports filed)

@wooorm
Copy link
Member

wooorm commented Aug 15, 2021

I don’t see a reason for an option to turn all of mdast-util-gfm-autolink-literal off in mdast-util-gfm or remark-gfm. The whole point of these is that they match GFM. And the separate packages can be used and matched however you please (see § When to use this).

I somewhat see an option to turn algorithm2 off, which is what you do in the workaround, in mdast-util-gfm-autolink-literal, which would be a major (because the API has to change to accept options), but bubble through in mdast-util-gfm and remark-gfm as a minor release.

The reason is that algorithm2 does not work in comments, and thus that option would somewhat match some flavor of GFM. But not 100%, because email links do work in comments: https://gist.github.com/wooorm/076fd173c31ba6837f17591d5932476e.
so this option would not match comments.

I’m not convinced yet of such an option. @ChristianMurphy?
And if it would be added: a question would be how to name it.

(and I think your workaround will work fine with ESM!)

@ChristianMurphy
Copy link
Member

ChristianMurphy commented Aug 15, 2021

I don't really see the option as making much sense. 🤔
*-gfm plugins should, to the best of their ability, match GFM, having an option for "be less like GFM" seems to go against its goal. 🤔

Taking a step back, position on Node is optional. This is far from the only case where it can happen. I'd be curious to hear how it's being used, and discussing what alternative(s) exist for handling these cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🙅 no/wontfix This is not (enough of) an issue for this project 👎 phase/no Post cannot or will not be acted on
Development

No branches or pull requests

3 participants