Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested anchor tags translate into invalid markdown #25

Closed
kj800x opened this issue Aug 22, 2021 · 4 comments
Closed

Nested anchor tags translate into invalid markdown #25

kj800x opened this issue Aug 22, 2021 · 4 comments

Comments

@kj800x
Copy link

kj800x commented Aug 22, 2021

Technically I believe this is invalid HTML, since anchor tags should not be nested. That said, there's lots of HTML out there which isn't spec compliant so it'd be nice to support this. Based on my testing, browsers will stop the first link when the nested link starts, so in the example below, b would link to https://example.com and c would link to https://github.com but d would not be a link. I tested this behavior on both Chrome and Firefox and they were both consistent.

I'm happy to try and contribute a fix, although I'm not sure what the best way to approach the problem is. Perhaps it would be best to apply some sort of link-fixing transform to the html tree first before any of the markdown conversion starts? I was also looking at the a default translator and wondering if either trying to postprocess the child content to adjust links with some sort of regex or using childTranslators to provide a different implementation for any nested a tags, but since this whole issue is about the tree being invalid I think it would be pretty hard to fix inside the translator.

Perhaps this entire bug is out of scope for this library and I should be trying to look at another library to sanitize the HTML prior to feeding it to node-html-markdown

Raw Test HTML:

image

Resulting DOM as shown in the inspector:

image

Example:

a<a href="https://example.com">b<a href="https://github.com">c</a>d</a>e

Expected Output:

a[b](https://example.com)[c](https://github.com)de

Actual Output:

a[b[c](https://github.com)d](https://example.com)e

Which renders into garbage when put through a markdown renderer:

a[bcd](https://example.com)e

@nonara
Copy link
Collaborator

nonara commented Aug 23, 2021

Thanks for the report! We definitely don't want broken markdown.

In the interest of performance, adding the ability to essentially process child nodes as if they were a sibling would not be ideal in order to support out of spec html.

I've written a fix which makes use of the new childTranslators property of TranslatorConfig using the following:

export const aTagTranslatorConfig: TranslatorConfigObject = {
  // (postprocessing is now set to replace \r?\n+ with a single space 
  'br': { content: '\n', recurse: false },
  'hr': { content: '\n', recurse: false },
  'pre': defaultTranslators['pre'],
  'strong,b': defaultTranslators['strong,b'],
  'del,s,strike': defaultTranslators['del,s,strike'],
  'em,i': defaultTranslators['em,i'],
  'img': defaultTranslators['img']
}

This effectively means that all nodes not in that list will simply grab the text content, with no special processing.

So in your case, the output would become: a[bcd](https://example.com)e

That said, I can see where transform would be useful, especially in cases like yours. It would make sense to add the hook right into the software, so you didn't have to parse twice.

If you're interested in contributing, that's a good place to start!

Proposal

Add transformer property to NodeHtmlMarkdownOptions

Probable signature: transformer?: (srcFile: ElementNode) => ElementNode.

It would implement right after the parseHTML call in NodeHtmlMarkdown#translateWorker

I'm going to commit the fix I have now, and if you decide you'd like to add it, I'll be getting back to this next weekend to publish the current changes + another fix I need to do.

@nonara nonara closed this as completed in dcd6b20 Aug 23, 2021
@nonara
Copy link
Collaborator

nonara commented Aug 23, 2021

It just occurred to me that this is probably a parser issue.

Here's what astexplorer.net shows using parse5

image

It matches what your DOM inspection shows.

Update

Issue filed: taoqf/node-html-parser#144

I will have a look as soon as possible

@nonara
Copy link
Collaborator

nonara commented Sep 5, 2021

Quick update — submitted a PR for a fix to the parser. This will cause nodes to be parsed correctly, per OP's example, and will close the issue.

As I believe it adds some value, I will also add the transformer option.

@nonara
Copy link
Collaborator

nonara commented Sep 18, 2021

Fixed in v1.1.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants