Nested anchor tags translate into invalid markdown #25

kj800x · 2021-08-22T21:49:05Z

Technically I believe this is invalid HTML, since anchor tags should not be nested. That said, there's lots of HTML out there which isn't spec compliant so it'd be nice to support this. Based on my testing, browsers will stop the first link when the nested link starts, so in the example below, b would link to https://example.com and c would link to https://github.com but d would not be a link. I tested this behavior on both Chrome and Firefox and they were both consistent.

I'm happy to try and contribute a fix, although I'm not sure what the best way to approach the problem is. Perhaps it would be best to apply some sort of link-fixing transform to the html tree first before any of the markdown conversion starts? I was also looking at the a default translator and wondering if either trying to postprocess the child content to adjust links with some sort of regex or using childTranslators to provide a different implementation for any nested a tags, but since this whole issue is about the tree being invalid I think it would be pretty hard to fix inside the translator.

Perhaps this entire bug is out of scope for this library and I should be trying to look at another library to sanitize the HTML prior to feeding it to node-html-markdown

Raw Test HTML:

Resulting DOM as shown in the inspector:

Example:

a<a href="https://example.com">b<a href="https://github.com">c</a>d</a>e

Expected Output:

a[b](https://example.com)[c](https://github.com)de

Actual Output:

a[b[c](https://github.com)d](https://example.com)e

Which renders into garbage when put through a markdown renderer:

a[bcd](https://example.com)e

The text was updated successfully, but these errors were encountered:

nonara · 2021-08-23T01:57:51Z

Thanks for the report! We definitely don't want broken markdown.

In the interest of performance, adding the ability to essentially process child nodes as if they were a sibling would not be ideal in order to support out of spec html.

I've written a fix which makes use of the new childTranslators property of TranslatorConfig using the following:

export const aTagTranslatorConfig: TranslatorConfigObject = {
  // (postprocessing is now set to replace \r?\n+ with a single space 
  'br': { content: '\n', recurse: false },
  'hr': { content: '\n', recurse: false },
  'pre': defaultTranslators['pre'],
  'strong,b': defaultTranslators['strong,b'],
  'del,s,strike': defaultTranslators['del,s,strike'],
  'em,i': defaultTranslators['em,i'],
  'img': defaultTranslators['img']
}

This effectively means that all nodes not in that list will simply grab the text content, with no special processing.

So in your case, the output would become: a[bcd](https://example.com)e

That said, I can see where transform would be useful, especially in cases like yours. It would make sense to add the hook right into the software, so you didn't have to parse twice.

If you're interested in contributing, that's a good place to start!

Proposal

Add transformer property to NodeHtmlMarkdownOptions

Probable signature: transformer?: (srcFile: ElementNode) => ElementNode.

It would implement right after the parseHTML call in NodeHtmlMarkdown#translateWorker

I'm going to commit the fix I have now, and if you decide you'd like to add it, I'll be getting back to this next weekend to publish the current changes + another fix I need to do.

nonara · 2021-08-23T03:20:19Z

It just occurred to me that this is probably a parser issue.

Here's what astexplorer.net shows using parse5

It matches what your DOM inspection shows.

Update

Issue filed: taoqf/node-html-parser#144

I will have a look as soon as possible

nonara · 2021-09-05T20:51:34Z

Quick update — submitted a PR for a fix to the parser. This will cause nodes to be parsed correctly, per OP's example, and will close the issue.

As I believe it adds some value, I will also add the transformer option.

nonara · 2021-09-18T17:57:04Z

Fixed in v1.1.2

nonara closed this as completed in dcd6b20 Aug 23, 2021

nonara added the awaiting-upstream-fix label Aug 23, 2021

nonara reopened this Aug 23, 2021

nonara mentioned this issue Aug 24, 2021

Nested A tags parsed improperly taoqf/node-html-parser#144

Closed

nonara mentioned this issue Sep 6, 2021

Fix 144 taoqf/node-html-parser#148

Merged

nonara closed this as completed in bdc7caa Sep 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested anchor tags translate into invalid markdown #25

Nested anchor tags translate into invalid markdown #25

kj800x commented Aug 22, 2021

nonara commented Aug 23, 2021

nonara commented Aug 23, 2021 •

edited

Loading

nonara commented Sep 5, 2021 •

edited

Loading

nonara commented Sep 18, 2021

Nested anchor tags translate into invalid markdown #25

Nested anchor tags translate into invalid markdown #25

Comments

kj800x commented Aug 22, 2021

Raw Test HTML:

Resulting DOM as shown in the inspector:

Example:

Expected Output:

Actual Output:

nonara commented Aug 23, 2021

Proposal

nonara commented Aug 23, 2021 • edited Loading

Update

nonara commented Sep 5, 2021 • edited Loading

nonara commented Sep 18, 2021

nonara commented Aug 23, 2021 •

edited

Loading

nonara commented Sep 5, 2021 •

edited

Loading