-
-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nested anchor tags translate into invalid markdown #25
Comments
Thanks for the report! We definitely don't want broken markdown. In the interest of performance, adding the ability to essentially process child nodes as if they were a sibling would not be ideal in order to support out of spec html. I've written a fix which makes use of the new export const aTagTranslatorConfig: TranslatorConfigObject = {
// (postprocessing is now set to replace \r?\n+ with a single space
'br': { content: '\n', recurse: false },
'hr': { content: '\n', recurse: false },
'pre': defaultTranslators['pre'],
'strong,b': defaultTranslators['strong,b'],
'del,s,strike': defaultTranslators['del,s,strike'],
'em,i': defaultTranslators['em,i'],
'img': defaultTranslators['img']
} This effectively means that all nodes not in that list will simply grab the text content, with no special processing. So in your case, the output would become: That said, I can see where transform would be useful, especially in cases like yours. It would make sense to add the hook right into the software, so you didn't have to parse twice. If you're interested in contributing, that's a good place to start! ProposalAdd Probable signature: It would implement right after the I'm going to commit the fix I have now, and if you decide you'd like to add it, I'll be getting back to this next weekend to publish the current changes + another fix I need to do. |
It just occurred to me that this is probably a parser issue. Here's what It matches what your DOM inspection shows. UpdateIssue filed: taoqf/node-html-parser#144 I will have a look as soon as possible |
Quick update — submitted a PR for a fix to the parser. This will cause nodes to be parsed correctly, per OP's example, and will close the issue. As I believe it adds some value, I will also add the transformer option. |
Fixed in v1.1.2 |
Technically I believe this is invalid HTML, since anchor tags should not be nested. That said, there's lots of HTML out there which isn't spec compliant so it'd be nice to support this. Based on my testing, browsers will stop the first link when the nested link starts, so in the example below,
b
would link to https://example.com andc
would link to https://github.com butd
would not be a link. I tested this behavior on both Chrome and Firefox and they were both consistent.I'm happy to try and contribute a fix, although I'm not sure what the best way to approach the problem is. Perhaps it would be best to apply some sort of link-fixing transform to the html tree first before any of the markdown conversion starts? I was also looking at the
a
default translator and wondering if either trying topostprocess
the child content to adjust links with some sort of regex or usingchildTranslators
to provide a different implementation for any nesteda
tags, but since this whole issue is about the tree being invalid I think it would be pretty hard to fix inside the translator.Perhaps this entire bug is out of scope for this library and I should be trying to look at another library to sanitize the HTML prior to feeding it to
node-html-markdown
Raw Test HTML:
Resulting DOM as shown in the inspector:
Example:
Expected Output:
Actual Output:
Which renders into garbage when put through a markdown renderer:
a[bcd](https://example.com)e
The text was updated successfully, but these errors were encountered: