Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more robust support for HTML5 anchor tags. #16

Merged
merged 1 commit into from
May 10, 2023
Merged

Conversation

xStrom
Copy link
Contributor

@xStrom xStrom commented May 4, 2023

This PR makes the anchor tag regex more robust so it can handle more of the HTML5 spec.

Old   a.*href=('([^']*?)'|"([^"]*?)")
New   ^(?i:a)(?:$|\s).*(?i:href)\s*=\s*('([^']*?)'|"([^"]*?)"|([^\s"'`=<>]+))

To break down the changes, in order:

Regex Reasoning
^ Check that the tag actually starts with the letter a, as opposed to say in the case of head.
(?i:a) Tags and attribute names are case insensitive, so we need a case-insensetive check.
(?:$|\s) We want only the letter a and not say article, so we check for the end the same way as badTagnamesRE.
(?i:href) Case-insensetive check for attribute names as well.
\s*=\s* The equals sign can be surrounded by zero or more spaces.
|([^\s"'`=<>]+) Attribute values don't have to be enclosed in quotes if they follow certain rules.

@k3a
Copy link
Owner

k3a commented May 10, 2023

Thanks for the nice PR! I will make other tag regexpes case-insensitive as well in the followup commits.

@k3a k3a merged commit a58537e into k3a:master May 10, 2023
@xStrom xStrom deleted the anchor branch May 10, 2023 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants