Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gelbooru] regex for extractor not allowing alternate parameter order in url. #2821

Merged
merged 5 commits into from
Oct 21, 2022
Merged

Conversation

KJNeko
Copy link
Contributor

@KJNeko KJNeko commented Aug 13, 2022

The regex currently implemented wouldn't support urls where the s, page, and id orders were different.

placing page=post and s=view into look-aheads from the ? forward allows for any order while also checking if present.

URLs tested

1: https://gelbooru.com/index.php?page=post&s=view&id=7583325
2: https://gelbooru.com/index.php?page=post&s=view&id=7583325&tags=cat
3: https://gelbooru.com/index.php?id=7583325&page=post&s=view

Match 2 was gotten by searching for 'cat' and then clicking on an image. Seems gelbooru embeds search parameters

Old regex

(?:https?://)?(?:www\.)?gelbooru\.com/(?:index\.php)?\?page=post&s=view&id=(?P<post>\d+)

Results:
1: Match: https://gelbooru.com/index.php?page=post&s=view&id=7586232, Group post: 7586232
2: Match: https://gelbooru.com/index.php?page=post&s=view&id=7586232, Group post: 7586232
3: No match

New regex

(?:https?:\/\/)?(?:www\.)?gelbooru\.com\/(?:index\.php)?\?(?=.*page=post)(?=.*s=view).*id=(?P<post>\d+).*

Results:
1: Match: https://gelbooru.com/index.php?page=post&s=view&id=7586232, Group post: 7586232
2: Match: https://gelbooru.com/index.php?page=post&s=view&id=7582526&tags=cat, Group post 7586232
3: Match: https://gelbooru.com/index.php?id=7586232&page=post&s=view, Group post: 7586232

I was unable to determine if the old regex was made with the intention of sanitizing additional arguments ( such as removing &tag=cat) or if it was just a byproduct or a 'bonus' of the regex

@thatfuckingbird
Copy link
Contributor

@mikf Can you take a look at this sometime? I can confirm this parameter order issue. I think SauceNAO and possibly some other sites return such rearranged URLs sometimes, at least I semi-regularly encounter them.

@mikf mikf merged commit 300bc03 into mikf:master Oct 21, 2022
@mikf
Copy link
Owner

mikf commented Oct 21, 2022

Using regular expressions to "parse" URL query strings is generally a bad idea, as is using .* everywhere, which is why I kind of didn't want to merge this.
And then there is the "problem" that this only fixes the issue for post URLs, and only for Gelbooru.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants