-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement re_first for Selector #52
Comments
I've started implementing this change here: I'm also looking to see if I can improve the performance a bit. Specifically this extract_regex function found in the utils module. def extract_regex(regex, text):
"""Extract a list of unicode strings from the given text/encoding using the following policies:
* if the regex contains a named group called "extract" that will be returned
* if the regex contains multiple numbered groups, all those will be returned (flattened)
* if the regex doesn't contain any group the entire regex matching is returned
"""
if isinstance(regex, six.string_types):
regex = re.compile(regex, re.UNICODE)
try:
strings = [regex.search(text).group('extract')] # named group
except:
strings = regex.findall(text) # full regex or numbered groups
return [replace_entities(s, keep=['lt', 'amp']) for s in flatten(strings)] The try-except means that regexes without named extract groups are going to execute twice. For large documents or when run quite often, this might become expensive. In addition, fetching all matches is unnecessary for running a match-first function. I created another branch for benchmarking (using pytest-benchmark). I'll test there if I can improve it. |
Regarding class Selector(object):
def re_first(self, regex, default=None):
return SelectorList([self]).re_first(regex, default) |
Fixed in #86 |
Copied from scrapy/scrapy#1907
Currently only SelectorList supports the re_first shortcut method. It would be useful to have this method in Selector too.
The text was updated successfully, but these errors were encountered: