Implement re_first for Selector #52

Tethik · 2016-08-08T10:24:46Z

Copied from scrapy/scrapy#1907

Currently only SelectorList supports the re_first shortcut method. It would be useful to have this method in Selector too.

from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).re_first
Traceback (most recent call last):
  File "<console>", line 1, in <module>
AttributeError: 'Selector' object has no attribute 're_first'

Tethik · 2016-08-08T18:58:23Z

I've started implementing this change here:
https://github.com/Tethik/parsel/tree/re_first_for_selector

I'm also looking to see if I can improve the performance a bit. Specifically this extract_regex function found in the utils module.

def extract_regex(regex, text):
    """Extract a list of unicode strings from the given text/encoding using the following policies:
    * if the regex contains a named group called "extract" that will be returned
    * if the regex contains multiple numbered groups, all those will be returned (flattened)
    * if the regex doesn't contain any group the entire regex matching is returned
    """
    if isinstance(regex, six.string_types):
        regex = re.compile(regex, re.UNICODE)

    try:
        strings = [regex.search(text).group('extract')]   # named group
    except:
        strings = regex.findall(text)    # full regex or numbered groups
return [replace_entities(s, keep=['lt', 'amp']) for s in flatten(strings)]

The try-except means that regexes without named extract groups are going to execute twice. For large documents or when run quite often, this might become expensive. In addition, fetching all matches is unnecessary for running a match-first function.

I created another branch for benchmarking (using pytest-benchmark). I'll test there if I can improve it.
https://github.com/Tethik/parsel/tree/re_benchmark_tests

starrify · 2017-05-14T00:38:16Z

Regarding Selector.re_first / Selector.extract_first, is there any obvious disadvantage if we simply use something like this:

class Selector(object):
    def re_first(self, regex, default=None):
        return SelectorList([self]).re_first(regex, default)

redapple · 2017-05-17T11:21:10Z

Fixed in #86

kmike mentioned this issue Aug 8, 2016

'Selector' object has no attribute 're_first' scrapy/scrapy#1907

Closed

Tethik mentioned this issue Aug 18, 2016

SelectorList re_first and regex optimizations #55

Closed

starrify mentioned this issue May 14, 2017

[MRG+1] Added: parsel.Selector.re_first #86

Merged

redapple closed this as completed May 17, 2017

barrio mentioned this issue Apr 30, 2024

Parsel import causes crash #294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement re_first for Selector #52

Implement re_first for Selector #52

Tethik commented Aug 8, 2016

Tethik commented Aug 8, 2016

starrify commented May 14, 2017

redapple commented May 17, 2017

Implement re_first for Selector #52

Implement re_first for Selector #52

Comments

Tethik commented Aug 8, 2016

Tethik commented Aug 8, 2016

starrify commented May 14, 2017

redapple commented May 17, 2017