URL regex in tokenizer_exceptions.py too broad #840

rappdw · 2017-02-16T18:54:03Z

The regex for the URL used for token_match during tokenization treats strings that are not URLs as URLs. For example, I'd expect the following text:

"This is the ticker symbol for Google, (NASDAQ:GOOG). Google's homepage is http://www.google.com"

to produce 'NASDAQ', ':', 'GOOG' as separate tokens while producing 'http://www.google.com' as a single token.

Using the URL regex proposed by https://gist.github.com/dperini/729294 yields better results, e.g.

token_match = re.compile(
    r"^"
    # protocol identifier
    r"(?:(?:https?|ftp)://)"
    # user:pass authentication
    r"(?:\S+(?::\S*)?@)?"
    r"(?:"
    # IP address exclusion
    # private & local networks
    r"(?!(?:10|127)(?:\.\d{1,3}){3})"
    r"(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
    r"(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
    # IP address dotted notation octets
    # excludes loopback network 0.0.0.0
    # excludes reserved space >= 224.0.0.0
    # excludes network & broadcast addresses
    # (first & last IP address of each class)
    r"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
    r"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}"
    r"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
    r"|"
    # host name
    r"(?:(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)"
    # domain name
    r"(?:\.(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)*"
    # TLD identifier
    r"(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
    r")"
    # port number
    r"(?::\d{2,5})?"
    # resource path
    r"(?:/\S*)?"
    r"$"
).match

Your Environment

Operating System: OSX & Linux
Python Version Used: 3.6.0
spaCy Version Used: 1.6.0
Environment Information:

The text was updated successfully, but these errors were encountered:

honnibal · 2017-02-16T22:54:51Z

Thanks, we'll definitely patch this case.

@oroszgy : Do you have thoughts on the regex in this gist, vs. the current one?

oroszgy · 2017-02-20T09:01:30Z

Looks good to me. (cf. dperini column here) However I'd make the protocol identifier optional . My assumption is the things like "google.com" could appear easily in not so formal texts.

If we touch the code, I think we should

add new test cases based on the link above
check how the tokenizer's throughput changes

rappdw · 2017-02-24T15:59:37Z

While perhaps not rigorous, I have a data point on throughput.

Processing the same full en wikipedia dump on a AWS EC2 m4.16xlarge instance:

with regex proposed above: throughput == 1,390,586 words/sec
with 1.6.0 regex: throughput == 1,379,929 words/sec

Some details:
27 individual python processes running concurrently, each multithreaded generator

oroszgy · 2017-03-02T22:57:25Z

@rappdw Do you mind making a PR on this? :)

rappdw · 2017-03-06T15:40:49Z

@oroszgy Yes, I'll submit a PR.

rappdw · 2017-03-09T19:42:33Z

@oroszgy #879 submitted

Fix for Issue #840 - URL pattern too broad

honnibal · 2017-03-09T19:43:29Z

Merged!

lock · 2018-05-09T02:38:43Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the performance label Feb 16, 2017

honnibal added a commit that referenced this issue Mar 9, 2017

Merge pull request #879 from rappdw/rappdw/tokenizer_exceptions_url_fix

dd13aac

Fix for Issue #840 - URL pattern too broad

honnibal closed this as completed Mar 9, 2017

honnibal mentioned this issue Apr 7, 2017

NLP silently hangs indefinitely on strings with many periods #957

Closed

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL regex in tokenizer_exceptions.py too broad #840

URL regex in tokenizer_exceptions.py too broad #840

rappdw commented Feb 16, 2017

honnibal commented Feb 16, 2017

oroszgy commented Feb 20, 2017

rappdw commented Feb 24, 2017

oroszgy commented Mar 2, 2017

rappdw commented Mar 6, 2017

rappdw commented Mar 9, 2017

honnibal commented Mar 9, 2017

lock bot commented May 9, 2018

URL regex in tokenizer_exceptions.py too broad #840

URL regex in tokenizer_exceptions.py too broad #840

Comments

rappdw commented Feb 16, 2017

Your Environment

honnibal commented Feb 16, 2017

oroszgy commented Feb 20, 2017

rappdw commented Feb 24, 2017

oroszgy commented Mar 2, 2017

rappdw commented Mar 6, 2017

rappdw commented Mar 9, 2017

honnibal commented Mar 9, 2017

lock bot commented May 9, 2018