Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL regex in tokenizer_exceptions.py too broad #840

Closed
rappdw opened this issue Feb 16, 2017 · 8 comments
Closed

URL regex in tokenizer_exceptions.py too broad #840

rappdw opened this issue Feb 16, 2017 · 8 comments

Comments

@rappdw
Copy link
Contributor

rappdw commented Feb 16, 2017

The regex for the URL used for token_match during tokenization treats strings that are not URLs as URLs. For example, I'd expect the following text:

"This is the ticker symbol for Google, (NASDAQ:GOOG). Google's homepage is http://www.google.com"

to produce 'NASDAQ', ':', 'GOOG' as separate tokens while producing 'http://www.google.com' as a single token.

Using the URL regex proposed by https://gist.github.com/dperini/729294 yields better results, e.g.

token_match = re.compile(
    r"^"
    # protocol identifier
    r"(?:(?:https?|ftp)://)"
    # user:pass authentication
    r"(?:\S+(?::\S*)?@)?"
    r"(?:"
    # IP address exclusion
    # private & local networks
    r"(?!(?:10|127)(?:\.\d{1,3}){3})"
    r"(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})"
    r"(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})"
    # IP address dotted notation octets
    # excludes loopback network 0.0.0.0
    # excludes reserved space >= 224.0.0.0
    # excludes network & broadcast addresses
    # (first & last IP address of each class)
    r"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])"
    r"(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}"
    r"(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))"
    r"|"
    # host name
    r"(?:(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)"
    # domain name
    r"(?:\.(?:[a-z\u00a1-\uffff0-9]-?)*[a-z\u00a1-\uffff0-9]+)*"
    # TLD identifier
    r"(?:\.(?:[a-z\u00a1-\uffff]{2,}))"
    r")"
    # port number
    r"(?::\d{2,5})?"
    # resource path
    r"(?:/\S*)?"
    r"$"
).match

Your Environment

  • Operating System: OSX & Linux
  • Python Version Used: 3.6.0
  • spaCy Version Used: 1.6.0
  • Environment Information:
@honnibal
Copy link
Member

Thanks, we'll definitely patch this case.

@oroszgy : Do you have thoughts on the regex in this gist, vs. the current one?

@oroszgy
Copy link
Contributor

oroszgy commented Feb 20, 2017

Looks good to me. (cf. dperini column here) However I'd make the protocol identifier optional . My assumption is the things like "google.com" could appear easily in not so formal texts.

If we touch the code, I think we should

  • add new test cases based on the link above
  • check how the tokenizer's throughput changes

@rappdw
Copy link
Contributor Author

rappdw commented Feb 24, 2017

While perhaps not rigorous, I have a data point on throughput.

Processing the same full en wikipedia dump on a AWS EC2 m4.16xlarge instance:

with regex proposed above: throughput == 1,390,586 words/sec
with 1.6.0 regex: throughput == 1,379,929 words/sec

Some details:
27 individual python processes running concurrently, each multithreaded generator

@oroszgy
Copy link
Contributor

oroszgy commented Mar 2, 2017

@rappdw Do you mind making a PR on this? :)

@rappdw
Copy link
Contributor Author

rappdw commented Mar 6, 2017

@oroszgy Yes, I'll submit a PR.

@rappdw
Copy link
Contributor Author

rappdw commented Mar 9, 2017

@oroszgy #879 submitted

honnibal added a commit that referenced this issue Mar 9, 2017
@honnibal
Copy link
Member

honnibal commented Mar 9, 2017

Merged!

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants