-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URL regex in tokenizer_exceptions.py too broad #840
Comments
Thanks, we'll definitely patch this case. @oroszgy : Do you have thoughts on the regex in this gist, vs. the current one? |
Looks good to me. (cf. If we touch the code, I think we should
|
While perhaps not rigorous, I have a data point on throughput. Processing the same full en wikipedia dump on a AWS EC2 m4.16xlarge instance: with regex proposed above: throughput == 1,390,586 words/sec Some details: |
@rappdw Do you mind making a PR on this? :) |
@oroszgy Yes, I'll submit a PR. |
Fix for Issue #840 - URL pattern too broad
Merged! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
The regex for the URL used for token_match during tokenization treats strings that are not URLs as URLs. For example, I'd expect the following text:
"This is the ticker symbol for Google, (NASDAQ:GOOG). Google's homepage is http://www.google.com"
to produce 'NASDAQ', ':', 'GOOG' as separate tokens while producing 'http://www.google.com' as a single token.
Using the URL regex proposed by https://gist.github.com/dperini/729294 yields better results, e.g.
Your Environment
The text was updated successfully, but these errors were encountered: