spaCy tokenizer does not split correctly tokens separated by a slash (/) ending in a digit #2926

arimbr · 2018-11-14T19:38:17Z

spaCy tokenizer seems not to tokenize correctly tokens separated by slash (/) when some of them end with a digit.

How to reproduce the behaviour

In [57]: import spacy
In [58]: nlp = spacy.load('fr')

In [59]: [t for t in nlp('Learn html5/css3/javascript/jquery')]
Out[59]: [Learn, html5/css3/javascript, /, jquery] # UNEXPECTED

In [60]: [t for t in nlp('Learn html/css/javascript/jquery')]
Out[60]: [Learn, html, /, css, /, javascript, /, jquery] # EXPECTED

Your Environment

spaCy version: 2.0.11
Platform: Linux-4.15.0-36-generic-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.5
Models: fr, en

Related issue #891

The text was updated successfully, but these errors were encountered:

ines · 2019-01-07T12:52:56Z

Merging this with the master issue in #1642!

lock · 2019-02-06T13:50:20Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added feat / tokenizer Feature: Tokenizer perf / accuracy Performance: accuracy labels Nov 14, 2018

ines closed this as completed Jan 7, 2019

lock bot locked as resolved and limited conversation to collaborators Feb 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spaCy tokenizer does not split correctly tokens separated by a slash (/) ending in a digit #2926

spaCy tokenizer does not split correctly tokens separated by a slash (/) ending in a digit #2926

arimbr commented Nov 14, 2018

ines commented Jan 7, 2019

lock bot commented Feb 6, 2019

spaCy tokenizer does not split correctly tokens separated by a slash (/) ending in a digit #2926

spaCy tokenizer does not split correctly tokens separated by a slash (/) ending in a digit #2926

Comments

arimbr commented Nov 14, 2018

How to reproduce the behaviour

Your Environment

ines commented Jan 7, 2019

lock bot commented Feb 6, 2019