Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop_words assigned but not used? #639

Closed
ExplodingCabbage opened this issue Nov 20, 2016 · 5 comments
Closed

stop_words assigned but not used? #639

ExplodingCabbage opened this issue Nov 20, 2016 · 5 comments
Labels
bug Bugs and behaviour differing from documentation

Comments

@ExplodingCabbage
Copy link
Contributor

Maybe I'm being dense, but when I search the entire repo case-insensitively for stop_words, it looks like you're defining a list of stop words but never using it. Every match in the Sublime search below is an assignment; all you seem to do is define STOP_WORDS constants in language_data.py files and then assign those constants to the stop_words class property of a Language's Defaults, without ever then reading from it:

Searching 535 files for "stop_words"

/home/mark/spaCy/spacy/language.py:
  153      tagger_features = Tagger.feature_templates # TODO -- fix this
  154  
  155:     stop_words = set()
  156  
  157      lex_attr_getters = {

/home/mark/spaCy/spacy/de/__init__.py:
   24          tag_map = dict(language_data.TAG_MAP)
   25  
   26:         stop_words = set(language_data.STOP_WORDS)
   27  
   28  

/home/mark/spaCy/spacy/de/language_data.py:
    4  
    5  
    6: STOP_WORDS = set()
    7  
    8  

/home/mark/spaCy/spacy/en/__init__.py:
   29          tag_map = dict(language_data.TAG_MAP)
   30  
   31:         stop_words = set(language_data.STOP_WORDS)
   32  

/home/mark/spaCy/spacy/en/language_data.py:
    4  
    5  # improved list from Stone, Denis, Kwantes (2010)
    6: STOP_WORDS = set("""
    7  a about above across after afterwards again against all almost alone 
    8  along already also although always am among amongst amoungst amount 

/home/mark/spaCy/spacy/es/__init__.py:
   24          tag_map = dict(language_data.TAG_MAP)
   25  
   26:         stop_words = set(language_data.STOP_WORDS)
   27  

/home/mark/spaCy/spacy/es/language_data.py:
    4  
    5  
    6: STOP_WORDS = set()
    7  
    8  

/home/mark/spaCy/spacy/fr/__init__.py:
   24          tag_map = dict(language_data.TAG_MAP)
   25  
   26:         stop_words = set(language_data.STOP_WORDS)
   27  
   28  

/home/mark/spaCy/spacy/fr/language_data.py:
    4  
    5  
    6: STOP_WORDS = set()
    7  
    8  

/home/mark/spaCy/spacy/it/__init__.py:
   24          tag_map = dict(language_data.TAG_MAP)
   25  
   26:         stop_words = set(language_data.STOP_WORDS)
   27  
   28  

/home/mark/spaCy/spacy/it/language_data.py:
    4  
    5  
    6: STOP_WORDS = set()
    7  
    8  

/home/mark/spaCy/spacy/pt/__init__.py:
   24          tag_map = dict(language_data.TAG_MAP)
   25  
   26:         stop_words = set(language_data.STOP_WORDS)
   27  
   28  

/home/mark/spaCy/spacy/pt/language_data.py:
    4  
    5  
    6: STOP_WORDS = set()
    7  
    8  

19 matches across 13 files

Does this list still have a purpose, or should it be culled? I thought I'd flag this up before any of you dutifully hunt down stop word lists for the new languages you're adding!

Apologies if there's some reason for this to exist that I'm missing.

@ExplodingCabbage
Copy link
Contributor Author

ExplodingCabbage commented Nov 20, 2016

Aside: the English STOP_WORDS list contains some surprising entries like "computer", "fire", and "mill" that it seems bizarre and arbitrary to treat as stop words. I've tracked down the source of this to http://onlinelibrary.wiley.com/store/10.1111/j.1756-8765.2010.01108.x/asset/supinfo/TOPS_1108_sm_supmat.pdf?v=1&s=715bd019aab0c2df0c269b487209c1342143a0a6, and it seems that this was indeed the stop word list used in http://onlinelibrary.wiley.com/doi/10.1111/j.1756-8765.2010.01108.x/full; regardless, it's bizarre and if this list is sticking around perhaps the presence of these seemingly inappropriate entries should be addressed.

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Nov 20, 2016
@honnibal
Copy link
Member

Thanks.

What's supposed to happen is the IS_STOP attribute in the Language class should be mapping to a function that looks up the word in the stop list. I see that this got broken somewhere.

Agree about the English stop list.

@ines
Copy link
Member

ines commented Nov 22, 2016

Re English stopwords: I'm currently in the process of reorganising the language data. Just posted an update here: #649

honnibal added a commit that referenced this issue Nov 23, 2016
…messy, but it's better not to change too much until the language data loading can be properly refactored.
@honnibal
Copy link
Member

Put a band-aid on this for now. A more satisfying fix will come alongside the data reorganisation.

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation
Projects
None yet
Development

No branches or pull requests

3 participants