Skip to content
This repository has been archived by the owner on Nov 7, 2023. It is now read-only.

Project for filtering stopwords minus fn file that throws errors in main repo

License

Notifications You must be signed in to change notification settings

GhostGroup/wm-stopwords-filter

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stopwords Filter

Build Status

This project is a very simple and naive implementation of a stopwords filter that remove a list of banned words (stopwords) from a sentence.

Quick guide

  • Install

just type

gem install stopwords-filter

or

# Don't forget the 'require:'
gem 'stopwords-filter', require: 'stopwords'

in your Gemfile.

  • Use it

    1. Simple version
stopwords = ['by', 'written', 'from']
filter = Stopwords::Filter.new stopwords

filter.filter 'guide by douglas adams'.split
# ['guide', 'douglas', 'adams']

filter.stopword? 'by'
# true
  1. Snowball version
filter = Stopwords::Snowball::Filter.new "en"
filter.filter 'guide by douglas adams'.split
# ['guide', 'douglas', 'adams']

filter.stopword? 'by'
# true

2.1 Snowball version with Sieve class (thanks to @s2gatev)

sieve = Stopwords::Snowball::WordSieve.new

filtered = sieve.filter lang: :en, words: 'guide by douglas adams'.split
# filtered = ['guide', 'douglas', 'adams']

sieve.stopword? lang: :en, word: 'by'
# true

What is a Stopword?

According to Wikipedia

In computing, stop words are words which are filtered out prior to, or after, processing of natural language data (text).

And that's it. Words that are removed before you perform some task on the rest of them.

Why would I want to remove anything?

Imagine you have a database of products and you want your customers to search on them. You can't use a proper search engine (such as Solr, Sphinx or even Google) neither full search systems from popular database systems such as PostgreSQL. You are left alone with LIKEs and %.

You have your fake search engine working. Someone searches 'Guide Douglas Adams' and you find 'Douglas Adams - Hitchhiker's guide to the galaxy' everything is perfect.

But then someone searches 'guide by douglas adams' and you don't find anything. You don't have any 'by' in the description or title of the book! Most importantly, you don't need that 'by'!

You wish you could get rid of all those 'by' or 'written' or 'from', huh? That's why we are here!

How this thing works?

Main class of this 'library' is Stopwords::Filter You just create a new object with an array of stopwords

stopwords = ['by', 'written', 'from']
filter = Stopwords::Filter.new stopwords

And then you have it, you just can filter

filter.filter 'guide by douglas adams'.split  #-> ['guide', 'douglas', 'adams']

That's all?

I know what you're thinking, it takes a line of ruby code to filter one array from other. That's why we have added an extra functionality, Snowball stopwords lists, already built for you and ready to use.

At least, in the beginning we were using snowball stopwords, but several collaborators have improved this humble gem by including new languages or adding new stopwords. So now, the Snowball version is more an "Snowball and friends" version.

How do I use that snowball thing?

You just create the filter with the locale you want to use

filter = Stopwords::Snowball::Filter.new "en"

And then you filter without worrying about the exact stopwords used

filter.filter 'guide by douglas adams'.split  #-> ['guide', 'douglas', 'adams']

Which languages are supported with snowball?

Currently we have support for:

  • Afrikaans (af)
  • Arabic (ar)
  • Bengali (bn)
  • Breton (br)
  • Catalán (ca)
  • Chinese (zh)
  • Czesch (cs)
  • Danish (da)
  • German (de)
  • Greek (el)
  • English (en)
  • Spanish (es)
  • Finnish (fi): Due to an error it can also be used referring to the fn locale
  • French (fr)
  • Hebrew (he)
  • Hungarian (hu)
  • Indonesian (id)
  • Italian (it)
  • Korean (ko)
  • Dutch (nl)
  • Polish (pl)
  • Portuguese (pt)
  • Romanian (ro)
  • Russian (ru)
  • Swedish (sv)
  • Thai (th)
  • Turkish (tr)
  • Vietnamese (vi)

In the changelog you can see the collaborators for each language.

Anything else?

In a future version I would like to include a chaining filter where you include a series of operations and they are executed in a lineal order, just like the Pipes and Filters design pattern

Ackonowledgments

Thanks to @s2gatev who added the stopword? method and the sieve class to this gem

Thanks to @bettysteger, @fauno, @vrypan, @woto, @grzegorzblaszczyk, @nerde, @sbeckeriv and @zackxu1 for language support and other features.

About

Project for filtering stopwords minus fn file that throws errors in main repo

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Ruby 100.0%