Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pt-br wordlist #60

Merged
merged 1 commit into from
Apr 26, 2019
Merged

Add pt-br wordlist #60

merged 1 commit into from
Apr 26, 2019

Conversation

drebs
Copy link
Contributor

@drebs drebs commented Apr 22, 2019

The wordlist was generated from 2 different sources of words:

  • The file /usr/share/dict/brazilian from Debian's wbrazilian package.
  • A dump of the pages of Wikipedia in portuguese.

The final pt-br wordlist was generated as follows:

  1. Download a dump of portuguese Wikipedia pages, process all pages
    and determine the frequency of each word.
  2. Start from /usr/share/dict/brazilian and filter out:
    • words not matching /^[a-z]+$/,
    • words shorter than 4 characters, and
    • words longer than 8 characters.
  3. Remove all words that are a suffix of any other word in the list.
  4. Sort remaining words using pt Wikipedia frequencies.
  5. Take the 7776 most frequent words.

No further curation was made.

There are obvious drawbacks in this approach (eg: many very frequent
words are left out because they are either too short or too long or
contain accents or cedilla), but it was the best cost-benefit i could
think about.

The wordlist was generated from 2 different sources of words:

  - The file /usr/share/dict/brazilian from Debian's wbrazilian package.
  - A dump of the pages of Wikipedia in portuguese.

The final pt-br wordlist was generated as follows:

  1. Download a dump of portuguese Wikipedia pages, process all pages
     and determine the frequency of each word.
  2. Start from /usr/share/dict/brazilian and filter out:
       - words not matching /^[a-z]+$/,
       - words shorter than 4 characters, and
       - words longer than 8 characters.
  3. Remove all words that are a suffix of any other word in the list.
  4. Sort remaining words using pt Wikipedia frequencies.
  5. Take the 7776 most frequent words.

No further curation was made.

There are obvious drawbacks in this approach (eg: many very frequent
words are left out because they are either too short or too long or
contain accents or cedilla), but it was the best cost-benefit i could
think about.
@ulif ulif merged commit 7743ed5 into ulif:master Apr 26, 2019
@ulif
Copy link
Owner

ulif commented Apr 26, 2019

Awesome, @drebs ! What a fine piece of work :) Collecting words from wikipedia pages... a really nice idea.

@ulif
Copy link
Owner

ulif commented Apr 27, 2019

@drebs, what license do you want to see applied for your list? Would CC-BY-3.0 be okay for you?

@drebs
Copy link
Contributor Author

drebs commented Apr 28, 2019

@drebs, what license do you want to see applied for your list? Would CC-BY-3.0 be okay for you?

CC-BY-3.0 is great, thanks for caring for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants