Add pt-br wordlist #60

drebs · 2019-04-22T00:05:27Z

The wordlist was generated from 2 different sources of words:

The file /usr/share/dict/brazilian from Debian's wbrazilian package.
A dump of the pages of Wikipedia in portuguese.

The final pt-br wordlist was generated as follows:

Download a dump of portuguese Wikipedia pages, process all pages
and determine the frequency of each word.
Start from /usr/share/dict/brazilian and filter out:
- words not matching /^[a-z]+$/,
- words shorter than 4 characters, and
- words longer than 8 characters.
Remove all words that are a suffix of any other word in the list.
Sort remaining words using pt Wikipedia frequencies.
Take the 7776 most frequent words.

No further curation was made.

There are obvious drawbacks in this approach (eg: many very frequent
words are left out because they are either too short or too long or
contain accents or cedilla), but it was the best cost-benefit i could
think about.

The wordlist was generated from 2 different sources of words: - The file /usr/share/dict/brazilian from Debian's wbrazilian package. - A dump of the pages of Wikipedia in portuguese. The final pt-br wordlist was generated as follows: 1. Download a dump of portuguese Wikipedia pages, process all pages and determine the frequency of each word. 2. Start from /usr/share/dict/brazilian and filter out: - words not matching /^[a-z]+$/, - words shorter than 4 characters, and - words longer than 8 characters. 3. Remove all words that are a suffix of any other word in the list. 4. Sort remaining words using pt Wikipedia frequencies. 5. Take the 7776 most frequent words. No further curation was made. There are obvious drawbacks in this approach (eg: many very frequent words are left out because they are either too short or too long or contain accents or cedilla), but it was the best cost-benefit i could think about.

ulif · 2019-04-26T22:52:21Z

Awesome, @drebs ! What a fine piece of work :) Collecting words from wikipedia pages... a really nice idea.

ulif · 2019-04-27T00:17:25Z

@drebs, what license do you want to see applied for your list? Would CC-BY-3.0 be okay for you?

drebs · 2019-04-28T09:40:08Z

@drebs, what license do you want to see applied for your list? Would CC-BY-3.0 be okay for you?

CC-BY-3.0 is great, thanks for caring for that.

drebs mentioned this pull request Apr 22, 2019

Add support for wordlists in other languages #50

Open

drebs force-pushed the wordlist-pt-br branch from ee49ffb to e3fe095 Compare April 22, 2019 10:51

ulif merged commit 7743ed5 into ulif:master Apr 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pt-br wordlist #60

Add pt-br wordlist #60

drebs commented Apr 22, 2019

ulif commented Apr 26, 2019

ulif commented Apr 27, 2019

drebs commented Apr 28, 2019

Add pt-br wordlist #60

Add pt-br wordlist #60

Conversation

drebs commented Apr 22, 2019

ulif commented Apr 26, 2019

ulif commented Apr 27, 2019

drebs commented Apr 28, 2019