Skip to content

Commit

Permalink
Merge branch 'master' of github.com:totalgood/nlpia
Browse files Browse the repository at this point in the history
  • Loading branch information
Hobson Lane committed Oct 14, 2018
2 parents 26b5fad + 37003a5 commit 1809d80
Showing 1 changed file with 81 additions and 31 deletions.
112 changes: 81 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Once you have Git installed, launch a bash terminal.
It will usually be found among your other applications with the name `git-bash`.


1. Install [Anaconda3 (Python3.6)](https://docs.anaconda.com/anaconda/install/)
### Step 1. Install [Anaconda3 (Python3.6)](https://docs.anaconda.com/anaconda/install/)

* [Linux](https://repo.anaconda.com/archive/Anaconda3-5.2.0-Linux-x86_64.sh)
* [MacOSX](https://repo.anaconda.com/archive/Anaconda3-5.2.0-MacOSX-x86_64.pkg)
Expand All @@ -52,15 +52,15 @@ If you're installing Anaconda3 using a GUI, be sure to check the box that update
Also, at the end, the Anaconda3 installer will ask if you want to install VSCode.
Microsoft's VSCode is supposed to be an OK editor for Python so feel free to use it.

2. Install an Editor
### Step 2. Install an Editor

You can skip this step if you are happy using `jupyter notebook` or `VSCode` or the editor built into Anaconda3.

I like [Sublime Text](https://www.sublimetext.com/3).
It's a lot cleaner more mature.
Plus it has more plugins written by individual developers like you.

3. Install Git and Bash
### Step 3. Install Git and Bash

* Linux -- already installed
* MacOSX -- already installed
Expand All @@ -70,25 +70,24 @@ If you're on Linux or Mac OS, you're good to go. Just figure out how to launch a

On Windows you have a bit more work to do. Supposedly Windows 10 will let you install Ubuntu with a terminal and bash. But the terminal and shell that comes with [`git`](https://git-scm.com/downloads) is probably a safer bet. It's mained by a broader open source community.

4. Clone this repository
### Step 4. Clone this repository

```bash
git clone https://github.com/totalgood/nlpia.git
```

5. Install `nlpia`
### Step 5. Install `nlpia`

You have two tools you can use to install `nlpia`:
You have two alternative package managers you can use to install `nlpia`:

5.1. `conda`
5.2. `pip`

In most cases, `conda` will be able to install python packages faster and more reliably than pip. Without `conda` Some packages, such as `python-levenshtein`, require you to compile a C library during installation. Windows doesn't have an installer that will "just work."

### 5.1. `conda`
#### Alternative 5.1. `conda`

In most cases, conda will be able to install python packages faster and more reliably than pip, because packages like `python-levenshtein` require you to compile a C library during installation, and Windows doesn't have an installer that will "just work."

So use conda (part of the Anaconda package that we already installed) to create an environment called `nlpiaenv`:
Use conda (part of the Anaconda package that you installed in Step 1 above) to create an environment called `nlpiaenv`:

```bash
cd nlpia # make sure you're in the nlpia directory that contains `setup.py`
Expand All @@ -101,13 +100,17 @@ Whenever you want to be able to import or run any `nlpia` modules, you'll need t

```bash
source activate nlpiaenv
python -c "print(import nlpia)"
```

Make sure you can import nlpia with:

```bash
python -c "print(import nlpia)"
```

Skip to Step 4 if you have successfully created and activated an environment containing the `nlpia` package.
Skip to Step 6 ("Have fun!") if you have successfully created and activated an environment containing the `nlpia` package and its dependencies.

### 5.2. `pip`
#### Alternative 5.2. `pip`

Linux-based OSes like Ubuntu and OSX come with C++ compilers built-in, so you may be able to install the dependencies using pip instead of `conda`.
But if you're on Windows and you want to install packages, like `python-levenshtein` that need compiled C++ libraries, you'll need a compiler.
Expand All @@ -133,8 +136,7 @@ If you are on a Linux or Darwin(Mac OSX) system or want to try to help us debug
# pip install -r requirements-voice.txt
```


6. Have Fun!
### 6. Have Fun!

Check out the code examples from the book in `nlpia/nlpia/book/examples` to get ideas:

Expand All @@ -143,43 +145,33 @@ cd nlpia/book/examples
ls
```

## Contributing
### 7. Contribute

Help your fellow readers by contributing to your shared code and knowledge.
Help other NLP practicioners by contributing your code and knowledge.
Here are some ideas for a few features others might find handy.

### Feature 1: Glossary Compiler
#### Feature 1: Glossary Compiler

Skeleton code and APIs that could be added to the https://github.com/totalgood/nlpia/blob/master/src/nlpia/transcoders.py:`transcoders.py` module.


```python


def find_acronym(text):
"""Find parenthetical noun phrases in a sentence and return the acronym/abbreviation/term as a pair of strings.
>>> find_acronym('Support Vector Machine (SVM) are a great tool.')
('SVM', 'Support Vector Machine')
"""
return (abbreviation, noun_phrase)


```

```python


def glossary_from_dict(dict, format='asciidoc'):
""" Given a dict of word/acronym: definition compose a Glossary string in ASCIIDOC format """
return text


```

```python


def glossary_from_file(path, format='asciidoc'):
""" Given an asciidoc file path compose a Glossary string in ASCIIDOC format """
return text
Expand All @@ -188,15 +180,73 @@ def glossary_from_file(path, format='asciidoc'):
def glossary_from_dir(path, format='asciidoc'):
""" Given an path to a directory of asciidoc files compose a Glossary string in ASCIIDOC format """
return text


```

### Feature 2: Semantic Search
#### Feature 2: Semantic Search

Use a parser to extract only natural language sentences and headings/titles from a list of lines/sentences from an asciidoc book like "Natural Language Processing in Action".
Use a sentence segmenter in https://github.com/totalgood/nlpia/blob/master/src/nlpia/transcoders.py:[nlpia.transcoders] to split a book, like _NLPIA_, into a seequence of sentences.

#### Feature 3: Semantic Spectrograms

A sequence of word vectors or topic vectors forms a 2D array or matrix which can be displayed as an image. I used `word2vec` (`nlpia.loaders.get_data('word2vec')`) to embed the words in the last four paragraphs of Chapter 1 in NLPIA and it produced a spectrogram that was a lot noisier than I expected. Nonetheless stripes and blotches of meaning are clearly visible.

First, the imports:

```python
>>> from nlpia.loaders import get_data
>>> from nltk.tokenize import casual_tokenize
>>> from matplotlib import pyplot as plt
>>> import seaborn
```

First get the raw text and tokenize it:

```python
>>> lines = get_data('ch1_conclusion')
>>> txt = "\n".join(lines)
>>> tokens = casual_tokenize(txt)
>>> tokens[-10:]
['you',
'accomplish',
'your',
'goals',
'in',
'business',
'and',
'in',
'life',
'.']
```

Then you'll have to download a word vector model like word2vec:

```python
>>> wv = get_data('w2v') # this could take several minutes
>>> wordvectors = np.array([wv[tok] for tok in tokens if tok in wv])
>>> wordvectors.shape
(307, 300)
```

Now you can display your 307x300 spectrogram or "wordogram":

```python
>>> plt.imshow(wordvectors)
>>> plt.show()
```

Can you think of some image processing or deep learning algorithms you could run on images of natural language text?

Once you've mastered wordvectors you can play around with Google's Universal Sentence Encoder and create spectrograms of entire books.

#### Feature 5: Build your own Sequence-to-Sequence translator

If you have pairs of statements or words in two languages, you can build a sequence-to-sequence translator. You could even design your own language like you did in gradeschool with piglatin or build yourself a L337 translator.


#### Other Ideas

There are a lot more project ideas mentioned in NLPIA "Appendix E -- Resources". Here's an early draft of [that resource list](https://github.com/totalgood/nlpia/blob/master/src/nlpia/data/book/Appendix%20E%20--%20Resources.asc.md).



0 comments on commit 1809d80

Please sign in to comment.