Skip to content

Commit

Permalink
Initial commit for public release
Browse files Browse the repository at this point in the history
  • Loading branch information
recluze committed Jul 26, 2017
0 parents commit c8de796
Show file tree
Hide file tree
Showing 29 changed files with 608,399 additions and 0 deletions.
73 changes: 73 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Authors:

- Nauman ([email protected], [email protected], recluze.wordpress.com) -- Queries about ML should go here.
- Hafeez ur Rehman ([email protected]) -- Queries about Bioinformatics should go here.



# Import points:
- Requires python2.7
- See requirements.txt for exact version of libraries used. Keras v1.2.1 gives errors so use keras 1.2.0
- I've used theano backend for keras. If you use tensorflow, I think you will have issues.
- It's suggested that you use `virtualenv` to create a new environment and then install required packages.
```
pip install virtualenv
virtualenv bi
cd bi
. bin/activate
git cone <git_repo_url>
pip install -r <git_repo_name>src/requirements.txt
```
- Set keras/theano to use the GPU. (Only do this on the GPU machine.) Put the following in `~/.theanorc`
```
[global]
device = gpu
floatX = float32
```
- Set keras to use theano. In `~/.keras/keras.json`, put the following:
```
{
"image_dim_ordering": "th",
"epsilon": 1e-07,
"floatx": "float32",
"backend": "theano"
}
```

# Execution
The source is executed in several steps.

1. First, data needs to be downloaded to `data-scrapes` folder.
- Needs to be in FASTA format along with annotation file in .txt
- This is already done for human proteins


2. These scrapes need to be converted to a format that we read later.
- This is done through the `python src/scrape2vec.py`.
- Variables to set: `scrape_dir`, `out_file`, `out_file_fns`, `out_file_unique_functions`
- Use `function_usage_cutoff` variable to remove function used fewer times than this number
- This step has already been done for human proteins downloaded in step 1.


3. Once output files are created from above step, you can run training/validation.
- This is done through `python train.py`
- Some variables need to be set (although current `train.py` can be executed as is to reproduce our experiments):
* See top of `train.py` for parameters of training that you can set
* `target_function` can be set to train for a particular function. Set to empty string to train for all functions
* To quickly check code on slow machines, set `restrict_sample_size` to, say, 10.
* `results_dir` is where results will be stored. These will be `-console.txt`, `-results.txt` and `-saved-model.h5` prefixed with `exp_id` i.e. the experiment ID.
* Bottom of `train.py`, need to set `sequences_file` and `funtions_file` created from step 2 above
* In `utils.py`, need to set `unique_function_file` variable. (Sorry for this clumsiness. I'm too lazy to fix this.)
- Actual model is defined in `get_*_model` functions in `models.py`. This is called from `train.py` during training.

# License

This code is provided under the MIT License.

Copyright 2017 Mohammad Nauman, Hafeez-ur-Rehman

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
68,157 changes: 68,157 additions & 0 deletions data-scrapes/all-human-0001-annotations.txt

Large diffs are not rendered by default.

504,348 changes: 504,348 additions & 0 deletions data-scrapes/all-human-0001.fasta

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/protein-functions-2017-01-14-081641.txt

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/protein-functions-2017-01-23-203946.txt

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/protein-functions-2017-01-26-191058.txt

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions data/protein-functions-2017-01-29-080137.txt

Large diffs are not rendered by default.

20,124 changes: 20,124 additions & 0 deletions data/protein-seqs-2017-01-26-191058.txt

Large diffs are not rendered by default.

Loading

0 comments on commit c8de796

Please sign in to comment.