Skip to content
This repository has been archived by the owner on Feb 22, 2020. It is now read-only.

Commit

Permalink
Alex |Andre |Gilberto |Shize algorithm
Browse files Browse the repository at this point in the history
Fixed dependencies

Former-commit-id: 8a196b2

Added test better cases

Former-commit-id: f00d01d

Added classification algorithm

Former-commit-id: ed57689

Updated style

Former-commit-id: e714ecf

Changed names and added comments to clarify that the code is largely adapted from https://github.com/lfz/DSB2017

Former-commit-id: 803e562

Added skip_slow

Former-commit-id: f5da4e6

Enabled tracking fo ckpt files for lfs

Former-commit-id: a63e97c

Converted to LFS

Former-commit-id: 95986890b22dcca14779ee8ffc5ba89298c5035c

Converted to google style docstring

Converted to new style classes

Fixed change requested by @reubano

Fixed flake8 errors

Fixed the last flake8 errors

Updated classify/trained_model to use the model_path. Fixed a flake8 issue with extract_lungs

Update of classification model

Reintroduced model_path

Reintroduced model_path

Fixed flake8 errors

Commit attempting to fix the issues with pytest segfaulting

Forced loading of torch

Changed docker image to custom image  based on ubuntu

Fixed flake8 issues
  • Loading branch information
vessemer authored and dchansen committed Sep 28, 2017
1 parent 51a5b8a commit 11b8146
Show file tree
Hide file tree
Showing 16 changed files with 866 additions and 186 deletions.
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ test/assets/* filter=lfs diff=lfs merge=lfs -text
*.hd5 filter=lfs diff=lfs merge=lfs -text
*.mhd filter=lfs diff=lfs merge=lfs -text
*.raw filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
7 changes: 5 additions & 2 deletions compose/prediction/Dockerfile-dev
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
FROM python:3.6
FROM ubuntu:rolling
ENV PYTHONUNBUFFERED 1

RUN apt-get update && apt-get install -y tcl tk python3.6 python3.6-tk wget python-opencv
RUN wget https://bootstrap.pypa.io/get-pip.py
RUN python3.6 get-pip.py
RUN ln -s /usr/bin/python3.6 /usr/local/bin/python
# Requirements have to be pulled and installed here, otherwise caching won't work
COPY ./prediction/requirements /requirements
RUN pip install -r /requirements/local.txt
Expand Down
199 changes: 199 additions & 0 deletions docs/algorithm_alex_andre_gilberto_shize.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
# Alex |Andre |Gilberto |Shize algorithm

## Summary
The approach consists of 3D CNN data model which slide through the z coordinate of a CT volume, followed xgboost and extraTree models trained on different subsets of extracted features.
by was custom built to reflect how radiologists review lung CT scans to diagnose cancer risk.
![](https://preview.ibb.co/fUERDk/Screenshot_from_2017_09_27_01_11_49.png)
> A sliding 3D data model was custom built to reflect how radiologists review lung CT scans to diagnose cancer risk. As part of this data model - which allows for any nodule to be analyzed multiple times - a neural network nodule identifier has been implemented and trained using the Luna CT dataset. Non-traditional, unsegmented (i.e. full CT scans) were used for training, in order to ensure no nodules, in particular, those on the lung perimeter are missed.
## Source

**Author:** Alexander Ryzhkov, Gilberto Titericz Junior, Andre, Shize Su </br>
**Repository:** [https://github.com/astoc/kaggle_dsb2017](https://github.com/astoc/kaggle_dsb2017) </br>
The approach scored the 8th place at the Data Science Bowl 2017.

## License
[MIT License](https://github.com/astoc/kaggle_dsb2017/blob/master/LICENSE)


## Prerequisites

<table>
<thead>
<tr>
<th colspan="3">Andre</th>
<th colspan="3">Shize</th>
</tr>
<tr>
<th>Dependency</th>
<th>Name</th>
<th>Version</th>
<th>Dependency</th>
<th>Name</th>
<th>Version</th>
</tr>
</thead>

<tbody>
<tr>
<td>Language</td>
<td>Python</td>
<td>3.5</td>
<td>Language</td>
<td>Python</td>
<td>2.7</td>
</tr>
<tr>
<td>ML engine</td>
<td>Keras</td>
<td>1.2.2</td>
<td>ML engine</td>
<td>Keras</td>
<td>1.2.2</td>
</tr>
<tr>
<td>ML backend</td>
<td>Theano</td>
<td>0.8+</td>
<td>ML backend</td>
<td>Theano</td>
<td></td>
</tr>
<tr>
<td>OS</td>
<td>PC Linux <br/>AWS Linux</td>
<td><br/>P2</td>
<td>OS</td>
<td>AWS Linux</td>
<td>C3.8</td>
</tr>
<tr>
<td>Processor</td>
<td>CPU</td>
<td>PC i7<br/>P2 vCPU</td>
<td>Processor</td>
<td>CPU</td>
<td>Intel Xeon</td>
</tr>
<tr>
<td></td>
<td>GPU</td>
<td>PC NVIDIA 8GB<br/>P2 NVIDIA K80 12GB</td>
<td></td>
<td>GPU</td>
<td>no</td>
</tr>
<tr>
<td>GPU driver</td>
<td>CUDA<br/>cuDNN</td>
<td>7.5<br/>6.0</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Some of the cells' values were restored from the AWSs' setups and CUDA compatibility.

**Dependency packages:** Neither the repository nor the authors specified exact versions of the Python packages:


Andre | Shize
--------------|---------------------------------
Keras 1.2.2 | numpy
Theano | pandas
spyder | xgboost
opencv | scikit-learn
pydicom |
scipy |
scikit-image |
SimpleITK |
numpy |
pandas |

## Algorithm design


### Preprocessing
1. Resampling all patient CT scans to a relatively rough resolution of 2x2x2mm.
2. CT voxels' values standardisation to Hounsfield scale.
3. Lungs segmentation.

### Nodule detection
> Train a nodule identifier on a slicing architecture using Luna dataset and intermediate files created (3 options provided).
The slicing architecture itself is made of UNets. One of the aforementioned options is also a good data augmentation method:
> [..] Special mosaic-type aggregation of training of the nodule identifier has been deployed, as illustrated below.
![](https://preview.ibb.co/ivOTtk/Screenshot_from_2017_09_27_01_35_56.png)

### Prediction of cancer probability
>The most important feature is the existence of nodule(s), followed by their size, location and their other characteristics. For instance, a very significant number of patients for which no nodule has been found, proved to be no cancer cases. [..] Key features include existence/size of the largest nodule, and its vertical location, existence of emphysema, volume of all nodules, and their diversity.
The authors also have mentioned that the code location of nodules versus the segmented lungs centre of gravity as a feature provide higher significance in comparison with convenient upper/lower parts of lungs feature.
>As outlined, our combined approach uses the neural network as a feature generator and then applying xgboost and extraTree models on the extracted features to generate predictions and submissions. To make the model performance more stable, we also run some of the models with multiple random seeds (e.g., for xgb, use 50 random runs; for extraTree, use 10 random runs) and take the average. Our final winning submission (private LB0.430) is a linear combination of a couple of xgb models and extraTree models.
## Trained model

**Source:** [nodule_identifiers](https://github.com/astoc/kaggle_dsb2017/blob/master/code/Andre/nodule_identifiers/d8_2x2x2_best_weights.h5) </br>

**Usage instructions:** [Shize algorithm](https://github.com/astoc/kaggle_dsb2017/blob/master/code/Shize/00ReadMe.txt), [Andre algorithm](https://github.com/astoc/kaggle_dsb2017/blob/master/code/Andre/ReadMe.txt) <br/>
## Model Performance

### Training- / prediction time

**Test system:** </br>

| Component | Spec | Count |
|-----------|-------|-------|
| CPU | C3.8 Intel Xeon | |
| GPU | P2 NVIDIA K80 12GB | >1 |
| RAM | | |

**Training time:** days on AWS
>Training some of the nodule models took days using high end 12GB GPUs.</br>
**Prediction time:** unknown, but must be less than 14 min per CT, since it processes the 506 CTs for the 5 days </br>

### Model Evaluation

**Dataset:** Data Science Bowl evaluation dataset </br>

| Metric | Score |
|----------|-------|
| Log Loss | 0.43019 |

## Use cases


### When to use this algorithm

- The nodules detection system seems to be a good contribute to a concept-to-clinic's ensemble, by the reason listed in comments.

### When to avoid this algorithm

- The nodules detection method provided by the authors requires inconvenient rough CT's spacing (`2x2x2mm`) which may conflict with other pipelines, if the high order interpolation polynomials will be employed then the additional spacing transaction may considerably affect on a computation time.
- The training from scratch, as it was mentioned by the authors, for only one of the sliding architectures may take days even over AWS P2 equipped by NVIDIA K80 12GB GPUs.

## Adaptation into Concept To Clinic

### Porting to Python 3.5+
The Andre part had been already written in python 3.5. However Shize used the python 2. The main difficulties seems to be the lack of specified versions for the packages employed by Shize. Nonetheless, Shize's part consists merely of ensembling already extracted features via xgboost and extraTree models, and GPUs are not required.

### Porting to run on CPU and GPU
The noodles detector written on `Keras` with `Theano` as the backend, thus it shall run on CPU out of the box.

### Improvements on the code base


### Adapting the model
Worth noting that simpler model consisted only of a [single xgb](https://github.com/astoc/kaggle_dsb2017/blob/master/code/Shize/0Shize_DSB_feat3_xgb_v5.py) has performed similarly (0.434 on private LB). Thus it will be better to drop away the cumbersome combination of different xgb and extraTree models , and some of them were using averaged prediction from 50 or 10 random
runs (i.e., using 50 (or 10) different random seeds)

## Comments
The whole pipeline relies on the nodules detector, and at the same time the approach has reached 8th place on the DSB17 private LB, it's worth to admire that method then and consider it into account. Moreover, the authors stated that they didn't use the information relative nodule malignancy as they've incorrectly assumed it's unavailable, therefore training the model from scratch or fine tune it over the data within malignancy status seems to be beneficial.

## References
Repository: https://github.com/astoc/kaggle_dsb2017
Technical Report: https://github.com/astoc/kaggle_dsb2017/blob/master/Solution_description_DSB2017_8th_place.pdf
Luna16 dataset: https://luna16.grand-challenge.org/home/
6 changes: 3 additions & 3 deletions prediction/requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ keras==2.0.8
tensorflow==1.3.0
h5py==2.7.0
scipy==0.19.1
torch >= 0.2.0
opencv-python==3.3.0.10
pandas==0.20.3
scikit-image==0.13.0
SimpleITK==1.0.1
http://download.pytorch.org/whl/cu80/torch-0.2.0.post3-cp36-cp36m-manylinux1_x86_64.whl
opencv-python==3.3.0.10
pandas==0.20.3
3 changes: 3 additions & 0 deletions prediction/src/algorithms/classify/assets/gtr123_model.ckpt
Git LFS file not shown
Loading

0 comments on commit 11b8146

Please sign in to comment.