Skip to content

Commit

Permalink
Merge pull request #85 from HDI-Project/issue_58_intermediate_outputs
Browse files Browse the repository at this point in the history
Issue 58 intermediate outputs
  • Loading branch information
csala authored May 16, 2019
2 parents e1ca77b + 7112016 commit abe6e25
Show file tree
Hide file tree
Showing 14 changed files with 539 additions and 158 deletions.
34 changes: 18 additions & 16 deletions CONTRIBUTING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -172,24 +172,26 @@ The process of releasing a new version involves several steps combining both ``g

1. Merge what is in ``master`` branch into ``stable`` branch.
2. Update the version in ``setup.cfg``, ``mlblocks/__init__.py`` and ``HISTORY.md`` files.
3. Create a new TAG pointing at the correspoding commit in ``stable`` branch.
3. Create a new git tag pointing at the corresponding commit in ``stable`` branch.
4. Merge the new commit from ``stable`` into ``master``.
5. Update the version in ``setup.cfg`` and ``mlblocks/__init__.py`` to open the next
development interation.
5. Update the version in ``setup.cfg`` and ``mlblocks/__init__.py``
to open the next development iteration.

**Note:** Before starting the process, make sure that ``HISTORY.md`` has a section titled
**Unreleased** with the list of changes that will be included in the new version, and that
these changes are committed and available in ``master`` branch.
Normally this is just a list of the Pull Requests that have been merged since the latest version.
.. note:: Before starting the process, make sure that ``HISTORY.md`` has been updated with a new
entry that explains the changes that will be included in the new version.
Normally this is just a list of the Pull Requests that have been merged to master
since the last release.

Once this is done, just run the following commands::
Once this is done, run of the following commands:

1. If you are releasing a patch version::

git checkout stable
git merge --no-ff master # This creates a merge commit
bumpversion release # This creates a new commit and a TAG
git push --tags origin stable
make release
git checkout master
git merge stable
bumpversion --no-tag patch
git push

2. If you are releasing a minor version::

make release-minor

3. If you are releasing a major version::

make release-major
6 changes: 5 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,11 @@ fix-lint: ## fix lint issues using autoflake, autopep8, and isort
autopep8 --in-place --recursive --aggressive tests
isort --apply --atomic --recursive tests

.PHONY: lint-docs
lint-docs: ## check docs formatting with doc8 and pydocstyle
doc8 mlblocks/
pydocstyle mlblocks/


# TEST TARGETS

Expand All @@ -122,7 +127,6 @@ coverage: ## check code coverage quickly with the default Python
.PHONY: docs
docs: clean-docs ## generate Sphinx HTML documentation, including API docs
$(MAKE) -C docs html
touch docs/_build/html/.nojekyll

.PHONY: view-docs
view-docs: docs ## view docs in browser
Expand Down
2 changes: 1 addition & 1 deletion docs/changelog.rst
Original file line number Diff line number Diff line change
@@ -1 +1 @@
.. include:: ../HISTORY.md
.. mdinclude:: ../HISTORY.md
23 changes: 9 additions & 14 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,18 +18,9 @@
# relative to the documentation root, use os.path.abspath to make it
# absolute, like shown here.

import os
import sys

import sphinx_rtd_theme # For read the docs theme
from recommonmark.parser import CommonMarkParser
# from recommonmark.transform import AutoStructify

# sys.path.insert(0, os.path.abspath('..'))

import mlblocks
#
# mlblocks.add_primitives_path('../mlblocks_primitives')

# -- General configuration ---------------------------------------------

Expand All @@ -40,13 +31,21 @@
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
extensions = [
'sphinx.ext.napoleon',
'm2r',
'sphinx.ext.autodoc',
'sphinx.ext.githubpages',
'sphinx.ext.viewcode',
'sphinx.ext.napoleon',
'sphinx.ext.graphviz',
'IPython.sphinxext.ipython_console_highlighting',
'IPython.sphinxext.ipython_directive',
'autodocsumm',
]

autodoc_default_options = {
'autosummary': True,
}

ipython_execlines = ["import pandas as pd", "pd.set_option('display.width', 1000000)"]

# Add any paths that contain templates here, relative to this directory.
Expand All @@ -56,10 +55,6 @@
# You can specify multiple suffix as a list of string:
source_suffix = ['.rst', '.md', '.ipynb']

source_parsers = {
'.md': CommonMarkParser,
}

# The master toctree document.
master_doc = 'index'

Expand Down
2 changes: 1 addition & 1 deletion docs/getting_started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ them to the `MLPipeline class`_:
from mlblocks import MLPipeline
primitives = [
'mlprimitives.feature_extraction.StringVectorizer',
'mlprimitives.custom.feature_extraction.StringVectorizer',
'sklearn.ensemble.RandomForestClassifier',
]
pipeline = MLPipeline(primitives)
Expand Down
4 changes: 2 additions & 2 deletions docs/pipeline_examples/graph.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ additional information not found inside `X`.
primitives = [
'networkx.link_prediction_feature_extraction',
'mlprimitives.feature_extraction.CategoricalEncoder',
'mlprimitives.custom.feature_extraction.CategoricalEncoder',
'sklearn.preprocessing.StandardScaler',
'xgboost.XGBClassifier'
]
Expand Down Expand Up @@ -69,6 +69,6 @@ additional information not found inside `X`.
.. _NetworkX Link Prediction: https://networkx.github.io/documentation/networkx-1.10/reference/algorithms.link_prediction.html
.. _CategoricalEncoder from MLPrimitives: https://github.com/HDI-Project/MLPrimitives/blob/master/mlblocks_primitives/mlprimitives.feature_extraction.CategoricalEncoder.json
.. _CategoricalEncoder from MLPrimitives: https://github.com/HDI-Project/MLPrimitives/blob/master/mlblocks_primitives/mlprimitives.custom.feature_extraction.CategoricalEncoder.json
.. _StandardScaler from scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
.. _XGBClassifier: https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn
22 changes: 11 additions & 11 deletions docs/pipeline_examples/text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,31 +40,31 @@ for later ones.
# set up the pipeline
primitives = [
"mlprimitives.counters.UniqueCounter",
"mlprimitives.text.TextCleaner",
"mlprimitives.counters.VocabularyCounter",
"mlprimitives.custom.counters.UniqueCounter",
"mlprimitives.custom.text.TextCleaner",
"mlprimitives.custom.counters.VocabularyCounter",
"keras.preprocessing.text.Tokenizer",
"keras.preprocessing.sequence.pad_sequences",
"keras.Sequential.LSTMTextClassifier"
]
input_names = {
"mlprimitives.counters.UniqueCounter#1": {
"mlprimitives.custom.counters.UniqueCounter#1": {
"X": "y"
}
}
output_names = {
"mlprimitives.counters.UniqueCounter#1": {
"mlprimitives.custom.counters.UniqueCounter#1": {
"counts": "classes"
},
"mlprimitives.counters.VocabularyCounter#1": {
"mlprimitives.custom.counters.VocabularyCounter#1": {
"counts": "vocabulary_size"
}
}
init_params = {
"mlprimitives.counters.VocabularyCounter#1": {
"mlprimitives.custom.counters.VocabularyCounter#1": {
"add": 1
},
"mlprimitives.text.TextCleaner#1": {
"mlprimitives.custom.text.TextCleaner#1": {
"language": "en"
},
"keras.preprocessing.sequence.pad_sequences#1": {
Expand Down Expand Up @@ -116,12 +116,12 @@ to encode all the string features, and go directly into the
nltk.download('stopwords')
primitives = [
'mlprimitives.text.TextCleaner',
'mlprimitives.feature_extraction.StringVectorizer',
'mlprimitives.custom.text.TextCleaner',
'mlprimitives.custom.feature_extraction.StringVectorizer',
'sklearn.ensemble.RandomForestClassifier',
]
init_params = {
'mlprimitives.text.TextCleaner': {
'mlprimitives.custom.text.TextCleaner': {
'column': 'text',
'language': 'nl'
},
Expand Down
19 changes: 8 additions & 11 deletions mlblocks/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ class Dataset():
**kwargs: Any additional keyword argument passed on initialization will be made
available as instance attributes.
"""

def __init__(self, description, data, target, score, shuffle=True, stratify=False, **kwargs):

self.name = description.splitlines()[0]
Expand All @@ -115,10 +116,10 @@ def __init__(self, description, data, target, score, shuffle=True, stratify=Fals
self.__dict__.update(kwargs)

def score(self, *args, **kwargs):
"""Scoring function for this dataset.
r"""Scoring function for this dataset.
Args:
\\*args, \\*\\*kwargs: Any given arguments and keyword arguments will be
\*args, \*\*kwargs: Any given arguments and keyword arguments will be
directly passed to the given scoring function.
Returns:
Expand All @@ -141,7 +142,7 @@ def _get_split(data, index):
else:
return data[index]

def get_splits(self, n_splits=1):
def get_splits(self, n_splits=1, random_state=0):
"""Return splits of this dataset ready for Cross Validation.
If n_splits is 1, a tuple containing the X for train and test
Expand All @@ -166,12 +167,13 @@ def get_splits(self, n_splits=1):
self.data,
self.target,
shuffle=self._shuffle,
stratify=stratify
stratify=stratify,
random_state=random_state
)

else:
cv_class = StratifiedKFold if self._stratify else KFold
cv = cv_class(n_splits=n_splits, shuffle=self._shuffle)
cv = cv_class(n_splits=n_splits, shuffle=self._shuffle, random_state=random_state)

splits = list()
for train, test in cv.split(self.data, self.target):
Expand Down Expand Up @@ -314,7 +316,6 @@ def load_dic28():
There exist 52,652 words (vertices in a network) having 2 up to 8 characters
in the dictionary. The obtained network has 89038 edges.
"""

dataset_path = _load('dic28')

X = _load_csv(dataset_path, 'data')
Expand Down Expand Up @@ -343,7 +344,6 @@ def load_nomination():
Data consists of one graph whose nodes contain two attributes, attr1 and attr2.
Associated with each node is a label that has to be learned and predicted.
"""

dataset_path = _load('nomination')

X = _load_csv(dataset_path, 'data')
Expand All @@ -362,7 +362,6 @@ def load_amazon():
co-purchased with product j, the graph contains an undirected edge from i to j.
Each product category provided by Amazon defines each ground-truth community.
"""

dataset_path = _load('amazon')

X = _load_csv(dataset_path, 'data')
Expand All @@ -382,7 +381,6 @@ def load_jester():
source: "University of California Berkeley, CA"
sourceURI: "http://eigentaste.berkeley.edu/dataset/"
"""

dataset_path = _load('jester')

X = _load_csv(dataset_path, 'data')
Expand All @@ -392,15 +390,14 @@ def load_jester():


def load_wikiqa():
"""A Challenge Dataset for Open-Domain Question Answering.
"""Challenge Dataset for Open-Domain Question Answering.
WikiQA dataset is a publicly available set of question and sentence (QS) pairs,
collected and annotated for research on open-domain question answering.
source: "Microsoft"
sourceURI: "https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/#"
""" # noqa

dataset_path = _load('wikiqa')

data = _load_csv(dataset_path, 'data', set_index=True)
Expand Down
Loading

0 comments on commit abe6e25

Please sign in to comment.