Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 58 intermediate outputs #85

Merged
merged 11 commits into from
May 16, 2019
34 changes: 18 additions & 16 deletions CONTRIBUTING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -172,24 +172,26 @@ The process of releasing a new version involves several steps combining both ``g

1. Merge what is in ``master`` branch into ``stable`` branch.
2. Update the version in ``setup.cfg``, ``mlblocks/__init__.py`` and ``HISTORY.md`` files.
3. Create a new TAG pointing at the correspoding commit in ``stable`` branch.
3. Create a new git tag pointing at the corresponding commit in ``stable`` branch.
4. Merge the new commit from ``stable`` into ``master``.
5. Update the version in ``setup.cfg`` and ``mlblocks/__init__.py`` to open the next
development interation.
5. Update the version in ``setup.cfg`` and ``mlblocks/__init__.py``
to open the next development iteration.

**Note:** Before starting the process, make sure that ``HISTORY.md`` has a section titled
**Unreleased** with the list of changes that will be included in the new version, and that
these changes are committed and available in ``master`` branch.
Normally this is just a list of the Pull Requests that have been merged since the latest version.
.. note:: Before starting the process, make sure that ``HISTORY.md`` has been updated with a new
entry that explains the changes that will be included in the new version.
Normally this is just a list of the Pull Requests that have been merged to master
since the last release.

Once this is done, just run the following commands::
Once this is done, run of the following commands:

1. If you are releasing a patch version::

git checkout stable
git merge --no-ff master # This creates a merge commit
bumpversion release # This creates a new commit and a TAG
git push --tags origin stable
make release
git checkout master
git merge stable
bumpversion --no-tag patch
git push

2. If you are releasing a minor version::

make release-minor

3. If you are releasing a major version::

make release-major
6 changes: 5 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,11 @@ fix-lint: ## fix lint issues using autoflake, autopep8, and isort
autopep8 --in-place --recursive --aggressive tests
isort --apply --atomic --recursive tests

.PHONY: lint-docs
lint-docs: ## check docs formatting with doc8 and pydocstyle
doc8 mlblocks/
pydocstyle mlblocks/


# TEST TARGETS

Expand All @@ -122,7 +127,6 @@ coverage: ## check code coverage quickly with the default Python
.PHONY: docs
docs: clean-docs ## generate Sphinx HTML documentation, including API docs
$(MAKE) -C docs html
touch docs/_build/html/.nojekyll

.PHONY: view-docs
view-docs: docs ## view docs in browser
Expand Down
2 changes: 1 addition & 1 deletion docs/changelog.rst
Original file line number Diff line number Diff line change
@@ -1 +1 @@
.. include:: ../HISTORY.md
.. mdinclude:: ../HISTORY.md
23 changes: 9 additions & 14 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,18 +18,9 @@
# relative to the documentation root, use os.path.abspath to make it
# absolute, like shown here.

import os
import sys

import sphinx_rtd_theme # For read the docs theme
from recommonmark.parser import CommonMarkParser
# from recommonmark.transform import AutoStructify

# sys.path.insert(0, os.path.abspath('..'))

import mlblocks
#
# mlblocks.add_primitives_path('../mlblocks_primitives')

# -- General configuration ---------------------------------------------

Expand All @@ -40,13 +31,21 @@
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
extensions = [
'sphinx.ext.napoleon',
'm2r',
'sphinx.ext.autodoc',
'sphinx.ext.githubpages',
'sphinx.ext.viewcode',
'sphinx.ext.napoleon',
'sphinx.ext.graphviz',
'IPython.sphinxext.ipython_console_highlighting',
'IPython.sphinxext.ipython_directive',
'autodocsumm',
]

autodoc_default_options = {
'autosummary': True,
}

ipython_execlines = ["import pandas as pd", "pd.set_option('display.width', 1000000)"]

# Add any paths that contain templates here, relative to this directory.
Expand All @@ -56,10 +55,6 @@
# You can specify multiple suffix as a list of string:
source_suffix = ['.rst', '.md', '.ipynb']

source_parsers = {
'.md': CommonMarkParser,
}

# The master toctree document.
master_doc = 'index'

Expand Down
2 changes: 1 addition & 1 deletion docs/getting_started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ them to the `MLPipeline class`_:

from mlblocks import MLPipeline
primitives = [
'mlprimitives.feature_extraction.StringVectorizer',
'mlprimitives.custom.feature_extraction.StringVectorizer',
'sklearn.ensemble.RandomForestClassifier',
]
pipeline = MLPipeline(primitives)
Expand Down
4 changes: 2 additions & 2 deletions docs/pipeline_examples/graph.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ additional information not found inside `X`.

primitives = [
'networkx.link_prediction_feature_extraction',
'mlprimitives.feature_extraction.CategoricalEncoder',
'mlprimitives.custom.feature_extraction.CategoricalEncoder',
'sklearn.preprocessing.StandardScaler',
'xgboost.XGBClassifier'
]
Expand Down Expand Up @@ -69,6 +69,6 @@ additional information not found inside `X`.


.. _NetworkX Link Prediction: https://networkx.github.io/documentation/networkx-1.10/reference/algorithms.link_prediction.html
.. _CategoricalEncoder from MLPrimitives: https://github.com/HDI-Project/MLPrimitives/blob/master/mlblocks_primitives/mlprimitives.feature_extraction.CategoricalEncoder.json
.. _CategoricalEncoder from MLPrimitives: https://github.com/HDI-Project/MLPrimitives/blob/master/mlblocks_primitives/mlprimitives.custom.feature_extraction.CategoricalEncoder.json
.. _StandardScaler from scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
.. _XGBClassifier: https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn
22 changes: 11 additions & 11 deletions docs/pipeline_examples/text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,31 +40,31 @@ for later ones.

# set up the pipeline
primitives = [
"mlprimitives.counters.UniqueCounter",
"mlprimitives.text.TextCleaner",
"mlprimitives.counters.VocabularyCounter",
"mlprimitives.custom.counters.UniqueCounter",
"mlprimitives.custom.text.TextCleaner",
"mlprimitives.custom.counters.VocabularyCounter",
"keras.preprocessing.text.Tokenizer",
"keras.preprocessing.sequence.pad_sequences",
"keras.Sequential.LSTMTextClassifier"
]
input_names = {
"mlprimitives.counters.UniqueCounter#1": {
"mlprimitives.custom.counters.UniqueCounter#1": {
"X": "y"
}
}
output_names = {
"mlprimitives.counters.UniqueCounter#1": {
"mlprimitives.custom.counters.UniqueCounter#1": {
"counts": "classes"
},
"mlprimitives.counters.VocabularyCounter#1": {
"mlprimitives.custom.counters.VocabularyCounter#1": {
"counts": "vocabulary_size"
}
}
init_params = {
"mlprimitives.counters.VocabularyCounter#1": {
"mlprimitives.custom.counters.VocabularyCounter#1": {
"add": 1
},
"mlprimitives.text.TextCleaner#1": {
"mlprimitives.custom.text.TextCleaner#1": {
"language": "en"
},
"keras.preprocessing.sequence.pad_sequences#1": {
Expand Down Expand Up @@ -116,12 +116,12 @@ to encode all the string features, and go directly into the
nltk.download('stopwords')

primitives = [
'mlprimitives.text.TextCleaner',
'mlprimitives.feature_extraction.StringVectorizer',
'mlprimitives.custom.text.TextCleaner',
'mlprimitives.custom.feature_extraction.StringVectorizer',
'sklearn.ensemble.RandomForestClassifier',
]
init_params = {
'mlprimitives.text.TextCleaner': {
'mlprimitives.custom.text.TextCleaner': {
'column': 'text',
'language': 'nl'
},
Expand Down
19 changes: 8 additions & 11 deletions mlblocks/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ class Dataset():
**kwargs: Any additional keyword argument passed on initialization will be made
available as instance attributes.
"""

def __init__(self, description, data, target, score, shuffle=True, stratify=False, **kwargs):

self.name = description.splitlines()[0]
Expand All @@ -115,10 +116,10 @@ def __init__(self, description, data, target, score, shuffle=True, stratify=Fals
self.__dict__.update(kwargs)

def score(self, *args, **kwargs):
"""Scoring function for this dataset.
r"""Scoring function for this dataset.

Args:
\\*args, \\*\\*kwargs: Any given arguments and keyword arguments will be
\*args, \*\*kwargs: Any given arguments and keyword arguments will be
directly passed to the given scoring function.

Returns:
Expand All @@ -141,7 +142,7 @@ def _get_split(data, index):
else:
return data[index]

def get_splits(self, n_splits=1):
def get_splits(self, n_splits=1, random_state=0):
"""Return splits of this dataset ready for Cross Validation.

If n_splits is 1, a tuple containing the X for train and test
Expand All @@ -166,12 +167,13 @@ def get_splits(self, n_splits=1):
self.data,
self.target,
shuffle=self._shuffle,
stratify=stratify
stratify=stratify,
random_state=random_state
)

else:
cv_class = StratifiedKFold if self._stratify else KFold
cv = cv_class(n_splits=n_splits, shuffle=self._shuffle)
cv = cv_class(n_splits=n_splits, shuffle=self._shuffle, random_state=random_state)

splits = list()
for train, test in cv.split(self.data, self.target):
Expand Down Expand Up @@ -314,7 +316,6 @@ def load_dic28():
There exist 52,652 words (vertices in a network) having 2 up to 8 characters
in the dictionary. The obtained network has 89038 edges.
"""

dataset_path = _load('dic28')

X = _load_csv(dataset_path, 'data')
Expand Down Expand Up @@ -343,7 +344,6 @@ def load_nomination():
Data consists of one graph whose nodes contain two attributes, attr1 and attr2.
Associated with each node is a label that has to be learned and predicted.
"""

dataset_path = _load('nomination')

X = _load_csv(dataset_path, 'data')
Expand All @@ -362,7 +362,6 @@ def load_amazon():
co-purchased with product j, the graph contains an undirected edge from i to j.
Each product category provided by Amazon defines each ground-truth community.
"""

dataset_path = _load('amazon')

X = _load_csv(dataset_path, 'data')
Expand All @@ -382,7 +381,6 @@ def load_jester():
source: "University of California Berkeley, CA"
sourceURI: "http://eigentaste.berkeley.edu/dataset/"
"""

dataset_path = _load('jester')

X = _load_csv(dataset_path, 'data')
Expand All @@ -392,15 +390,14 @@ def load_jester():


def load_wikiqa():
"""A Challenge Dataset for Open-Domain Question Answering.
"""Challenge Dataset for Open-Domain Question Answering.

WikiQA dataset is a publicly available set of question and sentence (QS) pairs,
collected and annotated for research on open-domain question answering.

source: "Microsoft"
sourceURI: "https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/#"
""" # noqa

dataset_path = _load('wikiqa')

data = _load_csv(dataset_path, 'data', set_index=True)
Expand Down
Loading