Merge pull request #85 from HDI-Project/issue_58_intermediate_outputs

Issue 58 intermediate outputs
MLBazaar · May 16, 2019 · abe6e25 · abe6e25
2 parents e1ca77b + 7112016
commit abe6e25
Show file tree

Hide file tree

Showing 14 changed files with 539 additions and 158 deletions.
diff --git a/CONTRIBUTING.rst b/CONTRIBUTING.rst
@@ -172,24 +172,26 @@ The process of releasing a new version involves several steps combining both ``g
 
 1. Merge what is in ``master`` branch into ``stable`` branch.
 2. Update the version in ``setup.cfg``, ``mlblocks/__init__.py`` and ``HISTORY.md`` files.
-3. Create a new TAG pointing at the correspoding commit in ``stable`` branch.
+3. Create a new git tag pointing at the corresponding commit in ``stable`` branch.
 4. Merge the new commit from ``stable`` into ``master``.
-5. Update the version in ``setup.cfg`` and ``mlblocks/__init__.py`` to open the next
-   development interation.
+5. Update the version in ``setup.cfg`` and ``mlblocks/__init__.py``
+   to open the next development iteration.
 
-**Note:** Before starting the process, make sure that ``HISTORY.md`` has a section titled
-**Unreleased** with the list of changes that will be included in the new version, and that
-these changes are committed and available in ``master`` branch.
-Normally this is just a list of the Pull Requests that have been merged since the latest version.
+.. note:: Before starting the process, make sure that ``HISTORY.md`` has been updated with a new
+          entry that explains the changes that will be included in the new version.
+          Normally this is just a list of the Pull Requests that have been merged to master
+          since the last release.
 
-Once this is done, just run the following commands::
+Once this is done, run of the following commands:
+
+1. If you are releasing a patch version::
 
-    git checkout stable
-    git merge --no-ff master    # This creates a merge commit
-    bumpversion release   # This creates a new commit and a TAG
-    git push --tags origin stable
     make release
-    git checkout master
-    git merge stable
-    bumpversion --no-tag patch
-    git push
+
+2. If you are releasing a minor version::
+
+    make release-minor
+
+3. If you are releasing a major version::
+
+    make release-major
diff --git a/Makefile b/Makefile
@@ -98,6 +98,11 @@ fix-lint: ## fix lint issues using autoflake, autopep8, and isort
 	autopep8 --in-place --recursive --aggressive tests
 	isort --apply --atomic --recursive tests
 
+.PHONY: lint-docs
+lint-docs: ## check docs formatting with doc8 and pydocstyle
+	doc8 mlblocks/
+	pydocstyle mlblocks/
+
 
 # TEST TARGETS
 
@@ -122,7 +127,6 @@ coverage: ## check code coverage quickly with the default Python
 .PHONY: docs
 docs: clean-docs ## generate Sphinx HTML documentation, including API docs
 	$(MAKE) -C docs html
-	touch docs/_build/html/.nojekyll
 
 .PHONY: view-docs
 view-docs: docs ## view docs in browser

diff --git a/docs/changelog.rst b/docs/changelog.rst
@@ -1 +1 @@
-.. include:: ../HISTORY.md
+.. mdinclude:: ../HISTORY.md
diff --git a/docs/conf.py b/docs/conf.py
@@ -18,18 +18,9 @@
 # relative to the documentation root, use os.path.abspath to make it
 # absolute, like shown here.
 
-import os
-import sys
-
 import sphinx_rtd_theme # For read the docs theme
-from recommonmark.parser import CommonMarkParser
-# from recommonmark.transform import AutoStructify
-
-# sys.path.insert(0, os.path.abspath('..'))
 
 import mlblocks
-# 
-# mlblocks.add_primitives_path('../mlblocks_primitives')
 
 # -- General configuration ---------------------------------------------
 
@@ -40,13 +31,21 @@
 # Add any Sphinx extension module names here, as strings. They can be
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
 extensions = [
-    'sphinx.ext.napoleon',
+    'm2r',
+    'sphinx.ext.autodoc',
     'sphinx.ext.githubpages',
+    'sphinx.ext.viewcode',
+    'sphinx.ext.napoleon',
     'sphinx.ext.graphviz',
     'IPython.sphinxext.ipython_console_highlighting',
     'IPython.sphinxext.ipython_directive',
+    'autodocsumm',
 ]
 
+autodoc_default_options = {
+    'autosummary': True,
+}
+
 ipython_execlines = ["import pandas as pd", "pd.set_option('display.width', 1000000)"]
 
 # Add any paths that contain templates here, relative to this directory.
@@ -56,10 +55,6 @@
 # You can specify multiple suffix as a list of string:
 source_suffix = ['.rst', '.md', '.ipynb']
 
-source_parsers = {
-    '.md': CommonMarkParser,
-}
-
 # The master toctree document.
 master_doc = 'index'
 

diff --git a/docs/getting_started/quickstart.rst b/docs/getting_started/quickstart.rst
@@ -24,7 +24,7 @@ them to the `MLPipeline class`_:
 
     from mlblocks import MLPipeline
     primitives = [
-        'mlprimitives.feature_extraction.StringVectorizer',
+        'mlprimitives.custom.feature_extraction.StringVectorizer',
         'sklearn.ensemble.RandomForestClassifier',
     ]
     pipeline = MLPipeline(primitives)

diff --git a/docs/pipeline_examples/graph.rst b/docs/pipeline_examples/graph.rst
@@ -39,7 +39,7 @@ additional information not found inside `X`.
 
     primitives = [
         'networkx.link_prediction_feature_extraction',
-        'mlprimitives.feature_extraction.CategoricalEncoder',
+        'mlprimitives.custom.feature_extraction.CategoricalEncoder',
         'sklearn.preprocessing.StandardScaler',
         'xgboost.XGBClassifier'
     ]
@@ -69,6 +69,6 @@ additional information not found inside `X`.
 
 
 .. _NetworkX Link Prediction: https://networkx.github.io/documentation/networkx-1.10/reference/algorithms.link_prediction.html
-.. _CategoricalEncoder from MLPrimitives: https://github.com/HDI-Project/MLPrimitives/blob/master/mlblocks_primitives/mlprimitives.feature_extraction.CategoricalEncoder.json
+.. _CategoricalEncoder from MLPrimitives: https://github.com/HDI-Project/MLPrimitives/blob/master/mlblocks_primitives/mlprimitives.custom.feature_extraction.CategoricalEncoder.json
 .. _StandardScaler from scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
 .. _XGBClassifier: https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn
diff --git a/docs/pipeline_examples/text.rst b/docs/pipeline_examples/text.rst
@@ -40,31 +40,31 @@ for later ones.
 
     # set up the pipeline
     primitives = [
-        "mlprimitives.counters.UniqueCounter",
-        "mlprimitives.text.TextCleaner",
-        "mlprimitives.counters.VocabularyCounter",
+        "mlprimitives.custom.counters.UniqueCounter",
+        "mlprimitives.custom.text.TextCleaner",
+        "mlprimitives.custom.counters.VocabularyCounter",
         "keras.preprocessing.text.Tokenizer",
         "keras.preprocessing.sequence.pad_sequences",
         "keras.Sequential.LSTMTextClassifier"
     ]
     input_names = {
-        "mlprimitives.counters.UniqueCounter#1": {
+        "mlprimitives.custom.counters.UniqueCounter#1": {
             "X": "y"
         }
     }
     output_names = {
-        "mlprimitives.counters.UniqueCounter#1": {
+        "mlprimitives.custom.counters.UniqueCounter#1": {
             "counts": "classes"
         },
-        "mlprimitives.counters.VocabularyCounter#1": {
+        "mlprimitives.custom.counters.VocabularyCounter#1": {
             "counts": "vocabulary_size"
         }
     }
     init_params = {
-        "mlprimitives.counters.VocabularyCounter#1": {
+        "mlprimitives.custom.counters.VocabularyCounter#1": {
             "add": 1
         },
-        "mlprimitives.text.TextCleaner#1": {
+        "mlprimitives.custom.text.TextCleaner#1": {
             "language": "en"
         },
         "keras.preprocessing.sequence.pad_sequences#1": {
@@ -116,12 +116,12 @@ to encode all the string features, and go directly into the
     nltk.download('stopwords')
 
     primitives = [
-        'mlprimitives.text.TextCleaner',
-        'mlprimitives.feature_extraction.StringVectorizer',
+        'mlprimitives.custom.text.TextCleaner',
+        'mlprimitives.custom.feature_extraction.StringVectorizer',
         'sklearn.ensemble.RandomForestClassifier',
     ]
     init_params = {
-        'mlprimitives.text.TextCleaner': {
+        'mlprimitives.custom.text.TextCleaner': {
             'column': 'text',
             'language': 'nl'
         },

diff --git a/mlblocks/datasets.py b/mlblocks/datasets.py
@@ -100,6 +100,7 @@ class Dataset():
         **kwargs: Any additional keyword argument passed on initialization will be made
             available as instance attributes.
     """
+
     def __init__(self, description, data, target, score, shuffle=True, stratify=False, **kwargs):
 
         self.name = description.splitlines()[0]
@@ -115,10 +116,10 @@ def __init__(self, description, data, target, score, shuffle=True, stratify=Fals
         self.__dict__.update(kwargs)
 
     def score(self, *args, **kwargs):
-        """Scoring function for this dataset.
+        r"""Scoring function for this dataset.
 
         Args:
-            \\*args, \\*\\*kwargs: Any given arguments and keyword arguments will be
+            \*args, \*\*kwargs: Any given arguments and keyword arguments will be
             directly passed to the given scoring function.
 
         Returns:
@@ -141,7 +142,7 @@ def _get_split(data, index):
         else:
             return data[index]
 
-    def get_splits(self, n_splits=1):
+    def get_splits(self, n_splits=1, random_state=0):
         """Return splits of this dataset ready for Cross Validation.
 
         If n_splits is 1, a tuple containing the X for train and test
@@ -166,12 +167,13 @@ def get_splits(self, n_splits=1):
                 self.data,
                 self.target,
                 shuffle=self._shuffle,
-                stratify=stratify
+                stratify=stratify,
+                random_state=random_state
             )
 
         else:
             cv_class = StratifiedKFold if self._stratify else KFold
-            cv = cv_class(n_splits=n_splits, shuffle=self._shuffle)
+            cv = cv_class(n_splits=n_splits, shuffle=self._shuffle, random_state=random_state)
 
             splits = list()
             for train, test in cv.split(self.data, self.target):
@@ -314,7 +316,6 @@ def load_dic28():
     There exist 52,652 words (vertices in a network) having 2 up to 8 characters
     in the dictionary. The obtained network has 89038 edges.
     """
-
     dataset_path = _load('dic28')
 
     X = _load_csv(dataset_path, 'data')
@@ -343,7 +344,6 @@ def load_nomination():
     Data consists of one graph whose nodes contain two attributes, attr1 and attr2.
     Associated with each node is a label that has to be learned and predicted.
     """
-
     dataset_path = _load('nomination')
 
     X = _load_csv(dataset_path, 'data')
@@ -362,7 +362,6 @@ def load_amazon():
     co-purchased with product j, the graph contains an undirected edge from i to j.
     Each product category provided by Amazon defines each ground-truth community.
     """
-
     dataset_path = _load('amazon')
 
     X = _load_csv(dataset_path, 'data')
@@ -382,7 +381,6 @@ def load_jester():
     source: "University of California Berkeley, CA"
     sourceURI: "http://eigentaste.berkeley.edu/dataset/"
     """
-
     dataset_path = _load('jester')
 
     X = _load_csv(dataset_path, 'data')
@@ -392,15 +390,14 @@ def load_jester():
 
 
 def load_wikiqa():
-    """A Challenge Dataset for Open-Domain Question Answering.
+    """Challenge Dataset for Open-Domain Question Answering.
 
     WikiQA dataset is a publicly available set of question and sentence (QS) pairs,
     collected and annotated for research on open-domain question answering.
 
     source: "Microsoft"
     sourceURI: "https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/#"
     """  # noqa
-
     dataset_path = _load('wikiqa')
 
     data = _load_csv(dataset_path, 'data', set_index=True)
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		.. include:: ../HISTORY.md
		.. mdinclude:: ../HISTORY.md