Skip to content

Commit

Permalink
Documentation update 1.1 (#340)
Browse files Browse the repository at this point in the history
* General readthedocs documentation rewrite to make things easier to find.
* Fix bug in assigning privacy
  • Loading branch information
qubixes authored Nov 15, 2024
1 parent 1e53df6 commit 04d3901
Show file tree
Hide file tree
Showing 31 changed files with 846 additions and 1,339 deletions.
48 changes: 47 additions & 1 deletion docs/source/about/license.rst → docs/source/about.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,51 @@
About
=====

.. _contact us:

Contact us
----------

**Metasyn** is a project by the `ODISSEI Social Data Science (SoDa) <https://odissei-data.nl/nl/soda/>`_ team and is currently being maintained by Erik-Jan van Kesteren, Raoul Schram and Samuel Spithorst.

.. image:: /images/logos/soda.png
:alt: SODA_logo
:width: 200
:align: center
:target: https://odissei-data.nl/nl/soda/

Do you have questions, suggestions, or remarks on the technical implementation? We welcome your feedback and contributions!

GitHub repository and issue tracker
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Feel free to check out our `GitHub repository <https://github.com/sodascience/metasyn>`_ for the latest updates and to browse the source code.
If you encounter any problems or have ideas for improvements, please file an issue on our `GitHub issue tracker <https://github.com/sodascience/metasyn/issues>`_.

.. image:: https://img.shields.io/badge/GitHub-blue?logo=github&link=https%3A%2F%2Fgithub.com%2Fsodascience%2Fmetasyn
:alt: GitHub Repository Button
:target: https://github.com/sodascience/metasyn

.. image:: https://img.shields.io/badge/GitHub-Issue_Tracker-blue?logo=github&link=https%3A%2F%2Fgithub.com%2Fsodascience%2Fmetasyn%2Fissues
:alt: GitHub Issue Tracker Button
:target: https://github.com/sodascience/metasyn/issues

Contributing
^^^^^^^^^^^^

We highly encourage and appreciate pull requests (PRs) from the community. If you want to discover how you can contribute to metasyn, please refer to our :doc:`/developer/developer`

Maintainers
^^^^^^^^^^^

Feel free to contact one of the maintainers directly:

* Erik-Jan van Kesteren: `https://github.com/vankesteren <https://github.com/vankesteren>`_

* Raoul Schram: `https://github.com/qubixes <https://github.com/qubixes>`_

License
========
-------

Metasyn is released under the MIT License.

Expand Down
11 changes: 0 additions & 11 deletions docs/source/about/about.rst

This file was deleted.

40 changes: 0 additions & 40 deletions docs/source/about/contact.rst

This file was deleted.

2 changes: 1 addition & 1 deletion docs/source/api/developer_reference.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Developer Reference
Developer reference
===================

This section is intended for those that want to contribute to the metasyn package, or simply want a deeper understanding of how it works. It contains the classes, functions and modules that are not in the rest of the reference API. These are mostly elements that are not directly used by users, but are important for developers of the metasyn package to understand the architecture.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/api/metasyn.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
API Reference
API reference
=============

This section aims to give an overview of all classes, functions and methods in the metasyn package.
Expand Down
38 changes: 10 additions & 28 deletions docs/source/usage/cli.rst → docs/source/cli.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Command-line Interface
Command-Line Interface
======================
Metasyn provides a command-line interface (CLI) for accessing core functionality without needing to write any Python code.

Expand Down Expand Up @@ -28,7 +28,7 @@ Accessing the CLI
-----------------
If you have installed the main ``metasyn`` package, the CLI should be available automatically. You can find instructions on how to install ``metasyn`` in the :doc:`installation` section of the documentation.

Alternatively, the CLI can be accessed through a Docker container, allowing you to run ``metasyn`` in an isolated environment without installing the package on your system. This can be useful, for example, when trying out ``metasyn`` without affecting your existing Python environment.
Alternatively, the CLI can be accessed through a Docker container, allowing you to run ``metasyn`` in an isolated environment without installing the package on your system.

Here's how you can use Docker to access Metasyn's CLI:

Expand Down Expand Up @@ -81,9 +81,9 @@ The ``metasyn`` CLI should now be up and running within the Docker container and
docker run -v $(pwd):/wd sodateam/metasyn:v1.0.3 --help
Creating Generative Metadata
Creating generative metadata
----------------------------
The ``create-meta`` subcommand combines the :doc:`estimation </usage/generating_metaframes>` and :doc:`serialization </usage/saving_metaframes>` steps in the pipeline into one, this allows you to generate generative metadata for a tabular dataset (in CSV format), and store it in a GMF (Generative Metadata Format) file.
The ``create-meta`` subcommand combines the estimation and serialization steps in the pipeline into one, this allows you to generate generative metadata for a tabular dataset (in CSV format), and store it in a GMF (Generative Metadata Format) file.

.. image:: /images/pipeline_cli_create_meta.png
:alt: Creating Generative Metadata using the CLI
Expand Down Expand Up @@ -130,35 +130,16 @@ An example of how to use the ``create-meta`` subcommand is as follows:
The ``create-meta`` command also takes one optional argument:

* ``--config [config-file]``: The filepath and name of a .toml configuration file that specifies distribution behavior. For example, if we want to set a column to be unique or to have a specific distribution, we can do so by specifying it in the configuration file. Information on how to use these files can be found in the :doc:`/usage/config_files` section.
* ``--config [config-file]``: The filepath and name of a .toml configuration file that specifies distribution behavior. For example, if we want to set a column to be unique or to have a specific distribution, we can do so by specifying it in the configuration file. Information on how to use these files can be found in the :doc:`improve_synth` section.

.. admonition:: Generating a GMF file without a dataset

It is also possible to create a GMF file (and to generate synthetic data from there) without every inputting a dataset. Adding columns not present in the input dataset is also possible using the same method.
See our :doc:`section<datafree>` for how to create a configuration file without using a dataset. In this case, you will not need to supply any ``[input]`` to the ``create-meta`` command.

This can be done by supplying a configuration file that fully specifies the columns that should be generated. For each to be generated column, you need also need to set the `data_free` parameter to `true`.

It is also required to set the number of rows under the `general` section.

For example, to create a GMF file that can be used to generate 100 rows of synthetic data with a single column `PassengerId`, that is unique and has a discrete distribution, you can use the following configuration file:

.. code-block:: toml
n_rows = 100
[[var]]
name = "PassengerId"
data_free = true
prop_missing = 0.0
description = "ID of the unfortunate passenger."
var_type = "discrete"
distribution = {implements = "core.unique_key", unique = true, parameters = {consecutive = true, low = 0}}
Generating Synthetic Data
Generating synthetic data
-------------------------
The ``synthesize`` subcommand combines the :doc:`deserialization </usage/saving_metaframes>` and :doc:`generation </usage/generating_synthetic_data>` steps in the pipeline into one, and allows you to generate a synthetic dataset from a previously saved MetaFrame (stored as GMF file).
The ``synthesize`` subcommand combines the deserialization and generation steps in the pipeline into one, and allows you to generate a synthetic dataset from a previously saved MetaFrame (stored as GMF file).

.. image:: /images/pipeline_cli_synthesize.png
:alt: Creating Synthetic Data from a GMF file using the CLI
Expand Down Expand Up @@ -207,6 +188,7 @@ An example of how to use the ``synthesize`` subcommand is as follows:
The ``synthesize`` command also takes two optional arguments:

- ``-n [rows]`` or ``--num_rows [rows]``: To generate a specific number of data rows.
- ``-p`` or ``--preview``: To preview the first six rows of synthesized data. This can be extremely useful for quick data validation without saving it to a file.

Expand All @@ -217,7 +199,7 @@ The ``synthesize`` command also takes two optional arguments:



Creating Validation schemas
Creating validation schemas
---------------------------

The ``schema`` subcommand generates a schema that describes the expected format of the GMF files. These can be used to validate GMF files before importing and loading them into a :obj:`MetaFrame<metasyn.metaframe.MetaFrame>`.
Expand Down
58 changes: 58 additions & 0 deletions docs/source/datafree.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
Synthetic data without raw data
===============================

It is also possible to create a GMF file without any input dataset, or to add additional fictive columns to those already present in a dataset.

To do so, you need to fully specify each column (variable) you want to generate. You will also need to set the ``data_free`` parameter to true,
to indicate that the variable will be generated from scratch, instead of being based on existing data.
Finally, you will need to set the number of rows to generate.

For example, the following configuration file will generate a GMF file with 100 rows of synthetic data, with a unique key column named ``PassengerId``:

.. code-block:: toml
n_rows = 100
[[var]]
name = "PassengerId"
data_free = true
prop_missing = 0.0
description = "ID of the unfortunate passenger."
var_type = "discrete"
distribution = {implements = "core.unique_key", unique = true, parameters = {consecutive = true, low = 0}}
See :doc:`distribution page </api/metasyn.distribution>` for a list of distributions that can be chosen from.

Setting defaults
----------------

Writing the same things for every distribution can be tedious work, but you can also create defaults for
variables. The following can be set by default: ``data_free``, ``prop_missing``, ``distribution`` and ``privacy``.
Since the distribution depends on the type of the column, you can set the default distribution per column type.
Below is an example on how to set defaults:

.. code-block:: toml
[defaults]
data_free = true
prop_missing = 0.1
[defaults.distribution]
discrete = {implements = "core.uniform", parameters = {lower = 1, upper = 30}}
continuous = {implements = "core.normal", parameters = {mean = 0, stdev = 1}}
string = {implements = "core.faker", parameters = {faker_type = "name", locale = "en_US"}}
With this block, you won't have to set the ``data_free`` parameter, and the default
proportion of missing values is set to 0.1. For discrete columns, the distribution
will be set to a uniform distribution between 1 and 30, etc.

With the defaults set as above, you only need to specify the ``name`` and ``var_type`` of
the columns:

.. code-block:: toml
[[var]]
name = "some discrete variable"
var_type = "discrete"
30 changes: 0 additions & 30 deletions docs/source/developer/GMF.rst

This file was deleted.

2 changes: 1 addition & 1 deletion docs/source/developer/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Thank you for your interest in contributing to metasyn! We greatly appreciate an

Feedback, suggestions and issues:
---------------------------------
If you encounter a bug or have a feature request, you can report it in the `issue tracker <https://github.com/sodascience/metasyn/issues>`_. Detailed bug reports and well-defined feature requests are highly appreciated. Additionally, you can help us by leaving suggestions or feedback on how to enhance ``metasyn`` on the project's `GitHub repository <https://github.com/sodascience/metasyn>`_. More information on getting in touch with us can be found on our :doc:`contact page </about/contact>`.
If you encounter a bug or have a feature request, you can report it in the `issue tracker <https://github.com/sodascience/metasyn/issues>`_. Detailed bug reports and well-defined feature requests are highly appreciated. Additionally, you can help us by leaving suggestions or feedback on how to enhance ``metasyn`` on the project's `GitHub repository <https://github.com/sodascience/metasyn>`_. More information on getting in touch with us can be found on our :ref:`contact page <contact us>`.

.. image:: https://img.shields.io/badge/GitHub-blue?logo=github&link=https%3A%2F%2Fgithub.com%2Fsodascience%2Fmetasyn
:alt: GitHub Repository Button
Expand Down
7 changes: 1 addition & 6 deletions docs/source/developer/developer.rst
Original file line number Diff line number Diff line change
@@ -1,20 +1,15 @@
Developer Guide
Developer guide
=================

This guide is mainly directed at developers working on the ``metasyn`` package, but it may be useful for users that want
more control over the workings of the ``metasyn`` package.

.. warning::
This section is intended for developers and advanced users. If you are new to ``metasyn``, please refer to the :doc:`/usage/usage` first.


.. toctree::
:maxdepth: 1
:caption: Sections:

contributing
overview
GMF
distributions
plugins

Expand Down
8 changes: 0 additions & 8 deletions docs/source/developer/distributions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,10 +62,6 @@ If the distribution has subsequently draws that are not independent, it is recom

It is recommended to also implement :meth:`~metasyn.distribution.base.BaseDistribution.information_criterion`. This is a class method used to determine which distribution gets selected during the fitting process for a series of values. The distribution with the lowest information criterion of the correct variable type will be selected. For discrete and continuous distributions it is currently implemented as `BIC <https://en.wikipedia.org/wiki/Bayesian_information_criterion>`_).

.. warning::

Despite not being an abstract method in :class:`~metasyn.distribution.base.BaseDistribution`, it is recommended to implement a constructor (``__init__``) method in derived classes to initialize a distribution with a set of (distribution specific) parameters.

There are more methods, but this is a good starting point when implementing a new distribution.
For an overview of the rest of the methods and implementation details, refer to the :class:`~metasyn.distribution.base.BaseDistribution` class.

Expand Down Expand Up @@ -121,10 +117,6 @@ For example, the unique variants of the :class:`~metasyn.distribution.regex.Rege
@metadist(implements="core.faker", var_type="string")
class UniqueFakerDistribution(UniqueDistributionMixin, FakerDistribution):
.. warning::

This mixin class has a default implementation that will work for many distributions, but it may not be appropriate for all. Be sure to check the implementation before using it.

Other modules
~~~~~~~~~~~~~
Expand Down
Loading

0 comments on commit 04d3901

Please sign in to comment.