Skip to content

Commit

Permalink
Merge branch 'var-toml' into docstrings-class-level
Browse files Browse the repository at this point in the history
  • Loading branch information
Samuwhale committed Jan 22, 2024
2 parents c4e7167 + c10ece1 commit 4ff7fe8
Show file tree
Hide file tree
Showing 33 changed files with 974 additions and 409 deletions.
6 changes: 3 additions & 3 deletions docs/source/developer/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,7 @@ A :obj:`~metasyn.MetaVar` contains information on the variable type (``var_type`

This class is considered a passthrough class used by the :obj:`~metasyn.MetaFrame` class, and is not intended to be used directly by the user. It contains the following functionality:

- **Detecting variable types**: The :meth:`~metasyn.MetaVar.detect` method detects the variable class(es) of a series or dataframe. This method does not fit any distribution, but it does infer the correct types for the :obj:`~metasyn.MetaVar` and saves the ``Series`` for later fitting.
- **Fitting distributions**: The :meth:`~metasyn.MetaVar.fit` method fits distributions to the data. Here you can set the distribution, privacy package and uniqueness for the variable again.
- **Fitting distributions**: The :meth:`~metasyn.MetaVar.fit` method fits distributions to the data. Here you can set the distribution, privacy package and uniqueness for the variable.
- **Drawing values and series**: The :meth:`~metasyn.MetaVar.draw` method draws a random item for the variable in whatever type is required. The :meth:`~metasyn.MetaVar.draw_series` method draws a new synthetic series from the metadata. For this to work, the variable has to be fitted.
- **Converting to and from a dictionary**: The :meth:`~metasyn.MetaVar.to_dict` method creates a dictionary from the variable. The :meth:`~metasyn.MetaVar.from_dict` method restores a variable from a dictionary.

Expand All @@ -54,4 +53,5 @@ The ``metasyn`` package is organized into several submodules, each focusing on d
* The :mod:`metasyn.testutils` module provides testing utilities for plugins. It includes functions for checking distributions and distribution providers.
* The :mod:`metasyn.validation` module contains tools for validating distribution outputs and GMF file formats.
* The :mod:`metasyn.privacy` module contains the basis for implementing privacy features. A system to incorporate privacy features such as differential privacy or other forms of disclosure control is still being implemented.

* The :mod:`metasyn.util` module contains utility classes :class:`~metasyn.util.DistributionSpec` and :class:`~metasyn.util.VarConfig`.
* The :mod:`metasyn.config` module contains the :class:`~metasyn.config.MetaConfig` class that can read .toml configuration files.
42 changes: 34 additions & 8 deletions docs/source/usage/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Metasyn provides a command-line interface (CLI) for accessing core functionality

The CLI currently has three subcommands:

* The ``create-meta`` subcommand, which allows you to **create generative metadata** from a ``CSV file``
* The ``create-meta`` subcommand, which allows you to **create generative metadata** from a ``.csv file`` and/or a ``.toml`` configuration file.
* The ``synthesize`` subcommand, which allows you to **generate synthetic data** from a ``GMF file``
* The ``schema`` subcommand, which allows you to **create validation schemas** for GMF files.

Expand Down Expand Up @@ -85,7 +85,7 @@ The ``create-meta`` command can be used as follows:

.. code-block:: bash
metasyn create-meta [input] [output]
metasyn create-meta --input [input] --output [output]
This will:

Expand Down Expand Up @@ -126,17 +126,43 @@ The ``create-meta`` command also takes one optional argument:

.. note::

The configuration file must be in the `.ini` format. For more information on the format, please refer to the `Python documentation <https://docs.python.org/3/library/configparser.html>`_.
The configuration file must be in the `.toml` format. For more information on the format, please refer to the `Python documentation <https://docs.python.org/3/library/configparser.html>`_.

An example of a configuration file that specifies the ``PassengerId`` column to be unique and the ``Fare`` column to have a log-normal distribution is as follows:

.. code-block:: ini
.. code-block:: toml
[var.PassengerId]
unique = True
[[var]]
name = "PassengerId"
distribution = {unique = true} # Notice lower capitalization for .toml files.
[var.Fare]
distribution=LogNormalDistribution
[[var]]
name = "Fare"
distribution = {implements = "core.log_normal"}
It is also possible to create a GMF file without any input CSV. For this to work, you need to supply a configuration
file that fully specifies all wanted columns. You will need to tell ``metasyn`` in the configuration file that the
column is ``data_free``. It is also required to set the number of rows under the `general` section, for example:

.. code-block:: toml
[general]
n_rows = 100
[[var]]
name = "PassengerId"
data_free = true
unique = true
prop_missing = 0.0
description = "ID of the unfortunate passenger."
var_type = "discrete"
distribution = {implements = "core.unique_key", unique = true, parameters = {consecutive = 1, low = 0}}
The example will generate a GMF file that can be used to generate new synthetic data with the ``synthesize``
subcommand described below.


Generating Synthetic Data
Expand Down
62 changes: 38 additions & 24 deletions docs/source/usage/generating_metaframes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,15 @@ One of the main features of ``metasyn`` is to create a :obj:`MetaFrame <metasyn.
Basics
------

Metasyn can generate metadata from any given dataset (provided as Polars or Pandas DataFrame), using the :meth:`metasyn.MetaFrame.fit_dataframe(df) <metasyn.metaframe.MetaFrame.fit_dataframe>` classmethod.
Metasyn can generate metadata from any given dataset (provided as Polars or Pandas DataFrame),
using the :meth:`metasyn.MetaFrame.fit_dataframe(df) <metasyn.metaframe.MetaFrame.fit_dataframe>` classmethod.

.. image:: /images/pipeline_estimation_code.png
:alt: MetaFrame Generation With Code Snippet
:align: center

This function requires a :obj:`DataFrame` to be specified as parameter. The following code returns a :obj:`MetaFrame<metasyn.metaframe.MetaFrame>` object named :obj:`mf`, based on a DataFrame named :obj:`df`.
This function requires a :obj:`DataFrame` to be specified as parameter. The following code returns
a :obj:`MetaFrame<metasyn.metaframe.MetaFrame>` object named :obj:`mf`, based on a DataFrame named :obj:`df`.

.. code-block:: python
Expand All @@ -47,29 +49,42 @@ It is possible to print the (statistical metadata contained in the) :obj:`MetaFr

Optional Parameters
----------------------
The :meth:`metasyn.MetaFrame.fit_dataframe() <metasyn.metaframe.MetaFrame.fit_dataframe>` class method allows you to have more control over how your synthetic dataset is generated with additional (optional) parameters:
The :meth:`metasyn.MetaFrame.fit_dataframe() <metasyn.metaframe.MetaFrame.fit_dataframe>` class method
allows you to have more control over how your synthetic dataset is generated with additional (optional)
parameters:

Besides the required `df` parameter, :meth:`metasyn.MetaFrame.fit_dataframe() <metasyn.metaframe.MetaFrame.fit_dataframe>` accepts three parameters: ``spec``, ``dist_providers`` and ``privacy``.
Besides the required `df` parameter, :meth:`metasyn.MetaFrame.fit_dataframe() <metasyn.metaframe.MetaFrame.fit_dataframe>`
accepts four parameters: ``meta_config``, ``var_specs``, ``dist_providers`` and ``privacy``.

Let's take a look at each optional parameter individually:

spec
^^^^
**spec** is an optional dictionary that outlines specific directives for each DataFrame column (variable). The potential directives include:

meta_config
^^^^^^^^^^^
**meta_config** is an optional parameter that encompasses all the other parameters; it contains information on the
``var_specs``, ``dist_providers`` and ``privacy``. This parameter is generally used when the configuration is loaded
from a .toml file. Otherwise it is recommended to leave ``meta_config`` at its default value (None) and specify
the other optional parameters.

var_specs
^^^^^^^^^
**var_specs** is an optional list that outlines specific directives for columns (variables) in the DataFrame.
The potential directives include:

- ``name``: This specifies the column name and is mandatory.

- ``distribution``: Allows you to specify the statistical distribution of each column. To see what distributions are available refer to the :doc:`distribution package API reference</api/metasyn.distribution>`.

- ``unique``: Declare whether the column in the synthetic dataset should contain unique values. By default no column is set to unique.

.. admonition:: Detection of unique variables

When generating a MetaFrame, ``metasyn`` will automatically analyze the columns of the input DataFrame to detect ones that contain only unique values.
If such a column is found, and it has not manually been set to unique in the ``spec`` dictionary, the user will be notified with the following warning:
If such a column is found, and it has not manually been set to unique in the ``var_specs`` dictionary, the user will be notified with the following warning:
``Warning: Variable [column_name] seems unique, but not set to be unique. Set the variable to be either unique or not unique to remove this warning``

It is safe to ignore this warning - however, be aware that without setting the column as unique, ``metasyn`` may generate duplicate values for that column when synthesizing data.

To remove the warning and ensure the values in the synthesized column are unique, set the column to be unique (``"column" = {"unique": True}``) in the ``spec`` dictionary.
To remove the warning and ensure the values in the synthesized column are unique, set the column to be unique (``"column" = {"unique": True}``) in the ``var_specs`` list.

- ``description``: Includes a description for each column in the DataFrame.

Expand All @@ -80,7 +95,7 @@ spec
- ``prop_missing``: Set the intended proportion of missing values in the synthetic data for each column.


.. admonition:: Example use of the ``spec`` parameter
.. admonition:: Example use of the ``var_specs`` parameter

- For the column ``PassengerId``, we want unique values.
- The ``Name`` column should be populated with realistic fake names using the `Faker <https://faker.readthedocs.io/en/master/>`_ library.
Expand All @@ -91,30 +106,29 @@ spec
The following code to achieve this would look like:

.. code-block:: python
from metasyn.distribution import FakerDistribution, DiscreteUniformDistribution, RegexDistribution
from metasyn.config import VarConfig, DistributionSpec
# Create a specification dictionary for generating synthetic data
var_spec = {
var_specs = [
# Ensure unique values for the `PassengerId` column
"PassengerId": {"unique": True},
VarConfig(name="PassengerId", dist_spec=DistributionSpec(unique=True)),
# Utilize the Faker library to synthesize realistic names for the `Name` column
"Name": {"distribution": FakerDistribution("name")},
VarConfig(name="Name", dist_spec=FakerDistribution("name")),
# Fit `Fare` to an exponential distribution based on the data
"Fare": {"distribution": "ExponentialDistribution"},
# Fit `Fare` to an log-normal distribution, but base the parameters on the data
VarConfig(name="Name", dist_spec="LogNormalDistribution"),
# Fit `Age` to a discrete uniform distribution ranging from 20 to 40
"Age": {"distribution": DiscreteUniformDistribution(20, 40)},
# Set the `Age` column to a discrete uniform distribution ranging from 20 to 40
VarConfig(name="Age", dist_spec=DiscreteUniformDistribution(20, 40)),
# Use a regex-based distribution to generate `Cabin` values following [A-F][0-9]{2,3}
"Cabin": {"distribution": RegexDistribution(r"[A-F][0-9]{2,3}")}
}
VarConfig(name="Cabin", dist_spec=cabin_distribution, description="The cabin number of the passenger."),
]
mf = MetaFrame.fit_dataframe(df, spec=var_spec)
mf = MetaFrame.fit_dataframe(df, var_specs=var_specs)
dist_providers
Expand Down
1 change: 1 addition & 0 deletions examples/basic_example.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import polars as pl

from metasyn import MetaFrame

# example dataframe from polars website
Expand Down
33 changes: 33 additions & 0 deletions examples/example_config.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Example toml file as input for metasyn

[general]
dist_providers = ["builtin", "metasyn-disclosure"]

[general.privacy]
name = "disclosure"
parameters = {n_avg = 11}


[[var]]
name = "PassengerId"
distribution = {unique = true} # Notice lower capitalization for .toml files.

[[var]]
name = "Name"
prop_missing = 0.1
description = "Name of the unfortunate passenger of the titanic."
distribution = {implements = "core.faker", parameters = {faker_type = "name", locale = "en_US"}}

[[var]]
name = "Fare"
distribution = {implements = "core.exponential"}

[[var]]
name = "Age"
distribution = {implements = "core.uniform", parameters = {low = 20, high = 40}}


[[var]]
name = "Cabin"
distribution = {implements = "core.regex", parameters = {regex_data = "[A-F][0-9]{2,3}"}}
privacy = {name = "disclosure", parameters = {n_avg = 21}}
4 changes: 2 additions & 2 deletions examples/example_gmf_titanic.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
"provenance": {
"created by": {
"name": "metasyn",
"version": "0.6.1.dev45+gd3708ea.d20231212"
"version": "0.6.1.dev44+g2ce6998.d20240115"
},
"creation time": "2023-12-12T15:30:02.834410"
"creation time": "2024-01-17T12:11:01.120007"
},
"vars": [
{
Expand Down
Loading

0 comments on commit 4ff7fe8

Please sign in to comment.