Skip to content

Commit

Permalink
Convert export -> save
Browse files Browse the repository at this point in the history
  • Loading branch information
qubixes committed Oct 3, 2024
1 parent 876bb4c commit cdb952b
Show file tree
Hide file tree
Showing 20 changed files with 89 additions and 88 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

__Generate synthetic tabular data__ in a transparent, understandable, and privacy-friendly way. Metasyn makes it possible for owners of sensitive data to create test data, do open science, improve code reproducibility, encourage data reuse, and enhance accessibility of their datasets, without worrying about leaking private information.

With metasyn you can __fit__ a model to an existing dataframe, __export__ it to a transparent and auditable `.json` file, and __synthesize__ a dataframe that looks a lot like the real one. In contrast to most other synthetic data software, we make the explicit choice to strictly limit the statistical information in our model in order to adhere to the highest privacy standards.
With metasyn you can __fit__ a model to an existing dataframe, __save__ it to a transparent and auditable `.json` file, and __synthesize__ a dataframe that looks a lot like the real one. In contrast to most other synthetic data software, we make the explicit choice to strictly limit the statistical information in our model in order to adhere to the highest privacy standards.

## Highlights
- 👋 __Accessible__. Metasyn is designed to be easy to use and understand, and we do our best to be welcoming to newcomers and novice users. [Let us know](https://github.com/sodascience/metasyn/issues/new) if we can improve!
Expand Down Expand Up @@ -71,7 +71,7 @@ mf = MetaFrame.fit_dataframe(df)
# Generate a new DataFrame with 5 rows from the MetaFrame.
df_synth = mf.synthesize(5)

# This DataFrame can be exported to csv, parquet, excel and more.
# This DataFrame can be saved to csv, parquet, excel and more.
df_synth.write_csv("output.csv")
```

Expand Down
8 changes: 4 additions & 4 deletions docs/paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ These choices enable the software to generate synthetic data with __privacy and
At its core, `metasyn` has three main functions:

1. __Estimation__: Fit a generative model to a properly formatted tabular dataset, optionally with privacy guarantees.
2. __(De)serialization__: Create an intermediate representation of the fitted model for auditing, editing, and exporting.
2. __(De)serialization__: Create an intermediate representation of the fitted model for auditing, editing, and saving.
3. __Generation__: Synthesize new datasets based on a fitted model.

## Estimation
Expand Down Expand Up @@ -117,11 +117,11 @@ After fitting a model, `metasyn` can transparently store it in a human- and mach
}
```

This `.json` can be manually audited, edited, and after exporting this file, an unlimited number of synthetic records can be created without incurring additional privacy risks. Serialization and deserialization with `metasyn` can be performed as follows:
This `.json` can be manually audited, edited, and after saving this file, an unlimited number of synthetic records can be created without incurring additional privacy risks. Serialization and deserialization with `metasyn` can be performed as follows:

```python
mf.export("fruits.json")
mf_new = MetaFrame.from_json("fruits.json")
mf.save("fruits.json")
mf_new = MetaFrame.load("fruits.json")
```

## Data generation
Expand Down
2 changes: 1 addition & 1 deletion docs/source/developer/GMF.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Generative Metadata Format (GMF)
================================

At the core of ``metasyn`` lies its ability to :doc:`generate </usage/generating_metaframes>`, :doc:`export</usage/exporting_metaframes>` and :doc:`import</usage/exporting_metaframes>` statistical metadata for a given dataset, which can then be used to :doc:`generate synthetic datasets </usage/generating_synthetic_data>`. To achieve this, ``metasyn`` uses the Generative Metadata Format (GMF), an open source format (available on `GitHub <https://github.com/sodascience/generative_metadata_format>`_) designed to store statistical metadata for tabular datasets. The GMF standard is designed to be modular and extensible, with more distributions and privacy-enhancing mechanisms. Due to its open nature, GMF can be used by other software too.
At the core of ``metasyn`` lies its ability to :doc:`generate </usage/generating_metaframes>`, :doc:`save</usage/saving_metaframes>` and :doc:`load</usage/loading_metaframes>` statistical metadata for a given dataset, which can then be used to :doc:`generate synthetic datasets </usage/generating_synthetic_data>`. To achieve this, ``metasyn`` uses the Generative Metadata Format (GMF), an open source format (available on `GitHub <https://github.com/sodascience/generative_metadata_format>`_) designed to store statistical metadata for tabular datasets. The GMF standard is designed to be modular and extensible, with more distributions and privacy-enhancing mechanisms. Due to its open nature, GMF can be used by other software too.



Expand Down
2 changes: 1 addition & 1 deletion docs/source/developer/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ The :class:`~metasyn.MetaFrame` class is a core component of the ``metasyn`` pac
Essentially, a :obj:`~metasyn.MetaFrame` is a collection of :obj:`~metasyn.MetaVar` objects, each representing a column in a dataset. It contains methods that allow for the following:

- **Fitting to a DataFrame**: The :meth:`~metasyn.MetaFrame.fit_dataframe` method allows for fitting a Polars DataFrame to create a :obj:`~metasyn.MetaFrame` object. This method takes several parameters including the DataFrame, column specifications, distribution providers, privacy level, and a progress bar flag.
- **Exporting and importing**: The :meth:`~metasyn.MetaFrame.export` method serializes and exports the :obj:`~metasyn.MetaFrame` to a JSON file, following the GMF format. The :meth:`~metasyn.MetaFrame.from_json` method reads a :obj:`~metasyn.MetaFrame` from a JSON file.
- **Saving and loading**: The :meth:`~metasyn.MetaFrame.save` method serializes and saves the :obj:`~metasyn.MetaFrame` to a JSON or TOML file, following the GMF format. The :meth:`~metasyn.MetaFrame.load` method reads a :obj:`~metasyn.MetaFrame` from a JSON file.
- **Synthesizing to a DataFrame**: The :meth:`~metasyn.MetaFrame.synthesize` method creates a synthetic Polars DataFrame based on the :obj:`~metasyn.MetaFrame`.


Expand Down
2 changes: 1 addition & 1 deletion docs/source/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ A MetaFrame is a fitted model that describes the aggregate structure and charact

Key elements encapsulated in a MetaFrame include variable names, their data types, the proportion of missing values, and the parameters of the distributions that these variables follow in the dataset. This information is sufficient to understand the overall structure and attributes of the data, without divulging the exact data points.

When a MetaFrame is created from an input dataset, it can be exported for auditing or manual editing.
When a MetaFrame is created from an input dataset, it can be saved for auditing or manual editing.

In the ``metasyn`` workflow, once you have a MetaFrame, ``metasyn`` can generate synthetic data that aligns with the MetaFrame. This synthetic data shares the structural and distributional characteristics (as defined in the MetaFrame) with the original data but does not contain any actual data points from the original dataset, thus preserving privacy.

Expand Down
6 changes: 3 additions & 3 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,16 +38,16 @@ Welcome to the `metasyn <https://github.com/sodascience/metasyn/>`_ documentatio

1. **Estimation**: ``Metasyn`` can **create a MetaFrame**, from a dataset. A MetaFrame is metadata describing a table, augmented with statistical information on the columns. It captures individual distributions and features and enables the generation of synthetic data based on it.
2. **Generation**: ``Metasyn`` can **generate synthetic data** based on a MetaFrame. The synthetic data produced solely depends on the MetaFrame, thereby maintaining a critical separation between the original sensitive data and the generated synthetic data.
3. **Serialization**: ``Metasyn`` can **export a MetaFrame** into an easy-to-read :doc:`/developer/GMF` file. This allows users to audit, understand, and modify their data generation model. These GMF files can also be imported back into Metasyn to generate synthetic data.
3. **Serialization**: ``Metasyn`` can **save a MetaFrame** into an easy-to-read :doc:`/developer/GMF` file. This allows users to audit, understand, and modify their data generation model. These GMF files can also be imported back into Metasyn to generate synthetic data.

Researchers and data owners can use ``metasyn`` to generate and share synthetic versions of their sensitive datasets, mitigating privacy concerns. Additionally, ``metasyn`` facilitates transparency and reproducibility by allowing the underlying MetaFrames to be exported and shared. Other researchers can use these to regenerate consistent synthetic datasets, validating published work without requiring sensitive data.
Researchers and data owners can use ``metasyn`` to generate and share synthetic versions of their sensitive datasets, mitigating privacy concerns. Additionally, ``metasyn`` facilitates transparency and reproducibility by allowing the underlying MetaFrames to be saved and shared. Other researchers can use these to regenerate consistent synthetic datasets, validating published work without requiring sensitive data.



.. admonition:: Key Features

- **MetaFrame Generation**: ``Metasyn`` allows the creation of a MetaFrame from a dataset provided as a `Polars <https://pola.rs/>`_ or `Pandas <https://pandas.pydata.org/>`_ DataFrame. MetaFrames include key characteristics such as *variable names*, *data types*, *percentage of missing values*, and *distribution parameters*.
- **Exporting MetaFrames**: ``Metasyn`` can export and import MetaFrames to GMF files. These are JSON files that follow the easy-to-read and understand :doc:`/developer/GMF`.
- **Saving MetaFrames**: ``Metasyn`` can save and load MetaFrames to GMF files. These are JSON files that follow the easy-to-read and understand :doc:`/developer/GMF`.
- **Synthetic Data Generation**: ``Metasyn`` allows for the generation of a Polars DataFrame with synthetic data that resembles the original data.
- **Distribution Fitting**: ``Metasyn`` allows for manual and automatic distribution fitting.
- **Data Type Support**: ``Metasyn`` supports generating synthetic data for a variety of common data types including ``categorical``, ``string``, ``integer``, ``float``, ``date``, ``time``, and ``datetime``.
Expand Down
4 changes: 2 additions & 2 deletions docs/source/metasyn_in_detail.rst
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ This allows for manual and automatic editing, as well as sharing.
.. raw:: html

<details>
<summary> An example of an exported MetaFrame [click to expand]: </summary>
<summary> An example of a saved MetaFrame [click to expand]: </summary>

.. code-block:: json
Expand Down Expand Up @@ -268,7 +268,7 @@ This allows for manual and automatic editing, as well as sharing.
<br>

.. note::
See the :doc:`/usage/exporting_metaframes` page for information on *how* to export and load MetaFrame to and from JSON files.
See the :doc:`/usage/saving_metaframes` page for information on *how* to save and load MetaFrame to and from JSON files.

Data generation
^^^^^^^^^^^^^^^^
Expand Down
4 changes: 2 additions & 2 deletions docs/source/usage/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ The ``metasyn`` CLI should now be up and running within the Docker container and
Creating Generative Metadata
----------------------------
The ``create-meta`` subcommand combines the :doc:`estimation </usage/generating_metaframes>` and :doc:`serialization </usage/exporting_metaframes>` steps in the pipeline into one, this allows you to generate generative metadata for a tabular dataset (in CSV format), and store it in a GMF (Generative Metadata Format) file.
The ``create-meta`` subcommand combines the :doc:`estimation </usage/generating_metaframes>` and :doc:`serialization </usage/saving_metaframes>` steps in the pipeline into one, this allows you to generate generative metadata for a tabular dataset (in CSV format), and store it in a GMF (Generative Metadata Format) file.

.. image:: /images/pipeline_cli_create_meta.png
:alt: Creating Generative Metadata using the CLI
Expand Down Expand Up @@ -158,7 +158,7 @@ The ``create-meta`` command also takes one optional argument:
Generating Synthetic Data
-------------------------
The ``synthesize`` subcommand combines the :doc:`deserialization </usage/exporting_metaframes>` and :doc:`generation </usage/generating_synthetic_data>` steps in the pipeline into one, and allows you to generate a synthetic dataset from a previously exported MetaFrame (stored as GMF file).
The ``synthesize`` subcommand combines the :doc:`deserialization </usage/saving_metaframes>` and :doc:`generation </usage/generating_synthetic_data>` steps in the pipeline into one, and allows you to generate a synthetic dataset from a previously saved MetaFrame (stored as GMF file).

.. image:: /images/pipeline_cli.png
:alt: Creating Synthetic Data from a GMF file using the CLI
Expand Down
22 changes: 11 additions & 11 deletions docs/source/usage/exporting_metaframes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,26 +5,26 @@
Exporting and importing MetaFrames
===================================

Metasyn can serialize and **export a MetaFrame** into a GMF file. GMF files are JSON files that follow the :doc:`/developer/GMF` and have been designed to be easy to read and understand. This allows users to audit, understand, modify and share their data generation model with ease.
Metasyn can serialize and **save a MetaFrame** into a GMF file. GMF files are JSON files that follow the :doc:`/developer/GMF` and have been designed to be easy to read and understand. This allows users to audit, understand, modify and share their data generation model with ease.

.. image:: /images/pipeline_serialization_simple.png
:alt: MetaFrame Serialization Flow
:align: center

Exporting a MetaFrame
----------------------
MetaFrames can be serialized and exported to a GMF file by calling the :meth:`metasyn.metaframe.MetaFrame.to_json` method on a :obj:`MetaFrame<metasyn.metaframe.MetaFrame>`.
MetaFrames can be serialized and saved to a GMF file by calling the :meth:`metasyn.metaframe.MetaFrame.save` method on a :obj:`MetaFrame<metasyn.metaframe.MetaFrame>`.

The following code exports a generated :obj:`MetaFrame<metasyn.metaframe.MetaFrame>` object named ``mf`` to a GMF file named ``exported_metaframe``.
The following code saves a generated :obj:`MetaFrame<metasyn.metaframe.MetaFrame>` object named ``mf`` to a GMF file named ``saved_metaframe``.

.. code-block:: python
mf.to_json("exported_metaframe.json")
mf.save("saved_metaframe.json")
.. raw:: html

<details>
<summary> <em><b>An example of a MetaFrame that has been exported to a GMF file: </em></b></summary>
<summary> <em><b>An example of a MetaFrame that has been saved to a GMF file: </em></b></summary>

.. code-block:: json
Expand Down Expand Up @@ -164,7 +164,7 @@ The following code exports a generated :obj:`MetaFrame<metasyn.metaframe.MetaFra
|break|


It is possible to preview the GMF file, without having to export it. This can be done by calling the Python built-in :func:`repr <python:repr>` function on a :obj:`MetaFrame<metasyn.metaframe.MetaFrame>` object, and printing its output.
It is possible to preview the GMF file, without having to save it. This can be done by calling the Python built-in :func:`repr <python:repr>` function on a :obj:`MetaFrame<metasyn.metaframe.MetaFrame>` object, and printing its output.

.. code-block:: python
Expand All @@ -173,16 +173,16 @@ It is possible to preview the GMF file, without having to export it. This can be
Loading a MetaFrame
-------------------
You can load a MetaFrame from a GMF file using the :meth:`MetaFrame.from_json <metasyn.metaframe.MetaFrame.from_json>` classmethod.
You can load a MetaFrame from a GMF file using the :meth:`MetaFrame.load <metasyn.metaframe.MetaFrame.load>` classmethod.

The following code loads a :obj:`MetaFrame<metasyn.metaframe.MetaFrame>` object named ``mf`` from a GMF file named ``exported_metaframe``.
The following code loads a :obj:`MetaFrame<metasyn.metaframe.MetaFrame>` object named ``mf`` from a GMF file named ``saved_metaframe``.

.. code-block:: python
mf = metasyn.MetaFrame.from_json("exported_metaframe.json")
mf = metasyn.MetaFrame.load("saved_metaframe.json")
Tweaking an exported MetaFrame
Tweaking an saved MetaFrame
-----------------------------------
Since the JSON is formatted in an easy to read way (for both humans *and* computers), it is easy to manually edit the metadata, or to automatically edit the metadata using a script.

Expand Down Expand Up @@ -231,7 +231,7 @@ Let's say we import a MetaFrame from the GMF (from earlier on this page) and use
- audi
- 87

Well, what if we wanted to change the distribution of the ``fruits`` variable to instead be 30% ``apple``, 30% ``banana``, and introduce a new fruit ``orange`` with a distribution of 40%? We can do this by editing the ``probs`` and ``labels`` attributes of the ``fruits`` variable in the exported MetaFrame. The following is the edited MetaFrame:
Well, what if we wanted to change the distribution of the ``fruits`` variable to instead be 30% ``apple``, 30% ``banana``, and introduce a new fruit ``orange`` with a distribution of 40%? We can do this by editing the ``probs`` and ``labels`` attributes of the ``fruits`` variable in the saved MetaFrame. The following is the edited MetaFrame:


.. tab:: GMF file before
Expand Down
2 changes: 1 addition & 1 deletion docs/source/usage/generating_synthetic_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ The generated data does **not** preserve any relationships between variables.

.. admonition:: Prerequisite

Before synthetic data can be generated, a :obj:`MetaFrame <metasyn.metaframe.MetaFrame>` object must be :doc:`created </usage/generating_metaframes>` or :doc:`loaded </usage/exporting_metaframes>`.
Before synthetic data can be generated, a :obj:`MetaFrame <metasyn.metaframe.MetaFrame>` object must be :doc:`created </usage/generating_metaframes>` or :doc:`loaded </usage/saving_metaframes>`.

To generate a synthetic dataset, simply call the :meth:`MetaFrame.synthesize(n) <metasyn.metaframe.MetaFrame.synthesize>` method on a :obj:`MetaFrame <metasyn.metaframe.MetaFrame>` object. This method takes a parameter `n` which represents the number of rows of data that should be generated. By default (when `n` is not provided), metasyn tries to generate as many rows as in the original dataset.

Expand Down
8 changes: 4 additions & 4 deletions docs/source/usage/quick_start.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,17 +112,17 @@ We can inspect the MetaFrame by simply printing it (``print(mf)``). This will pr
Saving and Loading the MetaFrame
--------------------------------

The MetaFrame can be saved to a JSON file for future use, to do so we simply use the :func:`~metasyn.metaframe.MetaFrame.to_json` function on the MetaFrame (which in our case is named ``mf``), and pass in the filepath as a parameter. The following code saves the MetaFrame to a JSON file named "exported_metaframe.json":
The MetaFrame can be saved to a JSON file for future use, to do so we simply use the :func:`~metasyn.metaframe.MetaFrame.save` method on the MetaFrame (which in our case is named ``mf``), and pass in the filepath as a parameter. The following code saves the MetaFrame to a JSON file named "saved_metaframe.json":

.. code-block:: python
mf.to_json("exported_metaframe.json")
mf.save("saved_metaframe.json")
Inversely, we can load a MetaFrame from a JSON file using the :func:`~metasyn.metaframe.MetaFrame.from_json` function, passing in the filepath as a parameter. To load our previously saved MetaFrame, we use the following code:
Inversely, we can load a MetaFrame from a JSON file using the :func:`~metasyn.metaframe.MetaFrame.load` method, passing in the filepath as a parameter. To load our previously saved MetaFrame, we use the following code:

.. code-block:: python
mf = MetaFrame.from_json("exported_metaframe.json")
mf = MetaFrame.load("saved_metaframe.json")
Synthesizing the Data
---------------------
Expand Down
2 changes: 1 addition & 1 deletion docs/source/usage/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ information on installation, quickstart, tutorials and information on the core f
installation
quick_start
generating_metaframes
exporting_metaframes
saving_metaframes
generating_synthetic_data
config_files
cli
Expand Down
4 changes: 2 additions & 2 deletions examples/basic_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,12 @@

# write to json
gmf_path = Path("examples", "gmf_files", "example_gmf_simple.json")
mf.export(gmf_path)
mf.save(gmf_path)

# then, export json from secure environment

# outside secure environment, load json
mf_out = MetaFrame.from_json(gmf_path)
mf_out = MetaFrame.load_json(gmf_path)

# create a fake dataset
df_syn = mf_out.synthesize(10)
Loading

0 comments on commit cdb952b

Please sign in to comment.