Skip to content

Commit

Permalink
Add export functionality to TOML in addition to JSON (#327)
Browse files Browse the repository at this point in the history
* Add support for simplified configuration file.
* Rename methods:
   * MetaFrame.export -> MetaFrame.save
   * MetaFrame.to_json -> MetaFrame.save_json
   * MetaFrame.from_json -> MetaFrame.load_json
* Add support for TOML GMF files. They work almost exactly the same as the JSON GMF files. Some comments are automatically included to make the file more easily understandable.
  • Loading branch information
qubixes authored Oct 10, 2024
1 parent 789ae63 commit cd0e8ac
Show file tree
Hide file tree
Showing 54 changed files with 1,066 additions and 1,196 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/core-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install ".[check]"
python -m pip install ".[check,extra]"
- name: Lint with Ruff
run: ruff check
run: ruff check metasyn
- name: Check types with MyPy
run: mypy metasyn

Expand Down Expand Up @@ -61,7 +61,7 @@ jobs:
run: |
pip install git+https://github.com/sodascience/metasyn-disclosure-control
pip install .
metasyn create-meta metasyn/demo/demo_titanic.csv --config examples/example_config.toml
metasyn create-meta metasyn/demo/demo_titanic.csv --config examples/config_files/example_config.toml
build-docs:
name: Build documentation
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

__Generate synthetic tabular data__ in a transparent, understandable, and privacy-friendly way. Metasyn makes it possible for owners of sensitive data to create test data, do open science, improve code reproducibility, encourage data reuse, and enhance accessibility of their datasets, without worrying about leaking private information.

With metasyn you can __fit__ a model to an existing dataframe, __export__ it to a transparent and auditable `.json` file, and __synthesize__ a dataframe that looks a lot like the real one. In contrast to most other synthetic data software, we make the explicit choice to strictly limit the statistical information in our model in order to adhere to the highest privacy standards.
With metasyn you can __fit__ a model to an existing dataframe, __save__ it to a transparent and auditable `.json` file, and __synthesize__ a dataframe that looks a lot like the real one. In contrast to most other synthetic data software, we make the explicit choice to strictly limit the statistical information in our model in order to adhere to the highest privacy standards.

## Highlights
- 👋 __Accessible__. Metasyn is designed to be easy to use and understand, and we do our best to be welcoming to newcomers and novice users. [Let us know](https://github.com/sodascience/metasyn/issues/new) if we can improve!
Expand Down
8 changes: 4 additions & 4 deletions docs/paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ These choices enable the software to generate synthetic data with __privacy and
At its core, `metasyn` has three main functions:

1. __Estimation__: Fit a generative model to a properly formatted tabular dataset, optionally with privacy guarantees.
2. __(De)serialization__: Create an intermediate representation of the fitted model for auditing, editing, and exporting.
2. __(De)serialization__: Create an intermediate representation of the fitted model for auditing, editing, and saving.
3. __Generation__: Synthesize new datasets based on a fitted model.

## Estimation
Expand Down Expand Up @@ -117,11 +117,11 @@ After fitting a model, `metasyn` can transparently store it in a human- and mach
}
```

This `.json` can be manually audited, edited, and after exporting this file, an unlimited number of synthetic records can be created without incurring additional privacy risks. Serialization and deserialization with `metasyn` can be performed as follows:
This `.json` can be manually audited, edited, and after saving this file, an unlimited number of synthetic records can be created without incurring additional privacy risks. Serialization and deserialization with `metasyn` can be performed as follows:

```python
mf.export("fruits.json")
mf_new = MetaFrame.from_json("fruits.json")
mf.save("fruits.json")
mf_new = MetaFrame.load("fruits.json")
```

## Data generation
Expand Down
5 changes: 4 additions & 1 deletion docs/source/developer/GMF.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
Generative Metadata Format (GMF)
================================

At the core of ``metasyn`` lies its ability to :doc:`generate </usage/generating_metaframes>`, :doc:`export</usage/exporting_metaframes>` and :doc:`import</usage/exporting_metaframes>` statistical metadata for a given dataset, which can then be used to :doc:`generate synthetic datasets </usage/generating_synthetic_data>`. To achieve this, ``metasyn`` uses the Generative Metadata Format (GMF), an open source format (available on `GitHub <https://github.com/sodascience/generative_metadata_format>`_) designed to store statistical metadata for tabular datasets. The GMF standard is designed to be modular and extensible, with more distributions and privacy-enhancing mechanisms. Due to its open nature, GMF can be used by other software too.
At the core of ``metasyn`` lies its ability to :doc:`generate </usage/generating_metaframes>`,
:doc:`save</usage/saving_metaframes>` and :ref:`load<loading-a-metaframe>` statistical metadata
for a given dataset, which can then be used to :doc:`generate synthetic datasets </usage/generating_synthetic_data>`. To achieve this, ``metasyn`` uses the Generative Metadata Format (GMF), an open source format (available on `GitHub <https://github.com/sodascience/generative_metadata_format>`_) designed to store statistical metadata for tabular datasets. The GMF standard is designed to be modular and extensible, with more distributions and privacy-enhancing mechanisms.
Due to its open nature, GMF can be used by other software too.



Expand Down
2 changes: 1 addition & 1 deletion docs/source/developer/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ The :class:`~metasyn.MetaFrame` class is a core component of the ``metasyn`` pac
Essentially, a :obj:`~metasyn.MetaFrame` is a collection of :obj:`~metasyn.MetaVar` objects, each representing a column in a dataset. It contains methods that allow for the following:

- **Fitting to a DataFrame**: The :meth:`~metasyn.MetaFrame.fit_dataframe` method allows for fitting a Polars DataFrame to create a :obj:`~metasyn.MetaFrame` object. This method takes several parameters including the DataFrame, column specifications, distribution providers, privacy level, and a progress bar flag.
- **Exporting and importing**: The :meth:`~metasyn.MetaFrame.export` method serializes and exports the :obj:`~metasyn.MetaFrame` to a JSON file, following the GMF format. The :meth:`~metasyn.MetaFrame.from_json` method reads a :obj:`~metasyn.MetaFrame` from a JSON file.
- **Saving and loading**: The :meth:`~metasyn.MetaFrame.save` method serializes and saves the :obj:`~metasyn.MetaFrame` to a JSON or TOML file, following the GMF format. The :meth:`~metasyn.MetaFrame.load` method reads a :obj:`~metasyn.MetaFrame` from a JSON or TOML file.
- **Synthesizing to a DataFrame**: The :meth:`~metasyn.MetaFrame.synthesize` method creates a synthetic Polars DataFrame based on the :obj:`~metasyn.MetaFrame`.


Expand Down
2 changes: 1 addition & 1 deletion docs/source/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ A MetaFrame is a fitted model that describes the aggregate structure and charact

Key elements encapsulated in a MetaFrame include variable names, their data types, the proportion of missing values, and the parameters of the distributions that these variables follow in the dataset. This information is sufficient to understand the overall structure and attributes of the data, without divulging the exact data points.

When a MetaFrame is created from an input dataset, it can be exported for auditing or manual editing.
When a MetaFrame is created from an input dataset, it can be saved for auditing or manual editing.

In the ``metasyn`` workflow, once you have a MetaFrame, ``metasyn`` can generate synthetic data that aligns with the MetaFrame. This synthetic data shares the structural and distributional characteristics (as defined in the MetaFrame) with the original data but does not contain any actual data points from the original dataset, thus preserving privacy.

Expand Down
Loading

0 comments on commit cd0e8ac

Please sign in to comment.