Skip to content

Commit

Permalink
update paper, add github workflow for auto build
Browse files Browse the repository at this point in the history
  • Loading branch information
vankesteren committed Aug 9, 2024
1 parent 6aee4c3 commit 0a7be74
Show file tree
Hide file tree
Showing 3 changed files with 43 additions and 15 deletions.
28 changes: 28 additions & 0 deletions .github/workflows/joss-paper-draft.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: Draft PDF
on:
push:
paths:
- docs/paper/**
- .github/workflows/joss-paper-draft.yml

jobs:
paper:
runs-on: ubuntu-latest
name: Paper Draft
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Build draft PDF
uses: openjournals/openjournals-draft-action@master
with:
journal: joss
# This should be the path to the paper within your repo.
paper-path: docs/paper/paper.md
- name: Upload
uses: actions/upload-artifact@v4
with:
name: paper
# This is the output path where Pandoc will write the compiled
# PDF. Note, this should be the same directory as the input
# paper.md
path: docs/paper/paper.pdf
30 changes: 15 additions & 15 deletions docs/paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,25 +29,25 @@ bibliography: paper.bib
---

# Summary
Synthetic data is a promising tool for improving the accessibility of datasets that are otherwise too sensitive to be shared publicly. To this end, we introduce `metasyn`, a Python package for generating synthetic data from tabular datasets. Unlike existing synthetic data generation software, `metasyn` is built on a simple generative model with a "naive" marginal independence assumption --- an explicit choice that lowers the multivariate precision of the synthetic data in order to maintain transparency and auditability, to keep information leakage to a minimum, and even to enable privacy or disclosure risk guarantees through a plug-in system. While the analytical validity of the generated data is thus intentionally limited, its potential uses are broad, including exploratory analyses, code development and testing, and external communication and teaching [@vankesteren2024democratize]. `Metasyn` is flexible, scalable, and easily extended to meet diverse privacy needs.
Synthetic data is a promising tool for improving the accessibility of datasets that are otherwise too sensitive to be shared publicly. To this end, we introduce `metasyn`, a Python package for generating synthetic data from tabular datasets. Unlike existing synthetic data generation software, `metasyn` is built on a simple generative model with a "naïve" marginal independence assumption --- an explicit choice that removes multivariate information from the synthetic data. It makes this trade-off in order to maintain transparency and auditability, to keep information leakage to a minimum, and even to enable privacy or disclosure risk guarantees through a plug-in system. While the analytical validity of the generated data is thus intentionally limited, its potential uses are broad, including exploratory analyses, code development and testing, and external communication and teaching [@vankesteren2024democratize]. `Metasyn` is flexible, scalable, and easily extended to meet diverse privacy needs.

![Logo of the `metasyn` project.](img/logo.svg)

# Statement of need

`Metasyn` is a python package for generating synthetic data with a focus on privacy and disclosure control. It is aimed at owners of sensitive datasets such as public organisations, research groups, and individual researchers who want to improve the accessibility of their data for research and reproducibility by others. The goal of `metasyn` is to make it easy for data owners to share the structure and and approximation of the content of their data with others while keeping privacy concerns to a minimum.
`Metasyn` is a python package for generating synthetic data with a focus on privacy and disclosure control. It is aimed at owners of sensitive datasets such as public organisations, research groups, and individual researchers who want to improve the accessibility of their data for research and reproducibility by others. The goal of `metasyn` is to make it easy for data owners to share the structure and an approximation of the content of their data with others while keeping privacy concerns to a minimum.

With this goal in mind, `metasyn` distinguishes itself from existing software for generating synthetic data [e.g., @nowok2016synthpop; @templ2017simulation; @ping2017datasynthesizer] by restricting itself to the "augmented plausible" category of synthetic data [@bates2019ons]. This choice enables the software to generate synthetic data with __privacy and disclosure guarantees__ through a plug-in system. Moreover, our system provides an __auditable and editable intermediate representation__ in the form of a human- and machine-readable `.json` metadata file from which new data can be synthesized.
With this goal in mind, `metasyn` distinguishes itself from existing software for generating synthetic data [e.g., @nowok2016synthpop; @templ2017simulation; @ping2017datasynthesizer] by strictly limiting the statistical information from the real data in the produced synthetic data. This choice enables the software to generate synthetic data with __privacy and disclosure guarantees__ through a plug-in system. Moreover, our system provides an __auditable and editable intermediate representation__ in the form of a human- and machine-readable `.json` metadata file from which new data can be synthesized.

Through our focus on privacy and transparency, `metasyn` explicitly avoids generating synthetic data with high analytical validity. The data generated by our system is realistic in terms of data structure and plausible in terms of values for each variable, but any multivariate relations or conditional patterns are excluded. This has implications for how this synthetic data can be used: not for statistical analysis and inference, but rather for initial exploration, analysis script development, and communication outside the data owner's institution. In the intended use case, an external researcher can make use of the synthetic data to assess the feasibility of their intended research before making the (often time-consuming) step of requesting access to the sensitive source data for the final analysis.
Through our focus on privacy and transparency, `metasyn` explicitly avoids generating synthetic data with high analytical validity. The data generated by our system is realistic in terms of data structure and plausible in terms of values for each variable --- the "augmented plausible" category of synthetic data [@bates2019ons] --- but multivariate relations or conditional patterns not learnt from the real data. This has implications for how this synthetic data can be used: not for statistical analysis and inference, but rather for initial exploration, analysis script development, and communication outside the data owner's institution. In the intended use case, an external researcher can make use of the synthetic data to assess the feasibility of their intended research before making the (often time-consuming) step of requesting access to the sensitive source data for the final analysis.

As mentioned before,the privacy capacities of `metasyn` are extensible through a plug-in system, recognizing that different data owners have different needs and definitions of privacy. A data owner can define under which conditions they would accept open distribution of their synthetic data --- be it based on differential privacy [@dwork2006differential], statistical disclosure control [@dewolf2012statistical], k-anonymity [@sweeney2002k], or another specific definition of privacy. As part of the initial release of `metasyn`, we publish a proof-of-concept plugin following the disclosure control guidelines from Eurostat [@bond2015guidelines].
As mentioned before, the privacy capacities of `metasyn` are extensible through a plug-in system, recognizing that different data owners have different needs and definitions of privacy. A data owner can define under which conditions they would accept open distribution of their synthetic data --- be it based on differential privacy [@dwork2006differential], statistical disclosure control [@dewolf2012statistical], k-anonymity [@sweeney2002k], or another specific definition of privacy. As part of the initial release of `metasyn`, we publish a plugin following the disclosure control guidelines from Eurostat [@bond2015guidelines].

# Software features

At its core, `metasyn` is designed for three functions, which are briefly described in this section:

1. __Estimation__: Automatically select univariate distributions and fit them to a well-defined tabular dataset, optionally with additional privacy guarantees.
1. __Estimation__: Automatically select univariate distributions and fit them to a properly formatted tabular dataset, optionally with additional privacy guarantees.
2. __(De)serialization__: Create an intermediate representation of the fitted model for auditing, editing, and exporting.
3. __Generation__: Generate new synthetic datasets based on the fitted model or its serialized representation.

Expand Down Expand Up @@ -80,13 +80,13 @@ For each data type supported by `metasyn`, there is a set of candidate distribut

Table: \label{tbl:dist} Candidate distributions associated with data types in the core `metasyn` package.

| Variable type | Example | Candidate distributions |
| :------------ | :--------------------- | :------------------------------------------------------- |
| categorical | yes/no, country | Categorical (Multinoulli) |
| continuous | 1.0, 2.1, ... | Uniform, Normal, LogNormal, TruncatedNormal, Exponential |
| discrete | 1, 2, ... | Poisson, Uniform, Normal, TruncatedNormal, Categorical |
| string | A108, C122, some words | Regex, Categorical, Faker, FreeText |
| date/time | 2021-01-13, 01:40:12 | Uniform |
| Variable type | Example | Candidate distributions |
| :------------ | :--------------------- | :----------------------------------------------------------------- |
| categorical | yes/no, country | Categorical (Multinoulli), Constant |
| continuous | 1.0, 2.1, ... | Uniform, Normal, LogNormal, TruncatedNormal, Exponential, Constant |
| discrete | 1, 2, ... | Poisson, Uniform, Normal, TruncatedNormal, Categorical, Constant |
| string | A108, C122, some words | Regex, Categorical, Faker, FreeText, Constant |
| date/time | 2021-01-13, 01:40:12 | Uniform, Constant |

From this table, the string distributions deserve special attention as they are not commonly encountered as probability distributions. Regex (regular expression) inference is performed on structured strings using the companion package [RegexModel](https://pypi.org/project/regexmodel/). It is able to automatically detect structure such as room numbers (A108, C122, B109), e-mail addresses, websites, and more, which it summarizes using a probabilistic variant of regular expressions. Another option, should Regex inference fail for lack of structure, is to detect the language (using [lingua](https://pypi.org/project/lingua-language-detector/)) and randomly pick words from that language. We call this approach FreeText. The final alternative is for the data owner to specify that a certain variable should be synthesized using the popular [Faker](https://pypi.org/project/Faker/) package, which can generate specific data types such as localized addresses.

Expand Down Expand Up @@ -199,7 +199,7 @@ shape: (10, 5)
[^1]: This `polars` dataframe can be easily converted to a `pandas` dataframe using `df_syn.to_pandas()`

# Plug-ins and automatic privacy
In addition to the core features described above, the `metasyn` package allows for plug-ins: add-on packages that alter the behaviour of the parameter estimation. Through this system, privacy guarantees can be built into `metasyn`. For example, a package called [`metasyn-disclosure-control`](https://github.com/sodascience/metasyn-disclosure-control) implements the disclosure control output guidelines from Eurostat [@bond2015guidelines] by re-implementing the `fit()` method of the candidate distributions shown in Table \autoref{tbl:dist} to include a micro-aggregation step. In this way, information transfer from the sensitive real data to the synthetic public data can be further reduced.
In addition to the core features described above, the `metasyn` package allows for plug-ins: add-on packages that alter the behaviour of the parameter estimation. Through this system, privacy guarantees can be built into `metasyn` ([privacy plugin template](https://github.com/sodascience/metasyn-privacy-template)) and additional distributions can be supported ([distribution plugin template](https://github.com/sodascience/metasyn-distribution-template)). For example, a plugin package called [`metasyn-disclosure-control`](https://github.com/sodascience/metasyn-disclosure-control) implements the disclosure control output guidelines from Eurostat [@bond2015guidelines] by re-implementing the `fit()` method of the candidate distributions shown in Table \autoref{tbl:dist} to include a micro-aggregation step. In this way, information transfer from the sensitive real data to the synthetic public data can be further reduced.

This plug-in system is user-friendly: the user only needs to `pip install` the package and then `metasyn` can automatically find it to make the methods accessible:

Expand All @@ -211,7 +211,7 @@ mf = MetaFrame.fit_dataframe(df, privacy=DisclosurePrivacy())
```

# Conclusion
Synthetic data is a valuable tool for communicating about sensitive datasets. In this work, we have presented the software `metasyn`, which allows data owners to generate a synthetic version of their sensitive tabular data with a focus on privacy and transparency. Unlike existing tools for generating synthetic data, we choose to aim for low analytic validity to enable high privacy guarantees: the underlying model makes a simplifying independence assumption, resulting in few parameters and thus a very small information transfer. This approach additionally allows for disclosure guarantees through a plug-in system.
Synthetic data is a valuable tool for communicating about sensitive datasets. In this work, we have presented the software `metasyn`, which allows data owners to generate a synthetic version of their sensitive tabular data with a focus on privacy and transparency. Unlike existing tools for generating synthetic data, we choose to aim for low analytic validity to enable strong privacy guarantees: the underlying model makes a simplifying independence assumption, resulting in few parameters and thus a very limited information transfer. This approach additionally allows for disclosure guarantees through a plug-in system.

Further documentation and examples can be found on [metasyn.readthedocs.io](https://metasyn.readthedocs.io/).

Expand Down
Binary file modified docs/paper/paper.pdf
Binary file not shown.

0 comments on commit 0a7be74

Please sign in to comment.