diff --git a/.gitignore b/.gitignore index b03d5aea..fb8f0c2d 100644 --- a/.gitignore +++ b/.gitignore @@ -151,3 +151,6 @@ docs/paper/media # Generated api docs stuff docs/source/api/generated + +# uv stuff +uv.lock diff --git a/docs/paper/paper.bib b/docs/paper/paper.bib index 7b49c275..304e0ab9 100644 --- a/docs/paper/paper.bib +++ b/docs/paper/paper.bib @@ -150,9 +150,16 @@ @inproceedings{ping2017datasynthesizer } @article{vankesteren2024democratize, - title={To democratize research with sensitive data, we should make synthetic data more accessible}, - author={{van Kesteren}, Erik-Jan}, - journal={arXiv preprint arXiv:2404.17271}, - year={2024}, - doi={10.48550/arXiv.2404.17271} + title = {To democratize research with sensitive data, we should make synthetic data more accessible}, + volume = {5}, + ISSN = {2666-3899}, + url = {http://dx.doi.org/10.1016/j.patter.2024.101049}, + DOI = {10.1016/j.patter.2024.101049}, + number = {9}, + journal = {Patterns}, + publisher = {Elsevier BV}, + author = {{van Kesteren}, Erik-Jan}, + year = {2024}, + month = sep, + pages = {101049} } \ No newline at end of file diff --git a/docs/paper/paper.md b/docs/paper/paper.md index 3c3804cf..ef7cde22 100644 --- a/docs/paper/paper.md +++ b/docs/paper/paper.md @@ -43,36 +43,33 @@ These choices enable the software to generate synthetic data with __privacy and # Software features -At its core, `metasyn` has three main functions: - -1. __Estimation__: Fit a generative model to a properly formatted tabular dataset, optionally with privacy guarantees. -2. __(De)serialization__: Create an intermediate representation of the fitted model for auditing, editing, and saving. -3. __Generation__: Synthesize new datasets based on a fitted model. +At its core, `metasyn` has three main functions: __estimation__, to fit a model to a properly formatted tabular dataset; __generation__, to synthesize new datasets based on a fitted model; and __(de)serialization__, to create a file from the model for auditing, editing, and saving. ## Estimation -The generative model in `metasyn` makes the assumption of marginal independence: each column is considered separately, similar to naïve Bayes classifiers [@hastie2009elements]. Some key advantages of this naïve approach are transparency and explainability, flexibility in handling mixed data types, and computational scalability to high-dimensional datasets. Formally, the generative model for $K$-variate data $\mathbf{x}$ is: +Model estimation starts with an appropriately pre-processed data frame, meaning it is tidy [@wickham2014tidy], each column has the correct data type, and missing data are represented by a missing value. Accordingly, `metasyn` is built on the `polars` data frame library [@vink2024polars]. As an example, the first records of the "hospital" data built into `metasyn` are printed below: -\begin{equation} \label{eq:model} -p(\mathbf{x}) = \prod_{k = 1}^K p(x_k) -\end{equation} +``` +┌────────────┬───────────────┬───────────────┬──────┬──────┬───────────────┐ +│ patient_id ┆ date_admitted ┆ time_admitted ┆ type ┆ age ┆ hours_in_room │ +│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ +│ str ┆ date ┆ time ┆ cat ┆ i64 ┆ f64 │ +╞════════════╪═══════════════╪═══════════════╪══════╪══════╪═══════════════╡ +│ A5909X0 ┆ 2024-01-01 ┆ 10:30:00 ┆ IVT ┆ null ┆ 3.633531 │ +│ B4025X2 ┆ 2024-01-01 ┆ 11:23:00 ┆ IVT ┆ 59 ┆ 6.932891 │ +│ B6999X2 ┆ 2024-01-01 ┆ 11:58:00 ┆ IVT ┆ 77 ┆ 1.970654 │ +│ B9525X2 ┆ 2024-01-01 ┆ 16:56:00 ┆ MYE ┆ null ┆ 1.620047 │ +│ … ┆ … ┆ … ┆ … ┆ … ┆ … │ +└────────────┴───────────────┴───────────────┴──────┴──────┴───────────────┘ +``` -Model estimation starts with an appropriately pre-processed data frame, meaning it is tidy [@wickham2014tidy], each column has the correct data type, and missing data are represented by a missing value. Internally, our software uses the `polars` data frame library [@vink2024polars], as it is performant, has consistent data types, and natively supports missing data (i.e., `null` values). An example source table is printed below (NB: categorical data are appropriately encoded as `cat`, not `str`): +Note that categorical data are encoded as `cat` (not `str`) and missing data is represented by `null` values. Model estimation with `metasyn` is then performed as follows: -``` -┌─────┬────────┬─────┬────────┬──────────┐ -│ ID ┆ fruits ┆ B ┆ cars ┆ optional │ -│ --- ┆ --- ┆ --- ┆ --- ┆ --- │ -│ i64 ┆ cat ┆ i64 ┆ cat ┆ i64 │ -╞═════╪════════╪═════╪════════╪══════════╡ -│ 1 ┆ banana ┆ 5 ┆ beetle ┆ 28 │ -│ 2 ┆ banana ┆ 4 ┆ audi ┆ 300 │ -│ 3 ┆ apple ┆ 3 ┆ beetle ┆ null │ -│ 4 ┆ apple ┆ 2 ┆ beetle ┆ 2 │ -│ 5 ┆ banana ┆ 1 ┆ beetle ┆ -30 │ -└─────┴────────┴─────┴────────┴──────────┘ +```python +from metasyn import MetaFrame +mf = MetaFrame.fit_dataframe(df_hospital) ``` -For each data type, a set of candidate distributions is fitted (see \autoref{tbl:dist}), and then `metasyn` selects the one with the lowest BIC [@neath2012bayesian]. For distributions where BIC computation is impossible (e.g., for the string data type) a pseudo-BIC is created that trades off fit and complexity of the underlying models. +The generative model in `metasyn` makes the simplifying assumption of _marginal independence_: each column is considered separately, similar to naïve Bayes classifiers [@hastie2009elements]. For each column, a set of candidate distributions is fitted (see \autoref{tbl:dist}), and then `metasyn` selects the one that fits best (usually having the lowest BIC [@neath2012bayesian]). Key advantages of this approach are transparency and explainability, flexibility in handling mixed data types, and computational scalability to high-dimensional datasets. Table: \label{tbl:dist} Candidate distributions associated with data types in the core `metasyn` package. @@ -84,77 +81,71 @@ Table: \label{tbl:dist} Candidate distributions associated with data types in th | String | Regex, Categorical, Faker, FreeText, Constant | | Date/time | Uniform, Constant | -From this table, the string distributions deserve special attention as they are not common probability distributions. The regex (regular expression) distribution uses the package [`regexmodel`](https://pypi.org/project/regexmodel/) to automatically detect structure such as room numbers (A108, C122, B109), e-mail addresses, or websites. The FreeText distribution detects the language (using [lingua](https://pypi.org/project/lingua-language-detector/)) and randomly picks words from that language. The [Faker](https://pypi.org/project/Faker/) distribution can generate specific data types such as localized names and addresses pre-specified by the user. +From this table, the string distributions deserve special attention as they are not common probability distributions. The regex (regular expression) distribution uses the package [`regexmodel`](https://pypi.org/project/regexmodel/) to automatically detect structure such as room numbers (A108, C122, B109), identifiers, e-mail addresses, or websites. The FreeText distribution detects the language (using [lingua](https://pypi.org/project/lingua-language-detector/)) and randomly picks words from that language. The [Faker](https://pypi.org/project/Faker/) distribution can generate specific data types such as localized names and addresses pre-specified by the user. -Generative model estimation with `metasyn` can be performed as follows: -```python -from metasyn import MetaFrame -mf = MetaFrame.fit_dataframe(df) -``` +## Data generation -## Serialization and deserialization -After fitting a model, `metasyn` can transparently store it in a human- and machine-readable `.json` metadata file. This file contains dataset-level descriptive information as well as the following variable-level information: - -```json -{ - "name": "fruits", - "type": "categorical", - "dtype": "Categorical(ordering='physical')", - "prop_missing": 0.0, - "distribution": { - "implements": "core.multinoulli", - "version": "1.0", - "provenance": "builtin", - "class_name": "MultinoulliDistribution", - "unique": false, - "parameters": { - "labels": ["apple", "banana"], - "probs": [0.4, 0.6] - } - }, - "creation_method": { "created_by": "metasyn" } -} +After creating a `MetaFrame`, `metasyn` can randomly sample synthetic datapoints from it. This is done using the `synthesize()` method: + +```python +df_syn = mf.synthesize(3) ``` -This `.json` can be manually audited, edited, and after saving this file, an unlimited number of synthetic records can be created without incurring additional privacy risks. Serialization and deserialization with `metasyn` can be performed as follows: +This may result in the following data frame. Note that missing values in the `age` column are appropriately reproduced as well. -```python -mf.save("fruits.json") -mf_new = MetaFrame.load("fruits.json") +``` +┌────────────┬───────────────┬───────────────┬──────┬──────┬───────────────┐ +│ patient_id ┆ date_admitted ┆ time_admitted ┆ type ┆ age ┆ hours_in_room │ +│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ +│ str ┆ date ┆ time ┆ cat ┆ i64 ┆ f64 │ +╞════════════╪═══════════════╪═══════════════╪══════╪══════╪═══════════════╡ +│ B7906X1 ┆ 2024-01-04 ┆ 13:32:00 ┆ IVT ┆ 37 ┆ 4.955418 │ +│ B0553X2 ┆ 2024-01-02 ┆ 10:54:00 ┆ IVT ┆ 39 ┆ 3.872872 │ +│ A5397X7 ┆ 2024-01-03 ┆ 18:16:00 ┆ CAT ┆ null ┆ 6.569082 │ +└────────────┴───────────────┴───────────────┴──────┴──────┴───────────────┘ ``` -## Data generation -For each variable in a `MetaFrame` object, `metasyn` can randomly sample synthetic datapoints. Data generation (or synthetization) in `metasyn` can be performed as follows: +## Serialization and deserialization +`MetaFrame`s can also be transparently stored in a human- and machine-readable `.json` metadata file. This file contains dataset-level descriptive information as well as variable-level information. This `.json` can be manually audited, edited, and after saving this file, an unlimited number of synthetic records can be created without incurring additional privacy risks. Serialization and deserialization with `metasyn` is done using the `save()` and `load()` methods: ```python -df_syn = mf.synthesize(3) +mf.save("hospital_admissions.json") +mf_new = MetaFrame.load("hospital_admissions.json") ``` -This may result in the following data frame. Note that missing values in the `optional` column are appropriately reproduced as well. +# Privacy +As a general principle, `metasyn` errs on the side of privacy by default, aiming to recreate the structure but not all content and relations in the source data. For example, take the following sensitive dataset where study participants state how they use drugs in daily life: ``` -┌─────┬────────┬─────┬────────┬──────────┐ -│ ID ┆ fruits ┆ B ┆ cars ┆ optional │ -│ --- ┆ --- ┆ --- ┆ --- ┆ --- │ -│ i64 ┆ cat ┆ i64 ┆ cat ┆ i64 │ -╞═════╪════════╪═════╪════════╪══════════╡ -│ 1 ┆ banana ┆ 4 ┆ beetle ┆ null │ -│ 2 ┆ banana ┆ 3 ┆ audi ┆ null │ -│ 3 ┆ banana ┆ 2 ┆ beetle ┆ 172 │ -└─────┴────────┴─────┴────────┴──────────┘ +┌────────────────┬─────────────────────────────────┐ +│ participant_id ┆ drug_use │ +│ --- ┆ --- │ +│ str ┆ str │ +╞════════════════╪═════════════════════════════════╡ +│ OOWJAHA4 ┆ I use marijuana in the evening… │ +│ 8CA1RV4P ┆ I occasionally take CBD to hel… │ +│ FMSVAKPM ┆ Prescription medication helps … │ +│ … ┆ … │ +└────────────────┴─────────────────────────────────┘ ``` -# Plug-ins and automatic privacy -The `metasyn` package also allows for plug-ins: packages that alter the distribution fitting behaviour. Through this system, privacy guarantees can be built into `metasyn` ([privacy plug-in template](https://github.com/sodascience/metasyn-privacy-template)) and additional distributions can be supported ([distribution plug-in template](https://github.com/sodascience/metasyn-distribution-template)). The [`metasyn-disclosure-control`](https://github.com/sodascience/metasyn-disclosure-control) plug-in implements output guidelines from Eurostat [@bond2015guidelines] by including micro-aggregation. In this way, information transfer from the sensitive real data to the synthetic public data can be further limited. Disclosure control is performed as follows: - -```python -from metasyn import MetaFrame -from metasyncontrib.disclosure import DisclosurePrivacy +When creating synthetic data for this example, the information in the open answers is removed, and using our standard `FreeText` distribution this information is replaced by words from the detected language (English): -mf = MetaFrame.fit_dataframe(df, privacy=DisclosurePrivacy()) ``` +┌────────────────┬─────────────────────────────────┐ +│ participant_id ┆ drug_use │ +│ --- ┆ --- │ +│ str ┆ str │ +╞════════════════╪═════════════════════════════════╡ +│ ZQJZQAB7 ┆ Lawyer let sort her yet line e… │ +│ 7KDLEL0S ┆ Particularly third myself edge… │ +│ QBZKGXC7 ┆ Put color against call researc… │ +└────────────────┴─────────────────────────────────┘ +``` + +Additionally, the `metasyn` package supports [plug-ins](https://github.com/sodascience/metasyn-privacy-template) which alter the estimation behaviour. Through this system, privacy guarantees can be built into `metasyn` and additional distributions can be supported. For example, [`metasyn-disclosure-control`](https://github.com/sodascience/metasyn-disclosure-control) implements output guidelines from Eurostat [@bond2015guidelines] through _micro-aggregation_. # Acknowledgements diff --git a/docs/paper/paper.pdf b/docs/paper/paper.pdf index d5c6ea79..864a26a3 100644 Binary files a/docs/paper/paper.pdf and b/docs/paper/paper.pdf differ diff --git a/metasyn/demo/dataset.py b/metasyn/demo/dataset.py index 6f820d29..4e4e6867 100644 --- a/metasyn/demo/dataset.py +++ b/metasyn/demo/dataset.py @@ -2,7 +2,7 @@ # import random import string -from abc import ABC, abstractproperty +from abc import ABC, abstractmethod from datetime import date, datetime, time, timedelta from pathlib import Path @@ -21,16 +21,19 @@ def register(*args): """Register a dataset so that it can be found by name.""" + def _wrap(cls): _AVAILABLE_DATASETS[cls().name] = cls() return cls + return _wrap(*args) class BaseDataset(ABC): """Base class for demo datasets.""" - @abstractproperty + @property + @abstractmethod def name(self): pass @@ -39,10 +42,10 @@ def file_location(self): return files(__package__) / f"demo_{self.name}.csv" def get_dataframe(self): - return pl.read_csv(self.file_location, schema_overrides=self.schema, - try_parse_dates=True) + return pl.read_csv(self.file_location, schema_overrides=self.schema, try_parse_dates=True) - @abstractproperty + @property + @abstractmethod def schema(self): pass @@ -67,6 +70,7 @@ def schema(self): def var_specs(self): return [VarSpec("PassengerId", unique=True)] + @register class SpaceShipDataset(BaseDataset): """CC-BY from https://www.kaggle.com/competitions/spaceship-titanic.""" @@ -135,6 +139,42 @@ def name(self): def schema(self): return {"SOP_DESCRIPTION": pl.Categorical, "BODYSITE_DESCRIPTION": pl.Categorical} + +@register +class HospitalDataset(BaseDataset): + """Example electronic health record hospital dataset. + + This dataset was created manually by the metasyn team. + """ + + @property + def name(self): + return "hospital" + + @property + def schema(self): + return {"date_admitted": pl.Date, "time_admitted": pl.Time, "type": pl.Categorical} + + +@register +class DrugUseDataset(BaseDataset): + """Example dataset with answers to an open question on study participants' daily drug use. + + This example dataset was generated through ChatGPT-4o on 07-11-2024 using the following prompt: + > Create a csv with 12 rows and 2 columns: participant_id, and drug_use. The participant_id + has a standard alphanumeric structure, and the drug_use contains participant's responses on + how they use drugs in their daily life. + """ + + @property + def name(self): + return "druguse" + + @property + def schema(self): + return {} + + @register class TestDataset(BaseDataset): """Test dataset with all supported data types.""" @@ -146,8 +186,10 @@ def name(self): @property def schema(self): columns = pl.read_csv(self.file_location).columns - return {col_name: (getattr(pl, col_name[3:]) if col_name != "NA" else pl.String) - for col_name in columns} + return { + col_name: (getattr(pl, col_name[3:]) if col_name != "NA" else pl.String) + for col_name in columns + } @classmethod def create(cls, csv_file): @@ -155,47 +197,81 @@ def create(cls, csv_file): n_rows = 100 for int_val in [8, 16, 32, 64]: - all_series.append(pl.Series(f"pl.Int{int_val}", - [np.random.randint(-10, 10) for _ in range(n_rows)], - dtype=getattr(pl, f"Int{int_val}"))) - all_series.append(pl.Series(f"pl.UInt{int_val}", - [np.random.randint(10) for _ in range(n_rows)], - dtype=getattr(pl, f"UInt{int_val}"))) + all_series.append( + pl.Series( + f"pl.Int{int_val}", + [np.random.randint(-10, 10) for _ in range(n_rows)], + dtype=getattr(pl, f"Int{int_val}"), + ) + ) + all_series.append( + pl.Series( + f"pl.UInt{int_val}", + [np.random.randint(10) for _ in range(n_rows)], + dtype=getattr(pl, f"UInt{int_val}"), + ) + ) for float_val in [32, 64]: - all_series.append(pl.Series(f"pl.Float{float_val}", - np.random.randn(n_rows), - dtype=getattr(pl, f"Float{float_val}"))) - - all_series.append(pl.Series("pl.Date", [date(2024, 9, 4) + timedelta(days=i) - for i in range(n_rows)], - dtype=pl.Date)) - all_series.append(pl.Series("pl.Datetime", - [datetime(2024, 9, 4, 12, 30, 12) - + timedelta(hours=i, minutes=i*2, seconds=i*3) - for i in range(n_rows)], - dtype=pl.Datetime)) - all_series.append(pl.Series("pl.Time", - [time(3+i//20, 6+i//12, 12+i//35) for i in range(n_rows)], - dtype=pl.Time)) - all_series.append(pl.Series("pl.String", - np.random.choice(list(string.printable), size=n_rows), - dtype=pl.String)) - all_series.append(pl.Series("pl.Utf8", - np.random.choice(list(string.printable), size=n_rows), - dtype=pl.Utf8)) - all_series.append(pl.Series("pl.Categorical", - np.random.choice(list(string.ascii_uppercase[:5]), size=n_rows), - dtype=pl.Categorical)) - all_series.append(pl.Series("pl.Boolean", - np.random.choice([True, False], size=n_rows), - dtype=pl.Boolean)) + all_series.append( + pl.Series( + f"pl.Float{float_val}", + np.random.randn(n_rows), + dtype=getattr(pl, f"Float{float_val}"), + ) + ) + + all_series.append( + pl.Series( + "pl.Date", + [date(2024, 9, 4) + timedelta(days=i) for i in range(n_rows)], + dtype=pl.Date, + ) + ) + all_series.append( + pl.Series( + "pl.Datetime", + [ + datetime(2024, 9, 4, 12, 30, 12) + + timedelta(hours=i, minutes=i * 2, seconds=i * 3) + for i in range(n_rows) + ], + dtype=pl.Datetime, + ) + ) + all_series.append( + pl.Series( + "pl.Time", + [time(3 + i // 20, 6 + i // 12, 12 + i // 35) for i in range(n_rows)], + dtype=pl.Time, + ) + ) + all_series.append( + pl.Series( + "pl.String", np.random.choice(list(string.printable), size=n_rows), dtype=pl.String + ) + ) + all_series.append( + pl.Series( + "pl.Utf8", np.random.choice(list(string.printable), size=n_rows), dtype=pl.Utf8 + ) + ) + all_series.append( + pl.Series( + "pl.Categorical", + np.random.choice(list(string.ascii_uppercase[:5]), size=n_rows), + dtype=pl.Categorical, + ) + ) + all_series.append( + pl.Series("pl.Boolean", np.random.choice([True, False], size=n_rows), dtype=pl.Boolean) + ) all_series.append(pl.Series("NA", [None for _ in range(n_rows)], dtype=pl.String)) # Add NA's for all series except the categorical for series in all_series: if series.name != "pl.Categorical": - none_idx = np.random.choice(np.arange(n_rows), size=n_rows//10, replace=False) + none_idx = np.random.choice(np.arange(n_rows), size=n_rows // 10, replace=False) none_idx.sort() series[none_idx] = None diff --git a/metasyn/demo/demo_druguse.csv b/metasyn/demo/demo_druguse.csv new file mode 100644 index 00000000..1eaea06b --- /dev/null +++ b/metasyn/demo/demo_druguse.csv @@ -0,0 +1,13 @@ +participant_id,daily_drug_use +OOWJAHA4,I use marijuana in the evenings to relax and manage my anxiety. +8CA1RV4P,I occasionally take CBD to help with sleep and muscle recovery after workouts. +FMSVAKPM,Prescription medication helps me manage chronic pain; I take it as directed daily. +JLTGIXS2,I use caffeine throughout the day to keep my energy up at work. +U5CI8Y5F,I smoke cigarettes during breaks; it’s part of my daily routine. +8YMDTW83,I use Adderall as prescribed to help with focus for my studies. +C4UVCR0B,"I drink alcohol socially, but sometimes have a glass of wine at dinner to unwind." +A0Q515ZQ,I take anti-anxiety medication every morning as part of my mental health regimen. +OY50H1JG,"Occasionally, I use edibles on weekends to help me relax after a stressful week." +ADU9F363,I use nicotine patches to help reduce cigarette cravings. +P66WD839,"I use melatonin at night to help with sleep, especially on stressful days." +68WPHNB4,"I take a prescription stimulant daily for ADHD, which helps me stay productive." diff --git a/metasyn/demo/demo_hospital.csv b/metasyn/demo/demo_hospital.csv new file mode 100644 index 00000000..182492ff --- /dev/null +++ b/metasyn/demo/demo_hospital.csv @@ -0,0 +1,19 @@ +patient_id,date_admitted,time_admitted,type,age,hours_in_room +A5909X0,2024-01-01,10:30:00.000000000,IVT,,3.6335314585568366 +B4025X2,2024-01-01,11:23:00.000000000,IVT,59,6.93289094489419 +B6999X2,2024-01-01,11:58:00.000000000,IVT,77,1.970654097305811 +B9525X2,2024-01-01,16:56:00.000000000,MYE,,1.6200472977499425 +B0453X1,2024-01-02,11:50:00.000000000,IVT,,3.344935472600846 +A6441X2,2024-01-02,16:11:00.000000000,IVT,,5.28264145694029 +A9260X4,2024-01-02,17:08:00.000000000,IVT,68,2.2405604565294372 +B7526X0,2024-01-02,17:12:00.000000000,IVT,,7.268823575595286 +B4675X8,2024-01-03,11:10:00.000000000,MYE,49,2.8933899306166166 +A3206X1,2024-01-04,10:12:00.000000000,CAT,40,5.766570385883389 +A4363X0,2024-01-04,10:59:00.000000000,IVT,77,6.931131403828807 +A9418X9,2024-01-04,13:12:00.000000000,IVT,82,3.1923453535590203 +B4309X2,2024-01-04,13:51:00.000000000,IVT,30,1.828827719602017 +B3830X7,2024-01-04,16:03:00.000000000,CAT,87,3.8709200461975373 +A7952X3,2024-01-04,18:00:00.000000000,IVT,60,4.332882463068666 +A0013X2,2024-01-05,10:06:00.000000000,CAT,,2.1333391075365316 +B2077X8,2024-01-05,10:43:00.000000000,IVT,,6.026762991290889 +A4700X7,2024-01-05,14:53:00.000000000,MYE,84,6.193777477178557