diff --git a/docs/source/developer/overview.rst b/docs/source/developer/overview.rst index f3d7252b..efd6a48b 100644 --- a/docs/source/developer/overview.rst +++ b/docs/source/developer/overview.rst @@ -28,8 +28,7 @@ A :obj:`~metasyn.MetaVar` contains information on the variable type (``var_type` This class is considered a passthrough class used by the :obj:`~metasyn.MetaFrame` class, and is not intended to be used directly by the user. It contains the following functionality: -- **Detecting variable types**: The :meth:`~metasyn.MetaVar.detect` method detects the variable class(es) of a series or dataframe. This method does not fit any distribution, but it does infer the correct types for the :obj:`~metasyn.MetaVar` and saves the ``Series`` for later fitting. -- **Fitting distributions**: The :meth:`~metasyn.MetaVar.fit` method fits distributions to the data. Here you can set the distribution, privacy package and uniqueness for the variable again. +- **Fitting distributions**: The :meth:`~metasyn.MetaVar.fit` method fits distributions to the data. Here you can set the distribution, privacy package and uniqueness for the variable. - **Drawing values and series**: The :meth:`~metasyn.MetaVar.draw` method draws a random item for the variable in whatever type is required. The :meth:`~metasyn.MetaVar.draw_series` method draws a new synthetic series from the metadata. For this to work, the variable has to be fitted. - **Converting to and from a dictionary**: The :meth:`~metasyn.MetaVar.to_dict` method creates a dictionary from the variable. The :meth:`~metasyn.MetaVar.from_dict` method restores a variable from a dictionary. @@ -54,4 +53,5 @@ The ``metasyn`` package is organized into several submodules, each focusing on d * The :mod:`metasyn.testutils` module provides testing utilities for plugins. It includes functions for checking distributions and distribution providers. * The :mod:`metasyn.validation` module contains tools for validating distribution outputs and GMF file formats. * The :mod:`metasyn.privacy` module contains the basis for implementing privacy features. A system to incorporate privacy features such as differential privacy or other forms of disclosure control is still being implemented. - +* The :mod:`metasyn.util` module contains utility classes :class:`~metasyn.util.DistributionSpec` and :class:`~metasyn.util.VarConfig`. +* The :mod:`metasyn.config` module contains the :class:`~metasyn.config.MetaConfig` class that can read .toml configuration files. diff --git a/docs/source/usage/cli.rst b/docs/source/usage/cli.rst index 1fa68d0e..5d91b38c 100644 --- a/docs/source/usage/cli.rst +++ b/docs/source/usage/cli.rst @@ -4,7 +4,7 @@ Metasyn provides a command-line interface (CLI) for accessing core functionality The CLI currently has three subcommands: -* The ``create-meta`` subcommand, which allows you to **create generative metadata** from a ``CSV file`` +* The ``create-meta`` subcommand, which allows you to **create generative metadata** from a ``.csv file`` and/or a ``.toml`` configuration file. * The ``synthesize`` subcommand, which allows you to **generate synthetic data** from a ``GMF file`` * The ``schema`` subcommand, which allows you to **create validation schemas** for GMF files. @@ -85,7 +85,7 @@ The ``create-meta`` command can be used as follows: .. code-block:: bash - metasyn create-meta [input] [output] + metasyn create-meta --input [input] --output [output] This will: @@ -126,17 +126,43 @@ The ``create-meta`` command also takes one optional argument: .. note:: - The configuration file must be in the `.ini` format. For more information on the format, please refer to the `Python documentation `_. + The configuration file must be in the `.toml` format. For more information on the format, please refer to the `TOML documentation `_. An example of a configuration file that specifies the ``PassengerId`` column to be unique and the ``Fare`` column to have a log-normal distribution is as follows: - .. code-block:: ini + .. code-block:: toml - [var.PassengerId] - unique = True + [[var]] + name = "PassengerId" + distribution = {unique = true} # Notice lower capitalization for .toml files. - [var.Fare] - distribution=LogNormalDistribution + + [[var]] + name = "Fare" + distribution = {implements = "core.log_normal"} + +It is also possible to create a GMF file without any input CSV. For this to work, you need to supply a configuration +file that fully specifies all wanted columns. You will need to tell ``metasyn`` in the configuration file that the +column is ``data_free``. It is also required to set the number of rows under the `general` section, for example: + + .. code-block:: toml + + [general] + n_rows = 100 + + + [[var]] + + name = "PassengerId" + data_free = true + unique = true + prop_missing = 0.0 + description = "ID of the unfortunate passenger." + var_type = "discrete" + distribution = {implements = "core.unique_key", unique = true, parameters = {consecutive = 1, low = 0}} + +The example will generate a GMF file that can be used to generate new synthetic data with the ``synthesize`` +subcommand described below. Generating Synthetic Data diff --git a/docs/source/usage/generating_metaframes.rst b/docs/source/usage/generating_metaframes.rst index 89a51507..ceb6f964 100644 --- a/docs/source/usage/generating_metaframes.rst +++ b/docs/source/usage/generating_metaframes.rst @@ -19,13 +19,15 @@ One of the main features of ``metasyn`` is to create a :obj:`MetaFrame ` classmethod. +Metasyn can generate metadata from any given dataset (provided as Polars or Pandas DataFrame), +using the :meth:`metasyn.MetaFrame.fit_dataframe(df) ` classmethod. .. image:: /images/pipeline_estimation_code.png :alt: MetaFrame Generation With Code Snippet :align: center -This function requires a :obj:`DataFrame` to be specified as parameter. The following code returns a :obj:`MetaFrame` object named :obj:`mf`, based on a DataFrame named :obj:`df`. +This function requires a :obj:`DataFrame` to be specified as parameter. The following code returns +a :obj:`MetaFrame` object named :obj:`mf`, based on a DataFrame named :obj:`df`. .. code-block:: python @@ -47,16 +49,29 @@ It is possible to print the (statistical metadata contained in the) :obj:`MetaFr Optional Parameters ---------------------- -The :meth:`metasyn.MetaFrame.fit_dataframe() ` class method allows you to have more control over how your synthetic dataset is generated with additional (optional) parameters: +The :meth:`metasyn.MetaFrame.fit_dataframe() ` class method +allows you to have more control over how your synthetic dataset is generated with additional (optional) +parameters: -Besides the required `df` parameter, :meth:`metasyn.MetaFrame.fit_dataframe() ` accepts three parameters: ``spec``, ``dist_providers`` and ``privacy``. +Besides the required `df` parameter, :meth:`metasyn.MetaFrame.fit_dataframe() ` +accepts four parameters: ``meta_config``, ``var_specs``, ``dist_providers`` and ``privacy``. Let's take a look at each optional parameter individually: -spec -^^^^ -**spec** is an optional dictionary that outlines specific directives for each DataFrame column (variable). The potential directives include: - +meta_config +^^^^^^^^^^^ +**meta_config** is an optional parameter that encompasses all the other parameters; it contains information on the +``var_specs``, ``dist_providers`` and ``privacy``. This parameter is generally used when the configuration is loaded +from a .toml file. Otherwise it is recommended to leave ``meta_config`` at its default value (None) and specify +the other optional parameters. + +var_specs +^^^^^^^^^ +**var_specs** is an optional list that outlines specific directives for columns (variables) in the DataFrame. +The potential directives include: + + - ``name``: This specifies the column name and is mandatory. + - ``distribution``: Allows you to specify the statistical distribution of each column. To see what distributions are available refer to the :doc:`distribution package API reference`. - ``unique``: Declare whether the column in the synthetic dataset should contain unique values. By default no column is set to unique. @@ -64,12 +79,12 @@ spec .. admonition:: Detection of unique variables When generating a MetaFrame, ``metasyn`` will automatically analyze the columns of the input DataFrame to detect ones that contain only unique values. - If such a column is found, and it has not manually been set to unique in the ``spec`` dictionary, the user will be notified with the following warning: + If such a column is found, and it has not manually been set to unique in the ``var_specs`` dictionary, the user will be notified with the following warning: ``Warning: Variable [column_name] seems unique, but not set to be unique. Set the variable to be either unique or not unique to remove this warning`` It is safe to ignore this warning - however, be aware that without setting the column as unique, ``metasyn`` may generate duplicate values for that column when synthesizing data. - To remove the warning and ensure the values in the synthesized column are unique, set the column to be unique (``"column" = {"unique": True}``) in the ``spec`` dictionary. + To remove the warning and ensure the values in the synthesized column are unique, set the column to be unique (``"column" = {"unique": True}``) in the ``var_specs`` list. - ``description``: Includes a description for each column in the DataFrame. @@ -80,7 +95,7 @@ spec - ``prop_missing``: Set the intended proportion of missing values in the synthetic data for each column. -.. admonition:: Example use of the ``spec`` parameter +.. admonition:: Example use of the ``var_specs`` parameter - For the column ``PassengerId``, we want unique values. - The ``Name`` column should be populated with realistic fake names using the `Faker `_ library. @@ -91,30 +106,29 @@ spec The following code to achieve this would look like: .. code-block:: python - + from metasyn.distribution import FakerDistribution, DiscreteUniformDistribution, RegexDistribution + from metasyn.config import VarConfig, DistributionSpec # Create a specification dictionary for generating synthetic data - var_spec = { - + var_specs = [ # Ensure unique values for the `PassengerId` column - "PassengerId": {"unique": True}, + VarConfig(name="PassengerId", dist_spec=DistributionSpec(unique=True)), # Utilize the Faker library to synthesize realistic names for the `Name` column - "Name": {"distribution": FakerDistribution("name")}, + VarConfig(name="Name", dist_spec=FakerDistribution("name")), - # Fit `Fare` to an exponential distribution based on the data - "Fare": {"distribution": "ExponentialDistribution"}, + # Fit `Fare` to an log-normal distribution, but base the parameters on the data + VarConfig(name="Name", dist_spec="LogNormalDistribution"), - # Fit `Age` to a discrete uniform distribution ranging from 20 to 40 - "Age": {"distribution": DiscreteUniformDistribution(20, 40)}, + # Set the `Age` column to a discrete uniform distribution ranging from 20 to 40 + VarConfig(name="Age", dist_spec=DiscreteUniformDistribution(20, 40)), # Use a regex-based distribution to generate `Cabin` values following [A-F][0-9]{2,3} - "Cabin": {"distribution": RegexDistribution(r"[A-F][0-9]{2,3}")} - - } + VarConfig(name="Cabin", dist_spec=cabin_distribution, description="The cabin number of the passenger."), + ] - mf = MetaFrame.fit_dataframe(df, spec=var_spec) + mf = MetaFrame.fit_dataframe(df, var_specs=var_specs) dist_providers diff --git a/examples/basic_example.py b/examples/basic_example.py index 1d45a847..180165b7 100644 --- a/examples/basic_example.py +++ b/examples/basic_example.py @@ -1,4 +1,5 @@ import polars as pl + from metasyn import MetaFrame # example dataframe from polars website diff --git a/examples/example_config.toml b/examples/example_config.toml new file mode 100644 index 00000000..be9c4c7d --- /dev/null +++ b/examples/example_config.toml @@ -0,0 +1,32 @@ +# Example toml file as input for metasyn + +[general] +dist_providers = ["builtin", "metasyn-disclosure"] + +[general.privacy] +name = "disclosure" +parameters = {n_avg = 11} + + +[[var]] +name = "PassengerId" +distribution = {unique = true} # Notice booleans are lower case in .toml files. + +[[var]] +name = "Name" +prop_missing = 0.1 +description = "Name of the unfortunate passenger of the titanic." +distribution = {implements = "core.faker", parameters = {faker_type = "name", locale = "en_US"}} + +[[var]] +name = "Fare" +distribution = {implements = "core.exponential"} + +[[var]] +name = "Age" +distribution = {implements = "core.uniform", parameters = {low = 20, high = 40}} + +[[var]] +name = "Cabin" +distribution = {implements = "core.regex", parameters = {regex_data = "[A-F][0-9]{2,3}"}} +privacy = {name = "disclosure", parameters = {n_avg = 21}} diff --git a/examples/example_gmf_titanic.json b/examples/example_gmf_titanic.json index 4abeec39..83e9e8b0 100644 --- a/examples/example_gmf_titanic.json +++ b/examples/example_gmf_titanic.json @@ -4,9 +4,9 @@ "provenance": { "created by": { "name": "metasyn", - "version": "0.6.1.dev45+gd3708ea.d20231212" + "version": "0.6.1.dev44+g2ce6998.d20240115" }, - "creation time": "2023-12-12T15:30:02.834410" + "creation time": "2024-01-17T12:11:01.120007" }, "vars": [ { diff --git a/examples/getting_started.ipynb b/examples/getting_started.ipynb index 60363c09..49ebf12a 100644 --- a/examples/getting_started.ipynb +++ b/examples/getting_started.ipynb @@ -42,7 +42,7 @@ "outputs": [], "source": [ "# Run the following line to install metasyn\n", - "%pip install metasyn" + "# %pip install metasyn" ] }, { @@ -63,9 +63,11 @@ "outputs": [], "source": [ "# import required packages\n", - "import datetime as dt\n", "import polars as pl\n", - "from metasyn import MetaFrame, demo_file" + "\n", + "from metasyn import MetaFrame, demo_file\n", + "from metasyn.config import VarConfig\n", + "from metasyn.util import DistributionSpec" ] }, { @@ -388,12 +390,10 @@ "outputs": [], "source": [ "# First, we create a specification dictionary for the variables\n", - "var_spec = {\n", - " \"PassengerId\": {\"unique\": True}\n", - "}\n", + "var_spec = [VarConfig(name=\"PassengerId\", dist_spec=DistributionSpec(unique=True))]\n", "\n", "# then, we add that dictionary as the `spec` argument\n", - "mf = MetaFrame.fit_dataframe(df, spec=var_spec)\n", + "mf = MetaFrame.fit_dataframe(df, var_specs=var_spec)\n", "\n", "# then, let's check what the metadata about PassengerId contains!\n", "mf[\"PassengerId\"].to_dict()" @@ -447,12 +447,12 @@ "# First, we create a specification dictionary for the variables\n", "from metasyn.distribution import FakerDistribution\n", "\n", - "var_spec = {\n", - " \"PassengerId\": {\"unique\": True}, \n", - " \"Name\": {\"distribution\": FakerDistribution(\"name\")}\n", - "}\n", + "var_specs = [\n", + " VarConfig(name=\"PassengerId\", dist_spec=DistributionSpec(unique=True)),\n", + " VarConfig(name=\"Name\", dist_spec=FakerDistribution(\"name\")),\n", + "]\n", "\n", - "mf = MetaFrame.fit_dataframe(df, spec=var_spec)\n", + "mf = MetaFrame.fit_dataframe(df, var_specs=var_specs)\n", "mf.synthesize(5)" ] }, @@ -485,14 +485,14 @@ "source": [ "from metasyn.distribution import DiscreteUniformDistribution\n", "\n", - "var_spec = {\n", - " \"PassengerId\": {\"unique\": True}, \n", - " \"Name\": {\"distribution\": FakerDistribution(\"name\")},\n", - " \"Fare\": {\"distribution\": \"LogNormalDistribution\"}, # estimate / fit an exponential distribution based on the data\n", - " \"Age\": {\"distribution\": DiscreteUniformDistribution(20, 40)} # fully specify a distribution for age (uniform between 20 and 40)\n", - "}\n", + "var_specs = [\n", + " VarConfig(name=\"PassengerId\", dist_spec=DistributionSpec(unique=True)),\n", + " VarConfig(name=\"Name\", dist_spec=FakerDistribution(\"name\")),\n", + " VarConfig(name=\"Name\", dist_spec=\"LogNormalDistribution\"), # estimate / fit an exponential distribution based on the data\n", + " VarConfig(name=\"Age\", dist_spec=DiscreteUniformDistribution(20, 40)) # fully specify a distribution for age (uniform between 20 and 40)\n", + "]\n", "\n", - "mf = MetaFrame.fit_dataframe(df, spec=var_spec)\n", + "mf = MetaFrame.fit_dataframe(df, var_specs=var_specs)\n", "mf.synthesize(5)" ] }, @@ -521,15 +521,15 @@ "cabin_distribution = RegexDistribution(r\"[A-F][0-9]{2,3}\") # Add the r so that it becomes a literal string.\n", "# just for completeness: data generated from this distribution will always match the regex [A-F]?(\\d{2,3})?\n", "\n", - "var_spec = {\n", - " \"PassengerId\": {\"unique\": True}, \n", - " \"Name\": {\"distribution\": FakerDistribution(\"name\")},\n", - " \"Fare\": {\"distribution\": \"ExponentialDistribution\"}, # estimate / fit an exponential distribution based on the data\n", - " \"Age\": {\"distribution\": DiscreteUniformDistribution(20, 40)}, # fully specify a distribution for age (uniform between 20 and 40)\n", - " \"Cabin\": {\"distribution\": cabin_distribution}\n", - "}\n", + "var_specs = [\n", + " VarConfig(name=\"PassengerId\", dist_spec=DistributionSpec(unique=True)),\n", + " VarConfig(name=\"Name\", dist_spec=FakerDistribution(\"name\")),\n", + " VarConfig(name=\"Name\", dist_spec=\"LogNormalDistribution\"), # estimate / fit an exponential distribution based on the data\n", + " VarConfig(name=\"Age\", dist_spec=DiscreteUniformDistribution(20, 40)), # fully specify a distribution for age (uniform between 20 and 40)\n", + " VarConfig(name=\"Cabin\", dist_spec=cabin_distribution), # Use the regex distribution for the cabin\n", + "]\n", "\n", - "mf = MetaFrame.fit_dataframe(df, spec=var_spec)\n", + "mf = MetaFrame.fit_dataframe(df, var_specs=var_specs)\n", "mf.synthesize(10)" ] }, @@ -628,15 +628,24 @@ }, "outputs": [], "source": [ - "var_spec = {\n", - " \"PassengerId\": {\"unique\": True}, \n", - " \"Name\": {\"distribution\": FakerDistribution(\"name\")},\n", - " \"Fare\": {\"distribution\": \"ExponentialDistribution\"}, # estimate / fit an exponential distribution based on the data\n", - " \"Age\": {\"distribution\": DiscreteUniformDistribution(20, 40)}, # fully specify a distribution for age (uniform between 20 and 40)\n", - " \"Cabin\": {\"distribution\": cabin_distribution, \"description\": \"The cabin number of the passenger.\"},\n", - "}\n", + "var_specs = [\n", + " # Ensure unique values for the `PassengerId` column\n", + " VarConfig(name=\"PassengerId\", dist_spec=DistributionSpec(unique=True)),\n", + "\n", + " # Utilize the Faker library to synthesize realistic names for the `Name` column\n", + " VarConfig(name=\"Name\", dist_spec=FakerDistribution(\"name\")),\n", + "\n", + " # Fit `Fare` to an log-normal distribution, but base the parameters on the data\n", + " VarConfig(name=\"Name\", dist_spec=\"LogNormalDistribution\"),\n", + "\n", + " # Set the `Age` column to a discrete uniform distribution ranging from 20 to 40\n", + " VarConfig(name=\"Age\", dist_spec=DiscreteUniformDistribution(20, 40)),\n", + "\n", + " # Use a regex-based distribution to generate `Cabin` values following [A-F][0-9]{2,3}\n", + " VarConfig(name=\"Cabin\", dist_spec=cabin_distribution, description=\"The cabin number of the passenger.\"),\n", + "]\n", "\n", - "mf = MetaFrame.fit_dataframe(df, spec=var_spec) " + "mf = MetaFrame.fit_dataframe(df, var_specs=var_specs) " ] }, { @@ -769,7 +778,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.5" + "version": "3.11.6" }, "vscode": { "interpreter": { diff --git a/metasyn/__main__.py b/metasyn/__main__.py index fa3d136d..18f4f9eb 100644 --- a/metasyn/__main__.py +++ b/metasyn/__main__.py @@ -8,7 +8,6 @@ import pathlib import pickle import sys -from configparser import ConfigParser try: # Python < 3.10 (backport) from importlib_metadata import entry_points, version @@ -18,6 +17,7 @@ import polars as pl from metasyn import MetaFrame +from metasyn.config import MetaConfig from metasyn.validation import create_schema MAIN_HELP_MESSAGE = f""" @@ -56,38 +56,11 @@ def main() -> None: elif subcommand == "create-meta": create_metadata() - else: print(f"Invalid subcommand ({subcommand}). For help see metasyn --help") sys.exit(1) -def _parse_config(config_fp): - config = ConfigParser() - config.read(config_fp) - spec = {} - for section in config.sections(): - if section.startswith("var."): - new_dict = {} - for key, val in dict(config[section]).items(): - try: - new_dict[key] = config.getboolean(section, key) - except ValueError: - pass - try: - new_dict[key] = config.getfloat(section, key) - except ValueError: - pass - try: - new_dict[key] = config.getint(section, key) - except ValueError: - pass - if key not in new_dict: - new_dict[key] = val - spec[section[4:]] = new_dict - return spec - - def create_metadata(): """Program to create and export metadata from a DataFrame to a GMF file (.json).""" parser = argparse.ArgumentParser( @@ -95,28 +68,35 @@ def create_metadata(): description="Create a Generative Metadata Format file from a CSV file.", ) parser.add_argument( - "input", + "--input", help="input file; a CSV file that you want to synthesize later.", type=pathlib.Path, + default=None, ) parser.add_argument( - "output", + "--output", help="output file: .json", type=pathlib.Path, + default=None, ) parser.add_argument( "--config", - help="Configuration file to specify distribution behavior.", + help="Configuration file (*.toml) to specify distribution behavior.", type=pathlib.Path, default=None, ) + args, _ = parser.parse_known_args() if args.config is not None: - spec = _parse_config(args.config) + meta_config = MetaConfig.from_toml(args.config) + else: + meta_config = None + + if args.input is None: + meta_frame = MetaFrame.from_config(meta_config) else: - spec = {} - data_frame = pl.read_csv(args.input, try_parse_dates=True) - meta_frame = MetaFrame.fit_dataframe(data_frame, spec=spec) + data_frame = pl.read_csv(args.input, try_parse_dates=True) + meta_frame = MetaFrame.fit_dataframe(data_frame, meta_config) meta_frame.export(args.output) diff --git a/metasyn/config.py b/metasyn/config.py new file mode 100644 index 00000000..aff515e4 --- /dev/null +++ b/metasyn/config.py @@ -0,0 +1,175 @@ +"""Module defining configuration classes for creating MetaFrames.""" +from __future__ import annotations + +from pathlib import Path +from typing import Iterable, Optional, Union + +try: + import tomllib +except ImportError: + import tomli as tomllib # type: ignore # noqa + +from metasyn.privacy import BasePrivacy, get_privacy +from metasyn.provider import DistributionProviderList +from metasyn.util import VarConfig + + +class MetaConfig(): + """Configuration class for creating MetaFrames. + + This class is used to create, manipulate, and retrieve configurations for + individual variables in a MetaFrame. It also provides methods for loading + configurations from .toml files and converting them to dictionaries. + + Parameters + ---------- + var_configs: + List of configurations for individual variables. The order does not + matter for variables that are found in the DataFrame, but in the case + of variables that are data-free, the order is also the order of columns + for the eventual synthesized dataframe. See the VarConfigAccess class on + how the dictionary can be constructed. + dist_providers: + Distribution providers to use when fitting distributions to variables. + Can be a string, provider, or provider type. + privacy: + Privacy method/level to use as a default setting for the privacy. Can be + overridden in the var_config for a particular column. + n_rows: + Number of rows for synthesization at a later stage. Can be unspecified by + leaving the value at None. + """ + + def __init__( + self, + var_configs: Union[list[dict], list[VarConfig]], + dist_providers: Union[DistributionProviderList, list[str], str], + privacy: Union[BasePrivacy, dict], + n_rows: Optional[int] = None): + self.var_configs = [self._parse_var_config(v) for v in var_configs] + + if not isinstance(dist_providers, DistributionProviderList): + dist_providers = DistributionProviderList(dist_providers) + self.dist_providers = dist_providers + + if not isinstance(privacy, BasePrivacy): + privacy = get_privacy(**privacy) + + self.privacy = privacy + self.n_rows = n_rows + + @staticmethod + def _parse_var_config(var_cfg): + if isinstance(var_cfg, VarConfig): + return var_cfg + return VarConfig.from_dict(var_cfg) + + @classmethod + def from_toml(cls, config_fp: Union[str, Path]) -> MetaConfig: + """Create a MetaConfig class from a .toml file. + + Parameters + ---------- + config_fp: + Path to the file containing the configuration. + + Returns + ------- + meta_config: + A fully initialized MetaConfig instance. + """ + with open(config_fp, "rb") as handle: + config_dict = tomllib.load(handle) + general = config_dict.get("general", {}) + var_list = config_dict.pop("var", []) + n_rows = general.pop("n_rows", None) + dist_providers = general.pop("dist_providers", ["builtin"]) + privacy = general.pop("privacy", {"name": "none", "parameters": {}}) + if len(general) > 0: + raise ValueError(f"Error parsing configuration file '{config_fp}'." + f" Unknown keys detected: '{list(general)}'") + return cls(var_list, dist_providers, privacy, n_rows=n_rows) + + def to_dict(self) -> dict: + """Convert the configuration to a dictionary. + + Returns + ------- + config_dict: + Configuration in dictionary form. + """ + return { + "general": { + "privacy": self.privacy, + "dist_providers": self.dist_providers, + }, + "var": self.var_configs + } + + def get(self, name: str) -> VarConfigAccess: + """Create a VarConfigAccess object pointing to a var with that name. + + If the variable does not exist, then a new variable config is created that + has the default values. + + Parameters + ---------- + name: + Name of the variable configuration to retrieve. + + Returns + ------- + var_cfg: + A variable config access object. + """ + for var_cfg in self.var_configs: + if var_cfg.name == name: + return VarConfigAccess(var_cfg, self) + return VarConfigAccess(VarConfig(name=name), self) + + def iter_var(self, exclude: Optional[list[str]] = None) -> Iterable[VarConfigAccess]: + """Iterate over all variables in the configuration. + + Parameters + ---------- + exclude: + Exclude variables with names in that list. + + Returns + ------- + var_cfg: + VarConfigAccess class for each of the available variable configurations. + """ + exclude = exclude if exclude is not None else [] + for var_spec in self.var_configs: + if var_spec.name not in exclude: + yield VarConfigAccess(var_spec, self) + + +class VarConfigAccess(): # pylint: disable=too-few-public-methods + """Access for variable configuration object. + + They take into account what the defaults are from the MetaConfig object. + Otherwise they pass through all the attributes as normal and thus behave + exactly as a variable config object themselves. + + Parameters + ---------- + var_config + The variable configuration to access. + meta_config + The meta configuration instance to get default values from. + """ + + def __init__(self, var_config: VarConfig, meta_config: MetaConfig): + self.var_config = var_config + self.meta_config = meta_config + + def __getattribute__(self, attr): + if attr == "privacy": + if self.var_config.privacy is None: + return self.meta_config.privacy + return self.var_config.privacy + if attr not in ("var_config", "meta_config") and hasattr(self.var_config, attr): + return getattr(self.var_config, attr) + return super().__getattribute__(attr) diff --git a/metasyn/distribution/base.py b/metasyn/distribution/base.py index 88956b6f..138deca1 100644 --- a/metasyn/distribution/base.py +++ b/metasyn/distribution/base.py @@ -18,11 +18,12 @@ class attributes of a distribution. from abc import ABC, abstractmethod from copy import deepcopy -from typing import Iterable, Optional, Sequence, Union +from typing import Optional, Union import numpy as np import pandas as pd import polars as pl +from numpy import typing as npt class BaseDistribution(ABC): @@ -40,7 +41,7 @@ class BaseDistribution(ABC): version: str = "1.0" @classmethod - def fit(cls, series: Union[Sequence, pl.Series], + def fit(cls, series: Union[pd.Series, pl.Series, npt.NDArray], *args, **kwargs) -> BaseDistribution: """Fit the distribution to the series. @@ -54,13 +55,13 @@ def fit(cls, series: Union[Sequence, pl.Series], BaseDistribution: Fitted distribution. """ - pd_series = cls._to_series(series) - if len(pd_series) == 0: + pl_series = cls._to_series(series) + if len(pl_series) == 0: return cls.default_distribution() - return cls._fit(pd_series, *args, **kwargs) + return cls._fit(pl_series, *args, **kwargs) @staticmethod - def _to_series(values: Union[Sequence, pl.Series, pd.Series]): + def _to_series(values: Union[npt.NDArray, pl.Series, pd.Series]) -> pl.Series: if isinstance(values, pl.Series): series = values.drop_nulls() elif isinstance(values, pd.Series): @@ -144,7 +145,7 @@ def from_dict(cls, dist_dict: dict) -> BaseDistribution: """Create a distribution from a dictionary.""" return cls(**dist_dict["parameters"]) - def information_criterion(self, values: Iterable) -> float: # pylint: disable=unused-argument + def information_criterion(self, values: Union[pd.Series, pl.Series, npt.NDArray]) -> float: # pylint: disable=unused-argument """Get the BIC value for a particular set of values. Parameters diff --git a/metasyn/distribution/categorical.py b/metasyn/distribution/categorical.py index ffb169a5..8776936a 100644 --- a/metasyn/distribution/categorical.py +++ b/metasyn/distribution/categorical.py @@ -59,7 +59,7 @@ def draw(self): return np.random.choice(self.labels, p=self.probs) def information_criterion(self, - values: Union[pd.Series, pl.Series, npt.NDArray[np.str_]] + values: Union[pd.Series, pl.Series, npt.NDArray] ) -> float: series = self._to_series(values) labels, counts = np.unique(series, return_counts=True) diff --git a/metasyn/distribution/continuous.py b/metasyn/distribution/continuous.py index 36796e16..ed5f4ba2 100644 --- a/metasyn/distribution/continuous.py +++ b/metasyn/distribution/continuous.py @@ -17,29 +17,29 @@ class UniformDistribution(ScipyDistribution): Parameters ---------- - min_val: float + low: float Lower bound for uniform distribution. - max_val: float + high: float Upper bound for uniform distribution. """ dist_class = uniform - def __init__(self, min_val: float, max_val: float): - self.par = {"min_val": min_val, "max_val": max_val} - self.dist = uniform(loc=self.min_val, scale=max(self.max_val-self.min_val, 1e-8)) + def __init__(self, low: float, high: float): + self.par = {"low": low, "high": high} + self.dist = uniform(loc=self.low, scale=max(self.high-self.low, 1e-8)) @classmethod def _fit(cls, values): return cls(values.min(), values.max()) def _information_criterion(self, values): - if np.any(np.array(values) < self.min_val) or np.any(np.array(values) > self.max_val): + if np.any(np.array(values) < self.low) or np.any(np.array(values) > self.high): return np.log(len(values))*self.n_par + 100*len(values) - if np.fabs(self.max_val-self.min_val) < 1e-8: + if np.fabs(self.high-self.low) < 1e-8: return np.log(len(values))*self.n_par - 100*len(values) return (np.log(len(values))*self.n_par - - 2*len(values)*np.log((self.max_val-self.min_val)**-1)) + - 2*len(values)*np.log((self.high-self.low)**-1)) @classmethod def default_distribution(cls): @@ -48,8 +48,8 @@ def default_distribution(cls): @classmethod def _param_schema(cls): return { - "min_val": {"type": "number"}, - "max_val": {"type": "number"}, + "low": {"type": "number"}, + "high": {"type": "number"}, } diff --git a/metasyn/distribution/faker.py b/metasyn/distribution/faker.py index 75ba0665..7d4b870b 100644 --- a/metasyn/distribution/faker.py +++ b/metasyn/distribution/faker.py @@ -159,7 +159,8 @@ def draw(self): def information_criterion(self, values) -> float: series = self._to_series(values) # Check the average number of characters - if series.str.len_chars().mean() >= 25: + avg_chars = series.str.len_chars().mean() + if avg_chars is not None and avg_chars >= 25: lang = self.detect_language(series) if lang is not None: return -1.0 diff --git a/metasyn/metaframe.py b/metasyn/metaframe.py index 186c2615..59d95c1a 100644 --- a/metasyn/metaframe.py +++ b/metasyn/metaframe.py @@ -4,17 +4,17 @@ import json import pathlib -from copy import deepcopy from datetime import datetime from importlib.metadata import version from typing import Any, Dict, List, Optional, Sequence, Union import numpy as np +import pandas as pd import polars as pl from tqdm import tqdm -from metasyn.privacy import BasePrivacy, BasicPrivacy -from metasyn.provider import BaseDistributionProvider +from metasyn.config import MetaConfig +from metasyn.privacy import BasePrivacy from metasyn.validation import validate_gmf_dict from metasyn.var import MetaVar @@ -53,11 +53,11 @@ def n_columns(self) -> int: @classmethod def fit_dataframe( cls, - df: pl.DataFrame, - spec: Optional[dict[str, dict]] = None, - dist_providers: Union[str, list[str], BaseDistributionProvider, - list[BaseDistributionProvider]] = "builtin", - privacy: Optional[BasePrivacy] = None, + df: Optional[Union[pl.DataFrame, pd.DataFrame]], + meta_config: Optional[MetaConfig] = None, + var_specs: Optional[list[dict]] = None, + dist_providers: Optional[list[str]] = None, + privacy: Optional[Union[BasePrivacy, dict]] = None, progress_bar: bool = True): """Create a metasyn object from a polars (or pandas) dataframe. @@ -68,7 +68,9 @@ def fit_dataframe( ---------- df: Polars dataframe with the correct column dtypes. - spec: + meta_config: + Column specification in MetaConfig format. + var_specs: Column specifications to modify the defaults. For each of the columns additional directives can be supplied here. There are 3 different directives currently supported: @@ -120,46 +122,70 @@ def fit_dataframe( MetaFrame: Initialized metasyn metaframe. """ - if privacy is None: - privacy = BasicPrivacy() - if spec is None: - spec = {} + if meta_config is None: + if privacy is None: + privacy = {"name": "none"} + elif isinstance(privacy, BasePrivacy): + privacy = privacy.to_dict() + var_specs = [] if var_specs is None else var_specs + dist_providers = dist_providers if dist_providers is not None else ["builtin"] + meta_config = MetaConfig(var_specs, dist_providers, privacy) else: - spec = deepcopy(spec) + assert privacy is None - if set(list(spec)) - set(df.columns): - raise ValueError( - "Argument 'spec' includes the specifications for column names that do " - "not exist in the supplied dataframe:" - f" '{set(list(spec)) - set(df.columns)}'") + if isinstance(df, pd.DataFrame): + df = pl.DataFrame(df) all_vars = [] - for col_name in tqdm(df.columns, disable=not progress_bar): - series = df[col_name] - col_spec = spec.get(col_name, {}) - dist = col_spec.pop("distribution", None) - unq = col_spec.pop("unique", None) - description = col_spec.pop("description", None) - prop_missing = col_spec.pop("prop_missing", None) - cur_privacy = col_spec.pop("privacy", privacy) - fit_kwargs = col_spec.pop("fit_kwargs", {}) - if len(col_spec) != 0: + columns = df.columns if df is not None else [] + if df is not None: + for col_name in tqdm(columns, disable=not progress_bar): + var_spec = meta_config.get(col_name) + var = MetaVar.fit( + df[col_name], + var_spec.dist_spec, + meta_config.dist_providers, + var_spec.privacy, + var_spec.prop_missing, + var_spec.description) + all_vars.append(var) + + # Data free columns to be appended + for var_spec in meta_config.iter_var(exclude=columns): + if not var_spec.data_free: raise ValueError( - f"Unknown spec items '{col_spec}' for variable '{col_name}'.") - var = MetaVar.detect( - series, - description=description, - prop_missing=prop_missing) - var.fit( - dist=dist, - dist_providers=dist_providers, - unique=unq, - privacy=cur_privacy, - fit_kwargs=fit_kwargs) - + f"Column with name '{var_spec.name}' not found and not declared as " + "data_free.") + distribution = meta_config.dist_providers.create(var_spec) + var = MetaVar( + var_spec.name, + var_spec.var_type, + distribution, + description=var_spec.description, + prop_missing=var_spec.prop_missing, + ) all_vars.append(var) - + if df is None: + if meta_config.n_rows is None: + raise ValueError("Please provide the number of rows in the configuration, " + "or supply a DataFrame.") + return cls(all_vars, meta_config.n_rows) return cls(all_vars, len(df)) + @classmethod + def from_config(cls, meta_config: MetaConfig) -> MetaFrame: + """Create a MetaFrame using a configuration, but without a DataFrame. + + Parameters + ---------- + meta_config + Configuration to be used for creating the new MetaFrame. + + Returns + ------- + A created MetaFrame. + """ + return cls.fit_dataframe(None, meta_config) + def to_dict(self) -> Dict[str, Any]: """Create dictionary with the properties for recreation.""" return { @@ -219,7 +245,7 @@ def descriptions( for i_desc, new_desc in enumerate(new_descriptions): self[i_desc].description = new_desc - def export(self, fp: Union[pathlib.Path, str], + def export(self, fp: Optional[Union[pathlib.Path, str]], validate: bool = True) -> None: """Serialize and export the MetaFrame to a JSON file, following the GMF format. @@ -236,8 +262,11 @@ def export(self, fp: Union[pathlib.Path, str], self_dict = _jsonify(self.to_dict()) if validate: validate_gmf_dict(self_dict) - with open(fp, "w", encoding="utf-8") as f: - json.dump(self_dict, f, indent=4) + if fp is None: + print(json.dumps(self_dict, indent=4)) + else: + with open(fp, "w", encoding="utf-8") as f: + json.dump(self_dict, f, indent=4) def to_json(self, fp: Union[pathlib.Path, str], validate: bool = True) -> None: diff --git a/metasyn/privacy.py b/metasyn/privacy.py index d9b9a481..ba355cbc 100644 --- a/metasyn/privacy.py +++ b/metasyn/privacy.py @@ -1,7 +1,12 @@ """Module with privacy classes to be used for creating GMF files.""" from abc import ABC, abstractmethod -from typing import Type, Union +from typing import Optional, Type, Union + +try: + from importlib_metadata import entry_points +except ImportError: + from importlib.metadata import entry_points # type: ignore from metasyn.distribution.base import BaseDistribution @@ -52,3 +57,33 @@ class BasicPrivacy(BasePrivacy): def to_dict(self) -> dict: return BasePrivacy.to_dict(self) + + +def get_privacy(name: str, parameters: Optional[dict] = None) -> BasePrivacy: + """Create a new privacy object using a name and parameters. + + Parameters + ---------- + name + Name of the privacy type, use "none" for no specific type of privacy. + parameters, optional + The parameters for the privacy type. This could be the epsilon for differential + privacy or n_avg for disclosure control, by default None. + + Returns + ------- + A new instantiated object for privacy. + + Raises + ------ + KeyError + If the name of the privacy type cannot be found. + """ + parameters = parameters if parameters is not None else {} + for entry in entry_points(group="metasyn.privacy"): + if name == entry.name: + return entry.load()(**parameters) + privacy_names = [entry.name for entry in entry_points(group="metasyn.privacy")] + raise KeyError(f"Unknown privacy type with name '{name}'. " + "Ensure that you have installed the privacy package." + f"Available privacy names: {privacy_names}.") diff --git a/metasyn/provider.py b/metasyn/provider.py index 4dbe3885..d1070959 100644 --- a/metasyn/provider.py +++ b/metasyn/provider.py @@ -6,10 +6,9 @@ from __future__ import annotations -import inspect import warnings from abc import ABC -from typing import Any, List, Optional, Type, Union +from typing import TYPE_CHECKING, Any, List, Optional, Type, Union try: from importlib_metadata import EntryPoint, entry_points @@ -57,7 +56,10 @@ from metasyn.distribution.na import NADistribution from metasyn.distribution.regex import RegexDistribution, UniqueRegexDistribution from metasyn.privacy import BasePrivacy, BasicPrivacy +from metasyn.util import DistributionSpec +if TYPE_CHECKING: + from metasyn.config import VarConfig, VarConfigAccess class BaseDistributionProvider(ABC): """Class that encapsulates a set of distributions. @@ -167,6 +169,7 @@ class DistributionProviderList(): def __init__( self, dist_providers: Union[ + list[str], None, str, type[BaseDistributionProvider], BaseDistributionProvider, list[Union[str, type[BaseDistributionProvider], BaseDistributionProvider]]]): if dist_providers is None: @@ -188,10 +191,8 @@ def __init__( def fit(self, series: pl.Series, var_type: str, - dist: Optional[Union[str, BaseDistribution, type]] = None, - privacy: BasePrivacy = BasicPrivacy(), - unique: Optional[bool] = None, - fit_kwargs: Optional[dict] = None): + dist_spec: DistributionSpec, + privacy: BasePrivacy = BasicPrivacy()): """Fit a distribution to a column/series. Parameters @@ -200,29 +201,39 @@ def fit(self, series: pl.Series, The data to fit the distributions to. var_type: The variable type of the data. - dist: + dist_spec: Distribution to fit. If not supplied or None, the information criterion will be used to determine which distribution is the most suitable. For most variable types, the information criterion is based on the BIC (Bayesian Information Criterion). privacy: Level of privacy that will be used in the fit. - unique: - Whether the distribution should be unique or not. - fit_kwargs: - Extra options for distributions during the fitting stage. """ - if fit_kwargs is None: - fit_kwargs = {} - if dist is not None: - unique = unique if unique else False - return self._fit_distribution(series, dist, var_type, privacy, - unique=unique, **fit_kwargs) - if len(fit_kwargs) > 0: - raise ValueError(f"Got fit arguments for variable '{series.name}', but no " - "distribution. Set the distribution manually to fix.") + if dist_spec.implements is not None: + return self._fit_distribution(series, dist_spec, var_type, privacy) + unique = dist_spec.unique if dist_spec.unique is True else False return self._find_best_fit(series, var_type, unique, privacy) + def create(self, var_cfg: Union[VarConfig, VarConfigAccess]) -> BaseDistribution: + """Create a distribution without any data. + + Parameters + ---------- + var_cfg + A variable configuration that provides all the qinformation to create the distribution. + + Returns + ------- + A distribution according to the variable specifications. + """ + dist_spec = var_cfg.dist_spec + unique = dist_spec.unique if dist_spec.unique else False + assert dist_spec.implements is not None and var_cfg.var_type is not None + dist_class = self.find_distribution( + dist_spec.implements, var_cfg.var_type, + privacy=BasicPrivacy(), unique=unique) + return dist_class(**dist_spec.parameters) + def _find_best_fit(self, series: pl.Series, var_type: str, unique: Optional[bool], privacy: BasePrivacy) -> BaseDistribution: @@ -355,49 +366,41 @@ def find_distribution(self, # pylint: disable=too-many-branches return all_dist[i_max] def _fit_distribution(self, series: pl.Series, - dist: Union[str, Type[BaseDistribution], BaseDistribution], + dist_spec: DistributionSpec, var_type: str, - privacy: BasePrivacy, - unique: bool = False, - **fit_kwargs) -> BaseDistribution: + privacy: BasePrivacy) -> BaseDistribution: """Fit a specific distribution to a series. In contrast the fit method, this needs a supplied distribution(type). Parameters ---------- - dist: + series: + Series to fit the distribution to. + dist_spec: Distribution to fit (if it is not already fitted). var_type: Type of variable to fit the distribution for. - series: - Series to fit the distribution to. privacy: Privacy level to fit the distribution with. - unique: - Whether the distribution to be fit is unique. - fit_kwargs: - Extra keyword arguments to modify the way the distribution is fit. Returns ------- BaseDistribution: Fitted distribution. """ - dist_instance = None - if isinstance(dist, BaseDistribution): - return dist - - if isinstance(dist, str): - dist_class = self.find_distribution(dist, var_type, privacy=privacy, unique=unique) - elif inspect.isclass(dist) and issubclass(dist, BaseDistribution): - dist_class = dist - else: - raise TypeError( - f"Distribution {dist} with type {type(dist)} is not a BaseDistribution") + unique = dist_spec.unique + unique = unique if unique else False + assert dist_spec.implements is not None + dist_class = self.find_distribution(dist_spec.implements, var_type, privacy=privacy, + unique=unique) + if dist_spec.parameters is not None: + return dist_class(**dist_spec.parameters) + if issubclass(dist_class, NADistribution): dist_instance = dist_class.default_distribution() else: + fit_kwargs = dist_spec.fit_kwargs dist_instance = dist_class.fit(series, **privacy.fit_kwargs, **fit_kwargs) return dist_instance diff --git a/metasyn/testutils.py b/metasyn/testutils.py index 31db7836..43601bbf 100644 --- a/metasyn/testutils.py +++ b/metasyn/testutils.py @@ -30,7 +30,6 @@ def check_distribution_provider(provider_name: str): Name of the provider to be tested. """ provider = get_distribution_provider(provider_name) - print(type(provider)) assert isinstance(provider, BaseDistributionProvider) assert len(provider.distributions) > 0 assert all(issubclass(dist, BaseDistribution) for dist in provider.distributions) diff --git a/metasyn/util.py b/metasyn/util.py new file mode 100644 index 00000000..27b5553b --- /dev/null +++ b/metasyn/util.py @@ -0,0 +1,141 @@ +"""Utility module for metasyn. + +This module provides utility classes that are used across metasyn, +including classes for specifying distributions and storing variable +configurations. +""" +from __future__ import annotations + +from dataclasses import dataclass, field +from typing import Optional, Union + +from metasyn.distribution.base import BaseDistribution +from metasyn.privacy import BasePrivacy, get_privacy + + +@dataclass +class DistributionSpec(): + """Specification that determines which distribution is selected. + + It has the following attributes: + - implements: Which distribution is chosen. + - unique: Whether the distribution should be unique. + - parameters: The parameters of the distribution as defined by implements. + - fit_kwargs: Fitting keyword arguments to be used while fitting the distribution. + - version: Version of the distribution to fit. + """ + + implements: Optional[str] = None + unique: Optional[bool] = None + parameters: Optional[dict] = None + fit_kwargs: dict = field(default_factory=dict) + version: Optional[str] = None + + def __post_init__(self): + if self.implements is None: + if self.version is not None: + raise ValueError("Cannot create DistributionSpec with attribute 'version' but " + "without attribute 'implements'.") + if self.parameters is not None: + raise ValueError("Cannot create DistributionSpec with attribute 'parameters' but " + "without attribute 'implements'.") + if len(self.fit_kwargs) > 0: + raise ValueError("Cannot create DistributionSpec with attribute 'fit_kwargs' that" + " is not empty but without attribute 'implements'.") + + + @classmethod + def parse(cls, dist_spec: Optional[Union[dict, type[BaseDistribution], BaseDistribution, + DistributionSpec, str]] + ) -> DistributionSpec: + """Create a DistributionSpec instance from a variety of inputs. + + Parameters + ---------- + dist_spec + Specification for the distribution in several types. + + Returns + ------- + A instantiated version of the dist_spec that has the DistributionSpec type. + + Raises + ------ + TypeError + If the input has the wrong type and cannot be parsed. + """ + if isinstance(dist_spec, BaseDistribution): + dist_dict = {key: value for key, value in dist_spec.to_dict().items() + if key in ["implements", "version", "is_unique", "parameters"]} + dist_dict["unique"] = dist_dict.pop("is_unique") + return cls(**dist_dict) + if isinstance(dist_spec, str): + return cls(implements=dist_spec) + if dist_spec is None: + return cls() + if isinstance(dist_spec, dict): + return cls(**dist_spec) + if isinstance(dist_spec, DistributionSpec): + return dist_spec + if issubclass(dist_spec, BaseDistribution): + return cls(implements=dist_spec.implements, unique=dist_spec.is_unique) + raise TypeError("Error parsing distribution specification of unknown type " + f"'{type(dist_spec)}' with value '{dist_spec}'") + + @property + def fully_specified(self) -> bool: + """Indicate whether the distribution is suitable for datafree creation. + + Returns + ------- + A flag that indicates whether a distribution can be generated from the values + that are specified (not None). + """ + return self.implements is not None and self.parameters is not None + +@dataclass +class VarConfig(): + """Data class for storing the configurations for variables. + + It contains the following attributes: + - name: Name of the variable/column. + - dist_spec: DistributionSpec object that determines the distribution. + - privacy: Privacy object that determines which implementation can be used. + - prop_missing: Proportion of missing values. + - description: Description of the variable. + - var_type: Type of the variable in question. + """ + + name: str + dist_spec: DistributionSpec = field(default_factory=DistributionSpec) + privacy: Optional[BasePrivacy] = None + prop_missing: Optional[float] = None + description: Optional[str] = None + data_free: bool = False + var_type: Optional[str] = None + + def __post_init__(self): + # Convert the the privacy attribute if it is a dictionary. + if isinstance(self.privacy, dict): + self.privacy = get_privacy(**self.privacy) + if self.data_free and not self.dist_spec.fully_specified: + raise ValueError("Error creating variable specification: data free variable should have" + f" 'implements' and 'parameters'. {self}") + + @classmethod + def from_dict(cls, var_dict: dict) -> VarConfig: + """Create a variable configuration from a dictionary. + + Parameters + ---------- + var_dict + Dictionary to parse the configuration from. + + Returns + ------- + A new VarConfig instance. + """ + dist_spec = var_dict.pop("distribution", None) + if dist_spec is None: + return cls(**var_dict) + return cls(**var_dict, dist_spec=DistributionSpec.parse(dist_spec)) diff --git a/metasyn/var.py b/metasyn/var.py index bdf9e353..5079cedb 100644 --- a/metasyn/var.py +++ b/metasyn/var.py @@ -11,6 +11,7 @@ from metasyn.distribution.base import BaseDistribution from metasyn.privacy import BasePrivacy, BasicPrivacy from metasyn.provider import BaseDistributionProvider, DistributionProviderList +from metasyn.util import DistributionSpec class MetaVar(): @@ -47,82 +48,23 @@ class MetaVar(): User-provided description of the variable. """ - dtype = "unknown" - def __init__(self, # pylint: disable=too-many-arguments + name: str, var_type: str, - series: Optional[Union[pl.Series, pd.Series]] = None, - name: Optional[str] = None, - distribution: Optional[BaseDistribution] = None, - prop_missing: Optional[float] = None, - dtype: Optional[str] = None, - description: Optional[str] = None): + distribution: BaseDistribution, + dtype: str = "unknown", + description: Optional[str] = None, + prop_missing: float = 0.0): + self.name = name self.var_type = var_type - self.prop_missing = prop_missing - if series is None: - self.name = name - if dtype is not None: - self.dtype = dtype - else: - series = _to_polars(series) - self.name = series.name - if prop_missing is None: - self.prop_missing = ( - len(series) - len(series.drop_nulls())) / len(series) - self.dtype = str(series.dtype) - - self.series = series self.distribution = distribution + self.dtype = dtype self.description = description - - if self.prop_missing is None: - raise ValueError(f"Error while initializing variable {self.name}." - " prop_missing is None.") + self.prop_missing = prop_missing if self.prop_missing < -1e-8 or self.prop_missing > 1+1e-8: raise ValueError(f"Cannot create variable '{self.name}' with proportion missing " "outside range [0, 1]") - @classmethod - def detect(cls, - series_or_dataframe: Union[pd.Series, - pl.Series, - pl.DataFrame], - description: Optional[str] = None, - prop_missing: Optional[float] = None): - """Detect variable class(es) of series or dataframe. - - This method does not fit any distribution, but it does infer the - correct types for the MetaVar and saves the Series for later fitting. - - Parameters - ---------- - series_or_dataframe: pd.Series or pd.Dataframe - If the variable is a pandas Series, then find the correct - variable type and create an instance of that variable. - If a Dataframe is supplied instead, a list of of variables is - returned: one for each column in the dataframe. - description: - User description of the variable. - prop_missing: - Proportion of the values missing. If None, detect it from the series. - Otherwise prop_missing should be a float between 0 and 1. - - Returns - ------- - MetaVar: - It returns a meta data variable of the correct type. - """ - if isinstance(series_or_dataframe, (pl.DataFrame, pd.DataFrame)): - if isinstance(series_or_dataframe, pd.DataFrame): - return [MetaVar.detect(series_or_dataframe[col]) - for col in series_or_dataframe] - return [MetaVar.detect(series) for series in series_or_dataframe] - - series = _to_polars(series_or_dataframe) - var_type = cls.get_var_type(series) - - return cls(var_type, series, description=description, prop_missing=prop_missing) - @staticmethod def get_var_type(series: pl.Series) -> str: """Convert polars dtype to metasyn variable type. @@ -197,13 +139,14 @@ def __str__(self) -> str: f'- Distribution:\n{distribution_formatted}\n' ) - def fit(self, # pylint: disable=too-many-arguments - dist: Optional[Union[str, BaseDistribution, type]] = None, - dist_providers: Union[str, type, - BaseDistributionProvider] = "builtin", + @classmethod + def fit(cls, # pylint: disable=too-many-arguments + series: Union[pl.Series, pd.Series], + dist_spec: Optional[Union[dict, type, BaseDistribution, DistributionSpec]] = None, + provider_list: DistributionProviderList = DistributionProviderList("builtin"), privacy: BasePrivacy = BasicPrivacy(), - unique: Optional[bool] = None, - fit_kwargs: Optional[dict] = None): + prop_missing: Optional[float] = None, + description: Optional[str] = None) -> MetaVar: """Fit distributions to the data. If multiple distributions are available for the current data type, @@ -214,35 +157,34 @@ def fit(self, # pylint: disable=too-many-arguments Parameters ---------- - dist: + series: + Data series to fit a distribution to. + dist_spec: The distribution to fit. In case of a string, search for it using the aliases of all distributions. Otherwise use the supplied distribution (class). Examples of allowed strings are: "normal", "uniform", "faker.city.nl_NL". If not supplied, fit the best available distribution for the variable type. - dist_providers: + provider_list: Distribution providers that are used for fitting. privacy: Privacy level to use for fitting the series. - unique: - Whether the variable should be unique. If not supplied, it will be - inferred from the data. - fit_kwargs: - Extra options for distributions during the fitting stage. + prop_missing: + Proportion of the values missing, default None. + description: + Description for the variable. """ - if self.series is None: - raise ValueError("Cannot fit distribution if we don't have the" - "original data.") - - provider_list = DistributionProviderList(dist_providers) - self.distribution = provider_list.fit( - self.series, self.var_type, dist, privacy, unique, fit_kwargs) + series = _to_polars(series) + var_type = cls.get_var_type(series) + dist_spec = DistributionSpec.parse(dist_spec) + distribution = provider_list.fit(series, var_type, dist_spec, privacy) + if prop_missing is None: + prop_missing = (len(series) - len(series.drop_nulls())) / len(series) + return cls(series.name, var_type, distribution=distribution, dtype=str(series.dtype), + description=description, prop_missing=prop_missing) def draw(self) -> Any: """Draw a random item for the variable in whatever type is required.""" - if self.distribution is None: - raise ValueError("Cannot draw without distribution") - # Return NA's -> None if self.prop_missing is not None and np.random.rand() < self.prop_missing: return None @@ -261,8 +203,6 @@ def draw_series(self, n: int) -> pl.Series: pandas.Series: Pandas series with the synthetic data. """ - if not isinstance(self.distribution, BaseDistribution): - raise ValueError("Cannot draw without distribution.") self.distribution.draw_reset() value_list = [self.draw() for _ in range(n)] if "Categorical" in self.dtype: @@ -294,10 +234,11 @@ def from_dict(cls, provider_list = DistributionProviderList(distribution_providers) dist = provider_list.from_dict(var_dict) return cls( - var_dict["type"], name=var_dict["name"], + var_type=var_dict["type"], distribution=dist, - prop_missing=var_dict["prop_missing"], dtype=var_dict["dtype"], + prop_missing=var_dict["prop_missing"], + dtype=var_dict["dtype"], description=var_dict.get("description", None) ) diff --git a/pyproject.toml b/pyproject.toml index 6713087e..d621f1f3 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -38,6 +38,7 @@ dependencies = [ "jsonschema", "importlib-metadata;python_version<'3.10'", "importlib-resources;python_version<'3.9'", + "tomli;python_version<'3.11'", "wget", "regexmodel>=0.2.1" ] @@ -62,6 +63,9 @@ metasyn = "metasyn.__main__:main" [project.entry-points."metasyn.distribution_provider"] builtin = "metasyn.provider:BuiltinDistributionProvider" +[project.entry-points."metasyn.privacy"] +none = "metasyn.privacy:BasicPrivacy" + [tool.setuptools] packages = ["metasyn"] obsoletes = ["metasynth"] @@ -79,6 +83,7 @@ module = [ "importlib_resources.*", "wget.*", "lingua.*", + "tomllib.*", ] ignore_missing_imports = true diff --git a/tests/data/example_config.toml b/tests/data/example_config.toml new file mode 100644 index 00000000..dab20a81 --- /dev/null +++ b/tests/data/example_config.toml @@ -0,0 +1,28 @@ +# Example toml file as input for metasyn + +[general] +dist_providers = ["builtin"] + + +[[var]] +name = "PassengerId" +distribution = {unique = true} # Notice lower capitalization for .toml files. + +[[var]] +name = "Name" +prop_missing = 0.1 +description = "Name of the unfortunate passenger of the titanic." +distribution = {implements = "core.faker", parameters = {faker_type = "name", locale = "en_US"}} + +[[var]] +name = "Fare" +distribution = {implements = "core.log_normal"} + +[[var]] +name = "Age" +distribution = {implements = "core.uniform", parameters = {low = 20, high = 40}} + + +[[var]] +name = "Cabin" +distribution = {implements = "core.regex", parameters = {regex_data = "[A-F][0-9]{2,3}"}} diff --git a/tests/data/no_data_config.toml b/tests/data/no_data_config.toml new file mode 100644 index 00000000..fedc49d4 --- /dev/null +++ b/tests/data/no_data_config.toml @@ -0,0 +1,31 @@ +# Example toml file as input for metasyn + +[general] +n_rows = 100 + + +[[var]] + +name = "PassengerId" +data_free = true +prop_missing = 0.0 +description = "ID of the unfortunate passenger." +var_type = "discrete" +distribution = {implements = "core.unique_key", unique = true, parameters = {consecutive = 1, low = 0}} + + +[[var]] + +name = "Name" +data_free = true +prop_missing = 0.1 +description = "Name of the unfortunate passenger of the titanic." +var_type = "string" +distribution = {implements = "core.faker", parameters = {faker_type = "name", locale = "en_US"}} + +[[var]] +name = "Cabin" +data_free = true +prop_missing = 0.2 +var_type = "string" +distribution = {implements = "core.regex", parameters = {regex_data = "[A-F][0-9]{2,3}"}} diff --git a/tests/test_builtin.py b/tests/test_builtin.py index 7f7ba9c5..7d489cef 100644 --- a/tests/test_builtin.py +++ b/tests/test_builtin.py @@ -1,9 +1,9 @@ from pytest import mark, raises +from metasyn.distribution import UniformDistribution +from metasyn.privacy import BasicPrivacy from metasyn.provider import get_distribution_provider from metasyn.testutils import check_distribution, check_distribution_provider -from metasyn.privacy import BasicPrivacy -from metasyn.distribution import UniformDistribution def test_builtin_provider(): diff --git a/tests/test_cli.py b/tests/test_cli.py index 0f58fe21..bac467e5 100644 --- a/tests/test_cli.py +++ b/tests/test_cli.py @@ -1,11 +1,12 @@ import json -import sys import subprocess +import sys from pathlib import Path import jsonschema -from pytest import mark, fixture import polars as pl +from pytest import fixture, mark + from metasyn import MetaFrame from metasyn.validation import validate_gmf_dict @@ -35,17 +36,20 @@ def tmp_dir(tmp_path_factory) -> Path: "Fare": float } data_frame = pl.read_csv(csv_fp, dtypes=csv_dt)[:100] - meta_frame = MetaFrame.fit_dataframe(data_frame, spec={"PassengerId": {"unique": True}}) + meta_frame = MetaFrame.fit_dataframe(data_frame, var_specs=[{"name": "PassengerId", "distribution": {"unique": True}}]) meta_frame.to_json(json_path) config_fp = TMP_DIR_PATH / "config.ini" with open(config_fp, "w") as handle: handle.write(""" -[var.PassengerId] -unique = True - -[var.Fare] -distribution=LogNormalDistribution -prop_missing=0.2""") +[[var]] +name = "PassengerId" +distribution = {unique = true} + +[[var]] +name = "Fare" +prop_missing = 0.2 +distribution = {implements = "lognormal"} +""") return TMP_DIR_PATH @@ -82,8 +86,10 @@ def test_create_meta(tmp_dir, config): Path(sys.executable).resolve(), # the python executable Path("metasyn", "__main__.py"), # the cli script "create-meta", # the subcommand - Path("tests", "data", "titanic.csv"), # the input file - out_file # the output file + "--input", + Path("tests", "data", "titanic.csv"), # the input file + "--output", + out_file # the output file ] if config: cmd.extend(["--config", Path(tmp_dir) / 'config.ini']) @@ -125,3 +131,31 @@ def test_schema_gen(tmp_dir): cmd.append("non-existent-plugin") result = subprocess.run(cmd, check=False, capture_output=True) assert result.returncode != 0 + + +def test_datafree(tmp_dir): + gmf_fp = tmp_dir / "gmf_out.json" + syn_fp = tmp_dir / "test_out.csv" + cmd = [ + Path(sys.executable).resolve(), # the python executable + Path("metasyn", "__main__.py"), # the cli script + "create-meta", # the subcommand + "--output", gmf_fp, # the output file + "--config", Path("tests", "data", "no_data_config.toml") + ] + result = subprocess.run(cmd, check=False, capture_output=True) + assert result.returncode == 0 + meta_frame = MetaFrame.from_json(gmf_fp) + assert meta_frame.n_rows == 100 + assert len(meta_frame.meta_vars) == 3 + cmd2 = [ + Path(sys.executable).resolve(), # the python executable + Path("metasyn", "__main__.py"), # the cli script + "synthesize", + gmf_fp, syn_fp + ] + result = subprocess.run(cmd2, check=False, capture_output=True) + assert result.returncode == 0 + df = pl.read_csv(syn_fp) + assert list(df.columns) == ["PassengerId", "Name", "Cabin"] + assert len(df) == 100 diff --git a/tests/test_config.py b/tests/test_config.py new file mode 100644 index 00000000..37b0bd96 --- /dev/null +++ b/tests/test_config.py @@ -0,0 +1,77 @@ +from pathlib import Path + +import pytest +from pytest import mark + +from metasyn.config import MetaConfig, VarConfigAccess +from metasyn.distribution import UniformDistribution +from metasyn.privacy import BasePrivacy +from metasyn.util import DistributionSpec, VarConfig + + +@mark.parametrize( + "input,error", + [ + ("uniform", False), + (None, False), + ({"implements": {"uniform": False}}, False), + (UniformDistribution, False), + (UniformDistribution(0, 2), False), + (DistributionSpec(), False), + ({"fit_kwargs": {"param": 3}}, True), + ({"version": "2.0"}, True), + ({"parameters": {"param": 2}}, True), + (1, True), + ] +) +def test_dist_spec(input, error): + if error: + with pytest.raises(Exception): + DistributionSpec.parse(input) + else: + dist_spec = DistributionSpec.parse(input) + assert isinstance(dist_spec, DistributionSpec) + +def test_var_config(): + var_cfg = VarConfig("test", privacy={"name": "none", "parameters": {}}) + assert var_cfg.name == "test" + assert isinstance(var_cfg.privacy, BasePrivacy) + with pytest.raises(ValueError): + var_cfg = VarConfig("test", data_free=True) + var_cfg = VarConfig("test") + assert isinstance(var_cfg.dist_spec, DistributionSpec) + var_cfg = VarConfig.from_dict({"name": "test"}) + assert var_cfg.name == "test" + var_cfg = VarConfig.from_dict({"name": "test", "distribution": "uniform"}) + assert var_cfg.dist_spec.implements == "uniform" + + +def test_meta_config_datafree(): + meta_config = MetaConfig.from_toml(Path("tests", "data", "no_data_config.toml")) + assert meta_config.n_rows == 100 + assert len(meta_config.var_configs) == 3 + assert isinstance(meta_config.var_configs[0], VarConfig) + assert meta_config.var_configs[0].privacy is None + assert isinstance(meta_config.var_configs[0].dist_spec, DistributionSpec) + assert isinstance(meta_config.to_dict(), dict) + var_cfg = meta_config.get("PassengerId") + assert isinstance(var_cfg, VarConfigAccess) + print(var_cfg.var_config) + assert var_cfg.data_free is True + var_cfg = meta_config.get("unknown") + assert var_cfg.name == "unknown" + + all_var_cfg = list(meta_config.iter_var()) + assert len(all_var_cfg) == 3 + assert isinstance(all_var_cfg[0], VarConfigAccess) + assert all_var_cfg[0].meta_config == meta_config + assert len(list(meta_config.iter_var(exclude=["PassengerId"]))) == 2 + + +def test_meta_config(): + meta_config = MetaConfig.from_toml(Path("tests", "data", "example_config.toml")) + assert len(meta_config.var_configs) == 5 + var_cfg = meta_config.get("Cabin") + assert var_cfg.data_free is False + assert var_cfg.var_type is None + assert var_cfg.dist_spec.implements == "core.regex" diff --git a/tests/test_continuous.py b/tests/test_continuous.py index 09b5a541..d24db5b3 100644 --- a/tests/test_continuous.py +++ b/tests/test_continuous.py @@ -23,8 +23,8 @@ def test_uniform(lower_bound, upper_bound): scale = upper_bound-lower_bound values = stats.uniform(loc=lower_bound, scale=scale).rvs(100) dist = UniformDistribution.fit(values) - assert dist.min_val <= values.min() - assert dist.max_val >= values.max() + assert dist.low <= values.min() + assert dist.high >= values.max() assert dist.information_criterion(values) < 2*np.log(len(values)) - 200*np.log((upper_bound-lower_bound)**-1) assert isinstance(dist.draw(), float) diff --git a/tests/test_dataset.py b/tests/test_dataset.py index d379cbb0..1278fd5d 100644 --- a/tests/test_dataset.py +++ b/tests/test_dataset.py @@ -1,15 +1,14 @@ -from random import random from pathlib import Path +from random import random -import pytest import pandas as pd import polars as pl +import pytest from pytest import mark from metasyn.metaframe import MetaFrame -from metasyn.var import MetaVar from metasyn.provider import get_distribution_provider - +from metasyn.var import MetaVar dtypes = { "PassengerId": "int", @@ -43,12 +42,12 @@ def test_dataset(tmp_path, dataframe_lib): df = _read_csv(titanic_fp, dataframe_lib) dataset = MetaFrame.fit_dataframe( df, - spec={ - "Name": {"prop_missing": 0.5}, - "Ticket": {"description": "test_description"}, - "Fare": {"distribution": "normal"}, - "PassengerId": {"unique": True}, - }) + var_specs=[ + {"name": "Name", "prop_missing": 0.5}, + {"name": "Ticket", "description": "test_description"}, + {"name": "Fare", "distribution": {"implements": "normal"}}, + {"name": "PassengerId", "distribution": {"unique": True}}, + ]) def check_dataset(dataset): assert dataset.n_columns == 12 @@ -102,7 +101,7 @@ def check_dataset(dataset): # Check whether non-columns raise an error with pytest.raises(ValueError): - dataset = MetaFrame.fit_dataframe(df, spec={"unicorn": {"prop_missing": 0.5}}) + dataset = MetaFrame.fit_dataframe(df, var_specs=[{"name": "unicorn", "prop_missing": 0.5}]) def test_distributions(tmp_path): @@ -111,7 +110,7 @@ def test_distributions(tmp_path): provider = get_distribution_provider() for var_type in provider.all_var_types: for dist in provider.get_dist_list(var_type): - var = MetaVar(var_type, name="None", distribution=dist.default_distribution(), + var = MetaVar(name="None", var_type=var_type, distribution=dist.default_distribution(), prop_missing=random()) dataset = MetaFrame([var], n_rows=10) dataset.to_json(tmp_fp) diff --git a/tests/test_datetime.py b/tests/test_datetime.py index f56dd0aa..92c93c5d 100644 --- a/tests/test_datetime.py +++ b/tests/test_datetime.py @@ -1,12 +1,15 @@ import datetime as dt -from pytest import mark +import numpy as np import pandas as pd import polars as pl -import numpy as np +from pytest import mark -from metasyn.distribution.datetime import DateUniformDistribution, DateTimeUniformDistribution -from metasyn.distribution.datetime import TimeUniformDistribution +from metasyn.distribution.datetime import ( + DateTimeUniformDistribution, + DateUniformDistribution, + TimeUniformDistribution, +) all_precision = ["microseconds", "seconds", "minutes", "hours"] start = ["10", ""] diff --git a/tests/test_demo.py b/tests/test_demo.py index ad16e112..011a2216 100644 --- a/tests/test_demo.py +++ b/tests/test_demo.py @@ -1,7 +1,9 @@ from pathlib import Path -from pytest import mark, raises -from metasyn.demo.dataset import demo_file, create_titanic_demo + import polars as pl +from pytest import mark, raises + +from metasyn.demo.dataset import create_titanic_demo, demo_file @mark.parametrize("dataset", ["titanic"]) diff --git a/tests/test_distribution.py b/tests/test_distribution.py index 9afaab9b..41778836 100644 --- a/tests/test_distribution.py +++ b/tests/test_distribution.py @@ -1,11 +1,10 @@ from pytest import mark, raises from metasyn.distribution.categorical import MultinoulliDistribution -from metasyn.distribution.continuous import UniformDistribution,\ - NormalDistribution +from metasyn.distribution.continuous import NormalDistribution, UniformDistribution +from metasyn.distribution.discrete import DiscreteNormalDistribution, DiscreteUniformDistribution from metasyn.distribution.faker import FakerDistribution from metasyn.distribution.regex import RegexDistribution, UniqueRegexDistribution -from metasyn.distribution.discrete import DiscreteUniformDistribution, DiscreteNormalDistribution from metasyn.provider import DistributionProviderList diff --git a/tests/test_privacy.py b/tests/test_privacy.py index e0e99503..5b7d455d 100644 --- a/tests/test_privacy.py +++ b/tests/test_privacy.py @@ -1,6 +1,6 @@ -from metasyn.privacy import BasicPrivacy from metasyn.distribution import MultinoulliDistribution from metasyn.distribution.base import metadist +from metasyn.privacy import BasicPrivacy @metadist(privacy="test") diff --git a/tests/test_provider.py b/tests/test_provider.py index 05cc9b09..562d6700 100644 --- a/tests/test_provider.py +++ b/tests/test_provider.py @@ -1,8 +1,10 @@ -from pytest import mark import pytest -from metasyn.provider import DistributionProviderList, BuiltinDistributionProvider +from pytest import mark + +from metasyn.distribution import MultinoulliDistribution, UniformDistribution from metasyn.distribution.base import BaseDistribution, metadist -from metasyn.distribution import UniformDistribution, MultinoulliDistribution +from metasyn.provider import BuiltinDistributionProvider, DistributionProviderList + @mark.parametrize("input", ["builtin", "fake-name", BuiltinDistributionProvider, BuiltinDistributionProvider()]) @@ -30,7 +32,7 @@ class UniformTest2(UniformDistribution): pass -class TestProvider(BuiltinDistributionProvider): +class CheckProvider(BuiltinDistributionProvider): distributions = [UniformTest2] legacy_distributions = [UniformTest1, UniformTest11] @@ -41,7 +43,7 @@ class LegacyOnly(BuiltinDistributionProvider): def test_legacy(): - plist = DistributionProviderList(TestProvider) + plist = DistributionProviderList(CheckProvider) assert issubclass(plist.find_distribution("core.uniform", var_type="continuous"), UniformTest2) with pytest.warns(): assert issubclass(plist.find_distribution("core.uniform", var_type="continuous", version="1.1"), UniformTest11) @@ -56,4 +58,4 @@ def test_legacy(): plist = DistributionProviderList(LegacyOnly) with pytest.warns(): - assert issubclass(plist.find_distribution("core.uniform", var_type="continuous"), UniformTest2) \ No newline at end of file + assert issubclass(plist.find_distribution("core.uniform", var_type="continuous"), UniformTest2) diff --git a/tests/test_var.py b/tests/test_var.py index eca0296e..9a1dc91c 100644 --- a/tests/test_var.py +++ b/tests/test_var.py @@ -1,19 +1,24 @@ +import datetime as dt import json +import numpy as np import pandas as pd import polars as pl -import numpy as np -from pytest import mark, raises import pytest +from pytest import mark, raises -from metasyn.var import MetaVar -from metasyn.distribution import NormalDistribution, RegexDistribution, UniqueRegexDistribution -from metasyn.distribution import DiscreteUniformDistribution -from metasyn.distribution import UniformDistribution -from metasyn.metaframe import _jsonify -from metasyn.distribution.discrete import UniqueKeyDistribution -from metasyn.distribution.continuous import TruncatedNormalDistribution +from metasyn.distribution import ( + DiscreteUniformDistribution, + NormalDistribution, + RegexDistribution, + UniformDistribution, + UniqueRegexDistribution, +) from metasyn.distribution.categorical import MultinoulliDistribution +from metasyn.distribution.continuous import TruncatedNormalDistribution +from metasyn.distribution.discrete import UniqueKeyDistribution +from metasyn.metaframe import _jsonify +from metasyn.var import MetaVar def _series_drop_nans(series): @@ -41,13 +46,8 @@ def check_similar(series_a, series_b): assert (len(series_a)-len(_series_drop_nans(series_a)) > 0) == (len(series_b) - len(_series_drop_nans(series_b)) > 0) assert isinstance(series, (pd.Series, pl.Series)) - var = MetaVar.detect(series) - assert isinstance(str(var), str) - assert "Proportion of Missing Values" in str(var) - with raises(ValueError): - var.draw_series(100) - var.fit() + var = MetaVar.fit(series) new_series = var.draw_series(len(series)) check_similar(series, new_series) assert var.var_type == var_type @@ -66,8 +66,8 @@ def check_similar(series_a, series_b): newer_series = new_var.draw_series(len(series)) check_similar(newer_series, series) - with raises(ValueError): - new_var.fit() + # with raises(ValueError): + # new_var.fit() assert type(new_var) == type(var) assert new_var.dtype == var.dtype @@ -159,31 +159,31 @@ def test_bool(tmp_path, series_type): [-1, -0.1, 1.2], ) def test_invalid_prop(prop_missing): + # with raises(ValueError): + MetaVar("test", "discrete", DiscreteUniformDistribution.default_distribution()) with raises(ValueError): - MetaVar("continuous") - with raises(ValueError): - MetaVar("continuous", prop_missing=prop_missing) + MetaVar("test", "discrete", DiscreteUniformDistribution.default_distribution(), + prop_missing=prop_missing) @mark.parametrize( - "dataframe", + "series,var_type", [ - pd.DataFrame({ - "int": pd.Series([np.random.randint(0, 10) for _ in range(100)]), - "float": pd.Series([np.random.rand() for _ in range(100)]) - }), - pl.DataFrame({ - "int": [np.random.randint(0, 10) for _ in range(100)], - "float": [np.random.rand() for _ in range(100)] - }) + (pl.Series([1, 2, 3]), "discrete"), + (pl.Series([1.0, 2.0, 3.0]), "continuous"), + (pl.Series(["1", "2", "3"]), "string"), + (pl.Series(["1", "2", "3"], dtype=pl.Categorical), "categorical"), + (pl.Series([dt.time.fromisoformat("10:38:12"), dt.time.fromisoformat("12:52:11")]), + "time"), + (pl.Series([dt.datetime.fromisoformat("2022-07-15T10:39:36"), + dt.datetime.fromisoformat("2022-08-15T10:39:36")]), + "datetime"), + (pl.Series([dt.date.fromisoformat("1903-07-15"), dt.date.fromisoformat("1940-07-16")]), + "date"), ] ) -def test_dataframe(dataframe): - variables = MetaVar.detect(dataframe) - assert len(variables) == 2 - assert isinstance(variables, list) - assert variables[0].var_type == "discrete" - assert variables[1].var_type == "continuous" +def test_get_var_type(series, var_type): + assert MetaVar.get_var_type(series) == var_type @mark.parametrize( @@ -192,17 +192,16 @@ def test_dataframe(dataframe): pl.Series([np.random.rand() for _ in range(5000)])] ) def test_manual_fit(series): - var = MetaVar.detect(series) - var.fit() + var = MetaVar.fit(series) assert isinstance(var.distribution, (UniformDistribution, TruncatedNormalDistribution)) - var.fit("normal") + var = MetaVar.fit(series, dist_spec={"implements": "normal"}) assert isinstance(var.distribution, NormalDistribution) - var.fit(UniformDistribution) + var = MetaVar.fit(series, dist_spec=UniformDistribution) assert isinstance(var.distribution, UniformDistribution) - var.fit(NormalDistribution(0, 1)) + var = MetaVar.fit(series, dist_spec=NormalDistribution(0, 1)) assert isinstance(var.distribution, NormalDistribution) - with raises(TypeError): - var.fit(10) + # with raises(TypeError): + # var.fit(10) @mark.parametrize( @@ -211,8 +210,7 @@ def test_manual_fit(series): pl.Series([None for _ in range(10)])] ) def test_na_zero(series): - var = MetaVar.detect(series) - var.fit() + var = MetaVar.fit(series) assert var.var_type == "continuous" assert var.prop_missing == 1.0 @@ -223,8 +221,7 @@ def test_na_zero(series): pl.Series([None if i != 0 else 1.0 for i in range(10)])] ) def test_na_one(series): - var = MetaVar.detect(series) - var.fit() + var = MetaVar.fit(series) assert var.var_type == "continuous" assert abs(var.prop_missing-0.9) < 1e7 @@ -235,8 +232,7 @@ def test_na_one(series): pl.Series([None if i < 2 else 0.123*i for i in range(10)])] ) def test_na_two(series): - var = MetaVar.detect(series) - var.fit() + var = MetaVar.fit(series) assert var.var_type == "continuous" assert abs(var.prop_missing-0.8) < 1e7 @@ -247,10 +243,9 @@ def test_na_two(series): pl.Series(np.random.randint(0, 100000, size=1000))] ) def test_manual_unique_integer(series): - var = MetaVar.detect(series) - var.fit() + var = MetaVar.fit(series) assert isinstance(var.distribution, DiscreteUniformDistribution) - var.fit(unique=True) + var = MetaVar.fit(series, dist_spec = {"unique": True}) assert isinstance(var.distribution, UniqueKeyDistribution) @@ -263,10 +258,9 @@ def test_manual_unique_integer(series): ) def test_manual_unique_string(series): # series = pd.Series(["x213", "2dh2", "4k2kk"]) - var = MetaVar.detect(series) - var.fit() + var = MetaVar.fit(series) assert isinstance(var.distribution, RegexDistribution) - var.fit(unique=True) + var = MetaVar.fit(series, dist_spec={"unique": True}) assert isinstance(var.distribution, UniqueRegexDistribution) @@ -280,8 +274,7 @@ def test_manual_unique_string(series): ] ) def test_int_multinoulli(series): - var = MetaVar.detect(series) - var.fit() + var = MetaVar.fit(series) assert isinstance(var.distribution, MultinoulliDistribution)