Add get_default_configuration and defaults yaml #4

mattkappel · 2023-03-09T02:28:41Z

Add get_default_configuration and defaults yaml

Description

Category: feature
JIRA issue: MIC-3860
Research reference: https://vivarium-research.readthedocs.io/en/latest/models/concept_models/vivarium_census_synthdata/concept_model.html#noise-functions

Changes

Adds a get_default_configuration utility function that returns a ConfigTree object
Adds a yaml for holding defaults. It is partially complete and will be updated with RT feedback and additional development.

Testing

Imported the utility, executed, and got an expected ConfigTree:

import pandas as pd
from pathlib import Path
from pseudopeople import utilities

utilities.get_default_configuration().to_dict()

gave output:

{'decennial_census': {'omission': 0.0145,
  'duplication': 0.05,
  'first_name': {'nickname': {'row_noise_level': 0.01},
   'fake_names': {'row_noise_level': 0.01},
   'missing_data': {'row_noise_level': 0.01},
   'phonetic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'ocr': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1}},
  'age': {'missing_data': {'row_noise_level': 0.01},
   'ocr': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'age_miswriting': {'row_noise_level': 0.01, 'age_miswriting': [1, -1]}},
  'zipcode': {'missing_data': {'row_noise_level': 0.01},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'zipcode_miswriting': {'row_noise_level': 0.01,
    'zipcode_miswriting': [0.04, 0.04, 0.2, 0.36, 0.36]}}},
 'american_communities_survey': {'omission': 0.0145,
  'duplication': 0.05,
  'first_name': {'nickname': {'row_noise_level': 0.01},
   'fake_names': {'row_noise_level': 0.01},
   'missing_data': {'row_noise_level': 0.01},
   'phonetic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'ocr': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1}},
  'age': {'missing_data': {'row_noise_level': 0.01},
   'ocr': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'age_miswriting': {'row_noise_level': 0.01, 'age_miswriting': [1, -1]}},
  'zipcode': {'missing_data': {'row_noise_level': 0.01},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'zipcode_miswriting': {'row_noise_level': 0.01,
    'zipcode_miswriting': [0.04, 0.04, 0.2, 0.36, 0.36]}}},
 'current_population_survey': {'omission': 0.2905,
  'duplication': 0.05,
  'first_name': {'nickname': {'row_noise_level': 0.01},
   'fake_names': {'row_noise_level': 0.01},
   'missing_data': {'row_noise_level': 0.01},
   'phonetic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'ocr': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1}},
  'age': {'missing_data': {'row_noise_level': 0.01},
   'ocr': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'age_miswriting': {'row_noise_level': 0.01, 'age_miswriting': [1, -1]}},
  'zipcode': {'missing_data': {'row_noise_level': 0.01},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'zipcode_miswriting': {'row_noise_level': 0.01,
    'zipcode_miswriting': [0.04, 0.04, 0.2, 0.36, 0.36]}}},
 'women_infants_and_children': {'omission': 0.0,
  'duplication': 0.05,
  'first_name': {'nickname': {'row_noise_level': 0.01},
   'fake_names': {'row_noise_level': 0.01},
   'missing_data': {'row_noise_level': 0.01},
   'phonetic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'ocr': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1}},
  'age': {'missing_data': {'row_noise_level': 0.01},
   'ocr': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'age_miswriting': {'row_noise_level': 0.01, 'age_miswriting': [1, -1]}},
  'zipcode': {'missing_data': {'row_noise_level': 0.01},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'zipcode_miswriting': {'row_noise_level': 0.01,
    'zipcode_miswriting': [0.04, 0.04, 0.2, 0.36, 0.36]}}}}

ramittal

should we not create a unit-test from the "testing" that is done?

stevebachmeier · 2023-03-09T17:30:50Z

@ramittal there's a separate ticket for building out the testing framework https://jira.ihme.washington.edu/browse/MIC-3862

stevebachmeier · 2023-03-09T17:31:26Z

src/pseudopeople/utilities.py

+
+
+def get_default_configuration() -> ConfigTree:
+    import pseudopeople


why import in this scope?

Because I just need it for the location of the file.

I have added a paths.py that we could add the yaml location to at some point.

stevebachmeier · 2023-03-09T17:32:52Z

src/pseudopeople/utilities.py

+    noising_configuration = ConfigTree(layers=default_config_layers)
+    BASE_DIR = Path(pseudopeople.__file__).resolve().parent
+    yaml_path = BASE_DIR / "default_configuration.yaml"
+    noising_configuration.update(yaml_path, layer="base")


So this update method adds the new keys from the yaml to the base layer? Does it behave just like a dict.update?

Yes I believe that is what it is doing according to the docs. I think the important thing to note is you can update at different levels.

Yes; more-or-less. You have the layers, which matter in deciding the "true", desired configuration.

stevebachmeier · 2023-03-09T17:50:29Z

src/pseudopeople/default_configuration.yaml

+    zipcode_miswriting:
+        zipcode:
+            row_noise_level: 0.01
+            zipcode_miswriting: [0.04, 0.04, 0.2, 0.36, 0.36]


How are these lists working? Is this the chance of a miswrite per digit?

This is defined in the research docs. First two and last two digits have the same probabilities, respectively.

stevebachmeier · 2023-03-09T17:51:06Z

src/pseudopeople/default_configuration.yaml

@@ -0,0 +1,200 @@
+# Default noising configuration


This is very, very long. Do we want a default config per form?

RT discussion needed.

ramittal · 2023-03-09T17:52:45Z

@ramittal there's a separate ticket for building out the testing framework https://jira.ihme.washington.edu/browse/MIC-3862

While the testing framework can come in later, just moving the ad-hoc test code to a unit-test should be trivial and get us code-coverage from get-go and ensure any un-intended changes are caught.

stevebachmeier · 2023-03-09T17:53:33Z

src/pseudopeople/entities.py

@@ -44,6 +47,21 @@ class __NoiseTypes(NamedTuple):
    PHONETIC: ColumnNoiseType = ColumnNoiseType(
        "phonetic", noise_functions.generate_phonetic_errors
    )
+    MISSING_DATA: ColumnNoiseType = ColumnNoiseType(


why did this get added in this PR?

Because I need to have agreement on strings that are in the YAML.

mattkappel added 2 commits March 8, 2023 18:18

implementation for discussion

112447a

get getter getting

ad67b47

mattkappel requested review from albrja, hussain-jafari, ramittal, rmudambi and stevebachmeier as code owners March 9, 2023 02:28

ramittal reviewed Mar 9, 2023

View reviewed changes

stevebachmeier reviewed Mar 9, 2023

View reviewed changes

albrja approved these changes Mar 9, 2023

View reviewed changes

update per RT feedback key ordering

aba0929

mattkappel merged commit 4d0d501 into main Mar 9, 2023

mattkappel deleted the feature/mic-3860 branch March 9, 2023 22:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add get_default_configuration and defaults yaml #4

Add get_default_configuration and defaults yaml #4

mattkappel commented Mar 9, 2023 •

edited

Loading

ramittal left a comment

stevebachmeier commented Mar 9, 2023

stevebachmeier Mar 9, 2023

mattkappel Mar 9, 2023

albrja Mar 9, 2023

stevebachmeier Mar 9, 2023

albrja Mar 9, 2023

mattkappel Mar 9, 2023

stevebachmeier Mar 9, 2023

mattkappel Mar 9, 2023

stevebachmeier Mar 9, 2023

mattkappel Mar 9, 2023

ramittal commented Mar 9, 2023

stevebachmeier Mar 9, 2023

mattkappel Mar 9, 2023



		def get_default_configuration() -> ConfigTree:
		import pseudopeople

Add get_default_configuration and defaults yaml #4

Add get_default_configuration and defaults yaml #4

Conversation

mattkappel commented Mar 9, 2023 • edited Loading