Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add get_default_configuration and defaults yaml #4

Merged
merged 3 commits into from
Mar 9, 2023

Conversation

mattkappel
Copy link
Contributor

@mattkappel mattkappel commented Mar 9, 2023

Add get_default_configuration and defaults yaml

Description

Changes

  • Adds a get_default_configuration utility function that returns a ConfigTree object
  • Adds a yaml for holding defaults. It is partially complete and will be updated with RT feedback and additional development.

Testing

Imported the utility, executed, and got an expected ConfigTree:

import pandas as pd
from pathlib import Path
from pseudopeople import utilities

utilities.get_default_configuration().to_dict()

gave output:

{'decennial_census': {'omission': 0.0145,
  'duplication': 0.05,
  'first_name': {'nickname': {'row_noise_level': 0.01},
   'fake_names': {'row_noise_level': 0.01},
   'missing_data': {'row_noise_level': 0.01},
   'phonetic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'ocr': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1}},
  'age': {'missing_data': {'row_noise_level': 0.01},
   'ocr': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'age_miswriting': {'row_noise_level': 0.01, 'age_miswriting': [1, -1]}},
  'zipcode': {'missing_data': {'row_noise_level': 0.01},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'zipcode_miswriting': {'row_noise_level': 0.01,
    'zipcode_miswriting': [0.04, 0.04, 0.2, 0.36, 0.36]}}},
 'american_communities_survey': {'omission': 0.0145,
  'duplication': 0.05,
  'first_name': {'nickname': {'row_noise_level': 0.01},
   'fake_names': {'row_noise_level': 0.01},
   'missing_data': {'row_noise_level': 0.01},
   'phonetic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'ocr': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1}},
  'age': {'missing_data': {'row_noise_level': 0.01},
   'ocr': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'age_miswriting': {'row_noise_level': 0.01, 'age_miswriting': [1, -1]}},
  'zipcode': {'missing_data': {'row_noise_level': 0.01},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'zipcode_miswriting': {'row_noise_level': 0.01,
    'zipcode_miswriting': [0.04, 0.04, 0.2, 0.36, 0.36]}}},
 'current_population_survey': {'omission': 0.2905,
  'duplication': 0.05,
  'first_name': {'nickname': {'row_noise_level': 0.01},
   'fake_names': {'row_noise_level': 0.01},
   'missing_data': {'row_noise_level': 0.01},
   'phonetic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'ocr': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1}},
  'age': {'missing_data': {'row_noise_level': 0.01},
   'ocr': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'age_miswriting': {'row_noise_level': 0.01, 'age_miswriting': [1, -1]}},
  'zipcode': {'missing_data': {'row_noise_level': 0.01},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'zipcode_miswriting': {'row_noise_level': 0.01,
    'zipcode_miswriting': [0.04, 0.04, 0.2, 0.36, 0.36]}}},
 'women_infants_and_children': {'omission': 0.0,
  'duplication': 0.05,
  'first_name': {'nickname': {'row_noise_level': 0.01},
   'fake_names': {'row_noise_level': 0.01},
   'missing_data': {'row_noise_level': 0.01},
   'phonetic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'ocr': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1}},
  'age': {'missing_data': {'row_noise_level': 0.01},
   'ocr': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'age_miswriting': {'row_noise_level': 0.01, 'age_miswriting': [1, -1]}},
  'zipcode': {'missing_data': {'row_noise_level': 0.01},
   'typographic': {'row_noise_level': 0.01, 'token_noise_level': 0.1},
   'zipcode_miswriting': {'row_noise_level': 0.01,
    'zipcode_miswriting': [0.04, 0.04, 0.2, 0.36, 0.36]}}}}

Copy link

@ramittal ramittal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we not create a unit-test from the "testing" that is done?

@stevebachmeier
Copy link
Contributor

@ramittal there's a separate ticket for building out the testing framework https://jira.ihme.washington.edu/browse/MIC-3862



def get_default_configuration() -> ConfigTree:
import pseudopeople
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why import in this scope?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I just need it for the location of the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a paths.py that we could add the yaml location to at some point.

noising_configuration = ConfigTree(layers=default_config_layers)
BASE_DIR = Path(pseudopeople.__file__).resolve().parent
yaml_path = BASE_DIR / "default_configuration.yaml"
noising_configuration.update(yaml_path, layer="base")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this update method adds the new keys from the yaml to the base layer? Does it behave just like a dict.update?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I believe that is what it is doing according to the docs. I think the important thing to note is you can update at different levels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; more-or-less. You have the layers, which matter in deciding the "true", desired configuration.

zipcode_miswriting:
zipcode:
row_noise_level: 0.01
zipcode_miswriting: [0.04, 0.04, 0.2, 0.36, 0.36]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these lists working? Is this the chance of a miswrite per digit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is defined in the research docs. First two and last two digits have the same probabilities, respectively.

@@ -0,0 +1,200 @@
# Default noising configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very, very long. Do we want a default config per form?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RT discussion needed.

@ramittal
Copy link

ramittal commented Mar 9, 2023

@ramittal there's a separate ticket for building out the testing framework https://jira.ihme.washington.edu/browse/MIC-3862

While the testing framework can come in later, just moving the ad-hoc test code to a unit-test should be trivial and get us code-coverage from get-go and ensure any un-intended changes are caught.

@@ -44,6 +47,21 @@ class __NoiseTypes(NamedTuple):
PHONETIC: ColumnNoiseType = ColumnNoiseType(
"phonetic", noise_functions.generate_phonetic_errors
)
MISSING_DATA: ColumnNoiseType = ColumnNoiseType(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this get added in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I need to have agreement on strings that are in the YAML.

@mattkappel mattkappel merged commit 4d0d501 into main Mar 9, 2023
@mattkappel mattkappel deleted the feature/mic-3860 branch March 9, 2023 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants