Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add get_default_configuration and defaults yaml #4

Merged
merged 3 commits into from
Mar 9, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
200 changes: 200 additions & 0 deletions src/pseudopeople/default_configuration.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
# Default noising configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very, very long. Do we want a default config per form?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RT discussion needed.

# structure follows:
# Row noise: `form.noise_type`
# Column-wise parameters: `form.noise_type.column.noise_parameter`

decennial_census:
omission: 0.0145
duplication: 0.05
nickname:
first_name:
row_noise_level: 0.01
fake_names:
first_name:
row_noise_level: 0.01
missing_data:
first_name:
row_noise_level: 0.01
age:
row_noise_level: 0.01
zipcode:
row_noise_level: 0.01
phonetic:
first_name:
row_noise_level: 0.01
token_noise_level: 0.1
ocr:
first_name:
row_noise_level: 0.01
token_noise_level: 0.1
age:
row_noise_level: 0.01
token_noise_level: 0.1
typographic:
first_name:
row_noise_level: 0.01
token_noise_level: 0.1
age:
row_noise_level: 0.01
token_noise_level: 0.1
zipcode:
row_noise_level: 0.01
token_noise_level: 0.1
zipcode_miswriting:
zipcode:
row_noise_level: 0.01
zipcode_miswriting: [0.04, 0.04, 0.2, 0.36, 0.36]
age_miswriting:
age:
row_noise_level: 0.01
age_miswriting: [1, -1]





american_communities_survey:
omission: 0.0145
duplication: 0.05
nickname:
first_name:
row_noise_level: 0.01
fake_names:
first_name:
row_noise_level: 0.01
missing_data:
first_name:
row_noise_level: 0.01
age:
row_noise_level: 0.01
zipcode:
row_noise_level: 0.01
phonetic:
first_name:
row_noise_level: 0.01
token_noise_level: 0.1
ocr:
first_name:
row_noise_level: 0.01
token_noise_level: 0.1
age:
row_noise_level: 0.01
token_noise_level: 0.1
typographic:
first_name:
row_noise_level: 0.01
token_noise_level: 0.1
age:
row_noise_level: 0.01
token_noise_level: 0.1
zipcode:
row_noise_level: 0.01
token_noise_level: 0.1
zipcode_miswriting:
zipcode:
row_noise_level: 0.01
zipcode_miswriting: [0.04, 0.04, 0.2, 0.36, 0.36]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these lists working? Is this the chance of a miswrite per digit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is defined in the research docs. First two and last two digits have the same probabilities, respectively.

age_miswriting:
age:
row_noise_level: 0.01
age_miswriting: [1, -1]

current_population_survey:
omission: 0.2905
duplication: 0.05
nickname:
first_name:
row_noise_level: 0.01
fake_names:
first_name:
row_noise_level: 0.01
missing_data:
first_name:
row_noise_level: 0.01
age:
row_noise_level: 0.01
zipcode:
row_noise_level: 0.01
phonetic:
first_name:
row_noise_level: 0.01
token_noise_level: 0.1
ocr:
first_name:
row_noise_level: 0.01
token_noise_level: 0.1
age:
row_noise_level: 0.01
token_noise_level: 0.1
typographic:
first_name:
row_noise_level: 0.01
token_noise_level: 0.1
age:
row_noise_level: 0.01
token_noise_level: 0.1
zipcode:
row_noise_level: 0.01
token_noise_level: 0.1
zipcode_miswriting:
zipcode:
row_noise_level: 0.01
zipcode_miswriting: [0.04, 0.04, 0.2, 0.36, 0.36]
age_miswriting:
age:
row_noise_level: 0.01
age_miswriting: [1, -1]


women_infants_and_children:
omission: 0.0
duplication: 0.05
nickname:
first_name:
row_noise_level: 0.01
fake_names:
first_name:
row_noise_level: 0.01
missing_data:
first_name:
row_noise_level: 0.01
age:
row_noise_level: 0.01
zipcode:
row_noise_level: 0.01
phonetic:
first_name:
row_noise_level: 0.01
token_noise_level: 0.1
ocr:
first_name:
row_noise_level: 0.01
token_noise_level: 0.1
age:
row_noise_level: 0.01
token_noise_level: 0.1
typographic:
first_name:
row_noise_level: 0.01
token_noise_level: 0.1
age:
row_noise_level: 0.01
token_noise_level: 0.1
zipcode:
row_noise_level: 0.01
token_noise_level: 0.1
zipcode_miswriting:
zipcode:
row_noise_level: 0.01
zipcode_miswriting: [0.04, 0.04, 0.2, 0.36, 0.36]
age_miswriting:
age:
row_noise_level: 0.01
age_miswriting: [1, -1]

# TODO: add the rest of observers/forms with RT input
#social_security:
#
#taxes_w2_and_1099:
#
#taxes_1040:
18 changes: 18 additions & 0 deletions src/pseudopeople/entities.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from pseudopeople.entity_types import ColumnNoiseType, RowNoiseType


# todo: is "form" the right word? Ask RT
class Form(Enum):
CENSUS = "decennial_census"
ACS = "american_communities_survey"
Expand All @@ -20,7 +21,9 @@ class __Columns(NamedTuple):
MIDDLE_INITIAL: str = "middle_initial"
LAST_NAME: str = "last_name"
STREET_NAME: str = "street_name"
ZIP_CODE: str = "zipcode"
CITY: str = "city"
AGE: str = "age"
# todo finish filling in columns


Expand All @@ -44,6 +47,21 @@ class __NoiseTypes(NamedTuple):
PHONETIC: ColumnNoiseType = ColumnNoiseType(
"phonetic", noise_functions.generate_phonetic_errors
)
MISSING_DATA: ColumnNoiseType = ColumnNoiseType(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this get added in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I need to have agreement on strings that are in the YAML.

# todo: implement the noise fn
"missing_data",
lambda: (_ for _ in ()).throw(NotImplemented("TBD!")),
)
TYPOGRAPHIC: ColumnNoiseType = ColumnNoiseType(
# todo: implement the noise fn
"typographic",
lambda: (_ for _ in ()).throw(NotImplemented("TBD!")),
)
OCR: ColumnNoiseType = ColumnNoiseType(
# todo: implement the noise fn
"ocr",
lambda: (_ for _ in ()).throw(NotImplemented("TBD!")),
)


NOISE_TYPES = __NoiseTypes()
Expand Down
16 changes: 16 additions & 0 deletions src/pseudopeople/utilities.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,24 @@
from pathlib import Path

import pandas as pd
from vivarium.framework.configuration import ConfigTree, ConfigurationError
from vivarium.framework.randomness import RandomnessStream

from pseudopeople.entities import Form


def get_randomness_stream(form: Form, seed: int) -> RandomnessStream:
return RandomnessStream(form.value, lambda: pd.Timestamp("2020-04-01"), seed)


def get_default_configuration() -> ConfigTree:
import pseudopeople
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why import in this scope?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I just need it for the location of the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a paths.py that we could add the yaml location to at some point.


default_config_layers = [
"base",
]
noising_configuration = ConfigTree(layers=default_config_layers)
BASE_DIR = Path(pseudopeople.__file__).resolve().parent
yaml_path = BASE_DIR / "default_configuration.yaml"
noising_configuration.update(yaml_path, layer="base")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this update method adds the new keys from the yaml to the base layer? Does it behave just like a dict.update?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I believe that is what it is doing according to the docs. I think the important thing to note is you can update at different levels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; more-or-less. You have the layers, which matter in deciding the "true", desired configuration.

return noising_configuration