-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add get_default_configuration and defaults yaml #4
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,200 @@ | ||
# Default noising configuration | ||
# structure follows: | ||
# Row noise: `form.noise_type` | ||
# Column-wise parameters: `form.noise_type.column.noise_parameter` | ||
|
||
decennial_census: | ||
omission: 0.0145 | ||
duplication: 0.05 | ||
nickname: | ||
first_name: | ||
row_noise_level: 0.01 | ||
fake_names: | ||
first_name: | ||
row_noise_level: 0.01 | ||
missing_data: | ||
first_name: | ||
row_noise_level: 0.01 | ||
age: | ||
row_noise_level: 0.01 | ||
zipcode: | ||
row_noise_level: 0.01 | ||
phonetic: | ||
first_name: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
ocr: | ||
first_name: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
age: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
typographic: | ||
first_name: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
age: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
zipcode: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
zipcode_miswriting: | ||
zipcode: | ||
row_noise_level: 0.01 | ||
zipcode_miswriting: [0.04, 0.04, 0.2, 0.36, 0.36] | ||
age_miswriting: | ||
age: | ||
row_noise_level: 0.01 | ||
age_miswriting: [1, -1] | ||
|
||
|
||
|
||
|
||
|
||
american_communities_survey: | ||
omission: 0.0145 | ||
duplication: 0.05 | ||
nickname: | ||
first_name: | ||
row_noise_level: 0.01 | ||
fake_names: | ||
first_name: | ||
row_noise_level: 0.01 | ||
missing_data: | ||
first_name: | ||
row_noise_level: 0.01 | ||
age: | ||
row_noise_level: 0.01 | ||
zipcode: | ||
row_noise_level: 0.01 | ||
phonetic: | ||
first_name: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
ocr: | ||
first_name: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
age: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
typographic: | ||
first_name: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
age: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
zipcode: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
zipcode_miswriting: | ||
zipcode: | ||
row_noise_level: 0.01 | ||
zipcode_miswriting: [0.04, 0.04, 0.2, 0.36, 0.36] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How are these lists working? Is this the chance of a miswrite per digit? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is defined in the research docs. First two and last two digits have the same probabilities, respectively. |
||
age_miswriting: | ||
age: | ||
row_noise_level: 0.01 | ||
age_miswriting: [1, -1] | ||
|
||
current_population_survey: | ||
omission: 0.2905 | ||
duplication: 0.05 | ||
nickname: | ||
first_name: | ||
row_noise_level: 0.01 | ||
fake_names: | ||
first_name: | ||
row_noise_level: 0.01 | ||
missing_data: | ||
first_name: | ||
row_noise_level: 0.01 | ||
age: | ||
row_noise_level: 0.01 | ||
zipcode: | ||
row_noise_level: 0.01 | ||
phonetic: | ||
first_name: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
ocr: | ||
first_name: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
age: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
typographic: | ||
first_name: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
age: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
zipcode: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
zipcode_miswriting: | ||
zipcode: | ||
row_noise_level: 0.01 | ||
zipcode_miswriting: [0.04, 0.04, 0.2, 0.36, 0.36] | ||
age_miswriting: | ||
age: | ||
row_noise_level: 0.01 | ||
age_miswriting: [1, -1] | ||
|
||
|
||
women_infants_and_children: | ||
omission: 0.0 | ||
duplication: 0.05 | ||
nickname: | ||
first_name: | ||
row_noise_level: 0.01 | ||
fake_names: | ||
first_name: | ||
row_noise_level: 0.01 | ||
missing_data: | ||
first_name: | ||
row_noise_level: 0.01 | ||
age: | ||
row_noise_level: 0.01 | ||
zipcode: | ||
row_noise_level: 0.01 | ||
phonetic: | ||
first_name: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
ocr: | ||
first_name: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
age: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
typographic: | ||
first_name: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
age: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
zipcode: | ||
row_noise_level: 0.01 | ||
token_noise_level: 0.1 | ||
zipcode_miswriting: | ||
zipcode: | ||
row_noise_level: 0.01 | ||
zipcode_miswriting: [0.04, 0.04, 0.2, 0.36, 0.36] | ||
age_miswriting: | ||
age: | ||
row_noise_level: 0.01 | ||
age_miswriting: [1, -1] | ||
|
||
# TODO: add the rest of observers/forms with RT input | ||
#social_security: | ||
# | ||
#taxes_w2_and_1099: | ||
# | ||
#taxes_1040: |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,6 +5,7 @@ | |
from pseudopeople.entity_types import ColumnNoiseType, RowNoiseType | ||
|
||
|
||
# todo: is "form" the right word? Ask RT | ||
class Form(Enum): | ||
CENSUS = "decennial_census" | ||
ACS = "american_communities_survey" | ||
|
@@ -20,7 +21,9 @@ class __Columns(NamedTuple): | |
MIDDLE_INITIAL: str = "middle_initial" | ||
LAST_NAME: str = "last_name" | ||
STREET_NAME: str = "street_name" | ||
ZIP_CODE: str = "zipcode" | ||
CITY: str = "city" | ||
AGE: str = "age" | ||
# todo finish filling in columns | ||
|
||
|
||
|
@@ -44,6 +47,21 @@ class __NoiseTypes(NamedTuple): | |
PHONETIC: ColumnNoiseType = ColumnNoiseType( | ||
"phonetic", noise_functions.generate_phonetic_errors | ||
) | ||
MISSING_DATA: ColumnNoiseType = ColumnNoiseType( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why did this get added in this PR? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because I need to have agreement on strings that are in the YAML. |
||
# todo: implement the noise fn | ||
"missing_data", | ||
lambda: (_ for _ in ()).throw(NotImplemented("TBD!")), | ||
) | ||
TYPOGRAPHIC: ColumnNoiseType = ColumnNoiseType( | ||
# todo: implement the noise fn | ||
"typographic", | ||
lambda: (_ for _ in ()).throw(NotImplemented("TBD!")), | ||
) | ||
OCR: ColumnNoiseType = ColumnNoiseType( | ||
# todo: implement the noise fn | ||
"ocr", | ||
lambda: (_ for _ in ()).throw(NotImplemented("TBD!")), | ||
) | ||
|
||
|
||
NOISE_TYPES = __NoiseTypes() | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,24 @@ | ||
from pathlib import Path | ||
|
||
import pandas as pd | ||
from vivarium.framework.configuration import ConfigTree, ConfigurationError | ||
from vivarium.framework.randomness import RandomnessStream | ||
|
||
from pseudopeople.entities import Form | ||
|
||
|
||
def get_randomness_stream(form: Form, seed: int) -> RandomnessStream: | ||
return RandomnessStream(form.value, lambda: pd.Timestamp("2020-04-01"), seed) | ||
|
||
|
||
def get_default_configuration() -> ConfigTree: | ||
import pseudopeople | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why import in this scope? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because I just need it for the location of the file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have added a paths.py that we could add the yaml location to at some point. |
||
|
||
default_config_layers = [ | ||
"base", | ||
] | ||
noising_configuration = ConfigTree(layers=default_config_layers) | ||
BASE_DIR = Path(pseudopeople.__file__).resolve().parent | ||
yaml_path = BASE_DIR / "default_configuration.yaml" | ||
noising_configuration.update(yaml_path, layer="base") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes I believe that is what it is doing according to the docs. I think the important thing to note is you can update at different levels. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes; more-or-less. You have the layers, which matter in deciding the "true", desired configuration. |
||
return noising_configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very, very long. Do we want a default config per form?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RT discussion needed.