-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Reorder noise functions in user docs #254
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your order
@@ -110,60 +110,60 @@ Noise types for each column | |||
:header-rows: 1 | |||
|
|||
* - Column name | |||
- Datasets present | |||
- Applicable datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and changed this column name as discussed at the doc session last week. This is unrelated to the noise function order.
- Leave a field blank, use a fake name, use a nickname, make typos, make OCR errors, make phonetic errors | ||
- | ||
- Leave a field blank, use a nickname, use a fake name, make phonetic errors, make OCR errors, make typos | ||
- Middle names use the same lists of nicknames and fake names used for first names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it would be good to specify that middle names use the same data as first names for these two noise functions. This is unrelated to the noise function order.
- Leave a field blank, use a fake name, make typos, make OCR errors, make phonetic errors | ||
- In the 1040 form, the same noise types apply to the last name columns for the joint filer and dependents | ||
- Leave a field blank, use a fake name, make phonetic errors, make OCR errors, make typos | ||
- Last names use a different list of fake names than the list for first names. In the 1040 form, the same noise types apply to the last name columns for the joint filer and dependents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, I thought it would be good to clarify that last names use a different fake name list than first names. This is unrelated to the noise function order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this order makes sense to me!
Title: Re-order the noise functions on the noise page and column noise page
Description
Edits the user docs to list the noise functions in the order in which the're applied in the code. Also makes a couple unrelated minor edits to one of the noise tables (see code comments below).
Technically, I put the noise types in an order that should be equivalent to the order in which they're actually applied, but which I think will be more intuitive to a user reading the docs. I would like feedback on whether the order here looks good or whether we want to switch anything. If this ordering looks good, I'll edit the concept model accordingly and create an engineering ticket to update the code to match.
In particular, I have the following questions about what is the most logical order to try to match the actual data generation process:
I think it makes sense that OCR errors and typos are the last two types of noise, but I don't know which of these should come first. Is it more likely that a document is read in by OCR, and then someone types the ORC'ed data into an administrative file, or is it more likely that someone types data into a file, the file is printed, and then the printed file is read in by OCR and input directly into the administrative file?
I currently have nicknames applied before fake names. Does this make sense, or would it make more sense to apply fake names and then nicknames? (In practice, it probably doesn't make much difference because there are probably very few, if any, fake names that are also eligible for nicknames.)
Testing
Built docs locally.