Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Reorder noise functions in user docs #254

Merged
merged 7 commits into from
Aug 18, 2023

Conversation

NathanielBlairStahn
Copy link
Contributor

@NathanielBlairStahn NathanielBlairStahn commented Aug 15, 2023

Title: Re-order the noise functions on the noise page and column noise page

Description

  • Category: documentation
  • JIRA issue: SSCI-1457

Edits the user docs to list the noise functions in the order in which the're applied in the code. Also makes a couple unrelated minor edits to one of the noise tables (see code comments below).

Technically, I put the noise types in an order that should be equivalent to the order in which they're actually applied, but which I think will be more intuitive to a user reading the docs. I would like feedback on whether the order here looks good or whether we want to switch anything. If this ordering looks good, I'll edit the concept model accordingly and create an engineering ticket to update the code to match.

In particular, I have the following questions about what is the most logical order to try to match the actual data generation process:

  1. I think it makes sense that OCR errors and typos are the last two types of noise, but I don't know which of these should come first. Is it more likely that a document is read in by OCR, and then someone types the ORC'ed data into an administrative file, or is it more likely that someone types data into a file, the file is printed, and then the printed file is read in by OCR and input directly into the administrative file?

  2. I currently have nicknames applied before fake names. Does this make sense, or would it make more sense to apply fake names and then nicknames? (In practice, it probably doesn't make much difference because there are probably very few, if any, fake names that are also eligible for nicknames.)

Testing

Built docs locally.

Copy link
Member

@aflaxman aflaxman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your order

@NathanielBlairStahn NathanielBlairStahn added the documentation Improvements or additions to documentation label Aug 17, 2023
@@ -110,60 +110,60 @@ Noise types for each column
:header-rows: 1

* - Column name
- Datasets present
- Applicable datasets
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and changed this column name as discussed at the doc session last week. This is unrelated to the noise function order.

- Leave a field blank, use a fake name, use a nickname, make typos, make OCR errors, make phonetic errors
-
- Leave a field blank, use a nickname, use a fake name, make phonetic errors, make OCR errors, make typos
- Middle names use the same lists of nicknames and fake names used for first names
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be good to specify that middle names use the same data as first names for these two noise functions. This is unrelated to the noise function order.

- Leave a field blank, use a fake name, make typos, make OCR errors, make phonetic errors
- In the 1040 form, the same noise types apply to the last name columns for the joint filer and dependents
- Leave a field blank, use a fake name, make phonetic errors, make OCR errors, make typos
- Last names use a different list of fake names than the list for first names. In the 1040 form, the same noise types apply to the last name columns for the joint filer and dependents
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, I thought it would be good to clarify that last names use a different fake name list than first names. This is unrelated to the noise function order.

@NathanielBlairStahn NathanielBlairStahn marked this pull request as ready for review August 17, 2023 19:55
@NathanielBlairStahn NathanielBlairStahn requested review from zmbc, pletale and a team as code owners August 17, 2023 19:55
Copy link
Collaborator

@zmbc zmbc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this order makes sense to me!

@NathanielBlairStahn NathanielBlairStahn merged commit 2201d71 into develop Aug 18, 2023
@NathanielBlairStahn NathanielBlairStahn deleted the reorder-noise-fns branch August 18, 2023 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants