Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement typographic noise function #19

Merged
merged 6 commits into from
Mar 29, 2023

Conversation

stevebachmeier
Copy link
Contributor

Title: Implement typographnic noising

Description

  • Category: implementation
  • JIRA issue: MIC-3881

This is one of probably a few PRs to implement typographic noising.
This one includes the function as well as some unit tests.
Followup PRs may include:

  • an integration test if it's decided to do one
  • numba implementation to speed up the unit test (it's currently a very slow
    loop over every character in every string of potentially-noised rows)

Testing

This PR includes a (very slow) unit test that is passing
To discuss: how to add an integration test?

@stevebachmeier stevebachmeier changed the title naive implementation of typographic noise function implement typographic noise function Mar 28, 2023
"""Abie's implementation of typographical noising"""
err = ""
i = 0
while i < len(truth):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where a followup PR will try and implement numba or otherwise refactor.

)

assert noised_data.equals(noised_data_same_seed)
assert not noised_data.equals(noised_data_different_seed)
assert not data.equals(noised_data)
# TODO: Confirm correct columns exist once the interface functions
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this deletion from Jim's open PR since we decided this is all too complicated given the number of columns and functions to be implemented.



# TODO: refactor this into its own test parameterized by noise functions
def _validate_seed_and_noise_data(func, column, config):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A followup PR will move this method out of the existing tests and do these checks as their own tests.

@stevebachmeier stevebachmeier marked this pull request as ready for review March 28, 2023 22:46
tests/integration/conftest.py Outdated Show resolved Hide resolved
tests/integration/conftest.py Outdated Show resolved Hide resolved
tests/unit/test_column_noise.py Outdated Show resolved Hide resolved
@stevebachmeier stevebachmeier merged commit 6b217e5 into develop Mar 29, 2023
@stevebachmeier stevebachmeier deleted the feature/sbachmei/MIC-3881-typographic-noise branch March 29, 2023 18:27
Copy link
Collaborator

@zmbc zmbc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! The only concern I have with this is that the code that transformed the keyboard layout CSV -> qwerty_errors.yaml isn't here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants