Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace negative IMD values with NULLs #1548

Closed
inglesp opened this issue Sep 1, 2023 · 1 comment · Fixed by #1639
Closed

Replace negative IMD values with NULLs #1548

inglesp opened this issue Sep 1, 2023 · 1 comment · Fixed by #1639
Assignees

Comments

@inglesp
Copy link
Contributor

inglesp commented Sep 1, 2023

In TPP, -1 is used to indicate a missing IMD value. We should replace this with NULL in the table definition in the TPP backend. As with #1523, we should check the performance impact before making any changes.

There's unlikely to be a problem with backwards compatibility, because if we write a dataset to arrow, we'll try to use an unsigned int for this column (because of this constraint) and it'll crash. But it's possible users are writing to CSV, so we should double check, through GH code search.

@inglesp inglesp added this to Data Team Sep 1, 2023
@evansd
Copy link
Contributor

evansd commented Sep 1, 2023

Thanks for catching this. We should check the performance impact, but I'm less worried about a change like this because (a) the addresses table is much smaller than, say, clinical_events and (b) we don't tend to filter on imd_rounded in any case.

What would have caught this earlier is the thing we've discussed several times, some sort of validate command which checks that the constraints we assert in the schema are actually met.

Maybe I should write a placeholder ticket for this so we've got somewhere to link relevant issues to.

@iaindillingham iaindillingham self-assigned this Oct 2, 2023
@iaindillingham iaindillingham moved this to Next in Data Team Oct 2, 2023
iaindillingham added a commit that referenced this issue Oct 2, 2023
Although the issue mentions negative IMD values, the only negative IMD
values are -1s. So, we can use NULLIF to replace -1s with NULLs.

I've checked the performance impact of this change by comparing the
execution time of the new query (NULLIF) to the old query.

The first template query counted the number of rows where imd_rounded
was GTE zero. In this case, there was an 88ms difference in elapsed time
between new and old, favouring old.

The second template query counted the number of rows where imd_rounded
was NULL (new query) or was -1 (old query). In this case, there was a
33ms difference in elapsed time between new and old, favouring new.

Fixes #1548
@iaindillingham iaindillingham moved this from Next to Under Review in Data Team Oct 2, 2023
@github-project-automation github-project-automation bot moved this from Under Review to Done in Data Team Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants