-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace ceiling dates with NULL in TPP tables #1523
Comments
Just to add extra consideration here (as mentioned in this comment), we should be mindful of the performance impact of making these transformations and make sure the updated queries perform adequately. |
New query:
Original query:
Run with 1 month (2022-01-01 to 2022-01-31) and 6 months (2022-01-01 to 2022-06-30) ranges
|
Wow, what a confusing set of results! But it goes to show that doing this kind of thing is worthwhile, I think. It's just occurred to me however that is a bit of an unrepresentative test because it doesn't filter on any codes at all, and that means it actually has to do the union and query both of the underlying tables. In practice, we always (or near enough to always as makes no difference) supply either a At the risk of being a massive pain, I think it would be worth re-running with a The other thing I'd be interested in is whether it makes any difference using |
New query:
Original Query:
Filtering on CTV3codesFilters on the top 3 CTV3 Codes from 2022 that aren't Y codes
Filtering on SnomedCt CodeFilters on the top 3 Snomed Codes from 2022 that aren't Y codes
|
@evansd As suggested ☝️ with filters on CTV3 and Snomed codes. For the CTV3 one, the new query beats the original in elapsed time over 1 month, and in both elapsed and cpu over 6 months. The Snomed one is similar for one month, but over 6 months the elapsed time is similar for both queries, and cpu lower for the original one, but not by a vast amount. Scans on the CodedEvent_SNOMED table go up for the 6 month one for the original query those, and I've no idea why. |
Thanks for doing all this. These results make so little sense to me. The first row of the CTV3 table shows CPU usage jumping 4-5x with the new query, but elapsed time consistently dropping 10-50x. I think this just shows that we don't have much an intuition for how the db will perform on any given query and we really need access to more detailed query planner stats. But all that said, the aim here was just to make sure that the change wasn't going to make things obviously much worse than they were before and on the basis of this I'd say we're probably OK to go ahead. Would welcome other thoughts (@Jongmassey @inglesp). |
Using NULLIFFiltering on CTV3codes
Filtering on Snomed codes
|
Yes, on the (rather baffling) evidence I'd agree. |
This has a bit of a smell of this being a partitioned table or index(es) (probably on the basis of date), but given that "scan count" hasn't actually meant "how many times the table was accessed" for ages it can sometimes be a slightly misleading metric. What were the physical/logical read stats? I have a slight hesitation here because of measures, which quite often run month-by-month and if we're increasing the IO by 60x+ and CPU by 3x+ then this might make the already-not-great performance a fair bit worse. For queries with a larger time span I'd say the difference is largely moot. |
I prefer |
It'd be really nice if replacement of NULL-equivalent dates was done during ETL rather than every single query! |
It really would, wouldn't it? Another option, which I think would be achievable but a non-trivial amount of work, would be to do the date transformation in ehrQL (rather than in an blob of SQL text which is opaque to ehrQL). That would give ehrQL the option to optimise away the transformation if it can determine that it isn't needed e.g. if you're querying for dates in a particular range. Or replace queries like Longer term I'd like to move as much transformation logic as we can out of the |
See issue description for a list of date fields in TPP (for tables that we use in cohort-extractor/ehrql). Of those, the ones to investigate (i.e. that contain ceiling dates and are used/will probably be used in ehrql) are:
Other questions:
|
RegistrationHistory.EndDateCurrent query:
New query:
Results:
|
Patient.DateOfDeathCurrent query:
New query:
Results:
|
PatientAddress.StartDate, PatientAddress.EndDateCurrent query:
New query:
Results:
|
Closing this issue:
|
In some TPP tables, missing or unknown dates are represented by a ceiling date (9999-12-31), which can cause confusion or coding errors (see e.g. https://bennettoxford.slack.com/archives/C01D7H9LYKB/p1692786760750299)
We already replace ceiling end dates in the practice_registrations table, and consultation date in the clinical_events table (as of #1522). For any other date fields in TPP tables we should consider whether ceiling dates should be replaced with NULL.
For each date/datetime field, we should check:
Date/datetime fields in TPP (based on database schema report):
OPENPROMPT - Non-nullable; no ceiling dates in data
ICNARC - All nullable, no ceiling dates
ISARIC_Patient_Data_Topline (all varchar, all nullable - the following are converted to date in cohort-extractor):
(Only used in cohort-extractor, ehrql uses ISARIC_New)
ISARIC_New
All varchar, all nullable, date-like fields can contain "NA"; currently in exploration stage in ehrQL
Datefield used in ehrql:
APCS
CPNS (all nullable, no ceiling dates)
EC (all Nullable)
OPA (all nullable)
SGSS_AllTests_Negative (All nullable, no floor/ceiling dates)
SGSS_AllTests_Positive (All nullable, no floor/ceiling dates)
SGSS_Negative (All nullable, no floor/ceiling dates)
SGSS_Positive (All nullable, no floor/ceiling dates)
Therapeutics (All nullable, no floor/ceiling dates)
WL_ClockStops (datelike fields, all varchar, all nullable, all except RTT have min date 1900-01-01, several have impossible future dates, no ceiling dates)
WL_Diagnostics (datelike fields, all varchar, all nullable. A couple have ceiling dates - noted below - several have impossible min dates that look like defaults (1900-01-01) and impossible max dates that are not ceiling)
WL_OpenPathways (datelike fields, all varchar, all nullable. A couple have ceiling dates - noted below - several have impossible min dates that look like defaults (1900-01-01) and impossible max dates that are not ceiling)
ONS_Deaths (Nullable, no ceiling dates)
DecisionSupportValue (Nullable, no ceiling dates)
Appointment (all non Nullable, most contain default min 1900-01-01)
Consultation
MedicationIssue (all non-nullable)
Organisation (non nullable, no ceiling dates)
Patient (non nullable)
PatientAddress (non nullable)
RegistrationHistory
Vaccination
UKRR
The text was updated successfully, but these errors were encountered: