-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add WL_Clockstops
and WL_OpenPathways
#1651
Conversation
@iaindillingham There is a link to documentation below. While it provides useful background information and explains some of the terminology/concepts, it doesn't have a straightforward description of the datasets. So I've put together brief descriptions. Let me know if it's still not clear. WL_Clockstops: This dataset contains all completed referral-to-treatment (RTT) pathways with a “clock stop” date between May 2021 and May 2022. Patients referred for non-emergency consultant-led treatment are on RTT pathways. The “clock start” date is the date of the first referral that starts the pathway. The “clock stops” date is when the patient either: receives treatment; declines treatment; enters a period of active monitoring; no longer requires treatment; or dies. The time spent waiting is the difference in these two dates. A patient may have multiple rows if they have multiple completed RTT pathways; however, there is only one row per unique pathway. Because referral identifiers aren’t necessarily unique between hospitals, unique RTT pathways can be identified using a combination of pseudo_referral_identifier, pseudo_patient_pathway_identifier, pseudo_organisation_code_patient_pathway_identifier_issuer and referral_to_treatment_period_start_date. WL_OpenPathways: This dataset contains all people on open (incomplete) RTT or not current RTT (non-RTT) pathways as of May 2022. It is a snapshot of everyone still awaiting treatment as of May 2022 (i.e., the clock hasn’t stopped). Patients referred for non-emergency consultant-led treatment are on RTT pathways, while patients referred for non-consultant-led treatment are on non-RTT pathways. For each pathway, there is one row for every week that the patient is still waiting. Because referral identifiers aren’t necessarily unique between hospitals, unique RTT pathways can be identified using a combination of the pseudo_referral_identifier, pseudo_patient_pathway_identifier, pseudo_organisation_code_patient_pathway_identifier_issuer and referral_to_treatment_period_start_date. More information can be found in the Recording and Reporting RTT guidance here: https://www.england.nhs.uk/statistics/statistical-work-areas/rtt-waiting-times/rtt-guidance/ |
Also, for those variables where you provided the possible values from the data dictionary (priority_type_code, waiting_list_type), I note that a not insignificant proportion (~10%) have other values. I don't know if that is something you usually note in the schema. |
d981436
to
39cfcc9
Compare
Deploying with Cloudflare Pages
|
39cfcc9
to
d9c3348
Compare
Elsewhere, we have linked to short data reports from the schema. Have you produced something similar for your data development work, @alschaffer? As it stands, the proportion with other values will appear as missing ( |
Here is where I've provided a summary of the data holdings and quality: https://docs.google.com/document/d/1Es1ODHeyUdIj6MXgFonZtQ2ncPC0_oaSfN8piGQc0ks/edit And this folder has some more granular data exploration, including frequency distributions for many variables: I suspect those other categories are a combination of old codes, local codes, and typos (e.g. reversing the order of letters, inputting numbers instead of letters). If it's not too much trouble, it would be useful to include the other categories, just for the waiting_list_type variable. While they may not end up being important, I'd prefer not to exclude them right off the bat. For ClockStops, the categories in the data are (in descending order of frequency: ORTT, IRTT, PTLO, PTLI, ONON, RTTO, INON, RTTI. For OpenPathways, it is: ORTT, ONON, IRTT, INON, PTLO, PTLI, OTHR, PTLN, PLTO, RTTO, PLTI, PTI, PTL0, RTTI, PTL, PTO, PTL1. (FYI this is from the notebook pdf output in the folder above.) You could use the same categories for both datasets, since it's the same variable. |
Thanks, @alschaffer. I think there's a slight -- but understandable -- misunderstanding. I've applied categorical constraints to I've mapped values of I've not mapped values of |
d9c3348
to
36ab8b0
Compare
Just checking my understanding: does this mean we're effectively dropping 10% of non-valid data for these fields in the non-raw tables? |
We're not dropping anything. Values of |
Thanks Iain, I get it now, I forgot that the values in |
) | ||
week_ending_date = Series( | ||
datetime.date, | ||
description="The Sunday of the week that the pathway relates to", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at the tables in a while, but when I did previously, there were instances of this date being empty, or in the future. It should never be in the future afaik, since it should be the date the data was uploaded. Should there be some sort of note to the user in the docs?
Just checked again; there are no empty week_ending_dates now, but there are 23,554 future ones in the wel_openpathways table, and 1816 in the wl_clockstops table ("future" being after this Sunday, in case the date entered is the date of the next Sunday rather than the one just gone).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be some sort of note to the user in the docs?
On balance, I think we shouldn't put this kind of information in the docs, as it may become outdated without us knowing; empty values for the Week_Ending_Date
column is a good example. I know @alschaffer has investigated these tables. Is there a plan to publish those investigations, Andrea?
36ab8b0
to
c466774
Compare
Thanks for documenting the ehrQL table, @alschaffer.
Thanks for documenting the ehrQL table, @alschaffer.
c466774
to
43072a1
Compare
This adds the database table
WL_Clockstops
as the ehrQL tableswl_clockstops_raw
andwl_clockstops
, and the database tableWL_OpenPathways
as the ehrQL tableswl_openpathways_raw
andwl_openpathways
.The raw ehrQL tables should help researchers understand the structure and semantics of the database tables (i.e. undertake data development). I've added their corresponding backend tables as instances of
QueryTable
rather thanMappedTable
, because the database tables contain binary columns that must be converted to string columns using custom SQL.If we treat
wl_clockstops
as a view, then filtering each of the date columns in turn for rows that fall within a range takes from roughly 40ms to roughly 500ms. Adding another filter, such as also filtering bypriority_type_code
, takes from roughly 100ms to 1500ms. If we treatwl_openpathways
as a view, then similar queries take about 500ms longer, possibly because many values ofwl_openpathways.current_pathway_period_start_date
areNULL
.I've added the constraints using the information in National Waiting List Dataset Submission v12.1 The categorical constraints on
priority_type_code
andwaiting_list_type
are especially useful, as they ensure that the dummy data are more realistic. As @alschaffer notes, however, in the real tables around 10% of values in each of these columns do not match the categorical constraints. Consequently, these values will be missing (NULL
) in a real dataset.You can find the rendered documentation at https://iaindillingham-add-nhse-wl.databuilder.pages.dev/reference/schemas/beta.tpp/#wl_clockstops. Thanks for your help documenting the ehrQL tables, @alschaffer.
I used the National Waiting List Dataset Submission v121 to help document the columns. However, I'm still unclear what Clock Stops and Open Pathways are. Could you help me describe the ehrQL tables, @alschaffer? Are there URLs to these datasets that I could add to the documentation?I haven't been able to test the performance characteristics of the cooked ehrQL tables, because I cannot connect to the VPN.2 However, when I can, I will. Similarly, I haven't been able to check that the constraints reflect the data, rather than the documentation.Footnotes
https://bennettoxford.slack.com/archives/C02HJTL065A/p1697015782102079?thread_ts=1694531849.195469&cid=C02HJTL065A ↩ ↩2
https://bennettoxford.slack.com/archives/C010SJ89SA3/p1696602442369059 ↩