Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WL_Clockstops and WL_OpenPathways #1651

Merged
merged 5 commits into from
Oct 23, 2023
Merged

Conversation

iaindillingham
Copy link
Member

@iaindillingham iaindillingham commented Oct 12, 2023

This adds the database table WL_Clockstops as the ehrQL tables wl_clockstops_raw and wl_clockstops, and the database table WL_OpenPathways as the ehrQL tables wl_openpathways_raw and wl_openpathways.

The raw ehrQL tables should help researchers understand the structure and semantics of the database tables (i.e. undertake data development). I've added their corresponding backend tables as instances of QueryTable rather than MappedTable, because the database tables contain binary columns that must be converted to string columns using custom SQL.

If we treat wl_clockstops as a view, then filtering each of the date columns in turn for rows that fall within a range takes from roughly 40ms to roughly 500ms. Adding another filter, such as also filtering by priority_type_code, takes from roughly 100ms to 1500ms. If we treat wl_openpathways as a view, then similar queries take about 500ms longer, possibly because many values of wl_openpathways.current_pathway_period_start_date are NULL.

I've added the constraints using the information in National Waiting List Dataset Submission v12.1 The categorical constraints on priority_type_code and waiting_list_type are especially useful, as they ensure that the dummy data are more realistic. As @alschaffer notes, however, in the real tables around 10% of values in each of these columns do not match the categorical constraints. Consequently, these values will be missing (NULL) in a real dataset.

You can find the rendered documentation at https://iaindillingham-add-nhse-wl.databuilder.pages.dev/reference/schemas/beta.tpp/#wl_clockstops. Thanks for your help documenting the ehrQL tables, @alschaffer.


I used the National Waiting List Dataset Submission v121 to help document the columns. However, I'm still unclear what Clock Stops and Open Pathways are. Could you help me describe the ehrQL tables, @alschaffer? Are there URLs to these datasets that I could add to the documentation?

I haven't been able to test the performance characteristics of the cooked ehrQL tables, because I cannot connect to the VPN.2 However, when I can, I will. Similarly, I haven't been able to check that the constraints reflect the data, rather than the documentation.

Footnotes

  1. https://bennettoxford.slack.com/archives/C02HJTL065A/p1697015782102079?thread_ts=1694531849.195469&cid=C02HJTL065A 2

  2. https://bennettoxford.slack.com/archives/C010SJ89SA3/p1696602442369059

@iaindillingham iaindillingham linked an issue Oct 12, 2023 that may be closed by this pull request
3 tasks
@alschaffer
Copy link

@iaindillingham There is a link to documentation below. While it provides useful background information and explains some of the terminology/concepts, it doesn't have a straightforward description of the datasets. So I've put together brief descriptions. Let me know if it's still not clear.

WL_Clockstops: This dataset contains all completed referral-to-treatment (RTT) pathways with a “clock stop” date between May 2021 and May 2022. Patients referred for non-emergency consultant-led treatment are on RTT pathways. The “clock start” date is the date of the first referral that starts the pathway. The “clock stops” date is when the patient either: receives treatment; declines treatment; enters a period of active monitoring; no longer requires treatment; or dies. The time spent waiting is the difference in these two dates. A patient may have multiple rows if they have multiple completed RTT pathways; however, there is only one row per unique pathway. Because referral identifiers aren’t necessarily unique between hospitals, unique RTT pathways can be identified using a combination of pseudo_referral_identifier, pseudo_patient_pathway_identifier, pseudo_organisation_code_patient_pathway_identifier_issuer and referral_to_treatment_period_start_date.

WL_OpenPathways: This dataset contains all people on open (incomplete) RTT or not current RTT (non-RTT) pathways as of May 2022. It is a snapshot of everyone still awaiting treatment as of May 2022 (i.e., the clock hasn’t stopped). Patients referred for non-emergency consultant-led treatment are on RTT pathways, while patients referred for non-consultant-led treatment are on non-RTT pathways. For each pathway, there is one row for every week that the patient is still waiting. Because referral identifiers aren’t necessarily unique between hospitals, unique RTT pathways can be identified using a combination of the pseudo_referral_identifier, pseudo_patient_pathway_identifier, pseudo_organisation_code_patient_pathway_identifier_issuer and referral_to_treatment_period_start_date.

More information can be found in the Recording and Reporting RTT guidance here: https://www.england.nhs.uk/statistics/statistical-work-areas/rtt-waiting-times/rtt-guidance/

@alschaffer
Copy link

Also, for those variables where you provided the possible values from the data dictionary (priority_type_code, waiting_list_type), I note that a not insignificant proportion (~10%) have other values. I don't know if that is something you usually note in the schema.

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Oct 16, 2023

Deploying with  Cloudflare Pages  Cloudflare Pages

Latest commit: 43072a1
Status: ✅  Deploy successful!
Preview URL: https://d4361f4e.databuilder.pages.dev
Branch Preview URL: https://iaindillingham-add-nhse-wl.databuilder.pages.dev

View logs

@iaindillingham
Copy link
Member Author

Also, for those variables where you provided the possible values from the data dictionary (priority_type_code, waiting_list_type), I note that a not insignificant proportion (~10%) have other values. I don't know if that is something you usually note in the schema.

Elsewhere, we have linked to short data reports from the schema. Have you produced something similar for your data development work, @alschaffer? As it stands, the proportion with other values will appear as missing (NULL). Have I missed any categories, or do the other values represent data entry (or similar) errors?

@iaindillingham iaindillingham marked this pull request as ready for review October 16, 2023 14:49
@alschaffer
Copy link

Here is where I've provided a summary of the data holdings and quality: https://docs.google.com/document/d/1Es1ODHeyUdIj6MXgFonZtQ2ncPC0_oaSfN8piGQc0ks/edit

And this folder has some more granular data exploration, including frequency distributions for many variables:
G:\Shared drives\OpenSAFELY\Data\Waiting List Data\Notebook html output

I suspect those other categories are a combination of old codes, local codes, and typos (e.g. reversing the order of letters, inputting numbers instead of letters). If it's not too much trouble, it would be useful to include the other categories, just for the waiting_list_type variable. While they may not end up being important, I'd prefer not to exclude them right off the bat.

For ClockStops, the categories in the data are (in descending order of frequency: ORTT, IRTT, PTLO, PTLI, ONON, RTTO, INON, RTTI.

For OpenPathways, it is: ORTT, ONON, IRTT, INON, PTLO, PTLI, OTHR, PTLN, PLTO, RTTO, PLTI, PTI, PTL0, RTTI, PTL, PTO, PTL1.

(FYI this is from the notebook pdf output in the folder above.) You could use the same categories for both datasets, since it's the same variable.

@iaindillingham
Copy link
Member Author

Thanks, @alschaffer. I think there's a slight -- but understandable -- misunderstanding.

I've applied categorical constraints to priority_type_code and waiting_list_type in both wl_clockstops and wl_openpathways (ehrQL tables). These categorical constraints apply only when we generate dummy data.

I've mapped values of PRIORITY_TYPE_CODE in WL_ClockStops and WL_OpenPathways (database tables). These mappings apply when we extract real data. The value that's extracted is the value in the mapping, or NULL.

I've not mapped values of Waiting_List_Type in either WL_ClockStops or WL_OpenPathways. The value that's extracted is the value in the column.

@sebbacon
Copy link
Contributor

The categorical constraints on priority_type_code and waiting_list_type are especially useful, as they ensure that the dummy data are more realistic. As @alschaffer notes, however, in the real tables around 10% of values in each of these columns do not match the categorical constraints. Consequently, these values will be missing (NULL) in a real dataset.

Just checking my understanding: does this mean we're effectively dropping 10% of non-valid data for these fields in the non-raw tables?

@iaindillingham
Copy link
Member Author

Just checking my understanding: does this mean we're effectively dropping 10% of non-valid data for these fields in the non-raw tables?

We're not dropping anything. Values of priority_type_code are mapped, because most values of PRIORITY_TYPE_CODE either obviously conform to the data dictionary or don't. Values of waiting_list_type are not mapped, because it's less clear whether they conform or don't.

@alschaffer
Copy link

Thanks Iain, I get it now, I forgot that the values in PRIORITY_TYPE_CODE needed to be mapped. And most of the "non-conforming" values in that variable are not interpretable anyway.

ehrql/backends/tpp.py Outdated Show resolved Hide resolved
)
week_ending_date = Series(
datetime.date,
description="The Sunday of the week that the pathway relates to",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the tables in a while, but when I did previously, there were instances of this date being empty, or in the future. It should never be in the future afaik, since it should be the date the data was uploaded. Should there be some sort of note to the user in the docs?

Just checked again; there are no empty week_ending_dates now, but there are 23,554 future ones in the wel_openpathways table, and 1816 in the wl_clockstops table ("future" being after this Sunday, in case the date entered is the date of the next Sunday rather than the one just gone).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be some sort of note to the user in the docs?

On balance, I think we shouldn't put this kind of information in the docs, as it may become outdated without us knowing; empty values for the Week_Ending_Date column is a good example. I know @alschaffer has investigated these tables. Is there a plan to publish those investigations, Andrea?

tests/integration/backends/test_tpp.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add WL_Clockstops and WL_OpenPathways
4 participants