Add `WL_Clockstops` and `WL_OpenPathways` #1651

iaindillingham · 2023-10-12T11:01:52Z

This adds the database table WL_Clockstops as the ehrQL tables wl_clockstops_raw and wl_clockstops, and the database table WL_OpenPathways as the ehrQL tables wl_openpathways_raw and wl_openpathways.

The raw ehrQL tables should help researchers understand the structure and semantics of the database tables (i.e. undertake data development). I've added their corresponding backend tables as instances of QueryTable rather than MappedTable, because the database tables contain binary columns that must be converted to string columns using custom SQL.

If we treat wl_clockstops as a view, then filtering each of the date columns in turn for rows that fall within a range takes from roughly 40ms to roughly 500ms. Adding another filter, such as also filtering by priority_type_code, takes from roughly 100ms to 1500ms. If we treat wl_openpathways as a view, then similar queries take about 500ms longer, possibly because many values of wl_openpathways.current_pathway_period_start_date are NULL.

I've added the constraints using the information in National Waiting List Dataset Submission v12.¹ The categorical constraints on priority_type_code and waiting_list_type are especially useful, as they ensure that the dummy data are more realistic. As @alschaffer notes, however, in the real tables around 10% of values in each of these columns do not match the categorical constraints. Consequently, these values will be missing (NULL) in a real dataset.

You can find the rendered documentation at https://iaindillingham-add-nhse-wl.databuilder.pages.dev/reference/schemas/beta.tpp/#wl_clockstops. Thanks for your help documenting the ehrQL tables, @alschaffer.

I used the National Waiting List Dataset Submission v12¹ to help document the columns. However, I'm still unclear what Clock Stops and Open Pathways are. Could you help me describe the ehrQL tables, @alschaffer? Are there URLs to these datasets that I could add to the documentation?

I haven't been able to test the performance characteristics of the cooked ehrQL tables, because I cannot connect to the VPN.² However, when I can, I will. Similarly, I haven't been able to check that the constraints reflect the data, rather than the documentation.

alschaffer · 2023-10-12T16:45:56Z

@iaindillingham There is a link to documentation below. While it provides useful background information and explains some of the terminology/concepts, it doesn't have a straightforward description of the datasets. So I've put together brief descriptions. Let me know if it's still not clear.

WL_Clockstops: This dataset contains all completed referral-to-treatment (RTT) pathways with a “clock stop” date between May 2021 and May 2022. Patients referred for non-emergency consultant-led treatment are on RTT pathways. The “clock start” date is the date of the first referral that starts the pathway. The “clock stops” date is when the patient either: receives treatment; declines treatment; enters a period of active monitoring; no longer requires treatment; or dies. The time spent waiting is the difference in these two dates. A patient may have multiple rows if they have multiple completed RTT pathways; however, there is only one row per unique pathway. Because referral identifiers aren’t necessarily unique between hospitals, unique RTT pathways can be identified using a combination of pseudo_referral_identifier, pseudo_patient_pathway_identifier, pseudo_organisation_code_patient_pathway_identifier_issuer and referral_to_treatment_period_start_date.

WL_OpenPathways: This dataset contains all people on open (incomplete) RTT or not current RTT (non-RTT) pathways as of May 2022. It is a snapshot of everyone still awaiting treatment as of May 2022 (i.e., the clock hasn’t stopped). Patients referred for non-emergency consultant-led treatment are on RTT pathways, while patients referred for non-consultant-led treatment are on non-RTT pathways. For each pathway, there is one row for every week that the patient is still waiting. Because referral identifiers aren’t necessarily unique between hospitals, unique RTT pathways can be identified using a combination of the pseudo_referral_identifier, pseudo_patient_pathway_identifier, pseudo_organisation_code_patient_pathway_identifier_issuer and referral_to_treatment_period_start_date.

More information can be found in the Recording and Reporting RTT guidance here: https://www.england.nhs.uk/statistics/statistical-work-areas/rtt-waiting-times/rtt-guidance/

alschaffer · 2023-10-12T16:47:22Z

Also, for those variables where you provided the possible values from the data dictionary (priority_type_code, waiting_list_type), I note that a not insignificant proportion (~10%) have other values. I don't know if that is something you usually note in the schema.

cloudflare-workers-and-pages · 2023-10-16T11:25:44Z

Deploying with Cloudflare Pages

Latest commit:	`43072a1`
Status:	✅ Deploy successful!
Preview URL:	https://d4361f4e.databuilder.pages.dev
Branch Preview URL:	https://iaindillingham-add-nhse-wl.databuilder.pages.dev

View logs

iaindillingham · 2023-10-16T14:43:09Z

Also, for those variables where you provided the possible values from the data dictionary (priority_type_code, waiting_list_type), I note that a not insignificant proportion (~10%) have other values. I don't know if that is something you usually note in the schema.

Elsewhere, we have linked to short data reports from the schema. Have you produced something similar for your data development work, @alschaffer? As it stands, the proportion with other values will appear as missing (NULL). Have I missed any categories, or do the other values represent data entry (or similar) errors?

alschaffer · 2023-10-16T15:32:40Z

Here is where I've provided a summary of the data holdings and quality: https://docs.google.com/document/d/1Es1ODHeyUdIj6MXgFonZtQ2ncPC0_oaSfN8piGQc0ks/edit

And this folder has some more granular data exploration, including frequency distributions for many variables:
G:\Shared drives\OpenSAFELY\Data\Waiting List Data\Notebook html output

I suspect those other categories are a combination of old codes, local codes, and typos (e.g. reversing the order of letters, inputting numbers instead of letters). If it's not too much trouble, it would be useful to include the other categories, just for the waiting_list_type variable. While they may not end up being important, I'd prefer not to exclude them right off the bat.

For ClockStops, the categories in the data are (in descending order of frequency: ORTT, IRTT, PTLO, PTLI, ONON, RTTO, INON, RTTI.

For OpenPathways, it is: ORTT, ONON, IRTT, INON, PTLO, PTLI, OTHR, PTLN, PLTO, RTTO, PLTI, PTI, PTL0, RTTI, PTL, PTO, PTL1.

(FYI this is from the notebook pdf output in the folder above.) You could use the same categories for both datasets, since it's the same variable.

iaindillingham · 2023-10-16T17:08:58Z

Thanks, @alschaffer. I think there's a slight -- but understandable -- misunderstanding.

I've applied categorical constraints to priority_type_code and waiting_list_type in both wl_clockstops and wl_openpathways (ehrQL tables). These categorical constraints apply only when we generate dummy data.

I've mapped values of PRIORITY_TYPE_CODE in WL_ClockStops and WL_OpenPathways (database tables). These mappings apply when we extract real data. The value that's extracted is the value in the mapping, or NULL.

I've not mapped values of Waiting_List_Type in either WL_ClockStops or WL_OpenPathways. The value that's extracted is the value in the column.

sebbacon · 2023-10-17T09:15:10Z

The categorical constraints on priority_type_code and waiting_list_type are especially useful, as they ensure that the dummy data are more realistic. As @alschaffer notes, however, in the real tables around 10% of values in each of these columns do not match the categorical constraints. Consequently, these values will be missing (NULL) in a real dataset.

Just checking my understanding: does this mean we're effectively dropping 10% of non-valid data for these fields in the non-raw tables?

iaindillingham · 2023-10-17T09:45:20Z

Just checking my understanding: does this mean we're effectively dropping 10% of non-valid data for these fields in the non-raw tables?

We're not dropping anything. Values of priority_type_code are mapped, because most values of PRIORITY_TYPE_CODE either obviously conform to the data dictionary or don't. Values of waiting_list_type are not mapped, because it's less clear whether they conform or don't.

alschaffer · 2023-10-17T09:49:58Z

Thanks Iain, I get it now, I forgot that the values in PRIORITY_TYPE_CODE needed to be mapped. And most of the "non-conforming" values in that variable are not interpretable anyway.

ehrql/backends/tpp.py

rebkwok · 2023-10-19T16:28:05Z

ehrql/tables/beta/tpp.py

+    )
+    week_ending_date = Series(
+        datetime.date,
+        description="The Sunday of the week that the pathway relates to",


I haven't looked at the tables in a while, but when I did previously, there were instances of this date being empty, or in the future. It should never be in the future afaik, since it should be the date the data was uploaded. Should there be some sort of note to the user in the docs?

Just checked again; there are no empty week_ending_dates now, but there are 23,554 future ones in the wel_openpathways table, and 1816 in the wl_clockstops table ("future" being after this Sunday, in case the date entered is the date of the next Sunday rather than the one just gone).

Should there be some sort of note to the user in the docs?

On balance, I think we shouldn't put this kind of information in the docs, as it may become outdated without us knowing; empty values for the Week_Ending_Date column is a good example. I know @alschaffer has investigated these tables. Is there a plan to publish those investigations, Andrea?

tests/integration/backends/test_tpp.py

@alschaffer

Thanks for documenting the ehrQL table, @alschaffer.

@alschaffer

Thanks for documenting the ehrQL table, @alschaffer.

iaindillingham linked an issue Oct 12, 2023 that may be closed by this pull request

Add WL_Clockstops and WL_OpenPathways #1646

Closed

3 tasks

iaindillingham force-pushed the iaindillingham/add-nhse-wl branch from d981436 to 39cfcc9 Compare October 16, 2023 11:24

github-actions bot deployed to databuilder-docs (Preview) October 16, 2023 11:25 View deployment

iaindillingham force-pushed the iaindillingham/add-nhse-wl branch from 39cfcc9 to d9c3348 Compare October 16, 2023 14:39

github-actions bot deployed to databuilder-docs (Preview) October 16, 2023 14:40 View deployment

iaindillingham marked this pull request as ready for review October 16, 2023 14:49

iaindillingham force-pushed the iaindillingham/add-nhse-wl branch from d9c3348 to 36ab8b0 Compare October 16, 2023 17:09

github-actions bot deployed to databuilder-docs (Preview) October 16, 2023 17:10 View deployment

iaindillingham mentioned this pull request Oct 17, 2023

Restrict access to Clock Stops and Open Pathways opensafely-core/opensafely-cli#223

Merged

rebkwok approved these changes Oct 19, 2023

View reviewed changes

iaindillingham force-pushed the iaindillingham/add-nhse-wl branch from 36ab8b0 to c466774 Compare October 23, 2023 09:17

github-actions bot deployed to databuilder-docs (Preview) October 23, 2023 09:19 View deployment

iaindillingham added 5 commits October 23, 2023 11:13

feat: Add wl_clockstops_raw

56c37b2

feat: Add wl_clockstops

39d9e81

Thanks for documenting the ehrQL table, @alschaffer.

feat: Add wl_openpathways_raw

7c5801c

feat: Add wl_openpathways

479d145

Thanks for documenting the ehrQL table, @alschaffer.

Generate docs

43072a1

iaindillingham force-pushed the iaindillingham/add-nhse-wl branch from c466774 to 43072a1 Compare October 23, 2023 10:30

github-actions bot deployed to databuilder-docs (Preview) October 23, 2023 10:31 View deployment

iaindillingham merged commit 941a9ae into main Oct 23, 2023

iaindillingham deleted the iaindillingham/add-nhse-wl branch October 23, 2023 10:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `WL_Clockstops` and `WL_OpenPathways` #1651

Add `WL_Clockstops` and `WL_OpenPathways` #1651

iaindillingham commented Oct 12, 2023 •

edited

Loading

alschaffer commented Oct 12, 2023

alschaffer commented Oct 12, 2023

cloudflare-workers-and-pages bot commented Oct 16, 2023 •

edited

Loading

iaindillingham commented Oct 16, 2023

alschaffer commented Oct 16, 2023

iaindillingham commented Oct 16, 2023

sebbacon commented Oct 17, 2023

iaindillingham commented Oct 17, 2023

alschaffer commented Oct 17, 2023

rebkwok Oct 19, 2023

iaindillingham Oct 23, 2023

Add WL_Clockstops and WL_OpenPathways #1651

Add WL_Clockstops and WL_OpenPathways #1651

Conversation

iaindillingham commented Oct 12, 2023 • edited Loading

Footnotes

alschaffer commented Oct 12, 2023

alschaffer commented Oct 12, 2023

cloudflare-workers-and-pages bot commented Oct 16, 2023 • edited Loading

Deploying with Cloudflare Pages

iaindillingham commented Oct 16, 2023

alschaffer commented Oct 16, 2023

iaindillingham commented Oct 16, 2023

sebbacon commented Oct 17, 2023

iaindillingham commented Oct 17, 2023

alschaffer commented Oct 17, 2023

rebkwok Oct 19, 2023

Choose a reason for hiding this comment

iaindillingham Oct 23, 2023

Choose a reason for hiding this comment

Add `WL_Clockstops` and `WL_OpenPathways` #1651

Add `WL_Clockstops` and `WL_OpenPathways` #1651

iaindillingham commented Oct 12, 2023 •

edited

Loading

cloudflare-workers-and-pages bot commented Oct 16, 2023 •

edited

Loading