Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't deserialize and reserialize json for no reason #157

Merged
merged 1 commit into from
Jan 26, 2023

Conversation

mikix
Copy link
Contributor

@mikix mikix commented Jan 25, 2023

Description

This makes the core ETL loop much faster for basic FHIR tables.

We previously were reading the de-identified input ndjson, turning it into fhirclient resource objects, scrubbing it, then writing it back as json.

The value of the fhirclient step was validation and ease of use. But the MS deid tool validates for us. And we can live with loss of nice typed objects if it means huge speed wins.

And they do seem to be huge. Quick napkin math looks like we use 10% of the CPU and 25% of the wall clock time as before. (Of the core loop -- not counting the MS deid time.)

Which might take us from a CPU-bound profile to an IO-bound profile.

This removes all uses of fhirclient except for the i2b2 code (which would benefit from a similar change, if we cared about speeding that up) and the bulk exporter (for some oauth code).

Checklist

  • Consider if documentation (like in docs/) needs to be updated
  • Consider if tests should be added

@@ -177,22 +175,6 @@ def warn_mode():
logging.getLogger().setLevel(logging.WARN)


def error_fhir(fhir_resource):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two methods (error_fhir and print_json) were not used anywhere.

return FHIRReference({"reference": f"{resource_type}/{resource_id}"})


def ref_subject(subject_id: str) -> FHIRReference:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref_subject and ref_encounter are moved to i2b2 code, the only place that uses them. They still return a real FHIRReference for convenience.

###############################################################################


def fhir_concept(text: str, coded: List[Coding], extension=None) -> CodeableConcept:
Copy link
Contributor Author

@mikix mikix Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the methods here and below were either not used anywhere (removed) or only used in i2b2.transform (moved there).

Comment on lines +142 to +145
except AnalysisException:
return # if the table doesn't exist because we didn't write anything, that's fine - just bail

try:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change, just something I noticed the other day -- you'd get a scary exception message if there was no table yet but also no input data.

Comment on lines 51 to 54
try:
self.root.get(self.root.joinpath(f"*{resource}*.ndjson"), f"{tmpdir.name}/")
except FileNotFoundError:
logging.warning(f"No resources found for {resource}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same unrelated change -- be more graceful about lacking some input files.

@@ -1,67 +0,0 @@
"""Support for cohorts"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is not used anywhere and would need to be rewritten if it were to be revived anyway. This is some pre-Mike code that has had the world change around it.

I'm removing it rather than fix its use of fhirclient.

Comment on lines -251 to +241
yield self.resource(jsondict=json.loads(line), strict=False) # pylint: disable=not-callable
yield json.loads(line)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key line in this PR, which stops creating fhirclient objects.

This makes the core ETL loop much faster for basic FHIR tables.

We previously were reading the de-identified input ndjson,
turning it into fhirclient resource objects, scrubbing it, then
writing it back as json.

The value of the fhirclient step was validation and ease of use.
But the MS deid tool validates for us. And we can live with loss
of nice typed objects if it means huge speed wins.

And they do seem to be huge. Quick napkin math looks like we use 10%
of the CPU and 25% of the wall clock time as before. (Of the core
loop -- not counting the MS deid time.)

Which might take us from a CPU-bound profile to an IO-bound profile.

This removes all uses of fhirclient except for the i2b2 code (which
would benefit from a similar change, if we cared about speeding that
up) and the bulk exporter (for some oauth code).
@mikix mikix force-pushed the mikix/less-fhirclient branch from 19387b5 to e4965da Compare January 25, 2023 20:17
@mikix mikix changed the title WIP: fix: don't deserialize and reserialize json for no reason fix: don't deserialize and reserialize json for no reason Jan 25, 2023
Copy link
Contributor

@dogversioning dogversioning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⏩ ⏩ ⏩ ⏩ ⏩ ⏩ ⏩ ⏩ ⏩ ⏩ ⏩

@mikix mikix merged commit 54da18e into main Jan 26, 2023
@mikix mikix deleted the mikix/less-fhirclient branch January 26, 2023 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants