-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: don't deserialize and reserialize json for no reason #157
Conversation
@@ -177,22 +175,6 @@ def warn_mode(): | |||
logging.getLogger().setLevel(logging.WARN) | |||
|
|||
|
|||
def error_fhir(fhir_resource): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two methods (error_fhir
and print_json
) were not used anywhere.
return FHIRReference({"reference": f"{resource_type}/{resource_id}"}) | ||
|
||
|
||
def ref_subject(subject_id: str) -> FHIRReference: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref_subject
and ref_encounter
are moved to i2b2 code, the only place that uses them. They still return a real FHIRReference
for convenience.
############################################################################### | ||
|
||
|
||
def fhir_concept(text: str, coded: List[Coding], extension=None) -> CodeableConcept: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of the methods here and below were either not used anywhere (removed) or only used in i2b2.transform
(moved there).
except AnalysisException: | ||
return # if the table doesn't exist because we didn't write anything, that's fine - just bail | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated change, just something I noticed the other day -- you'd get a scary exception message if there was no table yet but also no input data.
cumulus/loaders/fhir/fhir_ndjson.py
Outdated
try: | ||
self.root.get(self.root.joinpath(f"*{resource}*.ndjson"), f"{tmpdir.name}/") | ||
except FileNotFoundError: | ||
logging.warning(f"No resources found for {resource}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same unrelated change -- be more graceful about lacking some input files.
@@ -1,67 +0,0 @@ | |||
"""Support for cohorts""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is not used anywhere and would need to be rewritten if it were to be revived anyway. This is some pre-Mike code that has had the world change around it.
I'm removing it rather than fix its use of fhirclient
.
yield self.resource(jsondict=json.loads(line), strict=False) # pylint: disable=not-callable | ||
yield json.loads(line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the key line in this PR, which stops creating fhirclient
objects.
This makes the core ETL loop much faster for basic FHIR tables. We previously were reading the de-identified input ndjson, turning it into fhirclient resource objects, scrubbing it, then writing it back as json. The value of the fhirclient step was validation and ease of use. But the MS deid tool validates for us. And we can live with loss of nice typed objects if it means huge speed wins. And they do seem to be huge. Quick napkin math looks like we use 10% of the CPU and 25% of the wall clock time as before. (Of the core loop -- not counting the MS deid time.) Which might take us from a CPU-bound profile to an IO-bound profile. This removes all uses of fhirclient except for the i2b2 code (which would benefit from a similar change, if we cared about speeding that up) and the bulk exporter (for some oauth code).
19387b5
to
e4965da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⏩ ⏩ ⏩ ⏩ ⏩ ⏩ ⏩ ⏩ ⏩ ⏩ ⏩
Description
This makes the core ETL loop much faster for basic FHIR tables.
We previously were reading the de-identified input ndjson, turning it into fhirclient resource objects, scrubbing it, then writing it back as json.
The value of the fhirclient step was validation and ease of use. But the MS deid tool validates for us. And we can live with loss of nice typed objects if it means huge speed wins.
And they do seem to be huge. Quick napkin math looks like we use 10% of the CPU and 25% of the wall clock time as before. (Of the core loop -- not counting the MS deid time.)
Which might take us from a CPU-bound profile to an IO-bound profile.
This removes all uses of fhirclient except for the i2b2 code (which would benefit from a similar change, if we cared about speeding that up) and the bulk exporter (for some oauth code).
Checklist
docs/
) needs to be updated