-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore alternative ways of populating data object download counts #1548
Comments
Initial investigation - injecting download countsSince accessing a In order to do this, I modified the class DataObject(BaseModel):
...
_download_count = None
@hybrid_property
downloads(self):
if self_download_count is None:
return len(self.download_entities) + len(self.bulk_download_entities)
return self._download_count This allows us to avoid running queries when accessing ...which we can accomplish by hooking into the query results BEFORE serialization (which is when I believe some code in FastAPI is eventually accessing def results_with_counts(db: Session, biosample_results):
"""
Hydrate paginated biosample results with data object download counts.
This consolidates counting downloads into a single database query, rather than
two queries for each data object included in the results (one for file_downloads, and
another for bulk_downloads).
"""
data_object_download_counts = {}
biosamples = biosample_results["results"]
for b in biosamples:
for op in b.omics_processing:
for da in op.outputs:
data_object_download_counts[da.id] = 0
for od in op.omics_data:
for da in od.outputs:
data_object_download_counts[da.id] = 0
counts = crud.get_data_object_counts(db, list(data_object_download_counts))
for row in counts:
data_object_download_counts[row[0]] = row[1]
for b in biosamples:
for op in b.omics_processing:
for da in op.outputs:
da._download_count = data_object_download_counts[da.id]
for od in op.omics_data:
for da in od.outputs:
da._download_count = data_object_download_counts[da.id]
return biosample_results Where Some validation of this approach can be gained by looking at the FastAPI debug toolbar, and gathering data to back it up. The FastAPI debug toolbar reports some information about SQLAlchemy, for example, when running a biosample search through the swagger API WITHOUT the above changes, it reports this: With the changes, we see a significant drop in both number of queries run and time spent doing SQLAlchemy stuff: Here are some metrics recorded using the following command: { time curl -X POST "http://localhost:8080/api/biosample/search?offset=0&limit=15" > /dev/null; } 2>&1 This offset/limit combination was chosen because it is the default when loading the Data Portal's home page. Times are in milliseconds.
This shows that the changes do have a significant impact on response time for the biosample search endpoint. It's still fairly slow, and issues more requests to the database than it likely needs to, but if we take an incremental approach to performance improvement I believe this is a good first step. |
Currently, data download object counts add a seemingly enormous overhead to the biosample search. This is likely due to how download counts are not actually stored in the database, but computed at lookup time by counting rows in the
bulk_download
andfile_download
tables.Since the results of the biosample search endpoint include this information, this lookup is done for each data object included in the result of a biosample search.
Possible things to experiment with/investigate include:
The text was updated successfully, but these errors were encountered: