Explore alternative ways of populating data object download counts #1548

naglepuff · 2025-02-19T16:44:58Z

Currently, data download object counts add a seemingly enormous overhead to the biosample search. This is likely due to how download counts are not actually stored in the database, but computed at lookup time by counting rows in the bulk_download and file_download tables.

Since the results of the biosample search endpoint include this information, this lookup is done for each data object included in the result of a biosample search.

Possible things to experiment with/investigate include:

Looking at indices (can we speed up the count queries?)
Materialized views (can we store integer counts of all data objects?, how important is consistency? can we tolerate differences between actual download counts and those displayed on the data portal?)
Reducing the number of queries (can we rewrite the biosample search so that only one query is run which hydrates the reponse with ALL relevant download counts?)

The text was updated successfully, but these errors were encountered:

naglepuff · 2025-02-20T18:51:45Z

Initial investigation - injecting download counts

Since accessing a DataObject's .downloads property runs 2 database queries, we add 2 * n queries for each biosample search, where n is the number of DataObjects associated with the biosamples returned, it seemed promising to find a way to reduce the number of these queries. Additionally, profiling with the FastAPI debug toolbar indicated that this was taking a lot of time.

In order to do this, I modified the downloads property of the DataObject model like so:

class DataObject(BaseModel):
    ...
    _download_count = None
    
    @hybrid_property
    downloads(self):
        if self_download_count is None:
            return len(self.download_entities) + len(self.bulk_download_entities)
        return self._download_count

This allows us to avoid running queries when accessing data_object.downloads as long as we populate data_object._download_count beforehand...

...which we can accomplish by hooking into the query results BEFORE serialization (which is when I believe some code in FastAPI is eventually accessing .downloads for DataObjects, like so:

def results_with_counts(db: Session, biosample_results):
    """
    Hydrate paginated biosample results with data object download counts.

    This consolidates counting downloads into a single database query, rather than
    two queries for each data object included in the results (one for file_downloads, and
    another for bulk_downloads).
    """
    data_object_download_counts = {}
    biosamples = biosample_results["results"]
    for b in biosamples:
        for op in b.omics_processing:
            for da in op.outputs:
                data_object_download_counts[da.id] = 0
            for od in op.omics_data:
                for da in od.outputs:
                    data_object_download_counts[da.id] = 0
    counts = crud.get_data_object_counts(db, list(data_object_download_counts))
    for row in counts:
        data_object_download_counts[row[0]] = row[1]
    for b in biosamples:
        for op in b.omics_processing:
            for da in op.outputs:
                da._download_count = data_object_download_counts[da.id]
            for od in op.omics_data:
                for da in od.outputs:
                    da._download_count = data_object_download_counts[da.id]
    return biosample_results

Where crud.get_data_object_counts is a function that uses SQLAlchemy to build a single query to compute the download counts for all data objects.

Some validation of this approach can be gained by looking at the FastAPI debug toolbar, and gathering data to back it up.

The FastAPI debug toolbar reports some information about SQLAlchemy, for example, when running a biosample search through the swagger API WITHOUT the above changes, it reports this:

With the changes, we see a significant drop in both number of queries run and time spent doing SQLAlchemy stuff:

Here are some metrics recorded using the following command:

{ time curl -X POST "http://localhost:8080/api/biosample/search?offset=0&limit=15" > /dev/null; } 2>&1

This offset/limit combination was chosen because it is the default when loading the Data Portal's home page. Times are in milliseconds.

Change?	Request 1	Request 2	Request 3	Request 4	Request 5	Avg. Response Time
Yes	707	563	527	523	521	568
No	1241	925	906	962	1095	1028

This shows that the changes do have a significant impact on response time for the biosample search endpoint. It's still fairly slow, and issues more requests to the database than it likely needs to, but if we take an incremental approach to performance improvement I believe this is a good first step.

naglepuff self-assigned this Feb 19, 2025

naglepuff added this to 2025 - Sprint 56 - Feb 10-21,2025 Feb 19, 2025

naglepuff moved this to In Progress in 2025 - Sprint 56 - Feb 10-21,2025 Feb 19, 2025

naglepuff removed this from 2025 - Sprint 56 - Feb 10-21,2025 Feb 21, 2025

naglepuff added this to 2025 - Sprint 57 - Feb 24- Mar 7, 2025 Feb 21, 2025

naglepuff linked a pull request Feb 28, 2025 that will close this issue

Obtain data object download counts in single query #1558

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore alternative ways of populating data object download counts #1548

Explore alternative ways of populating data object download counts #1548

naglepuff commented Feb 19, 2025

naglepuff commented Feb 20, 2025

Explore alternative ways of populating data object download counts #1548

Explore alternative ways of populating data object download counts #1548

Comments

naglepuff commented Feb 19, 2025

naglepuff commented Feb 20, 2025

Initial investigation - injecting download counts