Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore alternative ways of populating data object download counts #1548

Open
naglepuff opened this issue Feb 19, 2025 · 1 comment · May be fixed by #1558
Open

Explore alternative ways of populating data object download counts #1548

naglepuff opened this issue Feb 19, 2025 · 1 comment · May be fixed by #1558
Assignees

Comments

@naglepuff
Copy link
Collaborator

Currently, data download object counts add a seemingly enormous overhead to the biosample search. This is likely due to how download counts are not actually stored in the database, but computed at lookup time by counting rows in the bulk_download and file_download tables.

Since the results of the biosample search endpoint include this information, this lookup is done for each data object included in the result of a biosample search.

Possible things to experiment with/investigate include:

  • Looking at indices (can we speed up the count queries?)
  • Materialized views (can we store integer counts of all data objects?, how important is consistency? can we tolerate differences between actual download counts and those displayed on the data portal?)
  • Reducing the number of queries (can we rewrite the biosample search so that only one query is run which hydrates the reponse with ALL relevant download counts?)
@naglepuff
Copy link
Collaborator Author

Initial investigation - injecting download counts

Since accessing a DataObject's .downloads property runs 2 database queries, we add 2 * n queries for each biosample search, where n is the number of DataObjects associated with the biosamples returned, it seemed promising to find a way to reduce the number of these queries. Additionally, profiling with the FastAPI debug toolbar indicated that this was taking a lot of time.

In order to do this, I modified the downloads property of the DataObject model like so:

class DataObject(BaseModel):
    ...
    _download_count = None
    
    @hybrid_property
    downloads(self):
        if self_download_count is None:
            return len(self.download_entities) + len(self.bulk_download_entities)
        return self._download_count

This allows us to avoid running queries when accessing data_object.downloads as long as we populate data_object._download_count beforehand...

...which we can accomplish by hooking into the query results BEFORE serialization (which is when I believe some code in FastAPI is eventually accessing .downloads for DataObjects, like so:

def results_with_counts(db: Session, biosample_results):
    """
    Hydrate paginated biosample results with data object download counts.

    This consolidates counting downloads into a single database query, rather than
    two queries for each data object included in the results (one for file_downloads, and
    another for bulk_downloads).
    """
    data_object_download_counts = {}
    biosamples = biosample_results["results"]
    for b in biosamples:
        for op in b.omics_processing:
            for da in op.outputs:
                data_object_download_counts[da.id] = 0
            for od in op.omics_data:
                for da in od.outputs:
                    data_object_download_counts[da.id] = 0
    counts = crud.get_data_object_counts(db, list(data_object_download_counts))
    for row in counts:
        data_object_download_counts[row[0]] = row[1]
    for b in biosamples:
        for op in b.omics_processing:
            for da in op.outputs:
                da._download_count = data_object_download_counts[da.id]
            for od in op.omics_data:
                for da in od.outputs:
                    da._download_count = data_object_download_counts[da.id]
    return biosample_results

Where crud.get_data_object_counts is a function that uses SQLAlchemy to build a single query to compute the download counts for all data objects.

Some validation of this approach can be gained by looking at the FastAPI debug toolbar, and gathering data to back it up.

The FastAPI debug toolbar reports some information about SQLAlchemy, for example, when running a biosample search through the swagger API WITHOUT the above changes, it reports this:

Image

With the changes, we see a significant drop in both number of queries run and time spent doing SQLAlchemy stuff:

Image

Here are some metrics recorded using the following command:

{ time curl -X POST "http://localhost:8080/api/biosample/search?offset=0&limit=15" > /dev/null; } 2>&1

This offset/limit combination was chosen because it is the default when loading the Data Portal's home page. Times are in milliseconds.

Change? Request 1 Request 2 Request 3 Request 4 Request 5 Avg. Response Time
Yes 707 563 527 523 521 568
No 1241 925 906 962 1095 1028

This shows that the changes do have a significant impact on response time for the biosample search endpoint. It's still fairly slow, and issues more requests to the database than it likely needs to, but if we take an incremental approach to performance improvement I believe this is a good first step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

1 participant