Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update stats endpoint to return quantity of non-parent studies #1509

Open
eecavanna opened this issue Jan 11, 2025 · 4 comments
Open

Update stats endpoint to return quantity of non-parent studies #1509

eecavanna opened this issue Jan 11, 2025 · 4 comments
Assignees
Labels
type: feature New feature or request X SMALL Less than 1 day

Comments

@eecavanna
Copy link
Collaborator

eecavanna commented Jan 11, 2025

Here's an example API response from https://data.microbiomedata.org/api/stats:

{
  "studies": 32,
  "locations": 33,
  "habitats": 41,
  "data_size": 70025040913105,
  "metagenomes": 4117,
  "metatranscriptomes": 245,
  "proteomics": 52,
  "metabolomics": 34,
  "lipodomics": 44,
  "organic_matter_characterization": 2547
}

The studies number currently accounts for all studies. It does not distinguish between so-called "parent" and "non-parent" studies, where the former are studies whose id values appear in the part_of column of a study, and the latter are studies whose id values do not appear in the part_of column of any studies.

I discussed this distinguishing characteristic (i.e. whether a study is a parent study versus a non-parent study) with several team members this week, while trying to gather metrics for a report/infographic. @aclum wants this endpoint to return the number of "non-parent" studies.

My proposal is to add a new field to this object, named non_parent_studies, whose value is the quantity of non-parent studies. That is as opposed to repurposing the existing studies field.

In the comments below, I will post a Postgres query that can be used to determine that quantity (off the top of my head, I don't know what it would look like as a SQLAlchemy statement).

@eecavanna
Copy link
Collaborator Author

eecavanna commented Jan 11, 2025

Here are two versions of the aforementioned query. One of them works with the 1/1/2024 dump and the other works with the 1/1/2025 dump. They are labeled accordingly.

Get the number of studies that are not parent studies ("2025 version")

-- Get the number of studies that are not parent studies ("2025 version").
--
-- Note: This query is compatible with the nightly Postgres dump from 1/1/2025,
--       where the `study` table has a `part_of` column. It is not compatible
--       with the nightly Postgres dump from 1/1/2024 (a year earlier),
--       where the `study` table _does not yet_ have a `part_of` column.
--
SELECT count(*) AS "Number of studies (non-parent only)"
FROM study 
WHERE id NOT IN (
  -- Get the distinct IDs (as text) of all the parent studies.
  SELECT DISTINCT parent_study_id_as_text
  FROM study,
       jsonb_array_elements_text(part_of) AS parent_study_id_as_text
  WHERE jsonb_typeof(part_of) = 'array'
);

Get the number of studies that are not parent studies ("2024 version")

-- Get the number of studies that are not parent studies ("2024 version").
--
-- Note: This query is compatible with the nightly Postgres dump from 1/1/2024,
--       where the `study` table did not yet have a `part_of` column. It is not
--       compatible with the nightly Postgres dump from 1/1/2025 (a year later),
--       where the `study` table _does_ have a `part_of` column.
--
SELECT count(*) AS "Number of studies (non-parent only)"
FROM study 
WHERE id NOT IN (
  -- Get the distinct IDs (as text) of all the parent studies.
  SELECT DISTINCT parent_study_id_as_text
  FROM study,
       jsonb_array_elements_text(annotations->'part_of') AS parent_study_id_as_text
  WHERE jsonb_typeof(annotations) = 'object'
    AND jsonb_typeof(annotations->'part_of') = 'array'
);

@ssarrafan
Copy link

This is not assigned to anyone and hasn't been updated so I'm removing from sprint and adding to the backlog. @eecavanna if you need this to be prioritized please add to the agenda for next week's infra sync meeting.

@ssarrafan ssarrafan added the backlog Issues not assigned to a sprint or not finished during a sprint. Needs to be reprioritized. label Jan 24, 2025
@ssarrafan
Copy link

This is not assigned to anyone and hasn't been updated so I'm removing from sprint and adding to the backlog again.

@eecavanna eecavanna self-assigned this Feb 22, 2025
@eecavanna eecavanna removed the backlog Issues not assigned to a sprint or not finished during a sprint. Needs to be reprioritized. label Feb 22, 2025
@eecavanna
Copy link
Collaborator Author

Oops! I forgot about this ticket. I will assign it to myself and implement it in the sprint after next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature New feature or request X SMALL Less than 1 day
Projects
Status: No status
Development

No branches or pull requests

2 participants