Update stats endpoint to return quantity of non-parent studies #1509

eecavanna · 2025-01-11T00:12:06Z

Here's an example API response from https://data.microbiomedata.org/api/stats:

{
  "studies": 32,
  "locations": 33,
  "habitats": 41,
  "data_size": 70025040913105,
  "metagenomes": 4117,
  "metatranscriptomes": 245,
  "proteomics": 52,
  "metabolomics": 34,
  "lipodomics": 44,
  "organic_matter_characterization": 2547
}

The studies number currently accounts for all studies. It does not distinguish between so-called "parent" and "non-parent" studies, where the former are studies whose id values appear in the part_of column of a study, and the latter are studies whose id values do not appear in the part_of column of any studies.

I discussed this distinguishing characteristic (i.e. whether a study is a parent study versus a non-parent study) with several team members this week, while trying to gather metrics for a report/infographic. @aclum wants this endpoint to return the number of "non-parent" studies.

My proposal is to add a new field to this object, named non_parent_studies, whose value is the quantity of non-parent studies. That is as opposed to repurposing the existing studies field.

In the comments below, I will post a Postgres query that can be used to determine that quantity (off the top of my head, I don't know what it would look like as a SQLAlchemy statement).

The text was updated successfully, but these errors were encountered:

eecavanna · 2025-01-11T00:43:14Z

Here are two versions of the aforementioned query. One of them works with the 1/1/2024 dump and the other works with the 1/1/2025 dump. They are labeled accordingly.

Get the number of studies that are not parent studies ("2025 version")

-- Get the number of studies that are not parent studies ("2025 version").
--
-- Note: This query is compatible with the nightly Postgres dump from 1/1/2025,
--       where the `study` table has a `part_of` column. It is not compatible
--       with the nightly Postgres dump from 1/1/2024 (a year earlier),
--       where the `study` table _does not yet_ have a `part_of` column.
--
SELECT count(*) AS "Number of studies (non-parent only)"
FROM study 
WHERE id NOT IN (
  -- Get the distinct IDs (as text) of all the parent studies.
  SELECT DISTINCT parent_study_id_as_text
  FROM study,
       jsonb_array_elements_text(part_of) AS parent_study_id_as_text
  WHERE jsonb_typeof(part_of) = 'array'
);

Get the number of studies that are not parent studies ("2024 version")

-- Get the number of studies that are not parent studies ("2024 version").
--
-- Note: This query is compatible with the nightly Postgres dump from 1/1/2024,
--       where the `study` table did not yet have a `part_of` column. It is not
--       compatible with the nightly Postgres dump from 1/1/2025 (a year later),
--       where the `study` table _does_ have a `part_of` column.
--
SELECT count(*) AS "Number of studies (non-parent only)"
FROM study 
WHERE id NOT IN (
  -- Get the distinct IDs (as text) of all the parent studies.
  SELECT DISTINCT parent_study_id_as_text
  FROM study,
       jsonb_array_elements_text(annotations->'part_of') AS parent_study_id_as_text
  WHERE jsonb_typeof(annotations) = 'object'
    AND jsonb_typeof(annotations->'part_of') = 'array'
);

ssarrafan · 2025-01-24T23:08:38Z

This is not assigned to anyone and hasn't been updated so I'm removing from sprint and adding to the backlog. @eecavanna if you need this to be prioritized please add to the agenda for next week's infra sync meeting.

ssarrafan · 2025-02-22T20:08:45Z

This is not assigned to anyone and hasn't been updated so I'm removing from sprint and adding to the backlog again.

eecavanna · 2025-02-22T20:13:53Z

Oops! I forgot about this ticket. I will assign it to myself and implement it in the sprint after next.

eecavanna added type: feature New feature or request X SMALL Less than 1 day labels Jan 11, 2025

eecavanna added this to Autometrics Squad and 2025 - Sprint 54 - January 13 - 24, 2025 Jan 11, 2025

ssarrafan removed this from 2025 - Sprint 54 - January 13 - 24, 2025 Jan 24, 2025

ssarrafan added the backlog Issues not assigned to a sprint or not finished during a sprint. Needs to be reprioritized. label Jan 24, 2025

eecavanna added this to 2025 - Sprint 56 - Feb 10-21,2025 Feb 8, 2025

ssarrafan removed this from 2025 - Sprint 56 - Feb 10-21,2025 Feb 22, 2025

eecavanna self-assigned this Feb 22, 2025

eecavanna added this to 2025 - Sprint 58 - Mar 10- Mar 21, 2025 Feb 22, 2025

eecavanna removed the backlog Issues not assigned to a sprint or not finished during a sprint. Needs to be reprioritized. label Feb 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update stats endpoint to return quantity of non-parent studies #1509

Update stats endpoint to return quantity of non-parent studies #1509

eecavanna commented Jan 11, 2025 •

edited

Loading

eecavanna commented Jan 11, 2025 •

edited

Loading

ssarrafan commented Jan 24, 2025

ssarrafan commented Feb 22, 2025

eecavanna commented Feb 22, 2025

Update stats endpoint to return quantity of non-parent studies #1509

Update stats endpoint to return quantity of non-parent studies #1509

Comments

eecavanna commented Jan 11, 2025 • edited Loading

eecavanna commented Jan 11, 2025 • edited Loading

Get the number of studies that are not parent studies ("2025 version")

Get the number of studies that are not parent studies ("2024 version")

ssarrafan commented Jan 24, 2025

ssarrafan commented Feb 22, 2025

eecavanna commented Feb 22, 2025

eecavanna commented Jan 11, 2025 •

edited

Loading

eecavanna commented Jan 11, 2025 •

edited

Loading