Endpoint `/data_objects/study/{study_id}` does not return all expected data objects #723

aclum · 2024-10-08T22:48:03Z

Describe the bug
This is not returning all expected data objects. It appears to only be returning the raw data. FWIW this is incorrect in the berkeley environment as well.

Please check on what is being used to connect workflow records to omics processing, if this is using part_of instead of was_informed_by on the WorkflowExecution subclass records that could explain what is happening. This shouldn't be used and doesn't exist in berkeley.

To Reproduce
Steps to reproduce the behavior:

curl -X 'GET'
'https://api.microbiomedata.org/data_objects/study/nmdc%3Asty-11-5tgfr349'
-H 'accept: application/json'
search for nmdc:dobj-11-10kp6g46, an expected data object from MetagenomeAnnotation which can be found in alldocs when I manually search for it.

Expected behavior
Several thousand additional files should be returned. For example 219 Functional annotation gff files are expected. You can search this study in the data portal to see what the expect results are.

Screenshots
If applicable, add screenshots to help explain your problem.

Acceptance Criteria

Clear to everyone involved
Can be tested or verified
Either passes or fails (cannot be 50% completed, for example)
Focus on the outcome, not how the outcome is achieved
As specific as possible (fast page load speed vs. 3-second page load speed)
if multiple criteria, present as a bulleted list of short scenarios (see template below)

Example scenario-based template:
Given (some given context or precondition), when (I take this action), then (this will be the specific outcome).

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

eecavanna · 2024-10-09T18:23:45Z

@PeopleMakeCulture and I pair-investigated this today. Here are my notes from that investigation (we did not fix the issue):

The alldocs collection in the production Mongo database (which complies with the legacy schema) currently does not include documents from the functional_annotation_agg collection (and documents in that collection do not have an id field).
```
db.getCollection("alldocs").find({"type": "FunctionalAnnotationAggMember"}); // 0
db.getCollection("functional_annotation_agg").countDocuments({"id": {$exists: true}}); // 0
```
We visited the aforementioned Study on the Data Portal (at https://data.microbiomedata.org/details/study/nmdc:sty-11-5tgfr349) and did not know how to use that web page to determine the "expected results" of the Runtime API endpoint, which was suggested in this statement above:

...219 Functional annotation gff files are expected. You can search this study in the data portal to see what the expect results are.

aclum · 2024-10-10T01:22:37Z

@eecavanna @PeopleMakeCulture there is confusion between a data_object_type of Functional Annotation GFF for a record of type nmdc:DataObject and functional_annotation_agg . I am referring to expecting and not seeing the former. nmdc:dobj-11-10kp6g46 is a specific example of a DataObject that this endpoint should return for nmdc:sty-11-5tgfr349.

I AM able to find this record in alldocs in mongo prod with:


db.getCollection('alldocs').find({
id: 'nmdc:dobj-11-10kp6g46'
});

but this record is NOT being returned by the endpoint. I assigned this to @sujaypatil96 because I suspect the issue is with the code the endpoint uses to get records from alldocs, not an issue with alldocs itself and he wrote the code for that.

eecavanna · 2024-10-10T08:32:13Z

Thanks for elaborating on the situation.

I wasn't familiar with the term "Functional Annotation GFF," but thought it might have something to do with the functional_annotation_agg collection, and so @PeopleMakeCulture and I confirmed a couple things about that collection with respect to alldocs. Sounds to me like that confirmation isn't relevant to this issue after all.

Assigning this ticket to @sujaypatil96 because you suspect the issue is with that endpoint makes sense to me.

For reference (by everyone)

Here's a link to the endpoint's code (in Runtime v1.10.0, which is running in production):

nmdc-runtime/nmdc_runtime/api/endpoints/find.py

Lines 129 to 191 in d3742a5

    
           @router.get( 
        
               "/data_objects/study/{study_id}", 
        
               response_model_exclude_unset=True, 
        
           ) 
        
           def find_data_objects_for_study( 
        
               study_id: str, 
        
               mdb: MongoDatabase = Depends(get_mongo_db), 
        
           ): 
        
               """This API endpoint is used to retrieve data object ids associated with 
        
               all the biosamples that are part of a given study. This endpoint makes 
        
               use of the `alldocs` collection for its implementation. 
        
               :param study_id: NMDC study id for which data objects are to be retrieved 
        
               :param mdb: PyMongo connection, defaults to Depends(get_mongo_db) 
        
               :return: List of dictionaries where each dictionary contains biosample id as key, 
        
                   and another dictionary with key 'data_object_set' containing list of data object ids as value 
        
               """ 
        
               biosample_data_objects = [] 
        
               study = raise404_if_none( 
        
                   mdb.study_set.find_one({"id": study_id}, ["id"]), detail="Study not found" 
        
               ) 
        
               biosamples = mdb.biosample_set.find({"part_of": study["id"]}, ["id"]) 
        
               biosample_ids = [biosample["id"] for biosample in biosamples] 
        
               for biosample_id in biosample_ids: 
        
                   current_ids = [biosample_id] 
        
                   collected_data_objects = [] 
        
                   while current_ids: 
        
                       new_current_ids = [] 
        
                       for current_id in current_ids: 
        
                           query = {"has_input": current_id} 
        
                           document = mdb.alldocs.find_one(query) 
        
                           if not document: 
        
                               continue 
        
                           has_output = document.get("has_output") 
        
                           if not has_output: 
        
                               continue 
        
                           for output_id in has_output: 
        
                               if get_classname_from_typecode(output_id) == "DataObject": 
        
                                   data_object_doc = mdb.data_object_set.find_one( 
        
                                       {"id": output_id} 
        
                                   ) 
        
                                   if data_object_doc: 
        
                                       collected_data_objects.append(strip_oid(data_object_doc)) 
        
                               else: 
        
                                   new_current_ids.append(output_id) 
        
                       current_ids = new_current_ids 
        
                   if collected_data_objects: 
        
                       biosample_data_objects.append( 
        
                           { 
        
                               "biosample_id": biosample_id, 
        
                               "data_object_set": collected_data_objects, 
        
                           } 
        
                       ) 
        
               return biosample_data_objects

sujaypatil96 · 2024-10-15T22:48:06Z

The code currently doesn't account for DataObjects created by WorkflowExecution processes (using was_generated_by).

~~It makes the assumption that DataObjects only created/exist on has_output of LibraryPreparation processes.~~

aclum · 2024-10-15T23:03:26Z

Not all samples have LibraryPreparation. Some older records go directly from biosample to a DataGeneration subclass, in particular records that get mass spec don't use this class at all. This needs to be generated in an agnostic fashion that. It should use the collection of relationship slots (as derived from the schema) to figure out what slots to query.

sujaypatil96 · 2024-10-15T23:10:41Z

Oh yes, you're right, sorry, that was a false statement that I made above (scratched it out), don't know what I was thinking.

But the code is actually agnostic and doesn't "hardcode" any classes per se. It should work for the "Some older records go directly from biosample to a DataGeneration subclass" case.

The logic works in the following manner:

Find biosamples associated with a study based on associated_studies slot
Check alldocs to see what documents have given biosamples on has_input
Get the has_output associated with above has_input
Don't stop till you hit documents which are of type DataObject
Once you do, pull out the DataObjects from data_object_set

aclum · 2024-10-16T16:17:40Z

Then I don't understand why this isn't working for this study. Please dig in further.

aclum · 2024-10-16T16:18:28Z

unless it is stopping at the first DataObject it finds instead of continuing to check relationships.

ssarrafan · 2024-10-18T23:27:20Z

Moving to next sprint for more "digging"
@sujaypatil96 @aclum

aclum added the bug Something isn't working label Oct 8, 2024

aclum assigned sujaypatil96 Oct 8, 2024

aclum added this to 2024 - Sprint 47 - October 7 - 18, 2024 Oct 8, 2024

ssarrafan moved this to Todo in 2024 - Sprint 47 - October 7 - 18, 2024 Oct 9, 2024

eecavanna changed the title ~~find_data_objects_for_study_data_objects_study__study_id__get not returning all expected data objects~~ Endpoint /data_objects/study/{study_id} does not return all expected data objects Oct 9, 2024

eecavanna mentioned this issue Oct 9, 2024

Core functionality of /data_objects/study/{study_id} endpoint is not being tested #725

Closed

ssarrafan added this to 2024 - Sprint 48 - October 21 - November 1, 2024 Oct 18, 2024

ssarrafan moved this to Todo in 2024 - Sprint 48 - October 21 - November 1, 2024 Oct 18, 2024

ssarrafan removed this from 2024 - Sprint 47 - October 7 - 18, 2024 Oct 18, 2024

sujaypatil96 mentioned this issue Oct 23, 2024

/data_objects/study/{study_id} should take into account was_informed_by relationships #738

Merged

sujaypatil96 moved this from Todo to In Progress in 2024 - Sprint 48 - October 21 - November 1, 2024 Oct 24, 2024

sujaypatil96 moved this from In Progress to In Review in 2024 - Sprint 48 - October 21 - November 1, 2024 Oct 24, 2024

sujaypatil96 closed this as completed in #738 Oct 28, 2024

github-project-automation bot moved this from In Review to Done in 2024 - Sprint 48 - October 21 - November 1, 2024 Oct 28, 2024

bmeluch mentioned this issue Dec 18, 2024

/data_objects/study/{study_id} endpoint does not return all of the data objects it used to #850

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endpoint `/data_objects/study/{study_id}` does not return all expected data objects #723

Endpoint `/data_objects/study/{study_id}` does not return all expected data objects #723

aclum commented Oct 8, 2024 •

edited by eecavanna

Loading

eecavanna commented Oct 9, 2024

aclum commented Oct 10, 2024 •

edited by eecavanna

Loading

eecavanna commented Oct 10, 2024

sujaypatil96 commented Oct 15, 2024 •

edited

Loading

aclum commented Oct 15, 2024

sujaypatil96 commented Oct 15, 2024 •

edited

Loading

aclum commented Oct 16, 2024

aclum commented Oct 16, 2024

ssarrafan commented Oct 18, 2024

Endpoint /data_objects/study/{study_id} does not return all expected data objects #723

Endpoint /data_objects/study/{study_id} does not return all expected data objects #723

Comments

aclum commented Oct 8, 2024 • edited by eecavanna Loading

eecavanna commented Oct 9, 2024

aclum commented Oct 10, 2024 • edited by eecavanna Loading

eecavanna commented Oct 10, 2024

For reference (by everyone)

sujaypatil96 commented Oct 15, 2024 • edited Loading

aclum commented Oct 15, 2024

sujaypatil96 commented Oct 15, 2024 • edited Loading

aclum commented Oct 16, 2024

aclum commented Oct 16, 2024

ssarrafan commented Oct 18, 2024

Endpoint `/data_objects/study/{study_id}` does not return all expected data objects #723

Endpoint `/data_objects/study/{study_id}` does not return all expected data objects #723

aclum commented Oct 8, 2024 •

edited by eecavanna

Loading

aclum commented Oct 10, 2024 •

edited by eecavanna

Loading

sujaypatil96 commented Oct 15, 2024 •

edited

Loading

sujaypatil96 commented Oct 15, 2024 •

edited

Loading