Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoint /data_objects/study/{study_id} does not return all expected data objects #723

Closed
aclum opened this issue Oct 8, 2024 · 9 comments · Fixed by #738
Closed

Endpoint /data_objects/study/{study_id} does not return all expected data objects #723

aclum opened this issue Oct 8, 2024 · 9 comments · Fixed by #738
Assignees
Labels
bug Something isn't working

Comments

@aclum
Copy link
Contributor

aclum commented Oct 8, 2024

Describe the bug
This is not returning all expected data objects. It appears to only be returning the raw data. FWIW this is incorrect in the berkeley environment as well.

Please check on what is being used to connect workflow records to omics processing, if this is using part_of instead of was_informed_by on the WorkflowExecution subclass records that could explain what is happening. This shouldn't be used and doesn't exist in berkeley.

To Reproduce
Steps to reproduce the behavior:

  1. curl -X 'GET'
    'https://api.microbiomedata.org/data_objects/study/nmdc%3Asty-11-5tgfr349'
    -H 'accept: application/json'
  2. search for nmdc:dobj-11-10kp6g46, an expected data object from MetagenomeAnnotation which can be found in alldocs when I manually search for it.

Expected behavior
Several thousand additional files should be returned. For example 219 Functional annotation gff files are expected. You can search this study in the data portal to see what the expect results are.

Screenshots
If applicable, add screenshots to help explain your problem.

Acceptance Criteria

  • Clear to everyone involved
  • Can be tested or verified
  • Either passes or fails (cannot be 50% completed, for example)
  • Focus on the outcome, not how the outcome is achieved
  • As specific as possible (fast page load speed vs. 3-second page load speed)
  • if multiple criteria, present as a bulleted list of short scenarios (see template below)

Example scenario-based template:
Given (some given context or precondition), when (I take this action), then (this will be the specific outcome).

Additional context
Add any other context about the problem here.

@eecavanna
Copy link
Collaborator

@PeopleMakeCulture and I pair-investigated this today. Here are my notes from that investigation (we did not fix the issue):

  • The alldocs collection in the production Mongo database (which complies with the legacy schema) currently does not include documents from the functional_annotation_agg collection (and documents in that collection do not have an id field).
    db.getCollection("alldocs").find({"type": "FunctionalAnnotationAggMember"}); // 0
    db.getCollection("functional_annotation_agg").countDocuments({"id": {$exists: true}}); // 0
  • We visited the aforementioned Study on the Data Portal (at https://data.microbiomedata.org/details/study/nmdc:sty-11-5tgfr349) and did not know how to use that web page to determine the "expected results" of the Runtime API endpoint, which was suggested in this statement above:

    ...219 Functional annotation gff files are expected. You can search this study in the data portal to see what the expect results are.

@eecavanna eecavanna changed the title find_data_objects_for_study_data_objects_study__study_id__get not returning all expected data objects Endpoint /data_objects/study/{study_id} does not return all expected data objects Oct 9, 2024
@aclum
Copy link
Contributor Author

aclum commented Oct 10, 2024

@eecavanna @PeopleMakeCulture there is confusion between a data_object_type of Functional Annotation GFF for a record of type nmdc:DataObject and functional_annotation_agg . I am referring to expecting and not seeing the former. nmdc:dobj-11-10kp6g46 is a specific example of a DataObject that this endpoint should return for nmdc:sty-11-5tgfr349.

I AM able to find this record in alldocs in mongo prod with:


db.getCollection('alldocs').find({
id: 'nmdc:dobj-11-10kp6g46'
});

but this record is NOT being returned by the endpoint. I assigned this to @sujaypatil96 because I suspect the issue is with the code the endpoint uses to get records from alldocs, not an issue with alldocs itself and he wrote the code for that.

@eecavanna
Copy link
Collaborator

Thanks for elaborating on the situation.

I wasn't familiar with the term "Functional Annotation GFF," but thought it might have something to do with the functional_annotation_agg collection, and so @PeopleMakeCulture and I confirmed a couple things about that collection with respect to alldocs. Sounds to me like that confirmation isn't relevant to this issue after all.

Assigning this ticket to @sujaypatil96 because you suspect the issue is with that endpoint makes sense to me.

For reference (by everyone)

Here's a link to the endpoint's code (in Runtime v1.10.0, which is running in production):

@router.get(
"/data_objects/study/{study_id}",
response_model_exclude_unset=True,
)
def find_data_objects_for_study(
study_id: str,
mdb: MongoDatabase = Depends(get_mongo_db),
):
"""This API endpoint is used to retrieve data object ids associated with
all the biosamples that are part of a given study. This endpoint makes
use of the `alldocs` collection for its implementation.
:param study_id: NMDC study id for which data objects are to be retrieved
:param mdb: PyMongo connection, defaults to Depends(get_mongo_db)
:return: List of dictionaries where each dictionary contains biosample id as key,
and another dictionary with key 'data_object_set' containing list of data object ids as value
"""
biosample_data_objects = []
study = raise404_if_none(
mdb.study_set.find_one({"id": study_id}, ["id"]), detail="Study not found"
)
biosamples = mdb.biosample_set.find({"part_of": study["id"]}, ["id"])
biosample_ids = [biosample["id"] for biosample in biosamples]
for biosample_id in biosample_ids:
current_ids = [biosample_id]
collected_data_objects = []
while current_ids:
new_current_ids = []
for current_id in current_ids:
query = {"has_input": current_id}
document = mdb.alldocs.find_one(query)
if not document:
continue
has_output = document.get("has_output")
if not has_output:
continue
for output_id in has_output:
if get_classname_from_typecode(output_id) == "DataObject":
data_object_doc = mdb.data_object_set.find_one(
{"id": output_id}
)
if data_object_doc:
collected_data_objects.append(strip_oid(data_object_doc))
else:
new_current_ids.append(output_id)
current_ids = new_current_ids
if collected_data_objects:
biosample_data_objects.append(
{
"biosample_id": biosample_id,
"data_object_set": collected_data_objects,
}
)
return biosample_data_objects

@sujaypatil96
Copy link
Collaborator

sujaypatil96 commented Oct 15, 2024

The code currently doesn't account for DataObjects created by WorkflowExecution processes (using was_generated_by).

It makes the assumption that DataObjects only created/exist on has_output of LibraryPreparation processes.

@aclum
Copy link
Contributor Author

aclum commented Oct 15, 2024

Not all samples have LibraryPreparation. Some older records go directly from biosample to a DataGeneration subclass, in particular records that get mass spec don't use this class at all. This needs to be generated in an agnostic fashion that. It should use the collection of relationship slots (as derived from the schema) to figure out what slots to query.

@sujaypatil96
Copy link
Collaborator

sujaypatil96 commented Oct 15, 2024

Oh yes, you're right, sorry, that was a false statement that I made above (scratched it out), don't know what I was thinking.

But the code is actually agnostic and doesn't "hardcode" any classes per se. It should work for the "Some older records go directly from biosample to a DataGeneration subclass" case.

The logic works in the following manner:

  1. Find biosamples associated with a study based on associated_studies slot
  2. Check alldocs to see what documents have given biosamples on has_input
  3. Get the has_output associated with above has_input
  4. Don't stop till you hit documents which are of type DataObject
  5. Once you do, pull out the DataObjects from data_object_set

@aclum
Copy link
Contributor Author

aclum commented Oct 16, 2024

Then I don't understand why this isn't working for this study. Please dig in further.

@aclum
Copy link
Contributor Author

aclum commented Oct 16, 2024

unless it is stopping at the first DataObject it finds instead of continuing to check relationships.

@ssarrafan
Copy link
Contributor

Moving to next sprint for more "digging"
@sujaypatil96 @aclum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
4 participants