Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not all schema collections suffix with _sets #620

Merged
merged 4 commits into from
Aug 8, 2024

Conversation

dwinston
Copy link
Collaborator

@dwinston dwinston commented Aug 7, 2024

Description

Just like not all heroes wear capes, not all not all schema collections suffix with _sets.

Interpret perform_mongo_updates for non-id-having schema collections as simple insertions. Leave note in code about decision to insist on schema-supplied uniqueness signal.

This is a hack because e.g. https://w3id.org/nmdc/FunctionalAnnotationAggMember documents appear by eye to be unique via a compound key (metagenome_annotation_id, gene_function_id) and yet this is not explicit in the schema.

One potential solution is to auto-generate an id for such documents as a deterministic hash of the compound key, but any such id generation strategy should be unambiguous to a schema consumer such as nmdc-runtime.

For now, decision is to potentially re-insert "duplicate" documents, i.e. to interpret lack of id as lack of unique document identity for de-duplication.

Fixes #611

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

  • test_perform_mongo_updates_functional_annotation_agg

Definition of Done (DoD) Checklist:

  • My code follows the style guidelines of this project (have you run black nmdc_runtime/?)
  • I have performed a self-review of my code
  • I have added tests that prove my fix is effective or that my feature works, incl. considering downstream usage (e.g. https://github.com/microbiomedata/notebook_hackathons) if applicable.
  • New and existing unit and functional tests pass locally with my changes (make up-test && make test-run)

interpret as simple insertion. leave note in code about decision to insist on schema-supplied uniqueness signal.

fix #611
@dwinston dwinston changed the title 611 not all schema collections have set not all schema collections have _sets Aug 7, 2024
@dwinston dwinston changed the title not all schema collections have _sets not all schema collections suffix with _sets Aug 7, 2024
eecavanna
eecavanna previously approved these changes Aug 7, 2024
@dwinston dwinston merged commit 3fd7786 into main Aug 8, 2024
2 checks passed
@dwinston dwinston deleted the 611-not-all-schema-collections-have-_set branch August 8, 2024 08:19
dwinston added a commit that referenced this pull request Aug 8, 2024
* fix: allow "update" of non-`id`-having document collections

interpret as simple insertion. leave note in code about decision to insist on schema-supplied uniqueness signal.

fix #611

* refactor to add test

* fix: rm abandoned candidate test

* Update nmdc_runtime/site/ops.py

Co-authored-by: eecavanna <[email protected]>

---------

Co-authored-by: eecavanna <[email protected]>
@cmungall
Copy link
Contributor

cmungall commented Aug 8, 2024

We can use https://w3id.org/linkml/unique_keys to declare a compound key (not totally understanding the broader issue though)

@shreddd
Copy link
Collaborator

shreddd commented Aug 8, 2024

I think we should review how this collection is typically generated based on:
https://github.com/microbiomedata/nmdc-aggregator/blob/main/generate_functional_agg.py

This will need some additional discussion. We can follow up on slack or in a small meeting.

@shreddd
Copy link
Collaborator

shreddd commented Aug 8, 2024

PS - this collection's role is to provide the aggregation tables for the data portal search (it will not be typically queried directly from the outside by end users), so we should keep that in mind.

One way to approach to this might be to model the set of mongo queries used in https://github.com/microbiomedata/nmdc-aggregator/blob/main/generate_functional_agg.py in the API.

@aclum
Copy link
Contributor

aclum commented Aug 9, 2024

@eecavanna @pkalita-lbl this PR is already closed, can we get a new ticket and PR to address the new issue

@eecavanna
Copy link
Collaborator

I'll create a new issue for it now. Thanks, @aclum.

Thanks, @pkalita-lbl, for reporting the issue and linking to the relevant commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dagster apply_metadata_in op fails to insert functional_annotation_agg documents
5 participants