`berkeley`: Determine class ancestry per-type value (within collection) instead of per-collection (when generating `alldocs`) #694

eecavanna · 2024-09-19T23:09:31Z

Description

Fixed a bug in the code that was being used to generate the alldocs collection.

Options:

See first commit (ea17ec3) for the not optimized algorithm that @PeopleMakeCulture, @sujaypatil96, and @eecavanna discussed in real time on Thursday.
See second commit (1c5788d) for an optimized algorithm that @eecavanna thought of later and did not run by @PeopleMakeCulture or @sujaypatil96. Y'all can revert to the first commit if you want.

The bug in the original code was that a single document in each collection was being used to get the class ancestry of every document in that collection. This could produce incorrect class ancestries when dealing with a database that conforms to the Berkeley schema, since a collection can store documents of differing types.

In the first commit, we changed it to determine the class ancestry document-by-document. I never ran this on Dagster, but I assume it would have taken about 10 minutes (or a few hours, if including the functional_annotation_agg collection).

In the second commit, we changed it to be more like the original code, where we are using insert_many (instead of insert_one as in the first commit). Unlike in the original code, we do not assume all documents in a collection have the same class ancestry; however, we do take advantage of the fact that documents having the same type value will have the same class ancestry. This allows us to process them in batches.

Fixes #690

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration, if it is not simply make up-test && make test-run.

It has not been tested yet

Configuration Details: none

Definition of Done (DoD) Checklist:

A simple checklist of necessary activities that add verifiable/demonstrable value to the product by asserting the quality of a feature, not the functionality of that feature; the latter should be stated as acceptance criteria in the issue this PR closes. Not all activities in the PR template will be applicable to each feature since the definition of done is intended to be a comprehensive checklist. Consciously decide the applicability of value-added activities on a feature-by-feature basis.

My code follows the style guidelines of this project (have you run black nmdc_runtime/?)
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas

The previous code made the invalid assumption that a given collection could only store documents of a single type. That constraint does not exist in databases conforming to the Berkeley schema (nmdc-schema v11+).

eecavanna · 2024-09-20T01:50:18Z

This works for me locally, with the latest Berkeley Mongo nightly dump loaded into my local Mongo server. The run took 1.5 minutes from start to finish.

Here's a screenshot showing the successful run in Dagit:

Show/hide logs from the run

6:46:04.138 PM
materialize_alldocs
INFO
Inserted 27 documents from collection_name='study_set' originally having type_value='nmdc:Study'.
6:46:04.143 PM
materialize_alldocs
INFO
Found 0 estimated documents for collection_name='workflow_execution_set'.
6:46:04.197 PM
materialize_alldocs
INFO
Found 9 distinct `type` values in collection_name='workflow_execution_set': distinct_type_values=['nmdc:MagsAnalysis', 'nmdc:MetabolomicsAnalysis', 'nmdc:MetagenomeAnnotation', 'nmdc:MetagenomeAssembly', 'nmdc:MetagenomeSequencing', 'nmdc:MetaproteomicsAnalysis', 'nmdc:NomAnalysis', 'nmdc:ReadBasedTaxonomyAnalysis', 'nmdc:ReadQcAnalysis']
6:46:04.211 PM
materialize_alldocs
INFO
Found 1884 documents having type_value='nmdc:MagsAnalysis' in collection_name='workflow_execution_set'.
6:46:04.252 PM
materialize_alldocs
INFO
Inserted 1883 documents from collection_name='workflow_execution_set' originally having type_value='nmdc:MagsAnalysis'.
6:46:04.263 PM
materialize_alldocs
INFO
Found 34 documents having type_value='nmdc:MetabolomicsAnalysis' in collection_name='workflow_execution_set'.
6:46:04.331 PM
materialize_alldocs
INFO
Inserted 33 documents from collection_name='workflow_execution_set' originally having type_value='nmdc:MetabolomicsAnalysis'.
6:46:04.346 PM
materialize_alldocs
INFO
Found 2013 documents having type_value='nmdc:MetagenomeAnnotation' in collection_name='workflow_execution_set'.
6:46:04.411 PM
materialize_alldocs
INFO
Inserted 2012 documents from collection_name='workflow_execution_set' originally having type_value='nmdc:MetagenomeAnnotation'.
6:46:04.431 PM
materialize_alldocs
INFO
Found 2124 documents having type_value='nmdc:MetagenomeAssembly' in collection_name='workflow_execution_set'.
6:46:04.498 PM
materialize_alldocs
INFO
Inserted 2123 documents from collection_name='workflow_execution_set' originally having type_value='nmdc:MetagenomeAssembly'.
6:46:04.522 PM
materialize_alldocs
INFO
Found 631 documents having type_value='nmdc:MetagenomeSequencing' in collection_name='workflow_execution_set'.
6:46:04.548 PM
materialize_alldocs
INFO
Inserted 630 documents from collection_name='workflow_execution_set' originally having type_value='nmdc:MetagenomeSequencing'.
6:46:04.563 PM
materialize_alldocs
INFO
Found 2 documents having type_value='nmdc:MetaproteomicsAnalysis' in collection_name='workflow_execution_set'.
6:46:06.086 PM
materialize_alldocs
INFO
Inserted 1 documents from collection_name='workflow_execution_set' originally having type_value='nmdc:MetaproteomicsAnalysis'.
6:46:06.109 PM
materialize_alldocs
INFO
Found 2549 documents having type_value='nmdc:NomAnalysis' in collection_name='workflow_execution_set'.
6:46:06.446 PM
materialize_alldocs
INFO
Inserted 2548 documents from collection_name='workflow_execution_set' originally having type_value='nmdc:NomAnalysis'.
6:46:06.458 PM
materialize_alldocs
INFO
Found 2106 documents having type_value='nmdc:ReadBasedTaxonomyAnalysis' in collection_name='workflow_execution_set'.
6:46:06.522 PM
materialize_alldocs
INFO
Inserted 2105 documents from collection_name='workflow_execution_set' originally having type_value='nmdc:ReadBasedTaxonomyAnalysis'.
6:46:06.536 PM
materialize_alldocs
INFO
Found 2250 documents having type_value='nmdc:ReadQcAnalysis' in collection_name='workflow_execution_set'.
6:46:06.576 PM
materialize_alldocs
INFO
Inserted 2249 documents from collection_name='workflow_execution_set' originally having type_value='nmdc:ReadQcAnalysis'.
6:46:07.400 PM
materialize_alldocs
INFO
refreshed Collection(Database(MongoClient(host=['mongo:27017'], document_class=dict, tz_aware=False, connect=True, directconnection=True), 'nmdc'), 'alldocs') collection with 144426 docs.
6:46:07.436 PM
materialize_alldocs
STEP_OUTPUT
Yielded output "result" of type "Int". (Type check passed).
6:46:07.467 PM
materialize_alldocs
HANDLED_OUTPUT
Handled output "result" using IO manager "io_manager"
path
/opt/dagster/dagster_home/storage/7298c571-b87b-48aa-a778-b37760687589/materialize_alldocs/result

6:46:07.476 PM
materialize_alldocs
STEP_SUCCESS
Finished execution of step "materialize_alldocs" in 1m19s.

eecavanna · 2024-09-20T01:59:05Z

Here's what I did in this PR

Implemented bugfix previously discussed with teammates (first commit: ea17ec3)
Replaced that with an optimized version (latest commit: 6bb4524)
Confirmed the job runs successfully—locally—with the following Mongo dump from the NERSC filesystem loaded into my local Mongo server: /global/cfs/projectdirs/m3408/nmdc-mongodumps/dump_nmdc-berkeley_2024-09-19_20-20-01/
Measured the run duration to be about 90 seconds
Confirmed my local Mongo database does contain an alldocs collection having 144,426 documents in it

Here's what I did NOT do

Check whether the generated alldocs collection is "correct" (beyond being non-empty)

@dwinston and @PeopleMakeCulture: In case you have time Friday morning, I'd appreciate you checking out this branch, running the job locally, checking whether the generated alldocs collection is "correct", and posting your verdict(s) here in the PR. I'm comfortable with this PR being merged in as-is, if your verdict(s) are that the generated alldocs collection is correct (in that case, please merge this in without waiting for me to get online).

PeopleMakeCulture · 2024-09-20T16:45:24Z

@eecavanna There is some code in the ref integrity notebook that looks similar to your first commit, but batches the write operations to mongo. It is fairly performant (~30 sec with functional annotation) and uses LinkML's native schemaview to get all class ancestors for each doc.

for coll_name in collection_names:
    pbar.set_description(f"processing {coll_name}...")
    requests = []
    for doc in mdb[coll_name].find():
        doc_type = doc_cls(doc, coll_name=coll_name)
        slots_to_include = ["id"] + document_reference_ranged_slots[doc_type]
        new_doc = pick(slots_to_include, doc)
        new_doc["type"] = schema_view.class_ancestors(doc_type)
        requests.append(InsertOne(new_doc))
        if len(requests) == 1000: # ensure bulk-write batches aren't too huge
            result = mdb.alldocs.bulk_write(requests, ordered=False)
            pbar.update(result.inserted_count)
            requests.clear()
    if len(requests) > 0:
        result = mdb.alldocs.bulk_write(requests, ordered=False)
        pbar.update(result.inserted_count)
pbar.close()

Could you use this logic instead? It is fairly similar to your first commit

sujaypatil96

Procedure followed to review this PR:

Edit: ~~Current status: running into an error.~~

dwinston · 2024-09-20T17:32:56Z

Download and restore latest backup/dump of the prod Mongo database into local Mongo: /global/cfs/projectdirs/m3408/nmdc-mongodumps/dump_nmdc-prod_2024-09-19_20-12-01

@sujaypatil96 you want to restore a database from the "berkeley" deployment, e.g. /global/cfs/projectdirs/m3408/nmdc-mongodumps/dump_nmdc-berkeley_2024-09-19_20-20-01.

Thank you for documenting your procedure so that I could zoom in on this aspect.

sujaypatil96

Oh great catch @dwinston, thank you! I've updated my review checklist above. I think this PR is good to go 🚀

Fantastic work @eecavanna 💕

eecavanna · 2024-09-20T19:18:19Z

Hi @PeopleMakeCulture, thanks for bringing that code to my attention. I like the use of bulk_write to perform the writes in batches. I also like the use of SchemaView as opposed to requiring a Python instance and calling a function that traverses the Python inheritance tree upward.

Yes, I think that approach would work (except I think I would omit the pick part, which I assume omits fields that are not references and are not named id).

I'll merge this PR in as is and file a ticket about the fact that it deviates from the notebook, as I would like to standardize the various alldocs-generating approaches (standardize to the more robust approach), maybe even abstracting it into a common function.

eecavanna · 2024-09-20T19:19:51Z

Thanks for testing it out on your end, @sujaypatil96! I'll merge this in.

As for this fix also fixing #689, I think there will be one code change required to fix the latter. I posted a comment about it in that Issue a few minutes ago.

PeopleMakeCulture · 2024-09-26T17:47:30Z

I'll merge this PR in as is and file a ticket about the fact that it deviates from the notebook, as I would like to standardize the various alldocs-generating approaches (standardize to the more robust approach), maybe even abstracting it into a common function.

@eecavanna Link to ticket pls?

eecavanna · 2024-09-26T22:22:19Z

Here's a link to the ticket I was referring to: #696

Determine class hierarchy per-document instead of per-collection

ea17ec3

The previous code made the invalid assumption that a given collection could only store documents of a single type. That constraint does not exist in databases conforming to the Berkeley schema (nmdc-schema v11+).

eecavanna requested review from dwinston and PeopleMakeCulture September 19, 2024 23:11

eecavanna self-assigned this Sep 19, 2024

eecavanna added bug Something isn't working unplanned-task not accounted for during sprint planning, but time sensitive berkeley-schema Related to making the Runtime work with the Berkeley schema hasPR has an active pull request to fix/close labels Sep 19, 2024

eecavanna requested a review from sujaypatil96 September 19, 2024 23:12

eecavanna changed the title ~~Determine class hierarchy per-document instead of per-collection~~ berkeley: Determine class hierarchy per-document instead of per-collection Sep 19, 2024

eecavanna and others added 2 commits September 19, 2024 17:24

Optimization: Process documents in batches according to their type

1c5788d

style: reformat

64c9838

eecavanna marked this pull request as ready for review September 20, 2024 00:33

eecavanna added 4 commits September 19, 2024 17:37

Fix bug in log.info statement

61fc984

Omit invalid kwarg key (oops)

11e7bed

Clarify ambiguous log message

06b002d

Add log message showing distinct "type" values in collection

6bb4524

eecavanna changed the title ~~berkeley: Determine class hierarchy per-document instead of per-collection~~ berkeley: Determine class hierarchy per-type value (within collection) instead of per-collection Sep 20, 2024

This was referenced Sep 20, 2024

berkeley: Dagster job ensure_alldocs fails with AssertionError #690

Closed

berkeley: API endpoint /data_objects/study/{study_id} not returning correct results #689

Closed

eecavanna changed the title ~~berkeley: Determine class hierarchy per-type value (within collection) instead of per-collection~~ berkeley: Determine class ancestry per-type value (within collection) instead of per-collection (when generating alldocs) Sep 20, 2024

sujaypatil96 reviewed Sep 20, 2024

View reviewed changes

sujaypatil96 self-requested a review September 20, 2024 18:26

sujaypatil96 approved these changes Sep 20, 2024

View reviewed changes

sujaypatil96 linked an issue Sep 20, 2024 that may be closed by this pull request

berkeley: Dagster job ensure_alldocs fails with AssertionError #690

Closed

eecavanna merged commit 4f1ee79 into berkeley Sep 20, 2024
1 of 2 checks passed

eecavanna deleted the 690B-berkeley-dagster-job-ensure_alldocs-fails-with-assertionerror-1 branch September 20, 2024 19:20

eecavanna mentioned this pull request Sep 20, 2024

Dagster and referential integrity notebook generate alldocs collection in different ways #696

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`berkeley`: Determine class ancestry per-type value (within collection) instead of per-collection (when generating `alldocs`) #694

`berkeley`: Determine class ancestry per-type value (within collection) instead of per-collection (when generating `alldocs`) #694

eecavanna commented Sep 19, 2024 •

edited

Loading

eecavanna commented Sep 20, 2024 •

edited

Loading

eecavanna commented Sep 20, 2024 •

edited

Loading

PeopleMakeCulture commented Sep 20, 2024 •

edited

Loading

sujaypatil96 left a comment •

edited

Loading

dwinston commented Sep 20, 2024 •

edited

Loading

sujaypatil96 left a comment

eecavanna commented Sep 20, 2024

eecavanna commented Sep 20, 2024

PeopleMakeCulture commented Sep 26, 2024

eecavanna commented Sep 26, 2024

berkeley: Determine class ancestry per-type value (within collection) instead of per-collection (when generating alldocs) #694

berkeley: Determine class ancestry per-type value (within collection) instead of per-collection (when generating alldocs) #694

Conversation

eecavanna commented Sep 19, 2024 • edited Loading

Description

Type of change

How Has This Been Tested?

Definition of Done (DoD) Checklist:

eecavanna commented Sep 20, 2024 • edited Loading

eecavanna commented Sep 20, 2024 • edited Loading

Here's what I did in this PR

Here's what I did NOT do

PeopleMakeCulture commented Sep 20, 2024 • edited Loading

sujaypatil96 left a comment • edited Loading

Choose a reason for hiding this comment

dwinston commented Sep 20, 2024 • edited Loading

sujaypatil96 left a comment

Choose a reason for hiding this comment

eecavanna commented Sep 20, 2024

eecavanna commented Sep 20, 2024

PeopleMakeCulture commented Sep 26, 2024

eecavanna commented Sep 26, 2024

`berkeley`: Determine class ancestry per-type value (within collection) instead of per-collection (when generating `alldocs`) #694

`berkeley`: Determine class ancestry per-type value (within collection) instead of per-collection (when generating `alldocs`) #694

eecavanna commented Sep 19, 2024 •

edited

Loading

eecavanna commented Sep 20, 2024 •

edited

Loading

eecavanna commented Sep 20, 2024 •

edited

Loading

PeopleMakeCulture commented Sep 20, 2024 •

edited

Loading

sujaypatil96 left a comment •

edited

Loading

dwinston commented Sep 20, 2024 •

edited

Loading