Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

berkeley: Determine class ancestry per-type value (within collection) instead of per-collection (when generating alldocs) #694

Conversation

eecavanna
Copy link
Collaborator

@eecavanna eecavanna commented Sep 19, 2024

Description

Fixed a bug in the code that was being used to generate the alldocs collection.

Options:

The bug in the original code was that a single document in each collection was being used to get the class ancestry of every document in that collection. This could produce incorrect class ancestries when dealing with a database that conforms to the Berkeley schema, since a collection can store documents of differing types.

In the first commit, we changed it to determine the class ancestry document-by-document. I never ran this on Dagster, but I assume it would have taken about 10 minutes (or a few hours, if including the functional_annotation_agg collection).

In the second commit, we changed it to be more like the original code, where we are using insert_many (instead of insert_one as in the first commit). Unlike in the original code, we do not assume all documents in a collection have the same class ancestry; however, we do take advantage of the fact that documents having the same type value will have the same class ancestry. This allows us to process them in batches.

Fixes #690

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration, if it is not simply make up-test && make test-run.

  • It has not been tested yet

Configuration Details: none

Definition of Done (DoD) Checklist:

A simple checklist of necessary activities that add verifiable/demonstrable value to the product by asserting the quality of a feature, not the functionality of that feature; the latter should be stated as acceptance criteria in the issue this PR closes. Not all activities in the PR template will be applicable to each feature since the definition of done is intended to be a comprehensive checklist. Consciously decide the applicability of value-added activities on a feature-by-feature basis.

  • My code follows the style guidelines of this project (have you run black nmdc_runtime/?)
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas

The previous code made the invalid assumption that a given collection
could only store documents of a single type. That constraint does not
exist in databases conforming to the Berkeley schema (nmdc-schema v11+).
@eecavanna eecavanna self-assigned this Sep 19, 2024
@eecavanna eecavanna added bug Something isn't working unplanned-task not accounted for during sprint planning, but time sensitive berkeley-schema Related to making the Runtime work with the Berkeley schema hasPR has an active pull request to fix/close labels Sep 19, 2024
@eecavanna eecavanna changed the title Determine class hierarchy per-document instead of per-collection berkeley: Determine class hierarchy per-document instead of per-collection Sep 19, 2024
@eecavanna eecavanna marked this pull request as ready for review September 20, 2024 00:33
@eecavanna
Copy link
Collaborator Author

eecavanna commented Sep 20, 2024

This works for me locally, with the latest Berkeley Mongo nightly dump loaded into my local Mongo server. The run took 1.5 minutes from start to finish.

Here's a screenshot showing the successful run in Dagit:

image

Show/hide logs from the run
6:46:04.138 PM
materialize_alldocs
INFO
Inserted 27 documents from collection_name='study_set' originally having type_value='nmdc:Study'.
6:46:04.143 PM
materialize_alldocs
INFO
Found 0 estimated documents for collection_name='workflow_execution_set'.
6:46:04.197 PM
materialize_alldocs
INFO
Found 9 distinct `type` values in collection_name='workflow_execution_set': distinct_type_values=['nmdc:MagsAnalysis', 'nmdc:MetabolomicsAnalysis', 'nmdc:MetagenomeAnnotation', 'nmdc:MetagenomeAssembly', 'nmdc:MetagenomeSequencing', 'nmdc:MetaproteomicsAnalysis', 'nmdc:NomAnalysis', 'nmdc:ReadBasedTaxonomyAnalysis', 'nmdc:ReadQcAnalysis']
6:46:04.211 PM
materialize_alldocs
INFO
Found 1884 documents having type_value='nmdc:MagsAnalysis' in collection_name='workflow_execution_set'.
6:46:04.252 PM
materialize_alldocs
INFO
Inserted 1883 documents from collection_name='workflow_execution_set' originally having type_value='nmdc:MagsAnalysis'.
6:46:04.263 PM
materialize_alldocs
INFO
Found 34 documents having type_value='nmdc:MetabolomicsAnalysis' in collection_name='workflow_execution_set'.
6:46:04.331 PM
materialize_alldocs
INFO
Inserted 33 documents from collection_name='workflow_execution_set' originally having type_value='nmdc:MetabolomicsAnalysis'.
6:46:04.346 PM
materialize_alldocs
INFO
Found 2013 documents having type_value='nmdc:MetagenomeAnnotation' in collection_name='workflow_execution_set'.
6:46:04.411 PM
materialize_alldocs
INFO
Inserted 2012 documents from collection_name='workflow_execution_set' originally having type_value='nmdc:MetagenomeAnnotation'.
6:46:04.431 PM
materialize_alldocs
INFO
Found 2124 documents having type_value='nmdc:MetagenomeAssembly' in collection_name='workflow_execution_set'.
6:46:04.498 PM
materialize_alldocs
INFO
Inserted 2123 documents from collection_name='workflow_execution_set' originally having type_value='nmdc:MetagenomeAssembly'.
6:46:04.522 PM
materialize_alldocs
INFO
Found 631 documents having type_value='nmdc:MetagenomeSequencing' in collection_name='workflow_execution_set'.
6:46:04.548 PM
materialize_alldocs
INFO
Inserted 630 documents from collection_name='workflow_execution_set' originally having type_value='nmdc:MetagenomeSequencing'.
6:46:04.563 PM
materialize_alldocs
INFO
Found 2 documents having type_value='nmdc:MetaproteomicsAnalysis' in collection_name='workflow_execution_set'.
6:46:06.086 PM
materialize_alldocs
INFO
Inserted 1 documents from collection_name='workflow_execution_set' originally having type_value='nmdc:MetaproteomicsAnalysis'.
6:46:06.109 PM
materialize_alldocs
INFO
Found 2549 documents having type_value='nmdc:NomAnalysis' in collection_name='workflow_execution_set'.
6:46:06.446 PM
materialize_alldocs
INFO
Inserted 2548 documents from collection_name='workflow_execution_set' originally having type_value='nmdc:NomAnalysis'.
6:46:06.458 PM
materialize_alldocs
INFO
Found 2106 documents having type_value='nmdc:ReadBasedTaxonomyAnalysis' in collection_name='workflow_execution_set'.
6:46:06.522 PM
materialize_alldocs
INFO
Inserted 2105 documents from collection_name='workflow_execution_set' originally having type_value='nmdc:ReadBasedTaxonomyAnalysis'.
6:46:06.536 PM
materialize_alldocs
INFO
Found 2250 documents having type_value='nmdc:ReadQcAnalysis' in collection_name='workflow_execution_set'.
6:46:06.576 PM
materialize_alldocs
INFO
Inserted 2249 documents from collection_name='workflow_execution_set' originally having type_value='nmdc:ReadQcAnalysis'.
6:46:07.400 PM
materialize_alldocs
INFO
refreshed Collection(Database(MongoClient(host=['mongo:27017'], document_class=dict, tz_aware=False, connect=True, directconnection=True), 'nmdc'), 'alldocs') collection with 144426 docs.
6:46:07.436 PM
materialize_alldocs
STEP_OUTPUT
Yielded output "result" of type "Int". (Type check passed).
6:46:07.467 PM
materialize_alldocs
HANDLED_OUTPUT
Handled output "result" using IO manager "io_manager"
path
/opt/dagster/dagster_home/storage/7298c571-b87b-48aa-a778-b37760687589/materialize_alldocs/result

6:46:07.476 PM
materialize_alldocs
STEP_SUCCESS
Finished execution of step "materialize_alldocs" in 1m19s.

@eecavanna
Copy link
Collaborator Author

eecavanna commented Sep 20, 2024

Here's what I did in this PR

  1. Implemented bugfix previously discussed with teammates (first commit: ea17ec3)
  2. Replaced that with an optimized version (latest commit: 6bb4524)
  3. Confirmed the job runs successfully—locally—with the following Mongo dump from the NERSC filesystem loaded into my local Mongo server: /global/cfs/projectdirs/m3408/nmdc-mongodumps/dump_nmdc-berkeley_2024-09-19_20-20-01/
  4. Measured the run duration to be about 90 seconds
  5. Confirmed my local Mongo database does contain an alldocs collection having 144,426 documents in it

Here's what I did NOT do

  1. Check whether the generated alldocs collection is "correct" (beyond being non-empty)

@dwinston and @PeopleMakeCulture: In case you have time Friday morning, I'd appreciate you checking out this branch, running the job locally, checking whether the generated alldocs collection is "correct", and posting your verdict(s) here in the PR. I'm comfortable with this PR being merged in as-is, if your verdict(s) are that the generated alldocs collection is correct (in that case, please merge this in without waiting for me to get online).

@eecavanna eecavanna changed the title berkeley: Determine class hierarchy per-document instead of per-collection berkeley: Determine class hierarchy per-type value (within collection) instead of per-collection Sep 20, 2024
@eecavanna eecavanna changed the title berkeley: Determine class hierarchy per-type value (within collection) instead of per-collection berkeley: Determine class ancestry per-type value (within collection) instead of per-collection (when generating alldocs) Sep 20, 2024
@PeopleMakeCulture
Copy link
Collaborator

PeopleMakeCulture commented Sep 20, 2024

@eecavanna There is some code in the ref integrity notebook that looks similar to your first commit, but batches the write operations to mongo. It is fairly performant (~30 sec with functional annotation) and uses LinkML's native schemaview to get all class ancestors for each doc.

for coll_name in collection_names:
    pbar.set_description(f"processing {coll_name}...")
    requests = []
    for doc in mdb[coll_name].find():
        doc_type = doc_cls(doc, coll_name=coll_name)
        slots_to_include = ["id"] + document_reference_ranged_slots[doc_type]
        new_doc = pick(slots_to_include, doc)
        new_doc["type"] = schema_view.class_ancestors(doc_type)
        requests.append(InsertOne(new_doc))
        if len(requests) == 1000: # ensure bulk-write batches aren't too huge
            result = mdb.alldocs.bulk_write(requests, ordered=False)
            pbar.update(result.inserted_count)
            requests.clear()
    if len(requests) > 0:
        result = mdb.alldocs.bulk_write(requests, ordered=False)
        pbar.update(result.inserted_count)
pbar.close()

Could you use this logic instead? It is fairly similar to your first commit

Copy link
Collaborator

@sujaypatil96 sujaypatil96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Procedure followed to review this PR:

  • Checkout 690B-berkeley-dagster-job-ensure_alldocs-fails-with-assertionerror-1 branch
  • Run make up-dev
  • Download and restore latest backup/dump of the prod Berkeley Mongo database into local Mongo: /global/cfs/projectdirs/m3408/nmdc-mongodumps/dump_nmdc-prod_2024-09-19_20-12-01 /global/cfs/projectdirs/m3408/nmdc-mongodumps/dump_nmdc-berkeley_2024-09-19_20-12-01
  • Note: The dump already has an alldocs collection present in it
  • Drop existing alldocs collection
  • Find ensure_alldocs job on Dagster/Dagit: http://localhost:3000/locations/repo@nmdc_runtime.site.repository%3Arepo/jobs/ensure_alldocs and run it using launchpad
  • Confirm the successful creation of an alldocs collection
  • Confirm that the number of documents in there is non-zero (not sure what the right number of expected documents should be, but I'm sure it's correct)
  • Total number of documents created = 144,426
  • Execution time for ensure_alldocs job run ~ 16 s

Edit: Current status: running into an error.

Screenshot 2024-09-20 at 11 39 09 AM

@dwinston
Copy link
Collaborator

dwinston commented Sep 20, 2024

  • Download and restore latest backup/dump of the prod Mongo database into local Mongo: /global/cfs/projectdirs/m3408/nmdc-mongodumps/dump_nmdc-prod_2024-09-19_20-12-01

@sujaypatil96 you want to restore a database from the "berkeley" deployment, e.g. /global/cfs/projectdirs/m3408/nmdc-mongodumps/dump_nmdc-berkeley_2024-09-19_20-20-01.

Thank you for documenting your procedure so that I could zoom in on this aspect.

@sujaypatil96 sujaypatil96 self-requested a review September 20, 2024 18:26
Copy link
Collaborator

@sujaypatil96 sujaypatil96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh great catch @dwinston, thank you! I've updated my review checklist above. I think this PR is good to go 🚀

Fantastic work @eecavanna 💕

@sujaypatil96 sujaypatil96 linked an issue Sep 20, 2024 that may be closed by this pull request
@eecavanna
Copy link
Collaborator Author

Hi @PeopleMakeCulture, thanks for bringing that code to my attention. I like the use of bulk_write to perform the writes in batches. I also like the use of SchemaView as opposed to requiring a Python instance and calling a function that traverses the Python inheritance tree upward.

Yes, I think that approach would work (except I think I would omit the pick part, which I assume omits fields that are not references and are not named id).

I'll merge this PR in as is and file a ticket about the fact that it deviates from the notebook, as I would like to standardize the various alldocs-generating approaches (standardize to the more robust approach), maybe even abstracting it into a common function.

@eecavanna
Copy link
Collaborator Author

Thanks for testing it out on your end, @sujaypatil96! I'll merge this in.

As for this fix also fixing #689, I think there will be one code change required to fix the latter. I posted a comment about it in that Issue a few minutes ago.

@eecavanna eecavanna merged commit 4f1ee79 into berkeley Sep 20, 2024
1 of 2 checks passed
@eecavanna eecavanna deleted the 690B-berkeley-dagster-job-ensure_alldocs-fails-with-assertionerror-1 branch September 20, 2024 19:20
@PeopleMakeCulture
Copy link
Collaborator

I'll merge this PR in as is and file a ticket about the fact that it deviates from the notebook, as I would like to standardize the various alldocs-generating approaches (standardize to the more robust approach), maybe even abstracting it into a common function.

@eecavanna Link to ticket pls?

@eecavanna
Copy link
Collaborator Author

Here's a link to the ticket I was referring to: #696

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
berkeley-schema Related to making the Runtime work with the Berkeley schema bug Something isn't working hasPR has an active pull request to fix/close unplanned-task not accounted for during sprint planning, but time sensitive
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

berkeley: Dagster job ensure_alldocs fails with AssertionError
4 participants