Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

berkeley: Dagster job ensure_alldocs fails with AssertionError #690

Closed
eecavanna opened this issue Sep 19, 2024 · 4 comments · Fixed by #694
Closed

berkeley: Dagster job ensure_alldocs fails with AssertionError #690

eecavanna opened this issue Sep 19, 2024 · 4 comments · Fixed by #694
Assignees
Labels
berkeley-schema Related to making the Runtime work with the Berkeley schema bug Something isn't working

Comments

@eecavanna
Copy link
Collaborator

Today, I visited the Dagit instance in the Berkeley environment (nmdc-berkeley namespace on Spin) and tried running the ensure_alldocs job.

While the materialize_alldocs op was running, an error occurred. Here's a screenshot of the error message, followed by a copy/paste of the same error message:

image

Show/hide copy/pasted error message
dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "materialize_alldocs":

  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_plan.py", line 247, in dagster_event_sequence_for_step
    for step_event in check.generator(step_events):
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_step.py", line 500, in core_dagster_event_sequence_for_step
    for user_event in _step_output_error_checked_user_event_sequence(
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_step.py", line 184, in _step_output_error_checked_user_event_sequence
    for user_event in user_event_sequence:
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/execute_step.py", line 88, in _process_asset_results_to_events
    for user_event in user_event_sequence:
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute.py", line 198, in execute_core_compute
    for step_output in _yield_compute_results(step_context, inputs, compute_fn, compute_context):
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute.py", line 167, in _yield_compute_results
    for event in iterate_with_context(
  File "/usr/local/lib/python3.10/site-packages/dagster/_utils/__init__.py", line 476, in iterate_with_context
    with context_fn():
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/utils.py", line 84, in op_execution_error_boundary
    raise error_cls(

The above exception was caused by the following exception:
AssertionError: configuration_set collection has class name of ['ChromatographyConfiguration', 'MassSpectrometryConfiguration'] and len 2

  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/utils.py", line 54, in op_execution_error_boundary
    yield
  File "/usr/local/lib/python3.10/site-packages/dagster/_utils/__init__.py", line 478, in iterate_with_context
    next_output = next(iterator)
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute_generator.py", line 141, in _coerce_op_compute_fn_to_iterator
    result = invoke_compute_fn(
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute_generator.py", line 129, in invoke_compute_fn
    return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass)
  File "/opt/dagster/lib/nmdc_runtime/site/ops.py", line 1027, in materialize_alldocs
    len(collection_name_to_class_names[name]) == 1

I think this was the first time that job had been run in the Berkeley environment. That is based upon what I see here, on the "Runs" page of Dagit:

image

Task

The task here is to make it so the ensure_alldocs job runs and an alldocs collection exists in the Berkeley Mongo database.

@eecavanna
Copy link
Collaborator Author

This issue is causing the following downstream issue: #689

@aclum
Copy link
Contributor

aclum commented Sep 19, 2024

There is no longer a 1:1 with collection names and class names. Since type is universally enforced we reduced the total number of collections.
For example now the first leaf of children of PlannedProcess all have a corresponding collection.
see https://microbiomedata.github.io/berkeley-schema-fy24/PlannedProcess/
so MaterialProcessing has a corresponding material_processing_set, etc.

This is high priority b/c this backs production endpoints and the is needed for the ncbi export code.

@mslarae13 mslarae13 moved this from 📝 Todo to 🏗 In Progress in Berkeley-Schema Refactor Roll Out Sep 19, 2024
@eecavanna
Copy link
Collaborator Author

I think @PeopleMakeCulture, @sujaypatil96, and I have a solid plan for fixing this. It will make the generation of alldocs take longer (by several minutes) because it involves processing documents one-by-one instead of treating every document in the collection as though they have identical class hierarchies. I plan to prototype the fix later today.

@eecavanna
Copy link
Collaborator Author

A fix is ready for review in #694.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
berkeley-schema Related to making the Runtime work with the Berkeley schema bug Something isn't working
Projects
Archived in project
4 participants