`berkeley`: Function `ensure_schema_collections_and_alldocs` is incompatible with Berkeley schema #658

aclum · 2024-08-29T00:35:27Z

This came up in a PR where Patrick was working on updating ETL code.
there are errors about associated_studies being missing from biosample_set, upon further inspection it is trying to compare prod mongo, I inferred this based on it pulling in collection names which don't exist in berkeley like omics_processing_set, with the berkeley nmdc-schema release candidate release

We need an alldocs in mongo berkeley and for this test to point to that.

cc @dwinston @eecavanna @pkalita-lbl

The text was updated successfully, but these errors were encountered:

aclum · 2024-08-29T00:36:11Z

hoping Donny or Eric can look at this in @PeopleMakeCulture absence. cc @shreddd

eecavanna · 2024-08-29T00:45:16Z

Here's an excerpt from the logs of the GitHub Actions workflow run that @aclum linked to above (thanks!):

Show/hide excerpt

wait-for-it.sh: fastapi:8000 is available after 0 seconds
/usr/local/lib/python3.10/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.3.2, pluggy-1.5.0
rootdir: /code
plugins: anyio-4.4.0, asyncio-0.24.0, requests-mock-1.12.1, cov-5.0.0, mock-3.14.0
asyncio: mode=strict, default_loop_scope=None
collected 109 items

demo/metadata_migration/notebooks/test_helpers.py .                      [  0%]
tests/e2e/test_minter_api.py xxxx                                        [  4%]
tests/integration/test_minter_repository.py ....X...                     [ 11%]
tests/test_api/test_db_mongo.py .....                                    [ 16%]
tests/test_api/test_endpoints.py ssE

==================================== ERRORS ====================================
_______________ ERROR at setup of test_metadata_validate_json_0 ________________

    @pytest.fixture
    def api_site_client():
        mdb = get_mongo_db()
>       rs = ensure_test_resources(mdb)

tests/test_api/test_endpoints.py:204: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/test_api/test_endpoints.py:121: in ensure_test_resources
    ensure_schema_collections_and_alldocs()
tests/test_api/test_endpoints.py:67: in ensure_schema_collections_and_alldocs
    materialize_alldocs(
/usr/local/lib/python3.10/site-packages/dagster/_core/definitions/op_definition.py:457: in __call__
    return direct_invocation_result(self, *args, **kwargs)
/usr/local/lib/python3.10/site-packages/dagster/_core/definitions/op_invocation.py:217: in direct_invocation_result
    result = invoke_compute_fn(
/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute_generator.py:129: in invoke_compute_fn
    return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass)
nmdc_runtime/site/ops.py:1025: in materialize_alldocs
    raise e
nmdc_runtime/site/ops.py:1018: in materialize_alldocs
    nmdcdb = NMDCDatabase(
<string>:21: in __init__
    ???
/usr/local/lib/python3.10/site-packages/nmdc_schema/nmdc.py:440: in __post_init__
    self._normalize_inlined_as_list(slot_name="biosample_set", slot_type=Biosample, key_name="id", keyed=True)
/usr/local/lib/python3.10/site-packages/linkml_runtime/utils/yamlutils.py:105: in _normalize_inlined_as_list
    self._normalize_inlined(slot_name, slot_type, key_name, keyed, True)
/usr/local/lib/python3.10/site-packages/linkml_runtime/utils/yamlutils.py:190: in _normalize_inlined
    cooked_obj = slot_type(**as_dict(list_entry))
<string>:589: in __init__
    ???
/usr/local/lib/python3.10/site-packages/nmdc_schema/nmdc.py:2120: in __post_init__
    self.MissingRequiredField("associated_studies")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = Biosample(id='nmdc:bsm-13-amrnys72', type='nmdc:Biosample', name='Sand microcosm microbial communities from a hyporhei...ample_type=None, technical_reps=None, analysis_type=[], sample_link=[], bulk_elect_conductivity=None, infiltrations=[])
field_name = 'associated_studies'

    def MissingRequiredField(self, field_name: str) -> None:
        """ Generic loader error handler """
>       raise ValueError(f"{field_name} must be supplied")
E       ValueError: associated_studies must be supplied

/usr/local/lib/python3.10/site-packages/linkml_runtime/utils/yamlutils.py:281: ValueError
---------------------------- Captured stdout setup -----------------------------
local mongodump already exists at /code/tests/nmdcdb

For reference, that PR's base branch is the berkeley branch.

eecavanna · 2024-08-29T00:52:21Z

The error occurred during the "setup" phase of the following test:

test_metadata_validate_json_0

nmdc-runtime/tests/test_api/test_endpoints.py

Lines 208 to 220 in 2c300cc

    
           def test_metadata_validate_json_0(api_site_client): 
        
               rv = api_site_client.request( 
        
                   "POST", 
        
                   "/metadata/json:validate", 
        
                   { 
        
                       "field_research_site_set": [ 
        
                           {"id": "nmdc:frsite-11-s2dqk408", "name": "BESC-470-CL2_38_23"}, 
        
                           {"id": "nmdc:frsite-11-s2dqk408", "name": "BESC-470-CL2_38_23"}, 
        
                           {"id": "nmdc:frsite-11-s2dqk408", "name": "BESC-470-CL2_38_23"}, 
        
                       ] 
        
                   }, 
        
               ) 
        
               assert rv.json()["result"] == "errors"

Specifically, it occurred while the following fixture was being built:

@pytest.fixture
def api_site_client():
    mdb = get_mongo_db()
    rs = ensure_test_resources(mdb)  # <-- the error occurred here
    return RuntimeApiSiteClient(base_url=os.getenv("API_HOST"), **rs["site_client"])

ensure_test_resources is defined here. Among other things, it calls a function named ensure_schema_collections_and_alldocs, which is defined here.

eecavanna · 2024-08-29T01:00:07Z

Here's an excerpt from the ensure_schema_collections_and_alldocs function:

    dump_dir = ensure_local_mongodump_exists()
    mongorestore_from_dir(mdb, dump_dir, skip_collections=["functional_annotation_agg"])

It calls a function named ensure_local_mongodump_exists (side note: I prefer the inclusion of this _exists suffix, which says what about it someone is ensuring).

That function (and the constant it accesses) is defined here:

nmdc-runtime/tests/test_api/test_endpoints.py

Lines 31 to 51 in 2c300cc

    
           TEST_MONGODUMPS_DIR = REPO_ROOT_DIR.joinpath("tests", "nmdcdb") 
        
           SCHEMA_COLLECTIONS_MONGODUMP_ARCHIVE_BASENAME = ( 
        
               "nmdc-prod-schema-collections__2024-07-29_20-12-07" 
        
           ) 
        
           SCHEMA_COLLECTIONS_MONGODUMP_ARCHIVE_URL = ( 
        
               "https://portal.nersc.gov/cfs/m3408/meta/mongodumps/" 
        
               f"{SCHEMA_COLLECTIONS_MONGODUMP_ARCHIVE_BASENAME}.tar" 
        
           )  # 84MB. Should be < 100MB. 
        
           def ensure_local_mongodump_exists(): 
        
               dump_dir = TEST_MONGODUMPS_DIR.joinpath( 
        
                   SCHEMA_COLLECTIONS_MONGODUMP_ARCHIVE_BASENAME 
        
               ) 
        
               if not os.path.exists(dump_dir): 
        
                   download_and_extract_tar( 
        
                       url=SCHEMA_COLLECTIONS_MONGODUMP_ARCHIVE_URL, extract_to=TEST_MONGODUMPS_DIR 
        
                   ) 
        
               else: 
        
                   print(f"local mongodump already exists at {TEST_MONGODUMPS_DIR}") 
        
               return dump_dir

Looks like that function downloads a pre-made database dump from the NERSC filesystem.

eecavanna · 2024-08-29T01:03:58Z

The value of SCHEMA_COLLECTIONS_MONGODUMP_ARCHIVE_BASENAME is the filename of a dump of the production database, specifically. I think this is consistent with what @aclum reported in the issue description.

eecavanna · 2024-08-29T01:07:19Z

In other words, I think this couples the test suite to the specific schema that happened to be in effect when that specific dump was generated.

I think this will be insufficient both for (a) the legacy-to-Berkeley roll out we are working on now; and (b) changes to certain aspects of the schema in the future.

eecavanna · 2024-08-29T01:09:35Z

The "coupling" (as I referred to it in the previous message) was introduced in this commit. I will assign to @dwinston to take a pass at resolving it.

eecavanna · 2024-08-29T19:56:03Z

During today's infrastructure meeting, we discussed the possibility of disabling this problematic code (which may involve disabling some tests) in order to unblock some current work; and developers can work on implementing a long term fix in parallel. I'll look into this with that path forward in mind today.

eecavanna · 2024-08-31T07:30:03Z

I created a PR in which I disabled the problematic code. The PR is here: #664

aclum added this to Berkeley-Schema Refactor Roll Out Aug 29, 2024

aclum mentioned this issue Aug 29, 2024

berkeley: Complete conversion of SubmissionPortalTranslator #657

Merged

10 tasks

eecavanna added this to 2024 - Sprint 44 - August 26 - September 6, 2024 Aug 29, 2024

eecavanna assigned dwinston Aug 29, 2024

eecavanna changed the title ~~berkeley ensure_schema_collections_and_alldocs test failing~~ berkeley: Test ensure_schema_collections_and_alldocs is failing (mismatch between DB dump and schema) Aug 29, 2024

eecavanna changed the title ~~berkeley: Test ensure_schema_collections_and_alldocs is failing (mismatch between DB dump and schema)~~ berkeley: Function ensure_schema_collections_and_alldocs causing test failures (mismatch between DB and schema) Aug 29, 2024

eecavanna added this to the Berkeley schema compatibility 🤝 milestone Aug 29, 2024

eecavanna mentioned this issue Aug 30, 2024

Merge main into berkeley (routine, August 29, 2024) #661

Merged

eecavanna changed the title ~~berkeley: Function ensure_schema_collections_and_alldocs causing test failures (mismatch between DB and schema)~~ berkeley: Function ensure_schema_collections_and_alldocs is incompatible with Berkeley schema (causing test failures) Aug 30, 2024

ssarrafan moved this to 📝 Todo in Berkeley-Schema Refactor Roll Out Aug 30, 2024

ssarrafan moved this to Todo in 2024 - Sprint 44 - August 26 - September 6, 2024 Aug 30, 2024

eecavanna linked a pull request Aug 31, 2024 that will close this issue

Remove test code's dependence on a static Mongo dump #664

Merged

eecavanna moved this from Todo to In Review in 2024 - Sprint 44 - August 26 - September 6, 2024 Aug 31, 2024

eecavanna moved this from 📝 Todo to 👀 In Review in Berkeley-Schema Refactor Roll Out Aug 31, 2024

dwinston added this to Polyneme mixset Sep 3, 2024

dwinston moved this to At bat in Polyneme mixset Sep 3, 2024

dwinston closed this as completed in #664 Sep 3, 2024

github-project-automation bot moved this from At bat to Scored in Polyneme mixset Sep 3, 2024

github-project-automation bot moved this from 👀 In Review to ✅ Done in Berkeley-Schema Refactor Roll Out Sep 3, 2024

github-project-automation bot moved this from In Review to Done in 2024 - Sprint 44 - August 26 - September 6, 2024 Sep 3, 2024

eecavanna removed this from the Berkeley schema compatibility 🤝 milestone Sep 18, 2024

eecavanna mentioned this issue Sep 18, 2024

Update the Runtime to work with the Berkeley schema (meta issue) 📚 #577

Closed

38 tasks

eecavanna changed the title ~~berkeley: Function ensure_schema_collections_and_alldocs is incompatible with Berkeley schema (causing test failures)~~ berkeley: Function ensure_schema_collections_and_alldocs is incompatible with Berkeley schema Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`berkeley`: Function `ensure_schema_collections_and_alldocs` is incompatible with Berkeley schema #658

`berkeley`: Function `ensure_schema_collections_and_alldocs` is incompatible with Berkeley schema #658

aclum commented Aug 29, 2024 •

edited

Loading

aclum commented Aug 29, 2024

eecavanna commented Aug 29, 2024 •

edited

Loading

eecavanna commented Aug 29, 2024 •

edited

Loading

eecavanna commented Aug 29, 2024 •

edited

Loading

eecavanna commented Aug 29, 2024 •

edited

Loading

eecavanna commented Aug 29, 2024 •

edited

Loading

eecavanna commented Aug 29, 2024

eecavanna commented Aug 29, 2024

eecavanna commented Aug 31, 2024

berkeley: Function ensure_schema_collections_and_alldocs is incompatible with Berkeley schema #658

berkeley: Function ensure_schema_collections_and_alldocs is incompatible with Berkeley schema #658

Comments

aclum commented Aug 29, 2024 • edited Loading

aclum commented Aug 29, 2024

eecavanna commented Aug 29, 2024 • edited Loading

eecavanna commented Aug 29, 2024 • edited Loading

eecavanna commented Aug 29, 2024 • edited Loading

eecavanna commented Aug 29, 2024 • edited Loading

eecavanna commented Aug 29, 2024 • edited Loading

eecavanna commented Aug 29, 2024

eecavanna commented Aug 29, 2024

eecavanna commented Aug 31, 2024

`berkeley`: Function `ensure_schema_collections_and_alldocs` is incompatible with Berkeley schema #658

`berkeley`: Function `ensure_schema_collections_and_alldocs` is incompatible with Berkeley schema #658

aclum commented Aug 29, 2024 •

edited

Loading

eecavanna commented Aug 29, 2024 •

edited

Loading

eecavanna commented Aug 29, 2024 •

edited

Loading

eecavanna commented Aug 29, 2024 •

edited

Loading

eecavanna commented Aug 29, 2024 •

edited

Loading

eecavanna commented Aug 29, 2024 •

edited

Loading