Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

berkeley: Function ensure_schema_collections_and_alldocs is incompatible with Berkeley schema #658

Closed
aclum opened this issue Aug 29, 2024 · 9 comments · Fixed by #664
Closed
Assignees

Comments

@aclum
Copy link
Contributor

aclum commented Aug 29, 2024

This came up in a PR where Patrick was working on updating ETL code.
there are errors about associated_studies being missing from biosample_set, upon further inspection it is trying to compare prod mongo, I inferred this based on it pulling in collection names which don't exist in berkeley like omics_processing_set, with the berkeley nmdc-schema release candidate release

We need an alldocs in mongo berkeley and for this test to point to that.

cc @dwinston @eecavanna @pkalita-lbl

@aclum
Copy link
Contributor Author

aclum commented Aug 29, 2024

hoping Donny or Eric can look at this in @PeopleMakeCulture absence. cc @shreddd

@eecavanna
Copy link
Collaborator

eecavanna commented Aug 29, 2024

Here's an excerpt from the logs of the GitHub Actions workflow run that @aclum linked to above (thanks!):

Show/hide excerpt
wait-for-it.sh: fastapi:8000 is available after 0 seconds
/usr/local/lib/python3.10/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.3.2, pluggy-1.5.0
rootdir: /code
plugins: anyio-4.4.0, asyncio-0.24.0, requests-mock-1.12.1, cov-5.0.0, mock-3.14.0
asyncio: mode=strict, default_loop_scope=None
collected 109 items

demo/metadata_migration/notebooks/test_helpers.py .                      [  0%]
tests/e2e/test_minter_api.py xxxx                                        [  4%]
tests/integration/test_minter_repository.py ....X...                     [ 11%]
tests/test_api/test_db_mongo.py .....                                    [ 16%]
tests/test_api/test_endpoints.py ssE

==================================== ERRORS ====================================
_______________ ERROR at setup of test_metadata_validate_json_0 ________________

    @pytest.fixture
    def api_site_client():
        mdb = get_mongo_db()
>       rs = ensure_test_resources(mdb)

tests/test_api/test_endpoints.py:204: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/test_api/test_endpoints.py:121: in ensure_test_resources
    ensure_schema_collections_and_alldocs()
tests/test_api/test_endpoints.py:67: in ensure_schema_collections_and_alldocs
    materialize_alldocs(
/usr/local/lib/python3.10/site-packages/dagster/_core/definitions/op_definition.py:457: in __call__
    return direct_invocation_result(self, *args, **kwargs)
/usr/local/lib/python3.10/site-packages/dagster/_core/definitions/op_invocation.py:217: in direct_invocation_result
    result = invoke_compute_fn(
/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute_generator.py:129: in invoke_compute_fn
    return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass)
nmdc_runtime/site/ops.py:1025: in materialize_alldocs
    raise e
nmdc_runtime/site/ops.py:1018: in materialize_alldocs
    nmdcdb = NMDCDatabase(
<string>:21: in __init__
    ???
/usr/local/lib/python3.10/site-packages/nmdc_schema/nmdc.py:440: in __post_init__
    self._normalize_inlined_as_list(slot_name="biosample_set", slot_type=Biosample, key_name="id", keyed=True)
/usr/local/lib/python3.10/site-packages/linkml_runtime/utils/yamlutils.py:105: in _normalize_inlined_as_list
    self._normalize_inlined(slot_name, slot_type, key_name, keyed, True)
/usr/local/lib/python3.10/site-packages/linkml_runtime/utils/yamlutils.py:190: in _normalize_inlined
    cooked_obj = slot_type(**as_dict(list_entry))
<string>:589: in __init__
    ???
/usr/local/lib/python3.10/site-packages/nmdc_schema/nmdc.py:2120: in __post_init__
    self.MissingRequiredField("associated_studies")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = Biosample(id='nmdc:bsm-13-amrnys72', type='nmdc:Biosample', name='Sand microcosm microbial communities from a hyporhei...ample_type=None, technical_reps=None, analysis_type=[], sample_link=[], bulk_elect_conductivity=None, infiltrations=[])
field_name = 'associated_studies'

    def MissingRequiredField(self, field_name: str) -> None:
        """ Generic loader error handler """
>       raise ValueError(f"{field_name} must be supplied")
E       ValueError: associated_studies must be supplied

/usr/local/lib/python3.10/site-packages/linkml_runtime/utils/yamlutils.py:281: ValueError
---------------------------- Captured stdout setup -----------------------------
local mongodump already exists at /code/tests/nmdcdb

For reference, that PR's base branch is the berkeley branch.

@eecavanna
Copy link
Collaborator

eecavanna commented Aug 29, 2024

The error occurred during the "setup" phase of the following test:

test_metadata_validate_json_0

def test_metadata_validate_json_0(api_site_client):
rv = api_site_client.request(
"POST",
"/metadata/json:validate",
{
"field_research_site_set": [
{"id": "nmdc:frsite-11-s2dqk408", "name": "BESC-470-CL2_38_23"},
{"id": "nmdc:frsite-11-s2dqk408", "name": "BESC-470-CL2_38_23"},
{"id": "nmdc:frsite-11-s2dqk408", "name": "BESC-470-CL2_38_23"},
]
},
)
assert rv.json()["result"] == "errors"

Specifically, it occurred while the following fixture was being built:

@pytest.fixture
def api_site_client():
    mdb = get_mongo_db()
    rs = ensure_test_resources(mdb)  # <-- the error occurred here
    return RuntimeApiSiteClient(base_url=os.getenv("API_HOST"), **rs["site_client"])

ensure_test_resources is defined here. Among other things, it calls a function named ensure_schema_collections_and_alldocs, which is defined here.

@eecavanna
Copy link
Collaborator

eecavanna commented Aug 29, 2024

Here's an excerpt from the ensure_schema_collections_and_alldocs function:

    dump_dir = ensure_local_mongodump_exists()
    mongorestore_from_dir(mdb, dump_dir, skip_collections=["functional_annotation_agg"])

It calls a function named ensure_local_mongodump_exists (side note: I prefer the inclusion of this _exists suffix, which says what about it someone is ensuring).

That function (and the constant it accesses) is defined here:

TEST_MONGODUMPS_DIR = REPO_ROOT_DIR.joinpath("tests", "nmdcdb")
SCHEMA_COLLECTIONS_MONGODUMP_ARCHIVE_BASENAME = (
"nmdc-prod-schema-collections__2024-07-29_20-12-07"
)
SCHEMA_COLLECTIONS_MONGODUMP_ARCHIVE_URL = (
"https://portal.nersc.gov/cfs/m3408/meta/mongodumps/"
f"{SCHEMA_COLLECTIONS_MONGODUMP_ARCHIVE_BASENAME}.tar"
) # 84MB. Should be < 100MB.
def ensure_local_mongodump_exists():
dump_dir = TEST_MONGODUMPS_DIR.joinpath(
SCHEMA_COLLECTIONS_MONGODUMP_ARCHIVE_BASENAME
)
if not os.path.exists(dump_dir):
download_and_extract_tar(
url=SCHEMA_COLLECTIONS_MONGODUMP_ARCHIVE_URL, extract_to=TEST_MONGODUMPS_DIR
)
else:
print(f"local mongodump already exists at {TEST_MONGODUMPS_DIR}")
return dump_dir

Looks like that function downloads a pre-made database dump from the NERSC filesystem.

@eecavanna
Copy link
Collaborator

eecavanna commented Aug 29, 2024

The value of SCHEMA_COLLECTIONS_MONGODUMP_ARCHIVE_BASENAME is the filename of a dump of the production database, specifically. I think this is consistent with what @aclum reported in the issue description.

@eecavanna
Copy link
Collaborator

eecavanna commented Aug 29, 2024

In other words, I think this couples the test suite to the specific schema that happened to be in effect when that specific dump was generated.

I think this will be insufficient both for (a) the legacy-to-Berkeley roll out we are working on now; and (b) changes to certain aspects of the schema in the future.

@eecavanna
Copy link
Collaborator

The "coupling" (as I referred to it in the previous message) was introduced in this commit. I will assign to @dwinston to take a pass at resolving it.

@eecavanna eecavanna changed the title berkeley ensure_schema_collections_and_alldocs test failing berkeley: Test ensure_schema_collections_and_alldocs is failing (mismatch between DB dump and schema) Aug 29, 2024
@eecavanna eecavanna changed the title berkeley: Test ensure_schema_collections_and_alldocs is failing (mismatch between DB dump and schema) berkeley: Function ensure_schema_collections_and_alldocs causing test failures (mismatch between DB and schema) Aug 29, 2024
@eecavanna eecavanna added this to the Berkeley schema compatibility 🤝 milestone Aug 29, 2024
@eecavanna
Copy link
Collaborator

During today's infrastructure meeting, we discussed the possibility of disabling this problematic code (which may involve disabling some tests) in order to unblock some current work; and developers can work on implementing a long term fix in parallel. I'll look into this with that path forward in mind today.

@eecavanna eecavanna changed the title berkeley: Function ensure_schema_collections_and_alldocs causing test failures (mismatch between DB and schema) berkeley: Function ensure_schema_collections_and_alldocs is incompatible with Berkeley schema (causing test failures) Aug 30, 2024
@eecavanna eecavanna linked a pull request Aug 31, 2024 that will close this issue
@eecavanna eecavanna moved this from 📝 Todo to 👀 In Review in Berkeley-Schema Refactor Roll Out Aug 31, 2024
@eecavanna
Copy link
Collaborator

I created a PR in which I disabled the problematic code. The PR is here: #664

@dwinston dwinston moved this to At bat in Polyneme mixset Sep 3, 2024
@github-project-automation github-project-automation bot moved this from At bat to Scored in Polyneme mixset Sep 3, 2024
@github-project-automation github-project-automation bot moved this from 👀 In Review to ✅ Done in Berkeley-Schema Refactor Roll Out Sep 3, 2024
@eecavanna eecavanna removed this from the Berkeley schema compatibility 🤝 milestone Sep 18, 2024
@eecavanna eecavanna changed the title berkeley: Function ensure_schema_collections_and_alldocs is incompatible with Berkeley schema (causing test failures) berkeley: Function ensure_schema_collections_and_alldocs is incompatible with Berkeley schema Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants