-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
berkeley-schema-fy24
: id slot_usage resulting in badly minted IDs
#2122
Comments
berkeley-schema-fy24
: id slots have extra parenthesesberkeley-schema-fy24
: id slot_usage resulting in badly minted IDs
Thanks for reporting this, @kheal, prefixing the issue name with You have already CC'ed @dwinston. I'll also CC @PeopleMakeCulture so she is aware of the issue. I don't know how the minter gets that value to include in the ID it mints. I'm going to add this to "Update the Runtime to work with the Berkeley schema" meta issue, although—like you—I don't know where this will be fixed (in terms of the schema versus the runtime, or both). |
I think parentheses wrapped, pipe-separated list of typecodes are pretty common in the schema. I'm surprised we haven't seen problems earlier. In any case, I agree that |
In grep -c -R 'syntax: "{id_nmdc_prefix}:' src/schema/*
grep -c -R "syntax: '{id_nmdc_prefix}:" src/schema/*
|
similar pattern in |
The parentheses wrapped, pipe-separated list of typecodes are appropriate for the structured pattern checks that linkML does, but the minter seems to use the id slot to make ids. As written, how would the minter know which of the type codes to use in |
@kheal can you please show some minting API calls that result in useful vs mangled |
@brynnz22 was doing the actual minting while I was schema hunting to track down the source. Brynn, could you add an example API call for
|
swagger minting reminder:
aside: do we have any better specifications on how many characters can be in a typecode?
6 characters appears as the max in a few suggestions. These are meant to be very short mnemonics. For example, nmdc-schema typecodes
berkeley-schema-fy24 typecodes
aside: I would like us to pick a single minting authority code for all of our example data, like 00 or 99. |
Well I'm certainly in favor of removing the un-necessary parentheses wrappers from both schemas right away if you want. This works in PyCharm's find in files:
in
that pattern doesn't appear in |
@aclum do you think |
I have no problem shortening the typecodes for |
This is from The original motivation, decided at the berkeley in person meeting, behind supporting both (dgms|omprc) and (dgns|omprc) was to support the legacy typecodes so we didn't have to change ids when typecode changed. Doing otherwise sets a bad precedent in which our identifiers are potentially not persistent. I still that decision which means we'll need to support the legacy typecodes going forward. |
Thanks @aclum. I'll summarize what I think of are actionable steps going forward to close this issue.
I will fix the first two so we can review and merge during the metadata meeting tomorrow to unblock refactoring work in progress. I'll convert the third into a separate issue in the nmdc runtime repo and let @eecavanna or @dwinston address that issue. |
@turbomam if you want to test and get the API calls you can go the Berkeley runtime and mint these ids to see if you haven't already done so: https://api-berkeley.microbiomedata.org/docs#/minter/mint_ids_pids_mint_post |
FYI: I created a PR in |
Thanks @brynnz22 ! see also |
As mentioned here (verbatim): The Minter in the Berkeley environment has been updated to (a) tolerate one layer of extraneous parentheses and (b) to use the first typecode — |
We can keep a "legacy" identifier persistent if (a) we keep it bound to its entity, e.g. move it from The consequence of the above would be that the schema is (or rather, returns to being) unambiguous about ID format for minting and validation, and the case of legacy IDs becomes a subset of the case of alternative IDs. |
I'd strongly discourage changing the primary ID on a regular basis, the IDs get used in file names and file headers and propagate to downstream systems like IMG and we don't have logic with systems like the data portal yet to search alternative identifiers. |
In
berkeley-schema-fy24
, some of the syntax declarations on ids of classes result in extra parentheses during minting on the berkeley-schema nmdc runtime. This results in badly minted ids for instances of new classes.TLDR.
Actions to take to fix this bug:
MassSpectrometryConfiguration
andChromatographyConfiguration
(in berkeley schema repo)berkeley
: Update minter to use first typecode when pattern contains multiple typecodes nmdc-runtime#592For example:
Example minted ID of
ChromatographyConfiguration
= nmdc:(chrocon)-14-bwmbj819.That can be fixed easily with a schema change by removing the () around chrocon, as classes without this result in well minted IDs (example
Instrument
class mints to nmdc:inst-14-3p938v35 for example)However instances of
DataGeneration
:NucleotideSequencing
are less easily fixedExample minted ID of
NucleotideSequencing
= nmdc:(dgms|omprc)-14-2204gm61. 🙀My guess is that the minter doesn't know which to choose of those options (dgms|omprc).
My understanding is that we will need to keep the flexibility on the id prefixes for referential integrity, so I do not have a solution forward to fixing this example.
I am not sure where to fix this - on the schema or in the minter or some combination of both?
@brynnz22, @eecavanna, @aclum, @dwinston, @turbomam - I'm actually not sure who all needs to be in on this conversation since it's a schema:minter intersection issue.
The text was updated successfully, but these errors were encountered: