`berkeley-schema-fy24`: id slot_usage resulting in badly minted IDs #2122

kheal · 2024-07-16T00:34:02Z

In berkeley-schema-fy24, some of the syntax declarations on ids of classes result in extra parentheses during minting on the berkeley-schema nmdc runtime. This results in badly minted ids for instances of new classes.

TLDR.
Actions to take to fix this bug:

Shorten id typecodes to adhere to nmdc standards and update CONTRIBUTING guide accordingly (in berkeley schema repo)
Remove extraneous parentheses for ids on definition of MassSpectrometryConfiguration and ChromatographyConfiguration (in berkeley schema repo)
berkeley: Update minter to use first typecode when pattern contains multiple typecodes nmdc-runtime#592

For example:

  ChromatographyConfiguration:
    is_a: Configuration
    class_uri: nmdc:ChromatographyConfiguration
    slot_usage:
      id:
        structured_pattern:
          syntax: "{id_nmdc_prefix}:(chrocon)-{id_shoulder}-{id_blade}$"
          interpolated: true

Example minted ID of ChromatographyConfiguration = nmdc:(chrocon)-14-bwmbj819.
That can be fixed easily with a schema change by removing the () around chrocon, as classes without this result in well minted IDs (example Instrument class mints to nmdc:inst-14-3p938v35 for example)

However instances of DataGeneration:NucleotideSequencing are less easily fixed

  NucleotideSequencing:
    class_uri: nmdc:NucleotideSequencing
    is_a: DataGeneration
    slot_usage:
      id:
        structured_pattern:
          syntax: "{id_nmdc_prefix}:(dgns|omprc)-{id_shoulder}-{id_blade}$"
          interpolated: true

Example minted ID of NucleotideSequencing = nmdc:(dgms|omprc)-14-2204gm61. 🙀
My guess is that the minter doesn't know which to choose of those options (dgms|omprc).
My understanding is that we will need to keep the flexibility on the id prefixes for referential integrity, so I do not have a solution forward to fixing this example.

I am not sure where to fix this - on the schema or in the minter or some combination of both?

@brynnz22, @eecavanna, @aclum, @dwinston, @turbomam - I'm actually not sure who all needs to be in on this conversation since it's a schema:minter intersection issue.

The text was updated successfully, but these errors were encountered:

eecavanna · 2024-07-16T00:50:40Z

Thanks for reporting this, @kheal, prefixing the issue name with berkeley-schema-fy24:, and including the (dgns|omprc) example.

You have already CC'ed @dwinston. I'll also CC @PeopleMakeCulture so she is aware of the issue.

I don't know how the minter gets that value to include in the ID it mints. I'm going to add this to "Update the Runtime to work with the Berkeley schema" meta issue, although—like you—I don't know where this will be fixed (in terms of the schema versus the runtime, or both).

turbomam · 2024-07-16T16:39:03Z

I think parentheses wrapped, pipe-separated list of typecodes are pretty common in the schema. I'm surprised we haven't seen problems earlier.

In any case, I agree that (chrocon) isn't a good style. That was probably my fault., But it is a reasonable regular expression and I would expect the minter to parse it (and the NucleotideSequencing pattern) correctly.

turbomam · 2024-07-16T16:42:34Z

In nmdc-schema

grep -c -R 'syntax: "{id_nmdc_prefix}:' src/schema/*

src/schema/annotation.yaml:1
src/schema/basic_slots.yaml:0
src/schema/core.yaml:4
src/schema/external_identifiers.yaml:0
src/schema/mixs.yaml:0
src/schema/nmdc.yaml:37
src/schema/portal/emsl.yaml:0
src/schema/portal/jgi_metagenomics.yaml:0
src/schema/portal/.gitkeep:0
src/schema/portal/mixs_inspired.yaml:0
src/schema/portal/jgi_metatranscriptomics.yaml:0
src/schema/portal/sample_id.yaml:0
src/schema/prov.yaml:3
src/schema/workflow_execution_activity.yaml:16

grep -c -R "syntax: '{id_nmdc_prefix}:" src/schema/*

src/schema/annotation.yaml:0
src/schema/basic_slots.yaml:0
src/schema/core.yaml:0
src/schema/external_identifiers.yaml:0
src/schema/mixs.yaml:0
src/schema/nmdc.yaml:1
src/schema/portal/emsl.yaml:0
src/schema/portal/jgi_metagenomics.yaml:0
src/schema/portal/.gitkeep:0
src/schema/portal/mixs_inspired.yaml:0
src/schema/portal/jgi_metatranscriptomics.yaml:0
src/schema/portal/sample_id.yaml:0
src/schema/prov.yaml:0
src/schema/workflow_execution_activity.yaml:0

turbomam · 2024-07-16T16:43:55Z

similar pattern in berkeley-schema-fy24

kheal · 2024-07-16T16:46:12Z

The parentheses wrapped, pipe-separated list of typecodes are appropriate for the structured pattern checks that linkML does, but the minter seems to use the id slot to make ids. As written, how would the minter know which of the type codes to use in {id_nmdc_prefix}:(dgns|omprc)-{id_shoulder}-{id_blade}$ when generating the IDs. Therefore, I think the issue is only when the syntax {id_nmdc_prefix}:( is declared on the id.structured_pattern.syntax slot.

turbomam · 2024-07-16T16:47:17Z

@kheal can you please show some minting API calls that result in useful vs mangled id minting?

kheal · 2024-07-16T16:51:56Z

@brynnz22 was doing the actual minting while I was schema hunting to track down the source.

Brynn, could you add an example API call for

Instrument [useful]
MassSpectrometryConfiguration [mangled, but easily fixable with schema mods]
MassSpectrometry [mangled, and not easily fixable with schema mods alone]

turbomam · 2024-07-16T16:53:22Z

swagger minting reminder:

go to https://api.microbiomedata.org/docs, https://api-berkeley.microbiomedata.org/docs etc.
click on the authorize button towards the upper right
use the second form (OAuth2PasswordOrClientCredentialsBearer (OAuth2, clientCredentials)) to enter site, not user, credentials
close the authorization dialog box without logging out
go to the /pids/mint section on the swagger page
expand the POST /pids/mint box
click the try it out button on the right
change the schema_class.id value if desired
click the Execute button

aside: do we have any better specifications on how many characters can be in a typecode?

6 characters appears as the max in a few suggestions. These are meant to be very short mnemonics. For example, storpro could follow the example of poolp, ie pp not pro. chrocon could be chrcon

nmdc-schema typecodes

Typecode Name	Length
bsmprc	6
clsite	6
filtpr	6
frsite	6
libprp	6
procsm	6
wfmgan	6
wfmgas	6
wfmtan	6
wfmtas	6
extrp	5
filtpr	5
omprc	5
poolp	5
wfmag	5
wfmsa	5
wfnom	5
wfrbt	5
wfrqc	5
ansm	4
dobj	4
wfmb	4
wfmp	4
wfmt	4
bsm	3
sty	3

berkeley-schema-fy24 typecodes

Typecode Name	Length
(chrocon)	7
storpro	7
clsite	6
frsite	6
libprp	6
mixpro	6
procsm	6
subspr	6
wfmgan	6
wfmgas	6
wfmtan	6
wfmtas	6
calib	5
chcpr	5
cspro	5
(dgms\|omprc)	5
(dgns\|omprc)	5
dispro	5
extrp	5
filtpr	5
(mscon)	5
poolp	5
wfmag	5
wfmsa	5
wfnom	5
wfrbt	5
wfrqc	5
dobj	4
inst	4
wfmb	4
wfmp	4
wfmt	4
bsm	3
pex	3
sty	3

aside: I would like us to pick a single minting authority code for all of our example data, like 00 or 99.

turbomam · 2024-07-16T17:05:18Z

Well I'm certainly in favor of removing the un-necessary parentheses wrappers from both schemas right away if you want.

This works in PyCharm's find in files:

id\:\s+structured_pattern\:\s+syntax: "\{id_nmdc_prefix\}\:\(

in berkeley-schema-fy24

NucleotideSequencing "{id_nmdc_prefix}:(dgns|omprc)-{id_shoulder}-{id_blade}$"
MassSpectrometry "{id_nmdc_prefix}:(dgms|omprc)-{id_shoulder}-{id_blade}$"
MassSpectrometryConfiguration "{id_nmdc_prefix}:(mscon)-{id_shoulder}-{id_blade}$"
ChromatographyConfiguration "{id_nmdc_prefix}:(chrocon)-{id_shoulder}-{id_blade}$"

that pattern doesn't appear in nmdc-schema... I guess that's why you just discovered it now @kheal

turbomam · 2024-07-16T17:19:16Z

@aclum do you think NucleotideSequencing and MassSpectrometry currently need (or will always need) those dual-typecode patterns?

kheal · 2024-07-16T17:48:55Z

aside: do we have any better specifications on how many characters can be in a typecode?

We do not currently have any criteria for good typecodes, other than they are ideally between 3 and 6 characters in length.

^(?nmdc):(?[a-z]{1,6})-(?[0-9][a-z]{0,6}[0-9])-(?[A-Za-z0-9]+)(?(.[A-Za-z0-9]+)*)(?[A-Za-z0-9.-]+)?$

6 characters appears as the max in a few suggestions. These are meant to be very short mnemonics. For example, storpro could follow the example of poolp, ie pp not pro. chrocon could be chrcon

nmdc-schema typecodes
berkeley-schema-fy24 typecodes
aside: I would like us to pick a single minting authority code for all of our example data, like 00 or 99.

I have no problem shortening the typecodes for chrocon and storpro if 6 characters is preferred, but we need to do it ASAP because we are in the process of generating metadata for at least one of those classes. If we do decide to go that route I would also ask that standard to be added to the CONTRIBUTING.md under "Modeling Best Practices" to document our decision and help guide reviewers/contributors in the future.

aclum · 2024-07-16T18:45:31Z

This is from
https://microbiomedata.github.io/nmdc-schema/identifiers/
: An alphabetical code with a 1:1 correspondence to a class from the NMDC schema. Answers the question "of what class is the data record that bears this id"? Must consist of 1 to 6 lower case letters, although a minimum of 3 letters is suggested. The type code portion of an NMDC id must match the regular expression [a-z]{1,6}.

The original motivation, decided at the berkeley in person meeting, behind supporting both (dgms|omprc) and (dgns|omprc) was to support the legacy typecodes so we didn't have to change ids when typecode changed. Doing otherwise sets a bad precedent in which our identifiers are potentially not persistent. I still that decision which means we'll need to support the legacy typecodes going forward.

kheal · 2024-07-16T19:00:17Z

Thanks @aclum. I'll summarize what I think of are actionable steps going forward to close this issue.

Shorten id typecodes to adhere to nmdc standards and update CONTRIBUTING guide accordingly (in berkeley schema repo)
Remove extraneous parentheses for ids on definition of MassSpectrometryConfiguration and ChromatographyConfiguration (in berkeley schema repo)
berkeley: Update minter to use first typecode when pattern contains multiple typecodes nmdc-runtime#592

I will fix the first two so we can review and merge during the metadata meeting tomorrow to unblock refactoring work in progress. I'll convert the third into a separate issue in the nmdc runtime repo and let @eecavanna or @dwinston address that issue.

brynnz22 · 2024-07-16T22:57:36Z

@brynnz22 was doing the actual minting while I was schema hunting to track down the source.

Brynn, could you add an example API call for

Instrument [useful]

MassSpectrometryConfiguration [mangled, but easily fixable with schema mods]

MassSpectrometry [mangled, and not easily fixable with schema mods alone]

@turbomam if you want to test and get the API calls you can go the Berkeley runtime and mint these ids to see if you haven't already done so: https://api-berkeley.microbiomedata.org/docs#/minter/mint_ids_pids_mint_post

eecavanna · 2024-07-17T01:30:44Z

FYI: I created a PR in nmdc-runtime with what I think is a fix for the ~~third item~~ minter piece. PR: microbiomedata/nmdc-runtime#597. I've requested reviews (there) from the Polyneme team members.

turbomam · 2024-07-17T13:45:24Z

@turbomam if you want to test and get the API calls you can go the Berkeley runtime and mint these ids to see if you haven't already done so: https://api-berkeley.microbiomedata.org/docs#/minter/mint_ids_pids_mint_post

Thanks @brynnz22 !

see also

berkeley-schema-fy24: id slot_usage resulting in badly minted IDs #2122 (comment)

eecavanna · 2024-07-18T01:22:40Z

As mentioned here (verbatim): The Minter in the Berkeley environment has been updated to (a) tolerate one layer of extraneous parentheses and (b) to use the first typecode — foo — in a (foo|bar|baz) list.

dwinston · 2024-07-22T14:24:15Z

The original motivation, decided at the berkeley in person meeting, behind supporting both (dgms|omprc) and (dgns|omprc) was to support the legacy typecodes so we didn't have to change ids when typecode changed. Doing otherwise sets a bad precedent in which our identifiers are potentially not persistent. I still that decision which means we'll need to support the legacy typecodes going forward.

We can keep a "legacy" identifier persistent if (a) we keep it bound to its entity, e.g. move it from id to nmdc:alternative_identifiers; and (b) we ensure that part of NMDC's service architecture understands what the current preferred unique identifier is for a thing, given a non-unique but unambiguous identifier in a service request. This would apply to "external" PIDs as well, e.g. IGSNs for biosamples as applicable. In either case, the response would always yield (as id) the identifier that -- at present -- we want to be used when the client user cites the entity.

The consequence of the above would be that the schema is (or rather, returns to being) unambiguous about ID format for minting and validation, and the case of legacy IDs becomes a subset of the case of alternative IDs.

aclum · 2024-07-22T16:29:27Z

I'd strongly discourage changing the primary ID on a regular basis, the IDs get used in file names and file headers and propagate to downstream systems like IMG and we don't have logic with systems like the data portal yet to search alternative identifiers.

kheal added the bug Something isn't working label Jul 16, 2024

kheal self-assigned this Jul 16, 2024

kheal changed the title ~~berkeley-schema-fy24: id slots have extra parentheses~~ berkeley-schema-fy24: id slot_usage resulting in badly minted IDs Jul 16, 2024

kheal removed their assignment Jul 16, 2024

kheal added this to Berkeley-Schema Refactor Roll Out Jul 16, 2024

eecavanna mentioned this issue Jul 16, 2024

Update the Runtime to work with the Berkeley schema (meta issue) 📚 microbiomedata/nmdc-runtime#577

Closed

38 tasks

kheal self-assigned this Jul 16, 2024

kheal mentioned this issue Jul 16, 2024

Typecode fixes on IDs for ChromatographyConfiguration, StorageProcess, and MassSpectrometryConfiguration class ids microbiomedata/berkeley-schema-fy24#225

Merged

6 tasks

eecavanna mentioned this issue Jul 16, 2024

berkeley: Update minter to use first typecode when pattern contains multiple typecodes microbiomedata/nmdc-runtime#592

Closed

kheal moved this to 🏗 In Progress in Berkeley-Schema Refactor Roll Out Jul 22, 2024

eecavanna mentioned this issue Jul 24, 2024

Sample grouping class microbiomedata/berkeley-schema-fy24#233

Draft

17 tasks

kheal closed this as completed Jul 24, 2024

github-project-automation bot moved this from 🏗 In Progress to ✅ Done in Berkeley-Schema Refactor Roll Out Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`berkeley-schema-fy24`: id slot_usage resulting in badly minted IDs #2122

`berkeley-schema-fy24`: id slot_usage resulting in badly minted IDs #2122

kheal commented Jul 16, 2024 •

edited

Loading

eecavanna commented Jul 16, 2024 •

edited

Loading

turbomam commented Jul 16, 2024

turbomam commented Jul 16, 2024 •

edited

Loading

turbomam commented Jul 16, 2024

kheal commented Jul 16, 2024

turbomam commented Jul 16, 2024

kheal commented Jul 16, 2024

turbomam commented Jul 16, 2024 •

edited

Loading

turbomam commented Jul 16, 2024 •

edited

Loading

turbomam commented Jul 16, 2024

kheal commented Jul 16, 2024 •

edited

Loading

aclum commented Jul 16, 2024

kheal commented Jul 16, 2024 •

edited

Loading

brynnz22 commented Jul 16, 2024

eecavanna commented Jul 17, 2024 •

edited

Loading

turbomam commented Jul 17, 2024

eecavanna commented Jul 18, 2024

dwinston commented Jul 22, 2024

aclum commented Jul 22, 2024

berkeley-schema-fy24: id slot_usage resulting in badly minted IDs #2122

berkeley-schema-fy24: id slot_usage resulting in badly minted IDs #2122

Comments

kheal commented Jul 16, 2024 • edited Loading

eecavanna commented Jul 16, 2024 • edited Loading

turbomam commented Jul 16, 2024

turbomam commented Jul 16, 2024 • edited Loading

turbomam commented Jul 16, 2024

kheal commented Jul 16, 2024

turbomam commented Jul 16, 2024

kheal commented Jul 16, 2024

turbomam commented Jul 16, 2024 • edited Loading

turbomam commented Jul 16, 2024 • edited Loading

turbomam commented Jul 16, 2024

kheal commented Jul 16, 2024 • edited Loading

aclum commented Jul 16, 2024

kheal commented Jul 16, 2024 • edited Loading

brynnz22 commented Jul 16, 2024

eecavanna commented Jul 17, 2024 • edited Loading

turbomam commented Jul 17, 2024

eecavanna commented Jul 18, 2024

dwinston commented Jul 22, 2024

aclum commented Jul 22, 2024

`berkeley-schema-fy24`: id slot_usage resulting in badly minted IDs #2122

`berkeley-schema-fy24`: id slot_usage resulting in badly minted IDs #2122

kheal commented Jul 16, 2024 •

edited

Loading

eecavanna commented Jul 16, 2024 •

edited

Loading

turbomam commented Jul 16, 2024 •

edited

Loading

turbomam commented Jul 16, 2024 •

edited

Loading

turbomam commented Jul 16, 2024 •

edited

Loading

kheal commented Jul 16, 2024 •

edited

Loading

kheal commented Jul 16, 2024 •

edited

Loading

eecavanna commented Jul 17, 2024 •

edited

Loading