generated from linkml/linkml-template
-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathidentifiers.md
276 lines (190 loc) · 11.8 KB
/
identifiers.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
# Identifiers in NMDC
Identifiers are crucial for the NMDC, both for any data objects *created* (aka minted) and for any external objects
*referenced*
Examples of entities that require identifiers:
* Samples
* Data objects (e.g. sequence files)
* Taxa (e.g. NCBITaxon or GTDB)
* Genes, Proteins
* Sequences (e.g. genome/transcriptome)
* Ontology terms and other descriptors
* functional orthologs, e.g. KEGG.orthology (KO) terms
* pathways, e.g. KEGG.pathway, MetaCyc, GO
* reactions/activities: KEGG, MetaCyc
* chemical entities: CHEBI, CHEMBL, INCHI, ...
* sequence feature types: SO, Rfam
Identifiers should be:
* Permanent
* Unique
* Resolvable
* Opaque
See [McMurry et al, PMID:28662064](https://www.ncbi.nlm.nih.gov/pubmed/28662064) for more desiderata.
## CURIEs - prefixed IDs
Following McMurry et al we adopt the use of *prefixed identifiers*
The syntax is:
Prefix:LocalId
Examples:
- GO:0008152
- BIOSAMPLE:SAMEA2397676
- DOI:10.1038/nbt1156
These prefixed identifiers are also known as CURIEs (Compact URIs). There is
a [W3C specification](https://www.w3.org/TR/curie) for these
All prefixes should be registered with a standard identifier prefix system. These include:
* http://n2t.net
* http://identifiers.org
* http://obofoundry.org
## Examples
### INSDC BioSamples
Registry entry: https://registry.identifiers.org/registry/biosample
Example ID/CURIE: BIOSAMPLE:SAMEA2397676
Resolving via
identifiers.org: [https://identifiers.org/BIOSAMPLE:SAMEA2397676](https://identifiers.org/BIOSAMPLE:SAMEA2397676)
Resolving via nt2.net: [http://n2t.net/BIOSAMPLE:SAMEA2397676](http://n2t.net/BIOSAMPLE:SAMEA2397676)
### GOLD identifiers
https://registry.identifiers.org/registry/gold
Example ID: `GOLD:Gp0119849`
Resolving via identifiers.org: https://identifiers.org/GOLD:Gp0119849
### identifiers for ontology terms and function descriptors
Most of the ontologies we use are in OBO. All OBO IDs are prefixed
using the ontology ID space. The list of ID spaces can be found on
http://obofoundry.org
For example the ID/CURIE `ENVO:00002007` represents the class `sediment` and is expanded to a URI
of http://purl.obolibrary.org/obo/ENVO_00002007
#### KEGG
KEGG is actually a set of databases, each with its own prefix, usually of form `KEGG.$database`, e.g.
* [KEGG.ORTHOLOGY](https://registry.identifiers.org/registry/kegg.orthology) (aka KO), e.g. KEGG.ORTHOLOGY:K00001
* [KEGG.COMPOUND](https://registry.identifiers.org/registry/kegg.compound), e.g. KEGG.COMPOUND:C12345
## Recommended IDs for use within NMDC
The NMDC schema is annotated with the set of IDs that are allowed to act as primary keys for instances of each class.
For example the class [OrthologyGroup](https://microbiomedata.github.io/nmdc-metadata/docs/OrthologyGroup) has a
description of the IDs allowed on the class web page, the first listed
is [KEGG.ORTHOLOGY](https://registry.identifiers.org/registry/kegg.orthology)
The underlying yaml looks like this:
```
orthology group:
is_a: functional annotation term
description: >-
A set of genes or gene products in which all members are orthologous
id_prefixes:
- KEGG.ORTHOLOGY ## KO number
- EGGNOG
- PFAM
- TIGRFAM
- SUPFAM
- PANTHER.FAMILY
exact_mappings:
- biolink:GeneFamily
```
The full URLs for each is in the jsonld context file
## IDs minted for use within NMDC
The NMDC schema specifies legal identifiers for all of its classes. All data instances/records that are intended for upload into the NMDC metadata store must have an `id` field that follows this specification, which is discussed below.
NMDC offers a central identifier minting [endpoint](https://api.microbiomedata.org/docs#/minter/mint_ids_pids_mint_post) in order to save data contributors the trouble of hand-crafting `id`s.
The possibility of decentralized (or offline) minting of `id`s by trusted organizations has also been anticipated. `id` component 3 below (the shoulder) is used to indicate the organization that minted an `id`. LBL, which hosts the `id` minting endpoint will use one shoulder value. If another organization, like JGI or EMSL, needs to bulk-create `id`s outside of the central identifier minting endpoint, they would use different shoulders, to be determined by the NMDC schema and metadata team.
No matter where they are minted, all NMDC `id` values must match this abstract pattern:
```
nmdc:<type-code>-<shoulder>-<blade><.version><_locus>
```
The abstract pattern has six parts, delimited by hyphens (unless otherwise specified):
1. `nmdc`: All NMDC identifiers will begin with this static prefix.
2. `<typecode>`: An alphabetical code with a 1:1 correspondence to a class from the NMDC schema. Answers the question "of what class is the data record that bears this `id`"? Must consist of 1 to 6 lower case letters, although a minimum of 3 letters is suggested. The *type code* portion of an NMDC `id` must match the regular expression `[a-z]{1,6}`.
3. `<shoulder>`: A code that indicates what organization minted the identifier. Shoulder values must be zero to six lower case letters, flanked by one digit on either side. Answers the question "what organization minted this `id`"? The central identifier endpoint, hosted at LBL, uses the shoulder 00. Should organizations like JGI or EMSL need to mint identifiers in bulk, they would be assigned other shoulders, so that `id` values aren't reused. The *shoulder* portion of an NMDC `id` must match the regular expression `[0-9][a-z]{0,6}[0-9]`.
4. `<blade>`: The fully unique part of the identifier under a given type code and shoulder namespace. The _shoulder_ and _blade_ together make up the _key_ of the identifier. The blade is an alphanumeric string of open-ended length with at least one character, following the regular expression: `[A-Za-z0-9]+`.
5. `<.version>`: Differentiates multiple iterations of a workflow. The delimiter used to separate the *version* from the *blade* and everything before it is a dot (`.`). The *version* is a potentially repeating alphanumeric pattern with a minimum length of 1 character. The *version* portion of an NMDC `id` must match the regular expression `(\.[A-Za-z0-9]+)*`.
6. `<_locus>`: Indicates the contig on which a genomic feature is found, along with its start and end coordinates. Delimited from the rest of the `id` by an underscore (`_`). The *locus* part, if present, must have at least one character from the set off uppercase letters, lower case letters, digits, underscores (`_`), dots (`.`) and hyphens (`-`). The regular expression that the locus will follow is: `_[A-Za-z0-9_\.-]+`.
The per-part regular expression described above can be composed into one complete regular expression. Named capture groups have been used to tie in the part names.
```
^(?<prefix>nmdc):(?<typecode>[a-z]{1,6})-(?<shoulder>[0-9][a-z]{0,6}[0-9])-(?<blade>[A-Za-z0-9]+)(?<version>(\.[A-Za-z0-9]+)*)(?<locus>_[A-Za-z0-9_\.-]+)?$
```
## Annotation identifiers
Both metaG and metaT analyses produce GFF3 files.
See [issue 184](https://github.com/microbiomedata/nmdc-metadata/issues/184) for more on how the GFF is modeled.
The main entity we care about in these is
the [gene product] https://microbiomedata.github.io/nmdc-metadata/docs/GeneProduct) ID (usually a protein), this is what
functional annotation hangs off.
This is typically a protein encoded by a CDS, e.g.
```
Ga0185794_41 GeneMark.hmm-2 v1.05 CDS 48 1037 56.13 + 0 ID=Ga0185794_41_48_1037;translation_table=11;start_type=ATG;product=5-methylthioadenosine/S-adenosylhomocysteine deaminase;product_source=KO:K12960;cath_funfam=3.20.20.140;cog=COG0402;ko=KO:K12960;ec_number=EC:3.5.4.28,EC:3.5.4.31;pfam=PF01979;superfamily=51338,51556
```
When converting col9 we ensure that each ID is correctly prefixed. So for example, we use `KEGG.OTHOLOGY:K12960`
not `KO:K12960` as the former is the official prefix according to KEGG and identifiers.org
We will also later need a policy for IDs for the sequences in col1 (ie genome or transcript), please return later for
more details...
## Reuse vs minting new IDs
In 2023 NMDC transitioned from reusing identifiers from other organizations to using NMDC minted identifiers as the primary identifier. In April 2024 NMDC will update legacy records to use NMDC minted identifers as the primary identifier. The table below provides information is how legacy identifiers can be found in updated records.
## Identifier mapping
| Identifier | Example | NMDC Schema Class | NMDC Schema Slot |
| :-------- | :------- | :------- | :------- |
| gold:Gs* | gold:Gs0114675 | Study | gold_study_identifiers |
| gold:Gb* | gold:Gb0110739 | Biosample | gold_biosample_identifiers |
| emsl:* | emsl:63ca2f94-6647-11eb-ae93-0242ac130002| Biosample | emsl_biosample_identifiers|
| igsn:* | igsn:IEWFS001H | Biosample | igsn_biosample_identifiers |
| gold:Gp* | gold:Gp0452734 | OmicsProcessing | gold_sequencing_project_identifiers |
| emsl:* | emsl:598506 | OmicsProcessing | alternative_identifiers |
Some legacy data object identifiers were based on file md5sums, either with or without a prefix (nmdc, jgi, emsl). In some cases the legacy value can be found by removing the prefix and searching DataObject records on slot md5_checksum. If you are having trouble finding information based on legacy identifiers please contact [email protected].
## Additional details on legacy identifiers
Legacy metagenomics omics objects look like this:
```yaml
id: "gold:Gp0108335"
name: "Thawing permafrost microbial communities from the Arctic, studying carbon transformations - Permafrost 712P3D"
has_input:
- "gold:Gb0108335"
part_of:
- "gold:Gs0112340"
has_output:
- "jgi:551a20d30d878525404e90d5"
omics_type: Metagenome
type: "nmdc:OmicsProcessing"
add_date: "30-OCT-14 12.00.00.000000000 AM"
mod_date: "22-MAY-20 06.13.12.927000000 PM"
ncbi_project_name: "Thawing permafrost microbial communities from the Arctic, studying carbon transformations - Permafrost 712P3D"
processing_institution: "Joint Genome Institute"
principal_investigator_name: "Virginia Rich"
```
the linked data object uses a jgi prefix and an md5 hash
```yaml
id: "jgi:551a20d30d878525404e90d5"
name: "8871.1.114459.GCCAAT.fastq.gz"
description: "Raw sequencer read data"
file_size_bytes: 17586370657
type: "nmdc:DataObject"
```
Legacy metaproteomics omics objects look like this:
```yaml
id: "emsl:404590"
name: "FECB_21_5093B_01_23Dec14_Tiger_14-11-12"
description: "High res MS with low res CID MSn"
part_of:
- "gold:Gs0110132"
has_output:
- "emsl:output_404590"
omics_type: Proteomics
type: "nmdc:OmicsProcessing"
instrument_name: "VOrbiETD03"
processing_institution: "Environmental Molecular Sciences Lab"
```
the output data objects are formed from these:
```yaml
id: "emsl:output_404590"
name: "output: FECB_21_5093B_01_23Dec14_Tiger_14-11-12"
description: "High res MS with low res CID MSn"
file_size_bytes: 503296678
type: "nmdc:DataObject"
```
the data objects use hashes (md5) prefixed with nmdc:
```yaml
name: "404590_resultant.tsv"
description: "Aggregation of analysis tools{MSGFplus, MASIC} results"
file_size_bytes: 10948480
type: "nmdc:DataObject"
id: "nmdc:e0c70280a7a23c7c5cc1e589f72e896e"
```
## MIxS term identifiers
We are working with the GSC to provide permanent IDs for MIxS terms. Note these terms are schema-level rather than
data-level.
Please check this section later
For now we place these in the nmdc namespaces, e.g
`nmdc:alt`
## Identifiers and semantic web URIs
We produce a JSON-LD context with the schema:
* [jsonschema/nmdc.context.jsonld](https://github.com/microbiomedata/nmdc-schema/blob/main/project/jsonschema/nmdc.schema.json)
When this is combined with schema-conformant JSON, RDF can be automatically created using the intended URIs