Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Book: annotated Dataset example comparison to CDIF #472

Open
smrgeoinfo opened this issue Oct 24, 2024 · 1 comment
Open

Book: annotated Dataset example comparison to CDIF #472

smrgeoinfo opened this issue Oct 24, 2024 · 1 comment

Comments

@smrgeoinfo
Copy link

smrgeoinfo commented Oct 24, 2024

Comparing example at https://book.odis.org/thematics/dataset/index.html#id1 with CDIF recommendations (https://cross-domain-interoperability-framework.github.io/cdifbook/metadata/schemaorgimplementation.html#schema-org-implementation-of-cdif-metadata) and examples at https://github.com/Cross-Domain-Interoperability-Framework/cdifbook/tree/main/examples

{
    "@context": {
        "@vocab": "https://schema.org/"
    },
    "@type": "Dataset",
    "@id": "https://example.org/permanentUrlToThisJsonDoc",
    "name": "A concise but descriptive name of the dataset",
    "description": "An extended, free-text description of what's in the dataset, who created it, and other attributes",
    "url": "https://urlToTheDatasetOrLandingPage.org/",

Good alignment to here; CDIF has dcterms: in the @context as well.

    "sameAs": [
        "http://alternativeUrlToTheDatasetOrLandingPage.org"
    ],

Not clear what the point of this is-- looks like an alternate link to the landing page, so its sameAs the landing page?

"license": "This work is licensed under a Creative Commons Attribution (CC-BY) 4.0 License",
Good alignment

    "citation": [
        "Citation to other work relevant to this dataset",
        "Citation to other work relevant to this dataset",
        "Citation to other work relevant to this dataset"
    ],

CDIF Doesn't include citation in recommendation. Personally, I'd recommend against using it because it's so frequently misunderstood. In CDIF, if you want to link to 'other work relevant to this dataset', use schema:relatedLink.

"version": "2021-04-24T06:34:56.000Z",

Good alignment

    "keywords": [
        "Keyword 1",
        "Keyword 2",
        "Keyword 3"
    ],

Partial alignment; CDIF recommends using schema:DefinedTerm for keyword from an identifiable controlled vocabulary.

"measurementTechnique": "The URL to or text about the methods, technique or technology used to generate this Dataset", 

Not in CDIF recommendation; for this kind of prov information in CDIF discovery, recommendation is to use prov:wasGeneratedBy to link to sensors, instruments, software, algorithms. For full description of data creation use the CDIF data integration profile.

    "variableMeasured": [
        {
            "@type": "PropertyValue",
            "name": "Name of a variable in the dataset",
            "description": "Extended description of this variable"
        },
        {
            "@type": "PropertyValue",
            "name": "Name of a variable in the dataset",
            "url": "http://ontology.org/uriToSemanticDescriptorOfThisVariable",
            "description": "Extended description of this variable?"
        },
        {
            "@type": "PropertyValue",
            "name": "SamplingDeviceApertureSurfaceArea",
            "url": "http://ontology.org/uriToSemanticDescriptorOfThisVariable",
            "description": "Extended description of this variable"
        }
    ],

Partial alignment. CDIF also includes use of schema:StatisticalVariable for schema:variableMeasured. For schema:PropertyValue, the guidance is "Variable must have a name and description, should have a propertyID with URI for the represented concept. The URI in the propertyID provides the semantic linkage for meaning of the variable."

    "includedInDataCatalog": {
        "@id": "https://registryOfCatalogs.org/permanentUrlIdentifiyingCatalog",
        "@type": "DataCatalog",
        "url": "https://urlOfDataCatalog.org"
    },

Not in CDIF Discovery recommendation. Is this supposed to identify the source of the metadata record; if so it should be in the metadata about the metadata section that CDIF recommends (https://cross-domain-interoperability-framework.github.io/cdifbook/metadata/contentmodel.html#properties-for-metadata-management)? Usually the actual dataset that the metadata is about is in a repository, not generally referred to as a 'DataCatalog'.

    "temporalCoverage": "2007/2007",
    "distribution": {
        "@type": "DataDownload",
        "contentUrl": "http://urlToDirectDownloadOfThisDataset.org/",
        "encodingFormat": "text/csv"
    },

Good Alignment. CDIF also includes recommendation for API-based data distribution (https://cross-domain-interoperability-framework.github.io/cdifbook/metadata/schemaorgimplementation.html#service-based-distribution) analogous to DCAT:accessService

    "spatialCoverage": {
        "@type": "Place",
        "geo": {
            "@type": "GeoShape",
            "description": "schema.org expects lat long (Y X) coordinate order",
            "polygon": "10.161667 142.014,18.033833 142.014,18.033833 147.997833,10.161667 147.997833,10.161667 142.014"
        },
        "additionalProperty": {
            "@type": "PropertyValue",
            "propertyID": "https://dbpedia.org/page/Spatial_reference_system",
            "value": "https://www.w3.org/2003/01/geo/wgs84_pos"
        }
    },

CDIF requires a schema:box, schema:line, schema:point or a named place (Place/name with string or DefinedTerm). Guidance for box: "For bounding box specification of the spatial extent of resource content. See ESIP SOSO for details. Recommend including only one bounding box; behavior of harvesting clients when multiple geometries are specified is unpredictable". CDIF also provides for optional geographic extents using other more interoperable geometries, GeoSPARQL us recommended, see Ocean InfoHub. Other geometry schemes might be specified in a specific domain profile, e.g. for atmospheric, subsurface data, or local coordinate systems.

    "provider": [
        {
            "@type": "Organization",
            "legalName": "Legal Name of Organisation which generated the dataset",
            "name": "Other Name of Organisation which generated the dataset",
            "url": "https://organisationWebsite.org/"
        }
    ],

CDIF guidance is that provider is the contact point for the agent responsible for a resource distribution; this is different from 'agent that generated the dataset'.

    "subjectOf": {
        "@type": "Event",
        "description": "Describe the event which is the subject of this dataset. For example, a cruise ID.",
        "name": "Concise and descriptive name of the Event",
        "potentialAction": {
            "@type": "Action",
            "name": "Concise but descriptive name of action that was part of an Event. For example, the name of a CTD cast",
            "agent": [
                "Name or permanent ID of person or thing that performed this action",
                "Name or permanent ID of person or thing that performed this action",
                "Name or permanent ID of person or thing that performed this action"
            ],
            "startTime": "2007-03-11T14:45UTC",
            "endTime": "2007-03-11T15:42UTC",


            "instrument": {
                "@type": "Thing",
                "name": "The name of the instrument used in the action. For example, the specific model of a CTD, a glider, a moored sensor",
                "url": "http://ontology.org/uriToSemanticDescriptorOfThisInstrument",
                "description": "Extended description of the sampling instrument"
            }  
    }    }    }

CDIF uses 'subjectOf for the graph node with metadata about the metadata record (dateModified, conformsTo, responsible parties...). Its not clear from the example here what the intention is. The schema.org guidance for subjectOf is that its value is "A CreativeWork or Event about this Thing." So this example would appear to document some event or creativeWork that is about the described dataset. My suspicion is that its supposed to document workflow that created the dataset? CDIF recommends using prov:wasGeneratedBy to document instruments, sensors, algorithms, software etc. used to create the dataset, and prov:wasDerivedFrom to document resources (e.g. source datasets) that were used to create the described dataset. CDIF would link to CreativeWorks about the resource using relatedLink.

@smrgeoinfo
Copy link
Author

smrgeoinfo commented Oct 24, 2024

Done for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant