Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File should have technical metadata #16

Closed
jcoyne opened this issue Apr 17, 2015 · 20 comments
Closed

File should have technical metadata #16

jcoyne opened this issue Apr 17, 2015 · 20 comments
Assignees

Comments

@jcoyne jcoyne modified the milestone: May Hydra PCDM Apr 30, 2015
@jlhardes
Copy link

jlhardes commented May 1, 2015

Not sure if this is helpful here, but baseline technical metadata properties across file formats/types from the Hydra Technical Metadata Subgroup are available: https://docs.google.com/document/d/1SZCpSIdlGfXgoYrAnW2eRKlIt6O-1ADIDDhmLrvxeLc/edit#heading=h.a8hurtypz8qi

@jcoyne
Copy link
Member Author

jcoyne commented May 4, 2015

In a block like so:

metadata do
  type PCDM::File
  property :foo, ...
end

@acoburn
Copy link
Contributor

acoburn commented May 5, 2015

Reiterating @jlhardes comment about the technical metadata profile -- this is now available on the duraspace wiki at: https://wiki.duraspace.org/display/hydra/Technical+Metadata+Application+Profile

@hectorcorrea
Copy link
Member

@jcoyne Just to make sure I understand what I need to add here.

Is the goal of this ticket to have something like this at the end (assuming I get the correct Ruby classes for each of those predicates) ?

    metadata do
      configure type: RDFVocabularies::PCDMTerms.File
      property :label, predicate: ::RDF::RDFS.label

      # TODO: Get the proper Ruby classes for these predicates
      property :file_name, predicate: ::ebucore.file 
      property :file_size, predicate: ::ebucore.fileSize 
      property :date_created, predicate: ::ebucore.dateCreated
      property :file_hash, predicate: ::premis.hasMessageDigest
      property :mime_type, predicate: ::ebucore.hasMimeType
      property :date_modified, predicate: ::ebucore.dateModified
      property :file_format, predicate: ::pronom.puid
      property :byte_order, predicate: ::sweetjpl.byteOrder
    end

@tpendragon
Copy link
Contributor

👍 Pronom's impossible, but besides that

@awead
Copy link
Contributor

awead commented May 8, 2015

Should hydra-pcdm be concerned about what kind of tech metadata it is, or just that it has any kind of tech metadata?

@jlhardes
Copy link

jlhardes commented May 8, 2015

For this sprint, I think it makes sense to go with required properties (File Name and File Size) at a minimum. If recommended properties can also be includes (Label, Date Created, File Hash, File Format Type, and Has Mime Type) that would make it more complete. The optional fields can probably be safely ignored for the sprint - they aren't completely workable anyway (pronom:puid, for example).

@awead
Copy link
Contributor

awead commented May 8, 2015

@jlhardes I agree. My question was really more about the schema. Are we enforcing a techdata schema at this level? I guess it doesn't matter, since it's RDF, if an implementer wants to use a different one, they just add it in. I think we can flesh out those details at the hydra-works level, with integration tests that serve as an example of someone who would want to build their own PCDM-approved object and add additional technical metadata.

@hectorcorrea
Copy link
Member

@jlhardes This is good to know since I've got File Size working (Fedora does that automatically using premis:hasSize as the predicate)

I need to talk with Esme about hasOriginalName since Fedora seems to do it out of the box but I cannot get it to work.

@hectorcorrea
Copy link
Member

@jlhardes Question: Is it OK if we use the equivalent predicates indicated in the document that you posted (e.g. use "premis:hasSize" instead of "ebucore:fileSize") or should we use the one indicated at the top of each one (e.g. "ebucore:fileSize") ?

@jlhardes
Copy link

jlhardes commented May 8, 2015

If we can stick with the properties at the top (Property name: ebucore:fileSize) that might make things easier for the sprint (not so much to implement). Those Property names are the main ones we'd like to see implemented anyway for technical metadata. The equivalent properties are listed to help explain the property and to provide options if the property we're listing can't be used for some reason.

@acoburn
Copy link
Contributor

acoburn commented May 8, 2015

@hectorcorrea the logic behind using ebucore was that it is a comprehensive vocabulary for technical metadata. So rather than splitting the technical metadata properties across lots of different vocabularies (nfo, exif, dc, premis, etc, etc), it would be much more sane to start with a well supported, single vocabulary.

@hectorcorrea
Copy link
Member

@jlhardes @acoburn thanks for the background info. I'll look into implementing it with ebucore then.

@hectorcorrea
Copy link
Member

@jlhardes @acoburn Fedora automatically calculates and stores (as read-only) the following properties premis:fileSize, fedora:digest, and fedora:mimetype.

I could add three separate properties with educore predicates as the document recommends, but they would have to manually set and run the risk of having different values than what Fedora already stores. Do we really want to do that or should we stay with the Fedora provided properties?

//cc: @awead @jcoyne (thoughts?)

@acoburn
Copy link
Contributor

acoburn commented May 8, 2015

@hectorcorrea part of the thinking here was that if an external tool (e.g. FITS) calculates these value, they can be put into the ebucore properties (since the existing properties are managed by the server and hence read-only). The advantage of using the additional properties include:

  • There is a clearer line between server managed properties and externally managed properties (potentially useful for provenance)
  • The fedora: namespace is much less widely used than ebucore: and so potentially less transferrable
  • By keeping as much technical metadata within a single namespace, you make the data (potentially) more useable in a LOD context
  • If an application chooses to use a different hashing algorithm than SHA1, that option is available
  • If there is a mismatch between the server managed properties and an externally generated value, that might be useful for certain types of preservation activities.

The disadvantage of using the additional properties is:

  • data duplication
  • more code to write / manage

That said, I don't actually have a strong opinion one way or the other. @jlhardes thoughts?

@awead
Copy link
Contributor

awead commented May 8, 2015

Yes, 👍 to that. But, I don't think hydra-pcdm should have any opinions about what tech data you're using or what your'e using to create it. It should just allow you to using whichever tool and schema you prefer.

@jcoyne
Copy link
Member Author

jcoyne commented May 8, 2015

@awead I disagree with that somewhat. I think it should provide an opinionated default. You should be allowed to do something else though.

@jlhardes
Copy link

jlhardes commented May 8, 2015

We had some discussion about these properties in relation to properties that are already in Fedora and I wasn't quite sure which of these mapped, so you've helped clear that up, @hectorcorrea - thanks!

I don't actually understand how it works to NOT use what we are implementing on this sprint. It seems like we want to see a baseline of technical metadata across all Hydra implementations using PCDM to make things easier going across systems and sharing externally. I understand that we don't want to limit people's implementations by making these properties using these predicates a requirement but it seems like we do want to encourage their use.

I think for that reason and for the longer term it's better to go with a more externally-useful standard, so I'd stick with premis:hasMessageDigest, ebucore:hasMimeType, and ebucore:fileSize to express those properties, even though it is a bit of duplication.

Additionally, I don't think premis:fileSize actually exists (http://id.loc.gov/ontologies/premis.html) - at least not in RDF premis. I think the premis property might be hasSize so if Fedora is using premis:fileSize, I'm not sure what ontology is actually being used.

@awead
Copy link
Contributor

awead commented May 9, 2015

@jcoyne agreed. If there's any additional tech metadata you want beyond what Fedora is giving you already, then it should be as easy as simply including a module with the additional properties. Any implementation would then override that module, or more realistically, just include their own. The side effect is that you may have extra triples with different predicates but duplicate object content.

So, if we use @jlhardes recommendations, you'd have two triples with the checksum, fedora:digest and premis:hasMessageDigest. And two for mime type: fedora:mimeType and ebucore:hasMimeType (assuming their object values can be the same). I think @hectorcorrea meant premis:hasSize. That's what comes back from Fedora if you do GET request on the binary's fcr:metadata node.

@hectorcorrea
Copy link
Member

So I went ahead and implemented the additional properties. The only caveats with the current implementation are:

  1. Fedora considers PREMIS.hasMessageDigest a server-managed property and therefore it does not let us change the value of this property (i.e. this is a read-only property.)
  2. Property pronum:puid indicated in the documented linked by @jlhardes was not implemented since this ontology hasn't been published.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants