Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

May inventories contain properties that aren't defined in the spec? #474

Closed
pwinckles opened this issue May 20, 2020 · 18 comments · Fixed by #478
Closed

May inventories contain properties that aren't defined in the spec? #474

pwinckles opened this issue May 20, 2020 · 18 comments · Fixed by #478
Assignees
Labels
Decided An editorial decision that was decided OCFL Object
Milestone

Comments

@pwinckles
Copy link

I was sure that the spec forbid this, but @neilsjefferies just pointed out to me that it in fact does not. Is this intentional?

If so, this would allow for something like, for example, a media type extension that augments inventory files like the following:

{
    "id": "obj1",
    "type": "https://ocfl.io/1.0/spec/#inventory",
    "digestAlgorithm": "sha512",
    "head": "v1",
    "contentDirectory": "content",
    "fixity": {},
    "manifest": {
        "5a..e1": [
            "v1/content/test.txt"
        ]
    },
    "versions": {
        "v1": {
            "created": "2020-05-20T16:26:11Z",
            "message": "initial commit",
            "state": {
                "5a..e1": [
                    "test.txt"
                ]
            }
        }
    },
    "mediaTypes": {
        "5a..e1": "text/plain"
    }
}
@ahankinson
Copy link
Contributor

It was certainly intended that no other keys can be present in the inventory other than those defined in the spec.

@neilsjefferies
Copy link
Member

Im pretty sure it was to enable subsequent versions of OCFL to make additions without breaking backwards compatibility.

@ahankinson
Copy link
Contributor

ahankinson commented May 21, 2020

If new keys are added to the inventory then that would necessitate a new minor version of the spec, but it wouldn't break backwards compatibility. If the current keys defined in the spec were to be renamed or removed, it would break backwards compatibility but it would also require a major version change.

But in both cases I don't believe it was ever intended that keys not defined in the spec are permitted in the inventory. I think the MUST in the inventory section are missing an explicit 'and MUST NOT contain any other keys'.

@neilsjefferies
Copy link
Member

If there is a MUST NOT clause then adding new keys in new versions breaks backwards compatibility. A V1.1 Inventory would fail V1.0 validation. This was something I thought we wanted to avoid as much as possible.

However, now we have the concept of an extensions mechanism, which did not exist when we worked on this bit, we have the additional possibility of having an "extension" key - which can contain all the extensions relevant to the object with their parameters. We can then require non-OCFL keys to be encapsulated in an extension. Thoughts...?

@pwinckles
Copy link
Author

I had assumed that inventories were to be validated against the OCFL version specified in their type.

There are some ambiguities here that should probably be clarified in the spec. Here are some more questions:

  1. Can an object created as OCFL v1.0 later be changed to v1.1 in a similar fashion to changing digestAlgorithm?
  2. If yes, then which version does the object conformance declaration reflect?

In a scenario where multiple OCFL spec versions exist that have substantive differences in inventory serialization, then it does become problematic to deserialize inventory files to anything other than a basic map structure if it is not knowable without reading the inventory contents what format its in.

That aside, if you were to allow inventories to include keys that are not defined in the spec, I would certainly feel better about it if they were at the very least encapsulated in some way, as @neilsjefferies was suggesting.

@bcail
Copy link
Contributor

bcail commented May 21, 2020

A mediaType extension like @pwinckles's example would be interesting to me. mediaType/mimetype is one of the system properties handled by Fedora 3. I can move the mimetype to another file in ocfl, but if it were an option, I might put the mimetype in inventory.json. And I'd be fine with it being encapsulated in an "extensions" key. But I don't have to have it in inventory.json at all - I can handle it either way.

@zimeon
Copy link
Contributor

zimeon commented May 21, 2020

My recollection is that it was intentional only to specify the necessary structure of the inventory so that there could be evolution or extension anywhere else. Boxing within extensions is fine for extensions but doesn't allow good evolution to future versions

@pwinckles
Copy link
Author

Is the inventory_schema.json up to date? It does not allow undefined properties: https://github.com/OCFL/spec/blob/master/draft/spec/inventory_schema.json#L7

@ahankinson
Copy link
Contributor

Validating a 1.1 inventory with a 1.0 validator is a break in forward compatibility (the object is newer than the code that is verifying it), but I think that's expected. (You can't build a validator that will predict the future...)

If the new keys added in the subsequent version were required keys I think it should mean a major version change (e.g. 2.x) but if they were optional it would only require a minor version bump. I don't see how a blanket 'MUST NOT' would break compatibility, though -- if new keys are added, a new version of the spec is created, and then the MUST NOT no longer forbids that key. If, for example, @pwinckles suggestion of a mediaTypes key gets added to 1.1, then it is no longer forbidden and can be included in v1.1 inventories.

I think we should be clear about the purpose of the inventory file. In my mind, it contains only the data required to effectively track the changes to the files in the object. What I don't think it should be is a sort of config file for capturing different options and behaviours of the object. That would be my concern if we started putting keys, such as extensions, in it.

@neilsjefferies
Copy link
Member

Yep, @ahankinson , I meant forwards. This does potentially matter though, an object version (and thus its inventory) should always be a valid within all future versions of OCFL (since it is immutable). This actually places quite a few restrictions on what we can do with inventory entries. For example, it is not possible to require new keys without version specific language. @pwinckles I think this answers your question on conformance and upgrading too. A V2.0 object can contain V<2.0 versions and they should be valid. Inventory versions can only ratchet upwards with new versions, obviously!

Being overly proscriptive about keys doesn't prevent any failure modes or add new capabilities as far as I can see - all it does is add an additional compatibility issue for no obvious benefit.

@zimeon Hence I said "non-OCFL" keys should be in extensions, future OCFL versions should be able to specify additional keys. This needs careful wording though.

In the case of digests and fixity outside the OCFL standards, it does make sense that some reference can be made in the inventory to the relevant extension.

@ahankinson
Copy link
Contributor

Uhhh... I don't think I agree with @neilsjefferies . We haven't really made any declarations about whether OCFL Objects can have mixed versions, and the presence of the 0=ocfl_object_1.0 NamAsTe file in the root as a declaration of the OCFL Object version would make that difficult.

@zimeon
Copy link
Contributor

zimeon commented May 21, 2020

To my mind the handling of mixed version objects is an issue to defer for now

@zimeon zimeon added this to the 1.0 milestone May 21, 2020
@ahankinson ahankinson self-assigned this Jun 2, 2020
@ahankinson
Copy link
Contributor

Editor's meeting: Decision was to disallow attributes not specified in the spec, under the principle that it would be better to restrict behaviours and then gradually open them up, then to do the opposite if it becomes a problem. This will be open for community feedback and more use-case gathering post-1.0.

@zimeon zimeon added Decided An editorial decision that was decided and removed Needs Discussion labels Jun 2, 2020
ahankinson added a commit that referenced this issue Jun 9, 2020
@bdwheele
Copy link

Crud it looks like I missed the ticket. At IU we're starting to look at OCFL as a potential storage format and since we're using tapes as our storage being able to hold some technical metadata (or other storage information) about the files in an object as part of the inventory would be a big help to reduce tape access if someone just wants to know the (rough) size of the object or duration information.

Until the 2.0 release cycle is there an option to add add a use-at-your-own-peril key that could be used to store that information and still validate? I'm thinking in the same vein as IANA's X-* mime types.

@ahankinson
Copy link
Contributor

ahankinson commented Jun 11, 2020

Hi @bdwheele, the inventory wasn’t really designed to hold metadata about the files, it was primarily designed to make the versioning system in OCFL work. Since it’s trying to keep a level of compatibility across time and across clients, and because we didn’t feel like we had gathered enough use cases for this, we felt it was best to follow Postel’s law and “be conservative in what we validate”.

The equivalent to the “x-*” mimetype would probably be an extension that gathered the relevant metadata you need.

@bdwheele
Copy link

That's fair enough and I understand the rationale, but if I may offer a bit of background of where I'm coming from to provide some context

Here at Indiana University we have several decades of scanned documents (photos & books) as well as a sizeable collection of A/V material (~14PB) all of which are stored on a proprietary tape system (HPSS). As a further fly in the ointment, we share the tape system with the rest of the university so we have to be good citizens and not inadvertently create denial-of-service to the other units.

We're currently looking at what our preservation situation is going to look like for the future since what we have is a mix of several different systems. OCFL has come up multiple times during our investigation and it looks interesting. We have not decided if we want to use someone else's management software (with modifications to use our storage) or if we want to write our own, or even a mix of the two.

Using a tape system creates a lot of headaches when managing the content due to latency issues, so we try to collect as much information about the objects as possible before they're committed to storage. The downside currently is that the information is kept in two separate places: a copy in our database(s) and one on the tape storage (in several cases). We would like to make sure that the metadata we've collected in both places is consistent, or could be rebuilt without reading the files since that's incredibly time consuming.

It seems to me that having the ability to store arbitrary (and explicitly separated) data in the inventory would be a good thing. With that (and RFC 760's "Liberal in its receiving" text) in mind, would it be possible to create a 'private' toplevel that would be used by management tools to store arbitrary data about the object which would not be validated by the OCFL validator beyond being syntactically correct? Within the private node it would probably be wise to suggest an application ID (such as edu_indiana_dlib_archivemanager or something) to allow multiple applications to store metadata without interfering with either the OCFL content or content stored by other applications.

For IU we'd probably want to store technical metadata (size, mime type, stream information, etc), but a tar-like application that generates OCFL would likely include ownership and permissions. One assumes that "immutable" descriptive metadata would also be an option (ownership, alternate IDs, title, etc)

A solution of that nature would be forward compatible because it is up to the application to manage that content. Data that stored in the private space would be ignored by parsers reading the 1.0(?) spec. If a future specification included fields for commonly used metadata, it would be up to the application to upgrade the package (because there are other files involved beyond the inventory) and it would be able to deal with backward compatibility by looking at both the future spec's location as well as its own private data.

An additional benefit to adding this space is that it would provide real usage data for future directions for OCFL: if all of the management applications are storing file size, for example, that would lend creedence to adding a size field for future versions of OCFL

Thank you for your time.

@pwinckles
Copy link
Author

@bdwheele In your use case, is there a prohibiting factor that makes storing metadata in an object content file unacceptable?

University of Technology Sydney is currently doing this by combining OCFL and ro-crate.

@bdwheele
Copy link

@pwinckles, no there isn't anything absolutely preventing us from going that route and it could work. So it's not a blocker, but embedding the immutable metadata in with the inventory could offer advantages in some situations:

  • When reading the information about the object would only be a single tape read. Due to the system we have there's no guarantee that two files would end up on the same tape. In HPSS, often files pushed at a single time are copied to separate tapes to speed up storage. Even if files were stored on the same tape, some metadata (such as descriptive metadata for the object as a whole) would most likely incur a substantial seek time to grab the current version inventory and the metadata that was written at object creation time.
  • If arbitrary metadata is allowed in the inventory then everything one needs to know about the object is in a single document, even if the consumer doesn't understand or care about all of it. Loading those documents into something like mongodb and doing collection queries would be trivial. It's true that documents could be merged from multiple files, but which files? It may be different on objects from different sources or different types of objects.
  • There's nowhere in the OCFL specification (at least that I see) that is the equivalent to BagIt's tags for organization, contact information, or description. If someone gets a tarball of an OCFL object there's no context (beyond the ID -- which may not be resolvable outside the creator's organization) about what the object is or where it came from. By allowing packaging tools to put information into the inventory, if that's something that's deemed important by the organization, there's place to put it. It may not be a rigorously defined place, but it would get you into the self-documentation ballpark

julianmorley pushed a commit that referenced this issue Jun 15, 2020
* Fixed: disallow arbitrary keys

Fixes #474

* Fixed: Addressing review comments

Moved MUST NOT constraint to section introduction
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Decided An editorial decision that was decided OCFL Object
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants