Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a file size key to the inventory #629

Closed
tomwrobel opened this issue May 10, 2023 · 10 comments
Closed

Adding a file size key to the inventory #629

tomwrobel opened this issue May 10, 2023 · 10 comments
Assignees
Labels
Extensions Tickets that we believe should be extensions

Comments

@tomwrobel
Copy link

tomwrobel commented May 10, 2023

Following on from, but not necessarily looking to revive: #474

It would be very useful for a repository manager to know how big an OCFL object and its component binary files are on disk. It affects a lot of decisions we're likely to make regarding how to handle the object and its component files.

Given the processing work required to generate the checksum, it seems like an opportunity to include the file size of a binary file represented by a given checksum. A key akin to the 'fixity' key, containing an array of key value pairs, might allow this, e.g.

"size": {
     "4d27c8...b53": "1213131",
    "7dcc35...c31": "83488484",
    "cf83e1...a3e": "0",
    "ffccf6...62e": "85834853845384422"
}
@tomwrobel
Copy link
Author

This could be considered as an additional kind of fixity check - the file should be x bytes in size - but I suspect I'm pushing the definition of the word 'fixity' here.

@rosy1280 rosy1280 added the Extensions Tickets that we believe should be extensions label Jun 1, 2023
@zimeon
Copy link
Contributor

zimeon commented Jun 1, 2023

2023-06-01 Editors' discussion -- This could be done within the current specification by creating an extension that defines (as mentioned in #629 (comment)) a new fixity type, perhaps called size, that is simply the file size.

@tomwrobel
Copy link
Author

@zimeon
Copy link
Contributor

zimeon commented Jun 3, 2023

Yes, the process is outlined in https://github.com/OCFL/extensions/blob/main/docs/0001-digest-algorithms.md#maintenance -- because we are not versioning extensions the PR should create a new digest algorithms extension that obsoletes 0001

@tomwrobel
Copy link
Author

Spun out to OCFL/extensions#64

tomwrobel added a commit to tomwrobel/extensions that referenced this issue Jun 6, 2023
Create a new digest algorithms extension to add 'size' to
the list of allowed algorithms; obsolete the previous
digest algorithms extension.

As described in OCFL#64 and
following discussion in OCFL/spec#629.
@srerickson
Copy link
Contributor

srerickson commented Jun 7, 2023

The implication of size as a fixity digest algorithm is that collisions in fixity entries are not only unlikely, they may even be expected. I'm wondering if this represents a significant enough change in how implementers should treat the fixity block to warrant further discussion.

@zimeon
Copy link
Contributor

zimeon commented Jun 8, 2023

Interesting question @srerickson. My feeling is that it doesn't represent a major change in how fixity should be used but I'd love to hear other thoughts. I just created a new fixture suggestion of an object that has two different files with the same MD5 value: OCFL/fixtures#107 . Implementations have to deal with this possibility even without extension digests that might be even weaker than currently specified digests.

@srerickson
Copy link
Contributor

@zimeon that fixture is really helpful thanks! This issue has helped me identify a problem in my own implementation where fixture collisions are treated as an error condition instead of being handled gracefully.

I don't mean to belabor the point, but I wonder if the implementation notes could address collisions a bit better. From this discussion, a key difference between fixity and manifest digests is that manifest digests are assumed to be collision-free, whereas collisions in fixity digests should be expected and handled gracefully. This point doesn't come across very clearly in the current fixity section which, instead, focuses on content addressability and tampering.

@zimeon
Copy link
Contributor

zimeon commented Jul 6, 2023

2023-07-06 Editors' discussion - we agree that it would be helpful to add a note to the fixity section of the Implementation Notes pointing out that fixity algorithms may generate the same value for different file content

@rosy1280
Copy link
Contributor

algorithm extension has a PR that has been submitted and is being reviewed

rosy1280 pushed a commit to OCFL/extensions that referenced this issue Sep 22, 2023
* Add 'size' to list of allowed digest algorithms

Create a new digest algorithms extension to add 'size' to
the list of allowed algorithms; obsolete the previous
digest algorithms extension.

As described in #64 and
following discussion in OCFL/spec#629.

* Make integer explicitly decimal

* Update size string expression definition
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Extensions Tickets that we believe should be extensions
Projects
None yet
Development

No branches or pull requests

4 participants