-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a FileSet class #59
Comments
👎 If it is required. Further discussion in this thread: https://groups.google.com/forum/#!msg/pcdm/Ep1Cty2JDx4/Hdw8I5pbDwAJ |
I've said it before, but I think enforcing an additional resource for the (I have no numbers to back this up...but) small percentage of use cases that will need multiple file sets is not to the general benefit. I understand the opinion around consistency for development, but I don't find this particularly persuasive. It seems that we can find some way to allow for the need of multiple FileSets without requiring them for all PCDM based resources. |
@whikloj Can you say more about your use cases? What kinds of content do you need to support? |
@mjgiarlo at bare minimum, the content we support now in Islandora 1.x; which, I am near 100% sure, does not have any public use cases from the Islandora Community on the need for multiple FileSets as described. Should we need FileSets, we'd be perfectly happy using the FileSets extension. 👎 forcing FileSets |
If the PCDM vocabulary did not use predicates such as |
@mjgiarlo sorry our data is 90% newspapers. But the question is more about the possibility of multiple FileSets. We scan and, for better or worse, never scan again. So a second FileSet is extremely unlikely in most all of our use cases. But perhaps I am misunderstanding the use cases that drove this need. |
@whikloj @ruebot @awoods The majority of the objects in your repository don't store derivatives? The goal of FileSets is "here's a grouping of binaries and a description about why they're grouped." If FileSets aren't required, then I assume you just never have multiple FileSets, and your object representing "Page 1" just has three files hooked to it and there's no reason to group the master with the derivatives because only one of those is a master and you can identify it? So, is optional filesets something we can imagine an ingest routine for? "If it uses hasFileSet, go there for files, if it just hasFile, go there, there's only one grouping, if it has both...I dunno, count it as three filesets?" |
Ah, forgot to mention, the reason multiple filesets became a thing was there were institutions who had multiple masters for a single "page". |
@tpendragon if i understand correctly, then the use case that motivated this new rdfs Class is having multiple Masters, with multiple derivatives each one for a same real world entity(page 1 of book 1 for example). by the way Islandora(and @whikloj) do store Derivatives and a lot of them. I don't feel our use cases are that different, but our approaches are. Grouping multiple binary resources together under a FileSet Class can be solved(without FileSet) by linking (via a specific predicate, not necessarily in PCDM space) a given Master binary to it's derivatives. You could even make another construct, via proxy's that point to the Binary you want to consider your canonical Master. Just one of many ways. How do you solve the "which of the many masters" problem with FileSet? I also start to understand that some of this needs are based on programming paradigms, which is maybe the miss understanding we have here, and has probably something to do with how Hydra does data modelling hardcoding structure (based on rdf class to ruby class matching, just guessing) versus what we want to do (trying to extract structure, constraints and requirements from Ontologies and triple store)
That is what i mean with hardcoded. The ontology itself (the semantic definition of the class + object properties allowed and their target classes) should give us that info instead of the code. I feel that is the idea of using Linked data. I'm not saying one approach is better than other. I'm saying we defer probably on how we deal with the logic over the structure. |
Nah, it's totally arbitrary. We store a model statement on the resource to map to the ruby class.
I suppose my hypothesis is that this specification is only as good as the tools we can build which utilize it. Arbitrary ingest was just an example. Someone somewhere will have to build that logic, if we can encode that logic into the ontology via restraints that they the developer will have to follow, then great.
So I think in order to do this we'd have to redefine what a pcdm:File is, because right now it's a binary file. Bits don't have RDF statements - they can have nodes which describe them, but that's not part of the ontology now (except in the case of FileSets). |
Ah, I may be letting Fedora leak into my previous argument. To be clear, you're proposing:
Which I think solves that case, yeah. |
We've always said Files only have access and technical metadata. What's |
The other issue is one of practicality: FileSets are implementable in LDP, and specifically Fedora. If we say |
Just a comment, not pro or con filesets at this point: < derivedFrom > to my mind would count as technical metadata. It is expressing the technical process (via the relationship) used to generate that binary. There's an ebucore property that could possibly be used there, if we go that route. |
@ruebot What does "the content [you] support now in Islandora 1.x" look like? If you've already got this jotted down somewhere, I'd be happy to review existing documentation rather than expect you to type it all out. :) I'm struggling to imagine that Islandora doesn't already have robust support for multi-file works. |
w/r/t FileSets, they are, IMO a very natural way to describe book-like objects. They are, effectively, how we describe thousands of such objects in our current repo: Manuscript -> Page -> (Set of files w/ color targets) and (set of files w/o color target). And I would expect to model them the same way going forward w/ F4. There are other objects in our repo (hundreds of thousands of them). For these resources, the FileSet abstraction is unnecessary. For these, I could live with that additional layer (FileSets) if necessary, even if not ideal. Looking into the future, one of the big areas of growth in our repository will be faculty research data. This data does not look anything like books. An example from last semester: perterbations of protein data observed over a period of time (as in several million observations over hundreds of specific protein chains). For this data, not only would FileSets be not quite to the point, the entire PCDM structure would likely get in the way. And no, I do not see "put all the data into a big zip file and be done with it" as an option. So I am left with a choice: do I model some objects (e.g. book-like things) using PCDM and some objects in ORE or do I attempt to have consistency across the repository in terms of how structural metadata is expressed. Personally, I opt for the latter. |
In this parallel discussion FileSets are regarded as specific aggregations representing "digital content" in an abstract sense. The Files that they aggregate are different manifestations (i.e. different file formats, encodings, derivations, subsets, etc.) of the same digital source, so they have specific common traits. This gives FileSets a defined role (e.g. a scan of a page) distinguished from other pcdm:Objects which represent higher-level aggregations (e.g. pages, books, collections etc.). Even with a FileSet with one File (which is quite unlikely because you will almost surely have a thumbnail, an access image, a preservation copy, an OCR or metadata extract, etc.) you would still benefit from having an independent FileSet to put descriptive metadata about the digitized content. The dc:creator of a pcdm:Object may be the monk who wrote a book page, the dc:creator of the pcdm:File the photographer who reproduced it, and so on. 👍 to FileSets. |
Sorry, wrong link for the discussion mentioned. I mean this one: https://groups.google.com/forum/#!topic/hydra-tech/u181eBfgJcU |
What if we made FileSet not a subclass of Object (the current proposal), but allowed a single resource to be a FileSet and an Object at the same time? This would allow users to skip the extra node if they had no use for it. For example, if you had an Object with both image and text representations, you could have separate FileSets to separate them out:
But you could also just having a single Object/FileSet combo resource:
|
This again would conflate the concepts of real world object and digital content. I find having a dedicated class fot the latter very useful. I don't see a problem with a FileSet hanging out by itself or potentially having multiple relationships with other Objects. At AIC we have Assets (wannabe FileSets), such as a photographic portrait, that can be representations of both an artwork and a person (Objects). |
In the above for FileSet I meant "a digital reproduction of a photographic portrait". Also, in the scenario you lay out:
you may think you are content with a simple book page that has only one image. But if in 5 years you make a better reproduction of that page, you will have a hard time separating the old reproduction from the new one. This discussion seems very similar to the one about the ordering ontology and about why we build a complex structure even for a simple scenario: the reason is to be interoperable and allow for expansion. |
@escowles exactly, I'd love to re-scan but the funding is generally always for new digitization, so... We currently have only 675,000 newspaper pages in Fedora 3. Every page has a single master (Tiff) and derivatives, and as I said we never re-scan unless the original is useless to use. In which case we don't add the useless scan. So a set of pcdm:File(s) attached to a pcdm:Object with (perhaps) pcdm use or ebucore predicates is perfectly workable. Heck I don't even really want my derivatives in Fedora (but that is a Claw discussion). So while I accept that some people want/need the ability to have multiple FileSets. To me it is just an extra layer to traverse. |
The benefit would be in separating the metadata about the newspaper page and the one about its scan.
@whikloj is this correct? I wonder how one would not even leave room for an option. |
So see I would put the metadata about the newspaper page on the pcdm:Object (RdfSource) and the metadata about the scan on the pcdm:File (NonRdfSource). @scossu, I'm not saying we don't leave room for the ability. But as it has not yet happened, I would love it was an option. But why force the extra layer for everything? |
You would put the descriptive metadata about the page (e.g. the author of the article(s), date of publication) in the Object; descriptive metadata about the digitized content (author and date of the scan) in the FileSet and technical metadata about the file itself (characterization, file timestamp, etc.) in the File.
To have one single model to predictably store and find information instead of two different ones depending on whether you plan on having one or more files. |
@scossu Why would you not put descriptive metadata about the digitized content on the File itself? |
@ruebot yes, if you intend the pcdm:hasFile relationship to apply to JP2, JPG, and TN as well. |
reason for not having the fileset be the direct container itself? not a criticism. just curious. |
@dannylamb Maybe we could have the FileSet be a DirectContainer itself. Would it be more palatable if the LDP projection was basically the same, we're just calling out existing |
Huh. Does multiple direct containers work? The problem would be you'd have to manage a link that isn't ldp:contains to find the "FileSet" |
I am not quite following the discussion about LDP here and maybe I am missing an important part of the PCDM fundamentals, so bear with me. How is PCDM related to LDP, and most important, is PCDM membership related to LDP containment? My understanding is that PCDM defines the role of resources and their semantic relationships, while LDP focuses on structure and traversal. If we are talking about implementation examples around @ruebot's graph, I understand. If we are introducing LDP concepts in PCDM I would be OK as well, I just would like to know if this has always been a common understanding. To this point, I would actually rephrase @ruebot's statement to "Files MUST be members of exactly one FileSet". |
@scossu There's always been some tension about how to treat LDP: on one hand, PCDM is an abstract model that could be implemented in any number of systems. But on the other hand, most of the people involved in PCDM are planning to implement it with Fedora 4, so how PCDM maps to LDP is an important consideration. So I would say that LDP is definitely not a part of PCDM or required to use it. But many people who use PCDM are also interested in LDP, so it makes sense to also agree on the mapping (though separately from the modeling discussions). In this particular case, I think the LDP mapping is relevant to the modeling discussion, because it changes whether adding an extra FileSet node results in adding an extra LDP container or not (with implications for scalability, etc.). If adding a FileSet only results in slightly redefining an existing container in the LDP projection, then maybe that lessens the objection to requiring it. |
@tpendragon I believe that we could have a pcdm:Object as a BasicContainer and it could have multiple DirectContainers which were FileSets. This would result in the Object having direct hasFile links to each of the Files, and you could also add hasFileSet linking to the FileSets, which would link to their containing Files with ldp:contains. The triples would look like:
|
Can we clarify that we're using LDP or not? see #56 I'm mildly +1 on using LDP, but I think we need a definitive answer on the topic before proceeding with further discussion about FileSets. |
Here is a summary of issues and questions with
|
Mea culpa, question 2 should be |
@whikloj updated. |
There's some confusion, because FileSet's been refined multiple times due to the previous large ticket. I think the status of those answers now are these:
So my only point is this: Let's say we don't do FileSets as a required construct - there's no node describing file grouping. We, at Hydra, obviously have use cases and have fallen down on it as a necessary construct. So let's say we keep it in the extension, and don't violate anything PCDM. You don't do that. So let's say we each represent a postcard. Hydra:
Islandora
Is there any sort of useful interop we can have here? Are there any tools we can build off of PCDM 1.0 to generically work with both these models and do something useful? If we're just an extension, and we have to stick to PCDM 1, then FileSet has to be a pcdm:object. That means the graphs for Islandora's If the answer is no, then I think we need FileSet in some form in the ontology. If it's NOT a required construct, then the rules get more complex, and I would love to see examples of how we can have it be non-required (with graphs and restraints on the predicates defined here, in this ticket) and still talk about one another's models. I think we can all be happy here. |
What if the Hydra representation was:
I think this lets the Object use hasFile to link to the File, so Islandora and Hydra (and everyone else!) can use the existing pattern. But there is an optional overlay on top of that groups the files, which maps neatly to LDP containers, for purposes of having multiple sets of files, such as both an image and a transcription, or a new digitization, etc., etc. |
@escowles Could you use |
@whikloj I think you could use |
Could |
@awead That's the other option: making Though I'm not sure about linking to FileSets from more than one Object. If Files are part of a single Object, and FileSets serve to group those Files, wouldn't the FileSets also be limited to that Object? |
I don't really have a use case for Files/FileSets being attached to more than one Object, but I remember that @scossu made the comment above
Just in case he has a use case he'd like to mention. |
I thought this was the case. |
@whikloj @dannylamb @DiegoPino @bryjbrown One use case we should think about is how we would do without FileSets is the good old ETD (Electronic Thesis/Dissertation) that is a PDF and associated datasets. |
I understood the point of a FileSet to be to group together files from the same source bitstream? So PDF plus Datasets is a pcdm:Object, not a FileSet. |
@azaroth42 Cool. That's exactly what I was thinking, but wanted to make sure. I'm just trying to think of other use cases for FileSets from our perspective. |
@ruebot more broadly, the FileSet could contain derivatives from the original source, whether auto-generated or not, such as thumbnails, but also derived technical information such as fits xml, or other derivative-like things: TEI representations, full-text extraction, etc. |
@azaroth42 @ruebot A lot of the research data sets that I've worked with are different "views" (for lack of a better term) of the same raw data. Think different tabs on the same spreadsheet, or a chart image file representing the data in a separate CSV file. Not derivatives in the technical sense, but thematically derivative. Would this be a use case for FileSet, or does it just confuse things? |
@bryjbrown That seems like a reasonable use of a FileSet to me — including, for example, a data file and graphs/visualizations of it. |
A few questions/thoughts based of these last few comments + ideas:
Sorry, coming from a consistency in modeling is key ideal for me here, as I'll have to do quite a bit of batch metadata updates in a few PCDM implementations. *edited to avoid presumption of inverses to these properties. |
Discussion of FileSets has moved on — closing this issue. There is still work going on in the Hydra community about how FileSets should work, and what they represent, and making that compatible with the core ontology. |
The works extension includes a FileSet class that represents an original file and other files derived from it. The Hydra implementation has found this to be a very useful structure, and the key to separating Objects that represent component parts of other Objects from groupings of Files.
Should we add a FileSet class to the core ontology?
See #53 for preliminary discussion.
The text was updated successfully, but these errors were encountered: