-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add storing of dataset related files to the File Abstraction layer #3919
Comments
|
To work on this issue, one will need a Swift account on a Swift service. To apply for one on Mass Open Cloud, you can fill out this form: https://docs.google.com/forms/d/e/1FAIpQLScM0jWjAFOWXr4ZY8FcRIVyeiAbtBaZFB5-suKqpJj0WJExOQ/viewform |
In its current form, our storage abstraction mechanism - DataFileIO - deals exclusively with DataFiles (as the name suggests). But it should be fairly easy to extend it to handle dataset-level file storage as well. This is how I would approach it: For the files, the container location - the directory for local files, the end point for swift - is derived from the global id of the dataset. So this part can be used with datasets as is. DataFiles also have the "storage identifier" (aka filename). The combination of the container and the filename is the unique physical location of the file. Datasets don't have "file names". But we should be able to get by without it. The DataFileIO has a concept of an "auxiliary file". For a DataFile, DataFIleIO provides a "data access object", with the single stream of bytes - the actual data payload of the file. PLUS, any number of extra, "auxiliary" files. For a local file these aux files are basically extra files with the same filename with the added extra extension ("auxiliary tag"). If the main tabular data file is "foofile", the ingested original is saved as "foofile.orig", a cached copy of the file in R format is saved as "foofile.RData", etc. If it's an image, various size thumbnails are saved as "foofile.thumb48", "foofile.thum64", etc. A DataAccess object for a Dataset could be a special case, in such a way that, unlike one for a DataFile, it would not provide the primary byte stream (for example, an attempt to open the main stream for read or write should throw an exception); but it should allow to read and write any number of these auxiliary files. So a cached DDI metadata export for a Dataset will be treated as an aux file with the extension tag "ddi_export"; an uploaded dataset logo is an aux file with the tag "logo", etc. We can do this without the main storage identifier/filename - because there is only one of these per dataset container. Some tweaks to the driver layer will be necessary to achieve this. As currently implemented, the local filesystem and the swift driver will (I believe) throw an exception if you try to open an aux file when the main file is missing; so there will need to be some special logic for allowing that when in a DataAccess object that's associated with a Dataset. Still, I believe it's all very doable. Also, as far as I remember, the metadata exports and the dataset logos are the only Dataset-level files, as of now. So, not much code to change there, in order to switch to reading and writing these files through DataFileIO. |
Re: swift - Kaizen (or a swift account elsewhere) will be necessary to test the swift driver implementation, yes. |
-Need to delete file, upgrade_v4.7.1_to_v4.7.2.sql since it was renamed/actually copied to be upgrade_v4.7.1_to_v4.8.sql |
We know have the ability to store Data Files either locally or on Swift (and others soon to come). However dataset related files (exports of ddi, JSON, etc) are still always stored locally.
We need to support of storing these dataset related files to these alternative systems.
The text was updated successfully, but these errors were encountered: