Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add storing of dataset related files to the File Abstraction layer #3919

Closed
scolapasta opened this issue Jun 19, 2017 · 5 comments
Closed

Add storing of dataset related files to the File Abstraction layer #3919

scolapasta opened this issue Jun 19, 2017 · 5 comments

Comments

@scolapasta
Copy link
Contributor

We know have the ability to store Data Files either locally or on Swift (and others soon to come). However dataset related files (exports of ddi, JSON, etc) are still always stored locally.

We need to support of storing these dataset related files to these alternative systems.

@djbrooke
Copy link
Contributor

djbrooke commented Jun 21, 2017

  • Exports, Thumb nails
  • Consider taking into account cost in the future

@pdurbin
Copy link
Member

pdurbin commented Jun 27, 2017

To work on this issue, one will need a Swift account on a Swift service. To apply for one on Mass Open Cloud, you can fill out this form: https://docs.google.com/forms/d/e/1FAIpQLScM0jWjAFOWXr4ZY8FcRIVyeiAbtBaZFB5-suKqpJj0WJExOQ/viewform

@landreev
Copy link
Contributor

In its current form, our storage abstraction mechanism - DataFileIO - deals exclusively with DataFiles (as the name suggests). But it should be fairly easy to extend it to handle dataset-level file storage as well. This is how I would approach it:

For the files, the container location - the directory for local files, the end point for swift - is derived from the global id of the dataset. So this part can be used with datasets as is.

DataFiles also have the "storage identifier" (aka filename). The combination of the container and the filename is the unique physical location of the file. Datasets don't have "file names". But we should be able to get by without it. The DataFileIO has a concept of an "auxiliary file". For a DataFile, DataFIleIO provides a "data access object", with the single stream of bytes - the actual data payload of the file. PLUS, any number of extra, "auxiliary" files. For a local file these aux files are basically extra files with the same filename with the added extra extension ("auxiliary tag"). If the main tabular data file is "foofile", the ingested original is saved as "foofile.orig", a cached copy of the file in R format is saved as "foofile.RData", etc. If it's an image, various size thumbnails are saved as "foofile.thumb48", "foofile.thum64", etc. A DataAccess object for a Dataset could be a special case, in such a way that, unlike one for a DataFile, it would not provide the primary byte stream (for example, an attempt to open the main stream for read or write should throw an exception); but it should allow to read and write any number of these auxiliary files. So a cached DDI metadata export for a Dataset will be treated as an aux file with the extension tag "ddi_export"; an uploaded dataset logo is an aux file with the tag "logo", etc. We can do this without the main storage identifier/filename - because there is only one of these per dataset container.

Some tweaks to the driver layer will be necessary to achieve this. As currently implemented, the local filesystem and the swift driver will (I believe) throw an exception if you try to open an aux file when the main file is missing; so there will need to be some special logic for allowing that when in a DataAccess object that's associated with a Dataset.

Still, I believe it's all very doable. Also, as far as I remember, the metadata exports and the dataset logos are the only Dataset-level files, as of now. So, not much code to change there, in order to switch to reading and writing these files through DataFileIO.

@landreev
Copy link
Contributor

Re: swift - Kaizen (or a swift account elsewhere) will be necessary to test the swift driver implementation, yes.
But you don't need it to start working on this issue. Because you will probably want to start with changing the top level abstract class; and the local filesystem implementation driver.

@ferrys ferrys self-assigned this Jul 7, 2017
@bsilverstein95 bsilverstein95 self-assigned this Jul 7, 2017
@rbhatta99 rbhatta99 self-assigned this Jul 13, 2017
rbhatta99 added a commit that referenced this issue Jul 17, 2017
rbhatta99 added a commit that referenced this issue Jul 18, 2017
@kcondon
Copy link
Contributor

kcondon commented Aug 4, 2017

-Need to delete file, upgrade_v4.7.1_to_v4.7.2.sql since it was renamed/actually copied to be upgrade_v4.7.1_to_v4.8.sql
(Resolved)
-Not able to add files, nothing appears in list after choosing file for upload, no errors.
(resolved, caching issue on browser and server).
-Remove any locally created /tmp files created whenever possible, eg. export, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests