Skip to content

What is DataBank and what does it do?

kfletch edited this page Mar 14, 2012 · 7 revisions

DataBank is a repository that will keep data safe in the long term. It can automatically obtain a Digital Object Indicator (DOI) for each data package, and make the metadata and/or the underlying data searchable and accessible by the wider world.

The most likely scenario is that an institution or subject-specific repository would run an instance of DataBank, which would archive data for posterity, and allow that data to be made public as appropriate.

Uploading data to DataBank

Data packages can be uploaded using the DataStage web module. Submissions can also be made with any other http-compatible language using the DataBank API, so long as the data is submitted as a .zip file (ideally BagIt or SWORD-2-compliant).

When uploading a data package, the following settings are available:

  • Public: (metadata and data fully accessible by the outside world, from day 1).
  • Embargoed: metadata is visible but underlying data files are invisible for a period of time (e.g. embargoed until the related thesis has been judged, or paper has been published).
  • Metadata-only: metadata is visible but the data will never be visible (the file is "embargoed" with no end date). All DataBank submissions are assigned a default 70-year embargo, but the administrator can change default setting.
  • Dark archive: metadata and data completely invisible; only the person who made the deposit, and the silo manager, can ever find it. NOTE: this cannot be achieved on a DataBank instance in which some metadata or data are public. It requires a separate "dark" DataBank instance.

Unless the administrator adds a robots file saying they do not want to be crawled, by default all data held in a non-dark instance of DataBank will be visible to Google and any other web crawlers. Users can make files more visible by including richer metadata in the "manifest.rdf" file (the metadata "label" on the data package).

File structure

DataBank works on the basis of "collections" (also known as "silos") that function as virtual administrative groups. Each silo has a set of users who can read and write files in the silo, and an administrator to manage it (including responsibility for payment, if there is a charge for storage space). Ordinarily, each research group would have its own silo; individuals can be members of multiple silos, and silos can have multiple administrators. User authentication is not yet compatible with Single Sign-on protocols, but we would like to get there by the end of the DataFlow project.

Note: the name of the silo cannot be changed. Ideally, silo names should be related to the research topic, rather than linked to an individual's name (e.g. "Image Bioinformatics Research Group", rather than "Shotton Research Group".)

Within a silo, DataBank stores "data packages." Data packages are zipped files (compatible with BagIt and SWORD-2 standards), with an RDF-format "label" to indicate what is inside the package. To look inside and see the individual data files, and their particular metadata, the data package must be unzipped. This unzipping is done within DataBank.

If you use DataStage to submit a data package, each DataStage instance submits to a single DataBank silo. The location (name) of that silo is part of the metadata "label" which is passed to DataBank when the data package is submitted.

What services are available for prospective users?

At present, the DataFlow project is developing software, with no immediate plans to offer a service based on the software. Users are free to install, configure, and further develop the software as they wish, on their own hardware or on commercially provided cloud resources.