Skip to content


merged some conflicts from develop #3921
Browse files Browse the repository at this point in the history
  • Loading branch information
rbhatta99 committed Aug 18, 2017
1 parent b960a07 commit 00226fd
Show file tree
Hide file tree
Showing 8 changed files with 445 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
"factoryData":"type: orcid | userEndpoint:{ORCID}/orcid-profile | clientId: FIXME | clientSecret: FIXME",
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
"status": "validation passed",
"uploadFolder": "DNXV2H",
"totalSize": 1234567890
64 changes: 64 additions & 0 deletions doc/sphinx-guides/source/developers/big-data-support.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
Big Data Support

Big data support is highly experimental. Eventually this content will move to the Installation Guide.

.. contents:: |toctitle|

Various components need to be installed and configured for big data support.

Data Capture Module (DCM)

Data Capture Module (DCM) is an experimental component that allows users to upload large datasets via rsync over ssh.

Install a DCM

Installation instructions can be found at . Note that a shared filesystem between Dataverse and your DCM is required. You cannot use a DCM with non-filesystem storage options such as Swift.

Once you have installed a DCM, you will need to configure two database settings on the Dataverse side. These settings are documented in the :doc:`/installation/config` section of the Installation Guide:

- ``:DataCaptureModuleUrl`` should be set to the URL of a DCM you installed.
- ``:UploadMethods`` should be set to ``dcm/rsync+ssh``.

This will allow your Dataverse installation to communicate with your DCM, so that Dataverse can download rsync scripts for your users.

Downloading rsync scripts via Dataverse API

The rsync script can be downloaded from Dataverse via API using an authorized API token. In the curl example below, substitute ``$PERSISTENT_ID`` with a DOI or Handle:

``curl -H "X-Dataverse-key: $API_TOKEN" $DV_BASE_URL/api/datasets/:persistentId/dataCaptureModule/rsync?persistentId=$PERSISTENT_ID``

How a DCM reports checksum success or failure to Dataverse

Once the user uploads files to a DCM, that DCM will perform checksum validation and report to Dataverse the results of that validation. The DCM must be configured to pass the API token of a superuser. The implementation details, which are subject to change, are below.

The JSON that a DCM sends to Dataverse on successful checksum validation looks something like the contents of :download:`checksumValidationSuccess.json <../_static/installation/files/root/big-data-support/checksumValidationSuccess.json>` below:

.. literalinclude:: ../_static/installation/files/root/big-data-support/checksumValidationSuccess.json
:language: json

- ``status`` - The valid strings to send are ``validation passed`` and ``validation failed``.
- ``uploadFolder`` - This is the directory on disk where Dataverse should attempt to find the files that a DCM has moved into place. There should always be a ``files.sha`` file and a least one data file. ``files.sha`` is a manifest of all the data files and their checksums. The ``uploadFolder`` directory is inside the directory where data is stored for the dataset and may have the same name as the "identifier" of the persistent id (DOI or Handle). For example, you would send ``"uploadFolder": "DNXV2H"`` in the JSON file when the absolute path to this directory is ``/usr/local/glassfish4/glassfish/domains/domain1/files/10.5072/FK2/DNXV2H/DNXV2H``.
- ``totalSize`` - Dataverse will use this value to represent the total size in bytes of all the files in the "package" that's created. If 360 data files and one ``files.sha`` manifest file are in the ``uploadFolder``, this value is the sum of the 360 data files.

Here's the syntax for sending the JSON.

``curl -H "X-Dataverse-key: $API_TOKEN" -X POST -H 'Content-type: application/json' --upload-file checksumValidationSuccess.json $DV_BASE_URL/api/datasets/:persistentId/dataCaptureModule/checksumValidation?persistentId=$PERSISTENT_ID``


The following low level command should only be used when troubleshooting the "import" code a DCM uses but is documented here for completeness.

``curl -H "X-Dataverse-key: $API_TOKEN" -X POST "$DV_BASE_URL/api/batch/jobs/import/datasets/files/$DATASET_DB_ID?uploadFolder=$UPLOAD_FOLDER&totalSize=$TOTAL_SIZE"``

Repository Storage Abstraction Layer (RSAL)

For now, please see
82 changes: 82 additions & 0 deletions scripts/search/tests/data/dataset-finch2.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
"datasetVersion": {
"metadataBlocks": {
"citation": {
"fields": [
"value": "HTML & More",
"typeClass": "primitive",
"multiple": false,
"typeName": "title"
"value": [
"authorName": {
"value": "Markup, Marty",
"typeClass": "primitive",
"multiple": false,
"typeName": "authorName"
"authorAffiliation": {
"value": "W4C",
"typeClass": "primitive",
"multiple": false,
"typeName": "authorAffiliation"
"typeClass": "compound",
"multiple": true,
"typeName": "author"
"value": [
"datasetContactEmail": {
"typeClass": "primitive",
"multiple": false,
"typeName": "datasetContactEmail",
"value": "[email protected]"
"datasetContactName": {
"typeClass": "primitive",
"multiple": false,
"typeName": "datasetContactName",
"value": "Markup, Marty"
"typeClass": "compound",
"multiple": true,
"typeName": "datasetContact"
"value": [
"dsDescriptionValue": {
"value": "BEGIN<br></br>END",
"multiple": false,
"typeClass": "primitive",
"typeName": "dsDescriptionValue"
"typeClass": "compound",
"multiple": true,
"typeName": "dsDescription"
"value": [
"Medicine, Health and Life Sciences"
"typeClass": "controlledVocabulary",
"multiple": true,
"typeName": "subject"
"displayName": "Citation Metadata"
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@

import static;
import java.util.Properties;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.batch.operations.JobOperator;
import javax.batch.operations.JobSecurityException;
import javax.batch.operations.JobStartException;
import javax.batch.runtime.BatchRuntime;
import javax.json.JsonObject;
import javax.json.JsonObjectBuilder;

public class ImportFromFileSystemCommand extends AbstractCommand<JsonObject> {

private static final Logger logger = Logger.getLogger(ImportFromFileSystemCommand.class.getName());

final Dataset dataset;
final String uploadFolder;
final Long totalSize;
final String mode;
final ImportMode importMode;

public ImportFromFileSystemCommand(DataverseRequest aRequest, Dataset theDataset, String theUploadFolder, Long theTotalSize, ImportMode theImportMode) {
super(aRequest, theDataset);
dataset = theDataset;
uploadFolder = theUploadFolder;
totalSize = theTotalSize;
importMode = theImportMode;
mode = theImportMode.toString();

public JsonObject execute(CommandContext ctxt) throws CommandException {
JsonObjectBuilder bld = jsonObjectBuilder();
* batch import as-individual-datafiles is disabled in this iteration;
* only the import-as-a-package is allowed. -- L.A. Feb 2 2017
String fileMode = FileRecordWriter.FILE_MODE_PACKAGE_FILE;
try {
* Current constraints: 1. only supports merge and replace mode 2.
* valid dataset 3. valid dataset directory 4. valid user & user has
* edit dataset permission 5. only one dataset version 6. dataset
* version is draft
if (!mode.equalsIgnoreCase("MERGE") && !mode.equalsIgnoreCase("REPLACE")) {
String error = "Import mode: " + mode + " is not currently supported.";;
throw new IllegalCommandException(error, this);
if (!fileMode.equals(FileRecordWriter.FILE_MODE_INDIVIDUAL_FILES) && !fileMode.equals(FileRecordWriter.FILE_MODE_PACKAGE_FILE)) {
String error = "File import mode: " + fileMode + " is not supported.";;
throw new IllegalCommandException(error, this);
File directory = new File(System.getProperty("")
+ File.separator + dataset.getAuthority() + File.separator + dataset.getIdentifier());
if (!isValidDirectory(directory)) {
String error = "Dataset directory is invalid. " + directory;;
throw new IllegalCommandException(error, this);

if (Strings.isNullOrEmpty(uploadFolder)) {
String error = "No uploadFolder specified";;
throw new IllegalCommandException(error, this);

File uploadDirectory = new File(System.getProperty("")
+ File.separator + dataset.getAuthority() + File.separator + dataset.getIdentifier()
+ File.separator + uploadFolder);
if (!isValidDirectory(uploadDirectory)) {
String error = "Upload folder is not a valid directory.";;
throw new IllegalCommandException(error, this);

if (dataset.getVersions().size() != 1) {
String error = "Error creating FilesystemImportJob with dataset with ID: " + dataset.getId() + " - Dataset has more than one version.";;
throw new IllegalCommandException(error, this);

if (dataset.getLatestVersion().getVersionState() != DatasetVersion.VersionState.DRAFT) {
String error = "Error creating FilesystemImportJob with dataset with ID: " + dataset.getId() + " - Dataset isn't in DRAFT mode.";;
throw new IllegalCommandException(error, this);

try {
long jid;
Properties props = new Properties();
props.setProperty("datasetId", dataset.getId().toString());
props.setProperty("userId", getUser().getIdentifier().replace("@", ""));
props.setProperty("mode", mode);
props.setProperty("fileMode", fileMode);
props.setProperty("uploadFolder", uploadFolder);
if (totalSize != null && totalSize > 0) {
props.setProperty("totalSize", totalSize.toString());
JobOperator jo = BatchRuntime.getJobOperator();
jid = jo.start("FileSystemImportJob", props);
if (jid > 0) {
bld.add("executionId", jid).add("message", "FileSystemImportJob in progress");
} else {
String error = "Error creating FilesystemImportJob with dataset with ID: " + dataset.getId();;
throw new CommandException(error, this);

} catch (JobStartException | JobSecurityException ex) {
String error = "Error creating FilesystemImportJob with dataset with ID: " + dataset.getId() + " - " + ex.getMessage();;
throw new IllegalCommandException(error, this);

} catch (Exception e) {
bld.add("message", "Import Exception - " + e.getMessage());

* Make sure the directory path is truly a directory, exists and we can read
* it.
* @return isValid
private boolean isValidDirectory(File directory) {
String path = directory.getAbsolutePath();
if (!directory.exists()) {
logger.log(Level.SEVERE, "Directory " + path + " does not exist.");
return false;
if (!directory.isDirectory()) {
logger.log(Level.SEVERE, path + " is not a directory.");
return false;
if (!directory.canRead()) {
logger.log(Level.SEVERE, "Unable to read files from directory " + path + ". Permission denied.");
return false;
return true;


0 comments on commit 00226fd

Please sign in to comment.