-
Notifications
You must be signed in to change notification settings - Fork 23
CSV Importer
Bulkrax can import from a CSV file that follows the following guidelines.
- The CSV MUST have a header row to uniquely identify the record.
- This header row MUST have a field representing the
source_identifier
, containing a unique identifier for the item. (refer to the below for more detail) - The CSV MUST have a
title
column - There MUST be something in the field representing the
source_identifier
andtitle
for all works
In Bulkrax, the source_identifier
must exist. This field will be used to store the import identifier or the Work or Collection. When the import runs, it checks whether a Work or Collection already exists with that identifier, and if so, it updates that existing item. If it does not, then a new item is created.
There are two ways to set the field:
-
Use an existing Hyrax field
Bulkrax.setup do | config | # Use the doi field (note: doi must be available on all works and collections). config.field_mappings['Bulkrax::CsvParser'] = { 'doi' => { from: ['doi'], source_identifier: true } } end
-
Allow Bulkrax to create the
source_identifier
- If there isn't a field that's available and unique across all Works and Collections, Bulkrax can make a custom field. An example of how this can be changed in the local application as follows:
config.fill_in_blank_source_identifiers = ->(obj, index) { "#{Site.instance.account.name}-#{obj.importerexporter.id}-#{index}" } config.field_mappings['Bulkrax::CsvParser'] = { 'bulkrax_identifier' => { from: ['bulkrax_identifier'], source_identifier: true } }
- You will also need to add the following to "app/indexers/shared_indexer" in your local app
solr_doc[Solrizer.solr_name('bulkrax_identifier', :facetable)] = object.bulkrax_identifier
All columns will be imported if the column name matches an existing metadata property in Hyrax, eg. title, creator, etc.
In addition, the following columns will be imported:
- collection or collection_#
- file or file_#
- file_url or file_url_#
- remote_files
- model
The default way to handle a field with multiple values is to have numerated headers
creator_first_name_1 | creator_last_name_1 | creator_position_1 | creator_first_name_2 |
---|---|---|---|
Aaliyah | Haughton | Queen | Ruth |
If your csv has a single header with multiple values instead:
- the "join" property must be set in the field mapping
- the values are separated by a semi-colon (;) or pipe (|)
creator_first_name | creator_last_name | creator_position |
---|---|---|
Aaliyah; Ruth | Haughton | Queen |
A column headed collection
will be used to define which collection imported works should be added to. However, the collection will need to already exist in the app.
Multiple collections can be supplied.
If the value provided matches a value found in the system_identifier_field
of an existing collection, then works will be added to that collection. If not, a new collection will be created and both title and system_identifier_field
will be set to the value supplied in the collection column.
For example
| source_identifier | title | collection |
|---|---|---|
| imported_work_1 | Work One | Collection One |
| imported_work_2 | Work Two | Collection One; Collection Two |
In the first row (after the header), the Work being imported will be added to Collection One, and in the second, to both Collection One and Collection Two.
If either of those already exist, then the existing collection is used. If not, a new one is created.
The model column is used to determine the work type. It is not required. In it's absence, either the field mapping or default_work_type will be used. Read more about these in the Configuration guide.
Files will be imported from a column called file_#
, file_url_#
or remote_files
if they are present.
The file_#
columns will each contain a single filename (these must be unique). Multiple files can be imported, by using additional numerated headers.
The file_url_#
columns will each contain a single URL to a file which will be downloaded and imported (these must be unique). Multiple files can be imported, by using additional numerated headers.
The remote_files
column will contain one or more URLs to files which will be downloaded and imported. Multiple files can be imported, if separated by a pipe (|). (Semi-colons are valid URL syntax so don't use it as the separator. URLs themselves MUST NOT contain pipes).
If imported from a pre-existing server location, files MUST be placed in a directory called files
relative to the location of the CSV file.
If uploading using Browse Everything, the location of the files will be handled by the system.
For example:
source_identifier | title | creator | publisher | file |
---|---|---|---|---|
first_work | First work title | Smith, John | Faber and Faber | document.pdf |
second_work | Second work title | Jones, David | Macmillan | firstdocument.docx; seconddocument.pdf |
third_work | Third work title | Other, A.N. | Penguin |
If the CSV to be imported is located at
/tmp/imports/1/csv-to-be-imported.csv
The files would be at:
/tmp/imports/1/files/document.pdf
/tmp/imports/1/files/firstdocument.docx
/tmp/imports/1/files/seconddocument.pdf
The third_work does not have any associated files.
A Zip file containing a single CSV and a folder named files/
can be imported by the CSV Importer. The structure of the Zip is very important and is as follows:
metadata.csv
files/
|
file_1.png
file_2.jpg
See the Files Location guide for how to reference the files within the CSV
In Finder, select the CSV and the files/
folder (cmd + click
to select multiple items), right click, and select Compress. This will create the Zip file that will be imported.
NOTE: The names of the files themselves don't matter, as long as they match what's in the files
column in the CSV. Likewise, the name of the CSV does not matter. However, the name of the folder containing the files does matter and should be written exactly as "files" (lowercase and plural). Also, the structure of the Zip is important; for example, if you compress a directory containing the CSV and the files/
folder, it will not import properly.
Please see the Configuration guide for information on how to configure and customize import. For example, by excluding columns from import, or splitting data on specific delimeters.