Skip to content

CSV Importer

Alisha Evans edited this page Sep 17, 2021 · 32 revisions

Bulkrax can import from a CSV file that follows the following guidelines.

Required fields

  • The CSV MUST have a header row to uniquely identify the record.
  • This header row MUST have a field representing the source_identifier, containing a unique identifier for the item. (refer to the below for more detail)
  • The CSV MUST have a title column
  • There MUST be something in the field representing the source_identifier and title for all works

Source Identifier

In Bulkrax, the source_identifier must exist. This field will be used to store the import identifier or the Work or Collection. When the import runs, it checks whether a Work or Collection already exists with that identifier, and if so, it updates that existing item. If it does not, then a new item is created.

There are two ways to set the field:

  • Use an existing Hyrax field

    Bulkrax.setup do | config |
      # Use the doi field (note: doi must be available on all works and collections).
      config.field_mappings['Bulkrax::CsvParser'] = {
        'doi' => { from: ['doi'], source_identifier: true }
      }
    end
    
  • Allow Bulkrax to create the source_identifier

    • If there isn't a field that's available and unique across all Works and Collections, Bulkrax can make a custom field. An example of how this can be changed in the local application as follows:
      config.fill_in_blank_source_identifiers = ->(obj, index) { "#{Site.instance.account.name}-#{obj.importerexporter.id}-#{index}" }
      config.field_mappings['Bulkrax::CsvParser'] = {
        'bulkrax_identifier' => { from: ['bulkrax_identifier'], source_identifier: true }
      }
    
    • You will also need to add the following to "app/indexers/shared_indexer" in your local app
    solr_doc[Solrizer.solr_name('bulkrax_identifier', :facetable)] = object.bulkrax_identifier
    

Supported fields

All columns will be imported if the column name matches an existing metadata property in Hyrax, eg. title, creator, etc.

In addition, the following columns will be imported:

  • collection or collection_#
  • file or file_#
  • file_url or file_url_#
  • remote_files
  • model

Fields with multiple values

Default

The default way to handle a field with multiple values is to have numerated headers

creator_first_name_1 creator_last_name_1 creator_position_1 creator_first_name_2
Aaliyah Haughton Queen Ruth

Join

If your csv has a single header with multiple values instead:

  • the "join" property must be set in the field mapping
  • the values are separated by a semi-colon (;) or pipe (|)
creator_first_name creator_last_name creator_position
Aaliyah; Ruth Haughton Queen

Collections

A column headed collection will be used to define which collection imported works should be added to. However, the collection will need to already exist in the app.

Multiple collections can be supplied.

If the value provided matches a value found in the system_identifier_field of an existing collection, then works will be added to that collection. If not, a new collection will be created and both title and system_identifier_field will be set to the value supplied in the collection column.

For example

| source_identifier | title | collection | |---|---|---| | imported_work_1 | Work One | Collection One | | imported_work_2 | Work Two | Collection One; Collection Two |

In the first row (after the header), the Work being imported will be added to Collection One, and in the second, to both Collection One and Collection Two.

If either of those already exist, then the existing collection is used. If not, a new one is created.

Model

The model column is used to determine the work type. It is not required. In it's absence, either the field mapping or default_work_type will be used. Read more about these in the Configuration guide.

Files

Files will be imported from a column called file_#, file_url_# or remote_files if they are present.

The file_# columns will each contain a single filename (these must be unique). Multiple files can be imported, by using additional numerated headers.

The file_url_# columns will each contain a single URL to a file which will be downloaded and imported (these must be unique). Multiple files can be imported, by using additional numerated headers.

The remote_files column will contain one or more URLs to files which will be downloaded and imported. Multiple files can be imported, if separated by a pipe (|). (Semi-colons are valid URL syntax so don't use it as the separator. URLs themselves MUST NOT contain pipes).

Files Location

If imported from a pre-existing server location, files MUST be placed in a directory called files relative to the location of the CSV file.

If uploading using Browse Everything, the location of the files will be handled by the system.

For example:

source_identifier title creator publisher file
first_work First work title Smith, John Faber and Faber document.pdf
second_work Second work title Jones, David Macmillan firstdocument.docx; seconddocument.pdf
third_work Third work title Other, A.N. Penguin

If the CSV to be imported is located at

/tmp/imports/1/csv-to-be-imported.csv

The files would be at:

/tmp/imports/1/files/document.pdf
/tmp/imports/1/files/firstdocument.docx
/tmp/imports/1/files/seconddocument.pdf

The third_work does not have any associated files.

Importing Metadata and Files from a Zip file

A Zip file containing a single CSV and a folder named files/ can be imported by the CSV Importer. The structure of the Zip is very important and is as follows:

metadata.csv
files/
  |
  file_1.png
  file_2.jpg

See the Files Location guide for how to reference the files within the CSV

In Finder, select the CSV and the files/ folder (cmd + click to select multiple items), right click, and select Compress. This will create the Zip file that will be imported.

NOTE: The names of the files themselves don't matter, as long as they match what's in the files column in the CSV. Likewise, the name of the CSV does not matter. However, the name of the folder containing the files does matter and should be written exactly as "files" (lowercase and plural). Also, the structure of the Zip is important; for example, if you compress a directory containing the CSV and the files/ folder, it will not import properly.

Configuration and Customization

Please see the Configuration guide for information on how to configure and customize import. For example, by excluding columns from import, or splitting data on specific delimeters.

Clone this wiki locally