Custom field per dataset and also dataset file #6455

tainguyenbui · 2019-12-13T07:19:43Z

Hi Dataverse team! I wanted to raise an issue that could actually be a feature request

Description:

Request 1: To have the ability to have custom fields in a dataset that could be added to the metadata blocks

Given that I am a user that wants to store some extra information in a dataset
I would like to define a custom field
So I could easily retrieve that information, for example the location in S3, from a "Dataset Information" request

Request 2: To have the ability to have a custom field in a dataset file, so I could store some extra information related to a file

Scenario:
Given that I am a user that wants to store some extra information in a dataset
I would like to have a custom field per file
So I could find out specific information about that file, for instance, where that file was extracted from.

Desired outcome:
Extra custom information can be extracted from datasets and dataset files

pdurbin · 2019-12-13T13:08:27Z

for example the location in S3

Could the location of the fine in S3 be considered a form of data provenance?

You said "where that file was extracted from."

In the User Guide, we describe prov like this:

"Data Provenance is a record of where your data came from and how it reached its current form. It describes the origin of a data file, any transformations that have been made to that file, and any persons or organizations associated with that file. A data file’s provenance can aid in reproducibility and compliance with legal regulations. Dataverse can help you keep track of your data’s provenance."

So maybe it's a good match?

If so and if you're willing to express the information in PROV-JSON format http://guides.dataverse.org/en/4.18.1/api/native-api.html#provenance explains how to get the JSON in and out. I couldn't quickly find an JSON in our guides but we build up a JSON object here in a test: https://github.com/IQSS/dataverse/blob/v4.18.1/src/test/java/edu/harvard/iq/dataverse/api/ProvIT.java#L129

Here's more on the format: https://www.w3.org/Submission/2013/SUBM-prov-json-20130424/

We explain the feature in some depth at http://guides.dataverse.org/en/4.18.1/user/dataset-management.html#data-provenance

There's also a "provFreeform" field as an alternative to PROV-JSON: http://guides.dataverse.org/en/4.18.1/api/native-api.html#updating-file-metadata

tainguyenbui · 2019-12-15T21:26:50Z

Hi @pdurbin thanks for your reply.

The location we want to save is the location where the scanned image is, not where it's original format is.

When we talk about images we talk about the scanned image of the book.
When we talk about CSV, it is the result of the image text extraction.

The use of data provenance json could potentially fulfil our requirements; however, it may be an overkill and an extra step in the upload of files. Imagine the situation where we are uploading a few hundred files per dataset.

We would like to make it easy to upload a path such as z-results/personnel-records/1954/seg/bank/col_img/pr1954_p0214_0 in the same request we perform to upload a file. For instance, a new custom field in the below form:

And at the same time, have the same field available when uploading a file through a POST request in the native api.

pdurbin · 2019-12-16T15:09:32Z

@tainguyenbui sure. To me it's all provenance. Your CSV comes ultimately from the scanned image.

The main reason I'm trying to sell you on using Dataverse's provenance feature is that you can start using it today.

In the past we've talked a lot about more metadata at the files level and even custom metadata at the file level, but this feature does not yet exist (and the effort is not small). Please see #594 #916 and #3259.

Another thought is that you could maintain a file as part of your dataset that serves as sort of a manifest or bill of materials of all the files in the dataset with extra metadata like the location of the scanned image. This file could be JSON or whatever you want.

tainguyenbui · 2019-12-16T15:49:53Z

@pdurbin we feel like the modification and retrieval of file provenance could become complex when uploading a few hundred files.

We understand the complexity of adding more metadata at file level, and having a look at the discussions in the pass, it is likely that this option won't be developed any time soon, if so.

We have also considered keeping an index of the file to S3 prefix mapping. However, we would not store that file in the Dataset, because then, we would still need a place in the Dataset metadata block to point to the fileId or rootDataFileId of the index file. Which would also require updates on the dataset as the dataset grows. Instead, we would store that index file in the root of the S3 dataset folder, for instance, <s3-bucket>/personnel_records/1954/images_index.json

All in all, even if we take any of the above two options, it is important to have certain custom information in the Dataset metadata :(

Unfortunately, we still do not have a straight forward solution for this

Thanks a lot for your help and effort to point us in the right direction

djbrooke · 2019-12-16T16:13:26Z

Thanks @tainguyenbui and @pdurbin for the good discussion here. The use case here seems similar to SBGrid's use case in that "If there's another way to access this file/dataset that may be more efficient/appropriate, I want to know about it." We implemented this with the "Local Access" in the screenshot here:

cc: @pameyer

tainguyenbui · 2019-12-16T16:35:41Z

@djbrooke thanks for your reply. Despite interesting, I think in the event that nothing is implemented for now, we would use the dataSource metadata block in the Dataset to populate some prefixes that we could iterate over. Possibly no ideal but better than what we currently have.

For instance:

    "dataSource": [
        "personnel-records/1954/seg/bank",
        "personnel-records/1954/seg/credit_union",
        "personnel-records/1954/seg/firm"
    ],

pdurbin · 2022-10-09T20:01:19Z

vkush · 2022-10-09T20:32:14Z

Thanks @tainguyenbui and @pdurbin for the good discussion here. The use case here seems similar to SBGrid's use case in that "If there's another way to access this file/dataset that may be more efficient/appropriate, I want to know about it." We implemented this with the "Local Access" in the screenshot here:

Many thanks, @djbrooke. Is it a common functionality or it is just a custom extension to some DV-instance? Because I can not find something similar in the DV version 5.10.1. Your screenshot looks very nice, exactly as custom metadata fields on the file level (see related issues in the comment above) - Local Access, Download, Access, Verify Data. Is it some visualization of auxiliary files directly in GUI? Maybe really some visualization of auxiliary files could deal as a solution for custom metadata on file level (when this custom metadata will be stored inside of auxiliary files)?

pdurbin · 2022-10-11T15:27:07Z

Is it a common functionality or it is just a custom extension to some DV-instance?

Local Access is part of enabling a feature we usually refer to as "rsync" but beware, we are thinking about removing it:

remove rsync from guides (deprecated) #8985

The setting for "Local Access" is :LocalDataAccessPath and described here: https://guides.dataverse.org/en/5.12/developers/big-data-support.html#configuring-download-via-rsync

Those other items (Download, Access, and Verify Data) are part of the same rsync feature. I hope this helps.

cmbz · 2024-08-20T15:21:28Z

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

This was referenced Oct 1, 2022

Customized Metadata at Datafile Level #7930

Open

Feature Request/Idea: custom metadata fields for files #8240

Open

pdurbin added Type: Suggestion an idea Feature: Metadata labels Oct 9, 2022

pdurbin mentioned this issue Oct 11, 2022

Adding Geographic Bounding Box to file level metadata #7704

Open

pdurbin added the User Role: Depositor Creates datasets, uploads data, etc. label Oct 7, 2023

pdurbin mentioned this issue Oct 8, 2023

File Level Metadata - Tabular Metadata #3252

Closed

cmbz closed this as completed Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom field per dataset and also dataset file #6455

Custom field per dataset and also dataset file #6455

tainguyenbui commented Dec 13, 2019

pdurbin commented Dec 13, 2019

tainguyenbui commented Dec 15, 2019 •

edited

Loading

pdurbin commented Dec 16, 2019

tainguyenbui commented Dec 16, 2019

djbrooke commented Dec 16, 2019

tainguyenbui commented Dec 16, 2019 •

edited

Loading

pdurbin commented Oct 9, 2022

vkush commented Oct 9, 2022

pdurbin commented Oct 11, 2022

cmbz commented Aug 20, 2024

Custom field per dataset and also dataset file #6455

Custom field per dataset and also dataset file #6455

Comments

tainguyenbui commented Dec 13, 2019

Request 1: To have the ability to have custom fields in a dataset that could be added to the metadata blocks

Request 2: To have the ability to have a custom field in a dataset file, so I could store some extra information related to a file

pdurbin commented Dec 13, 2019

tainguyenbui commented Dec 15, 2019 • edited Loading

pdurbin commented Dec 16, 2019

tainguyenbui commented Dec 16, 2019

djbrooke commented Dec 16, 2019

tainguyenbui commented Dec 16, 2019 • edited Loading

pdurbin commented Oct 9, 2022

vkush commented Oct 9, 2022

pdurbin commented Oct 11, 2022

cmbz commented Aug 20, 2024

tainguyenbui commented Dec 15, 2019 •

edited

Loading

tainguyenbui commented Dec 16, 2019 •

edited

Loading