Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom field per dataset and also dataset file #6455

Closed
tainguyenbui opened this issue Dec 13, 2019 · 10 comments
Closed

Custom field per dataset and also dataset file #6455

tainguyenbui opened this issue Dec 13, 2019 · 10 comments
Labels
Feature: Metadata Type: Suggestion an idea User Role: Depositor Creates datasets, uploads data, etc.

Comments

@tainguyenbui
Copy link
Contributor

Hi Dataverse team! I wanted to raise an issue that could actually be a feature request

Description:

Request 1: To have the ability to have custom fields in a dataset that could be added to the metadata blocks

Given that I am a user that wants to store some extra information in a dataset
I would like to define a custom field
So I could easily retrieve that information, for example the location in S3, from a "Dataset Information" request

Request 2: To have the ability to have a custom field in a dataset file, so I could store some extra information related to a file

Scenario:
Given that I am a user that wants to store some extra information in a dataset
I would like to have a custom field per file
So I could find out specific information about that file, for instance, where that file was extracted from.

Desired outcome:
Extra custom information can be extracted from datasets and dataset files

@pdurbin
Copy link
Member

pdurbin commented Dec 13, 2019

for example the location in S3

Could the location of the fine in S3 be considered a form of data provenance?

You said "where that file was extracted from."

In the User Guide, we describe prov like this:

"Data Provenance is a record of where your data came from and how it reached its current form. It describes the origin of a data file, any transformations that have been made to that file, and any persons or organizations associated with that file. A data file’s provenance can aid in reproducibility and compliance with legal regulations. Dataverse can help you keep track of your data’s provenance."

So maybe it's a good match?

If so and if you're willing to express the information in PROV-JSON format http://guides.dataverse.org/en/4.18.1/api/native-api.html#provenance explains how to get the JSON in and out. I couldn't quickly find an JSON in our guides but we build up a JSON object here in a test: https://github.com/IQSS/dataverse/blob/v4.18.1/src/test/java/edu/harvard/iq/dataverse/api/ProvIT.java#L129

Here's more on the format: https://www.w3.org/Submission/2013/SUBM-prov-json-20130424/

We explain the feature in some depth at http://guides.dataverse.org/en/4.18.1/user/dataset-management.html#data-provenance

There's also a "provFreeform" field as an alternative to PROV-JSON: http://guides.dataverse.org/en/4.18.1/api/native-api.html#updating-file-metadata

@tainguyenbui
Copy link
Contributor Author

tainguyenbui commented Dec 15, 2019

Hi @pdurbin thanks for your reply.

The location we want to save is the location where the scanned image is, not where it's original format is.

When we talk about images we talk about the scanned image of the book.
When we talk about CSV, it is the result of the image text extraction.

The use of data provenance json could potentially fulfil our requirements; however, it may be an overkill and an extra step in the upload of files. Imagine the situation where we are uploading a few hundred files per dataset.

We would like to make it easy to upload a path such as z-results/personnel-records/1954/seg/bank/col_img/pr1954_p0214_0 in the same request we perform to upload a file. For instance, a new custom field in the below form:
Screenshot 2019-12-15 at 21 18 07
And at the same time, have the same field available when uploading a file through a POST request in the native api.

@pdurbin
Copy link
Member

pdurbin commented Dec 16, 2019

@tainguyenbui sure. To me it's all provenance. Your CSV comes ultimately from the scanned image.

The main reason I'm trying to sell you on using Dataverse's provenance feature is that you can start using it today.

In the past we've talked a lot about more metadata at the files level and even custom metadata at the file level, but this feature does not yet exist (and the effort is not small). Please see #594 #916 and #3259.

Another thought is that you could maintain a file as part of your dataset that serves as sort of a manifest or bill of materials of all the files in the dataset with extra metadata like the location of the scanned image. This file could be JSON or whatever you want.

@tainguyenbui
Copy link
Contributor Author

@pdurbin we feel like the modification and retrieval of file provenance could become complex when uploading a few hundred files.

We understand the complexity of adding more metadata at file level, and having a look at the discussions in the pass, it is likely that this option won't be developed any time soon, if so.

We have also considered keeping an index of the file to S3 prefix mapping. However, we would not store that file in the Dataset, because then, we would still need a place in the Dataset metadata block to point to the fileId or rootDataFileId of the index file. Which would also require updates on the dataset as the dataset grows. Instead, we would store that index file in the root of the S3 dataset folder, for instance, <s3-bucket>/personnel_records/1954/images_index.json

All in all, even if we take any of the above two options, it is important to have certain custom information in the Dataset metadata :(

Unfortunately, we still do not have a straight forward solution for this

Thanks a lot for your help and effort to point us in the right direction

@djbrooke
Copy link
Contributor

Thanks @tainguyenbui and @pdurbin for the good discussion here. The use case here seems similar to SBGrid's use case in that "If there's another way to access this file/dataset that may be more efficient/appropriate, I want to know about it." We implemented this with the "Local Access" in the screenshot here:

Screen Shot 2019-12-16 at 10 55 14 AM

cc: @pameyer

@tainguyenbui
Copy link
Contributor Author

tainguyenbui commented Dec 16, 2019

@djbrooke thanks for your reply. Despite interesting, I think in the event that nothing is implemented for now, we would use the dataSource metadata block in the Dataset to populate some prefixes that we could iterate over. Possibly no ideal but better than what we currently have.

For instance:

    "dataSource": [
        "personnel-records/1954/seg/bank",
        "personnel-records/1954/seg/credit_union",
        "personnel-records/1954/seg/firm"
    ],

@pdurbin
Copy link
Member

pdurbin commented Oct 9, 2022

@vkush
Copy link

vkush commented Oct 9, 2022

Thanks @tainguyenbui and @pdurbin for the good discussion here. The use case here seems similar to SBGrid's use case in that "If there's another way to access this file/dataset that may be more efficient/appropriate, I want to know about it." We implemented this with the "Local Access" in the screenshot here:

Many thanks, @djbrooke. Is it a common functionality or it is just a custom extension to some DV-instance? Because I can not find something similar in the DV version 5.10.1. Your screenshot looks very nice, exactly as custom metadata fields on the file level (see related issues in the comment above) - Local Access, Download, Access, Verify Data. Is it some visualization of auxiliary files directly in GUI? Maybe really some visualization of auxiliary files could deal as a solution for custom metadata on file level (when this custom metadata will be stored inside of auxiliary files)?

@pdurbin
Copy link
Member

pdurbin commented Oct 11, 2022

Is it a common functionality or it is just a custom extension to some DV-instance?

Local Access is part of enabling a feature we usually refer to as "rsync" but beware, we are thinking about removing it:

The setting for "Local Access" is :LocalDataAccessPath and described here: https://guides.dataverse.org/en/5.12/developers/big-data-support.html#configuring-download-via-rsync

Those other items (Download, Access, and Verify Data) are part of the same rsync feature. I hope this helps.

@pdurbin pdurbin added the User Role: Depositor Creates datasets, uploads data, etc. label Oct 7, 2023
@cmbz
Copy link

cmbz commented Aug 20, 2024

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

@cmbz cmbz closed this as completed Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Metadata Type: Suggestion an idea User Role: Depositor Creates datasets, uploads data, etc.
Projects
None yet
Development

No branches or pull requests

5 participants