Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catalog-validate failed to validate metadata extracted by meta-extract #492

Closed
tientong98 opened this issue Oct 7, 2024 · 2 comments
Closed

Comments

@tientong98
Copy link

Hello,

I'm currently using

datalad 1.0.2
datalad_catalog 1.1.1

and was not able to validate metadata files. Below are what I did:

datalad clone https://github.com/OpenNeuroDatasets/ds005454.git
datalad meta-extract -d ds005454 metalad_core | jq > metadata.json

I tried to use catalog-translate but wasn't successful: datalad catalog-translate metadata.json returned null and datalad catalog-translate -m metadata.json returned error:

[ERROR  ] unknown argument: -m 
usage: datalad catalog-translate [-h] [-c CATALOG] [--version] metadata

metadata.json doesn't pass the validation:

datalad catalog-add --catalog data-cat --metadata metadata.json          

catalog_add(error): data-cat [Expecting property name enclosed in double quotes: line 1 column 2 (char 1)]
catalog_add(error): data-cat [Extra data: line 1 column 9 (char 8)]
catalog_add(error): data-cat [Extra data: line 1 column 15 (char 14)]
catalog_add(error): data-cat [Extra data: line 1 column 20 (char 19)]
catalog_add(error): data-cat [Extra data: line 1 column 19 (char 18)]
catalog_add(error): data-cat [Extra data: line 1 column 22 (char 21)]
catalog_add(error): data-cat [Extra data: line 1 column 25 (char 24)]
catalog_add(error): data-cat [Extra data: line 1 column 20 (char 19)]
catalog_add(error): data-cat [Extra data: line 1 column 15 (char 14)]
catalog_add(error): data-cat [Extra data: line 1 column 16 (char 15)]
  [50 similar messages have been suppressed; disable with datalad.ui.suppress-similar-results=off]
action summary:
  catalog_add (error: 60)

But I also tried to validate a catalog schema, which also didn't pass the validation either

datalad catalog-add --catalog data-cat --metadata jsonschema_dataset.json 
catalog_add(error): data-cat [Expecting property name enclosed in double quotes: line 1 column 2 (char 1)]
catalog_add(error): data-cat [Extra data: line 1 column 12 (char 11)]
catalog_add(error): data-cat [Extra data: line 1 column 8 (char 7)]
catalog_add(error): data-cat [Extra data: line 1 column 10 (char 9)]
catalog_add(error): data-cat [Extra data: line 1 column 16 (char 15)]
catalog_add(error): data-cat [Extra data: line 1 column 9 (char 8)]
catalog_add(error): data-cat [Extra data: line 1 column 15 (char 14)]
catalog_add(error): data-cat [Extra data: line 1 column 11 (char 10)]
catalog_add(error): data-cat [Extra data: line 1 column 20 (char 19)]
catalog_add(error): data-cat [Extra data: line 1 column 14 (char 13)]
  [326 similar messages have been suppressed; disable with datalad.ui.suppress-similar-results=off]
action summary:
  catalog_add (error: 336)

I was able for add metadata as specified in the example in the handbook

touch toy_metadata.jsonl
echo '{ "type": "dataset", "dataset_id": "5df8eb3a-95c5-11ea-b4b9-a0369f287950", "dataset_version": "dae38cf901995aace0dde5346515a0134f919523", "name": "My toy dataset", "short_name": "My toy dataset", "description": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus nec justo tellus. Nunc sagittis eleifend magna, eu blandit arcu tincidunt eu. Mauris pharetra justo nec volutpat euismod. Curabitur bibendum vitae nunc a pharetra. Donec non rhoncus risus, ac consequat purus. Pellentesque ultricies ut enim non luctus. Sed viverra dolor enim, sed blandit lorem interdum sit amet. Aenean tincidunt et dolor sit amet tincidunt. Vivamus in sollicitudin ligula. Curabitur volutpat sapien erat, eget consectetur mauris dapibus a. Phasellus fringilla justo ligula, et fringilla tortor ullamcorper id. Praesent tristique lacus purus, eu convallis quam vestibulum eget. Donec ullamcorper mi neque, vel tincidunt augue porttitor vel.", "doi": "", "url": "https://github.com/jsheunis/multi-echo-super", "license": { "name": "CC BY 4.0", "url": "https://creativecommons.org/licenses/by/4.0/" }, "authors": [ { "givenName": "Stephan", "familyName": "Heunis"} ], "keywords": [ "lorum", "ipsum", "foxes" ], "funding": [ { "name": "Stephans Bank Account", "identifier": "No. 42", "description": "Nothing to see here" } ], "metadata_sources": { "key_source_map": {}, "sources": [ { "source_name": "stephan_manual", "source_version": "1", "source_parameter": {}, "source_time": 1652340647.0, "agent_name": "Stephan Heunis", "agent_email": "" } ] } }' >> toy_metadata.jsonl
echo '{ "type": "file", "dataset_id": "5df8eb3a-95c5-11ea-b4b9-a0369f287950", "dataset_version": "dae38cf901995aace0dde5346515a0134f919523", "contentbytesize": 1403, "path": "README", "metadata_sources": { "key_source_map": {}, "sources": [ { "source_name": "stephan_manual", "source_version": "1", "source_parameter": {}, "source_time": 1652340647.0, "agent_name": "Stephan Heunis", "agent_email": "" } ] } }' >> toy_metadata.jsonl
echo '{ "type": "file", "dataset_id": "5df8eb3a-95c5-11ea-b4b9-a0369f287950", "dataset_version": "dae38cf901995aace0dde5346515a0134f919523", "contentbytesize": 15572, "path": "main_data/main_results.png", "metadata_sources": { "key_source_map": {}, "sources": [ { "source_name": "stephan_manual", "source_version": "1", "source_parameter": {}, "source_time": 1652340647.0, "agent_name": "Stephan Heunis", "agent_email": "" } ] } }' >> toy_metadata.jsonl

And I can see the differences in the format between the handbook example and the output of meta-extract using metalad_core, and this might be related to this issue. Please let me know what's the best way to move forward.

Thank you!
Tien

@jsheunis
Copy link
Member

Hi @tientong98. Sorry I missed the notification for this issue.

The issue seems to be the format in which you input the metadata to the catalog-translate subcommand. When you run datalad catalog-translate --help you'll see:

positional arguments:
  metadata              The metalad-extracted metadata that is to be translated. Multiple input types are possible: - a path to a file
                        containing JSON lines - JSON lines from STDIN - a JSON serialized string.

So the file has to be jsonlines, i.e. a single line should be readable as a json object. In your example you first run it through jq, which formats the json object output from the metalad command as a multiline and indented object. The first line is then just {, which is why validation fails. It then continues to return null for each line, because the catalog-translate (and similarly the -add or -validate commands) will try to process a file line by line, as separate objects, and will either succeed or fail. If you want a more informative error (or success) message for each line, you can set the return format of the command as follows:

datalad -f json catalog-translate metadata.json

which should return something like this instead of null:

{"action": "catalog_translate", "error_message": "Expecting property name enclosed in double quotes: line 1 column 2 (char 1)", "exception": "ConstraintError(Expecting property name enclosed in double quotes: line 1 column 2 (char 1))", "exception_traceback": "[compound.py:_item_yielder:304,formats.py:__call__:26,base.py:raise_for:55,formats.py:__call__:24,__init__.py:loads:346,decoder.py:decode:337,decoder.py:raw_decode:353]", "path": "/Users/jsheunis/Documents/psyinf/Data/ds005454", "status": "error"}
...
...

So, your metadata in the file needs to be json lines (or from STDIN, or a json serialized string, or a python dict if you work with the python API). So if you remove the jq part from your first call, it should work:

>> datalad meta-extract -d . metalad_core > metadata.json

[INFO] Start core metadata extraction from Dataset(/Users/jsheunis/Documents/psyinf/Data/ds005454)
[INFO] Extracted core metadata from /Users/jsheunis/Documents/psyinf/Data/ds005454
[INFO] Finished core metadata extraction from Dataset(/Users/jsheunis/Documents/psyinf/Data/ds005454)

>> cat metadata.json

{"type": "dataset", "dataset_id": "7d556bac-defc-4614-b9c1-4ab4b9681496", "dataset_version": "cdd3369413f11e817fe5c72bfe73b0b9e035d839", "extractor_name": "metalad_core", "extractor_version": "1", "extraction_parameter": {}, "extraction_time": 1728994675.429282, "agent_name": "Stephan Heunis", "agent_email": "[email protected]", "extracted_metadata": {"@context": {"@vocab": "http://schema.org/", "datalad": "http://dx.datalad.org/"}, "@graph": [{"@id": "0678df9504071875731533290575892b", "@type": "agent", "name": "Git Worker", "email": "[email protected]"}, {"@id": "10b42c100a06d747da3cf528e083d9c5", "@type": "agent", "name": "Oscar Esteban", "email": "[email protected]"}, {"@id": "cdd3369413f11e817fe5c72bfe73b0b9e035d839", "identifier": "7d556bac-defc-4614-b9c1-4ab4b9681496", "@type": "Dataset", "version": "0-5-gcdd3369", "dateCreated": "2024-09-05T12:01:40+00:00", "dateModified": "2024-09-05T13:03:47+00:00", "hasContributor": [{"@id": "0678df9504071875731533290575892b"}, {"@id": "10b42c100a06d747da3cf528e083d9c5"}], "distribution": [{"@id": "35a94bd7-91c1-4851-9fef-7adaf326bca1"}, {"name": "s3-PUBLIC", "@id": "datalad:a6e74f1e-73cd-47c2-bb89-bdbce8a6649e"}, {"name": "origin", "url": "https://github.com/OpenNeuroDatasets/ds005454.git"}]}]}}

>> datalad catalog-translate metadata.json

{"type":"dataset","dataset_id":"7d556bac-defc-4614-b9c1-4ab4b9681496","dataset_version":"cdd3369413f11e817fe5c72bfe73b0b9e035d839","metadata_sources":{"key_source_map":{},"sources":[{"source_name":"metalad_core","source_version":"1","source_parameter":{},"source_time":1728994309.525142,"agent_email":"[email protected]","agent_name":"Stephan Heunis"}]},"name":"","url":["https://github.com/OpenNeuroDatasets/ds005454.git"],"authors":[{"name":"Git Worker","email":"[email protected]"},{"name":"Oscar Esteban","email":"[email protected]"}]}

@tientong98
Copy link
Author

Thank you, @jsheunis ! Yes, turning the metadata to json lines solved my issue. Thank you very much for the detailed explanation, really appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants