Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Croissant for lmms-lab/LMMs-Eval-Lite is incorrect #3145

Open
ccl-core opened this issue Mar 4, 2025 · 2 comments
Open

Croissant for lmms-lab/LMMs-Eval-Lite is incorrect #3145

ccl-core opened this issue Mar 4, 2025 · 2 comments

Comments

@ccl-core
Copy link
Contributor

ccl-core commented Mar 4, 2025

The croissant for the lmms-lab/LMMs-Eval-Lite dataset is not correct.

And more specifically in the "gqa/semantic/dependencies" field:

{
              "@type": "cr:Field",
              "@id": "gqa/semantic/dependencies",
              "name": "gqa/semantic/dependencies",
              "description": "Column 'semantic' from the Hugging Face parquet file.",
              "dataType": "sc:Integer",
              "source": {
                "fileSet": {
                  "@id": "parquet-files-for-config-gqa"
                },
                "extract": {
                  "column": "semantic"
                }
               // <==== Here the transform is missing!
}

I don't know why "transform": { "jsonPath": "dependencies" } is missing here, given that it has been correctly added to all other subfields.

For comparison, the croissant on mlcroissant repo for the same dataset, with the correct gqa/semantic/dependencies subfield: https://github.com/mlcommons/croissant/blob/0b5cdfdcea72025fa4f7eaf8384e92eda291a118/datasets/1.0/huggingface-lmms-eval-lite/metadata.json

Can be loaded both with and without beam without any issue.

@ccl-core
Copy link
Contributor Author

ccl-core commented Mar 4, 2025

For what I can understand, the code to generate the croissant is correct, so I don't understand why it does not add the transform correctly for that field only...
Could we maybe see if re-triggering generation for that dataset solves the problem? @lhoestq thanks!

@lhoestq
Copy link
Member

lhoestq commented Mar 4, 2025

I just refreshed the croissant file for this dataset, let me know if it's fixed now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants