Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Croissant tag missing in some Croissant supported datasets #3135

Open
fylux opened this issue Feb 14, 2025 · 1 comment
Open

Croissant tag missing in some Croissant supported datasets #3135

fylux opened this issue Feb 14, 2025 · 1 comment

Comments

@fylux
Copy link

fylux commented Feb 14, 2025

For example the following dataset:

https://huggingface.co/datasets/allenai/c4

Lacks a Croissant tag, not just in the UI but also if filtering by "library:mlcroissant" with the API. However, the Croissant file is available in the API:

https://huggingface.co/api/datasets/allenai/c4/croissant

When looking at the 15k most download HF datasets, around 4k were lacking this tag. Sometimes this might be justified due to a faulty DatasetInfo, but that's not always the case as we have seen with allenai/c4.

fyi @lhoestq

@lhoestq
Copy link
Member

lhoestq commented Mar 3, 2025

allenai/c4 has this error when listing the compatible libraries:

// 20250303181911
// https://datasets-server.huggingface.co/compatible-libraries?dataset=allenai/c4

{
  "error": "Failed to simplify json data files pattern: {'train': ['multilingual/c4-af.*.json.gz', 'multilingual/c4-am.*.json.gz', 'multilingual/c4-ar.*.json.gz', 'multilingual/c4-az.*.json.gz', 'multilingual/c4-be.*.json.gz', 'multilingual/c4-bg.*.json.gz', 'multilingual/c4-bg-Latn.*.json.gz', 'multilingual/c4-bn.*.json.gz', 'multilingual/c4-ca.*.json.gz', 'multilingual/c4-ceb.*.json.gz', 'multilingual/c4-co.*.json.gz', 'multilingual/c4-cs.*.json.gz', 'multilingual/c4-cy.*.json.gz', 'multilingual/c4-da.*.json.gz', 'multilingual/c4-de.*.json.gz', 'multilingual/c4-el.*.json.gz', 'multilingual/c4-el-Latn.*.json.gz', 'multilingual/c4-en.*.json.gz', 'multilingual/c4-eo.*.json.gz', 'multilingual/c4-es.*.json.gz', 'multilingual/c4-et.*.json.gz', 'multilingual/c4-eu.*.json.gz', 'multilingual/c4-fa.*.json.gz', 'multilingual/c4-fi.*.json.gz', 'multilingual/c4-fil.*.json.gz', 'multilingual/c4-fr.*.json.gz', 'multilingual/c4-fy.*.json.gz', 'multilingual/c4-ga.*.json.gz', 'multilingual/c4-gd.*.json.gz', 'multilingual/c4-gl.*.json.gz', 'multilingual/c4-gu.*.json.gz', 'multilingual/c4-ha.*.json.gz', 'multilingual/c4-haw.*.json.gz', 'multilingual/c4-hi.*.json.gz', 'multilingual/c4-hi-Latn.*.json.gz', 'multilingual/c4-hmn.*.json.gz', 'multilingual/c4-ht.*.json.gz', 'multilingual/c4-hu.*.json.gz', 'multilingual/c4-hy.*.json.gz', 'multilingual/c4-id.*.json.gz', 'multilingual/c4-ig.*.json.gz', 'multilingual/c4-is.*.json.gz', 'multilingual/c4-it.*.json.gz', 'multilingual/c4-iw.*.json.gz', 'multilingual/c4-ja.*.json.gz', 'multilingual/c4-ja-Latn.*.json.gz', 'multilingual/c4-jv.*.json.gz', 'multilingual/c4-ka.*.json.gz', 'multilingual/c4-kk.*.json.gz', 'multilingual/c4-km.*.json.gz', 'multilingual/c4-kn.*.json.gz', 'multilingual/c4-ko.*.json.gz', 'multilingual/c4-ku.*.json.gz', 'multilingual/c4-ky.*.json.gz', 'multilingual/c4-la.*.json.gz', 'multilingual/c4-lb.*.json.gz', 'multilingual/c4-lo.*.json.gz', 'multilingual/c4-lt.*.json.gz', 'multilingual/c4-lv.*.json.gz', 'multilingual/c4-mg.*.json.gz', 'multilingual/c4-mi.*.json.gz', 'multilingual/c4-mk.*.json.gz', 'multilingual/c4-ml.*.json.gz', 'multilingual/c4-mn.*.json.gz', 'multilingual/c4-mr.*.json.gz', 'multilingual/c4-ms.*.json.gz', 'multilingual/c4-mt.*.json.gz', 'multilingual/c4-my.*.json.gz', 'multilingual/c4-ne.*.json.gz', 'multilingual/c4-nl.*.json.gz', 'multilingual/c4-no.*.json.gz', 'multilingual/c4-ny.*.json.gz', 'multilingual/c4-pa.*.json.gz', 'multilingual/c4-pl.*.json.gz', 'multilingual/c4-ps.*.json.gz', 'multilingual/c4-pt.*.json.gz', 'multilingual/c4-ro.*.json.gz', 'multilingual/c4-ru.*.json.gz', 'multilingual/c4-ru-Latn.*.json.gz', 'multilingual/c4-sd.*.json.gz', 'multilingual/c4-si.*.json.gz', 'multilingual/c4-sk.*.json.gz', 'multilingual/c4-sl.*.json.gz', 'multilingual/c4-sm.*.json.gz', 'multilingual/c4-sn.*.json.gz', 'multilingual/c4-so.*.json.gz', 'multilingual/c4-sq.*.json.gz', 'multilingual/c4-sr.*.json.gz', 'multilingual/c4-st.*.json.gz', 'multilingual/c4-su.*.json.gz', 'multilingual/c4-sv.*.json.gz', 'multilingual/c4-sw.*.json.gz', 'multilingual/c4-ta.*.json.gz', 'multilingual/c4-te.*.json.gz', 'multilingual/c4-tg.*.json.gz', 'multilingual/c4-th.*.json.gz', 'multilingual/c4-tr.*.json.gz', 'multilingual/c4-uk.*.json.gz', 'multilingual/c4-und.*.json.gz', 'multilingual/c4-ur.*.json.gz', 'multilingual/c4-uz.*.json.gz', 'multilingual/c4-vi.*.json.gz', 'multilingual/c4-xh.*.json.gz', 'multilingual/c4-yi.*.json.gz', 'multilingual/c4-yo.*.json.gz', 'multilingual/c4-zh.*.json.gz', 'multilingual/c4-zh-Latn.*.json.gz', 'multilingual/c4-zu.*.json.gz'], 'validation': ['multilingual/c4-af-validation.tfrecord-00000-of-00001.json.gz']}"
}

This error happens in get_compatible_libraries_for_json that is used to list the libraries compatible with a dataset and create the code snippets people can copy/paste to load the data. Apparently this function doesn't support the case where multiple glob patterns are provided to list the files of a dataset.

If we manage to fix, it will show croissant as well as the other compatible libraries like dask

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants