Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LibYAML with PyYAML if available #6266

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

bryant1410
Copy link
Contributor

@bryant1410 bryant1410 commented Sep 27, 2023

PyYAML, the YAML framework used in this library, allows the use of LibYAML to accelerate the methods load and dump. To use it, a user would need to first install a PyYAML version that uses LibYAML (not available in PyPI; needs to be manually installed). Then, to actually use them, PyYAML suggests importing the LibYAML version of the Loader and Dumper and falling back to the default ones. This PR implements this change. See PyYAML docs for more info.

This change was motivated after trying to use any of the SugarCREPE datasets in the Hub provided by the org HuggingFaceM4. Such datasets save a lot of information (~1MB) in the YAML metadata from the README.md file and I noticed this slowed down the data loading process. BTW, I also noticed cache files for it is also slow because it tries to hash an instance of DatasetInfo, which in turn has all this metadata.

Also, I changed two list comprehensions into generator expressions to avoid allocating extra memory unnecessarily.

And BTW, there's an issue in PyYAML suggesting to make this automatic.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@bryant1410
Copy link
Contributor Author

On Ubuntu, if libyaml-dev is installed, you can install PyYAML 6.0.1 with LibYAML with the following command (as it's automatically detected):

pip install git+https://github.com/yaml/[email protected]

@bryant1410
Copy link
Contributor Author

Are the failing tests flaky?

@mariosasko
Copy link
Collaborator

We use huggingface_hub's RepoCard API instead of these modules to parse the YAML block (notice the deprecations), so the huggingface_hub repo is the right place to suggest these changes.

Personally, I'm not a fan of these changes, as a single non-standard usage of the ClassLabel type is not a sufficient reason to merge them. Also, the dataset in question stores data in a single Parquet file, with the features info embedded in its (schema) metadata, which means the YAML parsing can be skipped while preserving the features by directly loading the Parquet file:

from datasets import load_dataset
ds = load_dataset("parquet", data_files="https://huggingface.co/datasets/HuggingFaceM4/SugarCrepe_swap_obj/resolve/main/data/test-00000-of-00001-ca2ae6017a2336d7.parquet")

PS: Yes, these tests are flaky. We are working on fixing them.

@bryant1410
Copy link
Contributor Author

Oh, I didn't realize they were deprecated. Thanks for the tip on how to work around this issue!

For future reference, the places to change the code in huggingface_hub would be:

https://github.com/huggingface/huggingface_hub/blob/89cc69105074f1d071e0471144605f3cdfe1dab3/src/huggingface_hub/repocard.py#L506

https://github.com/huggingface/huggingface_hub/blob/89cc69105074f1d071e0471144605f3cdfe1dab3/src/huggingface_hub/utils/_fixes.py#L34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants