-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSTimeoutError for MLSUMClusteringP2P & MLSUMClusteringS2S #1311
Comments
Hi @bourdoiscatie! This looks like an TimeoutError that occurs when the system takes too long to download a dataset. MLSUM takes a few GB. Without knowing what hardware you ran this on, I could only suggest that we check for internet connectivity, and rerun the task. Here's my successful run on a linux machine: printout
Maybe @imenelydiaker or @KennethEnevoldsen have seen this error before? |
Never seen this error before, looks like an internet issue, but it can be anything related to the network you're using. As @isaac-chung mentioned it MLSUM is quite a big dataset that requires some time to load. |
Thank you for your feedback @isaac-chung @imenelydiaker |
@bourdoiscatie for reference I'm using an A10 on a remote server, and the dataset was downloaded into the default HF cache location. |
Thanks for the information, I should be able to manage with all that 🤗 |
For those who have the same problem, it seems to be due to the datasets library since version 3.x. |
Hi ! I'm Quentin from HF :) Unfortunately we had to limit our support of script-based datasets for obvious security reasons, and apparently it made some issues related to relying on bad hosts resurface :/ Have you considered uploading the data on HF instead (ideally in Parquet to avoid using a dataset script) ? |
Looking at the code, I realize that the train split that is massive is not even used in practice :
Wouldn't it be more appropriate to load only the validation and test splits to speed things up? And as Quentin points out, possibly host these two splits on the Hub. |
Thanks @lhoestq and @bourdoiscatie for pointing this out. The best solution (imo) is to re-upload the dataset to HF using parquet, We're working on it and will let you know when it's fixed, thank you. 🙏 |
Hi!
I've just trained an embedding model in French and would like to test it on the MTEB_FR.
I used the following code:
and everything ran fine until
MLSUMClusteringP2P
, where I got the following error:I then ran the code on each individual task and everything ran, with the exception of
MLSUMClusteringP2P
but also forMLSUMClusteringS2S
, where I received the same error.This suggests to me that there may be a problem with these two datasets, but I can't say what it is. I haven't found any other issues with this problem.
Note that I'm using version 1.16.1 of the library.
If you can enlighten me on this point, I'd be very grateful 🙏
The text was updated successfully, but these errors were encountered: