-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to download IWSLT datasets #1676
Comments
As a temporary fix, I'm just downloading the datasets manually via the links in the documentation:
Then you can put the downloaded Then |
Original comment by @austinvhuang : I've run into this as well. Given the download problem in 0.11, maybe downloads checks could be in CI or integration tests? Response by @parmeet below (sorry @austinvhuang, I meant to reply earlier but somehow ended up editing your original comment): Duplicate of #1620. Yes, we have this ongoing issue with torchtext <=0.11. could you please upgrade to 0.12 or try the temporary fix suggest here #1676 (comment).
We have had full testing earlier, but moved to mocked testing as explained in this issue #1493 |
Apparently torchtext 0.12 also have this download issue, just that error message is not the same. Looking into it, the error is apparently same as found in 0.11 which is |
@parmeet Even though the root cause of this Error is unknown to me, do you think we could align the Error between two versions of TorchText? These OnlineReader could take extra keyword arguments and pass them to the |
I think one way to achieve this would be to go through the same error tracing as provided in torchtext download hook for google drive . I am not exactly sure why this error message is removed from the implementation in GDriveReader here when |
Can't find why via git blame as the actual commit was buried into combined commits. But, I think it's reasonable to add it back to the function in TorchData. |
I was under the impression that even when the |
@NivekT Thank you for pointing it out!!
@parmeet Does it mean the file not existing on the GDrive if |
Hi, I just wanted to point out that it seems there seems to be another issue at play here. With pytorch/data#442, we can get past the content-disposition error described in the previous comments. But there's still a problem with the dataset loading, as eventually it times out with the following message:
I did some debugging and it seems a bit related to the nested caching in When I dug a bit deeper I found that the |
@lolzballs the caching issue you just mentioned seems to be related to #1735. cc @parmeet @VitalyFedyunin I wonder if this is caused by the cache inconsistency issue you mention here #1735 (comment) when using filters in our dataset logic. |
@Nayef211 thanks, it does sound like exactly what I'm observing with IWSLT. But I tried what is suggested in #1735 with (note the order of end_caching here and in the original code): def _filter_clean_cache(cache_decompressed_dp, full_filepath, uncleaned_filename):
cache_inner_decompressed_dp = cache_decompressed_dp.on_disk_cache(
filepath_fn=partial(_return_full_filepath, full_filepath)
)
cache_inner_decompressed_dp = cache_inner_decompressed_dp.open_files(mode="b").load_from_tar()
cache_inner_decompressed_dp = cache_inner_decompressed_dp.end_caching(mode="wb", same_filepath_fn=True)
cache_inner_decompressed_dp = cache_inner_decompressed_dp.filter(partial(_filter_filename_fn, uncleaned_filename))
cache_inner_decompressed_dp = cache_inner_decompressed_dp.map(partial(_clean_files_wrapper, full_filepath))
return cache_inner_decompressed_dp I still get the same behaviour: the inner |
Thanks @Nayef211, @lolzballs. I am also start seeing this issue but rather sporadically. The error is not reproducible unfortunately. |
Also I am not very clear what the Also I wonder if this and the issue #1747 are somehow linked? cc: @VitalyFedyunin |
Interesting, it seems this issue may be a regression? I was able to consistently reproduce the error using the latest One other thing I should mention is that I found that when the timeout happens it leaves behind a |
Could be the situation when locks from previous runs (with mispositioned |
I agree that message is cryptic in case of errors which is not timeout, I will change it to some sort of diagnosis URL to help users figure out if the pipeline is bad and there are real errors. |
Pretty much taken from the docs, this is what I did. I tested again today with torchtext cb8475e and torchdata cd3892790, still get the same timeout. Based on the same commit I changed the order of the filter as I posted above, and no luck there. I've always removed the cache (in my case data/IWSLT2017) before rerunning it to test. If it helps, this is all done on an Arch Linux system. I'm not sure if it might be platform dependent. |
Status today, with torchtext version 0.15.2+cpuit is still not possible to download the IWSLT datasets.
PyTorch version: 2.0.1 |
🐛 Bug
Describe the bug Unable to download IWSLT2016 or IWSLT2017 datasets.
To Reproduce Steps to reproduce the behavior:
The same error occurs when trying to use
IWSLT2017
.Expected behavior The program returns the next
src, tgt
pair in the training data.Screenshots Full error logs are in this gist.
Environment Included in gist above.
Additional context No additional context.
The text was updated successfully, but these errors were encountered: