-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Can't read data (curl cannot perform request) #197
Comments
@bendichter This may be a general streaming issue present for particular NWBFiles, so maybe we should transfer this issue to PyNWB, what do you think? |
Is this error repeatable for the same file? If so, can you put together a minimal example? |
It's hard to tell at this point of this is issue is being caused by a bug in DANDI, s3, h5py, ros3, or PyNWB |
Exactly, I'll try downloading a local file from there and see |
Yeah IDK then, this was a reproducible error when running the archive-wide script on the Hub, but isolated and local attempts to reproduce behave fine. Not a problem with the file, PyNWB reads a local copy just fine. Maybe it's a parallelization issue, like multiple jobs trying to access assets from the same dandiset too quickly? Either way, deprioritizing this thing but keeping it in the back of our minds if we see it happen again. |
maybe bring up with the DANDI team? |
I will when/if I can find a nice concrete way of reproducing it |
Not sure if this is a similar issue, but I inspected dandiset 000003 using inspect_all (code shown below)
After running for more than 2 hours (with 11/101 files inspected, is that normal?), it crashed with an OSError: Unable to open file (curl cannot perform request). Attached is the report from the unfinished inspection, if it helps. I'd appreciate any thoughts on this.
|
Definitely not normal for that DANDI set (I recall running it through streaming within 10 minutes or so, granted that was on the DANDIHub). Good to know I'm not the only one that has run into this before - the fact you made it 11 files in means the streaming and s3 resolution was working for some files, certainly... but as to why it suddenly occasionally stops for other files in the same location, I still don't have an answer beyond speculation that it has something to do with the internet connection between the streaming device and the S3 bucket. |
That's what I thought too. I'm not familiar with streaming via S3 so it's a shame if that's the case.
So from your experience, inspecting dandisets via streaming is done relatively quick for all dandisets? |
Not all, no, but I'd say most. Depends on how many files are in each set; the slowest tend to be those sets out there with hundreds of files in them. Also depends on how many objects are in each file, and how many checks end up getting triggered in each file (most checks in the NWB Inspector have early-exit logic so they take almost no time to scan if they aren't initially triggered).
Would be worth trying again tomorrow or something. In general I've never had problems with streaming a single NWB File via |
I've tried 3 times now, different times and days. The first time it crashed after sometime but I didn't document, the second time I had to terminate the process after 1h, and this is the third instance.
By they you mean the NWB Inspector? I suppose it works by accessing files sequentially and not in parallel? |
I inspected another dandiset (000059), which has 54 files and the process took more than 3 hours. The last file in the set encountered the I'm not sure I understand why possibly this is a connection/streaming issue, assuming that the streamed files are accessed sequentially, and my device can still perform the inspection on several other dandisets (albeit very slowly and I haven't tried any that has as many files as 000003). |
Ah, worth mentioning that when I ran all these I did so on the DANDIHub (which has a direct connection to the archive) as well as an extra large spawn (I think I used 20+ CPU's) combined with the nice parallelization of the NWB Inspector (
Indeed! And without issue. Here is our original report.
What a strange problem. Maybe we'll consult this with the DANDI team and they might know the underlying issue (faulty S3 resolution, perhaps?) Any ideas @yarikoptic @jwodder? The TL;DR is we're trying to open a large number of NWBFiles iteratively via |
Adding to the list (my version) of unsuccessful NWBInspector run, dandiset 000117. It managed to run 157/197 files and crashed with the same |
Just to be clear, the URLs that you're trying to read are all S3 URLs, correct? I believe S3 recommends retrying any 5xx or connection errors up to about 10 or 12 times (can't find the reference right now). |
yes
I'm not sure I follow, would you please elaborate? |
@anhknguyen96 If an HTTP request to an S3 URL fails due to either a connection error (e.g., read timeout) or because the server returned a response with an error code in the 500-600 range, Amazon recommends retrying the request with an exponentially-increasing delay between retries; here's the best reference for that I could find at the moment. The official Python AWS SDK, boto3, does retries automatically, but if you're building your own S3 client, you'll need to add in retry behavior yourself. |
Thanks a bunch @jwodder I'll add the suggested exponential-delay retry whenever the |
great, thanks all! |
Now finally evidence of this on a well-established file in the testing suite: https://github.com/NeurodataWithoutBorders/nwbinspector/runs/7188205359?check_suite_focus=true I'll try to get this fix out in the next couple of days |
More evidence - https://github.com/NeurodataWithoutBorders/nwbinspector/runs/7282295969?check_suite_focus=true This time, it occurred during CLI usage, as well as after |
And again! https://github.com/NeurodataWithoutBorders/nwbinspector/runs/7282540229?check_suite_focus=true Never seen this many all in sequence, wonder if it has something to do with that CI being py v3.10, or the version |
I don't think it has to do with the files or the code. I think it's a connection issue on the s3 side |
was recommended/agreed to be implemented retrying implemented and these fresh failures are with retrying or it is still "one in a bunch" fails? |
PR #223 tries to fix it, my recent posts were merely evidence outside of that PR that it can occur not just within the
Yes, also just me mundanely remarking that I'm seeing more of these issues (even in the CI) than I did previously for whatever reason. |
Yes, read failures (in particular with network files) can occur any time read from the file directly happens. On |
What happened?
As I parse the remaining errors from the first attempt to inspect the entire archive, a number of repetitions of this issue occur. One basic example is in the traceback section below, but I've also attached all report files where it occurs, which spans multiple objects.
There also seem to be some related issues such as
ConstructErrors
andKeyError: 'Unable to open object by token (unable to determine object type)'
.In each of these cases, I am streaming the data via
ros3
, which could very well be a part of the problem. Interestingly, sometimes these curl errors occur outside of thenwbfile = io.read() step
which causes the the dandiset to not be inspectable at all (which I'm considering a separate issue).Anyway here is the full set of tracebacks available in the ERROR sections of each report...
000008.txt
000011.txt
000035.txt
000041.txt
000043.txt
000054.txt
000121.txt
000148.txt
000211.txt
000223.txt
Steps to Reproduce
No response
Traceback
Operating System
Linux
Python Executable
Python
Python Version
3.9
Package Versions
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: