-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transport endpoint is not connected #151
Comments
Thanks for the issue report. We need to understand a bit more about the effect of the error message you get (which is coming from FUSE itself). Are you stacking FUSE file systems or is rar2fs the only thing mounted on top of your underlying fs? Do you use the . |
Yes the rar2fs process is still running when it happens but can't enter folder. I am not stacking FUSE file systems. Here are the options I am using: I think it was the same when using I will run |
Can you also try to reduce the number of mount options to a minimum?
It is a bit weird that FUSE would not be able to reach the process, it is multi-threaded by default and FUSE can spawn numerous of threads for file system operations unless some other critical action is in progress and it needs to block. Also I am a bit puzzled whether this happens during some archive scanning (cache population) procedure of if happens during strict extraction on an archive. Once the cache is in effect there should not be anything really that would become busy like you describe. |
Ok I have reduced the options to I usually run Here are some outputs.
Output from I attached strace to the process. I see nothing except this below, no matter what is being done on the file system. Don't know if that might change if error occurs ?
|
Try to avoid that. Using find forces access to your mount point which means FUSE has to deal with every single file access. That is why Do you need the The number of open files should drop unless you are in the middle of a complete mount point scan. It is not expected to have that many open file descriptors on an idling file system. There is no obvious system limits that look suspicious. I need to check the |
Also, would it be possible to run rar2fs in the foreground using |
Ok for
It was easily spotted by trying |
Yes I need the Found the thread blocking on |
The process crashed now. I did think I saw rar2fs still running previous times. But now I'm thinking I might have been mistaken and it has been crashing all along. Could have been unrelated dangling instance running after lazy unmount I was seeing. Didn't really pay too much attention to it until now since the crashes keep happening. Last lines in the rar2fs log
Last lines on the strace I had running. Maybe I should run it on more threads next time ?
It crashed pretty much exactly when I did |
At a first glance, yes you seem to have found some nasty bug here. Would be great if we could narrow it down a bit. |
Checked your previous posts, can you please change this
using something like |
Ok using unlimited now. |
We might be needing some extra debug symbols so configure package using |
Ok |
Another way of getting a better trace of exactly where the crash happens is to run it through gdb. It would give you similar output as the core dump that needs to be loaded in gdb postmortem anyway.
|
Ok It crashed again but I didn't realize The crash message was different this time. I am running it through gdb now. Let's see what we get from that. |
I have spotted an error I have not seen before, need to look into this.
|
Can you try this patch?
|
Sure I'll give it a try |
Still stable so far. |
I'll get it as it is still working without hiccups? |
Yes, I have been stress testing alot and no problems so far. At this point I'm 99% sure the patch solved it. |
When new data is added to a directoy and all affected cache entries are invalidated there is potential access of already freed memory. This issue was spotted using valgrind and was for some reason overlooked after commit c8af7494106705b547b7c545aee2fdd10aeec3ac was introduced. ==21603== Invalid read of size 8 ==21603== Address 0x626b158 is 24 bytes inside a block of size 32 free'd Resolves issue: #151 Signed-off-by: Hans Beckerus <hans.beckerus at gmail.com>
I have pushed a patch to master now. It is similar but not the same as the one posted here since it was rather ugly. |
When new data is added to a directoy and all affected cache entries are invalidated there is potential access of already freed memory. This issue was spotted using valgrind and was for some reason overlooked after commit 4b1d308 was introduced. ==21603== Invalid read of size 8 ==21603== Address 0x626b158 is 24 bytes inside a block of size 32 free'd Resolves issue: #151 Signed-off-by: Hans Beckerus <hans.beckerus at gmail.com>
I have still been running into crashes periodically but really random. Can be stable for weeks and then crashes few times in a row and then stable again for weeks. But I was working with nodejs script making symlinks and I noticed it was crashing rar2fs every time. I stripped the script down to a minimal dummy version which exposes the same bug
Here's log after running it
I am running
|
Seems there is still some problem in this area. Will see what I can make out of your reproducer. |
Can you confirm if your source folder was in fact empty (except for the crash directory) while executing your java script?
Mounted using:
Can you reproduce using this script? I believe I got the essence of it, right? |
Also, if you could run in the foreground ( |
No that bash script does not crash. But this one works for me
Got a huge crash with that one
|
Here is output from gdb
Here is the previous gdb output |
I now use a bucket size of 1 (i.e. every single entry in the cache will collide) and still no sign of crash, not even with your variant of the script. So question remains, was your mount point empty before you tried this or not? |
No the mount point has thousands of files and folders. But the The crash does not happen when using |
So then we can conclude this is some thread concurrency issue. There is a lock missing somewhere. |
If possible can you try this patch (master/HEAD) to see if we enter a function that looks a bit suspicious?
You need to run using |
Yes that function is being entered like 100 times when I cd to the |
Seems this is related to the underlying fs I'm using on /data mountpoint. I am using cephfs, which is fully POSIX compliant distributed fs, normally works the same as local filesystems for everything I throw at it. Way better than NFS in my experience, but seems something in rar2fs conflicting with it unfortunately. I tried several other file systems and cannot reproduce the issue at all. Except when I tried NTFS, it also crashes using the same bash script every time. But the backtrace seems a bit different. The
I am unable to reproduce the issue when using |
Please try below patch to see if it has any effect.
|
Same behaviour after patch with both ntfs & cephfs |
I found a race condition using valgrind/helgrind that might be related. Please try attached patch as well.
|
Yep that patches solves it ! Awesome job 😄 |
I will push the patch to master and then close this issue. |
Commit 33af7aa did not completely solve the problem(s) reported in issue #151. After running valgrind/helgrind a race condition was spotted with the following signature: ==29979== Possible data race during write of size 8 at 0x62E2F20 by thread #7 ==29979== Locks held: none ==29979== at 0x411D6B: hashtable_entry_delete_hash (string3.h:52) ==29979== by 0x412180: hashtable_entry_delete_subkeys (hashtable.c:250) ==29979== by 0x4126EE: dircache_get (dircache.c:195) ==29979== by 0x41BC63: syncdir (rar2fs.c:3082) ==29979== by 0x41BF12: rar2_getattr (rar2fs.c:3203) ==29979== by 0x4E4534F: lookup_path (fuse.c:2472) The problem is caused by a missing rwlock when detecting stale cache entries and when such entries were invalidated. This patch also adds a few missing locks spotted by pure inspection of the code and which were not part of the use-case covered by the valgrind/helgrind test run. Resolves-issue: #151 Signed-off-by: Hans Beckerus <hans.beckerus at gmail.com>
I randomly but constantly get "cannot access '/media': Transport endpoint is not connected".
This seems to happen usually under heavy load, like scanning many many small files and such. With less load the rar2fs remains stable.
It feels like this is related to some timeout, like if the underlying fs is inresponsive for X amount of time then it happens. The underlying fs might hang for a while under heavy load, but never stops working.
Is there any way to improve this ?
The text was updated successfully, but these errors were encountered: