Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a deadlock issue using search asynchronously #446

Merged
merged 1 commit into from
Jul 22, 2023

Conversation

t2y
Copy link
Contributor

@t2y t2y commented Jul 19, 2023

The previous change is merged from #440.

I'm implementing #422 using the search asynchronous feature. I found a deadlock issue with the low-sized buffered channel in my local environment. When the buffered channel is full, a goroutine waits to send to the channel. But sometimes, searchResponse.Next() occurs deadlock since the messageMutex cannot get a lock in Conn.GetLastError().

While working at #440, I found Conn.err is a race condition. So I added messageMutex.Lock() code, which causes a deadlock issue with certain conditions (e.g., a low-sized buffered channel). That makes a more complicated problem.

This code is young, and we can revert quickly.

@cpuschma
Copy link
Member

I can't follow your thoughts. Removing the mutex check from GetLastError is not ideal in my opinion. Can you explain/do you know what caused the messageMutex to hang? Because a slow reader and therefore a blocked buffered channel is something we cannot account for.

@t2y
Copy link
Contributor Author

t2y commented Jul 19, 2023

@cpuschma I changed GetLastError() method to lock for searchResponse.Next() in #440. See below.

So I just revert my previous changes since searchResponse.Next() doesn't handle it.

@t2y
Copy link
Contributor Author

t2y commented Jul 19, 2023

Another problem. This failure is rarely a race condition, even if the messageMutex locks. I guess it doesn't reproduce after running again.

https://github.com/go-ldap/ldap/actions/runs/5599882841/jobs/10241454178?pr=447

@cpuschma
Copy link
Member

We cannot guarantee that the the Connection.err is guarded atomically without a mutex. Again, that's not an option.

I still don't see what the exact problem is, maybe because of the language barrier. Lets break things down:

  1. You establish a connection to the directory server
  2. You submit a search request with a buffered channel of N. In the internals, the library locks the mutex, encodes and sends the message and unlocks the mutex
  3. The incoming response are read in a seperate goroutine, which also shortly utilizes the mutex
  4. The entries are send to the channel one by one for a consumer to process it. No locking of the messageMutex

What am I missing here? 😓

@t2y
Copy link
Contributor Author

t2y commented Jul 19, 2023

Connection.err is guarded atomically without a mutex. Again, that's not an option.

I see. I agree with using mutex to get Connection.err. But the previous code didn't have mutex before I worked #440. Do you think it still needs mutex?

After #440, two goroutine access to Connection.err, and rarely it occurs a data race condition (see above CI failure), and sometimes it occurs a deadlock issue. Because searchResponse.Next() access to Connection.err. As the same as you, I thought no problem when I was working on #440, but it's difficult to imagine since it's a timing issue.

You can probably reproduce easily. This is how to reproduce in my environment.

  1. set 1 to a buffered channel
  2. call SearchAsync() to get about a thousand entries
  3. run several times

Additionally, searchResponse.Next() doesn't need to access Connection.err. At that time, I just thought it was a shortcut to get an error. If this code does not exist, the error occurs in the search process next time.

@cpuschma
Copy link
Member

Agreed @t2y 👍

Admittedly, the missing mutex for GetLastError was an oversight of me. At the time it wasn't designed for parallelism like SearchAsync in mind. Can you adjust your branch to keep the mutex but remove the GetLastError calls from the Next method?

…is trying to lock, and another goroutine waits since full of channel buffers go-ldap#440
@t2y t2y force-pushed the fix-potential-deadlock branch from 741471d to 735afe6 Compare July 20, 2023 23:08
@t2y
Copy link
Contributor Author

t2y commented Jul 20, 2023

@cpuschma Thanks for your review. I follow you.

@cpuschma cpuschma merged commit d85ebd1 into go-ldap:master Jul 22, 2023
@t2y t2y deleted the fix-potential-deadlock branch July 23, 2023 06:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants