Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost mutex for key ... #9

Closed
kirstyannepollock opened this issue Feb 18, 2020 · 8 comments
Closed

Lost mutex for key ... #9

kirstyannepollock opened this issue Feb 18, 2020 · 8 comments

Comments

@kirstyannepollock
Copy link

kirstyannepollock commented Feb 18, 2020

I am getting 'Lost mutex for key xxx' when using 2 seperate Mutexes on 2 seperate services as part of my moleculer-based app. Just wondering what sort of thing I should be looking for that might cause this? I can't think right now how to construct a minimal example that replicates this behaviour, sorry, Just a few clues would be cool, especially since they are coming up as UnhandledPromiseRejectionWarning directly from RedisMutex._refresh with no stack trace.

@kirstyannepollock
Copy link
Author

aha, while debugging in VSCode, any breakpoint being hit or pausing of the node service code causes it.

@swarthy
Copy link
Owner

swarthy commented Feb 19, 2020

Hi! "Lost mutex/semaphore ..." thrown in service1 when service2 "steals" the lock (as your service1 thinks). This can happen when service1 doesn't update the lock. Normally it happens every refreshInterval ms, but if service1 is on huge CPU load or paused by breakpoint in debug mode or anything else what blocks Nodejs event loop, then after lockTimeout ms Redis deletes the key and service2 can legaly get this lock. After that, when service1 "unfreezes" and trying to refresh lock it got this error, cause lock is not belongs to service1 anymore. So I guess you should find what part of your service1 blocks event loop.

@kirstyannepollock
Copy link
Author

kirstyannepollock commented Feb 19, 2020

Thanks. That explains it, and I have put in mitigation. I am doing some test code now, and I am trying to work out how to set up my test code to catch the 'Lost mutex for key xxx', but it seems to always fall outside any error handling. Test I have so far:

it('acquire twice same node base lib expiry test', async () => {
   let acquire
   try {
     const options = {
       lockTimeout: 1000,
       acquireTimeout: 1000,
       retryInterval: 10,
       refreshInterval: 1100 // so they will actually expire and not get refreshed
     }
     const mutexObject =
     {
       mutexes: {
         channelSync: new Mutex(redisClient, 'channelSync', options)
       }
     }

     const sleep = async (ms) => {
       return new Promise(resolve => {
         setTimeout(resolve, ms)
       })
     }

     await mutexObject.mutexes.channelSync.acquire()
     await sleep(2000)
   } catch (err) {
     //***** we never get here *****
     expect(err).toBeTruthy()
     expect(err.message).toBe('Lost mutex for key channelSync')

     // acquire = await mutexObject.mutexes.channelSync.acquire()
   }
 })

How do I need to set it up so that I can detect and handle this error?

@swarthy
Copy link
Owner

swarthy commented Feb 20, 2020

Unfortunately, you can't catch this error this way because it was designed to be fatal (it's throwns in setTimeout callback). When key is lost, then looks like there are errors in configuration (like your example with refreshInterval > lockTimeout) or problems in code architecture (when event loop is blocked by infinite while(true) { ... } for example), in both cases you will want to fix configuration/code, not to handle. In case of distributed system it's better to get fatal error (setup process.on('unhandledRejection', cb) or run nodejs with --unhandled-rejections=strict) here in service1 (and get notification from monitoring system), than handle unexpected behaviour because both service1 and service2 continues work with critical resource simultaneously.

@kirstyannepollock
Copy link
Author

In my nodejs service code I have set up the process.on('unhandledRejection' handler) and written code mitigations in the services, there all is fine (I think). I can/should probably improve that to raise an event over the service bus to inform the services. Our services should really NOT crash our backend in the use case I have, and I think I have that covered. But I do want to write jest tests to prove that if (e.g.) one k8s node crashes, that the mutex is released and that the relevant service on other nodes can then get the mutex. I suppose I can put a process.on('unhandledRejection' in the jest code. Do I understand you/the code right, and the refresh, when set up with the correct params, means that the lock should be held by the service as long as the service keeps running, or until released? (or somehow pre-empted, as can happen when I pause the code in debugging).

@swarthy
Copy link
Owner

swarthy commented Feb 20, 2020

Yep, exactly! The idea (borrowed from book about redis) is to try to acquire lock (or throw error after acquireTimeout if lock is still held by other service) and refresh it every refreshInterval ms until .release() called. If service1 dies, then Redis deletes key (lockTimeout = ttl in Redis) and service2 can acquire this mutex.

@kirstyannepollock
Copy link
Author

kirstyannepollock commented Feb 20, 2020

Yeah, great book, only know about it because it was linked from your readme- thanks :-). Sadly, I it seems will not be able to write a jest test due to a "bug" in jest where process is not the real one... jestjs/jest#5146 (comment) Thanks for your help!

@swarthy
Copy link
Owner

swarthy commented Feb 20, 2020

You welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants