-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: Communication Exception running gen2-triangulation #210
Comments
Thanks for the report and sorry for the trouble. Not immediately sure. But CC: @Erol444 on this. |
If it helps, it failed today with a:
..and a just had another failure where it was landmarks_left stream. I'm guessing something is crashing in the pipleline and its just erroring out on whatever stream its trying to retrieve. |
Hello @madgrizzle , I have just pushed a small edit to latest master branch of depthai-experiments. Could you try using that + the latest |
Awesome, will try tonight! Thanks for looking into it. |
It works great! Thanks for the fix! |
Well, turns out that it just seems to work 'better', but it still crashes. I'm not entirely sure why I wasn't getting these messages before, but I recently did an update to depthai-python. This occurs right before the crash: [14442C10411DC2D200] [470.453] [system] [critical] Fatal error. Please report to developers. Log: 'PoolBase' '66' |
@madgrizzle I'm not immediately sure why this issue occurs, will check with the team too. Could you specify which |
@madgrizzle it appears to be a bug in the depthai itself and @themarpe managed to fix that right away (kudos!). I'll circle back once the experiment is updated with a library version containing the fix, so you'll be able to test again. Thanks for reporting the issue! |
Just curious.. is the "bug in depthai itself" a bug in the FW or a bug in depthai-python (or something else)? |
@madgrizzle it was a bug in FW regarding how messages are shared across the two cores available. |
I was hoping 2.11.0 would fix it for me (based upon what I saw in the commits and announcement), but I get the same error message (X_LINK_ERROR). I've seen @Luxonis-Brandon demo the gen2-triangulation on twitter just recently so I suspect that my problem may be specific to me. I can try to use a different machine to see if maybe its an issue with my specific OAK-D or just the computer I'm using. |
@madgrizzle does it happen immediately or after some time running? |
@themarpe, not immediately and the time it takes varies. At the moment, within 30 seconds or so. Couple days ago (I had built your develop branch hoping the fix was in) and it took a few minutes to crash. The OAK-D is powered by the power supply and plugged into a intel NUC i7 via usb-3 cable. I access the NUC via ssh/xserver, but I wouldn't expect that to cause the issue. |
@themarpe, I spun up ubuntu 20.04 onto a RPI 4 and got everything installed and running with the same result after a few seconds. It did manage to spit out more information though: terminate called after throwing an instance of 'dai::XLinkReadError' Also, on another run, I got this in the stream: [14442C10411DC2D200] [21.932] [system] [critical] Fatal error. Please report to developers. Log: 'class' '374' If this seems like a device issue, we can close this and I'll look for help on the discord. |
@madgrizzle I'm investigating more. |
@themarpe I did some testing and got frustrating results. Not sure any of this is useful, but figured more information is better than none. I first incrementally enabled various parts of the pipeline (by commenting out the sections I didn't want enabled) and it seemed to run well (>5 minutes) until I enabled the very last part to retrieve the landmarks
Upon enabling that, it started to crash after a few seconds. So I thought maybe it was just related to that particular model so I adapted the program to use the facial-landmarks-35-adas model and it ran for a really, really long time. This seemed at the time to prove my hypothesis. To make sure, I switched back to the original model to verify it still crashed (which did) and then switched back to the new adas one to verify it didn't crash.. but then it did. That's the frustrating part. I tried wiping the blob cache and install the newest version of the blob converter and it didn't seem to help. |
@madgrizzle thanks very much for extensive testing. Current state from my side is that some sort of memory corrupt happens, which seems accelerated when detecting faces and/or unsupported config is printed (odd scenes where something is detected oddly afaik cause this). The longest runs on my end were stationary scenes, while the shortest was dynamic movement with both empty and scenes with faces. Regarding your observation - if you've removed that line, the actual on device processing should not differ (the queue is non-blocking, the messages are going to continue being produced). Although an issue in parsing could happend, but in that case you'd be left with a host side error. Regarding the different model, do you think it helped with overall stability, in terms of timing and how soon it crashed? Any way, I suspect ImageManip issue, but not 100% sure yet. I have a rewrite planned for ImageManip, which might address this alongside hopefully, as its quite elusive and not deterministically reproducible bug (in terms of execution) |
@themarpe, you wrote:
Did you mean that the processing SHOULDN'T differ? I'm having hard time reconciling what you wrote in the sentence with what you wrote in the parenthesis.
I thought it solved it because it wouldn't crash the first time I ran the other model. But the second time and thereafter it crashed as much as the original model. That's the weird part.
I know very little about the internal workings, but I thought it was either the fact that two different 'pipelines' were being run (left camera and right camera) and some issue came up from that or it was ImageManip considering that face detection and landmark recognition in general has been pretty solid on other examples. I used both with gen1 stuff and it seemed solid. I will try to figure out how to do the cropping of the image host side, thereby eliminating the ImageManip node, and see if it fixes the problem. |
@madgrizzle
Thanks, that'd be a great data point to have. Let me know how it goes. |
@themarpe Perhaps this should be an issue 'somewhere else' (I'll ask on discord) but it appears that the ImageManip config errors occur when resize width is less than half the height. I was testing it at lunch and started to slowly move my hand in front of my face and the 'box' (that gets drawn on the screen) started to narrow. When it got around half the height, the errors started to occur. I had to get back to work but I'll do some more testing tonight and see if I can catch those events. |
@themarpe I've got things running much, much better (still crashes though) by not processing any face detections where the aspect ratio is less than 70% (width to height or height to width). That seemed to have eliminated most of the errors ImageManipConfig errors that occur. It runs for several minutes now and is much more stable. So I tend to agree ImageManip is the likely culprit, but the fact it happens while the image is dynamic (I find the same to be the case) is odd. When it crashed as I was moving forward to get closer to the keyboard, I got this message:
I understand ImageManip is being rewritten so I'll hold off on any more investigating/testing for the time being. /em fingers_crossed |
I solved this problem by modifying this line of code |
The version on the repo still crashes for me if I make the change you described. So, I keep wondering if its a hardware problem. The more I work with it and try different things, the worse it gets. I give up, let it rest and I come back to it days later and it works better initially but then eventually craps out again and again. |
Turns out the latter problem where it seems to hang without crashing was caused by the optimizations I made. When you don't retrieve the frames (left, right, cropped, etc.), the host program's loop runs too fast and it empties the queues of config and landmarks and only rarely gets both left and right landmarks at the same time in a single loop's iteration. I had to slow down the loop by adding more delay to the cv2.waitkey() so that it's not emptying out the queue. |
Cross posting for visibility - luxonis/depthai-python#408 (comment) Main issue as far as I've dug into it, is that a memory corruption is happening in a non-deterministic way, which looks like something might be wrong with the hardware, but its just that the bug is "random" and hard to pin down to a specific cause. Will keep you posted after I discover more information about it |
@madgrizzle could you try installing the following library, it contains some fixes for Script node related memory allocation:
Then modifying the script by adding the following (@alex-luxonis suggested ): |
Knock on wood, but this so far seems pretty stable. I've let it run for about 8 hours so far without issue on the RPI. I'll let it run overnight and then switch to the x86 host to see how it works there. |
No problem running overnight. I'm now trying out the script from luxonis/depthai-python#408 (comment) that I modified to include facial detection/recognition. Previously it would run for a few seconds and crash, but its been running for 15 minutes now with no issue. |
@madgrizzle Thanks for the tests! |
I had done both changes. When I go home at lunch, i'll undo the setProcessor and see what the effect is. |
@alex-luxonis switching the code to use LEON_MSS crashed within about 1 minute. Switched back to LEON_CSS and running stable. Tried MSS again to make certain and it crashed in about 10 seconds. Back to CSS and stable. This is all still running on an RPI host (haven't moved the cable back to the x86) |
@alex-luxonis Couple things.. First it appears that for the gen2-triangulation script, I did not change it to LEON_CSS (only the updated depthai). I went ahead and set it to LEON_MSS and will let it run for a while to to be absolutely certain. Second, the script from luxonis/depthai-python#408 (comment) with my face detection/recognition additions however did crash using LEON_MSS rather quickly (seconds to minutes) and ran for 12+ hours last night using LEON_CSS without crashing. However, when playing around at lunch time (i.e., switching back and forth between MSS and CSS), it did crash with LEON_CSS after about a 15 minute run. That potentially could be related to something else (like ImageManip or maybe just the process of switching the code or something). Definite improvement running the new depthai + the script node in LEON_CSS processor. |
@madgrizzle thanks for extensive testing - the findings support my theory that moving to CSS only mitigates/suppresses the issue but isn't the cause of it. |
@themarpe seems so.. good enough for my use-case (giving my robot some vision). |
Ack.. accidentally closed. |
@madgrizzle in latest |
It ran gen2-triangulation pretty solid (I stopped it after 10 minutes), but then I ran it on my yoloSpatialCalculator+face recognition program that often hangs on 2.13.3 after a few minutes and this version did the same. Never consistent on when either occurs.. sometimes a couple of minutes, sometimes after a couple of hours, butsometimes just a few seconds. I do still notice lots of ImageManip unsupported config errors. Any progress on addressing that (I can deal with it, nevertheless)? |
Thanks for testing - is your yolo + face recognition available somewhere openly? Might be a good benchmark to test against, just so that error is more easily observed. Regarding ImageManip, we have a branch for it, but there are a couple of extra things to fixup and retest, so might take a bit more. I'll also address the issue with unsupported config errors :) |
It's currently part of a ROS node, but it won't be hard to strip it out and make it run standalone. I should have it on my repo tonight and will post a link. |
This is my repo with the code I'm using. https://github.com/madgrizzle/visiontest It ran 12 hours today with no one in the room. When I walked it, it said it detected an elephant (great for my ego) and a few minutes later it locked up while my head was down looking at my phone. One key I've discovered to making it lock up is putting my hands to my face (like rubbing my eyes.) It doesn't happen all the time, but so often that it has to be more than coincidence. |
Thanks, will try to recreate:) |
I'm current running 2.13.3.0.dev+61eb5c1617623c628fab1cc09d123073518124b0 with great success. It ran for days using LEON_CSS until I intentionally stopped it. I'll try out LEON_MSS as well. I've added a few things to the pipeline since the code in the visiontest repo and moved it back into my ROS node because it appears stable enough for me (/em crosses fingers). |
Finally getting back around to trying out the face detection with stereo demo and I find that it runs fine for about 5 seconds but then crashes with the following error:
Traceback (most recent call last):
File "main.py", line 207, in
frame = queues[i*4].get().getCvFrame()
RuntimeError: Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'mono_left' (X_LINK_ERROR)'
It only crashes when someone steps in front of the camera and face detection starts working.
The text was updated successfully, but these errors were encountered: