-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RPC error in docker stop, and eventually, vic-machine delete constantly fail with invalid memory address #2818
Comments
@cgtexmex tagging Clint as he has been working on the concurrency epic |
Also seen in https://ci.vmware.run/vmware/vic/6093 in every test suite:
Teardown for |
@hickeng vic-machine is not trying to delete VMs not belonging to one VCH. The CI has cleanup functions in every test suite teardown, to delete dangling VCHs, if the corresponding drone build is not running any more. That's why you saw those delete message in different drone build, but try to delete same container VM. |
For the number 2, CI also has another clean up function to delete datastore files, if there is no .vmx file found in that directory. |
@hickeng dig into why vic-machine delete failed with invalid memory issue, but unfortunately, it's hard to reproduce that error.
|
The shudtown, VM power off all succeed, but the refresh followed to update changeVersion failed for RPC error. Though the error message followed shows nothing. |
log from hostd:
|
@hickeng @dougm Any idea on that? |
The logging improvements @matthewavery is making should help here. Regarding the NPE, it is worth checking SearchIndex usage as it can return a nil value and error when there are no results. Looks like some uses check for that, but at least 1 case where we could return nil, but wasn't obvious is checked for != nil later on: https://github.com/vmware/vic/blob/master/lib/install/management/finder.go#L268 |
Well, just got something cause we still didn't remove that vch and container vm from our CI env. So I found that the NPE is from VirtualMachine.Device method. |
I'll add a nil check to govmomi for the Device method. |
This problem is still caused by VM invalid state issue. But it does not always happen in vm operations. Observed three different cases.
For these three issues, we'll need to do following fix:
@dougm I think the first one should happen in govmomi, and I'll take care of the second. We might take care of the third in the future. |
There is no 'vm.properties' method, are you referring to the object.Common.Properties method? That's just a shortcut to the property collector, there isn't anything specific to the object type vm or otherwise. In any case, I'm not sure about adding a workaround for this bug to govmomi. I may need to spend some time better understanding the bug and creating a small test case to reproduce outside of vic. Certainly will add the nil check to the VirtualMachine.Device method, as the sdk doc http://pubs.vmware.com/vsphere-60/index.jsp#com.vmware.wssdk.apiref.doc/vim.VirtualMachine.html states: "The virtual machine configuration is not guaranteed to be available. For example, the configuration information would be unavailable if the server is unable to access the virtual machine files on disk, and is often also unavailable during the initial phases of virtual machine creation. " |
Yeah, I know there is no vm.Properties method, but it looks like there is one, and we can add one for some workaround. That's what I mean. It's reasonable to return the empty value back to caller, but it might be difficult to fix all callers for it. I'm fine with it, but is there good way to show the error if another NPE happens? For this specific one, it needs lots of additional effort to figure out what's wrong. |
If we are only concerned with the VirtualMachine.Device() case here, we can return an error if Config == nil and include ConnectionState in the error message. How does that sound? The more generic case I'm not sure about yet. |
For this specific case, if you check the connection state, it might be better to return an invalid state error. But even with that, we also need to check the caller to make sure they are not throwing NPE. I can check vic code, to make sure they works even with empty vm.Config, but for govmomi, then you'll need to go through the callers. Otherwise we might go into other path to hit this NPE again. |
And I'm wondering what should we do with this error. For example, https://github.com/vmware/vic/blob/master/lib/portlayer/exec/base.go#L107, if there is empty config returned, what should we do there? |
I gonna have a method vm.Properties in vic vm.go file, to workaround this issue. And one log mechanism to catch NPE in vic-machine, in case we're leaking anything. |
@emlin sounds good. I will just add the nil check in VirtualMachine.Device for now then. |
vmware/govmomi#623 - includes a test case to reproduce nil VirtualMachine.Config |
Fixes vmware#2968 (NSX-T) Avoid possible VirtualMachine.Device NPE seen in vmware#2818 Fix for vmware#2878 (docker image update still required)
Fixes vmware#2968 (NSX-T) Avoid possible VirtualMachine.Device NPE seen in vmware#2818 Fix for vmware#2878 (docker image update still required)
Fixes vmware#2968 (NSX-T) Avoid possible VirtualMachine.Device NPE seen in vmware#2818 Fix for vmware#2878 (docker image update still required)
Seen in https://ci.vmware.run/vmware/vic/6045 and as a follow on from #2817
Build https://ci.vmware.run/vmware/vic/6013 left
VCH-6013-4771
anddrunk_englebart-9f1ce9...
behind.For some reason, all of the
vic-machine delete
operations that occur against that host attempt to deletedrunk_englebart
.The text was updated successfully, but these errors were encountered: