-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to add disk during pull #2819
Comments
@fdawg4l Assigned to you for initial investigation. |
@cgtexmex tagging Clint as he has been working on the concurrency epic |
|
The esx reconfig |
Looking at the messages @hickeng posted above, |
Looks like scratch was deleted in the |
So our |
I can't tell you how convenient the narrow column width GH chose for their issue tracker is. Despite your widescreen monitor made in the last 10 years, they insist you have a CRT from the 70s. Thanks GH! |
The PL needed scratch at approx |
Looks like we have a similar issue on build 6235. Logs here. |
From 6235 in (
hostd.log
|
@derekbeard If you have a moment, can you take a look at this? Hostd says the parent or file doesn't exist but we see it does directly above. In port-layer.log, we see
And this is corroborated by
|
@mhagen-vmware Do we use ssh to do cleanup of the datastores between/after tests/suites/ci runs? |
we don't use ssh at all, do you mean govc? We use govc before install of a VCH to cleanup dangling networks/vms/datastore files. |
@ fdawg4l If I read the logs (Test-Cases.Group6-VIC-Machine.6-05-Create-Validation-VCH-6235-7958-container-logs/journalctl) and image.go:scratch() correctly, then the detach that you pointed out above is expected (at least from my read of the code).
corresponds to the log message just before the mkfs.
look like the mkfs to me. And the next step in that the scratch() function is the detach, likely corresponding to
Is this expected behavior, and if so, should I be focusing on some other symptoms or log messages? |
Yup! That's right @derekbeard. We tried to create a chained disk from
We created scratch earlier (i.e. in the log snippet you pasted) so it should be there. The open question is why couldn't we create the child disk from scratch. Was Note, the "Invalid configuration..." message is directly from vsphere. |
Looks like these are the corresponding failure messages from the Hostd log:
Before this, there are a slew of FileManager.deleteFile() calls, including one to delete scratch.vmdk:
Right about this time I see the following in the port-layer.log:
Perhaps related? |
The path is different. The path we care about in this case is |
Looks like build 6243 was running at the same time on this ESX host. I'm investigating whether we're racing on the datastore, in which case this is a test issue. Still digging. |
Yeah, 6243 nuked our image store from underneath this test.
Thank you for taking the time to take a look at this @derekbeard, but we have some test framework inconsistencies which are causing this. This is not a storage issue. Reassigning. |
Looks like we have some racy-ness due to concurrent CI builds running |
Oh, interesting! Uh, give me a few minutes to figure out that |
Reading that snippet, It looks like The test has other issues too with respect to concurrent builds. The custom path is hardcoded and will collide if more than one build chooses the same esx instance AND one attempts to clean up before the other completes. We namespace the image store (with a |
related to #2043 |
@mhagen-vmware can you look at my comment above and advise? We either need to identify the tests that are using hardcoded non-sharable paths which we are likely racing on, reduce the number of works to the number of esx hosts and stop sharing all together, or your change above has put us in a good place and we can close this issue. |
Well, at this point my PR is in to prevent concurrency from happening again, so if we never see this again then I guess we can chalk it up to the aggressive cleanup from another job. But this line here: |
But |
Right so it would ignore that folder as well due to the previous line: |
OOOHH!! OK, cool. Thanks. In that case, let's close this bug and if we see weird datastore issues again, I'll investigate as they're likely real or reopen this and look for how to address any possible collisions. We still have the issue of 2 VCHs using this hardcoded path and racing eachother. This doesn't involve cleanup but just 2 VCHs conflicting with eachother if we ever enable concurrency again. I'll create an issue to track that. |
Investigating https://ci.vmware.run/vmware/vic/6045 and as follow-on from #2817
There is a reconfigure (opID=e902fe86) to add a disk to one VCH that interleaves with
vic-machine delete
for another VCH (opID=e902feaa). I don't know if that interleaving is an issue, but as can be seen, we don't end up with an attached disk for the reconfigure.The reconfigure is
Test-Cases.Group6-VIC-Machine.6-4-Create-Basic-VCH-6045-5188
The text was updated successfully, but these errors were encountered: