Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

critical target error message on VCH console #8112

Closed
RebeccaYo opened this issue Jun 29, 2018 · 7 comments · Fixed by #8154
Closed

critical target error message on VCH console #8112

RebeccaYo opened this issue Jun 29, 2018 · 7 comments · Fixed by #8154
Labels
area/appliance area/vsphere Intergration and interoperation with vSphere kind/defect Behavior that is inconsistent with what's intended status/need-info Additional information is needed to make progress

Comments

@RebeccaYo
Copy link

VIC version:

1.4.0

Deployment details:

./vic-machine-linux create --name WV-VCH --public-network-ip 192.168.10.100/24 --management-network PublicNetwork --insecure-registry 10.115.68.63:443 --compute-resource ordos12.eng.vmware.com --no-tlsverify --no-tls --thumbprint ... --target 10.115.68.198 --user [email protected] --image-store Tegile-Lun3a --volume-store Tegile-Lun3a:default --volume-store Tegile-Lun1a:Tegile-Lun1a --volume-store Tegile-Lun1b:Tegile-Lun1b --bridge-network VCH1BridgeNetwork --client-network Testbed --public-network Testbed --container-network Testbed:cn1 --dns-server 192.168.10.101 --public-network-gateway 192.168.10.101 --endpoint-cpu 1 --endpoint-memory 2048

Steps to reproduce:

After deploying the VCH, the error appears within a number of hours to days.
This same error has appeared on many different deployments of VCH, which were located on different datastores (I was making sure this wasn't because of a bad disk). Every time, the error is the same (including sector number) except for the number at the beginning of the error.

Actual behavior:

There is an error message on the console of the VCH.
[ 10.487831] blk_update_request: critical target error, dev sda, sector 15958016. See attachment.
Also, I cannot connect to the Docker daemon of this VCH at tcp://192.168.10.100:2375.

Expected behavior:

After deploying the VCH, there was no error on the console and I did have connectivity to the Docker daemon.

Logs:

VCH Admin portal is inaccessible.
When I tried to enable ssh on the VCH, I received the message

INFO[0005] ### Configuring VCH for debug ####
INFO[0005] Validating target
INFO[0005]
INFO[0005] VCH ID: VirtualMachine:vm-2075
INFO[0005] Creating directory [Tegile-Lun3a] WV-VCH
INFO[0005] Datastore path is [Tegile-Lun3a] WV-VCH
INFO[0006]
INFO[0006] Installer version: v1.4.0-18893-6c385b0
INFO[0006] VCH version: v1.4.0-18893-6c385b0
ERRO[0006] Tools is not running in the appliance, unable to continue
ERRO[0006] Unable to enable ssh on the VCH appliance VM: Tools is not running in the appliance, unable to continue
INFO[0006] Collecting 67e398ac-44ad-4777-b58a-848f8b56df0f vpxd.log
ERRO[0006] Tools is not running in the appliance, unable to continue
ERRO[0006] --------------------
ERRO[0006] vic-machine-linux debug failed: debug failed

Additional details as necessary:

@RebeccaYo
Copy link
Author

critical_target_error-06-29-18

@RebeccaYo RebeccaYo reopened this Jun 29, 2018
@hickeng
Copy link
Member

hickeng commented Jul 2, 2018

@RebeccaYo

If you still have that VCH around, please could you supply the tether.debug and vmware.log files from the endpointVM datastore directory?

Additionally, given this recreates for you, please could you:

  1. deploy a VCH with --debug=2 during the initial step and
  2. configure the VCH for console access - this is done using vic-machine debug as I assume you tried to run from the original issue output. So long as you set the password you'll be able to log into the VM console (even if you don't enable SSH) until the password expires at midnight.

This gives us a method by which we can gather additional logging if/when the problem recreates. Given the message about tools not running I am wondering if this is a possible recreate of #7680 and would love to get actionable data on that.

Regarding the message on the console. I don't know why you're seeing this but...

  • the number at the beginning is the time since system boot - you'll see this message in dmesg output
  • at that time this is almost certain the device in question is the scratch.vmdk base disk, and likely just after either the hot-add prior to creating the filesystem on it, or the hot-remove once filesystem creation is done.
  • I have seen read errors before when reading from thin vmdks when inflation has to occur, but only recall it happening in a nested environment. I'm assuming this is not nested?

@hickeng hickeng added kind/defect Behavior that is inconsistent with what's intended area/vsphere Intergration and interoperation with vSphere area/appliance status/need-info Additional information is needed to make progress labels Jul 2, 2018
@RebeccaYo
Copy link
Author

vmware.log
tether.debug.log
Hopefully these will be helpful.
I'll redeploy with debug=2 and enable shell access.
This environment is not nested.
Thanks again for your help.

@RebeccaYo
Copy link
Author

Hi @hickeng, I've reproduced this on a VCH with debug=2. The VCH admin portal is unavailable, but I've attached the logs from /var/log/vic on the VCH.

vchlogs_8112_yo.tar.gz

@RebeccaYo
Copy link
Author

@hickeng Any chance you could look at this? I'm seeing panic: runtime error: integer divide by zero appearing less than 24 hours after I create a VCH, necessitating a redeploy. This is currently a showstopper for my VIC testing as it interrupts the test run.

@hickeng
Copy link
Member

hickeng commented Jul 24, 2018

@RebeccaYo Apologies for the delay. I've taken a look at the logs you attached. It's possible that this is a variant of #7680. One very effective way of determining if this is the same issue is deploying with a static IP on the management network and seeing if the problem persists. If it does not then you can try with DHCP again and specifying --asymmetric-routes.

If this is confirmed to be the same issue I'm very interested in knowing whether the panic is also present in the tether.debug when deployed with debug=2. We've struggled to get any traction on #7680 and any insight would be invaluable.

@RebeccaYo
Copy link
Author

Hi George, thank you, I'll try that. In the meantime, here's the tether.debug for the VCH deployed with debug=2. tether.debug.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/appliance area/vsphere Intergration and interoperation with vSphere kind/defect Behavior that is inconsistent with what's intended status/need-info Additional information is needed to make progress
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants