Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad unable to launch on EC2 graviton instance types #7989

Closed
shoenig opened this issue May 16, 2020 · 1 comment · Fixed by #9589
Closed

Nomad unable to launch on EC2 graviton instance types #7989

shoenig opened this issue May 16, 2020 · 1 comment · Fixed by #9589

Comments

@shoenig
Copy link
Contributor

shoenig commented May 16, 2020

We go through the trouble of shipping a table of CPU performance data for EC2 types as of v0.11.2, so we should be able to launch here. It looks like the standard CPU fingerprinter causes an error on graviton (ARM) instances, because there is no MHz information available. (On AMD/Intel, data is available but meaningless and discarded).

ubuntu@ip-172-31-17-121:~$ curl -s http://169.254.169.254/latest/meta-data/instance-type
a1.medium
ubuntu@ip-172-31-17-121:~$ lscpu
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          1
On-line CPU(s) list:             0
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           3
Model name:                      Cortex-A72
Stepping:                        r0p3
BogoMIPS:                        166.66
L1d cache:                       32 KiB
L1i cache:                       48 KiB
L2 cache:                        2 MiB
NUMA node0 CPU(s):               0
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Not affected
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Branch predictor hardening
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
ubuntu@ip-172-31-17-121:~$ ./nomad version 
Nomad v0.11.2 (807cfebe90d56f9e5beec3e72936ebe86acc8ce3)
ubuntu@ip-172-31-17-121:~$ ./nomad agent -dev -log-level=TRACE
==> No configuration files loaded
==> Starting Nomad agent...
==> Error starting agent: client setup failed: fingerprinting failed: cannot detect cpu total compute. CPU compute must be set manually using the client config option "cpu_total_compute"
    2020-05-16T14:58:28.830Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=
    2020-05-16T14:58:28.830Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=
    2020-05-16T14:58:28.831Z [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2020-05-16T14:58:28.831Z [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2020-05-16T14:58:28.831Z [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2020-05-16T14:58:28.831Z [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2020-05-16T14:58:28.831Z [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2020-05-16T14:58:28.831Z [INFO]  agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
    2020-05-16T14:58:28.835Z [INFO]  nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:127.0.0.1:4647 Address:127.0.0.1:4647}]"
    2020-05-16T14:58:28.836Z [INFO]  nomad: serf: EventMemberJoin: ip-172-31-17-121.global 127.0.0.1
    2020-05-16T14:58:28.836Z [INFO]  nomad: starting scheduling worker(s): num_workers=1 schedulers=[service, batch, system, _core]
    2020-05-16T14:58:28.836Z [INFO]  client: using state directory: state_dir=/tmp/NomadClient618123295
    2020-05-16T14:58:28.837Z [INFO]  client: using alloc directory: alloc_dir=/tmp/NomadClient800920818
    2020-05-16T14:58:28.842Z [DEBUG] client.fingerprint_mgr: built-in fingerprints: fingerprinters=[arch, cgroup, consul, cpu, host, memory, network, nomad, signal, storage, vault, env_aws, env_gce]
    2020-05-16T14:58:28.843Z [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
    2020-05-16T14:58:28.843Z [INFO]  nomad.raft: entering follower state: follower="Node at 127.0.0.1:4647 [Follower]" leader=
    2020-05-16T14:58:28.843Z [INFO]  nomad: adding server: server="ip-172-31-17-121.global (Addr: 127.0.0.1:4647) (DC: dc1)"
    2020-05-16T14:58:28.843Z [DEBUG] client.fingerprint_mgr: fingerprinting periodically: fingerprinter=cgroup period=15s
    2020-05-16T14:58:28.844Z [DEBUG] client.fingerprint_mgr.cpu: detected core count: cores=1
@shoenig shoenig self-assigned this Dec 9, 2020
shoenig added a commit that referenced this issue Dec 9, 2020
…etected

Previously, Nomad would fail to startup if the CPU fingerprinter could
not detect the cpu total compute (i.e. cores * mhz). This is common on
some EC2 instance types (graviton class), where the env_aws fingerprinter
will override the detected CPU performance with a more accurate value
anyway.

Instead of crashing on startup, have Nomad use a low default for available
cpu performance of 1000 ticks (e.g. 1 core * 1 GHz). This enables Nomad
to get past the useless cpu fingerprinting on those EC2 instances. The
crashing error message is now a log statement suggesting the setting of
cpu_total_compute in client config.

Fixes #7989
@shoenig shoenig added this to the 1.0.1 milestone Dec 9, 2020
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants