-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move setVMMaxMapCount from top-level in spec #1712
Comments
Promoting tweaking the host in an init container feels a little rough (but can totally be done, IF you accept the security risks associated with it), so I'm not sure we should be strongly advocating it. This is an ES feature (both memory mapped files -- a lot of them, and bootstrap checks -- yay!), which can be disabled via config (https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-store.html#allow-mmap):
I think we can make the argument that we can instruct our users to do either 1) set the sysctl setting in one of several ways, or 2) disable mmap because if you are in an environment where you can not control the ability to create a lot of memory maps. The description seem to fit. Thus, I propose the following:
cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 7.2.0
nodes:
- name: default
nodeCount: 1
config:
node.master: true
node.data: true
node.ingest: true
## Hosts running ES should set vm.max_map_count according to
## https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html
##
## If you're not able to configure this, node.store.allow_mmap can be set to false to
## disable memory mapped files to avoid this requirement, at the cost of performance
## (see https://www.elastic.co/guide/en/elasticsearch/reference/current/_maximum_map_count_check.html)
#node.store.allow_mmap: false
EOF We should probably discuss within the team whether we'd like to have this uncommented by default (for the quickstart only), and if we'd like to further document this (e.g add a section to "troubleshooting" that explains it further and allows for perhaps easier discovery when running into the issue outside of the quickstart). A follow-up question should go to the ES team whether they are fine with this as the quickstart contents, or whether there are better ways we can/should deal with this. (e.g potentially allowing for a lower number of mem-mapped files for small clusters in the bootstrap-check?) |
Agreed. However, this has been in my experience the only way to make host-level changes on GKE. To phrase this another way, 100% of GKE users[1] are likely to need this sysctl vm.max_map_count setting. In past explorations, I've found folks requesting things like a "DaemonJob" which would act like a DaemonSet (schedule on all nodes) but otherwise act like a Job (run once). I don't think anyone's got a solution for a 'daemon job' style thing. An initContainer with elevated privileges is the shortest path to declare "In order for this pod to be successful, the host needs to be configured a certain way." There's been recent work to move sysctl stuff (as mentioned above in the description) with a PodSecurityPolicy, and maybe we can do something along those lines and then set this in the Elasticsearch pod's securityContext? [1] I don't know how common it is for someone to run GKE but also do config management on their nodes (like with puppet or ansible). On Infosec, we run GKE's default OS image (Google Container-optimized OS) and it has a number of properties, like the filesystem changes revert on reboot and upgrade, which make it hard or annoying to do this. I think our default behavior should be the same as our best practices. In this case, do change the vm.max_map_count setting unless we can't find any way to have it by default. Otherwise, I feel that our user experience especially involving support would always open the door on a sour note - a user experiencing a performance issue, and we may blame the lack of correct |
Checking the helm chart, it has an optional initContainer to set the sysctl settings. The helm chart docs are interestingly worded to imply that you are expected to set
The way I read this implies that it is not optional (though we know one can choose niofs instead of mmapfs, but our docs warn this has bugs on windows Some prior discussions for this optional sysctl initContainer on the helm chart, if that helps: |
I feel like this is waaaaaay below the Kubernetes abstraction layer. A user is thinking about deploying Elasticsearch, and now we're forcing them to think about weird words like "mmap" and "sysctl". I don't think we should hide these details, but I wonder if we could suggest something on the same abstraction level as Kubernetes? Something like:
It's possible the operator could detect these situations on its own (Does an appropriate PodSecurityPolicy exist? etc) It'd be ideal if users didn't have to think about this ;) |
Does something provider specific not work, or am I missing your point? |
This works, manually. But we have dozens of GKE nodes coming and going week
to week. We would have to build a system for detecting new GKE nodes and
deploying this sysctl change quickly before we could deploy elasticsearch.
It adds another step, which adds complexity and room for error.
…On Wed, Sep 18, 2019 at 1:25 AM Peter Brachwitz ***@***.***> wrote:
However, this has been in my experience the only way to make host-level
changes on GKE.
Does something provider specific not work, or am I missing your point?
For example:
gcloud compute ssh $USER:$INSTANCE --project=$GCP_PROJECT --zone=$ZONE
--command="sudo sysctl -w vm.max_map_count=262144"
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1712?email_source=notifications&email_token=AABAF2VU46XLLCW5KQJIFPDQKHQX7A5CNFSM4IVV52JKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD67H6MY#issuecomment-532578099>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABAF2Q3PMZKMX3X6DWWEO3QKHQX7ANCNFSM4IVV52JA>
.
|
@jordansissel you can configure the GKE nodes bootstrap and add the sysctl setting there, so it happens e.g before the kubelet even starts. This makes the most sense to me: configuring host level settings on the host, at bootstrap. This is a host-level sysctl (not namespaced), so one container setting it will affect all others on the host, so setting it in an init-container and hoping nothing overwrites it with something incompatible doesn't feel like an obvious/perfect solution. Assuming possible multi-tenancy/workloads here. Nothing would prevent you from including such an init container on your own though if you accept the tradeoff (a few lines of YAML) is sufficient: apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 7.2.0
nodes:
- name: default
nodeCount: 1
podTemplate:
spec:
initContainers:
- name: sysctl
securityContext:
privileged: true
command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144'] In the end, this is going to have to be a trade-off between ease of use, security, isolation and we need to have this discussion in order to pick the default behavior that we believe makes the most sense. There's different kinds of users that will have different expectations here, and it looks like all options currently on the table seems very odd to at least one of the groups? |
Also to note, when a GKE node reboots, it resets all changes. So we would
have to make sure this is detected and sysctl run before elasticsearch
would come back online.
The best way to do this on GKE is with k8s resources (daemonset or
initContainer)
On Wed, Sep 18, 2019 at 7:12 AM Jordan Sissel <[email protected]>
wrote:
… This works, manually. But we have dozens of GKE nodes coming and going week
to week. We would have to build a system for detecting new GKE nodes and
deploying this sysctl change quickly before we could deploy elasticsearch.
It adds another step, which adds complexity and room for error.
On Wed, Sep 18, 2019 at 1:25 AM Peter Brachwitz ***@***.***>
wrote:
> However, this has been in my experience the only way to make host-level
> changes on GKE.
>
> Does something provider specific not work, or am I missing your point?
> For example:
> gcloud compute ssh $USER:$INSTANCE --project=$GCP_PROJECT --zone=$ZONE
> --command="sudo sysctl -w vm.max_map_count=262144"
>
> —
> You are receiving this because you commented.
>
>
> Reply to this email directly, view it on GitHub
> <
#1712?email_source=notifications&email_token=AABAF2VU46XLLCW5KQJIFPDQKHQX7A5CNFSM4IVV52JKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD67H6MY#issuecomment-532578099
>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/AABAF2Q3PMZKMX3X6DWWEO3QKHQX7ANCNFSM4IVV52JA
>
> .
>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1712?email_source=notifications&email_token=AABAF2VJC53Q2ESLBBAYXMTQKIZN5A5CNFSM4IVV52JKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7AGQ4Y#issuecomment-532703347>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABAF2V5KPH6FX5QO4K65DDQKIZN5ANCNFSM4IVV52JA>
.
|
How? I'm unable to find any boot/configuration settings for Google Container OS without building my own image.
Agreed. If we can still use a custom initContainer or securityContext (assuming a PSP will work for this), that should help us continue upgrading ECK without tons of development effort (ie; we don't have to build custom GKE node images, etc). Is there still a "quick start" getting started experience without this sysctl hack? if it's removed, Elasticsearch won't start, unless we disable allow_mmap by default, which is not suitable for production and will be blamed quickly in support scenarios[1]. [1] when seeking advice on our own ECK clusters, the first reaction is always "Stop using Google Persistent Disk and use local disks" even before the situation is evaluated to form a hypothesis. If we disable mmap by default, I believe we'll open the same scenario for ECK users, an unevaluated "Use hybridfs instead of niofs" presented when any performance concerns are raised. This will provide a bad experience for users who may not have chosen this outcome. Everyone seems to have to set sysctl their own special way: initContainer, securityContext w/ an PodSecurityPolicy, by managing the OS the kubelet runs on, etc. This feels weird. Can we bring this problem upstream to Kubernetes? Can we revisit with the Elasticsearch team to allow Elasticsearch to run with the default vm.max_map_count? The default on GKE is 65536, and our minimum "required" is 4x that. The default on Ubuntu 18.04 seems to be 65530. Is 262144 really some magic number and 65536 is wrong? |
This was my understanding as well and what led me to be okay with keeping the init container as an option. It appears startup scripts are still blacklisted: https://cloud.google.com/kubernetes-engine/docs/reference/rest/v1/NodeConfig And while people running their own k8s will likely be building their own images, I really doubt that will be the case for most people on managed k8s platforms. |
Found a recent discussion where this was rejected: elastic/elasticsearch#45815 |
Just in case I can't make meeting tomorrow, here is my overall thinking on this:
|
I'm thinking that for a few reasons, maybe we should default
On the quickstart note, I feel the best way we can be sure everyone is successful is to set Thoughts? I'll ping #elasticsearch and see if they have recommendations here. |
I could definitely be persuaded. The docs I saw didn't make it clear what the trade off of running only niofs was, just that one could do it. I'm assuming there's a reason for the hybridfs setting. If there's no real downside, the portability benefits are compelling. That said I don't think it changes my opinion on the |
What about keeping the Elasticsearch default (
|
Reading this issue it doesn't seem like any of us feel comfortable making the recommendation to do that just yet though, and would need to discuss with the ES team. It seems like that would affect a bunch of people's opinions on the |
I think Njals suggestion to turn off |
That's correct, @nkvoll and I discussed this a few weeks ago, and disabling |
Ah I must have misread then. Given that, I think I'm good with your suggestion here:
Though I think it may still be useful to keep the functionality in ECK but buried under @jasontedor just to make sure I understand, your opinion is that for the quickstart we should disable |
After some more discussions offline we decided to
There is still some disagreement as to what the best approach is for the documentation part. The two suggestions are at the moment:
cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
name: quickstart
spec:
version: 7.2.0
nodes:
- name: default
nodeCount: 1
config:
node.master: true
node.data: true
node.ingest: true
## Hosts running ES should set vm.max_map_count according to
## https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html
##
## If you're not able to configure this, node.store.allow_mmap can be set to false to
## disable memory mapped files to avoid this requirement, at the cost of performance
## (see https://www.elastic.co/guide/en/elasticsearch/reference/current/_maximum_map_count_check.html)
#node.store.allow_mmap: false
podTemplate:
spec:
initContainers:
- name: sysctl
securityContext:
privileged: true
command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144']
EOF
|
My vote: the example has |
Gotta admit I am confused on the thought process here; the Helm chart makes the |
Hey @brsolomon-deloitte I think this issue offers some context how we ended up with this decision. But just to reiterate one of the main drivers for not including this by default is that it is considered bad practice to run privileged containers to manipulate the underlying hosts. Giving any container that kind of privilege carries a substantial security risk (the container could basically do anything on the host) and it is therefore typically also not allowed on secured Kubernetes installations. For any production scenario we would expect the cluster admin to set up the hosts for best Elasticsearch performance. For any proof of concept level work or development we document the init container hack here: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-virtual-memory.html Hope that helps making our decision process clearer. |
We think we are giving this attribute too much of prominence given that it is a hack to enable a smooth getting started experience.
Part of #1141
Move the existing attribute one level down
Why not just securityContext?
Just using
PodSecurityContext
does not work because of the sysctl we want to use is typically not whitelisted thereby defeating the purpose of this exercise.Annotations
Looks like on OpenShift you can use annotations as well, so there is some precedence
This has the downside of creating an implicit API
The text was updated successfully, but these errors were encountered: