Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move setVMMaxMapCount from top-level in spec #1712

Closed
pebrc opened this issue Sep 11, 2019 · 24 comments · Fixed by #1839
Closed

Move setVMMaxMapCount from top-level in spec #1712

pebrc opened this issue Sep 11, 2019 · 24 comments · Fixed by #1839
Assignees
Labels
discuss We need to figure this out v1.0.0-beta1

Comments

@pebrc
Copy link
Collaborator

pebrc commented Sep 11, 2019

We think we are giving this attribute too much of prominence given that it is a hack to enable a smooth getting started experience.

Part of #1141

Move the existing attribute one level down

kind: Elasticsearch
metadata:
  name: elasticsearch-sample
spec:
  version: 7.2.0
  unsafe: # tweaks, hacks, 
    setVmMaxMapCount: true 
  nodeSets:
  - name: default
    count: 3 

Why not just securityContext?

Just using PodSecurityContext does not work because of the sysctl we want to use is typically not whitelisted thereby defeating the purpose of this exercise.

apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
spec:
  version: 7.3.0  
  nodeSets:
  - name: default
    podTemplate:
      spec:
        securityContext:
          sysctls:
          - name: vm.max_map_count
            value: "262144"
   count: 1

Annotations

Looks like on OpenShift you can use annotations as well, so there is some precedence

apiVersion: v1
kind: Pod
metadata:
  name: sysctl-example
  annotations:
    security.alpha.kubernetes.io/sysctls: kernel.shm_rmid_forced=1
    security.alpha.kubernetes.io/unsafe-sysctls: net.ipv4.route.min_pmtu=1000,kernel.msgmax=1 2 3
spec:

This has the downside of creating an implicit API

@sebgl sebgl added the discuss We need to figure this out label Sep 12, 2019
@nkvoll
Copy link
Member

nkvoll commented Sep 13, 2019

Promoting tweaking the host in an init container feels a little rough (but can totally be done, IF you accept the security risks associated with it), so I'm not sure we should be strongly advocating it. This is an ES feature (both memory mapped files -- a lot of them, and bootstrap checks -- yay!), which can be disabled via config (https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-store.html#allow-mmap):

You can restrict the use of the mmapfs and the related hybridfs store type via the setting node.store.allow_mmap. This is a boolean setting indicating whether or not memory-mapping is allowed. The default is to allow it. This setting is useful, for example, if you are in an environment where you can not control the ability to create a lot of memory maps so you need disable the ability to use memory-mapping.

I think we can make the argument that we can instruct our users to do either 1) set the sysctl setting in one of several ways, or 2) disable mmap because if you are in an environment where you can not control the ability to create a lot of memory maps. The description seem to fit.

Thus, I propose the following:

  1. Remove setVmMaxMapCount from the spec. It's promoting bad behavior from containers.

  2. Provide something ala this as a quickstart:

cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.2.0
  nodes:
  - name: default
    nodeCount: 1
    config:
      node.master: true
      node.data: true
      node.ingest: true
      ## Hosts running ES should set vm.max_map_count according to
      ## https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html
      ##
      ## If you're not able to configure this, node.store.allow_mmap can be set to false to
      ## disable memory mapped files to avoid this requirement, at the cost of performance
      ## (see https://www.elastic.co/guide/en/elasticsearch/reference/current/_maximum_map_count_check.html)
      #node.store.allow_mmap: false
EOF

We should probably discuss within the team whether we'd like to have this uncommented by default (for the quickstart only), and if we'd like to further document this (e.g add a section to "troubleshooting" that explains it further and allows for perhaps easier discovery when running into the issue outside of the quickstart).

A follow-up question should go to the ES team whether they are fine with this as the quickstart contents, or whether there are better ways we can/should deal with this. (e.g potentially allowing for a lower number of mem-mapped files for small clusters in the bootstrap-check?)

@jordansissel
Copy link
Contributor

Promoting tweaking the host in an init container feels a little rough

Agreed. However, this has been in my experience the only way to make host-level changes on GKE. To phrase this another way, 100% of GKE users[1] are likely to need this sysctl vm.max_map_count setting.

In past explorations, I've found folks requesting things like a "DaemonJob" which would act like a DaemonSet (schedule on all nodes) but otherwise act like a Job (run once). I don't think anyone's got a solution for a 'daemon job' style thing.

An initContainer with elevated privileges is the shortest path to declare "In order for this pod to be successful, the host needs to be configured a certain way." There's been recent work to move sysctl stuff (as mentioned above in the description) with a PodSecurityPolicy, and maybe we can do something along those lines and then set this in the Elasticsearch pod's securityContext?

[1] I don't know how common it is for someone to run GKE but also do config management on their nodes (like with puppet or ansible). On Infosec, we run GKE's default OS image (Google Container-optimized OS) and it has a number of properties, like the filesystem changes revert on reboot and upgrade, which make it hard or annoying to do this.

I think our default behavior should be the same as our best practices. In this case, do change the vm.max_map_count setting unless we can't find any way to have it by default. Otherwise, I feel that our user experience especially involving support would always open the door on a sour note - a user experiencing a performance issue, and we may blame the lack of correct vm.max_map_count.

@jordansissel
Copy link
Contributor

Checking the helm chart, it has an optional initContainer to set the sysctl settings.

The helm chart docs are interestingly worded to imply that you are expected to set vm.max_map_count somehow, whether by the initContainer or by some other means.

sysctlInitContainer.enabled | Allows you to disable the sysctlInitContainer if you are setting vm.max_map_count with another method

The way I read this implies that it is not optional (though we know one can choose niofs instead of mmapfs, but our docs warn this has bugs on windows

Some prior discussions for this optional sysctl initContainer on the helm chart, if that helps:

@jordansissel
Copy link
Contributor

disable mmap because if you are in an environment where you can not control the ability to create a lot of memory maps

I feel like this is waaaaaay below the Kubernetes abstraction layer. A user is thinking about deploying Elasticsearch, and now we're forcing them to think about weird words like "mmap" and "sysctl". I don't think we should hide these details, but I wonder if we could suggest something on the same abstraction level as Kubernetes? Something like:

In order to achieve best filesystem performance, you should set vm.max_map_count sysctl. You can have ECK do this for you if you install the appropriate PodSecurityPolicy

{example pod security policy}

If you can't install this PodSecurityPolicy, you will need to set an appropriate sysctl vm.max_map_count on all kubelet nodes yourself, or you will need to set [some Elasticsearch CRD setting] to false which causes Elasticsearch to avoid mmap, resulting in slower filesystem performance.

It's possible the operator could detect these situations on its own (Does an appropriate PodSecurityPolicy exist? etc)

It'd be ideal if users didn't have to think about this ;)

@pebrc
Copy link
Collaborator Author

pebrc commented Sep 18, 2019

However, this has been in my experience the only way to make host-level changes on GKE.

Does something provider specific not work, or am I missing your point?
For example:
gcloud compute ssh $USER:$INSTANCE --project=$GCP_PROJECT --zone=$ZONE --command="sudo sysctl -w vm.max_map_count=262144"

@jordansissel
Copy link
Contributor

jordansissel commented Sep 18, 2019 via email

@nkvoll
Copy link
Member

nkvoll commented Sep 18, 2019

@jordansissel you can configure the GKE nodes bootstrap and add the sysctl setting there, so it happens e.g before the kubelet even starts. This makes the most sense to me: configuring host level settings on the host, at bootstrap.

This is a host-level sysctl (not namespaced), so one container setting it will affect all others on the host, so setting it in an init-container and hoping nothing overwrites it with something incompatible doesn't feel like an obvious/perfect solution. Assuming possible multi-tenancy/workloads here.

Nothing would prevent you from including such an init container on your own though if you accept the tradeoff (a few lines of YAML) is sufficient:

apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.2.0
  nodes:
  - name: default
    nodeCount: 1
    podTemplate:
      spec:
        initContainers:
        - name: sysctl
          securityContext:
            privileged: true
          command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144']

In the end, this is going to have to be a trade-off between ease of use, security, isolation and we need to have this discussion in order to pick the default behavior that we believe makes the most sense. There's different kinds of users that will have different expectations here, and it looks like all options currently on the table seems very odd to at least one of the groups?

@jordansissel
Copy link
Contributor

jordansissel commented Sep 18, 2019 via email

@jordansissel
Copy link
Contributor

you can configure the GKE nodes bootstrap and add the sysctl setting there

How? I'm unable to find any boot/configuration settings for Google Container OS without building my own image.

this is going to have to be a trade-off between ease of use, security, isolation

Agreed. If we can still use a custom initContainer or securityContext (assuming a PSP will work for this), that should help us continue upgrading ECK without tons of development effort (ie; we don't have to build custom GKE node images, etc).

Is there still a "quick start" getting started experience without this sysctl hack? if it's removed, Elasticsearch won't start, unless we disable allow_mmap by default, which is not suitable for production and will be blamed quickly in support scenarios[1].

[1] when seeking advice on our own ECK clusters, the first reaction is always "Stop using Google Persistent Disk and use local disks" even before the situation is evaluated to form a hypothesis. If we disable mmap by default, I believe we'll open the same scenario for ECK users, an unevaluated "Use hybridfs instead of niofs" presented when any performance concerns are raised. This will provide a bad experience for users who may not have chosen this outcome.

Everyone seems to have to set sysctl their own special way: initContainer, securityContext w/ an PodSecurityPolicy, by managing the OS the kubelet runs on, etc. This feels weird. Can we bring this problem upstream to Kubernetes?

Can we revisit with the Elasticsearch team to allow Elasticsearch to run with the default vm.max_map_count? The default on GKE is 65536, and our minimum "required" is 4x that. The default on Ubuntu 18.04 seems to be 65530. Is 262144 really some magic number and 65536 is wrong?

@anyasabo
Copy link
Contributor

How? I'm unable to find any boot/configuration settings for Google Container OS without building my own image.

This was my understanding as well and what led me to be okay with keeping the init container as an option. It appears startup scripts are still blacklisted:

https://cloud.google.com/kubernetes-engine/docs/reference/rest/v1/NodeConfig

And while people running their own k8s will likely be building their own images, I really doubt that will be the case for most people on managed k8s platforms.

@jordansissel
Copy link
Contributor

Can we revisit with the Elasticsearch team to allow Elasticsearch to run with the default vm.max_map_count

Found a recent discussion where this was rejected: elastic/elasticsearch#45815

@anyasabo
Copy link
Contributor

Just in case I can't make meeting tomorrow, here is my overall thinking on this:

  • I do think it is useful to have this flag, given that it is a pain to configure this in a hosted environment like GKE (and potentially also AKS, EKS).

  • No strong opinion on if it defaults to true or false. I think I lean towards false because I don't think I want to encourage privileged pods that change the host by default. I can see the other argument that it is helpful to default to true from a quickstart perspective though, but I don't think that outweighs encouraging an overall not great practice of allowing all privileged pods.

  • I am in favor of moving it to a lower level. I don't particularly care what key we put it under. experimental, tweaks, and unsafe all work for me, and probably prefer them in roughly that order.

  • I don't know enough about the trade offs to know if we can encourage niofs by default (and so avoid the mmap bootstrap check). Even if we do, I don't think that will change my mind on still wanting it to be an option but behind a key like experimental etc, because there may still be relevant use cases for mmapfs that would otherwise be a pain to use.

@jordansissel
Copy link
Contributor

I'm thinking that for a few reasons, maybe we should default node.store.allow_mmap to false. Here's my reasoning:

  1. Allowing mmap requires vm.max_map_count set correctly. This requires one of these solutions:
  • k8s node administrator activity (outside ECK)
  • alternately, allow pods to set sysctls
  • alternately, use a privileged initContainer to set sysctl
  1. There are too many ways for folks to set this sysctl, making it very difficult to have a "quickstart" under this scenario
  2. Some users may not have any opportunity to set vm.max_map_count (due to policy or whatever, preventing this setting)

On the quickstart note, I feel the best way we can be sure everyone is successful is to set allow_mmap to false. We can also write a "going to production" document. We have such a thing for Kibana, and Google has one for GKE.

Thoughts? I'll ping #elasticsearch and see if they have recommendations here.

@anyasabo
Copy link
Contributor

I could definitely be persuaded. The docs I saw didn't make it clear what the trade off of running only niofs was, just that one could do it. I'm assuming there's a reason for the hybridfs setting. If there's no real downside, the portability benefits are compelling.

That said I don't think it changes my opinion on the setVmMaxMapCount feature -- we can still maintain it but put it behind an experimental flag of some sort.

@pebrc
Copy link
Collaborator Author

pebrc commented Sep 23, 2019

On the quickstart note, I feel the best way we can be sure everyone is successful is to set allow_mmap to false. We can also write a "going to production" document. We have such a thing for Kibana, and Google has one for GKE.

What about keeping the Elasticsearch default (node.store.allow_mmap: true) and just disable it in the quickstart example (as suggested in #1712 (comment))

  • same defaults everywhere (ECK, outside of ECK)
  • quickstart will "just work"
  • we can remove the setting from the CRD
  • we can explain the reason for disabling in a footnote of the quickstart guide (not great I know)
  • we add a section about production configuration outlining the various ways to configure the kernel for Elasticsearch (privileged initContainer, pod security context, daemonset, k8s admin)

@anyasabo
Copy link
Contributor

Reading this issue it doesn't seem like any of us feel comfortable making the recommendation to do that just yet though, and would need to discuss with the ES team. It seems like that would affect a bunch of people's opinions on the setVmMaxMapCount flag in ECK , so I'm in favor of tabling this until we get an opinion from them.

@pebrc
Copy link
Collaborator Author

pebrc commented Sep 23, 2019

so I'm in favor of tabling this until we get an opinion from them.

I think Njals suggestion to turn off allow_mmap for the quickstart resulted from a conversation with @jasontedor and I think the desire to keep the defaults the same across all platforms was also expressed.

@jasontedor
Copy link
Member

That's correct, @nkvoll and I discussed this a few weeks ago, and disabling node.store.allow_mmap is the path that is least likely to give users a bad experience. Yes, there will be a performance difference, in a lot of use cases it won't be noticeable, especially for a quickstart experience. Keeping the defaults the same is also a goal, we don't really want users to have a different experience depending on how they deploy Elasticsearch, so I'm supportive of only making this for the quickstart.

@anyasabo
Copy link
Contributor

Ah I must have misread then. Given that, I think I'm good with your suggestion here:

What about keeping the Elasticsearch default (node.store.allow_mmap: true) and just disable it in the quickstart example (as suggested in #1712 (comment))

Though I think it may still be useful to keep the functionality in ECK but buried under experimental so people who are okay with it can still use it.

@jasontedor just to make sure I understand, your opinion is that for the quickstart we should disable allow_mmap, but we still should have documentation on how to enable it and strongly recommend that it be enabled in production? Is that correct?

@pebrc
Copy link
Collaborator Author

pebrc commented Sep 24, 2019

After some more discussions offline we decided to

  • remove the setVmMaxMapCount boolean from the CRD.
  • adjust the documentation to ensure our quickstart still gives a good copy/pasteable experience.
  • document the various ways to set sysctl settings on k8s

There is still some disagreement as to what the best approach is for the documentation part.

The two suggestions are at the moment:

  1. just use allow_mmap

Screenshot 2019-09-24 at 22 06 06

  1. use an explicit init container
cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.2.0
  nodes:
  - name: default
    nodeCount: 1
    config:
      node.master: true
      node.data: true
      node.ingest: true
      ## Hosts running ES should set vm.max_map_count according to
      ## https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html
      ##
      ## If you're not able to configure this, node.store.allow_mmap can be set to false to
      ## disable memory mapped files to avoid this requirement, at the cost of performance
      ## (see https://www.elastic.co/guide/en/elasticsearch/reference/current/_maximum_map_count_check.html)
      #node.store.allow_mmap: false
    podTemplate:
      spec:
        initContainers:
        - name: sysctl
          securityContext:
            privileged: true
          command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144']
EOF




@anyasabo
Copy link
Contributor

anyasabo commented Oct 1, 2019

My vote: the example has allow_mmap:false with a link to a more detailed guide on the trade offs and how to set max_map_count in a privileged init container.

@sebgl
Copy link
Contributor

sebgl commented Oct 1, 2019

My vote:
Option 1 looks simpler, especially if we document it as in @pebrc screenshot, and provide a dedicated documentation page.
Option 2 looks way more verbose and advanced, but should definitely be documented in that dedicated doc page.

(edit: @anyasabo says it all in a single sentence 😄 )

@brsolomon-deloitte
Copy link

brsolomon-deloitte commented Nov 16, 2021

Gotta admit I am confused on the thought process here; the Helm chart makes the sysctl calls in an initContainer by default; why is it out of scope for this project to do the same as the default behavior? This project is advertised as "simplifying setup" but then makes what is likely the majority of users go out of their way with custom config.

@pebrc
Copy link
Collaborator Author

pebrc commented Nov 17, 2021

Hey @brsolomon-deloitte I think this issue offers some context how we ended up with this decision. But just to reiterate one of the main drivers for not including this by default is that it is considered bad practice to run privileged containers to manipulate the underlying hosts. Giving any container that kind of privilege carries a substantial security risk (the container could basically do anything on the host) and it is therefore typically also not allowed on secured Kubernetes installations. For any production scenario we would expect the cluster admin to set up the hosts for best Elasticsearch performance. For any proof of concept level work or development we document the init container hack here: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-virtual-memory.html Hope that helps making our decision process clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss We need to figure this out v1.0.0-beta1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants