Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow total memory to be overridden #78750

Merged
merged 31 commits into from
Oct 16, 2021

Conversation

droberts195
Copy link
Contributor

Since #65905 Elasticsearch has determined the Java heap settings
from node roles and total system memory.

This change allows the total system memory used in that calculation
to be overridden with a user-specified value. This is intended to
be used when Elasticsearch is running on a machine where some other
software that consumes a non-negligible amount of memory is running.
For example, a user could tell Elasticsearch to assume it was
running on a machine with 3GB of RAM when actually it was running
on a machine with 4GB of RAM.

The system property is es.total_memory_bytes, so, for example,
could be specified using -Des.total_memory_bytes=3221225472.
(It is specified in bytes rather than using a unit, because it
needs to be parsed by startup code that does not have access to
the utility classes that interpret byte size units.)

Since elastic#65905 Elasticsearch has determined the Java heap settings
from node roles and total system memory.

This change allows the total system memory used in that calculation
to be overridden with a user-specified value. This is intended to
be used when Elasticsearch is running on a machine where some other
software that consumes a non-negligible amount of memory is running.
For example, a user could tell Elasticsearch to assume it was
running on a machine with 3GB of RAM when actually it was running
on a machine with 4GB of RAM.

The system property is `es.total_memory_bytes`, so, for example,
could be specified using `-Des.total_memory_bytes=3221225472`.
(It is specified in bytes rather than using a unit, because it
needs to be parsed by startup code that does not have access to
the utility classes that interpret byte size units.)
@droberts195 droberts195 added >enhancement :Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts :Core/Infra/Core Core issues without another label :ml Machine learning v8.0.0 labels Oct 6, 2021
@droberts195 droberts195 marked this pull request as ready for review October 6, 2021 15:59
@elasticmachine elasticmachine added Team:Delivery Meta label for Delivery team Team:Core/Infra Meta label for core/infra team Team:ML Meta label for the ML team labels Oct 6, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @droberts195, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Hi @droberts195, I've updated the changelog YAML for you.

Comment on lines 74 to 85
String totalMemoryBytesOption = userDefinedJvmOptions.stream()
.filter(option -> option.startsWith("-Des.total_memory_bytes="))
.findFirst()
.orElse(null);
if (totalMemoryBytesOption == null) {
return null;
}
try {
return Long.parseLong(totalMemoryBytesOption.split("=", 2)[1]);
} catch (NumberFormatException e) {
throw new IllegalArgumentException("Unable to parse number of bytes from [" + totalMemoryBytesOption + "]");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need an intermediate variable if you used Optional.map?

Suggested change
String totalMemoryBytesOption = userDefinedJvmOptions.stream()
.filter(option -> option.startsWith("-Des.total_memory_bytes="))
.findFirst()
.orElse(null);
if (totalMemoryBytesOption == null) {
return null;
}
try {
return Long.parseLong(totalMemoryBytesOption.split("=", 2)[1]);
} catch (NumberFormatException e) {
throw new IllegalArgumentException("Unable to parse number of bytes from [" + totalMemoryBytesOption + "]");
}
return userDefinedJvmOptions.stream()
.filter(option -> option.startsWith("-Des.total_memory_bytes="))
.findFirst()
.map(totalMemoryBytesOption -> {
try {
return Long.parseLong(totalMemoryBytesOption.split("=", 2)[1]);
} catch (NumberFormatException e) {
throw new IllegalArgumentException("Unable to parse number of bytes from [" + totalMemoryBytesOption + "]");
}
})
.orElse(null);

`total_override`::
(<<byte-units,byte value>>)
If the amount of physical memory has been overridden using the `es.total_memory_bytes`
system property on all selected nodes then this reports the sum of the overridden
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the total memory was only overridden on some of the selected nodes?

Copy link
Contributor Author

@droberts195 droberts195 Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a tricky one. I opted not to report any override at all at the cluster level if some nodes have overrides and others don't. (You can still get all the values from the node stats.) I guess the alternative would be to report the sum of overrides on the nodes that have overrides, plus un-overridden total on the nodes that don't, but only if at least one node has an override. Maybe that's better - I'd be interested to hear if subsequent reviewers have any thoughts on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also ties in with #78750 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need to have any distinct concept of override in the stats at all. With available disk space, when limited through cgroups, we do not show what the real disk has available. Why not just show this as “this is the memory available”?

assertThat(OsProbe.getTotalMemoryOverride("123456789123456789"), is(123456789123456789L));
IllegalArgumentException e = expectThrows(IllegalArgumentException.class, () -> OsProbe.getTotalMemoryOverride("abc"));
assertThat(e.getMessage(), is("Invalid value for [es.total_memory_bytes]: [abc]"));
// Although numeric, this value overflows long. This won't be a problem if the override is set appropriately on a 64 bit machine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this won't be a problem on a 64 bit machine - can you elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 64 bit pointer can address 16 exabytes of memory. A signed long can only store half that, but it's still huge. I think as machines start to have memory in the exabyte ranges we'll see 128 bit CPUs and languages will be changed to have 128 bit primitive integer types. At that time all the places where we've stored memory sizes in long will need to be changed to that new 128 bit primitive type, but that's decades in the future.

Basically, for the foreseeable future there's no justification for anyone to need to tell Elasticsearch to work on the assumption that the total memory on the machine is greater than 8 exabytes, so it doesn't matter that attempting to do so is an error. I'll change the comment to say that.

Copy link
Contributor

@mark-vieira mark-vieira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments. I think we should get somebody from core/infra to give this a once over as well. Otherwise 👍 .

* is read from the "es.total_memory_bytes" system property. Negative
* values or not set at all mean no override.
*/
public Long getTotalMemoryOverride() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should call this something less implementation-specific, like "totalUsableMemory"? I think in that case it would also make sense to have this return a long primative as well, and if there's no override defined, we just return the same value as totalSystemMemory. This saves downstream callers the trouble of null checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

assertThat(xArgs, hasItems("-Xms304m", "-Xmx304m"));
}

private List<String> machineDependentHeapTest(final String containerMemory, final List<String> extraJvmOptions) throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make sure we're testing this scenario in our ArchiveTests and PackageTests as well to get full coverage across packaging types.

@droberts195
Copy link
Contributor Author

droberts195 commented Oct 11, 2021

Just summarizing some of the comments that have been raised against different lines but are all very interrelated:

  1. Is this like cgroups?
  2. Can we just modify total and free instead of introducing a 3rd field?
  3. If we do introduce a 3rd field, it should always be present, not only present if memory was overridden.

This is not exactly like cgroups. The key difference is that cgroups enforces boundaries but this functionality does not. So with cgroups, if the memory of the machine is 16GB but cgroups says this container can use 4GB then the OS will enforce that the container really can only use 4GB. With this functionality if a container has 4GB but we want Elasticsearch to work on the basis that something else will be using 1GB then there is nothing at the OS level that will stop Elasticsearch using more than 3GB, nor stop the other processes from using more than 1GB. It has been suggested that a better solution to doing all this would be to run the other processes in a separate container, and I guess this issue shows that that is indeed the better solution from the perspective of different processes fighting over memory. But I guess it also creates other problems when the other process is supposed to be monitoring the container where Elasticsearch is running and cannot get all the stats it wants when running outside the Elasticsearch container.

This also makes it hard to modify free to reflect the override. Suppose total is 1024m, and this is overridden to 844m (i.e. 180m reduction) to allow metricbeat to run in the same container without causing of OOM errors. Suppose at the time the node stats are reported free is 233m, and metricbeat is using 97m. It would be wrong to subtract 180m from 233m and report that as free. Instead we should subtract 83m from free, which is the 180m reduction less the 97m that metricbeat is currently using. But to actually do this, we'd need to know that metricbeat is the reason for the reduction, and be able to find it in the process table and get its stats. I don't think a non-root process would be able to find the stats for another non-root process, and even if it could passing the details of the process or multiple processes that the memory override relates to is going to add a lot of extra complexity.

As a result of the answers to points 1 and 2 I think the only way forward is to add a third field in addition to total and free. Messing with total and free in some way that doesn't reflect the operating system's opinion of them could make support cases where memory is a factor very challenging to work on. I will change this third field so that it's always present, but I think since it's always going to be there we should spend a bit more time thinking about the best name for it. total_usable is one option, but I wonder if usable might cause confusion about how it relates to free. Some other options could be ration, budget, share or allowance. Please let me know if you have any thoughts or if there's anyone else who's particularly good at choosing good field names that I should consult.

@@ -1286,7 +1286,8 @@ public static boolean allTemplatesInstalled(ClusterState clusterState) {

/**
* Find the memory size (in bytes) of the machine this node is running on.
* Takes container limits (as used by Docker for example) into account.
* Takes container limits (as used by Docker for example) and forced memory
* size overrides into account.
*/
static long machineMemoryFromStats(OsStats stats) {
long mem = stats.getMem().getTotal().getBytes();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion about cgroups made me think that the 10 lines below (not changed by this PR, so expand the diff to see them) are unnecessary now. They certainly were necessary back in some 6.x release when they were first added.

Is it the case that OperatingSystemMXBean takes cgroups into account only for Java 14 and above, hence:

throw new SystemMemoryInfoException("The minimum required Java version is 14 to use " + this.getClass().getName());

?

If that's the case then this method can be simplified for 8.0+ when we're mandating at least Java 17.

@droberts195
Copy link
Contributor Author

I wonder if usable might cause confusion about how it relates to free. Some other options could be ration, budget, share or allowance.

I went with adjusted_total in 449cf2a, which makes the new field always present.

I also removed ML's hand coded application of cgroups limits on the assumption Java 14+ does this automatically.

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

static Long getTotalMemoryOverride(String memoryOverrideProperty) {
try {
long memoryOverride = Long.parseLong(memoryOverrideProperty);
return (memoryOverride < 0) ? null : memoryOverride;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why lenience for a negative value, shouldn't that be an error? Instead could we use null as the sentinel for the property not existing?

assert total >= 0 : "expected total memory to be positive, got: " + total;
assert free >= 0 : "expected free memory to be positive, got: " + total;
assert adjustedTotal >= 0 : "expected adjusted total memory to be positive, got: " + adjustedTotal;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we are asserting here to ensure it is positive, but since asserts are not normally running in production, I think a hard check when parsing would still be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reluctant to make it throw exceptions in production here. The assertions date back to #42725, which was fixing #42157. #54415 and #56435 are also related. Even though those issues showed up in CI, I know there were also a load of problems reported by end users related to exceptions caused by negative memory values. This included nasty problems like ECK being unusable on certain operating systems.

This constructor is only called in 2 places in production code, and the code paths that lead to those calls were all changed in #42725 to ensure none of the arguments can be negative. Changing negative values reported by the JVM to zero was the general approach to solving all the related problems.

I'll make the two constructors log an error and change negative values to zero after the assertions. This will add an extra layer of protection that negative values won't corrupt the stats API outputs if a future code change means a negative value gets this far in production. And the assertions will still trip CI, so we should still find out about future JVM bugs or OS bugs as soon as we test with them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, but I was specifically referring to the adjusted total, which does not come from the JVM. CI won’t catch that since it will be passed in externally at runtime? Surely we can enforce our own system property does not have a negative value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely we can enforce our own system property does not have a negative value?

Yes, sure, that's done in https://github.com/elastic/elasticsearch/pull/78750/files#diff-709e127f78f07be389e23093a3fd58cde32a752de9b80a7f64230e330d744e36R148 now.

So none of the arguments to this constructor should be negative now.

The adjusted value might still come from the JVM if the system property wasn't set at all, because it will have been set to be the same as total. This class is really just a holder for passing previously calculated values around.

@droberts195
Copy link
Contributor Author

@elasticmachine update branch

@droberts195
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/part-1

@droberts195 droberts195 merged commit e86de06 into elastic:master Oct 16, 2021
@droberts195 droberts195 deleted the add_memory_override branch October 16, 2021 11:01
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Oct 18, 2021
* upstream/master: (109 commits)
  Migrate custom role providers to licensed feature (elastic#79127)
  Remove stale AwaitsFix in InternalEngineTests (elastic#79323)
  Fix errors in RefreshListenersTests (elastic#79324)
  Reeable BwC Tests after elastic#79318 (elastic#79320)
  Mute BwC Tests for elastic#79318 (elastic#79319)
  Reenable BwC Tests after elastic#79308 (elastic#79313)
  Disable BwC Tests for elastic#79308 (elastic#79310)
  Adjust BWC for node-level field cap requests (elastic#79301)
  Allow total memory to be overridden (elastic#78750)
  Fix SnapshotBasedIndexRecoveryIT#testRecoveryIsCancelledAfterDeletingTheIndex (elastic#79269)
  Disable BWC tests
  Mute GeoIpDownloaderCliIT.testStartWithNoDatabases (elastic#79299)
  Add alias support to fleet search API (elastic#79285)
  Create a coordinating node level reader for tsdb (elastic#79197)
  Route documents to the correct shards in tsdb (elastic#77731)
  Inject migrate action regardless of allocate action (elastic#79090)
  Migrate to data tiers should always ensure a TIER_PREFERENCE is set (elastic#79100)
  Skip building of BWC distributions when building release artifacts (elastic#79180)
  Default ENFORCE_DEFAULT_TIER_PREFERENCE to true (elastic#79275)
  Deprecation of transient cluster settings (elastic#78794)
  ...

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/IndexMode.java
#	server/src/test/java/org/elasticsearch/index/TimeSeriesModeTests.java
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Oct 18, 2021
* upstream/master: (521 commits)
  Migrate custom role providers to licensed feature (elastic#79127)
  Remove stale AwaitsFix in InternalEngineTests (elastic#79323)
  Fix errors in RefreshListenersTests (elastic#79324)
  Reeable BwC Tests after elastic#79318 (elastic#79320)
  Mute BwC Tests for elastic#79318 (elastic#79319)
  Reenable BwC Tests after elastic#79308 (elastic#79313)
  Disable BwC Tests for elastic#79308 (elastic#79310)
  Adjust BWC for node-level field cap requests (elastic#79301)
  Allow total memory to be overridden (elastic#78750)
  Fix SnapshotBasedIndexRecoveryIT#testRecoveryIsCancelledAfterDeletingTheIndex (elastic#79269)
  Disable BWC tests
  Mute GeoIpDownloaderCliIT.testStartWithNoDatabases (elastic#79299)
  Add alias support to fleet search API (elastic#79285)
  Create a coordinating node level reader for tsdb (elastic#79197)
  Route documents to the correct shards in tsdb (elastic#77731)
  Inject migrate action regardless of allocate action (elastic#79090)
  Migrate to data tiers should always ensure a TIER_PREFERENCE is set (elastic#79100)
  Skip building of BWC distributions when building release artifacts (elastic#79180)
  Default ENFORCE_DEFAULT_TIER_PREFERENCE to true (elastic#79275)
  Deprecation of transient cluster settings (elastic#78794)
  ...

# Conflicts:
#	rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/tsdb/10_settings.yml
#	server/src/main/java/org/elasticsearch/common/settings/IndexScopedSettings.java
#	server/src/main/java/org/elasticsearch/common/settings/Setting.java
#	server/src/main/java/org/elasticsearch/index/IndexMode.java
#	server/src/test/java/org/elasticsearch/index/TimeSeriesModeTests.java
henningandersen added a commit to henningandersen/elasticsearch that referenced this pull request Nov 9, 2021
The current capacity in use in autoscaling would use the full container
memory and not the adjusted total memory. ES sometimes responds with
`current_capacity` as `required_capacity` and the `current_capacity`
therefore need to use the adjusted capacity instead (since the
orchestration should add the memory reservation on top).

Relates elastic#78750
henningandersen added a commit that referenced this pull request Nov 9, 2021
The current capacity in use in autoscaling would use the full container
memory and not the adjusted total memory. ES sometimes responds with
`current_capacity` as `required_capacity` and the `current_capacity`
therefore need to use the adjusted capacity instead (since the
orchestration should add the memory reservation on top).

Relates #78750
henningandersen added a commit that referenced this pull request Nov 9, 2021
The current capacity in use in autoscaling would use the full container
memory and not the adjusted total memory. ES sometimes responds with
`current_capacity` as `required_capacity` and the `current_capacity`
therefore need to use the adjusted capacity instead (since the
orchestration should add the memory reservation on top).

Relates #78750
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label :Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts >enhancement :ml Machine learning Team:Core/Infra Meta label for core/infra team Team:Delivery Meta label for Delivery team Team:ML Meta label for the ML team v8.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants