Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile Che-Theia to determine the correct memory/cpu resources #18565

Closed
azatsarynnyy opened this issue Dec 9, 2020 · 12 comments
Closed

Profile Che-Theia to determine the correct memory/cpu resources #18565

azatsarynnyy opened this issue Dec 9, 2020 · 12 comments
Assignees
Labels
area/editor/theia Issues related to the che-theia IDE of Che kind/task Internal things, technical debt, and to-do tasks to be performed. severity/P1 Has a major impact to usage or development of the system.
Milestone

Comments

@azatsarynnyy
Copy link
Member

azatsarynnyy commented Dec 9, 2020

Is your task related to a problem? Please describe.

When starting a Workspace, Che-Theia shows it’s in Offline mode.
Offline indicator is displayed for a short period of time, but it makes a bad UX.
Offline indicator means that Che-Theia backend didn’t respond to a ping request within the timeout.
Most likely, it’s because we don’t have the memoryRequest/cpuLimit specified for Che-Theia sidecar.

Describe the solution you'd like

We need to profile Che-Theia to determine the correct memoryRequest/cpuLimit values to set for Che-Theia sidecar.

Describe alternatives you've considered

Additional context

@azatsarynnyy azatsarynnyy added kind/task Internal things, technical debt, and to-do tasks to be performed. severity/P1 Has a major impact to usage or development of the system. area/editor/theia Issues related to the che-theia IDE of Che labels Dec 9, 2020
@azatsarynnyy azatsarynnyy added this to the 7.24 milestone Dec 9, 2020
@azatsarynnyy
Copy link
Member Author

After the detailed investigation, we figured out that Che-Theia's Offline mode is more related to the issues in Theia's connection status check mechanism rather than the requested/limited cluster resources.
So, the Offline mode issue will be fixed within a separate issue.

@vzhukovs please provide more details on your investigation results.

@azatsarynnyy azatsarynnyy added the status/in-progress This issue has been taken by an engineer and is under active development. label Dec 31, 2020
@vzhukovs
Copy link
Contributor

vzhukovs commented Jan 3, 2021

Here are some investigations around memory and cpu configuration:

In all workspace configurations were used workspace based on eclipse/che-devfile-registry/devfiles/java-mysql/devfile.yaml

Metrics gathered from Prometheus & Grafana.
Che deployed on minikube v1.16.0 using hyperkit with CPUs=2, Memory=6000MB, Disk=20000MB.

Measurement 1 (with default configuration):

  • Theia container
    • memoryLimit: 512m
    • memoryRequested: default
    • cpuLimit: default
    • cpuRequest: default

Here we can observe general resource usage for the whole workspace:
screencapture-localhost-3000-d-85a562078cdf77779eaa1add43ccec1e-kubernetes-compute-resources-namespace-pods-2021-01-03-21_42_14

Here we can observe resource usage for Theia container:
screencapture-localhost-3000-d-6581e46e4e5c7ba40a07646395ef7b23-kubernetes-compute-resources-pod-2021-01-03-21_43_22

At workspace startup we can see, that Theia sometimes goes into offline mode.

Measurement 2:

  • Theia container
    • memoryLimit: 512m
    • memoryRequest: 512m
    • cpuLimit: default
    • cpuRequest: default

Here we can observe general resource usage for the whole workspace:
screencapture-localhost-3000-d-85a562078cdf77779eaa1add43ccec1e-kubernetes-compute-resources-namespace-pods-2021-01-03-21_42_14

Here we can observe resource usage for Theia container:
screencapture-localhost-3000-d-6581e46e4e5c7ba40a07646395ef7b23-kubernetes-compute-resources-pod-2021-01-03-21_43_22

The same situation with the offline mode as in the first measurement. Memory request doesn't influence on web socket connection.

Measurement 3:

  • Theia container
    • memoryLimit: 512m
    • memoryRequest: 512m
    • cpuLimit: 100m or 0.1 cpu
    • cpuRequest: default

Here we can observe resource usage for the whole workspace:
screencapture-localhost-3000-d-85a562078cdf77779eaa1add43ccec1e-kubernetes-compute-resources-namespace-pods-2021-01-03-22_37_33

General CPU Limit grew to 0.6 with the current configuration.

Here we can observe how Theia container throttles at workspace startup:
screencapture-localhost-3000-d-6581e46e4e5c7ba40a07646395ef7b23-kubernetes-compute-resources-pod-2021-01-03-22_39_19

Theia starts really slow. Throttling influences on web socket connection by increasing the time on sending/receiving web socket messages through the channel, so we can get offline message more than usual.

Measurement 4:

  • Theia container
    • memoryLimit: 512m
    • memoryRequest: 512m
    • cpuLimit: 500m or 0.5 cpu
    • cpuRequest: default

Here we can see, that setting up CPU Limit to 0.5 influences the Theia container:
screencapture-localhost-3000-d-6581e46e4e5c7ba40a07646395ef7b23-kubernetes-compute-resources-pod-2021-01-03-23_11_21

This also influences on web socket connection. Web socket channel becomes really slow with this configuration.

Measurement 5:

  • Theia container
    • memoryLimit: 512m
    • memoryRequest: 512m
    • cpuLimit: 500m or 0.5cpu
    • cpuRequest: 500m or 0.5cpu

The same situation with throttling as we have in Measurement 4:
screencapture-localhost-3000-d-6581e46e4e5c7ba40a07646395ef7b23-kubernetes-compute-resources-pod-2021-01-03-23_03_03

Measurement 6:

  • Theia container
    • memoryLimit: 512m
    • memoryRequest: 512m
    • cpuLimit: 900m or 0.9cpu
    • cpuRequest: default

Workspace wasn't started with such configuration. Got error:

Error: Failed to run the workspace: "Unrecoverable event occurred: 'FailedScheduling', '0/1 nodes are available: 1 Insufficient cpu.', 'workspaceb907x975ac3h2bmz.workspace-7b9654c89c-xqmz8'"

During the measurement the ping request was measured.
Here are some values for the ping requests when Theia went to offline mode at workspace start up.

  • Measurement 1
    • 1553ms
    • 1882ms
    • 1040ms
  • Measurement 2
    • 1616ms
    • 1791ms
  • Measurement 3
    • 2101ms
    • 1904ms
    • 6695ms
    • 4301ms
    • 4848ms
    • 7910ms
    • 4463ms
    • 2057ms
  • Measurement 4
    • 1122ms
    • 3694ms
    • 2887ms
    • 2674ms
    • 1906ms
  • Measurement 5
    • 2101ms
    • 1904ms
    • 6695ms
    • 4301ms
    • 4848ms
    • 7910ms
    • 4463ms
    • 2057ms

As we can see, tuning cpuLimit property doesn't get proper effect on ping service.

Digging into connection status service revealed the following problem. There is a web socket activity handler which sets up a timer [1]. When first message comes in the channel function after 4 second calls the ping and if ping successful, connection status server sets up another timer [2] for 5 second to trigger connection to offline. In the meantime we receive another web socket message and connection status sets up timer for another ping request and this ping request might be slow (over 1 second). During this one second timer [2] triggers and Theia goes to offline mode. After that we finally receive response from second iteration from timer [1] and Theia immediately switches to online mode.

So the problem is in two timers that doesn't track activity for promise.

This issue can be reproduced on vanilla Theia when user tries to open large file in the editor (> 5-7mb, it depends on user's host). In this case websocket channel is busy by transmitting the file content and ping service can't operate well.


There are two possible solutions:

  1. Increase offlineTimeout parameter from default 5 second to 7-8 second. This requires creating extension of ConnectionStatusOptions class and bind it in DI container.
  2. Refactor AbstractConnectionStatusService by taking into account promise that ping service produces each time when ping is performed.

@azatsarynnyy
Copy link
Member Author

Thank you @vzhukovs for the great benchmarking!

So, setting cpuLimit to any of 100m, 500m, or 900m slows down Che-Theia significantly.
But it seems it worked well with 1500m, in @l0rd's tests here #18472 (see the issue description).
@vzhukovs does it make sense to check it on minikube as well?
To determine the correct request/limit values we can set in the plugin registry.

@azatsarynnyy
Copy link
Member Author

As we've figured out, Offline mode issue cannot be fixed by tuning the cpu/mem resources,
@vzhukovs could you please register a separate issue in upstream Theia regarding an offline timeout?
Since it's reproduced upstream as well.


Increase offlineTimeout parameter from default 5 second to 7-8 second. This requires creating extension of ConnectionStatusOptions class and bind it in DI container.

I believe a better option would be to introduce Theia configuration parameter upstream and just set it in Che-Theia, e.g. here. It would give us more flexibility.
This is fair enough taking into account the current connection status check method in Theia really depends on the infrastructure/environment the real application is running on.

@vzhukovs
Copy link
Contributor

vzhukovs commented Jan 4, 2021

@vzhukovs does it make sense to check it on minikube as well?

On minikube this configuration doesn't work:
Eclipse Che | java-mysql-8-j4rzn 2021-01-04 16-25-43

@azatsarynnyy azatsarynnyy changed the title Profile Che-Theia to determine the correct memoryRequest/cpuLimit values Profile Che-Theia to determine the correct memory/cpu resources Jan 4, 2021
@vzhukovs
Copy link
Contributor

vzhukovs commented Jan 5, 2021

@vzhukovs could you please register a separate issue in upstream Theia regarding an offline timeout?

I've created a linked issue for this problem: #18723

@l0rd
Copy link
Contributor

l0rd commented Jan 5, 2021

What I observe is that:

  1. In your test you haven't been able to find good CPU request/limit values because you always get some CPU throttling. Good values for CPU request and CPU limit would be the minimum values that guarantee no throttling.
  2. You are specifying CPU limit but not CPU request and in this case Kubernetes automatically assigns a CPU request that matches the limit. On "measurement 6" for example you could try setting a CPU request to 0.5 and CPU limit to 1.
  3. The error you get in measurement 6 is due to a CPU request higher than the CPU available in minikube. To verify what's going on here you should get the real CPU request and CPU limit that set for every container of the workspace pod. Besides Theia, other container may request a CPU value that is exhagerated and that we can lower down. I was using this snippet to get those values at workspace startup:
kubectl get po -l 'che.original_name=workspace' -o json -w |
   jq -r 'if .spec then .spec.containers[] |
             "---", .name, .resources.requests.cpu,  .resources.limits.cpu
          else "no pod yet" end'

@nickboldt nickboldt modified the milestones: 7.24, 7.25 Jan 8, 2021
@vzhukovs
Copy link
Contributor

vzhukovs commented Jan 9, 2021

You are specifying CPU limit but not CPU request and in this case Kubernetes automatically assigns a CPU request that matches the limit. On "measurement 6" for example you could try setting a CPU request to 0.5 and CPU limit to 1.

I've tried to apply different configurations to Theia container with mixed outcomes:

  1. cpuLimit: 1, cpuRequest: 0.1
    cpuLimit-1-cpuRequest-0 1
    CPU throttling around 70%

  2. cpuLimit: 1, cpuRequest: 0.3
    cpuLimit-1-cpuRequest-0 3
    CPU throttling around 60%

  3. cpuLimit: 1, cpuRequest: 0.5
    cpuLimit-1-cpuRequest-0 5
    CPU throttling around 75%

  4. cpuLimit: 1, cpuRequest: 0.7
    cpuLimit-1-cpuRequest-0 7
    CPU throttling around 75%

  5. cpuLimit: 1.5, cpuRequest: 0.75
    cpuLimit-1 5-cpuRequest-0 75
    CPU throttling is minimal, around 7-10%

But visually, setting up the cpuLimit to 1.5 causes to the case, that it somehow influences to the web socket communication and Theia continues to show offline mode. We can setup by default cpuLimit to 1.5 and cpuRequest to 0.5-0.75, but this won't get proper feedback on web socket communication channel. Communication status checker mechanism should be reviewed in upstream.

Setting cpuLimit or cpuRequest more than 2 I suppose won't be a good idea, because usually Che is starting on default minikube configuration (if developer starts it on host machine) and default configuration tries to allocate cpuLimit to 2 for the whole local cluster. This might be also somehow related with the amount of physical cores, but not sure about it.

To verify what's going on here you should get the real CPU request and CPU limit that set for every container of the workspace pod.

I tried the provided command and I saw no difference in output from this command and what grafana shows on charts in realtime.

@l0rd
Copy link
Contributor

l0rd commented Jan 11, 2021

@vzhukovs what's your conclusion? What are the CPU request and limit values for Che Theia?

@vzhukovs
Copy link
Contributor

@l0rd sorry, for the late response. From what I've got, using the local installation and che.openshift.io, configuration with cpuLimit: 1500m and cpuRequest: 500m would be enough to have a comfortable work. There was the smallest throttling that was able to get with this configuration. I'll contact @vitaliy-guliy to decide who will update the limits on Theia container, because there is a parallel issue for that.

@azatsarynnyy
Copy link
Member Author

@vzhukovs please provide a PR for setting the resources for Che-Theia based on your investigations within this issue.
Vitalii works on all the other Plug-ins except Che-Theia.

@azatsarynnyy
Copy link
Member Author

The cpuRequest and cpuLimit are set by eclipse-che/che-plugin-registry#796. It makes the situation with Che-Theia Offline mode a bit better but to fix it completely we'll continue working on #18723.

@azatsarynnyy azatsarynnyy removed the status/in-progress This issue has been taken by an engineer and is under active development. label Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/editor/theia Issues related to the che-theia IDE of Che kind/task Internal things, technical debt, and to-do tasks to be performed. severity/P1 Has a major impact to usage or development of the system.
Projects
None yet
Development

No branches or pull requests

4 participants