-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: deploy ChaosMesh on APISIX, to simulate more faults #2757
Comments
@Yiyiyimu Any progress? Do you need help? :) |
Hi @moonming Sorry the progress is a bit slower than expected. This is the final week of my Community Bridge Program, so I think I would have more time working on this next week. I'll try to hurry up |
ok, take your time |
Hi @moonming I made some basic tests, applying etcd network delay and etcd pods failed. And I got the following results for now.
So I guess I could try to integrate it with the backend e2e test in dashboard, like running e2e test when certain chaos injected, so that could be more of a real user scenario and we could get more information. Hi @nic-chen do you think that's possible? Some coarse todo:
|
sure. E2E test and the main project are independent, you could run it directly on your test env. |
I do not think so. Now the e2e test of dashboard is relatively simple and not enough to support chaos mesh. Most of the tests in the apisix project are also e2e tests, I think it is more appropriate to add a one new CI task for the apisix project |
Proposal
Background
Nowadays, we have unit tests, integration tests, and e2e tests, to ensure the fault tolerance of APISIX. But there are still some problems, like network delay and CPU stress, that have not covered by the above tests. Thus, it would be a better idea to introduce chaos engineering, to simulate different types of faults, and test the performance of APISIX in these circumstances.
To deploy chaos engineering, ChaosMesh could be a good choice for us. There are several benefits above other chaos engineering tools:
BTW, chaos types ChaosMesh supports for now(Nov.16, 2020) includes pod chaos, network chaos, stress chaos, io chaos, time chaos, kernel chaos, HTTP chaos, and DNS chaos.
TODO
Following the principles of chaos engineering, there are two main parts we need to care about: 1. what should we test and 2. how to prove the correctness after chaos injection.
As for what we got for now, the current problems we encounter and need to simulating are:
And the method to test correctness including:
1. error log of Nginx and APISIX2. whether cpu/memory use of APISIX is abnormally high
3. whether wrk benchmarking would fail
After some more investigation of how people make use of chaos engineering, to get how things going after certain chaos takes effect, it would be better to use Prometheus/Grafana to plot the metrics of APISIX performance, rather than only focusing on nginx logs. Also, since chaos is more about mocking problems facing in production, directly using monitoring tools could let us get what users are facing.
To use Prometheus, we need a demo to run basic functions of APISIX, like a certain amount of traffic, and new rules set by a certain time interval. It seems we do not have that kind of demo, so maybe I plan to write a simple script to implement these features.
With monitoring tools and the demo, we could then easily run different kinds of chaos, and see how things going. When we found something interesting and useful, we could then standardize it, write a test case of the scenario, and put it into CI. With experiments before, testify certain case is not that hard, so what we should focus more on is to find those interesting scenarios.
Timeplan
By 1.20 (Wed): finish writing the demo script, and present the metrics of APISIX with Grafana
By 1.22 (Fri): apply network chaos and see how APISIX works without etcd. Better test with different chaos cases
By 1.24 (Sun): write test case about the network chaos, and running on CI
Future: more chaos cases!
The most uncertain part for me is the demo that I'm both unsure about if we have that kind of demo or if we don't, some details about writing the script (like what is normal traffic for APISIX). Any suggestions are welcome!!
Here to list a rough schedule of this feature:
Basic Integration
Problems to simulate
Correctness Proof
wrk
benchmarking would failCI Deployment
The text was updated successfully, but these errors were encountered: