Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: deploy ChaosMesh on APISIX, to simulate more faults #2757

Closed
5 of 8 tasks
Yiyiyimu opened this issue Nov 16, 2020 · 6 comments
Closed
5 of 8 tasks

feat: deploy ChaosMesh on APISIX, to simulate more faults #2757

Yiyiyimu opened this issue Nov 16, 2020 · 6 comments
Assignees
Labels
chaos chaos scenario to do enhancement New feature or request
Milestone

Comments

@Yiyiyimu
Copy link
Member

Yiyiyimu commented Nov 16, 2020

Proposal

Background

Nowadays, we have unit tests, integration tests, and e2e tests, to ensure the fault tolerance of APISIX. But there are still some problems, like network delay and CPU stress, that have not covered by the above tests. Thus, it would be a better idea to introduce chaos engineering, to simulate different types of faults, and test the performance of APISIX in these circumstances.

To deploy chaos engineering, ChaosMesh could be a good choice for us. There are several benefits above other chaos engineering tools:

  1. ChaosMesh is a CNCF sandbox project and has quite an active community, which ensures the project would be better and we could get help when needed.
  2. ChaosMesh support Github Actions, so when we set up the workflow of this integration, it would be easy to do the test in our daily working
  3. ChaosMesh currently supports most types of different chaos for now and is supporting more. Although we might not need that much for now, it is a good point when we decide to test more with it.
    BTW, chaos types ChaosMesh supports for now(Nov.16, 2020) includes pod chaos, network chaos, stress chaos, io chaos, time chaos, kernel chaos, HTTP chaos, and DNS chaos.

TODO

Following the principles of chaos engineering, there are two main parts we need to care about: 1. what should we test and 2. how to prove the correctness after chaos injection.

As for what we got for now, the current problems we encounter and need to simulating are:

  1. the connection with etcd is unstable
  2. etcd failure
  3. problems when cpu/memory/disk stressed out

And the method to test correctness including:

1. error log of Nginx and APISIX
2. whether cpu/memory use of APISIX is abnormally high
3. whether wrk benchmarking would fail

After some more investigation of how people make use of chaos engineering, to get how things going after certain chaos takes effect, it would be better to use Prometheus/Grafana to plot the metrics of APISIX performance, rather than only focusing on nginx logs. Also, since chaos is more about mocking problems facing in production, directly using monitoring tools could let us get what users are facing.

To use Prometheus, we need a demo to run basic functions of APISIX, like a certain amount of traffic, and new rules set by a certain time interval. It seems we do not have that kind of demo, so maybe I plan to write a simple script to implement these features.

With monitoring tools and the demo, we could then easily run different kinds of chaos, and see how things going. When we found something interesting and useful, we could then standardize it, write a test case of the scenario, and put it into CI. With experiments before, testify certain case is not that hard, so what we should focus more on is to find those interesting scenarios.

Timeplan

By 1.20 (Wed):  finish writing the demo script, and present the metrics of APISIX with Grafana    
By 1.22 (Fri):     apply network chaos and see how APISIX works without etcd. Better test with different chaos cases   
By 1.24 (Sun):   write test case about the network chaos, and running on CI
Future:              more chaos cases!

The most uncertain part for me is the demo that I'm both unsure about if we have that kind of demo or if we don't, some details about writing the script (like what is normal traffic for APISIX). Any suggestions are welcome!!


Here to list a rough schedule of this feature:

Basic Integration

  • add a basic chaos to apisix

Problems to simulate

  • the connection with etcd is unstable
  • etcd failure
  • problems when cpu/memory/disk stressed out

Correctness Proof

  • error log of Nginx and APISIX
  • whether cpu/memory use of APISIX is abnormally high
  • whether wrk benchmarking would fail

CI Deployment

  • deploy on Github Actions
@Yiyiyimu Yiyiyimu added this to the 2.2 milestone Nov 16, 2020
@Yiyiyimu Yiyiyimu self-assigned this Nov 16, 2020
@juzhiyuan juzhiyuan added the enhancement New feature or request label Nov 24, 2020
@moonming
Copy link
Member

moonming commented Dec 3, 2020

@Yiyiyimu Any progress? Do you need help? :)

@Yiyiyimu
Copy link
Member Author

Yiyiyimu commented Dec 3, 2020

Hi @moonming Sorry the progress is a bit slower than expected. This is the final week of my Community Bridge Program, so I think I would have more time working on this next week. I'll try to hurry up

@moonming
Copy link
Member

moonming commented Dec 3, 2020

ok, take your time

@Yiyiyimu
Copy link
Member Author

Yiyiyimu commented Dec 8, 2020

Hi @moonming I made some basic tests, applying etcd network delay and etcd pods failed. And I got the following results for now.

  1. Both of the chaos could take effect. For the network delay, I could see the delay when I put a new route. And for pods failure, I could see the new route setting would fail.
  2. However a benchmark tool is somewhat not appropriate for testing the performance. I failed to found how to use wrk to PUT route, so I just test with curl. I could know if it works as expect and some error would indeed show up. But get to know the network delay or pod failure is still useless for us.

So I guess I could try to integrate it with the backend e2e test in dashboard, like running e2e test when certain chaos injected, so that could be more of a real user scenario and we could get more information. Hi @nic-chen do you think that's possible?

Some coarse todo:

  • run APISIX on kubernetes on CI
  • run e2e test on kubernetes
  • add chaos

@idbeta idbeta mentioned this issue Dec 8, 2020
@nic-chen
Copy link
Member

nic-chen commented Dec 8, 2020

So I guess I could try to integrate it with the backend e2e test in dashboard, like running e2e test when certain chaos injected, so that could be more of a real user scenario and we could get more information. Hi @nic-chen do you think that's possible?

sure. E2E test and the main project are independent, you could run it directly on your test env.

@moonming
Copy link
Member

moonming commented Dec 8, 2020

So I guess I could try to integrate it with the backend e2e test in dashboard, like running e2e test when certain chaos injected, so that could be more of a real user scenario and we could get more information

I do not think so. Now the e2e test of dashboard is relatively simple and not enough to support chaos mesh. Most of the tests in the apisix project are also e2e tests, I think it is more appropriate to add a one new CI task for the apisix project

@Yiyiyimu Yiyiyimu modified the milestones: 2.2, 3.0 Dec 14, 2020
@juzhiyuan juzhiyuan added the wait for update wait for the author's response in this issue/PR label Dec 26, 2020
@spacewander spacewander modified the milestones: 2.2, 2.3 Dec 29, 2020
@spacewander spacewander added bug Something isn't working and removed bug Something isn't working labels Jan 20, 2021
@Yiyiyimu Yiyiyimu removed the wait for update wait for the author's response in this issue/PR label Jan 21, 2021
@Yiyiyimu Yiyiyimu added the chaos chaos scenario to do label Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chaos chaos scenario to do enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants