-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add a few basic integration tests #13
base: main
Are you sure you want to change the base?
[RFC] Add a few basic integration tests #13
Conversation
Leaving this in draft for the moment because I expect it to fail in CI. Working on that now. |
44a2ab0
to
8160f99
Compare
I managed to make this compile and run some basic integration tests in CI. Unfortunately, even the newest version of Focal with I'm promoting this change set from draft because I think it can be meaningfully discussed at this point. |
This is a really interesting approach to integration testing on CI, and I'd love to land something like this! I wonder if you're getting bitten by the fact that Rust runs all tests in parallel by default. You may need to pass If you try to break the failing test into many smaller ones that have increasingly larger prefixes of the failing test, what does it point to as the failing operation? |
b69ae0d
to
ea82a90
Compare
I did as you suggested and I limited the number of concurrent tests to 1. Regretably, this did not solve the core problem. I went ahead and ran the integration tests against valgrind (somewhat muddling this PR in the process). The result here seems to indicate that the VM run by TravisCI simply isn't equiped to run all of the instructions in the generated binary. For what it is worth, this test consistently passes under valgrind on my home machine. Not sure how to procede here as I am not convinced that the test failure is the fault of the test or the library; it seems to be a limitiation of the CI environment. Thoughts? |
I may try setting up an integration test suite on an AWS free tier VM. Then we could run the integration tests during development and before cutting a release. I have one ConnectX-6 VPI card and a bunch of ConnectX-5 VPI cards as well. I may also have some ConnectX-3 cards running around. I think I also have at least one iwarp capable NIC in a server somewhere. Thus, I could set up a containerized test and run some integration tests on real hardware every once in a while as well. I also think AWS elastic fabric adapter supports RoCE (although I have not used it yet). I don't think the AWS machines with EFA are freely available tho. |
ea82a90
to
07a5c7a
Compare
Hmm, that's interesting. I'd be curious to see if moving to something like Azure might let us do what's needed? Alternatively, we may have to run our own CI infrastructure on a host that supports the required instructions, though that quickly becomes expensive and complex. Hmm... |
Are you open to adjusting the way CI works for this project? I was thinking I would generate an AMI with the linux kernel config necessary for SoftRoCE (I checked and the stock community AMIs don't have it as far as I can see). Then I can set up something like terraform-cloud to spin a spot instance of that AMI on PR updates to this repo. It can be a t2 micro spot instance so it can be really inexpensive (or more likely free). That EC2 instance can just run the extant That also open the possibility of spinning SoftRoCE devices on a geneve tunnel connecting two different EC2 instances in the same VPC. That would let us do a "host to host" test (I suggest the geneve tunnel because I'm not 100% sure how AWS will respond to ethernet frames with RoCEv2's ethertype coming out of a system which isn't "supposed" to be emitting them. If you wrap it in a point to point geneve tunnel then it will be invisible to AWS). I have a partial implementation of this plan now. I generated a machine image with the necessary kernel, iproute2, and rdma-core. The entire machine image build process is containerized so it should work on any machine with docker. I just need to convert it to an AMI with packer and set up some simple terraform to deploy it. I'm completely happy to do all of that (I really like rust and I really like rdma and this project is of considerable interest to me). I just don't want to do it if that isn't the direction you want to take things. I am open to alternatives, I am just running out of simple ideas and I guess I'm reaching for somewhat more complex ones at this point 🤷. |
c64a85c
to
1c4ca08
Compare
Sorry again for taking so long to get back to you! I am open to changing CI, though I worry about having to maintain dedicated CI infrastructure because I know that realistically I won't be able to stay on top of that. If you'd be willing to host it though, I would be super happy to point the repo at it! I'm also looking for maintainers for this crate since I'm not using it myself, and you'd make a good candidate if you're also responsible for the integration tests! On your proposal more generally, I think the "right" CI for this is indeed dedicate CI. I do think there's value in also having a test suite that users can reasonably run locally, so we should find ways of mocking or using SoftRoCE where possible, but some end-to-end is extremely valuable. You'll have to be mindful of abuse once you're hosting your own CI though. Chances are someone will come along with a PR that adds a bitcoin mining script or something, and suddenly your EC2 use goes through the roof 😅 |
No worries on the response time, after all, I didn't get back to you promptly either :) Sorry about that, work has been a challenge lately. I agree with your thoughts regarding the CI. I'm actually working on a solution to that problem in the background a bit. The big problem so far is that I'm not having great luck with SoftRoCE. I can make a lot of test cases pass on my hardware but those tests fail every time (or sometimes ~70% of the time) with the SoftRoCE device swapped in. Not really sure why honestly. In any case, you are right about the danger of hosting your own CI (especially if it is running on a powerful VM or host). Will need to think about that carefully. Maybe only run basic tests in something like Travis CI and then run a full integration test on hardware after the PR is reviewed (and doesn't obviously run a mining rig or a botnet or something). As for maintaining the crate, I would be pleased and honored to do so. I think this is a pretty solid foundation with a lot of room for improvement. I also use RoCE at work so this is, at minimum, a good excuse for me to study it further :D |
1c4ca08
to
bbd8150
Compare
I created a few basic integration tests to complement the bindgen auto-generated test suite.
Trying @jonhoo suggestion for reduing test thread count to make CI run integration tests.
I am concerned that the test failure is transient. That is very irritating.
Trying to parse test output more cleanly. Trying with sudo now.
WIP patch. This is an incomplete attempt at building an AMI which can run integration tests. I still need to make the process automatic and I still need to ensure that the integration tests actually pass on the generated VM.
We were generating multiple images with the same /etc/machine-id parameter. This causes multiple machines to get the same MAC address in virtual systems which is very annoying. Just remove /etc/machine-id in the build and it will be generated automatically on first system boot.
8b9bbe5
to
35737e7
Compare
FWIW, SoftRoCE (the Now, if we want to test libibverbs on a more "real" RDMA NIC, you could run with the AWS EFA provider natively (on the What troubles did you have with rxe? Happy to help debug the RDMA side of things. I am not a Rust expert or even a novice, so can't help with Rust<->C API binding issues, but I'm guessing you all have that covered. |
Also, with Travis having shuttered the free tier support for OSS projects, do you want to move over to Github Actions or the like? |
I actually gave that exact thing a pretty good effort a little while back. (you can see the record of my flailing attempt here) The problem I ran into was that the VMs spun by Github actions are running a linux kernel (and modules) signed by Microsoft. Unfortunately the I tried compiling it out of band and loading it via insmod but I have yet to make that work. I actually have a (perhaps too ambitious) plan to work around this limitation using Kata containers with a custom compiled linux kernel (with You can see some work I did in this direction here: https://github.com/daniel-noland/docker-in-kata The basic idea looks like:
This plan has the upshot of allowing for proper integration testing in addition to checking the functionality of the bindings. That said, this is a pretty involved thing to set up. If you (or anybody) have a simpler plan I'm all about it. I have a bunch of the components of this plan working but it really is no small thing to make it all work together. |
Also, @rajachan
Would you know about the state of This limitation has persistently foiled my attempts to containerize the test suite for this project. |
I've had success bringing up RDMA devices in a container namespace and running datapath tests on them. It has been a while, but some of the key things I had to plumb through via the container runtime were --- exposing the For the rxe driver, I know the Amazon Linux AMI did not have it but we might be able to get it to work with Canonical's newer Ubuntu AMIs? I can take a look at the Github Actions VM environments and see if I have any success. Will find some time in the coming week to look into it. |
Bumps [actions/checkout](https://github.com/actions/checkout) from 3 to 4. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v3...v4) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
I created a few basic integration tests to complement the bindgen auto-generated test suite.
I'm not sure if this is the best approach but it is a place to start.