Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requesting Temporary ignoring of default.go oci-runtime-validation test #721

Closed
YJDoc2 opened this issue Feb 19, 2022 · 8 comments
Closed

Comments

@YJDoc2
Copy link
Collaborator

YJDoc2 commented Feb 19, 2022

After two weeks of testing and a lot of digging of git commit history, I think I have finally found what has been causing CI failure for #694 's OCI validation test (there are some other failing tests, but this is not related to them).

The surprising reason behind that test failing is a knock-on effect of the way -

  1. Our CI is set up (I have partial blame in this 😓 )
  2. We recently adding these capabilities
  3. runtime-tools not validating there tests on runc (?)

If you see in commet I have mentioned that the CI is faioing on youki for these three capabilities, although if you check the integration test CI of any other PR there is no issue. The reason behind this is :

  1. We added caching mechanism to the CI for integration tests on 7th Sep Here. This caches the build artifacts of the runtime-tests builds, which is the compiled tests, and has a very good reason of doing so, as without it the CI takes 7-8 mins and with caching it takes about 3-4 mins. The downside is that the tests are not built each time, and we always use the version that was cached on 7th Sep.
  2. runtime-tools itself updated one of their dependencies go-capabilities on 26th of October here
  3. The go-capabilities itself was updated on Aug 5 here to include the three new capabilities, and thus the runtime tools itself as well inherited these updates.
  4. runc released version 1.1 here on 18th Jan , which included the support for these capabilities. I think we also have only recently implemented these as well.
  5. As our CI was using the cached version of tests, they did not check for these and thus tests were passed. But when I moved the git module in the refactor, the cache was invalidated, and youki was tested with new version and failed the tests.
  6. The probable reason is that the CI runner's kernel might not be updated or compiled with correct flags or has permissions to set these capabilities, so runc cannot use them OR the oci-tests has bug in it, (as far as I checked the do not validate the tests on runc or some other runtime)

It is too late here, so I'm not sure what approach might be best, my suggestion is to either comment the default.go test and open an issue if not yet opened on oci-runtime-tools

I do not have runc 1.1 yet, so cannot verify if the tests are also failing on it as well , but I would guess that it is so. I will try to get the 1.1 and check on it if I can.

@Furisto @utam0k Please take a look. Thank you.

@utam0k
Copy link
Member

utam0k commented Feb 20, 2022

@YJDoc2
um... we got into trouble 😭
First of all thanks for the awesome research. If it was for you, I probably wouldn't have noticed it.
I don't want to remove as many tests as possible because default includes many tests. it gets a little weird, but how about hacking from stdout only when default.go run? e.g. ignore only the relevant test cans in default.go.

@YJDoc2
Copy link
Collaborator Author

YJDoc2 commented Feb 20, 2022

Hey @utam0k
I agree that removing the test should be the very last option, as we have already commented few tests as they do not work, and removing tests will make more bugs slip through.

Currently my suggested approach is this :

  1. Build runc 1.1 locally and run oci tests on it
  2. If runc also fail the tests, then the tests themselves need to update :
  3. then I will open an issue on the runtime-tools repo notifying this, and wait for few days to see if they reply.
  4. If they do, then we can wait till they update, and then use that new version for us, till then we can keep using the cached version, as even though capabilities tests are old, other seems to have no major changes since our cache, and are currently working
  5. In case they do not reply, or won't be fixing the issue in near future, we can work on the way you suggested, and use grep to ignore these three specific capabilities related test failures, until we find some other good way to deal with it.
  6. In case runc does pass the tests, that would mean that it is our implementation that has some errors, and then we can fix them on our side, which would mean that the CI is not broken

I'll update here once I have ran the runc check.

@utam0k What do you think?

@YJDoc2
Copy link
Collaborator Author

YJDoc2 commented Feb 20, 2022

Hmm I cloned and built runc from main branch, and it seems that it is passing the default.go tests, but failing the delete.go tests 🤔 🤔 🤔
That should not be happening right? If you get time, can you also get runc 1.1 and verify this on your end as well?

If that is right, it would mean that is is some bug from our end, likely that we need to check the kernel version and enable the BPF and other features according to that.

EDIT : ok, I ran the tests on my local version youki once again to make sure that I was getting error here as well, but
apparently it has the same behavior, where default.go is passing but delete.go is failing, 😭 not sure why, investigating.

Update : So I have run this several times, and with different combinations of options, and what I'm almost certain of is :

  1. runc 1.1 / 1.0 both are passing default.go, but failing delete.go for some reasons (likely due to some setup issue of my system?)
  2. youki is failing default.go

@YJDoc2
Copy link
Collaborator Author

YJDoc2 commented Feb 20, 2022

I think I figured out what is causing the issue, and I think solution is simple if I understand it correctly :
As far as I know, we currently support CAP_PERFMON, CAP_CHECKPOINT_RESTORE, and CAP_BPF but in the capabilities code, we still ignore them when setting capabilities, Here : https://github.com/containers/youki/blob/main/crates/libcontainer/src/syscall/linux.rs#L136-L140

That is why I think even though youki reports that they are supported, they are ignored above when setting capabilities, and thus the error occurs.

I tried commenting that part, and unconditionally dropping the cap (the _ arm), and that seemed to pass the default and delete tests as well.

@Furisto if possible can you confirm what I am thinking is correct or not? This seemed to slip by originally due to the CI cahcing I have mentioned.

@utam0k What do you think?

@Furisto
Copy link
Collaborator

Furisto commented Feb 20, 2022

@YJDoc2 Makes sense to me. We should support these capabilities in Youki as well.

YJDoc2 added a commit to YJDoc2/youki that referenced this issue Feb 21, 2022
@utam0k
Copy link
Member

utam0k commented Feb 23, 2022

@YJDoc2 Sounds good to me to support them. But If we will support these capabilities, we have to check the linux kernel version. That's the only thing I want to pay attention to.

@yihuaf
Copy link
Collaborator

yihuaf commented Jul 21, 2023

@YJDoc2 is there still pending action items for this issue?

@YJDoc2
Copy link
Collaborator Author

YJDoc2 commented Jul 21, 2023

No I think everything mentioned here was covered in some or other PR/issue, so we can close this one out. Thanks for pinging!

@YJDoc2 YJDoc2 closed this as completed Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants