Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IceCap: Improve ring buffer stability #447

Closed
wants to merge 7 commits into from

Conversation

geky
Copy link
Member

@geky geky commented May 6, 2022

This includes a number of changes to improve the stability of the ring-buffers used to communicate between domains. As it was it was easy to lose notifications and cause applications to lock up.

The commits have more details, but here are the high-level changes:

  1. RingBuffer is changed to enabled notifications by default, avoiding a race-condition during domain bringup.
  2. Fixed an issue in BufferedRingBuffer where >ring-buffer sized messages could hang the ring-buffer if multiple messages are in the queue.
  3. Reworked the virtio-console-server example a bit to use the BufferedRingBuffer more correctly.

This depends on https://gitlab.com/arm-research/security/icecap/icecap/-/merge_requests/22, which is depended on in several places in the commit history (and is already broken by a rebase needed there). So I will wait to fix the submodule and make this a non-draft until that is merged.

This should resolve #429, resolve #443, and fix the main hanging problem we are currently seeing in IceCap in CI.

@mathias-arm mathias-arm force-pushed the ring-buffer-stability branch from c97246a to 5578a0d Compare May 6, 2022 19:35
@mathias-arm
Copy link
Contributor

I merged https://gitlab.com/arm-research/security/icecap/icecap/-/merge_requests/22 and rebased this PR.

@geky geky force-pushed the ring-buffer-stability branch 2 times, most recently from 8ac3ef2 to f5087dc Compare May 6, 2022 20:05
@geky geky marked this pull request as ready for review May 6, 2022 20:06
@geky
Copy link
Member Author

geky commented May 6, 2022

Should be good to go now when/if CI passes

@dominic-mulligan-arm dominic-mulligan-arm added bug Something isn't working trusted-veracruz-runtime Something related to the trusted Veracruz runtime always-refactoring A refactoring/code quality change to the Veracruz codebase labels May 9, 2022
@dominic-mulligan-arm
Copy link
Member

@geky unfortunately some commits are not signed: can you rewrite the Git history to sign them?

@mathias-arm
Copy link
Contributor

You don't need to add commits to re-run CI, you can restart the failing jobs in "Actions".

@geky
Copy link
Member Author

geky commented May 9, 2022

@geky unfortunately some commits are not signed: can you rewrite the Git history to sign them?

I was going to just remove those commits, they're empty and was just to trigger CI.

You don't need to add commits to re-run CI, you can restart the failing jobs in "Actions".

Unfortunately this clobbers the log and hides the old results. I was particularly interested in how many runs failed (though not sure what to do with this info).

@mathias-arm
Copy link
Contributor

You don't need to add commits to re-run CI, you can restart the failing jobs in "Actions".

Unfortunately this clobbers the log and hides the old results. I was particularly interested in how many runs failed (though not sure what to do with this info).

You can access the logs for all attempts. e.g. https://github.com/veracruz-project/veracruz/actions/runs/2247521453 see the Latest #2 drop-down allows you to select the attempt you want to look at.

@dominic-mulligan-arm
Copy link
Member

You don't need to add commits to re-run CI, you can restart the failing jobs in "Actions".

Unfortunately this clobbers the log and hides the old results. I was particularly interested in how many runs failed (though not sure what to do with this info).

You can access the logs for all attempts. e.g. https://github.com/veracruz-project/veracruz/actions/runs/2247521453 see the Latest #2 drop-down allows you to select the attempt you want to look at.

The more you know!

@dominic-mulligan-arm
Copy link
Member

@mathias-arm I went ahead and removed the unverified and useless commits added to force a CI re-run, and addressed my own comment above. Can you review this (given it risks needing a rebase given the amount of other changes going in to the repository)?

@geky
Copy link
Member Author

geky commented May 11, 2022

You can access the logs for all attempts. e.g. https://github.com/veracruz-project/veracruz/actions/runs/2247521453 see the Latest #2 drop-down allows you to select the attempt you want to look at.

Oh this is good to know! Looks like they must have added this recently.

@mathias-arm I went ahead and removed the unverified and useless commits added to force a CI re-run, and addressed my own comment above.

Thanks for fixing it up. This looks good to me though not sure my review counts.


It should be noted we're still seeing unknown CI failures in IceCap, but this PR at least fixes some of the known reasons. More fixes can always come later.

@dominic-mulligan-arm
Copy link
Member

No problem, was a small change to get it over the line and remove any risk of rebase nightmares.

@mathias-arm
Copy link
Contributor

It should be noted we're still seeing unknown CI failures in IceCap, but this PR at least fixes some of the known reasons. More fixes can always come later.

Since disabling the 3 tests and the yield_() work-around, I don't remember seeing spurious CI failures. The changes look good, but I would be uncomfortable to approve if this PR is going to make CI unstable again. Let's see the result of these checks and after a rebase. If both are successful we will merge.

mathias-arm
mathias-arm previously approved these changes May 11, 2022
Copy link
Contributor

@mathias-arm mathias-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approval in principle, conditioned to rebase and successful CI results.

@dominic-mulligan-arm
Copy link
Member

Valid concerns --- @geky can you do the rebase?

@geky
Copy link
Member Author

geky commented May 11, 2022

I'm not opposed to removing bad3596 for now, that would make this PR purely an improvement over main.

But I am concerned about the failures for f5087dc, since it in theory had all the fixes in it...

@geky
Copy link
Member Author

geky commented May 11, 2022

Valid concerns --- @geky can you do the rebase?

Can do, but should we go ahead and remove bad3596? (the unignoring of tests)

@dominic-mulligan-arm
Copy link
Member

I think bad3596 can be removed if it's still going to represent a strict improvement over main. @mathias-arm what do you think?

@geky geky force-pushed the ring-buffer-stability branch from 8654bdd to ec106ca Compare May 11, 2022 15:53
@geky
Copy link
Member Author

geky commented May 11, 2022

I went ahead and rebased and dropped bad3596, it's an easy enough patch to apply on its own.

@dominic-mulligan-arm
Copy link
Member

Unverified commits 😢

@mathias-arm
Copy link
Contributor

bad3596 can be dealt separately in #416. But I don't want to assume it is the only contributor to CI instability. As @geky pointed out f5087dc (now 66d5d48) failed in CI.

geky and others added 7 commits May 11, 2022 10:58
…console

Reworked to be:

1. More stable, hopefully more resilient to losing read/write
   notifications

2. Use BufferedRingBuffer on both sides, which should be less error
   prone and avoids losing data when Veracruz responses exceed the
   size of the underlying ring-buffer

   Note this requires a bit of mixed usage of the BufferedRingBuffer
   on recieving, since we look ahead to see if we have a full
   message.
…ons by default

See the relevant commit in icecap for more details. This should solve
the race-conditions around ring-buffer bring-ups we are seeing in CI.
@geky geky force-pushed the ring-buffer-stability branch from ec106ca to acd17c8 Compare May 11, 2022 15:58
@geky
Copy link
Member Author

geky commented May 11, 2022

Unverified commits 😢

Woopsie doopsie, fixed now 😅

@dominic-mulligan-arm
Copy link
Member

What is the status of this, again? I've lost track of what's going on with this PR, unfortunately...

@mathias-arm
Copy link
Contributor

Superseded by #555

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
always-refactoring A refactoring/code quality change to the Veracruz codebase bug Something isn't working trusted-veracruz-runtime Something related to the trusted Veracruz runtime
Projects
No open projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Race-condition in IceCap RealmOS Make IceCap server tests more reliable
3 participants