Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-check/assess the choose of firecracker #450

Open
hpvd opened this issue Dec 9, 2024 · 12 comments
Open

Re-check/assess the choose of firecracker #450

hpvd opened this issue Dec 9, 2024 · 12 comments
Labels
proposal Enhancement idea or proposal

Comments

@hpvd
Copy link

hpvd commented Dec 9, 2024

Proposed change

With origin at aws lambda, firecracker is a stunning approach to run short lived functions.

Nex tastes awesome. It seems to be about easily manage/deploy/run

  • short lived functions
  • but also long(er)-running services
  • (nearly) anywhere where Nats live

With this, there are some drawbacks coming with firecracker (used in Nex https://docs.nats.io/using-nats/nex/internals/node_process) which have to be know and valued carefully:

  1. its not that easy as it seems to find places where firecracker can be run in production
    • the big 3 clouds are very expensive, because you have to rely on bare metal (others instances don't support KVM)
    • cheap virtual KVM seems to be not suitable for long running services because of auto-live-migrations
    • having typical cloud use-cases with some geo-distribution relying on fair priced, local vendors would add lots of overhead for dealing with lots of parties (not only technical setup...)
    • these 3 items were extracted from Firecracker at $3K-10K per runtime hour is unusable #330
  2. resource efficiency in long-running uses-cases may have some hurdles to come over e.g. see https://hocus.dev/blog/qemu-vs-firecracker/
  3. AI is everywhere. And with it the need for the use of specialized hardware (GPU/NPU). Currently it's not possible to pass through these devices to be used inside. There are requests from community for years but are only recently and very restrained discussed GPU (and PCIe) Support in Firecracker firecracker-microvm/firecracker#4845
  4. newer security concepts like confidential-computing in firecracker (data is secured not only on rest and in motion but also in use, by having memory encrypted per "user" on attested and with this trusted hardware relying on AMD SEV-SNP, Intel TDX, ARM CCA) are not well explored yet...

No, I do not have a perfect solution, but it's imho worth to know, think and discuss about to pave the way to an even brighter future of Nex :-)

Use case

geo-distributed long running services

Contribution

today only input for discussion, more possibly in the future...

@hpvd hpvd added the proposal Enhancement idea or proposal label Dec 9, 2024
@hpvd hpvd changed the title Recheck choose of firecracker Re-check choose of firecracker Dec 9, 2024
@hpvd hpvd changed the title Re-check choose of firecracker Re-check/assess the choose of firecracker Dec 9, 2024
@MikeHawkesCapventis
Copy link

I'm happy to help shape this - although, I'm rather hoping that the forthcoming OCI side of NEX might help shape future Firecracker thinking as well.

@jordan-rash
Copy link
Contributor

We will be merging the V3 branch onto main pretty shortly. Firecracker is no longer a dependency at that point and you will be able to run nex anywhere. This new version of nex will have much more pluggable agent capabilities. While the v3 branch will land before the replacement firecracker agent, it will allow folks to experiment with their own agents, so you can experiment with anything. I would expect the replacement firecracker agent to come Q125-ish

@hpvd
Copy link
Author

hpvd commented Dec 9, 2024

@jordan-rash wow thats really interesting. Many thanks for all your work on this!
Looking forward to these pluggable capabilities and what solutions may come up from requirements...

@MikeHawkesCapventis
Copy link

Exciting times ahoy! Thanks for the update.

@hpvd
Copy link
Author

hpvd commented Dec 9, 2024

since we are all very excited: would you mind sharing some early thoughts/sneak peak on this "replacement firecracker agent"?

@jordan-rash
Copy link
Contributor

since we are all very excited: would you mind sharing some early thoughts/sneak peak on this "replacement firecracker agent"?

As soon as I have something I can share, I will. At this point, we are still working on defining the agent SDK, which I hope to be finished with next week 🤞🏼 (no promises 😅)

@MikeHawkesCapventis
Copy link

Not sure if this is in scope for the agent SDK (should this be another issue), but please consider a restart on fail capability somewhere in / around the SDK. Currently, if the agent dies, Nex can't do anything about it. We don't even know it's died. This gives us a headache: if the agent fails - we have to restart and redeploy. It's the only part of the landscape that I find slightly uncomfortable: a single misbehaving script can kill an entire implementation, rather than just killing itself!

@hpvd
Copy link
Author

hpvd commented Dec 12, 2024

Not sure if this is in scope for the agent SDK (should this be another issue),

its imho both :-) would you mind open in addition a new issue for this topic?
(so we could easily keep track on it)

@hpvd
Copy link
Author

hpvd commented Dec 12, 2024

separate issue on this: Agent respawn upon death #454

@udf2457
Copy link

udf2457 commented Jan 6, 2025

To be honest I would be weary of throwing the baby out with the bathwater.

Rather than dump firecracker support, it should be kept as a supported option. Firecracker should be both supported and documented as such going forward.

Using firecracker could suit many environments. The fact it does not suit yours does not mean NATS should throw away all the hard work done to date on ensuring interop with firecracker.

@hpvd
Copy link
Author

hpvd commented Jan 6, 2025

To be honest I would be weary of throwing the baby out with the bathwater.

Rather than dump firecracker support, it should be kept as a supported option. Firecracker should be both supported and documented as such going forward.

Using firecracker could suit many environments. The fact it does not suit yours does not mean NATS should throw away all the hard work done to date on ensuring interop with firecracker.

you are absolutely right! Using firecracker could suit many environments.

@jordan-rash presented to imho best way to go: pluggable agent capabilities
see #450 (comment)

@jordan-rash
Copy link
Contributor

As @hpvd points out, I dont think we are looking to dump anything, just separate where it makes sense. The core of nex needs to be as light weight as possible and we'll hopefully keep moving in that direction. At the same time, it needs to be as extensible as we can make it so that it supports any agent that the community develops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Enhancement idea or proposal
Projects
None yet
Development

No branches or pull requests

4 participants