Re-check/assess the choose of firecracker #450

hpvd · 2024-12-09T13:39:16Z

Proposed change

With origin at aws lambda, firecracker is a stunning approach to run short lived functions.

Nex tastes awesome. It seems to be about easily manage/deploy/run

short lived functions
but also long(er)-running services
(nearly) anywhere where Nats live

With this, there are some drawbacks coming with firecracker (used in Nex https://docs.nats.io/using-nats/nex/internals/node_process) which have to be know and valued carefully:

its not that easy as it seems to find places where firecracker can be run in production
- the big 3 clouds are very expensive, because you have to rely on bare metal (others instances don't support KVM)
- cheap virtual KVM seems to be not suitable for long running services because of auto-live-migrations
- having typical cloud use-cases with some geo-distribution relying on fair priced, local vendors would add lots of overhead for dealing with lots of parties (not only technical setup...)
- these 3 items were extracted from Firecracker at $3K-10K per runtime hour is unusable #330
resource efficiency in long-running uses-cases may have some hurdles to come over e.g. see https://hocus.dev/blog/qemu-vs-firecracker/
AI is everywhere. And with it the need for the use of specialized hardware (GPU/NPU). Currently it's not possible to pass through these devices to be used inside. There are requests from community for years but are only recently and very restrained discussed GPU (and PCIe) Support in Firecracker firecracker-microvm/firecracker#4845
newer security concepts like confidential-computing in firecracker (data is secured not only on rest and in motion but also in use, by having memory encrypted per "user" on attested and with this trusted hardware relying on AMD SEV-SNP, Intel TDX, ARM CCA) are not well explored yet...

No, I do not have a perfect solution, but it's imho worth to know, think and discuss about to pave the way to an even brighter future of Nex :-)

Use case

geo-distributed long running services

Contribution

today only input for discussion, more possibly in the future...

MikeHawkesCapventis · 2024-12-09T14:23:00Z

I'm happy to help shape this - although, I'm rather hoping that the forthcoming OCI side of NEX might help shape future Firecracker thinking as well.

jordan-rash · 2024-12-09T15:53:02Z

We will be merging the V3 branch onto main pretty shortly. Firecracker is no longer a dependency at that point and you will be able to run nex anywhere. This new version of nex will have much more pluggable agent capabilities. While the v3 branch will land before the replacement firecracker agent, it will allow folks to experiment with their own agents, so you can experiment with anything. I would expect the replacement firecracker agent to come Q125-ish

hpvd · 2024-12-09T16:03:53Z

@jordan-rash wow thats really interesting. Many thanks for all your work on this!
Looking forward to these pluggable capabilities and what solutions may come up from requirements...

MikeHawkesCapventis · 2024-12-09T16:07:50Z

Exciting times ahoy! Thanks for the update.

hpvd · 2024-12-09T20:42:49Z

since we are all very excited: would you mind sharing some early thoughts/sneak peak on this "replacement firecracker agent"?

jordan-rash · 2024-12-10T19:04:38Z

since we are all very excited: would you mind sharing some early thoughts/sneak peak on this "replacement firecracker agent"?

As soon as I have something I can share, I will. At this point, we are still working on defining the agent SDK, which I hope to be finished with next week 🤞🏼 (no promises 😅)

MikeHawkesCapventis · 2024-12-12T14:20:49Z

Not sure if this is in scope for the agent SDK (should this be another issue), but please consider a restart on fail capability somewhere in / around the SDK. Currently, if the agent dies, Nex can't do anything about it. We don't even know it's died. This gives us a headache: if the agent fails - we have to restart and redeploy. It's the only part of the landscape that I find slightly uncomfortable: a single misbehaving script can kill an entire implementation, rather than just killing itself!

hpvd · 2024-12-12T16:08:47Z

Not sure if this is in scope for the agent SDK (should this be another issue),

its imho both :-) would you mind open in addition a new issue for this topic?
(so we could easily keep track on it)

hpvd · 2024-12-12T16:20:57Z

separate issue on this: Agent respawn upon death #454

udf2457 · 2025-01-06T19:17:17Z

To be honest I would be weary of throwing the baby out with the bathwater.

Rather than dump firecracker support, it should be kept as a supported option. Firecracker should be both supported and documented as such going forward.

Using firecracker could suit many environments. The fact it does not suit yours does not mean NATS should throw away all the hard work done to date on ensuring interop with firecracker.

hpvd · 2025-01-06T19:41:18Z

To be honest I would be weary of throwing the baby out with the bathwater.

Rather than dump firecracker support, it should be kept as a supported option. Firecracker should be both supported and documented as such going forward.

Using firecracker could suit many environments. The fact it does not suit yours does not mean NATS should throw away all the hard work done to date on ensuring interop with firecracker.

you are absolutely right! Using firecracker could suit many environments.

@jordan-rash presented to imho best way to go: pluggable agent capabilities
see #450 (comment)

jordan-rash · 2025-01-06T21:25:27Z

As @hpvd points out, I dont think we are looking to dump anything, just separate where it makes sense. The core of nex needs to be as light weight as possible and we'll hopefully keep moving in that direction. At the same time, it needs to be as extensible as we can make it so that it supports any agent that the community develops.

hpvd added the proposal Enhancement idea or proposal label Dec 9, 2024

hpvd changed the title ~~Recheck choose of firecracker~~ Re-check choose of firecracker Dec 9, 2024

hpvd mentioned this issue Dec 9, 2024

Firecracker at $3K-10K per runtime hour is unusable #330

Closed

hpvd changed the title ~~Re-check choose of firecracker~~ Re-check/assess the choose of firecracker Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-check/assess the choose of firecracker #450

Re-check/assess the choose of firecracker #450

hpvd commented Dec 9, 2024 •

edited

Loading

MikeHawkesCapventis commented Dec 9, 2024

jordan-rash commented Dec 9, 2024

hpvd commented Dec 9, 2024 •

edited

Loading

MikeHawkesCapventis commented Dec 9, 2024

hpvd commented Dec 9, 2024

jordan-rash commented Dec 10, 2024

MikeHawkesCapventis commented Dec 12, 2024

hpvd commented Dec 12, 2024

hpvd commented Dec 12, 2024 •

edited

Loading

udf2457 commented Jan 6, 2025 •

edited

Loading

hpvd commented Jan 6, 2025

jordan-rash commented Jan 6, 2025

Re-check/assess the choose of firecracker #450

Re-check/assess the choose of firecracker #450

Comments

hpvd commented Dec 9, 2024 • edited Loading

Proposed change

Use case

Contribution

MikeHawkesCapventis commented Dec 9, 2024

jordan-rash commented Dec 9, 2024

hpvd commented Dec 9, 2024 • edited Loading

MikeHawkesCapventis commented Dec 9, 2024

hpvd commented Dec 9, 2024

jordan-rash commented Dec 10, 2024

MikeHawkesCapventis commented Dec 12, 2024

hpvd commented Dec 12, 2024

hpvd commented Dec 12, 2024 • edited Loading

udf2457 commented Jan 6, 2025 • edited Loading

hpvd commented Jan 6, 2025

jordan-rash commented Jan 6, 2025

hpvd commented Dec 9, 2024 •

edited

Loading

hpvd commented Dec 9, 2024 •

edited

Loading

hpvd commented Dec 12, 2024 •

edited

Loading

udf2457 commented Jan 6, 2025 •

edited

Loading