RFD 121 bhyve brand: Discussion #76

mgerdts · 2018-01-10T17:10:00Z

This issue represents an opportunity for discussion of RFD 121 while it remains in a pre-published state.

sjorge · 2018-01-10T21:53:48Z

For the KVM brand (at least last I played with them), every device configured via zonecfg will get passed as a disk.

If possible the same approach should be avoided for bhyve, this to keep the option for device pass-through open in the future. (e.g. it is impossible to pass-through a tty to a to a kvm branded zone, this works fine on OmniOS)

XXX Do we need to expose the bus:slot:function as a property on device and net resources?

To keep the option of pass through open in the future, this might be needed. E.g. pci device usually need to be attached at the same spot to work properly.

sjorge · 2018-01-10T23:37:42Z

Oh and you might want to reach out to Allen Jude, if I am not mistaken he revived a review for some code to have bhyve parse a simple config file. Maybe the zbhyve bits can benifit from that. Or the other way around.

wiedi · 2018-01-10T23:40:58Z

VirtFS/9p filesystem passthrough (https://reviews.freebsd.org/D10335) might enable support for zonecfg "fs".

jussisallinen · 2018-01-11T11:38:17Z

Sorry for posting something not directly related to the RFD.
Something along these lines would be welcome additions in the bigger picture, if feasible in terms of Triton:

zvol live resize (expansion in particular) without reboot.
adding additional zvol's to instance without reboot.

Live resize might already be on the list anyways, it seems to be feature mostly missed by plain SmartOS users.

siepkes · 2018-01-11T12:55:13Z

@jussisallinen I think you are looking for RFD 26.

mgerdts · 2018-01-11T16:01:28Z

New predraft posted, with @yourname next to things that are in response to various people's comments.

@sjorge I've made notes that we need to accommodate pci passthrough and tty devices
@sjorge Also made some changes to the way that zhyve gets its config. This is an implementation detail that is likely to change over time.
@wiedi This is very interesting - thanks for pointing it out. This probably won't make the first cut, but seems useful for the future.
@jussisallinen I think that bhyve support for live resize should be straight forward. Hooking it into Triton is an area that I don't know much about. I'll add this to my list of things to think about as we work up the stack (@joshwilsdon may have thoughts). Live add/remove of devices is a bit more of a challenge and will not be in an early cut.

sjorge · 2018-01-11T16:19:06Z

@mgerdts you added a note for @wiedi god-passhrough to the fs-allowed property. Shouldn’t that go in the fs property instead? IIRC fs-allowed would be for telling a zone what file system type it is allowed to mount.

But more clarification on the try pass through, i was talking specifically about cua* devices, e.g. for a serial UPS or GPS device. When running KVM outside a KVM branded zone you can hook up one of qemu/kvm’s device to those. That does not work in kvm branded zones due to all device zone properties being mapped to a disk. Sorry if that was not clear before.

mgerdts · 2018-01-11T16:38:01Z

@sjorge Thanks for noticing that I only updated fs-allowed. I've now updated fs as well. As for the tty devices, I realize I chose the wrong term. I've changed that to serial to better convey the intent.

sjorge · 2018-01-11T20:40:31Z

@mgerdts another thing, maybe make amd_hostbridge vs hostbridge (intel) selectable!

Slot 0 should use amd_hostbridge for all OpenBSD instances, even on Intel based hardware as MSI/MSI-x interrupts which are used by the xhci/tablet for accurate mouse control are not available in OpenBSD unless the hostbridge is advertised as AMD.
https://wiki.freebsd.org/bhyve/OpenBSD

Also if/once we export the vnc console, an option to set the wait flag would be great to expose

The fbuf wait parameter instructs bhyve to only boot upon the initiation of a VNC connection, simplifying the installation of operating systems that require immediate keyboard input. This can be removed for post-installation use. https://wiki.freebsd.org/bhyve/UEFI

pfmooney · 2018-01-12T21:36:18Z

Thanks for doing the initial write-up, Mike. (And thanks to everyone else for the feedback)

Some comments from my first pass:

/usr/sbin/amd64/bhyve -m 4g -c 2 -l com1,/dev/zconsole -P -H -s 1,lpc
-s 3,virtio-blk,/dev/zvol/rdsk/zones/$zone/data/disk0
-s 4,virtio-net-viona,net0
-l bootrom,/BHYVE_UEFI.fd "$zone"

We're expecting the bhyve vmm component to be sdev-aware, right? Perhaps we
can stick to a common name for each instance (VM?) since they'll all be
effectively namespaced into the zone. (So it'd be like /dev/vmm/<zonename>/VM from the GZ, /dev/vmm/VM from the zone)

The /dev/viona/ctl node is opened and a CREATE ioctl is issued. This creates a new minor that does not require a minor node. The return value from the ioctl is a file descriptor associated with the new viona minor.

Any reason to use /dev/viona/ctl over plain /dev/viona? I don't believe any piece of the viona interface is going to need (or frankly want) access to already-opened instances for other processes.

When the bhyve command exits, the kernel state remains present until a DESTROY ioctl is issued. To free these resources, vmctl --destroy must be used.

For in-zone bhyve, we'll need to wire up an explicit interface so that
lingering vmm resources are destroyed on zone shutdown.

The following are private implementation details that are architecturally relevant.
Guest networking configuration

XXX @pfmooney needs to review this

I am working on plumbing for this that will resemble the interface used by
qemu/KVM today: Where the userspace hypervisor component can be configured to
respond to DHCP requests using addressing information provided at startup.
My plan is for those addressing parameters to be passed to the bhyve(1)
commandline as parameters for the viona driver. It will handle all the
details of filtering and injecting the packets in as efficient a manner as
possible.

XXX It is not yet known if the following are needed:
vnd(8d)

The viona bits are using MAC directly, so only the libraries should be required, I think?

We are striving to not modify bhyve code any more than required so that it is easier to keep in sync with upstream. For this reason, a new source file, zhyve.c is being added. This will contain an implementation of main() and any other bhyve brand-specific code that is required. The main() that is in bhyverun.c is renamed to bhyve_main() via -Dmain=bhyve_main in CPPFLAGS while compiling bhyverun.c

In the global zone, /usr/sbin/amd64/bhyve and /usr/lib/brand/bhyve/usr/sbin/zhyve will be hard links to the same file. When invoked with a basename of bhyve, the command will behave exactly as documented in bhyve(8). When invoked with a basename of zhyve, it will read its arguments from /var/run/bhyve/zhyve.args

Is there a reason why we can't call the bhyve binary from zhyve rather than
the name-detection route? I suspect a separate handler process (zhyve as init)
would be desirable for doing guest restarts.

The privileges will be stripped to the minimum required to run a guest. If bhyve only needs a privilege during startup, the privilege will be dropped prior to running code in the guest.

Is this the appropriate place to mention adding a privilege to the system which is required to instantiate vmm instances?

sjorge · 2018-01-12T21:47:11Z

I am working on plumbing for this that will resemble the interface used by
qemu/KVM today: Where the userspace hypervisor component can be configured to
respond to DHCP requests using addressing information provided at startup.
My plan is for those addressing parameters to be passed to the bhyve(1)
commandline as parameters for the viona driver. It will handle all the
details of filtering and injecting the packets in as efficient a manner as
possible.

Only DHCPv4? Or also SLAAC/DHCPv6?

pfmooney · 2018-01-12T22:03:02Z

Only DHCPv4? Or also SLAAC/DHCPv6?

The first draft will only be DHCPv4 to support the "static" addressing model expected from existing KVM images to make it easier to test with them. I think we'll probably want a more explicit scheme for communicating static addresses (both v4 and v6) into the guest, rather than snooping for DHCP requests.

mgerdts · 2018-01-12T22:04:11Z

On Fri, Jan 12, 2018 at 3:36 PM, Patrick Mooney ***@***.***> wrote: Thanks for doing the initial write-up, Mike. (And thanks to everyone else for the feedback)

Thanks for reviewing.

Some comments from my first pass: /usr/sbin/amd64/bhyve -m 4g -c 2 -l com1,/dev/zconsole -P -H -s 1,lpc -s 3,virtio-blk,/dev/zvol/rdsk/zones/$zone/data/disk0 -s 4,virtio-net-viona,net0 -l bootrom,/BHYVE_UEFI.fd "$zone" We're expecting the bhyve vmm component to be sdev-aware, right? Perhaps we can stick to a common name for each instance (VM?) since they'll all be effectively namespaced into the zone. (So it'd be like /dev/vmm/<zonename>/VM from the GZ, /dev/vmm/VM from the zone)

In the zone, /dev/vmm will only show the nodes that this zone has access to and eventually there will be a /dev/vmm/zone directory per zone. I thought about using the same name for all vm's in a bhyve zone, using the zone namespace to uniquely identify them. I can't say I care much which way we go. I've punted on this so far because I've not yet added the zone namespace into /dev/vmm.

The /dev/viona/ctl node is opened and a CREATE ioctl is issued. This creates a new minor that does not require a minor node. The return value from the ioctl is a file descriptor associated with the new viona minor. Any reason to use /dev/viona/ctl over plain /dev/viona? I don't believe any piece of the viona interface is going to need (or frankly want) access to already-opened instances for other processes.

Only because the sdev_plugin seems to not work in the "just create a device without a directory" mode. Once sdev issues are sorted, we can drop the directory.

When the bhyve command exits, the kernel state remains present until a DESTROY ioctl is issued. To free these resources, vmctl --destroy must be used. For in-zone bhyve, we'll need to wire up an explicit interface so that lingering vmm resources are destroyed on zone shutdown.

Yeah, I'll add a callback to to the zone to do the destroy in-kernel. The vmm instance will maintain a hold on the zone to be sure that we don't get zombies.

The following are private implementation details that are architecturally relevant. Guest networking configuration XXX @pfmooney <https://github.com/pfmooney> needs to review this I am working on plumbing for this that will resemble the interface used by qemu/KVM today: Where the userspace hypervisor component can be configured to respond to DHCP requests using addressing information provided at startup. My plan is for those addressing parameters to be passed to the bhyve(1) commandline as parameters for the viona driver. It will handle all the details of filtering and injecting the packets in as efficient a manner as possible. XXX It is not yet known if the following are needed: vnd(8d) The viona bits are using MAC directly, so only the libraries should be required, I think? We are striving to not modify bhyve code any more than required so that it is easier to keep in sync with upstream. For this reason, a new source file, zhyve.c is being added. This will contain an implementation of main() and any other bhyve brand-specific code that is required. The main() that is in bhyverun.c is renamed to bhyve_main() via -Dmain=bhyve_main in CPPFLAGS while compiling bhyverun.c In the global zone, /usr/sbin/amd64/bhyve and /usr/lib/brand/bhyve/usr/sbin/zhyve will be hard links to the same file. When invoked with a basename of bhyve, the command will behave exactly as documented in bhyve(8). When invoked with a basename of zhyve, it will read its arguments from /var/run/bhyve/zhyve.args Is there a reason why we can't call the bhyve binary from zhyve rather than the name-detection route? I suspect a separate handler process (zhyve as init) would be desirable for doing guest restarts.

I was trying to make it so that the zone doesn't need proc_exec. The zone platform has a restart_init option that we could set to true for guest restarts where we want to reattach to the already allocated memory.

The privileges will be stripped to the minimum required to run a guest. If bhyve only needs a privilege during startup, the privilege will be dropped prior to running code in the guest. Is this the appropriate place to mention adding a privilege to the system which is required to instantiate vmm instances?

I thought there may be a separate rfd covering bhyve as a standalone component. If that exists, the introductory material at the top of this becomes a reference to the other rfd. That rfd would presumably cover the vmm and viona dirvers. It should probably also cover their sdev plugins. Just to be clear, bhyve can run in a zone without the bhyve brand. The bhyve brand is really about making it so that you don't have to think about how bhyve works and how it interacts with illumos features like privileges and resource controls. The bhyve brand should provide a very simple way to get the best practice for performance and security with minimal effort.

pfmooney · 2018-01-12T22:19:15Z

Only because the sdev_plugin seems to not work in the "just create a device without a directory" mode. Once sdev issues are sorted, we can drop the directory.

I'm not sure the same sdev plugin integration is needed for viona. Something similar to what was done for the KVM driver may be adequate.

I was trying to make it so that the zone doesn't need proc_exec. The zone platform has a restart_init option that we could set to true for guest restarts where we want to reattach to the already allocated memory.

If we end up wanting/needing the iasl configuration on start-up, the strive to stay proc_exec-free may be in vain. Nothing prevents us from dropping privileges like that once the instance is in steady-state.

sjorge · 2018-01-14T11:07:22Z

An option to specify a bootrom in the zonecfg might also be nice.
This opens op the possibility for say a UEFI+GOP/UEFI+Serial boot roms, but also a UEFI+CSM bootrom for systems that do not support UEFI.

e.g. https://www.freshports.org/sysutils/uefi-edk2-bhyve-csm/

Maybe also and option to pass user specified flags to the bhyve process.

sjorge · 2018-01-17T18:19:27Z

Something else that popped in my head while I was doing some stuff with my kvm zone earlier today...

vmadm connect UUID will use socat which is not that nice, manually using @jclulow's sercons is much nicer...

Anyway, can we maybe get bhyve brand to just work with zlogin -C ?

mgerdts · 2018-01-17T20:05:34Z

Anyway, can we maybe get bhyve brand to just work with zlogin -C ?

The plan is to have com1 (ttya, ttyS0, whatever) connected to /dev/zconsole, which will give you access to whatever (hopefully the console) the guest attaches to its first serial port. There will also be the ability to make it so that com2 is instead connected to /dev/zconsole instead. Either of the two supported serial ports can instead be connected to a UNIX domain socket on which bhyve listen.

The updates to the RFD will come soon.

melloc · 2018-01-17T23:34:00Z

Only DHCPv4? Or also SLAAC/DHCPv6?

The first draft will only be DHCPv4 to support the "static" addressing model expected from existing KVM images to make it easier to test with them. I think we'll probably want a more explicit scheme for communicating static addresses (both v4 and v6) into the guest, rather than snooping for DHCP requests.

To expand on this further, I suspect that we will want to do with bhyve what we started doing with KVM images (only Ubuntu so far I believe): fetch the information using mdata-get sdc:nics and use that to set things up inside the zone. This is how we want to eventually make all of our images work, so that we can support multiple IP addresses, IPv6, setting the correct MTU, etc. without having to snoop on the VM's traffic.

mgerdts · 2018-01-18T04:16:51Z

@sjorge - The ability to select the bootrom is being added as well.

Is there a need to select arbitrary bootroms or is the selection of BHYVE_UEFI_CSM.fd or BHYVE_UEFI.fd sufficient?

sjorge · 2018-01-18T07:32:49Z

Selecting beteren UEFI and UEFI_CSM will be sufficient I think. In don’t think I have seen any other bootroms available.

pfmooney · 2018-01-18T18:25:36Z

To expand on this further, I suspect that we will want to do with bhyve what we started doing with KVM images (only Ubuntu so far I believe): fetch the information using mdata-get sdc:nics and use that to set things up inside the zone. This is how we want to eventually make all of our images work, so that we can support multiple IP addresses, IPv6, setting the correct MTU, etc. without having to snoop on the VM's traffic.

Absolutely. The DHCP addressing feature is only meant as a stopgap to allow KVM images to be used until better bhyve-specific ones (which do static addressing on their own) are created. Fortunately, the design for intercepting those DHCP/ARP requests in viona should result in lower overhead than what's used for KVM.

sjorge · 2018-01-18T21:10:00Z

(Slightly OT) The idea is to grap the RARP/DHCP bits when they pass through viona. Instead of sending them to the network stack, bhyve (or some other daemon) handles them and sends a reply?

mgerdts · 2018-03-06T22:59:24Z

I've updated the draft. Please take a fresh look.

https://github.com/joyent/rfd/blob/master/rfd/0121/README.md

sjorge · 2018-03-07T09:05:49Z

@mgerdts I left a few comments on the actual commit, that seemed easier than copy/pasting here to add comments.

ptribble · 2018-03-07T21:15:18Z

For zoneadm install -i, can you consider 2 additional formats:

A simple tarball, for example that obtained via 'docker export', or available for download (eg https://cloud-images.ubuntu.com/ for ubuntu or http://download.proxmox.com/images/system/ for proxmox) - this is what I typically feed into an LX zone, and is currently used in OmniOS and Tribblix
ACI format images, which has the advantage of being a simple standard, and is what I would like to transition to as an image format

Thanks!

citrus-it · 2018-03-07T21:46:37Z

Ideally allowed-address should be supported. I'm not talking about the guest having access to this information via zoneattrs but just that zoneadmd will configure L3 protection on the vnic when starting the zone.

The process of zoneadmd setting environment variables and then boot using these to build the bhyve CLI arguments and storing that in a file to be picked up by zhyve seems unnecessarily convoluted. Have you considered lofs mounting the zone's xml file into the zone like lx does? It could then be parsed directly or via libzonecfg. Alternatively, the boot process is also able to access the zone config, removing one level of indirection. SmartOS' zoneadmd is already quite different to upstream's which could be a barrier.

Zone root - as I read it a bhyve zone will need to be effectively a sparse zone and zoneadm -z xx install actually installs an image onto the virtual disk for the guest. SmartOS and Tribblix already have sparse zones and OmniOS will get them as of May (with r151026). It's not clear how you intend to handle this gap in upstream gate.

It is possible to run an arbitrary number of bhyve instances in the global zone or non-global zones, subject to resource constraints.

With the recent zone-specific-data integration, isn't there now a limit of one bhyve per NGZ?

mgerdts · 2018-03-07T21:55:10Z

@ptribble Container images (docker or ACI, assuming I found the right ACI) are great for containers, but problematic for virtual machines. In order to create a VM, we would need to run a compatible OS to perform the installation. It seems like the container images could be installed over the top of some base image (which includes the required kernel, boot loader, etc.)

It seems like what you are after is better implemented with custom media that does the right thing for the OS that is being installed or with post-installation orchestration. If you disagree, please share a rough outline of how you think how any illumos derived installation would be able to reliably install an arbitrary container image that is probably assumes a non-illumos kernel. In particular, how are appropriate partition tables, file systems, boot loaders, and kernels put in place?

mgerdts · 2018-03-07T22:01:25Z

@citrus-it allowed-address is one that I waffled on for a while. The way that smartos does this now is it uses something along the lines of dladm set-linkprop protection=... to handle all of the anti-spoofing rules. I see that protection is missing from dladm(1M), so perhaps that's something else that needs to be upstreamed.

Maybe what makes sense is to allow allowed-address and append that to any protection specified by linkprop name=protection. Of course, if linkprop name=protection doesn't exist, allowed-address would still be applied.

Note that this would probably not make the first cut refactoring that goes into smartos-joyent, but seems reasonable to implement before upstreaming to illumos.

ptribble · 2018-03-07T22:09:02Z

@mgerdts Hm, I guess that I'm getting too close to direct install. I suspect this is something I would have to experiment with to see what does and doesn't work, rather than trying to design it up front.

citrus-it · 2018-03-07T22:19:34Z

@mgerdts thanks for the reply. OmniOS certainly has protection settable with:

reaper# dladm set-linkprop -p protection=ip-nospoof -p allowed-ips=10.0.0.1/32 test0
reaper# dladm show-linkprop test0 | egrep 'allowed-ip|protect'
test0        protection      rw   ip-nospoof     --             mac-nospoof,
test0        allowed-ips     rw   10.0.0.1/32    --             --

but the automatic setting done by zoneadmd is convenient and expected - we use it for both native and lx zones.

mgerdts · 2018-03-07T22:21:05Z

@citrus-it I agree that the environment variables followed by boot dumping stuff into the zone is hacky. I'm considering alternatives.

One in particular mimics what we did with kernel zones in Solaris. We modified zoneadmd so that it has brand-specific handlers that run in zoneadmd. This gave us great flexibility in ongoing interactions between zoneadmd and the equivalent of zhyve (kzhost on in solaris-kz zones). The way it works there is:

zoneadmd makes the system call to boot the zone
The in-zone process is started by the kernel
That process creates a door server and waits for commands
zoneadmd connects to the door and sends it a boot command with all the required config
The return from the door call says whether start of the guest succeeded. See OS-6717

FWIW, I skipped the part where logging is set up, ala OS-6718.

There are other ways to deal with this too. I'm admittedly biased toward what we did with kernel zones, as that proved to be quite handy as we added suspend/resume, live reconfiguration, live migration, etc. Not only that, the in-process brand hooks made it so that a ton of stuff that is in ksh in other brands was able to be implemented in C (often reusing existing C code).

I've posted a very incomplete working copy of a future RFD.

mgerdts · 2018-03-07T22:22:47Z

@citrus-it thanks for that - I wasn't aware of -p allowed-ips=.... Seems like this is a great thing to keep.

citrus-it · 2018-03-07T22:27:56Z

The current zoneadmd will always set both protection & allowed-ips for an interface in an exclusive IP zone which has an allowed-address property. allowed-ips always gets a /32 mask however.
https://github.com/illumos/illumos-gate/blob/master/usr/src/cmd/zoneadmd/vplat.c#L2741
I think this would be a zero-cost thing to support for bhyve zones since zoneadmd is already doing the work.

sjorge · 2018-03-08T17:53:20Z

@mgerdts was going over the RFD again, it is unclear to me how (or if) it is possible to pick amd_hostbridge vs hostbridge for slot 0.

(e.g. some guest only support one of the two, example OpenBSD needs amd_hostbridge)

mgerdts · 2018-03-08T17:55:26Z

@sjorge good point. I think that is probably best done with device (or pci, if we go that route). I'll be sure to address this in an upcoming draft.

sjorge · 2018-05-31T19:11:57Z

Another typo snuk in uefi-csi-rom.bin

liv3010m · 2020-05-05T16:30:04Z

If it works, can we have the option for specifying "pci_slot": "x:y:z" to a VNIC definition inside "nics" when creating a Bhyve VM, so we can have more than 8 functions assigned to virtual NICs?

sjorge · 2020-05-05T17:02:45Z

If it works, can we have the option for specifying "pci_slot": "x:y:z" to a VNIC definition inside "nics" when creating a Bhyve VM, so we can have more than 8 functions assigned to virtual NICs?

There is already a ticket for this work: https://smartos.org/bugview/OS-7458

liv3010m · 2020-05-05T17:32:56Z

Thanks!, sorry for not seeing it.

bcantrill changed the title ~~RFD-121: Discussion~~ RFD 121: Discussion Jan 10, 2018

mgerdts changed the title ~~RFD 121: Discussion~~ RFD 121 bhyve brand: Discussion Mar 28, 2018

RFD 121 bhyve brand: Discussion #76

RFD 121 bhyve brand: Discussion #76

Comments

mgerdts commented Jan 10, 2018

sjorge commented Jan 10, 2018 • edited Loading

sjorge commented Jan 10, 2018

wiedi commented Jan 10, 2018

jussisallinen commented Jan 11, 2018

siepkes commented Jan 11, 2018

mgerdts commented Jan 11, 2018

sjorge commented Jan 11, 2018

mgerdts commented Jan 11, 2018

sjorge commented Jan 11, 2018 • edited Loading

pfmooney commented Jan 12, 2018

sjorge commented Jan 12, 2018

pfmooney commented Jan 12, 2018

mgerdts commented Jan 12, 2018 via email

pfmooney commented Jan 12, 2018

sjorge commented Jan 14, 2018 • edited Loading

sjorge commented Jan 17, 2018 • edited Loading

mgerdts commented Jan 17, 2018

melloc commented Jan 17, 2018

mgerdts commented Jan 18, 2018

sjorge commented Jan 18, 2018

pfmooney commented Jan 18, 2018

sjorge commented Jan 18, 2018

mgerdts commented Mar 6, 2018

sjorge commented Mar 7, 2018 • edited Loading

ptribble commented Mar 7, 2018

citrus-it commented Mar 7, 2018 • edited Loading

mgerdts commented Mar 7, 2018

mgerdts commented Mar 7, 2018

ptribble commented Mar 7, 2018

citrus-it commented Mar 7, 2018

mgerdts commented Mar 7, 2018

mgerdts commented Mar 7, 2018

citrus-it commented Mar 7, 2018

sjorge commented Mar 8, 2018

mgerdts commented Mar 8, 2018

sjorge commented May 31, 2018

liv3010m commented May 5, 2020

sjorge commented May 5, 2020

liv3010m commented May 5, 2020

sjorge commented Jan 10, 2018 •

edited

Loading

sjorge commented Jan 11, 2018 •

edited

Loading

sjorge commented Jan 14, 2018 •

edited

Loading

sjorge commented Jan 17, 2018 •

edited

Loading

sjorge commented Mar 7, 2018 •

edited

Loading

citrus-it commented Mar 7, 2018 •

edited

Loading