Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD 121 bhyve brand: Discussion #76

Open
mgerdts opened this issue Jan 10, 2018 · 39 comments
Open

RFD 121 bhyve brand: Discussion #76

mgerdts opened this issue Jan 10, 2018 · 39 comments

Comments

@mgerdts
Copy link
Contributor

mgerdts commented Jan 10, 2018

This issue represents an opportunity for discussion of RFD 121 while it remains in a pre-published state.

@sjorge
Copy link

sjorge commented Jan 10, 2018

For the KVM brand (at least last I played with them), every device configured via zonecfg will get passed as a disk.

If possible the same approach should be avoided for bhyve, this to keep the option for device pass-through open in the future. (e.g. it is impossible to pass-through a tty to a to a kvm branded zone, this works fine on OmniOS)

XXX Do we need to expose the bus:slot:function as a property on device and net resources?

To keep the option of pass through open in the future, this might be needed. E.g. pci device usually need to be attached at the same spot to work properly.

@bcantrill bcantrill changed the title RFD-121: Discussion RFD 121: Discussion Jan 10, 2018
@sjorge
Copy link

sjorge commented Jan 10, 2018

Oh and you might want to reach out to Allen Jude, if I am not mistaken he revived a review for some code to have bhyve parse a simple config file. Maybe the zbhyve bits can benifit from that. Or the other way around.

@wiedi
Copy link

wiedi commented Jan 10, 2018

VirtFS/9p filesystem passthrough (https://reviews.freebsd.org/D10335) might enable support for zonecfg "fs".

@jussisallinen
Copy link

Sorry for posting something not directly related to the RFD.
Something along these lines would be welcome additions in the bigger picture, if feasible in terms of Triton:

  • zvol live resize (expansion in particular) without reboot.
  • adding additional zvol's to instance without reboot.

Live resize might already be on the list anyways, it seems to be feature mostly missed by plain SmartOS users.

@siepkes
Copy link

siepkes commented Jan 11, 2018

@jussisallinen I think you are looking for RFD 26.

@mgerdts
Copy link
Contributor Author

mgerdts commented Jan 11, 2018

New predraft posted, with @yourname next to things that are in response to various people's comments.

  • @sjorge I've made notes that we need to accommodate pci passthrough and tty devices
  • @sjorge Also made some changes to the way that zhyve gets its config. This is an implementation detail that is likely to change over time.
  • @wiedi This is very interesting - thanks for pointing it out. This probably won't make the first cut, but seems useful for the future.
  • @jussisallinen I think that bhyve support for live resize should be straight forward. Hooking it into Triton is an area that I don't know much about. I'll add this to my list of things to think about as we work up the stack (@joshwilsdon may have thoughts). Live add/remove of devices is a bit more of a challenge and will not be in an early cut.

@sjorge
Copy link

sjorge commented Jan 11, 2018

@mgerdts you added a note for @wiedi god-passhrough to the fs-allowed property. Shouldn’t that go in the fs property instead? IIRC fs-allowed would be for telling a zone what file system type it is allowed to mount.

But more clarification on the try pass through, i was talking specifically about cua* devices, e.g. for a serial UPS or GPS device. When running KVM outside a KVM branded zone you can hook up one of qemu/kvm’s device to those. That does not work in kvm branded zones due to all device zone properties being mapped to a disk. Sorry if that was not clear before.

@mgerdts
Copy link
Contributor Author

mgerdts commented Jan 11, 2018

@sjorge Thanks for noticing that I only updated fs-allowed. I've now updated fs as well. As for the tty devices, I realize I chose the wrong term. I've changed that to serial to better convey the intent.

@sjorge
Copy link

sjorge commented Jan 11, 2018

@mgerdts another thing, maybe make amd_hostbridge vs hostbridge (intel) selectable!

Slot 0 should use amd_hostbridge for all OpenBSD instances, even on Intel based hardware as MSI/MSI-x interrupts which are used by the xhci/tablet for accurate mouse control are not available in OpenBSD unless the hostbridge is advertised as AMD.
https://wiki.freebsd.org/bhyve/OpenBSD

Also if/once we export the vnc console, an option to set the wait flag would be great to expose

The fbuf wait parameter instructs bhyve to only boot upon the initiation of a VNC connection, simplifying the installation of operating systems that require immediate keyboard input. This can be removed for post-installation use. https://wiki.freebsd.org/bhyve/UEFI

@pfmooney
Copy link
Contributor

Thanks for doing the initial write-up, Mike. (And thanks to everyone else for the feedback)

Some comments from my first pass:

/usr/sbin/amd64/bhyve -m 4g -c 2 -l com1,/dev/zconsole -P -H -s 1,lpc
-s 3,virtio-blk,/dev/zvol/rdsk/zones/$zone/data/disk0
-s 4,virtio-net-viona,net0
-l bootrom,/BHYVE_UEFI.fd "$zone"

We're expecting the bhyve vmm component to be sdev-aware, right? Perhaps we
can stick to a common name for each instance (VM?) since they'll all be
effectively namespaced into the zone. (So it'd be like /dev/vmm/<zonename>/VM from the GZ, /dev/vmm/VM from the zone)

The /dev/viona/ctl node is opened and a CREATE ioctl is issued. This creates a new minor that does not require a minor node. The return value from the ioctl is a file descriptor associated with the new viona minor.

Any reason to use /dev/viona/ctl over plain /dev/viona? I don't believe any piece of the viona interface is going to need (or frankly want) access to already-opened instances for other processes.

When the bhyve command exits, the kernel state remains present until a DESTROY ioctl is issued. To free these resources, vmctl --destroy must be used.

For in-zone bhyve, we'll need to wire up an explicit interface so that
lingering vmm resources are destroyed on zone shutdown.

The following are private implementation details that are architecturally relevant.
Guest networking configuration

XXX @pfmooney needs to review this

I am working on plumbing for this that will resemble the interface used by
qemu/KVM today: Where the userspace hypervisor component can be configured to
respond to DHCP requests using addressing information provided at startup.
My plan is for those addressing parameters to be passed to the bhyve(1)
commandline as parameters for the viona driver. It will handle all the
details of filtering and injecting the packets in as efficient a manner as
possible.

XXX It is not yet known if the following are needed:
vnd(8d)

The viona bits are using MAC directly, so only the libraries should be required, I think?

We are striving to not modify bhyve code any more than required so that it is easier to keep in sync with upstream. For this reason, a new source file, zhyve.c is being added. This will contain an implementation of main() and any other bhyve brand-specific code that is required. The main() that is in bhyverun.c is renamed to bhyve_main() via -Dmain=bhyve_main in CPPFLAGS while compiling bhyverun.c

In the global zone, /usr/sbin/amd64/bhyve and /usr/lib/brand/bhyve/usr/sbin/zhyve will be hard links to the same file. When invoked with a basename of bhyve, the command will behave exactly as documented in bhyve(8). When invoked with a basename of zhyve, it will read its arguments from /var/run/bhyve/zhyve.args

Is there a reason why we can't call the bhyve binary from zhyve rather than
the name-detection route? I suspect a separate handler process (zhyve as init)
would be desirable for doing guest restarts.

The privileges will be stripped to the minimum required to run a guest. If bhyve only needs a privilege during startup, the privilege will be dropped prior to running code in the guest.

Is this the appropriate place to mention adding a privilege to the system which is required to instantiate vmm instances?

@sjorge
Copy link

sjorge commented Jan 12, 2018

I am working on plumbing for this that will resemble the interface used by
qemu/KVM today: Where the userspace hypervisor component can be configured to
respond to DHCP requests using addressing information provided at startup.
My plan is for those addressing parameters to be passed to the bhyve(1)
commandline as parameters for the viona driver. It will handle all the
details of filtering and injecting the packets in as efficient a manner as
possible.

Only DHCPv4? Or also SLAAC/DHCPv6?

@pfmooney
Copy link
Contributor

Only DHCPv4? Or also SLAAC/DHCPv6?

The first draft will only be DHCPv4 to support the "static" addressing model expected from existing KVM images to make it easier to test with them. I think we'll probably want a more explicit scheme for communicating static addresses (both v4 and v6) into the guest, rather than snooping for DHCP requests.

@mgerdts
Copy link
Contributor Author

mgerdts commented Jan 12, 2018 via email

@pfmooney
Copy link
Contributor

Only because the sdev_plugin seems to not work in the "just create a device without a directory" mode. Once sdev issues are sorted, we can drop the directory.

I'm not sure the same sdev plugin integration is needed for viona. Something similar to what was done for the KVM driver may be adequate.

I was trying to make it so that the zone doesn't need proc_exec. The zone platform has a restart_init option that we could set to true for guest restarts where we want to reattach to the already allocated memory.

If we end up wanting/needing the iasl configuration on start-up, the strive to stay proc_exec-free may be in vain. Nothing prevents us from dropping privileges like that once the instance is in steady-state.

@sjorge
Copy link

sjorge commented Jan 14, 2018

An option to specify a bootrom in the zonecfg might also be nice.
This opens op the possibility for say a UEFI+GOP/UEFI+Serial boot roms, but also a UEFI+CSM bootrom for systems that do not support UEFI.

e.g. https://www.freshports.org/sysutils/uefi-edk2-bhyve-csm/

Maybe also and option to pass user specified flags to the bhyve process.

@sjorge
Copy link

sjorge commented Jan 17, 2018

Something else that popped in my head while I was doing some stuff with my kvm zone earlier today...

vmadm connect UUID will use socat which is not that nice, manually using @jclulow's sercons is much nicer...

Anyway, can we maybe get bhyve brand to just work with zlogin -C ?

@mgerdts
Copy link
Contributor Author

mgerdts commented Jan 17, 2018

Anyway, can we maybe get bhyve brand to just work with zlogin -C ?

The plan is to have com1 (ttya, ttyS0, whatever) connected to /dev/zconsole, which will give you access to whatever (hopefully the console) the guest attaches to its first serial port. There will also be the ability to make it so that com2 is instead connected to /dev/zconsole instead. Either of the two supported serial ports can instead be connected to a UNIX domain socket on which bhyve listen.

The updates to the RFD will come soon.

@melloc
Copy link
Contributor

melloc commented Jan 17, 2018

Only DHCPv4? Or also SLAAC/DHCPv6?

The first draft will only be DHCPv4 to support the "static" addressing model expected from existing KVM images to make it easier to test with them. I think we'll probably want a more explicit scheme for communicating static addresses (both v4 and v6) into the guest, rather than snooping for DHCP requests.

To expand on this further, I suspect that we will want to do with bhyve what we started doing with KVM images (only Ubuntu so far I believe): fetch the information using mdata-get sdc:nics and use that to set things up inside the zone. This is how we want to eventually make all of our images work, so that we can support multiple IP addresses, IPv6, setting the correct MTU, etc. without having to snoop on the VM's traffic.

@mgerdts
Copy link
Contributor Author

mgerdts commented Jan 18, 2018

@sjorge - The ability to select the bootrom is being added as well.

Is there a need to select arbitrary bootroms or is the selection of BHYVE_UEFI_CSM.fd or BHYVE_UEFI.fd sufficient?

@sjorge
Copy link

sjorge commented Jan 18, 2018

Selecting beteren UEFI and UEFI_CSM will be sufficient I think. In don’t think I have seen any other bootroms available.

@pfmooney
Copy link
Contributor

To expand on this further, I suspect that we will want to do with bhyve what we started doing with KVM images (only Ubuntu so far I believe): fetch the information using mdata-get sdc:nics and use that to set things up inside the zone. This is how we want to eventually make all of our images work, so that we can support multiple IP addresses, IPv6, setting the correct MTU, etc. without having to snoop on the VM's traffic.

Absolutely. The DHCP addressing feature is only meant as a stopgap to allow KVM images to be used until better bhyve-specific ones (which do static addressing on their own) are created. Fortunately, the design for intercepting those DHCP/ARP requests in viona should result in lower overhead than what's used for KVM.

@sjorge
Copy link

sjorge commented Jan 18, 2018

(Slightly OT) The idea is to grap the RARP/DHCP bits when they pass through viona. Instead of sending them to the network stack, bhyve (or some other daemon) handles them and sends a reply?

@mgerdts
Copy link
Contributor Author

mgerdts commented Mar 6, 2018

I've updated the draft. Please take a fresh look.

https://github.com/joyent/rfd/blob/master/rfd/0121/README.md

@sjorge
Copy link

sjorge commented Mar 7, 2018

@mgerdts I left a few comments on the actual commit, that seemed easier than copy/pasting here to add comments.

@ptribble
Copy link

ptribble commented Mar 7, 2018

For zoneadm install -i, can you consider 2 additional formats:

  1. A simple tarball, for example that obtained via 'docker export', or available for download (eg https://cloud-images.ubuntu.com/ for ubuntu or http://download.proxmox.com/images/system/ for proxmox) - this is what I typically feed into an LX zone, and is currently used in OmniOS and Tribblix

  2. ACI format images, which has the advantage of being a simple standard, and is what I would like to transition to as an image format

Thanks!

@citrus-it
Copy link

citrus-it commented Mar 7, 2018

Ideally allowed-address should be supported. I'm not talking about the guest having access to this information via zoneattrs but just that zoneadmd will configure L3 protection on the vnic when starting the zone.

The process of zoneadmd setting environment variables and then boot using these to build the bhyve CLI arguments and storing that in a file to be picked up by zhyve seems unnecessarily convoluted. Have you considered lofs mounting the zone's xml file into the zone like lx does? It could then be parsed directly or via libzonecfg. Alternatively, the boot process is also able to access the zone config, removing one level of indirection. SmartOS' zoneadmd is already quite different to upstream's which could be a barrier.

Zone root - as I read it a bhyve zone will need to be effectively a sparse zone and zoneadm -z xx install actually installs an image onto the virtual disk for the guest. SmartOS and Tribblix already have sparse zones and OmniOS will get them as of May (with r151026). It's not clear how you intend to handle this gap in upstream gate.

It is possible to run an arbitrary number of bhyve instances in the global zone or non-global zones, subject to resource constraints.

With the recent zone-specific-data integration, isn't there now a limit of one bhyve per NGZ?

@mgerdts
Copy link
Contributor Author

mgerdts commented Mar 7, 2018

@ptribble Container images (docker or ACI, assuming I found the right ACI) are great for containers, but problematic for virtual machines. In order to create a VM, we would need to run a compatible OS to perform the installation. It seems like the container images could be installed over the top of some base image (which includes the required kernel, boot loader, etc.)

It seems like what you are after is better implemented with custom media that does the right thing for the OS that is being installed or with post-installation orchestration. If you disagree, please share a rough outline of how you think how any illumos derived installation would be able to reliably install an arbitrary container image that is probably assumes a non-illumos kernel. In particular, how are appropriate partition tables, file systems, boot loaders, and kernels put in place?

@mgerdts
Copy link
Contributor Author

mgerdts commented Mar 7, 2018

@citrus-it allowed-address is one that I waffled on for a while. The way that smartos does this now is it uses something along the lines of dladm set-linkprop protection=... to handle all of the anti-spoofing rules. I see that protection is missing from dladm(1M), so perhaps that's something else that needs to be upstreamed.

Maybe what makes sense is to allow allowed-address and append that to any protection specified by linkprop name=protection. Of course, if linkprop name=protection doesn't exist, allowed-address would still be applied.

Note that this would probably not make the first cut refactoring that goes into smartos-joyent, but seems reasonable to implement before upstreaming to illumos.

@ptribble
Copy link

ptribble commented Mar 7, 2018

@mgerdts Hm, I guess that I'm getting too close to direct install. I suspect this is something I would have to experiment with to see what does and doesn't work, rather than trying to design it up front.

@citrus-it
Copy link

@mgerdts thanks for the reply. OmniOS certainly has protection settable with:

reaper# dladm set-linkprop -p protection=ip-nospoof -p allowed-ips=10.0.0.1/32 test0
reaper# dladm show-linkprop test0 | egrep 'allowed-ip|protect'
test0        protection      rw   ip-nospoof     --             mac-nospoof,
test0        allowed-ips     rw   10.0.0.1/32    --             --

but the automatic setting done by zoneadmd is convenient and expected - we use it for both native and lx zones.

@mgerdts
Copy link
Contributor Author

mgerdts commented Mar 7, 2018

@citrus-it I agree that the environment variables followed by boot dumping stuff into the zone is hacky. I'm considering alternatives.

One in particular mimics what we did with kernel zones in Solaris. We modified zoneadmd so that it has brand-specific handlers that run in zoneadmd. This gave us great flexibility in ongoing interactions between zoneadmd and the equivalent of zhyve (kzhost on in solaris-kz zones). The way it works there is:

  • zoneadmd makes the system call to boot the zone
  • The in-zone process is started by the kernel
  • That process creates a door server and waits for commands
  • zoneadmd connects to the door and sends it a boot command with all the required config
  • The return from the door call says whether start of the guest succeeded. See OS-6717

FWIW, I skipped the part where logging is set up, ala OS-6718.

There are other ways to deal with this too. I'm admittedly biased toward what we did with kernel zones, as that proved to be quite handy as we added suspend/resume, live reconfiguration, live migration, etc. Not only that, the in-process brand hooks made it so that a ton of stuff that is in ksh in other brands was able to be implemented in C (often reusing existing C code).

I've posted a very incomplete working copy of a future RFD.

@mgerdts
Copy link
Contributor Author

mgerdts commented Mar 7, 2018

@citrus-it thanks for that - I wasn't aware of -p allowed-ips=.... Seems like this is a great thing to keep.

@citrus-it
Copy link

The current zoneadmd will always set both protection & allowed-ips for an interface in an exclusive IP zone which has an allowed-address property. allowed-ips always gets a /32 mask however.
https://github.com/illumos/illumos-gate/blob/master/usr/src/cmd/zoneadmd/vplat.c#L2741
I think this would be a zero-cost thing to support for bhyve zones since zoneadmd is already doing the work.

@sjorge
Copy link

sjorge commented Mar 8, 2018

@mgerdts was going over the RFD again, it is unclear to me how (or if) it is possible to pick amd_hostbridge vs hostbridge for slot 0.

(e.g. some guest only support one of the two, example OpenBSD needs amd_hostbridge)

@mgerdts
Copy link
Contributor Author

mgerdts commented Mar 8, 2018

@sjorge good point. I think that is probably best done with device (or pci, if we go that route). I'll be sure to address this in an upcoming draft.

@mgerdts mgerdts changed the title RFD 121: Discussion RFD 121 bhyve brand: Discussion Mar 28, 2018
@sjorge
Copy link

sjorge commented May 31, 2018

Another typo snuk in uefi-csi-rom.bin

@liv3010m
Copy link

liv3010m commented May 5, 2020

If it works, can we have the option for specifying "pci_slot": "x:y:z" to a VNIC definition inside "nics" when creating a Bhyve VM, so we can have more than 8 functions assigned to virtual NICs?

@sjorge
Copy link

sjorge commented May 5, 2020

If it works, can we have the option for specifying "pci_slot": "x:y:z" to a VNIC definition inside "nics" when creating a Bhyve VM, so we can have more than 8 functions assigned to virtual NICs?

There is already a ticket for this work: https://smartos.org/bugview/OS-7458

@liv3010m
Copy link

liv3010m commented May 5, 2020

Thanks!, sorry for not seeing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants