OpenEmbedded Xen Network Driver VM

I wrote about a similar topic what feels like ages ago and I guess it was (8 months is a long time in this business). Since then I’ve been throwing some spare time at this same problem and I’ve actually made measurable progress. There have been a number of serendipitous events that came together to make this possible, the most important of which is the massive update to the Xen recipe in meta-virtualization. With this it’s super easy to crank out a Xen pvops kernel so combining this with an image that has the right plumbing in place it’s not as hard as you might think to build an NDVM.

So armed with the new Xen stuff from meta-virtualization I set out to build a reference NDVM. This isn’t intended to replace the NDVM in a system like XenClient-XT which is far more sophisticated. It’s just intended for experimentation and I don’t intend to build anything more sophisticated than a dumb Ethernet bridge into it.

To host this I’ve started a layer I call ‘meta-integral’. I know, all the good names were taken. Anyways this is intended to be as sort of distro layer where I can experiment with Xen stuff. Currently I’ve got a distro config for dom0 and an NDVM. The dom0 work is still very much a work in progress but the NDVM (much simpler) will actually boot as a PV guest.

To build this just clone my git repo with the build scripts and it’ll do all of the hard work for you:

git clone https://github.com/flihp/oe-build-scripts.git
git checkout ndvm
./build.sh | tee build.log

This will crank out an image suitable to run on an Intel SandyBridge (SNB) system. I’ve only tested PV guests so you’ll have to set up a config like the following:

kernel = "/usr/lib/xen-common/bzImage-sugarbay.bin"
extra = "root=/dev/xvda console=hvc0"
iommu = "soft"
memory = "512"
name = "ndvm"
uuid = "a9ae8853-f1e9-41ca-9904-0e906efeb4af"
vcpus = "1"

disk = ['phy:/dev/loop0,xvda,w']
pci = ['0000:04:00.0']

Notice the kernel image and the rootfs image must be copied over to the Xen dom0 that you want to test the NDVM on. The image is listed in the kernel line and this can be found at tmp-eglibc/deploy/images/sugarbay/bzImage-sugarbay.bin relative to your build root. The image will be in the same directory and called something like integral-image-ndvm-sugarbay.ext3. Notice that the disk config is pointing at a loopback. You’ll have to set this up with losetup just like any other loopback device. The part that differentiates this from any other PV guest is that we’re passing a PCI network device through to it and it’ll offer up a bridge to other guest VMs. The definitive documentation on how to do this with Xen is here: http://wiki.xen.org/wiki/Xen_PCI_Passthrough

The bit that I had to wrangle to get the bridge set up properly with OE was the integration between a network interfaces file and the bridge. I’ve been spoiled by Debian and it’s seamless integration between the two. OE has no such niceties. In this situation I had to chose between hacking a script manually or finding the scripts that integrate interfaces configuration with the bridge and baking that into the bridge-utils package from meta-oe. I figured getting bridges integrated with interfaces would be useful to others so I went through the Debian source package, extracted the scripts and baked them into OE directly. Likely this should go ‘upstream’ but for now this specialization is just sitting in my meta-integral layer.

So after fixing up the bridge-utils package so it plays nice with the interfaces file, the interfaces in our NDVM looks like so:

# /etc/network/interfaces -- configuration file for ifup(8), ifdown(8)
 
# The loopback interface
auto lo
iface lo inet loopback

# real interface
auto eth0
iface eth0 inet manual

# xen bridge
auto xenbr0
iface xenbr0 inet manual
        bridge_ports eth0
        bridge_stp off
        bridge_waitport 0
        bridge_fw 0

So that’s it. Boot up this NDVM and it’ll have a physical network device and a bridge ready for consumption by other guests. I’ve not yet gone through and tested adding additional guests to the bridge so I’m assuming there’s still a bit of work lurking there. I’ll give this last bit a go and hopefully have positive results to post sooner than later. I’ve also not tested this on XenClient-XT as the most recent stable release is getting a bit old and likely there’s going to be incompatibilities between netfront / back stuff. This approach however is likely a great starting point if you’re building a service VM you want to run on our next release of XT though so feel free to fork and experiment.

UPDATE: Gave my NDVM a test just by giving the dom0 that was hosting it a vif. You can do this like so:

# xl network-attach Domain-0 backend=ndvm

The above assumes your NDVM has been named ‘ndvm’ in it’s VM config naturally. Anyways this will pop up a vif in dom0 backed by the NDVM. Pretty slick IMHO. Now to wrap this whole thing up so dom0 and the NDVM can be built as a single image with OE … Sounds easy enough 🙂

Talk at Xen Developer Summit 2013

UPDATE: Here’s the link: http://www.youtube.com/watch?v=6Q8mlTBn-ZI. I still haven’t been able to bring myself to actually watch it but I’m sure it’s great :

Just got back from the 2013 Xen Developer Summit where I gave a talk on a few interesting (to me at least) things. If you’re interested you can find my abstract here. My focus was naturally on SELinux / XSM stuff. Mostly my talk focused on the sVirt implementation in XenClient XT and another fun application of the architecture to our management stuff.

Had good chat with a guy from Amazon afterward about all of the other evil stuff someone could do if they compromised QEMU. So while sVirt prevents the specific scenario presented I’ve no doubt there are other hazards. He was specifically concerned over the Xen privcmd driver & the hypercalls it could make. Hard to disagree as QEMU with root permissions in dom0 can execute any hypercall it wants. The only way to address this (other than stubdoms) is to deprivilege QEMU to prevent it from making hypercalls. That would probably require some code-changes in QEMU so it’s no small task.

I also touched briefly on the design for an inter-VM communication (IVC) mechanism that was floated to xen-devel this summer. In XT we have an IVC called ‘V4V’ that isn’t acceptable to upstream. When it came to our XSM policy however V4V had some favorable properties in that we created a new object in the hypervisor that was a ‘first-class’ object in the policy.

The proposal uses the same model as the front/back drivers so there would be no new object specific to the IVC. This means there wouldn’t be way to differentiate the IVC from any other front/back driver. The purpose of the talk was to point this out and hopefully solicit some discussion. Got an even better conversation going on this point so hopefully I’ll have some fun stuff to report on this front soon.

Using OE to build an XT ‘Service VM’

UPDATE: I’ve deleted the build scripts git repo mentioned in this post and rolled all of my OE project build scripts into one repo. Find it here: git://github.com/flihp/oe-build-scripts.git

UPDATE #2: I’ve written a more up-to-date post on similar work here: http://twobit.us/blog/2013/11/openembedded-xen-network-driver-vm/. The data in this post should be considered out of date.

Over the past few weeks I’ve run into a few misconceptions about XenClient XT and OpenEmbedded. First is that XT is some sort of magical system that mere mortals can’t customize. Second is that building a special-purpose, super small Linux image on OpenEmbedded is an insurmountable task. This post is an attempt to dispel both of these misconceptions and maybe even motivate some fun work in the process.

Don’t get me wrong though, this isn’t a trival task and I didn’t start and end this work in one night. There’s still a bunch of work to do here. I’ll lay that out at the end though. For now, I’ve put up the build scripts I threw together last night on git hub. They’re super minimal and derived from another project. Get them here: https://github.com/flihp/transbridge-build-scripts

My goal here is to build a simple rootfs that XT can boot with a VM ‘type’ of ‘servicevm’. This is the type that the XT toolstack associates with the default ‘Network’ VM. Basically it will be a VM invisible to the user. Eventually I’d like for this example to be useful as a transparent network bridge suitable as an in-line filter or even as a ‘driver domain’. But let’s not get ahead of ourselves …

What image do I build?

The first thing that you need to chose to do when building an image with OE is what MACHINE you’re building for. XT uses Xen for virtualization so whatever platform you’re running it on will dictate the MACHINE. Since XT only runs on Intel hardware it’s pretty safe to assume you’re system is compatible with generic i586. The basic qemux86 MACHINE that’s in the oe-core layer builds for this so for these purposes it’ll suffice. This is already set up in the local.conf in my build scripts.

To build the minimal core image that’s in the oe-core layer just run my build.sh script from the root of the repository. I like to tee the output to a log file for inspection in the event of a failure:

./build.sh | tee build.log

Now you should have a bunch of new stuff in ./tmp-eglibc/deploy/images/ which includes an ext3 rootfs. The file name should be something like core-image-minimal.ext3. Copy this over to your XT dom0 and get ready to build a VM.

Make a VHD

The next thing to do is copy the ext3 image over to a VHD. From within /storage/disks create a new VHD large enough to hold the image. I’ve experimented with both core-image-basic and core-image-minimal and a 100M VHD will be large enough … yes that’s a very small rootfs. core-image-minimal is around 9M:

cd /storage/disks
vhd-util create -n transbridge.vhd -s 100

Next have tap-ctl create a new device node for the VHD:

tap-ctl create -a vhd:/storage/disks/transbridge.vhd

This will output the path to the device node created (and yeah the weird command syntax bugs me too). You can alternatively list the current blktap devices and find yours there:

tap-ctl list
1276    0    0        vhd /storage/ndvm/ndvm.vhd
1281    1    0        vhd /storage/ndvm/ndvm-swap.vhd
...

I’ve got no idea what the first number is (maybe PID of the blktap instance that’s backing the device?) but the 2nd and 3rd numbers are the major / minor for the device. The last column is the VHD file backing the device so find your VHD there and the major number then find the device in /dev/xen/blktap-2/tapdevX … mine had a major number of ‘8’ so that’s what I’ll use in this example. Then just byte copy your ext3 on to this device:

dd if=/storage/disks/transbridge.ext3 of=/dev/xen/blktap-2/tapdev8

Then you can mount your VHD in dom0 to poke around:

mount /dev/xen/blktap-2/tapdev8 /media/hdd

Where’s my kernel?

Yeah so OE doesn’t put a kernel on the rootfs for qemu machines. That’s part of why the core-image-minimal image is so damn small. QEMU doesn’t actually boot like regular hardware so you actually pass it the kernel on the command line so OE’s doing the right thing here. If you want the kernel from the OE build it’ll be in ./tmp-eglibc/deploy/images/ with the images … but it won’t boot on XT 😦

This is a kernel configuration thing. I could have spent a few days creating a new meta layer and customizing the Yocto kernel to get a ‘Xen-ified’ image but taht sounds like a lot of work. I’m happy for this to be quick and dirty for the time being so I just stole the kernel image from the XT ‘Network’ VM to see if I could get my VM booting.

You can do this too by first mounting the Network VMs rootfs. Cool thing is you don’t need to power down the Network VM to mount it’s FS in dom0! The disk is exposed to the Network VM as a read-only device so you can mount it read only in dom0:

mount /dev/xen/blktap-2/tapdev0 /media/cf

Then just copy the kernel and modules over to your new rootfs and set up some symlinks to the kernel image so it’s easy to find:

cp /media/cf/boot/vmlinuz-2.6.32.12-0.7.1 /media/hdd/boot
cp -R /media/cf/lib/modules/2.6.32.12-0.7.1 /media/hdd/lib/modules
cd /media/hdd/boot
ln -s vmlinuz-2.6.32.12-0.7.1 vmlinuz
cd /media/hdd
ln -s ./boot/vmlinuz-2.6.32.12-0.7.1 vmlinuz

You may find that there isn’t enough space on the ext3 image you copied on to the VHD. Remember that the ext3 image is only as large as the disk image created by OE. It’s size won’t be the same as the VHD you created unless you resize it to fill the full VHD. You can do so by first umount’ing the tapdev then running resize2fs on the tapdev:

umount /media/hdd
resize2fs /dev/xen/blktap-2/tapdev8

This will make the file system on the VHD expand to fill the full virtual disk. If you made your VHD large enough you’ll have enough space for the kernel and modules. Like I say above, 100M is a safe number but you can go smaller.

Finally you’ll want to be able to log into your VM. If you picked the minimal image it won’t have ssh or anything so you’ll need a getty listening on the xen console device. Add the following line to your inittab:

echo -e "nX:2345:respawn:/sbin/getty 115200 xvc0" >> /media/hdd/etc/inittab

There’s also gonna be a default getty trying to attach to ttyS0 which isn’t present. When the VM is up it will cause some messages on the console:

respawning too fast: disabled for 5 minutes

You can disable this by removing the ‘S’ entry in inittab but really the proper solution is a new image with a proper inittab for an XT service VM … I’ll get there eventually.

Make it a VM

Up till now all we’ve got is a VHD file. Without a VM to run it nothing interesting is gonna happen though so now we make one. The XT toolstack isn’t documented to the point where someone can just read the man page. But it will tell you a lot about itself if you just run it with out any parameters. Honestly I know very little about our toolstack so I’m always executing xec and grepping though the output.

After some experimentation here are the commands to create a new Linux VM from the provided template and modify it to be a para-virtualized service VM. In retrospect it may be better to use the ‘ndvm’ template but this is how I did it for better or for worse:

xec create-vm-with-template new-vm-linux

This command will output a path to the VM node in the XT configuration database. The name of the VM will also be something crazy. Ge the name from the output of xec-vm and change it to something sensible like ‘minimal’:

xec-vm --name  set name minimal

Your VM will also get a virutal CD-ROM which we don’t want so delete it and then add a disk for VHD we configured:

xec-vm --name minimal --disk 0 delete
xec-vm --name minimal add-disk
xec-vm --name minimal --disk 0 set phys-path /storage/disks/minimal.vhd

Then set all of the VM properties per the instructions provided in the XT Developer Guide:

xec-vm --name minimal --disk 0 set virt-path xvda
xec-vm --name minimal set flask-label "system_u:system_r:nilfvm_t"
xec-vm --name minimal set stubdom false
xec-vm --name minimal set hvm false
xec-vm --name minimal set qemu-dm-path ""
xec-vm --name minimal set slot -1
xec-vm --name minimal set type servicevm
xec-vm --name minimal set kernel /tmp/minimal-vmlinuz
xec-vm --name minimal set kernel-extract /vmlinuz
xec-vm --name minimal set cmd-line "root=/dev/xvda xencons=xvc0 console=xvc0 rw"

Then all that’s left is booting your new minimal VM:

xec-vm --name minimal start

You can then connect to the dom0 end of the Xen serial device to log into your VM:

screen $(xenstore-read /local/domain/$(xec-vm --name minimal get domid)/console/tty))

Next steps

This is a pretty rough set of instructions but it will produce a bootable VM on XenClient XT from a very small OpenEmbedded core-image-minimal. There’s tons of places this can be cleaned up starting with an kernel that’s specific to a Xen domU. A real OE DISTRO would be another welcome addition so various distro specific features could be added and removed more easily. If the lazywebs feel like contributing some OE skills to this effort leave me a comment.

Chrome web sandbox on XenClient

There’s lots of software out there that sets up a “sandbox” to protect your system from untrusted code. The examples that come to mind are Chrome and Adobe for the flash sandbox. The strength of these sandboxes are an interesting point of discussion. Strength is always related to the mechanism and if you’re running on Windows the separation guarantees you get are only as strong as the separation Windows affords to processes. If this is a strong enough guarantee for you then you probably won’t find this post very useful. If you’re interested in using XenClient and the Xen hypervisor to get yourself the strongest separation that I can think of, then read on!

Use Case

XenClient allows you to run any number of operating systems on a single piece of hardware. In my case this is a laptop. I’ve got two VMs: my work desktop (Windows 7) for email and other work stuff and my development system that runs Debian testing (Wheezy as of now).

Long story short, I don’t trust some of the crap that’s out on the web to run on either of these systems. I’d like to confine my web browsing to a separate VM to protect my company’s data and my development system. This article will show you how to build a bare bones Linux VM that runs a web browser (Chromium) and little more.

Setup

You’ll need a linux VM to host your web browser. I like Debian Wheezy since the PV xen drivers for network and disk work out of the box on XenClient (2.1). There’s a small bug that required you use LVM for your rootfs but I typically do that anyways so no worries there.

Typically I do an install omitting even the “standard system tools” to keep things as small as possible. This results in a root file system that’s < 1G. All you need to do then is install the web browser (chromium), rungetty, and the xinint package. Next is a bit of scripting and some minor configuration changes.

inittab

When this VM boots we want the web browser to launch and run full screen. We don’t want a window manager or anything. Just the browser.

When Linux boots, the init process parses the /etc/inittab file. One of the things specified in inittab are processes that init starts like getty. Typically inittab starts getty‘s on 6 ttys but we want it to start chrome for us. We can do this by having init execute rungetty (read the man page!) which we can then have execute arbitrary commands for us:

# /sbin/getty invocations for the runlevels.
#
# The "id" field MUST be the same as the last
# characters of the device (after "tty").
#
# Format:
#  :::
#
# Note that on most Debian systems tty7 is used by the X Window System,
# so if you want to add more getty's go ahead but skip tty7 if you run X.
#
1:2345:respawn:/sbin/getty 38400 tty1
2:23:respawn:/sbin/getty 38400 tty2
3:23:respawn:/sbin/getty 38400 tty3
4:23:respawn:/sbin/getty 38400 tty4
5:23:respawn:/sbin/getty 38400 tty5
6:23:respawn:/sbin/rungetty tty6 -u root /usr/sbin/chrome-restore.sh

Another configuration change you’ll have to make is in /etc/X11/Xwrapper.config. The default configuration in this file prevents users from starting X if their controlling TTY isn’t a virtual console. Since we’re kicking off chromium directly we need to relax this restriction:

allowed_users=anybody

chromium-restore script

Notice that we have rungetty execute a script for us and it does so as the root user. We don’t want chromium running as root but we need to do some set-up before we kick off chromium as an unprivileged user. Here’s the chrome-restore.sh script:

#!/bin/sh

USER=chromium
HOMEDIR=/home/${USER}
HOMESAFE=/usr/share/${USER}-clean
CONFIG=${HOMEDIR}/.config/chromium/Default/Preferences
LAUNCH=$(which chromium-launch.sh)
if [ ! -x "${LAUNCH}" ]; then
	echo "web-launch.sh not executable: ${LAUNCH}"
	exit 1
fi
CMD="${LAUNCH} ${CONFIG}"

rsync -avh --delete ${HOMESAFE}/ ${HOMEDIR}/ > /dev/null 2>&1
chown -R ${USER}:${USER} ${HOMEDIR}

/bin/su - -- ${USER} -l -c "STARTUP="${CMD}" startx" < /dev/null
shutdown -Ph now

The first part of this script is setting up the home directory for the user (chromium) that will be running chromium. This is the equivalent of us restoring the users home directory to a “known good state”. This means that the directory located at /usr/share/chromium-clean is a “known good” home directory for us to start from. On my system it’s basically an empty directory with chrome’s default config.

The second part of the script, well really the last two lines just runs startx as an unprivileged user. startx kicks off the X server but first we set a variable STARTUP to be the name of another script: chromium-launch.sh. When this variable is set, startx runs the command from the variable after the X server is started. This is a convenient way to kick off an X server that runs just a single graphical application.

The last command shuts down the VM. The shutdown command will only be run after the X server terminates which will happen once the chromium process terminates. This means that once the last browser tab is closed the VM will shutdown.

chromium-launch script

The chromium-launch.sh script looks like this:

#!/bin/sh

CONFIG=$1
if [ ! -f "${CONFIG}" ]; then
	echo "cannot locate CONFIG: ${CONFIG}"
	exit 1
fi

LINE=$(xrandr -q 2> /dev/null | grep Screen)
WIDTH=$(echo ${LINE} | awk '{ print $8 }')
HEIGHT=$(echo ${LINE} | awk '{ print $10 }' | tr -d ',')

sed -i -e "s&(s+"bottom":s+)-?[0-9]+&1${HEIGHT}&" ${CONFIG}
sed -i -e "s&(s+"left":s+)-?[0-9]+&10&" ${CONFIG}
sed -i -e "s&(s+"right":s+)-?[0-9]+&1${WIDTH}&" ${CONFIG}
sed -i -e "s&(s+"top":s+)-?[0-9]+&10&" ${CONFIG}
sed -i -e "s&(s+"work_area_bottom":s+)-?[0-9]+&1${HEIGHT}&" ${CONFIG}
sed -i -e "s&(s+"work_area_left":s+)-?[0-9]+&10&" ${CONFIG}
sed -i -e "s&(s+"work_area_right":s+)-?[0-9]+&1${WIDTH}&" ${CONFIG}
sed -i -e "s&(s+"work_area_top":s+)-?[0-9]+&10&" ${CONFIG}

chromium

It’s a pretty simple script. It takes one parameter which is the path to the main chromium config file. It query’s the X server through xrandr to get the screen dimensions (WIDTH and HEIGHT) which means it must be run after the X server starts. It then re-writes the relevant values in the config file to the maximum screen width and height so the browser is run “full screen”. Pretty simple stuff … once you figure out the proper order to do things and the format of the Preferences file which was non-trivial.

User Homedir

The other hard part is creating the “known good” home directory for your unprivileged user. What I did was start up chromium once manually. This causes the standard chromium configuration to be generated with default values. I then copied this off to /usr/share to be extracted on each boot.

Conclusion

So hopefully these instructions are enough to get you a Linux system that boots and runs Chromium as an unprivileged user. It should restore that users home directory to a known good state on each boot so that any downloaded data will be wiped clean. When the last browser tab is closed it will power off the system.

I use this on my XenClient XT system for browsing sites that I want to keep separate from my other VMs. It’s not perfect though and as always there is more that can be done to secure it. I’d start by making the root file system read only and adding SELinux would be fun. Also the interface is far too minimal. Finding a way to handle edge cases like making pop-ups manageable and allowing us to do things like control volume levels would also be nice. This may require configuring a minimal window manager which is a pretty daunting task. If you have any other interesting ways to make this VM more usable or lock it down better you should leave them in the comments.

Linux bridge forward EAPOL 8021x frames

XenClient is no different from other Xen configurations in that the networking hardware is shared between guests through a bridge hosted in dom0 (or a network driver domain in the case of XenClient XT). For most use cases the standard Linux bridge will route your traffic as expected. We ran into an interesting problem however when a customer doing a pilot on XenClient XT tried to authenticate their guest VMs using EAPOL (8021x auth over ethernet). The bridge gobbled up their packets and we got some pretty strange bug reports as a result.

Just throwing “linux bridge EAPOL 8021x” into a search engine will return a number of hits from various mailing lists where users report similar issues. The fix is literally a one line change that drops a check on the destination MAC address. This check is to ensure compliance with the 8021d standard which requires layer 2 bridges to drop packets from the “bridge filter MAC address group“. Since XenClient is a commercial product and the fix is in code that is responsible for our guest networking (which is pretty important) we wanted to code up a way to selectively enable this feature on a per-bridge basis using a sysfs node. We / I also tested the hell out of it for a few days straight.

The end result is a neat little patch that allows users to selectively pass EAPOL packets from their guests across the layer 2 bridge in dom0 / ndvm and out to their authentication infrastructure. The patch is available opensource just like the kernel and is available on the XenClient source CD. It’s also available here for your convenience 🙂