OpenXT: Contingencies Abandoned

This post is long overdue. I’ve been experimenting with using OE as a means to building measured systems for a while now. Back before Openxt became a reality I was hedging bets and working on some overlapping tech in parallel. Now that OpenXT is available as OSS I’m going to abandon some of this work and shift focus to OpenXT. I do however feel like there should be some record of the work that was done and some explanation as to why I did it and how it relates to OpenXT.

Here goes …

Building systems with security properties

All of this nonsense began with some experimentation in using OE as a means to build measured systems. For some reason I think that a sound integrity measurement architecture is necessary for the construction of software systems with meaningful security properties. All of the necessary component parts were available as open source but there were few examples showing how they could be integrated into a functional whole. Those that did were generally research prototypes and weren’t maintained actively (need references). My work on meta-measured was the first step in my attempt to make the construction of an example measured system public and easily buildable.

Building a measured systems with the Xen hypervisor as a primary component was another part of this work. I wasn’t using virtualization for the sake of virtualization though. Xen was a means to an end: its architecture allows for system partitioning in part through the Isolated Driver Domain model like the example I describe here. The NDVM is just the “low hanging fruit” here but it serves as a good example of how OE can be used to build very small Linux VMs that can serve as example device domains. There are ways to build smaller IDDs but IMHO a Linux image < 100MB is probably the point of diminishing returns currently. Hopefully in the future this will no longer be the case and we'll have IDDs based on unikernels or even smaller things.

Small, single purpose systems are important in that they allow us to extend the integrity measurement architecture into more fine-grained system components. Ideally these IDDs can be restarted so that the integrity state of the system can be refreshed on a periodic basis. By disaggregating the Xen dom0 we increase the granularity of our measurements from 1 (dom0) to 1 + the number of disaggregated components. By restarting and remeasuring these components we provide better "freshness" properties for systems that remain on-line for long periods of time.

This of course is all built on the initial root of trust established by hardware and the software components in meta-measured. Disaggregation on the scale of previously published academic work is the end goal though with the function of dom0 reduced to domain construction.

The final piece of this work is to use the available mandatory access control mechanisms to restrict the interactions between disaggregated components. We get this by using the XSM and the reference policy from Xen. Further, there will always be cases where it’s either impossible or impractical to decompose some functions into separate VMs. In these cases the use of the SELinux MAC policy within Linux guests is necessary.

The Plan

So my plan went something like this: Construct OE layers for individual components. Provide reference images for independent test. One of these layers will be a “distro” where the other components can be integrated to become the final product. This ended up taking the form of the following meta layers:

  • meta-measured: boot time measurements using the D-RTM method and TPM utilities
  • meta-virtualization: recipes to build the Xen hypervisor and XSM policy
  • meta-selinux: recipes to build SELinux toolstack and MAC policy
  • meta-integral: distro layer to build platform and service VM images

Some of these meta layers provide a lot more functionality than the description given but I only list the bits that are relevant here.

Where I left off

I had made significant progress on this front but never really finished and didn’t write about the work as a whole. It’s been up on Github in a layer called ‘meta-integral‘ (i know, all the good names were taken) for a while now and the last time I built it (~5 months ago) it produced a Xen dom0 and an NDVM that boots and runs guests. The hardest work was still ahead: I hadn’t yet integrated SELinux into dom0 and the NDVM, XSM was buildable but again, not yet integrated and the bulk of disaggregating dom0 hadn’t even yet begun.

This work was a contingency though. When I began working on this there had been no progress made or even discussion of getting OpenXT released as OSS. This side project was an outlet for work that I believe needs to be done in the open so that the few of us who think this is important could some day collaborate in a meaningful way. Now that OpenXT is a real thing I believe that this collaboration should happen there.

Enter OpenXT: aka the Future

Now that OpenXT is a reality the need for a distro layer to tie all of this together has largely gone away. The need for ‘meta-integral’ is no more and I’ll probably pull it down off of Github in the near future. The components and OE meta layers that I’ve enumerated above are all still necessary though. As far as my designs are concerned OpenXT will take over only as the distro and this means eliminating a lot of duplication.

In a world where OpenXT hadn’t made it out as OSS I would have had the luxury of building the distro from scratch and keeping it very much in line with the upstream components. But that’s not how it happened (a good thing) so things are a bit different. The bulk of the work that needs to be done for the project to gain momentum now is disentangling these components so that they can be developed in parallel with limited dependencies.

Specifically we duplicate recipes that are upstream in meta-virtualization, meta-selinux and meta-measured. To be fair, OpenXT actually had a lot of these recipes first but there was never any focus on upstreaming them. Eventually someone else duplicated this work in the open source and now we must pay off this technical debt and bring ourselves in-line with the upstream that has formed despite us.

What’s next?

So my work on meta-integral is over before it really started. No tragedy there but I am a bit sad that it never really got off the ground. OpenXT is the future of this work however so goal number one is getting that off the ground.

More to come on that front soon …

building HVM Xen guests

On my Xen systems I’ve run pretty much 99% of my Linux guests paravirtualized (PV). Mostly this was because I’m lazy. Setting up a PV guest is super simple. No need for partitions, boot loaders or any of that complicated stuff. Setting up a PV Linux guest is generally as simple as setting up a chroot. You don’t even need to install a kernel.

There’s been a lot of work over the past 5+ years to add stuff to processors and Xen to make the PV extensions to Linux unnecessary. After checking out a presentation by Stefano Stabilini a few weeks back I decided I’m long overdue for some HVM learning. Since performance of HVM guests is now better than PV for most cases it’s well worth the effort.

This post will serve as my documentation for setting up HVM Linux guests. My goal was to get an HVM Linux installed using typical Linux tools and methods like LVM and chroots. I explicitly was trying to avoid using RDP or anything that isn’t a command-line utility. I wasn’t completely successful at this but hopefully I’ll figure it out in the next few days and post an update.

Disks and Partitions

Like every good Linux user LVMs are my friend. I’d love a more flexible disk backend (something that could be sparsely populated) but blktap2 is pretty much unmaintained these days. I’ll stop before I fall down that rabbit hole but long story short, I’m using LVMs to back my guests.

There’s a million ways to partition a disk. Generally my VMs are single-purpose and simple so a simple partitioning scheme is all I need. I haven’t bothered with extended partitions as I only need 3. The layout I’m using is best described by the output of sfdisk:

# partition table of /dev/mapper/myvg-hvmdisk
unit: sectors

/dev/mapper/myvg-hvmdisk1 : start=     2048, size=  2097152, Id=83
/dev/mapper/myvg-hvmdisk2 : start=  2099200, size=  2097152, Id=82
/dev/mapper/myvg-hvmdisk3 : start=  4196352, size= 16775168, Id=83
/dev/mapper/myvg-hvmdisk4 : start=        0, size=        0, Id= 0

That’s 3 partitions, the first for /boot, the second for swap and the third for the rootfs. Pretty simple. Once the partition table is written to the LVM volume we need to get the kernel to read the new partition table to create devices for these partitions. This can be done with either the partprobe command or kpartx. I went with kpartx:

$ kpartx -a /dev/mapper/myvg-hvmdisk

After this you’ll have the necessary device nodes for all of your partitions. If you use kpartx as I have these device files will have a digit appended to them like the output of sfdisk above. If you use partprobe they’ll have the letter ‘p’ and a digit for the partition number. Other than that I don’t know that there’s a difference between the two methods.

Then get the kernel to refresh the links in /dev/disk/by-uuid (we’ll use these later):

$ udevadm trigger

Now we can set up the filesystems we need:

$ mkfs.ext2 /dev/mapper/myvg-hvmdisk1
$ mkswap /dev/mapper/myvg-hvmdisk2
$ mkfs.ext4 /dev/mapper/myvg-hvmdisk3

Install Linux

Installing Linux on these partitions is just like setting up any other chroot. First step is mounting everything. The following script fragment

# mount VM disks (partitions in new LV)
if [ ! -d /media/hdd0 ]; then mkdir /media/hdd0; fi
mount /dev/mapper/myvg-hvmdisk3 /media/hdd0
if [ ! -d /media/hdd0/boot ]; then mkdir /media/hdd0/boot; fi
mount /dev/mapper/myvg-hvmdisk1 /media/hdd0/boot

# bind dev/proc/sys/tmpfs file systems from the host
if [ ! -d /media/hdd0/proc ]; then mkdir /media/hdd0/proc; fi
mount --bind /proc /media/hdd0/proc
if [ ! -d /media/hdd0/sys ]; then mkdir /media/hdd0/sys; fi
mount --bind /sys /media/hdd0/sys
if [ ! -d /media/hdd0/dev ]; then mkdir /media/hdd0/dev; fi
mount --bind /dev /media/hdd0/dev
if [ ! -d /media/hdd0/run ]; then mkdir /media/hdd0/run; fi
mount --bind /run /media/hdd0/run
if [ ! -d /media/hdd0/run/lock ]; then mkdir /media/hdd0/run/lock; fi
mount --bind /run/lock /media/hdd0/run/lock
if [ ! -d /media/hdd0/dev/pts ]; then mkdir /media/hdd0/dev/pts; fi
mount --bind /dev/pts /media/hdd0/dev/pts

Now that all of the mounts are in place we can debootstrap an install into the chroot:

$ sudo debootstrap wheezy /media/hdd0/

We can then chroot to the mountpoint for our new VMs rootfs and put on the finishing touches:

$ chroot /media/hdd0


Unlike a PV guest, you’ll need a bootloader to get your HVM up and running. A first step in getting the bootloader installed is figuring out which disk will be mounted and where. This requires setting up your fstab file.

At this point we start to run into some awkward differences between our chroot and what our guest VM will look like once it’s booted. Our chroot reflects the device layout of the host on which we’re building the VM. This means that the device names for these disks will be different once the VM boots. On our host they’re all under the LVM /dev/mapper/myvg-hvmdisk and once the VM boots they’ll be something like /dev/xvda.

The easiest way to deal with this is to set our fstab up using UUIDs. This would look something like this:

# / was on /dev/xvda3 during installation
UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx /               ext4    errors=remount-ro 0       1
# /boot was on /dev/xvda1 during installation
UUID=yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy /boot           ext2    defaults        0       2
# swap was on /dev/xvda2 during installation
UUID=zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz none            swap    sw              0       0

By using UUIDs we can make our fstab accurate even in our chroot.

After this we need to set up the /etc/mtab file needed by lots of Linux utilities. I found that when installing Grub2 I needed this file in place and accurate.

Some data I’ve found on the web says to just copy or link the mtab file from the host into the chroot but this is wrong. If a utility consults this file to find the device file that’s mounted as the rootfs it will find the device holding the rootfs for the host, not the device that contains the rootfs for our chroot.

The way I made this file was to copy it off of the host where I’m building the guest VM and then modify it for the guest. Again I’m using UUIDs to identify the disks / partitions for the rootfs and /boot to keep from having data specific to the host platform leak into the guest. My final /etc/mtab looks like this:

rootfs / rootfs rw 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,relatime,size=10240k,nr_inodes=253371,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=203892k,mode=755 0 0
/dev/disk/by-uuid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx / ext4 rw,relatime,errors=remount-ro,user_xattr,barrier=1,data=ordered 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
tmpfs /run/shm tmpfs rw,nosuid,nodev,noexec,relatime,size=617480k 0 0
/dev/disk/by-uuid/yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy /boot ext2 rw,relatime,errors=continue,user_xattr,acl 0 0

Finally we need to install both a kernel and the grub2 bootloader:

$ apt-get install linux-image-amd64 grub2

Installing Grub2 is a pain. All of the additional disks kicking around in my host confused the hell out of the grub installer scripts. I was given the option to install grub on a number of these disks and none were the one I wanted to install it on.

In the end I had to select the option to not install grub on any disk and fall back to installing it by hand:

$ grub-install --force --no-floppy --boot-directory=/boot /dev/disk/by-uuid/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

And then generate the grub config file:


If all goes well the grub boot loader should now be installed on your disk and you should have a grub config file in your chroot /boot directory.

Final Fixups

Finally you’ll need to log into the VM. If you’re confident it will boot without you having to do any debugging then you can just configure the ssh server to start up and throw a public key in the root homedir. If you’re like me something will go wrong and you’ll need some boot logs to help you debug. I like enabling the serial emulation provided by qemu for this purpose. It’ll also allow you to login over serial which is convenient.

This is pretty standard stuff. No paravirtual console through the xen console driver. The qemu emulated serial console will show up at ttyS0 like any physical serial hardware. You can enable serial interaction with grub by adding the following fragment to /etc/default/grub:

GRUB_SERIAL_COMMAND="serial --speed=38400 --unit=0 --word=8 --parity=no --stop=1"

To get your kernel to log to the serial console as well set the GRUB_CMDLINE_LINUX variable thusly:

GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,38400n8"

Finally to get init to start a getty with a login prompt on the console add the following to your /etc/inittab:

T0:23:respawn:/sbin/getty -L ttyS0 38400 vt100

Stefano Stabilini has done another good write-up on the details of using both the PV and the emulated serial console here: Give it a read for the gory details.

Once this is all done you need to exit the chroot, unmount all of those bind mounts and then unmount your boot and rootfs from the chroot directory. Once we have a VM config file created this VM should be bootable.

VM config

Then we need a configuration file for our VM. This is what my generic HVM template looks like. I’ve disabled all graphical stuff: sdl=0, stdvga=0, and vnc=0, enabled the emulated serial console: serial='pty' and set xen_platform_pci=1 so that my VM can use PV drivers.

The other stuff is standard for HVM guests and stuff like memory, name, and uuid that should be customized for your specific installation. Things like uuid and the mac address for your virtual NIC should be unique. There are websites out there that will generate these values. Xen has it’s own prefix for MAC addresses so use a generator to make a proper one.

builder = "hvm"
memory = "2048"
name = "myvm"
uuid = "uuuuuuuu-uuuu-uuuu-uuuu-uuuuuuuuuuuu"
vcpus = 1
cpus = '0-7'

disk = [
vif = [


Booting this VM is just like booting any PV guest:

xl create -c /etc/xen/vms/myvm.cfg

I’ve included the -c option to attach to the VMs serial console and ideally we’d be able to see grub and the kernel dump a bunch of data as the system boots.


I’ve tested these instructions twice now on a Debian Wheezy system with Xen 4.3.1 installed from source. Both times Grub installs successfully but fails to boot. After enabling VNC for the VM and connecting with a viewer it’s apparent that the VM hangs when SEABIOS tries to kick off grub.

As a work-around both times I’ve booted the VM from a Debian rescue ISO, setup a chroot much like in these instructions (the disk is now /dev/xvda though) and re-installed Grub. This does the trick and rebooting the VM from the disk now works. So I can only conclude that either something from my instructions w/r to installing Grub is wrong but I think that’s unlikely as they’re confirmed from numerous other “install grub in a chroot” instructions on the web.

The source of the problem is speculation at this point. Part of me wants to dump the first 2M of my disk both after installing it using these instructions and then again after fixing it with the rescue CD. Now that I think about it the version of Grub installed in my chroot is probably a different version than the one on the rescue CD so that could have something to do with it.

Really though, I’ll probably just install syslinux and see if that works first. My experiences with Grub have generally been bad any time I try to do something out of the ordinary. It’s incredibly complicated and generally I just want something simple like syslinux to kick off a very simple VM.

I’ll post an update once I’ve got to the bottom of this mystery. Stay tuned.

OE image package

Here’s a fun problem that I don’t yet have a solution to: I want to build a single image with OE. This image will be my dom0. I want to include other images in this image. That is to say I want to package service VMs as part of / in my dom0.

All of the research I’ve done up till now (all 30 minutes of it) points to this having never been done before. I could be using the wrong keywords but I the ones I tried turned up nothing on the respective OE and Yocto mailing lists. There seem to be a huge number of pitfalls here including things like changing the DISTRO_FEATURES in effect for the images as well as selecting image specific files for packages. On a few occasions I’ve used the distro name as a way to select specific configuration files like an fstab or interfaces.

What I want is to run bitbake once for the dom0 image and have it build all the other images and install them as packages in dom0. So I’d need to have recipes that actually package the images so they can be installed in another image. I think that will be the easy part.

The hard part will be making packages specific to each image with different files specific to the image. The only thing I can come up with for this is to play ugly tricks like building each VM image with a different MACHINE type but I’m not even sure if that will work. I guess all I can do for now is to experiment a bit and get on the mailing list to make sure I’m not duplicating work that’s already been done. This could get ugly.

OpenEmbedded Xen Network Driver VM

I wrote about a similar topic what feels like ages ago and I guess it was (8 months is a long time in this business). Since then I’ve been throwing some spare time at this same problem and I’ve actually made measurable progress. There have been a number of serendipitous events that came together to make this possible, the most important of which is the massive update to the Xen recipe in meta-virtualization. With this it’s super easy to crank out a Xen pvops kernel so combining this with an image that has the right plumbing in place it’s not as hard as you might think to build an NDVM.

So armed with the new Xen stuff from meta-virtualization I set out to build a reference NDVM. This isn’t intended to replace the NDVM in a system like XenClient-XT which is far more sophisticated. It’s just intended for experimentation and I don’t intend to build anything more sophisticated than a dumb Ethernet bridge into it.

To host this I’ve started a layer I call ‘meta-integral’. I know, all the good names were taken. Anyways this is intended to be as sort of distro layer where I can experiment with Xen stuff. Currently I’ve got a distro config for dom0 and an NDVM. The dom0 work is still very much a work in progress but the NDVM (much simpler) will actually boot as a PV guest.

To build this just clone my git repo with the build scripts and it’ll do all of the hard work for you:

git clone
git checkout ndvm
./ | tee build.log

This will crank out an image suitable to run on an Intel SandyBridge (SNB) system. I’ve only tested PV guests so you’ll have to set up a config like the following:

kernel = "/usr/lib/xen-common/bzImage-sugarbay.bin"
extra = "root=/dev/xvda console=hvc0"
iommu = "soft"
memory = "512"
name = "ndvm"
uuid = "a9ae8853-f1e9-41ca-9904-0e906efeb4af"
vcpus = "1"

disk = ['phy:/dev/loop0,xvda,w']
pci = ['0000:04:00.0']

Notice the kernel image and the rootfs image must be copied over to the Xen dom0 that you want to test the NDVM on. The image is listed in the kernel line and this can be found at tmp-eglibc/deploy/images/sugarbay/bzImage-sugarbay.bin relative to your build root. The image will be in the same directory and called something like integral-image-ndvm-sugarbay.ext3. Notice that the disk config is pointing at a loopback. You’ll have to set this up with losetup just like any other loopback device. The part that differentiates this from any other PV guest is that we’re passing a PCI network device through to it and it’ll offer up a bridge to other guest VMs. The definitive documentation on how to do this with Xen is here:

The bit that I had to wrangle to get the bridge set up properly with OE was the integration between a network interfaces file and the bridge. I’ve been spoiled by Debian and it’s seamless integration between the two. OE has no such niceties. In this situation I had to chose between hacking a script manually or finding the scripts that integrate interfaces configuration with the bridge and baking that into the bridge-utils package from meta-oe. I figured getting bridges integrated with interfaces would be useful to others so I went through the Debian source package, extracted the scripts and baked them into OE directly. Likely this should go ‘upstream’ but for now this specialization is just sitting in my meta-integral layer.

So after fixing up the bridge-utils package so it plays nice with the interfaces file, the interfaces in our NDVM looks like so:

# /etc/network/interfaces -- configuration file for ifup(8), ifdown(8)
# The loopback interface
auto lo
iface lo inet loopback

# real interface
auto eth0
iface eth0 inet manual

# xen bridge
auto xenbr0
iface xenbr0 inet manual
        bridge_ports eth0
        bridge_stp off
        bridge_waitport 0
        bridge_fw 0

So that’s it. Boot up this NDVM and it’ll have a physical network device and a bridge ready for consumption by other guests. I’ve not yet gone through and tested adding additional guests to the bridge so I’m assuming there’s still a bit of work lurking there. I’ll give this last bit a go and hopefully have positive results to post sooner than later. I’ve also not tested this on XenClient-XT as the most recent stable release is getting a bit old and likely there’s going to be incompatibilities between netfront / back stuff. This approach however is likely a great starting point if you’re building a service VM you want to run on our next release of XT though so feel free to fork and experiment.

UPDATE: Gave my NDVM a test just by giving the dom0 that was hosting it a vif. You can do this like so:

# xl network-attach Domain-0 backend=ndvm

The above assumes your NDVM has been named ‘ndvm’ in it’s VM config naturally. Anyways this will pop up a vif in dom0 backed by the NDVM. Pretty slick IMHO. Now to wrap this whole thing up so dom0 and the NDVM can be built as a single image with OE … Sounds easy enough 🙂

Talk at Xen Developer Summit 2013

UPDATE: Here’s the link: I still haven’t been able to bring myself to actually watch it but I’m sure it’s great :

Just got back from the 2013 Xen Developer Summit where I gave a talk on a few interesting (to me at least) things. If you’re interested you can find my abstract here. My focus was naturally on SELinux / XSM stuff. Mostly my talk focused on the sVirt implementation in XenClient XT and another fun application of the architecture to our management stuff.

Had good chat with a guy from Amazon afterward about all of the other evil stuff someone could do if they compromised QEMU. So while sVirt prevents the specific scenario presented I’ve no doubt there are other hazards. He was specifically concerned over the Xen privcmd driver & the hypercalls it could make. Hard to disagree as QEMU with root permissions in dom0 can execute any hypercall it wants. The only way to address this (other than stubdoms) is to deprivilege QEMU to prevent it from making hypercalls. That would probably require some code-changes in QEMU so it’s no small task.

I also touched briefly on the design for an inter-VM communication (IVC) mechanism that was floated to xen-devel this summer. In XT we have an IVC called ‘V4V’ that isn’t acceptable to upstream. When it came to our XSM policy however V4V had some favorable properties in that we created a new object in the hypervisor that was a ‘first-class’ object in the policy.

The proposal uses the same model as the front/back drivers so there would be no new object specific to the IVC. This means there wouldn’t be way to differentiate the IVC from any other front/back driver. The purpose of the talk was to point this out and hopefully solicit some discussion. Got an even better conversation going on this point so hopefully I’ll have some fun stuff to report on this front soon.

sVirt in XenClient

It’s been 5 months since my last post about my on-going project required by my masters program at SU. With the hope of eventually getting my degree, this is my last post on the subject. In my previous post on this topic I described a quick prototype I coded up to test an example program and SELinux policy to demonstrate the sVirt architecture. This was a simple example of how categories from the MCS policy can be used to separate multiple instances of the same program. The logical step after implementing a prototype is coding up the real thing so in this post I’ll go into some detail describing an implementation of the sVirt architecture I coded for the XenClient XT platform. While it may have taken me far too long to write up a description of this project, it’s already running in a commercial product … so I’ve got that going for me 🙂


XenClient is a bit different than the upstream Xen in that the management stack has been completely rewritten. Instead of the xend process which was written in python, XenClient uses a toolstack that’s rewritten in Haskell. This posed two significant hurdles. First I’ve done little more than read the first few pages from a text book on Haskell so the sVirt code, though not complex, would be a bit over my skill level. Second SELinux has no Haskell bindings which would be required by the sVirt code.

Taking on the task of learning a new functional programming language and writing bindings for a relatively complex API in this language would have taken far longer than reasonable. Though we do intend to integrate sVirt into the toolstack proper, putting this work on the so called “critical path” would have been prohibitively expensive. Instead we implemented the sVirt code as a C program that is interposed between the toolstack and the QEMU instances it is intended to separate. Thus the toolstack (the xenmgr process) invokes the svirt-interpose program each time a QEMU process needs to be started for a VM. The svirt-interpose process then does all of the necessary functions to prepare the environment for the separation of the QEMU instance requested from the others currently running.

The remainder of this document describes the svirt-interpose program in detail. We begin by describing the interfaces down the call chain between the xenmgr, svirt-interpose and QEMU.
We then go into detail describing the internal workings of the svirt-interpose code. This includes the algorithm used to assign categories to QEMU processes and to label the system objects used by these processes. We conclude with a brief analysis of the remaining pieces of the system that could benefit from similar separation. In my first post on this topic I describe possible attacks we’re defending against so I’ll not repeat that here.

Call Chain

As we’re unable to integrate the sVirt code directly into the toolstack we must interpose the execution of the sVirt binary between the toolstack and QEMU. We do this by having the toolstack invoke the sVirt binary and then have sVirt invoke QEMU after performing the necessary SELinux operations. For simplicity we leave the command line that the toolstack would normally pass to QEMU unchanged and simply extract the small piece of information we need from it in the sVirt code. All sVirt requires to do it’s job is the domain id (domid) of the VM it’s starting a QEMU instance for. This value is the first parameter so extracting it is quite simple.

The final bit that must be implemented is in policy. Here we must be sure that the policy we write reflects this call chain explicitly. This means removing the ability for the toolstack (xend_t) to invoke QEMU (qemu_t) directly and replacing this with allowing the toolstack to execute the svirt-interpose program (svirt_t) while allowing the svirt-interpose domain to transition to the QEMU domain. This is an important part of the process as it prevents the toolstack from bypassing the svirt code. Many will find protections like this superfluous as it implies protections from a malicious toolstack and the toolstack is a central component of the systems TCB. There is a grain of truth in this argument though it represents a rather misguided analysis. It is very important to limit the permissions granted to a process to limit a possible vulnerability even if the process we’re confining is largely a “trusted” system component.

Category Selection

The central piece of this architecture is to select a unique MCS category for each QEMU process and assign this category to the resources belonging to said process. Before a category can be assigned to a resource we must first chose the category. The only requirement we have when selecting categories is that they are unique (not used by another QEMU process).
Thus there is no special meaning in a category number. Thus it makes sense to select the category number at random.

We’re presented with an interesting challenge here based on the nature of the svirt-interpose program. If this code was integrated with the toolstack directly it would be reasonable to maintain a data structure mapping the running virtual machines to their assigned categories. We could then select a random category number for a new QEMU instance and quickly walk this in-memory structure to be sure this category number hasn’t already been assigned to another process. But as was described previously, the svirt-interpose code is a short lived utility that is invoked by the toolstack and dies shortly after it starts up a QEMU process. Thus we need persistent storage to maintain this association.

The use of the XenStore is reasonable for such data and we use the key ‘selinux-mcs’ under the /local/domain/$domid node (where $domid is the domain id of a running VM) to store the value. Thus we randomly select a category and then walk the XenStore tree examining this key for each running VM. If a conflict is detected a new value is selected and the search continues. This is a very naive algorithm and we discuss ways in which it can be improved in the section on future work.

Object labeling

Once we’ve successfully interposed our svirt code between the toolstack and QEMU and implemented our category selection algorithm we have two tasks remaining. First we must enumerate the objects that belong to this QEMU instance and label them appropriately. Second we must perform the steps necessary to ensure the QEMU process will be labeled properly before we fork and exec it.

Determining the devices assigned to a VM by exploring the XenStore structures is tedious. The information we begin with is the domid of the VM we’re configuring QEMU for. From this we can examine the virtual block devices (VBDs) that belong to this VM but the structure in the VM specific configuration space rooted at /local/domain/$domid only contains information about the backend hosting the device. To find the OS objects associated with the device we need to determine the backend, then examine the configuration space for that backend.

We begin by listing the VBDs assigned to a VM by enumerating the /local/domain/$domid/device/vbd XenStore directory. This will yeild a number of paths in of the form /local/domain/$domid/device/vbd/$vbd_num where $vbd_num is the numeric id assigned to a virtual block device. VMs can be assigned any number of VBDs so we must process all VBDs listed in this directory.

From these paths representing each VBD assigned to a VM we get closer to the backing store by extracting the path to the backend of the split xen block driver. This path is contained in the key /local/domain/$domid/device/vbd/$vbd_num/backend. Once this path is extracted we check to see if the device in dom0 is writable by reading the ‘mode’ value. If the mode is ‘w’ the device is writable and we must apply the proper MCS label to it. We ignore read only VBDs as XenClient only assigns CDROMs as read only, all hard disks are allocated as read/write.

Once we’ve determined the device is writable we now need to extract the dom0 object (usually a block or loopback device file) that’s backing the device. The location of the device path in XenStore depends on the backend storage type in use. XenClient uses blktap processes to expose VHDs through device nodes in /dev and loopback devices to expose files that contain raw file systems. If a loopback device is in use the path to the device node will be stored in the XenStore key ‘loop-device’ in the corresponding VBD backend directory. Similarly if a bit more cryptic, the device node for a blktap device for a VHD will be in the XenStore key ‘params’.

Once these paths have been extracted the devices can be labeled using the SELinux API. To do so, we first need to know what the label should be. Through the SELinux API we can determine the current context for the file. We then set the MCS category calculated for the VM on this context and then change the file context to the resultant label. Important to note here is that both a sensitivity level and a category must be set on the security context. The SELinux API doesn’t shield us from the internals of the policy here and even though the MCS policy doesn’t reason about sensitivities there is a single sensitivity defined that must be assigned to every object (s0).

Assigning a category to the QEMU process is a bit different. Unlike file system objects there isn’t an objct that we can query for a label. Instead we can ask the security server to calculate the resultant label of a transition from the current process (sVirt) to the destination process (QEMU). There is an alteernative method available however and this allows us to deterine the type for the QEMU process directly. SELinux has added some native support for virtualization and one such bit was the addition of the API call ‘selinux_virtual_domain_context_path’. This function returns the path of a file in the SELinux configuration directory that contains the type to be assigned to domains used for virtualization.

Once we have this type the category calculated earlier is then applied and the full context is obtained. SELinux has a specific API call that allows the caller to request the security server apply a specific context to the process produced by the next exec performed by the calling process (setexeccon). Once this has been done successfully the sVirt process cleans up the environment (closes file descriptors etc) and execs the QEMU program passing it the unmodified command line that was provided by the toolstack.


Applying an MCS category to a QEMU process and its resources is fairly straight forward task. There are a few details that must be attended to to ensure that proper error handling is in place but the code is relatively short (~600 LOC) and quite easy to audit. There are some places where the QEMU processes must overlap however. XenClient is all about multiplexing shared hardware between multiple virtual machines on the same PC / Laptop. Sharing devices like the CD-ROM that is proxied to clients via QEMU requires some compromise.

As we state above the CD-ROM is read-only so an MCS category is not applied to the device itself but XenClient must ensure the accesses to the device are exclusive. This is achieved by QEMU obtaining an exclusive lock on a file in /var/lock before claiming the CD-ROM. All QEMU processes must be able to take this lock so the file must be created without any categories. This may seem like a minor detail but it’s quite tedious to implement in practice and it does represent path for data to be transmitted from one QEMU process to another. Transmission through this lock file would require collusion between QEMU processes so it’s considered a minimal threat.

Future Work

This is my last post in this series that has nearly spanned a year. I’m a bit ashamed it’s taken me this long to write up my masters project but it did end up taking on a life of its own getting me a job with Citrix on the XenClient team. There’s still a lot of work to be done and I’m hoping to continue documenting it here. Firstly I have to collect the 8 blog posts that document this work and roll them up into a single document I can submit to my adviser to satisfy my degree requirements.

In parallel I’ll be working all things XenClient hopefully learning Haskell and integrating the sVirt architecture directly into our toolstack. Having this code in the toolstack directly will have a number of benefits. Most obviously it’ll remove a few forks so VM loading will be quicker. More interestingly though it will open up the possibility of applying MCS category labeling to devices (both PCI and USB) that are assigned to VMs. The end goal, as always, is strengthening the separation between the system components that need to remain separate thus improving the security of the system.

Another Go at My Masters Project

About a year ago I started working on my masters project. The topic turned out to be … not interesting enough to hold my attention I guess. This ended up being for the best since shortly after losing interest in my first project I got picked up on a project at work that I managed to dual-purpose for both work and school. I’m at a point where all I need to do to graduate is write this work up.

This first post (with a new tag) is a brief introduction of what this work is and why you should care. Future posts on this topic will cover the technical background and an implementation, both a simulation and code targeted at a specific Xen-based platform.

SELinux and Virtualization

I’ve been playing with SELinux for a while now and more recently I’ve had the need to work with the Xen hypervisor. Specifically I’ve been tasked with getting an SELinux policy running in the management VM (a.k.a. dom0). Most people savvy on SELinux would simply use a Linux distro that supports SELinux out of the box as their dom0 like Debian, Fedora, CentOS / RedHat. This is a valid argument but a minimal Debian install has a root file system that’s about about a gigabyte in size and this was too big for our purposes.

The project was actually in full swing when I was tasked with SELinux integration and the team had already rolled their own dom0 using openembedded [1]. Getting SELinux up and running on an OE system was a pain but it’s done and that work isn’t all that interesting. What was interesting was implementing some policy and machinery to separate the instances of QEMU [2] that provide emulation for the HVM Windows guests [3].

SELinux does a great job of formalizing the interactions between processes. Any interactions that aren’t specified are prevented. I won’t go into a detailed explanation of the SELinux policy language [4] or Domain and Type Enforcement [5]. These are well documented elsewhere [6].

The Problem

What’s interesting is that on an SELinux system each process runs with a label defined by the label of it’s parent (the process that caused the execution) and the label on the programs binary file on disk. This means that a process performing multiple fork/exec calls on the same binary will produce child processes all with the same SELinux label. Under nearly all circumstances this is the desired effect. On our specific platform there is a case where this isn’t what we want.

As our platform is Xen, xend is the daemon responsible for managing VMs. When xend starts an HVM client it executes QEMU. This causes all QEMU instances running to have the same SELinux label. In SELinux speak we would say these QEMU instances are all running “in the same domain” which is equivalent to them all having the same permissions, or more precisely, the same access to system resources.

At this point understanding the separation goals of Xen is important. Where an OS kernel aims to keep running processes separate, Xen aims to keep running VMs separate. Having instances of QEMU operating on behalf of what are effectively untrusted client VMs all with the same SELinux label undermines this goal of separation.

Separating QEMU Instances

The sVirt [7] project specifically addressed this problem with a prototype implementation some time back. Eventually this was integrated with libvirt so if you’re running libvirt[8] with the SELinux security driver[9] loaded then presumably you’ve got these protections in place. But again due to the embedded nature of our dom0 libvirt was way more software than we needed.

It became necessary to implement the basic sVirt design but integrated directly into our management stack. This isn’t a line-for-line re-implementation of the libvirt SELinux code though. I’ve made a number of changes that I will discuss in detail in the following posts though the design goals remain the same: keep instances of QEMU separate. The metric we’re using to judge the usefulness of this work is our answer to the question “what could a compromised QEMU instance access?”. If the answer to this is “all of the virtual hard disks on the system” then obviously we’ve failed. Instead we’re aiming to confine a QEMU instance to the resources it needs to function, like the virtual disks belonging to the VM it is providing services to.

Next Time

Now that the basic problem is laid out the next step is to cover some background.
In my next post I’ll cover the background necessary to understand the specific pieces of the SELinux policy that this work uses to achieve these goals: the MCS policy.


1 the Open Embedded project:
2 the QEMU project:
3 Xen-HVM:
4 SELinux policy language:
5 Practical Domain and Type Enforcement for UNIX:
6 SELinux By Example:
7 sVirt Requirements Document:
8 The libvirt Project:
9 libvirt SELinux Driver:

Xen Network Driver Domain: How

In my last post I went into the reasons why exporting the network hardware from dom0 to an unprivileged driver domain is good for security. This time the “how” is our focus. The documentation out there isn’t perfect and it could use a bit of updating so expect to see a few edits to the relevant Xen wiki page [1] in the near future.

Basic setup

How you configure your Xen system is super important. The remainder of this post assumes you’re running the latest Xen from the unstable mercurial repository (4.0.1) with the latest 2.6.32 paravirt_ops kernel [2] from Jeremy Fitzhardinge’s git tree ( If you’re running older versions of either Xen or the Linux kernel this may not work so you should consider updating.

For this post I’ll have 3 virtual machines (VMs).

  1. the administrative domain (dom0) which is required to boot the system
  2. an unprivileged domain (domU) that we’ll call “nicdom” which is short for network interface card (NIC) domain. You guessed it, this will become our network driver domain.
  3. another unprivileged domain (domU or client domain) that will get its virtual network interface from nicdom

I don’t really care how you build your virtual machines. Use whatever method you’re comfortable with. Personally I’m a command line junkie so I’ll be debootstrapping mine on LVM partitions as minimal Debian squeeze/sid systems running the latest pvops kernel. Initially the configuration files used to start up these two domUs will be nearly identical:

root="/dev/xvda ro"
extra="console=hvc0 xencons=tty"

client domain

root="/dev/xvda ro"
extra="console=hvc0 xencons=tty"

I’ve given the client a swap partition and more ram because I intend to turn it into a desktop. The nicdom (driver domain) has been kept as small as possible since it’s basically a utility that won’t have many logins. Obviously there’s more to it than just load up these config files but installing VMs is beyond the scope of this document.

PCI pass through

The first step in configuring the nicdom is passing the network card directly through to it. The xen-pciback driver is the first step in this process. It hides the PCI device from dom0 which will later allow us to bind the device to a domU through configuration when we boot it using xm

There’s two ways to configure the xen-pciback driver:

  1. kernel parameters at dom0 boot time
  2. dynamic configuration using sysfs

xen-pciback kernel parameter

The first is the easiest so we’ll start there. You need to pass the kernel some parameters to tell it which PCI device to pass to the xen-pciback driver. Your grub kernel line should look something like this:

module /vmlinuz- /vmlinuz- root=/dev/something ro console=tty0 xen-pciback.hide=(00:19.0) intel_iommu=on

The important part here is the xen-pciback.hide parameter that identifies the PCI device to hide. I’m using a mixed Debian squeeze/sid system so getting used to grub2 is a bit of a task. Automating the configuration through grub is outside the scope of this document so I’ll assume you have a working grub.cfg or a way to build one.

Once you boot up your dom0 you’ll notice that lspci still shows the PCI device. That’s fine because the device is still there, it’s just the kernel is ignoring it. What’s important is that when you issue an ip addr you don’t have a network device for this PCI device. On my system all I see is the loopback (lo) device, no eth0.

dynamic configuration with sysfs

If you don’t want to restart your system you can pass the network device to the xen-pciback driver dynamically. First you need to unload all drivers that access the device: modprobe -r e1000e. This is the e1000e driver in my case.

Next we tell the xen-pciback driver to hide the device by passing it the device address:

echo "0000:00:19.0" | sudo tee /sys/bus/pci/drivers/pciback/new_slot
echo "0000:00:19.0" | sudo tee /sys/bus/pci/drivers/pciback/bind

Some of you may be thinking “what’s a slot” and I’ve got no good answer. If someone reading this knows, leave me something in the comments if you’ve got the time.

passing pci device to driver domain

Now that dom0 isn’t using the PCI device we can pass it off to our nicdom. We do this by including the line:


in the configuration file for the nicdom. We can pass more than one device to this domain by placing another address between the square brackets like so:

pci=['00:19.0', '03:00.0']

Also we want to tell Xen that this domain is going to be a network driver domain and we have to configure IOMMU:

extra="console=hvc0 xencons=tty iommu=soft"

Honestly I’m not sure exactly what these last two configuration lines do. There are a number of mailing list posts giving a number of magic configurations that are required to get PCI passthrough to work right. These ones worked for me so YMMV. If anyone wants to explain please leave a comment.

Now when this domU boots we can lspci and we’ll see these two devices listed. Their address may be the same as in dom0 but this depends on how you’ve configured your kernel. Make sure to read the Xen wiki page for PCIPassthrough [4] as it’s quite complete.

Depending on how you’ve set up your nicdom you may already have some networking configuration in place. I’m partial to debootstrapping my installs on a LVM partition so I end up doing the network configuration by hand. I’ll dedicate a whole post to configuring the networking in the nicdom later. For now just get it working however you know how.

the driver domain

As much as we want to just jump in and make the driver domain work there’s still a few configurations that we need to run through first.

Xen split drivers

Xen split drivers exist in two halves. The backend of the driver is located in the domain that owns the physical device. Each client domain that is serviced by the backend has a frontend driver that exposes a virtual device for the client. This is typically referred to as xen split drivers [3].

The xen networking drivers exist in two halves. For our nicdom to serve its purpose we need to load the xen-netback driver along with the xen-evtchn and the xenfs. We’ve already discussed what the xen-netback driver so let’s talk about what the others are.

The xenfs driver will exposes some xen specific stuff form the kernel to user space through the /proc file system. Exactly what this “stuff” is I’m still figuring out. If you dig into the code for the xen tools (xenstored and the various xenstore-* utilities) you’ll see a number of references to files in proc. From my preliminary reading this is where a lot of the xenstore data is exposed to domUs.

The xen-evtchn is a bit more mysterious to me at the moment. The name makes me think it’s responsible for the events used for communication between backend and frontend drivers but that’s just a guess.

So long story short, we need these modules loaded in nicdom:

modprobe -i xenfs xen-evtchn xen-netback

In the client we need the xenfs, xen-evtchn and the xen-netfront modules loaded.

Xen scripts and udev rules

Just like the Xen wiki says, we need to install the udev rules and the associated networking scripts. If you’re like me you like to know exactly what’s happening though, so you may want to trigger the backend / frontend and see the events coming from udev before you just blindly copy these files over.

udev events

To do this you need both the nicdom and the client VM up and running with no networking configured (see configs above). Once their both up start udevadm monitor --kernel --udev in each VM. Then try to create the network front and backends using xm. This is done from dom0 with a command like:

xm network-attach client mac=XX:XX:XX:XX:XX:XX,backend=nicdom

I’ll let the man page for xm explain the parameters 🙂

In the nicdom you should see the udev events creating the backend vif:

KERNEL[timestamp] online   /devices/vif/-4-0 (xen-backend)

There are actually quite a few events but this one is the most important mostly because of the script and vif values. script is how the udev rule configures the network interface in the driver domain and the vif tells us the new interface name.

Really we don’t care what udev events happend in the client since the kernel will just magically create an eth0 device like any other. You can configure it using /etc/network/interfaces or any other method. If you’re interested in which events are triggered in the client I recommend recreating this experiment for yourself.

Without any udev rules and scripts in place the xm network-attach command should fail after a time out period. If you’re into reading network scripts or xend log files you’ll see that xend is waiting for the nicdom to report the status of the network-attach in a xenstore variable:

DEBUG (DevController:144) Waiting for 0.
DEBUG (DevController:628) hotplugStatusCallback /local/domain/1/backend/vif/3/0/hotplug-status

installing rules, scripts and tools

Now that we’ve seen the udev events we want to install the rules for Xen that will wait for the right event and will then trigger the necessary script. From the udevadm output above we’ve seen that dom0 passes the script name through the udev event. This script name is actually configured in the xend-config.xsp file in dom0:

(vif-script vif-whatever)

You can use whatever xen networking script you want (bridge is likely the easiest).

So how to install the udev rules and the scripts? Well you could just copy them over manually (mount the nicdom partition in dom0 and literally cp them into place). This method got me in trouble though and this detail is omitted from the relevant Xen wiki page [1]. What I didn’t know is the info I just supplied above: that dom0 waits for the driver domain to report its status through the xenstore. The networking scripts that get run in nicdom report this status but they require some xenstore-* utilities that aren’t installed in a client domain by default.

Worse yet I couldn’t see any logging out put from the script indicating that it was trying to execute xenstore-write and failing because there wasn’t an executable by that name on it’s path. Once I tracked down this problem (literally two weeks of code reading and bugging people on mailing lists) it was smooth sailing. You can install these utilities by hand to keep your nicdom as minimal as possible. What I did was copy over the whole xen-unstable source tree to my home directory on nicdom with the make tools target already built. Then I just ran make -C tools install to install all of the tools.

This is a bit heavy handed since it installs xend and xenstored which we don’t need. Not a big deal IMHO at this point. That’s pretty much it. If you want your vif to be created when your client VM is created just add a vif line to its configuration:



In short the Xen DriverDomain has nearly all the information you need to get a driver domain up and running. What they’re missing are the little configuation tweeks that likely change from time to time and that the xenstore-* tools need to be installed in the driver domain. This last bit really stumped me since there seems to be virtually no debug info that comes out of the networking scripts.

If anyone out there tries to follow this leave me some feedback. There’s a lot of info here and I’m sure I forgot something. I’m interested in any way I can make this better / more clear so let me know what you think.


Xen Network Driver Domain: Why

If you’ve been watching the xen-user, xen-devel or the xen channel on freenode you’ve probably seen me asking questions about setting up a driver domain. You also may have noticed that the Xen wiki page dedicated to the topic [1] is a bit light on details. I’ve even had a few people contact me directly through email and my blog to see if I have this working yet which I think is great. I’m glad to know there are other people interested in this besides me.

This post is the first in a series in which I’ll explains how I went about configuring Xen to bind a network device to an unprivileged domain (typically called domU) and how I configured this domU (Linux) to act as a network back end for other Linux domUs. This first post will frame the problem and tell you why you should care. My next post will dig into the details of what needs to be done to set up a network driver domain.


First off why is this important? Every Xen configuration I’ve seen had all devices hosted in the administrative domain (dom0) and everything worked fine. What do we gain by removing the device from dom0 and having it hosted in a domU.

The answer to these questions is all about security. If all you care about is functionality then don’t bother configuring a driver domain. You get no new “features” and no performance improvement (that I know of). What you do get is a dom0, the most security critical domain, with a reduced attack surface.

Let’s consider an attack scenario to make this concrete: Say an exploit exists in whichever Linux network driver you use. This exploit allows a remote attacker to send a specially crafted packet to your NIC and execute arbitrary code in your kernel. This is a worst case scenario, probably not something that will happen but it is possible. If dom0 is hosting this and all other devices and their drivers your system is hosed. The attacker can manipulate all of your domUs, the data in your domUs, everything.

Now suppose you’ve bound your NIC to a domU and configure this domU to act as a network back end for other domUs. Using the same (slightly far-fetched) vulnerability as an example, have we reduced the impact of the exploit?

The driver domain makes a significant difference here. Instead of having malicious code executing in dom0, it’s in an unprivileged domain. This isn’t to say that the exploit has no effect on the system. What we have achieved though is reducing the effects of the exploit. Instead of a full system compromise we’re now faced with a single unprivileged domain compromise and a denial of service on the networking offered to the other VMs.

The attacker can take down the networking for all of your VMs but they can’t get at their disks, they can’t shut them down, and they can’t create new VMs. Sure the attacker could sit in the driver domain and snoop on you domUs traffic but this sort of snooping is possible remotely. The appropriate use of encryption solves the disclosure problem. In the worst case scenario attacker could use this exploit as a first step in a more complex attack on the other VMs by manipulating the network driver front ends and possibly triggering another exploit.

In short a driver domain can reduce the attack surface of your Xen system. It’s not a cure-all but it’s good for your overall system security. I’m pretty burned out so I’ll wrap up with the matching “How” part of this post in the next day or two. Stay tuned.


xend fails silently

I’ve been giving myself a crash course in Xen recently. I’ve played with Xen in the past but it’s always annoyed me that the kernel for dom0 was super old (2.6.18). Debian has always been good with their support and the 2.6.26 kernel that ships with Lenny has worked well.

Now that Xen 4.0 is out and I’m getting ready to transition over to Squeeze, I wanted to go back and find where the upstream dom0 kernel was maintained. Turns out there’s a git tree hosted on with a stable 2.6.32 and more bleeding edge kernels as well. Building both Xen 4 and the kernel from git were pretty uneventful. I did however run into a problem getting xend to start.

There’s nothing worse than a program that fails without giving any good indication as to why. xend failed with this error message in /var/log/xen/xend-debug.log:

Error: (111, 'Connection refused')

That’s it. Not helpful at all. I started fishing through an strace but got lost in the output pretty quick. Luckily there was some help to be had in the #xen channel on freenode.

The folks on IRC told me it’s quite common for xend to fail quietly like this if there’s a needed xen kernel module that isn’t loaded. The usual suspects are xen-evtchn or xen-gntdev but it may be some other component that wasn’t compiled in. In my situation I had xen-gntdev built in but xen-evtchn was built as a module and hadn’t been loaded. A quick modprobe later and xend was up and running. I went one step further and recompiled this piece into the kernel directly.

Now I’m happily running Xen 4.0 with a 2.6.32 kernel on Debian Squeeze. Good times.