sVirt in XenClient

It’s been 5 months since my last post about my on-going project required by my masters program at SU. With the hope of eventually getting my degree, this is my last post on the subject. In my previous post on this topic I described a quick prototype I coded up to test an example program and SELinux policy to demonstrate the sVirt architecture. This was a simple example of how categories from the MCS policy can be used to separate multiple instances of the same program. The logical step after implementing a prototype is coding up the real thing so in this post I’ll go into some detail describing an implementation of the sVirt architecture I coded for the XenClient XT platform. While it may have taken me far too long to write up a description of this project, it’s already running in a commercial product … so I’ve got that going for me 🙂

Background

XenClient is a bit different than the upstream Xen in that the management stack has been completely rewritten. Instead of the xend process which was written in python, XenClient uses a toolstack that’s rewritten in Haskell. This posed two significant hurdles. First I’ve done little more than read the first few pages from a text book on Haskell so the sVirt code, though not complex, would be a bit over my skill level. Second SELinux has no Haskell bindings which would be required by the sVirt code.

Taking on the task of learning a new functional programming language and writing bindings for a relatively complex API in this language would have taken far longer than reasonable. Though we do intend to integrate sVirt into the toolstack proper, putting this work on the so called “critical path” would have been prohibitively expensive. Instead we implemented the sVirt code as a C program that is interposed between the toolstack and the QEMU instances it is intended to separate. Thus the toolstack (the xenmgr process) invokes the svirt-interpose program each time a QEMU process needs to be started for a VM. The svirt-interpose process then does all of the necessary functions to prepare the environment for the separation of the QEMU instance requested from the others currently running.

The remainder of this document describes the svirt-interpose program in detail. We begin by describing the interfaces down the call chain between the xenmgr, svirt-interpose and QEMU.
We then go into detail describing the internal workings of the svirt-interpose code. This includes the algorithm used to assign categories to QEMU processes and to label the system objects used by these processes. We conclude with a brief analysis of the remaining pieces of the system that could benefit from similar separation. In my first post on this topic I describe possible attacks we’re defending against so I’ll not repeat that here.

Call Chain

As we’re unable to integrate the sVirt code directly into the toolstack we must interpose the execution of the sVirt binary between the toolstack and QEMU. We do this by having the toolstack invoke the sVirt binary and then have sVirt invoke QEMU after performing the necessary SELinux operations. For simplicity we leave the command line that the toolstack would normally pass to QEMU unchanged and simply extract the small piece of information we need from it in the sVirt code. All sVirt requires to do it’s job is the domain id (domid) of the VM it’s starting a QEMU instance for. This value is the first parameter so extracting it is quite simple.

The final bit that must be implemented is in policy. Here we must be sure that the policy we write reflects this call chain explicitly. This means removing the ability for the toolstack (xend_t) to invoke QEMU (qemu_t) directly and replacing this with allowing the toolstack to execute the svirt-interpose program (svirt_t) while allowing the svirt-interpose domain to transition to the QEMU domain. This is an important part of the process as it prevents the toolstack from bypassing the svirt code. Many will find protections like this superfluous as it implies protections from a malicious toolstack and the toolstack is a central component of the systems TCB. There is a grain of truth in this argument though it represents a rather misguided analysis. It is very important to limit the permissions granted to a process to limit a possible vulnerability even if the process we’re confining is largely a “trusted” system component.

Category Selection

The central piece of this architecture is to select a unique MCS category for each QEMU process and assign this category to the resources belonging to said process. Before a category can be assigned to a resource we must first chose the category. The only requirement we have when selecting categories is that they are unique (not used by another QEMU process).
Thus there is no special meaning in a category number. Thus it makes sense to select the category number at random.

We’re presented with an interesting challenge here based on the nature of the svirt-interpose program. If this code was integrated with the toolstack directly it would be reasonable to maintain a data structure mapping the running virtual machines to their assigned categories. We could then select a random category number for a new QEMU instance and quickly walk this in-memory structure to be sure this category number hasn’t already been assigned to another process. But as was described previously, the svirt-interpose code is a short lived utility that is invoked by the toolstack and dies shortly after it starts up a QEMU process. Thus we need persistent storage to maintain this association.

The use of the XenStore is reasonable for such data and we use the key ‘selinux-mcs’ under the /local/domain/$domid node (where $domid is the domain id of a running VM) to store the value. Thus we randomly select a category and then walk the XenStore tree examining this key for each running VM. If a conflict is detected a new value is selected and the search continues. This is a very naive algorithm and we discuss ways in which it can be improved in the section on future work.

Object labeling

Once we’ve successfully interposed our svirt code between the toolstack and QEMU and implemented our category selection algorithm we have two tasks remaining. First we must enumerate the objects that belong to this QEMU instance and label them appropriately. Second we must perform the steps necessary to ensure the QEMU process will be labeled properly before we fork and exec it.

Determining the devices assigned to a VM by exploring the XenStore structures is tedious. The information we begin with is the domid of the VM we’re configuring QEMU for. From this we can examine the virtual block devices (VBDs) that belong to this VM but the structure in the VM specific configuration space rooted at /local/domain/$domid only contains information about the backend hosting the device. To find the OS objects associated with the device we need to determine the backend, then examine the configuration space for that backend.

We begin by listing the VBDs assigned to a VM by enumerating the /local/domain/$domid/device/vbd XenStore directory. This will yeild a number of paths in of the form /local/domain/$domid/device/vbd/$vbd_num where $vbd_num is the numeric id assigned to a virtual block device. VMs can be assigned any number of VBDs so we must process all VBDs listed in this directory.

From these paths representing each VBD assigned to a VM we get closer to the backing store by extracting the path to the backend of the split xen block driver. This path is contained in the key /local/domain/$domid/device/vbd/$vbd_num/backend. Once this path is extracted we check to see if the device in dom0 is writable by reading the ‘mode’ value. If the mode is ‘w’ the device is writable and we must apply the proper MCS label to it. We ignore read only VBDs as XenClient only assigns CDROMs as read only, all hard disks are allocated as read/write.

Once we’ve determined the device is writable we now need to extract the dom0 object (usually a block or loopback device file) that’s backing the device. The location of the device path in XenStore depends on the backend storage type in use. XenClient uses blktap processes to expose VHDs through device nodes in /dev and loopback devices to expose files that contain raw file systems. If a loopback device is in use the path to the device node will be stored in the XenStore key ‘loop-device’ in the corresponding VBD backend directory. Similarly if a bit more cryptic, the device node for a blktap device for a VHD will be in the XenStore key ‘params’.

Once these paths have been extracted the devices can be labeled using the SELinux API. To do so, we first need to know what the label should be. Through the SELinux API we can determine the current context for the file. We then set the MCS category calculated for the VM on this context and then change the file context to the resultant label. Important to note here is that both a sensitivity level and a category must be set on the security context. The SELinux API doesn’t shield us from the internals of the policy here and even though the MCS policy doesn’t reason about sensitivities there is a single sensitivity defined that must be assigned to every object (s0).

Assigning a category to the QEMU process is a bit different. Unlike file system objects there isn’t an objct that we can query for a label. Instead we can ask the security server to calculate the resultant label of a transition from the current process (sVirt) to the destination process (QEMU). There is an alteernative method available however and this allows us to deterine the type for the QEMU process directly. SELinux has added some native support for virtualization and one such bit was the addition of the API call ‘selinux_virtual_domain_context_path’. This function returns the path of a file in the SELinux configuration directory that contains the type to be assigned to domains used for virtualization.

Once we have this type the category calculated earlier is then applied and the full context is obtained. SELinux has a specific API call that allows the caller to request the security server apply a specific context to the process produced by the next exec performed by the calling process (setexeccon). Once this has been done successfully the sVirt process cleans up the environment (closes file descriptors etc) and execs the QEMU program passing it the unmodified command line that was provided by the toolstack.

Conclusion

Applying an MCS category to a QEMU process and its resources is fairly straight forward task. There are a few details that must be attended to to ensure that proper error handling is in place but the code is relatively short (~600 LOC) and quite easy to audit. There are some places where the QEMU processes must overlap however. XenClient is all about multiplexing shared hardware between multiple virtual machines on the same PC / Laptop. Sharing devices like the CD-ROM that is proxied to clients via QEMU requires some compromise.

As we state above the CD-ROM is read-only so an MCS category is not applied to the device itself but XenClient must ensure the accesses to the device are exclusive. This is achieved by QEMU obtaining an exclusive lock on a file in /var/lock before claiming the CD-ROM. All QEMU processes must be able to take this lock so the file must be created without any categories. This may seem like a minor detail but it’s quite tedious to implement in practice and it does represent path for data to be transmitted from one QEMU process to another. Transmission through this lock file would require collusion between QEMU processes so it’s considered a minimal threat.

Future Work

This is my last post in this series that has nearly spanned a year. I’m a bit ashamed it’s taken me this long to write up my masters project but it did end up taking on a life of its own getting me a job with Citrix on the XenClient team. There’s still a lot of work to be done and I’m hoping to continue documenting it here. Firstly I have to collect the 8 blog posts that document this work and roll them up into a single document I can submit to my adviser to satisfy my degree requirements.

In parallel I’ll be working all things XenClient hopefully learning Haskell and integrating the sVirt architecture directly into our toolstack. Having this code in the toolstack directly will have a number of benefits. Most obviously it’ll remove a few forks so VM loading will be quicker. More interestingly though it will open up the possibility of applying MCS category labeling to devices (both PCI and USB) that are assigned to VMs. The end goal, as always, is strengthening the separation between the system components that need to remain separate thus improving the security of the system.

Troubles with Ovi Store after upgrading N8 to Anna

I took some time yesterday to upgrade my beloved Nokia N8 to the new(ish) Symbian^3 Anna build and it wasn’t a smooth upgrade. Upgrading through the Ovi Suite went well enough but applications like the Ovi Store just wouldn’t work after the upgrade. The Ovi Store application would install and start but would sit on the splash screen showing the “Loading …” message. At this point I did a hard reset that did nothing.

After combing through a hand full of forum posts I ran across this one in the Nokia Europe discussion boards. Initially I had no idea what this guy was talking about. I guess I’m used to my phone being an appliance so downgrading packages seemed pretty crazy, especially since I had no idea where to obtain the packages (sis files) described in the post. A few web searches later and I ran across the “Fix Symbian” site. Pretty encouraging name all things considered. Anyways all I did was download the sis files from the post titled “S^3 QT 4.7.3 MOBILITY 1.1.3″ which downgrades a number of components in the ” Notifications Support Package Symbian3 v1.1.11120″ package. Once quick reboot and Ovi Store was back up and running.

So I guess the Anna upgrade that Nokia is shipping is broken? Pretty strange really but not something that can’t be worked around thanks to some contributions from the interwebs. Unfortunately the upgrade to Anna didn’t fix the problems I’ve had with my N8 and my wireless access point. I’ll debug this sooner or later, likely when I upgrade my home network with the ALIX boards I got in the mail last week.

ALIX3D2 Arrives

I’ve been hoping to upgrade my older ALIX2D3 router / firewall / wireless access point for a while now. While taking a few hours off from work over the holiday I surfed over to PCEngines and bought myself a Christmas present: a matching pair of ALIX3D2 boards and a Compex WLM200N5-23 MiniPCI card (802.11an) plus the necessary wires / cables / 5GHz antenna and enclosures etc.

Everything showed up in the mail today:

Unfortunately
one board arrived with minor damage to the CF socket. Bent pins suck:

I’m waiting to hear back from the folks over at PCEngines to see how they want to handle a repair / return. I really want to just bend it back and hope it works but I’m worried if I break it off they won’t accept a return. Glad I bought two so I still have one to play with in the meantime.

My next post should be about me finally complete Masters project, but likely it’ll be about building OE for this device 🙂

Update:
PC Engines support was super quick to respond. They gave me permission to try to bend the pin back myself and ensured me that if my attempted repair failed they’d still replace the board. The pin bent back into place easily enough. I’m hoping to do an install tomorrow to be sure the CF slot is functional.

Atom Based Home NFS

It’s been a while since I’ve posted any new content but that’s not because I haven’t been doing anything worthy of mention. During the few breaks I’ve had from my day job I took the time to replace my QNAP 419p NAS with a custom system to host NFS shares. Here’s just a quick laundry list of the parts I settled on after quite a bit of shopping.

My requirements for this system were really basic. All it does is host NFS shares and run rTorrent from time to time. There was no need to use a full desktop grade processor but I still wanted something x86 compatible with a good bit of ram and a PCI-E port so I could use a real hardware RAID card.

Motherboard

The Super Micro MBD-X7SPE-HF-D525 was a good fit. It has a sweet little Atom D525 soldered on to the board and it’s surprisingly quick (1.8GHz) though it’s no work horse. There’s only one 16x PCI-E port but that’s all I need. It supports up to 4G of ram though it’s pretty low end laptop memory. It also has two NICs so you can bond them for more throughput if you want to get fancy.

If you check out the Super Micro site you’ll see they advertise this board as having RAID on board. It’s still not hardware RAID though and since the RAID is done in firmware / driver the Atom processor on this board would have a very hard time keeping up with parity calculations under heavy usage. There’s plenty of literature out there about “fake RAID” so don’t be fooled.

RAID Card

I’ve had excellent luck with 3ware in the past so when they were bought out by LSI I was skeptical. Some of their older cards are still branded 3ware and I tracked down a 3ware 9650SE 8LPML on ebay for under 300 … since I couldn’t afford it new at close to $500 when I was looking 3 months back.

The Linux drivers for this card and the management software are still great. Setting up RAID 6 was easy though the card requires 5 drives to do this RAID level where theoretically it should only need 4.

RAM

The memory for the Super Micro board is practically free. I tracked down two 2G SODIMMs of Kingston ValueRAM for like $30. Not the fastest ram in the world but it works.

Disks

Buying 5 hard disks is not cheep. This is especially true if you want to get big drives and ones that will be fast enough to last. I went with 5 Western Digital AV-GP WD20EURS 2TB 3.0Gb/s drives. They’re nice and big, pretty fast and quiet. User reviews on sites like Newegg are a good way to check up on products before you buy. This one had particularly good reviews and so far they’ve been great. None were DOA as drives have a tendency to do. They’re also very quiet and run relatively cool.

Drive Enclosure

With this many drives it’s worth investing in an enclosure to hold them. You won’t have to rig fans on the drives to keep air moving over them and you’ll have the luxury of easy swapping if it something goes wrong and you need to swap out drives. There isn’t much available in this space so your choices are limited. I went with 2x ICY DOCK MB453SPF-B 3 in 2 enclosures.

I was a bit pissed when one of the enclosures showed up DOA but Newegg was pretty good about doing a quick exchange. It cost me a few extra bucks in shipping but it could have been worse … ::shrug::

Case

I had an old case from Lian Li and a small power supply that have been sitting in my basement unused for several years. Luckily the case it had the 4 5.25 bays that I need to hold these enclosures. Fitting enclosures into cases never goes as planned though.

The case is super nice but it had these little tabs that stuck out into the drive bays. These were intended to support individual 5.25 drives but they just got in the way of these enclosures. They were easy enough to remove with a saw and a file.

That’s about all there is to it. Setting up the RAID so I could boot Linux from the 3ware card was a bit of a pain and I wish I had documented the process. Setting it up through the firmware interface took some experimentation but it is possible. Then just install Debian and an NFS server and you’re done.

Spanish Thruxton

I travel a bunch but I’ve only traveled internationally twice now. Interestingly enough in my short trips to both the UK and Spain I’ve ran into my Thruxton’s foreign relatives. Naturally, the Spanish Thruxton I came across this week in Barcelona is red.

You do not have sufficient permissions to access this page

I’ve been working on a set of scripts to backup the wordpress instances on my server. While structuring these scripts I realized that I hadn’t structured my WP installs consistently. I had played around with using table prefixes and hosting multiple WP instances in a single database but eventually I just broke them out into unique databases for simplicity.

The table prefixes persisted however and while I was poking around I decided to rename the tables dropping the prefixes. What I didn’t count on is that some values in the usermeta and options tables are prepended with the table prefix from the wp-config.php. Without fixing up these values in the databse your site will function normally, but the admin interface will only show an error message:

You do not have sufficient permissions to access this page.

The database values that need to be fixed up are some entries in the meta_key column of the usermeta table and the option_name column of the options table. Let’s assume that your prefix is pre_, that you’ve already removed the prefix from your database and that you now want to fix up these values. The following SQL commands will remove your old prefix from these tables:

UPDATE usermeta SET meta_key = REPLACE(meta_key,'pre_','');
UPDATE options SET option_name = REPLACE(option_name,'pre_','');

References

http://www.tech-evangelist.com/2010/02/06/wordpress-error-sufficient-permissions/
http://wordpress.org/support/topic/changed-table-prefix-got-insufficient-permissions-error

Farewell to summer 2011

It’s been getting progressively colder and I can feel another Syracuse winter bearing down fast. Last Friday I got the feeling that I may not get the chance to do it again so I rode my motorcycle over to Shifty’s for lunch. It’s funny what we crave when faced with the end of the summer and the inevitability of winter. Hot dogs and PBR hit the spot for sure. Farewell summer, you will be missed.

Financial Site Password Policies

One of the many things I’ve had to do as part of transitioning to my new job is move my retirement savings (401k) over to a new provider. In this case I’ve been moving over to the Fidelity site. The security of financial web sites never fails to disappoint.

No I didn’t try some crazy SQL injection, CSS, or XSRF attack on the site (that would be illegal!). I just read their password policy:

  • Use 6 to 12 letters and/or numbers
  • Do not use one entire piece of personally identifiable information such as your Social Security number, telephone number, or date of birth. Instead, alter or disguise it (e.g., Jane212Smith)
  • Do not use more than 5 instances of a single number or letter, or easily recognized sequences (e.g., 12345 or 11111)
  • Do not use symbols, punctuation marks, or spaces (e.g., #,@, /, *, -.)

::Sigh:: 12 character max and only letters and numbers? Come on seriously? According to Microsoft [1] 14 characters is the recommended minimum for a password. But length isn’t the only factor. By excluding all special characters they effectively cut the character space in half.

Let’s think about this technically for a second though. Why would they need to place these restrictions on their customers security? Let’s assume the password is just another field in the Fidelity database because that’s exactly what it is. Typically there are two reasons to limit field length and content in a database:

  • efficiency: variable string length fields are very expensive
  • special characters can be a bit dangerous in that a scripting language or the SQL engine fronting the database might interpret the characters as commands

If however, Fidelity follows best practices for storing passwords in their database neither of these concerns apply because YOU SHOULD NEVER STORE THE PASSWORD DIRECTLY IN THE DATABASE! A hash of the password is what should be stored and the process of hashing addresses both of these concerns because it:

  • normalizes the password length
  • sanitizes any special characters

This last point can also be mitigated by using the proper and safe SQL commands available in any modern database engine.

So what these restrictions make me think is that the Fidelity site may actually be storing my password in plain text. That, or they’re just placing restrictions on my password strength arbitrarily which makes little sense.

Unfortunately this Fidelity site isn’t the worst offender that I’ve seen, and I don’t even analyze website security for a living. My experiences are limited only to the sites I’ve had to use over the years. Some time back around 2005 I had an account with a bank once that didn’t even let me change my password for their website if I wanted to. They set my password for their website when I established my pin for my ATM card! That’s right, my password for their website was a four digit number. I wrote them a letter pointing out the weakness in their password policy and I got back a form letter basically telling me to go away … so I did, and I took my money (totaling a whopping $2500) with me.

These are the people we trust with our life savings … Most of the web forums I have an account on fixed these problems back around 2007. Fixing something like this isn’t rocket science, just best practice.

UPDATE
After dialing into the phone system for Fidelity it became apparent that special characters are prohibited in passwords because their phone system authenticates users against the same database. It’s convenient, but touch-tone key pads just haven’t kept pace with keyboards 🙂 This also means that their passwords are probably case-insensitive too which nearly cuts the character space in half yet again. ::sigh:: balancing backward compatibility against security is an age old problem. You can look to Microsoft for a few of the biggest examples: the bazillion Windows users still surfing on IE6, the upcoming EOL for Windows XP etc. The Fidelity password strength issue isn’t anywhere near the scale of these examples but the principle holds.

[1]: http://www.microsoft.com/security/online-privacy/passwords-create.aspx

sVirt Simulation Demo

In my last post on this topic I gave a quick description of a simulation of the sVirt architecture. Talking about it is only half the work. In this post I’ll show it in action and interpret the output as it relates to the separation goals.

Building and Installing

After you clone the git repo from my last post you’ll have all of the source code and policy for the simulation. Typing make in the root of the tree should build both the policy and the simulation code. Install it with make install.

There are differences between different Linux distros and how they do their SELinux stuff. This was developed and tested on Debian Squeeze.

Interface

The simulation is a bit sparse so the interface is just a textual one. Run it on the console by running it as root:

svirt-sim

The simulation is run as root because we’re simulating a virtualization management stack like libvirt or xend and makes and manages mock VHDs in /var/lib/svirt-sim/images.

The initial output you’ll see just a prompt of two hashes ## and a message indicating you can get a list of commands by pressing ?. With these commands you can interactively perform actions typical for a virtualization stack: create, delete, start, stop and show information about virtual machines:

press ? for a list of commands
##:?
c : create a VM
d : delete a VM
h : start a VM
k : stop a VM
s : show objects and labels
q : quit

Creating VMs

First things first, we need to create two VMs. When we create a VM the svirt-sim process creates a VHD for it and allocates an MCS category for the mock QEMU instance (that we call not-qemu) that will be started when the VM is started. The first one we’ll create will do what we expect it to do: access it’s VHD file simulating disk access. We’ll call this VM ‘good-one’ since it’s well behaved. The second VM we’ll create will be the ‘bad-one’ and it will attempt to violate confinement and access VHDs belonging to the other VM. Here’s the output we’ll see:

##:c
what's the name of the VM?: good-one
what's the vm's disposition [good|bad]?: good
creating: /var/lib/svirt-sim/images/good-one.vhd
##:c
what's the name of the VM?: bad-one
what's the vm's disposition [good|bad]?: bad
creating: /var/lib/svirt-sim/images/bad-one.vhd

Now that we’ve created our VMs we can examine the internal state of the management stack by pressing ‘s’:

##:s
======================
VMs:
VM Name:        bad-one
  VHD Name:     /var/lib/svirt-sim/images/bad-one.vhd
  Disposition:  bad
  VM qemu PID   0
  VM Status:    Stopped
  mcs category:     s0:c850
  prev vhd context: (null)
VM Name:        good-one
  VHD Name:     /var/lib/svirt-sim/images/good-one.vhd
  Disposition:  good
  VM qemu PID   0
  VM Status:    Stopped
  mcs category:     s0:c440
  prev vhd context: (null)
======================
=============================
dumping security context list
  mcs: s0:c850
  mcs: s0:c440
=============================

We can also see the mock VHDs created by the management stack and examine their security context:

-rw-r-----. 1 root root unconfined_u:object_r:svirt_sim_image_t:s0:c0 0 Sep  3 23:27 bad-one.vhd
-rw-r-----. 1 root root unconfined_u:object_r:svirt_sim_image_t:s0:c0 0 Sep  3 23:27 good-one.vhd

We use the first category (c0) as the context for VMs that aren’t running. No VM should ever be run with this category so the VHD of a running VM should never be accessible to a not-qemu instance.

Running VMs

The VMs created above can be run by pressing ‘h’. Once they’re running press ‘s’ again to see the change in the internal state of the management stack. We’ll now see pid of the mock QEMU process started.

##:h
what's the name of the VM?: good-one
starting vm: good-one
security_start for vm with mcs: s0:c440
started child with pid 1298
##:h
what's the name of the VM?: bad-one
starting vm: bad-one
security_start for vm with mcs: s0:c850
started child with pid 1300
##:s
======================
VMs:
VM Name:        bad-one
  VHD Name:     /var/lib/svirt-sim/images/bad-one.vhd
  Disposition:  bad
  VM qemu PID   1300
  VM Status:    Running
  mcs category:     s0:c850
  prev vhd context: unconfined_u:object_r:svirt_sim_image_t:s0:c0
VM Name:        good-one
  VHD Name:     /var/lib/svirt-sim/images/good-one.vhd
  Disposition:  good
  VM qemu PID   1298
  VM Status:    Running
  mcs category:     s0:c440
  prev vhd context: unconfined_u:object_r:svirt_sim_image_t:s0:c0
======================
=============================
dumping security context list
  mcs: s0:c850
  mcs: s0:c440
=============================

Now that the simulated “VMs” are running we can view their contexts and those of their VHDs to be sure they match up to the internal state of the svirt-sim stack:

# ls -l /var/lib/svirt-sim/images
-rw-r-----. 1 root root system_u:object_r:mcs_server_image_t:s0:c850 0 Sep  3 23:27 bad-one.vhd
-rw-r-----. 1 root root system_u:object_r:mcs_server_image_t:s0:c440 1 Sep  3 23:28 good-one.vhd
ps auxZ | grep not-qemu
system_u:system_r:not_qemu_t:s0:c440 root 1298  0.0  0.1   1664   544 ?        Ss   23:28   0:00 /sbin/not-qemu --good --file=/var/lib/svirt-sim/images/good-one.vhd
system_u:system_r:not_qemu_t:s0:c850 root 1300  0.0  0.1   1696   544 ?        Ss   23:28   0:00 /sbin/not-qemu --bad --file=/var/lib/svirt-sim/images/bad-one.vhd

So far the output looks as we’d expect: the category assigned to the running not-qemu process matches the category of it’s VHD file.

Testing Confinement

Built in to this simulation is a simple confinement test. The not-qemu process has an option I call it’s disposition. If started with the option --good it will only attempt to access the VHD file it’s been assigned. If started with the --bad option then it will cycle through the directory containing all of the VHDs and it will attempt to access all of them. This is meant to simulate a malicious QEMU process reading the disk of another VM. Above we’ve started two VMs, one good and one bad. We can see the evidence of their behavior in the system logs.

The syslog on my system shows the following trace of accesses made by each of the not-qemu processes. I’ve removed the time stamps but the messages should be considered sequential.

/sbin/not-qemu[1298]: opening VHD file: /var/lib/svirt-sim/images/good-one.vhd
/sbin/not-qemu[1298]: not-qemu with pid: 1298 still alive
/sbin/not-qemu[1300]: new file path: /var/lib/svirt-sim/images/.
/sbin/not-qemu[1300]: opening VHD file: /var/lib/svirt-sim/images/.
/sbin/not-qemu[1300]: open failed: Is a directory
/sbin/not-qemu[1298]: opening VHD file: /var/lib/svirt-sim/images/good-one.vhd
/sbin/not-qemu[1298]: not-qemu with pid: 1298 still alive
/sbin/not-qemu[1300]: new file path: /var/lib/svirt-sim/images/..
/sbin/not-qemu[1300]: opening VHD file: /var/lib/svirt-sim/images/..
/sbin/not-qemu[1300]: open failed: Is a directory
/sbin/not-qemu[1298]: opening VHD file: /var/lib/svirt-sim/images/good-one.vhd
/sbin/not-qemu[1298]: not-qemu with pid: 1298 still alive
/sbin/not-qemu[1300]: new file path: /var/lib/svirt-sim/images/bad-one.vhd
/sbin/not-qemu[1300]: opening VHD file: /var/lib/svirt-sim/images/bad-one.vhd
/sbin/not-qemu[1300]: not-qemu with pid: 1300 still alive
/sbin/not-qemu[1298]: opening VHD file: /var/lib/svirt-sim/images/good-one.vhd
/sbin/not-qemu[1298]: not-qemu with pid: 1298 still alive
/sbin/not-qemu[1300]: new file path: /var/lib/svirt-sim/images/good-one.vhd
/sbin/not-qemu[1300]: opening VHD file: /var/lib/svirt-sim/images/good-one.vhd
/sbin/not-qemu[1300]: open failed: Permission denied
kernel: [ 2678.904913] type=1400 audit(1315107146.430:22): avc:  denied  { read write } for  pid=1300 comm="not-qemu" name="go
od-one.vhd" dev=sda3 ino=1444187 scontext=system_u:system_r:not_qemu_t:s0:c850 tcontext=system_u:object_r:mcs_server_image_t:s
0:c440 tclass=file
/sbin/not-qemu[1298]: opening VHD file: /var/lib/svirt-sim/images/good-one.vhd
/sbin/not-qemu[1298]: not-qemu with pid: 1298 still alive
/sbin/not-qemu[1300]: new file path: /var/lib/svirt-sim/images/.
/sbin/not-qemu[1300]: opening VHD file: /var/lib/svirt-sim/images/.
/sbin/not-qemu[1300]: open failed: Is a directory
/sbin/not-qemu[1298]: not-qemu with pid 1298 stopping on user request
/sbin/not-qemu[1298]: not-qemu with pid 1298 shutting down
/sbin/not-qemu[1300]: not-qemu with pid 1300 stopping on user request
/sbin/not-qemu[1300]: not-qemu with pid 1300 shutting down

There’s a bit of noise here in that ‘bad’ not-qemu instances loop through the images directory contents and don’t differentiate between files and directories so it will attempt to open and read the special directories . and .. which I’ll clean up in the future.

The important bits are that we can see the ‘bad’ not-qemu instance with PID 1300 attempt to access a VHD that doesn’t belong to it. This results in the operation failing and an AVC being displayed.

Conclusion

Whenever I’m writing SELinux policy I’m always looking for ways to test it. Policy analysis that proves a policy fulfills a specific security goal is the holy grail of security and this simulation isn’t policy analysis. Instead it’s just a simple simulation giving an example of what we’ve discussed using logical representation of the mlsconstrain statements in the SELinux MCS policy. Specifically, the category component of a processes context must dominate that of the object it’s accessing. We’ve represented this in logic previously now this simulation shows that the security server in the Linux kernel really does enforce this constraint.

Another interesting use of this simulation would be to use the real QEMU policy from the reference policy and a specifically crafted binary to push the limits of the policy. This binary could contain any number of known ways information can be leaked across QEMU instances. If anyone out there tries this I’d love to hear about it.