Xen NUMA Roadmap
About This Page
This page acts as a collection point for all NUMA related features. The idea is using this space to summarize the status of each of them and track their progress. This is being done with the hope of facilitating as much as possible the collaboration between the various community members, and to limit the risk of duplicating efforts.
For more general information about NUMA on Xen, check this page.
Updating this page
This is a Wiki, so, please, go ahead and update/fix (if not a Wiki editor, see this). The maintaner of this page is Dario, so feel also free to contact him for anything you think you need. Even better, especially if about actual development of one of the features, start a conversation in the xen-devel mailing list (but in that case, be sure you follow this).
In the list below, each Work item contains the name and e-mail of the person working on it. In that context, WORKING means work has already started and patches could have already been submitted or will shortly be.
PLANNED, means the person is keen on doing the job, but any code have been written yet. If wanting to help or take over, consider dropping to such person a note.
If there is no name at all, the item is something identified as useful, but still unclaimed.
barred work item items means it is done (and the name tells who did it).
Automatic VM placement
This is about picking up a NUMA node (or a set of NUMA nodes) where a newly created VM would best execute, in order to maximize its own and the system overall performances.
Check out the Automatic NUMA Placement page.
Basics are there. The old XEND toolstack had a placement logic, which did not go inot XL, initially. It is now there too, starting from Xen 4.2. That being said, there is still a lot of room for improvements and making the placement algorithm more advanced and powerful.
- Dario (<firstname.lastname@example.org>):
at VM creation time, choose a node or a set of node where the VM fits (memory and VCPU wise) and pin the VM's VCPUs to the node's PCPUs. Patch series: v1, v2, v3, v4, v5, v6, v7, v8, v9. Relevant changesets: f4b5a21f93ad, 4165d71479f9.
- Dario (<email@example.com>), WORKING: allow the user to control the placement algorithm by specifying some of the parameters it uses, instead of always detemining them implicitly. Patch series: v1, v2 needs reposting.
- Dario (<firstname.lastname@example.org>), WORKING: enhance the placement algorithm to take latencies between nodes (node distances) into account. Patch series: v1, v2, but too much computational complexity was being introduced, needs rethinking.
- Dario (<email@example.com>), PLANNED: provide aids to enable easy verification and testing of the placement (stressing it by generating synthetic placement request). Discussion: 1.
- Dario (<firstname.lastname@example.org>), PLANNED: enhance the placement algorithm to take some more sophisticated measure of NODE load into account.
- (Semi-)Automatic placement for Dom0. Discussion: 1.
NUMA aware scheduling
Instead of having to statically pin the vCPUs on nodes' pCPUs and just have them prefer running on the nodes where their memory resides. If considered independently from NUMA, this feature can be seen as giving vCPUs a sort of soft affinity (i.e., a set of pCPUs where they will prefer to run), in addition to their hard affinity (i.e., pinning).
Check out the NUMA Aware Scheduling page.
For credit1, it's done (see below for patches and changesets). We are now concentrating on making node affinity (soft affinity) per-vCPU, instead than per-domain. That work, still concentrating on credit1, was almost ready for going in Xen 4.4, but, because of some last minut issues, then we decided it could wait for 4.5.
For credit2, some work started, although it's quite complicated, as credit2 lacks pinning (hard affinity) too.
- Dario (<email@example.com>),
NUMA aware scheduling for credit. Some related discussion (and patches): 1. Patch series: v1, v2, v3, v4, v5, v6. Relevant changesets: 8bf04f2ed8de, 6a8c84c8e25f.
- Dario (<firstname.lastname@example.org>), WORKING: per-vcpu soft affinity in credit. Patch series: v1, v1-resend, v2, v3, v4, v5, v5-resend. v6 is in this git branch, waiting to be rebased and reposted as soon as Xen 4.5 development cycle opens.
- Justin (email@example.com), WORKING: Hard and soft affinity for credit2. Discussion: 1. Patch series: v1, v2. While working on this, a bug on how credit2 handles multiple runqueues was found. Here they are the attempt to fix that, as preliminary work: v1, v2, v3
Virtual NUMA (support for NUMA guests)
If a guest ends up on more than one nodes, make sure it knows it's running on a NUMA platform (smaller than the actual host, but still NUMA). It is something very important for some specific kind of workloads, for instance, HPC ones. In fact, it the guest OS (and application) has any NUMA support, exporting a virtual topology to the guest is the only way to render that effective, and perhaps filling at least to some extent the gap introduced by the needs of distributing the guests on more than one node. Under the name of vNUMA, this is one of the key and most advertised feature of VMWare vSphere 5 ("vNUMA: what it is and why it matters").
For PV guests, most of the work is done (by Elena, while participating in Round 6 of OPW), although it still needs to be properly upstreamed. Various patch series have been submitted along such period, here's the most relevant ones: first RFC for Xen, for Linux; second RFC for Xen, for Linux; actual v1 for Xen; v2 for Xen, for Linux; v3 for Xen; v4 for Xen.
Having vNUMA in both Dom0 and DomU will enable some potentially relevant optimizations, e.g., wrt the split driver model Xen supports, making sure to run the backend and the frontend on the same NUMA node, and/or to run the backend on the same node where the IO device is also attached (see also IONUMA below). Some thoughts about this here.
- Elena (<firstname.lastname@example.org>), WORKING: upstream PV vNUMA in both Xen and Linux.
- Matt (<email@example.com>), send in an RFC.
- automatic placement for resuming/migrating domains: if they have a virtual topology, better not to change it;
- memory migration: it can change the actual topology (should we update it on-line or disable memory migration?)
Dynamic memory migration
Between different nodes of one host, either upon user request or automatically, as a form of load balancing (similar to what happens on the CPU with the NUMA-aware scheduler. Some development for this features happened during the Xen 4.3 window, but then got stalled. It is supposed to start back during the 4.5 development window.
Started, but not yet ready to leave some developer's private patch queue in their dev-boxes. The need to support both HVM and PV guests complicate things quite a bit. Xenbus, qemu, a lot of inherent characteristics of the Xen architecture get in the way of having it simply done within the hypervisor (as it happens for NUMA aware scheduling). The current idea being pursued is for it to happen at low toolstack level (perhaps with the hypervisor exporting statistics that will help toolstacks and users to undertake proper decisions), sort-of mimicking a suspend-resume cycle.
- Dario (<firstname.lastname@example.org>), WORKING: enable moving memory from one node to another (on the same host).
- Dario (<email@example.com>), PLANNED: track how many and from whom non node-local memory is being accessed.
If not only memory, but also I/O controllers are attached to specific nodes, you'll end up with devices which are better used by VMs running on those nodes (or vice-versa, VMs that are better run on the proper node if/when they want to use a specific device).
Yang Zhang did some previous investigation on this situation, which , BTW, goes under the name IONUMA, and the result is this presentation I/O Scalability in Xen at Xen Summit 2011.
- export IONUMA information to the user, as we currently do for NUMA topology with, in that case,
xl info -n;
- IONUMA and automatic placement: as said in the description, IONUMA info (once available) should bias the automatic placement decisions;
- Dom0/Driver IONUMA: devices should have their DMA buffers for the backends allocated on (or as close as possible to) the node where their IO controller is attached;
- guest IONUMA: devices passed-through to guests, should have their DMA buffers allocated on (or as close as possible to) the node where their IO controller is attached.