Node affine NUMA scheduler

Erich Focht <efocht@ess.nec.de>

Background

  This scheduler extension is targeted to cc-NUMA machines with CPUs grouped in multiple nodes, each node having its own memory. The example in the sketch below is a 16 CPU ccNUMA machine, each node holding 4 processors and its own memory. The nodes are connected by some interconnect over which the data for memory accesses to remote nodes is transfered and some cache coherency protocol is implemented.
NUMA machine


Accessing the memory of a remote node implies taking penalties in memory access latency and bandwidth. Therefore it is desirable to keep processes on or near the node on which their memory (or most of it) is allocated. As an example: the small benchmark program used in the testing section needs 30% longer time to complete when running on a remote node on an NEC AzusA Itanium server though this machine has an excellent remote to local memory access latency ratio of 1.6.

Some machines implement an additional large node-level cache which tries to hide the distance to the remote nodes. Programs with small node-level cache footprint see only the latency to this cache but take a big performance hit if they get scheduled on another node. The codes running out of cache (generating many cache misses) suffer similarly as on machines without node-cache if their memory is allocated on a distant node.

Fixing the cpus_allowed mask of tasks to a particular nodes would of course be a solution but experiments show that for loads >100% this leads to poorer performance (due to bad load balancing).

In this approach each task gets a homenode on which its memory will be allocated and the scheduler will try to keep the task on its homenode or reattract it to that node if it was migrated away from it. The kernel must provide some mechanism to control the node on which memory gets allocated.
 


Implementation

The node affine NUMA scheduler is built on top of Ingo Molnar's O(1) scheduler which replaced the old Linux scheduler since 2.5.2 kernels.  The newer versions are in the latest 2.5 kernels, I'll try to backport essential features to 2.4.X kernels, please check the Download section.


NUMA topology information

As sketched above, the CPUs of a NUMA machine are typically grouped in nodes. The node IDs are normally reflected in the (S)APIC ID of each CPU. I'll call these physical node IDs. Each platform should provide a macro SAPICID_TO_PNODE() which extracts the physical node number from the (S)APIC ID of a processor.

CPUs in a physical node are grouped in a CPU pool. The scheduler deals with pool IDs or  logical node IDs , their numbering starts at 0. The following variables are used for describing the node topology:
The variables pool_ptr[] and pool_cpus[] are similar to the compressed row format for matrices. A loop over the cpus of one pool can be written as:
for (i = pool_ptr[pool]; i < pool_ptr[pool+1]; i++) {
    cpu = pool_cpus[i];
    ... // do work on the cpu
}
The CPU_TO_NODE() macro is using the array lnode_number[] which is initialized in bld_pools() (kernel/sched.c).


Homenode

The task_structure has been extended by an element int node which specifies the homenode of the task, i.e. the node on which:
  1. the tasks memory gets allocated (if a coupling to a NUMA memory affinity patch exists),
  2. the task should preferably run, i.e. if it gets scheduled away because of high load on its homenode, the task will be attracted back by the scheduler.
Because we have no mechanism to move the memory of one task from one node to another, it doesn't make much sense to change the homenode of a task during its lifetime. It is basically chosen shortly after the new task is created, the next section describes the details.

Additional to the number of running tasks nr_running each runqueue stores in the array nr_homenode[0 ... numpools] the number of tasks enqueued and comming from a particular node. In this way the scheduler can quickly look up whether there are tasks from other nodes running on the current cpu.

Initial load balancing

Currently the homenode of a new task can be selected either in do_execve() or in do_fork(). The choice is depending on the number of tasks assigned to each node, not on the actual load of each node, as tasks running on foreign nodes might return to their homenode. An additional task_structure element int node_policy decides on where the initial load balancing (i.e. choice of the homenode) is done:

node_policy balancing in comment
0 (default) do_execve() Tasks that fork() but don't exec() end up on the same node!
1 do_fork() New homenode is chosen only if the child has its own new mm structure!
2 do_fork() Allways choose new homenode.

A small utility can be used to change the node_policy field of a particular task. One would typically change the node_policy of the shell from which the application is started, run it and change it back, if needed.

Dynamic load balancing

The load_balance() routine is changed as follows: