Fixing the cpus_allowed mask of tasks to a particular nodes would of course be a solution but experiments show that for loads >100% this leads to poorer performance (due to bad load balancing).
In this approach each task gets a homenode on which its memory will be
allocated and the scheduler will try to keep the task on its homenode
or reattract it to that node if it was migrated away from it.
The kernel must provide some mechanism to control the node on which
memory gets allocated.
The variables pool_ptr[] and pool_cpus[] are similar to the compressed row format for matrices. A loop over the cpus of one pool can be written as:
- numpools : number of pools
- pool_nr_cpus[] : number of cpus per pool
- pool_cpus[] : logical cpu numbers sorted by their pool id
- pool_ptr[] : pointer into poo_cpus[], points to first cpu of a certain pool
- pool_mask[] : cpu mask for processors in a pool
- CPU_TO_NODE() / lnode_number[] : pool id of each logical cpu.
for (i = pool_ptr[pool]; i < pool_ptr[pool+1]; i++) {The CPU_TO_NODE() macro is using the array lnode_number[] which is initialized in bld_pools() (kernel/sched.c).
cpu = pool_cpus[i];
... // do work on the cpu
}
Because we have no mechanism to move the memory of one task from one node to another, it doesn't make much sense to change the homenode of a task during its lifetime. It is basically chosen shortly after the new task is created, the next section describes the details.
- the tasks memory gets allocated (if a coupling to a NUMA memory affinity patch exists),
- the task should preferably run, i.e. if it gets scheduled away because of high load on its homenode, the task will be attracted back by the scheduler.
| node_policy | balancing in | comment |
| 0 (default) | do_execve() | Tasks that fork() but don't exec() end up on the same node! |
| 1 | do_fork() | New homenode is chosen only if the child has its own new mm structure! |
| 2 | do_fork() | Allways choose new homenode. |
if (CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) {
weight = (jiffies - tmp->sleep_timestamp)/cache_decay_ticks;
#ifdef CONFIG_NODE_AFFINE_SCHED
if (weight >= MAX_CACHE_WEIGHT) weight=MAX_CACHE_WEIGHT-1;
weight += POOL_WEIGHT(this_pool,tmp->node);
if (tmp->node == CPU_TO_NODE(busiest->curr->cpu))
weight -= MAX_CACHE_WEIGHT;
#endif
if (weight > maxweight) {
maxweight = weight;
next = tmp;
}
}
| node distance | avg. load | low load | comment |
| 10 | 132ms | 2ms | stealing from same node: handled without delays |
| 15 | 199ms | 3ms | stealing from same supernode |
| 20 | 265ms | 4ms | stealing from other supernode |
| 1 |
O1_ia64-ef7-2.4.18.patch.bz2 |
IA64 port of O(1) scheduler, applies
cleanly to kernels with ia64-020622 patch |
| 2 |
O1_i386-ef7-2.4.18.patch.bz2 |
i386 version of the O(1) scheduler, applies to vanilla 2.4.18 kernel |
| 3 |
Nod20_2.4.18-ia64-O1ef7.patch |
Node affine scheduler, apply over a kernel with O(1)-ef7 scheduler |
| 4 |
proc_sched_hist_2.4.18-ia64-O1ef7.patch |
Patch for runqueue load and load balancer history information in /proc/sched. |
| 5 |
NodeAff_Discontig_coupling-2.patch |
Ensures that memory gets allocated on the homenode. DISCONTIGMEM patch required! (This patch applies over the ia64 version of discontigmem!) |
for(i=0; i<n; i++)
a[ix[i]] = a[ix[i]] + b[i];
with random ix[]. This is quite typical for simulations using unstructured grids.
There are two similar perl scripts included which can be used for submitting several such processes in parallel and formatting the output. They are hardwired for 4 nodes (sorry) and need a file describing the cpu to node assignement. It should contain the node numbers of the logical CPUs separated by one blank (sorry for the inconvenience, I was using a patch for accessing this info in /proc...).
Usage: try calling:
time ./disp 4 ./affinity 1000000
This will start 4 copies of ./affinity and collect some statistics
about percentage of time spent on each node, maximum percentage
spent on a node, the node number on which it spent most of the time,
the initial
node, the user time. Interesting is also the elapsed
time returned by the "time" command. The output looks like this:
[focht@azusa ~/affinity]> ./dispn 4 ./affinity 1000000
Executing 4 times: ./affinity 1000000
-----------------------------------------------------------------------
% Time on Node
Scheduled node
Job
0 1
2 3
Max (m , i) Time (s)
-----------------------------------------------------------------------
1 100.0
0.0 0.0 0.0
| 100.0 (0 = 0) | 29.5
2
0.0 0.0 0.0
100.0 | 100.0 (3 = 3) | 26.5
3 100.0
0.0 0.0 0.0
| 100.0 (0 = 0) | 29.5
4
0.0 100.0 0.0
0.0 | 100.0 (1 = 1) | 26.6
-----------------------------------------------------------------------
Average user time = 28.00 seconds
When a task runs alone on a node it needs 26.5s, if two tasks are
running on the same node (but on different CPUs), each needs 29.5s. When
running alone on the wrong node (i.e. the memory is allocated on another
node) the test needs 38s, roughly 40% more.
Normally one should find:
Below are some results for a load of 150%.
The behavior of the old Linux scheduler is illustrated in the following output of a typical run. The machine runs 24 affinity processes on 16 CPUs. One can see a nearly equal distribution of the CPU times over the nodes. The asteriscs show proceses which ran most of the time on another node than the one on which they started.
[focht@azusa ~/affinity_test]> ./dispn 24 ./affinity 1000000
Executing 24 times: ./affinity 1000000
-----------------------------------------------------------------------
% Time on Node Scheduled node
Job 0 1 2 3 Max (m , i) Time (s)
-----------------------------------------------------------------------
1 19.9 27.0 22.0 31.0 | 31.0 (3 = 3) | 40.13
2 23.4 27.1 24.8 24.7 | 27.1 (1 = 1) | 40.65
3 32.4 22.3 23.1 22.2 | 32.4 (0 = 2) * | 41.08
4 28.3 19.1 26.9 25.6 | 28.3 (0 = 2) * | 40.69
5 22.9 20.0 29.1 28.0 | 29.1 (2 = 3) * | 40.49
6 24.6 22.4 24.8 28.1 | 28.1 (3 = 2) * | 40.76
7 29.3 23.7 17.4 29.6 | 29.6 (3 = 0) * | 40.42
8 28.1 30.9 22.2 18.8 | 30.9 (1 = 1) | 40.15
9 26.2 21.4 23.3 29.1 | 29.1 (3 = 3) | 40.51
10 22.3 29.6 24.4 23.7 | 29.6 (1 = 2) * | 40.72
11 26.2 22.8 27.8 23.2 | 27.8 (2 = 0) * | 41.07
12 24.1 25.5 25.9 24.4 | 25.9 (2 = 0) * | 41.30
13 19.7 29.6 22.0 28.7 | 29.6 (1 = 2) * | 41.74
14 21.2 26.6 23.6 28.6 | 28.6 (3 = 0) * | 41.61
15 19.4 27.1 26.8 26.6 | 27.1 (1 = 1) | 40.72
16 21.1 20.5 26.4 32.1 | 32.1 (3 = 2) * | 41.52
17 22.5 29.1 26.9 21.5 | 29.1 (1 = 2) * | 41.00
18 20.9 27.0 23.1 29.0 | 29.0 (3 = 1) * | 40.70
19 21.6 29.8 25.6 22.9 | 29.8 (1 = 3) * | 41.68
20 27.9 24.2 26.2 21.6 | 27.9 (0 = 1) * | 41.28
21 27.9 23.1 25.7 23.3 | 27.9 (0 = 2) * | 41.43
22 26.9 24.3 27.8 21.0 | 27.8 (2 = 3) * | 41.62
23 27.7 21.6 32.3 18.4 | 32.3 (2 = 3) * | 41.95
24 34.4 24.4 22.5 18.7 | 34.4 (0 = 1) * | 41.43
-----------------------------------------------------------------------
Average user time = 41.03 seconds
With the O(1) scheduler the results are better, but several processes spend most of their CPU-time on another node than the one they started to run (the lines with the asteriscs). Those tasks needed significantly more usertime than the ones which spent their lifetime on the same node.
[focht@azusa ~/affinity_test]> ./dispn 24 ./affinity 1000000
Executing 24 times: ./affinity 1000000
-----------------------------------------------------------------------
% Time on Node Scheduled node
Job 0 1 2 3 Max (m , i) Time (s)
-----------------------------------------------------------------------
1 100.0 0.0 0.0 0.0 | 100.0 (0 = 0) | 32.64
2 0.0 100.0 0.0 0.0 | 100.0 (1 = 1) | 31.97
3 0.0 57.2 0.0 42.8 | 57.2 (1 = 3) * | 37.28
4 41.7 0.0 0.0 58.3 | 58.3 (3 = 0) * | 38.28
5 41.5 0.0 58.5 0.0 | 58.5 (2 = 0) * | 38.05
6 0.0 0.0 0.0 100.0 | 100.0 (3 = 3) | 32.19
7 0.0 100.0 0.0 0.0 | 100.0 (1 = 1) | 32.15
8 0.0 0.0 42.3 57.7 | 57.7 (3 = 2) * | 37.59
9 0.0 0.0 100.0 0.0 | 100.0 (2 = 2) | 31.88
10 0.0 56.9 43.1 0.0 | 56.9 (1 = 2) * | 37.01
11 0.0 57.0 43.0 0.0 | 57.0 (1 = 2) * | 37.09
12 0.0 0.0 100.0 0.0 | 100.0 (2 = 2) | 32.03
13 0.0 0.0 0.0 100.0 | 100.0 (3 = 3) | 31.78
14 99.7 0.3 0.0 0.0 | 99.7 (0 = 1) * | 37.81
15 0.0 100.0 0.0 0.0 | 100.0 (1 = 1) | 32.30
16 0.0 100.0 0.0 0.0 | 100.0 (1 = 1) | 32.24
17 0.0 0.5 0.0 99.5 | 99.5 (3 = 1) * | 39.97
18 0.0 57.0 0.0 43.0 | 57.0 (1 = 3) * | 37.36
19 0.0 0.0 99.5 0.5 | 99.5 (2 = 3) * | 42.47
20 0.0 0.0 0.0 100.0 | 100.0 (3 = 3) | 31.87
21 100.0 0.0 0.0 0.0 | 100.0 (0 = 0) | 31.90
22 0.0 0.0 100.0 0.0 | 100.0 (2 = 2) | 31.46
23 100.0 0.0 0.0 0.0 | 100.0 (0 = 0) | 32.67
24 100.0 0.0 0.0 0.0 | 100.0 (0 = 0) | 32.62
-----------------------------------------------------------------------
Average user time = 34.69 seconds
[focht@azusa ~/affinity_test]> ./dispn 24 ./affinity 1000000
Executing 24 times: ./affinity 1000000
-----------------------------------------------------------------------
% Time on Node Scheduled node
Job 0 1 2 3 Max (m , i) Time (s)
-----------------------------------------------------------------------
1 100.0 0.0 0.0 0.0 | 100.0 (0 = 0) | 33.3
2 0.0 0.0 0.0 100.0 | 100.0 (3 = 3) | 32.5
3 0.0 0.0 100.0 0.0 | 100.0 (2 = 2) | 31.7
4 0.0 100.0 0.0 0.0 | 100.0 (1 = 1) | 32.1
5 0.0 0.0 100.0 0.0 | 100.0 (2 = 2) | 31.1
6 0.0 0.0 0.0 100.0 | 100.0 (3 = 3) | 32.3
7 39.3 0.0 60.7 0.0 | 60.7 (2 = 0) * | 39.8
8 0.0 100.0 0.0 0.0 | 100.0 (1 = 1) | 32.3
9 0.0 0.0 0.0 100.0 | 100.0 (3 = 3) | 32.2
10 38.5 0.0 61.5 0.0 | 61.5 (2 = 0) * | 39.9
11 0.0 0.0 0.0 100.0 | 100.0 (3 = 3) | 31.9
12 0.0 100.0 0.0 0.0 | 100.0 (1 = 1) | 32.2
13 0.0 0.0 100.0 0.0 | 100.0 (2 = 2) | 31.8
14 0.0 0.0 100.0 0.0 | 100.0 (2 = 2) | 31.3
15 0.0 0.0 0.0 100.0 | 100.0 (3 = 3) | 32.2
16 0.0 0.0 0.0 100.0 | 100.0 (3 = 3) | 31.8
17 0.0 100.0 0.0 0.0 | 100.0 (1 = 1) | 31.8
18 0.0 100.0 0.0 0.0 | 100.0 (1 = 1) | 31.8
19 100.0 0.0 0.0 0.0 | 100.0 (0 = 0) | 32.9
20 100.0 0.0 0.0 0.0 | 100.0 (0 = 0) | 32.3
21 100.0 0.0 0.0 0.0 | 100.0 (0 = 0) | 33.0
22 0.0 100.0 0.0 0.0 | 100.0 (1 = 1) | 32.6
23 0.0 0.0 100.0 0.0 | 100.0 (2 = 2) | 31.7
24 100.0 0.0 0.0 0.0 | 100.0 (0 = 0) | 33.2
-----------------------------------------------------------------------
Average user time = 32.83 seconds