Graham cluster
2018-07-02 (updated 2020-04-22)Graham (GP3) is a heterogeneous HPC cluster located at the University of Waterloo and operated by SHARCNET. It is one of several new Compute Canada national systems.
Nodes
There are five kinds of CPU nodes:
Type | Count | Memory | Cores | Total cores | Memory/core |
---|---|---|---|---|---|
base |
903 | 128 GB | 32 | 28,896 | 4 GB |
72 | 192 GB | 44 | 3,168 | 4.4 GB | |
cloud |
56 | 256 GB | 32 | 1,792 | 8 GB |
bigmem512 |
24 | 512 GB | 32 | 768 | 16 GB |
bigmem3000 |
3 | 3 TB | 64 | 192 | 48 GB |
Total | 1,058 | 34,816 |
The first three types are in the base
partitions, and the last two are in the large
partitions.
The memory amounts are nominal, with the actual usable space potentially being a bit less.
There are also 203 GPU nodes with all sorts of different specs, having P100, V100, or T4 cards.
All of the processors in the cluster are Xeons and most (except in a handful of GPU nodes) are clocked at 2.1 GHz, but they have different architectures.
Partitions
There are three special CPU partitions: cpubase_interac
, cpularge_interac
, and cpubackfill
.
The first two are for interactive jobs (maximum time: 3 hours).
The last one is automatically added to the list of partitions for jobs not exceeding 24 hours and is used for backfilling lower priority short jobs into otherwise wasted slots.
The cpubackfill
partition includes some of every kind of CPU node, so even jobs requiring huge amounts of memory can get backfilled.
The other 24 CPU partitions are of the form cpuX_Y_bZ
, where X
is either base
or large
, Y
is either bycore
or bynode
, and Z
is a digit between 1 and 6; for example, cpularge_bycore_b4
.
The difference between bycore
and bynode
is whether the entire node is allocated to the job; this happens when the job is submitted as exclusive, or otherwise depending on the core and memory requirements (gory details in /etc/slurm/job_submit.lua
).
The base
memory bucket is used by default (33,856 cores).
If the memory requirements call for it, the large
bucket is used instead (960 cores).
The time buckets are much more straightforward:
Bucket | Time (hours) | Time (days) |
---|---|---|
b1 |
3 | |
b2 |
12 | |
b3 |
24 | 1 |
b4 |
72 | 3 |
b5 |
168 | 7 |
b6 |
672 | 28 |
Every job goes into the smallest time bucket that is large enough to contain it. No partition allows jobs longer than 4 weeks.
The total number of cores in each partition is as follows:
cpubase_bycore |
cpubase_bynode |
cpularge_bycore |
cpularge_bynode |
|
---|---|---|---|---|
b1 |
19,572 | 33,652 | 640 | 928 |
b2 |
19,572 | 33,652 | 640 | 928 |
b3 |
18,932 | 32,692 | 576 | 864 |
b4 |
11,840 | 23,580 | 192 | 736 |
b5 |
5,552 | 11,916 | 192 | 224 |
b6 |
1,916 | 3,512 | 96 | 128 |
Here's a shell one-liner to see the number of free cores in each partition:
1 | for x in base large; do for y in core node; do for z in {1..6}; do partition="cpu${x}_by${y}_b${z}"; echo -n "$partition "; sinfo --noheader -o '%C' -p "$partition" | tr / ' ' | awk '{print $2, $4}'; done; done; done | column -t
|
It's important not to run this too frequently to avoid further stressing the scheduler, which is (seemingly permanently) under heavy load.
Examples
Some concrete examples of where jobs end up based on their parameters:
sbatch arguments |
Partition |
---|---|
-c 1 --mem-per-cpu 2000 -t 5:00:00 |
cpubase_bycore_b2,cpubackfill |
-c 32 --mem 2000 -t 28-0 |
cpubase_bynode_b6 |
-c 1 --mem 500G -t 3-0 |
cpularge_bycore_b4 |
-c 32 --mem-per-cpu 9000 -t 3:00:00 |
cpularge_bynode_b1,cpubackfill |