Graham cluster

Graham (GP3) is a heterogeneous HPC cluster located at the University of Waterloo and operated by SHARCNET. It is one of several new Compute Canada national systems.

Nodes

There are five kinds of CPU nodes:

Type Count Memory Cores Total cores Memory/core
base 903 128 GB 32 28,896 4 GB
72 192 GB 44 3,168 4.4 GB
cloud 56 256 GB 32 1,792 8 GB
bigmem512 24 512 GB 32 768 16 GB
bigmem3000 3 3 TB 64 192 48 GB
Total 1,058 34,816

The first three types are in the base partitions, and the last two are in the large partitions. The memory amounts are nominal, with the actual usable space potentially being a bit less.

There are also 203 GPU nodes with all sorts of different specs, having P100, V100, or T4 cards.

All of the processors in the cluster are Xeons and most (except in a handful of GPU nodes) are clocked at 2.1 GHz, but they have different architectures.

Partitions

There are three special CPU partitions: cpubase_interac, cpularge_interac, and cpubackfill. The first two are for interactive jobs (maximum time: 3 hours). The last one is automatically added to the list of partitions for jobs not exceeding 24 hours and is used for backfilling lower priority short jobs into otherwise wasted slots. The cpubackfill partition includes some of every kind of CPU node, so even jobs requiring huge amounts of memory can get backfilled.

The other 24 CPU partitions are of the form cpuX_Y_bZ, where X is either base or large, Y is either bycore or bynode, and Z is a digit between 1 and 6; for example, cpularge_bycore_b4. The difference between bycore and bynode is whether the entire node is allocated to the job; this happens when the job is submitted as exclusive, or otherwise depending on the core and memory requirements (gory details in /etc/slurm/job_submit.lua).

The base memory bucket is used by default (33,856 cores). If the memory requirements call for it, the large bucket is used instead (960 cores).

The time buckets are much more straightforward:

Bucket Time (hours) Time (days)
b1 3
b2 12
b3 24 1
b4 72 3
b5 168 7
b6 672 28

Every job goes into the smallest time bucket that is large enough to contain it. No partition allows jobs longer than 4 weeks.

The total number of cores in each partition is as follows:

cpubase_bycore cpubase_bynode cpularge_bycore cpularge_bynode
b1 19,572 33,652 640 928
b2 19,572 33,652 640 928
b3 18,932 32,692 576 864
b4 11,840 23,580 192 736
b5 5,552 11,916 192 224
b6 1,916 3,512 96 128

Here's a shell one-liner to see the number of free cores in each partition:

1
for x in base large; do for y in core node; do for z in {1..6}; do partition="cpu${x}_by${y}_b${z}"; echo -n "$partition "; sinfo --noheader -o '%C' -p "$partition" | tr / ' ' | awk '{print $2, $4}'; done; done; done | column -t

It's important not to run this too frequently to avoid further stressing the scheduler, which is (seemingly permanently) under heavy load.

Examples

Some concrete examples of where jobs end up based on their parameters:

sbatch arguments Partition
-c 1 --mem-per-cpu 2000 -t 5:00:00 cpubase_bycore_b2,cpubackfill
-c 32 --mem 2000 -t 28-0 cpubase_bynode_b6
-c 1 --mem 500G -t 3-0 cpularge_bycore_b4
-c 32 --mem-per-cpu 9000 -t 3:00:00 cpularge_bynode_b1,cpubackfill

See also