Graham cluster

Graham (GP3) is a heterogeneous HPC cluster located at the University of Waterloo and operated by SHARCNET. It is one of several new Compute Canada national systems.

Nodes

There are four kinds of CPU nodes:

Type Count Memory Cores Total cores Memory/core
base 884 128 GB 32 28,288 4 GB
cloud 56 256 GB 32 1,792 8 GB
bigmem512 24 512 GB 32 768 16 GB
bigmem3000 3 3 TB 64 192 48 GB
Total 967 31,040

The memory amounts are nominal, with the actual usable space being a bit less. The first two types are in the base partitions, and the last two are in the large partitions.

There are also 160 GPU nodes (32 cores, 128 GB RAM) with two P100 cards each.

Almost all of the processors in the cluster (including on the GPU nodes) are identical: Xeon E5-2683 v4 running at 2.1 GHz. The only exception to this are the bigmem3000 nodes, since the E5-2683 v4 only supports up to 1.54 TB of RAM; instead, these are running the more expensive Xeon E7-4850 v4 processors, also at 2.1 GHz.

Partitions

There are three special CPU partitions: cpubase_interac, cpularge_interac, and cpubackfill. The first two are for interactive jobs (maximum time: 3 hours). The last one is automatically added to the list of partitions for jobs not exceeding 24 hours and is used for backfilling lower priority short jobs into otherwise wasted slots. The cpubackfill partition includes some of every kind of CPU node, so even jobs requiring huge amounts of memory can get backfilled.

The other 24 CPU partitions are of the form cpuX_Y_bZ, where X is either base or large, Y is either bycore or bynode, and Z is a digit between 1 and 6; for example, cpularge_bycore_b4. The difference between bycore and bynode is whether the entire node is allocated to the job; this happens when the job is submitted as exclusive, or otherwise depending on the core and memory requirements (gory details in /etc/slurm/job_submit.lua).

The base memory bucket is used by default (30,080 cores). If the memory requirements call for it, the large bucket is used instead (960 cores). This happens when a job requires more than 8,040 MB per core, more than 256,500 MB per node, or for other memory-related reasons.

The time buckets are much more straightforward:

Bucket Time (hours) Time (days)
b1 3
b2 12
b3 24 1
b4 72 3
b5 168 7
b6 672 28

Every job goes into the smallest time bucket that is large enough to contain it. No partition allows jobs longer than 4 weeks.

The total number of cores in each partition is as follows:

cpubase_bycore cpubase_bynode cpularge_bycore cpularge_bynode
b1 13,792 27,872 416 960
b2 13,152 27,232 416 960
b3 12,512 26,272 384 864
b4 8,832 20,352 192 736
b5 5,472 9,568 192 224
b6 1,696 3,072 96 128

Here's a shell one-liner to see the number of free cores in each partition:

1
for x in base large; do for y in core node; do for z in {1..6}; do partition="cpu${x}_by${y}_b${z}"; echo -n "$partition "; sinfo --noheader -o '%C' -p "$partition" | tr / ' ' | awk '{print $2, $4}'; done; done; done | column -t

It's important not to run this too frequently to avoid further stressing the scheduler, which is (seemingly permanently) under heavy load.

Examples

Some concrete examples of where jobs end up based on their parameters:

sbatch arguments Partition
-c 1 --mem-per-cpu 2000 -t 5:00:00 cpubase_bycore_b2,cpubackfill
-c 32 --mem 2000 -t 28-0 cpubase_bynode_b6
-c 1 --mem 500G -t 3-0 cpularge_bycore_b4
-c 32 --mem-per-cpu 9000 -t 3:00:00 cpularge_bynode_b1,cpubackfill