Graham cluster

2018-07-02 (updated 2020-04-22)

Graham (GP3) is a heterogeneous HPC cluster located at the University of Waterloo and operated by SHARCNET. It is one of several new Compute Canada national systems.

Nodes

There are five kinds of CPU nodes:

Type	Count	Memory	Cores	Total cores	Memory/core
`base`	903	128 GB	32	28,896	4 GB
	72	192 GB	44	3,168	4.4 GB
`cloud`	56	256 GB	32	1,792	8 GB
`bigmem512`	24	512 GB	32	768	16 GB
`bigmem3000`	3	3 TB	64	192	48 GB
Total	1,058			34,816

The first three types are in the base partitions, and the last two are in the large partitions. The memory amounts are nominal, with the actual usable space potentially being a bit less.

There are also 203 GPU nodes with all sorts of different specs, having P100, V100, or T4 cards.

All of the processors in the cluster are Xeons and most (except in a handful of GPU nodes) are clocked at 2.1 GHz, but they have different architectures.

Partitions

There are three special CPU partitions: cpubase_interac, cpularge_interac, and cpubackfill. The first two are for interactive jobs (maximum time: 3 hours). The last one is automatically added to the list of partitions for jobs not exceeding 24 hours and is used for backfilling lower priority short jobs into otherwise wasted slots. The cpubackfill partition includes some of every kind of CPU node, so even jobs requiring huge amounts of memory can get backfilled.

The other 24 CPU partitions are of the form cpuX_Y_bZ, where X is either base or large, Y is either bycore or bynode, and Z is a digit between 1 and 6; for example, cpularge_bycore_b4. The difference between bycore and bynode is whether the entire node is allocated to the job; this happens when the job is submitted as exclusive, or otherwise depending on the core and memory requirements (gory details in /etc/slurm/job_submit.lua).

The base memory bucket is used by default (33,856 cores). If the memory requirements call for it, the large bucket is used instead (960 cores).

The time buckets are much more straightforward:

Bucket	Time (hours)	Time (days)
`b1`	3
`b2`	12
`b3`	24	1
`b4`	72	3
`b5`	168	7
`b6`	672	28

Every job goes into the smallest time bucket that is large enough to contain it. No partition allows jobs longer than 4 weeks.

The total number of cores in each partition is as follows:

	`cpubase_bycore`	`cpubase_bynode`	`cpularge_bycore`	`cpularge_bynode`
`b1`	19,572	33,652	640	928
`b2`	19,572	33,652	640	928
`b3`	18,932	32,692	576	864
`b4`	11,840	23,580	192	736
`b5`	5,552	11,916	192	224
`b6`	1,916	3,512	96	128

Here's a shell one-liner to see the number of free cores in each partition:

for x in base large; do for y in core node; do for z in {1..6}; do partition="cpu${x}_by${y}_b${z}"; echo -n "$partition "; sinfo --noheader -o '%C' -p "$partition" | tr / ' ' | awk '{print $2, $4}'; done; done; done | column -t

It's important not to run this too frequently to avoid further stressing the scheduler, which is (seemingly permanently) under heavy load.

Examples

Some concrete examples of where jobs end up based on their parameters:

`sbatch` arguments	Partition
`-c 1 --mem-per-cpu 2000 -t 5:00:00`	`cpubase_bycore_b2,cpubackfill`
`-c 32 --mem 2000 -t 28-0`	`cpubase_bynode_b6`
`-c 1 --mem 500G -t 3-0`	`cpularge_bycore_b4`
`-c 32 --mem-per-cpu 9000 -t 3:00:00`	`cpularge_bynode_b1,cpubackfill`

Graham cluster

Nodes

Partitions

Examples

See also