Cedar cluster

2020-04-22

Cedar (GP2) is a heterogeneous HPC cluster located at Simon Fraser University and operated by WestGrid. It is one of several new Compute Canada national systems.

Cedar - CC Doc

Nodes

There are seven kinds of CPU nodes:

Type	Count	Memory	Cores	Total cores	Memory/core
`basecompute`	575	128 GB	32	18,400	4 GB
`basecomputev2`	640	192 GB	48	30,720	4 GB
`basecomputev3`	768	192 GB	48	36,864	4 GB
`largecompute`	94	256 GB	32	3,008	8 GB
`bigmem512`	24	512 GB	32	768	16 GB
`bigmem1500`	24	1.5 TB	32	768	48 GB
`bigmem3000`	4	3 TB	32	128	96 GB
Total	2,129			90,656

The first four types are in the base partitions, and the last three are in the large partitions. The memory amounts are nominal, with the actual usable space potentially being a bit less.

There are also 338 GPU nodes with all sorts of different specs, having P100 or V100 cards.

All of the processors in the cluster are Xeons, clocked anywhere from 2.1 GHz to 2.4 GHz, and spanning several architectures.

Partitions

There are five special CPU partitions: cpubase_interac, cpularge_interac, cpubackfill, c12hbackfill, and cpupreempt. The first two are for interactive jobs (maximum time: 3 hours). The next two are automatically added to the list of partitions for jobs not exceeding 24 or 12 hours, respectively, and are used for backfilling lower priority short jobs into otherwise wasted slots. The cpubackfill partition includes some of every kind of CPU node, so even jobs requiring huge amounts of memory can get backfilled, and c12hbackfill even includes contributed nodes. The last of the special partitions has a time limit of 122 days (4 months), but must be requested manually and has PreemptMode=REQUEUE, so jobs from other partitions allocated to the same hardware will cause jobs on this partition to be restarted.

The other 24 CPU partitions are of the form cpuX_Y_bZ, where X is either base or large, Y is either bycore or bynode, and Z is a digit between 1 and 6; for example, cpularge_bycore_b4. The difference between bycore and bynode is whether the entire node is allocated to the job; this happens when the job is submitted as exclusive, or otherwise depending on the core and memory requirements (gory details in /etc/slurm/job_submit.lua).

The base memory bucket is used by default (88,992 cores). If the memory requirements call for it, the large bucket is used instead (1,664 cores).

The time buckets are much more straightforward:

Bucket	Time (hours)	Time (days)
`b1`	3
`b2`	12
`b3`	24	1
`b4`	72	3
`b5`	168	7
`b6`	672	28

Every job goes into the smallest time bucket that is large enough to contain it.

The total number of cores in each partition is as follows:

	`cpubase_bycore`	`cpubase_bynode`	`cpularge_bycore`	`cpularge_bynode`
`b1`	68,928	88,352	1,088	1,600
`b2`	68,928	88,352	1,088	1,600
`b3`	59,840	76,192	1,088	1,600
`b4`	48,608	61,888	704	1,120
`b5`	34,784	44,992	288	544
`b6`	20,960	27,616	160	224

Here's a shell one-liner to see the number of free cores in each partition:

for x in base large; do for y in core node; do for z in {1..6}; do partition="cpu${x}_by${y}_b${z}"; echo -n "$partition "; sinfo --noheader -o '%C' -p "$partition" | tr / ' ' | awk '{print $2, $4}'; done; done; done | column -t

It's important not to run this too frequently to avoid further stressing the scheduler, which is (seemingly permanently) under heavy load.

Examples

Some concrete examples of where jobs end up based on their parameters:

`sbatch` arguments	Partition
`-c 1 --mem-per-cpu 2000 -t 18:00:00`	`cpubase_bycore_b3,cpubackfill`
`-c 48 --mem 2000 -t 28-0`	`cpubase_bynode_b6`
`-c 1 --mem 500G -t 3-0`	`cpularge_bycore_b4`
`-c 32 --mem-per-cpu 9000 -t 3:00:00`	`cpularge_bynode_b1,cpubackfill,c12hbackfill`

Cedar cluster

Nodes

Partitions

Examples

See also