Climate Incident - Nodes non-operational in parallel partition

Incident Report for Rapoi HPC

Resolved

HPC is now recommissioned and available for you to use.
The quicktest partition now has: amd01n02,amd01n02,amd01n03,amd01n04
Intel machines (itl02n[01-04]) have been retired.
The rest of the partitions have been restored to their original state before we moved to reduced capacity, i.e.,
gpu partition has gpu[01-03],
longrun has bigtmp[01-02],
bigmem has high[01-04], and
parallel has amd's and spj in it.

Posted Mar 26, 2025 - 00:21 UTC

Update

DS team has completed the HPC system relocation to the off-campus site.
Testing will be conducted tomorrow in the morning and an update will be provided by midday.

Posted Mar 25, 2025 - 04:03 UTC

Update

We are continuing to monitor for any further issues.

Posted Mar 25, 2025 - 01:41 UTC

Update

We are continuing to monitor for any further issues.

Posted Mar 25, 2025 - 01:39 UTC

Update

The HPC relocation is tomorrow (20 Mar). Power will be shut down for the move by the DS team. Testing will occur on Tuesday (25 Mar). Please contact support team with any questions.

Posted Mar 18, 2025 - 22:23 UTC

Update

Rāpoi HPC system will be unavailable from Thursday, March 20th, until the week of March 24th due to relocation.

We expect to begin testing on Tuesday, March 25th, and based on the results, we will determine when the system can be reopened for general use.

Please make sure to update your new job submission with relevant time limits such that they end before March 20.

We will provide updates as necessary.

Posted Feb 28, 2025 - 01:51 UTC

Update

Status - relocation:
The team at Digital Solutions is making progress on the migration. The logistics of quote, approvals, and insurance have been completed. Currently, they are waiting for the the off-campus facility provider to get the racks ready with power and network cabling.
Status - down nodes:
Due to a possibility of summer humidity levels exceeding 80% RH, we are unable to restart the parallel nodes at this time.
Planned:
Users will receive at least 10 days' notice before the cluster is shut down for relocation. The migration will result in several days of complete system outage as DS team de-rack, transport, and re-rack all equipment.
Potential delays:
A change freeze is planned for the first week of Trimester 1 (starting 24th Feb), which is typically a high-demand period. Depending on operational workload and incident response, this may further impact migration timelines.
We appreciate your patience and will provide updates as soon as firm migration dates are confirmed. Please reach out with any concerns.

Posted Feb 19, 2025 - 21:58 UTC

Monitoring

The compute infrastructure is running with limited capacity. Nodes will be moved to a new off-campus facility, but the schedule hasn’t been announced yet.

Posted Feb 10, 2025 - 21:34 UTC

This incident affected: Apply for New Accounts, Job Submission, Running Slurm Jobs, Pending Slurm Jobs in the Queue, Rāpoi: Login node, Storage Infrastructure, and Rāpoi: Compute Infrastructure.