VMware Storage problem

Incident Report for CYSO

Postmortem

During regular hardware maintenance on 17 January on a server in the Equinix data center by a technician from one of our suppliers, a situation arose at 12:10 in which both power feeds of equipment, including two storage switches, in that rack were interrupted.

The cause appears to be a combination of circumstances in which both power feeds on the ATS PDU were interrupted. An ATS is an Automatic Transfer Switch that can switch between the A and B feed in the event of a failure on a power feed. As a result of the loss of both feeds, connected storage switches in the relevant rack both failed. As soon as we suspected that a power outage was the basis for the situation that had arisen, we immediately sent engineers to the data center to assess and repair the situation on site.

Due to the failure of a redundant part of the storage network, the SAN storage became unavailable for a large part of our customers in the Equinix data center, which resulted in virtual servers crashing.

After power was restored by an on-site Cyso engineer at 12:59, we began the process of rebooting the affected servers and checking for file system errors as needed due to the unexpected SAN storage failure. By 14:35, 80% of the systems experiencing problems were back online. At 16:15 this had risen to 95% and at 19:30 the very last error messages had been resolved.

Since we have our infrastructure connected to two different power feeds in the data center, we were caught off guard by this equipment failure. We are going to check the cabling in the data center to find out why this went wrong and take measures to prevent this in the future. In addition, until further notice, we will only allow external suppliers access to our data centers under the supervision of one of our own engineers. Finally, we will investigate whether it is possible to increase the robustness of servers in the event of an unexpected loss of SAN storage.

Posted Jan 18, 2022 - 17:27 CET

Resolved

The final remaining issues as a result of today's incident have been resolved.

Posted Jan 17, 2022 - 19:15 CET

Update

We're still working on a few remaining issues as a result of today's power disruption, We hope to give the all-clear sign soon.

Posted Jan 17, 2022 - 16:56 CET

Monitoring

Most affected systems are back online. Our engineers are still working on those that still suffer after effects.

Posted Jan 17, 2022 - 13:43 CET

Update

Because storage for servers had been disconnected earlier, we're still performing reboots for servers that suffer from read-only file systems.

Posted Jan 17, 2022 - 13:27 CET

Identified

Power has been restored in the rack. We are now checking all systems that were affected.

Posted Jan 17, 2022 - 13:08 CET

Update

Our engineers are now onsite at the datacenter and the affected rack.

Posted Jan 17, 2022 - 13:00 CET

Update

The disruption might possibly be caused by a power issue in one of our racks. We're still investigating.

Posted Jan 17, 2022 - 12:45 CET

Investigating

We are currently experiencing issues within part of our VMware cloud platform in our Equinix datacenter which may result in services being unavailable. Our engineers are currently working to find the cause of the problem.

Posted Jan 17, 2022 - 12:27 CET