Child pages
  • 20110623ShibbolethOutage
Skip to end of metadata
Go to start of metadata

Shibboleth Outage on 6/23/2011

Problem

Shibboleth SSO service experienced complete outage on Thursday, 6/23/2011 from 3:07pm till 3:18pm. Users were not able to sign in to applications that use Shibboleth.

Cause

On 6/23 afternoon IT Services provisioned two new Shibboleth IdP servers. As part of capacity planning we wanted to maintain spare/standby servers in the pool. These servers were being configured and tested at the time.

We use a session clustering product Terracotta. During the configuration/testing process, we inadvertenetly started one of the new server. This led the server to joining the terracotta cluster. Server was offline in the load balancer, not receiving authn requests, however it was a client as far as Terracotta was concerned. Service degraded from this point and became unavailable few minutes later.

Normally, a new client joining the Terracotta cluster (even if unintentional) should not adversely impact the service. When the new client joined the pool Terracotta caused all OTHER NODES to lock up. Requests queued up resulting in timeout.

We do not yet understand why the lock occurred. We are continuing to research. We are seeking help from the Terracotta forum as well.

Resolution

Service was restored by restarting web servers on all nodes.

Additional Action Items

Pending outcome of Root cause analysis.