I wanted to share the significant events that occurred to IAMUCLA services (Shibboleth, Grouper WS, EDWS) over the last few days. Certain queries connecting to Enterprise Directory (ED) were causing a halt, which prevent new requests from processing until the current ones finished. We are still actively working with Oracle to determine a root cause.
What is Enterprise Directory
ED is our backend person repository, where most key IAMUCLA services such as Shibboleth, Grouper and EDWS depend on for user lookup.
What exactly went wrong?
Spread throughout the day are export jobs that are run to extract data from ED. We time these jobs so they do not overlap, due to the heavy nature of exporting large amounts of data. We noticed starting 9/23 at 3-4am that a query was causing a halt, which prevented other searches from completing until the original query completed. This would cause services like Shibboleth to periodically throw an error.
How come you did not notice the issue sooner?
On 9/23 we had originally noticed the issue. We had saw a spike in traffic from a number of IP's. We originally suspected that Shibboleth was being DDoS'ed and had reported the issue to Information Security investigate. Information Security determined that it was not a DDoS attack, but rather an application continuously redirecting back to Shibboleth. We are in the process of working with the application to address this issue.
The IAMUCLA team began to look deeper and realized that ED may be the potential issue as we checked other services to see if they experienced any issue. On 9/26 we opened a request with Oracle to help troubleshoot the issue and are still actively working with Oracle to resolve the issue.
How did you alleviate the problem?
On 9/26 we also made the decision to shift Shibboleth to our backup ED instance. Even though traffic was minimal during the 3-4am period, we wanted to ensure availability of the system. The shift occurred at midnight of the 27th. Since, this was the first full week of school, we picked a time that was a reasonable hour with less traffic and impact to the community.
By shifting Shibboleth and the majority of the requests, we thought that our primary ED instance would no longer experience the issue as we continued to troubleshoot. However on 9/27 at about 1:00pm, we noticed the same issue occurring, which caused outages to EDWS and Grouper WS. At this point we decided to shift these servers to our backup ED instance as well.
Our primary ED instance is now only processing these daily export jobs until we resolve the issue with Oracle.
Did monitors not trigger the alert?
Setting up end to end alerts have been a challenge. We have monitor on OS levels on all our systems and most applications have log alerts or status checks. However, ED technically was responding and did not generate any error logs to alert us. Monitors on Shibboleth check for status pages to ensure the application itself is healthy and since Shibboleth was responding no alerts were triggered. Grouper WS and EDWS, however did trigger as eventually searches would queue and the service would stop responding.
What are your next steps?
We are still working with Oracle to troubleshoot our issues. No other instance (test or production) are experiencing this issue, so replicating it would be fairly difficult. Oracle is recommending configuration changes and more logging and would require restarts to the ED server. Email communications should be sent shortly to give a better timeline of these changes. We do apologize for the short notice (1 or 2 days) we will be giving to apply these changes to our primary production server. We will apply the same changes to our backup once we shift services back to our primary. We will provide an additional notice when that change does occur.
We are also increasing priority of our ED upgrade project. The architecture and configurations need to be revisited with the explosion of data from Grouper groups.
A stretch goal of ours is to aggregate our logs and correlate events between all IAMUCLA systems to determine the health of our systems.
We are aware of the impacts our services cause to the campus community. When there are service issues it has always been a balancing act trying to juggle between resolving/diagnosing issues vs maintaining as much service availability as possible. This is especially difficult during key academic calendar dates.
Finally, I wanted to thank the team who worked tirelessly to troubleshoot and maintain service availability.