Day-Long E-mail Outage Affects 10,500
Repeated problems have plagued MIT’s electronic mail systems in the second half of July. Failures of the traditional IMAP and the new Microsoft Exchange mail systems caused large portions of campus e-mail to be unavailable for the better part of a day, followed by shorter and smaller outages last week.
The major outage began just before 7 a.m. on Thursday July 23 and affected all users of the traditional IMAP system (also known as Cyrus), which is Information Services and Technologies’ (IS&T) older e-mail system and used by 90 percent of MIT e-mail users. (The rest are using MIT’s new Microsoft Exchange system, which was unaffected in this outage.) Users who forward or split their e-mail to other e-mail providers such as Gmail were unaffected by the outage. Users of MIT’s webmail service were affected by the outage.
Jeffrey I. Schiller ’79, MIT’s network manager, said the system failed when both controllers in the Storage Area Network (SAN) device that manages the e-mail disk drives failed. (There are two in case one fails.) The Tech spoke to Schiller the day after the outage, on Friday, July 24.
By 9 a.m., IS&T had replaced the failed components and restored service to all accounts on the post office servers po9, po12, and po14 (there is no po13; servers 1-8 are no longer in use). However, the hardware failure caused data corruption for po10 and po11, leaving about 5,100 students and 5,400 faculty and staff without e-mail service for much of the outage duration.
Schiller said that the disk array at fault was old and due to be replaced soon. He also said that the problem was extremely rare, and none of the published documentation for the disk array contained instructions that correctly solved the problem.
Schiller said that this problem was not specific to the traditional IMAP service, and could have just as easily affected the Exchange service. Eventually, they escalated the problem to a Sun Microsystems engineer, who Schiller said was extremely knowledgeable.
MIT provided Sun with detailed debugging diagnostics from the SAN which Sun took hours to analyze and produce a recovery procedure.
Around 4:15 p.m., Sun provided detailed instructions to recover from the failure; executing the instructions took until 8:15 p.m, when service was restored.
IS&T succeeded in fixing the broken SAN, but had that approach not worked, a full backup of IMAP e-mail was also available. A copy of mailboxes is made between 5 p.m. and 9 a.m. nightly. It would have taken about five hours to restore the affected servers’ data from the backup.
Other outages follow
Users of the new Exchange e-mail system were also not immune to problems last month, as one of the Exchange post office servers failed around 8 a.m. on Tuesday July 28. Service was restored at 9:11 a.m., IS&T reported.
Around 1 a.m. on Thursday July 30, the traditional IMAP service again had a problem. The po9 post office server experienced a disk corruption-related kernel panic and went offline. Three quarters of the mailboxes were online by 2:05 a.m., and the remainder were brought online by 11 a.m.
The most recent serious outage of e-mail before last months’ outage occurred in 2007, when po14 was inaccessible for almost four days.
IS&T has not responded to repeated inquiries over the past week regarding the Tuesday Exchange outage or last week’s po9 outage; Spokeswoman Christine C. Fitzgerald indicated she was waiting to hear from technical staff.
Redundancy Plan in Flux
The day after the big outage, Schiller emphasized that SAN failures are rare, and that to buy enough redundancy to eliminate outages would have cost “money that MIT is not prepared to spend.”
When asked how MIT views the repeated outages, last week Friday, Theresa M. Stone, MIT’s Executive Vice President, said that the “IS&T team does not believe any outage of e-mail is acceptable, and has worked to introduce redundancy to protect the system.”
Yesterday, Schiller said Stone recently asked IS&T to take “what steps were necessary” to ensure a reliable mail system for both traditional IMAP and Exchange. “And we will do that,” Schiller said.
Prior to the outage, IS&T had already been working on a fully redundant storage system for Exchange e-mail users, but traditional IMAP system users were not included. Yesterday, Schiller said IS&T is now in the process of migrating all of the IMAP users’ data to the redundant SAN originally built for the Exchange e-mail users. The redundant SAN is planned to be replicated both on-campus in W92, and off-campus in building OC11, located at One Summer Street, Boston.
Because the volume of data is very large, moving the data could take weeks to finish, Schiller said.
This process has benefitted from tight management coordination. IS&T’s technical lead on the mail system reports directly to Executive Vice President Stone, because IS&T’s Vice President has announced his retirement and a mid-level manager is out on medical leave.
John A. Hawkinson contributed reporting to this article.