Downed Mail Server Up Saturday, Then Another Fails, Webmail Too
The mail server outage that began last week Wednesday drew to a close on Saturday morning, but then was followed by an unrelated outage of another of the five post office servers on Monday afternoon. In a third outage, early Tuesday, two of the six Webmail servers were down for about half an hour.
On Wednesday morning, po14.mit.edu crashed leaving all 4,000 of its users without access to e-mail. By Thursday morning initial repairs on “firebrick,“ one of four filesystems on po14, had not completed. Information Services and Technology staff changed to a new plan: bring po14 up with its three intact file systems, and copy the remaining mailboxes from the damaged file system to a new file system. The resulting mailbox-by-mailbox restoration began on Thursday morning, and did not complete until Saturday morning, at about 8 a.m.
Throughout the outage, IS&T provided updates on the 3-DOWN outage information service, http://3down.mit.edu/, though it was not until 6 p.m. on Friday that IS&T published any information about the rate of restores. No estimates for restore times were ever provided. Friday’s 6 p.m. update reported that there were “below 300” mailboxes awaiting restoration. By 10 p.m., 3-DOWN reported that less than 75 mailboxes were still affected, but the restores did not complete until the following morning.
According to Jeffrey I. Schiller ’79, restores were prioritized, with faculty accounts first and guest accounts last. The firebrick file system held about 843 gigabytes of data across 27 million files, Schiller said.
Beginning Wednesday, IS&T provided technical details of the overall outage in a Web page linked from the main 3-DOWN page, http://3down.mit.edu/fcgi-bin/3down?showpage2=email.
po10 ran out of disk space
On Monday afternoon at about 3:43 p.m., IS&T experienced another mail server failure. According to Schiller, a filesystem on po10 (named “plum”) ran out of disk space. Schiller explained that IS&T receives daily reports about disk space usage, but in the aftermath of the po14 outage, “things have been a little hectic.” IS&T staff repaired the problem by 5:17 p.m,, so the outage lasted about an hour and a half. Schiller said there were “no messages lost.”
Midnight Webmail failure
At least two of the six Webmail servers that support Web-based email access (http://webmail.mit.edu) failed around midnight Monday night, shortly before press time. The servers reported “database errors” to users, but the problems cleared up around 12:30 a.m. IS&T representatives were not available for comment.
Mail forwarding disabled
During po14’s outage, it was possible for affected users to forward new e-mail to other mail services, such as Gmail. The mail forwarding generally updates every four hours, but those updates were disabled very early Friday morning. As a result, users on po14 who set up mail forwarding by Thursday could have access to their new mail, but those who tried to set up forwarding after that were not successful.
According to Schiller, the update “takes significant resources to process, so it probably would have done more harm to leave it enabled.”
The updates remained disabled through early afternoon Monday, well after the outage ended on Saturday. Schiller was not able to account for the delay.
Widespread delays
During and shortly after the po14 outage, the MIT mail system experienced delays that were visible to all users. Because the mail system was under heavy load as part of holding po14-bound mail, some other e-mail experienced multi-hour delays.
According to Schiller, IS&T focused on delivering on-campus mail, and the delays were primarily in mail originating from off-campus, which undergoes additional spam filtering and uses different servers than e-mail sent on-campus.
When asked if the system was engineered to handle the failure of a single post office mail server (such as po14) without impacting other e-mail, Schiller said “not for this long.” Schiller attributed some of the delays to the open source “sendmail” program, which handles queueing and delivery on most of MIT’s mail servers.
There are many competing alternatives to sendmail, both open source and commercial. Most alternatives are designed to perform more efficiently than sendmail. MIT has been using sendmail for decades.
The remaining three post office mail servers (po9, po11, and po12) have not experienced failures in several years.