Yesterday Google e-mail service GMail.com went offline for a period of 100 minutes. For a while there, no one knew why this happened but Google recently revealed the cause.
The GMail team took a small number of the GMail servers offline in order to perform routine upgrades. However the company underestimated the load that recent changes in the service would place on the request servers. The request servers became overloaded and told the system to stop sending traffic. The load was then transferred to other remaining request routers which caused more servers to become overloaded. Within a few minutes all of the request routers were overloaded causing a massive outage. While mobile users still had luck in retrieving their e-mails, GMail on the web could not see their e-mail because requests could not be sent. IMAP/POP access still worked because they use different routers.
To alleviate this from happening again, the GMail team increased router capacity well beyond the normal peak demand in order to buffer any data loads. Ben Treynor, the VP Engineering and Site Reliability Czar at Google wrote “Gmail remains more than 99.9% available to all users, and we’re committed to keeping events like today’s notable for their rarity.”