![]() ![]() ![]() ![]() |
|
We host websites and email. We take our responsibility very seriously and we understand that the margin of error is slim -- 100% uptime is the only acceptable standard. But, as we administer and manage these servers, to well over 99.99% uptime, we know that perfection is impossible and that the stumbles made by the biggest names in corporate America are legendary. Here are a few... REGISTER.COM suffers further DOS attack: Millions of sites downed again FAA Says Systems That Process Flight Plans Is Down, Mass Flight Delays APPLE DISCUSSES MOBILEME SHORTAGES FACEBOOK CRASHES IN WAKE OF 'SCRABULOUS' TAKEDOWN AMAZON WEB SERVICES GOES DOWN, TAKES OUT SOME WEB 2.0 SITES OUTAGE LEAVES MANY HOTMAIL USERS COLD AMAZON’S S3 OUTAGE: IS THE CLOUD TOO COMPLICATED? BLACKBERRY SUFFERS 'CRITICAL' OUTAGE RIM OFFERS EXPLANATION FOR MASSIVE OUTAGE .MAC USERS MOCK APPLE SLOGAN DURING OUTAGE CHRONOLOGY OF THE RECENT PAYPAL OUTAGE EBAY OUTAGE BLAMED ON SOFTWARE FROM SUN AND ORACLE Microsoft Email Crash Affects Millions Of Users REDMOND, Wash. - The world's largest email provider, Microsoft, was struggling to restore its services Friday after outages that reportedly affected up to 365 million users worldwide. Skype Still Struggling After Worldwide Outage A significant number of people are still unable to use the Skype Internet communication service nearly 24 hours after technical glitches brought down the company's network. Around 9 a.m. EST Thursday morning, the company Tweeted that as many as 10 million users were now able to use the service, although the company recently announced a peak usage of 25 million. This means as many as 15 million people are still unable to access the telephony network -- and advanced services such as group video calling may take longer to restore. "In the last hour, we've seen evidence of a significant increase in the number of people online," the company wrote in a blog post. "Because of the way the Skype software works, it's not possible for anyone to obtain an exact figure, but we now estimate it to be over 10 million." Users across the globe reported issues accessing the service Wednesday morning, prompting the company to acknowledge the issue on Twitter: "Some of you may have problems signing in to Skype -- we're investigation, and we're sorry for the disruption to your conversations." Skype followed up with another Tweet assuring users that their "engineers and site operations are working non-stop to get things back to normal." The problem is unconnected to the hacking attacks that disabled popular websites such as MasterCard and Visa in recent weeks. Chaim Haas, a spokesman for Skype, explained to FoxNews.com that the company's telephony network relies on millions of individual connections between computers and phones to stay up and running, referencing a blog post by the company. "Under normal circumstances, there are a large number of supernodes available," network features which act like phone directories for Skype, Haas said. "Unfortunately, today, many of them were taken offline by a problem affecting some versions of Skype." Engineers created new "mega-supernodes" that solved the problem, Haas said. As of about 3:30 p.m. EST, normal services had started returning to Skype, the company said, acknowledging that it may take "several more hours" before all users can sign in again. Later that day, the company Tweeted again -- this time a request for patience. Behind last night's Bing outage Microsoft said that a configuration change that was mistakenly moved from testing onto the live Bing.com site was to blame for an outage Thursday that left Microsoft's search engine completely inaccessible for more than half an hour. A Microsoft representative told CNET on Friday that the problem appears to have come when something being tested was moved onto the live site.
Twitter, Facebook Investigating Service Disruptions Twitter is in the midst of defending itself against an ongoing denial-of-service attack, the micro-blogging service reported this morning. Few details have been released so far about the attack. The company, however, confirmed the attack today in a blog post, and noted that even though the site is back up, officials are still working to recover and defend against the attack. “On this otherwise happy Thursday morning, Twitter is the target of a denial of service attack,” blogged Twitter co-founder Biz Stone. “Attacks such as this are malicious efforts orchestrated to disrupt and make unavailable services such as online banks, credit card payment gateways, and in this case, Twitter for intended customers or users. We are defending against this attack now and will continue to update our status blog as we continue to defend and later investigate.” There were also reports that Facebook was attacked as well, though officials there have not confirmed the attack. Facebook spokesperson Malorie Lucich said the company was investigating the reports and would update users as soon as possible. “Earlier this morning, we encountered issues within our network that resulted in a short period of degraded site experience for some visitors,” she said. “No user data was at risk and the matter is now resolved for the majority of users. We’re monitoring the situation to ensure that users continue to have the fast and reliable experience they’ve come to expect from Facebook.” Ray Dickenson, CTOat security firm Authentium, noted that many denial-of-service attacks are launched from botnets. “Twitter is such a high profile site, it may be just a bot-herder or one of their customers wanting to show off the power of their botnet,” he said. Register.com suffers further DOS attack Computer Problem Causes Mass Flight Delays Travelers are facing mass flight delays today as the result of a computer problem at the Federal Aviation Administration. The FAA has two systems that process flight plans - one located in Atlanta and the other one in Salt Lake City. But the Atlanta system went down at 1:30pm today, and all flight plans are now being handled out of Salt Lake City. As a result, delays could pile up at airports across the country. Delays up to 90 minutes are already surfacing at several airports. "This was a failure mode we have not seen before," said FAA chief operating officer Hank Krakowski on Tuesday afternoon. According to the FAA, about 6,500 airplanes are in FAA system, though the aviation agency has not said how many were in the sky and how many were on the ground when the problem occurred. With such a heavy volume of air traffic typically converging on the East Coast, delays could spread depending on how much time it takes to iron out the problem. Krakowski said most of the delays were happening in the eastern portion of the United States, with none reported west of Dallas or Chicago. 'SCRABBLE' APP ON FACEBOOK CRASHES IN WAKE OF 'SCRABULOUS' TAKEDOWN When Scrabulous, a popular game on Facebook's developer platform, was shut down earlier on Tuesday because of copyright infringement issues with the manufacturer of the Scrabble board game, word game fans weren't totally left in the dark. After all, Electronic Arts (which handles the digital rights to Scrabble for the game's parent company, Hasbro) had recently created an official beta version of Scrabble for the platform. Problem is, the servers that were hosting the "real" Scrabble app couldn't handle the load of new migrants, and the application crashed on Tuesday afternoon. Oops! "We'll be back up shortly," an apologetic error message read. "We're working on some tech problems and Scrabble will be ready to play as soon as possible!" The game is slated to exit the beta phase in the middle of next month, and some (my colleague Rafe Needleman among them) initially found it to be a better-quality game experience than Scrabulous had been. But in the wake of a server crash, Facebook users weren't too pleased, as the message wall for the Scrabble application revealed. "Wow, does this suck," one Facebook user wrote. "Why can't you guys work out a licensing deal with the Scrabulous boys? Now we're back to square one and have to go through all of your debugging process." Well, to be fair, rumor has it that Hasbro put out an acquisition offer for Scrabulous, only to have it rebuffed because its creators thought the amount offered was insufficient. "Sucks, sucks, sucks," another Facebook user said. "Locks up at 30 percent loading. Sucks. Oh, did I mention it sucks? Get a grip, Hasbro." Too bad "FAIL" will net you only seven points. [ BACK TO TOP ] APPLE GETS UNUSUALLY CHATTY ABOUT MOBILEME SHORTAGES This week we saw an unusual chatty Apple, if we consider the MobileMe service functionality problems. MobileMe had a rough start two weeks ago, when users reported issues with the “Internet service that takes the best of .Mac and more.” Not only did Apple apologize last week, but it also offered customers a 30-day extension eligibility for their MobileMe subscription and promises to keep them updated about the repairing process, which they did. “Be assured people here are working 24-7 to improve matters, and we're going to favor getting you new info hot off the presses even if we have to post corrections or further updates later,” Apple's blog said on Friday. It appears that 1 percent of the MobileMe members reported a mail outage last Friday, when one of Apple's mail servers blocked their access to their MobileMe mail accounts. Apple reported fixing the problem, but unfortunately, the affected members will only be able to read mail they've received since last Friday, but not prior to that. The company expects to restore full access to the accounts and estimated that it should take no longer than a week for that to happen. However, I appears that the affected users have lost 10 percent of their mail messages received between July 16 and July 18. So what exactly happened on launch date? Apple blames more traffic than they had anticipated for the failure to access the web versions of the MobileMe applications – Mail, Contacts, Calendar, Gallery, iDisk. However, “we've since added server capacity and tuned our software to scale better – i.e. behave more gracefully when traffic spikes.” Overall, Apple reported 70 bugs fixed, including the one that prevented MobileMe IMAP mail folders from syncing correctly between the web app and Mac OS X Mail or Outlook. Further details are expected next week. [ BACK TO TOP ] OUTAGE LEAVES MANY HOTMAIL USERS COLD Microsoft's Windows Live services experienced a significant outage Tuesday, leaving many users unable to get to their Hotmail inboxes. A company representative said all Windows Live services are affected, though not all users are reporting problems. Microsoft said it is still trying to determine the cause of the problems. "We are aware that some customers may be experiencing difficulty accessing their Windows Live accounts," the software maker said in a statement to CNET News.com. "We're actively investigating the cause and are working to take the appropriate steps to remedy the situation as rapidly as possible. We sincerely apologize for any inconvenience and disruption this may be causing our customers." [ BACK TO TOP ] Amazon Web Services goes down, takes out some Web 2.0 sites Some sites based on "cloud computing" got a wake-up call yesterday when the system failed. Amazon Web Services stopped working yesterday morning, which affected a number of Web 2.0 sites. TechCrunch was quick to point out that this blew a big hole in the "cloud computing" hype that seems to be prevalent in Silicon Valley at the moment. It said: "This could just be growing pains for Amazon Web Services, as more startups and other companies come to rely on it for their Web-scale computing infrastructure. But even if the outage only lasted a couple hours, it is unacceptable. Nobody is going to trust their business to cloud computing unless it is more reliable than the data-center computing that is the current norm. So many Websites now rely on Amazon's S3 storage service and, increasingly, on its EC2 compute cloud as well, that an outage takes down a lot of sites, or at least takes down some of their functionality. Cloud computing needs to be 99.999 percent reliable if Amazon and others want it to become more widely adopted." Amazon Web Services is nothing like that reliable: it seems it only aspires to 99.9% availability, which would have been unacceptable in an antique mainframe, let alone a specialised fault-tolerant server. If people really want "five nines" availability, they'll have to pay for it, and at the moment it doesn't come at anything like Amazon's prices. One of the people promoting cloud computing is Greg Olsen, founder and chief technology officer of Coghead. Rather amusingly, the day before Amazon fell over, GigaOM published his guest column about adopting this stuff. He wrote: "By leveraging service options like Amazon's EC2 and S3, a small company can deploy a complex, highly available and scalable multi-user software application -- without huge upfront investments in hardware or software infrastructure. Likewise, a very small company can build a simple, narrowly focused service and can cost-effectively sell it to a mass audience. Neither of these companies would have been possible only a short time ago." Although I have a natural resistance to boosterism, I think Olsen is right and TechCrunch is wrong. Cloud computing does not need to be 99.999% reliable to get adopted by Web 2.0 companies. It makes sense to adopt it because it's cheap and because you don't need much technical competence to do it. It therefore meets Web 2.0 needs very nicely. Of course, you'd have to be incompetent way beyond stupidity to build your banking, air traffic control, hospital or mission-critical corporate system on Amazon Web Services, because these do need to be reliable. Web 2.0 systems don't. Who really cares if Twitter goes down for a couple of hours, or even a couple of days, apart from the people who run Twitter? There are, however, a couple of useful lessons from the debacle. The first is that "cloud computing" is still mostly hype. It will stop being mostly hype when service providers start to offer guaranteed service level agreements (SLAs) backed up by real financial guarantees. The second is that relying on somebody else's unreliable system makes your system less reliable, not more reliable. You don't have "five nines" reliability in whatever it is you do if you're using a supplier that only has "three nines" reliability. And if you're relying on a beta Web 2.0 site that's relying on another beta service like Amazon Web Services, then you're just asking for trouble. Web-based services are great, especially if they're free or very cheap, but it's insane to pretend they have the reliability of the electricity grid (which isn't wholly reliable) or a water utility (ditto, plus leaks). Web sites today don't guarantee reliability, availability or adequate performance, and there are lots of ways you can lose not just the service but also your data (as I wrote in a column this week). I'm not saying you shouldn't use them. I am saying that you should know what you're doing. Yesterday just showed that some people don't. [ BACK TO TOP ] AMAZON’S S3 OUTAGE: IS THE CLOUD TOO COMPLICATED? Over the weekend Amazon’s S3 storage service was down for an extended period and a bunch of Web 2.0 sites lost avatars, images and other items on their sites. Since enterprises haven’t totally jumped on the bandwagon Amazon’s outage didn’t have broader ramifications. But Amazon’s latest outage–the second big one this year–will hamper dreams of enterprise class services for the masses. After all, the dream for cloud computing is enterprise reliability for pennies. In this view, the cloud will just work, uptime will always be there and we’ll tap into this architecture and always be tethered to the Web. Michael Krigsman gives Amazon props for transparency with its latest outage, but the larger issue is reliability and how much redundancy should we expect for a few pennies a gigabyte (Techmeme). If Amazon can’t democratize cloud computing and bring us a bunch of “9s” reliability who can? Om Malik writes:
Om hits the mark. The problem: The Web is one big legacy system. And cloud computing relies on millions of connections and services. In other words, it’s a troubleshooting nightmare when the cloud goes bust. And like any company wrestling with legacy systems cloud computing vendors will dust off a tired playbook. The solutions will be the usual: Relegate legacy systems to plumbing and create more services and applications to keep infrastructure current. In other words, the cloud will likely become more of a rat’s nest. What’s scary about that prognosis is the cloud is already too complicated since it’s built on creaky infrastructure. [ BACK TO TOP ] BlackBerry Suffers 'Critical' Outage A CRITICAL BlackBerry network outage in the US has hampered business deals and presidential campaign plans after users were left stranded without access to email. The maker of BlackBerry handsets, which are ubiquitous in professional and political circles and are used to send and receive emails on the run, said its US network had experienced a "critical severity outage" today. "This is an emergency notification regarding the current BlackBerry Infrastructure outage," said an email sent by company Research In Motion to its large BlackBerry clients. The email said the outage affected business clients and "users of the Americas network". Research In Motion did not say what caused the outage, when regular service was expected to be restored or how many people could be affected. About one hour after the notification, some customers said a few emails were going through. Others said they continued to be without service. Some BlackBerry users appeared to enjoy a respite from the device, which has been affectionately dubbed the "CrackBerry" due to its addictive nature. On Parliament Hill in Ottawa, Liberal Party spokesman Jean-Francois Del Torchio said things seemed very relaxed for a while. "It made my life a little bit easier, since I didn't have to reply. But when I arrived at my desktop and I saw all the e-mails I received, I was like, 'Oh, I still need to work'," he joked. Carmi Levy of AR Communications, said service reliability was a serious concern for telecommunications companies because if problems became routine, they could turn customers away. A massive outage in April last year crashed the BlackBerry network across the US, leaving thousands of users without access to wireless email. Research In Motion CEO Jim Balsillie said at the time that such incidents were "very rare" and the Waterloo, Ontario-based company was taking steps to prevent such an outage from happening again. Executives, politicians, lawyers and other professionals rely on the BlackBerry for its ability to send secure emails. With Wojtek Dabrowski in Toronto for Reuters [ BACK TO TOP ] RIM OFFERS EXPLANATION FOR MASSIVE OUTAGE Research In Motion finally offered some details late Thursday about what caused a severe outage of its BlackBerry e-mail service from Tuesday evening until Wednesday morning. The company said in a statement that it had ruled out security and capacity issues as a cause of the outage that left millions of so-called "CrackBerry" addicts without access to their e-mail for several hours. The company also said the incident was not caused by any hardware failure or core software issue. Ruling out those causes, the company has "determined that the incident was triggered by the introduction of a new, noncritical system routine that was designed to provide better optimization of the system's cache." In computing terms, a cache is a temporary storage area for that allows data to be served up quickly. RIM said the system routine had not been expected to affect the regular operations of the BlackBerry servers and infrastructure. Despite previous testing, the new system routine produced an unexpected effect that set off a chain reaction, triggering a series of interaction errors between the system's operational database and the cache. After RIM isolated the database problem and tried unsuccessfully to fix the issue, it began its "failover" process to a backup system. But that also failed. "Although the backup system and failover process had been repeatedly and successfully tested previously, the failover process did not fully perform to RIM's expectations in this situation and therefore caused further delay in restoring service and processing the resulting message queue," the company said in the statement. RIM also said it has already identified several aspects of its testing, monitoring and recovery processes that it plans to improve as a result of the incident. Since the outage's start--around 5 p.m. PDT Tuesday--the company had been quiet about its cause. But experts said they were convinced the issue had to do with RIM's network since subscribers were still able to make phone calls and send and receive text messages. RIM's service is centralized and works by routing all BlackBerry e-mails through one of two main network operations centers, which are essentially large data centers. One center is located in Canada and primarily serves the Western Hemisphere as well as parts of Asia. The other data center, located in the U.K., handles e-mail traffic in Europe, Africa and the Middle East. Analysts had speculated that since most of the people affected by the outage were based in North America that it was likely the problem occurred in the center located in Waterloo, Ontario. By Wednesday morning, RIM said, the e-mail had begun trickling into in-boxes across North America. The service was operating normally on Thursday, the company said. RIM has built a strong reputation as a reliable service provider that has attracted bankers, lawyers and lawmakers as subscribers. The company has recently been trying to broaden its appeal to consumers with new products, such as the BlackBerry Pearl handheld and the BlackBerry 8800. The new strategy has helped the company rapidly expand its subscribers. In its latest quarter, RIM reported it had added 1.02 million new subscribers, taking its total to 8 million. This is a huge increase from the 2 million subscribers the company reported a year ago, when it settled a patent infringement case with NTP. The company expects to add between 1.12 million and 1.15 million subscribers during the current quarter. [ BACK TO TOP ]
.Mac users mock Apple slogan during outage Apple Computer's latest advertising campaign, pegged to the slogan "It just works" is irritating some .Mac users as they wonder when the service will become operational again. Over the past four days, .Mac users have struggled to get its Web site publishing features, iWeb, and related file-share capabilities, iDisk, to work. Users have complained not only about the length of the outage, but also what they say is a tardy response from .Mac's technical support team, according to postings on Apple's discussion board. "It is going on 96 hours for me. Completely Unacceptable," wrote a user named BK Broiler in a post to the discussion board. "The .76 IP now pings, since yesterday, but iDisk does not work still. It'll only work with the /etc/hosts trick, but not on its own. I got a canned e-mail from Apple today after 72 hours of silence from the time I sent the trouble call. Thanks, Apple, for making a joke out of long term customer loyalty, and for just not giving a ****. It may be time to switch away from Mac after 20 years." Apple said Monday it is investigating the issue. [ BACK TO TOP ] eBay outage blamed on software from Sun and Oracle Online auction mega-site eBay was offline for over 24 hours this weekend, causing an estimated $2-3 million loss of business for the company. But the company was eager to spread the blame and offset some of the embarrassment by blaming the outage on its reliance on software from Sun Microsystems and Oracle. Ebay's site uses Sun Solaris and Oracle's server database. Particularly damning is the fact that eBay's Web site goes crashing down on a fairly regular basis. EBay is one of the most popular destinations on the Web, but the constant problems are causing customers to look elsewhere. "We are sorry," wrote eBay CEO Meg Whittman in a letter to its users. "We know that you expect uninterrupted service from eBay. We believe that this is reasonable, and we know we haven't lived up to your expectations. We want to earn back your trust that we'll provide you with this level of service." [ BACK TO TOP ] CHRONOLOGY OF THE RECENT PAYPAL OUTAGE We are monitoring this situation closely, and we will continue to update you as new information is available. We appreciate your patience. Regards, Today, access to PayPal continues to be intermittent. Some members are able to log in to the site and make payments and perform other activities, although they may be experiencing very slow system responses. Other members are not able to get in right away, or at all. PayPal users may also be having problems with their debit cards. Sellers who use PayPal shipping functionality may be having problems shipping products to their buyers, and buyers may be experiencing difficulties paying sellers. We encourage members to be patient with trading partners as we work to improve PayPal access. These PayPal issues are the result of unforeseen problems that resulted when a new code base to upgrade the site architecture was introduced to the PayPal platform on Friday morning. The code worked well when tested and during the first hours of launch. Unfortunately, problems handling peak levels of traffic developed later in the day that created intermittent availability and errors for members. These problems have continued in varying degrees since Friday. Account data and personal information have not been compromised by these issues. eBay and PayPal technical teams are working at full force to fix the underlying problems and improve site access. We will continue to update you on the status of this situation. Regards, eBay and PayPal are continuing to work to resolve these issues, and we will continue to update you. We understand the inconvenience this issue has caused for some members, and we appreciate your patience. Regards We sincerely apologize for the inconvenience this may have caused and we appreciate your continued patience. Members may be experiencing intermittent errors while accessing the PayPal site or when attempting to pay for eBay items with PayPal. We are aware of the problem and are currently working on a solution. We appreciate your patience at this time. Regards, Regards,
|
||||||||||