Monday, December 27, 2010

My thoughts on Skype Outage

Recently, there was a Skype outage and people did not know what to make of it. Was it an attack, was it a Skype bug, was it a Windows bug? As of now, most are saying it was a Windows update bug (ain’t that the truth, LOL). Here are some of my thoughts on the matter. The Skype network, which has been known to support up to 20 million users online, is very similar to Limewire (more formally the Gnutella network). This is because the Skype developers are the same ones who wrote the famous Kazaa P2P filesharing program. Skype developers realized how easy it was for individuals to share files over the Internet with little infrastructural support and decided to use the same concept to send Voice Over IP streams instead of file streams. Since VoIP communication can happen on very low bandwidth less than 10 KB, as long as you can keep latency to a minimum (under 150 ms) and enough buffering to counteract jitter, then people can have conversations over the Internet directly using their own download/upload bandwidth and resources. So Skype can support millions of people without having to buy a bunch of servers to host services which is a very attractive business model.

One of the main services Skype provides is peer discovery. Peer discovery is the ability to “discover” information about how I can connect to a peer that I am interested in. For example, Alice would like to talk to Bob, but before Alice can talk to Bob, Alice has to “discover” information on how to reach Bob. Because of how the Internet is setup, Bob usually does not have a static endpoint (or IP address), also even if you discover Bob's IP address, you may not be able to connect to Bob directly because Bob is behind a NAT (network address translator, most home users call this a wireless router). So as a result, most P2P networks (such as Skype and Gnutella) uses a hybrid model, where you have supernodes and leaf nodes. Supernodes are SUPER, meaning you can connect to them directly because they are not behind NATs or firewalls. NATs/firewalls usually allow outgoing connections while blocking incoming connections, that's why you can initiate a connection to a web server, but the web server cannot initiate a connection to you. Back to Alice and Bob, if they are both using their wireless laptops at home, they usually need the help of a supernode to setup their communication channel. When Bob runs Skype, Bob's machine automatically connects to a few supernodes, when Alice wants to communicate with Bob, she has to find out which supernodes Bob is connected to through flood-based, random walk, or DHT-based search (this is similar to search for a file you’re interested in, Gnutella uses the first two, EDonkey uses the last one). Once Alice discovers the supernodes that Bob is connected to, she can then send a communication request to Bob through these supernodes and they can initiate a bidirectional VoIP channel (if Bob is not firewalled, Alice can also try to connect to Bob directly through the IP address discovered in the search process). Moral of the story is that the supernodes are important to the Skype network.

It turns out that Skype has about 100,000 supernodes (Gnutella also has a similar amount) to support the millions of users that it has. The beauty of Skype (and P2P in general) is that Skype does not own the supernodes, supernodes are selected out of Skype's users. Therefore, Skype does not have to invest millions to support this discovery and communication setup, it utilizes the users' resources to make it happen. In contrast, GoogleChat also allows for VoIP communication, but it uses its own resources to enable this communication. The Skype outage occurred because a large fraction of the supernodes went offline. As a result, the millions of users started looking for other supernodes to connect to, as a results the current supernodes became overloaded since they can only support a limited number of users (in Gnutella, a supernode can support 30 users, but it's probably more for Skype). Basically, there was not enough supernodes for all of the users. Skype's solution was to use its own resources to run MEGA-supernodes in order to restore the Skype network. It's funny because by running these supernodes, Skype became similar to GoogleChat temporarily losing on some of the savings of running a P2P network. The other issue is that because Skype does not have direct control over the supernodes thus making it harder to diagnose the problem and guard against future attacks. If this was an attack on GoogleChat's "supernode" servers (called STUN servers), Google would have diagnosed the issue much faster. But it seems for now Skype has to speculate (it seems) the reasons because it does not have direct access to the machines that went offline. Debugging P2P systems is usually harder than traditional systems because you lose your ability to thoroughly monitor the system because the integral part of your system is not safely running in your data center somewhere but out in the wild Internet on users' machines.

The biggest question in the whole story is, will it happen or can it happen again? Basically, can we still rely on Skype? Of course it can happen again, but it probably won't. Just like any system, Skype can be attacked using similar DDOS attacks that have been used recently to take down Wikileaks, Visa, Paypal and so on. The attacker just has to flood the supernodes or even Skype’s MEGA-supernodes to cause the outage. But many people believe that it would require significantly more resources (botnets) to attack Skype because of its distributed nature (that may be true, but I'm not sure). Overall, although Skype is a P2P system, it's increased dependence on a much smaller subset of its network makes it vulnerable. But this is tough problem to solve because many networks have supernodes which naturally form and become targets for attack (the power grid is famous for such weaknesses). One obvious solution I can think of is for Skype to lower the criteria for a node to become a supernode, so instead of 100,000 supernodes, you have 1,000,000 supernodes. But that solution also has its own consequences. In the end, Skype is great (except for its wiretapping features) and I just wish it was free and open-source software.

3 comments:

  1. Sounds like skype just need to bank roll enough Mega nodes to fight DDOS attacks and bugs. They can use Amazon EC2

    ReplyDelete
  2. You know, I had the exact same thought when I found out that they were rolling out MEGA nodes. I thought "I wonder if they are using EC2". But it seems like they are using their own resources for now. But I think a good solution would be for them to periodically monitor the size of the network through crawls and if supernode count starts to drop, they automatically spawn some EC2 meganode instances. Skype should give me a job.

    ReplyDelete
  3. to ptony82 - I agree! They should give you a job :-)

    ReplyDelete