Skype went out for about 24 hours just before Christmas. Skype management is embarrassed and promise this will not happen again, which of course is true. The particuar situation is now prevented. However, the question is: Will Skype never be out again?
Skype’s CIO explains what went wrong in this post mortem of what I’d call Skype’s first Black Swan.
To summarise, it was a high load on Skypes infrastructure which triggered a bug in a certain version of the Windows client for Skype which again increased the load on the infrastructure, thereby rapidly taking the entire network down and making it almost impossible to get it up again.
The bug was always there of course, and it was probably already known internally at Skype. It is also possible that the risk of server overloading and service degradation had been identified, but obviously not in the context of making a complete system crash a likely possibility (if so, they would have prevented it). Further, I’m quite certain that the risk of the client bug affecting the server load had not been identified. Humans are positive thinkers, as Taleb documents in his book: The Black Swan: The Impact of the Highly Improbable.
So Skype’s challenge now is to prevent outages in general by identifying and preventing Black Swans in general. This will involve a cross organisational backwards thinking process, which the innovation driven company has probably not been focusing on at all until now. (I may actually be wrong here, Janus Friis, one of the founders of Skype, used to work in a support function of an ISP so he may have been involved in preventing problems, but generally, startups think very positively, and even if Skype has millions of users, it’s still a very young company – a startup.)
One may think that this is going to be extremely expensive for Skype since they will have to predict every possible way their system can go wrong. It does not have to be that expensive, although it will cost money.
When securing a nuclear facility, engineers don’t have to analyze every possible way a disaster can happen, instead they think: How can we prevent failure at every level? This is what I mean with “backwards thinking” – start assuming something is failing, then work backwards identifying ways to prevent it becoming worse.
This is done on multiple levels: On component level, asking what can go wrong here and how can we prevent a bug or incident from affecting the rest of the system? And on system level, assuming that disaster is happning, how can we prevent it from developing.
I assume that’s what Skype is doing now.
Wishing everyone a happy 2011!