Thursday, February 26, 2009

local error, global fault - internet infrastructre critique

so i was talkin to some folks from our national funding agenc about what is interesting net research to do - they were not aware of the set of failures in the internet over recent years caused by "small" errors of configuration leading to global problems -
e.g.
0. the root DNS zeroing the boot dbase so returning nxdomain for the planet for 6 hours

1. the youtube blackout caused by local BGP config in small asian ISP

2. google mistyping a config rule for listing search result sites as "risk of harm to your computer" and marking 100% of the world as bad

there are others - these represent the problems caused by NOT STAYING WITH THE PROGRAM - the internet is decentralised - organisations that want to own pieces of it horizontally cause problems (there are tools to avoid most of these problems, but they require a modicum of cooperation)...people forget these design philosophy rules (aka architecture) at their peril (and ours:)

2 comments:

Tony Finch said...

Do you have a reference for the root zone outage? I don't remember that happening.

I don't think that BGP outages are an example of failure owing to too much centralization. They're more to do with transitive trust, and lack of sanity checking on who advertises routes (owing to slack network administration or because it's inherently hard to check BGP feeds in general). The AS7007 incident in 1997 was a wake-up call but too many ISPs are still sleeping.

clog said...

verisign accidentally zeroed a dbase they boot their root servers with - i can't find a gospel ref right now - many places survide by boosting the TTL in their caches manually to 24 hours...was a few years back

BGP - the ability to do sanity check depends on global coop but I agree...