As IT becomes more central to everything – both work and play – the cost of disruptions and outages continue to escalate. So while more attention – and dollars – are being devoted to disaster recovery and business continuity, Chris Yetman’s call to embrace failure would seem counterintuitive.
However there is method to his seeming madness. Yetman is SVP of Operations, Vantage Data Centers, and the former VP of Amazon Web Services Infrastructure Operations, where he had worldwide responsibility for ops and network for Amazon’s data centers, and oversaw the number of servers increase 10X.
“IT is not about technology, it’s about adding business value.” One way to add value is to increase server utilization, currently languishing around 5-15% in most datacenters. “The whole idea is to stuff the snot out of the machine,” he said. “The real fun starts when you can run 40-60 utilization… and can get more bang for the buck using what you’ve already got…”
Tapping into that unused potential provides a number of options in addition to adding value and decreasing costs – including reducing power, cabling, cooling and floorspace requirements. “Stop trying to protect the systems from failure,” said Yetman. Instead, learn how to master when a server fails. “Then you can go back to treating machines the way you should, rotten. They’re just machines.”
Organizations are afraid to allow equipment to fail and work hard to avoid breaking a server. “Too often we are needlessly cautious when it comes to hardware. Too often we make IT decisions based on fear. These decisions cost us more than we suspect.”
IT equipment can handle higher temperatures than what the industry currently operates under, he said. Going hotter will reduce your power (Power Usage Effectiveness or PUE) costs, and coupled with focusing on ways to recover faster from the relatively slight increase in hardware failures will “set you free”.
As an example, he said a large data center running five-plus megawatts can spend $500,000 more in energy per month. However, if you run the temperature up to 80 degrees in the cold aisle, taking your PUE down from 1.3 to 1.2., you will save over $500,000 per year in energy, and break a whole lot less than that in hardware.
The challenge shouldn’t be to create bullet-proof hardware, but to write applications to be resilient. “Managing state in your applications well enough to recover from server failure is not easy. But the rewards in lowered PUE and TCO are well worth the investment.”
1-Be brave. Embrace failure.
2-Maintain infinite control.
3-Pick the right equipment.
4-Increase the voltage.
5-Reduce pressure drop.