Data Center Knowledge | News and analysis for the data center industry - Industr's Journal
[Most Recent Entries]
[Calendar View]
Wednesday, July 5th, 2017
Time |
Event |
3:23p |
Machine Learning Tools are Coming to the Data Center Back at the dawn of the internet, data centers could be small and simple. A large ecommerce service could do with a couple of 19-inch racks with all the necessary servers, storage, and networking. Today’s hyper-scale data centers cover acres, with tens of thousands of hardware boxes sitting in thousands of racks. Along with the design changes, these mega-server farms have been built in new, remote locations, trading proximity to large population centers for cheap power.
As they automate data center operations, public clouds like Amazon Web Services or Microsoft Azure hire fewer and fewer highly skilled data center engineers, who are usually outnumbered by security staff and relatively low-skilled workers who do manual labor, such as handling hardware deliveries. Fewer staff managing more servers means monitoring the power and cooling infrastructure requires greater reliance on sensors, which we might now call Internet of Things hardware. They help identify issues to an extent, but there are many cases in which the experience of a seasoned facilities engineer is hard to replace with sensors. These are things like recognizing a sound that indicates a fan is about to fail or locating a leak by hearing the sound of water drops.
You need more than sensors to monitor modern data center infrastructure, and a new generation of applications aims to fill the gap by applying machine learning to IoT sensor networks. The idea is to capture operator knowledge and turn it into rules to help interpret sounds and video, for example, adding a new layer of automated management for increasingly empty data centers. The services promise “to predict and prevent data center infrastructure incidents and failures,” Rhonda Ascierto of 451 Research told Data Center Knowledge. “Faster mean time to recovery and more effective capacity provisioning could also reduce risk.”
See also: Deep Learning Driving Up Data Center Power Density
Predictive Analytics and Wider Data Variety
The first steps in this direction is predictive analytics in data center infrastructure management, or DCIM, software. One example is software by a company called Vigilent, based in Oakland, California. Its “control system is based on machine-learning software that determines the relationships between variables such as rack temperature, cooling unit settings, cooling capacity, cooling redundancy, power use, and risk of failure. It controls cooling units, including variable frequency drives (VFDs), by turning units on and off, adjusting VFDs up or down, and adjusting units’ temperature setpoints.” Ascierto said. It uses wireless temperature sensors and predicts what would happen if an operator took a certain action – such as shutting off a cooling unit or increasing set-point temperature.
A different example is Oneserve Infinite, which mixes sensors with a wider variety of data points, pulling in for example usage and weather conditions to deliver what the Exeter, England-bases company calls “Predictive Field Service Management.” The aim here is to predict maintenance requirements, avoid failures, and keep downtime to a minimum. Chris Proctor, Oneserve’s CEO, told us that by applying these techniques, it should be possible to also handle strategic planning and procurement. “A data center would be able to manage their assets and their resources much more accurately and effectively,” he said. (To our knowledge, this kind of functionality isn’t yet live in any data center.)
Oneserve focuses on wider maintenance issues, but the approach maps well with how data centers operate, working with in-house operations and third-party contractors. One useful aspect of its tooling is a dashboard that tracks issues with past maintenance, allowing users to detail where access may be difficult, or where problems have occurred multiple times. Today that’s a very manual approach, but you’ll need this kind of data to train a machine learning system in the future.
Tapping Human Knowledge
Example of a company that combines sensor data with operator knowledge is San Jose-based LitBit. According to Scott Noteboom, its founder and CEO, who in the past led data center strategy for Yahoo and later Apple, LitBit’s data center AI, or DAC, allows operators to build, train, and tune their own “co-workers” using machine-learning techniques. These could respond to events across a data center, alerting operators or – eventually — automating actions. The key to LitBit’s approach is a form of assisted learning, where the system alerts the operators when it detects a new abnormal event, and the operators then create a set of rules for reacting to such events in the future. To collect data, LitBit has a mobile app that takes videos, which it can then turn into thousands of images for training.
The startup provides a managed cloud service, which will allow it to take advantage of many users’ anonymized data to build more complex and more accurate models; while some customers will choose to keep their trained models secret, others might sell theirs as an additional source of revenue. As Ascierto pointed out to us, “the value of data center management data multiplies when it is aggregated and analyzed at scale. By applying algorithms to large datasets aggregated from many customers – with diverse types of data centers and in different locations – … suppliers can, for example, predict when equipment will fail and when cooling thresholds will be breached.”
More on LitBit: This IoT Startup Wants to Break Down Data Center Silos
Don’t Go Seeking a Career Coach Just Yet
There’s a lot of implicit knowledge in operations, and surfacing it as rules can help identify problems and react faster, especially when the human operator with the knowledge isn’t around. Even if you don’t operate large geographically isolated data centers, you still want to be able to respond effectively during off-hours or during staff illness. A data center AI probably won’t completely replace your operations staff, but it could become a tool that enhances their existing skills and helps transfer them to other team members.
This area isn’t mature, but it’s developing fast. Machine learning applications using sensor data are improving rapidly and being used across a wide range of industries. Microsoft Research for example has been working with Sierra Systems to develop machine learning-based audio analysis for oil and water pipeline defects, using its Cognitive Toolkit to help classify anomalies. At the other end of the scale, machine learning models and tooling built for hyper-scale clouds are downscaled, with compressed neural networks using quantized weights running on low-capacity devices like the Raspberry Pi.
Don’t expect implementing an AI-based data center management service to give you instant results; the technology is new, the services are still in development, and they will need a lot of training. You may well need more sensors than you already have for your DCIM software, Ascierto points out. “If you wanted to exploit AI for end-to-end chiller-to-rack decisions, then acoustic and vibration sensors would be required for some equipment, as well as environmental sensors and power meters. If the goal is to optimize and automate setpoint temperatures for cooling units, then multiple environmental sensors per rack (top, middle, bottom) may be required.”
The underlying data models may be there, but they will also have to be tuned for your specific equipment, your specific workload, and, most importantly, to your site’s idiosyncrasies. Training an AI support system will take time, just like bringing a new human operator on-board, but in time, similar machine learning tools to those already running in production in the cloud will help run your data center. | 6:58p |
Oracle’s Hurd Bullish on Cloud Business, Says Enterprise Market Largely Untapped Oracle co-CEO Mark Hurd said enterprises have yet to spend most of the money they will eventually spend on cloud services that will replace their on-premises data centers, implying that even though Amazon Web Services has a massive lead in the market today, its market share is large compared to competitors but tiny compared to the market’s potential size.
He made the comments during a recent media event at the company’s headquarters in Redwood City, California, where he was interviewed onstage by Recode’s Kara Swisher. The conversation focused largely on the enterprise cloud market and Oracle’s role and aspirations in the space.
Cloud Data Center Spend
Cloud services is a capital-intensive business; the likes of Microsoft and Google have been spending in the neighborhood of $10 billion each annually to build out the data center infrastructure for their global cloud empires. At an event earlier this year, Hurd told an audience that because Oracle had faster servers, the company didn’t need to spend as much as its competitors on data centers.
At least one of those competitors called his bluff publicly. There’s little indication that Oracle has cloud hardware that’s superior than the hardware that runs in Amazon, Microsoft, or Google data centers, and many of the top engineers that designed Oracle’s new cloud platform came from those competitors’ infrastructure teams.
Hurd’s answer to the capital question was less heavy-handed this time around. He acknowledged that despite investors’ wishes big capital expenditures were unavoidable for a company that wants to play in the space, but added that it is possible to use technology to cut that Capex by half, or even by three quarters:
“If I do something technically, I might be able to halve my Capex cost. I might be able to turn it into a quarter. If I multi-tenant my database, if I put it in memory, if I multi-tenant my middle tier, I can actually shrink the number of data centers I need and actually shrink my Capex at the same time. I need capital, but I also need innovation, and I need technology.”
Oracle decided to “go at this hard three or four years ago,” Hurd said, referring to the company’s recent ramp-up in investment in its cloud business. That investment included a big push to build a new cloud platform and launch data centers to host it.
“We made the decision to build all of these global data centers to deploy this technology and to do it fast. We invested in incremental R&D. Our R&D six years ago was $3.7 billion. It’s now over $5 billion.”
Some wholesale data center providers in the US benefited from the recent spike in data center spend by Oracle. In 2015, the company reportedly leased about 25MW of wholesale data center capacity in Northern Virginia alone (Northern Virginia is America’s largest data center market, home to more cloud data center capacity than any other region). In 2016, it was one of the two companies that leased the most data center space in the US. The other one was Microsoft.
A Poster Cloud Deal With AT&T
Hurd expanded somewhat on the big cloud deal Oracle recently closed with AT&T. The American telco has been a big Oracle database user for a long time, according to him, and will be moving that data to the Oracle cloud platform.
“I think they have in excess of an exabyte of data, which is you ask how much, it’s just a lot. They have it in a lot of Oracle databases, and some of those Oracle databases are actually really big Oracle databases, and so what we’ve come to as an agreement to really bring all the benefits of Oracle out to AT&T, and we’ll work together collaboratively as we migrate those databases to the Oracle cloud.”
AT&T will also be taking advantage of some cloud applications by Oracle, starting with the field service app, which will automate the process currently done by the telco’s 7,000 or so technicians.
“The First Inning”
The run rate of Oracle’s cloud services business today is north of $4 billion, Hurd said, and that its current growth rates will get it to $10 billion “relatively shortly.”
While highlighting growth of Oracle’s cloud software business, Hurd indicated that rather than placing a particular focus on a single layer of the enterprise cloud stack, his company is going for a comprehensive play:
“…we’re coming in with more of a complete stack, applications, platform and infrastructure.”
In comparing Oracle’s performance in the overall cloud services market to Amazon, Microsoft, and Google, Hurd said his company – often perceived as a late comer in the space – had for a long time been focused on cloud applications and now was building out the infrastructure to support that business.
He said it was difficult for cloud providers to get to even $50 billion of the trillion-dollar enterprise technology market today to highlight how big the overall cloud opportunity is and how little of it has been tapped so far, even by the leading cloud providers.
“I would say we’re in the first inning of all this.”
He repeated a forecast he’s made before, saying 80 percent of corporate data centers that exist today will be gone within the next decade.
“All of DevTest, which is a third of the industry, will go to the cloud. All the commercial applications will go that have a commercial alternative, and security, which is one of the issues we’ve talked about, will actually flip from being a concern to being a benefit. The security implications in the cloud, particularly of the big enterprise providers, you will be more secure there than you will be on the traditional on-prem world.” | 7:41p |
How to End On-Call IT Burnout and Post-Traumatic Alert Fatigue Peter Waterhouse is a Senior Strategist for CA Technologies.
In so many ways IT operations has developed a military-style culture. If IT ops teams are not fighting fires they’re triaging application casualties. Tech engineers are the troubleshooters and problems solvers who hunker down in command centers and war rooms.
For the battle weary on-call staff who are regularly dragged out of bed in the middle of the night, having to constantly deal with flaky infrastructure and poorly designed applications carries a heavy personal toll. So, what are the signs an IT organization is engaged in bad on-call practices? Three obvious ones to consider include:
Support teams are overloaded – Any talk of continuous delivery counts for squat if systems are badly designed, hurriedly released and poorly tested. If teams are constantly running from one problem to another then someone or something will eventually break. Of course, good application support engineers try to do the right thing by patching up systems to keep them in action. But such are the stresses of working in these environments that no time is ever available to work on permanent solutions. The result: Applications with Band-aids just limp from one major outage to the next.
Bad practice becomes the norm – If on-call staff is constantly asked to deal with floods of false-alarms, then any sense of urgency in responding to those alerts will be diminished – staff becomes desensitized. It’s a problem well understood in the field of healthcare where clinical staff have been known to dial-back cardiac alarm systems due to a nuisance factor. Similarly in IT, when on-call staff has alert fatigue they might be inclined to rejig some alert thresholds or hack up an automation to put old incident paging systems into snooze mode. Whatever the cheat, the results are never good.
Poor visibility and insight – What could be worse than being woken up at 3 a.m. to deal with a tech crisis? Being woken up at 3 a.m. and being absolutely powerless to do anything about it. Even with a swag of opensource monitoring tools at their disposal, including log aggregators and dashboards systems, on-call teams still struggle to address complex problems. Not because these tools are bad per se, but because narrowly focused monitoring only provides partial answers. That’s always been troublesome but now even more problematic due to the distributed, API-centric nature microservice style architectures.
Poor visibility doesn’t only manifest technically, there are people issues too. If senior managers aren’t aware of on-call burnout or just turn a blind eye, then methods should be employed to help them wake up and smell the stink. A good place to start is discussing the people cost associated with stressful on-call rotations. If, however, the empathetic approach falls short, try presenting all those latency, saturation and utilization issues in context of business impact – like revenue, profit, customer satisfaction.
Improving Conditions for Better Business Results
Apart from using monitoring to present on-call calamities in clear business terms, there are a many other common-sense approaches that can help give on-callers their life back.
Make alerts actionable – What’s the point of alerting on machine related issues when they have no tangible impact on the business? Good monitoring avoids this by aggregating metrics at a service-level and only alerting on-call staff when customers are hurting and problems need fixing immediately. Anything else can wait until tomorrow when everyone’s had a good night’s sleep.
Automate runbooks – It’s a good practice to develop concise documentation that guides on-call staff during major service disruptions. That’s all fine and dandy but runbook effectiveness is highly dependent on development teams providing clear and up-to-date instructions, which isn’t always top of mind. Although there’s no substitute for good support documentation, advanced analytics-based monitoring tools can augment manual detective work with fully automated evidence gathering, correlation and recovery workflows.
Put developers on-call – However good on-call support engineers are, no one knows the idiosyncrasies of an application better than the people who wrote the actual code. Putting developers’ on-call means the people who’ve most likely caused the problem are the ones being put on the spot to fix it. Witnessing programming stuff-ups first hand in the small hours of the morning is also a great motivator to put things right – permanently.
Audit continuously – Even if the ultimate goal is to never page on-call staff, a more realistic objective is to ensure staff never get paged for the same problem twice. Again, good monitoring tools and analytics can support this – by for example, reviewing performance alerts over historical time periods and correlating with infrastructure or code changes.
When employees are continuously placed in stressful on-call situations they burn out – so will your business. By combining constant on-call reviews and auditing together with advanced monitoring practices, organizations can eliminate alert fatigue, increase service reliability and reduce the need for costly unplanned work.
Opinions expressed in the article above do not necessarily reflect the opinions of Data Center Knowledge and Penton.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library. |
|