Data Center Knowledge | News and analysis for the data center industry - Industr's Journal
[Most Recent Entries]
[Calendar View]
Tuesday, September 2nd, 2014
Time |
Event |
1:14p |
Why Data Scientists Want More Than Hadoop Marilyn Matz is CEO and co-founder of Paradigm4, the creator of SciDB, a computational database management system used to solve large-scale, complex analytics challenges on Big – and Diverse – Data.
Many new analytical uses require more powerful algorithms and computational approaches than what’s possible in Hadoop or relational databases. Data scientists increasingly need to leverage all of their organization’s data sources in novel ways, using tools and analytical infrastructures suitable for the task.
As we found out from our survey of data scientists, organizations are moving increasingly from simple SQL aggregates and summary statistics to next-generation complex analytics. This includes machine learning, clustering, correlation and principal components analysis.
Hadoop missing the mark
Hadoop is well suited for simple parallel problems but it comes up short for large-scale complex analytics. A growing number of complex analytics use cases are proving to be unworkable in Hadoop. Some examples include recommendation engines based on millions of customers and products, running massive correlations across giant arrays of genetic sequencing data and applying powerful noise reduction algorithms to finding actionable information in sensor and image data.
Currently, first-wave Hadoop adopters like Google, Facebook and LinkedIn are required to have a small army of developers to program and maintain Hadoop. But many organizations either don’t have the resources required for Hadoop and MapReduce programming expertise in-house or they face complex analytics use cases that can’t be readily solved with Hadoop. Since Hadoop does not support SQL, joins and other key functionality required for managing and manipulating data are not available to data scientists.
Addressing significant shortcomings
Hadoop vendors have also recognized the limitations. They are adding SQL functionality to their products to accommodate data scientists’ preference for a higher-level query language over low-level programming languages like Java, and to address the limitations of MapReduce.
For example, Cloudera has abandoned MapReduce and is offering Impala to provide SQL on top of the Hadoop Distributed File System (HDFS). Splice Machine and Hadapt are also adding SQL-sitting-on-Hadoop solutions to address Hadoop’s significant shortcomings. While these approaches make it easier to program, they are limited in how far they take you because they operate on a file system, not a database management system. Finally, they don’t have atomicity, consistency, isolation and durability (ACID) capabilities that are highly desirable for some applications. And they are slow.
Beyond SQL functionality, leveraging skill sets
In addition to lacking SQL functionality, Hadoop doesn’t effectively leverage data scientist skill sets. In a Hadoop environment, end-users typically use MapReduce Java as their primary programming language. But data scientists prefer to work in powerful and familiar high-level languages such as R and Python.
As a result, data stored in Hadoop tends to get exported to a data scientist’s preferred analytical environment, injecting time-intensive, low-value data movement into analytical workflows. Moving data out of Hadoop for analysis, summarization and aggregation and then having to move results back to Hadoop destroys data provenance and makes it difficult for data scientists to seamlessly explore and analyze their data across a spectrum of granularity and aggregations.
Rethinking Hadoop-based strategies
Many organizations are drawn to Hadoop because the Hadoop Distributed File System enables a low-cost storage strategy for a broad range of data types without having to pre-define table schemas or determine what the data will eventually be used for. While this is convenient, it’s a terribly inefficient approach for storing and analyzing massive volumes of structured data.
The move from simple to complex analytics on Big Data warns us of an emerging need for analytics that scale beyond single server memory limits and handle sparsity, missing values and mixed sampling frequencies appropriately. These complex analytics methods can also provide data scientists with unsupervised and assumption-free approaches, letting all the data do the talking. Storage and analytics solutions that leverage inherent data structure produce significantly better performance than Hadoop.
While Hadoop is a useful and pervasive technology, it’s hardly a golden hammer. Hadoop and MapReduce environments require significant development resources and fail to leverage the power of popular high level languages like R and Python preferred by data scientists.
Too slow for interactive data exploration and not suited for complex analysis, Hadoop forces data scientists to move data from the Hadoop Distributed File System to analytical environments, a time-consuming and low-value activity. As data scientists increasingly turn to complex analytics for help solving their most difficult problems, organizations are rethinking their Hadoop-based strategies.
Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library. | 1:32p |
Simplivity Converged Infrastructure Adds Cisco UCS Hyperconverged infrastructure provider SimpliVity launched a new release of its OmniStack Data Virtualization Platform, adding Cisco UCS support.
The company says in addition to its own packaging of OmniStack on Dell x86 appliances, offered as OmniCube, it will now provide OmniStack Integrated Solution with Cisco Unified Computing System (UCS) C-Series Rack-Mount systems.
Cloud Economics – Enterprise Performance
The company says its approach to hyperconvergence provides a 3x reduction in costs with OmniStack-powered hyperconverged infrastructure versus traditional siloed infrastructure.
OmniStack places its software on an x86 platform that provides hypervisor, compute, storage services and network switching. The company says it will assimilate core functions on Cisco UCS C240 rack-mount servers, including the hypervisor, compute, storage, network switching, backup, replication, cloud gateway, caching, WAN optimization, real-time deduplication and more.
Instances of OmniStack-powered hyperconverged infrastructure within and across data centers form a federation that can be globally managed from a single interface via VMware vCenter.
Doron Kimpel, chairman and CEO of SimpliVity said that the company “is thrilled with the market’s response to our unique Data Virtualization Platform. We are not the “first to market”, however, we offer the most complete solution. As we engaged with some of the most sophisticated customers and partners across the globe, demand for an integrated solution with Cisco Unified Computing Systems became a recurring theme. SimpliVity is committed to delivering the best of both worlds: on one hand, x86 cloud economics, reducing TCO by 3x; on the other hand, tier-1 enterprise capabilities: performance, data-efficiency, data protection and global unified management. We are pleased to introduce our OmniStack Data Virtualization Platform, now integrated with Cisco UCS C240 rack-mount servers.”
Kempel calls the company’s approach Convergence 3.0, as putting the whole stack in one box, including servers, switch, storage, de-dupe and back up, and a WAN function. The company made the announcement at VMworld this week in Las Vegas, where OmniStack fits right into the themes of the software defined data center and hyperconvergence. | 1:45p |
iCloud Data Breach a Black Eye For Cloud In General Apple is investigating vulnerabilities in iCloud after the service was exploited to hack the accounts of celebrities, leading to the publication of nude photos and videos. There are reports of more than 100 female celebrities being compromised.
A posting on GitHub, an online code-sharing site, by hackappcom said the group had discovered a bug in the Find My iPhone service, which tracks the location of a missing phone and allows a user to disable the phone remotely. The bug allowed an outside user to try passwords repeatedly rather than limit the amount of attempts. As a simple measure of security, most online services lock down an account after multiple attempts.
Some media outlets are claiming a brute force service called “iBrute” was used to gain access to the celebrities’ passwords, gaining them access to photos stored in their iCloud accounts.
The vulnerability was patched, but not until after the damage was done. Severe violations of privacy have occurred. While it appears that it was individual users and not the service that were hacked, it is a very public breach that should affect trust in cloud services in general.
“We take user privacy very seriously and are actively investigating this report,” said Apple spokeswoman Natalie Kerris.
The impetus falls on both Apple and users to protect themselves. Huge amounts of press coverage have made everyone more aware that they are not necessarily safe storing information in the cloud. Security measures are not always baked in. Steps such as using two-step verification are needed on the part of users across cloud services. Strong passwords or better yet, pass phrases greatly reduces the chance of an incident.
The event is a reminder that the general public views cloud as safe by default, thanks to the big technology names behind these cloud offerings. Cloud, however, is only as safe as the services that rest upon them. In the case of iCloud, the Find My iPhone service had a serious issue. The event guarantees that users will be more cautious using cloud services, as the only foolproof way to avoid an attack is to not store online. The use of cloud services will continue to grow, however the event serves as a wake up call that not even the biggest services are necessarily foolproof, and that users must take steps to protect themselves. | 6:32p |
Peak 10 Opens Third Atlanta Metro Data Center Peak 10 has opened its third data center in the Atlanta market — the first 15,000-square-foot phase of its facility in Alpharetta, an Atlanta suburb. The site can accommodate two more similar phases.
The data center adds to Peak 10’s existing 35,000-square-foot local presence. The company is seeing healthy growth in the area, and Alpharetta’s tech profile is growing. The town has branded itself as “The Technology City of the South” and presents a rich market for Peak 10 to tap into the near future.
“Peak 10′s reliable, scalable infrastructure is fully equipped to serve our technology-oriented employment community and help them grow,” said David Belle Isle, mayor of Alpharetta. “As we create a culture of technology in Alpharetta, companies like Peak 10 that are enabling innovation and growth are a cornerstone of economic development.”
The company already serves several hundred customers in the greater Atlanta area, including automotive retail business Kauffman Tire, MovieStop, a value retailer of new and used DVD and Blu-ray movies, and Curse, a large online gaming information property.
The new facility’s first fully operational tenant is Lancope, a local provider of network visibility and security intelligence through contextual security analytics. Lancope has been a customer in Peak 10’s Norcross data center for more than years. It has secured space in the new data center as a way to increase security and redundancy of mission-critical data.
Peak 10 continues to expand across its footprint, recently adding space in Cincinnati as well as breaking ground on a 60,000-square-foot data center in Tampa, Florida.
The company was acquired by GI Partners, one of the largest private equity players in the data center market, earlier this year in a deal believed to be between $800 million and $900 million. The acquisition helped fuel aggressive but calculated expansion plans.
Atlanta has emerged as a front runner in business and technology innovation, being one of the country’s leading markets for health IT companies, telecom and Internet security and emerging as a mobile payment epicenter.
“With the Atlanta market, we are witnessing a remarkable shift as this region becomes a magnet for technology and innovation driven by advancements of data-centric sectors from finance and mobility to logistics, retail and healthcare IT,” said David Jones, chairman and CEO of Peak 10. “The convergence of these markets supported by data center and cloud services creates boundless growth potential for us and our customers here. As the region continues to expand, we will remain deeply interested and invested in their success.” | 6:46p |
HP Rolls Out Gen9 ProLiant Servers HP launched a new portfolio of ProLiant servers last week, featuring workload optimizations to dramatically increase compute capacity and efficiency. The company said Proliant Gen9 servers will be in sync with the latest generation of Intel processors being released and change compute economics with modular architecture and converged infrastructure.
HP President and CEO Meg Whitman said in a webcast that HP “will bring organizations three times the compute capacity of previous ProLiants, greater efficiency in processing multiple workloads, make infrastructure provisioning 66 times faster and drive down the cost of ownership.”
Leaving its Moonshot servers in a category all their own, HP Gen9 servers span four architectures – blade, rack, tower and scale-out. HP says that the new generation of servers triple the compute capacity and increase efficiency across multiple workloads at a lower total cost of ownership with design optimization and automation.
The features helping to fuel performance gains are HP-unique PCIe accelerators, HP DDR4 SmartMemory and tighter integration with HP OneView. New features for HP OneView will be available later in the year and provide infrastructure lifecycle management capabilities to promote HP’s software-defined management strategy to converge tools for managing server, storage and networking.
The recent IDC global server market report for the second quarter 2014 shows HP continuing as the leading vendor, with its servers accounting for 25.4 percent of all shipments. HP said additional details on the Gen9 portfolio will be announced at the Intel Developers Forum next week.
The new HP ProLiant Gen9 servers will be available through HP and worldwide channel partners beginning September 8. | 7:20p |
IBM Wires U.S. Open Using Cloud Tech, Analytics, Watson As athletes compete in the thick of the U.S. Open, IBM has decided to remind everyone just how much of its technology is involved in running the massive tennis tournament.
The company has long been a partner of the competition — the collaboration is entering its 25th year — and the technology behind all the stats has evolved greatly over the years. IBM is now providing deeper analytics and an experience optimized across all three screens today’s users see the spectacle through.
The event is a high-profile use of technologies the company hopes to further proliferate in other industries. Clients in banking, government, healthcare, retail and other business verticals are already using the same mobile, analytics, cloud and social technologies that enable the United States Tennis Association to connect with millions of tennis fans around the world.
Making sense of lots of data
IBM had 54 million visits during the last tournament, up 18 percent from the year before. Last year, there were about 5.5 million mobile application launches.
“It all starts with the collection of data,” said John Kent, IBM worldwide sponsorship marketing executive. “The shelf life of data is also very small. The volume of data is so huge and IBM technology helps identify patterns. A lot of this takes place in the cloud. We now have very few servers onsite, except for the scoring system, staging and publishing environment.”
Overall, all the content that’s published — be it videos, photos or news articles — ends up being more than 20 terabytes of data. IBM is continuously sifting through data points during the entire tournament, providing key factoids based on player trends to commentators, as well as general trends through identifying patterns.
The umpire records the outcome of the point and other factoids, while court-side statisticians record stats such as radar and player and ball position data. All the information comes to IBM onsite. The broadcaster puts up a graphic relating to the point, such as the number of Aces.
IBM redesigned the application to generate point commentary this year. After the point, relevant factoids are presented.
“The systems we provide are not just data, but we’re also trying to drive some insights,” said Kent. ESPN uses SPSS predictive analytics technology to look at 41 million data points, find patterns and generate ‘keys to the game.’ It looks at factors such as whether or not two competitors played in the past or against players of similar style. “ESPN makes use, picking one of the keys calling it an IBM insight,” said Kent.
“During last year’s traffic, we looked at real-time traffic, but [also] other factors. such as player popularity on Twitter,” he said. “If a player is coming up on Twitter, the site reflects the trend. The algorithms the team has written will predict traffic, and Orchestrator will change based on that forecast. What’s critical is not so much what will happen at 5 [o'clock], but what will happen in the next hour. ”
IBM cloud, Watson technology in action
The infrastructure is in a software-defined environment through IBM SmartCloud Orchestrator. The company hopes to implement software-defined networking and storage going forward. Orchestrator is based on open standards and can manage public, private and hybrid clouds through an easy-to-use interface.
“We also make some use of the Watson foundation for insights and streams,” Kent said. “We don’t take full advantage of Watson yet, but we’re looking at ways of doing that. The core of this really makes use of a lot of the IBM technologies that make up cloud.”
These technologies include Power Servers, System X server architecture, Tivoli and the Cloud Orchestrator.
“It’s a journey rather than a stick in the ground with cloud,” said Kent. “We’re standardizing of all our systems. First we’re consolidating, virtualizing, then we get standard processes in place.”
IBM also uses a technology called MessageSight for real-time data. MessageSight is a full-featured messaging appliance designed for machine-to-machine and mobile environments. The scoreboard subscribes to the message site and as the score changes, it’s pushed out in real time to end users.
One of the major evolutions is how the data is displayed over desktop, tablet and phone. Part of the challenge is optimizing for a wide array of devices. “A few short years ago we had a website and mobile site,” said Kent. “Now you need to optimize, and the complexity really comes in with working with USTA and what kind of experience they want on the platform. iPhone and Android needs snackable info, whereas an iPad is more of a second-screen experience.”
Three-site data center architecture
Behind the U.S. Open, IBM uses what it refers to as a “three-site architecture.”
“Three data centers are geographically dispersed and virtualized as one,” said Kent. “They’re globally load balanced.”
This architecture has been in place since IBM used to do the Olympics. The three-site architecture is “more disaster avoidance than disaster recovery. Rather than doubling capacity, we put 50 percent in each location,” he said. “We can lose an entire site and still have 100 percent. We can perform maintenance while it remains up.” |
|