When most people consider big data, they think of the end game: analytics. But according to industry experts, the precursor to successful analytics is an integrated technology foundation that is tuned for a variety of big data workloads.
“To maximize the use and results of any enterprise technology implementation, both hardware and software must work well together,” says Boyd Davis, vice president and general manager of data center software at Intel. “Big data is no different. Companies need a foundational layer that provides top-notch manageability and security in support of application-level software and services.”
Intel and Cisco are working together to deliver this foundational layer:
- The Intel® Distribution for Apache Hadoop software (Intel® Distribution) is being integrated with the Intel Xeon® processor-based Cisco Unified Computing System™ (Cisco UCS®) Common Platform Architecture (CPA) for Big Data.
- The result is a comprehensive Hadoop platform that delivers exceptional performance, management, and capacity while reducing risk and accelerating deployment.
Creating an enterprise-ready platform
Big data technology—and Apache Hadoop in particular—is finding use in an enormous number of applications and is being evaluated and adopted by enterprises of all sizes. While the technology helps transform large volumes of data into actionable information, many organizations are struggling to deploy an effective and reliable Hadoop infrastructure that is appropriate for mission-critical applications.
“Cisco and Intel enjoy a close technology partnership, and we’re extending this relationship to create next-generation big data solutions,” says Paul Perez, vice president and general manager of computing systems at Cisco. “We share a vision for a data analytics platform that is seamlessly integrated into an enterprise environment. One that takes advantage of the storage, networking, and built-in automation of Cisco UCS and Intel’s processor and management technologies; making it easy to plan, provision, execute, and scale.”
Supporting a variety of workloads
The combination of Intel Distribution and Cisco UCS® is being tuned to support a variety of workloads and investigations, including:
- Batch-mode analysis
- Massive parallel processing (MPP) queries
- Streaming analytics
Currently the most common big data investigation, batch-mode analysis includes direct MapReduce jobs or Hive queries involving very large data sets. According to Davis, the typical response time is one to several minutes.
“An example of batch-mode analysis would be a job that tries to find anomalies in trading transactions that happened over a period of a month or a year,” he explains. “This would be accomplished by combining trading data with other large reference data sets.”
MPP queries typically involve data warehouse applications like Hive, with an expectation of browser response time or better. In these queries, the reference data sets are generally smaller than batch-mode analytics.
“MPP queries are often performed to analyze and segment the purchase patterns of customers in a retail chain over short periods of time—using up to a week of data—in order to set prices,” says Davis. “Another example is a pipeline set of queries, used for tasks like malware detection, where an automated job takes output of one query and uses it as input for another. The shorter response time for each query speeds up detection, and therefore improves prevention and response measures.”
Machine-learning includes both predictive analytics and data mining. Using Bayesian classifiers, neural networks, and other algorithms, machines can automatically improve their modeling and prediction capabilities. With data mining, unknown relationships within data sets can be discovered.
“We can use predictive analytics to anticipate machine failures,” Davis explains. “And data mining can be used to discover interesting dependencies in, say, social networking data or telecom call records.”
Streaming analytics involve immediate investigations as data flows into a cluster, rather than being pulled from a static repository. This type of analysis is becoming increasingly important when dealing with sensor data—compiled by smart meters, security systems, and the like—allowing discoveries and decisions to be made in real time.
“We are optimizing our Hadoop software so it works seamlessly on Cisco UCS, regardless of the workload or application,” says Davis. “It will be as close to plug-and-play as possible, so enterprises can focus on application-level software and services and not worry about the foundational layer.”
“Cisco is committed to big data, open source, and our work with Intel,” says Perez, “to optimize data-intensive computing for on-premise enterprise and hosted as-a-service environments.”