Because of their ability to handle large volumes of incongruent data, Hadoop-based systems are becoming a foundational pillar of data storage and analytics. Unlike traditional relational databases that require information to be transformed into a specified structure, or schema, before it can even be loaded into the database, Hadoop focuses on storing data in its raw format. This allows analysts and developers to apply structure to suit the needs of their applications at the time they access the data.
“With the continued growth in scope and scale of applications using Hadoop and other data sources, the vision of an enterprise data lake has started to materialize,” says Shaun Connolly, vice president of strategy for Hortonworks, a leading contributor to and provider of Hadoop for the enterprise. “Combining data from multiple silos, including internal and external data sources, helps organizations find answers to complex questions that no one previously knew how to ask.”
Data lake, defined
The concept of a data lake is closely tied to Hadoop and its ecosystem of open source projects. The traditional “Schema on Write” approach of data management requires a lot of forethought and IT involvement, whereas Hadoop’s “Schema on Read” approach empowers users to quickly store data in any format and apply structure in a very flexible and agile way, whenever needed. As such, a data lake represents a shared repository where any type of data can be collected, accessed, and analyzed by any number of users within an organization.
But according to Gartner1, the growing hype surrounding data lakes is causing substantial confusion in the information management space. Several vendors are marketing data lakes as an essential component of big data implementations, but there is little alignment between vendors about what comprises a data lake, or how to get value from it.
“In broad terms, data lakes are marketed as enterprise-wide data management platforms for analyzing disparate sources of data in its native format,” says Nick Heudecker, research director at Gartner. “The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it’s available for analysis by everyone in the organization.”
Control versus freedom
While data lakes effectively solve a number of problems—overcoming independently managed data silos through consolidation and providing a new way to handle large quantities of disparate and often unstructured data—they also create gray areas surrounding information governance, control, and security.
“There is always value to be found in data,” says Andrew White, vice president and distinguished analyst at Gartner. “But the question your organization has to address is this: Do we allow or even encourage one-off, independent analysis of information in silos or a data lake, bringing said data together, or do we formalize to a degree that effort, and try to sustain the value-generating skills we develop?”
According to Hortonworks, companies have an opportunity to do both.
“Hadoop complements existing systems and data warehouses, it doesn’t replace them,” says John Kriesa, vice president of strategic marketing at Hortonworks. “And it’s the data lake that pulls it all together, from Hadoop and data warehouse environments to the systems, inputs, and sensors generating data to the analytics tools being applied.”
For information and applications requiring the highest security and fastest, most consistent response, enterprise data warehouses are still the platform of choice, he adds, delivering the performance, governance, and control that is often needed for mission-critical activities. But when an organization wants to bring together and explore both structured and unstructured workloads from a variety of systems and silos, Hadoop environments and data lakes provide a compelling and cost-effective option.
“Implementing Hadoop is typically a journey from single instance applications to a fully fledged data lake,” says Connolly. “The journey is not about assembling petabytes of data. It’s about encouraging people to combine new types of data with existing data sources, and enabling them to play with that data in ways that creatively unlock the value within.”