In the field of big data storage and processing, a data lake is a new type of platform for storing companies’ data. So how are they different from data warehouses, and what’s the point of them?
Processed data storage vs. raw data storage: that’s one of the major difference between data warehouses and data lakes. The term data lake first appeared in October 2010 on James Dixon’s blog. The CTO of Business Intelligence specialists Pentaho defined it as follows: “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
The notion of raw data and samples are where data lakes differ from data warehouse: data lakes are used to collect huge volumes of raw data – structured, semi-structured and unstructured – and store it in native format and then analyse it.
Or more precisely: “While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question,” as TechTarget explains.
Data Lake: architecture for merging data silos
In 2013, the Cisco France blog specified that “the concept of data lakes highlights the need to create a modern company architecture to organise, manage and exploit large volumes of data operationally. [...] The reason we’ve come up with data lakes is that companies’ data up to now has been stored in independent silos. For about ten years software vendors designed specific solutions to exploit the various data, which reinforced this idea of silos and was an obstacle to interoperability.” Galaxy Consulting confirms:
“The data lake concept hopes to solve information silos. Rather than having dozens of independently managed collections of data, you can combine these sources in the unmanaged data lake. The consolidation theoretically results in increased information use and sharing, while cutting costs through server and license reduction.”
Business Intelligence vs. data science
Another advantage of data lakes, as explained by Vincent Heuschling, CEO of Affini-tech: “Big data operations are often restricted by the difficulties of collecting and ingesting it in the systems. In this respect, being able to load all the data onto a platform in a raw state and iterate rapidly is an undeniable advantage. [...] The ability to ingest and react to data in real time means applications can directly interact with it. This goes beyond the Business Intelligence aspect of data warehouses: the value no longer lies in just using the data for reporting.”
Heuschlin goes on to point out how advances in data lake platforms can benefit marketing and sales: companies now have “A 360° vision of clients, and can segment, predict and anticipate consumer behaviour.”
Furthermore, with the Internet of Things, data lakes can facilitate large-scale machine learning. After retrospective data analytics and metrics, the key advantages of Business Intelligence and data warehouses, data lakes will offer new possibilities for predictive analytics.