There is a lot of hype surrounding the topic of data management and specifically around the concept of a data lake, but what does this mean in practical terms? How do I know if my company should set sail into these unchartered waters?
What is a Data Lake?
Much like a real lake holds water coming from multiple streams, a data lake is designed to take information from various source systems and hold it in its purest form. The raw data can originate from a variety of sources including operational systems such as an ERP, log files, social media, documents, and images among many other things. No matter the point of origination, the central idea is to retain information in its native format for use by systems and processes that lie upstream.
Advantages of using a Data Lake
A well-stocked data lake is a data scientist’s paradise, much like a well-stocked pond is a fisherman’s paradise. Data scientists can fish through the data looking for trends and patterns using advanced analytics tools and processes. Meanwhile, data deemed useful to support the business is harvested from the lake, refined, and curated into a format that decision makers can interact with. Often times this curated layer takes the form of a traditional data warehouse or data mart which serves as the foundation for operational reports or dashboards. The data can also be used to provide interoperability between systems or data-trading partners as it provides a centralized, vendor and application agnostic platform from which to source data.
One question that still seems to surface with stakeholders is “why fill the data lake if it’s ultimately going to dock in the data warehouse?”. The key concept here is that not all data that streams into the lake will be modeled into a structured data warehouse. Only data that has been deemed necessary for business users to interact with will chart a course from the lake to the warehouse. However, what’s important to the business will likely change over time so holding the original data will support shifts in organizational mindset and the competitive landscape. Additionally, as mentioned above, the data lake can support those looking to go fishing. For example, data scientists can use the lake to look for a new species, detect changing patterns, and ultimately look for signals in the data that may indicate shifting tides in the marketplace. This would not be possible in the data warehouse because the data that lands has already been modeled to answer specific, known business questions.
Ultimately, data lakes are a functional, cost effective way to wrangle in an organization’s data and give it a landing spot that can serve as a springboard for future expeditions. If your organization is drowning in data and wondering what to do, the data lake might just be the catalyst needed on the path to analytics maturity.