01- Databricks Lakehouse Fundamentals Learning Plan — V2
What is a data lakehouse? The history of Data Management!
In this article, we will learn about
1. The Origin and purpose of Data Lakehouse
2. The Challenges of managing Big Data
The History of Data Management And Analytics
To better understand what data lakehouse is, we need to explore the history of Data management and analytics.
In the late 1980s, businesses wanted to make use of data-driven insights for business decisions and innovations. To do this organisations had to move simple relational databases to systems that could manage and analyze data that was being collected and generated at high volumes. And at a faster pace.
Data Warehouse
Data warehouses were designed to collect and combine this influx of data. And provide support for overall business intelligence(BI) and analytics.
Data in a warehouse is structured and cleaned with predefined schemas.
However, data warehouses were not designed with semi-structured and unstructured data. It became very expensive when trying to store and analyze data that did not fit the predefined schema.
As companies grew, the world become more digital and data collection 1000x increased in volume, velocity and variety.
It took too much time to process data and provide and provide useful insights. And data warehouses had limited capabilities to handle data variety and velocity.
2000s Big Data Explosion
In the early 2000s, the advent of Big Data drove the development of data lakes. Data lakes have the capabilities to handle structured, semi-structured and unstructured data in the same way.
Multiple data types could be stored side by side in a data lake. Data created from many different sources, such as weblogs or sensor data could be streamed(processed) into the data lake quickly and cheaply in low cost cloud object stored.
However, while data Lake solved the storage dilemma. It introduced additional concerns and lacked important features from data warehouses.
- Data lakes are not supportive of transactional data and can’t enforce data quality. So the reliability of data stored in data lakes is questionable. Due to data formats.
- With such a large volume of data, the performance of analysis is slower and this impacts the decision-making for business.
- Governance over data in a data lake creates challenges with security and privacy enforcement due to the unstructured nature of the data lake.
Businesses Required Two Disparate, Incompatible Data Platforms
Because data lakes did not fully replace data warehouses for reliable BI insights. Businesses implemented complex technologies and stack environments, including data lakes, data warehouses and additional specialized systems for streaming time series, graphs, and image databases.
But such an environment introduced complexity and delay as data teams were stuck in(silos) handling data at each stage and completing disjoint work.
Data has to be copied between systems and in some cases copied back impacting data governance. Not to mention the cost of storing the data twice with disjoint systems. Successful AI implementation was difficult.
And actionable outcomes required data from multiple places.
Companies Reporting Measurable Value From Data
The value behind the data was lost.
In recent research by Accenture, only 32% of organizations reported measurable values and insights from data. And 68% did not get value from the data.
Something needed to change. Because businesses needed a single, flexible and high-performance system to support the ever increasing use cases of data exploration, productive modeling and predictive analysis.
Data teams needed systems to support data applications, including SQL analytics, real time analysis, DS and ML.
The Data Lakehouse
To meet these needs and address the concerns and challenges a new data management architecture: The Data Lakehouse was invented.
The data lakehouse was developed as an open architecture. It combines the benefits of data lake with the analytical power and controls of a data warehouse.
The Data lakehouse is built on the data lake therefore it stores all types of data together. And it became the single source of providing direct access to AI and BI together.
Key Features of Data Lake houses
Data lakehouses like the Databricks Lakehouse provides several features. Such as
- Transaction support: Asset transactions for read/write interactions.
- Schema enforcement and governance: Schema enforcement and governance for data integrity and robust auditing needs.
- Data governance: To support privacy regulations and data use metrics.
- BI Support: To reduce the latency between obtaining data and drawing insights.
- Decoupled storage from compute: means each operates on its own cluster, allowing them to scale independently to support specific needs.
- Open storage formats: such as Apache Parquet(which is open and standardized) so a variety of tools can access the data directly and efficiently
- Support for Diverse Data Types: So we can store, analyze, and find useful insights and access structured, unstructured and semi-structured data formats at one place.
- Support for diverse workloads: It allows a range of workloads such as ML, DS, and SQL analytics to use the same data repository.
- End-to-end streaming: For real-time reports it removes the need for a separate system
The lakehouse supports the work of Data Engineers, Data Scientists, and Data analysts all in one location.
The lakehouse is the modern version of Data warehouses. It provides all the benefits and features without compromising the flexibility in depth of a data lake