Data Warehouses Are the Past, Present, and Future
The death of the data warehouse, long prophesied, seems to be always on the horizon yet never realized. First, it was NoSQL, then Hadoop, then data lakes that would kill the data warehouse. Yet here we are. Snowflake was the hottest initial public offering (IPO) of 2020, and the demand for data and analytics engineers who can crank value out of a data warehouse is as high as ever.
In 2010, the future of data warehouses felt pretty bleak. Most analytics teams were relying on traditional row-based, online transactional processing (OLTP) databases for their data warehouses. Data volume was exploding. When it came to processing and querying all that data for analysis, columnar databases came to the rescue, but they required expanding hardware.
While data warehouse bare-metal appliances provided a massive jump in processing power, it was quite an investment to add the hardware to your server room. It’s unimaginable 10 years later.
Things changed for the better in 2012, when Amazon launched Redshift, a columnar data warehouse that you could spin up in minutes and pay for in small increments with no massive up-front cost, built on top of PostgreSQL.
Migrations away from overtaxed, row-based SQL data warehouses to Redshift grew massively. The barrier to entry for a high-performing data warehouse was lowered substantially, and suddenly what looked like the impending death of the warehouses was a rebirth.
Next, extract, load, transform (ELT) wiped out extract, transform, load (ETL). The difference between the two patterns is where the T (transform) step takes place, and distributed columnar databases made it all possible. It’s now better to focus on extracting data loading it into a data warehouse, and then performing the necessary transformations. With ELT, data engineers can focus on the extract and load steps, while analysts can utilize SQL to transform the data that’s been ingested for reporting and analysis.
In other words, this new breed of data warehouses made it possible (and economical) to store and query far higher volumes of data than ever before. ELT saved the data warehouse.
The concept of a data lake was first introduced in 2011. The benefit of storing vast amounts of data without having to define its structure when it’s stored (schema-on-write), but rather when it’s queried (schema-on-read), is real. However, there’s a cost to such an approach when it comes to data discovery and governance, as well as in complexity for the data analytics or analytics engineer who works with the data.
With the cost of storing and querying large structured datasets dropping and the performance spiking upward, some of the downsides of data lakes for analytics became more noticeable. Still, data lakes have a place in an analytics infrastructure. There’s still a need to store data that’s not consistently structured, or in a volume that makes even the most robust data warehouses creak. However, for most data teams, data lakes have been a complement to their data warehouse rather than a replacement.
Data warehouses aren’t going anywhere anytime soon. Snowflake continues to blow away expectations for developers and investors alike, and I expect a wave of data warehouse innovation soon.
For those hesitant to invest in a greenfield data warehouse, migrate a legacy one to a modern platform, or hire data engineers with data warehousing know-how, don’t fear! You’re building for now and investing intelligently for the future.
In the ever-evolving landscape of data management, cloud data warehouses have emerged not just as a technological asset but as a cornerstone of organizational intelligence and decision-making. Despite the rise of various data storage and processing technologies that once seemed to threaten its existence, the data warehouse has not only endured but flourished. The remarkable resurgence and adoption of cloud data warehouses, exemplified by platforms like Snowflake / BQ, underscore their indispensable role. These modern data warehouses have revolutionized the way organizations store, process, and analyze vast volumes of data with unparalleled efficiency and scalability. The shift towards Extract, Load, Transform (ELT) methodologies further emphasizes the strategic advantage of leveraging cloud data warehouses for agile, real-time data analysis. Today, a data warehouse is not merely a repository of information but a vital asset driving innovation, optimizing operations, and enabling businesses to unearth valuable insights from their data. As companies navigate the complexities of the digital age, investing in a robust data warehouse infrastructure is not just a strategic move; it's imperative for sustaining competitive advantage and future-proofing the business.
Comments