The Big Data Warehouse

By Ian Dudley, Enterprise Architect

Over the last 30 years, business intelligence evolved from a cottage industry—whose main tool was a desktop computer—to a mature business using centralized, enterprise-wide analytic platforms underpinned by an enterprise data warehouse. The traditional enterprise data warehouse copies the data it needs into a central location, transforms it into a common format, removes the noise, reconciles the inconsistencies, and creates a pristine, holistic, enterprise view that seamlessly combines information from disparate business units and third-party data brokers.

Then, just as it reached maturity, the enterprise data warehouse died, laid low by a combination of big data and the cloud.

Mobile technology, the rise of digital business and the increasing connectedness of the world through the Internet of Things have generated exponential growth in the volume of analyzable big data. Enterprise decision-making is increasingly reliant on data from outside the enterprise: both from traditional partners and “born in the cloud” companies, such as Twitter and Facebook, as well as brokers of cloud-hosted utility datasets, such as weather and econometrics. Meanwhile, businesses are migrating their own internal systems and data—including sales, finance, customer relationship management and email—to cloud services.

Big data has volume, velocity and variety: it is large, grows at a fast rate, and exists in many different physical formats (video, text, audio, web page, database, etc.). It is not possible to apply traditional warehousing techniques to this sort of data. The workflows that populate an enterprise data warehouse need to be defined by IT professionals, and many processes require expert human intervention for quality control and exception handling. There are also the technical demands of the bandwidth needed to copy data into the central warehouse environment and the computing power to transform it into a standard format.

At the same time—as more and more sources of data move to the cloud—what Gartner calls “data gravity” will pull enterprise data out of the on-premise data center and disperse it into the cloud, accelerating the demise of the enterprise data warehouse.

Today, the enterprise needs a big data warehouse that combines on-premise and in-the-cloud datasets into a comprehensive view of its business and the environment in which it operates.

LOGICAL DATA WAREHOUSES AND FEDERATION TECHNOLOGY

One of the most mature solutions to the problem of aggregating data from disparate data sources is the logical data warehouse. To the user, a logical data warehouse looks just like an enterprise data warehouse. But in fact, it is a single-query interface that assembles a logical view “on top of” a set of heterogeneous and dispersed data sources.

In order to look like an enterprise data warehouse, a logical warehouse has to transform and standardize the data in real time, as it is queried. The great strength of the logical warehouse is that it is possible to do just enough standardization and transformation to meet an immediate business need just in time, rather than having to standardize and transform every piece of data for every possible query up front.

However, if the data being queried is stored in multiple formats and in multiple physical locations, this is both technically difficult and inefficient; this is why the logical warehouse has not simply replaced physical data warehouses.

Logical data warehouses that span enterprises—an approach often described as “federation”—are particularly inefficient, because the problems of data being in different formats, and in physically distant locations connected by limited bandwidth, are magnified by the physically- and technologically-distinct corporations that are being bridged together. Today’s solutions are the data lake and APIs. But both suffer challenges of their own.

THE DATA LAKE

A data lake is a physical instantiation of a logical data warehouse: data is copied from wherever it normally resides into a centralized big data file system, thereby solving the problem of data being physically dispersed. This is not any kind of return to the traditional data warehouse—the data lake is designed for far greater agility, scalability, and diversity of data sources than a traditional data warehouse.

It is relatively easy for an enterprise to add its own data to a lake, but there are many datasets of critical importance outside of the enterprise. It may be possible to copy low volume, stable, third-party datasets into the lake, but this will not always be a viable solution—whether for reasons of data confidentiality or data volume, or because the data is volatile and requires excessive change management.

In the case of external data in the cloud, an enterprise might be able to extend their private network into the cloud to encompass the dataset and federate it with their on-premise lake. This is undoubtedly technically feasible; the only issues are the willingness of the data owner to agree to the arrangement, and performance (depending on the bandwidth of the connection between the cloud and the on-premise data center).

APIs

Rather than lifting and shifting data into a lake, a big data warehouse can federate with external data sources by means of their published APIs.

It is often far more effective for an enterprise to consume APIs that answer its business questions (“What is my top-performing brand?”) than to amass the data required to answer these questions for itself.

Remote APIs are an appropriate solution when the enterprise knows what it needs to know and can ask for it. Remote APIs are less effective for advanced analytics and data discovery—where the user doesn’t know what they don’t know, and as a result is obliged to make API calls that move large volumes of data across a wide area network. This has traditionally been a poorly-performing approach, mainly due to the bandwidth problems of moving so much data. Advanced analytics is one of the main uses of big data; given that big data is inherently distributed, solving the problem of how to run discovery-type processes in this environment has received serious attention. The most promising approach is to implement APIs designed from the ground up to support big data. These APIs use overcome the problem of moving large amounts of unstructured data across a network by transmitting it in a highly compressed form, and having the data describe itself so that the lack of a pre-defined structure is not an issue.

INTEGRATING DATA IN A BIG DATA WORLD

Physically co-locating data in a data lake, or logically through APIs or a form of federation, solves the problem of data dispersal. It does not address the issue that the data is in many different formats and is un-harmonized. The enterprise data warehouse solves the format problem by brute force: extracting all of the data from source and loading it into a single database. It solves the integration problem by using master data management software to apply a consistent set of descriptive metadata and identifying keys across the whole dataset.

Although big data technologies and the data lake approach have a major role to play in the future of data warehousing, the many different of types of data the warehouse needs to contain (including images, video, documents, associations, key value pairs, and plain old relational data) means that there is no one physical format that is optimal for storing and querying all of it. As a result, many people are strong proponents of a polyglot persistence approach: data is stored in the most appropriate form and an overarching access layer provides a single interface and query language to interrogate the data. The access layer takes responsibility for translating the query into a form the underlying data stores can understand, combining results across stores and presenting the results back to the user.

There are already many interfaces that allow developers to query big data in a non-relational format using SQL. Although it may take some time for comprehensive, fully functional and efficient solutions to the complications of polyglot persistence to become mainstream, it is an eminently solvable problem. The problem of data integration and harmonization is much more challenging because it is one of content, not technology. One way of looking at this is to recognize that polyglot persistence gives you the grammar, but no vocabulary. Grammar has just a small number of rules—but the language it orders will have hundreds of thousands if not millions of words.

Unless the disparate datasets in a data lake are aligned and harmonized, they cannot be joined or co-analyzed. The techniques used to do this in an enterprise data warehouse are manual and rely on a limited number of experts—they don’t scale to big-data volumes.

Data discovery tools have provided a partial solution to the problem by democratizing integration. An analysis may require a business user to combine several disparate datasets. To support this, discovery tools have built-in, lightweight, high productivity integration capabilities—generally known as “data wrangling”—to distinguish them from heavyweight data warehouses with extract, transform and load (ETL) processes. This basic and very user-friendly functionality removed the expert bottleneck from integration: users could do integration for themselves. The downside of this approach is that integration tends to be done by users in isolation and the integration process is not repeatable, shareable and cataloged. It results in data puddles rather than a data lake. This may be the best approach to providing a fast answer to a tactical question: it allows the flexibility to apply just enough integration to meet the business need if, for example, all that is required is a quick directional read on a trend. The one-size-fits-all enterprise data warehouse approach is inflexible and slow in comparison. Nevertheless, the data puddle has obvious issues if a consistent or enterprise-wide view is required.

Companies have often spent a great deal of time and money curating their enterprise master data. The data warehousing guru Ralph Kimball argues that the dimensions identified in the enterprise warehouse model and the master data associated with them are valuable enterprise assets that should be reused in a big data world. Matching dimensions between internal and external datasets and identifying common items on those dimensions allow data to be snapped together and years of master data investment to be leveraged.

A problem for both traditional and democratized data integration is that they rely on people, albeit a much larger pool of people in the case of democratized integration. Big data is not only vast, it is also fast: if the sheer amount of data needing integration does not overwhelm individuals, the speed at which it needs to be integrated before it becomes stale will. That is why the common thread linking the tools attempting to scale data integration for the digital world is the use of machine learning and statistical matching techniques. After appropriate learning, these tools fully automate simple integration tasks and only promote complex cases or exceptions to humans for adjudication. For some analyses, fully automated matching may give a good enough result. If users need real-time or near-real-time directional reporting, it is the only way to go.

CONCLUSION

Given the current state of technology, there is no single solution for creating a big data warehouse to replace the now outdated enterprise data warehouse. In the short term, the enterprise data warehouse will remain the golden nugget at the heart of a company’s data assets. It will be supplemented by, or subsumed into, a data lake, which contains the many and various big data sources the company is able to co-locate. Data that cannot be co-located in the lake will be accessed through APIs and federation technologies, such as logical data warehouses. Data harmonization will take on even greater importance than it does now, but will transform from the clerically intensive, one-size-fits-all approach of the enterprise warehouse to a highly automated and need-driven approach.

 

Skip to content