Wrangling Data in a Big Data World

  By Michelle Kennedy, Program Manager, Thought Leadership

It’s Monday morning and Luke, the BIGCO brand manager, walks into his office. He asks the digital assistant device on his desk how BIGCO’s Acme soft-drinks brand performed over the weekend. After a second, the assistant replies that Acme’s share of sales has fallen by 0.5%. Luke asks what’s driving the decrease, and is told it’s due to issues in the BIGCO West region. His digital assistant offers to email Tom, the sales manager, with a summary of their findings.

Two hours later in California, Tom uses the report to drill down to the root cause, combining sales data with BIGCO’s shipment data. After he blends in third-party data, including market share, weather and econometrics, it looks as if failure to promote Acme during prolonged periods of good weather is a contributing factor. Tom is able to simulate the effect of various promotions on brand share and profitability, and creates a plan. Luke approves the plan, and his digital assistant makes a note to monitor the situation and report progress.

Digital assistants and smart machines are cool, but the most useful business insights come from combining internal data with a multiplicity of external data sources, whether it be sales, shipments, promotions, financials, or a hundred other things. It’s data integration that allows this scenario to play out. Without data integration, the only thing cool technology can do is stare helplessly at a pile of bricks it can’t assemble into anything useful.

The Basics of Data Integration

Data integration relies on the ability to link fields containing the same information, for example information about states, in different datasets. If all datasets used the same identifiers, this is would be easy, but they don’t: some use two-digit identifiers (IL, CT), some use full names (Illinois, Connecticut), and so on. And that’s a simple example: Universal Product Codes (UPC) identify a type of product (10-oz. can of BIGCO’s Acme soda, say) and can be used for point of sale and stock keeping. Electronic Product Codes can be used to identify individual items—so every single can (more likely, every bottle of champagne, as people aren’t terribly interested in tracking by the soda can) could have a different code. Other codes are used to identify aggregations of products, such as in-store combo packs and warehouse pallets.

Data integration means reconciling these different entities and coding systems. Part of the process is to make the data ready for analysis by either aggregating it up or disaggregating it down to a common basis so that, for example, point-of-sale data at UPC level can be combined with advertising data at brand level. Finally, data has to be enriched to enhance its usefulness for analytics: supplementing a brief product description, say, with codified attributes such as manufacturer, brand, size, flavor, packaging, health claims and ingredients.

Data Wrangling for Digital Business

The gold standard for data integration is the extract, transform and load (ETL) process associated with the data warehouse. ETL provides an automated, high-quality process with defined outcomes, and is the best way to curate long-lived, high-value assets, such as the data used in C-suite dashboards and KPIs.

The problem with using ETL on the big datasets typical of digital business is that ETL is expert led, has a long set-up, and requires ongoing work to incorporate new and changed data. It is the opposite of agile. Businesses need to tap into a huge array of new data sources and it’s impossible to know in advance which of them contain valuable insights; there’s a desperate need for fast, lightweight, “good enough” data integration that allows data scientists to explore and experiment.

The solution to the integration bottleneck is to turn the ETL architecture on its head: Instead of providing a human with the tools to craft an integration process, provide an automated integration tool with a human helper to process exceptions. There is an increasing number of cloud-based integration platforms of this kind on the market that operate based on machine learning and statistical best-fit mapping.

ETL requires data to be copied into a data warehouse, something that quickly becomes onerous as the volume, velocity and variety of big data continues to grow. Data “federation” and “virtualization” technologies solve for this by allowing analysts to query multiple, remote data sources, and create virtual integrated views in memory, without having to bulk-copy data. Since the source data is accessed directly, the problems of synchronization (necessary in the ETL model whenever information is updated) are avoided.

ETL is also strongly oriented toward data in relational databases. Digital unleashed a huge variety of new data formats, including Hadoop, NoSQL, Graph databases, key value stores, messages, logs and documents. In addition to these text and number stores, map-related information, images, videos and sound recordings are increasingly important to business. Tom may want to use satellite images to incorporate traffic volumes by daypart into his analysis, or to look for a correlation between Acme’s sales and the number of vehicles parked near beaches or parks. As before, there are an increasing number of cloud platforms able to connect to this multiplicity of different data formats, with no requirement that they be relational.

The Future of Data Wrangling

You need to optimize for competing priorities such as speed, accuracy, completeness and the availability of expertise on a case-by-case basis: There is no one-size-fits-all solution. It’s also likely that a process will move from one flavor of integration to another over time: Tom’s initial analysis of Acme’s fall in market share used “quick and dirty” methods of data integration you would not want to incorporate into routine reporting for the C-suite.

ETL will not go away, but smart machines will increasingly get their arms around big data, and be able to integrate it reliably, automatically, and at speed. That means Tom’s digital assistant will be able to do more than help him understand what happened to BIGCO’s Acme sales and why. She’ll be able to combine BIGCO data with any data on the Internet, or whatever other sources she can access, to create models that better predict the sales of BIGCO brands. Ultimately she’ll be able to respond to predictive requests such as, “How do we increase sales in the West region by 10%?”

But not quite yet.

 

Skip to content