Big Data – what is it? Use it with care.

Unfortunately, IT consultants are often technology-driven and love their buzzwords: cloud, mobile, big data, Internet of things, digital transformation. Best used often in combination … “digital transformation with mobile big data in the cloud” is a hard thing to disagree with.

Please don’t get me wrong. The intellectual and technical achievements covered by these buzzwords are astounding: growth has been exponential in processing power, storage and algorithmic complexity. However, no single technology is a magic bullet.

I would like to make two cautionary observations:

  1. Big data means many things to many people. A recent ACM Transactions technical paper on massive parallel storage starts with the statement that “The term big data is vague enough to have lost much of its meaning” with the result that as soon as you say the term you have to say what you mean by it. In-memory computing? Text analytics? High-performance computing (HPC)? Massively parallel storage? Streaming analytics? Domain-relevant query languages? The answer is all of the above: it just depends on what your perspective is – and the technology you are starting from. So, for example, an oil service company Baker Hughes’ big data story focuses on HPC and visualization. SAP talks about in-memory databases. Control vendors focus on streaming data and alarms. SAS Institute talks about statistics and Splunk about log records. Big data is also sometimes made to mean the same as Artificial Intelligence or machine learning. Big data is a tool box: pick and mix what you need for a specific task … and don’t use a hammer if you need to put in a screw.
  2. Senior management in companies are being fed balderdash about big data. I will quote two fresh examples, although I could have chosen many from the mid 1990s as well:
    1. First take the November 2014 issue of Harvard Business Review. A side-box, in an otherwise excellent article on GE’s internet of things, tells us that “unlike analogue signals, digital data is perfectly transmitted” and that this drives digital transformation. This is not even true at a technical level, as I experience every week as I have to download a certain newspaper twice to my iPad because of data corruption during the download. More importantly, this statement ignores the analogue or physical things or people at either end of the transmission. What is transmitted perfectly is all too often inaccurate, incomplete, mistaken or a lie. Perfectly transmitted garbage remains garbage.
    2. McKinsey Quarterly recently published and article on “Artificial Intelligence meets the C-suite”. The article recommends that the C-suite executive needs to become “data driven” and that “domain expertise” is a barrier to this: a prejudice or “survivor bias”. One of the areas of domain analysis that can be replaced is G&G: “The oil and gas industry, for instance, has incredibly rich data sources.” …. We then get a description of drill logs and seismic data sets … and in concluding “Now these are incredibly rich and complex data sets and, at the moment, they’ve been mostly manually interpreted. And when you manually interpret what comes off a sensor on a drill bit or a seismic survey, you miss a lot of the richness that a machine-learning algorithm can pick up.” The authors conclude therefore that “the best thing you can possibly do is to get rid of the domain expert who comes with preconceptions about what are the interesting correlations or relationships in the data and to bring in somebody who’s really good at drawing signals out of data.”

This sort of wild triumphalism is scary. It extrapolates from “data about people and what they say: e,g. Google and Facebook” to “data about the physical world.” This attitude ignores physics and the physical, well-founded correlations and relationships in engineering models. It arrogantly overlooks the professional knowledge and skill spent building physically meaningful models of the world and says that a brute-force statistical or pattern-matching algorithm is superior.

We need both physical models and statistical models, working together.  This means we also need data scientists and engineers to work with each other rather than against each other. McKinsey’s proposals are dangerous and counterproductive.

 

The more things change…

I would like to tell you the story about a project. This was a big data project. Streaming data was used to monitor and optimize production from a processing plant. The industry was in interesting times: competition from Asia and new players in the USA was squeezing margins. The company needed to cut several thousand positions.

The project had the following cast ot actors:

  • A vice-president for operations with the latest in portable personal devices.
  • An expert in simulation and analysis of the key processing equipment in the company’s value chain. He was really old: at least 50.
  • An artificial intelligence research manager with a very expensive Mac.
  • An operations support engineer who knew that this project could improve production by 2% and availability by about the same … and hoped that he could prove this to the vice-president for operations.
  • A data scientist, who was very smart – he knew the latest about artificial intelligence, ontologies and Markov chains – and pitied anyone who couldn’t understand it like he could.
  • An IT department who delivered a cloud-based set of standardized business services, with user self-service, a three month turnaround on change requests and an obstructive attitude to non-standard or technical computing.
  • A 63-year old production planner, who had a spreadsheet that only he could run. This spreadsheet was used for all production planning in the plant. Consistency between assumptions in the spreadsheet and actual process behaviour was updated occasionally.
  • An academic, who had a really cool algorithm that might be useful.
  • A junior project manager who tried to tie all of this together and get everyone to pull in the same direction.

The project had a few false starts, with a huge challenge in communication between the AI people and operations. But it delivered in the end.

Anybody recognize this type of project?

This project happened nearly 30 years ago. The “personal portable device” was a Compaq Portable III “luggable lunchbox”, with a 20Mb hard disk, an orange plasma screen and a weight of 10kg. But it was an iPad for the Vice-President. The IT department’s standardized cloud was an IBM mainframe. But the project team and implementation challenges are the same now as they were in 1988. I was that junior project manager.

The more things change, the more they stay the same. Our project team was struggling with the same technical and organizational issues that capital-intensive industries still face today. Data-processing power has grown exponentially: this makes it possible to analyze big data sets. However, our capacity to extract value has grown at best linearly … and will continue to do so unless IT professionals help the business professionals, their customers – the geophysicists, petroleum technologists, operations engineers and logistics planners – with easy access to the right information to allow them to control (and the optimize) their corner of the world.