The Five Important D’s of Big Data Variety
Written by Dr. Kirk Borne (Big Data Variety)
A recent situation caused me to reflect on how often we can jump to conclusions, infer a hypothesis, and (maybe without as much proof as in this case) we assume that our conclusion is true.
For the modern digital organization, the proof of any inference that drives decisions must be in the data! The data makes possible more accurate and trustworthy conclusions.
When I was out for a walk a few weeks ago, I heard a loud low-flying aircraft passing overhead. This was not unusual since we live about 10 miles from a major international airport. In this case, I thought to myself that the aircraft’s sound seemed more directly overhead and lower than normal as well as being suggestive of a larger than average jet aircraft.
I realized that in my one simple thought, I had made three different inferences from a single stream of data. That initial data stream was the audible sound of the aircraft. The three inferences were about the altitude (lower than normal), the size (larger than average), and the flight path (more directly overhead). When I looked up, my tri-inference hypothesis was confirmed. The plane was a very large, low-flying jet for a major overnight shipping company. The slightly unusual flight path may have been associated with the fact that these planes are probably instructed to land on a different runway at the airport than the usual commercial passenger airlines’ flights – consequently, the altitude and location were slightly different from the slightly smaller commercial passenger airlines that pass overhead every day.
Personalization Made Possible by Big Data Variety
I frequently refer to the era of big data as “the end of demographics”.
By that, I mean that we now have many more features, attributes, data sources, and insights into each entity in our domain: people, processes, and products. These multiple data sources enable a “360 degree view” of the entity, thus empowering a more personalized (even hyper-personalized) understanding of and response to the needs of that unique entity. In “big data language”, we are talking about one of the 3 V’s of big data: big data variety!
High variety is one of the foundational key features of big data variety — we now measure many more features, characteristics, and dimensions of insight into nearly everything due to the plethora of data sources, sensors, and signals that we measure, monitor, and mine. Consequently, we no longer need to rely on a limited number of features and attributes when making decisions, taking actions, and generating inferences. We can make better, tailored, more personalized decisions and actions. Every entity is unique! That marks the end of demographics.
Here is another example: suppose that a person goes to their doctor to report problems with painful headaches. That headache pain is a single symptom – a single data source, a single signal, a single sensor. However, one could imagine a large number of possible inferences from that one single signal. The headaches could be caused by insufficient sleep (sleep apnea), high blood pressure, pregnancy, or a brain tumor. Obviously, each one of these diagnoses carries a seriously different course of action and treatment.
In “data science language”, what we are describing are different segments (clusters) in the hyperspace of symptoms and causes in which the many causes (clusters) are projected on top of one another (overlap one another) in the symptom space. The way that a data scientist resolves that degeneracy is to introduce more parameters (higher big data variety) in order to “look at” those overlapping clusters from different angles and perspectives, thus resolving and clearly separating the different diagnosis clusters. (Degeneracy is a technical data science word that refers to a situation when two or more states of a system look the same, or redundant, from certain perspectives, due to a limited set of data features.) High-variety data enables the discovery of multiple clusters, and eventually identifies the correct cluster (correct diagnosis, in this case).
Reaching More Refined Data-Driven Insights
Higher variety data means that we are adding data from other sensors, other signals, other sources, and of different types. Going back to our low-flying airplane example, this has the following application:
- I not only heard the aircraft (sound = audio data).
- I also looked at it (sight = visual data).
- I observed its flight path (dynamic change over time = time series data).
The proof of my inference about the airplane was in the data! Additional data sources provided the variety of data signals that were needed in order to derive a correct conclusion.
Similarly, when you go to the doctor with that headache, the doctor will start asking about other symptoms, e.g., lack of appetite, other pains, and may order other medical tests (blood pressure checks or other lab results). Those additional data sources and sensors provide the variety of data signals that are needed in order to derive the correct diagnosis.
These examples (low-flying aircraft and headache pain) are representative analogies of a large number of different use cases in every organization, every business, and every process. The more data you have, the better you are able to detect and discover interesting and important phenomena and events. However, the more variety of big data you have, the better you are able to correctly diagnose, interpret, understand, gain insights from, and take appropriate action in response to those phenomena and events.
The Five Important D’s of Big Data Variety
The detection and separation of multiple classes (or diagnoses) of entities (persons, things, or events) in your data collection improves when a sufficient number of usable and informative data features are available for exploration and testing. This empowers five important applications of big data variety in analytics applications. These are:
- Disambiguation of different entities that otherwise look the same when examined with a small (insufficient) number of data features (e.g., two customers who have the same first name, last name, and city of residence).
- Deduplication of multiple instances of the same entity in multiple databases (e.g., the same customer listed with different ID numbers in a sales database compared with the customer call center database).
- Discrimination (distinction and separation) between different classes (categories, or diagnoses) that may overlap strongly in some subsets of data feature space (e.g., medical patients with the same symptoms and the same diagnosis may still need completely different treatments for their condition due to allergies to specific medications).
- Discovery of new classes of entities (unknown unknowns) that are “lurking” and previously hidden from detection due to their strong similarity to already existing well-known entities (e.g., a new strain of the seasonal flu).
- Decreased model bias that may have been caused by applying an insufficient number of features (insufficient information) in making a diagnosis, or classification, or decision (e.g., fitting a straight line to a correlated pair of decision variables when the data clearly shows there is more information and structure in the data than that).
Big data variety makes all of this possible because variety is definitely the spice of discovery. The proof is in the data.