Written by Dr. Kirk Borne
In this, the latest in our exclusive series from Dr. Kirk Borne, Principal Data Scientist at Booz Allen Hamilton, and a continuation of his article “The Five Important D’s of Big Data Variety,” we learn why it’s critically important to have multiple, diverse data sources in order to minimize bias and arrive at objective conclusions.
At the beginning of 2019, I enjoyed New Year’s Day by watching a lot of football games on TV. As I watched different teams, listening to the opinions of experts and of fans (mostly on social media), it was clear to me that most of us have very strong opinions and bias in favor of our favorite team. We can cite one or more “convincing” statistics to support our point of view. Those who have a different opinion can do the same.
There is no winning in an argument like that. Fortunately, in those cases, the real winning is done on the field of play, not in our arguments. But there are plenty of other more significant cases where we have opinions that differ from other people, and where that difference of opinion (or bias) can have serious consequences. How can we come to an objective reconciliation, or evidence-based rationale, for our points of view in these cases?
How Can We Arrive at Objective Understanding in Data Science?
One way to do this is through the lens of multiple, diverse perspectives. By analyzing the full picture, the full set of facts and statistics, we may hopefully minimize our filters and arrive at an objective understanding of the thing that otherwise we may each be perceiving differently.
In data science, we refer to these multiple, diverse perspectives as diverse data sources. Diversity in data is one of the three defining characteristics of big data — high data variety — along with high data volume and high velocity. We discussed the power and value of high-variety data in a previous article on this site: “The Five Important D’s of Big Data Variety” We won’t repeat those lessons here, but we focus specifically on the bias busting power of high-variety data, which was actually the last of the five D’s mentioned in the earlier article: Decreased model bias.
Here, we broaden our meaning of “bias” to go beyond model bias, which has the technical statistical meaning of “underfitting”, which essentially means that there is more information and structure in the data than our model has captured.
Natural Bias Busting is not Personal Bias Busting
In this article, we generalize our discussion to a broader definition of bias: lacking a neutral viewpoint, or having a viewpoint that is partial. We will call this natural bias, since the examples can be considered as “naturally occurring” without obvious intent. This article does not elaborate on personal bias (which might be intentional), though the cause for that kind of prejudice is essentially the same: not considering and taking into account the full knowledge and understanding of the person or entity that is the subject of the bias.
Examples of Natural Bias Busting
We describe some examples of natural bias first, and then recommend a remedy for those of us working in the realm of data specifically.
1. A few years ago, there was a photograph of a white and gold dress that circulated virally on social media — it went viral because some people insisted that it was actually blue and black.
Or was it the other way around? I’m not sure, but maybe I am biased! There was heated disagreement about the actual colors. It was all a matter of perception. Or was it? After all, the dress was either one or the other, not both, right? In fact, the reality was that each person’s eyes’ optic nerves, under different lighting, were more or less responsive to different colors. Hence, to each person, the dress was the colors that they said it was, despite the apparent contradiction. In the end, we had to agree that the different perspectives were both correct since the sensor (data collector) was each person’s optic nerves, and each person was simply reporting what their “data” measured. The different perceived colors were an optical illusion. Collecting sensor data from multiple sources (what different people saw, under different lighting conditions) proved this fact.
2. A similar event occurred in 2018 when an audio recording went viral on social media, on which some people heard the word “Laurel” and some heard the word “Yanny”.
Once again, there was strong disagreement — each person convinced that they were right. Fortunately, again there was a reasonable scientific explanation having to do with the audio frequencies in the recording and the sensitivity of each individual person’s ears to different frequencies (high vs. low), It was eventually demonstrated that each of the two words could be heard by the same person when the frequency of the recording was adjusted either higher or lower. The different perceived sounds were an auditory illusion. Collecting sensor data from multiple sources (what different people heard, under different audio frequency filtering) proved this fact.
In that book, he described an experience that he had at a train station in America after the start of World War II. As a person of Japanese descent, he was being stared at suspiciously by some of the people in the station. One might say that they saw him in only one dimension — his physical characteristics, which were clear indicators of his biological heritage, but which could not (and did not) provide a full picture of the man himself. After starting casual conversation with a young couple at the station, where they discussed the weather, train schedules, family, and concerns about the war, he noticed that the other people in the station went back to their other concerns (the weather and the train delays).
The casual conversation provided additional perspectives (simple, but meaningful insights) into the man that proved to the suspicious onlookers that there was no basis for their suspicion after all. He was as concerned about his family in this situation as was everyone else (for example, when would he see his family again?). The filters of bias were removed, and the full person was seen and understood. In this case, the additional “data sources” were simply the conversational exchanges of thoughts and words that communicated concerns that were mutually shared by others. This was a wonderful example of “language in thought and action.”
4. Another example of natural bias comes from a famous cartoon. The cartoon shows three or more blind men (or blindfolded men) feeling an elephant.
They each feel a different aspect of the elephant: the tail, a tusk, an ear, the body, a leg — and they consequently offer a different interpretation of what they believe this thing is (which they cannot see). They say it might be a rope (the tail), or a spear (the tusk), or a large fan (the ear), or a wall (the body), or a tree trunk (the leg). Only after the blindfolds are removed (or an explanation is given) do they finally “see” the full truth of this large complex reality. It has many different features, facets, and characteristics. Focusing on only one of those features and insisting that this partial view describes the whole thing would be foolish.
We have similar complex systems in our organizations, whether it is the human body (in healthcare), or our population of customers (in marketing), or the Earth (in climate science), or different components in a complex system (like a manufacturing facility), or our students (in a classroom), or whatever. Unless we break down the silos and start sharing our data (insights) about all the dimensions, viewpoints, and perspectives of our complex system, we will consequently be drawn into biased conclusions and actions, and thus miss the key insights that enable us to understand the wonderful complexity and diversity of the thing in its entirety. Integrating the many data sources enables us to arrive at the “single correct view” of the thing: the 360 view!
How can we conquer our potential for natural bias in data analytics activities?
The approach includes both a data strategy and an analytics strategy: what to collect, and why to collect it.
First, we should aim to collect data from a variety of sources. This might include sensor data from IoT (Internet of Things) devices, or alternative data sources that exist outside of our organization (like social sentiment, online business reports, government open data repositories, or other open data sets).
Second, we must break down the silos (organizational, cultural, or administrative) that isolate the data sources and prevent integration. Instead, we must curate integrated data repositories.
Curation includes labeling, annotating, documenting, indexing, and rating the data sources so that others in our organization can understand them, know how to use them, where to find them, and when to use them.
Third, we must commit to learn from all data, Exploring high-variety data to find value in them can be a key differentiator for our organization. For example, explore additional attributes of customer transactions, not only what they purchased, but how and when (like sales channel, shipping method, and time of day of purchase).
Fourth, innovate with these data sources. The value derived from these innovations will help to build advocacy across our organization for this data-sharing and data democratization policy.
Bust Bias with the CCDI Data Analytics Strategy
The four-step program described above corresponds to the CCDI data & analytics strategy = Collect, Curate, Differentiate, and Innovate:
- Collect: seek to acquire diverse (high-variety) data from multiple sources.
- Curate: label, annotate, rate, and index data for reuse and sharing.
- Differentiate: perform Exploratory Value Analysis (EVA) to find the data and the analytics products that will set you and your organization apart from your competition.
- Innovate: get busy creating products, services, and outcomes that deliver big value from big data.
Collecting high-variety data from diverse sources, connecting the dots, and building the 360 view of our domain is not only the data silo-busting thing to do. It is also the bias-busting thing to do. High-variety data makes that possible, and there is no shortage of biases for high-variety data to bust, including cognitive bias, confirmation bias, salience bias, and sampling bias, just to name a few!
Also check out our How to Launch Your Data Analytics Career series by analytics leader Piyanka Jain!
Now, take the conversation to Twitter! Agree or disagree with this perspective on big data variety and how to bust bias? Want to ask Kirk a question? Tweet @KirkDBorne using the hashtag #datamakespossible right now!