Putting It Together: Statistics: Describing Data

The trials and tribulations of data visualization for good

“I love big data.  It’s got such potential for storytelling.”  At DataKind, we hear some version of this narrative every week.  As more and more social organizations dip their toes into using data, invariably the conversation about data visualization comes up. There is a growing feeling that data visualization, with its combination of “engaging visuals” and “data-driven interactivity”, may be the magic bullet that turn opaque spreadsheets and dry statistics into funding, proof, and global action.

However, after four years of applying data-driven techniques to social challenges at DataKind, we feel that data visualization, while it does have an important place in our work, is a mere sliver of what it takes to work with data.  Worse, the ubiquity of data visualization tools has lead to a wasteland of confusing, ugly, and sometimes unhelpful pie charts, word clouds, and worse.

 

Two pie charts. Each is broken into dozens of slivers, most of them too tiny to read. The one on the right is a call-out of the one on the left, and it has a list of items that are too small to be legible on its right.

Ugh.

The challenge is that data visualization is not an end-goal, it is a process.  It is often the final step in a long manufacturing chain along which data is poked, prodded, and molded to get to that pretty graph.  Ignoring that process is at best misinformed, and at worst destructive.

Let me show you an example:  In New York City, we had a very controversial program called Stop and Frisk that allowed police officers to stop people on the street they felt were a potential threat in an attempt to find and reclaim illegal weapons.

After a Freedom of Information Act (FOIA) request by the New York Civil Liberties Union (NYCLU) resulted in the New York Police Department (NYPD) releasing all of their Stop and Frisk data publicly, people flocked to the data to independently pick apart how effective the program was.

The figure below comes from WNYC, a public radio station located in New York City.  Here they’ve shaded each city block brighter pink the more stops and frisks occurred there.  The green dots on the map indicate where guns were found.  What the figure shows is that the green dots do not appear as close to the hot pink squares as one would believe they should.  The implication, then, is that Stop and Frisk may not actually be all that effective in getting guns off the street.Map of New York City, against a black backdrop. The Burroughs appear in shades of purple (majority area) and pink (smaller areas scattered in the city) to reflect number of police stops per block. Small dots of green, hard to see, note where guns were found during police stops.

But then a citizen journalist created this map of the same data.

Map of New York City against black backdrop, which is a smaller subset of previous map. Compared to previous map, this one also shows hues of purple and pink, though the pink is much more prominent in the map overall. Green dots to show guns found during police stops are significantly bigger and more prominent across the map.

By simply changing the shading scheme slightly he notes that this map makes the green dots look much closer to the hot pink squares.  In fact, he goes further to remove the artificial constraints of the block-by-block analysis and smooths over the whole area in New York, resulting in a map where those green dots stare unblinkingly on top of the hot-red stop and frisk regions.

Heat Map of The Bronx. A key shows shades of orange, from light to dark, to reflect number of police stops. Green circles indicate guns found during police stops; bigger green circles reflect more guns found in an area.

The argument this author makes visually is that Stop and Frisk does in fact work.

So who’s right here?  Well both of them.  And neither of them.  These pictures are just that – pictures.  Though they “use” data, they are not science. They are not analyses. They are mere visuals.

When data visualization is used simply to show alluring infographics about whether people like Coke or Pepsi better, the stakes of persuasion like this are low.  But when they are used as arguments for or against public policy, the misuse of data visualization to persuade can have drastic consequences.  Data visualization without rigorous analysis is at best just rhetoric and, at worse, incredibly harmful.

“Data for Humans vs. Data for Machines”

The fundamental challenge underlying this inadvertently malicious use of data comes, I believe, from a vagueness in terminology.  When people crow about “the promise of data”, they are often describing two totally different activities under the same umbrella.  I’ve dubbed these two schools of thought “data for humans” vs. “data for machines”.

Data for Humans:  The most popular use of data, especially in the social sector, places all of the emphasis on the data itself as the savior.  The idea is that, if we could just show people more data, we could prove our impact, encourage funding, and change behavior.  Your bar charts, maps, and graphs pointing-up-and-to-the-right all fall squarely into this category.  In fact almost all data visualization falls here, relying on the premise that showing a decisionmaker some data about the past will be all it takes to drive future change.

Unfortunately, while I believe data is a necessary part of this advocacy work, it is never sufficient by itself.  The challenge with using “data for humans” is threefold:

  1. Humans don’t make decisions based on data, at least not alone.  Plato once said “Human behavior flows from three main sources: desire, emotion, and knowledge.”  I want to believe he listed those aspects in that order intentionally. Study after study has shown that humans rationalize beliefs with data, not vice versa.  If behavior change were driven by data and graphs alone, we would be 50 years into a united battle against climate change.  Conversely, we will leap to conclusions from data visualizations that “feel” right, but are not rigorously tested, like the conclusions from the Stop and Frisk images above.
  2. The public still treats data and data visualization as “fact” and “science”.  I believe the public has gained enough visual literacy to question photojournalists or documentary filmmakers’ motives, aware that theirs is an auteur behind the final piece that intends for us to walk away with their chosen understanding.  We have yet to bring that same skepticism to data visualization, though we need to. The result of this illiteracy is that we are less critical of graphs and charts than written arguments because the use of data gives the sense that “fact” or “science” is at work, even if what we’re doing is little more than visually bloviating.
  3. The data or visualization you see at the end of the road is opaque to interrogation.  It is difficult, if not impossible to know where that “58%” statistic or that flashy bar graph came from, grinning up at you from the page.  Because we don’t have ways to know how the data was collected, manipulated, and designed, we can’t answer any of the questions we might want to raise above. If point 2 means we need to treat data visualization as photojournalism, then this point implores us to go further to requiring forensic photographers in this work.

Data for Machines: For these reasons, DataKind specializes in projects focusing on what we refer to as “Data for Machines”.  The promise of abundant data is not that we can show people more data, but that we can take advantage of computers, algorithms, and rigorous statistical methodologies to learn from these new datasets.  The data is not the end goal, it is the raw resource we use to fuel computer systems that can learn from this information and, in many cases, even predict what is likely to happen in the future.

For example, instead of engaging in the Stop and Frisk gallery debate above, DataKind volunteers loaded the NYPD data into computers and created statistical models to rigorously test whether or not racial discrimination was occurring disproportionately in different parts of the city.  While the models needed further evaluation, this analysis shows how data should be used. People shouldn’t try to draw conclusions from pictures of data – we’re notoriously bad at that as humans – we should be building models and using scientific methods to learn from data.

Celebrating Visualization

No surprise, creating data visualization well simply entails designing in a way that leads people to make scientific conclusions themselves.

There are many examples of insightful, persuasive, and downright clever data visualizations, but perhaps one of the best visualization practices I know of is to turn the idea of visualization on its head.  Data  visualization is incredibly good for allowing one to ask questions, not answer them.  The huge amount of data that we have available to us now means that we need visual techniques just to help us make sense of what we need to try to make sense of.

So where do we go from here?

First off, you can boycott the tyranny of pie charts and word clouds, rail against those three pitfalls, and share these last two examples far and wide. But I think we can also all go out and start thinking about how data can truly be used to its fullest advantage. Aside from just using “data for machines,” the best data visualization should raise questions and inspire exploration, not just sum up information or try to tell us the answer. Today we have more information than ever before and we have a new opportunity to use it to mobilize others, provided we do so with sensitivity.  Now, more than ever, we need to all be out there on the front lines looking beyond data visualization as merely a way to satisfy our funders’ requirements and instead looking at data as a way to ask deep questions of our world and our future.