The Art and Science of Data Visualization (2024)

A comprehensive guide on how to think about and create brilliant data visualizations.

The Art and Science of Data Visualization (1)

Published in

Towards Data Science

·

31 min read

·

Oct 14, 2019

--

Data visualization — our working definition will be “the graphical display of data” — is one of those things like driving, cooking, or being fun at parties: everyone thinks they’re really great at it, because they’ve been doing it for a while. Particularly for those coming to data science from an engineering background, data visualizations are often seen as something trivial, to be rushed through to show stakeholders once the fun modelling has been finished.

Yet visualizations are often the main way complicated problems are explained to decision makers. For something so essential to so many people’s daily work, data visualization is rarely directly taught, instead being something new professionals are expected to learn via osmosis. But this isn’t the best approach. Data visualization is a skill like any other, and even experienced practitioners could benefit from honing their skills in the subject.

Hence, this short lesson on the topic. This post is a little bit on the longer side, but aims to give you a comprehensive backing in the concepts underlying data visualizations in a way that will make you better at your job.

At no point do I intend to teach you how to make a specific graphic in a specific software. I don’t know what software might be applicable to your needs in the future, or what visualizations you’ll need to formulate when — and quite frankly, Google exists — so this isn’t a cookbook with step-by-step instructions. The goal here is not to provide you with recipes for future use, but rather to teach you what flour is — to introduce you to the basic concepts and building blocks of effective data visualizations.

With that said, you can find the code (as three R Markdown files) to build this article on my personal GitHub.

The Mantras

As much as possible, I’ve collapsed those basic concepts into four mantras we’ll return to throughout this course. The mantras are:

  1. A good graphic tells a story.
  2. Everything should be made as simple as possible, but no simpler.
  3. Use the right tool for the job.
  4. Ink is cheap. Electrons are even cheaper.

Each mantra serves as the theme for a section, and will also be interwoven throughout. The theme of this first section is, easily enough:

A good graphic tells a story

When making a graphic, it is important to understand what the graphic is for. After all, you usually won’t make a chart that is a perfect depiction of your data — modern data sets tend to be too big (in terms of number of observations) and wide (in terms of number of variables) to depict every data point on a single graph. Instead, the analyst consciously chooses what elements to include in a visualization in order to identify patterns and trends in the data in the most effective manner possible. In order to make those decisions, it helps a little to think both about why and how graphics are made.

Why do we tell a story?

As far as the why question goes, the answer usually comes down to one of two larger categories:

  • To help identify patterns in a data set, or
  • To explain those patterns to a wider audience

These are the rationales behind creating what are known as, respectively, exploratory and explanatory graphics. Exploratory graphics are often very simple pictures of your data, built to identify patterns in your data that you might not know exist yet. Take for example a simple graphic, showing tree circumference as a function of age:

The Art and Science of Data Visualization (3)

This visualization isn’t anything too complex — two variables, thirty-five observations, not much text — but it already shows us a trend that exists in the data. We could use this information, if we were so inspired, to start investigating the whys of why tree growth changes with age, now that we’re broadly aware of how it changes.

Explanatory graphs, meanwhile, are all about the whys. Where an exploratory graphic focuses on identifying patterns in the first place, an explanatory graphic aims to explain why they happen and — in the best examples — what exactly the reader is to do about them. Explanatory graphics can exist on their own or in the context of a larger report, but their goals are the same: to provide evidence about why a pattern exists and provide a call to action. For instance, we can reimagine the same tree graph with a few edits in order to explain what patterns we’re seeing:

The Art and Science of Data Visualization (4)

I want to specifically call out the title here: “Orange tree growth tapers by year 4.” A good graphic tells a story, remember. As such, whatever title you give your graph should reflect the point of that story — titles such as “Tree diameter (cm) versus age (days)” and so on add nothing that the user can’t get from the graphic itself. Instead, use your title to advance your message whenever it makes sense — otherwise, if it doesn’t add any new information, you’re better off erasing it altogether.

The important takeaway here is not that explanatory graphics are necessarily more polished than exploratory ones, or that exploratory graphics are only for the analyst — periodic reporting, for instance, will often use highly polished exploratory graphics to identify existing trends, hoping to spur more intensive analysis that will identify the whys. Instead, the message is that knowing the end purpose of your graph — whether it should help identify patterns in the first place or explain how they got there — can help you decide what elements need to be included to tell the story your graphic is designed to address.

How do we tell a story?

The other important consideration when thinking about graph design is the actual how you’ll tell your story, including what design elements you’ll use and what data you’ll display. My preferred paradigm when deciding between the possible “hows” is to weigh the expressiveness and effectiveness of the resulting graphic, as defined by Jeffrey Heer at the University of Washington, Heer writes:

  • Expressiveness: A set of facts is expressible in a visual language if the sentences (i.e. the visualizations) in the language express all the facts in the set of data, and only the facts in the data.
  • Effectiveness: A visualization is more effective than another visualization if the information conveyed by one visualization is more readily perceived than the information in the other visualization.

Or, to simplify:

  1. Tell the truth and nothing but the truth (don’t lie, and don’t lie by omission)
  2. Use encodings that people decode better (where better = faster and/or more accurate)

Keep this concept in the back of your mind as we move into our mechanics section — it should be your main consideration while deciding which elements you use! We’ll keep returning to these ideas of explanatory and exploratory, as well as expressiveness and effectiveness, throughout the other two sections.

Let’s move from theoretical considerations of graphing to the actual building blocks you have at your disposal. As we do so, we’re also going to move on to mantra #2:

Everything should be made as simple as possible — but no simpler.

Graphs are inherently a 2D image of our data:

The Art and Science of Data Visualization (5)

They have an x and a y scale, and — as in our scatter plot here — the position a point falls along each scale tells you how large its values are. But this setup only allows us to look at two variables in our data — and we’re frequently interested in seeing relationships between more than two variables.

So the question becomes: how can we visualize those extra variables? We can try adding another position scale:

The Art and Science of Data Visualization (6)

But 3D images are hard to wrap your head around, complicated to produce, and not as effective in delivering your message. They do have their uses — particularly when you’re able to build real, physical 3D models, and not just make 3D shapes on 2D planes — but frequently aren’t worth the trouble.

So what tools do we have in our toolbox? The ones that are generally agreed upon (no, really — this is an area of active debate) fall into four categories:

  • Position (like we already have with X and Y)
  • Color
  • Shape
  • Size

These are the tools we can use to encode more information into our graphics. We’re going to call these aesthetics, but any number of other words could work — some people refer to them as scales, some as values. I call them aesthetics because that’s what my language of choice (R, using ggplot2) calls them — but the word itself comes from the fact that these are the things that change how your graph looks.

For what it’s worth, we’re using an EPA data set for this unit, representing fuel economy data from 1999 and 2008 for 38 popular models of car. “Hwy” is highway mileage, “displ” is engine displacement (so volume), and “cty” is city mileage. But frankly, our data set doesn’t matter right now — most of our discussion here is applicable to any data set you’ll pick up.

We’re going to go through each of these aesthetics, to talk about how you can encode more information in each of your graphics. Along the way, remember our mantras:

  1. A good graphic tells a story
  2. Everything should be made as simple as possible — but no simpler
  3. Use the right tool for the job
  4. Ink is cheap. Electrons are even cheaper

We’ll talk about how these are applicable throughout this section.

Position

Let’s start off discussing these aesthetics by finishing up talking about position. The distance of values along the x, y, or — in the case of our 3D graphic — z axes represents how large a particular variable is. People inherently understand that values further out on each axis are more extreme — for instance, imagine you came across the following graphic (made with simulated data):

The Art and Science of Data Visualization (7)

Which values do you think are higher?

Most people innately assume that the bottom-left hand corner represents a 0 on both axes, and that the further you get from that corner the higher the values are. This — relatively obvious — revelation hints at a much more important concept in data visualizations: perceptual topology should match data topology. Put another way, that means that values which feel larger in a graph should represent values that are larger in your data. As such, when working with position, higher values should be the ones further away from that lower left-hand corner — you should let your viewer’s subconscious assumptions do the heavy lifting for you.

Applying this advice to categorical data can get a little tricky. Imagine that we’re looking at the average highway mileages for manufacturers of the cars in our data set:

The Art and Science of Data Visualization (8)

In this case, the position along the x axis just represents a different car maker, in alphabetical order. But remember, position in a graph is an aesthetic that we can use to encode more information in our graphics. And we aren’t doing that here — for instance, we could show the same information without using x position at all:

The Art and Science of Data Visualization (9)

Try to compare Pontiac and Hyundai on the first graph, versus on this second one. If anything, removing our extraneous x aesthetic has made it easier to compare manufacturers. This is a big driver behind our second mantra — that everything should be made as simple as possible, but no simpler. Having extra aesthetics confuses a graph, making it harder to understand the story it’s trying to tell.

However, when making a graphic, we should always be aiming to make important comparisons easy. As such, we should take advantage of our x aesthetic by arranging our manufacturers not alphabetically, but rather by their average highway mileage:

The Art and Science of Data Visualization (10)

By reordering our graphic, we’re now able to better compare more similar manufacturers. It’s now dramatically faster to understand our visualization — closer comparisons are easier to make, so placing more similar values closer together makes them dramatically easier to grasp. Look at Pontiac vs Hyundai now, for instance. Generally speaking, don’t put things in alphabetical order — use the order you place things to encode additional information.

As a quick side note, I personally believe that, when working with categorical values along the X axis, you should reorder your values so the highest value comes first. For some reason, I just find having the tallest bar/highest point (or whatever is being used to show value) next to the Y axis line is much cleaner looking than the alternative:

The Art and Science of Data Visualization (11)

For what it’s worth, I’m somewhat less dogmatic about this when the values are on the Y axis. I personally believe the highest value should always be at the top, as humans expect higher values to be further from that bottom left corner:

The Art and Science of Data Visualization (12)

However, I’m not as instantly repulsed by the opposite ordering as I am with the X axis, likely because the bottom bar/point being the furthest looks like a more natural shape, and is still along the X axis line:

The Art and Science of Data Visualization (13)

For this, at least, your mileage may vary. Also, it’s worth pointing out how much cleaner the labels on this graph are when they’re on the Y axis — flipping your coordinate system, like we’ve done here, is a good way to display data when you’ve got an unwieldy number of categories.

Color

While we’ve done a good job covering the role position plays in communicating information, we’re still stuck on the same question we started off with: How can we show a third variable on the graph?

One of the most popular ways is to use colors to represent your third variable. It might be worth talking through how color can be used with a simulated data set. Take for example the following graph:

The Art and Science of Data Visualization (14)

And now let’s add color for our third variable:

The Art and Science of Data Visualization (15)

Remember: perceptual topology should match data topology. Which values are larger?

Most people would say the darker ones. But is it always that simple? Let’s change our color scale to compare:

The Art and Science of Data Visualization (16)

Sure, some of these colors are darker than others — but I wouldn’t say any of them tell me a value is particularly high or low.

That’s because humans don’t perceive hue — the actual shade of a color — as an ordered value. The color a point is doesn’t communicate that the point has a higher or lower value than any other point on the graph. Instead, hue works as an unordered value, which only tells us which points belong to which groupings. In order to tell how high or low a point’s value is, we instead have to use luminescence — or how bright or dark the individual point is.

There’s one other axis you can move colors along in order to encode value — how vibrant a color is, known as chroma:

The Art and Science of Data Visualization (17)

Just keep in mind that luminescence and chroma — how light a color is and how vibrant it is — are ordered values, while hue (or shade of color) is unordered This becomes relevant when dealing with categorical data. For instance, moving back to the scatter plot we started with:

The Art and Science of Data Visualization (18)

If we wanted to encode a categorical variable in this — for instance, the class of vehicle — we could use hue to distinguish the different types of cars from one another:

The Art and Science of Data Visualization (19)

In this case, using hue to distinguish our variables clearly makes more sense than using either chroma or luminesence:

The Art and Science of Data Visualization (20)

This is a case of knowing what tool to use for the job — chroma and luminescence will clearly imply certain variables are closer together than is appropriate for categorical data, while hue won’t give your audience any helpful information about an ordered variable. Note, though, that I’d still discourage using the rainbow to distinguish categories in your graphics — the colors of the rainbow aren’t exactly unordered values (for instance, red and orange are much more similar colors than yellow and blue), and you’ll wind up implying connections between your categories that you might not want to suggest. Also, the rainbow is just really ugly:

The Art and Science of Data Visualization (21)

Speaking of using the right tool for the job, one of the worst things people like to do in data visualizations is overuse color. Take for instance the following example:

The Art and Science of Data Visualization (22)

In this graph, the variable “class” is being represented by both position along the x axis, and by color. By duplicating this effort, we’re making our graph harder to understand — encoding the information once is enough, and doing it any more times than that is a distraction. Remember the second mantra: Everything should be made as simple as possible — but no simpler. The best data visualization is one that includes all the elements needed to deliver the message, and no more.

You can feel free to use color in your graphics, so long as it adds more information to the plot — for instance, if it’s encoding a third variable:

The Art and Science of Data Visualization (23)

But replicating as we did above is just adding more junk to your chart.

There’s one last way you can use color effectively in your plot, and that’s to highlight points with certain characteristics:

The Art and Science of Data Visualization (24)

Doing so allows the viewer to quickly pick out the most important sections of our graph, increasing its effectiveness. Note that I used shape instead of color to separate the class of vehicles, by the way — combining point highlighting and using color to distinguish categorical variables can work, but can also get somewhat chaotic:

The Art and Science of Data Visualization (25)

There’s one other reason color is a tricky aesthetic to get right in your graphics: about 5% of the population (10% of men, 1% of women) can’t see colors at all. That means you should be careful when using it in your visualizations — use colorblind-safe color palettes (check out “ColorBrewer” or “viridis” for more on these), and pair it with another aesthetic whenever possible.

Shape

The easiest aesthetic to pair color with is the next most frequently used — shape. This one is much more intuitive than color — to demonstrate, let’s go back to our scatter plot:

The Art and Science of Data Visualization (26)

We can now change the shape of each point based on what class of vehicle it represents:

The Art and Science of Data Visualization (27)

Imagine we were doing the same exercise as we did with color earlier — which values are larger?

I’ve spoiled the answer already by telling you what the shapes represent — none of them are inherently larger than the others. Shape, like hue, is an unordered value.

The same basic concepts apply when we change the shape of lines, not just points. For instance, if we plot separate trend lines for front-wheel, rear-wheel, and four-wheel drive cars, we can use line type to represent each type of vehicle:

The Art and Science of Data Visualization (28)

But even here, no one line type implies a higher or lower value than the others.

There are two caveats to be made to this rule, however. For instance, if we go back to our original scatter plot and change which shapes we’re using:

The Art and Science of Data Visualization (29)

This graph seems to imply more connection between the first three classes of car (which are all different types of diamonds) and the next three classes (which are all types of triangle), while singling out SUVs. In this way, we’re able to use shape to imply connection between our groupings — more similar shapes, which differ only in angle or texture, imply a closer relationship to one another than to other types of shape. This can be a blessing as well as a curse — if you pick, for example, a square and a diamond to represent two unrelated groupings, your audience might accidentally read more into the relationship than you had meant to imply.

It’s also worth noting that different shapes can pretty quickly clutter up a graph. As a general rule of thumb, using more than 3–4 shapes on a graph is a bad idea, and more than 6 means you need to do some thinking about what you actually want people to take away.

Size

Our last aesthetic is that of size. Going back to our original scatter plot, we could imagine using size like this:

The Art and Science of Data Visualization (30)

Size is an inherently ordered value — large size points imply larger values. Specifically, humans perceive larger areas as corresponding to larger values — the points which are three times larger in the above graph are about three times larger in value, as well.

This becomes tricky when size is used incorrectly, either by mistake or to distort the data. Sometimes an analyst maps radius to the variable, rather than area of the point, resulting in graphs as the below:

The Art and Science of Data Visualization (31)

In this example, the points representing a cty value of 10 don’t look anything close to 1/3 as large as the points representing 30. This makes the increase seem much steeper upon looking at this chart — so be careful when working with size as an aesthetic that your software is using the area of points, not radius!

It’s also worth noting that unlike color — which can be used to distinguish groupings, as well as represent an ordered value — it’s generally a bad idea to use size for a categorical variable. For instance, if we mapped point size to class of vehicle:

The Art and Science of Data Visualization (32)

We seem to be implying relationships here that don’t actually exist, like a minivan and midsize vehicle being basically the same. As a result, it’s best to only use size for continuous (or numeric) data.

A Tangent

Now that we’ve gone over these four aesthetics, I want to go on a quick tangent. When it comes to how quickly and easily humans perceive each of these aesthetics, research has settled on the following order:

  1. Position
  2. Size
  3. Color (especially chroma and luminescence)
  4. Shape

And as we’ve discussed repeatedly, the best data visualization is one that includes exactly as many elements as it takes to deliver a message, and no more. Everything should be made as simple as possible, but no simpler.

However, we live in a world of humans, where the scientifically most effective method is not always the most popular one. And since color is inherently more exciting than size as an aesthetic, the practitioner often finds themselves using colors to denote values where size would have sufficed. And since we know that color should usually be used alongside shape in order to be more inclusive in our visualizations, size often winds up being the last aesthetic used in a chart. This is fine — sometimes we have to optimize for other things than “how quickly can someone understand my chart”, such as “how attractive does my chart look” or “what does my boss want from me”. But it’s worth noting, in case you see contradictory advice in the future — the disagreement comes from if your source is teaching the most scientifically sound theory, or the most applicable practice.

Let’s transition away from aesthetics, and towards our third mantra:

Use the right tool for the job.

Think back to our first chart:

The Art and Science of Data Visualization (33)

As you already know, this is a scatter plot — also known as a point graph. Now say we added a line of best fit to it:

The Art and Science of Data Visualization (34)

This didn’t stop being a scatter plot once we drew a line on it — but the term scatter plot no longer really encompasses everything that’s going on here. It’s also obviously not a line chart, as even though there’s a line on it, it also has points.

Rather than quibble about what type of chart this is, it’s more helpful to describe what tools we’ve used to depict our data. We refer to these as geoms, short for geometries — because when you get really deep into things, these are geometric representations of how your data set is distributed along the x and y axes of your graph. I don’t want to get too far down that road — I just want to explain the vocabulary so that we aren’t talking about what type of chart that is, but rather what geoms it uses. Framing things that way makes it easier to understand how things can be combined and reformatted, rather than assuming each type of chart can only do one thing.

Two continuous variables

This chart uses two geoms that are really good for graphs that have a continuous y and a continuous x — points and lines. This is what people refer to most of the time when they say a line graph — a single smooth trend line that shows a pattern in the data. However, a line graph can also mean a chart where each point is connected in turn:

The Art and Science of Data Visualization (35)

It’s important to be clear about which type of chart you’re expected to produce! I always refer to the prior as a trend line, for clarity.

These types of charts have enormous value for quick exploratory graphics, showing how various combinations of variables interact with one another. For instance, many analysts start familiarizing themselves with new data sets using correlation matrices (also known as scatter plot matrices), which create a grid of scatter plots representing each variable:

The Art and Science of Data Visualization (36)

In this format, understanding interactions between your data is quick and easy, with certain variable interactions obviously jumping out as promising avenues for further exploration.

To back up just a little, there’s one major failing of scatter plots that I want to highlight before moving on. If you happen to have more than one point with the same x and y values, a scatter plot will just draw each point over the previous, making it seem like you have less data than you actually do. Adding a little bit of random noise — for instance, using RAND() in Excel — to your values can help show the actual densities of your data, especially when you’re dealing with numbers that haven’t been measured as precisely as they could a have been.

The Art and Science of Data Visualization (37)

One last chart that does well with two continuous variables is the area chart, which resembles a line chart but fills in the area beneath the line:

The Art and Science of Data Visualization (38)

Area plots make sense when 0 is a relevant number to your data set — that is, a 0 value wouldn’t be particularly unexpected. They’re also frequently used when you have multiple groupings and care about their total sum:

The Art and Science of Data Visualization (39)

(This new data set is the “diamonds” data set, representing 54,000 diamonds sizes, qualities, cut, and sale prices. We’ll be going back and forth using it and the EPA data set from now on.)

Now one drawback of stacked area charts is that it can be very hard to estimate how any individual grouping shifts along the x axis, due to the cumulative effects of all the groups underneath them. For instance, there are actually fewer “fair” diamonds at 0.25 carats than at 1.0 — but because “ideal” and “premium” spike so much, your audience might draw the wrong conclusions. In situations where the total matters more than the groupings, this is alright — but otherwise, it’s worth looking at other types of charts as a result.

One continuous variable

If instead you’re looking to see how a single continuous variable is distributed throughout your data set, one of the best tools at your disposal is the histogram. A histogram shows you how many observations in your data set fall into a certain range of a continuous variable, and plot that count as a bar plot:

The Art and Science of Data Visualization (40)

One important flag to raise with histograms is that you need to pay attention to how your data is being binned. If you haven’t picked the right width for your bins, you might risk missing peaks and valleys in your data set, and might misunderstand how your data is distributed — for instance, look what shifts if we graph 500 bins, instead of the 30 we used above:

The Art and Science of Data Visualization (41)

An alternative to the histogram is the frequency plot, which uses a line chart in the place of bars to represent the frequency of a value in your dataset:

The Art and Science of Data Visualization (42)

Again, however, you have to pay attention to how wide your data bins are with these charts — you might accidentally smooth over major patterns in your data if you aren’t careful!

The Art and Science of Data Visualization (43)

One large advantage of the frequency chart over the histogram is how it deals with multiple groupings — if your groupings trade dominance at different levels of your variable, the frequency graph will make it much more obvious how they shift than a histogram will.

(Note that I’ve done something weird to the data in order to show how the distributions change below.)

The Art and Science of Data Visualization (44)

One categorical variable, one continuous

If you want to compare a categorical and continuous variable, you’re usually stuck with some form of bar chart:

The Art and Science of Data Visualization (45)

The bar chart is possibly the least exciting type of graph in existence, mostly because of how prevalent it is — but that’s because it’s really good at what it does. Bar charts are one of the most easily interpreted and effective types of visualizations, no matter how exciting they are.

However, some people are really intent on ruining that. Take, for instance, the stacked bar chart, often used to add a third variable to the mix:

The Art and Science of Data Visualization (46)

Compare Fair/G to Premium/G. It’s next to impossible to accurately compare the boxes — they don’t share a top or a bottom line, so you can’t really make a comparison. In these situations, it’s a better idea to use a dodged bar chart instead:

The Art and Science of Data Visualization (47)

Dodged bar charts are usually a better choice for comparing the actual numbers of different groupings. However, this chart does a good job showing one of the limitations dodged bar charts come up against — once you get past 4 or 5 groupings, making comparisons is tricky. In these cases, you’re probably trying to apply the wrong chart for the job, and should consider either breaking your chart up into smaller ones — remember, ink is cheap, and electrons or cheaper — or replacing your bars with a few lines.

The one place where stacked bar charts are appropriate, however, is when you’re comparing the relative proportions of two different groups in each bar. For instance, take the following graph:

The Art and Science of Data Visualization (48)

In this case, making comparisons across groups is trivial, made simple by the fact that the groupings all share a common line — at 100% for group 1, and at 0% for group 2. This point of reference solves the issue we had with more than two groupings — though note we’d still prefer a dodged bar chart if the bars didn’t always sum to the same amount.

A Quick Tangent

This is usually where most people will go on a super long rant about pie charts and how bad they are. They’re wrong, but in an understandable way.

People love to hate on pie charts, because they’re almost universally a bad chart. However, if it’s important for your viewer to be able to quickly figure out what proportion two or more groupings make up of the whole, a pie chart is actually the fastest and most effective way to get the point across. For instance, compare the following pie and bar charts, made with the same data set:

The Art and Science of Data Visualization (49)
The Art and Science of Data Visualization (50)

It’s a lot easier to tell that, say, A is smaller than C through F in the pie chart than the bar plot, since humans are better at summing angles than areas. In these instances, feel free to use a pie chart — and to tell anyone giving you flack that I said it was OK.

Two categorical variables

Our last combination is when you’re looking to have a categorical variable on both the x and y axis. These are trickier plots to think about, as we no longer encode value in position based on how far away a point is from the lower left hand corner, but rather have to get creative in effectively using position to encode a value. Remember that a geom is a geometric representation of how your data set is distributed along the x and y axes of your graph. When both of your axes are categorical, you have to get creative to show that distribution.

One method is to use density, as we would in a scatter plot, to show how many data points you have falling into each combination of categories graphed. You can do this by making a “point cloud” chart, where more dense clouds represent more common combinations:

The Art and Science of Data Visualization (51)

Even without a single number on this chart, its message is clear — we can tell how our diamonds are distributed with a single glance. A similar way to do this is to use a heat map, where differently colored cells represent a range of values:

The Art and Science of Data Visualization (52)

I personally think heat maps are less effective — partially because by using the color aesthetic to encode this value, you can’t use it for anything else — but they’re often easier to make with the resources at hand.

As we move into our final section, it’s time to dwell on our final mantra:

Ink is cheap. Electrons are even cheaper.

Dealing with big data sets

Think back to the diamonds data set we used in the last section. It contains data on 54,000 individual diamonds, including the carat and sale price for each. If we wanted to compare those two continuous variables, we might think a scatter plot would be a good way to do so:

The Art and Science of Data Visualization (53)

Unfortunately, it seems like 54,000 points is a few too many for this plot to do us much good! This is a clear case of what’s called overplotting — we simply have too much data on a single graph.

There are three real solutions to this problem. First off, we could decide simply that we want to refactor our chart, and instead show how a metric — such as average sale price — changes at different carats, rather than how our data is distributed:

The Art and Science of Data Visualization (54)

There are all sorts of ways we can do this sort of refactoring — if we wanted, we could get a very similar graph by binning our data and making a bar plot:

The Art and Science of Data Visualization (55)

Either way, though, we’re not truly showing the same thing as was in the original graph — we don’t have any indication of how our data is distributed along both these axes.

The second solution solves this problem much more effectively — make all your points semi-transparent:

The Art and Science of Data Visualization (56)

By doing this, we’re now able to see areas where our data is much more densely distributed, something that was lost in the summary statistics — for instance, it appears that low-carat diamonds are much more tightly grouped than higher carat ones. We can also see some dark stripes at “round-number” values for carat — that indicates to me that our data has some integrity issues, if appraisers are more likely to give a stone a rounded number.

The challenge with this approach comes when we want to map a third variable — let’s use cut — in our graphic. We can try to change the aesthetics of our graph as usual:

The Art and Science of Data Visualization (57)

But unfortunately the sheer number of points drowns out most of the variance in color and shape on the graphic. In this case, our best option may be to facet our plots — that is, to split our one large plot into several small multiples:

The Art and Science of Data Visualization (58)

Ink is cheap. Electrons are even cheaper. Make more than one graph.

By splitting out our data into several smaller graphics, we’re much better able to see how the distribution shifts between our categories. In fact, we could use this technique to split our data even further, into a matrix of scatter plots showing how different groups are distributed:

The Art and Science of Data Visualization (59)

One last, extremely helpful use of faceting is to split apart charts with multiple entangled lines:

The Art and Science of Data Visualization (60)

These charts, commonly referred to as “spaghetti charts”, are usually much easier to use when split into small multiples:

The Art and Science of Data Visualization (61)

Now, one major drawback of facet charts is that they can make comparisons much harder — if, in our line chart, it’s more important to know that most clarities are close in price at 2 carats than it is to know how the price for each clarity changes with carat, then the first chart is likely the more effective option. In those cases, however, it’s worth reassessing how many lines you actually need on your graph — if you only care about a few clarities, then only include those lines. The goal is to make making important comparisons easy, with the understanding that some comparisons are more important than others.

Cast your mind back to the graphic I used as an example of an explanatory chart:

The Art and Science of Data Visualization (62)

You might have noticed that this chart is differently styled from all the others in this course — it doesn’t have the grey background or grid lines or anything else.

Remember our second mantra: everything should be made as simple as possible, but no simpler. This chart reflects that goal. We’ve lost some of the distracting elements — the colored background and grid lines — and changed the other elements to make the overall graphic more effective. The objective is to have no extraneous element on the graph, so that it might be as expressive and effective as possible. This usually means using minimal colors, minimal text, and no grid lines. (After all, those lines are usually only useful in order to pick out a specific value — and if you’re expecting people to need specific values, you should give them a table!)

Those extraneous elements are known as chartjunk. You see this a lot with graphs made in Excel — they’ll have dark backgrounds, dark lines, special shading effects or gradients that don’t encode information, or — worst of all — those “3D” bar/line/pie charts, because these things can be added with a single click. However, they tend to make your graphics less effective as they force the user to spend more time separating data from ornamentation. Everything should be made as simple as possible, but no simpler — so don’t try to pretty up your graph with non-useful elements.

Another common instance of chartjunk is animation in graphics. While animated graphics are exciting and trendy, they tend to reduce the effectiveness of your graphics because as humans, when something is moving we can’t focus on anything else. Check out these examples from the Harvard Vision Lab — they show just how hard it is to notice changes when animation is added. This isn’t to say you can never use animation — but its uses are best kept to times when your graphic looking cool is more important than it conveying information.

As we wind down this course, I want to finish by touching on a few common mistakes that didn’t have a great home in any other section — mostly because we were too busy talking about good design principles.

Dual y axes

Chief among these mistakes are plots with two y axes, beloved by charlatans and financial advisors since days unwritten. Plots with two y axes are a great way to force a correlation that doesn’t really exist into existence on your chart. In almost every case, you should just make two graphs — ink is cheap. Electrons are even cheaper. The precise reasons are outside of the scope of this lesson, so check out this link for an extremely entertaining read on the subject. I’ve borrowed Kieran’s code for the below viz — look at how we can imply different things, just by changing how we scale our axes!

The Art and Science of Data Visualization (63)

Over-complex visualizations

Another common issue in visualizations comes from the analyst getting a little too technical with their graphs. For instance, think back to our original diamonds scatter plot:

The Art and Science of Data Visualization (64)

Looking at this chart, we can see that carat and price have a positive correlation — as one increases, the other does as well. However, it’s not a linear relationship; instead, it appears that price increases faster as carat increases.

The more statistically-minded analyst might already be thinking that we could make this relationship linear by log-transforming the axes — and they’d be right! We can see a clear linear relationship when we make the transformation:

The Art and Science of Data Visualization (65)

Unfortunately, transforming your visualizations in this way can make your graphic hard to understand — in fact, only about 60% of professional scientists can even understand them. As such, transforming your axes like this tends to reduce the effectiveness of your graphic — this type of visualization should be reserved for exploratory graphics and modeling, instead.

That just about wraps up this introduction to the basic concepts of data visualizations. Hopefully you’ve picked up some concepts or vocabulary that can help you think about your own visualizations in your daily life. If nothing else, I hope you remember our mantras of data visualization:

  1. A good graph tells a story.
  2. Everything should be made as simple as possible — but no simpler.
  3. Use the right tool for the job.
  4. Ink is cheap. Electrons are cheaper.

Hopefully these concepts will help you maximize the expressiveness and efficiency of your visualizations, steering you to use exactly as many aesthetics and design elements as it takes to tell your story. You’ll know to match perceptual and data topology. You’ll strive to make important comparisons easy, and you’ll know to make more than one chart.

Much luck. Go forth and visualize, and teach others how to as well. Our field will be so much the better for it.

Mike Mahoney is a data analyst, passionate about data visualization and finding ways to apply data insights to complex systems. Find out more on his website or connect with him on LinkedIn.

The Art and Science of Data Visualization (2024)

References

Top Articles
Latest Posts
Article information

Author: Clemencia Bogisich Ret

Last Updated:

Views: 5716

Rating: 5 / 5 (60 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Clemencia Bogisich Ret

Birthday: 2001-07-17

Address: Suite 794 53887 Geri Spring, West Cristentown, KY 54855

Phone: +5934435460663

Job: Central Hospitality Director

Hobby: Yoga, Electronics, Rafting, Lockpicking, Inline skating, Puzzles, scrapbook

Introduction: My name is Clemencia Bogisich Ret, I am a super, outstanding, graceful, friendly, vast, comfortable, agreeable person who loves writing and wants to share my knowledge and understanding with you.