4 Data Visualization via ggplot2
In Chapter 3, we discussed the importance of datasets being tidy. You will see in examples here why having a tidy dataset helps us immensely when plotting our data. In plotting our data, we will be able to gain valuable insights from our data that we couldn’t initially see from just looking at the raw data. We will focus on using Hadley Wickham’s
ggplot2 package in doing so, which was developed to work specifically on datasets that are tidy. It provides an easy way to customize your plots and is based on data visualization theory given in The Grammar of Graphics (Wilkinson 2005).
At the most basic level, graphics/plots/charts provide a nice way for us to get a sense for how quantitative variables compare in terms of their center and their spread. The most important thing to know about graphics is that they should be created to make it obvious for your audience to see the findings you want to get across. This requires a balance of not including too much in your plots, but also including enough so that relationships and interesting findings can be easily seen. As we will see, plots/graphics also help us to identify patterns and outliers in our data. We will see that a common extension of these ideas is to compare the distribution of one quantitative variable (i.e., what the spread of a variable looks like or how the variable is distributed in terms of its values) as we go across the levels of a different categorical variable.
Before we proceed with this chapter, let’s load all the necessary packages.
library(ggplot2) library(nycflights13) library(knitr) library(dplyr)
4.1 The Grammar of Graphics
We begin with a discussion of a theoretical framework for data visualization known as the “The Grammar of Graphics,” which serves as the basis for the
ggplot2 package. Much like the way we construct sentences in any language using a linguistic grammar (nouns, verbs, subjects, objects, etc.), the theoretical framework given by Leland Wilkinson (Wilkinson 2005) allows us to specify the components of a statistical graphic.
4.1.1 Components of Grammar
In short, the grammar tells us that:
A statistical graphic is a mapping of
aesthetic attributes of
Specifically, we can break a graphic into the following three essential components:
data: the data set comprised of variables that we map.
geom: the geometric object in question. This refers to our type of objects we can observe in our plot. For example, points, lines, bars, etc.
aes: aesthetic attributes of the geometric object that we can perceive on a graphic. For example, x/y position, color, shape, and size. Each assigned aesthetic attribute can be mapped to a variable in our data set. If not assigned, they are set to defaults.
4.1.2 Napolean’s March on Moscow
In 1812, Napoleon led a French invasion of Russia, marching on Moscow. It was one of the biggest military disasters due in large part to the Russian winter. In 1869, a French civil engineer named Charles Joseph Minard published arguably one of the greatest statistical visualizations of all-time, which summarized this march:
This was considered a revolution in statistical graphics because between the map on top and the line graph on the bottom, there are 6 dimensions of information (i.e. variables) being displayed on a 2-dimensional page. Let’s view this graphic through the lens of the Grammar of Graphics:
For example, the data variable
longitude gets mapped to the
aesthetic of the points
geometric objects on the map while the annotated line-graph displays
temperature variable information via its mapping to the
aesthetic of the line
4.1.3 Other Components of the Grammar
There are other components of the Grammar of Graphics we can control:
facet: how to break up a plot into subsets
statistical transformations: this includes smoothing, binning values into a histogram, or just itself un-transformed as
- convert data units to physical units the computer can display
- draw a legend and/or axes, which provide an inverse mapping to make it possible to read the original data values from the graph.
coordinate system for x/y values: typically
cartesian, but can also be
In this text, we will only focus on the first two:
faceting (introduced in Section 4.6) and
statistical transformations (in a limited sense, when consider Barplots in Section 4.8); the other components are left to a more advanced text. This is not a problem when producing a plot as each of these components have default settings.
There are other extra attributes that can be tweaked as well including the plot title, axes labels, and over-arching themes for the plot. In general, the Grammar of Graphics allows for customization but also a consistent framework that allows the user to easily tweak their creations as needed in order to convey a message about their data.
4.1.4 The ggplot2 Package
We next introduce Hadley Wickham’s
ggplot2 package, which is an implementation of the Grammar of Graphics for R (Wickham and Chang 2016). You may have noticed that a lot of previous text in this chapter is written in computer font. This is because the various components of the Grammar of Graphics are specified using the
ggplot function, which expects at a bare minimal as arguments
- the data frame where the variables exist (the
- the names of the variables to be plotted (the
The names of the variables will be entered into the
aes function as arguments where
aes stands for “aesthetics”.
4.2 Five Named Graphs - The 5NG
For our purposes, we will be limiting consideration to five different types of graphs (note that in this text we use the terms “graphs”, “plots”, and “charts” interchangeably). We term these five named graphs the 5NG:
With this repertoire of plots, you can visualize a wide array of data variables thrown at you. We will discuss some variations of these, but with the 5NG in your toolbox you can do big things! Something we will also stress here is that certain plots only work for categorical/logical variables and others only for quantitative variables. You’ll want to quiz yourself often as we go along on which plot makes sense a given a particular problem set-up.
4.3 5NG#1: Scatter-plots
The simplest of the 5NG are scatter-plots (also called bivariate plots); they allow you to investigate the relationship between two continuous variables. While you may already be familiar with such plots, let’s view it through the lens of the Grammar of Graphics. Specifically, we will graphically investigate the relationship between the following two continuous variables in the
flights data frame:
dep_delay: departure delay on the horizontal “x” axis and
arr_delay: arrival delay on the vertical “y” axis
for Alaska Airlines flights leaving NYC in 2013. This requires paring down the
flights data frame to a smaller data frame
all_alaska_flights consisting of only Alaska Airlines (carrier code “AS”) flights.
data(flights) all_alaska_flights <- flights %>% filter(carrier == "AS")
This code snippet makes use of functions in the
dplyr package for data manipulation to achieve our goal: it takes the
flights data frame and
filters it to only return the rows which meet the condition
carrier == "AS" (recall equality is specified with
== and not
=). You will see many more examples using this function in Chapter 5.
(LC4.1) Take a look at both the
all_alaska_flights data frames by running
View(all_alaska_flights) in the console. In what respect do these data frames differ?
4.3.1 Scatter-plots via geom_point
We proceed to create the scatter-plot using the
ggplot(data = all_alaska_flights, aes(x = dep_delay, y = arr_delay)) + geom_point()
You are encouraged to enter Return on your keyboard after entering the
+. As we add more and more elements, it will be nice to keep them indented as you see below. Note that this will not work if you begin the line with the
Let’s break down this keeping in mind our discussion in Section 4.1:
- Within the
ggplot()function call, we specify two of the components of the grammar:
dataframe to be
data = all_alaska_flights
aesthetic mapping by setting
aes(x = dep_delay, y = arr_delay). Specifically
dep_delaymaps to the
arr_delaymaps to the
- We add a layer to the
ggplot()function call using the
- The layer in question specifies the third component of the grammar: the
geometric object in question. In this case the geometric object are
points, set by specifying
In Figure 4.2 we see that a positive relationship exists between
arr_delay: as departure delays increase, arrival delays tend to also increase. We also note that the majority of points fall near the point (0, 0). There is a large mass of points clustered there. (We will work more with this data set in Chapter 9, where we investigate correlation and linear regression.)
(LC4.2) What are some practical reasons why
arr_delay have a positive relationship?
(LC4.3) What variables (not necessarily in the
flights data frame) would you expect to have a negative correlation (i.e. a negative relationship) with
dep_delay? Why? Remember that we are focusing on continuous variables here.
(LC4.4) Why do you believe there is a cluster of points near (0, 0)? What does (0, 0) correspond to in terms of the Alaskan flights?
(LC4.5) What are some other features of the plot that stand out to you?
(LC4.6) Create a new scatter-plot using different variables in the
all_alaska_flights data frame by modifying the example above.
The large mass of points near (0, 0) can cause some confusion. This is the result of a phenomenon called over-plotting. As one may guess, this corresponds to values being plotted on top of each other over and over again. It is often difficult to know just how many values are plotted in this way when looking at a basic scatter-plot as we have here. There are two ways to address this issue:
- By adjusting the transparency of the points via the
- By jittering the points via
The first way of relieving over-plotting is by changing the
alpha argument to
geom_point() which controls the transparency of the points. By default, this value is set to
1. We can change this value to a smaller fraction (greater than 0) to change the transparency of the points in the plot:
ggplot(data = all_alaska_flights, aes(x = dep_delay, y = arr_delay)) + geom_point(alpha = 0.2)
Note how this function call is identical to the one in Section 4.3, but with
geom_point() replaced with
alpha = 0.2 added.
The second way of relieving over-plotting is to jitter the points a bit. In other words, we are going to add just a bit of random noise to the points to better see them and remove some of the over-plotting. You can think of “jittering” as shaking the points a bit on the plot. Instead of using
geom_point, we use
geom_jitter to perform this shaking and specify around how much jitter to add with the
height arguments. This corresponds to how hard you’d like to shake the plot in units corresponding to those for both the horizontal and vertical variables (in this case minutes).
ggplot(data = all_alaska_flights, aes(x = dep_delay, y = arr_delay)) + geom_jitter(width = 30, height = 30)
Note how this function call is identical to the one in Section 4.3.1, but with
geom_point() replaced with
geom_jitter(). The plot in 4.4 helps us a little bit in getting a sense for the over-plotting, but with a relatively large dataset like this one (714 flights), it can be argued that changing the transparency of the points by setting
alpha proved more effective.
(LC4.7) Why is setting the
alpha argument value useful with scatter-plots? What further information does it give you that a regular scatter-plot cannot?
(LC4.8) After viewing the Figure 4.3 above, give a range of arrival times and departure times that occur most frequently? How has that region changed compared to when you observed the same plot without the
alpha = 0.2 set in Figure 4.2?
Scatter-plots display the relationship between two continuous variables and may be the most used plot today as they can provide an immediate way to see the trend in one variable versus another. If you try to create a scatter-plot where either one of the two variables is not quantitative however, you will get strange results. Be careful!
With medium to large datasets, you may need to play with either
geom_jitter or the
alpha argument in order to get a good feel for relationships in your data. This tweaking is often a fun part of data visualization since you’ll have the chance to see different relationships come about as you make subtle changes to your plots.
4.4 5NG#2: Line-graphs
The next of the 5NG is a line-graph. They are most frequently used when the x-axis represents time and the y-axis represents some other numerical variable; such plots are known as time series. Time represents a variable that is connected together by each day following the previous day. In other words, time has a natural ordering. Line-graphs should be avoided when there is not a clear sequential ordering to the explanatory variable, i.e. the x-variable or the predictor variable.
Our focus turns to the
temp variable in this
weather dataset. By
- Looking over the
weatherdataset by typing
View(weather)in the console.
?weatherto bring up the help file.
We can see that the
temp variable corresponds to hourly temperature (in Fahrenheit) recordings at weather stations near airports in New York City. Instead of considering all hours in 2013 for all three airports in NYC, let’s focus on the hourly temperature at Newark airport (
origin code “EWR”) for the first 15 days in January 2013. The
weather data frame in the
nycflights13 package contains this data, but we first need to filter it to only include those rows that correspond to Newark in the first 15 days of January.
data(weather) early_january_weather <- weather %>% filter(origin == "EWR" & month == 1 & day <= 15)
This is similar to the previous use of the
filter command in Section 4.3, however we now use the
& operator. The above selects only those rows in
origin == "EWR" and
month = 1 and
day <= 15.
(LC4.9) Take a look at both the
early_january_weather data frames by running
View(early_january_weather) in the console. In what respect do these data frames differ?
(LC4.10) The weather data is recorded hourly. Why does the
time_hour variable correctly identify the hour of the measurement whereas the
hour variable does not?
4.4.1 Line-graphs via geom_line
We plot a line-graph of hourly temperature using
ggplot(data = early_january_weather, aes(x = time_hour, y = temp)) + geom_line()
Much as with the
ggplot() call in Section 4.3.1, we specify the components of the Grammar of Graphics:
- Within the
ggplot()function call, we specify two of the components of the grammar:
dataframe to be
data = early_january_weather
aesthetic mapping by setting
aes(x = time_hour, y = temp). Specifically
time_hour(i.e. the time variable) maps to the
tempmaps to the
- We add a layer to the
ggplot()function call using the
- The layer in question specifies the third component of the grammar: the
geometric object in question. In this case the geometric object is a
line, set by specifying
(LC4.11) Why should line-graphs be avoided when there is not a clear ordering of the horizontal axis?
(LC4.12) Why are line-graphs frequently used when time is the explanatory variable?
(LC4.13) Plot a time series of a variable other than
temp for Newark Airport in the first 15 days of January 2013.
Line-graphs, just like scatter-plots, display the relationship between two continuous variables. However the variable on the x-axis (i.e. the explanatory variable) should have a natural ordering, like some notion of time. We can mislead our audience if that isn’t the case.
4.5 5NG#3: Histograms
Let’s consider the
temp variable in the
weather data frame once again, but now unlike with the line-graphs in Section 4.4, let’s say we don’t care about the relationship of temperature to time, but rather you care about the (statistical) distribution of temperatures. We could just produce points where each of the different values appear on something similar to a number line:
This gives us a general idea of how the values of
temp differ. We see that temperatures vary from around 11 up to 100 degrees Fahrenheit. The area between 40 and 60 degrees appears to have more points plotted than outside that range.
4.5.1 Histograms via geom_histogram
What is commonly produced instead of this strip plot is a plot known as a histogram. The histogram shows how many elements of a single numerical variable fall in specified bins. In this case, these bins may correspond to between 0-10°F, 10-20°F, etc. We produce a histogram of the hour temperatures at all three NYC airports in 2013:
ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
- There is only one variable being mapped in
aes(): the single continuous variable
temp. You don’t need to compute the y-aesthetic: it gets computed automatically.
- We set the
geometric object to be
- We got a warning message of
1 rows containing non-finite valuesbeing removed. This is due to one of the values of temperature being missing. R is alerting us that this happened.
4.5.2 Adjusting the Bins
We can adjust the number/size of the bins two ways:
- By adjusting the number of bins via the
- By adjusting the width of the bins via the
First, we have the power to specify how many bins we would like to put the data into as an argument in the
geom_histogram function. By default, this is chosen to be 30 somewhat arbitrarily; we have received a warning above our plot that this was done.
ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(bins = 60, color = "white")
Note the addition of the
color argument. If you’d like to be able to more easily differentiate each of the bins, you can specify the color of the outline as done above.
Second, instead of specifying the number of bins, we can also specify the width of the bins by using the
binwidth argument in the
ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(binwidth = 10, color = "white")
(LC4.14) What does changing the number of bins from 30 to 60 tell us about the distribution of temperatures?
(LC4.15) Would you classify the distribution of temperatures as symmetric or skewed?
(LC4.16) What would you guess is the “center” value in this distribution? Why did you make that choice?
(LC4.17) Is this data spread out greatly from the center or is it close? Why?
Histograms, unlike scatter-plots and line-graphs, presents information on only a single continuous variable. In particular they are visualizations of the (statistical) distribution of values.
Before continuing the 5NG, we briefly introduce a new concept called faceting. Faceting is used when we’d like to create small multiples of the same plot over a different categorical variable. By default, all of the small multiples will have the same vertical axis.
For example, suppose we were interested in looking at how the temperature histograms we saw in Section 4.5 varied by month. This is what is meant by “the distribution of a variable over another variable”:
temp is one variable and
month is the other variable. In order to look at histograms of
temp for each month, we add a layer
facet_wrap(~month). You can also specify how many rows you’d like the small multiple plots to be in using
nrow inside of
ggplot(data = weather, aes(x = temp)) + geom_histogram(binwidth = 5, color = "white") + facet_wrap(~ month, nrow = 4)
As we might expect, the temperature tends to increase as summer approaches and then decrease as winter approaches.
(LC4.18) What other things do you notice about the faceted plot above? How does a faceted plot help us see how relationships between two variables?
(LC4.19) What do the numbers 1-12 correspond to in the plot above? What about 25, 50, 75, 100?
(LC4.20) For which types of datasets would these types of faceted plots not work well in comparing relationships between variables? Give an example describing the variability of the variables and other important characteristics.
(LC4.21) Does the
temp variable in the
weather data set have a lot of variability? Why do you say that?
4.7 5NG#4: Boxplots
While using faceted histograms can provide a way to compare distributions of a continuous variable split by groups of a categorical variable as in Chapter 4.6, an alternative plot called a boxplot (also called a side-by-side boxplot) achieves the same task and is frequently preferred. The boxplot uses the information provided in the five-number summary referred to in Appendix A. It gives a way to compare this summary information across the different levels of a categorical variable.
4.7.1 Boxplots via geom_boxplot
Let’s create a boxplot to compare the monthly temperatures as we did above with the faceted histograms.
ggplot(data = weather, aes(x = month, y = temp)) + geom_boxplot()
Note the first warning that is given here. (The second one corresponds to missing values in the data frame and it is turned off on subsequent plots.) Observe that this plot does not look like what we were expecting. We were expecting to see the distribution of temperatures for each month (so 12 different boxplots). This gives us the overall boxplot without any other groupings. We can get around this by introducing a new function for our
ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) + geom_boxplot()
We have introduced a new function called
factor() here. One of the things this function does is to convert a discrete value like
month (1, 2, …, 12) into a categorical variable. The “box” part of this plot represents the 25th percentile, the median (50th percentile), and the 75th percentile. The dots correspond to outliers. (The specific formulation for these outliers is discussed in Appendix A.) The lines show how the data varies that is not in the center 50% defined by the first and third quantiles. Longer lines correspond to more variability and shorter lines correspond to less variability.
(LC4.22) What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point.
(LC4.23) Which months have the highest variability in temperature? What reasons do you think this is?
(LC4.24) We looked at the distribution of a continuous variable over a categorical variable here with this boxplot. Why can’t we look at the distribution of one continuous variable over the distribution of another continuous variable? Say, temperature across pressure, for example?
(LC4.25) Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram?
Boxplots provide a way to compare and contrast the distribution of one quantitative variable across multiple levels of one categorical variable. One can easily look to see where the median falls across the different groups by looking at the center line in the box. You can also see how spread out the variable is across the different groups by looking at the width of the box and also how far out the lines stretch from the box. If the lines stretch far from the box but the box has a small width, the variability of the values closer to the center is much smaller than the variability of the outer ends of the variable. Lastly, outliers are even more easily identified when looking at a boxplot than when looking at a histogram.
4.8 5NG#5: Barplots
Both histograms and boxplots represent ways to visualize the variability of continuous variables. Another common task is to present the distribution of a categorical variable. This is a simpler task since we will be interested in how many elements from our data fall into the different categories of the categorical variable.
4.8.1 Barplots via geom_bar
Frequently, the best way to visualize these different counts (also known as frequencies) is via a barplot. Consider the distribution of airlines that flew out of New York City in 2013. Here we explore the number of flights from each airline/
carrier. This can be plotted by invoking the
geom_bar function in
ggplot(data = flights, mapping = aes(x = carrier)) + geom_bar()
To get an understanding of what the names of these airlines are corresponding to these
carrier codes, we can look at the
airlines data frame in the
nycflights13 package. Note the use of the
kable function here in the
knitr package, which produces a nicely-formatted table of the values in the
airlines data frame.
|9E||Endeavor Air Inc.|
|AA||American Airlines Inc.|
|AS||Alaska Airlines Inc.|
|DL||Delta Air Lines Inc.|
|EV||ExpressJet Airlines Inc.|
|F9||Frontier Airlines Inc.|
|FL||AirTran Airways Corporation|
|HA||Hawaiian Airlines Inc.|
|OO||SkyWest Airlines Inc.|
|UA||United Air Lines Inc.|
|US||US Airways Inc.|
|WN||Southwest Airlines Co.|
|YV||Mesa Airlines Inc.|
Going back to our barplot, we see that United Air Lines, JetBlue Airways, and ExpressJet Airlines had the most flights depart New York City in 2013. To get the actual number of flights by each airline we can use the
count function in the
dplyr package on the
carrier variable in
flights, which we will introduce formally in Chapter 5.
flights_table <- flights %>% dplyr::count(carrier) knitr::kable(flights_table)
Technical note: Refer to the use of
:: in both lines of code above. This is another way of ensuring the correct function is called. A
count exists in a couple different packages and sometimes you’ll receive strange errors when a different instance of a function is used. This is a great way of telling R that “I want this one!”. You specify the name of the package directly before the
:: and then the name of the function immediately after
(LC4.26) Why are histograms inappropriate for visualizing categorical variables?
(LC4.27) What is the difference between histograms and barplots?
(LC4.28) How many Envoy Air flights departed NYC in 2013?
(LC4.29) What was the seventh highest airline in terms of departed flights from NYC in 2013? How could we better present the table to get this answer quickly.
4.8.2 Must avoid pie charts!
Unfortunately, one of the most common plots seen today for categorical data is the pie chart. While they may see harmless enough, they actually present a problem in that humans are unable to judge angles well. As Naomi Robbins describes in her book “Creating More Effective Graphs” (Robbins 2013), we overestimate angles greater than 90 degrees and we underestimate angles less than 90 degrees. In other words, it is difficult for us to determine relative size of one piece of the pie compared to another.
Let’s examine our previous barplot example on the number of flights departing NYC by airline. This time we will use a pie chart. As you review this chart, try to identify
- how much larger the portion of the pie is for ExpressJet Airlines (
EV) compared to US Airways (
- what the third largest carrier is in terms of departing flights, and
- how many carriers have fewer flights than United Airlines (
While it is quite easy to look back at the barplot to get the answer to these questions, it’s quite difficult to get the answers correct when looking at the pie graph. Barplots can always present the information in a way that is easier for the eye to determine relative position. There may be one exception from Nathan Yau at FlowingData.com but we will leave this for the reader to decide:
(LC4.30) Why should pie charts be avoided and replaced by barplots?
(LC4.31) What is your opinion as to why pie charts continue to be used?
4.8.3 Using barplots to compare two variables
Barplots are the go-to way to visualize the frequency of different categories of a categorical variable. They make it easy to order the counts and to compare one group’s frequency to another. Another use of barplots (unfortunately, sometimes inappropriately and confusingly) is to compare two categorical variables together. Let’s examine the distribution of outgoing flights from NYC by
flights_namedports <- flights %>% inner_join(airports, by = c("origin" = "faa"))
View(flights_namedports), we see that
name now corresponds to the name of the airport as referenced by the
origin variable. We will now plot
carrier as the horizontal variable. When we specify
geom_bar, it will specify
count as being the vertical variable. A new addition here is
fill = name. Look over what was produced from the plot to get an idea of what this argument gives.
fill is an
aesthetic just like
x is an
aesthetic. We need to make the
name variable to this
aesthetic. Any time you use a variable like this, you need to make sure it is wrapped inside the
aes function. This is a common error! Make note of this now so you don’t fall into this problem later.
ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) + geom_bar()
This plot is what is known as a stacked barplot. While simple to make, it often leads to many problems.
(LC4.32) What kinds of questions are not easily answered by looking at the above figure?
(LC4.33) What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights?
Another variation on the stacked barplot is the side-by-side barplot.
ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) + geom_bar(position = "dodge")
(LC4.34) Why might the side-by-side barplot be preferable to a stacked barplot in this case?
(LC4.35) What are the disadvantages of using a side-by-side barplot, in general?
Lastly, an often preferred type of barplot is the faceted barplot. We already saw this concept of faceting and small multiples in Section 4.6. This gives us a nicer way to compare the distributions across both
carrier and airport/
ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) + geom_bar() + facet_grid(name ~ .)
Note how the
facet_grid function arguments are written here. We are wanting the names of the airports vertically and the
carrier listed horizontally. As you may have guessed, this argument and other formulas of this sort in R are in
y ~ x order. We will see more examples of this in Chapter 9.
(LC4.36) Why is the faceted barplot preferred to the side-by-side and stacked barplots in this case?
(LC4.37) What information about the different carriers at different airports is more easily seen in the faceted barplot?
Barplots are the preferred way of displaying categorical variables. They are easy-to-understand and to make comparisons across groups of a categorical variable. When dealing with more than one categorical variable, faceted barplots are frequently preferred over side-by-side or stacked barplots. Stacked barplots are sometimes nice to look at, but it is quite difficult to compare across the levels since the sizes of the bars are all of different sizes. Side-by-side barplots can provide an improvement on this, but the issue about comparing across groups still must be dealt with.
An excellent resource as you begin to create plots using the
ggplot2 package is a cheatsheet that RStudio has put together entitled “Data Visualization with ggplot2” available
- by clicking here or
- by clicking the RStudio Menu Bar -> Help -> Cheatsheets -> “Data Visualization with
This covers more than what we’ve discussed in this chapter but provides nice visual descriptions of what each function produces.
In addition, we’ve created a mind map to help you remember which types of plots are most appropriate in a given situation by identifying the types of variables involved in the problem. It is available here and below.
4.9.2 Script of R code
An R script file of all R code used in this chapter is available here.
Review questions have been designed using the
fivethirtyeight R package (Ismay and Chunn 2017) with links to the corresponding FiveThirtyEight.com articles in our free DataCamp course Effective Data Storytelling using the
tidyverse. The material in this chapter is covered in the chapters of the DataCamp course available below:
A ggplot2 Review DataCamp course is in development currently.
4.9.3 What’s to come?
In Chapter 5, we’ll further explore data by grouping our data, creating summaries based on those groupings, filtering our data to match conditions, and other manipulations with our data including defining new columns/variables. These data manipulation procedures will go hand-in-hand with the data visualizations you’ve produced here.
Wilkinson, Leland. 2005. The Grammar of Graphics (Statistics and Computing). Secaucus, NJ, USA: Springer-Verlag New York, Inc.
Wickham, Hadley, and Winston Chang. 2016. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.
Robbins, Naomi. 2013. Creating More Effective Graphs. Chart House.
Ismay, Chester, and Jennifer Chunn. 2017. Fivethirtyeight: Data and Code Behind the Stories and Interactives at ’Fivethirtyeight’. https://github.com/rudeboybert/fivethirtyeight.